This is the fourth day of my November challenge.
TextRNN
TextRNN only inputs Word Embedding into bidirectional LSTM, and then inputs the last bit of output into the full connection layer. Softmax classification is done for it. The model is shown as follows:
Code:
class RNN(nn.Module): Def __init__(self, vocab_size, embedding_DIM, hidden_DIM, output_dim, n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx) self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers,batch_first=True, Bidirectional =bidirectional) self.fc = nn.Linear(hidden_dim * 2, output_dim) It doesn't depend on the number of layers of N_layers. self.dropout = nn.Dropout(dropout) def forward(self, text): # text.shape=[seq_len, batch_size] embedded = self.dropout(self.embedding(text)) # output: [batch,seq,2*hidden if bidirection else hidden] # hidden/cell: [bidirec * n_layers, batch, hidden] output, (hidden, cell) = self.rnn(embedded) # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)) # hidden = [batch size, Hid dim * num Directions], return self.fc(hidden. Squeeze (0)) # batch size, output_dimCopy the code
TextRNN_ATT
Add attention mechanism to TextRNN
class RNN_ATTs(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0, hidden_size2=64):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
bidirectional=bidirectional, batch_first=True, dropout=dropout)
self.tanh1 = nn.Tanh()
# self.u = nn.Parameter(torch.Tensor(config.hidden_size * 2, config.hidden_size * 2))
self.w = nn.Parameter(torch.zeros(hidden_dim * 2))
self.tanh2 = nn.Tanh()
self.fc1 = nn.Linear(hidden_dim * 2, hidden_size2)
self.fc = nn.Linear(hidden_size2, output_dim)
def forward(self, x):
emb = self.embedding(x) # [batch_size, seq_len, embeding]=[128, 32, 300]
H, _ = self.lstm(emb) # [batch_size, seq_len, hidden_size * num_direction]=[128, 32, 256]
M = self.tanh1(H) # [128, 32, 256]
# M = torch.tanh(torch.matmul(H, self.u))
alpha = F.softmax(torch.matmul(M, self.w), dim=1).unsqueeze(-1) # [128, 32, 1]
out = H * alpha # [128, 32, 256]
out = torch.sum(out, 1) # [128, 256]
out = F.relu(out)
out = self.fc1(out)
out = self.fc(out) # [128, 64]
return out
Copy the code
The data set
Data sets using cnews data set, contains three files, respectively is cnews. The “train”. TXT, cnews. Val. TXT, cnews, test. TXT. Categories: Sports, Entertainment, Home furnishing, Real estate, Education, Fashion, politics, games, Technology, finance, a total of 10 categories. Web disk Address:
Link: pan.baidu.com/s/1awlBYclO… Extraction code: RTNV
Word vector construction
The first step is to read the expectation and do the word segmentation.
Ideas:
1. Create the default word segmentation object seg.
2. Open the file and read the article by line.
3. Remove the closing Spaces and separate the label from the article.
4. Put the words in SRC_data and label in Labels.
5. Return the result.
I annotated the code as follows:
Def read_corpus(file_path): """ param file_path: :param type: :return: Src_data = [] labels = [] seg = pkuseg.pkuseg() # Use the default word segmentation. with codecs.open(file_path,'r',encoding='utf-8') as fout: for line in tqdm(fout.readlines(),desc='reading corpus'): Split ('\t') if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() Pair = line.strip().split('\t') if len(pair)! = 2: print(pair) continue src_data.append(seg.cut(pair[1]))# Append (pair[0]) return (src_data, labels) # returns word segmentation and labels for article contentsCopy the code
After this step, labels and word segmentation are obtained. The following code:
src_sents, labels = read_corpus('cnews/cnews.train.txt')
Copy the code
Map labels to labels:
labels = {label: idx for idx, label in enumerate(labels)}
Copy the code
Get the labels dictionary of idX, the value of which is the last value of the label inserted.
The second step is to construct the word vector
This step mainly uses the from_corpus method of VOCab. py
Ideas:
1. Create vocAB_entry.
2. Count word frequency of the article after word segmentation to generate a dictionary composed of word and word frequency.
Select Top size – 2 from the dictionary.
Get the element’s word.
5. Execute the add method to put the word into VOCab_entry and generate the word and ID. Id is the corresponding vector value of the word.
The code is as follows:
@staticmethod def from_corpus(corpus, size, min_feq=3): Vocab_entry = VocabEntry() # chain comes from the Itertools library, which provides very useful iteration-based functions, The chain function is used to concatenate multiple iterators to form a larger iterator # * : returns a single iterator. Word_freq = Counter(chain(*corpus)) Most (size - 2) # MOST_common () is used to implement Top N functions. Valid_words = [word for word, value in valid_words if value >= min_feq] # print('number of word types: {}, number of word types w/ frequency >= {}: {}'. Format (len(word_freq), min_feq, len(valid_words))) for word in valid_words: vocab_entry.add(word) return vocab_entryCopy the code
After creation, save the word vector to a JSON file
vocab = Vocab.build(src_sents, labels, 50000, 3)
print('generated vocabulary, source %d words' % (len(vocab.vocab)))
vocab.save('./vocab.json')
Copy the code
training
Training using train_rnn. py, first look at the analysis of the main method parameters.
parameter
parse = argparse.ArgumentParser() parse.add_argument("--train_data_dir", default='./cnews/cnews.train.txt', type=str, required=False) parse.add_argument("--dev_data_dir", default='./cnews/cnews.val.txt', type=str, required=False) parse.add_argument("--test_data_dir", default='./cnews/cnews.test.txt', type=str, required=False) parse.add_argument("--output_file", default='deep_model.log', type=str, required=False) parse.add_argument("--batch_size", default=4, type=int) parse.add_argument("--do_train", default=True, action="store_true", help="Whether to run training.") parse.add_argument("--do_test", default=True, action="store_true", help="Whether to run training.") parse.add_argument("--learnning_rate", default=5e-4, type=float) parse.add_argument("--num_epoch", default=50, type=int) parse.add_argument("--max_vocab_size", default=50000, type=int) parse.add_argument("--min_freq", default=2, type=int) parse.add_argument("--hidden_size", default=256, type=int) parse.add_argument("--embed_size", default=300, type=int) parse.add_argument("--dropout_rate", Parse. Add_argument ("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") parse.add_argument("--GRAD_CLIP", default=1, type=float) parse.add_argument("--vocab_path", default='vocab.json', type=str)Copy the code
Parameter Description:
Train_data_dir: training set path.
Dev_data_dir: indicates the path of the verification set
Test_data_dir: path of the test set
Output_file: indicates the output log path
Batch_size: indicates the batchsize.
Do_train: whether to train. The default value is True.
Do_test: indicates whether to test. The default value is True
Learnning_rate: indicates the learning rate
Num_epoch: Indicates the number of epochs
Max_vocab_size: number of word vectors
Min_freq: Word frequency, filtering words below this value
Hidden_size: indicates the number of hidden layers
Embed_size: The length of Embedding.
Dropout_rate: Dropout value.
Warmup_steps: Sets the warmup value.
Vocab_path: The path where the word vector is saved
Word vector construction
vocab = build_vocab(args)
label_map = vocab.labels
print(label_map)
Copy the code
Build_vocab method:
def build_vocab(args):
if not os.path.exists(args.vocab_path):
src_sents, labels = read_corpus(args.train_data_dir)
labels = {label: idx for idx, label in enumerate(labels)}
vocab = Vocab.build(src_sents, labels, args.max_vocab_size, args.min_freq)
vocab.save(args.vocab_path)
else:
vocab = Vocab.load(args.vocab_path)
return vocab
Copy the code
Create the model
Create CNN model, put the model on GPU, call train method, train.
rnn_model = RNN_ATTs(len(vocab.vocab), args.embed_size, args.hidden_size,
len(label_map), n_layers=1, bidirectional=True, dropout=args.dropout_rate)
rnn_model.to(device)
train(args, rnn_model, train_data, dev_data, vocab, dtype='RNN')
Copy the code
The train method is annotated as follows:
def train(args, model, train_data, dev_data, vocab, dtype='CNN'): LOG_FILE = args. Output_file # record training log with open(LOG_FILE, "a") as fout: fout.write('\n') fout.write('==========' * 6) fout.write('start trainning: {}'.format(dtype)) fout.write('\n') time_start = time.time() if not os.path.exists(os.path.join('./runs', dtype)): os.makedirs(os.path.join('./runs', dtype)) tb_writer = SummaryWriter(os.path.join('./runs', T_total = args.num_epoch * (math.ceil(len(train_data)/args.batch_size)) #optimizer = BNB. Optim. Adam8bit (model. The parameters (), lr = 0.001, betas = (0.9, Optimizer = AdamW(model.parameters(), lr=args. Learnning_rate, Scheduler = get_linear_schedule_with_warmup(Optimizer =optimizer, num_warmup_steps=args. Warmup_steps) Num_training_steps =t_total) # Criterion = nn.CrossEntropyLoss()# set loss to cross entropy global_step = 0 total_loss = 0.logg_loss = 0.val_acces = [] train_epoch Trange (args.num_epoch, desc='train_epoch') for epoch in train_epoch:# train epoch model.train() for src_sents, labels in batch_iter(train_data, args.batch_size, shuffle=True): src_sents = vocab.vocab.to_input_tensor(src_sents, args.device) global_step += 1 optimizer.zero_grad() logits = model(src_sents) y_labels = torch.tensor(labels, device=args.device) example_losses = criterion(logits, y_labels) example_losses.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), args.GRAD_CLIP) optimizer.step() scheduler.step() total_loss += example_losses.item() if global_step % 100 == 0: loss_scalar = (total_loss - logg_loss) / 100 logg_loss = total_loss with open(LOG_FILE, "a") as fout: fout.write("epoch: {}, iter: {}, loss: {},learn_rate: {}\n".format(epoch, global_step, loss_scalar, scheduler.get_lr()[0])) print("epoch: {}, iter: {}, loss: {}, learning_rate: {}".format(epoch, global_step, loss_scalar, scheduler.get_lr()[0])) tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step) tb_writer.add_scalar("loss", loss_scalar, global_step) print("Epoch", epoch, "Training loss", total_loss / global_step) eval_loss, eval_result = evaluate(args, criterion, model, dev_data, Vocab) # EVALUATE model with open(LOG_FILE, "a") as fout: fout.write("EVALUATE: epoch: {}, loss: {},eval_result: {}\n".format(epoch, eval_loss, eval_result)) eval_acc = eval_result['acc'] if len(val_acces) == 0 or eval_acc > max(val_acces): Print ("best model on epoch: {}, eval_acc: {}".format(epoch, eval_acc)) torch.save(model.state_dict(), "classifa-best-{}.th".format(dtype)) val_acces.append(eval_acc) time_end = time.time() print("run model of {},taking total {} m".format(dtype, (time_end - time_start) / 60)) with open(LOG_FILE, "a") as fout: fout.write("run model of {},taking total {} m\n".format(dtype, (time_end - time_start) / 60))Copy the code
The batch_iter method is highlighted as follows:
Def batch_iter(data, batch_size, shuffle=False): """ Batch data: param data: list of tuple :param batch_size: :param shuffle: :return: Batch_num = math.ceil(len(data)/batch_size) Shuffle (index_array) for I in range(batch_num): shuffle(index_array) for I in range(batch_num): Indices = index_array[I *batch_size:(I +1)*batch_size]# index examples = [data[idx] for idx in indices]# Sorted = data examples (examples,key=lambda x: Len (x[1]),reverse=True)# Src_sents = [e[0] for examples] # Src_sents labels = [label_map[e[1]] Label_map = value yield SRc_sents, labelsCopy the code
Vocab.vocab.to_input_tensor
1. Convert data to values for words using self.words2indices.
2. Find out the longest data in a batch, and add 0 after the remaining data to form a uniform length.
3. Put the result of step 2 into Torch. Tensor
The code is as follows:
def to_input_tensor(self, sents: List[List[str]], device: torch.device): "" Then translate the original sentence list into tensor, at the same time translate the sentence PAD into max_len :param sents: list of list< STR > :param device: :return: """ sents = self.words2indices(sents) sents = pad_sents(sents, self.word2id['<PAD>']) sents_var = torch.tensor(sents, device=device) return sents_varCopy the code
Start training:
validation
Change do_train to False and do_test to True to enable the validation model, and TextRNN will achieve a score of 0.96.
parse.add_argument("--do_train", default=False, action="store_true", help="Whether to run training.")
Copy the code