This is the fourth day of my November challenge.

TextRNN

TextRNN only inputs Word Embedding into bidirectional LSTM, and then inputs the last bit of output into the full connection layer. Softmax classification is done for it. The model is shown as follows:

Code:

class RNN(nn.Module): Def __init__(self, vocab_size, embedding_DIM, hidden_DIM, output_dim, n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx) self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers,batch_first=True, Bidirectional =bidirectional) self.fc = nn.Linear(hidden_dim * 2, output_dim) It doesn't depend on the number of layers of N_layers. self.dropout = nn.Dropout(dropout) def forward(self, text): # text.shape=[seq_len, batch_size] embedded = self.dropout(self.embedding(text)) # output: [batch,seq,2*hidden if bidirection else hidden] # hidden/cell: [bidirec * n_layers, batch, hidden] output, (hidden, cell) = self.rnn(embedded) # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)) # hidden = [batch size, Hid dim * num Directions], return self.fc(hidden. Squeeze (0)) # batch size, output_dimCopy the code

TextRNN_ATT

Add attention mechanism to TextRNN

class RNN_ATTs(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
                 n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0, hidden_size2=64):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            bidirectional=bidirectional, batch_first=True, dropout=dropout)
        self.tanh1 = nn.Tanh()
        # self.u = nn.Parameter(torch.Tensor(config.hidden_size * 2, config.hidden_size * 2))
        self.w = nn.Parameter(torch.zeros(hidden_dim * 2))
        self.tanh2 = nn.Tanh()
        self.fc1 = nn.Linear(hidden_dim * 2, hidden_size2)
        self.fc = nn.Linear(hidden_size2, output_dim)
​
    def forward(self, x):
        emb = self.embedding(x)  # [batch_size, seq_len, embeding]=[128, 32, 300]
        H, _ = self.lstm(emb)  # [batch_size, seq_len, hidden_size * num_direction]=[128, 32, 256]
​
        M = self.tanh1(H)  # [128, 32, 256]
        # M = torch.tanh(torch.matmul(H, self.u))
        alpha = F.softmax(torch.matmul(M, self.w), dim=1).unsqueeze(-1)  # [128, 32, 1]
        out = H * alpha  # [128, 32, 256]
        out = torch.sum(out, 1)  # [128, 256]
        out = F.relu(out)
        out = self.fc1(out)
        out = self.fc(out)  # [128, 64]
        return out
​
Copy the code

The data set

Data sets using cnews data set, contains three files, respectively is cnews. The “train”. TXT, cnews. Val. TXT, cnews, test. TXT. Categories: Sports, Entertainment, Home furnishing, Real estate, Education, Fashion, politics, games, Technology, finance, a total of 10 categories. Web disk Address:

Link: pan.baidu.com/s/1awlBYclO… Extraction code: RTNV

Word vector construction

The first step is to read the expectation and do the word segmentation.

Ideas:

1. Create the default word segmentation object seg.

2. Open the file and read the article by line.

3. Remove the closing Spaces and separate the label from the article.

4. Put the words in SRC_data and label in Labels.

5. Return the result.

I annotated the code as follows:

Def read_corpus(file_path): """ param file_path: :param type: :return: Src_data = [] labels = [] seg = pkuseg.pkuseg() # Use the default word segmentation. with codecs.open(file_path,'r',encoding='utf-8') as fout: for line in tqdm(fout.readlines(),desc='reading corpus'): Split ('\t') if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() if line is not None: # line.strip() Pair = line.strip().split('\t') if len(pair)! = 2: print(pair) continue src_data.append(seg.cut(pair[1]))# Append (pair[0]) return (src_data, labels) # returns word segmentation and labels for article contentsCopy the code

After this step, labels and word segmentation are obtained. The following code:

src_sents, labels = read_corpus('cnews/cnews.train.txt')
Copy the code

Map labels to labels:

    labels = {label: idx for idx, label in enumerate(labels)}
Copy the code

Get the labels dictionary of idX, the value of which is the last value of the label inserted.

The second step is to construct the word vector

This step mainly uses the from_corpus method of VOCab. py

Ideas:

1. Create vocAB_entry.

2. Count word frequency of the article after word segmentation to generate a dictionary composed of word and word frequency.

Select Top size – 2 from the dictionary.

Get the element’s word.

5. Execute the add method to put the word into VOCab_entry and generate the word and ID. Id is the corresponding vector value of the word.

The code is as follows:

@staticmethod def from_corpus(corpus, size, min_feq=3): Vocab_entry = VocabEntry() # chain comes from the Itertools library, which provides very useful iteration-based functions, The chain function is used to concatenate multiple iterators to form a larger iterator # * : returns a single iterator. Word_freq = Counter(chain(*corpus)) Most (size - 2) # MOST_common () is used to implement Top N functions. Valid_words = [word for word, value in valid_words if value >= min_feq] # print('number of word types: {}, number of word types w/ frequency >= {}: {}'. Format (len(word_freq), min_feq, len(valid_words))) for word in valid_words: vocab_entry.add(word) return vocab_entryCopy the code

After creation, save the word vector to a JSON file

 vocab = Vocab.build(src_sents, labels, 50000, 3)
    print('generated vocabulary, source %d words' % (len(vocab.vocab)))
    vocab.save('./vocab.json')
​
Copy the code

training

Training using train_rnn. py, first look at the analysis of the main method parameters.

parameter

parse = argparse.ArgumentParser() parse.add_argument("--train_data_dir", default='./cnews/cnews.train.txt', type=str, required=False) parse.add_argument("--dev_data_dir", default='./cnews/cnews.val.txt', type=str, required=False) parse.add_argument("--test_data_dir", default='./cnews/cnews.test.txt', type=str, required=False) parse.add_argument("--output_file", default='deep_model.log', type=str, required=False) parse.add_argument("--batch_size", default=4, type=int) parse.add_argument("--do_train", default=True, action="store_true", help="Whether to run training.") parse.add_argument("--do_test", default=True, action="store_true", help="Whether to run training.") parse.add_argument("--learnning_rate", default=5e-4, type=float) parse.add_argument("--num_epoch", default=50, type=int) parse.add_argument("--max_vocab_size", default=50000, type=int) parse.add_argument("--min_freq", default=2, type=int) parse.add_argument("--hidden_size", default=256, type=int) parse.add_argument("--embed_size", default=300, type=int) parse.add_argument("--dropout_rate", Parse. Add_argument ("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") parse.add_argument("--GRAD_CLIP", default=1, type=float) parse.add_argument("--vocab_path", default='vocab.json', type=str)Copy the code

Parameter Description:

Train_data_dir: training set path.

Dev_data_dir: indicates the path of the verification set

Test_data_dir: path of the test set

Output_file: indicates the output log path

Batch_size: indicates the batchsize.

Do_train: whether to train. The default value is True.

Do_test: indicates whether to test. The default value is True

Learnning_rate: indicates the learning rate

Num_epoch: Indicates the number of epochs

Max_vocab_size: number of word vectors

Min_freq: Word frequency, filtering words below this value

Hidden_size: indicates the number of hidden layers

Embed_size: The length of Embedding.

Dropout_rate: Dropout value.

Warmup_steps: Sets the warmup value.

Vocab_path: The path where the word vector is saved

Word vector construction

    vocab = build_vocab(args)
    label_map = vocab.labels
    print(label_map)
Copy the code

Build_vocab method:

def build_vocab(args):
    if not os.path.exists(args.vocab_path):
        src_sents, labels = read_corpus(args.train_data_dir)
        labels = {label: idx for idx, label in enumerate(labels)}
        vocab = Vocab.build(src_sents, labels, args.max_vocab_size, args.min_freq)
        vocab.save(args.vocab_path)
    else:
        vocab = Vocab.load(args.vocab_path)
    return vocab
Copy the code

Create the model

Create CNN model, put the model on GPU, call train method, train.

  rnn_model = RNN_ATTs(len(vocab.vocab), args.embed_size, args.hidden_size,
                        len(label_map), n_layers=1, bidirectional=True, dropout=args.dropout_rate)
  rnn_model.to(device)
  train(args, rnn_model, train_data, dev_data, vocab, dtype='RNN')
Copy the code

The train method is annotated as follows:

def train(args, model, train_data, dev_data, vocab, dtype='CNN'): LOG_FILE = args. Output_file # record training log with open(LOG_FILE, "a") as fout: fout.write('\n') fout.write('==========' * 6) fout.write('start trainning: {}'.format(dtype)) fout.write('\n') time_start = time.time() if not os.path.exists(os.path.join('./runs', dtype)): os.makedirs(os.path.join('./runs', dtype)) tb_writer = SummaryWriter(os.path.join('./runs', T_total = args.num_epoch * (math.ceil(len(train_data)/args.batch_size)) #optimizer = BNB. Optim. Adam8bit (model. The parameters (), lr = 0.001, betas = (0.9, Optimizer = AdamW(model.parameters(), lr=args. Learnning_rate, Scheduler = get_linear_schedule_with_warmup(Optimizer =optimizer, num_warmup_steps=args. Warmup_steps) Num_training_steps =t_total) # Criterion = nn.CrossEntropyLoss()# set loss to cross entropy global_step = 0 total_loss = 0.logg_loss = 0.val_acces = [] train_epoch Trange (args.num_epoch, desc='train_epoch') for epoch in train_epoch:# train epoch model.train() for src_sents, labels in batch_iter(train_data, args.batch_size, shuffle=True): src_sents = vocab.vocab.to_input_tensor(src_sents, args.device) global_step += 1 optimizer.zero_grad() logits = model(src_sents) y_labels = torch.tensor(labels, device=args.device) example_losses = criterion(logits, y_labels) example_losses.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), args.GRAD_CLIP) optimizer.step() scheduler.step() total_loss += example_losses.item() if global_step % 100 == 0: loss_scalar = (total_loss - logg_loss) / 100 logg_loss = total_loss with open(LOG_FILE, "a") as fout: fout.write("epoch: {}, iter: {}, loss: {},learn_rate: {}\n".format(epoch, global_step, loss_scalar, scheduler.get_lr()[0])) print("epoch: {}, iter: {}, loss: {}, learning_rate: {}".format(epoch, global_step, loss_scalar, scheduler.get_lr()[0])) tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step) tb_writer.add_scalar("loss", loss_scalar, global_step) print("Epoch", epoch, "Training loss", total_loss / global_step) eval_loss, eval_result = evaluate(args, criterion, model, dev_data, Vocab) # EVALUATE model with open(LOG_FILE, "a") as fout: fout.write("EVALUATE: epoch: {}, loss: {},eval_result: {}\n".format(epoch, eval_loss, eval_result)) eval_acc = eval_result['acc'] if len(val_acces) == 0 or eval_acc > max(val_acces): Print ("best model on epoch: {}, eval_acc: {}".format(epoch, eval_acc)) torch.save(model.state_dict(), "classifa-best-{}.th".format(dtype)) val_acces.append(eval_acc) time_end = time.time() print("run model of {},taking total {} m".format(dtype, (time_end - time_start) / 60)) with open(LOG_FILE, "a") as fout: fout.write("run model of {},taking total {} m\n".format(dtype, (time_end - time_start) / 60))Copy the code

The batch_iter method is highlighted as follows:

Def batch_iter(data, batch_size, shuffle=False): """ Batch data: param data: list of tuple :param batch_size: :param shuffle: :return: Batch_num = math.ceil(len(data)/batch_size) Shuffle (index_array) for I in range(batch_num): shuffle(index_array) for I in range(batch_num): Indices = index_array[I *batch_size:(I +1)*batch_size]# index examples = [data[idx] for idx in indices]# Sorted = data examples (examples,key=lambda x: Len (x[1]),reverse=True)# Src_sents = [e[0] for examples] # Src_sents labels = [label_map[e[1]] Label_map = value yield SRc_sents, labelsCopy the code

Vocab.vocab.to_input_tensor

1. Convert data to values for words using self.words2indices.

2. Find out the longest data in a batch, and add 0 after the remaining data to form a uniform length.

3. Put the result of step 2 into Torch. Tensor

The code is as follows:

def to_input_tensor(self, sents: List[List[str]], device: torch.device): "" Then translate the original sentence list into tensor, at the same time translate the sentence PAD into max_len :param sents: list of list< STR > :param device: :return: """ sents = self.words2indices(sents) sents = pad_sents(sents, self.word2id['<PAD>']) sents_var = torch.tensor(sents, device=device) return sents_varCopy the code

Start training:

validation

Change do_train to False and do_test to True to enable the validation model, and TextRNN will achieve a score of 0.96.

parse.add_argument("--do_train", default=False, action="store_true", help="Whether to run training.")
Copy the code