Summary of CNN and LSTM neural Network model tuning based on PyTorch

Demo

This is a summary of the last two months. The demo has been uploaded to Github, which contains the realization of CNN, LSTM, BiLSTM, GRU and the combination of CNN and LSTM and BiLSTM, as well as multi-layer and multi-channel CNN, LSTM, BiLSTM and other neural network models. This article summarizes the recent problems, solutions and related strategies, as well as experience (actually there is no experience).

Demo Site: https://github.com/bamtercelboo/cnn-lstm-bilstm-deepcnn-clstm-in-pytorch

(1) Brief introduction of Pytorch

Pytorch is a relatively new deep learning framework, a Python-first deep learning framework capable of implementing tensors and dynamic neural networks on top of powerful GPU acceleration.

For beginners, never learn pytorch can have a look at the website of 60 minutes first introductory pytorch, refer to the address: http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

2. CNN and LSTM

Convolutional Neural networks CNN Understanding reference

(https://www.zybuluo.com/hanbingtao/note/485480)

Short – and long-term memory network LSTM understanding reference

(https://zybuluo.com/hanbingtao/note/581764)

(iii) Data preprocessing

1. The corpus I am using now is the basic standard data (for example, below), but there are still some pre-processing needs in the process of loading corpus data, such as the case of some data, number processing and some characters such as “\n \ T”, now torchtext third-party library is used to load data pre-processing.

Torch builds the word table and handles the case of corpus data.

3. Processing special characters such as numbers of corpus data:

4. Points needing attention:

Random can be used to scramble data when loading data sets

Torchtext builds training set, development set, and test set iterators with the option of shuffled data during each iteration

(4) Word Embedding

1. Word embedding simply means that each word in the corpus corresponds to its corresponding word vector, and word2vec is the most suitable way for word embedding to train word vector.

(http://www.cnblogs.com/bamtercelboo/p/7181899.html)

2. Relevant vocabularies have been established through Torchtext above. There are two ways to load word vectors, one is to load externally pre-trained word vectors trained according to corpus, and the other is to randomly initialize word vectors. However, word vectors trained by ourselves may not have a good effect, because the corpus data may be insufficient. Word vectors already trained, such as Google News, are recognized word vectors in the industry. However, due to the huge number, if the hardware facilities (GPU) are not good, we should not try this.

3. Provide several addresses for downloading pre-trained word vectors

word2vec-GoogleNews-vectors

(https://github.com/mmihaltz/word2vec-GoogleNews-vectors)

glove-vectors

(https://nlp.stanford.edu/projects/glove/)

4. Load external word vector mode

Load word vectors that can be found in word vectors in the vocabulary

For word that cannot be found in the word vector in the vocabulary, commonly known as OOV(Out of Vocabulary), the more OOV there is, the more influence it may have on the addition, so the processing of OOV words is particularly critical. Now there are several strategies for reference:

Average the word vectors that have been found

Random initialization or all zeros, random initialization or zero, can be all ooVs use a random value, or each OOV word can be random, the specific effect depends on the effect

The value of random initialization is between (-0.25,0.25) or (-0.1,0.1), the specific effect can be tested, different data sets, different external word vector estimation effect is different, MY test result is 0.25 is better than 0.1

Special need to pay attention to whether processed OOV term vectors in a certain range, this must manually or demo check after processing, to deal with out of this word vector is greater than 15, 30, is likely to be your own problem handling, can also is your demo there may be bugs, had a great influence on the result of.

5. Use external word vectors in model

5) Parameter initialization

For the nn.Conv2d() convolution function in Pytorch, there is weight and bias, which is necessary to initialize weight. Failure to initialize weight may slow down the convergence speed and affect the final effect

Init. Uniform (), torch.nn.init.normal(), torch.nn.init.xavier_uniform (), The specific use reference http://pytorch.org/docs/master/nn.html#torch-nn-init

For nn.lstm () in Pytorch, the all_weights attribute, which includes weight and bias, is a multidimensional matrix

(6) parameter adjustment and its strategy

Neural network parameter setting

Kernel-size: Convolutional Neural Networks for Sentence read A paper (A Sensitivity Analysis of (and Practitioners’ Guide to)Convolutional Neural Networks for Sentence Classification), the paper tested the use of kernel. According to the results, most of the Settings will be randomly combined between 1 and 10, and the specific effect depends on one’s own task.

Kernel-num in CNN refers to the number of features of each convolution window, which is roughly set at 100-600. I usually set it at 200,300

Dropout: The Dropout value is set to 0.5 in most papers. It is said that 0.5 has a good effect and can prevent overfitting problems. However, in different tasks, the size of the Dropout needs to be adjusted appropriately. Try different dropout positions and you’ll get amazing results.

Batch size: The batch size still needs to be adjusted appropriately. Check related blogs. Generally, the setting will not exceed 128, which may also be very small.

This general initial value is different for different optimizer Settings, and it is said that there are some classic configurations like Adam: LR = 0.001

Iteration times: set different values according to their task, model, convergence speed and fitting effect

Hidden size in LSTM: The dimension of hidden layer in LSTM also has a certain influence on the result. If 300dim’s external word vector is used, hidden size =150 or 300 can be considered. For hidden size, I have set 600 at most, because of hardware equipment, 600 is already very slow to train. If the hardware resources are ok, you can try more hidden size values, but the relationship between hidden size and the dimension of word vector should be considered in the process of trying (I think it has some influence on the relationship).

Binormal constraints: Max-norm and norm-type in Embedding in Pytorch are binormal constraints

In Pytorch L2 regularization, also called weight decay, is implemented in the optimizer with the argument weight_decay (the L1 regular in PyTorch has been abandoned and can be implemented by itself), usually set to 1E-8Gradient disappearance, gradient explosion problem

Strategies of neural network to enhance Acc

Data preprocessing: in the process of vocabulary establishment, words with word frequency of 1 can be eliminated, which is also a hyperparameter. If the accuracy is found to decrease after elimination, we can try to eliminate with a certain probability or carry out different processing on this part of word vector with a certain probability

Dynamic vector: pytorch has achieved dynamic vector, the latest version 0.2 specific USES reference http://pytorch.org/docs/master/optim.html#how-to-adjust-learning-rate

Batchnormalizations (BatchNorm1d) and BatchNorm2d(BatchNorm2d) can be used for batch Normalizations (MOMENTUM)

Wide convolution and narrow convolution should be used in the deep convolution model. If narrow convolution is used, dimension problems will occur. The data I am using now will have dimension problems when using double-layer convolutional neural network

For character-level processing, the initial processing method is to use words for processing (i.e. words), which can be divided according to characters. The word vector divided can be randomly initialized, which is also a strategy. I have tried this strategy, but it does not improve my current task.

The optimizer: pytorch provides several optimizer, we are the most commonly used Adam, the effect is very good, you can refer to http://pytorch.org/docs/master/optim.html#algorithms

Fine-tune or no-fine-tune: This is a very important strategy. In general, fine-tune has a very good effect compared to no-fine-tune.

7) Reference acknowledgments

What deep learning (RNN, CNN) tuning experience do you have?

https://www.zhihu.com/question/41631631

Who can explain word embedding?

https://www.zhihu.com/question/32275069

Zero basic entry to deep learning

https://zybuluo.com/hanbingtao/note/581764

PyTorch parameter initialization and Finetune

https://zhuanlan.zhihu.com/p/25983105

Overfitting and regularization

http://hpzhao.com/2017/03/29/ regularization / # more in machine learning

Batch Normalitions

https://discuss.pytorch.org/t/example-on-how-to-use-batch-norm/216/2

Bamtercelboo is my brother in the lab. I summed up my dry goods experience when I was just entering the lab. Personally, I think it is worth watching. He’s making for https://github.com/bamtercelboo, a lot of dry goods code, welcome to visit.

Welcome to follow the public account of deep learning natural language processing. I will update my work, theory and practice on this road frequently.