The BERT model mentioned in Google’s pre-Training of Deep Bidirectional Transformers for Language Understanding broke 11 records in natural Language processing. Recently, I was doing questions and answers related to NLP, and found time to write a paper for detailed interpretation. I find that most people who follow ai don’t understand the main conclusions of the roadmap. To help you quickly understand the essence of the paper, here is a skillpoints roadmap for beginners and those new to deep learning. If you don’t understand the interpretation of a paper written by someone else, it means you need to supplement your basic knowledge. In addition to the main paper reference, in the fifth part, I hope to help you in the field of NLP comprehensive understanding.
I. General introduction
BERT model is actually a language coder that converts input sentences or paragraphs into feature vectors. There are two highlights in this paper: 1. Bidirectional encoder. The author follows the language encoder mentioned in Attention is All You Need, and puts forward the concept of bidirectional programming, using the Masked language model to realize bidirectional programming. 2. The author proposes two pre-training methods, the Masked language model and the prediction method of the next sentence. The authors argue that many current language models underestimate the power of pre-training. The Masked language model has more bidirectional concepts than the language model that predicts the next sentence.
Second, the model framework
BERT model adopts the framework in Improving Language Understanding with Unsupervised Learning released by OpenAI. BERT model structure and parameter setting are the same as OpenAI GPT. Only the pre-training methods have been modified. GPT tells the encoder to only learn each token(word) related to the previous content.
The whole process is divided into two processes: 1. Pre-training process (figure on the left) Pre-training process is a multi-task learning, transfer learning task, the purpose is to learn the input sentence vector. 2 fine-tuning process (figure on the right) can be based on a small number of supervised learning samples, adding Feedword neural network to achieve the goal. Because the learning objective of fine-tuning stage is composed of simple Feedward neural network and a small number of samples are annotated, the training time is short.
1. Input description
Compared with other language models where input is a sentence or document, Bert model gives a broader definition of input, which can be either a sentence or a pair of sentences (such as question and answer pairs).
Input is expressed as the sum of the word vector, segment vector and position vector corresponding to each word. (See Attention is All You Need for position vectors.)
2. Pre-training process -Masked language model
The Masked language model is used to train the deep bidirectional language representation vector. The author covers some words in the sentence in a very direct way and allows the encoder to predict what the word is. The training method was as follows: the author randomly covered 15% of the words as training samples. (1) 80% of them are replaced by masked token. (2) 10% is replaced with a random word. (3) 10% keep the word the same. The author mentioned in the paper that the advantage of this is that the encoder does not know which words need to be predicted and which words are wrong, so it is forced to learn the representation vector of each token. In addition, the author states that only 15% of the words per batchsize are covered because of the performance overhead. Bidirectional encoders train slower than monomial encoders.
3. Predict the next sentence.
Pre-train a dichotomous model to learn the relationships between sentences. Predicting the next sentence is a great way to learn the relationships between sentences. Training method: The ratio of positive samples to negative samples is 1:1, 50% sentences are positive samples, and 50% sentences are randomly selected as negative samples.
4. Parameters of pre-training stage
(1) 256 sentences as a batch, each sentence up to 512 tokens. (2) Iterate 1 million steps. (3) The total training sample is more than 3.3 billion. (4) Iterate 40 Epochs. (5) Adam learning rate, 1 = 0.9, 2 = 0.999. (6) The learning rate keeps a fixed value for the first ten thousand steps, and then decreases linearly. (7) L2 attenuation, attenuation parameter is 0.01. (8) Set drop out to 0.1. (9) GELU was used to replace RELU in the activation function. (10) Bert Base version used 16 TPU, Bert Large version used 64 TPU, and the training took 4 days to complete. The paper defines two versions, one base version and the other large version. Large version (L=24, H=1024, A=16, Total Parameters=340M). Base version (L=12, H=768, A=12, Total PA-rameters =110M). L represents the number of network layers, H represents the number of hidden layers, and A represents the number of self attention heads.
5. Fine-tuning stage
The fine-tuning phase uses different network models according to different tasks. In the fine-tuning stage, the hyperparameters of most models were similar to those of pre-training, except for batchsize, learning rate and EPOchs. Batch size: 16, 32 Learning rate (Adam): 5E-5, 3E-5, 2E-5 Number of EPOchs: 3, 4
3. Experimental results
1. Performance on classified data sets
2. Performance on the Q&A data set
Performance on the QUESTION and Answer dataset SQuAD V1.1, TriviaQA is a question and answer dataset. The basic algorithm of EM is to compare the coincidence rate of two strings. F1 is a comprehensive measure of accuracy and recall rate.
3. Performance of named entity recognition
4. Common sense reasoning
Blation study is an experiment designed to investigate whether some of the structures proposed in the model are valid. It plays a great role in the extension and engineering deployment of the model.
1. Pre-training effect test
LTR&NO NSP: Left to right (LTR) language model is used, without masked language model, the next sentence prediction method is not used +BiLSTM: Add bidirectional LSTM model for pre-training.
2. The influence of the complexity of model structure on the results
L represents the number of network layers, H represents the number of hidden layers, and A represents the number of self attention heads.
3. Influence of training step on results in pre-training
4. The impact of feature-based methods on results
5. Important reference papers
If you want to know the key trends in NLP from 2017 to 2018, you can refer to the following papers. Google can download it directly. Attention is All You Need. One of the most important breakthrough papers in NLP in 2017. Convolutional Sequence to Sequence Learning is one of the most important breakthrough papers in the NLP field in 2017. “Deep Contextualized Word Representations” NAACL Best Paper of 2018, the famous ELMO. Improving Language Understanding by Generative PreTraining, OpenAI GPT, the Bert model is mainly used for reference and comparison. “An Efficient Framework for Learning Sentence Representations” sentence vector representation method. Semi-supervised sequence Tagging with Bidirectional Language Models proposes a bidirectional language model.
6. Personal views
Personally, if you know about the development of NLP in the past two years, the breakthrough of BERT model is reasonable. Most of the ideas are borrowed from previous breakthroughs. For example, the idea of bidirectional encoder is based on this paper semi-supervised Sequence Tagging with Bidirectional Language Models. And he came up with some new ideas that we naturally think of. (When Eleven was at home, while doing the question-answering model, I wondered why I couldn’t use the previous sentence and the following sentence as annotated data to form a dichotomy model for training.) In my opinion, the most valuable part of the whole paper is the two methods of pre-training, which do not need a large amount of data annotation, and have great reference significance in engineering practice and some NLP basic training. Two big trends in natural language processing in 2017 and 2018: on the one hand, models are moving back from complex to simple. On the other hand, transfer learning and semi-supervised learning are hot. These two trends are the first signs of NLP’s transition from academia to industry, as the reality is that large quantities of high-quality annotation data are not available and expensive resources and equipment cannot solve the efficiency problem.