BERT model is an NLP model proposed by Google in 2018, which has become the most breakthrough technology in the FIELD of NLP in recent years. New records were set in 11 NLP domain missions such as GLUE, SquAD1.1, MultiNLI, etc.
1. Introduction
In the paper BERT: BERT model is proposed in pre-training of Deep Bidirectional Transformers for Language Understanding. The BERT model mainly uses the Encoder structure of Transformer and adopts the most original Transformer. For those unfamiliar with Transformer, check out our previous article, “The Transformer Model” or Jay Alammar’s blog, The Illustrated Transformer. In general, BERT has the following characteristics:
Structure: Encoder structure of Transformer is used, but the model structure is deeper than Transformer. Transformer Encoder contains 6 Encoder blocks, bert-Base model contains 12 Encoder blocks, and Bert-Large contains 24 Encoder blocks.
Training: Training is mainly divided into two stages: pre-training stage and fine-tuning stage. The pre-training stage is similar to Word2Vec, ELMo, etc., which is trained on a large data set according to some pre-training tasks. Fine-tuning stage is later used for fine-tuning some downstream tasks, such as text classification, part-of-speech tagging, question answering system, etc. BERT can fine-tune different tasks without adjusting the structure.
Pre-training task 1: BERT’s first pre-training task is Masked LM, which randomly covers a part of words in a sentence, and then predicts the covered words by using contextual information, so that the meaning of words can be better understood according to the full text. Masked LM is the focus of BERT and is different from biLSTM prediction method, which will be discussed later.
Pre-training task 2: BERT’s second pre-training task is Next Sentence Prediction (NSP), the Next Prediction task, which is mainly to make the model better understand the relationship between sentences.
2. BERT structure
The figure above is BERT’s structure diagram. The figure on the left represents the pre-training process, while the figure on the right represents the fine-tuning process for specific tasks.
2.1 BERT’s input
BERT’s input can consist of A pair of sentences (sentences A and B) or A single sentence. Meanwhile, BERT added some flag bits with special functions:
- The [CLS] symbol is placed at the first place of the first sentence, and the representation vector C obtained by BERT can be used for subsequent classification tasks.
- The [SEP] flag is used to separate two input sentences. For example, if you enter sentences A and B, add the [SEP] flag after sentences A and B.
- The [MASK] mark is used to cover some words in the sentence. After the words are covered with [MASK], the [MASK] vector output by BERT is used to predict what the words are.
For example, given two sentences “my dog is cute” and “he likes palying” as input samples, BERT will change to “[CLS] my dog is cute [SEP] he likes play ##ing [SEP]”. BERT uses the WordPiece method, which will break words into subwords, so some words will break out their roots. For example, “palying” will become “paly” + “##ing”.
When BERT gets the sentence to be input, it needs to transform the word of the sentence into Embedding, which is represented by E. Unlike Transformer, BERT’s input Embedding is obtained by adding three parts: Token Embedding, Segment Embedding, and Position Embedding.
Token Embedding: Word Embedding, such as [CLS] dog, can be learned through training.
Segment Embedding: Used to distinguish each word belongs to sentence A or sentence B, if only one sentence is input, only EA is used, which is learned through training.
Position Embedding: the Position where the encoded words appear. Unlike Transformer’s fixed Embedding formula, BERT’s Position Embedding is also learned. In BERT, the longest sentence is assumed to be 512.
2.2 BERT pre-training
After the Embedding of the words in the sentence is input by BERT, the model is trained by the pre-training mode, and the pre-training has two tasks.
The first one is Masked LM, which randomly replaces some words with [MASK] in the sentence, and then the sentence is passed into BERT to encode the information of each word, and finally the correct word at this position is predicted with the encoding information T[MASK] of [MASK].
The second is the next sentence prediction. Input sentences A and B into BERT to predict whether B is the next sentence of A, and use the coded information C of [CLS] to predict.
The process of BERT’s pre-training can be shown in the figure below.
2.3 BERT is used for specific NLP tasks
The BERT model obtained by pre-training can be fine-tuned in the subsequent application of specific NLP tasks, and the BERT model can be applied to a variety of different NLP tasks, as shown in the following figure.
The classification task of a pair of sentences: for example, natural language inference (MNLI), sentence semantic equivalence judgment (QQP), etc., as shown in figure (a) above, two sentences need to be passed into BERT, and then the output value C of [CLS] is used for sentence pair classification.
Single sentence classification tasks: for example, sentence sentiment analysis (SST-2), judging whether a sentence is grammatically acceptable (CoLA), etc., as shown in figure (b), only one sentence is input without the use of [SEP] flag, and then the output value C of [CLS] is also used for classification.
Question and answer tasks: For example, the SQuAD V1.1 dataset, sample is Question, Paragraph, Question is a Question, Paragraph is a text from Wikipedia, Paragraph contains the answer to the Question. The goal of the training is to find the Start, End position of the answer in Paragraph. As shown in figure (c) above, Question and Paragraph are passed to BERT, and BERT then predicts the position of Start and End based on the output of all of the Paragraph words.
Single sentence labeling tasks: For example, named entity recognition (NER), you enter a single sentence and then, based on BERT’s output T for each word, predict the category of the word, whether it belongs to Person, Organization, Location, Miscellaneous or Other (non-named entity).
3. Pre-training tasks
The pre-training part is BERT’s focus. Next, we will learn the details of BERT’s pre-training. BERT includes two pre-training tasks, Masked LM and the next sentence prediction.
3.1 Masked LM
Let’s review the pre-training methods of previous language models, using the sentence “I/like/learn/natural/language/process” as an example. When training language models, it is usually necessary to carry out some Mask operations to prevent information leakage. Information leakage refers to knowing the information about “nature” in advance when predicting the word “nature”. The reasons for Transformer Encoder information leakage will be discussed later.
CBOW of Word2Vec: Word I is predicted from the preceding and following information of word I, but the word bag model is adopted, without knowing the word order information. For example, the word “nature” is predicted using both “I/like/learn” and “language/processing”. CBOW is the training equivalent of masking the word “natural”.
ELMo: ELMo uses biLSTM when training, and when predicting “nature”, forward LSTM will Mask all words after “nature”, using the above “I/like/learn” prediction; Backward LSTM will Mask the words before “natural” and make predictions using the following “language/processing”. The output of the forward and backward LSTM is then spliced together, so the ELMo separates the context information for prediction, rather than both.
OpenAI GPT: OpenAI GPT is another algorithm that uses Transformer to train language models, but OpenAI GPT uses Transformer’s Decoder, which is a one-way structure. When predicting “nature”, only use the above “I/like/learn”. Decoder includes Mask operation, Mask all the words after the current prediction word.
The chart below shows the differences between BERT and ELMo and OpenAI GPT.
The author of BERT believes that the word left (above) and right (below) information should be used to predict the best words. ELMo’s left-to-right and right-to-left models are called shallow Bidirectional models. BERT hoped to train a deep bidirectional model on Transformer Encoder structure, so he proposed the Mask LM method for training.
Mask LM is used to prevent information leakage. For example, when predicting the word “natural”, if the “natural” in the input part is not masked, the information of “natural” can be directly obtained in the predicted output.
BERT only predicts the words of [Mask] position during training so that contextual information can be used simultaneously. However, in subsequent use, the word [Mask] will not appear in the sentence, which will affect the performance of the model. Therefore, the following strategy was adopted in the training: 15% of the words in the sentence were randomly selected for Mask. Among the words selected as Mask, 80% were really replaced with [Mask], 10% were not replaced, and the rest 10% were replaced with a random word.
For example, if the sentence “my dog is hairy” is used as a Mask, then:
- 80% of the time, convert “my dog is hairy” to “my dog is [Mask]”.
- 10% of the time, keep the sentence “My dog is hairy”.
- 10% of the time, replace the word “hairy” with another random word, like “apple”. Convert the sentence “my dog is hairy” into the sentence “my dog is apple”.
That’s BERT’s first pre-training mission Masked LM.
3.2 The next prediction
BERT’s second pre-training task is the Next Sentence Prediction (NSP). Given two sentences A and B, it is necessary to predict whether Sentence B will be the Next Sentence of Sentence A.
The main reason for BERT to use this pre-training task is that many downstream tasks, such as question answering (QA) and natural language inference (NLI), require the model to be able to understand the relationship between two sentences, which cannot be achieved by training the language model.
When BERT is training, there is A 50% probability that he will choose two connected sentences A and B, and there is A 50% probability that he will choose two disconnected sentences A and B, and then predict whether the next sentence of sentence A is sentence B through the output C of [CLS] flag bit.
-
Enter = [CLS] I like to play [Mask] league [SEP] My best [Mask] is Yasso [SEP]
Category = B is the next sentence of A
-
Enter = [CLS] I like to play [Mask] league [SEP] The weather is very [Mask] [SEP]
Category = B is not the next sentence of A
4. BERT
Since Masked LM is used in BERT’s pre-training, only 15% of words are trained in each batch, so more pre-training steps are needed. Sequential models, like ELMo, make predictions for every word.
BERT uses Encoder and Masked LM pre-training methods of Transformer, so bidirectional prediction can be made. OpenAI GPT uses Transformer’s Decoder structure and uses the Mask in Decoder, which can only be predicted sequentially.
5. References
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Attention Is All You Need