1. ELMO (2018.2.15)

Elmo is based on bidirectional LSTM and is trained using classical language models. The so-called deep Contextualized word vectors can be obtained by pre-train on a large number of corpus, and then combined in different NLP tasks through specific methods.

1. Prediction model

For a sequence of N words
( t 1 . t 2 . . . . . t N ) (t_1, t_2, … , t_N)
Is by given sequence
( t 1 . t 2 . . . . . t t 1 ) (t_1, t_2, … , t_{t-1})
To calculate the
t k t_k
, so the probability formula of each token obtained by a forward prediction model is as follows:According to the above prediction model, a sentence (each token is represented by the traditional independent word-embedding) is used
L L
The LSTM of layer can obtain context-independent representation. In this
L L
In the LAYER LSTM,
t k t_k
The corresponding token representation is
h k . j L M . j = 1 . . . L h^{LM}_{k,j}, j=1,,,L
. The top-level output of LSTMH ^{LM}_{k,L} uses a Softmax layer to predict the next word $$t_{k+1}. The above is the forward LSTM, and the reverse LSTM is the same idea:The training maximizes both forward and backward likelihood probabilities.The word vector and Softmax layer parameters are shared, and the forward LSTM and backward LSTM train their respective parameters.

2.Elmo

For a
L L
Layers of bi-directional LSTM, each token
t k t_k
But there are
2 L + 1 2L+1
Each vector represents:For each layer,
h k . j L M h^{LM}_{k,j}
Is equal to j layer forward and backward vector concat together.

ELMO integrates the output of multi-layer biLSTM into a vector. ELMo_k theta e) = e (R_k; \theta_e)ELMok=E(Rk; Theta e), the simplest case, ELMo using only the top of the vector e (fairly Rk) = hk, LLME (R_k) = h ^ {LM} _ {k, L} e (fairly Rk) = hk, LLM.

The best ELMo model, however, is to have all biLSTM output weighted with Normalized Softmax.

Among them
s t a s k = s o f t m a x ( w ) s^{task}=softmax(w)
.
gamma \gamma
Is the scale factor, which is equivalent to using the
s t a s k s^{task}
Layer Normalization was used previously for each layer of biLSTM:

3. Access downstream tasks

(1) Directly concatenate ELMo word vector ELMokELMo_kELMok with ordinary word vector XkX_KxK [XK;ELMok][X_K;ELMo_k][XK;ELMok]. (2) Directly spline the ELMo word vector ELMokELMo_kELMok with the hidden layer output vector HKH_khk [HK;ELMok][h_K;ELMo_k][HK;ELMok], which has been improved in SNLI and SQuAD.

2. GPT, GPT2.0 GPT3.0

GPT is a pre-training prediction model earlier than Bert, and also a pioneering work of pre-train+fine-tuning model. But all versions of GPT use a one-way prediction model. And a transfomer decoder is used.

  • GPT1.0

As shown above, the GPT model structure is made up of 12 Transformer decoders stacked on top of each other. In the fine-tuning stage, Start,Delim and Extract are used for model input.

  • GPT 2.0

The model structure and pretraining method preGPT1.0 are essentially the same. Differences:

  1. Use larger corpus and better quality corpus.
  2. Fine-tuning is not emphasized in zero shot experimental Settings.
  3. Bigger models.
  • GPT 3.0

A few key points:

  1. Bigger models
  2. Fine-tuning is not needed
  3. Experimental Settings with Few-shot work best.
  4. It is still a one-way prediction model implemented with Transformer.

2. BERT (2018.10.11)

Bert uses Transformer as a component, and simultaneously trains and converges on two supervised tasks. The two tasks are to predict Mask words and sentence relationships. The first task can be regarded as the model learning token-level information, and it is bidirectional. The second task is to learn sentence-level or even doc-level information. Here are the layers:

  1. Input/Output Representations

Use A sentence-pair {A,B} to construct the input, connect it with [SEP], and begin it with [CLS]. The hidden state of the last layer corresponding to [CLS] can be used for sentence classification tasks. Construct an input representation using three layers of coded sums:

  • WordPiece embeddings to construct the Token embedding
  • Used to indicate whether A token belongs to sentence A or B, it is called Segment Embedding.
  • Position Embeddings, the learned embedding vector.
  1. Pre-training BERT
  • 2.1 Pre-training Task 1#: Masked LM(masked language model)

Since traditional conditional language models can only be learned left-to-right or right-to-left, BERT simply masks some tokens randomly in order to obtain a bidirectional representation through training. It then predicts only the words that are Masked, which is called Masked LM.

Because downstream tasks do not have [MASK]. Therefore, there is a mismatch between pre-training and fine-tuning, so the mask strategy adopted by the author is: Instead of always replacing the masked token with [MASK], 15% of the tokens were randomly selected in the training data for prediction, and then 80%[MASK], 10% random token and 10% themselves were used to replace the ith token. For other explanations, refer to appendix-A

The cross entropy loss is then used to predict the original token using the last layer of implicit variable TiT_iTi. Loss for this task is denoted as masked_Lm_loss.

Note: Masked_LM_Loss only predicts masked tokens, not other tokens.

  • 2.2 Pre-training Task 2#: Next Sentence Prediction

Because many downstream tasks, such as QA and NLI, need to learn the relationship between sentence-level. Therefore, another task for BERT is to carry out the Next Sentence Prediction task (NSP) of a dichotomies. When constructing the training data, 50% of the sentence pairs {A,B} are IsNext relationships and the other 50% are NotNext relationships. Loss for this task is denoted as next_sentence_loss.

Then combined training, the final loss is the sum of two tasks loss: total_loss = masked_LM_loss + next_sentence_loss

  1. Fine-tuning

After the pre-training, the output of the model, token representation, goes through a layer of network for token-level tasks, such as entity annotation and MRC. The head’s [CLS] representation is used for classification tasks, such as emotion analysis, sentence inheritance relationship prediction, etc. Of course, different tasks require different loss functions to be customized, such as:

  • GLUE uses the leading token[CLS], whose representaton is denoted as CCC, and the network weights at the classification level are denoted as WWW. A standard classification task loss: log(Softmax (CWT))log(Softmax (CWT))log(Softmax (CWT)) is then used.

  • SQuAD V1.1 constructs the input A for question and B for passage. Then two vectors are introduced, the starting position vector S∈RHS\in\mathbb{R}^HS∈RH and the ending position vector E∈RE\in\mathbb{R}E∈R.

The representation of the ith word is denoted as TiT_iTi, then it serves as the answer. The probability distribution of the start subscript is: Pi = eS. Ti ∑ jeS. TjP_i = \ frac {e ^ {S.T _i}} {{\ sum_ {j} {e ^ {S.T _j}}}} Pi = ∑ jeS. TjeS. Ti answer the end of the subscript probability distribution using the same formula. ⋅Ti+E⋅TjS·T_i +E ·T_jS +E⋅Tj The score of the maximum score satisfies both J >=ij>=ij>= I, which is the final prediction result. The training objectives are log-likelihood hoods and of right start and end.

paper, code

BERT fits the scene

  • Solve sentence or paragraph matching tasks
  • Contain the answer in the language itself, without relying particularly on other features outside the text
  • Bert is suitable for NLP tasks with short input length
  • The application scenarios of Bert are related to the demands of NLP tasks for deep semantic features. The tasks requiring deeper semantic features are more suitable for Bert to solve

3. RoBERTa (2019.7.26)

RoBERTa (a Robustly Optimized BERT Pretraining Approach) is an improved BERT. The main work of this article is as follows:

  1. Static Masking vs dynamic Masking

    Bert randomly replaced 15% of Tokens with [MASK] for each sequence. In order to eliminate mismatches with downstream tasks, Bert also replaced the 15% of Tokens with [MASK] for (1) 80% of the time. (2) 10% of the time remains unchanged; (3) 10% of the time with other words. But throughout the training process, once the 15% Tokens were selected, they did not change. In other words, after the random selection of the 15% Tokens at the beginning, they did not change for the subsequent N epochs. This is called static Masking. RoBERTa begins by copying 10 copies of the pre-trained data, and selecting 15% of Tokens randomly for Masking each copy. In other words, the same sentence can be masked in 10 different ways. Each data is then trained with N/10 epochs. This means that in the training of these N epochs, the tokens masked by each sequence will change. This is called dynamic Masking. So does the change really work? The author makes an experiment by changing only static Masking to dynamic Masking and keeping other parameters unchanged. Dynamic Masking can indeed improve performance.

2. With NSP vs without NSP 3. More data, more hours of training

4. AlBert (2019.12.26)

AlBert’s main exploration was how to reduce parameters and reduce the model, which reduced the training time and predict time, but of course the performance was not reduced. In a normal Bert network, the size of the dictionary is EEE. Encoder layer is LLL, hidden layer size is HHH, then set the feed-forward/filter size to 4H4H4H, and the number of attention heads to H/64H/64H/64. The main structure of AlBert and Bert’s model remains unchanged, and three main improvements are made:

1.Factorized embedding parameterizatio

In Bert and later improvements such as XLNet and RoBERTa, the word vector size EEE of WordPiece is related to the hidden layer size HHH, for example, EEE is always equal to HHH. This setup may not be necessary. The author believes that the word vector of WordPiece is a representation of learning context-independent, while the hidden layer vector is a representation of learning context-dependent. The advantage of Bert class model is to acquire signals from context to learn context-independent representations. Therefore, the embedding size of WordPiece EEE is separated from the hidden layer size HHH, which can make more efficient use of the overall model parameters. HHH is much larger than EEE.

If the dictionary size is VVV, the embedding matrix size of Bert class models is V∗EV*EV∗E, which easily reaches a billion level of parameters.

Therefore, Albert wanted to break the binding relationship between EEE and HHH to reduce the number of parameters in the model and improve the performance of the model.

In AlBert, factorization is applied to the parameters of embedding, which is decomposed into two smaller matrices. Instead of mapping the one-hot vector directly to the EEE size embedding, AlBert maps it to the EEE size embedding and then to the hidden layer. That is, the embedding matrix is decomposed into two matrices: V∗EV*EV∗E and E∗HE*HE∗H. In this way, the embedding vector changes from O(V∗H)O(V*H)O(V∗H) O(V∗E+E∗H)O(V*E +E *H)O(V∗E+E∗H) O(V∗E+E∗H)O(V ∗E+E∗H)O(V ∗E+E∗H). When HHH is much larger than EEE, the reduction of parameters is very significant.

The figure above shows BERT’s one-hot vector input. The first projection is context independent between words. Only when it comes to Attention do words become context dependent. So you don’t need very high dimensional vectors for the first projection.In AlBert, you can put the first map in a very low dimension, and then zoom in to a large dimension when you do attention in the context. The VOCab of NLP training model is usually very large, so the first one-hot vector mapping to embedding parameters will be very large. This has two advantages:

  1. The parameters are greatly reduced
  2. The word Context independent is unlocked between the Context dependent representation and the Context dependent representation. You can make the Context dependent representation taller freely, that is, the network becomes wider

2.Cross-layer parameter sharing

There are several ways to share parameters, such as only sharing parameters of the feed-forward network (FFN) and only sharing parameters of attention. AlBert’s strategy is to share all parameters between layers.

3.Inter-sentence coherence loss

Two kinds of loss are used in Bert, one is masked_Lm_loss and the other is next_sentence_Loss. One article discussed removing next_sentence_Loss and found that model performance did not have much impact. Therefore, the author suspects that the reason why next_sentence_loss has little effect is that this task is too simple. The model only needs to learn two sentences with different themes, which is repeated with MLM, leading to that almost everything Bert learns comes from MLM. Therefore, AlBert adopted sentence-order Prediction (SOP)loss, and the structure of training data was similar to Bert. Two consecutive sentences were taken from the same document as positive examples, and the same two sentences were transposed in order as negative examples. It’s quite metaphysical.

Analysis:

  1. Effect: AlBert is better than Bert only in versions above large, such as Xlarge, xxlarge.
  2. Speed: In the predict phase, AlBert and Bert have almost no difference in the same version. Strictly speaking, AlBert is slower because AlBert’s Embedding also does a matrix decomposition. So AlBert can’t bring about a predicted speed boost!

paper

5. ELECTRA (2020.4.23)

The language model Bert uses is MLM, which randomly masks 15% of the tokens and then trains the model to predict the masked location of the tokens. Later, we all know Bert’s great difficulty in training. ELECTRA uses GAN’s ideas to build models. In Bert, it is the token that only predicts the location of masked, which is too simple. (AlBert thought Next Sentence Prediction in the last article, it seems that the improvement of Bert is to make the task more difficult, and it must first suffer from his will, sweat his bones, and hunger his body and skin.) So how do you introduce a harder task? It first uses the MLM model, called the Generator by the author, to rewrite a sentence by replacing the MASK position with some tokens, and then uses another network, called Discriminative by the author, to predict whether each token in the sentence will be replaced or not (sequential notation problem). The authors call this task the replacement Token detection.

  1. Pre-training

As shown in the figure above, Generator and Discriminator are Bert encoders. In the pre_training stage, G and D train together; in the fine_tuning stage, G remains unchanged, and only D performs parameter update.

For input: x = [x1,…, xn] x = x = [x_1,…, x_n] [x1,…, xn], coding for vector: h (x) = [h1,…, hn] h (x) = [h_1;… h_n] h (x) = [h1,…, hn]

In the pre_training stage, because back-propagate between Generato and Discriminator is not possible, the left and right networks are actually trained independently:

  • Generator

The Generator network on the left, which is the standard MLM, calculates the probability of predicting tokens at all [MASK] positions, for positions
k k
, Generator generates the token
x t x_t
The probability is:

Loss is defined as:

  • Discriminator

You can use the training sample, which is called updating example in the paper
x c o r r u p t x^{corrupt}
, is generated using a Generator to replace the Token of MaskTED-out. Then let the Discriminator predict
x t x_t
Have not been replaced:

Loss on the right Disc is defined as:

  • loss

The final training mode is to add the two Loss weights:Since the task of discriminator is relatively easy, RTD loss is relatively small compared with MLM loss, so a coefficient is added, and 50 is used in the author’s training.

Paper Code reference: 1

6. XLNet (2019.6.19)

The authors define two pretraining modes available in the market as AR (Autoregressive) and AE (Autoencoding). AR mode includes ELMO and GPT. The problem of AR mode is that it adopts one-way prediction model. Although ELMO uses language model with two directions, the effect is still poor. AE mode, such as BERT, can well realize bidirectional language model by Masked LM, but there are two problems: 1. [MASK] characters are only available in pre-train, but not in fine-tuning, so this causes a two-stage deviation. 2. BERT’s hypothesis is that predictive tokens are completely independent of each other, which is too simple given that natural languages tend to have long dependencies. For example, input during pre-training: “Natural [Mask] [Mask] processing”, the objective function is p (∣ natural treatment) + p (∣ natural treatment) p (natural processing |) + p (natural processing |) p (∣ natural treatment) + p (∣ natural processing), Whereas if you use the AR should be p (∣ natural) + p (∣ natural language) language (natural) | p + p (speech | natural language) language (natural) ∣ p + p ∣ natural language (words). Thus, the probability distribution obtained by BERT is also based on this assumption, ignoring the connection between these tokens.

Therefore, XLNet proposed a plan, which had three changes compared with BERT:

  1. Use Permutation Language Modeling. The author proposes two-stream attention to achieve this.
  2. Taking the idea from the Transformer-XL, it works better with long text, so it performs better on SQuAD tasks.
  3. No longer use NSP (Next Sentence Prediction) tasks. Relative Segment Encoding XLNet uses Relative Segment Encoding, which only determines whether two tokens are in the same Segment, rather than which Segment they belong to.
  4. Bigger and better corpus.

XLNet implementation

  1. Permutations

For the sequencex = [This, is, a, sentence]And the sequence length is
T T
, the number of combinations is
T ! T!
A. For example, when given the first two, calculate the probability of the token in the third position. Suppose there are three combinations: [1, 2, 3, 4], [1, 2, 4, 3] and [4, 3, 2, 1], then its corresponding probability is:
P r ( a . T h i s . i s ) Pr(a,|This,is)
.
P r ( s e n t e n c e . T h i s . i s ) Pr(sentence,|This,is)
and
P r ( i s . s e n t e n c e . a ) Pr(is,|sentence,a)
. Therefore, the objective function of XLNet is to conduct AR prediction model on all combinations and maximize the combination likelihood probability of single-direction prediction: 2. Attention Mask

The model solves the token order problem. In Transformer, position-embedding and word-embedding are added as input. For example, Pr (This ∣ is + 2) Pr (This | is + 2) Pr (This ∣ is + 2). However, in XLNet, the order of the sequence is already scrambled, so using this relative position is a mistake. For example, for the combination [3,2,4,1], the model calculates the probability of the first position, i.e., ‘a’. Since there is no token in front, there is no context, so its corresponding mask is [0,0,0]. Similarly, the mask matrix of this combination can be obtained: [0111001000000110]\begin{bmatrix} 0 & 1 & 1& 1 \\ 0 & 0 & 1& 0 \\ 0 & 0 & 0& 0 \\ 0 & 1 & 1& 0\end{bmatrix}⎣⎢⎢⎢, “0000100111011000” ⎥⎥⎥⎤ Another expression is (− − indicates that this position is mask) : Pr (This ∣ -, is + 2, + 3, + 4) what Pr (This | — – is + 2, + 3, + 4) what Pr (This ∣ -, is + 2, + 3, + 4) sentence Pr (is ∣ -, -, a + 3, -) Pr (is | -, -, a + 3, -) Pr (is ∣ -, -, a + 3, -) Pr (a ∣ -, -, -, -) Pr (a | — – -, -, -) Pr (a ∣ -, -, -, -) Pr (what ∣ -, is + 2, + 3 a, -) Pr (what | — – is + 2, + 3 a, -) Pr (what ∣ -, is + 2, + 3 a, -)

  1. Two-stream self-attention

Here’s another thing to consider: not only do we need to calculate the conditional probability of predicting a location token in the context, but we also need to know which location the model is predicting. Is to calculate: Pr (This ∣ 1, is + 2) Pr (This | 1, is + 2) Pr (This ∣ 1, + 2) is, This is 1 st1 ^} {st 1 st, is 2 st2 ^ 2} {st probability of st. But the Transformer is the embedding position and the word directly embedding together, namely: Pr (This ∣ + 1, This is a + 2) Pr (This | + 1, This is a + 2) Pr (This ∣ + 1, This is a + 2). But in XLNet, because the token order is scrambled, the model doesn’t know if This is 1ST1 ^{st}1st, or even if This is in the sequence. The solution is to use a two-stream self-attention mechanism, at the M layer, for each token in the III position, represented by two vectors: HIMH ^ M_Ihim and GIMG ^ M_IGim. HHH is initialized as word +position-embedding, and GGG is initialized as normal embedding+position-embedding. HHH is updated in the Content Stream, just like regular self-attention, using Q,K,V. The GGG is updated in the Query Stream, using the unmasked content Vector as K and V, and the upper layer itself as Q.

The figure below shows that
m m
Layer,
g 4 m g^m_4
How is it updated:

For an input sequence (1,2,3,4), if you want to compute a combination (3,2,4,1), you can do so by using mask matrix edges, but in fact the input sequence is (1,2,3,4). ∣ p (1, 2 and 4) p (1 | 2 and 4) p (1, 2 and 4 ∣), p (2, 3) ∣ p (2 | 3) p (2 ∣ 3), p (3) ∣ p (3) | p ∣ (3), p (4 ∣ 2, 3) p (4 | 2, 3) p (4 ∣ 2, 3), namely can realize calculation combination (3,2,4,1). In fact, it’s standard self-attention. In this way, XLNet “reads” the tokens behind it while building the prediction model.

In the mask matrix, the red dots represent the unmasked parts.

Refer to the principle and code analysis of self-attention and mask operations in Transformer

7. SpanBERT (2019.7.24)

Compared with Bert, SpanBERT’s changes mainly include two points:

  1. MLM, instead of random masks, masks drop all tokens each time in an interval whose length and starting position are randomly selected.
  2. No longer use NSP task (Next Sentence Prediction), instead use SBO(Span Boundary Objective) task.

The figure above describes SpanBERT’s training pattern. The interval of “an American Football Game” is masked out, and SBO uses the hidden layer vectors of the tokens on both sides of the interval (x4X_4x4, x9X_9x9) to predict each token in the span. The figure above is the loss description of MLM and SBO tasks when predicting football. Football is the third tokne in span, and its position vector is P3P_3P3.

1.mask span

For one line of input, iterate to construct the mask input, ending the process when the mask reaches 15%. Each selection of span length and span starting position is random, where span length follows geometric distribution, and the maximum length can only be 10. The authors experimentally determined that performance was best when P =0.2, when the average span length was 3.8.

In Bert, for all tokens that require a mask, 80% of the words are replaced with [mask], 10% of the words are randomly replaced with other words, and 10% of the words remain unchanged. The same strategy is used in SpanBERT, except that it applies to each span.

2.SBO task

When the span of the mask is
( x s . . . . . x e ) (x_s, . . . , x_e)
, including
( s . e ) (s,e)
Represent the start and end positions of a span. SBO task is the representation (
X s 1 X_{s-1}
.
X e + 1 X_{e+1}
) to predict each token in span. When trying to predict
i i
Location of token:

Position vector is the relative position of (S,e)(s,e)(S, E). F (⋅) F (·) F (⋅) is a simple double-layer neural network.

Then, yiy_iyi of I position is used to predict xix_Ixi, and loss is used to calculate cross entropy.

3.loss

Add the loss of the preceding tasks.

4. The implementation of

  1. The position vector uses a 200-dimensional vector.
  2. Use longer sentence for input.

5. Experiment and analysis

  1. SpanBERT was particularly good at quizzical questions.
  2. It is generally better to abandon a long sentence of NSP than to combine two paragraphs of original BERT.

6. Reference

paper

8. Bert-WWM

Bert-wwm (Whole Word Masking) is different from Bert in the way of mask. In English corpus, sentences are separated by WordPiece, which is to break up longer words and mask them separately (in dictionaries, “##token” is used to indicate that tokens are separated parts). In Chinese Bert, mask is usually performed directly at token_level. However, bert-WWM masks a token after sentence segmentation, and meanwhile masks other tokens of the term of this token. The input of the model is still at the token level. The following example is clearer:

instructions The sample
The original text Use language models to predict probability of the next word.
Participles text Use language models to predict probability of the next word.
Bert’s Mask results Measure the pro [MASK] ## of the next word with a verbal [MASK] type.
Mask results for Bert-WWM Use the phrase [MASK] [MASK] to use the word [MASK] [MASK].

And then everything else is the same as Bert.

Take a look at his results on LCQMC:

9. ERNIE (2019.3)

First of all, Baidu’s RENIE has no difference with BERT’s model structure. The difference is that the words of its mask are not completely random, but make use of the knowledge of atlas. According to the paper, mask and ERNIE directly modeled the semantic knowledge in this way, enhancing the semantic representation ability of the model

  1. Different ways to mask

    As shown in the figure, Bert only masks a single token, while ERNIE adopts three mask modes, namely token, Entity and phrase.

  2. Introduce DLM learning dialogueERNIE also introduced the forum Dialogue class data, using DLM (Dialogue Language Model) to Model the Query-Response Dialogue structure, taking the Dialogue Pair as input, so that the Model can learn to judge whether the current multiple rounds of Dialogue is real or false.

10. Groeb, 2.0

Introduce multitasking learning. As many as seven tasks are introduced to fine-tuning model (not in the pre-training stage), and the pre-training is performed by adding tasks one by one