Hello, everyone. I’m Rumor, a young woman who is spinning and jumping in the forefront of algorithms.

Since the Attention mechanism was proposed, the Seq2seq model with Attention was improved in all tasks, so the current Seq2seq model refers to the model combining RNN and Attention. Google then introduced the Transformer model to solve the Seq2Seq problem, replacing the LSTM with a full-attention structure and achieving better results in translation tasks. This article mainly introduces the article “Attention Is All You Need”. I still did not understand it at the first reading. I hope that I can make you understand this model faster with my own interpretation

For those unfamiliar with the principles of Attention and Transformer:

  • 【NLP】Transformer model principle detail
  • 【NLP】 The principle of Attention

Since BERT’s release, more and more research results have appeared, and here is a summary of them. Each node has more notes, and you can add V “leerumorrr” to get xMind edition, and make a friend (̀ㅂ•́) ✧ by the way

! [](https://pic4.zhimg.com/80/v2-0321c5f0e1601e458989864629644323_1440w.jpg)

1. BERT model

BERT’s full name is Bidirectional Encoder Representation from Transformers, which is the Encoder of two-way Transformer, because the decoder cannot obtain the information to be predicted. The main innovation of the model is the pre-train method, which uses Masked LM and Next Sentence Prediction to capture the representation at word and Sentence level respectively.

1.1 Model Structure

Since Transformer has already analyzed the components of the model, I won’t go into more detail. The structure of BERT model is as follows:

! [](https://pic1.zhimg.com/80/v2-d942b566bde7c44704b7d03a1b596c0c_1440w.jpg)

Compared with OpenAI GPT(Generative Pre-trained Transformer), BERT is a two-way transformer block connection. Just like the difference between a unidirectional RNN and a bidirectional RNN, it intuitively works better.

Compared to ELMo, they’re both bidirectional, but the target function is different. ELMo is the first one

 和 

 

As the target function, the two representations are independently trained and then splitted, while BERT is

Train LM as the target function.

1.2 Embedding

The Embedding here is composed of the summation of three kinds of Embedding:

Among them:

  • Token Embeddings are word vectors, with the first word being the CLS flag, which can be used for later classification tasks
  • Segment Embeddings were used to distinguish between the two sentences because the pre-training did not only LM but also a classification task with two sentences as input
  • Position Embeddings is not a trig function like Transformer in the previous article, it’s a learned function

1.3 Pre-training Task 1#: Masked LM

The goal of the first step of pre-training is to make a language model. From the above model structure, we can see the difference of this model, namely bidirectional. As for why this is bidirectional, the authors explain on Reddit that if the pre-trained model is used for other tasks, people will want information not just on the left side of a word, but on both sides. Considering this point, the model ELMo just splices the left-to-right and right-to-left training respectively. Intuitively we want a deeply bidirectional model, but the common LM can’t do that because it can “cross” during training (I don’t really agree with that, but I’ll post about bidirectional LMS later). So the author used a trick with a mask.

In the training process, the author randomly masks 15% of the tokens, instead of predicting every word like CBOW. As for why this is done, I think it may be due to the structure of the model itself, from the structure of the input and output are the same length sequence, so the model is actually doing sequence-level LM.

How to do the Mask is also tricky. If the marker [Mask] is always used to replace the Mask (which cannot be touched in the actual prediction), it will affect the model, so when the Mask is randomly used, 10% of the words will be replaced by other words, 10% of the words will not be replaced, and the remaining 80% will be replaced by [Mask]. Specific why so allocation, the author did not say… It is important to note that the Masked LM pre-training model does not know which word is really Masked, so the model should pay attention to every word.

1.4 Pre-training Task 2#: Next Sentence Prediction

Since tasks such as QA and NLI are involved, a second pre-training task is added to get the model to understand the connection between two sentences. The training inputs were sentences A and B, and there was A 50-50 chance that B would be the next sentence of A. Input those two sentences, and the model predicted that B would be the next sentence of A. It’s 97-98% accurate during pre-training.

1.5 Fine – tunning

Classification: For sequence-level classification tasks, BERT directly takes the final hidden state of the first [CLS]token, and softMax predicts label Proba after adding a layer of weight:

Other prediction tasks require some adjustments, as shown here:

Since most of the parameters are the same as pre-training, fine-tuning will be faster, so the authors recommend trying more parameters.

2. The advantages and disadvantages of

2.1 the advantages

BERT is the latest State of the Art model as of October 2018. It has swept 11 NLP tasks through pre-training and fine-tuning, which is the biggest advantage first of all. It also uses Transformer, which is more efficient than RNN and can capture dependencies over longer distances. Compared to the previous pre-training model, it captures the true bidirectional context information.

2.2 disadvantages

What the author mainly mentioned in the paper is the mask problem during MLM pre-training:

  1. The [MASK] marker does not appear in the actual prediction, and too much [MASK] is used in training to affect the performance of the model
  2. Only 15% of the tokens per batch are predicted, so BERT converges more slowly than the left-to-right model (they will predict each token)

3. Summary

After reading it, I felt I was using what I already had, but I didn’t think it would be so good, and no one else did. However, many points not specifically explained in the paper can be seen that such excellent results are also obtained through continuous experiments, and the training data is also more than OpenAI GPT with similar structure, so data and model structure are indispensable things.

Welcome to follow my public number, the first time to get the most cutting-edge interpretation of the algorithm ⬇️

[Reference] :

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. Beyond all human! Google dominates SQuAD, and BERT swept the top 11 NLP tests
  3. Zhihu: How do you evaluate BERT model?