“This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”

As the 2018 rookie of the year in the field of natural language processing, BERT has been the master of the field of natural language processing (NLP) for the past few years. As soon as he arrived, he surprised the crowd, crushing all algorithms, breaking the record of 11 NLP tests, and even “surpassing human” performance. It is considered to be one of the most mainstream language models for future NLP research and industrial applications.

Look at me 2000 words 👇

Bert principle

Pretraining model

What is a pre-training model? —- Train models with generalization capabilities!

For example, suppose we have a lot of wikipedia data, then we can use this part of the huge data to train a model of a generalization ability is very strong, when we need to use in specific scenarios, such as text similarity calculation, so simply modify some of the output layer, then use our own data for a delta training, Make a slight adjustment to the weights.

The advantage of pre-training lies in that there is no need to use a large amount of corpus for training in specific scenarios, which saves time, is efficient and efficient. Bert is such a pre-training model with strong generalization ability.

Bidirectional Encoder Representation from Transformers Bert is a Bidirectional Encoder Representation from Transformers.

“Bidirectional” means that when the model processes a word, it can make use of both the preceding word and the following word. The source of this “bidirectional” is that BERT is different from traditional language models. Instead of predicting the most likely current word given all the preceding words, BERT randomly masks some words and makes predictions using all the unmasked words.

Look at the picture 👀

ELMo: Using two-way information, using context as a feature, but there is still a difference between unsupervised corpus and our real corpus, which may or may not be appropriate for our particular task.

OpenAI GPT: With one-way message, decoder can only see the previous message, with Transformer instead of ELMo’s LSTM. Although fine-tuning can be performed, some special tasks are different from pre-training input, and it is difficult to solve the inconsistency between a single sentence and two sentences.

Bert: Bidirectional information is used and has strong generalization ability. For specific tasks, only one output layer needs to be added for fine-tuning.

The main innovation of Bert Model lies in the pre-training stage method, that is to say, Masked Language Model and Next Sentence Prediction are used to capture representation at the word and Sentence level respectively.

Masked Language Model

To put it simply, Masked Language Model can be understood as a cloze, where the author randomly masks 15% of the words in each sentence and makes predictions based on its context.

To be serious, the pre-training method of 👉Masked Language Model is to randomly mask some tokens in sentences, and then train the Model to predict the removed tokens and make prediction with its context.

The specific operation is as follows:

Randomly mask 15% of the tokens in the corpus, and then send the final Hidden Vectors outputed by masked token locations into SoftMax to predict masked tokens.

Such as:

My dog is hairy → My dog is [MASK]

Here, hairy is masked, and then unsupervised learning is used to predict the words of mask position. However, there is a problem with this method, because it is 15% of the words of MASK, and the number of words is already very high, which may lead to some words that have never been seen in the fine-tuning stage. In order to solve this problem, The author does the following treatment:

80% of the time [mask] is used, my dog is hairy → My dog is [mask].
10% of the time it was a random word to replace the mask, my dog is hairy -> my dog is apple.
10% of the time it stays the same, my dog is hairy -> My dog is hairy.

So why use random words with a certain probability? 🧐

This is because Transformer needs to maintain a distributed representation of each input token, otherwise Transformer will most likely remember that the [MASK] is “hairy”. As for the negative impact of using random words, all the other tokens (i.e. non-” hairy” tokens) share a 15%*10% = 1.5% probability, and the impact is negligible. Transformer’s global visibility also increases the retrieval of information, but does not allow the model to obtain full information.

Next Sentence Prediction

Given two sentences A and B, where B has A half chance of being the next sentence of A, the training model predicts whether B is the next sentence of A, and the training model equips the model with the ability to understand connections in long sequence contexts.

Many current NLP tasks such as QA (question answering system) and NLI (Natural Language reasoning) require understanding the relationship between two sentences so that pre-trained models can better adapt to such tasks. Personal Understanding:

Bert first used Mask to improve the information acquisition quantity of field of view, and added duplicate and random Mask, which was no different from the sequential training prediction of RNN class method, except that masks were at different positions.
The global vision greatly reduces the difficulty of learning, and then A+B/C is used as samples, so that each sample has A 50% probability of seeing about half of the noise;
It is impossible to learn Mask A+B/C directly, because we do not know what is noise, so next_sentence prediction task is added to perform training at the same time as Masked Language Model. In this way, next is used to assist the Model in identifying noise/non-noise. Masked Language Model is used to complete the majority of semantic learning.

The input

Bert’s input is a linear sequence in which two sentences are separated by delimiters, starting with [CLS] and ending with [SEP].

The actual input of the Bert model is the sum of Token Embeddings, Segment Embeddings, and Position Embeddings.

Token Embeddings: Word vector.
Segment Embeddings: Used to distinguish between two sentences and indicate which sentence the word belonged to.
Position Embeddings: Encode the Position information of a character.

{0: I, 1: ai, 2: bei, 3: jing, 4: shang, 5: hai, 6:,}

Enter BERT’s original sentence: ‘I love Beijing, I love Shanghai’

Token embedding:,1,2,3,6,0,1,4,5 [0]

position embedding:[012345678]

segment embedding:[000001111]

The output

The output of Bert’s pre-training model is nothing more than one or more vectors.

The downstream task can be used in two ways: fine-tuning (changing the parameters of the pre-training model) or feature extraction (not changing the parameters of the pre-training model, but input the output of the pre-training model as features to the downstream task).

Bert used the fine tuning method in his original paper, but also tried the effect of feature extraction method. For example, in NER task, the best feature extraction method was only a little worse than the fine tuning method. However, the advantage of feature extraction method is that the required vectors can be calculated in advance and saved for repeated use, which greatly improves the training speed of downstream task model.

Behind the words

Bert’s main contributions

Introducing the Masked Language Model, bidirectional Language Model is used for Model pre-training.
A new sentence-level self-supervised goal is introduced for pre-training, which can learn the relationship between sentences.
It is further verified that larger models work better: 12 –> 24 layers.
A general solution framework is introduced for downstream tasks, and model customization for tasks is no longer required.
It broke records for many NLP tasks and exploded NLP unsupervised pre-training techniques.

Bert advantages

Transformer Encoder has self-attention mechanism, so Bert has bidirectional functionality
Due to the influence of bidirectional function and multi-layer self-attention mechanism, Bert must use Cloze version of Masked Language Model to complete the pre-training at the token level
In order to obtain the semantic representation of Sentence level at a higher level than words, Bert adds Next Sentence Prediction to perform joint training with Masked Language Model
In order to adapt to multi-task transfer learning, Bert designed a more general input layer and output layer
Small cost of fine tuning

Bert shortcomings

[MASK] markers will not appear in actual prediction, and too much [MASK] will affect model performance during training.
Only 15% of tokens per batch were predicted, so Bert converges more slowly than left-to-right models (they predict each token)
Bert has a huge consumption of hardware resources (the large model needs 16 TPU and lasts four days; The larger model required 64 TPus and lasted four days.

Welcome giant guide ~

The resources

One article to understand Bert — waste when self-improvement

BERT knows it — Soup House

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

What! ? You don’t know anything about Bert’s power?

Bert principle

Pretraining model

Masked Language Model

Next Sentence Prediction

The input

The output

Behind the words

Bert’s main contributions

Bert advantages

Bert shortcomings

What! ? You don’t know anything about Bert’s power?

Bert principle

Pretraining model

Masked Language Model

Next Sentence Prediction

The input

The output

Behind the words

Bert’s main contributions

Bert advantages

Bert shortcomings

Related Posts

Meituan proposed Transformer based on implicit conditional location coding, which has better performance than ViT and DeiT

Chatopera enterprise chatbot solution

Github Star 11.5K project: AAAI 2021 will be open source, 80+ multi-language model new upgrade