This is the 22nd day of my participation in Gwen Challenge

What is the BERT

Bidirectional Encoder Representations from Transformers, BERT’s purpose is to pretrain the Encoder network of Transformers models, Thus greatly improving performance. This article does not go into specific technical details, just the main ideas. For details, see the paper: arxiv.org/pdf/1810.04…

The first task

BERT’s first training model was to randomly block one or more words and then ask the model to predict which words to block. The specific process is shown in the figure.

  • The second input word in the figure was originally cat, but was replaced by the MASK symbol.

  • MASK is converted to word vector xM by Embedding.

  • The output at that time is uM, and uM depends not only on xM, but on all the input vectors x1 through x6. So uM knows all the input. Because uM knows all the information about the context, it can be used to predict the shaded word cat.

  • UM is input into Softmax as a feature vector to obtain a probability distribution P, and the word corresponding to the maximum probability can be obtained through the dictionary. The word cat is covered here, so the model should be trained to maximize the probability value of CAT in the output probability distribution P of the model.

What’s the point of the first mission

BERT does not need manual annotation of data sets during training, which can greatly save cost and time. Training data is also easy to obtain, and any books and articles can be used as training data. It can automatically generate labels and easily conduct pre-training of models.

Second task

The task of BERT’s second training model was to give two sentences and determine whether they were adjacent or not.

First of all, prepare the training data. 50% of the samples in the training data use two real adjacent sentences, and the remaining 50% samples use random sampling to select any two non-adjacent sentences.

The processing method of selecting real adjacent sentences is as shown in the figure below, and the symbols CLS and SEP are used in the splicing. CLS is the symbol of “classification”, and SEP is the symbol of “dividing” two sentences. Because these two sentences are really adjacent sentences, they are labeled true.

The processing method of selecting non-adjacent sentences is spliced together as shown in the figure below, but since these two sentences are not adjacent sentences, they are labeled false.

After processing the training data, we train the model to determine whether two sentences are adjacent to each other. The specific process is shown in the figure below.

  • Enter the concatenated character sequence of [CLS][first sentence][SEP][second sentence] into the model.

  • Word vector is converted by Embedding.

  • The final output of [CLS] position is vector C. Since the information of two sentences of the whole input is extracted from vector C through the model, whether the two sentences are really adjacent can be judged by vector C.

  • Input vector C into a binary classifier and the output value is either 0 for false and 1 for true. The model can be trained so that the predicted labels of two sentences are as close to their real labels as possible.

What’s the point of the second mission

The neighboring two sentences are often associated. By making dichotomous judgment, this association can be strengthened and the intrinsic association can be strengthened by training the word vector of Embedding.

The Transformer Encoder layer has a self-attention mechanism, and the role of self-attention is to find correlations between inputs. This task can also be enhanced to find the correct correlations between inputs.

The third task

The first task is to predict the blocking words, and the second task is to determine whether two sentences are adjacent. BERT was also able to combine these two tasks to pre-train the Encoder structure of the Transformer.

We need to prepare data, as shown in the figure below, we used two real adjacent sentences as training data, and blocked 15% (here are two) words randomly. There are three targets in total. Since they are real adjacent sentences, the first target is true, and the second target is the real blocked word branch. And the third target is the word was for true occlusion.

In addition, we need to find sentences that are not really adjacent to each other as training data, and also block words. Here, only one word is blocked, so there are two targets. The first target is false, and the second target is the word South.

If there are three goals like the one above, there are three loss functions (or two if there are two goals). The first goal is the binary task, and the second and third goals are the multicategory task. The objective function is the sum of the three loss functions, and then the gradient is calculated with respect to the model parameters, and then the model parameters are updated by gradient descent.

BERT advantages

BERT can automatically generate labels without manual data annotation, which is a time-consuming and expensive job.

BERT can use all kinds of text data, books, web pages, news, etc

The grades BERT showed were really good

BERT shortcomings

BERT’s idea is simple and the model is effective, but it costs a lot. It is difficult for ordinary people to have the time and energy to train BERT. Fortunately, it has been made public and can be downloaded by themselves.

reference

[1] Devlin J , Chang M W , Lee K , et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[J]. 2018.

[2] Vaswani A , Shazeer N , Parmar N , et al. Attention Is All You Need[J]. arXiv, 2017.