[toc]

Subject to introduce

Last generation model

Before introducing the BERT model, let’s review other natural language models

Word2Vec

disadvantages
  • Because the relationship between a word and a vector is one-to-one, the problem of polysemy cannot be solved.
  • Word2vec is a static approach. Although it is versatile, it cannot be optimized dynamically for specific tasks
advantages
  • Since Word2vec takes context into account, it is better than the Embedding method before (but not as good as the Embedding method after 18 years).

  • Compared with the Embedding method before, it has fewer dimensions, so the speed is faster

  • It is versatile and can be used in a variety of NLP tasks

ELMo

advantages
  • ELMo focuses on polysemy
  • The word vectors generated by ELMo take advantage of the context information and can be adjusted according to the downstream task by weight to suit different tasks
disadvantages
  • LSTM’s long distance extraction features are not as good as Transformer’s

  • LSTM is a serial mechanism

A comparison of the models

Only BERT satisfies both parallelism, obtaining context semantics, and deeper long-distance semantics

BERT profile

BERT is an open source natural language processing framework of Google. In recent years, improvements on this framework have emerged one after another, and it has been widely used in academia and industry. Examples include Baidu’s Ernie, Meituan’s MT-bert, Huawei’s Tiny-bert, and Facebook’s Roberta.

What is BERT?

BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

Pre-training of Deep Bidirectional Transformers for Language Understanding, “BERT” is an acronym for “Bidirectional Encoder Representations from Transformers”. As shown in the figure below, BERT can use the information in both directions, while ELMo and GPT can only use the information in one direction.

  • An aside:

BERT and other models of deep learning are named after the fictional characters from Sesame Street.

Main principle analysis

BERT is introduced

Before introducing Bert, we can imagine what kind of model is the state-the-art model, maybe it should be:

  • Support parallel
  • Consider context
  • Solve polysemy of the word
  • It has strong generalization ability and can complete different downstream tasks

GPT2 solves this problem perfectly, but it is a one-way Transformer, which can cost performance during training. Which brings us to BERT, the main character of this article.

Overall structure of BERT model

BERT is a multi-layer bidirectional Transformer encoder based on fine tuning, where the Transformer is identical to the original Transformer and implements two versions of the BERT model with the feedforward size set to 4 layers in both versions:

BERTBASE: L=12, H=768, A=12, Total Parameters=110M

BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M

The number of layers (i.e. Transformer blocks) is represented as L, the hidden size is represented as H, and the number of self-attention is represented as A.

Model comparison diagram

The details of BERT are as follows:

From the above figure, we can see that BERT extracts features through encode of Transformer and obtains bidirectional context semantics.

BERT model input

Input representations can represent a single text sentence or a pair of texts (for example, [question, answer]) in a sequence of words. For a given word, its input representation can be composed of Embedding summation in three parts. Visual representation of Embedding is shown in the figure below:

  • Token Embeddings: Represents a word Embeddings. The first word is a CLS flag, which can be used for later classification tasks. For non-classification tasks, the word Embeddings can be ignored.

  • Segment Embeddings: Used to distinguish between two sentences because the pre-training not only did the language model but also did the classification task with two sentences as input;

  • Position Embeddings: Model learning.

Pre-training

Mask Language Model (MLM)

To train deep bidirectional Transformer representations, a simple approach is used: Masked partial input words at random, and then predicted those Masked words, known as “Masked LM” (MLM). The goal of pre-training is to build the language model, and the BERT model uses Bidirectional Transformer. So why the “bidirectional” approach? Because in pre-training the language model to handle downstream tasks, we need not only the language information to the left of a word, but also the language information to the right.

During the training, 15% of the tokens in each sequence were randomly masked, instead of predicting each word as cbow in Word2Vec did. MLM randomly masks some words from the input, with the goal of predicting the original vocabulary of the masked word based on its context. Unlike left-to-right language model pre-training, MLM targets allow representations to blend left-to-right contexts, which makes it possible to pre-train deep two-way Transformer. The Transformer encoder does not know which words it will be asked to predict, or which have been replaced by random words, so it must maintain a distributed context representation for each input word. In addition, since random substitution occurs in only 1.5% of all words, it does not affect the model’s understanding of the language.

What is the difference between Transformer and BERT? BERT’s position coding is learned, Transformer is generated by sinusoidal functions.

The native Transformer uses Sinusoidal Position Encoding, which is a functional Encoding for absolute positions. Since Transformer is self-attention, such sines and cosines will have relative position information due to the dot product operation, but there is no direction, and after the mapping of weight matrix, such information may disappear.

Learned position embedding is used in BERT, which is the parameter coding of absolute position and is added with the word vector at the corresponding position rather than splicing.

Sentence Coherence Determination (NSP)

Many sentence-level tasks such as automated question answering (QA) and natural language reasoning (NLI) require understanding the relationship between two sentences. For example, in the Masked LM task above, 15% of words are Masked after the first step of processing. So in this task, we need to randomly divide the data into two equal sized parts, one part of the data has two statement pairs that are contextually continuous, and the other part has two statement pairs that are contextually discontinuous. The Transformer model is then asked to identify which pairs of statements are continuous and which are discontinuous.

The efficiencies of different pre-training models are as follows

By comparing Bert-base with other methods, we can see the necessity of MLM tasks and NSP, in addition to the self-attention mechanism of two-way SQuAD and MRPC tasks

Fine-tuning

Can be solved

  • For a classification

To judge the relationship between two sentences, such as sentence semantic similarity and sentence coherence, is essentially text classification.

Input: two sentences;

Output: sentence relationship label.

  • Single sentence text classification

Determine which category the sentence belongs to, such as news automatic classification, problem area classification, etc.

Input: a sentence;

Output: Output sentence category label.

  • Extractive question and answer

Given a question and answer and a paragraph of text, extract the answer to the question from the text, such as machine reading comprehension. Its essence is sequence labeling.

Input: a question, a piece of text;

Output: Index of the answer in text.

  • Single sentence sequence labeling

Each token of the input sentence is labeled with a target tag, such as word segmentation, part-of-speech tagging, entity recognition, etc.

Input: a piece of text;

Output: the label for each token in the text.

Model evaluation

Experimental data and corresponding NLP tasks
Introduction to GLUE corpus set
The name of the The full name use
MNLI Multi-Genre NLI Implication relation inference
QQP Quora Question Pairs Question pairs are equivalent
QNLI Question NLI Whether the sentence answers the question
SST-2 Stanford Sentiment Treebank Sentiment analysis
CoLA Corpus of Linguistic Acceptability Sentence linguistic judgment
STS-B Semantic Textual Similarity Semantic similarity
MRPC Microsoft Research Paraphrase Corpus Whether sentence pairs are semantically equivalent
RTE Recognizing Texual Entailment Implication relation inference
WNLI Winograd NLI Implication relation inference

It involves various tasks of understanding the meaning and relationships of sentences. Look at the effect:

As shown in the figure above, the BERT model reached the state-the-art at that time.

its family

After the BERT model came out, the improved models emerged in an endless stream. Here are some improvements to the BERT model.

Modify the mask Language Model (MLM) task

Ernie

Bert is the word in the random mask input sequence, so the collocation between words can be easily inferred, so that some connected words that should have strong correlation are separated during training. This is not good for capturing the semantic information of the single semantic words in Chinese text, such as entities and phrases with multiple characters.

Therefore, ERNIE continuously masks phrases and entity words in the input sequence on the basis of the input as characters, so that the phrase information will be integrated into the embedding of characters.

The purpose of this is to enable the model to learn the semantic information of entities and phrases, and the embedding of the word will have the semantic information of entities and phrases after the completion of the training, which is very friendly to the text task with a large number of phrases and entities (especially the entity recognition task).

ERNIE’s performance also improved over the original BERT

Robert
  • Change the static Mask to dynamic Mask

In the whole pre-training process, Bert chose 15% Tokens for mask unchanged, that is to say, the 15% Tokens were randomly selected from the beginning, and the subsequent N epochs did not change. This is called static Masking.

At the beginning, RoBERTa copied 10 copies of the pre-training data, and each copy randomly selected 15% Tokens for Masking. In other words, the same sentence had 10 different masks. Then each data is trained with N/10 epochs. This means that in the training of N epochs, the masked tokens of each sequence will change. This is called dynamic Masking.

Modify the Sentence Coherence determination (NSP) task

ALBERT
  • Remove the NSP task and use the SOP task

BERT’s NSP task is actually a dichotomous category. The positive sample of the training data is sampled by two consecutive sentences in the same document, while the negative sample is sampled by using sentences from two different documents. The purpose of this task is to improve the effect of downstream tasks (such as text matching), but subsequent studies have found that the effect of this task is not good. NSP actually consists of two subtasks, topic prediction (whether two sentences are from the same type of article) and relational consistency (whether two sentences are from the same type of article), but topic prediction is much simpler than relational consistency prediction and actually has a type effect in MLM tasks.

Therefore, in ALBERT, a new task, sentence-Order Prediction (SOP), is proposed in order to retain only the consistency task and remove the influence of topic recognition. The positive sample of SOP is obtained in the same way as that of NSP, and the order of positive sample can be reversed for negative sample. Because SOP is selected from the same document, it only focuses on the order of sentences and has no subject matter. Moreover, SOP can solve the task of NSP, but NSP cannot solve the task of SOP. The addition of this task improves the final result.

The purpose of this method is to remove the influence of topic recognition by adjusting the acquisition method of positive and negative samples, so that the pre-training focuses more on the prediction of sentence relational consistency.

The size of model parameters

  • TinyBERT

TinyBERT is a knowledge distillation method developed by Huawei and Huaker for Transformer models. The model size is less than 1/7 of BERT, but the speed is increased by 9 times without significant performance degradation.

Introduction of knowledge graph

  • K-BERT

multimodal

  • VideoBERT
  • VisualBERT
  • SpeechBERT

Python Implementation Results

Original paper Results

GLUE test results are provided by the GLUE evaluation server. The number below each task indicates the number of training samples

In actual combat

Baidu 2021 Language and Intelligent Technology Competition: Machine reading Comprehension task

Baidu 2021 Language and Intelligent Technology Competition: Machine reading Comprehension task

See the demo folder for the code

Description of the contest:

Given a question Q, a passage P and its title T, the competition system needs to judge whether the passage P contains the answer to the given question according to the content of the passage. If so, it will give the answer a to the question. Otherwise, “No answer” is displayed. Each sample in the dataset is a quadruple, for example:

Chapter (P): Guava is warm, sweet, sour, astringent… On top of that, guava is low in fat and calories. One guava weighs about 0.9 grams of fat or 84 calories. Guava has 38 percent less fat and 42 percent fewer calories than apples. Title (t): Calories in Guava Juice – Mumsnet.com Reference answer (a): [‘ One guava weighs about 0.9 grams of fat or 84 calories’]

The comparison of erine-bert-Roberta results is as follows, and it can be seen that ERINE has the best performance.

The promotion method on top of Baselin

  • Load additional data sets
  • Model integration
  • robert

Summary and discussion

  • BERT overview

  • BERT features

Real two-way: can use two-way Transformer, at the same time use the word context information to do the feature extraction, the two-way RNN separately using the information above and below information of the current word do have essentially different feature extraction, and CNN will context information provision within a limited the size of the window for feature extraction with different nature;

Dynamic representation: Using the context information of words to do feature extraction, dynamically adjust the word vector according to the different context information, to solve the polysemical problem of Word2Vec;

Parallel capabilities: The Transformer component uses self-attention internally to extract the features of each word in an input sequence simultaneously and in parallel.

Easy to transfer learning: If pre-trained BERT is used, it only needs to load the pre-trained model as the word embedding layer of the current task, and then build the subsequent model structure for the specific task, without much modification or optimization of the code.

  • Model to improve

MLM: Dynamic Mask (RoBERTa)

MLM: mask = mask = ERINE

NSP: Consider removing it and changing it to SOP task (ALBERT)

  • BERT model

BERT is a language representation model, which is trained by large data, large model, and great computational overhead, and achieves the optimal (SOTA) results in 11 natural language processing tasks. You may have guessed where this model came from, but yes, it came from Google. Some people may joke that an experiment of this scale has put the average laboratory and researcher to shame, but it does offer some valuable lessons:

  1. Deep learning is representation learning

“We show that pre-trained representations eliminate the needs of many heavily engineered task-specific architectures”. In most of the 11 tasks, a linear output layer was added on the basis of fine-tuning of pre-trained representation. In terms of sequence labeling (e.g. NER), even sequence output dependencies (i.e. non-autoregressive and no CRF) were left to the original SOTA, showing its powerful representation learning ability.

  1. Scale Matters

“One of our core claims is that the deep bidirectionality of BERT, which is enabled by masked LM pre-training, is the single most important improvement of BERT compared to previous work”. The application of this mask in language model is not new to many people, but it is the author of BERT who has verified its powerful representation learning ability on the basis of such a large scale of data + model + computing power. Such models, which can even be extended to many other models, may have been proposed and tested by different laboratories before. However, due to the limitation of scale, the potential of these models has not been fully explored, and regrettably, they have been drowned in the torrent of paper.

  1. (pre-training is important)

“We believe that this is the first work to demonstrate that scaling to extreme model sizes also leads to large improvements on very small-scale tasks, provided that the model has been sufficiently pre-trained”. Pre-training has been widely used in various fields (e.g., ImageNet for CV, Word2Vec in NLP), mostly through big model and big data. Such a big model can bring a lot of improvement to small-scale tasks, and the author also gives his own answer. The BERT model is pre-trained in Transformer, but I don’t think there will be much difference in performance in LSTM or GRU. It is another matter to train parallel computing capabilities.

  1. Views on the BERT model

In fact, there are two reasons for high-performance. Besides the improvement of the model, More importantly, a large data set (800M BooksCorpus + 2.5g words from English Wikipedia) and a large computational power (corresponding to the large model) were used to pre-train on related tasks, achieving a monotonous increase in performance on the target task. The bidirectional nature of this model is different from that of Elmo, and most people have a misconception about the size of his bidirectional contribution to novelty. I think this detail may account for his remarkable improvement over Elmo. Elmo is a left to right and a right to left, and this one he’s training to open a window directly, using a sequential CBOW.

References and citations

CSDN: Introduction to the ELMo algorithm

Li Hongyi Human language processing PPT

Zhihu: A swastika long article gives you an overview of the BERT family

Exploration and practice of Meituan BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention Is All You Need

Attention? Attention!

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

ERNIE: Enhanced Representation through Knowledge Integration

TinyBERT: Distilling BERT for Natural Language Understanding

K-BERT: Enabling Language Representation with Knowledge Graph

CSDN: A guide to the Use of ERNIE for ultra-detailed Chinese Pre-training models

Machine reading comprehension algorithms and practices

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning

Zhihu: TinyBERT is smaller and faster than Bert