“This is the 27th day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge in 2021”

Let’s learn Bert pre-training techniques

Part 1 BERT model and attention mechanism

Bert Model

A Brief introduction to Bert

  • In 2018, Google opened source the Bert model
  • Bert is a landmark model in the FIELD of NLP. Since then, various Bert variants have emerged, and Bert model has begun to beat SuperGlue Benchmark (the most SOTA model, which needs to be climbed over the wall).

Figure NLP development

  • New paradigm: Two convergence (model {transformer} convergence, pre-training-fine-tuning convergence)

Bert – Overview of network architecture

Simply put: BERT network structure: network overlay of multilayer Transformer

Transformer network architecture

Recommend to see li Mu big guy read paper\

Core elements of the network in the Transformer layer:

  • Multi-head attention
  • FFN (Full connection layer)
  • LayerNorm (Layer normalization)
  • Residual link

Figure Bert model (left) Transformer module (right)

Attention mechanism

The development of attentional mechanisms

2014: In the era of machine translation, attention mechanism gets attention

2017: Transformer was born, focusing on its own attention

【 真 题 链 接 】 NEURAL MACHINE TRANSLATIONBY Collaborating LEARNING TO ALIGN AND TRANSLATE

How to achieve Attention in the first work

Calculation of Attention sorce

Step1: The calculation of attention score is simply a vector representation composed of a sentence representation and the calculation of the similarity of each H (using the dot product form to represent the similarity).

Calculation of Attention distribution

Step2: As for the calculation of Attention distribution, softmax is used to map Attention sorce into a new representation of probability sum of 1, which is Attention weight.

Calculation of Attention output:

Step3: for the attention output, that is, each h and the weight calculated in step2 are weighted and summed.

The three pictures above are the attention mechanism in Seq2Seq

General formulation of attention mechanism

In the first work, attention is in the SEQ2SEQ model, so how to extract the attention mechanism from it and use the more general expression of attention mechanism in mathematical language? At this time, we need to find a general realization form, so that we can better understand the principle of attention mechanism and apply it to other forms.

[Definition] : Given a set of vector key-value pairs and a vector query, the attention mechanism is a kind of attention mechanism according to query

Module with key to calculate the weighted average of values. They will show each other these pictures at the same time each other at the same time. {query PROM to keys} {query PROM will show each other these pictures at the same time each other at the same time.

Let’s use the elephant refrigerator as an analogy for the implementation of the attention mechanism

Step1 open the refrigerator door: evaluate the correlation between query and key to get the correlation score

  • There are many different ways to evaluate relevance. The dot product is the most common form, and there are other forms. Here are the three most common forms:
  • Dot-product attention:【 Original 】
  • Multiplicative attention:[Transformer basically adopts the multiplication form, W can learn the parameter matrix]
  • Additive attention mechanism:[The most popular methods in the field of machine translation in 2014-2015]

Step2 plug in the elephant: normalize the correlation fraction to the distribution

Step3 close the refrigerator door: weighted average through values and distribution

Self-attention mechanism

Can we get a representation of a sentence using the attentional mechanism?

Attention models sentence representation

  • From the point of view of individual words: the vector representation of e.g. “it_ “and the home of all words in the sentence are computed attention, thus updating its vector representation:

  • The attention operation runs completely in parallel so that pairs of words interact in sentences;

\

[Matrix form] : Verify matrix dimensions

Self-attention for the entire sentence: the existing hand of the sentence is marked as, there are:

[Open source tools] BERTVIZ visualizes Bert’s representation

NLP model before Bert

Before 2017, LSTM networks dominated the NLP field, so why was this kind of RNN abandoned?

RNN model:

  • The complexity of O (L) is required so that all words in a sentence interact: long distance semantic dependencies are hard to learn (vanishaing gradients disappear).
  • Bad parallelism: forward propagation and backward propagation are limited by O (L).

The model of the CNN:

  • The CNN model is efficient
  • However, it is difficult to obtain the dependence between long-distance texts by CNN modeling: it requires very multi-layer CNN superposition;

Self Attention needs an auxiliary part

Of course, self-attention alone cannot complete the model construction, so we need auxiliary construction

Auxiliary 1: Sentence order

  • Position Encoding
    • Where does the location code go? Embedding/can also be added to each layer
    • Specific form of position coding: sines and cosines position coding (in the paper)/learner-driven position coding (in actual Bert) PiRandom initialization, learning with the model.

Auxiliary 2: Nonlinear transformation

  • Add full connection layer:
    • Formula:
    • Note that the operation is pointwise:Only by the decision“Without accepting the rest of the information in the sentence.
    • Activation function improvements: smoother Relu functions (partial derivatives are not zero when your output is negative)

Auxiliary 3: Multi-head:

  • Bull mechanism:
    • Different heads are concerned with different context dependencies
    • Effect of model integration (similar to Google’s Inception model)

Auxiliary 4: Scaling

  • Scaled dot-product attenrion:
    • Remove a factor when calculating the attention score(This value is usually 64) :

    • Effects: When the model dimension is higher, the vector dot product result will be larger. At this time, due to softMax, the gradient will become smaller, and the training will become unstable.

Can the effect of scaling be demonstrated?

Auxiliary 5: Residual links

  • Both the multiattentional layer and FFN have a residual connection
  • The significance of residual link is to solve the problem of gradient disappearance of deep network, change the shape of loss function, and make the loss function smoother.

Part 6: LayerNormalization

  • Layer normalization: calculate the mean and variance of each vector;

  • The BatchNom is commonly used in CV and LayerNom is used in NLP because the sentence length is not fixed
  • Significance of layer normalization: similar to residual connection –> Loss function landscape is smoother (2018), gradient variance is smaller (2019)

The following chart shows the similarities and differences between BatchNormalization and LayerNormalization

  • Layer normalization location is important:
    • For post-LN, the gradient of the parameter of the last layer satisfies:

    • For pre-ln, it is (2020, Microsoft Research Asia) :

The upper bound of the gradient becomes smaller, and the training becomes stable. However, the article only proves that location is important, and that the performance of different combinations can be verified one by one. In theory, there are more than 70 combinations that can be made in different positions.

  • Layer normalized location:
    • AutoTrans proposed to use automatic machine learning to find the optimal combination of layerNorm;
    • LLDB, new Transformer structure learned in Mutli-30K

Network Optimization Techniques

  • Multiple attention mechanism
  • scaling
  • Residual link
  • LayerNormalization

\

review

Of course, self-attention alone cannot complete the model construction, so we need auxiliary construction

The derivation of a scaling

Blog.csdn.net/qq_37430422…

Part2 Pre-training and fine-tuning

Pre-training Knowledge Tree

Pre-training model age

The Bert proposed by Google in the pre-training model group is getting bigger and bigger, and the number of parameters has already exceeded 100 million level. The latest GTP3 has 10 billion level of parameters. The Pangu Model of Huawei and Wutao 2.0 of Tsinghua Zhiyuan Institute have also reached 100 billion level of parameters. Google’s latest Switch Transformer is the largest pre-training model with trillions of parameters.

  • The pre-training model has really pushed the boundaries of academia and industry and brought about a new PARADIGM for NLP (alleviating data annotation issues)

Pre-training mission

Word Vector (Word2vec) review

You shall know a word by the company it keeps! — J.R.Firth (1957)

  • Pre-bert training: Word vector
    • The vocabulary; : Each word corresponds to a number;
    • Each wordAnd that corresponds to a vector;
    • By predicting neighbors’ words;
      • skip-gram
      • CBOW;
    • Skip-gram: Formal, if,For the wordThe probability of is:

s

Limitations of word vectors:

The complete meaning of Word is always contextual, and no study of meaning apart from a complete context can be taken seriously. — J.R.Firth (1957)

  • It is difficult to solve the problem of multiple meanings
  • Fixed, unable to express context:
    • Ants are coming to market soon!
    • Ants on trees are delicious!
    • Ants live in groups.

\

Word2vec Based Models

  • In 2017 and before, the general shape of NLP models:
    • Word vector +encoder (figure below) :
      • LSTM or Transformer models learn how to extract context information through training sets;

    • What are the flaws of the pattern?
      • Encoder partial random initialization has great pressure on the optimizer.
      • The data set has to be good enough and large enough for Encoder to learn how to extract context information;

\

Lead to the pre-training task

We now want an effective pre-trained embedding+ encoder. How? On What?

  • Firstly, the labeled tasks are excluded and a large amount of unlabeled text data is used.
  • Unsupervised –> Self-supervised:

Self-supervised learning is the dark matter of deep learning (dark matter is hard to observe but is 80% to 95% of the matter in the universe) — Yann LeCun

BERT pre-training task – Mask prediction task

Masked language modeling (MLM)

Mask prediction tasks:

(1) MASK a word (replace the word with the “[MASK]” character);

(2) Input the sentence into BERT network;

(3) The word prediction.

What might be the problem with the mask prediction task?

  • The downstream task, e.g., to predict the emotion of a sentence, has no [MASK];

Improvements are needed in the realization of the mask prediction task:

  • Covered words only account for a small part (15%) of the total number of words in corpus.
  • Of the 15 percent predicted:
    • A small part is masked, that is, replaced by [MASK];
    • Some are randomly replaced with other words (still predicting what the token should be here);
    • Some keep the original words (still need to predict the token here);

It can be found that Bert is naturally able to do some tasks, such as sentence correction, after this language model training.

BERT pre-training mission – Next sentence prediction

Next Sentence Prediction (NSP)

Determine whether sentence B follows sentence A.

So the input of BERT sentence will make [CLS] A sent [SEP] B sent [SEP];

The context relation is stored in the [CLS] symbol of the input, and BertPooler is used to extract the representation of the entire input sequence during prediction:

  • Note here: not directly take the directional scale of [] this; Instead, it goes through the BertPooler module, which contains MLP and tanH operations to get a vector representation, which is then entered into the dichotomous layer.

Subword tokenizer Subword tokenization

BERT tokenizer

For example,

  • The sentence “play the song little robin redbreast” becomes “[CLS] play the song little robin red ##bre ##ast [SEP]” when retyped into the model.

Why use subword Tokenizer:

  • Traditional thesaurus takes English words as units, which are generally large and have insufficient training of tail words
  • Traditional lexical representation methods do not deal well with words with rare locations (OOV problem)
  • The traditional tokenization method is not conducive to learning the relationship between affixes
    • The relationships among “old”, “older” and “oldest” learned in the model cannot be generalized to “smart”, “Smarter” and “smartest”.
  • As a solution of OOV, Character embedding is too thin and has high memory overhead
  • Subword strength between word characters can balance OOV problems well

Common subword models: Byte Pair Encoding (BPE), WordPiece

BPE algorithm (2016) :

Generate thesaurus:

(1) Prepare a corpus; Determine the desired Subword word size;

(2) Divide the words into their smallest units. For example, the English 26 letters plus various symbols, these as the initial word list;

(3) The frequency of adjacent unit pairs within the word is counted in the corpus, and the unit pair with the highest frequency is selected to merge into a new Subword unit;

(4) Repeat step 3 until: the size of Subword word set in step 1 is reached or the maximum token frequency in the next iteration is 1.

After obtaining the Subword list, we can encode each word in the following way:

  • Sort all the subwords in the dictionary in descending order of length;
  • For the word W, walk through the sorted dictionary. Checks if the current subword is a substring of the word (greedy longest match), if so, outputs the current subword and continues to match the rest of the word string;
  • If there is still no match after traversing the dictionary, the rest of the string is replaced by a special symbol, such as “”;

For example:

  • Suppose the following words (and their frequencies) are present in the corpus:
  • Observe how the vocabulary size changes at each step

The WordPiece algorithm (Bert’s approach)

The steps for generating the word list are similar to those for BPE; The prefix “##” indicates that token is not used as the beginning of a full word;

The biggest difference from BPE lies in how to select two subwords to merge: BPE selects the adjacent subwords with the highest frequency to merge, while WordPiece selects the adjacent subwords with the highest probability to enhance the language model to join the word list;

  • Suppose S = (t1, t2… Tn) consists of n tokens, and each subword is assumed to exist independently
  • Language model likelihood value of sentence: logP(S)= 21=1 P(t;)
  • It is assumed that the and and two sub-words at adjacent positions are combined, and the sub-word generated after the combination is denoted as Z. At this time, the change of logarithmic likelihood value is:

  • It is easy to see that the change of likelihood is the mutual information between two subwords.

WordPiece example:

  • English: Adopt bert-base-uncased tokenizer
    • 30,522 words; English to lowercase
    • E.g., The sentence “play the song little Robin redbreast” becomes [ “Play”, “the”, “song”, “little”, “robin”, “red”, “# # bre”, “# # ast”];
  • 英 文 : Google Chinese bert-base-uncased is adopted
    • 21,128 words;
    • Case insensitive;
    • Split Chinese character;
    • E.g., the sentence “I like a song: yesterday once more.” processing for [” I “, “very”, “xi”, “huan”, “a”, “first”, “song”, “: “, “yes”, “# # ter”, “# # day”, “on”, “# # ce”, “more”, “.” ;

Bert Eembedding layer

  • Embedded layer
    • Token embedding;
    • Segmentation embedding;
    • The Position embedding;

The usual BERT version

  • Google has opened source two versions of BERT in English and one in Chinese:
    • Bert-base: 12Layers, 768-Dim Hidden States, 12Attention Heads, 110 million Params.
    • Bert-large: 12Layers, 1024-DIM Hidden States, 16attention Heads, 340 million params.
    • BERT:
  • Pre-training corpus:
    • English model: BooksCorpus (800 million words), English Wikipedia (2,500 million words)
    • Chinese Wikipedia: one-tenth of English Wikipedia;
  • How expensive is pre-training?
    • 64 TPUchips for a total of 4 days.
  • Fine tuning is relatively quick!

Fine-tuning task

Fine-tuning Knowledge Tree

BERT brought SOTA

Need to solve two problems?

Q1: How to fineturn?

Q2: How to fineturn better?

BERT fine tuning: sentence classification task

Sentence Classification

Classified tasks:

  • Categorize texts according to certain rules;
  • The application scenarios of text classification tasks are as follows:
    • Sentiment analysis: Comments/dialogue understanding
    • Subject category: News public opinion/academic articles
    • Intent recognition: search scenario/dialogue scenario
    • Content management: disharmony content identification/illegal content identification/discrimination, violence and other content identification
    • .
  • The classification task is fine tuning:
    • How to do? It is basically the same as predicted in the next sentence
    • The input of sentence is [CLS]A sent[SEP].
    • How does a representation of a sentence turn into a vector?
      • BertPlooler: Vector representation of [CLS], a vector representation is obtained through MLP and TANH operations, and then input into multiple classification layers;
      • Other pooling operations; 【 Max, AVG, Attention, capsule】

Sentence pair classification

  • Sentence rewriting scenario
  • Q&a/search scenario

Fine tuning of classified tasks:

  • Loss function:
    • Cross entropy loss function (CE)
  • The principle of fine tuning is no different from the general network training;

Great bonus from fine-tuning text categorization tasks:

BERT fine tuning: Sequence annotation task

Sequence labeling

Sequence labeling tasks:

  • For the one-segment sequence to be markedWe need to give eachPrediction – a tag;
  • So let’s define the Tag set to be;
  • {Begin, Middle, End, Single};
    • E.g. “the neurobiological role of risk gene synergy” has been subdivided into the neurobiological role of risk gene synergy and translated into sequence labeling tasks such as risk /B risk /E base /B factor /E association /B with /E’s /S spirit /B pathway /E biology /M science /E work /B use /E.
  • Named entity recognition: marking entities in sentences; Use BIO annotation mode; Entity types include {PER, ORG}
    • E.g. Joe/b-per cloth/i-per/i-per on /O job /O apple/b-org fruit/i-org company/i-org company/i-org company/i-org;
  • Part-of-speech (POS tagging) : tagging the parts of speech of words; Use BIO annotation mode;
    • E.g.. Jieba gives: I/B-R at/b-p north/b-ns jing /-ns day /B-ns Ann /-ns gate/L-ns

Sequence annotation task

  • Input format: [CLS] A sent [SEP]
  • Method 1: Each token vector is represented by the linear layer + Softmax;
  • Method 2: CRF layer avoids the problem of “B-per, I-org” by learning the transfer mode between labels.

BERT fine-tuning: Relational classification tasks

relation classification

Relationship classification tasks:

  • Extracting structured knowledge from unstructured text; To distinguish the semantic relationship between the header entity and the tail entity;
  • Significance: To construct knowledge graph
  • Method one:
    • BERT represents sentences;
    • The vector representation of two entities is obtained by pooling.
    • Head and tail entity vector splicing, and then through linear layer classification;

  • Method 2:
    • BERT adds relational location coding to the embedding layer

Method 3: add newly defined characters in the sentence to represent the entity position of the head and tail; Character use [unused1]

Study on the problem of sample disequilibrium

Sentence label task

Sample unbalanced scenarios:

  • Zip’s Law: In the corpus of natural languages, the frequency of a word’s occurrence is inversely proportional to its rank in the frequency list
  • The number of positive cases in dichotomous scenarios is small: e.g., the number of positive cases in nucleic acid tests is very small.

  • Re-sampling: Assuming that C is the number of categories of the dataset and N is the number of samples of category I, then the probability of sampling a sample from category I is:
    • Instance-balanced sampling: each sample is sampled with equal probability, i.e

    • Class balanced sampling: Each category has an equal probability of being drawn

    • General re-sampling, assuming.

  • Re-weighting: Take dichotomy as an example
    • Normal cross entropy loss:
    • By adding a sparse, the contribution of samples of a few categories to total loss is controlled:

    • Focal Loss [2017 CV, Target Detection] + Category Reweighting: