Absolutely dry! NLP pre-training model: From Transformer to Albert

Serendipity

Serendipity’s Zhihu

happyGirl

background

Language model is the way for machines to understand human language. Transformer in 2017 is an attempt for language model to get rid of RNN and LSTM modeling. The subsequent Bert is a representative of great miracles, using larger models and more data to improve the BENCHMARK of NLP task by a large amount. In PART because GPT went rogue in Auto-regressive, XLnet combined the best of GPT and Bert, then beat Bert with bigger data. It wasn’t long before Bert’s enhanced Roberta beat XLNet with bigger and bigger data. However, when Bert’s model reached a certain level, it was limited by hardware resources, so Google reduced Bert’s model size through matrix decomposition and parameter sharing. Therefore, when Albert used the same number of parameters as Bert, his reasoning ability was further improved. Just in recent months, I have also been studying language models, so LET me record my understanding of transformer and other representative NLP models.

1. transformer

1.1 Background of Transformer

17 years ago, language models were modeled by RNN and LSTM, which made it possible to learn the relationship between contexts, but could not be parallelized, bringing difficulties to model training and reasoning. Therefore, this paper proposed a model of language modeling based entirely on attention, called Transformer. Transformer gets rid of the dependence of NLP tasks on RNN and LSTM, and uses self-attention to model the context to improve the speed of training and reasoning. Transformer is also the basis of the subsequent more powerful NLP pre-training model. It is therefore necessary to elaborate on this model at great length.

1.2 Process of Transformer

Flow chart of Transformer

Inputs <1> padding-based input: [Batch size, Max seq length]

<2> Initialize the embedding matrix, map their Inputs to token embedding by embedding Lookup. The size is [Batch size, Max seq Length, embedding Size]. The embedding size is then multiplied by the root of the embedding size.

<3> Create positional encoding via the sin and cos functions to represent the absolute location of a token, add it to the token embedding, and then dropout.

<4> multi-head attention

<4.1> Input token embedding, Q, K, V is generated through Dense. The size is [Batch size, Max Seq Length, embedding Size]. [num heads*batch size, Max seq length, embedding size/num heads] [num heads*batch size, Max seq length, embedding size/num heads] The multi-head operation is complete.

<4.2> Transpose the 1st and 2nd dimensions of K, and then dot Q with the transposed K. The size of the result is [num heads*batch size, Max seq length, Max seq length].

<4.3> Divide the result of <4.2> by the square root of hidden size (in Transformer, hidden size=embedding size) to complete the scale operation.

<4.4> Set the dot product result of the padding in <4.3> to a very small number (-2^32+1) to complete the mask operation, and the subsequent padding result of SoftMax can be ignored.

<4.5> Softmax operation will be performed after the result of mask.

<4.6> Dot product softmax result with V to get attention result, size is [num heads*batch size, Max seq length, hidden size/num heads].

<4.7> Split the attention result into num heads and concat the 2nd dimension to generate the multi-head attention result. The size is [Batch size, Max Seq Length, hidden Size]. There is another linear operation after concat in Figure 2, but it is not in the code.

<5> Adds the results of token embedding and multi-head attention, and performs Layer Normalization.

<6> took the results of <5> through 2 Dense layers, where activation= RELU in the first layer and activation=None in the second layer.

<7> has the same function as <5>.

Inputs <8> are padding-based Outputs. Unlike Inputs, they need to be preceded by an initial symbol “~~” to indicate the beginning of sequence generation. Inputs do not need this notation.~~

<9> has the same function as <2>.

<10> has the same function as <3>.

The <11> function is similar to <4>, except for mask. Mask in <11> not only sets the dot product of the padding to a very small number, but also sets the dot product of the current token and subsequent tokens to a very small number.

<12> has the same function as <5>.

The function of <13> is similar to that of <4>. The only difference is that the input Q, K and V of <13> comes from the output token embedding of Outputs, and the input K and V come from the result of <7>.

<14> has the same function as <5>.

<15> has the same function as <6>.

<16> functions the same as <7>, and the size of the result is [Batch size, Max Seq Length, hidden Size].

<17> Dot product the last 2 of the results of <16> and the transpose of the embedding matrix. The size of the generated results is [Batch size, Max seq length, VOCab size].

<18> Softmax operation is performed on the result of <17>, and the generated result represents the probability distribution of the next token on VOCab predicted at the current moment.

<19> Calculate the probability distribution of the next token on VOCab and the real one-hot cross entropy of the next token, then sum the non-padding token’s cross entropy as loss. Use Adam for training.

1.3 Technical details of transformer

Self-attention in Transformer is evolved from the common dot product attention, which can be seen in the evolution process

Attention everywhere. Do you really know that? \

1.3.1 Why is <2> multiplied by the square root of embedding size? \

The paper does not explain why this is done. I have seen the code and guess that it is because the embedding matrix is initialized by Xavier init, which has a variance of 1/embedding size. Therefore, the variance of the embedding matrix is 1 multiplied by the square of the embedding size, which may be more conducive to the convergence of the embedding matrix.

1.3.2 Why input embedding is added in positional encoding?

Because self-attention is position-independent, the hidden embedding of token calculated by self-attention is the same no matter what the sequence of sentences is, which obviously does not conform to human thinking. Therefore, there is a way to express the position of a token in the model, and Transformer uses a fixed positional encoding to express the absolute position of the token in the sentence. The positional encoding formula is as follows:

Positional encoding formula

As for why positional encoding can represent positional information, see how positional encoding in a Transformer paper relates to trigonometry.

1.3.3 Why should the results of <4.2> be scaled?

If you take an array, the dot product of 2 arrays of length len, mean 0, and variance 1 will produce an array of length len, mean 0, and variance len. The variance variation will lead to the input of Softmax pushing to positive or negative infinity. At this time, the gradient will approach 0 infinitely, which is not conducive to the convergence of training. So if you divide by the square root of len, you can get the variance of the array back to 1, which is good for training convergence.

1.3.4 Why <5> Add the input and output of multi-head attention?

Similar to the residual learning unit in ResNet, with ensemble thinking, network degradation is addressed.

1.3.5 Why does attention need a multi-head?

Multi-head is equivalent to dividing a large space into several small mutually exclusive Spaces, and then calculating attention in the small space respectively. Although the calculation result of attention in a single small space is not as accurate as that in a large space, multiple small Spaces are parallel and concat helps the network to capture richer information. Think of channel in CNN.

1.3.6 Why add FFN after Multi-head attention?

By analogy, CNN block and FC are connected alternately in CNN network, and the effect is better. Adding an FFN to the end improves the nonlinear transformation of the entire block compared to multi-head attention alone.

1.3.7 Why does <11> mask the dot product result of the token at the current moment and subsequent tokens?

Natural language generation (such as machine translation and text abstracts) was auto-regressive. In regressive reasoning, the tokens of the current moment could only be generated based on the previous tokens. In order to maintain consistency between training and reasoning, the tokens of the current moment could only be generated based on the previous tokens. During training, subsequent tokens cannot be used to generate tokens at the current moment. This approach is also consistent with the human way of thinking in the production of natural language.

1.4 Summary of Transformer

When Transformer was first released, I happened to be an intern in the NLP department of Baidu. At that time, I thought Transformer was more gimmicky and self-attention was no better than RNN and LSTM in small models. When the model became larger and larger with more and more samples, self-attention was much better than traditional RNN and LSTM in terms of training speed brought by parallelization and modeling over long distances. Transformer is now the basis of a variety of representative NLP pre-training models, Bert series using Transformer encoder, GPT series Transformer decoder. Transformer’s multi-head attention is also widely used in the recommendation space.

2. bert

2.1 Bert’s background

Before Bert, the methods of applying pre-trained embedding to downstream tasks can be roughly divided into two types. One is feature-based. For example, ELMo introduces the pre-trained embedding into the network of downstream tasks as features. One is fine-tuning, such as GPT, in which downstream tasks are connected to the pre-training model and then trained together. However, these two methods will face the same problem, that is, they cannot learn the context information directly. For example, ELMo only learns the above and below information respectively, and concat represents the context information, or GPT only learns the above information. Therefore, the author proposes a pre-training model called Bert based on Transformer Encoder, which can directly learn context information. Bert used 12 Transformer Encoder blocks and pre-trained on 13G of data, which is a good example of NLP working wonders.

2.2 Bert’s process

Bert is improved on the basis of Transformer Encoder, so there is no big difference with Transformer Encoder in the whole process, only differences in embedding, multi-head attention and loss.

2.2.1 Differences between Bert and Transformer on embedding

bert pre-train and fine-tune

There are three main differences between Bert and Transformer in embedding:

The embedding of transformer is composed of two parts. One is token embedding. The vector representing token is generated on token_IDS by embedding matrix lookup. One is position embedding, which is a constant vector created by the sin and cos functions. Bert’s embedding is composed of three parts. The first one is also token embedding. Vector representing token is generated by embedding matrix lookup to token_IDS. The segment vocab size is 2 when the current token is added to the first segment. The third is position embedding. Different from Transformer, Bert creates a position embedding matrix. A position vector representing the token location is generated by position embedding matrix lookup to the location of token_IDS.

<2> Transformer follows embedding with a dropout, but Bert follows embedding with a Layer normalization and then with a dropout.

<3> Bert added a specific token “[CLS]” before the token sequence, and the vector corresponding to this token would be used in the subsequent classification task; If it is a sentence-pair task, then the two sentences are separated using a specific token “[seq]”.

2.2.2 Differences between Bert and Transformer in multi-head attention

There are two main differences between Bert and Transformer in multi-head attention:

<1> Transformer in <4.7> concates attention results at<4.6>. Bert concat<4.6> attention results and concat all attention results in the previous N-1 encoder block.

<2> Transformer does not have linear after <4.7> (possibly because the Transformer code I looked at is not official Transformer), whereas Bert has Linear after <4.7> in Transformer.

2.2.3 Differences between Bert and Transformer in Loss

There are two main differences between Bert and Transformer in loss:

<1> Transformer loss is calculated in the decoder stage, and the calculation method of loss is transformer <19>. Bert’s pre-trained loss consists of two parts, one of which is loss of NSP, which is token “[CLS]” after 1 layer Dense, followed by A dichotomous loss, where 0 represents segment B and the next sentence of segment A, 1 indicates that segment A and segment B are from two different texts. The other part is loss of MLM. Each token in the segment has a 15% probability of being masked, and 80% probability of being masked is represented by ”

“, 10% probability of being randomly replaced with a certain token, and 10% probability of retaining the original token. The distribution of the masked token on VOCab is generated after the encoder is multiplied by the transposition of the embedding matrix. The distribution and the real token’s one-hot cross entropy are calculated and sum as loss. These two losses are combined as total loss, and Adam is used for training. Bert Fine-tune’s Loss will be designed according to the nature of the task. For example, in the classification task, token “[CLS]” went through 1 layer Dense, and then a dichroic loss was followed. For example, in the question answering task, a starting position and an ending position are predicted in the token of paragraph, and then loss is designed based on the predicted distribution and real distribution of the starting position and ending position. For example, sequence tagging predicts the part of speech of each token, and then designs loss based on the predicted distribution and real distribution of each token in the part of speech.

<2> Bert added a Dense operation to the input of NSP and MLM respectively after encoder and before calculating the loss of NSP and MLM. This part of parameter is only useful for pre-training, but not for fine-tune. Transformer calculates loss directly after decoder, without Dense operation.

2.3 Bert’s technical details

2.3.1 Why does Bert need additional segment embedding?

As one of Bert’s pre-training tasks is to judge the relationship between segment A and segment B, embedding needs to contain the information of which segment the current token belongs to. However, no matter token embedding, The segment vocab size is 2 when the current token is added to the segment where it belongs. Index =0 indicates that the token belongs to segment A, and index=1 indicates that the token belongs to segment B.

2.3.2 Why does Transformer embedding follow with a dropout, while Bert follows with a Layer normalization and then dropout?

LN is to solve the problem of gradient disappearance, dropout is to solve the problem of overfitting. Adding LN after embedding is beneficial to convergence of embedding matrix.

2.3.3 In multi-head attention, Why does Bert not only concat the result of attention <4.6>, but also concat the result of attention in the first N-1 encoder block?

By concatenating the attention results in the first N Encoder blocks, you can get more information than just using the attention results in the NTH Encoder block. The next linear operation changes the size of the result back to [Batch size, Max Seq Length, hidden Size]. This problem is essentially the same as transformer problem 3.4, with ensemble providing more information.

2.3.4 Why is the probability of token being masked 15%? Why are there 3 cases after being masked?

The probability of 15% is the best probability obtained through experiments, and XLNET is also near this probability, indicating that under this probability, sufficient mask samples can be learned, and the segment information will not be lost so much that the expression of the context information of mask samples will be affected. However, since token ”

” would not appear in the downstream task, there was an inconsistency between pre-training and fine-tune. In order to reduce the impact of inconsistency on the model, the token masked was represented by ”

” 80% of the time, and was replaced by a token randomly 10% of the time. There is a 10% probability of retaining the original token, and these 3 percentages are also the best combination obtained from multiple experiments. Under these 3 percentages, fine-tune downstream tasks can achieve the best experimental results.

2.4 Bert’s summary

In contrast to those papers that say they are good but are not soft in the real world, Bert has really influenced academia and industry. From GLUE to SQUAD, the high score method on the current list is an improvement on Bert’s work. In my work, Bert has achieved better business results than I expected. Bert’s position in the NLP field can be compared to inception or Resnet in the CV field, where the algorithm’s effect exceeded human accuracy several years ago, whereas the NLP field didn’t achieve this until Bert came along. However, Bert is not omnipotent. Bert’s framework determines that this model is suitable for solving the problem of natural language understanding. Since there is no decoding process, Bert is not suitable for solving the problem of natural language generation. Therefore, how to transform Bert into a framework suitable for solving machine translation and text summarization problems is a point worth studying in the future.

3. xlnet

3.1 Background of XLNET

For example, in regressive language pre-training model (GPT), the regressive language pre-training model predicts the next token based on the tokens of all previous moments. Loss can be defined as follows:

Autoregressive loss

The second is auto-encoder model like Bert, which randomly masks several tokens in the sentence, and then predicts the masked tokens according to the context. The definition of auto-encoder loss is as follows:

Self-encoding loss

Regressive model could only use the above information in the training process, but there was no gap between training and reasoning. Auto-encoder model can use context information in the training process, but there will be a gap between training and reasoning, which will not appear in the training process of reasoning. For this reason, a regressive model based on Transformer – XL was proposed, which combined the advantages of auto-regressive model and auto-encoder model.

3.2 XLNET process

3.2.1 Factorization order

The factorization order of a sentence is a random arrangement of the tokens of the sentence. In order to combine the advantages of the auto-regressive and auto-encoder models, XLNet used factor decomposition sequences to introduce context information into Loss in auto-regressive. In regressive loss, token 2 could use token 1’s information but not token 2/3/4/5. After the factorization order is introduced, assuming the factorization order of 1->4->2->3->5 is used, then it is predicted that token 2 can use 1/4 of the token information, but not 3/5 of the token information. After using the factorization order, it does not affect the input order of the sentence, but will change when calculating the attention result of each token in the transformer-XL multi-head attention. The original way was to mask the current token and the subsequent token in the sentence, but now mask the current token and the subsequent token in the factorization order. In this way, the context information of the current token can be used in the calculation of the attention result of the current token. For example, in the above factorization order, 1/4 of the information of the token is used in the calculation of the attention result of token 2. In the original sentence, Token 1 comes before Token 2 and Token 4 comes after Token 2.

The factorization order is implemented by using a proper mask when calculating multi-head attention. For example, when entering token 2, token 2 was ranked after token 1/3 in the factorization order, so token 2/4/5 was masked when calculating the attention result of token 2. Only the dot product of Token 2 and token 1/3 is calculated, then Softmax and weighted summation are used as the result of attention.

3.2.2 Dual-flow self-attention mechanism

Xlnet uses the transformer- XL framework and uses a dual-flow self-attention mechanism on top of Transformer.

Double flow from the attention mechanism

Compared to regular Transformer, XLnet adds multi-head attention+ FFN calculation. The dual-stream self-attention mechanism is divided into query flow G and content flow H. H is the same multi-head attention as transformer. The position information and token information of the first T positions in the factor decomposition order are used to calculate the result of attention at the t moment. G is modified on the basis of transformer multi-head attention, and only the position information of the first T position and the token information of the first T-1 position in the factorization order are used to calculate the result of attention at the t moment. In order to reduce the difficulty of optimization, XLNET only calculated the last 1/6 or 1/7 of the token’s G in the factor decomposition order, and then blended the G into loss for training in auto-regressive. In addition, h was trained. At the end of the pre-training, the fine-tune, instead of g, uses H for the downstream task, and the fine-tune process is exactly the same as the ordinary Fine-tune.

3.3 Technical details of XLNET

3.3.1 Advantages of factorization order

In theory, as long as the pre-training of the model trained all the factorization orders of a sentence, the model could accurately get the connection between each token in the sentence and the context. However, in practice, the number of factorization orders of a sentence increases exponentially with the length of the sentence, so only a certain factorization order or several factorization orders of the sentence are used in practical training. In addition, in order to enhance the reasoning ability of the model, context information could be obtained simultaneously by adding factor decomposition order, compared with auto-regressive data that could only get the previous information.

3.3.2 Why use dual flow for self-attention?

For example, 1->3->2->4->5 and 1->3->2->5->4 ->4. For r (4) and (5) in regressive loss, the vocab probability distribution was the same for r (4) and (5) in regressive loss. In regressive loss, the token information and position information of the former T-1 token were used to predict the t token. However, the position of the t token in the original sentence was uncertain. Therefore, additional information is required to represent the token position in the original sentence that needs to be predicted in the factorization sequence. To achieve this, xLnet uses dual-stream multi-head attention+ FFN. Query stream G calculates the output of the t position using the position information of the first T position and the token information of the first T-1 position in the factorization sequence. The content flow H calculates the output information of the t position by using the position information and token information of the first T position in the factorization sequence. In the process of pre-training, loss in auto-regressive was calculated using G, and then the value of loss was minimized, accompanied by training H. After pre-training, drop g and seamlessly switch to fine-Tune with a normal Transformer using H.

3.4 Summary of XLNET

Since I have only read papers and never used XLNET in real work, I can only talk about the theory of XLNET. After Bert, many papers improved Bert, but the innovations were limited. Xlnet was the only paper I read that integrated contextual information and Auto-regressive loss under the framework of Transformer. Xlnet used 126GB of data for pre-training, which is an order of magnitude larger than Bert’s 13GB. Shortly after xLnet was released, Bert’s improved Roberta was pre-trained with 160GB of data. Beat xLnet again.

4. albert

4.1 Background of Albert

Increasing the size of the pre-training model can usually improve the reasoning ability of the pre-training model, but when the pre-training model increases to a certain extent, it will encounter the limitation of GPU/TPU memory. Therefore, the author added two parameters reduction techniques to Bert to reduce the size of Bert, and modified the loss of Bert NSP. On the premise of having the same number of parameters as Bert, the author had stronger reasoning ability.

4.2 Albert process

4.2.1 Decomposition of word vector matrix

In Bert and many improved versions of Bert, embedding size is equal to hidden size, which is not necessarily optimal. Because Bert’s token embedding is context-independent, and hidden embedding after multi-head attention+ FFN is context-relevant, Bert’s pre-training aims to provide more accurate hidden embedding. Not Token embedding, so token embedding doesn’t have to be as big as hidden embedding. Albert decomposed token embedding, first reducing the size of embedding size, and then using a Dense operation to map the low-dimensional token embedding back to hidden size. When the embedding size=hidden size is added to Bert, vocab size * hidden size is added to Bert. The number of parameters after decomposition is vocab size * embedding size + embedding size * hidden size, as long as embedding size << hidden size, it can reduce the effect of parameters.

4.2.2 Parameter Sharing

Bert’s 12-layer Transformer Encoder blocks are serialized together. Although each block looks exactly the same, the parameters are not shared. Albert shares transformer Encoder blocks with parameters, which can greatly reduce the number of parameters in the whole model.

Holdings sentence order prediction (SOP)

In addition to loss of auto-Encoder, Bert uses Loss of NSP to improve Bert’s reasoning ability in sentence pair relation reasoning task. However, Albert gave up the loss of NSP and used the loss of SOP. NSP loss is to judge the relationship between segment A and segment B, where 0 indicates that segment B is the next sentence of segment A, and 1 indicates that segment A and B come from two different texts. The loss of SOP is the sequential relation between segment A and B. 0 indicates that segment B is the next line of segment A and 1 indicates that segment A is the next line of segment B.

4.3 Technical details of Albert

4.3.1 Parameter reduction technique

Albert uses two techniques for parameter reduction, but their contributions to parameter reduction are different. The first one is the decomposition of the word vector matrix. When the embedding size decreases from 768 to 64, the number of parameters of 21M can be saved, but the reasoning ability of the model also decreases. The second parameter is shared by multi-head attention+ FFN. When embedding size=128, the number of parameters is saved by 77M, and the reasoning ability of the model also decreases. Although the reduction of parameters will lead to the decline of the reasoning ability of the model, the parameter quantity can be changed back to the order of Bert by enlarging the model, and then the reasoning ability of the model will exceed Bert.

There are two kinds of common routines for academic papers. The first one is to add parameters and data to the death, and then improve the reasoning ability of the model. The second way is to subtract parameters, and then make the reasoning ability of the model not so degraded. The parameter reduction technique Albert uses appears to be the second, but it is actually the first. When Bert changed from Large to Xlarge, although the model was enlarged to 1270M, the model was degraded and its reasoning ability dropped by a large margin, indicating that large was the limit of the model’s reasoning ability under Bert’s framework. Albert used parameter reduction technology. Compared with Bert’s large, which was 334M, Albert’s large was only 18M. Although Albert’s reasoning ability was worse than Bert’s, there was still room for Albert to grow after parameter reduction. Even when xxlarge, the reasoning ability of the model is improved and exceeds Bert’s best model.

4.3.2 loss

Prior to Albert, many improved versions of Bert questioned the loss of NSP. Segment B is the next line of segment A, segment A is the next line of segment B, and segment B is the next line of segment A. There is A 1/3 probability that segment A and segment B are from two different texts. Roberta directly abandoned the loss of NSP and modified the way of sample construction by changing the input 2 segments to include consecutive sample sentences in a text until 512. When the end of the text is reached and 512 is not filled, add a “[sep]” first and then sample from another text until 512 is filled.

On the basis of structbert, Albert abandoned the practice that segment A and segment B come from two different texts, leaving only A probability of 1/2 that segment B is the next sentence of segment A, With probability 1/2 segment A is the next line in segment B. The paper gives an explanation for doing so. Loss of NSP includes two functions: Topic prediction and Coherence prediction, among which topic prediction is easier to learn than coherence prediction, and LOSS of MLM also includes the function of Topic prediction, Therefore, Bert has difficulty learning coherence Prediction’s power. Albert’s SOP Loss abandons the practice of segment A and segment B coming from two different texts, making loss more focused on coherence prediction, which improves the model’s ability in sentence pair inference.

4.4 Summary of Albert

Although Albert reduces the number of parameters, it does not reduce the reasoning time. The reasoning process is only changed from serial computation of 12 Transformer encoder blocks to cyclic computation of transformer Encoder blocks for 12 times. Albert’s greatest contribution is that the model has a stronger growth than the original Bert, and the reasoning ability can be improved when the model becomes larger.

5. Other papers

5.1 GPT

GPT released before Bert, using transformer Decoder as a framework for pre-training. After seeing the disadvantage that decoder can only get the above information, but not the following information, Bert switched to Transformer Encoder as a pre-training framework, which can get the context information at the same time, and achieved great success.

5.2 structbert

The innovation of Structbert mainly lies in loss. In addition to loss of MLM, there is also a loss to reconstruct token order and a loss to judge the relationship between two segments. Loss reconstructs the token order by selecting token triples in the segment with a certain probability, randomly shuffles the sequence, and finally corrects the sequence of the scrambled token triples after encoder. Segment B is the next sentence of segment A. Segment A is the next sentence of segment B. Segment A is the next sentence of segment B. Segment A and B are from two different texts. The sample predicted by “[CLS]” to belong to one of these three categories.

5.3 Roberta,

Shortly after XLnet topped GLUE with 126GB of data, Roberta beat xLNet with 160GB of data. There are four main innovations in Roberta: The first point is dynamic mask, Bert used static mask before, that is, the mask operation was completed during data preprocessing, and the same mask result was obtained for the same sample in subsequent training. Dynamic mask means re-mask for each sample input during training. Compared with static mask, dynamic mask has more data of different mask results for training, and the effect is very good. The second point is the way of sample construction. Roberta abandoned the loss of NSP and modified the way of sample construction by changing the input two segments to continuous sample sentences in a text until 512 was filled. When the end of the text is reached and 512 is not filled, add a “[sep]” first and then sample from another text until 512 is filled. The third point is to increase the batch size. On the premise of training the same amount of data, increasing the batch size can improve the reasoning ability of the model. The fourth point is the word segmentation method of subword, which is similar to Chinese characters. Compared with the word segmentation method of Full Word, the word segmentation method of subword changes the size of the word list from 30K to 50K. Although the word segmentation method of subword is worse than that of full Word in experimental effect, However, the author believes that Subword will be better than Full Word (manual blackface) due to its theoretical superiority.

6. Summary

NLP and CV of the difference between NLP is knowledge learning, and CV is perceptual learning, NLP on CV more a symbol mapping process, because of this, development of NLP slower than CV, CV field there are many successful start-up, there are a lot of capable of achieving the degree of commercial areas, the NLP is less. However, the NLP field has entered a period of rapid iteration since transformer was released in 2017, and Bert’s publication has improved the benchmark of the NLP field by a large margin, producing many sub-fields that can reach the commercial level. By nineteen nineteen, the development of the NLP field is getting faster and faster. I started writing this technical share on National Day, just one week after Albert was published, and by the time I finished writing this technical share, it was already November. A few days ago, Google published another T5, which beat Albert again. The T5 paper is said to be 50 pages long and is an overview of the NLP pre-training model, well worth the time.

Finally, the level is limited, if there are mistakes, please correct.

The above is the development history of NLP pre-training model from Transformer to Albert. Click on the bottom left corner of the article to read the original article and pay attention to the author’s Zhihu to learn more about NLP algorithms.

On this site

The “Beginner machine Learning” public account was founded by Dr. Huang Haiguang. Huang Bo has more than 23,000 followers on Zhihu and ranks among the top 100 in github (33,000 followers). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.

Highlights from the past

All those years of academic philanthropy. – You’re not alone
Suitable for beginners to enter the artificial intelligence route and information download
Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \
Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)
Python code implementation of Statistical Learning Methods (Github 7200+)
The Mathematical Essence of Machine Learning (Online reading edition) \

Note: If you join our wechat group or QQ group, please reply”Add group“

If you like, click on????