Bert pretrains the new rules

Abstract: Should we still use 15% shadowing probability in shadowing language models? Should You Mask 15% in Masked Language Modeling? Links to papers: arxiv.org/pdf/2202.08… {Alexander Wettig∗ Tianyu Gao∗ Zexuan Zhong Danqi Chen}

Paper introduction

In previous shadowing pre-training models, shadowing language models typically used a shadowing rate of 15%, and the authors’ team argued that more shadowing would provide enough context to learn good representations, while less shadowing would make training too expensive. Surprisingly, we found that an input sequence with a 40% obscuring probability can outperform a baseline of 15%, and even an 80% obscuring character can retain most of its performance when measured by fine-tuning downstream tasks.

Increased shading has two distinct effects, which the authors found in a detailed ablation experiment:
No need to use 80% [MASK], 10% keep original tokens, and 10% randomly replace tokens.
With the increase of Masking rate, Uniform random Masking still performs better than Span Masking and PMI-masking.

Overall, the results of this study contribute to a better understanding of masking language models and suggest new approaches to effective pre-training. Let’s look at the detailed experimental results.

The pre-training “15% masking” routine can be broken

“15% masking” refers to randomly covering 15% of words in a pre-training task and training the AI to predict which words are covered.

In this work, the authors found that with an effective pre-training scheme, 40-50% of the input text can be obscured and the model achieves better downstream performance than the default 15%. Table 1 shows examples of 15%, 40%, and 80% shading and their downstream task performance. We found that with 80% masking, even when most of the context was destroyed, the model still learned a good pre-trained representation and retained over 95% of downstream task performance compared to 15% masking. This phenomenon breaks the previous convention of choosing 15% masking rate, and raises the question of how models can benefit from high masking rate, which may be a hot topic in future masking language model research.

Pre-training requires more than 15% shading

To understand how many characters can be masked in MLM and how masking rates affect the performance of pre-trained models, a series of models with different masking rates ranging from 15% to 80% were pre-trained. Figure 1 shows the change in downstream mission performance with respect to different cover rates.

We can find that up to 50% shading can achieve comparable or better results than the default 15% masking model. Masking 40% achieved the best downstream mission performance overall (although the best cover rate varied from downstream mission to downstream mission). The results show that the mask rate less than 15% is not necessary for pre-training language models, while the best mask rate is up to 40% for large models with efficient pre-training side rate.

To further compare the shading rates of 15% and 40%, the GLUE test results of both are shown in Table 2:In Figure 2, the changes of downstream task performance with different training steps are plotted:

Table 2 further verifies that 40% masking performs significantly better than 15% – SQuAD improves by nearly 2%. We also see that 40 percent masking has a consistency advantage of more than 15 percent throughout the training process in Figure 2

“Re-understand” the Mask Rate

In this section, the authors analyze how cover rate affects the pre-training process of MLM from two different perspectives: task difficulty and optimization effectiveness. Under the Mask mechanism, we further discuss the relationship between masking rate, model size, and different damage strategies, as well as their impact on downstream task performance.

The relationship between shading rate and failure rate and prediction rate

To be specific, the cover rate will be divided into two indicators: corruption rate and prediction rate. Where, the destruction rate is the proportion of sentence destruction, and the prediction rate is the proportion predicted by the model. In this paper, the failure rate (McOrr) and prediction rate (MPRED) are further studied, and a new rule is found: higher prediction rate leads to better model effect; But the failure rate is higher and the model is worse:Table 3 shows the ablation results using the breakdown rate McOrr and the prediction rate MPRED. We can see that (1) fixing McOrr at 40% and reducing MPRED from 40% to 20% leads to a continuous decline in downstream tasks, indicating that more predictions lead to better performance; (2) Fixing MPRED at 40% and reducing McOrr results in consistently better performance, suggesting that the lower failure rate makes the pretraining task easier to learn. (3) The benefits brought by the high prediction rate can exceed the defects brought by the failure rate, and the performance is better.

High shading is more suitable for large models

As can be seen from the figure above, under effective pre-training Settings, the optimal shadowing rate can be 40% on average for large models. The optimal shadowing rate is about 20% for basic model and medium model. This clearly shows that models with a larger number of parameters benefit more from a higher obscuring rate.

Since 2019, most have found it beneficial to replace the original token with 10% (keeping the word the same) and random token with 10%. Since then, the 80-10-10 rule has been widely used in almost all MLM pre-training work in previous pre-training model studies. The motivation is that masking markers create a mismatch between pre-training and downstream fine-tuning, and the use of raw or random markers as an alternative to [MASK] can mitigate this gap. Based on this reasoning, it would be reasonable to expect that shielding more context would further increase the difference, but the authors observed stronger performance on downstream tasks. This raises the question of whether the 80-10-10 rule is needed at all. First of all, the author rediscusses the 80-10-10 rule and relates it to the two indexes of failure rate and prediction rate. The author thinks as follows:

Identical character prediction: Predicting identical characters is a very simple task — the model simply copies the input to the output. The loss from the same character prediction is very small, and this goal should be seen as an auxiliary regularization that ensures that text information is propagated from the embedding to the last layer. Therefore, the same token predictions should not be counted in either the destruction rate or the prediction rate — they do not destroy the input and contribute little to learning.

Random character destruction: Substituting random tokens increases both destruction and prediction because the input has been corrupted and the prediction task is not important. In fact, the authors find that the loss of random tokens is slightly higher than that of [MASK] for two reasons :(1) the model needs to determine whether the information input for all tokens comes from random characters and (2) the prediction needs to be consistent with large variations in input embedding. In order to verify the conclusion, the author adopts m=40% model only using [MASK] replacement as the baseline, and on this basis, we add three models:

1. “+5% same” : blocks 40% of characters and predicts 45% of characters.

2. “W /5% random” : shaded 35% of characters and randomly replaced another 5% of characters, with a prediction rate of 40%.

3. “80-10-10” : In BERT configuration, 80% of all shaded text is replaced by [MASK], 10% by original token, and 10% by random token.The results are shown in Table 4. We observed that the same character prediction and random character corruption would degrade the performance of most downstream tasks. The “80-10-10” rule is worse than simply using [MASK] for all tasks. This shows that in fine-tuning paradigm, the [MASK] model can quickly adapt to complete, uncorrupted sentences without the need for random substitution. In view of the experimental results, the authors recommend using [MASK] only for pre-training.

Masking is more effective at high Masking rates

In order to understand the interaction between masking rate and masking strategy, we conducted experiments using multiple masking strategies at different masking rates, and found that random Uniform masking (Uniform masking) performed better than more complex masking strategies at the optimal masking rate.

Figure 5 shows the results of uniform masking, T5’s SPAN maskin, and PMI masking at masking rates ranging from 15% to 40%. We found that (1) for all the shadowing strategies, the optimal shadowing rate was higher than 15%; (2) The optimal shadowing rate of span and PMI shadowing is lower than that of uniform shadowing; (3) When all strategies adopt the optimal shadowing rate, Uniform shadowing can achieve the same or even better results than the advanced strategy.

To understand the relationship between higher shadowing rates and advanced shadowing strategies, as shown in the figure below, more uniform shadowing basically increases the probability of shadowing highly relevant characters, thereby reducing trivial character tokens and forcing the model to learn more robust. We note that even with Uniform masking, a higher masking rate increases the chance of “accidentally” covering the entire PMI character span. By sampling the masks on the corpus, we calculated the probability in Figure 6 and found that the probability increased by 8 times when the masking rate increased from 15% to 40%. Similarly, higher shadowing rate leads to longer span of shadowing characters, suggesting that increased shadowing rate can produce effects similar to those of advanced shadowing strategy, but with better learning representation.

The conclusions

In this paper, the author makes a comprehensive study on the masking rate of the masking language model, and finds that the performance of the 40% masking rate is always better than that of the traditional 15% masking rate on the downstream task. The relationship between uncovering rate and prediction rate allows a better understanding of obscuring rate and shows that larger models can benefit more from higher obscuring rate. It is also proved that the 80-10-10 rule is largely unnecessary and that simple uniform shadowing is equivalent to complex shadowing schemes at higher shadowing rates.

The resources

Chen Danqi and tsinghua Special Prize students released a new achievement: break the training rules proposed by Google BERT! Oh, that’s a pretty good way to celebrate

Paper introduction

The pre-training “15% masking” routine can be broken

Pre-training requires more than 15% shading

“Re-understand” the Mask Rate

The relationship between shading rate and failure rate and prediction rate

High shading is more suitable for large models

Masking is more effective at high Masking rates

The conclusions

The resources

Related Posts

NLP preprocessing technology – Feature extraction and WORD2VEC

Deep learning — Learning neural networks for the first time

Blockbuster download | 6 major technical direction and 40 + good letter, Scott how to do, live billion?