This is the 13th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

EMNLP2021 has a paper titled Frustratingly Simple Pretraining Alternatives to Masked Language Modeling, which translates to “Frustratingly Simple pre-training tasks to substitute MLM.” But I put a question mark on it, because I felt that the methods proposed by the author in the first place were too difficult for the model, even if I had to do the pre-training tasks proposed by him, I would not have been able to do it. The second is that the results seem mediocre

As shown in the figure below, specifically, the author proposes 4 pre-training tasks to replace MLM, namely Shuffle, Random, Shuffle+Random, Token Type and First Char

Pretraining Tasks

Shuffle

The author mentions that this method was inspired by ELECTRA. Specifically, 15% of the tokens in a sentence will be randomly switched, and then the model needs to do a token level 2 classification problem to predict whether the tokens are switched for each position. The benefit of this pre-training task is that the model can gain syntactic and semantic knowledge by learning to distinguish between contexts where tokens are scrambled

For a Shuffle task, the Loss function is a simple cross-entropy Loss:


L shuffle = 1 N i = 1 N y i log p ( x i ) + ( 1 y i ) log ( 1 p ( x i ) ) (1) \begin{aligned} \mathcal{L}_{\text{shuffle}}&=-\frac{1}{N}\sum_{i=1}^N y_i\log p(x_i) \\&+ (1-y_i)\log (1-p(x_i)) \end{aligned} \tag{1}

Where NNN represents the number of tokens in a sample, yiy_iyi and P (xi)p(x_i)p(xi) are both vectors, and P (xi)p(x_i)p(xi) represents the probability that the third token is predicted to be scrambled

Random Word Detection (Random)

Replace 15% of the tokens in the input sentence with words chosen at random from the Vocabulary, even if the entire sentence becomes grammatically incoherent. In essence, it is also a 2-classification problem. For each position, whether the token has been replaced is predicted. The loss function is the same as formula (1).

Manipulated Word Detection (Shuffle + Random)

This task actually combines Shuffle and Random task to form a more difficult task. Here, I really live in Bengbu. Shuffle and Random tasks ask me to judge respectively, and I may be able to distinguish, but I may not be able to make accurate judgment after the combination of them. The author may also take this into account, so the proportion of Shuffle and Random is reduced to 10% respectively. Meanwhile, it should be noted that these two tasks do not overlap, that is, there will not be a token that has been shuffled and then Random. Now this task is a triage problem. Its Loss function is also based on the cross-entropy Loss


L manipulated = 1 N i = 1 N j = 1 3 y i j log p i j ( x i ) (2) \mathcal{L}_{\text{manipulated}} = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^3 y_{ij}\log p_{ij}(x_i)\tag{2}

Where JJJ traversal Shuffle(j=1j=1j=1), Random(j=2j=2j=2) and Orignal (j=3j=3j=3), piJP_ {ij} PIJ represents the probability of the third token for the JJJ tag, Both yijy_{ij}yij and PIjp_ {ij} PIj are concrete real numbers, and PIp_ {I} PI is a three-dimensional vector

Masked Token Type Classification (Token Type)

This task is a 4-category problem that determines whether the tokens of the current position are stop words, numbers, punctuation marks, or text content. Specifically, the authors use the NLTK tool to determine whether a token is a stop word, and as long as it does not fall into the first three categories, the current token falls into the body content category. In particular, take 15% of the tokens and replace them with the special token [MASK]. As to why, I think it is: it is too easy to predict a token directly. To make it harder, we should have the model predict what token is here and then predict what category it belongs to. Its Loss function is also cross-entropy Loss

Masked First Character Prediction (First Char)

Finally, the authors propose a simple version of the MLM task. Originally MLM task needs to be done at a certain position for a ∣ n ∣ | V | ∣ n ∣ classification problem, that is to say, you need to the size of a Vocabulary Softmax vector, the task is actually quite difficult, because the candidate set is too large, but also there may be a risk of fitting. The last task, proposed by the authors, only needs to predict the first character of the token corresponding to the current position, which transforms the task into a 29 classification problem. Specifically, 26 English letters, a mark for numbers, a mark for label symbols, and a mark for other categories add up to 29 categories. Similarly, 15% of tokens are replaced with [MASK] and then predicted

Results

The result looks like the picture above, so you can just look at it. In fact, their results are a bit confusing, but the authors note that if the training lasts as long as the Baseline, they are confident they can exceed it. Then I have a question, why don’t you train a little more, is it to catch the EMNLP DDL?

Personal summary

The main innovation of this paper is that the author proposes 5 new pre-training tasks that can replace MLM. Because MLM is token level, these 5 tasks are token level as well. Will anyone at EMNLP come up with sentence level next year as an alternative to NSP/SOP pre-training tasks? Another point to make fun of is the title of this paper: Frustratingly Simple XXXX, a title that I have seen several times in my mind, giving a sense of the subject line