This is the 22nd day of my participation in the First Challenge 2022
This article documents three pre-training tasks: LM, MLM, and PLM.
LM (Language Model)
Probabilistic task is the most common unsupervised task in natural language processing. LM is a classical probability density estimation problem. In practice, it usually refers to autoregressive language model or unidirectional language model.
In the unidirectional language model, the representation of each vector is only encoded for itself and the left or right up context. In order to be more semantic and achieve better expression, the context information should be encoded from the left and right directions. One solution is the bidirectional language model (BiLM), which consists of a left-to-right forward language model and a right-to-left backward language model.
The loss function of the language model is as follows:
Second, MLM (Masked Language Modeling)
MLM was first proposed in the literature A New Tool for Measuring Readability and was called cloze task. Later, it was proposed in the literature BERT: In the pre-training of Deep Bidirectional Transformers for Language Understanding, this task is adapted as a new pre-training task to compensate for the shortcomings of the standard one-way language model. In the MLM task of this literature, part of the words in the sentence are first [mask] input, and then model training is carried out to make other words predict the words covered, so that the model can learn the semantic information in the sentence. But MLM leads to inconsistencies between the pre-training phase and the fine-tuning phase, because masks from the pre-training phase do not appear in the fine-tuning phase. Therefore, the author of this paper uses a special [mask] method: replace 15% of the words with [mask] randomly each time. Among the words replaced, there is an 80% probability that [mask] will be replaced, 10% probability that random words will be replaced, and 10% probability that the original words will remain.
The loss function of MLM is as follows:
Iii, PLM (Permuted Language Modeling)
MLM task is widely used in pre-training, but as noted above it leads to inconsistency between pre-training phase and fine-tuning phase, and PLM was proposed to replace MLM. PLM is a pre-training task based on random permutation of input sequences, in which the permutation used is randomly selected from all possible permutations, and then [mask] part of the words in the sequence, and then model training, so that other words predict the masked words. It should be noted that the random arrangement in PLM does not affect the original position of the sequence, but only defines the predicted position of [mask].
The loss function of PLM is as follows: