1. The quantitative
The high precision model is represented by low precision to make the model smaller
2. Paper cutting
The less important parts of the model are discarded to make the model smaller.
##3. Model distillation
3.1. DistilBERT (2019.10.2)
The basic principles of knowledge distillation:
Two networks are defined, teacher network and student network. Teacher network is a large model obtained by pre-training, and student model is a small model obtained by learning and migration. According to Hinton’s article, the general distillation pattern is: The output of teacher network is taken as soft-label, and the goal of student network learning is hard-label. Three kinds of loss are cross-constructed for joint training, so that the student network has the same reasoning goal as the teacher network.
Brief description of model:
Student’s network solution and teacher’s network structure are the same, both Bert. The main changes are:
- The construction of the student network is to remove one layer from every two layers of the teacher network. Because the author found that the change of hidden layer dimension has less impact on performance than the change of layer number, only the network layer number is changed.
- The token type embedding and pooler are removed because the student network has two times fewer layers.
- Initialize the Student network with the parameters of one of every two layers of the teacher network.
- Use larger batches, use dynamic masking, and remove NSP tasks. These points are taken from RoBERTa.
Three loss:
- LceL_{ce}Lce, the target probability distribution obtained by teacher model is TIT_ITI, and the target probability distribution obtained by student model is sis_isi, then calculate the KL divergence of the two distributions: Ce in Lce = ∑ iti ∗ log siL_ {} = \ sum_i \ {t_i * log {s_i}} Lce = ∑ iti ∗ logsi. KL divergence is to measure the correlation of two distributions. By using loss, the rich prior knowledge in teacher network can be brought into the training process during training.
Softmax-temperature is used here and TTT is used to control the smoothness of the output probability. In inference, it is set to 1. 2. LmlmL_{MLM}Lmlm is loss of masked language model in Bert. 3. LcosL_{cos}Lcos, calculate cosine similarity of hidden state of student network and hidden state of teacher network.
Then the three losses are weighted to sum:
Experiment:
The inference time was improved by 60% :
Note that Lce, Lcos, and parameter initialization have great effect on results:
3.2 TinyBert (2019.11.23)
The main contributions
- The proposed distillation method increases the consideration of the attention layer in Transformer.
- Two-stage distillation mode was proposed, and the same distillation was carried out in pre-training and fine-tuning stages, so that the verification effect was better.
- The experiment proved to be very good.
Brief description of model:
Problem definition: The Teacher model (layer M) and the student model (layer N) each pass one
Function mapping. The process of student learning teacher is to minimize the following objective function:
Definitions of various Loss:
- Transformer-layer Distillation
Is divided intoAttention-based distillationandDistillation based on hidden layers. The authors use attentional distillation because recent research has found thatThe attention matrix learned by BERT contains rich linguistic knowledge, which contains grammar and co-signatory information and is very important for natural language understanding.Distilling based on attention: Calculate the MSE of attention matrix between student and teacher, where
Represents the number of heads.Distillation based on hidden layers: Calculate the MSE between the Transformer output layer, also known as hidden States, in the formula
Is the
Maps to the same dimension
Is a parameter matrix that needs to be learned. 2. Embedding-layer DistillationIs to compare the input layer input. 3. Prediction-Layer Distillation
According to the above definition, the complete loss of student network can be finally obtained:
Among them
and
Are logits of the predicted target of student network and teacher network respectively. This is kL-Loss in the standard distillation paradigm.
Two-stage distillation:
In this paper, we proposed two stages of Distillation, namely, Distillation was conducted on pre-training nodes to obtain General MEETING Model. In the fine-tuning phase, data augmentation was performed, followed by the same Distillation, resulting in the Task-specific taomodel.
Experiment:
TinyBERT is empirically effective, with performance exceeding 96% of baseline BERT, capacity 7.5 times smaller, and reasoning 9.4 times faster. TinyBERT is also significantly better than baseline DistillBERT, with only 28% of parameters and 31% of reasoning time:
Achieved a BERT equivalent (3 percentage points less) on the GLUE benchmark:
paper code
##4. Optimization of model structure
4.1 Deformer: Decomposture Pre-trained Transformers for Faster Question Answering (ACL 2020)
To do QA problems with BERT model, such as question and answer or reading comprehension, query and Document stitching are required as model input, and then multi-layer interactive coding of input text is carried out using self-attention, and then possible answer sequences are searched in document using linear classifier. And usually documents are very long, so there’s a lot of computation. This paper proposes to make BERT model into a two-paragraph structure and carry out some operations in advance. Studies have shown that in multi-layer Transformer model, encoding at low layers mainly focuses on some local prediction surface features (such as part of speech, grammar, etc.), and at upper layer gradually focuses on global semantic information related to downstream tasks. Therefore, at low layers, **’s assumption that document encoding can be problem-independent is correct. Therefore, the idea of this article is as follows:At the bottom level, the problem and the document are coded separately, and at the top level, the problem and the hidden representation of the document are combined, and then the cross-coding is carried out. The diagram below:
In addition, the author found that this structure has a large loss of accuracy on SQuAD, so the author added two distillation loss items. The aim was to minimize the differences between Defomer’s high-level characterization and classification layer Logits and the original BERT model.
Experiments: In three QA tasks, BERT and XLNet obtained 2.7-3.5 times of acceleration, 65.8-72.0% memory saving and only 0.6-1.8% effect loss after DeFormer decomposition. But it’s still too slow to work in real time.
4.2 “AlBert”
There are three major changes,
- The dimension (EEE) of the embedding layer is set to be much smaller than that of the hidden layer (HHH), and the matrix decomposition of H∗VH*VH∗V is decomposed into V∗EV*EV∗E, E∗HE*HE∗H through matrix decomposition, where H<
- Share all parameters in Emcoder.
- Change the NSP task to SOP task.
3. “FastBERT”
1. Major contributions
The purpose of this article is to optimize the performance of the PREDICT phase by using an early end strategy. That is, in the predict stage, transformer of each layer is followed by a classifier (the method proposed by the author is used for classification problems). If the result of the classifier is highly reliable, the process will not be carried out further and will be ended early.
Thus, samples that are easy to predict (with distinct features) can get results through one or two layers, while more difficult samples need to go through all layers. Since the classifier is much less computationally complex than Transformer, Predict’s performance can be improved on average.
2. Model brief
The diagram above is the model architecture, and the following is the model training and reasoning sequence:
- Pre-training:BERT (or other pre-training models) as the backbone of the ** * will not change, pre-training phase has not changed.
- Fine-tuning for Backbone: Fine-tuning model Backbone in classification tasks.
- “Self-distillation for branch” : Distillation of trunk model knowledge into a branch classifier, using unlabeled data. That is, for the probability distribution predicted by the trunk model and the probability distribution predicted by each layer branch classifier, the KL divergence of the two is calculated, and then the sum of loss of all layers (a total of L−1L-1L−1 student) is the final loss.
- Adaptive inference: Using student classifier to classify samples at different levels. If the result is clear, the prediction is finished directly, and if the result is not clear, the prediction is continued to the next layer. Among them, the author defines the definition of prediction results ** Uncertainty ** :
Ps (I) p_S (I) PS (I) represents the output probability distribution of the classifier, and NNN represents the number of classification tags. The meaning of this definition is actually the entropy of the probability distribution. The more scattered the probability distribution is, the greater the entropy is, the more information is, the lower the uncertainty is, and the clearer the classification result is. The termination conditions of each layer are determined by speed threshold (Uncertainty obtained after each layer classifier, when less than speed, is terminated in advance). Therefore, when speed increases, fewer samples will be sent to higher levels, and the reasoning speed will be slower.
3. Experimental results
Taking the results in the Chinese dataset as an example, several conclusions can be drawn:
- All in all, it’s a lot better than DistilBERT.
- When raisingspeedThreshold value, improve the speed and accuracy of the comprehensive judgment is better.
【 reference: 】 Code: paper