BERT has a good effect, but the model is too large and slow, so some methods of model compression are needed. TinyBERT is a compressed model of BERT, proposed by researchers from Huazhong Science and Technology and Huawei. TinyBERT mainly uses model distillation for compression. In the GLUE experiment, the performance of Bert-Base can be retained by 96%, but the volume is 7 times smaller than BERT and the speed is 9 times faster.

1. Introduction

Model Distillation is a common model compression method in which a large teacher model is first trained and then a small student model is trained using predicted values of the teacher model output. Student model learns the generalization ability of teacher model by learning the prediction results (probability values) of teacher model.

There have been many studies on the use of model Distillation to compress BERT, such as distilling BERT into BiLSTM, and Huggingface’s DistilBERT, which was described in detail in the previous article, “Distilling BERT with a Model”. Model distillation is not familiar with the children’s shoes can refer to.

TinyBERT, another BERT distillation model, is mainly introduced here. The loss function of the previous distillation model is mainly the predicted probability value for the output of teacher model, while TinyBERT’s loss function includes four parts: Loss of Embedding, loss of Transformer attention, loss of Transformer hidden state and loss of final prediction layer. That is, the student model not only learns the prediction probability of the teacher model, but also learns the characteristics of the Embedding and Transformer layers.

The above picture shows the structure of TinyBERT (Studet) and BERT (teacher). It can be seen that TinyBERT reduces the number of BERT’s layers and the dimension of BERT’s hidden layers.

2.TinyBERT

The loss function in TinyBERT distillation process mainly includes the following four:

  • The Embedding loss function
  • Transformer layer attention loss function
  • Transformer layer hidden State loss function
  • Prediction layer loss function

Let’s first look at the mapping of each layer in TinyBERT’s distillation.

2.1 Mapping method of TinyBERT distillation

Let’s say TinyBERT has M Transformer layers and BERT has N Transformer layers. TinyBERT distillation mainly involves layers including embedding layer (numbered 0), Transformer layer (numbered 1 to M) and output layer (numbered M+1).

We need to match each TinyBERT layer to the BERT layer to learn, and then distill. The corresponding function is g(m) = n, where M is the TinyBERT layer number and n is the BERT layer number.

For embedding, TinyBERT’s embedding (0) corresponds to BERT’s embedding (0), that is, g(0) = 0.

For the output layer, TinyBERT’s output layer (M+1) corresponds to BERT’s output layer (N+1), that is, G (M+1) = N+1.

For the middle Transformer layer, TinyBERT uses the k-layer distillation method, that is, G (m) = m × N/m. For example, TinyBERT has 4-layer Transformer and BERT has 12-layer Transformer, so TinyBERT’s 1-layer Transformer learns BERT’s 3-layer Transformer. TinyBERT’s level 2 learns BERT’s level 6.

2.2 Embedding loss function

It is also described that: ES embedding is TinyBERT embedding, ET embedding is BERT embedding, L is the length of sentence sequence, d ‘is TinyBERT embedding dimension, D is BERT embedding dimension. As the BERT model is compressed, d’ < D, TinyBERT hopes that embedding learned by the model has similar semantics to BERT’s original embedding, so the loss function above is adopted to reduce the differences between them.

As embedding dimensions are different, loss cannot be calculated directly, so TInyBERT adds a matrix of mapping matrix We (D ‘× D), and the dimension is the same as ET after ES is multiplied by the mapping matrix. Embedding loss is the MSE of mean square error of both.

2.3 Transformer layer attention-loss function

TinyBERT has two layer loss functions in Transformer. The first one is Attention Loss, as shown in the figure below.

Attention loss mainly hopes that the attention score matrix output by TinyBERT multi-head attention can be close to BERT’s attention score matrix. Because studies have found that the attention Score matrix BERT learned can contain semantic knowledge, such as grammar and mutual relations, For details, please refer to the paper “What Does BERT Look At? An Analysis of BERT’s Attention”. TinyBERT learned the function of BERT attention by using the following loss function. H represents the number of heads in multi-head attention.

2.4 Transformer layer hidden State loss function

TinyBERT’s second loss function in the Transformer layer is hidden Loss, as shown below.

Hidden state loss is similar to embedding Loss. The calculation formula is as follows and also needs to go through a mapping matrix.

2.5 Prediction layer loss function

The loss function of the prediction layer adopts cross entropy, and the calculation formula is as follows: t is the temperature value of the model distillation, zT is BERT’s prediction probability, and zS is TinyBERT’s prediction probability.

3.TinyBERT two-stage training method

BERT has two training phases, the first training phase is to train a pre-training model (pre-training BERT), and the second training phase is fine-tuning for specific downstream tasks (fine-tuning BERT). TinyBERT’s distillation process is also divided into two stages.

  • In the first stage, TinyBERT distills a General TinyBERT on a large-scale general data set using pre-trained BERT.
  • In the second stage, TinyBERT uses data enhancement to train a task-specific model with fine-tuned BERT.

4. References

  • TINYBERT: DISTILLING BERT FOR NATURAL LANGUAGE UNDERSTANDING