This is the second day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021
The problem of BERT
After BERT was released, it produced a lot of NLP mission updates on the charts. However, the model is very large, which causes some problems. ALBERT’s paper divides these questions into two categories:
Memory limit
Consider a simple neural network with one input node, two hidden nodes, and one output node. Even for such a simple neural network, there are a total of 7 parameters to learn because each node has weights and deviations
Bert-large is a complex model with 24 hidden layers, many nodes in feedforward network and multi-head attention mechanism, and a total of 340 million parameters. If you want to train from scratch, it will take a lot of computing resources
Model the degradation of
The recent trend in NLP research is to use larger and larger models for better performance. ALBERT’s study suggests that the parameters of the brainless stack model may lead to reduced effectiveness
In the paper, the author made an interesting experiment
If larger models lead to better performance, why not double the hidden layer units of the largest BERT model (Bert-large) from 1024 to 2048?
They call it “Bert-Xlarge”. Surprisingly, this larger model did not perform as well as Bert-Large in either the language modeling task or the Reading Comprehension test (RACE)
From the graph given in the original article (below), we can see how performance degrades
ALBERT
An overview of the
ALBERT greatly reduced model parameters by using parameter sharing, matrix decomposition and other technologies, and replaced NSP (Next Sentence Prediction) Loss with SOP (Sentence Order Prediction) Loss to improve the performance of downstream tasks. However, the number of layers of ALBERT did not decrease, so the Inference Time was not improved. However, the reduction of parameters does make training faster, and ALBERT can scale to larger models than BERT (Albert-xxlarge), thus achieving better performance
ALBERT’s structure is basically the same as BERT’s, with Transformer and GELU activation function. There are three specific innovative parts:
- The embedding parameters are decomposed by factoring
- Parameter sharing across layers
- Change the NSP task to SOP task
The main effect of the first two improvements is to reduce parameters. The third improvement is not really an innovation, since there has been a lot of work that found that the NSP task in BERT had no positive effect, right
Factorized Embedding Parameterization
The original BERT model and various pre-training language models based on Transformer have a common feature: E=HE=HE=H, where EEE refers to Embedding Dimension and HHH refers to Hidden Dimension. This leads to a problem that when the Hidden Dimension is promoted, the Embedding Dimension also needs to be improved, resulting in an increase in the number of squared parameters. Therefore, ALBERT’s authors unbind ** EEE and HHH by adding a matrix after Embedding for dimensional transformation. The dimensions of the EEE are constant, and if the HHH increases, we just need to do an ascending operation after the EEE
Therefore, ALBERT did not directly map the original one-Hot vector to hidden space size of HHH, but decomposed it into two matrices. The original parameter number is V∗HV*HV∗H, and VVV represents Vocab size. When decomposed into two steps, this is reduced to V∗E+E∗HV*E+E*HV∗E+E∗H, which can significantly reduce the number of parameters when the value of HHH is large
For example, when VVV is 30000, HHH 768, and EEE 256, the number of parameters drops from 23 million to 7.8 million
It can be seen from the experiment of factoring decomposition Embedding that the effect of the version whose parameters are not shared is constantly improved with the increase of EEE. But the shared version of the argument doesn’t seem to be the case, and the largest EEE doesn’t work best. It is also possible to see a 1-2 point drop in the effect of parameter sharing
Cross-Layer Parameter Sharing
In traditional Transformer, each layer of parameters is independent, including self-attention and full connection for each layer. This results in a significant increase in the number of parameters as the number of layers increases. Previous attempts to share self-attention or the full link layer alone have had some success. Instead of learning different parameters for each layer, ALBERT authors try to share the parameters of all layers, which is equivalent to learning only the parameters of the first layer and reusing the parameters of that layer in all the remaining layers
At the same time, the author also found through experiments that using parameter sharing can effectively improve the stability of the model. The experimental results are shown in the figure below
Bert-base and ALBERT used the same number of layers and 768 hidden cells, resulting in a total of 110 million parameters for Bert-Base and 31 million for ALBERT. Experiments show that the parameter sharing of feed-forward layer has a great influence on the accuracy. The effect of shared attention parameters was minimal
Sentence-Order Prediciton (SOP)
BERT introduced a dichotomous problem called next sentence prediction. This was created specifically to improve the performance of downstream tasks that use sentence pairs, such as “natural language reasoning”. But papers like RoBERTa and XLNet have illustrated the ineffectiveness of NSP and found its effect on downstream tasks to be unreliable
So ALBERT proposed another task — sentence order prediction. The key idea is:
- Take two consecutive sentences from the same document as a positive sample
- Swap the order of the two sentences and use it as a negative sample
SOP improves performance on downstream multiple tasks (SQUAD, MNLI, SST-2, RACE)
Adding Data & Remove Dropout
ALBERT used the same training data as BERT. However, adding training data may improve the performance of the model, so ALBERT trained 157G of data after adding the STORIES Dataset. In addition, the model had not Overfit the training set at 1M steps, so the author simply removed the Dropout, resulting in a significant improvement in the MLM validation set
Conclusion
At the beginning, I was surprised to read this article, because it directly reduced BERT of the same magnitude by 10+ times, making it possible for ordinary users to run. But a closer look at the experiment revealed that the reduction in the number of parameters came at a cost
It should be noted that Speedup is the training time rather than the Inference time. The Inference time was not improved, because even with the use of the shared parameter mechanism, 12 layers of Encoder still had to be finished, so the Inference time was about the same as BERT
The parameters used in the experiment are as follows
It can be concluded that:
- ALBERT did get better results than BERT for the same amount of training time
- Under the same Inference time, the effects of ALBERT Base and Large were not as good as BERT’s, and were 2-3 points worse. The author also mentioned in the end that he would continue to look for methods to improve the speed (Sparse attention and Block attention).
In addition, in conjunction with Universal Transformer, it can be thought that the number of Transformer layers can be adjusted dynamically during the training and Inference stages (say goodbye to the configuration of 12, 24 and 48). At the same time, we can try to avoid the effect of pure parameter sharing. After all, the deeper the Transformer is, the more task-related information it learns, so we can improve the Transformer module by adding memory units and personalized Embedding at each layer
Reference
- Visual Paper Summary: ALBERT (A Lite BERT)
- Meet ALBERT
- Thesis Preparation ALBERT
- Interpretation of ALBERT’s Paper
- ALBERT coarse read