ALBERT
Although the BERT model itself is very effective, its effectiveness depends on a large number of model parameters, so the time and resource cost required to train a set of BERT models are very large, and even such a complex model will affect the final effect. In this paper, we focus on introducing a slimmer version of the blockbuster BERT model -ALBERT, which obtains a much smaller model than BERT through several optimization strategies, but exceeds BERT model in data sets such as GLUE and RACE.
ALBERT’s paper:
Lan Z, Chen M, Goodman S, et al. Albert: A lite bert for self-supervised learning of language representations[J]. arXiv preprint arXiv:1909.11942, 2019.
Directory:
- Training for large-scale pre-training models
- Review of BERT model
- Several optimization strategies used by ALBERT
- ALBERT summary
- Future (current) works
- Prospects for the NLP field
1. Training for large-scale pre-training models
We are witnessing a big shift in the approach in natural language understanding in the last two years.
We have made great progress in the field of natural language processing in the past two years, and the breakthroughs mainly come from “Full-network pre-training Algorithm”, including the three papers in the figure above (GPT, BERT, etc.). The awareness of full network pretraining is that pretraining and downstream tasks are basically sharing most parameters. For example, skip-Gram and Continuous Bag of Words (CBOW) are only the embedding part of training Words, which have relatively small parameters. If you want to do downstream tasks, you need to add a lot of parameters to it and start training from scratch, which requires a lot of training data in the training of downstream tasks.
How powerful is the all-network pretraining model? Here’s a contrast with AlexNet, which is basically the fuse that lit ai. In the figure below, the left is the image of ImageNet’s Error. In 2012, there was a steep drop in Error because of Alext. After that, there would be a sharp drop in Error every year due to the benefits of deep learning. The 2017 RACE data set, on the right, was released using a local pre-training (similar to Skip-Gram and CBOW) with an accuracy rate of 45.9%. In June of ’18, GPT achieved 59% accuracy, which is basically full network training. And then BERT, in November of the same year, made it 72 percent accurate. Then XLNET did 81.8%, Roberta 83.2%, and ALBERT 98.4%. Due to full network pre-training, we doubled the accuracy from 45.9% to 89.4%.
2. Review of BERT model
Can we improve full-network pre-training models, similar to what computer vision community did for AlexNet?
As can be seen from the figure above, after AlextNet, the improvement of accuracy is mostly due to the increase of network depth. The figure below is from BERT’s paper. BERT also conducted experiments on widening and deepening the network, and found that the accuracy would be improved with widening and deepening the network.
Is having better NLU model as easy as increasing the model size?
Here’s a quick review of BERT’s model. In the figure below, we only doubled the width of BERT, and found that the parameter had reached 1.27 billion, and the memory directly burst.
Can we have more efficient ways of using parameters? Can we reduce the precision of the model without reducing it or reducing it a little bit without reducing the parameters a lot? This is the theme of ALBERT.
3. Several optimization strategies adopted by ALBERT
3.1 What are the main parameters of BERT and Transformer
In the figure above, the Attention feed-forward block accounts for about 80% of the total parameters, and the Token embedding Projection block accounts for about 20% of the parameters. \
3.2 Method 1: factorized embedding parametrization
The figure below shows BERT’s one-hot vector input. The first projection is that there is no interaction between words. Only when you do Attention do words interact with words. So you don’t need very high dimensional vectors for the first projection.
The picture below is ALBERT’s improvement. First, put the first mapping into a very low dimension, and then map into the dimension we need. This has two advantages:
- The word Context independent is unlocked between the Context dependent representation and the Context dependent representation. You can make the Context dependent representation taller freely, that is, the network becomes wider.
- The one-hot vector has a lot of parameters to the first mapping, and you can make this parameter very small.
Matrix decomposition is essentially a low-rank decomposition operation, which reduces the parameters by reducing the dimension of Embedding part. In the original BERT, the dimension of Embedding layer is 768 as well as that of hidden layer, taking Base as an example. However, we know that such a high dimension is not required for distributed representation of words. For example, in the era of Word2Vec, 50 or 300 dimensions are used at most. A very simple idea is that the number of Embedding parameters can be reduced by splitting the Embedding part, which is expressed as follows: \
: Thesaurus size; : Hidden dimension; : word vector dimension
We take Bert-Base as an example. The Hidden size in Base is 768 and the size of the word list is 3W. At this time, the number of parameters is 768 * 3W = 23040000. If the Embedding dimension is 128, the number of Embedding parameters is 128 * 3W + 128 * 768 = 3938304. The difference between the two is 19,101,696, which is about 19 meters. As we can see, the number of Embedding parameters changes greatly from 23M to 4M now. However, when we look at the global view, the number of bert-base parameters is 110M, and a reduction of 19M will not produce any revolutionary changes. Therefore, it can be said that the factoring decomposition of Embedding layer is not the main means of reducing the number of parameters.
3.3 Method 2: Cross-layer parameter sharing
Through the paper “Gong L, He D, Li Z, et al. Efficient training of bert by progressively stacking[C]//International Conference on Machine Learning. 2019: “2337-2346” is inspired by the fact that Transformer in each layer focuses on similar areas of Attention, so weights can be shared for each layer.
Another method proposed by ALBERT to reduce the number of parameters is parameter sharing between layers, where multiple layers use the same parameters. There are three ways to share parameters: share only feed-forward network parameters, share only attention parameters, and share all parameters. ALBERT shares all parameters by default. It can be understood that the weight sharing at this time does not mean that the 12-layer transformer_encoder values are the same, but occupy the same variable, so the calculation amount of the model is not less, but the number of parameters is 1/12 of the original.
3.4 Model broadening
3.5 Model deepens
3.6 Pre-training strategy SOP replaces NSP
Design better self-supervised learning tasks
This pre-training strategy is also an innovation point. SOP is called Sentence Order Prediction, which is used to replace the role of NSP in BERT. After all, some experiments show that NSP has no effect, but will bring some damage to the model. SOP is similar to NSP in that it is also used to determine whether the second sentence is the next sentence of the first sentence. However, for negative cases, SOP is not generated from unrelated sentences, but inverts two consecutive sentences to form negative cases.
3.7 remove the dropout
Further increase model capacity by removing dropout
ALBERT’s removal of Dropout provides a very small improvement in accuracy, but significantly reduces memory usage during training.
3.8 Data volume increases
4. ALBERT
ALBERT’s idea was to slim BERT down so that the bigger BERT could run. The authors reduce the parameters by weight sharing and matrix decomposition. The spatial complexity is reduced, but the computation is not reduced, so it is not faster for the model to perform downstream tasks and predictions. Therefore, the author says that the current method to optimize BERT is also optimized in the direction of time complexity.
ALBERT’s innovations are shown below:
Future (current) works
6. Prospects for THE NLP field
7. Reference
[1] This paper is the study notes of Microstrong watching the live course from BERT to ALBERT explained by LAN Zhenzhong on B website. Live broadcast Address: Live.bilibili.com/11869202 [2] natural language processing (NLP 】 free live (greedy college), https://www.bilibili.com/video/av89296151?p=4 [3] understanding of Albert – articles – zhihu xixika zhuanlan.zhihu.com/p/108105658 Principle and practice, [4] ALBERT address: https://mp.weixin.qq.com/s/SCY2J2ZN_j2L0NoQQehd2A
“`php
Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching
Copy the code