RoBERTa and ALBERT

The BERT model was proposed in 2018 and has unprecedented advances in many natural language processing tasks. Therefore, a lot of work has been carried out around BERT in 2019, among which two improved BERT models, RoBERTa and ALBERT, emerged. RoBERTa trained BERT in larger data sets and optimal parameters, so that BERT’s performance was improved again. ALBERT mainly compresses BERT and reduces the number of BERT parameters by sharing all layers’ parameters and Embedding decomposition.

1. Introduction

This paper mainly introduces two improved BERT models, RoBERTa and ALBERT. For BERT model, you can refer to the previous article “Thoroughly Understanding The Google BERT Model”. First, some characteristics of RoBERTa and ALBERT are generally reviewed.

RoBERTa:

Bigger training set, bigger batch.
NSP Loss is not required.
Use longer training sequences.
Dynamic Mask.

ALBERT:

Decompose the Embedding matrix and reduce the dimensions.
All Transformer layers share parameters.
Use Sentence Order Prediction (SOP) instead of NSP.

2.RoBERTa

RoBERTa mainly tested some training Settings in BERT (such as whether NSP Loss is meaningful, the size of batch, etc.) and found out the best Settings, and then trained BERT on a larger data set.

2.1 Larger data sets

The original BERT only used 16G of data set, while RoBERTa trained BERT on a larger data set, using 160G of corpus:

BOOKCORPUS, 16G, original BERT’s training data set
CC – NEWS, 76 g
OPENWEBTEXT, 38 g
STORIES, 31 g

2.2 Remove NSP Loss

BERT adopted NSP Loss in the training process, originally intended to make the model better capture the semantics of the text. Given two statements X = [x1, x2… xN] and Y = [y1, y2,…., yM], The NSP task in BERT needs to predict whether Y comes after X.

However, NSP Loss has been questioned by many articles, such as XLNet. RoBERTa adopted an experiment to verify the practicability of NSP Loss. Four combinations were used in the experiment:

Segment-pair + NSP: This is BERT’s training method, using NSP Loss. Input two paragraphs X and Y can contain multiple sentences, but the length of X + Y should be less than 512.

Sentence-Pair + NSP: Basically similar to the previous one, NSP Loss is also used, but the input two paragraphs X and Y are respectively a sentence, so an input usually contains fewer tokens than the segment-pair, so batch should be increased. Make the total number of tokens similar to sentence-pair.

Full-sentences: Without NSP, sample multiple Sentences from one or more documents until the total length is 512. When sampled to the end of one document, a document delimiter token is added to the sequence, and then sampled from the next document.

Doc-sentences: Similar to full-sentences, does not use NSP, but only samples Sentences from one document, so the length of the input may be less than 512. Doc-sentences also needs to dynamically size the batch so that it contains about the same number of tokens as full-sentences.

The above is the experimental result, and the top two lines are using NSP. It can be seen that using segment-pair (multiple sentences) is better than sentence-pair (single Sentence). The experimental result shows that using a single Sentence will degrade BERT’s performance in downstream tasks. The main reason may be that the model does not learn long-term dependencies well because of the use of single sentences.

The middle two lines are the results of not using NSP Loss. It can be seen that the effect of both methods is better than that of using NSP, which indicates that NSP Loss actually has no effect, so NSP Loss is discarded in RoBERTa.

2.3 dynamic Mask

The original BERT Mask the data before the training, and then keep the data unchanged throughout the training, called Static Mask. That is, the same sentence in the entire training process, Mask the words are the same.

RoBERTa used a Dynamic Mask strategy, copying the whole data set 10 times, and then masking all the 10 data sets once, that is, each sentence will have 10 Mask results. BERT was trained with 10 data sets.

The following figure shows the experimental results. It can be seen that the result of using Dynamic Mask is slightly better than the original Static Mask, so RoBERTa also used Dynamic Mask.

2.4 Larger Batch

Previous studies on neural network translation have shown that using a large batch and correspondingly increasing the learning rate can speed up optimization and improve performance. RoBERTa also conducted experiments on batch size. The original BERT used Batch = 256 and the training step number was 1M, which was the same as batch = 2K and the training step number was 125K. The same is true with batch = 8K and training steps of 31K. The following figure shows the experimental results of using different batches. Different batch learning rates are different. It can be seen that the best effect is achieved when batch = 2K.

3.ALBERT

BERT’s pre-training model has a large number of parameters and takes a long time to train. ALBERT is a compressed model of BERT, which reduces the number of BERT parameters and the time required for training.

Note that ALBERT only reduces the number of BERT’s arguments, not their computation. ALBERT can reduce the training time, because the communication amount during distributed training can be reduced after reducing parameters. ALBERT could not reduce the time of inference, because the Transformer calculations in the inference period were the same as BERT’s.

Here are some optimizations for ALBERT.

3.1 Factorized embedding parameterization

This decomposes the Embedding to reduce the parameters. In BERT, the dimension of Embedding is the same as that of Transformer’s hidden layer, both H. Assuming the size of the thesaurus is V, the number of parameters in the Embedding matrix of the word has VH. If the thesaurus is large, the number of parameters will be large.

Therefore, ALBERT used a Factorized method. Instead of directly mapping the one-hot matrix of the word to h-dimensional vectors, ALBERT first mapped it to a lower-dimensional space (E-dimensional) and then to h-dimensional space, a process similar to doing a matrix factorization.

3.2 Cross – layer parameter sharing

This is the parameter sharing mechanism, where all Transformer layers share a set of parameters. Transformer includes parameters for multi-head Attention and parameters for feed-forward. ALBERT experimented in four ways with different parts of the parameters.

All-shared: All Transformer parameters are shared.

Shared-attention: Only multi-head attention parameters are shared in Transformer.

Shared-ffn: Only feed-forward parameters in Transformer are shared.

Not-shared: indicates that no parameter is shared.

The figure above shows the number of parameters in models with different sharing modes. It can be seen that the model with all parameters shared is much smaller than the model without parameters shared. When E = 768, the number of not-shared parameters is actually the number of bert-base parameters, which is 108M. After all parameters are shared, the number of model parameters becomes 31M.

Sharing parameters can effectively reduce the number of parameters in the model, and sharing parameters can also help the model to stabilize the parameters in the network. The author compared the L2 distance between the input and output of ALBERT and BERT’s Transformer in each layer and found that ALBERT’s effect was smoother, as shown in the figure below.

3.3 Replace NSP with SOP

As RoBERTa’s results showed, NSP Loss was not useful for the model, so ALBERT also thought about NSP.

ALBERT thought that the NSP task used in BERT was too simple, because the counterexamples of NSP were randomly sampled, and the sentences of these counterexamples often belonged to different topics, for example, the first sentences were from sports news, and the latter sentences were from entertainment news. Therefore, when BERT performs NSP tasks, he usually does not need to really learn the semantics and order between sentences, but only needs to judge their topic types.

ALBERT replaced NSP with SOP (Sentence Order Prediction), which predicts whether the order of two sentences has been swapped. That is, two sentences were entered as consecutive sentences from the same document, and the order of the two sentences was randomly switched, allowing the model to predict whether the sentences had been switched. This allows the model to better learn sentence semantic information and interrelationships.

4. To summarize

RoBERTa is more like a BERT model that has been carefully tuned and trained with a much larger data set.

ALBERT can compress the number of BERT parameters and reduce the overhead of distributed training. However, ALBERT could not reduce the amount of calculation needed, so the speed of the model did not improve in inference.

5. References

RoBERTa: A Robustly Optimized BERT Pretraining Approach
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS