This is the fifth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

RoBERTa’s biggest improvements over BERT are as follows:

  1. Dynamic Masking
  2. Cancel the NSP (Next Sentence Predict) task
  3. Expand the Batch Size

Static Masking vs dynamic Masking

  • Static Maksing: Mask matrix has been generated during data preprocessing, random Mask will only be performed once for each sample, and each Epoch is the same
  • Modified static Maksing: during preprocessing, 10 copies of the data are made, each with a different Mask, that is, there are 10 different Mask ways for the same sentence, and then each copy of the data is trained with N/10 epoches
  • Dynamic Masking: A new Maks pattern is generated each time a sequence is entered into the model. That is, masks are not generated during preprocessing, but dynamically generated when input is provided to the model

Cancel the NSP task

In fact, by 2020, many papers had long ceased to use NSP tasks, but RoBERTa was one of the early ones to question the MODEL of NSP tasks. RoBERTa experimented with four methods:

  • Segment-pair + NSP: The input contains two parts. Each part is a SEGMENT from the same document or a different document. The total number of tokens for these two segments is less than 512. Pre-training includes MLM tasks and NSP tasks. This is what the original BERT did
  • Sentence-pair + NSP: The input also contains two parts, each part is a single SENTENCE from the same document or different documents, the total number of tokens in these two sentences is less than 512. Since these inputs are significantly less than 512 tokens, the batch size is increased to keep the total number of tokens similar to the segment-pair + NSP. Pre-training includes MLM tasks and NSP tasks
  • Input one part (not two parts), consecutive SENTENCES from the same document or from different documents, total tokens up to 512. Input may cross document boundaries, or if it does, a token marking document boundaries is added to the end of the previous document. Pre-training does not include NSP tasks
  • Doc-sentences: Input has one part (instead of two), the input is constructed like full-sentences, only without crossing document boundaries, the input comes from consecutive SENTENCES in the same document, and the total number of tokens does not exceed 512. The input sampled near the end of the document can be shorter than 512 tokens, so dynamically increase the Batch size size in these cases to achieve the same total number of tokens as full-sentences. Pre-training does not include NSP tasks

Expand the Batch Size

In fact, I have read a saying before (from Chinese-bert-WWM) : reducing the batch size will significantly reduce the experimental effect

The author of RoBERTa’s paper has also done relevant experiments, and adopting large Batch Size can help improve performance

Where, BSZ is Batch Size; Steps are training steps (in order to ensure that BSZ *steps are approximately the same, so large BSZ must correspond to small steps); Lr is learning rate; PPL is confusion, the smaller the better; The last two are the accuracy of different tasks

Reference

  • Deep Learning: Cutting edge technology -RoBERTa