How do you get the best results with limited data? In addition to a good model, the most effective way is to do data enhancement. Natural language processing (NLP) is different from images, which can be enhanced by cropping, inverting, zooming and so on, but has a special way of processing text. The paper, CODA: CONTRAST-ENHANCED AND DIVERSITYPROMOTING DATA AUGMENTATION FOR NATURAL LANGUAGE UNDERSTANDING’s CoDA scheme claims to improve roberta-Large by 2.2% on average, so let’s take a look at what it actually suggests.

Data to enhance

This paper mainly discusses how to do data enhancement in text classification. Of course, these data enhancement methods can also be used in other NLP tasks.

In a word, data enhancement is the original training set :\

To a new training set by some transformation:

We use the data from these two sets to learn the parameter :\

Referring to the methods of data enhancement given in previous papers, it can be summarized as the following figure: \

Back-translate is easy to understand, as shown in the picture above, English translated into German and back into English, as a training sample. \

Adversarial Training: The Adversarial Training method is used to improve the robustness of the text data model. Adversarial training does not require any additional domain knowledge, but only samples produced by the model itself, which are the samples that the model is most prone to predict wrong. The following are the two most commonly used Losses for confrontational training. \

In fact, it is to find the samples that the model thinks are similar and add them into the model for training. However, it is difficult to obtain accurate adversarial samples in practical use. Therefore, we can use model gradient to build similar adversarial samples, as shown in the following formula: \

Promoting diversity

In fact, the ideas of the above data enhancement methods are very consistent. Similar samples of existing samples are found, and the training objectives are also relatively consistent. Consider a question at this point: Are different data enhancement methods equivalent or complementary? Can a mix of all data enhancement methods improve model generalization? CV has been shown to be effective, but it is much more difficult to apply to text, and slight changes to text can lead to huge semantic differences. \

This paper proposes a variety of hybrid data enhancement methods, as shown in the figure below: \

There’s nothing to say about these three, literally. (a) in a mini-batch, a random data enhancement method is used to change X into X ‘; (c) in a series of data enhancement methods, X is changed into X ‘. (b) is mixed interpolation, which manipulates the embedding EI and EJ of the two samples, where A conforms to the beta distribution. \

It is worth noting that sequential stacking involves sequences of enhancement methods. Not all sequences are reasonable. For example, it is impossible to do translation after generating adversative training samples

This fusion mode can be expressed simply by the following formula :\

Xi ‘was first translated back into xi, and then the adversarial samples most difficult to distinguish from the model were found. Finally, loss of the original samples and adversarial samples was calculated. It can be seen that there are three items of Loss, the first is normal cross entropy, the second is counter loss, and the third is consistent loss, that is, the sample and counter sample should have high similarity. Here, RCS is defined as follows :\

Adversarial regularization

We looked at the three items of loss mentioned above, which reflected that our counter sample XI should be consistent with the estimated result of XI ‘, but there was no reaction XI ‘should be inconsistent with XJ. In order to make full use of the enhanced data, the paper also puts forward the adversarial learning objective. Given that XI ‘is generated by XI, the model should learn who the “father” of each data enhancement sample is. As shown below: \

Memory is used to store historical embeding as large numbers of negative samples. To avoid encoder updating too quickly (which results in an embeding inconsistency), momentum key encoder is proposed, which updates parameters not by gradient, but by the following formula: \

When we have a sample xi, we get 3 embedding:

The new adversarial learning objectives are as follows:

Where t is the temperature and M is the memory bank. This expression also expresses a simple meaning, that is, the similarity of Ki produced by sample Xi and enhanced sample Xi ‘and momentum key encoder is higher than the negative sample of memory bank. Once integrated with the previous loss, it becomes our final learning goal: \

The experiment

The combination of multiple enhancement methods is better, and the stack method of translation + adversarial training achieves the best effect. See the original text for details on tuning parameters. Let’s focus on the model efficiency gain brought by using data enhancement. As can be seen from the figure below, the effect of using CoDA is still significant: \

reference

  1. CODA: CONTRAST-ENHANCED AND DIVERSITYPROMOTING DATA AUGMENTATION FOR NATURAL LANGUAGE UNDERSTANDING

    Openreview.net/pdf?id=Ozk9…

Machine learning: Deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning, deep learning704220115To join the wechat group, please scan the code:Copy the code