This is the sixth day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

In real scenes, we often do not collect too much data, so in order to expand the data set, we can use data enhancement to increase the sample, then how should we do data enhancement?

What is data enhancement

Data augmentation, also known as data augmentation, is the practice of making a limited amount of data produce value equivalent to more data without materially adding to it.

Data enhancement can be divided into supervised data enhancement and unsupervised data enhancement. Supervised data enhancement can be divided into single sample data enhancement and multiple original data enhancement methods. Unsupervised data enhancement can be divided into two directions: generating new data and learning enhancement strategies.

Data enhancement supports audio, image, text, and video data types. This article mainly explains data enhancement methods for text and image data.

Text data enhancement method

For text data, one of the traditional effective data enhancement methods is adding noise, the other is translation, both supervised methods. Denoising is to create new data similar to the original data by replacing words or deleting words on the basis of the original data. Back translation is the translation of original data into other languages and then back into the original language.

  • Translate back (translate twice, e.g. Chinese to English, then English to Chinese). Due to the difference of language logic order, the method of translation can often get new data which is quite different from the original data.
  • EDA(Easy Data Augmentation for Text Classification Tasks), which can replace, insert, exchange, and delete synonyms.
  1. SR: Synonyms Replace: A random selection of N words in a sentence, without regard to stopwords, and then a random selection of Synonyms from a thesaurus, in which I was Synonyms.
  2. 1. A random selection of a word, regardless of stopwords, and then a random selection of synonyms from that word into a random position in the original sentence. This process can be repeated n times.
  3. 1. RS: You get two words in a sentence, and you change positions. This process can be repeated n times.
  4. 1. RD: Every word in a sentence is removed Randomly by a probability of P.

In addition to traditional data enhancement, we can also use deep learning data enhancement techniques such as Mixmatch, which is a semi-supervised approach. (Semi-supervised learning is proposed to make better use of unlabeled data and reduce the dependence on large-scale labeled data sets; This is now proving to be a powerful learning paradigm.)

  • Mixmatch works by mixing unlabeled and labeled data with low entropy labels of unlabeled samples generated by MixUp guessing data amplification method.

Traditional data augmentation methods have a certain effect, but mainly for small data volume, for deep learning models that crave large amounts of training data, the effect of traditional methods is always limited. The proposal of Unsupervised Data Augmentation (UDA) opens the door to massive Data deficiency.

In addition to using ordinary data augmentation, another secret of the MixMatch algorithm is the Mixup augmentation. The success of UDA is due to the use of target-specific data enhancement algorithms for specific tasks.

  • UDA uses different data enhancement methods for different tasks to produce more efficient data than conventional noise such as Gaussian noise and dropout noise. This method can produce effective and realistic noise, and the noise is diversified.

In addition, goal – and performance-oriented data enhancement strategies can learn how to find missing or desired training signals in the original marker set (such as image data for color enhancement).

Image data enhancement method

For image data, the methods we usually adopt are as follows:

  1. Use random clipping. Crop a part of the original image, such as corners, centers, and upper and lower parts, but not too small.
  2. Flip or mirror the original image. You can flip horizontally or you can flip vertically.
  3. Rotate the original image. The original image can be rotated by different angles to increase the sample size.
  4. You can adjust the brightness or contrast of the original picture. Lighten or darken, increase or decrease contrast.
  5. Adjust the chroma of the original image. Change the ratio of R, G, and B color components.
  6. Adjust the saturation of the image. The so-called saturation refers to the purity of color, the higher the purity, the more bright the performance, the lower the purity, the performance is dimmer.

In addition, we can also use gaussian blur image, sharpen, add noise and transform to gray image and other methods.

kit

For Chinese text data, you can use Textda, which is a Chinese text data enhancement kit.

There is also EDA_NLP, which is a simple data augmentation technique for improving the performance of text categorization tasks.

In addition, there are other open source tools that can be used to add data, such as AugLy. It is an open source data enhancement Python library from Facebook. The library currently supports four modes of audio, image, text and video. On the one hand, it can enhance data with real data, and on the other hand, it can detect similar content and eliminate the interference caused by duplicate data.

Reference documentation

  • Summary of NLP data enhancement methods: EDA, BT, MixMatch, UDA
  • What are the data enhancement methods in deep learning?