preface

The emergence of MoCo in 2019 has set off a boom in visual self-supervised learning. Later, SimCLR, MoCo, BYOL, SwAV and other mainstream self-supervised learning algorithms were put forward one after another, and the self-supervised learning field showed an unprecedented prosperity. MAE at the end of 2021 takes self-supervised learning to an unprecedented new level. But behind the prosperity, self-supervised learning has gone through a long process of iteration and development.

Self-supervised learning has a very strong motivation: at present, most neural network training still uses supervised paradigm, which requires a large amount of annotation data, which is very time-consuming and laborious. The self-supervision is proposed to break the dependence on manual annotation and train the network efficiently even if there is no annotation data. As is known to all, the training of neural networks needs tasks to drive, so the core of self-supervised learning is to reasonably construct tasks conducive to model learning. At present, the methods of constructing these tasks can be roughly divided into three aspects:

  1. Pretext Task

  2. Contrastive Learning (Contrastive Learning)

  3. Based on Mask Image Modeling

This article mainly introduces how pretext task-based self-supervised learning algorithm constructs task-driven model for efficient learning. We will analyze 4 mainstream articles and find pretext task-based self-supervised learning algorithms.

Relative Location

Links to papers: arxiv.org/abs/1505.05…

Context information contains a large amount of supervised information, which is widely used in natural language processing to complete large-scale self-supervised training. Similarly, the visual field can also use context information of pictures to complete self-supervised training. The author randomly selects two patches from an image and then asks the model to predict the position of one patch relative to another. The author thinks that only when the model has a good understanding of the various scenes, objects and the relationship between each part of the image, can the model complete the relative position prediction task well.

\

To do this, you pick one of the red boxes at random, as shown above (left), and ask the model to predict its position relative to the blue box. The specific network structure is shown in the figure above (right). At the beginning, the features of two patches are extracted from the two networks respectively, and then the fusion is performed at the end.

It is simple and reasonable to predict the relative positions of two patches, but how to avoid invalid solutions is a key consideration in model design. After the author’s long-term exploration, the following two situations tend to lead to invalid solutions:

  1. Low-level features such as boundary patterns and texture continuity allow the model to directly predict the relative location of two patches without understanding their contents.

  2. The effect of chromatic aberration. Because different colors of light have different wavelengths, the convex lens projects different colors of light to different locations. In general, the green light is closer to the center than the blue and red light. Then, the model can learn the position relationship between different patches and the center of the convex lens. Using this information, the model can also learn the relative position relationship between patches without understanding their contents.

For the above two situations, the author proposes solutions respectively:

  1. Gap is left in the middle of patch sampling (continuous sampling is not carried out), and random fluctuation is further introduced to select the specific value of gap.

  2. To solve the chromatic aberration problem, the author proposed to randomly discard two of the three RGB channels in each patch and fill the two discarded channels with Gaussian noise.

Through the above design, the model has achieved good results.

Colorization

Links to papers: arxiv.org/abs/1603.08… \

The author constructs an image coloring task to make the model learn the semantic information of the image. This is because the author believes that the model can do this task well only if it has a good understanding of the individual semantic information of various scenarios and how they relate to each other.

As shown in the figure above, the author transformed the image into the CIE Lab color space, then fed the L channel into the model, and then asked the model to predict the values of a and B channels. However, how to design the loss function is very elaborate.

If the prediction of channel A and B is completed directly by regression, the final coloring effect is not ideal, and the final picture is grayscale.

Because each object in the image can have multiple possible shading effects, the authors turned this into a classification problem and further proposed that class rebalancing rebalance the effects of different pixels on model training.

Context Encoders

Links to papers: arxiv.org/abs/1604.07… \

Similar to Colorization mentioned above, Context Encoders also recreate the original image by design to enable the model to learn the semantic information of the image. However, unlike Colorization, Context Encoders reconstruct the image from the spatial dimension, whereas Colorization reconstruct the image from the channel dimension. The author thinks that only when the model has a good understanding of the semantic information of the whole image, it can complete the reconstruction task well.

The specific approach is shown in the figure above: the author masks part of the image information, and then feeds the masked image into the network. Then, the masked part supervises the predicted part, and uses L2 Loss to drive model learning. So far, the whole pipeline of self-supervised learning by reconstructing images has been built. However, the author found that only USING L2 Loss could easily make the reconstruction results of the model very fuzzy, and failed to reconstruct the high frequency signal of the image well. This is because the regression mean is easier to optimize for L2 Loss.

In order to solve the above problems, the author further adds an anti-loss function, which makes the reconstructed image sharper and improves the model’s understanding of image semantics.

Finally, the author explored how to mask images, and finally found that the generalization ability of the model using Central region was not particularly good, while the model using Random block and Random region achieved good and similar results.

Rotation Prediction

Links to papers: arxiv.org/abs/1803.07…

The author hopes to enable the model to understand the semantic information of images by allowing the model to recognize the rotation Angle of images. In order to accomplish this task, the author thinks that only when the model can recognize and extract the main object in the image well, and understand the semantic information of the main object and other objects in the image, can it complete the task of rotation Angle recognition. Its specific operation is as follows:

Firstly, a variety of rotation operations are defined, and the rotated images are fed into the network. Then, the model is trained in classification tasks, and the model outputs what rotation operations are applied to the current image. Through such a simple task, the author found that the model has the ability to extract the main objects in the picture and understand the semantic information of the picture. In addition, the author continues to analyze the reasons for the success of the method:

1) Compared with other operations, the rotation operation did not leave any obvious low-level clues, making it easy for the model to complete the rotation recognition task. Therefore, the model must understand the semantic information of the image.

2) Identifying rotation angles is a very well defined task. Because most of the time the objects in the image are upright, ideally the model can clearly identify how much the image is being rotated. Unless, of course, the object is something like a circle.

Based on the above four examples, we can find that pretext task-based self-supervised learning algorithms all have the following two characteristics:

1) A well-defined task, such as identifying rotation angles and relative positions.

2) Reasonable constraints, such as what conditions can be introduced to avoid invalid solutions of the model.

That’s all for this installment on pretext Task-based self-supervised learning algorithms, which we hope will give readers a brief introduction to this type of self-supervised learning algorithm through these four classic examples. MMSelfSup currently covers most mainstream self-supervised learning algorithms. Welcome to use and PR.