Peking University doctoral students proposed CAE, downstream task generalization ability better than He Kaiming MAE

Hello, this is dialogue.

He Kaiming published a paper two years later, and proposed a new paradigm of visual self-supervised learning

The masking self-encoder MAE is used to open the way for the visual large model.

This time, Peking University doctoral students proposed a new method, CAE, which demonstrated more generalization ability in its downstream tasks than MAE.

Let’s see what kind of study is this?

What kind of study is this?

Since He Kaiming proposed MAE, Masked Image Modeling, a self-supervised learning representation algorithm, has attracted more and more attention.

Its main idea is to partition the input image and perform random mask operation, and then predict the mask area.

The target for prediction can be a Token ID (as proposed by Microsoft in BEiT) or an RGB value (as proposed by MAE).

By using MIM, encoders can learn good representations and achieve good generalization performance in downstream tasks.

In the past, this method was common in the NLP field, but with the development of ViT, this method has made some progress in the visual field.

The team believes that two recent representative works, BEiT and MAE, do not fully tap the potential of encoder and limit the quality of representation in pre-training learning.

To put it simply, BEiT encoders find that only some of them are responsible for representation learning and others perform “pretext/pretext tasks”.

MAE is another situation, the decoder also does some representation learning, may teach the encoder to “lazy.”

Based on this background, the team proposed Context Autoencoder, CAE for short. The main idea is to separate “presentation learning” and “pretext task”.

During the pre-training, the encoder was only responsible for representation learning, while the decoder was only responsible for solving the pre-task. The two cooperated to maximize the encoder’s representation ability.

CAE consists of four parts.

1. The encoder is a ViT model, which is responsible for learning the representation of the visible patch of the image and extracting the image feature Zv.

2. Latent contextual regressor was used to predict the characterization Zm of masked patch.

3. The decoder takes Zm and corresponding location coding as input, and predicts some properties of masking patch, such as RGB value and Token ID, through Zm. In this process, Zv will not be updated, and the representation learning task is all entrusted to the encoder.

4. Latent representation alignment adds constraints to Zm, and expects the output of Latent contextual regressor to be in the same space as the output of the encoder. Specifically, the masking patch of the image will also be input into the encoder (this process does not participate in gradient back transmission), and this part of the representation obtained will serve as the learning target of Zm.

Alignmen****t it is important that the output of latent contextual Regressor (which is also the input of the decoder) contain good semantic information if you want to predict the mask part. Through alignment operation, encoder operation can be encouraged to contain good semantic information and improve the quality of encoder representation.

In this paper, alignment is visualized: all patches are input into the encoder, and then the representation is directly input into the decoder for RGB reconstruction. CAE can reconstruct the original image (the first row is the original image, and the second row is the reconstruction result), indicating that the output of the encoder and latent contextual regressor are in the same coding space.

If alignment is not constrained, the output will look like this… Well, it’s all gibberish.

The encoders of this design also learn worse representations, and the downstream task results are worse.

The loss function consists of two parts. One is the monitoring of decoder prediction, using the cross-entropy loss. One is the supervision of alignment, using the MSE loss function.

In addition, it is further verified that THE MIM method represented by CAE is more suitable for downstream tasks than the comparative learning method represented by Moco V3 and DINO.

Based on the analysis of the properties of random clipping operation, this paper considers that random clipping has a high probability of including the center region of the image.

In imagenet-1K, the center area is usually the object in the 1000 class tag set (figure below). Therefore, the contrast learning method mainly extracts the features of the subject object in the image.

However, MIM method can learn the characteristics of each patch, including the background area of the image, rather than just the main object of the image, which makes the representation learned by MIM more suitable for the downstream detection and segmentation task.

The attention attempts of CAE and MoCo V3 are visualized. Red means higher attention and blue means lower attention. The first line is the original, the second is MoCo V3, and the third is CAE. It can be seen that MoCo V3 has high response mainly in the main area of the image, while CAE can take almost all patches into account.

The experimental results

The research team used VIS-Small and VIS-Base to conduct experiments on Imagenet-1K. The resolution of the input image was 224*224, and each image was divided into 14*14 patches with a size of 16*16.

Each time, 75 patches will be masked randomly, and the rest patches will be visible.

In this paper, referring to BEiT, dall-e Tokenizer is used to tokenize the input image and obtain the predicted target.

The final results showed better representation results on semantic segmentation tasks compared with other MIM methods, such as MAE and BEiT, as well as contrast learning and supervised pre-training.

The same is true for object detection and instance segmentation.

Finally, I welcome you to follow my wechat official account: Duibainotes, which tracks the forefront of machine learning such as NLP, recommendation system and comparative learning. I also share my entrepreneurial experience and life perception on a daily basis. Students who want to further communicate can also add my wechat account to discuss technical problems with me, thank you!

Peking University doctoral students proposed CAE, downstream task generalization ability better than He Kaiming MAE

What kind of study is this?

The experimental results

Related Posts

Deep learning -TF function -layers.concatenate uses numpy array dimensions

Explainable recommendation of five research hotspots of personalized recommendation System (5)

Traffic sign recognition Based on MATLAB GUI moment matching algorithm road sign recognition