Machine learning algorithm engineer a heart account

Pay attention to me, take you to learn algorithm knowledge!


The MLP-Mixer: An All-MLP Architecture for Vision. The Mixer model proposed in this paper contains only the simplest MLP structure to achieve SOTA on ImageNet. MLP is actually two FC layers, which makes you wonder:

FC is all you need, neither Conv nor Attention!

With sufficient data and resources, the physical variability bias model may actually become a constraint, less direct than the simple model. The following result shows everything: When the amount of training data was small, the performance was BiT>ViT>Mixer, but with the increase of the amount of data, the performance of the three was almost the same.

In terms of network architecture, mlP-Mixer and ViT are very similar:

  1. The preprocessing is to divide the image into patchs and obtain a series of patch embeddings through Linear projection.

  2. The main body of the network is isotropic design, which consists of N consecutive and identical layers.

The differences are mainly reflected in the differences of layers. ViT adopts Transformer layer, while MLP-Mixer adopts mixer-layer. Mixer-layer is very simple and only includes two MLPS (as well as Skip Connection) :

(1) Token-MIXING MLP block: the input characteristic dimension isThe operation dimension is tokens, which means doing MLP for the same feature of all tokens;

Channel-mixing MLP block: the input characteristic dimension isThe operation dimension is channels, which means MLP for all features of each tokens.

For images, most of the network has nothing more than two mix features:

(i) at a given spatial location,

(ii) Between different spatial locations,

For example, convolution actually takes place simultaneously (I) and (ii). Specifically, 1×1 convolution only completes (I), single-channel depth-wise conv only completes (II); The Transformer layer is more complex, while the projection Layer implements (I), the self-attention implements (II) (which will actually involve (I)), and the FFN implements (I). For mixer-layer, token-mixing MLP block implements (II), channel-mixing MLP block implements (I), which is also a clever explanation of the design.

Because channel mixing MLP is permutation-variant, it is sensitive to the order of tokens, which is different from ViT because self-attention is permutation-invariant. The Mixer does not need E position embedding. Token-mixing MLP block, however, means that the network can only accept image inputs of a fixed size. After all, the MLP parameters depend on the number of tokens, which may be troublesome for the Dense Prediction task. Because detection and segmentation are basically variable input.

The network parameter design of Mixer was similar to that of ViT, as follows:

When Mixer moved to other tasks after pre-training on different data sets, its performance was compared with ViT and other models as follows. It can be seen that Mixer was close to SOTA, and the model Inference time was basically similar.

However, pre-training was required on large data sets such as Imagenet-21K and JFT-300. When the training data of Mixer was insufficient, Mixer was easy to overfit, and its effect was worse than CNN and ViT. The following table is the comparison of Mixer with other models in different Settings. For example, when trained on ImageNet, mixer-B /16 performed worse than VIT-B /16.

In the case of Mixer, some thought that it actually contained convolution. For example, LeCun stated: 1st layer “Per-patch fully-connected” == “conv layer with 16×16 kernels and 16×16 stride” , Other Layers “MLP-Mixer” == “Conv layer with 1×1 kernels”. This is actually from an implementation point of view, but can you count CNN with only 1×1 convolution layer? In fact, the paper also explained the connection between Mixer and CNN from another perspective: Channel mixing MLP is equivalent to 1×1 convolution, and token-mixing MLP is equivalent to a single-channel depth-wise convolutions with a kernel size of image size. Note That the parameter sharing for token mixing (in this case, is also known as s convolution, but conV will use different convolution kernels. Token -mixing MLP all channel parameters are shared.

Then came a simple paper from Oxford University: Do You Even Need Attention? A Stack of feed-forward Layers Does Surprisingly Well on ImageNet, whose model structure was actually the same as Mixer, but FFN was used in the paper instead of MLP, and the experiment was not as adequate as Google’s:

There is also a short description of this network in the paper, which is basically the same as what we have described above:

The essence of ViT, CNN-Resnet, and the MLP-Mixer here is that they are all residual net, which is probably the most important thing.

Finally, Google may have opened a new pit that could feed a lot of paper, see ViT.

Recommended reading

CPVT: A convolution implicitly encodes location information

DETR: Target detection based on Transformers

MoCo V3: I’m not who you think I am!

The application of Transformer in semantic segmentation

ViT: Transformer is All You Need!

PVT: Pyramid Vision Transformer for backbone on intensive tasks!

FixRes: Beat SOTA on ImageNet data set twice

How can Transformer break into CV and kill CNN?

Try MoCo instead of pretrain on ImageNet!