• Article transferred from wechat official account “Alchemy of Machine Learning”
  • Author: Brother Lian Dan (Welcome to exchange and make progress together)
  • Contact: Wechat CYX645016617
  • Title: “MLP-Mixer: An All-MLP Architecture for Vision”
  • Links to papers: arxiv.org/pdf/2105.01…

“Cutting edge” : Recently busy with various things, update slowly. Find time to write. This passage is very simple. It only takes 5 minutes to finish.

The text start

We provide the MLP-Mixer architecture, later referred to as Mixer. This is a competitive, but conceptually and technically simple structure, and does not use convolution or self-attention.

Similar to Transformer, the input of Mixer model is also the sequence of Patch of picture after linear mapping, which is simply called embedding. Is a feature shaped like patches x channels. The sequence after embedding is called token.

The macro structure of Mixer is shown below:

Mixer utilized two MLP layers:

  • Channel-mixing MLPs: allows communication between different channels characteristics;
  • Token-mixing MLPs: Allows communication between different spatial positions.
  • The two MLP layers are interlaced.

“Graph reading”

  • You can see it in the Caption section. I think “per-patch Fully connected” is embedding, for example, a 32x32x3 color patch picture is Fully connected to the sequence of 128 dimensions.
  • Mixer Layer is the main innovative structure proposed in this paper. Mixing MLP and channel mixing MLP, both of which are composed of two full connection layers and GELU activation function.
  • Let’s look at the upper part of the figure again, which reflects the details of Mixer Layer: Firstly, it is assumed that an image is divided into 9 patches, and then each patch turns into a vector of 128 after embedding. If the original embedding is added, a matrix like 9×128 is obtained.
    1. This matrix first goes through the LayerNorm, which is equivalent to normalizing on the 128 dimension;
    2. And then the transpose of the matrix is 128×9;
    3. Channel-mixing MLP after the first fully joined layer, the MLP should be channel-mixing, because the patch dimension 9 is calculated.
    4. Then transpose to 9×128, and then do the Layer norm;
    5. Then token-mixing channels was calculated on the spatial dimension of 128.
    6. Two Skip Connections were added in the middle.

“Here, we can see that the whole structure is really very simple. Go back and try it out.”