Vision Sparse MLP A Fully-MLP Architecture with Conditional Computation

The original document: www.yuque.com/lart/papers…

Read the passage from the abstract

Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. In this paper, we propose Sparse-MLP, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture.

This paper introduces the idea of MoE, using conditional calculation to realize the extension of parameters, but also to ensure that the calculation cost is not too high.

We replace a subset of dense MLP blocks in the MLP-Mixer model with Sparse blocks. In each Sparse block, we apply two stages of MoE layers:

  1. one with MLP experts mixing information within channels along image patch dimension,
  2. one with MLP experts mixing information within patches along the channel dimension.

MoE is used for both spatial MLP and channel MLP.

Besides, to reduce computational cost in routing and improve expert capacity, we design Re-represent layers in each Sparse block. These layers are to re-scale image representations by two simple but effective linear transformations.

Here, the structural reduction of scaling characteristic dimension is introduced.

When pre-training on ImageNet-1k with MoCo v3 algorithm, our models can outperform dense MLP models by 2.5% on ImageNet Top-1 accuracy with fewer parameters and computational cost. On small-scale downstream image classification tasks, i.e., Cifar10 and Cifar100, our Sparse-MLP can still achieve better performance than baselines.

The main content

  • MoE technology is introduced into the MLP model to replace the original MLP layer. Moreover, conditional Computation can be used to expand the capacity and expressibility of the model by constructing multiple experts with different weights, and the actual number of experts can be constrained by gating routing mechanism, which is called Conditional Computation. So as not to bring too much calculation cost and time loss.
  • Due to large differences in the number of tokens and channels in the original space, this leads to unbalanced computational costs for routing and expert forward calculations, Therefore, by balancing the number of spatial tokens and the number of channels before and after the spatial MoE layer (using linear layer re-projection), the authors ensured a more balanced and efficient calculation process.

Mixture of Experts

Core operations


MoE ( x ) = i = 1 N G ( x ) i E i ( x ) . G ( x ) = TopK ( softmax ( W g ( x ) + ϵ ) ) : R D R N . E i ( x ) : R D R D . \text{MoE}(x) = \sum_{i=1}^{N}G(x)_iE_i(x), \, G(x) = \text{TopK}(\text{softmax}(W_g(x) + \epsilon)):\mathbb{R}^{D} \rightarrow \mathbb{R}^{N}, \, E_i(x):\mathbb{R}^D \rightarrow \mathbb{R}^D.

Here, from left to right, are the aggregation operations of THE MoE layer containing NNN experts respectively, used to calculate the gated network with the routing weights based on the input conditions (using SoftMax to generate normalized weights, Here noise ϵ ~ N(0,1N2)\epsilon \sim \mathcal{N}(0, frac{1}{N^2})ϵ ~ N(0,N21) is introduced, as well as the third expert layer.

Similar work

  • MoE’s idea came largely from an ICLR 2017 article: OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE Sparsely-Gated MIXTURE OF EXPERTS LAYER (Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean), the article says:The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation.“. This article introduces experts through a trainable gated network by inserting multiple expert structures in the LSTMSparse combination, thus increasing the capacity of the model by more than 1000 times with only a small efficiency loss. In addition to the main ideas similar to this paper, optimization strategies for MoE structure are also basically consistent.
  • In addition, this SparseMLP work has potential in common with another work by the authors on ViT enhancement Go Wider Instead of Deeper.

Go Wider Instead of Deeper

  • Of course, in terms of loss function, it is actually more like Scaling Vision with Sparse Mixture of Experts. The loss function is very similar.

This paper details

In this paper, the last few layers of space and channel MLP in Mixer-MLP were replaced into MoE structure (including space and channel structures). Such Settings help introduce more parameters and improve the power of the model.

Training for multi-expert models is not easy. It is mainly due to the sparse gated routing mechanism that not all experts can be fully trained, which is the so-called load imbalance problem. So using multi-expert methods most of them require specific losses to be targeted.

As for the Loss function, this paper continues the setting of previous work and applies Load Balance Loss. This loss encourages a balanced distribution of inputs across experts.

The loss consists of two parts:

  • Importance Loss: The purpose is to make the Importance of each expert as close as possible in the process of information transmission, so as to ensure that each expert can be selected as much as possible and fully trained.
    • To this end, the definition of the importance of experts is introduced: Imp (X) = {∑ X ∈ Xsoftmax (Wgx) I}, I = 1 n \ text (X) = {Imp} \ {\ sum_ \ {X in X} \ text {softmax} (W_gx) _i \} _ {I = 1} ^ {N} Imp (X) = {∑ X ∈ Xsoftmax (Wgx) I} I = 1 N.
    • Wg∈RD×NW_g \in \mathbb{R}^{D \times N}Wg∈RD×N represents the gating weight matrix of MoE layer, which maps input DDD dimension XXX to expert number NNN, After processing by Softmax, the weight of each sample XXX assigned to each expert will be obtained. Here, the weight of each input related to the third expert is added to obtain its importance measure for batch input XXX. This can reflect the relative role of each expert relative to other experts in the overall input being allocated for processing.
    • Therefore, in order to balance the importance of each expert as much as possible, so that everyone can better “performance”, the importance of each expert should be balanced as much as possible. So using the squared coefficient of variation of importance (the squared coefficient of variation of the importance distribution over experts) as a loss of importance:
      L i m p ( X ) = ( std ( Imp ( X ) ) mean ( Imp ( X ) ) ) 2 L_{imp}(X) = (\frac{\text{std}(\text{Imp}(X))}{\text{mean}(\text{Imp}(X))})^2
      .

Table 4 of the Scaling Vision with Sparse Mixture of Experts

  • Load Loss: Materiality Loss is intended to ensure that all experts have similar routing weights on average. But unfortunately, it’s not hard to think of these looks have tended to balance the weight of the routing configuration, there are still a small number of experts have all distribution (visible above, although the input 1 ~ 4 for the sum of the weights of experts are 2, but the final choice, just choose to expert 1 and 3, and 2 in other parts of the appropriate learning).
    • Here introduced about expert Load for this definition: the Load (X) = {∑ X ∈ Xpi (X)}, I = 1 N \ text (X) = {Load} \ {\ sum_ \ {X in X} p_i (X) \} _ {I = 1} ^ {N} the Load (X) = {∑ X ∈ Xpi (X)} I = 1 N.
    • PI: (x) = Pr (G (x) I > = thresholdk (G (x))) p_i (x) : = Pr (G (x) _i > = threshold_k (G (x))) PI (x) : = Pr (G (x) I > = thresholdk (G (x))), Represents the probability and that expert III will be selected (the gated route is greater than the threshold, that is, it is in the top K experts with the maximum weight) for each sample when the batch data is input.
    • This probability seems a little tricky to figure out, but the authors have introduced a normally distributed noise here to make everything computable, roughly as follows, and finally the probability of a normally distributed variable:


    p i ( x ) : = P r ( G ( x ) i > = t h r e s h o l d k ( G ( x ) ) ) = P r ( W g ( x ) i + ϵ > = t h r e s h o l d k ( W g ( x ) + ϵ ) ) = P r ( ϵ > = t h r e s h o l d k ( W g ( x ) + ϵ ) W g ( x ) i ) p_i(x) := Pr(G(x)_i >= threshold_k(G(x))) = Pr(W_g(x)_i + \epsilon >= threshold_k(W_g(x) + \epsilon)) = Pr(\epsilon >= threshold_k(W_g(x) + \epsilon) – W_g(x)_i)

    • The load loss is expressed as the square variation coefficient of the load distribution: LLoad(X)=(std(Load(X))mean(Load(X)))2L_{Load}(X)=(\frac{\text{std}(\text{Load}(X))}{\text{mean}(\text{Load}(X))})^2LLoa D (X) = (scheme (Load (X)) STD (Load (X))) 2.

So, Laux=λ(12Limp+12Lload)L_{aux} = \lambda(\frac{1}{2}L_{IMP} + \frac{1}{2}L_{load})Laux=λ(21Limp+21Lload). The super parameter lambda lambda here is used to control auxiliary losses in encouraging cross-expert routing balance and to ensure that overwhelm original model losses are not overwhelmed. The actual working Settings are the same as before, both set to 0.01. Based on previous work, the impact of this parameter on performance is not obvious.

In addition to the setting of multi-expert layer itself, it was considered that in the original MLP-Mixer, the patch-based token processing resulted in the number of spatial tokens being less than 1/3 of the number of channels. For MOEs, the spatial MoE layer, this leads to an imbalance between the cost of the routing part and the cost of the expert part. Therefore, the authors introduce a re-present layer to readjust the space and channel dimensions of MOEs inputs and outputs. Mainly through special linear layer processing, pseudo-code is as follows:

In practice, S1=2S,C1=C/2S_1 =2S, C_1 =C/ 2S1=2S,C1=C/2.

There are two layers, one for output and one for input. The combination of the two is used to balance the operation of MOEs wrapped in the middle of the two (the number of channels in MOEs operation is reduced and the number of space patches is increased.

The experimental results

We find that scaling MLP models in parameters and training them from scratch with limited training data will lead to an overfitting problem. Such finding is consistent with previous work on MLP models (Tolstikhin et al. 2021) and attention based models (Chen, Xie, and He 2021; Dosovitskiy et al. 2020; Xue et al. 2021).

In order to better obtain model capacity, we adopt MoCo v3 algorithm as our default self-supervised training algorithm (He et al. 2019; Chen et al. 2020b; Chen, Xie, and He 2021), and fine-tune models on downstream image classification tasks.

All models were self-supervised training using MoCo V3

The ablation experiment in this paper mainly discussed the following four points:

  • The impact of the number of experts

Here are fixed MoEs and MoEc respectively for the experiment. It can be seen that the increase of MoEs can improve the performance. MoEc, however, led to a decline, which the authors believe leads to over-fitting. (The idea that increasing the number of specialists on channel characteristics leads to over-fitting was also illustrated in the authors’ previous work Go Wider Instead of Deeper.)

  • The number of routing experts K

Here, different K values are tried for different positions. All experiments are conducted based on THE B structure. As you can see, for channel specialists you need to apply more at the same time, and a single space is enough.

  • The location of Sparse Blocks, or MoE structures

For structure B, originally Sparse Blocks were placed in the last two stages. A comparison is made here. It can be seen that such a structure would have a better effect at the end of the model.

  • The role of rerepresentational layers

It can be seen that although the speed is improved after using the heavy representation layer, the performance is not decreased, but improved. This is an interesting phenomenon. However, the author does not give a reasonable explanation and analysis. Just mention its use for balancing routing and expert computing costs. Would such a structure also improve if used directly in the MLP-Mixer?

Refer to the link

  • Sparse MLP: A Fully MLP Architecture with Conditional Computation: arxiv.org/pdf/2109.02…
  • OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE Sparsely-Gated MIXTURE-OF EXPERTS LAYER: arxiv.org/pdf/1701.06…
  • Go Wider Instead of Deeper: arxiv.org/pdf/2107.11…
  • Scaling Vision with Sparse Mixture of Experts: arxiv.org/pdf/2106.05…