ICCV2021 | MicroNet: at a low FLOPs improve image recognition

** Preface: ** This paper aims to solve the problem of drastically reduced performance at very low computational cost. A micro-decomposition convolution is proposed to decompose the convolution matrix into a low-rank matrix and integrate the sparse connections into the convolution. A new Dynamic activation function, Dynamic Shift Max, is proposed to improve the nonlinearity by maximizing multiple Dynamic fuses between the input feature graph and its cyclic channel shifts.

On the basis of these two new operations came a network family called MicroNet, which achieved significant performance improvements over existing technology in a low-flop mechanism. With the restriction of 12M FLOPs, MicroNet achieved a top-1 accuracy of 59.4% in ImageNet classification, 9.6% higher than MobileNetV3.

Paper: MicroNet: Improving Image Recognition with Extremely Low FLOPs

Code: github.com/liyunsheng1…

Click a concern, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Starting Point

Recent advances in efficient CNN architectures have successfully reduced the computational cost of ImageNet classification by two orders of magnitude from 3.8G FLOPs (resnet-50) to about 40M FLOPs (e.g. MobileNet, ShuffleNet), with reasonable performance degradation.

However, when computing costs are further reduced, they suffer a significant performance degradation. For example, the top-1 accuracy of MobileNetV3 dropped significantly from 65.4% to 58.0% and 49.8% when computing costs decreased from 44M to 21M and 12M MAdds, respectively.

The goal of this paper is to reduce the accuracy of the very low FLOP mechanism from 21M to 4M MAdds, which marks another order of magnitude reduction in computational cost.

Dealing with very low computational costs (4M-21M FLOPs) is very challenging, considering that the input data size is 224×224×3, and 2.7m MAdds are consumed in the operation of the first layer 3 × 3 convolution and output channel 8. The remaining resources are too limited to design the convolution layers and 1,000 class classifiers needed for effective classification.

As shown in the figure above, common strategies to reduce the width or depth of existing efficient CNNS (such as MobileNet and ShuffleNet) can result in severe performance degradation.

This paper focuses on the new operator design while fixing the input resolution at 224×224 with a budget cost of 4M FLOPs.

Innovative ideas

This paper deals with extremely low FLOPs from two perspectives: Node connectivity and non-linearity, which is related to network width and depth.

First, reducing node connections to expand network width provides a good trade-off for a given computing budget. Secondly, the reduced network depth is compensated by improved layer nonlinearity, which determines the nonlinearity of the network. These two factors encourage the design of more efficient convolution and activation functions.

Methods

Micro-Factorized Convolution

It is divided into two parts: micro-factorized Pointwise Convolution and micro-factorized Depthwise Convolution, which are combined in different ways.

Micro-Factorized Pointwise Convolution

In this paper, micro-decomposition convolution (MF-conV) is proposed to decompose point-by-point convolution into two group convolution layers, where the group number G ADAPTS to the channel number C: G = SQRT (C/R)

Where R is the channel reduction ratio between the two.

The equation achieves a good compromise between the number of channels and node connections for a given computational cost.

As shown in the figure above, the number of input channels C is divided into group G, and group G reduces the number of channels through a displacement matrix φ in the middle (C/R × C/R), which is similar to the operation of shufflenet to shuffle channel order.

Micro-Factorized Depthwise Convolution

This part refers to the decomposed convolution in Inception_v2. Based on Depthwise, the KxK convolution kernel is divided into Kx1 and 1xK parts.

Micro-factorized pointwise and depthwise convolution can be combined in two different ways :(a) regular combination, and (b) lite combination.

The former just joins the two convolution. The Lite combination shown above uses microdecomposition deep convolution to expand the number of channels by applying multiple spatial filters to each channel. Then a set of adaptive convolution is applied to fuse and compress the number of channels. Compared with the conventional combination method, it spends more resources on learning space filter (Depthwise) through saving channel fusion (Pointwise) calculation, which is proved to be more effective for realizing lower network layer.

Dynamic Shift-Max

Considering that micro-factorized pointwise convolution focuses more on intra-group connections, Dynamic shift-max is proposed, which is a new Dynamic nonlinearity used to strengthen the connections between groups created by micro-factorized.

Dynamic shift-max Outputs the maximum value of K fuses, and the displacement of multiple (J) groups per fuses is

Where J represents the number of groups, I represents the number of channels, and K represents the number of output after fusion. When J=K=2, a good compromise between accuracy and complexity can be achieved.

The formula can be explained in one sentence by taking a weighted sum of x for every J groups, K fuses, and then taking the maximum of K as the activation function on the ith channel.

In this way, dy-Shift-max achieves two forms of nonlinearity: (a) the maximum of K fuses of the output J group, and (b) through dynamic parameters.

The first nonlinearity is a complement to micro-factorized pointwise convolution, which focuses on the connections within each group and strengthens the connections between the groups. The second enables the network to adjust this reinforcement based on input X. These two operations increase the representation capacity of the network and compensate for the loss of layers.

MicroNet

Conclusion

With the restriction of 12M FLOPs, MicroNet achieved a top-1 accuracy of 59.4% in ImageNet classification, 9.6% higher than MobileNetV3.

Evaluation of ImageNet classification. Left: top-1 Accuracy with FLOPs. Right: Top-1 Accuracy and delay. Note that Mo bileNetV3 ×0.75 was added to facilitate comparison. MicroNet is superior to MobileNetV3, especially when computing costs are extremely low (top-1 accuracy improves by more than 5% when FLOPs are less than 15M or latency is less than 9ms).

Dynamic Shift-Max compared to other activation functions on ImageNet.

ICCV2021 | MicroNet: at a low FLOPs improve image recognition

Starting Point

Innovative ideas

Methods

Micro-Factorized Convolution

Dynamic Shift-Max

MicroNet

Conclusion

Related Posts

Four machine learning programming languages: R, Python, MATLAB, Octave

【 Kinematics 】 Maneuvering target tracking algorithm based on Matlab Singer model

Which programmer community site’s front page is most comfortable to use