Introduction:
ShuffleNet_v1 comes after MobileNet_v1 and before MobileNet_v2. The ShuffleNet series is also a model for mobile deployment.
ShuffleNet_v1 combines depth-separable convolution and grouping convolution, and proposes a ShuffleNet Unit containing two operations, pointwise group convolution and Channel shuffle. Depthwise convolution is derived from Xception, and the concept of grouped convolution is derived from AlexNet, which is well applied in ResNeXt and DeepRoots.
** Click to follow and update two computer vision articles daily
Shuffle channels for group convolution
Depthwise self-convolutions or group convolution are used in Xception and ResNeXt to build blocks in order to balance the representation power and computation of the model. But neither of these takes full account of the 1×1 convolution, because it requires a lot of complexity. For example, group convolution in ResNeXt only uses 3×3, which results in point convolution in the residual unit in ResNeXt accounting for 93.4%. In small networks, due to memory limitations, expensive point convolution can only use a limited number of channels, which may affect accuracy.
To solve this problem, Shufflenet proposed the channel sparse join. That is, in the process of convolution, the number of channels is divided into G groups, and each group is only connected within the sub-groups. The left figure below shows the normal channel connection, and the right figure shows the channel connection after grouping.
One problem with such grouping is that if there are too many groups, each channel of the output layer will be directly convolved from one or more channels of the previous layer (as shown in the left figure below), which will prevent the flow of information between channel groups and weaken the representation capability of the model. The solution is to let the different groups connect (as shown in the figure below) and let the information flow between the different channel groups.
ShuffleNet suggests that a better approach is to simply shuffle the order of the groups (as shown on the right in the figure above) and join them by group. Note: this structure can be embedded into the network model for end-to-end training.
ShuffleNet Unit
Residual connection is used in ShuffleNet, ShuffleNet Unit is adjusted on the basis of residual connection.
The figure on the left is residual Unit. ShuffleNet Unit replaced 1×1 convolution with 1X1 grouping convolution, added 3×3 average pooling on shortcut Path, and replaced by-element summing with channel splicing, which can expand channel number size without increasing computational cost. In addition, shuffleNet Unit cancels ReLU after Depthwise convolution. It was first proposed in Xception that only linear transformation can be used. And MobileNet_v2 explained in Depthwise conv after the use of ReLU will lose more information, I detailed MobileNet_v2 on the use of ReLU will lose information in the public number, you can pay attention to the public number to understand. Therefore, the final ShuffleNet Unit is shown on the right.
Compared with ResNet and ResNeXt, ShuffleNet Unit has much less computation. Therefore, ShuffleNet can use a wider feature map with the same computation.
For a given channel number C, feature map size H x W, and bottleneck channels M, the computation of ResNet is HW(2CM + 9M^2) FLOPs, and that of ResNeXt is HW(2CM + 9M^2/g). The calculation quantity of ShuffleNet Unit is HW(2CM/g + 9M) FLOPs, where G is the number of groups. There is an article called “The Difference between FLOPS and FLOPS” which explains how to calculate FLOPS.
In addition, even though Depthwise ConV is theoretically less complex, in practice it is difficult to implement effectively on low-power mobile devices, possibly due to the poorer computational/memory access ratio compared to other intensive operations. Therefore, depthwise conv is used only on bottleneck in ShuffleNet Unit to minimize overhead.
ShuffleNet structure
The last column of Complexity is the flop, at Stage2, at the first point of the convolution layer no grouping convolution is used, because the number of input channels is quite small. Each Stage doubles the number of channels in the previous Stage. The larger g is, the more information is encoded in the limited amount of computation, but not too much, otherwise the aforementioned problems will occur.
conclusion
Here, the number after ShuffleNet represents the scale factor S of the number of convolution kernels. This scaling factor s can be used to control model size. It’s going to be s squared.
The experiment proves that the more groups there are, the lower the error rate will be, and the error rate with shuffle is lower than without shuffle.
ShuffleNet has a lower error rate than other models with the same amount of calculation.
ShuffleNet2x performs better than MobileNet_v1 (with less computation and a lower error rate).
This article comes from the public CV technical guide model interpretation series.
Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
A summary PDF of the following articles can be obtained by replying to the keyword “Technical Summary” in the public account.
Other articles
Shi Baixin, Peking University: From the perspective of reviewers, talk about how to write a CVPR paper
Siamese network summary
Summary of computer vision terms (a) to build the knowledge system of computer vision
Summary of under-fitting and over-fitting techniques
Summary of normalization methods
Summary of common ideas of paper innovation
Summary of efficient Reading methods of English literature in CV direction
A review of small sample learning in computer vision
A brief overview of intellectual distillation
Optimize the read speed of OpenCV video
NMS summary
Loss function technology summary
Summary of attention mechanism technology
Summary of feature pyramid technology
Summary of pooling techniques
Summary of data enhancement methods
Summary of CNN structure Evolution (I) Classical model
Summary of CNN structural evolution (II) Lightweight model
Summary of CNN structure evolution (iii) Design principles
How to view the future trend of computer vision
Summary of CNN visualization technology (I) – feature map visualization
Summary of CNN visualization Technology (ii) – Convolutional kernel visualization
CNN Visualization Technology Summary (iii) – Class visualization
CNN Visualization Technology Summary (IV) – Visualization tools and projects