Take you through a paper series on computer vision -SENet

gossip

There is always a moment to miss distant old people. August 15 Mid-Autumn Festival, let us put down busy work, go home and the elderly reunion toast dinner. This is the first time FOR me not to spend Mid-Autumn Festival at home, and I feel ok. Now the festival is not festive, the most important family reunion. Dear little lovely, Mid-Autumn Festival, wish you happy, a happy heart; Healthy, easily; Round and round, boon conjugation; Hehemeimei, thriving! ❤ ️

preface

Essay: Squeeze-and-Congestion Networks

code

A graftable/integrated Block 😇

Momenta won the ImageNet2017 challenge with its SENet web architecture. The author of this article is Momenta Senior R&D engineer Hu Jie.Momenta, founded in 2016, isSelf-driving company. Its core technologies are environment perception, high-precision map and driving decision algorithm based on deep learning. Products include various levels of autonomous driving solutions and spin-off big data services. Momenta is dedicated to ‘building the autonomous driving brain’ and has the world’s leading deep learning experts, such as the authors of image recognition framework Faster R-CNN and ResNet, ImageNet 2015, ImageNet 2017, MS COCO Challenge 2015 and other competitions. The team members mainly come from tsinghua University, MIT, Microsoft Research Asia and other universities and research institutions, as well as Baidu, Alibaba, Tencent, Huawei, Sensetime and other well-known high-tech companies, with profound technical accumulation, strong technical creativity and rich industry experience.

SENet won the ImageNet2017 classification task, which was the last ImageNet competition, and the paper also won the CPR2018 Oral. Moreover, SENet has the advantages of simple thinking, convenient implementation, small computation and modularization. It can be seamlessly embedded into the mainstream network structure. Practice has constantly proved that SENet can make the network obtain better task effect.

As the core of convolutional neural network, convolution kernel is generally regarded as an information aggregator that aggregates spatial information and channel-wise information in the local receptive field. Convolutional neural network consists of a series of convolutional layers, nonlinear layers and lower sampling layers, so that they can capture the features of the image from the global sense field to describe the image.

Abstract

  1. Convolution operation is the core feature of CNN, which can fuse space and channel.
  2. Enhanced spatial feature extraction has been studied;
  3. In this paper, SEblock is proposed for channel features, which can adjust channel features adaptively.
  4. Seblocks can be stacked into SENet and get good results on multiple data sets.
  5. SENet can greatly improve the accuracy by adding only a few parameters.
  6. ILSVRC Champion;

The ImageNet dataset was extensively evaluated. SENets is not limited to a particular data set or task. By leveraging SENets, we ranked first in the ILSVRC2017 classification competition. Our best model set achieves the highest level 5 error 1 of 2.251% on the test set. This represents a relative improvement of about 25% over the previous year’s winner (the margin of error for the top five is 2.991%).

SEBlock is designed from the channel dimension. An attentional mechanism is proposed to correct the features, which can retain the valuable features and eliminate the worthless features.

The structure of the SE building block is shown in the figure above. Features are first implemented through the Squeeze operation, which generates channel descriptors by aggregating feature graphs on their spatial dimensions (H×W). The function of this descriptor is to generate globally distributed embedding of channel characteristic responses, allowing information from the global receptive field of the network to be used by all its layers. The aggregation is followed by the excitation operation, which takes the form of a simple self-gating mechanism that inserts as inputs and produces a set of modulation weights per channel. These weights are applied to the feature map U to generate an output of SE blocks that can be fed directly into subsequent layers of the network. An SE network (SENet) can be built by simply stacking a collection of SE blocks. In addition, these SE blocks can be used as primitive blocks in a deep range of network architectures.

The paper details

Idea: Let our neural network use global information to enhance useful information and suppress useless information.

Assumptions:

Among them,

Let K = [K1, K2,…..KC], where each element Ki is filter kernel, then:Where * represents the conv operation (bias is ignored)

Squeeze stage: Excitation phase:VGGNets and Inception models show that increasing the depth of a network can significantly improve the quality of its learnable representations. By adjusting the distribution of inputs at each layer, batch normalization (BN) adds stability to the learning process in deep networks and produces a smoother optimized surface. Building on this work, ResNets demonstrated that it is possible to learn deeper and more powerful networks using Shortcut Connection. Highway Networks has introduced a self-gating machine to regulate information flow shortcut connections. Following this work, there was a further refactoring of the connections between the network layers, which showed promising improvements to the learning and presentation features of the deep network.

  1. Grouping convolution: ResNeXt
  2. Multi – branch network GoogLeNet series
  3. Application of 1*1 convolution: Xception, etc

In the past, local information was used to study the relationship between channels. The method proposed in this paper adopts global method.

Designing and developing a new CNN architecture is a difficult engineering task, often requiring the selection of many new hyperparameter and layer configurations. By contrast, SE blocks are simple and can be used directly in existing state-of-the-art architectures, effectively improving performance by replacing them with SE corresponding components. The SE module is also computationally lightweight, with only a slight increase in model complexity and computational burden.

SENet advantage:

  1. SE Block is simple in design and plug and play.
  2. SE Block parameters are few

Google team MnasNet (MnasNet: Platform – AwareNeuralArchitectureSearchforMobile) using the thinking of reinforcement learning, puts forward a kind of resource constraint model structure of automatic nerve terminal CNN search method. MnasNet uses SEblock.

  1. The attention mechanism can be understood as giving more “attention” to the most meaningful part;
  2. Attention mechanism has been widely used in sequential learning tasks such as image comprehension, localization, image description and lip recognition.
  3. The block in this paper focuses on the attention mechanism of channel dimension.

The first is Squeeze operation. We compress features along the spatial dimension, turning each two-dimensional feature channel into a real number, which has a global receptive field to some extent, and the dimension of output matches the number of feature channels of input. It represents the global distribution of responses on the characteristic channel and enables the layer close to the input to obtain the global receptive field, which is very useful in many tasks.

The second is the scheduling operation. It is a gate mechanism similar to the one found in a cyclic neural network. Weights are generated for each feature channel by parameter W, where parameter W is learned to explicitly model the correlation between feature channels.

Finally, there is a Reweight operation. We refer to the weight of the output as the importance of each feature after the feature selection. Then the weighting is multiplied to the previous feature per channel.

An SE network can be generated by simply stacking a collection of SE building blocks. SE blocks can also be used as direct replacements for original blocks at any depth in the architecture. But while the template for building a module is universal, its roles at different depths fit the needs of the network. In the early layers, it learns to inspire information properties in class-agnostic ways, supporting the quality of shared underlying representations. At later levels SE blocks become increasingly specialized and respond to different inputs in a highly class-specific manner. Thus, the benefits of feature recalibration for SE blocks can be accumulated across the entire network. SE blocks are simple in design and can be used directly with existing state-of-the-art architectures whose modules can be enhanced by directly replacing SE modules.

The conv2D process is described by formula, and the convolution kernel is understood in terms of channel dimension.

The Conv2D operation blends spatial information with channel information. The purpose of this article is to improve the sensitivity of the information on the channel dimension. The operations are Squeeze and Congestion.

Question: U does not make good use of the context information outside the local receptive field. Solution: Use global pooling to compress spatial information into channel descriptors, that is, data into the form of channel dimensions. This operation can be regarded as a local descriptor of image, which is common in feature engineering.

  1. For information between the channels, join the congestion;
  2. To achieve this, two guidelines need to be followed:

(1) The operation should be able to learn the non-linear relationship between channels; (2) Ensure that multiple channels can be “emphasized”; 3. Using sigmoID mechanism to achieve;

Above is the activation function selection experiment. Conclusion: SigmoID is best.

Integration with other architectures. ResNet and Inception are integrated in this paper. As shown in the following two pictures.

Architecture of the original Inception module (left) and se-Inception module (right).

Mode of original Residual module (left) and SE-Resnet module (right).

In exchange for this slight extra computational burden, the SE-ResNET-50 outperforms the RESNET-50 in accuracy, and actually approaches the accuracy of a deeper ResNET-101 network that requires ~7.58GFLOPs.

The total number of weight parameters introduced for FC layer is given by the following formula:Where R represents the reduction rate, S represents the number of stages (stages refer to the set of blocks operating on the feature graph of the common space dimension), Cs represents the dimension of the output channel, and Ns represents the number of phase repeat blocks (when the bias term is used for FC layer, the introduced parameters and calculation cost are usually negligible). Se-resnet-50 introduces over 2.5 million additional parameters.

  1. One block is 2C^2/r;
  2. A stage has N blocks;
  3. A model has S stages, so the above formula is obtained;

SEblock can be inserted into CNN in many ways because it is flexible. There are three variants :(1) se-pre block, in which SEblock moves before residual unit; (2) Se-POST blocks, where the SE unit moves (after ReLU) after summing with the Identity branch, and (3) Se-identity blocks, where the SE unit is placed on an Identity connection parallel to the residual unit. These variants are shown in Figure 5, and the performance of each variant is reported in Table 14. We observed similar performance for se-Pre, SE-Identity and the proposed SE block.

Se-resnet Complete architecture ✊

(Left) Resnet-50. (middle) SE-Resnet-50. (Right) SE-ResNext-50 with 32 x 4D template. The shape and operation of the particular parameter Settings for residual blocks are listed in parentheses, while the number of blocks stacked in a phase is shown outside. The parentheses following Byfcin represent the output dimensions of the two fully connected layers in an SE module.

The experiment

Discuss the points

  1. Horizontal contrast

The lower the number, the better.

  1. Adjust Reduction ratio

Reduction ratio null value the amount of Neuron in the Dense layer 1

The paper recommends r = 16

  1. GAP v.s. GMP

The results show that Avg Pooling is better

  1. Comparison of the different Activation functions in the Excitation phase

5. Different positions of SE Blocks The results were pretty much the same.

  1. SE Blocks at different locations in ResNet

It works best to insert all SE blocks! It works better in the deep than in the shallow.

  1. Squeeze has no effect

There must be a Squeeze.

  1. Explorations for Internet Explorer

The early Layer is more general, the later Layer is more Specific, and 5-2 is an inflection point.

Removing the late layer can reduce param without affecting the model too much.

conclusion

SENet performs weight scoring on the number of channels in the convolutional layer, which can be well combined with other networks (VGG, ResNet).

Compared with increasing the model width (width in WRN, Cardinality in ResNeXt), depth, and SE Block weight channel value, fewer parameters are added, less computation is required, and better enhancement effect is achieved

Finally, a happy Mid-Autumn Festival!