Make writing a habit together! This is the 12th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

Abstract

In this article, we describe a new mobile architecture, MobileNetV2, that improves the latest performance of mobile models across multiple tasks and benchmarks, as well as across different model sizes. We also describe an efficient way to apply these mobile models to object detection in a new framework we call SSDLite. In addition, we demonstrate how to build a Mobile semantic segmentation model with a simplified form of DeepLabv3 that we call Mobile DeepLabv3.

Based on inverted residual structure, where fast connections are located between thin bottleneck layers. The intermediate extension layer uses lightweight deep convolution to filter features as nonlinear sources. In addition, we find it important to remove nonlinearity in narrow layers to maintain representativeness. We show that this improves performance and provides the intuition that leads to this design.

Finally, our approach allows the input/output domain to be separated from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measured our performance on ImageNet [1] classification, COCO object detection [2], VOC image segmentation [3]. We evaluate the relationship between the accuracy of the tradeoff and the number of operations measured by multiplication and addition (MAdd), as well as the actual number of delays and parameters.

1, the introduction of

Neural networks have revolutionized many areas of machine intelligence, providing superhuman accuracy for challenging image recognition tasks. However, the drive to improve accuracy often comes at a cost: modern state-of-the-art networks require massive computing resources that exceed the capabilities of many mobile and embedded applications.

This paper introduces a new neural network architecture tailored for mobile and resource-constrained environments. Our network advances the latest technology in mobile custom computer vision models by significantly reducing the amount of operations and memory required, while maintaining the same accuracy.

Our main contribution is a new layer module: inversion residuals with linear bottlenecks. The module takes a low-dimensional compressed representation as input, which is first expanded to a higher-dimensional representation and filtered using lightweight deep convolution. The features are then projected back to a lower-dimensional representation with linear convolution. The official implementation is provided in [4] as part of the TensorFlow-SLIM model library.

The module can be effectively implemented in any modern framework using standard operations and allows our models to beat state-of-the-art technology at multiple performance points using standard benchmarks. In addition, this convolution module is particularly suitable for mobile designs because it allows a significant reduction in the memory footprint required during reasoning by never fully implementing large intermediate tensors. This reduces the need for main memory access in many embedded hardware designs that provide a small number of very fast software-controlled caches.

2. Related work

Tweaking deep neural architecture to strike the optimal balance between accuracy and performance has been an active area of research over the past few years. Improvements in manual architectural search and training algorithms carried out by numerous teams have resulted in significant improvements over earlier designs such as AlexNet [5], VGGNet [6], GoogLeNet [7]. And ResNet [8]. Recent advances have been made in the exploration of algorithm architectures, including hyperparametric optimization [9, 10, 11] and various network pruning methods [12, 13, 14, 15, 16, 17] and connectivity learning [18, 19]. A great deal of work is also devoted to changing the connection structure of internal convolution blocks, such as in ShuffleNet [20] or introducing sparsity [21] and others [22].

Recently, [23,24,25,26] has opened up a new direction of introducing optimization methods including genetic algorithms and reinforcement learning into building search. However, one drawback is that the resulting network is very complex. In this article, we pursue the goal of developing a better intuition about how neural networks operate and using it to guide the simplest of network designs. Our approach should be seen as complementing the approach and related work described in [23]. In this case, our approach is similar to the approach taken [20, 22] and allows for further performance improvements while providing a glimpse into its internal operations. Our network design is based on MobileNetV1 [27]. It retains its simplicity and does not require any special operators, while significantly improving its accuracy, achieving state of the art in multiple image classification and detection tasks for mobile applications

Preparation, discussion, and intuition

3.1 depth separable convolution

Deeply separable convolution is a key building block for many efficient neural network architectures [27,28,20], and we also use them in our current work. The basic idea is to replace the complete convolution operator with a decomposed version that splits the convolution into two separate layers. The first layer, called deep convolution, performs lightweight filtering by applying a single convolution filter to each input channel. The second layer is the 1×1 convolution, called point-by-point convolution, which is responsible for constructing new features by calculating linear combinations of input channels.

The standard convolution takes hi×wi× DIh_ {I}\times w_{I}\times d_{I}hi×wi×di input tensor LiL_{I}Li, The convolution kernel K∈Rk× K ×di×djK\in R^{K\ times K\ times d_{I}\times d_{j}}K∈Rk× K ×di× DJ is applied to generate Hi × WI ×djh_{I}\times w_{I}\times D_ {j} Hi × Wi × DJ Output tensor LjL_{j}Lj. The calculation cost of standard convolution layer is hi× WI × DI × DJ × K ×kh_{I}\times w_{I}\times d_{I}\times d_{j}\times k \times khi× WI × DI × DJ × K × K × K.

Deeply separable convolution is a direct substitute for the standard convolution layer. Empirically, they work almost as well as regular convolution, but only at a cost:


h i w i d i ( k 2 + d j ) h_{i} \cdot w_{i} \cdot d_{i}\left(k^{2}+d_{j}\right)

This is the sum of depthwise convolved with 1 by 1 pointwise. Compared with the traditional layer, the effective depth-separable convolution reduces the computation of almost K2k ^{2}k2. MobileNetV2 uses k = 3 (3 × 3 depth separable convolution), so the computational cost is 8 to 9 times smaller than standard convolution, while the accuracy is only slightly reduced [27].

3.2. Linear bottlenecks

Consider a deep neural network consisting of n layers of LiL_{I}Li, each of which has an activation tensor with dimension hi×wi× DIh_ {I}\times w_{I}\times d_{I} Hi × WI ×di. In this section, we will discuss the basic properties of these activation tensors, which we treat as containers of Hi × WIh_ {I} \times w_{I} Hi × WI “pixels” with did_{I} DI dimensions. Informally, for a set of input real images, we say that the set of layers activated (for any layer LiL_{I}Li) forms a “manifold of interest.” It has long been thought that manifolds of interest in neural networks can be embedded into low-dimensional subspaces. In other words, when we look at all the individual D-channel pixels of the deep convolution layer, the information encoded in these values actually resides in some manifold, which in turn can be embedded in a low-dimensional subspace.

At first glance, this fact can be captured and exploited by simply reducing the dimension of the layer, thereby reducing the dimension of the operation space. This has been successfully utilized by MobileNetV1 [27], which effectively balances computation and accuracy through the width multiplier parameter, and has been incorporated into the design of effective models for other networks [20]. Following this intuition, the width multiplier method allows one to reduce the dimension of the active space until the manifold of interest spans the entire space. However, this intuition is broken when we recall that deep convolutional neural networks actually have nonlinear per coordinate transformations (e.g. ReLU). For example, ReLU applied to a line in one dimensional space produces a “ray” where, as in the RnR^{n}Rn space, it usually produces piecewise linear curves with n joints.

It is not difficult to see that, in general, if the volume S of ReLU(Bx) as a result of a layer transformation is non-zero, the point mapped to internal S is obtained by the linear transformation B of the input, thus indicating that the input space corresponding to the full-dimensional output is limited to linear transformation. In other words, the deep network has the capability of linear classifier only in the non-zero volume part of the output domain. We refer to the supplementary materials for a more formal statement.

On the other hand, when ReLU collapses a channel, it inevitably loses information in that channel. However, if we have many channels and one structure in the active manifold, the information may still remain in the other channels. In supplementary material, we show that if the input manifold can be embedded into a significantly lower-dimensional subspace of the active space, then the ReLU transform preserves the information while introducing the required complexity into an expressible set of functions

In summary, we emphasize two properties that indicate that the manifold of interest should be in a lower-dimensional subspace of a higher-dimensional active space:

  1. If the manifold of interest remains nonzero volume after the ReLU transformation, a linear transformation corresponds

  2. ReLU can retain complete information about the input manifold, but only if the input manifold is in a low-dimensional subspace of the input space

These two insights provide us with empirical hints for optimizing existing neural architectures: assuming that the manifold of interest is low-dimensional, we can capture this by inserting the linear bottleneck layer into the convolution block. Experimental evidence suggests that using a linear layer is critical because it prevents nonlinearity from destroying too much information. In Section 6, we further verify our hypothesis by showing empirically that using nonlinear layers in bottlenecks does reduce performance by several percentage points. We note that similar reports contributing to nonlinearity are reported in [29], where nonlinearity is removed from the input of traditional residual blocks, thus improving the performance of the CIFAR data set.

For the rest of this article, we will use bottleneck convolution. We refer to the ratio between the size of the input bottleneck and the internal size as the expansion ratio.

3.3 inversion residual

Bottleneck blocks look similar to residual blocks, where each block contains one input, then several bottlenecks, then extensions [8]. However, inspired by the intuition that the bottleneck actually contains all the necessary information, and that the extension layer only acts as an implementation detail accompanying the nonlinear transformation of the tensor, we use shortcuts directly between the bottlenecks. Figure 3 provides a visualization of schematic design differences. The motivation for inserting shortcuts is similar to classical residual joins: we want to improve the ability of gradients to propagate through the multiplier layer. However, the inverted design was much more memory efficient (see Section 5) and worked better in our experiment.

The basic implementation structure of bottleneck convolution running time and parameter counting is shown in Table 1. For a block of size H × Wh \times wh×w, extension factor t, and kernel size K, with d ‘d^{\prime}d’ input channels and D ‘ ‘d^{\prime \prime}d “output channels, Required by adding the total ‘⋅ ⋅ ⋅ w d t h + k2 + d (d’ ‘ ‘) h \ \ cdot cdot w d ^ t \ {\ prime} \ cdot left (d ^ {\ prime} + k ^ ^ {{2} + d \ prime \ prime} \ right) ‘⋅ ⋅ ⋅ w d t h + k2 + d (d’ ‘ ‘). This expression has an extra term compared to (1), because we do have an extra 1 by 1 convolution, but the nature of our network allows us to use smaller input and output dimensions. In Table 3, we compare the size required for each resolution between MobileNetV1, MobileNetV2, and ShuffleNet.

3.4 Interpretation of information flow

An interesting feature of our architecture is that it provides a natural separation between the input/output domain of the building block (the bottleneck layer) and the layer transformation — a non-linear function that converts inputs into outputs. The former can be seen as the capacity of the network at each layer, while the latter can be seen as expressiveness. This contrasts with conventional and separable conventional convolution blocks, where expressiveness and capacity are both entangled and are functions of the depth of the output layer.

In particular, in our example, when the inner depth is zero, the underlying convolution is the identity function due to the fast connection. When the extension ratio is less than 1, this is a classic residual convolution block [8, 30]. However, for our purposes, we show that an expansion ratio greater than 1 is most useful.

This interpretation allows us to study the expressiveness of the network separately from its capacity, and we believe it is necessary to further explore this separation to better understand network properties.

4. Model architecture

Now let’s describe our architecture in detail. As described in the previous section, the basic building blocks are bottleneck depth separable convolution with residuals. The detailed structure of this block is shown in Table 1. The architecture of MobileNetV2 consists of an initial full convolutional layer with 32 filters, followed by 19 residual bottleneck layers described in Table 2. Because of its robustness, ReLU6 is used as a nonlinear when used with low-precision calculations [27]. We always use the standard kernel size of modern networks of 3 × 3 and use dropout and batch normalization during training.

We use a constant expansion rate throughout the network except for the first layer. In our experiments, we found that expansion rates between 5 and 10 resulted in nearly identical performance curves, with smaller networks performing better at slightly lower expansion rates and larger networks performing slightly better at larger expansion rates.

For all our major experiments, we apply the expansion factor of 6 to the size of the input tensor. For example, for a bottleneck layer that takes a 64-channel input tensor and produces a tensor with 128 channels, the intermediate extension layer is 64×6 = 384 channels.

Weighing hyperparameters As in [27], we customized our architecture to accommodate different performance points by using input image resolution and width multipliers as tunable hyperparameters, which can be adjusted according to the desired precision/performance trade-offs. Our main network (width multiplier 1,224 × 224) has a computation cost of 300 million multiplies and uses 3.4 million parameters. We explored the performance tradeoff, with input resolutions ranging from 96 to 224 and width multipliers ranging from 0.35 to 1.4. Network computing costs range from 7 x plus to 585M MAdds, while model sizes vary between 1.7m and 6.9m parameters.

A small implementation difference from [27] is that, for multipliers less than 1, we apply the width multiplier to all layers except the last convolution layer. This improves the performance of smaller models.

5. Implementation instructions

5.1 Efficient inference in memory

The inverted residual bottleneck layer allows for a particularly efficient memory implementation, which is important for mobile applications. Standard efficient inference implementations such as TensorFlow[31] or Caffe [32] are used to construct a directed acyclic computing hypergraph G, which consists of edges representing operations and nodes representing intermediate computing tensors. Plan calculations to minimize the total number of tensors that need to be stored in memory. In the most general case, it searches for all reasonable order of computation sigma (G) and selects the minimal one.


M ( G ) = min PI. Σ ( G ) max i 1.. n [ A R ( i . PI. . G ) A ] + size ( PI. i ) M(G)=\min _{\pi \in \Sigma(G)} \max _{i \in 1 . . n}\left[\sum_{A \in R(i, \pi, G)}|A|\right]+\operatorname{size}\left(\pi_{i}\right)

Where R(I,π,G)R(I, \ PI,G)R(I,π,G) is connected to any π I… PI n \ pi_ {I} \ ldots \ pi_ {n} PI I… List of PI n nodes in the middle of the tensor, ∣ A ∣ | A | ∣ A tensor ∣ said A size, the size (I) is required for internal total memory storage during operation of the I.

For graphs with only trivial parallel structures (such as residual connections), there is only one non-trivial feasible order of computation, so the total amount of memory and bounds required for reasoning on graph G can be simplified:


M ( G ) = max o p G [ A o p inp  A + B o p out  B + o p ] (2) M(G)=\max _{o p \in G}\left[\sum_{A \in \mathrm{op}_{\text {inp }}}|A|+\sum_{B \in \mathrm{op}_{\text {out }}}|B|+|o p|\right] \tag{2}

Or to reiterate, the amount of memory is simply the maximum total size of combined inputs and outputs for all operations. Next we will show that if we treat the bottleneck residual block as a single operation (and treat the internal convolution as a one-time tensor), the total amount of memory will be determined by the size of the bottleneck tensor, rather than by the size of the tensor within the bottleneck (and larger).

The CVD operator F(x) shown in Figure 3B can be expressed as A combination of three CVD operators F(x)=[CVD B] X \mathcal{F}(x)=[A \circ \mathcal{N} \circ B] xF(x)=[CVD N CVD B] X, Where A is A linear transformation A:Rs× S ×k→Rs× S ×n\mathcal{A}: Mathcal {R}^{s \times s \times k} \rightarrow \mathcal{R}^{s \times s \times n}A:Rs× S ×k→Rs×s×n, n is A nonlinear per-channel transformation: N: Rs * s * N – > Rs’ * ‘s * N \ mathcal {N} : Mathcal {R}^{s \times s \times n} \rightarrow \mathcal{R}^{s^{\prime} \times s^{\prime} \times n} n :Rs×s×n→Rs’ ×s’ ×n, B:Rs’ × S ‘×n→Rs’ × S’ ×k ‘\mathcal{B} \mathcal{R}^{s^{\prime} \times s^{\prime} \times n} \rightarrow \mathcal{R}^{s^{\prime} \times s^{\prime} \times K ^ {\ prime}} B: Rs’ * ‘s * n – > Rs’ * k’ s’.

For our network N=ReLU⁡6∘\ Mathcal {N}=\ operatorName {ReLU} 6\ circN=ReLU6 CVD dwise CVD ReLU⁡6\circ \ OperatorName {ReLU} 6∘ReLU6, But the results apply to any per-channel conversion. Assuming that the size of the input fields for ∣ x ∣ | | x ∣ x ∣, the size of the output domain for ∣ y ∣ | y | ∣ y ∣, The memory needed for the calculation of F (X) can be as low as ∣ s2k ∣ + ∣ s’ 2 k ‘∣ + \ left | s ^ | + \ \ {2} k right left | s ^ ^ {2} \ prime k {\ prime} \ right | + ∣ ∣ ∣ s2k ∣ ∣ ∣ + ∣ ∣ ∣ s’ 2 k’ ∣ ∣ ∣ + O (Max ⁡ (s2, s2)) O \ left (\ \ Max left (s ^ {2}, s ^ {2} \ right) \ right) O (Max (s2, s2)).

The algorithm is based on the fact that the internal tensor I\mathcal{I}I can be expressed as a series of T tensors, each of size N /tn/tn/t, and then our function can be expressed as


F ( x ) = i = 1 t ( A i N B i ) ( x ) \mathcal{F}(x)=\sum_{i=1}^{t}\left(A_{i} \circ N \circ B_{i}\right)(x)

By accumulating the sum, we only need a middle block of size n=t to be kept in memory at all times. With n/ T, we end up having to keep only one intermediate representation of the channel all the time. The two constraints that enable us to use this technique are (a) the fact that internal transformations (including nonlinearity and depth direction) are per-channel, and (b) continuous non-per-channel operators have significant ratios of input to output size. For most traditional neural networks, this technique does not yield significant improvements.

We note that the number of multiplication-plus operators required to compute F(X) using T-way split is independent of T, but in existing implementations we find that replacing a matrix multiplication with several smaller matrix multiplications hurts runtime performance miss due to cache increase. We find this method most helpful when t is a small constant between 2 and 5. It significantly reduces memory requirements, but still allows one to take advantage of much of the efficiency gained by using highly optimized matrix multiplications and convolution operators provided by deep learning frameworks. Whether particular frame-level optimizations will lead to further runtime improvements remains to be seen.

6,

6.1 ImageNet classification

Training setup We used TensorFlow[31] to train our model. We use the standard RMSPropOptimizer, with attenuation and momentum both set to 0.9. We used batch normalization after each layer, with standard weight attenuation set to 0.00004. After MobileNetV1[27] was set, we used an initial learning rate of 0.045 and a learning rate decay rate of 0.98 per epoch. We used 16 GPU asynchronous workers with batch size of 96.

As A result, we compared our network with MobileNetV1, ShuffleNet and NASNET-A models. Statistics for some selected models are shown in Table 4, and a complete performance graph is shown in Figure 5.

6.2 Object detection

We evaluated and compared the performance of modified versions of MobileNetV2 and MobileNetV1 as feature extractors for object detection [33] and single pass detector (SSD) [34] on COCO dataset [2]. We also compared YOLOv2 [35] with the original SSD (vGG-16 [6] as the base network) as a baseline. We did not compare performance with other architectures, such as FtYL-RCNN [36] and RFCN [37], because our focus was on mobile/real-time models.

SSDLite: In this article, we introduced a mobile-friendly variant of the regular SSD. We replace all conventional convolution with separable convolution (depth followed by 1×1 projection) in the SSD prediction layer. This is in line with the overall design of MobileNets and is thought to be more computationally efficient. We call this modified version SSDLite. SSDLite significantly reduces the number of parameters and computation costs compared to regular SSDS, as shown in Table 5.

For MobileNetV1, we follow the Settings in [33]. For MobileNetV2, the first layer of SSDLite is appended to the layer 15 extension (output step 16). SSDLite’s second and remaining layers are attached to the last layer (output step 32). This setting is consistent with MobileNetV1 because all layers are attached to a feature map of the same output step.

6.3 semantic segmentation

In this section, MobileNetV1 and MobileNetV2 models used as feature extractors are compared with DeepLabv3 [39] for mobile semantic segmentation tasks. DeepLabv3 utilizes atrous convolution [40, 41, 42], a powerful tool for explicitly controlling the resolution of computed feature graphs, and builds five parallel headers, Includes (a) Atrous Spatial Pyramid Pooling module (ASPP) [43], which contains three 3 × 3 convolution with different opening rates, (b) 1×1 convolution heads, and (c) image-level features [44]. We use the output step to represent the ratio of the spatial resolution of the input image to the final output resolution, which is controlled by the appropriate application of cavity convolution. For semantic segmentation, we usually use output steps = 16 or 8 to get a denser feature map. We conducted experiments on the PASCAL VOC 2012 dataset [3], using additional annotated images from [45] and the evaluation indicator mIOU.

To build the mobile model, we tried three design variations :(1) different feature extractors, (2) simplified DeepLabv3 headers to speed up computation, and (3) different reasoning strategies to improve performance. Our results are summarized in Table 7. We observed: (a) Inference strategies, including multi-scale input and adding left-right flip images, significantly increase MAdd and are therefore not suitable for applications on devices, (b) using output stride = 16 is more effective than output stride = 8, (c) MobileNetV1 is already a powerful feature extractor, requiring only approximately 4.9-5.7 fewer MAdds than ResNET-101 [8] (e.g., mIOU: 78.56 vs 82.70 and MAdds: 941.9b vs 4870.6b), (d) It is more efficient to build DeepLabv3 headers on the penultimate feature map of MobileNetV2 than on the original last layer feature map, By doing so, we achieved similar performance, but with approximately 2.5 times less operation than the MobileNetV1 counterpart, because the penultimate feature map contains 320 channels instead of 1280. (e) The computation cost of DeepLabv3 headers is high, and the removal of the ASPP module significantly reduces MAdds performance with only slight E degradation. At the end of Table 7, we identified the potential candidate (bold) for the application on the device, which reached 75.32% mIOU and required only 2.75B MAdds.

6.4. Ablation study

Reverse residual connection. The importance of residual connection has been extensively studied [8,30, 46]. The new result reported in this article is that the shortcut connection bottleneck performs better than the shortcut connection extension layer (see Figure 6b for comparison).

The importance of linear bottlenecks. Linear bottleneck models are strictly less powerful than models with nonlinearity, because activation can always run in a linear state with appropriate changes to bias and scaling. However, our experiments shown in FIG. 6A show that linear bottlenecks improve performance and provide support for nonlinear destruction of information in low-dimensional space

7. Conclusions and future work

We describe a very simple network architecture that allows us to build a series of efficient mobile models. Our basic building unit has several features that make it particularly suitable for mobile applications. It allows very memory saving reasoning and relies on leveraging standard operations that exist in all neural frameworks.

For the ImageNet dataset, our architecture improves the state of the art with a wide range of performance points. For target detection tasks, our network outperforms the most advanced real-time detectors on COCO datasets in both accuracy and model complexity. It is worth noting that our architecture combined with the SSDLite detection module requires 20 times less computation and 10 times less parameters than YOLOv2.

Theoretically: The proposed convolution block has unique properties that allow the network expressiveness (encoded by the extension layer) to be separated from its capacity (encoded by the bottleneck input). Exploration of this is an important direction of future research.

Both MobileNet models are trained and evaluated using the open source TensorFlow object detection API [38]. Both models have a 320 × 320 input resolution. We benchmarked and compared mAP (COCO challenge metric), parameter number, and multiplicative addend. The results are shown in Table 6. MobileNetV2 SSDLite is not only the most efficient model, but also the most accurate of the three. Notably, MobileNetV2 SSDLite is 20 times more efficient and 10 times smaller, while still outperforming YOLOv2 on the COCO dataset.