The core of the ESPNet series is the void convolution pyramid. Each layer has a different dilation rate, which can integrate multi-scale features without increasing the number of parameters. Compared with the deep separable convolution, the deep separable void convolution pyramid is more cost-effective. In addition, HFF’s multi-scale feature fusion method is also worthy of reference

Source: Xiao Fei’s algorithm engineering notes public account

ESPNet


ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

  • Address:Arxiv.org/abs/1803.06…
  • Thesis code:Github.com/sacmehta/ES…

Introduction

ESPNet is a lightweight network for semantic segmentation. The core of ESPNet is the ESP module, which includes point-wise convolution and void convolution pyramids to reduce computational complexity and resample the features of each valid domain, respectively. The ESP module is more efficient than other convolution decomposition methods (mobilenet/shufflenet). ESPNet can achieve 112FPS/21FPS/9FPS on GPU/ laptop/terminal devices.

ESP module

The ESP module decomposed the standard convolutions into point-wise convolutions and spatial pyramid of dilated convolutions. The point-wise convolutions mapped the input to the low-dimensional feature space. The void convolution pyramid uses KKK group n×nn\times nn×n void convolution to resample the low-dimensional feature simultaneously, and the dilation rate of each void convolution is 2k−12^{k-1} 2K −1, K ={1… k}k=\{1, \cdots, k \}k={1… k}. This decomposition method can greatly reduce the number of parameters and memory of the ESP module, and maintain a large effective receptive field.

  • Width divider K

For the standard convolution with input and output dimensions MMM and NNN and convolution kernel size n×nn\times nn×n, the number of parameters to be learned is n2MNn^2MNn2MN and the effective receptive domain is N2n ^2n2. The hyperparameter KKK is used to adjust the computational complexity of ESP module. Firstly, the input dimension is reduced from MMM to NK\frac{N}{K}KN(reduce) by using point-wise convolution. Then, the low-dimensional features were split and transformed using the above empty convolution pyramid respectively. Finally, the outputs of k-group empty convolution were merged. ESP module contains MNK + (nN) 2 K \ frac {MN} {K} + \ frac {(nN) ^ 2} {K} KMN + K (nN) 2 parameters, effective receptive field for [(n – 1) 2 K – 1 + 1) 2 [(n – 1) ^ 2} {K – 1 + 1) ^ 2 / (n – 1) 2 K – 1 + 1) 2, There are some improvements in parameters and receptors.

  • Hierarchical feature fusion (HFF) for de-gridding

It is found in this paper that although the void convolution pyramid brings a larger receptivity field, the direct concate output will bring strange mesh patterns, as shown in FIG. 2. To solve this problem, the output is added before concate. Compared with adding additional convolution for post-processing, HFF can effectively solve the mesh pattern without excessive computation. In addition, to ensure gradient transmission of the network, a shortcut connection from input to output was added to the ESP module.

Relationship with other CNN modules

The paper lists some of the core modules of the lightweight network and compares them. It can be seen that the ESP module has good values in terms of number of parameters/memory/sensing domain.

ESPNet

Figure 4 shows the evolution process of ESPNet, LLL is the size of feature map, modules of the same LLL have the same size of feature map, red and green modules are respectively the down sampling module and up sampling module, α2=2 alpha_2=2α2=2, α3=8 alpha_3=8α3=8.

Experiments

Here are just some of the experiments. For more details, see the paper.

Replace the ESP module in Figure 4D for experimental comparison.

Compare with other semantic segmentation models.

Conclusion

ESPNet is a lightweight network for semantic segmentation. At the same time, the core module is designed for the scene of semantic segmentation. Hollow convolution pyramid is used to extract the features of multiple sensory domains and reduce the number of parameters.

ESPNetV2


ESPNetv2: A light-weight, Power Efficient, and General Purpose Convolutional Neural Network

  • Address:Arxiv.org/abs/1811.11…
  • Thesis code:Github.com/sacmehta/ES…

Introduction

Model lightweight includes three methods: model compression, model quantization and lightweight design. The paper designs lightweight network ESPNetv2, the main contributions are as follows:

  • Universal lightweight network architecture that supports both visual data and serialized data, i.e., visual tasks and natural language processing tasks.
  • On the basis of ESPNet, adding deep detacheable void convolution to expand, compared with ESPNet has better accuracy and fewer parameters.
  • Experiments show that ESPNetv2 has good accuracy and low number of parameters in multiple visual tasks, including image classification, semantic segmentation, and target detection.
  • The Cyclic Learning Rate Scheduler is designed, which is better than the usual fixed learning rate scheduler.

Depth-wise dilated separable convolution

Suppose the input is X∈RW×H×cX\in \mathbb{R}^{W\times H\times c}X∈RW×H×c, X∈Kn×n×c×c^X\in \mathbb{K}^{n\times n\times c \times hat{c}}X∈Kn×n×c×c^, The output is Y∈RW×H×c^Y\in \mathbb{R}^{W\times H\times \hat{c}}Y∈RW×H×c^, and the number of parameters and valid domain of standard convolution, group convolution, deeply separated convolution and deeply separable void convolution are shown in Table 1.

EESP unit

Based on the improved ESP modules of deeply separable void convolution and grouping point-wise convolution, the EESP(Extremely Efficient Spatial Pyramid) module is proposed in this paper. The original STRUCTURE of ESP module is shown in Figure 1A. This paper first replaces the point wise convolution with the grouped point wise convolution, then replaces the void convolution with the deeply detactable void convolution, and finally uses HFF to eliminate grid lines. The structure is shown in Figure 1B. It can reduce the computational complexity of Md+n2d2KMdg+(n2+d)dK\frac{Md+n^2d^2K}{\frac{Md}{g}+(n^2+d)dK}gMd+(n2+d)dKMd+n2d2K. KKK is the number of layers of the void convolution pyramid. Considering that the single calculation of KKK point-wise convolution is equivalent to the point-wise group convolution with a single group number of KKK, and the group convolution is more efficient in implementation, the final structure of Figure 1C is improved.

In order to learn multi-scale features more efficiently, this paper proposes a Strided EESP module (with shortcut connection to an input image) with the following improvements:

  • Modify the deeply separable void convolution to the version with stride=2.
  • Add average pooling for the module’s original shortcut.
  • Replace the Element-wise addition operation with a concate operation to increase the feature dimension of the output.
  • To prevent information loss along with the downsampling, a shortcut connecting the input images is added, which uses multiple pooling operations to make its space size consistent with the feature map output by the module, then uses two convolves to extract the features and adjust the dimensions, and finally does the Element-wise addition.

Network architecture

The network structure of ESPNetv2 is shown in Table 2. Each convolution of ESSP module is followed by BN layer and PReLU. The PReLU of the final group convolution of the module is carried out after element-wise addition, g=K= 4G =K= 4G =K=4, and the others are similar to ESPNet.

Cyclic learning rate scheduler

In the training of image classification, the paper designs a cycle learning rate scheduler. In each cycle TTT, the learning rate is calculated as:

η Max \eta_{Max}η Max and ηmin\eta_{min}ηmin are the maximum and minimum learning rates respectively, and TTT is the cycle period.

The visualization of the loop learning rate scheduler is shown in Figure 4.

Experiments

Image classification performance comparison.

Semantic segmentation performance comparison.

Target detection performance comparison.

Text generation performance comparison.

Conclusion

ESPNetv2 combines deep separation and convolution design with ESPNet to further lightweight the model. Combined with richer feature fusion, the model can be extended to a variety of tasks and has very good performance.

CONCLUSION


The core of the ESPNet series is the void convolution pyramid. Each layer has a different dilation rate, which can integrate multi-scale features without increasing the number of parameters. Compared with the deep separable convolution, the deep separable void convolution pyramid is more cost-effective. In addition, HFF’s multi-scale feature fusion method is also worthy of reference.





If this post helped you, please give it a like or watch it again

More content please pay attention to the wechat public account