The paper proposes to introduce a few large convolutional kernel layers to effectively expand the effective sensing domain, narrowing the gap between CNN network and ViT network, especially in the downstream task performance. The whole paper is very detailed, but also optimized the performance of the actual operation, worth reading and trying
Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs
- Thesis Address:Arxiv.org/abs/2203.06…
- Thesis Code:Github.com/megvii-rese…
Introduction
The performance of convolutional networks is continuously surpassed by that of Vision Transformer (ViTs) networks in terms of image classification, pretext task, feature learning and downstream task, etc. It is generally believed that the performance of ViTs mainly benefits from MHSA(Multi-head self-attention) mechanism, and many studies have been conducted to compare the advantages and disadvantages of MHSA and convolution from different perspectives. It is not the purpose of this paper to explain the performance difference between VisTs and CNNs. Instead of studying the difference between MHSA and convolution, this paper focuses on the paradigm difference between ViTs and CNNs in constructing long-distance spatial connections. In ViTs, MHSA usually uses large receptive fields (≥7×7\ GE 7\times 7≥7×7), and each output can contain a wide range of information. However, in CNNs, the current practice is to increase the sensing domain by stacking convolution with small size (3×33\times 33×3), and the range of information contained in each output is small.
Based on the perceived domain differences found above, this paper tries to compensate for the performance differences between ViTs and CNNs by introducing a small number of large kernel convolution layers. The RepLKNet network is proposed to establish spatial relations through large convolution of reparameterization. RepLKNet networks are adapted based on the Swin Transformer backbone to replace MHSA with large deep convolution and perform better than ViTs networks. In addition, according to the visualization in Figure 1, the introduction of large convolution kernels can significantly improve the effective receptive domain (ERFs) compared with stacked small convolution, and can even pay attention to shape features like ViTs.
Guidelines of Applying Large Convolutions
Direct use of large convolution will lead to a significant decline in performance and speed. This paper summarizes five criteria for efficient use of large convolution kernels through experiments, and each criterion is accompanied by a remark.
Guideline 1: large depth-wise convolutions can be efficient in practice.
The computation cost of large convolution is very high, and the number of parameters and computation is quadratic with the size of convolution kernel, while deep convolution can make up for this shortcoming. Changing the convolution kernel of each stage from the standard convolution of [3,3,3,3][3,3,3][3,3,3] to the deep convolution of [31,29,27,13][31,29,27,13][31,29,27,13] only brings about 18.6% increase in computation and 10.4% increase in parameter number. However, 3×33\times 33×3 deep convolution has low computational efficiency on parallel devices due to the low ratio of computation to memory access. However, when the convolution kernel becomes larger, the number of times that a single eigenvalue is used increases, and the computational density of deep convolution increases accordingly. According to Roofline model, the computational density increases with the increase of convolution kernel, and the computational delay should not increase as much as the computational amount.
-
Remark 1
As shown in Table 1, current deep learning frameworks are inefficient in realizing deep convolution. Therefore, this paper tried different methods to optimize CUDA kernel, and finally chose block-wise(Inverse) Implicit Gemm algorithm and integrated it into MegEngine framework. Compared with Pytorch, the computation delay caused by deep convolution is reduced from 49.5% to 12.3%, almost proportional to the amount of computation. For specific correlation analysis and implementation, you can refer to the article “Why can 31×31 convolution kernel take as much time as 9×9 convolution?” (zhuanlan.zhihu.com/p/479182218…
Guideline 2: identity shortcut is vital especially for networks with very large kernels.
In order to verify the importance of short-circuit connection to large convolutional kernels, MobileNetV2 was used as a benchmark to replace the size of the deep convolutional kernels to compare whether there was short-circuit connection. As shown in Table 2, in the case of short-circuit connection, large convolution nuclear energy brings 0.77% performance improvement. In the case of no short circuit connection, the accuracy of large convolution kernel is reduced to 53.98%.
-
Remark 2
This principle also applies to ViTs. A recent study found that if the short-circuit connection is removed, attention in ViTs decreases double with increasing depth, resulting in excessive smoothing of attention. Although the reasons for the performance decline of large convolutional kernel may be different from that of ViT, it is also difficult to capture local features. Therefore, the thesis argues that as the reference Residual Networks Behave like ensembles of relatively shallow networks, Short-circuit connection can make the model explicitly become a combination of multiple models with different receptive domain sizes (small receptive domain and large receptive domain are continuously directly accumulated), so that it can be improved in a larger receptive domain without losing its ability to capture small-scale features.
Guideline 3: re-parameterizing with small kernels helps to make up the optimization issue.
In this paper, 3×33\times 33×3 convolution kernels in MobileNetV2 are replaced by 9×99\times 99×9 and 13×1313\times 1313×13, respectively, and structural weight parameters are used to help better training. As shown in Figure 2, the 3×33\times 33×3 convolution kernel is first replaced with a larger convolution kernel, and then a 3×33\times 33×3 deep convolution layer is parallel. After BN processing, the results are added as output. After the training, the parallel size convolution layer and its BN layer were combined to obtain the model without small convolution layer. The overall train of thought with RepVGG similar, interested can look at the number of public articles before the RepVGG: VGG, eternal god! | 2021 xinwen”
As shown in Table 3, the increase of convolution kernel from 9 to 13 leads to the decline of accuracy, which can be solved by using structural weight parameters. In semantic segmentation task, structural weight parameters can also solve the performance degradation caused by increasing the convolution kernel.
-
Remark 3
ViTs has optimization problems in small data sets, which are usually solved by adding a pre-convolution layer. For example, add a 3×33\times 33×3 deep convolution layer before each self-attention, which is similar to the parallel 3×33\times 33×3 convolution proposed in this paper. The added convolution layer can pre-introduce translation invariance and local features to the ViT network, making it easier to optimize on small data sets. A similar phenomenon was found in RepLKNet, where structure reparameterization was no longer needed to aid optimization when the pre-training dataset grew to 73 million.
Guideline 4: large convolutions boost downstream tasks much more than ImageNet classification.
As shown in Table 3 above, compared with the classification task, the increase of convolution kernel brings more benefits to the segmentation task. The results in Table 5 also show a similar phenomenon. The improvement of large convolution kernel in ADE20K data set is more obvious. This suggests that even though pretrained models have similar ImageNet performance, their performance may differ significantly in downstream tasks.
-
Remark 4
According to the paper, there are two main reasons for this phenomenon:
- Large convolution nuclear power significantly increases the effective sensing domain (ERF) and can contain more context information, which is critical for downstream tasks.
- Large convolution can guide the network to learn more shape features. Image classification requires only context or shape information, while object recognition requires shape information. So models that tend to have more shape features are clearly better suited for downstream tasks. The strong performance of ViTs in downstream tasks also benefits from its strong shape feature extraction capability. On the contrary, traditional convolutional networks pre-trained by ImageNet tend to favor context information.
Guideline 5: large kernel (e.g., 13×13) is useful even on small feature maps (e.g., 7×7).
In order to verify the effectiveness of large convolution on small feature graphs, the deep convolution of MobileNetV2’s final stage(feature graph size: 7×7) was expanded to 7×7 and 13×13 respectively for comparison. The experimental structure was attached with the structural weight parameters suggested by criterion 3. The results are shown in Table 4. Although the feature graph is already small, increasing the convolution kernel can still improve the performance.
-
Remark 5
When the convolution kernel on the small feature graph becomes larger, the translation invariance of convolution is no longer strictly true. As shown in Figure 3, two adjacent outputs are associated with different convolution kernel weights. This is just in line with the concept of ViTs to obtain greater recognition ability by relaxing the prior requirements of symmetry (for example, the output should be worth using the same convolution weights). Interestingly, 2D relative position encoding (the position of other features relative to the current feature) used in Transformer can also be considered as a deep convolution of a convolution kernel size of (2H−1)×(2W−1)(2H-1)\times(2W-1)(2H−1)×(2W−1), HHH and WWW are the height and width of the feature map respectively. Therefore, large convolution nuclei can not only help us learn the relative position information between features, but also encode the absolute position information due to the addition of more padding (see paper On Translation Invariance in CNNS: Convolutional layers can exploit Absolute spatial Location).
RepLKNet: a Large-Kernel Architecture
Based on the above criteria, this paper proposes a pure CNN architecture of RepLKNet, a large convolution kernel. At present, SOTA small network is still dominated by CNN, so this paper mainly compares with ViT in terms of large model.
Architecture Specification
The structure of RepLKNet is shown in Figure 4, and the details of each module are as follows:
- Stem: Since RepLKNet’s primary use is for downstream tasks, more detail needs to be captured early in the network. After sampling under the initial stride=2 3×3 convolution, a 3×3 deep convolution is followed to extract low-dimensional features, and then a 1×1 convolution and 3×3 deep convolution are followed for down-sampling.
- Stages 1-4: Each stage contains multiple RepLK blocks that contain the deep convolution recommended by Rule 1 and the short circuit connection recommended by Rule 2. According to rule 3, each deep convolution is parallel to a 5×5 deep convolution for structural heavy parameters. In addition to the ability of sensing domain and spatial feature extraction, the ability of model feature expression is also related to the dimension of feature. In order to increase the nonlinearity and information exchange between channels, 1×1 convolution is used to increase the feature dimension before deep convolution. Referring to the feed-forward Network(FFN) used by transformers and MLPs networks, the paper proposes a CNN-style ConvFFN consisting of a short-circuit connection and two 1×1 convolution cores GELU. When applied, ConvFFN typically has an intermediate feature four times that of the input. Refer to ViT and Swin to place a ConvFFN behind each RepLK Block.
- Transition Blocks: placed between stages. First, 1×1 convolution is used to expand the feature dimension, and then two 3×3 deep convolution is used to double downsampling.
In general, each stage has three hyperparameters: ReLK Block number BBB, dimension number CCC and convolution kernel size KKK, So a RepLKNet structure can be expressed as (B1, B2, B3, B4) [B_1 B_2, B_3, B_4] [B1, B2, B3, B4], [C1, C2, C3, C4] [C_1, C_2 and C_3 and C_4] [C1, C2, C3, C4], [K1,K2,K3,K4][K_1, K_2, K_3, K_4][K1,K2,K3,K4]
Making Large Kernels Even Larger
Paper fixed B = B =,2,18,2 [2],2,18,2 [2] =,2,18,2 [2] B and C = C = [128256512102] [128256512102] 4 C = [128256512102] 4, Simple tuning KKK came up with five networks of different sizes, called RepLKNet-3/7/13/25/31. In the absence of special adjustment training configuration, the parameters and performance of each model are shown in Table 5. In addition, the training configuration is fine-tuned for comparison with the SOTA model, called REPLkNET-31B. On this basis, adjust the hyperparameter C=[192,384,768,1536]C=[192,384,768,1536]C=[192,384,768,1536] to obtain replKNET-31L. Replknet-31xl is obtained by further adjusting the hyperparameter C=[256,512 1024,2048]C=[256,512 1024,2048]C=[256,512 1024,2048]C=[256,512 1024,2048]. The intermediate characteristics of the network’s RepLK Blocks are 1.5 times the input.
Discussion
Large-Kernel CNNs have Larger ERF than Deep Small-Kernel Models
Generally speaking, stacked small convolution can eventually reach the same size of sensing domain as single large convolution, but why is the performance of traditional network lower than that of large convolution kernel network? According to the paper, although the same size of the receptive domain can be achieved, the single-layer large convolution kernel is more effective than the multi-layer small convolution, mainly for two reasons:
- According to the properties of the effective receptive domain, its size is proportional to O(KL)\mathcal{O}(K\ SQRT {L})O(KL). It can be seen that the effective sensing domain has a linear relationship with the size of the convolution kernel, and a sublinear relationship with the depth.
- The increase in depth creates training problems. Although ResNet seems to have solved this problem, recent studies have shown that ResNet’s effective receptive domain does not increase significantly with depth.
Therefore, the design of large convolution kernel only needs fewer layers to achieve the predetermined effective perception domain, while avoiding the optimization problems caused by the increase of depth.
We also visualized and counted the effective receptive domains of ResNet and RepLKNet, and found that the overall effective receptive domains of RepLKNet were larger than those of ResNet.
Large-kernel Models are More Similar to Human in Shape Bias
Some studies have found that ViT is closer to human vision and makes predictions based on the shape of the target, while CNN relies more on local context. Thesis borrowed from github.com/bethgelab/m…
Dense Convolutions vs. Dilated Convolutions
Cavity convolution is a common method to expand the convolution range, so the thesis compares the cavity deep convolution with ordinary deep convolution. As shown in Table 11, although the maximum sensing domain may be the same, the expression ability of cavity deep convolution is much weaker, and the accuracy drops significantly. This is also in line with expectations. Although the receptive domain of the empty convolution is large, it uses very few features in calculation.
Experiment
ImageNet image classification performance comparison.
Cityscapes semantic segmentation performance comparison.
ADE20K semantic segmentation performance comparison.
MSCOCO target detection performance comparison.
Conclusion
The paper proposes to introduce a few large convolutional kernel layers to effectively expand the effective sensing domain, narrowing the gap between CNN network and ViT network, especially in the downstream task performance. The whole paper is very detailed, but also optimize the performance of the actual operation, worth reading and trying.
If this article is helpful to you, please click a like or watch it. For more content, please follow the wechat public account.