RFBNet is a good article without formula, easy to understand, illustrated and elegant. RFB is a module that can be integrated into other detection algorithms. It is also easy to understand from Fig2, 3, 4 and 5 of the paper, which is inspired by human visual perception system. Backbone proposed RFBNet based on SSD combines the ideas of Inception and wormhole convolution to simulate human visual perception as much as possible, and the final experimental results are very good.

There is no special training tricks, occupy little resources, but the effect is good, the source code also looks very comfortable, easy to run, fast, their own data set is also convenient training test, the experimental results are very work, and through the experimental results prove that the RFB performance is excellent; If you just want to find shortcomings, that is, the effect is good, but do not know why the effect is good, can not explain, the article has not given the reason for good performance, just combined with the view of human visual cognition…

Definitions of nouns:

1. RFB: Receptive Field Block, a lightweight module proposed in this paper that can be integrated into various detection algorithms, combines the ideas of Inception and wormhole convolution, which can be understood by referring to Fig2, 3, 4 and 5 of this paper;

Abstract

1. The existing mainstream target detector generally has two ways:

1.1 Strong backbone network (e.g. Resnet-101, Inception) is used to extract highly discriminant features, but the computational overhead is high and the running speed is slow;

1.2 Use lightweight backbone network (such as MobileNet), running speed is fast, but performance is weak;

2. This paper proposes RFB and integrates RFB into SSD to form RFBNet; Through the designed RFB module, high discriminant features can be extracted even in the lightweight backbone network, and the final RFBNet is fast and has good performance.

3. RFB is inspired by the Receptive Fields structure of human vision, and the size and eccentricity of RFs are taken into consideration, ultimately improving the discrimination and robustness of features.

4 RFBNet performance on Pacal VOC, COCO is very good, Pacal VOC map over 80%, but because of the lightweight backbone (MobileNet, VGG16 lightweight network), the speed is also very fast; —- is therefore a very practical network structure on the whole;

1 Introduction

Mainstream 2-stage target detection algorithms, such as RCNN, Fast RCNN and Faster RCNN, have excellent performance on Pascal VOC and MS COCO, and generally contain two stages:

1st stage: region proposals for category-agnostic;

2nd stage: CNN characteristics of proposals, and further Bbox CLS and REG.

In the schemes above, the CNN backbone that extracts the characteristics of proposals plays an important role. We hope to extract target features that are not only discriminant but also have robust translation invariance.

Existing SOTA target detection algorithms also support the above views: they are generally based on a strong backbone, such as Resnet-101 and Inception; FPN uses the top-down method to build feature pyramid, combines the features of high and low layer feature map, and uses RoIAlign to generate more accurate regional features for Mask RCNN. Both use the high discriminant features extracted from the backbone network with strong performance, but the disadvantage is too much calculation. The forward prediction time is also very large;

In order to improve the speed of model detection, 1-stage target detection algorithms come into being, which directly omit the generation of region proposals, such as YOLO and SSD. Although they can achieve real-time operation speed, they lose 10% to 40% of the accuracy compared with the 2-stage method of SOTA. Newer 1-stage detection algorithms such as DSSD and RetinaNet have improved detection accuracy and are no better than 2-stage detection algorithms in performance, but still use a powerful backbone (such as ResNET-101) and time is increased.

So what if we want the detector to be good and fast? The author gives a direction: use a lightweight backbone network, but add some hand-crafted mechanisms to make the extracted features more expressive, which is more elegant than adding layer and conv without thinking; In the human visual cortex, the scale of population RF (pRF) is a function of the eccentricity of retinotopic maps. Although there are differences among different sensory fields, the scale of pRF increases with the eccentricity of retinotopic maps, as shown in FIG. 1. This finding reveals the importance of the target region being as close to the center of the receptive field as possible, which helps to improve the robustness of the model to small-scale spatial displacement. Some traditional shallow feature descriptors, such as HSOG and DERF, are inspired by the above scheme and have achieved good performance in local image matching.

Several issues can be illustrated from Fig.1:

(A) : 1. The scale of the receptive field is A function of the eccentricity of the retinal image, and its scale is positively correlated with the eccentricity of the retinal image;

2. The size of receptive field is different in different retinal images.

(B) : The spatial matrix of pRFs based on the parameters in (a). The radius of each circle in the figure is the scale of RF at appropriate eccentricity;

In addition to the replacement of the model itself, FRs of the same scale can also be used to set RFs at the same size with a regular sampling grid for each position on the feature map. In fact, it is a conventional CONV operation on a single feature map), but may cause loss of feature discrimination and robustness, as shown in FIG 3.

Inception: A multi-scale RFs is obtained by conV operation connecting multiple branches (the resulting convolutional kernels on each branch have different scales and may stack multiple ConVs). Inception V1 ~ V3 has excellent performance in object detection and image classification. All kernels in Inception are sampled at the same Center. Conventional CONV operations operate on the same center (with the same distance from the center). In my understanding, conv kernels of the same scale are used. Although different receptive fields are finally reached on each branch, they are achieved by stacking CONV operations of the same scale.

ASPP: Atrous Spatial Pyramid Pooling is used to connect multiple parallel dilated conv branches between the upper and lower feature maps. Different Atrous rates are used on each branch to obtain different distances to the center. However, the previous feature map of wormhole convolution uses conv kernel based on the same scale, so the feature discrimination extracted is not enough.

Deformable CNN: The spatial distribution of RFs is adaptively adjusted according to the scale and shape of the target. Although the sampling grid is variable, the influence of eccentricity of RFs is not taken into account, that is, all pixels in RFs have the same contribution rate to the output response, so the conclusion in FIG 1 is not used.

Inspired by the Receptive Fields structure of human vision, RFB is proposed in this paper. The scale and eccentricity of RFs are taken into account. High discriminant features can be extracted by using lightweight backbone network, making the detector fast and high precision. Specifically, RFB designs conV and pooling operations (makes use of multi-branch pooling with varying kernels) based on different scales of RFs and different convolution kernels. Dilated Conv is used to control the eccentricity of the receptive field 0 0 After the last step is 0

It can be seen from Figure 2 that RFB also concates multiple CONV branches, similar to Inception structure. Conventional convolution + wormhole convolution at different scales is used on each branch to simulate different receptive fields in pRFs through different convolution kernel scales of conventional convolution. The eccentricity obtained by dilated conv on each branch was used to simulate the ratio of pRF scale to eccentricity.

Finally, the number of channels in the feature map was reduced by concate + 1 x 1 conv, and the spatial array of RF in the feature map was similar to hV4 in the human visual system in Fig.1. —- As you can see from reading here, all kinds of RFB operations are carefully designed to simulate human visual perception mode, and the final results are similar to hV4 of FIG 1.

In this paper, RFB is further integrated into SSD to form 1-stage RFBNet. RFBNet’s accuracy is comparable to soTA’s 2-stage detection algorithm, and it does not use resNET-101 backbone, but only a lightweight backbone, which is fast. In addition, the RFB module is integrated into MobileNet in the experiment, and the performance is also very good, which fully demonstrates the high generalization ability and adaptability of RFB itself.

Here’s spotlight:

1. Propose the RFB module, which is inspired by the Receptive Fields structure of human vision and takes the scale and eccentricity of RFs into consideration, and can extract highly discriminant features even through lightweight network structure;

2. It is proposed to integrate RFB into SSD (directly replace the TOP CONV layer of SSD with RFB) to form RFBNet with high speed and good performance;

3. RFBNet performance on Pacal VOC, COCO is very good, Pacal VOC map more than 80%, but because of the lightweight backbone (MobileNet, VGG16 lightweight network), fast speed, real-time;

2 Related Work

Two-stage detector

RCNN, Fast RCNN, Faster RCNN, R-FCN, FPN and Mask RCNN were mentioned, and the model performance was gradually improved.

One-stage detector

Representative YOLO and SSD, AS well as CLS Score and Bbox LOC, which directly predict multiple targets in the whole feature map, all use lightweight backbone network to improve model detection speed, but the accuracy is weaker than 2-stage method.

DSSD and RetinaNet replace the lightweight backbone with ResNET-101, then use microcontroller such as DE-ConV and Focal Loss, the performance is close to or even better than 2-stage target detection algorithm, but at the expense of detection speed.

Receptive field

Objective of this paper: to improve the performance of 1-stage target detection algorithm without introducing a lot of computational overhead; Therefore, this paper does not use the brainless method of directly replacing the high-computational backbone, but introduces RFB, which is inspired by the RFs mechanism of the human visual system, and can extract highly discriminant features from the lightweight backbone.

RFs has also been studied in depth, such as Inception, ASPP, Deformable CNN, as shown in Figure 3:

Inception: RFs of multiple scales is obtained by conV operation connecting multiple branches (the resulting convolutional kernels on each branch have different scales and may stack multiple ConVs); However, all convolution kernels operate on the same CONv center, and larger convolution kernels are needed to generate the same sampling coverage (WHICH I do not understand very well), which may lose important information of the target.

ASPP: using multi-scale information, multiple parallel Dilated conv branches are connected between the upper and lower feature maps, and different Atrous rates are used on each branch to obtain different distances to the center. However, the former feature map of wormhole convolution uses conv kernel based on the same scale. It is believed that each position in featre map has the same contribution rate to the target, which may confuse the target itself and context, and the extracted feature discriminance is not enough.

Deformable CNN: The spatial distribution of RFs is adaptively adjusted according to the scale and shape of the target. Although the sampling grid is variable, the influence of eccentricity of RFs is not taken into account, that is, all pixels in RFs have the same contribution rate to the output response, so the conclusion in FIG 1 is not used.

RFB: The correlation degree between RF scale and eccentricity in daisy-Shape similar to FIG1 hV4 was emphasized, and the position close to conv center was also emphasized. 40. It is with a smaller kernel that I have lunch. 40. The bigger weights are assigned to the positions against the center by smaller kernels. The further away from the center, the lower its contribution rate to the central features.

Currently, Inception and ASPP do not achieve good performance in 1-stage detectors, but RFB combines the advantages of both to achieve good performance improvement.

3 Method

This section introduces the visual perception cortex, each module of RFB, the method of RFB simulation of visual perception, the structure of RFBNet, training/testing scheme, etc.

3.1 the Visual Cortex to Revisit

Nuclear magnetic resonance technology (fMRI) can detect human brain activity with millimeter accuracy. Receptive field modeling has also become an important perceptual way to measure human brain activity. PRF model: FMRI + pRF further explores the relationship between neurons’ pooled Responses of neurons within the visual cortex, where pRF scales are positively correlated with eccentricity. And in different visual Field maps, the correlation coefficient is different; —- is not well understood, but can be understood in combination with FIG 1;

3.2 Receptive Field Block

RFB is a multi-branch CONV block with two internal parts:

1. Multi-branch ConV with different convolution kernel scales is equivalent to Inception structure and used to simulate multi-scale pRFs;

2. The wormhole convolution operation is used to simulate the relationship between pRF scale and eccentricity in human visual perception;

The visualization results of RFB module are shown in Fig.2.

Multi-branch convolution layer

Validation of Inception structure: Different conv Kernels with multiple branches are used to obtain multi-scale RFs with better performance than fixed Conv kernel. The latest Inception structure is used for RFB: Inception V4, Inception-ResNet V2;

Specifically, 1 × 1 CONV reduces the number of feature map channels and forms a bottleneck structure on each branch, and then conventional N × N CONV is connected. 5×5 ConV was replaced by two stacked 3×3 ConV, which not only reduced the number of parameters, but also increased the nonlinear capability of the model. The original N × N ConV was further replaced by 1 × N + N × 1 ConV, and there was a shortcut design like ResNet, as shown in Figure 4.

Dilated pooling or convolution layer

Combined with FIG. 4, the wormhole convolution is increased. The core idea is to increase the receptive field of feature map of each layer to obtain more context information on the premise that the same number of parameters are maintained and larger resolution feature map is generated. Wormhole convolution has good performance on SSD and R-FCN. RFBNet uses wormhole convolution to simulate the effect of pRFs eccentricity in human visual perception.

It can be easily understood by combining with Fig.4. After the conventional CONV operation of each branch, a Dilates Conv layer (conV, ratio) is connected. It can be noticed that the kernel size of conventional CONV is exactly corresponding to the ratio of Dilate conv. The size and eccentricity of pRFs can be simulated. —- although I do not understand the principle;

Finally, after concate each branch, 1 x 1 conv was followed, and elder-wise sum was made with shortcut. The visualization results are shown in Fig.1 and 2.

3.3 RFB Net Detection Architecture

RFBNet based on multi-scale + 1-stage SSD, RFB module can be directly connected to the TOP CONV layer of SSD, fast speed, good effect, but also can reuse many SSD parameters;

Lightweight backbone

Backbone VGG16 multiplexes SSDS, performs pre-train on ILSVRC CLS-LOC, transforms fc6 and FC7 full connection layer parameters into Conv layer after sub-sampling. Pool5 is changed from 2× 2-S2 to 3× 3-S1 (the feature map scale is not reduced), dilated Conv is added to expand the receptive field, and FC8 and all dropout layers are ignored.

RFB on multi-scale feature maps

SSD used multi-branch feature pyramid to detect targets, the feature map scale was gradually reduced, and the receptive field was gradually increased. RFBNet inherited the STRUCTURE of SSD, and connected rfB-S module to high resolution Conv4-3 to simulate the small scale pRFs in human retina. RFB module is not used in the top two layers. According to the author, the main reason is that the feature map scale is too small to use a convolution kernel similar to the scale of 5 x 5.

3.4 Training Settings

The RFBNet implementation is based on SSD. PyTorch, and the training strategy is the same as SSD: data enhancement, OHEM, default box scale, aspect ratio setting, loss function (LOC: smooth L1 loss; CLS: Softmax Loss), only the lr setting mode is modified, and MSRA method is used for the new convolution layer, corresponding to kaiming_normal in the code;

4 Experiments

RFBNet was tested on Pascal VOC 2007 (class 20) and MS COCO (class 80). Pascal VOC calculated the IoU threshold of GT Bbox and PRED Bbox of mAP to 0.5. The AP evaluation standard of Coco-style was IoU threshold from [0.5, 0.95] and stride = 0.05, which could evaluate the detector performance in a more all-round way.

4.1 Pascal VOC of 2007

Trainval_2007 + TrainVAL_2012 were used to train RFBNet, and 250 EPOCHS were trained. If lr = 10-3 training in SSD is directly reused, loss will fluctuate greatly during RFBNet training and convergence is not easy. The warmup strategy used in the experiment is as follows: In the first five epochs, LR was gradually increased from 10−6 to 4×10−3, and then reduced by 10 times at 150/200 epochs by conventional training strategy.

Table 1 shows the performance of RFBNet. SSD300* / SSD512* generates more small-scale sample images for model training through data enhancement. MAP of RFBNet512: 82.2%, which is very close to 2-stage R-FCN, but r-FCN based on ResNET-101 is much slower;

4.2 Ablation Study

A wave of ablation experiment is conducted, and the results are shown in table2 and 3:

RFB module

Baseline: SSD300* (new data enhancement method), mAP: 77.2%, replace the final CONV layer to RFB-max pooling, 79.1%, indicating good RFB module performance;

Cortex map simulation

Add RFB-S: Modify the RFB parameter to simulate the ratio of pRFs scale to eccentricity in Cortex Maps. 79.1% -> 79.6% (RFB max-pooling), 80.1% -> 80.5% (RFB-dilated conv);

More prior anchors

The original SSD set up 4 default boxes on conv4_3, CONV10_2 and CONV112, and 6 default boxes on other layers, but HR and S3FD thought that the low-level feature map for detecting small scale targets (such as CONV4_3), If more anchors are set, small scale targets can be recalled more conveniently.

Therefore, six default boxes were set on both SSD and RFBNet conv4_3. The experimental results showed that SSD performance did not improve, but RFBNet performance did improve (79.6% -> 79.8%).

Dilated convolutional layer

In Table 2, rfB-max pooling, RFB-AVG pooling and RFB-DILated conV are compared, and it is found that RFB-DILated ConV has the best performance (79.8% -> 80.5%). Reasons: Although the first two schemes (dilated pooling) avoid adding additional parameters, they limit the feature fusion of multi-scale RFs.

Comparison with other architectures

Inception-L: Modify the parameters in the module so that it has RF of the same scale as RFB;

Aspp-s: The parameters in the original ASPP are trained on image style data, and its RFs is too large for target detection, so the RFs scale is also adjusted in the experiment to keep consistent with RFB;

Ablation experiment is also easy to do. In Fig.5, the top layer is replaced by each module in Table 3, and other training processes and parameters remain consistent, it can be found that RFB is the most excellent. Moreover, the author believes that RFB has a larger RF, which will bring more accurate target positioning.

4.3 Microsoft COCO

COCO came on a wave: Training set trainVAL35K (train + Val 35K set), reuse SSD in COCO to reduce the default box scale (because the target scale on COCO is much smaller than Pascal VOC), still use Pascal VOC similar “warmup” strategy. A total of 120 Epochs were trained;

From Table 4 we can see that RFBNet performs well, better than baseline SSD300*, faster than RESNET-101 backbone based R-FCN (input 600 × 1000 PIx image), RFBNet512 is close to RetinaNet500 performance, But the RetinaNet500 uses resNet-101-FPN backbone + Focal Loss, RFBNet512 is only based on the lightweight VGG16 model and is much faster;

RetinaNet800, while performing best, takes longer to input based on 800 PIx short-edge images;

Rfbnet512-e further improves model performance with two solutions (mAP: 34.4%, but only 3ms more time) :

1. Similar to FPN, before using RFB-S, sample conv7 FC’s feature map (RFBNet is full convolution, so conv7 FC is also 4-D tensor), and concate with conv4_3.

2. Add a branch of 7 x 7 convolution kernel in all RFB modules; —- corresponds to rfb_net_E_vgg. py code, divided into 1 x 7 CONV, 7 x 1 conV, dilation = 7 wormhole convolution, etc.

5 Discussion

Inference speed comparison

Combined with Table 1 and FIG 6, RFBNet is really boring, fast, real-time, good effect, all concentrated in the upper left corner of the image;

Other lightweight backbone

Table 5, adding RFB module on Mobilenet-SSD, training on train + VAL35K and testing on minival5K can further improve performance, which fully proves the powerful generalization ability of RFB itself.

Training from scratch

At present, the training detector is generally based on the classification model trained in ImageNet, which is used as backbone, and then continues training on the detection data set (equivalent to reusing the parameters of the classification model, because the classification model also has strong feature extraction ability). If the detector parameters are trained from zero, the general effect is not very good;

DSOD: Zero training on VOC 2007 using lightweight structure without pre-training model, mAP: 77.7% on test set. However, pre-trained model does not further improve performance. —- Alchemy, it makes no sense;

Rfbnet-300 was trained from scratch by VOC 07+12 trainval. The rfBNET-300 was trained from scratch by VOC 07+12 trainval. With the addition of lightweight VGG16 as pre-trained Backbone, mAP is up to 80.5%. It’s amazing

6 Conclusion

1. This paper proposes that RFB, the Receptive Fields structure inspired by human vision, can also extract highly discriminant features through lightweight backbone structure combined with RPB module, without using the backbone network with heavy computation and deep layers.

2. RFB measures the relationship between the scale and eccentricity of RFs, which can generate more discriminant and robust features;

3. RFB was integrated into SSD (lightweight VGG16 as the backbone network) to form RFBNet. RFBNet performed well on Pacal VOC and COCO, and map on Pacal VOC exceeded 80%. For example, RESNET-101) has comparable performance and is fast because the RFBNet backbone is lightweight;