Illustrated PyTorch implementation of FtYL-RCNN

In this paper, the PyTorch implementation of FtYL-RCNN is analyzed in a graphical manner to visually understand the calls and dependencies between modules, as well as the CUDA implementations of NMS and ROI Align.

The main content
hidden
1 Data Reading
1.1 BBox reading
1.2 Implementation of Dataloader
2. Overall structure of the model
2.1 Dependency between modules and layers
3. Generation of BBox
4 calculation of RPN
4.1 CUDA implementation of NMS
4.2 Calculation of RPN error
5. Second Bbox screening
5.1 ROI Align
6 Final classification of Bbox
7. Model training
7.1 Installing the COCO API
7.2 Additional COCO sample subset
7.3 Training Scripts
8. Model prediction
9 summary

The original code comes from github.com/jwyang/fast… , I made some modifications according to personal habits, the code is placed at github.com/johnhany/fa… .

The PyTorch environment configuration method can be found in two articles: Ubuntu computer vision development environment configuration (Python/C++) and Manjaro computer vision development environment configuration (Python/C++).

Trainval_net.pytrainval_net.py is used as an entry point to interpret the implementation process of PyTorch in FtYL-RCNN.

PASCAL VOC 2007 was used as the data set and RESNET-101 was used as the CNN model.

Data is read

The data reading process is roughly as follows: The bounding boxes of training samples are read in advance (referred to as bbox for short), and coordinates of Bbox and other information are saved in memory. This step does not need to store images of the training set in memory, but it will screen bboxes whose aspect ratio exceeds a certain range, and also generate horizontally flipped Bboxes as required, which is convenient to increase the generalization ability of the model in the training process. When the sample needs to be read with the Dataloader, the image data is read from the hard disk, preprocessed, and scaled to the desired size.

The code for reading data is detailed below.

The read bbox

The following figure describes the call relationship inside combined_roidb(‘ VOC_2007_trainval ‘) function. (The image was drawn at Diagrams.net)

The reading process of bbox is as follows:

Some data sets can combine two or more different sub-sample sets into a larger and richer training set, which is separated by “+”, for example, COCO’s “coco_2014_train+ COCO_2014_valminusminival”. Here withroi_data_layer.roidb.combined_roidb.get_roidb()Roi_data_layer.roidb.com bined_roidb.get_roidb() reads bboxes of each subsample set. To be specific:
1. get_roidb('voc_2007_trainval')Get_roidb (‘ VOC_2007_trainval ‘) creates one with the Lambda functionlib.datasets.pascal_voc.pascal_vocPascal_voc. pascal_VOC is an instance of the lib.dataset. pascal_VOC class, which is derived fromlib.datasets.imdb.imdbLib.datasets.imdb.imdb.imdbImdb this class is used to store the image number, bbox, category information of different sample sets, etc.pascal_voc('trainval', '2007')Pascal_voc (‘trainval’, ‘2007’) creates a name on initializationimage_indexThe image_indexlistList, which holds the numbers of all sample images (for example, “000005”), according to which images can be read from the hard disk.
2. withlib.datasets.imdb.imdb.set_proposal_method('gt')Lib. Datasets. Imdb. Imdb. Set_proposal_method (‘ gt) methodlib.datasets.imdb.imdb.roidb_handlerLib. Datasets. Imdb. Imdb. Roidb_handler members set to onlib.datasets.pascal_voc.pascal_vocLib.datasets. Pascal_voc. pascal_VOCgt_roidb()Gt_roidb () method. This method is used to read the ground truth bbox of the sample set. In addition, inlib.datasets.pascal_voc.pascal_vocThere is another one defined in lib.datasets.pascal_voc. pascal_VOCrpn_roidb()The rpn_roidb() method selects the bbox implementation using the RPN network. In this process, the real bbox is taken from the Annotations folder of the PASCAL VOC dataset (such as “000005.xml”). And you end up with adictDict, which contains keys “boxes,” “gt_classes,” “gt_ishard,” “OVERLaps,” “FLIPPED,” and “Seg_areas,” Respectively represent Bbox coordinates, bbox category, whether the sample is difficult, one-Hot sparse matrix marking the category, whether the picture is flipped horizontally, and Bbox area.
3. withroi_data_layer.roidb.combined_roidb.get_training_roidb(imdb)Bined_roidb. get_training_roidb(IMDB) Preprocesses the bbox. Including usinglib.datasets.imdb.imdb.append_flipped_images()Lib. Datasets. Imdb. Imdb. Append_flipped_images () all bbox do a flip horizontal, at the same timeimage_indexI’m going to double image_index and I’m going to useroi_data_layer.roidb.combined_roidb.prepare_roidb(imdb)Prepare_roidb (IMDB) Prepare the values that may be used in the training process.
Merge the list of bBoxes from each subsample set into a list.
Filter out samples without bBox.
The samples are sorted according to the width to height ratio of Bbox. Bboxes with an aspect ratio greater than 2 or less than 0.5 will be marked as “need_crop”, so that these samples can be clipped later.

The realization of the dataloader

The figure below describes the incoming torch. Utils. Data. DataLoadertorch. Utils. Data. The DataLoader roi_data_layer. RoibatchLoader. RoibatchLoaderroi_data_layer . RoibatchLoader. RoibatchLoader class dependency, this class is used to create compatible PyTorch dataset objects, so that in the process of training and prediction is called DataLoaderDataLoader, read the sample of the mini – batch.

When roibatchLoaderroibatchLoader class initialization program, according to get sorted before ratio_listratio_list and ratio_indexratio_index listlist a length equal to the total sample size, Each successive batCH_size value is the same. If the width to height ratio (referred to as “ratio” later) is less than 1, the smallest ratio is selected as the unified ratio of this batch. If ratio is greater than 1, the highest ratio is selected as the unified ratio value. This is to maintain the same ratio within the same batch.

The DataLoaderDataLoader needs to obtain a sample from the dataset through the index value. This process is implemented by __getitem__(self, index)__getitem__(self, index). The specific process is as follows:

According to indexroidbGet bBox information from roIDB. If in training mode, passratio_indexRatio_index Gets the index indirectly. Which will then contain the bBox informationdictThe dict has a “minibatch_db” of length 1listIn the list.
withRoi_data_layer.minibatch.get_minibatch (minibatch_db, num_classes)Roi_data_layer.minibatch.get_minibatch (minibatch_db, num_classes) reads the image and generates a mini-batch. To be specific:
1. Reading the picture and constructing batch is mainly inroi_data_layer.minibatch._get_image_blob(roidb, 0)Roi_data_layer.minibatch. _get_image_blob(roIDb, 0)
  1. withimreadImread reads images. The original REPO usedscipy.misc.imreadScipy.misc.imread, I’ll call it insteadimageio.imreadImageio. imread, of course, in OpenCVcv2.imreadCv2. Imread also works;
  2. Convert the image color space from RGB to OpenCV default BGR. So if I did it in the previous stepcv2.imreadCv2. Imread, then this step is not needed;
  3. If the Bbox is marked “FLIPPED,” the image is flipped horizontally;
  4. withlib.model.utils.blob.prep_im_for_blob()Lib. Model. Utils. Blob. Prep_im_for_blob () minus the sample set average (the average of different CNN in training and prediction are the same set of values) and size, and scaling to the shorter side is 600;
  5. withlib.model.utils.blob.im_list_to_blob()Lib.model.utils.blob. Im_list_to_blob () creates a size of(1, max_h, max_w, 3)(1, max_h, max_w, 3), and copy in the image matrix aligned in the upper left corner.
2. Scale the bBox at the same scale and save the information about the sample in onedictDict (i.eblobsBlobs).
Finally, to check if the bbox needs to be clipped because the aspect ratio is out of range, crop it at a random ratio of 2 in the direction of the longer side.

Overall structure of model

The ftF-RCNN model structure and data transfer process based on RESNET-101 are shown in the figure below.

The model essentially splits ResNet into two parts. The first part included the input layer of ResNet and the first 3 groups of residual blocks (represented as RCNN_base in the figure), and then the feature map generated at this time (represented as the tensor named base_feat in the figure) was passed into the RPN network. A preliminary screening (NMS is used here) is conducted to give 2000 BBoxes, and the classification cross entropy error and smooth-L1 error of coordinates of 256 bboxes with the highest confidence degree (including 128 foreground bboxes and 128 background Bboxes) are also given. The second part included an ROI Align layer and the fourth group residual block of ResNet (represented as RCNN_top in the figure). Then classification errors and coordinate errors of 128 bboxes with the highest confidence were obtained according to the feature map generated at this time (in the figure). Finally, add up the four sets of error values, and you get the final error of the whole model.

Interpretation of input data:

im_dataIm_data. Tensors of size [4, 3, 600, 800] are passeddataloaderThe sample image is scaled by dataloader. Images of different original sizes are scaled to a short side length of 600, so the length of the other side is not always 800.
im_infoIm_info. Tensor of size [4, 3], where 4 of the first dimension represents batch size, and 3 of the second dimension saves the height, width and aspect ratio of each image. becauseim_dataIm_data is constructed according to the image with the largest size in this batch. There may be smaller images in the same batch, so the actual size of the image needs to be fromim_infoIt can be obtained from im_info. It is worth noting that in the current code, the actual length and width of the image information is only used to crop the bbox that exceeds the image boundary. In addition, the aspect ratio value is not used, which is redundant.
gt_boxesGt_boxes. Tensor of size [4, 20, 5], where 5 of the third dimension represents the coordinates and class labels of the real bbox (ground truth) (so 4+1 value is required). The dimensions of the class label are actually redundant because the category information itself can be derived from the index of the second dimension, and should be designed to be easily compared directly with the predicted values of the network (which require both the bbox coordinates and the class label to be predicted).
num_boxesNum_boxes. The tensor of size [4] is originally used to represent the number of Bboxes in the image, but this tensor is not used in the model, so it does not need to care about its role.

There are two modules related to ResNet network in the figure:

RCNN_baseRCNN_base. Contains the input layer of ResNet and the first 3 residual blocks:

self.RCNN_base = nn.Sequential(resnet.conv1, resnet.bn1,resnet.relu,
    resnet.maxpool,resnet.layer1,resnet.layer2,resnet.layer3)Copy the code

self.RCNN_base = nn.Sequential(resnet.conv1, resnet.bn1,resnet.relu,
resnet.maxpool,resnet.layer1,resnet.layer2,resnet.layer3)

And all conV layer and BN layer are not involved in training;

RCNN_topRCNN_top. Fourth residual block with ResNet:

self.RCNN_top = nn.Sequential(resnet.layer4)Copy the code

self.RCNN_top = nn.Sequential(resnet.layer4)

In addition, all bn layers do not participate in training, but conV layers are fine-tuned to adapt to new data and error functions.

Module and layer dependencies

lib.model.faster_rcnn.faster_rcnn._fasterRCNNLib.model.faster_rcn.faster_rcn._fasterrcnn, which defines the base class of ftP-rCNN. It calls:
- lib.model.rpn.rpn._RPNLib.model.rpn. RPN._RPN, RPN network, used to generate a large number of BBoxes, and then do a preliminary screening of these Bboxes, get 2000 BBoxes. In this network, we call:
  - lib.model.rpn.proposal_layer._ProposalLayerLib.model. RPN. Proposal_layer. _ProposalLayer, used to generate a large number of Bboxes (say 17,100 bboxes for a 600*800 image) from a predefined set of nine anchors, and then use NMS to filter them. Keep 2000 bBoxes. The NMS here is calculated on the GPU using CUDA.
  - lib.model.rpn.anchor_target_layer._AnchorTargetLayerLib.model.rpn. Anchor_target_layer. _AnchorTargetLayer used to generate 9 bboxes at each pixel of the feature map obtained after the first 3 residual blocks of ResNet. The prediction class label and coordinates of each Bbox are given, and these Bboxes will be used to calculate two sets of errorsrpn_loss_clsRpn_loss_cls andrpn_loss_bboxRpn_loss_bbox.
- lib.model.rpn.proposal_target_layer_cascade._ProposalTargetLayerLib.model.rpn. Proposal_target_layer_cascade._proposaltargetlayer selects 128 bBoxes from the 2000 bboxes selected in the previous step.
- lib.model.roi_layers.roi_align.ROIAlignThe lib.model.roi_layers.roi_align.ROIAlign and ROIAlign layers are used to normalize feature maps of different sizes to the same size, so that the final output layer can generate prediction tensor with the same length. Here, ROI Align is calculated on GPU using CUDA.
lib.model.faster_rcnn.resnet.resnetLib.model.faster_rcnn.resnet.resnet, inherited from_fasterRCNN_fasterRCNN is definedRCNN_baseRCNN_base, RCNN_topRCNN_top modules.

Below we explain the specific functions and computational logic of each part of the network according to the calculation sequence of the model.

The generation of bbox

In the lib model. Model. Faster_rcnn. Resnet. Resnetlib. Model. Faster_rcnn. Resnet. Resnet initialization process, In lib. Model. RPN. RPN. _RPNlib. Model. The RPN. The RPN. Create a lib of _RPN model. The RPN. Proposal_layer. _ProposalLayerlib. Model. The RPN. Proposal_la Yer. _ProposalLayer layer. At the time this _ProposalLayer_ProposalLayer layer is initialized, The lib.model.rpn. Generate_anchors. Generate_anchors ()lib.model.rpn. Generate_anchors. This function is used to generate 9 groups of basic bbox coordinates (called anchors) according to 3 different aspect ratios (i.e. 0.5, 1,2) and 3 different size ratios (i.e. 8,16,32). These 9 groups of anchors are used in the image (specifically, A large number of Bboxes are generated at different positions of) on the feature map obtained through convolution calculation. These BBoxes can be used as candidate target boxes to screen out the optimal detection window through subsequent steps.

The nine sets of anchors that result are as shown below.

It should be noted that some values of the actual matrix may have an error of plus or minus 1 due to the difference in the rounding rules of floating point numbers in the calculation process. If you are writing a paper, you need to check that the values of this matrix are the same as the Matlab results given by the author of the paper.

The calculation of RPN

The calculation process of RPN network is shown in the following figure.

The feature map here is calculated from the first 3 residual blocks of ResNet. The original sample tensor with size [4, 3, 600, 800] is mapped to a feature map with size [4, 1024, 38, 50].

After two conV operations and Softmax, the number of channels in this feature map is reduced from 1024 to 18. Meanwhile, another set of CONV also gives a tensor with 36 channels. Both of these tensors are entered to _ProposalLayer_ProposalLayer. The tensor RPN_CLs_PRObrpn_CLs_PROb with size [4, 18, 38, 50] can be understood as generating 9 foreground and 9 background predicted values for each pixel in the feature map of 38×50. It is equivalent that 9 different anchors centered on this pixel can exactly give the possibility of foreground or background target (after processing by Softmax, it can be directly regarded as “possibility”). Similarly, the tensor RPN_clS_PREDRpn_CLS_pred of size [4, 36, 38, 50] can be understood as the coordinates of nine bboxes (4×9=36). In _ProposalLayer_ProposalLayer, A total of 38×50=1900 groups of Bboxes will be generated, with each group of adjacent Bbox coordinates actually separated by 16 pixels. Each group of bboxes produces 9 bboxes from 9 anchors, so the total number of bboxes is 1900×9=17100.

The four coordinates of these Bboxes do not directly generate values such as X, Y, W and H, but the offset and scaling coefficient relative to the X and Y directions of anchor (the length and width of bbox are scaled in accordance with the exponential, which not only ensures that the length and width are both positive, but also can generate values in a larger range). Thus, the previous CONV layer only needs to generate values that jitter around 1 to map to bbox coordinates and dimensions with a wide range of values. The process of mapping the offset and scaling coefficients to the actual Bbox coordinates is done in lib.model.rpn. Bbox_transform.bbox_transform_inv ()lib.model.rpn. Bbox_transform.bbox_transform. bbox_transform_ Inv (). And when I’m done mapping, Also call lib.model.rpn.bbox_transform.clip_boxes()lib.model.rpn.bbox_transform.clip_boxes() to go out of feature The map-scoped bbox is clipped (hence the im_infoIM_info input data).

Then, sort the 17100 bboxes according to the rpN_CLS_PROBRpN_CLS_PROb value, and extract the first 12000 bBoxes. Then NMS selects the bboxes with high overlap rate with the bboxes with high confidence from the 12000 bboxes (for example, the area of intersection of two rectangles is greater than 0.7), selects the 2000 bboxes with the highest overlap rate, and outputs roisrois.

CUDA implementation of NMS

We know that Python supports calling extensions implemented by C++ to take better advantage of CPU performance. Due to Python’s GIL limitations, multi-threaded code executed by Python can only use one CPU core at a time, so computationally intensive tasks never improve CPU utilization. With a C++ extension implementation, it is easy to increase CPU utilization to 100% and speed up the algorithm. Similarly, in ftP-RCNN, while conv, BN and other operations can be computed on the GPU by PyTorch, some algorithms not implemented by PyTorch are still slow to be computed directly on the CPU. In particular, parallelized algorithms like NMS, implemented in CUDA C and compiled into Python extensions, can significantly improve the training speed of the model.

Now support to the CPU and GPU PyTorch NMS, ROI Pooling, ROI Align, interface in torchvision. Opstorchvision. Ops: pytorch.org/docs/stable… . Now, can directly use torchvision ops. Nmstorchvision. Ops. NMS implementation of NMS is calculated. But for learning purposes, we’ll still use the CUDA implementation given in THE REPO. Also, as you’ll see below, CUDA code has a lot of room for improvement (such as redundant video memory, frequent data transfer between CPU and GPU, etc.). If energy permits, read the NMS implemented by PyTorch again.

The Python NMS interface is in lib.model.roi_layers.nmslib.model.roi_layers.nms, and from there points to the C++ module _c.nms_c. NMS. The c + + function declarations in the lib/model / / NMS. The CSRC hlib/model/CSRC/NMS. H. As you can see, the input parameters and return values this function takes are the TensorTensor defined in ATen of the PyTorch C++ API. When CUDA available, input TensorTensor was introduced into the lib/model / / CUDA/NMS. The CSRC culib/model / / CUDA/NMS. The CSRC cu nms_cudanms_cuda function in the file:

at::Tensor nms_cuda(const at::Tensor boxes, float nms_overlap_thresh)Copy the code

at::Tensor nms_cuda(const at::Tensor boxes, float nms_overlap_thresh)

These 12000 bboxes can be divided into several groups. In each group, some optimal Bboxes are screened out by NMS and then the results of each group are combined, which can be approximately considered as the same as the result of one NMS for the original 12000 bboxes. There may be cases where bboxes in another group overlap highly with the optimal bboxes in the current group, but the primary role of the NMS is to screen out unwanted BBoxes (that is, to suppress elements that are not local maxima). Therefore, it is also a very reasonable strategy to use parallel NMS to give more than 2000 Bboxes, and then successively screen out the 2000 bboxes with the highest score.

In CUDA code, a two-dimensional grid was established, containing 188×188 blocks. Each block is one-dimensional and contains either 64 or 32 threads (the last block contains 32 threads). In this way, there are 187×64+32=12000 threads in x or Y direction of grid, and each thread is used to calculate overlap of a Bbox and the optimal bbox in this block. For bboxes with an overlap rate greater than 0.7, whether the bbox is reserved or not will be marked with bit 0/1 in a continuous memory maskmask. Since all bboxes are already ranked in Python, each bbox within each block only needs to calculate the overlap rate with the first bbox within that block. The calculation of the overlap rate of a single bbox is completed in the nMS_kernelnMS_kernel function, which is called as follows:

nms_kernel<<<blocks, threads>>>(boxes_num,
                                nms_overlap_thresh,
                                boxes_dev,
                                mask_dev);Copy the code

nms_kernel<<<blocks, threads>>>(boxes_num,
nms_overlap_thresh,
boxes_dev,
mask_dev);

< < <… > > > < < <… >>> is CUDA C-specific syntax for calling CUDA’s global function from the host (i.e. the CPU). Here the Bbox data has been converted from TensorTensor to a Floatfloat array.

The nms_kernelnms_kernel function has the following format:

__global__ void nms_kernel(const int n_boxes, const float nms_overlap_thresh,
                           const float *dev_boxes, unsigned long long *dev_mask)Copy the code

__global__ void nms_kernel(const int n_boxes, const float nms_overlap_thresh,
const float *dev_boxes, unsigned long long *dev_mask)

A device function is called to calculate the overlap rate of the two bBoxes:

__device__ inline float devIoU(float const * const a, float const * const b)Copy the code

__device__ inline float devIoU(float const * const a, float const * const b)

CUDA’s global function (__global____global__) runs on the device side (i.e. on the GPU side) and can be accessed via <<<… > > > < < <… >>> syntax functions called from the host side, such as nms_kernelnms_kernel in the nMS_cudanMS_cuda function; The device function (__device____device__) runs on the device side and can only be called by other device functions or global functions. For example, devIoUdevIoU is called in the nms_kernelnms_kernel function. The global function is generally called “kernel”, which is a single computation unit after parallelization. The position of the current kernel in the whole computation task can be obtained through blockIdxblockIdx, threadIdxthreadIdx, etc. Kernel is the innermost layer of a two-tier or multi-tier for loop, and blockIdxblockIdx, threadIdxthreadIdx are the index variables of those for loops.

The 188*188 blocks can be regarded as a symmetric matrix, and the matrix elements (I, j) and (j, I) represent the overlap rate between the ith bbox and the JTH bbox. So this calculation actually wastes a lot of space and time.

RPN error calculation

The previously obtained tensor rpn_cls_scorerpn_cls_score of size [4, 18, 38, 50] is passed in _AnchorTargetLayer_AnchorTargetLayer to calculate the error of the RPN network. In the _AnchorTargetLayer_AnchorTargetLayer, 17100 bBoxes are generated as a grid using the same method as _ProposalLayer_ProposalLayer. Then, bboxes with a range beyond 38×50 feature map are discarded.

For example, there are still 5944 valid BBoxes falling within the range of feature map. Call lib.model.rpn. Bbox_transform.bbox_overlaps_batch ()lib.model.rpn. Bbox_transform.bbox_overlaps_batch () to calculate each bbox and ground The overlap rate of truth. In this process, if the overlap rate between bbox and bbox of a certain kind of ground truth is greater than 0.7, the predicted value of the label is marked as 1 (representing prospect). If the overlap rate is less than 0.3, the label predictive value is marked 0 (representing the background). After all the bboxes are processed, if the number of bboxes marked 1 exceeds 128, some bboxes are randomly selected and their corresponding labels are changed to -1 (indicating neither foreground nor background). Similarly, if there are more than 128 bBoxes labeled -1, some corresponding labels are randomly changed to -1. In this way, the category prediction value of 5944 bboxes is guaranteed to have exactly 128 foreground and 128 background, and the remaining bboxes are all -1.

And then, Call lib.model.rpn. Bbox_transform.bbox_transform_batch () to add the ground truth The Bbox is mapped to the same format as the PREDICTED value of the RPN network, that is, the x and y coordinates are shifted and the length and width are log-scaled.

In this way, we get the predicted bbox category labels and coordinates of the RPN network, as well as the class labels and coordinates of the real Bbox. Since class labels are discrete values (-1, 0, 1) and coordinate values are continuous values, the prediction of class labels is optimized for classification problems using the cross entropy as an error function, and the coordinate values are optimized for regression problems using the smooth L1 error function. The inputs for the cross entropy are 4×(128+128) bboxes with the highest overlap (the purpose is to artificially adjust the ratio of positive and negative samples to avoid the large number of -1 and 0 categories affecting the RPN network’s accuracy in identifying foreground objects), while the inputs for smooth L1 are all bbox coordinates.

The smooth L1 error is defined as follows:

Two sets of parameters rpN_bbox_INSIde_wSRpn_bbox_INSIde_ws and RPN_bbox_outside_wSRpn_bbox_outside_ws are manually set to adjust the coefficient of x in the formula. But these two sets of coefficients are the same, and they’re uniformly initialized.

Second bbox filter

In the previous steps, RPN network generates 17100 Bboxes in the form of grid according to the 9 predefined anchors (assuming that the input image size is 600×800), and then uses NMS to do preliminary screening and retains 2000 bboxes, which are stored in the tensor Roisrois. Pass to _ProposalTargetLayer_ProposalTargetLayer for further filtering.

In the _ProposalTargetLayer_ProposalTargetLayer, The key lies in the lib. Model. RPN. Proposal_target_layer_cascade. _ProposalTargetLayer. _sample_rois_pytorchlib. Model. The RPN. Proposal_target_l Ayer_cascade. _ProposalTargetLayer. _sample_rois_pytorch function. In this function, all candidate bboxes will calculate the overlap rate with each ground truth bbox. Bboxes with an overlap rate greater than or equal to 0.5 are considered as prospects, and their corresponding predicted value of the category is marked as the class with the highest overlap rate. If an image ends up with more than 32 foreground Bboxes, some of them are discarded randomly and only 32 foreground Bboxes are retained. Meanwhile, bboxes with an overlap rate of less than 0.5 with ground truth are considered as background, and the number of background Bboxes to be retained should ensure that the total number of foreground + background bboxes generated by each image is 128.

The coordinates of the 128 Bboxes will also be transformed in a process similar to RPN network for regression optimization.

ROI Align

The role of ROI Pooling is to normalize the previously obtained 4×128 bboxes with different sizes (assuming batch size is 4) to the same size, so as to facilitate the subsequent conV layer to continue processing.

Compared with ROI Pooling, ROI Align has obvious advantages when the feature map cannot be divisible by the target size. The reason is that ROI Align uses bilinear interpolation to find extreme values in feature maps with irregular sizes.

ROI Align can also be parallelized. So IN the code, ROI Align is also implemented in CUDA. According to rimmer, the output tensor dimension of ROI Align was [512, 1024, 7, 7], of which 1024 was from the output of the third residual block of ResNet, 512 was just 4×128 and 128 predicted bboxes of the 4 input images could be processed at one time. The first dimension of the output tensor, 512, also indicates that each prediction bbox is treated as a separate sample from ROI Align.

Python’s ROI Align API is defined in lib.model.roi_layers.roi_alignlib.model.roi_layers.roi_align. It then points to the C++ extensions _c. roi_align_forward_c. roi_align_forward and _c. roi_backward_c. roi_align_backward (for forward calculation and reverse derivation, respectively), The corresponding c + + header file for the lib/model / / ROIAlign CSRC hlib/model/CSRC/ROIAlign. H.

Used to calculate, for example, in CUDA is available, the realization of the function is located in the lib/model / / CUDA/ROIAlign_cuda CSRC culib/model/CSRC/CUDA/ROIAlign_cuda cu:

at::Tensor ROIAlign_forward_cuda(const at::Tensor& input,
                                 const at::Tensor& rois,
                                 const float spatial_scale,
                                 const int pooled_height,
                                 const int pooled_width,
                                 const int sampling_ratio)Copy the code

at::Tensor ROIAlign_forward_cuda(const at::Tensor& input,
const at::Tensor& rois,
const float spatial_scale,
const int pooled_height,
const int pooled_width,
const int sampling_ratio)

At ::Tensorat::Tensor again, and then you create a grid of 4096. (Ideally you would have a grid of 1024×7×7, so that each block corresponds to one element of the output Tensor.) Each grid has 512 threads. The sample dimension corresponding to the output tensor.

Again, here we use <<<… > > > < < <… > > > the way global function called RoIAlignForwardRoIAlignForward:

template <typename T>
__global__ void RoIAlignForward(const int nthreads, const T* bottom_data,
                                const T spatial_scale, const int channels,
                                const int height, const int width,
                                const int pooled_height, const int pooled_width,
                                const int sampling_ratio,
                                const T* bottom_rois, T* top_data)Copy the code

template <typename T>
__global__ void RoIAlignForward(const int nthreads, const T* bottom_data,
const T spatial_scale, const int channels,
const int height, const int width,
const int pooled_height, const int pooled_width,
const int sampling_ratio,
const T* bottom_rois, T* top_data)

Here the template class type TT is set to FloatFloat so that sub-pixel level coordinates can be calculated. For example, in a 2×2 square, four points (0.5, 0.5), (0.5, 1.5), (1.5, 0.5) and (1.5, 1.5) are selected for bilinear interpolation to obtain the local maximum pixel value.

The interpolation function is defined in a device function bilinear_interpolateBILinear_interpolate:

template <typename T>
__device__ T bilinear_interpolate(const T* bottom_data,
                                  const int height, const int width,
                                  T y, T x,
                                  const int index /* index for debug only*/)Copy the code

template <typename T>
__device__ T bilinear_interpolate(const T* bottom_data,
const int height, const int width,
T y, T x,
const int index /* index for debug only*/)

Final classification of bBox

The [512, 1024, 7, 7] tensor generated by ROI Align was passed to the fourth residual block of ResNet to get a tensor with a dimension of [512, 2048]. Note that bn in the residual block did not participate in the training, but conV was involved in the training of new samples.

This tensor of size [512, 2048] passes through the RCNN_bbox_predRCNN_bbox_pred layer to produce a [512, 4] tensor representing the predicted value of coordinates bbox_predbbox_pred for 4×128 bboxes.

At the same time, the [512, 2048] tensor will also pass through the RCNN_cls_scoreRCNN_cls_score layer to produce a [512, 21] tensor, Represents clS_scorecls_score for category prediction (20 positive categories +1 negative category) of 4×128 BBoxes.

Then, CLS_scorecls_score calculates the classification error from the cross entropy error function RCNN_loss_clsRCNN_loss_cls, Bbox_predbbox_pred Coordinates regression error calculated by smooth L1 error function RCNN_loss_bboxRCNN_loss_bbox.

Finally, the two sets of errors of RPN and the two sets of errors generated here are added together to form the final error of the model:

loss = rpn_loss_cls.mean() + rpn_loss_box.mean() \
       + RCNN_loss_cls.mean() + RCNN_loss_bbox.mean()Copy the code

loss = rpn_loss_cls.mean() + rpn_loss_box.mean() \
+ RCNN_loss_cls.mean() + RCNN_loss_bbox.mean()

Model training

The README of the original REPO does not give you all the preparation you need to train a network, but here are some additions.

Install the COCO API

The least modification is to install the COCO API in the data folder in the code directory:

cd data && git clone https://github.com/cocodataset/cocoapi.git && cd cocoapi/PythonAPI && makeCopy the code

cd data && git clone https://github.com/cocodataset/cocoapi.git && cd cocoapi/PythonAPI && make

Extra subset of COCO samples

If training with COCO dataset, in addition to downloading the COCO dataset itself, two annotation subsets may need to be added. The download address is: dataset.d2.mpi-INF.mpg. De/Hosang17cvp… .

Training script

Modified lib/model/faster_rcnn/resnet pylib/model/faster_rcnn/resnet. Py files of the self. Model_pathself. Model_path and lib/model/utils/conf Ig. Pylib/model/utils/config. Py file __C. DATA_DIR__C. After DATA_DIR, Use the following commands to compile CUDA code (don’t forget to installation requirements. Txtrequirements. TXT list depends on the package) :

cd lib
python setup.py build developCopy the code

cd lib
python setup.py build develop

Start training with the following command:

python trainval_net.py --dataset pascal_voc --net res101 --bs 4 --nw 8 --lr 4e-3 --lr_decay_step 8 --epochs 10 --cudaCopy the code

python trainval_net.py –dataset pascal_voc –net res101 –bs 4 –nw 8 –lr 4e-3 –lr_decay_step 8 –epochs 10 –cuda

On the RTX 2080Ti, training PASCAL VOC 2007 took about 130 minutes, with a video memory footprint of about 9309MB and a memory footprint of about 2709MB.

Model prediction

Verify the performance of the model with the test set with the following command:

python test_net.py --dataset pascal_voc --net res101 --checksession 1 --checkepoch 10 --checkpoint 2504 --cuda --load_dir modelsCopy the code

python test_net.py –dataset pascal_voc –net res101 –checksession 1 –checkepoch 10 –checkpoint 2504 –cuda –load_dir models

The mAP = 0.7573

conclusion

To sum up, the process of ftP-RCNN is as follows: define 9 anchors in advance, RPN network generates a large number of Bboxes in the form of grid according to these 9 anchors, and then use NMS to do preliminary screening and reserve 2000 Bboxes. In the process of screening 2000 Bboxes, ground truth information was not used, but the predicted value of Bbox belonging to the foreground/background given by RPN network was only used. However, ground truth was used to calculate the error of RPN network, where the category of Bbox was optimized by classification method, and the coordinate value was optimized by regression method. Then conduct another screening from the 2000 Bboxes according to the matching degree with the ground truth to generate 128 bboxes. Then calculate the specific classification error of these bboxes (if the training set contains 20 categories, 21 categories need to be predicted, and the extra category represents the background class) and coordinate error. Classification and regression methods were used to optimize.