A review of case segmentation research
Abstract
In the field of computer vision, instance segmentation is a very important research topic. It plays an important role in the application technology support of geographic information system, medical imaging, automatic driving, robotics and other fields, and has very important research significance. This paper reviews the latest progress and development process of instance segmentation. Firstly, it introduces the basic logic of instance segmentation, summarizes the main research methods and their principles and network architecture, and analyzes the published mainstream instance segmentation methods. Finally, this paper analyzes the current problems and future development trend of case segmentation task, and puts forward some feasible solutions to these problems.
Keywords instance segmentation image segmentation semantic segmentation deep learning
A Survey of Instance Segmentation Methods
Abstract
In the field of computer vision, instance segmentation is a very important research topic. It plays an important role in the application technology support of geographic information system, medical imaging, automatic driving, robotics and other fields, and has very important research significance. This paper reviews the latest progress and development process of instance segmentation. Firstly, it introduces the basic logic of instance segmentation, summarizes the main research methods and their principles and network architecture, and analyzes the published mainstream instance segmentation methods. Finally, this paper analyzes the current problems and future development trend of case segmentation task, and puts forward some feasible solutions to these problems.
Key words Instance segmentation, image segmentation, semantic segmentation, deep learning
- introduce
Image segmentation refers to the image is divided into a number of non-intersecting regions according to gray scale, color, spatial texture, geometric shape and other features, so that these features in the same region show consistency or similarity, but in different regions show obvious differences. As shown in the figure below.
image
Target detection is to identify the content and detect its position in the image, as shown in the figure below, taking the recognition and detection of person as an example.
[Image upload failed…(image-b10930-1638502329311)]
Semantic segmentation is to classify each pixel in the image with a category label. As shown in the figure below.
[image upload failed…(image-c06644-1638502329311)]
Instance segmentation is a combination of object detection and semantic segmentation. The object is detected in the image (target detection), and then each pixel is labeled (semantic segmentation). As shown in the figure below.
The purpose of instance segmentation is to detect the object in the input image and assign category labels to each pixel of the object. Instance segmentation can distinguish different instances of the same foreground semantic category, which is its biggest difference from semantic segmentation. Compared with semantic segmentation, instance segmentation develops later, so instance segmentation model is mainly based on deep learning technology, but it is also an important part of image segmentation. With the development of deep learning, instance segmentation methods such as SDS(Simultaneous detection and segmentation), DeepMask and MultiPath network appear successively, and the segmentation accuracy and efficiency are gradually improved.
1.1 Some problems and difficulties in case segmentation.
A. Small object segmentation problem. Deep neural networks generally have larger receptive fields and are more robust to posture, deformation, illumination, etc., but their resolution is lower and details are lost. Shallow neural networks have narrow receptive fields, rich details and high resolution, but lack semantic information. Therefore, if an object is small, its details will be less in the shallow CNN layer, and the same details will almost disappear in the deep network. The solutions to this problem include dilated convolution and feature resolution enhancement.
B. Arithmetic transformation. For geometric transformations, CNN is not spatially invariant in nature.
C. Handle the problem that occlusions occlude. Occlusion will cause the loss of target information. At present, some methods are proposed to solve this problem, such as DeFormable ROI pooling, DeFormable Convolution and Adversarial network. Alternatively, it is possible to use GAN to solve this problem.
D. Manage image degradations. The causes of image degradation include lighting, poor camera quality and image compression. However, most data sets (ImageNet, COCO, PASCAL VOC, etc.) do not have the problem of image degradation. 2. Basic process of instance segmentation
Instance segmentation model generally consists of three parts: image input, instance segmentation processing and segmentation result output. After the image is input, the model generally uses VGGNet, ResNet and other backbone networks to extract the image features, and then processes them through the example segmentation model. In the model, the location and category of the target instance can be determined by the target detection first, and then segmented in the selected region, or the semantic segmentation task can be performed first, and then different instances can be distinguished, and finally the instance segmentation result can be output.
2.1 Main technical routes of instance segmentation
For a long time, the research on instance segmentation has two lines, which are bottom-up semantic segmentation based method and top-down detection based method, both of which belong to two-stage methods.
Top-down instance segmentation method
The idea is: first of all, the regions where the instance is located are found by the object detection method (bounding box), and then the semantic segmentation is carried out in the detection box, and each segmentation result is output as a different instance. Usually, detection is followed by segmentation, such as FCIS, Mask-RCNN, PANet and Mask Scoring R-CNN.
The originator of top-down dense instance segmentation is DeepMask, which predicts a mask proposal in each spatial region through the method of sliding window. There are three drawbacks to this approach:
- The association between masks and features (local consistency) is lost, as in DeepMask, where a fully connected network is used to extract masks
- Feature extraction is redundant. For example, DeepMask will extract mask once for each foreground feature
- Loss of location information due to downsampling (using convolution with steps larger than 1)
Bottom-up case segmentation method
Treat each instance as a category; Based on clustering, the maximum class spacing and minimum class spacing are applied to each pixel. Finally, the grouping of different instances is applied. Grouping method: Bottom-up is usually inferior to top-down.
The idea is: firstly, semantic segmentation at pixel level is carried out, and then different instances are distinguished by means of clustering and metric learning. While this approach preserves better low-level characteristics (details and location information), it also suffers from the following disadvantages:
- High quality requirements for dense segmentation will lead to sub-optimal segmentation
- Poor generalization ability, unable to deal with complex scenarios with multiple categories
- The post-processing method is cumbersome
Single Shot Instance Segmentation is actually influenced by single-stage target detection research. Therefore, there are two approaches. One is influenced by the one-stage, anchor-based detection model such as YOLO, RetinaNet inspiration, representative works include YOLACT and SOLO; One is inspired by the Anchor-free detection model such as FCOS, and its representative works include PolarMask and AdaptIS.
2.2 Development history of instance segmentation method
The main context diagram is as follows:
2.3 Instance Segmentation Methods Main network architecture methods are classified
There are four main methods: mask suggestion classification method, detection and segmentation method, marking pixel and clustering method and dense sliding window method
The corresponding English names and main technical methods are shown in the following table
[Image upload failed…(image-9997c8-1638502329311)]
The advantages and disadvantages of these four types of methods are as follows:
2.3.1 Proposed taxonomy for masks
General framework for Classification for Mask Proposals Techniques
2.3.2 Detection before segmentation method
image
General framework for Detection Followed by Segmentation
2.3.3 Post-labeled pixel clustering method
The method benefits from semantic segmentation and can predict high resolution object masks. Compared with the segmentation detection and tracking technology, the label pixel tracking and clustering method has lower accuracy on the commonly used benchmark. Because pixel tagging is computationally intensive, more computing power is often required.
[Image upload failed…(image-768CEe-1638502329311)]
General framework for Labelling Pixels Followed by Clustering Techniques
2.3.4 Dense sliding window method
[Image upload failed…(image-86f35D-1638502329311)]
General framework for Dense Sliding Window Methods
3. Typical methods of instance segmentation
3.1 DeepMask
The DeepMask network uses VGGNet to extract features from the input image and generate segmentation proposals. The extracted features are shared by two branches. The first branch predicts a segmentation mask for the selected object, and the second branch predicts a target score for the input Patch. The network is validated on PASCAL VOC 2007 and MS COCO datasets with good segmentation accuracy.
3.2 Fast – CNN
Fast RCNN solves some problems of RCNN, thus improving the ability of target detection. Fast RCNN uses end-to-end training of the detector. It simplifies the training process by learning both SoftMax classifier and class-specific BBox regression, rather than training the individual components of the model individually, as RCNN does. Fast convolutional calculation of RCNN shared region scheme, and then add an ROI pooling layer between the last convolutional layer and the first fully connected layer to extract the features of each region scheme. Clustering uses the concept of feature layer distortion to realize image layer distortion. The characteristics of ROI pooling layer are decomposed into a group of fully connected layers, and finally decomposed into two layers: soft maximum probability of target category prediction and refined offset of category suggestion. Compared with RCNN, Fast RCNN greatly improves efficiency, with training speed up by 3 times and test speed up by 10 times.
3.3 Mask R – CNN (2017.3)
Mask R-CNN, proposed by He et al. [39], is a brand new instance segmentation mode expanded on the basis of Faster RCNN[40]. Mask R-CNN is a two-stage method. In the first stage, RPN (Region proposal Network) is used to generate Region of Interest (ROI) candidate regions. The stage 2 model predicts each ROI’s category, bounding box offset, and binary mask. The Mask is predicted by the newly added third branch, which is the difference between Mask R-CNN and other methods. In addition, Mask R-CNN proposed ROIAlign, which aligns pixels during downsampling to make the location of segmentation instance more accurate.
RCNN integrates AlexNet with a regional solution using selective search technology. The training of THE RCNN model includes the following steps. The first step involves computing the class agnostic region recommendations obtained using selective searches. The next step is fine-tuning CNN models, including fine-tuning pre-trained CNN models using regional recommendations such as AlexNet. Next, the features extracted from CNN were used to train a set of class-specific support vector machine (SVM) classifiers, which replaced softmax classifier learned by fine tuning. Then the class-specific boundary box regression training was performed for each object class using the features obtained by CNN.
Its structure is very similar to that of Faster RCNN, but there are three major differences:
- Resnet-fpn structure is adopted in the basic network, and multi-layer feature map is helpful to detect multi-scale objects and small objects. The original FPN output feature maps of P2, P3, P4 and P54 phases, but a P6 is added in Mask RCNN. P6 can be obtained by maximum pooling of P5 to obtain the characteristics of larger receptive field. This stage is only used in RPN network.
- The RoI Align method is proposed to replace RoI Pooling, because the rounding method of RoI Pooling loses some accuracy, which is fatal to the segmentation task. The RoI Align proposed by Maks RCNN cancels the rounding operation, but retains all floating points, and then obtains the values of multiple sampling points through the method of bilinear interpolation. After pooling the maximum value of multiple sampling points, the final value of the point can be obtained.
- After obtaining the features of the region of interest, a Mask branch was added to predict the category of each pixel on the basis of the original classification and regression. In the realization, the Network structure of FCN (Fully Convolutional Network) is adopted, and the end-to-end Network is constructed by Convolutional and deconvolution. Finally, each pixel is classified to achieve a good segmentation effect.
The main steps of Mask R-CNN algorithm are as follows:
- Firstly, the input image is sent to the feature extraction network to obtain the feature image.
- Then set a fixed number of ROI (also known as Anchor) for each pixel position of the feature graph, and then send the ROI area into the RPN network for binary classification (foreground and background) and coordinate regression, so as to obtain the refined ROI area.
- Perform the ROIAlign operation proposed in the paper for the ROI region obtained in the previous step, that is, first match the original image with the pixel of the feature map, and then match the feature map with the fixed feature.
- Finally, these ROI regions are classified into multiple categories, candidate box regression and FCN Mask generation are introduced to complete the segmentation task.
In general, with the support of Faster R-CNN and FPN, Mask R-CNN started the prelude of multi-task learning under r-CNN structure. It appears later than other instance segmentation methods (such as FCIS), but it still makes the proposal-based instance segmentation approach dominate (although the logic of detection first and segmentation later is not so natural).
For Mask R-CNN, object frames obtained by R-CNN are used to distinguish each instance, and then instances are segmented according to each object frame. The obvious problem is that if the box is wrong, the segmentation result will be wrong. Therefore, it is not a good scheme for some tasks requiring high edge accuracy. At the same time, due to the accuracy of the box, it is easy to lead to some non-square object effect is poor.
3.4 PANet
PANet is a two-stage instance segmentation model proposed by Liu et al. [41]. To shorten the information path, the model uses low-level precise location information to elevate the feature pyramid and create bottom-up path enhancement. In order to restore the broken information pathway between candidate regions and all feature layers, Liu et al. [41] developed adaptive feature pooling to aggregate the features of all feature layers in each candidate region. In addition, the model uses the full connection layer to enhance the mask prediction. Due to the complementary nature of the full convolutional network, the model obtains a different view of each candidate region. As a result of these improvements, PANet ranked first in MS COCO 2017 real-world segmentation tasks. A framework based on instance segmentation task is proposed to improve information flow. The feature level of deep network is improved by using specific signals related to location in the bottom layer. This process is called bottom-up path enhancement. It makes the information path between low-level and top-level features of the deep Web shorter. A technique called adaptive feature pooling is also proposed, which associates feature grids with features at all levels. Because of this technique, information related to features at each level flows to subsequent sub-networks for generating recommendations. An alternate branch segment captures various proposal views to enhance the prediction of mask generation 3.5 YOLCAT (2019.4)
[Image upload failed…(image-9AD073-1638502329311)]
image
- YOLACT adds the Mask branch to the existing one-stage target detection model in the same way that Mask R-CNN operates on ftYL-CNN, but without explicit positioning steps.
- YOLACT split the instance segmentation task into two parallel sub-tasks :(1) generate k prototype masks for each image through a Protonet network; (2) For each instance, k Mask Coefficients are predicted. Finally, the instance mask is generated through linear combination. In this process, the network learns how to locate the mask of different locations, colors and semantic instances.
- YOLACT breaks the problem down into two parallel parts, using the FC layer (good at producing semantic vectors) and the CONV layer (good at producing spatially coherent masks) to generate “mask coefficients” and “prototype masks” respectively. Then, because the prototype and mask coefficients can be calculated independently, the calculation overhead of ** Backbone detector mainly comes from the assembly step **, which can be realized as a single matrix multiplication. In this way, we can maintain spatial consistency in the feature space while still being one-stage and fast.
Backbone: Resnet 101+FPN, same as RetinaNet; Protonet: An FCN network is connected to the output of FPN, and the prototype mask for the original image is predicted. Prediction Head: More Mask Cofficient branch than RetinaNet Head, predicts Mask coefficient so output is 4* C + K.
As you can see, a branch of mask coefficients was added to the head for prototypes composition to get the result of masks. Of course, according to the position of NMS, it also needs the accurate prediction of Bbox, and it is not suitable to replace with soft NMS in this process. It should be noted that in the training process, the groundtruth bbox is used to intercept the segmentation results of the combined full image, and then calculate the loss with the Groundtruth Mask. This also requires the bbox result to be in front as the premise to alleviate the pixel imbalance in front and back scenes.
As for the subsequent YOLCAT++, the concept of mask rescoring and DCN structure are mainly added to further improve the accuracy. (1) Referring to Mask Scoring RCNN, fast Mask re-scoring branch was added to better evaluate the quality of example masks; (2) Deformable convolution DCN is introduced in Backbone network; (3) Optimized the anchor design of Prediction Head.
3.6 PolarMask (2019.10)
PolarMask further refines the description of the boundary, making it suitable for the problem of masks. The most important features of PolarMask are :(1) anchor free and bbox free, no detection frame is needed; (2) Fully convolutional network: compared with FCOS, which radiates from 4 rays to 36 rays, the instance segmentation and object detection are expressed in the same modeling method.
PolarMask modeled contour based on polar coordinate system, and transformed the instance segmentation problem into instance Center classification problem and dense Distance regression problem. Meanwhile, we also proposed two effective methods to optimize the Loss function optimization of high-quality positive sample sampling and dense Distance regression, namely Polar CenterNess and Polar IoU Loss. Without using any trick(multi-scale training, extended training time, etc.), PolarMask achieved a mAP of 32.9 on Coco test-dev with ResNext 101. This is the first time to prove that a more complex instance segmentation problem can be as simple as Anchor Free object detection in terms of network design and computational complexity.
[Image upload failed…(image-932D51-1638502329310)]
The whole network is as simple as FCOS, the first is the standard backbone + FPN model, the second is the head part, we replace the Bbox branch of FCOS with the Mask branch, just replace channel=4 with channel=n, where n=36, equivalent to the length of 36 rays. At the same time, we propose a new Polar Centerness to replace the Bbox Centerness of FCOS. It can be seen that PolarMask and FCOS have no significant difference in network complexity.
3.7 SOLO (2019.12)
SOLO divides an image into an S×S grid, which gives SS positions. Different from TensorMask and DeepMask, which put Mask on the channel dimension of the feature map, SOLO refers to semantic segmentation and puts the defined category of object center position on the channel dimension, so as to retain the information of geometric structure. *
Essentially, an instance category approximates the location of the center of an instance. Thus, by classifying each pixel into the corresponding instance category, it is equivalent to regressing the center of the object pixel by pixel, which transforms a position prediction problem from a regression problem to a classification problem. The implication of this is that the classification problem can be more intuitive and simple to model a variable number of instances with a fixed number of channels without relying on post-processing methods such as grouping and learning pixel embedding.
For size processing, SOLO uses FPN to assign objects of different sizes to feature maps of different levels, which are successively regarded as size categories of objects. In this way, all instances are separated and objects can be classified using instance categories.
SOLO divides the picture into S×S grids. If the center of the object (center of mass) falls in a grid, the grid has two tasks :(1) to predict the semantic category of the object; (2) to predict the instance mask of the object. This corresponds to the two branches of the network, Category Branch and Mask Branch. Meanwhile, SOLO uses FPN behind the backbone network to cope with the size. After each layer of FPN, the above two parallel branches are connected for category and location prediction. The number of grids in each branch is correspondingly different, and small instances correspond to more grids.
Category Branch: Category Branch is responsible for predicting the semantic categories of objects, S×S×C for each grid. This part is similar to YOLO. The input is S×S×C grid image after Align, and the output is S×S×C category. The loss function used by this branch is focal Loss.
Mask Branch: An intuitive way to predict instance masks is to use FCN like semantic segmentation, but FCN is spatiallly invariant and we need location information on our side. Therefore, the author uses CoordConv to concat the horizontal and vertical coordinates of pixels x and y (normalized to [-1,1]) and the input features, and then input them into the network. So the dimension of the input is HW(D+2). Experimental results:
image
The accuracy of SOLO has surpassed that of Mask R-CNN, and it also has great advantages over PolarMask with similar thinking.
3.8 RDSNet (2019.12)
[image upload failed…(image-f1f831-1638502329310)]
The starting point of the RDSNet method is that detection barriers should not become barriers to the segmentation effect, and the two should promote each other. It is possible that segmentation itself is relatively accurate, but due to inaccurate positioning, the segmentation result is also relatively poor; At this time, if we can know the segmentation result in advance, the detection result will be better.
The method of YOLCAT is used to obtain extraction and segmentation results. Of course, it also combines the processing of front and rear scenes from the perspective of embedding (experiments show that front and rear scene correlation is better than single foreground linear combination). After the bBox prediction results are obtained, NMS and expand operations are required to ensure that as many valid areas are selected as possible (1.5 for training, 1.2 for testing). Then mask-based Boundary Refinement module was used to adjust the frame of the object.
3.9 PointRend (2019.12)
[image upload failed…(image-10308-1638502329310)]
PointRend borrows from Render’s idea that the sawtooth will not be obvious when the scale mode changes due to the way the sample is sampled (not the continuous coordinate setting). Therefore, PointRend uses a non-uniform sampling method to determine how to determine the points on the boundary under the condition of improved resolution, and to distinguish the attribution of these points. In essence, it is a new up-sampling method, which is optimized for image segmentation of object edge, so that it can perform better in the edge part of object which is difficult to segment.
The PointRend method is summarized as an iterative up-sampling process:
While output resolution < image resolution:
- The coarse prediction_I is obtained by sampling the output results by 2 times bilinear interpolation.
- Pick out N “hard points” — points where the result is likely to be different from the surrounding points (such as the edges of objects).
- 173 For each difficulty, the “representation vector” consists of two parts. One is fine-grained features, which are obtained by using point coordinates and bilinear interpolation on the low-level feature map (similar to RoI Align). The second is the coarse prediction, which is obtained in Step 1.
- The calculation of the “representation vector” by MLP is used to obtain the new prediction. Updating coarse prediction_I is used to obtain the coarse prediction_I +1. This MLP can actually be regarded as a small network consisting of multiple conv1x1s that operate only on “difficult” “representation vectors”.
3.10 BlendMask (2021.1)
[image upload failed…(image-a61C7-1638502329310)]
BlendMask is a one-stage intensive instance segmentation method, combining the ideas of top-down and bottom-up methods. Based on the Anchor-free detection model FCOS, Bottom Module is added to extract low-level detail features and predict a attention at instance-level. Drawing on FCIS and YOLACT’s fusion methods, the Blender module was proposed to better integrate these two features. In the end, BlendMask surpassed Mask r-cnn in accuracy (41.3AP) and speed (blendmask-rt 34.2mAP, 25FPS on 1080ti) on COCO.
The Detector Module directly uses FCOS, while the BlendMask module is composed of three parts: The bottom Module is used to process the underlying features, and the score map generated is called Base; The top layer is connected in series to the box head of the detector to generate the top level attention corresponding to Base. Finally, Blender is used to combine Base and Attention.
Advantages of BlendMask:
- Small amount of calculation: The one-stage detector FCOS saves the calculation of positon-sensitive feature map and Mask feature compared with RPN used by Mask R-CNN.
- Or the calculation is small: The attention Guided Blender module is proposed to calculate global map representation, which requires ten times less computation than the more complex hard alignment used in FCN and FCIS at the same resolution.
- Mask quality is higher: BlendMask is a dense pixel prediction method, and the output resolution is not limited by top-level sampling. In Mask R-CNN, in order to obtain more accurate Mask features, the resolution of RoIPooler must be increased, so that the calculation time of HEAD and the network depth of HEAD can be doubled.
- Stable inference time: The inference time of Mask R-CNN increases as the number of detected Bboxes increases, while the inference speed of BlendMask is faster and the increase time can be ignored
- Flexible: Can be added to other detection algorithms
3.11 TensorMask
[Image upload failed…(image-59b439-1638502329310)]
[Image upload failed…(image-f2EF3B-1638502329310)]
TensorMask regards instance segmentation as 4D tensor prediction. The core idea expressed by TensorMask is to use structured 4D Tensors to represent masks in spatial domain.
TensorMask is a dense Sliding window instance segmentation framework, which is qualitatively and quantitatively similar to Mask R-CNN framework for the first time. TensorMask establishes a conceptual complementary direction for the study of instance segmentation.
3.12 Comparison of indicators of main methods on COCO data set:
[image upload failed…(image-290773-1638502329310)]
4. Instances split common data sets
The commonly used data sets for instance segmentation include PASCAL VOC, MS COCO, Cityscapes, ADE20k, etc. This section introduces several commonly used data sets from the aspects of image number, category number and sample number.
4.1 PASCAL VOC data set
VOC data set is one of the mainstream data sets of computer vision, which can be classified, segmented, target detection, motion detection and character location. From 2005 to 2012, PASCAL VOC released sub-data sets on image classification, object detection, image segmentation and other tasks every year, and held world-class computer vision competition. The PASCAL VOC dataset initially consisted of 4 categories and stabilized at 21 for the segmentation task, including cars, houses, animals, aircraft, bicycles, boats, buses, cars, motorcycles, trains, etc. The test images stabilized at 11 540 from an early 1578. The PASCAL VOC dataset consists of training sets and test sets, with a separate test set for the actual competition. After 2012, the PASCAL VOC Competition was discontinued, but the dataset is open source and can be downloaded for use.
4.2 Microsoft Common Objects in Context (MS COCO)
MS COCO is another large-scale object detection, segmentation and text location data set. This data set contains many categories, as well as a large number of labels. It has a total of 91 object categories, 328,000 images, more than 80,000 images for training, more than 40,000 images for verification, more than 80,000 images for testing, and 2.5 million annotation instances. MS COCO data set is a popular data set with a large number of images of each type of object, fine annotation and high diversity of data scenes.
4.3 Cityscapes
Cityscapes is another large-scale data set that focuses on semantic understanding of urban streetscapes. It contains a set of different stereo video sequences from street scenes in 50 cities, with 5K frames of high quality pixel-level labeling, and a set of 20K frames of weak labeling. Cityscapes data set is a data set of urban street scenes. It contains 5,000 finely annotated images of urban driving scenes, of which 2,975 are used for training,500 for validation, 1,525 for testing, and 20,000 coarse annotated images. Generally, the finely annotated part of the data is used. This dataset contains images recorded from 50 city street scenes and is a popular street scene data set.
4.4 ADE20K
ADE20K is a new scene understanding dataset with a total of more than 20,000 images, including 20 210 images in the training set, 2 000 images in the validation set and 3 352 images in the test set, which are annotated densely with open dictionary tag sets. ADE20K contains 151 object categories, such as cars, sky, street, window, lawn, sea, coffee table, etc. Each image may contain multiple objects of different types, and the object scale varies greatly, so detection is difficult
reference
[1] Case Segmentation: From Mask R-CNN to BlendMask-Cloud + Community — Tencent Cloud (Tencent.com)
[2] Understanding Semantic Segmentation and Instance Segmentation – Zhihu (Zhihu.com)
[4]zhuanlan.zhihu.com/p/165135767
[5] www.aas.net.cn/cn/article/…
[6] He Kaiming et al. (2010). TensorMask: A New Method for Instance Segmentation. The effect of TensorMask r-CNN-Zhihu (ZHIhu.com)
[7] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- ´ Shick. Mask R-CNN. In Computer Vision (ICCV), IEEE International Conference on Computer Science and Technology, 2017. 1, 2, 6, 8
[8] Anurag Arnab and Philip HS Torr. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016
From: www.zhihu.com/people/xu-s…