It is difficult to make a fair comparison between different target detectors. Which model is best for? There is no straight answer to this question. For real life applications, we chose to balance accuracy and speed. In addition to detector types, we also need to understand other options that affect performance: \
- Feature extractors (VGG16, ResNet, Inception, MobileNet).
- Output strides for the extractor.
- Input image resolutions.
- Matching strategy and IoU threshold (how predictions are excluded in calculating loss).
- Non-max suppression IoU threshold.
- Hard example mining ratio (positive v.s. negative anchor ratio).
- The number of proposals or predictions.
- Boundary box encoding.
- Data augmentation.
- Training dataset.
- Use of multi-scale images in training or testing (with cropping).
- Which feature map layer(s) for object detection.
- Localization loss function.
- Deep learning software platform used.
- Training configurations including batch size, input image resize, learning rate, and learning rate decay.
Worst of all, technology moves so fast that any comparison quickly becomes obsolete. Here, we summarize the results of each paper so you can analyze and compare them in their entirety. Then, we summarized a review based on Google Research. By presenting multiple perspectives in one situation, we hope we can better understand performance metrics.
Performance results
In this section, we summarize the performance reported in the corresponding paper. Feel free to browse this section quickly.
Faster – R – CNN (Arxiv.org/pdf/1506.01…
This is the result of the PASCAL VOC 2012 test set. We are interested in the last three lines that represent the performance of Faster R-CNN. The second column represents the AMOUNT of RoI developed by the RPN network. The third column represents the training data set used. The fourth column is the average average accuracy (mAP) of the measurement accuracy.
mAP:medium.com/@jonathan_h…
PASCAL VOC 2012 test set results
VOC 2012 for Faster R-CNN
Results on MS COCO
COCO for Faster R-CNN
Time the K40 GPU in milliseconds using the PASCAL VOC 2007 test set.
R – FCN (arxiv.org/pdf/1605.06…).
PASCAL VOC 2012 test set results
VOC 2012 for R-FCN
(Multi-scale training and testing were used for some of the results.)
Results on MS COCO
COCO for R-FCN
SSD (arxiv.org/pdf/1512.02…).
This is the result of PASCAL VOC 2007, 2012, and MS COCO using 300×300 and 512×512 input images.
SSD
(SSD300 * and SSD512 * apply data enhancement to small objects to improve maps.)
Performance:
Speed is measure with a batch size of 1 or 8 during inference
(YOLO here refers to v1 slower than YOLOv2 or YOLOv3)
MS COCO results:
COCO for SSD
YOLO (arxiv.org/pdf/1612.08…
PASCAL VOC 2007 test set results.
VOC 2007 for YOLOv2
(We added VOC 2007 tests here because it has results for different image resolutions.)
PASCAL VOC 2012 test set results.
VOC 2012 for YOLOv2
Results on MS COCO.
COCO for YOLOv2
YOLOv3 (pjreddie.com/media/files…).
Results on MS COCO
COCO for YOLOv3
The performance of the YOLOv3
Performance of YOCO2 on COCO
FPN (arxiv.org/pdf/1612.03…).
Results on MS COCO.
COCO for FPN
RetinaNet (arxiv.org/pdf/1708.02…).
Results on MS COCO
COCO for RetinaNet
MS COCO test development speed (MS) and accuracy (AP).
COCO for RetinaNet
Compare the results of papers
It is unwise to compare the results of different papers side by side. These experiments were done under different Settings. Nevertheless, we decided to plot them together so that you at least have a general idea of where they are. But please note that we should never directly compare these numbers.
For the results presented below, the model was trained using PASCAL VOC 2007 and 2012 data. MAP was measured using the PASCAL VOC 2012 tester. For SSDS, the chart shows the results of 300×300 and 512×512 input images. For YOLO, the results were 288×288, 416×461, and 544×544 images. High-resolution images from the same model have better maps, but are slower to process.
* indicates that small-target data enhancement is applied.
** indicates that the results are based on the VOC 2007 test suite. These are included because the YOLO paper does not have many VOC 2012 test results. Because VOC 2007 results were generally better than 2012 results, we added r-FCN VOC 2007 results as a cross reference.
Input image resolution and feature extractor affect speed. Below are the highest and lowest FPS reported in the corresponding papers. However, the following results can vary widely, especially when measured under different maps.
Results on the COCO dataset
In recent years, many results have been measured specifically using the COCO Target detection dataset. COCO datasets are difficult to detect objects, and detectors typically have much lower maps. Here’s a comparison of some of the key detectors.
FPN and Faster R-CNN * (using ResNet as a feature extractor) have the highest accuracy (MAP@ [.5:.95]). RetinaNet builds on FPN using ResNet. Therefore, the highest mAP implemented with RetinaNet is combined with the effect of pyramid features, the complexity of feature extractors and the combined influence of Focal Loss. Note, however, that this is not an Apple-to-apple comparison. We’ll show the Google survey later for a better comparison. But it’s a good idea to look at each model’s declaration first.
Takeaway so far
The frame per second (FPS) in Single Shot emulsion is impressive when it comes to the lower resolution images, but it comes at the cost of accuracy. These papers try to show that they can beat the accuracy of region based detectors. However, because high-resolution images are commonly used for such declarations, they are less conclusive. As a result, their situation is changing. In addition, different optimization techniques were applied, which made it difficult to isolate the advantages of each model. In fact, single Shot and region based detectors are increasingly similar in design and implementation. But with some reservations, we can say:
- If real-time speed is not required, region-based detectors such as Faster R-CNN will show a smaller accuracy advantage.
- Single Shot emulsion is used here for real-time processing. But the application needs to verify that it meets its accuracy requirements.
Compare SSD MobileNet, YOLOv2, YOLO9000 and Faster R-CNN
The measured video has been uploaded to the station 30 minutes long 】 【 b www.bilibili.com/video/av755…
Report by Google Research (Arxiv.org/pdf/1611.10…)
Google Research provides an investigation into the speed and accuracy trade-offs between Faster R-CNN, R-FCN, and SSD. (This article does not cover YOLO.) It reimplements these models in TensorFLow using MS COCO data sets for training. It creates a more controlled environment and makes trade-offs easier to compare. It also introduces MobileNet, a technology that can achieve high precision with low complexity.
Speed v.s. accuracy
The most important question is not which detector is best. It may be impossible to answer. The real question is which detector and which configuration will give us the best balance between speed and accuracy that your application needs. Here’s the accuracy versus speed trade-off (measured in milliseconds).
In general, Faster R-CNN is more accurate, while R-FCN and SSD are Faster.
- Faster R-CNN using Inception Resnet with 300 proposals provides the highest accuracy at 1 FPS for all test cases.
- SSDS on MobileNet had the highest mAP in the model for real-time processing.
The chart also helps us find the best trading spots to achieve good speed returns.
- R-fcn model using Residual Network has achieved a good balance between accuracy and speed.
- If we limit the number of proposals to 50, we can achieve similar performance using Resnet’s Faster R-CNN.
Feature extractor
This paper studies how the accuracy of feature extractor affects the accuracy of detector. Faster R-CNN and R-FCN can both take advantage of better feature extractors, but are of little significance for SSDS.
(The X-axis is the top 1% accuracy of each feature extractor’s classification.)
The target size
SSDS perform well for large objects, even with simple extractors. With better extractors, SSDS can even match the accuracy of other detectors. But SSD performance on small objects is much worse than other methods.
For example, SSDS have problems detecting the bottles in the table below, whereas other methods do.
Input image resolution
The higher resolution can significantly improve target detection for small objects, while also helping large objects. When the resolution was reduced by two times in two dimensions, the accuracy was reduced by 15.88% on average, but the inference time was also reduced by 27.4% on average.
Number of proposals
The number of proposals generated can significantly affect the Faster R-CNN (FRCNN) without significantly reducing accuracy. For example, using Inception Resnet, the Faster R-CNN can be up to 3 times Faster when using 50 proposals instead of 300. Accuracy dropped by only 4%. Because r-FCN is much less work per ROI, the significance of the speed increase is much less important.
GPU time
This is the GPU time for different models using different feature extractors.
Although many papers use FLOPS (the number of floating-point operations) to measure complexity, it does not necessarily reflect accurate speed. The sparse V.S. dense model influences the time required. Ironically, less dense models typically take longer on average to complete each floating-point operation. In the figure below, most dense models have slopes (FLOPS and GPU ratios) greater than or equal to 1, while shallower models have slopes less than 1. In other words, even if the overall execution time is short, the less dense model will not work well. However, this reason has not been fully studied in this paper.
Memory
MobileNet has a minimal footprint. It requires less than 1Gb (total) of memory.
2016 COCO Target Detection Competition
The winning entry in the 2016 COCO Target Detection Challenge was a collection of five Faster R-CNN models using Resnet and Inception Resnet. It achieved 41.3% map@ [.5,.95] on the COCO tester, and achieved a significant improvement in locating small objects.
Lessons learned
Some of the main findings of the Google Research paper:
- The R-FCN and SSD models were Faster on average, but could not beat the Faster R-CNN for accuracy without considering speed.
- Faster R-CNN takes at least 100 milliseconds per image.
- Only using low resolution feature map for detection will seriously affect the accuracy.
- Input image resolution can seriously affect accuracy. When the width and height of the image are reduced by half, the average accuracy is reduced by 15.88%, but the average prediction time is reduced by 27.4%.
- The selection of feature extractor will affect the detection accuracy of “Faster R-CNN” and “R-FCN”, but it has little dependence on SSD.
- Post-processing includes non-maximum suppression (running only on the CPU), with a running time of around 40 ms for the fastest models, which limits the speed to 25 FPS.
- If only one IoU is used to calculate the mAP, use mAP@IoU=0.75.
- Using step 8 instead of 16 improves mAP by 5% but increases elapsed time by 63% when using Inception ResNet network as a feature extractor.
The most accurate
- The most accurate single model uses Inception ResNet’s Faster R-CNN and 300 recommendations. Each image runs for 1 second.
- The most accurate model is the integrated model with multiple crop predictions. It achieves the latest detection accuracy for the 2016 COCO Challenge. It uses the average accuracy vector to select the five most different models.
One of the fastest
- SSDS with MobileNet provide the best accuracy tradeoff among the fastest detectors.
- SSDS are fast, but perform poorly on small objects compared to other objects.
- For large objects, SSD can outperform Faster and lighter extractors like Faster R-CNN and R-FCN.
Achieve a good balance between accuracy and speed
- If we reduce the number of proposals to 50, then the Faster R-CNN can reach 32mAP with the speed of R-FCN and SSD.
Translation: medium.com/jonathan_h… * * * *
\
Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.
You are not alone in the battle. The path and materials suitable for beginners to enter artificial intelligence download machine learning online manual Deep learning online Manual note:4500+ user ID:92416895), please reply to knowledge PlanetCopy the code
Like articles, click Looking at the