General object detection algorithm guide and implementation

Object detection is a key component of many deep learning models and has undergone many revolutionary transformations in recent years. Object detection refers to the process of identifying objects in images and extracting boundary boxes for each object. These target detection algorithms are used for applications in areas such as autonomous driving, security cameras, robotics, and almost any field involving visual field applications, including medical imaging or new trends such as Amazon Go Go-Icilless grocery stores.

For years, the main challenge revolved around the fact that many applications required real-time detection of object detection. Some newer implementations have faster inferences and allow different areas of computer vision to develop and mature. With each new release improving the performance of its predecessor, we are now seeing the enabling of real-time target detection.

Despite the exciting advances, one of the biggest obstacles in purpose detection is that models are very large and heavy. Running them takes a lot of time and computing power. Modeling object detection with deep learning is twice as challenging. First, because objects can vary in size and number, the network must be able to cope with this variability. Second, the number of possible combinations of boundary boxes is huge, and these networks often need to calculate the requirements. It has become a struggle to compute output at real-time speed.

To understand this barrier and what might be done to overcome it, we look at the different techniques available for target detection and how the field has matured through recent history. As you will see, no model is the fastest, most accurate. We have to deal with trade-offs between speed and accuracy: some models achieve lower accuracy faster, and vice versa. Let’s move into a bit of history so you can understand where each model sits on this spectrum.

Basic types of object detection algorithms

There are many deep learning algorithms to solve the problem of target detection. These object detectors have three main components:

  • Backbone for extracting features from the image
  • Feature network that takes features from the backbone and outputs features to represent the characteristics of the image
  • Final (class/box) network that uses these features to predict the class and location of an object

Before we go too far, it’s important to mention how object detectors are evaluated. Two common metrics are the intersection of union (IOU) and average accuracy (AP). The IOU is the overlap between the ground truth bounding box and the prediction bounding box. It is calculated as the cross region between the ground truth boundary box and the prediction boundary box, divided by its union region. The resulting value is a number between 0 and 1. The higher the number, the higher the overlap. Average accuracy is the region under the precision recall curve (AUC-PR). In some cases, you might see metrics defined as AP50, where 50 subscripts mean average accuracy at a 50% IOU threshold. When average accuracy is calculated for many classes and its average is photographed, it is called average average accuracy.

Now we’re ready to explore recent and common deep learning algorithms that you can use for your next object detection project.

Region-based models

Region-based convolutional neural networks use a set of regional proposals to detect objects. The faster R-CNN is the latest model in the object detector algorithm. It is the successor to R-CNN and FAST R-CNN. Before looking at the faster R-CNN, let’s take a moment to consider these predecessors.

R-CNN

R-cnn model combines convolutional neural network with bottom-up regional suggestion to combine objects. R-cnn photographed images and extracts of up to 2,000 bottom-up regional recommendations. A region proposal is a high probability location for finding the target object. Then r-CNN uses large CNN to calculate the features of each proposed region. Next, it classifies each region using a class-specific linear support vector machine (SVM).

R-CNN has a couple of drawbacks:

  • Training uses a multi-stage pipeline that first obtains object proposals, fits an SVM to the ConvNet features, and finally learns the bounding box regressors. This multi-stage training is slower than single-stage training.

  • Training uses deep networks that consume a lot of time and space. This means more time as well as more computational power.

  • Object detection is slow because it performs a ConvNet forward pass for each object proposal.

Fast R-CNN

FAST R-CNN is a ConvNet based object detector used to classify object proposals. It then takes an image and a set of object proposals as input. FAST R-CNN uses convolution and maximum pool layer to process images and generate convolution feature graphs. The benefit pool district then uses the MAX pool to extract fixed layer vectors from each feature map from each regional proposal.

The eigenvectors are then fed to the fully connected layer, which branches into two outputs. One produces four values representing the bounding box of the object, while the other output is the soft-mail probability of the object class. The bounding box number represents two numbers in the upper left and upper right corner, two for the bottom.

Fast R-CNN improvement r-CNN. It has a high average average accuracy, it is a training model, and it does not require disk storage to cache its functionality.

Faster R-CNN

The faster R-CNN model consists of two modules:

  • Deep convolutional network, responsible for proposing regions (Regional proposal network)
  • Use the fast R-CNN detector in the area

The regional proposal network shares the full image convolution function with the target detection network. Then, the target detects the boundary box and score of the network prediction object. Next, the FAST R-CNN model uses the regional proposals from the regional proposal network for target detection. Then the regional proposal network and FAST R-CNN are combined into a single network by sharing convolutional function. In general, the faster R-CNN model receives images and outputs rectangular object proposals. Each rectangle has an object score.

The TensorFlow documentation provides a faster R-CNN target detector API that you can use to build target detection models with minimal effort. You can also run predictions instantly using the pre-trained faster R-CNN model. Tensorflow offers pre-trained models through the Tensorflow Hub.

Single Shot Detector (SSD)

In this model, objects in the image are detected by a forwarding signal. During the training phase, SSDS use input images and ground truth bounding boxes for each object. SSDS use a single neural network to predict objects in the image. It uses a feedforward convolutional neural network, which generates bounding boxes and fractions for existing objects. The convolution feature layer makes it possible to detect objects at multiple scales.

SSD by evaluating different proportions of a small group of default bounding boxes. Each box then predicts the offset of the shape and the confidence of the category. The default box matches the ground truth box during training. Matching boxes were seen as positive, while unmatched boxes were seen as negative. Therefore, SSDS generate fixed boundary boxes through feedforward convolutional neural networks. And then it has fractions of objects in those boxes. The loss of the model is calculated by weighting localization loss and confidence loss.

Like other target detection models, SSDS use the base model for feature extraction. It uses the VGG-16 network as the backbone network by default.

SSD is faster because it requires a single wizard. Unlike the regional proposal network, this requires two lenses:

  • One to generate the object proposals
  • The other to detect objects from these proposals

One achieves a mean average precision score of 74.3% at 59 FLOPS per second On an Nvidia TitanX On the VOC2007 dataset.

YOLO Models

YOLO models are also single-shot models. The first YOLO model, short for You Only Look Once, was introduced in 2016. The original proposal was to predict bounding boxes and class probabilities from an image, in a single evaluation using a single neural network. This first model would process 45 frames per second in real-time using features from the entire image to predict bounding boxes.

There were two main challenges to the initial implementation of Yolo:

  • It could only predict one class
  • It didn’t perform well on small objects

Let’s take a moment to look at the newer and popular versions of YOLO. We’ll be rolling the ball with Yolo version 4.

YOLO V4

  • CSPDarknet-53 backbone for feature extraction
  • Spatial pyramid pooling (SPP) and path aggregation network(PAN) to collect features from different stages
  • YOLO V3 head for predicting classes and bounding boxes

Let’s take a closer look at these components.

CSP Darknet-53 is a convolutional neural network that serves as the backbone of an object detector. It is based on Darknet-53, a convolutional neural network that uses residual connections and is the backbone of the third version of Yolo. Cspdarknet-53 uses horizontal level partial network (CSPNET) to divide the feature map of the base layer into two parts. This partition reduces computation time by merging two parts.

SPP is a convolutional neural network architecture that uses spatial pyramid pools to remove fixed-size constraints of networks. The output of the SPP layer is fed to a fully connected layer or other classifier. PANET is a network with the feature of reducing the distance between the lower and topmost feature levels in order to obtain reliable information transmission.

YOLO V4 introduced two new methods to improve accuracy:

  • Bag of freebies — These are strategies that are applied to improve the performance of a model without increasing its latency at inference. One such strategy is data augmentation, whose goal is to expose the model to various images, hence making the model more robust. Photometric distortions and geometric distortions are two examples of augmentations On images that improve the object detector’s performance. Photometric spot-on includes Adjusting Contrast, saturation, and brightness in images. Some geometric distortions that are applied to object detectors include random scaling, rotating, and cropping images.

  • Bag of Specials — These are plugin modules and post-processing methods that increase the inference cost by a small Amount while improving the object detector’s accuracy. The aim of the plugin modules is to enhance some model attributes such as enlarging the receptive field or strengthening feature integration capability. The post-processing methods are used to monitor the prediction results.

Yolo V4 proposes several data enhancement strategies:

  • CutOut — Combines images by cutting out part of one image and pasting it onto an augmented image

  • Mosaic data augmentation — Mixes four training images. This enables the detection of objects outside their normal context.

  • Self-adversarial Training (SAT) — A new augmentation strategy that operates in two backward phases. In the first stage, the network alters the original image instead of the network weights. The second stage involves training the neural network to identify an object in the altered image.

The YOLO V4 achieved 65.7% average accuracy (AP50) at 65 frames per second in real time on the TESLA V100.

YOLO V5

There was a lot of controversy from the YOLO V5 release. Basically, it’s not a new version of YOLO, but an implementation of YOLO V4 in Pytorch. Another issue at the heart of this release is that the developers did not release any paper that could be peer-reviewed. Alexey, one of the authors of Yolo V4, stabbed this and responded to the release of Yolo V5. You can read his comments on the GitHub issue.

The YOLO V5 page on GitHub claims that this version is faster than previous versions of YOLO. The page also provides pre-trained checkpoints that can be downloaded and started immediately.

CenterNet

CenterNet is built upon the one-stage keypoint-based detector known as CornerNet. CornerNet produces heatmaps for the top-left corners and the bottom-right corners. The heatmaps are locations of key points for different objects. Each keypoint is assigned a confidence score.

CornerNet also generates embeddings that are used to determine whether two corners belong to the same object. It generates offsets that can be used to learn how to remap the corners from the heatmaps to the input image.

CenterNet is a one-stage detector that detects each object as a triplet of keypoints, resulting in improved precision and recall. It explores the central part of a proposal. The idea is that if a bounding box has a high intersection over union with the ground truth box, then there is a high probability that the center key point in its central region is predicted as the same class. At inference, a proposal is determined to be an object, Based on whether there is a central key point of the same class that falls within the proposal’s central region determination is done after a proposal is generated as a pair of corner points.

CenterNet consists of two modules: a cascading corner pool and a central pool. The central pool module provides more identifiable information in the central area, making it easier to identify the center of the proposed area. This module is responsible for predicting the central keyboard. It uses the maximum horizontal and vertical values of the object to predict the central key point.

The cascading corner pool module is responsible for enriching the information collected by the upper left and lower right corners. Cascade Angle pool The maximum value of the pool in the direction of the object’s boundary and the predicted corner of the object’s internal direction.

CenterNet achieved an average accuracy of 47% in the MS-COCO DataSet compared to the MS-COCO DataSet and an increase of 4.9% compared to the existing first-level detectors.

EfficientDet

Efficient files is an algorithm introduced by Google. It is built on efficient networks and introduces a new BIFPN and new scaling rules. High efficiency equipment Optimizes target detector components to improve performance and efficiency. Optimization leads to smaller models with fewer computations. These optimizations include:

  • Employing EfficientNet for the backbone. Applying EfficientNet-B increases accuracy by 3% while reducing computation by 20%.
  • Improving the efficiency of the feature networks. This is done using a bi-directional feature network (BiFPN) that allows information to flow top-down and bottom-up while using regular and efficient connections. BiFPN is a type of feature pyramid network that allows fast and easy multi-scale feature fusion. It uses regular and efficient connections to allow information to flow in both top-down and bottom-up directions.
  • Improving efficiency further through a fast normalized fusion technique. The observation is that since images have different resolutions, they contribute unequally to the final output features. This is addressed by weighting each input and allowing the network to learn the importance of each input feature.
  • Introduction of a new compound scaling method for object detectors that leads to better accuracy. This is done using a simple compound coefficient that jointly scales up all resolution/depth/width of backbone, BiFPN, and class/box network.

One achieves average precision of 52.2 On the COCO dataset, Efficientdet-D7.

The original address