Object detection Backbone

In the first step of most object detection algorithms, the convolutional neural network CNN is usually used to process the input image of the output image to generate a relatively deep feature map, and various algorithms are used to calculate the completed accuracy and loss.

Object detection —– feedforward neural network

Object detection usually uses feedforward neural network with deep structure including convolution computation, and feedforward neural network includes commonly used convolution layer, pooling layer, full connection layer, etc.

Convolution layer

Convolution is a form of operation, which extracts data features through the parameters of convolution kernel, and obtains results through matrix dot product operation and sum operation. Convolution operation is linear operation and is the nonlinear result of fitting neural network, so activation functions such as SigmoID, RelU and TANH are added

No matter how big the input image is, the parameter size of the convolution layer is fixed

Pooling layer

The downsampling technique (reducing the image) is called pooling.

Sampling purposes: 1. Make the image conform to the size of the display area; 2. Generate thumbnails of corresponding images

Through the convolution operation, we complete the reduction and feature extraction of the input image, but the dimension of the feature image is still very high.

The concrete realization of the pooling layer is to partition the feature image after the convolution operation. The image is divided into non-intersecting blocks, and the maximum or average value in these blocks is calculated to get the pooled image.

Pooling stratification:

  1. Reduce dimension, reduce model size, improve calculation speed
  2. The probability of overfitting is reduced and the robustness of feature extraction is improved
  3. Insensitive to translation and rotation

The connection layer

Generally connected to the feature graph output by the convolutional network, output one-dimensional vector. The main function is to further map the feature graph abstracted from the convolution layer to the label space of a specific dimension and obtain the loss, namely the prediction result.

Url:blog.csdn.net/zfjBIT/arti…

This article has a relatively popular understanding of these aspects, you can refer to understanding 123 for details

Object detection technology —– Network structure development

  • AlexNet– > Smaller convolution kernel, deeper network structure –>VGGNet

  • VGGNet– > solves the problem of large increase in the number of parameters and disappearance of gradients caused by deeper network structures –>inception

  • Inception — resolve poor gradient correlation returned by too deep Network structure –>ResNet(Residual Network)

  • ResNet– > Optimization parameters and computation –>DenseNet

  • Image pyramid (input images to make multi-scale, different scale images to generate different scale features)– > Solve multi-scale problems, extract multi-scale features –> feature pyramid FPN