Deep learning and fundamentals of computer vision

I. Definition and relationship between computer vision and artificial intelligence deep learning

Artificial intelligence and deep learning

  • What is artificial intelligence?
    • In essence, artificial intelligence is the simulation of human thinking and problem solving.
  • What is deep learning?
    • It is an algorithm that simulates the structure of human brain and takes artificial neural network as the architecture to extract higher dimensions and deeper logical relations behind data tables so as to achieve more accurate results.
  • Artificial intelligence: expert systems, physical models, etc
    • Machine learning: kNN, sVM, etc
      • Deep learning: fully connected neural network, convolutional neural network, recurrent neural network, etc

2. Computer Vision

  • Computer vision is the study of how to make machines “see”. More specifically, it refers to the use of computers and visual systems instead of human eyes to identify, track and measure objects, and further image processing and analysis.
  • Vision is the main source of information for the human brain and the gateway to the palace of artificial intelligence.
  • Common areas of computer vision:
    • Control process: guide robot arm, industrial robot
    • Navigation: Autopilot or mobile robot
    • Detection: Video surveillance and face recognition
    • Organizing information: Intelligent search based on images and image sequences
    • Modeling object or environment: medical image analysis system or terrain model
    • Intelligent interaction: emotion recognition, human-computer interaction
  • Four main tasks of computer vision:
    • Image classification and recognition
    • Semantic segmentation
    • Target detection
    • Examples of segmentation
    • Other tasks (image enhancement, target tracking, visual creativity)

Second, the cutting-edge application of deep learning in computer vision

  • Face recognition
  • OCR (Optical Character Recognition) — license plate, bank card number Recognition
  • Image search engine
  • autopilot
  • Intelligent monitoring
  • Visual creative
  • Manipulator guide

Classical computer vision — digital image processing

1. Computer vision and digital image processing

  • Computer Vision is the process of “seeing pictures” and “understanding” imitated by human eyes and brain. The key words are “reality” and “understanding”. The input is the picture, the output is the model, recognition results and other information extracted from the image, such as background segmentation, motion detection, object recognition, face recognition.
  • Digital Image Processing (Digital Image Processing) is a variety of pre-processing of images before viewing them, including transformation, analysis, reconstruction and pixel-level Processing of existing images. The input is the image, the output is also the image, such as: image enhancement, denoising, filter, etc.
  • Computer Graphics is similar to human “drawing”, is the use of Computer Graphics generation; The input is the model, the output is the image (pixel), creating new visual perception, such as: fingerprint generation, 3D effects, game movie production, etc.

2. Image processing

  • Color mode:
    • RGB color mode: various colors are obtained through the changes of red (R), green (G) and blue (B) color channels and their mutual superposition. RGB is the color of red, green and blue channels, and the value range is [0, 255].
    • Grayscale: the range is [0, 255], where 0 represents black and 255 represents white.
    • HSV: Similar to the way human sense color, with a strong degree of perception.
      • Hue (H) : is the basic attribute of color, which is commonly referred to as the color name, such as red and green, etc., with a value range of [0°, 360°].
      • Saturation (S) : refers to the purity of color. The higher the color is, the purer the color will be, while the lower the color will become gray. The value range is [0%, 100%].
      • Brightness (V) : also called brightness (L), the value range is [0%, 100%].
  • Transformation of color space:
    • RGB to grayscale diagram :(R+G+B)/3
    • RGB to HSV:
    • HSV to RGB:

3. Image filtering (smoothing filtering and edge detection)

  • Mathematical principles of convolutional layer in deep learning neural networks.
  • It is a kind of image preprocessing
  • Function:
    • Smooth filtering: eliminate mixed noise in the image, noise reduction
    • Edge detection: Extracting image features for image recognition
  • Smoothing filtering method:
    • Simple average method
    • Gaussian filtering
  • Edge detection method:
    • Roberts operator:
    • Prewitt operator:
    • Sobel operator:
    • Canny operator:
      • A. Smooth the image with Gaussian filtering
      • B. Calculate gradient amplitude and direction by Sobel and other gradient operators
      • C. Perform non-maximum suppression of gradient amplitude
        • Compare whether B of A, B, and C is the maximum value. If B is the maximum value, B is reserved; otherwise, B is suppressed and set to 0.
        • The direction of gradient is perpendicular to the direction of potential boundary
      • D. Detect and connect edges to the double threshold algorithm
        • The part larger than T-max is retained, the part smaller than T-min is suppressed, and the part between T-max and T-min is retained. The part of the curve above the connection is retained, and the part of the curve below the connection is discarded.
    • Comparison of effects of different operators:

4. Image threshold segmentation

  • Classical application of computer vision segmentation task.
  • Threshold segmentation based on Otsu algorithm (Otsu algorithm) :
    • Turn the image to grayscale
    • Calculate all average gray levels w
    • Select a threshold T to divide all pixels into N0 and N1
    • Calculate the grayscale w0 of N0 and w1 of N1
    • Calculate the variance between classesG = N0 * (w0 - w) squared + N1 * (w1 - w) squared = N0N1 (w0 - w1) squared
    • Use traversal to find Tmax so that g is maximum

5, Basic morphological filtering: Dilation, Erosion, opening and closing operations

  • Images are denoised with data cleaning.
  • Bloat: Enlarges bright white areas in an enlarged image by adding pixels to the perceptual boundaries of objects in the image. Often used to expand edges or fill small holes.
  • Corrosion: Remove pixels along the object boundary and reduce the object’s size by adding pixels to the object’s perceptual boundary in this image. It is often used to extract image backbone information and eliminate isolated pixels and noises.
  • Open operation: first corrosion and then expansion.
  • Closed operation: expansion and corrosion.

Classical computer vision algorithm

1. Hough transform

  • Mainly used to identify regular shapes.
  • Recognition and extraction of higher-order features.
  • Theory:
    • In the automatic analysis of digital images, one of the most common sub-problems is the detection of some simple lines, circles, and ellipses. In most cases, an Edge detector will be used to preprocess the image, turning the original image into an image that contains only edges. Because the image is not perfect or the edge detection is not perfect, some points or pixels are missing, or there is noise that makes the edge detector’s boundary deviate from the actual boundary. Therefore, it is impossible to divide the detected edges into straight lines, circles and ellipses intuitively. Hough transform solves the above problems. Through the voting steps in Hough transform algorithm, the parameters of the graph can be found in the complex parameter space, and the computer can know which shape the edge is based on the parameters.
  • Steps:
    • A. Select the type of shape to be identified
    • B. Project the parameter space of the cartesian coordinate system to a special parameter space
    • C. Look for intersections to determine the recognized shape (by adding local maxima in the parameter space)

2. Template matching

  • Classical applications of computer vision recognition tasks.
  • Template matching is one of the most primitive and basic pattern recognition methods. It is a matching problem to study where the pattern of a particular object is located in the image and then identify the object. It is the most basic and common matching method in image processing. Template matching has its own limitations, mainly in that it can only carry out parallel movement, if the matching target in the original image rotates or changes in size, the algorithm is invalid.
  • Template is a small known image, and template matching is to search for a target in a large image, known that there is a target to find in the map, and the target has the same size, direction and image elements with the template, through a certain algorithm can find the target in the map, determine its coordinate position.

3. Defects of classical computer vision algorithms

  • In practical application, classical computer vision algorithm has poor performance in anti-interference and anti-noise.
  • Such as: rotation, size, deformation, occlusion, brightness and other generalization problems.
  • Improvement:
    • SIFT algorithm:
      • Scale-invariant feature Transform (SIFT) is a description used in the field of image processing.
      • SIFT feature is based on some local appearance points of interest on the object independent of image size and rotation. There is also a high tolerance for light, noise, and micro-perspective changes. Based on these characteristics, they are highly prominent and relatively easy to extract, making it easy to identify objects in a large database of features with little misidentification. The detection rate of some object occlusion using SIFT feature description is also quite high, and even more than 3 SIFT features are enough to calculate the position and orientation. At the speed of today’s computer hardware and with small feature databases, recognition speed can approach real-time computing. SIFT feature has a large amount of information and is suitable for fast and accurate matching in massive databases.
    • Cascade algorithm (Cascade classifier)
  • Solution: Convolutional neural network

Convolutional neural network

1. Basic introduction of neural network

  • ANN refers to a complex network structure formed by a large number of processing units (neurons) connected with each other. It is an abstraction, simplification and simulation of the human brain’s organizational structure and operating mechanism. Artificial Neural Network (ANN), a kind of information processing system based on the structure and function of brain Neural Network, simulates Neural activity by mathematical model.
  • Artificial neural network with multi-layer and single-layer, each layer contains a number of neurons, each with variable weights between neurons have to arc connection, by the repeated training to the known information network, the method to adjust the connection weights change neurons step by step, to deal with information, the purpose of the simulation of the relation between input and output. It does not need to know the exact relationship between input and output, and does not need a lot of parameters, but only needs to know the non-constant factors causing output changes, that is, the nonconstant parameters. Therefore, compared with traditional data processing methods, neural network technology has obvious advantages in processing fuzzy data, random data, nonlinear data, especially suitable for large-scale, complex structure, unclear information system.
  • Multi-layer forward neural network (also called multi-layer perceptron) proposed by Minsley and Papert is the most commonly used network structure at present.

2. Basic constitution of neural network

  • Neurons:
  • Neural network:
    • It’s made up of multiple neurons.
    • Fully connected neural network:
    • Convolutional Neural Network (CNN) :
      • Mainly used in computer vision.
    • Recurrent neural network (RNN) :
      • Mainly used in natural language processing.

3. Common activation functions

  • The Sigmoid function:
    • Formula:
    • Is the most widely used class of activation functions, having an exponential shape and physically closest to neurons. Its output ranges from (0,1) and can be expressed as probability or used for normalization of data.
    • Disadvantages:
      • A. Soft saturation — Derivative F ‘(x)=f(x)(1-f(x)), the bilateral derivatives of F (x) gradually approach 0 as x approaches infinity. In backward transfer, the gradient of sigmoid downward transfer contains an F ‘(x) factor, so that f'(x) becomes close to zero once it falls into the saturated region, resulting in a very small gradient of backward transfer. At this time, network parameters are difficult to be effectively trained, which is called gradient disappearance. Gradients generally disappear within 5 layers.
      • B. The output of sigmoid function is all greater than 0, which makes the output not zero mean, which is called bias phenomenon. This will cause the neurons in the later layer to receive the non-zero mean signal output from the previous layer as input.
  • Tanh functions:
    • Formula:
    • Compared with sigmoID function, the average output value of TANh function is 0, which makes its convergence speed faster than sigmoID, thus reducing the number of iterations.
    • Disadvantages:
      • It also has soft saturation, which causes the gradient to disappear.
  • ReLu function:
    • Formula:
    • ReLU is referred to as Rectified Linear Units.
    • It has no saturation problem when x>0, so that the gradient does not decay, thus solving the gradient disappearance problem. This allows us to train deep neural networks directly in a supervised manner, rather than relying on unsupervised layer-by-layer pretraining. However, with the progress of training, part of the input will fall into the hard saturation region, resulting in the corresponding weight cannot be updated, which is called “neuron death”.
    • Similar to SigmoID, the mean output value of ReLU is also greater than 0, so migration and neuron death jointly affect the convergence of the network.
  • Image:

4. Neural network training

  • Loss function:
    • The loss function is used to evaluate the difference between the predicted value and the real value of the model. The better the loss function is, the better the performance of the model is generally. Different models generally use different loss functions.
    • The loss function is divided into empirical risk loss function and structural risk loss function. The empirical risk loss function refers to the difference between the predicted result and the actual result, and the structural risk loss function refers to the empirical risk loss function plus the regular term.
  • Gradient descent:
    • Gradient descent is an iterative method that can be used to solve least squares problems (both linear and nonlinear). When solving the model parameter of machine learning algorithm, namely unconstrained optimization problem, Gradient Descent is one of the most commonly used methods, and the least square method is another commonly used method. When solving the minimum value of the loss function, the gradient descent method can be used to solve the loss function iteratively step by step to obtain the minimum value of the loss function and model parameters. Conversely, if we need to find the maximum loss function, then we need to iterate using gradient ascent. In machine learning, two gradient descent methods are developed based on the basic gradient descent method, namely stochastic gradient descent method and batch gradient descent method.
  • Back propagation algorithm:
    • Back propagation algorithm (BP algorithm) is a learning algorithm suitable for multi-layer neural networks, which is based on the gradient descent method. The input-output relation of BP network is essentially a mapping relation: the function of an N-input m-output BP neural network is a continuous mapping from n-dimensional Euclidean space to a finite domain in m-dimensional Euclidean space, and this mapping is highly nonlinear. Its information processing ability comes from the multiple recombination of simple nonlinear functions, so it has strong function repetition ability. This is the basis for the application of BP algorithm.

5. Introduction to convolutional neural networks

  • Processing process:
    • Image Input -> Convolution layer -> Pooling layer -> Full connection layer -> Result Output
  • Convolution layer:
    • The convolution operator is used for the convolution operation.
    • Size after convolution:

6. Classical convolutional neural network structure

  • AlexNet
  • ResNet
    • The residual network is easy to optimize and can improve the accuracy by increasing the depth. Its internal residual block uses jump connection to alleviate the gradient disappearance problem caused by increasing depth in deep neural network.
  • Inception
    • When building the convolutional layer, decide whether the filter size is 1×1, 3×3, 5×5, or whether to add a pooling layer. Inception the purpose of the network is to decide on your behalf, and although the network architecture becomes more complex, it performs very well.

    Optimization of convolutional neural networks

1. Measure the performance of the model

  • Take the prediction of classification tasks as an example:
    • TP = True Positive, the actual label of this column is “yes” in the test data set and is predicted by the model.
    • FP = False Positive, the actual label for this column is “no” in the test data set and “yes” as predicted by the model.
    • TN = True Negative, the column is actually labeled “No” and the model predicts “no”.
    • FN = False Negative, the column is actually labeled “yes” and the model predicts “no”.
  • Accuracy:
    • Accuracy refers to the overall Accuracy of both positive and negative predictions.
    • accuracy = (TP + TN)/(TP + FP + TN + FN)
    • Disadvantages: In practice, not suitable for unbalanced data sets.
  • Accuracy:
    • Precision refers to the accuracy of forward prediction.
    • precision = TP/(TP + FP)
    • Accuracy is usually used in the most important cases to avoid a large number of false positives.
  • Recall rate/sensitivity:
    • Recall/Sensitivity refers to the proportion of all data with positive real results that we predict to be positive.
    • recall = sensitivity = TP/(TP + FN)
    • Recall rates are often used in the most important use cases for truth detection.

2. Optimization of overfitting problem

  • Overfitting problem:
    • Overfitting refers to making the hypothesis too strict in order to get the consistent hypothesis, which may take some noise data into account, resulting in errors in future data prediction.
  • A. Division of training verification set:
    • The existing data set is divided into training set, validation set and test set (usually 60%, 20% and 20%). The data contained in the training set is used to train the model, and the performance of the model on the validation set is used to optimize the training set. Finally, the test set is used to verify the model.
  • B. Cross validation:
    • The data were divided into several equal proportions, which were alternately used as training sets and verification sets, and the average value and variance of each group of errors were observed to judge.
  • C. Data enhancement:
    • Data enhancement can generate new data images by rotating, deforming, mirroring, changing brightness, changing color, and adding white noise to the original image.
  • D. Regularization (Regularization) :
    • Keep all the characteristic variables, but reduce the order of magnitude of the characteristic variables.
  • E. Random Dropout:
    • Only some nodes at the same level in the network are trained each time.
    • It can effectively avoid the result deviation caused by uneven weight division.
    • Disadvantages: The training time is long, usually 2-3 times as long as that of the neural network without random inactivation.

3. Optimization of gradient disappearance/explosion problem

  • In essence, the reason for gradient disappearance and explosion is the multiplication effect in gradient back propagation caused by too deep network layers.
  • A. Unsaturated activation function:
    • ReLu: Let the derivative of the activation function be 1 for the first quadrant.
    • LeakyReLu: Contains almost all of ReLu’s advantages, as well as addressing the effects of the disappearance of the second quadrant gradient in ReLu.
  • B. Gradient Clipping:
    • When the gradient exceeds a set threshold, it is adjusted manually.
  • C. Initialization of network parameters:
    • Xavier initialization:
      • Initialization uniformly distributed over a fixed range.
      • Because Xavier’s derivation is based on several assumptions, one of which is that the activation function is linear. This does not apply to ReLU activation functions. The other is that the activation value is symmetric with respect to 0, which does not apply to the sigmoid function.
    • He initialization:
      • Fixed a bug with Xavier initialization.
    • Pre-train initialization (transfer learning) :
      • Use a network with preset weights and fine-tune from there.
  • D. Batch Normalization:
    • There is a very important assumption in the field of machine learning: the independent homodistributed hypothesis, which assumes that training data and test data meet the same distribution, which is a basic guarantee for the model obtained through training data to obtain good results in test sets. BN is to keep the same distribution of input of each layer of deep neural network during training.
    • Function:
      • Prevention of gradient explosion
      • Solve the Internal Covariate Shift and improve learning efficiency
      • Reduced reliance on good weight initialization
      • Help solve overfitting

4. Optimization of model training

  • A. batch training
    • Divide the training set into parts and train them one by one.
    • Advantages:
      • Improve training speed
      • Randomness is introduced into the training process to help find the global optimal solution
    • Disadvantages:
      • The training time is too long
  • B. Gradient Descent with Momentum optimizer
  • C. RMSProp optimizer
  • D. Adaptive matrix Adam optimizer
    • The momentum gradient descent optimizer and RMSProp optimizer are integrated.

5. Other optimization strategies

  • Bayes limit
    • A theoretical limit that can be identified from the available data collected.
    • Accuracy: Bayes limit > human identification > training accuracy > verification accuracy & test accuracy > long time real application environment
  • A. Reduce training error:
    • More complex models
    • Longer training and optimization
    • Better hyperparameters
  • B. Reduce validation & test error (overfitting problem) :
    • More comprehensive data
    • Strategies for solving overfitting
    • Simplified model structure & parameter combination
  • C. Meet indicators and optimization indicators:
    • Select the best item of the optimization index on the premise of satisfying the index.
    • There is usually one optimization indicator and the rest are fulfillment indicators.
  • D. Considerations for the output layer:
    • Linear: Regression prediction
    • Sigmoid: dichotomies
    • Softmax: Multiple categories
  • E. Asymmetric data training and optimization:
    • Data enhancement enlarges the number of samples with a low proportion
    • Modify the loss function to give higher weight to the sample with lower proportion