This article introduces the basic principle of MTCNN and FaceNet, and the next chapter introduces the program.

preface

Nowadays, computer vision and artificial intelligence are closely related to human life, such as face recognition and detection, road violation monitoring, license plate recognition, mobile phone photo beauty, unmanned driving technology, go man-machine war and other aspects. Deep learning, based on the development and improvement of deep neural network, has made remarkable achievements in the field of computer vision. For example, in lmageNet image detection and classification contest, many algorithms using deep neural network framework have achieved errors far lower than the human eye can distinguish images. AlphaGo robot has successfully defeated the current top world champion in the field of Go lee Sedol and Ke Jie, but the theoretical basis behind it is still dependent on the assistance of deep neural network. It is clear that deep learning has emerged in the field of computer vision. The development of deep learning not only breaks through many difficult visual problems, improves the level of image cognition, and accelerates the progress of computer vision and artificial intelligence-related technologies.

Take face recognition and detection as an example to introduce the application of image recognition, the commonly used face detection algorithms are Dilb, OpenCV, OpenFace, MTCNN and so on. Commonly used face recognition algorithms include FaceNet, InsightFace model and so on. In this paper, MTCNN and FaceNet are used to realize face detection and recognition.

A, MTCNN

Convolutional neural Network MTCNN (Multi-task convolutional Neural network) is a multi-task neural network model proposed by Shenzhen Research Institute of Chinese Academy of Sciences in 2016 for face detection task, which mainly adopts three cascaded networks. The idea of candidate frame plus classifier is used for fast and efficient face detection. The three cascading networks are p-NET for fast candidate window generation, R-NET for high precision candidate window filtering selection, and O-NET for final boundary box and face key point generation. And many convolutional neural network models dealing with image problems, this model also uses image pyramid, border regression, non-maximum suppression and other technologies.

The goal of the first stage of MTCNN is to generate a face candidate frame. First, do a “pyramid” transformation on the image. The reason is that due to various reasons, the scale of the face in the picture is large and small, and the recognition algorithm needs to adapt to the change of the target scale; In essence, target detection is the dot product operation of feature and template weight in target area. If the template scale matches the target scale, the detection effect will be very high. MTCNN uses image pyramid to solve the multi-scale problem of the target, that is, the original image is scaled in a certain proportion for many times to get a multi-scale image, much like a pyramid.

The P-NET model is trained by using single-scale (12*12) pictures. In inference, in order to identify faces of various scales more accurately, the scale of the face to be recognized should be changed to the model scale (12*12) first.

The output result of P-NET restored to the original picture represents the probability of having a face in each region. After obtaining the result, on the one hand, some regions with low scores can be filtered through cutting threshold. On the other hand, the NMS algorithm can be used to filter areas with high overlap.

In addition, Bounding box regression can be used to correct the previously obtained border region positions. In fact, border regression is a mapping that makes the original input window P(red box below) more similar to the real window G(green box below).

MTCNN’s R-NET and O-NET both use a similar process to P-NET, each process can output the classification results, as well as the corrected values. All the boxes whose classification score is higher than the threshold and the overlap rate is not high are corrected.

P-net will eventually output a lot of possible face regions, and these regions will be input into R-NET for further processing.

Compared with the first layer p-NET, R-NET adds a full connection layer.

P – Net output is only has a certain credibility may face region, in the R – Net, will choose, to refine the input error and give up most of the input, and once again use border regression and facial point locator of face region border regression and key point positioning, finally outputs more credible face region, for the use of O – Net. Compared with p-NET, which uses 1*1*32 features of full convolutional output, R-NET uses a full connection layer of 128 after the last convolutional layer, which retains more image features and has better accuracy performance than P-NET.

The basic structure of O-NET is a more complex convolutional neural network, which has one more convolutional layer compared with R-NET.

Input characteristics of the network more, at the end of the network structure is a larger 256 full connection layer, retain more image characteristics, and then distinguish face, face region border regression and facial feature location, final output at the upper left of the face region coordinates, and the lower right corner coordinates with the face region of the five characteristic points. O-net has more characteristic input and more complex network structure, also has better performance, the output of this layer as the final network model output.

At this point, MTCNN completed the work of face detection, give a face image, through MTCNN can mark the face area and face key point positioning. Combined with FaceNet’s model, it can identify whether several images are the same person, which can be extended to more areas, including security checks and face unlocking.

Second, the FaceNet

The FaceNet model was developed by Google engineers Florian Schroff, Dmitry Kalenichenko, and James Philbin. The main idea of FaceNet is to map the face image to a multi-dimensional space, and express the similarity of face through spatial distance. The spatial distance of different face images is larger than that of individual face images. In this way, face recognition can be realized through the spatial mapping of face images. In FaceNet, image mapping method based on deep neural network and Loss function based on Triplets are used to train the neural network, and the network directly outputs the vector space of 128 dimensions.

The network structure of FaceNet is shown in the figure below. Batch represents the training data of human face, followed by deep convolutional neural network. After that, L2 normalized operation is adopted to obtain the feature representation of human face image in the access layer.

The so-called embedding can be understood as a mapping relationship, that is, the feature is mapped from the original feature space to a new feature space, and the new feature can be called an embedding of the original feature.

The mapping relationship here is to map the features output by the full-connection layer at the end of the convolutional neural network to a hypersphere, and then take Triplet Loss as the supervised signal to obtain the Loss and gradient of the network.

A Triplet Loss is a Loss calculated according to the Triplet. Among them, the triplet is composed of Anchor(A),Negative(N) and Positive(P), any picture can be used as A base point (A), then the picture belonging to the same person with it is its P, and the picture not belonging to the same person is its N. The learning objectives are as follows:

Before the network is learned, the Euclidean distance between A and P may be large, while the Euclidean distance between A and N may be small, as shown on the left in the figure above. In the process of network learning, the Euclidean distance between A and P will gradually decrease, while the distance between A and N will gradually increase. The network directly learns the separability of features: the distance between features of the same class should be as small as possible, and the distance between features of different classes should be as large as possible.

The final result of FaceNet can obtain the feature vector of face data, and then calculate the Euclidean distance of the feature vector between two images, which can directly represent the difference between two images.

Third, OpenCV

Combined with MTCNN and FaceNet, it can be used for face recognition. However, if we want to do face recognition in real time, we need to obtain real-time images. The tool I used is OpenCV, because it happens to have a function to call the PC camera, cv2.videocapture (0) parameter 0 means to call the PC camera, if changed to 1 then call the peripheral camera.

reference

MTCNN and Facenet are used to realize face detection and recognition

Refuse to switch! And look at MTCNN face detection and inference process details!

How MTCNN works