primers

Take the process of realizing face recognition in OpenFace algorithm as an example. This process can be regarded as a basic framework for using deep convolutional network to process face problems, as shown in the figure below

According to the figure above, the face recognition project can be divided into five main steps:

1. Enter a photo first. 2. Detect faces in photos and classify whether they are alive or not; 3. Align and cut the detected living face; 4, feature extraction of the face after alignment and cutting, characterized as feature code; 5. Compare the paired feature codes after characterization.

Detailed introduction

1. Input Image -> Detect

Input: Raw images that may contain human faces.

Output: Bounding box of living face positions.

This step is called “Face Detection” (Face Detection). In OpenFace, the existing Face Detection methods of Dlib and OpenCV are used. This method has nothing to do with deep learning, and uses features from traditional computer vision (Hog, Haar, etc.).

Those interested in this step of face detection can refer to the following information:

The realization of the dlib: blog.dlib.net/2014/02/dli…

Face Detection using Haar Cascades

Later algorithms also began to use deep learning target detection algorithms to detect faces, such as MTCNN.

For the detected faces, it is also necessary to determine whether they are non-living faces such as photos and videos. It is necessary to input the detected faces into the living classification network to screen out living faces.

Detect -> Transform -> Crop

Enter: original image + face position bounding box

Output: “calibrated” images containing only faces

For the input original image + bounding box, all you need to do in this step is to detect the key points in the face and then align the face according to these key points. The key points are the green dots in the picture below, usually the position of the eyes, the position of the nose, the outline of the face and so on. With these key points, we can “calibrate”, or “align”, the face. The explanation is that the original face may be more crooked, here according to the key point, the use of affine transformation will be unified face “straighten”, try to eliminate the error caused by different posture. This step is commonly called Face Alignment.

In OpenFace, this step also uses the traditional method, which is characterized by relatively fast, and the corresponding paper is as follows:

Pdfs.semanticscholar.org/d78b/6a5b0d…

3, Crop -> Representation

Input: single face image after calibration

Output: a vector representation.

This step is to use deep convolutional networks to convert the input face image into a vector representation. The vector used in OpenFace is 128×1, which is a 128-dimensional vector.

VGG16 is a relatively simple basic model of deep learning. The input to the neural network is the image, and after a series of convolution, the category probability is obtained by fully connected classification.

In general image applications, we can remove the full connection layer and use computational features (usually the last layer of the convolution layer, e.g Conv5_3) were calculated as extracted features. However, if the same method is used for face recognition problems, that is, the last layer of the convolution layer is used as the “vector representation” of the face, the effect is not good. How to improve? We’ll talk about that later, but let’s talk about what we want this vector representation of the face to be.

Ideally, we’d like the distance between vector representations to be a direct reflection of face similarity:

For the same person’s face image, the corresponding vector Euclidean distance should be relatively small.

For different people’s face images, the Euclidean distance between the corresponding vectors should be relatively large.

This representation can actually be regarded as a kind of “embedding”. In the original VGG16 model, we used Softmax loss without requiring distance between vector representations of each class. So it can’t be used directly as a face representation.

For example, CNN is used to classify MNIST. We design a special convolutional network to make the vector at the last layer become 2-dimensional, and then we can draw the graph of 2-dimensional vector representation corresponding to each category (a color in the graph corresponds to a category) :

The figure above is the result of our direct softmax training, which does not match the characteristics we want to feature:

We want the vector representations of the same class to be as close as possible. But here the same class (purple, for example) may have a large interclass distance.

We want vectors that correspond to different classes to be as far away as possible. But near the center of the graph, the categories are very close together.

So what is the correct posture for training facial features? There are many ways. One way is to use “Center Loss”. In fact, centor Loss is to add another loss to Softmax loss, which defines a “central” point for each category. The features of each category should be close to this central point, while the centers of different categories are far away from each other. After adding Center Loss, the trained features are roughly like this:

This characteristic representation is more in line with our requirements. Center here: loss of the original paper ydwen. Making. IO/cca shut/WenE… . The two pictures above are also taken from this paper.

By the way, in addition to center Loss. There are many ways to learn Face feature representation, such as Triplet Loss (A Unified Embedding for Face Recognition and Clustering). Triplet Loss directly trains the network with triples (image 1 of A, image 2 of A, and image B). By removing the final classification layer, the neural network is forced to establish A unified representation of the same face image (the same person A in the triad).

4. Practical application

Input: vector representation of human face.

With a vector representation of the face, the rest of the problem is very simple. Because this representation has the characteristics that the distance of the vector corresponding to the same person is small, and the distance of the vector corresponding to different people is large. The next general application has the following categories:

Face Identification. Is to test whether A and B belong to the same person. It only needs to calculate the distance between vectors and set an appropriate alarm threshold.

Face Recognition. This app is the most popular, detecting the most similar faces in a database for a given image. Obviously can be converted to a nearest neighbor problem for distance.

Face Clustering. For face clustering in the database, k-means can be used directly.

Reference:

www.zhihu.com/question/60…