**【 Abstract 】** We start from the technical details of face recognition technology, take you to understand the development process of face recognition technology. Through the operation of the platform example, take you to see how to use the computing resources of the public cloud, quickly train a usable face recognition model.

preface

I think you’ve all seen Brad. Directed by Byrd, Tom. Mission: Impossible 4 starring Cruise? Boundless huge crowd of train stations, as long as the blink of an eye has been identified by the computer, followed by agents; The beauty that meets head on is deadly killer, mobile phone sends out beep alarm sound, the name that has shown belle above and information. This is the face recognition algorithm that this paper wants to introduce, and how to use the public cloud AI platform to train the model.

As one of the early and widespread technologies in the field of artificial intelligence, face recognition aims to determine the identity of faces in pictures and videos. Face recognition technology has a wide range of applications from ordinary mobile phone face unlock, face payment, to face recognition control in the field of security, and so on. Human face is an inherent feature of everyone, which is unique and difficult to be copied, thus providing a necessary premise for identity identification.

Face recognition research began in the 1960 s, with the development of computer technology and optical imaging technology, as well as the neural network technology in recent years, rise again, especially the convolutional neural network in image recognition and great success in the test, the effect of face recognition system is greatly improved. In this paper, we start from the technical details of face recognition technology, take you to understand the development process of face recognition technology. In the second half of the article, we will use the custom image of ModelArts platform, take you to see how to use public cloud computing resources, quickly train a usable face recognition model.

The body of the

Whether it’s based on traditional image processing and machine learning or deep learning, the process is the same. As shown in Figure 1, the face recognition system includes face detection, alignment, coding and matching. So this part first through the traditional image processing and machine learning algorithm based on the face recognition system overview, we can see the whole deep learning algorithm in the field of face recognition development context.

Face detection process

Traditional machine learning algorithms

As has been said before, the purpose of face recognition is to determine what the face identity in the image is, so it is necessary to detect the face in the image first, in fact, this step ultimately is a target detection problem. The traditional image target detection algorithm is mainly composed of three parts, suggestion box generation, feature engineering and classification, including the famous RCNN series algorithm optimization ideas are also based on these three parts.

First is the suggestion box, the most simple idea is the step in image crop out a pile of boxes to be detected, and then to detect whether there is any target within the box, if it exists, then the box in the original position is the location of the target detecting, so in this step of the greater the coverage of the target, the suggestion box generation strategy, the better. Common suggestions frame generation strategies include sliding window, Selective Search, Randomized Prim, etc., which generate a large number of candidate frames, as shown in the figure below.

After obtaining a large number of candidate boxes, the next most important part of the traditional face detection algorithm is feature engineering. In fact, feature engineering is to use the expert experience of algorithm engineers to extract various features of faces in different scenes, such as edge features, shape morphological features, texture features and so on. The specific algorithms are LBP, Gabor, Haar, SIFT and other feature extraction algorithms. A face picture represented by a two-dimensional matrix is converted into the representation of various feature vectors.

After feature vectors are obtained, features can be classified by traditional machine learning classifiers, such as ADaboost, Cascade, SVM, random forest, etc., to determine whether the features are human faces. After the traditional classifier classification, the face region, feature vector and classification confidence can be obtained. With this information, we can complete the work of face alignment, feature representation and face matching recognition.

In the traditional method, the classic HAAR+AdaBoost method is taken as an example. In the feature extraction stage, HAAR features are firstly used to extract many simple features from images. Haar features are shown in the figure below. In order to meet the needs of face detection of different sizes, Gaussian pyramid is usually used to extract Haar features from images with different resolutions.

Haar features are calculated by subtracting the sum of pixels in the white area from the black area, so the values obtained are not the same in the face and non-face area. In general, in the concrete implementation process, it can be achieved quickly by the method of integral graph. Generally, there are about 10,000 Haar features available for training pictures normalized to 20*20. Therefore, machine learning algorithms can be used for classification and recognition under the condition of such feature scale.

After Haar features are obtained, Adaboost algorithm can be used for classification. Adaboost algorithm is a new strong classification method combining several relatively weak classification methods. According to the cascade classifier and the trained threshold of each feature selection, face detection can be completed.

As can be seen from the above methods, the traditional machine learning algorithm is feature-based, so it requires a lot of expert experience of algorithm engineers to carry out feature engineering and parameter tuning, and the algorithm effect is not very good. Moreover, it is difficult for artificial design to be robust to different variations in an unconstrained environment. The past image algorithm is an engineer more is through the traditional method of image processing, according to the reality and expert experience, a large number of feature extracting and then to extract features for the processing of statistical learning, so that the overall performance of the algorithm is very dependent on reality and expert experience, to face this category is huge, The effect of unconstrained scenarios with serious disequilibrium of each type of samples is not very good. Therefore, in recent years, with the great success of deep learning in image processing, face recognition technology is also based on deep learning, and has achieved very good results.

Application of deep learning in face recognition

In depth study of face recognition system, the problem is divided into a target detection problem and a classification problem, and the target detection problem in deep learning in essence is a classification and regression problems, so as the convolution the successful application of neural network in image classification, the effect of face recognition system to get the fast and huge ascension, In this way, a large number of visual algorithm companies have been born, and face recognition has been applied in all aspects of social life.

In fact, it is not a new idea to use neural network to do face recognition. In 1997, researchers proposed a method called neural network based on probability decision for face detection, eye location and face recognition. The face recognition PDBNN is divided into a fully connected sub-network for each training subject to reduce the number of hidden units and avoid over-fitting. The researchers trained two PBDNNS separately using density and edge characteristics, and then combined their outputs to get the final classification decision. However, due to the serious shortage of computing power and data at that time, the algorithm was relatively simple, so the algorithm did not get good results. With only this year’s back propagation theory and computing power frame, the effect of face recognition algorithms began to be greatly improved.

In deep learning, a complete face recognition system also includes four steps as shown in Figure 1. Among them, the first step is called face detection algorithm, which is also a target detection algorithm in essence. The second step is called face alignment and is currently based on keypoint geometric alignment and deep learning-based face alignment. The third step is feature representation. In deep learning, through the idea of classification network, some feature layers in the classification network are extracted as face feature representation, and then the standard face image is processed in the same way. Finally, the whole face recognition system is completed by comparison and query. The following is a brief overview of the development of face detection and face recognition algorithms.

Face detection

After the great success of deep learning in image classification, it was soon applied to face detection. At first, most of the ideas to solve this problem were based on the scale invariance of CNN network, to scale images at different scales, and then to deduce and directly predict the category and location information. In addition, due to the direct position regression of each point in the feature map, the accuracy of the obtained face frame is relatively low. Therefore, someone proposed a coarse-to-fine detection strategy based on multi-stage classifier to detect faces, such as Cascade CNN, DenseBox and MTCNN.

MTCNN is a multi-task method, which puts face area detection and face key point detection together for the first time. Like Cascade CNN, it is also based on Cascade framework, but the overall idea is more ingenious and reasonable. MTCNN is generally divided into three parts: The network structure of PNet, RNet and ONet is shown in the figure below.

First of all, PNet network resize the input image to different sizes, as the input, directly through two layers of convolution, regression face classification and face detection frame, this part is called rough detection. After crop out the crude detected face from the original image, face detection is carried out again in the input R-NET. Finally, the obtained face is finally input o-NET, and the o-NET output result is the final face detection result. MTCNN overall process is relatively simple, can be quickly deployed and implemented, but MTCNN has many disadvantages. Including multi-stage task training time, a large number of intermediate results need to occupy a large amount of storage space. In addition, because the modified network directly performs bounding box regression for feature points, the face detection effect of small targets is not very good. In addition, in order to meet the needs of face detection of different sizes in the process of reasoning, the network needs to resize face images to different sizes, which seriously affects the speed of reasoning.

With the development of target detection, more and more experimental evidence proves that more bottlenecks in target detection lie in the contradiction between low-level network with low semantics but relatively high positioning accuracy and high-level network with high semantics but low positioning accuracy. Target detection network also begins to adopt anchor based strategy and cross-layer fusion strategy. For example, famous ftY-RCNN, SSD and YOLO series. Therefore, face detection algorithms are increasingly using Anchor and multichannel output to meet the detection effect of faces of different sizes, among which the most famous algorithm is SSH network structure.

As can be seen from the figure above, SSH network already has a method to process the outputs of different network layers. It only needs one inference to complete the detection process of faces of different sizes, so it is called Single Stage. SSH network is also relatively simple, that is, different convolutional layers of VGG awaken branch calculation and output. In addition, up-sampling is carried out for high-level features, and Eltwise Sum is made with low-level features to complete feature fusion between low-level and high-level features. In addition, DETECTION Module and Context Module are also designed in SSH network. Context Module, as part of detection Module, adopts inception structure to obtain more context information and a larger receptive field.

Detection Module module in SSH

Context Module module of the Detection Module in SSH

SSH uses 1×1 convolution to output the final regression and classification branch results, but does not use the full connection layer, so it can ensure that the input of pictures of different sizes can get the output results, and also responds to the trend of full convolution design at that time. Unfortunately, the network does not output landmark points, in fact, the context structure does not use the popular feature pyramid structure, BACKBONE of VGG16 is relatively shallow, with the continuous development of face optimization technology, all kinds of trick are becoming more mature. Therefore, we will introduce the widespread Retinaface network in face detection algorithms.

The essence of Retinaface is based on the network structure of RetinaNet and adopts the feature pyramid technology to realize the fusion of multi-scale information, which plays an important role in the detection of small objects. The network structure is shown as follows.

The backbone network of Retinaface is a common convolutional neural network, which incorporates contextual information and performs multiple tasks including classification, detection, landmark point regression and image self-enhancement by adding the feature gold tower and Context Module.

Because the essence of face detection is target detection task, the future direction of target detection is also suitable for face optimization direction. At present, it is still difficult to detect small and medium targets and occlusion targets in target detection. In addition, most detection networks are deployed at the end side. Therefore, end-to-end network model compression and reconstruction acceleration are more tests for algorithm engineers to understand and apply deep learning detection algorithms.

Face recognition

The essence of face recognition problem is a classification problem, that is, each person as a class of classification detection, but there will be many problems in the practical application process. First, there are many face categories, if you want to identify all the people in a town, then the classification category is nearly more than one hundred thousand categories, in addition, there are few labeling samples available between each person, there will be a lot of long mantissa data. According to the above problems, the traditional CNN classification network should be modified.

As we know, although deep convolutional network is a black box model, it can represent the features of pictures or objects through data training. Therefore, the face recognition algorithm can extract a large number of face feature vectors through the convolutional network, and then complete the face recognition process according to the similarity judgment and comparison with the base library. Therefore, the focus of this kind of embedding task is whether the algorithm network can generate different features for different faces and similar features for the same face. How do you maximize the distance between classes and minimize the distance within classes.

In face recognition, the backbone network can use various convolutional neural networks to complete feature extraction, such as Resnet, Inception and other classic convolutional neural networks as backbone. The key lies in the design and implementation of loss function at the last layer. Now we analyze the loss function of face recognition algorithm based on deep learning from two ideas.

Idea 1: Metric learning, including Contrastive Loss, Triplet Loss and sampling method

2. Margin based classification includes Softmax with Center Loss, SphereFace, NormFace, AM-SOfrmax (Cosface) and ArcFace.

1. Metric Larning

(1) Contrastive loss

One of the first uses of metric learning in deep learning was DeepID2. The main improvement in DeepID2 is to train both verification and classification on the same network (with two monitoring signals). Contrastive Loss is introduced in the feature layer of verification Loss.

Contrastive Loss not only considers the distance minimization of the same category, but also considers the distance maximization of different categories. It improves the accuracy of face recognition by making full use of label information of training samples. Therefore, the Loss function essentially makes the photos of the same person close enough in the feature space, and the distance between different people is far enough in the feature space until a certain threshold is exceeded. (Sounds a bit like Triplet Loss).

Contrastive Loss introduces two signals and trains the network through both signals. The expression of recognition signal is as follows:

The expression of the verification signal is as follows:

Based on this signal, DeepID2 does not train in a single Image, but in an Image Pair. Two images are entered at a time, which is 1 for the same person, or -1 for different people.

(2) The Triplet loss from FaceNet

This 15-year-old piece from Google’s FaceNet was also a watershed in the field of facial recognition. It puts forward a unified framework to solve most of the face problems, that is: recognition, verification, search and other problems can be put into the feature space to do, need to focus on solving is just how to better map the face to the feature space.

On the basis of DeepID2, Google abandoned the Classification layer, namely Classification Loss, and improved Contrastive Loss into Triplet Loss, only for one purpose: to learn a better feature.

The loss function of Triplet loss is directly posted, and the input is no longer Image Pair, but three Triplet images, which are Anchor Face, Negative Face and Positive Face respectively. Anchor and Positive Face are the same person, and Negative Face are different people. Then the loss function of Triplet Loss can be expressed as:

The intuitive interpretation of this formula is that: in the feature space, the distance between Anchor and Positive is smaller than that between Anchor and Negative and exceeds a Margin Alpha. The intuitive difference between it and Contrastive Loss is shown below.

(3) Metric learning problems

The above two Loss functions have good effects and are in line with people’s objective cognition. They are also widely used in practical projects, but there are still some deficiencies in this method.

  • ** Model training relies on a large amount of data and the fitting process is slow. ** Since contrastive loss and Triplet Loss are based on pairs or triplet loss, a large number of positive and negative samples need to be prepared, and it is impossible to completely iterate all possible combinations between samples after a long time of training. There are blogs on the Internet that say it takes a month to fit an Asian dataset of 10,000 people, 500,000 or so.
  • **Sample mode affects the training of the model. ** For example, for Triplet Loss, anchor face, negative face and positive face should be randomly sampled in the training process. Good sample sampling can speed up the training speed and model convergence, but it is difficult to achieve very good results in the process of random sampling.
  • ** lacks the mining of hard triplets, which is also a problem for most model training. ** For example, in face recognition, hard negatives to represent similar but different people, and Hard positive negatives to represent the same person but completely different gesture, expression, and so on. The study and special processing of hard example is crucial to improve the accuracy of recognition model.

2. Trick fixes for Metric Learning deficiencies

(1) Finetune

Reference paper: Deep Face Recognition

In the paper Deep Face Recognition, in order to speed up the training of Triplet loss, softmax is used to train the Face Recognition model first, and then the classification layer on the top layer is removed. Then triplet Loss is used to finetune the feature layer of the model, which achieves good results while accelerating the training. This method is also the most commonly used method for training Triplet Loss.

(2) Modification of Triplet Loss

In Defense of the Triplet Loss for Person Re-identification

The author points out the disadvantages of Triplet loss. For a Triplet of Anchor (a), positive(P) and negative(n) required by Triplet loss training, it needs to be randomly selected from the training set. Driven by loss function, it is likely to select a very simple sample combination, that is, positive samples that are very similar to each other and negative samples that are very different from each other. However, making the network learn from simple samples all the time will limit the network’s normalization ability. Therefore, she modified triplet Loss and added a new trick. A large number of experiments proved that this improved method was very effective.

In the Facenet Triplet loss training provided by Google, once the B triplet set is selected, the data will be arranged in 3 groups in sequence, so there will be a total of 3B combinations, but these 3B images actually have as many as 3 effective triplet combinations. Just using 3B is wasteful.

In this paper, the author puts forward a TriHard loss, whose core idea is to add the processing of hard example on the basis of triplet loss: For each training batch, P pedestrians with ID are randomly selected, and K different pictures are randomly selected for each pedestrian, that is, one batch contains P×K pictures. Later, for each picture A in the batch, we can select one of the most difficult positive samples and one of the most difficult negative samples to form a triplet with A. Firstly, we define the picture set with the same ID as A as A, and the rest with different ID as B, then TriHard loss can be expressed as:

Where is the artificially set threshold parameter. TriHard Loss will calculate the Euclidean distance of each image in A and Batch in the feature space, and then select the positive sample P farthest (least similar) from A and the negative sample N closest (most similar) to calculate the triplet loss. Where D represents the Euclidean distance. Another way to write the loss function is as follows:

In addition, the author also put forward several experimental views in the round:

  • The Square Euclidean distance is not as good as the square root of the true Euclidean distance (the reason will be briefly mentioned later).
  • Soft-margin loss function is proposed to replace the original Triplet loss expression. Soft-margin can make the loss function smoother, avoid function convergence at bad local, and accelerate algorithm convergence to a certain extent.
  • Batch Hard Sampling was introduced

This method has better effect than traditional triplet loss after considering hard example.

(3) Modification of loss and SAMPLE methods

Deep Metric Learning via Lifted Structured Feature Embedding

This paper first proposes that the existing triplet method cannot fully utilize the advantages of miniBatch SGD training batches. Creatively transform the vector of Pairwise Distances into the matrix of Pairwise distance, and then design a new structured loss function, which has achieved very good results. As shown in the figure below, sampling schematic diagram of contrastice embedding, Triplet embedding and Lifted Structured embedding.

Intuitively, there are more classification modes involved in the Structured embedding. In order to avoid the training difficulties caused by a large amount of data, the author gives a structured loss function on this basis. As shown in the figure below.

Where P is the positive sample set and N is the negative sample set. You can see that in contrast to the loss function above, the loss function starts to consider a sample set problem. However, not all the negative edges between sample pairs carry useful information, that is to say, the negative edges between sample pairs sampled randomly carry very limited information, so we need to design a non-random sampling method.

From the above structured loss function, we can see that the most similar and least similar hard pairs(the use of Max in the loss function) are considered in the final calculation of the loss function, which is equivalent to adding the information of difficult neighbors in the training process and training mini-batch. In this way, the training data can obtain a high probability of obtaining hard negatives and Hard positives samples, and with the continuous training, the training of hard samples will maximize the interclass distance and minimize the in-class distance.

As shown in the figure above, in metric learning, this paper did not randomly select sample pairs, but integrated multiple types of samples that were difficult to distinguish for training. In addition, the process of seeking Max or single Hardest negative leads to network convergence to a bad local Optimum. I guess it is because of the truncation effect of Max that the gradient is too steep or there are too many gradient breaks. The author further improves the Loss Function with smooth Upper Bound, as shown below.

(4) Further modification of sample method and Triplet Loss

Sampling Matters in Deep Embedding Learning

  • Modification of sampling method

** The article points out that since the distance of hard negative sample is small, if there is noise, this sampling method is easily affected by noise, resulting in model collapse during training. **FaceNet once proposed a semi-hard negative mining method, which is to make samples not too hard. However, according to the analysis of the author, sample should be evenly sampled in the samples, so the best sampling state should be in evenly dispersed negative samples, which have hard, semi-hard and easy samples. Therefore, the author proposed a new sampling method called Distance weighted sampling.

In reality, all samples are sampled in pairs and their distances are calculated. Finally, the distribution of point-to-distance is as follows:

Then, according to the given distance, the sampling probability can be obtained through the inverse function of the above function, and the proportion of sampling for each distance can be determined according to the probability. Given an Anchor, the probability of sampling negative cases is as follows:

Since training samples are strongly correlated with training gradient, the author also plots the relationship between sampling distance, sampling method and variance of data gradient, as shown in the figure below. As can be seen from the figure, samples sampled by the Hard negative mining method are all in areas with high variance. If there is noise in the data set, the sampling is easily affected by noise, resulting in model collapse. Random samples tend to be concentrated in the region with low variance, resulting in small loss. However, the model is not well trained at this time. The sampling range of semi-hard negative mining is very small, which may lead to convergence of the model at a very early time. Loss decreases slowly, but in fact, the model is not well trained at this time. The method proposed in this paper can realize uniform sampling on the whole data set.

  • Modification of Loss Function

** The author finds a problem when observing constractive loss and Triplet loss, that is, the loss function is very smooth when the negative sample is very hard, which means that the gradient will be very small. A small gradient for training means that the hard sample cannot be fully trained. The network does not have valid information about the hard sample, so the effect of the hard sample becomes worse. ** Therefore, if loss is not smooth around hard samples, that is, the commonly used derivative in deep learning is 1(just like RELu), will hard mode solve the problem of gradient disappearance? In addition, Loss function should realize both positive and negative samples in Triplet loss, and have the function of margin design, that is, adaptive to different data distribution. The loss function is as follows:

We call the distance between anchor sample and positive example sample as positive example pair distance. The distance between the Anchor sample and the negative example sample is called the negative example pair distance. The parameter beta in the formula defines the boundary between the positive example pair distance and the negative example pair distance. If the positive example pair distance Dij is greater than beta, the loss increases. Or the negative example pair distance Dij is less than beta, the loss increases. A Control the separation interval of samples; Yij is 1 when the sample is a positive example pair and -1 when the sample is a negative example pair. The following figure shows the loss function curve.

You can see why the gradient disappears at very hard times, because as we get closer to zero the blue line gets smoother and the gradient gets smaller and smaller. In addition, the author also optimized the Settings, including sample bias, category bias and overparameter, and further optimized the loss function, which could be automatically modified according to the training process.

3. Margin Based Classification

Margin based classification does not directly calculate the loss of metric learning in the feature layer to add intuitive strong restrictions on feature, but still training face recognition as classification task. Through the modification of Softmax formula, margin limit is imposed on feature layer indirectly, making the final feature obtained in network more discriminative.

(1) Center Loss

A Discriminative Feature Learning Approach for Deep Face Recognition

This article of ECCV 2016 mainly proposed a new Loss: Center Loss, which is used to assist Softmax Loss in face training. In order to compress the same category together, more discriminative features are finally obtained. Center Loss means: provide a category center for each category, minimize the distance between each sample in min-Batch and the corresponding category center, so as to achieve the purpose of reducing the distance within the category. The following figure shows the loss function that minimizes the center distance between sample and category.

For the category center corresponding to each sample in each batch, the Euclidean distance is used as the distance of high-dimensional fluid body, the same as the dimension of feature. Therefore, on the basis of Softmax, the loss function of Center Loss is:

My personal understanding of Center Loss is that the function of clustering is added to the loss function. With the progress of training, samples are consciously clustered in the Center of each batch to further maximize the difference between classes. However, IN my opinion, for high-dimensional features, Euclidean distance does not reflect the distance of clustering, so such simple clustering cannot achieve better results in high-dimensional features.

(2) L – Softmax

The purpose of the original Softmax is to transform the way of vector multiplication into the relation between modulus and Angle of vector, that is, on this basis, L-Softmax hopes to add a positive integer variable M, as can be seen:

The resulting decision boundary can more strictly constrain the above inequality, make the spacing within classes more compact, and make the spacing between classes more discriminative. Therefore, based on the above formula and softmax formula, l-Softmax formula can be obtained as follows:

Since cosine is a minus function, multiplying by m makes the inner product smaller, and eventually, as you train, the distance between the classes themselves increases. By controlling the size of M, changes in distance within and between classes can be seen, as shown in the two-dimensional diagram below:

In order to ensure that in the process of back propagation and reasoning, the Angle between category vectors can meet the margin process and ensure the monotonicity decreasing, the author constructs a new function form:

Some people feedback that the adjustment of L-Softmax is difficult, and the adjustment of M needs to be repeated to achieve better results.

(3) the Normface

NormFace: L2 Hypersphere Embedding for Face Verification

This paper is a very interesting article, the article for weight and feature normalization to do a lot of interesting discussion. The article argues that websphere Face, while good, is not beautiful. In the testing phase, hereFace measures similarity by cosine between features, that is, by Angle. However, there is also a problem in the training process, the weight is not normalized, loss function is reduced in the training process, the weight module will become larger and larger, so the optimization direction of SphereFace loss function is not very rigorous, in fact, part of the optimization direction is to increase the length of features. Some bloggers have conducted experiments and found that as M increases, the scale of coordinates also increases, as shown in the figure below.

Therefore, the author normalized the features in the process of optimization. The corresponding loss function is also shown as follows:

Where W and f are normalized features, and the dot product is the Angle cosine. The parameter S is introduced because of its mathematical properties, which ensures the rationality of the gradient size. There is a more intuitive explanation in the original text, which can be read in the original paper. It is not the key point. S can become either a learnable parameter or a superparameter. The author of the paper gives many recommended values, which can be found in the paper. In fact, the normalized Euclidean distance in FaceNet is the same as the cosine distance.

4. AM-softmax/CosFace

Additive Margin Softmax for Face Verification

CosFace: Large Margin Cosine Loss for Deep Face Recognition

Looking at the above paper, you will find that there is one thing missing, that is margin, or the meaning of margin is less, so AM-Softmax has introduced margin on the basis of normalization. The loss function is as follows:

Intuitively, the -m ratio is smaller, so the loss function value is larger than that in Normface, hence the feeling of margin. M is a hyperparameter that controls punishment, and as m gets bigger, the punishment gets stronger. The advantage of this method is that it is easy to reproduce, and there are no many tricks to adjust parameters, and the effect is very good.

(1) ArcFace

Compared with AM-Softmax, the difference lies in the different way Arcface introduces margin, loss function:

Does it look like AM-Softmax at first glance? Notice that m is inside the cosine. The paper points out that the boundary between features obtained based on the above equation is superior and has stronger geometric interpretation.

However, is it a problem to introduce margin in this way? So let’s think about is cosine theta plus m definitely less than cosine theta?

Finally, we use the figure in the article to explain this problem and make a summary of margin-based Classification in this chapter.

This figure is from Arcface. The abscis θ is the Angle between the feature and the center of the class, and the ordinate is the value of the molecular exponential part of the loss function (not considering S). The smaller the value is, the greater the loss function will be.

Reading so many classification-based face recognition papers, you might get the feeling that everyone is working on the loss function, or more specifically, designing the Target logit-theta curve shown above.

This curve means how you’re going to optimize the off-target sample, or how much punishment you’re going to give depending on how off-target you are. Two points:

1. Strong constraints are not easy to generalize. For example, Sphereface’s loss function satisfies the requirement that the maximum distance within a class is less than the minimum distance between classes at m=3 or 4. The loss function is large, that is, the target logits is small. But it does not mean that it can be generalized to samples outside the training set. Imposing too strong constraints will reduce model performance and training is not easy to converge.

2. It is important to choose what kind of sample to optimize. Arcface points out that excessive punishment of θ∈[60°, 90°] may lead to training non-convergence. Optimization of θ∈[30°, 60°] samples may improve the accuracy of the model, while excessive optimization of θ∈[0°, 30°] samples will not improve significantly. As for the larger Angle of the sample, deviating too far from the target, forced optimization is likely to reduce the model performance.

And that answers the question left in the last video, which is that Arcface is going up, and that’s not important and it’s even good. Because it may not be beneficial to optimize the hard sample with a large Angle. This is similar to FaceNet’s semi-hard strategy for sample selection.

Margin based classification

1. A discriminative feature learning approach for deep face recognition [14]

Center Loss is proposed and weighted into the original Softmax Loss. Maintaining a Euclidean space class center reduces the distance within a class and enhances the discriminative power of a feature.

2. Large-margin softmax loss for convolutional neural networks [10]

A previous article by hereFace authors, without normalized weights, introduced margins in Softmax Loss. It also covers the training details of Sphereface.

Use ModelArts to train face models

Face recognition algorithm implementation explanation

The face recognition algorithm model deployed in this paper mainly includes two parts:

  1. The first part is the face detection algorithm model, which recognizes the face in the picture and returns the location information of the face.
  2. The second part is the face feature representation algorithm model, also known as recognition model. In this part, the face image emitted by crop is added to a vector with a fixed dimension, which is then compared with the base library to complete the overall process of face recognition.

As shown in the figure below, the overall algorithm implementation process is divided into offline and online parts. Before identifying different people each time, the trained algorithm is first used to generate standard base of face, and the base data is saved in ModelArts. Then in each reasoning process, the image input will be through the face detection model and face recognition model to get face features, and then based on the feature in the base of the search similar to the highest feature, complete the process of face recognition.

In the realization process, we use the algorithm based on Retinaface+ ResNet50 + ArcFace to complete feature extraction of face image, in which Retinaface is used as detection model and ResNet50 + ArcFace is used as feature extraction model.

In the mirror, there are two scripts to run the training, respectively corresponding to the training of face detection and face recognition.

  • The training script for face detection is:

    run_face_detection_train.sh

The start command of the script is

 sh run_face_detection_train.sh data_path model_output_path
Copy the code

Model_output_path is the output path of model, data_path is the input path of face detection training set, and the input picture path structure is as follows:

  detection_train_data/
    train/
      images/
      label.txt
    val/
      images/
      label.txt
    test/
      images/
      label.txt
Copy the code
  • The training script for face recognition is:

    run_face_recognition_train.sh

The start command of the script is

 sh run_face_recognition_train.sh data_path model_output_path
Copy the code

Model_output_path is the output path of model, data_path is the input path of face detection training set, and the input picture path structure is as follows:

  recognition_train_data/
cele.idx
cele.lst
cele.rec
property
Copy the code
  • Scripts generated by the base library:

    run_generate_data_base.sh

The script starts with the following commands:

sh run_generate_data_base.sh data_path detect_model_path recognize_model_path db_output_path
Copy the code

Data_path is the base library input path, DETECt_model_PATH is the detection model input path, recognize_model_PATH is the recognition model input path, and db_output_path is the base library output path.

  • Scripts generated by the base library:

    run_face_recognition.sh

The script starts with the following commands:

sh run_generate_data_base.sh data_path db_path detect_model_path recognize_model_path
Copy the code

Where data_PATH is the test image input path, db_PATH is the base library path, DETECt_model_PATH is the detection model input path, recognize_model_path is the recognition model input path

The training process

Huawei Cloud ModelArts has the function of training operations, which can be used for model training and model training parameters and version management. This feature is useful for developers with multiple iterations. There are preset images and algorithms in the training. Currently, there are preset images for commonly used frameworks (including Caffe, MXNet, Pytorch, TensorFlow) and Huawei’s own Ascend Powered-Engine.

In this paper, based on the custom image feature of ModelArts, we will upload the complete image that has been debugged on the machine and train the model with GPU resources of Huawei Cloud.

We want to complete a face recognition model in ModelArts on Huawei cloud based on the data training of common stars on the website. In this process, the face recognition network is designed by engineers themselves, so it needs to be uploaded through a custom image. So the whole face training process is divided into the following nine steps:

  1. Build the local Docker environment
  2. Download the basic image file from huawei cloud
  3. Build a custom image environment based on your requirements
  4. Import training data to a custom image
  5. Import face recognition base library to custom image
  6. Import the pretrained model to the custom image
  7. Upload a custom image to SWR
  8. Use Huawei Cloud training operations for training
  9. The Huawei cloud is used for reasoning

Build the local Docker environment

Docker environment can be built on a local computer, or an elastic cloud server can be purchased on Huawei cloud for Docker environment construction. Refer to official Docker documents for the whole process:

Docs.docker.com/engine/inst…

Download the basic image file from huawei cloud

Official website:

Support.huaweicloud.com/engineers-m…

We need to use the MXNet environment for training. First, we need to download the basic image of the customized image from Huawei cloud. The download command on the official website is as follows:

An explanation of this command can be found in the training operations basic mirror specification.

Support.huaweicloud.com/engineers-m…

According to our script requirements, I used a mirror of CUDa9:

The official also provides another method, which is to use docker file. The dockerfile for the base image is also found in the training base image specification. Dockerfile:

Github.com/huaweicloud…

Build a custom image environment based on your requirements

Because I’m lazy, I didn’t build my own image using Dockerfile. I do it the other way!

Since cudA 9 is our requirement and there are some python dependencies, assuming cudA 9 is available in the official image, we can add the requirement. TXT to the training script along with this tutorial. Simple, efficient and quick to solve the needs!! Here is the tutorial ~~~

Support.huaweicloud.com/modelarts_f…

Upload a custom image to SWR

Official website tutorial:

  • Support.huaweicloud.com/engineers-m…
  • Support.huaweicloud.com/usermanual-…

The image upload page says the file must be no larger than 2GB after decompression. However, the basic image provided by the official is 3.11GB, and the image is 5+GB after adding the required pre-training model, so we cannot use the page for uploading, we must use the client. Uploading an image starts by creating an organization,

If you still find the product documentation difficult to understand, try the pull/push image experience of an SWR page:

The first step is to log in to the warehouse:

The second step is to pull the mirror image, which we will use our own custom mirror image instead,

Step 3 Modify the organization using the organization name created from the product documentation. In this step, you need to rename a local image to a mirror life identified on the cloud. See the illustration below for details:

Step 4 push the image,

Once you’ve mastered these four steps, you can leave this tutorial and upload using the client. Log in using the client and upload. Client login can be done using the generate temporary Docker loging command. This page is in “My image” -> “client upload” -> “Generate temporary Docker login directive” :

In the local Docker environment, after logging in with this generated temporary Docker login command, use the following command to upload the image:

Use Huawei Cloud training operations for training

Huawei Cloud ModelArts provides training operations for users to conduct model training. There are preset images and you can choose from custom images in the training job. Preconfigured images include most of the frames in the market, and it is also convenient to use the images of these frames for training when there is no special requirement. This test also uses a custom image.

In a customized image, you need to configure your own environment in the image. If you change the startup mode of the training job, you also need to modify the startup script of the training. The startup script run_train.sh is stored in the /home/work/ directory of the official image pulled from the Huawei Cloud ModelArts official website. You need to modify the customized startup script based on this script. The main thing to note is “dls_get_app”, which downloads the relevant command from OBS. Other parts are modified according to their own training scripts.

To upload training results or models to OBS, run the dls_get_app plus dls_upload_model command. In our training, the script uploaded is as follows:

Training jobs can currently be debugged using the free one-hour V100. A good part of ModelArts training is that it facilitates our version management. All parameters passed into the training script by running parameters are recorded in the version, and you can also use version comparison for parameter comparison. Another convenience is that it can be modified based on a particular version, eliminating the need to re-enter all parameters and facilitating debugging.

After the training is completed in the training job, the model can also be deployed online in ModelArts.

Click to follow, the first time to learn about Huawei cloud fresh technology ~