Researchers from the University of Hertfordshire and GBG Plc have published a comprehensive review of facial recognition methods, ranging from traditional methods to the current wave of deep learning. The deep learning method is introduced. For more information about traditional face recognition methods, please refer to the original paper.

Article adapted from arXiv by Daniel Saez Trigueros et al., Heart of the Machine.

The paper address: https://arxiv.org/abs/1811.00116

Face recognition has been one of the most studied topics in computer vision and biometrics since the 1970s. Traditional approaches based on artificially designed features and traditional machine learning techniques have recently been replaced by deep neural networks trained with very large data sets. In this paper, we conduct a comprehensive and up-to-date literature review of popular face recognition methods, including both traditional methods (geometry-based methods, holistic methods, feature-based methods and hybrid methods) and deep learning methods.

The introduction

Face recognition refers to the ability to identify or verify the identity of the subject in an image or video. The first human face recognition algorithms were developed in the early 1970s [1,2]. Since then, their accuracy has improved dramatically, and people now tend to favor face recognition over biometric methods traditionally considered more robust, such as fingerprint or iris recognition [3]. One big difference that makes face recognition more popular than other biometric methods is that it is essentially non-invasive. For example, fingerprint recognition requires the user to press a finger against a sensor, iris recognition requires the user to get very close to the camera, and speech recognition requires the user to speak loudly. Modern facial-recognition systems, by contrast, only require the user to be within the camera’s field of view (assuming they are at a reasonable distance). This makes face recognition the most user-friendly biometric method. This also means the potential for facial recognition is wider, as it can also be deployed in environments where users do not expect to work with the system, such as surveillance systems. Other common applications for face recognition include access control, fraud detection, identity authentication and social media.

When deployed in an unconstrained environment, face recognition is also one of the most challenging biometrics due to the high variability of how face images appear in the real world (often referred to as faces in-the-wild). Face images vary in areas such as head position, age, occlusion, lighting conditions and facial expression. Figure 1 shows an example of these situations.



Figure 1: Typical changes found in natural face images. (a) head posture, (b) age, (c) light exposure, (d) facial expression, (e) occlusion.

Face recognition technology has changed significantly over the years. Traditional methods rely on a combination of artificially designed features (such as edge and texture descriptors) with machine learning techniques (such as principal component analysis, linear discriminant analysis, or support vector machines). It is difficult to artificially design robust features for different changes in an unconstrained environment, which makes past researchers focus on special methods for each change type, such as methods that can cope with different ages [4,5], methods that can cope with different postures [6], methods that can cope with different lighting conditions [7,8], etc. Recently, traditional face recognition methods have been replaced by deep learning methods based on convolutional neural network (CNN). The main advantage of deep learning methods is that they can be trained on very large data sets to learn the best features that characterize those data. The large number of natural face images available on the Internet has enabled researchers to collect large-scale face datasets [9-15], which contain various variations in the real world. The CNN-based face recognition methods trained with these datasets have achieved very high accuracy because they can learn robust features in face images and thus cope with real-world changes presented by face images used during training. In addition, the increasing popularity of deep learning methods in computer vision is also accelerating the development of face recognition research, because CNN is also being used to solve many other computer vision tasks, such as target detection and recognition, segmentation, optical character recognition, facial expression analysis, age estimation, etc.

Face recognition systems are usually composed of the following building blocks:

  • Face detection. The face detector is used to find the location of faces in the image and, if there are faces, return the coordinates of the bounding box containing each face. See Figure 3A.

  • Face alignment. The goal of face alignment is to scale and crop an image of a face using a set of reference points located at a fixed position in the image. This process usually requires the use of a feature point detector to find a set of face feature points, in the simple 2D alignment case, that is, to find the best affine transformation for the reference point. Figures 3B and 3C show two human face images aligned using the same set of reference points. More sophisticated 3D alignment algorithms (e.g. [16]) can also enable face positionization, in which the face is repositioned to face forward.

  • Face representation. In the face representation stage, the pixel value of the face image is converted into a compact and discriminable feature vector, which is also known as template. Ideally, all faces of the same subject should be mapped to similar feature vectors.

  • Face matching. In the face matching building module, two templates are compared to get a similarity score that gives the possibility that they belong to the same subject.



Figure 2: Building blocks for face recognition.

Many people think that face representation is the most important component in face recognition system, which is also the focus of the second section of this paper.

Figure 3 :(a) boundary box found by face detector. (b) and (c) : the aligned face and reference point.

Deep learning method

Convolutional neural network (CNN) is one of the most popular deep learning methods in face recognition. The main advantage of deep learning is that it can be trained with a large amount of data, so as to learn robust face representation to the changes in the training data. This approach does not require the design of specific features that are robust to different types of intra-class differences (such as lighting, posture, facial expression, age, etc.), but can be learned from training data. The main drawback of deep learning methods is that they require very large data sets to be trained on, and these data sets need to contain enough variation to generalize to previously unseen samples. Fortunately, some large-scale face datasets containing natural face images have been made public [9-15] and can be used to train CNN models. In addition to learning discriminant features, neural networks can also be dimensionally reduced and trained as classifiers or using metric learning methods. CNN is considered an end-to-end trainable system that does not need to be combined with any other specific method.

CNN models for face recognition can be trained using different methods. One is to treat the problem as a classification problem, with each subject in the training set corresponding to a category. After the training, the model can be used to identify subjects that do not exist in the training set by removing the classification layer and using the features of the previous layer as face representation [99]. In the deep learning literature, these features are often referred to as bottleneck features. After this first training phase, the model can be further trained using other techniques to apply optimization bottleneck characteristics to the target (such as using combined Bayes [9] or fine-tuning the CNN model with a different loss function [10]). Another common way to learn face representations is to directly learn bottleneck features by optimizing the distance measurements between paired faces [100,101] or triples of faces [102].

The idea of using neural networks for face recognition is not new. In 1997, researchers proposed an early method called “Probabilistic Decision Based Neural Network (PBDNN)” [103] for face detection, eye location and face recognition. The face recognition PDBNN is divided into a fully connected sub-network for each training subject to reduce the number of hidden units and avoid over-fitting. The researchers trained two PBDNNS separately using density and edge characteristics, and then combined their outputs to get the final classification decision. Another early approach [104] uses a combination of self-organizing mapping (SOM) and convolutional neural networks. Self-organizing mapping [105] is a kind of neural network trained in an unsupervised way, which can map input data to a lower dimensional space while retaining the topological properties of the input space (i.e. inputs that are similar in the original space are also similar in the output space). Note that neither of these early methods was trained in an end-to-end manner (edge features were used in [103] and SOM was used in [104]), and the proposed neural network architecture was shallow. [100] Wang J, Wang J, Wang J, et al. An end-to-end face recognition algorithm based on CNN [J]. This approach uses a twin architecture and uses a comparative loss function [106] for training. This comparison loss uses a metric learning process whose goal is to minimize the distance between pairs of feature vectors corresponding to the same subject and maximize the distance between pairs of feature vectors corresponding to different subjects. The CNN architecture used in this method is also very shallow and the training data set is small.

None of the methods mentioned above achieved breakthrough results, mainly due to the use of inadequate networks and relatively small data sets for training. It was not until these models were extended and trained using large amounts of data [107] that the first deep learning methods for face recognition [99,9] reached their current best. Of particular note is Facebook’s DeepFace [99], one of the earliest CNN methods for face recognition, which uses a powerful model to achieve 97.35% accuracy on the LFW benchmark, reducing the error rate of the previous best performance by 27%. Researchers trained a CNN using Softmax losses and a dataset containing 4.4 million faces (from 4030 subjects). This paper has two new contributions :(1) an efficient face alignment system based on explicit 3D face modeling; (2) A CNN architecture containing locally connected layers [108,109]. These layers are different from conventional convolutional layers and can learn different features from each region in the image. At the same time, DeepID system [9] obtained similar results by training 60 different CNNS on patch, which contained ten regions, three proportions and RGB or grayscale channels. In the test phase, 160 bottleneck features will be extracted from each map block, and a 19200-dimension feature vector (160×2×60) can be formed by adding the situation after the horizontal reversal. Similar to [99], the newly proposed CNN architecture also uses the layer of local connection. The verification results were obtained by training a joint Bayesian classifier on the 19200-dimensional feature vector extracted by CNN [48]. The data set used to train the system contains 202,599 face images from 10,177 celebrities [9].

For the face recognition method based on CNN, there are three main factors affecting accuracy: training data, CNN architecture and loss function. Because in most deep learning applications, large training sets are required to prevent overfitting. Generally speaking, the accuracy of CNN trained for classification tasks will improve with the increase of the number of samples in each category. This is because CNN model can learn more robust features when there are more intra-class differences. For face recognition, however, we are interested in extracting features that generalize to subjects that are not present in the training set. Therefore, the dataset for face recognition also needs to contain a large number of subjects, so that the model can learn more differences between classes. [110] Wang Y, Wang Y, Wang Y, et al. The influence of the number of subjects in a dataset on the accuracy of face recognition. In this study, a large data set was first sorted in descending order by the number of images per subject. Then, the researcher used different subsets of training data to train a CNN by gradually increasing the number of subjects. The highest accuracy was achieved when training was performed with the largest number of subjects, 10,000 images. Adding more subjects reduces accuracy because there are very few images available for each additional subject. Another study [111] examined whether a wider dataset is better or a deeper dataset is better (a dataset is considered wider if it contains more subjects; Similarly, if each body contains more images, it is considered deeper). The study concluded that a wider data set results in better accuracy if the number of images is equal. The researchers suggest that this is because wider data sets contain more interclass differences and are better able to generalize to previously unseen subjects. Table 1 shows some of the public data sets most commonly used to train face recognition CNN.



Table 1: Public large face datasets.

CNN architectures for face recognition take a lot of inspiration from architectures that performed well in the ImageNet Large-scale Visual Recognition Challenge (ILSVRC). For example, a version of VGG network [112] with 16 layers is used in [11] and a similar but smaller network is used in [10]. Two different types of CNN architectures are explored in [102] : VGG style network [112] and GoogleNet style network [113]. Even though the two networks achieve comparable accuracy, the Googlenett-style network has 20 times fewer parameters. More recently, residual network (ResNet) [114] has become the preferred choice for many target recognition tasks, including face recognition [115-121]. The main innovation of ResNet is the introduction of a building block that uses shortcut connections to learn residual mapping, as shown in Figure 7. The use of shortcut connections allows researchers to train deeper architectures because they facilitate the flow of information across layers. [121] A comprehensive study is conducted on different CNN architectures. The best tradeoff between accuracy, speed, and model size is achieved using 100 layers of ResNet with a residual module similar to the one proposed in [122].



The original residual module presented in Figure 7: [114].

The selection of loss functions for training CNN methods has become the most active research area in face recognition recently. Even though CNN using softmax loss training has been very successful [99,9,10,123], some researchers believe that such loss function cannot be well generalized to subjects that have not appeared in the training set. This is because Softmax loss helps to learn features that increase the difference between classes (in order to distinguish between classes in the training set), but not necessarily reduce the difference within classes. Researchers have suggested ways to alleviate the problem. A simple way to optimize bottleneck characteristics is to use a discriminant subspace approach, such as combined bayes [48], as is done in [9,124,125,126,10,127]. Another approach is to use metric learning. For example, paired contrast loss is used as the only monitoring signal in [100,101], and classification loss is combined with [124-126]. The most commonly used metric learning method in face recognition is triplet loss function [128], which was first used in face recognition tasks in [102]. The goal of triplet loss is to separate the distance between positive and negative case pairs with a certain margin. Mathematically, for each triplet I, the following conditions need to be met [102] :



Where X_A is the anchor image, X_P is the image of the same subject, x_n is the image of another different subject, f is the mapping relationship learned by the model, and α is the margin imposed on the distance between positive and negative example pairs. In practice, THE convergence rate of CNN using triplet loss training is slower than that using Softmax, because a large number of triples (or pairs in comparison losses) are required to cover the entire training set. Although this problem can be alleviated by selecting difficult triples (i.e., triples that violate the allowance condition) during the training phase [102], it is common practice to use softmax loss training in the first training phase and triplet losses in the second training phase to adjust for bottleneck characteristics [11,129,130]. Researchers have proposed some variations of triplet loss. For example, [129] uses point product as similarity measure instead of Euclidean distance. [130] proposed a probabilistic triplet loss; In [131,132], a modified version of the triplet loss is proposed, which also minimizes the standard deviation of the positive and negative case score distributions. Another loss function used for learning distinguishing features is the Centre loss proposed in [133]. The goal of center loss is to minimize the distance between bottleneck features and the centers of their corresponding classes. By using Softmax loss and center loss for joint training, the results show that the features learned by CNN can effectively increase the difference between classes (Softmax loss) and reduce the individual difference within classes (center loss). Center loss has the advantage of being more efficient and easier to implement than contrast loss and triplet loss, as it does not require building pairs or triples during training. Another related metric learning method is range loss proposed in [134] to improve training using unbalanced data sets. Scope loss has two components. The intra-class loss component minimizes the k-maximum distance between samples of the same class, while the inter-class loss component maximizes the distance between the two closest class centers in each training batch. By using these extreme cases, scope loss uses the same information for each class, regardless of how many samples are available in each category. Similar to center losses, range losses need to be combined with Softmax losses to keep losses from dropping to zero [133].

When combining different loss functions, the difficulty arises in finding the right balance between each term. In recent times, researchers have proposed several ways to modify Softmax losses so that it can learn discriminant features without being combined with other losses. One method that has been shown to increase the ability to discriminate bottleneck features is feature normalization [115,118]. For example, [115] proposes normalized features with unit L2 norm, and [118] proposes normalized features with zero mean and unit variance. A successful approach has introduced a margin in the decision boundary between each class in Softmax losses [135]. For simplicity, let’s look at binary classification using Softmax losses. In this case, the decision boundary between each class (if bias is zero) can be given by the following equation:



Where x is the eigenvector, W_1 and W_2 are the weights corresponding to each class, and θ_1 and θ_2 are the angles between X and W_1 and W_2 respectively. By introducing a multiplicative margin in the above equation, the two decision boundaries can be made more rigid:



As figure 8 shows, this margin effectively increases the degree of differentiation between categories and compactness within each category. Researchers have proposed a variety of approaches based on how this margin is integrated into the loss [116,119-121]. For example, in [116], the weight vector is normalized to have a unit norm, so that the decision boundary depends only on angles θ_1 and θ_2. In [119,120], a kind of additive cosine covariance is proposed. Compared with multiplicative margin [135,116], additive margin is easier to implement and optimize. In this work, in addition to the normalization of weight vectors, feature vectors are also normalized and proportioned as shown in [115]. An alternative additive margin is proposed in [121], which has the same advantages as [119,120] and has a better geometric interpretation, since this margin is added to angles rather than cosines. Table 2 summarizes the decision boundaries for different variants of softmax losses with a margin. These methods are currently the best in the field of face recognition.

Figure 8: Effect of introducing a margin m into the decision boundary between the two categories. (a) Softmax loss, and (b) Softmax loss with a margin.

Table 2: Decision boundaries for different variants of softmax losses with margin. Note that these decision boundaries are for category 1 in the binary classification case.