This article is a comprehensive review of deep facial expression recognition and the introduction of the 30th paper on the AI front. Firstly, we introduce the standard process of deep FER system based on relevant background knowledge. We then introduce the data sets that are currently widely used in the literature and provide the accepted criteria for data selection and evaluation for these data sets. For the most advanced deep FER technology, we review the existing FER deep neural network design based on static image and dynamic image sequence and related training strategies, and discuss their advantages and limitations. We then expand the review to other related issues and application scenarios. Finally, we analyze future challenges and corresponding opportunities in this field, as well as future directions for designing robust deep FER systems.
Please pay attention to the wechat public account “AI Front”, (ID: AI-front)
Facial expressions are one of the most powerful, natural and universal signals humans use to express their emotional states and intentions. Automatic analysis of facial expressions is widely used in social robotics, medical care, driver fatigue monitoring and many other human-computer interaction systems.
FER system can be divided into two categories according to feature representation: static image FER and dynamic sequence FER. In the static image-based approach, feature representation is encoded only by the spatial information of a single image, while in the dynamic approach, the time relationship between successive frames of input facial expression needs to be considered.
Most traditional methods use manually extracted features or shallow learning. However, after 2013, various expressions recognition competitions, such as FER2013 and Real Scene Emotion Recognition (EmotiW), collected sufficient training data from challenging real world scenes, promoting FER technology from the laboratory to the real scene. Due to the increasing amount of data, traditional features are no longer sufficient to represent the diversity of factors unrelated to facial expression. With the significant improvement of chip processing capacity (GPU unit) and the emergence of various excellent neural network structures, many fields began to turn to deep learning methods, which greatly improved the recognition accuracy. Similarly, deep learning techniques are increasingly being used to address the challenges of facial expression recognition in real-world situations.
Figure 1 Facial expression recognition data set and methods
Although deep learning has strong feature learning ability, its application in FER still has some problems. First, deep neural networks need a lot of training data to avoid overfitting. However, existing facial expression databases are insufficient to train deep networks that have significant effects on target recognition tasks. In addition, due to different character attributes, such as age, gender, religious background and expression ability, subjects have high variability. Posture, lighting, and occlusion are common in unconstrained expression scenes. The relationship between these factors and expressions is nonlinear, so it is necessary to enhance the robustness of the deep network for intra-class changes and learn effective expression features.
FIG. 2 Deep facial expression recognition system
There are many variations in natural scenes that have nothing to do with facial expressions, such as different backgrounds, lighting, head positions, etc. Therefore, it is necessary to calibrate and align visual semantic information of face by preprocessing before training deep neural network.
Face alignment is a necessary pre-processing step in many face-related recognition tasks. Below we introduce some common methods and public implementations available in deep expression recognition systems. (face alignment review reference papers: Automatic analysis of differentiated bi-facial actions: A survey, https://ieeexplore.ieee.org/abstract/document/7990582/).
Once the training data is in hand, the first step is to examine the face, then remove the background and irrelevant areas. The Viola-Jones face detector is a classic and widely used face detection method, implemented in many toolkits (such as OpenCV and Matlab). After obtaining the face boundary frame, the original image can be cropped to the face region. After face detection, face key point calibration can be used to further improve the FER effect. According to the key point coordinates, the face can be displayed on a uniform predefined template using radial transformation. This step will reduce the changes caused by rotation and facial deformations. At present, the most commonly used face calibration method is IntraFace, which has been applied in many deep FER. The method uses cascade face key location, or SDM, to accurately predict 49 key points.
Deep neural networks need enough training data to ensure the generalization performance on a given recognition task. However, none of the public databases for FER can reach such training data, so data enhancement becomes a very important step in the deep expression recognition system. Data enhancement techniques can be divided into two categories: offline data enhancement and online data enhancement.
The offline data enhancement of DEEP FER is mainly to expand the database through some image processing operations. The most common methods include random interference and deformation, such as rotation, horizontal flipping, scaling, etc. These processes can generate more training samples, making the network more robust to offset and rotated faces. In addition to basic image operations, CNN or GAN can also be used to generate more training data.
Online data enhancement methods are generally integrated into deep learning toolkits to reduce the impact of overfitting. During the training process, the input sample will be randomly centered clipping and flipped horizontally to get a database 10 times larger than the original training database.
FER’s performance can be impaired by illumination and head posture changes, so we introduce two kinds of face normalization methods to reduce this effect: illumination normalization and posture normalization.
Illumination normalization: INFace toolbox is the most commonly used illumination normalization tool. The study shows that histogram equalization combined with illumination normalization can achieve better face recognition accuracy. There are three major illumination normalization methods: Isotropic diffuse-based normalization, DCT-BASED normalization and Gaussian difference (DoG).
Posture normalized: some FER using normalized a positive face Angle, one of the most commonly used method is proposed by people Hassner: after calibration face point, create a 3 d texture reference model, and then estimates facial parts, then, through the input face the projection to the reference frame, to generate the initial positive face. Recently, there have also been a series of gan-based depth models for generating positive faces (FF-gan, TP-GAN, DR-gan).
Deep learning uses multi-layer network structure to perform multiple nonlinear transformation and representation and extract high-level abstract features of images. Below we briefly introduce some deep learning methods for FER.
CNN is more robust to face position change and scale change, and performs better than multi-layer perceptron for unseen face pose change.
Table 1 is used for THE CNN model Settings and characteristics of FER.
Application of other CNN driver models in FER:
Region-based CNN (R-CNN) is used in FER to learn features:
-
Facial expression recognition in the wild based on multimodal texture features
-
Combining multimodal features within a fusion network for emotion recognition in the wild
Faster R-CNN identifies facial expressions by generating high-quality candidate regions:
-
Facial expression recognition with faster r-cnn
DBN was proposed by Hinton et al., which can learn and extract the deep hierarchical representation of training data. DBN training has two steps: pre-training and fine-tuning. Firstly, the layer-by-layer greedy training method is used to initialize the deep network, which can prevent local optimal solutions without requiring a large amount of labeled data. The parameters and output of the network are then fine-tuned with supervised gradient descent.
Unlike the networks described earlier, deep autoencoders reconstruct the input by minimizing reconstruction errors. DAE comes in many variants: noise-reducing autoencoders that recover original uncorrupted data from partially corrupted data; Sparse auto-coding network to enhance the sparsity of feature representation acquired by learning; Compression autoencoder, adding activity related regularization terms to extract local invariant features; The convolutional autoencoder uses the convolutional layer to replace the hidden layer in DAE.
RNN is a connectionist model that can capture time domain information and is more suitable for sequence data prediction. Back Propagation through Time (BPTT) algorithm was used to train RNN. LSTM proposed by Hochreiter and Schmidhuber is a special form of RNN, which is used to solve the problems of gradient disappearance and explosion in traditional RNN training.
After learning the depth features, FER’s final step is to identify which of the basic expressions the test faces fall into. Deep neural networks can be used for end-to-end facial expression recognition. One method is to add a loss layer at the end of the network to correct the backpropagation error, and the predicted probability of each sample can be directly output from the network. Another approach is to use deep neural networks as a tool to extract features, and then use traditional classifiers, such as SVM and random forest, to classify extracted features.
Table 2 Overview of public database of facial expressions
Elicit: P = posed, S = spontaneous
Condit (collection criteria) : Lab (Lab collection), Web (Web grab), Movie (Movie screenshot)
According to the data type, the current main work can be divided into two categories: static image deep FER network and dynamic sequence image deep FER network.
Table 3 Algorithm evaluation of static image deep FER network
Training deep networks directly on relatively small facial expression datasets tends to lead to overfitting. To address this problem, many studies pre-train custom networks from scratch using additional task-oriented data, or fine-tune them on already pre-trained network models (AlexNet, VGG, VGG-Face, and GoogleNet).
Auxiliary data can be selected from large Face recognition database (CASIA WebFace, Celebrity Face in the Wild (CFW), FaceScrub dataset), Or the relatively large FER databases (FER2013 and Toronto Face Database). Knyazev et al. found that FR models with poor performance trained on a larger FR database, fine-tuned by FER2013 database, actually achieved better performance in the expression recognition task. Pre-training on a large FR database had a positive effect on the recognition accuracy, and further fine-tuning with the facial expression database could effectively improve the recognition accuracy.
Ng et al. proposed a multi-stage fine-tuning method: in the first stage, FER2013 was used to fine-tune the pre-training model; in the second stage, training data from the target database was used to fine-tune the model to make it more suitable for the target database.
FIG. 3 Combination of different fine-tuning methods. FER28 and FER32 are different parts of FER2013 database. “EmotiW” is the target database. This two-stage fine-tuning approach has achieved the best results.
Ding et al. found that due to the gap between THE FR and FER databases, facial-dominated information was still left in the fine-tuned FR network, weakening the network’s ability to represent different expressions. Therefore, they proposed a new training algorithm named “FaceNet2ExpNet”, which further integrated the face region knowledge learned by FR network to modify the training of target FER network. Training is divided into two stages:
FIG. 4 (a) stage, fixed depth face network, which provides feature-level regularization term, uses distribution function to make the features of expression network and face network gradually approximate. (b) Stage, further improve the discrimination of the features learned, add the full convolutional layer of random initialization, and then use the expression class standard information and the whole expression network for joint training.
Since the fine-tuning face network has achieved competitive performance in the facial expression dataset, it can be used as a good initialization of the facial expression network. In addition, since the full connection layer usually captures more domain-specific semantic features, only face network is used to guide the learning of the convolutional layer, while the full connection layer uses facial expression information to train from scratch.
Traditional methods usually use an RGB image of the whole face as network input to learn features, but these original pixels lack valid information, such as texture and invariance of rotation, translation and scaling. Some approaches solve this problem by taking manually extracted features and their extended information as network input.
Figure 5 Image pixel (left) and LBP feature (middle). Levi et al. proposed mapping these two kinds of information into a 3D metric space (right) as input to CNN.
In addition to LBP feature, SIFT feature, AGE (Angle + gradient + edge) feature, NCDV (neighborhood – center difference vector) feature are used for diversified network input.
On the basis of CNN structure, some studies have proposed adding auxiliary network block and network layer structure to enhance the expression-related feature representation ability.
(a) Hu et al. embedded three types of supervised network blocks into THE CNN structure to realize shallow, middle and deep supervision. These blocks are designed to represent the hierarchical characteristics of the original network. Subsequently, the interclass score of each block is accumulated at the join layer for the second level of supervision.
(b) Cai et al., the island loss layer. The island loss layer calculated by feature extraction layer and softmax loss calculated by decision layer are combined to supervise CNN training.
(c) Liu et al proposed the clustering loss layer of (N+M) group. In the training process, hard sample mining and positive sample mining techniques of identity perception are used to reduce the influence of changes within the identity under the same facial expression category.
Studies show that the integration of multiple networks performs better than a single network. There are two factors to consider for network integration:
(1) The network should have enough diversity to ensure complementarity.
(2) Appropriate integration methods can effectively accumulate the network.
For the first factor, different training databases and different network structures and parameters need to be considered to increase diversity.
For the second factor, the network can be combined at two different levels: the feature level and the decision level. For feature layer, the most commonly adopted method is to connect the features learned from different networks to form a new feature vector to represent the image. At the decision level, the three commonly used methods are majority voting, simple averaging, and weighted averaging.
(a) Feature layer integration: Bargal et al. proposed to connect three different features (VGG13 FC5 layer output, VGG16 FC7 layer output and Resnet pooling layer output) together after normalization to generate a single feature vector (FV), which is then used to describe the input frame.
(b) Decision-making level integration: Kim et al. proposed a three-level combination structure, which was fused at the decision-making level to obtain sufficient decision-making diversity.
Many existing FER networks focus on single tasks and learn emotion-sensitive features, regardless of the interaction between other potential factors. In the real world, however, FER is interwoven with various factors such as head posture, lighting, and subject identity (facial morphology). In order to solve this problem, multi-task learning is introduced to transfer knowledge from other related tasks and eliminate harmful factors.
Figure 8. Example FER multitasking network. In the MSCNN proposed by Zhang et al., a pair of images are input into the MSCNN network during training. The expression recognition task uses cross entropy loss to learn the expression change feature, and the face recognition human task uses contrast loss to reduce the change between the same expression feature.
In a cascading network, various modules dealing with different tasks are combined sequentially to form a deeper network in which the output of one module serves as the input of the next. Related studies have proposed new combinations of different structures to learn hierarchical features from which changes unrelated to expression can be filtered layer by layer.
Figure 9 Example FER cascading network. Liu et al. proposed au-Perceptive deep network (AUDN), which consists of 3 sequence modules: in the first module, a 2-layer CNN training generates an over-complete representation, encoding the surface changes of all expressions at all positions. In the second module, the au-aware acceptance field layer is used to search subsets of over-complete representations. In the last module, the hierarchical features are learned using the multi-layer RBM structure.
Because the frames in a given video clip have different intensity of expression, direct measurement of each frame error in the target dataset cannot produce satisfactory results. Many methods are used to aggregate the network output frames for each sequence to substantially improve FER performance. We classify these methods into two categories: decision level frame aggregation and feature level frame aggregation.
Decision layer frame aggregation:
FIG. 10 Kahou et al proposed frame aggregation at decision level. (a) For a sequence of more than 10 frames, divide the total number of frames into 10 independent frame groups by time and average their probability vectors. (b) For a sequence of less than 10 frames, the sequence is expanded to 10 frames by uniformly repeating the frames.
Feature layer frame aggregation: Liu et al. extracted image features of a given sequence, and then applied three models: feature vector (line subspace), covariance matrix, and multidimensional Gaussian distribution.
Most methods focus on identifying peak intensity expressions and ignore subtle low-intensity expressions. In this section, we introduce several deep networks, which take training samples of a certain intensity as input, so as to make use of the inner connection of the same subject expression in sequences of different intensities.
FIG. 11 Zhao et al. proposed a peak-guided deep network (PPDN) for expression recognition with constant intensity. PPDN takes as input a pair of peak and non-peak images of the same expression from the same person, and then uses L2 norm loss to minimize the distance between the two images. We use peak gradient suppression (PGS) as the backpropagation mechanism to approximate the features of peak expressions with the features of non-peak expressions. At the same time, the gradient information of peak expression is ignored in L2 norm minimization to avoid inversion.
Although the frame aggregation described above can integrate the learned frame features to produce a single feature vector representing the entire video sequence, the critical time dependence is not exploited. In contrast, the spatio-temporal FER network takes a series of frames in a time window as input with unknown expression intensity and uses texture information and time dependence in image sequences for more subtle expression recognition.
RNN and C3D:
FIG. 12 3DCNN-DAP model proposed by Liu et al. Input n frame sequence to convolve with 3D filter. The 13* C * K part filter corresponds to 13 facial regions defined by people, which is used to convolve K feature images and generate detection images of facial activity regions corresponding to C expression categories.
Facial key points trajectory:
FIG. 13 Spatial event network proposed by Zhang et al. The temporal network PHRNN is used for tracking key points, and the spatial network MSCNN is used for identity invariant features. The two networks are trained separately. Then, the probability of predicted probability from the two networks is fused to carry out spatio-temporal FER.
Network integration:
Simonyan et al. proposed to use dual-stream CNN for video action recognition, in which one CNN network is used to extract optical flow information of video frames and the other CNN is used to extract surface information of still images, and then the output of the two networks is fused. This network structure is also enlightening to FER field.
FIG. 14 Jung et al. proposed a joint fine-tuning method to jointly train DTAN (belonging to “RNN-C3D”) and DTGA (” belonging to facial key point trajectory “).
Table 4 Evaluation results of representative methods of dynamic deep expression recognition on common data sets.
S = Spatial network, T = Temporal network, LOSO = leave-one-subject-out
Domain related issues
-
Occlusion and non-frontal poses are two of FER’s main challenges, and they may alter the visual appearance of the original expression, especially in real situations.
-
Although RGB data is the current standard for depth FER data, these data are easily affected by lighting conditions, and different face parts lack certain depth information.
-
Facial expression synthesis in real scenes can synthesize different facial expressions through interactive interface.
-
In addition to using CNN for FER, some studies use visualization techniques to qualitatively analyze how CNN contributes to FER’s surface-based learning process, and to qualitatively determine which parts of a face can generate the most discriminative information.
-
Based on the original expression classification problem, some new problems are proposed: primary and supplementary emotion recognition challenges, and true and false emotion challenges.
As FER research shifts its primary focus to challenging real-world scenario conditions, many researchers utilize deep learning techniques to address these difficulties, such as light changes, occlusion, non-frontal head poses, identity biases, and low-intensity facial expression recognition. Given that FER is a data-driven mission and that training a sufficiently deep network requires a large amount of training data, the main challenge facing deep FER systems is the lack of training data, both in terms of quality and quantity.
Because people of different age, culture, and gender in different ways to make facial expression, so ideal facial expression data set should include rich with accurate facial attributes label sample images, is not only the expression, and other properties, such as age, gender, race, which will help across different age, gender and the depth of the cross-cultural FER related research. On the other hand, accurate annotation of a large number of complex natural scene images is an obvious obstacle to the construction of expression database. The proper approach is to crowdsource reliably with expert guidance, or to provide roughly accurate annotations with fully automated annotation tools that have been modified by experts.
Another major issue to consider is that although facial recognition technology has been extensively studied, the expressions we have defined only cover a small fraction of a given category and do not represent all of the expressions that humans can make in real-world interactions. Two new models are available to describe more emotions: the FACS model, which combines different facial muscle activity units to describe visual changes in facial expressions; The dimensional model proposed two continuous-valued variables, variable-arousal, that encode small changes in emotional intensity.
In addition, the bias between different databases and the unbalanced distribution of expression categories are the other two problems to be solved in the field of deep FER. The problem of deviation between databases can be solved by deep domain adaptation and knowledge distillation. One solution to the imbalanced expression category problem is to use data enhancement and composition to balance the class distribution in the pre-processing stage. Another option is to add cost-sensitive loss layers to the deep network during training.
Finally, human expressions in real life involve encoding different perspectives, of which facial expressions are only one form. Although expression recognition based on visible face images can achieve satisfactory results, expression recognition should be combined with other models into high-level frameworks in the future to provide supplementary information and further enhance robustness. For example, participants in the EmotiW challenge and the Audio-video Emotion Challenge (AVEC) consider the audio model as the second most important element and employ a variety of fusion techniques for multimodal facial expression recognition.
Link to the original paper (click to read the original) :
https://arxiv.org/pdf/1804.08348.pdf
Please pay attention to the wechat public account “AI Front”, (ID: AI-front)