Abstract: This paper introduces the development status and challenges in the field of speech emotion recognition, with emphasis on the lack of annotation data.
This article is shared by SSIL_SZT_ZS from the Application and Challenge of Speech Emotion Recognition in Huawei Cloud community.
Emotion plays an important role in interpersonal communication. Emotion recognition has great application value. Successful detection of human emotional state is of great significance for social robots, medical treatment, education quality assessment and some other human-computer interaction systems. The main points of this paper are:
1. Basic knowledge and application scenarios of emotion recognition. 2. Introduction and challenges of speech emotion recognition technology. 3. How to solve the problem of lack of data and what are our solutions?
1. What is emotion recognition?
Emotion is one’s attitude towards external events or dialogues. Human emotions are generally divided into: happy, angry, sad, fear and surprise. The machine analyzes the collected signals to get the emotional state of people, which is the process of emotion recognition. Generally, there are two kinds of signals that can be used for emotion recognition. One is physiological signals such as respiration, heart rate and temperature, and the other is behavioral expressions including facial expression, voice and posture. Face and voice are often used to recognize the emotion of objects because of the simple acquisition method. Emotion recognition can help the system to understand the object’s emotional state and its attitude to a topic or event.
In the process of interaction between artificial intelligence (AI) products and human beings, users’ experience of AI products can be greatly improved if they can accurately grasp the current emotional state of human beings and respond according to the emotional state. This is of great significance in product recommendation, public opinion monitoring, man-machine dialogue and other aspects. For example, in the sales process, knowing users’ satisfaction with products can help the platform formulate better sales strategies. In the film and television industry, understanding the audience’s emotions about a show can help to formulate more exciting plots and schedule specific shows. In human-machine dialogue, mastering human emotional state can help intelligent robots make appropriate responses, and timely express comfort and understanding, and improve user experience; In terms of public opinion, administrative departments can monitor public opinion in a more timely and effective manner by understanding people’s emotional tendencies towards popular events and mastering the direction of public opinion, so as to provide support for the formulation of policies. Emotion recognition can also be used in many real-world situations. Emotion recognition algorithm has high research value.
Considering the difficulty of collecting, privacy and other factors, the work of this paper focuses on the use of speech to identify the speaker emotional speech emotion recognition task (SpeechEmotionRecognition, SER).
2. Introduction of speech emotion recognition technology
Pronunciation is the main medium of communication in daily life. It not only conveys ideas, but also expresses the emotional state of the speaker. The goal of speech emotion recognition is to recognize human emotion state from speech. It mainly consists of two steps: feature extraction and classifier construction.
The audio signal input is approximately continuous in value. To extract audio features, the audio is usually divided into frames, windowed, and short-time Fourier transform (STFT). Then, the spectral features with dimension T\timesD_T_×_D_ are obtained, where T_T_ indicates that frame number is related to time length, and D_D_ is characteristic dimension, each dimension corresponds to different frequency. Some work will also perform some MEL filtering on this spectrum.
Spectrum features contain a wealth of information, such as speech content, rhythm, tone, intonation and so on. Speech feature extraction related to emotion is still an immature research direction. The emergence of deep learning simplifies the process of artificial feature proposal. Data-driven methods are used to train the deep model to extract the implied semantic features related to emotion by using emotion tags as supervisory signals. Due to the serialization characteristics of audio input, depth feature extraction usually has methods based on CNN/GRU/LSTM, or CRNN or CNN+Attention.
Traditional machine learning methods can construct classifiers based on artificial speech features or deep speech features, such as Gaussian mixture model (GMM), hidden Markov model (HMM), support vector machine (SVM) and other classical methods. In addition, thanks to the development of deep learning, neural network-based classifier can be trained end-to-end with deep feature extractor to obtain emotion classifier.
3. Challenges of speech emotion recognition
We have introduced the common methods used in speech emotion analysis, but speech emotion recognition also faces some challenges in practice:
- Emotional subjectivity and Ambiguity: Speech emotion recognition is a relatively young field, and there is a lack of official standards on the definition of emotion. Different listeners may have different opinions about the emotion of the same passage. In addition, a speech often has emotional changes and strong subjectivity, resulting in a lot of research work is not universal.
- Extraction and selection of emotional features: There are many kinds of speech speakers, different emotional categories, different length of speech fragments, etc., which lead to the inability of artificial design features to cover all emotional information. On the other hand, depth features, while effective, are not interpretable.
- Annotation data lack problem: The deep learning method requires a large amount of high-quality annotation data to achieve good performance. Due to the subjectivity and fuzziness of emotion, labeling phonetic emotion is time-consuming and requires a large number of professionals. Collecting a large amount of emotion labeling data is an urgent problem in the field of speech emotion recognition.
4. How to solve the problem of data shortage?
Data is the driving force of deep learning, and large-scale high-quality data is the key to the success of deep learning. However, in many practical problems, due to the problem of annotation cost, only a small amount of annotation data exists, which seriously restricts the development of deep learning methods. With the development of Internet social platforms, a large amount of multimedia data is produced every day, and large-scale unlabeled data is easy to obtain. This has led to the development of semi-supervised learning methods that use labeled and unlabeled data simultaneously. On the other hand, multimedia data typically contains multiple modes, so there is some work exploring the use of annotation knowledge in one mode to enhance tasks in another mode. These two methods are described below.
4.1 Semi-supervised learning
Semi-supervised learning generally has two data sets, a small-scale labeled data set and a large-scale unlabeled data set. The goal is to use unlabeled data to enhance and supervise learning. Classical semi-supervised learning methods include many categories, such as self-training(self-training algorithm), Generativemodels (generation model), SVMs (semi-supervised support vector machine), Graph-basic Methods (Graph theory method), Multiviewlearing and so on. The main classes of semi-supervised learning methods are described below.
- The steps of self-training algorithm are as follows :(1) first, the classifier is trained by annotated training set data; (2) Use classifiers to classify unlabeled data and calculate errors; (3) Select the samples with small errors in the classification results and add the classification results as their labels to the training set. Repeat the training process until all unlabeled data is labeled.
- Multiview learning is a type of self-training algorithm. It assumes that each piece of data can be classified from a different Angle. The algorithm steps are as follows :(1) train different classifiers with annotation data in Angle; (2) Use these classifiers to classify unlabeled data from different angles; (3) According to multiple classification results, credible unlabeled samples are selected and added to the training set. Loop the previous training process. The advantage of this method is that the prediction results from different angles can complement each other to improve the classification accuracy.
- Label propagation algorithm (LabelPropagationAlgorithm) label propagation algorithm is a kind of a semi-supervised algorithm based on figure, find no labels by constructing a graph structure data and the relationship between the label data, and then spread through the relationship to label.
The semi-supervised learning method in deep learning is called semi-supervised deep learning. Semi-supervised deep learning mainly includes three categories: fine-tune; Self-training algorithm based on deep learning; Semi-supervised training of neural networks.
Fine-tune, using unlabeled data training networks (reconstructing self-coding or training based on false labels) and then using labeled data to fine-tune target tasks.
The basic steps of self-training based on deep learning method are as follows :(1) use annotated data to train depth model; (2) Using depth model as classifier or depth feature to classify unlabeled data; (3) Select the labeled training set with high execution and repeat the process.
The semi-supervised approach to training deep networks includes many techniques, such as pseudo-label [1],LadderNetworks[2],TemporalEnsembling[3], Meanteachers[4] and FixMatch, etc. Here are some of the main tasks.
1. Pseudo-label method [1] This method takes the network’s prediction result of unlabeled data as the Label of unlabeled data to train the network. Although the method is simple, the effect is very good. As can be seen from the figure below, after the addition of unlabeled data, the data points of the same category are clustered more densely.
TemporalEnsembling[3] TemporalEnsembling is a development of the pseudo-label method. The goal is to construct better pseudo tags. The diagram below shows the structure of this approach, which has two different implementations, namely π_π_-model and Temporary ensembling.
The unsupervised cost of π_π_ model is that the model input should be consistent for the same input under different regularization or data enhancement conditions, which can encourage the network to learn the internal invariance of data. Temporalensembling carries out the moving average of the forecast Z_i_zi_ of each iteration to a \hat{z_i}_zi_^ as the supervised signal of unsupervised training.
3.Meanteacher[4] Meanteacher method is a new way to improve the quality of false labels from the perspective of models, and follows the principle of “the average is the best”. The teacher model was obtained by moving weight-averaged student model parameters after each iteration, and then the teacher model was used to construct high-quality pseudo-labels to monitor label-free loss of the student model.
4. FixMatch [5] FixMatch promoted the TemporalEnsembling consistency regularization (consistencyregularization) principle in the method, namely the same samples of the different augmented, model should be consistent with the results, thus invariance within the study data. Therefore, the FixMatch method uses the weakly augmented sample to generate a pseudo-label, which is used to monitor the output of the model to the strongly augmented sample.
4.2 Cross-modal knowledge transfer
Cross-modal knowledge transfer based on the internal relationship between various modes in multimedia data, the annotation information is transferred from one mode to the target mode to achieve data annotation. As shown in the figure below, cross-modal knowledge transfer includes visual to speech transfer, text to image transfer and so on. The following are some classical cross-modal knowledge transfer operations.
1. Image emotion analysis based on cross-media transfer [6] This method uses paired text image data on Twitter to complete image emotion analysis task, and the specific steps are shown below.
It uses the trained text emotion classifier to classify the text emotion and then gives the label directly to the corresponding picture. Then, the image emotion classifier was trained by using the image with pseudo annotation.
2.SoundNet[7]
Through the pre-trained video object and scene recognition network, the knowledge transfer from visual mode to speech mode is realized, and the speech model is trained by the transferred label, and the speech scene or speech object classification is completed.
3.EmotionRecognitioninSpeechusingCross-ModalTransferintheWild[8]
This method uses the pre-trained face emotion recognition model as the teacher model, and then uses the prediction results of the teacher model to train the speech emotion recognition model.
5. Our speech emotion recognition scheme
This section describes our approach to dealing with the lack of annotated data.
Combining cross-modal knowledge transfer and semi-supervised learning
In order to solve the problem of lack of data in the field of speech emotion recognition, we proposed a combined cross-modal knowledge transfer and semi-supervised learning architecture in 2021, which achieved the current optimal results of speech emotion recognition tasks on CH-SMIS and IEMOCAP datasets. At the same time we see this work published in a paper just published in SCI journal knowledge area – basedsystem Combiningcross modalknowledgetransferandsemi — supervisedlearningforspeechemot Ionrecognition. Below is the architecture diagram of our solution:
Our scenario is based on two observations:
- Direct cross-modal label transfer has errors, because the relationship between face emotion and speech emotion is very complex and not completely consistent.
- Semi-supervised learning does not perform well in the case of little annotated data. The predictive errors of the model may be reinforced over time, resulting in poor accuracy of the model in certain categories.
Inspired by the idea of multi-perspective learning, our method takes advantage of the existence of two modes in the video data, identifies emotions on the two modes, and fuses them to obtain more accurate pseudo labels. In order to perform speech emotion recognition, the scheme firstly extracts STFT features of speech, and then performs Specaugment data. Due to Transformer’s success in modeling sequence data, this scheme uses Encoder of Transformer to encode speech, and finally uses mean pooling to get speech features and classify emotions.
Knowledge transfer across modes
In order to carry out cross-modal emotion transfer, a powerful facial expression recognition model is trained based on MobileNet model using a large number of facial expression data sets. This model is used for facial expression recognition of picture frames extracted from video. Then the results of multiple frame recognition are integrated to obtain the facial expression prediction results of the whole video segment.
Semi-supervised speech emotion recognition
Inspired by the consistent regularization hypothesis in FixMatch, we design a semi-supervised speech emotion recognition method. Specifically, the method adopts two types of augmentation for the input of the speech sample. The strong augmentation method SpecAugment algorithm is used to obtain the spectrum features of the severely distorted version of the speech, and the weak augmentation method (dropout of features, etc.) is used to obtain the speech features with little change. The model uses weakly augmented samples to generate pseudo-labels to supervise the training of strongly augmented samples.
Combining semi-supervised learning with cross-modal knowledge transfer
In each iteration of the model, the method generates a pseudo-label using weakly augmented samples, and then fuses it with the pseudo-label migrated across modes to improve the quality of the pseudo-label. In this work, two fusion methods are explored, one is weighted sum and the other is multi-view consistency. After obtaining a high quality pseudo-label, this label is used to supervise the training of enhanced samples.
The model continuously improves the quality of pseudo label through several iterations.
Compared with the semi-supervised learning method and the cross-modal method, this method achieves the best results on both CH-SIMS and IEMOCAP data sets. The results are as follows:
reference
[1]Pseudo-Label:TheSimpleandEfficientSemi-SupervisedLearningMethodforDeepNeuralNetworks
[2]Semi-SupervisedLearningwithLadderNetworks
[3]TemporalEnsemblingforSemi-supervisedLearning
[4]Meanteachersarebetterrolemodels:Weight-averagedconsistencytargetsimprovesemi-superviseddeeplearningresults
[5]FixMatch:SimplifyingSemi-SupervisedLearningwithConsistencyandConfidence
[6]Cross-MediaLearningforImageSentimentAnalysisintheWild
[7]SoundNet:LearningSoundRepresentationsfromUnlabeledVideo
[8]EmotionRecognitioninSpeechusingCross-ModalTransferintheWild
Click to follow, the first time to learn about Huawei cloud fresh technology ~