Emotion recognition in natural scenes based on deep neural networks and a small number of audio and video training samples

Authors: WanDing1, MingyuXu2, DongyanHuang3, WeisiLin4, MinghuiDong3, XinguoYu1, HaizhouLi3,5

1.CentralChinaNormalUniversity,China

2.University of British Columbia, Canada

3. ASTAR, Singapore

4.NanyangTechnologicalUniversity,Singapore

5.ECEDepartment,NationalUniversity ofSingapore,Singapore

summary

This paper introduces the team's system used in the 2016 Emotion Recognition in the Wild Challenge (EmotiW2016). The EmotiW2016 challenge required classifying videos into seven basic emotions (no emotion, anger, sadness, happiness, surprise, fear, and disgust) based on facial expressions, actions, and sounds. The training and testing data for the EmotiW2016 challenge came from clips of movies and reality TV shows. The proposed solution first performs emotion recognition separately based on both video (facial expressions) and audio information channels, then fuses the predictions from the facial recognition and sound recognition subsystems (ScoreLevelFusion). Video emotion recognition first extracts convolutional neural network (CNN) features from the facial expression images. The deep convolutional neural network used for image feature extraction is based on a pre-trained ImageNet neural network, and then fine-tuned on the FER2013 image dataset for facial emotion recognition. Video features are then extracted based on CNN features and three image set models. Finally, different kernel classifiers (SVM, PLS, etc.) are used to classify emotions in facial videos. Audio emotion recognition, however, does not utilize external datasets but directly uses the challenge dataset to train a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN). Experimental results show that the proposed video recognition subsystem, audio recognition subsystem, and their fusion all achieve state-of-the-art performance in terms of accuracy. The system achieves a recognition accuracy of 53.9% on the EmotiW2016 challenge test dataset, which is 13.5% higher than the baseline (40.47%).

introduction

As one of the key technologies for human-computer emotional interaction, the research on emotion recognition based on audio and video signals has been active for decades. Early research on audio and video emotion recognition mainly focused on emotion recognition under laboratory staged shooting conditions. In recent years, with the development of technology, more and more researchers have begun to turn their attention to emotion recognition in natural situations. Facial Expression Recognition and Analysis Challenge (FERA) [1], Audio/Visual Emotion Challenge (AVEC) [2], and Emotion Recognition in the Wild Challenge (EmotiW) [3] have become benchmarks for people to study and test their emotion recognition methods in natural situations. For emotion recognition, facial expressions and voice are the two most important information channels. Among all emotion expression information, facial and voice parts account for nearly 93% [4]. Based on different time dimension feature extraction methods, facial emotion recognition can be divided into three categories. The first category is based on artificially designed spatiotemporal features (such as Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) and Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) [5-7].

The first type of method treats video data as a sequence of three-dimensional pixels, extracting texture features along each facet of the pixel (spatial and spatiotemporal dimensions). The second type treats video as a set of images, using image set modeling to extract video features for emotion recognition. Image set-based methods treat video frames as images of the same object captured under different conditions (pose, lighting, etc.). The third type utilizes sequence models, such as recurrent neural networks (RNNs), to capture the temporal features of emotion recognition contained in the video. Compared to spatiotemporal feature-based methods, image set-based methods and RNN methods are more robust to changes in facial expressions over time. RNN models generally contain a large number of free variables. With a limited number of training video samples, image set-based methods can achieve better recognition results than RNN methods [8-9, 37]. In terms of image feature extraction from video frames, one approach is to use artificially designed features, such as the combination of traditional features like DenseSIFT[9] and Histogram of Oriented Gradients (HOG)[10] with different image set modeling methods[11-14] by Liu et al.[8] for emotion recognition in facial videos. The experimental results shown in [8] also indicate that different traditional image features have complementary effects on facial emotion recognition.

Yao et al. [15] defined an emotion recognition feature based on the differences between local regions of a face image. They first registered local regions using face frontalization [16], then extracted LBP features from the local regions, and finally used feature selection to detect the most dissimilar regions and used the difference in LBP feature values of these regions as the emotion recognition feature of the face image. Their method achieved good results in both static and audiovisual emotion recognition challenges in EmotiW2015. In addition to manually designing image features, another method for image feature extraction is to use deep convolutional neural networks (DCNN).

Here, "deep" means that the network has more than three convolutional layers. DCNN is an end-to-end image classification model whose convolutional layer output can be used as image features and has a certain degree of universality [17]. Training an effective DCNN usually requires a large number of data samples (such as 100,000 facial expression images); however, the publicly available facial emotion recognition datasets are usually small (such as FER2013 which only has 30,000 images). To solve this problem, Liu et al. [8] used the face recognition dataset CFW [18] (about 170,000 images) to train DCNN. Experiments showed that the learned DCNN features were better than traditional handmade features (Dense-SIFT and HOG). Ng et al. [19] used a transfer learning strategy to use a pre-trained general image recognition network as the initialization of the emotion recognition network, and then trained the neural network on the FER-2013 dataset [20] (fine-tuning the weights). The fine-tuned DCNN achieved good results in the EmotiW2015 static facial expression recognition sub-challenge. Kim et al. [37] used a decision fusion method to directly train multiple DCNNs on a small dataset and then fused the DCNN pairs with the emotion prediction results of the face images by mean fusion. However, feature fusion methods for multiple DCNNs still need further research.

In audio emotion recognition, experience shows that emotion recognition audio features are complementary to facial visual features. Fusing face-based and audio-based emotion recognition results can achieve better results than single-channel [8-9,21-22]. In recent years, LSTM-RNN [26] has been widely used in speech emotion recognition and other acoustic modeling tasks [2,22-23,27-29]. Compared with traditional Hidden Markov Models (HMM) [23] and Standard Recurrent Neural Networks (StandardRNN), LSTM-RNN can extract associated features over longer time intervals (e.g., >100 time steps) without encountering problems such as gradient vanishing [25].

The system introduced in this paper combines different methods. Facial video emotion recognition is based on DCNN features and image set modeling, while audio emotion recognition is based on LSTM-RNN model. The main work of this paper is twofold. The first aspect is that the extraction of DCNN image features adopts a transfer learning method based on weight fine-tuning. The performance of the facial image emotion recognition DCNN features trained on a small number of samples exceeds that of the DCNN features trained on a large dataset of facial recognition [19]. The second aspect is that the audio emotion recognition LSTM-RNN model we trained only used a small number of training samples (773 audio sentences provided by EmotiW2016), but still exceeded the benchmark method by 7% in terms of recognition rate. The details of the method will be introduced in detail in later chapters.

1. The proposed method

1.1 Emotion Recognition Based on Facial Video

The facial video emotion recognition method proposed in this paper consists of three steps. The first step is to extract DCNN image features from the facial images of each frame of the video. The second step is to extract dynamic features based on the image set modeling method. The last step is classification. Since the video features based on the image set are usually located on non-Euclidean manifolds [13], kernel functions are used to map them to Euclidean space for final classification after feature extraction. In the method we use, the second and third steps directly apply the open source code provided in reference [8] for dynamic feature extraction and classification.

1.1.1 DeepCNN Image Features

Convolutional neural networks (CNNs) borrow the organizational structure of neurons in the animal visual cortex. The network structure utilizes techniques such as local connectivity, weight sharing, and pooling to achieve effects such as reducing network complexity and maintaining translation invariance of features. Deep CNNs typically contain multiple convolutional layers, and the output of each convolutional layer can serve as a feature description of the input image. Assuming the input image is IW,H,C, where W represents the width, H represents the height, and C represents the number of channels (typically RGB channels), for a local region Lw,h,C in I...

(1)

Where Kw,h,C is a kernel of the same size as L; * denotes convolution operation; b represents the bias variable; σ represents the activation function, which in practice is usually a rectified linear unit (RELU); and oL represents the feature value of region L. By convolving kernel K with each local region in I, we can obtain the feature map M, which can then be used as an image feature vector for further processing.

1.1.2 Dynamic Features of Facial Videos

Given a d-dimensional image feature f, a video can be regarded as a set of image feature vectors F=[f1,f2...fn], where fiÎRd is the feature vector corresponding to the i-th frame of the video. Three image set models are used to extract video (image set) features from F, namely linear subspace[14], covariance matrix[13] and multidimensional Gaussian distribution[15]. The feature vector P corresponding to the linear subspace model is calculated in the following way:

(2)

Where P = [p1, p2... pr], and pj (j ∈ [1, r]) represents the principal eigenvector.

The covariance matrix C is found in the following way:

(3)

Here, represents the average value of the image features. Assume that the feature vectors in F follow a d-dimensional Gaussian distribution N( μ, ∑) , where μ and ∑ represent the mean and covariance, respectively. The features of the Gaussian distribution are defined and calculated as follows:

(4)

(5)

1.1.3 Kernel Functions and Classifiers

Regarding kernel functions, we chose two kernel functions: polynomial and RBF (radial basis function). For the classifier, we used PLS (Partial Least Squares Regression) [30]. Experimental results from Liu et al. on the EmotiW2014 dataset [8] showed that PLS outperforms support vector machines (SVM) and Logistic Regression in facial emotion recognition; we also observed the same trend on the EmotiW2016 dataset. Given video feature variables X and 0-1 labels Y (the seven basic emotion recognitions can be viewed as seven binary classification tasks), the PLS classifier decomposes them into...

Where Ux and Uy are projected X-scores and Y-scores, Vx and Vy represent loadings, and rx and ry represent residuals. PLS determines the regression coefficients between X and Y by finding Ux and Uy that have the maximum covariance between their column vectors. Assuming UX and UY are the maximum covariance projections, the regression coefficient β is given by the following equation:

Given a video feature vector x, its corresponding classification prediction is .

1.2 Audio-based emotion recognition

Audio-based emotion recognition methods first extract acoustic features frame by frame, then train an LSTM-RNN (Long Short-Term Memory Recurrent Neural Network) to extract temporal features and classify emotions. Assume the audio feature sequence corresponding to a video clip is F=[f1,f2...fn], and the corresponding emotion classification label is c. Before training the LSTM-RNN, we define the emotion label C=[c1,c2...cn] frame by frame, where ci=cforifrom1ton. The output of the corresponding LSTM is also the frame-by-frame prediction result. We take the average of the frame-by-frame prediction results as the final prediction result for emotion recognition of the video clip.

1.2.1 Audio Features

The method uses the extended version of the generic minimalist acoustic parameter set (eGeMAPS)[31] for emotion recognition. The audio feature set in eGeMAPS is designed based on expert knowledge. Compared with the traditional high-dimensional feature set[32], eGeMAPS has only 88-dimensional features, but it shows higher robustness to the problem of speech emotion modeling[33-34]. The acoustic low-level descriptors (LLD) of eGeMAPS cover information such as spectral, cepstral, prosodic and voice quality. In addition to LLD, eGeMAPS also includes statistical features such as arithmetic mean and coefficient of variation.

1.2.2 LSTM-RNN

Compared to traditional activation functions such as sigmoid and tanh, LSTM-RNN uses a special activation function called MemoryBlocks. The structure of LSTMMemoryBlocks is shown in Figure 2. For a MemoryBlock in a network layer, its input at time t is the output xt of the previous network layer at time t, and the output ht-1 of the current Block at time t-1. The MemoryBlock structure consists of four main parts: input gate, memory cell, forget gate, and output gate. The Memory cell structure has a self-connection with a weight of 1.0. The Memory cell structure ensures that the state of the MemoryCell remains constant when external inputs are excluded. The input gate allows (or masks) the input signal to change the state of the memory cell. The output gate allows (or masks) the state of the memory cell to change the output of the block. The forget gate can adjust the self-recovery connection of the memory cell, allowing the cell to choose to retain or clear its previous state as needed. The calculation process of the MemoryBlock is shown below:

Where xt and ht-1 represent the input; W and U indicate that V is the weight matrix; b represents the bias vector; and σ represents the sigmoid function.

The state candidate of the cell at time t is represented; f, c, and o represent the outputs of InputGate, ForgetGate, MemoryCell, and OutputGate, respectively. h represents the final output of the block at time t.

1.3 System Integration

We further fused the prediction results from the facial video and audio subsystems. The fusion introduced a weighted vector w = [λ1, λ2…λc], where c represents the number of emotion categories. The final prediction result S is calculated as follows:

SAandSV represent the emotion recognition prediction results of the audio and video subsystems, respectively.

2. Experiment

2.1 EmotiW2016 Data

Emotion recognition based on audio and video is one of the sub-challenges set up by EmotiW2016. The dataset consists of multimedia video clips. The emotional states corresponding to the samples are labeled using a semi-automatic method defined in [40]. The task of emotion recognition based on audio and video is to design an algorithm to automatically classify video clips into seven basic emotional states (Anger, Disgust, Fear, Happiness, Neutrality, Sadness, and Surprise). EmotiW2016 is a continuation of EmotiW2013-15. The main change is that, in addition to video clips extracted from movies, video clips from reality shows are also introduced into the test set to test the generality of the emotion recognition method trained on movie data. The dataset for the sub-challenge contains 1739 video clips: 773 samples in the training set, 373 samples in the validation set, and 593 samples in the test set. The final result of the challenge is based on the accuracy of the system on the test set.

2.2 Implementation of Deep Neural Networks

2.2.1 CNN Image Feature Extraction

We fine-tuned the pre-trained AlexNetDeepCNN model [39] using the Caffe toolkit [38] and the FER2013 dataset. Both the pre-trained AlexNet model and the FER2013 dataset are publicly available. When using the FER2013 dataset (~28,000 face images), we first scaled the default 48x48x1 image size of FER2013 to 256x256x3 to fit the input requirements of the AlexNet model. The network was trained using the stochastic gradient descent algorithm. The hyperparameters of the algorithm were defined as momentum=0.9, weightdecay=0.0005, initial learning rate (baseline)=0.001, learningratedecay=0.1, decayepochs=10, batchsize=128. Because the final FullyConnectedLayer is completely retrained without retaining the AlexNet weights, its initial learning rate is increased by a factor of 4, to 0.004 instead of 0.001. The training termination strategy is EarlyStopping, meaning training stops when the recognition rate on the validation set no longer improves. The output of the final PoolingLayer of the trained network model serves as the image features for facial emotion recognition.

2.2.2 Audio Feature Extraction

We first used the Matlab toolbox to extract audio signals from EmotiW2016 video clips and converted the signals to 16kHz mono. Then, we used the OpenSMILE toolkit [35] to extract eGeMAPS audio features frame by frame. In the experiment, the length of the audio frame was defined as 0.04s.

2.2.3 Structure of LSTM-RNN

We evaluated six different BLSTM-RNN architectures for audio emotion recognition. The architectures of the six LSTM-RNNs are shown in Table 2.

In the experiment, the audio LSTM was implemented and trained using the CURRENNT toolkit [36]. The learning rate of the training network was 1e-5, and the batch size was 10 sentences (each sentence corresponds to an audio feature sequence extracted from a movie clip). The termination strategy for LSTM training was also earlystopping. Based on randomly initialized network weights, we trained each of the six LSTM structures 10 times. The recognition rate of LSTM on the validation dataset was between 31% and 35%, with the best model based on structure 4. We used this as the final model for audio emotion recognition.

3. System Evaluation

To test the performance of CNNs in evaluating image features, we tested three classifiers (PLS, SVM, and LogisticRegression) on the validation sets of the EmotiW2014 and 2016 datasets. The test results are shown in Tables 5 and 6.

The results show that both CNN-based and traditional handcrafted feature-based PLS exhibit superior classification performance compared to SVM and Logistic Regression classifiers. We then evaluated combinations of different methods, as shown in Table 7. Based on the experimental results, we selected DenseSIFT and CNN image features for emotion recognition based on face videos in our final system. For audio emotion recognition, we compared the LSTM method with traditional methods (EmotiW2014 Baseline), and the results are shown in Figure 4. The experimental results show that the LSTM method achieves 8% higher accuracy than the traditional method.

The final experiment involved the fusion of video and audio systems. We tested three fusion schemes: the first used the same weights for all emotion categories; the second and third assigned different weights to each emotion category subsystem. Results on the validation dataset showed that the LSTM-based audio recognition method performed well (outperforming the video method) in classifying fear and sadness, but performed poorly in classifying disgust and surprise. The experimental results also showed that using different weights for system fusion better combined the relative strengths and weaknesses of the subsystems, achieving a better fusion effect than using uniform weights. Table 3 lists the three fusion schemes tested in the experiment. Fusion scheme 3 achieved the best results on both the validation and final test datasets, with a recognition accuracy of 53.9% on the test set.

Experimental results show that: 1. The proposed method performs best in recognizing anger and happiness, achieving 80% and 75% accuracy, respectively. These results are on par with the top-performing methods in EmotiW 2014 and 2015. 2. Compared to the top-performing methods in 2014 and 2015, the LSTM-based audio emotion recognition method achieves a 10% improvement in accuracy for recognizing fear. 3. Compared to the top-performing methods in 2014 and 2015, the proposed method exhibits overfitting in recognizing the neutral state. Specifically, while achieving approximately 70% accuracy on the development dataset (on par with the two top-performing methods), it performs poorly on the test dataset, with an accuracy decrease of approximately 7%.

4. Summary

This paper proposes a method for emotion recognition in natural scenarios based on audio and video. This method uses only a small amount of sample data to train a deep neural network and can achieve the state-of-the-art recognition accuracy. The method proposed in this paper achieves a recognition accuracy of 53.9% on the EmotiW2016 test set, which is 13.5% higher than the baseline of 40.47% [41]. The test results show that: First, when the amount of facial video emotion recognition data available for training is small, the transfer learning strategy based on DCNN weight fine-tuning is an effective method; Second, for audio emotion recognition, the small amount of training data provided by EmotiW2016 and the LSTM-RNN model can be used directly to obtain better recognition results than traditional methods. Our future work will be carried out in two directions. First, we will obtain more effective facial emotion recognition features by examining different pre-trained DCNNs and different fine-tuning strategies. Second, we will conduct more in-depth research on audio-based emotion recognition and improve the audio recognition effect by designing a more effective LSTM-RNN model.

Emotion recognition in natural scenes based on deep neural networks and a small number of audio and video training samples

Read next

CATDOLL 115CM Kiki TPE

CATDOLL 108CM Coco (TPE Body with Hard Silicone Head) (Dark Tan Tone)

CATDOLL Nanako 109CM TPE (Soft Silicone Head with Pale Tone)

CATDOLL Miho Hard Silicone Head