Research on Integrated Neural Network Speech Emotion Recognition Model

background

Emotion recognition helps provide a better human-computer interaction experience and is an essential capability for future computers. Therefore, speech emotion recognition has gradually become a research hotspot and has seen numerous applications in recent years. Although research on speech emotion recognition began as early as the 1980s, this topic is indeed quite challenging for machines. Furthermore, compared to the field of speech recognition, there are very few publicly available databases for speech emotion recognition.

A voice emotion recognition system consists of two parts: a feature extractor and an emotion recognition classifier.

Acoustic features used in speech emotion recognition tasks can be categorized into three types: prosodic features, spectral features, and phonological features. Commonly used prosodic features include duration, fundamental frequency, and energy; spectral features generally include linear spectrum features such as LPC and OSALPC, and cepstral features such as MFCC and LPCC; phonological features generally include formant frequencies and their bandwidths, frequency perturbations, amplitude perturbations, and glottal parameters. These acoustic features, extracted from speech frames, represent the characteristics of short-term audio and are collectively referred to as low-level descriptors. However, human perception of emotion often lies in the fluctuations of emotion expressed over a specific time period. Therefore, to describe emotion over a longer time span, global features of the sentence are usually calculated. Global features are used to characterize the dynamic changes of low-level descriptors throughout the entire sentence and are therefore composed of statistical values of low-level descriptors. Common statistical values include mean, extreme values, range of variation, abundance, skewness, moments, and linear regression correlation parameters. Feature design is a crucial step in traditional speech emotion recognition methods, determining the quality of the emotion features. However, finding the optimal feature subset is a tedious task and varies depending on the database. To date, there is no universally accepted optimal feature set for speech emotion recognition tasks, and researchers mostly select the desired features empirically in experiments.

Depending on the feature sources and classifier training methods, speech emotion recognition systems can determine emotion at two levels: the short speech segment level and the complete sentence level. For short speech segment emotion recognition, a sentence is divided into multiple speech segments, and the features of each segment are used to train the classifier. Low-level descriptors, extracted from the speech frame, are input into a sequence classifier to simulate the speaker's emotional distribution. Such sequence classifiers typically use Gaussian mixture models and Hidden Markov models. During training, the emotion label of a short speech segment is the emotion label of its corresponding sentence; during testing, since a sentence has multiple speech segment recognition results, a majority vote is performed to obtain the final recognition result. For complete sentence-level speech emotion recognition, the classifier is input with features extracted from the entire sentence. First, the global features of the sentence are calculated using low-level descriptors and statistical functions. Finally, these global features are input into a discriminative classifier to identify the sentence's emotion. Such discriminative classifiers include almost all traditional classifiers, such as support vector machines, decision trees, and K-nearest neighbor models.

In recent years, deep neural networks (DNNs) have been introduced into the field of speech emotion recognition due to their powerful ability to learn hierarchical features from raw data. Han et al. designed a DNN to learn short-term speech segment emotion features, and used an extreme learning machine (ELM) in the backend to perform sentence-level emotion classification on global features. Lee et al. proposed a recurrent neural network based on the maximum likelihood learning criterion to model random speech segment label sequences, which greatly improved the accuracy of speech emotion recognition. Mirsamadi et al. explored different RNN structures for speech emotion recognition and proposed an attention mechanism to weight speech frames with different levels of emotion importance. Mao et al. designed a convolutional neural network to learn significantly discriminative emotion features in speech.

Comparing the different methods on the same database reveals significant differences in their confusion matrices. Despite using the same low-level descriptors, different classifiers achieved inconsistent recognition rates across each sentiment category. This phenomenon indicates that a single classifier cannot perform well across all sentiment categories; for example, an SVM might fail to effectively identify "happy," while a DNN classifier might. Such discrepancies are not only related to data imbalance but also directly related to the modeling capabilities of the classifiers themselves.

Based on this conclusion, to improve the accuracy of speech emotion recognition, this paper proposes an ensemble learning method using two types of neural networks as base classifiers. As mentioned in the literature, the base classifiers in an ensemble system should have as different structures as possible to achieve better generalization performance. This paper selects recurrent neural networks suitable for processing sequential data and wide residual networks with outstanding performance in image classification as base classifiers.

Introduction to base classifiers

1. Long Short-Term Memory Recurrent Neural Network

Due to its unique structure, RNNs possess powerful capabilities for processing sequential data. The connections between hidden layers in consecutive time steps allow the state of the hidden layer in the previous step to be passed to the hidden layer in the current step. This process continues cyclically, ensuring that information from the first step of the sequence is passed to the last, thus modeling sequence correlations. However, when the input sequence reaches a certain length, the performance of RNNs drops sharply due to the vanishing gradient problem. Long Short-Term Memory (LST) models are designed to overcome this issue.

In general, an LSTM module consists of four elements: an input gate i, a forget gate f, an output gate o, and a memory cell c. The three gates are responsible for adjusting the relationship between the states of the memory cell at different time steps. Taking a certain time step t as an example, the input of the RNN is xt, and the states of the three gates and the memory cell are it, ft, ot, and ct, respectively. The output of this LSTM layer is ht, and their relationship is expressed by the following equation:

(1)

(2)

(3)

(4)

(5)

Where Wx , Wh , and Wc are the connection weights between the input layer, hidden layer output, memory unit, and each gate, respectively; b represents the bias of each gate.

2. Width-based residual network

As is well known, due to the vanishing gradient problem, the more layers a CNN has, the more difficult it becomes to train. To train deep convolutional networks, residual networks were proposed. Experiments have shown that residual networks can achieve superior image recognition performance even with significantly deeper layers than traditional CNNs. Inspired by the increasing depth of residual networks, wide residual networks have been proposed in the literature, further improving image recognition accuracy with a shallower, wider network structure.

Residual networks are constructed by sequentially stacking residual modules. A residual module typically includes two convolutional layers, each preceded by a batch normalization layer and a ReLU activation function layer. Compared to ordinary residual networks, WRN expands the number of convolutional kernels in each convolutional layer by a factor of K, widening the convolutional layers to improve their feature learning capabilities. Studies have shown that WRN can achieve the same image recognition rate as ordinary residual networks with a much shallower number of layers. Figure 1 shows the structure of a residual module and a WRN. The WRN in Figure 1 uses four types of residual modules with 16, 16, 16, 16, and 16, respectively. N consecutive residual modules of the same type are stacked into a group, and these four groups of residual modules, along with pooling layers and softmax layers, are stacked sequentially to form a WRN.

Integrated Neural Network Speech Emotion Recognition System

1. RNN Speech Emotion Recognition Subsystem

The block diagram of the RNN subsystem is shown in Figure 2. The system input is the feature sequence s(1) , s(2) , … , S(T) of the sentence, where T is the number of segments the sentence is divided into, which is also the time step of the RNN network, and s(t) is the feature vector extracted from the t-th segment of the sentence. The system computation process is as follows: At each time step, the original feature vector passes through a fully connected layer and then enters the LSTM layer; the outputs of the LSTM layers from all time steps are averaged in the subsequent pooling layer to obtain the global features of the input sentence; the global features are input into the softmax layer to calculate the probability that the sentence belongs to each sentiment category, thereby generating the recognition result. Since the RNN directly processes the entire sentence, only the sentence label needs to be used as the training target during training, and the cross-entropy loss function is used during training.

Figure 2 RNN subsystem

The speech segment feature s(t) input to the RNN is composed of stacked frame features within a certain time window. Given the time window length w and the frame feature f(t), s(t) can be represented as follows. In this paper, the frame features include 12-dimensional MFCC, energy, zero-crossing rate, fundamental frequency, sound quality, and their temporal differences, totaling 32 dimensions.

2. WRN Speech Emotion Recognition Subsystem

Figure 3 shows the block diagram of the WRN subsystem. The spectrum of a sentence is segmented into several segments along the time axis. These spectrum segments are input into the WRN to obtain the probability distribution for each sentiment category. Statistical values are calculated from the outputs of these spectrum segments as the global features of the sentence. These global features are then input into a softmax layer to obtain the probability distribution for each sentiment category of the sentence, ultimately yielding the recognition result. In summary, the WRN subsystem consists of two parts: a WRN classifier that classifies spectrum segments and a softmax classifier that classifies the entire sentence. During system training, all training samples are first segmented into spectrum segments. Each spectrum segment is assigned the sentiment label of its corresponding sentence and input into the WRN for training. Subsequently, the outputs of the training spectrum segments in the WRN are aggregated according to their respective sentences, and global features are calculated. Thus, the training data for the softmax layer becomes sentence-by-sentence training with the sentiment label of the sentence as the target.

Figure 3 WRN subsystem

The global features in this subsystem are calculated as follows. Taking sentence i as an example, assuming the task requires identifying K types of emotions, after the spectrum segment s is input into the WRN, the probability of belonging to the k-th emotion Ek is Ps(Ek). For each, the following formulas are calculated, where U is the set of spectrum segments belonging to i:

(6)

(7)

(8)

(9)

,,, represent the average probability, minimum probability, maximum probability, and frequency with a probability greater than 0.5 for all frequency bands in i belonging to Ek, respectively . Thus, the global feature of i can be expressed as:

3. A speech emotion recognition system integrating neural networks

The ensemble system consists of two base classifiers and an ensemble layer (softmax), as shown in Figure 4. The outputs of both the RNN and WRN subsystems are probability distribution vectors related to the sentiment category. To achieve ensemble, this paper adds the two vectors together as a new global variable. Specifically, given a training set where is the speech sample, is the corresponding label, and N is the number of samples, the RNN and WRN subsystems are trained separately. For sample i, each subsystem generates a probability vector, denoted as and , respectively. The addition of the two vectors produces the new global variable:

(10)

In the ensemble layer, the softmax classifier is trained using the data from the training layer.

Figure 4 Integrated Network Voice Emotion Recognition System

During the testing phase, the test speech enters both subsystems simultaneously and generates a probability distribution vector. Then, the global variable is calculated by equation (10) and input into the integration layer to generate the final emotion recognition result.

Conclusion

For the task of speech emotion recognition, this paper designs and implements an ensemble system using a recurrent neural network (RNN) and a wide residual network (WRN) as base classifiers. This method aims to combine the advantages of deep neural networks with different architectures to improve the accuracy of speech emotion recognition. Specifically, the RNN is used to model sequence information and provide recognition results at the sentence level, while the WRN learns the feature representation of spectral segments and performs recognition at the speech segment level. Experiments demonstrate the effectiveness of this ensemble system compared to single-classifier speech emotion recognition systems, and also show that the WRN, introduced for the first time in this paper, performs comparably to mainstream RNN-based methods in this area. Somewhat regrettably, the performance improvement brought by the ensemble method in this experiment is not significant. This may be due to two reasons: firstly, the database used in the experiment suffers from data imbalance; and secondly, there are design issues with the ensemble method. Regarding these two points, future work will explore data augmentation methods to alleviate the problem of imbalanced dataset distribution, and will also try different ensemble methods to enhance the system's ability to model speech emotions.

Research on Integrated Neural Network Speech Emotion Recognition Model

Read next

CATDOLL 126CM Sasha (Customer Photos)

CATDOLL 115CM Momoko TPE

CATDOLL 131CM Kelsie (TPE Body with Hybrid Silicone Head)

CATDOLL 126CM Sasha (Customer Photos)