Abstract
To solve the problem that traditional Mel Frequency Cepstral Coefficient (MFCC) features cannot fully represent dynamic speech features, this paper introduces first-order and second-order difference on the basis of static MFCC features to extract dynamic MFCC features, and constructs a hybrid model (TWM,TIM-NET(Temporal-aware Bi-directional Multi-scale Network) WGAN-GP(Wasserstein Generative Adversarial Network with Gradient Penalty) multi-head attention) combining multi-head attention mechanism and improved WGAN-GP on the basis of TIM-NET network. Among them, the multi-head attention mechanism not only effectively prevents gradient vanishing, but also allows for the construction of deeper networks that can capture long-range dependencies and learn from information at different time steps, improving the accuracy of the model; WGAN-GP solves the problem of insufficient sample size by improving the quality of speech sample generation. The experiment results show that this method significantly improves the accuracy and robustness of speech emotion recognition on RAVDESS and EMO-DB datasets.
0 Introduction
With the continuous development of technology, human-computer interaction technology is becoming increasingly mature. However, current machines are still unable to effectively recognize human emotions, making speech emotion recognition a hot research topic[1]. The main problems faced by this technology include: 1) Effectively extracting speech emotional features; 2) Building an accurate emotion recognition model; 3) Emotional voice data that requires diversity and complexity.
Early research on speech emotion recognition mainly relied on traditional acoustic features, which can be divided into prosodic features, spectral based features, and speech quality features[2]. Therefore, researchers began to explore more effective features. Liu et al.[3] improved the Gammatone Frequency Cepstral Coefficient (GFCC) and proposed the VGFCC (Gammatone Frequency Cepstral Coefficient based on VMD) feature. Kumaran et al.[4] combined MFCC (Mel Frequency Cepstral Coefficient) and GFCC features and used Convolutional Recurrent Neural Network (CRNN) for recognition. MFCC is a commonly used spectral based feature in speech emotion recognition. However, traditional MFCC features mainly reflect the static characteristics of speech signals and cannot effectively capture the dynamic changes of speech signals, which may affect the accuracy of emotion recognition. To address this issue, this paper introduces a novel approach that combines MFCC features with their first and second derivatives[5] to capture both the static and dynamic changes of speech signals, providing a more comprehensive feature representation and improving the performance of speech emotion recognition models.
In terms of constructing emotion recognition models, TIM-NET (Temporal-aware Bi-directional Multi-scale Network) , an advanced network architecture, has demonstrated good performance in various emotion recognition tasks. TIM-NET better adapts to emotional changes by fusing features from multiple time scales. However, dealing with complex emotional data, TIM-NET still faces the problem of insufficient ability to process long time series. This is due to its time step limitation, which makes the modeling of long-term dependencies insufficient. To overcome this issue, this paper introduces an innovative solution by incorporating a multi-head attention mechanism into the TIM-NET architecture. The multi-head attention mechanism helps capture long-range dependencies and enables the construction of deeper networks, where each attention head focuses on different aspects of the data. This significantly enhances the networks ability to handle complex relationships within the data.
In response to the lack of diversity and complexity in sentiment data, this paper adopts an improved generative adversarial network WGAN-GP (Wasserstein GAN with Gradient Penalty) to generate more diverse and complex data samples. WGAN-GP solves the common problems of pattern collapse and training instability in traditional GANs by introducing gradient penalty mechanism. The gradient penalty mechanism limits the discriminator's ability to generate gradient changes in the samples, making the training process between the generator and discriminator more stable. This study is able to generate diverse emotional speech data, effectively improving the model’s generalization ability and recognition accuracy.
This study constructs a hybrid model (TWM, TIM-NET WGAN-GP Multi-Head Attention) based on TIM-NET network, which combines multi-head attention mechanism and improved WGAN-GP, and combines improved MFCC features for speech emotion recognition. This model represents significant advancements in feature extraction, emotion modeling, and data generation, effectively improving the robustness and accuracy of speech emotion recognition.
1 TIM-NET Network Model
TIM-NET is capable of learning long-term emotional dependencies from both forward and backward directions, and capturing multi-scale features at the frame level. The TIM-NET network architecture is shown in Fig.1[6].
Fig.1TIM-NET network structure diagram
Firstly, using MFCC as the input of the network, each speech signal is divided into 50 ms frames and subjected to 2048 point Fourier transform through frame segmentation and Hamming window operation. Then, Mel scale triangular filter bank analysis is performed to extract the first 39 factors and obtain low-frequency features. TIM-NET consists of several forward and backward Temporal Aware Blocks (TAB) , each TAB consisting of two sub blocks and a Sigmoid function σ. The Sigmoid function is used to learn the Temporal Attention Map A and generate the Temporal Aware Feature F by performing element-wise multiplication with the input features. The sub-blocks of TAB use Dilated Causal Convolutions (DC Conv) with causal constraints to expand the acceptance range and avoid information loss, followed by batch normalization, ReLU (Rectified Linear Unit) activation, and spatial dropout operations to enhance the model's nonlinear features and generalization ability[7]. Node time modeling determines emotional categories by integrating complementary information from the past and future. The forward and backward processing processes are shown in Eqs. (1) and (2) , respectively, combining bidirectional semantic dependency and discourse level global context representation, where represents the output from the first 1×1 convolution layer, the global time pool operation.
(1)
(2)
here, forward feature representation () and backward feature representation () at step or layer n capturing temporal information in different directions.A (·) represents an activation or transformation operation that introduces non-linearity for complex pattern learning. j represents an index or step counter that marks the sequence of operations or layers (for example, the j-th TAB) .
The entire process is shown in Eq. (3) :
(3)
where gj represents feature fusion result at step or index j, obtained by processing the sum of forward and backward features with G. G represents a function for feature processing or fusion, acting on the sum of forward and backward features. Where the global time pooling operation G takes the average of the time dimension and generates a representation vector for a specific receptive domain from the jth TAB. In order to adapt to the temporal scale changes in pronunciation habits, TIM-NET has designed a multi-scale dynamic module. During the training phase, the module selects the appropriate time scale based on the current input and fuses the features with the Dynamic Receptive Field (DRF) through weighted summation.
The DRF fusion gdrf is shown in Eq. (4) :
(4)
where gj is calculated using Eq. (3) , and wj is the jth individual trainable parameter of the dynamic receptive field weight vector w drf ( w drf=[w1, w2, ···, wn]T) . w drf comes from a different TAB. Specifically, gdrf meaning w drf provides the trainable weight coefficients that determine the contribution of each gj to the final DRF fusion output gdrf.Once the emotion representation w drf has strong discriminability, the fully connected layer of the softmax function can be utilized for emotion classification.
Although TIM-NET processes the dynamic features of time series to a certain extent via temporal aware blocks and multi-scale dynamic modules, its feature fusion may not fully capture all changes in time steps when dealing with very long time series. This time step limitation may result in information loss or insufficiency when TIM-NET processes speech data with long time spans, thereby limiting its ability to model long-term dependencies. This means that when processing long time series, TIM-NET may not effectively capture all relevant emotional information, affecting its ability to process complex emotional data.
2 TWM Network Model with Improved MFCC Features
The construction of the TWM model in this study is based on the TIM-NET network model, which integrates multi-head attention mechanism and an improved WGAN-GP. Furthermore, it integrates the first and second moments of MFCC features. The network structure diagram is shown in Fig.2.
Fig.2TWM network model with improved MFCC features
Firstly, a generative adversarial network (WGAN-GP) is used for data augmentation. The generator generates fake emotional data samples, while the discriminator distinguishes between real samples and generated samples. WGAN-GP introduces Wasserstein distance to measure the difference between generated and real samples, and adds a gradient penalty term to the loss function to improve the quality of generated samples and the stability of training. The generator and discriminator are alternately optimized during training process, so that the generated samples gradually approach the distribution of real samples. These generated fake samples will be used together with real samples to enhance the training dataset, in order to improve the robustness and generalization ability of the TIM-NET model. Subsequently, the real data containing MFCC and its Delta (Δ) and Delta Delta (ΔΔ) features, as well as the generated fake data, are input into the TIM-NET module. The TIM-NET module uses Dilated Causal Convolution (DC conv) with causal effects to extract temporal features. Extended convolution expands the receptive field of the convolution kernel in the time dimension by introducing larger time span intervals inside the kernel, thus better capturing long-term temporal dependencies. By combining these dilated convolutional layers, TIM-NET can extract multi-scale temporal features, thereby enhancing the ability of feature representation. Afterwards, the temporal information and multi-scale features extracted by TIM-NET are input into the multi-head attention mechanism. The multi-head attention mechanism processes different feature subspaces in parallel through multiple attention heads to capture complex relationships between features. Each attention head weights and aggregates input features to generate multi-dimensional weighted feature representations, thereby enhancing the model's ability to model long-term dependencies and feature relationships, and effectively solving the problem of time step limitations[8]. After the output of the multi-head attention mechanism, a weight layer is added. This layer further processes features by matrix multiplying the output of the attention mechanism with a trainable weight matrix, and outputs subsequent feature representations. The output of the weight layer is passed to the fully connected layer (dense layer) , and the softmax activation function is used for sentiment classification to obtain the final sentiment classification result.
2.1 Extraction of Improved MFCC Feature
Due to the human ear's greater sensitivity to the dynamic characteristics of sound signals, and since MFCC only reflects the static features of speech signals, the first-order and second-order derivatives are extracted after obtaining the MFCC features. The first-order derivative represents the relationship between the current speech frame and the previous frame, while the second-order derivative represents the relationship between the first-order and the second-order derivatives. Combining the static MFCCs with their first-and second-order derivatives creates a more comprehensive representation of the speech signal, better capturing the characteristics of different emotional states.
This study improved the MFCC features by introducing the first derivative Δ and second derivative ΔΔ features of MFCC to better capture the dynamic changes of speech signals. Specifically, the Δ feature captures the dynamic information of the speech signal by calculating the time first derivative of the MFCC feature, as shown in Eq. (5) , while the ΔΔ (second-order differential) feature characterizes the rate of change of the gradient of the speech signal by computing the second-order time derivative of the MFCC feature, as shown in Eq. (6) .
(5)
(6)
here, Δt represents the Δ feature of the t-th frame, ΔΔt represents the ΔΔ feature of the t-th frame, Δt+n is the MFCC feature of time frame t+n, and N is the window size of the ΔΔ feature (usually 2 or 3 frames) . The original MFCC features, the calculated Δ (first-order differential) features, and ΔΔ (second-order differential) features, are concatenated in the feature dimension to form a feature matrix containing multiple dynamic information. This method enhances the expression ability and accuracy of features, thereby improving the performance of speech emotion recognition.
2.2 WGAN-GP
Traditional GAN training often faces challenges of instability and mode collapse, mainly because it uses Jensen-Shannon divergence as the loss function to measure the difference between the generated sample distribution and the true sample distribution. However, this measure may result in weak training signals and unstable training when the generated sample distribution differs significantly from the true sample distribution. WGAN addresses these issues by introducing the Wasserstein distance as an optimization objective. The Wasserstein distance provides a more stable metric for calculating the difference between the generated sample distribution and the true sample distribution, as it directly measures the minimum moving cost between the two distributions. This measurement method can provide meaningful gradients, even though the sample distribution differences are small, which makes the training process more stable. However, WGAN still faces some challenges, especially unstable discriminator training. WGAN requires the discriminator's function to be1-Lipschitz continuity, which means that the discriminator's gradient must remain within a certain range across all data points[9]. If the gradient of the discriminator is not controlled, it may lead to instability and non-convergence during the training process. WGAN-GP further improves WGAN by introducing gradient penalty terms to enhance training stability. The gradient penalty term aims to ensure that the gradient of the discriminator maintains 1-Lipschitz continuity on all data points, thereby preventing unstable training of the discriminator. The calculation method of this gradient penalty term is shown in Eq. (7) .
(7)
here, represents the expected calculation of the sample extracted from the distribution . is a sample interpolated between real samples and generated samples, i.e. , where xreal is the real sample and xfake is a randomly generated sample within the range of [0, 1]. represents the gradient of discriminator D for sample . is the L2 norm, which is the magnitude of the gradient. A gradient penalty term is used to ensure that the magnitude of the gradient is close to 1, thereby maintaining Lipschitz continuity.
2.3 Multi-Head Attention Mechanism
The multi-head attention mechanism processes different feature subspaces in parallel through multiple attention heads to capture the complex relationships between features. Each attention head calculates its own weighted sum, and then concatenates them to generate the final output[8]. This mechanism can better establish the interaction and long-term dependence between model features. The calculation process of the multi-head attention mechanism first performs linear transformations on the input query vectors Q, K, and V, and calculates the scaled dot product attention for each head, as shown in Eq. (8) .
(8)
here, dk represents the dimension of the key vector. Apply scaling to ensure that the input to the softmax function remains within a stable range, especially when the dimensionality of the key vectors is large.In order to learn the relevant information of emotional features from different subspaces, this study uses multiple attention heads, each of which performs independent linear transformations on Q, K, and V, to obtain multiple attention outputs. The output Qi of the i-th head is calculated as shown in Eq. (9) .
(9)
here, denote real-valued matrices, where dmodel is the dimension of the model's feature space, and dq, dk, dv are the dimensions of the query, key, and value spaces respectively. These matrices define the space for linear transformations in multi-head attention.are trainable weight matrices for the i-th attention head. transforms the input Q into the query space of the i-th head, transforms K into the key space, and transforms V into the value space. These weights are learned during training to capture different aspects of feature relevance in multi-head attention. Finally, concatenate the attention outputs of each head to obtain the final output of multi-head attention (MHA) , as shown in Eq. (10) .
(10)
The multi-head attention mechanism can capture different features of input sequences from multiple subspaces, enhancing the network's ability to model long-range dependencies. Introducing multi-head attention mechanism in TIM-NET network can effectively alleviate the time step limitation problem faced by the original network when processing long-term emotional data, thereby improving the overall performance of the model[10].
3 Experiment
3.1 Data Set
The experiment used two datasets, RAVDESS and EMO-DB. The RAVDESS dataset contains 1440 speech files, covering8 emotion categories including anger, neutral, calm, happy, sad, fearful, disgust, and surprise[11]. The EMO-DB dataset is a German emotional speech database recorded by the Technical University of Berlin, containing535 voices. The emotional categories include neutral, anger, fear, happiness, sadness, disgust, and boredom[12].
3.2 Parameter Setting and Evaluation Indicators
In this study, the model was built using TensorFlow and Keras frameworks. The optimizer selected Adam, the activation function was ReLU, the dropout rate was set to 0.1, the iteration cycle was 200 times, and a five-fold cross validation was performed. Finally, the performance of the model was represented by the mean of five results, and the batch size was set to 64. In the experiment, the most commonly used evaluation metrics in the field of speech emotion recognition (SER) were used: weighted accuracy (WAR) and unweighted accuracy (UAR) .
3.3 Experimental Results and Analysis
3.3.1 Experimental comparison of improving MFCC features
In order to verify the effectiveness of improving MFCC features, this study constructed three feature sets and combined them with the TWM model to conduct ablation experiments on RAVDESS and EMO-DB corpora[13]. The specific feature set is described as follows:S1:Only includes MFCC features. S2:Composed of MFCC and its first-order differential features. S3:Composed of MFCC and its first-order differential and second-order differential features.
These feature sets are used to analyze the performance of TWM models in speech emotion recognition tasks with different feature combinations. The experimental results are shown in Table1, and evaluation measures are UAR (%) / WAR (%) .
Table1Performance of different feature sets on RAVDESS and EMO-DB corpus
From the experimental results, it can be seen that as the dimensionality of the feature set increases, both the UAR and WAR of the model significantly improve. This indicates that by introducing first-order differential (Delta) and second-order differential (Delta Delta) features, dynamic information in speech signals can be better captured, thereby improving the accuracy of speech emotion recognition. Among the three feature sets, the S3 feature set performed the best, achieving93.00% UAR and 93.15% WAR on the RAVDESS corpus, while achieving93.92% UAR and 93.60% WAR on the EMO-DB corpus, respectively. This indicates that combining MFCC and its first-order and second-order differential features in the TWM model can obtain a more comprehensive representation of speech signal features, thereby improving the performance of the model.
3.3.2 Ablation experiment
To validate the effectiveness of the TWM model proposed in this study, improved MFCC features were used as inputs, and EMO-DB and RAVDESS datasets were used for training and evaluation. WAR and UAR were used as metrics, and TIM-NET was used as the baseline model to verify the impact of different modules on model performance. The experiment results are shown in Table2, evaluation measures are UAR (%) / WAR (%) .
Table2Performance comparison of models on different datasets
From Table2, it can be seen that the TWM model, which has a parameter count of 13 M, performs better than TIM-NET (2 M) , TIM-NET + WGAN-GP (12 M) , and TIM-NET + MHA (3 M) on both RAVDESS and EMO-DB datasets. In both datasets, the UAR and WAR of the TWM model reached their highest values. This result validates the effectiveness of the TWM model in integrating the Multi-Head Attention mechanism (MHA) and the generative adversarial network (WGAN-GP) .The TWM model delivers significant performance improvements in emotion recognition tasks over the TIM-NET baseline (2 M parameter) , despite its increase in model size.The inclusion of WGAN-GP and MHA not only enhanced model performance but also increased the model's capacity for learning complex emotional features.
3.3.3 Model robustness verification
To verify the robustness of the TWM model, Gaussian noise was introduced during training and testing phases to evaluate the model's performance under data perturbations. The introduction of noise simulates real-world data interference, thereby validating the model's robustness and reliability. This study used Gaussian noise with a standard deviation of 0.01 to perturb the input data. The chosen standard deviation ensures that the noise impact remains within a reasonable range, avoiding excessive interference with the original signal[14]. The validation was performed on the RAVDESS and EMO-DB datasets, and the corresponding confusion matrices for various emotions are shown in Figs.3 and 4.
As observed in the confusion matrix in Fig.3, without the addition of Gaussian noise, the TWM model achieves high classification accuracy across all categories. The model's performance is balanced and consistent, demonstrating its strong ability to classify emotions accurately. In contrast, Fig.4 shows the confusion matrices with Gaussian noise, where the TWM model still maintains high classification accuracy despite the data perturbations. This proves the robustness of the TWM model, confirming its effectiveness in handling noisy data and its superior performance under real-world conditions.
Fig.3Emotion classification confusion matrix of RAVDESS and EMO-DB datasets
Fig.4Emotion classification confusion matrices of the RAVDESS and EMO-DB datasets with Gaussian noise
3.3.4 Compared to existing methods
Zhao et al.[15] used Mel spectrograms and MFCC as audio description methods, and proposed a fully convolutional neural network architecture as a classifier. This method achieved an average accuracy of 75.28% on the RAVDESS dataset and 92.71% on the EMO-DB dataset. Jahangir et al.[16]used data augmentation and feature fusion methods, combined with a CNN model, and achieved accuracy of 90.60% and 93.30% on the RAVDESS dataset and EMO-DB dataset, respectively. Zhang et al.[11] proposed a speech emotion recognition method based on Dilated Residual Network (DRN) combined with auxiliary classifier and channel spatial attention fusion.The model achieved an accuracy of 92.91% on the RAVDESS dataset and 89.15% on the EMO-DB dataset, demonstrating its effectiveness and generalization ability. García-Ordás et al.[17] used the Mel spectrogram and MFCC as audio description methods, and proposed a fully convolutional neural network architecture as a classifier, achieving average accuracies of 75.28% on the RAVDESS dataset and 92.71% on the EMODB dataset. Zhou et al.[18] utilized MFCC as input features and designed dual-path temporal convolutional channels based on Temporal Convolutional Networks (TCN) and cross-gated mechanisms to extract multi-scale cross-fusion features. The model achieved an average accuracy of 87.32% on the RAVDESS dataset and 89.30% on the EMODB dataset. In order to verify the effectiveness and robustness of the proposed method, this paper compared the proposed model with the advanced speech emotion recognition methods mentioned above and conducted comparative analysis on the RAVDESS and EMO-DB datasets. As shown in Table3, our method can achieve high accuracy on both the RAVDESS and EMO-DB datasets.
Table3Comparative analysis of the method proposed in this article with other methods
4 Conclusions
This paper proposes a speech emotion recognition model (TWM) based on improved MFCC features, a fused multi-head attention mechanism, and an improved WGAN-GP. This study introduces the first and second derivatives of MFCC features to significantly improve the ability to capture dynamic information in speech signals, thereby enhancing the accuracy and robustness of emotion recognition. The TWM model based on TIM-NET network not only effectively alleviates the limitations of traditional models in processing long time series by combining multi-head attention mechanism and WGAN-GP, but also improves the model's generalization ability to diverse emotional data through data augmentation techniques. Future research can further explore potential applications in other emotion recognition tasks and combine more advanced feature extraction techniques and model optimization strategies to achieve higher recognition accuracy and stronger model robustness.