|
Abstract: |
In recent years, Speech Emotion Recognition (SER) has developed into an essential instrument for interpreting human emotions from auditory data. The proposed research focused on the development of a SER system employing deep learning and multiple datasets containing samples of emotive speech. The primary objective of this research endeavor is to investigate the utilization of Convolutional Neural Networks (CNNs) in the process of sound feature extraction. Stretching, pitch manipulation, and noise injection are a few of the techniques utilized in this study to improve the data quality. The investigation includes coverage of these methods. Feature extraction methods including Zero Crossing Rate, Chroma_stft, MFCC, RMS, and MelSpectogram are used to train a model. By using these techniques, audio signals can be transformed into recognized features that can be utilized to train the model. Ultimately, the study will produce a thorough evaluation of the model’s performance. When this method was applied, the model achieved an impressive accuracy of 94.57% on the test dataset. Proposed work also validated on EMO-BD and IEMOCAP dataset. These consist of further data augmentation, feature engineering, and hyperparameter optimization. By following these development paths, SER systems will be able to be implemented in real-world scenarios with greater accuracy and resilience. |
Key words: voice signal emotion recognition deep learning CNN |
DOI:10.11916/j.issn.1005-9113.2024005 |
Clc Number:TN18, TN912.3 |
Fund: |