Please submit manuscripts in either of the following two submission systems

    ScholarOne Manuscripts

  • ScholarOne
  • 勤云稿件系统

  • 登录

Search by Issue

  • 2024 Vol.31
  • 2023 Vol.30
  • 2022 Vol.29
  • 2021 Vol.28
  • 2020 Vol.27
  • 2019 Vol.26
  • 2018 Vol.25
  • 2017 Vol.24
  • 2016 vol.23
  • 2015 vol.22
  • 2014 vol.21
  • 2013 vol.20
  • 2012 vol.19
  • 2011 vol.18
  • 2010 vol.17
  • 2009 vol.16
  • No.1
  • No.2

Supervised by Ministry of Industry and Information Technology of The People's Republic of China Sponsored by Harbin Institute of Technology Editor-in-chief Yu Zhou ISSNISSN 1005-9113 CNCN 23-1378/T

期刊网站二维码
微信公众号二维码
Related citation:
【Print】   【HTML】   【PDF download】   View/Add Comment  Download reader   Close
Back Issue    Advanced Search
This paper has been: browsed 25times   downloaded 16times  
Shared by: Wechat More
Deep Learning-Based Speech Emotion Recognition: Leveraging Diverse Datasets and Augmentation Techniques for Robust Modeling
Author NameAffiliationPostcode
Ayush Porwal* Department of Electronics and Instrumentation Engineering,Shri GS Institute of Technology and Science 452001
Praveen Kumar Tyagi Department of Electronics and Communication Engineering,Maulana Azad National Institute of Technology 
Ajay Sharma School of Computing Science and Engineering,VIT Bhopal University,Kothrikalan,Sehore 
Dheeraj Kumar Agarwal Department of Electronics and Communication Engineering,Maulana Azad National Institute of Technology 
Abstract:
In recent years, Speech Emotion Recognition (SER) has developed into an essential instrument for interpreting human emotions from auditory data. The proposed research focused on the development of a SER system employing deep learning and multiple datasets containing samples of emotive speech. The primary objective of this research endeavor is to investigate the utilization of Convolutional Neural Networks (CNNs) in the process of sound feature extraction. Stretching, pitch manipulation, and noise injection are a few of the techniques utilized in this study to improve the data quality. The investigation includes coverage of these methods. Feature extraction methods including Zero Crossing Rate, Chroma_stft, MFCC, RMS, and MelSpectogram are used to train a model. By using these techniques, audio signals can be transformed into recognized features that can be utilized to train the model. Ultimately, the study will produce a thorough evaluation of the model’s performance. When this method was applied, the model achieved an impressive accuracy of 94.57% on the test dataset. Proposed work also validated on EMO-BD and IEMOCAP dataset. These consist of further data augmentation, feature engineering, and hyperparameter optimization. By following these development paths, SER systems will be able to be implemented in real-world scenarios with greater accuracy and resilience.
Key words:  voice signal  emotion recognition  deep learning  CNN
DOI:10.11916/j.issn.1005-9113.2024005
Clc Number:TN18, TN912.3
Fund:

LINKS