Abstract:The discrete emotional description model labels human emotions as discrete adjectives. The model can only represent limited types of single and explicit emotion. The dimensional emotional model quantifies the implied state of complex emotions from the multiple dimensions. In addition, conventional speech emotion feature, Mel Frequency Cepstral Coefficient (MFCC), has the problem of neglecting the correlation between the adjacent frame spectral features due to frame division processing, making it susceptible to loss of much useful information. To solve this problem, this paper proposes an improved method, which extracts the time firing series feature and the firing position information feature from the spectrogram to supplement the MFCC, and applies them in speech emotion estimation respectively. Based on the predicted values, the proposed method calculates the correlation coefficients of each feature from three dimensions, P (Pleasure-displeasure), A (Arousal-nonarousal), and D (Dominance-submissiveness), as feature weights and obtains the final values of PAD in emotion speech after the weighted fusion, and finally maps it to PAD 3D emotion space. The experiments showed that the two added features could not only detect the emotional state of the speaker, but also consider the correlation between the adjacent frame spectral features, complementing to MFCC features. On the basis of improving the effect of discrete estimation of basic emotional types, this method represents the estimation results as coordinate points in PAD 3D emotion space, adopts the quantitative method to reveal the position and connection of various emotions in the emotion space, and indicates the emotion content mixed in the emotion speech. This study lays a foundation for subsequent research on classification estimation of complex speech emotions.