Speech Emotion Recognition Based on ASGRU-CNN Spatiotemporal Dual Channel
Speech emotion recognition is the key to achieving human-computer interaction,and how to improve the accuracy of speech emotion recognition is a major problem for speech emotion recognition.To realize this,a novel speech recognition model called ASGRU-CNN is proposed.The overall framework of the proposed model consists of two parallel branches:the first branch is the spatial feature extraction module consisted of 3D convolution,2D convo-lution,and pooling operations together to form a cascade structure;The second branch is the temporal feature extrac-tion module consisted of a slicing cycle and an attention mechanism.The model takes the fused features of rhythmic features and spectral features as the input,and enters the fully connected layer for speech emotion classification after the double branching process.The relevant experiments has been conducted on CASIA and EMO-DB databases and on their expanded version.Compared with the experimental results of other speech emotion recognition models,the proposed model has better robustness and generalizability.