Dual-modal Recognition Method of Emotion Combining Speech with Text Based on Deep Learning
Aiming at the problem of low accuracy of emotion recognition in human-computer interaction and the inability to make full use of different modal features,a speech emotion recognition method based on deep learning fusion of audio and text features was proposed.The STE-ER model was obtained by fusing the speech and text emotion recognition modules at the feature level.The experimental results on the public data set IEMOCAP showed that the SPEECH module used HuBERT to extract features,which could improve the emotional recognition rate by 7.1%compared with the spectrogram method;the BERT used in the TEXT module could improve the emotion recognition rate by 5.1%compared with Word2Vec.Compared with the two independent modules,the emotion recognition accuracy of SPEECH and TEXT modules was significantly improved after different strategies were fused.The recognition rate of STE-ER model with feature level fusion is 5.2%higher than that of maximum confidence decision level fusion.