Speech Emotion Recognition with Multi-task Learning
In recent speech emotion recognition,researchers attempt to identify emotion from speech signals using deep learning models.However,traditional single-task learning-based models do not pay enough attention to speech acoustic emotional information,resulting in low accuracy of emotion recognition.In view of this,this paper proposes a multi-task learning,end-to-end speech emotion recognition network to mine acoustic emotion in speech and improve the accuracy of emotion recognition.In order to avoid the loss of information caused by using frequency domain features,this paper adopts the Wav2vec2.0 as the backbone network of the model to extract the acoustic and semantic features of speech,and the attention mechanism is used to integrate the two kinds of features as self-supervised features.To make full use of the acoustic sentiment information in speech,using emotion-related phoneme recognition as an auxiliary task,a multi-task learning model is used to mine acoustic sentiment in self-supervised features.Experimental results on the public dataset IEMOCAP show that,the proposed multi-task learning model achieves a weighted accuracy rate of 76.0%and an unweighted accuracy rate of 76.9%,with significantly improved model performance compared to the traditional single-task learning model.Meanwhile,ablation experiments verify the effectiveness of auxiliary task and self-supervised network fine-tuning strategy.
deep learningmulti-task learningspeech emotion recognitionself-supervised modelfine-tuning strategy