Multimodal speech emotion recognition based on text feature energy encoding
Energy is one of the important characteristics of emotional expression.Different words have energy values when speaking,reflecting other emotional states of the speaker.In the process of transcription of speech into text,the energy information expressed by each text is not included,which leads to the loss of energy information when the text features are extracted.Therefore,for the text mode,this paper proposes and designs an energy coding,which adds the energy value of each word and each pause of the speech signal to the transcribed text so that the text features contain energy information and obtain the discourse level text features through the DC-BERT model.OpenSMILE toolbox was used for speech modes to extract shal-low acoustic features in speech.Random forest(RF)algorithm was adopted to select 1000-dimensional features with the highest importance of emotional features as the new feature set.In-depth features are extracted from new feature sets through the Transformer Encoder network,and shallow elements and in-depth features are fused to form multi-level voice emotion features.Finally,Bi-directional long short term memory-attention(BiLSTM-ATT)neural network based on a self-attention mechanism is used to classify emotions.The results show that the weighted accuracy of the proposed method in the IEMOCAP classification reaches 76.49%.