Multimodal Emotion Recognition Based on Hierarchical Fusion Strategy and Contextual Information Embedding
Existing fusion strategies often involve simple concatenation of modal features,disregarding persona-lized fusion requirements based on the characteristics of each modality.Additionally,solely considering the emo-tions of individual utterances in isolation,without accounting for their emotional states within the context,can lead to errors in emotion recognition.To address the aforementioned issues,this paper proposes a multimodal emotion recognition method based on a layered fusion strategy and the incorporation of contextual information.The method employs a layered fusion strategy,progressively integrating different modal features in a hierarchical manner to re-duce noise interference from individual modalities and address inconsistencies in expression across different mo-dalities.It leverages the contextual information to comprehensively analyze the emotional representation of each utterance within the context,enhancing overall emotion recognition performance.In binary emotion classification tasks,the proposed method achieves a 1.54%improvement in accuracy compared with the state-of-the-art(SOTA)model.In multi-class emotion recognition tasks,the F1 score is improved by 2.79%compared to SOTA model.
hierarchical fusionnoise interferencecontext information embedding