Multimodal Emotion Recognition Method Based on Dynamic Time Sequence Modeling
Existing emotion recognition studies have not fully considered the local-global information and long-term time dependencies in speech signals,and text feature extraction also suffers from feature sparsity and information loss.To solve the above problems,multimodal emotion recognition method based on dynamic time sequence modeling is pro-posed.The dynamic time window module is designed to segment the speech signal so as to capture the local-global infor-mation,and the spatial information in the signal is captured by bi-directional sequence modelling.Considering the impor-tance of text information for emotion analysis,a convolutional neural network based on the Transformer model is used to capture the longer contextual information by modelling the dependencies between different locations in the text,and finally the two modalities are fused to obtain the final emotion classification.The experimental results of the model on the IEMOCAP dataset show better multimodal emotion recognition compared to other mainstream models.
multimodal sentiment analysisdynamic time windowbidirectional time sequence modelingconvolutional neural networksmultimodal fusion