A Multimodal Sentiment Analysis Model Enhanced with Non-verbal Information and Contrastive Learning
Deep learning methods have gained popularity in multimodal sentiment analysis due to their impressive representation and fusion capabilities in recent years.Existing studies often analyze the emotions of individuals using multimodal information such as text,facial expressions,and speech intonation,primarily employing complex fusion methods.However,existing models inadequately consider the dynamic changes in emotions over long time sequences,resulting in suboptimal performance in sentiment analysis.In response to this issue,a Multimodal Sentiment Analysis Model Enhanced with Non-verbal Information and Contrastive Learning is proposed in this paper.Firstly,the paper employs long-term textual information to enable the model to learn dynamic changes in audio and video across extended time sequences.Subsequently,a gating mechanism is employed to eliminate redundant information and semantic ambiguity between modalities.Finally,contrastive learning is applied to strengthen the interaction between modalities,enhancing the model's generalization.Experimental results demonstrate that on the CMU-MOSI dataset,the model improves the Pearson Correlation coefficient(Corr)and F1 score by 3.7%and 2.1%,respectively.On the CMU-MOSEI dataset,the model increases"Corr"and"F1 score"by 1.4%and 1.1%,respectively.Therefore,the proposed model effectively utilizes intermodal interaction information while eliminating information redundancy.