Multimodal sentiment analysis based on attention mechanism and contrastive learning
To address the challenges associated with inadequate integration of information across modalities and limited analysis of temporal dependencies in existing multimodal sentiment analysis models,a model incorporating cross-modal attention,global self-attention,and contrastive learning was proposed,to deepen sentiment analysis.Specifically,features from speech,text,and image modalities were independently extracted,and maped into a unified vector space.Then,inter-modal data was effectively modeled and integrated using both cross-attention and global attention mechanisms.Meanwhile,contrastive learning tasks based on data,labeling,and timing were introduced to enhance the model's understanding of multimodal feature variability.Experimental evaluations on two publicly available datasets,CMU-MOSI and CMU-MOSEI,reveal that the proposed model achieves superior binary classification accuracy improvements of 1.2 and 1.6 percentage points,and F1 score enhancements of 1.0 and 1.6 percentage points,respectively,compared with the modality-invariant and-specific representations(MISA)model.