Image-Text Sentiment Classification Model Based on Multi-scale Cross-modal Feature Fusion
For the image-text sentiment classification task,the cross-modal feature fusion strategy which combines early fusion and Transformer model is usually used for image-text feature fusion.However,this strategy prefers to focus on the unique infor-mation within a single modality,while ignoring the interconnections and common information among multiple modalities,resulting in unsatisfactory effect of cross-modal feature fusion.To solve this problem,a method of image-text classification based on multi-scale cross-modal feature fusion is proposed.On the one hand,for the local scale,local feature fusion is carried out based on the cross-modal attention mechanism,so that the model not only focuses onthe unique information of the image and text,but also ex-plores the connection and common information between the image and text.On the other hand,for the global scale,global feature fusion based on MLM loss enables the model to conduct global modeling of image and text data,further mine the relationship be-tween them,and thus promote the deep fusion of image and text features.Compared with ten baseline models on two public data-sets,MVSA-Single and MVSA-Multiple,the proposed method shows distinct advantages in accuracy,F1 score,and model para-meter quantity,verifying its effectiveness.
Image-Text sentiment classificationCross-modal feature fusionTransformer modelAttention mechanismMLM loss