Cross-Modal Multi-level Fusion Sentiment Analysis Method Based on Visual Language Model
Image-text multimodal sentiment analysis aims to predict sentimental polarity by integrating visual modalities and text modalities.The key to solving the multimodal sentiment analysis task is obtaining high-quality multimodal representations of both visual and textual modalities and achieving efficient fusion of these representations.Therefore,a cross-modal multi-level fusion sentiment analysis method based on visual language model(MFVL)is proposed.Firstly,based on the pre-trained visual language model,high-quality multimodal representations and modality bridge representations are generated by freezing the parameters and a low-rank adaptation method being adopted for fine-tuning the large language model.Secondly,a cross-modal multi-head co-attention fusion module is designed to perform interactive weighted fusion of the visual and textual modality representations respectively.Finally,a mixture of experts module is designed to deeply fuse the visual,textual and modality bridging representations to achieve multimodal sentiment analysis.Experimental results indicate that MFVL achieves state-of-the-art performance on the public evaluation datasets MVSA-Single and HFM.
Visual Language ModelMultimodal FusionMulti-head AttentionMixture of Experts NetworkSentiment Analysis