首页|基于视觉语言模型的跨模态多级融合情感分析方法

基于视觉语言模型的跨模态多级融合情感分析方法

扫码查看
图文多模态情感分析旨在通过融合视觉模态和文本模态预测情感极性,获取高质量的视觉模态表征和文本模态表征并进行高效融合,这是解决图文多模态情感分析任务的关键环节之一.因此,文中提出基于视觉语言模型的跨模态多级融合情感分析方法.首先,基于预训练的视觉语言模型,通过冻结参数,采用低阶自适应方法微调语言模型的方式,生成高质量的模态表征和模态桥梁表征.然后,设计跨模态多头互注意力融合模块,分别对视觉模态表征和文本模态表征进行交互加权融合.最后,设计混合专家网络融合模块,将视觉、文本的模态表征和模态桥梁表征结合后进行深度融合,实现多模态情感分析.实验表明,文中方法在公开评测数据集MVSA-Single和HFM上达到SOTA.
Cross-Modal Multi-level Fusion Sentiment Analysis Method Based on Visual Language Model
Image-text multimodal sentiment analysis aims to predict sentimental polarity by integrating visual modalities and text modalities.The key to solving the multimodal sentiment analysis task is obtaining high-quality multimodal representations of both visual and textual modalities and achieving efficient fusion of these representations.Therefore,a cross-modal multi-level fusion sentiment analysis method based on visual language model(MFVL)is proposed.Firstly,based on the pre-trained visual language model,high-quality multimodal representations and modality bridge representations are generated by freezing the parameters and a low-rank adaptation method being adopted for fine-tuning the large language model.Secondly,a cross-modal multi-head co-attention fusion module is designed to perform interactive weighted fusion of the visual and textual modality representations respectively.Finally,a mixture of experts module is designed to deeply fuse the visual,textual and modality bridging representations to achieve multimodal sentiment analysis.Experimental results indicate that MFVL achieves state-of-the-art performance on the public evaluation datasets MVSA-Single and HFM.

Visual Language ModelMultimodal FusionMulti-head AttentionMixture of Experts NetworkSentiment Analysis

谢润锋、张博超、杜永萍

展开 >

北京工业大学 信息学部 北京 100124

视觉语言模型 多模态融合 多头注意力 混合专家网络 情感分析

国家重点研发计划项目国家自然科学基金项目

2023YFB330800492267107

2024

模式识别与人工智能
中国自动化学会,国家智能计算机研究开发中心,中国科学院合肥智能机械研究所

模式识别与人工智能

CSTPCD北大核心
影响因子:0.954
ISSN:1003-6059
年,卷(期):2024.37(5)