首页|基于跨模态信息过滤的视觉问答网络

基于跨模态信息过滤的视觉问答网络

扫码查看
视觉问答作为多模态任务,瓶颈在于需要解决不同模态间的融合问题,这不仅需要充分理解图像中的视觉和文本,还需具备对齐跨模态表示的能力.注意力机制的引入为多模态融合提供了有效的路径,然而先前的方法通常将提取的图像特征直接进行注意力计算,忽略了图像特征中含有噪声和不正确的信息这一问题,且多数方法局限于模态间的浅层交互,未曾考虑模态间的深层语义信息.为解决这一问题,提出了一个跨模态信息过滤网络,即首先以问题特征为监督信号,通过设计的信息过滤模块来过滤图像特征信息,使之更好地契合问题表征;随后将图像特征和问题特征送入跨模态交互层,在自注意力和引导注意力的作用下分别建模模态内和模态间的关系,以获取更细粒度的多模态特征.在VQA2.0数据集上进行了广泛的实验,实验结果表明,信息过滤模块的引入有效提升了模型准确率,在test-std上的整体精度达到了 71.51%,相比大多数先进的方法具有良好的性能.
Cross-modal Information Filtering-based Networks for Visual Question Answering
As a multi-modal task,the bottleneck of visual question answering(VQA)is to solve the problem of fusion between different modes.It requires not only a full understanding of vision and text in the image,but also the ability to align cross-modal representation.The introduction of the attention mechanism provides an effective path for multi-mode fusion.However,the pre-vious methods usually calculate the extracted image features directly,ignoring the noise and incorrect information contained in the image features,and most of the methods are limited to the shallow interaction between modes,without considering the deep se-mantic information between modes.To solve this problem,a cross-modal information filtering network(CIFN)is proposed.First-ly,the feature of the problem is taken as the supervision signal,and the information filtering module is designed to filter the fea-ture information of the image,so that it can better fit the representation of the problem.Then the image features and problem fea-tures are sent to the cross-modal interaction layer,and the intra-modal and inter-modal relationships are modeled respectively un-der the action of self-attention and guided attention,so as to obtain more fine-grained multi-modal features.Extensive experiments have been conducted on VQA2.0 data sets,and the experimental results show that the introduction of information filtering mo-dule effectively improves the model accuracy,and the overall accuracy of test-std reaches 71.51%,which has good performance compared with the most advanced methods.

Visual question answeringDeep learningAttention mechanismMulti-modal fusionInformation filtering

何世阳、王朝晖、龚声蓉、钟珊

展开 >

苏州大学计算机科学与技术学院 江苏苏州 215008

苏州大学东吴学院 江苏苏州 215006

常熟理工学院计算机科学与工程学院 江苏苏州 215500

视觉问答 深度学习 注意力机制 多模态融合 信息过滤

国家自然科学基金国家自然科学基金江苏省自然科学基金江苏省自然科学基金吉林大学符号计算与知识工程教育部重点实验室项目

6197205942071438BK20191474BK2019147593K172021K01

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(5)
  • 37