首页|基于视觉-语言特征编码的跨模态融合视觉问答方法

基于视觉-语言特征编码的跨模态融合视觉问答方法

扫码查看
现有的视觉问答方法采用相同编码器编码视觉-语言特征,忽略了视觉-语言模态之间的差异,从而在编码视觉特征时引入与问题无关的视觉干扰特征,导致对关键视觉特征关注不足.提出一种基于视觉-语言特征编码的跨模态融合视觉问答方法:采用一种动态注意力编码视觉特征以实现根据问题动态调整视觉特征的注意力范围;设计了一种具有双门控机制的引导注意力以过滤多模态融合过程带入的干扰信息,提升多模态特征融合的质量,并增强多模态特征的表征能力.该方法在视觉问答公共数据集VQA-2.0上的Test-dev和Test-std两个测试集上的准确率分别达到71.73%和71.94%,相比于基准方法分别提升了 1.10和1.04个百分点.本文方法能够提升视觉问答任务的答案预测准确率.
Cross-modal Fusion Visual Question Answering Method Based on Visual-language Feature Encoding
Existing Visual Question Answering(VQA)methods use the same encoder to encode visual-language features,ignoring the differences between visual and language modalities.This introduces visual noise into the encoding of visual features that are irrelevant to the questions,resulting in insufficient attention to key visual features.To address this problem,a cross-modal fusion VQA method based on visual-language features encoding is proposed.A dynamic attention mechanism is used to encode visual features to dynamically adjust the attentional span of visual features based on the question;a guided attention mechanism with dual-gate control is designed to filter out interfering information introduced in the multimodal fusion process,thus improving the quality of multimodal features fusion and the representability of multimodal features.The accuracies on the Test-dev and Test-std datasets of the public VQA-2.0 data-set reach 71.73%and 71.94%,respectively,compared with the baseline methods,which are improved by 1.10%and 1.04%,respectively.The proposed method in this paper can improve the accuracy of answer prediction in visual question-answering tasks.

Visual question answeringAttention mechanismMultimodal fusion

刘润知、陈念年、曾芳

展开 >

西南科技大学计算机科学与技术学院 四川绵阳 621010

视觉问答 注意力机制 多模态融合

2024

西南科技大学学报
西南科技大学

西南科技大学学报

影响因子:0.348
ISSN:1671-8755
年,卷(期):2024.39(3)