融合视觉定位信息的视觉问答算法研究

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：为提高视觉问答模型对图像中相关信息的捕捉,引入了视觉定位信息,以增强模型对完整图像信息的理解.通过将图像语义特征与浅层文本特征一同输入以图像为基础的文本编码器,将文本特征映射到图像空间.随后,将得到的文本特征和图像特征输入以文本为基础的图像解码器,生成视觉定位信息.实验结果显示,模型在Accuracy、Open、Bi-nary、Consistency这四项评价指标上均取得最佳成绩,分别提高了0.84％、0.74％、3.38％、2.95％.其中,Accuracy达到了 56.94％.这表明视觉定位信息有效地增强了图像特征中与问题相关部分的信息比例.

外文标题：Research on Visual Question Answering Fusing Visual Grounding Informa-tion

外文摘要：To enhance the capture of relevant information in images by a Visual Question An-swering(VQA)model,Visual Grounding(VG)information is introduced to augment the model's understanding of the complete image context.This involves integrating semantic features from the image and shallow textual features into an image-based text encoder,mapping textual fea-tures to the image space.Subsequently,the obtained textual features and image features are fed into a text-based image decoder to generate VG information.Experimental results demonstrate that the model achieves the best performance across four evaluation metrics:Accuracy,Open,Bi-nary,and Consistency,with improvements of 0.84％,0.74％,3.38％,and 2.95％respective-ly.Specifically,Accuracy reaches 56.94％,indicating that VG information effectively enhances the proportion of information related to the question in the image features.

外文关键词：

Visual question answeringVisual groundingGated mechanismEncoderDecoder

作者：

吴金蔓、车进、白雪冰、陈玉敏

展开 >

作者单位：

宁夏大学电子与电气工程学院,宁夏银川 750021

宁夏沙漠信息智能感知重点实验室,宁夏银川 750021

宁夏大学前沿交叉学院,宁夏银川 750021

关键词：

视觉问答视觉定位门控机制编码器解码器

基金：

国家自然科学基金宁夏回族自治区自然科学基金

项目编号：

623660422023AAC03127

出版年：

2024

DOI：

10.20153/j.issn.2096-9759.2024.05.001

长江信息通信

湖北通信服务公司

长江信息通信

影响因子：0.338

ISSN：2096-9759

年,卷(期)：2024.37(5)

参考文献量12