首页|融合视觉定位信息的视觉问答算法研究

融合视觉定位信息的视觉问答算法研究

扫码查看
为提高视觉问答模型对图像中相关信息的捕捉,引入了视觉定位信息,以增强模型对完整图像信息的理解.通过将图像语义特征与浅层文本特征一同输入以图像为基础的文本编码器,将文本特征映射到图像空间.随后,将得到的文本特征和图像特征输入以文本为基础的图像解码器,生成视觉定位信息.实验结果显示,模型在Accuracy、Open、Bi-nary、Consistency这四项 评价指 标上均取得最佳成绩,分别提高了0.84%、0.74%、3.38%、2.95%.其中,Accuracy达到了 56.94%.这表明视觉定位信息有效地增强了图像特征中与问题相关部分的信息比例.
Research on Visual Question Answering Fusing Visual Grounding Informa-tion
To enhance the capture of relevant information in images by a Visual Question An-swering(VQA)model,Visual Grounding(VG)information is introduced to augment the model's understanding of the complete image context.This involves integrating semantic features from the image and shallow textual features into an image-based text encoder,mapping textual fea-tures to the image space.Subsequently,the obtained textual features and image features are fed into a text-based image decoder to generate VG information.Experimental results demonstrate that the model achieves the best performance across four evaluation metrics:Accuracy,Open,Bi-nary,and Consistency,with improvements of 0.84%,0.74%,3.38%,and 2.95%respective-ly.Specifically,Accuracy reaches 56.94%,indicating that VG information effectively enhances the proportion of information related to the question in the image features.

Visual question answeringVisual groundingGated mechanismEncoderDecoder

吴金蔓、车进、白雪冰、陈玉敏

展开 >

宁夏大学电子与电气工程学院,宁夏银川 750021

宁夏沙漠信息智能感知重点实验室,宁夏银川 750021

宁夏大学前沿交叉学院,宁夏银川 750021

视觉问答 视觉定位 门控机制 编码器 解码器

国家自然科学基金宁夏回族自治区自然科学基金

623660422023AAC03127

2024

长江信息通信
湖北通信服务公司

长江信息通信

影响因子:0.338
ISSN:2096-9759
年,卷(期):2024.37(5)
  • 12