基于多模态交互网络的图像描述

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：在各类的图像描述方法中,多模态方法主要将视觉和文本两种模态的信息作为输入,以获得有效的多级信息.然而,其中的多数方法未考虑两种模态数据之间的关系,仅孤立地使用这两种模态的数据.为了在不同模态之间建立复杂的交互,充分利用模态之间的关系提升图像描述效果,首先,引入双向注意流模块(Bi-Directional Attention Flow,BiDAF),将自注意力机制升级为双向方式;然后,通过一个只需一个遗忘门就可以实现与长短期记忆网络(Long Short-Term Memory,LSTM)相同的功能的门控线性记忆模块(Gated Linear Memory,GLM)有效降低解码器的复杂度,并捕获多模态的交互信息;最后,将BiDAF和 GLM分别应用为图像描述模型的编码器和解码器,形成多模态交互网络(Multimodal Interactive Network,MINet).在公共数据集MS COCO上的实验结果表明,MINet与现有的多模态方法相比不仅具有更简洁的解码器、更好的图像描述效果、更高的评价分数,且无需进行预训练,图像描述更高效.

外文标题：Multimodal Interaction Network for Image Captioning

外文摘要：In image captioning,multimodal approaches are widely exploited by simultaneously providing visual inputs and semantic attributes to capture multi-level information.However,most approaches still utilize the two modalities in isolation,without considering the correlation between them.With the aim of filling this gap,we first introduce a Bi-Directional Attention Flow(BiDAF)module that extends the self attention mechanism as a bi-directional manner to model complex interactions between different modalities.Then,through a Gated Linear Memory(GLM)module that can realize the same function as a Long Short-Term Memory(LSTM)with only one forget gate,the decoder complexity is effectively reduced and multi-modal interaction information is captured.Finally,we apply BiDAF and GLM as the encoder and the decoder of the image captioning model respectively,forming a Multimodal Interactive Network(MINet).When tested on COCO,experimental results show that MINet not only has a more concise decoder,better image description,and higher evaluation scores than that of existing multimodal methods,but also more efficient in image description without pre-training.

外文关键词：

multimodalimage captioningself attentionlong short-term memoryvisualsemantic

作者：

段毛毛、魏燚伟

展开 >

作者单位：

中国石油大学(北京)克拉玛依校区石油学院,新疆克拉玛依 834000

关键词：

多模态图像描述自注意力长短期记忆网络视觉文本

基金：

中国石油大学(北京)克拉玛依校区人才引进项目

项目编号：

XQZX20200021

出版年：

2024

DOI：

10.20165/j.cnki.ISSN1673-629X.2024.0039

计算机技术与发展

陕西省计算机学会

计算机技术与发展

CSTPCD

影响因子：0.621

ISSN：1673-629X

年,卷(期)：2024.34(5)

参考文献量29