基于图卷积与多头注意力的图文跨模态检索
Cross-modal image and text retrieval based on graph convolution and multi-head attention
化春键 1张宏图 1蒋毅 1俞建峰 1陈莹2
作者信息
- 1. 江南大学机械工程学院,江苏无锡 214122;江苏省食品先进制造装备技术重点实验室,江苏无锡 214122
- 2. 江南大学物联网工程学院,江苏无锡 214122
- 折叠
摘要
针对现有跨模态检索方法难以衡量各节点数据权重和模态内局部一致性的问题,提出一种基于多头注意力机制的图文跨模态检索方法.首先在构建模态图时,将单个图文样本作为独立节点,采用图卷积提取各样本间的交互信息,提高不同模态数据内的局部一致性;然后在图卷积中引入注意力机制,自适应学习各个邻居节点的权重系数,从而区分不同邻居节点对中心节点的影响力;最后构建带有权重参数的多头注意力层,充分学习节点间的多组相关特征.与现有8种方法相比,该方法在Wikipedia数据集和Pascal Sentence数据集上进行实验得到的mAP值,分别提升了 2.6%-42.5%和 3.3%-54.3%.
Abstract
Aiming at the problem that the existing cross-modal retrieval methods are difficult to measure the weight of data at each node,and there are limitations in mining local consistency within modalities,a cross-modal image and text retrieval method based on multi-head attention mechanism is proposed.Firstly,a single image and text sample serves as an independent node when constructing the modal diagram,and graph convolution is used to extract the interaction information between each sample to improve the local consistency in different modal data.Then,attention mechanism is introduced into graph convolution to adaptively learn the weight coefficients of each neighboring node,thereby distinguishing the influence of different neighboring nodes on the central node.Finally,a multi-head attention layer with weight parameters is constructed to fully learn multiple sets of related features between nodes.Compared with the existing 8 methods,the mAP values obtained by this method in experiments on the Wikipedia dataset and Pascal Sentence dataset increase by 2.6%to 42.5%and 3.3%to 54.3%,respectively.
关键词
注意力权重/邻接矩阵/多头注意力/公共子空间/跨模态检索Key words
attention weight/adjacency matrix/multi-head attention/common subspace/cross-modal retrieval引用本文复制引用
出版年
2024