首页|基于语义调节与两级匹配的图像文本检索方法

基于语义调节与两级匹配的图像文本检索方法

扫码查看
图像文本检索的核心在于实现图像与文本之间的相似度匹配,其方法主要分为全局匹配和局部匹配。为了克服跨模态检索任务中全局匹配只考虑全局语义的对齐,局部匹配只考虑细粒度语义对齐,以及全局特征和局部特征交互不足的问题,该文提出一种语义调节与两级匹配(Semantic Adjustment and Two-level Matching,S ATM)方法,即结合两种匹配方法并利用全局特征调节局部特征语义来进行图像文本检索。首先,在模态内进行语义调节,采用自注意力机制,利用全局特征增强局部特征;其次,在模态间通过堆叠交叉注意力实现图像和文本的局部特征之间的交互和局部匹配,并生成跨模态全局特征;然后,在模态间进行语义调节,利用跨模态全局特征引导局部特征形成全局特征,并进行全局匹配;最后,结合两种匹配相似度进行跨模态检索。该方法综合考虑全局匹配和局部匹配,并充分实现了全局特征和局部特征之间的信息交互,因此能显著提升跨模态检索的准确性,通过在Flickr30K和MS-COCO两个基准数据集上的大量的对比实验,证明了该方法的有效性和优越性。
An Image-text Retrieval Method Based on Semantic Adjustment and Two-level Matching
The key of image text retrieval lies in the similarity matching between image and text,which can be divided into global matching and local matching.In order to overcome the problems that global matching only considers global semantic alignment,local matching only considers fine-grained semantic alignment,and the lack of interaction between global and local features in cross-modal retrieval tasks,we propose a method called Semantic Adjustment and Two-level Matching(SATM),which combines the two matching methods and uses global features to adjust the semantic of local features for image text retrieval.Firstly,the semantic adjustment is performed within the modality,and the self-attention mechanism is used to enhance the local features by using the global features.Secondly,the interaction and local matching between local features of image and text are realized by stacked cross-attention between mo-dalities,and the cross-modal global features are generated.Then,semantic adjustment is performed between modalities,and cross-modal global features are used to guide local features to form global features,and global matching is performed.Finally,the two matching similarities are combined to perform cross-modal retrieval.The proposed method comprehensively considers global matching and local matching,and fully realizes information interaction between global features and local features,so it can significantly improve the accuracy of cross-modal retrieval.Through a large number of comparative experiments on two benchmark datasets of Flickr30K and MS-COCO,the effectiveness and superiority of the proposed method are verified.

cross-modal retrievalimage text matchingglobal matchinglocal matchingsemantic adjustment

刘洪洲、张鸿

展开 >

武汉科技大学计算机科学与技术学院,湖北 武汉 430065

智能信息处理与实时工业系统湖北省重点实验室(武汉科技大学),湖北 武汉 430065

跨模态检索 图像文本匹配 全局匹配 局部匹配 语义调节

2024

计算机技术与发展
陕西省计算机学会

计算机技术与发展

CSTPCD
影响因子:0.621
ISSN:1673-629X
年,卷(期):2024.34(12)