首页|图文跨模态检索研究综述

图文跨模态检索研究综述

扫码查看
图文跨模态检索作为跨模态计算研究的一个核心课题,一直受到学术界和工业界的高度重视.在过去的几十年里,随着深度学习技术的发展,特别是深度神经网络、Transformer架构以及图文对比学习等关键技术的广泛应用与革新,图文跨模态检索领域取得了显著的进展和突破.本文在系统梳理图文跨模态检索的发展轨迹的基础上,重点关注其建模过程中的5个关键步骤,即训练数据准备、数据输入形式设计、图文特征抽取机制的选择、图文建模方法的选择以及优化目标的确立.为客观评估现有模型在跨模态检索任务上的性能水平,在多个权威的标注评测数据集上比较各类模型的表现,以揭示当前跨模态检索方法的实际效能边界.通过对各关键步骤发展历程的分析与总结,结合当前图文跨模态检索领域的研究成果,对未来跨模态学习的发展趋势做出预测与展望.研究结果表明:尽管当前的图文跨模态检索技术已取得显著进步,但仍存在进一步提升的空间和潜力,研究者可从精细化检索、经济的预训练方法、新的图文交互方式、AIGC赋能的图文预训练4个方向进行改进.
A survey on image-text cross-modal retrieval
Cross-modal image-text retrieval,a pivotal topic in cross-modal computing research,has garnered considerable attention from both academia and industry. Over the past few decades,fueled by advancements in deep learning technologies,particularly deep neural networks,Trans-former architectures,and image-text contrastive learning,the field of image-text retrieval has witnessed significant progress and breakthroughs. Based on a systematically review of the develop-ment trajectory of image-text cross-modal retrieval,this paper focuses on five key steps in its mod-eling process:preparing training data,designing data input formats,selecting mechanisms for ex-tracting image-text features,selecting image-text modeling methods,and establishing optimiza-tion objectives. To objectively evaluate the existing model performances in cross-modal retrieval tasks,various models are compared across multiple authoritative benchmark datasets,revealing the practical performance boundaries of current retrieval methods. By analyzing and summarizing the evolution of each key step and considering current research outcomes,the paper predicts and envisions future trends in cross-modal learning. The research findings suggest significant advance-ments in current image-text retrieval technologies while highlighting opportunities for further en-hancement. Researchers can enhance the field by focusing on four areas:refined retrieval,economi-cal pre-training methods,new image-text interaction approaches,and image-text pre-training em-powered by Artificial Intelligence Generated Content (AIGC).

image-text retrievalcross-modal learningdeep learningattention mechanism

张振兴、王亚雄

展开 >

合肥工业大学 计算机与信息学院,合肥 230000

图文检索 跨模态学习 深度学习 注意力机制

国家自然科学基金

62302140

2024

北京交通大学学报
北京交通大学

北京交通大学学报

CSTPCD北大核心
影响因子:0.525
ISSN:1673-0291
年,卷(期):2024.48(2)