Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning

扫码查看

原文链接

NETL
NSTL
万方数据

外文摘要：Traditional image-sentence cross-modal retrieval methods usually aim to learn consistent representations of heterogeneous modalities,thereby to search similar instances in one modality according to the query from another modality in result.The basic assumption behind these methods is that parallel multi-modal data(i.e.,different modalities of the same example are aligned)can be obtained in prior.In other words,the image-sentence cross-modal retrieval task is a supervised task with the alignments as ground-truths.However,in many real-world applications,it is difficult to realign a large amount of parallel data for new scenarios due to the substantial labor costs,leading the non-parallel multi-modal data and existing methods cannot be used directly.On the other hand,there actually exists auxiliary parallel multi-modal data with similar semantics,which can assist the non-parallel data to learn the consistent representations.Therefore,in this paper,we aim at"Alignment Efficient Image-Sentence Retrieval"(AEIR),which recurs to the auxiliary parallel image-sentence data as the source domain data,and takes the non-parallel data as the target domain data.Unlike single-modal transfer learning,AEIR learns consistent image-sentence cross-modal representations of target domain by transferring the alignments of existing parallel data.Specifically,AEIR learns the image-sentence consistent representations in source domain with parallel data,while transferring the alignment knowledge across domains by jointly optimizing a novel designed cross-domain cross-modal metric learning based constraint with intra-modal domain adversarial loss.Consequently,we can effectively learn the consistent representations for target domain considering both the structure and semantic transfer.Furthermore,extensive experiments on different transfer scenarios validate that AEIR can achieve better retrieval results comparing with the baselines.

外文关键词：

image-sentence retrievaltransfer learningsemantic transferstructure transfer

作者：

Yang YANG、Jinyi GUO、Guangyu LI、Lanyu LI、Wenjie LI、Jian YANG

展开 >

作者单位：

School of Computer Science and Engineering,Nanjing University of Science and Technology,Nanjing 210094,China

Department of Computing,Hong Kong Polytechnic University,Hong Kong 100872,China

State Key Lab.for Novel Software Technology,Nanjing University,Nanjing 210094,China

14th Research Institute of China Electronics Technology Group Corporation,Nanjing 210094,China

展开 >

基金：

National Key R&D Program of ChinaNational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaNatural Science Foundation of Jiangsu Province of ChinaJiangsu Shuangchuang(Mass Innovation and Entrepreneurship)Talent ProgramYoung Elite Scientists Sponsorship Program by CASTFundamental Research Funds for the Central UniversitiesFundamental Research Funds for the Central Universities

项目编号：

2022YFF0712100620061186227613162006119BK20200460NJ202202830922010317

出版年：

2024

DOI：

10.1007/s11704-023-3186-6

计算机科学前沿

高等教育出版社

计算机科学前沿

CSTPCDEI

影响因子：0.303

ISSN：2095-2228

年,卷(期)：2024.18(1)

参考文献量53