首页|基于多语言-视觉公共空间学习的多语言文本-视频跨模态检索模型

基于多语言-视觉公共空间学习的多语言文本-视频跨模态检索模型

扫码查看
本文针对具有挑战性的多语言文本-视频跨模态检索问题进行研究.传统文本-视频跨模态检索模型通常针对单一语言进行设计,比如英语,模型仅支持某一特定语言的文本查询.如果有不同语言检索需求,则需另收集目标语言的训练数据并重新训练构建新的检索模型,这使得模型很难快速有效地适用于其他语言的检索任务.近年来,针对多语言问题的研究逐渐深入,这为多语言跨模态检索的实现打下了良好的基石.为了解决多语言跨模态检索问题,本文提出了一种简单有效的基于多语言-视觉公共空间学习的多语言文本-视频跨模态检索模型,将不同语言与视觉信息映射到同一公共空间.该空间以视频向量为锚点,分别与不同的语言向量进行对齐,以此实现多语言跨模态的学习,由此建立了统一的多语言学习框架,使用一个模型满足了多语言的检索需求并探究了不平行语料库、平行语料库、伪平行语料库三种训练场景下的模型性能.同时,在多语言建模中有效地利用了不同语言之间的互通性和互补性,弥补了单语言文本特征表达的不足;并在文本端与视频端引入了基于对比学习的抗噪音鲁棒性学习方法,进一步提升了不同模态特征的表示能力.在VATEX、MSR-VTT多语言数据集上实验的数据证明,本文模型不仅能够简单快速地适用于多种语言检索任务,模型性能也较为突出,在较为常见的伪平行场景下和最先进的方法相比,中文VATEX和MSR-VTT在总召回率上分别提升了约5.97%和1.37%.
Multilingual Text-Video Cross-Modal Retrieval Model via Multilingual-Visual Common Space Learning
This paper focuses on the challenging multilingual cross-modal text-video retrieval.Traditional cross-modal text-video retrieval models are usually designed for a single language,such as English,and only support text queries in a specific language.If different language re-trieval requirements are encountered,training data for the target language needs to be collected,and a new model needs to be built and retrained,which makes it difficult to apply the model to multilingual retrieval tasks quickly and effectively.In recent years,research on multilingual problems has gradually deepened,laying a solid foundation for the implementation of multilingual cross-modal retrieval.In order to solve the problem of multilingual cross-modal retrieval,this paper proposes a simple and effective multilingual text-video cross-modal retrieval model via mul-tilingual-visual common space learning,which maps different languages and visual feature to the same common space,this space uses video vectors as anchors and aligns them with different lan-guage vectors to achieve cross-modal learning in multi-languages.Thus,a unified multilingual learning framework was established.This method uses only one model solves the multilingual re-trieval problem,and explores the performance of the model in the three training scenarios of non-parallel corpus,parallel corpus and pseudo-parallel corpus.At the same time,the interoperabili-ty and complementarity between different languages in multilingual modeling are effectively used to make up for the lack of monolingual text feature representation;and a robust learning method based on contrastive learning is introduced in the text and video ends,which further improves the representation ability of different modal features.The experimental results on the VATEX and MSR-VTT multilingual datasets demonstrate that the proposed model can not only be applied to multilingual retrieval tasks simply and quickly,but also the model performance is outstanding,compared with state-of-the-art methods in pseudo parallel scenes,Chinese VATEX and MSR-VTT have improved sum recall by approximately 5.97%and 1.37%,respectively.

multilingualcross-modal retrievalcross-modal feature representationcontrastive learning

林俊安、包翠竹、董建锋、杨勋、王勋

展开 >

浙江工商大学计算机科学与技术学院 杭州 310018

中国科学技术大学信息科学技术学院 合肥 230026

多语言 跨模态检索 跨模态特征表示 对比学习

浙江省"尖兵""领雁"研发攻关计划项目浙江省基础公益技术研究计划第八届中国科协青年人才托举工程项目

2023C01212LGF21F0200102022QNRC001

2024

计算机学报
中国计算机学会 中国科学院计算技术研究所

计算机学报

CSTPCD北大核心
影响因子:3.18
ISSN:0254-4164
年,卷(期):2024.47(9)
  • 3