Multilingual Text-Video Cross-Modal Retrieval Model via Multilingual-Visual Common Space Learning
This paper focuses on the challenging multilingual cross-modal text-video retrieval.Traditional cross-modal text-video retrieval models are usually designed for a single language,such as English,and only support text queries in a specific language.If different language re-trieval requirements are encountered,training data for the target language needs to be collected,and a new model needs to be built and retrained,which makes it difficult to apply the model to multilingual retrieval tasks quickly and effectively.In recent years,research on multilingual problems has gradually deepened,laying a solid foundation for the implementation of multilingual cross-modal retrieval.In order to solve the problem of multilingual cross-modal retrieval,this paper proposes a simple and effective multilingual text-video cross-modal retrieval model via mul-tilingual-visual common space learning,which maps different languages and visual feature to the same common space,this space uses video vectors as anchors and aligns them with different lan-guage vectors to achieve cross-modal learning in multi-languages.Thus,a unified multilingual learning framework was established.This method uses only one model solves the multilingual re-trieval problem,and explores the performance of the model in the three training scenarios of non-parallel corpus,parallel corpus and pseudo-parallel corpus.At the same time,the interoperabili-ty and complementarity between different languages in multilingual modeling are effectively used to make up for the lack of monolingual text feature representation;and a robust learning method based on contrastive learning is introduced in the text and video ends,which further improves the representation ability of different modal features.The experimental results on the VATEX and MSR-VTT multilingual datasets demonstrate that the proposed model can not only be applied to multilingual retrieval tasks simply and quickly,but also the model performance is outstanding,compared with state-of-the-art methods in pseudo parallel scenes,Chinese VATEX and MSR-VTT have improved sum recall by approximately 5.97%and 1.37%,respectively.