Multimodal Pre-Trained Framework for Aligning Image

Multimodal Pre-Trained Framework for Aligning Image–Text Relation Semantics

扫码查看

原文链接

NETL
NSTL
World Scientific

外文摘要：Image–text relation (ITR) in social media plays a crucial role in mining the semantics of the posts. Vision and language pre-trained models (PTMs) or multimodal PTMs have been used to create multimodal embeddings. The conventional practice of fine-tuning pre-trained models with labeled data for specific image–text relation tasks often falls short due to misalignment between general pre-training objectives and task-specific requirements. In this research, we introduce a cutting-edge pre-trained framework tailored for aligning image–text relation semantics. Our novel framework leverages unlabeled data to enhance learning of image–text relation representations through deep multimodal clustering and multimodal contrastive learning tasks. Our method significantly narrows the disparity between generic Vision-Language Pre-trained Models (VL-PTMs) and image–text relation tasks, showcasing an impressive performance boost of up to 10.4 points in linear probe tests. By achieving state-of-the-art results on image–text relation datasets, our pre-training framework stands out for its effectiveness in capturing and aligning image–text semantics. The visualizations generated by class activation map (CAM) also demonstrate that our models provide more accurate image–text semantic correspondence. The code is available on the website: https://github.com/qingyuannk/ITR.

外文关键词：

Image–text relationmultimodal semantic alignmentmultimodal model pre-training

作者：

Lin Sun、Yindu Su、Zhewei Zhou、Qingyuan Li、Ruichen Xia

展开 >

作者单位：

Department of Computer Science, Hangzhou City University, 51 Huzhou Street, Hangzhou 310015, Zhejiang,P. R. China

College of Computer Science and Technology, Zhejiang University, 38 Zheda Road, Hangzhou 310027, Zhejiang,P. R. China

Zhejiang Development & Planning Institute, 598 Gudun Road, Hangzhou 310012, Zhejiang,P. R. China

出版年：

2025

DOI：

10.1142/S0218001425550109

International journal of pattern recognition and artificial intelligence

ISSN：0218-0014

年,卷(期)：2025.39(8)

参考文献量54