首页|Multimodal Pre-Trained Framework for Aligning Image–Text Relation Semantics

Multimodal Pre-Trained Framework for Aligning Image–Text Relation Semantics

扫码查看
Image–text relation (ITR) in social media plays a crucial role in mining the semantics of the posts. Vision and language pre-trained models (PTMs) or multimodal PTMs have been used to create multimodal embeddings. The conventional practice of fine-tuning pre-trained models with labeled data for specific image–text relation tasks often falls short due to misalignment between general pre-training objectives and task-specific requirements. In this research, we introduce a cutting-edge pre-trained framework tailored for aligning image–text relation semantics. Our novel framework leverages unlabeled data to enhance learning of image–text relation representations through deep multimodal clustering and multimodal contrastive learning tasks. Our method significantly narrows the disparity between generic Vision-Language Pre-trained Models (VL-PTMs) and image–text relation tasks, showcasing an impressive performance boost of up to 10.4 points in linear probe tests. By achieving state-of-the-art results on image–text relation datasets, our pre-training framework stands out for its effectiveness in capturing and aligning image–text semantics. The visualizations generated by class activation map (CAM) also demonstrate that our models provide more accurate image–text semantic correspondence. The code is available on the website: https://github.com/qingyuannk/ITR.

Image–text relationmultimodal semantic alignmentmultimodal model pre-training

Lin Sun、Yindu Su、Zhewei Zhou、Qingyuan Li、Ruichen Xia

展开 >

Department of Computer Science, Hangzhou City University, 51 Huzhou Street, Hangzhou 310015, Zhejiang,P. R. China

College of Computer Science and Technology, Zhejiang University, 38 Zheda Road, Hangzhou 310027, Zhejiang,P. R. China

Zhejiang Development & Planning Institute, 598 Gudun Road, Hangzhou 310012, Zhejiang,P. R. China

2025

International journal of pattern recognition and artificial intelligence
  • 54