首页|Multimodal Pre-Trained Framework for Aligning Image–Text Relation Semantics
Multimodal Pre-Trained Framework for Aligning Image–Text Relation Semantics
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
World Scientific
Image–text relation (ITR) in social media plays a crucial role in mining the semantics of the posts. Vision and language pre-trained models (PTMs) or multimodal PTMs have been used to create multimodal embeddings. The conventional practice of fine-tuning pre-trained models with labeled data for specific image–text relation tasks often falls short due to misalignment between general pre-training objectives and task-specific requirements. In this research, we introduce a cutting-edge pre-trained framework tailored for aligning image–text relation semantics. Our novel framework leverages unlabeled data to enhance learning of image–text relation representations through deep multimodal clustering and multimodal contrastive learning tasks. Our method significantly narrows the disparity between generic Vision-Language Pre-trained Models (VL-PTMs) and image–text relation tasks, showcasing an impressive performance boost of up to 10.4 points in linear probe tests. By achieving state-of-the-art results on image–text relation datasets, our pre-training framework stands out for its effectiveness in capturing and aligning image–text semantics. The visualizations generated by class activation map (CAM) also demonstrate that our models provide more accurate image–text semantic correspondence. The code is available on the website: https://github.com/qingyuannk/ITR.
Image–text relationmultimodal semantic alignmentmultimodal model pre-training
Lin Sun、Yindu Su、Zhewei Zhou、Qingyuan Li、Ruichen Xia
展开 >
Department of Computer Science, Hangzhou City University, 51 Huzhou Street, Hangzhou 310015, Zhejiang,P. R. China
College of Computer Science and Technology, Zhejiang University, 38 Zheda Road, Hangzhou 310027, Zhejiang,P. R. China
Zhejiang Development & Planning Institute, 598 Gudun Road, Hangzhou 310012, Zhejiang,P. R. China