基于跨模态引导和对齐的多模态预训练方法

扫码查看

原文链接

万方数据
维普

中文摘要：现有的视觉语言多模态预训练方法仅在图像和文本的全局语义上进行特征对齐,对模态间细粒度特征交互的探索不足.针对这一问题,本文提出了一种基于跨模态引导和对齐的多模态预训练方法.该方法在模态特征提取阶段,采用基于视觉序列压缩的双流特征提取网络,在视觉编码器中联合图像和文本信息逐层引导视觉序列压缩,缓解与文本无关的冗余视觉信息对模态间细粒度交互的干扰;在模态特征对齐阶段,对图像和文本特征进行细粒度关系推理,实现视觉标记与文本标记的局部特征对齐,增强对模态间细粒度对齐关系的理解.实验结果表明,本文方法能够更好地对齐视觉文本的细粒度特征,在图文检索任务中,微调后的图像检索和文本检索的平均召回率分别达到了86.4%和94.88%,且零样本图文检索的整体指标相较于经典图文检索算法CLIP(Contrastive Language-Image Pre-train-ing)提升了5.36%,在视觉问答等分类任务中,准确率也优于目前主流多模态预训练方法.

外文标题：Multimodal Pretraining with Cross-Modal Guidance and Alignment

外文摘要：Current multimodal pre-training techniques for visual languages predominantly focus on aligning global se-mantic features between images and text,yet they inadequately explore the granular feature interactions between modalities.Addressing this gap,this paper proposes a novel multimodal pre-training strategy informed by cross-modal guidance and alignment.Our method employs a dual-stream feature extraction network designed for visual sequence compression,to fa-cilitate modality feature extraction.During this phase,a synergistic image-text guidance is integrated within the visual en-coder,orchestrating the compression of visual sequences layer by layer.This approach mitigates the obfuscation of modali-ty-specific fine-grained interactions by irrelevant visual information.Subsequently,in the modality feature alignment phase,we implement fine-grained relational reasoning on the image and textual features to achieve localized feature alignment among visual tokens and textual tokens.This advancement bolsters the model's comprehension of fine-grained alignment re-lationships.After fine-tuning,in the image-text retrieval tasks,our approach achieves an average recall rate of 86.4%for im-ages and 94.88%for texts,which represents a significant 5.36%improvement in zero-shot image-text retrieval over the ca-nonical CLIP(Contrastive Language-Image Pre-training)algorithm.Moreover,our method also surpasses existing main-stream multimodal pre-training methods in accuracy for classification tasks like visual question answering.

外文关键词：

multimodal pre-trainingcross-modal guidancevisual sequence compressiondual-stream feature ex-tractionfine-grained relational reasoninglocalized feature alignment

作者：

才华、易亚希、付强、冉越、孙俊喜

展开 >

作者单位：

长春理工大学电子信息工程学院,吉林长春 130022

长春中国光学科学技术馆,吉林长春 130117

长春理工大学空间光电技术研究所,吉林长春 130022

东北师范大学信息科学与技术学院,吉林长春 130117

展开 >

关键词：

多模态预训练跨模态引导视觉序列压缩双流特征提取细粒度关系推理局部特征对齐

基金：

国家自然科学基金国家自然科学基金吉林省人才专项西安市飞行器光学成像与测量技术重点实验室开放基金

项目编号：

61890963U234122620240602015RC2023-13

出版年：

2024

DOI：

10.12263/DZXB.20240271

电子学报

中国电子学会

电子学报

CSTPCD北大核心

影响因子：1.237

ISSN：0372-2112

年,卷(期)：2024.52(10)