电子学报2024,Vol.52Issue(10) :3368-3381.DOI:10.12263/DZXB.20240271

基于跨模态引导和对齐的多模态预训练方法

Multimodal Pretraining with Cross-Modal Guidance and Alignment

才华 易亚希 付强 冉越 孙俊喜
电子学报2024,Vol.52Issue(10) :3368-3381.DOI:10.12263/DZXB.20240271

基于跨模态引导和对齐的多模态预训练方法

Multimodal Pretraining with Cross-Modal Guidance and Alignment

才华 1易亚希 2付强 3冉越 2孙俊喜4
扫码查看

作者信息

  • 1. 长春理工大学电子信息工程学院,吉林 长春 130022;长春中国光学科学技术馆,吉林 长春 130117
  • 2. 长春理工大学电子信息工程学院,吉林 长春 130022
  • 3. 长春理工大学空间光电技术研究所,吉林 长春 130022
  • 4. 东北师范大学信息科学与技术学院,吉林 长春 130117
  • 折叠

摘要

现有的视觉语言多模态预训练方法仅在图像和文本的全局语义上进行特征对齐,对模态间细粒度特征交互的探索不足.针对这一问题,本文提出了一种基于跨模态引导和对齐的多模态预训练方法.该方法在模态特征提取阶段,采用基于视觉序列压缩的双流特征提取网络,在视觉编码器中联合图像和文本信息逐层引导视觉序列压缩,缓解与文本无关的冗余视觉信息对模态间细粒度交互的干扰;在模态特征对齐阶段,对图像和文本特征进行细粒度关系推理,实现视觉标记与文本标记的局部特征对齐,增强对模态间细粒度对齐关系的理解.实验结果表明,本文方法能够更好地对齐视觉文本的细粒度特征,在图文检索任务中,微调后的图像检索和文本检索的平均召回率分别达到了86.4%和94.88%,且零样本图文检索的整体指标相较于经典图文检索算法CLIP(Contrastive Language-Image Pre-train-ing)提升了5.36%,在视觉问答等分类任务中,准确率也优于目前主流多模态预训练方法.

Abstract

Current multimodal pre-training techniques for visual languages predominantly focus on aligning global se-mantic features between images and text,yet they inadequately explore the granular feature interactions between modalities.Addressing this gap,this paper proposes a novel multimodal pre-training strategy informed by cross-modal guidance and alignment.Our method employs a dual-stream feature extraction network designed for visual sequence compression,to fa-cilitate modality feature extraction.During this phase,a synergistic image-text guidance is integrated within the visual en-coder,orchestrating the compression of visual sequences layer by layer.This approach mitigates the obfuscation of modali-ty-specific fine-grained interactions by irrelevant visual information.Subsequently,in the modality feature alignment phase,we implement fine-grained relational reasoning on the image and textual features to achieve localized feature alignment among visual tokens and textual tokens.This advancement bolsters the model's comprehension of fine-grained alignment re-lationships.After fine-tuning,in the image-text retrieval tasks,our approach achieves an average recall rate of 86.4%for im-ages and 94.88%for texts,which represents a significant 5.36%improvement in zero-shot image-text retrieval over the ca-nonical CLIP(Contrastive Language-Image Pre-training)algorithm.Moreover,our method also surpasses existing main-stream multimodal pre-training methods in accuracy for classification tasks like visual question answering.

关键词

多模态预训练/跨模态引导/视觉序列压缩/双流特征提取/细粒度关系推理/局部特征对齐

Key words

multimodal pre-training/cross-modal guidance/visual sequence compression/dual-stream feature ex-traction/fine-grained relational reasoning/localized feature alignment

引用本文复制引用

基金项目

国家自然科学基金(61890963)

国家自然科学基金(U2341226)

吉林省人才专项(20240602015RC)

西安市飞行器光学成像与测量技术重点实验室开放基金(2023-13)

出版年

2024
电子学报
中国电子学会

电子学报

CSTPCD北大核心
影响因子:1.237
ISSN:0372-2112
段落导航相关论文