首页|基于双编码器的多模态融合方法

基于双编码器的多模态融合方法

扫码查看
双编码器模型比融合编码器模型具有更快的推理速度,且能在推理过程中对图像和文本进行预计算.然而,双编码器模型中使用的浅交互模块不足以处理复杂的视觉语言理解任务.针对上述问题,提出了 一种新的多模态融合方法.首先,提出一种前交互式桥塔结构(PBTS),在单模态编码器的顶层和跨模态编码器的每层之间建立连接,使得不同语义层次的视觉和文本表示之间能够进行全面、自下而上的交互,从而实现更有效的跨模态对齐和融合.同时,为了更好地学习图像和文本的深度交互,提出了一种两阶段跨模态注意力双蒸馏方法(TCMDD),使用融合编码器模型作为教师模型,在预训练阶段和调优阶段同时对单模态编码器及融合模块的跨模态注意力矩阵进行知识蒸馏.使用400万张图片进行预训练并在3个公开数据集上进行调优来验证该方法的有效性.实验结果表明,所提多模态融合方法在多个视觉语言理解任务中获得了更优的性能.
Multi-modal Fusion Method Based on Dual Encoders
The dual encoder model has faster inference speed than the fusion encoder model,and can pre-calculate images and text during the inference process.However,the shallow interaction module used in the dual encoder model is not sufficient to handle complex visual language comprehension tasks.In response to the above issues,this paper proposes a new multi-modal fusion method.Firstly,a pre-interactive bridge tower structure(PBTS)is proposed to establish connections between the top layer of a single mode encoder and each layer of a cross-mode encoder.This enables comprehensive bottom-up interaction between visual and textual representations at different semantic levels,enabling more effective cross-modal alignment and fusion.At the same time,in order to better learn the deep interaction between images and text,a two-stage cross-modal attention double distillation method(TCMDD)is proposed,which uses the fusion encoder model as the teacher model and distills knowledge of the cross-mo-dal attention matrix of the single modal encoder and fusion module simultaneously in the pre-training and tuning stages.Using 4 million images for pre-training and tuning on three public datasets to validate the effectiveness of this method.Experimental re-sults show that the proposed multi-modal fusion method achieves better performance in multiple visual language comprehension tasks.

Multi-modal fusionDual encoderCross-modal attention distillationBridge tower structure

黄晓飞、郭卫斌

展开 >

华东理工大学信息科学与工程学院 上海 200237

多模态融合 双编码器 跨模态注意力蒸馏 桥塔结构

&&

62076094

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(9)