The dual encoder model has faster inference speed than the fusion encoder model,and can pre-calculate images and text during the inference process.However,the shallow interaction module used in the dual encoder model is not sufficient to handle complex visual language comprehension tasks.In response to the above issues,this paper proposes a new multi-modal fusion method.Firstly,a pre-interactive bridge tower structure(PBTS)is proposed to establish connections between the top layer of a single mode encoder and each layer of a cross-mode encoder.This enables comprehensive bottom-up interaction between visual and textual representations at different semantic levels,enabling more effective cross-modal alignment and fusion.At the same time,in order to better learn the deep interaction between images and text,a two-stage cross-modal attention double distillation method(TCMDD)is proposed,which uses the fusion encoder model as the teacher model and distills knowledge of the cross-mo-dal attention matrix of the single modal encoder and fusion module simultaneously in the pre-training and tuning stages.Using 4 million images for pre-training and tuning on three public datasets to validate the effectiveness of this method.Experimental re-sults show that the proposed multi-modal fusion method achieves better performance in multiple visual language comprehension tasks.