首页|红外与可见光图像交互自注意力融合方法

红外与可见光图像交互自注意力融合方法

扫码查看
针对现有红外与可见光图像融合方法仅仅依靠局部或全局特征表示,缺乏跨模态特征交互而造成融合性能低的问题,提出一种交互自注意力融合方法,利用Transformer对卷积神经网络提取的局部特征进行全局依赖关系建模,达到结合局部与全局关系的目的,提高特征表征能力。同时,构建了跨模态注意力交互模型,允许不同空间和独立通道之间以交互方式进行特征传递,以实现特征局部到全局的映射,从而增强两类图像的补充特性。在TNO、M3FD和Roadscene数据集上进行主客观实验,结果表明,与其他7种先进的融合方法相比,该方法在融合性能、模型泛化和计算效率方面都具有明显的优势,验证了方法的有效性和优越性。
Infrared and Visible Image Fusion Method via Interactive Self-attention
The fusion of infrared and visible images aims to merge their complementary information to generate a fused output with better visual perception and scene understanding.The existing CNN-based methods typically employ convolutional operations to extract local features while failing to model the long-range relationships.On the contrary,the Transformer-based methods usually propose a self-attention mechanism to model the global dependencies,but lack the supplement of local information.More importantly,these methods often ignore the specialized interactive information learning of different modalities,which produces limited fusion performance.To address these issues,this paper introduces an infrared and visible image fusion via interactive self-attention,namely ISAFusion.First,we devise a collaborative learning scheme that seamlessly integrates CNN and Transformer.This approach leverages residual convolutional blocks to extract local features,which are then aggregated into the transformer to model the global features,thus enhancing its powerful feature representation abilities.Second,we construct a cross-modality interactive attention module,which is a cascade of Token-ViT and Channel-ViT.This module can model the long-range dependencies from token and channel dimensions in an interactive manner,and allow feature communication between spatial locations and independent channels.The generated global features markedly focus on the intrinsic characteristics of different modality images,which can effectively strengthen their complementary information to achieve better fusion performance.Finally,we end-to-end train the fusion network through a comprehensive objective function encompassing the structural similarity index measure SSIM loss,gradient loss,and intensity loss.This design can ensure the fusion model preserves similar structural information,valuable pixel intensity,and rich texture details from source images.To verify the effectiveness and superiority of the proposed method,we carry out experiments on the three different benchmarks,namely TNO,Roadscene,and M3FD datasets.Meanwhile,seven representative methods,namely U2Fusion,RFN-Nest,FusionGAN,GANMcC,YDTR,SwinFusion,and SwinFuse,are selected for the experimental comparisons.Eight evaluation metrics,such as average gradient,mutual information,phase congruency,feature mutual information with pixel,edge-based similarity measurement,gradient-based similarity measurement,multi-scale structural similarity index measure,and visual information fidelity,are used for the objective evaluation.In the compared experiments,ISAFusion can achieve more balanced fusion results in retaining the typical targets of the infrared image and rich texture details of the visible image,which presents a better visual effect and is more suitable for the human visual system.Meanwhile,from the objective comparison perspective,ISAFusion achieves better fusion performance than other comparable methods in the three different datasets,which is consistent with the subjective analysis.Furthermore,we also conduct experiments to evaluate the operational efficiency of different methods,and experimental results demonstrate our methods is only behind of YDTR,indicating its competitive computation efficiency.To sum up,compared with other seven state-of-the-art competitors,our method presents better image fusion performance,stronger robustness and higher computational efficiency.In addition,we carry on ablation experiments to verify the effectiveness of each designed component.The experimental results indicate that removing any of the components will degrade the fusion performance more or less.More specifically,we find that discarding the position embedding generates a positive effect on the fusion performance.The qualitative and quantitative ablation studies demonstrate the rationality and superiority of each designed component.In the further,we will exploit a more effective CNN-Transformer learning scheme to further promote the fusion performance,and extend it for other fusion tasks,such as multi-band,multi-exposure,multi-focus image fusion,and so on.

Image fusionSelf-attention mechanismFeature interactionDeep learningMulti-modality images

杨帆、王志社、孙婧、余朝发

展开 >

太原科技大学 应用科学学院,太原 030024

陆军工程大学 军械士官学校,武汉 430075

图像融合 自注意力机制 特征交互 深度学习 多模态图像

山西省基础研究计划

202203021221144

2024

光子学报
中国光学学会 中国科学院西安光学精密机械研究所

光子学报

CSTPCD北大核心
影响因子:0.948
ISSN:1004-4213
年,卷(期):2024.53(6)