Currently,the applications of convolutional neural networks to the task of fusing infrared and visible images have achieved better fusion results.Many of these methods are based on network models with self-encoder architec-tures,which are trained in a self-supervised methods and require the use of hand-designed fusion strategies to fuse fea-tures in the testing phase.However,existing methods based on self-encoder networks rarely make full use of both shallow and deep features,and convolutional neural networks are limited by the receptive field,making it more difficult to establish long-range dependencies and thus losing global information.In contrast,Transformer,with the help of self-attention mechanism,can establish long-range dependencies and effectively obtain global contextual information.In terms of fusion strategies,most of the methods are designed in a crude way and do not specifically consider the charac-teristics of different modal images.Therefore,CNN and Transformer are combined in the encoder to enable the en-coder to extract more comprehensive features.And the attention model is applied to the fusion strategy to optimize the features in a more refined way.The experimental results show that the fusion algorithm achieves excellent results in both subjective and objective evaluations compared to other image fusion algorithms.