首页|基于CNN-Transformer双模态特征融合的目标检测算法

基于CNN-Transformer双模态特征融合的目标检测算法

扫码查看
针对单模态目标检测的不足,提出了一种基于CNN-Transformer双模态特征融合的目标检测算法。在YOLOv5的基础上,构建了一个可以同时输入红外和可见光图像的双流特征提取网络;然后,分别提出了基于卷积神经网络结构的红外特征提取主干网络和基于Transformer结构的可见光特征提取主干网络,以提升对红外和可见光图像的特征提取能力;最后,按照中期融合的思想,设计了双模态特征融合模块,对两个分支对应尺度的双模态特征信息进行有效融合,实现跨模态信息互补。在数据集上对所提算法进行验证,实验结果表明,该算法在KAIST数据集上对双模态图像进行检测的结果,较基准算法单独检测红外图像和可见光图像,精度分别提升了5。7%和17。4%;在FLIR数据集上较基准算法,检测精度分别提升了 11。6%和 17。1%;在自建GIR数据集上,所提算法的检测精度也有明显提升。此外,该算法还可以单独处理红外或可见光图像,且检测精度较基准算法均有明显提升。
Object Detection Algorithm Based on CNN-Transformer Dual Modal Feature Fusion
To overcome the limitations of single modal object detection,this study proposes a dual modal feature fusion object detection algorithm based on CNN-Transformer.By fully leveraging the clear contour information from infrared images and the rich detail information from visible light images,integrating the complementary information of both infrared and visible light modalities significantly enhances object detection performance and extends its applicability to more complex real-world scenarios.The innovation of the proposed algorithm lies in the construction of a dual-stream feature extraction network,which can simultaneously process information from infrared and visible light images.Since infrared images have clear contour information that can be used to guide object localization,we adopt a CNN-based Feature Extraction(CFE)module to process infrared images to better capture the location information of the object in the infrared images and improve the expressive power of the feature information.On the other hand,visible light images usually contain rich detail information such as color and texture,which are distributed in the whole image,so we adopt Transformer-based Feature Extraction(TFE)to process visible light images to better capture the global context information and detail information of visible light images.This differentiated feature extraction strategy helps to fully utilize the advantages of both infrared and visible light modal images,enabling the algorithm to better adapt to object detection tasks in different scenes and conditions.In addition,we introduce a dual-modal feature fusion module,which successfully fuses feature information from both modalities through effective inter-modal information interaction.The design of this module not only preserves the features of the original two modalities,but also realizes the inter-modal feature complementarity,which enhances the expression ability of the object features and further improves the performance of multimodal object detection.To validate the effectiveness of the algorithm,we conducted extensive experimental evaluations on three different datasets,including the publicly available datasets KAIST,FLIR,and the GIR dataset that we created in-house.These datasets contain multimodal image pairs,i.e.,infrared images,visible light images,and infrared-visible light image pairs.We trained and tested these multimodal images to evaluate the applicability and performance of the algorithm in various situations.The experimental results indicate that the detection accuracy of the proposed algorithm for dual modal images on the KAIST dataset is 5.7%and 17.4%higher than that of the baseline algorithm for infrared and visible light images respectively.Similarly,on the FLIR dataset,the detection accuracy for infrared and visible light images increased by 11.6%and 17.1%,respectively,when compared to the baseline algorithm.On our self-constructed GIR dataset,the proposed algorithm also demonstrates a notable enhancement in detection accuracy.Additionally,the algorithm has the capability to independently process either infrared or visible light images,with a significant improvement in detection accuracy compared to the baseline algorithm for both modalities.This further validates the applicability and robustness of the proposed algorithm.Moreover,the proposed algorithm also supports the flexibility of displaying the detection results of visible or infrared images during the visualization process to meet different needs.

Object detectionConvolutional neural networksTransformerDual modalFeature fusionInfraredVisible light

杨晨、侯志强、李新月、马素刚、杨小宝

展开 >

西安邮电大学 计算机学院,西安 710121

陕西省网络数据分析与智能处理重点实验室,西安 710121

目标检测 卷积神经网络 Transformer 双模态 特征融合 红外 可见光

国家自然科学基金陕西省自然科学基金

620723702023?JC?YB?598

2024

光子学报
中国光学学会 中国科学院西安光学精密机械研究所

光子学报

CSTPCD北大核心
影响因子:0.948
ISSN:1004-4213
年,卷(期):2024.53(3)
  • 1
  • 46