Multi-modal Perception Fusion Method Based on Cross Attention
To address the problems related to the limited perception ability of single sensors and complex late-fusion processing of multi sensors in intelligent vehicle road target detection tasks,this study proposes a multi-modal perception fusion method based on Transformer Cross Attention.First,by utilizing the advantage of cross-attention,which can effectively fuse multimodal information,an end-to-end fusion perception network was constructed to receive the output of visual and point cloud detection networks and perform post-fusion processing.Second,the 3D target detection of the point cloud detection network was subjected to high-recall processing,which was used as an input to the network,along with the target detection output by the visual detector.Finally,the fusion of 2D target information with 3D information was achieved through the network,and the correction of the 3D target detection information was output,yielding more accurate post-fusion detection information.The validation metrics on the KITTI public dataset showed that after introducing 2D detection information through the fusion method proposed in this study,compared with the four benchmark methods,PointPillars,PointRCNN,PV-RCNN,and CenterPoint,the comprehensive average improvements for the three categories of vehicles,cyclists,and pedestrians were 7.07%,2.82%,2.46%,and 1.60%,respectively.Compared with rule-based post-fusion methods,the fusion network proposed in this study obtained an average improvement of 1.88%and 4.90%in detecting medium-and highly-difficult samples for pedestrians and cyclists,respectively,indicating that the proposed method has a stronger adaptability and generalization ability.Finally,a real vehicle test platform was constructed,and algorithm validation was performed.A visual qualitative analysis was conducted on selected real vehicle test scenarios,and the detection method and network model proposed in this study were validated under actual road scenarios.