A multimodal generative image data enhancement for 3D object detection
The traditional generative image data augmentation algorithms usually lose 3D attribute information,rendering them unsuitable for 3D object detection in autonomous driving.To address the problem,we propose a multimodal image enhancement algorithm based on stable diffusion model.A data augmentation method specifically designed for 3D object detection is developed employing our proposed algorithm.It further constrains the image generation process by introducing more modal inputs.In addition,it has devised a multimodal feature online generation module to extract real-time information such as scene descriptions,semantic distributions,and depth features.Meanwhile,for the multimodal feature fusion network,an enhanced gating self-attention module is designed to accurately capture depth information in the latent feature space.This effectively preserves the 3D attribute information of the image,facilitating targeted modifications to 2D features like texture,color,and illumination.Leveraging the algorithm's exceptional depth-preserving characteristics,the new images are combined with 3D pseudo-labels to create novel image samples,thereby achieving data augmentation for image samples.The 3D detection results on the nuScenes public dataset demonstrate the effectiveness of our algorithm in preserving 3D attributes,particularly for larger categories such as buses and trucks.The AP values exhibit noticeable improvement of 17.2%and 14.1%respectively.Additionally,the indicator of mAP and DNS is increased by 6.8%and 3.4%respectively.
data enhancementstable diffusionimage generationobject detectionfeature fusion