Precise image translation based on conditional diffusion model for driving scenarios
Objective Safety is the most important consideration for autonomous driving vehicles.New autonomous driving methods need numerous training and testing processes before their application in real vehicles.However,training and test-ing autonomous driving methods directly in real-world scenarios is a costly and risky task.Many researchers first train and test their methods in simulated-world scenarios and then transfer the trained knowledge to real-world scenarios.However,many differences in scene modeling,light,and vehicle dynamics are observed between the two-world scenarios.There-fore,the autonomous driving model trained in simulated-world scenarios cannot be effectively generalized to real-world sce-narios.With the development of deep learning technologies,image translation,which aims to transform the content of an image from one presentation form to another,has made considerable achievements in many fields,such as image beautifica-tion,style transfer,scene design,and video special effects.If image translation technology is applied to the translation of simulated driving scenarios to real ones,then this technology can not only solve the problem of poor generalization capabil-ity of autonomous driving models but can also effectively reduce the cost and risk of training in the real scenarios.Unfortu-nately,existing image translation methods applied in autonomous driving lack datasets of paired simulated and real sce-narios,and most of the mainstream image translation methods are based on generative adversarial network(GAN),which have problems of mode collapse and unstable training.The generated images also suffer from numerous detail problems,such as distorted object contours and unnatural small objects in the scene.These problems will not only further affect the perception of automatic driving,which will then impact the decision regarding automatic driving,but will also influence the evaluation metrics of image translation.In this paper,a multimodal conditional diffusion model based on the denoising dif-fusion probabilistic model(DDPM),which has achieved remarkable success in various image generation tasks,is pro-posed to address the problems of insufficient paired simulation-real data,mode collapse,unstable training,and inadequate diversity of generated data in existing image translation.Method First,an image translation method based on the diffusion model with good training stability and generative diversity is proposed to solve the problems of mode collapse and unstable training in existing mainstream image translation methods based on GAN.Second,a multimodal feature fusion method based on a multihead self-attention mechanism is developed in this paper to address the problem of traditional diffusion models,which cannot integrate prior information without controlling the image generation process.The proposed method can send the early fused data to the convolutional layer,extract the high-level features,and then obtain the high-level fused feature vectors through the multihead self-attention mechanism.Finally,considering the semantic segmentation and depth maps,which can precisely represent the contour and depth information,respectively,the conditional diffusion model(CDM)is designed by fusing the semantic segmentation and depth maps with the noise image before sending them to the denoising network.In this model,the semantic segmentation map,depth map,and noise image can perceive each other through the proposed multimodal feature fusion method.The output fusion features will then be fed to the next sublayer in the network.After the denoising iterative process,the final output of the denoising network contains semantic and depth information;thus,the semantic segmentation and depth maps can play a conditional guiding role in the diffusion model.According to the settings in the DDPM,the U-Net network is utilized as the denoising network.Compared with the U-Net in DDPM,the self-attention layer is modified to match the improved self-attention proposed in this paper for effectively learning the fusion features.The proposed model can be applied to the image translation of simulated-to-real scenarios after training the denoising network in the CDM.Noise is first added to the simulated images collected from the Carla simulator,and paired semantic segmentation and depth maps are then sent to the denoising network to perform a step-by-step denois-ing process.Finally,real driving scene images are obtained to realize image translation with highly precise contour details and consistent distance in simulated and real images.Result The model is trained on the Cityscapes dataset and compared with state-of-the-art(SOTA)methods in recent years.Experimental results indicate that the proposed approach achieves a superior translation result with improved semantic precision and additional contour details.The evaluation metrics include Fréchet inception distance(FID)and the learned perceptual image patch similarity(LPIPS),which indicate the similarity between the generated and original images,and the difference between the generated images,respectively.A lower FID score represents better generation quality with a smaller gap between the generated and real image distributions,while a higher LPIPS value indicates better generation diversity.Compared with the comparative SOTA methods,the proposed method can achieve better results in the FID and LPIPS indicators,revealing scores of 44.20 and 0.377,respectively.Conclusion In this paper,a novel image-to-image translation method based on a conditional diffusion model and a multi-modal fusion method with a multihead attention mechanism for autonomous driving scenarios are proposed.Experimental results show that the proposed method can effectively solve the problems of insufficient paired datasets,imprecise transla-tion results,unstable training,and insufficient generation diversity in existing image translation methods.Thus,this method improves the image translation precision of driving scenarios and provides theoretical support and a data basis to realize safe and practical autonomous driving systems.
simulation to realityimage translationdiffusion modelmulti-modal fusiondriving scenario