新兴的三维目标检测技术在自动驾驶领域中扮演着关键的角色,它通过提供环境感知和障碍物检测等信息,为自动驾驶系统的决策和控制提供了基础。过去的许多学者对该领域优秀的方法论和成果进行了全面的检验和研究。然而,由于技术上的不断更新和快速进步,对该领域的最新进展保持持续跟踪并坚持跟随知识前沿,不仅是学术界的一项至关重要任务,同时也是应对新兴挑战的一项基础。本文回顾了近两年内的新兴成果并针对该方向中的前沿理论进行系统性的阐述。首先,简单介绍三维目标检测的背景知识并回顾相关的综述研究。然后,从数据规模、多样性等方面对 KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)等多个流行的数据集进行了归纳总结,并进一步介绍相关基准的评测原理。接下来,按照传感器类型和数量将最近的几十种检测方法划分为基于单目的、基于立体的、基于多视图的、基于激光雷达的、基于多模态5个类别,并根据模型架构或数据预处理方式的不同对每一种类别进行更深层次的细分。在每一种类别的方法中,首先对其代表性算法进行简单回顾,然后着重对该类别中最前沿的方法进行综述介绍,并进一步深入分析了该类别潜在的发展前景和当前面临的严峻挑战。最后展望了三维目标检测领域未来的研究方向。
Survey of 3D object detection algorithms for autonomous driving
Conventional two-dimensional(2D)object detection technology primarily emphasizes classifying the target to be detected and defining its bounding box in image space coordinates but lacks the capability to provide accurate information regarding the real three-dimensional(3D)spatial position of the target.This limitation restricts its applicability in autono-mous driving systems(ADs),particularly for tasks such as obstacle avoidance and path planning in real 3D environments.The emerging field of 3D object detection represents a substantial technological advancement.This field primarily relies on neural networks to extract features from input data,commonly obtained from camera images or LiDAR-captured point clouds.Following feature extraction,3D object detection predicts the category of the target and furnishes crucial data,including its spatial coordinates,dimensions,and yaw angles in a real-world coordinate system.This detection facilitates the provision of essential preliminary information for subsequent operations,such as object tracking,trajectory forecasting,and path planning.Consequently,this technology has assumed a vital role within the field of ADs,serving as a cornerstone within the domain of perception tasks.The field of 3D object detection has currently witnessed the emergence of numerous exceptional methodologies,exhibiting notable accomplishments.Several scholars have conducted comprehensive reviews and in-depth assessments of these pertinent works and their associated outcomes.However,prior reviews may have omitted the latest developments due to the rapid evolution of technology within the domain of computer vision.Therefore,con-stantly monitoring the most recent advancements and continuing at the frontline of this realm is not only an imperative task for the academic community but is also a fundamental endeavor to effectively respond to the emerging challenges posed by the incessant and expeditious technological advancements and progression.Based on the preceding considerations,this paper conducts a systematic review of the latest developments and cutting-edge theories in the realm of existing 3D object detection.In contrast to earlier review studies,the current work offers distinct advantages because it encompasses the inclusion of more cutting-edge methodologies and encompasses a broader spectrum of fields.For example,while most prior review works predominantly concentrated on individual sensors,this work uniquely incorporates a multitude of diverse sen-sor types.Moreover,this work encompasses a wide array of distinct training strategies,ranging from semi-supervised and weak-supervised methods to active learning and knowledge distillation techniques,thereby substantially enhancing the breadth and depth of research within this field.Specifically,this work starts with a concise contextualization of the progress of the field and conducts a brief examination of pertinent review research.Subsequently,the fundamental definition of 3D object detection is explored,and multiple widely used datasets are comprehensively summarized based on data scale and diversity,extending the discourse to the introduction of the evaluation criteria integral to the relevant benchmark assess-ments.Among these datasets,three widely recognized datasets are particularly highlighted:KITTI,nuScenes,and Waymo Open.Next,the multitude of detection methods proposed in the previous year is categorized into five distinct groups,primarily dictated by the type and quantity of sensors involved:monocular-based,stereo-based,multi-view-based,LiDAR-based,and multimodal-based.Additionally,further subcategorization is conducted within each group according to the specific data preprocessing methods or model architectures utilized.Within each method category grounded in distinct sensor types,the examination starts with a comprehensive review of the pioneering representative algo-rithms.An intricate exposition of the latest and most advanced methodologies within that specific domain is then offered.Furthermore,an in-depth analysis of the prospective pathways for development and the formidable challenges currently encountered by this category is conducted.Among the five categories,the monocular method relies solely on a single cam-era for the classification and localization of environmental objects.This approach is cost-effective and easy to implement.However,it grapples with the challenge of ill-posed depth information regression from monocular images,which frequently results in reduced accuracy for this method.The stereo-based method leverages stereo images to enforce geometric con-straints,leading to more precise depth estimation and comparatively higher detection accuracy.However,the requirement for stereo camera calibration drives up the deployment costs of this method,thereby maintaining its susceptibility to envi-ronmental factors.The multi-view-based method seeks to establish a unified feature space through the utilization of multiple surrounding cameras.Unlike the first two approaches,this method provides improved safety and practicality due to its pan-oramic perspective.However,the absence of direct constraints between cameras results in its inherent ill-posed nature.LiDAR-based methods excel in directly providing accurate depth information,which eliminates the need for additional depth estimation.This method leads to enhanced detection efficiency and accuracy compared to image-centric methods.Despite these advantages,the substantial hardware costs associated with LiDAR pose a considerable financial burden on real-world deployments.The multimodal-based approaches leverage the advantages of image and point cloud data,albeit at the cost of increased computational time required for the concurrent processing of these data modalities.In a broader con-text,each of the five method categories exhibits unique strengths and limitations,necessitating a careful selection based on financial considerations and specific application prerequisites during real-world engineering deployment.Upon concluding the exhaustive review of all methodologies,comprehensive statistical analyses of these techniques are conducted on data-sets such as KITTI,nuScenes,and Waymo Open.The statistical evaluations encompassed aspects pertaining to detection performance and inference time.In this research,we have meticulously reviewed 3Dobject detection algorithms in the con-text of AD.This comprehensive study encompasses detection algorithms based on various mainstream sensors and includes an exploration of the latest advancements in this field.Subsequently,we perform a comprehensive statistical analysis and comparison of the performance and latency demonstrated by all the methods on widely recognized datasets.A summary of the current research status is presented,and prospects for future research directions are provided.
autonomous driving3D object detectionmonocularstereomulti-viewlight detection and ranging(LiDAR)multi-modal