Depth-Guided Vision Transformer With Normal-izing Flows for Monocular 3D Object Detection

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

外文摘要：Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with con-volutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mecha-nism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone archi-tecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior per-formance over previous counterparts.

作者：

Cong Pan、Junran Peng、Zhaoxiang Zhang

展开 >

作者单位：

Center for Research on Intelligent Perception and Computing(CRIPAC),National Laboratory of Pattern Recognition(NLPR),Institute of Automation,Chinese Academy of Sciences(CASIA),Beijing 100190

School of Future Technology,University of Chinese Academy of Sciences(UCAS),Beijing 100049,China

Huawei Inc.,Beijing 100085,China

Institute of Automation,Chinese Academy of Sciences(CASIA),Beijing 100190

University of Chinese Academy of Sciences(UCAS),Beijing 100049

Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science & Innovation,Chinese Academy of Sciences(HKISI CAS),Hong Kong 999077,China

展开 >

基金：

Major Project for New Generation of AINational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaInnoHK Program

项目编号：

2018AAA010040061836014U21B20426207245762006231

出版年：

2024

DOI：

10.1109/JAS.2023.123660

自动化学报(英文版)

CSTPCDEI

ISSN：2329-9266

年,卷(期)：2024.11(3)

Cong Pan,Junran Peng,Zhaoxiang Zhang.Depth-Guided Vision Transformer With Normal-izing Flows for Monocular 3D Object Detection[J].自动化学报(英文版),2024,11(3):673-689.DOI:10.1109/JAS.2023.123660.

参考文献量91