轻量化视觉定位技术综述

扫码查看

原文链接

万方数据
维普

中文摘要：视觉定位旨在从已知的三维场景中恢复当前观测图像的相机位姿.视觉定位技术具备低成本、高精度和易于集成等优势,是实现计算设备与真实世界建立智能交互过程的关键技术之一,如今获得了混合现实、自动驾驶等应用领域的广泛关注.作为计算机视觉领域长期探索的基础任务之一,视觉定位方法至今已取得显著的研究进展,然而现有方法普遍存在计算开销和存储占用过大等不足,这些问题导致视觉定位在移动端的高效部署和场景模型的更新维护方面存在困难,并因此在很大程度上限制着视觉定位技术的实际应用.针对这一问题,部分研究工作开始聚焦于推动视觉定位技术的轻量化发展.轻量化视觉定位旨在研究更加高效的场景表达形式及其视觉定位方法,目前正逐渐成为视觉定位领域重要的研究方向.本文首先回顾早期视觉定位框架,随后从场景表达形式的角度对具备轻量化特性的现有视觉定位研究工作进行分类.在各个方法类别下,分析总结其特点优势、应用场景和技术难点,并同时介绍代表性成果.进一步地,本文对部分轻量化视觉定位的代表性方法在常用室内外数据集上的性能表现进行对比分析,评估指标主要包含离线建图的用时、场景地图的存储占用和定位精度3个维度.现有的轻量化视觉定位技术仍然面临着诸多的难题与挑战,场景模型的表达能力、定位方法的泛化性与鲁棒性尚存在较大的提升空间.最后,本文对轻量化视觉定位未来的发展趋势进行分析与展望.

外文标题：Lightweight visual-based localization technology

外文摘要：Visual-based localization determines the camera translation and orientation of an image observation with respect to a prebuilt 3D-based representation of the environment.It is an essential technology that empowers the intelligent interac-tions between computing facilities and the real world.Compared with alternative positioning systems beyond,the capability to estimate the accurate 6DOF camera pose,along with the flexibility and frugality in deployment,positions visual-based localization technology as a cornerstone of many applications,ranging from autonomous vehicles to augmented and mixed reality.As a long-standing problem in computer vision,visual localization has made exceeding progress over the past decades.A primary branch of prior arts relies on a preconstructed 3D map obtained by structure-from-motion techniques.Such 3D maps,a.k.a.SfM point clouds,store 3D points and per-point visual features.To estimate the camera pose,these methods typically establish correspondences between 2D keypoints detected in the query image and 3D points of the SfM point cloud through descriptor matching.The 6DOF camera pose of the query image is then recovered from these 2D-3D matches by leveraging geometric principles introduced by photogrammetry.Despite delivering fairly sound and reliable performance,such a scheme often has to consume several gigabytes of storage for just a single scene,which would result in computationally expensive overhead and prohibitive memory footprint for large-scale applications and resource-intensive platforms.Furthermore,it suffers from other drawbacks,such as costly map maintenance and privacy vulnerability.The aforementioned issues pose a major bottleneck in real-world applications and have thus prompted researchers to shift their focus toward leaner solutions.Lightweight visual-based localization seeks to introduce improvements in scene representa-tions and the associated localization methods,making the resulting framework computationally tractable and memory-efficient without incurring a notable performance expense.For the background,this literature review first introduces sev-eral flagship frameworks of the visual-based localization task as preliminaries.These frameworks can be broadly classified into three categories,including image-retrieval-based methods,structure-based methods,and hierarchical methods.3D scene representations adopted in these conventional frameworks,such as reference image databases and SfM point clouds,generally exhibit a high degree of redundancy,which causes excessive memory usage and inefficiency in distinguishing scene features for descriptor matching.Next,this review provides a guided tour of recent advances that promote the brevity of the 3D scene representations and the efficiency of corresponding visual localization methods.From the perspective of scene representations,existing research efforts in lightweight visual localization can be classified into six categories.Within each category,this literature review analyzes its characteristics,application scenarios,and technical limitations while also surveying some of the representative works.First,several methods have been proposed to enhance memory effi-ciency by compressing the SfM point clouds.These methods reduce the size of SfM point clouds through the combination of techniques including feature quantization,keypoint subset sampling,and feature-free matching.Extreme compression rates,such as 1％and below,can be achieved with barely noticeable accuracy degradation.Employing line maps as scene representations has become a focus of research in the field of lightweight visual localization.In human-made scenes charac-terized by salient structural features,the substitution of line maps for point clouds offers two major merits:1)the abun-dance and rich geometric properties of line segments make line maps a concise option for depicting the environment;2)line features exhibit better robustness in weak-textured areas or under temporally varying lighting conditions.However,the lack of a unified line descriptor and the difficulty of establishing 2D-3D correspondences between 3D line segments and image observations remain as main challenges.In the field of autonomous driving,high-definition maps constructed from vectorized semantic features have unlocked a new wave of cost-effective and lightweight solutions to visual localization for self-driving vehicle.Recent trends involve the utilization of data-driven techniques to learn to localize.This end-to-end phi-losophy has given rise to two regression-based methods.Scene coordinate regression(SCR)methods eschew the explicit processes of feature extraction and matching.Instead,they establish a direct mapping between observations and scene coor-dinates through regression.While a grounding in geometry remains essential for camera pose estimation in SCR methods,pose regression methods employ deep neural networks to establish the mapping from image observations to camera poses without any explicit geometric reasoning.Absolute pose regression techniques are akin to image retrieval approaches with limited accuracy and generalization capability,while relative pose regression techniques typically serve as a postprocessing step following the coarse localization stage.Neural radiance fields and related volumetric-based approaches have emerged as a novel way for the neural implicit scene representation.While visual localization based solely on a learned volumetric-based implicit map is still in an exploratory phase,the progress made over the past year or two has already yielded an impressive performance in terms of the scene representation capability and precision of localization.Furthermore,this study quantitatively evaluates the performance of several representative lightweight visual localization methods on well-known indoor and outdoor datasets.Evaluation metrics,including offline mapping time usage,storage demand,and local-ization accuracy,are considered for making comparisons.Results reveal that SCR methods generally stand out among the existing work,boasting remarkably compact scene maps and high success rates of localization.Existing lightweight visual localization methods have dramatically pushed the performance boundary.However,challenges still remain in terms of scalability and robustness when enlarging the scene scale and taking considerable visual disparity between query and map-ping images into consideration.Therefore,extensive efforts are still required to promote the compactness of scene represen-tations and improving the robustness of localization methods.Finally,this review provides an outlook on developing trends in the hope of facilitating future research.

外文关键词：

visual localizationcamera pose estimation3D scene representationlightweight mapfeature matchingscene coordinate regressionpose regression

作者：

叶翰樵、刘养东、申抒含

展开 >

作者单位：

中国科学院大学人工智能学院,北京 100049

中国科学院自动化研究所,北京 100190

关键词：

视觉定位相机位姿估计三维场景表达轻量化地图特征匹配场景坐标回归位姿回归

基金：

国家自然科学基金项目国家自然科学基金项目北京市自然科学基金项目

项目编号：

U22B205562273345L223003

出版年：

2024

DOI：

10.11834/jig.230744

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心

影响因子：1.111

ISSN：1006-8961

年,卷(期)：2024.29(10)