首页|"三维视觉—语言"推理技术的前沿研究与最新趋势

"三维视觉—语言"推理技术的前沿研究与最新趋势

扫码查看
三维视觉推理的核心思想是对点云场景中的视觉主体间的关系进行理解。非专业用户难以向计算机传达自己的意图,从而限制了该技术的普及与推广。为此,研究人员以自然语言作为语义背景和查询条件反映用户意图,进而与点云的信息进行交互以完成相应的任务。此种范式称做"三维视觉—语言"推理,在自动驾驶、机器人导航以及人机交互等众多领域广泛应用,已经成为计算机视觉领域中备受瞩目的研究方向。过去几年间,"三维视觉—语言"推理技术迅猛发展,呈现出百花齐放的趋势,但是目前依然缺乏对最新研究进展的全面总结。本文聚焦于两类最具代表性的研究工作,锚框预测和内容生成类的"三维视觉—语言"推理技术,系统性概括领域内研究的最新进展。首先,本文总结了"三维视觉—语言"推理的问题定义和现存挑战,同时概述了一些常见的骨干网络。其次,本文按照方法所关注的下游场景,对两类"三维视觉—语言"推理技术做了进一步细分,并深入探讨了各方法的优缺点。接下来,本文对比分析了各类方法在不同基准数据集上的性能。最后,本文展望了"三维视觉—语言"推理技术的未来发展前景,以期促进该领域的深入研究与广泛应用。
Comprehensive survey on 3D visual-language understanding techniques
The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloud scenes.Traditional 3D visual reasoning typically requires users to possess professional expertise.However,nonprofes-sional users face difficulty conveying their intentions to computers,which hinders the popularization and advancement of this technology.Users now anticipate a more convenient way to convey their intentions to the computer for achieving infor-mation exchange and gaining personalized results.Researchers utilize natural language as a semantic background or query criteria to reflect user intentions for addressing the aforementioned issue.They further accomplish various missions by inter-acting such natural language with 3D point clouds.By multimodal interaction,often employing techniques such as the Transformer or graph neural network,current approaches not only can locate the entities mentioned by users(e.g.,visual grounding and open-vocabulary recognition)but also can generate user-required content(e.g.,dense captioning,visual question answering,and scene generation).Specifically,3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query.Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded(open)vocabulary at inference,which can gener-alize beyond the limited number of base classes labeled during the training phase.3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance.The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer.Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions.The aforementioned paradigm,which is known as 3D visual-language under-standing,has gained significant traction in various fields,such as autonomous driving,robot navigation,and human-computer interaction,in recent years.Consequently,it has become a highly anticipated research direction within the com-puter vision domain.Over the past 3 years,3D visual-language understanding technology has rapidly developed and show-cased a blossoming trend.However,comprehensive summaries regarding the latest research progress remain lacking.Therefore,the necessary tasks are to systematically summarize recent studies,comprehensively evaluate the performance of different approaches,and prospectively point out future research directions.This situation motivates this survey to fill this gap.For this purpose,this study aims to focus on two of the most representative works of 3D visual-language under-standing technologies and systematically summarizes their latest research advancements:anchor box prediction and content generation.First,the study provides an overview of the problem definition and existing challenges in 3D visual-language understanding,and it also outlines some common backbones used in this area.The challenges in 3D visual-language under-standing include 3D-language alignment and complex scene understanding.Meanwhile,some common backbones involve priori rules,multilayer perceptrons,graph neural networks,and Transformer architectures.Subsequently,the study delves into downstream scenarios,which emphasize two types of 3D visual-language understanding techniques,including bounding box predation and content generation.This study thoroughly explores the advantages and disadvantages of each method.Furthermore,the study compares and analyzes the performance of various methods on different benchmark data-sets.Finally,the study concludes by looking ahead to the future prospects of 3D visual-language reasoning technology,which can promote profound research and widespread application in this field.The major contributions of this study can be summarized as follows:1)Systematic survey of 3D visual-language understanding.To the best of our knowledge,this sur-vey is the first to thoroughly discuss the recent advances in 3D visual-language understanding.We categorize algorithms into different taxonomies from the perspective of downstream scenarios to provide readers with a clear comprehension of our article.2)Comprehensive performance evaluation and analysis.We compare the existing 3D visual-language understand-ing approaches on several publicly available datasets.Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights on the modification of existing methods.3)Insightful discussion of future prospects.Based on the systematic survey and comprehensive performance comparison,some promising future research directions are discussed,including large-scale 3D foundation model,computational effi-ciency of 3D modeling,and incorporation of additional modalities.

deep learningcomputer vision3D visual-language understandingcross-modal learningvisual groundingdense captioningvisual question answeringscene generation

雷印杰、徐凯、郭裕兰、杨鑫、武玉伟、胡玮、杨佳琪、汪汉云

展开 >

四川大学电子信息学院,成都 610065

国防科技大学计算机学院,长沙 410073

国防科技大学电子科学学院,长沙 410073

大连理工大学计算机科学与技术学院,大连 116081

北京理工大学计算机学院,北京 100081

北京大学王选计算机研究所,北京 100091

西北工业大学计算机学院,西安 710072

信息工程大学计算机与大数据学院/软件学院,郑州 450001

展开 >

深度学习 计算机视觉 "三维视觉—语言"推理 跨模态学习 视觉定位 密集字幕生成 视觉问答 场景生成

国家自然科学基金项目国家自然科学基金项目

U23B201362276176

2024

中国图象图形学报
中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心
影响因子:1.111
ISSN:1006-8961
年,卷(期):2024.29(6)