首页|面向单幅图像的高质量深度估计方法

面向单幅图像的高质量深度估计方法

扫码查看
单幅图像的深度估计是机器人导航、场景理解中的一项关键任务,也是计算机视觉领域的一个复杂问题。针对单幅图像深度估计不准确的问题,提出一种基于Vision Transformer(ViT)的单幅图像深度估计方法。首先用预训练的DenseNet对图像进行下采样,将特征编码成适用于ViT的特征序列;然后通过稠密连接的ViT处理全局上下文信息,并将特征序列重新组装成高维度特征图;最后将RefineNet进行上采样,得到完整的深度图像。在NYU V2数据集上与一些代表性的深度估计方法进行对比实验,并对网络结构进行消融实验,同时对平均相对误差、均方根误差等误差进行量化分析,结果表明,所提方法面向单幅图像可以生成具有丰富细节的高质量深度图像;与传统的编码器解码器方法相比,该方法的PSNR值平均提高1。052dB,平均相对误差下降7。7%~21。8%,均方根误差下降5。6%~16。9%。
A High-Quality Depth Estimation Method for Single Image
Depth estimation from a single image is critical in robot navigation and scene understanding.It is also a complex problem in computer vision.Aiming at the inaccurate depth estimation of a single image,we propose a single-image depth estimation method based on ViT.First,we downsample the image by the pre-trained DenseNet and encode the features into sequences suitable for ViT.Then,the densely connected ViT processes the global context information,and the feature sequence is reassembled into high-dimensional feature maps.Finally,Upsampling to obtain a complete depth image.We conduct comparative experiments with some reprsentative depth estimation methods on the NYU V2 dataset,and ablation experiments on the network structure.This paper quantitatively analyzes the average relative error,root means square error,and oth-er errors.The results show that the method can generate high-quality depth images with rich details for a single image.Compared with the traditional encoder-decoder method,the PSNR value of the proposed method is in-creased by 1.052 dB on average,the REL is decreased by 7.7%-21.8%,and the RMS is reduced by 5.6%-16.9%.

deep learningdepth estimationvision Transformerattention mechanism

包永堂、燕帅、齐越

展开 >

山东科技大学计算机科学与工程学院 青岛 266590

北京航空航天大学虚拟现实技术与系统全国重点实验室 北京 100191

北京航空航天大学青岛研究院 青岛 266100

深度学习 深度估计 vision Transformer 注意力机制

2024

计算机辅助设计与图形学学报
中国计算机学会

计算机辅助设计与图形学学报

CSTPCD北大核心
影响因子:0.892
ISSN:1003-9775
年,卷(期):2024.36(11)