Monocular depth estimation combining pyramid structure and attention mechanism
Monocular depth estimation is the prediction of a dense depth image from a single color image.A monocular depth estimation algorithm combining pyramid structure and attention mechanism was proposed to address the issues of boundary ambiguity and insufficient capture of contextual information in current monocular depth estimation algorithms.The algorithm adopted the overall framework of encoder-decoder,in which the encoder selected the PVTv2 network to obtain more adequate global semantic information by taking advantage of the Transformer network in modeling global information.The decoder consisted of a depth estimation main branch and two pyramid sub-branches.The depth estimation main branch adaptively focused on important feature regions and feature channels between the encoder and decoder features through spatial and channel attention mechanisms.The Laplacian pyramid sub-branch and depth residual pyramid sub-branch aimed to learn rich local information from color images and depth estimation main branch depth features,transferring it to the depth estimation main branch to address the problems of missing details and chaotic structures in monocular depth estimation.Experimental results demonstrated that on the indoor public dataset NYU Depth V2,compared with the advanced algorithm P3Depth,the accuracy of δ1.25 threshold was increased by 1.22%,the absolute error and root mean square error were decreased by 5.8%and 2.8%,respectively.On the outdoor public dataset KITTI,the absolute error,root mean square logarithmic error,and root mean square error of the algorithm were decreased by 8.5%,3.9%,and 0.4%,respectively.The algorithm improved the accuracy of depth estimation and achieved a good visual rendering.
deep learningmonocular depth estimationpyramid structureattention mechanismTransformer