Virtual viewpoint image synthesis using neural radiance fields with depth information supervision
Objective Viewpoint synthesis techniques are widely applied to computer graphics and computer vision.In accordance with whether they depend on geometric information or not,virtual viewpoint synthesis methods can be classified into two distinct categories:image-based rendering and model-based rendering.1)Image-based rendering typically utilizes input data from camera arrays or light field cameras to achieve higher-quality rendering outcomes without the need to recon-struct the geometric information of the scene.Among the image-based rendering methods,depth map-based rendering tech-nology is currently a popular research topic for virtual viewpoint rendering.However,this technology is prone to be affected by depth errors,leading to challenges such as holes and artifacts in the generated virtual viewport image.In addi-tion,obtaining precise depth information for real-world scenes poses difficulties in practical applications.2)Model-based rendering involves 3D geometric modeling of real-world scenes.This method utilizes techniques such as projection transfor-mation,cropping,fading,and texture mapping to synthesize virtual viewpoint images.However,quickly modeling real-world scenes is a significant disadvantage of this approach.With the emergence of neural rendering technology,the neural radiance fields technique employs a neural network to represent the 3D scene and combines it with volume rendering tech-nology for viewpoint synthesis,thus producing photo-realistic viewpoint synthesis results.However,this approach is heav-ily reliant on the appearance of the view and requires a substantial number of views to be input for modeling.As a result,this method may be capable of perfectly explaining the training images but generalizes poorly to novel test views.Depth information is introduced for supervision to reduce the dependence of the neural radiance fields on the view appearance.However,structure from motion produces sparse depth values with inaccuracy and outliers due to the limited number of view inputs.Therefore,this study proposes a virtual viewpoint synthesis algorithm for supervising the neural radiance fields by using dense depth values obtained from a depth estimation network and introduces an embedding vector in the fitting function of the neural radiance fields to improve the virtual viewport image quality.Method First,the camera's internal and external reference matrices were calibrated for the input view.The 3D point cloud data in the world coordinate system were then converted to 3D point cloud data in the camera coordinate system by using the camera's external reference matrix.After that,the 3D point cloud data in the camera coordinate system were projected onto the image plane by using the camera's internal reference matrix to obtain the sparse depth value.Next,the RGB view was input into the new condi-tional random fields(CRFs)network to obtain an estimated depth value,and the standard deviation between the estimated depth value and the sparse depth value was calculated.The new CRFs network used the FC-CRFs module,which was con-structed using a multi-headed attention mechanism,as the decoder and used the visual converter as the encoder to con-struct a U-shaped codec structure to estimate the depth value.Finally,the training of the neural radiance fields was super-vised using the estimated depth values and the computed standard deviations.The training process of the neural radiance fields began by emitting camera rays on the input view to determine the sampling locations and the sampling point param-eterization scheme.The re-parameterized sample point locations were then fed into the network for fitting,and the network outputted the volume density and color values to calculate the rendered color values and rendered depth values by using the volume rendering technique.The training process was supervised using the color loss between the rendered color value and the true color value and the depth loss between the predicted depth value and the rendered depth value.Result Experi-ments were conducted on the NeRF Real dataset,which comprises eight real-world scenes captured by forward-facing cam-eras.The evaluation involved the comparison of the proposed method with other algorithms,including the neural radiance field(NeRF)method that only uses RGB supervision and the method that employs sparse depth information supervision.The assessment criteria included peak signal-to-noise ratio,structural similarity index,and learned perceptual image patch similarity.Results indicate that the performance of proposed method surpassed that of the NeRF method that relied solely on RGB supervision and the method that employed sparse depth information supervision in a limited number of view synthe-sis experiments in terms of graphical quality and effectiveness.Specifically,the proposed method achieved a 24%improve-ment in peak signal-to-noise ratio over the NeRF method and a 19.8%improvement over the sparse depth information supervision method.In addition,the proposed method exhibited a 36%improvement in structural similarity index over the NeRF method and a 16.6%improvement over the sparse depth information supervision method.The data efficiency of the algorithm was evaluated by comparing the peak signal-to-noise ratio achieved by the same number of iterations.The pro-posed method demonstrated a significant improvement compared with the NeRF method.Conclusions In this study,we pro-posed a method for synthesizing virtual viewport images by using neural radiance fields supervised by dense depth.The method uses the dense depth values outputted by the depth estimation network to supervise the training of the neural radi-ance fields and introduced embedding vector during training fitting function.The experiments demonstrated that our approach effectively addresses the issue of sparse depth values resulting from insufficient views or inconsistent view colors and can achieve high-quality synthesized images,particularly when the number of input views is limited.