Geometric attribute-guided 3D semantic instance reconstruction
Objective The objective of 3D vision is to capture the geometric and optical features of the real world from mul-tiple perspectives and convert this information into digital form,enabling computers to understand and process it.3D vision is an important aspect of computer graphics.Nonetheless,sensors can only provide partial observations of the world due to viewpoint occlusion,sparse sensing,and measurement noise,resulting in a partial and incomplete representation of a scene.Semantic instance reconstruction is proposed to solve this problem.It converts 2D/3D data obtained from multiple sensors into a semantic representation of the scene,including modeling each object instance in the scene.Machine learn-ing and computer vision techniques are applied to achieve high-precision reconstruction results,and point cloud-based methods have demonstrated prominent advantages.However,existing methods disregard prior geometric and semantic information of objects,and the subsequent simple max-pooling operation loses key structural information of objects,result-ing in poor instance reconstruction performance.Method In this study,a geometric attribute-guided semantic instance reconstruction network(GANet),which consists of a 3D object detector,a spatial Transformer,and a mesh generator,is proposed.We design the spatial Transformer to utilize the geometric and semantic information of instances.After obtaining the 3D bounding box information of instances in the scene,box sampling is used to obtain the real local point cloud of each target instance in the scene on the basis of the instance scale information,and then semantic information is embedded for foreground point segmentation.Compared with ball sampling,box sampling reduces noise and obtains more effective infor-mation.Then,the encoder's feature embedding and Transformer layers extract rich and crucial detailed geometric informa-tion of objects from coarse to fine to obtain the corresponding local features.The feature embedding layer also utilizes the prior semantic information of objects to help the algorithm quickly approximate the target shape.The attention module in the Transformer integrates the correlation information between points.The algorithm also uses the object's global features provided by the detector.Considering the inconsistency between the scene space and the canonical space,a designed fea-ture space Transformer is used to align the object's global features.Finally,the fused features are sent to the mesh genera-tor for mesh reconstruction.The loss function of GANet consists of two parts:detection and shape losses.Detection loss is the weighted sum of the instance confidence,semantic classification,and bounding box estimation losses.Shape loss con-sists of three parts:Kullback-Leibler divergence between the predicted and standard normal distributions,foreground point segmentation loss,and occupancy point estimation loss.Occupancy point estimation loss is the cross-entropy between the predicted occupancy value of the spatial candidate points and the real occupancy value.Result The experiment was com-pared with the latest methods on the ScanNet v2 datasets.The algorithm utilized computer aided design(CAD)models pro-vided by Scan2CAD,which included 8 categories,as ground truth for training.The mean average precision of semantic instance reconstruction increased by 8%compared with the second-ranked method,i.e.,RfD-Net.The average precision of bathtub,trash bin,sofa,chair,and cabinet is better than that from RfD-Net.In accordance with the visualization results,GANet can reconstruct object models that are more in line with the scene.Ablation experiments were also con-ducted on the corresponding dataset.The performance of the complete network was better than the other four networks,which included a GANet that replaced ball sampling with box sampling,replaced the Transformer with PointNet,and removed the semantic embedding of point cloud features and feature transformation.The experimental results indicate that box sampling obtains more effective local point cloud information,the Transformer-based point cloud encoder enables the network to use more critical local structural information of the foreground point cloud during reconstruction,and semantic embedding provides prior information for instance reconstruction.Feature space transformation aligns the global prior infor-mation of an object,further improving the reconstruction effect.Conclusion In this study,we proposed a geometric attribute-guided network.This network considers the complexity of scene objects and can better utilize the geometric and attribute information of objects.The experiment results show that our network outperforms several state-of-the-art approaches.Current 3D-based semantic instance reconstruction algorithms have achieved good results,but acquiring and annotating 3D data are still relatively expensive.Future research can focus on how to use 2D data to assist in semantic instance reconstruction.
scene reconstructionthree-dimensional point cloudsemantic instance reconstructionmesh generationobject detection