Zero-shot 3D Shape Classification Based on Semantic-enhanced Language-Image Pre-training Model
Currently,the Contrastive Language-Image Pre-training(CLIP)has shown great potential in zero-shot 3D shape classification.However,there is a large modality gap between 3D shapes and texts,which limits further improvement of classification accuracy.To address the problem,a zero-shot 3D shape classification method based on semantic-enhanced CLIP is proposed in this paper.Firstly,3D shapes are represented as views.Then,in order to improve recognition ability of unknown categories in zero-shot learning,the semantic descriptive text of each view and its corresponding category are obtained through a visual language generative model,and it is used as the semantic bridge between views and category prompt texts.The semantic descriptive texts are obtained through image captioning and visual question answering.Finally,the finely-adjusted semantic encoder is used to concretize the semantic descriptive texts to the semantic descriptions of each category,which have rich semantic information and strong interpretability,and effectively reduce the semantic gap between views and category prompt texts.Experiments show that our method outperforms existing zero-shot classification methods on the ModelNet10 and ModelNet40 datasets.
3D shape classificationZero-shotContrastive Language-Image Pre-training(CLIP)Semantic descriptive text