电子与信息学报2024,Vol.46Issue(8) :3314-3323.DOI:10.11999/JEIT231161

语义增强图像-文本预训练模型的零样本三维模型分类

Zero-shot 3D Shape Classification Based on Semantic-enhanced Language-Image Pre-training Model

丁博 张立宝 秦健 何勇军
电子与信息学报2024,Vol.46Issue(8) :3314-3323.DOI:10.11999/JEIT231161

语义增强图像-文本预训练模型的零样本三维模型分类

Zero-shot 3D Shape Classification Based on Semantic-enhanced Language-Image Pre-training Model

丁博 1张立宝 1秦健 1何勇军2
扫码查看

作者信息

  • 1. 哈尔滨理工大学计算机科学与技术学院 哈尔滨 150080
  • 2. 哈尔滨工业大学计算学部 哈尔滨 150006
  • 折叠

摘要

目前,基于对比学习的图像-文本预训练模型(CLIP)在零样本3维模型分类任务上表现出了巨大潜力,然而3维模型和文本之间存在巨大的模态鸿沟,影响了分类准确率的进一步提高.针对以上问题,该文提出一种语义增强CLIP的零样本3维模型分类方法.该方法首先将3维模型表示成多视图;然后为了增强零样本学习对未知类别的识别能力,通过视觉语言生成模型获得每张视图及其类别的语义描述性文本,并将其作为视图和类别提示文本之间的语义桥梁,语义描述性文本采用图像字幕和视觉问答两种方式获取;最后微调语义编码器将语义描述性文本具化为类别的语义描述,其拥有丰富的语义信息和较好的可解释性,有效减小了视图和类别提示文本的语义鸿沟.实验表明,该文方法在ModelNet10和ModelNet40数据集上的分类性能优于现有的零样本分类方法.

Abstract

Currently,the Contrastive Language-Image Pre-training(CLIP)has shown great potential in zero-shot 3D shape classification.However,there is a large modality gap between 3D shapes and texts,which limits further improvement of classification accuracy.To address the problem,a zero-shot 3D shape classification method based on semantic-enhanced CLIP is proposed in this paper.Firstly,3D shapes are represented as views.Then,in order to improve recognition ability of unknown categories in zero-shot learning,the semantic descriptive text of each view and its corresponding category are obtained through a visual language generative model,and it is used as the semantic bridge between views and category prompt texts.The semantic descriptive texts are obtained through image captioning and visual question answering.Finally,the finely-adjusted semantic encoder is used to concretize the semantic descriptive texts to the semantic descriptions of each category,which have rich semantic information and strong interpretability,and effectively reduce the semantic gap between views and category prompt texts.Experiments show that our method outperforms existing zero-shot classification methods on the ModelNet10 and ModelNet40 datasets.

关键词

3维模型分类/零样本/基于对比学习的图像-文本预训练模型/语义描述性文本

Key words

3D shape classification/Zero-shot/Contrastive Language-Image Pre-training(CLIP)/Semantic descriptive text

引用本文复制引用

基金项目

国家自然科学基金(61673142)

黑龙江省自然科学基金(LH2022F029)

黑龙江省自然科学基金(JQ2019F002)

出版年

2024
电子与信息学报
中国科学院电子学研究所 国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCD北大核心
影响因子:1.302
ISSN:1009-5896
段落导航相关论文