首页|基于CLIP模型和知识数据库的零样本动作识别

基于CLIP模型和知识数据库的零样本动作识别

扫码查看
零样本动作识别旨在从已知类别的动作样本数据中学习知识,并将其迁移到未知的动作类别上,从而实现对未知动作样本的识别和分类.现有的零样本动作识别模型依赖有限的训练数据,可学习到的先验知识有限,难以将视觉特征准确地映射到语义标签上,是限制零样本学习性能提升的关键因素.针对上述问题,本文提出了一种引入外部知识数据库和CLIP模型的零样本学习框架,利用多模态CLIP模型通过自监督对比学习方式积累的知识,来扩充零样本动作识别模型的先验知识.同时,设计了时序编码器,以弥补 CLIP 模型时序建模能力的欠缺.为了使模型学习到更丰富的语义特征,缩小视觉特征和语义标签之间的语义鸿沟,本文扩展了已知动作类别的语义标签,用更为详细的描述语句代替简单的文本标签,丰富了文本表示的语义信息;在此基础上,在模型外部构建了一个知识数据库,在不增加模型参数规模的条件下为模型提供额外的辅助信息,强化视觉特征与文本特征表示之间的关联关系.最后,本文遵循零样本学习规范,对模型进行微调,使其适应零样本动作识别任务,提高了模型的泛化能力.所提方法在 HMDB51 和 UCF101 两个主流数据集上进行了广泛实验,实验数据表明,该方法的识别性能相比目前的先进方法在上述两个数据集上分别提升了3.8%和2.3%,充分体现了所提方法的有效性.
Zero-Shot Action Recognition Based on CLIP Model and Knowledge Database
Zero-shot action recognition(ZSAR)aims to learn knowledge from seen action classes and apply it to un-seen action classes,thereby achieving recognition and classification of unknown action samples.However,existing ZSAR models are limited by the amount of training data.This restricts their capability to learn prior knowledge and the accurate mapping of visual features with semantic labels.To address this issue,a ZSAR framework was proposed in this study by introducing an external knowledge database and using the contrastive language-image pre-training(CLIP)model.This framework utilized the knowledge acquired through self-supervised contrastive learning by the multimodal CLIP model to expand the prior knowledge of ZSAR.Moreover,a temporal encoder was designed to compensate for the lack of temporal modeling capability of the CLIP model.To enhance semantic features and bridge the gap between visual features and semantic labels,the semantic labels of seen action classes were extended.This involved replacing simple text labels with more detailed descriptive sentences to enrich the semantic information of text representations.On this basis,a knowledge database was constructed outside the model.This approach pro-vided additional information without increasing the model parameter scale and strengthens the association between the visual and text features.Finally,following the ZSAR protocol,the model was fine-tuned for the ZSAR task to im-prove its generalization ability.Furthermore,the proposed method was extensively experimented on two mainstream datasets:HMDB51 and UCF101.The experimental results demonstrate significant improvements of 3.8%and 2.3%on the above two datasets,respectively,compared with previous methods,validating the effectiveness of the pro-posed approach.

zero-shot learning(ZSL)action recognitioncontrastive language-image pre-training(CLIP)modelknowledge database

侯永宏、郑皓春、高嘉俊、任懿

展开 >

天津大学电气自动化与信息工程学院,天津 300072

天津大学未来技术学院,天津 300072

中国科学院软件研究所,北京 100190

零样本学习 动作识别 CLIP模型 知识数据库

2025

天津大学学报
天津大学

天津大学学报

北大核心
影响因子:0.793
ISSN:0493-2137
年,卷(期):2025.58(1)