Zero-Shot Action Recognition Based on CLIP Model and Knowledge Database
Zero-shot action recognition(ZSAR)aims to learn knowledge from seen action classes and apply it to un-seen action classes,thereby achieving recognition and classification of unknown action samples.However,existing ZSAR models are limited by the amount of training data.This restricts their capability to learn prior knowledge and the accurate mapping of visual features with semantic labels.To address this issue,a ZSAR framework was proposed in this study by introducing an external knowledge database and using the contrastive language-image pre-training(CLIP)model.This framework utilized the knowledge acquired through self-supervised contrastive learning by the multimodal CLIP model to expand the prior knowledge of ZSAR.Moreover,a temporal encoder was designed to compensate for the lack of temporal modeling capability of the CLIP model.To enhance semantic features and bridge the gap between visual features and semantic labels,the semantic labels of seen action classes were extended.This involved replacing simple text labels with more detailed descriptive sentences to enrich the semantic information of text representations.On this basis,a knowledge database was constructed outside the model.This approach pro-vided additional information without increasing the model parameter scale and strengthens the association between the visual and text features.Finally,following the ZSAR protocol,the model was fine-tuned for the ZSAR task to im-prove its generalization ability.Furthermore,the proposed method was extensively experimented on two mainstream datasets:HMDB51 and UCF101.The experimental results demonstrate significant improvements of 3.8%and 2.3%on the above two datasets,respectively,compared with previous methods,validating the effectiveness of the pro-posed approach.