This paper presents a novel multi-modal model for video action recognition,which is built upon the con-trastive language-image pre-training(CLIP)model.The presented model extends the CLIP model in two ways,i.e.,incorporating temporal modeling in the visual encoder and leveraging prompt learning for language descriptions of action classes,to better learn multi-modal video representations.Specifically,we design a virtual-frame interaction module(VIM)within the visual encoder that transforms class tokens of sampled video frames into virtual-frame tokens through linear transformation,and then temporal modeling operations based on temporal convolution and virtual-frame token shift are performed to effectively model the spatio-temporal change information in the video.In the language branch,we propose a visual-reinforcement prompt module(VPM)that leverages an attention mechan-ism to fuse the visual information,carried by the class token and visual token which are both output by the visual encoder,to enhance the language representations.Fully-supervised experiments conducted on four publicly avail-able video datasets,as well as few-shot and zero-shot experiments conducted on two video datasets,demonstrate the effectiveness and generalization capabilities of the proposed multi-modal model.
关键词
视频行为识别/语言-视觉对比学习/多模态模型/时序建模/提示学习
Key words
Video action recognition/language-visual contrastive learning/multi-modal model/temporal modeling/prompt learning
引用本文复制引用
基金项目
国家自然科学基金(61972062)
辽宁省应用基础研究计划(2023JH2/101300191)
国家民委中青年英才培养计划(61972062)
Young and Middleaged Talents Program of the National Civil Affairs Commission()