Video captioning is the process of using computer vision and natural language processing technology to convert video content into textual descriptions.Video captioning has a wide range of applications,including signal recognition and decoding,online video conferences,video surveillance and security,video translation,and content retrieval.Video captioning generation models based on deep learning have made significant advancements in performance.However,these models often have high computational complexity and are difficult to deploy and apply on mobile devices with limited computing resources.To address this problem,two lightweight models for general video captioning and dense video captioning tasks are proposed.These models are based on the UniVL model and experimentally determine the minimum model architecture that satisfies the requirements of video captioning tasks.To further reduce the size of the models,an adaptive embedding compression strategy is also proposed to compress the models based on different types of video datasets.Additionally,knowledge distillation techniques using information at different layers are employed to optimize training for the proposed lightweight models and perform information exchange with teacher models to improve their performance.Experimental results show that compared to the baseline model,the proposed lightweight models achieve a reduction of 75%in model parameters,with a performance decrease of less than 10%.
关键词
视频描述生成/模型压缩/轻量化/知识蒸馏/预训练模型
Key words
video captioning/model compression/lightweight/knowledge distillation/pre-trained model