Lightweight Video Captioning Model and Performance Optimization Based on Knowledge Distillation
Video captioning is the process of using computer vision and natural language processing technology to convert video content into textual descriptions.Video captioning has a wide range of applications,including signal recognition and decoding,online video conferences,video surveillance and security,video translation,and content retrieval.Video captioning generation models based on deep learning have made significant advancements in performance.However,these models often have high computational complexity and are difficult to deploy and apply on mobile devices with limited computing resources.To address this problem,two lightweight models for general video captioning and dense video captioning tasks are proposed.These models are based on the UniVL model and experimentally determine the minimum model architecture that satisfies the requirements of video captioning tasks.To further reduce the size of the models,an adaptive embedding compression strategy is also proposed to compress the models based on different types of video datasets.Additionally,knowledge distillation techniques using information at different layers are employed to optimize training for the proposed lightweight models and perform information exchange with teacher models to improve their performance.Experimental results show that compared to the baseline model,the proposed lightweight models achieve a reduction of 75%in model parameters,with a performance decrease of less than 10%.
video captioningmodel compressionlightweightknowledge distillationpre-trained model