Video Description Generation Method Based on Latent Feature Augmented Network
Video description generation aims to use natural language to describe objects and their interactions in videos.The existing methods do not fully utilize the spatio-temporal semantic information in videos,which limits the model's ability to generate accurate descriptive statements.To this end,a Latent Feature Augmented Network(LFAN)model is proposed for video description generation.Different feature extractors are used to extract appearance,motion,and target features,thereby fusing object level target features with frame level appearance and motion features.Concurrently,the fused different features are enhanced.Before generating descriptions,graph neural and long short-term memory networks are used to infer the spatio-temporal relationships between objects,thereby obtaining potential features with spatio-temporal and semantic information.Finally,a decoder using both a long short-term memory network and a gated loop unit is used to generate a description statement for the video.This network model can accurately learn object features and guide the generation of more accurate vocabulary and relationships with objects.The experimental results on MSVD and MSR-VTT datasets show that the LFAN model can significantly improve the accuracy of generating descriptive statements,exhibiting better semantic consistency with the content in the video.The BLEU@4 and ROUGE-L scores are 57.0 and 74.1 on MSVD,respectively,and 43.8 and 62.1 on the MSR-VTT dataset.