Existing video description generation methods extract features and feature combinations in a simpler way,resulting in the model losing some of the important semantic information related to the video description,limiting the accurate description and understanding of the video content.Analysing the deficiencies,this paper proposes a video description generation method based on enhanced global-local feature fusion.Firstly,different feature extractors are used to extract local and global features for the video clips respectively,and in order to model the relevance of different levels of features(local and global),feature fusion is performed using a feature fusion enhancement network to enrich the feature information of the model.In this paper,the bi-directional long and short term memory network used by the decoder is followed by a reconstruction network,which reconstructs the video feature sequences obtained by the encoder processing,and finally generates the descriptive statements of the video through the long and short term memory network.Experimental results on MSVD and MSR-VTT datasets show that the model proposed in this paper can significantly improve the accuracy of the generated descriptive statements.
video description generationenhanced feature fusion networknatural language processing