Text-to-video Generation:Research Status,Progress and Challenges
The generation of video from text aims to produce semantically consistent,photo-realistic,temporal consistent,and logically coherent videos based on provided textual descriptions.Firstly,the current state of research in the field of text-to-video generation is elucidated in this paper,providing a detailed overview of three mainstream approaches:methods based on recurrent networks and Generative Adversarial Networks(GAN),methods based on Transformers,and methods based on diffusion models.Each of these models has its strengths and weaknesses in video generation.The recurrent networks and GAN-based methods can generate videos with higher resolution and duration but struggle with generating complex open-domain videos.Transformer-based methods show proficiency in generating open-domain videos but face challenges related to unidirectional biases and accumulated errors,making it difficult to produce high-fidelity videos.Diffusion models exhibit good generalization but are constrained by inference speed and high memory consumption,making it challenging to generate high-definition and lengthy videos.Subsequently,evaluation benchmarks and metrics in the text-to-video generation domain are explored,and the performance of existing methods is compared.Finally,potential future research directions in the field is outlined.