Survey of Zero-Shot Transfer Learning Methods Based on Vision-Language Pre-Trained Models
In recent years,remarkable advancements in Artificial Intelligence(AI)across unimodal domains,such as computer vision and Natural Language Processing(NLP),have highlighted the growing importance and necessity of multimodal learning.Among the emerging techniques,the Zero-Shot Transfer(ZST)method,based on visual-language pre-trained models,has garnered widespread attention from researchers worldwide.Owing to the robust generalization capabilities of pre-trained models,leveraging visual-language pre-trained models not only enhances the accuracy of zero-shot recognition tasks but also addresses certain zero-shot downstream tasks that are beyond the scope of conventional approaches.This review provides an overview of ZST methods based on vision-language pre-trained models.First,it introduces conventional approaches to Few-Shot Learning(FSL)and summarizes its main forms.It then discusses the distinctions between ZST and FSL based on vision-language pre-trained models,highlighting the new tasks that ZST can address.Subsequently,it explores the application of ZST methods in various downstream tasks,including sample recognition,object detection,semantic segmentation,and cross-modal generation.Finally,it analyzes the challenges of current ZST methods based on vision-language pre-trained models and outlines potential future research directions.