The Vision Transformer(ViT),an application of the Transformer architecture with an encoder-decoder structure,has garnered remarkable success in the field of computer vision.Over the past few years,research centered around ViT has witnessed a prolific surge and has consistently exhibited exceptional performance.Therefore,endeavors rooted in this model have evolved into a pivotal and prominent research trajectory within the domain of computer vision tasks.Consequently,this paper seeks to provide a comprehensive survey of the recent advancements and developments in ViT during the recent years.To begin with,it briefly revisits the fundamental principles of the Transformer and its adaptation into ViT,analyzing the structural characteristics and advantages of the ViT model.Then it categorizes and synthesizes the various directions of improvement for ViT backbone networks and their representative improvement models based on the distinguishing features of each ViT variant.These directions include enhancements in locality,structural modifications,self-supervised improvements,and lightweight and efficient improve-ments,which are thoroughly examined and compared.Lastly,this paper discusses the remaining shortcomings of the current ViT and its enhancement models,while also offering a prospective view on the future research directions for ViT.This comprehensive analysis serves as a valuable reference for researchers when deliberating on the choice of deep learning methodologies for their investigations into ViT backbone networks.