Multimodal pre-training has shown increased interest on vision-language tasks.Recent comprehensive studies have demonstrated that,multimodal representations training can benefit the Vision-Language downstream tasks.Multimodal pre-training requires a large-scale training data and self-supervised learning.This paper reviews some significant transformer-base researches about Vision-Language(VL)pre-training,which came out after BERT.Firstly,the application background and development significance of multimode pretraining are expounded.Secondly,this paper introduces the development of mainstream multimodal networks and analyzes the advantages and disadvantages of methods.Then,we explain cost functions used in multi-task pre-training.Next,We then illustrate the large-scale image-text database mentioned in recent studies.In the end,combining different VL downstream tasks,this paper describes the task objectives,datasets and training methods.