For improving the semantic segmentation effect of remote sensing images,this paper proposes a Transformer based multi-scale Transformer network(MSTNet)based on the characteristics of small inter-class variance and large in-tra-class variance of segmentation targets,focusing on two key points:global contextual information and multi-scale se-mantic features.The MSTNet consists of an encoder and a decoder.The encoder includes an improved visual attention network(VAN)backbone based on Transformer and an improved multi-scale semantic feature extraction module(MS-FEM)based on atrous spatial pyramid pooling(ASPP)to extract multi-scale semantic features.The decoder is designed with a lightweight multi-layer perception(MLP)and an encoder,to fully analyze the global contextual information and multi-scale representations features extracted by utilizing the inductive property of transformer.The proposed MSTNet was validated on two high-resolution remote sensing semantic segmentation datasets,ISPRS Potsdam and LoveDA,achieving an average intersection over union(mIoU)of 79.50%and 54.12%,and an average F1-score(mF1)of 87.46%and 69.34%,respectively.The experimental results verify that the proposed method has effectively improved the se-mantic segmentation of remote sensing images.