The Vision Transformer(ViT)shows promise in enhancing image super-resolution performance.However,the diverse scale of objects inherent in remote sensing images significantly constrains the quality of their super-resolution.To address this,a method for remote sensing image super-resolution using a Transformer network combining multi-scale and multi-attention is introduced,with the goal of enhancing its feature learning capability and effectively improving the super-resolution performance of remote sensing images.Specifically,the input features are continuously downsampled to obtain multiple features at different scales.Subsequently,the low-dimensional features undergo a stepwise transformation through a Transformer network,utilizing alternating dense attention and sparse attention,and the resulting output is upscaled for fusion with the high-dimensional features.The combination of dense attention and sparse attention enables the simultaneous extraction of local and global dependencies,while the multi-path,multi-scale transformation enhances the modeling capability for small objects within the images.Extensive experimental results on two public remote sensing datasets validate the effectiveness of the proposed method.