Images super-resolution reconstruction of transposed self-attention with local feature enhancement
Objective Research on super-resolution image reconstruction based on deep learning techniques has gained exceptional progress in recent years.In particular,when the development of traditional convolutional neural networks reached a bottleneck,Transformer,which performs extremely well in natural language processing,was introduced to approximate super-resolution image reconstruction.However,the computational complexity of Transformer is related to the square of the width and height of the input image,leading to the inability to migrate Transformer to low-level computer vision tasks fully.Recent methods,such as image restoration using Swin Transformer(SwinIR),have achieved excellent performance by dividing windows,performing self-attention within the windows and interacting the information between the windows.However,this method of dividing windows increases the computational burden as the window size increases.Moreover,the window division method cannot model the global information of images completely,resulting in partial loss of information.To solve the above problems,we model the long-range dependencies of images by constructing a Transformer block while maintaining a moderate level of the number of parameters.Excellent super-resolution reconstruction perfor-mance is achieved by constructing global dependencies of features.Method The proposed super-resolution network based on self-attention(SRTSA)consists of four main stages:a shallow feature extraction module,a deep feature special extrac-tion module,an image upsampling module,and an image reconstruction module.The shallow feature extraction part con-sists of a 3 × 3 convolution.The deep feature extraction part mainly consists of a global and local information extraction block(GLIEB).Our proposed GLIEB performs simple relational modeling through a sufficiently lightweight nonlinear acti-vation free block(NAFBlock).Although dropout can improve the robustness of the model,we discard the dropout layer to avoid losing other information before modeling the feature information globally.In the global modeling of feature informa-tion using the transposed self-attention mechanism,we keep the features with positive effects on image reconstruction and discard the features with negative effects by replacing the softmax activation function in the self-attention mechanism with the ReLU activation function,which makes the reconstructed global dependencies more robust.Given that an image includes global and local information,a residual channel attention module is used to supplement the local information and enhance the expressive ability of the model.Furthermore,a new dual-channel gating mechanism is introduced to control the flow of information in the model to improve the modeling capability of the model for features and its robustness.The image upsampling module uses subpixel convolution to expand the features to the target dimension,and the reconstruction module employs a 3 × 3 convolution to obtain the final reconstruction results.For the loss function,although many loss functions have been proposed to optimize model training,to demonstrate the advancement and effectiveness of our model,we use the same L1 loss function as that of SwinIR to supervise the model training.The L1 loss function can provide a stable gradient that allows the model to converge quickly.In the image training phase,800 images from the DIV2K dataset are used for training.The 800 training images are randomly rotated or horizontally flipped to expand the dataset,and 16 LR image blocks of size 48 × 48 pixels are used as input in each iteration.The Adam optimizer is used for training.Result We test on five datasets commonly used in super-resolution tasks,namely,Set5,Set 14,Berkeley segmentation dataset 100(BSD100),Urban100,and Manga109,to demonstrate the effectiveness and robustness of the proposed method.We also compare the proposed method with SRCNN,VDSR,EDSR,RCAN,SAN,HAN,NLSA,and SwinIR networks in terms of objective metrics.These networks are supervised using the L1 loss function during the training process.The peak signal-to-noise ratio(PSNR)and structural similarity(SSIM)are calculated on the Y channel of the YCbCr space of the output image to measure the image reconstruction effect.Experimental results show that the PSNR and SSIM values obtained our method are both optimal.In the ×2 super-resolution tasks,compared with those of SwinIR,the PSNR of the proposed method is improved by 0.03 dB,0.21 dB,0.05 dB,0.29 dB,and 0.10 dB,and the SSIM is enhanced by 0.000 4,0.001 6,0.000 9,and 0.002 7 on four datasets,except Manga109.The reconstruction effect demonstrates that SRTSA can recover more detailed information and more texture structure compared with most methods.From the attribution analy-sis of the model using local attribution maps(LAM),SRTSA uses a larger range of pixels in the reconstruction process com-pared with other methods,such as SwinIR,which fully illustrates the global modeling capability of SRTSA.Conclusion The proposed super-resolution image reconstruction algorithm based on a transposed self-attention mechanism can fully model the global relationship of feature information without losing the local relationship of features by converting the global relationship modeling in the spatial dimension into a channel dimension for global relationship modeling.It also contains global and local information,which effectively improves the image super-resolution reconstruction performance.The excel-lent PSNR and SSIM on five datasets and the significantly high quality of the reconstructed images with rich details and sharp edges fully demonstrate the effectiveness and advancedness of the proposed method.