Dynamic association learning of self-attention and convolution in image restoration
Objective Convolutional neural networks(CNNs)and self-attention(SA)have achieved great success in the field of multimedia applications for dynamic association learning of SA and convolution in image restoration.However,owing to the intrinsic characteristics of local connectivity and translation equivariance,CNNs have at least two shortcom-ings,1)limited receptive field and 2)static weight of sliding window at inference,unable to cope with content diversity.The former prevents the network from capturing long-range pixel dependencies,while the latter sacrifices the adaptability to input contents.As a result,they are far from meeting the requirement in modeling global rain distribution and generate results with obvious rain residue.Meanwhile,because of the global calculation of SA,its computational complexity grows quadratically with the spatial resolution,making it infeasible to apply to high-resolution images.In view of the advantages and disadvantages of these two architectures,this study proposes an association learning method to utilize the advantages of the two methods comprehensively and suppress their respective shortcomings to achieve high-quality and efficient inpaint-ing.Method This study combines the advantages of CNN and SA architectures,particularly by fully utilizing CNNs'local perception and translation invariance in specific local context and global structural representations,as well as SA's global aggregation ability.We take inspiration from the observation that rain distribution reflects the degradation location and degree,in addition to rain distribution prediction.Therefore,we propose to refine background textures with the predicted degradation prior in an association learning manner.We accomplish image deraining by associating rain streak removal and background recovery,in which an image deraining network and a background recovery network are specifically designed for these two subtasks.The key part of association learning is a novel multi-input attention module(MAM).It generates the degradation prior and produces the degradation mask according to the predicted rainy distribution.Benefiting from the global correlation calculation of SA,MAM can extract informative complementary components from the rainy input(query)with a degradation mask(key)and then help realize accurate texture restoration.SA tends to aggregate feature maps with SA importance,but convolution diversifies them to focus on local textures.Unlike Restormer equipped with pure Trans-former blocks,the design paradigm is promoted in a parallel manner of SA and CNNs,and a hybrid fusion network is pro-posed.The network involves one residual Transformer branch(RTB)and one encoder-decoder branch(EDB).The former takes a few learnable tokens(feature channels)as input and stacks multihead attention and feed-forward networks to encode global features of the image.The latter,conversely,leverages the multiscale encoder-decoder to represent contex-ture knowledge.We propose a lightweight hybrid fusion block to aggregate the outcomes of RTB and EDB to yield a final solution to the subtask.In this way,we construct our final model as a two-stage Transformer-based method,namely,ELF,for single image deraining.Result An ablation experiment is conducted on the Test 1200 dataset to validate the effectiveness of various parts of the algorithm.The experimental results show that the fusion of CNN and SA can effectively improve the model's expression ability.At the same time,the elimination of degraded disturbances and background repair association learning can effectively improve the overall repair effect.The method proposed in this paper is compared with over 10 new methods on the synthetic and real data of three inpainting tasks,and the proposed method achieves significant improve-ment.In the task of image rain removal,the ELF method improves the peak signal-to-noise ratio(PSNR)value by 0.9 dB compared with multi-stage progressive image restoration network(MPRNet)on the synthetic dataset Test 1200.In the underwater enhancement task,ELF exceeds Ucolor by 4.15 dB on the R90 dataset.In the low-illumination image enhance-ment task,ELF achieves a 1.09 dB improvement compared with the LLFlow algorithm.Conclusion We rethink image der-aining as a composite task of rain streak removal,texture recovery,and their association learning and propose an ELF model for image deraining.Accordingly,a two-stage architecture and an associated learning module are adopted in ELF to account for the two goals of rain streak removal and texture reconstruction while facilitating the learning capability.The joint optimization promotes the compatibility while maintaining the model compactness.Extensive results on image derain-ing and joint detection tasks demonstrate the superiority of our ELF model over state-of-the-art techniques.The method pro-posed in this paper possesses efficiency and effectiveness and is superior to representative methods in common tasks such as image rain removal,low-light image enhancement,and underwater enhancement.