ACCURATE IMAGE RESTORATION WITH ATTENTION RETRACTABLE TRANSFORMER

Abstract

Recently, Transformer-based image restoration networks have achieved promising improvements over convolutional neural networks due to parameter-independent global interactions. To lower computational cost, existing works generally limit self-attention computation within non-overlapping windows. However, each group of tokens are always from a dense area of the image. This is considered as a dense attention strategy since the interactions of tokens are restrained in dense regions. Obviously, this strategy could result in restricted receptive fields. To address this issue, we propose Attention Retractable Transformer (ART) for image restoration, which presents both dense and sparse attention modules in the network. The sparse attention module allows tokens from sparse areas to interact and thus provides a wider receptive field. Furthermore, the alternating application of dense and sparse attention modules greatly enhances representation ability of Transformer while providing retractable attention on the input image.We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks. Experimental results validate that our proposed ART outperforms state-ofthe-art methods on various benchmark datasets both quantitatively and visually. We also provide code and models at https://github.com/gladzhang/ART.

1. INTRODUCTION

Image restoration aims to recover the high-quality image from its low-quality counterpart and includes a series of computer vision applications, such as image super-resolution (SR) and denoising. It is an ill-posed inverse problem since there are a huge amount of candidates for any original input. Recently, deep convolutional neural networks (CNNs) have been investigated to design various models Kim et 2022) and achieved state-of-the-art results on several image restoration tasks. In contrast, higher performance can be achieved when using Transformer. Based on joint dense and sparse attention strategies, we design two types of self-attention blocks. We utilize fixed non-overlapping local windows to obtain tokens for the first block named dense attention block (DAB) and sparse grids to obtain tokens for the second block named sparse attention block (SAB). To better understand the difference between our work and SwinIR, we show a visual comparison in Fig. 1 . As we can see, the image is divided into four groups and tokens in each group interact with each other. Visibly, the token in our sparse attention block can learn relationships from farther tokens while the one in dense attention block of SwinIR cannot. At the same computational cost, the sparse attention block has stronger ability to compensate for the lack of global information. We consider our dense and sparse attention blocks as successive ones and apply them to extract deep feature. In practice, the alternating application of DAB and SAB can provide retractable attention for the model to capture both local and global receptive field. Our main contributions can be summarized as follows: • We propose the sparse attention to compensate the defect of mainly using dense attention in existing Transformer-based image restoration networks. The interactions among tokens extracted from a sparse area of an image can bring a wider receptive field to the module.  F 0 ∈ R H×D×C , where C is the dimension size of the new feature embedding. Next, the shallow feature is normalized and fed into the residual groups, which consist of core Transformer attention blocks. The deep feature is extracted and then passes through another 3×3 Conv to get further feature embeddings F 1 . Then we use element-wise sum to obtain the final feature map F R = F 0 + F 1 . Finally, we employ the restoration module to generate the high-quality image I HQ from the feature map F R . Residual Group. We use N G successive residual groups to extract the deep feature. Each residual group consists of N B pairs of attention blocks. We design two successive attention blocks shown in Fig. 2(b ). The input feature x l-1 passes through layer normalization (LN) and multi-head selfattention (MSA). After adding the shortcut, the output x l is fed into the multi-layer perception (MLP). x l is the final output at the l-th block. The process is formulated as x l = MSA(LN(x l-1 )) + x l-1 , x l = MLP(LN(x l )) + x l . Lastly, we also apply a 3×3 convolutional layer to refine the feature embeddings. As shown in  I HQ = Conv(Upsample(F R )). (2) For tasks without upsampling, such as image denoising, we directly use a convolutional layer to reconstruct the high-quality image. Besides, we add the original image to the last output of restoration module for better performance. We formulate the whole process as I HQ = Conv(F R ) + I LQ . Loss Function. We optimize our ART with two types of loss functions. There are various wellstudied loss functions, such as  L = I HQ -I G 1 , where I HQ is the output of ART and I G is the ground-truth image. For image denoising and JPEG compression artifact reduction, we utilize Charbonnier loss with super-parameter ε as 10 -3 , which is L = I HQ -I G 2 + ε 2 . (5)

3.2. ATTENTION RETRACTABLE TRANSFORMER

We elaborate the details about our proposed two types of self-attention blocks in this section. As plotted in Fig. 2 (b), the interactions of tokens are concentrated on the multi-head self-attention module (MSA). We formulate the calculation process in MSA as MSA(X) = Softmax( QK T √ C )V, where Q, K, V ∈ R N ×C are respectively the query, key, and value from the linear projecting of input X ∈ R N ×C . N is the length of token sequence, and C is the dimension size of each token. Here we assume that the number of heads is 1 to transfer MSA to singe-head self-attention for simplification. Multi-head Self Attention. Given an image with size H×D, vision Transformer firstly splits the raw image into numerous patches. These patches are projected by convolutions with stride size P . The new projected feature map X ∈ R h×w×C is prepared with h = H P and w = D P . Common MSA uses all the tokens extracted from the whole feature map and sends them to self-attention module to learn relationships between each other. It will suffer from high computational cost, which is Ω(MSA) = 4hwC 2 + 2(hw) 2 C. To lower the computational cost, existing works generally utilize non-overlapping windows to obtain shorter token sequences. However, they mainly consider the tokens from a dense area of an image. Different from them, we propose the retractable attention strategies, which provide interactions of tokens from not only dense areas but also sparse areas of an image to obtain a wider receptive field. Dense Attention. As shown in Fig. 3 (a), dense attention allows each token to interact with a smaller number of tokens from the neighborhood position of a non-overlapping W ×W window. All tokens are split into several groups and each group has W ×W tokens. We apply these groups to compute self-attention for h W × w W times and the computational cost of new module named D-MSA is Ω(D-MSA) = (4W 2 C 2 + 2W 4 C) × h W × w W = 4hwC 2 + 2W 2 hwC. Sparse Attention. Meanwhile, as shown in Fig. 3 (b), we propose sparse attention to allow each token to interact with a smaller number of tokens, which are from sparse positions with interval size I. After that, the updates of all tokens are also split into several groups and each group has h I × w I tokens. We further utilize these groups to compute self-attention for I×I times. We name the new multi-head self-attention module as S-MSA and the corresponding computational cost is Ω(S-MSA) = (4 h I × w I C 2 + 2( h I × w I ) 2 C) × I × I = 4hwC 2 + 2 h I w I hwC. By contrast, our proposed D-MSA and S-MSA modules have lower computational cost since W 2 hw and h I w I < hw. After computing all groups, the outputs are further merged to form original-size feature map. In practice, we apply these two attention strategies to design two types of self-attention blocks named as dense attention block (DAB) and sparse attention block (SAB) as plotted in Fig. 2 . Successive Attention Blocks. We propose the alternating application of these two blocks. As the local interactions have higher priority, we fix the order of DAB in front of SAB. Besides, we provide the long-distance residual connection between each three pairs of blocks. We show the effectiveness of this joint application with residual connection in the supplementary material. Attention Retractable Transformer. We demonstrate that the application of these two blocks enables our model to capture local and global receptive field simultaneously. We treat the successive attention blocks as a whole and get a new type of Transformer named Attention Retractable Transformer, which can provide interactions for both local dense tokens and global sparse tokens.

3.3. DIFFERENCES TO RELATED WORKS

We summarize the differences between our proposed approach, ART with the closely related works in Tab. 1. We conclude them as three points. (2) Different designs of sparse attention. In the part of attention, GG-Transformer utilizes the adaptively-dilated partitions, MaxViT utilizes the fixed-size grid attention and CrossFormer utilizes the cross-scale long-distance attention. As the layers get deeper, the interval of tokens from sparse attention becomes smaller and the channels of tokens become larger. Therefore, each token learns more semantic-level information. In contrast, the interval and the channel dimension of tokens in our ART keep unchanged and each token represents the accurate pixel-level information. (3) Different model structures. Different from these works using Pyramid model structure, our proposed ART enjoys an Isotropic structure. Besides, we provide the long-distance residual connection between several Transformer encoders, which enables the feature of deep layers to reserve more low-frequency information from shallow layers. More discussion can be found in the supplementary material.

3.4. IMPLEMENTATION DETAILS

Some details about how to apply our ART to construct image restoration model are introduced here. Firstly, the residual group number, DAB number, and SAB number in each group are set as 6, 3, and 3. Secondly, all the convolutional layers are equipped with 3×3 kernel, 1-length stride, and 1-length padding, so the height and width of feature map remain unchanged. In practice, we treat 1×1 patch as a token. Besides, we set the channel dimension as 180 for most layers except for the shallow feature extraction and the image reconstruction process. Thirdly, the window size in DAB is set as 8 and the interval size in SAB is adjustable according to different tasks, which is discussed in Sec. 4.2. Lastly, to adjust the division of windows and sparse grids, we use padding and mask strategies to the input feature map of self-attention, so that the number of division is always an integer. Figure 4 : Left: PSNR (dB) comparison of our ART using all dense attention block (DAB), using all sparse attention block (SAB), and using alternating DAB and SAB. Middle: PSNR (dB) comparison of our ART using large interval size in sparse attention block which is (8, 8, 8, 8, 8, 8) for six residual groups, using medium interval size which is (8, 8, 6, 6, 4, 4), and using small interval size which is (4, 4, 4, 4, 4, 4) . Right: PSNR (dB) comparison of SwinIR, ART-S, and ART. Training Settings. Data augmentation is performed on the training data through horizontal flip and random rotation of 90 • , 180 • , and 270 • . Besides, we crop the original images into 64×64 patches as the basic training inputs for image SR, 128×128 patches for image denoising, and 126×126 patches for JPEG CAR. We resize the training batch to 32 for image SR, and 8 for image denoising and JPEG CAR in order to make a fair comparison. We choose ADAM Kingma & Ba (2015) to optimize our ART model with β 1 = 0.9, β 2 = 0.999, and zero weight decay. The initial learning rate is set as 2×10 -4 and is reduced by half as the training iteration reaches a certain number. Taking image SR as an example, we train ART for total 500k iterations and adjust learning rate to half when training iterations reach 250k, 400k, 450k, and 475k, where 1k means one thousand. Our ART is implemented on PyTorch Paszke et al. (2017) with 4 NVIDIA RTX8000 GPUs.

4.2. ABLATION STUDY

For ablation experiments, we train our models for image super-resolution (×2) based on DIV2K and Flicke2K datasets. The results are evaluated on Urban100 benchmark dataset. Design Choices for DAB and SAB. We demonstrate the necessity for simultaneous usage of dense attention block (DAB) and sparse attention block (SAB) by conducting ablation study. We set three different experiment conditions, which are using 6 DABs, 6 SABs, and 3 pairs of alternating DAB and SAB. We keep the rest of experiment environment the same and train all models within 100k iterations. The experimental results are shown in Fig. 4 (Left). As we can see, only using DAB or SAB suffers from poor performance, because they lack either global receptive field or local receptive field. On the other hand, the structure of SAB following DAB brings higher performance. It validates that both local contextual interactions and global sparse interactions are important for improving strong representation ability of Transformer by obtaining retractable attention on the input feature. Impact of Interval Size. The interval size in sparse attention block has a vital impact on the performance of our ART. In fact, if the interval size is set as 1, it will be transferred to full attention. Generally, a smaller interval means wider receptive fields but higher computational cost. We compare the experimental results under different interval settings in Fig. 4 5 . Quantitative Comparisons. Table 5 shows the PSNR/SSIM comparisons of our ART with existing state-of-the-art methods. We can see that our proposed method has the best performance. Better results are achieved by ART+ using self-ensemble. These results indicate that our ART also performs outstandingly when solving image compression artifact reduction problems.

5. CONCLUSION

In this work, we propose Attention Retractable Transformer for image restoration named ART, which offers two types of self-attention blocks to enhance the Transformer representation ability. Most previous image restoration Transformer backbones mainly utilize dense attention modules to alleviate self-attention computation within non-overlapping regions and thus suffer from restricted receptive fields. Without introducing additional computational cost, we employ the sparse attention mechanism to enable tokens from sparse areas of the image to interact with each other. In practice, the alternating application of dense and sparse attention modules is able to provide retractable attention for the model and bring promising improvement. Experiments on image SR, denoising, and JPEG CAR tasks validate that our method achieves state-of-the-art results on various benchmark datasets both quantitatively and visually. In future work, we will try to apply our proposed method to more image restoration tasks, like image deraining, deblurring, dehazing, and so on. We will further explore the potential of sparse attention in solving low-level vision problems.



https://github.com/gladzhang/ART



al. (2016b); Zhang et al. (2020; 2021b) for image restoration. SRCNN Dong et al. (2014) firstly introduced deep CNN into image SR. Then several representative works utilized residual learning (e.g., EDSR Lim et al. (2017)) and attention mechanism (e.g., RCAN Zhang et al. (2018b)) to train very deep network in image SR. Meanwhile, a number of methods were also proposed for image denoising such as DnCNN Zhang et al. (2017a), RPCNN Xia & Chakrabarti (2020), and BRDNet Tian et al. (2020). These CNN-based networks have achieved remarkable performance. However, due to parameter-dependent receptive field scaling and content-independent local interactions of convolutions, CNN has limited ability to model long-range dependencies. To overcome this limitation, recent works have begun to introduce self-attention into computer vision systems Hu et al. (2019); Ramachandran et al. (2019); Wang et al. (2020); Zhao et al. (2020). Since Transformer has been shown to achieve state-of-the-art performance in natural language processing Vaswani et al. (2017) and high-level vision tasks Dosovitskiy et al. (2021); Touvron et al. (2021); Wang et al. (2021); Zheng et al. (2021); Chu et al. (2021), researchers have been investigating Transformer-based image restoration networks Yang et al. (2020); Wang et al. (2022b). Chen et al. proposed a pre-trained image processing Transformer named IPT Chen et al. (2021a). Liang et al. proposed a strong baseline model named SwinIR Liang et al. (2021) based on Swin Transformer Liu et al. (2021) for image restoration. Zamir et al. also proposed an efficient Transformer model using U-net structure named Restormer Zamir et al. (

Figure 1: (a) Dense attention and sparse attention strategies of our ART. (b) Dense attention strategy with shifted window of SwinIR. Despite showing outstanding performance, existing Transformer backbones for image restoration still suffer from serious defects. As we know, SwinIR Liang et al. (2021) takes advantage of shifted window scheme to limit selfattention computation within non-overlapping windows. On the other hand, IPT Chen et al. (2021a) directly splits features into P ×P patches to shrink original feature map P 2 times, treating each patch as a token. In short, these methods compute self-attention with shorter token sequences and the tokens in each group are always from a dense area of the image. It is considered as a dense attention strategy, which obviously causes a restricted receptive field. To address this issue, the sparse attention strategy is employed. We extract each group of tokens from a sparse area of the image to provide interactions like previous studies (e.g., GG-Transformer Yu et al. (2021), MaxViT Tu et al. (2022b), Cross-Former Wang et al. (2022a)), but different from them. Our proposed sparse attention module focuses on equal-scale features. Besides, We pay more attention to pixel-level information than semantic-level information. Since the sparse attention has not been well proposed to solve the problems in low-level vision fields, our proposed method can bridge this gap. We further propose Attention Retractable Transformer named ART for image restoration. Following RCAN Zhang et al. (2018b) and SwinIR Liang et al. (2021), we reserve the residual in residual structure Zhang et al. (2018b) for model architecture.Based on joint dense and sparse attention strategies, we design two types of self-attention blocks. We utilize fixed non-overlapping local windows to obtain tokens for the first block named dense attention block (DAB) and sparse grids to obtain tokens for the second block named sparse attention block (SAB). To better understand the difference between our work and SwinIR, we show a visual comparison in Fig.1. As we can see, the image is divided into four groups and tokens in each group interact with each other. Visibly, the token in our sparse attention block can learn relationships from farther tokens while the one in dense attention block of SwinIR cannot. At the same computational cost, the sparse attention block has stronger ability to compensate for the lack of global information. We consider our dense and sparse attention blocks as successive ones and apply them to extract deep feature. In practice, the alternating application of DAB and SAB can provide retractable attention for the model to capture both local and global receptive field. Our main contributions can be summarized as follows:

The architecture of ART for image restoration (b) Two successive attention blocks

Figure 2: (a) The architecture of our proposed ART for image restoration. (b) The inner structure of two successive attention blocks DAB and SAB with two attention modules D-MSA and S-MSA. designs and improving techniques have been introduced into the basic CNN frameworks. These techniques include but not limit to the residual structure Kim et al. (2016a); Zhang et al. (2021a), skip connection Zhang et al. (2018b; 2020), dropout Kong et al. (2022), and attention mechanism Dai et al. (2019); Niu et al. (2020). Recently, due to the limited ability of CNN to model long-range dependencies, researchers have started to replace convolution operator with pure self-attention module for image restoration Yang et al. (2020); Liang et al. (2021); Zamir et al. (2022); Chen et al. (2021a).

a), a residual connection is employed to obtain the final output in each residual group module.

Figure 3: (a) Dense attention strategy. Tokens of each group are from a dense area of the image. (b) Sparse attention strategy. Tokens of each group are from a sparse area of the image. Restoration Module. The restoration module is applied as the last stage of the framework to obtain the reconstructed image. As we know, image restoration tasks can be divided into two categories according to the usage of upsampling. For image super-resolution, we take advantage of the sub-pixel convolutional layer Shi et al. (2016) to upsample final feature map F R . Next, we use a convolutional layer to get the final reconstructed image I HQ . The whole process is formulated as I HQ = Conv(Upsample(F R )).(2) For tasks without upsampling, such as image denoising, we directly use a convolutional layer to reconstruct the high-quality image. Besides, we add the original image to the last output of restoration module for better performance. We formulate the whole process as I HQ = Conv(F R ) + I LQ .(3)

L 2 loss Dong et al. (2016); Sajjadi et al. (2017); Tai et al. (2017), L 1 loss Lai et al. (2017); Zhang et al. (2020), and Charbonnier loss Charbonnier et al. (1994). Same with previous works Zhang et al. (2018b); Liang et al. (2021), we utilize L 1 loss for image super-resolution (SR) and Charbonnier loss for image denoising and compression artifact reduction. For image SR, the goal of training ART is to minimize the L 1 loss function, which is formulated as

Different tasks. GG-Transformer Yu et al. (2021), MaxViT Tu et al. (2022b) and CrossFormer Wang et al. (2022a) are proposed to solve high-level vision problems. Our ART is the only one to employ the sparse attention in low-level vision fields.

EXPERIMENTAL SETTINGS Data and Evaluation. We conduct experiments on three image restoration tasks, including image SR, denoising, and JPEG Compression Artifact Reduction (CAR). For image SR, following previous works Zhang et al. (2018b); Haris et al. (2018), we use DIV2K Timofte et al. (2017) and Flickr2K Lim et al. (2017) as training data, Set5 Bevilacqua et al. (2012), Set14 Zeyde et al. (2010), B100 Martin et al. (2001), Urban100 Huang et al. (2015), and Manga109 Matsui et al. (2017) as test data. For image denoising and JPEG CAR, same as SwinIR Liang et al. (2021), we use training data: DIV2K, Flickr2K, BSD500 Arbelaez et al. (2010), and WED Ma et al. (2016). We use BSD68 Martin et al. (2001), Kodak24 Franzen (1999), McMaster Zhang et al. (2011), and Urban100 as test data of image denoising. Classic5 Foi et al. (2007) and LIVE1 Sheikh et al. (2006) are test data of JPEG CAR. Note that we crop large-size input image into 200×200 partitions with overlapping pixels during inference. Following Lim et al. (2017), we adopt the self-ensemble strategy to further improve the performance of our ART and name it as ART+. We evaluate experimental results with PSNR and SSIM Wang et al. (2004) values on Y channel of images transformed to YCbCr space.

Figure 7: Visual comparison with challenging examples on color image denoising (σ=50).4.5 JPEG COMPRESSION ARTIFACT REDUCTION We compare our ART with state-of-the-art JPEG CAR methods: RNAN Zhang et al. (2019), RDN Zhang et al. (2020), DRUNet Zhang et al. (2021a), and SwinIR Liang et al. (2021). Following most recent works, we set the compression quality factors of original images to 40, 30, and 10. We provide the PSNR and SSIM comparison results in Table5.

We further propose Attention Retractable Transformer (ART) for image restoration. Our ART offers two types of self-attention blocks to obtain retractable attention on the input feature. With the alternating application of dense and sparse attention blocks, the Transformer model can capture local and global receptive field simultaneously.

).

Comparison to related works. The differences between our ART with other works.

(Middle). As we can see, smaller PSNR (dB)/SSIM comparisons for image super-resolution on five benchmark datasets. We color best and second best results in red and blue.

Model size comparisons (×4 SR). Output size is 3×640×640 for Mult-Adds calculation.Comparison of Variant Models. We provide a new version of our model for fair comparisons and name it ART-S. Different from ART, the MLP ratio in ART-S is set to 2 (4 in ART) and the interval size is set to 8. We demonstrate that ART-S has comparable model size with SwinIR. We provide the PSNR comparison results in Fig.4(Right). As we can see, our ART-S achieves better performance than SwinIR. More comparative results can be found in following experiment parts. A 27.38 N/A N/A 28.63 N/A N/A N/A N/A N/A 27.94IRCNN Zhang et  al. (2017b) 33.86 31.16 27.86 34.69 32.18 28.93 34.58 32.18 28.91 33.78 31.20 27.70 FFDNet Zhang et al. (2018a) 33.87 31.21 27.96 34.63 32.13 28.98 34.66 32.35 29.18 33.83 31.40 28.05

PSNR (dB) comparisons. The best and second best results are in red and blue. .8178 30.00 0.8188 30.16 0.8234 30.27 0.8249 30.27 0.8258 30.32 0.8263 30 33.38 0.8924 33.43 0.8930 33.59 0.8949 33.73 0.8961 33.74 0.8964 33.78 0.8967 40 34.27 0.9061 34.27 0.9061 34.41 0.9075 34.52 0.9082 34.55 0.9086 34.58 0.9089 LIVE1 10 29.63 0.8239 29.67 0.8247 29.79 0.8278 29.86 0.8287 29.89 0.8300 29.92 0.8305 30 33.45 0.9149 33.51 0.9153 33.59 0.9166 33.69 0.9174 33.71 0.9178 33.74 0.9181 40 34.47 0.9299 34.51 0.9302 34.58 0.9312 34.67 0.9317 34.70 0.9322 34.73 0.9324

PSNR (dB)/SSIM comparisons. The best and second best results are in red and blue.Visual Comparisons. The visual comparison for color image denoising of different methods is shown in Fig.7. Our ART can preserve detailed textures and high-frequency components and remove heavy noise corruption. Compared with other methods, it has better performance to restore clean and crisp images. It demonstrates that our ART is also suitable for image denoising.

ACKNOWLEDGMENTS

This work was supported in part by NSFC grant 62141220, 61972253, U1908212, 62172276, 61972254, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, the National Natural Science Foundation of China under Grant No. 62271414, Zhejiang Provincial Natural Science Foundation of China under Grant No. LR23F010001. This work was also supported by the Shenzhen Science and Technology Project (JCYJ20200109142808034), and in part by Guangdong Special Support (2019TX05X187). Xin Yuan would like to thank Research Center for Industries of the Future (RCIF) at Westlake University for supporting this work.

REPRODUCIBILITY STATEMENT

We provide the reproducibility statement of our proposed method in this section. We introduce the model architecture and core dense and sparse attention modules in Sec. 3. Besides, we also give the implementation details. In Sec. 4.1, we provide the detailed experiment settings. To ensure the reproducibility, we provide the source code and pre-trained models at the website 1 . Everyone can run our code to check the training and testing process to the given instructions. At the website, the pre-trained models are provided to verify the validity of corresponding results. More details please refer to the website or the submitted supplementary materials.

annex

Retractable vs. Dense Attention. We further show a typical visual comparison with SwinIR in Fig. 6 . As SwinIR mainly utilizes dense attention strategy, it restores wrong texture structures under the influence of close patches with mainly vertical lines. However, our ART can reconstruct the right texture, thanks to the wider receptive field provided by sparse attention strategy. Visibly, the patch is able to interact with farther patches with similar horizontal lines so that it can be reconstructed clearly. This comparison demonstrates the advantage of retractable attention and its strong ability to restore high-quality outputs.Model Size Comparisons. Table 3 provides comparisons of parameters number and Mult-Adds of different networks, which include existing state-of-the-art methods. We calculate the Mult-Adds assuming that the output size is 3×640×640 under ×4 image SR. Compared with previous CNNbased networks, our ART has comparable parameter number and Mult-Adds but achieves high performance. Besides, we can see that our ART-S has less parameters and Mult-Adds than most of the compared methods. The model size of ART-S is similar with SwinIR. However, ART-S still achieves better performance gains than all compared methods except our ART. It indicates that our method is able to achieve promising performance at an acceptable computational and memory cost.Visual Comparisons. We also provide some challenging examples for visual comparison (×4) in Fig. 5 . We can see that our ART is able to alleviate heavy blurring artifacts while restoring detailed edges and textures. Compared with other methods, ART obtains visually pleasing results by recovering more high-frequency details. It indicates that ART preforms better for image SR.

4.4. IMAGE DENOISING

We show color image denoising results to compare our ART with representative methods in Tab. 

