ACCURATE IMAGE RESTORATION WITH ATTENTION RETRACTABLE TRANSFORMER

Abstract

Recently, Transformer-based image restoration networks have achieved promising improvements over convolutional neural networks due to parameter-independent global interactions. To lower computational cost, existing works generally limit self-attention computation within non-overlapping windows. However, each group of tokens are always from a dense area of the image. This is considered as a dense attention strategy since the interactions of tokens are restrained in dense regions. Obviously, this strategy could result in restricted receptive fields. To address this issue, we propose Attention Retractable Transformer (ART) for image restoration, which presents both dense and sparse attention modules in the network. The sparse attention module allows tokens from sparse areas to interact and thus provides a wider receptive field. Furthermore, the alternating application of dense and sparse attention modules greatly enhances representation ability of Transformer while providing retractable attention on the input image.We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks. Experimental results validate that our proposed ART outperforms state-ofthe-art methods on various benchmark datasets both quantitatively and visually. We also provide code and models at https://github.com/gladzhang/ART.

1. INTRODUCTION

Image restoration aims to recover the high-quality image from its low-quality counterpart and includes a series of computer vision applications, such as image super-resolution (SR) and denoising. It is an ill-posed inverse problem since there are a huge amount of candidates for any original input. Recently, deep convolutional neural networks (CNNs) have been investigated to design various models Kim et al. (2016b) ; Zhang et al. (2020; 2021b) 2022) and achieved state-of-the-art results on several image restoration tasks. In contrast, higher performance can be achieved when using Transformer. 2018b) for model architecture. Based on joint dense and sparse attention strategies, we design two types of self-attention blocks. We utilize fixed non-overlapping local windows to obtain tokens for the first block named dense attention block (DAB) and sparse grids to obtain tokens for the second block named sparse attention block (SAB). To better understand the difference between our work and SwinIR, we show a visual comparison in Fig. 1 . As we can see, the image is divided into four groups and tokens in each group interact with each other. Visibly, the token in our sparse attention block can learn relationships from farther tokens while the one in dense attention block of SwinIR cannot. At the same computational cost, the sparse attention block has stronger ability to compensate for the lack of global information. We consider our dense and sparse attention blocks as successive ones and apply them to extract deep feature. In practice, the alternating application of DAB and SAB can provide retractable attention for the model to capture both local and global receptive field. Our main contributions can be summarized as follows: • We propose the sparse attention to compensate the defect of mainly using dense attention in existing Transformer-based image restoration networks. The interactions among tokens extracted from a sparse area of an image can bring a wider receptive field to the module. 



for image restoration. SRCNN Dong et al. (2014) firstly introduced deep CNN into image SR. Then several representative works utilized residual learning (e.g., EDSR Lim et al. (2017)) and attention mechanism (e.g., RCAN Zhang et al. (2018b)) to train very deep network in image SR. Meanwhile, a number of methods were also proposed for image denoising such as DnCNN Zhang et al. (2017a), RPCNN Xia & Chakrabarti (2020), and BRDNet Tian et al. (2020). These CNN-based networks have achieved remarkable performance. However, due to parameter-dependent receptive field scaling and content-independent local interactions of convolutions, CNN has limited ability to model long-range dependencies. To overcome this limitation, recent works have begun to introduce self-attention into computer vision systems Hu et al. (2019); Ramachandran et al. (2019); Wang et al. (2020); Zhao et al. (2020). Since Transformer has been shown to achieve state-of-the-art performance in natural language processing Vaswani et al. (2017) and high-level vision tasks Dosovitskiy et al. (2021); Touvron et al. (2021); Wang et al. (2021); Zheng et al. (2021); Chu et al. (2021), researchers have been investigating Transformer-based image restoration networks Yang et al. (2020); Wang et al. (2022b). Chen et al. proposed a pre-trained image processing Transformer named IPT Chen et al. (2021a). Liang et al. proposed a strong baseline model named SwinIR Liang et al. (2021) based on Swin Transformer Liu et al. (2021) for image restoration. Zamir et al. also proposed an efficient Transformer model using U-net structure named Restormer Zamir et al. (

Figure 1: (a) Dense attention and sparse attention strategies of our ART. (b) Dense attention strategy with shifted window of SwinIR. Despite showing outstanding performance, existing Transformer backbones for image restoration still suffer from serious defects. As we know, SwinIR Liang et al. (2021) takes advantage of shifted window scheme to limit selfattention computation within non-overlapping windows. On the other hand, IPT Chen et al. (2021a) directly splits features into P ×P patches to shrink original feature map P 2 times, treating each patch as a token. In short, these methods compute self-attention with shorter token sequences and the tokens in each group are always from a dense area of the image. It is considered as a dense attention strategy, which obviously causes a restricted receptive field. To address this issue, the sparse attention strategy is employed. We extract each group of tokens from a sparse area of the image to provide interactions like previous studies (e.g., GG-Transformer Yu et al. (2021), MaxViT Tu et al. (2022b), Cross-Former Wang et al. (2022a)), but different from them. Our proposed sparse attention module focuses on equal-scale features. Besides, We pay more attention to pixel-level information than semantic-level information. Since the sparse attention has not been well proposed to solve the problems in low-level vision fields, our proposed method can bridge this gap. We further propose Attention Retractable Transformer named ART for image restoration. Following RCAN Zhang et al. (2018b) and SwinIR Liang et al. (2021), we reserve the residual in residual structure Zhang et al. (2018b) for model architecture. Based on joint dense and sparse attention strategies, we design two types of self-attention blocks. We utilize fixed non-overlapping local windows to obtain tokens for the first block named dense attention block (DAB) and sparse grids to obtain tokens for the second block named sparse attention block (SAB). To better understand the difference between our work and SwinIR, we show a visual comparison in Fig.1. As we can see, the image is divided into four groups and tokens in each group interact with each other. Visibly, the token in our sparse attention block can learn relationships from farther tokens while the one in dense attention block of SwinIR cannot. At the same computational cost, the sparse attention block has stronger ability to compensate for the lack of global information. We consider our dense and sparse attention blocks as successive ones and apply them to extract deep feature. In practice, the alternating application of DAB and SAB can provide retractable attention for the model to capture both local and global receptive field. Our main contributions can be summarized as follows:

With the rapid development of CNN, numerous works based on CNN have been proposed to solve image restoration problems Anwar & Barnes (2020); Dudhane et al. (2022); Zamir et al. (2020; 2021); Li et al. (2022); Chen et al. (2021b) and achieved superior performance over conventional restoration approaches Timofte et al. (2013); Michaeli & Irani (2013); He et al. (2010). The pioneering work SRCNN Dong et al. (2014) was firstly proposed for image SR. DnCNN Zhang et al. (2017a) was a representative image denoising method. Following these works, various model

We further propose Attention Retractable Transformer (ART) for image restoration. Our ART offers two types of self-attention blocks to obtain retractable attention on the input feature. With the alternating application of dense and sparse attention blocks, the Transformer model can capture local and global receptive field simultaneously.• We employ ART to train an effective Transformer-based network. We conduct extensive experiments on three image restoration tasks: image super-resolution, denoising, and JPEG compression artifact reduction. Our method achieves state-of-the-art performance.

