ACCURATE IMAGE RESTORATION WITH ATTENTION RETRACTABLE TRANSFORMER

Abstract

Recently, Transformer-based image restoration networks have achieved promising improvements over convolutional neural networks due to parameter-independent global interactions. To lower computational cost, existing works generally limit self-attention computation within non-overlapping windows. However, each group of tokens are always from a dense area of the image. This is considered as a dense attention strategy since the interactions of tokens are restrained in dense regions. Obviously, this strategy could result in restricted receptive fields. To address this issue, we propose Attention Retractable Transformer (ART) for image restoration, which presents both dense and sparse attention modules in the network. The sparse attention module allows tokens from sparse areas to interact and thus provides a wider receptive field. Furthermore, the alternating application of dense and sparse attention modules greatly enhances representation ability of Transformer while providing retractable attention on the input image.We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks. Experimental results validate that our proposed ART outperforms state-ofthe-art methods on various benchmark datasets both quantitatively and visually. We also provide code and models at https://github.com/gladzhang/ART.

1. INTRODUCTION

Image restoration aims to recover the high-quality image from its low-quality counterpart and includes a series of computer vision applications, such as image super-resolution (SR) and denoising. It is an ill-posed inverse problem since there are a huge amount of candidates for any original input. Recently, deep convolutional neural networks (CNNs) have been investigated to design various models Kim et al. (2016b); Zhang et al. (2020; 2021b) 2022) and achieved state-of-the-art results on several image restoration tasks. In contrast, higher performance can be achieved when using Transformer.



for image restoration. SRCNN Dong et al. (2014) firstly introduced deep CNN into image SR. Then several representative works utilized residual learning (e.g., EDSR Lim et al. (2017)) and attention mechanism (e.g., RCAN Zhang et al. (2018b)) to train very deep network in image SR. Meanwhile, a number of methods were also proposed for image denoising such as DnCNN Zhang et al. (2017a), RPCNN Xia & Chakrabarti (2020), and BRDNet Tian et al. (2020). These CNN-based networks have achieved remarkable performance. However, due to parameter-dependent receptive field scaling and content-independent local interactions of convolutions, CNN has limited ability to model long-range dependencies. To overcome this limitation, recent works have begun to introduce self-attention into computer vision systems Hu et al. (2019); Ramachandran et al. (2019); Wang et al. (2020); Zhao et al. (2020). Since Transformer has been shown to achieve state-of-the-art performance in natural language processing Vaswani et al. (2017) and high-level vision tasks Dosovitskiy et al. (2021); Touvron et al. (2021); Wang et al. (2021); Zheng et al. (2021); Chu et al. (2021), researchers have been investigating Transformer-based image restoration networks Yang et al. (2020); Wang et al. (2022b). Chen et al. proposed a pre-trained image processing Transformer named IPT Chen et al. (2021a). Liang et al. proposed a strong baseline model named SwinIR Liang et al. (2021) based on Swin Transformer Liu et al. (2021) for image restoration. Zamir et al. also proposed an efficient Transformer model using U-net structure named Restormer Zamir et al. (

