DISCOVERING DISTINCTIVE "SEMANTICS" IN SUPER-RESOLUTION NETWORKS

Abstract

Projection Test car plane cat Content-related semantics Degradation-related semantics Figure 1: Distributions of the deep representations of classification and super-resolution networks. For classification networks, the semantics of the deep feature representations are artificially predefined according to the training data (category labels). However, for SR networks, the learned deep representations have a different kind of "semantics" from classification. During training, the SR networks are only provided with downsampled clean LR images. There is not any supervision signal related to image degradation information. Surprisingly, we find that SR networks' deep representations are spontaneously discriminative to different degradations. Notably, NOT an arbitrary SR network has such a property. In Sec. 4.3, we reveal two factors that facilitate SR networks to extract such degradation-related representations, i.e., adversarial learning and global residual.

1. INTRODUCTION

The emergence of deep convolutional neural network (CNN) has given birth to a large number of new solutions to low-level vision tasks (Dong et al., 2014; Zhang et al., 2017) . Among these signs of progress, image super-resolution (SR) has enjoyed a great performance leap. Compared with traditional methods (e.g., interpolation (Keys, 1981) and sparse coding (Yang et al., 2008) ), SR networks can achieve better performance with improved efficiency. However, even if we have benefited a lot from the powerful CNNs, we have little knowledge about what happens in SR networks and what distinguishes them from traditional approaches on earth. Does the performance gain merely come from more complex mapping functions? Or is there anything different inside SR networks, like classification networks with discriminative capability? On the other hand, as a classic regression task, SR is expected to perform a continuous mapping from low-resolution (LR) to high-resolution (HR) images. It is generally a local operation without the consideration of the global context. But with the introduction of GAN-based models Ledig et al. (2017) ; Wang et al. (2018) , more delicate SR textures can be generated. It seems that the network has learned some kind of semantic, which is beyond our common perception for regression tasks. Then, we may raise the question: are there any "semantics" in SR networks? If yes, do these semantics have different definitions from those in classification networks? Existing literature cannot answer these questions, as there is little research on interpreting low-level vision deep models. Nevertheless, discovering the semantics in SR networks is of great importance. It can not only help us further understand the underlying working mechanisms, but also guide us to design better networks and evaluation algorithms. In this study, we give affirmative answers to the above questions by unfolding the semantics hidden in super-resolution networks. Specifically, different from the artificially predefined semantics associated with object classes in high-level vision, semantics in SR networks are distinct in terms of image degradation instead of image content. Accordingly, we name such semantics deep degradation representations (DDR). More interestingly, such degradation-related semantics are spontaneously existing without any predefined labels. We reveal that a well-trained deep SR network is naturally a good descriptor of degradation information. Notably, the semantics in this paper have different implications from those in high-level vision. Previously, researchers have disclosed the hierarchical nature of classification networks (Zeiler & Fergus, 2014; Gu et al., 2018) . As the layer deepens, the learned features respond more to abstract high-level patterns (e.g., faces and legs), showing a stronger discriminability to object categories (see Fig. 4 ). However, similar research in low-level vision is absent, since there are no predefined semantic labels. In this paper, we reveal the differences in deep "semantics" between classification and SR networks, as illustrated in Fig. 1 . Our observation stems from a representative blind SR method -CinCGAN Yuan et al. (2018) , and we further extend it to more common SR networks -SRResNet and SRGAN Ledig et al. (2017) . We have also revealed more interesting phenomena to help interpret the semantics, including the analogy to classification networks and the influential factors for extracting DDR. Moreover, we improve the results of several tasks by exploiting DDR. We believe our findings could lay the groundwork for the interpretability of SR networks, and inspire more exploration of the mechanism of low-level vision deep models. Contributions. 1) We have successfully discovered the "semantics" in SR networks, denoted as deep degradation representations (DDR). Through in-depth analysis, we also find that global residual learning and adversarial learning can facilitate the SR network to extract such degradation-related representations. 2) We reveal the differences in deep representations between classification and SR networks, for the first time. This further expands our knowledge of the deep representations of highand low-level vision models. 3) We exploit our findings to several fundamental tasks and achieve very appealing results, including distortion identification, blind SR and generalization evaluation.

2. RELATED WORK

Super-resolution. Super-resolution (SR) is a fundamental task in low-level vision, which aims to reconstruct the high-resolution (HR) image from the corresponding low-resolution (LR) counterpart. SRCNN (Dong et al., 2014) is the first proposed CNN-based method for SR. Since then, a large number of deep-learning-based methods have been developed (Dong et al., 2016; Lim et al., 2017; Zhang et al., 2018b; Ledig et al., 2017; Zhang et al., 2019) . Generally, current CNN-based SR methods can be categorized into two groups. One is MSE-based method, which targets at minimizing the distortion (e.g., Mean Square Error) between the ground-truth HR image and super-resolved image to yield high PSNR values, such as SRCNN (Dong et al., 2014) , VDSR (Kim et al., 2016) , EDSR (Lim et al., 2017) , RCAN (Zhang et al., 2018b) , SAN (Dai et al., 2019) , etc. The other is GAN-based method, which incorporates generative adversarial network (GAN) and perceptual loss (Johnson et al., 2016) to obtain perceptually pleasing results, such as SRGAN (Ledig et al., 2017) , (Yuan et al., 2018) , BM3D (Dabov et al., 2007) , and SRC-NN (Dong et al., 2014) . CinC-GAN (Yuan et al., 2018) is trained on DIV2K-mild dataset in an unpaired manner. If the input image conforms to the training data distribution, CinCGAN will generate better restoration results than BM3D (a). Otherwise, it tends to ignore the unseen degradation types (b)&(c). On the other hand, the traditional method BM3D (Dabov et al., 2007) has stable performance and similar denoising effects on all input images, regardless of the input degradation types. Zoom in for the best view. ESRGAN (Wang et al., 2018) , RankSRGAN (Zhang et al., 2019) , SROBB (Rad et al., 2019) . Recently, blind SR has attracted more and more attention (Gu et al., 2019; Bell-Kligler et al., 2019; Luo et al., 2020; Wang et al., 2021) ,which aims to solve SR with unknown real-world degradation. A comprehensive survey for blind SR is newly proposed (Liu et al., 2021) , which summarizes existing methods. We regard SR as a representative research object and study its deep semantic representations. It can also draw inspirations on other low-level vision tasks. Network interpretability. At present, most existing works on neural network interpretability focus on high-level vision tasks, especially for image classification. Zhang et al. (Zhang et al., 2020) systematically reviewed existing literature on network interpretability and proposed a novel taxonomy to categorize them. Here we only discuss several classic works. By adopting deconvolutional networks (Zeiler et al., 2010 ), Zeiler et al. (Zeiler & Fergus, 2014) projected the downsampled lowresolution feature activations back to the input pixel space, and then performed a sensitivity analysis to reveal which parts of the image are important for classification. Simonyan et al. (Simonyan et al., 2013) generated a saliency map from the gradients through a single backpropagation pass. Based on class activation maps (CAM) (Zhou et al., 2016 ), Selvaraju et al. (Selvaraju et al., 2017) proposed Grad-CAM (Gradient-weighted CAM) to produce a coarse-grained attribution map of the important regions in the image, which was broadly applicable to any CNN-based architecture. For more information about the network interpretability literature, please refer to the survey paper (Zhang et al., 2020) . However, for low-level vision tasks, similar researches are rare. Recently, the local attribution map (LAM) (Gu & Dong, 2021) has been proposed to interpret super-resolution networks, which can be used to localize the input features that influenced the network outputs. Besides, Wang et al. (Wang et al., 2020b) presented a pioneer work that bridges the representation relationship between high-and low-level vision. They learned the mapping between deep representations of lowand high-quality images, and leveraged it as a deep degradation prior (DDP) for low-quality image classification. Inspired by these previous works, we interpret SR networks from another new perspective. We dive into their deep feature representations, and discover the "semantics" of SR networks. More background knowledge is described in the supplementary file.

3. MOTIVATION

To begin with, we present an interesting phenomenon, which drives us to start exploring the deep representations of SR networks. It is well known that SR networks are superior to traditional methods in specific scenarios, but are inferior in generalization ability. In blind SR, the degradation types of the input test images are unknown. For traditional methods, they treat different images equally without distinction of degradation types, thus their performance is generally stable and predictable. How about the SR networks, especially those designed for blind SR? In other words, the network seems to figure out the specific degradation types within its training data distribution, and distribution mismatch may make the network "turn off" its ability. This makes the performance of CinCGAN unstable and unpredictable. For comparison, we process the above three types of degraded images by a traditional denoising method BM3D (Dabov et al., 2007) foot_0 . The visual results show that BM3D has an obvious and stable denoising performance for all different degradation types. Although the results of BM3D may be mediocre (the image textures are largely over-smoothed), it does take effect on every input image. This observation reveals a significant discrepancy between traditional methods and SR networks. The above interesting phenomenon indicates that the deep network has learned more than a regression function, since it demonstrates the ability to distinguish among different degradation types. Inspired by this observation, we try to find any semantics hidden in SR networks. Hinton, 2008) for dimensionality reduction. This algorithm is commonly used in manifold learning, and it has been successfully applied in previous works (Donahue et al., 2014; Mnih et al., 2015; Wen et al., 2016; Zahavy et al., 2016; Veličković et al., 2017; Wang et al., 2020b; Huang et al., 2020) for feature projection and visualization. In our experiments, we first reduce the dimensionality of feature maps to a reasonable amount (50 in this paper) using PCA (Hotelling, 1933) , then apply t-SNE to project the 50-dimensional representation to two-dimensional space, after which the results are visualized in a scatterplot. Furthermore, we also introduce CHI (Caliński & Harabasz, 1974) score to quantitatively evaluate the distributions of visualized datapoints. The CHI score is higher when clusters are well separated, which indicates stronger semantic discriminability.

4. DIVING

What do the deep features of SR networks represent? As discussed in Sec.3, since CinCGAN performs differently on various degradations, we compare the features generated from three testing datasets: 1) DIV2K-mild: training and testing data used in CinCGAN, which are synthesized from DIV2K (Agustsson & Timofte, 2017) dataset, containing noise, blur, pixel shifting and other degradations. 2) DIV2K-noise20: add Gaussian noise (σ = 20) to DIV2K set. 3) Hollywood100: 100 images selected from Hollywood dataset (Laptev et al., 2008) , containing real-world old film degradations. Each test dataset includes 100 images. As shown in Fig. 3 For the classification network, feature representations are clustered by the same color, while representations of SR networks are clustered by the same marker shape, suggesting that there is a significant difference in feature representations between classification and SR networks. vision models, but also help promote the development of other tasks. In Sec. 5, we apply DDR to several fundamental tasks and achieve appealing results, implying the great potential of DDR.

4.2. DIFFERENCES IN SEMANTICS BETWEEN CLASSIFICATION AND SR NETWORKS

In the high-level vision, classification is one of the most representative tasks, where artificially predefined semantic labels on object classes are given as supervision. We choose ResNet18 (He et al., 2016) as the classification backbone and conduct experiments on CIFAR10 dataset (Krizhevsky et al., 2009) . We extract the forward features of each input testing imagefoot_3 at different network layers, as described in Fig. 3 (e)-a. Fig. 4 shows that as the network deepens, the extracted feature representations produce obvious discriminative clusters, i.e., the learned features are increasingly becoming semantically discriminative. Such discriminative semantics in classification networks are coherent with the artificially predefined labels. This is an intuitive and natural observation, on which lots of representation and discriminative learning methods are based (Wen et al., 2016; Oord et al., 2018; Lee et al., 2019; Wang et al., 2020b) . Further, we add blur and noise degradation to the CIFAR10 test images, and then investigate the feature representations of classification and SR networks. Note that no degradation is added to the training data. As shown in Fig. 5 , after adding degradations to the test data, the deep representations obtained by the classification network (ResNet18) are still clustered by object categories, indicating that the features focus more on high-level object class information. On the contrary, the deep representations obtained by SR networks (SRResNet and SRGAN) are clustered with regard to degradation types. The features of the same object category are not clustered together, while those of the same degradation type are clustered together, showing different "semantic" discriminability. This phenomenon intuitively illustrates the differences in the deep semantic representations between SR and classification networks, i.e., degradation-related semantics and content-related semantics. More interestingly, the "semantics" in SR networks exists naturally, because the SR networks only see clean data without any input or labelled degradation information. 4.3 HOW DO GLOBAL RESIDUAL AND ADVERSARIAL LEARNING AFFECT THE DEEP REPRESENTATIONS? Previously, we have elaborated on the deep degradation representations in CinCGAN, SRGAN and SRResNet. Nevertheless, we further discover that no arbitrary SR network structure has such a property. To be specific, we find two crucial factors that can influence the learned representations: i) image global residual (GR), and ii) generative adversarial learning (GAN). Global Residual. We train two SRResNet networks -SRResNet (with global residual) and SRResNet-woGR (without global residual), as shown in Fig. 3 . The two architectures are both common in practice (Kim et al., 2016; Shi et al., 2016) . DIV2K (Agustsson & Timofte, 2017) dataset is used for training, where the LR images are bicubic-downsampled and clean. Readers can refer to the supplementary file for more details. After testing, the feature visualization analysis is shown in Fig. 6 . The results show that for MSE-based SR method, GR is essential for producing discriminative representations on degradation types. The features in "ResBlock16" of SRResNet have shown distinct discriminability, where the clean, blur, and noise data are clustered separately. On the contrary, SRResNet-woGR shows no discriminability even in deep layers. This phenomenon reveals that GR significantly impacts the learned feature representations. It is inferred that learning the global residual could remove most of the content information and make the network concentrate more on the contained degradation. This claim is also corroborated by visualizing the feature maps in the supplementary file. Adversarial Learning. MSE-based and GAN-based methods are currently two prevailing trends in CNN-based SR methods. Previous studies only reveal that the output images of MSE-based and GAN-based methods are different, but the differences between their feature representations are rarely discussed. Since their learning mechanisms are quite different, will there be a discrepancy in their deep feature representations? We directly adopt SRResNet and SRResNet-woGR as generators. Consequently, we build two corresponding GAN-based models, namely SRGAN and SRGAN-woGR. After training, we perform the same test and analysis process mentioned earlier. The results show that the deep features are bound to be discriminative to degradation types for the GAN-based method, whether there is GR or not. As shown in Fig. 7 (d)(h), the deep representations in "ResBlock16" of SRGAN-woGR have already been clustered according to different degradation types. This suggests that the learned deep representations of MSE-based method and GAN-based method are dissimilar. Adversarial learning can help the network learn more informative features for distinguishing image degradation rather than image content.

4.4. HOW DOES DDR EVOLVE THROUGH THE TRAINING PROCESS?

We also reveal the relationship between the model performance and DDR discriminability. We select SRResNet models with different training iterations for testing. We report the model performance Under review as a conference paper at ICLR 2023 The DDR phenomenon is mainly introduced by overfitting the degradation in the training data. Specifically, since the training data (DIV2K-clean) do not contain extra degradations, the trained SR network lacks the ability to deal with the unseen degradations. When feeding images with degradations (e.g., noise and blur), it will produce features with unprocessed noises or blurring. These patterned features naturally show a strong discriminability between different degradations. As for GR, models with GR produce features that contain less components of original content information. GR can help remove the redundant image content information and make the network concentrate more on degradation-related information. GAN training also enhances the high-frequency degradation information. Besides, prolonging the training iterations and deepening the network depth will make the network further overfit to the training data.

4.6. WHY SR NETWORKS CAN HARDLY GENERALIZE TO UNSEEN DEGRADATIONS?

Classical SR models (Dong et al., 2014; Lim et al., 2017) assume that the input LR images are generated by fixed downsampling kernel (e.g., bicubic). However, it is difficult to apply such simple SR models to real scenarios with unknown degradations. We claim that SR and restoration networks learn to overfit the distribution of degradations, rather than the distribution of natural clean images. To verify our statements, we compare the representations between SRGAN-wGR models trained on clean data and clean+noise data, respectively. As presented in Fig. 9 

5. APPLICATIONS AND INSPIRATIONS

Image Distortion Identification Using DDR Features. Image distortion identification (Liang et al., 2020) is an important subsidiary pretreatment for many image processing systems, especially for image quality assessment (IQA). It aims to recognize the distortion type from the distorted images, so as to facilitate the downstream tasks (Mittal et al., 2012a; Gu et al., 2019; Liang et al., 2020) . Previous methods usually resort to design handcrafted features that can distinguish different degradation types (Mittal et al., 2012a; b) or train a classification model via supervised learning (Kang et al., 2014; Bosse et al., 2017; Liang et al., 2020) . Since DDR is related to image degradation, it can naturally be used as an excellent prior feature for image distortion identification. To obtain DDR, we do not need any degradation information but only a well-trained SR model (train on clean data). Following BRISQUE (Mittal et al., 2012a) , we adopt the deep representations of SRGAN as input features (using PCA to reduce the original features to a 120-dimensional vector), and then use linear SVM to classify the degradation types of LIVE dataset (Sheikh et al., 2006) . As shown in Tab. 1, compared with BRISQUE and MLLNet (Liang et al., 2020) , DDR features achieve excellent results on recognizing different distortion types. More inspiringly, DDR is not obtained by any distortion-related supervision. Blind SR with DDR Guidance. To super-resolve real images with unknown degradations, many blind SR methods resort to estimating and utilising the degradation information. For instance, IKC (Gu et al., 2019) 

6. CONCLUSIONS

In this paper, we discover the deep degradation representations in deep SR networks, which are different from high-level vision networks. We demonstrate that a well-trained deep SR network is naturally a good descriptor of degradation information. We reveal the differences in deep representations between classification and SR networks. We draw a series of interesting observations on the intrinsic features of deep SR networks, such as the effects of global residual and adversarial learning. Further, we apply DDR to several fundamental tasks and achieve appealing results. The exploration on DDR is of great significance and inspiration for relevant work. To better understand the behaviors of CNN, many efforts have been put in the neural network interpretability for high-level vision As for low-level vision tasks, however, similar research work is absent. The possible reasons are as follows. In high-level vision tasks, there are usually artificially predefined semantic labels/categories. Thus, we can intuitively associate feature representations with these labels. Nevertheless, in low-level vision tasks, there is no explicit predefined semantics, making it hard to map the representations into a domain that the human can make sense of. Further, high-level vision usually performs classification in a discrete target domain with distinct categories, while low-level vision aims to solve a regression problem with continuous output values. Hence, without the guidance of predefined category semantics, it seems not so straightforward to interpret low-level vision networks. In this paper, we take super-resolution (SR), one of the most representative tasks in low-level vision, as research object. Previously, it is generally thought that the features extracted from the SR network have no specific "semantic" information, and the network simply learns some complex non-linear functions to model the relations between network input and output. Are CNN features SR networks really in lack of any semantics? Can we find any kind of "semantics" in SR networks? In this paper, we aim to give an answer to these questions. We reveal that there are semantics existing in SR networks. We first discover and interpret the "semantics" of deep representations in SR networks. But different from high-level vision networks, such semantics relate to the image degradation types and degrees. Accordingly, we designate the deep semantic representations in SR networks as deep degradation representations (DDR).

A.2 LIMITATIONS

In this paper, we only explore the deep representations of SR networks. Other low-level vision networks are also worth exploring. We apply DDR to three tasks without too elaborate design in the application parts. For blind SR, we make a simple attempt to improve the model performance. The design is not optimal. We believe that there should be a more efficient and effective way to utilize DDR. For generalization evaluation, DDR can only evaluate the model generalization under constrained conditions. It shows the possibility of designing a generalization evaluation metric, but there is still a long way to realize this goal.

A.3 DEEP REPRESENTATIONS OF REAL-WORLD IMAGES

In the main paper, we mainly conduct experiments on synthetic degradations. The difficulty of realworld dataset is that it is hard to keep the content the same but change the degradations. If we simply use two real-world datasets which contains different contents and different degradations, it is hard to say whether the feature discriminability is targeted at image content or at image degradation. Hence, synthetic data at least can control the variables. In addition, we find a plausible real-world dataset Real-City100, which is proposed in paper Cameral SR. The authors use iPhoneX and NikonD5500 devices to capture controllable images. By adjusting the cameral focal length, each camera captures paired images with the same content but different resolutions. The low-resolution images contain real-world degradations such as real noise and real iphoneX-HQ iphoneX-LQ NikonD5500-HQ NikonD5500-LQ  = G CL (X), where G CL represents the classification network, and Ŷ ∈ R C is the predicted probability vector indicating which of the C categories X belongs to. In practice, cross-entropy loss is usually adopted to train the classification network: CE(Y, Ŷ ) = - C i=1 y i log ŷi , where Y ∈ R C is a one-hot vector representing the ground-truth class label. ŷi is the i-th row element of Ŷ , indicating the predicted probability that X belongs to the i-th class.

Super-resolution.

A general image degradation process can be model as follows: X = (Y ⊗ k) ↓ s +n, where Y is the high-resolution (HR) image and ⊗ denotes the convolution operation. X is the degraded high-resolution (LR) image. There are three types of degradation in this model: blur kernel k, downsampling operation ↓ s and additive noise n. Hence, super-resolution can be regarded as a superset of other restoration tasks like denoising and deblurring. Super-resolution (SR) is the inverse problem of Equ. (3). Given the input LR image X ∈ R M ×N , the super-resolution network attempts to produce its HR version: Ŷ = G SR (X), where G SR represents the super-resolution network, Ŷ ∈ R sM ×sN is the predicted HR image and s is the upscaling factor. This procedure can be regarded as a typical regression task. At present, there are two groups of method: MSE-based and GAN-based methods. The former one treats SR as a reconstruction problem, which utilizes pixel-wise loss such as L 2 loss to achieve high PSNR values. L 2 (Y, Ŷ ) = 1 r 2 N M rN i=1 rM j=1 Y i,j -Ŷi,j 2 2 . (5) 2020). However, such loss tends to produce over-smoothed images. To generate photo-realistic SR results, the latter method incorporates adversarial learning and perceptual loss to benefit better visual perception. The optimization is expressed as following min-max problem: min θ G SR max θ D SR E Y ∼p HR [log D SR (Y )] + E X∼p LR [log(1 -D SR (G SR (X)))]. In such adversarial learning, a discriminator D SR is introduced to distinguish super-resolved images from real HR images. Then, the generator loss is defined as: L G = -log D SR (G SR (X)). From the formulation, we can clearly see that image classification and image super-resolution represent two typical tasks in machine learning: classification and regression. The output of the classification task is discrete, while the output of the regression task is continuous.

A.4.2 ARCHITECTURES

Due to the different output types, the CNN architectures of classification and super-resolution networks also differ. Generally, classification networks often contain multiple downsampling layers (e.g., pooling and strided convolution) to gradually reduce the spatial resolution of feature maps. After several convolutional and downsampling layers, there may be one or more fully-connected layers to aggregate global semantic information and generate a vector containing C elements. For the output layer, the SoftMax operator is frequently used to normalize the previously obtained vector into a probabilistic representation. Some renowned classification network structures include AlexNet Krizhevsky et al. (2012) 2016)). Thus, the spatial resolution of feature maps would increase. Another difference is that the output of the SR network is a three-channel image, rather than an abstract probability vector. The well-known SR network structures include SRCNN Dong et al. (2014) , FSRCNN Dong et al. (2016) , SRResNet Ledig et al. (2017) , RDN Zhang et al. (2018c) , RCAN Zhang et al. (2018b) , etc. An intuitive comparison of classification and SR networks in CNN architecture is shown in Fig. 18 . We can notice that one is gradually downsampling, and the other is gradually upsampling, which displays the discrepancy between high-level vision and low-level vision tasks in structure designing. Although there are several important architectural differences, classification networks and SR networks can share and adopt some proven effective building modules, like skip connection He et al. 

A.5 IMPLEMENTATION DETAILS

In the main paper, we conduct experiments on ResNet18 He et al. (2016) and SRResNet/SRGAN Ledig et al. (2017) . We elaborate more details on the network structures and training settings here. For ResNet18, we directly adopt the network structure depicted in He et al. (2016) . Cross-entropy loss (Eq. 2) is used as the loss function. The learning rate is initialized to 0.1 and decreased with a cosine annealing strategy. We apply SGD optimizer with weight decay 5 × 10 -4 . The trained model yields an accuracy of 92.86% on CIFAR10 testing set which consists of 10, 000 images. For SRResNet-wGR/SRResNet-woGR, we stack 16 residual blocks (RB) as shown in Fig. 3 of the main paper. The residual block is the same as depicted in Wang et al. (2018) , in which all the BN layers are removed. Two Pixel-shuffle layers Shi et al. (2016) are utilized to conduct upsampling in the network, while the global residual branch is upsampled by bilinear interpolation. L 1 loss is adopted as the loss function. The learning rate is initialized to 2 × 10 -4 and is halved at [100k, 300k, 500k, 600k] iterations. A total of 600, 000 iterations are executed. For SRGAN-wGR/SRGAN-woGR, the generator is the same as SRResNet-wGR/SRResNet-woGR. The discriminator is designed as in Ledig et al. (2017) . Adversarial loss (Eq. 7) and perceptual loss Johnson et al. (2016) are combined as the loss functions, which are kept the same as in Ledig et al. (2017) . The learning rate of both generator and discriminator is initialized to 1 × 10 -4 and is halved at [50k, 100k, 200k, 300k] iterations. A total of 600, 000 iterations are executed. For all the superresolution networks, we apply Adam optimizer Kingma & Ba (2014) with β 1 = 0.9 and β 2 = 0.99. All the training LR patches are of size 128 × 128. When testing, 32 × 32 patches are fed into the networks to obtain deep features. In practice, we find that the patch size has little effect on revealing the deep degradation representations. All above models are trained on PyTorch platform with GeForce RTX 2080 Ti GPUs. For the experiment of distortion identification, we use the aforementioned trained models to conduct inferencing on the LIVE dataset Sheikh et al. (2006) . We crop the central 96 × 96 patch of each image to feed into the SR networks and obtain the corresponding deep representations. Then, the deep representations of each image are reduced to 120-dimensional vector using PCA. Afterwards, the linear SVM is adopted as the classification tail. In practice, we find that the vector dimension can be even larger for better performance. Notably, unlike previous methods, the features here are not trained on any degradation related labels or signals. The SR networks are only trained using clean data. However, the deep representations can be excellent prior features for recognizing various distortion types. This is of great importance and very encouraging. A.6 DEFINITIONS OF WD, BD AND CHI In Sec. 3.1 of the main paper, we describe the adopted analysis method on deep feature representations. Many other literatures also have adopted similar approaches to interpret and visualize the deep models, such as Graph Attention Network Veličković et al. (2017) , Recurrent Networks Karpathy et al. (2015) , Deep Q-Network Zahavy et al. (2016) and Neural Models in NLP Li et al. (2015) . Most aforementioned researches adopt t-SNE as a qualitative analysis technique. To better illustrate and quantitatively measure the semantic discriminability of deep feature representations, we take a step further and introduce several indicators, which are originally used to evaluate the clustering performance, according to the data structure after dimensionality reduction by t-SNE. Specifically, we propose to adopt within-cluster dispersion (WD), between-clusters dispersion (BD) and Calinski-Harabaz Index (CHI) Caliński & Harabasz (1974) to provide some rough yet practicable quantitative measures for reference. For K clusters, WD, BD and CHI are defined as: W D(K) = K k=1 n(k) i=1 x i k -xk 2 , where x i k represents the i-th datapoint belonging to class k and xk is the average mean of all n(k) datapoints that belong to class k. Datapoints belonging to the same class should be close enough to each other and WD measures the compactness within a cluster. BD(K) = K k=1 n(k) xk -x 2 , where x represents the average mean of all datapoints. BD measures the distance between clusters. Intuitively, larger BD value indicates stronger discriminability between different feature clusters. Given K clusters and N datapoints in total (N = k n(k)), by combining WD and BD, the CHI is formulated as: CHI(K) = BD(K) W D(K) • (N -K) (K -1) . ( ) It is represented as the ratio of the between-clusters dispersion mean and the within-cluster dispersion. The CHI score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster. Table 3 : Quantitative measures for the discriminability of the projected deep feature representations. We statistically report the mean value and the standard deviation of each metric. The adopted indicators well reflect the effect of feature clustering quantitatively. tributions that measure pairwise similarities of the input high-dimensional data and that of the corresponding low-dimensional points in the embedding. Further, t-SNE is a non-convex optimization process which is performed using a gradient descent method, as a result of which several optimization parameters need to be chosen, like perplexity, iterations and learning rate. Hence, the reconstruction solutions may differ due to the choice of different optimization parameters and the initial random states. In this paper, we used exactly the same optimization procedure for all experiments. Moreover, we conduct extensive experiments using different parameters and demonstrate that the quality of the optima does not vary much from run to run, which is also emphasized in the t-SNE paper. To make the quantitative analysis more statistically solid, for each projection process, we run t-SNE five times and report the average and standard deviations of every metric.

A.7 FROM SHALLOW TO DEEP SR NETWORKS

In the main paper, we reveal that a shallow 3-layer SRCNN Dong et al. (2014) does not manifest representational discriminability on degradation types. Thus, we hypothesize that only deep SR networks possess such degradation-related semantics. To verify the statement, we gradually deepen the depth of SRCNN and observe how its deep representations change. We construct SRCNN models with different layer depths from shallow 3 layers to 13 layers. We train these models on DIV2Kclean data (inputs are only downsampled without other degradations) and test them on classical SR benchmarks. As shown in Tab. 4, the model achieves better SR performance with the increase of network depth, suggesting that deeper networks and more parameters can lead to greater learning capacity. On the other hand, the deep representations also gradually manifest discriminability on degradation types, as depicted in Fig. 14 . When the model only has 3 layers, its representations cannot distinguish different degradation types. However, when we increase the depth to 13 layers, the deep representations begin to show discriminability on degradation types, with the CHI score increasing to 168.12. Table 7 : Quantitative evaluations (CHI). There appears to be a spectrum (continuous transition) for the discriminability of DDR. tions still be discriminative to them? To explore this question, more experiments and analysis are performed. We test super-resolution networks on degraded images with different noise degrees and blur degrees. The results are depicted in Table . 7 and Fig. 17 . It can be seen that the deep degradation representations are discriminative not only to cross-degradation (different degradation types) but also to intra-degradation (same degradation type but with different degrees). This suggests that even for the same type of degradation, different degradation degrees will also cause significant differences in features. The greater the difference between degradation degrees, the stronger the discriminability of feature representations. This also reflects another difference between the representation semantics of super-resolution network and classification network. For classification, the semantic discriminability of feature representations is generally discrete, because the semantics are associated with discrete object categories. Nevertheless, there appears to be a spectrum (continuous transition) for the discriminability of the deep degradation representations, i.e., the discriminability has a monotonic relationship with the divergence between degradation types and degrees. For example, the degradation difference between noise levels 10 and 20 is not that much distinct, and the discriminability of feature representations is relatively smaller, comparing with noise levels 10 and 30. From Table 7 , there are notable observations. 1) Comparing with blur degradation, noise degradation is easier to be discriminated. Yet, it is difficult to obtain deep representations that have strong discriminability for different blur levels. Even for GAN-based method, global residual (GR) is indispensable to obtain representations that can be discriminative to different blur levels. 2) The representations obtained by GAN-based method have more discriminative semantics to degradation types and degrees than those of MSE-based method. 3) Again, global residual can strengthen the representation discriminability for degradations. A.10 EXPLORATION OF NETWORK STRUCTURE In the main paper, we choose ResNet18 He et al. (2016) and SRResNet/SRGAN Ledig et al. (2017) as the backbones of classification and SR networks, respectively. In order to eliminate the influence of different network structures, we design a unified backbone framework, which is composed of the 

DIV2K-clean DIV2K-blur DIV2K-noise

Figure 20 : Projected feature representations extracted from unified backbone framework (superresolution) using t-SNE. same basic building modules but connected with different tails for downsampling and upsampling to conduct classification and super-resolution respectively. The unified architecture is shown in Fig. 18 . To differ from the residual block in the main paper, we adopt residual channel attention layer as basic building block, which is inspired by SENet Hu et al. (2018) and RCAN Zhang et al. (2018b) . For classification, the network tail consists of three maxpooling layers and a fully connected layer; for super-resolution, the network tail consists of two pixel-shuffle layers to upsample the feature maps. According to the conclusions in the main paper, we adopt global residual (GR) in the network design to obtain deep degradation representations (DDR). Except the network structure, all the training protocols are kept the same as in the main paper. The training details are the same as depicted in Sec. A.5. After training, the unified backbone framework for classification yields an accuracy of 92.08% on CIFAR10 testing set. The experimental results are shown in Fig. 19 , Fig. 20 2020), we also take advantage of the superior manifold learning capability of t-SNE for feature projection. In this section we further explain the effectiveness of adopting t-SNE and why we choose to project hign-dimensional features into two-dimensional datapoints. We first compare the projection results of PCA and t-SNE. From the results shown in Fig. 21 , it can be observed that the projected features by t-SNE are successfully clustered together according the semantic labels, while the projected features by PCA are not well separated. It is because that PCA is a linear dimension reduction method which cannot deal with complex non-linear data obtained by the neural networks. Thus, t-SNE is a better choice to conduct dimension reduction on CNN features. This suggests the effectiveness of t-SNE for the purpose of feature projection. Note that we do not claim t-SNE is the optimal or the best choice for dimensionality reduction. We just utilize t-SNE as a rational tool to show the trend behind deep representations, since t-SNE has been proven effective and practical in our experiments and other literatures. Then, we discuss the dimensions to reduce. We conduct dimensionality reduction to different dimensions. Since the highest dimension supported by t-SNE is 3, we first compare the effect between the two-dimensional projected features and the three-dimensional projected features by t-SNE. The qualitative and quantitative results are shown in Fig. 21 and Tab. 9. When we reduce the features to three dimensions, the reduced representations also show discriminability to semantic labels. How- ever, quantitative results show that two dimensions can better portray the discriminability than three or higher dimensions. For PCA, the results are similar. With higher dimensions, the discriminability decrease. Hence, it is reasonable to reduce high-dimensional features into two-dimensional datapoints. Such settings are also adopted in Donahue et al. (2014) To utilize t-SNE, we first use PCA to pre-reduce the features to 50 dimensions. Since PCA is a numerical method, the result is fixed. For t-SNE, we report the mean and standard deviation for 5 runs. The quantitative results show that t-SNE surpasses PCA and reducing to two dimensions is better. The features are obtained by "Conv5 4" layer of ResNet18. A.13 VISUALIZATION OF FEATURE MAPS So far, we have successfully revealed the degradation-related semantics in SR networks with dimensionality reduction. In this section, we directly visualize the deep feature maps extracted from SR networks to provide some intuitive and qualitative interpretations. Specifically, we extract the feature maps obtained from four models (SRResNet-wGR, SRResNet-woGR, SRGAN-wGR and SRGAN-woGR) on images with different degradations (clean, blur4, noise20), respectively. Then we treat each feature map as a one channel image and plot it. The visualized feature maps are shown in Fig. 22 . We select 8 feature maps with the largest eigenvalues for display. The complete results are shown in the supplementary file. Influence of degradations on feature maps. From Fig. 22 (a), we can observe that the deep features obtained by SRResNet-woGR portray various characteristics of the input image, including edges, textures and contents. In particular, we highlight in "red rectangles" the features that retain most of the image content. As shown in Fig. 22 (b), after applying blur and noise degradations to the input image, the extracted features appear similar degradations as well. For blurred/noisy input images, the extracted feature maps also contain homologous blur/noise degradations. Effect of global residual. In Sec. 4.3, we have revealed the importance and effectiveness of global residual (GR) for obtaining deep degradation representations for SR networks. But why GR is so important? What is the role of GR? Through visualization, we can provide a qualitative and intuitive explanation here. Comparing Fig. 22 (a) and Fig. 22 (b), it can be observed that by adopting GR, the extracted features seem to contain less components of original shape and content information. Thus, GR can help remove the redundant image content information and make the network concentrate more on obtaining features that are related to low-level degradation information. Effect of GAN. Previously, we have discussed the difference between MSE-based and GAN-based SR methods in their deep representations. We find that GAN-based method can better obtain feature representations that are discriminative to different degradation types. As shown in Fig. 22 (e) Hollywood100: 100 images selected from Hollywood dataset Laptev et al. (2008) , containing real-world old film frames with unknown degradations, which may have compression, noise, blur and other real-world degradations. Dataset (a), (b), (c) and (d) have the same image contents but different degradations. However, we find that the deep degradation representations (DDR) obtained by SR networks have discriminability to these degradation types, even if the network has not seen these degradations at all during training. Further, for real-world degradation like in (e), the DDR are still able to discern it.



Note that BM3D is a denoising method while CinCGAN is able to upsample the resolution of the input image. Thus, after applying BM3D, we apply bicubic interpolation to unify the resolution of the output image. This is reasonable as we only evaluate their denoising effects. Note that the class labels in the scatterplots are only used to assign a color/symbol to the datapoints for better visualization. We use the same architecture as the original paperDong et al. (2014) and add global residual for better visualization. For efficiency, we selected 100 testing images of each category (1000 images in total). & Patel (2018);Dong et al. (2020);Deng et al. (2020a), etc. However, an interesting phenomenon is that even if we have successfully applied CNNs to many tasks, yet we still do not have a thorough understanding of its intrinsic working mechanism.



Figure2: Different degraded input images and their corresponding outputs produced by CinC-GAN(Yuan et al., 2018), BM3D(Dabov et al., 2007), and SRC-NN(Dong et al., 2014). CinC-GAN(Yuan et al., 2018) is trained on DIV2K-mild dataset in an unpaired manner. If the input image conforms to the training data distribution, CinCGAN will generate better restoration results than BM3D (a). Otherwise, it tends to ignore the unseen degradation types (b)&(c). On the other hand, the traditional method BM3D(Dabov et al., 2007) has stable performance and similar denoising effects on all input images, regardless of the input degradation types. Zoom in for the best view.

Figure 3: (a)-(d): The projected deep feature representations. The deep features of CinCGAN and SRGAN are separated by degradation types, even if the image contents are aligned. (e)-a: ResNet18 (He et al., 2016) for classification. "Conv2 x" represents the 2nd group of residual blocks. (e)-b: SRResNet-woGR (without global residual). (e)-c: SRResNet (with global residual). "RB1" represents the 1st residual block.CinCGAN(Yuan et al., 2018) is a representative solution for real-world SR without paired training data. It maps a degraded LR to its clean version using data distribution learning before conducting SR operation. However, we find that it still has a limited application scope even if CinCGAN is developed for blind settings. If the degradation of the input image is not included in the training data, CinCGAN will fail to transfer the degraded input to a clean one. More interestingly, instead of producing extra artifacts in the image, it seems that CinCGAN does not process the input image and retains all the original defects. Readers can refer to Fig.2for an illustration, where CinCGAN performs well on the testing image of the DIV2K-mild dataset (same distribution as its training data), but produces unsatisfactory results for other different degradation types. In other words, the network seems to figure out the specific degradation types within its training data distribution, and distribution mismatch may make the network "turn off" its ability. This makes the performance of CinCGAN unstable and unpredictable. For comparison, we process the above three types of degraded images by a traditional denoising method BM3D(Dabov et al., 2007) 1 . The visual results show that BM3D has an obvious and stable denoising performance for all different degradation types. Although the results of BM3D may be mediocre (the image textures are largely over-smoothed), it does take effect on every input image. This observation reveals a significant discrepancy between traditional methods and SR networks.

Figure5: Feature representation differences between classification and SR networks. The same object category is represented by the same color, and the same image degradation type is depicted by the same marker shape. For the classification network, feature representations are clustered by the same color, while representations of SR networks are clustered by the same marker shape, suggesting that there is a significant difference in feature representations between classification and SR networks.

Figure 6: Projected feature representations extracted from different layers of SRResNet-woGR (1st row) and SRResNet (2nd row) using t-SNE. With image global residual (GR), the representations of MSE-based SR networks show discriminability to degradation types.

Figure 7: Projected feature representations extracted from different layers of SRGAN-woGR (1st row) and SRGAN (2nd row) using t-SNE. Even without GR, GAN-based SR networks can still obtain DDR.

Figure 9: By training with more degraded data, the deep representations become unanimous.

Figure 11: DDR clustering of different models. A lower CHI score connotes better generalization.

Figure 12: Projected feature representations of SRGAN-wGR on Real-City100 dataset.

, VGG Simonyan & Zisserman (2015), ResNet He et al. (2016), InceptionNet Szegedy et al. (2015); Ioffe & Szegedy (2015); Szegedy et al. (2017), DenseNet Huang et al. (2017), SENetBadrinarayanan et al. (2017), etc. Unlike classification networks, super-resolution networks usually do not rely on downsampling layers, but upsampling layers (e.g., bilinear upsampling, transposed convolution Zeiler et al. (2010) or subpixel convolution Shi et al. (

(2016); Lim et al. (2017) and attention mechanismHu et al. (2018); Zhang et al. (2018b).

Figure 14: With more layers, the model deep representations gradually manifest the discriminability on degradation types.

Figure 18: Unified backbone framework for classification and super-resolution. The two networks share the same backbone structure and different tails.

Figure 21: Comparison between PCA and t-SNE for projecting feature representations ("Conv5 4" layer of ResNet18).

Figure 22: Visualization of feature maps. GR and GAN can facilitate the network to obtain more features on degradation information.

(a)  andFig. 22(c), the feature maps extracted by GAN-based method contain less object shape and content information compared with MSE-based method. This partially explains why the deep representations of GAN-based method are more discriminative, even without global residual. Comparing Fig.22(c) and Fig.22(d), when there is global residual, the feature maps containing the image original content information are further reduced, leading to stronger discriminability to degradation types.A.14 SAMPLES OF DIFFERENT DATASETSIn the main paper, we adopt several different datasets to conduct experiments. Fig.23displays some example images from these datasets. (a) DIV2K-clean: the original DIV2K Agustsson & Timofte (2017) dataset. The high-resolution (HR) ground-truth (GT) images have 2K resolution and are of high visual quality. The lowresolution (LR) input images are downsampled from HR by bicubic interpolation, without any further degradations. (b) DIV2K-noise: adding Gaussian noises to DIV2K-clean LR input, thus making it contain extra noise degradation. DIV2K-noise20 means the additive Gaussian noise level σ is 20, where the number denotes the noise level. (c) DIV2K-blur: applying Gaussian blur to DIV2K-clean LR input, thus making it contain extra blur degradation. DIV2K-blur4 means the Gaussian blur width is 4. (d) DIV2K-mild: officially synthesized from DIV2K Agustsson & Timofte (2017) dataset as challenge dataset Timofte et al. (2017; 2018), which contains noise, blur, pixel shifting and other degradations. The degradation modelling is unknown to challenge participants.

We find that different networks will learn different semantic representations. For example, in Sec. 4.2, we reveal the differences in the learned representations between classification and SR Networks. In Sec. 4.3, we show that not all SR network structures can easily obtain DDR. DDR Projected feature representations extracted from different layers of ResNet18 using t-SNE. With the network deepens, the representations become more discriminative to object categories, which clearly shows the semantics of the representations in classification.

The PSNR↑/NIQE↓ results on datasets with different degradations.

Simonyan et al. (2013);Samek et al. (2017);Zeiler & Fergus (2014);Selvaraju et al. (2017);Montavon et al. (2018);Karpathy et al. (2015);Mahendran & Vedaldi (2016);Zhang et al. (2020);Adebayo et al. (2018). Most of them attempt to interpret the CNN decisions by visualization techniques, such as visualizing the intermediate feature maps (or saliency maps and class activation maps) Simonyan et al. (2013); Zeiler & Fergus (2014); Adebayo et al. (2018); Zhou et al. (2016); Selvaraju et al. (2017), computing the class notion images which maximize the class score Simonyan et al. (2013), or projecting feature representations Wen et al. (2016); Wang et al. (2020b); Zhu et al. (2018);Huang et al. (2020). For high-level vision tasks, especially image classification, researchers have established a set of techniques for interpreting deep models and have built up a preliminary understanding of CNN behaviorsGu et al. (2018). One representative work is done byZeiler et al. Zeiler & Fergus (2014), who reveal the hierarchical nature of CNN by visualizing and interpreting the feature maps: the shallow layers respond to low-level features such as corners, curves and other edge/color conjunctions; the middle layers capture more complex texture combinations; the deeper layers are learned to encode more abstract and class-specific patterns, e.g., faces and legs. These patterns can be well interpreted by human perception and help partially explain the CNN decisions for high-level vision tasks.

This is the most widely used loss function in many image restoration tasksDong et al. (2014);Lim et al. (2017);Zhang et al. (2018b;a);Cai et al. (2016);He et al. (

Projected feature representations extracted from different layers of ResNet18 using t-SNE. With the network deepens, the representations become more discriminative to object categories, which clearly shows the semantics of the representations in classification.

Quantitative measures for the projected deep feature representations obtained by SRResNet-woGR and SRResNet-wGR.

Quantitative measures for the projected deep feature representations obtained bySRGAN- woGR and SRGAN-wGR.    evaluating the discriminability of the projection results (clustering effect), we can roughly measure the generalization performance over different degradation types. The worse the clustering effect, the better the generalizability. Fig .11shows the DDR clustering of different models. RRDB (clean) is unable to deal with degraded data and obtains lower PSNR values on blur and noise inputs. Its CHI score is 322.16. By introducing degraded data into training, the model gains better generalization and the CHI score is 14.04. With DDR guidance, the generalization ability is further enhanced. The CHI score decreases to 4.95. The results are consistent with the results in the previous section. Interestingly, we do not need ground-truth images to evaluate the model generalization. A similar attempt has been made in recent workLiu et al. (2022). Note that CHI is only a rough index, which cannot accurately measure the minor differences. DDR shows the possibility of designing a generalization evaluation metric, but there is still a long way to realize this goal. Even for the same type of degradation, different degradation degrees will also cause differences in features. The greater the difference between degradation degrees, the stronger the discriminability. First row: SRResNet-wGR. Second row: SRGAN-wGR.Previously, we introduce deep degradation representations by showing that the deep representations of SR networks are discriminative to different degradation types (e.g., clean, blur and noise). How about the same degradation type but with different degraded degrees? Will the deep representa--: 0 ∼ 20. +: 20 ∼ 100. ++: 100 ∼ 500. +++: ≥ 500.

Quantitative measures for the discriminability of the projected deep feature representations obtained by unified backbone framework (classification).

and Tab. 8. From the results, we can see that the observations are consistent with the findings in the main paper. It suggests that the semantic representations do not stem from network structures, but from the task itself. Hence, our findings are not only limited to specific structures but are universal. A.11 MORE INSPIRATIONS AND FUTURE WORK Disentanglement of Image Content and Degradation In plenty of image editing and synthesizing tasks, researchers seek to disentangle an image through different attributes, so that the image can be finely edited Karras et al. (2019); Ma et al. (2018); Deng et al. (2020b); Lee et al. (2018); Nitzan et al. (2020). For example, semantic face editing Shen et al. (2020a;b); Shen & Zhou (2020) aims at manipulating facial attributes of a given image, e.g., pose, gender, age, smile, etc. Most methods attempt to learn disentangled representations and to control the facial attributes by manipulating the latent space. In low-level vision, the deep degradation representations can make it possible to decompose an image into content and degradation information, which can promote a number of new areas, such as degradation transferring and degradation editing. Further, more in-depth research on deep degradation representations will also greatly improve our understanding of the nature of images.

; Wang et al. (2020b); Veličković et al. (2017); Huang et al. (2020), which are proven effective. Quantitative comparison with dimensionality reduction methods and reduced dimensions.

