DISCOVERING DISTINCTIVE "SEMANTICS" IN SUPER-RESOLUTION NETWORKS

ABSTRACT

Image super-resolution (SR) is a representative low-level vision problem. Although deep SR networks have achieved extraordinary success, we are still unaware of their working mechanisms. Specifically, whether SR networks can learn semantic information, or just perform complex mapping functions? What hinders SR networks from generalizing to real-world data? These questions not only raise our curiosity, but also influence SR network development. In this paper, we make the primary attempt to answer the above fundamental questions. After comprehensively analyzing the feature representations (via dimensionality reduction and visualization), we successfully discover the distinctive "semantics" in SR networks, i.e., deep degradation representations (DDR), which relate to image degradation instead of image content. We show that a well-trained deep SR network is naturally a good descriptor of degradation information. Our experiments also reveal two key factors (adversarial learning and global residual) that influence the extraction of such semantics. We further apply DDR in several interesting applications (such as distortion identification, blind SR and generalization evaluation) and achieve promising results, demonstrating the correctness and effectiveness of our findings.

1. INTRODUCTION

The emergence of deep convolutional neural network (CNN) has given birth to a large number of new solutions to low-level vision tasks (Dong et al., 2014; Zhang et al., 2017) . Among these signs of progress, image super-resolution (SR) has enjoyed a great performance leap. Compared with traditional methods (e.g., interpolation (Keys, 1981) and sparse coding (Yang et al., 2008) ), SR networks can achieve better performance with improved efficiency. However, even if we have benefited a lot from the powerful CNNs, we have little knowledge about what happens in SR networks and what distinguishes them from traditional approaches on earth. Does the performance gain merely come from more complex mapping functions? Or is there anything different inside SR networks, like classification networks with discriminative capability? On the other hand, as a classic regression task, SR is expected to perform a continuous mapping from low-resolution (LR) to high-resolution (HR) images. It is generally a local operation without the consideration of the global context. But with the introduction of GAN-based models Ledig et al. (2017) ; Wang et al. (2018) , more delicate SR textures can be generated. It seems that the network has learned some kind of semantic, which is beyond our common perception for regression tasks. Then, we may raise the question: are there any "semantics" in SR networks? If yes, do these semantics have different definitions from those in classification networks? Existing literature cannot answer these questions, as there is little research on interpreting low-level vision deep models. Nevertheless, discovering the semantics in SR networks is of great importance. It can not only help us further understand the underlying working mechanisms, but also guide us to design better networks and evaluation algorithms. In this study, we give affirmative answers to the above questions by unfolding the semantics hidden in super-resolution networks. Specifically, different from the artificially predefined semantics associated with object classes in high-level vision, semantics in SR networks are distinct in terms of image degradation instead of image content. Accordingly, we name such semantics deep degradation representations (DDR). More interestingly, such degradation-related semantics are spontaneously existing without any predefined labels. We reveal that a well-trained deep SR network is naturally a good descriptor of degradation information. Notably, the semantics in this paper have different implications from those in high-level vision. Previously, researchers have disclosed the hierarchical nature of classification networks (Zeiler & Fergus, 2014; Gu et al., 2018) . As the layer deepens, the learned features respond more to abstract high-level patterns (e.g., faces and legs), showing a stronger discriminability to object categories (see Fig. 4 ). However, similar research in low-level vision is absent, since there are no predefined semantic labels. In this paper, we reveal the differences in deep "semantics" between classification and SR networks, as illustrated in Fig. 1 . Our observation stems from a representative blind SR method -CinCGAN Yuan et al. ( 2018), and we further extend it to more common SR networks -SRResNet and SRGAN Ledig et al. (2017) . We have also revealed more interesting phenomena to help interpret the semantics, including the analogy to classification networks and the influential factors for extracting DDR. Moreover, we improve the results of several tasks by exploiting DDR. We believe our findings could lay the groundwork for the interpretability of SR networks, and inspire more exploration of the mechanism of low-level vision deep models. Contributions. 1) We have successfully discovered the "semantics" in SR networks, denoted as deep degradation representations (DDR). Through in-depth analysis, we also find that global residual learning and adversarial learning can facilitate the SR network to extract such degradation-related representations. 2) We reveal the differences in deep representations between classification and SR networks, for the first time. This further expands our knowledge of the deep representations of highand low-level vision models. 3) We exploit our findings to several fundamental tasks and achieve very appealing results, including distortion identification, blind SR and generalization evaluation.

2. RELATED WORK

Super-resolution. Super-resolution (SR) is a fundamental task in low-level vision, which aims to reconstruct the high-resolution (HR) image from the corresponding low-resolution (LR) counterpart. SRCNN (Dong et al., 2014) is the first proposed CNN-based method for SR. Since then, a large number of deep-learning-based methods have been developed (Dong et al., 2016; Lim et al., 2017; Zhang et al., 2018b; Ledig et al., 2017; Zhang et al., 2019) . Generally, current CNN-based SR methods can be categorized into two groups. One is MSE-based method, which targets at minimizing the distortion (e.g., Mean Square Error) between the ground-truth HR image and super-resolved image to yield high PSNR values, such as SRCNN (Dong et al., 2014) , VDSR (Kim et al., 2016) , EDSR (Lim et al., 2017) , RCAN (Zhang et al., 2018b) , SAN (Dai et al., 2019) , etc. The other is GAN-based method, which incorporates generative adversarial network (GAN) and perceptual loss (Johnson et al., 2016) to obtain perceptually pleasing results, such as SRGAN (Ledig et al., 2017), 



Figure 1: Distributions of the deep representations of classification and super-resolution networks. For classification networks, the semantics of the deep feature representations are artificially predefined according to the training data (category labels). However, for SR networks, the learned deep representations have a different kind of "semantics" from classification. During training, the SR networks are only provided with downsampled clean LR images. There is not any supervision signal related to image degradation information. Surprisingly, we find that SR networks' deep representations are spontaneously discriminative to different degradations. Notably, NOT an arbitrary SR network has such a property. In Sec. 4.3, we reveal two factors that facilitate SR networks to extract such degradation-related representations, i.e., adversarial learning and global residual.

