DEEP GENERATIVE MODEL BASED RATE-DISTORTION FOR IMAGE DOWNSCALING ASSESSMENT Anonymous authors Paper under double-blind review

Abstract

In this paper, we propose a novel measure, namely Image Downscaling Assessment by Rate-Distortion (IDA-RD), to quantitatively evaluate image downscaling algorithms. In contrast to image-based methods that measure the quality of downscaled images, ours is process-based that draws ideas from the rate-distortion theory to measure the distortion incurred during downscaling. Our main idea is that downscaling and super-resolution (SR) can be viewed as the encoding and decoding processes in the rate-distortion model, respectively, and that a downscaling algorithm that preserves more details in the resulting low-resolution (LR) images should lead to less distorted high-resolution (HR) images in SR. In other words, the distortion should increase as the downscaling algorithm deteriorates. However, it is non-trivial to measure this distortion as it requires the SR algorithm to be blind and stochastic. Our key insight is that such requirements can be met by recent SR algorithms based on deep generative models that can find all matching HR images for a given LR image on their learned image manifolds. Empirically, we first validate our IDA-RD measure with synthetic downscaling algorithms which simulate distortions by adding various types and levels of degradations to the downscaled images. We then test our measure on traditional downscaling algorithms such as bicubic, bilinear, nearest neighbor interpolation as well as state-of-the-art downscaling algorithms such as DPID (Weber et al., 2016) , L0-regularized downscaling (Liu et al., 2017) , and Perceptual downscaling (Oeztireli & Gross, 2015) . Experimental results show the effectiveness of our IDA-RD in evaluating image downscaling algorithms.

1. INTRODUCTION

Image downscaling is a fundamental problem in image processing and computer vision. To address the diverse application scenarios, various digital devices with different resolutions, such as smartphones, iPads, and desktop monitors, co-exist, which makes this problem even more important. In contrast to image super-resolution (SR), which aims to "add" information to low-resolution (LR) images, image downscaling algorithms focus on "preserving" information present in the highresolution (HR) images, which is particularly important for applications and devices with very limited screen spaces. Traditional image downscaling algorithms low-pass filter an image before resampling it. While this prevents aliasing in the downscaled LR image, important high-frequency details of the HR image are removed simultaneously, resulting in a blurred or overly-smooth LR image. To improve the quality of downscaled images, several sophisticated approaches have been proposed recently, including remapping of high-frequency information (Gastal & Oliveira, 2017) , optimization of perceptual image quality metrics (Oeztireli & Gross, 2015) , using L0-regularized priors (Liu et al., 2017) , and pixelizing the HR image (Gerstner et al., 2012; Han et al., 2018; Kuang et al., 2021; Shang & Wong, 2021) . Nevertheless, research in image downscaling algorithms has significantly slowed down due to the lack of a quantitative measure to evaluate them. Specifically, standard distance measures (e.g. L1, L2 norm) and full-reference image quality assessment (IQA) methods are not applicable here due to the absence of ground truth LR images; existing No-Reference IQA (NR-IQA) metrics (Mittal et al., 2012b; a; Bosse et al., 2017 ) cannot be applied either as they rely on the "naturalness" of HR images, which is not present in LR images (we will verify this in our experiments). In this paper, we propose a new quantitative measure for image downscaling based on Claude Shannon's rate-distortion theory (Berger, 2003) , namely Image Downscaling Assessment by Rate-Distortion (IDA-RD). The main idea of our IDA-RD measure is that a superior image downscaling algorithm would try to retain as much information as possible in the LR image, thereby reducing the distortion when being up-scaled (a.k.a. super-resolved) to the size of the original HR image. However, such an upscaling method is non-trivial as it must satisfy two challenging requirements: i) blindness, i.e. it must apply to all kinds of downscaling algorithms without knowing them in advance; ii) stochasticity, i.e. it must be able to generate a manifold of HR images that captures the conditional distribution of the super-resolution process. Our key insight is that both such requirements can be satisfied by the recent success of deep generative models in blind and stochastic super-resolution. To demonstrate the flexibility of our IDA-RD measure, we show that it can be successfully implemented with two mainstream generative models: Generative Adversarial Networks (Menon et al., 2020) and Normalizing Flows (Lugmayr et al., 2020) . Extensive experiments demonstrate the effectiveness of our IDA-RD measure in evaluating image downscaling algorithms. Our contributions include: • Drawing on Claude Shannon's rate-distortion theory (Berger, 2003) , we propose the Image Downscaling Assessment by Rate-Distortion (IDA-RD) measure to quantitatively evaluate image downscaling algorithms, which fills a gap in existing image downscaling research. • We demonstrate the effectiveness of our IDA-RD measure with extensive experiments on both synthetic and real-world image downscaling algorithms.

2. RELATED WORK

Image Downscaling has a long history and its traditional methods (e.g. bicubic) have now become the standard for image processing and computer vision software, making it difficult to trace their origins. To this end, we only review recent attempts in developing better image downscaling algorithms. For example, Gastal & Oliveira (2017) conducted a discrete Gabor frequency analysis and propose to remap the high-frequency information of HR images to the representable range of the downsampled spectrum, thereby preserving high frequency details in image downscaling. Oeztireli & Gross (2015) model image downscaling as an optimization problem and minimize a perceptual metric (SSIM) between the input and downscaled image. However, the limitations of SSIM are also carried over to their approach. DPID (Weber et al., 2016) preserves small details by assigning higher weights to the input pixels whose color deviates from their local neighborhood within the convolutional filter. Liu et al. (2017) propose an optimization framework using two L0 regularized priors that addresses two issues of image downscaling, i.e. salient feature preservation and downscaled image construction. Image thumbnailing, a special case of image downscaling, has been studied by Sun & Ling (2013) (Wang et al., 2003) and Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) . However, such IQA metrics are not applicable in the evaluation of image downscaling algorithms as there are no ground truth LR images for comparison. Thus, most researchers rely on subjective evaluation of downscaled images, which is costly and time-consuming. No-Reference Image Quality Assessment (NR-IQA) addresses IQA in the absence of a reference (i.e. ground truth) image. For example, Mittal et al. (2012a) propose BRISQUE, an NR-IQA metric that uses the natural scene statistics (NSS) to quantify loss of "naturalness" in distorted images.

𝑫(

) ,

𝑿

𝔼 𝑿,𝑸 𝑸 𝒇 𝒅𝒔 𝒇 𝒖𝒔

IDA-RD Input Downscaling Method

Figure 1 : Illustration of the proposed IDA-RD measure. Given a downscaling method f ds to be evaluated, i) we first use it to downscale several HR images; ii) then, we upscale them back to the original resolution with f us and measure the distortion from the corresponding HR images. Such an upscaling method leverages the recent success in deep generative models and thus can i) apply to arbitrarily down-scaled images and ii) output a manifold of HR images that captures the conditional distribution given a downscaled image. Using locally normalized luminances, BRISQUE models a regressor which maps the feature space to image quality scores. Based on their NSS, Mittal et al. (2012b) (Karras et al., 2019; 2020; 2021) developed by Nvidia has shown impressive (maybe even the best) results in high-resolution and high-quality image synthesis, leading to various applications in image processing and manipulation (Abdal et al., 2019; 2020; Zhu et al., 2020) . In this paper, we follow Menon et al. (2020) and implement our measure with a StyleGAN generator pre-trained on portrait images. Nevertheless, normalizing flows (Rezende & Mohamed, 2015; Papamakarios et al., 2021; Keller et al., 2021) that construct complex distributions by transforming a probability density function through a series of invertible mappings have attracted increasing attention in the past several years. In this paper, we employ the SRFlow (Lugmayr et al., 2020) model to implement our measure, which directly learns the conditional distribution of the HR output given the LR input.

3. OUR APPROACH

In this section, we first introduce the definition of our metric derived from Claude Shannon's ratedistortion theory (Berger, 2003) , and then detail how deep generative models help to sidestep the data scarcity challenge that impedes the application of the proposed metric.

3.1. METRIC DEFINITION

We create a proxy task, namely the lossy compression problem underpinned by Claude Shannon's rate-distortion theory (Berger, 2003) , and formulate image downscaling as its encoding process: inf Q f (x|x) E[D Q (X, X)] s.t. I Q (X; X) ≤ R (1) where X is the set of input high-resolution images, X is the set of output reconstructed images, R is a rate constraint determined by the downscaling processfoot_0 , Q f (x|x) or Q for short is the probability density function (PDF) of reconstructed HR images x conditioned on an input HR image x with respect to a given lossy image reconstruction function f that x = f (x) = f us (f ds (x)), where f us and f ds denote image upscaling and downscaling functions respectively, D Q is a distortion metric between two image sets where the image correspondence is determined by Q. Thus, we propose to use the expectation of the distortion as an evaluation metric for image downscaling: S(f ds ) = E[D Q (X, X)] = E x {E x|x [D(x, x)]}, where x ∈ X, x ∈ X, D is a distortion metric between two images, e.g., LPIPS (Zhang et al., 2018) . The lower S, the better the downscaling algorithm f ds . Although straightforward, the application of such a metric remained a challenge in the past as it requires a strong upscaling function f us that can: • Reconstruct the input image x regardless of the input downscaling algorithm f ds . • Generate a conditional distribution of reconstructed images x|x for each x. Between them, the first is commonly known as blind image super-resolution that is essentially a many-to-one mapping problem that aims to map different distorted downscaled images to the same high-resolution image; the second is commonly known as one-to-many super-resolution due to its ill-posed nature caused by the information loss during downscaling (Lugmayr et al., 2020) . Data Scarcity Challenge Combining the above two requirements makes the desired f us an extremely challenging many-to-many mapping problem that has remained unsolved for decades. Specifically, the numerous kinds of distorted downscaled images and the corresponding countless high-resolution images for each of them makes it infeasible to collect sufficient data for supervised learning methods: f us = arg min f θ E I LR (E I HR ||I HR -f θ (I LR )||) where I HR and I LR denote the high-resolution (HR) and low-resolution (LR) training images respectively, E I HR indicates that there are many I HR corresponding to the same I LR , E I LR indicates that there are many I LR obtained by different image downscaling methods f ds .

3.2. EVALUATION WITH DEEP GENERATIVE MODELS

Our key insight is that the above-mentioned data scarcity challenge (Eq. 3) can be overcome by the recent successes in deep generative modeling (Goodfellow et al., 2014; Radford et al., 2015; Arjovsky et al., 2017; Karras et al., 2019; 2020; 2021; Rezende & Mohamed, 2015; Papamakarios et al., 2021; Keller et al., 2021) . In deep generative modeling, a neural network model is trained to learn a manifold of natural and high-resolution (HR) images from samples in the training dataset. This has been successfully applied to various image processing tasks (Abdal et al., 2019; 2020; Zhu et al., 2020) . To demonstrate the flexibility of our metric, we show its two implementations using two mainstream deep generative models: i) Generative Adversarial Networks (GANs) and ii) Normalizing Flows respectively as follows. Implementation with a GAN generator. Similar to Menon et al. (2020) , we implement the upsampling function f us in our metric using an optimization-based GAN inversion method (Abdal et al., 2019; 2020) . Leveraging the power of a pre-trained StyleGAN (Karras et al., 2019) generator G, we define our GAN-based f us (Eq. 2) as locating the optimized StyleGAN latent code z * i so that its corresponding HR image G(z * i ) synthesized by G shares the same downscaled image as an input LR image I LR = f ds (x): f us (I LR , i) = G(z * i ) = arg min G(z i ) ||I LR -f ds (G(z i ))|| where I LR = f ds (x) denotes the input LR image downscaled by f ds , z i denotes the i-th randomly initialized latent code to be optimized to get the i-th sample from x|I LR (i.e., G(z * i )), i = 1, 2, 3, ... is the index. It can be observed that i) our f us sidesteps the data scarcity challenge (Eq. 3) by using a StyleGAN generator that is trained with HR images only (i.e., without any many-to-many LR-HR training pairs); ii) it relocates the supervision to downscaling (i.e., enforcing different HR images to be downscaled to the same LR image) and thus outputs high quality HR images G(z * i ) that applies to an arbitrary choice of f ds ; iii) it is inherently stochastic given the random choices of z i . Implementation with a Flow model. We employ the SRFlow model (Lugmayr et al., 2020) and implement the f us in our metric with a conditional invertible neural network. Leveraging its invertible nature, f us is trained to explicitly learn the conditional distribution x|I LR by minimizing the negative log-likelihood: f us = arg min f θ -log p z (f θ (x|I LR )) where I LR = f bicubic ds (x) is a bicubic downscaled image of HR input x, z denotes a random latent variable whose distribution encodes x|I LR with a 'reparameterization trick'. Although trained with only bicubic downscaling, surprisingly, we observed that the resulting f us can also be applied to evaluate other downscaling methods. We use SRFlow in the final version of our metric as it shares similar performance as the GAN-based implementation but has a much lower time cost. Please see Sec. 4.4 for a detailed ablation study.

4. EXPERIMENTS

To validate the effectiveness of our IDA-RD measure, we first test it with synthetic image downscaling methods whose performance are known beforehand (Sec. 4.2). Specifically, we simulate different types and levels of downscaling distortions by adding controllable degradations (e.g., Gaussian Blur, Contrast Change) to bicubic-downscaled images. In principle, the heavier the degradation, the worse the results of downscaling, and the higher our measure should be. We also validate the effectiveness of our IDA-RD measure across different scaling factors. Then, we show that our measure can also be used to evaluate real-world image downscaling methods like Bicubic, Bilinear, Nearest Neighbour, and state-of-the-art downscaling methods like L0-regularized (Liu et al., 2017) , Perceptual (Oeztireli & Gross, 2015) and DPID (Weber et al., 2016) (Sec. 4.3) . Third, we perform a thorough ablation study to justify the algorithmic choices of our measure (Sec. 4.4). Finally, we empirically justify our motivation in Sec. 4.5. Please see the appendix for additional experiments and examples of downscaled images (Appendix A.1).

4.1. EXPERIMENTAL SETUP

Dataset Unless specified, we use a balanced subset of 900 images from the FFHQ dataset (Karras et al., 2019) , including face images at 1024×1024 resolution, as the set of input high-resolution images X in Eq. 2 for our IDA-RD measure. Please see Appendix A.2 for more details on how we construct balanced subsets of images from FFHQ. The results on other datasets, including NPRportrait 1.0 (Rosin et al., 2022) and AFHQ-Cat (Choi et al., 2020) , are shown in Sec. 4.4. Note that we use these domain-specific datasets as they are more stable for SRFlow. Please see Appendix A.8 for the results and discussions on real-world datasets, e.g., DIV2K (Agustsson & Timofte, 2017) , Flickr30k (Young et al., 2014) . Image Upscaling Algorithms We use SRFlow (Lugmayr et al., 2020) as the f us in Eq. 2. Specifically, we used the models provided by the authors for 4× and 8× super resolution that are pre-trained on DIV2K (Agustsson & Timofte, 2017) and Flickr2K datasetsfoot_1 . Unless specified, we use the 8× model for all experiments. Note that we also tested PULSE (Menon et al., 2020) as an alternative in Sec. 4.4. For PULSE, we use the same StyleGAN generator pre-trained with FFHQ (Karras et al., 2019) . This model generates face images of size 1024×1024. We use a learning rate of 0.4, and stop the optimization for each image after 200 steps of spherical gradient descent. The noise signals of the StyleGAN generator were kept fixed. Hyperparameters Unless specified, we use N Q = 5 as the number of images upscaled from a single downscaled image for the estimation of Q in Eq. 2; we use LPIPS (Zhang et al., 2018) as the distortion measure D in Eq. 2; we use N X = 900 as the number of images in the set of highresolution image X in Eq. 2. In this section, we demonstrate the effectiveness of our IDA-RD measure by testing its performance on synthetic downscaling methods, which simulate the effects of different downscaling methods by adding controllable degradations to bicubic-downscaled images.

4.2.1. EFFECTIVENESS ACROSS DEGRADATION TYPES AND LEVELS

As detailed below, we test our IDA-RD measure with four sets of synthetic downscaling methods that apply different types and levels of degradations to bicubic-downscaled images respectively. Gaussian Blur. We apply Gaussian blur to the bicubic-downscaled images. The standard deviation of the blur kernel σ is chosen from {1.0, 2.0, 4.0}. The kernel size was set as (3σ + 1). The results are shown in Table 1 (a). Gaussian Noise. We add Gaussian noise to the bicubic-downscaled images. The standard deviation σ of the noise is chosen from {0.05, 0.1, 0.2}. The results are shown in Table 1 (a). Contrast Change. We apply contrast change to bicubic-downscaled images. To increase the contrast, we select the scale factor from {1.5, 2.0, 2.5}. Note that such scaling can cause degradation due to the clipping of extreme intensity values. Similarly, to decrease the contrast, we select the contrast parameter from {0.25, 0.50, 0.75}. The results are shown in Table 1 (a). Quantization. We apply pixel quantization to bicubic-downscaled images and select the number of color thresholds from {5, 10, 15}. Specifically, we apply Otsu's multilevel thresholding algorithm (Otsu, 1979) to the graylevel histogram which is derived from the color image, and then apply these thresholds uniformly to each of the RGB color channels. The results are shown in Table 1 (b). Mixed Degradations. In addition to single degradations mentioned above, we also demonstrate the effectiveness of our IDA-RD measure on their mixtures. The results are shown in Table 1 (c). It can be observed that our IDA-RD measure works as expected (i.e., the stronger the degradation, the worse the downscaling algorithm, and the higher the IDA-RD) for all synthetic image downscaling methods, which demonstrates its effectiveness.

4.2.2. EFFECTIVENESS ACROSS SCALE FACTORS

We further demonstrate the effectiveness of our IDA-RD measure on synthetic downscaling algorithms across different scaling factors. As Table 2 shows, we test our IDA-RD on synthetic downscaling algorithms of different levels of Gaussian Blur degradation as mentioned above. It can be observed that: i) the larger the scaling factor, the more the information loss, and the higher the IDA-RD; ii) the stronger the degradation, the worse the downscaling algorithm, and the higher the IDA-RD; which justifies the validity of our IDA-RD measure.

4.3. EVALUATING EXISTING DOWNSCALING METHODS

We apply our method to compare six existing downscaling algorithms, consisting of three traditional methods: Bicubic, Bilinear, Nearest Neighbor (N.N.), and three state of the art methods: DPID (Weber et al., 2016), L0-regularized downscaling (Liu et al., 2017) , and Perceptual (Oeztireli & Gross, 2015) downscaling. The results are shown in Table 3 . It can be observed that: i) when applied to classical downscaling algorithms (i.e., Bicubic, Bilinear, and N.N.), our IDA-RD measure identifies the quality of these algorithms in the correct order (Bilinear > Bicubic > N.N.), although the difference between the results of Bicubic and Bilinear downscaling is not significant as expected; ii) when applied to SOTA ones, the common belief is that these algorithms should perform better than Bilinear downscaling. However, none of these methods achieve a better in IDA-RD, suggesting that although SOTA image downscaling methods excel in perceptual quality, they actually lose more information than Bilinear downscaling. Nevertheless, it can be observed that DPID and L0-regularized methods are slightly better than Perceptual downscaling on our IDA-RD measure, which is consistent with previous understanding. These indicate that our IDA-RD measure is a useful complement to visual inspection, i.e., a good image downscaling algorithm should be both visually satisfying and achieve a low IDA-RD score, which further validates the role of our measure in providing new insights into image downscaling algorithms. Please see Appendix A.10 for a qualitative comparison.

4.4. ABLATION STUDY

In this experiment, we justify the algorithmic choices of our IDA-RD measure, i.e., f us , D, the number of images used to estimate Q and in X, and the content of X in Eq. 2, by performing a thorough ablation study on them. Choice of f us . As Table 4 shows, both PULSE (Menon et al., 2020) and SRFlow (Lugmayr et al., 2020) have similar results when used as f us in our IDA-RD measure, i.e., N.N. > Perceptual > L0-regularized > DPID > Bicubic > Bilinear. However, since SRFlow yields more distinguishable results and runs much faster (Table 15 in Appendix A.4), we use it in our IDA-RD measure. Nevertheless, our IDA-RD is very flexible (i.e., not restricted to PULSE or SRFlow) and will benefit from future progresses of blind and stochastic super-resolution methods (please see Appendix A.9). Number of Images used to Estimate Q. As Table 5 shows, for a downscaled image, we investigate how many images are required to be upscaled from it (by f us ) to achieve a robust estimation of the conditional distribution Q and thus our IDA-RD, namely N Q . It can be observed that the results become stable when N Q ≥ 5, so we choose N Q = 5 for our IDA-RD measure. Choice of D. As Table 6 shows, we test different choices of D including multiple image distortion metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) (Wang et al., 2004) , MS-SSIM (Multi-Scale SSIM), and LPIPS (Zhang et al., 2018) . Experimental results demonstrate a similar trend across all of them, indicating the flexibility of our IDA-RD measure. Nevertheless, since LPIPS is a more advanced metric that has been shown to be more consistent with human perception, we use it in the final version of our IDA-RD measure. Number of Images in X. As Table 7 shows, we investigate how many images are required in the test dataset X consisting of high-resolution images to achieve a robust estimation of IDA-RD, namely N X . It can be observed that the results become stable when N X ≥ 900, so we choose N X = 900 for our IDA-RD measure. Table 7 : Ablation study of N X , the number of images in test dataset X in Eq. 2. Synthetic image downscaling methods with Contrast Decrease with σ = 0.75 (DG1); Gaussian Noise with σ = 0.05 (DG2); mixed noise consisting of Gaussian Blur with σ = 1.0, Contrast Decrease with σ = 0.75, and Gaussian Noise with σ = 0.05 (DG3); are used in the experiments. N X 30 300 600 900 1200 1500 DG1 0.320±0.026 0.321±0.047 0.321±0.046 0.330±0.047 0.325±0.047 0.329±0.047 DG2 0.501±0.055 0.473±0.051 0.481±0.050 0.482±0.051 0.483±0.051 0.484±0.051 DG3 0.483±0.088 0.312±0.048 0.321±0.045 0.320±0.048 0.321±0.047 0.322±0.048 The Content of X. As Table 8 shows, in addition to FFHQ (Karras et al., 2019) , we test our IDA-RD measure on another two datasets: the NPRportrait 1.0 benchmark set (Rosin et al., 2022) and AFHQ-Cat (Choi et al., 2020) . Between them, we use all 60 images at around 800×1024 resolution from the NPRportrait 1.0 benchmark set as X, which was carefully constructed so as to include a controlled diversity of gender, age and ethnicity; we use a random sample of 900 images at 512×512 resolution from the AFHQ-Cat dataset as X. We test them with 4× image downscaling. It can be Table 8 : Ablation study of the contents of dataset X in Eq. 2. (1) Bicubic (2) Bilinear (3) Nearest Neighbor (N.N.) (4) DPID ( 5) Perceptual (6) L0-regularized. observed that our conclusions hold for all datasets, which further verifies the flexibility of our method against the content of X. Without loss of generality, we use FFHQ in our IDA-RD measure.

4.5. MOTIVATION JUSTIFICATION

Invalidity of NR-IQA Metrics As Table. 9 shows, existing NR-IQA metrics, such as NIQE (Mittal et al., 2012b) and BRISQUE (Mittal et al., 2012a) , are not suitable for the image downscaling problem, especially extreme downscaling. It can be observed that i) NIQE struggles to calculate proper scores at all resolutions below 128×128; ii) BRISQUE does not provide the correct scores at a resolution of 32×32. Please see Appendix A.5 for results on higher resolutions. 

5. CONCLUSION

In this paper, we presented Image Downscaling Assessment by Rate Distortion (IDA-RD), a quantitative measure for the evaluation of image downscaling algorithms. Our measure circumvents the requirement of a ground-truth LR image by measuring the distortion in the HR space, which is enabled by the recent success of blind and stochastic super-resolution algorithms based on deep generative models. We validate our approach by testing various synthetic downscaling algorithms, simulated by adding degradations, on various datasets. We also test our measure on real-world image downscaling algorithms, which further validates the role of our measure in providing new insights into image downscaling algorithms. Please see Appendix A.6 for Limitation and Future Work.

A APPENDIX A.1 EXAMPLES OF DOWNSCALED IMAGES USED IN OUR EXPERIMENTS

Table 11 and Table 12 show examples of images downscaled by synthetic and real-world image downscaling methods used in our experiments, respectively.

A.2 BALANCING FFHQ INTO AGE-, GENDER-, AND RACE-BALANCED SUBSETS

We balance the FFHQ dataset Karras et al. (2019) into subsets (i.e., X in Eq. 2) that are balanced in age, gender and ethnicity for a fair evaluation of our IDA-RD measure. For the gender and age labels of FFHQ images, we use those offered by the FFHQ-features-datasetfoot_2 ; for the ethnicity labels of FFHQ images, we use the recognition results of DeepFacefoot_3 . According to the above, we define i) four age groups: Minors (0-18), Youth (19-36), Middle Aged (36-54) and Seniors (54+); ii) three major ethnic groups: Asian, White and Black; iii) two gender groups: Male and Female. We apply K-means to cluster FFHQ images in 24 (4×3×2) groups and select images from them evenly to generate the subsets used in our experiments. As Table 13 shows, the subsets used in our experiments are highly-balanced in terms of age, gender and ethnicity.

A.3 ADDITIONAL ABLATION STUDY ON D THE DISTORTION MEASURE

As a complement to Table 6 in the main paper, Table 14 shows additional results for the ablation study of D, which further justifies our choice of LPIPS as the distortion measure in our IDA-RD. Table 15 shows the running times of our IDA-RD measure using PULSE and SRFlow as f us (Eq. 2) on an Nvidia RTX3090 GPU, respectively. It can be observed that the SRFlow implementation runs much faster, which justifies our choice of using it in our IDA-RD measure. 16a and Table 16b show additional results of NIQE (Mittal et al., 2012b) and BRISQUE (Mittal et al., 2012a) at higher resolutions where the two scores work better.

A.6 LIMITATION AND FUTURE WORK

Limitations. Since our measure makes use of GAN-and Flow-based super-resolution (SR) models, the limitations of these models are carried over as well. First of all, we cannot use test data beyond the learnt distribution of the SR model. For example, unlike the SRFlow (Lugmayr et al., 2020) model trained on general images that are used in the main paper, our GAN-based implementation uses a StyleGAN generator pre-trained on portrait images, which only allows for the use of portrait face images to evaluate downscaling algorithms. Also, although highly unlikely to occur, we cannot Future work. Our framework still requires a ground truth HR image. However, we believe the distortion can be calculated without such a ground truth image. To further validate our IDA-RD measure, in the future we will we use the meta-measure methodology (Pont-Tuset & Marques, 2013; Fan et al., 2019) , in which secondary, easily quantifiable measures are constructed to quantify the performance of a less easily quantifiable measure. A.7 ABLATION STUDY OF N X FOR IDA-RD IMPLEMENTED WITH PULSE As Table 17 shows, we also investigate how many images are required in the test dataset X consisting of high-resolution images to achieve a robust estimation of IDA-RD implemented with PULSE (Menon et al., 2020) . Similar to those in the main paper, it can be observed that the results become stable when N X ≥ 900, which further justifies our choice of N X = 900 for IDA-RD. 18 shows our IDA-RD scores on two real-world datasets: DIV2K (Agustsson & Timofte, 2017) and Flickr30k (Young et al., 2014) . It can be observed that our conclusions still hold (N.N. > Perceptual > L0-regularized > DPID > Bicubic > Bilinear), which further justifies the validity of the proposed IDA-RD measure. Note that both experiments are conducted with a scaling factor of 4× as we observed SRFlow become unstable for higher scaling factors (Fig. 2 ). For stable uses of SRFlow, we intentionally used domain-specific datasets in the main paper. Note that all state-of-theart image downscaling methods (i.e., Perceptual, L0-regularized, DPID) used in our experiments are general ones that are applicable to all domains (i.e., not tuned for specific domains). Table 18 : IDA-RD scores on two real-world datasets: DIV2K (Agustsson & Timofte, 2017) and Flickr30k (Young et al., 2014) . UD: "unknown downscaled" images provided by DIV2K. A scaling factor of 4× is used for both datasets as we observed higher scaling factors makes SRFlow unstable. A.9 ADDITIONAL ABLATION STUDY OF f us As Table 19 shows, we tested our IDA-RD measure with some other choices of state-of-the-art SR methods: BSRGAN (Zhang et al., 2021) , RSR (Castillo et al., 2021) and Real-ESRGAN (Wang et al.) . However, all these methods are blind but non-stochastic (Sec. 4.5), which do not satisfy the requirement of our IDA-RD measure and generate less informative results. Specifically, the results of BSRGAN and Real-ESRGAN are less distinguishable among different downscaling methods; the results of RSR are slightly better but still not comparable to SRFlow. A.10 QUALITATIVE EVALUATION OF EXISTING DOWNSCALING METHODS As Fig. 3 shows, state-of-the-art image downscaling methods achieve better perceptual quality by "exaggerating" perceptually important features in the original image (e.g., building lights, water Table 19 : Additional ablation study of f us , the image upscaling algorithms. Following Sec. A.8, we use the DIV2K dataset and a scaling factor of 4×. BSRGAN (Zhang et al., 2021) , RSR (Castillo et al., 2021) and Real-ESRGAN (Wang et al.) are blind but non-stochastic SR methods (Sec. 4.5), which do not satisfy the requirement of our IDA-RD measure and generate less informative results. 



Note that in image downscaling, such a constraint on R is always satisfied as the downscaled images are of a fixed resolution defined by users. https://github.com/andreas128/SRFlow https://github.com/DCGM/ffhq-features-dataset https://github.com/serengil/deepface



Table 1: IDA-RD scores for synthetic image downscaling with different types and levels of degradations (a), (b); with mixed degradations (c). The numbers in parentheses denote degradation parameters. As a reference, the IDA-RD score for the bicubic-downscaled image without degradation is 0.11±0.145. It is best to Zoom In to view the examples of downscaled images with different types and levels of degradations. IDA-RD scores for synthetic image downscaling methods with different scaling factors. (•): the resolution of downscaled images. Bicubic: bicubic-downscaled image without degradation. G.B.: Gaussian Blur. The 32× super-resolution is achieved by a concatenation of a 8× and a 4× upscaling implemented by pretrained SRFlow models. Scaling Factor Bicubic G.B. (σ = 1.0) G.B. (σ = 2.0) G.B. (σ = 4.0) 4× (256 × 256) 0.058±0.

Examples of images downscaled by synthetic image downscaling methods, i.e., those adds controllable degradations to bicubic-downscaled images (Sec. 4.2). The numbers below images are the degradation parameters. LR: bicubic-downscaled images, Dec.: decrease, Inc.: increase, Gauss.

Figure2: SRFlow becomes unstable for a scaling factor of 8× on real-world datasets, e.g., DIV2K (Row 1), while such cases never happen for domain-specific datasets, e.g., FFHQ (Row 2).

.008 0.011±0.008 0.024±0.022 0.013±0.011 0.025±0.018 0.011±0.008 RSR 0.231±0.071 0.208±0.095 0.423±0.132 0.288±0.099 0.379±0.123 0.231±0.071 Real-ESRGAN 0.014±0.010 0.015±0.011 0.026±0.022 0.016±0.012 0.026±0.017 0.017±0.013 reflections), thus leading to over-exaggeration in the upscaled images. As a result, they have lower IDA-RD scores than bicubic and bilinear downscaling.

Figure 3: Qualitative evaluation of existing image downscaling methods. Original: the input HR image; LR: the downscaled LR image; SR1, SR2, SR3: three instances of upscaled images; MD1, MD2, MD3: difference map visualizations of (SR1, Original), (SR2, Original), and (SR3, Original), respectively. The white numbers on the left-top corners: the corresponding LPIPS scores of the difference map visualizations. State-of-the-art image downscaling methods (DPID, Perceptual and L0-reg.) achieve better perceptual quality by "exaggerating" perceptually important features in the original image (e.g., building lights, water reflections), thus leading to over-exaggeration in the upscaled images and lower IDA-RD scores.

Despite the aforementioned works, there does not exist a good quantitative measure for the evaluation of image downscaling methods, which impedes the research on them.

IDA-RD scores for real-world image downscaling methods with different scaling factors. S.F.: Scaling Factor, the resolutions of downscaled images (e.g., 512×512 for 2×, 64×64 for 16×), are omitted for simplicity. N.N.: Nearest Neighbour. L0-reg.: L0-regularized. Note that the relatively large standard deviations in some cases (especially when the scaling factors are small) indicate the algorithmic biases of image downscaling methods against individual images, e.g., flat images with large color blocks may suffer less from information loss. The 32× super-resolution is achieved by a concatenation of a 8× and a 4× upscaling implemented by pretrained SRFlow models.

Ablation study of f us , the image upscaling algorithms. PULSE(Menon et al., 2020) and SRFlow(Lugmayr et al., 2020) have similar results but those of SRFlow are more distinguishable.

Ablation study of N Q , the number of images required for a robust estimation of Q in Eq. 2.

Ablation study of D, the distortion measure in Eq. 2. Dec.: Decrease. Param.: Parameter. Please see Table 14 in Appendix A.3 for experiments with other synthetic downscaling methods.

NIQE and BRISQUE scores at different resolutions. The test image was randomly selected from the FFHQ dataset and bicubic-downscaled to different resolutions (LR). Different levels of Gaussian Blur with kernel σ = 1.0, 2.0, 4.0 were applied as synthetic image downscaling methods.Invalidity of Non-blind and Non-stochastic SR method As Table10shows, non-blind and nonstochastic SR methods like ESRGAN(Wang et al., 2018) and SR3(Saharia et al., 2022) fail to distinguish among image downscaling algorithms, which justifies the choice of blind and stochastic SR methods in our IDA-RD.

Invalidity of using ESRGAN and SR3 in our IDA-RD measure.

Examples of images downscaled by real-world image downscaling methods. N.N.: Nearest Neighbour; L0-reg.: L0-regularized.

Statistics of our balanced FFHQ subsets. MI: Minors, Y: Youth, MA: Middle Aged, S: Senior; A: Asian, W: White, B: Black; M: Male, F: Female. J.E.: Joint Entropy, which measures the extent to which a subset is balanced. As a reference, a fully-balanced subset has a joint entropy of -24 * (1/24) * log 2 (1/24) ≈ 4.5850. ADDITIONAL RESULTS WITH NIQE AND BRISQUE As a complement to Table 9 in the main paper, Table

Ablation study of D, the distortion measure in Eq. 2. Dec.: Decrease. Param.: Parameter.

Running times of our IDA-RD with PULSE and SRFlow as f us (Eq. 2) respectively. N X : the number of images in test dataset X in Eq. 2.

Additional results of NIQE and BRISQUE at higher resolutions (lower is better).

Ablation study of N X for IDA-RD implemented with PULSE.

