TRANSFERABLE RECOGNITION-AWARE IMAGE PROCESSING

Abstract

Recent progress in image recognition has stimulated the deployment of vision systems at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Existing image processing methods only optimize for better human perception, yet the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we propose simple approaches to improve machine interpretability of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate transforming model. Interestingly, the processing model's ability to enhance recognition quality can transfer when evaluated on models of different architectures, recognized categories, tasks and training datasets. This makes the solutions applicable even when we do not have the knowledge of future recognition models, e.g., if we upload processed images to the Internet. We conduct experiments on multiple image processing tasks, with ImageNet classification and PASCAL VOC detection as recognition tasks. With our simple methods, substantial accuracy gain can be achieved with strong transferability and minimal image quality loss. Through a user study we further show that the accuracy gain can transfer to a black-box, third-party cloud model. Finally, we try to explain this transferability phenomenon by demonstrating the similarities of different models' decision boundaries. Code is available at https://github.com/ anonymous20202020/Transferable_RA.

1. INTRODUCTION

Unlike in image recognition where a neural network maps an image to a semantic label, a neural network used for image processing maps an input image to an output image with some desired properties. Examples include super-resolution (Dong et al., 2014) , denoising (Xie et al., 2012) , deblurring (Eigen et al., 2013) , colorization (Zhang et al., 2016) . The goal of such systems is to produce images of high perceptual quality to a human observer. For example, in image denoising, we aim to remove noise that is not useful to an observer and restore the image to its original "clean" form. Metrics like PSNR/SSIM (Wang et al., 2004) are often used (Dong et al., 2014; Tong et al., 2017) to approximate human-perceived similarity between the processed and original images, and direct human assessment on the fidelity of the output is often considered the "gold-standard" (Ledig et al., 2017; Zhang et al., 2018b) . Therefore, techniques have been proposed to make outputs look perceptually pleasing to humans (Johnson et al., 2016; Ledig et al., 2017; Isola et al., 2017) . However, while looking good to humans, image processing outputs may not be accurately recognized by image recognition systems. As shown in Fig. 1 , the output image of an denoising model could easily be recognized by a human as a bird, but a recognition model classifies it as a kite. One could separate the image processing task and recognition task by specifically training a recognition model on these output images produced by the denoising model to achieve better performance on such images, or could leverage domain adaptation approaches to adapt the recognition model to this domain, but the performance on natural images can be harmed. This retraining/adaptation scheme might also be impractical considering the significant overhead induced by catering to various image processing tasks and models. With the fast-growing size of image data, images are often "viewed" and analyzed more by machines than by humans. Nowadays, any image uploaded to the Internet is likely to be analyzed by certain vision systems. Therefore, it is of great importance for the processed images to be recognizable not only by humans, but also by machines. In other words, recognition systems (e.g., image classifier) should be able to accurately explain the underlying semantic meaning of the image content. In this way, we make them easier to search, recommended to more interested audience, and so on, as these procedures are mostly executed by machines based on their understanding of the images. Therefore, we argue that image processing systems should also aim at machine recognizability. We call this problem "Recognition-Aware Image Processing". It is also important that the enhanced recognizability is not specific to any concrete recognition model, i.e., the improvement is only achieved when the output images are evaluated on that particular model. Instead, the improvement should ideally be transferable when evaluated on different models, to support its usage without access to possible future recognition systems, for example if we upload it to the Internet or share it on social media. This is another case where we cannot separate the processing and recognition task by training them individually, since the recognition is not in our control. We may not know what network architectures (e.g. ResNet or VGG) will be used for inference, what object categories the model recognizes (e.g. animals or scenes), or even what task will be performed (e.g. classification or detection). Without these specifications, it might be hard to enhance image's machine semantics. In this work, we introduce simple yet highly effective approaches to make image processing outputs more accurately recognized by downstream recognition systems, transferable among different recognition architectures, categories, tasks and training datasets. The approaches we propose add a recognition loss optimized jointly with the image processing loss. The recognition loss is computed using a fixed recognition model that is pretrained on natural images, and can be done without further supervision from class labels for training images. It can be optimized either directly by the original image processing network, or through an intermediate transforming network. Interestingly, the accuracy gain transfers favorably among different recognition model architectures, object categories, and recognition tasks, which renders our simple solutions effective even when we do not have access to the recognition model. Note that our contribution in these approaches does not lie in designing novel network architectures or training procedures, but in making the processed images more accurately recognized based on existing architectures/procedures. We also view our method's simplicity mainly as a strength, as it is easy to be deployed and could serve as a baseline in this new problem. We conduct extensive experiments, on multiple image processing (super-resolution, denoising, JPEGdeblocking) and downstream recognition (classification, detection) tasks. The results demonstrate our methods can substantially boost the recognition accuracy (e.g., up to 10%, or 20% relative gain), with minimal loss in image quality. Results are also compared with alternative approaches in Appendix A. We explore the transferability phenomenon in different scenarios (architectures, categories, tasks/datasets, black-box models) and demonstrate models' decision boundary similarities to give an explanation. To our best knowledge, this is the first study on transferability of accuracy gain from the processing models trained with recognition loss. Our contributions can be summarized as: • We propose to study the problem of enhancing the machine interpretability of image processing outputs, a desired property considering the amount of images analyzed by machines nowadays. • We propose simple yet effective methods towards this goal, suitable for different use cases. Extensive experiments are conducted on multiple image processing and recognition tasks. • We show that using our simple approaches, the recognition accuracy improvement could transfer among recognition architectures, categories, tasks and datasets, a desirable behavior making the proposed methods applicable without access to downstream recognition models. • We provide decision boundary analysis of recognition models and show their similarities to gain a better understanding of the transferability phenomenon.

2. RELATED WORK

Image processing/enhancement problems such as super-resolution and denoising have a long history (Tsai, 1984; Park et al., 2003; Rudin et al., 1992; Candès et al., 2006 

3. METHOD

We first introduce the problem setting of "recognition-aware" image processing, and then develop various approaches, each suited for different use cases. Our proposed methodology, although only introduced in a vision context, can be extended to other domains (e.g., speech) as well. Problem Setting. In a typical image processing problem, given a set of training input images {I k in } and corresponding target images {I k target }, we aim to train a network that maps an input to its target. Denoting this network as P (processing), parameterized by W P , our optimization objective is: min W P Lproc = 1 N N k=1 lproc P I k in , I k target , where l proc is the loss function for each sample (e.g., L 2 ). The performance is typically evaluated by similarity (e.g., PSNR/SSIM) between I k target and P I k in , or human assessment. In recognitionaware processing, we are interested in a recognition task, with a trained recognition model R (recognition). We assume each target image I k target is associated with a semantic label S k for the recognition task. Our goal is to train a processing model P such that the recognition performance on the output images P I k in is high, when evaluated using R with the semantic labels {S k }. In practice, R might not be available (e.g., on the cloud), in which case we could resort to other models if the performance improvement transfers among models. Optimizing Recognition Loss. Given our goal is to make the output images by P more recognizable by R, it is natural to add a recognition loss on top of the objective of the image processing task (Eqn. 1) during training: min W P Lrecog = 1 N N k=1 lrecog R P I k in , S k (2) l recog is the per-example recognition loss defined by the downstream recognition task. For example, for image classification, l recog could be the cross-entropy (CE) loss. Adding the image processing loss (Eqn. 1) and recognition loss (Eqn. 2) together, our total training objective becomes min W P Lproc + λLrecog where λ is the coefficient controlling the weights of L recog relative to L proc . We denote this simple solution as "RA (Recognition-Aware) processing", which is visualized in Fig. 2 left. Note that once the training is finished, the recognition model used as loss is not needed anymore, and during inference, we only need the processing model P, thus no overhead is introduced in deployment. A potential shortcoming of directly optimizing L recog is that it might deviate P from optimizing the original loss L proc , and the trained P will generate images that are not as good as if we only optimize L proc . We will show that, however, with proper choice of λ, we could substantially boost the recognition performance with nearly no sacrifice on image quality. If using R as a fixed loss can only boost the accuracy on R itself, the use of the method could be limited. Sometimes we do not have the knowledge about the future downstream recognition model or even task. Interestingly, we find that processing models trained with the loss of one recognition model R 1 , can also boost the performance when evaluated using another model R 2 , even if model R 2 has a different architecture, recognizes different categories or even performs a different task. This makes our method effective even when we cannot access the target downstream model, where we could use another trained model as the loss function. This phenomenon also implies that the "recognizability" of a processed image can be more general than just the extent it fits to a specific model. Unsupervised Optimization of Recognition Loss. The solution above requires semantic labels for training images, which however, may not be always available. In this case, we could regress the recognition model's output of the target image R(I k target ), The recognition objective changes to min W P Lrecog = 1 N N k=1 l dis R P I k in , R I k target (4) where l dis is a distance measure between two of R's outputs (e.g., L 2 distance, KL divergence). We call this approach "unsupervised RA". Note that it is only unsupervised for training model P , but not necessarily for the model R. The (pre)training of the model R is not our concern since in our problem setting R is a given trained model, and it can be trained either with or without full supervision. Unsupervised RA is related to the "knowledge distillation" paradigm ( 2017), where the feature distance is minimized instead of output distance. We provide a comparison in Appendix A. Using an Intermediate Transformer. Sometimes we want to prevent the added recognition loss L recog from causing P to deviate from optimizing its original loss. We can achieve this by introducing an intermediate transformation model T : P 's output is first fed to the T , and T 's output serves as the input for R (Fig. 2 right). T 's parameters W T are optimized for the recognition loss: min W T Lrecog = 1 N N k=1 lrecog R T P I k in , S k With the help of T on optimizing the recognition loss, the model P can now "focus" on its original image processing loss L proc . The optimization objective becomes: min W P Lproc + min W T λLrecog In Eqn. 6, P is solely optimizing L proc as in the original image processing problem (Eqn. 1). P is learned as if there is no recognition loss, and therefore the image processing quality of its output will not be affected. This could be implemented by "detaching" the gradient generated by L recog between the model T and P (Fig. 2 right ). We term this solution as "RA with transformer". Its downside compared with directly optimization using P is that there are two instances for each image (the output of model P and T ), one is "for human" and the other is "for machines". Therefore, the transformer is best suited when we want to guarantee the image processing quality not affected at all, at the expense of maintaining another image. For example, in classifying images, we can have the higher-quality image presented to users for better experience and the other image passed to the backend for accurate machine classification.

4. EXPERIMENTS

We experiment on three image processing tasks: image super-resolution (SR), denoising, and JPEGdeblocking (17 tasks from ImageNet-C (Hendrycks & Dietterich, 2019) in Appendix F). To obtain the input images, for SR, we downsample images by a factor of 4×; for denoising, we add Gaussian noise with 0.1 standard deviation; for JPEG deblocking, a quality factor of 10 is used for compression. We pair these three tasks with two visual recognition tasks, ImageNet classification and PASCAL VOC object detection. We adopt SRResNet ( ; "Plain Processing" denotes using image processing models trained without recognition loss (Eqn. 1). We observe that plain processing can boost the accuracy over unprocessed images. These two are considered as baselines. Note there exist instances where the unprocessed images are classified correctly but the plain processing outputs are classified incorrectly. For example, on super-resolution with ResNet-18 as the recognition model, these instances account for 6.14% of the validation set. However, plain image processing does help the recognition accuracy on average. From Table 1 , using RA processing can significantly boost the accuracy of output images over plainly processed ones, for all image processing tasks and recognition models. This is more prominent when the accuracy of plain processing is lower, e.g., in SR and JPEG-deblocking, where we mostly obtain ∼10% accuracy gain (close to 20% in relative terms). Even without semantic labels, our unsupervised RA can still in most cases outperform baseline methods, despite achieves lower accuracy than its supervised counterpart. Also in SR and JPEG-deblocking, using an intermediate transformer T can bring additional improvement over RA processing. See Appendix D for results on detection. Transfer between Recognition Architectures. In reality, sometimes the R we want to eventually evaluate the output images on might not be available for us to use as a loss for training, e.g., it could be on the cloud, kept confidential or decided later. In this case, we could train an processing model P using recognition model R A (source) that is accessible to us, and after we obtain the trained model P , evaluate its output images' accuracy using another unseen R B (target). We evaluate model architecture pairs on ImageNet in Table 2 , for RA Processing, where row is the source model (R A ), and column is the target model (R B ). In Table 2's each column, training with any model R A produces substantially higher accuracy than plainly processed images on R B , indicating that the improvement is transferable among recognition architectures. This phenomenon enables us to use RA processing without the knowledge of the downstream recognition architecture. Transfer between Recognition Tasks and Datasets. We evaluate task transferability when task A is classification and task B is object detection in Table 4 , where rows are classification models used for RA loss and columns are detection models for evaluation. There is also a dataset shift, since model P and R are both trained on ImageNet; during evaluation, P is fed with VOC images and we use a VOC-trained detection model R. We observe that using classification loss on model A (row) gives accuracy gain on model B over plain processing in most cases. Such task transferability suggests the "machine semantics" of the image could be a task-agnostic property. Transfer to an Black-box, Third-party Cloud Model. We compare the images generated from plain processing and RA models using the "General" model at clarifai.com, a company providing state-of-the-art image classification cloud services. We do not have knowledge of the model's architecture or what datasets it was trained on, except we can access the service using APIs. The model also recognizes over 11000+ concepts that are different from the 1000-class ImageNet categories. For this experiment, we only take the output category with the maximum probability as the prediction. We use the SR processing model trained with R18/ImageNet as the RA model. We randomly sample image indices from ImageNet validation set, and ask clarifai.com for predictions of both images generated from plain and RA processing models. From the results, we then randomly select 100 instances where clarifai.com gives different predictions on plain and RA images, to compose a survey for user study. For each of the 100 instances, the survey presents the user with the target image, and both prediction labels generated from plain/RA images, in randomized left/right order. The survey asks the user to indicate in his/her opinion which label(s) describe the image to a satisfactory level. The user has the options to choose none, either or both labels. The survey and instructions can be found at https://tinyurl.com/y698779q. 10 volunteers participated in our survey. The resulting average satisfaction rates for plain and RA super-resolved images are 40.1% and 55.3% respectively. We achieve 15.2% absolute gain or 37.9% relative gain on recognition satisfication rate, indicating the strong transferability our method provides without knowledge of the black-box cloud model. Image Processing Quality Assessment. We compare the output image quality using conventional metrics (PSNR/SSIM). When using RA with transformer, the output quality of P is guaranteed unaffected, therefore here we evaluate RA processing. We use R18 as loss on ImageNet, and report results with different λs (Eqn. 3) in Table 5 . λ = 0 corresponds to plain processing. When λ = 10 -4 , PSNR/SSIM are only marginally worse. However, the accuracy obtained is significantly higher. This suggests that the added recognition loss is not harmful when λ is chosen properly. When λ is excessively large (10 -2 ), image quality is hurt more, and interestingly even the recognition accuracy start to decrease, which could be due to the change of actual learning rate. A proper balance between processing and recognition loss is needed for both image quality and accuracy. We also measure the image quality using the PieAPP metric (Prashnani et al., 2018) , which emphasizes more on perceptual difference: on SR, when λ = 0/10 -4 /10 -3 , PieAPP (lower is better) = 1.329/1.313/1.323. Interestingly, RA processing can slightly improve perceptual quality measured with PieAPP.

5. ANALYSIS AND CONCLUSION

In Fig. 3 , we visualize some examples where the output image is incorrectly classified with a plain processing model, but correctly recognized with RA processing. With smaller λ (10 -2 and 10 -3 ), the image is nearly the same as the plainly processed images. When λ is too large (10 -2 ), we could see some extra textures when zooming in. More results are presented in Appendix H. Decision Boundaries Analysis and Transferability. Inspired by prior works' analysis on adversarial example transferability (Liu et al., 2016; Tramèr et al., 2017) , we conduct decision boundary analysis to gain insights on RA processing's transferability. The task used is SR with ImageNet. We restrict our analysis to a single direction at a time due to image's high dimension: given a input image x and a direction d (unit vector, same dimension as x), we analyze how the output of the recognition model R changes when x moves along d by δ amount, i.e., when input is x + δs. Consider a two-model scenario, with a source model R s and a target model R t sharing the same output categories. We define their inter-boundary distance (IBD): IBD(R s , R t , x, d) = |BD(R s , x, d) - BD(R t , x, d)|. Intuitively, if R s (x) = R t (x) (same prediction within boundary), a small IBD between R s and R t means they have close boundaries along the s direction, since x does not need to move beyond one's boundary too far to reach the other's. In this case, changes made to x along d likely has a transferring effect from source to target model due to their close boundaries. We take the image x to be a plain processing output, and consider two types of directions: 1. random direction d r . 2. The direction generated by subtracting the plain processing output x from RA processing output x s , i.e., (x s -x)/||x s -x|| 2 . The RA processing model here is trained with the source model R s , thus x s is specific to R s . We call this "RA direction" (d RA ), since it points to the RA output x s from the plain output x. We take all validation images such that the plain processing output x generates the same wrong prediction when fed to R s and R t , i.e., R s (x) = R t (x) = Ground Truth. For each image, we compute BD(R s , x, d), BD(R t , x, d) and IBD(R s , R t , x, d) with d being random direction and RA direction. In this experiment we present results with R s being R18 and R t being R50, as we observe other model pairs produce similar trends. The average results are shown in Table 6 . We first observe that BDs are much smaller alongside the RA direction than the random direction. This indicates moving along the RA direction will change the model's wrong prediction at x faster, possibly to a correct prediction. More importantly, under either random or RA direction, IBD is always smaller than source/target BDs, which indicates R s and R t 's boundaries are relatively close, leading to a transfering effect. This result in RA direction can explain why RA processing can lead to transferable accuracy gains, since the RA loss brings this direction as the effect on the processing output x. We further visualize decision boundaries in Fig. 4 with examples (see Appendix I for more). We use R18 as source and each of the other models as target. Here we plot d RA as the horizontal and d r as the vertical axis. The origin represents the plain processing output x, and the color of point (u, v) represents the predicted class of the image x + u • d RA + v • d r . From the plot, we can see that different models share similar decision boundaries, and also tend to change to the same prediction once we move from the origin along a direction far enough. In both examples, we do confirm that when we move towards RA direction (towards right at horizontal axis), the first color we encounter (green for top, purple for bottom) represents the image's correct label. This suggests the signal from RA loss (RA direction) can correct the wrong prediction with plain processing output (x at origin), and such correction is transferable given the similar decision boundaries among models. 

Conclusion.

We investigated an important yet largely ignored problem: enhancing the machine interpretability of image processing outputs. We found that our simple approaches -optimizing with the additional recognition loss, can significantly boost the recognition accuracy with minimal or no loss in image quality. Moreover, such gain can transfer to architectures, object categories, vision tasks unseen during training, or even a black-box cloud model, indicating the enhanced interpretability is not specific to one particular model but generalizable to others. This makes the approaches applicable even when the future downstream recognition models are unknown. Finally we analyzed the decision boundary similarities of recognition models to explain this transferability phenomenon. 

A COMPARISON WITH ALTERNATIVES

We analyze some alternatives to our approaches. Unless otherwise specified, experiments in this section are conducted using RA processing on super-resolution, with ResNet-18 trained on ImageNet as the recognition model, and λ = 10 -3 if used. Under this setting, we achieve 61.8% classification accuracy on the output images. Training/Fine-tuning the Recognition Model. Instead of fixing the recognition model R, we could train/fine-tune it together with the training of image processing model P , to optimize the recognition loss. Many prior works (Sharma et al., 2018; Bai et al., 2018; Zhang et al., 2018a) do train/fine-tune the recognition model jointly with the image processing model. We use SGD with momentum as R's optimizer, and the final accuracy reaches 63.0%. However, since we do not fix R, it becomes a model that specifically recognizes super-resolved images, and we found its performance on original target images drops from 69.8% to 60.5%. Moreover, when transferring the trained P on ResNet-50, the accuracy is 62.4 %, worse than 66.7% when we train with a fixed ResNet-18. This suggests we lose some transferability if we do not fix the recognition model R. Training Recognition Models from Scratch. We could first train a super-resolution model, and then train R from scratch on the output images. Doing this, we achieve 66.1% accuracy, higher than 61.8% in RA processing. However, R's accuracy on original clean images drops from 69.8% to 66.1%. Alternatively, we could train R from scratch on interpolated low-resolution images, in which case we achieve 66.0% on interpolated validation data but only 50.2% on the original data. In summary, training/fine-tuning R to cater the need of super-resolved or interpolated images can harm its performance on original images, and causes additional overhead in storing models. In contrast, RA processing could boost the accuracy of output images with the performance on original images intact. Training without the Image Processing Loss. It is possible to train the processing model on the recognition loss L recog , without even keeping the original image processing loss L proc (Eqn. 3). This may presumably lead to better recognition performance since the model P can now "focus on" optimizing the recognition loss. However, we found removing the original image processing loss hurts the recognition performance: the accuracy drops from 61.8% to 60.9%; even worse, the SSIM/PSNR metrics drop from 26.69/0.804 to 16.92/0.263, which is reasonable since the image processing loss is not optimized during training. This suggests the original image processing loss is helpful for the recognition accuracy, since it helps the corrupted image to restore to its original form. Perceptual/Feature Loss. Our unsupervised RA method optimizes the recognition model's output probability distance between processed and target images. This is related to the perceptual loss (also called feature loss) used in Johnson et al. (2016) ; Ledig et al. (2017) . Perceptual loss optimizes processed and target images' distance in VGG feature space. Note that the perceptual loss was originally proposed to improve output's quality from a human observer's perspective. To compare both methods, we follow Ledig et al. (2017) to optimize the perceptual loss from VGG-16. We find perceptual loss yields lower accuracy than unsupervised RA (56.7% vs. 61.0% on the VGG-16 recognition model). This could be because using final probabilities provides more semantic supervision, while intermediate features improve the outputs from a perceptual perspective.

B EXPERIMENTAL DETAILS

General Setup. We evaluate our proposed methods on three image processing tasks: image superresolution, denoising, and JPEG-deblocking. In those tasks, the target images are all the original images from the datasets. To obtain the input images, for super-resolution, we use a downsampling scale factor of 4×; for denoising, we add Gaussian noise on the images with a standard deviation of 0.1 to obtain the noisy images; for JPEG deblocking, a quality factor of 10 is used to compress the image to JPEG format. The image processing loss used is the mean squared error (MSE, or L 2 ) loss. For the recognition tasks, we consider image classification and object detection, two common tasks in computer vision. In total, we have 6 (3 × 2) task pairs to evaluate. Training is performed with the training set and results on the validation set are reported. We adopt the SRResNet (Ledig et al., 2017) as the architecture of the image processing model P , which is simple yet effective in optimizing the MSE loss. Even though SRResNet is originally designed for super-resolution, we find it also performs well on denoising and JPEG deblocking when its upscale parameter set to 1 for the same input-output sizes. The results for object detection, when evaluated on the same architecture, are shown in Table 8 . We observe similar trend as in classification: using recognition loss can consistently improve the mAP over plain image processing by a notable margin. On super-resolution, RA processing mostly performs on par with RA with transformer, but on the other two tasks using a transformer is slightly better. The model with transformer performs better more often possibly because with this extra network in the middle, the capacity of the whole system is increased. 

E MORE RESULTS ON TRANSFERABILITY

We present additional results on transferability in this section. In Table 14 , we evaluate RA Processing on all 17 types of corruptions, with corruption level 5 as in (Hendrycks & Dietterich, 2019) . We observe that RA Processing brings consistent improvement over plain processing, sometimes by an even larger margin than the tasks considered in Sec. 4. In Table 15 , we experiment with different levels of with corruption type "speckle noise" and "snow". We also evaluate with our variants -Unsupervised RA and RA with Transformer. We observe that when the corruption level is higher, our methods tend to bring more recognition accuracy gain. In this case, we note that using a Transformer could sometimes hurt the accuracy compared with plain processing. This is possibly because the insufficient training data in ImageNet-C dataset caused the transformer to hurt, since more parameters typically require more training data. In the majority of other cases, it improves slightly over RA processing. 



Figure 1: Image processing aims for images that look visually pleasing for human, but not those accurately recognized by machines. In this work we try to enhance output images' recognition accuracy. Zoom in for details.

Figure2: Left: RA (Recognition-Aware) processing. In addition to the image processing loss, we add a recognition loss using a fixed recognition model R, for the processing model P to optimize. Right: RA with transformer. "Recognition Loss" stands for the dashed box in the left figure. A Transformer T is introduced between the output of P and input of R, to optimize recognition loss. We cut the gradient from recognition loss flowing to P , such that P only optimizes the image processing loss and the image quality is not affected.

Figure3: Examples where outputs of RA processing models can be correctly classified but those from plain processing models cannot. PSNR/SSIM/class prediction is shown below each output image. Slight differences between images from plain processing and RA processing models could be noticed when zoomed in.

Figure 4: Different models' decision boundaries are similar, especially along the RA direction (horizontal axis).

Figure5: Examples where output images from RA processing models can be correctly classified but those from plain processing models cannot. PSNR/SSIM/class prediction is shown below each output image. Slight differences between images from plain processing and RA processing models (especially with large λs) could be noticed when zoomed in.

). Since the initial success of deep neural networks on these problems(Dong et  al., 2014; Xie et al., 2012; Wang et al., 2016b), a large body of works try to investigate better model architecture design and training techniques (Dong et al., 2016; Kim et al., 2016b; Shi et al., 2016; Kim et al., 2016a; Lai et al., 2017; Chen et al., 2018) These works focus on generating high visual quality images under conventional metrics (PSNR/SSIM) or human evaluation, without considering recognition performance on the output. There are also a number of works that relate image recognition with image processing. Some works (Zhang et al., 2016; Larsson et al., 2016; Zhang et al., 2018c; Sajjadi et al., 2017; Lee et al., 2019) use image recognition accuracy as an evaluation metric for image colorization/super-resolution/denoising, but without optimizing for it in training. Wang et al. (2016a); VidalMata et al. (2019); Banerjee et al. (2019) investigate how to achieve more accurate recognition on low-resolution or corrupted/noisy images, but did not resort to the use of recognition loss. Wang et al. (2019) propose a method to make denoised images more accurately segmented. Liu et al. (2019) introduced a theoretical framework for classification-distortion-perception tradeoff and conducted experiments with simulated or toy datasets, while our work develops practical approaches for real-world datasets. Most existing works only consider one image processing task or image domain, and develop specific techniques, while our simpler approach is task-agnostic and more widely applicable.

Accuracy (%) on ImageNet classification. R18 means ResNet-18, etc. The five models achieve 69.8, 76.2, 77.4, 74.7, 73.4 on original images. RA processing techniques substantially boost recognition accuracy. Transformer 63.0 68.2 70.1 66.5 63.0 65.2 70.9 72.3 69.6 65.9 59.8 65.1 66.7 63.9 58.7

Transfer between recognition architectures. A processing model trained with source model RA (row) as recognition loss can improve the recognition performance on target model RB (column).

Transfer between different object categories (500-way accuracy %). Refer to text for details.

We observe that RA still benefits the accuracy even when transferring across categories (e.g., in SR, 60.1% to 66.5% transferring from Cat A to Cat B). The improvement is only marginally lower than directly training on the same categories (e.g., 60.2% to 67.8% on Cat B). This suggests RA processing models do not impose category-specific signals to the images, but signals that enable wider sets of classes to be better recognized.

Transfer from ImageNet classification to PASCAL VOC object detection (mAP). Note that rows are classification models and columns are detection models, so even the same name in row and column (e.g., "R18") indicates different models trained on different tasks and datasets.

PSNR/SSIM/Accuracy using different λs.

Decision boundary analysis results.

Transferring between Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 16 E.2 Transferring between Recognition Tasks . . . . . . . . . . . . . . . . . . . . . . . 17

The experiments are run on 1-4 NVIDIA TITAN Xp GPUs. Clearly training time is linear in the iterations trained and the space taken is a constant with a fixed input size. The training process finishes in 2-24 hours depending on the model sizes/variants of methods/recognition tasks, and the maximum GPU memory taken is 30GB (multi-GPU) with batch size of 20.Object Detection. For object detection, we evaluate on PASCAL VOC 2007 and 2012 detection dataset (https://pjreddie.com/projects/pascal-voc-dataset-mirror/), using Faster-RCNN (Ren et al., 2015) as the recognition model. Our implementation is based on the code from(Yang et al., 2017). Following common practice(Redmon et  al., 2016; Ren et al., 2015; Dai et al., 2016), we use VOC 07 and 12 trainval data as the training set, and evaluate on VOC 07 test data. The Faster-RCNN training uses the same hyperparameters in (Yang et al., 2017). For the recognition model's backbone architecture, we evaluate ResNet-18/50/101 and VGG-16 (without BN (Ioffe & Szegedy, 2015)), obtaining mAP of 74.2, 76.8, 77.9, 72.2 on the test set respectively.Given those trained detectors as recognition loss functions, we train the models on the training set for 7 epochs, with a learning rate decay of 10 × at epoch 6 and 7, and a batch size of 1. We report the mean Average Precision (mAP) of processed images in the test set. As in image classification, we use λ = 10 -3 for RA processing, and λ = 10 -2 for RA with transformer.C RESULTS ON MORE ARCHITECTURESIn our main paper, we use SRResNet(Ledig et al., 2017) as our processing model P . Here we provide more results with other more recent processing models, including SRDenseNet (SRDNet) (Tong et al., 2017), Residual Dense Network (RDN) (Zhang et al., 2018d), and Deep Back-Projection Networks (DBPN) Haris et al. (2018). We present results at Table 7, with super-resolution as the processing task, ImageNet classification as recognition task, and R being ResNet-18.

Accuracy (%) on ResNet-18 ImageNet classification, with more processing architectures. Processing task is supre-resolution.

mAP on VOC object detection. The four models achieve 74.2, 76.8, 77.9, 72.2 on original images. Transformer 71.4 74.2 75.6 66.0 71.0 73.9 75.9 67.7 68.5 70.7 73.7 64.4

ImageNet-C results (top-1 accuracy %) under different levels of corruptions, with corruption level "snow" and "speckle noise". Plain Processing 57.1 45.1 46.0 37.1 34.5 60.3 57.0 48.4 43.2 36.6 RA Processing 60.3 51.7 51.7 45.7 43.7 62.7 60.8 54.2 50.3 45.2 Unsupervised RA 60.2 51.3 50.6 43.6 41.5 62.9 60.5 53.8 49.4 43.9 RA w/ Transformer 55.7 46.7 48.1 42.7 40.9 59.0 57.7 52.2 49.2 44.7

ImageNet-C results (top-1 accuracy %) under different levels of corruptions, with corruption level "snow" and "speckle noise".In Table16, we examine the transferability of RA Processing between recognition architectures, using the same two tasks "speckle noise" and "snow", with corruption level 5. Note the recognition loss used during training is from a ResNet-18, and we evaluate the improvement over plain processing on ResNet-50/101, DenseNet-121 and VGG-16. We observe that the improvement over plain processing is transferable among different architectures.G EXPERIMENTS ON SEMANTIC SEGMENTATIONWe experiment with the Cityscapes (Cordts et al., 2016) semantic segmentation task using the recent HRNet-W48Wang et al. (2020) architecture as the recognition model. As with other tasks, we use a λ of 0.001 for RA processing, and SRResNet as processing model. We train for 18 epochs and other settings are the same as in the main paper.

RA Processing on Cityscapes semantic segmentation.

compares the recognition accuarcy (mIoU) of Plain and RA models. We observe that RA processing is able to improve the acuracy substantially, with similar image qualities in terms of PSNR/SSIM.

annex

We provide the model transferability results of RA processing on object detection in Table 9 . Rows indicate the models trained as recognition loss and columns indicate the evaluation models. We see similar trend as in classification (Table 1 ): using other architectures as loss can also improve recognition performance over plain processing; the loss model that achieves the highest performance is mostly the model itself, as can be seen from the fact that most boldface numbers are on the diagonals.In Table 10 , we present the results when transferring between recognition architectures, using unsupervised RA. We note that for super-resolution and JPEG-deblocking, similar trend holds as in (supervised) RA processing, as using any architecture in training will improve over plain processing. But for denoising, this is not always the case. Some models P trained with unsupervised RA are slightly worse than the plain processing counterpart. A possible reason for this is the noise level in our experiments is not large enough and plain processing achieve very high accuracy already. In Table 11 , we present the results of transferring between architectures when we use a transformer T . We use the processing model P and transformer T trained with R A together when evaluating on R B . From Table 11 , in most cases improvement is still transferable but there are a few exceptions. For example, when R A is ResNet or DenseNet and when R B is VGG-16, in most cases the accuracy fall behind plain processing by a large margin. This weaker transferability is possibly caused by the fact that there is no constraint imposed by the image processing loss on T 's output, thus it "overfits" more to the specific R it is trained with. 

E.2 TRANSFERRING BETWEEN RECOGNITION TASKS

In Section 4, we investigated the transferability of improvement from classification to detection. Here we evaluate the opposite direction, from detection to classification. The results are shown in Table 12 . Here, using RA processing can still consistently improve over plain processing for any pair of models, but we note that the improvement is not as significant as directly training using classification models as loss (Table 1 and Table 2 ). Additionally, the results when we transfer the model P trained with unsupervised RA with image classification to object detection are shown in Table 13 . In most cases, it improves over plain processing, but for image denoising, this is not always the case. Similar to results in Table 10 , this could be because the noise level is relatively low in our experiments. 

F RESULTS ON IMAGENET-C

We evaluate our methods on the ImageNet-C benchmark (Hendrycks & Dietterich, 2019) . It imposes 17 different types of corruptions on the ImageNet (Deng et al., 2009) validation set. Despite ImageNet-C benchmark is designed for more robust recognition models, but not for testing image processing models, it is a good testbed to test our methods in a broader range of processing tasks. Since only corrupted images from the validation set are released, we divide it evenly for each class into two halves and train/test on its first/second half. The corrupted image is the input image to the processing model and the original clean image is the target image. The recognition model used in this experiment is an ImageNet-pretrained ResNet-18. We also evaluate the transferring effect from ImageNet classification to Cityscapes semantic segmentation. We take a plain processing model and an RA processing model (ResNet-18 as recognition model) trained with ImageNet classification, and compare their outputs' performance on Cityscapes segmentation. Table 18 shows, using RA Processing on ImageNet classification can help segmentation accuracy on Cityscapes as well. We also note both accuracies are higher than when we train the processing model on Cityscapes. This is possibly due to the abundance data in ImageNet (1.2 million images) compared with Cityscapes (3k images).

H MORE IMAGE QUALITY VISUALIZATIONS

We provide more visualizations in Fig. 5 where the output image is incorrectly classified by ResNet-18 with a plain image processing model, and correctly recognized with RA processing, as in Fig. 3 at Section 5.

I MORE DECISION BOUNDARY VISUALIZATIONS

We provide more visualizations in Fig. 6 of different model's decision boundaries to complement Fig. 4 . We can see that different recognition models share similar boundaries. 

