TRANSFERABLE RECOGNITION-AWARE IMAGE PROCESSING

Abstract

Recent progress in image recognition has stimulated the deployment of vision systems at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Existing image processing methods only optimize for better human perception, yet the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we propose simple approaches to improve machine interpretability of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate transforming model. Interestingly, the processing model's ability to enhance recognition quality can transfer when evaluated on models of different architectures, recognized categories, tasks and training datasets. This makes the solutions applicable even when we do not have the knowledge of future recognition models, e.g., if we upload processed images to the Internet. We conduct experiments on multiple image processing tasks, with ImageNet classification and PASCAL VOC detection as recognition tasks. With our simple methods, substantial accuracy gain can be achieved with strong transferability and minimal image quality loss. Through a user study we further show that the accuracy gain can transfer to a black-box, third-party cloud model. Finally, we try to explain this transferability phenomenon by demonstrating the similarities of different models' decision boundaries. Code is available at https://github.com/ anonymous20202020/Transferable_RA.

1. INTRODUCTION

Unlike in image recognition where a neural network maps an image to a semantic label, a neural network used for image processing maps an input image to an output image with some desired properties. Examples include super-resolution (Dong et al., 2014) , denoising (Xie et al., 2012) , deblurring (Eigen et al., 2013) , colorization (Zhang et al., 2016) . The goal of such systems is to produce images of high perceptual quality to a human observer. For example, in image denoising, we aim to remove noise that is not useful to an observer and restore the image to its original "clean" form. Metrics like PSNR/SSIM (Wang et al., 2004) are often used (Dong et al., 2014; Tong et al., 2017) to approximate human-perceived similarity between the processed and original images, and direct human assessment on the fidelity of the output is often considered the "gold-standard" (Ledig et al., 2017; Zhang et al., 2018b) . Therefore, techniques have been proposed to make outputs look perceptually pleasing to humans (Johnson et al., 2016; Ledig et al., 2017; Isola et al., 2017) . However, while looking good to humans, image processing outputs may not be accurately recognized by image recognition systems. As shown in Fig. 1 , the output image of an denoising model could easily be recognized by a human as a bird, but a recognition model classifies it as a kite. One could separate the image processing task and recognition task by specifically training a recognition model on these output images produced by the denoising model to achieve better performance on such images, or could leverage domain adaptation approaches to adapt the recognition model to this domain, but the performance on natural images can be harmed. This retraining/adaptation scheme might also be impractical considering the significant overhead induced by catering to various image processing tasks and models. With the fast-growing size of image data, images are often "viewed" and analyzed more by machines than by humans. Nowadays, any image uploaded to the Internet is likely to be analyzed by certain vision systems. Therefore, it is of great importance for the processed images to be recognizable not only by humans, but also by machines. In other words, recognition systems (e.g., image classifier) should be able to accurately explain the underlying semantic meaning of the image content. In this way, we make them easier to search, recommended to more interested audience, and so on, as these procedures are mostly executed by machines based on their understanding of the images. Therefore, we argue that image processing systems should also aim at machine recognizability. We call this problem "Recognition-Aware Image Processing". It is also important that the enhanced recognizability is not specific to any concrete recognition model, i.e., the improvement is only achieved when the output images are evaluated on that particular model. Instead, the improvement should ideally be transferable when evaluated on different models, to support its usage without access to possible future recognition systems, for example if we upload it to the Internet or share it on social media. This is another case where we cannot separate the processing and recognition task by training them individually, since the recognition is not in our control. We may not know what network architectures (e.g. ResNet or VGG) will be used for inference, what object categories the model recognizes (e.g. animals or scenes), or even what task will be performed (e.g. classification or detection). Without these specifications, it might be hard to enhance image's machine semantics. In this work, we introduce simple yet highly effective approaches to make image processing outputs more accurately recognized by downstream recognition systems, transferable among different recognition architectures, categories, tasks and training datasets. The approaches we propose add a recognition loss optimized jointly with the image processing loss. The recognition loss is computed using a fixed recognition model that is pretrained on natural images, and can be done without further supervision from class labels for training images. It can be optimized either directly by the original image processing network, or through an intermediate transforming network. Interestingly, the accuracy gain transfers favorably among different recognition model architectures, object categories, and recognition tasks, which renders our simple solutions effective even when we do not have access to the recognition model. Note that our contribution in these approaches does not lie in designing novel network architectures or training procedures, but in making the processed images more accurately recognized based on existing architectures/procedures. We also view our method's simplicity mainly as a strength, as it is easy to be deployed and could serve as a baseline in this new problem. We conduct extensive experiments, on multiple image processing (super-resolution, denoising, JPEGdeblocking) and downstream recognition (classification, detection) tasks. The results demonstrate our methods can substantially boost the recognition accuracy (e.g., up to 10%, or 20% relative gain), with minimal loss in image quality. Results are also compared with alternative approaches in Appendix A. We explore the transferability phenomenon in different scenarios (architectures, categories, tasks/datasets, black-box models) and demonstrate models' decision boundary similarities to give an explanation. To our best knowledge, this is the first study on transferability of accuracy gain from the processing models trained with recognition loss. Our contributions can be summarized as: • We propose to study the problem of enhancing the machine interpretability of image processing outputs, a desired property considering the amount of images analyzed by machines nowadays. • We propose simple yet effective methods towards this goal, suitable for different use cases. Extensive experiments are conducted on multiple image processing and recognition tasks. • We show that using our simple approaches, the recognition accuracy improvement could transfer among recognition architectures, categories, tasks and datasets, a desirable behavior making the proposed methods applicable without access to downstream recognition models. • We provide decision boundary analysis of recognition models and show their similarities to gain a better understanding of the transferability phenomenon.



Figure 1: Image processing aims for images that look visually pleasing for human, but not those accurately recognized by machines. In this work we try to enhance output images' recognition accuracy. Zoom in for details.

