TRANSFERABLE RECOGNITION-AWARE IMAGE PROCESSING

Abstract

Recent progress in image recognition has stimulated the deployment of vision systems at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Existing image processing methods only optimize for better human perception, yet the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we propose simple approaches to improve machine interpretability of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate transforming model. Interestingly, the processing model's ability to enhance recognition quality can transfer when evaluated on models of different architectures, recognized categories, tasks and training datasets. This makes the solutions applicable even when we do not have the knowledge of future recognition models, e.g., if we upload processed images to the Internet. We conduct experiments on multiple image processing tasks, with ImageNet classification and PASCAL VOC detection as recognition tasks. With our simple methods, substantial accuracy gain can be achieved with strong transferability and minimal image quality loss. Through a user study we further show that the accuracy gain can transfer to a black-box, third-party cloud model. Finally, we try to explain this transferability phenomenon by demonstrating the similarities of different models' decision boundaries. Code is available at https://github.com/ anonymous20202020/Transferable_RA.

1. INTRODUCTION

Unlike in image recognition where a neural network maps an image to a semantic label, a neural network used for image processing maps an input image to an output image with some desired properties. Examples include super-resolution (Dong et al., 2014 ), denoising (Xie et al., 2012 ), deblurring (Eigen et al., 2013 ), colorization (Zhang et al., 2016) . The goal of such systems is to produce images of high perceptual quality to a human observer. For example, in image denoising, we aim to remove noise that is not useful to an observer and restore the image to its original "clean" form. Metrics like PSNR/SSIM (Wang et al., 2004) are often used (Dong et al., 2014; Tong et al., 2017) to approximate human-perceived similarity between the processed and original images, and direct human assessment on the fidelity of the output is often considered the "gold-standard" (Ledig et al., 2017; Zhang et al., 2018b) . Therefore, techniques have been proposed to make outputs look perceptually pleasing to humans (Johnson et al., 2016; Ledig et al., 2017; Isola et al., 2017) . However, while looking good to humans, image processing outputs may not be accurately recognized by image recognition systems. As shown in Fig. 1 , the output image of an denoising model could easily be recognized by a human as a bird, but a recognition model classifies it as a kite. One could separate the image processing task and recognition task by specifically training a recognition model on these output images produced by the denoising model to achieve better performance on such 1



Figure 1: Image processing aims for images that look visually pleasing for human, but not those accurately recognized by machines. In this work we try to enhance output images' recognition accuracy. Zoom in for details.

