DJMIX: UNSUPERVISED TASK-AGNOSTIC AUGMEN-TATION FOR IMPROVING ROBUSTNESS

Abstract

Convolutional Neural Networks (CNNs) are vulnerable to unseen noise on input images at the test time, and thus improving the robustness is crucial. In this paper, we propose DJMix, a data augmentation method to improve the robustness by mixing each training image and its discretized one. Discretization is done in an unsupervised manner by an autoencoder, and the mixed images are nearly impossible to distinguish from the original images. Therefore, DJMix can easily be adapted to various image recognition tasks. We verify the effectiveness of our method using classification, semantic segmentation, and detection using clean and noisy test images.

1. INTRODUCTION

CNNs are the de facto standard components of image recognition tasks and achieve excellent performance. However, CNNs are vulnerable to unseen noise on input images. Such harmful noise includes not only adversarially generated noise (Szegedy et al., 2014; Goodfellow et al., 2014) , but also naturally possible noise such as blur by defocusing and artifacts generated by JPEG compression (Vasiljevic et al., 2016; Hendrycks & Dietterich, 2019) . Natural noise on input images is inevitable in the real world; therefore, making CNNs robust to natural noise is crucial for practitioners. A simple approach to solving this problem is adding noise to training images, but this does not make models generalize to unseen corruptions and perturbations (Vasiljevic et al., 2016; Geirhos et al., 2018; Gu et al., 2019) . For example, even if Gaussian noise of a certain variance is added during training, models fail to generalize to Gaussian noise of other variances. Nonetheless, some data augmentation methods are effective for improving robustness. For example, Yin et al. reported that extensive augmentation, such as AutoAugment (Cubuk et al., 2019) , improves the robustness. Similarly, Hendrycks et al. proposed to mix differently augmented images during training to circumvent the vulnerability. We will further review previous approaches in Section 2. Despite the effectiveness of these data augmentation and mixing approaches, these methods require handcrafted image transformations, such as rotation and solarizing. Particularly when geometrical transformations are used, the mixed images cannot have trivial targets in non classification tasks, for instance, semantic segmentation and detection. This lack of applicability to other tasks motivates us to introduce robust data augmentation without such transformations. In this paper, we propose Discretizing and Joint Mixing (DJMix) which mixes original and discretized training images to improve the robustness. The difference between the original and obtained images is nearly imperceptible, as shown in Figure 1 , which enables the use of DJMix in various image recognition tasks. In Section 3, we will introduce DJMix and analyze it empirically and theoretically. We show that DJMix reduces mutual information between inputs and internal representations to ignore harmful features and improve CNNs' resilience to test-time noise. To benchmark the robustness of CNNs to unseen noise, Hendrycks & Dietterich (2019) introduced ImageNet-C as a corrupted counterpart of the ImageNet validation set (Russakovsky et al., 2015) . CNN models are evaluated using this dataset on the noisy validation set, whereas they are trained without any prior information on the corruptions on the original training set. Similarly, Geirhos et al. created noisy ImageNet and compared different behaviors between humans and CNN models with image noise. In addition to these datasets designed for classification, we cre-Figure 1 : Schematic view of DJMix. In DJMix, CNN models are trained to minimize divergence between the features of each input image f θ (x) and its discretized image f θ ( x), as well as the taskspecific loss between f θ ( x) and the label y. This pipeline barely affects the appearance of input images, and thus, can be used for various image recognition tasks, e.g., classification, (semantic) segmentation, and detection. ated Segmentation-C and Detection-C datasets, which are corrupted counterparts of the PASCAL-VOC validation sets (Everingham et al., 2015) . We demonstrate the robustness of CNN models trained with DJMix on various tasks using these benchmark datasets in Section 4. Additionally, we perform experimental analyses, including ablation studies, to verify DJMix in Section 5. In summary, our contributions are summarized as follows: 1. We introduce DJMix, a simple task-agnostic data augmentation method for improving robustness. DJMix mixes the original and discretized training images and can be straightforwardly adapted to various image recognition tasks. We empirically demonstrate the effectiveness of this approach. 2. We analyze DJMix theoretically from the Information Bottleneck perspective, which could help analyze other robust methods. We also investigate DJMix from the Fourier sensitivity perspective. 3. We create datasets, Segmentation-C and Detection-C, to benchmark the robustness of CNN models on semantic segmentation and detection tasks.

2. RELATED WORK

Small corruptions or perturbations on images can drastically change the predictions of CNN models. While adversarially generated noise, i.e., adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2014) , can be thought of as the worst case, natural noise also harms the performance of CNNs. Such natural noise includes blurs and artifacts generated by JPEG compression (Vasiljevic et al., 2016; Hendrycks & Dietterich, 2019) . Because of this vulnerability, CNN models sometimes predict inconsistently among adjacent video frames (Gu et al., 2019; Hendrycks & Dietterich, 2019) . For the real-world application of CNNs, this vulnerability needs to be overcome. A strong defense against adversarial examples is adversarial training, where CNN models are trained with adversarial examples (Goodfellow et al., 2014; Madry et al., 2018) . Unfortunately, this approach fails in natural noise, because CNNs trained on a specific type of noise do not generalize to other types of noise (Geirhos et al., 2018; Vasiljevic et al., 2016) . Instead, we need robust methods that are agnostic to test-time noise a priori (Hendrycks & Dietterich, 2019). Data augmentation is a practical approach to improving the generalization ability on clean data, e.g., by randomly flipping and cropping (Krizhevsky et al., 2012; He et al., 2016) , mixing different images (Zhang et al., 2018; Tokozume et al., 2018) , or erasing random regions (DeVries & Taylor; Zhong et al., 2020) . Some types of data augmentation are also reported to improve robustness. For example, strong data augmentation, namely, AutoAugment (Cubuk et al., 2019) , can improve the

