Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity

Abstract

While deep neural networks show great performance on fitting to the training distribution, improving the networks' generalization performance to the test distribution and robustness to the sensitivity to input perturbations still remain as a challenge. Although a number of mixup based augmentation strategies have been proposed to partially address them, it remains unclear as to how to best utilize the supervisory signal within each input data for mixup from the optimization perspective. We propose a new perspective on batch mixup and formulate the optimal construction of a batch of mixup data maximizing the data saliency measure of each individual mixup data and encouraging the supermodular diversity among the constructed mixup data. This leads to a novel discrete optimization problem minimizing the difference between submodular functions. We also propose an efficient modular approximation based iterative submodular minimization algorithm for efficient mixup computation per each minibatch suitable for minibatch based neural network training. Our experiments show the proposed method achieves the state of the art generalization, calibration, and weakly supervised localization results compared to other mixup methods. The source code is available at https://github.com/snu-mllab/Co-Mixup.

1. Introduction

Deep neural networks have been applied to a wide range of artificial intelligence tasks such as computer vision, natural language processing, and signal processing with remarkable performance (Ren et al., 2015; Devlin et al., 2018; Oord et al., 2016) . However, it has been shown that neural networks have excessive representation capability and can even fit random data (Zhang et al., 2016) . Due to these characteristics, the neural networks can easily overfit to training data and show a large generalization gap when tested on previously unseen data. To improve the generalization performance of the neural networks, a body of research has been proposed to develop regularizers based on priors or to augment the training data with task-dependent transforms (Bishop, 2006; Cubuk et al., 2019) . Recently, a new taskindependent data augmentation technique, called mixup, has been proposed (Zhang et al., 2018) . The original mixup, called Input Mixup, linearly interpolates a given pair of input data and can be easily applied to various data and tasks, improving the generalization performance and robustness of neural networks. Other mixup methods, such as manifold mixup (Verma et al., 2019) or CutMix (Yun et al., 2019) , have also been proposed addressing different ways to mix a given pair of input data. Puzzle Mix (Kim et al., 2020) utilizes saliency information and local statistics to ensure mixup data to have rich supervisory signals. However, these approaches only consider mixing a given random pair of input data and do not fully utilize the rich informative supervisory signal in training data including collection of object saliency, relative arrangement, etc. In this work, we simultaneously consider mixmatching different salient regions among all input data so that each generated mixup example accumulates as many salient regions from multiple input data as possible while ensuring We verify the performance of the proposed method by training classifiers on CIFAR-100, Tiny-ImageNet, ImageNet, and the Google commands dataset (Krizhevsky et al., 2009; Chrabaszcz et al., 2017; Deng et al., 2009; Warden, 2017) . Our experiments show the models trained with Co-Mixup achieve the state of the performance compared to other mixup baselines. In addition to the generalization experiment, we conduct weakly-supervised object localization and robustness tasks and confirm Co-Mixup outperforms other mixup baselines.

2. Related works

Mixup Data augmentation has been widely used to prevent deep neural networks from over-fitting to the training data (Bishop, 1995) . The majority of conventional augmentation methods generate new data by applying transformations depending on the data type or the target task (Cubuk et al., 2019) . Zhang et al. (2018) proposed mixup, which can be independently applied to various data types and tasks, and improves generalization and robustness of deep neural networks. Input mixup (Zhang et al., 2018) 2013) generates a saliency map using a pre-trained neural network classifier without any additional training of the network. Following the work, measuring the saliency of data using neural networks has been studied to obtain a more precise saliency map (Zhao et al., 2015; Wang et al., 2015) or to reduce the saliency computation cost (Zhou et al., 2016; Selvaraju et al., 2017) . The saliency information is widely applied to the tasks in various domains, such as object segmentation or speech recognition (Jung and Kim, 2011; Kalinli and Narayanan, 2007) .



Figure 1: Example comparison of existing mixup methods and the proposed Co-Mixup. We provide more samples in Appendix H.

linearly interpolates between two input data and utilizes the mixed data with the corresponding soft label for training. Following this work, manifold mixup (Verma et al., 2019) applies the mixup in the hidden feature space, and CutMix (Yun et al., 2019) suggests a spatial copy and paste based mixup strategy on images. Guo et al. (2019) trains an additional neural network to optimize a mixing ratio. Puzzle Mix (Kim et al., 2020) proposes a mixup method based on saliency and local statistics of the given data. In this paper, we propose a discrete optimization-based mixup method simultaneously finding the best combination of collections of salient regions among all input data while encouraging diversity among the generated mixup examples. Saliency The seminal work from Simonyan et al. (

