Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity

Abstract

While deep neural networks show great performance on fitting to the training distribution, improving the networks' generalization performance to the test distribution and robustness to the sensitivity to input perturbations still remain as a challenge. Although a number of mixup based augmentation strategies have been proposed to partially address them, it remains unclear as to how to best utilize the supervisory signal within each input data for mixup from the optimization perspective. We propose a new perspective on batch mixup and formulate the optimal construction of a batch of mixup data maximizing the data saliency measure of each individual mixup data and encouraging the supermodular diversity among the constructed mixup data. This leads to a novel discrete optimization problem minimizing the difference between submodular functions. We also propose an efficient modular approximation based iterative submodular minimization algorithm for efficient mixup computation per each minibatch suitable for minibatch based neural network training. Our experiments show the proposed method achieves the state of the art generalization, calibration, and weakly supervised localization results compared to other mixup methods. The source code is available at https://github.com/snu-mllab/Co-Mixup.

1. Introduction

Deep neural networks have been applied to a wide range of artificial intelligence tasks such as computer vision, natural language processing, and signal processing with remarkable performance (Ren et al., 2015; Devlin et al., 2018; Oord et al., 2016) . However, it has been shown that neural networks have excessive representation capability and can even fit random data (Zhang et al., 2016) . Due to these characteristics, the neural networks can easily overfit to training data and show a large generalization gap when tested on previously unseen data. To improve the generalization performance of the neural networks, a body of research has been proposed to develop regularizers based on priors or to augment the training data with task-dependent transforms (Bishop, 2006; Cubuk et al., 2019) . Recently, a new taskindependent data augmentation technique, called mixup, has been proposed (Zhang et al., 2018) . The original mixup, called Input Mixup, linearly interpolates a given pair of input data and can be easily applied to various data and tasks, improving the generalization performance and robustness of neural networks. Other mixup methods, such as manifold mixup (Verma et al., 2019) or CutMix (Yun et al., 2019) , have also been proposed addressing different ways to mix a given pair of input data. Puzzle Mix (Kim et al., 2020) utilizes saliency information and local statistics to ensure mixup data to have rich supervisory signals. However, these approaches only consider mixing a given random pair of input data and do not fully utilize the rich informative supervisory signal in training data including collection of object saliency, relative arrangement, etc. In this work, we simultaneously consider mixmatching different salient regions among all input data so that each generated mixup example accumulates as many salient regions from multiple input data as possible while ensuring diversity among the generated mixup examples. To this end, we propose a novel optimization problem that maximizes the saliency measure of each individual mixup example while encouraging diversity among them collectively. This formulation results in a novel discrete submodular-supermodular objective. We also propose a practical modular approximation method for the supermodular term and present an efficient iterative submodular minimization algorithm suitable for minibatch-based mixup for neural network training. As illustrated in the Figure 1 , while the proposed method, Co-Mixup, mix-matches the collection of salient regions utilizing inter-arrangements among input data, the existing methods do not consider the saliency information (Input Mixup & CutMix) or disassemble salient parts (Puzzle Mix). We verify the performance of the proposed method by training classifiers on CIFAR-100, Tiny-ImageNet, ImageNet, and the Google commands dataset (Krizhevsky et al., 2009; Chrabaszcz et al., 2017; Deng et al., 2009; Warden, 2017) . Our experiments show the models trained with Co-Mixup achieve the state of the performance compared to other mixup baselines. In addition to the generalization experiment, we conduct weakly-supervised object localization and robustness tasks and confirm Co-Mixup outperforms other mixup baselines.

2. Related works

Mixup Data augmentation has been widely used to prevent deep neural networks from over-fitting to the training data (Bishop, 1995) . The majority of conventional augmentation methods generate new data by applying transformations depending on the data type or the target task (Cubuk et al., 2019) . Zhang et al. (2018) proposed mixup, which can be independently applied to various data types and tasks, and improves generalization and robustness of deep neural networks. Input mixup (Zhang et al., 2018) linearly interpolates between two input data and utilizes the mixed data with the corresponding soft label for training. Following this work, manifold mixup (Verma et al., 2019) applies the mixup in the hidden feature space, and CutMix (Yun et al., 2019) suggests a spatial copy and paste based mixup strategy on images. Guo et al. (2019) trains an additional neural network to optimize a mixing ratio. Puzzle Mix (Kim et al., 2020) proposes a mixup method based on saliency and local statistics of the given data. In this paper, we propose a discrete optimization-based mixup method simultaneously finding the best combination of collections of salient regions among all input data while encouraging diversity among the generated mixup examples.

Saliency

The seminal work from Simonyan et al. (2013) generates a saliency map using a pre-trained neural network classifier without any additional training of the network. Following the work, measuring the saliency of data using neural networks has been studied to obtain a more precise saliency map (Zhao et al., 2015; Wang et al., 2015) or to reduce the saliency computation cost (Zhou et al., 2016; Selvaraju et al., 2017) . The saliency information is widely applied to the tasks in various domains, such as object segmentation or speech recognition (Jung and Kim, 2011; Kalinli and Narayanan, 2007) . Submodular-Supermodular optimization A submodular (supermodular) function is a set function with diminishing (increasing) returns property (Narasimhan and Bilmes, 2005) . It is known that any set function can be expressed as the sum of a submodular and supermodular function (Lovász, 1983) , called BP function. Various problems in machine learning can be naturally formulated as BP functions (Fujishige, 2005) , but it is known to be NP-hard (Lovász, 1983) . Therefore, approximate algorithms based on modular approximations of submodular or supermodular terms have been developed (Iyer and Bilmes, 2012) . Our formulation falls into a category of BP function consisting of smoothness function within a mixed output (submodular) and a diversity function among the mixup outputs (supermodular).

3. Preliminary

Existing mixup methods return {h(x 1 , x i(1) ), . . . , h(x m , x i(m) )} for given input data {x 1 , . . . , x m }, where h : X × X → X is a mixup function and (i(1), . . . , i(m)) is a random permutation of the data indices. In the case of input mixup, h(x, x ) is λx + (1 -λ)x , where λ ∈ [0, 1] is a random mixing ratio. Manifold mixup applies input mixup in the hidden feature space, and CutMix uses h(x, x ) = 1 B x + (1 -1 B ) x , where 1 B is a binary rectangular-shape mask for an image x and represents the element-wise product. Puzzle Mix defines h(x, x ) as z Π x + (1 -z) Π x , where Π is a transport plan and z is a discrete mask. In detail, for x ∈ R n , Π ∈ {0, 1} n and z ∈ L n for L = { l L | l = 0, 1, . . . , L}. In this work, we extend the existing mixup functions as h : X m → X m which performs mixup on a collection of input data and returns another collection. Let x B ∈ R m×n denote the batch of input data in matrix form. Then, our proposed mixup function is h(x B ) = g(z 1 x B ), . . . , g(z m x B ) , where z j ∈ L m×n for j = 1, . . . , m with L = { l L | l = 0, 1, . . . , L} and g : R m×n → R n returns a column-wise sum of a given matrix. Note that, the k th column of z j , denoted as z j,k ∈ L m , can be interpreted as the mixing ratio among m inputs at the k th location. Also, we enforce z j,k 1 = 1 to maintain the overall statistics of the given input batch. Given the one-hot target labels y B ∈ {0, 1} m×C of the input data with C classes, we generate soft target labels for mixup data as y B õj for j = 1, . . . , m , where õj = 1 n n k=1 z j,k ∈ [0, 1] m represents the input source ratio of the j th mixup data. We train models to estimate the soft target labels by minimizing the cross-entropy loss.

4.1. Objective

Saliency Our main objective is to maximize the saliency measure of mixup data while maintaining the local smoothness of data, i.e., spatially nearby patches in a natural image look similar, temporally adjacent signals have similar spectrum in speech, etc. (Kim et al., 2020) . As we can see from CutMix in Figure 1 , disregarding saliency can give a misleading supervisory signal by generating mixup data that does not match with the target soft label. While the existing mixup methods only consider the mixup between two inputs, we generalize the number of inputs m to any positive integer. Note, each k th location of outputs has m candidate sources from the inputs. We model the unary labeling cost as the negative value of the saliency, and denote the cost vector at the k th location as c k ∈ R m . For the saliency measure, we calculate the gradient values of training loss with respect to the input and measure 2 norm of the gradient values across input channels (Simonyan et al., 2013; Kim et al., 2020) . Note that this method does not require any additional architecture dependent modules for saliency calculation. In addition to the unary cost, we encourage adjacent locations to have similar labels for the smoothness of each mixup data. In summary, the objective can be formulated as follows: As τ increases, more inputs are used to create each output on average. (c) Mean batch saliency measurement of a batch of mixup data using the ImageNet dataset. We normalize the saliency measure of each input to sum up to 1. (d) Diversity measurement of a batch of mixup data. We calculate the diversity as 1j j =j õ j õj /m, where õj = o j / o j 1 . We can control the diversity among Co-Mixup data (red) and find the optimum by controlling τ . m j=1 n k=1 c k z j,k + β m j=1 (k,k )∈N (1 -z j,k z j,k ) -η m j=1 n k=1 log p(z j,k ), where the prior p is given by z j,k ∼ 1 L M ulti(L, λ) with λ = (λ 1 , . . . , λ m ) ∼ Dirichlet(α, . . . , α), which is a generalization of the mixing ratio distribution of Zhang et al. (2018) , and N denotes a set of adjacent locations (i.e., neighboring image patches in vision, subsequent spectrums in speech, etc.). Diversity Note that the naive generalization above leads to the identical outputs because the objective is separable and identical for each output. In order to obtain diverse mixup outputs, we model a similarity penalty between outputs. First, we represent the input source information of the j th output by aggregating assigned labels as n k=1 z j,k . For simplicity, let us denote n k=1 z j,k as o j . Then, we measure the similarity between o j 's by using the inner-product on R m . In addition to the input source similarity between outputs, we model the compatibility between input sources, represented as a symmetric matrix A c ∈ R m×m + . Specifically, A c [i 1 , i 2 ] quantifies the degree to which input i 1 and i 2 are suitable to be mixed together. In summary, we use inner-product on A = (1 -ω)I + ωA c for ω ∈ [0, 1], resulting in a supermodular penalty term. Note that, by minimizing o j , o j A = o j Ao j , ∀j = j , we penalize output mixup examples with similar input sources and encourage each individual mixup examples to have high compatibility within. In this work, we measure the distance between locations of salient objects in each input and use the distance matrix A c [i, j] = argmax k s i [k] -argmax k s j [k] 1 , where s i is the saliency map of the i th input and k is a location index (e.g., k is a 2-D index for image data). From now on, we denote this inner-product term as the compatibility term. Over-penalization The conventional mixup methods perform mixup as many as the number of examples in a given mini-batch. In our setting, this is the case when m = m . However, the compatibility penalty between outputs is influenced by the pigeonhole principle. For example, suppose the first output consists of two inputs. Then, the inputs must be used again for the remaining m -1 outputs, or only m -2 inputs can be used. In the latter case, the number of available inputs (m -2) is less than the outputs (m -1), and thus, the same input must be used more than twice. Empirically, we found that the remaining compatibility term above over-penalizes the optimization so that a substantial portion of outputs are returned as singletons without any mixup. To mitigate the over-penalization issue, we apply clipping to the compatibility penalty term. Specifically, we model the objective so that no extra penalty would occur when the compatibility among outputs is below a certain level. Now we present our main objective as following: z * = argmin z j,k ∈L m , z j,k 1 =1 f (z), where f (z) := m j=1 n k=1 c k z j,k + β m j=1 (k,k )∈N (1 -z j,k z j,k ) + γ max    τ, m j=1 m j =j n k=1 z j,k A n k=1 z j ,k    =fc(z) -η m j=1 n k=1 log p(z j,k ). In Figure 2 , we describe the properties of the BP optimization problem of Equation ( 1) and statistics of the resulting mixup data. Next, we verify the supermodularity of the compatibility term. We first extend the definition of the submodularity of a multi-label function as follows (Windheuser et al., 2012) . Definition 1. For a given label set L, a function s : L m × L m → R is pairwise submodular, if ∀x, x ∈ L m , s(x, x) + s(x , x ) ≤ s(x, x ) + s(x , x). A function s is pairwise supermodular, if -s is pairwise submodular. Proposition 1. The compatibility term f c in Equation (1) is pairwise supermodular for every pair of (z j1,k , z j2,k ) if A is positive semi-definite. Proof. See Appendix B.1. Finally note that, A = (1 -ω)I + ωA c , where A c is a symmetric matrix. By using spectral decomposition, A c can be represented as U DU , where D is a diagonal matrix and U U = U U = I. Then, A = U ((1 -ω)I + ωD)U , and thus for small ω > 0, we can guarantee A to be positive semi-definite.

4.2. Algorithm

Our main objective consists of modular (unary, prior ), submodular (smoothness), and supermodular (compatibility) terms. To optimize the main objective, we employ the submodularsupermodular procedure by iteratively approximating the supermodular term as a modular function (Narasimhan and Bilmes, 2005) . Note that z j represents the labeling of the j th output and o j represents the aggregated input source information of the j th output, n k=1 z j,k . Before introducing our algorithm, we first inspect the simpler case without clipping. Proposition 2. The compatibility term f c without clipping is modular with respect to z j . Proof. Note, A is a positive symmetric matrix by the definition. Then, for an index j 0 , we can represent f c without clipping in terms of o j0 as m j=1 m j =1,j =j o j Ao j = 2 m j=1,j =j0 o j Ao j0 + m j=1,j =j0 m j =1,j / ∈{j0,j} o j Ao j = (2 m j=1,j =j0 Ao j ) o j0 +c = v -j0 o j0 + c, where v -j0 ∈ R m and c ∈ R are values independent with o j0 . Finally, v -j0 o j0 + c = n k=1 v -j0 z j0,k + c is a modular function of z j0 . By Proposition 2, we can apply a submodular minimization algorithm to optimize the objective with respect to z j when there is no clipping. Thus, we can optimize the main objective without clipping in coordinate descent fashion (Wright, 2015) . For the case with clipping, we modularize the supermodular compatibility term under the following criteria: 1. The modularized function value should increase as the compatibility across outputs increases. 2. The modularized function should not apply an extra penalty for the compatibility below a certain level. Borrowing the notation from the proof in Proposition 2, for an index j, f c (z) = max{τ, v -j o j + c} = max{τ -c, v -j o j } + c. Note, o j = n k=1 z j,k represents the input source information of the j th output and v -j = 2 m j =1,j =j Ao j encodes the status of the other outputs. Thus, we can interpret the supermodular term as a penalization of each label of o j in proportion to the corresponding v -j value (criterion 1), but not for the compatibility below τ -c (criterion 2). As a modular function which satisfies the criteria above, we use the following function: f c (z) ≈ max{τ , v -j } o j for ∃τ ∈ R. (2) Note that, by satisfying the criteria above, the modular function reflects the diversity and over-penalization desiderata described in Section 4.1. We illustrate the proposed mixup procedure with the modularized diversity penalty in Figure 3 . Proposition 3. The modularization given by Equation (2) satisfies the criteria above. Proof. See Appendix B.2.

Algorithm 1 Iterative submodular minimization

Initialize z as z (0) . Let z (t) denote a solution of the t th step. Φ: modularization operator based on Equation (2). for t = 1, . . . , T do for j = 1, . . . , m do f (t) j (z j ) := f (z j ; z (t) 1:j-1 , z (t-1) j+1:m ). f (t) j = Φ(f (t) j ). Solve z (t) j = argmin f (t) j (z j ). end for end for return z (T ) By applying the modular approximation described in Equation (2) to f c in Equation (1), we can iteratively apply a submodular minimization algorithm to obtain the final solution as described in Algorithm 1. In detail, each step can be performed as follows: 1) Conditioning the main objective f on the current values except z j , denoted as f j (z j ) = f (z j ; z 1:j-1 , z j+1:m ). 2) Modularization of the compatibility term of f j as Equation ( 2), resulting in a submodular function fj . We denote the modularization operator as Φ, i.e., fj = Φ(f j ). 3) Applying a submodular minimization algorithm to fj . Please refer to Appendix C for implementation details. Analysis Narasimhan and Bilmes (2005) proposed a modularization strategy for general supermodular set functions, and apply a submodular minimization algorithm that can monotonically decrease the original BP objective. However, the proposed Algorithm 1 based on Equation ( 2) is much more suitable for minibatch based mixup for neural network training than the set modularization proposed by Narasimhan and Bilmes (2005) in terms of complexity and modularization variance due to randomness. For simplicity, let us assume each z j,k is an m-dimensional one-hot vector. Then, our problem is to optimize m n one-hot m-dimensional vectors. To apply the set modularization method, we need to assign each possible value of z j,k as an element of {1, 2, . . . , m}. Then the supermodular term in Equation ( 1) can be interpreted as a set function with m nm elements, and to apply the set modularization, O(m nm) sequential evaluations of the supermodular term are required. In contrast, Algorithm 1 calculates v -j in Equation ( 2) in only O(m ) time per each iteration. In addition, each modularization step of the set modularization method requires a random permutation of the m nm elements. In this case, the optimization can be strongly affected by the randomness from the permutation step. As a result, the optimal labeling of each z j,k from the compatibility term is strongly influenced by the random ordering undermining the interpretability of the algorithm. Please refer to Appendix D for empirical comparison between Algorithm 1 and the method by Narasimhan and Bilmes (2005) .

5. Experiments

We evaluate our proposed mixup method on generalization, weakly supervised object localization, calibration, and robustness tasks. First, we compare the generalization performance of the proposed method against baselines by training classifiers on CIFAR-100 (Krizhevsky et al., 2009) , Tiny-ImageNet (Chrabaszcz et al., 2017) , ImageNet (Deng et al., 2009) , and the Google commands speech dataset (Warden, 2017) . Next, we test the localization performance of classifiers following the evaluation protocol of Qin and Kim (2019) . We also measure calibration error (Guo et al., 2017) of classifiers to verify Co-Mixup successfully alleviates the over-confidence issue by Zhang et al. (2018) . In Section 5.4, we evaluate the robustness of the classifiers on the test dataset with background corruption in response to the recent problem raised by Lee et al. (2020) that deep neural network agents often fail to generalize to unseen environments. Finally, we perform a sensitivity analysis of Co-Mixup and provide the results in Appendix F.3.

5.1. Classification

We first train PreActResNet18 (He et al., 2016) , WRN16-8 (Zagoruyko and Komodakis, 2016) , and ResNeXt29-4-24 (Xie et al., 2017) on CIFAR-100 for 300 epochs. We use stochastic gradient descent with an initial learning rate of 0.2 decayed by factor 0.1 at epochs 100 and 200. We set the momentum as 0.9 and add a weight decay of 0.0001. With this setup, we train a vanilla classifier and reproduce the mixup baselines (Zhang et al., 2018; Verma et al., 2019; Yun et al., 2019; Kim et al., 2020) , which we denote as Vanilla, Input, Manifold, CutMix, Puzzle Mix in the experiment tables. Note that we use identical hyperparameters regarding Co-Mixup over all of the experiments with different models and datasets, which are provided in Appendix E. We further test Co-Mixup on other datasets; Tiny-ImageNet, ImageNet, and the Google commands dataset (Table 1 ). For Tiny-ImageNet, we train PreActResNet18 for 1200 epochs following the training protocol of Kim et al. (2020) . As a result, Co-Mixup consistently improves Top-1 error rate over baselines by 0.67%. In the ImageNet experiment, we follow the experimental protocol provided in Puzzle Mix (Kim et al., 2020) , which trains ResNet-50 (He et al., 2015) for 100 epochs. As a result, Co-Mixup outperforms all of the baselines in Top-1 error rate. We further test Co-Mixup on the speech domain with the Google commands dataset and VGG-11 (Simonyan and Zisserman, 2014) . We provide a detailed experimental setting and dataset description in Appendix F.1. From Table 1 , we confirm that Co-Mixup is the most effective in the speech domain as well.

5.2. Localization

We compare weakly supervised object localization (WSOL) performance of classifiers trained on ImageNet (in Table 1 ) to demonstrate that our mixup method better guides a classifier to focus on salient regions. We test the localization performance using CAM (Zhou et al., 2016) , a WSOL method using a pre-trained classifier. We evaluate localization performance following the evaluation protocol in Qin and Kim (2019) , with binarization threshold 0.25 in CAM. Table 2 summarizes the WSOL performance of various mixup methods, which shows that our proposed mixup method outperforms other baselines.

5.3. Calibration

We evaluate the expected calibration error (ECE) (Guo et al., 2017) of classifiers trained on CIFAR-100. Note, ECE is calculated by the weighted average of the absolute difference between the confidence and accuracy of a classifier. As shown in 3.9 17.7 13.1 5.6 7.5 1.9 Table 2 : WSOL results on ImageNet and ECE (%) measurements of CIFAR-100 classifiers.

5.4. Robustness

In response to the recent problem raised by Lee et al. (2020) that deep neural network agents often fail to generalize to unseen environments, we consider the situation where the statistics of the foreground object, such as color or shape, is unchanged, but with the corrupted (or replaced) background. In detail, we consider the following operations: 1) replacement with another image and 2) adding Gaussian noise. We use ground-truth bounding boxes to separate the foreground from the background, and then apply the previous operations independently to obtain test datasets. We provide a detailed description of datasets in Appendix G. With the test datasets described above, we evaluate the robustness of the pre-trained classifiers. As shown in Table 3 , Co-Mixup shows significant performance gains at various background corruption tests compared to the other mixup baselines. For each corruption case, the classifier trained with Co-Mixup outperforms the others in Top-1 error rate with the performance margins of 2.86% and 3.33% over the Vanilla model. 

5.5. Baselines with multiple inputs

To further investigate the effect of the number of inputs for the mixup in isolation, we conduct an ablation study on baselines using multiple mixing inputs. For fair comparison, we use Dirichlet(α, . . . , α) prior for the mixing ratio distribution and select the best performing α in {0.2, 1.0, 2.0}. Note that we overlay multiple boxes in the case of CutMix. A Supplementary notes for objective

A.1 Notations

In Table 5 , we provide a summary of notations in the main text. Notation Meaning m, m , n # inputs, # outputs, dimension of data c k ∈ R m (1 ≤ k ≤ n) labeling cost for m input sources at the k th location z j,k ∈ L m (1 ≤ j ≤ m , 1 ≤ k ≤ n) input source ratio at the k th location of the j th output zj ∈ L m×n labeling of the j th output oj ∈ R m aggregation of the labeling of the j th output A ∈ R m×m compatibility between inputs Table 5 : A summary of notations.

A.2 Interpretation of compatibility

In our main objective Equation (1), we introduce a compatibility matrix A = (1 -ω)I + ωA c between inputs. By minimizing o j , o j A for j = j , we encourage each individual mixup examples to have high compatibility within. Figure 5 explains how the compatibility term works by comparing simple cases. Note that our framework can reflect any compatibility measures for the optimal mixup. 

B.1 Proof of proposition 1

Lemma 1. For a positive semi-definite matrix A ∈ R m×m + and x, x ∈ R m , s(x, x ) = x Ax is pairwise supermodular. Proof. s(x, x) + s(x , x ) -s(x, x ) -s(x , x) = x Ax + x Ax -2x Ax = (x -x ) A(x -x ), and because A is positive semi-definite, (x -x ) A(x -x ) ≥ 0. Proposition 1. The compatibility term f c in Equation ( 1) is pairwise supermodular for every pair of (z j1,k , z j2,k ) if A is positive semi-definite. Proof. For j 1 and j 2 , s.t., j 1 = j 2 , max τ, m j=1 m j =1,j =j ( n k=1 z j,k ) A( n k=1 z j ,k ) = max{τ, c + 2z j1,k Az j2,k } = -min{-τ, -c -2z j1,k Az j2,k }, for ∃c ∈ R. By Lemma 1, -z j1,k Az j2, k is pairwise submodular, and because a budget additive function preserves submodularity (Horel and Singer, 2016) , min{-τ, -c -2z j1,k Az j2,k } is pairwise submodular with respect to (z j1,k , z j2,k ).

B.2 Proof of proposition 3

Proposition 3. The modularization given by Equation ( 2) satisfies the criteria. Proof. Note, by the definition in Equation ( 1), the compatibility between the j th and j th outputs is o j Ao j , and thus, v -j o j represents the compatibility between the j th output and the others. In addition, o j 1 = n k=1 z j,k 1 = n k=1 z j,k 1 = n. In a local view, for the given o j , let us define a vector o j as o j [i 1 ] = o j [i 1 ] + α and o j [i 2 ] = o j [i 2 ] -α for α > 0. Without loss of generality, let us assume v -j is sorted in ascending order. Then, v -j o j ≤ v -j o j implies i 1 > i 2 , and because the max function preserves the ordering, max{τ , v -j } o j ≤ max{τ , v -j } o j . Thus, the criterion 1 is locally satisfied. Next, for τ > 0, max{τ , v -j } o j 1 ≥ τ o j 1 = τ n. Let ∃i 0 s.t. for i < i 0 , v -j [i] < τ , and for i ≥ i 0 , v -j [i] ≥ τ . Then, for o j containing positive elements only in indices smaller than i 0 , max{τ , v -j } o j = τ n which means there is no extra penalty from the compatibility. In this respect, the proposed modularization satisfies the criterion 2 as well.

C Implementation details

We perform the optimization after down-sampling the given inputs and saliency maps to the specified size (4 × 4). After the optimization, we up-sample the optimal labeling to match the size of the inputs and then mix inputs according to the up-sampled labeling. For the saliency measure, we calculate the gradient values of training loss with respect to the input data and measure 2 norm of the gradient values across input channels (Simonyan et al., 2013) . In classification experiments, we retain the gradient information of network weights obtained from the saliency calculation for regularization. For the distance in the compatibility term, we measure 1 -distance between the most salient regions. For the initialization in Algorithm 1, we use i.i.d. samples from a categorical distribution with equal probabilities. We use alpha-beta swap algorithm from pyGCOfoot_0 to solve the minimization step in Algorithm 1, which can find local-minima of a multi-label submodular function. However, the worst-case complexity of alpha-beta swap algorithm with |L| = 2 is O(m 2 n), and in the case of mini-batch with 100 examples, iteratively applying the algorithm can become a bottleneck during the network training. To mitigate the computational overhead, we partition the mini-batch (each of size 20) and then apply Algorithm 1 independently per each partition. The worst-case complexity theoretic of the naive implementation of Algorithm 1 increases exponentially as |L| increases. Specifically, the worst-case theoretic complexity of the alphabeta swap algorithm is proportional to the square of the number of possible states of z j,k , which is proportional to m |L|-1 . To reduce the number of possible states in a multi-label case, we solve the problem for binary labels (|L| = 2) at the first inner-cycle and then extend to multi labels (|L| = 3) only for the currently assigned indices of each output in the subsequent cycles. This reduces the number of possible states to O(m + m|L|-1 ) where m m. Here, m means the number of currently assigned indices for each output. Based on the above implementation, we train models with Co-Mixup in a feasible time. For example, in the case of ImageNet training with 16 Intel I9-9980XE CPU cores and 4 NVIDIA RTX 2080Ti GPUs, Co-Mixup training requires 0.964s per batch, whereas the vanilla training without mixup requires 0.374s per batch. Note that Co-Mixup requires saliency computation, and when we compare the algorithm with Puzzle Mix, which performs the same saliency computation, Co-Mixup is only slower about 1.04 times. Besides, as we down-sample the data to the fixed size regardless of the data dimension, the additional computation cost of Co-Mixup relatively decreases as the data dimension increases. Finally, we present the empirical time complexity of Algorithm 1 in Figure 6 . As shown in the figure, Algorithm 1 has linear time complexity over |L| empirically. Note that we use |L| = 3 in all of our main experiments, including a classification task. 

D Algorithm Analysis

In this section, we perform comparison experiments to analyze the proposed Algorithm 1. First, we compare our algorithm with the exact brute force search algorithm to inspect the optimality of the algorithm. Next, we compare our algorithm with the BP algorithm proposed by Narasimhan and Bilmes (2005) .

D.1 Comparison with Brute Force

To inspect the optimality of the proposed algorithm, we compare the function values of the solutions of Algorithm 1, brute force search algorithm, and random guess. Due to the exponential time complexity of the brute force search, we compare the algorithms on small scale experiment settings. Specifically, we test algorithms on settings of (m = m = 2, n = 4), (m = m = 2, n = 9), and (m = m = 3, n = 4) varying the number of inputs and outputs (m, m ) and the dimension of data n. We generate unary cost matrix in the objective f by sampling data from uniform distribution. We perform experiments with 100 different random seeds and summarize the results on Table 6 . From the table, we find that the proposed algorithm achieves near optimal solutions over various settings. We also measure relative errors between ours and random guess, (f (z ours ) -f (z brute ))/(f (z random ) -f (z brute )). As a result, our algorithm achieves relative error less than 0.01.

D.2 Comparison with another BP algorithm

We compare the proposed algorithm and the BP algorithm proposed by Narasimhan and Bilmes (2005) . We evaluate function values of solutions by each method using a random unary cost matrix from a uniform distribution. We compare methods over various scales by controlling the number of mixing inputs m. 

E Hyperparameter settings

We perform Co-Mixup after down-sampling the given inputs and saliency maps to the pre-defined resolutions regardless of the size of the input data. In addition, we normalize the saliency of each input to sum up to 1 and define unary cost using the normalized saliency. As a result, we use an identical hyperparameter setting for various datasets; CIFAR-100, Tiny-ImageNet, and ImageNet. In details, we use (β, γ, η, τ ) = (0.32, 1.0, 0.05, 0.83) for all of experiments. Note that τ is normalized according to the size of inputs (n) and the ratio of the number of inputs and outputs (m/m ), and we use an isotropic Dirichlet distribution with α = 2 for prior p. For a compatibility matrix, we use ω = 0.001. For baselines, we tune the mixing ratio hyperparameter, i.e., the beta distribution parameter (Zhang et al., 2018) , among {0.2, 1.0, 2.0} for all of the experiments if the specific setting is not provided in the original papers.

F.1 Another Domain: Speech

In addition to the image domain, we conduct experiments on the speech domain, verifying Co-Mixup works on various domains. Following (Zhang et al., 2018) , we train LeNet (LeCun et al., 1998) and VGG-11 (Simonyan and Zisserman, 2014) on the Google commands dataset (Warden, 2017) . The dataset consists of 65,000 utterances, and each utterance is about one-second-long belonging to one out of 30 classes. We train each classifier for 30 epochs with the same training setting and data pre-processing of Zhang et al. (2018) . In more detail, we use 160 × 100 normalized spectrograms of utterances for training. As shown in 

F.2 Calibration

In this section, we summarize the expected calibration error (ECE) (Guo et al., 2017) of classifiers trained with various mixup methods. For evaluation, we use the official code provided by the TensorFlow-Probability libraryfoot_1 and set the number of bins as 10. As shown in Table 10 , Co-Mixup classifiers have the lowest calibration error on CIFAR-100 and Tiny-ImageNet. As pointed by Guo et al. (2017) , the Vanilla networks have overconfident predictions, but however, we find that mixup classifiers tend to have under-confident predictions (Figure 7 ; Figure 8 ). As shown in the figures, Co-Mixup successfully alleviates the over-confidence issue and does not suffer from under-confidence predictions. 

F.3 Sensitivity analysis

We measure the Top-1 error rate of the model by sweeping the hyperparameter to show the sensitivity using PreActResNet18 on CIFAR-100 dataset. We sweep the label smoothness coefficient β ∈ {0, 0.16, 0.32, 0.48, 0.64}, compatibility coefficient γ ∈ {0.6, 0.8, 1.0, 1.2, 1.4}, clipping level τ ∈ {0.79, 0.81, 0.83, 0.85, 0.87}, compatibility matrix parameter ω ∈ {0, 5 • 10 -4 , 10 -3 , 5 • 10 -3 , 10 -2 }, and the size of partition m ∈ {2, 4, 10, 20, 50}. Table 11 shows that Co-Mixup outperforms the best baseline (PuzzleMix, 20.62%) with a large pool of 



https://github.com/Borda/pyGCO https://www.tensorflow.org/probability/api_docs/python/tfp/stats/expected_calibration_error



Figure 1: Example comparison of existing mixup methods and the proposed Co-Mixup. We provide more samples in Appendix H.

Figure 2: (a) Analysis of our BP optimization problem. The x-axis is a one-dimensional arrangement of solutions: The mixed output is more salient but not diverse towards the right and less salient but diverse on the left. The unary term (red) decreases towards the right side of the axis, while the supermodular term (green) increases. By optimizing the sum of the two terms (brown), we obtain the balanced output z * . (b) A histogram of the number of inputs mixed for each output given a batch of 100 examples from the ImageNet dataset.As τ increases, more inputs are used to create each output on average. (c) Mean batch saliency measurement of a batch of mixup data using the ImageNet dataset. We normalize the saliency measure of each input to sum up to 1. (d) Diversity measurement of a batch of mixup data. We calculate the diversity as 1j j =j õ j õj /m, where õj = o j / o j 1 . We can control the diversity among Co-Mixup data (red) and find the optimum by controlling τ .

Figure 3: Visualization of the proposed mixup procedure. For a given batch of input data (left), a batch of mixup data (right) is generated, which mix-matches different salient regions among the input data while preserving the diversity among the mixup examples. The histograms on the right represent the input source information of each mixup data (o j ).

Figure 4: Confidence-Accuracy plots for classifiers on CIFAR-100. From the figure, the Vanilla network shows over-confident predictions, whereas other mixup baselines tend to have under-confident predictions. We can find that Co-Mixup has best-calibrated predictions.

Figure 5: Let us consider Co-Mixup with three inputs and two outputs. The figure represents two Co-Mixup results. Each input is denoted as a number and color-coded. Let us assume that input 1 and input 2 are more compatible, i.e., A 12A 23 and A 12 A 13 . Then, the left Co-Mixup result has a larger inner-product value o 1 , o 2 A than the right. Thus the mixup result on the right has higher compatibility than the result on the left within each output example.

Figure 6: Mean execution time (ms) of Algorithm 1 per each batch of data over 100 trials. The left figure shows the time complexity of the algorithm over |L| and the right figure shows the time complexity over the number of inputs m. Note that the other parameters are fixed equal to the classification experiments setting, m = m = 20, n = 16, and |L| = 3.

Figure 7: Confidence-Accuracy plots for classifiers on CIFAR-100. Note, ECE is calculated by the mean absolute difference between the two values.

Figure 11: Input batch.

Figure 12: Mixed output batch.

Co-Mixup significantly outperforms all other baselines in Top-1 error rate. Co-Mixup achieves 19.87% in Top-1 error rate with PreActResNet18, outperforming the best baseline by 0.75%. We further test Co-Mixup on different models (WRN16-8 & ResNeXt29-4-24) and verify Co-Mixup improves Top-1 error rate over the best performing baseline.



Top-1 error rates of various mixup methods for background corrupted ImageNet validation set. The values in the parentheses indicate the error rate increment by corrupted inputs compared to clean inputs.

Table4reports the classification test errors on CIFAR-100 with PreActResNet18. From the table, we find that mixing multiple inputs decreases the performance gains of each mixup baseline. These results demonstrate that mixing multiple inputs could lead to possible degradation of the performance and support the necessity of considering saliency information and diversity as in Co-Mixup. Top-1 error rates of mixup baselines with multiple mixing inputs on CIFAR-100 and PreActResNet18. We report the mean values of three different random seeds. Note that Co-Mixup optimally determines the number of inputs for each output by solving the optimization problem. . Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning.In CVPR, 2015.B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.

Mean function values of the solutions over 100 different random seeds. Rel. error means the relative error between ours and random guess.

Table7shows the averaged function values with standard deviations in the parenthesis. As we can see from the table, the proposed algorithm achieves much lower function values and deviations than the method byNarasimhan and Bilmes (2005) over various settings. Note that the method byNarasimhan and Bilmes (2005) has high variance due to randomization in the algorithm. We further compare the algorithm convergence time in Table8. The experiments verify that the proposed algorithm is much faster and effective than the method byNarasimhan and Bilmes (2005). Mean function values of the solutions over 100 different random seeds. We report the standard deviations in the parenthesis. Random represents the random guess algorithm.

Convergence time (s) of the algorithms.

Table 9, we verify that Co-Mixup is still effective in the speech domain. Top-1 classification test error on the Google commands dataset. We stop training if validation accuracy does not increase for 5 consecutive epochs.

Expected calibration error (%) of classifiers trained with various mixup methods on CIFAR-100, Tiny-ImageNet and ImageNet. Note that, at all of three datasets, Co-Mixup outperforms all of the baselines in Top-1 accuracy.

Acknowledgements

This research was supported in part by Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd, Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00882, (SW STAR LAB) Development of deployable learning intelligence via self-sustainable and trustworthy machine learning), and Research Resettlement Fund for the new faculty of Seoul National University. Hyun Oh Song is the corresponding author.

annex

hyperparameters. We also find that Top-1 error rate increases as the partition batch size m increases until m = 20. Table 11 : Hyperparameter sensitivity results (Top-1 error rates) on CIFAR-100 with PreAc-tResNet18. We report the mean values of three different random seeds.

F.4 Comparison with non-mixup baselines

We compare the generalization performance of Co-Mixup with non-mixup baselines, verifying the proposed method achieves the state of the art generalization performance not only for the mixup-based methods but for other general regularization based methods. One of the regularization methods called VAT (Miyato et al., 2018 ) uses virtual adversarial loss, which is defined as the KL-divergence of predictions between input data against local perturbation. We perform the experiment with VAT regularization on CIFAR-100 with PreActResNet18 for 300 epochs in the supervised setting. We tune α (coefficient of VAT regularization term) in {0.001, 0.01, 0.1}, (radius of -inf ball) in {1, 2}, and the number of noise update steps in {0, 1}. Table 12 shows that Co-Mixup, which achieves Top-1 error rate of 19.87%, outperforms the VAT regularization method.

G Detailed description for background corruption

We build the background corrupted test datasets based on ImageNet validation dataset to compare the robustness of the pre-trained classifiers against the background corruption. ImageNet consists of images {x 1 , ..., x M }, labels {y 1 , ..., y M }, and the corresponding groundtruth bounding boxes {b 1 , ..., b M }. We use the ground-truth bounding boxes to separate the foreground from the background. Let z j be a binary mask of image x j , which has value 1 inside of the ground-truth bounding box b j . Then, we generate two types of background corrupted sample xj by considering the following operations:1. Replacement with another image as xj = x j z j + x i(j) (1 -z j ) for a random permutation {i(1), ..., i(M )}.2. Adding Gaussian noise as xj = x j z j + (1 -z j ), where ∼ N (0, 0.1 2 ). We clip pixel values of xj to [0, 1]. 

H Co-Mixup generated samples

In Figure 12 , we present Co-Mixup generated image samples by using images from ImageNet.We use an input batch consisting of 24 images, which is visualized in Figure 11 . As can be seen from Figure 12 , Co-Mixup efficiently mix-matches salient regions of the given inputs maximizing saliency and creates diverse outputs. In Figure 12 , inputs with the target objects on the left side are mixed with the objects on the right side, and objects on the top side are mixed with the objects on the bottom side. In Figure 13 , we present Co-Mixup generated image samples with larger τ using the same input batch. By increasing τ , we can encourage Co-Mixup to use more inputs to mix per each output.Figure 13 : Another mixed output batch with larger τ .

