TEACH ME HOW TO INTERPOLATE A MYRIAD OF EM-BEDDINGS

Abstract

Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Yet, its extensions focus on the definition of interpolation and the space where it takes place, while the augmentation itself is less studied: For a mini-batch of size m, most methods interpolate between m pairs with a single scalar interpolation factor λ. In this work, we make progress in this direction by introducing MultiMix, which interpolates an arbitrary number n of tuples, each of length m, with one vector λ per tuple. On sequence data, we further extend to dense interpolation and loss computation over all spatial positions. Overall, we increase the number of tuples per mini-batch by orders of magnitude at little additional cost. This is possible by interpolating at the very last layer before the classifier. Finally, to address inconsistencies due to linear target interpolation, we introduce a self-distillation approach to generate and interpolate synthetic targets. We empirically show that our contributions result in significant improvement over state-of-the-art mixup methods on four benchmarks. By analyzing the embedding space, we observe that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.

1. INTRODUCTION

Mixup (Zhang et al., 2018 ) is a data augmentation method that interpolates between pairs of training examples, thus regularizing a neural network to favor linear behavior in-between examples. Besides improving generalization, it has important properties such as reducing overconfident predictions and increasing the robustness to adversarial examples. Several follow-up works have studied interpolation in the latent or embedding space, which is equivalent to interpolating along a manifold in the input space (Verma et al., 2019) , and a number of nonlinear and attention-based interpolation mechanisms (Yun et al., 2019; Kim et al., 2020; 2021; Uddin et al., 2021; Hong et al., 2021) . However, little progress has been made in the augmentation process itself, i.e., the number of examples being interpolated and the number of interpolated examples being generated. Mixup was originally motivated as a way to go beyond empirical risk minimization (ERM) (Vapnik, 1999) through a vicinal distribution expressed as an expectation over an interpolation factor λ, which is equivalent to the set of linear segments between all pairs of training inputs and targets. In practice however, in every training iteration, a single scalar λ is drawn and the number of interpolated pairs is limited to the size of the mini-batch, as illustrated in Figure 1(a) . This is because, if interpolation takes place in the input space, it would be expensive to increase the number of examples per iteration. To our knowledge, these limitations exist in all mixup methods. In this work, we argue that a data augmentation process should augment the data seen by the model, or at least by its last few layers, as much as possible. In this sense, we follow manifold mixup (Verma et al., 2019) and generalize it in a number of ways to introduce MultiMix, as illustrated in Figure 1(b) . First, rather than pairs, we interpolate tuples that are as large as the mini-batch. Effectively, instead of linear segments between pairs of examples in the mini-batch, we sample on their entire convex hull. Second, we draw a different vector λ for each tuple. Third, and most important, we increase the number of interpolated tuples per iteration by orders of magnitude by only slightly decreasing the actual training throughput in examples per second. This is possible by interpolating at the deepest layer possible, i.e., just before the classifier, which also happens to be the most effective choice. The interpolated embeddings are thus only processed by a single layer. Apart from increasing the number of examples seen by the model, another idea is to increase the number of loss terms per example. In many modalities of interest, the input is a sequence in one or more dimensions: pixels or patches in images, voxels in video, points or triangles in highdimensional surfaces, to name a few. The structure of input data is expressed in matrices or tensors, which often preserve a certain spatial resolution until the deepest network layer before they collapse e.g. by global average pooling (Szegedy et al., 2015; He et al., 2016) or by taking the output of a classification token (Vaswani et al., 2017; Dosovitskiy et al., 2020) . In this sense, we choose to operate at the level of sequence elements rather than representing examples by a single vector. We introduce dense MultiMix, which is the first approach of this kind in mixup-based data augmentation. In particular, we interpolate densely the embeddings and targets of sequence elements and we also apply the loss densely, as illustrated in Figure 2 . This is an extreme form of augmentation where the number of interpolated tuples and loss terms increases further by one or two orders of magnitude, but at little cost. Finally, linear interpolation of targets, which is the norm in most mixup variants, has a limitation: Given two examples with different class labels, the interpolated example may actually lie in a region associated with a third class in the feature space, which is identified as manifold intrusion (Guo et al., 2019) . In the absence of any data other than the mini-batch, a straightforward way to address this limitation is to devise targets originating in the network itself. This naturally leads to selfdistillation, whereby a moving average of the network acts as a teacher and provides synthetic soft targets (Tarvainen & Valpola, 2017) , to be interpolated exactly like the original hard targets. In summary, we make the following contributions: 1. We introduce MultiMix, which, given a mini-batch of size m, interpolates an arbitrary number n ≫ m of tuples, each of length m, with one interpolation vector λ per tuplecompared with m pairs, all with the same scalar λ for most mixup methods (subsection 3.2). 2. We extend to dense interpolation and loss computation over all spatial positions (subsection 3.4). 3. We use online self-distillation to generate and interpolate soft targets for mixup-compared with linear target interpolation for most mixup methods (subsection 3.3). 4. We improve over state-of-the-art mixup methods on image classification, robustness to adversarial attacks, object detection and out-of-distribution detection (section 4).

2. RELATED WORK

Mixup: interpolation methods In general, mixup interpolates between pairs of input examples (Zhang et al., 2018) or embeddings (Verma et al., 2019) and their corresponding target labels. 

3.1. PRELIMINARIES AND BACKGROUND

Problem formulation Let x ∈ X be an input example and y ∈ Y its one-hot encoded target, where X = R D is the input space, Y = {0, 1} c and c is the total number of classes. Let f θ : X → R d be an encoder that maps the input x to an embedding z = f θ (x), where d is the dimension of the embedding. A classifier g W : R d → ∆ c-1 maps z to a vector p = g W (z) of predicted probabilities over classes, where ∆ n ⊂ R n+1 is the unit n-simplex, i.e., p ≥ 0 and 1 ⊤ c p = 1, and 1 c ∈ R c is an all-ones vector. The overall network mapping is f := g W • f θ . Parameters (θ, W ) are learned by optimizing over mini-batches. Given a mini-batch of m examples, let X = (x 1 , . . . , x m ) ∈ R D×m be the inputs, Y = (y 1 , . . . , y m ) ∈ R c×m the targets and P = (p 1 , . . . , p m ) ∈ R c×m the predicted probabilities of the mini-batch, where P = f (X) := (f (x 1 ), . . . , f (x m )). The objective is to minimize the cross-entropy H(Y, P ) := -1 ⊤ c (Y ⊙ log(P ))1 m /m of predicted probabilities P relative to targets Y averaged over the mini-batch, where ⊙ is the Hadamard (element-wise) product. In summary, the mini-batch loss is L(X, Y ; θ, W ) := H(Y, g W (f θ (X))). Mixup Mixup methods commonly interpolate pairs of inputs or embeddings and the corresponding targets at the mini-batch level while training. Given a mini-batch of m examples with inputs X and targets Y , let Z = (z 1 , . . . , z m ) ∈ R d×m be the embeddings of the mini-batch, where Z = f θ (X). Manifold mixup (Verma et al., 2019) interpolates the embeddings and targets by forming a convex combination of the pairs with interpolation factor λ ∈ [0, 1]: Z = Z(λI + (1 -λ)Π) Y = Y (λI + (1 -λ)Π), where λ ∼ Beta(α, α), I is the identity matrix and Π ∈ R m×m is a permutation matrix. Input mixup (Zhang et al., 2018) interpolates inputs rather than embeddings: X = X(λI + (1 -λ)Π). Whatever the interpolation method and the space where it is performed, the interpolated data, e.g.  λ k ∈ ∆ m-1 , that is, λ k ≥ 0 and 1 ⊤ m λ k = 1. We then interpolate embeddings and targets by taking n convex combinations over all m examples: Z = ZΛ Y = Y Λ, where Λ = (λ 1 , . . . , λ n ) ∈ R m×n . We thus generalize manifold mixup (Verma et al., 2019): 1. from pairs to tuples of length m, as long as the mini-batch: m-term convex combination ( 6),( 7) vs. 2-term in (3),(4), Dirichlet vs. Beta distribution; 2. from m to an arbitrary number n of tuples: interpolated embeddings Z ∈ R d×n (6) vs. R d×m in (3), interpolated targets Y ∈ R c×n (7) vs. R c×m in (4); 3. from fixed λ across the mini-batch to a different λ k for each interpolated item. Loss Again, we replace the original mini-batch embeddings Z by the interpolated embeddings Z and minimize the average cross-entropy H( Y , P ) (1) between the predicted probabilities P = g W ( Z) and the interpolated targets Y (7). Compared with (2), the mini-batch loss becomes  L M (X, Y ; θ, W ) := H(Y Λ, g W (f θ (X)Λ)). ( ) x1 z1 ⊙ norm   a1 ⊙ λ   encoder f θ ⊕ z x2 z2 ⊙ norm   a2 ⊙ 1 -λ  

3.3. MULTIMIX WITH SELF-DISTILLATION

Networks We use an online self-distillation approach whereby the learned network f := g W • f θ becomes the student, whereas a teacher network f ′ := g W ′ • f θ ′ of the same architecture is obtained by exponential moving average of the parameters (Tarvainen & Valpola, 2017; Grill et al., 2020) . The teacher parameters (θ ′ , W ′ ) are not learned: We stop the gradient in the computation graph. Views Given two transformations T and T ′ , we generate two different augmented views v = t(x) and v ′ = t ′ (x) for each input x, where t ∼ T and t ′ ∼ T ′ . Then, given a mini-batch of m examples with inputs X and targets Y , let V = t(X), V ′ = t ′ (X) ∈ R D×m be the mini-batch views corresponding to the two augmentations and Z = f θ (V ), Z ′ = f θ ′ (V ′ ) ∈ R d×m the embeddings obtained by the student and teacher encoders respectively. Interpolation We obtain the interpolated embeddings Z, Z ′ from Z, Z ′ by ( 6) and targets Y from Y by (7) , using the same Λ. The predicted class probabilities are given by P = g W ( Z) and P ′ = g W ′ ( Z ′ ), again obtained by the student and teacher classifiers, respectively. Loss We learn parameters (θ, W ) by minimizing a classification and a self-distillation loss: γH( Y , P ) + (1 -γ)H( P ′ , P ), where γ ∈ [0, 1]. The former brings the probabilities P predicted by the student close to the targets Y , as in (8) . The latter brings P close to the probabilities P ′ predicted by the teacher.

3.4. DENSE MULTIMIX

We now extend to the case where the embeddings are structured, e.g. in tensors. This happens e.g. with token vs. sentence embeddings in NLP and patch vs. image embeddings in vision. This works by removing spatial pooling and applying the loss function densely over all tokens/patches. The idea is illustrated in Figure 2 . For the sake of exposition, our formulation uses sets of matrices grouped either by example or by spatial position. In practice, all operations are on tensors. Preliminaries The encoder is now f θ : X → R d×r , mapping the input x to an embedding z = f θ (x) ∈ R d×r , where d is the number of channels and r is its spatial resolution-if there are more than one spatial dimensions, these are flattened. Given a mini-batch of m examples, we have again inputs X = (x 1 , . . . , x m ) ∈ R D×m and targets Y = (y 1 , . . . , y m ) ∈ R c×m . Each embedding z i = f θ (x i ) = (z 1 i , . . . , z r i ) ∈ R d×r for i = 1, . . . , m consists of features z j i ∈ R d for spatial position j = 1, . . . , r. We group features by position in matrices Z 1 , . . . , Z r , where Z j = (z j 1 , . . . , z j m ) ∈ R d×m for j = 1, . . . , r. Attention Each feature vector will inherit the target of the corresponding input example. However, we also attach a level of confidence according to an attention map. Given an embedding z ∈ R d×r with target y ∈ Y and a vector u ∈ R d , the attention map a = h(z ⊤ u) ∈ R r (10) measures the similarity of features of z to u, where h is a non-linearity, e.g. softmax or ReLU followed by ℓ 1 normalization. There are different ways to define vector u. For example, u = z1 r /r by global average pooling (GAP) of z, or u = W y assuming a linear classifier with W ∈ R d×c , similar to class activation mapping (CAM) (Zhou et al., 2016). In the case of no attention, a = 1 r /r is uniform. Given a mini-batch, let a i = (a 1 i , . . . , a r i ) ∈ R r be the attention map of embedding z i (10). We group attention by position in vectors a 1 , . . . , a r , where a j = (a j 1 , . . . , a j m ) ∈ R m for j = 1, . . . , r. Interpolation For each spatial position j = 1, . . . , r, we draw interpolation vectors λ j k ∼ Dir(α) for k = 1, . . . , n and define Λ j = (λ j 1 , . . . , λ j n ) ∈ R m×n . Because input examples are assumed to contribute according to the attention vector a j ∈ R m , we scale the rows of Λ j accordingly and then we normalize its columns back to ∆ m-1 so that they can define convex combinations: M j = diag(a j )Λ j (11) M j = M j diag(1 ⊤ m M j ) -1 We then interpolate embeddings and targets by taking n convex combinations over m examples: Z j = Z j M j Y j = Y M j . This is similar to ( 6),( 7), but there is a different interpolated embedding matrix Z j ∈ R d×n as well as target matrix Y j ∈ R c×n per position, even though the original target matrix Y is one. Classifier The classifier is now g W : R d×r → R c×r , maintaining the same spatial resolution as the embedding and generating one vector of predicted probabilities per spatial position. This is done by removing average pooling or any down-sampling operation. The interpolated embeddings Z 1 , . . . , Z r (13) are grouped by example into z 1 , . . . , z n ∈ R d×r , mapped by g W to predicted probabilities p 1 , . . . , p n ∈ R c×r and grouped again by position into P 1 , . . . , P r ∈ R c×n . In the simple case where the original classifier is linear, i.e. W ∈ R d×c , it is seen as 1×1 convolution and applied densely to each column (feature) of Z j for j = 1, . . . , r. Loss Finally, we learn parameters θ, W by minimizing the weighted cross-entropy H( Y j , P j ; s) of P j relative to the interpolated targets Y j again densely at each position j, where H(Y, P ; s) := -1 ⊤ c (Y ⊙ log(P ))s/(1 ⊤ n s) generalizes (1) and the weight vector is defined as s = 1 ⊤ m M j ∈ R n . This is exactly the vector used to normalize the columns of M j in (12). The motivation is that the columns of M j are the original interpolation vectors weighted by attention: A small ℓ 1 norm indicates that for the given position j, we are sampling from examples of low attention, hence the loss is to be discounted.

4.1. SETUP

We use a mini-batch of size m = 128 examples in all experiments. For every mini-batch, we apply MultiMix with probability 0.5 or input mixup otherwise. For MultiMix, the default settings are given in subsection 4.4. We follow the experimental settings of AlignMixup (Venkataramanan et We report the mean and standard deviation of the top-1 accuracy (%) for five runs on image classification. On robustness to adversarial attacks (subsection 4.2), we report the top-1 error. We also experiment on object detection (subsection 4.3) and out-of-distribution detection (subsection A.3).

Image classification

In Table 1 we observe that MultiMix and Dense MultiMix already outperform SoTA on all datasets except CIFAR-10 with R-18, where they are on par with Co-Mixup. The addition of distillation increases the gain and outperforms SoTA on all datasets. Both distillation and dense improve over vanilla MultiMix and their effect is complementary on all datasets. On TI for example, distillation improves by 0.95%, dense by 1.33% and their combination by 2.0%. This combination brings an impressive gain of 2.26% over the previous SoTA -AlignMixup. We provide additional analysis of the embedding space on 10 classes of CIFAR-100 in subsection A. 4 . In 3 we observe that vanilla MultiMix is already more robust than SoTA on all datasets and settings except FGSM on CIFAR-100 with R-18, where it is on par with AlignMixup. The addition of dense, distillation or both again increases the robustness and shows that their effect is complementary. The overall gain is more impressive than in classification error. For example, against the strong PGD attack on CIFAR-10 with W16-8, the SoTA Co-Mixup improves the baseline by 3.8% and our best result improves the baseline by 9.4%, which is more than double. In Table 4 , we observe that, while vanilla MultiMix is slightly worse than AlignMixup, dense and distillation bring improvements over the SoTA on both datasets and are still complementary. This is consistent with classification results. Compared with the baseline, our best setting brings a gain of 2.40% mAP on Pascal VOC07+12 and 3.14% on MS-COCO.

4.4. ABLATIONS

All ablations are performed using R-18 on CIFAR-100. For MultiMix, we study the effect of the layer where we interpolate, the number of tuples n and a fixed value of Dirichlet parameter α. More ablations are given in the supplementary material. Interpolation layer For MultiMix, we use the entire network as the encoder f θ by default, except for the last fully-connected layer, which we use as classifier g W . Thus, we interpolate embeddings in the deepest layer by default. Here, we study the effect of different decompositions of the network f = g W • f θ , such that interpolation takes place at a different layer. When using distillation, we interpolate at the same layer for both the teacher and the student. In Figure 3 (a), we observe that mixing at the deeper layers of the network significantly improves performance. The same behavior is observed when adding dense, distillation, or both. This validates our default choice. It is interesting that the authors of input mixup (Zhang et al., 2018) found that convex combinations of three or more examples in the input space with weights from the Dirichlet distribution do not bring further gain. This agrees with the finding of SuperMix (Dabouei et al., 2021) for four or more examples. Figure 3 (a) suggests that further gain emerges when mixing in deeper layers. Number of tuples n Since our aim is to increase the amount of data seen by the model, or at least part of the model, it is important to study the number n of interpolated embeddings. We observe from Figure 3 (b) that accuracy increases overall with n and saturates for n ≥ 1000 for all variants of MultiMix. Our best setting, Dense MultiMix with distillation, works best at n = 1000. We choose this as default, given also that the training cost increases with n. The training speed as a function of n is given in the supplementary material and is nearly constant for n ≤ 1000. Dirichlet parameter α Our default setting is to draw α uniformly at random from [0.5, 2] for every interpolation vector (column of Λ). Here we study the effect of a fixed value of α. In Figure 3(c) , we observe that the best accuracy comes with α = 1 for most MultiMix variants, corresponding to the uniform distribution over the convex hull of the mini-batch embeddings. However, all measurements are lower than the default α ∼ U [0.5, 2]. For example, from Table 1 (CIFAR-100, R-18), dense MultiMix + distillation has accuracy 82.52, compared with 82.23 in Figure 3 (c) for α = 1.

5. CONCLUSION

In terms of input interpolation, the take-home message of this work is that, instead of devising smarter and more complex interpolation functions in the input space or the first layers of the representation, it is more beneficial to just perform linear interpolation in the very last layer where the cost is minimal, and then increase as much as possible the number of interpolated embeddings for mixup. This is more in line with the original motivation of mixup as a way to go beyond ERM. In terms of target interpolation, the take-home message is the opposite: instead of linear interpolation of original targets, find new synthetic targets for the interpolated embeddings with the help of the network itself, then interpolate them linearly. This idea fits nicely with self-distillation, which is popular in settings such as self-supervised representation learning and continual learning. Interestingly, self-distillation can be seen as yet another form of augmentation, but in the model space. A natural extension of this work is to settings other than supervised classification. A limitation is that it is not straightforward to combine the sampling scheme of MultiMix with complex interpolation methods, unless they are fast to compute in the embedding space. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016. 6, 5 Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT: Image bert pre-training with online tokenizer. In ICLR, 2022. 3 Jianchao Zhu, Liangliang Shi, Junchi Yan, and Hongyuan Zha. Automix: Mixup networks for sample interpolation via cooperative barycenter learning. In ECCV, 2020. 3 In Table 6 and Table 7 , we observe that MultiMix and its variants outperform all the additional mixup methods on image classification. Furthermore, they are more robust to FGSM and PGD attacks as compared to these additional methods. The remaining observations in subsection 4.2 are still valid. These results validate the qualitative analysis of Figure 4 .

A.5 MORE ABLATIONS

As in subsection 4.4, all ablations here are performed using R-18 on CIFAR-100. Mixup methods with distillation In subsection 4.2 and Table 6 , we observe that distillation significantly improves the performance when used with MultiMix. Here, we also study its effect when applied to SoTA mixup methods. Given a mini-batch of m examples with inputs X and targets Y , we obtain the augmented views V and V ′ as discussed in subsection 3.3. We then follow the mixup strategy of each mixup method and obtain the corresponding predicted class probabilities P , P ′ from the student and teacher classifier, respectively. E.g., for manifold mixup (Verma et al., 2019) , we interpolate the embeddings Z = f θ (V ), Z ′ = f θ ′ (V ′ ) using ( 3) and obtain P = g W ( Z) and P ′ = g W ′ ( Z ′ ). In each case, we obtain the interpolated targets Y using (4) and train the student network using (9). Mixup methods with dense loss In Table 6 we observe that dense interpolation and dense loss improve vanilla MultiMix. Here, we study the effect of the dense loss when applied to SoTA mixup methods. Given a mini-batch of m examples, we follow the mixup strategy of the SoTA mixup methods to obtain the mixed embedding Z j ∈ R d×m for each spatial position j = 1, . . . , r. Then, as discussed in subsection 3.4, we obtain the predicted class probabilities P j ∈ R c×m again for each j = 1, . . . , r. Finally, we compute the cross-entropy loss H( Y , P j ) (1) densely at each spatial position j, where the interpolated target label Y ∈ R c×m is given by (4). In Table 9 , we observe that using a dense loss improves the performance of all SoTA mixup methods. The baseline improves by 1.4% accuracy (76.76 → 78.16) and manifold mixup by 0.67% (80.20 → 80.87). On average, we observe a gain of 0.7% brought by the dense loss. An exception is AlignMixup (Venkataramanan et al., 2022) , which drops by 0.35% (81.71 → 81.36). This may be due to the alignment process, whereby the interpolated dense embeddings are not very far from the original. Finally, we study the effect of using a dense distillation loss on SoTA mixup methods. Here, similarly with (9), the loss has two terms for each spatial position j: the first is the dense cross-entropy loss H( Y , P j ) as above and the second is the dense distillation loss H(( P ′ ) j , P j ), where P ′ is obtained by the teacher. In Table 9 , we observe that dense distillation further improves the performance of SoTA mixup methods as compared to using the dense loss only. Two-stage distillation Following SuperMix Dabouei et al. (2021) , we also study the effect of using a two-stage distillation process with MultiMix, rather than online self-distillation. In the first stage, we train the teacher using only clean examples for 300 epochs, and we achieve a top-1 accuracy of 75.62%. This is slightly lower than the 76.76% of the baseline from Table 9 , which is trained for 2000 epochs. In the second stage, we fix the teacher parameters and train the student using the predictions from the teacher network as targets. In particular, we use the second term H( P ′ , P ) of (9), that is, γ = 0. At inference, the top-1 accuracy drops by 16% (75.62 → 59.77). This shows that using the setting of SuperMix is not effective, while also being computationally expensive because of the two-stage training. We also study the effect of training the student with both the interpolated labels Y (7) and the interpolated predictions P ′ of the pretrained teacher as targets. In particular, we use (9) with our default γ = 1 2 . At inference, the top-1 accuracy improves by 4.7% compared with the teacher (75.62 → 80.35). However, the student accuracy of 80.35% is still inferior to our 82.28% by online self-distillation (Table 9 ). This shows that joint training of teacher and student is beneficial. Training speed In Figure 5 , we analyze the training speed of MultiMix and its variants as a function of number of tuples n. In terms of speed, vanilla MultiMix is on par with the baseline up to n = 1000, while bringing an accuracy gain of 5%. The best performing variant-dense Multi-Mix with distillation-is only slower by 15.6% at n = 1000 as compared to the baseline, which is arguably worth given the impressive 5.8% accuracy gain. Further increasing beyond n > 1000 brings a drop in training speed, due to computing Λ and then using it to interpolate (6), (7) . Because n > 1000 also brings little performance benefit according to Figure 3 (b), we set n = 1000 as default for all MultiMix variants. Dense MultiMix: Spatial attention In subsection 3.4, we discuss different options for attention in dense MultiMix. In particular, no attention amounts to defining a uniform a = 1 r /r. Otherwise, a is defined by (10). The vector u can be defined as u = z1 r /r by global average pooling (GAP) of z, which is the default, or u = W y assuming a linear classifier with W ∈ R d×c . The latter is similar to class activation mapping (CAM) Zhou et al. ( 2016), but here the current value of W is used online while training. The non-linearity h can be softmax or ReLU followed by ℓ 1 normalization (ℓ 1 • relu), which is the default. Here, we study the affect of these options on the performance of dense Multimix. In Table 10 , we observe that using GAP for u and ℓ 1 • relu as h yields the best performance overall. Changing GAP to CAM or ℓ 1 • relu to softmax is inferior, more so in the presence of distillation. The combination of CAM with softmax is the weakest, even weaker than uniform attention. CAM may fail because of using the non-optimal value of W while training; softmax may fail because of being too selective. Compared to our best result, uniform attention is clearly inferior, by nearly 1% in the presence of distillation. This validates that the use of spatial attention in dense MultiMix is clearly beneficial. The intuition is the same as in weakly supervised tasks: In the absence of dense



https://github.com/alldbi/SuperMix



Figure 1: Data augmentation from a mini-batch B consisting of m = 10 points in two dimensions. (a) mixup: sampling of m points on linear segments between m pairs of points in B, using the same interpolation factor λ. (b) MultiMix: sampling of n = 300 points in the convex hull of B.

X(Zhang et al., 2018) or Z(Verma et al., 2019), replaces the original mini-batch data and gives rise to predicted probabilities P = (p 1 , . . . , p m ) ∈ R c×m over classes, e.g. P = f ( X)(Zhang et al., 2018) or P = g W ( Z)(Verma et al., 2019). Then, the average cross-entropy H( Y , P ) (1) between the predicted probabilities P and interpolated targets Y is minimized. The number of interpolated data is m, same as the original mini-batch data.3.2 MULTIMIXInterpolation Given a mini-batch of m examples with embeddings Z and targets Y , we draw interpolation vectors λ k ∼ Dir(α) for k = 1, . . . , n, where Dir(α) is the symmetric Dirichlet distribution and

Figure 2: Dense MultiMix (subsection 3.4) for the special case m = 2 (two examples), n = 1 (one interpolated embedding), r = 9 (spatial resolution 3 × 3). The embeddings z 1 , z 2 ∈ R d×9 of input images x 1 , x 2 are extracted by encoder f θ . Attention maps a 1 , a 2 ∈ R 9 are extracted (10), multiplied element-wise with interpolation vectors λ, (1 -λ) ∈ R 9 (11) and ℓ 1 -normalized per spatial position (12). The resulting weights are used to form the interpolated embedding z ∈ R d×9 as a convex combination of z 1 , z 2 per spatial position (13). Targets are interpolated similarly (14).

Figure 3: Ablation study of MultiMix and its variants on CIFAR-100 using R-18. (a) Interpolation layers (R-18 block; 0: input mixup). (b) Number of tuples n. (c) Dirichlet parameter α.

Figure 5: Training speed (images/sec) of MultiMix and its variants vs. number of tuples n on CIFAR-100 using R-18. Measured on NVIDIA RTX 2080 TI GPU, including forward and backward pass.

Other definitions of interpolation include the combination of content and style from two images(Hong et al., 2021) and the spatial alignment of dense features (Venkataramanan et al., 2022). Our dense MultiMix variant also uses dense features but without aligning them, hence it can mix a very large number of images and generate a large number of interpolated samples. Our work is orthogonal to these methods as we focus on the sampling process of augmentation rather than on the definition of interpolation. Mixup: sampling To the best of our knowledge, the only methods that interpolate more than two examples for image classification are OptTransMix (Zhu et al., 2020), SuperMix (Dabouei et al., 2021) and ζ-Mixup (Abhishek et al., 2022). All three methods operate in the input space and limit the number of interpolated examples to the mini-batch size, m; whereas our MultiMix generates an arbitrary number of interpolated examples (n = 1000 in practice) in the embedding space.To determine the interpolation weights, OptTransMix involves a complex optimization process and only applies to images with clean background; ζ-Mixup uses random permutations of a fixed vector; and SuperMix uses a Dirichlet distribution over not more than 3 samples in practice. We also sample weights from the Dirichlet distribution but interpolate as many examples as the mini-batch, m.

al., 2022) and use PreActResnet-18 (R-18) (He et al., 2016) and WRN16-8 (Zagoruyko & Komodakis, Image classification on CIFAR-10/100 and TI (TinyImagenet). Mean and standard deviation of Top-1 accuracy (%) for 5 runs. R: PreActResnet, W: WRN. ⋆ : reproduced, † : reported by AlignMixup, Bold black: best; Blue: second best; underline: best baseline. Gain: improvement over best baseline. Comparison with additional baselines is given in subsection A.2

Image classification and training speed on ImageNet. Top-1 accuracy (%): higher is better.



Robustness to FGSM & PGD attacks. Mean and standard deviation of Top-1 error (%) for 5 runs: lower is better. ⋆ : reproduced, † : reported by AlignMixup. Bold black: best; Blue: second best; underline: best baseline. Gain: reduction of error over best baseline. TI: TinyImagenet. R: PreActResnet, W: WRN. Comparison with additional baselines is given in subsection A.2 terms of training speed, the vanilla MultiMix is on par with the baseline, bringing a gain of 2.49%. The addition of distillation is on par with SoTA AlignMixup, bringing a gain of 0.80%. Adding both dense and distillation brings a gain of 0.89% over AlignMixup, while being 19.4% slower. The inference speed is the same for all methods.

Transfer

Image

Image classification and training speed on ImageNet. Top-1 accuracy (%): higher is better. Speed: images/sec (×10 3 ): higher is better. * : reproduced with same teacher and student model, † : reported by AlignMixup. Bold black: best; Blue: second best; underline: best baseline. Gain: improvement over best baseline.A.3 MORE RESULTS: OUT OF DISTRIBUTION DETECTIONThis is a standard benchmark for evaluating over-confidence. Here, in-distribution (ID) are examples on which the network has been trained, and out-of-distribution (OOD) are examples drawn from any other distribution. Given a mixture of ID and OOD examples, the network should predict an ID example with high confidence and an OOD example with low confidence, i.e., the confidence of the predicted class should be below a certain threshold.Following AlignMixup(Venkataramanan et al., 2022), we compare MultiMix and its variants with SoTA methods trained using R-18 on CIFAR-100 as ID examples, while using LSUN(Yu et al.,  Xiao et al., 2010)  and TI to draw OOD examples. We use detection accuracy, Area under ROC curve (AuROC) and Area under precision-recall curve (AuPR) as evaluation metrics.In Table8, we observe that MultiMix and its variants outperform SoTA on all datasets and metrics by a large margin. Although the gain of vanilla MultiMix and Dense MultiMix over SoTA mixup methods is small on image classification, these variants significantly reduce over-confident incorrect predictions and achieve superior performance on out-of-distribution detection.A.4 ANALYSIS OF THE EMBEDDING SPACEQualitative analysisWe qualitatively analyze the embedding space on 10 CIFAR-100 classes in Figure4. We observe that the quality of embeddings of the baseline is extremely poor with severely overlapping classes, which explains its poor performance on image classification. All mixup methods result in clearly better clustered and more uniformly spread classes. Manifold mixup(Verma  et  al., 2019) produces five tightly clustered classes but the other five are still severely overlapping. SaliencyMix (Uddin et al., 2021) and AlignMixup (Venkataramanan et al., 2022) yield four somewhat clustered classes and 6 moderately overlapping ones. Our best setting, i.e., dense MultiMix with distillation, results in five tightly clustered classes and another five somewhat overlapping but less than all competitors. More plots including variants of MultiMix are given in the supplementary material.Quantitative analysis We also quantitatively assess the embedding space on the CIFAR-100 test set using alignment and uniformity(Wang & Isola, 2020). Alignment measures the expected pairwise distance of examples in the same class. Lower alignment indicates that the classes are more tightly clustered. Uniformity measures the (log of the) expected pairwise similarity of all examples using a Gaussian kernel as a similarity function. Lower uniformity indicates that classes are more uniformly spread in the embedding space. On CIFAR-100, we obtain alignment 3.02 for baseline, 1.27 for Manifold Mixup(Verma et  al., 2019), 2.44 for SaliencyMix (Uddin et al., 2021), 2.04 for AlignMixup and 0.92 for Dense MultiMix with distillation. We also obtain uniformity -1.94 for the baseline, -2.38 for Manifold Mixup (Verma et al., 2019), -2.82 for SaliencyMix (Uddin et al., 2021), -4.77 for AlignMixup (Venkataramanan et al., 2022) and -5.68 for dense MultiMix with distillation.

Image classification on CIFAR-100 using R-18: The effect of distillation, dense loss and both on SoTA mixup methods. Top-1 accuracy (%): higher is better. * : 'vanilla' refers to teacher pre-training and 'distil' to self-distillation where teacher and student are trained concurrently from scratch. ‡ : Instead of dense MultiMix, we only apply the loss densely.

Variants

A APPENDIX A MORE EXPERIMENTS A.1 MORE ON SETUP

Settings and hyperparameters We train MultiMix and its variants with mixed examples only. We use a mini-batch of size m = 128 examples in all experiments. For every mini-batch, we apply MultiMix with probability 0.5 or input mixup otherwise. For input mixup, we interpolate the standard m pairs (5). For MultiMix, we use the entire network as the encoder f θ by default, except for the last fully-connected layer, which we use as classifier g W . We use n = 1000 tuples and draw a different α ∼ U [0.5, 2.0] for each example from the Dirichlet distribution by default. For multi-GPU experiments, all training hyperparameters including m and n are per GPU.For dense MultiMix, the spatial resolution is 4 × 4 (r = 16) on CIFAR-10/100 and 7 × 7 (r = 49) on Imagenet by default. We obtain the attention map by (10) using GAP for vector u and ReLU followed by ℓ 1 normalization as non-linearity h by default. To predict class probabilities and compute the loss densely, we use the classifier g W as 1×1 convolution by default; when interpolating at earlier layers, we follow the process described in subsection 3.4. For distillation, both the teacher and student networks have the same architecture. By default, we use γ = 1 2 in (9), that is, equal contribution of original labels and teacher predictions.CIFAR-10/100 training Following the experimental settings of AlignMixup (Venkataramanan et al., 2022) , we train MultiMix and its variants using SGD for 2000 epochs using the same random seed as AlignMixup. We set the initial learning rate to 0.1 and decay it by a factor of 0.1 every 500 epochs. The momentum is set to 0.9 and the weight decay to 0.0001. We use a batch size m = 128 and train on a single NVIDIA RTX 2080 TI GPU for 10 hours.TinyImageNet training Following the experimental settings of PuzzleMix (Kim et al., 2020) , we train MultiMix and its variants using SGD for 1200 epochs, using the same random seed as AlignMixup. We set the initial learning rate to 0.1 and decay it by a factor of 0.1 after 600 and 900 epochs. The momentum is set to 0.9 and the weight decay to 0.0001. We train on two NVIDIA RTX 2080 TI GPUs for 18 hours.ImageNet training Following the experimental settings of PuzzleMix (Kim et al., 2020) , we train MultiMix and its variants using the same random seed as AlignMixup. We train R-50 using SGD with momentum 0.9 and weight decay 0.0001 and ViT-S/16 using AdamW with default parameters. The initial learning rate is set to 0.1 and 0.01, respectively. We decay the learning rate by 0.1 at 100 and 200 epochs. We train on 32 NVIDIA V100 GPUs for 20 hours.

Tasks and metrics

We use top-1 error (%, lower is better) or top-1 accuracy (%, higher is better) as evaluation metric on image classification and robustness to adversarial attacks (subsection 4.2 and subsection A.2). Additional datasets and metrics are reported separately for transfer learning to object detection (subsection 4.3) and out-of-distribution detection (subsection A.3).

A.2 MORE RESULTS: CLASSIFICATION AND ROBUSTNESS

Using the experimental settings of subsection A.1, we extend Table 1 , Table 2 targets, assuming the same target of the entire example at every spatial position naively implies that the object of interest is present everywhere, whereas spatial attention provides a better hint as to where the object may really be.Dense MultiMix: Spatial resolution We study the effect of spatial resolution on dense MultiMix. By default, we use a resolution of 4 × 4 at the last residual block of R-18 on CIFAR-100. Here, we additionally investigate 1 × 1 (downsampling by average pooling with kernel size 4, same as GAP), 2 × 2 (downsampling by average pooling with kernel size 2) and 8 × 8 (upsampling by using stride 1 in the last residual block). We measure accuracy 81.07% for spatial resolution 1 × 1, 81.43% for for 2 × 2, 81.88% for 4 × 4 and 80.83% for 8 × 8. We thus observe that performance improves with spatial resolution up to 4 × 4, which the optimal, and then drops at 8 × 8. This drop may be due to assuming the same target at each spatial position. The resolution 8 × 8 is also more expensive computationally.

