TEACH ME HOW TO INTERPOLATE A MYRIAD OF EM-BEDDINGS

Abstract

Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Yet, its extensions focus on the definition of interpolation and the space where it takes place, while the augmentation itself is less studied: For a mini-batch of size m, most methods interpolate between m pairs with a single scalar interpolation factor λ. In this work, we make progress in this direction by introducing MultiMix, which interpolates an arbitrary number n of tuples, each of length m, with one vector λ per tuple. On sequence data, we further extend to dense interpolation and loss computation over all spatial positions. Overall, we increase the number of tuples per mini-batch by orders of magnitude at little additional cost. This is possible by interpolating at the very last layer before the classifier. Finally, to address inconsistencies due to linear target interpolation, we introduce a self-distillation approach to generate and interpolate synthetic targets. We empirically show that our contributions result in significant improvement over state-of-the-art mixup methods on four benchmarks. By analyzing the embedding space, we observe that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.

1. INTRODUCTION

Mixup (Zhang et al., 2018) is a data augmentation method that interpolates between pairs of training examples, thus regularizing a neural network to favor linear behavior in-between examples. Besides improving generalization, it has important properties such as reducing overconfident predictions and increasing the robustness to adversarial examples. Several follow-up works have studied interpolation in the latent or embedding space, which is equivalent to interpolating along a manifold in the input space (Verma et al., 2019) , and a number of nonlinear and attention-based interpolation mechanisms (Yun et al., 2019; Kim et al., 2020; 2021; Uddin et al., 2021; Hong et al., 2021) . However, little progress has been made in the augmentation process itself, i.e., the number of examples being interpolated and the number of interpolated examples being generated. Mixup was originally motivated as a way to go beyond empirical risk minimization (ERM) (Vapnik, 1999) through a vicinal distribution expressed as an expectation over an interpolation factor λ, which is equivalent to the set of linear segments between all pairs of training inputs and targets. In practice however, in every training iteration, a single scalar λ is drawn and the number of interpolated pairs is limited to the size of the mini-batch, as illustrated in Figure 1(a) . This is because, if interpolation takes place in the input space, it would be expensive to increase the number of examples per iteration. To our knowledge, these limitations exist in all mixup methods. In this work, we argue that a data augmentation process should augment the data seen by the model, or at least by its last few layers, as much as possible. In this sense, we follow manifold mixup (Verma et al., 2019) and generalize it in a number of ways to introduce MultiMix, as illustrated in Figure 1(b) . First, rather than pairs, we interpolate tuples that are as large as the mini-batch. Effectively, instead of linear segments between pairs of examples in the mini-batch, we sample on their entire convex hull. Second, we draw a different vector λ for each tuple. Third, and most important, we increase the number of interpolated tuples per iteration by orders of magnitude by only slightly decreasing the actual training throughput in examples per second. This is possible by interpolating at the deepest layer possible, i.e., just before the classifier, which also happens to be the most effective choice. The interpolated embeddings are thus only processed by a single layer. Apart from increasing the number of examples seen by the model, another idea is to increase the number of loss terms per example. In many modalities of interest, the input is a sequence in one or more dimensions: pixels or patches in images, voxels in video, points or triangles in highdimensional surfaces, to name a few. The structure of input data is expressed in matrices or tensors, which often preserve a certain spatial resolution until the deepest network layer before they collapse e.g. by global average pooling (Szegedy et al., 2015; He et al., 2016) or by taking the output of a classification token (Vaswani et al., 2017; Dosovitskiy et al., 2020) . In this sense, we choose to operate at the level of sequence elements rather than representing examples by a single vector. We introduce dense MultiMix, which is the first approach of this kind in mixup-based data augmentation. In particular, we interpolate densely the embeddings and targets of sequence elements and we also apply the loss densely, as illustrated in Figure 2 . This is an extreme form of augmentation where the number of interpolated tuples and loss terms increases further by one or two orders of magnitude, but at little cost. Finally, linear interpolation of targets, which is the norm in most mixup variants, has a limitation: Given two examples with different class labels, the interpolated example may actually lie in a region associated with a third class in the feature space, which is identified as manifold intrusion (Guo et al., 2019) . In the absence of any data other than the mini-batch, a straightforward way to address this limitation is to devise targets originating in the network itself. This naturally leads to selfdistillation, whereby a moving average of the network acts as a teacher and provides synthetic soft targets (Tarvainen & Valpola, 2017) , to be interpolated exactly like the original hard targets. In summary, we make the following contributions: 1. We introduce MultiMix, which, given a mini-batch of size m, interpolates an arbitrary number n ≫ m of tuples, each of length m, with one interpolation vector λ per tuplecompared with m pairs, all with the same scalar λ for most mixup methods (subsection 3.2). 2. We extend to dense interpolation and loss computation over all spatial positions (subsection 3.4). 3. We use online self-distillation to generate and interpolate soft targets for mixup-compared with linear target interpolation for most mixup methods (subsection 3.3). 4. We improve over state-of-the-art mixup methods on image classification, robustness to adversarial attacks, object detection and out-of-distribution detection (section 4).

2. RELATED WORK

Mixup: interpolation methods In general, mixup interpolates between pairs of input examples (Zhang et al., 2018) or embeddings (Verma et al., 2019) and their corresponding target labels. Several follow-up methods mix input images according to spatial position, either at random rectangles (Yun et al., 2019) or based on attention (Uddin et al., 2021; Kim et al., 2020; 2021) , in an attempt to focus on a different object in each image. We also use attention in our dense MultiMix variant, but



Figure 1: Data augmentation from a mini-batch B consisting of m = 10 points in two dimensions. (a) mixup: sampling of m points on linear segments between m pairs of points in B, using the same interpolation factor λ. (b) MultiMix: sampling of n = 300 points in the convex hull of B.

