AN ENSEMBLE VIEW ON MIXUP

Abstract

Deep ensembles are widely used to improve the generalization, calibration, uncertainty estimates and adversarial robustness of neural networks. In parallel, the data augmentation technique of mixup has grown popular for the very same reasons. Could these two techniques be related? This work suggests that both implement a similar inductive bias to "linearize" decision boundaries. We show how to obtain diverse predictions from a single mixup machine by interpolating a test instance with multiple reference points. These "mixup ensembles" are cheap: one needs to train and store one single model, as opposed to the K independent members forming a deep ensemble. Motivated by the limitations of ensembles to model uncertainty far away from the training data, we propose a variant of mixup that builds augmented examples using both random interpolations and extrapolations of examples. We evaluate the efficacy of our proposed methods across a variety of in-domain and out-domain metrics on the CIFAR-10 and CIFAR-10-NEG datasets.

1. INTRODUCTION

In an unprecedented tour de force, deep learning has surpassed human performance on a variety of applications such as image classification, speech recognition, natural language processing, and game playing (LeCun et al., 2015; Silver et al., 2016; Goodfellow et al., 2016) . While this is nothing short of an incredible achievement, there is a growing interest in the research community to evaluate machine learning models beyond average accuracy (Stock and Cisse, 2018) . This is because, regardless of their impressive accuracy in-domain, deep learning models crumble under distribution shifts (Arjovsky et al., 2020; Gulrajani and Lopez-Paz, 2020) , have limited ability to say "I don't know" (Tagasovska and Lopez-Paz, 2019; Belghazi and Lopez-Paz, 2021) , and are fooled by "adversarial" perturbations imperceptible to humans (Goodfellow et al., 2015) . In sum, we have built machines extremely predictive under the conditions they are trained on, but unreliable under changing circumstances. Deep ensembles (Lakshminarayanan et al., 2016) are a popular technique to improve the generalization (Wilson and Izmailov, 2022), calibration (Guo et al., 2017) , uncertainty estimates (Belghazi and Lopez-Paz, 2021; Ovadia et al., 2019) , and adversarial robustness (Pang et al., 2019; Abbasi et al., 2020; Adam and Speciel, 2020) of deep neural network models. To construct a deep ensemble, practitioners (i) initialize a collection of K neural networks at random, called ensemble members; (ii) train them all to convergence on the same data; (iii) use as prediction rule the average of the outputs of the K trained neural networks. As such, deep ensembles are K times more expensive than single neural network models, both at training and evaluation time. Moreover, while the uncertainty estimates provided by deep ensembles result in better-calibrated decision boundaries in-between classes, how to signal low confidence far away from training data remains an open question (Amersfoort et al., 2020) . On a separate thread of research, the regularization technique mixup (Zhang et al., 2018) has raised in popularity for exhibiting the same benefits as deep ensembles. In a nutshell, mixup trains neural network classifiers on convex combinations of random pairs of examples, and the corresponding label mixtures. More specifically, deep neural networks trained with mixup achieve better generalization (Chun et al., 2020; Zhang et al., 2021a ), calibration (Thulasidasan et al., 2020; Zhang et al., 2021b) , uncertainty estimates (Lee et al., 2021) , and adversarial robustness (Pang et al., 2020; Archambault et al., 2019; Zhang et al., 2021a; Lamb et al., 2021) . In this work, we take the parallels above as hints to consider mixup as an implicit ensemble method. Our main intuition is summarized in Figure 1 . In the zero-training-error case, the members of a deep ensemble fluctuate randomly in-between training examples, because each member is initialized at random. After ensembling, these random fluctuations cancel each other, smoothening (linearizing) decision boundaries, often improving generalization. This is the very inductive bias directly optimized by mixup! In fact, as we increase the number of members in a deep ensemble, the resulting predictor minimizes the mixup loss implicitly, as each of the members is an ERM model. Contributions If our intuitions are correct, how could we leverage a single mixup machine like an ensemble? We propose a recipe that interpolates single test instances with multiple "reference" examples, providing a multiplicity of predictions (Section 3). mixup ensembles are K times cheaper than deep ensembles, as only one machine is trained and deployed. We also discuss the limitations of deep ensembles when it comes to detecting out-of-distribution examples that lay far away from the training data. Motivated by this shortcoming, we propose Ex-mixup, a variant of mixup that constructs augmented points allowing both the interpolation and extrapolation of pairs of training examples (Section 4). We conclude with a series of numerical experiments to verify the efficacy of the proposed methods (Section 5) and some concluding thoughts (Section 6).

2. BASICS ON DEEP ENSEMBLES

Let us start our exposition by providing the necessary background about deep ensembles (Lakshminarayanan et al., 2016) . Constructing a deep ensemble of neural networks (Lakshminarayanan et al., 2016) is as follows: 1. Instantiate K neural networks of equal architecture at different random initializations {θ 1 , . . . , θ K }. These neural networks, denoted by {f θ1 , . . . , f θ K }, are referred to as the members of the ensemble. 2. Train the K neural networks via Empirical Risk Minimization (ERM) on the training data. 3. Deploy as the final prediction rule the average of the K neural networks f (x) = 1 K K k=1 f θ k (x). Use "one minus the largest softmax score" (1max c f (x) c ) as a measure of predictive uncertainty (Hendrycks and Gimpel, 2016; Lakshminarayanan et al., 2016) . Without loss of generality the sequel considers classification problems with C classes. Thus, ensemble members-as well as the ensemble itself-output C-dimensional softmax vectors, belonging to the C-dimensional simplex ∆ C . When using modern neural network architectures, one can assume that each of the K members in the ensemble achieves zero training error (Zhang et al., 2017) . However, since the members are initialized at random, they may disagree in those regions of the input space where the labeling is noisy (high aleatoric uncertainty) or there was lack of training data (high epistemic uncertainty) (Kendall and Gal, 2017; Hüllermeier and Waegeman, 2020; Tagasovska and Lopez-Paz, 2019) . Therefore,



Figure1: Deep ensembles and mixup exhibit smoother decision boundaries than ERM (first three panels). On the one hand, ensembles "linearize" their decision boundaries by averaging zero-trainingerror predictors that fluctuate randomly in-between training examples. On the other hand, mixup optimizes for this linear behavior explicitly during training. As we add more members to a deep ensemble, the resulting predictor exhibits a lower mixup loss, while it has never been trained to minimize such statistic (fourth panel). Do ensembles and mixup improve generalization because of the same inductive bias?

