AN ENSEMBLE VIEW ON MIXUP

Abstract

Deep ensembles are widely used to improve the generalization, calibration, uncertainty estimates and adversarial robustness of neural networks. In parallel, the data augmentation technique of mixup has grown popular for the very same reasons. Could these two techniques be related? This work suggests that both implement a similar inductive bias to "linearize" decision boundaries. We show how to obtain diverse predictions from a single mixup machine by interpolating a test instance with multiple reference points. These "mixup ensembles" are cheap: one needs to train and store one single model, as opposed to the K independent members forming a deep ensemble. Motivated by the limitations of ensembles to model uncertainty far away from the training data, we propose a variant of mixup that builds augmented examples using both random interpolations and extrapolations of examples. We evaluate the efficacy of our proposed methods across a variety of in-domain and out-domain metrics on the CIFAR-10 and CIFAR-10-NEG datasets.

1. INTRODUCTION

In an unprecedented tour de force, deep learning has surpassed human performance on a variety of applications such as image classification, speech recognition, natural language processing, and game playing (LeCun et al., 2015; Silver et al., 2016; Goodfellow et al., 2016) . While this is nothing short of an incredible achievement, there is a growing interest in the research community to evaluate machine learning models beyond average accuracy (Stock and Cisse, 2018) . This is because, regardless of their impressive accuracy in-domain, deep learning models crumble under distribution shifts (Arjovsky et al., 2020; Gulrajani and Lopez-Paz, 2020) , have limited ability to say "I don't know" (Tagasovska and Lopez-Paz, 2019; Belghazi and Lopez-Paz, 2021) , and are fooled by "adversarial" perturbations imperceptible to humans (Goodfellow et al., 2015) . In sum, we have built machines extremely predictive under the conditions they are trained on, but unreliable under changing circumstances. On a separate thread of research, the regularization technique mixup (Zhang et al., 2018) has raised in popularity for exhibiting the same benefits as deep ensembles. In a nutshell, mixup trains neural network classifiers on convex combinations of random pairs of examples, and the corresponding label mixtures. More specifically, deep neural networks trained with mixup achieve better generalization (Chun et al., 2020; Zhang et al., 2021a ), calibration (Thulasidasan et al., 2020; Zhang et al., 2021b ), uncertainty estimates (Lee et al., 2021) , and adversarial robustness (Pang et al., 2020; Archambault et al., 2019; Zhang et al., 2021a; Lamb et al., 2021) . In this work, we take the parallels above as hints to consider mixup as an implicit ensemble method. Our main intuition is summarized in Figure 1 . In the zero-training-error case, the members of a deep 1



Deep ensembles(Lakshminarayanan et al., 2016)  are a popular technique to improve the generalization(Wilson and Izmailov, 2022), calibration (Guo et al., 2017), uncertainty estimates(Belghazi and  Lopez-Paz, 2021; Ovadia et al., 2019), and adversarial robustness(Pang et al., 2019; Abbasi et al.,  2020; Adam and Speciel, 2020)  of deep neural network models. To construct a deep ensemble, practitioners (i) initialize a collection of K neural networks at random, called ensemble members; (ii) train them all to convergence on the same data; (iii) use as prediction rule the average of the outputs of the K trained neural networks. As such, deep ensembles are K times more expensive than single neural network models, both at training and evaluation time. Moreover, while the uncertainty estimates provided by deep ensembles result in better-calibrated decision boundaries in-between classes, how to signal low confidence far away from training data remains an open question(Amersfoort et al.,  2020).

