COMBINING ENSEMBLES AND DATA AUGMENTATION CAN HARM YOUR CALIBRATION

Abstract

Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model's calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration across CIFAR-10, CIFAR-100, and ImageNet. 1

1. INTRODUCTION

Many success stories in deep learning (Krizhevsky et al., 2012; Sutskever et al., 2014) are in restricted settings where predictions are only made for inputs similar to the training distribution. In real-world scenarios, neural networks can face truly novel data points during inference, and in these settings it can be valuable to have good estimates of the model's uncertainty. For example, in healthcare, reliable uncertainty estimates can prevent over-confident decisions for rare or novel patient conditions (Dusenberry et al., 2019) . We highlight two recent trends obtaining state-of-the-art in uncertainty and robustness benchmarks. Ensemble methods are a simple approach to improve a model's calibration and robustness (Lakshminarayanan et al., 2017) . The same network architecture but optimized with different initializations can converge to different functional solutions, leading to decorrelated prediction errors. By averaging predictions, ensembles can rule out individual mistakes (Lakshminarayanan et al., 2017; Ovadia et al., 2019) . Additional work has gone into efficient ensembles such as MC-dropout (Gal and Ghahramani, 2016), BatchEnsemble, and its variants (Wen et al., 2020; Dusenberry et al., 2020; Wenzel et al., 2020) . These methods significantly improve calibration and robustness while adding few parameters to the original model. Data augmentation is an approach which is orthogonal to ensembles in principle, encoding additional priors in the form of invariant feature transformations. Intuitively, data augmentation enables the model to train on more data, encouraging the model to capture certain invariances with respect to its inputs and outputs; data augmentation may also produce data that may be closer to an out-ofdistribution target task. It has been a key factor driving state-of-the-art: for example, Mixup (Zhang et al., 2018; Thulasidasan et al., 2019a ), AugMix (Hendrycks et al., 2020) , and test-time data augmentation (Ashukha et al., 2020) . A common wisdom in the community suggests that ensembles and data augmentation should naturally combine. For example, the majority of uncertainty models in vision with strong performance are built upon baselines leveraging standard data augmentation (He et al., 2016; Hendrycks et al., 2020 ) (e.g., random flips, cropping); Hafner et al. (2018) cast data augmentation as an explicit prior for Bayesian neural networks, treating it as beneficial when ensembling; and Hendrycks et al. ( 2020) highlights further improved results in AugMix when combined with Deep Ensembles (Hansen and Salamon, 1990; Krogh and Vedelsby, 1995) . However, we find the complementary benefits between data augmentations and ensembels are not universally true. Section 3.1 illustrates the poor calibration of combining ensembles (MC-dropout, BatchEnsemble and Deep Ensembles) and Mixup on CIFAR: the model outputs excessive low confidence. Motivated by this pathology, in this paper, we investigate in more detail why this happens and propose a method to resolve it. Contributions. In contrast to prior work, which finds individually that ensembles and Mixup improve calibration, we find that combining ensembles and Mixup consistently degrades calibration performance across three ensembling techniques. From a detailed analysis, we identify a compounding under-confidence, where the soft labels in Mixup introduce a negative confidence bias that hinders its combination with ensembles. We further find this to be true for other label-based strategies such as label smoothing. Finally, we propose CAMixup to correct this bias, pairing well with ensembles. CAMixup produces new state-of-the-art calibration on both CIFAR-10/100 (e.g., 0.4% and 2.3% on CIFAR-10 and CIFAR-10C), building on Wide ResNet 28-10 for competitive accuracy (e.g., 97.5% and 89.8%) and on ImageNet (1.5%), building on ResNet-50 for competitive accuracy (77.4%).

2. BACKGROUND ON CALIBRATION, ENSEMBLES AND DATA AUGMENTATION

2.1 CALIBRATION Uncertainty estimation is critical but ground truth is difficult to obtain for measuring performance. Fortunately, calibration error, which assesses how well a model reliably forecasts its predictions over a population, helps address this. Let ( Ŷ , P ) denote the class prediction and associated confidence (predicted probability) of a classifier. 

Expected Calibration Error(ECE):

Acc(B m ) = 1 |B m | xi∈Bm 1( ŷi = y i ), Conf(B m ) = 1 |B m | xi∈Bm pi , where ŷi and y i are the predicted and true labels and pi is the confidence for example x i . Given n examples, ECE is M m=1 |Bm| n Acc(B m ) -Conf(B m ) .

2.2. ENSEMBLES

Aggregating the predictions of multiple models into an ensemble is a well-established strategy to improve generalization (Hansen and Salamon, 1990; Perrone and Cooper, 1992; Dietterich, 2000) . BatchEnsemble: BatchEnsemble takes a network architecture and shares its parameters across ensemble members, adding only a rank-1 perturbation for each layer in order to decorrelate member predictions (Wen et al., 2020) . For a given layer, define the shared weight matrix among K ensemble members as W ∈ R m×d . A tuple of trainable vectors r k ∈ R m and s k ∈ R n are associated with each ensemble member k. The new weight matrix for each ensemble member in BatchEnsemble is W k = W • F k , where F k = r k s k ∈ R m×d , where • denotes the element-wise product. Applying rank-1 perturbations via r and s adds few additional parameters to the overall model. We use an ensemble size of 4 in all experiments. MC-Dropout: Gal and Ghahramani (2016) interpret Dropout (Srivastava et al., 2014) as an ensemble model, leading to its application for uncertainty estimates by sampling multiple dropout masks at test time in order to ensemble its predictions. We use an ensemble size of 20 in all experiments. Deep Ensembles: Composing an ensemble of models, each trained with a different random initialization, provides diverse predictions (Fort et al., 2019) which have been shown to outperform strong



Contact: ywen@utexas.edu. Code: https://github.com/google/edward2/tree/master/ experimental/marginalization_mixup.



One notion of miscalibration is the expected difference between confidence and accuracy (Naeini et al., 2015): E P [|P( Ŷ = Y | P = p) -p|]. ECE approximates this by binning the predictions in [0, 1] under M equally-spaced intervals, and then taking a weighted average of each bins' accuracy/confidence difference. Let B m be the set of examples in the m th bin whose predicted confidence falls into interval ( m-1 M , m M ]. The bin B m 's accuracy and confidence are:

