COMBINING ENSEMBLES AND DATA AUGMENTATION CAN HARM YOUR CALIBRATION

Abstract

Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model's calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration across CIFAR-10, CIFAR-100, and ImageNet. 1

1. INTRODUCTION

Many success stories in deep learning (Krizhevsky et al., 2012; Sutskever et al., 2014) are in restricted settings where predictions are only made for inputs similar to the training distribution. In real-world scenarios, neural networks can face truly novel data points during inference, and in these settings it can be valuable to have good estimates of the model's uncertainty. For example, in healthcare, reliable uncertainty estimates can prevent over-confident decisions for rare or novel patient conditions (Dusenberry et al., 2019) . We highlight two recent trends obtaining state-of-the-art in uncertainty and robustness benchmarks. Ensemble methods are a simple approach to improve a model's calibration and robustness (Lakshminarayanan et al., 2017) . The same network architecture but optimized with different initializations can converge to different functional solutions, leading to decorrelated prediction errors. By averaging predictions, ensembles can rule out individual mistakes (Lakshminarayanan et al., 2017; Ovadia et al., 2019) . Additional work has gone into efficient ensembles such as MC-dropout (Gal and Ghahramani, 2016), BatchEnsemble, and its variants (Wen et al., 2020; Dusenberry et al., 2020; Wenzel et al., 2020) . These methods significantly improve calibration and robustness while adding few parameters to the original model. Data augmentation is an approach which is orthogonal to ensembles in principle, encoding additional priors in the form of invariant feature transformations. Intuitively, data augmentation enables the model to train on more data, encouraging the model to capture certain invariances with respect to its inputs and outputs; data augmentation may also produce data that may be closer to an out-ofdistribution target task. It has been a key factor driving state-of-the-art: for example, Mixup (Zhang et al., 2018; Thulasidasan et al., 2019a ), AugMix (Hendrycks et al., 2020) , and test-time data augmentation (Ashukha et al., 2020) . A common wisdom in the community suggests that ensembles and data augmentation should naturally combine. For example, the majority of uncertainty models in vision with strong performance are



Contact: ywen@utexas.edu. Code: https://github.com/google/edward2/tree/master/ experimental/marginalization_mixup.

