DICE: DIVERSITY IN DEEP ENSEMBLES VIA CONDI-TIONAL REDUNDANCY ADVERSARIAL ESTIMATION

Abstract

Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members' performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel training criterion called DICE: it increases diversity by reducing spurious correlations among features. The main idea is that features extracted from pairs of members should only share information useful for target class prediction without being conditionally redundant. Therefore, besides the classification loss with information bottleneck, we adversarially prevent features from being conditionally predictable from each other. We manage to reduce simultaneous errors while protecting class information. We obtain state-of-the-art accuracy results on CIFAR-10/100: for example, an ensemble of 5 networks trained with DICE matches an ensemble of 7 networks trained independently. We further analyze the consequences on calibration, uncertainty estimation, out-of-distribution detection and online co-distillation.

1. INTRODUCTION

Averaging the predictions of several models can significantly improve the generalization ability of a predictive system. Due to its effectiveness, ensembling has been a popular research topic (Nilsson, 1965; Hansen & Salamon, 1990; Wolpert, 1992; Krogh & Vedelsby, 1995; Breiman, 1996; Dietterich, 2000; Zhou et al., 2002; Rokach, 2010; Ovadia et al., 2019) as a simple alternative to fully Bayesian methods (Blundell et al., 2015; Gal & Ghahramani, 2016) . It is currently the de facto solution for many machine learning applications and Kaggle competitions (Hin, 2020) . Ensembling reduces the variance of estimators (see Appendix E.1) thanks to the diversity in predictions. This reduction is most effective when errors are uncorrelated and members are diverse, i.e., when they do not simultaneously fail on the same examples. Conversely, an ensemble of M identical networks is no better than a single one. In deep ensembles (Lakshminarayanan et al., 2017) , the weights are traditionally trained independently: diversity among members only relies on the randomness of the initialization and of the learning procedure. Figure 1 shows that the performance of this procedure quickly plateaus with additional members. To obtain more diverse ensembles, we could adapt the training samples through bagging (Breiman, 1996) and bootstrapping (Efron & Tibshirani, 1994) , but a reduction of training samples has a negative impact on members with multiple local minima (Lee et al., 2015) . Sequential boosting does not scale well for time-consuming deep learners that overfit their training dataset. Liu & Yao (1999a; b); Brown et al. (2005b) explicitly quantified the diversity and regularized members into having negatively correlated errors. However, these ideas have not significantly improved accuracy when applied to deep learning (Shui et al., 2018; Pang et al., 2019) : while members should predict the same target, they force disagreements among strong learners and therefore increase their bias. It highlights the main objective and challenge of our paper: finding a training strategy to reach an improved trade-off between ensemble diversity and individual accuracies (Masegosa, 2020).

