DICE: DIVERSITY IN DEEP ENSEMBLES VIA CONDI-TIONAL REDUNDANCY ADVERSARIAL ESTIMATION

Abstract

Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members' performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel training criterion called DICE: it increases diversity by reducing spurious correlations among features. The main idea is that features extracted from pairs of members should only share information useful for target class prediction without being conditionally redundant. Therefore, besides the classification loss with information bottleneck, we adversarially prevent features from being conditionally predictable from each other. We manage to reduce simultaneous errors while protecting class information. We obtain state-of-the-art accuracy results on CIFAR-10/100: for example, an ensemble of 5 networks trained with DICE matches an ensemble of 7 networks trained independently. We further analyze the consequences on calibration, uncertainty estimation, out-of-distribution detection and online co-distillation.

1. INTRODUCTION

Averaging the predictions of several models can significantly improve the generalization ability of a predictive system. Due to its effectiveness, ensembling has been a popular research topic (Nilsson, 1965; Hansen & Salamon, 1990; Wolpert, 1992; Krogh & Vedelsby, 1995; Breiman, 1996; Dietterich, 2000; Zhou et al., 2002; Rokach, 2010; Ovadia et al., 2019) as a simple alternative to fully Bayesian methods (Blundell et al., 2015; Gal & Ghahramani, 2016) . It is currently the de facto solution for many machine learning applications and Kaggle competitions (Hin, 2020) . Ensembling reduces the variance of estimators (see Appendix E.1) thanks to the diversity in predictions. This reduction is most effective when errors are uncorrelated and members are diverse, i.e., when they do not simultaneously fail on the same examples. Conversely, an ensemble of M identical networks is no better than a single one. In deep ensembles (Lakshminarayanan et al., 2017) , the weights are traditionally trained independently: diversity among members only relies on the randomness of the initialization and of the learning procedure. Figure 1 shows that the performance of this procedure quickly plateaus with additional members. To obtain more diverse ensembles, we could adapt the training samples through bagging (Breiman, 1996) and bootstrapping (Efron & Tibshirani, 1994) , but a reduction of training samples has a negative impact on members with multiple local minima (Lee et al., 2015) . Sequential boosting does not scale well for time-consuming deep learners that overfit their training dataset. Liu & Yao (1999a; b); Brown et al. (2005b) explicitly quantified the diversity and regularized members into having negatively correlated errors. However, these ideas have not significantly improved accuracy when applied to deep learning (Shui et al., 2018; Pang et al., 2019) : while members should predict the same target, they force disagreements among strong learners and therefore increase their bias. It highlights the main objective and challenge of our paper: finding a training strategy to reach an improved trade-off between ensemble diversity and individual accuracies (Masegosa, 2020). ( ,-) should not be able to differentiate (-, ) and (-, ). Our core approach is to encourage all members to predict the same thing, but for different reasons. Therefore the diversity is enforced in the features space and not on predictions. Intuitively, to maximize the impact of a new member, extracted features should bring information about the target that is absent at this time so unpredictable from other members' features. It would remove spurious correlations, e.g. information redundantly shared among features extracted by different members but useless for class prediction. This redundancy may be caused by a detail in the image background and therefore will not be found in features extracted from other images belonging to the same class. This could make members predict badly simultaneously, as shown in Figure 2 . Our new learning framework, called DICE, is driven by Information Bottleneck (IB) (Tishby, 1999; Alemi et al., 2017) principles, that force features to be concise by forgetting the task-irrelevant factors. Specifically, DICE leverages the Minimum Necessary Information criterion (Fischer, 2020) for deep ensembles, and aims at reducing the mutual information (MI) between features and inputs, but also information shared between features. We prevent extracted features from being redundant. As mutual information can detect arbitrary dependencies between random variables (such as symmetry, see Figure 2 ), we increase the distance between pairs of members: it promotes diversity by reducing predictions' covariance. Most importantly, DICE protects features' informativeness by conditioning mutual information upon the target. We build upon recent neural approaches (Belghazi et al., 2018) based on the Donsker-Varadhan representation of the KL formulation of MI. We summarize our contributions as follows: • We introduce DICE, a new adversarial learning framework to explicitly increase diversity in ensemble by minimizing the conditional redundancy between features. • We rationalize our training objective by arguments from information theory. • We propose an implementation through neural estimation of conditional redundancy. We consistently improve accuracy on CIFAR-10/100 as summarized in Figure 1 , with better uncertainty estimation and calibration. We analyze how the two components of our loss modify the accuracy-diversity trade-off. We improve out-of-distribution detection and online co-distillation.

2. DICE MODEL

Notations Given an input distribution X, a network θ is trained to extract the best possible dense features Z to model the distribution p θ (Y |X) over the targets, which should be close to the Dirac on the true label. Our approach is designed for ensembles with M members θ i , i ∈ {1, . . . , M } extracting Z i . In branch-based setup, members share low-level weights to reduce computation cost. We average the M predictions in inference. We initially consider an ensemble of M = 2 members.



Figure 1: DICE better leverages ensemble size. Without weights sharing, 5 networks trained with DICE match 7 networks trained independently. With low-level weights sharing, 4 branches trained with DICE match 7 traditional branches. Dataset: CIFAR-100. Backbone: ResNet-32. Details in Table8.

Figure 2: Outline. DICE prevents features from being predictable from each other conditionally upon the target class. Features extracted by members (1, 2) from one input ( , ) should not share more information than features from two inputs in the same class ( , ): i.e., ( ,-) should not be able to differentiate (-, ) and (-, ).

