DICE: DIVERSITY IN DEEP ENSEMBLES VIA CONDI-TIONAL REDUNDANCY ADVERSARIAL ESTIMATION

Abstract

Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members' performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel training criterion called DICE: it increases diversity by reducing spurious correlations among features. The main idea is that features extracted from pairs of members should only share information useful for target class prediction without being conditionally redundant. Therefore, besides the classification loss with information bottleneck, we adversarially prevent features from being conditionally predictable from each other. We manage to reduce simultaneous errors while protecting class information. We obtain state-of-the-art accuracy results on CIFAR-10/100: for example, an ensemble of 5 networks trained with DICE matches an ensemble of 7 networks trained independently. We further analyze the consequences on calibration, uncertainty estimation, out-of-distribution detection and online co-distillation.

1. INTRODUCTION

Averaging the predictions of several models can significantly improve the generalization ability of a predictive system. Due to its effectiveness, ensembling has been a popular research topic (Nilsson, 1965; Hansen & Salamon, 1990; Wolpert, 1992; Krogh & Vedelsby, 1995; Breiman, 1996; Dietterich, 2000; Zhou et al., 2002; Rokach, 2010; Ovadia et al., 2019) as a simple alternative to fully Bayesian methods (Blundell et al., 2015; Gal & Ghahramani, 2016) . It is currently the de facto solution for many machine learning applications and Kaggle competitions (Hin, 2020) . Ensembling reduces the variance of estimators (see Appendix E.1) thanks to the diversity in predictions. This reduction is most effective when errors are uncorrelated and members are diverse, i.e., when they do not simultaneously fail on the same examples. Conversely, an ensemble of M identical networks is no better than a single one. In deep ensembles (Lakshminarayanan et al., 2017) , the weights are traditionally trained independently: diversity among members only relies on the randomness of the initialization and of the learning procedure. Figure 1 shows that the performance of this procedure quickly plateaus with additional members. To obtain more diverse ensembles, we could adapt the training samples through bagging (Breiman, 1996) and bootstrapping (Efron & Tibshirani, 1994) , but a reduction of training samples has a negative impact on members with multiple local minima (Lee et al., 2015) . Sequential boosting does not scale well for time-consuming deep learners that overfit their training dataset. Liu & Yao (1999a; b) ; Brown et al. (2005b) explicitly quantified the diversity and regularized members into having negatively correlated errors. However, these ideas have not significantly improved accuracy when applied to deep learning (Shui et al., 2018; Pang et al., 2019) : while members should predict the same target, they force disagreements among strong learners and therefore increase their bias. It highlights the main objective and challenge of our paper: finding a training strategy to reach an improved trade-off between ensemble diversity and individual accuracies (Masegosa, 2020) . 8 . Figure 2 : Outline. DICE prevents features from being predictable from each other conditionally upon the target class. Features extracted by members (1, 2) from one input ( , ) should not share more information than features from two inputs in the same class ( , ): i.e., ( ,-) should not be able to differentiate (-, ) and (-, ). Our core approach is to encourage all members to predict the same thing, but for different reasons. Therefore the diversity is enforced in the features space and not on predictions. Intuitively, to maximize the impact of a new member, extracted features should bring information about the target that is absent at this time so unpredictable from other members' features. It would remove spurious correlations, e.g. information redundantly shared among features extracted by different members but useless for class prediction. This redundancy may be caused by a detail in the image background and therefore will not be found in features extracted from other images belonging to the same class. This could make members predict badly simultaneously, as shown in Figure 2 . Our new learning framework, called DICE, is driven by Information Bottleneck (IB) (Tishby, 1999; Alemi et al., 2017) principles, that force features to be concise by forgetting the task-irrelevant factors. Specifically, DICE leverages the Minimum Necessary Information criterion (Fischer, 2020) for deep ensembles, and aims at reducing the mutual information (MI) between features and inputs, but also information shared between features. We prevent extracted features from being redundant. As mutual information can detect arbitrary dependencies between random variables (such as symmetry, see Figure 2 ), we increase the distance between pairs of members: it promotes diversity by reducing predictions' covariance. Most importantly, DICE protects features' informativeness by conditioning mutual information upon the target. We build upon recent neural approaches (Belghazi et al., 2018) based on the Donsker-Varadhan representation of the KL formulation of MI. We summarize our contributions as follows: • We introduce DICE, a new adversarial learning framework to explicitly increase diversity in ensemble by minimizing the conditional redundancy between features. • We rationalize our training objective by arguments from information theory. • We propose an implementation through neural estimation of conditional redundancy. We consistently improve accuracy on CIFAR-10/100 as summarized in Figure 1 , with better uncertainty estimation and calibration. We analyze how the two components of our loss modify the accuracy-diversity trade-off. We improve out-of-distribution detection and online co-distillation.

2. DICE MODEL

Notations Given an input distribution X, a network θ is trained to extract the best possible dense features Z to model the distribution p θ (Y |X) over the targets, which should be close to the Dirac on the true label. Our approach is designed for ensembles with M members θ i , i ∈ {1, . . . , M } extracting Z i . In branch-based setup, members share low-level weights to reduce computation cost. We average the M predictions in inference. We initially consider an ensemble of M = 2 members. Quick overview First, we train each member separately for classification with information bottleneck. Second, we train members together to remove spurious redundant correlations while training adversarially a discriminator. In conclusion, members learn to classify with conditionally uncorrelated features for increased diversity. Our procedure is driven by the following theoretical findings.

2.A.1 BASELINE: NON-CONDITIONAL OBJECTIVE

The Minimum Necessary Information (MNI) criterion from (Fischer, 2020) aims at finding minimal statistics. In deep ensembles, Z 1 and Z 2 should capture only minimal information from X, while preserving the necessary information about the task Y . First, we consider separately the two Markov chains Z 1 ← X ↔ Y and Z 2 ← X ↔ Y . As entropy measures information, entropy of Z 1 and Z 2 not related to Y should be minimized. We recover IB (Alemi et al., 2017) in deep ensembles: IB β ib (Z 1 , Z 2 ) = 1 β ib [I(X; Z 1 ) + I(X; Z 2 )] -[I(Y ; Z 1 ) + I(Y ; Z 2 )] = IB β ib (Z 1 ) + IB β ib (Z 2 ). Second, let's consider I(Z 1 ; Z 2 ): we minimize it following the minimality constraint of the MNI. (Yeung, 1991) . IBR β ib ,δr (Z 1 , Z 2 ) = 1 β ib Compression [I(X; Z 1 ) + I(X; Z 2 )] - Relevancy [I(Y ; Z 1 ) + I(Y ; Z 2 )] +δ r Redundancy I(Z 1 ; Z 2 ) = IB β ib (Z 1 ) + IB β ib (Z 2 ) + δ r I(Z 1 ; Z 2 ).

DICE minimizes conditional redundancy (green vertical stripes

) with no overlap with relevancy (red stripes). Analysis In this baseline criterion, relevancy encourages Z 1 and Z 2 to capture information about Y . Compression & redundancy (R) split the information from X into two compressed & independent views. The relevancy-compressionredundancy trade-off depends on the values of β ib & δ r .

2.A.2 DICE: CONDITIONAL OBJECTIVE

The problem is that the compression and redundancy terms in IBR also reduce necessary information related to Y : it is detrimental to have Z 1 and Z 2 fully disentangled while training them to predict the same Y . As shown on Figure 3 , redundancy regions (blue horizontal stripes ) overlap with relevancy regions (red stripes). Indeed, the true constraints that the MNI criterion really entails are the following conditional equalities given Y : I(X; Z 1 |Y ) = I(X; Z 2 |Y ) = I(Z 1 ; Z 2 |Y ) = 0. Mutual information being non-negative, we transform them into our main DICE objective: DICE β ceb ,δcr (Z 1 , Z 2 ) = 1 β ceb [I(X; Z 1 |Y ) + I(X; Z 2 |Y )] Conditional Compression -[I(Y ; Z 1 ) + I(Y ; Z 2 )] Relevancy +δ cr I(Z 1 ; Z 2 |Y ) Conditional Redundancy = CEB β ceb (Z 1 ) + CEB β ceb (Z 2 ) + δ cr I(Z 1 ; Z 2 |Y ), where we recover two conditional entropy bottleneck (CEB) (Fischer, 2020)  components, CEB β ceb (Z i ) = 1 β ceb I(X; Z i |Y ) -I(Y ; Z i ), with β ceb > 0 and δ cr > 0. Analysis The relevancy terms force features to be informative about the task Y . But contrary to IBR, DICE bottleneck constraints only minimize irrelevant information to Y . First, the conditional compression removes in Z 1 (or Z 2 ) information from X not relevant to Y . Second, the conditional redundancy (CR) reduces spurious correlations between members and only forces them to have independent bias, but definitely not independent features.  (Z i ) = 1 β ceb I(X; Z i |Y ) -I(Y ; Z i ) is variationally upper bounded by: VCEB β ceb ({e i , b i , c i }) = 1 N N n=1 1 β ceb D KL (e i (z|x n ) b i (z|y n )) -E [log c i (y n |e i (x n , ))] . (2) See explanation in Appendix E.4. e i (z|x) is the true features distribution generated by the encoder, c i (y|z) is a variational approximation of true distribution p(y|z) by the classifier, and b i (z|y) is a variational approximation of true distribution p(z|y) by the backward encoder. This loss is applied separately on each member θ i = {e i , c i , b i }, i ∈ {1, 2}. Practically, we parameterize all distributions with Gaussians. The encoder e i is a traditional neural network features extractor (e.g. ResNet-32) that learns distributions (means and covariances) rather than deterministic points in the features space. That's why e i transforms an image into 2 tensors; a features-mean e µ i (x) and a diagonal features-covariance e σ i (x) each of size d (e.g. 64). The classifier c i is a dense layer that transforms a features-sample z into logits to be aligned with the target y through conditional cross entropy. z is obtained via reparameterization trick: z = e i (x, ) = e µ i (x)+ e σ i (x) with ∼ N (0, 1). Finally, the backward encoder b i is implemented as an embedding layer of size (K, d) that maps the K classes to class-features-means b µ i (z|y) of size d, as we set the class-features-covariance to 1. The Gaussian parametrization also enables the exact computation of the D KL (see Appendix E.3), that forces (1) features-mean e µ i (x) to converge to the class-featuresmean b µ i (z|y) and ( 2) the predicted features-covariance e σ i (x) to be close to 1. The advantage of VCEB versus VIB (Alemi et al., 2017) is the class conditional b µ i (z|y) versus non-conditional b µ i (z) which protects class information.

2.B.2 ADVERSARIAL ESTIMATION OF CONDITIONAL REDUNDANCY

Theoretical Problem We now focus on estimating I(Z 1 ; Z 2 |Y ), with no such Markov properties. Despite being a pivotal measure, mutual information estimation historically relied on nearest neighbors (Singh et al., 2003; Kraskov et al., 2004; Gao et al., 2018) or density kernels (Kandasamy et al., 2015) that do not scale well in high dimensions. We benefit from recent advances in neural estimation of mutual information (Belghazi et al., 2018) , built on optimizing Donsker & Varadhan (1975) dual representations of the KL divergence. Mukherjee et al. (2020) extended this formulation for conditional mutual information estimation. CR = I(Z 1 ; Z 2 |Y ) = D KL (P (Z 1 , Z 2 , Y ) P (Z 1 , Y )p(Z 2 |Y )) = sup f E x∼p(z1,z2,y) [f (x)] -log E x∼p(z1,y)p(z2|y) [exp(f (x))] = E x∼p(z1,z2,y) [f * (x)] -log E x∼p(z1,y)p(z2|y) [exp(f * (x))] , where f * computes the pointwise likelihood ratio, i.e., f * (z 1 , z 2 , y) = p(z1,z2,y) p(z1,y)p(z2|y) . Empirical Neural Estimation We estimate CR (1) using the empirical data distribution and (2) replacing f * = w * 1-w * by the output of a discriminator w, trained to imitate the optimal w * . Let B  be a batch sampled from the observed joint distribution p(z 1 , z 2 , y) = p(e 1 (z|x), e 2 (z|x), y); we select the features extracted by the two members from one input. Let B p be sampled from the product distribution p(z 1 , y)p(z 2 |y) = p(e 1 (z|x), y)p(z 2 |y); we select the features extracted by the two members from two different inputs that share the same class. We train a multi-layer network w on the binary task of distinguishing these two distributions with the standard cross-entropy loss: If w is calibrated (see Appendix B.3), a consistent (Mukherjee et al., 2020) estimate of CR is: L ce (w) = - 1 |B  | + |B p |   (z1,z2,y)∈B log w(z 1 , z 2 , y) + (z1,z 2 ,y)∈Bp log(1 -w(z 1 , z 2 , y))   . (3) ÎCR DV = 1 |B  | (z1,z2,y)∈B log f (z 1 , z 2 , y) Diversity -log   1 |B p | (z1,z 2 ,y)∈Bp f (z 1 , z 2 , y) Fake correlations   , with f = w 1 -w . Intuition By training our members to minimize ÎCR DV , we force triples from the joint distribution to be indistinguishable from triples from the product distribution. Let's imagine that two features are conditionally correlated, some spurious information is shared between features only when they are from the same input and not from two inputs (from the same class). This correlation can be informative about a detail in the background, an unexpected shape in the image, that is rarely found in samples from this input's class. In that case, the product and joint distributions are easily distinguishable by the discriminator. The first adversarial component will force the extracted features to reduce the correlation, and ideally one of the two features loses this information: it reduces redundancy and increases diversity. The second term would create fake correlations between features from different inputs. As we are not interested in a precise estimation of the CR, we get rid of this second term that, empirically, did not increase diversity, as detailed in Appendix G. LCR DV (e 1 , e 2 ) = 1 |B  | (z1,z2,y)∈B∼p(e1(z|x),e2(z|x),y) log f (z 1 , z 2 , y). (4) Summary First, we train each member for classification with VCEB from equation 2, as shown in Step 1 from Figure 4 . Second, as shown in Step 2 from Figure 4 , the discriminator, conditioned on the class Y , learns to distinguish features sampled from one image versus features sampled from two images belonging to Y . Simultaneously, both members adversarially (Goodfellow et al., 2014) delete spurious correlations to reduce CR estimation from equation 4 with differentiable signals: it conditionally aligns features. We provide a pseudo-code in B.4. While we derive similar losses for IBR and CEBR in Appendix E.5, the full DICE loss is finally: L DICE (θ 1 , θ 2 ) = VCEB β ceb (θ 1 ) + VCEB β ceb (θ 2 ) + δ cr LCR DV (e 1 , e 2 ). (5)

2.C FULL PROCEDURE WITH M MEMBERS

We expand our objective for an ensemble with M > 2 members. We only consider pairwise interactions for simplicity to keep quadratic rather than exponential growth in number of components and truncate higher order interactions, e.g. I(Z i ; Z j , Z k |Y ) (see Appendix F.1). Driven by previous variational and neural estimations, we train θ i = {e i , b i , c i }, i ∈ {1, . . . , M } on: L DICE (θ 1:M ) = M i=1 VCEB β ceb (θ i ) + δ cr (M -1) M i=1 M j=i+1 LCR DV (e i , e j ), while training adversarially w on L ce . Batch B  is sampled from the concatenation of joint distribution p(z i , z j , y) where i, j ∈ {1, . . . , M }, i = j, while B p is sampled from the product distribution, p(z i , y)p(z j |y). We use the same discriminator w for M 2 estimates. It improves scalability by reducing the number of parameters to be learned. Indeed, an additional member in the ensemble only adds 256 * d trainable weights in w, where d is the features dimension. See Appendix B.3 for additional information related to the discriminator w.

3. RELATED WORK

To reduce the training cost of deep ensembles (Hansen & Salamon, 1990; Lakshminarayanan et al., 2017) , Huang et al. (2017) collect snapshots on training trajectories. One stage end-to-end codistillation (Song & Chai, 2018; Lan et al., 2018; Chen et al., 2020b ) share low-level features among members in branch-based ensemble while forcing each member to mimic a dynamic weighted combination of the predictions to increase individual accuracy. However both methods correlate errors among members, homogenize predictions and fail to fit the different modes of the data which overall reduce diversity. Beyond random initializations (Kolen & Pollack, 1991) , authors implicitly introduced stochasticity into the training, by providing subsets of data to learners with bagging (Breiman, 1996) or by backpropagating subsets of gradients (Lee et al., 2016) ; however, the reduction of training samples hurts performance for sufficiently complex models that overfit their training dataset (Nakkiran et al., 2019) . Boosting with sequential training is not suitable for deep members (Lakshminarayanan et al., 2017) . Some approaches applied different data augmentations (Dvornik et al., 2019; Stickland & Murray, 2020) , used different networks or hyperparameters (Singh et al., 2016; Ruiz & Verbeek, 2020; Yang & Soatto, 2020) , but are not general-purpose and depend on specific engineering choices. Others explicitly encourage orthogonality of the gradients (Ross et al., 2020; Kariyappa & Qureshi, 2019; Dabouei et al., 2020) or of the predictions, by boosting (Freund & Schapire, 1999; Margineantu & Dietterich) or with a negative correlation regularization (Shui et al., 2018) , but they reduce members accuracy. Second-order PAC-Bayes bounds motivated the diversity loss in Masegosa (2020). As far as we know, adaptive diversity promoting (ADP) (Pang et al., 2019) is the unique approach more accurate than the independent baseline: they decorrelate the non-maximal predictions. The limited success of these logits approaches suggests that we seek diversity in features. Empirically we found that the increase of (L 1 , L 2 , -cos) distances between features (Kim et al., 2018) reduce performance: they are not invariant to variables' symmetry. Simultaneously to our findings, Sinha et al. (2020) is somehow equivalent to our IBR objective (see Appendix C.2) but without information bottleneck motivations for the diversity loss. The uniqueness of mutual information (see Appendix E.2) as a distance measure between variables has been applied in countless machine learning projects, such as reinforcement learning (Kim et al., 2019a) , metric learning (Kemertas et al., 2020) , or evolutionary algorithms (Aguirre & Coello, 2004) . Objectives are often a trade-off between (1) informativeness and (2) compression. In computer vision, unsupervised deep representation learning (Hjelm et al., 2019; van den Oord et al., 2018; Tian et al., 2020a; Bachman et al., 2019) maximizes correlation between features and inputs following Infomax (Linsker, 1988; Bell & Sejnowski, 1995) , while discarding information not shared among different views (Bhardwaj et al., 2020) , or penalizing predictability of one latent dimension given the others for disentanglement (Schmidhuber, 1992; Comon, 1994; Kingma & Welling, 2014; Kim & Mnih, 2018; Blot et al., 2018) . The ideal level of compression is task dependent (Soatto & Chiuso, 2014) . As a selection criterion, features should not be redundant (Battiti, 1994; Peng et al., 2005) but relevant and complementary given the task (Novovičová et al., 2007; Brown, 2009) . As a learning criteria, correlations between features and inputs are minimized according to Information Bottleneck (Tishby, 1999; Alemi et al., 2017; Kirsch et al., 2020; Saporta et al., 2019) , while those between features and targets are maximized (LeCun et al., 2006; Qin & Kim, 2019) . It forces the features to ignore task-irrelevant factors (Zhao et al., 2020) , to reduce overfitting (Alemi et al., 2018) while protecting needed information (Tian et al., 2020b) . Fischer & Alemi (2020) concludes in the superiority of conditional alignment to reach the MNI point.

4. EXPERIMENTS

In this section, we present our experimental results on the CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) datasets. We detail our implementation in Appendix B. We took most hyperparameter values from Chen et al. (2020b) . Hyperparameters for adversarial training and information bottleneck were fine-tuned on a validation dataset made of 5% of the training dataset, see Appendix D.1. Bold highlights best score. First, we show gain in accuracy. Then, we further analyze our strategy's impacts on calibration, uncertainty estimation, out-of-distribution detection and co-distillation. 

4.B ABLATION STUDY

Branch-based is attractive: it reduces bias by gradient diffusion among shared layers, at only a slight cost in diversity which makes our approach even more valuable. We therefore study the 4-branches ResNet-32 on CIFAR-100 in following experiments. We ablate the two components of DICE: (1) deterministic, with VIB or VCEB, and (2) no adversarial loss, or with redundancy, conditionally or not. We measure diversity by the ratio-error (Aksela, 2003) , r = Nsingle Nshared , which computes the ratio between the number of single errors N single and of shared errors N shared . A higher average over the M 2 pairs means higher diversity as members are less likely to err on the same inputs. Our analysis remains valid for non-pairwise diversity measures, analyzed in Appendix A.5. In Figure 5 , CEB has slightly higher diversity than Ind.: it benefits from compression. ADP reaches higher diversity but sacrifices individual accuracies. On the contrary, co-distillation OKDDip sacri-fices diversity for individual accuracies. DICE curve is above all others, and notably δ cr = 0.2 induces an optimal trade-off between ensemble diversity and individual accuracies on validation. CEBR reaches same diversity with lower individual accuracies: information about Y is removed. Figure 6 shows that starting from random initializations, diversity begins small: DICE minimizes the estimated CR in features and increases diversity in predictions compared to CEB (δ cr = 0.0). The effect is correlated with δ cr : a high value (0.6) creates too much diversity. On the contrary, a negative value (-0.025) can decrease diversity. Figure 8 highlights opposing dynamics in accuracies. 

4.C FURTHER ANALYSIS: UNCERTAINTY ESTIMATION AND CALIBRATION

Procedure We follow the procedure from (Ashukha et al., 2019) . To evaluate the quality of the uncertainty estimates, we reported two complementary proper scoring rules (Gneiting & Raftery, 2007) ; the Negative Log-Likelihood (NLL) and the Brier Score (BS) (Brier, 1950) . To measure the calibration, i.e., how classification confidences match the observed prediction accuracy, we report the Expected Calibration Error (ECE) (Naeini et al., 2015) and the Thresholded Adaptive Calibration Error (TACE) (Nixon et al., 2019) with 15 bins: TACE resolves some pathologies in ECE by thresholding and adaptive binning. Ashukha et al. (2019) showed that "comparison of [. . .] ensembling methods without temperature scaling (Guo et al., 2017) might not provide a fair ranking". Therefore, we randomly divide the test set into two equal parts and compute metrics for each half using the temperature T optimized on another half: their mean is reported. Table 3 compares results after temperature scaling (TS) while those before TS are reported in Table 9 in Appendix A.6. Results We recover that ensembling improves performances (Ovadia et al., 2019) , as one single network (1-net) performs significantly worse than ensemble approaches with 4-branches ResNet-32. Members' disagreements decrease internal temperature and increase uncertainty estimation. DICE performs best even after TS, and reduces NLL from 8.13 to 7.98 and BS from 3.24 to 3.12 compared to independant learning. Calibration criteria benefit from diversity though they do "not provide a consistent ranking" as stated in Ashukha et al. (2019) : for example, we notice that ECE highly depends on hyperparameters, especially δ cr , as shown on Figure 8 1 -w's predictions behave like an "input-dependant temperature". To measure the ability of our ensemble to distinguish in-and out-of-distribution (OOD) images, we consider other datasets at test time following (Hendrycks & Gimpel, 2017 ) (see Appendix D.2). The confidence score is estimated with the maximum softmax value: the confidence for OOD images should ideally be lower than for CIFAR-100 test images. Temperature scaling (results in Table 7 ) refines performances (results without TS in Table 6 ). DICE beats Ind. and CEB in both cases. Moreover, we suspected that features were more correlated for OOD images: they may share redundant artifacts. DICE×w multiplies the classification logits by the mean over all pairs of 1 -w(z i , z j , ŷ), i = j, with predicted ŷ (as the true y is not available at test time). DICE×w performs even better than DICE+TS, but at the cost of additional operations. It shows that w can detect spurious correlations, adversarially deleted only when found in training. The inference time in network-ensembles grows linearly with M. Sharing early-features is one solution. We experiment another one by using only the M-th branch at test time. We combine DICE with OKDDip (Chen et al., 2020b) : the M-th branch (= the student) learns to mimic the soft predictions from the M-1 first branches (= the teacher), among which we enforce diversity. Our teacher has lower internal temperature (as shown in Experiment 4.c): DICE performs best when soft predictions are generated with lower T . We improve state-of-the-art by {+0.42, +0.53} for {3,4}-branches.

5. CONCLUSION

In this paper, we addressed the task of improving deep ensembles' learning strategies. Motivated by arguments from information theory, we derive a novel adversarial diversity loss, based on conditional mutual information. We tackle the trade-off between individual accuracies and ensemble diversity by deleting spurious and redundant correlations. We reach state-of-the-art performance on standard image classification benchmarks. In Appendix F.2, we also show how to regularize deterministic encoders with conditional redundancy without compression: this increases the applicability of our research findings. The success of many real-world systems in production depends on the robustness of deep ensembles: we hope to pave the way towards general-purpose strategies that go beyond independent learning.

Appendices

Appendix A shows additional experiments. Appendix B describes our implementation to facilitate reproduction. In Appendix C, we summarize the concurrent approaches (see Table 10 ). In Appendix D, we describe the datasets and the metrics used in our experiments. Appendix E clarifies certain theoretical formulations. In Appendix F, we explain that DICE is a second-order approximation in terms of information interactions and then we try to apply our diversity regularization to deterministic encoders. Appendix G motivates the removal of the second term from our neural estimation of conditional redundancy. We conclude with a sociological analogy in Appendix H.

A ADDITIONAL EXPERIMENTS

A.1 COMPARISONS WITH CO-DISTILLATION AND SNAPSHOT-BASED APPROACHES (Wu & Gong, 2020) 95.58 (Wu & Gong, 2020) A.2 OUT-OF-DISTRIBUTION DETECTION Table 6 summarizes our OOD experiments in the 4-branches ResNet-32 setup. We recover that IB improves OOD detection (Alemi et al., 2018) . Moreover, we empirically validate our intuition: features from in-distribution images are in average less predictive from each other compared to pairs of features from OOD images. w can perform alone as a OOD-detector, but is best used in complement to DICE. In DICE×w, logits are multiplied by the sigmoid output of w averaged over all pairs. Table 7 shows that temperature scaling improves all approaches without modifying ranking. Finally, DICE×w, even without TS, is better than DICE, even with TS. We measured diversity in 4.b with the ratio error (Aksela, 2003) . But as stated by Kuncheva & Whitaker (2003) , diversity can be measured in numerous ways. For pairwise measures, we averaged over the M 2 pairs: the Q-statistics is positive when classifiers recognize the same object, the agreement score measures the frequency that both classifiers predict the same class. Note that even if we only apply pairwise constraints, we also increase non-pairwise measures: for example, the Kohavi-Wolpert variance (Kohavi et al., 1996) which measures the variability of the predicted class, and the entropy diversity which measures overall disagreement. Ensemble with M Models In the general case, we only consider pairwise interactions, therefore we need to estimate M 2 values. To reduce the number of parameters, we use only one discriminator w. Features associated with z k are filled with zeros when we sample from p(z i , z j , y) or from p(z i , y)p(z j |y), where i, j, k ∈ {1, . . . , M }, k = i and k = j. Therefore, the input tensor for the discriminator is of size (M * d + 64): its first layer has (M * d + 64) * 256 dense weights: the number of weights in w scales linearly with M and d as w's input grows linearly, but w's hidden size remains fixed.

A.6 UNCERTAINTY ESTIMATION AND CALIBRATION BEFORE TEMPERATURE SCALING

δ cr value For branch-based and network-based CIFAR-100, we found δ cr at {0.1, 0.15, 0.2, 0.22, 0.25} for {2, 3, 4, 5, 6} members to perform best on the validation dataset when training on 95% on the classical training dataset. For CIFAR-10, {0.1} for 4 members. We found that lower values of δ r were necessary for our baselines IBR and CEBR. Scheduling For fair comparison, we apply the traditional ramp-up scheduling up to step 80 from the co-distillation literature (Lan et al., 2018; Kim et al., 2019b; Chen et al., 2020b) to all concurrent approaches and to our redundancy training. Sampling To sample from p(z 1 , z 2 , y), we select features extracted from one image. To sample from p(z 1 , y)p(z 2 |y), we select features extracted from two different inputs, that share the same class y. In practise, we keep a memory from previous batches as the batch size is 128 whereas we have 100 classes in CIFAR-100. This memory, of size M * d * K * 4, is updated at the end of each training step. Our sampling is a special case of k-NN sampling (Molavipour et al., 2020) : as we sample from a discrete categorical variable, the closest neighbour has exactly the same discrete value. The training can be unstable as it minimises the divergence between two distributions. To make them overlap over the features space, we sample num sample = {4} times from the gaussian distribution of Z 1 and Z 2 with the reparameterization trick. This procedure is similar to instance noise (Sønderby et al., 2016) and it allows us to safely optimise w at each iteration. It gives better robustness than just giving the gaussian mean. Moreover, we progressively ease the discriminator task by scheduling the covariance through time with a linear ramp-up. First the covariance is set to 1 until epoch 100, then it linearly reduces to the predicted covariance e σ i (x) until step 250. We sample a ratio ratio neg pos of one positive pair for {2, 4} negative pairs on CIFAR-{10, 100}. Clipping Following Bachman et al. (2019) , we clip the density ratios (tanhclip) by computing the non linearity exp[τ tanh log[f (z1,z2,y)] τ ]. A lower τ reduces the variance of the estimation and stabilizes the training even with a strong discriminator, at the cost of additional bias. The clipping threshold τ was set to 10 as in Song & Ermon (2020) .

B.4 PSEUDO-CODE

Algorithm 1: Full DICE Procedure for M = 2 members / * Setup * / Parameters:  θ 1 = {e 1 , b 1 , c 1 }, θ 2 = {e 2 , b 2 , c 2 } z n i ← e µ i (z|x n ) + e σ i (z|x n ), ∀n ∈ B with ∼ N (0, 1) 7 VCEB i ← 1 b n∈B { 1 β s ceb D KL (e i (z|x n ) b i (z|y n )) -log c i (y n |z n i } / * z n i,k ← e µ i (z|x n ) + e σ,s i (z|x n ), ∀n ∈ B with ∼ N (0, 1) 12 B  ← {(z n 1,k , z n 2,k , y n )}, ∀n ∈ B, k ∈ {1, . . . , num s } // Joint Distrib. 13 LCR DV ← 1 |B| t∈B log f (t) with f (t) ← tanhclip( w(t) 1-w(t) , τ ) 14 θ 1,2 ← g θ1,2 (∇ θ1 VCEB 1 + ∇ θ2 VCEB 2 + δ s cr ∇ θ1,2 LCR DV ) // Backprop Ensemble / * Step 3: Adversarial Training * / 15 for ← 1 to nstep d do 16 B  ← {(z n 1,k , z n 2,k , y n )}, ∀n ∈ B, ∀k ∈ {1, . . . , num s } // Joint Distrib. 17 B p ← {(z n 1,k , z n 2,k , y n )}, ∀n ∈ B, ∀k ∈ {1, . . . , num s }, k ∈ {1, . . . , ratio neg pos } 18 with n ∈ B, y n = y n , n = n // Product distribution 19 w ← g w (∇ w L ce (w)) // Backprop Discriminator 20 Sample new z n i,k / * Test Procedure * / Data: Inputs {x n } T n=1 // Test Data Output: arg max k∈{1,...,K} ( 1 2 [c 1 (e µ 1 (z|x n )) + c 2 (e µ 2 (z|x n ))]), ∀n ∈ {1, . . . , T } B.5 EMPIRICAL LIMITATIONS Our approach relies on very recent works in neural network estimation of mutual information, that still suffer from loose approximations. Improvements in this area would facilitate our learning procedure. Our approach increases the number of operations because of the adversarial procedure, but only during training: the inference time remains the same. DML (Zhang et al., 2018) Pred. pairwise Net CL-ILR (Song & Chai, 2018) Pred. Branch ONE (Lan et al., 2018) Preds Gate Branch FFL (Kim et al., 2019b) Pred. Feat. Fus. Both OKDDip (Chen et al., 2020b) Pred. asymetric Both ≈ KDCL (Guo et al., 2020) Pred.

Data Augmentation

Weights on val Net PCL (Wu & Gong, 2020) Pred. Data Augmentation Feat. Fus. Mean teacher Branch ≈ AFD (Chung et al., 2020) Features Net ≈ GAL (Kariyappa & Qureshi, 2019 ) Gradients Net GPMR (Dabouei et al., 2020) Gradients Grads. Magnitude Net ADP (Pang et al., 2019) Non maximum pred. Entropy Pred. Both DIBS (Sinha et al., 2020) JSD 

C.1 CO-DISTILLATION APPROACHES

Contrary to the traditional distillation (Hinton et al., 2015) that aligns the soft prediction between a static pre-trained strong teacher towards a smaller student, online co-distillation performs teaching in an end-to-end one-stage procedure: the teacher and the student are trained simultaneously.

Distillation in Logits

The seminal "Deep Mutual Learning" (DML) (Zhang et al., 2018) introduced the main idea: multiple networks learn to mimic each other by reducing KL-losses between pairs of predictions. "Collaborative learning for deep neural networks" (CL-ILR) (Song & Chai, 2018) used the branch-based architecture by sharing low-level layers to reduce the training complexity, and "Knowledge Distillation by On-the-Fly Native Ensemble" (ONE) (Lan et al., 2018) used a weighted combination of logits as teacher hence providing better information to each network. "Online Knowledge Distillation via Collaborative Learning" (KDCL) (Guo et al., 2020) computed the optimum weight on an held-out validation dataset. "Feature Fusion for Online Mutual Knowledge Distillation" (FFL) (Kim et al., 2019b ) introduced a feature fusion module. These approaches improve individual performance at the cost of increased homogenization. "Online Knowledge Distillation with Diverse Peers" (OKDDip) (Chen et al., 2020b) slightly alleviates this problem with an asymmetric distillation and a self-attention mechanism. "Peer Collaborative Learning for Online Knowledge Distillation" (PCL) (Wu & Gong, 2020) benefited from the mean-teacher paradigm with temporal ensembling and from diverse data augmentation, at the cost of multiple inferences through the shared backbone. Distillation in Features Whereas all previous approaches only apply distillation on the logits, the recent "Feature-map-level Online Adversarial Knowledge Distillation" (AFD) (Chung et al., 2020) aligned features distributions by adversarial training. Note that this is not opposite to our approach, as they force distributions to be similar while we force them to be uncorrelated.

C.2 DIVERSITY APPROACHES

On the other hands, some recent papers in computer vision explicitly encourage diversity among the members with regularization losses. Diversity in Logits "Diversity Regularization in Deep Ensembles" (Shui et al., 2018) applied negative correlation (Liu & Yao, 1999a) to regularize the training for improved calibration, with no impact on accuracy. "Learning under Model Misspecification: Applications to Variational and Ensemble methods" (Masegosa, 2020) theoretically motivated the minimization of second-order PAC-Bayes bounds for ensembles, empirically estimated through a generalized variational method. "Adaptive Diversity Promoting" (ADP) (Pang et al., 2019) decorrelates only the non-maximal predictions to maintain the individual accuracies, while promoting ensemble entropy. It forces different members to have different ranking of predictions among non maximal predictions. However, Liang et al. (2018) has shown that ranking of outputs are critical: for example, non maximal logits tend to be more separated from each other for in-domain inputs compared to out-of-domain inputs. Therefore individual accuracies are decreased. Coefficients α and β are respectively set to 2 and 0.5, as in the original paper.

Diversity in Features

One could think about increasing classical distances among features like L 2 in (Kim et al., 2018) , but in our experiments it reduces overall accuracy: it is not even invariant to linear transformations such as translation. "Diversity inducing Information Bottleneck in Model Ensembles" from Sinha et al. ( 2020) trains a multi-branch network and applies VIB on individual branch, by encoding p(z|y) ∼ N (0, 1), which was shown to be hard to learn (Wu & Fischer, 2020) . Moreover, we notice that their diversity-inducing adversarial loss is an estimation of the JS-divergence between pairs of features, built on the dual f -divergence representation (Nowozin et al., 2016) : similar idea was recently used for saliency detection (Chen et al., 2020a) . As the JS-divergence is a symmetrical formulation of the KL, we argue that DIBS and IBR share the same motivations and only have minor discrepancies: the adversarial terms in DIBS loss with both terms sampled from the same branch and both terms sampled from the same prior. In our experiments, these differences reduce overall performance. We will include their scores when they publish measurable results on CIFAR datasets or when they release their code. Diversity in Gradients "Improving adversarial robustness of ensembles with diversity training." (GAL) (Kariyappa & Qureshi, 2019) enforced diversity in the gradients with a gradient alignment loss. "Exploiting Joint Robustness to Adversarial Perturbations" (Dabouei et al., 2020) considered the optimal bound for the similarity of gradients. However, as stated in the latter, "promoting diversity of gradient directions slightly degrades the classification performance on natural examples . . . [because] classifiers learn to discriminate input samples based on distinct sets of representative features". Therefore we do not consider them as concurrent work.

D EXPERIMENTAL SETUP D.1 TRAINING DATASETS

We train our procedure on two image classification benchmarks, CIFAR-100 and CIFAR-10, (Krizhevsky et al., 2009) . They consist of 60k 32*32 natural and colored images in respectively 100 classes and 10 classes, with 50k training images and 10k test images. For hyperparameter selection and ablation studies, we train on 95% of the training dataset, and analyze performances on the validation dataset made of the remaining 5%.

D.2 OOD

Dataset We used the traditional out-of-distribution datasets for CIFAR-100, described in (Liang et al., 2018) : TinyImageNet (Deng et al., 2009) , LSUN (Yu et al., 2015) , iSUN (Xu et al., 2015) , and CIFAR-10. We borrowed the evaluation code from https://github.com/ uoguelph-mlrg/confidence_estimation (DeVries & Taylor, 2018) . Metrics We reported the standard metrics for binary classification: FPR at 95 % TPR, Detection error, AUROC (Area Under the Receiver Operating Characteristic curve) and AUPR (Area under the Precision-Recall curve, -in or -out depending on which dataset is specified as positive). See Liang et al. (2018) for definitions and interpretations of these metrics. in its class. These class-embeddings are similar to class-prototypes, highlighting a theoretical link between CEB (Fischer, 2020; Fischer & Alemi, 2020) and prototype based learning methods (Liu & Nakagawa, 2001 ).

E.4 DIFFERENCE BETWEEN VCEB AND VIB

In Fischer (2020) , CEB is variationally upper bounded by VCEB. We detail the computations: CEB β ceb (Z) = 1 β ceb I(X; Z|Y ) -I(Y ; Z) (Definition) = 1 β ceb [I(X, Y ; Z) -I(Y ; Z)] -I(Y ; Z) (Chain rule) = 1 β ceb [I(X; Z) -I(Y ; Z)] -I(Y ; Z) (Markov assumptions) = 1 β ceb [-H(Z|X) + H(Z|Y )] -[H(Y ) -H(Y |Z)] (MI as diff. of 2 ent.)  ≤ 1 β ceb [-H(Z|X) + H(Z|Y )] -[-H(Y |Z)] (Non- ≈ 1 N N n=1 { 1 β ceb log e(z|x n ) b(z|y n ) -log c(y n |z)}e(z|x n )∂z (Empirical data distrib.) ≈ VCEB β ceb (θ = {e, b, c}), ( where VCEB β ceb (θ = {e, b, c}) = 1 N N n=1 { 1 β ceb DKL(e(z|x n ) b(z|y n )) -E log c(y n |e(x n , )}. As a reminder, Alemi et al. (2017) upper bounded: IB β ib (Z) = 1 β ib I(X; Z) -I(Y ; Z) by: VIB β ib (θ = {e, b, c}) = 1 N N n=1 { 1 β ib DKL(e(z|x n ) b(z)) -E log c(y n |e(x n , )}. In VIB, all features distribution e(z|x) are moved towards the same class-agnostic distribution b(z) ∼ N (µ, σ), independently of y. In VCEB, e(z|x) are moved towards the class conditional marginal b µ (y) ∼ N (b µ (y), b σ (y)). This is the unique difference between VIB and VCEB. VIB leads to a looser approximation with more bias than VCEB.

E.5 TRANSFORMING IBR AND CEBR INTO TRACTABLE LOSSES

In this section we derive the variational approximation of the IBR criterion, defined by: IBR β ib ,δr (Z 1 , Z 2 ) = IB β ib (Z 1 ) + IB β ib (Z 2 ) + δ r I(Z 1 ; Z 2 ). Redundancy Estimation To estimate the redundancy component, we apply the same procedure as for conditional redundancy but without the categorical constraint, as in the seminal work of Belghazi et al. ( 2018) for mutual information estimation. Let B  and B p be two random batches sampled respectively from the observed joint distribution p(z 1 , z 2 ) = p(e 1 (z|x), e 2 (z|x)) and the product distribution p(z 1 )p(z 2 ) = p(e 1 (z|x))p(e 2 (z|x )), where x, x are two inputs that may not belong to the same class. We similarly train a network w that tries to discriminate these two distributions. With f = w 1-w , the redundancy estimation is: ÎR DV = 1 |B  | (z1,z2)∈B log f (z 1 , z 2 ) Diversity -log( 1 |B p | (z1,z 2 )∈Bp f (z 1 , z 2 )), and the final loss: LR DV (e 1 , e 2 ) = 1 |B  | (z1,z2)∈B log f (z 1 , z 2 ). IBR Finally we train θ 1 = {e 1 , b 1 , c 1 } and θ 2 = {e 2 , b 2 , c 2 } jointly by minimizing: L IBR (θ 1 , θ 2 ) = VIB β ib (θ 1 ) + VIB β ib (θ 2 ) + δ r LR DV (e 1 , e 2 ). CEBR For ablation study, we also consider a criterion that would benefit from CEB's tight approximation but with non-conditional redundancy regularization: Applying information-theoretic principles for deep ensembles leads to tackling interactions among features through conditional mutual information minimization. We define the order of an information interaction as the number of different extracted features involved. L CEBR (θ 1 , θ 2 ) = VCEB β ceb (θ 1 ) + VCEB β ceb (θ 2 ) + δ r LR DV (e 1 , e 2 ). First Order Tackling the first-order interaction I(X; Z i |Y ) with VCEB empirically increased overall performance compared to ensembles of deterministic features extractors learned with categorical cross entropy, at no cost in inference and almost no additional cost in training. In the Markov chain Z i ← X → Z j , the chain rules provides: I(Z i ; Z j |Y ) ≤ I(X; Z i |Y ). More generally, I(X; Z i |Y ) upper bounds higher order interactions such as third order I(Z i ; Z j , Z k |Y ). In conclusion, VCEB reduces an upper bound of higher order interactions with quite a simple variational approximation. Second Order In this paper, we directly target the second-order interaction I(Z i ; Z j |Y ) through a more complex adversarial training. We increase diversity and performances by remove spurious correlations shared by Z i and Z j that would otherwise cause simultaneous errors. Higher Order interactions include the third order I(Z i ; Z j , Z k |Y ), the fourth order I(Z i ; Z j , Z k , Z l |Y ), etc, up to the M -th order. They capture more complex correlations among features. For example, Z j alone (and Z k alone) could be unable to predict Z i , while they [Z j , Z k ] could together. However we only consider first and second order interactions in the current submission. It is common practice, for example in the feature selection literature (Battiti, 1994; Fleuret, 2004; Brown, 2009; Peng et al., 2005) . The main reason to truncate higher order interactions is computational, as the number of components would grow exponentially and add significant additional cost in training. Another reason is empirical, the additional hyper-parameters may be hard to calibrate. But these higher order interactions could be approximated through neural estimations like the second order. For example, for the third order, features Z i , Z j and Z k could be given simultaneously to the discriminator w. The complete analysis of these higher order interactions has huge potential and could lead to a future research project.

F.2 LEARNING FEATURES INDEPENDENCE WITHOUT COMPRESSION

The question is whether we could learn deterministic encoders with second order I(Z i ; Z j |Y ) regularization without tackling first order I(X; Z i |Y ). We summarized several approaches in Table 11 . First Approach Without Sampling Deterministic encoders predict deterministic points in the features space. Feeding the discriminator w with deterministic triples without sampling increases diversity and reaches 77.09, compared to 76.78 for independent deterministic. Compared to DICE, w's task has been simplified: indeed, w tries to separate the joint and the product deterministic distributions that may not overlap anymore. This violates convergence conditions, destabilizes overall adversarial training and the equilibrium between the encoders and the discriminator.

Sampling and Reparameterization Trick

To make the joint and product distributions overlap over the features space, we apply the reparametrization trick on features with variance 1. This second approach is similar to instance noise (Sønderby et al., 2016) , which tackled the instability of adversarial training. We reached 77.33 by protecting individual accuracies. Synergy between CEB and CR In comparison, we obtain 77.51 with DICE. In addition to theoretical motivations, VCEB and CR work empirically in synergy. First, the adversarial learning is simplified and only focuses on spurious correlations VCEB has not already deleted. Thus it may explain the improved stability related to the value of δ cr and the reduction in standard deviations in performances. Second, VCEB learns a Gaussian distribution; a mean but also an input-dependant covariance e σ i (x). This covariance fits the uncertainty of a given sample: in a similar context, Yu et al. (2019) has shown that large covariances were given for difficult samples. Sampling from this input-dependant covariance performs better than using an arbitrary fixed variance shared by all dimensions from all extracted features from all samples, from 77.29 to 77.51. Conclusion DICE benefits from both components: learning redundancy along with VCEB improves results, at almost no extra cost. We think CR can definitely be applied with deterministic encoders as long as the inputs of the discriminator are sampled from overlapping distributions in the features space. Future work could study new methods to select the variance in sampling. As compression losses yield additional hyper-parameters and may underperform for some architectures/datasets, learning only the conditional redundancy (without compression) could increase the applicability of our contributions.  with f = w 1-w . In this paper, we focused only on the left hand side (LHS) component from equation 11 which leads to LCR DV in equation 4. We showed empirically that it improves ensemble diversity and overall performances. LHS forces features extracted from the same input to be unpredictable from each other; to simulate that they have been extracted from two different images. Now we investigate the impact of the right hand side (RHS) component from equation 11. We conjecture that RHS forces features extracted from two different inputs from the same class to create fake correlations, to simulate that they have been extracted from the same image. Overall, the RHS would correlate members and decrease diversity in our ensemble.

G.2 EXPERIMENTS

These intuitions are confirmed by experiments with a 4-branches ResNet-32 on CIFAR-100, which are illustrated in Figure 12 . Training only with the RHS and removing the LHS (the opposite of what is done in DICE) reduces diversity compared to CEB. Moreover, keeping both the LHS and the RHS leads to slightly reduced diversity and ensemble accuracy compared to DICE. We obtained 77.40± 0.19 with LHS+RHS instead of 77.51± 0.17 with only the LHS. In conclusion, dropping the RHS performs better while reducing the training cost. 

H SOCIOLOGICAL ANALOGY

We showed that increasing diversity in features while encouraging the different learners to agree improves performance for neural networks: the optimal diversity-accuracy trade-off was obtained with a large diversity. To finish, we make a short analogy with the importance of diversity in our society. Decision-making in group is better than individual decision as long as the members do not belong to the same cluster. Homogenization of the decision makers increases vulnerability to failures, whereas diversity of backgrounds sparks new discoveries (Muldoon, 2016) : ideas should be shared and debated among members reflecting the diversity of the society's various components. Academia especially needs this diversity to promote trust in research (Sierra-Mercado & Lázaro-Muñoz, 2018) , to improve quality of the findings (Swartz et al., 2019) , productivity of the teams (Vasilescu et al., 2015) and even schooling's impact (Bowman, 2013) .

I LEARNING STRATEGY OVERVIEW

We provide in Figure 13 a zoomed version of our learning strategy. 



Figure 1: DICE better leverages ensemble size. Without weights sharing, 5 networks trained with DICE match 7 networks trained independently. With low-level weights sharing, 4 branches trained with DICE match 7 traditional branches. Dataset: CIFAR-100. Backbone: ResNet-32. Details in Table8.

Figure 3: Venn Information Diagram(Yeung, 1991).DICE minimizes conditional redundancy (green vertical stripes) with no overlap with relevancy (red stripes).

Figure 4: Learning strategy overview. Blue arrows represent training criteria: (1) classification with conditional entropy bottleneck applied separately on members 1 and 2, and (2) adversarial training to delete spurious correlations between members and increase diversity. X and X belong to the same Y for conditional redundancy minimization. See Figure 13 for a larger version.

Figure 5: Ensemble diversity/individual accuracy trade-off for different strategies. DICE (r. CEBR) is learned with different δ cr (r. δ r ).

Figure 6: Impact of the diversity coefficient δ cr in DICE on the training dynamics on validation: CR is negatively correlated with diversity.

Figure 7: Confidence estimates separate images from CIFAR-100 and OOD images from TinyImageNet (crop) for different strategies (AUROC ↑). DICE×w uses the discriminator to scale its confidence: 1 -w's predictions behave like an "input-dependant temperature".

Figure 8: Training dynamics on the validation dataset while training on 95% of the training dataset. A higher diversity coefficient decreases individual performance (lower left), but increases ensemble performance in terms of accuracy (upper left), uncertainty estimation (upper right) up to a value, found at δ cr = 0.2 for 4-branches ResNet-32. Calibration before temperature scaling (lower right) highly benefits from higher diversity. Learning rate updates create "steps" in the curves.

Figure 10: Discriminator dynamics and learning curve. The task becomes harder for higher values of δ cr : the joint and product features distributions tend to be indistinguishable.

Figure 11: The discriminator remains calibrated even at the end of the adversarial training.

FIRST, SECOND AND HIGHER-ORDER INFORMATION INTERACTIONS F.1 DICE REDUCES FIRST AND SECOND ORDER INTERACTIONS

IMPACT OF THE SECOND TERM IN THE NEURAL ESTIMATION OF CONDITIONAL REDUNDANCY G.1 CONDITIONAL REDUNDANCY IN TWO COMPONENTSThe conditional redundancy can be estimated by the difference between two components:

Figure 12: Training dynamics and ablation study of components from equation 11. Adding the RHS overall decreases ensemble performances, in terms of accuracy (upper left) or uncertainty estimation (upper right), when combined with CEB or DICE(=LHS). It decreases diversity (lower right) with no clear impact on individual accuracy (lower left).

Figure 13: Learning strategy overview. Blue arrows represent training criteria: (1) classification with conditional entropy bottleneck applied separately on members 1 and 2, and (2) adversarial training to delete spurious correlations between members and increase diversity. X and X belong to the same Y for

We now approximate the two CEB and the CR components in DICE objective from equation 1.2.B APPROXIMATING DICE INTO A TRACTABLE LOSS2.B.1 VARIATIONAL APPROXIMATION OF CONDITIONAL ENTROPY BOTTLENECKWe leverage Markov assumptions in Z i ← X ↔ Y, i ∈ {1, 2} and empirically estimate on the classification training dataset of N i.i.d. points D = {x n , y n } N n=1 , y n ∈ {1, . . . , K}. Following Fischer (2020), CEB β ceb

CIFAR-100 ensemble classification accuracy (Top-1, %). 89± 0.09 77.51± 0.17 78.08± 0.18 77.92± 0.08 81.67±0.14 81.93± 0.13 79.59±0.13 80.05±0.11 80.55± 0.12

Uncertainty estimation (NLL, BS) and calibration (ECE, TACE) on CIFAR-100 after temperature scaling.

Individual accuracy for branch-based co-distillation on CIFAR-100

Ensemble Accuracy on different setups. Concurrent approaches' accuracies are those reported in recent papers. DICE outperforms co-distillation and snapshot-based ensembles collected on the training trajectory, which fail to capture the different modes of the data(Ashukha et al., 2019).

Out-of-distribution performances before temperature scaling.

Out-of-distribution performances after temperature scaling. Table8the Memory Split Advantage (MSA) fromChirkova et al. (2020): splitting the memory budget between three branches of ResNet-32 results in better performance than spending twice the budget on one ResNet-110. DICE further improves this advantage. Our framework is particularly effective in the branch-based setting, as it reduces the computational overhead (especially in terms of FLOPS) at a slight cost in diversity. A 4-branches DICE ensemble has the same accuracy in average as a classical 7-branches ensemble.

Ensemble effectiveness evaluation. Top-1 accuracy (%), number of parameters (M) and floating point operations (GFLOPs). This table is summarized in Figure1. DICE always outperforms the independent learning baseline, even with only 1 member because of the CEB component. The saturation phenomenon is reduced.

Uncertainty estimation (NLL, BS) and calibration (ECE, TACE) on CIFAR-100 before temperature scaling for 4-branches ResNet-32.

and discriminator w, randomly initialized Input: Observations {x n , y n } N n=1 , coefficients β ceb and δ cr , schedulings sche ceb and rampup endstep startstep , clipping threshold τ , batch size b, optimisers g θ1,2 and g w , number of discriminators step nstep d , number of samples num s , ratio of positive/negative sample ratio neg

Summary of different approaches.

negativity of ent.)

Comparison between deterministic and distribution encoders on 4-branches ResNet-32 for Top-1 accuracy (%) on CIFAR-100.

unifies our main results on CIFAR-100 from Table1and CIFAR-10 from Table2.

ACKNOWLEDGMENTS

This work was granted access to the HPC resources of IDRIS under the allocation 20XX-AD011011953 made by GENCI. We acknowledge the financial support by the ANR agency in the chair VISA-DEEP (project number ANR-20-CHIA-0022-01). Finally, we would like to thank those who helped and supported us during these confinements, in particular Julie and Rouille.

B TRAINING DETAILS B.1 GENERAL OPTIMIZATION

Experiments Classical hyperparameters were taken from (Chen et al., 2020b) for conducting fair comparisons. Newly added hyperparameters were fine-tuned on a validation dataset made of 5% of the training dataset.Architecture We implemented the proposed method with ResNet (He et al., 2016) and Wide-ResNet (Zagoruyko & Komodakis, 2016) architectures. Following standard practices, we average the logits of our predictions uniformly. For branch-based ensemble, we separate the last block and the classifier of each member from the weights sharing while the other low-level layers were shared.Learning Following (Chen et al., 2020b) , we used SGD with Nesterov with momentum of 0.9, mini-batch size of 128, weight decay of 5e-4, 300 epochs, a standard learning rate scheduler that sets values {0.1, 0.001, 0.0001} at steps {0, 150, 225} for CIFAR-10/100. In CIFAR-100, we additionally set the learning rate at 0.00001 at step 250. We used traditional basic data augmentation that consists of horizontal flips and a random crop of 32 pixels with a padding of 4 pixels. The learning curve is shown on Figure 8 .

B.2 INFORMATION BOTTLENECK IMPLEMENTATION

Architecture Features are extracted just before the dense layer since deeper layers are more semantics, of size d = {64, 128, 256} for {ResNet-32, WRN-28-2, ResNet-110}. Our encoder does not provide a deterministic point in the features space but a feature distribution encoded by mean and diagonal covariance matrix. The covariance is predicted after a Softplus activation function with one additional dense layer, taking as input the features mean, with d(d + 1) trainable weights.In training we sample once from this features distribution with the reparameterization trick. In inference, we predict from the distribution's mean (and therefore only once). We parameterized b(z|y) ∼ N (b µ (y), 1) with trainable mean and unit diagonal covariance, with d additional trainable weights per class. As noticed in (Fischer & Alemi, 2020) , this can be represented as a single embedding layer mapping one-hot classes to d-dimensional tensors. Therefore in total we only add d(d + 1 + K) trainable weights, that all can be discarded during inference. For VIB, the embedding b µ is shared among classes: in total it adds d(d + 2) trainable weights. Contrary to recent IB approaches (Wu et al., 2019b; Wu & Fischer, 2020; Fischer & Alemi, 2020) , we only have one dense layer to predict logits after the features bottleneck, and we did not change the batch normalization, for fair comparisons with traditional ensemble methods.Scheduling We employ the jump-start method that facilitates the learning of bottleneck-inspired models (Wu et al., 2019b; Wu & Fischer, 2020; Fischer & Alemi, 2020) : we progressively anneal the value of β ceb . For CIFAR-10, we took the scheduling from (Fischer & Alemi, 2020) , except that we widened the intervals to make the training loss decrease more smoothly: log(β ceb ) reaches values {100, 10, 2} at steps {0, 5, 100}. No standard scheduling was available for CIFAR-100. As it is more difficult than CIFAR-10, we added additional jump-epochs with lower values: log(β ceb ) reaches values {100, 10, 2, 1.5, 1} at steps {0, 8, 175, 250, 300}. This slow scheduling increases progressively the covariance predictions e σ (x) and facilitates learning. For VIB, we scheduled similarly using the equivalence from (Fischer, 2020) : β ib = β ceb + 1. We found VCEB to have lower standard deviation in performances than VCEB: β ib can hinder the learnability (Wu et al., 2019b) . These schedulings have been used in all our setups, without and with redundancy losses, for ResNet-32, ResNet-110 and WRN-28-10, for from 1 to 10 members.

B.3 ADVERSARIAL TRAINING IMPLEMENTATION

Redundancy Following standard adversarial learning practices, our discriminator for redundancy estimation is a MLP with 4 layers of size {256, 256, 100, 1}, with leaky-ReLus of slope 0.2, optimized by RMSProp with learning rate {0.003, 0.005} for CIFAR-{10, 100}. We empirically found that four steps for the discriminator for one step of the classifier increase stability. Specifically, it takes as input the concatenation of the two hidden representations of size d, sampled with a repa-

E ADDITIONAL THEORETICAL ELEMENTS E.1 BIAS VARIANCE COVARIANCE DECOMPOSITION

The Bias-Variance-Covariance Decomposition (Ueda & Nakano, 1996) generalizes the Bias-Variance Decomposition (Kohavi et al., 1996) by treating the ensemble of M members as a single learning unit.The estimation improves when the covariance between members is zero: the reduction factor of the variance component equals to M when errors are uncorrelated. Compared to the Bias-Variance Decomposition (Kohavi et al., 1996) , it leads to a variance reduction of 1 M . Brown et al. (2005a; b) summarized it this way: "in addition to the bias and variance of the individual estimators, the generalisation error of an ensemble also depends on the covariance between the individuals. This raises the interesting issue of why we should ever train ensemble members separately; why shouldn't we try to find some way to capture the effect of the covariance in the error function?".

E.2 MUTUAL INFORMATION

Nobody knows what entropy really is.

John Van Neumann to Claude Shannon

At the cornerstone of Shannon's information theory in 1948 (Shannon, 1948) , mutual information is the difference between the sum of individual entropies and the entropy of the variables considered jointly. Stated otherwise, it is the reduction in the uncertainty of one variable due to the knowledge of the other variable (Cover, 1999) . Entropy owed its name to the thermodynamic measure of uncertainty introduced by Rudolf Clausius and developed by Ludwig Boltzmann.The conditional mutual information generalizes mutual information when a third variable is given:

E.3 KL BETWEEN GAUSSIANS

The Kullback-Leibler divergence (Kullback, 1959) between two gaussian distributions takes a particularly simple form: 

