AGREE TO DISAGREE: DIVERSITY THROUGH DIS-AGREEMENT FOR BETTER TRANSFERABILITY

Abstract

Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -present in the training data but absent from the test data -and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution-referred to as the Out of Distribution (OOD) generalization problem. However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability.

1. INTRODUCTION

While gradient-based learning algorithms such as Stochastic Gradient Descent (SGD), are nowadays ubiquitous in the training of Deep Neural Networks (DNNs), it is well known that the resulting models are (i) brittle when exposed to small distribution shifts (Beery et al., 2018; Sun et al., 2016; Amodei et al., 2016) , (ii) can easily be fooled by small adversarial perturbations (Szegedy et al., 2014) , (iii) tend to pick up spurious correlations (McCoy et al., 2019; Oakden-Rayner et al., 2020; Geirhos et al., 2020) -present in the training data but absent from the downstream task -, as well as (iv) fail to provide adequate uncertainty estimates (Kim et al., 2016; van Amersfoort et al., 2020; Liu et al., 2021b) . Recently those learning algorithms have been investigated for their implicit bias toward simplicity -known as Simplicity Bias (SB), seen as one of the reasons behind their superior generalization properties (Arpit et al., 2017; Dziugaite & Roy, 2017) . While for deep neural networks, simpler decision boundaries are often seen as less likely to overfit, Shah et al. (2020) , Pezeshki et al. (2021) demonstrated that the SB can still cause the aforementioned issues. In particular, they show how the SB can be extreme, compelling predictors to rely only on the simplest feature available, despite the presence of equally or even more predictive complex features. Its effect is greatly increased when we consider the more realistic out of distribution (OOD) setting (Ben-Tal et al., 2009) , in which the source and target distributions are different, known to be a challenging problem (Sagawa et al., 2020; Krueger et al., 2021) . The difference between the two domains can be categorized into either a distribution shift -e.g. a lack of samples in certain parts of the data manifold due to limitations of the data collection pipeline -, or as simply having completely different distributions. In the first case, the SB in its extreme form would increase the chances of learning to rely on spurious features -shortcuts not generalizing to the target distribution. Classic manifestations of this in vision applications are when models learn to rely mostly on textures or backgrounds instead of more complex and likely more generalizable semantic features such as using shapes (Beery et al., 2018; Ilyas et al., 2019; Geirhos et al., 2020) . In the second instance, by relying only on the simplest feature, and being invariant to more complex ones, the SB would cause confident predictions (low uncertainty) on completely OOD samples. This even if complex features 2020). The two classes, red and blue, can easily be separated by a vertical boundary decision. Other ways to separate the two classes -with horizontal lines for instanceare more complex., i.e. they require more hyperplanes. The simplicity bias will push models to systematically learn the simpler feature, as in the second column (b). Using D-BAT, we are able to learn the model in column (c), relying on a more complex boundary decision, effectively overcoming the simplicity bias. The ensemble h ens (x) = h 1 (x) + h 2 (x), in column (d), outputs a flat distribution at points where the two models disagree, effectively maximizing the uncertainty at those points. In this experiments the samples from D ood were obtained through computing adversarial perturbations, see App. D.2 for more details. are contradicting simpler ones. Which brings us to our goal of deriving a method which can (i) learn more transferable features, better suited to generalize despite distribution shifts, and (ii) provides accurate uncertainty estimates also for OOD samples. We aim to achieve those two objectives through learning an ensemble of diverse predictors (h 1 , . . . , h K ), with h : X → Y, and K being the ensemble size. Suppose that our training data is drawn from the distribution D, and D ood is the distribution of OOD data on which we will be tested. Importantly, D and D ood may have non-overlapping support, and D ood is not known during training. Our proposed method, D-BAT (Diversity-By-disAgreement Training), relies on the following idea: Diverse hypotheses should agree on the source distribution D while disagreeing on the OOD distribution D ood . Intuitively, a set of hypotheses should agree on what is known i.e. on D, while formulating different interpretations of what is not known, i.e. on D ood . Even if each individual predictor might be wrongly confident on OOD samples, while predicting different outcomes -the resulting uncertainty of the ensemble on those samples will be increased. Disagreement on D ood can itself be enough to promote learning diverse representations of instances of D. In the context of object detection, if one model h 1 is relying on textures only, this model will generate predictions on D ood based on textures, when enforcing disagreement on D ood , a second model h 2 would be discouraged to use textures in order to disagree with h 1 -and consequently look for a different hypothesis to classify instances of D e.g. using shapes. This process is illustrated in Fig. 2 . A 2D direct application of our algorithm can be seen in Fig. 1 . Once trained, the ensemble can either be used by forming a weighted average of the probability distribution from each hypothesis, or-if given some labeled data from the downstream task-by selecting one particular hypothesis.

Contributions.

Our results can be summarized as: • We introduce D-BAT, a simple yet efficient novel diversity-inducing regularizer which enables training ensembles of diverse predictors. • We provide a proof, in a simplified setting, that D-BAT promotes diversity, encouraging the models to utilize different predictive features. • We show on several datasets of varying complexity how the induced diversity can help to (i) tackle shortcut learning, and (ii) improve uncertainty estimation and transferability.

2. RELATED WORK

Diversity in ensembles. It is intuitive that in order to gain from ensembling several predictors h 1 , ..., h K , those should be diverse. The bias-variance-covariance decomposition (Ueda & Nakano, 1996) , which generalizes the bias variance decomposition to ensembles, shows how the error decreases with the covariance of the members of the ensemble. Despite its importance, there is still no well accepted definition and understanding of diversity, and it is often derived from prediction errors of members of the ensemble (Zhou, 2012) . This creates a conflict between trying to increase accuracy of individual predictors h, and trying to increase diversity. In this view, creating a good ensemble is seen as striking a good balance between individual performance and diversity. To promote diversity in ensembles, a classic approach is to add stochasticity into the training by using different subsets of the training data for each predictor (Breiman, 1996) , or using different data augmentation methods (Stickland & Murray, 2020) . Another approach is to add orthogonality constrains on the predictor's gradient (Ross et al., 2020; Kariyappa & Qureshi, 2019) . Recently, the information bottleneck (Tishby et al., 2000) has been used to promote ensemble diversity (Ramé & Cord, 2021; Sinha et al., 2021) . Unlike the aforementioned methods, D-BAT can be trained on the full dataset, it importantly does not set constrains on the output of in-distribution samples, but on a separate OOD distribution. Moreover, as opposed to Sinha et al. (2021) , our individual predictors do not share the same encoder. Simplicity bias. While the simplicity bias, by promoting simpler decision boundary, can act as an implicit regularizer and improves generalization (Arpit et al., 2017; Gunasekar et al., 2018) , it is also contributing to the brittleness of gradient-based machine-leaning (Shah et al., 2020) . Recently Teney et al. (2021) proposed to evade the simplicity bias by adding gradient orthogonality constrains, not at the output level, but at an intermediary hidden representation obtained after a shared and fixed encoder. While their results are promising, the reliance on a pre-trained encoder limits the type of features that can be used to the set of features extracted by the encoder, especially, if a feature was already discarded by the encoder due to SB, it is effectively lost. In contrast, our method is not relying on a pre-trained encoder, also comparatively require a very small ensemble size to counter the simplicity bias. A more detailed comparison with D-BAT is provided in App F.1. Shortcut learning. The failures of DNNs across application domains due to shortcut learning have been documented extensively in (Geirhos et al., 2020) . They introduce a taxonomy of predictors distinguishing between (i) predictors which can be learnt from the training algorithms (ii) predictors performing well on in-distribution training data, (iii) predictors performing well on in-distribution test data, and finally (iv) predictors performing well on in-distribution and OOD test data. The last category being the intended solutions. In our experiments, by learning diverse predictors, D-BAT increases the chance of finding one solution generalizing to both in and out of distribution test data, see § 4.1 for more details. OOD generalization. Generalizing to distributions not seen during training is accomplished by two approaches: robust training, and invariant learning. In the former, the test distribution is assumed to be within a set of known plausible distributions (say U). Then, robust training minimizes the loss over the worst possible distribution in U (Ben-Tal et al., 2009) . Numerous approaches exist to defining the set U -see survey by (Rahimian & Mehrotra, 2019) . Most recently, Sagawa et al. (2020) model the set of plausible domains as the convex hull over predefined subgroups of datapoints and Krueger et al. (2021) extend this by taking affine combinations beyond the convex hull. Our approach also borrows from this philosophy -when we do not know the labels of the OOD data, we assume the worst case and try predict as diverse labels as possible. This is similar to the notion of discrepancy introduced in domain adaptation theory (Mansour et al., 2009; Cortes & Mohri, 2011; Cortes et al., 2019) . A different line of work defines a set of environments and asks that our outputs be 'invariant' (i.e. indistinguishable) among the different environments (Bengio et al., 2013; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020) . When only a single training environment is present, like in our setting, this is akin to adversarial domain adaptation. Here, the data of one domain is modified to be indistinguishable to the other (Ganin et al., 2016; Long et al., 2017) . However, this approach is fundamentally limited. E.g. in Fig. 2 a model which classifies both the crane and the porcupine as a crane is invariant, but incorrect. Furthermore, it is worth noting that prior work in OOD generalization are often considering datasets where the spurious feature is not fully predictive in the training distribution (Zhang et al., 2021; Saito et al., 2017; 2018; Nam et al., 2020; Liu et al., 2021a) , and fail in our challenging settings of § 4.1 (see App. F for more in-depth comparisons). Lastly, parallel to our work, Lee et al. (2022) adopt a similar approach and improve OOD generalization by minimizing the mutual information on unlabeled target data between pairs of predictors. However, their work does not investigate uncertainty estimation and is not motivated by domain adaptation theory as ours is (Mansour et al., 2009) , see App. F.7 for a more in-depth comparison. Uncertainty estimation. DNNs are notoriously unable to provide reliable confidence estimates, which is impeding the progress of the field in safety critical domains (Begoli et al., 2019) , as well as hurting models interpretability (Kim et al., 2016) . To improve the confidence estimates of DNNs, Gal & Ghahramani (2016) Liu et al., 2021b) . We show in our experiments how D-BAT can help to associate high uncertainty to those samples by maximizing the disagreement outside of D (see § 4.2, as well as Fig. 1 ).

3. DIVERSITY THROUGH DISAGREEMENT

3.1 MOTIVATING D-BAT Figure 3 : If h 1 is computed by minimizing the training loss on D, its loss on the OOD task D ood may be very large i.e. h 1 may be very far from the optimal OOD model h ood as measured by L Dood (h 1 , h ood ) (left). To mitigate this, we propose to learn a diverse ensemble {h 1 , . . . , h 4 } which is maximally 'spread-out' (with distance measured using L Dood (•, •)) and cover the entire space of possible solutions H ⋆ t . This minimizes the distance between the unknown h ood and our learned ensemble, ensuring we learn transferable features with good performance on D ood . We will first define some notation and explain why standard training fails for OOD generalization. Then, we introduce the concept of discrepancy which will motivate our D-BAT algorithm. Setup. Let us formally define the OOD problem. X is the input space, Y the output space, we define a domain as a pair of a distribution over X and a labeling function h : X → Y. Given any distribution D over X , given two labeling functions h 1 and h 2 , given a loss function L : Y × Y → R + , we define the expected loss as the expectation: L D (h 1 , h 2 ) = E x∼D [L(h 1 (x), h 2 (x))]. Now, suppose that the training data is drawn from the distribution (D t , h t ), but we will be tested on a different distribution (D ood , h ood ). While the labelling function h ood is unknown, we assume that we have access to unlabelled samples from D ood . Finally, let H be the set of all labelling functions i.e. the set of all possible prediction models. And further define H ⋆ t and H ⋆ ood to be the optimal labelling functions on the train and the OOD domains: H ⋆ t := arg min h∈H L Dt (h, h t ), H ⋆ ood := arg min h∈H L Dood (h, h ood ). We assume that there exists an ideal transferable function h ⋆ ∈ H ⋆ t ∩ H ⋆ ood . This assumption captures the reality that the training task and the OOD testing task are closely related to each other. Otherwise, we would not expect any OOD generalization.

Beyond standard training. Just using the training data, standard training would train a model

h ERM ∈ H ⋆ t . However, as we discussed in the introduction, if we use gradient descent to find the ERM solution, then h ERM will likely be the simplest model i.e. it will likely pick up spurious correlations in D t which are not present in D ood . Thus, the error on OOD data might be very high L Dood (h ERM , h ood ) ≤ max h∈H ⋆ t L Dood (h, h ood ) . Instead, we would ideally like to minimize the right hand side in order to find h ⋆ . The main difficulty is that we do not have access to the OOD labels h ood . So we can instead use the following proxy: L Dood (h 1 , h ood ) = max h2∈H ⋆ t ∩H ⋆ ood L Dood (h 1 , h 2 ) ≤ max h2∈H ⋆ t L Dood (h 1 , h 2 ) In the above we used the two following facts, (i) that ∀h 2 ∈ H ⋆ ood , L Dood (h 1 , h ood ) = L Dood (h 1 , h 2 ), as well as (ii) that H ⋆ t ∩ H ⋆ ood is non-empty. Recall that H ⋆ t = arg min h∈H L Dt (h, h t ). So this means -in order to minimize the upper bound -we want to pick h 2 to minimize risk on our training data (i.e. belong to H ⋆ t ), but otherwise maximally disagree with h 1 on the OOD data. That way we minimize the worst case expected loss: min h∈{h1,h2} max h ′ ∈H ⋆ t L Dood (h, h ′ ) -this process is illustrated in Fig. 3 . The latter is closely related to the concept of discrepancy in domain-adaption (Mansour et al., 2009; Cortes et al., 2019) . However, the main difference between the definitions is that we restrict the maximum to the set of H ⋆ t , whereas the standard notions use an unrestricted maximum. Thus, our version is tighter when the train and OOD tasks are closely related. Deriving D-BAT. We make two final changes to the discrepancy term above to derive D-BAT. First, if L D (h 1 , h 2 ) is a loss function which quantifies dis-agreement, then suppose we have another loss function A D (h 1 , h 2 ) which quantifies agreement. Then, we can minimize agreement instead of maximizing dis-agreement arg min h2∈H ⋆ t A D (h 1 , h 2 ) = arg max h2∈H ⋆ t L D (h 1 , h 2 ) . Secondly, we relax the constrained formulation h 2 ∈ H ⋆ t by adding a penalty term with weight α as h D-BAT ∈ min h2∈H L Dt (h 2 , h t ) fit train data +α A Dood (h 1 , h 2 ) disagree on OOD . The above is the core of our D-BAT procedure -given a first model h 1 , we train a second model h 2 to fit the training data D while disagreeing with h 1 on D ood . Thus, we have L Dood (h 1 , h ood ) ≤ max h2∈H ⋆ t L Dood (h 1 , h 2 ) ≈ L Dood (h 1 , h D-BAT ), implying that D-BAT gives us a good proxy for the unknown OOD loss, and can be used for uncertainty estimation. Following a similar argument for h 1 , we arrive the following training procedure: min h1,h2 1 2 (L Dt (h 1 , h t ) + L Dt (h 2 , h t )) + αA Dood (h 1 , h 2 ) . However, we found the training dynamics for simultaneously learning h 1 and h 2 to be unstable. Hence, we propose a sequential variant which we describe next.

3.2. ALGORITHM DESCRIPTION

Binary classification formulation. Concretely given a binary classification task, with Y = {0, 1}, we train two models sequentially. The training of the first model h 1 is done in a classical way, minimizing its empirical classification loss L(h 1 (x), y) over samples (x, y) from D. Once h 1 trained, we train the second model h 2 adding a term A x(h 1 , h 2 ) representing the agreement on samples x of Dood , with some weight α ≥ 0: h ⋆ 2 ∈ argmin h2∈H 1 N (x,y)∈ DL(h 2 (x), y) + α x∈ Dood A x(h 1 , h 2 ) Given p (y) h,x the probability of class y predicted by h given x, the agreement A x(h 1 , h 2 ) is defined as: A x(h 1 , h 2 ) = -log p (0) h1, x • p (1) h2, x + p (1) h1, x • p (0) h2, x In the above formula, the term inside the log can be derived from the expected loss when L is the 01-loss and h 1 , h 2 independent. See App. B for more details. Multi-class classification formulation. The previous formulation requires a distribution over two labels in order to compute the agreement term (AG). We extend the agreement term A(h 1 , h 2 , x) to the multi-class setting by binarizing the softmax distributions h 1 ( x) and h 2 ( x). A simple way to do this is to take as positive class the predicted class of h 1 : ỹ = argmax(h 1 ( x)) with associated probability p (ỹ) h1, x, while grouping the remaining complementary class probabilities in a negative class ¬ỹ. We would then have p (¬ỹ) h1, x = 1 -p (ỹ) h1, x. We can then use the same bins to binarize the softmax distribution of the second model h 2 (x). Another similarly sound approach would be to do the opposite and use the predicted class of h 2 instead of h 1 . In our experiments both approaches performed well. In Alg.2 we show the second approach, which is a bit more computationally efficient in the case of ensembles of more than 2 predictors, as the binarization bins are built only once, instead of building them for each pair (h i , h m ) for 0 ≤ i < m.

3.3. LEARNING DIVERSE FEATURES

It is possible, under some simplifying assumptions to rigorously prove that minimizing L D-BAT results in learning predictors which use diverse features. We introduce the following theorem: The proof is provided in App. C. It crucially relies on the fact that D ood has positive weight on data points which only contain the alternative feature s, or only contain the feature c. Thus, as long as D ood is supported on a diverse enough dataset with features present in different combinations , we can expect D-BAT to learn models which utilize a variety of such features.

4. EXPERIMENTS

We conduct two main types of experiments, (i) we evaluate how D-BAT can mitigate shortcut learning, bypassing simplicity bias, and generalize to OOD distributions, and (ii) we test the uncertainty estimation and OOD detection capabilities of D-BAT models.

4.1. OOD GENERALIZATION AND AVOIDING SHORTCUTS

We estimate our method's ability to avoid spurious correlation and learn more transferable features on 6 different datasets. In this setup, we use a labelled training data D which might have a lot of highly correlated spurious features, and an unlabelled perturbation dataset D ood . We then test the performance on the learnt model on a test dataset. This test dataset may be drawn from the same distribution as D ood (which tests how well D-BAT avoids spurious features), as well as from a completely different distribution from D ood (which tests if D-BAT generalizes to new domains). We compare D-BAT against ERM, both when used to obtain a single model or an ensemble. Our results are summarized in Tab. 1. For each dataset, we report both the best-model accuracy andwhen applicable -the best-ensemble accuracy. All experiments in Tab. 1 are with an ensemble of size 2. Among the two models of the ensemble, the best model is selected according to its validation accuracy. We show results for a larger ensemble size of 5 in Fig. 4 . Finally in Fig. 4 C (right) we compare the performance of D-BAT against numerous other baseline methods. See Appendix D for additional details on the setup as well as numerous other results. Table 1 : Test accuracies on the six datasets described in § 4.1. For each dataset, we compare single model and ensemble test accuracies for D-BAT and ERM. In the left column we consider the scenario where D ood is also our test distribution (we can imagine we have access to unlabeled data from the test distribution). In the right column we consider D ood and our test distribution to be different, e.g. belonging to different domains. see § 4.1 for more details and a summary of our findings. In bold are the best scores along with any score within standard deviation reach. For datasets with completely spurious correlations, as we know ERM models would fail to learn anything generalizable, we are not interested in using them in a ensemble, hence the missing values for those datasets. Dood = test data (unlabelled) Dood ̸ = test data Single Model Ensemble Single Model Ensemble Dataset D ERM D-BAT ERM D-BAT ERM D-BAT ERM D-BAT C-MNIST 12.3 ± 0.7 90.2 ± 3.7 - - 27.1 ± 2.8 90.1 ± 1.9 - - M/F-D 52.9 ± 0.1 94.8 ± 0.3 - - 52.9 ± 0.1 89.0 ± 0.6 - - M/C-D 50.0 ± 0.0 73.3 ± 1.2 - - 50.0 ± 0.0 58.0 ± 0.6 - - Waterbirds 86.0 ± 0.5 88.7 ± 0.2 85.8 ± 0.4 87.5 ± 0.0 ----Office-Home 50.4 ± 1.0 51.1 ± 0.7 52.0 ± 0.5 52.7 ± 0.2 51.7 ± 0.6 51.7 ± 0.3 53.9 ± 0.4 54.5 ± 0.5 Camelyon17 80.3 ± 0.4 93.1 ± 0.3 80.9 ± 1.5 91.9 ± 0.4 80.3 ± 0.4 88.8 ± 1.4 80.9 ± 1.5 85.9 ± 0.9 Training data (D). We consider two kinds of training data: synthetic datasets with completely spurious correlation, and more real world datasets where do not have any control and naturally may have some spurious features. We use the former to have a controlled setup, and the latter to judge our performance in the real world.

Datasets with completely spurious correlation:

To know whether we learn a shortcut, and estimate our method's ability to overcome the SB, we design three datasets of varying complexity with known shortcut in a similar fashion as Teney et al. (2021) . The Colored-MNIST, or C-MNIST for short, consists of MNIST (Lecun & Cortes, 1998) images for which the color and the shape of the digits are equally predictive, i.e. all the 1 are pink, all the 5 are orange, etc. The color being simpler to learn than the shape, the simplicity bias will result in models trained on this dataset to rely solely on the color information while being invariant to the shape information. This dataset is a multiclass dataset with 10 classes. The test distribution consists of images where the label is carried by the shape of the digit and the color is random. Following a similar idea, we build the M/F-Dominoes (M/F-D) dataset by concatenating MNIST images of 0s and 1s with Fashion-MNIST (Xiao et al., 2017) images of coats and dresses. The source distribution consists in images where the MNIST and F-MNIST parts are equally predicitve of the label. In the test distribution, the label is carried by the F-MNIST part and the MNIST part is a 0 or 1 MNIST image picked at random. The M/C-Dominoes (M/C-D) dataset is built in the same way concatenating MNIST digits 0s and 1s with CIFAR-10 ( Krizhevsky, 2009) images of cars and trucks. See App. E to see samples from those datasets. Natural datasets: To test our method in this more general case we run further experiments on three well-known domain adaptation datasets. We use the Waterbirds (Sagawa et al., 2020) and Camelyon17 (Bandi et al., 2018) Results and discussion. • D-BAT can tackle extreme spurious correlations. This is unlike prior methods from domain adaptation (Zhang et al., 2021; Saito et al., 2017; 2018; Nam et al., 2020; Liu et al., 2021a ) which all fail when the spurious feature is completely correlated with the label, see App. F for an extended discussion and comparison in which we show those methods cannot improve upon ERM in that scenario. First we look at results without D-BAT for the C-MNIST, M/F-D and M/C-D datasets in Tab. 1. Looking at the ERM column, we observe how the test accuracies are near random guessing. This is a verification that without D-BAT, due to the simplicity bias, only the simplest feature is leveraged to predict the label and the models fail to generalize to domains for which the simple feature is spurious. D-BAT however, is effectively promoting models to use diverse features. This is demonstrated by the test accuracies of the best D-BAT model being much higher than of ERM. • D-BAT improves generalization to new domains. In Tab. 1, in the case D ood ̸ = test data, we observe that despite differences between D ood and the test distribution (e.g. the target distribution for M/C-D is using CIFAR-10 images of cars and trucks whereas D ood uses images of frogs, cats, etc. but no cars or trucks), D-BAT is still able to increase the generalization to the test domain. • Improved generalization on natural datasets. We observe a significant improvement in test accuracy for all our natural datasets. While the improvement is limited for the Office home dataset when considering a single model, we observe D-BAT ensembles nonetheless outperform ERM ensembles. The improvement is especially evident on the Camelyon17 dataset where D-BAT outperforms many known methods as seen in Fig. 4 .c. • Ensembles built using D-BAT generalize better. In Fig. 4 we observe how D-BAT ensembles trained on the Waterbirds and Office-Home datasets generalize better.

4.2. BETTER UNCERTAINTY & OOD DETECTION

MNIST setup. We run two experiments to investigate D-BAT's ability to provide good uncertainty estimates. The first one is similar to the MNIST experiment in Liu et al. (2021b) , it consists in learning to differentiate MNIST digits 0s from 1s. The uncertainty of the model -computed as the entropyis then estimated for fake interpolated images of the form t • 1 + (1 -t) • 0 for t ∈ [-1, 2 ]. An ideal model would assign (i) low uncertainty values for t near 0 and 1, corresponding to in-distribution samples, while (ii) high uncertainty values elsewhere. (Liu et al., 2021b) showed how only Gaussian Processes are able to fulfill those two conditions, most models failing in attributing high uncertainty away from the boundary decision (as it can also be seen in Fig. 1 when looking at individual models). We train ensembles of size 2 and average over 20 seeds. For D-BAT, we use as D ood the remaning (OOD) digits 2 to 9, along with some random cropping. We use a LeNet. MNIST results. Results in Fig. 5 suggest that D-BAT is able to give reliable uncertainty estimates for OOD datapoints, even when those samples are away from the boundary decision. This is in sharp contrast with deep-ensemble which only models uncertainty near the boundary decision. CIFAR-10 setup. We train ensembles of 4 models and benchmark three different methods in their ability to identify what they do not know. For this we look at the histograms of the probability of their predicted classes on OOD samples. As training set we use the CIFAR-10 classes {0, 1, 2, 3, 4}. We use the CIFAR-100 (Krizhevsky, 2009) test set as OOD samples to compute the histograms. For D-BAT we use the remaining CIFAR-10 classes, {5, 6, 7, 8, 9}, as D ood , and set α to 0.2. Histograms are averaged over 5 seeds. The three methods considered are simple deep-ensembles (Lakshminarayanan et al., 2017) , MC-Dropout models (Gal & Ghahramani, 2016) , and D-BAT ensembles. For the three methods we use a modified ResNet-18 (He et al., 2016) with added dropout to accommodate MC-Dropout, we use a dropout probability of 0.2 for the three methods. For MC-Dropout, we compute uncertainty estimates sampling 20 distributions. CIFAR-10 results. In Fig. 6 , we observe for both deep ensembles and MC-Dropout a large amount of predicted probabilities larger than 0.9, which indicate those methods are overly confident on OOD data. In contrast, most of the predicted probabilities of D-BAT ensembles are smaller than 0.7. The average ensemble accuracies for all those methods are 92% for deep ensembles, 91.2% for D-BAT ensembles, and 90.4% for MC-Dropout.

5. LIMITATIONS

Is the simplicity bias gone? While we showed in § 4.1 that our approach can clearly mitigate shortcut learning, a bad choice of D ood distribution can introduce an additional shortcut. In essence, our approach fails to promote diverse representations when differentiating D from D ood is easier than learning to utilize diverse features. Furthermore, we want to stress that learning complex features is not necessarily unilaterally better than learning simple features, and is not our goal. Complex features are better only so far as they can better explain both the train distribution and OOD data. With our approach, we aim to get a diverse yet simple set of hypotheses. Intuitively, D-BAT tries to find the best hypothesis which may be somewhere within the top-k simplest hypotheses, and not necessarily the simplest one which the simplicity bias is pushing us towards.

6. CONCLUSION

Training deep neural networks often results in the models learning to rely on shortcuts present in the training data but absent from the test data. In this work we introduced D-BAT, a novel training method to promote diversity in ensembles of predictors. By encouraging disagreement on OOD data, while agreeing on the training data, we effectively (i) give strong incentives to our predictors to rely on diverse features, (ii) which enhance the transferability of the ensemble and (iii) improve uncertainty estimation and OOD detection. Future directions include improving the selection of samples of the OOD distribution and develop stronger theory. D-BAT could also find applications beyond OOD generalization-e.g. ( Ţifrea et al., 2021) recently used disagreement for anomaly/novelty detection or to test for biases in our trained models (Stanczak & Augenstein, 2021) . Figure 7 : Simultaneous D-BAT training: two models trained simultaneously using D-BAT on our 2D toy task (see Fig. 1 ). We observe how we do not recover the ERM solution. The two obtained models are diverse but seemingly more complex (e.g. in terms of their boundary decision) than models trained sequentially as in Fig. 1 . C PROOF OF THM.3.1 We redefine here the setup for clarity: • Given a joint source distribution D of triplets of random variables (C, S, Y ) taking values in {0, 1} 3 . • Assuming D has the following pmf: While at the same time agreeing on the source distribution D: P D (C = c, S = s, Y = y) = 1/2 if c = s = y, P (c,s)∼D P1 (Y |c, s) = P2 (Y |c, s)) = 1 The expectation in eq.1 becomes: ( 1) = 1 2 -log( P2 (Y = 0|C = 1, S = 0)) -log( P2 (Y = 1|C = 0, S = 1)) Which is minimized for P2 (Y = 1|C = 0, S = 1) = P2 (Y = 0|C = 1, S = 0) = 1. Which means the posterior of the second model, according to our disagreement constrain, will be: et al., 1998) : P2 (Y = 1 | C = c, S = s) = • For the C-MNIST dataset, we used a standard LeNet, with 3 input channels instead of 1. • For the MF-Dominoes datasets, we increase the input dimension of the first fully-connected layer to 960. • For the MC-Dominoes dataset, we use 3 input channels, increase the number of output channels of the first convolution to 32, and of the second one to 56. We modify the fullyconnected layers to be 2016 → 512 → 256 → c with c the number of classes. In those experiments -for both cases D ood = D test and D ood ̸ = D test -the test and validation distributions are distributions in which the spurious feature is random, e.g. random color for C-MNIST and random 0 or 1 on the top part for MF-Dominoes and MC-Dominoes. We use the AdamW optimizer Loshchilov & Hutter (2019) for all our experiments. For all the datasets in this section, we only train ensembles of 2 models, which we denote M 1 and M 2 . When building the OOD datasets, we make sure the images used are not shared with the images used to build the training, test and validation sets. Our results are obtained by averaging over 5 seeds. For further details on the implementation, we invite the reader to check the source code, see § A.

D.2 IMPLEMENTATION DETAILS FOR FIG.1

Instead of relying on an external OOD distribution set, it is also possible to find, given some datapoint x, a perturbation δ ⋆ through directly minimizing the agreement in some neighborhood of x (i.e. for ∥δ ⋆ ∥ ≤ ϵ): δ ⋆ ∈ arg min δ s.t. ∥δ∥<ϵ -log p (0) h1,(x+δ) • p (1) h2,(x+δ) + p (1) h1,(x+δ) • p (0) h2,(x+δ) Which can be solved using several projected gradient descent steps as it done typically in the adversarial training literature. While this approach is working for the 2D example, it is not working however for complex high-dimensional input spaces combined with deep networks as those are notorious for their sensitivity to very small l p -bounded perturbations, and it would most of the time be easy to find a bounded perturbation maximizing the disagreement.

D.3 STANDARD DEVIATIONS FOR MNIST UNCERTAINTY EXPERIMENTS

For clarity we omitted the standard deviations in Fig. 5 . In Fig. 8 we show each individual curve with its associated standard deviation. We use a ResNet-50 (He et al., 2016) as model. We train for 60 epochs with a fixed learning rate of 0.001 with and SGD as optimizer. We use an l 2 penalty term of 0.0001 and a momentum term β = 0.9. For D-BAT, we tune α ∈ {10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 , 10 -6 } and found α = 10 -6 to be best. For each set of hyperparameters, we train a deep-ensemble and a D-BAT ensemble of size 2, and select the parameters associated with the highest averaged validation accuracy over the two predictors of the ensemble. Our results are obtained by averaging over 3 seeds. In Fig. 9 , we plot the evolution of the test accuracy as a function of α for both setups discussed in § 4.1. In the first "ideal" setup we have access to unlabeled target data to use as Dood . In the second setup we do not, instead we use samples from different hospitals. In the case of the Camelyon dataset, we use the available unlabeled validation data. Despite this data belonging to a different domain, we still get a significant improvement in test accuracy.

D.5 IMPLEMENTATION DETAILS FOR THE WATERBIRDS EXPERIMENTS

The Waterbirds dataset is built by combining images of birds with either a water or land background. It contains four categories: • Waterbirds on water For C-MNIST, the simple feature is the color and the complex one is the shape. For all the Dominoes datasets, the simple feature is the top row, while the complex feature is the bottom one. One could indeed separate 0s from 1s by simply looking at the value of the middle pixels (if low value then 0 else 1). 

F ADDITIONAL DISCUSSIONS AND EXPERIMENTS

When two features are equally predictive but have different complexities, the more complex feature will be discarded due to the extreme simplicity bias. This happens despite the uncertainty over the potential spuriousness of the simpler feature. For this reason it is important to be able to learn both features if we hope to improve our chances at OOD generalization. Recent methods such as Saito et al. (2017 ), Saito et al. (2018) , Zhang et al. (2021 ), Nam et al. (2020) and Liu et al. (2021a) all fail in this challenging scenario, we explain why in the following subsections F.1 to F.6. In F.7, we add a comparison between D-BAT and the concurrent work of Lee et al. (2022) .

F.1 COMPARISON WITH TENEY ET AL. (2021)

In their work, Teney et al. (2021) add a regularisation term δ gφ 1 ,gφ 2 which, given an input x, is promoting orthogonality of hidden representations h = f θ (x) given by an encoder f θ with parameters θ, and pairs of classifiers g φ1 and g φ2 of parameters φ 1 and φ 2 respectively: δ gφ 1 ,gφ 2 = ∇ h g ⋆ φ1 (x) • ∇ h g ⋆ φ2 (x) (T) With ∇g ⋆ the gradient of its top predicted score. We implemented the objective of Teney et al. (2021) with two different encoders: f θ (x) = x (identity) and a two-layers CNN. We tested it on our MM-Dominoes dataset (See App E). The classification heads are trained simultaneously. Considering two classifications heads, we find two sets of hyperparameters, one that is giving the best compromise between accuracy and randomizedaccuracy, and one that is keeping the accuracy close to 1. In the first setup in Fig. 15 , we observe that none of the pairs of models trained with equation T as regularizer are particularly good at capturing any of the two features in the data. In contrast with D-BAT (with D (1) ood ) which is able to learn a second model having both high accuracy and high randomized-accuracy, hence capturing with the first model the two data modalities. For the second set of hyperparameters in Fig. 16 , we observe that the improvement in randomized accuracy is only marginal if we do not want to sacrifice accuracy. We believe those results are explained by the many ways gradients of a neural network can be orthogonal while still encoding identical information. Better results might require training more classification heads (up to 96 heads are used in Teney et al. (2021) .

F.2 COMPARISON WITH ZHANG ET AL. (2021)

In their work, Zhang et al. (2021) argue that while a model can be biased, there exist unbiased functional subnetworks. They introduced Modular Risk Minimization (MRM) to find those subnetworks. We implemented the MRM method (Alg.1 from their paper) and tested it on our MM-Dominoes dataset ( § 4.1). We observed that their approach cannot handle the extreme case we consider where the spurious feature is fully predictive in the train distribution (but not in OOD). They need it to be, say, only 90% predictive. On our dataset, in the first phase of Alg.1, the model trained on the source task learns to completely ignore the bottom row due to the extreme simplicity bias, ensuring there is no useful sub-network. We found the randomized-accuracy of subnetworks obtained with MRM to be no better than random. This is because, in extreme cases, the network which the simplicity bias pushes us to learn may completely ignore the actual feature and instead only focuses on the spurious feature. In such a case, there is no un-biased subnetwork. 2017), we aim to train an ensemble of predictors able to generalize to unknown target tasks and do not assume access to the target data. In particular, the unlabelled OOD data we need can be different from the downstream transfer target data. We make this distinction clear in § 4.1 where D (3) ood for the dominoes datasets are built using combinations of 1s and 0s with images from classes not present in the target and source tasks. Despite the lack of target data, the r-acc improves by resp. 28% and 38% for the MM-Dominoes and MF-Dominoes datasets. Further, we focus on mitigating extreme simplicity bias as described by Shah et al. (2020) , where a spurious feature can have the same predictive power as a non-spurious one on the source task (but not on the unknown target task). While (Saito et al., 2017) uses the concept of diversity, their formulation measures diversity in temrs of the inner-product between the weights. However, since neural networks are highly non-convex, it is possible for two networks to effectively learn the exact same function which relies on spurious features, while still having different parameterization. Thus, our method can be viewed as "functional" extension of the method in (Shah et al., 2020) . Further, the encoder F itself can learn a representation such that F 1 and F 2 rely on the same information while minimizing the regularizer. To see this, we trained the method of Alg.1 from (Saito et al., 2017) on our MM-Dominoes dataset. Tuning λ ∈ {0.1, 1, 10, 100}, we were unable to learn a model F t which transfers to the target task. F.4 COMPARISON WITH SAITO ET AL. (2018) Contrary to Saito et al. (2018) , we do not aim at training a domain agnostic representation, but instead on overcoming simplicity bias to generalize to OOD settings. E.g. in colored MNIST, a classifier which throws out the shape and simply uses color (or vice-versa) is domain agnostic. But for overcoming spurious features, models in our ensemble would need to use both color and digit. Thus a domain agnostic representation is insufficient for OOD generalization. Furthermore, the training procedure of (Saito et al., 2018) consists in first training a shared feature extractor G and two classification heads F 1 and F 2 to minimize the cross-entropy on the source task. In a second step the classification heads F 1 and F 2 are trained to increase the discrepancy on samples from the target distribution while fixing the feature extractor G. However, in the case where a spurious feature is as predictive as the non-spurious one -as in our experiments of § 4.1 -the extreme simplicity bias would force the feature extractor to become invariant to the complex feature. The second and third steps of the algorithm would fail from there. F.5 COMPARISON WITH NAM ET AL. (2020) In this work, two models are trained simultaneously, one being the biased model while the other is the debiased model. During training, the first model gives higher weights to training samples agreeing with the current bias of the model. On the other hand, the second model learns by giving higher weights to training samples conflicting with the biased model. In order to work, the algorithm considers that the ratio of bias-aligned samples is smaller than 100%, which is not the case for our datasets in § 4.1). In these challenging datasets, where the biased feature is as predictive as the not biased feature, the second model fails to find bias-conflicting samples, hence would fail to de-biased itself. For this reason, the work of Nam et al. We implemented the JTT method from Liu et al. (2021a) and report test accuracies on the Waterbird, Camelyon17, and Office-Home datasets in Table 2 . We tuned T , the number of epochs for the first model, in {1, 2, 5, 10, 20, 60}. We tune the upsampling weight λ in {6, 50, 100}. We pick the model with best validation accuracy. Table 2 : Comparison between ERM, D-BAT, and JTT. For JTT, results are reported for a single seed. While JTT is efficient when small sub-groups are present in the data -as it is the case in the Waterbirds dataset-the method fails to significantly improve upon ERM when the distribution shift is more severe as in the Office-Home and Camelyon17 datasets.

Method

Waterbirds The concurrent work of Lee et al. (2022) proposes to measure diversity between two models using the mutual information (MI) between their predictions on the entire OOD distribution, whereas our loss is defined on the per datapoint difference in the predictions. This means that our loss decomposes as a sum over the data-points and is well defined on small mini-batches. Computing the mutual information (MI) needs processing the entirety(or at least a very large part) of the data. Besides such practical advantages, our notion of diversity naturally arises out of discrepancy based domain adaptation theory, whereas the choice of using MI is ad-hoc and in fact may not give the expected results. Consider the toy-problem in Fig. 3 of Lee et al. (2022) -the predictions of the two models actually have maximum mutual information since they predict the exact opposite on all the unlabelled perturbation data. Thus, MI would say that the two models actually have zero diversity, whereas discrepancy would say they have very high diversity. Hence, MI is theoretically the wrong measure to use. We confirmed this intuition by running experiments on the same setup as in Lee et al. (2022) , we compared for the two notions of diversity (MI and discrepancy) which pairs of predictor are optimal. Results can be seen in Fig. 17 .



, whereas minimizing mutual information would yield the wrong diagonal classifier. The disagreement scores match intuitive definitions of diversity, whereas mutual information does not.



Figure 1: Example of applying D-BAT on a simple 2D toy example similar to the LMS-5 dataset introduced by Shah et al. (2020). The two classes, red and blue, can easily be separated by a vertical boundary decision. Other ways to separate the two classes -with horizontal lines for instanceare more complex., i.e. they require more hyperplanes. The simplicity bias will push models to systematically learn the simpler feature, as in the second column (b). Using D-BAT, we are able to learn the model in column (c), relying on a more complex boundary decision, effectively overcoming the simplicity bias. The ensemble h ens (x) = h 1 (x) + h 2 (x), in column (d), outputs a flat distribution at points where the two models disagree, effectively maximizing the uncertainty at those points. In this experiments the samples from D ood were obtained through computing adversarial perturbations, see App. D.2 for more details.

Figure 2: Illustration of how D-BAT can promote learning diverse features. Consider the task of classifying bird pictures among several classes. The red color represents the attention of a first model h 1 .This model learnt to use some simple yet discriminative feature to recognise an African Crowned Crane on the left. Now suppose we use the top image D ood on which the models must disagree. h 2 cannot again use the same feature as h 1 since then it will not disagree on D ood . Instead, h 2 would look for other distinctive features of the crane which are not present on the right e.g. using its beak and red throat pouch.

Theorem 3.1 (D-BAT favors diversity). Given a joint source distribution D of triplets of random variables (C, S, Y ) taking values in {0, 1} 3 . Assuming D has the following PMF: P D (C = c, S = s, Y = y) = 1/2 if c = s = y, and 0 otherwise, which intuitively corresponds to experiments § 4.1 in which two features (e.g. color and shape) are equally predictive of the label y. Assuming a first model learnt the posterior distribution P 1 (Y = 1 | C = c, S = s) = c, meaning that it is invariant to feature s. Given a distribution D ood uniform over {0, 1} 3 outside of the support of D, the posterior solving the D-BAT objective will be P 2 (Y = 1 | C = c, S = s) = s, invariant to feature c.

Figure 4: All results are in the "D ood = test data" setting. (a) and (b): Test accuracies as a function of the ensemble size for both D-BAT and Deep Ensembles (ERM ensembles). We observe a significant advantage of D-BAT on both the Waterbirds and the Office-Home datasets. The difference is especially visible on the Waterbirds dataset, which has a stronger spurious correlation. Results have been obtained averaging over 3 seeds for the Waterbirds dataset and 6 seeds for the Office-Home dataset. (c): Comparison of D-BAT with several other methods on the Camelyon17, results except D-BAT are taken from Sagawa et al. (2022).

Figure 5: Entropy of ensembles of two models trained with and without D-BAT (deep-ensemble), for inputs x taken from along line t•1+(1-t)•0 for t ∈ [-1, 2]. In-distribution samples are obtained for t ∈ {0, 1}. All ensembles have a similar test accuracy of 99%. Unlike deep ensembles, D-BAT ensembles are able to correctly give high uncertainty values for points far away from the decision boundary. The standard deviations have been omitted here for clarity, but can be seen in App. D.3.

first model learnt the posterior distribution P1 (Y = 1 | C = c, S = s) = c. • Given a distribution D ood uniform over {0, 1} 3 outside of the support of D. From there, training a second model h 2 following the D-BAT objective would mean minimizing the agreement on D ood : min E (c,s)∼Dood -log( P1 (Y = 1|c, s) P2 (Y = 0|c, s) + P1 (Y = 0|c, s) P2 (Y = 1|c, s)) (1)

DETAILS FOR THE C-MNIST, M/M-D, M/F-D AND M/C-D EXPERIMENTS In the experiments on C-MNIST, M/F-D and M/C-D, we used different versions of LeNet (Lecun

Figure 8: Entropy of ensembles of two models trained with ((b) and (c)) and without D-BAT (deepensemble, (a)), for inputs x taken from along line t•1+(1-t)•0 for t ∈ [-1, 2].For deep-ensembles in (a), we notice how the standard deviation is near 0 for OOD regions t ∈] -1, 0] ∪ [1, 2[, which indicates a lack of diversity between members of the ensemble. This is in sharp contrast with D-BAT ensembles in (b) and (c) which clearly show some variability in those regions. The high variability is explained by the fact that we are not optimizing specifically to be able to detect OOD samples in those regions, but instead we are gaining this ability as a by-product of diversity, and diversity can be reached in many different configurations.

Figure 11: Samples from the training data distribution for C-MNIST, MM-Dominoes, MF-Dominoes, and MC-Dominoes.Those datasets are used to evaluate D-BAT's aptitude to evade the simplicity bias. For C-MNIST, the simple feature is the color and the complex one is the shape. For all the Dominoes datasets, the simple feature is the top row, while the complex feature is the bottom one. One could indeed separate 0s from 1s by simply looking at the value of the middle pixels (if low value then 0 else 1).

Figure 12: OOD distributions used for the C-MNIST experiments. D (1) ood is the distribution used to train D-BAT when we assumed we have access to unlabeled target data. D (2) ood is the distribution we used to show how D-BAT could work despite not having unlabeled target data. When experimenting on D (2) ood we remove the shapes 5 to 9 from the training dataset, that way D (2) ood is really OOD.

Figure 13: OOD distributions used for the MF-Dominoes experiments. D (1) ood corresponds to our experiments when we have access to unlabeled target data. D (2) ood is very different from the target distribution as the second row is made only of images from categories not present in the training and test distributions.

Figure 14: OOD distributions used for the MC-Dominoes experiments. D (1) ood corresponds to our experiments when we have access to unlabeled target data. D (2) ood is very different from the target distribution as the second row is made only of images from categories not present in the training and test distributions.

COMPARISON WITH SAITO ET AL. (2017) Contrary to Saito et al. (

(2020) fails to counter extreme simplicity bias. F.6 COMPARISON WITH LIU ET AL. (2021A) The work of Liu et al. (2021a) is similar to the work of Nam et al. (2020) and shares the same limitation. A first model is trained through ERM before a second model trained by upweighting the samples misclassified in by the first model. This method, as for Nam et al. (2020), is failing to induce diversity when all the samples are correctly classify by the first model, as this is the case for our datasets in § 4.1.

propose to use dropout at inference time, a method referred to as MC-Dropout. Other popular methods used for uncertainty estimation are Bayesian Neural Networks (BNNs) (Hernández-Lobato & Adams, 2015) and Gaussian Processes (Rasmussen & Williams, 2005). All those methods but gaussian processes, were recently shown to fail to adequately provide high uncertainty estimates on OOD samples away from the boundary decision (van Amersfoort et al., 2020;

In the official version released in the WILDS suite, the background is predictive of the label in 95% of cases i.e. 95% of Waterbirds, resp. land-birds, are seen on water, resp. land. Due to the simplicity bias, this means that ERM models tend to overuse the background information. The test D.7 NOTE ON SELECTING α Depending on the experiment the value of α used ranged from 1 to 10 -6 . We explain the variability in those values by (i) the capacity of the model used and (ii) the OOD distribution selected. If the model used has a large capacity, it can more easily overfit the OOD distribution and find shortcuts to disagree on D ood without relying on different features to classify the training samples, as discussed in § 5. For this reason we observed that larger models such as ResNet-18 or ResNet-50 used respectively on CIFAR10 and the Camelyon17 datasets are requiring a smaller α in comparison to smaller LeNet architectures. Furthermore, when the OOD distribution is close to the training distribution, smaller α values are preferred, as in our Camelyon17 experiments. In this case, disagreeing too strongly on the OOD data might force a second model M 2 to give erroneous predictions to disagree with M 1 , assuming that this first model is generalizing well to the OOD set. D.8 COMPUTATIONAL RESOURCES All of our experiments were run on single GPU machines. Most of our experiments require little computational resources and can be entirely reproduced on e.g. google colab (see App. A). For the Camelyon17, Waterbirds and Office-Home datasets, which use a ResNet-50 or ResNet-18 architectures, we used a V100 Nvidia GPU and the hyperparameter search and training took about two weeks.

A SOURCE CODE

Link to the source code to reproduce our experiments: https://github.com/mpagli/ Agree-to-Disagree 

B ALGORITHMS

The D-BAT training algorithm can be applied to both binary and multi-class classification problems. For our experiments on binary classification -as for the Camelyon17, Waterbirds, M/F-D, M/C-D (see § 4.1), and for our MNIST experiments in Fig. 5 -we used Alg. 1. This algorithm assumes a first model h 1 has already been trained with e.g. empirical risk minimization, and trains a second model following the algorithm described in § 3.2. For our multi-class experiments -as for the C-MNIST, Office-Home (see § 4.1, and CIFAR-10 uncertainty experiments (see § 4.2), we used Alg. 2. This algorithm is training a full ensemble of size M using D-BAT as described in § 3.2.

Algorithm 1 D-BAT for binary classification

Input: train data D, OOD data D ood , stopping time T , D-BAT coefficient α, learning rate η, pre-trained model h 1 , randomly initialized model h 2 with weights ω 0 , and its loss L. ), and a classification loss L. for m ∈ 0, . . . , M -1 do for t ∈ 0, . . . , T --η∇ ω (m) L(h m , x, y) + αA end for end for Sequential vs. simultaneous training. Nothing prevents the use of the D-BAT objective while training all the predictors of the ensemble simultaneously. While we had some successes in doing so, we advocate against it as this can discard the ERM solution. We found that the training dynamics of simultaneous training have a tendency to generate more complex solutions than sequential training. In our experiments on the 2D toy setting, sequential training gives two models which are both simple and diverse (see Fig. 1 ), whereas simultaneous training generates two relatively simple predictors but of higher complexity (see Fig. 7 ), especially it would deprive us from the simplest solution (Fig. 1 .b). In general as we do not know the spuriousness of the features, the simplest predictor is still of importance. and validation sets are made more evenly, with 50% of Waterbirds, resp. land-birds, being seen on water, resp. land. We use the train/ validation/test splits provided by the WILDS library.We use a ResNet-50 (He et al., 2016) as model. We train for 300 epochs with a fixed learning rate of 0.001 with and SGD as optimizer. We an l 2 penalty term of 0.0001 and a momentum term β = 0.9. For D-BAT, we tune α ∈ {10 0 , 10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 } and found α = 10 -4 to be best. For each set of hyperparameters, we train a deep-ensemble and a D-BAT ensemble of size 2, and select the parameters associated with the highest averaged validation accuracy over the two predictors of the ensemble. Our results are obtained by averaging over 3 seeds.For our D-BAT experiments we only consider the case where we have access to unlabeled target data. We use the validation split as it is from the same distribution as the target data.

D.6 IMPLEMENTATION DETAILS FOR THE OFFICE-HOME EXPERIMENTS

The Office-Home dataset is made of four domains: Art, Clipart, Product, and Real-world. We train on the grouped Product and Clipart domains, and measure the generalization to the Real-world domain. This dataset has 65 classes.We use a ResNet-18, we train for 600 epochs with a fixed learning rate of 0.001 with and SGD as optimizer. We an l 2 penalty term of 0.0001 and a momentum term β = 0.9. For D-BAT, we tune α ∈ {10 0 , 10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 , 10 -6 } and found α = 10 -5 to be best. For each set of hyperparameters, we train a deep-ensemble and a D-BAT ensemble of size 2, and select the parameters associated with the highest averaged validation accuracy over the two predictors of the ensemble. Our results are obtained by averaging over 6 seeds.We experiment with both the "ideal" case in which some unlabeled target data is available to use as D ood (D ood = D test ; see Fig. 4 .b) as well as the case in which we use a different domain (Art) as D ood (D ood ̸ = D test ). For this later setup, the evolution of the test accuracy given the ensemble size is in Fig. 10 . In both cases, the validation split, just as the test split, comes from the Real-World domain. 2021) with hyperparameters favoring the compromise between accuracy (test-acc) and randomized-accuracy (r-acc). We run 5 different seeds for Teney et al. (2021) , each run consisting in two classification heads and a shared encoder chosen to be the identity (a) or a CNN encoder (b). The acc and r-acc are displayed for the 10 resulting classification heads. We compared with two models obtained using D-BAT, the first model learning the simplest feature is in the bottom right corner, and the second model trained with diversity is in the top right corner. We observe that the method of Teney et al. (2021) is failing to reach a good r-acc, and is sacrificing accuracy. D-BAT is able to retrieve both data modalities without sacrificing accuracy. 2021) with hyperparameters yielding an accuracy (test-acc) close to 1 while maximizing the randomized-accuracy (r-acc). We run 5 different seeds for Teney et al. (2021) , each run consisting in two classification heads and a shared encoder chosen to be the identity (a) or a CNN encoder (b). The acc and r-acc are displayed for the 10 resulting classification heads. We compared with two models obtained using D-BAT, the first model learning the simplest feature is in the bottom right corner, and the second model trained with diversity is in the top right corner. We observe that the method of Teney et al. (2021) is only marginally improving the randomized-acc. OOD datapoints X are sampled randomly in the off-diagonal [-1, 0] 2 and [0, 1] 2 regions. The set of hyperplanes h θ with θ ∈ [0, π/2] all achieve a perfect train accuracy. We fix the first classifier to be the horizontal h 1 = h θ=0 classifier. Then, we measure the disagreement between h 1 and different choices of h 2 = h θ (in b), as well as their mutual information (in c) using the code provided in (Lee et al., 2022) . Maximizing the disagreement yields the correct vertical classifier h 2 = h θ= π

