NEURAL ENSEMBLE SEARCH FOR UNCERTAINTY ESTIMATION AND DATASET SHIFT Anonymous

Abstract

Ensembles of neural networks achieve superior performance compared to standalone networks not only in terms of predictive performance, but also uncertainty calibration and robustness to dataset shift. Diversity among networks is believed to be key for building strong ensembles, but typical approaches, such as deep ensembles, only ensemble different weight vectors of a fixed architecture. Instead, we propose two methods for constructing ensembles to exploit diversity among networks with varying architectures. We find the resulting ensembles are indeed more diverse and also exhibit better uncertainty calibration, predictive performance and robustness to dataset shift in comparison with deep ensembles on a variety of classification tasks.

1. INTRODUCTION

Automatically learning useful representations of data using deep neural networks has been successful across various tasks (Krizhevsky et al., 2012; Hinton et al., 2012; Mikolov et al., 2013) . While some applications rely only on the predictions made by a neural network, many critical applications also require reliable predictive uncertainty estimates and robustness under the presence of dataset shift, that is, when the observed data distribution at deployment differs from the training data distribution. Examples include medical imaging (Esteva et al., 2017) and self-driving cars (Bojarski et al., 2016) . However, several studies have shown that neural networks are not always robust to dataset shift (Ovadia et al., 2019; Hendrycks & Dietterich, 2019) , nor do they exhibit calibrated predictive uncertainty, resulting in incorrect predictions made with high confidence (Guo et al., 2017) . Using an ensemble of networks rather than a stand-alone network improves both predictive uncertainty calibration and robustness to dataset shift. Ensembles also outperform approximate Bayesian methods (Lakshminarayanan et al., 2017; Ovadia et al., 2019; Gustafsson et al., 2020) . Their success is usually attributed to the diversity among the base learners, however there are various definitions of diversity (Kuncheva & Whitaker, 2003; Zhou, 2012) without a consensus. In practice, ensembles are usually constructed by choosing a fixed state-of-the-art architecture and creating base learners by independently training random initializations of it. This is referred to as deep ensembles (Lakshminarayanan et al., 2017) , a state-of-the-art method for uncertainty estimation. However, as we show, base learners with varying network architectures make more diverse predictions. Therefore, picking a strong, fixed architecture for the ensemble's base learners neglects diversity in favor of base learner strength. This has implications for the ensemble performance, since both diversity and base learner strength are important. To overcome this, we propose Neural Ensemble Search (NES); a NES algorithm finds a set of diverse neural architectures that together form a strong ensemble. Note that, a priori, it is not obvious how to find diverse architectures that work well as an ensemble; one cannot randomly select them, since it is important to select strong ones, nor can one optimize them individually as that ignores diversity. By directly optimizing ensemble loss while maintaining independent training of base learners, a NES algorithm implicitly encourages diversity, without the need for explicitly defining diversity. In detail, our contributions are as follows: 1. We show that ensembles composed of varying architectures perform better than ensembles composed of a fixed architecture. We demonstrate that this is due to increased diversity among the ensemble's base learners (Sections 3 and 5). 2. Based on these findings and the importance of diversity, we propose two algorithms for Neural Ensemble Search: NES-RS and NES-RE. NES-RS is a simple random search based algorithm, and NES-RE is based on regularized evolution (Real et al., 2019) . Both search algorithms seek performant ensembles with varying base learner architectures (Section 4). 3. With experiments on classification tasks, we evaluate the ensembles found by NES-RS and NES-RE from the point of view of both predictive performance and uncertainty calibration, comparing them to deep ensembles with fixed, optimized architectures. We find our ensembles outperform deep ensembles not only on in-distribution data but also during dataset shift (Section 5). The code for our experiments is available at: https://anonymousfiles.io/ZaY1ccR5/.

2. RELATED WORK

Ensemble Learning and Uncertainty Estimation. Ensembles of neural networks (Hansen & Salamon, 1990; Krogh & Vedelsby, 1995; Dietterich, 2000) are commonly used to boost performance (Szegedy et al., 2015; Simonyan & Zisserman, 2015; He et al., 2016) . In practice, strategies for building ensembles include the popular approach of independently training multiple initializations of the same network (i.e. deep ensembles (Lakshminarayanan et al., 2017) ), training base learners on bootstrap samples of the training data (i.e. bagging) (Zhou et al., 2002) , joint training with diversity-encouraging losses (Liu & Yao, 1999; Lee et al., 2015; Zhou et al., 2018; Webb et al., 2019; Jain et al., 2020; Pearce et al., 2020) and using checkpoints during the training trajectory of a network (Huang et al., 2017; Loshchilov & Hutter, 2017) . Much recent interest in ensembles has been due to their strong predictive uncertainty estimation, with extensive empirical studies (Ovadia et al., 2019; Gustafsson et al., 2020) observing that ensembles outperform other approaches for uncertainty estimation, notably including Bayesian methods (Gal & Ghahramani, 2016; Welling & Teh, 2011) and post-hoc calibration (Guo et al., 2017) . Note He et al. (2020) draw a link between Bayesian methods and deep ensembles. We focus on deep ensembles as they provide state-of-the-art results in uncertainty estimation. Note that Ashukha et al. (2020) found many sophisticated ensembling techniques to be equivalent to a small-sized deep ensemble in terms of test performance. AutoML. AutoML (Hutter et al., 2018) is the process of automatically designing machine learning systems. Automatic ensemble construction is commonly used in AutoML (Feurer et al., 2015) . Lévesque et al. (2016) use Bayesian optimization to tune non-architectural hyperparameters of an ensemble's base learners, relying on ensemble selection (Caruana et al., 2004) . Specific to neural networks, AutoML also includes neural architecture search (NAS), the process of automatically designing network architectures (Elsken et al., 2019) . Existing strategies using reinforcement learning (Zoph & Le, 2017) , evolutionary algorithms (Real et al., 2019) or gradient-based methods (Liu et al., 2019) have demonstrated that NAS can find architectures that surpass hand-crafted ones. Some recent research connects ensemble learning with NAS. Methods proposed by Cortes et al. (2017) and Macko et al. (2019) iteratively add (sub-)networks to an ensemble to improve the ensemble's performance. While our work focuses on generating a diverse and well-performing (in an ensemble) set of architectures while fixing how the ensemble is built from its base learners, the aforementioned works focus on how to build the ensemble. The search spaces considered by these works are also limited compared to ours: Cortes et al. (2017) consider fully-connected layers and Macko et al. (2019) only use NASNet-A (Zoph et al., 2018) blocks with varying depth and number of filters. All aforementioned works only focus on predictive performance and do not consider uncertainty estimation and dataset shift. Concurrent to our work, Wenzel et al. (2020) consider ensembles with base learners having varying hyperparameters using an approach similar to NES-RS. However, they focus on non-architectural hyperparameters such as L 2 regularization strength and dropout rates, keeping the architecture fixed. As in our work, they also consider predictive uncertainty calibration and robustness to shift, finding improvements over deep ensembles.

3.1. DEFINITIONS AND SET-UP

Let D train = {(x i , y i ) : i = 1, . . . , N } be the training dataset, where the input x i ∈ R D and, assuming a classification task, the output y i ∈ {1, . . . , C}. We use D val and D test for the validation and test datasets, respectively. Denote by f θ a neural network with weights θ, so f θ (x) ∈ R C is the predicted probability vector over the classes for input x. Let (f θ (x), y) be the neural network's loss for data point (x, y). Given M networks f θ1 , . . . , f θ M , we construct the ensemble F of these networks by averaging the outputs, yielding F (x) = 1 M M i=1 f θi (x). In addition to the ensemble's loss (F (x), y), we will also consider the average base learner loss and the oracle ensemble's loss. The average base learner loss is simply defined as 1 M M i=1 (f θi (x), y); we use this to measure the average base learner strength. Similar to Lee et al. (2015) ; Zhou et al. (2018) , the oracle ensemble F OE composed of base learners f θ1 , . . . , f θ M is defined to be the function which, given an input x, returns the prediction of the base learner with the smallest loss for (x, y), that is, F OE (x) = f θ k (x), where k ∈ argmin i (f θi (x), y). Of course, the oracle ensemble can only be constructed if the true class y is known. We use the oracle ensemble loss as a measure of the diversity in base learner predictions. Intuitively, if base learners make diverse predictions for x, the oracle ensemble is more likely to find some base learner with a small loss, whereas if all base learners make identical predictions, the oracle ensemble yields the same output as any (and all) base learners. Therefore, as a rule of thumb, small oracle ensemble loss indicates more diverse base learner predictions. Proposition 3.1. Suppose is negative log-likelihood (NLL). Then, the oracle ensemble loss, ensemble loss, and average base learner loss satisfy the following inequality: (F OE (x), y) ≤ (F (x), y) ≤ 1 M M i=1 (f θi (x), y). We refer to Appendix A for a proof. Proposition 3.1 suggests that strong ensembles require not only strong average base learners (smaller upper bound), but also more diversity in their predictions (smaller lower bound). There is extensive theoretical work relating strong base learner performance and diversity with the generalization properties of ensembles (Hansen & Salamon, 1990; Zhou, 2012; Jiang et al., 2017; Bian & Chen, 2019; Goodfellow et al., 2016) . Notably, Breiman (2001) showed that the generalization error of random forests depends on the strength of individual trees and the correlation between their mistakes. The fixed architecture used to build deep ensembles is typically chosen to be a strong stand-alone architecture, either hand-crafted or found by NAS. However, since ensemble performance depends not only on strong base learners but also on their diversity, optimizing the base learner's architecture and then constructing a deep ensemble neglects diversity in favor of strong base learner performance.

3.2. VISUALIZING SIMILARITY IN BASE LEARNER PREDICTIONS

2 0 2 t-SNE dimension 1 3 2 1 0 1 2 3 t-SNE dimension 2 Arch_1 Arch_2 Arch_3 Arch_4 Arch_5 Having base learner architectures vary allows more diversity in their predictions. In this section, we provide empirical evidence for this by visualizing the base learners' predictions. Fort et al. (2019) found that base learners in a deep ensemble explore different parts of the function space by means of applying dimensionality reduction to their predictions. Building on this, we uniformly sample five architectures from the DARTS search space (Liu et al., 2019) , train 20 initializations of each architecture on CIFAR-10 and visualize the similarity among the networks' predictions on the test dataset using t-SNE (Van der Maaten & Hinton, 2008) . Experiment details are available in Section 5 and Appendix B. As shown in Figure 1 , we observe clustering of predictions made by different initializations of a fixed architecture, suggesting that base learners with varying architectures explore different parts of the function space. Moreover, we also visualize the predictions of base learners of two ensembles, each of size M = 30, where one is a deep ensemble and the other has varying architectures (found by NES-RS which will be introduced in Section 4). Figure 11 shows more diversity in the ensemble with varying architectures than in the deep ensemble. For each of the two ensembles shown in Figure 11 , we also compute the average pairwise predictive disagreement amongst the base learners (percentage of test inputs on which two base learners disagree), which we find to be 11.88% for the ensemble with varying architectures and 10.51% for the ensemble with fixed architectures (this is consistent across independent runs). This indicates higher predictive diversity in the ensemble with varying architectures, in line with the t-SNE results.

4. NEURAL ENSEMBLE SEARCH

In this section, we introduce neural ensemble search (NES). In summary, a NES algorithm optimizes the architectures of base learners in an ensemble to minimize ensemble loss. Given a network f : R D → R C , let L(f, D) = (x,y)∈D (f (x), y) be the loss of f over dataset D. Given a set of base learners {f 1 , . . . , f M }, let Ensemble be the function which maps {f 1 , . . . , f M } to the ensemble F =foot_0 M M i=1 f i as defined in Section 3 . To emphasize the architecture, we use the notation f θ,α to denote a network with architecture α ∈ A and weights θ, where A is a search space (SS) of architectures. A NES algorithm aims to solve the following optimization problem: min α1,...,α M ∈A L (Ensemble(f θ1,α1 , . . . , f θ M ,α M ), D val ) (1) s.t. θ i ∈ argmin θ L(f θ,αi , D train ) for i = 1, . . . , M Eq. 1 is difficult to solve for at least two reasons. First, we are optimizing over M architectures, so the search space is effectively A M , compared to it being A in typical NAS, making it more difficult to explore fully. Second, a larger search space also increases the risk of overfitting the ensemble loss to D val . A possible approach here is to consider the ensemble as a single large network to which we apply NAS, but joint training of an ensemble through a single loss has been empirically observed to underperform training base learners independently, specially for large neural networks (Webb et al., 2019) . Instead, our general approach to solve Eq. 1 consists of two steps: 1. Pool building: build a pool P = {f θ1,α1 , . . . , f θ K ,α K } of size K consisting of potential base learners, where each f θi,αi is a network trained independently on D train . 2. Ensemble selection: select M base learners f θ * 1 ,α * 1 , . . . , f θ * M ,α * M from P to form an ensemble which minimizes loss on D val . (We assume K ≥ M .) Step 1 reduces the options for the base learner architectures, with the intention to make the search more feasible and focus on strong architectures. Step 2 then selects a performant ensemble which implicitly encourages base learner strength and diversity. This procedure also ensures that the ensemble's base learners are trained independently. We propose using forward step-wise selection for step 2; that is, given the set of networks P, we start with an empty ensemble and add to it the network from P which minimizes ensemble loss on D val . We repeat this without replacement until the ensemble is of size M . Let ForwardSelect(P, D val , M ) denote the set of M base learners selected from P by this procedure. Note that selecting the ensemble from P is a combinatorial optimization problem; a greedy approach such as ForwardSelect is nevertheless effective (Caruana et al., 2004) , while keeping computational overhead low, given the predictions of the networks on D val . We also experimented with three other ensemble selection algorithms: (1) Starting with the best network by validation performance, add the next best network to the ensemble only if it improves validation performance, iterating until the ensemble size is M or all models have been considered. 1 (2) Select the top M networks by validation performance. (3) Forward step-wise selection with replacement. We typically found that these three performed comparatively or worse than our choice ForwardSelect. We have not yet discussed the algorithm for building the pool in step 1; we propose two approaches, NES-RS (Section 4.1) and NES-RE (Section 4.2). NES-RS is a simple random search based algorithm, while NES-RE is based on regularized evolution (Real et al., 2019) , a state-of-the-art NAS algorithm. A mutated copy of the parent is added to the population, and the oldest member is removed. Note that while gradient-based NAS methods have recently become popular, they are not naively applicable in our setting as the base learner selection component ForwardSelect is typically non-differentiable.

4.1. NES WITH RANDOM SEARCH

In NAS, random search (RS) is a competitive baseline on carefully designed architecture search spaces (Li & Talwalkar, 2019; Yang et al., 2020; Yu et al., 2020) . Motivated by its success and simplicity, we first introduce NES with random search (NES-RS). NES-RS builds the pool P by independently sampling architectures uniformly from the search space A (and training them). Since the architectures of networks in P vary, applying ensemble selection is a simple way to exploit diversity, yielding a performant ensemble. Algorithm 1 describes NES-RS in pseudocode.

Algorithm 1: NES with Random Search

Data: Search space A; ensemble size M ; comp. budget K; D train , D val . 1 Sample K architectures α 1 , . . . , α K independently and uniformly from A. 2 Train each architecture α i using D train , yielding a pool of networks P = {f θ1,α1 , . . . , f θ K ,α K }. 

4.2. NES WITH REGULARIZED EVOLUTION

A more guided approach for building the pool P is using regularized evolution (RE) (Real et al., 2019) . While RS has the benefit of simplicity, by sampling architectures uniformly, the resulting pool might contain many weak architectures, leaving few strong architectures for ForwardSelect to choose between. Therefore, NES-RS might require a large pool in order to explore interesting parts of the search space. RE is an evolutionary algorithm used for NAS which explores the search space by evolving a population of architectures. In summary, RE starts with a randomly initialized fixed-size population of architectures. At each iteration, a subset of size m of the population is sampled, from which the best network by validation loss is selected as the parent. A mutated copy of the parent architecture, called the child, is trained and added to the population, and the oldest member of the population is removed, preserving the population size. This is iterated until the computational budget is reached, returning the history, i.e. all the networks evaluated during the search, from which the best model is chosen by validation loss. Based on RE for NAS, we propose NES-RE to build the pool of potential base learners. NES-RE starts by randomly initializing a population p of size P . At each iteration, we apply ForwardSelect to the population to select an ensemble of size m, and we uniformly sample one base learner from the ensemble to be the parent. A mutated copy of the parent is added to p and the oldest network is removed, as in regularized evolution. This process is repeated until the computational budget is reached, and the history is returned as the pool P. See Algorithm 2 for pseudocode and Figure 2 for an illustration. Also, note the distinction between the population and the pool in NES-RE: the population is evolved, whereas the pool is the set of all networks evaluated during evolution (i.e., the history) and is used Algorithm 2: NES with Regularized Evolution Data: Search space A; ensemble size M ; comp. budget K; D train , D val ; population size P ; number of parent candidates m. 1 Sample P architectures α 1 , . . . , α P independently and uniformly from A. 2 Train each architecture α i using D train , and initialize p = P = {f θ1,α1 , . . . , f θ P ,α P }. 3 while |P| < K do 4 Select m parent candidates {f θ1, α1 , . . . , f θm, αm } = ForwardSelect(p, D val , m).

5

Sample uniformly a parent architecture α from { α 1 , . . . , α m }. // α stays in p.

6

Apply mutation to α, yielding child architecture β.

7

Train β using D train and add the trained network f θ,β to p and P.

8

Remove the oldest member in p. // as done in RE (Real et al., 2019) . 9 Select base learners {f θ * 1 ,α * 1 , . . . , f θ * M ,α * M } = ForwardSelect(P, D val , M ) by forward step-wise selection without replacement. 10 return ensemble Ensemble(f θ * 1 ,α * 1 , . . . , f θ * M ,α * M ) post-hoc for selecting the ensemble, similar to Real et al. (2019) . Moreover, ForwardSelect is used both for selecting m parent candidates (line 4 in NES-RE) and choosing the final ensemble of size M (line 9 in NES-RE). In general, m = M .

4.3. ENSEMBLE ADAPTATION TO DATASET SHIFT

Using deep ensembles is a common way of building a model robust to distributional shift relative to training data. In general, one may not know the type of distributional shift that occurs at test time. However, by using an ensemble, diversity in base learner predictions prevents the model from relying on one base learner's predictions which may not only be incorrect but also overconfident. We assume that one does not have access to data points with test-time shift at training time, but one does have access to some validation data D shift val with a validation shift, which encapsulates one's belief about test-time shift. Crucially, test and validation shifts are disjoint. A simple way to adapt NES-RS and NES-RE to return ensembles robust to shift is by using D shift val instead of D val whenever applying ForwardSelect to select the final ensemble. In algorithms 1 and 2, this is in lines 3 and 9, respectively. Note that in line 4 of Algorithm 2, we can also replace D val with D shift val when expecting test-time shift, however to avoid running NES-RE once for each of D val , D shift val , we simply sample one of D val , D shift val uniformly at each iteration, in order to explore architectures that work well both in-distribution and during shift. See Appendices C.1.3 and B.4 for further discussion.

5. EXPERIMENTS

We compare NES to deep ensembles on different choices of architecture search space (DARTS and NAS-Bench-201 (Dong & Yang, 2020) search spaces) and dataset (FMNIST, CIFAR-10, CIFAR-100, ImageNet-16-120 and Tiny ImageNet). For CIFAR-10/100 and Tiny ImageNet, we also consider dataset shifts proposed by Hendrycks & Dietterich (2019) . The metrics used are: NLL, classification error and expected calibration error (ECE) (Guo et al., 2017; Naeini et al., 2015) . Hyperparame- Baselines. We compare the ensembles found by NES to the baseline of deep ensembles built using a fixed, optimized architecture. On the DARTS search space, the fixed architecture is either: (1) optimized by random search, called DeepEns (RS), (2) the architecture found using the DARTS algorithm, called DeepEns (DARTS) or (3) the architecture found using RE, called DeepEns (AmoebaNet). On the NAS-Bench-201 search space, instead of DeepEns (DARTS/AmoebaNet), we compare to DeepEns (GDAS), found using GDAS (Dong & Yang, 2019) Hendrycks & Dietterich (2019) for details. The severity of the shift varies between 1-5. The fixed architecture used in the baseline DeepEns (RS) is selected based on its loss over D shift val , but the DARTS and AmoebaNet architectures remain unchanged. As shown in Figures 3b and 4b , ensembles picked by NES-RS and NES-RE are more robust to dataset shift than all three baselines. Unsurprisingly, DeepEns (DARTS/AmoebaNet) perform poorly compared to the other methods, as they are not optimized to deal with dataset shift, highlighting that highly optimized architectures can fail heavily under dataset shift. Classification error and uncertainty calibration. We also assess the ensembles using classification error and expected calibration error (ECE). ECE measures the mismatch between the model's confidence and the corresponding achieved accuracy at different levels of confidence. As shown in Figure 5 , ensembles found with NES tend to exhibit superior uncertainty calibration and are either competitive with or outperform deep ensembles for most shift severities. Notably, on CIFAR-10, ECE reduces by up to 40% compared to deep ensembles. Note that good uncertainty calibration is especially important when models are used during dataset shift. In terms of classification error, we find that ensembles built by NES consistently outperform deep ensembles, with reductions of up to 7 percentage points in error, shown in Table 1 . As with NLL, NES-RE tends to outperforms NES-RS. Diversity and average base learner strength. To understand why ensembles found by NES algorithms outperform deep ensembles with fixed, optimized architectures, we view the ensembles through the lenses of the average base learner loss and oracle ensemble loss as defined in Section 3 and shown in Figure 6 . Recall that small oracle ensemble loss indicates higher diversity. We see that NES finds ensembles with smaller oracle ensemble losses indicating greater diversity among base learners. Unsurprisingly, the average base learner is occasionally weaker for NES as compared to DeepEns (RS). Despite this, the ensemble performs better (Figure 3 ), highlighting once again the importance of diversity. Results on the NAS-Bench-201 search space. We also compare NES to deep ensembles for the NAS-Bench-201 search space, which has two benefits: this shows our findings are not specific to the DARTS search space, and NAS-Bench-201 is a search space for which all architectures' trained weights are available, which allows us to compare NES to the deep ensemble of the optimal architecture by validation loss. Results shown in Figure 7 compare the losses of the ensemble, average base learner and oracle ensemble. Interestingly, despite DeepEns (Optimal) having a significantly lower average base learner loss than other methods, the lack of diversity (as indicated by higher oracle ensemble loss) yields an ensemble which is outperformed by both NES algorithms. random initializations of a fixed architecture to ascertain whether the improvement in NES is only due to ensemble selection. Our results show that NES continues to outperform this baseline affirming the importance of varying the architecture. In Appendix C.4, we consider baselines of ensembles with other hyperparameters being varied. In particular, we consider ensembles with a fixed, optimized architecture but varying learning rates and L 2 regularization strengths (similar to concurrent work by Wenzel et al. (2020) ) and ensembles with architectures varying only in terms of width and depth. The results again show that NES tends to improve upon these baselines.

6. CONCLUSION

We showed that ensembles with varying architectures are more diverse than ensembles with fixed architectures and argued that deep ensembles with fixed, optimized architectures neglect diversity. To this end, we proposed Neural Ensemble Search, which exploits diversity between base learners of varying architectures to find strong ensembles. We demonstrated empirically that NES-RE and NES-RS outperform deep ensembles in terms of both predictive performance and uncertainty calibration, for in-distribution data and also during dataset shift. We found that even NES-RS, a simple random search based algorithm, found ensembles capable of outperforming deep ensembles built with state-of-the-art architectures. A PROOF OF PROPOSITION 3.1 Taking the loss function to be NLL, we have (f (x), y)) = -log [f (x)] y , where [f (x)] y is the probability assigned by the network f of x belonging to the true class y, i.e. indexing the predicted probabilities f (x) with the true target y. Note that t → -log t is a convex and decreasing function. We first prove (F OE (x), y) ≤ (F (x), y). Recall, by definition of F OE , we have F OE (x) = f θ k (x) where k ∈ argmin i (f θi (x), y), therefore [F OE (x)] y = [f θ k (x)] y ≥ [f θi (x)] y for all i = 1, . . . , C. That is, f θ k assigns the highest probability to the correct class y for input x. Since -log is a decreasing function, we have (F (x), y) = -log 1 M M i=1 [f θi (x)] y ≥ -log ([f θ k (x)] y ) = (F OE (x), y). We apply Jensen's inequality in its finite form for the second inequality. Jensen's inequality states that for a real-valued, convex function ϕ with its domain being a subset of R and numbers t 1 , . . . , t n in its domain, ϕ( 1 n n i=1 t i ) ≤ 1 n n i=1 ϕ(t i ). Noting that -log is a convex function, (F (x), y) ≤ 1 M M i=1 (f θi (x), y) follows directly.

B EXPERIMENTAL AND IMPLEMENTATION DETAILS

We describe details of the experiments shown in Section 5 and Appendix C. Note that unless stated otherwise, all sampling over a discrete set is done uniformly in the discussion below.

B.1 ARCHITECTURE SEARCH SPACES

DARTS search space. The first architecture search space we consider in our experiments is the one from DARTS (Liu et al., 2019) . We search for two types of cells: normal cells, which preserve the spatial dimensions, and reduction cells, which reduce the spatial dimensions. These cells are stacked using a pre-determined macro-architecture where they are usually repeated and connected using additional skip connections. Each cell is a directed acyclic graph, where nodes represent feature maps in the computational graph and edges between them correspond to operation choices (e.g. a convolution operation). The cell parses inputs from the previous and previous-previous cells in its 2 input nodes. Afterwards it contains 5 nodes: 4 intermediate nodes that aggregate the information coming from 2 previous nodes in the cell and finally an output node that concatenates the output of all intermediate nodes across the channel dimension. AmoebaNet contains one more intermediate node, making that a deeper architecture. The set of possible operations (eight in total in DARTS) that we use for each edge in the cells is the same as DARTS, but we leave out the "zero" operation since that is not necessary for non-differentiable approaches such as random search and evolution. Randomly of architectures is done by sampling the structure of the cell and the operations at each edge. The total number of architectures contained in this space is ≈ 10 18 . We refer the reader to Liu et al. (2019) for more details. NAS-Bench-201 search space. NAS-Bench-201 (Dong & Yang, 2020 ) is a tabular NAS benchmark, i.e. all architectures in the cell search space are trained and evaluated beforehand so one can query their performance (and weights) from a table quickly. Since this space is exhaustively evaluated, its size is also limited to only normal cells containing 4 nodes in total (1 input, 2 intermediate and 1 output node) and 5 operation choices on every edge connecting two nodes. This means that there are only 15,625 possible architecture configurations in this space. The networks are constructed by stacking 5 cells with in-between fixed residual blocks for reducing the spacial resolution. Each of them is trained for 200 epochs 3 times with 3 different seeds on 3 image classification datasets. For more details, please refer to Dong & Yang (2020) .

B.2 DATASETS

Fashion-MNIST (Xiao et al., 2017) . Fashion-MNIST consists of a training set of 60k 28×28 grayscale images and a test set of 10k images. The number of total labels is 10 classes. We split the 60k training set images to 50k used to train the networks and 10k used only for validation. CIFAR-10/100 (Krizhevsky et al., 2009) . CIFAR-10 and CIFAR-100 both consist of 60k 32×32 colour images with 10 and 100 classes, respectively. We use 10k of the 60k training images as the validation set. We use the 10k original test set for final evaluation. Tiny ImageNet (Le & Yang, 2015) . Tiny Imagenet has 200 classes and each class has 500 training, 50 validation and 50 test colour images with 64×64 resolution. Since the original test labels are not available, we split the 10k validation examples into 5k for testing and 5k for validation. ImageNet-16-120 (Dong & Yang, 2020) This variant of the ImageNet-16-120 (Chrabaszcz et al., 2017) This choice of validation and test corruptions follows the recommendation of (Hendrycks & Dietterich, 2019) . Also, as mentioned in Section 5, each of these corruptions has 5 severity levels, which yields 5 corresponding severity levels for D shift val and D shift test .

B.3 TRAINING ROUTINE

The macro-architecture we use has 16 initial channels and 8 cells (6 normal and 2 reduction), and was trained using a batch size of 100 for 100 epochs for CIFAR-10 and CIFAR-100 and 15 epochs for Fashion-MNIST. For Tiny ImageNet, we used a batch size of 128 for 100 epochs. Unlike DARTS, we do not use any data augmentation procedure during training, nor any additional regularization such as ScheduledDropPath (Zoph et al., 2018) or auxiliary heads, except for the case of Tiny ImageNet, for which we used ScheduledDropPath and standard data augmentation as default in DARTS. All other hyperparameter settings are exactly as in DARTS (Liu et al., 2019) . All results shown are averaged over multiple runs with error bars indicating a 95% confidence interval. We used a budget K = 400 in all experiments, except Tiny ImageNet on the DARTS search space, which used K = 200 and ImageNet-16-120 on the NAS-Bench-201 search space, which used K = 1000.

B.4 IMPLEMENTATION DETAILS OF NES-RE

Parallization. Running NES-RE on a single GPU requires evaluating hundreds of networks sequentially, which is tedious. To circumvent this, we distribute the "while |P| < K" loop in Algorithm 2 over multiple GPUs, called worker nodes. We use the parallelism scheme provided by the hpbandster (Falkner et al., 2018) codebase. 3 In brief, the master node keeps track of the population and history (lines 1, 4-6, 8 in Algorithm 2), and it distributes the training of the networks to the individual worker nodes (lines 2, 7 in Algorithm 2). In our experiments, we always use 20 worker nodes and evolve a population p of size P = 50 when working over the DARTS search space. Over NAS-Bench-201, we used one worker since it is a tabular NAS benchmark and hence is quick to evaluate on. During iterations of evolution, we use an ensemble size of m = 10 to select parent candidates. Mutations. We adapt the mutations used in RE to the DARTS search space. As in RE, we first pick a normal or reduction cell at random to mutate and then sample one of the following mutations:

C ADDITIONAL EXPERIMENTS

In this section we provide additional results for the experiments conducted in Section 5. Note that, as with all results shown in Section 5, all evaluations are made on test data unless stated otherwise. As shown in Figure 7 , we see a similar trend on Fashion-MNIST as with other datasets: NES ensembles outperform deep ensembles with NES-RE outperforming NES-RS. To understand why NES algorithms outperform deep ensembles on Fashion-MNIST (Xiao et al., 2017) , we compare the average base learner loss (Figure 8 ) and oracle ensemble loss (Figure 9 ) of NES-RS, NES-RE and DeepEns (RS). Notice that, apart from the case when ensemble size M = 30, NES-RS and NES-RE find ensembles with both stronger and more diverse base learners (smaller losses in Figures 8 and 9 , respectively). While it is expected that the oracle ensemble loss is smaller for NES-RS and NES-RE compared to DeepEns (RS), it initially appears surprising that DeepEns (RS) has a larger average base learner loss considering that the architecture for the deep ensemble is chosen to minimize the base learner loss. We found that this is due to the loss having a sensitive dependence not only on the architecture but also the initialization of the base learner networks. Therefore, re-training the best architecture by validation loss to build the deep ensemble yields base learners with higher losses due to the use of different random initializations. Fortunately, NES algorithms are not affected by this, since they simply select the ensemble's base learners from the pool without having to re-train anything which allows them to exploit good architectures as well as initializations. Note that, for CIFAR-10-C experiments, this was not the case; base learner losses did not have as sensitive a dependence on the initialization as they did on the architecture. In Table 2 , we compare the classification error and expected calibration error (ECE) of NES algorithms with the deep ensembles baseline for various ensemble sizes on Fashion-MNIST. Similar to the loss, NES algorithms also achieve smaller errors, while ECE remains approximately the same for all methods. To assess how well models respond to completely out-of-distribution (OOD) inputs (inputs which do not belong to one of the classes the model can predict), we investigate the entropy of the predicted probability distribution over the classes when the input is OOD. Higher entropy of the predicted probabilities indicates more uncertainty in the model's output. For CIFAR-10 on the DARTS search space, we compare the entropy of the predictions made by NES ensembles with deep ensembles on two types of OOD inputs: images from the SVHN dataset and Gaussian noise. In Figure 10 , we notice that NES ensembles indicate higher uncertainty when given inputs of Gaussian noise than deep ensembles but behave similarly to deep ensembles for inputs from SVHN. Results on CIFAR for larger models. In additional to the results on CIFAR-10 and CIFAR-100 using the settings described in Appendix B.3, we also train larger models (around 3M parameters) by scaling up the number of stacks cells and initial channels in the network. We run NES and other baselines similarly as done before and plot results in Figure 19 and 20 for NLL and classification test error with budget K = 90. As shown, NES algorithms tend to outperform or be competitive with the baselines. Note, more runs are needed including error bars for conclusive results in this case.

C.2 ABLATION STUDY: NES-RE OPTIMIZING ONLY ON CLEAN DATA

We also include a variant of NES-RE, called NES-RE-0, in Figure 21 .NES-RE and NES-RE-0 are the same, except that NES-RE-0 uses the validation set D val without any shift during iterations of evolution, as in line 4 of Algorithm 2. Following the discussion in Appendix B.4, recall that this is unlike NES-RE, where we sample the validation set to be either D val or D shift val at each iteration of evolution. Therefore, NES-RE-0 evolves the population without taking into account dataset shift, with D shift val only being used for the post-hoc ensemble selection step in line 9 of Algorithm 2. As shown in the Figure 21 , NES-RE-0 shows a minor improvement over NES-RE in terms of loss for ensemble size M = 30 in the absence of dataset shift. This is in line with expectations, because evolution in NES-RE-0 focuses on finding base learners which form strong ensembles for in-distribution data. On the other hand, when there is dataset shift, the performance of NES-RE-0 ensembles degrades, yielding higher loss and error than both NES-RS and NES-RE. Nonetheless, NES-RE-0 still manages to outperform the DeepEns baselines consistently. We draw two conclusions on the basis of these results: (1) NES-RE-0 can be a competitive option in the absence of dataset shift. (2) Sampling the validation set, as done in NES-RE, to be D val or D shift val in line 4 of Algorithm 2 plays an important role is returning a final pool P of base learners from which ForwardSelect can select ensembles robust to dataset shift. 

C.3 WHAT IF DEEP ENSEMBLES USE ENSEMBLE SELECTION OVER INITIALIZATIONS?

Recall that NES algorithms differ from deep ensembles in two important ways: the ensembles use varying architectures and NES utilizes ensemble selection (i.e. ForwardSelect applied to P) to pick the base learners. In this section, we conduct a study intended to investigate the following question: is the improvement offered by NES over deep ensembles only due to ensemble selection? In other words, we wish to isolate and understand the impact of varying architectures by comparing NES to a baseline that also incorporates ensemble selection into the construction of deep ensembles. Using the DARTS search space on Tiny ImageNet, we empirically compare NES to the baselines "DeepEns + ES" which operate as follows. We optimize a fixed architecture for the base learners, train K random initializations of it to form a pool and apply ForwardSelect to select an ensemble of size M from the pool. This yields the three additional baselines DeepEns + ES (DARTS/AmoebaNet/RS) which correspond to optimizing the fixed architectures using the DARTS algorithm (DARTS), regularized evolution (AmoebaNet) and random search (RS). The results indicate that NES outperforms or is at par with DeepEns + ES baselines, as shown in Table 3 and Figure 22 . In particular, both NES algorithms outperform DeepEns + ES (DARTS/AmoebaNet). DeepEns + ES (RS) is the most competitive of the deep ensemble baselines, which is improved upon by NES-RE and is competitive with NES-RS. Also, as expected, deep ensembles with ensemble selection consistently perform better than their counterparts without ensemble selection. Table 3 also includes the computational costs of each method measured in terms of the number of networks trained. For each DeepEns + ES baseline, we used a pool of size K = 200 (as with NES) from which the ensemble is selected. This cost comes in addition to the cost of optimizing the fixed base learner architecture prior to forming the pool. For instance, the architecture for DeepEns + ES (RS) is optimized by random search, selecting the best architecture by validation loss from a random sample of K = 200 (trained) architectures; this yields a total cost of 400 networks trained. This is twice the cost of NES algorithms which required training 200 architectures to form the pool.

C.4 COMPARING NES TO ENSEMBLES WITH OTHER VARYING HYPERPARAMETERS

Since varying the architecture in an ensemble improves predictive performance and uncertainty estimation as demonstrated in Section 5, it is natural to ask what other hyperparameters should be varied in an ensemble. It is also unclear which hyperparameters might be more important than others. Note that concurrent work by Wenzel et al. (2020) has shown that varying hyperparameters such as L 2 regularization strength, dropout rate and label smoothing parameter also improves upon deep ensembles. While these questions lie outside the scope of our work and are left for future work, we conduct preliminary experiments to address them. Table 3 : A comparison of NES to deep ensembles with ensemble selection over initializations for Tiny ImageNet over the DARTS search space with ensemble size M = 10. The computational costs are reports in terms of the number of networks trained (a typical network from this search space takes 3 hours to train on an NVIDIA RTX 2080Ti). The "arch" column indicates the number of architectures evaluated to find the architecture and the "ensemble" column indicates the number of architectures evaluated for building the ensemble. Note that for DARTS and AmoebaNet we convert the GPU hours for finding the architecture into number of networks trained by dividing by 3. See Appendix C.3 for details.

Method

No In this section, we consider two additional baselines working over the DARTS search space on CIFAR-10/100: 1. HyperEns: Optimize a fixed architecture, train K random initializations of it where the learning rate and L 2 regularization strength are also sampled randomly and select the final ensemble of size M from the pool using ForwardSelect. This is similar to hyper ens from Wenzel et al. (2020) . 2. NES-RS (depth, width): As described in Appendix B.1, NES navigates a complex (non-Euclidean) search space of architectures by varying the cell, which involves changing both the DAG structure of the cell and the operations at each edge of the DAG. We consider a baseline in which we keep the cell fixed (the optimized DARTS cell) and only vary the width and depth of the overall architecture. More specifically, we vary the number of initial channels ∈ {12, 14, 16, 18, 20} (width) and the number of layers ∈ {5, 8, 11} (depth). We apply NES-RS over this substantially simpler search space of architectures as usual: train K randomly sampled architectures (i.e. sampling only depth and width) to form a pool and select the ensemble from it. The results shown in Figures 23 and Table 4 compare the two baselines above to DeepEns (DARTS), NES-RS and NES-RE. 4 As shown in Figure 23 , NES-RE tends to outperform the baselines, though is at par with HyperEns on CIFAR-100 without dataset shift (Figure 23a ). Under the presence of dataset shift (Figures 23b and 23c ), both NES algorithms substantially outperform all baselines. Note that both HyperEns and NES-RS (depth, width) follow the same protocol as NES-RS and NES-RE: ensemble selection uses a shifted validation dataset when evaluating on a shifted test dataset. In terms of classification error, the observations are similar as shown in Table 4 . Lastly, we view the diversity of the ensembles from the perspective of oracle ensemble loss in Figure 24 . As in Section 5, results here also suggest that NES agorithms tend to find more diverse ensembles despite having higher average base learner loss. 



This approach returns an ensemble of size at most M . We did not consider DARTS algorithm on NAS-Bench-201, since it returns degenerate architectures with poor performance on this space(Dong & Yang, 2020). GDAS, on the other hand, yields state-of-the-art performance on this space(Dong & Yang, 2020). https://github.com/automl/HpBandSter Note that runs of DeepEns (DARTS), NES-RE and NES-RS differ slightly in this section relative to Section 5, as we tune the learning rate and L2 regularization strength for each dataset instead of using the defaults used inLiu et al. (2019). This yields a fair comparison: HyperEns varies the learning rate and L2 regularization while using a fixed, optimized architecture (DARTS), whereas NES varies the architecture while using fixed, optimized learning rate and L2 regularization strength.



Figure 1: t-SNE visualization: five different architectures, each trained with 20 different initializations.

Figure 2: Illustration of one iteration of NES-RE. Network architectures are represented as colored bars of different lengths illustrating different layers and widths. Starting with the current population, ensemble selection is applied to select parent candidates, among which one is sampled as the parent.A mutated copy of the parent is added to the population, and the oldest member is removed.

Select base learners {f θ * 1 ,α * 1 , . . . , f θ * M ,α * M } = ForwardSelect(P, D val , M ) by forward step-wise selection without replacement. 4 return ensemble Ensemble(f θ * 1 ,α * 1 , . . . , f θ * M ,α * M )

Figure 3: NLL vs. ensemble sizes on CIFAR-10, CIFAR-100 and Tiny ImageNet with and without respective dataset shifts (Hendrycks & Dietterich, 2019) over DARTS search space.

Figure 4: NLL vs. budget K on CIFAR-10/100 and Tiny ImageNet with and without respective dataset shifts over the DARTS search space. Ensemble size is fixed at M = 10.

Figure 5: ECE vs. dataset shift severity on CIFAR-10, CIFAR-100 and Tiny ImageNet over the DARTS search space. No dataset shift is indicated as severity 0. Ensemble size is fixed at M = 10.

Figure 7: Ensemble, average base learner and oracle ensemble NLL for ImageNet-16-120 on the NAS-Bench-201 search space. Ensemble size M = 3.

Figure6: Average base learner loss and oracle ensemble loss (see Section 3 for definitions) on CIFAR-10/100 and Tiny ImageNet. Recall that small oracle ensemble loss generally corresponds to higher diversity. These findings are qualitatively consistent across datasets and also over shifted test data. See Appendix C.1.

Figure 7: Results on Fashion-MNIST with varying ensembles sizes M . Lines show the mean NLL achieved by the ensembles with 95% confidence intervals.

Figure 10: Entropy of predicted probabilities when trained on CIFAR-10 over the DARTS search space.

.

Figure 11: t-SNE visualization: predictions of base learners in two ensembles, one with fixed architecture and one with varying architectures.

Figure 12: NLL vs. ensemble sizes on CIFAR-10, CIFAR-100 and Tiny ImageNet with varying dataset shifts (Hendrycks & Dietterich, 2019) over DARTS search space.

Figure 13: Classification error rate (between 0-1) vs. ensemble size on DARTS search space.

Figure 14: Average base learner and oracle ensemble NLL across ensemble sizes and shift severities on CIFAR-10 over DARTS search space.

Figure 15: Average base learner and oracle ensemble NLL across ensemble sizes and shift severities on CIFAR-100 over DARTS search space.

Figure 16: Average base learner and oracle ensemble NLL across ensemble sizes and shift severities on Tiny ImageNet over DARTS search space.

Figure 18: Ensemble error vs. budget K. Ensemble size fixed at M = 10.

Figure 21: Results on CIFAR-10 (Hendrycks & Dietterich, 2019) with varying ensembles sizes M and shift severity. Lines show the mean NLL achieved by the ensembles with 95% confidence intervals. See Appendix C.1.3 for the definition of NES-RE-0.

Figure22: Loss vs. ensemble size for NES and deep ensembles (with/without ensemble selection over initializations). The left plot shows that NES-RE outperforms all other methods across ensemble sizes. The right plot shows that ensembles produced by NES algorithms also consistently have higher diversity (as indicated by smaller oracle ensemble loss). See Appendix C.3 for details.

Figure 23: Plots show NLL vs. ensemble sizes comparing NES to the baselines introduced in Appendix C.4 on CIFAR-10 and CIFAR-100 with and without respective dataset shifts (Hendrycks & Dietterich, 2019).

Figure24: Average base learner loss and oracle ensemble loss for NES and the baselines introduced in Appendix C.4 on CIFAR-10 and CIFAR-100. Recall that small oracle ensemble loss generally corresponds to higher diversity.

RE usually outperforming NES-RS. Next, we evaluate the robustness of the ensembles to dataset shift in Figures 3b and 4b. All base learners are trained on D train without data augmentation with shifted examples. However, we use a shifted validation dataset, D shift val , and a shifted test dataset, D shift test . D shift val is built by applying a random validation shift to each datapoint in D val . D shift test is built similarly but using instead a disjoint set of test shifts applied to D test ; see Appendix B and

contains 151.7k train, 3k validation and 3k test ImageNet images downsampled to 16×16 and 120 classes.Note that the test data points are only used for final evaluation. The data points for validation are used by the NES algorithms during ensemble selection and by DeepEns (RS) for picking the best architecture from the pool to use in the deep ensemble. Note that when considering dataset shift for CIFAR-10, CIFAR-100 and Tiny ImageNet, we also apply two disjoint sets of "corruptions" (following the terminology used by(Hendrycks & Dietterich, 2019)) to the validation and test sets. We never apply any corruption to the training data. More specifically, of the 19 different corruptions provided byHendrycks & Dietterich (2019), we randomly apply one from {Speckle Noise, Gaussian Blur, Spatter, Saturate} to each data point in the validation set and one from {Gaussian Noise, Shot Noise, Impulse Noise, Defocus Blur, Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic Transform, Pixelate, JPEG compression} to each data point in the test set.

Error and ECE of ensembles on Fashion-MNIST for different ensemble sizes M . Best values and all values within 95% confidence interval are bold faced.

Ensemble NLL vs. budget K. Ensemble size fixed at M = 10.

Classification errors comparing NES to the baselines introduced in Appendix C.4 for different shift severities and M = 10. Best values are bold faced.

annex

• identity: no mutation is applied to the cell.• op mutation: sample one edge in the cell and replace its operation with another operation sampled from the list of operations.• hidden state mutation: sample one intermediate node in the cell, then sample one of its two incoming edges. Replace the input node of that edge with another sampled node, without altering the edge's operation.See Real et al. (2019) for details and illustrations of these mutations. Note that for NAS-Bench-201, following Dong & Yang (2020) we only use opAdaptation of NES-RE to dataset shifts. As described in Section 4.3, at each iteration of evolution, the validation set used in line 4 of Algorithm 2 is sampled uniformly between D val and D shift val when dealing with dataset shift. In this case, we use shift severity level 5 for D shift val . Once the evolution is complete and the pool P has been formed, then for each severity level s ∈ {0, 1, . . . , 5}, we apply ForwardSelect with D shift val of severity s to select an ensemble from P (line 9 in Algorithm 2), which is then evaluated on D shift test of severity s. (Here s = 0 corresponds to no shift.) This only applies to CIFAR-10, CIFAR-100 and Tiny ImageNet, as we do not consider dataset shift for Fashion-MNIST and ImageNet-16-120 .

