TRAVERSING BETWEEN MODES IN FUNCTION SPACE FOR FAST ENSEMBLING

Abstract

Deep ensemble is a simple yet powerful way to improve the performance of deep neural networks. Under this motivation, recent works on mode connectivity have shown that parameters of ensembles are connected by low-loss subspaces, and one can efficiently collect ensemble parameters in those subspaces. While this provides a way to efficiently train ensembles, for inference, one should still execute multiple forward passes using all the ensemble parameters, which often becomes a serious bottleneck for real-world deployment. In this work, we propose a novel framework to reduce such costs. Given a low-loss subspace connecting two modes of a neural network, we build an additional neural network predicting outputs of the original neural network evaluated at a certain point in the low-loss subspace. The additional neural network, what we call a "bridge", is a lightweight network taking minimal features from the original network, and predicting outputs for the low-loss subspace without forward passes through the original network. We empirically demonstrate that we can indeed train such bridge networks and significantly reduce inference costs with the help of the bridge networks.

1. INTRODUCTION

Deep Ensemble (DE) (Lakshminarayanan et al., 2017 ) is a simple algorithm to improve both predictive accuracy and uncertainty calibration of deep neural networks, where a neural network is trained multiple times using the same data but with different random seeds. Due to this randomness, the parameters obtained from the multiple training runs reach different local optima, called modes, on the loss surface (Fort et al., 2019) . These parameters represent a set of diverse functions serving as an effective approximation for Bayesian Model Averaging (BMA) (Wilson and Izmailov, 2020 ). An apparent drawback of DE is that it requires multiple training runs. This cost can be huge especially for large-scale settings for which parallel training is not feasible. Garipov et al. (2018) ; Draxler et al. (2018) showed that modes in the loss surface of a deep neural network are connected by relatively simple low-dimensional subspaces where every parameter on those subspaces retains low training error, and the parameters along those subspaces are good candidates for ensembling. Based on this observation, Garipov et al. (2018) ; Huang et al. (2017) proposed algorithms to quickly construct deep ensembles without having to run multiple independent training runs. While the fast ensembling methods based on mode connectivity reduce training costs, they do not address another important drawback of DE; the inference cost. One should still execute multiple forward passes using all the parameters collected for ensemble, and this cost often becomes critical for a real-world scenario, where the training is done in a resource-abundant setting with plenty of computation time, but for the deployment, the inference should be done in a resource-limited environment. For such settings, reducing the inference cost is much more important than reducing the training cost. In this paper, we propose a novel approach to scale up DE by reducing inference cost. We start from an assumption; if two modes in an ensemble are connected by a simple subspace, we can predict the outputs corresponding to the parameters on the subspace using only the outputs computed from the modes. In other words, we can predict the outputs evaluated at the subspace without having to forward the actual parameters on the subspace through the network. If this is indeed possible, for instance, given two modes, we can approximate an ensemble of three models consisting of parameters collected from three different locations (one from a subspace connecting two modes, and two from each mode) with only two forward passes and a small auxiliary forward pass. We show that we can actually implement this idea using an additional lightweight network whose inference cost is relatively low compared to that of the original neural network. This additional network, what we call a "bridge network", takes some features from the original neural network, (e.g., features from the penultimate layer), and directly predict the outputs computed from the connecting subspace. In other words, the bridge network lets us travel between modes in the function space. We present two types of bridge networks depending on the number of modes involved in prediction, network architectures for bridge networks, and training procedures. Through empirical validation on various image classification benchmarks, we show that 1) bridge networks can predict outputs of connecting subspaces quite accurately with minimal computation cost, and 2) DEs augmented with bridge networks can significantly reduce inference costs without big sacrifice in performance.

2. PRELIMINARIES 2.1 PROBLEM SETUP

In this paper, we discuss the K-way classification problem taking D-dimensional inputs. A classifier is constructed with a deep neural network f θ : R D → R K which is decomposed into a feature extractor f (ft) ϕ : R D → R Dft and a classifier f (cls) ψ : R Dft → R K , i.e., f θ (x) = f (cls) ψ • f (ft) ϕ (x). Here, ϕ ∈ Φ and ψ ∈ Ψ denote the parameters for the feature extractor and classifier, respectively, θ = (ϕ, ψ) ∈ Θ, and D ft is the dimension of the feature. An output from the classifier corresponds to a class probability vector.

2.2. FINDING LOW-LOSS SUBSPACES

While there are few low-loss subspaces that are known to connect modes of deep neural networks, in this paper, we focus on Bezier curves as suggested in (Garipov et al., 2018) . Let θ i and θ j be two parameters (usually corresponding to modes) of a neural network. The quadratic Bezier curve between them is defined as (1 -r) 2 θ i + 2r(1 -r)θ (be) i,j + r 2 θ j | r ∈ [0, 1] , where θ (be) i,j is a pin-point parameter characterizing the curve. Based on this curve paramerization, a low-loss subspace connecting (θ i , θ j ) is found by minimizing the following loss w.r.t. θ (be) i,j , 1 0 L θ (be) i,j (r) dr, where θ (be) i,j (r) denotes the point at the position r of the curve, θ (be) i,j (r) = (1 -r) 2 θ i + 2r(1 -r)θ (be) i,j + r 2 θ j , and L : Θ → R is the loss function evaluating parameters (e.g., cross entropy). Since the integration above is usually intractable, we instead minimize the stochastic approximation: E r∼U (0,1) L θ (be) i,j (r) , where U(0, 1) is the uniform distribution on [0, 1]. For more detailed procedure for the Bezier curve training, please refer to Garipov et al. (2018) .

2.3. ENSEMBLES WITH BEZIER CURVES

Let {θ 1 , . . . , θ m } be a set of parameters independently trained as a deep ensemble. Then, for each pair (θ i , θ j ), we can construct a low-loss Bezier curve. Since all the parameters along those Bezier curves achieve low loss, we can actually add them to the ensemble for improved performance. For instance, choosing r = 0.5, we can collect θ (be) i,j (0.5) for all (i, j) pairs, and construct an ensembled predictor as 1 m + m 2 m i=1 f θi (x) + i<j f θ (be) i,j (0.5) (x) . While this strategy provide an effective way to increase the number of ensemble members, for inference, an additional O(m 2 ) number of forward passes are required. Our primary goal in this paper is to reduce this additional cost by bypassing the direct forward passes with θ (be) i,j (r).

3. MAIN CONTRIBUTION

In this section, we present a novel method that directly predicts outputs of neural networks evaluated at parameters on Bezier curves without actual forward passes with them.

3.1. BRIDGE NETWORKS

Let us first recall our key assumption stated in the introduction; if two modes in an ensemble are connected by a simple low-loss subspace (Bezier curve), then we can predict the outputs corresponding to the parameters on the subspace using only the information obtained from the modes. The intuition behind this assumption is that, since the parameters are connected with a simple curve, the corresponding outputs may also be connected via a relatively simple mapping which is far less complex than the original neural network. If such mapping exists, we may learn them via a lightweight neural network.

More specifically, let z

i := f (ft) ϕ i (x) and v i := f θi (x) = f (cls) ψ i (z i ) for i ∈ {1, . . . , m}. Let v i,j (r) := f θ (be) i,j (r) (x). In order to use v i,j (r) with v i to get an ensemble, we should forward x through f θ (be) i,j (r) , starting from the bottom layer. Instead, we reuse z i to predict v i,j (r) with a lightweight neural network. We call such lightweight neural network a "bridge network", since it lets us directly move from v i to v i,j (r) in the function space, not through the actual parameter space. A bridge network is usually constructed with a Convolutional Neural Network (CNN) whose inference cost is much lower than that of f θi . From the following, we introduce two types of bridge networks depending on the number of modes involved in the computation.

Type I bridge networks A type I bridge network h (r)

i,j takes a feature z i from only one mode, and predicts v i,j (r) as v i,j (r) ≈ ṽi,j (r) = h (r) i,j (z i ). A type I bridge network can be constructed between any pair of connected modes (θ i , θ j ) and an ensembled prediction for specific mode θ i with its Bezier parameter θ (be) i,j can be approximated as  1 2 v i + h (r) i,j (z i ) , x i ← mixup(x i , α) z 1 ← f (ft) ϕ 1 (x i ), v 1 ← f (cls) ψ 1 (z 1 ). v 1,2 (r) ← f θ (be) 1,2 (0.5) (x i ). if type I then ṽ1,2 (r) ← h (0.5) 1,2 (z 1 ; ω). else z 2 ← f (ft) ϕ 2 (x i ), v 2 ← f (cls) ψ 2 (z 2 ). ṽ1,2 (r) = H (0.5) 1,2 (z 1 , z 2 ; ω). end if ℓ i ← D KL (v 1,2 (0.5)||ṽ 1,2 (0.5)) -λD KL (v 1 ||ṽ 1,2 (0.5)). end for ω ← ω -η∇ ω 1 |B| i ℓ i . end while return ω. whose inference cost is nearly identical to that of v i (nearly single forward pass). One can also connect θ i with multiple modes {θ j1 , . . . , θ j k }, learn bridge networks between (i, j 1 ), . . . , (i, j k ), and construct an ensemble 1 1 + k v i + k j=1 h (r) i,j k (z i ) . Still, since the costs for h (r) i,j k s are far lower than v i , the inference cost does not significantly increase. Type II bridge networks A type II bridge network between (θ i , θ j ) takes two features (z i , z j ) to predict v i,j (r). v i,j (r) ≈ ṽi,j (r) = H (r) i,j (z i , z j ). (9) An ensembled prediction with the type II bridge network is then constructed as 1 3 v i + v j + H (r) i,j (z i , z j ) , where we construct an ensemble of three models with effectively two forward passes (for v i and v j ). Similar to the type I bridge networks, we may construct multiple bridges between a single curves and use them together for an ensemble. Fig. 1 presents a schematic diagram comparing forward passes of ensembles with-/without a type II bridge network.

3.2. LEARNING BRIDGE NETWORKS

Fixing a position r on Bezier curves In the definition of the bridge networks above, we fixed the value r. In principle, we may parameterize the bridge networks to take r as an additional input to predict v i,j (r) for any r ∈ [0, 1], but we found this to be ineffective due to the difficulty of learning all the outputs corresponding to arbitrary r values. Moreover, as we empirically observed in Fig. 2 , the ensembling with Bezier parameters are most effective with r = 0.5, and adding additional parameters evaluated at different r values does not significantly improve the performance. To this end, we fix r = 0.5 and aim to learn bridge networks predicting v i,j (0.5) throughout the paper. Training procedure Let {θ 1 , . . . , θ m } be a set of parameters in an ensemble. Given a set of Bezier parameters {θ (be) i,j } connecting them, we learn bridge networks (either type I or II) for each Bezier curve. The training procedure is straightforward. We first minimize the Kullback-Leibler divergence between the actual output from the Bezier parameters and the prediction made from the bridge network. It makes the bridge network imitate the original function defined by the Bezier parameters in the same manner as a conventional knowledge distillation (Hinton et al., 2015) . In addition, we also maximize the Kullback-Leibler divergence between the base prediction and the bridge prediction to regularize the bridge to predict differently from the base model. Such regularization is quite important, when the training error of the base model is near zero; the base network and the target network (the one on the Bezier curve) will produce almost identical outputs. Further, we apply the mixup (Zhang et al., 2018) method to explore more diverse responses, preventing the bridge from learning to just copy the outputs of the base model. Refer to Algorithm 1 for the detailed training procedure. 

5. EXPERIMENTS

In this section, we are going to answer the following three big questions: • Do bridge networks really learn to predict the outputs of a function from the Bezier curves? • How much ensemble gain we obtain via bridge networks with lower computational complexity? • How many bridge networks do we have to make in order to achieve certain ensemble performance? We sequentially answer them in Sections 5.2 to 5.4 with empirical validation. We also depict the predicted logits from the base model with θ 1 and θ 2 in the second and fourth columns, respectively. Additional results are available in Fig. 6 . By changing the channel size of the convolutional layers in the bridge network, we can balance the trade-off between performance gains with computational costs. We check this trade-off in Table 2 . We refer to a bridge network with less than 15% of floating-point operations (FLOPs) compared to the base model as Bridge sm (small version), and a bridge with more than 15% as Bridge md (medium version). 

Efficiency metrics

H (0.5) 1,2 predicts v 1,2 (0.5) well compared to the other baselines, we can confirm that there exists the correspondence between the bridge network and the Bezier curve. To this end, we measure the R 2 score which quantifies how similar outputs of the following baselines to that of the target function f θ (be) 1,2 (0.5) ; (1) 'Type I/II Bridge' denote the bridge network imitating the function of θ (be) 1,2 (0.5), (2) 'Other Type I/II Bridge' denote the bridge network imitating the function of θ (be) i,j (0.5) for some (i, j) ̸ = (1, 2), and (3) 'Other Bezier' denotes the base model with the parameters θ (be) i,j (0.5) for some (i, j) ̸ = (1, 2). Table 1 summarizes the results. Compared to the baselines (i.e., 'Other Type I/II Bridge' and 'Other Bezier'), the bridge networks produce more similar outputs to the target outputs. The R 2 values between the predictions and targets are significantly higher than those from the wrong targets, demonstrating that the bridge predictions indeed are approximating our target outputs of interest. Fig. 3 visualizes whether the bridge network H (0.5) 1,2 predicts the logits from θ (be) 1,2 (0.5). To be more specific, we visualize the predicted logits from θ 1 , θ 2 , θ 

Relation between model size and regression result

We measure the relation between the size of bridge networks and the goodness of fits of prediction measured by R 2 scores. Table 2 shows that we can achieve decent R 2 scores with a small number of parameters, and the prediction gets better as we increase the flexibility of our bridge network. Also, the results show that a higher R 2 score leads to better ensemble results. Using multiple type I bridge networks As type I bridge network requires features from only one mode of each curve for inference, we can use multiple type I bridge networks for a single base model without significantly increasing inference cost, as we mentioned at Eq. 8. Table 3 reports the performance gain of a single base model with increasing number of type I bridges. Each bridge approximates the models on different Bezier curves between a single mode and others (i.e., Bezier curves between modes A-B, A-C, and so on where A, B, and C are different modes.), not the models on a single Bezier curve. Adding more bridge networks introduces more diverse outputs to the ensembles. One can see that the performance continuously improves as the number of bridges increases, with low additional inference cost. Fig. 4 shows how much type I bridge networks efficiently increase the performances proportional to FLOPs.

5.3.2. TYPE II BRIDGE NETWORKS

Performance 4 ; the ensembles with type II bridge networks consistently improved predictive accuracy and uncertainty calibration with negligible increase in the inference costs. Fig. 5 shows how much our type II bridge network achieve high performance in the perspective of relative FLOPs.

Computational cost

We report FLOPs for inference on Table 4 to indicate how much relative computational costs are required for the competing models. Fig. 5 summarizes the tradeoff between FLOPs and performance in various metrics. As one can see from these results, our bridge networks could achieve remarkable gain in performance, so for some cases, adding bridge ensembles achieved performance gains larger than those might be achieved by adding entire ensemble members. For instance, in Tiny ImageNet experiments, DE-4 + 2 bridges was better than DE-5 (DEE ≈ ×7.239). Please refer to Appendix B for the full results including various DE size and other datasets.

5.4. HOW MANY TYPE II BRIDGES ARE REQUIRED?

For an ensemble of m parameters, the number of pairs can be connected by Bezier curves is m 2 , which grows quadratically with m. In the previous experiment, we constructed Bezier curves and bridges for all possible pairs (which explains the large inference costs for Bezier ensembles), but in practice, we found that it is not necessary to use bridge networks for all of those pairs. As an example, we compare the performance of DE-4 + bridge ensembles with increasing number of bridges on Tiny ImageNet dataset. The results are summarized in Table 4 . Just one bridge dramatically increases the performance, and the performance gain gradually saturates as we add more bridges. Notably, only one bridge would suffice to outperform DE-5 (DEE ≈ ×5.962).

6. CONCLUSION

In this paper, we proposed a novel framework for efficient ensembling that reduces inference costs of ensembles with a lightweight network called bridge networks. Bridge networks predict the neural network outputs corresponding to the parameters obtained from the Bezier curves connecting two ensemble parameters without actual forward passes through the network. Instead, they reuse features and outputs computed from the ensemble members and predict the outputs corresponding to Bezier parameters directly in function spaces. Using various image classification benchmarks, we demonstrate that we can train such bridge networks with simple CNNs with minimal inference costs, and bridge augmented ensembles could achieve significant gain both in terms of accuracy and uncertainty calibration. 

A.2 DATASETS AND MODELS

Dataset We use CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , Tiny ImageNet (Li et al., 2017) and ImageNet (Russakovsky et al., 2015) datasets. We apply the data augmentation consisting of random cropping of 32 pixels with padding of 4 pixels and random horizontal flipping. We subtract per-channel means from input images and divide them by per-channel standard deviations. Network We use CNN with residual path similar to the ResNet block structure. To use the features of base models, we embed one or more features from different layers of base models. For CIFAR-10 dataset, we use ResNet-32×2 as a base network which consists of 15 blocks and 32 layers with widen factor of 2, and we use CNN 3 blocks as type I and type II bridge networks. The bridge networks use the features z of the third to last block. For CIFAR-100 dataset, we use ResNet-32×4 as a base network which is almost same as ResNet-32 with widen factor of 2, and we use CNN 3 blocks as type I and type II bridge networks. The bridge networks use the features z of the third to last block. For Tiny ImageNet dataset, we use ResNet-18 as a base network which consists of 8 blocks and 18 layers, and we use CNN 2 blocks as a type I and type II bridge network. The bridge networks use the features z of the third to last and the second to last blocks. For ImageNet dataset, we use ResNet-50 as a base network which consists of 17 blocks and 50 layers, and we use CNN 3 blocks as a type I and type II bridge network. The bridge networks use the features z of the third to last and the second to last blocks. Optimization We train base ResNet networks for 200 epochs with learning rate 0.1. We use the SGD optimizer with momentum 0.9 and adjust learning rate with simple cosine scheduler. We give weight decay 0.001 for CIFAR-10 dataset, 0.0005 for CIFAR-100 and Tiny ImageNet dataset, and 0.0001 for ImageNet dataset. Regularization We introduced two additional hyperparameters for training bridge models; 1) the regularization scale λ and 2) the mixup coefficient α. Since the training error of the base network is near zero for the family of residual networks on CIFAR-10/100, given a training input without any modification, the base network and the target network (the one on the Bezier curve) will produce almost identical outputs, so the bridge trained with them will just copy the outputs of the base network. To prevent this, we perturb the inputs via mixup, and regularize the bridge to produce outputs different from the ones computed from the base models. On the other hand, for the datasets such as ImageNet where the models fail to achieve near zero training errors, the base network and the target networks are already distinct enough, so we found that the bridge can be trained easily without such tricks (i.e., we used λ = 0.0 and α = 0.0). We search 0.0 ≤ λ ≤ 0.4 for the regularization scale λ. We use α = 0.4 for CIFAR-10/100 and Tiny ImageNet datasets. We do not use mixup(α = 0.0) for ImageNet dataset.

A.3 EVALUATION

Efficiency metrics Dehghani et al. ( 2021) pointed out that there can be contradictions between commonly used metrics (e.g., FLOPs, the number of parameters, and speed) and suggested refraining from reporting results using just a single one. So, we present FLOPs and the number of parameters in the results. Uncertainty metrics Let p(x) ∈ [0, 1] K be a predicted probabilities for a given input x, where p (k) denotes the kth element of the probability vector, i.e., p (k) is a predicted confidence on kth class. We have the following common metrics on the dataset D consists of inputs x and labels y: • Accuracy (ACC): ACC(D) = E (x,y)∈D y = arg max k p (k) (x) . • Negative log-likelihood (NLL): NLL(D) = E (x,y)∈D -log p (y) (x) . • Brier score (BS): BS(D) = E (x,y)∈D p(x) -y 2 2 , ( ) where y denotes one-hot encoded version of the label y, i.e., y (y) = 1 and y (k) = 0 for k ̸ = y. • Expected calibration error (ECE): ECE(D, N bin ) = Nbin b=1 n b |δ b | n 1 + • • • + n Nbin , ( ) where N bin is the number of bins, n b is the number of examples in the bth bin, and δ b is the calibration error of the bth bin. Specifically, the bth bin consists of predictions having the maximum confidence values in [(b -1)/K, b/K), and the calibration error denotes the difference between accuracy and averaged confidences. We fix N bin = 15 in this paper. We evaluate the calibrated metrics that compute the aforementioned metrics with the temperature scaling (Guo et al., 2017) , as Ashukha et al. (2020) suggested. Specifically, (1) we first find the optimal temperature which minimizes the NLL over the validation examples, and (2) compute uncertainty metrics including NLL, BS, and ECE using temperature scaled predicted probabilities under the optimal temperature. Moreover, we evaluate the following Deep Ensemble Equivalent (DEE) score, which measure the relative performance for DE in terms of NLL, 1,2 (0.5) (blue), for a given test inputs displayed in the first column. We also depicts the predicted logits from θ 1 and θ 2 in the second and fourth columns, respectively. We note that mimicking the original function defined by deep neural networks using relatively cheaper networks reminds of Knowledge Distillation (KD) (Hinton et al., 2015) , and thus one can think of the proposed approach as a special instance of the knowledge distillation. However, the proposed bridge network differs fundamentally from KD in that; 1) it uses a very small network that cannot be properly trained with a typical distillation procedure, and 2) while KD builds a student mapping input to the output, ours reuses outputs from the models related to the target function to be mimicked, and this actually plays a key role in the function matching. DEE(D) = min {m ≥ 0 | NLL(D) ≤ NLL DE-m (D)}, Here, we empirically validate the claim. Specifically, Table 13 compares bridge networks mimicking output probabilities from θ (be) 1,2 1) when it takes inputs x as in the typical knowledge distillation framework, and 2) when it takes outputs from θ 1 and θ 2 as we proposed. The former consistently underperforms compared to the latter, even if we introduce some frontal convolutional layers for dealing with image inputs. It indicates that the typical knowledge distillation procedure suffers from an insufficient capacity of the bridge network, while our proposed method does not. Consequently, our proposed method, which reuses informative outputs from θ 1 and θ 2 , is distinct from the typical knowledge distillation when the capacity of the bridge network is limited.



https://sites.research.google/trc/about/ https://github.com/pytorch/pytorch/blob/master/LICENSE



Figure 1: Comparing ensembles with a Bezier curve (left) and a type II bridge network (right).

Figure3: Bar plots in the third column depict whether the bridge network (orange) outputs the same logit values as the base model with the Bezier parameters (blue) for given test inputs displayed in the first column. We also depict the predicted logits from the base model with θ 1 and θ 2 in the second and fourth columns, respectively. Additional results are available in Fig.6.

We choose FLOPs and the number of parameters (#Params) for efficiency evaluation as these metrics are commonly used to consider the efficiency(Dehghani et al., 2021). Because FLOPs and #Params of the base model are different for each dataset, we report the relative FLOPs and the relative #Params with respect to the corresponding base model instead for better comparison.Uncertainty metrics As suggested byAshukha et al. (2020), along with the classification accuracy (ACC), we report the calibrated versions of Negative Log-likelihood (NLL), Expected Calibration Error (ECE), and Brier Score (BS) as metrics for uncertainty evaluation. We also measure the Deep Ensemble Equivalent (DEE) score proposed inAshukha et al. (2020), which shows the relative performance for DE in terms of NLL and roughly be interpreted as effective number of models for an ensemble. See Appendix A.3 for more details.

(0.5), and the bridge network H (0.5) 1,2 , for two test examples of CIFAR-10. Indeed, the bridge network predicts well the logits from the Bezier parameter. Appendix B.1 provides additional examples which further verify this.

Figure 4: The cost-performance plots of type I bridge(s) compared to DE on Tiny ImageNet. The x-axis denotes the relative FLOPs quantifying the inference cost of the model compared to a single base model, and the y-axis shows the corresponding predictive performance. On the basis of DE (black dashed line), the upper left position is preferable in ACC, and the lower left position is preferable in NLL, ECE, and BS.

Figure 5: The cost-performance plots of type II bridge(s) compared to DE on Tiny ImageNet. Others are identical to Fig. 4 except that we extend the DE basis from DE-2 to DE-7 (black dashed lines).

Reproducibility statement Please refer to Appendix A for full experimental detail including datasets, models, evaluation metrics and computing resources. A EXPERIMENTAL DETAILS A.1 FILTER RESPONSE NORMALIZATION Throughout experiments using convolutional neural networks, we use the Filter Response Normalization (FRN; Singh and Krishnan, 2020) instead of the Batch Normalization (BN; Ioffe and Szegedy, 2015) to avoid recomputation of BN statistics along the subspaces. Besides, FRN is fully made up of learned parameters and it does not utilize dependencies between training examples, thus, it gives us a more clear interpretation of the parameter space (Wenzel et al., 2020; Izmailov et al., 2021).

NLL DE-m (D) denotes the NLL of DE-m on the dataset D. Here, we linearly interpolate NLL DE-m (D) values for m ∈ R and make the DEE score continuous.A.4 COMPUTING RESOURCESWe conduct Tiny ImageNet experiments on 8 TPUv2 and 8 TPUv3 cores, supported by TPU Research Cloud 1 and the others on 8 RTX3090 cores. We attached code to the supplimentary material. We use PyTorch(Paszke et al., 2019) with BSD-style license. Visit PyTorch GitHub repository 2 for more details.

Figure 6: Bar plots in the third column depict whether the bridge network (orange) outputs the same logit values as the base model with the Bezier parameters θ(be)

Algorithm 1 Training bridge networks Require: Training dataset D, a pair of parameters (θ 1 , θ 2 ) and corresponding Bezier parameter θ

Top row shows ensemble performances when one member from Bezier curve r is added to DE-2. Bottom row shows ensemble performances when members are sequentially added to DE-2 from Bezier curve. For accuracy, higher is the better, and for NLL, ECE and BS, lower is the better.

R 2 scores quantify how similar the following models to the target function defined with Bezier parameters θ Refer to the main text in Section 5.2 for a detailed description for each model. All values are measured on the test split of each dataset.

FLOPs, #Params, R 2 scores, and ensemble performance metrics of various type II bridge network sizes on CIFAR-100. We use ResNet-32×4 as a base model and 3 blocks of CNN with a residual connection as bridge networks. The number after CNN indicates the number of channels. R 2 scores are measured with respect to the target Bezier r = 0.5.

Performance improvement of the ensemble by adding type I bridges to the single base ResNet model on Tiny ImageNet dataset. FLOPs, #Params, and DEE metrics are measured with respect to the single base model. Bridge sm and Bridge md denote the small and the medium versions of the bridge network based on their FLOPs. .38 ± 0.09 1.444 ± 0.005 0.015 ± 0.001 0.461 ± 0.001 2.179 ± 0.110 + 2 Bridge sm × 1.176 × 1.186 65.55 ± 0.15 1.405 ± 0.005 0.013 ± 0.001 0.456 ± 0.001 2.750 ± 0.086 + 3 Bridge sm × 1.264 × 1.279 65.61 ± 0.10 1.388 ± 0.003 0.014 ± 0.002 0.455 ± 0.000 3.022 ± 0.079

Table 4 summarizes the classification results comparing DE, DE with Bezier curves, and DE with type II bridge networks. For the more experimental results including other datasets, Performance improvement of the ensemble by adding type II bridges as members to existing DE ensembles on Tiny ImageNet dataset. FLOPs, #Params, and DEE metrics are measured with respect to corresponding DEs. Type II bridges consistently improve the accuracy and uncertainty metrics of the ensemble before saturation. Bridge sm and Bridge md denote the small and the medium versions of the bridge network based on their FLOPs. .54 ± 0.08 1.329 ± 0.001 0.018 ± 0.001 0.422 ± 0.001 5.000 please refer to Appendix B. From Table 4, one can see that with only sightly increase in the computational costs, the ensembles with bridge networks achieves almost DEE 1.962 ensemble gain for DE-4 case. This gain is not specific only for DE-

Full result of performance improvement of the ensemble by adding type I bridges on CIFAR-10 dataset. We use same settings as described in Table 3. .07 ± 0.04 0.254 ± 0.001 0.009 ± 0.001 0.119 ± 0.000 1.606 ± 0.035 + 2 Bridge sm × 1.124 × 1.096 92.13 ± 0.06 0.250 ± 0.001 0.008 ± 0.001 0.118 ± 0.000 1.677 ± 0.033 + 3 Bridge sm × 1.186 × 1.144 92.13 ± 0.04 0.250 ± 0.000 0.009 ± 0.000 0.118 ± 0.000 1.692 ± 0.033 + 4 Bridge sm × 1.248 × 1.192 92.12 ± 0.03 0.250 ± 0.000 0.009 ± 0.001 0.119 ± 0.000 1.695 ± 0.041 + 1 Bridge md × 1.205 × 1.159 92.09 ± 0.05 0.253 ± 0.001 0.009 ± 0.000 0.119 ± 0.000 1.623 ± 0.034 + 2 Bridge md × 1.410 × 1.318 92.17 ± 0.06 0.249 ± 0.001 0.009 ± 0.000 0.118 ± 0.000 1.705 ± 0.047 + 3 Bridge md × 1.615 × 1.477 92.15 ± 0.04 0.248 ± 0.001 0.008 ± 0.001 0.118 ± 0.000 1.729 ± 0.048 + 4 Bridge md × 1.820 × 1.636 92.14 ± 0.05 0.247 ± 0.001 0.009 ± 0.001 0.118 ± 0.000 1.736 ± 0.049 .07 ± 0.11 0.233 ± 0.003 0.012 ± 0.001 0.107 ± 0.001 2.000

Full result of performance improvement of the ensemble by adding type II bridges on CIFAR-10 dataset. We use same settings as described in Table4.

Full result of performance improvement of the ensemble by adding type I bridges on CIFAR-100 dataset. We use same settings as described in Table3.

Full result of performance improvement of the ensemble by adding type II bridges on CIFAR-100 dataset. We use same settings as described in Table4.

Full result of performance improvement of the ensemble by adding type I bridges on Tiny ImageNet dataset. We use same settings as described in Table3. .61 ± 0.10 1.388 ± 0.003 0.014 ± 0.002 0.455 ± 0.000 3.022 ± 0.079 + 4 Bridge sm × 1.352 × 1.372 65.68 ± 0.06 1.380 ± 0.002 0.012 ± 0.002 0.454 ± 0.000 3.233 ± 0.084 + 1 Bridge md × 1.277 × 1.290 65.94 ± 0.15 1.418 ± 0.003 0.018 ± 0.002 0.453 ± 0.001 2.562 ± 0.056 + 2 Bridge md × 1.554 × 1.580 66.59 ± 0.09 1.372 ± 0.001 0.016 ± 0.002 0.445 ± 0.000 3.437 ± 0.036 + 3 Bridge md × 1.831 × 1.870 66.79 ± 0.11 1.353 ± 0.001 0.015 ± 0.001 0.443 ± 0.000 3.967 ± 0.043 + 4 Bridge md × 2.108 × 2.160 66.88 ± 0.15 1.342 ± 0.001 0.018 ± 0.001 0.441 ± 0.000 4.450 ± 0.062

Full result of performance improvement of the ensemble by adding type II bridges on Tiny ImageNet dataset. We use same settings as described in Table4. Bridge md × 6.112 × 6.202 69.15 ± 0.19 1.257 ± 0.000 0.015 ± 0.000 0.416 ± 0.000 10.198 ± 0.493 + 6 Bezier × 10.000 × 10.000 69.26 ± 0.07 1.271 ± 0.002 0.016 ± 0.001 0.414 ± 0.000 9.118 ± 0.383 × 5.000 × 5.000 68.54 ± 0.08 1.329 ± 0.001 0.018 ± 0.001 0.422 ± 0.001 5.000

Full result of performance improvement of the ensemble by adding type I bridges on ImageNet dataset. We use same settings as described in Table3. .85 ± 0.06 1.618 ± 0.005 0.037 ± 0.002 0.485 ± 0.003 1.000 + 1 Bridge md × 1.194 × 1.222 76.57 ± 0.02 1.418 ± 0.003 0.018 ± 0.002 0.453 ± 0.001 2.562 ± 0.056 + 2 Bridge md × 1.388 × 1.444 76.74 ± 0.05 1.372 ± 0.001 0.016 ± 0.002 0.445 ± 0.000 3.437 ± 0.036 + 3 Bridge md × 1.582 × 1.666 76.85 ± 0.05 1.353 ± 0.001 0.015 ± 0.001 0.443 ± 0.000 3.967 ± 0.043 + 4 Bridge md × 1.776 × 1.888 76.96 ± 0.03 1.342 ± 0.001 0.018 ± 0.001 0.441 ± 0.000 4.450 ± 0.062 DE-2 × 2.000 × 2.000 77.20 ± 0.07 1.456 ± 0.004 0.022 ± 0.002 0.450 ± 0.002 2.000

Full result of performance improvement of the ensemble by adding type II bridges on ImageNet dataset. We use same settings as described in Table4. × 2.000 × 2.000 77.20 ± 0.07 0.880 ± 0.002 0.013 ± 0.001 0.317 ± 0.001 2.000 + 1 Bridge md × 2.243 × 2.256 77.43 ± 0.05 0.870 ± 0.001 0.012 ± 0.000 0.314 ± 0.000 2.564 ± 0.046 + 1 Bezier × 3.000 × 3.000 77.65 ± 0.08 0.861 ± 0.001 0.011 ± 0.001 0.311 ± 0.001 3.059 ± 0.082 DE-3 × 3.000 × 3.000 77.64 ± 0.04 0.862 ± 0.001 0.013 ± 0.001 0.311 ± 0.000 3.000 + 1 Bridge md × 3.243 × 3.256 77.76 ± 0.07 0.856 ± 0.001 0.012 ± 0.001 0.310 ± 0.000 3.559 ± 0.038 + 2 Bridge md × 3.486 × 3.512 77.82 ± 0.07 0.853 ± 0.000 0.012 ± 0.001 0.309 ± 0.000 3.850 ± 0.069 + 3 Bridge md × 3.729 × 3.768 77.92 ± 0.06 0.851 ± 0.001 0.012 ± 0.001 0.308 ± 0.000 4.010 ± 0.063 + 3 Bezier × 6.000 × 6.000 78.30 ± 0.05 0.834 ± 0.001 0.012 ± 0.000 0.303 ± 0.001 7.821 ± 0.391 DE-4 × 4.000 × 4.000 77.87 ± 0.04 0.851 ± 0.001 0.012 ± 0.001 0.308 ± 0.000 4.000

FLOPs, R 2 scores, and model performance metrics between type I bridge network and CNN (ft) ×n + CNN models with various sizes on CIFAR-100. Here, ×n denotes the number of convolution layers to use as a frontal feature extractor CNN(ft) , and CNN denotes the same architecture used in the type I bridge network. R 2 scores are measured with respect to the target Bezier r = 0.5. ±0.003 72.18 ±0.19 1.016 ±0.004 0.031 ±0.002 0.379 ±0.001 CNN (ft) ×2 + CNN × 0.230 0.626 ±0.007 63.45 ±0.71 1.318 ±0.019 0.015 ±0.002 0.481 ±0.007 CNN (ft) ×4 + CNN × 0.373 0.682 ±0.005 67.39 ±0.42 1.166 ±0.015 0.013 ±0.001 0.436 ±0.004 CNN (ft) ×6 + CNN × 0.515 0.685 ±0.006 67.43 ±0.55 1.156 ±0.019 0.014 ±0.001 0.433 ±0.006 CNN (ft) ×8 + CNN × 0.648 0.701 ±0.001 68.36 ±0.35 1.121 ±0.009 0.012 ±0.002 0.422 ±0.004 B.3 COMPARISON WITH THE TYPICAL KNOWLEDGE DISTILLATION

B ADDITIONAL EXPERIMENTS B.1 ADDITIONAL EXAMPLES

We visually inspect the logit regression of type II bridge network. Our bridge network very accurately predicts the logits of r = 0.5 from Bezier curve when the two base models (r = 0 and r = 1) gives similar output logits (deer, ship, and frog). When the base models are not confident on the samples (airplane, bird, cat and horse), the network recovers the scale of logits approximately. However it fails to predict some very difficult samples (truck and dog) when even the base models are very confused.

B.2 FULL TYPE I AND TYPE II BRIDGE RESULTS

We report full experimental results for classification tasks; 1) Type I bridge network results in Table 5, Table 7 , Table 9 , and Table 11 , 2) Type II bridge network results in Table 6 , Table 8 , Table 10 and Table 12 .

