PACKED-ENSEMBLES FOR EFFICIENT UNCERTAINTY ESTIMATION

Abstract

Deep Ensembles (DE) are a prominent approach for achieving excellent performance on key metrics such as accuracy, calibration, uncertainty estimation, and out-of-distribution detection. However, hardware limitations of real-world systems constrain to smaller ensembles and lower-capacity networks, significantly deteriorating their performance and properties. We introduce Packed-Ensembles (PE), a strategy to design and train lightweight structured ensembles by carefully modulating the dimension of their encoding space. We leverage grouped convolutions to parallelize the ensemble into a single shared backbone and forward pass to improve training and inference speeds. PE is designed to operate within the memory limits of a standard neural network. Our extensive research indicates that PE accurately preserves the properties of DE, such as diversity, and performs equally well in terms of accuracy, calibration, out-of-distribution detection, and robustness to distribution shift. We make our code available at github.com/ENSTA-U2IS/torch-uncertainty.



Figure 1 : Evaluation of computation cost vs. performance trade-offs for multiple uncertainty quantification techniques on CIFAR-100. The y-axis and x-axis respectively show the accuracy and inference time in images per second. The circle area is proportional to the number of parameters. Optimal approaches are closer to the top-right corner. Packed-Ensembles strikes a good balance between predictive performance and speed.

1. INTRODUCTION

Real-world safety-critical machine learning decision systems such as autonomous driving (Levinson et al., 2011; McAllister et al., 2017) impose exceptionally high reliability and performance requirements across a broad range of metrics: accuracy, calibration, robustness to distribution shifts, uncertainty estimation, and computational efficiency under limited hardware resources. Despite significant improvements in performance in recent years, vanilla Deep Neural Networks (DNNs) still exhibit several shortcomings, notably overconfidence in both correct and wrong predictions (Nguyen et al., 2015; Guo et al., 2017; Hein et al., 2019) . Deep Ensembles (Lakshminarayanan et al., 2017) have emerged as a prominent approach to address these challenges by leveraging predictions from multiple high-capacity neural networks. By averaging predictions or by voting, DE achieves high accuracy and robustness since potentially unreliable predictions are exposed via the disagreement between individuals. Thanks to the simplicity and effectiveness of the ensembling strategy (Dietterich, 2000) , DE have become widely used and dominate performance across various benchmarks (Ovadia et al., 2019; Gustafsson et al., 2020) . DE meet most of the real-world application requirements except computational efficiency. Specifically, DE are computationally demanding in terms of memory storage, number of operations, and inference time during both training and testing, as their costs grow linearly with the number of individuals. Their computational costs are, therefore, prohibitive under tight hardware constraints. This limitation of DE has inspired numerous approaches proposing computationally efficient alternatives: multi-head networks (Lee et al., 2015;  Chen & Shrivastava, 2020), ensemble-imitating layers (Wen et al., 2019; Havasi et al., 2020; Ramé et al., 2021) , multiple forwards on different weight subsets of the same network (Gal & Ghahramani, 2016; Durasov et al., 2021) , ensembles of smaller networks (Kondratyuk et al., 2020; Lobacheva et al., 2020) , computing ensembles from a single training run (Huang et al., 2017; Garipov et al., 2018) , and efficient Bayesian Neural Networks (Maddox et al., 2019; Franchi et al., 2020) . These approaches typically improve storage usage, train cost, or inference time at the cost of lower accuracy and diversity in the predictions. An essential property of ensembles to improve predictive uncertainty estimation is related to the diversity of its predictions. Perrone & Cooper (1992) show that the independence of individuals is critical to the success of ensembling. Fort et al. (2019) argue that the diversity of DE, due to randomness from weight initialization, data augmentation and batching, and stochastic gradient updates, is superior to other efficient ensembling alternatives, despite their predictive performance boosts. Few approaches manage to mirror this property of DE in a computationally efficient manner close to a single DNN (in terms of memory usage, number of forward passes, and image throughput). In this work, we aim to design a DNN architecture that closely mimics properties of ensembles, in particular, having a set of independent networks, in a computationally efficient manner. Previous works propose ensembles composed of small models (Kondratyuk et al., 2020; Lobacheva et al., 2020) and achieve performances comparable to a single large model. We build upon this idea and devise a strategy based on small networks trying to match the performance of an ensemble of large networks. To this end, we leverage grouped convolutions (Krizhevsky et al., 2012) to delineate multiple subnetworks within the same network. The parameters of each subnetwork are not shared across subnetworks, leading to independent smaller models. This method enables fast training and inference times while predictive uncertainty quantification is close to DE (Figure 1 ). In summary, our contributions are the following: • We propose Packed-Ensembles (PE), an efficient ensembling architecture relying on grouped convolutions, as a formalization of structured sparsity for Deep Ensembles; • We extensively evaluate PE regarding accuracy, calibration, OOD detection, and distribution shift on classification and regression tasks. We show that PE achieves state-of-the-art predictive uncertainty quantification. • We thoroughly study and discuss the properties of PE (diversity, sparsity, stability, behavior of subnetworks) and release our PyTorch implementation.

2. BACKGROUND

In this section, we present the formalism for this work and offer a brief background on grouped convolutions and ensembles of DNNs. Appendix A summarizes the main notations in Table 3 .

2.1. BACKGROUND ON CONVOLUTIONS

The convolutional layer (LeCun et al., 1989) consists of a series of cross-correlations between feature maps h j ∈ R Cj ×Hj ×Wj regrouped in batches of size B and a weight tensor with C j , H j , W j three integers representing the number of channels, the height and the width of h j respectively. C j+1 and s j are also two integers corresponding to the number of channels of h j+1 (the output of the layer) and the kernel size. Finally, j is the layer's index and will be fixed in the following formulae. The bias of convolution layers will be omitted in the following for simplicity. ω j ∈ R Cj+1×Cj ×s 2 j x 1 ŷ 1 x 1 ŷ 1 ŷ 1 x 1 M = 3 M = 3 width α × width width a) b) c) γ = 2 Hence the output value of the convolution layer, denoted ⊛, is: z j+1 (c, :, :) = (h j ⊛ ω j )(c, :, :) = Cj -1 k=0 ω j (c, k, :, :) ⋆ h j (k, :, :), where c ∈ 0, C j+1 -1 is the index of the considered channel of the output feature map, ⋆ is the classical 2D cross-correlation operator, and z j is the pre-activation feature map such that h j = ϕ(z j ) with ϕ an activation function. To embed an ensemble of subnetworks, we leverage grouped convolutions, already used in ResNeXt (Xie et al., 2017) to train several DNN branches in parallel. The grouped convolution operation with γ groups and weights ω i γ ∈ R Cj+1× C j γ ×s 2 j is given in (2), γ dividing C j for all layers. Any output channel c is produced by a specific group (set of filters), identified by the integer γc Cj+1 , which only uses 1 γ of the input channels: z j+1 (c, :, :) = (h j ⊛ ω j γ )(c, :, :) = C j γ -1 k=0 ω j γ (c, k, :, :) ⋆ h j k + γc C j+1 C j γ , :, : . The grouped convolution layer is mathematically equivalent to a classical convolution where the weights are multiplied element-wise by the binary tensor mask m ∈ {0, 1} Cj+1×Cj ×s 2 j such that mask j m (k, l, :, : ) = 1 if γl Cj = γk Cj+1 = m for each group m ∈ 0, γ -1 . The complete layer mask is finally defined as mask j = γ-1 m=0 mask j m and the grouped convolution can therefore be rewritten as z j+1 = h j ⊛ ω j • mask j , where • is the Hadamard product.

2.2. BACKGROUND ON DEEP ENSEMBLES

For an image classification problem, let us define a dataset D = {x i , y i } |D| i=1 containing |D| pairs of samples x i = h 0 i ∈ R C0×H0×W0 and one-hot-encoded labels y i ∈ R N C modeled as the realization of a joint distribution P (X,Y ) where N C is the number of classes in the dataset. The input data x i is processed via a neural network f θ which is a parametric probabilistic model such that ŷi = f θ (x i ) = P (Y = y i |X = x i ; θ). This approach consists in considering the prediction ŷi as parameters of a Multinoulli distribution. 

P (y

i |x i , D) = 1 M M -1 m=0 P (y i |x i , θ m )

3. PACKED-ENSEMBLES

This section describes how to train multiple subnetworks using grouped convolution efficiently. Then, we explain how our new architectures are equivalent to training several networks in parallel.

3.1. REVISITING DEEP ENSEMBLES

Although Deep Ensembles provide undisputed benefits, they also come with the significant drawback that the training time and the memory usage in inference increase linearly with the number of networks. To alleviate these problems, we propose assembling small subnetworks, which are essentially DNNs with fewer parameters. Moreover, while ensembles to this day have mostly been trained sequentially, we suggest leveraging grouped convolutions to massively accelerate their training and inference computations thanks to their smaller size. The propagation of grouped convolutions with M groups, M being the number of subnetworks in the ensemble, ensures that the subnetworks are trained independently while dividing their encoding dimension by a factor M . More details on the usefulness of grouped convolutions to train ensembles can be found in subsection 3.3. To create Packed-Ensembles (illustrated in Figure 2 ), we build on small subnetworks but compensate for the dramatic decrease of the model capacity by multiplying the width by the hyperparameter α, which can be seen as an expansion factor. Hence, we propose Packed-Ensembles-(α, M, 1) as a flexible formalization of ensembles of small subnetworks. For an ensemble of M subnetworks, Packed-Ensembles-(α, M, 1) therefore modifies the encoding dimension by a factor α M and the inference of our ensemble is computed with the following formula, omitting the index i of the sample: ŷ = 1 M M -1 m=0 P (y|θ α,m , x) with θ α,m = {ω j α • mask j m } j , where ω j,α is the weight of the layer j of dimension (αC j+1 ) × (αC j ) × s 2 j . In the following, we introduce another hyperparameter γ corresponding to the number of groups of each subnetwork of the Packed-Ensembles, creating another level of sparsity. These groups are also called "subgroups" and are applied to the different subnetworks. Formally, we denote our technique Packed-Ensembles-(α, M, γ), with the hyperparameters in the parentheses. In this work, we consider a constant number of subgroups across the layers; therefore, γ divides αC j for all j.

3.2. COMPUTATIONAL COST

For a convolutional layer, the number of parameters involving C j input channels, C j+1 output channels, kernels of size s j and γ subgroups is equal to M × αCj M αCj+1 M s 2 j γ -1 . The same formula applies to dense layers as 1 × 1 convolutions. Two cases emerge when the architectures of the subnetworks are fully convolutional or dense. If α = √ M , the number of parameters in the ensemble equals the number of parameters in a single model. With α = M , each subnetwork corresponds to a single model (and their ensemble is therefore equivalent in size to DE).

3.3. IMPLEMENTATION DETAILS

We propose a simple way of designing efficient ensemble convolutional layers using grouped convolutions. To take advantage of the parallelization capabilities of GPUs in training and inference, we replace the sequential training architecture, (a) in Figure 3 , with the parallel implementations (b) and (c). Figure 3 summarizes different equivalent architectures for a simple ensemble of M = 3 DNNs with three convolutional layers and a final dense layer (equivalent to a 1 × 1 convolution) with α = γ = 1. In (b), we propose to stack the feature maps on the channel dimension (denoted as the rearrange operation). 1 This yields a feature map h j , of size M × C j × H j × W j regrouped by batches of size only B M , with B the batch size of the ensemble. One solution to keep the same batch size is to repeat the batch M times so that its size equals B after the rearrangement. Using convolutions with M groups and γ subgroups per subnetwork, each feature map is convoluted separately by each subnetwork and yields its own independent output. Grouped convolutions are propagated until the end to ensure that gradients stay independent between subnetworks. Other operations, such as Batch Normalization (Ioffe & Szegedy, 2015) , can be applied directly as long as they can be grouped or have independent actions on each channel. Figure 4a illustrates the mask used to code Packed-Ensembles in the case where M = 2. Similarly, Figure 4b shows the mask with M = 2 and γ = 2. Finally, (b) and (c) are also equivalent. It is indeed possible to replace the rearrange operation and the first grouped convolution with a standard convolution if the same images are to be provided simultaneously to all the subnetworks. We confirm in Appendix F that this procedure is not detrimental to the ensemble's performance, and we take advantage of this property to provide this final optimization and simplification.

4. EXPERIMENTS

To validate the performance of our method, we conduct experiments on classification tasks and measure the influence of the parameters α and γ. Regression tasks are detailed in Appendix N. 

4.1.1. METRICS, OOD DATASETS, AND IMPLEMENTATION

We evaluate the overall performance of the models in classification tasks using the accuracy (Acc) in % and the Negative Log-Likelihood (NLL). We choose the classical Expected Calibration Error (ECE) (Naeini et al., 2015) for the calibration of uncertaintiesfoot_1 and measure the quality of the OOD detection using the Areas Under the Precision/Recall curve (AUPR) and Under the operating Curve (AUC), as well as the False Positive Rate at 95% recall (FPR95), all expressed in %, similarly to Hendrycks & Gimpel (2017). We use accuracy as the validation criterion (i.e., the final trained model is the one with the highest accuracy). During inference, we average the softmax probabilities of all subnetworks and consider the index of the maximum of the output vector to be the predicted class of the ensemble. We define the prediction confidence as this maximum value (also called maximum softmax probability). For OOD detection tasks on CIFAR-10 and CIFAR-100, we use the SVHN dataset (Netzer et al., 2011) as an out-of-distribution dataset and transform the initial classification problem into a binary classification between in-distribution and OOD data using the maximum softmax probability as criterion. We discuss the different OOD criteria in appendix E. For ImageNet, we use two out-ofdistribution datasets: ImageNet-O (Hendrycks et al., 2021b) and Texture (Wang et al., 2022) , and use the Mutual Information (MI) as a criterion for the ensembles techniques (see Appendix E for details on MI) and the maximum softmax probability for the single model and MIMO. To measure the robustness under distribution shift, we use ImageNet-R (Hendrycks et al., 2021a) and evaluate the Accuracy, ECE, and NLL, denoted rAcc, rECE, and rNLL on this dataset, respectively. We implement our models using the PyTorch-Lightning framework built on top of PyTorch. Both are open-source Python frameworks. Appendix B and Table 4 detail the hyper-parameters used in our experiments across architectures and datasets. Most training instances are completed on a single Nvidia RTX 3090 except for ImageNet, for which we use 2 to 8 Nvidia A100-80GB.

4.1.2. RESULTS

Table 1 presents the average performance for the classification task over five runs using the hyperparameters in Table 4 . We demonstrate that Packed-Ensembles, in the setting of α = 2 and γ = 2, yields similar results to Deep Ensembles while having a lower memory cost than a single model. For CIFAR-10, the relative performance of PE compared to DE appears to increase as the original architecture becomes larger. When using ResNet-18, Packed-Ensembles matches Deep Ensembles on OOD detection metrics but shows slightly worse performance on the others. However, using ResNet-50, both models seem to perform similarly, and PE slightly outperforms DE in classification performance with WideResNet28-10. Table 1 : Performance comparison (averaged over five runs) on CIFAR-10/100 using ResNet-18 (R18), ResNet-50 (R50), and Wide ResNet28-10 (WR) architectures. All ensembles have M = 4 subnetworks, we highlight the best performances in bold. For our method, we consider α = γ = 2, except for WR on C100, where γ = 1. Mult-Adds corresponds to the inference cost, i.e., the number of Giga multiply-add operations for a forward pass which is estimated with Torchinfo (2022) . 1 reports results for α = 2 and γ = 2. However, the optimal values of these hyperparameters depend on the balance between computational cost and performance. To help users strike the best compromise, we propose Figures 6 and 7 in Appendix D, which illustrate the impact of changing α on the performance of Packed-Ensembles. Method Data Net Acc ↑ NLL ↓ ECE ↓ AUPR ↑ AUC ↑ FPR95 ↓ Params (M) ↓ Mult-

5. DISCUSSIONS

We have shown that Packed-Ensembles has attractive properties, mainly by providing a similar quality of uncertainty quantification as Deep Ensembles while using a reduced architecture and computing cost. Several questions can be raised, and we conducted some studies -detailed in the Appendix sections -to provide possible answers. Discussion on the sparsity As described in section 3, one could interpret PE as leveraging group convolutions to approximate Deep Ensembles with a mask operation applied to some components. In Appendix C, by using a simplified model, we propose a bound of the approximation error based on the Kullback-Leibler divergence between the DE and its pruned version. This bound depends on the density of ones in the mask p, and, more specifically, depends on p(1 -p) and (1 -p) 2 /p. By manipulating these terms, corresponding to modifying the number of subnetworks M , the number of groups γ, and the dilation factor α, we could theoretically control the approximation error. On the sources of stochasticity Diversity is essential in ensembles and is usually obtained by exploiting two primary sources of stochasticity: the random initialization of the model's parameters and the shuffling of the batches. A last source of stochasticity is introduced during training by the non-deterministic behavior of the backpropagation algorithms. In Appendix F, we study the function space diversities which arise from every possible combination of these sources. It follows that only one of these sources is often sufficient to generate diversity, and no peculiar pattern seems to emerge to predict the best combination. Specifically, we highlight that even the only use of non-deterministic algorithms introduces enough diversity between each subnetwork of the ensemble.

Ablation study

We perform ablation studies to assess the impact of the parameters M , α, and γ on the performance of Packed-Ensembles. Appendix D provides in-depth details of this study. No explicit behavior appears from the results we obtained. A trend shows that a higher number of subnetworks helps get better OOD detection, but the improvement in AUPR is not significant. Training speed Depending on the chosen hyperparameters α, M , and γ, PE may have fewer parameters than the single model, as shown in Table 1 . This translates into an expected lower number of operations. A study of the training and inference speeds, developed in Appendix H, shows that using PE-(2,4,1) does not significantly increase the training and testing times compared to the single model while improving accuracy and uncertainty quantification performances. However, this also hints that the group-convolution speedup is not optimal despite the significant acceleration offered by 16-bits floating points.

OOD criteria

The maximum softmax probability is often used as criterion for discriminating OOD elements. However, this criterion is not unique, and others can be used, such as the Mutual Information, the maximum logit, or the Shannon entropy of the mean prediction. Although no relationship is expected between this criterion and PE, we obtained different performances in OOD detection according to the selected criterion. The results on CIFAR-100 are detailed in Appendix E and show that an approach based on the maximum logit seems to give the best results in detecting OOD. It should be noted that the notion of OOD depends on the training distribution. Such a discussion does not necessarily generalize to all datasets. Indeed, preliminary results have shown that Mutual information outperforms the other criteria for our method applied to the ImageNet dataset. 

7. CONCLUSIONS

We propose a new ensemble framework, Packed-Ensembles, that can approximate Deep Ensembles in terms of uncertainty quantification and accuracy. Our research provides several new findings. First, we show that small independent neural networks can be as effective as large, deep neural networks when used in ensembles. Secondly, we demonstrate that not all sources of diversity are essential for improving ensemble diversity. Thirdly, we show that Packed-Ensembles are more stable than single DNNs. Fourthly, we highlight that there is a trade-off between accuracy and the number of parameters, and Packed-Ensembles enables us to create flexible and efficient ensembles. In the future, we intend to explore Packed-Ensembles for more complex downstream tasks.

8. REPRODUCIBILITY

Alongside this paper, we provide the source code of Packed-Ensembles layers. Additionally, we have created two notebooks demonstrating how to train ResNet-50-based Packed-Ensembles using public datasets such as CIFAR-10 and CIFAR-100. To ensure reproducibility, we report the performance given a specific random seed with a deterministic training process. Furthermore, it should be noted that the source code contains two PyTorch Module classes to produce Packed-Ensembles efficiently. A readme file at the root of the project details how to install and run experiments. In addition, we showcase how to get Packed-Ensembles from LeNet (LeCun et al., 1998). To further promote accessibility, we have created an open-source pip-installable PyTorch package, torch-uncertainty, that includes Packed-Ensembles layers. With these resources, we hope to encourage the broader research community to engage with and build upon our work.

9. ETHICS

The purpose of this paper is to propose a new method for better estimations of uncertainty for deep-learning-based models. Nevertheless, we acknowledge their limitations, which could become particularly concerning when applied to safety-critical systems. While this work aims to improve the reliability of Deep Neural Networks, this approach is not ready for deployment in safety-critical systems. We show the limitations of our approach in several experiments. Many more validation and verification steps would be crucial before considering its real-world implementation to ensure robustness to various unknown situations, including corner cases, adversarial attacks, and potential biases. 

A NOTATIONS

We summarize the main notations used in the paper in Table 3 . Notations Meaning D = {(x i , y i )} |D| i=1 The set of |D| data samples and the corresponding labels j, m, L The index of the current layer, the current subnetwork, and the number of layers z j The preactivation feature map and output of the layer (j -1)/input of layer j ϕ The activation function (considered constant throughout the network) h j The feature map and output of layer j, h j = ϕ(z j ) H j , W j The height and width of the feature maps and output of layer j -1 C j The number of channels of the feature maps and output of layer j -1 n j The number of parameters of layer j The set of weights of the subnetwork m with a width factor α ω j α,γ The weights of layer j with γ groups and a width factor α 5 ). For ImageNet, we use the A3 procedure from Wightman et al. ( 2021) for all models. Training with the exact A3 procedure was not always possible. Refer to the specific subsection for more details. Please note that the hyperparameters of the training procedures have not been optimized for our method and have been taken directly from the literature (He et al., 2016; Wightman et al., 2021) . We strengthened the data augmentations for WideResNet on CIFAR-100 as we were not able to replicate the results from Zagoruyko & Komodakis (2016). Masksembles. We use the code proposed by (Durasov et al., 2021) foot_2 . We modified the mask generation function using binary search, as proposed by the authors since it was unable to build masks for ResNet50x4. We note that the code implies performing batch repeats at the start of the forward passes. All the results regarding this technique are therefore computed with this specification. The ResNet implementations are built using Masksemble2D layers with M = 4 and a scale factor of 2 after each convolution. BatchEnsemble. For BatchEnsemble, we use two different values for weight decay: table 4 provides the weight decay corresponding to the shared weights but we don't apply weight decay to the vectors S and R (which generate the rank-1 matrices). ImageNet. The batch size of Masksembles ResNet-50x4 is reduced to 1120 because of memory constraints. Concerning the BatchEnsembles based on ResNet-50 and ResNet-50x4, we clip the norm of the gradients to 0.0005 to avoid divergence.

C DISCUSSION ON THE SPARSITY

In this section, we estimate the expected distance between a dense, fully-connected layer and a sparse one. For simplicity, we are here assuming to operate with a fully-connected layer. First, let us write our first proposition: Proposition C.1. Given a fully connected layer j + 1 defined by: and its approximation defined by: z j+1 (c) = Cj -1 k=0 ω j (c, k)h j (k) zj+1 (c) = Cj -1 k=0 (ω j (c, k)mask j (k, c))h j (k) Under the assumption that the j follows a Gaussian distribution h j ∼ N (µ j , Σ j ), where Σ j is the covariance matrix, and µ j the mean vector, the Kullback-Leibler divergence between the layer and its approximation is bounded by: D KL (z, z)(c) ≤ 1 2 p + 1 p -2 + p • (1 -p) Cj -1 k=0 ω j (c, k) 2 µ j (k) 2 (σ j+1 z ) 2 (c) + (1 -p) × µ j+1 z (c) 2 p(σ j+1 z ) 2 (c) where p ∈ [0; 1] is the fraction of the parameters of z j+1 (c) included in the approximation zj+1 (c). A plot for (7) is provided in Figure 5 . Proof. To prove Prop. C.1, we state first that, since h j (k) follows a Gaussian distribution, and considering that ω j at inference time is constant and linearly-combined with a gaussian random variable, z j+1 will be as well gaussian-distributed. From the property of linearity of expectations, we know that the mean for z j+1 (c) is: µ j+1 z (c) = Cj -1 k=0 ω j (c, k)µ j (k) and the variance is: (σ j+1 z ) 2 (c) = Cj -1 k=0 ω j (c, k) ω j (c, k)Σ(k, k) + 2 k ′ <k ω j (c, k ′ )Σ(k ′ , k) . If we assume Σ(i, k) = 0 ∀ i ̸ = k, (9) simplifies into: (σ j+1 z ) 2 (c) = Cj -1 k=0 ω j (c, k) 2 Σ(k, k). Let us now consider the case with the mask, similarly as presented at the end of section 2.1: zj+1 (c) = Cj -1 k=0 (ω j (c, k)mask j (k, c))h j (k) We assume here that mask j ∼ Ber(p) where p is the probability of the Bernoulli (or 1-pruning rate). In the limit of large C j , we know that zj+1 (c) follows a Gaussian distribution defined by a mean and a variance equal to: μj+1 z (c) = Cj -1 k=0 ω j (c, k)µ j (k)p (σ j+1 z ) 2 (c) = Cj -1 k=0 pω j (c, k) 2 µ j (k) 2 (1 -p) + Σ(k, k) Hence, we have: μj+1 z (c) = p × µ j+1 z (c) (σ j+1 z ) 2 (c) = p   (σ j+1 z ) 2 (c) + (1 -p) Cj -1 k=0 ω j (c, k) 2 µ j (k) 2   In order to assess the dissimilarity between z and z, we can write the Kullback-Leibler divergence: D KL (z, z)(c) = Straightforwardly we can write the inequality: D KL (z, z)(c) ≤ 1 2 (σ j+1 z ) 2 (c) (σ j+1 z ) 2 (c) -1 + (σ j+1 z ) 2 (c) + µ j+1 z (c) -μj+1 z (c) 2 (σ j+1 z ) 2 (c) -1 According to (15) we can write: D KL (z, z)(c) ≤ 1 2    p (σ j+1 z ) 2 (c) + (1 -p) Cj -1 k=0 ω j (c, k) 2 µ j (k) 2 (σ j+1 z ) 2 (c) -1+ + (σ j+1 z ) 2 (c) + µ j+1 z (c) -μj+1 z (c) 2 p (σ j+1 z ) 2 (c) + (1 -p) Cj -1 k=0 ω j (c, k) 2 µ j (k) 2 -1    Since we know that (σ j+1 z ) 2 (c)+[µ j+1 z (c)-μj+1 z (c)] 2 p (σ j+1 z ) 2 (c)+(1-p) C j -1 k=0 ω j (c,k) 2 µ j (k) 2 ≤ (σ j+1 z ) 2 (c)+[µ j+1 z (c)-μj+1 z (c)] 2 p(σ j+1 z ) 2 (c) we can also write: D KL (z, z)(c) ≤ 1 2 p -1 + p • (1 -p) Cj -1 k=0 ω j (c, k) 2 µ j (k) 2 (σ j+1 z ) 2 (c) + + (σ j+1 z ) 2 (c) + µ j+1 z (c) -μj+1 z (c) 2 p(σ j+1 z ) 2 (c) -1 Finally, according to: (14) D KL (z, z)(c) ≤ 1 2 p + 1 p -2 + p • (1 -p) Cj -1 k=0 ω j (c, k) 2 µ j (k) 2 (σ j+1 z ) 2 (c) + (1 -p) × µ j+1 z (c) 2 p(σ j+1 z ) 2 (c) finding back (7). 

D ABLATION STUDY

Our algorithm mainly depends on three hyperparameters. M represents the number of subnetworks in the ensemble, α controls the power of representation of the DNN, and γ is an extra parameter that controls the sparsity degree of the DNN. To evaluate the sensitivity of Packed-Ensembles to these parameters, we train 5 ResNet-50 on CIFAR-10 similarly to the protocol explained in section 4.1. Figures 6 and 7 show that the more we add subnetworks increasing M , the better the performance, in terms of accuracy and AUPR. We also note that the results are stable with γ. Moreover, the resulting accuracy tends to increase with α until it reaches a plateau. These statements are confirmed by the results in Table 5 . One can also use the Maximum Logit (ML) and the entropy of the posterior predictive distribution as uncertainty criteria, which is defined by Ent. = H(P (y i |x, D)) with H being the entropy function.

E DISCUSSION

Another metric is the mutual information between two random variables, which is defined by: MI = H(P (y i |x, D))-1 M M -1 H(P (y|θ α,m , x)). It represents a measure of the ensemble entropy, which is the entropy of the posterior minus the average entropy over predictions. The last metric, used in active learning, is the variation ratio (Beluch et al., 2018), which measures the dispersion of a nominal variable and is calculated as the proportion of predicted class labels that are not the modal class prediction. It is defined by: v = 1-fi M , where f i is the number of predictions falling into the modal class category. Criterion OOD eval α = 2, γ = 1 M = 4 α = 3, γ = 1 M = 8 α = 4, γ = 2 M = 8 α = 6, γ = 4 M = 8 α = 8, γ = 1 M = In Table 6 , the results for the different metrics are reported. We note that ML seems to be the best metric to detect OOD. This metric is followed by Ent. and then MI. Note that v, widely used in active learning, does not seem effective in detecting OOD samples. This shows us that it is essential to use a good criterion in addition to good ensembling.

F DISCUSSION ABOUT THE SOURCES OF STOCHASTICITY

As written in the introduction of the paper, diversity is essential to the success of ensembling, be it for its accuracy but also for calibration and OOD detection. Three primary sources can induce weight diversity, and therefore diversity in the function space, during the training. These sources are the initialization of the weights, the composition of the batches, and the use of non-deterministic backpropagation algorithmsfoot_3 . On Table 7 , we measure the performance and diversity of Packed-Ensembles trained on CIFAR-100. The quantity of diversity is measured by the mutual information and is twofold: we compute the in-distribution mutual information (IDMI) on the test set of CIFAR-100 and the OOD mutual information (OODMI) on SVHN. Concerning the performance, we compute the accuracy, ECE, and AUPR, which are proxies of the quality of this diversity. Results of Table 7 lead to several takeaways. First, they hint that there is no clear best set of trivial sources of stochasticity. Except for the first (and greyed) line, which corresponds to ensembling completely identical networks (the training being totally deterministic, which the null MI confirms), the results seem equivalent in diversity (via mutual information) and ID/OOD performance. Secondly, it shows that the use of non-deterministic algorithms can be sufficient to generate diversity. It was noted that this effect does not always happen depending on the selected architecture and the precision used (float16, or float32). Given that there is no emerging best set of stochasticity, we use the faster non-deterministic backpropagation algorithms and different initializations to ensure enough stochasticity and for programming convenience. Deep neural networks are heavily over-parameterized, as stated by the lottery ticket hypothesis (Frankle & Carbin, 2018) . It suggests that up to 80% of neurons can be removed without significant loss of performance. The MIMO approach builds on this assumption by allowing multiple networks to be trained simultaneously, and neurons may be used by several subnetworks. In our work, however, we assign each neuron to a specific DNN in the ensemble, guaranteeing their independence. This way, the DNNs can learn independent representations. However, as in MIMO, we rely on the fact that not all neurons are helpful, so we split the width of the initial DNNs into a set of DNNs. Although the decomposition may seem crude, it facilitates better parallelization of Packed-Ensembles during training and inference. To address the problem of not having sufficiently wide subnetworks, we added a hyperparameterα -to increase the width of subnetworks. In Figure 8 , we explore the impact of subnetwork width. Our observations reveal that the accuracy of the DNN increases with the width while the AUPR remains relatively constant. This finding suggests that the α is paramount in maintaining a balance in the DNN's width. We also note that reducing the width of the DNN does not significantly impact its accuracy. Hence, our decision to split the width of the DNN to create multiple subnetworks is justified since the uncertainty quantification remains unaltered, and the accuracy is not significantly compromised. In addition, α provides an additional degree of freedom to our ensemble, enabling it to enhance its accuracy. This is a significant advantage, as it allows us to further balance the performance of the ensemble, which can lead to more accurate predictions -and the number of parameters linked to its computational cost.

H DISCUSSION ABOUT THE TRAINING VELOCITY

Our experiments show that grouped convolutions are not as fast as they could theoretically be, and confirm the statements made by many PyTorch and TensorFlow users 5 . Following the idea that grouped convolutions are bandwidth-bound, we advise readers to leverage Native Automatic Mixed Precision (AMP) and cuDNN benchmark flags when training a Packed-Ensembles to reduce the bandwidth bottleneck compared to the baseline. AMP also divides the VRAM usage by two while yielding equally good results. Future improvements of PyTorch grouped convolutions should help Packed-Ensembles develop its full potential, increasing its current assets. We note in Table 8 

I DISTRIBUTION SHIFT

In this section, we evaluate the robustness of Packed-Ensembles under dataset shift. We use models trained on CIFAR-100 (Krizhevsky, 2009) and shift the data using corruptions and perturbations proposed by (Hendrycks & Dietterich, 2019) to produce CIFAR-100-C. There are five levels of perturbations called "severity", from one, the weakest, to five, the strongest. In real-world scenarios, For the training in the one-dimensional regression setting, we minimize the gaussian NLL (20) using networks with two outputs neurons which estimate the parameters of a heteroscedastic gaussian distribution (Nix & Weigend, 1994; Kendall & Gal, 2017) . One output corresponds to the mean of the predicted gaussian distribution, and the softplus applied on the second is its variance. The ensemble's mean μθ (x i ) is computed using the empirical mean over the estimators and the variance using the formula of a mixture σθ (x i ) 2 = M -1 m σ θm (x i ) 2 + µ θm (x i ) 2 -μθ (x i ) (Lakshminarayanan et al., 2017). L µ θm (x i ), σ θm (x i ) 2 , y i = (y i -µ θm (x i )) 



See https://einops.rocks/api/rearrange/ Note that the benchmark uncertainty-baselines only uses ECE to measure calibration available at github.com/nikitadurasov/masksembles see https://docs.nvidia.com/deeplearning/cudnn/api/index.html For instance https://github.com/pytorch/pytorch/issues/75747



Figure 2: Overview of the considered architectures: (left) baseline vanilla network; (center) Deep Ensembles; (right) Packed-Ensembles-(α, M = 3, γ = 2).

Figure 3: Equivalent architectures for Packed-Ensembles. (a) corresponds to the first sequential version, (b) to the version with the rearrange operation and grouped convolutions and (c) to the final version beginning with a full convolution.

Figure 4: Diagram representation of a subnetwork mask: mask j , with M = 2, j an integer corresponding to a fully connected layer

DATASETS AND ARCHITECTURES First, we demonstrate the efficiency of Packed-Ensembles on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), showing how the method adapts to tasks of different complexities. As we propose to replace a single model architecture with several subnetworks, we study the behavior of PE on various sizes architectures: ResNet-18, ResNet-50 (He et al., 2016), and Wide ResNet28-10 (Zagoruyko & Komodakis, 2016). We compare it against Deep Ensembles (Lakshminarayanan et al., 2017) and three other approximated ensembles from the literature: BatchEnsemble (Wen et al., 2019), MIMO (Havasi et al., 2020), and Masksembles (Durasov et al., 2021). Second, we report our results for Packed-Ensembles on ImageNet (Deng et al., 2009), which we compare against all baselines. We run experiments with ResNet-50 and ResNet-50x4. All training runs are started from scratch.

Discussion about the training velocity 23 I Distribution shift 23 On the equivalence between sequential training and Packed-Ensembles Using groups is not sufficient to equal Packed-Ensembles Efficiency of the networks trained on ImageNet 25 N Regression 25

The batch size of the training procedure mask j The mask corresponding to the layer j of the subnetwork m ⌊•⌋ The floor function ⋆, ⊛, • The 2D cross-correlation, the convolution, and the Hadamard product s j The size of the kernel of the layer j M The number of subnetworks in an ensemble ŷm i The prediction of the subnetwork number m concerning the input x i ŷi The prediction of the ensemble concerning the input x i α The width-augmentation factor of Packed-Ensembles γ The number of subgroups of Packed-Ensembles θ α,m

Figure 5: KL divergence for different values of p and σ j+1 z , with µ j (k) = 0.1 ∀j, k and w j (c, k) = 0.1 ∀j, c, k.

Figure 6: Accuracy and AUPR of Packed-Ensembles with ResNet-50 on CIFAR-100 depending on α.

ABOUT OOD CRITERIA Deep Ensembles (Lakshminarayanan et al., 2017) and Packed-Ensembles are ensembles of DNNs that can be used to quantify the uncertainty of the DNNs prediction. Similarly to Bayesian Neural Network, one can take the softmax outputs of posterior predictive distribution, which define the MSP = max yi {P (y i |x, D)}. The MSP can also be used for classical DNN, yet we use the conditional likelihood instead of the posterior distribution in this case.

Figure 8: Accuracy and AUPR curves of ResNet-18 in red and ResNet-50 in blue on CIFAR-100 with different widths. When the width is equal to 1, it to the original ResNet; when the width is equal to x, the width of every layer is multiplied by x.

Adds ↓Based on the results inTable 2, we can conclude that Packed-Ensembles improves uncertainty quantification for OOD and distribution shift on ImageNet compared to Deep Ensembles and Single model and that it improves the accuracy with a moderate training and inference cost. 4.1.3 STUDY ON THE PARAMETERS α AND γ

Performance comparison on ImageNet using ResNet-50 (R50) and ResNet-50x4 (R50x4). All ensembles have M = 4 subnetworks and γ = 1. We highlight the best performances in bold. For OOD tasks, we use ImageNet-O (IO) and Texture (T), and for distribution shift we use ImageNet-R. The number of parameters and operations are available in Appendix M.

6 RELATED WORKEnsembles and uncertainty quantification. Bayesian Neural Networks (BNNs)(MacKay, 1992;Neal, 1995) are the cornerstone and primary source of inspiration for uncertainty quantification in deep learning. Despite the progress enabled by variational inference(Jordan et al., 1999;Blundell et al., 2015), BNNs remain challenging to scale and train for large DNN architectures(Dusenberry et al., 2020). DE (Lakshminarayanan et al., 2017) arise as a practical and efficient instance of BNNs, coarsely but effectively approximating the posterior distribution of weights (Wilson & Izmailov, 2020). DE are currently the best-performing approach for both predictive performance and uncertainty estimation(Ovadia et al., 2019;Gustafsson et al., 2020). ensembles. The appealing properties in performance and diversity of DE(Fort et al., 2019), but also their major downside related to computational cost, have inspired a large cohort of approaches aiming to mitigate it. BatchEnsemble(Wen et al., 2019) spawns an ensemble at each layer thanks to an efficient parameterization of subnetwork-specific parameters trained in parallel. MIMO(Havasi et al., 2020) shows that a large network can encapsulate multiple subnetworks using a multi-input multi-output configuration. A single network can be used in ensemble mode by disabling different sub-sets of weights at each forward pass(Gal & Ghahramani, 2016; Durasov et al., 2021). Liu et al. (2022) leverage the sparse networks training algorithm of Mocanu et al. (2018) to produce ensembles of sparse networks. Ensembles can be computed from a single training run by collecting intermediate model checkpoints (Huang et al., 2017; Garipov et al., 2018), by computing the posterior distribution of the weights by tracking their trajectory during training (Maddox et al., 2019; Franchi et al., 2020), and by ensembling predictions over multiple augmentations of the input sample(Ashukha et al., 2020). However, most of these approaches require multiple forward passes.

Summary of the main notations of the paper.

Hyperparameters for image classification experiments. HFlip denotes the classical horizontal flip. Table 4 summarizes all the hyperparameters used in the paper for CIFAR-10 and CIFAR-100. In all cases, we use SGD combined with a multistep-learning-rate scheduler multiplying the rate by γ-lr at each milestone. Note that BatchEnsemble based on ResNet-50 uses a lower learning rate of 0.08 instead of 0.1 for stability. The "Medium" data augmentation corresponds to a combination of mixup (Zhang et al., 2018a) and cutmix (Yun et al., 2019) with 0.5 switch probability and using timm's augmentation classes (Wightman, 2019), with coefficients respectively 0.5 and 0.2. In this case, we also use RandAugment (Cubuk et al., 2020) with m = 9, n = 2, and mstd = 1 and a label-smoothing (Szegedy et al., 2016) of intensity 0.1.To ensure that the layers convey sufficient information and are not weakened by groups, we have set a constant minimum number of channels per group to 64 for all experiments presented in the paper.

Performance (Acc / ECE / AUPR) of Packed-Ensembles for various α and γ with ResNet-50 on CIFAR-100 and M = 4.

Comparison of the effect of the different uncertainty criteria for OOD on CIFAR-100 with different sets of parameters for Packed-Ensembles.

Comparison of the diversities the performance wrt. the different sources of stochasticity on CIFAR-100. ND corresponds to the use of Non-deterministic backpropagation algorithms, DI to different initializations, and DB to different compositions of the batches. A standard error (over five runs) is included in small font.The width and depth of deep neural networks are crucial research topics, and researchers strive to determine the best approaches for increasing the depth of DNNs, which can lead to improved accuracy. According toNguyen et al. (2020), the width and depth of a DNN are connected with its capacity to learn block structures, which can improve accuracy. Therefore, the model's capacity may decrease if the width is divided.

Comparison of training and inference times of different ensemble techniques using torch1.12.1+cu113 on an RTX 3090. All ensembles have four subnetworks.

that using float16, Packed-Ensembles is only 1.6× slower than the single model during inference. Furthermore, Packed-Ensembles is only 2.3× slower during training than the single model, making it an efficient model capable of training four models in half the time of a Deep Ensembles.

Comparison between the results obtained with Packed-Ensembles and a similar ResNeXt-50. The dataset is CIFAR-10.

Comparison of the efficiency of the networks trained on ImageNet(Deng et al., 2009). All ensembles have M = 4 subnetworks and γ = 1. Mult-Adds corresponds to the inference cost, i.e., the number of Giga multiply-add operations for a forward pass which is estimated withTorchinfo (2022). , which are therefore not independent. We keep the same training optimization procedures and data-augmentation strategies detailed in Appendix B.

provides the efficiency of the networks trained on ImageNet-1k (see section 4.1.3), in number of parameters and multiply-additions. PE-(3, 4, 1) was preferred to PE-(3, 4, 2) for ResNet50 to improve the representation capacity of the subnetworks. N REGRESSION To generalize our work, we propose to study regression tasks. We replicate the setting developed by Hernández-Lobato & Adams (2015), Gal & Ghahramani (2016), and Lakshminarayanan et al. (2017).

We compare Packed-Ensembles-(2, 3, 1) and Deep Ensembles on the UCI datasets in Table11. The subnetworks of these methods are based on multi-layer perceptrons with a single hidden layer, containing 400 neurons for the more extensive Protein dataset and 200 for the others, and a ReLU non-linearity. The results show that Packed-Ensembles and Deep Ensembles provide equivalent results on most datasets. Comparison of the results obtained with Packed-Ensembles and Deep Ensembles on regression tasks.

ACKNOWLEDGMENTS

This work was supported by AID Project ACoCaTherm and Hi!Paris. This work was performed using HPC resources from GENCI-IDRIS (Grant 2021-AD011011970R1) and (Grant 2022-AD011011970R2).

annex

distributional shift is crucial, as explained by (Ovadia et al., 2019) , and it is critical to study how much a model prediction shifts from the original training data distribution. Thanks to Figure 9 , we notice that Packed-Ensembles achieves the highest accuracy and lowest ECE under distributional shift, leading to a method robust against this uncertainty.

J STABILIZATION OF THE PERFORMANCE

We perform five times each training task on CIFAR-10 and CIFAR-100 to estimate a better value and be able to compute the variance. Let us first note that the standard deviation for the single DNN on CIFAR-100 with a ResNet-50 architecture amounts to 0.68%. Ensemble strategies shrink the standard variation to 0.43% for Deep Ensembles and 0.19% for Packed-Ensembles. Thus it seems that Packed-Ensembles makes DNN predictions more stable in addition to improving accuracy and uncertainty quantification. This result is interesting as it appears to contradict Neal et al. (2019), who claim that wider DNNs have a smaller variance. This stability might come from the ensembling.

K ON THE EQUIVALENCE BETWEEN SEQUENTIAL TRAINING AND PACKED-ENSEMBLES

The sequential training of Deep Ensembles differs significantly from the training procedure of Packed-Ensembles. The main differences lie in the subnetworks' batch composition and the best models' selection.Concerning Packed-Ensembles, the batches are strictly the same for all subnetworks, thus removing one source of stochasticity compared to sequential learning. Yet, in practice, we show empirically that random initialization and stochastic algorithms are sufficient to get diverse subnetworks (see Appendix F for more details).For the selection of models, Packed-Ensembles considers subnetworks as a whole (i.e., maximize the ensemble accuracy on the validation set) and therefore selects the best ensemble at a given epoch. On the other hand, sequential training selects the best networks individually, possibly on different epochs, which does not guarantee that the best ensemble is selected but ensures the optimality of subnetworks over the epochs.

L USING GROUPS IS NOT SUFFICIENT TO EQUAL PACKED-ENSEMBLES

To make sure that the use of groups cannot simply explain our results, we compare Packed-Ensembles to a single ResNeXt-50 (32×4d) (Xie et al., 2017) in Table 9 . ResNeXt-50 is fairly equivalent to our method but does not propagate groups, only used in the middle layer of each

