PACKED-ENSEMBLES FOR EFFICIENT UNCERTAINTY ESTIMATION

Abstract

Deep Ensembles (DE) are a prominent approach for achieving excellent performance on key metrics such as accuracy, calibration, uncertainty estimation, and out-of-distribution detection. However, hardware limitations of real-world systems constrain to smaller ensembles and lower-capacity networks, significantly deteriorating their performance and properties. We introduce Packed-Ensembles (PE), a strategy to design and train lightweight structured ensembles by carefully modulating the dimension of their encoding space. We leverage grouped convolutions to parallelize the ensemble into a single shared backbone and forward pass to improve training and inference speeds. PE is designed to operate within the memory limits of a standard neural network. Our extensive research indicates that PE accurately preserves the properties of DE, such as diversity, and performs equally well in terms of accuracy, calibration, out-of-distribution detection, and robustness to distribution shift. We make our code available at github.

1. INTRODUCTION

Real-world safety-critical machine learning decision systems such as autonomous driving (Levinson et al., 2011; McAllister et al., 2017) impose exceptionally high reliability and performance requirements across a broad range of metrics: accuracy, calibration, robustness to distribution shifts, uncertainty estimation, and computational efficiency under limited hardware resources. Despite significant improvements in performance in recent years, vanilla Deep Neural Networks (DNNs) still exhibit several shortcomings, notably overconfidence in both correct and wrong predictions (Nguyen et al., 2015; Guo et al., 2017; Hein et al., 2019 ). Deep Ensembles (Lakshminarayanan et al., 2017) have emerged as a prominent approach to address these challenges by leveraging predictions from multiple high-capacity neural networks. By averaging predictions or by voting, DE achieves high accuracy and robustness since potentially unreliable predictions are exposed via the disagreement between individuals. Thanks to the simplicity and effectiveness of the ensembling strategy (Dietterich, 2000) , DE have become widely used and dominate performance across various benchmarks (Ovadia et al., 2019; Gustafsson et al., 2020) . DE meet most of the real-world application requirements except computational efficiency. Specifically, DE are computationally demanding in terms of memory storage, number of operations, and inference time during both training and testing, as their costs grow linearly with the number of individuals. Their computational costs are, therefore, prohibitive under tight hardware constraints. et al., 2019; Franchi et al., 2020) . These approaches typically improve storage usage, train cost, or inference time at the cost of lower accuracy and diversity in the predictions. An essential property of ensembles to improve predictive uncertainty estimation is related to the diversity of its predictions. Perrone & Cooper (1992) show that the independence of individuals is critical to the success of ensembling. Fort et al. ( 2019) argue that the diversity of DE, due to randomness from weight initialization, data augmentation and batching, and stochastic gradient updates, is superior to other efficient ensembling alternatives, despite their predictive performance boosts. Few approaches manage to mirror this property of DE in a computationally efficient manner close to a single DNN (in terms of memory usage, number of forward passes, and image throughput). In this work, we aim to design a DNN architecture that closely mimics properties of ensembles, in particular, having a set of independent networks, in a computationally efficient manner. Previous works propose ensembles composed of small models (Kondratyuk et al., 2020; Lobacheva et al., 2020) and achieve performances comparable to a single large model. We build upon this idea and devise a strategy based on small networks trying to match the performance of an ensemble of large networks. To this end, we leverage grouped convolutions (Krizhevsky et al., 2012) to delineate multiple subnetworks within the same network. The parameters of each subnetwork are not shared across subnetworks, leading to independent smaller models. This method enables fast training and inference times while predictive uncertainty quantification is close to DE (Figure 1 ). In summary, our contributions are the following: • We propose Packed-Ensembles (PE), an efficient ensembling architecture relying on grouped convolutions, as a formalization of structured sparsity for Deep Ensembles; • We extensively evaluate PE regarding accuracy, calibration, OOD detection, and distribution shift on classification and regression tasks. We show that PE achieves state-of-the-art predictive uncertainty quantification. • We thoroughly study and discuss the properties of PE (diversity, sparsity, stability, behavior of subnetworks) and release our PyTorch implementation.

2. BACKGROUND

In this section, we present the formalism for this work and offer a brief background on grouped convolutions and ensembles of DNNs. Appendix A summarizes the main notations in Table 3 .

2.1. BACKGROUND ON CONVOLUTIONS

The convolutional layer (LeCun et al., 1989) consists of a series of cross-correlations between feature maps h j ∈ R Cj ×Hj ×Wj regrouped in batches of size B and a weight tensor ω j ∈ R Cj+1×Cj ×s 2 j



Figure 1: Evaluation of computation cost vs. performance trade-offs for multiple uncertainty quantification techniques on CIFAR-100. The y-axis and x-axis respectively show the accuracy and inference time in images per second. The circle area is proportional to the number of parameters. Optimal approaches are closer to the top-right corner. Packed-Ensembles strikes a good balance between predictive performance and speed.

This limitation of DE has inspired numerous approaches proposing computationally efficient alternatives: multi-head networks(Lee et al., 2015; Chen & Shrivastava, 2020), ensemble-imitating layers(Wen et al., 2019; Havasi et al., 2020; Ramé et al., 2021), multiple forwards on different weight subsets of the same network(Gal & Ghahramani, 2016; Durasov et al., 2021), ensembles of smaller networks(Kondratyuk et al., 2020; Lobacheva et al., 2020), computing ensembles from a single training run(Huang et al., 2017; Garipov et al., 2018), and efficient Bayesian Neural Networks (Maddox

