SUPERWEIGHT ENSEMBLES: AUTOMATED COMPOSI-TIONAL PARAMETER SHARING ACROSS DIVERSE AR-CHITECTURES

Abstract

Neural net ensembles boost task performance, but have excessive storage requirements. Recent work in efficient ensembling has made the memory cost more tractable by sharing learned parameters between ensemble members. Existing efficient ensembles have high predictive accuracy, but they are overly restrictive in two ways: 1) They constrain ensemble members to have the same architecture, limiting their usefulness in applications such as anytime inference, and 2) They reduce the parameter count for a small predictive performance penalty, but do not provide an easy way to trade-off parameter count for predictive performance without increasing inference time. In this paper, we propose SuperWeight Ensembles, an approach for architecture-agnostic parameter sharing. SuperWeight Ensembles share parameters between layers which have sufficiently similar computation, even if they have different shapes. This allows anytime prediction of heterogeneous ensembles by selecting a subset of members during inference, which is a flexibility not supported by prior work. In addition, SuperWeight Ensembles provide control over the total number of parameters used, allowing us to increase or decrease the number of parameters without changing model architecture. On the anytime prediction task, our method shows a consistent boost over prior work while allowing for more flexibility in architectures and efficient parameter sharing. SuperWeight Ensembles preserve the performance of prior work in the lowparameter regime, and even outperform fully-parameterized ensembles with 17% fewer parameters on CIFAR-100 and 50% fewer parameters on ImageNet.

1. INTRODUCTION

Ensembling aggregates the predictions of multiple models in an effort to boost task performance (Atwood et al., 2020; Ostyakov & Nikolenko, 2019) while also improving robustness, calibration, and accuracy (Lakshminarayanan et al., 2017; Dietterich, 2000) . However, it also require more memory and compute since multiple models need to be trained, stored, and used for inference. Efficient ensembling reduces the total number of parameters by sharing parameters between members (e.g., (Lee et al., 2015; Wen et al., 2020; Wenzel et al., 2020) ). As illustrated in Figure 1 (a), these methods share parameters while introducing diversity by a) perturbing the shared parameters to create distinct layer weights or b) perturbing layer inputs for each ensemble member. A significant drawback of these methods is they often make strong architectural assumptions, such as ensemble member homogeneity (i.e., each member is the same architecture), which limits their use. For example, homogeneous ensembles are ill-suited to tasks like anytime prediction because one only has n options for computational complexity, where n is the number of ensemble members. In contrast, heterogeneous ensembles can select a subset of its ensemble members to provide a range of inference times (e.g., a 4 member heterogeneous ensemble can adjust to 4 1 + 4 2 + 4 3 + 4 4 = 15 levels of inference latency). Thus, we present SuperWeight Ensembles, which allow for parameterefficient anytime inference by using heterogeneous ensemble members. Figure 1 (c) shows that our heterogeneous ensembles achieve state-of-the-art performance in anytime inference. In this paper, we propose SuperWeight Ensembles, a method for efficient ensembling using automated parameter sharing between diverse architectures. A key challenge we address is learning where we can reuse parameters for heterogeneous ensembles, where ensemble members may have architectures that vary both the number of layers and/or channels. Prior work has explored strategies where parameters are shared between members that have constant depths, but varying widths (Yu & Huang, 2019b; Yu et al., 2019; Wang et al., 2020; Li et al., 2021) , as well as cases where the widths are held constant, but the depths vary (Ruiz & Verbeek, 2021; Kaya et al., 2019; Huang et al., 2018; Yang et al., 2020; Wu et al., 2018) . In contrast, as illustrated in Figure 1 (b), SuperWeight Ensembles members can vary in both the widths and depths of the network. Our approach is built on the intuition that, despite being independent, two models trained on the same task likely have to detect similar features. However, these features may occur in different location(s) (i.e., different layers) across ensemble members, especially when members represent distinct networks architectures. Our goal then is to detect where these recurring computations exist in our heterogeneous ensemble. A neural network can be seen as a composition of feature detectors (Savarese & Maire, 2019) . Motivated by this and the Hebbian principle that "neurons which wire together fire together", we propose a technique that groups detectors that frequently co-occur, which we call SuperWeights. Our SuperWeights are analogous to SuperPixels (Ren & Malik, 2003) , which represent a semantically coherent region of pixels within an image, but where our SuperWeights represent feature detectors which frequently co-occur. An example of a SuperWeight would be filters of a CNN that capture a unique pattern. Prior work has shown these patterns may repeat across many types of classification networks (Raghu et al., 2021) , which we take advantage of in our approach. A single layer may require a concatenation of several SuperWeights, each of which may be shared with layers within the same ensemble member or across members. To learn these complex sharing patterns, we propose a gradient analysis approach for determining where SuperWeights can be reused. To further improve our model's parameter efficiency, we also take advantage of template mixing (e.g., Bagherinezhad et al. (2017); Plummer et al. (2022) ; Savarese & Maire (2019)). We construct SuperWeights using a weighted linear combination of templates made up of trainable parameters which we call Weight Templates. This creates a hierarchical representation for neural network weight generation, where we begin by combining Weight Templates to create SuperWeights, then concatenating together SuperWeights to create the layer weights used by each ensemble member. Two SuperWeights using the same Weight Templates may learn different linear coefficients for combining templates, allowing for per-layer SuperWeight sub-specialization. This template mixing can boost performance when reusing the same parameters many times, as each combination of templates can use a unique set of coefficients, and it also helps us support a wide range of parameter budgets. Adjusting the parameter count can be accomplished by increasing or decreasing the number of shared templates, allowing for parameter budget vs performance trade offs with no change in



Figure 1: Comparison to prior work. (a) Prior work in efficient ensembling (e.g.,(Lee et al., 2015;  Wen et al., 2020; Wenzel et al., 2020)) uses hand-crafted strategies that required ensemble members to have identical architectures and adds diversity by perturbing weights and/or features. In contrast, (b) illustrates our SuperWeight Ensembles, which learn effective soft parameter sharing between members, even for diverse architectures. As shown in (c), this enables our approach to support a range of inference times while outperforming prior work in efficient ensembling and anytime inference(Ruiz & Verbeek, 2021; Havasi et al., 2021; Wen et al., 2020; Yu et al., 2019; Yu &  Huang, 2019a)  on CIFAR-100 usingWRN-28-5 (Zagoruyko & Komodakis, 2016).

