SUPERWEIGHT ENSEMBLES: AUTOMATED COMPOSI-TIONAL PARAMETER SHARING ACROSS DIVERSE AR-CHITECTURES

Abstract

Neural net ensembles boost task performance, but have excessive storage requirements. Recent work in efficient ensembling has made the memory cost more tractable by sharing learned parameters between ensemble members. Existing efficient ensembles have high predictive accuracy, but they are overly restrictive in two ways: 1) They constrain ensemble members to have the same architecture, limiting their usefulness in applications such as anytime inference, and 2) They reduce the parameter count for a small predictive performance penalty, but do not provide an easy way to trade-off parameter count for predictive performance without increasing inference time. In this paper, we propose SuperWeight Ensembles, an approach for architecture-agnostic parameter sharing. SuperWeight Ensembles share parameters between layers which have sufficiently similar computation, even if they have different shapes. This allows anytime prediction of heterogeneous ensembles by selecting a subset of members during inference, which is a flexibility not supported by prior work. In addition, SuperWeight Ensembles provide control over the total number of parameters used, allowing us to increase or decrease the number of parameters without changing model architecture. On the anytime prediction task, our method shows a consistent boost over prior work while allowing for more flexibility in architectures and efficient parameter sharing. SuperWeight Ensembles preserve the performance of prior work in the lowparameter regime, and even outperform fully-parameterized ensembles with 17% fewer parameters on CIFAR-100 and 50% fewer parameters on ImageNet.

1. INTRODUCTION

Ensembling aggregates the predictions of multiple models in an effort to boost task performance (Atwood et al., 2020; Ostyakov & Nikolenko, 2019) while also improving robustness, calibration, and accuracy (Lakshminarayanan et al., 2017; Dietterich, 2000) . However, it also require more memory and compute since multiple models need to be trained, stored, and used for inference. Efficient ensembling reduces the total number of parameters by sharing parameters between members (e.g., (Lee et al., 2015; Wen et al., 2020; Wenzel et al., 2020) ). As illustrated in Figure 1 (a), these methods share parameters while introducing diversity by a) perturbing the shared parameters to create distinct layer weights or b) perturbing layer inputs for each ensemble member. A significant drawback of these methods is they often make strong architectural assumptions, such as ensemble member homogeneity (i.e., each member is the same architecture), which limits their use. For example, homogeneous ensembles are ill-suited to tasks like anytime prediction because one only has n options for computational complexity, where n is the number of ensemble members. In contrast, heterogeneous ensembles can select a subset of its ensemble members to provide a range of inference times (e.g., a 4 member heterogeneous ensemble can adjust to 4 1 + 4 2 + 4 3 + 4 4 = 15 levels of inference latency). Thus, we present SuperWeight Ensembles, which allow for parameterefficient anytime inference by using heterogeneous ensemble members. Figure 1 (c) shows that our heterogeneous ensembles achieve state-of-the-art performance in anytime inference. In this paper, we propose SuperWeight Ensembles, a method for efficient ensembling using automated parameter sharing between diverse architectures. A key challenge we address is learning where we can reuse parameters for heterogeneous ensembles, where ensemble members may have

