MORE OR LESS: WHEN AND HOW TO BUILD CONVOLUTIONAL NEURAL NETWORK ENSEMBLES

Abstract

Convolutional neural networks are utilized to solve increasingly more complex problems and with more data. As a result, researchers and practitioners seek to scale the representational power of such models by adding more parameters. However, increasing parameters requires additional critical resources in terms of memory and compute, leading to increased training and inference cost. Thus a consistent challenge is to obtain as high as possible accuracy within a parameter budget. As neural network designers navigate this complex landscape, they are guided by conventional wisdom that is informed from past empirical studies. We identify a critical part of this design space that is not well-understood: How to decide between the alternatives of expanding a single convolutional network model or increasing the number of networks in the form of an ensemble. We study this question in detail across various network architectures and data sets. We build an extensive experimental framework that captures numerous angles of the possible design space in terms of how a new set of parameters can be used in a model. We consider a holistic set of metrics such as training time, inference time, and memory usage. The framework provides a robust assessment by making sure it controls for the number of parameters. Contrary to conventional wisdom, we show that when we perform a holistic and robust assessment, we uncover a wide design space, where ensembles provide better accuracy, train faster, and deploy at speed comparable to single convolutional networks with the same total number of parameters.

1. INTRODUCTION

Scaling capacity of deep learning models. Convolutional neural network models are becoming as accurate as humans on perceptual tasks. They are now used in numerous and diverse applications such as drug discovery, data compression, and automating gameplay. These models increasingly grow in size with more parameters and layers, driven by two major trends. First, there is a continuous rise in data complexity and sizes in many applications (Shazeer et al., 2017) . Second, there is an increasing need for higher accuracy as models are utilized in more critical applications -such as self-driving cars and medical diagnosis (Grzywaczewski, 2017) . This effect is especially pronounced in computer vision and natural language processing: Model sizes are three orders of magnitude larger than they were just three years ago (Sanh et al., 2019) . With bigger model sizes, the time, computation, and memory needed to train and deploy such models also increase. Thus, it is a consistent challenge to design models that maximize accuracy while remaining practical with respect to the resources they need (Lee et al., 2015; Huang et al., 2017b) . In this paper, we study the following question: Given a number of parameters (neurons), how to design a convolutional neural network to optimize holistically for accuracy, training cost, and inference cost? The holistic design space is very complex. Designers of convolutional neural network models navigate a complex design landscape to address this question: First, they need to decide on network architecture. Then, they have to consider whether to use a single network or build an ensemble model with multiple networks. Additionally, they have to decide how many neural networks to use and their individual designs, i.e., the depth, width, and number of networks in their model. Modern applications with diverse requirements further complicate these decisions as what is desirable varies. Facebook, for instance, requires convolutional neural network models that strike specific tradeoffs between accuracy and inference time across 250 different types of smartphones (Wu et al., 2019) . As a result, not just accuracy but a diversity of metrics -such as inference time and memory usageinform whether a model gets used (Sze et al., 2017b) . Scattered conventional wisdom. There exist bits and pieces of scattered conventional wisdom to guide a neural network designer. These take the form of various empirical studies that demonstrate how depth and width in a single neural network model relate to certain metrics such as accuracy. First, it is generally known that deeper and wider networks can improve accuracy. In fact, recent convolutional architectures -such as ResNets and DenseNets -are designed precisely to enable this outcome (He et al., 2016; Huang et al., 2017b; a) . The caveat with beefing up a neural network is that accuracy runs into diminishing returns as we continue to add more layers or widen existing ones (Coates et al., 2011; Dauphin and Bengio, 2013) . On the other hand, increasing the number of networks in the model, i.e., building ensembles, is considered a relatively robust but expensive approach to improve accuracy as ensemble models train and deploy k networks instead of one (Russakovsky et al., 2015; Wasay et al., 2020) . The consensus is to use ensembles when the goal is to achieve high accuracy without much regard to training cost, inference time, and memory usage, e.g., competitions such as COCO and ImageNet (Lee et al., 2015; Russakovsky et al., 2015; Huang et al., 2017a; Ju et al., 2017) . All these studies, however, exist in silos. Any form of cross-comparison is impossible as they use different data sets, network architectures, and hardware. Lack of a robust and holistic assessment. Most past studies operate within the confines of a single convolutional network and do not consider the dimension of ensemble models. Those that compare with ensembles mostly do so unfairly comparing ensembles with k networks against a model that contains only one such network (Lee et al., 2015; Russakovsky et al., 2015; Huang et al., 2017a; Ju et al., 2017) . There are recent studies that make this comparison under a fixed parameter budget (Chirkova et al., 2020; Kondratyuk et al., 2020) . However, these studies consider only the metric of generalization accuracy and explore a very small part of the design space -two different classes of convolutional architectures with a single depth. A holistic analysis needs to include resource-related metrics such as training time, inference cost, and memory usage. All these metrics are critical for practical applications (Sze et al., 2017a; Wu et al., 2019) . Furthermore, to provide reliable guidance to a model designer, a robust comparison needs to consider a range of architectures and model sizes with various depth and width configurations. This is critical, especially because varying just the width of convolutional networks in isolation, as done by recent studies (Chirkova et al., 2020; Kondratyuk et al., 2020) , is known to be far less effective to improve accuracy (Eigen et al., 2013; Ba and Caruana, 2014) . Single networks vs. ensembles. In this paper, we bridge the gap in the understanding of the design space by providing answers to the following questions. Given specific requirements in terms of accuracy, training time, and inference time, should we train and deploy a convolutional model with a single network or one that contains an ensemble of networks? How should we design the networks within an ensemble? As these constraints and requirements evolve, should we switch between these alternatives, why, and when? Method. We introduce the following methodology to map the design space accurately. Since there is no robust theoretical framework to consistently analyze the design space and the complex interactions among its many parameters and metrics, we develop a detailed and extensive experimental framework to isolate the impact of the critical design knobs: (i) depth, (ii) width, and (iii) number of networks, on all relevant metrics: (i) accuracy, (ii) training time, (iii) inference time, and (iv) memory usage. Crucially the number of parameters is a control knob in our framework, and we only compare alternatives under the same parameter budget. To establish the robustness of our findings, we experiment across various architectures, data complexities, and classification tasks. We present and analyze data amounting to over one year of GPU run time. We also explain trends breaking down metrics into their constituents when necessary. Results: The Ensemble Switchover Threshold (EST). (i) Contrary to conventional wisdom, we show that when we make a holistic and robust comparison between single convolutional networks

