GROWING EFFICIENT DEEP NETWORKS BY STRUCTURED CONTINUOUS SPARSIFICATION

Abstract

We develop an approach to growing deep network architectures over the course of training, driven by a principled combination of accuracy and sparsity objectives. Unlike existing pruning or architecture search techniques that operate on full-sized models or supernet architectures, our method can start from a small, simple seed architecture and dynamically grow and prune both layers and filters. By combining a continuous relaxation of discrete network structure optimization with a scheme for sampling sparse subnetworks, we produce compact, pruned networks, while also drastically reducing the computational expense of training. For example, we achieve 49.7% inference FLOPs and 47.4% training FLOPs savings compared to a baseline ResNet-50 on ImageNet, while maintaining 75.2% top-1 accuracy -all without any dedicated fine-tuning stage. Experiments across CIFAR, ImageNet, PASCAL VOC, and Penn Treebank, with convolutional networks for image classification and semantic segmentation, and recurrent networks for language modeling, demonstrate that we both train faster and produce more efficient networks than competing architecture pruning or search methods.

1. INTRODUCTION

Deep neural networks are the dominant approach to a variety of machine learning tasks, including image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015) , object detection (Girshick, 2015; Liu et al., 2016) , semantic segmentation (Long et al., 2015; Chen et al., 2017) and language modeling (Zaremba et al., 2014; Vaswani et al., 2017; Devlin et al., 2019) . Modern neural networks are overparameterized and training larger networks usually yields improved generalization accuracy. Recent work (He et al., 2016; Zagoruyko & Komodakis, 2016; Huang et al., 2017) illustrates this trend through increasing depth and width of convolutional neural networks (CNNs). Yet, training is compute-intensive, and real-world deployments are often limited by parameter and compute budgets. Neural architecture search (NAS) (Zoph & Le, 2017; Liu et al., 2019; Luo et al., 2018; Pham et al., 2018; Savarese & Maire, 2019) and model pruning (Han et al., 2016; 2015; Guo et al., 2016) methods aim to reduce these burdens. NAS addresses an issue that further compounds training cost: the enormous space of possible network architectures. While hand-tuning architectural details, such as the connection structure of convolutional layers, can improve performance (Iandola et al., 2016; Sifre & Mallat, 2014; Chollet, 2017; Howard et al., 2017; Zhang et al., 2018; Huang et al., 2018) , a principled way of deriving such designs remains elusive. NAS methods aim to automate exploration of possible architectures, producing an efficient design for a target task under practical resource constraints. However, during training, most NAS methods operate on a large supernet architecture, which encompasses candidate components beyond those that are eventually selected for inclusion in the resulting network (Zoph & Le, 2017; Liu et al., 2019; Luo et al., 2018; Pham et al., 2018; Savarese & Maire, 2019) . Consequently, NAS-based training may typically be more thorough, but more computationally expensive, than training a single hand-designed architecture. Model pruning techniques similarly focus on improving the resource efficiency of neural networks during inference, at the possible expense of increased training cost. Common strategies aim to generate a lighter version of a given network architecture by removing individual weights (Han et al., 2015; 2016; Molchanov et al., 2017) or structured parameter sets (Li et al., 2017; He et al., 2018; Luo et al., 2017) . However, the majority of these methods train a full-sized model prior to pruning and, We define an architecture configuration space and simultaneously adapt network structure and weights. (a) Applying our approach to CNNs, we maintain auxiliary variables that determine how to grow and prune both filters (i.e. channel-wise) and layers, subject to practical resource constraints. (b) By starting with a small network and growing its size, we utilize fewer resources in early training epochs, compared to pruning or NAS methods. (c) Consequently, our method significantly reduces the total computational cost of training, while delivering trained networks of comparable or better size and accuracy. after pruning, utilize additional fine-tuning phases in order to maintain accuracy. Hubara et al. ( 2016) and Rastegari et al. (2016) propose the use of binary weights and activations, allowing inference to benefit from reduced storage costs and efficient computation through bit-counting operations. Yet, training still involves tracking high-precision weights alongside lower-precision approximations. We take a unified view of pruning and architecture search, regarding both as acting on a configuration space, and propose a method to dynamically grow deep networks by continuously reconfiguring their architecture during training. Our approach not only produces models with efficient inference characteristics, but also reduces the computational cost of training; see Figure 1 . Rather than starting with a full-sized network or a supernet, we start from simple seed networks and progressively adjust (grown and prune) them. Specifically, we parameterize an architectural configuration space with indicator variables governing addition or removal of structural components. Figure 2 (a) shows an example, in the form of a two-level configuration space for CNN layers and filters. We enable learning of indicator values (and thereby, architectural structure) via combining a continuous relaxation with binary sampling, as illustrated in Figure 2 (b). A per-component temperature parameter ensures that long-lived structures are eventually baked into the network's discrete architectural configuration. While the recently proposed AutoGrow (Wen et al., 2020) also seeks to grow networks over the course of training, our technical approach differs substantially and leads to significant practical advantages. At a technical level, AutoGrow implements an architecture search procedure over a predefined modular structure, subject to hand-crafted, accuracy-driven growing and stopping policies. In contrast, we parameterize architectural configurations and utilize stochastic gradient descent to learn the auxiliary variables that specify structural components, while simultaneously training the weights within those components. Our unique technical approach yields the following advantages: • Fast Training by Growing: Training is a unified procedure, from which one can request a network structure and associated weights at any time. Unlike AutoGrow and the majority of pruning techniques, fine-tuning to optimize weights in a discovered architecture is optional. We achieve excellent results even without any fine-tuning stage. • Principled Approach via Learning by Continuation + Sampling: We formulate our approach in the spirit of learning by continuation methods, which relax a discrete optimization problem to an increasingly stiff continuous approximation. Critically, we introduce an additional sampling step to this strategy. From this combination, we gain the flexibility of exploring a supernet architecture, but the computational efficiency of only actually training a much smaller active subnetwork. • Budget-Aware Optimization Objectives: The parameters governing our architectural configuration are themselves updated via gradient decent. We have flexibility to formulate a variety of resource-sensitive losses, such as counting total FLOPs, in terms of these parameters. • Broad Applicability: Though we use progressive growth of CNNs in width and depth as a motivating example, our technique applies to virtually any neural architecture. One has flexibility in how to parameterize the architecture configuration space. We also show results with LSTMs. We demonstrate these advantages while comparing to recent NAS and pruning methods through extensive experiments on classification, semantic segmentation, and word-level language modeling.



Figure 1: Growing Networks during Training. We define an architecture configuration space and simultane-

