GROWING EFFICIENT DEEP NETWORKS BY STRUCTURED CONTINUOUS SPARSIFICATION

Abstract

We develop an approach to growing deep network architectures over the course of training, driven by a principled combination of accuracy and sparsity objectives. Unlike existing pruning or architecture search techniques that operate on full-sized models or supernet architectures, our method can start from a small, simple seed architecture and dynamically grow and prune both layers and filters. By combining a continuous relaxation of discrete network structure optimization with a scheme for sampling sparse subnetworks, we produce compact, pruned networks, while also drastically reducing the computational expense of training. For example, we achieve 49.7% inference FLOPs and 47.4% training FLOPs savings compared to a baseline ResNet-50 on ImageNet, while maintaining 75.2% top-1 accuracy -all without any dedicated fine-tuning stage. Experiments across CIFAR, ImageNet, PASCAL VOC, and Penn Treebank, with convolutional networks for image classification and semantic segmentation, and recurrent networks for language modeling, demonstrate that we both train faster and produce more efficient networks than competing architecture pruning or search methods.

1. INTRODUCTION

Deep neural networks are the dominant approach to a variety of machine learning tasks, including image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015) , object detection (Girshick, 2015; Liu et al., 2016) , semantic segmentation (Long et al., 2015; Chen et al., 2017) and language modeling (Zaremba et al., 2014; Vaswani et al., 2017; Devlin et al., 2019) . Modern neural networks are overparameterized and training larger networks usually yields improved generalization accuracy. Recent work (He et al., 2016; Zagoruyko & Komodakis, 2016; Huang et al., 2017) illustrates this trend through increasing depth and width of convolutional neural networks (CNNs). Yet, training is compute-intensive, and real-world deployments are often limited by parameter and compute budgets. Neural architecture search (NAS) (Zoph & Le, 2017; Liu et al., 2019; Luo et al., 2018; Pham et al., 2018; Savarese & Maire, 2019) and model pruning (Han et al., 2016; 2015; Guo et al., 2016) methods aim to reduce these burdens. NAS addresses an issue that further compounds training cost: the enormous space of possible network architectures. While hand-tuning architectural details, such as the connection structure of convolutional layers, can improve performance (Iandola et al., 2016; Sifre & Mallat, 2014; Chollet, 2017; Howard et al., 2017; Zhang et al., 2018; Huang et al., 2018) , a principled way of deriving such designs remains elusive. NAS methods aim to automate exploration of possible architectures, producing an efficient design for a target task under practical resource constraints. However, during training, most NAS methods operate on a large supernet architecture, which encompasses candidate components beyond those that are eventually selected for inclusion in the resulting network (Zoph & Le, 2017; Liu et al., 2019; Luo et al., 2018; Pham et al., 2018; Savarese & Maire, 2019) . Consequently, NAS-based training may typically be more thorough, but more computationally expensive, than training a single hand-designed architecture. Model pruning techniques similarly focus on improving the resource efficiency of neural networks during inference, at the possible expense of increased training cost. Common strategies aim to generate a lighter version of a given network architecture by removing individual weights (Han et al., 2015; 2016; Molchanov et al., 2017) or structured parameter sets (Li et al., 2017; He et al., 2018; Luo et al., 2017) . However, the majority of these methods train a full-sized model prior to pruning and,

