ACCELERATED TRAINING VIA PRINCIPLED METHODS FOR INCREMENTALLY GROWING NEURAL NETWORKS

Abstract

We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering their effects on the training dynamics. Unlike existing growing methods, which follow simple replication heuristics or utilize auxiliary gradient-based local optimization, we craft a parameterization scheme which dynamically stabilizes weight, activation, and gradient scaling as the architecture evolves, and maintains the inference functionality of the network. To address the optimization difficulty resulting from imbalanced training effort distributed to subnetworks fading in at different growth phases, we propose a learning rate adaption mechanism that rebalances the gradient contribution of these separate subcomponents. Experimental results show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original computation budget for training. We demonstrate that these gains translate into real wall-clock training speedups.

1. INTRODUCTION

Modern neural network design typically follows a "larger is better" rule of thumb, with models consisting of millions of parameters achieving impressive generalization performance across many tasks, including image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Real et al., 2019; Zhai et al., 2022) , object detection (Girshick, 2015; Liu et al., 2016; Ghiasi et al., 2019) , semantic segmentation (Long et al., 2015; Chen et al., 2017; Liu et al., 2019a) and machine translation (Vaswani et al., 2017; Devlin et al., 2019) . Within a class of network architecture, deeper or wider variants of a base model typically yield further improvements to accuracy. Residual networks (ResNets) (He et al., 2016b) and wide residual networks (Zagoruyko & Komodakis, 2016) illustrate this trend in convolutional neural network (CNN) architectures. Dramatically scaling up network size into the billions of parameter regime has recently revolutionized transformer-based language modeling (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) . The size of these models imposes prohibitive training costs and motivate techniques that offer cheaper alternatives to select and deploy networks. For example, hyperparameter tuning is notoriously expensive as it commonly relies on training the network multiple times, and recent techniques aim to circumvent this by making hyperparameters transferable between models of different sizes, allowing them to be tuned on a small network prior to training the original model once (Yang et al., 2021) . Our approach incorporates these ideas, but extends the scope of transferability to include the parameters of the model itself. Rather than view training small and large models as separate events, we grow a small model into a large one through many intermediate steps, each of which introduces additional parameters to the network. Our contribution is to do so in a manner that preserves the function computed by the model at each growth step (functional continuity) and offers stable training dynamics, while also saving compute by leveraging intermediate solutions. More specifically, we use partially trained subnetworks as scaffolding that accelerates training of newly added parameters, yielding greater overall efficiency than training a large static model from scratch. Motivating this general strategy, we view aspects of prior works as hinting that deep network training may naturally be amenable to dynamically growing model size. For example, residual connections (He et al., 2016b) introduce depth-wise shortcuts, solving a gradient vanishing issue and thereby making very deep networks end-to-end trainable. Prior to ResNet, manually circumventing this issue involved

