ACCELERATED TRAINING VIA PRINCIPLED METHODS FOR INCREMENTALLY GROWING NEURAL NETWORKS

Abstract

We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering their effects on the training dynamics. Unlike existing growing methods, which follow simple replication heuristics or utilize auxiliary gradient-based local optimization, we craft a parameterization scheme which dynamically stabilizes weight, activation, and gradient scaling as the architecture evolves, and maintains the inference functionality of the network. To address the optimization difficulty resulting from imbalanced training effort distributed to subnetworks fading in at different growth phases, we propose a learning rate adaption mechanism that rebalances the gradient contribution of these separate subcomponents. Experimental results show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original computation budget for training. We demonstrate that these gains translate into real wall-clock training speedups.

1. INTRODUCTION

Modern neural network design typically follows a "larger is better" rule of thumb, with models consisting of millions of parameters achieving impressive generalization performance across many tasks, including image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Real et al., 2019; Zhai et al., 2022) , object detection (Girshick, 2015; Liu et al., 2016; Ghiasi et al., 2019) , semantic segmentation (Long et al., 2015; Chen et al., 2017; Liu et al., 2019a) and machine translation (Vaswani et al., 2017; Devlin et al., 2019) . Within a class of network architecture, deeper or wider variants of a base model typically yield further improvements to accuracy. Residual networks (ResNets) (He et al., 2016b) and wide residual networks (Zagoruyko & Komodakis, 2016) illustrate this trend in convolutional neural network (CNN) architectures. Dramatically scaling up network size into the billions of parameter regime has recently revolutionized transformer-based language modeling (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) . The size of these models imposes prohibitive training costs and motivate techniques that offer cheaper alternatives to select and deploy networks. For example, hyperparameter tuning is notoriously expensive as it commonly relies on training the network multiple times, and recent techniques aim to circumvent this by making hyperparameters transferable between models of different sizes, allowing them to be tuned on a small network prior to training the original model once (Yang et al., 2021) . Our approach incorporates these ideas, but extends the scope of transferability to include the parameters of the model itself. Rather than view training small and large models as separate events, we grow a small model into a large one through many intermediate steps, each of which introduces additional parameters to the network. Our contribution is to do so in a manner that preserves the function computed by the model at each growth step (functional continuity) and offers stable training dynamics, while also saving compute by leveraging intermediate solutions. More specifically, we use partially trained subnetworks as scaffolding that accelerates training of newly added parameters, yielding greater overall efficiency than training a large static model from scratch. Motivating this general strategy, we view aspects of prior works as hinting that deep network training may naturally be amenable to dynamically growing model size. For example, residual connections (He et al., 2016b) introduce depth-wise shortcuts, solving a gradient vanishing issue and thereby making very deep networks end-to-end trainable. Prior to ResNet, manually circumventing this issue involved 2016) show that end-to-end training of FractalNet, an alternative shortcut architecture, implicitly trains shallower subnetworks first. If such phenomena, though perhaps difficult to analyze, occurs more broadly, it suggests that one might achieve computational advantage by adopting an explicit growth strategy that matches the implicit subnetwork training schedule which occurs within a large static network. While overparameterization benefits generalization, a more detailed view suggests possible compatibility between the desire to maintain an overparameterized deep network and to dynamically grow such a network. The "double-descent" bias-variance trade-off curve (Belkin et al., 2019) indicates that large model capacity may be a safe strategy to ensure operation in the modern interpolating regime with consequently low test error. Small models, as we take for a starting point in dynamic growth, might not be sufficiently overparameterized and incur higher test error. However, Nakkiran et al. ( 2020) experimentally observe double-descent occurs with respect to both model size and number of training epochs. To remain in the interpolating regime, a model must be overparameterized relative to the amount it has been trained, which can be satisfied by an appropriate growth schedule. Competing recent efforts to grow deep models from simple architectures (Chen et al., 2016; Li et al., 2019; Dai et al., 2019; Liu et al., 2019b; Wu et al., 2020b; Wen et al., 2020; Wu et al., 2020a; Yuan et al., 2021; Evci et al., 2022) draw inspiration from other sources, such as the progressive development processes of biological brains. In particular, Net2Net (Chen et al., 2016) grows the network by randomly splitting learned neurons from previous phases. This replication scheme, shown in Figure 1 (a) is a common paradigm for most existing methods. Splitting steepest descent (Wu et al., 2020b) determines which neurons to split and how to split them by solving a combinatorial optimization problem with auxiliary variables. Firefly (Wu et al., 2020a) further improves flexibility by incorporating optimization for adding new neurons. Both methods outperform simple heuristics, but require additional training effort in their gradient-based parameterization schemes. Furthermore, all existing methods use a global learning rate scheduler to govern weight updates, ignoring the discrepancy in total training time among subnetworks introduced in different growth phases. We develop a growing framework around the principles of enforcing transferability of parameter settings from smaller to larger models (extending Yang et al. ( 2021)), offering functional continuity, smoothing optimization dynamics, and rebalancing learning rates between older and newer subnetworks. 



Figure1: Dynamic network growth strategies. Different from (a) which reply on either splitting(Chen et al., 2016; Liu et al., 2019b; Wu et al., 2020b)  or adding neurons with auxiliary local optimization(Wu et al., 2020a; Evci et al., 2022), our initialization (b) of new neurons is random but function-preserving. Additionally, our separate learning rate (LR) scheduler governs weight updating in order to address the discrepancy in total accumulated training between different growth stages.

Figure 1(b) illustrates key differences with prior work. Our core contributions are: • Parameterization using Variance Transfer: We propose a parameterization scheme accounting for the variance transition among networks of smaller and larger width in a single training process. Initialization of new weights is gradient-free and requires neither additional memory nor training. • Improved Optimization with Rate Adaptation: Subnetworks trained for different lengths have distinct learning rate schedules, with dynamic relative scaling driven by weight norm statistics. • Better Performance and Broad Applicability: Our method not only trains networks fast, but also yields excellent generalization accuracy, even outperforming the original fixed-size models. Flexibility in designing a network growth curve allows choosing different trade-offs between training resources and accuracy. Furthermore, adopting an adaptive batch size schedule provides acceleration in terms of wall-clock training time. We demonstrate results on image classification and machine translation tasks, across a diverse set of network architectures.

