ACCELERATED TRAINING VIA PRINCIPLED METHODS FOR INCREMENTALLY GROWING NEURAL NETWORKS

Abstract

We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering their effects on the training dynamics. Unlike existing growing methods, which follow simple replication heuristics or utilize auxiliary gradient-based local optimization, we craft a parameterization scheme which dynamically stabilizes weight, activation, and gradient scaling as the architecture evolves, and maintains the inference functionality of the network. To address the optimization difficulty resulting from imbalanced training effort distributed to subnetworks fading in at different growth phases, we propose a learning rate adaption mechanism that rebalances the gradient contribution of these separate subcomponents. Experimental results show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original computation budget for training. We demonstrate that these gains translate into real wall-clock training speedups.

1. INTRODUCTION

Modern neural network design typically follows a "larger is better" rule of thumb, with models consisting of millions of parameters achieving impressive generalization performance across many tasks, including image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Real et al., 2019; Zhai et al., 2022) , object detection (Girshick, 2015; Liu et al., 2016; Ghiasi et al., 2019) , semantic segmentation (Long et al., 2015; Chen et al., 2017; Liu et al., 2019a) and machine translation (Vaswani et al., 2017; Devlin et al., 2019) . Within a class of network architecture, deeper or wider variants of a base model typically yield further improvements to accuracy. Residual networks (ResNets) (He et al., 2016b) and wide residual networks (Zagoruyko & Komodakis, 2016) illustrate this trend in convolutional neural network (CNN) architectures. Dramatically scaling up network size into the billions of parameter regime has recently revolutionized transformer-based language modeling (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) . The size of these models imposes prohibitive training costs and motivate techniques that offer cheaper alternatives to select and deploy networks. For example, hyperparameter tuning is notoriously expensive as it commonly relies on training the network multiple times, and recent techniques aim to circumvent this by making hyperparameters transferable between models of different sizes, allowing them to be tuned on a small network prior to training the original model once (Yang et al., 2021) . Our approach incorporates these ideas, but extends the scope of transferability to include the parameters of the model itself. Rather than view training small and large models as separate events, we grow a small model into a large one through many intermediate steps, each of which introduces additional parameters to the network. Our contribution is to do so in a manner that preserves the function computed by the model at each growth step (functional continuity) and offers stable training dynamics, while also saving compute by leveraging intermediate solutions. More specifically, we use partially trained subnetworks as scaffolding that accelerates training of newly added parameters, yielding greater overall efficiency than training a large static model from scratch. Motivating this general strategy, we view aspects of prior works as hinting that deep network training may naturally be amenable to dynamically growing model size. For example, residual connections (He et al., 2016b) introduce depth-wise shortcuts, solving a gradient vanishing issue and thereby making very deep networks end-to-end trainable. Prior to ResNet, manually circumventing this issue involved 

Grow

(b) Ours: Function-Preserving Init, Stagewise LR Figure 1 : Dynamic network growth strategies. Different from (a) which reply on either splitting (Chen et al., 2016; Liu et al., 2019b; Wu et al., 2020b) or adding neurons with auxiliary local optimization (Wu et al., 2020a; Evci et al., 2022) , our initialization (b) of new neurons is random but function-preserving. Additionally, our separate learning rate (LR) scheduler governs weight updating in order to address the discrepancy in total accumulated training between different growth stages. adding outputs and losses to intermediate layers (Szegedy et al., 2015) , effectively causing a shallower subnetwork to train first and bootstrap the training of the full network. Larsson et al. (2016) show that end-to-end training of FractalNet, an alternative shortcut architecture, implicitly trains shallower subnetworks first. If such phenomena, though perhaps difficult to analyze, occurs more broadly, it suggests that one might achieve computational advantage by adopting an explicit growth strategy that matches the implicit subnetwork training schedule which occurs within a large static network. While overparameterization benefits generalization, a more detailed view suggests possible compatibility between the desire to maintain an overparameterized deep network and to dynamically grow such a network. The "double-descent" bias-variance trade-off curve (Belkin et al., 2019) indicates that large model capacity may be a safe strategy to ensure operation in the modern interpolating regime with consequently low test error. Small models, as we take for a starting point in dynamic growth, might not be sufficiently overparameterized and incur higher test error. However, Nakkiran et al. (2020) experimentally observe double-descent occurs with respect to both model size and number of training epochs. To remain in the interpolating regime, a model must be overparameterized relative to the amount it has been trained, which can be satisfied by an appropriate growth schedule. Competing recent efforts to grow deep models from simple architectures (Chen et al., 2016; Li et al., 2019; Dai et al., 2019; Liu et al., 2019b; Wu et al., 2020b; Wen et al., 2020; Wu et al., 2020a; Yuan et al., 2021; Evci et al., 2022) draw inspiration from other sources, such as the progressive development processes of biological brains. In particular, Net2Net (Chen et al., 2016) grows the network by randomly splitting learned neurons from previous phases. This replication scheme, shown in Figure 1 (a) is a common paradigm for most existing methods. Splitting steepest descent (Wu et al., 2020b) determines which neurons to split and how to split them by solving a combinatorial optimization problem with auxiliary variables. Firefly (Wu et al., 2020a) further improves flexibility by incorporating optimization for adding new neurons. Both methods outperform simple heuristics, but require additional training effort in their gradient-based parameterization schemes. Furthermore, all existing methods use a global learning rate scheduler to govern weight updates, ignoring the discrepancy in total training time among subnetworks introduced in different growth phases. We develop a growing framework around the principles of enforcing transferability of parameter settings from smaller to larger models (extending Yang et al. (2021) ), offering functional continuity, smoothing optimization dynamics, and rebalancing learning rates between older and newer subnetworks. Figure 1 (b) illustrates key differences with prior work. Our core contributions are: 

2. RELATED WORK

Network Growing. A diverse range of techniques train models by progressively expanding the network architecture (Wei et al., 2016; Elsken et al., 2018; Dai et al., 2019; Wen et al., 2020; Yuan et al., 2021) . Within this space, the methods of 2021) focuses on zero-shot HP transfer across model scale and establishes a principled network parameterization scheme to facilitate HP transfer. This serves as an anchor for our strategy, though, as Section 3 details, modifications are required to account for dynamic growth. Learning Rate Adaptation. Surprisingly, the existing spectrum of network growing techniques utilize relatively standard learning rate schedules and do not address potential discrepancy among subcomponents added at different phases. One might expect that newer weights should have higher learning rates than older weights. While general-purpose adaptive optimizers (e.g., AdaGrad (Duchi et al., 2011) , RMSProp (Tieleman et al., 2012 ), Adam (Kingma & Ba, 2015) , AvaGrad (Savarese et al., 2021) ) might ameliorate this issue, we choose to explicitly account for the discrepancy. As layer-adaptive learning rates (LARS) (Ginsburg et al., 2018; You et al., 2020) benefit in some contents, we explore further learning rate adaption specific to both layer and growth stage.

3. METHOD

3.1 PARAMETERIZATION AND OPTIMIZATION WITH GROWING DYNAMICS Functionality Preservation. We grow network capacity by expanding the width of computational units (e.g., hidden dimensions in linear layers, filters in convolutional layers). To illustrate our scheme, consider a 3-layer fully-connected network with ReLU activations ϕ: h in = ϕ(W in x), h o = ϕ(W h h in ), y = W out h o , (1) where x ∈ R Cin is the network input, y ∈ R Cout is the output, and h in ∈ R Hin , h o ∈ R Hout are the hidden activations. In this case, W in is a H in × C in matrix, while W h is H out × H in and W out is C out × H out . After training the network for a few epochs, we increase its capacity by increasing the dimensionality of each hidden state, i.e., from H in and H out to H in and H out , respectively. The layer parameter matrices W have their shapes changed accordingly and become W .  H in / H in H out / H out New Weighs Init. 1/C in 1/ H in 1/( H out ) 2 LR Adapt. 0-th Stage 1 1 1/H 0 out i-th Stage ||W in i \W in i-1 || ||W in 0 || ||W h i \W h i-1 || ||W h 0 || ||W out i \W out i-1 || ||W out 0 || generating new features h in n = ϕ(W in n x ) and changing the first set of activations from h in to h in = concat(h in , h in n , h in n ) . (2) Next, we expand W h across both input and output dimensions, as shown in Figure 2(b) . We initialize new weights W h e of shape H out × Hin-Hin 2 and add to W h two copies of it with different signs: +W h e and -W h e . This preserves the output of the layer since ϕ(W h h in + W h e h in n + (-W e )h in n ) = ϕ(W h h in ) = h o . We then add two copies of new weights W h n , which has shape Hout-Hout 2 × H in , yielding activations h o = concat(h o , ϕ(W h n h in ), ϕ(W h n h in )) . We similarity expand W out to match the dimension of h o . As Figure 2 (c) shows, the final output is: y = ϕ(W out h o + W out e ϕ(W h n h in ) + (-W out e )ϕ(W h n h in ) = y 4) which preserves the original output features in Eq. 1. , respectively. 𝑾 𝒐𝒍𝒅 𝒊𝒏 𝑾 𝒏𝒆𝒘 𝒊𝒏 𝑾 𝒏 𝒊𝒏 𝑾 𝒏 𝒊𝒏 𝑾 𝒏 𝒊𝒏 ~𝓝(𝟎, 𝟏/𝑯 𝒊𝒏 ) 𝑪 𝒊𝒏 𝑯 𝒊𝒏 𝑪 𝒊𝒏 $ 𝑯 𝒊𝒏 𝑾 𝒏𝒆𝒘 𝒊𝒏 = 𝑾 𝒐𝒍𝒅 𝒊𝒏 (a) Input Layer 𝑾 𝒐𝒍𝒅 𝒉 𝑾 𝒏𝒆𝒘 𝒉 - 𝑾 𝒆 𝒉 + 𝑾 𝒆 𝒉 𝑯 𝒊𝒏 𝑯 𝒐𝒖𝒕 % 𝑯 𝒊𝒏 & 𝑯 𝒐𝒖𝒕 𝑾 𝒏 𝒉 𝑾 𝒏 𝒉 𝑾𝒏 𝒉 ~𝓝(𝟎, 𝟏/ ) 𝑯𝒊𝒏) 𝑾𝒆 𝒉 ~𝓝(𝟎, 𝟏/ ) 𝑯𝒊𝒏) 𝑾𝒏𝒆𝒘 𝒉 ≜ 𝑯𝒊𝒏 ) 𝑯𝒊𝒏 𝑾 𝒐𝒍𝒅 𝒉 (b) Hidden Layer 𝑾 𝒐𝒍𝒅 𝒐𝒖𝒕 𝑾 𝒏𝒆𝒘 𝒐𝒖𝒕 - 𝑾 𝒆 𝒐𝒖𝒕 + 𝑾 𝒆 𝐨𝐮𝐭 𝑯 𝒐𝒖𝒕 𝑪 𝒐𝒖𝒕 & 𝑯 𝒐𝒖𝒕 𝑪 𝒐𝒖𝒕 𝑾𝒆 𝒐𝒖𝒕 ~𝓝(𝟎, 𝟏/) 𝑯𝒐𝒖𝒕 𝟐 ) 𝑾𝒏𝒆𝒘 𝒐𝒖𝒕 ≜ 𝑯𝒐𝒖𝒕 ) 𝑯𝒐𝒖𝒕 𝑾 𝒐𝒍𝒅 𝒐𝒖𝒕 (c) Output Layer However, this correction considers training differentlysized models separately, which fails to accommodate the training dynamics in which width grows incrementally. To make the weights of the old subnetwork W out old ∼ N (0, 1/(H out ) 2 ) compatible with the entire weight tensor parameterization, we rescale it with the fan in ratio as: W out new = W out old • H out / H out . Also see Table 1 (top) . This parameterization rule transfers to modern convolutional networks with batch normalization (BN). Given a weight scaling ratio of c, the running mean µ and variance σ of BN layers are modified as cµ and c 2 σ, respectively. Stage-wise Learning Rate Adaptation (RA). Following (Yang et al., 2021) , we employ a learning rate scaling factor of ∝ 1/fan in on the output layer when using SGD, compensating for the initialization scheme. However, subnetworks from different growth stages still share a global learning rate, though they have trained for different lengths. This may cause divergent behavior among the corresponding weights, making the training iterations after growing sensitive to the scale of the newly-initialized weights. Instead of adjusting newly added parameters via local optimization (Wu et al., 2020a; Evci et al., 2022) , we govern the update of each subnetwork in a stage-wise manner. Suppose a layer scales up with width at different growing stages S 0 ⊂ S 1 ⊂ ... ⊂ S N -1 with associated weight and gradient tensors as W 0 ⊂ W 1 ⊂ ... ⊂ W N -1 and G 0 ⊂ G 1 ⊂ ... ⊂ G N -1 , respectively. We adapt the learning rate and update the i-th sub-weights W i \ W i-1 as: η i = η 0 * f (S i \ S i-1 ) f (S 0 ) , W i \ W i-1 ← -η i * (G i \ G i-1 ), i > 0 (5) where η 0 is the base learning rate and f is an implicit function that maps subnetworks of different stages to corresponding train-time statistics. We train the model for T total epochs by expanding the channel number of each layer to C f inal across N growth phases. Existing methods (Liu et al., 2019b; Wu et al., 2020a) fail to derive a systemic way for distributing training resources across a growth trajectory. Toward maximizing efficiency, we experiment with a coupling between model size and training epoch allocation. Architectural Scheduler. We denote initial channel width as C 0 and expand exponentially: C n = C n-1 + ⌊p c C n-1 ⌉ 2 if n < N -1 C f inal if n = N -1 (6) where ⌊•⌉ 2 rounds to the nearest even number and p c is the growth rate between stages. Epoch Scheduler. We denote number of epochs assigned to n-th training stage as T n , with N -1 n=0 T n = T total . We similarly adapt T n via an exponential growing scheduler: T n = T n-1 + ⌊p t T n-1 ⌉ if n < N -1 T total - N -2 i=0 T i if n = N -1 (7) Wall-clock Speedup via Batch Size Adaptation. Though the smaller architectures in early growth stages require fewer FLOPs, hardware capabilities may still restrict practical gains. When growing width, in order to ensure that small models fully utilize the benefits of GPU parallelism, we adapt the batch size along with the exponentially-growing architecture in a reverse order: B n-1 = B base if n = N B n + ⌊p b B n ⌉ if n < N (8) where B base is the batch size of the large baseline model. Algorithm 1 summarizes our full method.

4. EXPERIMENTS

We evaluate on image classification and machine translation tasks. For image classification, we use CIFAR-10 ( Krizhevsky et al., 2014) , CIFAR-100 (Krizhevsky et al., 2014) and ImageNet (Deng et al., 2009) . For the neural machine translation, we use the IWSLT'14 dataset (Cettolo et al., 2014) and report the BLEU (Papineni et al., 2002) score on German to English (De-En) translation task. Large Baselines via Fixed-size Training. We use VGG-11 (Simonyan & Zisserman, 2015) with BatchNorm (Ioffe & Szegedy, 2015) , ResNet-20 (He et al., 2016a) , MobileNetV1 (Howard et al., 2017) for CIFAR-10 and VGG-19 with BatchNorm, ResNet-18, MobileNetV1 for CIFAR-100. We follow Huang et al. (2016) for data augmentation and processing, adopting random shifts/mirroring and channel-wise normalization. CIFAR-10 and CIFAR-100 models are trained for 160 and 200 epochs respectively, with a batch size of 128 and initial learning rate (LR) of 0.1 using SGD. We adopt a cosine LR schedule and set the weights decay and momentum as 5e-4 and 0.9. For ImageNet, we train the baseline ResNet-50 and MobileNetV1 (Howard et al., 2017) using SGD with batch sizes of 256 and 512, respectively. We adopt the same data augmentation scheme as Gross & Wilber (2016) , the cosine LR scheduler with initial LR of 0.1, weight decay of 1e-4 and momentum of 0.9. For IWSLT'14, we train an Encoder-Decoder Transformer (6 attention blocks each) (Vaswani et al., 2017) . We set width as d model = 1/4d f f n = 512, the number of heads n head = 8 and d k = d q = d v = d model /n head = 64. We train the model using Adam for 20 epochs with learning rate 1e-3 and (β 1 , β 2 ) = (0.9, 0.98). Batch size is 1500 and we use 4000 warm up iterations. Implementation Details. We compare with the growing methods Net2Net (Chen et al., 2016) , Splitting (Liu et al., 2019b) , FireFly-split, FireFly (Wu et al., 2020a) and GradMax (Evci et al., 2022) . For image classification, we run the comparison methods except GradMax alongside our algorithm for all architectures under the same growing scheduler. For the architecture scheduler, we set p c as 0.2 and C 0 as 1/4 of large baselines in Eq. 6 for all layers and grow the seed architecture within N = 9 stages towards the large ones. For epoch scheduler, we set p t as 0.2, T 0 as 8, 10, and 4 in Eq. 7 on CIAFR-10, CIFAR-100, and ImageNet respectively. Total training epochs T total are the same as the respective large fixed-size models. For CIFAR-10 and CIFAR-100, we train the models and report the results averaging over 3 random seeds. For machine translation, we grow the encoder and decoder layers' widths while fixing the embedding layer dimension for a consistent positional encoding table. The total number of growing stages is 4, each trained for 5 epochs. The initial width is 1/8 of the large baseline (i.e. d model = 64 and d k,q,v = 8). We set the growing ratio p c as 1.0 so that d model evolves as 64, 128, 256 and 512. We train all the models on an NVIDIA 2080Ti 12GB GPU for CIFAR-10, CIFAR-100, and IWSLT'14, and two NVIDIA A40 48GB GPUs for ImageNet.

4.1. CIFAR RESULTS

All models grow from a small seed architecture to the full-sized one in 9 stages, each trained for {8, 9, 11, 13, 16, 19, 23, 28, 33} epochs (160 total) in and {10, 12, 14, 17, 20, 24, 29, 35, 39} (200 total) in CIFAR-100. Net2Net follows the design of growing by splitting via simple neuron replication, hence achieving the same training efficiency as our gradientfree method under the same growing schedule. Splitting and Firely require additional training effort for their neuron selection schemes and allocate extra GPU memory for auxiliary variables during the local optimization stage. This is computationally expensive, especially when growing a large model. ResNet-20, VGG-11, and MobileNetV1 on CIFAR-10. ResNet-18, VGG-19, and MobileNetV1 on CIFAR-100. We also evaluate all methods on CIFAR-100 using different network architectures. Results in Table 3 show that Firely consistently achieves better test accuracy than Firefly-split, suggesting that adding new neurons provides more flexibility for exploration than merely splitting. Both Firely and our method achieve better performance than the original VGG-19, suggesting that network growing might have an additional regularizing effect. Our method yields the best accuracy and largest training cost reduction.

4.2. IMAGENET RESULTS

We first grow ResNet-50 on ImageNet and compare the performance of our method to Net2Net and FireFly under the same epoch schedule: {4, 4, 5, 6, 8, 9, 11, 14, 29} (90 total) with 9 growing phases. We also grow MobileNetV1 from a small seed architecture, which is more challenging than ResNet-50. We train Net2Net and our method using the same scheduler as for ResNet-50. We also compare with Firefly-Opt (a variant of FireFly) and GradMax and report their best results from Evci et al. (2022) . Note that both methods not only adopt additional local optimization, but also train with a less efficient growing scheduler: the final full-sized architecture needs to be trained for a 75% fraction while ours only requires 32.2%. Table 4 shows that our method dominates all competing approaches.

4.3. IWSLT14 DE-EN RESULTS

We grow a Transformer from d model = 64 to d model = 512 within 4 stages, each trained with 5 epochs using Adam. Applying gradient-based growing methods to the Transformer architecture is non-trivial due to their domain specific design of local optimization. As such, we only compare with Net2Net. We also design variants of our method for self-comparison, based on the adaptation rules for Adam in Appendix A.1. As shown in Table 5 , our method generalizes well to the Transformer architecture for the machine translation task. Comparison among variants is also consistent with Table 7 , demonstrating the benefit of learning rate adaptation.

4.4. ANALYSIS

Variance Transfer. We train a simple neural network with 4 convolutional layers on CIFAR-10. The network consists of 4 resolution-preserving convolutional layers; each convolution has 64, 128, 256 and 512 channels, a 3 × 3 kernel, and is followed by BatchNorm and ReLU activations. Max-pooling is applied to each layer for a resolution-downsampling of 4, 2, 2, and 2. These CNN layers are then followed by a linear layer for classification. We first alternate this network into four variants, given by combinations of training epochs ∈ {13(1×), 30(2.3×)} and initialization methods ∈ {standard, µtransfer (Yang et al., 2021) }. We also grow from a thin architecture within 3 stages, where the channel number of each layer starts with only 1/4 of the original one, i.e., {16, 32, 64, 128} → {32, 64, 128, 256} → {64, 128, 256, 512}, each stage is trained for 10 epochs. For network growing, we compare the baselines with standard initialization and variance transfer. We train all baselines using SGD, with weight decay set as 0 and learning rates sweeping over {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 2.0}. In Figure 3 (b), growing with Var. Transfer (blue) achieves overall better test accuracy than standard initialization (orange). Large baselines with merely µtransfer in initialization consistently underperform standard ones, which validate that the compensation from the LR re-scaling is necessary in Yang et al. (2021) . We also observe, in both Figure 3 (a) and 3(b), all growing baselines outperform fixed-size ones under the same training cost, demonstrating positive regularization effects. We also show the effect of our initialization scheme by comparing test performance on standard ResNet-20 on CIFAR-10. As shown in Figure 4 (a), compared with standard initialization, our variance transfer not only achieves better final test accuracy but also appears more stable. See Appendix A.4 for a fully-connected network example. Learning Rate Adaptation. We investigate the value of our proposed stage-wise learning rate adaptation as an optimizer for growing networks. As shown in the red curve in Figure 3 , rate adaptation not only bests the train loss and test accuracy among all baselines, but also appears to be more robust over different learning rates. In Figure 4 (a), rate adaptation further improves final test accuracy for ResNet-20 on CIFAR-10, under the same initialization scheme. Our rate adaptation mechanism rebalances subcomponents' gradient contributions to appear in lower divergence than global LR, when components are added at different stages and trained for different durations. In Figure 5 , we observe that the LR for newly added Subnet-8 (red) in last stage starts around 1.8× the base LR, then quick adapts to a smoother level. This demonstrates that our method is able to automatically adapt the updates applied to new weights, without any additional local optimization costs (Wu et al., 2020b; Evci et al., 2022) . All above show our method has a positive effect in terms of stabilizing training dynamics, which is lost if one attempts to train different subcomponents using a global LR scheduler. Appendix A.2 and A.3 provide more visualizations. Flexible Growing Scheduler. Our growing scheduler gains the flexibility to explore the best tradeoffs between training budgets and test performance in a unified configuration scheme (Eq. 6 and Eq. 7). We compare the exponential epoch scheduler (p t ∈ {0.2, 0.25, 0.3, 0.35}) to a linear one (p t = 0) in ResNet-20 growing on CIFAR-10, denoted as 'Exp.' and 'Linear' baselines in Figure 6 . Both baselines use the architectural schedulers with p c ∈ {0.2, 0.25, 0.3, 0.35}, each generates trade-offs between train costs and test accuracy by alternating T 0 . The exponential scheduler yields better overall trade-offs than the linear one with the same p c . In addition to different growing schedulers, we also plot a baseline for fixed-size training with different models. Growing methods with both schedulers consistently outperforms the fixed-size baselines, demonstrating that the regularization effect of network growth benefits generalization performance. 

5. CONCLUSION

We propose an efficient and accurate method for network growing, based on principled rules regarding both parameterization and optimization. Our parameter transition from older to newer subnetworks is general and quick to execute when expanding the network. Our carefully designed learning rate adaptation mechanism improves optimization dynamics in networks consisting of subcomponents with heterogeneous training durations. Applications to widely-used architectures on image classification and machine translation tasks demonstrate that our method bests the accuracy of competitors, even outperforming the original fixed-size models, while saving considerable training cost. A APPENDIX i-th Stage  m t[i] \m t[i-1] √ v t[i] \v t[i-1] AvaGrad 0-th Stage η t[0] ||η t[0] / √ d t[0] ||2 ⊙ m t[0] i-th Stage η t[i] \η t[i-1] ||η t[i] \η t[i-1] / √ d t[i] -d t[i-1] ||2 ⊙ (m t[i] \ m t[i-1] ) √ v t[i] \v t[i-1] (stage-wise). Similarly, we also generalize our approach to AvaGrad by adopting η t , d t , m t of the original paper as a stage-wise variables. Preserving Optimizer State/Buffer Essential to adaptive methods are training-time statistics (e.g. running averages m t and v t in Adam) which are stored as buffers and used to compute parameterwise learning rates. Different from fixed-size models, parameter sets are expanded when growing networks, which in practice requires re-instantiating a new optimizer at each growth step. Given that our initialization scheme maintains functionality of the network, we are also able to preserve and inherit buffers from previous states, effectively maintaining the optimizer's state intact when adding new parameters. We investigate the effects of this state preservation experimentally. Results with Adam and AvaGrad Table 7 shows the results growing ResNet-20 on CIFAR-10 with Adam and Avagrad. For the large, fixed-size baseline, we train Adam with lr = 0.1, ϵ = 0.1 and AvaGrad with lr = 0.5, ϵ = 10.0, which yields the best results for ResNet-20 following Savarese et al. (2021) . We consider different settings for comparison, (1) optimizer without buffer preservation: the buffers are refreshed at each new growing phase (2) optimizer with buffer preservation: the buffer/state is inherited from the previous phase, hence being preserved at growth steps (3) optimizer with buffer and rate adaptation (RA): applies our rate adaptation strategy described in Table 6 while also preserving internal state/buffers. We observes that (1) consistently underperforms (2), which suggests that preserving the state/buffers in adaptive optimizers is crucial when growing Comparison with Layer-wise Adaptive Optimizer We also consider LARS (Ginsburg et al., 2018; You et al., 2020) , a layer-wise adaptive variant of SGD, to compare different adaptation concepts: layer-wise versus layer + stage-wise (ours). Note that although LARS was originally designed for training with large batches, we adopt a batch size of 128 when growing ResNet-20 on CIFAR-10. We search the initial learning rate (LR) for LARS over {1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2, 1e-1, 2e-1, 5e-1} and observe that a value of 0.02 yields the best results. We adopt the default initial learning rate of 0.1 for both standard SGD and our method. As shown in Table 8 , LARS underperforms both standard SGD and our adaptation strategy, suggesting that layer-wise learning rate adaptation by itself -i.e. without accounting for stage-wise discrepancies -is not sufficient for successful growing of networks.

A.2 MORE VISUALIZATIONS ON RATE ADAPTATION

We show additional plots of stage-wise rate adaptation when growing a ResNet-20 on CIFAR-10. Figure 8 shows the of adaptation factors based on the LR of the seed architecture from 1st to 8th stages (the stage index starts at 0). We see an overall trend that for newly-added weights, its learning rate starts at > 1× of the base LR then quickly adapts to a relatively stable level. This demonstrates that our approach is able to efficiently and automatically adapt new weights to gradually and smoothly fade in throughout the current stage's optimization procedure. 3072-dimensional (32 × 32 × 3) vector as prior to being given as input to the network. We consider two variants of this baseline network by adopting training epochs (costs) ∈ {25(1×), 50(2×)}. We also grow from a thin architecture to the original one within 10 stages, each stage consisting of 5 epochs, where the number of units of each hidden layer grows from 50 to 100, 150, ..., 500. The total training cost is equivalent to the fixed-size one trained for 25 epochs. We train all baselines using SGD, with weight decay set as 0 and learning rates sweeping over {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}: results are shown in Figure 13 (a). Compared to standard initialization (green), the loss curve given by growing variance transfer (blue) is more similar to the curve of the large baseline -all using standard SGD -which is also consistent with the observations when training model of different scales separately (Yang et al., 2021) . Rate adaptation (in red) further lowers training loss. Interestingly, we observe in Figure 13 (b) that the test accuracy behavior differs from the training loss one given in Figure 13 (a), which may suggest that regularization is missing due to, for example, the lack of parameter-sharing schemes (like CNN) in this fully-connected network. Another direct and intuitive application for our method is to fit continuously incremental datastream where D 0 ⊂ D 1 , ... ⊂ D n ... ⊂ D N -1 . The network complexity scales up together with the data so that larger capacity can be trained on more data samples. Orthogonalized SGD (OSGD) (Wan et al., 2020) address the optimization difficulty in this context, which dynamically re-balances task-specific gradients via prioritizing the specific loss influence. We further extend our optimizer by introducing a dynamic variant of orthogonalized SGD, which progressively adjusts the priority of tasks on different subnets during network growth. Suppose the data increases from D n-1 to D n , we first accumulate the old gradients G n-1 using one additional epoch on D n-1 and then grow the network width. For each batch of D n , we first project gradients of the new architecture (n-th stage), denoted as G n , onto the parameter subspace that is orthogonal to G pad n-1 , a zero-padded version of G n-1 with desirable shape. The final gradients G * n are then calculated by re-weighting the original G n and its orthogonal counterparts: G * n = G n -λ * proj G pad n-1 (G n ), λ : 1 → 0 (9) where λ is a dynamic hyperparameter which weights the original and orthogonal gradients. When λ = 1, subsequent outputs do not interfere with earlier directions of parameters updates. We then anneal λ to 0 so that the newly-introduced data and subnetwork can smoothly fade in throughout the training procedure. Implementation Details. We implement the task in two different settings, denoted as' progressive class' and 'progressive data' on CIFAR-100 dataset within 9 stages. In the progressive class setting, we first randomly select 20 classes in the first stage and then add 10 new classes at each growing stage. In the progressive data setting, we sequentially sample a fraction of the data with replacement for each stage, i.e. 20%, 30%, ..., 100%. ResNet-18 on Continuous CIFAR-100: We evaluate our method on continuous datastreams by growing a ResNet-18 on CIFAR-100 and comparing the final test accuracies. As shown in Table 9, compared with the large baseline, our growing method achieves 1.53× cost savings with a slight performance degradation in both settings. The dynamic OSGD variant outperforms the large baseline with 1.46× acceleration, demonstrating that the new extension improves the optimization on continuous datastream through gradually re-balancing the task-specific gradients of dynamic networks.



Figure 2 illustrates the process for initializing W . 1 As Figure 2(a) shows, we first expand W in along the output dimension by adding two copies of new weights W in n of shape Hin-Hin 2

Figure 2: Initialization scheme. In practice, we also add noise to the expanded parameter sets for symmetry breaking.

Figure 4: (a) Performance with Var. Transfer and Rate Adaptation growing ResNet-20. (b) and (c) visualizes the gradients for different sub-compoents along training in the last two stages.

Figure 4(b) and 4(c) visualize the gradients of different sub-components for the 17-th convolutional layer of ResNet-20 during last two growing phases of standard SGD and rate adaptation, respectively.Our rate adaptation mechanism rebalances subcomponents' gradient contributions to appear in lower divergence than global LR, when components are added at different stages and trained for different durations. In Figure5, we observe that the LR for newly added Subnet-8 (red) in last stage starts around 1.8× the base LR, then quick adapts to a smoother level. This demonstrates that our method is able to automatically adapt the updates applied to new weights, without any additional local optimization costs(Wu et al., 2020b;Evci et al., 2022). All above show our method has a positive effect in terms of stabilizing training dynamics, which is lost if one attempts to train different subcomponents using a global LR scheduler. Appendix A.2 and A.3 provide more visualizations.

Figure 5: Visualization of our adaptive LR.

-clock Training Speedup. We benchmark GPU memory consumption and wall-clock training time on CIFAR-100 for each stage during training on single NVIDIA 2080Ti GPU. The large baseline ResNet-18 trains for 140 minutes to achieve 78.36% accuracy. As shown in the green bar of Figure7(b), the growing method only achieves marginal wall-clock acceleration, under the same fixed batch size. As such, the growing ResNet-18 takes 120 minutes to achieve 78.12% accuracy. The low GPU utilization in the green bar in Figure7(a) hinders FLOPs savings from translating into real-world training acceleration. In contrast, the red bar of Figure7shows that our batch size adaptation results in a large proportion of wall clock acceleration by filling the GPU memory, and corresponding parallel execution resources, while maintaining test accuracy. ResNet-18 trains for 84 minutes (1.7× speedup) and achieves 78.01% accuracy.



Parameterization using Variance Transfer: We propose a parameterization scheme accounting for the variance transition among networks of smaller and larger width in a single training process. Initialization of new weights is gradient-free and requires neither additional memory nor training. • Improved Optimization with Rate Adaptation: Subnetworks trained for different lengths have distinct learning rate schedules, with dynamic relative scaling driven by weight norm statistics.

Parameterization and optimization transition for different layers during growing.

Table 1 (bottom) summarizes our LR adaptation rule for SGD when f is instantiated as weight norm. Alternative heuristics are possible; see Appendix A.1. Data X, labels Y , task loss L Output: Grown model S Initialize: S 0 with C 0 , T 0 , B 0 , η 0 for n = 0 to N -1 do if n > 0 then Init. S n from S n-1 using VT in Table 1. Update C n and T n using Eq. 6 and Eq. 7. Update B n using Eq. 8 (optional) Iter

. Accuracy(%) ↑ Cost(%) ↓ Accuracy(%) ↑ Cost(%) ↓ Accuracy(%) ↑ Accuracy(%) ↑ Cost(%) ↓ Accuracy(%) ↑ Cost(%) ↓ Accuracy(%) ↑



Rate Adaptation Rule for Adam(Kingma & Ba, 2015) and AvaGrad(Savarese et al., 2021).

Generalize to Adam and AvaGrad for ResNet-20 on CIFAR-10.

Both methods are adaptive optimizers where different heuristics are adopted to derive a parameter-wise learning rate strategy, which provides primitives that can be extended using our stage-wise adaptation principle for network growing. For example, vanilla Adam adapts the global learning rate with running estimates of the first moment m t and the second moment v t of the gradients, where the number of global training steps t is an integer when training a fixed-size model. When growing networks, our learning rate adaptation instead considers a vector t which tracks each subcomponent's 'age' (i.e. number of steps it has been trained for). As such, for a newly grown subcomponent at a stage i > 0, t[i] starts as 0 and the learning rate is adapted from m t /

Comparisons among Standard SGD, LARS and Ours for ResNet-20 Growth on CIFAR-10. ) bests the other settings for both Adam and AvaGrad, indicating that our rate adaptation strategy generalizes effectively to Adam and AvaGrad for the growing scenario. Together, these also demonstrate that our method has the flexibility to incorporate different statistics that are tracked and used by distinct optimizers, where we take Adam and AvaGrad as examples. Finally, our proposed stage-wise rate adaptation strategy can be employed to virtually any optimizer.

Growing ResNet-18 using Incremental CIFAR-100. Accuracy (%) ↑ Cost (%) ↓ Accuracy (%) ↑

annex

A.4 SIMPLE EXAMPLE ON FULLY-CONNECTED NEURAL NETWORKS Additionally, we train a simple fully-connected neural network with 8 hidden layers on CIFAR-10 -each hidden layer has 500 neurons and is followed by ReLU activations. The network is has a final linear layer with 10 neurons for classification. Note that each CIFAR-10 image is flattened to a 

