LEARNING TO GROW PRETRAINED MODELS FOR EFFICIENT TRANSFORMER TRAINING

Abstract

Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) widthand depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models. 1 

1. INTRODUCTION

The transformer architecture (Vaswani et al., 2017) has emerged as a general purpose architecture for modeling many structured domains (Devlin et al., 2019; Brown et al., 2020; Rives et al., 2021; Dosovitskiy et al., 2021; Touvron et al., 2021a) . Perhaps more so than other architectures, the transformer empirically seems to have inductive biases that make it especially amenable to scaling (Rosenfeld et al., 2019; Kaplan et al., 2020) , which has led to a paradigm in which larger versions of smaller, existing models are trained and released on a periodic basis (e.g., the GPT lineage of models (Radford et al., 2018; 2019; Brown et al., 2020) ). New instances of such models are typically trained completely from scratch, despite the fact that they are often scaled-up versions of their smaller counterparts. Given the compute required to train even the smaller models, we argue that training each model from scratch is wasteful, and that prior knowledge implicit in the parameters of smaller pretrained models should be leveraged to enable faster training of larger models. One approach to this problem is through the lens of model growth, wherein a smaller model's pretrained parameters are used to initialize a subset of the larger model's parameters. While earlier works generally froze the parameters initialized from the pretrained model and only trained the new (randomly initialized) parameters (Fahlman & Lebiere, 1989; Fahlman, 1990; Gutstein et al., 2008) , subsequent work has shown that copying a subset of the pretrained parameters to initialize the new parameters and then finetuning the entire network significantly accelerates training and sometimes even leads to better performance (Chen et al., 2015) . When applied to modern transformers, these mechanisms roughly translate to a depth-expansion operator in which pretrained models are stacked (or combined with identity layers) to initialize deeper transformers (Gong et al., 2019; Yang et al., 2020) , and a width-expansion operator in which the smaller model's matrices are copied to initialize the larger model's matrices (e.g., in block-diagonal fashion) (Chen et al., 2021; Gu et al., 2020) . (new) . LiGO is parameterized as a sparse linear map M that can be decomposed into width-and depth-expansion operators. The width-operator R width and depth-operator L depth are structured matrices obtained from Kronecker products of smaller matrices which encode architectural knowledge by grouping parameters into layers and neurons. While we show the expansion operators for simple multi-layer perceptrons for illustrative purposes, in practice we apply LiGO to enable faster training of transformer networks. In our approach, we learn the growth matrix M with a 100 steps of SGD, use this to initialize the larger model, and then continue training as usual. Best viewed in color. Noting the empirical effectiveness of such recipes, we observe that existing mechanisms generally do not have a learning component (e.g., randomly copying over neurons for width-expansion or stacking consecutive layers for depth-expansion). This paper instead proposes an efficient, datadriven approach for learning to grow transformers. In particular, our approach frames the problem of initializing the larger model's parameters as learning a linear mapping from the smaller model's parameters, i.e., Θ (large) = M Θ (small) where Θ (small) and Θ (large) are the vectorized parameters of the small/large models. Due to the high dimensionality of the parameters, this mapping is completely intractable to learn without any restrictions on M . We thus factorize the linear mapping to be a composition of sparse width-and depth-expansion operators, M = L depth R width , where both width and depth matrices are further factorized to be a Kronecker product of smaller matrices that express architectural knowledge (e.g., through grouping parameters by layers and neurons). We show that our growth operators can represent existing approaches such as layer-stacking and neuron-copying as special cases. We find that with a small amount of learning on M (e.g., 100 gradient steps) to initialize the larger model, we can significantly accelerate training of both vision and language transformers. Figure 1 illustrates our approach. We apply our learned linear growth operator (LiGO) to popular families of models-BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , GPT2 (Radford et al., 2019) , and ViT (Dosovitskiy et al., 2021; Touvron et al., 2021a; b) -and find that LiGO can consistently improve transformer training efficiency over the traditional way of training from scratch across domains and model sizes. For instance, LiGO saves 44.7% and 22.5% FLOPs for training BERT-Base and GPT2-Medium from scratch by reusing pretrained smaller models that are half as big. Similarly, for vision transformers, when using DeiT-S (Touvron et al., 2021a) for initialization, LiGO yields 55% savings in FLOPs with no performance drop on ImageNet (Deng et al., 2009) . These FLOPs savings directly translate to similar wall clock savings. We further find that models trained using LiGO achieve similar performance to the trained-from-scratch baselines when transferred to downstream tasks.

2. RELATED WORK

Efficient training. Efficient training of transformers has been studied from multiple perspectives. Some methods that are orthogonal to our work include mixed precision training (Shoeybi et al., 2019) , large batch optimization (You et al., 2019) , distributed training (Huang et al., 2019) , and dropping layers (Zhang & He, 2020) or tokens (Hou et al., 2022) . Knowledge inheritance (Qin et al., 2021) explores knowledge distillation during pretraining to efficiently learn larger transformers. Progressive training, which first trains a small transformer with few layers and then gradually expands by stacking layers, has also been applied to accelerate transformer training (Gong et al., 2019; Yang et al., 2020; Li et al., 2022; Shen et al., 2022) . Net2Net Chen et al. (2015) uses functionpreserving transformations to grow width by copying neurons and depth by using identity layers. Recently, bert2BERT (Chen et al., 2021) extends Net2Net to transformers. In contrast to these approaches, our approach learns to (linearly) transform the parameters of a smaller model to initialize a larger model. While there is a line of work on learning to grow neural networks in a data-driven way, these methods are in general difficult to apply to modern-scale transformers since they (for example) involve growing a single neuron at a time or employ expensive optimization/search procedures (Wei et al., 2016; Cai et al., 2018; Wu et al., 2019; 2021; Evci et al., 2022) . Network initialization. Our work is also related to work on neural network initialization. Existing works include controlling the norm of the parameters (Mishkin & Matas, 2015; Kilcher et al., 2018; Dai et al., 2019; Wu et al., 2019; Glorot & Bengio, 2010) or replacing the normalization layers (Brock et al., 2021; Zhang et al., 2019; Huang et al., 2020) . MetaInit (Dauphin & Schoenholz, 2019) proposes an automatic method that optimizes the norms of weight tensors to minimize the gradient quotient on minibatches of random Gaussian samples. GradInit (Zhu et al., 2021) learns to initialize larger networks by adjusting norm of each layer. Our work focuses on using smaller pretrained transformers to better initialize larger transformers, which remains an understudied problem. Structured matrices. Finally, our work is also related to structured matrices which are typically used to replace dense weight matrices for reducing training and inference computation cost. Examples include sparse and low rank matrices (Chiu et al., 2021; Han et al., 2015) , Chebyshev matrices (Tang et al., 2019) , Toeplitz matrices (Sindhwani et al., 2015) , Kronecker-product matrices (Zhang et al., 2015) , and butterfly matrices (Dao et al., 2019) . A unified framework to learn a broad family of structured matrices is presented in Sindhwani et al. (2015) . Dao et al. (2022) propose Monarch matrices, which inherit the expressiveness of butterfly matrices and achieve reasonable accuracy-efficiency tradeoffs in many applications. While our approach is inspired by these works, we propose to grow pretrained models by learning structured sparse linear operators with Kronecker factorization, which to our knowledge has not been explored in the literature.

3. PROPOSED APPROACH

Notation. We denote the parameters of a neural network with L layers and D dimensions as Θ L,D = [W 1 • • • W L ] ⊤ ∈ R LD×D , where W l ∈ R D×D denotes the weights for the l-th layer.foot_0 With slight abuse of notation, we denote the vectorization of Θ L,D as vec(Θ L,D ) ⊤ = vec(W 1 ) ⊤ • • • vec(W L ) ⊤ . 3 Our goal is to re-use the parameters Θ = Θ L1,D1 from a pretrained smaller model to initialize a large model Θ (new) = Θ L2,D2 through a model growth operator M : R L1D1×D1 → R L2D2×D2 that maps the weights of the smaller network to the weights of the larger one, i.e., Θ (new) = M (Θ) where L 1 < L 2 and D 1 < D 2 . After model growth, we adopt Θ (new) as the initialization of the large model and train it using standard recipes.

3.1. EXISTING GROWTH OPERATORS

Existing works have separately established model growth operators for depth (L 1 < L 2 , D 1 = D 2 ) and width (L 1 = L 2 , D 1 < D 2 ). We summarize these methods below. Depth expansion. StackBERT (Gong et al., 2019) proposes to duplicate the smaller model to double the depth, based on the observation that upper layers share similar functionality with the lower layers. In contrast, interpolation-based depth expansion methods (Chang et al., 2017; Dong et al., 2020) interleave every layer to form a deeper model, which can be roughly interpreted as simulating a finer-grained solution to the original dynamical system from a neural ODE perspective (Chen et al., 2018) . Letting L 2 = kL 1 , the two methods' growth operators can be formulated as: (StackBERT) W (new) l = W l mod L1 , (Interpolation) W (new) i = W ⌊l/k⌋ , ∀l ∈ [L 2 ]. (1) Width expansion. Net2Net (Chen et al., 2015) expands the width of neural networks by randomly copying neurons while preserving output values via normalization. This can be seen as growing a matrix associated with a particular layer by duplicating the columns and rows of its weight matrix. Suppose a layer has weight matrix W l ∈ R D1×D1 . 4 To expand it to a matrix W (new) l ∈ R D2×D2 (D 2 > D 1 ), Net2Net copies W l to its upper-left corner of W (new) l , fills the new columns via a random selection matrix S l , and finally duplicates and normalizes rows according to the selection matrix from the previous layer. Formally, the growth operator of Net2Net can be written as: (Net2Net) W (new) l = I S ⊤ l-1 D -1 l W l [I S l ] , D l = diag(S l-1 1) + I, ∀l ∈ [L 2 ] (2) where S l ∈ {0, 1} D1×(D2-D1) is a random selection matrix. The diagonal of D l is a D 1dimensional histogram, whose i-th entry indicates number of times i-th column of W l was copied.

3.2. LEARNING TO GROW WITH A STRUCTURED LINEAR GROWTH OPERATOR

While existing operators have been empirically successful in accelerating transformer-based models such as BERT (Gong et al., 2019; Chen et al., 2021) , we observe that generally do not have a learning component and perform the depth-and width-expansions separately. In this section we introduce a general framework for learning to grow with a linear growth operator (LiGO), which generalizes existing operators by combining the width-and depth-growth operators in a data-driven way. We can formulate the problem of initializing the weights of the larger model Θ (new) from the smaller model Θ through the following optimization problem, arg min M E x∼D L(x; Θ (new) ), subject to Θ (new) = M (Θ), where D is the data distribution and L is the loss function. It is of course intractable to optimize over the entire operator space, and thus we further simplify the function M to be a linear transformation, which results in the following formulation, vec(Θ (new) ) = vec(M (Θ)) = M vec(Θ), M ∈ R L2D 2 2 ×L1D 2 1 . This simplified objective is still completely infeasible to apply to contemporary neural networks where L 1 D 1 can easily be in the hundreds of millions. We therefore propose an efficient parameterization of M for tractable learning.

3.2.1. DECOMPOSITION ALONG DEPTH AND WIDTH

Our first step is to decompose the LiGO operator as M = L depth R width , where L depth and R width expand the depth and width of model separately. Concretely, we decompose M as M =    diag(ℓ 1,1 ) • • • diag(ℓ 1,L1 ) . . . . . . . . . diag(ℓ L2,1 ) • • • diag(ℓ L2,L1 )    L depth    R 1 . . . R L1    R width . ( ) where R l ∈ R D 2 2 ×D 2 1 and ℓ i,j ∈ R D 2 2 . In the above, L depth is an array of diagonal matrices and R width is a block-diagonal matrix, i.e., both matrices are highly structured and sparse. When applying R width to weights vec(Θ), the parameters of each layer will be transformed independently via vec(W (new) l ) = R l vec(W l ) and lifted to a higher dimension. The l-th row block of L depth corresponds to the growth operator of l-th layer, which amounts to linearly combining all layers of the smaller model via vec(W (new) l ) k = L1 l ′ =1 (ℓ l,l ′ ) k vec(W l ) k . By this factorization, we can effectively reduce the complexity of the LiGO operator from O(D 2 1 L 1 D 2 2 L 2 ) to O(D 2 1 D 2 2 L 1 ) and encode architectural knowledge by grouping parameters by layers. Later in Section 3.4, this representation is also shown to preserve high representation power owing to its connection with Monarch matrices (Dao et al., 2022; 2019) .

3.2.2. PARAMETER SHARING VIA KRONECKER FACTORIZATION

The above LiGO operator requires O(D 2 1 D 2 2 L 1 ) parameters for R width and O(L 1 L 2 D 2 2 ) for L depth . The width operator R width is thus still prohibitively expensive given that D 1 (and D 2 ) can easily be in the hundreds or thousands. In this section, we propose a Kronecker factorization to further reduce the number of learnable parameters for each growth operator. Depth. For depth, we treat an entire layer as a single group and construct a new layer by combining existing layers, effectively tying parameters for all neurons in same layer. Formally, each block in L depth is simplified to be diag(ℓ i,j ) = w i,j I. Then the entire matrix can be written as a Kronecker factorization, L depth = w ⊗ I, where w ∈ R L2×L1 is a matrix whose entry w i,j indicates blending weights of j-th layer of the small model to form i-th layer of the large model. This strategy reduces the number of parameters in L depth to O(L 1 L 2 ), and is shown on left-hand side of Figure 1 . Width. For width, we decompose each diagonal block of width expansion operator R width using the Kronecker factorization (Schacke, 2004) , we then have, R l = A l ⊗ B l , where A l , B l ∈ R D2×D1 . Since vec(CAB) = (B ⊤ ⊗ C) vec(A) R width vec(Θ) =    A 1 ⊗ B 1 . . . A L1 ⊗ B L1    vec(Θ) = vec B 1 W 1 A ⊤ 1 • • • B L1 W L1 A ⊤ L1 ⊤ . (6) Here we observe that B l W l A ⊤ l performs in-and out-dimension expansion by A l and B l , respectively. Each new column/row is a linear combination of columns/rows of small model's weight matrix. This factorization, which can be seen as grouping parameters by neurons, reduces the number of parameters to O(L 1 D 1 D 2 ). Figure 1 (right) illustrates LiGO's width-expansion operator. Altogether, we obtain the final parameterization of LiGO operator M : M =       w 1,1 w 1,2 • • • w 1,L1 . . . . . . . . . . . . w L2,1 w L2,2 • • • w L2,L1    ⊗ I    Depth expansion       A 1 ⊗ B 1 . . . A L1 ⊗ B L1       Width expansion We can exploit the factorization to implement the LiGO operator (Eq. 8) efficiently. Training. LiGO expands a model in three steps: (1) for each layer, inserting new rows by linearly combining existing rows through B l , (2) for each layer, inserting new columns by linearly combining existing columns through A l , and then finally (3) reconstructing each layer by linearly combining the weight matrices with w along the depth. We then run a few steps (e.g., 100 iterations) of SGD to optimize M , which has negligible compute cost relative to regular training. After obtaining M , we initialize large model with M vec(Θ), and train parameters Θ (new) through SGD as usual. Algorithm 1 summarizes a forward pass of LiGO with transformer. Finally, as shown in Appendix A we note that StackBERT (Eq. 1), Interpolation (Eq. 1), and Net2Net (Eq. 2) are all special cases of LiGO (Eq. 8) with a particular setting of L depth and R width .

3.3. LIGO FOR TRANSFORMERS

While LiGO can be applied to any multi-layer neural network architecture, in this paper we focus on using LiGO to grow transformers which have been shown to be particularly amenable to scaling. Below we briefly describe how LiGO is applied to the main transformer embedding/attention layers and defer further details (e.g., growing bias vectors, layer norm parameters) to Appendix B.1. Embedding layer. The embedding layer can be regarded as a linear layer whose inputs are one-hot vectors. We learn a matrix B (emb) to extend its output dimension. This embedding layer is also used as the final output layer for our transformer language modeling experiments. Attention and feedforward Layers. An attention layer consists of multi-head attention weights (W Q , W K , W V ) and a linear projection (W O ). Let A k l and B k l where k ∈ {Q, K, V, O} be the lth layer's in-and out-dimension expansion matrices (Eq. 6) for the query, key, value, and projection matrices. To make sure new input and output channels are aligned across modules, we tie the LiGO operator as follows: for all l ∈ [L 1 ], (1) emb) . The last constraint is added to take into account the residual connections (Chen et al., 2021) . We similarly tie parameters for the feed-forward networks, A emb) . Since transformers make heavy use of residual layers with skip connections, we found that simply using the same B (emb) to parameterize A k l and B k l for many layers/modules worked well in practice. This reduces the number of learnable parameters even further and enables fast learning of M on a small amount of data (100 gradient steps). A k l = (B (emb) ) ⊤ for ∀k ∈ {Q, K, V }, (2) A O l = (B V l ) ⊤ , (3) B O l = B ( (f c1) l = (B (emb) ) ⊤ , A (f c2) l = (B (f c1) ) ⊤ l and B (f c2) l = B (

3.4. CONNECTION TO MONARCH MATRICES

As shown in Section 3.2.1, our depth-width decomposition factorizes M into a multiplication of two structured sparse matrices. We examine the expressiveness of this factorized representation by relating it to Monarch matrices (Dao et al., 2022) , defined below. Definition 1. Let the space of Monarch matrices be M ⊆ R mn1×mn2 . Then matrix M ∈ M if M = P 1 LP ⊤ 2 R = P 1 diag(L 1 , • • • , L n1 )P ⊤ 2 diag(R 1 , • • • , R n2 ) where L i ∈ R b1×b2 , R i ∈ R b3×b4 are dense rectangular matrices, and n 1 b 2 = n 2 b 3 . P 1 is the permutation π(i) = (i - b 1 ⌊ i b1 ⌋ -1)n 1 + ⌊ i b1 ⌋ + 1 and P 2 is the permutation π(j) = (j -b 2 ⌊ j b2 ⌋ -1)n 1 + ⌊ j b2 ⌋ + 1. It is clear that the block-diagonal matrix R has the identical form to our width growing operator R width . By applying the permutation matrices P 1 and P 2 to L, L is transformed into exactly the same form with our depth-growth operator L depth in Eq. 5. This implies that our depth-width decomposition coincides with Monarch sparsification of dense matrices, which generalize butterfly matrices (Dao et al., 2019) and enjoy rich expressivity properties (Dao et al., 2020; 2022) .

4. EXPERIMENTS

We conduct experiments to answer three key research questions. Q1: To what extent can LiGO improve the training efficiency (FLOPs and wall time) of transformers compared to training from scratch and other growth operators? Q2: Can LiGO be universally effective across transformers from different domains (e.g., language and vision) and sizes? Q3: Can models trained using LiGO achieve similar performance compared to the baselines when transferred to downstream tasks? Wall Time (hrs) Following Shen et al. (2022) , we train GPT2 models with a batch size of 384 and sequence length of 1024. For vision transformers, we build our models based on DeiT (Touvron et al., 2021a) and CaiT (Touvron et al., 2021b) , and apply their default hyper-parameters for training on ImageNet dataset. We train all our vision transformers for 300 epochs with a batch size of 1024. For transfer learning with BERT/RoBERTa, we follow Tan & Bansal (2020) and train for 3 epochs with a learning rate of 1e -4 and a batch-size of 32 for all tasks in GLUE. On SQuAD v1.1 and SQuAD 2.0, we fine-tune for 2 epochs with a learning rate of 5e -5 and a batch size of 12. We run both GLUE and SQuAD evaluations three times with different random seeds and report the mean numbers. For transfer learning experiments on DeiT, we finetune the pretrained models with 1000 epochs, batch size 768, learning rate 0.01, and use the same data augmentation in training on ImageNet. We use the same pretraining data and experimental settings for all the baselines (including our approach) for a fair comparison. Note that we include the additional compute required for training LiGO in all our tables and figures. However, since our LiGO is only trained for 100 steps, the influence on visualization and quantitative saving percentages is negligible.

4.2. RESULTS AND ANALYSIS

BERT. Similarly, LiGO significantly outperforms the recent bert2BERT method which saves about 30% computational costs. We observe that KI does not provide any real savings in training as it requires additional computation for knowledge distillation. Figure 2 (c) shows that our LiGO approach is flexible in growing either BERT-Small or BERT-Base for accelerating BERT-Large training. As expected, reusing BERT-Base instead of BERT-Small leads more savings in FLOPs (45.2% vs 30.3%) as BERT-Base contains more implicit knowledge in its parameters. formance of different BERT-Base models on both GLUE and SQuAD benchmarks, where we find that BERT trained with LiGO achieves very similar performance compared to the baselines on both benchmarks. Finally, in Table 5 of the Appendix C.3, we show that growing BERT-Small to BERT-Base with 100 steps of LiGO and then finetuning on GLUE tasks without additional pretraining outperforms just directly finetuning BERT-Small. RoBERTa and GPT2. Compared to the next best method, bert2BERT, LiGO obtains more than 15% savings, which once again demonstrates the effectiveness of our approach in growing vision transformers as well. Table 2 shows that finetuning results on downstream tasks perform on-par with the model trained from scratch, showing that LiGO does not harm the model's generalization capabilities when transferred to downstream datasets. We also find similar savings in CaiT-XS→CaiT-S where LiGO saves FLOPs by 52.6% and wall time by 46.1% over training CaiT-S from scratch on ImageNet (see Appendix C.2 for more details). Combining with other training strategies. We also find that LiGO can be effectively combined with orthogonal strategies such as layer dropping (Zhang & He, 2020) , token dropping (Hou et al., 2022) , and staged training (Chen et al., 2021) . More details are included in Appendix B.3. Figure 5 shows that LiGO can be combined with other training techniques to improve the computational savings by 4.7%, 7.4%, Depth-only expansion. We examine the effectiveness of our proposed depth expansion operator (L depth ) by only growing the depth of BERT from 6 layers to 12 layers, i.e, (BERT(6, 768)→BERT (12, 768) . We compare with stacking (Stack-BERT, Gong et al., 2019) , Interpolation (In-terBERT, Chang et al., 2017; Dong et al., 2020) (see Eq. 1), and MSLT (Yang et al., 2020) . For LiGO, we only apply its L depth component to the pre-trained model weights. Results in Figure 6 (a) show that a data-driven approach works well even when just growing across the depth dimension. Width-only expansion. We also verify the effectiveness of R width by only extending BERT width from 512 to 768, i.e., BERT(12, 512)→BERT(12, 768) . We compare LiGO based initialization with direct copy (Wei et al., 2016) , function preserving initialization (FPI, Chen et al., 2015) , and advanced knowledge initialization (AKI, Chen et al., 2021 ). LiGO's width expansion component outperforms all other methods, as shown in Figure 6 (b). Number of growing steps. Our main experiments just use 100 gradient steps to grow. We tune our LiGO on the pretraining set for 100, 500, 1000, and 10000 steps and compute the additional FLOPs for BERT-Small→BERT-Base training. 

5. CONCLUSION

This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where the larger transformer's parameters are initialized as a linear mapping from the smaller pretrained model's parameters, The linear map is factorized to be a composition of sparse width-and depth-expansion operators with a Kronecker factorization that groups parameters into layers and neurons. We demonstrate the effectiveness of our proposed approach on both language and vision transformers of different sizes, outperforming several competing methods. While our compute resources prevented us from applying LiGO to even larger transformers, it would be interesting to see if this can be applied on top of even larger models. A UNIVERSALITY OF LIGO OPERATOR Proposition 1. StackBERT (Eq. 1), Interpolation (Eq. 1), and Net2Net (Eq. 2) are all the special cases of the LiGO operator (Eq. 8). Proof. We prove Proposition 1 by constructing parameters in L depth and R width . Stacking. Stacking-based methods (Gong et al., 2019; Yang et al., 2020) duplicate the entire lower blocks on top of the small model to the form new layers (Eq. 1). Formally, we show this operation can be done by the following operator: M =          I I . . . I I . . .          L depth    I . . . I    R width Interpolation. Interpolation based methods (Chang et al., 2017; Dong et al., 2020) interleave each layer for twice. We can construct the following matrix to achieve layer interpolation (Eq. 1). M =          I I I I . . . . . .          L depth    I . . . I    R width We remark that any rearrangement of layers to construct new layers (mathematically a permutation of existing layers with replacement) can be constructed in a similar way. Net2Net. Since we show in Eq. 6, the Kronecker factorization on R l amounts to decomposing the general growth operator into in-dimension and out-dimension expansion. We can construct Net2Net (Chen et al., 2015) based growth by simply letting: L depth = I ∈ R L1D2×L2D2 , R width =    A 1 ⊗ B 1 . . . A L1 ⊗ B L1    A l = I S l-1 , B l = I S l , where S l ∈ {0, 1} (D2-D1)×D1 is a selection matrix to enlarge the out dimension, and S l-1 = S l-1 diag(1 ⊤ S l-1 ) -1 copies the selection from S l-1 with normalization to guarantee functionality preserving in expansion.

B IMPLEMENTATION DETAILS B.1 GROWING TRANSFORMERS WITH LIGO

The transformer architecture consists of an embedding layer, multi-block attention layer, and an output layer. The core ingredient attention block consists of a Multi-Head Attention (MHA) module followed by a FeedForward Network (FFN), with a skip connection across the both blocks. Applying LiGO requires the following considerations:

B.3 ORTHOGONAL EFFICIENT TRAINING STRATEGIES

For layer dropping, we follow the same progressive dropping rate schedule with Zhang & He (2020) , and set the maximum dropping rate to 0.1 to recover the performance. For token dropping, we randomly set 15% tokens aside in the middle layers. In the first 50k steps of staged training, only a sub-network is activated and trained, and afterwards, we perform full-model training for 350k steps.

C ADDITIONAL EXPERIMENTS

C.1 REUSING SMALLER MODELS TRAINED FOR ONLY FEW STEPS LiGO focuses on utilizing the knowledge of smaller models that have already been pretrained and available. In this section, we investigate how LiGO can leverage smaller existing models that are only trained for few steps to accelerate training of a larger model. We perform an experiment on BERT-Base by reusing a BERT-Small trained for only 50k steps instead full training for 220k steps as used in our experiments. Figure 7 shows that LiGO can still save 35.2% savings in FLOPs and 30.2% savings in wall time over the BERT-Base training from scratch. Our extensive experiments on BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019 ), GPT2 (Radford et al., 2019) , DeiT (Touvron et al., 2021a) and CaiT (Touvron et al., 2021b) show that LiGO can consistently improve transformer training efficiency over the traditional way of training from scratch across domains and model sizes. One interesting future direction of our work is scaling LiGO to very large models with parameters more than 100B, such as GPT3 (Brown et al., 2020) . While we currently do not possess the compute resources for this extreme large-scale study, we perform a preliminary experiment on GPT2-1.5B (Radford et al., 2019) by using GPT2-Medium as the initialization. We train for 15k steps on C4 dataset (Raffel et al., 2020) and find that our proposed LiGO saves about 39% computation cost (FLOPs) of training GPT2-1.5B from scratch to reach the same log perplexity (which is 3.3). We believe that it is imperative to study the extent to which the benefits of LiGO remain at the scale on which the modern large language models are trained. We hope to cover this in our future work.



For notational brevity we assume that each hidden layer has same number of dimensions D, but LiGO can be straightforwardly generalized to layers with different dimensions (e.g., FFN layers of transformers). We therefore have vec(ΘL,D) ⊤ ∈ R LD 2 . Our approach is also agnostic with regard to vectorization order. We define a single layer as f l (x) = W l x + b l , where the row number of W l corresponds to the output dimension, and the column number of W l corresponds to the input dimension. While the original BERT(Devlin et al., 2019) paper also uses the Toronto Book Corpus(Zhu et al., 2015), we do not include it here since it is no longer publicly available.



Figure 1: Our linear growth operator (LiGO) accelerates training by using the weights of a smaller model Θ to initialize the weights of the larger model Θ(new) . LiGO is parameterized as a sparse linear map M that can be decomposed into width-and depth-expansion operators. The width-operator R width and depth-operator L depth are structured matrices obtained from Kronecker products of smaller matrices which encode architectural knowledge by grouping parameters into layers and neurons. While we show the expansion operators for simple multi-layer perceptrons for illustrative purposes, in practice we apply LiGO to enable faster training of transformer networks. In our approach, we learn the growth matrix M with a 100 steps of SGD, use this to initialize the larger model, and then continue training as usual. Best viewed in color.

Figure 2: Results on BERT. (a-b) shows validation log perplexity vs. FLOPs and wall time respectively for training BERT-Base by reusing BERT-Small. (c) shows log perplexity vs. FLOPs in growing BERT-Small and BERT-Base to BERT-Large. The solid line indicates the final perplexity of the larger model trained from scratch, while the dotted line represents performance of the smaller model trained from scratch. LiGO offers about 45% savings in FLOPs and 40% savings in wall time over BERT-Base training from scratch. Our approach is also flexible in reusing either BERT-Small or BERT-Base for accelerating BERT-Large training.

Figure2shows the comparison between the different baselines for training BERT models. As seen from Figure2(a), LiGO saves 44.7% computational cost (FLOPs) of training BERT-Base (12 layers, 768 dimensions) from scratch by reusing BERT-Small (6 layers, 512 dimensions). LiGO offers 40.7% savings in wall time compared to training from scratch (Figure2(b)). Among the compared methods, StackBERT is the most competitive in terms of both FLOPs and wall time, although LiGO obtains +10.6% and +7.2% improvements in FLOPs and wall time on top of StackBERT.

Figure 4: Results on DeiT. (a) Accuracy vs. flops and (b) accuracy vs. wall time for training DeiT-B. LiGO saves flops and wall time by more than 50% over training from scratch on ImageNet. Vision Transformers. Figure4shows that by growing from DeiT-S, LiGO can save 55.4% FLOPs and 52% GPU wall time to reach the same performance of 81% on ImageNet. Interestingly, the model initialized by our datadriven growth operator (w/ only 100 gradient steps of tuning) can already achieve 72% accuracy at the beginning of training and leads to the final accuracy of 81.7% at the end of the training. Compared to the next best method, bert2BERT, LiGO obtains more than 15% savings, which once again demonstrates the effectiveness of our approach in growing vision transformers as well. Table2shows that finetuning results on downstream tasks perform on-par with the model trained from scratch, showing that LiGO does not harm the model's generalization capabilities when transferred to downstream datasets. We also find similar savings in CaiT-XS→CaiT-S where LiGO saves FLOPs by 52.6% and wall time by 46.1% over training CaiT-S from scratch on ImageNet (see Appendix C.2 for more details).

Figure 6: Results on Depth-only and Width-only growth. LiGO saves 51.7% FLOPS when expanding depth-only, and 41.6% FLOPS when expanding width-only.

Downstream transfer learning performance on GLUE and SQuAD. All of the results are based on BERT-Base models trained using the different baselines. LiGO achieves similar or even better performance than the original training from scratch baseline on several downstream tasks, despite improving training efficiency.

Table 1 shows the per-task per-Results on RoBERTa and GPT2. LiGO reduces FLOPs by 47.2% and 22.5% for RoBERTa-Base and GPT2-Medium, , demonstrating its effectiveness across different training strategies and architectures.

Transfer learning performance of DeiT-B. DeiT-B model trained using LiGO performs similarly to the original train-from-scratch baseline on all downstream tasks.

Effect of number of gradient steps. "+FLOPs" stands for additional flops (in 10 15 ).

Table 3 shows that training LiGO within 1000 steps results in the identical model convergence (reaching 1.8 PPL at 215K steps). This suggests tuning model weights under the linear constraints of LiGO can achieve faster convergence. Training LiGO for more than 10000 steps can provide a model with slightly faster convergence (214K steps), but results in less saving overall.

Downstream performance using AdapterFusion(Pfeiffer et al., 2020) on GLUE Benchmark. All of the results are based on BERT-Base models trained using different baselines. -par performance with model trained from scratch under adapter-based tuning with 44.7% savings in FLOPs abd 40.5% savings in wall time. This shows that LiGO does not harm the model generalization capability when adapters are used as a parameter-efficient finetuning strategy for transferring a trained model to downstream datasets.C.5 INITIAL RESULTS ON BILLION+ PARAMETER MODELS

4.1. EXPERIMENTAL SETUP

Datasets. We follow Tan & Bansal (2020) and use the English Wikipedia corpus 5 for training BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) . We use the public C4 (Raffel et al., 2020) dataset for training GPT2 (Radford et al., 2019) . We use ImageNet (Deng et al., 2009) for training vision transformers. We use GLUE (Wang et al., 2018) , SQuADv1.1 (Rajpurkar et al., 2016) , and SQuADv2.0 (Rajpurkar et al., 2018) for evaluating pretrained BERT models. We test downstream performance of vision transformers (DeiT (Touvron et al., 2021a)) by performing transfer learning on 5 downstream image classification tasks, including CIFAR10 (Krizhevsky et al., 2009) , CIFAR100 (Krizhevsky et al., 2009) , Flowers102 (Nilsback & Zisserman, 2008) , Stanford-Cars (Krause et al., 2013) , and ChestXRay8 (Wang et al., 2017) .Models. We experiment with growing the following language andd vision transformers: (1) BERT-Small→BERT-Base, BERT-Base→BERT-Large, BERT-Small→BERT-Large; (2) RoBERTa-Small→RoBERTa-Base for RoBERTa; (3) GPT2-Base→GPT2-Medium, (4) DeiT-S→DeiT-B, and (5) CaiT-XS→CaiT-S. BERT-Small has 6 layers with 512 hidden dimensions, while other named models are their usual sizes. See Appendix B.2 for full details.Baselines. We compare our approach with the following baselines: (1) training from scratch baseline where we train the larger transformer without using any smaller pretrained models; (2) progressive training methods designed for growing depth in transformers (StackBERT (Gong et al., 2019) and MSLT (Yang et al., 2020) ); (3) bert2BERT (Chen et al., 2021 ) that extends Net2Net (Chen et al., 2015) for width expansion and stacking for depth expansion; (4) KI (Qin et al., 2021) which uses distillation for transferring knowledge from the smaller model to the larger model.Implementation details. We always use 100 gradient steps to learn the LiGO for all models, which is negligible in terms of FLOPs/wall time compared to full training after initialization. We train both BERT and RoBERTa models for 400K steps with a warmup of 10K steps. We remove the nextsentence prediction task (Liu et al., 2019) and use a fixed sequence length of 128 for pretraining ACKNOWLEDGMENTS PW sincerely thanks Zhen Wang for the insightful discussion and for providing reference repositories for language model pre-training. PW also appreciates Hao Tan's assistance for reproducing fine-tuning results on GLUE datasets. YK and LTH were partially supported an MIT-IBM Watson AI grant and an Amazon award. We also acknowledge support from the IBM Research AI Hardware Center, and the Center for Computational Innovation at Rensselaer Polytechnic Institute for the computational resources on the AiMOS Supercomputer. The research of ZW is in part supported by the US Army Research Office Young Investigator Award (W911NF2010240).

funding

* Work done during an internship at MIT-IBM Watson AI Lab. https://vita-group.github.io/LiGO/ 

annex

Embedding layer. For both language and vision transformers, the embedding layer can be regarded as a linear layer, whose inputs are one-hot embeddings in language models. We draw a learnable matrix B (emb) to extend its output dimension.Multi-head attention blocks. An attention layer in transformer consists of multi-head attention weights (W Q , W K , W V ) and a linear projection (W O ). Let A k l and B k l with k ∈ {Q, K, V, O} be the in-and out-dimension expansion matrices (Eq. 6) for query, key, value, and projection in the l-th layer, respectively. Applying B k l to W k (k ∈ Q, K, V ) constructs new heads by a weighted summation of rows of all existing heads. To make sure the new input and output channels are aligned across modules, we tie our LiGO operator with the following scheme: (1)Both the bias and layer normalization inherit the associated linear transformations' out-dimension expansion matrices to grow the width. For depth expansion, each module independently combines the same module from other layers (Eq. 8) with learnable coefficients w.Feed-forward networks. Each attention block is followed by a two-layer FFN. Let A k l and B k l with k ∈ {f c1, f c2} be the in-and out-dimension expansion matrices (Eq. 6) for the first and second FFN layer in the l-th layer, respectively. We tie the parameters for feed-forward networks: emb) .Output layer. For output head, we have A (out) = B (emb)⊤ , since the output dimension of attention layers are always aligned with B (emb) by our construction. The output layer does not need out-dimension expansion. Algorithm 1 summarizes LiGO for growing transformers.Algorithm 1 A forward pass of LiGO with transformer.1: Input: A small transformer with hidden D 1 and number of layer L 1 . Denote the embedding layer asOutput: A large transformer with hidden D 2 and number of layer L 2 . Denote the weight matrices as Ω with the corresponding superscripts as the small model.l,j Ω (ln2) j 23: end for 24: Ω (out) ← W (out) B (emb)⊤ 25: Train transformer with parameters Ω.

C.2 RESULTS ON CAIT

In addition to DeiT (Touvron et al., 2021a) , we perform additional experiments with CaiT (Touvron et al., 2021b) on ImageNet and find that while reusing CaiT-XS, LiGO offers about 52.6% savings in FLOPs and 46.1% savings in wall time over the CaiT-S training from scratch (see Figure 8 ).

PRETRAINING

We perform additional experiments by directly finetuning BERT-Base initialized by LiGO (from BERT-Small) without any further pretraining. We observe in Table 5 that the LiGO-initialized model can benefit downstream tasks compared to BERT-Small trained from scratch (1st row vs 2nd row).

C.4 GLUE PERFORMANCE USING ADAPTERFUSION

LiGO is mainly proposed for improving efficiency of the pre-training stage and hence is compatible with various finetuning schemes like full model finetuning, adapters (Houlsby et al., 2019; Pfeiffer et al., 2020) or prompt tuning (Lester et al., 2021; Jia et al., 2022) for adaptation to downstream tasks. We test BERT-Base models trained using different baselines by using adapterfusion (Pfeiffer et al., 2020) instead of full finetuning on GLUE benchmark. Table 6 shows that LiGO also achieves 0 1 2 3 4 5 6 7FLOPs (1e18) 

