LEARNING TO GROW PRETRAINED MODELS FOR EFFICIENT TRANSFORMER TRAINING

Abstract

Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) widthand depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models. 1 

1. INTRODUCTION

The transformer architecture (Vaswani et al., 2017) has emerged as a general purpose architecture for modeling many structured domains (Devlin et al., 2019; Brown et al., 2020; Rives et al., 2021; Dosovitskiy et al., 2021; Touvron et al., 2021a) . Perhaps more so than other architectures, the transformer empirically seems to have inductive biases that make it especially amenable to scaling (Rosenfeld et al., 2019; Kaplan et al., 2020) , which has led to a paradigm in which larger versions of smaller, existing models are trained and released on a periodic basis (e.g., the GPT lineage of models (Radford et al., 2018; 2019; Brown et al., 2020) ). New instances of such models are typically trained completely from scratch, despite the fact that they are often scaled-up versions of their smaller counterparts. Given the compute required to train even the smaller models, we argue that training each model from scratch is wasteful, and that prior knowledge implicit in the parameters of smaller pretrained models should be leveraged to enable faster training of larger models. One approach to this problem is through the lens of model growth, wherein a smaller model's pretrained parameters are used to initialize a subset of the larger model's parameters. While earlier works generally froze the parameters initialized from the pretrained model and only trained the new (randomly initialized) parameters (Fahlman & Lebiere, 1989; Fahlman, 1990; Gutstein et al., 2008) , subsequent work has shown that copying a subset of the pretrained parameters to initialize the new parameters and then finetuning the entire network significantly accelerates training and sometimes even leads to better performance (Chen et al., 2015) . When applied to modern transformers, these mechanisms roughly translate to a depth-expansion operator in which pretrained models are stacked (or combined with identity layers) to initialize deeper transformers (Gong et al., 2019; Yang et al., 2020) , and a width-expansion operator in which the smaller model's matrices are copied to initialize the larger model's matrices (e.g., in block-diagonal fashion) (Chen et al., 2021; Gu et al., 2020) . (new) . LiGO is parameterized as a sparse linear map M that can be decomposed into width-and depth-expansion operators. The width-operator R width and depth-operator L depth are structured matrices obtained from Kronecker products of smaller matrices which encode architectural knowledge by grouping parameters into layers and neurons. While we show the expansion operators for simple multi-layer perceptrons for illustrative purposes, in practice we apply LiGO to enable faster training of transformer networks. In our approach, we learn the growth matrix M with a 100 steps of SGD, use this to initialize the larger model, and then continue training as usual. Best viewed in color. Noting the empirical effectiveness of such recipes, we observe that existing mechanisms generally do not have a learning component (e.g., randomly copying over neurons for width-expansion or stacking consecutive layers for depth-expansion). This paper instead proposes an efficient, datadriven approach for learning to grow transformers. In particular, our approach frames the problem of initializing the larger model's parameters as learning a linear mapping from the smaller model's parameters, i.e., Θ (large) = M Θ (small) where Θ (small) and Θ (large) are the vectorized parameters of the small/large models. Due to the high dimensionality of the parameters, this mapping is completely intractable to learn without any restrictions on M . We thus factorize the linear mapping to be a composition of sparse width-and depth-expansion operators, M = L depth R width , where both width and depth matrices are further factorized to be a Kronecker product of smaller matrices that express architectural knowledge (e.g., through grouping parameters by layers and neurons). We show that our growth operators can represent existing approaches such as layer-stacking and neuron-copying as special cases. We find that with a small amount of learning on M (e.g., 100 gradient steps) to initialize the larger model, we can significantly accelerate training of both vision and language transformers. Figure 1 illustrates our approach. We apply our learned linear growth operator (LiGO) to popular families of models-BERT (Devlin et al., 2019 ), RoBERTa (Liu et al., 2019 ), GPT2 (Radford et al., 2019 ), and ViT (Dosovitskiy et al., 2021; Touvron et al., 2021a; b) -and find that LiGO can consistently improve transformer training efficiency over the traditional way of training from scratch across domains and model sizes. For instance, LiGO saves 44.7% and 22.5% FLOPs for training BERT-Base and GPT2-Medium from scratch by reusing pretrained smaller models that are half as big. Similarly, for vision transformers, when using DeiT-S (Touvron et al., 2021a) for initialization, LiGO yields 55% savings in FLOPs with no performance drop on ImageNet (Deng et al., 2009) . These FLOPs savings directly translate to similar wall clock savings. We further find that models trained using LiGO achieve similar performance to the trained-from-scratch baselines when transferred to downstream tasks.

2. RELATED WORK

Efficient training. Efficient training of transformers has been studied from multiple perspectives. Some methods that are orthogonal to our work include mixed precision training (Shoeybi et al., 2019) , large batch optimization (You et al., 2019 ), distributed training (Huang et al., 2019) , and dropping layers (Zhang & He, 2020) or tokens (Hou et al., 2022) . Knowledge inheritance (Qin et al., 2021) explores knowledge distillation during pretraining to efficiently learn larger transformers. Progressive training, which first trains a small transformer with few layers and then gradually expands by stacking layers, has also been applied to accelerate transformer training (Gong et al., 2019; Yang et al., 2020; Li et al., 2022; Shen et al., 2022 ). Net2Net Chen et al. (2015) uses functionpreserving transformations to grow width by copying neurons and depth by using identity layers. Recently, bert2BERT (Chen et al., 2021) extends Net2Net to transformers. In contrast to these approaches, our approach learns to (linearly) transform the parameters of a smaller model to initialize a



Figure 1: Our linear growth operator (LiGO) accelerates training by using the weights of a smaller model Θ to initialize the weights of the larger model Θ(new) . LiGO is parameterized as a sparse linear map M that can be decomposed into width-and depth-expansion operators. The width-operator R width and depth-operator L depth are structured matrices obtained from Kronecker products of smaller matrices which encode architectural knowledge by grouping parameters into layers and neurons. While we show the expansion operators for simple multi-layer perceptrons for illustrative purposes, in practice we apply LiGO to enable faster training of transformer networks. In our approach, we learn the growth matrix M with a 100 steps of SGD, use this to initialize the larger model, and then continue training as usual. Best viewed in color.

funding

* Work done during an internship at MIT-IBM Watson AI Lab. https://vita-group.github.io/LiGO/ 

