SPEEDING UP DEEP LEARNING TRAINING BY SHAR-ING WEIGHTS AND THEN UNSHARING

Abstract

It has been widely observed that increasing deep learning model sizes often leads to significant performance improvements on a variety of natural language processing and computer vision tasks. In the meantime, however, computational costs and training time would dramatically increase when models get larger. In this paper, we propose a simple approach to speed up training for a particular kind of deep networks which contain repeated structures, such as the transformer module. In our method, we first train such a deep network with the weights shared across all the repeated layers till some point. We then stop weight sharing and continue training until convergence. The untying point is automatically determined by monitoring gradient statistics. Our adaptive untying criterion is obtained from a theoretic analysis over deep linear networks. Empirical results show that our method is able to reduce the training time of BERT by 50%.

1. INTRODUCTION

It has been widely observed that increasing model size often leads to significantly better performance on various real tasks, especially natural language processing and computer vision applications (Amodei et al., 2016; He et al., 2016a; Wu et al., 2016; Vaswani et al., 2017; Devlin et al., 2019; Brock et al., 2019; Raffel et al., 2020; Brown et al., 2020) . However, as models getting larger, the training can become extremely resource intensive and time consuming. As a consequence, there has been a growing interest in developing systems and algorithms for efficient distributed large-batch training (Goyal et al., 2017; Shazeer et al., 2018; Lepikhin et al., 2020; You et al., 2020) . In this paper, we seek for speeding up deep learning training by exploiting unique network architectures rather than by distributed training. In particular, we are interested in speeding up the training of a special kind of deep networks which are constructed by repeatedly stacking the same layer, for example, the transformer module (Vaswani et al., 2017) . We propose a simple method for efficiently training such kind of networks. In our approach, we first force the weights to be shared across all the repeated layers and train the network, and then, at some point, we stop weight sharing and continue training until convergence. The point for stopping weight sharing can be either predefined or automatically chosen by monitoring gradient statistics during training. Empirical studies show that our method can reduce the training time of BERT (Devlin et al., 2019) by 50%. Our method is motivated by the successes of weight sharing models, in particular, ALBERT (Lan et al., 2020) . It is a variant of BERT in which the weights across all the transformer layers are shared. As long as its architecture is sufficiently large, ALBERT can be comparable with or even outperform the original BERT on various downstream natural language processing benchmarks. However, when its architecture being the same as the original BERT, ALBERT performs significantly worse. Since the weights in the original BERT are not shared at all, it is natural to expect that ALBERT's performance will be improved if we stop its weight sharing at some point of training. To make this idea work, however, we need to know when to untie the shared weights. A randomly chosen untying point will not work. We can see this from the two extreme cases: ALBERT which shares weights all the time, and BERT which has no weight sharing at all. To find an effective solution for automatic weight untying, we turn to theoretic analysis over deep linear networks (Hardt & Ma, 2017; Laurent & Brecht, 2018; Wu et al., 2019) . A deep linear model is constructed by stacking a series of matrix multiplications. In its forward pass, a deep linear model is trivially equivalent to a single matrix. However, when being trained with backpropagation, its behavior is analogous to the deep models with non-linearity but much easier to understand. Our theoretical analysis shows that, when learning a positive definite matrix (which admits an optimal solution with all layers having the same weights), training with weight sharing can bring significantly faster convergence. More importantly, our theoretical analysis leads to the adaptive weight untying rule that we need to construct our algorithm (see Algorithm 2). Empirical studies on real tasks show that our adaptive untying method can be at least as effective as using the best untying point which is obtained by running multiple experiments of which each has a different point to untie weights and then choosing the best result. The rest of this paper is organized as follows. We present our weight sharing algorithm in Section 2. It actually contains three versions, depending on how to stop weight sharing during training. In Section 3, we present our theoretical results for positive definite deep linear models. All the proofs are deferred to the Appendix. In Section 4, we discuss related work. In Section 5, we show detailed experimental setups and results. We also provide various ablation studies on different choices in implementing our algorithm. Finally, we conclude this paper with discussions in Section 6.

2. ALGORITHM: SHARING WEIGHTS AND THEN UNSHARING

Assume we have a deep network which is obtained by repeatedly stacking the same neural module n times, such as the transformer module in transformer models (Vaswani et al., 2017) . Denote by w 1 , . . . , w n the weights of these n layers. In our method, we first train the deep network with all the weights tied. Then, after a certain number of training steps, we untie the weights and further train the network until convergence. In what follows, we first present a simple version of our algorithm in which the weight untying point is predefined. Then, we move to its adaptive version in which the layers are automatically gradually untied according to gradient statistics. Finally, we discuss a simplified variant of this adaptive method which unties the layers all the once. Untying weights at a fixed point. This is the simplest version of our method (Algorithm 1). We first train the deep network with all the weights tied for a fixed number of steps, and then untie the weights and continue training until convergence. Algorithm 1 SHARING WEIGHTS AND THEN UNTYING AT AT A FIXED POINT 1: Input: total number of training steps T , untying point τ , learning rates {α (t) , t = 1, . . . , T } 2: Randomly and equally initialize weights w  w (t) i = w (t-1) i -α (t) × grad loss, w (t-1) i , i = 1, . . . , n Note that, from line 1 to 5, we initialize all the weights equally, and then update them using the mean of their gradients. It is easy to see that such an update is equivalent to weight sharing or tying. For the sake of simplicity, in lines 5 and 7, we only show how to update the weights using the plain (stochastic) gradient descent rule. One can replace this plain update rule with any of their favorite optimization methods, for example, the Adam optimization algorithm (Kingma & Ba, 2015) . While the repeated layers being the most natural units for weight sharing, that is not the only choice. We may view several layers together as the weight sharing unit, and share the weights across those units. The layers within the same unit can have different weights. For example, for a 24-layer transformer model, we may combine every four layers as a weight sharing unit. Thus, there will be six such units for weight sharing. Such flexibility of choosing weight sharing units allows for a balance between "full weight sharing" and "no weight sharing" at all. Adaptive weight untying. The theoretical analysis in Section 3 motives us to adaptively and gradually untie weights based on the gradient correlation of adjacent layers (Algorithm 2). To implement this idea, all layers are put in the same group during initialization. Then, at any time step of the training, suppose we have groups G = {G 1 , G 2 , ..., G k }. For each group G i , we compute the correlation



i -α (t) × mean grad loss, w (t-1) k , k = 1, . . . , n , i = 1, . . . , n 6:

