SPEEDING UP DEEP LEARNING TRAINING BY SHAR-ING WEIGHTS AND THEN UNSHARING

Abstract

It has been widely observed that increasing deep learning model sizes often leads to significant performance improvements on a variety of natural language processing and computer vision tasks. In the meantime, however, computational costs and training time would dramatically increase when models get larger. In this paper, we propose a simple approach to speed up training for a particular kind of deep networks which contain repeated structures, such as the transformer module. In our method, we first train such a deep network with the weights shared across all the repeated layers till some point. We then stop weight sharing and continue training until convergence. The untying point is automatically determined by monitoring gradient statistics. Our adaptive untying criterion is obtained from a theoretic analysis over deep linear networks. Empirical results show that our method is able to reduce the training time of BERT by 50%.

1. INTRODUCTION

It has been widely observed that increasing model size often leads to significantly better performance on various real tasks, especially natural language processing and computer vision applications (Amodei et al., 2016; He et al., 2016a; Wu et al., 2016; Vaswani et al., 2017; Devlin et al., 2019; Brock et al., 2019; Raffel et al., 2020; Brown et al., 2020) . However, as models getting larger, the training can become extremely resource intensive and time consuming. As a consequence, there has been a growing interest in developing systems and algorithms for efficient distributed large-batch training (Goyal et al., 2017; Shazeer et al., 2018; Lepikhin et al., 2020; You et al., 2020) . In this paper, we seek for speeding up deep learning training by exploiting unique network architectures rather than by distributed training. In particular, we are interested in speeding up the training of a special kind of deep networks which are constructed by repeatedly stacking the same layer, for example, the transformer module (Vaswani et al., 2017) . We propose a simple method for efficiently training such kind of networks. In our approach, we first force the weights to be shared across all the repeated layers and train the network, and then, at some point, we stop weight sharing and continue training until convergence. The point for stopping weight sharing can be either predefined or automatically chosen by monitoring gradient statistics during training. Empirical studies show that our method can reduce the training time of BERT (Devlin et al., 2019) by 50%. Our method is motivated by the successes of weight sharing models, in particular, ALBERT (Lan et al., 2020) . It is a variant of BERT in which the weights across all the transformer layers are shared. As long as its architecture is sufficiently large, ALBERT can be comparable with or even outperform the original BERT on various downstream natural language processing benchmarks. However, when its architecture being the same as the original BERT, ALBERT performs significantly worse. Since the weights in the original BERT are not shared at all, it is natural to expect that ALBERT's performance will be improved if we stop its weight sharing at some point of training. To make this idea work, however, we need to know when to untie the shared weights. A randomly chosen untying point will not work. We can see this from the two extreme cases: ALBERT which shares weights all the time, and BERT which has no weight sharing at all. To find an effective solution for automatic weight untying, we turn to theoretic analysis over deep linear networks (Hardt & Ma, 2017; Laurent & Brecht, 2018; Wu et al., 2019) . A deep linear model is constructed by stacking a series of matrix multiplications. In its forward pass, a deep linear model is trivially equivalent to a single matrix. However, when being trained with backpropagation, its behavior is analogous to the deep models with non-linearity but much easier to understand. Our theoretical analysis shows that, when learning a positive definite matrix (which admits an optimal solution with all layers having the same weights), training with weight sharing can bring significantly faster convergence. More importantly, our theoretical analysis leads to the adaptive weight untying rule that we need to construct our algorithm (see Algorithm 2). Empirical studies on real tasks show that our adaptive untying method can be at least as effective as using the best untying point which is obtained by running multiple experiments of which each has a different point to untie weights and then choosing the best result. The rest of this paper is organized as follows. We present our weight sharing algorithm in Section 2. It actually contains three versions, depending on how to stop weight sharing during training. In Section 3, we present our theoretical results for positive definite deep linear models. All the proofs are deferred to the Appendix. In Section 4, we discuss related work. In Section 5, we show detailed experimental setups and results. We also provide various ablation studies on different choices in implementing our algorithm. Finally, we conclude this paper with discussions in Section 6.

2. ALGORITHM: SHARING WEIGHTS AND THEN UNSHARING

Assume we have a deep network which is obtained by repeatedly stacking the same neural module n times, such as the transformer module in transformer models (Vaswani et al., 2017) . Denote by w 1 , . . . , w n the weights of these n layers. In our method, we first train the deep network with all the weights tied. Then, after a certain number of training steps, we untie the weights and further train the network until convergence. In what follows, we first present a simple version of our algorithm in which the weight untying point is predefined. Then, we move to its adaptive version in which the layers are automatically gradually untied according to gradient statistics. Finally, we discuss a simplified variant of this adaptive method which unties the layers all the once. Untying weights at a fixed point. This is the simplest version of our method (Algorithm 1). We first train the deep network with all the weights tied for a fixed number of steps, and then untie the weights and continue training until convergence. Algorithm 1 SHARING WEIGHTS AND THEN UNTYING AT AT A FIXED POINT 1: Input: total number of training steps T , untying point τ , learning rates {α (t) , t = 1, . . . , T } 2: Randomly and equally initialize weights w (0) 1 , . . . , w (0) n 3: for t = 1 to T do 4: if t < τ then 5: w (t) i = w (t-1) i -α (t) × mean grad loss, w (t-1) k , k = 1, . . . , n , i = 1, . . . , n 6: else 7: w (t) i = w (t-1) i -α (t) × grad loss, w (t-1) i , i = 1, . . . , n Note that, from line 1 to 5, we initialize all the weights equally, and then update them using the mean of their gradients. It is easy to see that such an update is equivalent to weight sharing or tying. For the sake of simplicity, in lines 5 and 7, we only show how to update the weights using the plain (stochastic) gradient descent rule. One can replace this plain update rule with any of their favorite optimization methods, for example, the Adam optimization algorithm (Kingma & Ba, 2015) . While the repeated layers being the most natural units for weight sharing, that is not the only choice. We may view several layers together as the weight sharing unit, and share the weights across those units. The layers within the same unit can have different weights. For example, for a 24-layer transformer model, we may combine every four layers as a weight sharing unit. Thus, there will be six such units for weight sharing. Such flexibility of choosing weight sharing units allows for a balance between "full weight sharing" and "no weight sharing" at all. Adaptive weight untying. The theoretical analysis in Section 3 motives us to adaptively and gradually untie weights based on the gradient correlation of adjacent layers (Algorithm 2). To implement this idea, all layers are put in the same group during initialization. Then, at any time step of the training, suppose we have groups G = {G 1 , G 2 , ..., G k }. For each group G i , we compute the correlation between the gradients of any two adjacent layers by correlation (i) j = g (i) j , g (i) j+1 g (i) j g (i) j+1 , g (i) j , g (i) j+1 denote the gradients of two adjacent layers in G i . (1) Once correlation (i) j falls below ρ (0.5 as default) consecutively for a certain number of times (3 as default), we split the group G i into two subgroups, by splitting at j-th layer in G i . The layers in the same group are updated using the gradient mean of this group as in Algorithm 1. Algorithm 2 ADAPTIVE WEIGHT UNTYING 1: Input: total number of training steps T , threshold ρ, learning rates {α (t) , t = 1, . . . , T } 2: All layers are put in the same group, and randomly initialized with the same weights 3: for t = 1 to T do 4: for adjacent layers i, i + 1 in the same group do 5: Calculate the gradient correlation as Equation 1 //Motivated by Section 3 6: If the correlation falls below ρ, break the group between layer i and i + 1 7: Update the weights in each group using the average of gradients within each group Simplified adaptive untying. To simplify the multi-step adaptive untying, instead of gradually splitting the groups, we may untie the weights of all layers at once. For every certain number of iteration steps, we monitor the gradient correlations as described in multi-step adaptive untying. If more than half of these layer correlations are less than the predefined threshold ρ (0.5 as default) consecutively for a certain number of times (3 as default), we stop sharing weights for all layers.

3. THEORETIC MOTIVATION

In this section, we show how our method including the adaptive untying criterion can be motivated by analyzing the dynamics of training a deep linear network by gradient descent. At the first glance, deep learning models may look trivial since a deep linear model is just equivalent to a single matrix. However, when being trained with backpropagation, its behavior is analogous to generic deep models. Thus, analyzing deep linear models has attracted increasing interest in the theoretic research community (Hardt & Ma, 2017; Laurent & Brecht, 2018; Wu et al., 2019) . A deep linear network is a series of matrix multiplication f (x; W 1 , ..., W L ) = W L W L-1 ...W 1 x, W l ∈ R d×d , = 1, . . . , L. The task is to train the deep linear network to learn a target matrix Φ ∈ R d×d . To focus on the training dynamics, we adopt the simplified objective function R(W 1 , ...W L ) = 1 2 W L W L-1 ...W 2 W 1 -Φ 2 F . Denote ∇ l R to be the gradient of R with respect to W l . We have ∇ l R = ∂R ∂W l = W T L:l+1 (W L:1 -Φ)W T l-1:1 , where W l2:l1 = W l2 W l2-1 ...W l1+1 W l1 . The standard gradient update is given by W l (t + 1) = W l (t) -η∇ l R(t), l = 1, ..., L. To train with weights shared, all the layers need to have the same initialization. And the update is W l (t + 1) = W l (t) - η L L i=1 ∇ i R(t), l = 1, ..., L. Since the initialization and updates are the same for all layers, the parameters W 1 (t), ..., W L (t) are equal for all t. For simplicity, we denote the weight at time t to be W 0 (t). Notice that the gradients are averaged, the norm of update to each layer doesn't scale with L. Suppose that the target matrix Φ is a positive definite matrix. It is immediate that Φ 1/n is a solution to the deep linear network. Before looking into the detailed convergence analysis, it worth first showing a Lemma that reveals the updates in the weight sharing training. Lemma 1. With a positive definite target matrix Φ and initializing with W 0 (0) = I, update the parameters according to Equation 2, we have W l (t + 1) -W l (t) = -ηW L-1 0 (t) W L 0 (t) -Φ , l = 1, ..., L, ∀t ≥ 0. Intuitively, the Lemma 1 shows that weight sharing allows all the layers to be trained "equally well", the layers that are far away from the output layer won't suffer from gradient vanishing or exploding. In the following subsections, we first study the convergence result with continuous-time gradient descent, which demonstrates the benefit of training with weight sharing when learning a positive definite matrix Φ. We then extend the results to the discrete-time gradient descent. We draw a comparison with training with zero-asymmetric (ZAS) initialization (Wu et al., 2019) With η → 0, the training dynamics of continuous-time gradient descent can be described as dW l (t) dt = Ẇl (t) = -∇ l R(t), l = 1, ..., L, t ≥ 0. The ZAS initializes the weights W 1 = W 2 = • • • = W L-1 = I and W L = 0. It helps avoiding saddle points and has the following convergence result. Theorem 1. [Continuous-time gradient descent without weight sharing (Wu et al., 2019) ] For the deep linear network f (x; W 1 , ..., W L ) = W L W L-1 ...W 1 x, the continuous time gradient descent with the zero-asymmetric initialization satisfies R(t) ≤ exp (-2t) R(0). Theorem 1 shows that with the zero-asymmetric initialization, the continuous gradient descent linearly converges to the global optimal solution for general target matrix Φ. Next, considering the special case where the goal is to learn a positive definite matrix Φ. Based on Lemma 1, we have the following convergence result for training with weight sharing. Theorem 2. [Continuous-time gradient descent with weight sharing] For the deep linear network f (x; W 1 , ..., W L ) = W L W L-1 ...W 1 x, initialize all W l (0) with identity matrix I and update according to Equation 2. With a positive definite target matrix Φ, the continuous-time gradient descent satisfies R(t) ≤ exp -2L min(1, λ 2 min (Φ))t R(0). Remark 1. The difference between convergence rates in Theorem 1 and Theorem 2 is not an artifact of analysis. For example, when the target matrix is simply Φ = αI, α > 1. It can be explicitly shown that with the initialization in Theorem 1, we have Ṙ(0) = -2R(0) while training with weight sharing (Theorem 2), we have Ṙ(0) = -2LR(0). This implies that the convergence results in Theorem 1 and Theorem 2 cannot be improved in general. The extra L in the exponent leads to faster convergence. The key to show the acceleration is dR(t) dt = L l=1 tr ∇ l R(t) Ẇl (t) ≤ -2Lλ 2 min (W 0 (t) L-1 )R(t), where we see the L comes from the summation. This sheds light on two important factors: 1. All layers need to have sufficiently large update (i.e. Ẇl (t) is large for all l). 2. Each layer's update needs to well correlate with its gradient (i.e. ∇ l R(t) correlates with Ẇl (t)). Initializing the weights to be the same and using the average of gradients guarantees that all layers are sufficiently trained. The high correlation of ∇ l R(t) and Ẇl (t) relies on Φ being positive definite. Suppose the gradients of different layers do not correlate well (e.g. tr (∇ i R(t)∇ j R(t)) ≈ 0, i = j) and the weights are still forced to be shared via the updates according to Equation 2. Recall that Ẇl (t) = 1 L L i=1 ∇ i R(t), we then have L l=1 tr ∇ l R(t) Ẇl (t) ≈ -1 L L l=1 ∇R(t) 2 F , which loses the extra L acceleration in the convergence due to the 1/L leading factor. When dealing with real deep learning models, there is no guarantee that all the gradients at different layers highly correlate. Thus, we may monitor gradient correlations during training: sharing weights when gradients well correlate, and break the ties when gradient correlations fall below a certain threshold. This matches the adaptive untying rule we proposed in Section 2.

3.2. DISCRETE-TIME GRADIENT DESCENT

Here we extend the previous result to the discrete-time gradient descent with a positive constant step size η. It can be shown that with zero-asymmetric initialization, training with the gradient descent will achieve R(t) ≤ within O(L 3 log(1/ )) steps; initializing and training with weights sharing, the deep linear network will learn a positive definite matrix Φ to R(t) ≤ within O(L log(1/ )) steps, which reduces the required iterations by a factor of L 2 . To make easy comparisons, we first repeat without proving the discrete-time gradient descent convergence result of ZAS. Theorem 3. [Continuous-time gradient descent without weight sharing (Wu et al., 2019)  ] For deep linear network f (x; W 1 , ..., W L ) = W L W L-1 ...W 1 x with zero-asymmetric initialization and discrete-time gradient descent, if the learning rate satisfies η ≤ min 4L 3 ξ 6 -1 , 144L 2 ξ 4 -1 , where ξ = max 2 Φ F , 3L -1/2 , 1 , then we have linear convergence R(t) ≤ 1 -η 2 t R(0). Since the learning rate is η = O(L -3 ), Theorem 3 indicates that the gradient descent can achieve R(t) ≤ within O(L 3 log(1/ )) steps. In the special case of learning a positive definite matrix Φ, initialize all weights W l to be the same and train with weights sharing, we have the following convergence result. 4 RELATED WORK Lan et al. (2020) propose ALBERT with the weights being shared across all its transformer layers. Large ALBERT models can achieve good performance on several natural language understanding benchmarks. Bai et al. (2019b) propose trellis networks which are temporal convolution networks with shared weights and obtain good results for language modeling. This line of work is then extended to deep equilibrium models (Bai et al., 2019a) which are equivalent to infinite-depth weighttied feedforward networks. Dabre & Fujita (2019) show that the translation quality of a model that recurrently stacks a single layer is comparable to having the same number of separate layers. Zhang et al. (2020) also demonstrate the application of weight-sharing in neural architecture search. Deep linear models have been widely studied for its simplicity and similarity to deep learning models. Baldi & Hornik (1989) show that all local minima are also global minima for two-layer linear networks. Laurent & Brecht (2018) extend the same result to deep linear networks. Hardt & Ma (2017) show the PL condition is satisfied within the neighbour of a global optimum. Shamir (2019) show that, for one-dimensional deep linear networks, with the Xavier or near-identity initialization, it requires at least exp(Ω(L)) steps to converge, where L is the depth. Wu et al. (2019) show that this result can be improved to O(L 3 log 1/ ) with a special zero-asymmetric initialization.

5. EXPERIMENTS

In this section, we present the experimental setup and results for training the BERT Large model with the standard training procedure as in the literature as well as our Sharing WEights (SWE) method. In what follows, without explicit clarification, BERT always means the BERT Large model.

5.1. EXPERIMENTAL SETUP

We use the TensorFlow official implementation of BERT (team & contributors) . We first show experimental results with English Wikipedia and BookCorpus for pre-training as in the original BERT paper (Devlin et al., 2019) . We then move to the XLNet enlarged pretraining dataset (Yang et al., 2019) . We preprocess all datasets with WordPiece tokenization (Schuster & Nakajima, 2012) . We mask 15% tokens in each sequence. For experiments on English Wikipedia and BookCorpus, we randomly choose tokens to mask. For experiments on the XLNet dataset, we do whole word masking -in case that a word is broken into multiple tokens, either all tokens are masked or not masked. For all experiments, we set both the batch size and sequence length to 512. We use the AdamW optimizer (Loshchilov & Hutter, 2019) with the weight decay rate being 0.01, β 1 = 0.9, and β 2 = 0.999. For English Wikipedia and BookCorpus, we use Pre-LN (He et al., 2016b; Wang et al., 2019b) instead of the original BERT's Post-LN. Note that the correct implementation of Pre-LN contains a final layer-norm right before the final classification/masked language modeling layer. Unlike the claim made by Xiong et al. (2020) , we notice that using Pre-LN with learning rate warmup leads to better performance. In our implementation, the learning rate starts from 0.0, linearly increases to the peak value 3 × 10 -4 , the learning rate used by Xiong et al. (2020) , at the 10k-th iteration, and then linearly decays to 0.0. For the XLNet dataset, we apply the same Pre-LN setup except the peak learning chosen to be 2 × 10 -4 . The peak learning rate of 3 × 10 -4 makes training unstable here and yields worse performance than 2 × 10 -4 . After pre-training, we fine-tune the models for the Stanford Question Answering Dataset (SQuAD v1.1 and SQuAD v2.0) (Rajpurkar et al., 2016) and the GLUE benchmark (Wang et al., 2019a) . For all fine-tuning tasks, we follow the setting as in the literature: the model is fine-tuned for 3 epochs; the learning rate warms up linearly from 0.0 to peak in the first 10% of the training iterations, then linearly decay to 0.0. We select the best peak learning rate based on the validation set from {1 × 10 -5 , 1.5 × 10 -5 , 2 × 10 -5 , 3 × 10 -5 , 4 × 10 -5 , 5 × 10 -5 , 7.5 × 10 -5 , 10 × 10 -5 , 12 × 10 -5 }. For the SQuAD datasets, we fine-tune each model 5 times and report the average. For the GLUE benchmark, for each training method, we simply train one BERT model and submit the model's predictions over the test sets to the GLUE benchmark website to obtain test results. We observed that when using Pre-LN, the GLUE finetuning process is stable and no model diverged. Training methods. The training procedure in the TensorFlow official implementation of BERT serves as our baseline, where the baseline training takes 1 million steps (both on English Wikipedia plus BookCorpus and on the enlarged XLNet dataset). For our Sharing WEight (SWE) method, only half of the number of iterations is taken. For a complete comparison, we also report the results from the baseline method with half of the number of iterations. Three versions of our method with hyperparameter settings are listed below. SWE-F Fixed point untying. It serves as a baseline of its adaptive version. We ran experiments with a set of different untying points, and identified the best untying point τ = 50k. The full results from different τ values are presented in Section 5.3.1. SWE-A Adaptive untying. We check the gradient correlations for every 1000 iterations. If the gradient correlation of adjacent layers is below a threshold ρ = 0.5 for three consecutive times, we break the tie. The effect of different ρ values is studied in Section 5.3.1. SWE-S Simplified adaptive untying. We use the same setup as in SWE-A, except we break the tie all at once, when the majority of gradient correlations are below the threshold ρ for three consecutive times.

5.2. EXPERIMENT RESULTS

For English Wikipedia and BookCorpus, both pretraining and finetuning results of our method vs. the baseline method are shown in Table 1 . From the results, we see that our method with 500k training iterations matches the performance of the baseline method with 1 million training iterations, and significantly outperforms the baseline method with 500k training iterations. The results for the XLNet dataset are shown in Table 2 . We observe similar advantages of our approach over the baseline. In this section, we study the effects of different choices in implementing our method.

5.3.1. WHEN TO STOP WEIGHT SHARING

In this section, we study the effects of using different untying points (Algorithm 1) and thresholds (Algorithm 2). If weights are shared throughout the entire pretraining process, the final performance Under review as a conference paper at ICLR 2021 will be much worse than without any form of weight sharing (Lan et al., 2020) . On the other hand, without weight sharing at all yields slower convergence. Results of using different untying point τ and threshold ρ values are summarized in Table . 3. Models are trained for 500k iterations on English Wikipedia and BookCorpus. From the results, we see that for the SWE-F method, a smaller τ value performs better than a larger τ . This means that the weight sharing stage should not be too long. We also see that the performance of the SWE-A method is not sensitive to the threshold ρ. Note that it is not necessary to be restricted to share weights only across the original layers. We can group several consecutive layers as a weight sharing unit. We denote A × B as grouping A layers as a weight sharing unit which is being shared with B times. Since BERT has 24 layers, the baseline method without weight sharing can be viewed as "24x1", and our method shown in Table 1 can be viewed as "1x24". We present results from more different choices of weight sharing units in Table 4 . We can see that, in order to achieve good results, the size of the chosen weight sharing unit should not be larger than 6 layers. This means that the weights of a layer must be shared for at least 4 times. 

6. CONCLUSION

We proposed a simple weight sharing method to speed up the training of deep networks with repeated layers and showed promising empirical results on BERT training. Our method is motivated by the successes of weight sharing models in the literature as well as our theoretic analysis on deep linear models. For future work, we will extend our empirical studies to other deep learning models and tasks, and analyze under which conditions our method will be helpful. Proof. In the proof of Lemma 1, we know that W 0 (t) has the same eigenvectors as Φ. Take any eigenvector, denote λ(t) and λ L to be the corresponding eigenvalue of W 0 (t) and Φ. By Lemma 1, we have λ(t + 1) λ(t) -η λ(t) L-1 λ(t) L -λ L . For λ > 1, we would like to show λ(t) ∈ [1, λ], ∀t ≥ 0 by setting η ≤ 1 Lλ 2L . Since λ(0) = 1, we know this claim holds trivially at t = 0. Then suppose we have the claim holds for t = t 0 , then λ(t 0 + 1) = λ(t 0 ) -η λ(t 0 ) L-1 λ(t 0 ) -λ L-1 i=0 λ(t 0 ) i λ L-1-i . To make λ(t 0 + 1) ∈ [1, λ], we set η ≤ 1 Lλ 2L and have λ(t 0 + 1) ≤ λ(t 0 ) - λ(t 0 ) L-1 L-1 i=0 λ(t 0 ) i λ L-1-i Lλ 2L λ(t 0 ) -λ ≤ λ(t 0 ) - λ(t 0 ) L-1 L-1 i=0 λ(t 0 ) i λ L-1-i λ L-1 L-1 i=1 λ L-1 λ(t 0 ) -λ ≤ λ. And η ≥ 0 guarantees that λ(t 0 + 1) ≥ λ(t 0 ) ≥ 1. By induction, when λ > 1, we have λ(t) ∈ [1, λ], ∀t ≥ 0. Similarly, for λ < 1, we would like to show λ(t) ∈ [λ, 1] by setting η ≤ 1 L . Note again the claim holds trivially when t = 0. Suppose λ(t 0 ) ∈ [λ, 1], we have λ(t 0 + 1) ≥ λ(t 0 ) - λ(t 0 ) L-1 L-1 i=0 λ(t 0 ) i λ L-1-i 1 L-1 i=0 1 λ(t 0 ) -λ ≥ λ. And η ≥ 0 guarantees that λ(t 0 + 1) ≤ λ(t 0 ) ≥ 1. By induction, when λ < 1, we have λ(t) ∈ [λ, 1], ∀t ≥ 0. Note that the two claims hold for all λ, it then directly implies that by setting η ≤ 1 L max(λ 2 max (Φ),1) , we have λ min (W 0 (t)) ≥ min(λ min (Φ) 1/L , 1), λ max (W 0 (t)) ≤ max(λ max (Φ) 1/L , 1), which completes the proof.

A.4 PROOF OF THEOREM 4

Proof. Denote φ = max(λ max (Φ), 1). From the proof of Lemma 1, we have ∇ l R(t) = W L-1 0 (t) W L 0 (t) -Φ , By Lemma 2, setting η ≤ 1 L max(λ 2 max (Φ),1) , we have λ min (W 0 (t)) ≥ min(λ min (Φ) 1/L , 1), λ max (W 0 (t)) ≤ max(λ max (Φ) 1/L , 1). Denote φ = max(λ max (Φ), 1), we immediately have ∇ l R(t) F ≤ φ (L-1)/L 2R(t), ∇ l R(t) F ≥ min(λ min (Φ), 1) (L-1)/L 2R(t). With one step of gradient update, we have R(t + 1) -R(t) = 1 2 (W 0 -η∇ 0 R(t)) L -Φ 2 F -W 0 -Φ 2 F = (W 0 -η∇ 0 R(t)) L -W L 0 W L 0 -Φ + 1 2 (W 0 -η∇ 0 R(t)) L -W L 0 2 F , where the denotes the element-wise multiplication. Let I 1 = ηA 1 W L 0 -Φ , I 2 = L k=2 η k A k W L 0 -Φ , I 3 = 1 2 L k=1 η k A k 2 F , where the matrix A k comes from (W 0 -η∇ 0 R(t)) L = A 0 + ηA 1 + • • • + η L A L . We have R(t + 1) -R(t) ≤ I 1 + I 2 + I 3 .

Note that

W 0 ≤ φ 1/L , ∇ 0 R(t) F ≤ φ (L-1)/L 2R(t). Using the fact that for 0 ≤ y ≤ x/L 2 , (x + y) L ≤ x L + 2Lx L-1 y, (x + y) L ≤ x L + Lx L-1 y + L 2 x L-2 y 2 . Take η ≤ 1 L 2 φ √ 2R(t) , we have L k=1 η k A k F ≤ φ 1/L + ηφ L-1 L 2R(t) L -φ ≤ 2ηLφ 2 2R(t) L k=2 η k A k F ≤ φ 1/L + ηφ L-1 L 2R(t) L -φ -ηLφ 2(L-1)/L 2R(t) ≤ 2η 2 L 2 φ 3 R(t).

Thus we have

I 2 ≤ L k=2 η k A k F W 0 -Φ F ≤ 2 √ 2η 2 L 2 φ 2 R(t) 3/2 I 3 ≤ 1 2 L k=1 η k A k 2 F ≤ 4η 2 L 2 φ 4 R(t). For I 1 , we directly have I 1 = -ηL∇ 0 R(t) W L-1 0 W L 0 -Φ = -ηL ∇ 0 R(t) 2 F . Since we have ∇ 0 R(t) F ≥ min (λ min (Φ) , 1) 2R(t). Therefore by setting η ≤ min 1 L max (λ 2 max (Φ), 1) , 1 √ 2L 2 φR(t) 1/2 , min(λ 2 min (Φ), 1) 2 √ 2L 2 φ 2 R(t) 1/2 , min(λ 2 min (Φ), 1) 4L 2 φ 4 . We have R(t + 1) -R(t) ≤ (2 -2L)η min(λ 2 min (Φ), 1)R(t). By directly setting λ ≤ min(λ 2 min (Φ), 1) 4 √ dL 2 max (λ 4 max (Φ), 1) , we can satisfy all the requirements above, which will give us R(t) ≤ 1 -(2L -2) min λ 2 min (Φ), 1 η t R(0) ≤ exp -(2L -2) min λ 2 min (Φ), 1 ηt R(t).



Theorem 4. [Discrete-time gradient descent with weight sharing] For the deep linear network f (x; W 1 , ..., W L ) = W L W L-1 ...W 1 x, initialize all W l (0) with identity matrix I and update according to Equation 2. With a positive definite target matrix Φ, and setting η ≤ min(λ 2 min (Φ),1) 4 √ dL 2 max(λ 4 max (Φ),1) , we have linear convergence R(t) ≤ exp -(2L -2) min λ 2 min (Φ), 1 ηt R(0). Take λ min (Φ)/λ max (Φ), d as constants and focus on the scaling with L, , we have η = O(L -2 ). Because of the extra L in the exponent, we know that when learning a positive definite matrix Φ, training with weight sharing can achieve R(t) ≤ within O (L log (1/ )) steps. The dependency on L reduces from previous L 3 to linear, which shows the acceleration of training by weight sharing.

. To the best of our knowledge, ZAS gives the state-of-the-art convergence rate. It is actually the only work showing the global convergence of deep linear network trained by gradient descent for an arbitrary target matrix.

Training BERT on English Wikipedia and BookCorpus. Our method with half of a million iterations matches the baseline performance with one million iteration steps, and outperforms the baseline with half of a million iterations.

Training BERT on the XLNet dataset. Our method with half of a million iterations matches the baseline performance with one million iteration steps, and outperforms the baseline performance with half of a million iterations.

Results from different untying points τ and thresholds ρ. Models are trained for 500k iterations on English Wikipedia and BookCorpus.

We group several consecutive layers as a weight sharing unit instead of sharing weights only across original layers. A × B means grouping A layers as a unit which is being shared with B times. Models are trained for 500k iterations on English Wikipedia and BookCorpus.

A PROOFS OF SECTION 3

A.1 PROOF OF LEMMA 1 Proof. By definition, we haveTo see the result in Lemma 1, it is sufficient to show that 0 (t) has the same eigenvectors as Φ.First, it is not hard to see that by initializing W 0 = I, the requirement trivially holds for W 0 (0). Now, suppose W 0 (t 0 ) has the same eigenvectors as Φ, then from the definition of updateWe thus know that ∇ l R(t 0 ) has the same eigenvectors as Φ for all layer l. Therefore the update for W (t 0 ), which is given by L i=1 ∇ i R(t 0 )/L also has the same eigenvectors as Φ. Inductively, it indicates that from time step t 0 , all subsequent weights W 0 (t) has the same eigenvectors as Φ.We can thus exchange the order of matrix multiplication and for all t ≥ 0, we haveA.2 PROOF OF THEOREM 2Proof. From Lemma 1, we haveFor the loss function R(t), we have. By continuous gradient descent with W 0 (0) = I, it is easy to see thatTherefore we have

A.3 TECHNICAL LEMMA

Lemma 2. Initializing W 0 = I and training with weight sharing update (Equation 2), by setting η ≤ 1 L max(λ 2 max (Φ),1) , we have λ min (W 0 (t)) ≥ min(λ min (Φ) 1/L , 1), λ max (W 0 (t)) ≤ max(λ max (Φ) 1/L , 1).

