σREPARAM: STABLE TRANSFORMER TRAINING WITH SPECTRAL REPARAMETRIZATION

Abstract

Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the "attention entropy" for each attention head during the course of training, which is a proxy of the attention's sharpness. We observe a common, non monotonic evolution of attention entropy across different settings: the attention entropy first quickly decreases in the initial phase of training, followed by quickly increasing, and finally entering a long stable phase. While the exact shape can be affected by hyperparameters such as warmup, initialization, learning rate etc., we found that there is a close correlation between the minima of attention entropy and the model's training stability. To this end, we propose a simple and efficient solution dubbed σReparam, where we reparametrize all linear layers with Spectral Normalization and an additional learned scalar. We provide a lower bound on the attention entropy as a function of the spectral norms of the query and key projections, which suggests that small attention entropy can be obtained with large spectral norms. σReparam decouples the growth rate of a weight matrix's spectral norm from its dimensionality, which we verify empirically. We conduct experiments with σReparam on image classification, image self supervised learning, automatic speech recognition and language modeling tasks. We show that σReparam provides great stability and robustness with respect to the choice of hyperparameters.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) are state-of-the-art models in many application domains. However, despite their empirical success and wide adoption, great care often needs to be taken in order to achieve good training stability and convergence. In the original paper (Vaswani et al., 2017) , residual connections and Layer Normalizations (LNs) (Ba et al., 2016) are extensively used for each Attention and MLP block (specifically, in the "Post Norm" fashion). There has since been various works attempting to promote better training stability and robustness. For example, the "Pre Norm" (Radford et al., 2019) scheme has gained wide popularity, where one moves the placement of LNs to the beginning of each residual block. Others have argued that it is important to properly condition the residual connections. Bachlechner et al. (2021) proposes to initialize the residual connections to zero to promoter better signal propagation. Zhang et al. (2018) ; Huang et al. (2020) remove LNs with carefully designed initialization schemes. In this work, we study the training instability of Transformers from the lens of training dynamics. We start by monitoring the average entropy of the attention heads (by treating each attention head as a multinomial distribution) over all query positions and examples. Interestingly, the average attention entropy often evolves in a pattern consisting of three phases. In the beginning, attention entropy starts high (corresponding to uniform attention scores) and quickly drops to a small value; This is then followed by a second stage where it quickly increases to a relatively high entropy regime; Lastly the attention entropy curve stabilizes and smoothly evolves to convergence. See the top left plot of Figure 1 for an illustration, which is a Vision Transformer (Touvron et al., 2021) (ViT) trained on ImageNet classification, using well optimized hyper parameters. Empirically, we have found that the attention entropy is directly correlated with the model's stability and convergence. In particular, small attention entropy reached in the initial phase often causes slow convergence, fluctuations in training loss and, in the worst case, divergence. This is shown in Figure 1 where we vary the learning rate and warmup epochs of the baseline ViT model. We see that both decreased the learning rate and increased warmup epochs provide smoothing effects to the attention entropy curves, which in turn yield lower training losses. On the other hand, increasing learning rate brings a detrimental impact on training where the attention entropy collapses to near zero and training diverges. We denote the rapid dip of attention entropy to a near zero value and its resulting pathological optimization dynamics as "entropy collapse". The remaining questions are: 1) How do we get rid of entropy collapse? 2) Can we improve training stability by doing so? We answer them by showing that attention entropy is closely related to the spectral norms of the query and key projections. In particular, we show a lower bound of the attention entropy, which suggests that large spectral norms of the projections can more easily lead to entropy collapse. We then provide a simple fix, dubbed σReparam, which reparameterizes all weight matrices by sequentially applying Spectral Normalization (Miyato et al., 2018) and a learned multiplicative scalar. Intuitively, σReparam decouples the update of the spectral norms of weights from their dimensionality, which allows them to update smoothly in a controlled way. Also note that σReparam does not change the model space, which allows one to learn an arbitrarily expressive model. We validate σReparam on 4 tasks: image classification, image self supervised learning, automatic speech recognition and language modelling. We show that σReparam effectively slows down the growth of each layer's spectral norms, and as a result, their attention entropy curves are greatly smoothed. This allows us to achieve great robustness with respect to the choice of hyper parameters. In certain cases, we are able to remove Layer Norms and still achieve competitive results.

2. RELATED WORKS

Transformers have relied heavily on LNs to achieve training stability. Besides the popular Post Norm and Pre Norm configurations, other variants have been proposed (Wang et al., 2022; Shleifer et al., 2021) . σReparam does not rely on LN and can even work in the absence of it, which avoids the computational over head of explicit activation normalization. There have also been numerous attempts to design better Transformer initialization schemes, including Zhang et al. (2018); Huang et al. (2020) ; Yang et al. (2022) ; Bachlechner et al. (2021) . σReparam is an orthogonal approach as it addresses the training dynamics of attention layers, which makes it compatible with standard initialization methods and provides robust performance. σReparam is a special case of weight reparameterization, which has found wide adoption in Deep Learning. Weight Norm (WN) (Salimans & Kingma, 2016 ) is a well known example of such methods, but its effectiveness in Transformers is limited. In ConvNets, simple additive weight reparameterization (Ding et al., 2021) has been demonstrated useful in speeding up training convergence. To the best of our knowledge, σReparam is the first simple reparamterization technique that provides competitive performance with well optimized baseline models.

3.1. ATTENTION ENTROPY

At the core of Transformers is dot product attention. Let X ∈ R T ×d denote an input sequence to an attention layer (we assume self attention for simplicity of presentation), where T, d are the number of tokens and the token dimension, respectively; and let W K , W Q ∈ R d×na , W V ∈ R d×nv denote the key, query and value matrices. A simple attention layer then computes Att(X) = AXW V where A = ψ(a), a = XW K W ⊤ Q X ⊤ and ψ is the row-wise softmax function. We define the attention entropy of a row i of A by Ent(A i ) = -T j=1 A i,j log(A i,j ). We also overload the notation and let Ent(A) = 1 T T i=1 Ent(A i ) denote the average attention entropy of A. As shown in Figure 1 , the attention entropy (and the entropy collapse phenomenon) is a strong indicator of training stability of Transformers. Our goal is to alleviate the entropy collapse problem and achieve a smooth evolution of the attention entropy through training. We next investigate the properties of attention entropy. We show in the the next theorem that Ent(A) is directly connected to the Spectral norm (the largest singular value) of W K W ⊤ Q . Theorem 3.1 (Attention entropy lower bound). Assume without loss of generality ∥X∥ 2 ≤ 1, and let spectral norm σ = ∥W K W ⊤ Q ∥ 2 . Then it holds that: Ent(A i ) ≥ log 1 + (T -1)e -σ T T -1 + σ T (T -1)e -σ T T -1 1 + (T -1)e -σ T T -1 (1) Moreover, there exists inputs X and weights W K , W Q for which the lower bound in Eq. (1) is tight. Therefore, for large σ, T , the minimum attainable entropy behaves like Ω(T σe -σ ). We note that the bound on the entropy in Theorem 3.1 is tight in a sense that it is achievable for some inputs X. Moreover, the typical Frobenious (L2) regularization would not ensure a small σ (a small Frobenious norm is much more restrictive than a small Spectral norm), hence it would not be as effective in preventing an "entropy collapse". Proofs for Theorem3.1 and the following Proposition are provided in Appendix A.

3.2. σREPARAM

We then present σReparam, a method to re-parameterize the weights of a linear layer with: W = γ σ(W ) W, where σ(W ) ∈ R is the spectral norm of W and γ ∈ R is a learnable parameter, initialized to 1. In practice, σ(W ) can be computed via power iteration (Mises & Pollaczek-Geiringer, 1929) as in Spectral Normalization (SN) (Miyato et al., 2018) , see Algorithm 1 for a sketch implementation. Note that σReparam brings little extra overhead as the power iteration mainly consists of two matrix vector products and is only performed on the parameters rather than activations. During inference, one can compute W once and freeze it, which means that it has the same cost as a regular linear layer. Why σReparam? Unlike the standard SN, σReparam introduces an additional multiplier γ which explicitly controls the SN of the weights, and there is no explicit pressure to regularize the SN. The additional multiplier is necessary to avoid restricting the capacity of the network, and we find that training and overall performance is significantly degraded in its absence. Since the representational capacity of the layer remains unchanged, it is not immediately clear why σReparam would effectively regularize the SN of the weights. While a full theoretical characterization is beyond the scope of this paper, we identify a property of adaptive optimizers which, if left unchecked, causes the spectral norm of weight matrices to grow rapidly for large weight matrices. To illustrate this, we adopt common assumptions in stochastic optimization, and model the stochastic gradients at some point in the optimization by g = µ + ϵ ∈ R w×w , where µ is the mean and ϵ is a random variable with E[ϵ] = 0, E[ϵ 2 ] = n 2 ∈ R w×w . A typical Adam optimizer update attempts to approximate the following ideal update: ∆ = E[g] √ E[g 2 ] . The following proposition lower bounds the spectral norm of the ideal update σ(∆): Proposition 3.2. It holds that: σ(∆) ≥ √ w 1 - 1 w 2 w i,j=1 n 2 i,j µ 2 i,j + n 2 i,j Note that the noise second moment n 2 is typically in the order of µ 2 , hence Eq. ( 3) indicates that the spectral norm of the ideal update should be large, growing linearly with √ w. Moreover, for large batch sizes we would have n 2 ≪ 1, resulting in σ(∆) ∼ √ wfoot_0 . While such a large spectral norm could be offset by a proper learning rate adjustment, this would be counter productive since 1) a small learning rate typically induces inferior performance, and 2) architectures with layers of varying sizes, such as attention layers, would require a per layer learning rate tuning. In contrast, σReparam avoids this issue since the spectral norm of each layer is controlled by a single parameter γ, hence the size of its update does not scale with w and is uniform across layers.

4.1. SUPERVISED IMAGE CLASSIFICATION

Improved robustness. We first start from a well tuned recipe with ViT-B on ImageNet-1k (Touvron et al., 2021) , and vary its hyper parameters in the grid [base lr ∈ {5e -4, 1e -3}, batch size ∈ {1024, 2048}, warmup epochs ∈ {0, 5}]. 7/8 configurations lead to divergence except for the default [5e -4, 2048, 5] hyper parameter. We next apply σReparam to all the linear layers (including the initial patch embedding), and removed all the LayerNorm instances. All configurations in the same grid search converge with an average top-1 accuracy of 81.4% (max 82.2%, shown in Table 1 ). This suggests improved robustness with respect to hyperparameters. Simplified recipe. σReparam also enables a simplified framework for training ViT-B and ViT-L models, in contrast to state-of-the art ImageNet-1k ViT training protocols such as the fully supervised MAE recipe (He et al., 2022) and DeiT (Touvron et al., 2021) , (Table 1 ). In the case of ViT-B models, we are able to train for a shorter duration, remove all LayerNorm layers, remove LR warmup, remove cosine scheduling (requiring only a simple step schedule at 210 epochs) and use no weight decay. Furthermore, σReparam enables SGD training via LARS (You et al., 2017) (with momentum 0.9)something not possible with traditional ViT training protocols (Touvron et al., 2021; He et al., 2022) . These simplifications also have the added benefit of reducing GPU memory overheadfoot_1 . For the ViT-L model we relax the LR schedule back to cosine and match the baseline model's training interval. Both models use FP32 precision on the attention operands and keep mixed precision training for the rest of the network. The full set of hyperparameters is available in Appendix E. To further understand the effect of σReparam, we track both the attention entropy, and the largest singular value of the attention weight matrix over the course of training. In Figure 2 , σReparam maintains a lower largest attention weight singular value and presents a higher, but monotonically decreasing attention entropy throughout training. As previously discussed, a smaller bounded singular value helps with stable training, whereas a higher attention entropy encourages exploration of more diverse solutions. This is reinforced by the accelerated performance observed in Test Top 1 and the 50 epoch reduction in training time for the σReparam ViT-B/16 shown in Figure 2 .

4.2. SELF-SUPERVISED TRAINING OF VISUAL REPRESENTATIONS

In computer vision, self-supervised learning (SSL) has been effective in enabling efficient training on downstream tasks (Assran et al., 2022) . Most of this progress has been made using convolutional architectures, while works using ViTs often require specialized training recipes (Caron et al., 2021) . Recently, it was found that ViTs suffer from training instabilities in SSL tasks Chen et al. (2021) . These instabilities can be remedied through a combination of frozen patch embedders, initialization (a) Linear probe performance for the best (solid line) and worst (dashed line) trials of each method, against relevant metrics from the first attention layer (top to bottom): attention entropy, the spectral norm of the attention weights, and the ℓ∞-gradient norm of the attention weights. We see that the Frozen Patcher method functions as intended, regulating its gradient norm, and protecting it from the large gradient norms inducing instability in Baseline. We also observe a second form of instability during training: the growing spectral norm leads to a poorly behaved attention mechanism, entropy collapse, and a drop in performance as described in Section 3. This affects Baseline, as well as Frozen Patcher, as neither method gives specific protection against this second type of instability (solid and dashed red, and dashed green lines). Finally, we see that σReparam with and without layer normalization regulate both the gradient norms, as well as the spectral norms, giving defense against both types of instability. (b) Linear probe performance of every trial. We see that σReparam is the most stable method. σReparam + LN is also quite stable. In the case where it experiences instabilities, we see that it is able to recover much quicker than Baseline and Frozen Patcher. This is due to the regularization of the spectral norm which 1) prevents any arising instability pushing the model too far away from the current solution, and 2) keeps the attention mechanism useful, such that gradients are available for any required correction. We observe two types of instability. The first, as observed in Chen et al. (2021) , is induced by large gradient norms in early layers. The second, described in Section 3, relates to entropy collapse. We find that Frozen Patcher protects against the first type, but is still susceptible to the second. σReparam, however, can protect against both types of instability, yielding more reliable training (see Figure 3 ). As noted in Chen et al. ( 2021), instabilities reduce final performance. We show instability impact on performance in Figure 4 . Finally, we look at the performance attainable when training for a longer duration of 300 epochs in Table 2 . The best performing method run is given by with σReparam + LN, with Frozen Patcher performing almost as well, and both outperforming the reference SimCLR result (Chen et al., 2021) . Ultimately, we see while σReparam produces the lowest degree of instability, the best overall method for stable training of SimCLR ViTs is σReparam + LN, producing both the highest ImageNet1k linear probe performance at 100 epochs (69.6 %) and 300 epochs (74.5 %) epochs, as well as very stable training over many trails, both at long and short learning rate warmup.

4.3. SPEECH

In this section we focus on experiments for automatic speech recognition (ASR). Data All experiments are performed on the subset of 100h audio paired with transcriptions (trainclean-100) of LibriSpeech dataset Panayotov et al. (2015) . The standard LibriSpeech validation sets (dev-clean and dev-other) are used to tune all hyper parameters, as well as to select the best models. Test sets (test-clean and test-other) are used only to report final word error rate (WER) performance without an external language model. We keep the original 16kHz sampling rate and compute log-mel filterbanks with 80 coefficients for a 25ms sliding window, strided by 10ms, later normalized to zero mean and unit variance per input sequence. Acoustic Models We are using current, to the best of our knowledge, state-of-the-art model on 100h of LibriSpeech (Likhomanenko et al., 2021a) . The model consists of 1D convolution to perform striding, Transformer encoder with post-LN and a final linear layer to map to the output number of tokensfoot_2 . The model is trained with Connectionist Temporal Classification (Graves et al., 2006) loss. To speed up the model training (2-3x) and decrease memory usage we are using CAPE positional embeddings (Likhomanenko et al., 2021c ) instead of relative embeddings Shaw et al. (2018) . Training We use Adagrad (Duchi et al., 2011) if not specified otherwise, and LR decaying by 2 each time the WER reaches a plateau on the validation. We use dynamic batching of 240s audio per GPU and train with tensor cores fp32 on 8 Ampere A100 (40GB) GPUs for 350-500k updates. No weight decay is used. Default warmup is set to 64k for the baselines and varied for different models.

Data augmentation

The default LR is 0.03 and also optimized across models. We also apply gradient clipping of 1.

4.3.1. TRAINING STABILITY, ROBUSTNESS AND GENERALIZATION

First, we experiment with stability of training for the baselines using both "Pre Norm" (pre-LN) and "Post Norm" (post-LN) architectures. If we vary LR, warmup, and gradient clipping, all post-LN experiments either diverge or no training is observed. At the same time, pre-LN is stable: we can reduce warmup from 64k to 16k, increase learning rate from 0.03 to 0.5, and obtain better results than before. While pre-LN is more stable than post-LN, it generalizes worse: validation WER is worse while training loss is lower, see Table 3 . When we switch to σReparam we observe the same stability as for pre-LN, Figure 5 , while having better generalization than not optimized pre-LN. We are not able to match the post-LN results until we combine post-LN together with σReparam, which allows us to achieve similar performance on the dev and test sets and lower training loss. In Figure 5 both σReparam and σReparam with post-LN demonstrate robustness with respect to training hyperparameters. We also compare with Spectral Norm (SN) where γ is set to 1 and is not learnable and WN baselines. Both SN and WN perform poor compared to σReparam, see Table 3 . In prior works it was reported that post-LN can be impossible to train with very deep architectures, see e.g. Liu et al. (2020b; a) . We reproduced similar results: if we increase the encoder size to 2x then post-LN does not train, while pre-LN works out of the box and improves over the smaller architecture. We applied the same settings to σReparam and combination of σReparm and post-LN: for both cases out of the box models train well and achieve similar results as pre-Norm. This confirms σReparam's ability for stable training even with post-LN. 

4.3.2. TRAINING WITH SGD

Prior works report different problems training transformers with SGD (see e.g. (Li et al., 2022) ). First, we experimented with the baselines, pre-LN and post-LN and observed similar issues. It is hard to find hyperparameters that enable the model to train. Following vision experiments we switch to the LARS (You et al., 2017) (with momentum 0.9) optimizer, and are able to train pre-LN and post-LN by carefully tuning the LR (the rest stays the same, including gradient clipping) which is varied from 0.1 to 1.5, see and combined together with post-LN it achieves similar performance to the best results from Table 3 while keeping the train loss lowfoot_3 .

4.4. LANGUAGE

Setup. We use the WikiText-103 language model (LM) benchmark, which consists of 103M tokens sampled from English Wikipedia (Merity et al., 2017) . Our baseline is a highly optimized Transformer (Baevski & Auli, 2019) with 32 layers, 8 heads, 128 head dimensions, 1024 model dimensions, 4096 fully connected dimensions and post LayerNorm. The word embedding and softmax matrices are tied (Press & Wolf, 2017) . We partition the training data into non-overlapping blocks of 512 contiguous tokens and train the model to autoregressively predict each token (Baevski & Auli, 2019) . Validation and test perplexity is measured by predicting the last 256 words out of the input of 512 consecutive words to avoid evaluating tokens in the beginning with limited context (early token curse, Press et al., 2021) . Table 5 : WikiText-103 language modeling results in perplexity. Model PPL↓ train dev. test σReparam w/ weight decay 16.5 17.9 18.6 σReparam w/o weight decay 12.9 18.5 19.3 Baseline Transformer Baevski & Auli (2019) 15.4 18.1 18.7 Results. We do not experience training instability with the baseline Transformer, likely because the masked attention in autoregressive models makes entropy collapse less likely to occur. Nonetheless, we experimented with σReparam to test its generality on a different modality/problem. We apply σReparam to all linear layers of the Transformer while removing all LayerNorms, and search for learning rate in a grid [1, 1.5, 2, 2.5] and weight decay in the grid [1e-3, 1e-4, 0]. All other hyperparameters are kept the same as the baseline. The results are shown in Table 5 . We see that even in the absence of LayerNorm, σReparam shows strong performance in convergence and dev/test performance. With a mild weight decay, σReparam also outperforms the baseline wrt the dev/test PPL.

5. CONCLUSION

We analyze the training stability of Transformers from the lens of the attention entropy. We show that training instability or divergence is often accompanied by the entropy collapse phenomenon, and provide a simple fix named σReparam. We demonstrate over a wide set of benchmarks, domains, and training methodologies, that σReparam provides great stability and robustness, often leading to simplified model design and/or better performance. A PROOF OF THEOREM 3.1 AND PROPOSITION 3.2 Theorem 3.1 (Attention entropy lower bound). Assume without loss of generality ∥X∥ 2 ≤ 1, and let spectral norm σ = ∥W K W ⊤ Q ∥ 2 . Then it holds that: Ent(A i ) ≥ log 1 + (T -1)e -σ T T -1 + σ T (T -1)e -σ T T -1 1 + (T -1)e -σ T T -1 (1) Moreover, there exists inputs X and weights W K , W Q for which the lower bound in Eq. (1) is tight. Proof. WLOG let u ∈ R T denote the j'th row of a. From the condition that ∥X∥ 2 ≤ 1 it holds that ∥u∥ ≤ σ. Let p = p(u) denote the softmax probabilities given by: p i = e ui Z where Z = T j=1 e uj is the partition function. The entropy given p(u) is then: Ent(u) = - T i=1 e ui Z log( e ui Z ) = - T i=1 u i e ui Z + log(Z). (5) We wish to solve the following constrained minimization problem: min u Ent(u) s.t ∥u∥ 2 ≤ σ 2 where D > 0. Define the lagrangian: L(u, λ) = Ent(u) + 1 2 λ(∥u∥ 2 -σ 2 ) To find all saddle points, we solve the system of equations: ∂L(u, λ) ∂u = 0, ∂L(u, λ) ∂λ = 0 (8) Giving rise to the following set of equations: ∀ 1≤k≤T , λu k = T i=1 e ui Z (δ i,k - e u k Z )(1 + log( e ui Z )) = p k (log(p k ) + Ent(u)) ( 10) ∥u∥ 2 = σ 2 As a first step, assume that for the minimizer u ⋆ of Eq. ( 6) there exists an index k such that u ⋆ k = 0. Using Eq. ( 7): 0 = log(p k ) + Ent(u) = - T i=1 p i log( p i p k ) = - T i=1 p i log(e ui ) = - T i=1 p i u i = -Eu From the first set of equations we arrive at the condition: ∀ uj ,u j ′ ̸ =0 , p j log(p j ) + Ent(u) u j = p j ′ log(p j ′ ) + Ent(u) u j ′ (13) -→ p j + Eu u j = p j ′ + Eu u j ′ (14) -→ p j = p j ′ This however implies that u ⋆ 1 = u ⋆ 2 = ... = u ⋆ T = 0, hence a contradiction to Eq. ( 9). Now, assuming ∀ k u k ̸ = 0, we have that: ∀ uj ̸ =u j ′ e uj -e u j ′ 1 u j ′ -1 uj = ZEu = const (16) The monotonicity of the LHS of Eq. ( 16) implies that u contains only 2 distinct values. WLOG assume u ⋆ 1 = α, ∀ i>1 , u ⋆ i = -D 2 -α 2 T -1 . Then we have: e α -e -σ 2 -α 2 T -1 - 1 σ 2 -α 2 T -1 -1 α = αe α + (1 -T ) σ 2 -α 2 1 -T e -σ 2 -α 2 T -1 With a solution: α = σ 1 - 1 T , β = -σ 1 T (T -1) With the corresponding entropy:  Ent(u ⋆ ) = log 1 + (T -1)e -σ T T -1 + σ T (T -1)e -σ T T -1 1 + (T -1)e -σ T T -1 (19) Proposition A.1. It holds that: σ(∆) ≥ √ w 1 - 1 w 2 w i,j=1

C.2 REDUCED LEARNING RATE WARMUP

In Chen et al. (2021) the authors noted that the learning rate warmup period needed extending from its typical ImageNet1k default of 10 epochs to 40 epochs, enhancing the stability of the method. We observe that using σReparam, either with or without Layer Norm, we are able to achieve stable SimCLR+ViT training at the original warmup period of 10 epochs (see Figure 6 ). As with our analysis at the longer warmup period, we also investigate the performance distribution across the trials, giving a sense of how instability impacts the final model (see Figure 6 ). (a) Linear probe performance for the best (solid line) and worst (dshed line) trials of each method, against relevant metrics from the first attention layer (top to bottom): attention entropy, the spectral norm of the attention weights, and the ℓ∞-gradient norm of the attention weights. Our observations are consistent with those of the longer warmup of 40 epochs investigated in Figure 3 , except that here, Frozen Patcher is less able to tame early layer gradient norms than it was in the longer warmup (dashed green line). (b) Linear probe performance of every trial. Observations are again consistent with the longer warmup; σReparam with and without Layer Norm are the most stable methods. σReparam (0.01) refers to a σReparam with an initialization scheme of trunc normal(.01) instead of trunc normal(.02), with the former showing some signs of instability. Understanding the source of this instability will be the subject of future work. σReparam + LN uses the default trunc normal(.02). First, we found that it is better to initialize γ as 1 and not compute it from the initialized kernel as there could be different values for spectral norm depending on the initialization of the kernel. In this case we observed values greater than 1 for the spectral norm which cause divergence / no training. From practical point it is native to keep γ = 1. We compared different initializations for kernel and we didn't see any differences in initialization (e.g. uniform, normal). The only thing influences is the std of the initialization pdf which influences also effective LR. In speech we found that training is robust with respect to changes of std (Figure 5 ), however larger std performs better and sweet spot is 0.2-0.3.

D.2 FULL LIBRISPEECH EXPERIMENTS

We also evaluate σReparam for large scale data in speech domain: we take now the whole LibriSpeech as the training data. We consider again Adagrad optimizer with two schedules on learning rate: cosine (with 1 phase of 500k iterations) and step-wise decaying as before for train-clean-100 experiments. We use exactly the same architecture and hyper-parameters as in Table 9 except dropout and layer drop which are decreased to 0.1 to decrease model regularization. For all models we tune only learning rate. Keys and queries spectral reparametrization is done separately from values, also we use learning rate on gamma to be twice bigger than the main learning rate. Our experiments as for train-clean-100 show, see Tables 7 and 8 , that σReparam accompanied with post-LN can match the post-LN baseline, while having robustness to the hyper-parameter changes (e.g. allows larger learning rate values without any issues). 



This would be exact for full batch optimization. We observe a 8.2% memory reduction in full FP32 (for a 1:1 comparison) with a batch size of 86 per GPU. The token set consists of the 26 English alphabet letters augmented with the apostrophe and a word boundary token. For the separate reparametrization for (keys, queries) and values we observe less stable training with LARS and no warmup relative to reparametrizing them together.



Figure 1: The training loss curves of ViT-B on ImageNet, together with the attention entropy for three layers. From top left to bottom right: baseline with default hyper parameters from Touvron et al. (2021); 0.2× learning rate; 2× warmup epochs; 2× learning rate. We see a close correlation between the dip of the attention entropy and the convergence and stability of the training loss.

Figure 2: Test performance, attention entropy, and largest singular value of attention weights of a supervised σReparam ViT-B/16 alongside supervised MAE ViT-B/16 and SN baselines. Best (solid line) and worst (dashed line) trials of each method are presented. The MAE ViT-B/16 presents a more constrained attention entropy in contrast to the DeiT formulation from Figure 1 due to the longer warmup, lower learning rate and stronger weight decay.

Statistics of best and worst trials per method. Stability over 10 trials per method.

Figure 3: Ten trials of SimCLR for each method on ImageNet1k with 40 epochs of learning rate warmup.

We use SpecAugment(Park et al., 2019) activated right at the beginning of training. We use two frequency masks with frequency mask parameter F = 30, ten time masks with maximum time-mask ratio p = 0.1 and time mask parameter T = 50; time warping is not used.

Figure 5: Robustness of σReparam with respect to different hyperparameters: learning rate (left), warmup (middle) and initialization std value (right).

Stability over 8 trials per method.

Figure 6: Eight trials of SimCLR for each method on ImageNet1k with 10 epochs of learning rate warmup.

Figure 7: Linear probe performance on ImageNet1k at the end of training over 8 trials for each method. Trials are ordered by decreasing performance, with run rank 1 (8) corresponding to the best (worst) trial. Frozen Patcher produce the best individual, with all other methods marginally lower. σReparam + LN and σReparam are the methods most reliably giving good performance, with Baseline and Frozen Patcher each susceptible to at least one instability type.

Supervised Image Classification on ImageNet1k. The B/L refer to ViT-B/16 and ViT-L/16 variants respectively. SN corresponds to the Spectral Norm baseline without the learnable scalar. Also note that the WN configuration leads to immediate divergence without using Layer Norm, and here we only report the result with WN + LN.

(top)  Best SimCLR ImageNet1k trial top 1 linear probing performance training for 300 epochs.σReparam + LN yields the highest performing run, with Frozen Patcher performing competitively. (bottom) Configuration of the variants used in our stability analysis. The MoCo v3 weight initialization and patch initialization scheme are described inChen et al. (2021). For full hyperparameters, see Table6of Appendix C.1.

Linear probe performance on Ima-geNet1k at the end of training over 10 trials for each method. Trials are ordered by decreasing performance, with run rank 1 (10) corresponding to the best (worst) trial. Frozen Patcher and σReparam + LN produce the best individual runs, with σReparam marginally lower. σReparam + LN and σReparam are the methods most reliably giving good performance, with Baseline and Frozen Patcher each susceptible to at least one instability type.

The methods with the best performing individual runs are Frozen Patcher and σReparam + LN, whereas the most stable methods are σReparam + LN and σReparam.Our main stability experiments use 40 epochs of learning rate warmup, matching the setting ofChen et al. (2021). Using σReparam, as in the supervised setting, gives training stability even at the lower learning rate warmup of 10 epochs. For more details, see Appendix C.2.

Comparison between different normalizations and our re-parametrization for speech domain: training loss and word error rate are reported for the best models.



Comparison between different normalizations and our re-parametrization for speech domain when no warmup and LARS optimizer are used: training loss and word error rate are reported for the best models. σReparam performs re-parametrization for joint matrix for key, queries and values in self-attention. DV denotes model divergence: we are not able to train SN with post-LN configuration.

Default hyperparameters of the variants of SimCLR used in our stability analysis. The MoCo v3 weight initialization and patch initialization scheme are described inChen et al. (2021). SinCos refers to stacked 2D SinCos positional encodingsVaswani et al. (2017). The table is divided vertically into hyperparameters that differ across methods (top) and hyperparameters shared across methods (bottom).Here we outline the hyperparameters of our experimental setup for SimCLR+ViT stability. For the variations, alongside their default hyperparameters see Table6. These hyperparameters are used in all SimCLR runs unless stated otherwise.

Comparison between different normalizations and our re-parametrization for speech domain on full LibriSpeech with step-wise LR schedule: word error rate are reported for the best models.

Comparison between different normalizations and our re-parametrization for speech domain on full LibriSpeech with cosine LR schedule: word error rate are reported for the best models. ABLATIONS ON SEPARATE σREPARAM FOR KEY, QUERIES AND VALUES We found that in the end they behaves more or less similar while separate normalization allows to achieve lower training loss due to larger capacity ability which provides potential to scale. However, for training with LARS it is better to have joint re-parametrization to achieve stable training and comparable results with adaptive optimizers, see Section 4.3.2.D.4 HYPERPARAMETERSWe present hyperparameters for our speech experiments in Table9and speech experiments with LARS in Table10.

annex

Proof. We have that:n 2 i,j

B IMPLEMENTATION OF σREPARAM

To compute spectral norm of the current matrix we use the power method as approximation method to speed up computations. See Algorithm 1 for a sketch implementation.Algorithm 1 Pseudo code of σReparam in a PyTorch-like style.# parameters. W: weight matrix, shape (d, c); gamma: the learned spectral norm, shape (1,) # buffers. u: shape (d,), v: shape (c,), the left and right singular vectors of W if init: # initialize u, v as random unit vectors and gamma to 1 ('d,dc,c->', u, W, v) W_hat = gamma / sigma * W # the effective spectral norm of W_hat would be gamma 

