DEEP TRANSFORMERS WITHOUT SHORTCUTS: MODIFYING SELF-ATTENTION FOR FAITHFUL SIGNAL PROPAGATION

Abstract

Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation layers). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address several intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.

1. INTRODUCTION

Despite numerous impressive successes, the practice of training deep neural networks (DNNs) has progressed to a large extent independently of theoretical justification. Most successful modern DNN architectures rely on particular arrangements of skip connections and normalisation layers, but a general principle for how to use these components in new architectures (assuming they are even applicable) remains unknown, and their roles in existing ones are still not completely understood. The residual architecture, arguably the most popular and successful of these, was first developed in the context of convolutional networks (CNNs) (He et al., 2016) , and later in self-attention networks yielding the ubiquitous transformer architecture (Vaswani et al., 2017) . One proposed explanation for the success of residual architectures is that they have superior signal propagation compared to vanilla DNNs (e.g. Balduzzi et al., 2017; Xiao et al., 2018; Hayou et al., 2019; De & Smith, 2020; Martens et al., 2021) , where signal propagation refers to the transmission of geometric information through the layers of a DNN, as represented by a kernel function (Daniely et al., 2016; Poole et al., 2016; Schoenholz et al., 2017) . Recently, using signal propagation principles to train DNNs at high depths, without the skip connections and/or normalisation layers found in residual architectures, has become an area of interest in the community. The reasons are two-fold. First, it would validate the signal propagation hypothesis for the effectiveness of residual architectures, thus clarifying our understanding of DNN trainability. And second, it could lead to general principles and techniques for achieving trainability in DNNs beyond the residual paradigm, with the potential for improved or more efficient architectures. For CNNs, Xiao et al. (2018) showed that improved signal propagation from better initialisation enables very deep vanilla networks to be effectively trained, although at significantly reduced speeds •Σ l •diag(Σ l ) -1 2 (which are like kernel matrices except with cosine similarities instead of inner-products) at various depths for standard attention-only vanilla transformers and two of our proposed alternatives (Section 3). Standard attention-only vanilla transformers (top) quickly suffer from rank collapse where all entries of the normalised kernel converge to 1, whereas our approaches, U-SPA and E-SPA, maintain controlled signal propagation even at large depths. Moreover, our main method E-SPA (bottom) exhibits a recency bias, where cosine similarities corresponding to nearby pairs of locations are larger, akin to positional encoding. Equivalent plots for attention-only transformers with skips and normalisation can be found in Fig. 7 . compared to residual networks. Martens et al. (2021) later proposed Deep Kernel Shaping (DKS) which uses activation function transformations to control signal propagation, achieving training speed parity between vanilla and residual networks on ImageNet assuming the use of strong 2ndorder optimizers like K-FAC (Martens & Grosse, 2015) . Zhang et al. (2022) extended ideas from DKS to a larger class of activation functions, achieving near parity in terms of generalisation as well. The key quantity that is analysed in signal propagation is the DNN's initialisation-time kernel, or more precisely, the approximate kernel given by the infinite width limit (Neal, 2012; Matthews et al., 2018; Lee et al., 2018; Yang, 2019) . For MLPs, and for CNNs that use a Delta-initialisation (Balduzzi et al., 2017; Xiao et al., 2018) , this kernel can be written as a simple recursion over layers that involves only 2D functions, facilitating a straightforward analysis. Unfortunately, the evolution of the kernel across layers of a transformer is more complicated, and as a result, existing approaches like DKS are not applicable to transformers (or indeed any architecture that contains self-attention layers). More concretely, if X l ∈ R T ×d denotes a length-T sequence of activations at layer l of a transformer, then the kernel matrix Σ l = X l X ⊤ l /d ∈ R T ×T for layer l (or more precisely its limit as d → ∞), can be written as a function of the kernel matrix Σ l-1 of the previous layer (Hron et al., 2020) . In the case of self-attention layers, the relationship of Σ l on Σ l-1 cannot be simplified or decomposed into lower dimensional functions, leading to a recursion that is intrinsically high dimensional and harder to analyse or control. Analogously to the case of MLPs, where signal propagation is judged by looking at the behavior of the (one-dimensional) kernel, signal propagation in transformers can be judged by looking at the evolution of these (high-dimensional) kernel matrices through the layers of the network. One situation we must avoid is where the diagonal entries rapidly grow or shrink with depth, which corresponds to uncontrolled activation norms and can lead to saturated losses or numerical issues. A more subtle form of signal degradation can occur where Σ l converges to a rank-1 matrix, which is known as rank collapse (Dong et al., 2021) . Dong et al. (2021) showed that skip connections are essential to avoid the collapsed state: skipless transformers quickly converge to rank collapse at large depths, which we corroborate in Fig. 1 (top) . Moreover, Noci et al. (2022) showed that rank collapse may lead to zero gradients for certain parameters in attention layers, hindering the trainablility of deep transformers. Thus, avoiding rank collapse is necessary for deep transformers to be trainable, and the question of whether one can train deep skipless transformers remains open. In the present work we address this question, demonstrating for the first time that it is possible to successfully train deep transformers without skip connections or normalisation layers. To do so, we study the problem of signal propagation and rank collapse in deep skipless transformers, and derive three approaches to prevent it in Section 3. Our methods use combinations of: 1) parameter initialisations, 2) bias matrices, and 3) location-dependent rescaling, and highlight several intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In Section 4, we empirically demonstrate that our approaches result in trainable deep skipless transformers. On WikiText-103 and C4 datasets we show that using our main approach, Exponential Signal Preserving Attention (E-SPA), it is possible to match the training loss of standard transformers with our skipless ones by training for around 5 times longer. Moreover, by combining this approach with skip connections, we show that transformers without normalisation layers are able to match the training speed of standard ones.

2. PROBLEM SETTING

Transformer models The input to a transformer consists of a sequence x=(x i ) T i=1 over T locations consisting of tokens from a vocabulary V : x i ∈ {1, . . . , |V |}. The model takes this sequence, and using a trainable embedding matrix E ∈ R |V |×d , creates a matrix of vector representations X 0 ∈ R T ×d by performing a direct look-up for each location: [X 0 ] i = E xi ∈ R d . After this, the sequence is successively transformed via a series of L "transformer blocks", with independently initialised parameters. We will denote by X l ∈ R T ×d the output sequence for block l, and will sometimes refer to the rows of X l as representation vectors. Each transformer block consists of two component blocks that both employ a standard residual structure (He et al., 2016) , computing the sum of a residual branch which performs the main computation, and a shortcut branch (aka a skip connection) which just copies the block's inputs to its output. The attention block, which is the first of these component blocks, applies layer normalisation (LN) (Ba et al., 2016) or RM-SNorm (Zhang & Sennrich, 2019) , followed by multi-head attention (MHA), on its residual branch. The MLP block, which comes next, applies LN followed by a standard (typically shallow) MLP on its residual branch. The LN and MLP operations are applied to each sequence element independently (using shared parameters), so that information is only communicated between sequence elements via the MHA operation (which we will define later). In summary we have X l = αX l-1 + β MHA(RMSNorm( Xl-1 )) and Xl = αX l + β MLP(RMSNorm(X l )), where the shortcut and residual weights α, β are typically both 1. In this work we will focus on skipless transformers with α = 0 and β = 1, and on vanilla transformers, which are skipless transformers without normalisation layers. For simplicity we will also devote much of our analysis to attention-only models, which are transformers without MLP blocks (so that Xl = X l ). Note that we restrict our analysis to decoder-only transformers in this work, as they have a simpler structure which is easier to analyse, and are widely used in practice. Also note that Eq. ( 1) corresponds to the "Pre-LN" (Baevski & Auli, 2018; Child et al., 2019) rather than the original "Post-LN" transformer (Wang et al., 2019) . In Post-LN transformers, the normalisation operation is applied at the output of each MLP and attention block instead of at the beginning of each residual branch. Self-attention Given an input sequence X ∈ R T ×d , the self-attention mechanism computes Attn(X) = A(X)V(X), with A(X) = softmax 1 √ d k Q(X)K(X) ⊤ , where the softmax function is applied row-wise. Q(X) = XW Q , K(X) = XW K and V(X) = XW V denote the queries, keys and values respectively, with trainable parameters W Q , W K ∈ R d×d k and W V ∈ R d×d v . In practice, the attention mechanism, Eq. ( 2), is applied over h "heads" (with independent parameters), giving rise to so-called multi-head attention: MHA(X) ≜ Concat Attn 1 (X), . . . , Attn h (X) W O , where W O ∈ R hd v ×d are trainable parameters and usually d k = d v = d h . In this case, we can define W V = Concat(W V 1 , . . . , W V h ) ∈R d×d where W V n denotes the value parameters for head n. We focus our investigation to models that perform next-token prediction: at location i-1 the model outputs a prediction over the identity of the i th target token, but using only information from input tokens 1 through i-1. This corresponds to using a causal masked attention with mask M ∈ R T ×T satisfying M i,j = 1{i ≥ j}, where the attention matrix A in Eq. ( 2) is replaced with A(X) = softmax M • 1 √ d k Q(X)K(X) ⊤ -Γ(1 -M) , where Γ is a large positive constant that zeros the attention coefficients corresponding to future tokens, making A a lower triangular matrix.

Signal propagation in transformers

As discussed in Section 1, Dong et al. (2021) showed that deep skipless transformers suffer from rank collapse, where the kernel matrix converges in depth to have rank 1, and Noci et al. (2022) showed that rank collapse can prevent trainability. Moreover, Noci et al. (2022) demonstrated that rank collapse in transformers can occur in the absence of normalisation layers even with skip connections, and that downscaling the residual branch by setting β =foot_0 √ L can alleviate this issue. This latter observation is in line with previous findings concerning the benefits of downscaling residual weights in ResNets (Hanin & Rolnick, 2018; Zhang et al., 2018; Arpit et al., 2019; Hayou et al., 2021; Bachlechner et al., 2021) and transformers (Zhang et al., 2019; Xu et al., 2020; Huang et al., 2020; Touvron et al., 2021; Wang et al., 2022) . Davis et al. (2021) showed that concatenation acts similarly to a downweighted skip as an alternative way to connect skip and residual branches. De & Smith (2020) noted that the interaction of standard skip connections and normalisations can also effectively downweight the residual branch to give better signal propagation properties, but only if the normalisation layer is placed on the residual branch, like in Pre-LN transformers. Such an effect does not occur for Post-LN transformers, where the normalisation layer is after the residual branch, and we indeed observe in Fig. 7 that Post-LN attention-only transformers also suffer from rank collapse at large depths. This may explain some of the training instabilities of Post-LN transformers that have been observed in practice (Xiong et al., 2020; Liu et al., 2020) .

3. CONSTRUCTING TRAINABLE DEEP TRANSFORMERS WITHOUT SHORTCUTS

To date, the only strategy for rectifying rank collapse in transformers relies on skip/shortcut connections, which "skip" around the trainability issues intrinsic to self-attention layers. We seek instead to tackle this issue directly. To do so, we first develop a better understanding of signal propagation through attention layers, then derive modifications from our insights to achieve faithful signal propagation in deep transformers, allowing them to be trained regardless of the use of skip connections. To start with, we consider the simplified setting of a deep attention-only vanilla transformer, and suppose we are in a single-head setting (h = 1) or a multi-head setting where the attention matrix A does not vary across heads. If block l ≤ L has attention matrix A l at initialisation, then the final block's representation X L takes the following form: X L = [A L A L-1 . . . A 1 ] X 0 W, where W = L l=1 W V l W O l ∈ R d×d can be made to be orthogonal at initialisation (so that W ⊤ W = I d ), if each W V l and W O l are orthogonally initialised (assuming d v = d h ) . Going forward we will assume such an orthogonal initialisation, providing an ablation study in Fig. 9 . In that case, if we denote by Σ 0 = X 0 X ⊤ 0 ∈R T ×T the input kernel matrix across locations, and Π l = A l A l-1 . . . A 1 to be the product of attention matrices up to the l th block, then the locationwise kernel matrix Σ l = X l X ⊤ l ∈ R T ×T at block l simplifies to 1 Σ l = Π l • Σ 0 • Π ⊤ l . (6) From this simplified formula for kernel matrices in deep attention-only transformers, we identify three requirements on (A l ) l : (i) Σ l = Π l • Σ 0 • Π ⊤ l must be well-behaved at each block, avoiding degenerate situations such as rank collapse and exploding/vanishing diagonal values. (ii) A l must be elementwise non-negative ∀l (recalling that A l is constructed through the softmax operation Eq. ( 4)). (iii) A l should be lower triangular ∀l, for compatibility with causal masked attention. 2 In Sections 3.1 and 3.2, we focus on finding attention matrices that satisfy our desiderata above, and demonstrate how to modify softmax attention to achieve these attention matrices in Section 3.3.

3.1. IDENTITY ATTENTION: ISSUES AND VALUE-SKIPINIT

An obvious solution to our above requirements on (A l ) l is the trivial one: A l = I T ∀l, where each sequence location attends only to itself. In this case, Σ L = Σ 0 perfectly preserves the input kernel matrix, and will avoid rank collapse assuming Σ 0 is non-degenerate. Unfortunately, identity attention isn't compatible with a viable solution to obtain trainable vanilla transformers. This is because to achieve A l = I we would need to saturate the softmax operation (Eq. ( 2)) so that gradients do not pass to the query and key parameters, and the attention matrix stays close to identity during training. To provide a partial solution to this that achieves identity attention matrix at initialisation yet is still trainable, we introduce our first approach, Value-SkipInit, based on SkipInit (De & Smith, 2020) , and the related ReZero method (Bachlechner et al., 2021) . In Value-SkipInit, we modify the attention operation Attn (X) = A(X)V(X), to Attn(X) = αI + βA(X) • V(X) (7) with trainable parameters α and β that are initialised to 1 and 0 respectively. Thus, at initialisation, the attention matrix is the identity. For transformers with MLPs blocks, this yields identical behaviour to a standard MLP acting on each sequence location independently at initialisation, so we can apply the DKS or TAT frameworks (Martens et al., 2021; Zhang et al., 2022) to achieve well-behaved signal propagation in the entire model. 3 We note that Value-Skipinit is an approach to remove the standard skip connections in transformer blocks, Eq. ( 1), but can be construed as adding a skip connection around the value computation, and in that respect isn't strictly "skipless". Moreover, there is useful information contained in the positions of sequence locations that we do not employ when all attention matrices are identity at initialisation. As a result, we treat Value-Skipinit as a baseline for our main two methods, which we describe next.

3.2. SIGNAL PRESERVING ATTENTION METHODS

Returning to our requirements on (A l ) l in Section 3, we see that controlling their product is key to achieving faithful signal propagation. Given that we are now interested in non-identity (A l ) l , this becomes more difficult. To overcome this, we consider A l of the form A l = L l L -1 l-1 such that individual L i 's cancel in the product, giving Π l = L l L -1 0 . Then if L 0 satisfies L -1 0 Σ 0 L -1 ⊤ 0 = I T , we thus have Σ l = L l L ⊤ l . (9) Assuming input embeddings are initialised independently, and no repeated tokens in the input sequence, we have Σ 0 = I T in the large width limit,foot_3 and can thus take L 0 = I T . In practice, we will apply slight modifications to our methods to account for repeated tokens, as detailed in Appendix B. Now if L l is lower triangular, then from Eq. ( 9) we see it is simply a Cholesky factor for the kernel matrix Σ l . By the uniqueness of Cholesky factors for PSD matrices up to sign, this means we just need to choose {Σ l } L l=0 to be a family of well-behaved kernel matrices that satisfyfoot_4 our nonnegativity constraint (ii) on A l = L l L -1 l-1 . We identify two such families, which give rise to our main method, Signal Preserving Attention (SPA): 1. Uniform (U-SPA): Σ l (ρ l ) = (1 -ρ l )I T +ρ l 11 ⊤ . Here, the kernel matrix Σ l (ρ l ) has diagonal entries equal to 1 and off-diagonal entries equal to ρ l . The condition ρ l ≤ ρ l+1 is required for elementwise non-negativity of the A's, as shown in Theorem 1. Setting ρ 0 = 0 yields identity input kernel matrix, and as long as ρ L < 1, we avoid rank collapse in the skipless attention-only setting. 2. Exponential (E-SPA): Σ l (γ l ) i,j = exp(-γ l |i -j|). Here, diagonal entries are again 1, but now off-diagonals decay exponentially for more distant locations, with decay rate γ l . Thus, unlike U-SPA, E-SPA captures the notion of positional encoding, as the vector representations for nearby locations have larger inner products (i.e. are more similar to each other). The condition γ l ≥ γ l+1 is required for elementwise non-negativity of the A's, as established in Theorem 2. Setting γ 0 = ∞ yields the identity input kernel matrix, and rank collapse will be prevented as long as γ L > 0. We state and prove Theorems 1 and 2 in Appendix I, and in particular provide a closed-form solution for L l L -1 l-1 in the case of E-SPA in Theorem 3, enabling cheap computation. From these theorems, we see that our proposed SPA approaches are viable as long as the kernel matrix values become progressively larger (in an elementwise fashion) as depth increases, as dictated by (ρ l ) L l=1 and (γ l ) L l=1 . We find ρ L = 0.8 and γ L = 0.005 to be good default choices for the final block, and describe how to vary ρ l and γ l across depth in Appendix C. In Alg. 1 in Appendix D, we summarise how to construct a trainable self-attention layer with our main method E-SPA using ideas from Section 3.3, given an input-output decay rate pair γ in , γ out (in the notation of Alg. 1). Given a decreasing set of decay rates (γ l ) L l=0 , at block l we set γ in = γ l-1 and γ out = γ l . In Fig. 1 , we verify that our two proposed SPA schemes, U-SPA and E-SPA, successfully avoid rank collapse in attention-only vanilla transformers even at large depths. Moreover, because Σ l has diagonals equal to 1 for all l, there is an implicit mechanism in these two schemes to control the representation vector norms across all sequence locations at deep layers. This means that they can be used with or without normalisation layers, as we will verify empirically in Section 4. Moreover, in Fig. 1 we see that E-SPA observes a recency bias as expected, where representation vectors for nearby locations have larger cosine similarity, as seen with positional encoding schemes like ALiBi (Press et al., 2022) (Fig. 7 ). As a result, even though all three of our approaches successfully avoid rank collapse, we expect E-SPA to outperform both U-SPA and Value-SkipInit.

3.3. REVERSE ENGINEERING SELF-ATTENTION LAYERS AT INITIALISATION

In Section 3.2 we identified two families of lower-triangular non-negative attention matrices (A l ) l which enable us to obtain well-behaved kernel matrices Σ L at large depth L, and hence faithful signal propagation. It remains to show how we can actually realise these attention matrices via parameter initialisation and minor modifications of the attention mechanism. More precisely, we will show how for any given lower-triangular A ∈ R T ×T with non-negative entries, the masked softmax self-attention layer Eq. ( 4) can be initialised and augmented such that its output is exactly Attn(X) = AV(X) at initialisation. To do so, we first define matrices D, P ∈ R T ×T such that A = DP, where D is diagonal with positive entries and P is lower triangular with row sums 1. Then if B = log(P) and M is the causal mask M i,j = 1{i ≥ j}, we set 6 Attn(X)=DP(X)V(X), & P(X)=softmax M• 1 √ d k Q(X)K(X) ⊤ +B -Γ(1 -M) (10) which reduces to Attn(X) = AV(X) (with a data-independent attention matrix) at initialisation if the query-key dot product 1 √ d k Q(X)K(X) ⊤ = 0. We note that zero query-key dot products occur if one uses a 1 d k scaling rather 1 √ d k in the infinite width limit for independently initialised W Q , W K (Yang, 2019) . In practice, to achieve zero initial query-key dot product, we can initialise either: 1) W Q = 0, 2) W K = 0, or 3) both W Q and W K to have small initial scale (which achieves approximately zero dot product). In our experiments we found these three options to all perform similarly, and decided to use option 1): W Q = 0. An empirical evaluation of the sensitivity to the choice of initial scale in option 3) is provided in Fig. 8 . In Eq. ( 10), D acts as a non-trainable location-dependent rescaling, and is needed to realise arbitrary attention matrices since the softmax output P(X) is constrained to have row sums equal to 1. 7 Given target kernel matrices with 1's on the diagonal (which we have in SPA) its inclusion is akin to using RMSNorm at initialisation. However, this will gradually stop being true during training, leading to a (slightly) different model class. The additive pre-softmax biases B will also have an effect on the model class, but this will be similar to that of the ALiBi positional encoder (Press et al., 2022) , which too involves a non-trainable bias matrix being added to the logits. In our early experiments we tried including a trainable gain parameter on the bias matrix, initialised to 1, in order to better preserve the model class, but found that this didn't have a significant impact on training performance. 6 We take log(0) = -∞ or a large negative constant, e.g. -10 30 , in practice. 7 In SPA, D also corrects for the fact that masked softmax attention tends to reduce the norms of representation vectors for locations at the end of the sequence. To see this, if we have kernel matrix Σ = XX ⊤ /d with Σii = 1, ∀i, and softmax attention matrix A with row i ai, then (AΣA ⊤ )ii = 1 d ∥aiX∥ 2 2 ≤ 1 d ∥X∥ 2 2 ∥ai∥ 2 2 = ∥ai∥ 2 2 . But ai sums to 1 (as a softmax output), so ∥ai∥2 ≤ 1 with equality if and only if ai has exactly one non-zero entry (equal to 1). This holds for the first token (which can only attend to itself in masked attention) so (AΣA ⊤ )11 = 1, but not in general for later tokens, so (AΣA ⊤ )ii will usually be less than 1, for i > 1. We also note that this approach to controlling the attention matrix by zeroing the query-key dot product at initialisation is compatible with popular positional encodings like Relative (Shaw et al., 2018; Huang et al., 2019; Dai et al., 2019) and RoPE (Su et al., 2021) , as discussed in Appendix E. Unless stated otherwise, we use RoPE in our experiments, and provide an ablation in Fig. 10 .

3.4. ADDRESSING MLP BLOCKS AND SKIP CONNECTIONS

Up until this point we have restricted our focus to attention-only skipless transformers, for the sake of simplicity. However, MLP blocks are an important part of the transformer architecture that we must also address. And we would like for our approaches to be compatible with skip connections too, both for the sake of generality, and because they might combine profitably.

MLP blocks

Because MLP blocks operate on the representation vectors independently across locations, their effect on the kernel matrix can be easily computed using the standard limiting kernel formulas for MLPs (Neal, 2012) which are exact in the infinite width limit. In particular, there is a known f that maps the kernel matrix Σ l of an MLP block's input sequence to the kernel matrix Σl of its output sequence. In principle, one could modify SPA to account for this change to the kernel matrix by taking A l = L l L-1 l-1 , where L l and Ll are the Cholesky factors of Σ l and Σl respectively. Unfortunately, there is no guarantee that the resulting A l would be elementwise non-negative in general. In our experiments we thus elected to ignore the effect on the kernel matrices of the MLP blocks, approximating f as the identity function, which we found to work well enough in practice. As discussed in Martens et al. (2021) , this approximation becomes better as the MLP blocks tend to linear functions (for which f is exactly identity), which will happen as we increase the shortcut weight α relative to the residual weight β (see Eq. ( 1)), or when the MLP's activation functions are close to linear. Notably, the latter condition will tend to be true when using DKS or TAT to transform the activation functions in deep networks. Note, when using DKS, f decreases the size of off-diagonal elements of the kernel matrix, which intuitively will help to combat rank collapse. Skip connections When using skip connections as in Eq. ( 1), the output kernel matrix of a block is given by Σ block = α 2 Σ shortcut + β 2 • Σ residual . One of the main ways that kernel matrices degenerate in DNNs is when their diagonal entries either explode or shrink with depth. As shown by Martens et al. (2021) , this can be prevented in fully-connected and convolutional DNN architectures by rescaling the output of the activation functions (which happens automatically as part of DKS or TAT), and by replacing weighted sums with "normalised sums", which yields normalised skip connections defined by the condition α 2 + β 2 = 1. When using normalised skip connections with U-SPA, both Σ shortcut and Σ residual will have the form Σ(ρ) = (1-ρ)I T +ρ11 ⊤ for some ρ, and thus so will Σ block . Moreover, we will have ρ block = α 2 ρ shortcut + β 2 • ρ residual , which is less than ρ residual . This means we can easily adjust U-SPA to be compatible with normalised skip connections by replacing L l in the formula A l = L l L -1 l-1 with the Cholesky factor of Σ((ρ l -α 2 ρ l-1 )/β 2 ). In the case of E-SPA, we note that Σ block won't be of the form Σ(γ) i,j = exp(-γ|i -j|) for some γ, even when both Σ shortcut and Σ residual are. To work around this, we use an approximation described in Appendix F, which seeks to make the combined effect on the kernel matrix of the normalised skip and attention block approximately equal to the effect of a skipless attention block.

4. EXPERIMENTS

We now assess the capabilities of our proposed methods in training deep skipless and/or normaliserfree transformers. Our main experiment setting uses a transformer with 36 transformer blocks, which is deep enough that the effects of poor signal propagation and rank collapse render skipless training without modifications impossible. We begin our investigation on WikiText-103 (Merity et al., 2017) focusing primarily on training performance of our various methods, before moving to the larger C4 dataset (Raffel et al., 2019) , where overfitting is not an issue. We use Adam optimiser (Kingma & Ba, 2014) , as is standard for transformers, and train without dropout (Srivastava et al., 2014) . Additional results and experimental details are provided in Appendices G and H respectively.

WikiText-103 baselines

To start with, we verify that a standard deep transformer without skip connections is untrainable even with normalisation layers (LN) and transformed activations, and that our approaches remedy this. methods and Value-Skipinit to standard transformers both with and without skips, on a 36 block transformer. We clearly see that removing skip connections from a standard transformer makes it untrainable, with training loss plateauing around 7.5. This holds true too even when we use DKS to transform the GeLU MLPs, highlighting that the issue lies with the attention layers, which suffer from rank collapse as shown in Fig. 1 . 0K 20K 40K 60K 80K 100K Training step On the other hand, all three of our approaches train even for vanilla deep transformers, with our E-SPA method outperforming U-SPA and Value-Skipinit. However, the default transformer with skips and LN still retains a training speed advantage compared to our skipless methods, mirroring the situation for CNNs without powerful second-order optimisers (Martens et al., 2021; Zhang et al., 2022) . In Table 1 , we assess the effect of different activation functions in the MLP blocks, as well as the use of LN, in skipless transformers using our proposed methods. We see that at depth 36 we achieve good training performance for a range of activations: DKS-transformed GeLU, TAT-transformed Leaky ReLU, as well untransformed GeLU (Hendrycks & Gimpel, 2016) , but not untransformed Sigmoid. We also see that layer normalisation is relatively unimportant for training speed, and can even be harmful with transformed activations when using SPA, which already has an inbuilt mechanism to control activation norms (as discussed at the end of Section 3.2). Normalised skip connections In Fig. 3 , we see that one way to match the training loss of the default transformer, without more iterations, is by using normalised skip connections. While this is perhaps unsurprising, we observe that our E-SPA method (left) matches standard training both with and without normalisation, whereas a standard Transformer with normalised skip connections (right) requires normalisation in order to match the training speed of the default Pre-LN. C4 baseline So far, we tested our proposed methods on WikiText-103 (Merity et al., 2017) , on which we observed overfitting without the use of extra regularisation. Therefore, we further compare our methods to standard transformers on a larger dataset, C4 (Raffel et al., 2019) , where overfitting isn't an issue. Importantly, we see similar trends across validationfoot_5 (Fig. 4a ), training (Fig. 13 ) and downstream task (Table 6 ) performance, and so the benefits of our methods do extend beyond training. Due to the memory overhead of longer sequences, we use a 32-block transformer. One can notice that E-SPA performs the best among skipless transformers on all settings: training, validation and downstream tasks. 2022) observed a similar training speed gap for skipless CNNs, and showed that such a gap can be closed by using more sophisticated second order optimisers like K-FAC (Martens & Grosse, 2015) or Shampoo (Gupta et al., 2018; Anil et al., 2020) . As second order optimisers for transformers are not well established, we instead demonstrate that the training loss gap can be closed by simply training for longer in Fig. 4b . We observe that our E-SPA method matches the training loss of a standard pre-LN transformer on C4 if one trains for around 5 times longer with Adam, in line with the findings from the convolutional case (Zhang et al., 2022) . An equivalent plot for WikiText-103 is provided in Fig. 14 . 

5. CONCLUSION

We have shown for the first time that it is possible to successfully train deep transformers without skip connections or normalisation layers. To do so, we have proposed 3 approaches: E-SPA, U-SPA and Value-Skipinit, each of which control the attention matrices of a transformer to enable faithful signal propagation even at large depths. Our best approach, E-SPA enables deep vanilla transformers to match their standard counterparts with around 5 times more iterations, and also deep transformers without normalisation to match the training speed of standard ones. We hope that our work may potentially pave the way to new and improved architectures, and more research into improving the capabilities of deep learning in practice using insights from theory.

REPRODUCIBILITY STATEMENT

Pseudocode for our main approach, E-SPA, can be found in Alg. 1, using the notation and setup provided in Section 2. All experimental details can be found in Appendix H, including general and experiment-specific implementation details.

A COMPATIBILITY WITH NON-CAUSAL ATTENTION

In Section 3, we focus on causal masked self-attention for two reasons. First, next-token prediction using causal masked self-attention is arguably the most popular setting for self-attention. And second, it is a more challenging setting to work with in terms of controlling signal propagation, due to the additional constraint that attention matrices must be lower triangular. In this section we describe how our methods can be made compatible with non-causal masked attention, where the attention matrices are no longer required to be lower triangular. To start with, Value-SkipInit does not modify the softmax-attention computation and hence is already compatible with any form of attention. For our SPA methods, it is straightforward to extend to non-causal attention by changing L l in Eqs. ( 8) and ( 9) from being the Cholesky decomposition of Σ l to being the (symmetric) matrix square root of Σ l . In this case, for U-SPA it is possible to analytically calculate that A l in Eq. ( 8) will be element-wise non-negative if ρ l ≥ ρ l-1 (exactly like the Cholesky case in Theorem 1). This is easy to see because the matrix square-root, inverses and products of uniform kernel matrices are all still uniform of the form Σ(ρ) = (1 -ρ)I T +ρ11 ⊤ (up to positive rescaling), so that A l will be too, and one simply needs to track ρ and verify that it is positive. For E-SPA. we have verified empirically that the resulting A l = L l L -1 l-1 will be non-negative if γ l ≤ γ l-1 , just like the Cholesky case in Theorem 2.

B MODIFICATIONS TO SPA METHODS FOR REPEATED TOKENS

For simplicity, we assumed that our input sequences had no repeated tokens when presenting SPA in Section 3. This meant that we could take the input kernel matrix Σ 0 to be the identity, with zero off-diagonals, which was convenient for our construction of SPA. The effect of repeated tokens, e.g. if the word 'cat' occurs multiple times in the same sentence, is that our input kernel matrices Σ 0 will have non-zero off-diagonal entries, corresponding to entries where a token is repeated. This will impact the kernel matrices Σ l at deeper layers with SPA, particularly the diagonal values of Σ l , which we would like to able to control. In this section we discuss how we can modify our SPA approaches to account for the fact that we will often be working with sequences where a fraction of the tokens are repeated. We stress that the general principle that all our methods (Value-SkipInit and SPA methods) follow is independent of the input kernel matrix (i.e. independent of duplicate input tokens): we seek to prevent the product of attention matrices from deviating away from the identity matrix and degenerating to a rank-1 matrix. From Eq. ( 6), we see that if the product of attention-matrices is rank-1 then regardless of the input kernel we will have a rank-1 output kernel i.e. rank collapse (Dong et al., 2021) . On the other hand, if we control the deviation of the attention matrix product from the identity then no matter the input kernel, the output kernel will bear some similarity to the input kernel. So as long as the input kernel is non-degenerate and has full rank (regardless of duplicate tokens) so too will be the output kernel. Recall that with attention matrices (A l ) l , an input kernel matrix across T locations Σ 0 ∈ R T ×T gets mapped, at depth l, to Σ l = Π l • Σ 0 • Π ⊤ l (11) where Π l = A l A l-1 . . . A 1 is the product of attention matrices. In SPA, we parameterise A l = L l L -1 l-1 for (L l ) l corresponding to the Cholesky factors of some family of kernel matrices (Σ l ) l , with either uniform or exponentially decaying off diagonals: 1. U-SPA: Σ l (ρ l )=(1 -ρ l )I T +ρ l 11 ⊤ for 0 ≤ ρ ≤ 1 2. E-SPA: Σ l (γ l ) i,j = exp(-γ l |i -j|)) for γ ≥ 0. It is important that A l = L l L -1 l-1 is constructed from two Cholesky matrices belonging to the same family, because Theorems 1 and 2 show in that case we will satisfy our non-negativity constraint on A l (which is computed through a softmax operation), and otherwise there is no prior reason to suppose that non-negativity will be satisfied. Therefore, in an ideal world, Σ 0 would be an element of our family of kernel matrices, so that we can set L 0 to be the Cholesky factor of Σ 0 , satisfying Typically, Σ 0 = XX ⊤ /d 0 for input token embeddings X ∈ R T ×d0 , which are independently initialised for different tokens identites. Thus, in the wide limit d → ∞, (up to rescaling) Σ 0 will have diagonals equal to 1, and off-diagonals equal to 1 if there is a repeated token and 0 else. We plot examples of such kernel matrices in Fig. 5 for different fractions of repeated tokens r, under the assumption that repeated tokens occur independently of location. L -1 0 Σ 0 L -1 ⊤ 0 = I T . Clearly, when r = 0, we have Σ 0 = I is a member of both the uniform and exponential families of kernel matrices, corresponding to uniform off-diagonals of ρ 0 = 0 or exponentially decaying off-diagonals with rate γ 0 = ∞. In this case we can set L 0 = I too. On the other hand, for r > 0, it is in general difficult to say much more about an individual sequence's Σ 0 , given that different sequences will have repeated tokens in different locations. Moreover, the naive approach of ignoring the repeated tokens and treating Σ 0 = I leads to increasing diagonal values of Σ l (i.e. activation norms) at large depth without corrections, as shown in Fig. 6 .foot_6 This imbalance between blocks could be problematic for training dynamics, and also is incompatible with frameworks like DKS and TAT which suppose that diagonal values of Σ l are constant across blocks and locations, usually set to 1. To circumvent this, we will derive our modifications by considering the average input kernel matrix Σ0 (averaged over different sequences), under the assumption that repeated tokens occur independent of location: Σ0 = (1 -r)I T + r11 T (12) By linearity of Eq. ( 11) in Σ 0 (and because we are controlling A l to be input-independent at initialisation in Section 3), it also follows that the average depth l kernel matrix Σl satisfies: Σl (X, X) = Π l • Σ0 • Π ⊤ l (13) so we can modify our SPA approaches to control the average kernel matrix Σl instead. For U-SPA, the situation is more straightforward as Σ0 is a uniform kernel matrix with ρ = r, so it suffices to simply let ρ 0 = r instead of ρ 0 = 0. For E-SPA, we are unable to view Σ0 as having exponentially decaying off-diagonals, and so we keep γ 0 = ∞ which translates to L 0 = I and Π l = L l . Instead, to help us understand the effect of repeated tokens (to motivate our modifications), we first expand on Eq. ( 13) to simplify a little: Π l • Σ0 • Π ⊤ l = L l • Σ0 • L ⊤ l = (1 -r)L l L ⊤ l + rL l 11 ⊤ L ⊤ l = Σ l (γ l ) + rL l (11 ⊤ -I)L ⊤ l = Σ l (γ l ) + rL l O T L ⊤ l ( ) where O T = 11 ⊤ -I T ∈ R T ×T is 0 on the diagonal and 1 off the diagonal. Note that all terms in Eq. ( 14) are easily computable (given that Σ l (γ l ) i,j = exp(-γ l |i -j|)) and L l has an analytic form provided in Lemma 1). So we can use Eq. ( 14) to compute the expected diagonal ( Σl ) i,i for each location i (where the expectation is taken over different sequences), which we denote by a diagonal matrix Dl : Dl = Diag(L l • Σ0 • L ⊤ l ) ∈ R T ×T Thus, we propose to replace A l = L l L -1 l-1 with A l = D-1 2 l L l L -1 l-1 D 1 2 l-1 , setting D0 = I by default. This means that our product Π l = l i=1 A i = D-1 2 l L l , and Eq. ( 13) is updated to: Σl (X, X) = D-1 2 l L l • Σ0 • L ⊤ l D-1 2 l ( ) which has diagonals controlled to 1. Though our modifications for repeated tokens only consider averages across different sequences, we find that for individual sequences they still lead to well behaved diagonal values of Σ l , as shown in Fig. 6 . Moreover, we find that their effect on off-diagonals is still favourable for individual sequences, as shown in Fig. 1 . C SETTING (ρ l ) l AND (γ l ) l In this section we describe how we set the uniform off-diagonals (ρ l ) l and the exponential decay rates (γ l ) l in SPA, at different depths. Recall that Theorems 1 and 2 show that we are free to choose (ρ l ) L l=0 and (γ l ) L l=0 such that (ρ l ) L l=0 increase with depth and (γ l ) L l=0 decrease with depth. Moreover, ρ 0 = 0 (or the shared token fraction r, as per Appendix B) and γ 0 = ∞ at the input layer, whilst we have found ρ L = 0.8 and γ L = 0.005 to be good default values for the last block. For U-SPA, in terms of setting how (ρ l ) l vary with depth, we tried different polynomial rates of increase with depth from ρ 0 to ρ L , but did not observe a noticeable difference in performance across different rates so chose to increase from ρ 0 to ρ L linearly in depth. For E-SPA, we choose (γ l ) l so that the diagonal elements of the attention matrices A l are constant across blocks, akin to using a constant shortcut weight α over different blocks. From Theorem 3, we have that the diagonal entries of A l satisfy: (A l ) i,i = a(γ l ) a(γ l-1 ) , ∀i > 1 where a(γ) = 1exp(-2γ) with inverse γ(a) = -1 2 log(1 -a 2 ). This means that for a set of positive decreasing (γ l ) l , there exist a corresponding set of decreasing (a l ) l with values between 0 and 1. Because γ 0 = ∞, we have a 0 = 1, and likewise for a given γ L we can compute a L . Thus to have constant diagonal values of A l over different blocks l, we choose to set a l = a l/L L for l ≤ L, and thus γ l = γ(a l ) = - 1 2 log(1 -a 2l/L L ) We found this scheme to work well empirically, and note the similarity to other works which have discussed the choice of how to scale branches in residual architectures (Hayou et al., 2021) . We leave further study of choosing how to set (ρ l ) l and (γ l ) l to future work.

D E-SPA ALGORITHM

In Alg. 1 we present pseudocode to construct a trainable E-SPA masked attention layer. Algorithm 1: Modified E-SPA masked attention layer. Input: Input sequence representation X ∈ R T ×d . Output: Updated sequence representation Attn(X) ∈ R T ×d . Hyperparameters: Input γ in & output γ out exponential decay rates, γ in ≥ γ out . Number of heads h with head dimension d h = d/h. Causal mask M ∈ R T ×T s.t. M i,j = 1{i ≥ j}. Large constant Γ, e.g. 10 30 , to enforce causal mask. Trainable parameters: for head n ∈ {1, . . . , h} do W Q n , W K n , W V n ∈ R d×d h , for 1 ≤ n ≤ h. W O ∈ R d×d . Compute Initialise W K n , W V n i.i.d. ∼ N (0, 1 d ) (alternatively orthogonally). Initialise W Q n = 0. Set Q n (X) = XW Q n , K n (X) = XW K n and V n (X) = XW V n . Compute P n (X) = softmax M • 1 √ d h Q n (X)K n (X) ⊤ + B -Γ(1 -M) Set Attn n (X) = DP n (X) • V n (X) Initialise W O i.i.d. ∼ N (0, 1 d ) (alternatively as an orthogonal matrix) return Concat Attn 1 (X), . . . , Attn h (X) W O

E COMPATIBILITY OF SPA WITH EXISTING POSITIONAL ENCODINGS

In Section 3.3, we showed how to control the attention matrix at initialisation, using bias matrices and location-dependent rescaling, as well as making the query-key dot product, 1 √ d k Q(X)K(X) ⊤ = 1 √ d k XW Q (XW K ) ⊤ , zero at initialisation. This scheme is used in our SPA methods. We detailed several ways to achieve this, and in practice we chose to initialise W Q to zero, and initialise W K as usual (Gaussian fan-in or orthogonal). In this section we show how zero initialising the query-key dot product is also possible when using two standard positional encodings: RoPE (Su et al., 2021) and Relative (Shaw et al., 2018; Huang et al., 2019; Dai et al., 2019) . This means that we can use our methods in Section 3.3 in combination with these positional encoders. Let us denote the unscaled query-key dot product as S = XW Q (XW K ) ⊤ . For RoPE, the (i, j) query-key dot product, S i,j , is modified from (W Q X i ) ⊤ (W K X j ) to (R i W Q X i ) ⊤ (R j W K X j ), where R i,j are some location-dependent rotation matrices defined in Su et al. (2021) , and X i denotes row i of the incoming representation X ∈ R T ×d to the attention block. Clearly, Eq. ( 16) is 0 for all i, j if W Q is zero. For Relative positional encoding, we take the scheme from (Dai et al., 2019) . In that case, the (i, j) query-key dot product, S i,j , is modified from (W Q X i ) ⊤ (W K X j ) to (W Q X i ) ⊤ (W K X j ) + (W Q X i ) ⊤ (W K,R R i-j ) + u ⊤ (W K X j ) + v ⊤ (W K,R R i-j ) where R i-j ∈ R d is fixed, and W K,R ∈ R d k ×d , u, v ∈ R d k are trainable. Thus, we can achieve zero query-key dot products at initialisation if we initialise u, v = 0, in addition to W Q = 0.

F USING NORMALISED SKIP CONNECTIONS WITH E-SPA

In this section, we describe how to combine our E-SPA method with normalised skip connections in our attention blocks: Z = αX + 1 -α 2 AXW V (18) where α = 0 is the skipless setting that our methods are originally designed for. The general gist is that we will look at the combined effect, on the kernel matrix after the residual attention block, of the residual branch and the diagonal terms in the attention matrices for preserving signal propagation, and approximate the combination to match the setting where we are without skips, i.e. α = 0. To combine E-SPA with normalised skip connections, we consider how the cosine-similarity between two locations T 1 , T 2 ≤ T is affected through the normalised skip connection. Suppose we have an input kernel matrix: 1 d XX ⊤ = Σ where (Σ) i,i = 1, ∀i and d is the width. Then after the residual attention block, Eq. ( 18), we now have: 1 d ZZ ⊤ =α 2 1 d XX ⊤ + (1 -α 2 ) 1 d AXW V W V ⊤ X ⊤ A ⊤ (19) + α 1 -α 2 1 d XW V ⊤ X T A T + AXW V X ⊤ Either we have W V to be an orthogonal matrix sampled uniformly at random from the Haar measure, or W V i.i.d. ∼ N (0, 1 d ). In both cases we have 1 d AXW V W V ⊤ X ⊤ A ⊤ going towards 1 d AXX ⊤ A ⊤ , and the cross terms Eq. ( 20) converging to 0 for large d. Thus, we can consider the large d approximation: 1 d ZZ ⊤ =α 2 1 d XX ⊤ + (1 -α 2 ) 1 d AXX ⊤ A ⊤ (21) =α 2 Σ + (1 -α 2 )AΣA ⊤ (22) Then, if we look at the inner product between the T 1 and T 2 locations for locations T 1 ̸ = T 2 such that T 1 , T 2 > 1, we have: 1 d (ZZ ⊤ ) T1,T2 =α 2 Σ T1,T2 + (1 -α 2 )(AΣA ⊤ ) T1,T2 =α 2 Σ T1,T2 + (1 -α 2 ) i,j A T1,i Σ i,j A T2,j =α 2 Σ T1,T2 + (1 -α 2 )A T1,T1 Σ T1,T2 A T2,T2 + (1 -α 2 ) i̸ =T1∪j̸ =T2 A T1,i Σ i,j A T2,j = α 2 + (1 -α 2 )λ 2 Σ T1,T2 + (1 -α 2 ) i̸ =T1∪j̸ =T2 A T1,i Σ i,j A T2,j = α 2 + (1 -α 2 )λ 2 Σ T1,T2 + (1 -α 2 )δ (23) =νΣ T1,T2 + (1 -α 2 )δ where λ = A T1,T1 = A T2,T2 is the constant diagonal of the attention matrices in E-SPA, c.f. Theorem 3, and ν = α 2 + (1 -α 2 )λ 2 . Moreover, we have defined δ = i̸ =T1∪j̸ =T2 A T1,i Σ i,j A T2,j as it is a term that is not possible to control with only knowledge of Σ T1,T2 and will vary from sequence to sequence, and hence we argue can be discarded from a signal propagation perspective. For example, one could consider a sequence where the kernel matrix Σ has all off-diagonals equal to 0 apart from Σ T1,T2 , and T 1 could be sufficiently distant from T 2 such that there is no i for which A T1,i and A T2,i are large at the same time (which can be seen using the analytic form of A given in Theorem 3). Thus, from Eq. ( 23), we see that after an residual attention block, an input cosine similarity of Σ T1,T2 between locations T 1 , T 2 is diluted by a factor of ν = α 2 + (1 -α 2 )λ 2 < 1, with shortcut weight α and attention diagonal probability λ. Our approximation then seeks to preserve this factor ν when α > 0 to match the skipless case. Now, in the skipless case α = 0 described in Section 3, at block l we suppose that the incoming kernel matrix Σ has exponentially decaying off-diagonals with rate γ l-1 , and that we construct the attention matrix A so that the output kernel matrix has exponentially decaying off diagonals with rate γ l . From Theorem 3, we see that this gives diagonal entries of A to be: λ 0 = a(γ l ) a(γ l-1 ) where a(γ) = 1exp(-2γ). Thus, to preserve ν to match the case for α = 0, if we have shortcut weight α > 0 we need the attention matrix diagonal probability λ α to satisfy: λ α = λ 2 0 -α 2 1 -α 2 . ( ) In turn, when we have shortcut weight α, this means that we need to choose our outgoing decay rate at block l, γ l,α , such that: a(γ l,α ) = λ α a(γ l-1 ). Inverting the definition of a(γ), we see we need to set γ l,α : To summarise, if we have a sequence (γ l ) l of decreasing exponential decay rates, our proposed approximation when using shortcut weights α at block l is, using the notation of Alg. 1, to set γ in = γ l-1 as normal, and to set γ out = γ l,α from Eq. ( 25) in order to preserve the signal propagation from the combined residual attention block Eq. ( 23). We see that this approximation reduces to the standard skipless setting when α = 0, as a sanity check. γ l,α = - 1 2 log(1 -(λ 2 α a(γ l-1 ) 2 ) (25) = - 1 2 log 1 - λ 2 0 -α 2 1 -α 2 (1 -exp(-2γ l-1 ))

G ADDITIONAL RESULTS

In this section we present additional results and ablations that were not included in Section 4. Normalised kernel matrix evolution for attention-only transformers with skips and normalisation In Fig. 7 we plot the evolution of normalised kernel matrices for transformers with skips and or RMSNorm normalisation, in addition to those for vanilla transformers in Fig. 1 . We see that both skipless with RMSNorm (fourth row) and Post-LN (bottom row) converge to rank collapse at larger depths. The degeneration of skipless with RMSNorm is expected from the results of (Dong et al., 2021) , and while the convergence to rank collapse is slower for Post-LN, it is still expected (Hayou et al., 2021; Noci et al., 2022) . This is because the residual and shortcut branches are effectively given a constant weighting at all blocks in Post-LN, even as the network's depth increases. On the other hand, Pre-LN observes sensible signal propagation even at depth 100, as the positioning of the LN in the residual branch effectively downweights the residual branch at later blocks (De & Smith, 2020) . Likewise, Pre-LN with skip weight α = 0.98 also observes faithful signal propagation, because each block is effectively downweighted. This effect means that standard Pre-LN's (fifth row) kernel matrix increases elementwise faster with depth than Pre-LN with normalised skips (sixth row), despite both kernel matrices being qualitatively similar at block 100. Note, all methods besides our SPA methods used ALiBi positional encoder (Press et al., 2022) (detailed in Appendix H.2) and we observe both Pre-LN kernel matrices also have a recency bias, like our main method, E-SPA (third row).

Validation perplexity across different skipless methods

In Table 1 we compared the training speeds for our skipless methods, and found that E-SPA outperforms both U-SPA and Value-Skipinit. In particular, we found E-SPA with a DKS-transformed GeLU activation without LN to perform best. In Table 4 , we present the corresponding results but for validation perplexity. Again, we see that E-SPA is the best performing of our attention modifications, but in this case TAT with Leaky ReLU and no LN matches or outperforms DKS with GeLU. Sensitivity to query-key initialisation scale Recall in Section 3.3 that for attention layers using SPA we seek to initialise weights such that the query-key dot product, 1 √ d k XW Q (XW K ) ⊤ , is zero or small at initialisation. In our main experiments, we achieve this by initialising W Q = 0, and letting W K to be initialised as normal. In Fig. 8 , we assess the sensitivity of our E-SPA scheme to the scale of non-zero attention dot product, when W K , W Q are both orthogonally initialised but with scale σ (i.e. at each layer, both W K , W Q are initialised as two independent uniform orthgonal matrices multiplied by σ). We see that for small initialisation scales, there is little effect of varying initial scale, but that training performance degrades at larger scales, when our attention matrix reverse engineering in Section 3.3 (which expects small or zero query-key dot product at initialisation) is less precise. Ablation over orthogonal initialisation Recall in Section 3 that our kernel matrix evolution for attention-only transformers is exact at finite widths using orthogonally initialised weight matrices, and will be approximate using standard (Gaussian) fan-in initialised In Fig. 9 , we ablate over using orthogonally initialised weight matrices, compared to Gaussian fan-in initialisation. Across activations, we see that E-SPA with orthogonal initialisation slightly outperforms Gaussian fan-in initialisation (by around 0.15 train loss). Ablation over positional encodings As noted in Appendix E, our methods are compatible with several standard positional encodings, and by default all our experiments use the popular RoPE (Su et al., 2021) positional encoding. In Fig. 10 and Table 5 , we assess the effect of removing positional encodings, on training and validation performance respectively, in our skipless methods. We see that all methods are improved when combined with RoPE, however the improvement is most mild in E-SPA, which as discussed has an in-built recency bias akin to a positional encoder. Moreover, E-SPA without additional positional encoding still outperforms all other approaches, including U-SPA (which on its own has no notion of position) with RoPE. Stable residual connection (Hayou et al., 2021; Noci et al., 2022) on WikiText-103 In Fig. 11 , we provide an equivalent plot to Fig. 3 using a Stable ResNet (Hayou et al., 2021) rescaling of the shortcut weights (α = 1,

0K

β = O( 1 √ L )). In this case, the shortcut weight is always 1 and the residual weight β is uniform across blocks and scales as O( 1 √ L ) in depth L. Hayou et al. (2021) showed that such a scaling leads to non-degenerate signal propagation in large depth MLPs/CNNs without normalisation, and Noci et al. (2022) showed that the stable scaling prevents rank collapse in transformers without normalisation. We see in Fig. 11 that for large enough β, the stable residual weighting with normalisation matches the training speed of the default transformer (which is unsurprising given that once β = 1 it is exactly default Pre-LN). However, there is a small but consistent gap without normalisation (with optimal β = 0.1). Here, L = 36, so 1 √ L = 1 6 ≈ 0.17, or alternatively if we count the 72 nonlinear layers (one self-attention and element-wise nonlinearity for each transformer block), we have 1 √ 72 ≈ 0.12. Different shortcut weights for MLP and self-attention blocks Recall that in a transformer block Eq. ( 1), there are two distinct skip connections: one for the self-attention block and one for the MLP block. Moreover, we observed a training speed gap when we remove both skip connections in Fig. 2 . This leads us to ask if it is possible that only one of the skips, MLP or attention, is causing this gap in training speed. In Fig. 12 , we investigate this by varying the MLP shortcut weight for skipless attention blocks (left) and varying the attention shortcut weight for skipless MLP blocks (right). For all skip connections (for both MLP and self-attention blocks), we use a normalised skip connection (β = √ 1 -α 2 ). We observe that removing either skip connection results in a comparable loss of training speed (the default Pre-LN on WikiText-103 obtained train loss of 1.76 after 100K steps), although having one is still better than having neither. Moreover, we observe that the attention and MLP blocks prefer slightly different shortcut weights, with dense shortcuts performing better on slightly lower weightings (α = 0.9 or 0.97) compared to attention shortcuts (α = 0.98 or 0.99), whereas our experiments in Figs. 3 and  Training performance on C4 In Fig. 13 we compare the training performance of our various vanilla transfomers to the default Pre-LN transformer on C4. This is akin to Fig. 4a , which compared validation performance. Downstream tasks Typically, transformers are pre-trained on a large corpus of data before evaluation on a set of downstream tasks. To assess whether our conclusions about training performance in pre-training transfer over to downstream tasks, we assess models trained on C4 on 5 common sense downstream tasks: BoolQ (Clark et al., 2019) , HellaSwag (Zellers et al., 2019) , Winogrande (Sakaguchi et al., 2020) , PIQA (Bisk et al., 2020) , and SIQA (Sap et al., 2019) . These datasets are commonly used to evaluate large pre-trained transformers (Brown et al., 2020; Rae et al., 2021; Smith et al., 2022; Hoffmann et al., 2022) . In Table 6 , we see that the conclusions of pre-training on C4 largely carry over to the downstream tasks: • Among transformers trained for the same number of steps (50k), E-SPA beats U-SPA and Value-SkipInit each on 4 out of 5 downstream tasks. However, the default transformer outperforms skipless transformers on all tasks with the same amount of training. • With around 5 times longer training (200K and 300K steps), E-SPA achieves similar performance on downstream tasks to the standard transformer, outperforming on 2 out of 5 tasks (Winogrande and PIQA). Kernel matrix evolution during training Fig. 15 shows the evolution of the empiricallycomputed normalized kernel matrix during training of a (finite width) vanilla E-SPA transformer on WikiText-103. The network has depicted has both attention and MLP blocks, which use DKStransformer GeLU activations. We note that the untrained network shows good agreement with Fig. 1 , despite the fact that Fig. 1 is computed for an attention-only network in the infinite width limit. We also note that while significant changes to the kernel matrix occur during training, it retains the property of being larger close the diagonal.

H IMPLEMENTATION DETAILS

We first describe all additional general implementation details for our experiments, before going into details relevant for individual results.

H.1 GENERAL IMPLEMENTATION DETAILS

Datasets We present experiments on WikiText-103 (Merity et al., 2017) and C4 (Raffel et al., 2019) . For both datasets, we use the SentencePiece tokeniser (Kudo & Richardson, 2018) with vocabulary size |V | = 32, 000. For WikiText-103 we use sequence length of 512 for both training and validation. For C4, the sequence length is 2048 for both. Embedding By default the embedding layer E ∈ R |V |×d0 is fan-out initialised at initialisation, so that E i,j i.i.d. ∼ N (0, 1 d0 ). This means that each embedding entry has scale 1 d0 at initialisation, whereas our Gram matrices (which are dimension normalised) expect each entry to have scale 1 at initialisation. To get by this, in our proposed vanilla methods we rescale the embedding matrix by a fixed factor of √ d 0 , so that E ← √ d 0 E which is then treated as a lookup table. At initialisation, this has the same effect using RMSNorm on the embedding layer. In all our experiments, the unembedding layer weights are shared with the embedding weights. We do not use the embedding layer positional encoding introduced by Vaswani et al. (2017) . Model In all our experiments apart from Table 3 , the model width d = 1024 across all blocks. We use 8 head multi-head attention, so that d k = 128. The MLP block consists of a single hidden layer with width 4d, with input and output dimensions both equal to d. All our experiments use the Pre-LN transformer block Eq. ( 1), rather than Post-LN. On WikiText-103, we use a 36 block transformer by default for all experiments apart from Table 3 . On C4, we use a 32 block transformer due to memory constraints. Any normalisation layer we consider is RMSNorm (Zhang & Sennrich, 2019) , which is simpler than Layer Normalisation (Ba et al., 2016) and is commonly used in transformers (Rae Step 0 corresponds to the untrained network. Note that this is computed with an actual finite-width network with randomly sampled (and then trained) parameters, unlike Fig. 1 which corresponds to the infinite-width limit (at initialisation). et al., 2021) . By default, our models use RoPE positional encoder (Su et al., 2021) (apart from the ablation in Fig. 10 ). Parameter initialisations By default, all weight matrices are fan-in initialised N (0, σ 2 fan-in ) with σ = 1. The two exceptions for this are: 1) when using orthogonal intiailisation, we use the scaledcorrected uniform orthogonal initialisation (Martens et al., 2021) with scale σ = 1 (which for square matrices is just an orthogonal matrix sampled from the Haar measure Meckes (2019) multiplied by σ), and 2) for the parameter matrix immediately after the activation we set σ to take input activation norm ("q-values" or diagonals of Gram matrices) of 1 to output activation norm of 1. In the latter case, σ = 1 by construction for activations transformed by DKS/TAT Martens et al. (2021) ; Zhang et al. (2022) . All bias parameters are initialised to 0. (Transformed) Activations For DKS (Martens et al., 2021) we set slope parameter ξ = 5, and for TAT (Zhang et al., 2022) with leaky ReLU, we set η = 0.9. Both values were chosen by a small hyperparameter sweep on WikiText-103. The DKS and TAT transformations are chosen without consideration of the attention blocks, where the transformer can be viewed as an MLP (potentially with residual connections). Unless stated otherwise, all skipless transformers used a DKS-transformed GeLU as the nonlinearity in the MLP block by default. Loss We perform next-token prediction, with softmax cross entropy loss. More concretely, if X L ∈ R T ×d denotes our final layer representation and E ∈ R |V |×d denotes our embedding layer, then we obtain logits X L E ⊤ ∈ R T ×|V | for each location for a |V |-way classification. The loss is softmax cross entropy, obtained by using each location i to predict the identity, in |V |, of the token at location i + 1. In our training loss results, we use an exponential moving average with smoothing factor 0.01, to reduce randomness from mini-batching. Optimiser We use Adam optimiser (Kingma & Ba, 2014) with global gradient clipping of 0.1 by default (Pascanu et al., 2013) . We do not use weight decay in our experiments. Training We use mini-batching of 16 sequences for WikiText-103 and 8 for C4, due to memory constraints. Unless stated otherwise, we train for 100K steps on WikiText-103 and 50K steps for C4. Learning rate In all experiments we use a linear learning rate warm-up period of 4000 steps which increases from 0 to a maximum value (which is tunable hyperparameter). In the remaining iterations, we use a "one-cycle" cosine learning rate schedule (Loshchilov & Hutter, 2016) which reaches 0 at the final iteration. The maximum learning rate was tuned in {1, 2, 3, 5} × 10 -n over n ∈ Z for all experiments. As the embedding layer does not change across the different depths and skip/normalisation settings we considered, we use a different, fixed maximum learning rate for the embedding layer, which was chosen to match the optimal learning rate for the default transformer (2 × 10 -4 for WikiText-103 and 5 × 10 -4 for C4) and not tuned further.

H.2 ADDITIONAL IMPLEMENTATION DETAILS FOR INDIVIDUAL EXPERIMENTS

Attention-only kernel matrix evolution: Figs. 1 and 7 We calculate the kernel matrix Σ evolution directly in T × T kernel matrix-space, where T = 100. Our input kernel matrix Σ is sampled assuming a fraction r = 0.02 of repeated tokens, with value of 1 if the token is repeated, and 0 else. For all configurations of skip/normalisation/attention modifications corresponding to a row of Fig. 7 , we use 8 heads. We now detail how each operation in any configuration of Fig. 7 affects the kernel matrix. From this it should be possible to reconstruct the kernel evolution for any row in Fig. 7 . The 3 possible operations are: 1) attention, 2) skip connection, or 3) LN/RMSNorm operation. 1. Attention Because our SPA methods (second and third rows) are agnostic to the number of heads at initialisation (all attention matrices in a self-attention block are the same across heads at initialisation), we apply Eq. ( 6) directly, so that a single attention block amounts to Σ ← AΣA ⊤ for attention matrix A and incoming kernel matrix Σ. Our E-SPA method uses γ L = 0.005 and our U-SPA method uses ρ L = 0.8. For all other rows, the self-attention operation uses ALiBi (Press et al., 2022) , a popular positional encoder which uses head-dependent pre-softmax bias matrices. More specifically, from the default pre-softmax bias matrices given by ALiBi (for 8 heads), we obtain 8 attention matrices {A h } 8 h=1 using the softmax operation (which is exact assuming zero query-key dot product at initialisation). Because the different heads in transformers are typically concatenated along 8 equal size fractions of the total width d, the kernel evolution of an attention block on kernel matrix Σ with 8-head ALiBi corresponds to (Martens et al., 2021): Σ ← 1 8 8 h=1 A h ΣA ⊤ h 2. For a skip connection with shortcut weight α and residual weight β, if Σ attn (Σ) denotes the output of a kernel matrix Σ after an self-attention operation (i.e. from point 1. above), then an incoming kernel matrix Σ gets mapped to: Σ ← α 2 Σ + β 2 Σ attn (Σ). 3. For a normalisation operation, the incoming kernel matrix Σ gets mapped to: Hyperparameter tuning of transformers with normalised skips All experiments with skip connections (i.e. shortcut weight α ̸ = 0 for either the MLP or attention block) use untransformed GeLU activation in the MLP blocks. We combine E-SPA with normalised skips as described in Appendix F. We note that for high shortcut weights α and small final decay rate γ L , then the value of attention matrix diagonal λ α , Eq. ( 24), may not be real. This is because the input to the square root in Eq. ( 24) may be negative. To get by this, we tune γ L ∈ {0.005, 0.2, 0.4, 0.6} when using a 36 block transformer on WikiText-103 and γ L ∈ {0.005, 0.2, 0.4, 0.7, 1} for Table 2 , which used a 32 block transformer on C4. Σ ← diag(Σ) -1 2 • Σ • diag(Σ) - For Table 2 , the normalised skip connections had separately tuned attention and MLP shortcut weights, as we observed a difference in the optimal shortcut weight for selfattention vs MLP blocks in Fig. 12 . We tuned the attention shortcut weight in the range α ∈ {0.98, 0.99, 0.995, 0.9975, 0.999} and the MLP shortcut weight in the range α ∈ {0.98, 0.99, 0.995}. Likewise, both the stable residual weights were tuned in β ∈ {0.05, 0.1, 0.15, 0.2} separately for self-attention and MLP skips. The selected shortcut/residual weights (using validation performance) are presented in Table 7 . Depth scaling Due to memory constraints, for our deeper networks in Table 3 we use width d = 512 rather than 1024, with 8 heads to give d k = 64. All depth scaling runs used a DKS-transformed GeLU with ξ = 5.

I THEORETICAL RESULTS

In this section we state and prove our theoretical results, including Theorems 1 and 2: Theorem 1. (Non-negativity for U-SPA) Let Σ=(1 -ρ)I T +ρ11 ⊤ and Σ ′ =(1 -ρ ′ )I T +ρ ′ 11 ⊤ , with respective (positive) Cholesky factors L and L ′ . Then if ρ ≤ ρ ′ , we have L ′ L -1 is elementwise non-negative. Theorem 2. (Non-negativity for E-SPA) Let matrices (Σ) i,j = exp(-γ|i -j|)) and (Σ ′ ) i,j = exp(-γ ′ |i -j|)) with respective (positive) Cholesky factors L and L ′ . Then if γ ≥ γ ′ , we have L ′ L -1 is elementwise non-negative. We prove Theorem 1 second as it is more involved.

I.1 PROOF OF THEOREM 2

We actually prove Theorem 2 as a corollary of Theorem 3, which provides the analytic form for L ′ L -1 . Theorem 3. Let matrices (Σ) i,j = exp(-γ|i -j|)) and (Σ ′ ) i,j = exp(-γ ′ |i -j|)) with respective (positive) Cholesky factors L and L ′ . Then if γ ≥ γ ′ > 0, we have A = L ′ L -1 takes the following form: A i,j =                1, if i = j = 1 a(γ ′ ) a(γ) , if i = j ̸ = 1 exp(-γ ′ ) -a(γ ′ ) a(γ) exp(-γ) exp -γ ′ (i -j -1) , if j = 1, i > j a(γ ′ ) a(γ) exp(-γ ′ ) -exp(-γ) exp -γ ′ (i -j -1) , if j ̸ = 1, i > j 0 else ( ) where a(γ) = 1exp(-2γ), and likewise a(γ ′ ) = 1 -exp(-2γ ′ ) Proof of Theorem 2. From Theorem 3, we have the analytic form of A. Clearly a(γ), a(γ ′ ) are positive, and moreover if γ ′ ≤ γ, then exp(-γ ′ )exp(-γ) is non-negative. Finally, because a(γ ′ ) a(γ) ≤ 1 when γ ′ ≤ γ, then exp(-γ ′ ) -a(γ ′ ) a(γ) exp(-γ) is non-negative too. To prove Theorem 3, we first compute what the analytic form of Cholesky factor L for (Σ) i,j = exp(-γ|i -j|)) takes in Lemma 1. Lemma 1. Let (Σ) i,j = exp(-γ|i -j|)) with (positive) Cholesky factors L such that LL ⊤ = Σ. Then, we have: L i,j =    exp(-γ|j -i|), if j = 1, i ≥ j 1 -exp(-2γ)exp(-γ|j -i|), if j ̸ = 1, i ≥ j 0 else Proof of Lemma 1. It is clear that Σ is positive semi definite, as it is the covariance matrix of a stationary Ornstein-Uhlenbeck process, hence a Cholesky factor must exist. We now show that it is L, Eq. ( 28). If we define l = min(m, n), then we have: (LL ⊤ ) m,n = L m,1 L n,1 + l i=2 L m,i L n,i = exp(-γ(m + n -2)) + (1 -exp(-2γ)) l i=2 exp(-γ(m + n -2i)) = exp(-γ(m + n -2)) 1 + 1 -exp(-2γ) l-1 i=1 exp(2γi) = exp(-γ(m + n -2)) 1 + 1 -exp(-2γ) exp(2γ) 1 -exp(2γ(l -1)) 1 -exp(2γ) =a(γ ′ ) exp(-γ(k -l) + k-1 i=l exp(-γ ′ ) -exp(-γ) exp -γ ′ (k -i -1) exp(-γ(i -l)) =a(γ ′ ) exp(-γ(k -l) + exp(-γ ′ ) -exp(-γ) exp -γ ′ (k -l -1) 1 -exp((γ ′ -γ)(k -l)) 1 -exp(γ ′ -γ) =a(γ ′ ) exp(-γ(k -l) + exp(-γ ′ )exp -γ ′ (k -l -1) 1 -exp (γ ′ -γ)(k -l) =a(γ ′ ) exp(-γ(k -l) + exp -γ ′ (k -l) 1 -exp (γ ′ -γ)(k -l) =a(γ ′ ) exp -γ ′ (k -l) + exp -γ(k -l) -exp -γ(k -l) =a(γ ′ )exp -γ ′ (k -l) =L ′ k,l I.2 PROOF OF THEOREM 1 Theorem 1. (Non-negativity for U-SPA) Let Σ=(1 -ρ)I T +ρ11 ⊤ and Σ ′ =(1 -ρ ′ )I T +ρ ′ 11 ⊤ , with respective (positive) Cholesky factors L and L ′ . Then if ρ ≤ ρ ′ , we have L ′ L -1 is elementwise non-negative.

I.2.1 PRELIMINARIES

Before diving into the actual proof of the theorem we will derive several useful properties and notations. First we make a slight notational change from the main text of the theorem by replacing Σ, which depends on ρ, with C n (x), where n represents the size of the matrix, while x replaces ρ. In addition, since the case of ρ = ρ ′ (or x = y in the rest of the proof) is trivial, since then the resulting matrix is the identity, we will restrict ourselves to dealing with the case where 0 < x < y < 1. In any of the mathematical derivations we will denote with capital English letters (e.g. A, B, C, T ) any temporary expressions, that will be expanded on the following lines. Note that these are never general definitions, so they might be used multiple times for different expressions. We will denote vectors and vector functions in bold and scalars and scalar function in standard font. Let 1 n be the n-dimensional vector with only ones: 1 n = 1 . . . 1 ∈ R n We will denote with x n the n-dimensional vector with only x's: x n = x1 n First, we define the linear map p n (x) as: p n (x) = nx + 1 Proposition 1. The vector 1 n is an eigenvector of C n (x) with an eigenvalue p n-1 (x). Proof. Directly calculating the k-th entry of the product C n (x)1 n gives: [C n (x)1 n ] k = n i=1 [C n (x)] k,i = n i=1 δ k i + (1 -δ k i )x = 1 + (n -1) x which directly implies that: C n (x)1 n = p n-1 (x)1 n . Corollary 1. The vector 1 n is an eigenvector of C n (x) -1 with an eigenvalue 1 pn-1(x) . Corollary 2. C n (x) -1 x n = 1 pn-1(x) x n . Corollary 3. x T n C n (x) -1 x n = nx 2 pn-1(x) . Further we define the following useful functions: d n (x) = 1 -x T n C n (x) -1 x n = 1 - nx 2 1 + (n -1)x = 1 + nx -x -nx 2 1 + (n -1)x = (1 -x)(nx + 1) 1 + (n -1)x = (1 -x) p n (x) p n-1 (x) . r n (x) = d n (x)p n-1 (x) = (1 -x)p n (x)p n-1 (x) ⇒ r n (x) p n-1 (x) = d n (x) ⇒ r n (x) d n (x) = p n-1 (x) Definition 1. The function v n : I[0, 1] × I[0, 1] → R n maps the two bounded scalars x and y to the vector: v n (x, y) = L n (x) -T L n (y) -1 y n - d n (y) d n (x) C n (x) -1 x n .

I.2.2 LEMMAS

Lemma 2. If 0 < x < y < 1 then all entries of the vector v n (x, y) are non-negative -[v n (x, y)] i ⩾ 0. Lemma 3. If 0 < x < y < 1 then the function g n (x, y) = y(1-y)pn-1(x) rn(x)rn(y) -dn+1(y) dn+1(x) x pn(x) is non-negative. Lemma 4. If 0 < x < y < 1 then the function f n (x, y) = xy(1-y) rn(x)rn(y) + dn+1(y) dn+1(x) x pn(x)dn(y) dn(x) x pn-1(x) is negative. Definition 2. The partial sum the functions f n (x, y) from k+1 to n-1 will be denote by h n k (x, y) = n-1 i=k+1 f i (x, y), which from Lemma 4 follow are always negative.

I.2.3 MAIN PROOF

The theorem is proven by induction. First for n = 1 we have that C 1 (x) = C 1 (y) = [1], which implies that L 1 (x) = L 1 (y) = [1] and the condition is trivially satisfied. Now assuming that the statement is true for all integers up to n, we will prove that it also holds for n + 1: C n+1 (x) = C n (x) x n x T n 1 L n+1 (x) = L n (x) 0 n x T n L n (x) -T d n (x) L n+1 (x)L n+1 (x) T = L n (x) 0 n x T n L n (x) -T d n (x) L n (x) T L n (x) -1 x n 0 T n d n (x) = L n (x)L n (x) T x n x T n A A = x T n L n (x) -T L n (x) -1 x n + d n (x) = x T n C n (x) -1 x n + d n (x) = 1 L n+1 (x) -1 = L n (x) -1 0 n -1 rn(x) x T n 1 √ dn(x) L n+1 (x)L n+1 (x) -1 = L n (x) 0 n x T n L n (x) -T d n (x) L n (x) -1 0 n -1 rn(x) x T n 1 √ dn(x) = I n 0 n x T n L n (x) -T L n (x) -1 - √ dn(x) rn(x) x T n 1 = I n 0 n x T n C n (x) -1 - 1 pn-1(x) x T n 1 = I n 0 n 0 T n 1 L n+1 (y)L n+1 (x) -1 = L n (y) 0 n y T n L n (y) -T d n (y) L n (x) -1 0 n -1 rn(x) x T n 1 √ dn(x) = L n (y)L n (x) -1 0 n y T n L n (y) -T L n (x) -1 -dn(y) dn(x) 1 pn-1(x) x T n dn(y) dn(x) = L n (y)L n (x) -1 0 n v n (x, y) T dn(y) dn(x) Using the fact that L n (y)L n (x) -1 is a lower triangular and non-negative matrix by the inductive assumption combined with Lemma 2 and the fact that d n (x) ⩾ 0 it follows that L n+1 (y)L n+1 (x) -1 is also a lower triangular non-negative matrix.

I.2.4 PROOF OF LEMMA 2

First we will inspect the evolution of v n (x, y) as we increase n: v n+1 (x, y) = L n+1 (x) -T L n+1 (y) -1 y n+1 - d n+1 (y) d n+1 (x) C n+1 (x) -1 x n+1 = A - d n+1 (y) d n+1 (x) 1 p n (x) x n+1 A = L n+1 (x) -T L n+1 (y) -1 y n+1 = L n (x) -1 -1 rn(x) x n 0 T n 1 √ dn(x) L n (y) -1 0 n -1 rn(y) y T n 1 √ dn+1(y) y n+1 = L n (x) -1 -1 rn(x) x n 0 T n 1 √ dn(x) L n (y) -1 y n -ny 2 rn(y) + y √ dn(y) =   L n (x) -1 - 1 √ dn(x)pn-1(x) x n 0 T n 1 √ dn(x)   L n (y) -1 y n B rn(y) B = yp n-1 (y) -ny 2 = y((n -1)y + 1) -ny 2 = y -y 2 = y(1 -y) A = L n (x) -1 -1 rn(x) x n 0 T n 1 √ dn(x) L n (y) -1 y n y(1-y) rn(y) =   L n (x) -1 L n (y) -1 y n -y(1-y) rn(x)rn(y) x n y(1-y) √ dn(x)rn(y)   = B y(1-y) √ dn(x)rn(y) B = L n (x) -1 L n (y) -1 y n - y(1 -y) r n (x)r n (y) x n = v n (x, y) + d n (y) d n (x) C n (x) -1 x n - y(1 -y) r n (x)r n (y) x n = v n (x, y) - y(1 -y) r n (x)r n (y) x n + d n (y) d n (x) 1 p n-1 (x) x n A =   v n (x, y) -y(1-y) rn(x)rn(y) x n + dn(y) dn(x) 1 pn-1(x) x n y(1-y)pn-1(x) rn(x)rn(y)   v n+1 (x, y) = A -   dn+1(y) dn+1(x) 1 pn(x) x n dn+1(y) dn+1(x) x pn(x)   =   v n (x, y) - y(1-y) rn(x)rn(y) + dn+1(y) dn+1(x) 1 pn(x) -dn(y) dn(x) 1 pn-1(x) x n y(1-y)pn-1(x) rn(x)rn(y) -dn+1(y) dn+1(x) x pn(x)   Using the definition of from Lemma 3, Lemma 4 and Definition 2 we have that: v 1 = [g 0 ] T v 2 = [g 0 -f 1 , g 1 ] T v 3 = [g 0 -f 1 -f 2 , g 1 -f 2 , g 2 ] T . . . v n = g 0 - n-1 i=1 f i , g 1 - n-1 i=2 f i , . . . , g k - n-1 i=k+1 f i , . . . T = [g 0 -h n 0 , g 1 -h n 1 , . . . , g k -h n k , . . .] T Or in other words we have that: [v n (x, y)] k = g k (x, y) -h n k (x y) Thus proving the lemma reduces to proving that: g k (x, y) -h n k (x, y) ⩾ 0 ∀k, n Expanding on the equation that we need to prove is positive: L n k (x, y) = g k (x, y) -h n+1 k (x, y) = g k (x, y) - n i=k+1 f i (x, y) = y(1 -y)p k-1 (x) r k (x)r k (y) - d k+1 (y) d k+1 (x) x p k (x) - n i=k+1 xy(1 -y) r i (x)r i (y) + d i+1 (y) d i+1 (x) x p i (x) - d i (y) d i (x) x p i-1 (x) = y(1 -y)p k-1 (x) r k (x)r k (y) - n i=k+1 xy(1 -y) r i (x)r i (y) -B B = d k+1 (y) d k+1 (x) x p k (x) + n i=k+1 d i+1 (y) d i+1 (x) x p i (x) - n i=k+1 d i (y) d i (x) x p i-1 (x) = n i=k d i+1 (y) d i+1 (x) x p i (x) - n-1 i=k d i+1 (y) d i+1 (x) x p i (x) = d n+1 (y) d n+1 (x) x p n (x) L n k (x, y) = y(1 -y)p k-1 (x) r k (x)r k (y) - n i=k+1 xy(1 -y) r i (x)r i (y) - d n+1 (y) d n+1 (x) x p n (x) Now we turn our attention to the sum in the middle: B = n i=k+1 xy(1 -y) r i (x)r i (y) = xy(1 -y) n i=k+1 1 r i (x)r i (y) = xy 1 -y 1 -x n i=k+1 (1 -x)(1 -y) r i (x)r i (y) Using Cacuhy -Schwartz inequality : Let's define the partial sum in the brackets as: I n k (x) = n i=k 1 -x r i (x) 2 = n i=k 1 -x (1 -x)p i (x)p i-1 (x) = n i=k 1 p i (x)p i-1 (x) We will prove by induction that: I n k (x) = n -k + 1 p k-1 (x)p n (x) First for n = k, we have that I n n (x) = 1 pn(x)pn-1(x) which clearly satisfies the above equation. Assuming this is true for n, we will now show it holds for n + 1: From the fact that h n k (x, y) ⩾ 0 it impliest that L n k is a decreasing function of n, hence to we only need to show that for any fixed x and y L ∞ k (x, y) is positive. For this, we need to take the limit of the second and third term in the above equation. where the last line is true since the denominator is of higher degree in n. I n+1 k (x) = Taking the limit of the second term corresponds to computing the limit: A = lim n→∞ n -k p n (x) = im n→∞ n -k nx + 1 = 1 x Hence, this means that: All of the limits must be taken for fixed x and y (e.g. we can't have them approach 0 or 1 simultanously with n). A 2 = y 2 (kx -x + 1)(nx + 1)(ny + 1)(nx + 1) L ∞ k (x, B 2 = x 2 y 2 (n -k)(kx -x + 1)(nx + x + 1) C 2 = x 2 (kx + 1)(ky + 1)(ky -y + 1)(ny + y + 1) From the definition of g n (x, y) and the fact that p n (x) and r n (x) are non-negative functions we can conclude that: A = xy 1 -y 1 -x (n -k) 2 p k (x g n (x, y) ⩾ 0 ⇔ y(1 -y)p n-1 (x) r n (x)r n (y) x 2 p n (x) 2 = y 2 (1 -y) 2 p n-1 (x) 2 (1 -x)p n (x)p n-1 (x)(1 -y)p n (y)p n-1 (y) - (1 -y) pn+1(y) pn(y) (1 -x) pn+1(x) pn(x) x 2 p n (x) 2

=

(1 -y)y 2 p n-1 (x) (1 -x)p n (x)p n (y)p n-1 (y) -(1 -y)x 2 p n+1 (y) (1 -x)p n+1 (x)p n (x)p n (y) (1 -y) (1 -x)p n (x)p n (y) y 2 p n-1 (x) p n-1 (y) -x 2 p n+1 (y) p n+1 (x) = (1 -y)B (1 -x)p n (x)p n (y)p n-1 (y)p n+1 (x) B = y 2 p n-1 (x)p n+1 (x) -x 2 p n+1 (y)p n-1 (y) = y 2 ((nx + 1) -x)((nx + 1) + x) -x 2 ((ny + 1) -y)((ny + 1) + y) = y 2 (nx + 1) 2 -y 2 x 2 -x 2 (ny + 1) 2 + x 2 y 2 = (y(nx + 1) + x(ny + 1))(y(nx + 1) -x(ny + 1)) = (2nxy + x + y)(y -x) A = (y -x) (1 -y)(2nxy + x + y) (1 -x)p n (x)p n (y)p n-1 (y)p n+1 (x) Hence for x ⩽ y we have that A ⩾ 0 ⇔ g n (x, y) ⩾ 0. (1 -x) pn(x) pn-1(x) x 2 p n-1 (x) 2 -(1 -y) pn+1(y) pn(y) (1 -x) pn+1(x) pn(x) x 2 p n (x) 2  L = A 2 -C 2 -B 2 + 2BC 2L 1 = A 2 -C 2 -B 2 = x 2 y 2 (1 -y) 2 (1 -x)p n (x)p n-1 (x)(1 -y)p n (y)p n-1 (y) -  x 2 1 -y 1 -x 1 p n (x)



We note that this formula for the kernel matrix is exact even at finite widths, if W is orthogonal at initialisation. This is in contrast to the standard kernel correspondence of NNs at initialisation, where the usual kernel equations are only approximations at finite width that become increasingly accurate as width increases(Neal, 2012;Daniely et al., 2016;Yang, 2019;Martens, 2021; Li et al., 2022).2 We describe compatibility of our methods with non-causal attention in Appendix A. This is similar in principle to the Delta-initialisation for CNNs(Balduzzi et al., 2017;Xiao et al., 2018), and modifications to GNNs to enable compatibility with TAT(Zaidi et al., 2022). Because the embedding matrix E is initialised with variance 1/fan-out we rescale the embeddings it produces by √ d0 to get a Σ0 with ones on the diagonal. Constraint (iii) is satisfied as lower triangular matrices are closed under multiplication and inversion. Argued byNakkiran et al. (2021), the validation curves measure the "training speed" of the online setting. We find that r ≈ 0.008 for sentencepiece tokenisations like we use in WikiText-103 and C4, but r ≈ 0.05 for character-level prediction like in EnWiki-8.



Figure 1: Normalised kernel matrices diag(Σ l ) -1 2•Σ l •diag(Σ l ) -1 2 (which are like kernel matrices except with cosine similarities instead of inner-products) at various depths for standard attention-only vanilla transformers and two of our proposed alternatives (Section 3). Standard attention-only vanilla transformers (top) quickly suffer from rank collapse where all entries of the normalised kernel converge to 1, whereas our approaches, U-SPA and E-SPA, maintain controlled signal propagation even at large depths. Moreover, our main method E-SPA (bottom) exhibits a recency bias, where cosine similarities corresponding to nearby pairs of locations are larger, akin to positional encoding. Equivalent plots for attention-only transformers with skips and normalisation can be found in Fig.7.

Figure 3: Transformers with normalised skip connections, trained for 100K steps. E-SPA (left) without normalisation matches the training speed of a standard transformer. Results denote the mean of 2 seeds.

Figure 2: Comparison of skipless transformers. Vanilla transformers are not trainable without our modifications. All curves average over 3 seeds.

Figure 4: Results on C4. (a): E-SPA again outperforms our other approaches. (b): Vanilla E-SPA matches default Pre-LN after 5x more iterations.

Figure 5: Input token kernel matrices for different proportion of repeated tokens.

L in and L out to be the Cholesky factors of (Σ in i ) i,j = exp(-γ in |i -j|) and (Σ out i ) i,j = exp(-γ out |i -j|) respectively. Set A = L out L -1 in (using analytic form in Theorem 3) & decompose A = DP where D diagonal & P has row sums 1. Denote B = log(P).

Figure 7: Equivalent of Fig. 1 but with four additional configurations for Transformers: 1) Skipless+LN; 2) Pre-LN; 3) Pre-LN with normalised skips α = 0.98 and β =√ 1 -α 2 ; and 4) Post-LN. All rows except our SPA methods (i.e all rows except second and third) use pre-softmax bias matrices from the ALiBi(Press et al., 2022) positional encoding to compute attention matrices, assuming zero query-key dot product.

Figure 8: The sensitivity of training performance to the initialisation scale σ of W Q and W K using vanilla E-SPA. Mean and error bars are over 3 seeds.

Figure9: Ablation over using orthogonal parameter initialisations, compared to standard Gaussian fan-in, for our main method E-SPA on vanilla transformers over a range of activation functions. We see that orthogonal intialisation leads to a small improvement in training speed. Curves denote mean over 3 seeds

Figure 10: Ablation over positional encoding using a vanilla transformer with E-SPA, averaged over 3 random seeds.

Figure 12: Effect of using different shortcut weights for the attention and MLP blocks, trained on WikiText-103 for 100K steps. Mean and standard deviation over 2 seeds.

Figure13: Training performance on C4, equivalent to Fig.4a

Figure 14: Longer training on WikiText-103 with E-SPA on a vanilla transformer. It matches the performance of standard transformer after 4.5x more iterations of training.

Figure 15: The normalized kernel matrix for various transformer blocks at three different stages of training of a vanilla E-SPA transformer on WikiText-103.Step 0 corresponds to the untrained network. Note that this is computed with an actual finite-width network with randomly sampled (and then trained) parameters, unlike Fig.1which corresponds to the infinite-width limit (at initialisation).

Hyperparameter tuning of skipless transformers For experiments on WikiText-103 with 100K steps, for our U-SPA transformers we tuned ρ L ∈ {0.6, 0.8}, and for our E-SPA transformers we tuned γ L ∈ {0.005, 0.2}. For all other settings (longer/deeper training on WikiText-103 or any C4 experiment), we used the default γ L = 0.005 and ρ L = 0.8. All hyperparameters throughout our work are tuned based on training loss.

x)p n+1 (x) = p n+1 (x)(n -k + 1) -p k-1 (x) p k-1 (x)p n (x)p n+1 (x) = ((n + 1)x + 1)(n -k + 1) + ((k -1)x + 1) p k-1 (x)p n (x)p n+1 (x) = x(n 2 -nk + n + n -k + 1 + k -1) + n -k + 1 + 1 p k-1 (x)p n (x)p n+1 (x) = x(n 2 -nk + 2n) + n -k + 2 p k-1 (x)p n (x)p n+1 (x) = xn(n -k + 2) + n -k + 2 p k-1 (x)p n (x)p n+1 (x) = (n -k + 2)(nx + 1) p k-1 (x)p n (x)p n+1 (x) = n -k + 2 p k-1 (x)p n+1 (x)Which concludes the proof. This now means that:B ⩽ xy 1-y 1-x I n k+1 (x)I n k+1 (y) = xy 1 -y 1 -x (n -k) 2 p k (x)p n (x)p k (y)p n (y)Thus we can conclude thatL n k (x, y) ⩾ y(1 -y)p k-1 (x) r k (x)r k (y) -xy 1 -y 1 -x (n -k) 2 p k (x)p n (x)p k (y)p n (y) -d n+1 (y) d n+1 (x) x p n (x)

1 -y)p n+1 (y) (1 -x)p n+1 (x)p n (x)p n (y) = x 2 (1 -y) (1 -x) lim n→∞ p n+1 (y) p n+1 (x)p n (x)p n (y) = 0

x)p k (y)xy = y(1 -y)p k-1 (x) (1 -x)p k (x)p k-1 (x)(1 -y)p k (y)p k-1 (y) -√ xy 1 -y 1 -x 1 p k (x)p k (y) = y 1 -y 1 -x p k-1 (x) p k (x)p k (y)p k-1 (y) -√ xy 1 -y 1 -x 1 p k (x)p k (y) x)p k (y)p k-1 (y) yp k-1 (x) -xp k-1 (y) A = yp k-1 (x) -xp k-1 (y) = yx(k -1) + y -xy(k -1) -x = y -x ⩾ 0 ⇒ L ∞ k (x, y) ⩾ 0 Note:

k) 2 p k (x)p n (x)p k (y)p n (y) y)p k-1 (x) r k (x)r k (y) = y(1 -y)p k-1 (x) (1 -x)p k (x)p k-1 (x) (1 -y)p k (y)p k-1 y)p n+1 (y) (1 -x)p n (y)p n (x)p n+1 (xx)p k (y) p k (x)p k (y)p n+1 (y) p n (x)p n (y)p n+1 (x) k) 2 p n (x)p n (y) -x p k (x)p k (y)p n+1 (y) p n (x)p n (y)p n+1 (xx)p k (y)p k-1 (y)p n (x)p n (y)p n+1 (x) A = y p k-1 (x)p n (x)p n+1 (x)p n (y) B = xy(n -k) p k-1 (x)p n+1 (x) = x C = x p k (x)p k (y)p k-1 (y)p n+1 (y)

)p n (x)p k (y)p n (yy)p n+1 (y) (1 -x)p n (y)p n (x)p n+1 (x) = x 1 -y 1 -x p n+1 (y) p n (y)p n (x)p n+1 (x)C = √ xy 1 -y 1 -x 1 p k (x)p k (y) 2 = 1 -xy(n -k) 2 p n (x)p n (y) = p n (x)p n (y) ± xy(n -k) 2 p n (x)p n (y) = n 2 xy + nx + ny + 1 ± xy(n -k) 2 p n (x)p n (y) = xyk(n 2 ± (n -k) 2 ) + 1 p n (x)p n (y) ⩾ 0 (1 -R) 2 = 1 + R 2 -2R (C -A) 2 -B 2 = xy 1 -y 1 -x 1 p k (x)p k (y) (1 + R 2 -2R) -x 2 1 -y 1 -x p n+1 (y) p n (y)p n (x)p n+1 (x) = x 1 -y 1 -x xyk(n 2 + (n -k) 2 ) + 1 p k (x)p k (y)p n (x)p n (y) -x 2 1 -y 1 -x p n+1 (y) p n (y)p n (x)p n+1 (x) x)p n (y)p k (x)p k (y)p n+1 (x) K -xy 1 -y 1 -x 2R p k (x)p k (y) K = p n+1 (x)(yk(n 2 + (n -k) 2 ) + 1) -xp n+1 (y)p k (x)p k (y) = (nx + x + 1)y(2n 2 -2nk + k 2 ) -x(ny + 1)(kx + 1)(ky + 1) C -(A + B) = √ xy 1 -y 1 -x 1 p k (x)p k (y) 1 -√ xy(n -k) p n (x)p n (y) -xp n+1 (y)p k (x)p k (y) yp n (y)p n (x)p n+1 (x) -k) + p n+1 (y)p k (x)p k (y) yp n+1 (x) 2 = x(n -k) 2 + p n+1 (y)p k (x)p k (y) yp n+1 (x) + 2 √ x(n -k) p n+1 (y)p k (x)p k (y) yp n+1 (x) = xy(n -k) 2 p n+1 (x) + p n+1 (y)p k (x)p k (y) yp n+1 (x) + 2 √ xy(n -k) p n+1 (y)p n+1 (x)p k (x)p k (y) yp n+1 (x) 1 -(A + B) 2 = yp n+1 (x) -xy(n -k) 2 p n+1(x) -p n+1 (y)p k (x)p k (y) -2E yp n+1 (x) E = √ xy(n -k) p n+1 (y)p n+1 (x)p k (x)p k (y) I.2.5 PROOF OF LEMMA 3

Denoting with A the left hand side of the second equation we have:A = y 2 (1 -y) 2 p n-1 (x) 2 r n (x) 2 r n (y) 2 -d n+1 (y) d n+1 (x)

x 2 1 -y 1 -x p n (y) p n-1 (y)p n (x)p n-1 (x)p n+1 (y) p n (y)p n+1 (x)p n (x) = x 2 1 -y 1 -x 1 p n (x) p n (y) 2 p n+1 (x) -p n+1 (y)p n-1 (y)p n-1 (x) p n-1 (x)p n (y)p n-1 (y) A = p n (y) 2 p n+1 (x) -p n+1 (y)p n-1 (y)p n-1 (x) = p n (y) 2 (p n (x) + x) -(p n (x) -x)(p n (y) + y)(p n (y) -y) = p n (y) 2 p n (x) + xp n (y) 2 -(p n (x) -x)(p n (y) 2 -y 2 ) = p n (y) 2 p n (x) + xp n (y) 2 -p n (y) 2 p n (x) + y 2 p n (x) -xp n (y) 2 + xy 2 = y 2 p n (x) + xy 2 ⩾ 0 ⇒ C ⩾ BWith this we can now conclude that f n (x, y) ⩾ 0 ⇔ A 2 -(C -B) 2 ⩾ 0

p n (y) p n-1 (y)p n-1 (x) + p n+1 (y) p n (y)p n+1 (x) = x 2 1 -y 1 -x 1 p n (x) y 2 p n-1 (x)p n (y)p n-1 (y) -p n (y) p n-1 (y)p n-1 (x) -p n+1 (y) p n (y)p n+1 (x) = x 2 1 -y 1 -x 1 p n (x) y 2 -p n (y) 2 p n-1 (x)p n (y)p n-1 (y) -p n+1 (y) p n (y)p n+1 (x)

WT103 train loss of skipless transformers for different activations, with or without LN. Mean and standard deviation are computed across 3 random seeds.

Validation perplexity for different skips on C4 with and without LN after 50K iterations.

WT-103 training loss across depths. Deeper vanilla E-SPA transformers improve with more training.

Validation perplexity equivalent of Table1

Validation perplexity equivalent of Fig.10

11 use a joint weighting for both.

Downstream evaluation of various models (default Pre-LN and also our skipless models) from Figs. 4a and 4b, (pre-)trained on C4. We evaluate zero-shot on 5 common sense tasks: BoolQ, HellaSwag, Winogrande, PIQA, SIQA. In brackets shows the number of steps each model was trained for. Reported values are percentage accuracies: higher is better.

Shortcut and residual weights for Table2. Weights are presented as (αattn, βattn)/(αMLP, βMLP), where αattn denotes the shortcut weight for attention block and αMLP denotes the shortcut weight for the MLP block. Residual weights β are defined similarly. * denotes that the initial value of a trainable parameter.

ACKNOWLEDGEMENTS

We thank Christos Kaplanis for helpful discussions during initial stages of this project, as well as the anonymous reviewers for their feedback. BH is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1).

annex

By the uniqueness of (positive) Cholesky factors, the proof is complete.Proof of Theorem 3. We want to show (AL) k,l = (L ′ ) k,l , ∀k, l. This is clearly true for the top diagonal k = l = 1.We now show this for the rest of the first column, when l = 1, k > 1:Applying the geometric sum to Eq. ( 29) yields:= y 2 p n (x) 2 -x 2 p n (y) 2 = y 2 (nx + 1) 2 -x 2 (ny + 1) 2 = nx 2 y 2 + 2nxy 2 + y 2 -nx 2 y 2 -2nx 2 y -x 2 = 2nxy(y -x) + (y -x)(y + x) = (y -x)(2nxy + x + y) ⩾ 0Hence with this we can conclude that f n (x, y) ⩾ 0 ∀n.

