THE LIPSCHITZ CONSTANT OF SELF-ATTENTION Anonymous authors Paper under double-blind review

Abstract

Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task. Lipschitz continuity is a strong form of continuity for functions. Loosely speaking, a function is Lipschitz continuous if changing its input by a certain amount cannot change its output by more than K times that amount. The constant K is a hard constraint on how rapidly the function's output can vary, and the smallest such K is known as the function's Lipschitz constant. For example, f 1 (x) = |x| and f 2 (x) = exp(x) for x ∈ R are not Lipschitz continuous, because their output can change arbitrarily fast as x approaches 0 and +∞ respectively. On the other hand, g 1 (x) = tanh(x) and g 2 (x) = αx are Lipschitz continuous, because their rate of change (derivative) is bounded. In deep learning, we often use Lipschitz continuity as a constraint for neural networks, to control how much a network's output can change relative to its input. Such Lipschitz constraints are useful in several contexts. For example, Lipschitz constraints can endow models with provable robustness against adversarial pertubations (Cisse et al., 2017; Tsuzuku et al., 2018; Anil et al., 2019) , and guaranteed generalisation bounds (Sokolić et al., 2017) . Moreover, the dual form of the Wasserstein distance is defined as a supremum over Lipschitz functions with a given Lipschitz constant, hence Lipschitz-constrained networks are used for estimating Wasserstein distances (Peyré & Cuturi, 2019) . Further, Lipschitz-constrained networks can stabilise training for GANs, an example being spectral normalisation (Miyato et al., 2018) . Finally, Lipschitz-constrained networks are also used to construct invertible models and normalising flows. For example, Lipschitz-constrained networks can be used as a building block for invertible residual networks and hence flow-based generative models (Behrmann et al., 2019; Chen et al., 2019) . Additionally, Neural ODEs (Chen et al., 2018; Grathwohl et al., 2019) are typically defined using vector fields parameterized via Lipschitz networks, so that the flow generated by the vector field is guaranteed to exist for all times. Nonetheless, designing Lipschitz-continuous neural networks and computing (or even upperbounding) their Lipschitz constant is a hard problem. Previous work mostly focused on fullyconnected and convolutional networks, not only because they are common in deep learning, but also because they are relatively simple to analyze, as compositions of linear maps and pointwise nonlinearities. Even in this case however, exact evaluation of the Lipschitz constant of fully-connected and convolutional networks is NP-hard (Virmaux & Scaman, 2018) and obtaining a tight upper bound remains a challenging task (Virmaux & Scaman, 2018; Fazlyab et al., 2019; Latorre et al., 2020) . Fully-connected and convolutional networks are not the only neural networks worthy of interest. Recently, self-attention (Vaswani et al., 2017) has become a popular alternative to recurrent neural



networks. Self-attention is a key component of the Transformer (Vaswani et al., 2017) , that has found success as a building block in models of various data modalities, starting with natural-language processing (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) and extending to computer vision (Zhang et al., 2019; Parmar et al., 2019) , audio generation (Huang et al., 2019) , and reinforcement learning (Parisotto et al., 2020) . However, so far no previous work has analyzed the Lipschitz properties of self-attention, and thus it has been unclear whether self-attention is a viable option in applications that require Lipschitz constraints. In this work, we address this gap in the theory of self-attention by providing a thorough analysis of its Lipschitz properties. In particular, we make the following contributions: • We prove that the widely used dot-product self-attention is not Lipschitz, and therefore not suitable to use in applications requiring Lipschitz constraints. • We formulate L2 self-attention as an alternative, and show that it is Lipschitz. • We derive a theoretical upper bound on the Lipschitz constant of L2 self-attention, and provide empirical evidence of the asymptotic tightness of the bound. • As a practical demonstration of the theory, we use this bound to formulate invertible self-attention, and explore its use in a Transformer architecture for character-level language modelling.

2. LIPSCHITZ CONSTANT OF FULLY-CONNECTED/CONVOLUTIONAL LAYERS

We first define the notion of Lipschitz continuity, and proceed to define the Lipschitz constant. Definition 2.1. Given two metric spaces (X , d X ) and (Y, d Y ), a function f : X → Y is called Lipschitz continuous (or K-Lipschitz) if there exists a constant K ≥ 0 such that d Y (f (x), f (x )) ≤ Kd X (x, x ) for all x, x ∈ X . The smallest such K is the Lipschitz constant of f , denoted Lip(f ). In this paper, we focus on the common case where X = R n , Y = R m , and d X , d Y are induced by a p-norm x p := ( i |x i | p ) 1/p . We will primarily consider the cases p = 2 and p = ∞, where x ∞ := max i |x i |. To emphasise the dependence of the Lipschitz constant on the choice of p-norm, we will often denote it by Lip p (f ). In this case, it follows directly from Definition 2.1 that the Lipschitz constant is given by Lip p (f ) = sup x =x ∈R n f (x) -f (x ) p x -x p . Next, we outline some basic results that are useful for estimating Lipschitz constants, also covered in related works (Virmaux & Scaman, 2018; Behrmann et al., 2019) . We describe how these results are used to provide bounds on the Lipschitz constant of fully-connected networks (FCN) and convolutional neural networks (CNN), using the fact that both are compositions of linear maps and pointwise non-linearities. To begin with, the following theorem suggests a way to bound Lip p (f ) for a differentiable Lipschitz function f : Theorem 2.1 (Federer, 1969) . Let f : R n → R m be differentiable and Lipschitz continuous under a choice of p-norm • p . Let J f (x) denote its total derivative (Jacobian) at x. Then Lip p (f ) = sup x∈R n J f (x) p where J f (x) p is the induced operator norm on J f (x). Hence if f is a linear map represented by a matrix W then Lip p (f ) = W p := sup x p=1 W x p = σ max (W ), if p = 2 max i j |W ij | if p = ∞ (3) where W p is the operator norm on matrices induced by the vector p-norm, and σ max (W ) is the largest singular value of W . Under this choice of norm, many common non-linearities (including relu, sigmoid, tanh, elu) are 1-Lipschitz. W 2 = σ max (W ) is usually estimated via power iteration; we provide details on how this is done in Appendix B. Since we now know the Lipschitz constants of the components of both FCN and CNN, we can bound their Lipschitz constants by applying the following lemma: Lemma 2.1 (Federer, 1969) . Let g, h be two composable Lipschitz functions. Then g • h is also Lipschitz with Lip(g • h) ≤ Lip(g) Lip(h). Corollary 2.1. For a fully-connected network (FCN) or a convolutional neural network (CNN) f = W K • ρ K-1 • W K-1 • . . . • ρ 1 • W 1 , we have Lip p (f ) ≤ k W k p under a choice of p-norm with 1-Lipschitz non-linearities ρ k . The above bound is not necessarily tight; there are various works that compute tighter bounds for FCN and CNN (e.g. Virmaux & Scaman, 2018; Fazlyab et al., 2019; Latorre et al., 2020) . 3 LIPSCHITZ CONSTANT OF SELF-ATTENTION 3.1 DOT-PRODUCT SELF-ATTENTION IS not LIPSCHITZ Moving on, we investigate whether self-attention is Lipschitz. We first consider the widely used (scaled) dot-product multihead self-attention as formulated by Vaswani et al. (2017) . Let x 1 , . . . , x N be a sequence of N elements, where x i ∈ R D for i = 1, . . . , N . We represent this sequence as a matrix X ∈ R N ×D such that the ith row of X is the ith element of the sequence, i.e. X i: = x i . Dot-product multihead self-attention (DP-MHA) is a map from R N ×D to R N ×D consisting of H 'heads', where H is chosen to divide D. Each head is a map from R N ×D to R N ×D/H defined by DP (X) := softmax XW Q (XW K ) / D/H XW V , where W Q , W K , W V ∈ R D×D/H are learnable parameters specific to each head. The input to the softmax is an N × N matrix of dot products (hence dot-product self-attention), and the softmax is applied to each row of this matrix. Finally, the outputs of all heads are concatenated into an N × D matrix and are right multiplied by W O ∈ R D×D , thus DP-MHA is defined by MHA(X) := DP 1 (X), . . . , DP H (X) W O . In what follows, we will prove that MHA as defined above is not Lipschitz, assuming that the MHA map is non-trivial, i.e. W Q , W K , W V , W O = 0. It is sufficient to show that a single head DP is not Lipschitz, since MHA is a linear combination of the outputs of each head. Let us write Equation (4) as DP (X) = P XW V , where P ∈ R N ×N is the output of the softmax (we suppress the dependence of P on X to reduce clutter below). P is a stochastic matrix, i.e. its entries are non-negative and its rows sum to 1. Since the rows of X are the x i 's, a linear transformation of each x i by some matrix A is equivalent to right multiplication of X by A . So right multiplication of X by W V is a linear map and thus Lipschitz. Therefore, we are interested in the mapping f (X) = P X; this is not a linear mapping because P itself is a non-linear function of X. In fact, we show that f is not Lipschitz, thus proving the first main result of the paper: Theorem 3.1. DP-MHA is not Lipschitz for any vector p-norm • p with p ∈ [1, ∞]. Summary of Proof. We use Theorem 2.1, noting that if the supremum of the norm of the Jacobian is infinite, then the mapping is not Lipschitz. In particular, we show that when x i = 0 for some i, some elements of the Jacobian of f grow proportionally to the sample variance of x =i , which is unbounded. Proof. We show the proof for the case D = 1 (i.e. X ∈ R N ×1 , a column vector) for readability. See Appendix C for the general case, which follows the same logic. The mapping f can be written as f (X) = P X = softmax aXX X =    f 1 (X) . . . f N (X)    ∈ R N ×1 , where a = W K W Q ∈ R (we assume a = 0 such that self-attention is non-trivial) and f i (X) = N j=1 P ij x j with P i: = softmax (ax i X). Hence f can be interpreted as a map of each x i to a point in the convex hull of x 1 , ..., x N . Since f is a map from R N ×1 to R N ×1 , its Jacobian is J f =    J 11 . . . J 1N . . . . . . . . . J N 1 . . . J N N    ∈ R N ×N , where J ij = ∂fi(X) ∂xj ∈ R. By taking partial derivatives we can show that J ij = aX P (i) [e ji X + δ ij X] + P ij I where e ij ∈ R N ×N is a binary matrix with zeros everywhere except the (i, j)th entry, δ ij is the Kronecker delta, and P (i) := diag(P i: ) -P i: P i: . So for i = j: J ii = aX P (i) e ii X + aX P (i) X + P ii (8) Let us investigate the scalar X P (i) X. We observe that it is in fact a variance of a discrete distribution. Specifically: X P (i) X = k P ik x 2 k -( k P ik x k ) 2 = Var(X), where X is a discrete distribution with support at the inputs {x 1 , . . . , x N } and probability mass function given by their softmax probabilities P(X = x j ) = P ij . A consequence of this interpretation is that P (i) is positive semi-definite (PSD) since X P (i) X = Var(X) ≥ 0, with equality if and only if the x j are all equal. We use this observation to show that J ii is unbounded, and so J f p is unbounded, hence DP-MHA is not Lipschitz. Consider the case x i = 0. Then P i: = softmax (XAx i ) = 1 N 1, i.e. we have uniform attention regardless of x =i . The first term of J ii in Equation ( 8) disappears since e ii X = [0, . . . , x i , . . . , 0] = 0, and the last term becomes 1 N I. Now consider the second term aX P (i) X = aVar(X l ). Note X is uniformly distributed, since P(X = x j ) = P ij = 1/N . Hence the second term is equal to a times the sample variance of x 1 , . . . , x N , which can be arbitrarily large. High-level intuition for proof. At x i = 0, f i (X) = 1 N k x k , the mean of the inputs. The rate of change of f i is governed by how fast the softmax saturates when x i is perturbed, which is determined by how spread out the x =i are. The more spread out they are (the higher the sample variance), the greater the rate of saturation of the softmax, and the faster the rate of change of f i . Since the sample variance of x =i can be arbitrarily large, the rate of change of f i can also be arbitrarily large, i.e. the entries of the Jacobian (and hence its p-norm) can become arbitrarily large. In Appendix D, we show that adding bias terms to x i W Q and x j W K does not resolve the issue. The implications of this result are the following. (1) There can be undesirable behaviour (e.g. training instabilities) for the Transformer when some inputs are close to zero. (2) Dot-product self-attention (and hence the standard Transformer) is not a suitable choice when we require a Lipschitz neural network, such as for formulating invertible residual networks (Behrmann et al., 2019) . Therefore, to use self-attention and Transformers in such applications, a Lipschitz formulation of self-attention is required, together with an explicit (ideally tight) upper bound to its Lipschitz constant, to quantify how much the output can change with respect to changes in the input. One method to make dot-product self-attention Lipschitz is by ensuring its inputs are bounded. Indeed, if the input space is compact, e.g. [0, 1] N ×D , any continuously differentiable function is Lipschitz, including dot-product self-attention. However, as we further discuss in Section 6, such an approach has its own challenges, since it makes the Lipschitz constant depend on the input range. Instead, in the next section we formulate a version of self-attention that is provably Lipschitz on all of R N ×D , allowing us to derive an upper bound that holds for any subset of R N ×D .

3.2. L2 SELF-ATTENTION: A LIPSCHITZ FORMULATION OF SELF-ATTENTION

The pathology in dot-product self-attention arises because the softmax probabilities P i: are constant with respect to x =i when x i = 0. This behaviour can be undesirable as we want P ij to vary according to x j , regardless of whether x i is zero or not. Hence we propose an alternative form of self-attention based on L2 distance: P ij ∝ exp(L ij ) := exp -x i W Q -x j W K 2 2 / D/H , with the normalisation constant ensuring that j P ij = 1. We will refer to it as L2 self-attention. It is reminiscent of the standard squared-exponential kernel, but with softmax normalisation that ensures that each row of the kernel matrix sums to 1. Normalisation is usually necessary to deal with inputs of varying length N (Wang et al., 2018) , hence we keep the softmax for L2 self-attention. Similarly to dot-product self-attention, L2 self-attention can be computed efficiently with matrix operations; see Appendix E for details, with a comparison of wall-clock runtimes between different choices of attention. We first state the mathematical formulation of L2 multihead self-attention (L2-MHA) before proving the main result -the upper bound of its Lipschitz constant with respect to • p for p = 2, ∞. The full L2-MHA map F : R N ×D → R N ×D is defined as F (X) := f 1 (X)W V,1 , . . . , f H (X)W V,H W O where f h (X) := P h XA h . In the above, W V,h ∈ R D×D/H , W O ∈ R D×D , P h is defined as in Equation ( 10) with W Q,h = W K,h ∈ R D×D/H , and A h := W Q,h W Q,h / D/H ∈ R D×D . There are two changes from the usual form of multihead self-attention: (1) We require W Q,h = W K,h for each head f h (X) to be Lipschitz. In Lemma F.1 of Appendix F we show that L2-MHA is not Lipschitz for arbitrary W Q,h , W K,h , and that tying W Q,h = W K,h is sufficient for L2-MHA to be Lipschitz, with intuition for why tying is sufficient. (2) In each head of the self-attention f h (X), right multiplication by A h has been included for the theorem below to hold (details are in the proof). In practice, there is little harm done by this extra linear transformation, since when the heads are combined together in F , each f h (X) is additionally transformed by W V,h , a free parameter. The second main result of the paper is the following: Theorem 3.2. L2-MHA is Lipschitz, with the following bound on Lip ∞ (F ): Lip ∞ (F ) ≤ 4φ -1 (N -1) + 1 D/H max h W Q,h ∞ W Q,h ∞ max h W V,h ∞ W O ∞ and the following bound on Lip 2 (F ): Lip 2 (F ) ≤ √ N D/H 4φ -1 (N -1) + 1 h W Q,h 2 2 W V,h 2 2 W O 2 where φ(x) := x exp(x + 1) is an invertible univariate function on x > 0, and N is the input sequence length. Specifically, φ -1 (N -1) = W 0 ( N e ) where W 0 is the Lambert W -function, which grows sublogarithmically as O(log N -log log N ) (Corless et al., 1996) . Hence the above bounds can be simplified to O(log N ) for p = ∞ and O( √ N log N ) for p = 2. Proof. See Appendix F, which uses the key observation that X P (i) X is a covariance matrix (c.f. Equation ( 9)) to bound J F p , the norm of the Jacobian of F . Appendix G shows how the argument can be modified to prove the analogous result for the case with masking in the self-attention. These bounds are complemented by the concurrent work of Vuckovic et al. (2020) , which provides a O( √ D log N ) bound on Lip 1 (F ) using measure-theoretic tools.

4.1. INVERTIBLE RESIDUAL NETWORK

Consider the residual function g(x) := x + f (x). Behrmann et al. (2019) give the following sufficient condition for its invertibility: if f is a contraction with respect to some metric, i.e. if Lip(f ) < 1, and the metric space on which f is defined is complete, then g is invertible. (A Euclidean space with a metric induced by a p-norm • p for p ∈ [1, ∞] is always complete.) Specifically, the inverse g -1 (y) is the unique fixed point of the recursion x i+1 := y -f (x i ), since by the definition of the inverse we have y = g -1 (y) + f (g -1 (y)). Because f is a contraction, Banach's Fixed Point Theorem guarantees that this fixed point exists and is unique for all y, and that the recursion converges for all initial values x 0 (often set to y in practice) exponentially fast. Hence the inverse can be computed to arbitrary accuracy (up to numerical precision in practice) by the above fixed-point iteration. Note that a composition of such invertible residual blocks is also invertible. Behrmann et al. (2019) use this observation to design invertible ResNets: they take f to be a CNN normalised by an upper bound on Lip(f ) given by Corollary 2.1, making the resulting function contractive. For the 2-norm • 2 , a hyperparameter c < 1 is chosen and each linear map (convolution) W in the CNN is multiplied by c/ W 2 if c < W 2 where W 2 is estimated by power iteration (c.f. Appendix B). This multiplicative factor determines the scale of the Lipschitz constant of the normalised function. The standard use case of self-attention is with a skip connection inside the Transformer. A Transformer block is composed of residual blocks of multihead self-attention (MHA) and fully-connected (FCN) layers (Figure 1 ). Hence similarly to invertible ResNets, we can normalise L2-MHA by the upper bounds given in Theorem 3.2 to obtain Contractive-L2-MHA f , with which we can obtain invertible self-attention g(x) = x + f (x). In the next section, we investigate the properties of invertible self-attention and how it compares with the standard dot-product self-attention; we replace DP-MHA in the Transformer with Contractive-L2-MHA, hence replacing the residual self-attention module with invertible self-attention. We are not interested in the modified Transformer per se, but rather in comparing the properties of invertible self-attention to standard self-attention -we only use the Transformer as a testbed for this purpose, since self-attention is commonly used in a Transformer. Given the theoretical focus of the paper, we believe that a more challenging application of invertible self-attention, such as normalising flow-based modelling, would be more suitable as a separate paper focused on that particular application. In Appendix H, we show that Dropout in the residual branch is also contractive. A tight bound on the Lipschitz constant of selfattention is desirable for all listed applications in Section 1; it leads to tighter generalisation bounds, lighter constraints for provable robustness, and better expressiveness in residual flow models. Hence we investigate the tightness of our bound on the Lipschitz constant of L2-MHA. The Lipschitz constant is a supremum over the space of inputs X ∈ R N ×D (c.f. Equation ( 2)) and approximating it requires solving an intractable optimisation problem. Hence it is infeasible to estimate accurately in general, especially when X is high-dimensional. However, we may compute a lower bound on the Lipschitz constant by maximising the norm of the Jacobian J f (X) with respect to X until convergence. This local optimum will form a lower bound by Theorem 2.1, and we can expect this lower bound to be fairly tight for the low-dimensional case, provided the optimisation is thorough.

5.1. ASYMPTOTIC TIGHTNESS OF THE

We use this observation to provide empirical evidence for the asymptotic tightness of the upper bound on Lip ∞ (f ) in Theorem 3.2. In Figure 2 , we show the upper bound as well as the lower bound on Lip ∞ (f ) obtained by optimising J f (X) ∞ with respect to X for L2-MHA f with 50 different random initialisations of X, with H = D = 1 and N varying between 100 and 1000. See Appendix I for further details. Note that we use a log-scale for the x-axis, and recall that the upper bound is O(log N -log log N ), dominated by the O(log N ) term for large N . Hence the plot for the upper bound shows a linear trend. We also observe that the slope of the lower bound is very similar, providing empirical evidence that the O(log N -log log N ) upper bound is asymptotically tight. There are at least two possible explanations for the gap between the upper and lower bounds. (1) The lower bound is only a local optimum -the true Lipschitz constant is a global optimum across inputs, which can be difficult to attain especially for high values of N . (2) The multiplicative constant of the upper bound may be loose. Assuming asymptotic tightness, it remains an open question whether the multiplicative constant can be tightened. We show the analogous plot for Lip 2 (F ) and discuss the results in Appendix K. Additionally in Appendix L, we show that optimising J f (X) ∞ w.r.t. X for DP-MHA f causes the norm to diverge, providing empirical verification of Theorem 3.1, that DP-MHA is indeed not Lipschitz. Recall from Section 4.1 that g( x) = x + f (x) is invertible if f is contractive. Hence if f is Contractive-L2-MHA, g is necessarily in- vertible. However, technically we do not disprove the invertibility of DP-MHA, since the converse does not hold in general i.e. if f is DP-MHA, which we have shown is not Lipschitz hence not contractive, it may still be the case that g is invertible. To verify that DP-MHA (with the skip connection) is not invertible in practice, we compare the numerical invertibility of the residual map g(x) = x + cf (x) between the cases where f is L2-MHA and DP-MHA in Figure 3 . For each, we take MHA with 8 heads and randomly initialised weights, and quantify the maximum reconstruction error across a batch of 128 inputs whose outputs are inverted via the fixed-point iteration described in Section 4.1. We use N = 64, D = 64, and c ∈ {0.5, 0.7, 0.9} (see Appendix J for analogous results for a wider range of N and D and for DP-MHA with trained weights). To highlight the difference between the two types of self-attention, recall in the proof of Theorem 3.1 (showing that DP-MHA is not Lipschitz) that when one of the inputs x i is 0, some terms of the Jacobian grow with the sample variance of x =i . Hence we check numerical invertibility at a set of N inputs where x i = 0 and x =i are chosen uniformly at random. In Figure 3 , we see that DP-MHA is not invertible whereas L2-MHA is invertible for sufficiently small c. This shows how not having the theoretical guarantee of f being contractive can cost us invertibility in practice. We note that the figure shows local invertibility at the sampled inputs, as opposed to global invertibility across the whole input space, yet this clearly highlights the difference between the two choices of self-attention. Experiments with the globally invertible self-attention obtained by normalising with the Lipschitz upper bound are provided in the next section.

5.3. EXPRESSIVENESS OF L2-MHA AND INVERTIBLE SELF-ATTENTION

A natural question to ask is: how does the expressiveness of L2-MHA and Contractive-L2-MHA (that leads to invertible self-attention with the skip connection) compare with the original DP-MHA? We expect that the Lipschitz constraint will limit the expressiveness of the Transformer, and would like to find out by how much. We investigate this by comparing the performance of the original Transformer and the Transformer with invertible self-attention (c.f. Figure 1 ) at character-level language modelling on the Penn Treebank dataset (Marcus et al., 1993) . We compare the test negative log-likelihood (NLL) of a baseline LSTM, the original Transformer (DP-MHA), and a series of models between the original Transformer and the Transformer with invertible self-attention (Contractive-L2-MHA), making one change at a time and tuning the hyperparameters on a validation set. For Contractive-L2-MHA, we normalise L2-MHA by the bound on Lip ∞ (F ) as it is tighter than the bound on Lip 2 (F ). See Appendix I for experimental details. The results are shown in Figure 4 . The first plot shows the best performing LSTM reaching a test NLL of around 1.0, and the second plot shows the best performing Transformer reaching a slightly improved performance for 3-5 layers of Transformer blocks. We observe instabilities in training for a higher number of layers, requiring careful tuning of the learning rate schedule for stability at the cost of performance, a commonly observed phenomenon in the literature of deep Transformer architectures (Bapna et al., 2018; Parisotto et al., 2020) . The third plot shows results for the Transformer with DP-MHA replaced with L2-MHA but without tying W Q and W K , and we observe a very similar test performance. The fourth plot shows the change when we further tie the query and key weights (making W Q = W K ); we see that there is a small degradation in performance. Here the number of trainable parameters has been reduced, but in Appendix M we show that matching parameter count does not help performance, suggesting that the reduction in performance when tying queries and keys is not solely due to having fewer parameters. We note that performance saturates at around 5 layers for each Transformer model so far. On the rightmost plot we show results when further dividing self-attention in each block by the upper bound on Lip ∞ (F ), to obtain invertible self-attention. This does give reduced performance for the same number of layers, but we can attain similar performance with more layers, no longer saturating at 5 layers. Thus we conclude the following. (1) Replacing the dot-product with the L2 distance incurs hardly any loss in expressiveness. (2) Tying the query and key weights to obtain Lipschitz self-attention incurs a small loss in expressiveness. (3) Dividing by the upper bound on Lip ∞ (F ) to obtain invertible self-attention incurs a noticeable loss in expressiveness, but also has a stabilising effect on the optimisation of the Transformer, thus allowing one to compensate for the apparent loss in expressiveness by increasing the number of layers. We show further experimental results that compare the training stability of DP-MHA and (Contractive)-L2-MHA in Appendix N.

6. CONCLUSION AND DISCUSSION

We have shown that the widely used dot-product self-attention is not Lipschitz, and that the proposed L2 self-attention is Lipschitz, by deriving an O(log N -log log N ) Lipschitz bound for p = ∞ and an O( √ N (log N -log log N )) bound for p = 2, where N is the input sequence length. We also provided empirical evidence of the asymptotic tightness of the bound for p = ∞. Finally we demonstrated that Lipschitz-constrained self-attention can be used to formulate invertible self-attention, which we experimentally evaluated on a character-level language modelling task. Our approach to Lipschitz self-attention has been to replace the dot-product kernel with an L2 kernel. An alternative would be to constrain the inputs of self-attention to be bounded; if the input space is compact, e.g. [0, 1] N ×D , any continuously differentiable function is Lipschitz, including dot-product self-attention. However, while being simple to implement, this solution has its own difficulties. First, it makes the Lipschitz constant depend on the range of the input, and thus obtaining a tight bound would require non-trivial mathematical work. We stress that a guarantee that the function is Lipschitz does not tell us anything about its Lipschitz constant; without a tight Lipschitz bound, the true Lipschitz constant can be very large, at which point it is unhelpful that the function is Lipschitz. Second, since self-attention is typically applied at multiple layers within a model (e.g. Transformer), the input to each self-attention will live in a different compact set that depends on the parameters of the previous layers, complicating the analysis for subsequent layers. A solution is to constrain the inputs of each layer to be in the same compact set, e.g. by passing them through a sigmoid non-linearity. This however can have undesirable side effects such as vanishing gradients when the sigmoids are saturated (Hochreiter, 1998) . Despite these difficulties, this could be a worthwhile alternative route for obtaining Lipschitz self-attention to explore in the future. Having a provably Lipschitz self-attention module at our disposal makes it possible to use Transformerbased architectures in applications requiring Lipschitz constraints, while enjoying theoretical guarantees. A natural application of Lipschitz self-attention is for residual flows (Behrmann et al., 2019) , and for parameterising Neural ODEs (Chen et al., 2018) where a Lipschitz vector field guarantees the existence of a unique solution to the ODE for all times. These models can be used for density estimation and generative modelling of sets. Another interesting direction for future work would be to analyse different variants of self-attention based on kernels other than dot-product and L2, as Tsai et al. (2019) do from an experimental perspective, for which we believe the mathematical tools developed in this paper may aid the analysis.

A CHAIN RULE FOR VECTOR VALUED FUNCTIONS

In this section, we list some useful identities for deriving the Jacobians of the expressions in the paper. Suppose λ is a scalar, u, v, x are column vectors, and f (u) is a vector valued function. We use the standard convention that for a ∈ R m , b ∈ R n , we have ∂a ∂b ∈ R m×n . Then we have the following chain rule identities: • ∂ ∂x [λu] = λ ∂u ∂x + u ∂λ ∂x • ∂f (u) ∂x = ∂f (u) ∂u ∂u ∂x • ∂ ∂x [u v] = u ∂v ∂x + v ∂u ∂x Note ∂λ ∂x is a row vector, so u ∂λ ∂x is a matrix.

B POWER ITERATION

Although W ∞ can be computed efficiently in O(nm) time for W ∈ R m×n , naïvely computing W 2 = σ max (W ) := λ max (W W ) requires O(n 3 ) operations. (By λ max (A) we denote the greatest eigenvalue of a symmetric matrix A.) We can however obtain an underestimate σ(W ) via power iteration: b k+1 = W W b k W W b k 2 , σk (W ) = b k W W b k b k b k , with each iteration taking O(n 2 ) time. Then using K n iterations gives us an underestimate σK in O(Kn 2 ) time. Since this is an underestimate, the resulting approximation to the Lipschitz constant of the linear map will not be an upper bound. However the number of power iterations is usually chosen so that σ is accurate enough -K = 5 is shown to be sufficient in the context of fully connected networks or convolutions considered by Behrmann et al. (2019) . The iteration will converge if W W has an eigenvalue that is strictly greater in magnitude than its other eigenvalues, and the starting vector b 0 has a nonzero component in the direction of an eigenvector associated with the dominant eigenvalue. This happens with probability 1 if b 0 is chosen at random, and the convergence is geometric with ratio |λ 2 /λ max | where λ 2 is the eigenvalue with second largest magnitude (Mises & Pollaczek-Geiringer, 1929) . C PROOF OF THEOREM 3.1 FOR GENERAL D Theorem 3.1. DP-MHA is not Lipschitz for any vector p-norm • p with p ∈ [1, ∞]. Proof. The mapping f can be written as f (X) = P X = softmax XA X X =    f 1 (X) . . . f N (X)    ∈ R N ×D , where A = W K W Q / D/H ∈ R D×D and f i (X) = N j=1 P ij x j with P i: = softmax (XAx i ). Hence f can be interpreted as a map of each x i to a point in the convex hull of x 1 , ..., x N . Since f is a map from R N ×D to R N ×D , its Jacobian is J f =    J 11 . . . J 1N . . . . . . . . . J N 1 . . . J N N    ∈ R N D×N D , where J ij = ∂fi(X) ∂xj ∈ R D×D . By taking partial derivatives we can show that J ij = X P (i) e ji XA + XAδ ij + P ij I where e ij ∈ R N ×N is a binary matrix with zeros everywhere except the (i, j)th entry, δ ij is the Kronecker delta, and P (i) := diag(P i: ) -P i: P i: . So for i = j: J ii = X P (i) e ii XA + X P (i) XA + P ii I = P ii (x i -k P ik x k ) x i A + X P (i) XA + P ii I. For the last equality, note e ii X has all rows equal to zero except for the ith row given by x i . We can then verify that X P (i) e ii X simplifies to P ii (x ik P ik x k )x i . For vector p-norms, J f p is bounded if and only if its entries are bounded, by definition of the operator norm. The entries of X P (i) XA are bounded for arbitrary A only if the entries of X P (i) X are bounded. So let us investigate the entries of this D × D matrix. Writing out each term of the matrix, we observe that it is in fact a covariance matrix of a discrete distribution. Specifically: [X P (i) X] lm = k P ik x kl x km -( k P ik x kl ) ( k P ik x km ) = Cov(X l , X m ), ( ) where X is a discrete distribution with support at the inputs {x 1 , . . . , x N } and probability mass function given by their softmax probabilities P(X = x j ) = P ij . A consequence of this interpretation is that P (i) is positive semi-definite (PSD) since for D = 1, Equation ( 15) becomes X P (i) X = Var(X) ≥ 0, with equality if and only if the x j are all equal. We use this observation to show that the terms of J ii are unbounded, and so DP-MHA is not Lipschitz. Consider the case x i = 0. Then P i: = softmax (XAx i ) = 1 N 1, i.e. we have uniform attention regardless of x =i . The first term of J ii in Equation ( 14) disappears since x i = 0, and the last term becomes 1 N I. For the second term, the entries [X P (i) X] ll = Var(X l ) are unbounded since the latter is equal to the sample variance of x 1l , . . . , x N l , which can be arbitrarily large.

D BIAS TERM IN DP SELF-ATTENTION

A natural question to ask is whether we can add bias terms b Q to x i W Q and b K to x j W K to resolve the issue of attention weights P i: becoming uniform when x i = 0. The answer is no in general. It can again be shown that J ii is unbounded when x i is chosen such that x i W Q + b Q = 0 (such a choice is possible assuming W Q is full rank, a dense set in R D×D/H ). Then P i: = 1 N 1 again, and the diagonal entries of X P (i) X are unbounded.

E EFFICIENT COMPUTATION OF L2 SELF-ATTENTION

Dot-product self-attention only requires a few matrix multiplications to compute the logits (i.e. the inputs to the softmax) between all pairs of inputs, without having to loop over pairs, hence it can be computed efficiently. Similarly, we can show that L2 self-attention can also be computed in an efficient manner. Using the identity a -b 2 2 = a 2 2 -2a b + b 2 2 we can compute the logits of L2 attention between all pairs via matrix multiplications and computation of row-wise L2 norms, with negligible overhead compared to dot-product self-attention. Specifically, for L2 self-attention we can show that P = softmax - XW Q 2 row 1 -2XW Q (XW K ) + 1 XW K 2 row D/H , ( ) where A 2 row applies the squared L2 norm to each row of A, so if A ∈ R m×n then A 2 row ∈ R m . In Table 1 we show the wall-clock training times for the Transformer models with different attention functions and a varying number of layers. It is evident that the differences between the models are rather small. F PROOF OF THEOREM 3.2 Recall the formulation of L2-MHA: F : R N ×D → R N ×D F (X) = f 1 (X)W V,1 , . . . , f H (X)W V,H W O f h (X) = P h XA h P h ij ∝ exp(L ij ) := exp - x i W Q,h -x j W K,h 2 2 D/H , j P h ij = 1 where we have that W Q,h , W K,h , W V,h ∈ R D×D/H , W O ∈ R D×D , P h ∈ R N ×N and A h := W Q,h W Q,h / D/H ∈ R D×D , and the softmax is applied to each row of the input matrix. Recall Equation ( 16): P h = softmax - XW Q,h 2 row 1 -2XW Q,h (XW K,h ) + 1 XW K,h 2 row D/H . F.1 L2 SELF-ATTENTION IS not LIPSCHITZ FOR GENERAL W Q , W K Let us first look at the case of H = 1 and suppress the index h to reduce clutter. Consider the map f (X) := P X, so f (X) = f (X)A. We need f to be Lipschitz for f and hence F to be Lipschitz. Note that P is defined as: P ij ∝ exp(L ij ) := exp - x i W Q -x j W K 2 2 D/H and the normalisation constant satisfies j P ij = 1, for P ∈ R N ×N , X ∈ R N ×D . For L2 self-attention, we may take partial derivatives and use the chain rule to show that the Jacobian of f is: J f =    J11 . . . J1N . . . . . . . . . JN1 . . . JNN    ∈ R N D×N D (17) with Jij = X P (i) ∂L i: ∂x j + P ij I ∈ R D×D (18) where ∂L i:  ∂x j = 2 D/H XW K -1x i W Q W Q δ ij + e ji XW Q -e jj XW K W K -P iN )     , P ij := exp(-yj 2 2 ) k exp(-y k 2 2 ) = 1 N for all j. Then for i = j, Kij simplifies to Kij = - 1 N -y i - 1 N (N -2)(-y i ) (-y i ) = - 2N -2 N 2 y i y i whose entries are unbounded since y i can be any vector in R D/H (note we assume N ≥ 2 for self-attention to be well-defined, hence 2N -2 = 0). The intuition for this result is as follows: a reason for DP-MHA not being Lipschitz is that for x i = 0" the attention weights P ij become uniform regardless of the values of x j for j = i. A similar issue arises for L2-MHA with W Q = W K and full-rank W K , as shown above: given any x i , we can choose x j such that the P ij become uniform. F.2 L2 SELF-ATTENTION IS LIPSCHITZ FOR W Q = W K Hence we impose the restriction that W K = W Q . With this assumption we have P ij ∝ exp -(x i -x j ) √ A 2 2 ( ) where A = W Q W Q / D/H ∈ R D×D and √ A is chosen such that A = √ A √ A , in particular √ A := W Q /(D/H) 1 4 . The terms in the Jacobian of f simplify to: Jii = 2X P (i) XA + P ii I (note P (i) 1 = 0), Jij = 2P ij (x j - k P ik x k )(x i -x j ) A + P ij I for i = j. Let the Jacobian of f (X) be: J f =    J 11 . . . J 1N . . . . . . . . . J N 1 . . . J N N    ∈ R N D×N D . Since f (X) = f (X)A, and by the chain rule ∂ ∂xj [ fi (X)A] = A ∂ fi(X) ∂xj = A ∂ fi(X)

∂xj

(by symmetry of A), we have that J ij = A Jij . Hence J ii = 2AX P (i) XA + P ii A (note P (i) 1 = 0), J ij = 2P ij A(x j - k P ik x k )(x i -x j ) A + P ij A for i = j. Noting Lip p (f ) = sup X J f (X) p , we would like to upper bound J f p . F.2.1 UPPER BOUND ON Lip ∞ (F ) FOR L2-MHA Consider the choice p = ∞, where J f ∞ is the maximum absolute row sum of J f . A key observation is that if we can bound the ∞-norm of the Jacobian of f i , a single output of f (i.e. a single block row [J i1 , ..., J iN ] ∞ of J f ), then this is also a bound on J f ∞ due to permutation equivariance of self-attention; all block rows have the same maximal • ∞ when each is optimised over the input X. Using this, we can prove that J f ∞ admits an upper bound that is O(log Nlog log N ). Below we state and prove lemmas that lead to the proof of this upper bound. First we analyse the term √ A X P (i) X √ A, that appears in the first term of J ii . Note that for Y := X √ A, so that the rows of Y are y i := x i √ A, we have √ A X P (i) X √ A = Y P (i) Y = Cov(Y) where P(Y = y j ) = P ij = exp(-y j -y i 2 2 )/ k exp(-y k -y i 2 ). The last equality uses the observation in Equation ( 9). The central inequality used throughout the proof of the main theorem is the following: Lemma F.2. Tr(Cov(Y)) = j P ij y j -k P ik y k 2 2 ≤ j P ij y j -y i 2 2 ≤ φ -1 (N -1) where φ(c) = c exp(c + 1) is a one-dimensional invertible function on R ≥0 . Proof. The first equality holds since Tr(Cov(Y)) = j Cov(Y) jj = j Var(Y j ) = j E[(Y j - E[Y j ]) 2 ]. The next inequality holds since Var(Y j ) = Var(Y j ) = E[Y 2 j ] -E[Y j ] 2 ≤ E[Y 2 j ] where Y = Y -y i . The final inequality can be proved as follows. We would like to bound j P ij y j -y i 2 2 = j y j -y i 2 2 exp(-y j -y i 2 2 ) k exp(-y k -y i 2 2 ) = j z 2 j exp(-z 2 j ) k exp(-z 2 k ) where z j := y j -y i 2 (hence z i = 0). Define: g(z) := j z 2 j exp(-z 2 j ) k exp(-z 2 k ) = j =i z 2 j exp(-z 2 j ) 1 + k =i exp(-z 2 k ) . First note that as z j → ∞, exp(-z 2 j ) → 0 exponentially fast, causing the product z 2 j exp(-z 2 j ) → 0. Hence we expect the above quantity to be bounded and attain its maximum. Let h(z j ) := exp(-z 2 j ) for notational conciseness, and note h(z j ) > 0. By taking partial derivatives with the chain rule, we have that for j = i ∂g(z) ∂z j = 2y j h(z j ) ( k h(z k )) 2 (1 -z 2 j ) k h(z k ) + k h(z k )z 2 k . Hence the derivative is 0 if and only if z j = 0 or (1 -z 2 j ) k h(z k ) + k h(z k )z 2 k = 0, the latter being equivalent to z 2 j = 1 + k h(z k )z 2 k k h(z k ) = 1 + g(z) . Hence at the maximum, the non-zero values among {z j } N j=1 must be equal to one another. It is clear now that the maximum value c is attained when z 2 j = 1 + c for j = i (and recall z i = 0). So h(z j ) = exp(-1 -c) for j = i. Substituting this into g(z), and rearranging, we obtain c exp(c + 1) = N -1. Note φ(x) := x exp(x + 1) is increasing for x > 0 hence c = φ -1 (N -1). Note φ(log N ) = (log N ) exp(log N +1) ≥ N log N ≥ N -1 for N ≥ 3. Since φ is increasing, we have φ -1 (N -1) ≤ log(N ) for N ≥ 3. In fact, it is known that φ -1 (N -1) = O(log N -log log N ) (Corless et al., 1996) . Note the A term in f (X) = f (X)A allows us to use the above inequality, since Y P (i) Y = Cov(Y) now appears in the terms of J f : J ii = 2 √ A[Y P (i) Y ] √ A + P ii A, J ij , = 2 √ AP ij (y j - k P ik y k )(y i -y j ) √ A + P ij A for i = j. Using the inequalities BC ≤ B C , B + C ≤ B + C and [A 1 , . . . , A N ] ≤ i A i , we have: [J i1 , . . . , J iN ] ∞ ≤ J ii ∞ + j =i J ij ∞ ≤2 √ A ∞ Y P (i) Y ∞ √ A ∞ + P ii A ∞ + 2 j =i √ A ∞ P ij (y j - k P ik y k )(y i -y j ) ∞ √ A ∞ + P ij A ∞ =2 √ A ∞ √ A ∞ Y P (i) Y ∞ + j =i P ij (y j - k P ik y k )(y i -y j ) ∞ + A ∞ =2 W Q ∞ W Q ∞ D/H Y P (i) Y ∞ + j P ij (y j - k P ik y k )(y i -y j ) ∞ + W Q W Q ∞ D/H . For the first equality, note that j P ij = 1. For the second equality, note that the summand for j = i is 0 because the term y i -y j = 0. Each of the terms in the brackets are bounded by the following lemmas: Lemma F.3. Y P (i) Y ∞ ≤ φ -1 (N -1) D/H (φ defined as in Lemma F.2). Proof. Recall that Y P (i) Y = Cov(Y). Let σ(Y m ) denote the standard deviation of Y m . Then [Cov(Y)] lm ≤ σ(Y l )σ(Y m ). Hence Cov(Y) ∞ = max l m |[Cov(Y)] lm | ≤ max l σ(Y l ) m σ(Y m ) ≤ D H m σ 2 (Y m ) = D H Tr(Cov(Y)) ≤ D H φ -1 (N -1), since m σ(Y m ) ≤ D H m σ 2 (Y m ) (by e.g. using the Cauchy-Schwartz inequality on [σ(Y 1 ), . . . , σ(Y D/H )] and 1) and max l σ(Y l ) ≤ m σ 2 (Y m ), and the last inequality is from Lemma F.2. Lemma F.4. j P ij (y j -k P ik y k )(y i -y j ) ∞ ≤ φ -1 (N -1) D/H. Proof. Note ab ∞ = a ∞ b 1 for real vectors a, b. Hence j P ij (y j - k P ik y k )(y i -y j ) ∞ = j P ij y j - k P ik y k ∞ y i -y j 1 = a b ≤ a 2 b 2 , where a j = P ij y jk P ik y k ∞ , b j = P ij y i -y j 1 . Note a j ≤ c j := P ij y jk P ik y k 2 since x ∞ ≤ x 2 for vector x.  = j P ij y i -y j 2 2 ≤ φ -1 (N -1) also from Lemma F.2. Hence a 2 b 2 ≤ D H c 2 d 2 ≤ D H φ -1 (N -1). Putting the above lemmas altogether, with the observation sup X J f (X) ∞ = sup X [J i1 (X), . . . , J iN (X)] ∞ by permutation invariance of J f ∞ (since f is permutation equivariant and • ∞ is the maximum absolute row sum), we have J f ∞ ≤ 4 W Q ∞ W Q ∞ φ -1 (N -1) + W Q W Q ∞ D/H ≤ W Q ∞ W Q ∞ 4φ -1 (N -1) + 1 D/H (35) ≤ W Q ∞ W Q ∞ 4 log N + 1 D/H , where the last inequality holds for N ≥ 3. The full multihead attention map that combines the heads f h (X) is: F : X → f 1 (X)W V,1 , . . . f H (X)W V,H W O = g(X)W V W O where g : X → [f 1 (X), . . . , f H (X)], W O ∈ R D×D and W V =    W V,1 . . . 0 . . . . . . . . . 0 . . . W V,H    ∈ R DH×D . Note the Jacobian J g is a block matrix whose rows are J f h , hence J g ∞ = max h J f h ∞ , and similarly W V ∞ = max h W V,h ∞ . Hence we have Lip ∞ (F ) ≤ max h J f h ∞ max h W V,h ∞ W O ∞ . Combining this with Inequality (35), we have: Lip ∞ (F ) ≤ 4φ -1 (N -1) + 1 D/H max h W Q,h ∞ W Q,h ∞ max h W V,h ∞ W O ∞ . F.2.2 UPPER BOUND ON Lip 2 (F ) FOR L2-MHA For p = 2, we use the following lemma: Lemma F.5. Let A be a block matrix with block rows A 1 , . . . , A N . Then A 2 ≤ i A i 2 2 , and equality holds if and only if the first right singular vectors of the A i align. Proof. A 2 2 =    A 1 . . . A N    2 2 = sup x 2 =1    A 1 . . . A N    x 2 2 = sup x 2=1 i A i x 2 2 ≤ i sup x 2=1 A i x 2 2 = i A i 2 2 . Note that equality holds if and only if the first right singular vectors of the A i align. Hence a bound on the spectral norm of each block row of J f can give us an O( √ N ) bound on J f 2 , which may be loose, and it remains an open question as to whether this bound can be tightened. To bound the • 2 norm of each row of J f , we use the following lemmas: Lemma F.6. Y P (i) Y 2 ≤ φ -1 (N -1) Proof. Y P (i) Y 2 = Cov(Y) 2 = λ max (Cov(Y)) ≤ Tr(Cov(Y)) ≤ φ -1 (N -1) , where the first equality holds by symmetry of Cov(Y) and the next holds by Cov(Y) being positive semidefinite, so all its eigenvalues are non-negative, and hence the maximal eigenvalue is bounded by the sum of the eigenvalues, equal to its trace. The final inequality is from Lemma F.2. Lemma F.7. j P ij (y j -k P ik y k )(y i -y j ) 2 ≤ φ -1 (N -1) Proof. Directly use Cauchy-Schwartz on c and d in the proof of Lemma F.4. Again using the inequalities BC ≤ B C , B + C ≤ B + C and [A 1 , . . . , A N ] ≤ i A i , with the additional equality B 2 = B 2 , we have the bound: [J i1 , . . . , J iN ] 2 ≤ 2 W Q 2 W Q 2 D/H Y P (i) Y 2 + j P ij (y j - k P ik y k )(y i -y j ) 2 + W Q W Q 2 D/H ≤ 4φ -1 (N -1) W Q 2 2 D/H + W Q W Q 2 D/H ≤ W Q 2 2 D/H 4φ -1 (N -1) + 1 . Using Lemma F.5, we have that J f 2 ≤ √ N W Q 2 2 D/H 4φ -1 (N -1) + 1 (36) ≤ √ N W Q 2 2 D/H (4 log N + 1). To obtain the final result for the full multihead self-attention F , we need a final lemma: Lemma F.8. Let A be a block matrix with block columns A 1 , . . . , A N . Then A 2 ≤ i A i 2 2 . Proof. A 2 = [A 1 , . . . , A N ] 2 = sup i xi 2 2 =1 [A 1 , . . . , A N ]    x 1 . . . x N    2 2 = sup i xi 2 2 =1 i A i x i 2 ≤ sup i xi 2 2 =1 i A i x i 2 = sup ei 2 =1, i λ 2 i =1 i λ i A i e i 2 = sup i λ 2 i =1 i λ i A i 2 ≤ i A i 2 2 , where we are using the substitution x i = λ i e i , and the last inequality holds by e.g. Cauchy-Schwartz inequality on [λ 1 , . . . , λ N ] and [ A 1 2 , . . . , A N 2 ]. Recall that F : X → f 1 (X)W V,1 . . . , f H (X)W V,H W O . Since f h (X)W V,h 2 ≤ J f h 2 W V,h 2 , by Lemma F.8 we have that [f 1 (X)W V,1 , . . . , f H (X)W V,H ] 2 ≤ h J f h 2 2 W V,h 2 2 and hence Lip 2 (F ) ≤   h J f h 2 2 W V,h 2 2   W O 2 . ( ) Combining this with Inequality (36), we have: Lip 2 (F ) ≤ √ N D/H 4φ -1 (N -1) + 1 h W Q,h 2 2 W V,h 2 2 W O 2 .

G THE CASE WITH MASKING

Since self-attention is often used with masking, a natural question is how masking affects the derived bounds. In self-attention (for any choice of attention function), masking is implemented as follows: given a set of mask indices M ⊂ {1, . . . , N } × {1, . . . , N }, the logits (i.e. the inputs to the softmax) are set to -∞ at the mask indices. That is, L ij = Lij if (i, j) / ∈ M -∞ if (i, j) ∈ M where Lij is the original logit (e.g. for L2 self-attention, Lij = -(x i -x j ) A(x i -x j )). Masking implies f i (X) is not a function of x j for (i, j) ∈ M, hence J ij = 0 for (i, j) ∈ M. Thus f i (X) is equal to the ith output for self-attention with inputs restricted to {x j : (i, j) / ∈ M}, the unmasked inputs with respect to the ith output. Hence J ij will no longer contribute to the bound on [J i1 , . . . , J iN ] , and hence the bound for the unmasked case will continue to hold as long as (i, i) ∈ M i.e. x i attends to itself (this is necessary for the proof of Lemma F.2 to hold). The bound can in fact be tightened by replacing N with |{x j : (i, j) / ∈ M}|, the number of unmasked inputs with respect to the ith output.

H DROPOUT IS CONTRACTIVE

At test time, Dropout multiplies inputs by the dropout keep probability p < 1, so it is a contraction with Lipschitz constant p at evaluation time. At training time, Dropout amounts to setting some inputs to zero, while keeping other inputs constant. This can be expressed as right multiplication by a diagonal binary matrix M , and for such matrices we can verify M p := sup x p =1 M x p ≤ 1.

I EXPERIMENTAL DETAILS

For the experiment in Section 5.1, showing the asymptotic tightness of the upper bound on Lip ∞ (F ) where F is L2-MHA, we fix all free parameters of F (namely W Q , W V ) to be the identity, and only optimise the input X. We use 50 random initialisations of X for each N , where X ij ∼ U [-c, c] for c ∼ U [0, 10] (we observed that having c itself be random improves optimisation). We display the top 5 results for each value of N after optimising each random initialisation till convergence using Adam (Kingma & Ba, 2015) with a learning rate of 0.1. For the experiments in Section 5.3, we comparing the performance of the original Transformer and the Transformer with Lipschitz/invertible self-attention at character-level language modelling on the Penn Treebank dataset (Marcus et al., 1993 ). 1 Each training example is a sentence represented as a variable-length sequence of characters, and examples are batched according to length such that padding is minimised, with the maximum sequence length set to 288. All models are autoregressive, outputting the logits for the categorical likelihood predicting the next character, and are trained using maximum likelihood (cross-entropy loss) with a batch size of 64. The LSTM models have the dimensionality of the hidden state equal to the dimensionality D of the cell state (the usual default implementation). The Transformer models are trained with a varying number of blocks (number of layers) with H = 8 heads and D = 512, tuning hyperparameters for dropout rate in {0, 0.1, 0.2} and base learning rate γ ∈ {0.2, 0.4, 0.6, 0.8, 1.0, 1.5, 2.0} with number of warmup iterations w ∈ {1000, 2000, 4000, 8000} for the standard custom learning rate schedule in Vaswani et al. (2017) : t = γ √ D min(t -1/2 , tw -3/2 ), where t is the learning rate at training iteration t. Hence the learning rate linearly increases from 0 to (Dw) -1/2 over w iterations, then decays proportionally to t -1/2 . We use Glorot Uniform initialisation (Glorot & Bengio, 2010) for all weights (U -1 din+dout , 1 din+dout ), except for weights in L2-MHA that are initialised from U -s √ D , s √ D , and s is a hyperparameter. For D = 512, we used s = 1 2 4 . All experiments were done in Tensorflow 1.14 (Abadi et al., 2016) . The code will be released upon de-anonymisation. In Table 2 In Figure 7 , we show the lower bound on Lip 2 (F ) obtained by optimising J F (X) 2 using the same optimisation procedure as for Figure 2 of Section 5.1. Here the optimisation is more difficult, evident in the variance of the top 5 values, and the trend is less clear, but it appears that Lip 2 (f ) grows at a rate of O(log N ). The message is less clear here, and there are at least two possibilities: (1) The optimisation is difficult even for small values of N , hence Figure 7 shows a loose lower bound. (2) If the lower bound is tight, this suggests that the O( √ N log N ) bound in Theorem 3.2 is not asymptotically tight, and could be improved to O(log N ) (or O(log N -log log N ) as for p = ∞).

L OPTIMISING THE NORM OF THE JACOBIAN OF DP-MHA

In Figure 8 , we show how the norm of the Jacobian J f (X) ∞ for DP-MHA f keeps increasing when being optimised with respect to X. This is a useful sanity check validating our theoretical result of Theorem 3.1, that DP-MHA is not Lipshchitz. The oscillations are likely due to momentum term of Adam optimizer that was used to optimise the norm. 



We use the standard training-validation-test split, and the dataset can be found at e.g. https://github. com/harvardnlp/TextFlow/tree/master/data/ptb.



Figure 1: A Transformer block.

UPPER BOUND ON Lip ∞ (F ) 1 out of 50 random init LB: top 5 out of 50 random init

Figure 2: Lower and upper bound on Lip ∞ (f ) for L2-MHA f , with H = D = 1 and varying N .

Figure 3: Invertibility of g(x) = x + cf (x) where f is L2-MHA (left) and DP-MHA (right).

Figure 4: Test NLL curves during training for various LSTM/Transformer models.

Hence a 2 ≤ c 2 . Also b j ≤ D H d j := D H P ij y i -y j 1 since x 1 ≤ D H x 2 (e.g. by the Cauchy-Schwartz inequality on [|x 1 |, . . . , |x D/H |] and 1) for x ∈ R D/H . Hence b 2 ≤ D H d 2 . Note c 2 2 = j P ij y jk P ik y k 2 2 = Tr(Cov(Y)) ≤ φ -1 (N -1) from Lemma F.2, and d 2 2

NUMERICAL INVERTIBILITY OF MHA RESIDUAL MAPFollowing Section 5.2, Figure5confirms that numerical invertibility does not hold for trained weights for dot-product multihead self-attention (DP-MHA) (obtained from one-layer Transformer (DP) model used for Figure4), similar to the randomly initialised weight case. Figure6shows additional results for different values of N and D.

Figure 5: Invertibility of g(x) = x + cf (x) for trained DP-MHA f .

Figure 6: Numerical invertibility of g(x) = x + cf (x) where f is L2-MHA(left) or DP-MHA (right), for different values of N and D.

Figure7: Lower bound on Lip 2 (F ) where F is L2-MHA, with D = 1 and varying N , obtained by optimising J F (X) 2 with respect to X, with 50 random initialisations of X for each N .

Figure 8: Optimise J f (X) ∞ w.r.t. X for trained DP-MHA f .

Figure 10: Histogram showing distribution of inputs/outputs of L2-MHA and DP-MHA

Wall clock training times for one epoch of training (seconds)

we show the best Test NLL across training of Transformer models in Figure 4. Test NLL for Transformer models on PTB character level language modelling

Test NLL for Transformer models trained with fixed learning rate on PTB character level language modelling

annex

Recall that e ji ∈ R N ×N is a binary matrix with zeros everywhere except the (j, i)th entry. Hence e ji X has all rows equal to zero except for the jth row given by x i . We can then verify:X P (i) e ji X = P ij (x j -Also note P (i) is symmetric, and each row/colum sums to 0, i.e. P (i) 1 = 1 P (i) = 0. Hence we may simplify the Jacobian terms as follows:and for i = j:We are now ready to show that f is not Lipschitz for general W Q , W K :Lemma F.1. If W K ∈ R D×D/H is full rank (i.e. full column rank), and W K = W Q , then J ij has terms that are unbounded for i = j, hence f is not Lipschitz.Proof. Let us investigate the expression Kij :H for i = j, which is related to Jij as follows by Equation ( 22):It suffices to show that Kij is unbounded to show that Jij is unbounded, since W K is full rank andThen we have:Hence Kij = -P ij (y jk P ik y k )y j . Note y i can take an arbitrary value in R D/H , sinceFor all j = i, let us choose x j such that y j = -y i . This is possible for any value of y i since W K is full-rank. Note y j = -y i and not y i . We then have that y j 2 2 is equal for all j, hence M EXPERIMENT TYING KEYS AND QUERIES OF L2-MHA BUT PRESERVING PARAMETER COUNTIn Figure 4 of Section 5.3, we have shown that there is a clear reduction in performance when tying the keys and queries. To test whether this can be attributed to the reduction in parameter count, we tried doubling the number of columns of W Q when the keys and queries are shared (i.e. from D/H to 2D/H) so that the shared model has the same number of parameters as the unshared model. In Figure 9 , the third column shows results for shared L2-MHA, but with the same number of parameters as the unshared L2-MHA i.e. without tying the keys and queries. The performance is similar to the second column (tying with a reduced number of parameters), suggesting that there is an inherent limitation in expressiveness to tying the keys and queries, and that the reduction in number of parameters is an insufficient explanation this phenomenon. 

N STABILITY EXPERIMENTS

In Figure 10 below, we compare the output variance of trained L2-MHA against trained DP-MHA, with weights from the one-layer Transformer (L2), W Q = W K model and (DP) model used for Figure 4 respectively. We take the same distribution of inputs as used for the numerical invertibility experiment in Section 5.2, and show the histogram of inputs and outputs after flattening the input/output tensors. We see that the range of outputs remains similar to the range of inputs for Lipschitz L2-MHA, whereas for DP-MHA the outputs have a much wider range, because the Jacobian norm is large for DP-MHA at these inputs.In 

