THE LIPSCHITZ CONSTANT OF SELF-ATTENTION Anonymous authors Paper under double-blind review

Abstract

Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

1. INTRODUCTION

Lipschitz continuity is a strong form of continuity for functions. Loosely speaking, a function is Lipschitz continuous if changing its input by a certain amount cannot change its output by more than K times that amount. The constant K is a hard constraint on how rapidly the function's output can vary, and the smallest such K is known as the function's Lipschitz constant. For example, f 1 (x) = |x| and f 2 (x) = exp(x) for x ∈ R are not Lipschitz continuous, because their output can change arbitrarily fast as x approaches 0 and +∞ respectively. On the other hand, g 1 (x) = tanh(x) and g 2 (x) = αx are Lipschitz continuous, because their rate of change (derivative) is bounded. In deep learning, we often use Lipschitz continuity as a constraint for neural networks, to control how much a network's output can change relative to its input. Such Lipschitz constraints are useful in several contexts. For example, Lipschitz constraints can endow models with provable robustness against adversarial pertubations (Cisse et al., 2017; Tsuzuku et al., 2018; Anil et al., 2019) , and guaranteed generalisation bounds (Sokolić et al., 2017) . Moreover, the dual form of the Wasserstein distance is defined as a supremum over Lipschitz functions with a given Lipschitz constant, hence Lipschitz-constrained networks are used for estimating Wasserstein distances (Peyré & Cuturi, 2019) . Further, Lipschitz-constrained networks can stabilise training for GANs, an example being spectral normalisation (Miyato et al., 2018) . Finally, Lipschitz-constrained networks are also used to construct invertible models and normalising flows. For example, Lipschitz-constrained networks can be used as a building block for invertible residual networks and hence flow-based generative models (Behrmann et al., 2019; Chen et al., 2019) . Additionally, Neural ODEs (Chen et al., 2018; Grathwohl et al., 2019) are typically defined using vector fields parameterized via Lipschitz networks, so that the flow generated by the vector field is guaranteed to exist for all times. Nonetheless, designing Lipschitz-continuous neural networks and computing (or even upperbounding) their Lipschitz constant is a hard problem. Previous work mostly focused on fullyconnected and convolutional networks, not only because they are common in deep learning, but also because they are relatively simple to analyze, as compositions of linear maps and pointwise nonlinearities. Even in this case however, exact evaluation of the Lipschitz constant of fully-connected and convolutional networks is NP-hard (Virmaux & Scaman, 2018) and obtaining a tight upper bound remains a challenging task (Virmaux & Scaman, 2018; Fazlyab et al., 2019; Latorre et al., 2020) . Fully-connected and convolutional networks are not the only neural networks worthy of interest. Recently, self-attention (Vaswani et al., 2017) has become a popular alternative to recurrent neural networks. Self-attention is a key component of the Transformer (Vaswani et al., 2017) , that has found success as a building block in models of various data modalities, starting with natural-language processing (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) and extending to computer vision (Zhang et al., 2019; Parmar et al., 2019) , audio generation (Huang et al., 2019) , and reinforcement learning (Parisotto et al., 2020) . However, so far no previous work has analyzed the Lipschitz properties of self-attention, and thus it has been unclear whether self-attention is a viable option in applications that require Lipschitz constraints. In this work, we address this gap in the theory of self-attention by providing a thorough analysis of its Lipschitz properties. In particular, we make the following contributions: • We prove that the widely used dot-product self-attention is not Lipschitz, and therefore not suitable to use in applications requiring Lipschitz constraints. • We formulate L2 self-attention as an alternative, and show that it is Lipschitz. • We derive a theoretical upper bound on the Lipschitz constant of L2 self-attention, and provide empirical evidence of the asymptotic tightness of the bound. • As a practical demonstration of the theory, we use this bound to formulate invertible self-attention, and explore its use in a Transformer architecture for character-level language modelling.

2. LIPSCHITZ CONSTANT OF FULLY-CONNECTED/CONVOLUTIONAL LAYERS

We first define the notion of Lipschitz continuity, and proceed to define the Lipschitz constant. Definition 2.1. Given two metric spaces (X , d X ) and (Y, d Y ), a function f : X → Y is called Lipschitz continuous (or K-Lipschitz) if there exists a constant K ≥ 0 such that d Y (f (x), f (x )) ≤ Kd X (x, x ) for all x, x ∈ X . The smallest such K is the Lipschitz constant of f , denoted Lip(f ). In this paper, we focus on the common case where X = R n , Y = R m , and d X , d Y are induced by a p-norm x p := ( i |x i | p ) 1/p . We will primarily consider the cases p = 2 and p = ∞, where x ∞ := max i |x i |. To emphasise the dependence of the Lipschitz constant on the choice of p-norm, we will often denote it by Lip p (f ). In this case, it follows directly from Definition 2.1 that the Lipschitz constant is given by Lip p (f ) = sup x =x ∈R n f (x) -f (x ) p x -x p . Next, we outline some basic results that are useful for estimating Lipschitz constants, also covered in related works (Virmaux & Scaman, 2018; Behrmann et al., 2019) . We describe how these results are used to provide bounds on the Lipschitz constant of fully-connected networks (FCN) and convolutional neural networks (CNN), using the fact that both are compositions of linear maps and pointwise non-linearities. To begin with, the following theorem suggests a way to bound Lip p (f ) for a differentiable Lipschitz function f : Theorem 2.1 (Federer, 1969) . Let f : R n → R m be differentiable and Lipschitz continuous under a choice of p-norm • p . Let J f (x) denote its total derivative (Jacobian) at x. Then Lip p (f ) = sup x∈R n J f (x) p where J f (x) p is the induced operator norm on J f (x). Hence if f is a linear map represented by a matrix W then Lip p (f ) = W p := sup x p=1 W x p = σ max (W ), if p = 2 max i j |W ij | if p = ∞ (3) where W p is the operator norm on matrices induced by the vector p-norm, and σ max (W ) is the largest singular value of W . Under this choice of norm, many common non-linearities (including relu, sigmoid, tanh, elu) are 1-Lipschitz. W 2 = σ max (W ) is usually estimated via power iteration; we provide details on how this is done in Appendix B. Since we now know the Lipschitz constants of the components of both FCN and CNN, we can bound their Lipschitz constants by applying the following lemma: Lemma 2.1 (Federer, 1969) . Let g, h be two composable Lipschitz functions. Then g • h is also Lipschitz with Lip(g • h) ≤ Lip(g) Lip(h).

