THE LIPSCHITZ CONSTANT OF SELF-ATTENTION Anonymous authors Paper under double-blind review

Abstract

Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

1. INTRODUCTION

Lipschitz continuity is a strong form of continuity for functions. Loosely speaking, a function is Lipschitz continuous if changing its input by a certain amount cannot change its output by more than K times that amount. The constant K is a hard constraint on how rapidly the function's output can vary, and the smallest such K is known as the function's Lipschitz constant. For example, f 1 (x) = |x| and f 2 (x) = exp(x) for x ∈ R are not Lipschitz continuous, because their output can change arbitrarily fast as x approaches 0 and +∞ respectively. On the other hand, g 1 (x) = tanh(x) and g 2 (x) = αx are Lipschitz continuous, because their rate of change (derivative) is bounded. In deep learning, we often use Lipschitz continuity as a constraint for neural networks, to control how much a network's output can change relative to its input. Such Lipschitz constraints are useful in several contexts. For example, Lipschitz constraints can endow models with provable robustness against adversarial pertubations (Cisse et al., 2017; Tsuzuku et al., 2018; Anil et al., 2019) , and guaranteed generalisation bounds (Sokolić et al., 2017) . Moreover, the dual form of the Wasserstein distance is defined as a supremum over Lipschitz functions with a given Lipschitz constant, hence Lipschitz-constrained networks are used for estimating Wasserstein distances (Peyré & Cuturi, 2019) . Further, Lipschitz-constrained networks can stabilise training for GANs, an example being spectral normalisation (Miyato et al., 2018) . Finally, Lipschitz-constrained networks are also used to construct invertible models and normalising flows. For example, Lipschitz-constrained networks can be used as a building block for invertible residual networks and hence flow-based generative models (Behrmann et al., 2019; Chen et al., 2019) . Additionally, Neural ODEs (Chen et al., 2018; Grathwohl et al., 2019) are typically defined using vector fields parameterized via Lipschitz networks, so that the flow generated by the vector field is guaranteed to exist for all times. Nonetheless, designing Lipschitz-continuous neural networks and computing (or even upperbounding) their Lipschitz constant is a hard problem. Previous work mostly focused on fullyconnected and convolutional networks, not only because they are common in deep learning, but also because they are relatively simple to analyze, as compositions of linear maps and pointwise nonlinearities. Even in this case however, exact evaluation of the Lipschitz constant of fully-connected and convolutional networks is NP-hard (Virmaux & Scaman, 2018) and obtaining a tight upper bound remains a challenging task (Virmaux & Scaman, 2018; Fazlyab et al., 2019; Latorre et al., 2020) . Fully-connected and convolutional networks are not the only neural networks worthy of interest. Recently, self-attention (Vaswani et al., 2017) has become a popular alternative to recurrent neural 1

