LIPSFORMER: INTRODUCING LIPSCHITZ CONTINUITY TO VISION TRANSFORMERS

Abstract

We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5% with 4.7G FLOPs and 24M parameters.

1. INTRODUCTION

Transformer [49] has been widely adopted in natural language processing (NLP) [6, 27, 40] for its great capability of capturing long-range dependencies with self-attention. Motivated by its success in NLP, Dosovitskiy et al. [17] introduced Vision Transformer (ViT) as a general backbone for computer vision tasks such as image classification [35, 53, 16] , object detection [9, 59] , and segmentation [12] . Nowadays, Transformer [49] remains the dominant architecture for NLP [5, 6, 40], computer vision [58, 35, 53, 16 ] and many other AI applications [42, 41, 31] . Despite its success, training Transformer remains challenging [33, 14] for practitioners: the training process can be prohibitively unstable, especially at the beginning of training. To address the root cause for training instability, we resort to examining Lipschitz continuity of Transformer components. Intuitively, a Lipschitz continuous network is finite in the rate of change and its Lipschitz constant is an useful indicator for training stability. As shown in [8, 7, 44] , Lipschitz properties reveal intriguing behaviours of neural networks, such as robustness and generalization. In this work, we focus on the trainability issue of Transformer architectures by explicitly enforcing Lipschitz continuity at network initialization. Previous works for overcoming Transformer training instability usually focus on one or a combination of its components which can be divided into four categories: (1) improving normalization [54, 33, 51] ; Xiong et al. In this paper, we conduct a thorough analysis of Transformer architectures and propose a Lipschitz continuous Transformer called LipsFormer. In contrast to previous practical tricks that address training instability, we show that Lipschitz continuity is a more essential property to ensure training stability. We focus our investigation on the following Transformer components: LayerNorm, dotproduct self-attention, residual shortcut, and weight initialization. For each analyzed module, we propose a Lipschitz continuous variant as a new building block for LipsFormer. The final LipsFormer network has an upper bound Lipschitz constant at initial stages of training. Such a Lipschitz guarantee has two implications: 1) we can train LipsFormer without using the common trick of learning rate warmup, yielding a faster convergence and better generalization; 2) Transformer is more unstable at the beginning of training. By ensuring initial network stability, we drastically increases the trainability of Transformer. Note that we could also enforce Lipschitz continuity during the whole training process by simply constraining updates on certain scaling parameters. Our main contributions can be summarized as follows: • We give a thorough analysis of key Transformer components: LayerNorm, self-attention, residual shortcut, and weight initialization. More importantly, we identify potential instability problems each module brings to the training difficulty and propose their Lipschitz continuous counterparts: CenterNorm, scaled cosine similarity attention, scaled residual shortcut, and spectral-based initialization. The proposed Lipschitz continuous modules can serve as drop-in replacements for a standard Transformer, such as Swin Transformer [35] and CSwin [16] . • We propose a Lipschitz continuous Transformer (LipsFormer) that can be stably trained without the need of carefully tuning the learning rate schedule. We derive theoretical Lipschitz constant upper bounds for both scaled cosine similarity attention and LipsFormer. The derivation provides a principled guidance for designing LipsFormer networks. We build LipsFormer-Swin and LipsFormer-CSwin based on Swin Transformer and CSwin individually. • We validate the efficacy of the LipsFormer on ImageNet classification. We show empirically that LipsFormer can be trained smoothly without learning rate warmup. As a result, on the ImageNet-1K dataset, LipsFormer-Swin-Tiny training for 300 epochs can obtain a top-1 accuracy of 82.7% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny training for 300 epochs achieves a top-1 accuracy of 83.5% with 4.7G FLOPs and 24M parameters.

2. PRELIMINARIES

In this section, we first define Lipschitz continuity and Lipschitz constant and then discuss several Lipschitz properties of a neural network. We use the denominator-layout notation throughout this paper. A sequence of N elements is denoted as X = [x 1 ; . . . ; x N ] ⊤ ∈ R N ×D , where each vector where the smallest value of L that satisfies the inequality is called the Lipschitz constant of f . To emphasize that the Lipschitz constant with respect to x depends on W and the choice of p, we denote L as Lip p (f x (W )). A function is generally referred to as expansive, non-expansive, and contractive in the variable x for Lip p (f x (W )) > 1, Lip p (f x (W )) ≤ 1, and Lip p (f x (W )) < 1, respectively, x i ∈ R D , i ∈ {1, ..., N }.



[54] has shown that, for a Transformer architecture, Pre-LayerNorm (Pre-LN) is more stable than Post-LayerNorm (Post-LN). Liu et al. [33] identified that Post-LN negatively influences training stability by amplifying parameter perturbations. They introduced adaptive model initialization (Admin) to mitigate the amplification effect. Likewise, Wang et al. [51] introduced DeepNorm and a depth-specific initialization to stabilize Post-LN. However, even with normalization improvements such as Admin and DeepNorm, learning rate warmup [20] is still a necessity to stabilize training. (2) more stable attention [28, 13]; Kim et al. [28] proved that the standard dot-product attention is not Lipschitz continuous and introduced an alternative L2 attention. (3) re-weighted residual shortcut; Bachlechner et al. [3] showed that a simple architecture change of gating each residual shortcut with a learnable zero-initialized parameter substantially stabilizes training. With ReZero, they were able to train extremely deep Transformers of 120 layers. (4) careful weight initialization; To avoid gradient exploding or vanishing at the beginning of training, Zhang et al. [60] proposed fixed-update initialization (Fixup) by rescaling a standard initialization. They also proved that Fixup could enable stable training of residual networks without normalization.

Function transformation is parameterized by an associated weight matrix W and an affine transformation is denoted as f(x) = W ⊤ x, where W ∈ R D×M . Definition 1. A function f (x, W ) : R D → R M is Lipschitz continuous (L-Lipschitz) under a choice of p-norm ∥ • ∥ p in the variable x if there exists a constant L such that for all (x 1 , W ) and (x 2 , W ) in the domain of f , ∥f (x 1 , W ) -f (x 2 , W )∥ p ≤ L∥x 1 -x 2 ∥ p ,

