LIPSFORMER: INTRODUCING LIPSCHITZ CONTINUITY TO VISION TRANSFORMERS

Abstract

We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5% with 4.7G FLOPs and 24M parameters.

1. INTRODUCTION

Transformer [49] has been widely adopted in natural language processing (NLP) [6, 27, 40] for its great capability of capturing long-range dependencies with self-attention. Motivated by its success in NLP, Dosovitskiy et al. [17] introduced Vision Transformer (ViT) as a general backbone for computer vision tasks such as image classification [35, 53, 16] , object detection [9, 59], and segmentation [12] . Nowadays, Transformer [49] remains the dominant architecture for NLP [5, 6, 40], computer vision [58, 35, 53, 16 ] and many other AI applications [42, 41, 31] . Despite its success, training Transformer remains challenging [33, 14] 



for practitioners: the training process can be prohibitively unstable, especially at the beginning of training. To address the root cause for training instability, we resort to examining Lipschitz continuity of Transformer components. Intuitively, a Lipschitz continuous network is finite in the rate of change and its Lipschitz constant is an useful indicator for training stability. As shown in [8, 7, 44], Lipschitz properties reveal intriguing behaviours of neural networks, such as robustness and generalization. In this work, we focus on the trainability issue of Transformer architectures by explicitly enforcing Lipschitz continuity at network initialization. Previous works for overcoming Transformer training instability usually focus on one or a combination of its components which can be divided into four categories: (1) improving normalization [54, 33, 51]; Xiong et al. [54] has shown that, for a Transformer architecture, Pre-LayerNorm (Pre-LN) is more stable than Post-LayerNorm (Post-LN). Liu et al. [33] identified that Post-LN negatively influences training stability by amplifying parameter perturbations. They introduced adaptive model initialization (Admin) to mitigate the amplification effect. Likewise, Wang et al. [51] introduced DeepNorm and a depth-specific initialization to stabilize Post-LN. However, even with normalization improvements

