LIPSFORMER: INTRODUCING LIPSCHITZ CONTINUITY TO VISION TRANSFORMERS

Abstract

We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5% with 4.7G FLOPs and 24M parameters.

1. INTRODUCTION

Transformer [49] has been widely adopted in natural language processing (NLP) [6, 27, 40] for its great capability of capturing long-range dependencies with self-attention. Motivated by its success in NLP, Dosovitskiy et al. [17] introduced Vision Transformer (ViT) as a general backbone for computer vision tasks such as image classification [35, 53, 16] , object detection [9, 59] , and segmentation [12] . Nowadays, Transformer [49] remains the dominant architecture for NLP [5, 6, 40] , computer vision [58, 35, 53, 16] and many other AI applications [42, 41, 31] . Despite its success, training Transformer remains challenging [33, 14] for practitioners: the training process can be prohibitively unstable, especially at the beginning of training. To address the root cause for training instability, we resort to examining Lipschitz continuity of Transformer components. Intuitively, a Lipschitz continuous network is finite in the rate of change and its Lipschitz constant is an useful indicator for training stability. As shown in [8, 7, 44] , Lipschitz properties reveal intriguing behaviours of neural networks, such as robustness and generalization. In this work, we focus on the trainability issue of Transformer architectures by explicitly enforcing Lipschitz continuity at network initialization. Previous works for overcoming Transformer training instability usually focus on one or a combination of its components which can be divided into four categories: (1) improving normalization [54, 33, 51] ; Xiong et al. [54] has shown that, for a Transformer architecture, Pre-LayerNorm (Pre-LN) is more stable than Post-LayerNorm (Post-LN). Liu et al. [33] identified that Post-LN negatively influences training stability by amplifying parameter perturbations. They introduced adaptive model initialization (Admin) to mitigate the amplification effect. Likewise, Wang et al. [51] introduced DeepNorm and a depth-specific initialization to stabilize Post-LN. However, even with normalization improvements such as Admin and DeepNorm, learning rate warmup [20] is still a necessity to stabilize training. (2) more stable attention [28, 13] ; Kim et al. [28] proved that the standard dot-product attention is not Lipschitz continuous and introduced an alternative L2 attention. (3) re-weighted residual shortcut; Bachlechner et al. [3] showed that a simple architecture change of gating each residual shortcut with a learnable zero-initialized parameter substantially stabilizes training. With ReZero, they were able to train extremely deep Transformers of 120 layers. (4) careful weight initialization; To avoid gradient exploding or vanishing at the beginning of training, Zhang et al. [60] proposed fixed-update initialization (Fixup) by rescaling a standard initialization. They also proved that Fixup could enable stable training of residual networks without normalization. In this paper, we conduct a thorough analysis of Transformer architectures and propose a Lipschitz continuous Transformer called LipsFormer. In contrast to previous practical tricks that address training instability, we show that Lipschitz continuity is a more essential property to ensure training stability. We focus our investigation on the following Transformer components: LayerNorm, dotproduct self-attention, residual shortcut, and weight initialization. For each analyzed module, we propose a Lipschitz continuous variant as a new building block for LipsFormer. The final LipsFormer network has an upper bound Lipschitz constant at initial stages of training. Such a Lipschitz guarantee has two implications: 1) we can train LipsFormer without using the common trick of learning rate warmup, yielding a faster convergence and better generalization; 2) Transformer is more unstable at the beginning of training. By ensuring initial network stability, we drastically increases the trainability of Transformer. Note that we could also enforce Lipschitz continuity during the whole training process by simply constraining updates on certain scaling parameters. Our main contributions can be summarized as follows: • We give a thorough analysis of key Transformer components: LayerNorm, self-attention, residual shortcut, and weight initialization. More importantly, we identify potential instability problems each module brings to the training difficulty and propose their Lipschitz continuous counterparts: CenterNorm, scaled cosine similarity attention, scaled residual shortcut, and spectral-based initialization. The proposed Lipschitz continuous modules can serve as drop-in replacements for a standard Transformer, such as Swin Transformer [35] and CSwin [16] . • We propose a Lipschitz continuous Transformer (LipsFormer) that can be stably trained without the need of carefully tuning the learning rate schedule. We derive theoretical Lipschitz constant upper bounds for both scaled cosine similarity attention and LipsFormer. 

2. PRELIMINARIES

In this section, we first define Lipschitz continuity and Lipschitz constant and then discuss several Lipschitz properties of a neural network. We use the denominator-layout notation throughout this paper. A sequence of N elements is denoted as X = [x 1 ; . . . ; x N ] ⊤ ∈ R N ×D , where each vector x i ∈ R D , i ∈ {1, ..., N }. Function transformation is parameterized by an associated weight matrix W and an affine transformation is denoted as f (x) = W ⊤ x, where W ∈ R D×M . Definition 1. A function f (x, W ) : R D → R M is Lipschitz continuous (L-Lipschitz) under a choice of p-norm ∥ • ∥ p in the variable x if there exists a constant L such that for all (x 1 , W ) and (x 2 , W ) in the domain of f , ∥f (x 1 , W ) -f (x 2 , W )∥ p ≤ L∥x 1 -x 2 ∥ p , where the smallest value of L that satisfies the inequality is called the Lipschitz constant of f . To emphasize that the Lipschitz constant with respect to x depends on W and the choice of p, we denote L as Lip p (f x (W )). A function is generally referred to as expansive, non-expansive, and contractive in the variable x for Lip p (f x (W )) > 1, Lip p (f x (W )) ≤ 1, and Lip p (f x (W )) < 1, respectively, exhibiting characteristic differences in the change rate of its output. Contemporary neural networks are rarely Lipschitz continuous under the influence of any constituent non-Lipschitz module. Even if a network is Lipschitz continuous, calculating its Lipschitz constant exactly is a challenging task [50] . According to Definition 1, the Lipschitz constant of f (x, W ) with respect to x can be expressed as, Lip p (f x (W )) = sup x1̸ =x2∈R D ∥f (x 1 , W ) -f (x 2 , W )∥ p ∥x 1 -x 2 ∥ p . Exact computation of the above equation is an NP-hard problem. For subsequent analyses, We use p = 2 by default unless specified and suppress p to reduce clutter, but our conclusion can be easily extended to other choices of p. Lemma 1. Given W , let f (x, W ) : R D → R be a continuously differentiable function and Lip(f x (W )) be its Lipschitz constant with respect to x. According to the mean value theorem, we have the following inequality, ∥∇ x f (x, W )∥ ≤ Lip(f x (W )), ∀x ∈ R D , where ∥∇ x f (x, W )∥ is the gradient norm of f (x, W ) with respect to x. From Lemma 1, we can see that a practical method to compute the Lipschitz constant of a continuously differentiable function is to compute its maximum gradient norm. To prove a function is not Lipschitz, it is sufficient to show that its gradient norm is not bounded. For example, f (x) = 1 x and f (x) = x 2 are not Lipschitz continuous for x ∈ (0, ∞), because their gradient can be arbitrarily large as x approaches 0 and ∞, respectively. Definition 2. Let F (x, {W l , l = 1, . . . , L}) : R D → R be an L-layer neural network defined as a composite function with L transformation functions: F (x, {W l , l = 1, . . . , L}) = f L f L-1 . . . f 1 x, W 1 , W 2 . . . , W L , where {W l , l = 1, . . . , L} is the parameter set, and f l is the transformation function of the l-th layer. For an affine transformation f (x, W ) = W ⊤ x, its Lipschitz constant is, Lip p (f x (W )) = sup ∥x∥p=1 ∥W ⊤ x∥ p = σ max (W ), if p = 2 max i j |W ij | if p = ∞ where σ max (W ) is the largest eigenvalue of W . Many common activation functions such as Sigmoid, Tanh, ReLU, and GELU are 1-Lipschitz. See Appendix A.1 for a simple illustration. Lemma 2. Given the Lipschitz constant of each transformation function in a network F , the following inequality holds Lip(F x ({W l , l = 1, . . . , L})) ≤ L l=1 Lip(f l x (W l )). From Lemma 2, the Lipschitz constant of a network is upper bounded by the product of each layer's Lipschitz constant. This multiplicative nature gives us an insight into understanding why deeper networks usually suffer more severe training instability: if a network's constituent layers are expansive, the upper bound of its Lipschitz constant increases monotonically with its network depth. We refer the interested readers to [18, 30] for estimating tighter bounds of deep neural networks.

3. AN ASSUMPTION FOR TRAINING STABILITY

Our design philosophy for LipsFormer is based on the following assumption. Assumption 1. A network should satisfy the following Lipschitz conditions for training stability, 1. ∥f (x 1 , W ) -f (x 2 , W )∥ ≤ Lip(f x (W ))∥x 1 -x 2 ∥, 2. ∥f (x, W 1 ) -f (x, W 2 )∥ ≤ Lip(f W (x))∥W 1 -W 2 ∥. The first inequality focuses on the forward process and assumes that a stable network should satisfy Lipschitz continuity with respect to its input x: a small perturbation of its input should not lead to a drastic change of its output. Guaranteeing smoothness is vital for guarding a network's generalization ability. For the second inequality, recall that the forward process of a typical neural network propagates computation as x l+1 = (W l+1 ) ⊤ x l , where x l and W l+1 are the input and weight matrix of Layer l +1. Since common non-linearities are 1-Lipschitz, we drop non-linear activations here for simplicity. To backpropagate the network loss L, we have ∂L ∂x l = W l+1 ∂L ∂x l+1 , ∂L ∂W l+1 = x l ( ∂L ∂x l+1 ) ⊤ . Gradient descent updates network weights according to W ← W -lr × ∂L ∂W . As demonstrated above, any value explosion will propagate with the chain derivation: if ∂L ∂x l+1 is unbounded, ∂L ∂x l and ∂L ∂W l+1 will consequently be unbounded. Meanwhile, if ∂L ∂W l+1 is not bound, it will largely influence the back-propagation chain in the next iteration. This justifies the second inequality for the purpose of training stability. Intuitively, guaranteeing that a network's output does not change too much under small perturbations of either its input or network weights induces a more stable training process. In this work, we focus on satisfying the first inequality in Assumption 1 for Transformer architectures.

4. LIPSFORMER

A Lipschitz continuous Transformer (LipsFormer) requires all of its constituent modules to be Lipschitz continuous according to Lemma 2. In this section, we analyze key Transformer components and introduce their Lipschitz continuous counterparts when any Lipschitz continuity is violated.

4.1.1. CENTERNORM INSTEAD OF LAYERNORM

LayerNorm [2] is the most widely used normalization method in Transformer. It is defined as LN(x) = γ ⊙ z + β, where z = y Std(y) and y = I - 1 D 11 ⊤ x, where x, y ∈ R D , Std(y) is the standard deviation of the mean-subtracted input y, and ⊙ is an element-wise product. γ and β are initialized to 1 and 0 respectively. For simplicity, we drop γ and β from analysis because they can be explicitly constrained within any pre-defined range. By taking partial derivatives, the Jacobian matrix of z with respect to x is, J z (x) = ∂z ∂x = ∂z ∂y ∂y ∂x = 1 Std(y) I - 1 D 11 ⊤ I - yy ⊤ ∥y∥ 2 2 . The equation above shows that LayerNorm is not Lipschitz continuous because when Std(y) approaches 0, the values in the Jacobian matrix will approach ∞, causing severe training instability. On the other end, when Std(y) is very large, training will be hindered by LayerNorm as gradients become extremely small. Also note that backpropagating through LayerNorm is slow due to poor parallelization when computing the Jacobian matrix, especially for the term I -yy ⊤ ∥y∥ 2

2

. In practice, we notice that a single LayerNorm operation could cause severe training instability without learning rate warmup. The underlying reason is that LayerNorm is not Lipschitz continuous and some ill-defined input with zero variance will lead to a Jacobian matrix filled with infinity. To stabilize training by enforcing Lipschitz continuity, we introduce CenterNorm as, CN(x) = γ ⊙ D D -1 I - 1 D 11 ⊤ x + β, where D is the dimension of x. The Jacobian matrix ∂ CN(x) ∂x contains a term D D-1 I -1 D 11 ⊤ where D D-1 is a heuristic to avoid the eigenvalue contraction from I -1 D 11 ⊤ . It is easy to verify that, ∥ CN(x 1 ) -CN(x 2 )∥ ≤ Lip(CN x )∥x 1 -x 2 ∥, where Lip(CN x ) = D D-1 for γ = 1 and β = 0. As most deep neural networks are dealing with high dimensional data with D ≫ 1, we make a simplification that Lip(CN x ) is 1-Lipschitz for later discussions. CenterNorm is by design Lipschitz continuous at initialization. To guarantee its Lipschitz continuity through training we could simply constraint γ and β to a pre-defined range.

4.1.2. SCALED COSINE SIMILARITY ATTENTION

Self-attention [49] is a key component of Transformer, helping capture long-range relationships within data. In practice, people use multi-head attention to effectively capture such relationships under different contexts. Since multi-head attention is a linear combination of multiple single-head attention outputs, for simplicity, we focus our analysis on single-head attention, which is defined as, Attn(X, W Q , W K , W V ) = softmax XW Q XW K ⊤ √ D XW V , where W Q , W K , W V are the projection matrices to transform X into query, key, and value matrices, respectively. Intuitively, every token aggregates information from all the visible tokens by computing a weighted sum of the values of the visible tokens according to the similarity between its query and each visible token's key. The similarity between the i-th query q i and j-th key k j is denoted as P ij ∝ x i ⊤ W Q (W K ) ⊤ x j . In [28] , Kim et al. proved that the standard dot-product self-attention is not Lipschitz continuous and introduced an alternative L2 self-attention that is Lipschitz continuous. Here we use a scaled cosine similarity attention, which is defined as, SCSA(X, W Q , W K , W V , ν, τ ) = νP V , where P = softmax τ QK ⊤ , Q =    -q ⊤ 1 - . . . -q ⊤ N -    K =    -k ⊤ 1 - . . . -k ⊤ N -    V =    -v ⊤ 1 - . . . -v ⊤ N -    , where ν and τ are predefined or learnable scalars; Q, K, V are ℓ 2 row-normalized: q i , k i , v i = (xi ⊤ W Q ) ⊤ √ ∥xi ⊤ W Q ∥ 2 +ϵ , (xi ⊤ W K ) ⊤ √ ∥xi ⊤ W K ∥ 2 +ϵ , (xi ⊤ W V ) ⊤ √ ∥xi ⊤ W V ∥ 2 +ϵ ;ϵ is a smoothing factor to guarantee the validity of cosine similarity computation even when ∥x i ⊤ W Q ∥ = 0. For arbitrary pair of rows of Q and K denoted as q i and k j , the cosine similarity on their ℓ 2 -normalized vectors is proportional to their L2 dot product. The upper bound of SCSA's Lipschitz constant with respect to ∥ • ∥ 2 and ∥ • ∥ ∞ is the following, Theorem 1. Single-head scaled cosine similarity attention is Lipschitz continuous, its Lip ∞ and Lip 2 are upper bounded by the following inequalities, Lip(SCSA) ∞ ≤ N 2 √ Dντ ϵ -1 2 ∥W K ∥ ∞ + N √ Dντ ϵ -1 2 ∥W Q ∥ ∞ + 2N νϵ -1 2 ∥W V ⊤ ∥ ∞ , Lip(SCSA) 2 ≤ 2N (N -1)ντ ϵ -1 2 ∥W K ∥ 2 + 2(N -1)ντ ϵ -1 2 ∥W Q ∥ 2 + 2N νϵ -1 2 ∥W V ⊤ ∥ 2 . Proof of Theorem 1 can be found in Appendix H. For multi-head attention, we heuristically scale head feature concatenation by 1 K where K is the number of heads. Please refer to Appendix A.2 for more details.

4.1.3. WEIGHTED RESIDUAL SHORTCUT

Residual block [24] is a common component of contemporary neural networks [35, 53, 49] . It has been proven effective in avoiding gradient vanishing, especially when training deep networks. A standard residual shortcut block is defined as, RS(x, W ) = x + f (x, W ). The Lipschitz constant of a residual shortcut block with respect to x is Lip(RS x (W )) = 1 + Lip(f x (W )). For any non-degenerate Lipschitz continuous function f (x, W ), its Lipschitz constant is greater than 0, hence a residual block is strictly expansive. According to Lemma 2, stacking L identical residual blocks alone will grow the upper bound of a network's Lipschitz constant exponentially to Lip(RS x (W )) L , causing substantial vulnerability to forward value explosion. One way to mitigate such an instability is to constraint the Lipschitz constant of the residual path to be much smaller than 1, especially at the beginning of training when the network is undergoing fast changes via learning. In this paper, we explicitly multiply the residual path with a scale factor initialized to a small value such as 0.1 and 0.2. We define the weighted residual shortcut as, WRS(x, W ) = x + α ⊙ f (x, W ), where α is a learnable parameter vector with the same dimension as the channel size of x. It is easy to verify that ∥ WRS(x 1 , W ) -WRS(x 2 , W )∥ ≤ Lip(WRS x (W ))∥x 1 -x 2 ∥, where Lip(WRS x (W )) = 1 + max(α) when Lip(f x (W )) = 1. As training progresses, α changes as part of the learning process. We could easily constrain α to a pre-defined range to ensure the Lipschitz continuity of a network during the whole training process. Note that re-weighting shortcut and residual path has been explored before: in [51, 33] , the authors redefine a residual block as α ⊙ x + f (x, W ) to alleviate the LayerNorm instability; ReZero [3] uses a similar formulation as Equation 4to speed up convergence, where α is a scalar instead of a vector. Our formulation is motivated by decreasing the Lipschitz constant of a network, instead of being a practical trick. It provides a more principled guidance to network design. For example, when training a very deep network, a smaller α would be justified for the purpose of training stabilization. Note that a careful initialization is important for successfully training a neural network. Many initialization methods have been proposed before such as Xavier [19] and Kaiming [23] initialization. Inspired by spectral norm regularization [56] , we introduce a 1-Lipschitz initialization called spectral initialization, W si = W σ max (W ) , where W is Xavier-norm initialized and σ max (W ) is its largest eigenvalue. For affine transformation f (x, W si ) = W si ⊤ x, its Lipschitz constant satisfies the following inequality, ∥W si ⊤ x 1 -W si ⊤ x 2 ∥ ≤ Lip(f x (W si ))∥x 1 -x 2 ∥, where Lip(f x (W si )) = 1 at initialization. We use spectral initialization on all convolutions and feed-forward connections. Post-Norm x i+1 = LayerNorm(x i + f (x i )) Pre-Norm x i+1 = x i + f (LayerNorm(x i )) LipsFormer x i+1 = CenterNorm(x i + DropPath pi (α i f (x i ))) TABLE 1: Various forms of residual blocks for Transformer architectures. As illustrated in Figure 1 , f represents a transformation function ∈ {self-attention, feed-forward}. For LipsFormer, f ∈ {scaled cosine similarity attention, feed-forward, convolution blocks}. In Table 1 we compare the LipsFormer residual block with commonly used Post-Norm and Pre-Norm residual blocks. CenterNorm and scaled cosine similarity attention are Lipschitz continuous counterparts for LayerNorm and dot-product attention. Weighted residual connection and DropPath [29] are used to constrain the Lipschitz constant of a deep LipsFormer network.

4.2.2. OVERALL ARCHITECTURE OF LIPSFORMER

In general, LipsFormer follows the architecture of Swin Transformer v1. We start by processing an input image with non-overlapped convolutional token embedding (4 × 4 convolution with stride 4) to obtain a feature representation with resolution H 4 × W 4 . Then the main computation passes four stages where each stage consists of a pre-defined number of LipsFormer blocks as shown in Figure 1 . Between consecutive stages, we reduce the output resolution by 2 and double the size of output channels by a 2 × 2 non-overlapped convolution with stride 2. We build three variants of LipsFormer in correspondence with CSwin Transformer [16] as detailed in Appendix Table 4 : LipsFormer-CSwin-Tiny (LipsFormer-CSwin-T) , LipsFormer-CSwin-Small (LipsFormer-CSwin-S), and LipsFormer-CSwin-Base (LipsFormer-CSwin-B). The number of Lips-Former blocks within the four computation stages are [1, 2, 21, 1] for LipsFormer-CSwin-T, [2, 4, 32, 2] for LipsFormer-CSwin-S and LipsFormer-CSwin-B. The overall architecture of LipsFormer is illustrated in Figure 3 of Appendix B. We can also build LipsFormer on Swin Transformer, more experiments about LipsFormer-Swin can be found in Appendix.

4.2.3. LIPSCHITZ CONSTANT OF LIPSFORMER

As illustrated in Figure 3 , LipsFormer includes four computation stages, each starting with patch merging followed by a pre-defined number of LipsFormer blocks. Feed-forward connection, convolution, and patch merging are Lipschitz continuous operators. With spectral initialization these affine transformations are 1-Lipschitz at the beginning of training, hence are dropped from analysis. For the Lipschitz constant of the LipsFormer, we have the following theorem. Theorem 2. For a LipsFormer with S stages where the s-th stage has M s residual blocks, when α is set to Proof of Theorem 2 is in Appendix I. Theorem 2 suggests that 1) Deeper networks with more residual blocks should initialize with a smaller α to avoid exponential growth of its Lipschitz constant; 2) To control the Lipschitz constant of Lipsformer we should focus on constraining the Lipschitz constant Lip(f i ) of each constituent layer, especially the one with the largest Lipschitz constant.

5.1. DATASET AND TRAINING SETUP

We evaluate LipsFormer-CSwin on the standard ImageNet-1K [15] dataset, which consists of 1.28M images and 1,000 classes. We adopt a similar training strategy as in CSwin Transformer [16] for a fair comparison. Specifically, we use the AdamW [38] optimizer with weight decay 0.05 for LipsFormer-CSwin-T/S and 0.1 for LipsFormer-CSwin-B. By default, all our models are trained for 300 epochs with an input image size of 224 × 224. For LipsFormer-CSwin, the training batch size is 2048 and the initial learning rate is 0.002 with a standard cosine learning rate decay [37] without learning rate warmup [37] . We apply stochastic depth [26] for LipsFormer-CSwin-T, LipsFormer-CSwin-S, and LipsFormer-CSwin-B, with a maximum DropPath rate of 0.2, 0.4, and 0.5, respectively. For ablation study, we train each model for 100 epochs for efficiency. See Appendix C for more details. Compared with previous state-of-the-art Vision Transformer models, LipsFormer-CSwin attains a higher classification accuracy on all its model variants. For instance, LipsFormer-CSwin-T obtains a 83.5% Top-1 accuracy that outperforms CSwin-T by 0.8%, ViT-S by 2.1% and NAT-T by 0.3%. LipsFormer-CSwin-T also outperforms recently improved CNN architectures, such as ConvNeXt-T and EffNet-B4 by 1.4% and 0.6%, respectively. LipsFormer-CSwin-T has fewer parameters than NAT-T and ConvNeXt-T. LipsFormer-CSwin-B also outperforms its counterparts, including Swin-B, CrossFormer-L, ConvNeXt-B, DeiT-B, ViT-B, and NAT-B with fewer parameters. Also note that all the other Transformer models use learning rate warmup, but LipsFormer-CSwin does not.

5.3. ABLATION STUDY

We conduct extensive ablation study on each key component of LipsFormer-CSwin as shown in Table 3 . We use LipsFormer-CSwin-T for ablation study and all results in this comparison are trained for 100 epochs without learning rate warmup, except for ablation on warmup.

Warmup.

In previous experiments we do not use learning rate warmup when training LipsFormer-CSwin. Theoretically, warmup is not needed given LipsFormer's appealing stabilization guarantee. According to the results in Table 3 , 5 epochs of warmup does not bring in further improvement. CenterNorm. We compare CenterNorm against no-Norm (as in ReZero [3] ) and the standard LayerNorm. Results show that: Spectral Initialization. We compare LipsFormer-CSwin results with spectral initialization against truncated normal and Xavier initialization. We find that LipsFormer-CSwin with any of the three initializations converges. Spectral initialization and Xavier initialization slightly outperforms truncated normal initialization, but spectral initialization has a better Lipschitz interpretability than Xavier initialization. Scaled Cosine Similarity Attention. To validate the effectiveness of scaled cosine similarity attention, we compare it with the standard dot-product attention and the L2 distance attention [28] . We find that the standard dot-product self-attention leads to forward value explosion, but the scaled cosine similarity attention works well under Lipschitz guarantee. Meanwhile, SCSA works better than the L2 distance attention. Impact of the Residual Weight α. As detailed in 4.1.3, the weight of residual path α has a substantial influence on the upper bound of LipsFormer's Lipschitz constant. We evaluate different choices of α, and find that with a large α initialization value, network either does not converge or diverges quickly. This validates that deeper networks need a smaller α. Convolution Blocks. In LipsFormer-CSwin, we use two depth-wise convolutions (dwc) and one pointwise convolution (pwc). We evaluate four different convolution configurations: A) no convolution; B) one dwc; C) dwc + pwc; and D) dwc + pwc + dwc. Table 3 shows that one dwc increases LipsFormer's accuracy by 0.5%, one dwc + one pwc further improves its performance by 0.4%, adding more convolutions saturates performance gains. DropPath Ratio. In Appendix J, we show that DropPath effectively decreases the upper bound of a network's Lipschitz constant, making training process more stable. The results in Table 3 show that reasonable DropPath can effectively improve training performance. To summarize, CenterNorm, scaled cosine similarity attention, and convolution blocks all contribute positively to LipsFormer-CSwin's superior performance. Weighted residual shortcut with small α, reasonable DropPath ratio p and spectral initialization are effective in stabilizing LipsFormer-CSwin by constraining its Lipschitz constant.

6. CONCLUSION

In this paper, we present a Lipschitz continuous Transformer, called LipsFormer, to pursue a more stable training process by enforcing the Lipschitz continuity of the whole network. We analyze key components of Transformer and replace the ones violating Lipschitz continuity by introducing CenterNorm, scaled cosine similarity attention, and spectral initialization. LipsFormer also uses weighted residual shortcut and DropPath to further decrease the upper bound of its Lipschitz constant. Finally, we derive an upper bound of the Lipschitz constant of a LipsFormer network architecture. We empirically validate the effectiveness of LipsFormer-Swin and LipsFormer-CSwin, based on Swin Transformer and CSwin individually, on ImageNet 1K classification with state-of-the-art performance for model variants of different parameter sizes. The analysis of the Lipschitz continuity of a network is general. We look forward to extending it to a broader class of models and application areas, including multi-modal model and natural language processing. We also hope future works will discuss the Lipschitz continuity of LipsFormer in the backward process in depth.

A APPENDIX A.1 LIPSCHITZ CONSTANT OF COMMON ACTIVATION FUNCTIONS

In Figure 2 we plot common non-linear activation functions in neural networks: Sigmoid, Tanh, ReLU and GELU. According to [25] , GeLU can be approximated by GeLU(x) ≈ x sigmoid(1.702x). According to Lemma 1, the Lipschitz constants of Sigmoid, Tanh, ReLU and GELU are 1 4 , 1, 1, 1.0998 respectively. 

A.2 MULTI-HEAD ATTENTION

For a K-head attention, we have the i-th attention, i ∈ {1, ..., K} defined as, h i (x, W i ) = Attn i (X, W Q i , W K i , W V i ) , where W i is the set of projection weight matrices (W Q i , W K i , W V i ). Multi-head attention simply concatenates different attention results, h(x, W ) = [h 1 (x, W 1 ); h 2 (x, W 2 ); ...; h K (x, W K )]. According to the Lipschitz definition, we have, ∥h(x 1 , W ) -h(x 2 , W )∥ ≤ (Lip(h 1 (W 1 )) + Lip(h 2 (W 2 )) + ... + Lip(h K (W K )))∥x 1 -x 2 ∥.

B NETWORK ARCHITECTURE AND CONFIGURATIONS

The overall architecture of LipsFormer-CSwin is shown in Figure 3 . For patch embedding and patch merging, we use non-overlapped convolution as in Swin Transformer. Following CSwin Transformer, we use the same cross-shaped window when computing attention results and also the same Locally enhanced Positional Encoding (LePE). The configurations of Lipsformer-CSwin are based on CSwin Transformer and Table 4 summarizes three variants of Lipsformer-CSwin. LipsFormer-CSwin-T and LipsFormer-CSwin-S only varies in the number of LipsFormer-CSwin blocks. LipsFormer-CSwin-S/B share the same depth configuration but varies in hidden layer channel size. Similar to LipsFormer-CSwin, we also build LipsFormer based on Swin Transformer [35] . Here, we term it as LipsFormer-Swin. We create five versions of LipsFormer-Swin, and the detailed configurations are shown in Table 5 . 

D EXPERIMENTS OF LIPSFORMER-SWIN

We evaluate the Tiny, Small, Base and Large versions of LipsFormer-Swin on the ImageNet-1K, and compare our results with their corresponding counterpart Swin Transformer. The results are shown in Table 7 . We have the following two findings from Table 7 , • the proposed LipsFormer-Swin consistently outperforms its counterpart Swin Transformer. Specifically, LipsFormer-Swin-T improves Swin-T by 1.5%. • LipsFormer-Swin-L shows obvious overfitting on ImageNet-1K, and performs worst than LipsFormer-Swin-B. According to our observation in the training process, the training loss (around 2.2) of LipsFormer-Swin-L is much smaller than that (around 2.5) of LipsFormer-Swin-B, but the test accuracy is lower.) We also observe that in some github discussion issues, some people 1 also find that the original Swin-L cannot outperform Swin-B if only training on ImageNet-1K. Since LipsFormer-Swin-L has shown overfitting on ImageNet-1K, we do not report the performance of LipsFormer-Swin-L++ on the table. In the future, we will train it on a larger scale of data to test its fitting ability. On a single A100-40GB GPU, with a batch size fixed to 256 and a mixed precision, 

G PARAMETER VARIATIONS ALONG WITH TRAINING EPOCHS

In Figure 5 , we show the variations of the α along with the training epochs. Our statistic is based on LipsFormer-Swin-T model. We statistic the mean and standard variance of the absolute value of the α. We select one set of α from each stage. We find that from Figure 5 , the mean value of the absolute value of the α first grows and then tend to stabilize at a value. 

H PROOF OF THEOREM 1

In this subsection, we derive the Lipschitz constant upper bound for the scaled cosine similarity attention (SCSA). First, we list some useful notations and identities for deriving the Jacobians of attention computation. X =    -x ⊤ 1 - . . . -x ⊤ N -    ∈ R N ×D . For column vectors u, z ∈ R N the chain rule has: ∂ ∂x u ⊤ z = u ⊤ ∂z ∂x + z ⊤ ∂u ∂x . The standard dot-product attention is defined as, DP(X, W Q , W K , W V ) := softmax XW Q XW K ⊤ D/H XW V = P XW V . In [28] , Kim et al.proved that the standard dot-product attention is not Lipschitz continuous, and proposed L2-distance attention which is Lipschitz continuous conditioning on W Q = W K . But enforcing the equality of W Q and W K limits the expressiveness of the Transformer and degrades training performance empirically. Our scaled cosine similarity attention is defined as, SCSA(X, W Q , W K , W V , ν, τ ) = νP V , where P = softmax τ QK ⊤ , ν and τ are predefined or learnable scalars. The definitions of Q, K, V are as follows, Q =    -q ⊤ 1 - . . . -q ⊤ N -    ∈ R N ×D , K =    -k ⊤ 1 - . . . -k ⊤ N -    ∈ R N ×D , V =    -v ⊤ 1 - . . . -v ⊤ N -    ∈ R N ×D . For each input x i , the projected q i , k i , v i are defined as, q i = (x i ⊤ W Q ) ⊤ ∥x i ⊤ W Q ∥ 2 + ϵ , k j = (x j ⊤ W K ) ⊤ ∥x j ⊤ W K ∥ 2 + ϵ , v j = (x j ⊤ W V ) ⊤ ∥x j ⊤ W V ∥ 2 + ϵ . where ϵ is a small smoothing factor to guarantee that the definition of cosine similarity is valid everywhere. By taking partial derivatives, we have the following Jacobian matrices, Q i = ∂q i ∂x i = 1 ∥x i ⊤ W Q ∥ 2 + ϵ (I - W Q ⊤ x i x i ⊤ W Q ∥x i ⊤ W Q ∥ 2 + ϵ )W Q ⊤ , K j = ∂k j ∂x j = 1 ∥x j ⊤ W K ∥ 2 + ϵ (I - W K ⊤ x j x j ⊤ W K ∥x j ⊤ W K ∥ 2 + ϵ )W K ⊤ , V j = ∂v j ∂x j = 1 ∥x j ⊤ W V ∥ 2 + ϵ (I - W V ⊤ x j x j ⊤ W V ∥x j ⊤ W V ∥ 2 + ϵ )W V ⊤ . The attention matrix P is defined as, P := softmax      τ q 1 ⊤ k 1 τ q 1 ⊤ k 2 . . . τ q 1 ⊤ k n τ q 2 ⊤ k 1 τ q 2 ⊤ k 2 . . . τ q 2 ⊤ k n . . . . . . . . . . . . τ q n ⊤ k 1 τ q n ⊤ k 2 . . . τ q n ⊤ k n      . We can rewrite our SCSA attention in Eq. 6 as, f (X) = νP V = ν softmax τ QK ⊤ V =    f 1 (X) ⊤ . . . f N (X) ⊤    ∈ R N ×D . For simplification we focus on derivations for single-head attention, mutli-head attention requires only minor modifications for concatenating attention results for each head as discussed in A.2 . The Jacobian matrix for SCSA can be written as, J f =    J 11 • • • J 1N . . . . . . . . . J N 1 • • • J N N    ∈ R N D×N D , where J ij = ∂f i (X) ∂x j ∈ R D×D . By taking partial derivatives we can show that, J ij = ντ V ⊤ P (i) E ji Q K j ⊤ + δ ij K Q j ⊤ + νP ij V j , where E ij ∈ R N ×N is a binary matrix with zeros everywhere except the (i, j)-th entry, δ ij is the Kronecker delta, and the Jacobian of the softmax is well-known as below, P (i) := diag (P i: ) -P ⊤ i:  P i: =     P i1 ( P i2 • • • P iN (1 -P iN )     . ( ) When i = j, we have, J ii = ντ V ⊤ P (i) E ii Q K i ⊤ + K Q i ⊤ + νP ii V i . When i ̸ = j, we have,  J ij = ντ V ⊤ P (i) E ji Q K j ⊤ + νP ij V j . W Q , W K , W V have bounded norm. Sketch Proof. Our key observation is that most of the terms in J ii and J ij have bounded norm: ν and τ are scalars; Q, K, V are normalized so all elements are less than or equal to 1; E ij has zeros everywhere except the (i,j)-th entry; P is an attention matrix with all elements within [0, 1] so all elements in P (i) are bounded by [-0.25, 0.25] . Taking a closer look at Q i , K i , V i as shown in Eq. 7, Eq. 8 and Eq. 9, they are bounded as long as W Q , W K , W V are bounded. Consequently the final product of J ii and J ij have bounded norm if W Q , W K , W V have bounded norm.

H.1 UPPER BOUND ON LIP∞ FOR SCSA

Let us review some basic definitions for matrix norm. Suppose we have matrices A ∈ R N ×D , and B ∈ R N ×D . Then, we have: ∥A∥ ∞ = max 1≤i≤N D j=1 |A ij | , ∥A∥ 2 = λ max (A * A) = σ max (A). We also have the following inequalities, ∥AB ⊤ ∥ ≤ ∥A∥∥B ⊤ ∥, ∥A + B∥ ≤ ∥A∥ + ∥B∥ and ∥[A 1 , . . . , A N ]∥ ≤ i ∥A i ∥ . ∥A∥ 2 = σ max (A) ≤ ∥A∥ F =   N i=1 D j=1 |A ij | 2   1 2 =   min(N,D) k=1 σ 2 k   1 2 , ( ) where ∥ • ∥ F is the Frobenius norm. Equality holds if and only if matrix A is a rank-one matrix or a zero matrix. According to the above inequalities, we have ∥[J i1 , . . . , J iN ]∥ ∞ ≤ ∥J ii ∥ ∞ + j̸ =i ∥J ij ∥ ∞ ≤ ντ ∥V ⊤ ∥ ∞ ∥P (i) ∥ ∞ ∥E ii ∥ ∞ ∥Q∥ ∞ ∥ K i ⊤ ∥ ∞ + ∥K∥ ∞ ∥ Q i ⊤ ∥ ∞ + ν∥P ii ∥ ∞ ∥ V i ∥ ∞ + j̸ =i ντ ∥V ⊤ ∥ ∞ ∥P (i) ∥ ∞ ∥E ji ∥ ∞ ∥Q∥ ∞ ∥ K j ⊤ ∥ ∞ + ν∥P ij ∥ ∞ ∥ V j ∥ ∞ We can compute the L 2 norm Lipschitz constant by replacing the L ∞ norm in the above equation with L 2 norm. With simple derivations we list ∥ • ∥ ∞ for each term in 16: ∥V ⊤ ∥ ∞ = max 1≤i≤D N j=1 ∥V ⊤ ij ∥ ≤ N ∥P (i) ∥ ∞ = max 1≤i≤N N j=1 ∥P (i) ij ∥ = max 1≤i≤N 2(P ii -P 2 ii ) ≤ 1 2 ∥E ii ∥ ∞ = 1 ∥Q∥ ∞ = max 1≤i≤N D j=1 ∥Q ij ∥ ≤ √ D ∥ K j ⊤ ∥ ∞ ≤ ϵ -1 2 × ∥W K ∥ ∞ × 2 Proof. for Equation 17 ∥ K j ⊤ ∥ ∞ = ∥   1 ∥x j ⊤ W K ∥ 2 + ϵ (I - W K ⊤ x j x j ⊤ W K ∥x j ⊤ W K ∥ 2 + ϵ )W K ⊤   ⊤ ∥ ∞ ≤ ϵ -1 2 × ∥W K ∥ ∞ × ∥(I - W K ⊤ x j x j ⊤ W K ∥x j ⊤ W K ∥ 2 + ϵ )∥ ∞ ≤ 2 × ϵ -1 2 × ∥W K ∥ ∞ ∥ K i ⊤ ∥ ∞ = ∥ K j ⊤ ∥ ∞ < ϵ -1 2 × ∥W K ∥ ∞ × 2 ∥K∥ ∞ = √ D ∥ Q i ⊤ ∥ ∞ ≤ 2 × ϵ -1 2 × ∥W Q ∥ ∞ , similar to Equation 17 ∥P ii ∥ ∞ = ∥P ij ∥ ∞ = 1 ∥E ji ∥ ∞ = 1 ∥ V j ∥ ∞ ≤ 2 × ϵ -1 2 × ∥W V ⊤ ∥ ∞ , similar to Equation 17, According to 16, the Lip∞ constant of the scaled cosine similarity attention (SCSA) is: Lip(SCSA) ∞ ≤ ν × τ × N × 1 2 × 1 × √ D × ϵ -1 2 × 2 × ∥W K ∥ ∞ + √ D × ϵ -1 2 × 2 × ∥W Q ∥ ∞ + ν × 1 × ϵ -1 2 × 2 × ∥W V ⊤ ∥ ∞ + (N -1) ν × τ × N × 1 2 × 1 × √ D × ϵ -1 2 × 2 × ∥W K ∥ ∞ + ν × 1 × ϵ -1 2 × 2 × ∥W V ⊤ ∥ ∞ . After merging and rearranging the terms, Lip(SCSA) ∞ = ντ N √ Dϵ -1 2 ∥W K ∥ ∞ + ∥W Q ∥ ∞ + 2νϵ -1 2 ∥W V ⊤ ∥ ∞ + (N -1) ντ N √ Dϵ -1 2 ∥W K ∥ ∞ + 2νϵ -1 2 ∥W V ⊤ ∥ ∞ = N 2 √ Dντ ϵ -1 2 ∥W K ∥ ∞ + N √ Dντ ϵ -1 2 ∥W Q ∥ ∞ + 2N νϵ -1 2 ∥W V ⊤ ∥ ∞ H.2 UPPER BOUND ON LIP 2 FOR SCSA Correspondingly, we list ∥ • ∥ 2 for each term in 16: ∥V ⊤ ∥ 2 ≤ N i=1 D j=1 |V ij | 2 1 2 = N j=1 1 1 2 = √ N ∥P (i) ∥ 2 ≤ N -1 N Proof of Equation 18According to Eq 12, P (i) is a semi-definite matrix, thus its ordered eigenvalues λ 1 ≥ λ 2 ≥, . . . , ≥ λ N ≥ 0, and N i=1 λ i = tr(P (i) ) = N j P (i) jj ≤ ( N j=1 1 N N -1 N ) = N -1 N . According to 15, ∥P (i) ∥ 2 = ( N i=1 λ 2 i ) 1 2 ≤ ( N i=1 λ i ) 2× 1 2 ≤ N -1 N ∥E ii ∥ 2 = 1 ∥Q∥ 2 ≤ √ N ∥ K j ⊤ ∥ 2 ≤ 2 × ϵ -1 2 × ∥W K ∥ 2 , ∥K∥ 2 ≤ √ N ∥ Q i ⊤ ∥ 2 ≤ 2 × ϵ -1 2 × ∥W Q ∥ 2 ∥P ii ∥ 2 = 1 ∥E ji ∥ 2 = 1 ∥ V j ∥ 2 ≤ 2 × ϵ -1 2 × ∥W V ⊤ ∥ 2 Substituting the above results into Eq. 16 and changing L ∞ norm to L 2 norm, we have Lip(SCSA) 2 = ντ √ N √ N N -1 N 2ϵ -1 2 ∥W K ∥ 2 + ∥W Q ∥ 2 + 2νϵ -1 2 ∥W V ⊤ ∥ 2 + (N -1) N -1 N ντ √ N √ N 2ϵ -1 2 ∥W K ∥ 2 + 2νϵ -1 2 ∥W V ⊤ ∥ 2 = 2N (N -1)ντ ϵ -1 2 ∥W K ∥ 2 + 2(N -1)ντ ϵ -1 2 ∥W Q ∥ 2 + 2N νϵ -1 2 ∥W V ⊤ ∥ 2 . From the upper bound above, we highlight the following observations: 1) ϵ is to guarantee validity of cosine similarity computation when any participating vector is equal to zero; 2) In Lip(SCSA) 2 , the scale factor for the first term is 2N (N -1)ντ ϵ -1 2 , which multiplies with an extra ∼ N when compared to the other terms, meaning that ∥W K ∥ 2 plays a more significant role in the Lipschitz constant of Lip(SCSA) 2 . Different from the L2 distance attention [28] , to promise the module is Lipschitz continuous, the scaled cosine similarity attention has no requirement for the weight matrices, but the L2 distance attention detailed in [28] requires that W Q and W K should be the same. [49] AND L2-ATTENTION [28] As proved in [28] , the standard dot-product attention is not Lipschitz continuous. The proposed L2-attention is also not Lipschitz continuous for general W Q and W K , but only Lipschitz continuous when W Q = W K . However, enforcing W Q = W K degrades model performance as shown in [28] . As proved above, the scaled cosine similarity attention (SCSA) is in general Lipschitz continuous, only requiring that W Q , W K , W V have bounded norm and that the computation of cosine similarity is valid. We can easily guarantee that our computation of similarity is valid by introducing a small smoothing factor ϵ.

I PROOF OF THEOREM 2

In this section, we give the upper bound on LipsFormer's Lipschitz constant. For a LipsFormer with S stages where the s-th stage has M s residual blocks, its Lipschitz constant is upper bounded by the inequality below, Lip(F ) ≤ S s=1 Ms m=1 (1 + α s,m Lip(f s,m )). Here, we define κ = max({Lips(f i ) : i = 1, . . . , S s=1 M s }). When α is set to  where droppath(α s,m , p) = 0, with probability p α s,m Lip(f s,m )) with probability 1 -p . We can see that DropPath effectively decreases the upper bound of a network's Lipschitz constant by randomly dropping the contributions of residual paths.

K COMPARISON WITH EXISTING WORKS

In this section, to clarify our contribution more clearly, we provide a detailed comparison of our method with existing works, including Admin [33] , ReZero [3] , Swin-V2 [34] , DeepNorm [51] , L2 self-attention [28] and Spectral Normalization [56] . Admin [33] identifies that within a residual block, the residual branch amplifies network output and the amplification effect makes training unstable. They propose to initialize the weight contributions of a residual branch according to the variance of its previous layer. ReZero [3] introduces an effective strategy to improve training stability. They notice that initializing the residual branch with 0 satisfies initial dynamical isometry, thus stabilizes model training. ReZero demonstrates that they can train very deep transformer without warmup but it requires removing Layer Normalization. According to Equation 19, initializing the residual contribution to 0 trivially constraints network Lipschitz constant. However, with Layer Normalization back into the network, ReZero is likely to encounter training instability again. DeepNorm [51] shares similar motivation with Admin [33] and analyzes the influence of the residual block and initialization. They introduce a new normalization function to modify the residual connection in Transformer and propose a new initialization method. However, we observe that learning rate warmup is still necessary in DeepNorm [51] . Training of Admin [33] and DeepNorm [51] requires learning rate warmup, ReZero [3] could train without learning rate warmup but requires that LayerNorm is not present in the network. The analyses of Admin, ReZero, and DeepNorm are not from the perspective of Lipschitz continuity. In [56] , Yuichi et al.introduce a simple and effective spectral norm regularization, which penalizes high spectral norm of weight matrices in neural networks. This work focuses on regularization without considering residual block and self-attention block. In [28] , Kim et al.prove that the standard dot-product self-attention is not Lipschitz continuous. They introduce an alternative L2 self-attention that is Lipschitz continuous under the constraint that W Q = W K . Such constraint limits expressiveness of the attention block and empirically degrades training performance. Also, L2 self-attention focuses only on the Lipschitz continuity of self-attention block. Swin-V2 [34] introduces two strategies to improve training stability of transformer model, including replacing post-norm with pre-norm and a scaled cosine attention replacing the original dot product attention. The introduced scale cosine attention is defined as, Sim (q i , k j ) = cos (q i , k j ) /τ + B ij . It should be noted that there's a difference between cos (q i , k j ) /τ and τ cos (q i , k j ), the former is not a Lipschitz continuous function with respect to variable τ but the latter is. According to our derivation, self-attention based on the scaled cosine attention defined as in Swin-V2 is not Lipschitz continuous if V is not normalized. The above-mentioned works only deal with one or several standard neural computation modules. Our LipsFormer gives a holistic Lipschitz analysis of a typical transformer network instead of focusing exclusively on a single or few constituent modules. To derive the Lipschitz constant of LipsFormer, we analyze each constituent module of a standard transformer, including convolutions, fully-connected layer, self-attention, normalization and residual block. In this work we propose a Lipschitz continuous self-attention and construct a Lipschitz continuous transformer network by bounding each constituent computation layer. The resultant LipsFormer induces stable training and does not require learning rate warmup. We summarize our contributions from both theoretical and empirical perspectives as follows, Theoretically, • We derive a theoretical Lipschitz constant upper bound for scaled cosine similarity attention. Meanwhile, we give a thorough analysis of key Transformer components: LayerNorm, selfattention, residual shortcut, and weight initialization. • We propose a Lipschitz continuous Transformer (LipsFormer), and derive a theoretical Lipschitz constant upper bound for LipsFormer. The derivation provides a principled guidance for designing LipsFormer networks. Empirically, • We make an assumption about the Lipschitz continuity of the network, and experimentally validate this assumption. • We build LipsFormer on CSwin and Swin-Transformer. We validate the efficacy of the different versions (Tiny, Small, Base, Large and Large++) of LipsFormer on ImageNet, ImageNet-v2 and ImageNet-Real data sets.



.4 SPECTRAL INITIALIZATION FOR CONVOLUTION AND FEED-FORWARD CONNECTION Both convolution and feed-forward connection are compositions of affine transformations. As shown in Equation 1, affine transformation is Lipschitz continuous, hence by Lemma 2 both convolution and feed-forward connection are Lipschitz continuous.

Ms , the Lipschitz constant of the LipsFormer is upper bounded by exp(κ), where κ = max({Lip(f i ) : i = 1, . . . , S s=1 M s }).

FIGURE 2: Sigmoid, Tanh, ReLU and GELU activation function.

FIGURE 4: Training curves of LipsFormer-Swin-T and LipsFormer-Swin-B. Left: training loss along with epochs. Right: classification accuracy along with epochs.

FIGURE 5: Variation curve of α along with training epochs. Each line denotes one set of α in one stage. We show the mean and standard variance of the absolute value of the α.

The scaled cosine similarity attention (SCSA) is Lipschitz continuous if and only if

Ms , the above inequality can be rewritten as,

AN EFFICIENT WAY TO CONSTRAINT THE LIPSCHITZ CONSTANT DropPath [29] is another effective technique for training deep transformers, wherey = x, if residual path is dropped x + α • f (x) otherwiseWhen using DropPath with drop probability p within each residual block, the Lipschitz constant of LipsFormer is refined as, droppath(α s,m Lip(f s,m )), p)),



reports the LipsFormer-CSwin results compared with state-of-the-art CNN and Transformer models. We evaluate all three variants of LipsFormer-CSwin against state-of-the-art models of similar sizes: Tiny (< 32M parameters), Small (31-64M parameters), and Base (56-96M parameters).

Comparison of different models with input resolution 224 2 on ImageNet-1K classification. Red indicates the best result and blue indicates the second best result.

Ablation study on key components of LipsFormer. "Not converge" means training loss oscillates without converging, and "Diverged" means the loss explodes because of "NaN" or "Inf".

Details of LipsFormer-CSwin model variants.Table 6 we provide the ImageNet 1K training details used for producing the main results in Table 2. All LipsFormer variants use the same training hyperparameters, except for DropPath ratio, weight decay, learning rate and EMA. All the models are implemented with PyTorch, and trained on NVIDIA Tesla A100 GPUs. We do not use learning rate warmup in all experiments.

Details of LipsFormer-Swin model variants.

Details of LipsFormer-CSwin model variants. All results except our LipsFormer-CSwin are taken from DeiT III[47]

E OVERFITTING EVALUATION

Following DeiT III [47] , we also evaluate our method on ImageNet-v2 [43] and ImageNet-real [4] data sets. As pointed out by [48] , to test how the method performs in a nearby setting without any finetuning is a good way to assess overfitting. We directly apply the obtained models trained on the original ImageNet data set onto these two data sets. The results are shown in Table 8 .We can see that from Table 8 , our method that works well on the original ImageNet data set consistently performs well on the ImageNet-v2 and ImageNet-real data sets. This observation fully validates the generalization ability of the proposed method.

F TRAINING CURVES

In Figure 4 , we show the training curves of the training losses and the classification accuracies of LipsFormer-Swin-T and LipsFormer-Swin-B. We can find LipsFormer-Swin-B can fit the training data better than LipsFormer-Swin-T because the loss of LipsFormer-Swin-B is much lower than that of LipsFormer-Swin-T.

