CERTIFIABLY ROBUST TRANSFORMERS WITH 1-LIPSCHITZ SELF-ATTENTION

Abstract

Recent works have shown that neural networks with Lipschitz constraints will lead to high adversarial robustness. In this work, we propose the first One-Lipschitz Self-Attention (OLSA) mechanism for Transformer models. In particular, we first orthogonalize all the linear operations in the self-attention mechanism. We then bound the overall Lipschitz constant by aggregating the Lipschitz of each element in the softmax with weighted sum. Based on the proposed self-attention mechanism, we construct an OLSA Transformer to achieve model deterministic certified robustness. We evaluate our model on multiple natural language processing (NLP) tasks and show that it outperforms existing certification on Transformers, especially for models with multiple layers. As an example, for 3-layer Transformers we achieve an ℓ 2 deterministic certified robustness radius of 1.733 and 0.979 on the word embedding space for the Yelp and SST dataset, while the existing SOTA certification baseline of the same embedding space can only achieve 0.061 and 0.110. In addition, our certification is significantly more efficient than previous works, since we only need the output logits and Lipschitz constant for certification. We also fine-tune our OLSA Transformer as a downstream classifier of a pre-trained BERT model and show that it achieves significantly higher certified robustness on BERT embedding space compared with previous works (e.g. from 0.071 to 0.368 on the QQP dataset).

1. INTRODUCTION

Deep neural networks (DNNs) have been widely applied in different domains in recent years, including face recognition (He et al., 2016) , machine translation (Bahdanau et al., 2014) , and recommendation systems (Zhang et al., 2019b) . Specifically, on natural language processing (NLP) tasks, Transformer models (Vaswani et al., 2017) have been proposed and achieved outstanding performance on a variety of tasks. Despite its impressive performance, people have shown that these NLP models suffer from adversarial attacks (Zhang et al., 2020) , where an adversary can intentionally inject unnoticeable perturbations on the inputs to fool the model to provide incorrect predictions. Several works have been proposed to improve the empirical robustness of Transformers (Alzantot et al., 2018) , but few have studied its certified robustness, i.e. theoretically guarantee that the model will not be attacked under certain conditions (e.g. within some perturbation range). Recently, Shi et al. (2020) proposes to rely on bound-propagation techniques to derive certified robustness for Transformers, which leads to a relatively loose bound and cannot certify on deep models given the looseness induced by propagating from each component in the attention. In this work, we propose a One-Lipschitz Self-Attention (OLSA) algorithm which provides a robustness certificate for Transformers by bounding the Lipschitz constant of the model. The Lipschitz constant of a model is naturally related to its robustness, as both require that model's output should not change much when the input slightly changes. Previous works (Tsuzuku et al., 2018; Singla and Feizi, 2021) have investigated the 1-Lipschitz property on fully-connected and convolution neural networks, but the study of 1-Lipschitz on Transformer remains unexplored as its complicated non-linear self-attention mechanism are difficult to analyze and constrain. Thus, in this work, we will propose the first 1-Lipschitz Transformer network which allows us to achieve tighter deterministic certified robustness against adversarial attacks under different settings (e.g., training from scratch and fine-tuning). In order to bound the Lipschitz of a self-attention layer, we will first enforce all the linear operations (keys, queries, values) to be orthogonal via re-parametrization techniques (Huang et al., 2020) . Next, we will upper bound the input norm by normalizing the word embedding layer. As a result, we will be able to bound the overall Lipschitz by aggregating the change on each component of the softmax weighted sum. Finally, we add scaling factors to ensure 1-Lipschitzness of the OLSA layer. In addition, we also bound the Lipschitz of the pooling layer and aggregate the component to get the final OLSA Transformer classification model. We evaluate our OLSA Transformer model under both train-from-scratch and fine-tuning scenarios. In both settings, we show that OLSA achieves significantly higher certified robustness compared with existing bound-propagation-based methods (Shi et al., 2020) . The improvement is larger, especially on deeper models. For example, a 3-layer Transformer OLSA Transformer achieves an average certified radius of 1.733 on Yelp, while previous works can only achieve 0.061 under the train-fromscratch setting. When fine-tuning over a BERT pre-trained model, OLSA Transformer achieves a radius of 0.071 on the QQP dataset while previous works can only achieve 0.368. In addition, we show that our certification is 10,000× faster than previous approaches, since we do not need complicated bound propagation processes and only need one forward pass to perform the certification. Finally, we also evaluate different methods under adversarial attacks and show that OLSA achieves much higher empirical robustness than baselines as well. Meanwhile, we acknowledge a 1% to 2% performance drop on the clean accuracy for OLSA, as we impose the extra 1-Lipschitz constraint which limits the model expressiveness slightly. Technical contributions. We summarize our contributions as follows: • We propose the first One-Lipschitz Self-Attention mechanism (OLSA) and prove its Lipschitz bound with corresponding analysis. • We evaluate the proposed OLSA Transformer model on various NLP tasks and observe that it outperforms the state-of-the-art baselines. In particular, the performance gap is larger on deeper models (e.g on 3-layer Transformers on Yelp, we achieve over 25× average certified radius than previous works). • The OLSA model requires significantly less time to certify the robustness radius as it only requires a forward pass to calculate the prediction gap.

2. RELATED WORK

Adversarial Robustness for NLP Models Existing works have shown that NLP models suffer from adversarial attacks (Zhang et al., 2019c; 2020) (Ye et al., 2020; Wang et al., 2021a) or boundpropagation techniques (Jia et al., 2019; Shi et al., 2020) . However, the smoothing techniques cannot provide deterministic certification, while the bound-propagation techniques are relatively loose and cannot certify for deep models. Recently, (Xu et al., 2020 ) also proposes a bound-propagation-based technique for NLP models. Their certification is against word substitution attacks and does not directly apply to our scenario. Lipschitz-constrained Models and Certified Robustness The Lipschitz-constrained models have been studied for their smoothness and robustness; however, existing works all focus on constraining the Lipschitz constant for fully-connected and convolutional neural networks. Tsuzuku et al. (2018) first bridges the Lipschitz constant of a network with its robustness and propose a Lipschitz-related loss to improve model robustness. In order to achieve 1-Lipschitzness, multiple works (Cisse et al., 2017; Miyato et al., 2018; Qian and Wegman, 2018) propose to regularize the spectral norm of the linear matrices for fully-connected layers so that the Lipschitz constant is smaller than 1. For convolution neural networks, a simple approach is applied to unroll the convolution into an equivalent linear layer, but this is shown to have a loose Lipschitz bound (Wang et al., 2020) . Recent works (Li et al., 2019; Trockman and Kolter, 2021; Singla and Feizi, 2021) have proposed to directly parametrize a 1-Lipschitz convolutional neural network which achieves a good certified robustness on vision tasks. In addition, existing activation functions such as ReLU are shown not suitable for the 1-Lipschitz models, and better activation functions such as GroupSort (Anil et al., 2019) and Householder (Singla et al., 2021)  Lip(f ) = sup ||f (x) -f (x ′ )|| 2 ||x -x ′ || 2 ∀x, x ′ ∈ R m . We can observe that the Lipschitz property of a neural network is naturally related with its robustness property -both require that the model output should not change much when the input changes by a certain magnitude. Specifically, If we define the prediction margin of f on a certain input x by M f (x) = max i f (x) i -max j̸ =arg max i f (x)i f (x) j where f (x) i refers to the indexing operation, then we can guarantee that f (x) will not change its output class within ra- dius |x -x ′ | < M f (x)/( √ 2Lip(f )) . Therefore, people have proposed to enforce 1-Lipschitz for models to achieve robustness. Note that the Lipschitz of a composed function f • g satisfies Lip(f • g) ≤ Lip(f )Lip(g). As a result, we can upper bound the Lipschitz of a model by bounding the Lipschitz of each layer in the neural networks.

3.2. ORTHOGONAL LINEAR LAYER IN DNNS

Consider a linear layer y = W x where x ∈ R m , W ∈ R n×m , and y ∈ R n . People have proposed a stronger constraint to ensure the 1-Lipschitz property: to require that W is orthogonalfoot_0 . The orthogonality not only guarantees that the layer is 1-Lipschitz but also ensures that the gradient norm is preserved during the backward pass, which helps with the training stability (Anil et al., 2019) . Several works have proposed re-parametrization techniques to achieve an orthogonal linear layer. For instance, Huang et al. propose to parametrize the orthogonal matrix W with an unconstrained matrix V ∈ R n×m by W = (V V ⊺ ) -1 2 V , where the inverse square root can be calculated by Newton's iteration. In practice, we observe that the following Newton's iteration (Lin and Maji, 2017) achieves a more stable result in calculating the inverse square root, given Y 0 = V V ⊺ and Z 0 = I: Y k+1 = 1 2 Y k (3I -Z k Y k ) Z k+1 = 1 2 (3I -Z k Y k )Z k Z will converge quadratically to (V V ⊺ ) -1 2 when ||V V ⊺ -I|| 2 < 1 , which in practice is achieved by scaling the parameter V to be small. Existing works focus on constraining the Lipschitz constant on such linear operation or convolution operation (which is a type of linear operator in a more compact form). However, no one has studied the Lipschitz property of the self-attention mechanism which is a non-linear function. The focus of this paper is to construct the first 1-Lipschitz Transformer model, which we will introduce next.

4. ONE-LIPSCHITZ SELF-ATTENTION (OLSA) TRANSFORMER

As we can see, the property of 1-Lipschitz for neural networks can improve model smoothness and provide certified robustness. However, it is often challenging to enforce the Lipschitz constraint while maintaining a good model capacity. In this section, we will introduce the first 1-Lipschitz Transformer -One-Lipschitz Self-Attention (OLSA) Transformer. We will first introduce the Lipschitz property in a self-attention layer with sequential input and output. Then we introduce how we achieve 1-Lipschitzness for self-attention layers and pooling layers. Finally, we compose these layers and construct the 1-Lipschitz Transformer model.

4.1. LIPSCHITZ PROPERTY IN SELF-ATTENTION PIPELINE

First, we describe the setting of the self-attention pipeline as follows. Suppose we have a sequence of input X ≜ [x 1 , x 2 , . . . , x N ] where x i ∈ R d . The self-attention pipeline can be formalized as a function F : R N ×d → R N ×d where the output Y ≜ [y 1 , y 2 , . . . , y N ] = F (X). Under this setting, we define the Lipschitz of the pipeline based on the overall changes in the input sequence: Definition 4.1. Given X ≜ [x 1 , x 2 , . . . , x N ] and Y ≜ [y 1 , y 2 , . . . , y N ] = F (X) , we define the Lipschitz of the function as: Lip(F ) = sup xi,x ′ i i ||y i -y ′ i || 2 2 i ||x i -x ′ i || 2 2 = sup X,X ′ ||Y -Y ′ || F ||X -X ′ || F where || • || F denotes the Frobenius norm of a matrix. We can see that such definition of Lipschitz considers the overall potential changes within the input sequence. We aim to bound such Lipschitz constant of the overall model so that we can provide certified robustness for the model against perturbations on the input sequence.

4.2. ONE-LIPSCHITZ SELF-ATTENTION (OLSA) LAYER

The standard self-attention mechanism Y = F (X) in a Transformer with input x i 's and output y i 's can be formalized as: s ij = (W Q x i ) ⊺ (W K x j ) √ d p ij = softmax([s i1 , s i2 , . . . , s in ]) j y i = j p ij (W V x j ) In order to achieve a tight bound of the Lipschitz constant, we will make two changes to the pipeline. First, we use additive attention instead of dot-product attention in order to provide a tighter bound. As shown in (Vaswani et al., 2017) , these two mechanisms do not differ a lot in performance. Second, we add two scaling factors α 1 and α 2 to control the Lipschitz of the model. The modified self-attention mechanism Y = F (X) is as follows: s ij = 1 α 1 W S σ( W Q x i + W K x j 2 ) p ij = softmax([s i1 , s i2 , . . . , s in ]) j y i = 1 α 2 j p ij (W V x j ) where σ(•) denotes some non-linear activation function. After we consider the multi-head mechanism with H headers, the final OLSA layer will be defined as follows: Definition 4.2 (OLSA layer). Given X ∈ R N ×d , the OLSA layer F : R N ×d → R N ×d with H headers is calculated by: s h ij = 1 α 1 W S h σ( W Q h x i + W K h x j 2 ) p h ij = softmax([s h i1 , s h i2 , . . . , s h in ]) j y h i = 1 α 2 j p h ij (W V h x j ) y i = Concat([y 1 i , . . . , y H i ]) where {(W S h , W Q h , W K h , W V h )} H h=1 are model parameters in which W S h ∈ R 1× d H and W Q h , W K h , W V h ∈ R d H ×d . For this mechanism, we can provide the following Lipschitz bound under the assumption that the linear transformations W 's are orthogonal and the input norm is bounded. Theorem 4.1 (Lipschitz bound of OLSA layer). Let W Q = Concat([W Q 1 , . . . , W Q H ]) ∈ R d×d denote the concatenated parameters of W Q h 's; and W K and W V are defined similarly. Assume 1) W Q , W K , W V are orthogonal matrices; 2) ||W S h || 2 = 1 for all h ∈ {1, . . . , H}; 3) σ is a 1-Lipschitz activation function; 4) the overall input norm is bounded by ||X|| F ≤ c, then the Lipschitz constant of the function is bounded by: Lip(F ) ≤ 1 α 2 (1 + c √ n 4α 1 ). where n is the length of the input sequence. The main idea of the proof is to assume a perturbation δ on X, and gradually bound the perturbation from s h ij 's to y i 's. The main non-linearity comes from the calculation of y h i which multiplies the calculated attention score p h ij with the original input x j . This can be bounded by considering the perturbation on both parts individually, noticing that the norm of both parts is bounded (p h ij is the output of SoftMax so it is norm-bounded; x j 's norm is bounded in the assumption). The full proof of the theorem is shown in Appendix A. We can observe that the Lipschitz bound of the layer is related to the input sequence length and the input norm bound. We will discuss how we control the input norm for each layer in Section 4.4. Remark. (1) We will train the orthogonal matrices parametrized with W = (V V ⊺ ) -1foot_1 V with Newton's iteration (Huang et al., 2020) . (2) We would like to get a 1-Lipschitz layer such that the overall Lipschitz of the model is 1, and thus we will set α 2 = 1 + c √ N 4α1 . (3) We will set α 1 to be a trainable parameter. Intuitively, α 1 will control the trade-off between self-attention expressiveness and linearity -when α 1 is very large, α 2 will be close to 1 so that the expressiveness of the final output is preserved, but s ij will all be close to 0, so the attention becomes a simple averaging operation; when α 1 is very small, the attention mechanism will work well, but the final output will be divided by a large α 2 , so the expressiveness of final output is limited. Comparison with the Lipschitz bound in (Kim et al., 2021) In (Kim et al., 2021) , the authors also propose a variant of the Transformer L2-MHA and upper bounded its Lipschitz constant with α = √ N (4ϕ -1 (N -1) + 1) • J(W ) 2 , where J(W ) is some term related to the spectral norm of W and ϕ -1 (N ) grows slower than O(log N ). Note that they only provide an upper bound of the Lipschitz and do not constrain it to be small. We may also get a 1-Lipschitz model using their bound by orthogonalizing their weight matrices (such that J(W ) = 1) and re-scaling the output with 1/α. Theoretically, this bound is O( √ N log N ) which is slightly favored than our bound O(N ) 3 . However, , but this factor comes from the fact that they divide their output F (X) with d/H. For a fair comparison, we do not consider this scaling factor on the output (which is α2 in our case). 3 Our bound 1 + c √ N 4 is O(N ) because we view the embedding norm at each location ||xi||2 to be similar, so the input bound c = ||X||F is also O( √ N ). our bound has a significantly smaller constant factors than theirs, as we can observe the factor of 4 in their bound and the factor of 1 4 in ours. As an example, under a popular case N = 20 and input norm is bounded to ||x i || 2 = 1, their bound is √ N (4ϕ -1 (N -1) + 1) = 32.28, while our bound is 1 + c √ N 4 = 6.0. Therefore, we can actually provide a tighter Lipschitz guarantee and robustness certification in practice. We include the comparison of using this bound to construct 1-Lipschitz Transformers in Appendix D and observe that OLSA can indeed achieve higher certified robustness than directly adapting (Kim et al., 2021) for certification.

4.3. LIPSCHITZ OF POOLING LAYER

In many tasks, we need a pooling layer after several self-attention mechanisms in order to get a final embedding vector for the entire sequence: G : R n×d → R d . The popular approach in standard self-attention layers is to add an extra output token into the sequence and use the embedding of that token for the overall sequence. We intuitively find this approach a "waste of resource" in the 1-Lipschitz case -only one token value is used for pooling while others are dropped, while such information is usually important to achieve a tight bound. In practice, we propose to use the average of all the embeddings of the pooling layer, i.e. G(X) = 1 N i x i . We show that this approach has good Lipschitz properties: Theorem 4.2. For the pooling function G(X) = 1 N x i , we have: Lip(G) ≤ 1 √ N See Appendix B for the proof. Thus, we can further multiply a factor of √ N to get a 1-Lipschitz pooling layer and maximize the output expressiveness. The resulting pooling layer is: G(X) = 1 √ N N i=1 x i

4.4. OVERALL OLSA TRANSFORMER

Aggregating different layers we introduced above, we can construct the overall OLSA Transformer, which consists of one word embedding layer, several OLSA layers, one pooling layer and a final linear layer f : R d → R c for prediction, where c is the number of classes. However, there are still some challenges in constructing the OLSA Transformer. The first challenge is how to bound the input norm for each self-attention layer. Note that, the norm of the output layer of self-attention layers will not increase, since it is a weighted average of all the processed input divided by a factor α 2 > 1. Therefore, the norm of each layer can be bounded as long as the input to the first layer is bounded. Thus, we will normalize the embedding vector for each token with norm c, so that the input to each self-attention layer satisfies ||X|| F ≤ √ N c. The second challenge is how to simulate other components with 1-Lipschitz constraint in a standard Transformer. We will remove the LayerNorm and Dropout layers and we use the average y = 0.5x + 0.5f (x) instead of the residual addition connections. We use GroupSort (Anil et al., 2019) as the activation function which is shown to work better than ReLU in 1-Lipschitz networks. Let T : R N ×d → R c denote the OLSA Transformer classification model. We can calculate the certified radius on a given input X with T (X) and Lip(T ). In particular, we can certify that the model prediction will not change within ||X -X ′ || F < r where: r = max i T (X) i -max j̸ =arg max i T (x)i T (X) j √ 2Lip(T ) Comparison with the certification in (Shi et al., 2020) As the best existing work on robustness certification for Transformers, (Shi et al., 2020) provides certification assuming that only one or two words are perturbed by the adversary. In particular, in their ℓ 2 certification, they provide certificates that the prediction will not change within ||X -X ′ || F < r, but with the restriction that X ′ differs with X on only one or two positions of word embedding. Therefore, our work provides a more generalized ℓ 2 certification. In addition, we explicitly impose the 1-Lipschitz constraint during training, and therefore the model is optimized to have a better certification radius; by comparison, they will directly certify on a vanilla-trained model, so their certification might not be tight, especially for relatively deep models.

5. EVALUATION

In this section, we compare our OLSA Transformer with the state-of-the-art baseline on different datasets under different settings. We observe that our model achieves much higher certified robustness under both train-from-scratch and fine-tuning scenarios. In addition, OLSA is much more efficient during the certification process. We also evaluate our model against the empirical attacks and observe that OLSA also enhances the model empirical robustness.

5.1. EXPERIMENTAL SETUP

Dataset We consider three datasets in our evaluation: Yelp (Zhang et al., 2015) , SST (Socher et al., 2013) and QQP (Wang et al., 2018) Implementation Details We train and evaluate our 1-Lipschitz OLSA Transformer under both trainfrom-scratch and fine-tuning scenarios. In the train-from-scratch scenario, we randomly initialize the model and word embeddings and train it from scratch. In the fine-tuning scenario, we use a pre-trained BERT model (Devlin et al., 2018) and use its output for a downstream OLSA Transformer model. The BERT model is kept unchanged and we only train the downstream model. In the train-from-scratch setting, we use the Yelp and SST datasets which are also adopted in the baseline (Shi et al., 2020) . We consider N -layer models (N ≤ 3) and normalize the word embedding and position embedding to a norm of 2, so that the overall input norm is bounded by ||x i || 2 ≤ 4. For the fine-tuning setting, we evaluate on Yelp, SST and QQP datasets. We normalize the output of BERT to a norm of 2 and use our OLSA Transformer as the downstream model. For both settings, we use the number of attention heads 8 and hidden dimension 256. We train the model with batch size 32 for 50 epochs on SST and 10 epochs on Yelp. For QQP we use Adam optimizer with learning rate 10 -5 and decays by 0.1 at the 40-th epoch. We include certificate regularizer loss (Singla et al., 2021) which adds a term -γReLU T (x)y-max i̸ =y T (x)i √ 2 to maximize the prediction margin with a gradually increasing γ with final value γ = 2.0 for the train-from-scratch case. We use γ = 0.0 in the fine-tuning scenario as the margin is already large. We will not orthogonalize the final prediction layer; instead, we calculate its Lipschitz and include it in the final certification. For the evaluation time, we compare the time to certify one batch evaluated on an RTX 3090 GPU. Baselines There are few baselines on the certifiable robustness of Transformer models and (Shi et al., 2020) achieved state-of-the-art certification results. Their model is trained using the standard architecture and training algorithm. We will fix the word embedding to be the same as in our model to perform a fair comparison. To certify the robustness within the region, they propose a bound propagation-based method that tightly bounds the cross-position dependency in the attention mechanism. They evaluate the certification on Yelp and SST datasets on which we will make the comparison and we will also evaluate it on the QQP dataset. Note that since the certification time of their approach is slow, we will only evaluate on a sampled test set with 5% instances for Yelp and QQP and. The comparison with the Lipschitz bound in Kim et al. (2021) is shown in Appendix D, where we adapt their bound to construct the 1-Lipschitz model and observe that our bound provides a better robustness guarantee. Evaluation Metrics To evaluate the certified robustness, we use the certified radius as the metric following the setting in (Shi et al., 2020) . Given model T and input X ∈ R N ×d on the word embedding space, the certified radius is defined as the maximum perturbation radius within which we can guarantee that the model prediction will not be changed: Rad(T, X) = sup r r s.t. T (X) = T (X ′ ) ∀||X -X ′ || F ≤ r and we calculate the average certified radius over the test dataset.

5.2. TRAINING OLSA FROM SCRATCH

We show the results of our certification on the train-from-scratch model in Table 1 . We can observe that our OLSA model indeed achieves higher certified robustness compared with Shi et al., especially on deeper layers. We owe it to the reason that the OLSA network is trained to achieve a tight Lipschitz bound and therefore more robust; by comparison, Shi et al. verify on the vanilla model and therefore cannot guarantee a tight bound for each layer. Note that we cannot directly use our certification method to certify their model given that their models do not satisfy 1-Lipschitz so the bound will be very loose. As a cost of improved certified robustness, our models suffer from a 1% to 2% vanilla accuracy drop compared with the vanilla (no regularization) model. We think this is an inevitable performance drop, as shown in previous works (Zhang et al., 2019a) that there exists a trade-off between vanilla accuracy and robustness when comparing vanilla and adversarially robust models. As for the certification time, we can observe that our certification is over 10,000 times faster than previous works. This is because we only need one forward pass to calculate the prediction gap for certification; by comparison, Shi et al. (2020) needs to do a binary search and bound-propagation on each location of the input, which leads to a large number of forward passes. Empirical robustness Besides the certified accuracy, we also perform ℓ 2 -PGD attack (Madry et al., 2017) against the models over the word embedding space to check their empirical robustness as shown in Table 1 . We see that OLSA indeed achieves a much higher empirical robustness compared with vanilla models. This confirms that enforcing the 1-Lipschitz constraint indeed helps with the model robustness. Ablation studies -Different γ The γ in the certificate regularizer loss can control the tradeoff between vanilla accuracy and certified radius -with larger γ, the model is trained to have a larger output prediction gap and thus a larger certified radius, at a cost of vanilla accuracy. In Table 2 , we show the performance of varying the value of γ. From the tables, we can indeed observe the trade-off between vanilla accuracy and certified radius. In particular, even with small γ we can still achieve a large certified radius, while larger γ provides a higher certification result. We set γ = 2.0 as a reasonable choice for the trade-off. Interestingly, the best empirical robustness is achieved at some intermediate value of γ. This may be because empirical robustness is not always aligned with certified radius and may be affected by the drop of vanilla accuracy.

5.3. FINE-TUNING OLSA OVER PRETRAINED BERT

We show the certified robustness in the fine-tuning scenario in Table 3 . The certified robustness is computed over the BERT output embedding space. We mainly evaluate the 1-layer case as it is uncommon to use multi-layer Transformers on top of BERT for downstream tasks (although we expect a larger performance gap for those models). We can see that our OLSA model again achieves a larger certified radius on all three tasks at a small cost of vanilla accuracy. Also, OLSA achieves higher empirical robustness compared with vanilla models. These results show that our OLSA model can be applied in both train-from-scratch and fine-tuning scenarios to enhance model robustness. We provide the training curve in Appendix F. In addition, we evaluate the model robustness on PAWS-QQP (Zhang et al., 2019c) , an adversarial dataset of QQP, and also observe that OLSA achieves high empirical robustness, as we show in Appendix C.

6. CONCLUSION

In this paper, we propose OLSA, the first 1-Lipschitz self-attention mechanism for sequential input. Based on OLSA layer and 1-Lipschitz pooling, we propose the first 1-Lipschitz Transformer model for NLP classification tasks. We show that OLSA Transformer is able to achieve the state-of-the-art certified robustness on various tasks under both train-from-scratch and fine-tuning scenarios. Our certification is also significantly more efficient. We believe this work will provide a new research direction toward certifiably robust language models. A PROOF OF THEOREM 4.1 Proof. Consider two inputs X and X +∆X, where ||X|| F ≤ c, ||X +∆X|| F ≤ c and ||∆X|| F ≤ δ. We will use the symbol ∆ to denote the changes given X and X + ∆X if it does not lead to confusion. For example, ∆Y = F (X + ∆X) -F (X). Our goal is to bound the difference at the output layer ||∆Y || F . First, we will define the symbols for intermediate results as follows: Q = Concat([W Q 1 X, . . . , W Q H X]) ∈ R N ×d K = Concat([W K 1 X, . . . , W K H X]) ∈ R N ×d S h = [s h ij ] ∈ R N ×N P h = [p h ij ] ∈ R N ×N V h = W V h X ∈ R N ×(d/H) V = Concat([W V 1 X, . . . , W V H X]) ∈ R N ×d Ỹ h = P h V h ∈ R N ×(d/H) Now, we are going to bound the difference ||∆Y || F by considering the perturbations of ∆Q, ∆K, ∆S h and ∆P h step-by-step. Step 1 ||∆Q|| F ≤ δ, ||∆K|| F ≤ δ and ||∆V || F ≤ δ. We can observe that Q = W Q X, K = W K X and V = W V X. Since all matrices are orthogonal, the linear operation is 1-Lipschitz. Thus ||∆Q|| F ≤ ||∆X|| F ≤ δ and same for ∆K and ∆V . Step 2 h ||∆S h || 2 2 ≤ √ N α1 δ. First, notice that the operation W S h σ(•) is 1-Lipschitz since ||W S h || 2 = 1 and σ is 1-Lipschitz. Next, notice that every entry in Q and K will appear N times among the calculation of all the s h ij . Therefore, the overall change of i,j,h (∆s h ij ) 2 is N/((2α 1 ) 2 ) times the overall change of ||∆Q|| F and ||∆K|| F . So: ||∆S h || 2 2 ≤ N (2α 1 ) 2 (||∆Q|| F + ||∆K|| F ) 2 ≤ √ N α 1 δ Step 3 h ||∆P h || 2 2 ≤ √ N 4α1 δ. This can be proved by noticing that the Lipschitz constant of a SoftMax function is 1 4 . Step 4 ||∆ Ỹ h || F ≤ ||V h || F + c||P h || F . Note that the change of Ỹ h comes from both V h and P h and the inequality ||AB|| F ≤ ||A|| 2 ||B|| F ≤ ||A|| F ||B|| F holds true. Therefore, we have: ∆ Ỹ h = P h ∆V h + ∆P h (V h + ∆V h ) ≤ ||P h || 2 ||∆V h || F + ||∆P h || F ||V h + ∆V h || F ≤ ||∆V h || F + c||∆P h || F The last inequality comes from (1) P h is the matrix after SoftMax, so its spectral norm is no larger than 1; (2) ||V h + ∆V h || F ≤ ||X + ∆X|| F ≤ c since V comes from an orthogonal operation of X. Step 5 ||∆Y || F ≤ 1 α2 (1 + c √ N 4α1 )δ. We notice that Y = 1 α2 Concat([ Ỹ 1 , . . . , Ỹ H ]). In addition, we have inequality h (a h + b h ) 2 ≤ ( h a 2 h + h b 2 h ) from Cauchy's inequality. Therefore, we have: ||∆Y || F ≤ 1 α 2 h ||∆ Ỹ h || 2 F ≤ 1 α 2 h (||∆V h || F + c||∆P h || F ) 2 ≤ 1 α 2 h ||∆V h || 2 F + h (c||∆P h || F ) 2 ≤ 1 α 2 (1 + c √ N 4α 1 )δ. B PROOF OF THEOREM 4.2 Proof. Let ||∆X|| F = δ = N i=1 ||∆x i || 2 F be the perturbation on the input. We have: ||∆G(X)|| F = 1 N || N i=1 ∆x i || F ≤ 1 N N i=1 ||∆x i || F ≤ 1 N N • N i=1 ||∆x i || 2 F = 1 √ N δ where the last inequality is derived from the Cauchy's inequality. C EVALUATION ON PAWS-QQP We show the evaluation results where we fine-tune the model on QQP and evaluate on the adversarial dataset PAWS-QQP Zhang et al. (2019c) in Table 4 . We can observe that our OLSA model still achieves a better performance with arbitrary choice of γ. In particular, larger γ can provide a better adversarial robustness, but it may hurt the vanilla accuracy.

D COMPARISON WITH (KIM ET AL., 2021)

We compare our bound with the one in (Kim et al., 2021) as in Table 5 for the train-from-scratch scenario and Table 6 for the finetuning scenario. In particular, we will follow their L2-MHA pipeline and (1) orthogonalize all the linear matrices (2) divide the output by their Lipschitz constant bound in Under review as a conference paper at ICLR 2023 order to get a 1-Lipschitz self-attention layer. Other settings follow the same way as OLSA. We can observe that our model achieves a larger certified radius as well as a vanilla model accuracy. This is understandable as we discussed that our bound is tighter in practice.

E PERFORMANCE WITH DIFFERENT LAYERS OF BERT

In the previous evaluation, we use the standard 12-layer BERT architecture followed by a downstream self-attention layer for fine-tuning. In Table 7 , we show the results using the output of the 11-th layer in BERT and concatenate it with the downstream layer for finetuning. We view such architecture as a fair analogy of BERT, since it also consists of 12 layers in total. We can observe that there is a slight performance drop compared with the original version, which is understandable as we are only using part of the pretrained BERT. Nevertheless, our model still has a good performance under such circumstance.

F TRAINING AND CERTIFICATION CURVE

We show the curve of vanilla accuracy and certified radius for the OLSA model on Yelp during training under both scenarios in Figure 1 . We have two observations from the curves. First, the model can achieve a good vanilla accuracy at early epochs, while the certified radius is gradually increased during the training process. Note that we have a gradually increasing γ which may hurt vanilla accuracy, so the relatively stable accuracy indicates that the model is trained well. Second, we notice that the training and test performance trend is similar as the 1-Lipschitz constraint is a strong regularizer on the model, and thus the model will not overfit on the training set which is desired.

G ABLATION STUDY

In Table 8 , we show the performance of the OLSA network with different features in the tranditional Transformers, including Dropout, ReLU and LayerNorm. In particular, 1) for dropout, we use the PyTorch implementation where we randomly drop out some neurons and scale up other neurons during training, and perform identity transformation during evaluation; 2) for ReLU, we replace all the GroupSort activation with ReLU activation; 3) for LayerNorm, we only include the bias term and not use the variance term, which is also used in Shi et al. (2020) We can observe that these features have little impact on the final model performance. 

H DEEPER MODELS

We provide the results with 4/5/6-layer Transformers on the SST-2 dataset as in Table 9 . We can observe that our model can still outperform Shi et al. (2020) . In addition, the performance has a slight drop as training deeper models on a single task usually cannot achieve a good performance. 

I EVALUATION ON ADVGLUE

In order to validate the empirical performance of our work, we evaluate our approach over the advGLUE Wang et al. (2021b) dataset, which consists of adversarial examples on SST and QQP generated with different existing attacks. We show the results in Table 10 . We can observe that our approach indeed performs better on these adversarially attacked datasets. These results show that Lipschitz is helpful in improving model robustness. As a recent baseline, Bonaert et al. (2021) improves over Shi et al. (2020) to certify the robustness of a Transformer with bounding propagation techniques. We evaluate their certification approach and provide the comparison of average certified radius in Table 11 . Note that the models in this work are the same as in Shi et al. (2020) , so their vanilla accuracy is the same as we show in Table 1 . We can observe that our approach still outperforms this work, since their approach is still certifying the vanilla-trained model which intrinsically does not have good robustness. 



Strictly speaking, we require W to be semi-orthogonal when n ̸ = m, which means either W W ⊺ = I or W ⊺ W = I. In(Kim et al., 2021, Theorem 3.2), the bound has a factor of 1 √ d/H



Figure 1: The curve of vanilla accuracy and certified radius of 1-layer OLSA on the Yelp dataset under train-from-scratch and fine-tuning settings.

are proposed.Kim et al. (2021) indeed focuses on the Lipschitz constant of Transformer models. They propose to use ℓ 2 distance instead of multiplication in the attention mechanism and show that the Lipschitz constant of such variant can be bounded.

. Yelp consists of 560,000/38,000 examples in the training/test set; SST consists of 67,349/872/1,821 examples in the training/development/test set; QQP consists of 363,846/40,430 examples in the training/test set. Each example in Yelp and SST is a sentence labelled with a binary class for its sentiment; each example in QQP consists of two quora questions and labelled with a binary value on whether the two questions are equivalent.

Certified radius of OLSA and previous state-of-the-art in the train-from-scratch scenario. The certification time is evaluated on a batch of test data.

Ablation study of a 1-layer OLSA Transformer with different γ in the certificate regularizer.

Certified radius of 1-layer OLSA and previous state-of-the-arts in the fine-tuning scenario, where we keep the pre-trained BERT model unchanged and only tune the downstream model.

The results when we fine-tune on QQP and evaluate on the PAWS-QQP dataset.

Comparison of certification on 1-Lipschitz model usingKim et al. (2021).

Comparison of certification on 1-Lipschitz pretrained model usingKim et al. (2021).

Certified radius of 1-layer OLSA and previous state-of-the-arts in the fine-tuning scenario, where we use the 11-layer BERT output for OLSA.

The results of ablation study on the SST-2 dataset.

The results of deeper models on the SST-2 dataset. Depth Approach Vanilla Accuracy Certified Radius

The accuracy on the advGLUE dataset.

The certified radius comparison withBonaert et al. on SST and Yelp datasets.

