CERTIFIABLY ROBUST TRANSFORMERS WITH 1-LIPSCHITZ SELF-ATTENTION

Abstract

Recent works have shown that neural networks with Lipschitz constraints will lead to high adversarial robustness. In this work, we propose the first One-Lipschitz Self-Attention (OLSA) mechanism for Transformer models. In particular, we first orthogonalize all the linear operations in the self-attention mechanism. We then bound the overall Lipschitz constant by aggregating the Lipschitz of each element in the softmax with weighted sum. Based on the proposed self-attention mechanism, we construct an OLSA Transformer to achieve model deterministic certified robustness. We evaluate our model on multiple natural language processing (NLP) tasks and show that it outperforms existing certification on Transformers, especially for models with multiple layers. As an example, for 3-layer Transformers we achieve an ℓ 2 deterministic certified robustness radius of 1.733 and 0.979 on the word embedding space for the Yelp and SST dataset, while the existing SOTA certification baseline of the same embedding space can only achieve 0.061 and 0.110. In addition, our certification is significantly more efficient than previous works, since we only need the output logits and Lipschitz constant for certification. We also fine-tune our OLSA Transformer as a downstream classifier of a pre-trained BERT model and show that it achieves significantly higher certified robustness on BERT embedding space compared with previous works (e.g. from 0.071 to 0.368 on the QQP dataset).

1. INTRODUCTION

Deep neural networks (DNNs) have been widely applied in different domains in recent years, including face recognition (He et al., 2016) , machine translation (Bahdanau et al., 2014) , and recommendation systems (Zhang et al., 2019b) . Specifically, on natural language processing (NLP) tasks, Transformer models (Vaswani et al., 2017) have been proposed and achieved outstanding performance on a variety of tasks. Despite its impressive performance, people have shown that these NLP models suffer from adversarial attacks (Zhang et al., 2020) , where an adversary can intentionally inject unnoticeable perturbations on the inputs to fool the model to provide incorrect predictions. Several works have been proposed to improve the empirical robustness of Transformers (Alzantot et al., 2018) , but few have studied its certified robustness, i.e. theoretically guarantee that the model will not be attacked under certain conditions (e.g. within some perturbation range). Recently, Shi et al. (2020) proposes to rely on bound-propagation techniques to derive certified robustness for Transformers, which leads to a relatively loose bound and cannot certify on deep models given the looseness induced by propagating from each component in the attention. In this work, we propose a One-Lipschitz Self-Attention (OLSA) algorithm which provides a robustness certificate for Transformers by bounding the Lipschitz constant of the model. The Lipschitz constant of a model is naturally related to its robustness, as both require that model's output should not change much when the input slightly changes. Previous works (Tsuzuku et al., 2018; Singla and Feizi, 2021) have investigated the 1-Lipschitz property on fully-connected and convolution neural networks, but the study of 1-Lipschitz on Transformer remains unexplored as its complicated non-linear self-attention mechanism are difficult to analyze and constrain. Thus, in this work, we will propose the first 1-Lipschitz Transformer network which allows us to achieve tighter deterministic certified robustness against adversarial attacks under different settings (e.g., training from scratch and fine-tuning). In order to bound the Lipschitz of a self-attention layer, we will first enforce all the linear operations (keys, queries, values) to be orthogonal via re-parametrization techniques (Huang et al., 2020) . Next, we will upper bound the input norm by normalizing the word embedding layer. As a result, we will be able to bound the overall Lipschitz by aggregating the change on each component of the softmax weighted sum. Finally, we add scaling factors to ensure 1-Lipschitzness of the OLSA layer. In addition, we also bound the Lipschitz of the pooling layer and aggregate the component to get the final OLSA Transformer classification model. We evaluate our OLSA Transformer model under both train-from-scratch and fine-tuning scenarios. In both settings, we show that OLSA achieves significantly higher certified robustness compared with existing bound-propagation-based methods (Shi et al., 2020) . The improvement is larger, especially on deeper models. For example, a 3-layer Transformer OLSA Transformer achieves an average certified radius of 1.733 on Yelp, while previous works can only achieve 0.061 under the train-fromscratch setting. When fine-tuning over a BERT pre-trained model, OLSA Transformer achieves a radius of 0.071 on the QQP dataset while previous works can only achieve 0.368. In addition, we show that our certification is 10,000× faster than previous approaches, since we do not need complicated bound propagation processes and only need one forward pass to perform the certification. Finally, we also evaluate different methods under adversarial attacks and show that OLSA achieves much higher empirical robustness than baselines as well. Meanwhile, we acknowledge a 1% to 2% performance drop on the clean accuracy for OLSA, as we impose the extra 1-Lipschitz constraint which limits the model expressiveness slightly. Technical contributions. We summarize our contributions as follows: • We propose the first One-Lipschitz Self-Attention mechanism (OLSA) and prove its Lipschitz bound with corresponding analysis. • We evaluate the proposed OLSA Transformer model on various NLP tasks and observe that it outperforms the state-of-the-art baselines. In particular, the performance gap is larger on deeper models (e.g on 3-layer Transformers on Yelp, we achieve over 25× average certified radius than previous works). • The OLSA model requires significantly less time to certify the robustness radius as it only requires a forward pass to calculate the prediction gap.

2. RELATED WORK

Adversarial Robustness for NLP Models Existing works have shown that NLP models suffer from adversarial attacks (Zhang et al., 2019c; 2020) Lipschitz-constrained Models and Certified Robustness The Lipschitz-constrained models have been studied for their smoothness and robustness; however, existing works all focus on constraining the Lipschitz constant for fully-connected and convolutional neural networks. Tsuzuku et al. (2018) first bridges the Lipschitz constant of a network with its robustness and propose a Lipschitz-related loss to improve model robustness. In order to achieve 1-Lipschitzness, multiple works (Cisse et al., 2017; Miyato et al., 2018; Qian and Wegman, 2018) propose to regularize the spectral norm of the linear matrices for fully-connected layers so that the Lipschitz constant is smaller than 1. For convolution neural networks, a simple approach is applied to unroll the convolution into an equivalent linear layer, but this is shown to have a loose Lipschitz bound (Wang et al., 2020) . Recent works (Li et al., 2019; Trockman and Kolter, 2021; Singla and Feizi, 2021) have proposed to directly parametrize



. Adversarial training-based approaches have been proposed to enhance the model robustness during training (Alzantot et al., 2018). In particular, Ren et al. propose to generate adversarial examples with word saliency information. To improve the efficiency of adversarial training, Wang et al. propose a fast gradient projection method. Besides these empirical robustness algorithms, different approaches have been proposed to provide certified robustness on NLP models with smoothing techniques (Ye et al., 2020; Wang et al., 2021a) or boundpropagation techniques(Jia et al., 2019; Shi et al., 2020). However, the smoothing techniques cannot provide deterministic certification, while the bound-propagation techniques are relatively loose and cannot certify for deep models. Recently,(Xu et al., 2020) also proposes a bound-propagation-based technique for NLP models. Their certification is against word substitution attacks and does not directly apply to our scenario.

