CERTIFIABLY ROBUST TRANSFORMERS WITH 1-LIPSCHITZ SELF-ATTENTION

Abstract

Recent works have shown that neural networks with Lipschitz constraints will lead to high adversarial robustness. In this work, we propose the first One-Lipschitz Self-Attention (OLSA) mechanism for Transformer models. In particular, we first orthogonalize all the linear operations in the self-attention mechanism. We then bound the overall Lipschitz constant by aggregating the Lipschitz of each element in the softmax with weighted sum. Based on the proposed self-attention mechanism, we construct an OLSA Transformer to achieve model deterministic certified robustness. We evaluate our model on multiple natural language processing (NLP) tasks and show that it outperforms existing certification on Transformers, especially for models with multiple layers. As an example, for 3-layer Transformers we achieve an ℓ 2 deterministic certified robustness radius of 1.733 and 0.979 on the word embedding space for the Yelp and SST dataset, while the existing SOTA certification baseline of the same embedding space can only achieve 0.061 and 0.110. In addition, our certification is significantly more efficient than previous works, since we only need the output logits and Lipschitz constant for certification. We also fine-tune our OLSA Transformer as a downstream classifier of a pre-trained BERT model and show that it achieves significantly higher certified robustness on BERT embedding space compared with previous works (e.g. from 0.071 to 0.368 on the QQP dataset).

1. INTRODUCTION

Deep neural networks (DNNs) have been widely applied in different domains in recent years, including face recognition (He et al., 2016) , machine translation (Bahdanau et al., 2014) , and recommendation systems (Zhang et al., 2019b) . Specifically, on natural language processing (NLP) tasks, Transformer models (Vaswani et al., 2017) have been proposed and achieved outstanding performance on a variety of tasks. Despite its impressive performance, people have shown that these NLP models suffer from adversarial attacks (Zhang et al., 2020) , where an adversary can intentionally inject unnoticeable perturbations on the inputs to fool the model to provide incorrect predictions. Several works have been proposed to improve the empirical robustness of Transformers (Alzantot et al., 2018) , but few have studied its certified robustness, i.e. theoretically guarantee that the model will not be attacked under certain conditions (e.g. within some perturbation range). Recently, Shi et al. (2020) proposes to rely on bound-propagation techniques to derive certified robustness for Transformers, which leads to a relatively loose bound and cannot certify on deep models given the looseness induced by propagating from each component in the attention. In this work, we propose a One-Lipschitz Self-Attention (OLSA) algorithm which provides a robustness certificate for Transformers by bounding the Lipschitz constant of the model. The Lipschitz constant of a model is naturally related to its robustness, as both require that model's output should not change much when the input slightly changes. Previous works (Tsuzuku et al., 2018; Singla and Feizi, 2021) have investigated the 1-Lipschitz property on fully-connected and convolution neural networks, but the study of 1-Lipschitz on Transformer remains unexplored as its complicated non-linear self-attention mechanism are difficult to analyze and constrain. Thus, in this work, we will propose the first 1-Lipschitz Transformer network which allows us to achieve tighter deterministic certified robustness against adversarial attacks under different settings (e.g., training from scratch and fine-tuning).

