ROBUSTIFY TRANSFORMERS WITH ROBUST KERNEL DENSITY ESTIMATION

Abstract

Recent advances in Transformer architecture have empowered its empirical success in various tasks across different domains. However, existing works mainly focus on improving the standard accuracy and computational cost, without considering the robustness of contaminated samples. Existing work (Nguyen et al., 2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on the well-known kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods in the self-attention mechanism, to alleviate the issue of the contamination of data by down-weighting the weight of bad samples in the estimation process. The modified self-attention mechanism can be incorporated into different Transformer variants. Empirical results on language modeling and image classification tasks demonstrate the effectiveness of this approach.

1. INTRODUCTION

Attention mechanisms and transformers (Vaswani et al., 2017) have been widely used in machine learning community (Lin et al., 2021; Tay et al., 2020; Khan et al., 2021) . Transformer-based models are now among the best deep learning architectures on a variety of applications, including those in natural language processing (Devlin et al., 2019; Al-Rfou et al., 2019; Dai et al., 2019; Child et al., 2019; Raffel et al., 2020; Baevski & Auli, 2019; Brown et al., 2020; Dehghani et al., 2019 ), computer vision (Dosovitskiy et al., 2021; Liu et al., 2021; Touvron et al., 2021a; Ramesh et al., 2021; Radford et al., 2021; Fan et al., 2021; Liu et al., 2022) , and reinforcement learning (Chen et al., 2021; Janner et al., 2021) . Transformers have also been well-known for their effectiveness in transferring knowledge from pretraining tasks to downstream applications with weak supervision or no supervision (Radford et al., 2018; 2019; Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019) . Contribution Despite having an appealing performance, the robustness of the conventional attention module still remains an open question in the literature. In this paper, to robustify the attention mechanism and transformer models, we first revisit the interpretation of the self-attention in the transformer as the Nadaraya-Watson (NW) estimator (Nadaraya, 1964) in a non-parametric regression problem in the recent work of Nguyen et al. (2022) . Putting in the context of transformer, the NW estimator is constructed mainly based on the kernel density estimators (KDE) of the keys and queries. However, the KDE is not robust to the outliers (Kim & Scott, 2012) , which leads to the robustness issue of the NW estimator and the self-attention in transformer when there are outliers in the data. To improve the robustness of the KDE, we first show that the KDE can be viewed as an optimal solution of the kernel regression problem in the reproducing kernel Hilbert space (RKHS). Then, to robustify the KDE, we can either robustify the loss function of the kernel regression problem via some robust loss functions, such as the well-known Huber loss function (Huber, 1992) , or reweight the contaminated densities via scaling and projecting the original densities. The family of robust KDE can be used to construct a set of novel robust attentions in transformer, which also improves the robustness issue of the transformer. In summary, our contribution is two-fold: • By connecting the dot-product self-attention mechanism in transformer with the nonparametric kernel regression problem in reproducing kernel Hilbert space (RKHS), we propose a novel robust transformer framework, based on replacing the dot-product attention by an attention arising from a set of robust kernel density estimators associated with the robust kernel regression problem. Comparing to the standard soft-max transformer, the family of robustified transformers only requires computing an extra set of weights. • Extensive experiments on both vision and language modeling tasks demonstrate that our proposed framework has favorable performance under various attacks. Furthermore, the proposed robust transformer framework is flexible and can be incorporated into different Transformer variants. Organization The paper is organized as follows. In Section 2, we provide background on selfattention mechanism in Transformer and its connection to the Nadaraya-Watson (NW) estimator in the nonparametric regression problem, which can be constructed via KDE. In Section 3, we first connect the KDE to a kernel regression problem in the reproducing kernel Hilbert space (RKHS) and demonstrate that it is not robust to the outliers. Then, we construct the robust self-attention mechanism for the Transformer by leveraging a set of robust KDE methods. We empirically validate the advantage of the proposed robust self-attension mechanism, over the standard softmax transformer along with other baselines over both language modeling and image classification tasks in Section 4. Finally, we discuss the related works in Section 5 while conclude the paper in Section 6.

2. BACKGROUND: SELF-ATTENTION MECHANISM FROM A NON-PARAMETRIC REGRESSION PERSPECTIVE

In this section, we first provide background on the self-attention mechanism in transformer in Section 2. We then revisit the connection between the self-attention and the Nadaraya-Watson estimator in a nonparametric regression problem in Section 2.2.

2.1. SELF-ATTENTION MECHANISM

Given an input sequence X = [x 1 , . . . , x N ] ⊤ ∈ R N ×Dx of N feature vectors, the self-attention transforms it into another sequence H := [h 1 , • • • , h N ] ⊤ ∈ R N ×Dv as follows: h i = j∈[N ] softmax q ⊤ i k j √ D v j , for i = 1, . . . , N, where the scalar softmax((q ⊤ i k j )/ √ D) can be understood as the attention h i pays to the input feature x j . The vectors q i , k j , and v j are the query, key, and value vectors, respectively, and are computed as follows: [q 1 , q 2 , . . . , q N ] ⊤ := Q = XW ⊤ Q ∈ R N ×D , [k 1 , k 2 , . . . , k N ] ⊤ := K = XW ⊤ K ∈ R N ×D , [v 1 , v 2 , . . . , v N ] ⊤ := V = XW ⊤ V ∈ R N ×Dv , where W Q , W K ∈ R D×Dx , W V ∈ R Dv×Dx are the weight matrices. Equation 1 can be written as: H = softmax QK ⊤ √ D V , where the softmax function is applied to each row of the matrix (QK ⊤ )/ √ D. equation 3 is also called the "softmax attention". For each query vector q i for i = 1, • • • , N , an equivalent form of equation 3 to compute the output vector h i is given by h i = j∈[N ] softmax q ⊤ i k j √ D v j := j∈[N ] a ij v j . In this paper, we call a transformer built with softmax attention standard transformer or transformer.

2.2. A NON-PARAMETRIC REGRESSION PERSPECTIVE OF SELF-ATTENTION

We now review the connection between the self-attention mechanism in equation 4 and the nonparametric regression, which has been discussed in the recent work (Nguyen et al., 2022) . Assume

