A PRIMAL-DUAL FRAMEWORK FOR TRANSFORMERS AND NEURAL NETWORKS

Abstract

Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification. * Co-first authors.

1. INTRODUCTION

Transformer models (Vaswani et al., 2017) have achieved impressive success with state-of-the-art performance in a myriad of sequence processing tasks, including those in computer vision (Dosovitskiy et al., 2021; Liu et al., 2021; Touvron et al., 2020; Ramesh et al., 2021; Radford et al., 2021; Arnab et al., 2021; Liu et al., 2022; Zhao et al., 2021; Guo et al., 2021) , natural language processing (Devlin et al., 2018; Al-Rfou et al., 2019; Dai et al., 2019; Child et al., 2019; Raffel et al., 2020; Baevski & Auli, 2019; Brown et al., 2020; Dehghani et al., 2018) , reinforcement learning (Chen et al., 2021; Janner et al., 2021) , and other important applications (Rives et al., 2021; Jumper et al., 2021; Zhang et al., 2019; Gulati et al., 2020; Wang & Sun, 2022) . Transformers can also effectively transfer knowledge from pre-trained models to new tasks with limited supervision (Radford et al., 2018; 2019; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019) . The driving force behind the success of transformers is the self-attention mechanism (Cho et al., 2014; Parikh et al., 2016; Lin et al., 2017) , which computes a weighted average of feature representations of the tokens in the sequence with the weights proportional to similarity scores between pairs of representations. The weights calculated by the self-attention determine the relative importance between tokens and thus capture the contextual representations of the sequence (Bahdanau et al., 2014; Vaswani et al., 2017; Kim et al., 2017) . It has been argued that the flexibility in capturing diverse syntactic and semantic relationships is critical for the success of transformers (Tenney et al., 2019; Vig & Belinkov, 2019; Clark et al., 2019) . 1.1 BACKGROUND: SELF-ATTENTION For a given input sequence X := [x 1 , • • • , x N ] ⊤ ∈ R N ×Dx of N feature vectors, self-attention transforms X into the output sequence H in the following two steps: Step 1. The input sequence X is projected into the query matrix Q, the key matrix K, and the value matrix V via three linear transformations Q = XW ⊤ Q ; K = XW ⊤ K ; V = XW ⊤ V , where W Q , W K ∈ R D×Dx , and W V ∈ R Dv×Dx are the weight matrices. We denote Q := [q 1 , • • • , q N ] ⊤ , K := [k 1 , • • • , k N ] ⊤ , and V := [v 1 , • • • , v N ] ⊤ , where the vectors q i , k i , v i for i = 1, • • • , N are the query, key, and value vectors, respectively. Step 2. The output sequence H := [h 1 , • • • , h N ] ⊤ is then computed as follows H = softmax QK ⊤ / √ D V := AV, where the softmax function is applied to each row of the matrix QK ⊤ / √ D. The matrix A := softmax QK ⊤ √ D ∈ R N ×N and its component a ij for i, j = 1, • • • , N are called the attention matrix and attention scores, respectively. For each query vector q i for i = 1, • • • , N , an equivalent form of Eqn. (1) to compute the output vector h i is given by h i = N j=1 softmax q ⊤ i k j / √ D v j . The self-attention computed by Eqn. ( 1) and ( 2) is called the scaled dot-product or softmax attention. In our paper, we call a transformer that uses this attention the softmax transformer. The structure that the attention matrix A learns from training determines the ability of the self-attention to capture contextual representation for each token. Additionally, a residual connection can be added to the output of the self-attention layer, h i = x i + N j=1 softmax q ⊤ i k j / √ D v j . Multi-head Attention (MHA). In MHA, multiple heads are concatenated to compute the final output. This MHA mechanism allows transformers to capture more diverse attention patterns and increase the capacity of the model. Let H be the number of heads and W multi O = W 1 O , . . . , W H O ∈ R Dv×HDv be the projection matrix for the output where W 1 O , . . . , W H O ∈ R Dv×Dv . The MHA is defined as MultiHead({H} H s=1 ) = Concat(H 1 , . . . , H H )W multi⊤ O = H s=1 H s W s⊤ O = H s=1 A s V s W s⊤ O . Despite their remarkable success, most attention layers are developed based on heuristic approaches, and a coherent principled framework for synthesizing attention layers has remained elusive.

1.2. CONTRIBUTION

We derive the self-attention as the support vector expansion of a given support vector regression (SVR) problem. The primal representation of the regression function has the form of a neural network layer. Thus, we establish a primal-dual connection between an attention layer in transformers and a neural network layer in deep neural networks. Our framework suggests a principled approach to developing an attention mechanism: Starting from a neural network layer and a support vector regression problem, we derive the dual as a support vector expansion to attain the corresponding attention layer. We then employ this principled approach to invent two novel classes of attentions: the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer in deep neural networks and the Attention with Scaled Heads (Attention-SH) resulting from solving the support vector regression model with less amount of training data. Our contribution is three-fold. 3. We develop two new attention mechanism: the Batch Normalized Attention (Attention-BN) and the Attention with Scaled Heads (Attention-SH) using our proposed framework. We empirically demonstrate that 1) the Attention-BN significantly outperforms the baseline softmax and linear attention and 2) the Attention-SH performs better while being more efficient than the same baselines on a variety of practical tasks including image and time-series classification.

2. PRIMAL-DUAL INTERPRETATION OF SELF-ATTENTION

We first provide a primal-dual interpretation of self-attention as a support vector regression problem in Section 2.1. Based on that primal-dual framework, we derive popular attention mechanisms as the support vector expansion in Section 2.2. Finally, we introduce two new attention mechanisms in Section 2.3, the Attention-BN and Attention-SH.

2.1. ATTENTION AS A SUPPORT VECTOR REGRESSION MODEL

In this section, we derive self-attention from a support vector regression problem. Suppose we are given a training data {(k 1 , y 1 ), . . . , (k N , y N )} ⊂ K × Y, where K = R D and Y = R Dv . Here, k 1 , . . . , k N are attention keys in self-attention, and y 1 , . . . , y N are the training targets. We consider the function f , taking the form y = f (x) := W Φ(x) h(x) + b, where x ∈ K = R D , Φ(x) = [ϕ 1 (x), . . . , ϕ D ϕ (x)] ∈ R D ϕ , W = [w 1 , . . . , w Dv ] ⊤ ∈ R Dv×D ϕ , b ∈ R Dv , and h(x) is a vector-scalar function. We fit the function f to the training data {(k 1 , y 1 ), . . . , (k N , y N )} with an L 2 regularization on W, i.e., a ridge regression, by solving the following convex optimization problem: minimize W ξj , ξj ,j=1,...,N 1 2 ∥W∥ 2 F + C N j=1 Dv d=1 ξ j (d) + ξj (d) = 1 2 Dv d=1 ∥w d ∥ 2 + C N j=1 Dv d=1 ξ j (d) + ξj (d) subject to    y j (d) -w ⊤ d Φ(k j )/h(k j ) -b(d) ≤ ϵ + ξ j (d) w ⊤ d Φ(k j )/h(k j ) + b(d) -y j (d) ≤ ϵ + ξj (d) ξ j (d), ξj (d) ≥ 0 , j = 1, . . . , N, d = 1, . . . , D v . (5) The Eqn. 5 implies that there exists a function f that can approximates all pairs (k j , y j ) with ϵ precision. The additional slack variables ξ j , ξj relax this assumption and allows some of the training set data points to have the training error greater than ϵ just as in the soft-margin SVM (Cortes & Vapnik, 1995; Schölkopf et al., 2002) . C > 0 is a constant determining the trade between the complexity penalizer Dv d=1 ∥w d ∥ 2 , i.e., the flatness of f , and the amount up to which deviations larger than ϵ are tolerated. In order to derive the self-attention from the support vector regression defined by the optimization problem 5, the key idea to construct the Lagrangian from Eqn. 5 and find the representation of the w d , d = 1, . . . , D v , in terms of the dual variables. We define the Lagrangian function as follows: L := 1 2 Dv d=1 ∥w d ∥ 2 + C N j=1 Dv d=1 ξ j (d) + ξj (d) - N j=1 Dv d=1 η j (d)ξ j (d) + ηj (d) ξj (d) - N j=1 Dv d=1 α j (d) ϵ + ξ j (d) -y j (d) + w ⊤ d Φ(k j ) h(k j ) + b(d) - N j=1 Dv d=1 αj (d) ϵ + ξj (d) + y j (d) -w ⊤ d Φ(k j ) h(k j ) -b(d) , where η j , ηj , α j and αj are Lagrange multipliers.  ∂ b(d) L = N j=1 ( αj (d) -α j (d)) = 0 ⇒ N j=1 (α j (d) -αj (d)) = 0, ∂ w d L = w d - N j=1 (α j (d) -αj (d)) Φ(k j ) h(k j ) = 0 ⇒ w d = N j=1 (α j (d) -αj (d)) Φ(k j ) h(k j ) , ∂ ξj (d) L = C -α j (d) -η j (d) = 0, ∂ ξj (d) L = C -αj (d) -ηj (d) = 0. (9) Let v j = [ αj (1)-αj (1) h(kj ) , . . . , αj (Dv)-αj (Dv) h(kj ) ] ⊤ , j = 1, . . . , N , and substitute Eqn. 8 into Eqn. 4, we obtain the following support vector expansion of the linear basis function f : f (x) =   N j=1 α j (1) -αj (1) h(k j ) Φ(x) ⊤ Φ(k j ) h(x) , . . . , N j=1 α j (D v ) -αj (D v ) h(k j ) Φ(x) ⊤ Φ(k j ) h(x)   ⊤ + b, = N j=1 Φ(x) ⊤ Φ(k j ) h(x) v j + b. ( ) Remark 1 Notice that from Eqn. 9 and the conditions η j (d), ηj (d), Schölkopf, 2004; Schölkopf et al., 2002) . As a result, α j (d), αj (d) ≥ 0, we can prove that α j (d), αj (d) ∈ [0, C]. Furthermore, we can show that α j (d) * αj (d) = 0 (Smola & v j (d) ∈ -C h(kj ) , C h(kj ) , d = 1, . . . , D v . Deriving Softmax Attention. Choosing the appropriate h(x) and Φ(x) allows us to derive the popular softmax attention given in Eqn. 1 and 2. In particular, if we choose h(x) := N j Φ(x) T Φ(k j ), Eqn. 10 becomes f (x) = N j=1 Φ(x) ⊤ Φ(k j ) N j ′ Φ(x) T Φ(k j ′ ) v j + b = N j=1 Φ(x) ⊤ Φ(k j )v j N j ′ Φ(x) T Φ(k j ′ ) + b. ( ) We then select Φ(x) = a  (0) l0 , a = (x 1 / 4 √ D) n1 . . . (x D / 4 √ D) n D √ n 1 ! . . . n D ! | n 1 + • • • + n D = t, 1 ≤ l ≤ l t . Since exp (x T y) = ∞ t=0 (x T y) t t! = ∞ t=0 n1+•••+n D =t x n1 1 . . . x n D D √ n 1 ! . . . n D ! y n1 1 . . . y n D D √ n 1 ! . . . n D ! , then Eqn. 27 becomes f (x) = N j=1 exp (x ⊤ k j / √ D) N j ′ =1 exp (x ⊤ k j ′ / √ D) v j + b = N j=1 softmax x ⊤ k j / √ D v j + b. Let x = q i , b = 0 and relax the boundness constraint of v j in Remark 1. Eqn. 30 becomes Eqn. 2 of the softmax attention (Vaswani et al., 2017) . We summarize our results in the following theorem. Theorem 1 (Softmax Attention as a Support Vector Expansion) Given the function f defined in Eqn. 4 with h(x) := N j Φ(x) T Φ(k j ) and the support vector regression problem defined in Eqn. 5, we set b = 0, choose Φ(x) as in Eqn. 28, and relax the boundness constraint of the variables v j = [ αj (1)-αj (1) h(kj ) , . . . , αj (Dv)-αj (Dv) h(kj ) ] ⊤ , where α j and αj are dual variables of Eqn. 5, j = 1, . . . , N . Then, the support vector expansion of f derived from Eqn. 5 has the form of a softmax attention f (x) = N j=1 softmax x ⊤ k j / √ D v j . ( ) Remark 2 Since b is set to 0, the centering constraint of α j and αj in Eqn. 7 can be ignored. Remark 3 Theorem 1 and its derivation can be easily extended to capture the full form of the softmax attention with the residual connection, the query matrix projection W Q , the key matrix projection W K , and the value matrix projection W V . We include this result in Appendix F.

Remark 4

The primal representation of the function f as in Eqn. 4 has the form of a neural network layer where W is the weight, b is the bias term, Φ(x) is the input, and h(x) is the normalization term. Thus, an attention layer and a neural network layer are primal-dual of each other. A principled approach to developing an attention mechanism. The observation in Remark 4 suggests a principled way to construct an attention layer: Starting from a neural network layer and a support vector regression problem, we derive the dual as a support vector expansion to attain the corresponding attention layer. Using this approach, we derive popular attention mechanisms in Section 2.2 and propose our new attention mechanisms in Section 2.3.

2.2. DERIVING POPULAR ATTENTION MECHANISMS AS THE SUPPORT VECTOR EXPANSION

In this section, we derive popular attentions such as the linear attention (Katharopoulos et al., 2020) , the sparse attention (Child et al., 2019) , and the multi-head attention (Vaswani et al., 2017) .

2.2.1. LINEAR ATTENTION

The Eqn. 27, which is obtained when choosing h(x) := N j Φ(x) T Φ(k j ), already matches the formula of the linear attention. Here, we can let b = 0 as above and select the function Φ that results in a positive similarity function, e.g. Φ(x) = elu(x) + 1, as in (Katharopoulos et al., 2020) .

2.2.2. SPARSE ATTENTION

The sparse attention (Child et al., 2019) can be derived by fitting the function f in Eqn. 4 using a different subset {(k mx(1) , y mx(1) ), . . . , (k mx(M ) , y mx(M ) )} of training data {(k 1 , y 1 ), . . . , (k N , y N )} for each input data x, where M x = {m x (1), . . . , m x (M )} ⊂ {1, . . . , N }. The support vector expansion of f is then given by f (x) = N j=1 1 Mx (j) Φ(x) ⊤ Φ(k j ) h(x) v j + b where 1 Mx (j) = [j ∈ M x ] := 1 if j ∈ M x 0 otherwise . Note that the subsets M x are different for different x. When letting x = q i where q i , i = 1, . . . , N , are the query vectors and choosing Φ, h, b as in Section 2.1, we can obtain the sparse attention in (Child et al., 2019) where the binary matrix M = 1 Mq i (j) N i,j=1 becomes the sparse masking matrix.

2.2.3. MULTI-HEAD ATTENTION (MHA)

The MHA can be derived by solving multiple support vector regression problems and then linearly combining their outputs. In particular, given H training datasets {(k 1 1 , y 1 1 ), . . . , (k 1 N , y 1 N )}, . . . , {(k H 1 , y H 1 ), . . . , (k H N , y H N )} ⊂ K × Y , where K = R D and Y = R Dv . We define the function f applied on the input vector x = [x 1 , . . . , x H ] as follows y = f (x) := H s=1 W s O y s = H s=1 W s O f s (x s ) = H s=1 W s O W s Φ s (x s ) h s (x s ) + b s , where each function f s (x s ) = W s Φ s (x s ) h s (x s ) + b s is fitted to the training dataset {(k s 1 , y s 1 ), . . . , (k s N , y s N )}. Following the same derivation and choosing {Φ s , h s , b s } H s=1 as in Section 2.1, we can rewrite f (x) in terms of the support vector expansions of the individual functions f s (x s ), which are the individual softmax attentions f (x) = H s=1 W s O   N j=1 Φ s (x s ) ⊤ Φ s (k s j ) h s (x s ) v s j + b s   = H s=1 W s O   N j=1 softmax x s⊤ k s j / √ D v s j   . (18) Comparing Eqn. 18 and Eqn. 3, we see that Eqn. 18 computes the MHA when choosing x s = q s i where q s i , i = 1, . . . , N , are the query vectors at the s th head.

2.3. DERIVING NEW ATTENTION MECHANISMS: BATCH NORMALIZED ATTENTION AND MULTIRESOLUTION HEAD ATTENTION

In this section, we employ our primal-dual framework to develop new attention mechanisms. In particular, we derive: 1) the Batch Normalized Attention from employing the batch normalization (Ioffe & Szegedy, 2015) ; and 2) the Attention with Scaled Heads from using different amounts of training data. By 1) and 2), we demonstrate that new attentions can be invented by modifying the primal neural network layer and the support vector regression problem in our framework, respectively.

2.3.1. BATCH NORMALIZED ATTENTION

We incorporate the batch normalization into the primal form of the function f in Eqn. 4. Given a training data {(k 1 , y 1 ), . . . , (k N , y N )} ⊂ K × Y, where K = R D and Y = R Dv as in Section 2.1, the resultant f is defined as follows f (x) := W Φ((x -µ) ⊙ s -1 ) h((x -µ) ⊙ s -1 ) + b, where µ = 1 N N j=1 k j , s -1 = 1 σ 2 1 + ϵ , . . . , σ 2 D + ϵ ⊤ , σ 2 d = 1 N N j=1 (k j (d) -µ(d)) 2 . ( ) Here, d = 1, . . . , D, and the mean subtraction and division by the standard deviation is performed element-wise along the feature dimension of x. Following the same derivation as in Section 2.1, we derive the following support vector expansion of f f (x) = N j=1 Φ((x -µ) ⊙ s -1 ) ⊤ Φ((k j -µ) ⊙ s -1 ) h((x -µ) ⊙ s -1 ) v j + b. Here, v j = αj (1)-αj (1) h((kj -µ)⊙s -1 ) , . . . , αj (Dv)-αj (Dv) h((kj -µ)⊙s -1 )

⊤

, where α j and αj are the dual variables, j = 1, . . . , N . Same as in Section 2.1, in Eqn. 21, we choose Φ as in Eqn. 28, h(x) := N j Φ(x) T Φ(k j ), and b = 0 to obtain the the Batch Normalized Attention, which is defined as follows. Definition 1 (Batch Normalized Attention) Given a set of the key and value vectors {k j , v j } N j=1 , for each query vector q i , i = 1, . . . , N , the Batch Normalized Attention (Attention-BN) computes the corresponding output vector h i of the query q i by the following attention formula: h i = N j=1 softmax ((q i -µ) ⊙ s -1 ) ⊤ ((k j -µ) ⊙ s -1 )/ √ D v j , where µ = 1 N N j=1 k j , s -1 = 1 σ 2 1 + ϵ , . . . , σ 2 D + ϵ ⊤ , σ 2 d = 1 N N j=1 (k j (d) -µ(d)) 2 . ( ) The Effect of Normalization. Expanding the dot product in the Attention-BN (see Appendix E), Eqn. 22 becomes h i = N j=1 softmax D d=1 q i (d)k j (d) -1 N N j ′ =1 k j ′ (d)k j (d) √ D(σ 2 d + ϵ) v j . Eqn. 24 implies that in the Attention-BN, the similarity between the query q i and the key k j is adjusted by the similarity between the key k j and all the keys k j ′ , j ′ = 1, . . . , N . In particular, if the key k j is too similar to other keys, the query q i will attend to it less and vice versa.

2.3.2. ATTENTION WITH SCALED HEADS

The Attention with Scaled Heads, named Attention-SH, is derived based on the derivation of the MHA in Section 2.2.3. The key idea underlying the Attention-SH is to train multiple support vector regression problems using different amounts of training data. In particular, the Attention-SH follows Eqn. 17 in Section 2.2.3 and defines the same regression function f as the MHA. However, the Attention-SH fits the function f s , s = 1, . . . , H, in Eqn. 17 with training sets {(k 1 1 , y 1 1 ), . . . , (k 1 N1 , y 1 N1 )}, . . . , {(k H 1 , y H 1 ), . . . , (k H N H , y H N H )} ⊂ K × Y of different sizes N 1 , . . . , N H , where K = R D and Y = R Dv . The resultant support vector expansion yields the formula of the Attention-SH as in the following definition. Definition 2 (Attention with Scaled Heads) Given H sets of the key and value vectors {k 1 j , v 1 j } N1 j=1 , . . . , {k H j , v H j } N H j=1 , for each set of H query vectors q 1 i , . . . , q H i , i = 1, . . . , N , the Attention with Scaled Heads (Attention-SH) computes the corresponding output vector h i of the queries q 1 i , . . . , q H i by the following attention formula: h i = H s=1 W s O   Ns j=1 softmax q s⊤ i k s j / √ D v s j   . ( ) Table 1 : Test Accuracy (%) of the Attention-BN/SH/BN+SH vs. the baseline softmax attention on a subset of the UEA Time Series Classification Archive benchmark (Bagnall et al., 2018) . Our proposed attentions significantly outperform the baseline. We also include the reported results from (Zerveas et al., 2021) and (Wu et al., 2022) Remark 5 For a given input sequence X := [x 1 , • • • , x N ] ⊤ ∈ R N ×Dx of N feature vectors in self-attention , in order to generate the sets of {k s j , v s j } Ns j=1 at the scale s th , we can downsample the input X before projecting into the key matrix K and the value matrix V. There are multiples approaches to downsampling X, such as using the average-pooling, max-pooling, 1-D convolution, or K-means clustering. In this paper, we employ the average-pooling to downsample X. Linear Attention with Batch Normalization and Scaled Heads. The Attention-BN/SH can be extended to use with the linear attention. In particular, in the Linear Attention-BN/SH, we replace the softmax kernel in Eqn. 22 and Eqn. 25 by the linear kernel, respectively.

3. EXPERIMENTAL RESULTS

In this section, we empirically demonstrate the advantages of our Attention-BN, Attention-SH, and their combination (Attention-BN+SH) over the baseline softmax attention on the UEA timeseries classification benchmark (Bagnall et al., 2018) , the Long Range Arena benchmark (Tay et al., 2021) , and the image classification task on the Imagenet dataset (Deng et al., 2009; Russakovsky et al., 2015) . We aim to show that: (i) Attention-BN significantly outperforms the softmax baseline across tasks; (ii) Attention-SH achieves better or comparable accuracy while saving computation and memory compared to the baseline; (iii) Attention-BN+SH, which combines both Attention-BN and Attention-SH, results in the best model performance in term of accuracy and efficiency; (iv) all our proposed models help reduce the redundancy in multi-head attention and benefit learning of the long-term dependency in long input sequences; (v) Attention-BN and Attention-SH can be applied on other attention mechanisms beyond the softmax attention. When combined with the linear attention (Katharopoulos et al., 2020) , the resultant Linear Attention-BN and Linear Attention-SH yield similar advantages mentioned in (i), (ii), (iii) and (iv) over the baseline linear attention. In our experiments, we compare the proposed models with the baseline softmax and linear attentions of the same configuration. For the Attention-BN and Attention-BN+SH, we observe that recentering queries and keys alone is sufficient for improving the model performance. In addition, weighting µ with a constant β, as in Eqn. 26 in the Appendix, enables the Attention-BN/BN+SH to adjust the effect of normalization to the attention score and help increase the accuracy. Our results are averaged over 5 runs. Details on datasets, models, and training are provided in Appendix A. UEA Time Series Classification. Table 1 compares the accuracy of the Attention-BN and Attention-SH with the baseline softmax attention on 10 tasks in the UEA Time Series Classification benchmark (Bagnall et al., 2018) . Both Attention-BN and Attention-SH significantly outperform the softmax baseline on most tasks and on average among all tasks. When combining two models, the resulting Attention-BN+SH yields the best accuracy with more than 1% overall improvement over the softmax baseline. Notably, the Attention-SH and Attention-BN+SH are much more efficient than the baseline since they need much fewer keys and values in computing the attention output. The efficiency advantage of the Attention-SH/BN+SH is discussed and analyzed in detail in Section 4. Long Range Arena (LRA) benchmark. In this experiment, we verify the advantage of our methods over the softmax baseline on tasks that involve very long sequences (e.g., the sequence length can be up to 4K) in the LRA benchmark (Tay et al., 2021) . Those tasks require the model to capture long-range dependency in the input sequence. The summarized results in Time Series experiment, on this LRA benchmark, Attention-BN and Attention-SH both outperform the softmax attention on most five tasks. Moreover, Attention-BN+SH, which combines these two attention mechanisms, results in the most accurate models on average across tasks. Specifically, for the retrieval task, the most challenging task with the largest sequence length in the LRA benchmark, Attention-BN+SH achieve a remarkable improvement of more than 1.5% over the baseline. Image Classification on Imagenet. We corroborate the advantage of our proposed attention over the baseline softmax attention when scaled up for the large-scale ImageNet image classification task. We summarize the results in Table 3 . The Deit model (Touvron et al., 2021) equiped with the Attention-BN yields better performance than the softmax baseline. Meanwhile, Attention-SH/BN+SH Deit perform on par with the baseline while being more efficient. These results, together with other results above justify the benefits of our proposed methods across various tasks and data modalities, proving the effectiveness of our primal-dual approach to develop new attentions.

4. EMPIRICAL ANALYSIS

Efficiency Analysis. The Attention-BN+SH not only improves the accuracy of the model remarkably but also help reduce the computational and memory cost significantly. Fig. 1 presents the efficiency benefits of our Attention-BN+SH trained on the retrieval task when the model dimension D and sequence lengths N grow. The efficiency advantage of our model increase as N increase. In addition, the scaled-up models (with large D) remains significantly more efficient than the baseline. When the model dimension is 64 and sequence length is 4096, which is the standard configuration of the task, the model's FLOPS, in both training and inference, reduce almost 25%, whereas the reductions for memory usage in training and testing are 31.9% and 47.3%, respectively. Notably, this efficient model also outperforms the baseline with more than 1.5% improvement in accuracy. These results prove the benefits of applying the Attention-BN+SH for long-sequence tasks and large-scale models. New Attentions Helps Reduce Head Redundancy. We compute the average L 2 distances between heads to analyze the attention diversity. Given our trained models for the retrieval task, the layeraverage mean and standard deviation of distances between heads are reported in Table 4 . All our introduced attentions attain greater L 2 distances compared to the baseline, reducing the risk of learning redundant heads. In particular, Attention-SH has the highest head difference, indicating the model's attention patterns are most spread out between heads. Combining Attention-BN and Attention-SH with Other Attentions. Our methods can be extended to combine with other attention mechanisms. We study the Linear Attention-BN/SH/BN+SH, that combine the Attention-BN/SH/BN+SH with the linear attention (Katharopoulos et al., 2020) as explained at the end of Section 2.3. We summarize our results in Table 5 in Appendix B.1.

5. RELATED WORK

Interpretation of Attention Mechanism. Recent works have focused on understanding the attention mechanism in transformers from different perspectives. (Tsai et al., 2019) considers attention as a weighted moving average over the inputs via a smoothing kernel. (Nguyen et al., 2022) draws a connection between self-attention and nonparametric kernel regression. With this understanding, the work explores better regression estimators, e.g. the generalized Fourier nonparametric regression estimator, to improve transformers. In addition, (Cao, 2021) then shows that the linear transformer (Katharopoulos et al., 2020) corresponds to a Petrov-Galerkin projection (Reddy, 2004) and proves that the softmax normalization in the softmax attention is sufficient but not necessary. Other works that employ ordinary/partial differential equations to provide an interpretation for attention include (Lu et al., 2019; Sander et al., 2022) . From a probabilistic perspective, ( Efficient Transformers. Recently, efficient transformers have been studied (Roy et al., 2021) . Among them are sparse transformers which incorporate sparse structures into the attention matrix (Parmar et al., 2018; Liu et al., 2018; Qiu et al., 2019; Child et al., 2019; Beltagy et al., 2020) . Another class of efficient transformers are models that aim to have better coverage by integrating different access patterns (Child et al., 2019; Ho et al., 2019) , which can also be learned from the data (Kitaev et al., 2020; Roy et al., 2021; Tay et al., 2020 ). An emerging body of work is proposed to distill and prune the model, including (Sanh et al., 2019; Sun et al., 2019; Voita et al., 2019; Sajjad et al., 2020) . In other works, a side memory module is utilized in order to access multiple tokens simultaneously (Lee et al., 2019; Sukhbaatar et al., 2019; Asai & Choi, 2020; Beltagy et al., 2020) . Low-rank and kernelization methods have been proposed to improve the efficiency of self-attention calculation (Tsai et al., 2019; Wang et al., 2020; Katharopoulos et al., 2020; Choromanski et al., 2021; Shen et al., 2021; Peng et al., 2021) . Our Attention-SH/BN+SH is orthogonal to these methods.

6. CONCLUDING REMARKS

In this paper, we derive self-attention as a support vector expansion that solves a support vector regression (SVR) problem and provide a principled primal-dual framework to analyze and synthesize attention mechanisms. We then use our framework to invent two new attention mechanisms, the Batch Normalized Attention (Attention-BN) and the Attention with Scaled Heads (Attention-SH), that improve the accuracy and the efficiency of the baseline softmax attention. In our work, we approximate and learn the dual variables α j and αj using the value vector v j , j = 1, . . . , N in self-attention. It is natural to include more inductive biases and structures of those dual variables from solving the dual optimization problem of SVR into the value vectors v j . Furthermore, extending our framework to explain the attention modules that compute the attention using neural network layers or convolutional neural network layers applying on the input feature, such as the Convolutional Block Attention Module (Woo et al., 2018) , is an interesting research direction. We leave these exciting research ideas as future work.

Supplement to "A Primal-Dual Framework for Transformers and Neural Networks"

A ADDITIONAL DETAILS ON THE EXPERIMENTS This section provides datasets, models, and training details for experiments in Section 3. As mentioned in Section 3, for Attention-BN models, recentering queries and keys alone is sufficient for accuracy improvement, and we weight the mean µ in Eqn 22 with a constant β. Hence Eqn 22 is simplified to: h i = N j=1 softmax (q i -βµ) ⊤ (k j -βµ)/ √ D v j . ( ) In our experiments, we consider the constant β in Attention-BN/BN+SH and the different downsampling scales in Attention-SH/SH+BN as hyper-parameters to finetune. All of our experiments are conducted on a server with 4 NVIDIA A100 GPUs.

A.1 UEA TIME SERIES CLASSIFICATION

Datasets and metrics The benchmark (Bagnall et al., 2018) consists of 30 datasets. Following (Wu et al., 2022) , we choose 10 datasets, which vary in input sequence lengths, the number of classes, and dimensionality, to evaluate our models on temporal sequences. We report the test accuracy as evaluation for the benchmark.

Models and baselines

The experiment setups and configurations for the softmax/linear baseline and our models are the same as in (Wu et al., 2022) foot_0 (for the PEMS-SF, SelfRegulationSCP2, UWaveGestureLibrary datasets) and (Zerveas et al., 2021) foot_1 (for the other tasks). In all models, the number of heads is 8, whereas the model dimension and number of transformer layers are varied. For Attention-SH/SH+BN, we downsample keys and values by the factor of 2, after every two successive heads.

A.2 LONG RANGE ARENA BENCHMARK

Datasets and metrics We adopt the tasks: Listops (Nangia & Bowman, 2018) , byte-level IMDb reviews text classification (Maas et al., 2011) , byte-level document retrieval (Radev et al., 2013) , CIFAR-10 image classification (Krizhevsky et al., 2009) and the Pathfinder challenge (Linsley et al., 2018) in the LRA benchmark for our experiments. They consist of long sequences of length 2K, 4K, 4K, 1K, and 1K respectively. The evaluation protocol and metric are the same as in (Tay et al., 2021) . Models and baselines All our models and softmax/linear baselines follow the same architecture and configuration as in (Zhu et al., 2021) foot_2 . Each model consists of two layers and 64 embedding dimensions. While one head at each layer remains intact, the keys and values of the other heads are halved in our Attention-SH/SH+BN experiments.

A.3 IMAGE CLASSIFICATION ON IMAGENET

Datasets and metrics The ImageNet dataset (Deng et al., 2009; Russakovsky et al., 2015) consists of 1.28M training images and 50K validation images. The task is to classify 1000 categories. Top-1 and top-5 accuracies are reported.

Models and baselines

Our baseline is DeiT-tiny model (Touvron et al., 2021) 5 summarizes the comparison between the Linear Attention-BN/SH/BN+SH and the baseline Linear Attention on the UEA Time Series Classification task. The Linear Attention-BN/SH/BN+SH achieve better accuracy than the Linear Attention baseline while being more efficient.

B.2 CONVOLUTION ATTENTION

Table 6 demonstrates the advantage of Attention-Conv2D (Def. 3, Section G) over softmax Deit on the ImageNet image classification task. Furthemore, as shown in Table 7 , the Attention-Conv1D (Def. 4, Section G) outperforms the baseline softmax attention on 5 tasks of the LRA benchmark (Tay et al., 2021) .

B.3 ADDITIONAL EXPERIMENTS ON THE UEA TIMESERIES CLASSIFICATION BENCHMARK AND THE UCR TIME SERIES REGRESSION ARCHIVE

In this section, we further demonstrate the advantage of our Attention-BN/SH/BN+SH on additional 15 tasks in the UEA Time Series Classification benchmark and on 6 tasks in the UCR Time Series Regression benchmark. The results in Table 8 and 9 show that our Attention-BN and Attention-SH+BN outperform the baseline softmax transformers significantly on both of these benchmarks, while the attention-SH has comparable performance with the baseline but being more effiicient.

B.4 UEA TIME SERIES CLASSIFICATION USING THE SPARSE ATTENTION-BN/SH/BN+SH

Table 10 summarizes the comparison between the Sparse Attention-BN/SH/BN+SH and the Sparse Attention baseline on a subset of the UEA Time Series Classification benchmark. Our models when combined with Sparse Attention achieve significantly better accuracy than the Sparse Attention baseline while the Sparse Attention-SH/BN+SH are more efficient (See Fig. 3 and Fig. 4 in Appendix C). We experiment with our Attention-BN/BN+SH with learnable β on the retrieval task. Table 11 shows that learning β does not improve much over setting β to be a hyperparameter.

C ADDITIONAL RESULTS ON EFFICIENCY ANALYSIS

This section provides more efficiency analysis on our models. Attention-SH. Fig. 2 shows the efficiency benefits of our Attention-SH when trained on the retrieval task. Same as in the case of Attention-SH+BN, the efficiency benefits of our Attention-SH over the baseline Softmax attention grows when N and D increase. Sparse Attention-SH/BN+SH. Fig. 3 and Fig. 4 show that the efficiency advantages of our Sparse Attention-BN+SH and Sparse Attention-SH, respectively, increase as the model dimension D and sequence length N grow. All models are trained on the LRA retrieval task. In addition to the efficiency advantage, the Sparse Attention-BN+SH also significantly outperforms the Sparse Attention baseline in terms of accuracy in this task (79.86% vs. 78.20%) while the Sparse Attention-SH achieves a comparable result to the baseline. More accuracy advantages of the Sparse Attention-BN/SH/BN+SH over the Sparse Attention baseline are given in Table 10 .

D DERIVING SOFTMAX ATTENTION.

Choosing the appropriate h(x) and Φ(x) allows us to derive the popular softmax attention given in Eqn. 1 and 2. In particular, if we choose h(x) := N j Φ(x) T Φ(k j ), Eqn. 10 becomes f (x) = N j=1 Φ(x) ⊤ Φ(k j ) N j ′ Φ(x) T Φ(k j ′ ) v j + b = N j=1 Φ(x) ⊤ Φ(k j )v j N j ′ Φ(x) T Φ(k j ′ ) + b. ( ) We then select Φ(x) = a (0) l0 , a  = (x 1 / 4 √ D) n1 . . . (x D / 4 √ D) n D √ n 1 ! . . . n D ! | n 1 + • • • + n D = t, 1 ≤ l ≤ l t . Since exp (x T y) = ∞ t=0 (x T y) t t! = ∞ t=0 n1+•••+n D =t x n1 1 . . . x n D D √ n 1 ! . . . n D ! y n1 1 . . . y n D D √ n 1 ! . . . n D ! , then Eqn. 27 becomes f (x) = N j=1 ∞ t=0 n1+•••+n D =t x 1 4 √ D n 1 ... x D 4 √ D n D √ n1!...n D ! k j1 4 √ D n 1 ... k jD 4 √ D n D √ n1!...n D ! N j ′ =1 ∞ t=0 n1+•••+n D =t x 1 4 √ D n 1 ... x D 4 √ D n D √ n1!...n D !   k j ′ 1 4 √ D n 1 ... k j ′ D 4 √ D n D √ n1!...n D !   v j + b = N j=1 exp x 4 √ D ⊤ kj 4 √ D N j ′ =1 exp x 4 √ D ⊤ k j ′ 4 √ D v j + b = N j=1 exp (x ⊤ k j / √ D) N j ′ =1 exp (x ⊤ k j ′ / √ D) v j + b. Let x = q i , b = 0 and relax the boundness constraint of v j in Remark 1. Eqn. 30 becomes Eqn. 2 of the softmax attention (Vaswani et al., 2017) . E BATCH NORMALIZED ATTENTION: DERIVATION OF EQN. 24 h i = N j=1 softmax D d=1 (q i (d) -µ(d))(k j (d) -µ(d)) √ D(σ 2 d + ϵ) v j = N j=1 softmax D d=1 q i (d)k j (d) -q i (d)µ(d) -µ(d)k j (d) + µ(d)µ(d) √ D(σ 2 d + ϵ) v j = N j=1 exp D d=1 qi(d)kj (d)-qi(d)µ(d)-µ(d)kj (d)+µ(d)µ(d) √ D(σ 2 d +ϵ) N j ′ =1 exp D d=1 qi(d)k j ′ (d)-qi(d)µ(d)-µ(d)k j ′ (d)+µ(d)µ(d) √ D(σ 2 d +ϵ) v j = N j=1 exp D d=1 qi(d)kj (d)-µ(d)kj (d) √ D(σ 2 d +ϵ) exp D d=1 µ(d)µ(d)-qi(d)µ(d) √ D(σ 2 d +ϵ) N j ′ =1 exp D d=1 qi(d)k j ′ (d)-µ(d)k j ′ (d) √ D(σ 2 d +ϵ) exp D d=1 µ(d)µ(d)-qi(d)µ(d) √ D(σ 2 d +ϵ) v j = N j=1 exp D d=1 qi(d)kj (d)-µ(d)kj (d) √ D(σ 2 d +ϵ) N j ′ =1 exp D d=1 qi(d)k j ′ (d)-µ(d)k j ′ (d) √ D(σ 2 d +ϵ) v j = N j=1 softmax D d=1 q i (d)k j (d) -µ(d)k j (d) √ D(σ 2 d + ϵ) v j = N j=1 softmax D d=1 q i (d)k j (d) -1 N N j ′ =1 k j ′ (d)k j (d) √ D(σ 2 d + ϵ) v j .

F ATTENTION WITH THE RESIDUAL CONNECTION AND MATRIX PROJECTIONS

In this supplement, we first discuss attention with the residual connection and matrix projections in Appendix F. Suppose we are given a training data {(x 1 , y 1 ), . . . , (x N , y N )} ⊂ X × Y, where X = R Dx and Y = R Dv . Here, x 1 , . . . , x N are the training inputsd, and y 1 , . . . , y N are the training targets. In order to derive the attention with the residual connection and query, key, and value matrix projections, we define the function f as follows y = f (x) := W Φ(W proj x) h(x) + x + b, where x ∈ X = R Dx , W proj = [w proj 1 , . . . , w proj D ] ⊤ ∈ R D×Dx , Φ(•) = [ϕ 1 (•), . . . , ϕ D ϕ (•)] : R D → R D ϕ , W = [w 1 , . . . , w Dv ] ⊤ ∈ R Dv×D ϕ , b ∈ R Dv , and h(x) is a vector-scalar function. We fit the function f to the training data {(x 1 , y 1 ), . . . , (x N , y N )} with an L 2 regularization on W and W proj by solving the following convex optimization problem: minimize W,W proj ξj , ξj ,j=1,...,N 1 2 Dv d=1 ∥w d ∥ 2 + 1 2 Dv d=1 ∥w proj d ∥ 2 + C N j=1 Dv d=1 ξ j (d) + ξj (d) subject to    y j (d) -w ⊤ d Φ(W proj x j )/h(x j ) -x j -b(d) ≤ ϵ + ξ j (d) w ⊤ d Φ(W proj x j )/h(x j ) + x j + b(d) -y j (d) ≤ ϵ + ξj (d) ξ j (d), ξj (d) ≥ 0 , j = 1, . . . , N, d = 1, . . . , D v . (33) The Lagrangian of the optimization problem 33 is given by L 1 := 1 2 Dv d=1 ∥w d ∥ 2 + 1 2 Dv d=1 ∥w proj d ∥ 2 + C N j=1 Dv d=1 ξ j (d) + ξj (d) - N j=1 Dv d=1 η j (d)ξ j (d) + ηj (d) ξj (d) - N j=1 Dv d=1 α j (d) ϵ + ξ j (d) -y j (d) + w ⊤ d Φ(W proj x j ) h(x j ) + x j + b(d) - N j=1 Dv d=1 αj (d) ϵ + ξj (d) + y j (d) -w ⊤ d Φ(W proj x j ) h(x j ) -x j -b(d) , Similar to the derivation in Section 2.1, the partial derivatives of L 1 with respect to the primal variable w d , d = 1, . . . , D v , have to vanish for optimality, which leads to ∂ w d L 1 = w d - N j=1 (α j (d) -αj (d)) Φ(W proj x j ) h(x j ) = 0 ⇒ w d = N j=1 (α j (d) -αj (d)) Φ(W proj x j ) h(x j ) (35) Note that here we only find the form of the optimal solution for W = [w 1 , . . . , w Dv ] ⊤ . The optimal value of W proj can then be found by optimization algorithm such as the (stochastic) gradient descent when training the transformer. Let v j = [ αj (1)-αj (1) h(xj ) , . . . , αj (Dv)-αj (Dv) h(xj ) ] ⊤ , j = 1, . . . , N , we obtain the following support vector expansion of the function f : f (x) =   N j=1 (α j (1) -αj (1)) Φ(W proj x j ) h(x j ) , . . . , N j=1 (α j (D v ) -αj (D v )) Φ(W proj x j ) h(x j )   ⊤ Φ(W proj x) h(x) + x + b, =   N j=1 α j (1) -αj (1) h(x j ) Φ(W proj x) ⊤ Φ(W proj x j ) h(x) , . . . , N j=1 α j (D v ) -αj (D v ) h(x j ) Φ(W proj x) ⊤ Φ(W proj x j ) h(x)   ⊤ + x + b, = N j=1 Φ(W proj x) ⊤ Φ(W proj x j ) h(x) v j + x Residual connection +b. Here, the support vector expansion of f already includes a residual connection. The softmax attention can then be derived by selecting h(x) := N j Φ(W proj x) T Φ(W proj x j ) and choosing Φ as in Eqn. 28 in Section 2.1. Note that in Eqn. 36, {x j } N j=1 and x are the training samples and test sample, respectively. In order to derive the key, query, and value matrix projections in attention, we can then relax Eqn. 36 by letting W proj x j = W K x j , W proj x = W Q x, v j = W V x j and choosing the test sample x among the training samples {x j } N j=1 . Remark 6 Here, for self-attention, we choose the test sample x among the training samples {x j } N j=1 to compute the attention score of a token to other tokens in the same sequence. For cross-attention where a token in a sequence attends to tokens in another sequence, this constraint can be removed.

G 2D-CONVOLUTION ATTENTION

In this section, we discuss attention with 2D-convolution. Suppose we are given a training data {(x train Given a new set of inputs {x 1 , . . . , x N H ×N W } ⊂ X and the corresponding 3D-tensor X ∈ R N H ×N W ×Dx of these inputs, where X(h, w, d) = x N W ×(h-1)+w (d). We consider the function f applying on the 3D-tensor X and taking the following form f (x i ) = W Φ(Flatten(Conv2D(X, s))(i)) h(x i ) , i = 1, . . . , N H × N W ( ) where Conv2D is the depth-wise 2D-convolution (Howard et al., 2017) , with the kernel size s × s and identical kernel channels, applied on the input tensor X. Here, the last dimension of X, i.e., D x , is the depth. Also, Φ(x) = [ϕ 1 (x), . . . 

J HYPERPARAMETERS

In this section, we provide the hyper-parameters for our best models.

J.1 UEA TIME SERIES CLASSIFICATION AND REGRESSION

For these two benchmarks, use the set of downsampling factors s = [1, 1, 2, 2, 4, 4, 8, 8] for Attention-SH/BN+SH and Linear/Sparse Attention-SH/BN+SH models trained on the UEA benchmark. Table 13 and Table 12 provide the values of β used for our best Attention-BN/BN+SH and Linear Attention-BN/BN+SH, Sparse Attention-BN/BN+SH models trained on subsets of the two benchmarks.



Implementation available at https://github.com/thuml/Flowformer. Implementation available at https://github.com/gzerveas/mvts transformer. Implementation available at https://github.com/NVIDIA/transformer-ls. Implementation available at https://github.com/facebookresearch/deit.



These dual variables have to satisfy positivity constraints, i.e., η j (d), ηj (d), α j (d), αj (d) ≥ 0, ∀j = 1, . . . , N, ∀d = 1, . . . , D v . It follows from the saddle point condition that the partial derivatives of the Lagrangian function L with respect to the primal variables w d , b(d), {ξ j (d), ξj (d)} N j=1 , d = 1, . . . , D v , have to vanish for optimality, namely, we have:

Figure 1: (Left) FLOPS ratios and (Right) memory usage ratios between the Attention-BN+SH and the softmax baseline trained on retrieval task for different model dimensions and sequence lengths. The reduction in computation and memory when using our models improves with sequence length. When scaling up the model, our methods remain significantly more beneficial than the baseline.

Figure 2: (Left) FLOPS ratios and (Right) memory usage ratios between the Attention-SH and the softmax attention baseline trained on the LRA retrieval task for different model dimensions and sequence lengths.

Figure 3: (Left) FLOPS ratios and (Right) memory usage ratios between the Sparse Attention-BN+SH and the Sparse Attention baseline trained on the LRA retrieval task for different model dimensions and sequence lengths. When using our models, the reduction in computation and memory improves with sequence length. When scaling up the model with greater model dimension, our methods remain significantly more efficient than the baseline.

Figure 4: (Left) FLOPS ratios and (Right) memory usage ratios between the Sparse Attention-SH and the Sparse Attention baseline trained on the LRA retrieval task for different model dimensions and sequence lengths. When using our models, the reduction in computation and memory improves with sequence length. When scaling up the model with greater model dimension, our methods remain significantly more efficient than the baseline.

, . . . , (x train N H ×N W , y train N H ×N W )} ⊂ X × Y, where X = R Dx and Y = R Dv . Here, x train 1 , . . . , x train N H ×N W are the training inputs, and y train 1 , . . . , y train N H ×N W are the training targets. Let X train ∈ R N H ×N W ×Dx be the 3D-tensor of training inputs, where X train (h, w, d) = x train N W ×(h-1)+w (d).

, ϕ D ϕ (x)] ∈ R D ϕ , W = [w 1 , . . . , w Dv ] ⊤ ∈ R Dv×D ϕ , b ∈ R Dv ,and h is a vector-scalar function. We fit the function f to the training data {(x train 1 , y train 1 ), . . . , (x train N H ×N W , y train N H ×N W )} with an L 2 regularization on W, i.e., a ridge regression, by solving the following convex optimization problem: minimizeW ξj , ξj ,j=1,...,N H ×N W ≤ ϵ + ξj (d) ξ j (d), ξj (d) ≥ 0, j = 1, . . . , N H × N W , d = 1, . . . , D v . (38)The Lagrangian of the optimization problem 38 is given by derivation in Section 2.1 in the main text, the partial derivatives of L with respect to the primal variable w d , d = 1, . . . , D v , have to vanish for optimality, which leads to∂ w d L = w d -N H ×N W j=1 (α j (d) -αj (d)) Φ(Flatten(Conv2D(X train , s))(j)) h(x train j ) = 0(40) ⇒ w d = N H ×N W j=1 (α j (d) -αj (d)) Φ(Flatten(Conv2D(X train , s))(j)) h(x train j

(in parentheses) in addition to our reproduced results.

Table 2 indicate significant improvements of Attention-BN/SH/BN+SH over the baseline softmax attention. Same as in the UEA Test Accuracy (%) of the Attention-BN/SH/BN+SH vs. the baseline softmax attention on the LRA benchmark(Tay et al., 2021). Our models significantly outperform the softmax baseline.

Layer-average mean and standard deviation of L2 distances between heads of Attention-BN/SH/BN+SH versus dot-product attention transformer trained on the retrieval task. Our attentions attain greater L2 distances between heads than the baseline, suggesting that they capture more diverse attention patterns.

Test Accuracy (%) of the Linear Attention-BN/SH/BN+SH vs. the baseline Linear Attention(Katharopoulos et al., 2020) on the UEA Time Series Classification Archive benchmark(Bagnall et al., 2018). Our proposed attentions outperform the baseline.

Top-1 and top-5 accuracy (%) of the Attention-Conv2D Deit vs. the baseline Deit with the softmax attention on the ImageNet image classification task. The Attention-Conv2D Deit significantly outperforms the baseline in both top-1 and top-5 accuracy.

Test Accuracy (%) of the Attention-Conv1D vs. the baseline softmax attention on 5 tasks of the LRA benchmark(Tay et al., 2021). Our models outperform the softmax baseline.

Root mean square error (RMSE) of the Attention-BN/SH/BN+SH vs. the baseline softmax attention on 6 UCR Time Series Regression tasks(Tan et al., 2020). Smaller RMSE indicates better performance.

Accuracy (%) of the vs. the baseline softmax attention on other 15 UEA Time Series classification tasks(Bagnall et al., 2018).

Test Accuracy (%) of the Sparse Attention-BN/SH/BN+SH vs. the baseline Sparse Attention(Child et al., 2019) on a subset of the UEA Time Series Classification Archive benchmark(Bagnall et al., 2018). Our proposed attentions outperform the baseline.

Test Accuracy (%) of the Attention-BN/BN+SH with β is learnable or set as a hyperparameter on the retrieval task(Tay et al., 2021).

The values of β for Linear Attention-BN/BN+SH and Sparse Attention-BN/BN+SH trained on the selected 10 UEA tasks.

The values of β for Attention-BN/BN+SH trained on 25 UEA Time Series classification tasks(Bagnall et al., 2018) and 6 UEA Time Series Regression tasks.

The values of β of Attention-BN/BN+SH trained on the 5 tasks of the LRA benchmark(Tay et al., 2021).

ACKNOWLEDGEMENTS

This material is based on research sponsored by the NSF under Grant# 2030859 to the Computing Research Association for the CIFellows Project (CIF2020-UCLA-38). SJO acknowledges support from the ONR N00014-20-1-2093 and N00014-20-1-2787 and the NSF DMS 2208272 and 1952339. RGB acknowledges support from the NSF grants CCF-1911094, IIS-1838177, and IIS-1730574; ONR grants N00014-18-12571, N00014-20-1-2534, and MURI N00014-20-1-2787; AFOSR grant FA9550-22-1-0060; and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. ALB acknowledges support from the NSF grants DMS-2152717 and DMS-1952339. NH acknowledges support from the NSF IFML 2019844 and the NSF AI Institute for Foundations of Machine Learning.

annex

] ⊤ , j = 1, . . . , N H × N W , and substitute Eqn. 41 into Eqn. 38, we obtain the following support vector expansion of the linear basis function f :where A ij := Φ(Flatten(Conv2D(X, s))(i)) ⊤ Φ(Flatten(Conv2D(X train , s))(j)).Same as in Section 2.1, we set b s = 0. To derive the softmax normalization in attention, we choose h(x i ) := N j=1 A ij and select Φ as in Eqn. 28. Let the training inputs {x train 1 , . . . ,We define the 2D-Convolution Attention (Attention-Conv2D) as follows:Definition 3 (2D-Convolution Attention) Given a set of the key and value vectors {k j , v j } N H ×N W j=1 , and a set of the query vectors. Denote the key tensor and the query tensor byThe 2D-Convolution Attention (Attention-Conv2D) computes the corresponding output vector h i of the query q i by the following attention formula:where the Conv2D(•, s) is the depth-wise 2D-convolution (Howard et al., 2017) with the kernel size s × s and identical kernel channels.

Remark 7 (Convolutional Projection for Attention in the Convolutional vision Transformer)

The convolutional projections used in the Convolutional vision Transformer (CvT) (Wu et al., 2021) can be derived from Eqn. 42 by letting the training input tensor X train to be the 2D input matrix of size N × D x of the self-attention layer (see Section 1.1 in the main text) reshaped into a 3D tensor of sizeHere, to avoid confusion, we denote the input of the self-attention layer by X input and its reshaped version by Reshape2D(X input ). We then replace the depth-wise 2D-convolution by the depth-wise separable 2D-convolution in (Wu et al., 2021) and remove the constraint that the kernels have identical channels. In order to derive the convolutional projections for the keys, queries, and values in CvT, for i, j = 1, . . . , N , we let(44) Here, we specify the kernel/filter W K , W Q , and W V to emphasize that the convolutional projections in CvT uses different kernels to compute keys, queries, and values in self-attention. Eqn. 44 matches the convolutional projects in CvT. By choosing h and Φ similar to above, we can derive the convolutional attention in CvT.

H 1D-CONVOLUTION ATTENTION

Following the derivation for the Attention-Conv2D in Appendix G above, we can derive the 1D-Convolution Attention (Attention-Conv1D) in a similar way by letting X train ∈ R N ×Dx and X ∈ R N ×Dx be 2D-matrices of training inputs and new inputs, respectively, and by replacing Conv2D by Conv1D, which is the depth-wise 1D-convolution, with the kernel size s × 1 and identical kernel channels, applied on the input tensor X. Here, the last dimension of X, i.e., D x , is the depth. We define the 1D-Convolution Attention (Attention-Conv1D) as follows:Definition 4 (1D-Convolution Attention) Given a set of the key and value vectors {k j , v j } N j=1 , and a set of the query vectors {q i } N i=1 . Denote the key matrix and the query matrix by K :=[k 1 , . . . , k N ] ⊤ ∈ R N ×D and Q := [q 1 , . . . , q N ] ⊤ ∈ R N ×D , respectively. The 1D-Convolution Attention (Attention-Conv1D) computes the corresponding output vector h i of the query q i by the following attention formula:where the Conv1D(•, s) is the depth-wise 1D-convolution with the kernel size s × 1 and identical kernel channels.

I ATTENTION WITH BATCH NORMALIZATION AND SCALED HEADS

The Attention-BN+SH combines both the Attention-BN and Attention-SH. The Attention-BN+SH fits the function f s , s = 1, . . . , H, in Eqn. 17 with training setswhereFollowing the same derivation as in Section 2.1, we derive the following support vector expansion ofHere, v s j = α s j (1)-αs j (1) h s ((k s j -µ s )⊙s s -1 ) , . . . ,

⊤

, where α s j and αs j are the dual variables, j = 1, . . . , N . Same as in Section 2.1, in Eqn. 48, we choose Φ as in Eqn. 28, h s (x) := Ns j Φ(x) T Φ(k s j ), and b s = 0 to obtain the Batch Normalized Attention with Scaled Heads (Attention-BN+SH), which is defined as follows:Definition 5 (Batch Normalized Attention with Scaled Heads) Given H sets of the key and value vectors {k 1 j , v 1 j } N1 j=1 , . . . , {k H j , v H j } N H j=1 , for each set of H query vectors q 1 i , . . . , q H i , i = 1, . . . , N , the Batch Normalized Attention with Scaled Heads (Attention-BN+SH) computes the corresponding output vector h i of the queries q 1 i , . . . , q H i by the following attention formula:whereFollowing the same Remark 5 in Section 2.3.2, given input sequence Xof N feature vectors in self-attention, in order to generate the sets of {k s j , v s j } Ns j=1 at the scale s th , we can downsample the input X before projecting into the key matrix K and the value matrix V. In this paper, we use the average-pooling to downsample X.As in the same case of Attention-BN, for Attention-BN+SH, recentering queries and keys alone are sufficient for accuracy improvement, and we weight the mean µ in Eqn 49 with a constant β. Hence Eqn. 49 is simplified to:Published as a conference paper at ICLR 2023 J.2 LONG RANGE ARENA BENCHMARK For all 5 tasks of the LRA benchmark, we set the downsampling factors s of Attention-SH/BN+SH, Linear/Sparse Attention-SH/BN+SH is [1, 2] and kernel size of Attention-Conv1D models is 5. In addition, Table 14 provides the values β of Attention-BN/BN+SH models trained on the benchmark.

J.3 IMAGENET CLASSIFICATION

This task's β of Attention-BN/BN+SH is 1. Attention-SH/BN+SH has the downsampling factor of [1, 1, 2, 4], and the kernel size of Attention-Conv2D is (2, 2).

