ANALYZING ATTENTION MECHANISMS THROUGH LENS OF SAMPLE COMPLEXITY AND LOSS LAND-SCAPE

Abstract

Attention mechanisms have advanced state-of-the-art deep learning models for many machine learning tasks. Despite significant empirical gains, there is a lack of theoretical analyses on their effectiveness. In this paper, we address this problem by studying the sample complexity and loss landscape of attention-based neural networks. Our results show that, under mild assumptions, every local minimum of the attention model has low prediction error, and attention models require lower sample complexity than models without attention. Besides revealing why popular self-attention works, our theoretical results also provide guidelines for designing future attention models. Experiments on various datasets validate our theoretical findings. Significant research in machine learning has focused on designing network architectures for superior performance, faster convergence and better generalization. Attention mechanisms are one such design choice that is widely used in many natural language processing and computer vision tasks. Inspired by human cognition, attention mechanisms advocate focusing on relevant regions of input data to solve a desired task rather than ingesting the entire input.



We summarize our work as follows. We carefully analyze attention mechanisms on the loss landscape in Sections 3 and 4. In Section 3, we show that, under mild assumptions, every stationary point of attention models achieves a low generalization error. Section 4 studies other properties of attention models on the loss landscapes. After the loss landscape analyses, we discuss how our theoretical results can guide the practitioners to design better attention models in Section 5. Then we validate our theoretical findings with experiments on various datasets in Section 6. Section 7 includes a few concluding remarks. Proofs and more technical details are presented in the appendix.

2. ATTENTION MODELS

Attention mechanisms are modules that help neural networks focus only on relevant regions of input data to make predictions. To compare attention model with non-attention model, we first introduce a two-layer non-attention model as the baseline model. The network architecture consists of a linear layer followed by rectified linear units (ReLU) as a non-linear activation function, and a second linear layer. Denote the weights of the first layer by w (1) ∈ R p×d , the weights of the second layer by w (2) ∈ R d , and the ReLU function by φ(•). Then the response function for the input x ∈ R p can be written as y = w (2)T φ( w (1) , x ). We call the above function "baseline model" since it does not employ any attention. To study such mechanisms, we mainly focus on analyzing the most popular self-attention model. In this paper, we consider two types of self-attention model. For the first type of self-attention model, we consider attention weights that are determined by a function f (x): y = w (2)T φ( w (1) , x f (x) ) where f (•) is a known mapping function from R p to R p , representing the attention weight of each feature with any given x. This model is a prototype version of transformer model (Vaswani et al., 2017) , with a pre-determined function as attention weights. Second, we introduce a more practical self-attention setup, which is the transformer model proposed in Vaswani et al. (2017) . To mimic the NLP task, we set the input x i = (x 1 i , . . . , x p i ) ∈ R t×p , where x j i ∈ R t , are t-dimensional vectors. Intuitively, each x i corresponds to independent sentences for i = 1, . . . , n, and x j i 's are fixed dimensional vector embedding of each word in sentence x i . w Q ,w K ∈ R dq×t are query and key weight matrices, and w V ∈ R dv×t is the value matrix. For each input x i , the key is calculated as: K i = (w K x i ) T ∈ R p×dq ; For z th vector in the input, the query vector is computed as: Q z i = (w Q x z i ) T ∈ R 1×dq for z = 1, . . . , p. The value matrix V = w V x i ∈ R dv×p . Then the self-attention w.r.t to the z th vector in the input x i is computed as: a self (z) i (x z i , w Q , w K ) = sof tmax( Q z i K T i d q ) for z = 1, . . . , p. And a self i = (a self (1) i , . . . , a self (p) i ). This self-attention vector represents the interaction between different words in each sentence. The value vector for each word in the sentence x z i can be calculated as V z i = V a self (z) i theoretical results we derived with φ τ0 (x) holds for any arbitrarily large τ 0 . For ease of notation, we still use φ(x) to denote the softplus function. In this paper, we focus on the regression task which minimizes the following loss function:L = E (x,y)∼D h(x) -y 2 2 , where h(x) is the defined baseline/attention models. Our theory can be easily extended to classification tasks as well.

3.1. ASYMPTOTIC PROPERTY OF SELF-ATTENTION MODEL WITH KNOWN WEIGHT FUNCTION

We start with a self-attention model with known attention function f (•) . The objective function is: min w 1 2n n i=1 (w (2)T φ( w (1) , x i f (x i ) ) -y i ) 2 (3) where f (x) 2 = 1, as the sum one attention weights. Before proceeding, we introduce three major assumptions for our analysis: (A1): Model output y can be specified by a two-layer neural network with attention model structure; (A2): The Attention weights are mainly contributed by top several masks; (A3): Hidden layers of the network are sufficiently overparameterized such that they have sufficient expressiveness to predict y. For space consideration, we provide the detailed justification of these assumptions and explain why they can be intuitively summarized as (A1)-(A3) in the Appendix (A.1). Specifically, (A1) is supported by previous works. For (A2) and (A3), we verify them both theoretically and empirically in Appendix Section A.1 and in our experiments Section 6. In short, (A2) naturally holds since softmax function in attention model helps us to obtain a concentrated attention weights instead of evenly spread out weights; (A3) naturally holds for overparameterized neural networks. More details can be found in Appendix Section A.1. In the following, we present their mathematical presentations. Let φ( w (1) , x f (x) ) represents the d-dimensional random vector of hidden units with network weights w (1) . For any vector z 1 and z 2 , we denote var(z 1 |z 2 ) as the covariance matrix of residual z 1 after taking linear regression on z 2 . The explicit form can be represented as var(z 1 |z 2 ) = Σ z1z1 -Σ z2z1 Σ -1 z1z1 Σ z1z2 , where Σ z1z2 represents the cross-covariance matrix of vector z 1 and z 2 . The explicit derivation can be found on page 176 of Izenman (2013). In our setting, z 1 = φ( w (1) , x f (x) , and z 2 = φ( w (1) , x f (x) . (A1) There exists a set of parameters (w (1) , w (2) ) such that y i = w (2) T φ( w (1) , x i f (x) ) + i for i = 1, 2, ...n, where i 's follow the sub-Gaussian distribution subG(0, C 2 3 ) and x i ⊥ ⊥ i . (A2) For any x i , we order the attention weights f (x i ) in descending order as f (x i ) (1) , . . . , f (x i ) (p) . There exist positive integer 0 < s 0 < p and 0 < τ < 1, such that largest s 0 attention weights f s0 = {f (x i ) (1) , . . . , f (x i ) s0 } satisfies f s0 2 ≥ 1 -τ . (A3) For γ > 0, λ max (V ar(φ( w (1) , x i f (x) )|φ( w (1) , x f (x) )) = o(γ 2 ) as n k 4 γ 2 log( k γ )(pd + d) and n → ∞ for a constant k which satisfies k < √ s 0 + τ √ p. Other than these three major assumptions, we also assume the regularity assumption (B1): the model samples x i 's are i.i.d. with maximum bound; network weights w (1) and w (2) have bounded 2 norms, denoted by C 1 and C 2 , respectively; and the output y is centralized. The regularity assumption (B1) is standard in the literature, which will be justified in the Appendix (A.2). Given these assumptions, we show that the sample complexity bound of attention model: Theorem 1. (Sample complexity for attention model) Under (A1)-(A3) and regularity assumption (B1), for any γ > 0 and s 0 s.t. k √ s 0 + τ √ p, given the sample size n ( √ s 0 + τ √ p) 4 η 2 γ 2 log( ( √ s 0 + τ √ p)η γ )(pd + d) where η = C 2 1 C 2 . with probability converging to 1, any stationary point ( w(1) , w(2) ) of the objective function (3) satisfies that: E( w(2) T φ( w(1) , x f (x) ) -E(y|x)) 2 γ 2 . The proof sketch and complete proof are provided in Appendix Section E. To explicitly compare the sample complexity bound of our attention model with non-attention model, we provide the following corollary for sample complexity bound of baseline model. We just need to set s 0 = 0 and τ = 1 in Theorem 1, fixing all attention weights to be equal, then the model is the baseline one as follows. Corollary 1. (Sample complexity for non-attention model) Under (A1)-(A3) and regularity assumption (B1), for non-attention model with f (x j ) = 1/ √ p for all features x j ∈ x, given the sample size: n p 2 η 2 γ 2 log( pη γ )(pd + d) where η = C 2 1 C 2 . With probability converging to 1, any stationary point ( w(1) , w(2) ) of the objective function (3) satisfies that: E( w(2) T φ( w(1) , x f (x) ) -E(y|x)) 2 γ 2 . There is an important message from the comparison of Theorem 1 and Corollary 1. The corresponding sample complexity bound from Corollary is in higher order than the sample complexity in Theorem 1. This fact can be observed as follows. We can see the effect of concentration in attention models from Theorem 1. In Theorem 1, the sample complexity bound is proportional to ( √ s 0 + τ √ p) 4 . When the attention is sufficiently concentrated, it means that we have a much lower order of s 0 p and τ → 0, then our sample complexity will be significantly reduced. Also from the sample complexity bound of Theorem 1, up to a log term, prediction error γ is proportional to n -1/2 , which is the optimal rate of convergence in regression. In Imaizumi & Fukumizu (2018) and similar works, they showed the generalization error convergence rate is O(n -t ) where 0 < t < 1/2. These facts all imply that the bound is tight in aspect of sample size compared with existing works.

FUNCTION

In this section, we extend our analyses to the transformer self-attention model introduced in Section 2. Theorem 1 implies that if the self-attention mask can be precisely computed with f (x), we can derive its sample complexity bound. However, the function f (•) is not necessarily known, and needs to be learned in real-world applications. Therefore, the transformer self-attention setup is more desired in real-world settings. It provides a concrete model to learn the parameter of attention weight function. Denoting w = (w (1) , w (2) , w Q , w K , w V ), the two-layer self-attention model can be formulated as: min w 1 2n n i=1 (w (2)T φ( w (1) , vec(w V x i a self i ) ) -y i ) 2 We now introduce necessary assumptions. (A4) For each column a self i , we order the entries of a self i in descending order as a self i(1) , . . . , a self i(p) . There exist positive integer 0 < s 0 < p and 0 < τ < 1, such that the largest s 0 attention weights a self i(leading) = {a self i(1) , . . . , a self i(s0) } satisfies a self i(leading) 2 ≥ 1 -τ . (A5) Denote Φ = φ( w (1) , vec(w V x i a self i ) ), and Φ = φ( w (1) , vec(w V x i a self i )). For any γ > 0 we have λ max (V ar(Φ |Φ)) = o(γ 2 ). as n k 2 dvd γ 2 log( (k+ √ pt) γ ) and n → ∞. where similar to (A1), (w (1) , w (2) , w V , a self ) correspond to the true parameter set such that y i = w (2) T φ( w (1) , vec(w V x i a self i ) ) + i . (A4) and (A5) are parallel to (A2) and (A3), assuming the attention weights are focused on s 0 items, and a sufficient expressive power of Φ. Furthermore, sparse transformer Child et al. (2019) , sparsemax attention model Martins & Astudillo (2016) and local attention model Luong et al. (2015) can be regraded as a special case of the assumption with τ = 0, in which we only consider attention weights between partial locations instead of all locations. We further verify this assumption in our experiments that s 0 p and τ → 0. Other regularity assumptions on feature and parameter space are similar with the ones in Theorem 1. We denote is as (B2): The 2 norm bound for w (1) , w (2) ,w K and w Q are, respectively, C 1 , C 2 , C 5 , and C 6 , and the 1 column norm of w V is C 8 . They are presented in Appendix (A.3). Now we provide sample complexity bound for transformer self-attention models: Theorem 2. (Sample complexity for transformer self-attention model) Under (A4), (A5) and regularity assumptions (B2), for any γ > 0 and s 0 s.t k s 0 , given the sample size: n ( √ s 0 + τ √ p) 2 + √ pt) 2 pd v dη 2 a γ 2 log( (( √ s 0 + τ √ p) 2 + √ pt)η a η b γ ) where η a = C 2 1 C 2 C 8 and η b = C 5 + C 6 , with probability tending to 1, any stationary point ( w(1) , w(2) , wQ , wK ) of the objective function (4) satisfies that: E( w(2)T φ( w(1) , vec(w V x i a self i ) ) -E(y|x)) 2 γ 2 . There are two important messages from Theorem 2. First, theorem 2 shows that, with the help of self-attention, we can achieve consistent predictions under a more expressive class of models (equation 5), which is analyzed in Yun et al. (2019) . Non-attention model does not have consistency for this class of models. To train a non-attention model on data with self-attention structure, more layers and larger network parameter size are required to reduce such bias. Second, we can see the help of concentrating the attention in designing transformer model. Similar to Theorem 1, we see the sample complexity bound is proportional to ( √ s 0 + τ √ p) 4 . Therefore, a properly small s 0 and τ can significantly reduce the sample complexity of self-attention models. Later in our experiment, we show that it is exactly what is happening in the real transformer model. Our theorem also answers why the self-attention design with softmax function can effectively helps us achieve better prediction results. What's more, it also explains the effectiveness of sparse design in attention model. Sparse attention is one special case of our concentration condition with τ = 0. There has been work showing that sparse attention weights can significantly reduce computational cost and improve the performance, and it is verified in sparse transformer,local attention model and sparsemax attention models (Child et al., 2019; Luong et al., 2015; Martins & Astudillo, 2016) .

3.3. MULTI-LAYER SELF-ATTENTION MODELS AND RECURRENT ATTENTION MODELS

Theorems 1 and 2 can be extended to multi-layer neural nets. Due to page limit, the rigorous definition, notations, assumptions and statements of Theorem 3 are deferred to Appendix Section A.4. Here we provide a plain statement of it as follows. Theorem 3. Given the overparameterized and regularity condition, for any given generalization error level gamma, with high probability, a multi-layer self-attention model can achieve a generalization error smaller than γ, given the sufficient large sample size. Our analyses can be also extended to recurrent attention models, following the recurrent attention setup in Luong et al. (2015) . The analysis on recurrent attention model is deferred to Appendix Section B due to page limits.

4. NON-LINEARITY, FLATNESS OF MINIMA AND SMALL SAMPLE SIZE

In this section, we further investigate several additional properties on how attention mechanisms improve the landscape of neural networks and keep the nice properties of baseline models in aspects of reducing unnecessary non-linear regions, sharpness of local minimum and it doesn't affect the loss landscape in small sample case.

4.1. ON THE NUMBER OF LINEAR REGIONS

We first study how attention mechanisms affect the number of linear regions (Montufar et al., 2014) in a wide two-layer neural network with attention of known attention weight function, when the number of hidden units is larger than the number of non-zero weights in f (x). This result shows how the sparsity/concentration of attention weights effects the non-linearity of loss landscape. Theorem 4. Assume f (x) 0 = s 0 , which is the sparsity of the attention mask matrix, and the number of units in the hidden layer n 1 > s 0 . Then the maximal number of linear regions of the function by a two-layer attention model with ReLU activation function, is lower bounded by n1 s0 s0 . In general, we see the bound of attention model is smaller than the one of baseline model. The corresponding plots on bounds can be found in appendix Section C.1. The result implies that, when appropriate attention mechanism is used, the reduction of number of linear regions leads to a simpler landscape, yet the approximation error remains small.

4.2. ON FLATNESS/SHARPNESS OF MINIMA

Many recent works, such as Keskar et al. (2016) , argue that flatter local minimum tends to generalize well. However, in a recent study, Dinh et al. (2017) observes that by scale transformation, the minima which are observationally equivalent, can be arbitrarily sharp, and the operator norm of a Hessian matrix can also be arbitrarily large. We show that this fact also holds for the self-attention mechanism, if no 2 norm bound on parameter ( w(1) , w( 2) ) is imposed. Here we introduce the definition of -flatness as in Hochreiter & Schmidhuber (1997) . Definition 1. Given > 0, a minimum θ, and loss L, C(L, θ, ) is the largest connected set containing θ such that ∀θ ∈ C(L, θ, ), L(θ ) ≤ L(θ) + , and its volume is called the -flatness. In the following Theorem, we analyze the flatness of stationary point for self-attention model. Theorem 5. Consider the two-layer ReLU neural network with self-attention mechanism as stated in Section 3.2: y i = w (2) T φ( w (1) , vec(w V x i a self i ) ), and a minimum θ = ( w(1) , w(2) , w V , w Q , w K ) satisfying that wi = 0 for i = (1), ( 2), V, Q, K. For any > 0, C(L, θ, ) has an infinite volume, and for any M > 0, we can find a stationary point such that the largest eigenvalue of ∇ 2 L(θ) is larger than M. Theorem 5 indicates that property on flatness of minima is maintained when attention mechanism is applied. Furthermore, 2 norm bound helps remove sharp minima which are bad in generalization. It also coincides with our theoretical and empirical result that a flat minimum are expected to generalize better in general (Keskar et al., 2016) . We also dicuss the loss landscape of attention model under small sample size. The results are deferred to Appendix Section C.3.

5. GUIDANCE ON IMPROVING THE ARCHITECTURE OF ATTENTION MODELS

In this section, we provide insights into future attention model design through our analyses. Regularization: Our analyses suggest proper regularization is helpful in training an attention model. We can see that 2 norm bound C 1 , C 2 play an important role in sample complexity bound. It implies that an 1 and 2 regularization on network weights w (1) and w (2) are effective in reducing the sample complexity. In Theorem 5, we also find that imposing constraints and regularization on network weights help remove sharp minima and keep flat minima with good generalization.

Concentration on attention:

From the discussion of Theorem 1, Theorem 2 and Corollary 1, we conclude that a proper concentration design with small s 0 and τ can significantly reduce sample complexities. Our analysis show that the soft-max design concentrate attention on limited number of entries, which help reduce the sample complexity. In different problems, we can further concentrate the attention by adjusting the temperature of the softmax function. The smaller the temperature, more concentrated the attention weights are. Overparameterization in query/key weights From our analyses, we can see that as d q , the dimension of key and query matrices, increases, the sample complexity will not increase significantly. It indicates that we can obtain high expressive power in attention model through overparameterization in query and key matrices to increase expressiveness, without hurting sample complexity.

6. EXPERIMENTS

In this section, we validate our theoretical findings empirically. This section is divided into the following parts: (1) Verification of concentration assumption (A2) and (A4) on Portuguese to English translation task. (2) Ablation study on the IMDB reviews dataset, showing the effectiveness of attention, regularization, and attention-concentration on self-attention and recurrent attention models. (3) Experiments on a constructed noisy-MNIST dataset using self-attention models.

Experiment 1: Concentration of top attentions

To verify that our key assumptions/observations (A2) and (A4) hold in the real-world task, we investigate the distribution of attention weights in the transformer model on the tasks of translating Portuguese to English as proposed in Vaswani et al. (2017) . Then we randomly select the trained attention weights vector for 100000 different words in training samples. For each attention weight vector with sentence length p i , we calculate the 2 norm of largest √ p i number of attention weights. We find the top √ p i weights on average contribute to 90.9% of attention weights, very close to 1. The histogram is plotted as follows Figure 1 . We see that most of sum weights are close to 1. The result indicates that the largest √ p i attention weights contribute to most of the attention weights. It indicates that our assumption (A2) and (A4) hold with reasonably small s 0 p and τ , such as We consider the problem of sentiment classification on an IMDB reviews dataset (Maas et al., 2011) . The task is to classify the sentiment of a sentence as either a positive or negative one. We zero-pad all our sentences to make their length equal to the sentence length 130. For every input word, we train their embeddings with random initialization (of dimension 100) which is then passed to the neural network. Hence, the dimension of input is 130 × 100. We consider baseline 2-layer MLP, and then consider adding self-attention, weight regularization and tempered softmax function into the model to verify our theoretical analysis and corresponding guidance in Section 5. s 0 = √ p . Baseline model: To train the baseline model, we first flatten the input one large vector of dimension 130 × 100 and pass it to a 1-hidden layer MLP with h hidden units. The model is trained using binary cross entropy loss. Self-attention model: For self-attention model, the dimensions of query, key and value matrices are w Q ∈ R 100×100 , w K ∈ R 100×100 , w V ∈ R 100×130 , respectively. We first compute the attention mask a self i as per equation 1 (a self i ∈ R 130×130 ). Using this attention mask, the attended feature is then computed as f att = w V x i a self i . The feature vector f att ∈ R 100×130 . We then flatten this attended feature and then pass it through a 1-hidden layer MLP with h hidden units. Regularization: We impose a 10 -4 1 regularization on both w 1 and w 2 . Tempered Softmax: We calculate the attention weights as a self (z) i (x z i , w Q , w K ) = sof tmax( 5 * Q z i K T i √ dq ). We multiply the inner product by 5(or temperature as 1/5). Thus, the softmax operator pushes the small attention weights close to zero while retaining all large attention weights. In this way, it achieves higher level of concentration of attention weights, comparing with the standard softmax function based attention. Optimization: All models were initialized randomly with Xavier initialization. Binary cross-entropy loss was used to train the models. All models were trained using Adam optimizer with a learning rate 10 -3 . To test the sample complexity, we vary the number of training samples in each experiment, train all models and compute the performance on the test set. We varied the fraction of training samples from 1k to 10k of the training data. Each experiment was repeated for 10 replications, and mean and standard deviation was reported in Table 1 . Recurrent attention experiments: Beyond self-attention, we also provide an ablation study on recurrent attention model, verifying the superiority of recurrent attention model, as our analysis in the Appendix B. (1) RNN baseline: The baseline here is a bi-directional RNN attention model with LSTM cell size 32, and we put the final output of RNN into a 2-hidden layer MLP with hidden units 64 and 20 separately to return the final prediction. (2) Recurrent attention: For recurrent attention, we design the structure the same as Luong et al. (2015) . (3) Regularization: We impose the 1 regularization of 10 -4 on network weights in the attention layer. Here we reported the test accuracy results in the following Table 2 . From the experiment results, we see significant effect of regularization and softmax temperature in self-attention model, and we also see regularization also helps in the recurrent attention model. They all coincide with our theoretical findings.

Experiment 3: Self-attention on Noisy-MNIST dataset

To prove the applicability of our analyses, we further verify the effectiveness of attention on image classification task. We construct a noisy MNIST dataset based on the original MNIST dataset. For each original 28 × 28 image, we separate a 56 × 56 image into 16 square grids, and put the whole digit image randomly into 4 neighboring grids. Finally we generate all other grids with uniform random variables. Examples of our generated dataset are provided in Figure 2 . 3 . Model 1 has 5558k number of parameters, and model 2 has only 670k number of parameters. This fact also guarantees that the superiority of our model is not from training a bigger model. From the table, we see that Attention-CNN model achieves almost perfect testing accuracy as the original MNIST dataset. Although equal-attention CNN has the same expressiveness power, its performance is much worse than attention-CNN model. Again, It shows the usefulness of concentrating self-attention weights properly in the task.

7. CONCLUSIONS

In this paper, we study the loss landscape of neural networks on attention models, and show that attention mechanisms help reduce the sample complexity and achieve consistent predictions in the large sample regime. Besides theoretical analyses of loss landscape, empirical studies validate our theoretical findings. Based on our analyses, we discuss how regularization, concentration on attention, and overparameterization in attention weight matrices can further improve the attention model.

A.2 REGULARITY ASSUMPTIONS OF THEOREM 1

In section 3, we described the regularity assumptions of Theorem 1 in words. Here we provide rigorous mathematical representations of these assumptions in the following (A1.1) to (A1.3). (A1.1) x i are i.i.d and x i ∞ < C x for any i = 1, . . . , n. (A1.2) There exist C 1 , C 2 , C λ such that λ max (Σ φ ) ≤ C λ , w (1) F < C 1 and w (2) 2 < C 2 for any w (1) , w (2) ∈ S. (A1.3) E(w (2) φ( w (1) , x f (x) )) = E(y) = 0. (A1.1) to (A1.3) are standard assumptions for parameter and feature space. (A1.1) and (A1.2) require upper bounds on the input x i and 2 bound for network weights. It is a standard assumption in landscape analysis (Mei et al., 2018a; b) , and also it is crucial to remove sharp minima which may not generalize well (Keskar et al., 2016; Dinh et al., 2017) .(See discussions after Theorem 4). These assumptions can be achieved through regularization. (A1.3) can always be achieved by making the data centered.

A.3 ASSUMPTIONS OF THEOREM 2

Similarly, for transformer-type self-attention, we also provide similar assumptions (A2.1) and (A2.2) as follows: (A2.1): (A1.2) holds with Σ φ = cov(φ( w (1) , vec(w V x i a self (z) i )). Further there exist C 5 ,C 6 and C 7 such that w Q F ≤ C 5 , w K F ≤ C 6 , w V F ≤ C 7 , and w V 1 = max i=1,...,dv p j=1 w V ij ≤ C 8 . And Q z i K T i 2 ≥ C 9 for i = 1, . . . , n. (A2.2) There exists a set of parameters (a , w (1) , w (2) ) such that y i = w (2) T φ( w (1) , vec(w V x i a self (z) i ) ) + i , where a self is calculated by (1); i ∼ subG(0, C 2 4 ) for i = 1, 2, ...n, with x i ⊥ ⊥ i . And the output y is centered as (A6). These assumptions are all parallel to (A3) to (A6).

A.4 DETAILS ABOUT THEOREM 3

We consider a D-layer network with self-attention structure. We denote the k th self-attention layer follows g k (x k-1 g ) = w k2 φ( w k1 , w V x k-1 g a self )), where x k-1 g is the output of (k -1) th self-attention layer, with w V ∈ R dv×t , x k-1 g ∈ R t×d k-1 , a self ∈ R d k-1 ×d k-1 , w k1 ∈ R dv×q k and w k2 ∈ R d k ×q k . a self is calculated in the same way with two-layer self-attention networks. Then we have the final output h (x) = w D2 φ( w D1 , vec(w V x D-1 g a self )), where x D-1 g = (g D-1 (• • • g 1 (x)), and w D1 ∈ R 1×dv , w D2 ∈ R d k . In this way, the network calculates self-attention D times and finally produce the final prediction. It is worth mentioning that, to obtain a scalar prediction in regression model, we flatten the value matrix of the last layer as in the two-layer model. We still denote u = w D2 φ( w D1 , vec(w V x D-1 g a self )) -E(y|x). Then the necessary assumptions parallel to (A2) are as follows: (A6) There exists integer k and r such that k ∈ {1, . . . , D} and r ∈ {1, 2}, such that cov(∇h(w kr ), u) ≥ cγ 2 ) for some constant c, and such that ∇h(w kr ) 2 ≤ c k . This Theorem also requires mild regularity conditions as follows: • (A3.1) All weights w kj for k = 1, . . . , D and j = 1, 2 satisfy w kj 2 ≤ C 10 . And we assume the prediction is centered, i.e. E(u) = 0.

• (A3.

2) There exists a set of parameters (w (1) , w (2) ) such that y i = w D2 φ( w D1 , vec(w V x D-1 g a self )) + i as defined, where i ∼ subG(0, C 2 4 ) for i = 1, 2, ...n, with x i ⊥ ⊥ i . And the output y is centered as (A6). The following theorem provides sample complexity bound of multi-layer self-attention model: Theorem 3 Under (A6) and regularity assumptions (A3.1) and (A3.2), d self is the total number of parameters in all value, query, key matrices. Then for any γ > 0, given the sample size: n log( c k γ )(d self +D(d k +d v )q k ) , where c k is the Lipschitz constant of ∇h(w kr ). With probability tending to 1, any stationary point ( w(1) , w(2) , wQ , wK ) satisfies that: E(h(x) -E(y|x)) 2 γ 2 . Because multi-layer self-attention models include a large parameter set with complicated gradients, the assumptions are not as intuitive as the two-layer model. But the main assumptions are parallel, such that in an overparameterized network, ∇h(w) spans almost all directions, and some of them are correlated with the bias term u. The results show that mainly the size of network parameters determines the sample complexity bound. It implies that an efficient architecture design is critical in reducing the sample complexity. For multi-layer cases, the regular assumptions are stated as following (A3.1) and (A3.2).

B RECURRENT ATTENTION MODEL

In this section, we consider analyzing the sample complexity bound for representative recurrent attention framework in Bahdanau et al. (2014) . In the recurrent attention network, we follow the setting in self-attention model, such that x i = (x 1 i , . . . , x p i ) ∈ R t×p , corresponding to p words with t-dimensional embedding. And x is the population version of x i . Then the generative model can be represented as: y i = w (2)T w (1) , p j=1 a(x i ) j x j i + i Analogous to NLP setting, a(x i ) is a unknown function mapping x i to a t-dimensional vector, where a(x i ) j represents the effect of the j th word in the sentence for point i. Then following the RNN setup in Bahdanau et al. (2014) , using data features themselves as their annotations, then for time stamp k = 1, . . . , T , The recurrent attention model estimates for k th time stamp a k (x i ) as follows: s k = f (s k-1 , c k-1 ); e kj = score(s k-1 , x j i ) α kj = e kj p j=1 e kj ; c k = p j=1 α kj x j i y (k) i = w (2)T φ( w (1) , c k ) where score(•) is the score function representing how well the inputs around position j and the output at position i match. It can either be a dot product or a MLP. y (k) i is the prediction in k th time stamp. And we denote a(x i ) j = α T j , as the attention mask for the final time stamp. And f (•) is the function to update s k . Suppose the parameter set inside these two functions are w a and w f with number of parameters as d a and d f accordingly. Here we show that when these two functions are expressive enough, recurrent attention networks also have sample complexity bound parallel to self-attention models. Here we introduce necessary assumptions. (A7) When T is sufficiently large, the output y can be predicted by the two-layer network with an independent sub-Gaussian error with variance σ 2 , i.e, there exists a set of parameters (w (1) , w (2) ) such that y i = w (2) T φ( w (1) , p j=1 a(x i ) j x j i ) + i , where i ∼ subG(0, C 2 4 ) for i = 1, 2, ...n, with x i ⊥ ⊥ i . (A8) Suppose (A2) holds when we substitute x f (x) with p j=1 a(x i ) j x j i . (A9) We assume w a 2 ≤ C 8 and w f 2 ≤ C 9 . (A7) to (A9) are parallel to the assumptions in the self-attention case. They can be justified similar to them, which is discussed in Section B of Appendix. When T is large, recurrent attention models can represent a wide class of attentions weights. Thus we assume y i can be expressed by such recurrent attention models. Now we can provide the following sample complexity bound. Theorem 6. Under (A7) to (A9), for any γ > 0, suppose n C 2 1 C 2 x η 2 γ 2 log( ηC 8 C 9 γ )(td + d + d a + d f ) where η = C 1 C 2 C x , such that if there exist stationary point(s), then with probability tending to 1, any stationary point ( w(1) , w(2) , wf , wa ) satisfies the following prediction error bound: E( w(2) φ( w(1) , p j=1 a(x i ) j x j i ) -E(y|x)) 2 γ 2 Theorem 6 provides a sample complexity bound for recurrent attention networks. The bound holds under expressiveness assumption (A7). It also shows a trade-off of expressiveness and sample complexity. When the number of parameters is too large, the sample complexity will be too large; When the number of the parameters is too small, we don't have enough expressiveness to achieve consistency. If f (•) and a(•) are properly selected, they will be sufficiently expressive to obtain good stationary points, and also the number of parameters d w and d f will not be too large. In this way, an ideal sample complexity bound to these good stationary points can be achieved as Theorem 7 says. However, with an over-complicated design in these functions, the sample complexity bound will be large; With an over-simple design, such good stationary points don't exist. It is parallel to a trade-off between approximation error and estimation error in learning theory. The theory implies a good design of the recurrent structure will help achieve an optimal sample complexity in recurrent attention model.

C APPENDIX FOR SECTION 4

C.1 DISCUSSION OF THEOREM 4 ). Given p log p-s0 log s0 p-s0 ≤ p p-s0 log p, since p p-s0 is close to p when s 0 is relatively small, the result still holds when n 1 is larger than the order of p. For illustration, the linear region bounds under different sparsity levels are plotted in Figure 1 . The top red line is for baseline model with p = 100, and other lines are bounds for attention model with different sparsity level s 0 . In general, we see the bound of attention model is smaller than the one of baseline model. The result implies that, when appropriate attention mechanism is used, the reduction of number of linear regions leads to a simpler landscape, yet the approximation error remains small. We interpret that attention mechanisms help us reduce unnecessary non-linearity inside the landscape.

C.2 DISCUSSION OF THEOREM 5

Theorem 5 indicates that property on flatness of minima is maintained when attention mechanism is applied, and there exist good sharp minima, coinciding with the observation in Dinh et al. (2017) . However, there is no guarantee that all sharp minima are good in generalization. Revisiting our analysis in Section 3, the restriction on the parameter space helps remove these sharp minima. Specifically, we provide upper bounds on the 2 norm of (w (1) , w (2) ). These constraints restrict the parameter space and remove all sharp minima which we construct in the proof of Theorem 5. In these constructed sharp minima, α 1 or α 2 goes to infinity, and 2 norm bound guarantee that it cannot happen. In these constructed sharp minima, 2 norm bound guarantees that it cannot happen. Practically, 2 norm bounds can be achieved through a proper 2 regularization, which will be discussed in Section 5. It also coincides with our theoretical and empirical result that a flat minimum are expected to generalize better in general (Keskar et al., 2016) .

C.3 SMALL SAMPLE SIZE

In this section, we study the local minimum of wide neural networks in small sample regime. (Nguyen & Hein, 2017b) proves that a two-layer neural network model can always achieve perfect empirical estimation error when the sample size is small. Here, we extend this result for self-attention model. Theorem 7. For self-attention model in Sec 3.2, if rank(φ( w(1) , vec(w V x i a self i ) ) i=1,2,..n ) = n. Then every stationary point ( w(1) , w(2) , wV , wQ , wK ) of object function ( 4) is a global minima. rank(φ( w(1) , x i a ) i=1,2,..n ) = n is a mild assumption in a wide network with overparameterization. We can see that as long as we choose the number of units d to be larger than n, the linear dependence of w (1) , x i a i=1,2,..n holds with measure zero. In other words, almost surely this matrix has full column rank n. Thus after the nonlinear activation, the full column rank still holds almost surely. This assumption is similar to the condition in Theorem 3.8 of Nguyen & Hein (2017b) , where they assume the number of units in some layer is larger than the sample size. When the sample size is smaller than the number of units in the network, this theorem holds for the network without attention. It has been proved by Nguyen & Hein (2017b) and Soudry & Carmon (2016) under different conditions.

D LEMMA: MODELS WITH A FIXED ATTENTION MASK

In this section, we introduced a attention model with fixed attention mask, as a fundamental building block of attention mechanisms. In the fixed-mask attention model, we consider a dataset D = {x i , y i } N i=1 , x i ∈ R p , y i ∈ R, where the output y i depends only on certain regions of input x i , i.e., y i = f (a x i ), where a is an unknown fixed attention mask, and f (.) is the ground-truth function that is used to generate the dataset and the vector a ∈ [0, 1] p . The set of entries {a i |a i = 0} corresponds to the relevant region of the input, while the complementary set {a i |a i = 0} corresponds to the irrelevant region. The following Lemma 1 on the sample complexity of attention model with fixed attention mask, is not only an important building block of the proof of Theorem 1 but also provides helpful insights for understanding attention mechanisms and revealing the main idea of our proof. With the sparsity structure of attention mask a, attention mechanisms constrain parameters in a smaller space, thus they reduce the variability of the empirical landscape, and also reduce the covering number of parameter space. These results lead to a lower sample complexity compared with the baseline model not employing attention. Similar to Corollary 1 result, it is straightforward to calculate the sample complexity bound for the baseline model(not employing attention). To achieve the same error bound, we substitute s 0 with p in the bound, and this results in a much larger sample complexity bound. The attention model we use can be written as: f (x) = w (2)T φ( w (1) , x a ) The assumptions for analyzing this model is the same as the assumptions of Theorem 1(A1 to A3 and A1.1 to A1.3), where we substitute f (x) by a in all assumptions with a 0 ≤ s 0 . For clarity, we state them here explicitly: (A.L.1) There exists a set of parameters (w (1) , w (2) ) such that y i = w (2) T φ( w (1) , x i a ) + i for i = 1, 2, ...n, where i 's follow the sub-Gaussian distribution subG(0, C 2 3 ) and x i ⊥ ⊥ i . (A.L.2) a 0 ≤ s 0 with at most s 0 non-zero weights. (A.L.3) For any γ > 0, λ max (V ar(φ( w (1) , x i a ) |φ( w (1) , x a )) = o(γ 2 ) as n k 2 γ 2 log( k γ )(pd + d) and n → ∞. (A.L.4) x i are i.i.d and x i ∞ < C x for any i = 1, . . . , n. (A.L.5) There exist C 1 , C 2 , C λ such that λ max (Σ φ ) ≤ C λ , w (1) F < C 1 and w (2) 2 < C 2 for any w (1) , w (2) ∈ S. (A.L.6) E(w (2) φ( w (1) , x a )) = E(y) = 0. Lemma 1. Under (A.L.1) to (A.L.6), then for any γ > 0 and s 0 such that k O(s 0 ), suppose n s 2 0 C 2 1 C 2 x η 2 γ 2 log( s 0 η γ )(pd + p + d) where η = C 1 C 2 C x . Then with probability tending to 1, any stationary point (ã, w(1) , w(2) ) of the objective function (4) satisfies the following prediction error bound: E( w(2)T φ( w(1) , x ã ) -E(y|x)) 2 γ 2 E PROOFS In this section, we first provide a proof sketch for Theorem 1 for readers to understand the high-lelvel idea. Then we provide proofs of all our theorems. Lemma 1, Theorem 1, Theorem 2, Theorem 3 and Theorem 7 are proved in similar manners. For Theorem 1, Theorem 2 and Theorem 3, we omit the exact same part of the proof as Lemma 1.

E.1 PROOF SKETCH OF THEOREM 1

Proof Sketch: We target at proving that under our assumptions and required sample size, all local minimum must have prediction ability as good as global minimum up to γ 2 . The proof can be divided into two major steps. First, we show that for all parameter sets with bad prediction, their φ(w (1) , x i a) term must be correlated with the bias E(y|x i ) -E(w (2) )φ(w (1) , x i a). It leads to a large magnitude of population gradient with respect to w (2) . Second, we construct an -cover of parameter sets (w (1) , w (2) ) to show that sample gradients converge to population gradient, such that sample gradient with respect to w (2) is also away from zero. Thus these parameter sets with bad prediction cannot be local minimum. And we conclude that all local minimum must have prediction as good as global minimum up to O(γ 2 ). The complete proof is provided in the Appendix Section E. E.2 PROOF OF LEMMA 1 Proof. As described in the proof intuition, the proof is divided into two steps. First we study the lower bound of population risk gradient E x,y (∇R n (w (2) )) 2 ; In step 2 we study the convergence of ∇R n (w (2) ) 2 to the population gradient E x,y (∇R n (w (2) )) 2 . We further separate these two steps into three lemmas. Lemma 1.1 proves the first step, we study the landscape of population risk, showing that with high probability, we know the population risk with respect to x and y E x,y (∇R n (w (2) )) 2 is large; Lemma 1.2 and Lemma 1.3 prove the second step. Specifically, in part (b), we consider the convergence of population risk only taking expectation on y, E y (∇R n (w (2) )). In part (c), finally we consider the convergence of empirical risk gradient ∇R n (w (2) ) to E y (∇R n (w (2) )). We introduce necessary notations beforehand. To emphasize the role of x and y separately, here we denote R(w (1) , w (2) , a) = E y (R n (w (1) , w (2) , a)), which is the expectation of the empirical loss gradient with respect to y, treating x as random, and ∇R(w) as corresponding derivatives. And we denote E x (∇R(w (1) , w (2) , a)) = E x,y (∇R n (w (1) , w (2) , a)), which is the expectation of the empirical loss function with expectation to both x and y. In the proof, we may use o(γ) for vector/matrix case. In these cases, it means that every element in vector/matrix is o(γ). Lemma 1.1 Under the assumption of Lemma 1, when equation 7 is violated, with probability going to 1 that, the population risk gradient with respect to w (2) satisfies that: E x,y (∇R n (w (2) )) 2 ≥ O(γ) where Σ φ -Σ c Σ -1 φ Σ c is exactly the residual of covariance matrix after taking regression on {φ 1...d ( w (1) , x f (x) )}, as we defined in (A.L.3).(Page 176 of Izenman (2013)) w (2) T (Σ φ -Σ c Σ -1 φ Σ c )w (2) is a quadratic form of that conditional variance matrix. By (A.L.5), we know its largest eigenvalue λ max (Σ φ -Σ c Σ -1 φ Σ c ) = o(γ). Finally we obtain that: E(u 2 ) = E(∇R(w (2) )) 2 + c λ o(γ 2 ) ≤ w (2) 2 2 * λ max (Σ φ -Σ c Σ -1 φ Σ c ) + c λ o(γ 2 ) = o(γ 2 ) By contradiction, we conclude that if E(u 2 ) ≥ γ 2 , we must have E(∇R(w (2) ) 2 ≥ O(γ). Here we finish the proof of Lemma 1.1. Lemma 1.2 Under the assumption of Lemma 1, when equation 7 is violated, the risk gradient of w (2) with expectation to y satisfies: E y (∇R n (w (2) )) 2 ≥ O(γ) proof: In Lemma 1.1 we have shown that in population level, all parameter sets with bad prediction has a large magnitude of E(∇R(w (2) )). In Lemma 1.2, we use -cover technique to bound the gap between ∇R(w (2) )(recall that ∇R(w (2) ) is short notation of E y (∇R n (w (2) ))) and E(∇R(w (2) )). Our parameters w (1) ,w (2) and a are inside the 2 balls B d (0, C 1 ),B p×d (0, C 2 ) and B p (0, s 2 0 ), where B p (c, r 2 ) represents a p-dimensional 2 with center c and radius r. By Lemma 5.2 in Vershynin (2010) , the -covering number N 1 ,N 2 ,N 3 for these three balls are upper bounded by: N 1 ≤ (3C 1 / ) d , N 2 ≤ (3C 2 / ) pd , N 3 ≤ (3s 2 0 / ) p Then we know the joint 3 -covering number for the union of all three parameters N 3 satisfies that N 3 ≤ N 1 N 2 N 3 . For the ease of notation, we denote θ = (a, w (1) , w (2) ). Let Θ = {θ 1 , • • • , θ N } be a corresponding cover with N 3 elements. Following the -covering we construct for (w (1) , w (2) , a) separately, we can always find Θ such that for any feasible θ, there exists j ∈ [N ] such that max( w (1) (j) -w (1) 2 , w (j) -w (2) 2 , a (j) -a 2 ) ≤ . In this proof, we use parenthesis subscription (j) to represent the j th element in the cover, to distinguish it from other subscriptions. By triangle inequality, we have: ∇R(w (2) ) 2 ≥ ∇R(w (2) (j) ) 2 -∇R(w (2) ) -∇R(w (2) (j) ) 2 Therefore, we only need to bound both parts on the r.h.s of equation 9. We start with the first term ∇R(w (2) (j) ) 2 . To achieve this, we first bound the gradient term v i = u i φ(w (1) , x i a). For any fixed parameter set, we calculate: v i 2 2 (C 2 φ(w (1) , x i a) 2 ) 2 φ(w (1) , x i a) 2 2 Here we denote w (1) 1,active as the 1 norm of w (1) on active features, i.e. on the feature such that its attention weight a is not zero, and from the sparsity condition we know that there are at most s 0 such nonzero elements in x a. Combining with the 2 bound of w (2) , we apply Cauchy-Schwarz inequality: φ(w (1) , x i a) 2 ≤ max{|x a|} w (1) 1,active = √ s 0 C x C 1 It implies: v i 2 2 (C 2 φ(w (1) , x i a) 2 ) 2 φ(w (1) , x i a) 2 2 = O(s 2 0 C 4 1 C 2 2 C 4 x ) From Lemma 1.1, we know there exists a constant c such that E x (∇R(w (2) )) 2 ≥ cγ for some constant c. We denote ξ 2 = s 2 0 C 4 1 C 2 2 C 4 x . Then we apply Hoeffding bound on the 2 norm of In other words, under our assumptions, all the stationary points (ã, w(1) , w(2) ) in our programming satisfy the prediction error upper bound rate γ w.h.p, when sample size: n s 2 0 η 2 c 2 γ 2 log( s 0 η cγ )(pd + p + d) E.3 PROOF OF THEOREM 1 Proof. Theorem 1 can be proved following the same manner as Lemma 1, substituting a by f (x), only changing two bounds. Here we specify these two different bounds with Lemma 1. First difference is the bound of φ( w (1) , x i f (x) ) 2 in Lemma 1.2, and the second is the -cover number in Lemma 1.2. By assumption (A2), we denote f (x) lead = {f (x) (1) , . . . , f (x) (s0) } as the top s 0 leading attention weights such that f (x) lead 2 ≥ 1 -τ . And f (x) sub as the other p -s 0 attention weights and it satisfies f (x) sub 2 ≤ τ as f (x) 2 = f (x) lead 2 + f (x) sub 2 = 1. We denote x lead and w (1) lead as the features and network weights corresponding to leading attention weights, and x sub , w (1) sub as the features and network weights corresponding to other attention weights. Parallel to Lemma 1.2, we just need to adjust the bound of φ( w (1) , x i f (x) ) term as: φ( w (1) , x i f (x) ) 2 = φ( w (1) lead , x lead f (x) lead + w (1) sub , x sub f (x) sub ) 2 ≤ √ 2( φ( w (1) lead , x lead f (x) lead ) 2 + φ( w (1) sub , x sub f (x) sub ) 2 The two terms in the last inequality can be bounded by Cauchy-Schwarz inequality separately: φ( w (1) lead , x lead f (x) lead ) 2 ≤ w (1) 2 x lead f (x) lead 2 ≤ w (1) 2 x lead 2 f (x) lead 2 ≤ C 1 * √ s 0 * C x * (1 -τ ) ≤ √ s 0 C 1 C x And φ( w (1) sub , x sub f (x) sub ) 2 ≤ w (1) 2 x sub f (x) sub 2 ≤ w (1) 2 x sub 2 f (x) sub 2 ≤ C 1 * √ p * C x * τ = τ √ pC 1 C x Combining both inequalities, we have: φ( w (1) , x i f (x) ) 2 ≤ √ 2( √ s 0 + τ √ p)C 1 C x Further we have v i 2 2 (C 2 φ( w (1) , x i f (x) ) 2 ) 2 φ( w (1) , x i f (x) ) 2 2 = ( √ s 0 + τ √ p) 4 C 4 1 C 2 2 C 4 x (14) Second, in this case, since f (x) is not optimized together, but calculated from x instead. We don't have to consider the -cover number for a in the maximum operator. Therefore the new -cover number for w (1) and w (2) are upper bounded as: N 1 ≤ (3C 1 / ) d , N 2 ≤ (3C 2 / ) pd Substituting the new -cover bound to the theorem, we obtain the final sample complexity bound: n ( √ s 0 + τ √ p) 4 η 2 γ 2 log( s 0 η γ )(pd + d) ≤ √ 2C 1 C 8 ( x lead a self lead 2 + x sub a self sub 2 ) ≤ √ 2( √ s 0 + τ √ p)C 1 C 8 Finally we obtain: v i 2 2 (C 2 φ( w (1) , vec(w V x i a self i ) ) 2 ) 2 φ( w (1) , vec(w V x i a self i ) ) 2 2 = 4( √ s 0 + τ √ p) 4 C 4 1 C 2 2 C 4 8 C 4 x Denoting ξ 2 = ( √ s 0 + τ √ p) 4 C 2 1 C 2 8 C 2 x , parallel to Theorem 3, applying Hoeffding bound and union bound: P (∃j ∈ [N ], v j 2 ≤ 2cγ 3 ) N exp(-n c 2 γ 2 ξ 2 ) Second, we bound ∇R(w (2) ) -∇R(w (2) ) (j) 2 term. The attention weight gap is bounded as: Q z i K T i -Q z i(j) K T i(j) max = (x z i ) T ((w Q ) T w k )x i -(x z i ) T (w Q (j) w k (j) x i max t(C 5 + C 6 )C 2 x With this bound, we know a self (z) i is Lipschitz function under assumption (A9) such that Q z i K T i 2 has lower bounded. Recall the softmax function for vector β is defined as: sof tmax(β) = exp(β) p i=1 exp(β i ) It is Lipschitz continuous on β when p i=1 exp(β i ) has a lower bound, which is satisfied by (A2.1). Thus the bound can be derived as: a self (z) i -a self (z) i(j) 2 softmax( Q z i K T i √ d k ) -softmax( Q z i(j) K T i(j) √ d k ) 2 √ pt(C 5 + C 6 )C 2 x With these bounds, we have: ∇R(w (2) ) -∇R(w (2) ) j 2 ≤ 1 n n i=1 u i(j) φ( w (1) , vec(w V x i a self i ) ) -u i(j) φ( w (1) (j) , vec(w V x i a self i(j) ) ) 2 1 n ( n i=1 (u i -u i(j) )φ( w (1) , vec(w V x i a self i )) 2 + n i=1 u i(j) (φ( w (1) , w V x i a self i ) -u i(j) φ( w (1) (j) , vec(w V (j) x i a self i(j) ))) 2 ) (( √ s 0 + τ √ p) 2 tC 2 1 C 2 2 C 2 8 C 2 x + √ ptC 1 C 2 C 8 (C 5 + C 6 )C 2 x (( √ s 0 + τ √ p) 2 + √ pt)(C 1 C 2 C 8 (C 1 C 2 C 8 + (C 5 + C 6 )) Recall that u i(j) and a i(j) are corresponding to the j th element in the epsilon cover. Denote ξ = (( √ s 0 + τ √ p) 2 + √ pt)(C 1 C 2 C 8 (C 1 C 2 C 8 + (C 5 + C 6 )) , and we choose = cγ ξ , and combine the above results. Then at least with probability 1 -O(N exp(-n c 2 sγ 2 ξ 2 )), we have ∇R(w (2) ) 2 > γ 3 . Therefore we can choose n ξ 2 c 2 γ 2 log( (( √ s0+τ √ p) 2 + √ pt)C1C2C5C6C8Cx c 3 γ )(pd v d + d + 2pd q ), such then N exp(-n c 2 γ 2 ξ 2 ) = o(1). Thus with this required sample complexity, we have ∇R(w (2) ) -E(∇R(w (2) ) 2 ≤ 2cγ 3 . Finally we conclude that with high probability, any parameter (w (1) , w (2) , w (v) , w (k) , w (q) ) with E(u 2 ) ≥ γ, we have ∇R(w (2) ) 2 > γ 3 . Then following the same empirical risk convergence argument as Lemma 1.2, with high probability they cannot be stationary point. We conclude the sample complexity bound as: Then similar with Theorem 1 and 2, we construct an -cover over all parameters θ := (w k1 , w k2 , w V , w Q , w K ), and we denote it as {θ 1 , . . . , θ N } such that for any feasible parameter, there exist j ∈ [N ] such that the maximum 2 distance to θ j is smaller than . By calculating the number of parameters in all matrices in θ, we have n ( √ s 0 + τ √ p) 2 + √ pt) 2 pd v dη 2 log(N ) = 1 O(d self + k i=1 (d k + d v )q k ) Denoting ∇R(w kr (j) ) as the gradient with respect to j th parameter set in -cover for j ∈ {1, . . . , N }: P ( ∇R(w kr (j) ) -E x (∇R(w kr o(1). Finally we can conclude that with probability 1 -o n (1), for any (a, w (1) , w (2) ) such that E( w(2) φ( w(1) , x ã ) -E(y|x)) 2 ≥ γ, we have ∇R(w (2) ) 2 > cγ 3 . Then following the convergence of empirical risk procedure of Lemma 1.2, we show with probability going to 1 such that ∇R n (w (2) ) 2 > 0 and all parameters with prediction error O(γ) cannot be stationary point as long as n log( k γ )(d self + D(d k + d v q k )). Thus we complete the proof.

E.6 PROOF OF THEOREM 4

Proof. First with f (x) 0 = s 0 , we know all the inputs x i with corresponding a i = 0, will be inactive in the network. We can omit all these inactive inputs. Then we split n 1 units into s 0 group, with n1 s0 number of units in each group, and discard the leftover units. s 0 different groups correspond to s 0 active inputs with non-zero attention weight. Inside each group, for example in j th group, denoting q = n1 s0 , we choose the input weights and biases for i = 1, 2, • • • , q as: h 1 (x) = max{0, w j x}, h 2 (x) = max{0, 2w j x -1}, . . . h q (x) = max{0, 2w j x -(q -1)} here we assign w j to be a row vector with j th variable equal to 1 and all other entries to be 0. And in the second layer, we choose w (2) = (w 3 , • • • , w 3 ), where w 3 = (1, -1, 1, • • • , (-1) q+1 ), corresponding to h 1 to h q in each group. Then the designed network has q linear regions inside each group, giving by the intervals: (-∞, 0], (0, 1], (1, 2], • • • , [q -1, ∞) Each of these intervals has a subset that is mapped by w 3 h(x) onto the interval (0,1). Montufar et al. (2014) Therefore the total number of linear regions is lower bounded by n1 s0 s0 .

E.7 PROOF OF THEOREM 5

Proof. Here we define an α scale transformation such that: T α : (w (1) , w (2) ) → (αw (1) , α -1 w (2) ) And all the value,query and key matrices remain the same. Then we know the Jacobian determinant for T α = α (pdv-1)d . Since pd v d ≥ d, as we assign α → ∞, such that the Jacobian determinant goes to infinity, and the volume of C(L, θ, ) goes to infinity. For the Hessian matrix, we still assume a positive diagonal element δ > 0 in w (1) . Similarly we have the Frobenius norm ∇ 2 L(T α (θ)) F of ∇ 2 L(T α (θ)) = α -1 I 0 0 αI ∇ 2 L(θ) α -1 I 0 0 αI is lower bounded by α -2 δ. When we choose sufficient small α, we have the biggest eigenvalue of ∇ 2 L(T α1,α2 (θ)) is larger than any constant M . Therefore there exists a stationary point such that the operator norm for Hessian is arbitrary large. 



Figure 1: Distribution of 2 norm of top √ p i of attention weights

Figure 2: Noisy MNIST dataset We consider three models here. (1) CNN: We consider a standard 2-layer Convolutional neural networks with fillter size 64, kernel size 3 * 3 and max pooling size 2 * 2, following a hidden fully-connected layer with size 128.(2) Self-attention-CNN We fit a 2-layer convolutional network with filter size 64, kernel size 3 * 3 with a max pooling size 2 * 2, obtaining a 1600-dimensional embedding for each of 16 square grids. Then treat these embedding of 16 grids as "16 words embedding" in a sentence, fitting it into equation 2 as our experiment 2. (3) Equal weight self-attention-CNN: The model structure is the same as model 2, with fixing the attention weights between all grids are the same. Model (3) is designed as an alternative baseline guaranteeing that the improvement of model 2 is from the attention block instead of the model structure. The testing accuracy of three models are reported in Table3. Model 1 has 5558k number of parameters, and model 2 has only 670k number of parameters. This fact also guarantees that the superiority of our model is not from training a bigger model.

Figure 3: Number of linear regions in log scale v.s. sparsity

Under assumption (A3.1), we know all input features and weights are bounded. Therefore we know ∇f (w kr ) 2 is a Lipschitz continuous function on all parameters, and we denote its Lipschitz constant c k . For w kr , we can derive that: (w kr ) Under (A6), if we have E(h(x) -E(y|x)) 2 γ 2 , then: E(∇R(w kr )) 2 = cov(∇h(w kr ), u) 2 O(γ)

Secondly we analyze ∇R(w kr ) -∇R(w kr (j) ) 2 term. As we have shown that the gradient is Lipschitz continuous, thus we have:∇R(w kr ) -∇R(w kr (j) ) 2 ≤ c K We choose = cγ 3c K , then at least with probability 1-O(N exp(-n c 2 γ 2 c K )), we have ∇R(w (2) ) 2 > cγ 3 . Therefore we can choose n log( c k γ )(d self + D(d k + d v q k )), such that N exp(-n c 2 γ 2

Testing accuracy of self-attention ablation study on the IMDB reviews dataset.

Testing accuracy of recurrent attention ablation study on the IMDB reviews dataset.

Testing accuracy of self-attention on the Noisy MNIST dataset

APPENDIX

In this appendix, Section A presents detailed theorem assumptions and their justifications; Section B presents the theorem of sample complexity bounds for recurrent attention models; Section C presents details on results in Section 4 of main paper. Section D presents a key lemma of sample complexity bound on attention model with fixed attention masks for all samples; Finally, we provide proofs in section E.

A THEOREM ASSUMPTIONS

In this section, we provide more rigorous presentations on the standard assumptions of Theorems 1-3 and further justify these assumptions.

A.1 JUSTIFICATION OF ASSUMPTIONS

Here we justify (A1) to (A3) for Theorem 1, and explain why they can be intuitively summarized as in the main paper. (A1) assumes the expectation of output y can be specified by a two-layer neural network with attention model structure. It has been studied that general bounded functions with a Fourier representation on [-1, 1] can be well approximated by the defined two-layer network (Barron & Klusowski, 2018) . Specific to attention models, Yun et al. (2019) prove the strong expressiveness of transformer type self-attention model give sufficient large network. Therefore it is very mild to assume attention model can identify the mean of output y.(A2) assumes that the attention mask is mostly concentrated on the largest s 0 attention weights. This assumption is naturally satisfied due to the softmax function in the self-attention equation. Softmax function makes attention weight proportion to the exponential of the inner product of query and key vectors. This fact significantly enlarges the difference between the inner products, and makes the attention weights do not evenly spread out over all other entries, but only a few of them. Empirically, we verify that this assumption is satisfied in real-world transformer translation task in our experiments. Please check experiment 1 in Section 6 for more details on the distribution of attention weights in Portuguese to English translation. It shows that this assumption is well satisfied.(A3) assumes that {φ 1...d ( w (1) , x i f (x) )} obtain sufficient expressiveness to predict the output y when sample size is sufficiently large. In an overparameterized network with large d, we know {φ 1...d ( w (1) , x i f (x) )} have up to 2 d number of linear regions over n samples. And (A2) assumes these linear regions space all directions of φ( w (1) , x i f (x) ) up to a o(γ) term. It says that given the 2 d linear regions and strong expressiveness of {φ 1...d ( w (1) , x f (x) )}, the residual vector is a o(γ) term with respect to all directions in R d . This assumption is parallel to the full column rank condition in (Nguyen & Hein, 2017b) , where we essentially assume in overparameterized network, the linear combination spans all directions in R p . Allen-Zhu et al. ( 2019) also shows the fact in their Lemma B.1 and Corollary B.2 that in overparameterized networks, in every small region of parameter space, there exists set of parameters with good prediction. And it leads to a good landscape in each neighborhood. These results all indicate that it is reasonable to assume sufficient expressiveness of φ( w (1) , x a ) in overparameterized networks as n → ∞. What's more, with required large sample size, we can also straightforwardly evaluate this assumption by checking whether these φ( w (1) , x a ) spread out the whole space and their linear combinations have good estimation on y. All these results guarantee that (A3) can be achieved in overparameterized networks. We also empirically validate (A3) by computing the largest eigenvalue of such conditional covariance in (A3). We generate a two-layer network under self-attention model ( 4) with n = 500, d = 256, p = 2142, d v = 100. We choose the set of parameter same with our experiment setup in Section 6, and hidden layers are sufficient overparameterized with p = 2142. Then we compute the empirical conditional covariance λmax (V ar(φ( w (1) , x i f (x) )|φ( w (1) , x f (x) )) = 3.13 * 10 -8 . This result indicates that the largest eigenvalue term λ max (V ar(φ( w (1) , x i f (x) )|φ( w (1) , x f (x) )) is small enough to be assumes as o(γ 2 ) as in (A3).Proof: We denote: u = w (2) φ( w (1) , x a ) -E(y|x) (8) and u i as the version with the specified sample index. Then the derivatives of population risk with the expectation to y can be presented as follows:By (A.L.1), we know that E(yTherefore when (a, w (1) , w (2) ) = (a , w (1) , w (2) ), all the u i 's are zero, and all the gradients with expectations to x and y are zero. Thus for any true set of parameter (a , w (1) , w (2) ), they have zero gradient expectations automatically. And the key of our proof is showing that with high probability, any parameter (a, w (1) , w (2) ) cannot be stationary point ifbecause their gradients w.r.t to w (2) must be bounded away from zero.In the following section, we prove Lemma 1.1 by showing that if E(u 2 ) ≥ γ 2 , we must have E(∇R(w (1) )) 2 ≥ O(γ). We prove it by contradiction.In this proof, we denote r = cov(E(y|x), φ( w (1) , x a )) ∈ R d , and the covariance matrix for φ( w (1) , x a ) as:and plug it into cov((w (2) ) T φ( w (1) , x a ), φ( w (1) , x a )) = r + o(γ), we have:Since we know the true mean y can be specified by a true set of parameter (w (1) ), w (2) , a ), and we denote that the covariance matrix for φ( w (1) , x a ) as:And we denote the cross-covariance matrix of vector φ i ( w (1) , x a ) and vector φ j ( w (1) , x a ) as:Then with zero expectation on E(y|x) and E(w (2)T φ(w (1) , x i a)) we have:(j) )), with the upper bound of it as O(σ 2 ). Denoting ∇R(w(2) (j) ) as the gradient with respect to j th parameter set in -cover for j ∈ {1, . . . , N }:Then by union bound, over all N elements in Θ :Secondly we analyze ∇R(w (2) ) -∇R(w(2) (j) ) 2 term. Here we use u i to represent the prediction error for i th instant with respect to parameter (a, w (1) , w (2) ), and use u i(j) to represent the term with respect to the parameter from j th element in -cover set. By triangle inequality, we have:We choose = cγ 3s0C 2x C 2 C2 , and plug back above results into equation 9, then at least with probability 1 -O(N exp(-n c 2 γ 2 ξ 2 )), we have ∇R(w (2) ) 2 > cγ 3 . Therefore we can choose:Finally we can conclude that with probability 1 -o n (1), for any (a, w (1) , w (2) ) such that E( w(2) φ( w(1) , x ã ) -E(y|x)) 2 ≥ γ, we have ∇R(w (2) ) 2 > cγ 3 .Lemma 1.3 Under the assumption of Lemma 1, when equation 7 is violated, with probability going to 1 that, the empirical risk gradient with respect to w (2) satisfies that:Proof: So far in Lemma 1.2, we have shown that for population risk with respect to y, with high probability, all the parameter sets with poor prediction in expectation, i.e E(u 2 ) ≥ O(γ 2 ), their population risk gradient with expectation to y must be away from zero. Now we move forward to show that empirical risk ∇R n (w (2) ) 2 converges to ∇R(w (2) ) 2 . In aspect of w (2) , they can be represented as:With (A.L.1), we know that i ∼ subG(0, C 2 3 ), thus 1 Recalling part (a), under the first case that w.h.p ∇R(w (2) ) 2 ≥ cγ 3 for any parameter (a, w (1) , w (2) ) with w (2) φ( w (1) , X a ) -E(y|X) 2 ≥ γ. Combining this with (11), we conclude that for any positive constant γ > 0, with required sample size, with high probability that ∇R n (w (2) ) 2 > 0, thus they cannot be stationary solution for our loss function.

E.8 PROOF OF THEOREM 6

Proof. First, we obtained new -covering bound for the parameter set (w (2) , w (1) , w f , w a ):Then the derivatives of population risk with expectation to y can be presented as follows:And also, the new v i term is:, p j=1 a(x i )x j i ). Following (A15) and same procedure as Lemma 1.1, we have E(∇R(w (2 )) 2 ≥ cγ. we have new bound of v 2 2 with respect to x is upper bounded byX with normalized attention weight. Same argument follows for the case when E(∇R(w (1 )) k 2 ≥ cγ. Then following the same approach as Theorem 1 and 2, we obtain the sample complexity bound:E.9 PROOF OF THEOREM 7Proof. Here we only consider the derivatives with respect to w (1) and w (2) , they can be presented as follows:By assumption, rank(φ( w (1) , vec(w V x i a self i) ) i=1,...,n ) = n. Solving the linear system, we must have u i = 0 for any i = 1, 2, ..., n to satisfy that ∇R n ( w(2) ) = 0. Thus we know that the loss is exactly zero inside sample. Thus it must be a global minimum. 

