APPROXIMATION ABILITY OF TRANSFORMER NET-WORKS FOR FUNCTIONS WITH VARIOUS SMOOTHNESS OF BESOV SPACES: ERROR ANALYSIS AND TOKEN EX-TRACTION

Abstract

Although Transformer networks outperform various natural language processing tasks, many aspects of their theoretical nature are still unclear. On the other hand, fully connected neural networks have been extensively studied in terms of their approximation and estimation capability where the target function is included in such function classes as Hölder class and Besov class. Besov spaces play an important role in several fields such as wavelet analysis, nonparametric statistical inference and approximation theory. In this paper, we study the approximation and estimation error of Transformer networks in a setting where the target function takes a fixed-length sentence as an input and belongs to two variants of Besov spaces known as anisotropic Besov and mixed smooth Besov spaces, in which it is shown that Transformer networks can avoid curse of dimensionality. By overcoming the difficulties in limited interactions among tokens, we prove that Transformer networks can accomplish minimax optimal rate. Our result also shows that token-wise parameter sharing in Transformer networks decreases dependence of the network width on the input length. Moreover, we prove that, under suitable situations, Transformer networks dynamically select tokens to pay careful attention to. This phenomenon matches attention mechanism, on which Transformer networks are based. Our analyses strongly support the reason why Transformer networks have outperformed various natural language processing tasks from a theoretical perspective.

1. INTRODUCTION

Transformer networks, which were proposed in Vaswani et al. (2017) , have outperformed various natural language processing (NLP) tasks, including text classifications (Shaheen et al., 2020) , machine translation (Vaswani et al., 2017) , language modeling (Radford et al.; Devlin et al., 2018) ), and question answering (Devlin et al., 2018; Yang et al., 2019) . Transformer networks make it feasible to approximate functions which can take a sequence of tokens (i.e., text) as input due to their specific architecture which is a stack of blocks of self-attention layers and token-wise feed-forward layers. However, despite of these great successes in various NLP tasks, many aspects of their theoretical nature are still unclear. On the other hand, fully connected neural networks have been extensively studied in terms of their function approximation and estimation capability. A remarkable property of neural network is its universal approximation capability, which means that any continuous function with compact support can be approximated with arbitrary accuracy with two fully connected layers (Cybenko, 1989) . However, Cybenko (1989) did not state anything about an upper bound of the network size. Therefore, a relation between properties of the target function and the network size is a next question. By imposing certain properties such as smoothness on target functions, the representabiliy of neural network can be studied more precisely. Barron (1993) developed an approximation theory for functions with limited capacity that is measured by integrability of their Fourier transform. Deep neural networks with ReLU activation (Nair & Hinton, 2010; Glorot et al., 2011) has also been extensively studied from the viewpoint of the approximation and the estimation ability. For example, Yarotsky (2016) proved the approximation error of fully connected layers with the ReLU activation for functions in Sobolev spaces. Schmidt-Hieber (2017) derived an estimation error bound of regularized least squared estimator performed by deep ReLU network based on an approximation error analysis in a regression setting. Suzuki (2019) derived approximation and estimation error rates of fully connected layers with ReLU activation for the Besov space, which were also shown to be almost minimax optimal. Although the derived rates of convergence are almost optimal, they suffer from the curse of dimensionality, which is one of the main issues of machine learning. A typical consequence of the curse of dimensionality is that, when the dimension of data increases, the approximation accuracy (and estimation accuracy) deteriorates exponentially against the dimension. However, under some specific structure on the data and the target function, we may avoid this issue. Indeed, Suzuki (2019) and Suzuki & Nitanda (2021) showed that, by assuming that the target function has mixed smoothness or anisotropic smoothness, we can avoid curse of dimensionality. Okumoto & Suzuki (2022) derived approximation and estimation errors in a severe setting in which input data are infinite-dimesional. Although many researches on the representation ability of fully connected layers and convolution layers are developed, relatively few researches on that of Transformer networks are found. Kratsios et al. (2021) proved that there exists a pair of an input sequence and output particles which minimize a given proper loss functions under a given constraint set. Vuckovic (2020) proved that, when regarding attention layers as functions from measures to measures, attention layers have the Lipschitz continuity property from a viewpoint of Wasserstein distances. Both Kratsios et al. (2021) and Vuckovic (2020) regard an input sentence as a measure, that is, particles or a bag of words, which is an interesting viewpoint. However, these papers do not specify how approximate Transformer networks are to a given function from an input sequence to an output. Therefore, these papers' results are different from this paper's main purpose to explain why Transformer networks can outperform various NLP tasks represented by target functions in various function spaces. Yun et al. (2020) , Zaheer et al. (2020) and Shi et al. (2021) proved that Transformer networks are universal approximators of sequence to sequence functions. However, since these papers did not assume smoothness of the target function, the results of these papers did not specify an upper bound of Transformer network depths, which corresponds to the fact that universal approximation capability of neural networks did not state anything about an upper bound of the network width. Thus, this paper studies a question which naturally arises as to how properties of the target function are related to the network size and precision required. In this paper, we study the approximation and estimation error of the Transformer architecture in a setting where the target function takes a fixed-length sentence as an input and belongs to a mixed smooth Besov space and an anisotropic Besov space. We prove that Transformer networks accomplish almost minimax optimal rate by analyzing the Transformer network architecture and approximation ability of the two function spaces. Moreover, we prove that, under suitable situations, Transformer networks can dynamically select tokens to pay careful attention to. The essence of the proof strategy is as follows: First, for a given target function, we obtain a sum of piece-wise polynomial functions which is approximate to the target function in a certain rate. Next, one constructs a neural network approximate to a piece-wise polynomial functions. Finally, one constructs a neural network approximate to the sum. The problem is the second phase in which one constructs a neural network approximate to a cardinal B-spline function. The proof of the phase is based on fully connected layers approximate to xy in Yarotsky (2016). However, Transformer networks are permitted to do limited interactions among tokens. In this paper, we propose how to construct an attention layer which values exchanges between different tokens. By using attention layers constructed above, we can construct a Transformer network approximate to cardinal B-spline function. This difficulty is common to previous papers (Yun et al., 2020; Zaheer et al., 2020; Shi et al., 2021) , though their strategies of obtaining a piece-wise constant approximation are different from ours in a viewpoint of exploitation of function smoothness. Our contributions can be summarized as follows: 1. We consider a situation in which the target function takes a fixed-length sentence as an input and belongs to a mixed smooth Besov space and an anisotropic Besov space, in which it is shown that Transformer networks can avoid curse of dimentionality and accomplish almost minimax optimal rate. We also shows that token-wise parameter sharing in Transformer networks decreases dependence of the network width on the input length. 2. We prove that, under suitable situations, Transformer networks dynamically select tokens to pay careful attention to. Moreover, we show that the count of tokens to pay careful attention to is decided by an NLP task, and the accuracy required. This phenomenon matches attention mechanism, on which Transformer networks are based.

2. NOTATIONS AND PROBLEM SETTINGS

In this section, we define the notations and introduce the problem setting. Throughout this paper, we use the following notations. Let X be a finite set. Then we shall write ♯X for the cardinality of X. Write Z for the ring of rational integers and R for the field of real numbers. Let Λ ∈ {Z, R} and a ∈ Λ. Then we denote by Λ >a := {a ′ ∈ Λ | a ′ > a}, Λ ≥a := {a ′ ∈ Λ | a ′ ≥ a}. Let Ω ⊆ R d be a domain of the functions. For a function f : Ω → R, let ∥f ∥ p := ∥f ∥ L p (Ω) := Ω |f | p dx 1 p for 0 < p < ∞ and ∥f ∥ ∞ := ∥f ∥ L ∞ (Ω) := sup x∈Ω |f (x)| for p = ∞. For α ∈ R d , p ∈ R >0 , we denote by ∥α∥ p := d i=1 |α i | p 1 p , ∥α∥ ∞ := max d i=1 |α i |, ∥α∥ 0 := ♯{i ∈ Z|1 ≤ i ≤ d, α i ̸ = 0}. For α ∈ R d >0 , we denote by ᾱ := max(α), α := min(α), α := d i=1 1 αi -1 . For α ∈ Z d ≥0 , we denote by D α f (x) := ∂ ∥α∥ 1 ∂ α 1 x1...∂ α d x d f (x). We also define some utility functions as follows: Let x ∈ R. Then we denote by x + := max(x, 0), x ∨ y := max(x, y), ⌊x⌋ := min{n ∈ Z|n ≤ x}, ⌈x⌉ := max{n ∈ Z|n ≥ x}. In the following subsections, the function classes for which we develop error bounds, and the set of Transformer networks with given hyper-parameters.

2.1. MIXED SMOOTH BESOV SPACE

In this section, we define the mixed smooth Besov space, one of the function classes which we discuss. To define the mixed smooth Besov space, we first introduce the modulus of smoothness. Definition 1 (r-th modulus of smoothness). Let Ω be a measurable subset of R D , p ∈ R >0 ∪ {∞} and r ∈ Z ≥1 . For a function f ∈ L p (Ω) and t ∈ R D >0 , the r-th modulus of smoothness of f is defined by w r,p (f, t) := sup |hi|≤ti ||∆ r h (f )|| p , where ∆ r h (f )(x) := r j=0 r j (-1) r-j f (x + jh) (x ∈ Ω, x + rh ∈ Ω). 0 (otherwise). Next, based on the modulus of smoothness, we introduce the notion of mixed modulus of smoothness. Definition 2 (Mixed modulus of smoothness). Let d ∈ Z ≥1 , Ω be a measurable subset of R d , p ∈ R >0 ∪ {∞}, r ∈ Z d ≥1 , h ∈ R d , and a function f ∈ L p (Ω). Then we define the coordinate difference operator as follows: ∆ r,i h (f )(x) := ∆ r hi (f (x 1 , . . . , x i-1 , •, x i+1 , . . . , x d ))(x i ). Accordingly, for e ⊆ {1, . . . , d}, we define the mixed difference operator ∆ r,e h (f ) : = i∈e ∆ r,i hi (f ) (e ̸ = ∅) f (e = ∅) (note that, for any i ̸ = j, ∆ r,i hi • ∆ r,j hj = ∆ r,j hj • ∆ r,i hi ) and the r-th mixed modulus of smoothness of f w e r,p (f, t) := sup |hi|≤ti ||∆ r,e h (f )|| p . Finally, based on the mixed modulus of smoothness, the mixed smooth Besov space is defined as in the following definition. Definition 3 (mixed smooth Besov space). Let d ∈ Z ≥1 , Ω be a measureable subset of R d , p, q ∈ R >0 ∪ {∞}, α ∈ R d >0 , r := ⌊α⌋ + 1, e ⊆ {1, . . . , d}. Then, for e ⊆ {1, . . . , d}, we define the seminorm | • | M B α,e p,q as follows: |f | M B α,e p,q :=    Ω (( i∈e t -αi i )w e r,p (f, t)) q dt i∈e ti 1 q (q < ∞), sup t∈Ω (( i∈e t -αi i )w e r,p (f, t)) (q = ∞). Note that |f | M B α,∅ p,q = ||f || L p (Ω) . The norm of the mixed smooth Besov space M B α p,q (Ω) can be defined as the sum of the semi-norm over the choice of e by ||f || M B α p,q := e⊆{1,...,d} |f | M B α,e p,q , and we define M B α p,q (Ω) := f ∈ L p (Ω) ||f || M B α p,q < ∞ , M U α p,q (Ω) := f ∈ M B α p,q (Ω) ||f || M B α p,q ≤ 1 . The mixed smooth Besov space was originally introduced by Schmeisser (1987); Sickel & Ullrich (2009) . Various researches showed that an appropriate estimator for these models can avoid curse of dimensionality (Meier et al., 2009; Raskutti et al., 2012; Kanagawa et al., 2016; Suzuki et al., 2016) . The relation between mixed smooth Besov spaces and ordinary Besov spaces, the definition of which are found in Gine & Nickl (2015) ; Suzuki (2019) , can be informally explained as follows. A mixed smooth Besov space consists of functions for which the "maximum" of the orders of the derivatives is "bounded": max ᾱ=n D α f while an ordinary Besov space consists of functions for which the "sum" of the orders of the derivatives is "bounded" ∥α∥=n D α f . This difference directly affects the rate of convergence of approximation accuracy (Düng et al. (2016) ; Suzuki (2019) ), but, for this reason, mixed smooth Besov spaces do not include ordinary Besov spaces in general.

2.2. ANISOTORPIC BESOV SPACE

In this section, we define the anisotropic Besov space, the other of the function classes which we discuss. Here we define the anisotropic Besov space as follows. Definition 4 (Anisotropic Besov space). Let d ∈ Z ≥1 , Ω be a measureable subset of R D , p, q ∈ R >0 ∪ {∞}, α ∈ R d >0 , r := max(⌊α i ⌋) + 1. Then we define the seminorm | • | AB α p,q as follows: |f | AB α p,q :=    ∞ k=0 (2 k w r,p (f, (2 -k α 1 , . . . , 2 -k α d ))) q 1 q (q < ∞), sup k≥0 (2 k w r,p (f, (2 -k α 1 , . . . , 2 -k α d ))) (q = ∞). The norm of the anisotropic Besov space AB α p,q (Ω) can be defined as ∥f ∥ AB α p,q := ∥f ∥ L p + |f | AB α,e p,q , and we define AB α p,q (Ω) := f ∈ L p (Ω) ||f || AB α p,q < ∞ , AU α p,q (Ω) := f ∈ AB α p,q (Ω) ||f || AB α p,q ≤ 1 . The statistical analysis on an anisotropic Besov space can be dated back to Ibragimov & Khas'minskii (1984) , who considered an estimation of a density function which is assumed to be included in an anisotropic Sobolev space with p ≤ 2. Afterwards, several studied have been conducted from the viewpoint of non-parametric statistics, such as nonlinear kernel estimator (Kerkyacharian et al., 2001) , and kernel ridge regression (Hang & Steinwart, 2018) . Here, we present some relations with anisotropic Besov spaces and other function classes. First, if  α 1 = • • • = α d = α then it := max ∥n d ∥1≤m ∥∂ n d f ∥ ∞ + max ∥n d ∥1=m sup x,y∈Ω |∂ n d f (x)-∂ n d f (y)| ∥x-y∥ α-m . Then, (α-)Hölder space C α is defined as C α (Ω) := {f |∥f ∥ Cα < ∞}. Let p, q ∈ R ∞ >0 , α ∈ R d >0 , and α 0 ∈ R >0 such that α > 1 p . and we denote by α ′ := (α 0 , . . . , α 0 ) ⊤ . Then, Triebel (2011) shows that AB α ′ ∞,∞ = C α0 , AB α p,q → C 0 . This result shows that, if the average smoothness α is sufficiently large (α > 1 p ), then the functions in AB α p,q are continuous. However, it can be shown that, if it is small ( α < 1 p ), then they are no longer continuous. Actually, there exists functions in which spikes and jumps appear (see Donoho & Johnstone (1998) for this perspective, from the viewpoint of wavelet analysis).

2.3. TRANSFORMER NETWORKS

In this section, we define the set of Transformer networks with given hyper-parameters such as fully connected layer count, transformer block count, layer width, etc. Let us denote Mat(R d , R d ′ ) by the set of linear transformations from  R d to R d ′ , for any f : R d → R d ′ , the function Π(f ) : (R d ) l → (R d ′ ) l by Π(f )(x) := (f (x i )) i , for any v, k, q ∈ (R d ) l , the attention func- tion Attn : (R d ) l × (R d ) l × (R d ) l → (R d ) l by Attn(v, k, q) := l j=1 v j exp(⟨kj ,qi⟩) l k=1 exp(⟨k k ,qi⟩) i , by DAttn(x; M K , M Q , M V , M O ) := M O • Attn(Π(M V )(x), Π(M K )(x), Π(M Q )(x)), for any P E ∈ (R E ) l , the concatenation function Concat[P E ] : (R d ) l → (R d+E ) l by FL(W, S, B) := f (x) := x + (M • (x + ) + b) M ∈ Mat(W, W ), b ∈ R W , ||M || 0 + ||b|| 0 ≤ S, ||M || ∞ ∨ ||b|| ∞ ≤ B , FN(L, W, S, B) := f (L) • • • • • f (1) f (l) ∈ FL(W, S l , B l ), L l=1 S l ≤ S, max 1≤l≤L B l ≤ B , AL(W, H, S, B) := f (x) := x+ H h=1 DAttn(x; M (h) K , M (h) Q , M (h) V , M (h) O ) M (h) s , M (h) Q , M (h) V , M (h) O ∈ Mat(W, W ), s∈{K,Q,V,O} H h=1 (||M (h) s || 0 ) ≤ S, max 1≤h≤H s∈{K,Q,V,O} (||M (h) s || ∞ ) ≤ B , TL(L, W, H, S, B) := Π(g) • f f ∈ AL(W, H, S 1 , B 1 ), g ∈ FN(L, W, S 2 , B 2 ), 2 i=1 S i ≤ S, max i=1,2 B i ≤ B , STL(L, T, W, H, S, B) := f (T ) • • • • • f (1) f (t) ∈ TL(L t , W, H, S t , B t ), T t=1 L t ≤ L, T t=1 S t ≤ S, max 1≤l≤T B l ≤ B , TN(L, T, E, W, H, S, B) := Head • f • Concat[P E ] P E ∈ (R E ) l , f ∈ STL(L, T, W, H, S 1 , B), S 1 + ∥P E ∥ 0 ≤ S, ∥P E ∥ ∞ ≤ B . We incorporate the architecture proposed in Vaswani et al. (2017) into our definition of Transformer networks. We denote FL by the set of a single fully connected layer, FN by the set of a stack of fully connected layers, AL by the set of a multi-head attention layer, TL by the set of a Transformer block which composes of multi-head attention layer and a stack of fully connected layers, STL by the set of a stack of transformer blocks, TN by the set of an overall Transformer network with positional encoding. Note that, in order to exaggerate a count of interactions among tokens, we define a Transformer block as a concatenation of a multi-head attention layer and a stack of fully connected layers, not a single fully connected layers.

3. APPROXIMATION ERROR ANALYSIS

In this section, we evaluate how well the functions in mixed smooth Besov and anisotropic Besov spaces can be approximated by Transformer networks. To evaluate the accuracy of the deep neural network model in approximating target functions, we first define the worst case approximation error. Definition 7 (Worst case approximation error). Let d ∈ Z ≥1 , r ∈ R >0 ∪ {∞} and F, H be subsets of measurable functions on Ω(⊆ R d ). Then we define the worst case approximation error as follows: R r (F, H) := sup f • ∈H inf f ∈F ||f • -f || L r (Ω) . Note that F is the set of functions used for approximation, and H is the set of target functions. Here, we present the results on the approximation ability. Theorem 1 (Approximation ability for mixed smooth Besov spaces). Suppose that p, q, r ∈ R >0 ∪ {∞}, α ∈ (R d ) l , m ∈ Z ≥1 . Let δ := 1 p -1 r + (note that δ > 0 is equivalent to p < r) and assume that δ < α, ᾱ < min(m, m -1 + 1 p ). Then, for K ∈ Z ≥1 , there exist absolute constant C ∈ R >0 , constants which define hyperparameters D k,d ′ := 1 + d ′ -1 k k 1 + k d ′ -1 d ′ -1 , η :=      ( 1 min(r,1) -1 q ) + (r ≤ p), ( 1 r -1 q ) + (p < r and r < ∞), (1 -1 q ) + (p < r and r = ∞), ν := α -δ 2δ , K * := K(1 + ν -1 ) , N := (2 + (1 -2 -ν ) -1 )2 K D K * ,dl , ϵ := 2 -α+(1+ν -1 )( 1 p -α) + K , T 0 := ⌈log 2 l⌉, L 1 = C log 4dl ϵ , L 2 = 3 + 2 log 2 4l • 3 d∨m ϵc (d,m) + 5 ⌈log 2 (d ∨ m)⌉ , L 3 = C log 6T 0 ϵ , W 0 := 6dm(m + 2) + 2d and hyper-parameters of the set of Transformer networks T := T 0 , E := l, H := 1, L := L 1 + L 2 + T (L 3 + 2) + 1, W := W 0 N + E, S := C(N l + L 1 ) + L 2 W 2 0 N + CT 0 ((N + E) + L 3 ) + N, B ≲ N (1+ 1 ν )(( 1 p -α)+∨1) . such that R r (TN(L, T, E, W, H, S, B), M U α p,q ([0, 1] dl )) ≲ 2 -αK D η K,dl . Theorem 2 (Approximation ability for anisotropic Besov spaces). Suppose that p, q, r ∈ R >0 ∪ {∞}, α ∈ (R d ) l , m ∈ Z ≥1 . Let δ := 1 p -1 r + (note that δ > 0 is equivalent to p < r) and assume that δ < α, ᾱ < min(m, m -1 + 1 p ). Then, for K ∈ Z ≥1 , there exist absolute constant C ∈ R >0 , constants which define hyper- parameters ν := α -δ 2δ , K * := K(1 + ν -1 ) , N := (2 + (1 -2 -ν ) -1 ) Ñ , ϵ := Ñ -α+(1+ν -1 )( dl ᾱ p -α) + , T 0 := ⌈log 2 l⌉, L 1 = C log 4dl ϵ , L 2 = 3 + 2 log 2 4l • 3 d∨m ϵc (d,m) + 5 ⌈log 2 (d ∨ m)⌉ , L 3 = C log 6T 0 ϵ , W 0 := 6dm(m + 2) + 2d and hyper-parameters of the set of Transformer networks T := T 0 , E := l, H := 1, L := L 1 + L 2 + T (L 3 + 2) + 1, W := W 0 N + E, S := C(N l + L 1 ) + L 2 W 2 0 N + CT 0 ((N + E) + L 3 ) + N, B ≲ Ñ (1+ν -1 )(( dl ᾱ p -α)+∨ ᾱ) . such that R r (TN(L, T, E, W, H, S, B), AU α p,q ([0, 1] dl )) ≲ Ñ -α. The proofs of these theorems are provided in Appendix D. Note that the upper bound in the inequality of Theorem 1 depends only on D K,dl (D K,dl mildly depends on d and l) and the upper bound in that of Theorem 2 does not depend on d or l. This means that if the target function is included these function classes, we can ease the curse of dimensionality. Moreover, thanks to the token-wise parameter sharing property in Transformer networks, the width of the network architecture does not depend on the input length l, but only on the feature dimension d. Thus, our result also shows that the extent to which the network width and the approximation error upper bounds depend on the input length can be relaxed. Hence, Transformer networks are more efficient in a network size than fully-connected layers (See Suzuki (2019) and Suzuki & Nitanda (2021) ). Remark 1. We give the following approximation bound by using an adaptive sampling recovery method developed by (Dũng, 2011a) . The key point of this technique is that, instead of the whole set of the basis functions, we adaptively select much smaller functions from the whole set to approximate functions. If the target function belongs to mixed smooth or anisotropic Besov spaces, we can use this adaptive technique. Therefore, we deal with the variants of Besov spaces in this paper.

4. ESTIMATION ERROR ANALYSIS

In this section, we connect the approximation theory to estimation error analysis. First, we define the settings in the non-parametric regression model: Definition 8 (Non-parametric regression model for statistical analysis). Let f • : ([0, 1] d ) l → R be a measurable function and y i := f • (x i ) + ξ i where x i ∼ P X with density function 0 ≤ p(x) < R on ([0, 1] d ) l , and ξ i ∼ N (0, σ 2 ). We denote the training data D n := (x i , y i ) n i=1 which are independently identically distributed. Here, we define a regularized learning estimator as follows: f := argmax f :f ∈Φ(L,T,E,W,H,S,B) n i=1 (y i -f (x i )) 2 where f is the clipping of f defined by f := min{max{f, -F }, F } for F > 0 which is realized by ReLU units. In practice, it is hard to exactly compute f in the definition above. Therefore, there are numerous researches which study how to approximately compute f by applying sparse regularization such as L 1 regularization and optimal parameter search through Bayesian optimization. In this study, we assume that the optimal solution f is computable. Thus, we can assume that f in the definition above is valid. Here, we provide the estimation error rate of deep learning to estimate functions in Besov spaces by using the approximation error bound given in the previous sections. Theorem 3. Suppose that p, q ∈ R >0 ∪ {∞}, α ∈ (R d >0 ) l . If f • ∈ M B α p,q ∩ L ∞ (Ω) and ||f • || M B α p,q ≤ 1 and ||f • || L ∞ ≤ F , then letting (L, T, E, W, H, S, B) be as in Theorem 1, we obtain E Dn ||f • -f || L 2 (P X ) ≲ n -2α 2α+1 log(n) 2(dl-1)(η+α)+6α 1+2α where η := η p,q,r as in the notation of Theorem 1. Theorem 4. Suppose that p, q ∈ R >0 ∪ {∞}, α ∈ (R d >0 ) l . If f • ∈ AB α p,q ∩ L ∞ (Ω) and ||f • || AB α p,q ≤ 1 and ||f • || L ∞ ≤ F , then letting (L, T, E, W, H, S, B) be as in Theorem 1, we obtain E Dn ||f • -f || L 2 (P X ) ≲ n -2 α 2 α+1 log(n) 6 α 1+2 α . The proofs are given in Appendix D. The condition ∥f • ∥ ∞ ≤ F is used to fill a gap between the empirical L 2 -norm and the population L 2 -norm. A key factor of these results is a fact that the complexity of Transformer networks is not so high as that of fully connected layers to some extent. By combining this fact and the approximation error analysis in Section 3, the above estimations follows. Note that the dimensional parameters d, l do not appear in the exponent of n in the upper bounds, but only in the exponent of log(n) term. Thus, the risk bound (Theorem 3 and Theorem 4) indicates that curse of dimensionality can be relaxed in the two variants of Besov spaces. We can see that there does not appear d, l directly in the exponent of the convergence rate (although it appears in the poly-log term for the mixed smooth case). Instead, the rate is mainly characterized by α, α. This means that the curse of dimensionality is eased by utilizing the smoothness structure of the true function f • . Remark 2. According to Suzuki (2019) and Suzuki & Nitanda (2021) , inf f sup f • ∈U E Dn ||f • -f || L 2 (P X ) ≳ n -2 α 2 α+1 log(n) 6 α 1+2 α in the case of mixed smooth Besov spaces and anisotropic Besov spaces holds. Thus, by combining Theorem 3 and 4, we show that Transformer networks accomplish almost minimax optimal rate up to a poly-log(n) order, and, especially, under the conditions of p < 2 and 1/2 -1/q > 0, accomplish almost minimax optimal rate up to log(n) 3 order. Thus, Transformer networks have the potential to best fit the target function in either mixed smooth Besov spaces or anisotropic Besov spaces among all estimators. Note that it has been already shown in Suzuki (2019) and Suzuki & Nitanda (2021) that fully connected layers achieve almost minimax optimal rate. Therefore, by Definition 6, it is intuitively true that Transformer networks also achieve almost minimax optimal rate. The important thing is that we prove that Transformer networks are more efficient than fully connected layers in a setting where this intuition is true. For an instance, the extent to which the network width and the approximation error upper bounds depend on the input length can be relaxed, as we show in Section 3.

5. TOKEN EXTRACTION

In this section, we discuss the token extraction property of Transformer networks. First, we introduce a new function class which is a variant of mixed smooth Besov spaces to express a situation in which Transformer networks dynamically select important tokens for an input sequence. Definition 9. Let Ω, Ω i ⊆ R d and Ω = n i=1 Ω i , and α i ∈ Z d ≥0 where Ω, Ω i are written as Π d i=1 I i when I i := [a i , b i ], [a i , b i ), (a i , b i ], or (a i , b i ). We denote a partition of Ω as π := (Ω i ), and piecewise smoothness as α := (α i ). Then, the norm of the variable mixed smooth Besov space V B α,π p,q (Ω) can be defined as follows: ∥f ∥ V B α,π p,q := n i=1 ∥f ∥ M B α i p,q (Ωi) , and we define V B α,π p,q (Ω) := f ∈ L p (Ω) ||f || V B α,π p,q < ∞ , V U α p,q (Ω) := f ∈ V B α p,q (Ω) ||f || V B α p,q ≤ 1 . Intuitively, the target function in the variable mixed smooth Besov space changes a direction to regard as important or as a noise, according to an input. For each region Ω i , a corresponding smoothness parameter α i decides whether a direction is important or a noise. By regarding each direction as a token, we can express that the target function decide which tokens to pay attention to, for an input sequence (or a set of input sequences). Next, we introduce input quantization masks. Input quantization masks are used to cut off information of masked tokens to specify that Transformer networks extract much more information from non-masked tokens and much less information from masked tokens. Definition 10. Let t, u ∈ Z ≥1 . Then we denote as Q t,u the set of input quantization masks f as follows: If there exist a partition π := (Ω k ) and subsets S k ⊆ {1, 2, . . . , l} such that ♯S k ≤ t, f (x) := (x ij I(x, j)) where I(x, j) := x ij (x ∈ Ω k , j ∈ S k ) ⌈xij u⌉ u (x ∈ Ω k , j / ∈ S k ) . By using the definition of the set of input masks, we define the set of transformer networks with input mask as follows: MTN t,u (L, T, E, W, H, S, B) := f • q f ∈ TN(L, T, E, W, H, S, B), q ∈ Q t,u . Intuitively, masked tokens have much less information than non-masked tokens because masked tokens are rounded up by multiples of u. For example, when u = 2, x ij ∈ (0, 1/2] are rounded up to 1/2, and x ij ∈ (1/2, 1] are rounded up to 1. Hence, we see that this round-up quantization cut off much information of original tokens. Note that the higher the parameter u is, more roughly an input value is rounded up (or quantized). The parameter u in the definition is needed for a certain technical reason. The reason is why we use cardinal B-splines (which is not a constant function but a piecewise polynomial) to approximate the target function in this paper. By using the definitions above, we can present the main result in this section. Theorem 5 (Token extraction property of Transformer networks). Let s ∈ R >1∨ 1 p , π := (Ω k ) , α := (α k ), σ := (σ k ) where α k ∈ (R d >0 ) l and σ k be a permutation on {1, 2, . . . , l}. Moreover, we assume that r ≥ 1 and (α k ) ij ≥ s σ k (j)-1 . Then, for K ∈ Z ≥1 , letting (L, T, E, W, H, S, B) be as in Theorem 1, there exists constants t := log( 1 p +K) log s and u := 2 K such that a following estimation holds: R r (MTN t,u (L, T, E, W, H, S, B), V U α p,q ([0, 1] dl )) ≲ 2 -K D η∨1 K,dl . The proof is given in Appendix E. First, we explain the role of variable mixed smooth Besov spaces. Variable mixed smooth Besov spaces can be regarded as the set of target functions which, for an input sequence (or a set of input sequences) decide which tokens to pay attention to. Smoothness parameters α i ∈ (R d ) l have control over this token selection process. For example, let us consider a text classification task. When classifying a text "FRB is stepping up its battle on inflation" to a finance category, we will pay attention to the first word "FRB" and the eighth word "inflation", and, when classifying a text "Moreover, borrowing costs are going sharply higher" to a finance category, we will pay attention to the second word "borrowing" and the third word "inflation". Therefore, variable mixed smooth Besov spaces can grasp how a task decide which tokens to pay attention to. Under the settings of Theorem 5, the smoothness parameters (α k ) ij increases with respect to σ(j) in an exponential order. This means that the importance of token j by token i exponentially decays under an appropriate permutation σ k which can depend on the input x. Then, Theorem 5 shows that the Transformer can detect the input-dependent importance between tokens and achieves the addaptive rate which cannot be obtained by imposing a fixed smoothness over the entire input x. Next, we explain the role of input quantization masks. Note that the range of masked token feature values is {0, 1 u , 2 u , . . . , 1 -1 u , 1}, while the range of non-masked token features is [0, 1]. Thus, the cardinality of the range of masked token feature values is finite while the cardinality of the range of non-masked token feature values are uncountably infinite. Hence, masked tokens have much less information than non-masked tokens. Thus, input masks expresses a situation in which Transformer networks extract much more information from non-masked tokens and much less information from masked tokens. As mentioned above, the parameter u in the definition is needed for a certain technical reason. In this paper, we use cardinal B-splines (which is not a constant function but a piecewise polynomial) to approximate the target function, and, since piecewise constant functions are needed to approximate cardinal B-splines, input masks need quantization. Actually, since Okumoto & Suzuki (2022) considered a space of functions which are (possibly infinite) sums of finite products of trigonometric functions, this technical problem did not occur in Okumoto & Suzuki (2022) . Consequently, MTN t in Definition 10 can be regarded as the set of transformer networks which, for an input sequence (or a set of input sequences), fully exploit at most t tokens' features of an input sequence. Thus, Theorem 5 shows that, for a general NLP task and accuracy required, Transformer networks can dynamically select t tokens to pay careful attention to (a value t is decided by the target function which represents the NLP task and the accuracy required). This token selection property matches attention mechanism, on which Transformer networks are based.

6. CONCLUSION

This paper investigated the learning ability of Transformer networks when the target function is in mixed smooth Besov spaces or anisotropic Besov spaces. By overcoming the difficulties in limited interactions among tokens, we show that Transformer networks can adaptively avoid curse of dimensionality and accomplish minimax optimal rate. Our result also shows that dependence of the network width on the input length and the approximation error upper bounds can be relaxed, thanks to token-wise parameter sharing in Transformer networks. Moreover, we prove that, when the smoothness parameters α ij increases in an exponential order of a permutation of the token location j, Transformer networks dynamically select tokens to pay careful attention to. This phenomenon matches attention mechanism, and the result suggests that this favorable property is derived to the architecture of Transformer networks. Our analyses strongly support the reason why Transformer networks have outperformed various natural language processing tasks from a theoretical perspective. This paper did not discuss the optimization aspect of networks. In this paper, we assume that the optimal solution of regularized least squares are computable. For future works, it would be interesting to incorporate non-convex optimization techniques into our study. & Popov, 1988; DeVore et al., 1993; Dũng, 2011a) . we can obtain its B-spline interpolant representations. Thus, we can see that an approximation of a function in various Besov spaces can reduce to an approximation of the cardinal B-spline. Definition 11 (Cardinal B-spline). We define the cardinal B-spline of order m as follows: N (x) := 1 (x ∈ [0, 1]) 0 (x / ∈ [0, 1]) and N 0 (x) := N (x), N m+1 (x) := N m (x) * N (x) where (f * g)(x) := R f (x -t)g(t)dt is a convolution of f and g. Let d ∈ Z ≥1 , k, j ∈ Z d ≥0 . Then we define M d 0,0 (x) := d i=1 N m (x i ), M d k,j (x) := M d 0,0 (2 ki x i -j i ). J d m (k) := {-m, -(m -1), . . . , 2 k1 -1, 2 k1 } × • • • × {-m, -(m -1), . . . , 2 k d -1, 2 k d }. Lemma 1 (Property of B-splines). ∥N m ∥ L ∞ ≤ 1 and, if m ≥ 1, N m are 1-Lipschitz. M d 0,0 L ∞ ≤ 1 and, if m ≥ 1, M d 0,0 are d-Lipschitz. Proof. If m = 0, it clearly follows from the definition of N that ||N m || L ∞ ≤ 1. If m ≥ 1, it follows that ∥N m ∥ L ∞ = sup x∈R R N m-1 (x -t)N (t)dt ≤ sup x∈R R N (t)dt = 1, and |N m (x 1 ) -N m (x 2 )| = R (N m-1 (x 1 -t) -N m-1 (x 2 -t))N (t)dt ≤ |x 1 -x 2 | • R N (t)dt = |x 1 -x 2 | . Thus, it clearly follows M d 0,0 L ∞ ≤ 1 and M d 0,0 (x 1 ) -M d 0,0 (x 2 ) L ∞ = 1≤i≤d 1≤j≤i-1 N m ((x 1 ) i ) |N m ((x 1 ) i ) -N m ((x 2 ) i )| i+1≤j≤d N m ((x 2 ) i ) ≤ 1≤i≤d |(x 1 ) i -(x 2 ) i | ≤ d ∥x 1 -x 2 ∥ ∞ . Lemma 2 (L p norm of a linear combination of B-splines). Let d ∈ Z ≥1 , k ∈ Z d ≥0 , and f := j∈J d m (k) c j M d k,j a linear combination of B-splines. Then we estimate the L p norm of the linear combination f as follows: ||f || L p ≃ 2 -||k|| 1 p   j∈J d m (k) |c j | p   1 p The proof is found in Suzuki (2019) .

B SUB-NETWORKS

Here, we present auxiliary lemmas, which are used to sub-networks of which Transformer networks compose. A key step to show the approximation accuracy is to construct a ReLU neural network which can approximate the cardinal B-spline with high accuracy. By using the technique developed by Yarotsky ( 2016), we can construct fully connected layers with ReLU activation functions to approximate the cardinal B-spline. By combining the result B-spline approximation results and the results in this section, we can obtain the optimal approximation error bound for Transformer networks. Lemma 3 (Approximation of x 2 ). Let ϵ ∈ R >0 . Then, there exist constants L 1 := log 2 1 ϵ , W 1 := 4, S 1 := 8 log 2 1 ϵ , B 1 := 1, and a neural network M 1 ∈ Φ 2 (L 1 , W 1 , S 1 , B 1 ).such that sup x∈[0,1] x 2 -M 1 (x) ≤ ϵ. Moreover, if R ∈ R ≥1 , then, there exist constants L 2 := log 2 R 2 ϵ + 3, W 2 := 4, S 2 := 8 log 2 R 2 ϵ + 3, B 2 := R, and a neural network M 2 ∈ Φ 2 (L, W, S, B).such that sup x∈[0,R] x 2 -M 2 (x) ≤ ϵ. Proof. If R = 1, the proof is found in Proposition 2, Yarotsky (2016). If R > 1, we can obtain the following network: Proof. The proof strategy is found in Proposition 3, Yarotsky (2016). x × 1 R 2 ---→ • M1 --→ • ×R --→ • ×R --→ •. Lemma 5 (Approximation of cardinal B-spline basis by the ReLU activation). Let d ∈ Z ≥1 . Then, there exists a constant c(d, m) depending only on d and m such that, for all ϵ > 0, there exist constants L 0 := 3 + 2 log 2 3 d∨m ϵc (d,m) + 5 ⌈log 2 (d ∨ m)⌉ , W 0 := 6dm(m + 2) + 2d, S 0 := L 0 W 2 0 , B 0 := 2(m + 1) m , and a neural network M ∈ Φ 2 (L, W, S, B).such that ||M d 0,0 -M || L ∞ (R d ) ≤ ϵ and M (x) = 0 (∀x / ∈ [0, m + 1] d ). Proof. The proof is found in Lemma 1, Suzuki (2019) .

C PROOF OF THE STATEMENTS OF SECTION 3

Here, we give technical details behind the approximation bound. We will use the so-called sparse grid technique which Smolyak (1963) introduced to the function approximation theory field. The key point of this technique is that, instead of the whole set of the regular grid, we put the basis on a sparse grid which is a subset of the whole set and has much smaller cardinality than the whole set. The applications of approximation algorithm were developed by Düng (1990; 1991; 1992); Temlyakov (1982; 1993a; b) . Afterwards, the sparse grid technique develops into an optimal adaptive sampling recovery method by (Dũng, 2011b) , and we adopt this method on the cardinal B-spline bases. We follow the proof strategy for Suzuki (2019) and Suzuki & Nitanda (2021) . Definition 12. Let d ∈ Z ≥1 , p k ∈ R for k ∈ Z d ≥0 , and c k,j ∈ R for k ∈ Z d ≥0 , j ∈ J d m (k). Then we define a quasi-norm over a set of functions by |(p k )| b α q (L p ) :=       k∈Z d ≥0 (2 ⟨α,k⟩ ∥p k ∥ p    q    1 q , and a quasi-norm over a set of coefficients by |(c k,j )| mb α p,q :=       k∈Z d ≥0 (2 ⟨α,k⟩-∥k∥ p   j∈J d m (k) |c k,j | p   1 p    q    1 q . Theorem 6 (Cardinal B-spline approximation for mixed smooth Besov spaces). Suppose that p, q, r ∈ R >0 ∪ {∞}, α ∈ R d >0 . Let δ := 1 p -1 r + (note that δ > 0 is equivalent to p < r) and assume that m ∈ Z ≥1 , and δ < α, ᾱ < min m, m -1 + 1 p . Then, for any f ∈ M B α p,q and K ∈ Z ≥1 , there exist constants η :=      ( 1 min(r,1) -1 q ) + (r ≤ p), ( 1 r -1 q ) + (p < r and r < ∞), (1 -1 q ) + (p < r and r = ∞), ν := α -δ 2δ , K * := K 1 + ν -1 , n k := 2 K-ν(||k||-K) , S(k) ⊆ J d m (k) such that ♯(S(k)) = n k , and R K (f ) := k∈Z ≥0 ||k||1≤K j∈J d m (k) c k,j M d k,j (x) + k∈Z ≥0 K<||k||1≤K * j∈S(k) c k,j M d k,j (x) such that ||f -R K (f )|| r ≲ 2 -αK D ηp,q,r K,d ||f || M B α p,q , ♯E(K) := {(k, j) ∈ Z d ≥0 × Z d ≥0 |c k,j ̸ = 0} ≤ 2 + (1 -2 -ν ) -1 2 K D K * ,d , k max := max 1≤i≤d k∈E K k i ≤ K * , c max := max 1≤i≤d k∈E K c k,j ≲ 2 K * ( 1 p -α) + ∥f ∥ M B α p,q . Proof of Theorem 6. According to Suzuki (2019) (see also Dũng (2011a) ), there exist a collection of functions (P k ) k∈Z d ≥0 from M B α p,q to M B α p,q such that ∥f ∥ M B α p,q ≃ ∥(p k )∥ b α q (L p ) ≃ ∥(c k,j )∥ mb α p,q where p k : = P k (f ) = j∈J d m (k) c k,j M d k,j (x). (1) the case of r ≤ p. Then, the assertion can be shown in the same manner as Theorem 3.1 of Dũng (2011a) . (2) the case of p < r. We need to use an adaptive approximation method. In the following, we assume p < r. For a given K, by choosing K * appropriately later, we set R K (f ) := k∈Z ≥0 ||k||1≤K p k + k∈Z ≥0 K<||k||1≤K * G k (p k ) where G k (p k ) is given as G k (p k ) := n k i=1 c k,ji M d k,j (x) where (c k,ji ) is the sorted coefficients in decreasing order of their absolute value: |c k,j1 | ≥ |c k,j1 | ≥ • • • ≥ c k,♯J d m (k) . R K (f ) := k∈Z ≥0 ||k||1≤K j∈J d m (k) c k,j M d k,j (x) + k∈Z ≥0 K<||k||1≤K * j∈J d m (k) c k,j M d k,j Then, it holds that ∥p k -G k (p k )∥ r ≤ ∥p k ∥ p 2 δ∥k∥1 n -δ k , where δ := (1/p -1/r) (see also the proof of Dũng (2011a) and Dũng (2011b) ). Here we denote by ν := α -δ 2δ , K * := K 1 + ν -1 , n k := 2 K-ν(||k||-K) . Then, by Lemma 5.3 of Dũng (2011a) , it follows that ||f -R K (f )|| r L r ≲ k∈Z ≥0 K<||k||1≤K * 2 ⟨δ1,k⟩ n -δ k ||p k || L p r + k∈Z ≥0 K * <||k||1 2 ⟨δ1,k⟩ ||p k || L p r . (2-1) the case of r < ∞ and q ≤ r ||f -R K (f )|| q L r ≲     k∈Z ≥0 K<||k||1≤K * 2 ⟨δ1,k⟩ n -δ k ||p k || L p r + k∈Z ≥0 K * <||k||1 2 ⟨δ1,k⟩ ||p k || L p r     q r ≤ k∈Z ≥0 K<||k||1≤K * 2 ⟨δ1,k⟩ n -δ k ||p k || L p q + k∈Z ≥0 K * <||k||1 2 ⟨δ1,k⟩ ||p k || L p q since q r ≤ 1 ≤2 -δKq 2 -(α-δ)Kq k∈Z ≥0 K<||k||1≤K * 2 -(α-δ-δν)(||k||1-K) 2 ⟨α,k⟩ ||p k || L p q +2 -q(α-δ)K * k∈Z ≥0 K * <||k||1 2 ⟨α,k⟩ ||p k || L p q ≲2 -αKq ||f || M B α p,q . (2-2) the case of r < ∞ and q > r Since 1 r q + 1 q-r q = 1, then it follows by applying Hölder's inequality that ||f -R K (f )|| r L r ≲ k∈Z ≥0 K<||k||1≤K * 2 δ||k||1 n -δ k ||p k || L p r + k∈Z ≥0 K * <||k||1 2 δ||k||1 ||p k || L p r ≲2 -αKr k∈Z ≥0 K<||k||1≤K * 2 -(α-δ-δν)(||k||1-K) 2 ⟨α,k⟩ ||p k || L p r + k∈Z ≥0 K * <||k||1 2 -(α-δ)(||k||1-K * ) 2 ⟨α,k⟩ ||p k || L p r ≤2 -αKr k∈Z ≥0 K<||k||1≤K * 2 ⟨α,k⟩ ||p k || L p q + k∈Z ≥0 K * <||k||1 2 ⟨α,k⟩ ||p k || L p q × k∈Z ≥0 K<||k||1≤K * (2 -(α-δ-δν)(||k||1-K) ) qr q-r + k∈Z ≥0 K * <||k||1 2 -(α-δ)(||k||1-K * ) qr q-r q-r q ≲2 -αKr ||f || r M B α p,q D r( 1 r -1 q ) K,d . (2-3) the case of r = ∞ We can execute the same the analysis as in the case of q > r. Since 1 q + 1 q q-1 = 1, then it follows by applying Hölder's inequality that ||f -R K (f )|| L ∞ ≲ k∈Z ≥0 K<||k||1≤K * 2 δ||k||1 n -δ k ||p k || L p + k∈Z ≥0 K * <||k||1 2 δ||k||1 ||p k || L p ≲2 -αK k∈Z ≥0 K<||k||1≤K * 2 -(α-δ-δν)(||k||1-K) 2 ⟨α,k⟩ ||p k || L p + k∈Z ≥0 K * <||k||1 2 -(α-δ)(||k||1-K * ) 2 ⟨α,k⟩ ||p k || L p ≤2 -αK k∈Z ≥0 K<||k||1≤K * 2 ⟨α,k⟩ ||p k || L p q + k∈Z ≥0 K * <||k||1 2 ⟨α,k⟩ ||p k || L p q × k∈Z ≥0 K<||k||1≤K * (2 -(α-δ-δν)(||k||1-K) ) q q-1 + k∈Z ≥0 K * <||k||1 2 -(α-δ)(||k||1-K * ) q q-1 q-1 q ≲2 -αK ||f || M B α p,q D 1-1 q K,d . Estimation of ♯E(K), k max and c max : First, we estimate the cardinality of E(K). It follows from easy calculations that ♯E(K) = K κ=0 2 κ κ + d -1 d -1 + k:K<||k||1≤K * n k ≤2 K+1 K + d -1 d -1 + K<κ≤K * 2 K-ν(κ-K) κ + d -1 d -1 ≤2 K+1 D K,d + 2 K (1 -2 -ν ) -1 D K * ,d ≤ 2 + (1 -2 -ν ) -1 2 K D K * ,d . Next, it clearly follows that k max ≤ ∥k∥ 1 ≤ K * . Finally, we estimate c max . Since the inequality below holds 2 (α-1 p )∥k∥ 1 |c k,j | ≤ 2 ⟨α,k⟩-∥k∥ 1 p |c k,j | ≲ ∥f ∥ M B α p,q , it follows that c max = max 1≤i≤d k∈E K c k,j ≲ 2 ( 1 p -α)∥k∥ 1 ∥f ∥ M B α p,q ≤ 2 K * ( 1 p -α) + ∥f ∥ M B α p,q . This completes the proof. Theorem 7 (Cardinal B-spline approximation for anisotropic Besov spaces). Suppose that p, q, r ∈ R >0 ∪ {∞}, α ∈ (R d ) l . Let δ := 1 p -1 r + (note that δ > 0 is equivalent to p < r) and assume that m ∈ Z >0 , and δ < α, ᾱ < min(m, m -1 + 1 p ). Then, for any f ∈ AB α p,q and K ∈ Z ≥1 , there exist constants ν := α -δ 2δ , N := 2 ∥K∥α , n k := 2 ∥K∥α-ν(||k||α-∥K∥α) , K * := K 1 + ν -1 , S(k) ⊆ J d m (k) such that ♯(S(k)) = 2 K-ν(||k||1-K * ) , where ∥k∥ α := d i=1 kα αi , and R K (f ) := K k=0 j∈J d (k) c k,j M d k,j (x) + K * k=K+1 j∈S(k) c k,j M d k,j (x) such that ||f -R K (f )|| r ≲ Ñ -α ||f || AB α p,q , ♯E(K) := {(k, j) ∈ Z d ≥0 × Z d ≥0 |c k,j ̸ = 0} ≤ 2 + (1 -2 -ν ) -1 Ñ , k max := max 1≤i≤d k∈E K k i ≤ K * , c max := max 1≤i≤d k∈E K c k,j ≲ 2 K * ( d ᾱ p -α) + ∥f ∥ AB α p,q . Moreover, it follows that 2 K ≤ Ñ . Proof. For the existence of R K (f ): See the proof in Suzuki & Nitanda (2021) . Estimation of ♯E(K), k max and c max : First, we estimate the cardinality of E(K). It follows from easy calculations that ♯E(K) = K k=0 2 k + k:K<||k||1≤K * n k ≤2 K+1 + K<κ≤K * 2 K-ν(κ-K) ≤2 K+1 + 2 K (1 -2 -ν ) -1 ≤ 2 + (1 -2 -ν ) -1 N . Next, since it clearly follows that k max ≤ K * , we estimate c max . Since the inequality below holds 2 k α- i=1 ⌊kα i ⌋ kp |c k,j | ≲ ∥f ∥ AB α p,q , it follows that c max = max 1≤i≤d k∈E K c k,j ≲ 2 k d i=1 ⌊kα i ⌋ kp -α + ∥f ∥ AB α p,q ≤ 2 K * ( d ᾱ p -α) + ∥f ∥ AB α p,q . Finally, it clearly follows that 2 K ≤ 2 ∥K∥ α/α ≤ N . This completes the proof. By using the results of Theorem 6 and Theorem 7, we can prove Theorem 1, which shows the approximation ability of Transformer networks for the target function in a mixed smooth Besov space or an anisotropic Besov space. Proof of Theorem 1. Let g := xy -Mϵ,R (x) ≤ ϵ. Positional encoding and B-spline coefficients: Let ε 1 := ε 4dlcmax , P E := I, K :=    k 1 . . . k N    , J :=    j 1 . . . j N    , M 1,1 :=  I d O O d K O I   , M 1,2 := I N d -J O I . Let M 1 := M 1,2 • Mε1,2 kmax • M 1,1 , M1 := Π(M 1 ) • Π(M ) • Concat[P E ]. Then, it follows that, for any integer j(1 ≤ j ≤ l), For k, j ∈ Z d ≥0 , x ∈ R d , it follows that    2 k1 x -j 1 . . . 2 k h x -j h    -M1 (x) L ∞ ≤ ε 4dlc max where 2 k x -j :=    2 k1 x 1 -j 1 . . . 2 k d x d -j d    . Token-wise B-splines: Let y 1 , . . . , y H ∈ R d , ε 2 := ε 4lcmax . It follows from Lemma 5 that there exist constants L 2,ε2 := 3 + 2 log 2 3 d∨m ϵc (d,m) + 5 ⌈log 2 (d ∨ m)⌉ , W 2,ε2 := 6dm(m + 2) + 2d, S 2,ε2 := L 0 W 2 0 , B 2,ε2 := 2(m + 1) m , and a neural network M 2,1 ∈ Φ 2 (L 2,ε2 , W 2,ε2 , S 2,ε2 , B 2,ε2 ) such that ||M d 0,0 -M 2,1 || L ∞ (R D ) ≤ ε 2 . Then we define M 2         y 1 . . . y H e         :=     M 2,1 (y 1 ) . . . M 2,1 (y H ) e     , M2 := Π (M 2 ) • M1 . Since it holds from Lemma 1 that, for any r ∈ R, N m (x) is 1-Lipschitz and ∥N m (x)∥ L ∞ ≤ 1, it follows that    M d k1,j1 (x 1 ) • • • M d (k1) l ,(j1) l (x l ) . . . . . . . . . M d (k h )1,(j h )1 (x 1 ) • • • M d (k h ) l ,(j h ) l (x l )    -M2 (x) L ∞ (([0,1] d ) l ) ≤ sup 1≤h≤H M d 0,0 (2 k h x -j h ) -M d 0,0 ( M1,h (x)) L ∞ (([0,1] d ) l ) + sup 1≤h≤H M d 0,0 ( M1,h (x)) -M b,ε2 ( M1,h (x i )) L ∞ (R d ) ≤ ε 4lc max + ε 4lc max ≤ ε 2lc max . Multiplication between tokens: Let T 3 := ⌈log 2 l⌉, ε 3 := ε 6T3cmax , γ := log l ε3 , and δ := e γ (l-1)+e γ . Note that 1 ≥ δ = e γ (l -1) + e γ ≥ 1 -le -γ = 1 -ε 3 . W 3,1 :=   -I O -I O 1 o • • • o 1 • • • O I   , M 3,1 (x) := x + W 3,1 ((x) + ) If all the elements x ij of x i are 0 ≤ x ij ≤ 1, then it follows from easy calculations that M 3,1     x 1 0 x 2 0 • • • e 1 e 2 • • •     :=   1 -x 1 -x 1 -x 2 1 -x 2 • • • e 1 e 2 • • •   , (M 3,1 • M 3,1 )     x 1 0 x 2 0 • • • e 1 e 2 • • •     :=   x 1 0 0 x 2 • • • e 1 e 2 • • •   . Let H := 1, M K :=   O    B B • • • B B • • • . . . . . . . . .       , M Q = (O I) , M V = I O O O , M O := I where B := γ 0 0 1 0 . Since

Π(M

(1) V )(x) = x 1 x 2 • • • 0 0 • • • , Π(M (1) K )(x) = γ    B O • • • O B • • • . . . . . . . . .    , Π(M (1) Q )(x) = I, then, it follows that Attn(Π(M (1) V )(x), Π(M (1) K )(x), Π(M (1) Q )(x)) = δ x 2 1 l l i=1 x i x 4 1 l l i=1 x i • • • 0 0 0 0 • • • + R where R = 1-δ l-1 l i=1 x i -x 2 0 1-δ l-1 l i=1 x i -x 4 0 • • • 0 0 0 0 • • • . Thus, it holds that M 3,2     x 1 0 0 x 2 • • • e 1 e 2 • • •     =   x 1 0 0 x 2 • • • e 1 e 2 • • •   + Π(M (1) O )(Attn(Π(M (1) V )(x), Π(M (1) K )(x), Π(M (1) Q )(x))) =   x 1 0 0 x 2 • • • e 1 e 2 • • •   + δ   0 x 2 * 0 x 4 * • • • 0 0 0 0 • • •   + R =   x 1 δx 2 * x 3 δx 4 * • • • 0 0 0 0 • • •   + R Let M 3,3 := Mε3,1 and M 1 3 := Π (M 3,3 ) • M 3,2 • Π(M 3,1 • M 3,1 ). We denote m h,i by the i-th element of M d 0,0 (2 k h x -j h ), M 2,h,i by the i-th element of M2 (x), and M 3,h,i by the i-th element of M 1 3 • M2 (x). Then, the following inequality holds: sup x∈[0,1] dl |m h,2i-1 (x)m h,2i (x) -M 3,h,2i-1 (x)| ≤ sup 1≤h≤H |m h,2i-1 (x)m h,2i (x) -M 2,h,2i-1 (x)m h,2i (x)| + sup 1≤h≤H |M 2,h,2i-1 (x)m h,2i (x) -M 2,h,2i-1 (x)M 2,h,2i (x)| + sup 1≤h≤H |M 2,h,2i-1 (x)M 2,h,2i (x) -M 2,h,2i-1 (x)δM 2,h,2i (x)| + sup 1≤h≤H |M 2,h,2i-1 (x)δM 2,h,2i (x) -M 3,h,2i-1 (x)| ≤ sup 1≤h≤H |m h,2i-1 (x) -M 2,h,2i-1 (x)| + sup 1≤h≤H |m h,2i (x) -M 2,h,2i (x)| + (1 -δ) + (ε 3 + (1 -δ)) ≤ 2 l ε 2c max + 1 T 3 ε 2c max . Next, as well as M 1 3 , we can sequentially construct M 2 3 , . . . , M T3 3 (we can execute these construction by replacing B by    0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0   , . . . ), and define M 3 := M T3 3 • • • • • M 1 3 and M3 := M 3 • M2 such that M dl k,j (x) -M3 (x) L ∞ (([0,1] d ) l ) = l i=1 m h,i (x) -M3 (x) L ∞ (([0,1] d ) l ) ≤ ε c max . Linear combination of B-splines: Let M ′ := (c 1 . . . c N ′ ). Finally, we define an approximate Transformer network T [g] of g. T [g] := Head • Π(M ′ ) • M3 . Then, it follows from the estimations above that sup x∈[0,1] dl |g(x) -T [g](x)| ≤ ε. Estimation of the bounds of hyper-parameters: Let f ∈ M U α p,q AU α p,q . Then, it follows from Theorem 6 and Theorem 7 that there exist constants D := D K,dl , 1, D * := D K * ,dl , 1, N := 2 K , Ñ , α := α, α, N * := (2 + (1 -2 -ν ) -1 ) N D * , γ :=    1 p -α + , dl ᾱ p -α + , η :=      ( 1 min(r,1) -1 q ) + (r ≤ p), ( 1 r -1 q ) + (p < r and r < ∞), (1 -1 q ) + (p < r and r = ∞), and an approximation function R K (f )(x) := N n=1 c n M dl kn,jn (x). such that ||f -R K (f )|| r ≲ N -αD η , N ≤ N * , k max := max 1≤n≤N k n ≤ K * , c max := max 1≤n≤N c n ≲ 2 K * γ , 2 K ≤ N . Let ε := N -α. Then, from above estimations, it immediately follows that ||f -T [R K (f )]|| r ≲ ||f -R K (f )|| r + ||R K (f ) -T [R K (f )]|| r ≲ N -αD η . Second, it follows from that there exist absolute constant C ∈ R >0 . Third, we define constants which define hyper-parameters as follows: ϵ := N -( α+(1+ν -1 )γ) , T 0 := ⌈log 2 l⌉, L 1 = C log 4dl ϵ , L 2 = 3 + 2 log 2 4l • 3 d∨m ϵc (d,m) + 5 ⌈log 2 (d ∨ m)⌉ , L 3 = C log 6T 0 ϵ , W 0 := 6dm(m + 2) + 2d, ζ := 1, ᾱ. Note that ϵ ≲ ε cmax . Finally, we estimate hyper-parameters of the Transformer network. Because of the above estimations, it immediately follows that T = T 0 , E = l, H = 1, L = L 1 + L 2 + T (L 3 + 2) + 1 W = W 0 N * + E S ≤ C(N * l + L 1 ) + L 2 W 2 0 N * + CT ((N * + E) + L 3 ) + N * , B ≲ N (1+ν -1 )(γ∨ζ) . This completes the proof.

D PROOF OF THE STATEMENTS OF SECTION 4

To prove Theorem 3, we define a covering number with respect to a given metric space. Definition 13 (Covering number and packing number). Let ϵ ∈ R >0 , (V, d) be a metric space, and F be a subset of V . Then, we define a covering number N (ϵ, F, (V, d)) of F with respect to (V, d)) as follows: N (ϵ, F, (V, d)) := min    N N = ♯(K), K ⊆ F, F ⊆ fi∈K {f ∈ V |d(f, f i ) ≤ ϵ}    . Before proving Theorem 3, we have to estimate the covering number of the set of transformer networks TN(L, T, E, W, H, S, B) with respect to L ∞ . Lemma 6 (Covering number estimation of the set of Transformer networks). The covering number of TN(L, T, E, W, H, S, B) can be bounded by log N (δ, TN(L, T, E, W, H, S, B), || • || L ∞ ) ≤S log 4δ -1 (L + T + 1)(W + 1) 2L+2T (B + 1) L+2T +1 H T . Proof. First, we define F[M, b](x) := x + M • (x) + + b. Then, the following estimations hold: ∥F[M, b](x)∥ L ∞ ≤ ∥x∥ ∞ + max j ∥M j,: ∥ 1 ∥x∥ ∞ + |b| ≤ (W B + 1) ∥x∥ ∞ + B ≤(W + 1)(B + 1)(∥x∥ ∞ ∨ 1), ∥F[M, b](x 1 ) -F[M, b](x 2 )∥ L ∞ ≤ ∥x 1 -x 2 ∥ ∞ + max j ∥M j,: ∥ 1 ∥(x 1 ) + -(x 2 ) + ∥ ∞ ≤ ∥x 1 -x 2 ∥ ∞ + W B ∥x 1 -x 2 ∥ ∞ ≤(W B + 1) ∥x 1 -x 2 ∥ ∞ , and, if δ ′ := ∥M -M ′ ∥ ∞ ∨ ∥b -b ′ ∥ ∞ , then it follows that ∥F[M, b](x) -F[M ′ , b ′ ](x)∥ L ∞ ≤ δ (W ∥x∥ ∞ + 1) . Next, we define A[M h * ](x) := x + H h=1 Π(M h O )(Attn(Π(M h V )(x), Π(M h K )(x), Π(M h Q )(x))). Then, the following estimations holds A[M h * ](x) L ∞ ≤ ∥x∥ ∞ + H max j (M h O ) j,: 1 max j h V ) j,: 1 ∥x∥ ∞ ≤(HW 2 B 2 + 1) ∥x∥ ∞ , A[M h * ](x 1 ) -A[M h * ](x 2 ) L ∞ ≤ ∥x 1 -x 2 ∥ ∞ + H max j (M h O ) j,: 1 max j (M h V ) j,: 1 ∥x 1 -x 2 ∥ ∞ . ≤ ∥x 1 -x 2 ∥ ∞ + HW 2 B 2 ∥x 1 -x 2 ∥ ∞ ≤(HW 2 B 2 + 1) ∥x 1 -x 2 ∥ ∞ , and, if δ ′ := max 1≤h≤H, * ∈{O,V,K,Q} M h * -(M ′ ) h * ∞ , then it follows that A[M h * ](x) -A[(M ′ ) h * ](x) L ∞ ≤ 2δHW 2 B ∥x∥ ∞ . Next, we define F k (x) := (L k • • • • • L 0 ) (x), F ′ k (x) := (L ′ k • • • • • L ′ 0 ) (x), B k (x) := (L L+T • • • • • L k ) (x), B ′ k (x) := L ′ L+T • • • • • L ′ k (x) where, for k = 0, L k (x) := Concat[P E ](x) and L ′ k (x) := Concat[P ′ E ](x) , and, for k ≥ 1, either one of the following cases is true: 1. L k (x) := Π(F[M k , b k ])(x) and L ′ k (x) := Π(F[M ′ k , b ′ k ])(x), 2. L k (x) := A[(M k ) h * ](x) and L ′ k (x) := A[(M ′ k ) h * ](x). Moreover, for k ≥ 1, we denote f k := ♯{i ∈ Z∥1 ≤ i ≤ k, L i (x) := A[(M i ) h * ](x)}, l k := ♯{i ∈ Z∥i = k, L i (x) := A[(M i ) h * ](x)}, b k := ♯{i ∈ Z∥k ≤ i ≤ L + T, L i (x) := A[(M i ) h * ](x)}. If f = F L+T (x), f ′ = F ′ L+T (x)(f, f ′ ∈ TN(L, T, E, W, H, S, B)), δ ′ := max 1≤k≤L+T ∥P E -P ′ E ∥ ∞ ∨ ∥M k -M ′ k ∥ ∞ ∨ ∥b k -b ′ k ∥ ∞ ∨ max 1≤h≤H, * ∈{O,V,K,Q} (M k ) h * -(M ′ k ) h * ∞ , then it follows that ∥f -f ′ ∥ L ∞ ≤ L+T k=0 B ′ k+1 • L k • F k-1 -B ′ k+1 • L ′ k • F k-1 L ∞ ≤ L+T k=0 (W B + 1) L+T -k-b k+1 (HW 2 B 2 + 1) b k+1 ∥L k • F k-1 -L ′ k • F k-1 ∥ L ∞ ≤(W B + 1) L (HW 2 B 2 + 1) T δ ′ + L+T k=1 (W B + 1) L+T -k-b k+1 (HW 2 B 2 + 1) b k+1 δ ′ (W + 1) l k (2HW 2 B) l k (∥F k-1 ∥ L ∞ ∨ 1) ≤(W B + 1) L (HW 2 B 2 + 1) T δ ′ + L+T k=1 (W B + 1) L+T -k-b k+1 (HW 2 B 2 + 1) b k+1 δ ′ (W + 1) l k (2HW 2 B) l k • (W + 1) k-1-f k-1 (B + 1) k-1-f k-1 (HW 2 B 2 + 1) f k-1 (∥L 0 ∥ ∞ ∨ 1) ≤ L+T k=0 2δ ′ H T (W + 1) L+2T (B + 1) L+2T -1 (B ∨ 1) ≤2δ ′ (L + T + 1)H T (W + 1) L+2T (B + 1) L+2T Thus, for a fixed sparsity pattern (the locations of non-zero parameters), the covering number is bounded by 2B • 2(L + T + 1)H T (W + 1) L+2T (B + 1) L+2T δ S . Thus, the covering number of the whole space TN(L, T, E, W, H, S, B) is bounded as (W + 1) L S 2B • 2(L + T + 1)H T (W + 1) L+2T (B + 1) L+2T δ S ≤ 4δ -1 H T (L + T + 1)(W + 1) 2L+2T (B + 1) L+2T +1 S . This completes the proof. Next, to prove Theorem 3, we need the following result which connects the approximation theory to generalization error analysis. Proposition 1 (Schmidt-Hieber ( 2017)). Let F be a set of functions, let f be any estimator in F. Define ∆ n := E Dn 1 n n i=1 (y i -f (x i )) 2 -inf f ∈F 1 n n i=1 (y i -f (x i )) 2 . Assume that ||f • || ≤ F and all f ∈ F satisfies ||f || L ∞ ≤ F for some F ≥ 1. If 0 < δ < 1 satisfies N (δ, F, || • || L ∞ ), then there exists a universal constant C such that E Dn ||f • -f || L 2 (P X ) ≤C(1 + ϵ) 2 inf f ∈F ||f -f • || 2 L 2 (P X ) + F 2 log N (δ, F, || • || L ∞ ) -log δ nϵ + δF 2 + ∆ n for any ϵ ∈ (0, 1]. Proof. See the proof in Schmidt-Hieber (2017) and Suzuki (2019) . Finally, based on the definition and the results above, we can prove Theorem 3. Next, it follows from Theorem 1 that ||f -R K (f )|| L 2 ≲ N -αD η where η := η p,q,r . Since P X has a density function 0 ≤ p(x) < R on ([0, 1] d ) l , then it holds that ||f -R K (f )|| L 2 (P X ) ≲ ||f -R K (f )|| L 2 for any f : ([0, 1] d ) l → R, and by applying Proposition 1 with δ = 1 n , it follows that E Dn ||f • -f || L 2 (P X ) ≲ N -2 αD 2η + N D log N (log N ) 2 + log n n + 1 n . The case of mixed smooth Besov spaces: Note that N = 2 K , D = D k,dl . Since D k,dl := 1 + dl -1 k k 1 + k dl -1 dl-1 ≲ K dl-1 then we obtain the following upper bound estimation: E Dn ||f • -f || L 2 (P X ) ≲ 2 -2αK K 2η(dl-1) + K dl 2 K (K 2 + log n) n . Then, the right hand side is minimized by K = This completes the proof.

E PROOF OF THE STATEMENTS OF SECTION 5

Proof of Theorem 5. In the proof of Theorem 6, we can easily check that how to construct R K (f ) is independent of the smoothness parameter α. And, since α = s, as in the proof of Theorem 6, we obtain the following estimation ||f -R K (f )|| r ≲ 2 -sK D ηp,q,r K,d ||f || V B α,π p,q . such that R K (f ) := (k,j)∈E(K) c k,j M d k,j (x), ♯E(K) := {(k, j) ∈ Z d ≥0 × Z d ≥0 |c k,j ̸ = 0} ≤ 2 + (1 -2 -ν ) -1 2 K D K * ,d . From now on, we consider the case in which π = {Ω}. For a general π, we can adapt the same proof strategy. Let t 0 := log(2K+ 1 p ) log s ( ⇐⇒ s t0 -1 p ≥ 2K) and RK (f ) := (k,j)∈E(K) k •,i ′ ̸ =0,σ(i ′ )<t0 c k,j M d k,j (x). If k •,i ′ ̸ = 0, σ(i ′ ) ≥ t 0 , since s ≥ 1 p , the following inequality holds: 2 2K |c k,j | ≤ 2 (s t 0 -1 p ) |c k,j | ≤ 2 ⟨α,k⟩-∥k∥ 1 p |c k,j | ≲ ∥f ∥ M B α p,q . Thus, since r ≥ 1, it follows that R K (f ) -RK (f ) r ≤ (k,j)∈E(K) k •,i ′ ̸ =0,σ(i ′ )≥t0 |c k,j | M d k,j r ≤ ♯E(K)2 -2K ∥f ∥ V B α,π p,q ≤ 2 + (1 -2 -ν ) -1 2 K D K * ,d 2 -2K ∥f ∥ V B α,π p,q ≲ 2 -K D K * ,d ∥f ∥ V B α,π p,q . From estimations above, it follows that f -RK (f ) r ≲ 2 -K D ηp,q,r∨1 K,d ∥f ∥ V B α,π p,q . Let N m (x) := 2 K (m+1) i=1 1 [2 -K (i-1),2 -K i) (x)N m (2 -K 2 K x ) (the expressions 2 -K 2 K x correspond to input quantization masks). Then, it follows from Lemma 1 that N m (x) -N m (x) ∞ ≤ 2 -K . If k •,i ′ ̸ = 0, σ(i ′ ) ≥ t 0 , we denote R K (f ) as the expression which we can obtain by replacing N m in RK (f ) by N m . Then the following estimation follows: || RK (f ) -R K (f )|| r ≲ 2 -K . Consequently, as in the proof of Theorem 1, it follows that R r (MTN t,2 K (L, T, E, W, H, S, B), V U α p,q ([0, 1] dl )) ≲ 2 -K D η∨1 K,dl . For a general π, we execute the same proof strategy. For each Ω i , we set A i := {i ′ |k •,i ′ ̸ = 0, σ i (i ′ ) ≥ t 0 }. Since σ i ̸ = σ i ′ for i ̸ = i ′ in general, it also holds that A i ̸ = A i ′ for i ̸ = i ′ in general. For each A i , we can construct R K (f ) i and this expression R K (f ) i corresponds to each quantization mask. This completes the proof.



Approximation of xy). Let ϵ, R ∈ R ≥1 . Then, there exist constants B := R, and a neural network M ∈ Φ 2 (L, W, S, B).such that sup x∈[0,R] |xy -M ((x, y))| ≤ ϵ.

n M dl kn,jn (x), and we define k max := max 1≤n≤N k n , c max := max 1≤n≤N c n . Fix ε ∈ R >0 . Now, we construct an approximate Transformer network T [g] of g. It follows from Lemma 3 that, for any ϵ, R ∈ R >0 , there exist constants Lϵ,R := log 2 R ϵ , Wϵ,R := 3, Sϵ,R := 5 Lϵ,R , Bϵ,R := 1, and a neural network Mϵ,R ∈ Φ 2 ( Lϵ,R , Wϵ,R , Sϵ,R , Bϵ,R ), such that sup x,y∈[0,R]

Proof of Theorem Theorem 3 and Theorem 4. Note that the estimations below holds:D ≲ D * , L ≲ log N , T ≲ 1, W ≲ N D, H ≲ 1, S ≲ N D log N , B ≲ N (1+ν -1 )(γ∨ζ) .Lemma 6 gives an upper bound of the covering number aslog N (δ, Φ(L, T, E, W, H, S, B), || • || L ∞ ) ≤S log 4δ -1 (L + T + 1)(W + 1) 2L+2T (B + 1) L+2T +1 H T ≲ N D log N (log N ) 2 + log δ -1 .

1+2α log 2 n + (2η-1)(dl-1)+3 1+2α log 2 log n up to log log n-order. Then we obtain the following result:≲ n -2α 2α+1 (log n) 2(dl-1)(η+α)+6α 1+2α.The case of anisotropic Besov spaces: Note that N = Ñ , D = 1. The right hand side is minimized byÑ = n 1 1+2 α (log n) -3 1+2 αup to log log(n)-order. Then we obtain the following result:

follows from the definition of anisotropic Besov spaces that anisotropic Besov spaces are equal to oridinary Besov spaces with smoothness parameter α. Hence, the definition of anisotropic Besov spaces includes that of the ordinary Besov spaces as a special case, while the definition of mixed smooth Besov spaces do not include that of the ordinary Besov spaces in general. Moreover, anisotropic Besov spaces are closely related to Hölder spaces. We present the definition of Hölder spaces as follows. Definition 5 (Hölder space). Let α ∈ R >0 such as α /∈ N and we denote by m := ⌊α⌋. For an m times differentiable function f : R d → R, let the norm of the Hölder space C α (Ω) be ∥f ∥ Cα

T, E, W, H, S, B) with dense layer count L, transformer block count T , width W , head count H, sparsity constraint S and norm constraint B recursively as follows:

D. Yarotsky. Error bounds for approximations with deep relu networks. CoRR, abs/1610.01145, 2016. C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are transformers universal approximators of sequence-to-sequence functions? In Proceedings of the International Conference on Learning Representations. 2020.

