APPROXIMATION ABILITY OF TRANSFORMER NET-WORKS FOR FUNCTIONS WITH VARIOUS SMOOTHNESS OF BESOV SPACES: ERROR ANALYSIS AND TOKEN EX-TRACTION

Abstract

Although Transformer networks outperform various natural language processing tasks, many aspects of their theoretical nature are still unclear. On the other hand, fully connected neural networks have been extensively studied in terms of their approximation and estimation capability where the target function is included in such function classes as Hölder class and Besov class. Besov spaces play an important role in several fields such as wavelet analysis, nonparametric statistical inference and approximation theory. In this paper, we study the approximation and estimation error of Transformer networks in a setting where the target function takes a fixed-length sentence as an input and belongs to two variants of Besov spaces known as anisotropic Besov and mixed smooth Besov spaces, in which it is shown that Transformer networks can avoid curse of dimensionality. By overcoming the difficulties in limited interactions among tokens, we prove that Transformer networks can accomplish minimax optimal rate. Our result also shows that token-wise parameter sharing in Transformer networks decreases dependence of the network width on the input length. Moreover, we prove that, under suitable situations, Transformer networks dynamically select tokens to pay careful attention to. This phenomenon matches attention mechanism, on which Transformer networks are based. Our analyses strongly support the reason why Transformer networks have outperformed various natural language processing tasks from a theoretical perspective.

1. INTRODUCTION

Transformer networks, which were proposed in Vaswani et al. (2017) , have outperformed various natural language processing (NLP) tasks, including text classifications (Shaheen et al., 2020) , machine translation (Vaswani et al., 2017) , language modeling (Radford et al.; Devlin et al., 2018) ), and question answering (Devlin et al., 2018; Yang et al., 2019) . Transformer networks make it feasible to approximate functions which can take a sequence of tokens (i.e., text) as input due to their specific architecture which is a stack of blocks of self-attention layers and token-wise feed-forward layers. However, despite of these great successes in various NLP tasks, many aspects of their theoretical nature are still unclear. On the other hand, fully connected neural networks have been extensively studied in terms of their function approximation and estimation capability. A remarkable property of neural network is its universal approximation capability, which means that any continuous function with compact support can be approximated with arbitrary accuracy with two fully connected layers (Cybenko, 1989 ). However, Cybenko (1989) did not state anything about an upper bound of the network size. Therefore, a relation between properties of the target function and the network size is a next question. By imposing certain properties such as smoothness on target functions, the representabiliy of neural network can be studied more precisely. Barron (1993) developed an approximation theory for functions with limited capacity that is measured by integrability of their Fourier transform. Deep neural networks with ReLU activation (Nair & Hinton, 2010; Glorot et al., 2011) has also been extensively studied from the viewpoint of the approximation and the estimation ability. For example, Yarotsky (2016) proved the approximation error of fully connected layers with the ReLU activation for functions in Sobolev spaces. Schmidt-Hieber (2017) derived an estimation error bound of regularized least squared estimator performed by deep ReLU network based on an approximation error analysis in a regression setting. Suzuki (2019) derived approximation and estimation error rates of fully connected layers with ReLU activation for the Besov space, which were also shown to be almost minimax optimal. Although the derived rates of convergence are almost optimal, they suffer from the curse of dimensionality, which is one of the main issues of machine learning. A typical consequence of the curse of dimensionality is that, when the dimension of data increases, the approximation accuracy (and estimation accuracy) deteriorates exponentially against the dimension. However, under some specific structure on the data and the target function, we may avoid this issue. Indeed, Suzuki (2019) and Suzuki & Nitanda (2021) showed that, by assuming that the target function has mixed smoothness or anisotropic smoothness, we can avoid curse of dimensionality. Okumoto & Suzuki (2022) derived approximation and estimation errors in a severe setting in which input data are infinite-dimesional. Although many researches on the representation ability of fully connected layers and convolution layers are developed, relatively few researches on that of Transformer networks are found. Kratsios et al. (2021) proved that there exists a pair of an input sequence and output particles which minimize a given proper loss functions under a given constraint set. Vuckovic (2020) proved that, when regarding attention layers as functions from measures to measures, attention layers have the Lipschitz continuity property from a viewpoint of Wasserstein distances. Both Kratsios et al. ( 2021) and Vuckovic (2020) regard an input sentence as a measure, that is, particles or a bag of words, which is an interesting viewpoint. However, these papers do not specify how approximate Transformer networks are to a given function from an input sequence to an output. Therefore, these papers' results are different from this paper's main purpose to explain why Transformer networks can outperform various NLP tasks represented by target functions in various function spaces. Yun et al. ( 2020), Zaheer et al. (2020) and Shi et al. (2021) proved that Transformer networks are universal approximators of sequence to sequence functions. However, since these papers did not assume smoothness of the target function, the results of these papers did not specify an upper bound of Transformer network depths, which corresponds to the fact that universal approximation capability of neural networks did not state anything about an upper bound of the network width. Thus, this paper studies a question which naturally arises as to how properties of the target function are related to the network size and precision required. In this paper, we study the approximation and estimation error of the Transformer architecture in a setting where the target function takes a fixed-length sentence as an input and belongs to a mixed smooth Besov space and an anisotropic Besov space. We prove that Transformer networks accomplish almost minimax optimal rate by analyzing the Transformer network architecture and approximation ability of the two function spaces. Moreover, we prove that, under suitable situations, Transformer networks can dynamically select tokens to pay careful attention to. The essence of the proof strategy is as follows: First, for a given target function, we obtain a sum of piece-wise polynomial functions which is approximate to the target function in a certain rate. Next, one constructs a neural network approximate to a piece-wise polynomial functions. Finally, one constructs a neural network approximate to the sum. The problem is the second phase in which one constructs a neural network approximate to a cardinal B-spline function. The proof of the phase is based on fully connected layers approximate to xy in Yarotsky (2016). However, Transformer networks are permitted to do limited interactions among tokens. In this paper, we propose how to construct an attention layer which values exchanges between different tokens. By using attention layers constructed above, we can construct a Transformer network approximate to cardinal B-spline function. This difficulty is common to previous papers (Yun et al., 2020; Zaheer et al., 2020; Shi et al., 2021) , though their strategies of obtaining a piece-wise constant approximation are different from ours in a viewpoint of exploitation of function smoothness. Our contributions can be summarized as follows: 1. We consider a situation in which the target function takes a fixed-length sentence as an input and belongs to a mixed smooth Besov space and an anisotropic Besov space, in which it is shown that Transformer networks can avoid curse of dimentionality and accomplish almost minimax optimal rate. We also shows that token-wise parameter sharing in Transformer networks decreases dependence of the network width on the input length.

