APPROXIMATION ABILITY OF TRANSFORMER NET-WORKS FOR FUNCTIONS WITH VARIOUS SMOOTHNESS OF BESOV SPACES: ERROR ANALYSIS AND TOKEN EX-TRACTION

Abstract

Although Transformer networks outperform various natural language processing tasks, many aspects of their theoretical nature are still unclear. On the other hand, fully connected neural networks have been extensively studied in terms of their approximation and estimation capability where the target function is included in such function classes as Hölder class and Besov class. Besov spaces play an important role in several fields such as wavelet analysis, nonparametric statistical inference and approximation theory. In this paper, we study the approximation and estimation error of Transformer networks in a setting where the target function takes a fixed-length sentence as an input and belongs to two variants of Besov spaces known as anisotropic Besov and mixed smooth Besov spaces, in which it is shown that Transformer networks can avoid curse of dimensionality. By overcoming the difficulties in limited interactions among tokens, we prove that Transformer networks can accomplish minimax optimal rate. Our result also shows that token-wise parameter sharing in Transformer networks decreases dependence of the network width on the input length. Moreover, we prove that, under suitable situations, Transformer networks dynamically select tokens to pay careful attention to. This phenomenon matches attention mechanism, on which Transformer networks are based. Our analyses strongly support the reason why Transformer networks have outperformed various natural language processing tasks from a theoretical perspective.

1. INTRODUCTION

Transformer networks, which were proposed in Vaswani et al. (2017) , have outperformed various natural language processing (NLP) tasks, including text classifications (Shaheen et al., 2020) , machine translation (Vaswani et al., 2017) , language modeling (Radford et al.; Devlin et al., 2018) ), and question answering (Devlin et al., 2018; Yang et al., 2019) . Transformer networks make it feasible to approximate functions which can take a sequence of tokens (i.e., text) as input due to their specific architecture which is a stack of blocks of self-attention layers and token-wise feed-forward layers. However, despite of these great successes in various NLP tasks, many aspects of their theoretical nature are still unclear. On the other hand, fully connected neural networks have been extensively studied in terms of their function approximation and estimation capability. A remarkable property of neural network is its universal approximation capability, which means that any continuous function with compact support can be approximated with arbitrary accuracy with two fully connected layers (Cybenko, 1989) . However, Cybenko (1989) did not state anything about an upper bound of the network size. Therefore, a relation between properties of the target function and the network size is a next question. By imposing certain properties such as smoothness on target functions, the representabiliy of neural network can be studied more precisely. Barron (1993) developed an approximation theory for functions with limited capacity that is measured by integrability of their Fourier transform. Deep neural networks with ReLU activation (Nair & Hinton, 2010; Glorot et al., 2011) has also been extensively studied from the viewpoint of the approximation and the estimation ability. For example,

