ATTENTION ENABLES ZERO APPROXIMATION ERROR

Abstract

Attention-based architectures become the core backbone of many state-of-the-art models for various tasks, including language translation and image classification. However, theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the singlehead self-attention transformer with increasing numbers of free parameters is universal. Also, we show that our proposed model can avoid the classical trade-off between approximation error and sample error in the mean squared error analysis for the regression task if the target function is a polynomial. We conduct various experiments and ablation studies to verify our theoretical results.

1. INTRODUCTION

By imitating the structure of brain neurons, deep learning models have replaced traditional statistical models in almost every aspect of applications, becoming the most widely used machine learning tools LeCun et al. (2015) ; Goodfellow et al. (2016) . Structures of deep learning are also constantly evolving from fully connected networks to many variants such as convolutional networks Krizhevsky et al. (2012) The models considered in the above works all contain attention-based transformer encoder blocks. It is worth noting that the biggest difference between a transformer encoder block and a traditional neural network layer is that it introduces an inner product operation, which not only makes its actual performance better but also provides more room for theoretical derivations. In this paper, we consider the theoretical properties of the single-head self-attention transformer with suitable adaptations. Different from segmenting x into small pieces Dosovitskiy et al. ( 2020 information we obtain from data pre-processing, we place a one-hot vector to represent different features through the idea of positional encoding and place a zero vector to store the output values after each transformer encoder block. With such a special design, we can fix all transformer encoder blocks such that no training is needed for them. And it is able to realize the multiplication operation and store values in zero positions. By applying a well-known result in approximation theory Zhou (2018) stating that any polynomial Q ∈ P q R d of degree at most q can be represented by a linear combination of different powers of ridge forms ξ k • x of x ∈ R d , we prove that the proposed model can generate any polynomial of degree q with q transformer encoder blocks and a fixed number of free parameters. As a direct consequence, we show that the proposed model is universal if we let the the number of free parameters and transformer encoder blocks go to infinity. Our theoretical results are also verified by experiments on synthetic data. In summary, the contributions of our work are as follows: • We propose a new pre-processing method that captures global information and a new structure of input vectors of transformer encoder blocks. • With the special structure of input of transformer encoder blocks, we can artificially design all the transformer encoder blocks in a spare way and prove that the single-head self-attention transformer with q transformer encoder blocks and a fixed number of free parameters is able to generate any desired polynomial of degree q of the input with no error. • As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters and transformer encoder block is universal. • We conduct mean squared error analysis for the regression task with our proposed model. We show that if the target function is a polynomial, our proposed model can avoid the classical trade-off between approximation error and sample error. And the convergence rate is only controlled by the number of samples if we treat d and q as constants. • We apply our model to noisy regression tasks with synthetic data and real-world data-set. Our experiments show that the proposed model performs much better than traditional fully connected neural networks with a comparable number of free parameters. • We apply our model to the image classification task and achieve better performance than Vision Transformer on the CIFAR-10 data set with suitable adaptations.

2. TRANSFORMER STRUCTURES

In this section, we formally introduce the single-head self-attention transformer considered in this paper. The overall architecture is shown in Figure 1 .



, recurrent networks Mikolov et al. (2010) and the attention-based transformer model Dosovitskiy et al. (2020). Attention-based architectures were first introduced in the areas of natural language processing, and neural machine translation Bahdanau et al. (2014); Vaswani et al. (2017); Ott et al. (2018), and now an attention-based transformer model has also become state-of-the-art in image classification Dosovitskiy et al. (2020). However, compared with significant achievements and developments in practical applications, the theoretical properties of attention-based transformer models are not well understood. Let us describe some current theoretical progress of attention-based architectures briefly. The universality of a sequence-to-sequence transformer model is first established in Yun et al. (2019). After that, a sparse attention mechanism, BIGBIRD, is proposed by Zaheer et al. (2020) and the authors further show that the proposed transformer model is universal if its attention structure contains the star graph. Later, Yun et al. (2020) provides a unified framework to analyze sparse transformer models. Recently, Shi et al. (2021) studies the significance of different positions in the attention matrix during pre-training and shows that diagonal elements in the attention map are the least important compared with other attention positions. From a statistical machine learning point of view, the authors in Gurevych et al. (2021) propose a classifier based on a transformer model and show that this classifier can circumvent the curse of dimensionality.

Figure1: The Architecture of the single-head self-attention transformer. W Q , W K , W V stand for the query matrix, the key matrix, and the value matrix respectively. MatMul stands for the matrix multiplication.

