ATTENTION ENABLES ZERO APPROXIMATION ERROR

Abstract

Attention-based architectures become the core backbone of many state-of-the-art models for various tasks, including language translation and image classification. However, theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the singlehead self-attention transformer with increasing numbers of free parameters is universal. Also, we show that our proposed model can avoid the classical trade-off between approximation error and sample error in the mean squared error analysis for the regression task if the target function is a polynomial. We conduct various experiments and ablation studies to verify our theoretical results.

1. INTRODUCTION

By imitating the structure of brain neurons, deep learning models have replaced traditional statistical models in almost every aspect of applications, becoming the most widely used machine learning tools LeCun et al. (2015) ; Goodfellow et al. (2016) The models considered in the above works all contain attention-based transformer encoder blocks. It is worth noting that the biggest difference between a transformer encoder block and a traditional neural network layer is that it introduces an inner product operation, which not only makes its actual performance better but also provides more room for theoretical derivations. In this paper, we consider the theoretical properties of the single-head self-attention transformer with suitable adaptations. Different from segmenting x into small pieces Dosovitskiy et al. ( 2020) and capturing local information, we consider a global pre-processing of x and propose a new vector structure of the inputs of transformer encoder blocks. In this structure, in addition to the global



. Structures of deep learning are also constantly evolving from fully connected networks to many variants such as convolutional networks Krizhevsky et al. (2012), recurrent networks Mikolov et al. (2010) and the attention-based transformer model Dosovitskiy et al. (2020). Attention-based architectures were first introduced in the areas of natural language processing, and neural machine translation Bahdanau et al. (2014); Vaswani et al. (2017); Ott et al. (2018), and now an attention-based transformer model has also become state-of-the-art in image classification Dosovitskiy et al. (2020). However, compared with significant achievements and developments in practical applications, the theoretical properties of attention-based transformer models are not well understood. Let us describe some current theoretical progress of attention-based architectures briefly. The universality of a sequence-to-sequence transformer model is first established in Yun et al. (2019). After that, a sparse attention mechanism, BIGBIRD, is proposed by Zaheer et al. (2020) and the authors further show that the proposed transformer model is universal if its attention structure contains the star graph. Later, Yun et al. (2020) provides a unified framework to analyze sparse transformer models. Recently, Shi et al. (2021) studies the significance of different positions in the attention matrix during pre-training and shows that diagonal elements in the attention map are the least important compared with other attention positions. From a statistical machine learning point of view, the authors in Gurevych et al. (2021) propose a classifier based on a transformer model and show that this classifier can circumvent the curse of dimensionality.

