WHY SELF-ATTENTION IS NATURAL FOR SEQUENCE-TO-SEQUENCE PROBLEMS? A PERSPECTIVE FROM SYMMETRIES

Abstract

In this paper, we show that structures similar to self-attention are natural to learn many sequence-to-sequence problems from the perspective of symmetry. Inspired by language processing applications, we study the orthogonal equivariance of seq2seq functions with knowledge, which are functions taking two inputs-an input sequence and a "knowledge"-and outputting another sequence. The knowledge consists of a set of vectors in the same embedding space as the input sequence, containing the information of the language used to process the input sequence. We show that orthogonal equivariance in the embedding space is natural for seq2seq functions with knowledge, and under such equivariance the function must take the form close to the self-attention. This shows that network structures similar to self-attention are the right structures to represent the target function of many seq2seq problems. The representation can be further refined if a "finite information principle" is considered, or a permutation equivariance holds for the elements of the input sequence.

1. INTRODUCTION

Neural network models using self-attention, such as Transformers Vaswani et al. (2017) , have become the new benchmark in the fields such as natural language processing and protein folding. Though, the design of self-attention is largely heuristic, and theoretical understanding of its success is still lacking. In this paper, we provide a perspective for this problem from the symmetries of sequence-to-sequence (seq2seq) learning problems. By identifying and studying appropriate symmetries for seq2seq problems of practical interest, we demonstrate that structures like self-attention are natural for representing these problems. Symmetries in the learning problems can inspire the invention of simple and efficient neural network structures. This is because symmetries reduce the complexity of the problems, and a network with matching symmetries can learn the problems more efficiently. For instance, convolutional neural networks (CNNs) have seen great success on vision problems, with the translation invariance/equivariance of the problems being one of the main reasons. This is not only observed in practice, but also justified theoretically Li et al. In this work, we start from studying the symmetry of seq2seq functions in the embedding space, the space in which each element of the input and output sequences lie. For a language processing problem, for example, words or tokens are usually vectorized by a one-hot embedding using a dictionary. In this process, the order of words in the dictionary should not influence the meaning of input and output sentences. Thus, if a permutation is applied on the dimensions of the embedding space, the input and output sequences should experience the same permutation, without other changes. This implies a permutation equivariance in the embedding space. In our analysis, we consider equivariance under orthogonal group, which is slightly larger than the permutation group. We show that if a function f is orthogonal equivariant in the embedding space, then its output can be expressed as linear combinations of the elements of the input sequence, with the coefficients only depending



(2020b). Many other symmetries have been studied and exploited in the design of neural network models. Examples include permutation equivariance Zaheer et al. (2017) and rotational invariance Kim et al. (2020); Chidester et al. (2019), with various applications in learning physical problems. See Section 2.1 for more related works.

