SYNTHESIZER: RETHINKING SELF-ATTENTION FOR TRANSFORMER MODELS

Abstract

The dot product self-attention is known to be central and indispensable to stateof-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose SYNTHESIZER, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only 60% faster but also improves perplexity by a relative 3.5%. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

1. INTRODUCTION

Transformer models (Vaswani et al., 2017) have demonstrated success across a wide range of tasks. This has resulted in Transformers largely displacing once popular auto-regressive and recurrent models in recent years. At the heart of Transformer models lies the query-key-value dot product attention. The success of Transformer models is widely attributed to this self-attention mechanism since fully connected token graphs, which are able to model long-range dependencies, provide a robust inductive bias. But is the dot product self-attention really so important? Do we need it? Is it necessary to learn attention weights via pairwise dot products? This paper seeks to develop a deeper understanding of the role that the dot product self-attention mechanism plays in Transformer models. The fundamental role of dot product self-attention is to learn self-alignment, i.e., to determine the relative importance of a single token with respect to all other tokens in the sequence. To this end, there have been memory metaphors and analogies constructed to support this claim. Indeed, the terms query, keys, and values imply that self-attention emulates a content-based retrieval process which leverages pairwise interactions at its very core. Moving against convention, this paper postulates that we cannot only do without dot product self-attention but also content-based memory-like self-attention altogether. Traditionally, attention weights are learned at the instance or sample level, where weights are produced by instance-level pairwise interactions. As a result, these instance-specific interactions often fluctuate freely across different instances as they lack a consistent global context. This paper proposes SYNTHESIZER, a new model that learns to synthesize the self-alignment matrix instead of manually computing pairwise dot products. We propose a diverse suite of synthesizing functions and extensively evaluate them. We characterize the source information that these synthesizing functions receive, i.e., whether they receive information from individual tokens, token-token interactions, and/or global task information. Intuitively, different source inputs to the synthesizing functions should capture diverse views, which may be useful when employed in conjunction. Aside from generalizing the standard Transformer model, we show that it is possible to achieve competitive results with fully global attention weights that do not consider token-token interactions or any instance-level (local) information at all. More specifically, a random matrix SYNTHESIZER model achieves a 27.27 BLEU score on WMT 2014 English-Germanfoot_0 . Via a set of rigorous experiments, we observe that the popular and well-established dot-product content-based attention can be approximated with simpler variants such as random matrices or dense layers without sacrificing much performance in some cases. In our experiments, we also show that our relatively simple Synthesizer models also outperform Dynamic Convolutions (Wu et al., 2019) with a +3.5% relative improvement in perplexity while being 60% faster. On encoding tasks, our factorized Synthesizers can outperform other low-rank efficient Transformer models such as Linformers (Wang et al., 2020) . While simple Synthesizer models are able to perform competitively, our experiments show that the pairwise dot product is still ultimately helpful. When composing our synthesizing functions with dot products, we find that they consistently improve the performance of Transformers. In general, we believe our findings will spur further investigation and discussion about the true role and utility of the self-attention mechanism in Transformer models. Our Contributions Our key contributions are described as follows: • We propose Synthetic Attention, a new way of learning to attend without explicitly attending (i.e., without dot product attention or content-based attention). Instead, we generate the alignment matrix independent of token-token dependencies and explore a potpourri of parameterized functions for synthesizing attention matrices. • We propose SYNTHESIZER, a new model that leverages Synthetic Attention. The model performs competitive to state-of-the-art Transformer models on a wide range of language tasks, including machine translation and language modeling. • Moreover, we show that (1) random learnable alignment matrices perform competitively and (2) token-token dependencies are not necessary to achieve good performance with Transformer models on certain tasks. • On large-scale masked language modeling on the C4 dataset (Raffel et al., 2019) and finetuning on SuperGLUE and GLUE benchmarks, we show that simple random Synthesizers can outperform/match Lightweight Dynamic convolutions (Wu et al., 2019) along with outperforming Transformers and Universal Transformers (Dehghani et al., 2018) . On two encoding tasks, factorized random Synthesizers outperform low-rank Linformers (Wang et al., 2020) .

2. RELATED WORK

Attention-based models are used across a wide spectrum of problem domains. Such models are especially popular, due to their effectiveness, in the language and vision domains. Attention models can be traced back to the machine translation models of (Bahdanau et al., 2014) and (Luong et al., 2015) , where attention is employed to learn soft word alignments between language pairs. The intuition behind the attention mechanism is deeply-rooted in the notion of memory-based retrieval (Graves et al., 2014; Weston et al., 2014) , in which soft differentiable addressing of memory was initially proposed. The paradigm of learning self-alignments, also known as self-attention, has been largely popularized by Transformer models (Vaswani et al., 2017) . This technical narrative has also been explored by a number of other recent studies, including those on intra-attention (Parikh et al., 2016) , selfmatching networks (Wang et al., 2017), and LSTMN (Cheng et al., 2016) . To this end, Transformer models, which function primarily based on self-attention and feed-forward layers, generally serve as a reliable replacement for autoregressive recurrent models.



The originally reported result is 27.30.

