THINKING LIKE TRANSFORMERS

Abstract

What is the computational model behind a transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder -attention and feed-forward computation -into the simple primitives of select, aggregate and zipmap, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a transformer, augmenting it with tools we discover in our work. In particular, we provide RASP programs for histograms, sorting, and even logical inference similar to that of Clark et al. (2020). We further use our model to relate their difficulty in terms of the number of required layers and attention heads. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.

1. INTRODUCTION

While Yun et al. (2019) show that sufficiently large transformers can approximate any constantlength sequence-to-sequence function, and Hahn (2019) provides theoretical limitations on their ability to compute functions on unbounded input length, neither of these provide insight on how a transformer may achieve a specific task. Orthogonally, Bhattamishra et al. (2020) provide transformer constructions for several counting languages, but this also does not direct us towards a general model. This is in stark contrast to other neural network architectures, which do have clear computational models. For example, convolution networks are seen as as a sequence of filters (Zhang et al., 2018) , and finite-state automata and their variants have been extensively used both for extraction from and theoretical analysis of recurrent neural networks (RNNs) (Omlin & Giles, 1996; Weiss et al., 2018; Rabusseau et al., 2018; Merrill et al., 2020) , even inspiring new RNN variants (Joulin & Mikolov, 2015) . In this work we propose a computational model for the transformer-encoder, in the form of a simple sequence-processing language which we dub RASP(Restricted Access Sequence Processing Language). Much like how automata describe the token-by-token processing behavior of an RNN, our language captures the unique information flow constraints under which a transformer (Vaswani et al., 2017) operates as it processes input sequences. Considering computation problems and their implementation in the RASP language allows us to "think like a transformer" while abstracting away the technical details of a neural network in favor of symbolic programs. A RASP program operates on sequences of values from uniform atomic types, and transforms them by composing a restricted set of sequence processors. One pair of processors is used to select inputs for aggregation, and then aggregate the selected items. Another processor performs arbitrary but local computation over its (localized) input. However, access to the complete sequence is available only through aggregate operations that reduce a stream of numbers to a scalar. The key to performing complex global computations under this model is to compose the aggregations such that they gather the correct information, that can then be locally processed for a final output. Given a RASP program, we can analyze it to infer the minimal number of layers and maximum number of heads that is required to implement it as a transformer. We show several examples of expressive programs written in the RASP language, showing how complex operations can be

