

Abstract

We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and wellestablished ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) have dominated natural language processing (NLP) in recent years, from large scale machine translation (Ott et al., 2018) to pre-trained (masked) language modeling (Devlin et al., 2018; Radford et al., 2018) , and are becoming more popular in other fields as well, from reinforcement learning (Vinyals et al., 2019) to speech recognition (Baevski et al., 2019) and computer vision (Carion et al., 2020) . Their success is enabled in part by ever increasing computational demands, which has naturally led to an increased interest in improving their efficiency. Scalability gains in transformers could facilitate bigger, deeper networks with longer contexts (Kitaev et al., 2020; Wang et al., 2020; Beltagy et al., 2020; Kaplan et al., 2020; Tay et al., 2020b) . Conversely, improved efficiency could reduce environmental costs (Strubell et al., 2019) and hopefully help democratize the technology. In this work, we explore a simple question: if some layers of the transformer are kept frozen-i.e., never updated after random initialization-can we match the performance of fully learned transformers, while being more efficient? Surprisingly, the answer is resoundingly yes; and what is more, we find that freezing layers may actually improve performance. Beyond desirable efficiency gains, random layers are interesting for several additional reasons. Fixed randomly initialized networks (Gallicchio & Scardapane, 2020) converge to Gaussian processes in the limit of infinite width (Daniely et al., 2016) , have intriguing interpretations in metric learning (Rosenfeld & Tsotsos, 2019; Giryes et al., 2016) , and have been shown to provide excellent "priors" either for subsequent learning (Ulyanov et al., 2018) or pruning (Frankle & Carbin, 2018) . Fixed layers allow for efficient low-cost hardware implementations (Schrauwen et al., 2007) and can be characterized using only a random number generator and its seed, which might have repercussions in distributed training and enables highly efficient deployment to edge devices. The strong performance of networks with fixed layers also sheds new light on the inner workings of BERT (Devlin et al., 2018) , and layer-wise interpretations of such models (Rogers et al., 2020; Tenney et al., 2019) . It appears that "not all layers are created equal" (Zhang et al., 2019) is true to such an extent that some layers can simply remain random and fixed. These ideas have a long history in machine learning. By Cover's theorem (Cover, 1965) , any highdimensional non-linear transformation is more likely to be linearly separable than its lower-or-equaldimensional input space. By Johnson-Lindenstrauss (Johnson & Lindenstrauss, 1984) , random projections distort Euclidean distances very little under mild assumptions, which is useful e.g. for dimensionality reduction and random indexing (Sahlgren, 2005) . Fixed random layers in neural networks pre-date deep learning by far (Gamba et al., 1961; Baum, 1988) . Indeed, random kernel methods have been an impactful idea in machine learning (Rahimi & Recht, 2008; 2009) . One way to think of such layers is as "reservoirs" (Lukoševičius & Jaeger, 2009) , where a highly non-linear high-dimensional black box representation is provided to a lightweight "readout" network, as in echo state networks (Jaeger, 2003) and liquid state machines (Maass et al., 2002) . The benefit of such an approach is that the reservoir has fixed parameters and is computationally efficient, as it can be pre-computed and does not (necessarily) require backpropagation. In NLP, Wieting & Kiela (2019) showed that random sentence encoders present a strong baseline for text classification, with subsequent work showing applications in a variety of NLP tasks (Enguehard et al., 2019; Garg et al., 2020; Pilault et al., 2020) . To our knowledge, this work is the first to examine this phenomenon in transformers, and the first to recursively alternate reservoirs with subsequent transformer layers acting as readout functions. We introduce "reservoir transformers", wherein fixed random reservoir layers are interspersed with regular updateable transformer layers. The goal of this work is not necessarily to set a new state of the art, but to put our understanding of transformer models on a more solid footing by providing empirical evidence of their capabilities even when some of their parameters are fixed. Our contributions are as follows: • We introduce a new area under the convergence curve metric for measuring performanceefficiency trade-offs, and show that replacing regular transformer layers with reservoir layers leads to better results on that metric. • We show that the addition of reservoir layers in fact leads to improved test set generalization on a variety of tasks in a variety of settings. • We show that pre-trained masked language modelling architectures like BERT and RoBERTa (Liu et al., 2019) can benefit from having some of their layers frozen, both during pre-training as well as when fine-tuning on downstream tasks. • In addition, we experiment with different types of reservoir layers, including convolutional and recurrent neural network-based ones. We also show empirical evidence that the backward pass can be entirely skipped by approximating top-layer gradients using an approach we call backskipping, with a relatively small sacrifice in performance.

2. APPROACH

This paper is based on a very simple idea. Neural networks are trained via backpropagation, which involves consecutive steps of matrix addition and multiplication, i.e., θ t+1 ← θ t -η ∂J ∂θ t ; ∂J ∂θ t = ∂J ∂L n ∂L n ∂L n-1 • • • ∂L 1 ∂L 0 ∂L 0 ∂x (1) for some objective J, parameterization θ and learning rate η, with the gradient computed via the chain rule, where L i is the i-th layer of the neural network and x is the input. Let L = Transformer(X) be a single layer in a Transformer network (Vaswani et al., 2017) , i.e., H = MultiHeadSelfAttn(LayerNorm(X)) + X L = FFN(LayerNorm(H)) + H Now, during every "backward pass", we compute the Jacobian for parameters θ L at layer L, which are used to update the parameters of L, θ L t , as well as to compute the next layer's Jacobian, thus back-propagating the gradients. In this work however, for some of the layers, we still backpropagate through them to compute gradients for earlier layers, but we never update their parameters. As a result, these layers stay fixed at their random initialization, saving computational resources.

2.1. BACKGROUND

Naturally, never updating some of the parameters is computationally more efficient, as some matrix addition operations can be skipped in the backward pass, but why is this not detrimental to the performance of the network? In the early days of neural networks, the bottom layers were often kept fixed as "associators" (Block, 1962) , or what Minsky & Papert (2017) called the Gamba perceptron (Gamba et al., 1961; Borsellino & Gamba, 1961) . Fixed random networks (Baum, 1988; Schmidt et al., 1992; Pao et al., 1994) have

