

Abstract

We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and wellestablished ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) have dominated natural language processing (NLP) in recent years, from large scale machine translation (Ott et al., 2018) to pre-trained (masked) language modeling (Devlin et al., 2018; Radford et al., 2018) , and are becoming more popular in other fields as well, from reinforcement learning (Vinyals et al., 2019) to speech recognition (Baevski et al., 2019) and computer vision (Carion et al., 2020) . Their success is enabled in part by ever increasing computational demands, which has naturally led to an increased interest in improving their efficiency. Scalability gains in transformers could facilitate bigger, deeper networks with longer contexts (Kitaev et al., 2020; Wang et al., 2020; Beltagy et al., 2020; Kaplan et al., 2020; Tay et al., 2020b) . Conversely, improved efficiency could reduce environmental costs (Strubell et al., 2019) and hopefully help democratize the technology. In this work, we explore a simple question: if some layers of the transformer are kept frozen-i.e., never updated after random initialization-can we match the performance of fully learned transformers, while being more efficient? Surprisingly, the answer is resoundingly yes; and what is more, we find that freezing layers may actually improve performance. Beyond desirable efficiency gains, random layers are interesting for several additional reasons. Fixed randomly initialized networks (Gallicchio & Scardapane, 2020) converge to Gaussian processes in the limit of infinite width (Daniely et al., 2016) , have intriguing interpretations in metric learning (Rosenfeld & Tsotsos, 2019; Giryes et al., 2016) , and have been shown to provide excellent "priors" either for subsequent learning (Ulyanov et al., 2018) or pruning (Frankle & Carbin, 2018) . Fixed layers allow for efficient low-cost hardware implementations (Schrauwen et al., 2007) and can be characterized using only a random number generator and its seed, which might have repercussions in distributed training and enables highly efficient deployment to edge devices. The strong performance of networks with fixed layers also sheds new light on the inner workings of BERT (Devlin et al., 2018) , and layer-wise interpretations of such models (Rogers et al., 2020; Tenney et al., 2019) . It appears that "not all layers are created equal" (Zhang et al., 2019) is true to such an extent that some layers can simply remain random and fixed. These ideas have a long history in machine learning. By Cover's theorem (Cover, 1965) , any highdimensional non-linear transformation is more likely to be linearly separable than its lower-or-equaldimensional input space. By Johnson-Lindenstrauss (Johnson & Lindenstrauss, 1984) , random projections distort Euclidean distances very little under mild assumptions, which is useful e.g. for dimensionality reduction and random indexing (Sahlgren, 2005) . Fixed random layers in neural networks pre-date deep learning by far (Gamba et al., 1961; Baum, 1988) . Indeed, random kernel methods have been an impactful idea in machine learning (Rahimi & Recht, 2008; 2009) . One way to think of such layers is as "reservoirs" (Lukoševičius & Jaeger, 2009) , where a highly non-linear high-dimensional black box representation is provided to a lightweight "readout" network, as in echo state networks (Jaeger, 2003) and liquid state machines (Maass et al., 2002) . The 1

