

Abstract

We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and wellestablished ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) have dominated natural language processing (NLP) in recent years, from large scale machine translation (Ott et al., 2018) to pre-trained (masked) language modeling (Devlin et al., 2018; Radford et al., 2018) , and are becoming more popular in other fields as well, from reinforcement learning (Vinyals et al., 2019) to speech recognition (Baevski et al., 2019) and computer vision (Carion et al., 2020) . Their success is enabled in part by ever increasing computational demands, which has naturally led to an increased interest in improving their efficiency. Scalability gains in transformers could facilitate bigger, deeper networks with longer contexts (Kitaev et al., 2020; Wang et al., 2020; Beltagy et al., 2020; Kaplan et al., 2020; Tay et al., 2020b) . Conversely, improved efficiency could reduce environmental costs (Strubell et al., 2019) and hopefully help democratize the technology. In this work, we explore a simple question: if some layers of the transformer are kept frozen-i.e., never updated after random initialization-can we match the performance of fully learned transformers, while being more efficient? Surprisingly, the answer is resoundingly yes; and what is more, we find that freezing layers may actually improve performance. Beyond desirable efficiency gains, random layers are interesting for several additional reasons. Fixed randomly initialized networks (Gallicchio & Scardapane, 2020) converge to Gaussian processes in the limit of infinite width (Daniely et al., 2016) , have intriguing interpretations in metric learning (Rosenfeld & Tsotsos, 2019; Giryes et al., 2016) , and have been shown to provide excellent "priors" either for subsequent learning (Ulyanov et al., 2018) or pruning (Frankle & Carbin, 2018) . Fixed layers allow for efficient low-cost hardware implementations (Schrauwen et al., 2007) and can be characterized using only a random number generator and its seed, which might have repercussions in distributed training and enables highly efficient deployment to edge devices. The strong performance of networks with fixed layers also sheds new light on the inner workings of BERT (Devlin et al., 2018) , and layer-wise interpretations of such models (Rogers et al., 2020; Tenney et al., 2019) . It appears that "not all layers are created equal" (Zhang et al., 2019) is true to such an extent that some layers can simply remain random and fixed. These ideas have a long history in machine learning. By Cover's theorem (Cover, 1965) , any highdimensional non-linear transformation is more likely to be linearly separable than its lower-or-equaldimensional input space. By Johnson-Lindenstrauss (Johnson & Lindenstrauss, 1984) , random projections distort Euclidean distances very little under mild assumptions, which is useful e.g. for dimensionality reduction and random indexing (Sahlgren, 2005) . Fixed random layers in neural networks pre-date deep learning by far (Gamba et al., 1961; Baum, 1988) . Indeed, random kernel methods have been an impactful idea in machine learning (Rahimi & Recht, 2008; 2009) . One way to think of such layers is as "reservoirs" (Lukoševičius & Jaeger, 2009) , where a highly non-linear high-dimensional black box representation is provided to a lightweight "readout" network, as in echo state networks (Jaeger, 2003) and liquid state machines (Maass et al., 2002) . The benefit of such an approach is that the reservoir has fixed parameters and is computationally efficient, as it can be pre-computed and does not (necessarily) require backpropagation. In NLP, Wieting & Kiela (2019) showed that random sentence encoders present a strong baseline for text classification, with subsequent work showing applications in a variety of NLP tasks (Enguehard et al., 2019; Garg et al., 2020; Pilault et al., 2020) . To our knowledge, this work is the first to examine this phenomenon in transformers, and the first to recursively alternate reservoirs with subsequent transformer layers acting as readout functions. We introduce "reservoir transformers", wherein fixed random reservoir layers are interspersed with regular updateable transformer layers. The goal of this work is not necessarily to set a new state of the art, but to put our understanding of transformer models on a more solid footing by providing empirical evidence of their capabilities even when some of their parameters are fixed. Our contributions are as follows: • We introduce a new area under the convergence curve metric for measuring performanceefficiency trade-offs, and show that replacing regular transformer layers with reservoir layers leads to better results on that metric. • We show that the addition of reservoir layers in fact leads to improved test set generalization on a variety of tasks in a variety of settings. • We show that pre-trained masked language modelling architectures like BERT and RoBERTa (Liu et al., 2019) can benefit from having some of their layers frozen, both during pre-training as well as when fine-tuning on downstream tasks. • In addition, we experiment with different types of reservoir layers, including convolutional and recurrent neural network-based ones. We also show empirical evidence that the backward pass can be entirely skipped by approximating top-layer gradients using an approach we call backskipping, with a relatively small sacrifice in performance.

2. APPROACH

This paper is based on a very simple idea. Neural networks are trained via backpropagation, which involves consecutive steps of matrix addition and multiplication, i.e., θ t+1 ← θ t -η ∂J ∂θ t ; ∂J ∂θ t = ∂J ∂L n ∂L n ∂L n-1 • • • ∂L 1 ∂L 0 ∂L 0 ∂x (1) for some objective J, parameterization θ and learning rate η, with the gradient computed via the chain rule, where L i is the i-th layer of the neural network and x is the input. Let L = Transformer(X) be a single layer in a Transformer network (Vaswani et al., 2017) , i.e., H = MultiHeadSelfAttn(LayerNorm(X)) + X L = FFN(LayerNorm(H)) + H (2) Now, during every "backward pass", we compute the Jacobian for parameters θ L at layer L, which are used to update the parameters of L, θ L t , as well as to compute the next layer's Jacobian, thus back-propagating the gradients. In this work however, for some of the layers, we still backpropagate through them to compute gradients for earlier layers, but we never update their parameters. As a result, these layers stay fixed at their random initialization, saving computational resources.

2.1. BACKGROUND

Naturally, never updating some of the parameters is computationally more efficient, as some matrix addition operations can be skipped in the backward pass, but why is this not detrimental to the performance of the network? In the early days of neural networks, the bottom layers were often kept fixed as "associators" (Block, 1962) , or what Minsky & Papert (2017) called the Gamba perceptron (Gamba et al., 1961; Borsellino & Gamba, 1961) . Fixed random networks (Baum, 1988; Schmidt et al., 1992; Pao et al., 1994) have been explored from many angles, including as "random kitchen sink" kernel machines (Rahimi & Recht, 2008; 2009) , "extreme learning machines" (Huang et al., 2006) and reservoir computing (Jaeger, 2003; Maass et al., 2002; Lukoševičius & Jaeger, 2009) . In reservoir computing, input data are represented through fixed random high-dimensional non-linear representations, called "reservoirs", which are followed by a regular (often but not necessarily linear) "readout" network to make the final classification decision. The theoretical justification for these approaches lies in two well-known results in machine learning: Cover's theorem (Cover, 1965) on the separability of patterns states that high-dimensional non-linear transformations are more likely to be linearly separable; and the Johnson-Lindenstrauss lemma (Johnson & Lindenstrauss, 1984) shows that random projections distort Euclidean distances very little under mild assumptions. Practically, random layers can be seen as a cheap way to increase network depth. There are interesting advantages to this approach. Fixed layers are known to have particularly low-cost hardware requirements and can be easily implemented on high-bandwidth FPGAs with low power consumption (Hadaeghi et al., 2017; Tanaka et al., 2019) , or on optical devices (Hicke et al., 2013) . This might yield interesting possibilities for training in a distributed fashion across multiple devices, as well as for neurmorphic hardware (Neftci et al., 2017) . This approach also facilitates lower-latency deployment of neural networks to edge devices, since weights can be shared simply by sending the seed number, assuming the random number generator is known on both ends.

2.2. RESERVOIR TRANSFORMERS

This work explores inserting random non-linear transformations, or what we call reservoir layers, into transformer networks. Specifically, we experiment with a variety of reservoir layers: • Transformer Reservoir: The standard transformer layer as described above, but with all parameters fixed after initialization, including the self-attention module. (Wu et al., 2019) , which are known to be competitive with transformers in sequence-to-sequence tasks. We find that all these approaches work well, to a certain extent. For clarity, we focus primarily on the first two reservoir layers, but include a broader comparison in Appendix A. In each case, contrary to traditional reservoir computing, our reservoir layers are interspersed throughout a regular transformer network, or what we call a reservoir transformer. A good justification for this approach is that while random projections are not learned and might introduce noise, subsequent normal transformer "readout" layers might allow us to recover from any adverse effects of randomness. For example, previous work has shown that ResNets, with all of their parameters fixed except for the scale and shift parameters of batch normalization, can still achieve high performance, simply by scaling and shifting random features (Frankle et al., 2020) . Adding noise to the parameters of neural networks is also known to help convergence and generalization (Jim et al., 1995; 1996; Gulcehre et al., 2016; Noh et al., 2017) .

3. EVALUATION

We evaluate the proposed approach on a variety of well-known tasks in natural language processing, namely: machine translation, language modelling and masked language model pre-training. In this work, we are not necessarily interested in obtaining the state of the art on any task or even in improving overall task performance via this method. The main objective is to examine efficiency, i.e. the relationship between compute time and task performance. This is closely related to efforts in Green AI, which are concerned with the trade-offs between compute, data, and performance (Schwartz et al., 2019) . We propose a new metric for our purposes, the area under the convergence curve (AUCC): similarly to how the area under the receiver operating characteristic (Bradley, 1997, AUC-ROC) measures a classifier's performance independent of the classification threshold, AUCC measures a model's performance independent of the specific compute budget. Specifically, AUCC is computed as follows: T t=0 x,y∈D g t (f (x), y) where f is the network and g is the evaluation metric, measured until convergence time T , which is the maximum convergence time of all models included in the comparison. Note that time here is wall-clock time, not iterations. By convergence, we mean that validation performance has stopped improving, and hence the convergence curve whose area we measure plots the desired metric over time. Runs are averaged over multiple seeds and reported with standard deviation. We normalize raw AUCC scores by their maximum score to ensure a more easily interpretable [0 -1] range. One potential downside of this approach is that the AUCC metric could lead to higher scores for a model that converges quickly but to ultimately worse performance, if measured in a small window. We account for this by making sure that T is set sufficiently high. We include the raw validation curves in the appendix and also report test set generalization in each experiment.

3.1. EXPERIMENTAL SETTINGS AND IMPLEMENTATION DETAILS

We evaluate on IWSLT de-en (Cettolo et al., 2015) and WMT en-de (Bojar et al., 2014) for machine translation; enwiki8 (LLC, 2009) for language modelling; and experiment with RoBERTa (Liu et al., 2019) in our pretraining experiments. For IWSLT, we follow the pre-processing steps in Edunov et al. (2018) . The train/val/test split is 129k/10k/6.8k sentences. For WMT, we follow the pre-processing steps in Ott et al. (2018) . The train/val/test split is 4.5M/16.5k/3k sentences. For en-wiki8, we follow the pre-processing steps in Dai et al. (2019) . The train/val/test split is 1M/54k/56k sentences. For RoBERTa pretraining, we follow the pre-processing steps in Liu et al. (2019) . We use 8 Volta V100 GPUs for WMT and enwik8, 32 V100 GPUs for RoBERTa and a single V100 for IWSLT. The hyperparameters for IWSLT14 and WMT16 were set to the best-performing values from Ott et al. (2018) and Kasai et al. (2020) respectively. The enwik8 experiment settings followed Bachlechner et al. (2020) and the RoBERTa experiments followed Liu et al. (2019) . All experiments were conducted using fairseq (Ott et al., 2019) All the experiments in this paper were run with 3 random seeds and the mean and standard deviation are reported. For the relatively small IWSLT, the T value in the AUCC metric was set to 4 hours. For WMT, which is larger, we set it to 20 hours. For enwiki8, it was 30 hours; and for the RoBERTa pre-training experiments, it was set to 60 hours. The projection weights in random layers were initialized using orthogonal initialization (Saxe et al., 2013) , which makes sense since random orthogonal projections should be most informationpreserving, and which was found to work well empirically for initializing fixed random representations in previous work (Wieting & Kiela, 2019) . Biases and layer norm parameters were initialized using their respective PyTorch defaults (based on Xavier init; Glorot & Bengio, 2010) . We intersperse reservoir layers in alternating fashion starting from the middle. Specifically, we alternate one reservoir layer with one transformer layer, and place the alternating block in the middle. For example: a 7-layer encoder LLLLLLL in which we replace three layers with reservoirs becomes LRLRLRL, and with two becomes LLRLRLL. See Appendix C for a study comparing this strategy to alternative approaches (e.g., freezing in the bottom, middle or top).

4. EXPERIMENTS

In what follows, we first show our main result: reservoir transformers often have better AUCC metrics, less training time per epoch, less convergence time until the best validation performance is achieved, and even improved test set generalization metrics, on a variety of tasks. As a strong baseline method, we compare to LayerDrop (Fan et al., 2019) . LayerDrop can also be seen as a method that dynamically bypasses parts of the computation during Transformer training in an attempt to improve efficiency, and is a suitable comparison to examine our methods.. We also examine whether we can minimize the expectation over the gradients of upper layers in the transformer network such that we do not have to pass the true gradients through the reservoir for further efficiency.

4.1. MACHINE TRANSLATION

Machine translation (MT) is one of the core tasks of NLP. We demonstrate on two well-known MT datasets, IWSLT'14 German-English and WMT'16 English-German, that reservoir transformers obtain a better AUCC. For the raw validation plots over time that were used to calculate the AUCC, please refer to Appendix F. Following Kasai et al. (2020) , the architecture of the network is an N-layer reservoir transformer encoder, followed by a regular shallow one-or two-layer decoder. This design choice has been shown to lead to very good speed and efficiency trade-offs, and serves as a good baseline for our experiments. Moreover, shallow decoders make it easier to decide where to place reservoir layers (in the encoder) and makes it more straightforward to identify where performance gains come from. Figure 1 shows the results for IWSLT. On the y-axis we show validation AUCC for the BLEU metric; on the x-axis we show the number of updatable layers in the encoder. The performance of a regular transformer encoder with 6 layers and a reservoir transformer encoder with 6 layers plus N additional reservoir layers are plotted for the same x-axis value to show the total number of updated layers. Plots for the total number of layers (updatable plus not-updatable, so essentially shifted versions) are shown in Appendix E. Table 1 shows the time it took to achieve the maximum validation BLEU score and how that relates to the regular transformer, demonstrating that reservoir transformers consistently converge faster in terms of wall-clock time, up to 22% as much with the same number of updateable layers. We save as much as 27% time until convergence a 24 layer model on WMT, as shown in Table 3 . One other noticeable point is that we can see that the T Reservoir achieves similar performance to LayerDrop on IWSLT and WMT in terms of wall-clock per epoch and wall-clock time to the best performance. However, on both tasks, FFN Reservoir performs much better than LayerDrop in terms of efficiency per epoch and achieves better/similar performance in less time in each case. As a point of reference, a half hour gain on IWSLT translates to a gain of several days in the training of bigger transformer models like GPT-3 (Brown et al., 2020) . We observe that reservoir transformers consistently perform better than, or are competitive to, regular transformers, both in terms of validation BLEU AUCC as well as test time BLEU, for all examined encoder depths. Figure 2 shows a similar trend for WMT. WMT is much larger and requires a much deeper encoder, as illustrated by the fact that a certain minimum depth is required for reservoir transformers to achieve a comparable validation AUCC. At test time, reservoir transformers outperform regular transformers for almost all encoder depths. The FFN reservoir transformer seems to work best in both cases, which is surprising because it does not have any self-attention component at all. This finding shows that self-attention, or the mechanism to summarize context information, should be learned if present. Once the context features have been gathered, a random projection via a fixed FFN module appears to be beneficial, at least for MT.

4.2. LANGUAGE MODELLING

To examine whether the same findings hold for other tasks, we evaluate on the enwiki8 (LLC, 2009) language modelling task. We examine the BPC (bits per character) rate for a variety of network depths (since the task is language modelling, these layers are in the decoder). The results show that we obtain consistently better BPC for lower depths, except for the 64-layer regular transformer, which appears to be particularly optimal for this task. We observe similar trends during test time.

4.3. MASKED LANGUAGE MODEL PRETRAINING

We train RoBERTa (Liu et al., 2019) models from scratch at a variety of depths, both in the normal and reservoir setting. We find that these networks show minor differences in their best perplexity and similar AUCC perplexity (see Appendix D). We then examine the performance of these models when fine-tuned on downstream tasks, specifically the well known SST-2 (Socher et al., 2013) and MultiNLIfoot_0 (Williams et al., 2017) tasks. When fine-tuning the reservoir models, we keep the reservoir layers fixed (including them in fine-tuning did not work very well, see Appendix D). Figure 4 shows the results of fine-tuning. We observe that the reservoir transformer outperforms normal RoBERTa at all depths in both tasks. At lower depth, the improvements are substantial. As a sanity check, we also experiment with freezing some of the layers in normal RoBERTa during fine-tuning (Transformer frozen finetuned) and show that this helps a little but is still outperformed by the reservoir transformer. These findings suggest that you can train a RoBERTa model without updating all of the layers, achieve similar perplexity at a similar computational cost, but with better downstream performance. The fact that some layers can be kept random and entirely fixed during training, without sacrificing any performance, raises intriguing questions for "BERTology" (Rogers et al., 2020) and for the study of what different layers in transformers learn. With the reservoir transformers as described above, we obtain better efficiency by skipping the "gradient application" matrix addition step in some of the layers (i.e., updating the weights). One step further would be to investigate skipping the entire backward pass for reservoirs altogether, which would save us from having to do the much more expensive matrix multiplication for these layers that is required for the propagation of gradients. We report on preliminary experiments where in the backward pass we replace the gradients for the layer L i going into the reservoir L i+1 with a noisy estimate (Jaderberg et al., 2017; Czarnecki et al., 2017) . Promisingly, Oktay et al. ( 2020) recently asked "why spend resources on exact gradients when we're going to use stochastic optimization?" and show that you can do randomized auto-differentiation quite successfully. Here, rather than minimizing the actual gradients ∂Li ∂θ L i , we minimize their expectation and train via continuous-action REINFORCE (Williams, 1992) . That is, L i becomes a policy π a : s → µ where we sample actions a ∼ N (µ, 1). We train to minimize the gradient prediction loss via MSE, i.e., 1 n n i=0 (R i -V i (a)) 2 , and the REINFORCE loss E a [log(a) (R -V (a))], where the value network V acts as the baseline. R is defined as the mean of the gradients of the top layer L i+2 , with the sign flipped. Thus, simply put, we train to minimize the expectation of the true gradients at the layer directly following the reservoir. We employ an annealing scheme where we first train the value network and propagate the true gradients during warmup. Afterwards, we anneal the probability of backskipping rather than performing a true backward pass (multiplying the probability by 0.99 every iteration until we only backskip). We experimented with setting R to the negation of the total loss as well but found the current reward to work better. We call this approach backskipping. Figure 5 shows the results as validation BLEU over time. We observe that this approach helps especially during the earlier stages of training. Although it does not match the performance of the approach with true gradients quite yet, it actually performs competitively. Backskipping looks promising as an approach to further reduce computational costs, and would be even more efficient from a hardware perspective since the circuitry for such layers (which do not need to propagate gradients) can effectively be hardwired entirely.

5. RELATED WORK

Recent work has shown that modern NLP models are able to function with different numbers of layers for different examples (Elbayad et al., 2019; Fan et al., 2019) ; that different layers specialize for different purposes (Zhang et al., 2019) ; that layers can be compressed (Li et al., 2020) ; and, that layers can be reordered (Press et al., 2019) . There is a growing body of work in efficient self-attention networks (Tay et al., 2020b) , such as linear attention (Wang et al., 2020) , on how to process long context information (Beltagy et al., 2020) and on approximations to make transformers more scalable (Kitaev et al., 2020; Katharopoulos et al., 2020) . BigBIRD (Zaheer et al., 2020) provides random keys as additional inputs to its attention mechanism. Locality sensitive hashing (LSH) as employed e.g. in Reformer (Kitaev et al., 2020) utilizes a fixed random projection. Performer (Choromanski et al., 2020) computes the transformer's multi-head attention weights as a fixed orthogonal random projection. Closely related to this work, Tay et al. (2020a) showed that randomized alignment matrices in their "Synthesizer" architecture are sufficient for many NLP tasks. While these works focus on random attention, we show that entire layers can be random and fixed. We also show that entire layers can be replaced by fixed random projections that do not have any attention whatsoever. Beyond transformers, random features have been extensively explored. Examples of this include FreezeOut (Brock et al., 2017) 

A HYBRID NETWORKS AND NON-TRANSFORMER RESERVOIRS

We investigate whether reservoir layers need to be transformer-based (or transformers-withoutattention, i.e., FFN). We examine two different alternatives: bidirectional Gated Recurrent Units (Cho et al., 2014) and Convolutional Neural Networks (LeCun et al., 1998; Kim, 2014) , specifically light dynamical convolutions (Wu et al., 2019) . Figure 6 shows the results for these hybrids: depending on the setting, they may obtain a better AUCC than the regular transformer, but this is less consistent than with the other reservoir layers, most likely because these layers have different computational properties. It's possible that these hybrids simply require further tuning, as we found e.g. up-projecting to help for BiGRUs, but studying this is outside of the scope of the current work.

B DEEP DECODERS

We show that the same results hold for a 6-layer decoder on IWSLT (although less pronounced for AUCC, probably because the decoder is computationally heavier). See Figure 7 and Table 2 .

C FREEZING STRATEGY

We explored different strategies for the placement of reservoir layers and found the "alternating" strategy reported in the main body of the paper to work best. Generally, we found repetitive application of reservoirs to yield diminishing returns, as might be expected. See Figure 8 . 

D ROBERTA RESULTS

Here we present the additional RoBERTa results for convergence plot and AUCC in various decoder depth setting in Figure 10 . As stated in the main paper, the difference of AUCC / Convergence Plot between RoBERTa model with or without Reservoir layers are limited. Moreover, we plot the downstream task performance for SST-2 and MNLI compared to the pretraining wall-clock time in Figure 9 . It can be seen that the FFN Reservoir can achieve up to 25% and 10% pretraining time savings while matching the best performance of vanilla transformers for MNLI-m and SST2, respectively.

E RESERVOIR RESULTS FOR TOTAL LAYERS

Here we present the shifted Reservoir Results for IWSLT14, WMT16, Enwik8 and RoBERTa finetuning in Figure 11 , 12, 13, 14, respectively. We show the same results also hold when it comes to replace normal transformer blocks with Reservoir blocks at least for MT. 

G ROBERTA PROBING

We follow Jawahar et al. (2019) and investigate what the frozen layers in the Reservoir Transformer have actually "learned" (while being forzen) as measured by probing tasks, reported in Table 6 . The results are gathered over 3 random seeds for reporting the mean and standard deviation. From the table, we can see that generally probing performance is quite similar between Transformer and the T Reservoir model. We also noticed that the representations collected after the frozen layer (3, 5, 7, 9) in the T Reservoir actually have significantly better performance over the regular Transformer representations across all the probing tasks. This has interesting repercussions for the study of "BERTology", as it clearly shows, somewhat confusingly, that even completely random and frozen layers represent linguistic phenomena. 



We report results for MultiNLI-Matched. Gallicchio & Micheli, 2017), as well as applications in domains as varied as text classification(Conneau et al., 2017;Zhang & Bowman, 2018;Wieting & Kiela, 2019) or music classification(Pons & Serra, 2019). It is well known that randomly initialized networks can display impressive performance on their own(Ulyanov et al., 2018;Rosenfeld & Tsotsos, 2019;Ramanujan et al., 2020), which underlies, for example, the recently popularized lottery ticket hypothesis(Frankle & Carbin, 2018;Zhou et al., 2019). We know that learning deep overparameterized networks appears to help in general(Li & Liang, 2018;Du et al., 2019). Our method represents an easy and cheap way to add both depth and parameters to transformer networks.6 CONCLUSIONThis work demonstrated that state-of-the-art transformer architectures can be trained without updating all of the layers. This complements a long history in machine learning of harnessing the power of random features. In most cases, "reservoir transformers" achieve better performance-efficiency trade-offs as measured by our newly introduced AUCC metric, and better test set generalization, on a variety of tasks and in a variety of settings. Future work includes further investigating hybrid networks and backskipping architectures, as well as utilizing pruning strategies at inference time, in order to try to obtain even better performance/efficiency trade-offs.



Figure 1: Validation BLEU AUCC and test BLEU for IWSLT (high is good). Comparison of regular transformer and reservoir transformer with FFN or Transformer reservoir layers added.

Figure 3: Validation BPC AUCC and test BPC on the enwik8 language modelling task (low is good). Comparison of regular and reservoir transformers for varying depths.

Figure 4: Downstream RoBERTa performance on SST-2 (left) and MultiNLI-matched (right).

Figure 5: IWSLT comparison of normal v frozen v backskipped

Figure 7: IWSLT validation AUCC and test BLEU with 6-layer decoder.

Figure 8: IWSLT with 2-layer decoder using different freezing strategy.

Figure 9: RoBERTa Reservoir Results, Pre-training versus downstream task plot for 12 layer RoBERTa. MNLI-m (left). SST-2 (right).

Figure 10: RoBERTa Reservoir Results, Training plot for 12 layer RoBERTa (left). AUCC result (right).

Figure 11: Validation BLEU AUCC and test BLEU for IWSLT (high is good). Comparison of regular transformer and reservoir transformer with FFN or Transformer reservoir layers added.

Figure 14: Downstream RoBERTa performance on SST-2 (left) and MultiNLI-matched (right).

Figure15: IWSLT with 2-layer decoder validation plot (upper left). WMT with 24-layer decoder validation plot (upper right). Enwik8 with 48-layer decoder validation plot (lower left). RoBERTa with 12-layer decoder validation plot (lower right).

Wall-clock time (averaged over multiple runs) saved for IWSLT for different model types and encoder depths. Max BLEU is for validation. Number of layers is for encoder, decoder depth is kept fixed at 2. Ratio is computed compared to comparable number of layers in the normal case.

Validation BLEU AUCC and test BLEU for WMT (high is good). Comparison of regular transformer and reservoir transformer with FFN or Transformer reservoir layers added.

, deep reservoir computing networks(Scardapane & Wang, 2017;

Wall-clock time (averaged over multiple runs) saved for IWSLT for different model types and encoder depths. Max BLEU is for validation. Number of layers is for encoder, decoder depth is kept fixed at 6. Ratio is computed compared to comparable number of layers in the normal case.

Wall-clock time (averaged over multiple runs) saved for WMT for different model types and encoder depths. Max BLEU is for validation. Number of layers is for encoder, decoder depth is kept fixed at 1. Ratio is computed compared to comparable number of layers in the normal case.

Wall-clock time (averaged over multiple runs) saved for IWSLT/WMT for different model types and encoder depths. 95% Max BLEU is for validation.

Wall-clock time (averaged over multiple runs) saved for IWSLT/WMT for different model types and encoder depths. 99% Max BLEU is for validation. -layer decoder model for RoBERTa for detailed steps to calculate the AUCC. It can be clearly observed that given the configurations from Section 3.1, all the models have converged. So when we compute the area under the convergence curve, this depicts the training efficiency of the model (basically time x performance) until convergence. Specifically, we set T sufficiently high for computing the AUCC, which is 4h for IWSLT, 20h for WMT, 30h for enwik8 and 60h for RoBERTa pretraning. From the training plot in the appendix, we can see that each model has converged at that point. The Reservoir model in Figure15has 2 layers frozen for IWSLT14, 8 layers frozen for enwik8, and 4 layers frozen for WMT14 and RoBERTa.

