ADDING RECURRENCE TO PRETRAINED TRANSFORM-ERS

Abstract

Fine-tuning a pretrained transformer for a downstream task has become a standard method in NLP in the last few years. While the results from these models are impressive, applying them can be extremely computationally expensive, as is pretraining new models with the latest architectures. We present a novel method for applying pretrained transformer language models which lowers their memory requirement both at training and inference time. An additional benefit is that our method removes the fixed context size constraint that most transformer models have, allowing for more flexible use. When applied to the GPT-2 language model, we find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora, for a given amount of computation or memory.

1. INTRODUCTION

Recent progress in NLP has been dominated by large pretrained transformer neural networks (Vaswani et al., 2017) , such as BERT (Devlin et al., 2019) , and GPT-2 (Radford et al., 2019) . However, these models have a memory footprint that is quadratic in input sequence length. Although architectural innovations such as those of Kitaev et al. (2019) and Rae et al. (2019) mitigate this and the issue of a predetermined maximum context size, large pretrained models applying these techniques are not available at this time. Even if large pretrained models of this kind are released in the future, they will likely not cover the wide range of domains that BERT-family models have been published for. For example, there have been BERT-based models trained for other languages such as French (Le et al., 2020; Martin et al., 2020 ), Italian (Polignano et al., 2019) , and many other languages (see Nozza et al. (2020) for an overview) as well as specific domains such as scientific papers (Beltagy et al., 2019 ), biomedical papers (Lee et al., 2020) , and health records (Rasmy et al., 2020) . Individuals working with these models may not have the resources to train new models from scratch using the latest tricks, as the computation requirements for pretraining are extremely high. As such, identifying ways that already existing models can be improved could be widely impactful. Another drawback of this family of models is that they have an a priori fixed maximum context size (typically 512 or 1024 tokens for the currently available pretrained models). A typical application of pretrained language models is producing contextual embeddings for a document. If the document is simply chunked into disjoint segments of 512 tokens, tokens at the boundary of a window will have less contextual information than tokens in the center of a window. This can be mitigated by striding the evaluation of the model, and only keeping the embedding for a token which has the largest context-but this adds quite a bit of wasted computation. In this paper, we propose a method for augmenting and fine-tuning pretrained transformer language models to use context without directly attending to it. Our method simultaneously allows for increasing the context size a transformer processes, while allowing a controllable trade-off between computation and perplexity. We accomplish this by adding a small recurrence module that computes a fixed size representation from the transformer hidden states in a window of text. Then, the representation for that window is used during processing of the next window. Shrinking the window size is then a way to reduce the memory footprint of the model, with less loss of performance than would occur with a standard transformer. Our experiments add recurrence GPT-2 language models, and fine-tune them on the PG-19 (Rae et al., 2019) and WikiText-103 corpora (Merity et al., 2016) , and require only the same amount of memory used for standard fine-tuning of a pretrained

