COPY IS ALL YOU NEED

Abstract

The dominant text generation models compose the output by sequentially selecting words from a fixed vocabulary. In this paper, we formulate text generation as progressively copying text segments (e.g., words or phrases) from an existing text collection. We compute the contextualized representations of meaningful text segments and index them using efficient vector search toolkits. The task of text generation is then decomposed into a series of copy-and-paste operations: at each time step, we seek suitable text spans from the text collection rather than selecting from a standalone vocabulary. Experiments on the standard language modeling benchmark (WikiText-103) show that our approach achieves better generation quality according to both automatic and human evaluations. Besides, its inference efficiency is comparable to token-level autoregressive models thanks to the reduction of decoding steps. We also show that our approach allows for effective domain adaptation by simply switching to domain-specific text collection without extra training. Finally, we observe that our approach attains additional performance gains by simply scaling up to larger text collections, again without further training.

1. INTRODUCTION

Most neural language models (LMs) process text generation tasks by making a series of next-token predictions in an autoregressive manner (Radford et al., 2019; Dai et al., 2019; Khandelwal et al., 2020; Shi et al., 2022) . Specifically, LMs generate the next-token distribution over a fixed vocabulary for any given prefix. Then, the next token is selected by a chosen decoding method, such as greedy search and nucleus sampling (Holtzman et al., 2020) . This process continues until some stop condition is reached. For example, a special end-of-generation token is emitted, or the generated text reaches the maximum length limit. Unlike traditional neural language models, we reformulate text generation by copying text segments from existing text collections. The text segments can be of variable lengths, including single words and multi-word phrases. For clarity, we will use the term "phrase" to refer to any contiguous text segments, and a single word can also be seen as a phrase of length 1. We compute a contextualized vector representation for each phrase and pack them into an offline index. At each decoding step, a suitable phrase is retrieved from the offline index and appended to the current prefix. In other words, the next-token predictions in traditional neural language models are replaced by a series of copy-and-paste operations. Our proposed model, named COG (short for COPY-GENERATOR), enjoys the following advantages. First, our method selects phrases in specific contexts rather than standalone tokens in a fixed vocabulary. It potentially allows for more accurate candidate representation and selection. Second, our method allows training-free adaptation to new knowledge sources because the text collection can be updated in a plug-and-play fashion. It could benefit application scenarios such as domain adaptation and data expansion/filtering. Third, our method allows a sequence of multiple tokens (i.e., multi-word phrase) to be generated in one single step. It could reduce the total number of decoding steps, leading to improved inference efficiency. We conduct extensive experiments to verify the effectiveness of our proposed COG. On the standard language modeling benchmark (WikiText-103), our proposed COG substantially outperforms standard baselines on automatic metrics (26.14 vs. 23.43 MAUVE (Pillutla et al., 2021) ) and human evaluation (48% vs. 28% human preference). Moreover, when we directly switch the text collection from the WikiText-103 corpus to a domain-specific corpus, Law-MT (Koehn & Knowles, 2017), our proposed COG outperforms strong baselines on this domain adaption setting (28.14 vs. 26.85 MAUVE and 52% vs. 36% human preference) without any domain-specific training. Furthermore, when we scale up the text collection of COG to a larger one, the En-Wiki dataset, we obtain additional gain (26.97 vs. 23.43 MAUVE), again without any further training. Our contributions can be summarized as follows: • We propose COG, a method that reformulates text generation tasks as a series of copy-andpaste operations from existing text collections. • We show that COG can outperform standard neural language model baselines on existing language modeling benchmarks. • We demonstrate that COG allows for training-free adaptations to larger text collections and domain-specific text collections.

2. BACKGROUND: NEURAL TEXT GENERATION

Neural text generation can be divided into two categories: (1) unconditional text generation; (2) conditional text generation. Unconditional text generation (or language modeling) aims to generate a coherent text continuation given a prefix. In this case, language models perform generation using a density estimation over sequences p θ (x). Conditional text generation aims to generate text with some condition c and instead estimates the probability of p θ (x|c). Its typical applications include machine translation (Sutskever et al., 2014; Bahdanau et al., 2015) , summarization (See et al., 2017) . Throughout this paper, our discussion will be focused on unconditional text generation, however, our approach can be readily adapted to conditional text generation as well. The canonical approach to language modeling factors the generation in an autoregressive left-to-right manner p θ (x 0:n ) = n i=1 p(x i |x <i ). In this case, text generation is reduced to the task of repeatedly predicting the next token conditioned on the partial sequence (i.e., prefix) generated so far p(x i |x <i ). The model often consists of two parts: (1) a prefix encoder and (2) a set of token embeddings. The prefix encoder is often parameterized by the Transformer architecture (Vaswani et al., 2017) , which transforms any prefix into a fixed-sized vector representation h i ∈ R d = PrefixEncoder(x <i ). Then, the probability of the next token being w is calculated as p θ (x i = w|x <i ) = exp(v w • h i ) w∈V exp(v w • h i ) , where v w is the context-independent token embedding representing the token w, and V is the predefined vocabulary consisting of all possible tokens. Based on the chosen decoding method, such as greedy search and nucleus sampling (Holtzman et al., 2020) , the next token is selected according to the probability distribution over the fixed vocabulary V . This process is repeated in an autoregressive manner, until some stop condition is reached, e.g., the maximum length of generation is reached.

3. COPY-GENERATOR

Unlike traditional language models that compute the next token distribution over a fixed vocabulary that is usually composed of words or sub-words (Sennrich et al., 2016; Kudo & Richardson, 2018) , our proposed COG has a dynamic "vocabulary" that is dependent on the available source text collections. Each item in the "vocabulary" corresponds to a text segment (termed as phrase in this paper) in the source text collection. Importantly, all phrases are context-sensitive. That is, the same phrases in different contexts are considered to be different. The overall framework is depicted in Figure 1 . Formally, our approach assumes a set of source documents {D 1 , . . . , D n } is available. For each document D i , a phrase k = D i s:e of length e -s + 1 can be extracted, where s and e mark the start

