MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY

Abstract

Neural text decoding algorithms strongly influence the quality of texts generated using language models, but popular algorithms like top-k, top-p (nucleus), and temperature-based sampling may yield texts that have objectionable repetition or incoherence. Although these methods generate high-quality text after ad hoc parameter tuning that depends on the language model and the length of generated text, not much is known about the control they provide over the statistics of the output. This is important, however, since recent reports show that humans prefer when perplexity is neither too much nor too little and since we experimentally show that cross-entropy (log of perplexity) has a near-linear relation with repetition. First we provide a theoretical analysis of perplexity in top-k, top-p, and temperature sampling, under Zipfian statistics. Then, we use this analysis to design a feedback-based adaptive top-k text decoding algorithm called mirostat that generates text (of any length) with a predetermined target value of perplexity without any tuning. Experiments show that for low values of k and p, perplexity drops significantly with generated text length and leads to excessive repetitions (the boredom trap). Contrarily, for large values of k and p, perplexity increases with generated text length and leads to incoherence (confusion trap). Mirostat avoids both traps. Specifically, we show that setting target perplexity value beyond a threshold yields negligible sentence-level repetitions. Experiments with human raters for fluency, coherence, and quality further verify our findings.

1. INTRODUCTION

Large-scale generative language models (LMs) have received recent attention due to their highquality open-ended text generation ability (Brown et al., 2020; Radford et al., 2019) . Generating texts from these LMs usually relies on some form of random sampling. Pure sampling often leads to incoherent and low-quality texts (Holtzman et al., 2018) , whereas greedy decoding leads to excessive repetitions, another form of low quality. The right decoding algorithm is needed to generate highquality texts with controlled attributes (Ippolito et al., 2020; Zhang et al., 2020; Ippolito et al., 2019) . We introduce mirostat, 1 a neural text decoding algorithm that actively controls the generative process to maintain the perplexity of generated text at a certain desired value. Mirostat uses an adaptive topk sampling algorithm to actively tune the value of k which helps maintain the overall perplexity of the text; recall that top-k sampling (Holtzman et al., 2018; Fan et al., 2018) is where the next word is sampled from the top k most probable choices. Top-k sampling and several other recent sampling methods are motivated by suppressing an unreliable tail in the probability distribution of trained LMs. Another sampling method is top-p, also known as nucleus sampling, where the next word is chosen from the top x probable choices, where 1 The word mirostat is derived from mirum which is Latin for surprise and stat meaning control. This work was funded in part by the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR), a research collaboration as part of the IBM AI Horizons Network and the National Science Foundation Grant CCF-1717530. x is the smallest integer such that their cumulative probability mass is at least p (Holtzman et al., 2020). While top-k sampling involves a fixed number of most probable choices, top-p sampling involves a dynamic number of choices based on a fixed p value and shows better statistical and human-evaluated performance. For small values of k and p, these sampling methods unfortunately repeat phrases in generated text. This can be handled by penalizing repetitions and using appropriate temperature values (Keskar et al., 2019) or adding diversity to the generated text (Zhang et al., 2020; Vijayakumar et al., 2018) . On the other hand, large values of k and p can lead to incoherent texts similar to pure sampling. Although choosing appropriate values of p or k can avoid repetition and incoherence, this involves ad hoc tuning of parameters. Even for a fixed value of p or k, the generated text can have varying statistical properties. Intriguingly, as we demonstrate via Example 1 in Appendix A, small values of a certain perplexity statistic of generated texts called surprise (Def. 1) are closely linked to repetitions and large values of surprise are linked to incoherence. Perplexity is a statistical metric used to evaluate quality of neural text generation, and is closely related to average surprise as shown in Fig. 7 in Appendix A and formalized in Sec. 2. A large-scale human subject experiment by Zhang et al. ( 2020) showed human-evaluated quality is closely related to the likelihood of the generated text for fixed number of tokens. In particular, reducing perplexity increases quality upto some point before the quality starts dropping. This implies that good control over perplexity of the generated text would give direct control over the quality of generated text (as evaluated by humans). Generating texts with an appropriately chosen target perplexity value may maximize quality of generated text. Ergo mirostat. Now we summarize our key contributions. Sec. 3 shows theoretically how cross-entropy and hence perplexity grows in top-k and top-p sampling as a function of k and p respectively, which was previously unknown. Sec. 4 introduces mirostat sampling, which outputs texts with predetermined target perplexity value. Although perplexity may not fully capture the quality of text (Hashimoto et al., 2019) , much literature discusses its correlation to quality (Zhang et al., 2020) . Hence, our algorithm to control perplexity helps generate high-quality text. Sec. 5.1 experimentally shows much fluctuation in cross-entropy rates in top-k and top-p sampling as a function of their input parameters, hence unable to control perplexity of output text. Sec. 5.2 shows repetition is closely related to perplexity of the generated texts, mostly independent of the sampling method, but slightly dependent on the LM used. Sec. 5.3 experimentally shows mirostat sampling avoids both boredom and confusion traps for a wide range of target perplexity values. Sec. 5.4 provides our own experiments with human raters that demonstrate mirostat efficacy for fluency, coherence, and overall quality.

1.1. RELATED WORK

Sampling from distorted probability distribution Pure sampling from LMs often leads to incoherent text whereas greedy decoding leads to repetitions. Distorting probability distributions, as in top-k, top-p, or temperature sampling help improve quality of generated texts, if parameters are properly tuned (Holtzman et al., 2018; Fan et al., 2018; Holtzman et al., 2020) . Tuning these methods, however, is ad hoc and does not provide good control over the statistics of the output. Our method uses statistics of previously-generated tokens as input to generate the next token, by distorting the probability distribution so it helps control the overall statistics of the generated text. This ability to control the perplexity of the output is a key advantage of our method over previous work. This, when used with the relation between perplexity and human-evaluated quality observed by Zhang et al. (2020) , can yield text that has better quality control.

Controllable text generation

Controllable text generation has oft focused on semantics of the output text, as in LMs like CTRL (Keskar et al., 2019) , and sampling algorithms like plug-and-play LM (Dathathri et al., 2020) and constrained sentence generation by Metropolis-Hastings (Miao et al., 2019) . Contrarily our approach is purely statistical, guiding the decoder along a desired statistical path that addresses issues with pure sampling and greedy decoding. 



Top-k, top-p, and low-temperature sampling improve the quality of the text, but at the cost of reduced diversity. Applications like question-answering only demand highquality generation, but open-ended tasks such as story generation demand diversity too. Li et al. (2016); Vijayakumar et al. (2018); Kulikov et al. (2019) propose variants of beam search to induce diversity in generated text. However, Zhang et al. (2020) observe a tradeoff between quality and

