PENALIZING THE HIGH-LIKELIHOOD: A NOVEL SAM-PLING METHOD FOR OPEN-ENDED NEURAL TEXT GENERATION VIA INVERSE PROBABILITY WEIGHTING

Abstract

Traditional stochastic sampling methods for open-ended neural text generation focus on truncating the low-likelihood part of the predicted distribution. They do not directly manipulate the high-likelihood part, which leads to the likelihood trap that induces repetition and boredom. They also do not directly leverage that human does not always favor high-likelihood texts. Inspired by these, we propose a novel sampling method that rescales the high-likelihood part of the distribution with inverse probability weighting. It increases the diversity by rescaling and penalizing the high-likelihood words, and preserves the fluency by using multifiltering truncation on the low-likelihood words. We use pre-trained language models to compare our algorithm with traditional sampling methods. Results show that our algorithm can significantly increase the diversity and novelty of generated texts without corrupting the fluency.

1. INTRODUCTION

Open-ended neural text generation is greatly affected by decoding methods. Counter-intuitively, the quality-oriented decoding methods such as beam search, which maximizes the likelihood of decoded texts, induces the well-known text degeneration (Holtzman et al., 2020; Welleck et al., 2020) and likelihood trap (Zhang et al., 2021; Basu et al., 2021) , that is, the high-likelihood texts are prone to be repetitive and boring with low quality. As a result, many works have focused on stochastic sampling method such as top-k sampling (Fan et al., 2018; Holtzman et al., 2018) or nucleus sampling (top-p sampling, Holtzman et al., 2020) . These methods first truncate the low-likelihood part of the language model's predicted distribution, then perform stochastic sampling on the truncated distribution for all decoding time steps. Other methods, such as temperature sampling, rescale the log-likelihood of all words to control the quality of generated texts. Recent works (Caccia et al., 2020; Nadeem et al., 2020; Zhang et al., 2021) reveal that these methods achieve on-par performance regarding their quality-diversity trade-off feature. Still, there exist undiscovered properties to understand better the relationship between stochastic sampling algorithms and open-ended neural text generation (Nadeem et al., 2020) . We note that none of the traditional sampling algorithms have directly manipulated the high-likelihood part of the distribution since high-likelihood words are always considered to be "trustworthy". Essentially, the observed quality-likelihood curve by human judgment is inversely proportional to the likelihood in the high-likelihood area (Zhang et al., 2021) , which confirms the intuition that human does not always favor high-likelihood words (Holtzman et al., 2020; Welleck et al., 2020) . Inspired by these, we propose a novel sampling method, namely the interquartile range inverse probability (IQR-IP) sampling algorithm. It increases the diversity of generated texts by rescaling and penalizing the high-likelihood part of the predicted distribution with inverse probability weighting and preserves the fluency by using multi-filtering truncation on the low-likelihood. The rescaled distribution will achieve a closer resemblance to the quality-likelihood curve (such as the human judgment of Figure 1 by Zhang et al., 2021) , as is illustrated in Figure 1 . Empirical results show that our algorithm can increase the diversity and novelty of generated text without corrupting the fluency. The high-likelihood part ("head") of the distribution is rescaled by inverse probability weighting.

Closer to human judgement curve

Figure 1 : Illustration of our algorithm. The high-likelihood part of the language model's predicted distribution on each sampling step is rescaled by inverse probability weighting to penalize the highlikelihood words. The rescaled distribution (colored in red) will achieve a closer resemblance to the quality-likelihood curve (see the human judgment curve of Figure 1 by Zhang et al., 2021) . The trajectory of predicted probability ("o" marker) and predicted distribution (heatmap box beside each marker in "word-likelihood" format, with the sampled word marked by "*") for the first three repetition loops. It contains infinite repetitive loops of "She walks in beauty." (with a generated period). The trajectory of the repetitive word "She" is highlighted in shadow, which shows the increase of predicted probability and the gradually peaked predicted distribution. 



Figure2: The trajectory of predicted probability ("o" marker) and predicted distribution (heatmap box beside each marker in "word-likelihood" format, with the sampled word marked by "*") for the first three repetition loops. It contains infinite repetitive loops of "She walks in beauty." (with a generated period). The trajectory of the repetitive word "She" is highlighted in shadow, which shows the increase of predicted probability and the gradually peaked predicted distribution.

Figure 3: Trajectories of repetitive words extracted from samples that contain repetition loops.Repetitive words that appear more than 30 times are extracted and aligned to form their trajectories. A few appearances of repetitive words quickly lead the model to extreme distribution that causes repetition loops.

