A NON-MONOTONIC SELF-TERMINATING LANGUAGE MODEL

Abstract

Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top-k sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-k sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020) . We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020) , to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.

1. INTRODUCTION

Autoregressive neural sequence models (Bengio et al., 2000) have been widely used for various natural language generation tasks such as language modeling (Brown et al., 2020; Chowdhery et al., 2022) , machine translation (Bahdanau et al., 2014) , and conversational dialogue modeling (Vinyals & Le, 2015) . Furthermore, large-scale autoregressive neural sequence models have shown unprecedented ability to generate fluent, human-like texts (Vaswani et al., 2017; Brown et al., 2020) . Despite their success, the autoregressive neural sequence models have shown undesirable behaviors: non-termination (Welleck et al., 2020) , degenerate repetition (Welleck et al., 2019; Holtzman et al., 2020) , and premature termination (Koehn & Knowles, 2017; Stahlberg & Byrne, 2019) . In this paper, we focus on how to prevent non-termination when using a given decoding algorithm. Non-termination is the problem that a language model generates infinitely long sequences with a positive probability from our language model when using a given decoding algorithm. Welleck et al. (2020) pointed out that this issue comes from a discrepancy between the distribution of our language model and its induced distribution by an incomplete decoding algorithm. They formalized this disparity by the notion of inconsistency where our language model generates non-terminating sequences with a positive probability from the decoding algorithm. To avoid this inconsistency, they proposed a self-terminating (ST) language model that uses new parametrization for its classifier rather than usual softmax parametrization. They proved that the ST language model is consistent with respect to greedy search, beam search, top-k sampling (Fan et al., 2018) as well as nucleus sampling (Holtzman et al., 2020) . The ST language model increases the termination probability of each sequence monotonically to 1, but this parametrization is not appropriate for learning our natural language. As an illustrative example, suppose there are two sequences in our dataset: "I am a boy" vs. "I am a boy, and you are a girl.". Our language model trained on this dataset may or may not terminate after the former. Once our model decides not to end, it should dramatically reduce the termination probability to continue. The ST language model, which monotonically increase the termination probability, cannot capture such a case, where one sequence is a prefix of another. We thus propose a non-monotonic self-terminating (NMST) language model which guarantees the consistency with respect to greedy search, beam search, top-k sampling, and nucleus sampling without monotonically increasing termination probability. The NMST language model encourages the termination probability of each sequence to converge to 1 through NMST parametrization however without monotonicity. Even under this relaxation, the proposed NMST language model provably prevents any non-terminating sequence resulting from greedy search, beam search, top-k sampling, and nucleus sampling, which we refer to as incomplete probable decoding algorithms. We conduct experiments validating the effectiveness of our NMST language models on sequence completion tasks, as was done in earlier studies. We test NMST parametrization with various architectures. Specifically, we train RNN (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2 (Merity et al., 2016) . We additionally finetune GPT-2 (Radford et al., 2019) on WikiText-103 (Merity et al., 2016) . Across all these setups, NMST parametrization effectively prevents non-terminating sequences, especially when compared to softmax parametrization. Furthermore, we see that our NMST parametrization has better (lower) perplexities than those of ST parametrization, confirming the importance of relaxing the monotonic termination probability.

2.1. NOTATIONS FOR AUTOREGRESSIVE NEURAL SEQUENCE MODELS

Sequences, vocabulary, and ⟨eos⟩ We view an instance (e.g., a sentence and a paragraph) as a sequence y = (y 1 , y 2 , . . . , y T ), where each y t is an element from a pre-defined finite set of discrete tokens, referred to as a vocabulary V. V includes a special symbol ⟨eos⟩ that only appears at the end of the sequence. Every sequence y must end with ⟨eos⟩. We write the length of y as |y|, and y |y| = ⟨eos⟩. We call y a non-terminating sequence, |y| = ∞, if y t ̸ = ⟨eos⟩ for all t. Embedding vectors Each token v ∈ V is not a numerical vector so that we use an embedding vector u v ∈ R m to represent v. To capture the notion of similarity between discrete tokens efficiently, we use an embedding vector u v ∈ R m to project v into continuous embedding space (Bengio et al., 2000; Mikolov et al., 2013b; a; Levy & Goldberg, 2014) . Bengio et al. (2000) proposed an autoregressive neural sequence model parametrized by θ ∈ R k . They factorized p θ (y|x) into a product of the conditional probability of each token given all the previous tokens and an input in a predefined order as follows:

Autoregressive neural sequence models

p θ (y|x) = T t=1 p θ (y t |y <t , x), where y <t is a t-prefix of y and x is an input referred to as a context. For example, x represents either a prompt in sequence completion or a source-side sequence in machine translation. There are several popular architectures for p θ such as RNN (Elman, 1990) , LSTM (Hochreiter & Schmidhuber, 1997) , GRU (Cho et al., 2014), and Transformer (Vaswani et al., 2017) . As shown in equation 2, all these models utilize softmax classifiers. In this paper, we modify the parametrization of their softmax classifiers to prevent non-terminating sequences. We thus write a vanilla language model, regardless of its choice of architecture, that uses the original softmax parametrization as p va θ defined in Definition 1. Definition 1. A vanilla language model p va θ computes the conditional probability of each token given a t-prefix y <t and a context x at each time step t as follows: p va θ (y t = v|y <t , x) = exp(u ⊤ v h t )/ v ′ ∈V exp(u ⊤ v ′ h t ), where h t = f θ (y t , h t-1 ) with h 0 = 0. 1 Training For a given dataset, D = x (n) , y (n) N n=1 , we maximize the joint probability assigned to the sequences in our training dataset to find an optimal parameter configuration θ ⋆ as follows: θ ⋆ = arg max θ N n=1 T (n) t=1 log p θ y (n) t y (n) <t , x (n) . (3)

2.2. INCOMPLETE PROBABLE DECODING ALGORITHMS

An autoregressive language model p θ predicts the likelihood of a sequence y given a context x. Its autoregressive factorization in equation 1 requires a recursive process for every t to infer. Hence, at inference time, we use a decoding algorithm, defined below, to generate sequences from p θ . Definition 2. Let Y be a collection of y such that y = (y 1 , y 2 , • • • , y T ) where T ∈ {1, 2, • • • } and y t ∈ V. A decoding algorithm S is a function that maps p θ to q S(p θ ) which is a probability distribution over Y. A decoded sentence ŷ given x by S from p θ is a random sample from q S(p θ ) (y|x). To generate a high quality sequence from p θ , a decoding algorithm assumes that a higher quality sequence has a higher probability of p θ than others. For instance, maximum a posteriori (MAP) decoding algorithm S map gives the most probable sequence y ⋆ given a context x from p θ : y ⋆ = arg max y∈Y p θ (y|x), by setting q Smap(p θ ) (y = y ⋆ |x) = 1 and q Smap(p θ ) (y = y ′ |x) = 0 where y ′ ∈ Y \ {y ⋆ }. Unfortunately, S map is intractable since equation 4 requires an exhaustive search over the sequence space Y. Hence, in practice, we utilize incomplete probable decoding algorithms defined as follows: Definition 3. A decoding algorithm S is incomplete and probable if there exists V t ⊊ V such that v∈Vt q S(p θ ) (y t = v|y <t , x) = 1 and min v∈Vt p θ (y t = v|y <t , x) ≥ max v∈V\Vt p θ (y t = v|y <t , x) for each t. Furthermore, for every v ∈ V t , S satisfies q S(p θ ) (y t = v|y <t , x) ≥ p θ (y t = v|y <t , x). At each t, an incomplete probable decoding algorithm S considers only a set of highly probable tokens, V t . S generates ŷ given x by recursively sampling ŷt from q S(p θ ) (y t | ŷ<t , x) supported on V t . This reduces an exponential complexity of S map , O |V| | ŷ| , down to a linear level, O (| ŷ| • |V|). Greedy search, top-k sampling (Fan et al., 2018) , and nucleus sampling (Holtzman et al., 2020) are incomplete and probable. For example, greedy search S gr generates the t-th item of a sequence by ŷt = arg max v∈V p θ (y t = v| ŷ<t , x). In other words, S gr sets V t to v (1) t where v (1) t = arg max v∈V p θ (y t = v| ŷ<t , x). Moreover, we have p θ y t = v (1) t ŷ<t , x ≤ q Sgr(p θ ) y t = v (1) t ŷ<t , x = 1, and q Sgr(p θ ) (y t = v ′ | ŷ<t , x) = 0 holds for v ′ ∈ V \ V t . Thus, S gr is incomplete and probable. Unlike S gr , top-k sampling considers k most probable tokens in V as V t while nucleus sampling sets the smallest subset of V, containing most probable tokens of which total probability is higher than a given threshold µ, to V t . In §A.1 and A.2, we present that top-k sampling and nucleus sampling are also incomplete and probable. Beam search is a heuristic algorithm that operates on the level of prefixes. We describe it further in §A.3. Although beam search is not an incomplete probable decoding algorithm, it also selects V t which is a proper subset of V to expand each prefix at each step t. Due to this, our main theoretical finding for the incomplete probable decoding algorithms in §3 is applicable to beam search as well.

2.3. CONSISTENCY WITH RESPECT TO INCOMPLETE PROBABLE DECODING ALGORITHMS

AND SELF-TERMINATING (ST) LANGUAGE MODELS Incomplete probable decoding algorithms greatly reduce computational overhead for generating sequences from our model. However, Welleck et al. (2020)  (t) = 1 -(1 -ϵ) t , f ub (t) = 1, λ(t ′ ) = σ u ⊤ ⟨eos⟩ h t ′ , and g(t) = p nmst We empirically validate the effectiveness of the proposed non-monotonic self-terminating (NMST) language model by evaluating it on sequence completion tasks. We test three variants of a given architecture: (i) a vanilla (VA+) language model using common softmax parametrization in equation 2, (ii) a self-terminating (ST+) language model using ST parametrization proposed by Welleck et al. (2020) and (iii) our non-monotonic self-terminating (NMST+) language model using NMST parametrization in equation 10. We use following evaluation metrics for comparison: • Perplexity: Given an autoregressive language model p θ , the perplexity of p θ over D is n) , where D = x (n) , y (n) N n=1 . • Non-termination ratio (r nt ): To present the consistency of p θ with respect to a given decoding algorithm S, we need to compute r nt = q S(p θ ) (|y| = ∞). Instead, based on exp -1 N N n=1 T (n) t=1 log p θ y (n) t y (n) <t , x r nt = q S(p θ ) (|y| = ∞) = lim L→∞ q S(p θ ) (|y| > L) , we use r nt (L) = q S(p θ ) (|y| > L) with a sufficiently large threshold L to estimate r nt . Sequence completion is a task of predicting a continuation ŷ given a c-length context x = (x 1 , x 2 , • • • , x c ) by using a decoding algorithm S from a language model p θ (i.e. ŷ ∼ q S(p θ ) (y|x)). LSTM trained on WikiText-2 when using greedy search. We report mean (curve) ± st.dev. (shaded area) across 10 random experiments. For all configurations, both ST+ (non-red dashed) proposed by Welleck et al. (2020) and our NMST+ (non-red solid) are consistent with respect to greedy search since r nt (L) goes to 0 as L increases. However, softmax parametrization (VA+, red dotted) is inconsistent with respect to greedy search since its r nt (L) does not converge to 0 as L → ∞. 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 L 0 . 0 0 . 0 0 1 0 . 0 1 0 . 1 0 . 3 1 . 0 rnt (L) NMST+ ST+ = 5.0 × 10 -4 = 1.0 × 10 -4 = 5.0 × 10 -5 = 1.0 × 10 -5 VA+ (a) RNN 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 L 0 . 0 0 . 0 0 1 0 . 0 1 0 . 1 0 . 3 1 . 0 rnt (L) NMST+ ST+ = 5.0 × 10 -4 = 1.0 × 10 -4 = 5.0 × 10 -5 = 1.0 × 10 -5 VA+ (b) LSTM In this section, we use greedy search defined in equation 8 to generate ŷ given x. Our main theoretical finding in Theorem 3 is that the proposed NMST language model is consistent with respect to not only greedy search but also top-k sampling, nucleus sampling, and beam search. We thus present results when using decoding algorithms other than greedy search at the end in §5 and §F.

4.1. WIKITEXT-2

WikiText-2 (Merity et al., 2016) consists of 2 million words from 600 Wikipedia articles. With word tokenization, we regard the first 10 tokens of each sequence and its remaining part, as a context x and a ground truth y, respectively. We train RNN with tanh (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2. Both RNN and LSTM have 2 layers, with 256 and 512 hidden units at each layer, respectively. We perform 10 random runs with a batch size of 32 for 70 epochs. We use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 0.001, β 1 = 0.9, β 2 = 0.99, weight decay of 0.01, learning rate decay, and early stopping. We further describe our models and training strategies for WikiText-2 experiments in §D. Unlike VA+{RNN, LSTM}, ST+{RNN, LSTM} and NMST+{RNN, LSTM} need an additional hyperparameter ϵ. We explore ϵ in {1.0 × 10 -5 , 5.0 × 10 -5 , 1.0 × 10 -4 , 5.0 × 10 -4 }. We present the average (±st.dev.) non-termination ratios, r nt (L)'s, across 10 random runs as a function of L for all considered setups of WikiText-2 in Figure 2 , using greedy search. From equation 11, a language model is consistent with respect to greedy search if lim L→∞ r nt (L) = 0. As L increases, we observe that r nt (L)'s of VA+{RNN, LSTM} fail to converge toward 0 while r nt (L)'s of ST+{RNN, LSTM} and NMST+{RNN, LSTM} all reach 0. In other words, RNN and LSTM are now consistent with respect to greedy search after replacing the original softmax parametrization with either the proposed NMST parametrization or ST parametrization. Table 1 shows that the average (±st.dev.) validation perplexities across 10 random experiments for all variants of RNN and LSTM, trained on WikiText-2. We observe that NMST+{RNN, LSTM} have better validation perplexities than ST+{RNN, LSTM} for every ϵ. We demonstrate this more clearly in §E.1 by plotting the evolution of the mean validation perplexities as we vary ϵ. Although our NMST+ guarantees the consistency of RNN and LSTM with respect to greedy search with a better validation perplexity than ST+, we need to carefully select ϵ of NMST+. As ϵ increases, the lower bound of p nmst θ (y t = ⟨eos⟩|y <t , x) grows faster, yielding premature sequences when ϵ is too large. Indeed, the average validation perplexities of NMST+RNN and NMST+LSTM with ϵ = 5.0 × 10 -4 are 184.2 and 105.6 which degrade by 5.6 and 4.0 from those of VA+RNN and VA+LSTM, 178.6 and 101.6 , respectively. We however emphasize that there is the optimal ϵ = 1.0 × 10 -5 that makes NMST+{RNN, LSTM} have the validation perplexities similar to those of VA+{RNN, LSTM}. In short, both NMST+ and ST+ prevent non-termination when using greedy search but only NMST+ has a competitive validation perplexity against VA+. In §G, we further observe that the length distribution of predicted sequences from NMST+LSTM is closer to the length distribution of ground truth sequences than those of predicted sequences from {VA, ST}+LSTM. Table 1 : Mean (±st.dev.) validation perplexities across 10 random runs on WikiText-2 for various model configurations. Lower is better. Bold marks the best of each architecture. For all ϵ, the validation perplexities of our NMST+{RNN, LSTM} are better than those of ST+{RNN, LSTM} proposed by Welleck et al. (2020) . Moreover, with a proper choice of ϵ = 1.0 × 10 -5 , NMST+{RNN, LSTM} have competitive validation perplexities against those of VA+{RNN, LSTM}. Table 2 : We present the average (±st.dev.) validation perplexities across 10 random runs for all variants of GPT-2 finetuned on WikiText-103. We also demonstrate their non-termination ratios (mean±st.dev.), r nt (L)'s, when using greedy search. We set L to 1,000 since the maximum length of generated sequences from GPT-2 is 1,024. For perplexity, lower is better. Bold marks the best validation perplexity in all setups. For every ϵ, NMST+GPT-2 outperforms ST+GPT-2 in terms of the average validation perplexity. From r nt (L), NMST+GPT-2 effectively prevents non-termination sequences compared to VA+GPT-2 for every ϵ while ST+GPT-2 with small ϵ fails to avoid them. With a proper choice of ϵ (e.g., ϵ = 1.0 × 10 -5 ), NMST+GPT-2 improves the validation perplexity. 

4.2. WIKITEXT-103

WikiText-103 (Merity et al., 2016) consists of 103 million words constructed from 28,000 articles. We use BPE tokenization (Sennrich et al., 2015) and consider the first 10 tokens as a context for each sequence. Since WikiText-103 is substantially larger than WikiText-2, we finetune a pretrained GPT-2 which is a transformer language model with 124 million parameters (Radford et al., 2019) for 500, 000 steps. For computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens. We use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 5.0 × 10 -5 , β 1 = 0.9, β 2 = 0.99, weight decay of 0.01, linear learning rate decay, and early stopping. We present a more detailed description in §D. We select ϵ from {1.0 × 10 -5 , 5.0 × 10 -5 , 1.0 × 10 -4 , 5.0 × 10 -4 } for ST+GPT-2 and NMST+GPT-2. We report the mean (±st.dev.) validation perplexities and non-termination ratios, r nt (L)'s, resulting from greedy search across 10 random runs for all GPT-2 setups finetuned on WikiText-103 in Table 2 . Since GPT-2 can handle up to 1,024 tokens, we use L = 1,000. As shown in Figure 2 , we need a sufficiently large L such as L = 10 5 to determine whether a language model is consistent with respect to greedy search. Although L = 1,000 is not sufficiently large, we observe that r nt (L) of NMST+GPT-2 decreases compared to r nt (L) of VA+GPT-2 as ϵ increases. That is, NMST+ reduces the number of non-terminating continuations within 1,000 steps. Non-terminating sequences do not necessarily imply better quality. We thus demonstrate sample continuations from NMST+GPT-2, given a context that leads non-termination with VA+GPT-2 in Table 3 , using greedy search. We observe that the quality of the generated sequence tends to improve with NMST+ by avoiding repetitions of similar phrases and ending with ⟨eos⟩. We present more example continuations in §E.3. because it is optimal in terms of validation perplexities in Table 2 . Instead of t, we tag the t-th ground truth token. We report their mean (curve) ± st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 can model non-monotonic behaviors of p θ (y t = ⟨eos⟩|y <t , x) with respect to t. Both plots show that the non-monotonic behaviors occur where the sequences could end (e.g., after red marked tokens such as periods). Similar to the results in §4.1, Table 2 shows that the validation perplexities of both ST+GPT-2 proposed by Welleck et al. (2020) and our NMST+GPT-2 degrade compared to VA+GPT-2 as ϵ increases. NMST+GPT-2 with the optimal ϵ = 1.0 × 10 -5 has a competitive validation perplexity of 20.69 against that of VA+GPT-2, 20.72. On the other side, we cannot find ϵ that makes the validation perplexity of ST+GPT-2 competitive against that of VA+GPT-2. Moreover, if ϵ ̸ = 5.0 × 10 -4 , then r nt (L)'s of ST+GPT-2 blow up unlike r nt (L)'s of VA+GPT-2. §E.2 demonstrates the inevitable perplexity degradation and exploding r nt (L) of ST+GPT-2. We suspect that it is due to monotonically increasing p θ (y t = ⟨eos⟩|y <t , x) with respect to t. We investigate behaviors of p θ (y t = ⟨eos⟩|y <t , x) where p θ 's are {VA, ST, NMST}+GPT-2 in Figure 3 . Based on Table 2 , we select the optimal ϵ = 1.0 × 10 -5 in terms of validation perplexities for {ST, NMST}+GPT-2. In Figure 3 , {VA, NMST}+GPT-2 well-capture whether a sequence might end (e.g., after periods) by showing non-monotonic behaviors at those seeminglyterminating steps, but ST+GPT-2 cannot model such non-monotonic behaviors because it assumes that p θ (y t = ⟨eos⟩|y <t , x) is a monotonic function of t. This constraint makes ST+GPT-2 generate often finite but unnecessarily long sequences with greedy search (i.e., higher r nt (L) than VA+GPT-2 for small L, but r nt (L) = 0 for sufficiently large L). We demonstrate more behaviors in §E.4.

5. CONSISTENCY WITH RESPECT TO OTHER DECODING ALGORITHMS

We explore the effectiveness of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top-k sampling (Fan et al., Table 4 : Mean (±st.dev.) non-termination ratios, r nt (L)'s, across 10 random runs for the variants of GPT-2 finetuned on WikiText-103 with various decoding algorithms. We set L to 1,000 due to GPT-2's context window size of 1,024. We use the optimal ϵ = 1.0 × 10 -5 in terms of average validation perplexities in Table 2 for both NMST+GPT-2 and ST+GPT-2. Bold marks the lowest r nt (L) within each decoding algorithm (column). Similar to greedy search in Table 2 , for all decoding algorithms, r nt (L)'s of NMST+GPT-2 are lower than those of ST+GPT-2 and VA+GPT-2. It means that NMST+ reduce the number of non-terminating sequences within 1,000 decoding steps. top-2 top-4 nucleus-0.2 nucleus-0.4 beam-2 beam-4 VA+ 0.0 ± (0.0) 0.0 ± (0.0) 0.25 ± (0.08) 0.14 ± (0.05) 0.05 ± (0.02) 0.03 ± (0.01) ST+ 0.0 ± (0.0) 0.0 ± (0.0) 0.73 ± (0.11) 0.55 ± (0.15) 0.29 ± (0.10) 0.15 ± (0.07) NMST+ 0.0 ± (0.0) 0.0 ± (0.0) 0.21 ± (0.10) 0.10 ± (0.06) 0.03 ± (0.02) 0.01 ± (0.01) 2018), nucleus sampling (Holtzman et al., 2020) , and beam search. All experimental setups and notations are the same as Section §4. According to Theorem 3, the NMST language model is consistent with respect to any incomplete decoding algorithms (e.g., greedy search, top-k sampling, and nucleus sampling) and beam search for all ϵ > 0. To validate this, we use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+GPT-2 finetuned on WikiText-103 with ϵ = 1.0 × 10 -5 . The choice of ϵ = 1.0 × 10 -5 is made based on the validation perplexities in Table 2 . Since the validation perplexity does not depend on decoding algorithms, we focus on the average (±st.dev.) non-termination ratios, r nt (L)'s, across 10 random runs with L = 1, 000 for each decoding algorithm in Table 4 . We also present r nt (L)'s of VA+GPT-2 and ST+GPT-2 with ϵ = 1.0 × 10 -5 as baselines. Table 4 shows that our NMST+GPT-2 has the lowest r nt (L) with L = 1, 000 for all decoding algorithms compared to VA+GPT-2 and ST+GPT-2 proposed by (Welleck et al., 2020) . In other words, NMST+ effectively prevent non-terminating sequences within 1,000 time steps regardless of decoding algorithms. Comparing with greedy search in Table 2 (r nt (L) when ϵ = 1.0 × 10 -5 ), we observe that r nt (L)'s decrease for all setups. As we discussed in §2.3, non-terminating sequences originate from the choice of ⟨eos⟩ / ∈ V t ⊊ V for all t where V is a vocabulary and V t is the t-th proper subset of V, considered by a decoding algorithm at the t-th step. The decoding algorithms other than greedy search are likely to have ⟨eos⟩ in V t and have the lower r nt (L) since their |V t | are greater than or equal to |V t | = 1 of greedy search for all t. In the case of top-{2, 4} sampling, we obtain r nt (L) = 0.0 for VA+GPT-2. Even without NMST+, VA+ can avoid non-terminating sequences if we choose a proper decoding algorithm. We however emphasize that NMST+GPT-2 with ϵ = 1.0 × 10 -5 has a competitive validation perplexity against VA+GPT-2 in Table 2 and that it is guaranteed to terminate regardless of the choice of a decoding algorithm. We also empirically demonstrate the consistency of NMST+{RNN, LSTM} trained on WikiText-2 with respect to other decoding algorithms in §F.

6. CONCLUSION

Non-termination is a degenerate behavior we often observe when generating text from a well-trained language model. To prevent this, Welleck et al. (2020) proposed a self-terminating language model that encourages the termination probability of each sequence, which is the conditional probability of ⟨eos⟩ given a t-prefix and a context, to monotonically increase toward 1 as t increases. In this paper, we theoretically demonstrate that monotonically increasing termination probability of each sequence is not a necessary condition for avoiding non-terminating sequences. We then propose a non-monotonic self-terminating language model where the termination probability for each sequence converges to 1 but not monotonically. Our non-monotonic self-terminating language models successfully address the issue of non-termination and achieve perplexities that are comparable to vanilla language models and are better than the original self-terminating language models.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of our paper, we provide our code available at https://github. com/nyu-dl/non-monotonic-self-terminating-lm. prefixes, P t = {ρ (1) (t), ρ (2) (t), • • • , ρ (k) (t)}, at each time step t where ρ (i) (0) is an empty prefix for all i. At each step t ∈ {1, 2, • • • }, beam search forms a set of k × k prefixes, Pt = ρ∈Pt-1 {ρ • v|v ∈ V t (ρ)}, where ρ • v is concatenation and V t (ρ) = arg top-k v∈V p θ (y t = v|ρ, x). ( ) After forming Pt , beam search selects a set of the k highest scoring prefixes in Pt , P t = arg top-k ρ∈ Pt s(ρ), where s(ρ) = t τ =1 log p θ (y τ = ρ τ |ρ <τ , x). If ρ ∈ P t ends with ⟨eos⟩, then it does not expand further and is added to the final set P. Beam search continues until P contains k sequences ending with ⟨eos⟩. After that it returns the highest scoring sequence ŷ = arg max ρ∈P s(ρ). ( ) Unlike greedy search, top-k sampling, and nucleus sampling, beam search recursively expands k sequences with at most k different prefixes. Therefore, we cannot formalize beam search in tokenlevel by using q S beam-k (y t = v|y <t , x). However, in equation 17, the number of possible tokens at t is at most k × k. It means that S beam-k may exclude ⟨eos⟩ at time t if k ≤ |V| -1. By using this, Welleck et al. (2020) proved that a vanilla language model p va θ is inconsistent with respect to beam search as shown in Theorem 1.

B PROOFS FOR §2.3

Remark 1. Let D = (x (1) , y (1) ), (x (2) , y (2) ) be a two-instance training dataset. Assume that there exists t 0 such that y <t0 = y (1) <t0 = y (2) <t0 . Suppose further that t 0 = |y (1) | < |y (2) | -1 and x = x (1) = x (2) . If θ ⋆ is an optimal parameter configuration in equation 3 over D. Then, p θ ⋆ y (2) t = ⟨eos⟩|y (2) <t , x is non-monotonic with respect to t. Proof. Since θ ⋆ is an optimal parameter configuration that perfectly minimizes equation 3 and t 0 < |y (2) | -1, we have p θ ⋆ y (2) t = ⟨eos⟩|y (2) <t , x (2) = 0, for t < t 0 . Note that t 0 = |y (1) | ⇒ y (1) t0 = ⟨eos⟩ and t 0 < |y (2) | -1 ⇒ y (2) t0 ̸ = ⟨eos⟩. From x = x (1) = x (2) and y = y (1) <t0 = y (2) <t0 , we obtain p θ ⋆ y (2) t0 = ⟨eos⟩|y (2) <t0 , x (2) = 1 2 . ( ) Moreover, t 0 < |y (2) | -1 implies that y (2) t0+1 ̸ = ⟨eos⟩ which is equivalent to p θ ⋆ y (2) t0+1 = ⟨eos⟩|y (2) <t0+1 , x (2) = 0. ( ) From equation 20, equation 21, and equation 22, we see that p θ ⋆ y (2) t

= ⟨eos⟩|y

(2) <t , x is nonmonotonic with respect to t.

C PROOFS FOR §3

Theorem 3. A non-monotonic self-terminating (NMST) language model defined in Definition 6 is consistent with respect to any incomplete probable decoding algorithms and beam search.

D EXPERIMENTAL DETAILS

In this section, we describe our models and optimization processes used in §4. RNN and LSTM on WikiText-2 We use word tokenization for WikiText-2. We train RNN with tanh activations (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2. Both RNN and LSTM have 2 layers. Each layer has 256 hidden units for RNN and 512 hidden units for LSTM. The sizes of input and output embedding layers are 256 and 512 for RNN and LSTM, respectively. We use weight tying to share the weights between the input and output embedding layers for both models. We apply dropout (Srivastava et al., 2014) with drop probabilities of 0.3 and 0.5 to RNN and LSTM accordingly. For each model, we perform 10 random runs with a batch size of 32 for 70 epochs. To maximize the log-likelihood presented in equation 3, we use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 0.001, β 1 = 0.9, β 2 = 0.99, weight decay of 0.01, and learning rate decay which halves the learning rate if the validation perplexity does not improve for a training epoch. To avoid overfitting, we additionally use early stopping, which terminates training if the validation perplexity does not improve upon the best score attained so far for 10 consecutive epochs. In most cases, the training ends within 50 epochs.

GPT-2 on WikiText-103

We use BPE tokenizationfoot_3 (Sennrich et al., 2015) and the pretrained GPT-2foot_4 (Radford et al., 2019) with 124 million parameters, provided by HuggingFace. GPT-2 can handle up to 1,024 tokens. We apply dropout (Srivastava et al., 2014) with a drop probability of 0.1 to GPT-2. We finetune GPT-2 for 300,000 steps while ensuring that all runs continue for at least 250,000 steps. To minimize the number of padding tokens in every batch for computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens. To maximize the log-likelihood function in equation 3, we use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 5.0×10 -5 , β 1 = 0.9, β 2 = 0.99, weight decay of 0.01, and linear learning rate decay over 500, 000 steps.

E ADDITIONAL PLOTS AND TABLES FOR §4

In this section, we demonstrate additional plots and tables for §4. because GPT-2 has a context window size of 1, 024. For all ϵ, NMST+GPT-2 outperforms ST+GPT-2 in terms of the average validation perplexity. When ϵ is small, r n t(L) of ST+GPT-2 explodes. It means that ST+GPT-2 with small ϵ cannot prevent non-terminating sequences. However, our NMST+GPT-2 effectively reduces r nt (L) compared to VA+GPT-2 for every ϵ, and the validation perplexity degradation is smaller than that of ST+GPT-2 proposed by Welleck et al. (2020) .

E.3 ADDITIONAL TABLES FOR TABLE 3

Table 5 : Given a context in a validation instance of WikiText-103, we present example continuations of {VA, ST, NMST}+GPT-2 when using greedy search. We select ϵ = 1.0 × 10 -5 for {ST, NMST}+GPT-2 because it is optimal in terms of validation perplexities in Table 2 . Unlike {VA, ST}+GPT-2, NMST+GPT-2 improves the quality of the sequence by avoiding repetitive tokens and ending with ⟨eos⟩ when the given context leads VA+GPT-2 to non-terminate within 1, 000 steps.

Context

The single made its Irish Singles Chart debut at The single was certified gold by the British Phonographic Industry ( BPI ) for shipments of over 15 @,@ 000 copies. VA+ The single debuted at number two on the Irish Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010. The single was certified gold by the Irish Recorded Music Association ( IRMA ) for shipments of over 15 @,@ 000 copies. The single was ... NMST+ number twenty @-@ seven on the week ending March 26, 2010, and peaked at number three on the week ending March 27, 2010. It was certified gold by the Recording Industry Association of Ireland ( RIAA ) for shipment of 500 @,@ 000 copies of the single. The single was certified gold by the Recording Industry Association of Ireland ( RIANZ ) for shipment of 500 @,@ 000 copies of the single.⟨eos⟩ Despite the expensive reconstructions, both vessels were considered VA+ to be of sufficient quality to be considered for use in the Grand Fleet. The first, the British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built ... ST+ to be of sufficient quality to be considered a part of the Royal Navy, and were assigned to the Channel Fleet. NMST+ the Naktong River rises to a height of 1 @,@ 000 metres ( 3 @,@ 300 ft ) above the surrounding terrain. For {ST, NMST}+GPT-2, we choose ϵ = 1.0 × 10 -5 because it is optimal in terms of validation perplexities in Table 2 . Instead of t, we tag the t-th ground truth token. We report their mean (curve) ± st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 exhibits non-monotonic behaviors at plausibly terminating steps (e.g., after red marked tokens such as periods).

F CONSISTENCY WITH RESPECT TO OTHER DECODING ALGORITHMS FOR RNN AND LSTM

We validate the consistency of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top-k sampling (Fan et al., 2018) , nucleus sampling (Holtzman et al., 2020) , and beam search. All experimental setups and notations are the same as Section §4. We use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+{RNN, LSTM} trained on Wikitext-2 with ϵ = 1.0 × 10 -5 . The choice of ϵ = 1.0 × 10 -5 is made based on the validation perplexities in Table 1 . Since the validation perplexity does not change with decoding algorithms, we focus on the average (±st.dev.) non termination ratios, r nt (L)'s, across 10 random runs as a function of L, for each decoding algorithm in Figure 7 . We also plot the evolution of r nt (L)'s for VA+{RNN, LSTM} and ST+{RNN, LSTM} of ϵ = 1.0 × 10 -5 as we vary L. and LSTM (bottom), trained on WikiText-2, when using top-k sampling (left), nucleus sampling (middle), and beam search (right), as a function of L in log-log scale. We use the first 10 tokens of every WikiText-2 validation instance as a context. We present their average (curve) with their min-max range (shaded area) across 10 random experiments. VA+ (orange) displays inconsistency (lim L→∞ r nt (L) ↛ 0) for all combinations of model architectures and decoding algorithms, except in VA+RNN using top-4 (orange dashed in top left) and VA+LSTM using top-{2,4} (orange solid and dashed in top left, respectively). On the other hand, NMST+ (blue) and ST+ (green) show consistency (lim L→∞ r nt (L) → 0) across all configurations. By using decoding algorithms other than greedy search, VA+LSTM can avoid non-terminating sequences (e.g., top-{2, 4}). However, as shown in Table 1 , NMST+{RNN, LSTM} not only have better validation perplexities than VA+{RNN, LSTM} and ST+{RNN, LSTM} but also are consistent with respect to all decoding algorithms. 0 . 0 0 . 0 0 1 0 . 0 1 0 . 1 0 . 3 1 . 0 rnt (L) RNN RNN top -{2, 4} nucleus -{0.2, 0.4} beam -{2, 4}

G ANALYSIS OF PREDICTED SEQUENCE LENGTH DISTRIBUTIONS IN §4.1

We investigate whether our proposed non-monotonic self-terminating (NMST+) language model matches the data length distribution better than the baselines: i) a vanilla (VA+) language model and ii) a self-terminating (ST+) language model. For this, we compare the length distributions of predicted sequences from {VA, ST, NMST}+LSTM trained on WikiText-2 with the data length distribution of ground truth sequences in the WikiText-2 validation dataset, D val , when using greedy search. All experimental setups and notations are the same as §4.1. Figure 8 shows the length distributions of {VA, ST, NMST}+LSTM, and D val . For {ST, NMST}+LSTM, we use ϵ = 1 × 10 -5 because this choice is optimal in terms of validation perplexities based on Table 1 . We observe that the length distributions of predicted sequences from NMST+LSTM is closer to the data length distribution of D val , than those of predicted sequences from VA+LSTM and ST+LSTM. For {ST, NMST}+LSTM, we select ϵ = 1.0 × 10 -5 since it is optimal in terms of validation perplexities in Table 1 . NMST+LSTM better models the length distribution of D val than both VA+LSTM and ST+LSTM. Furthermore, we can tune ϵ to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution of D val . In Figure 9 , we compare NMST+LSTM's predicted length distribution of ϵ = 5 × 10 -4 with that of ϵ = 1 × 10 -5 . We see that ϵ = 5 × 10 -4 better models the data length distribution than ϵ = 5 × 10 -4 . However, in this case, the average validation perplexity of NMST+LSTM degrades from 101.5 (ϵ = 1 × 10 -5 ) to 105.6 (ϵ = 5 × 10 -4 ) as shown in Table 1 . The length distribution of NMST+LSTM using ϵ = 5.0 × 10 -5 matches the data length distribution of D val better than that of NMST+LSTM using ϵ = 1.0 × 10 -4 . We can choose ϵ to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution.



This definition stands for RNN, LSTM, and GRU. For Transformer, ht = f θ (yt, h 1:(t-1) ). We provide the proof in §C. If there is no such ρ, all k sequences in Pt 1/2 end with ⟨eos⟩. It means that S beam-k returns a finite sequence, so that p nmst θ is consistent with respect to beam search. https://github.com/huggingface/tokenizers https://github.com/huggingface/transformers



Figure2: Non-termination ratios, r nt (L)'s, as a function of L in log-log scale for (a) RNN and (b) LSTM trained on WikiText-2 when using greedy search. We report mean (curve) ± st.dev. (shaded area) across 10 random experiments. For all configurations, both ST+ (non-red dashed) proposed byWelleck et al. (2020) and our NMST+ (non-red solid) are consistent with respect to greedy search since r nt (L) goes to 0 as L increases. However, softmax parametrization (VA+, red dotted) is inconsistent with respect to greedy search since its r nt (L) does not converge to 0 as L → ∞.

10 -4 186.1 ± (6.2) 184.2 ± (6.5) 106.1 ± (1.0) 105.6 ± (1.2) 1.0 × 10 -4 181.0 ± (3.8) 177.4 ± (7.0) 104.6 ± (1.4) 102.5 ± (1.0) 5.0 × 10 -5 182.6 ± (8.0) 179.6 ± (5.7) 104.7 ± (1.6) 102.1 ± (1.0) 1.0 × 10 -5 180.4 ± (3.3) 177.4 ± (4.5) 104.5 ± (1.4) 101.5 ± (0

× 10 -4 21.80 ± (0.02) 21.63 ± (0.02) 0.05 ± (0.03) 0.07 ± (0.03) 1.0 × 10 -4 21.21 ± (0.02) 20.86 ± (0.02) 0.72 ± (0.11) 0.22 ± (0.10) 5.0 × 10 -5 21.19 ± (0.03) 20.76 ± (0.02) 0.72 ± (0.11) 0.24 ± (0.10) 1.0 × 10 -5 21.16 ± (0.03) 20.69 ± (0.03) 0.75 ± (0.10) 0.23 ± (0.10) VA+ 20.72 ± (0.03) 0.27 ± (0.08)

Figure3: We present p θ (y t = ⟨eos⟩|y <t , x) as a function of t for validation instances of WikiText-103 where p θ 's are {VA, ST, NMST}+GPT-2. For {ST, NMST}+GPT-2, we choose ϵ = 1.0 × 10 -5 because it is optimal in terms of validation perplexities in Table2. Instead of t, we tag the t-th ground truth token. We report their mean (curve) ± st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 can model non-monotonic behaviors of p θ (y t = ⟨eos⟩|y <t , x) with respect to t. Both plots show that the non-monotonic behaviors occur where the sequences could end (e.g., after red marked tokens such as periods).

Figure4: Validation perplexities as a function of in log-linear scale for all configurations of RNN (left) and LSTM (right), which are trained on WikiText-2. We present their average (curve) ± st.dev. (shaded area) across 10 random experiments. For all ϵ and architectures, NMST+ has better validation perplexities than ST+. As ϵ increases, the validation perplexities of both NMST+RNN and NMST+LSTM degrade compared to those of VA+RNN and VA+LSTM. We thus need to search for an optimal ϵ to avoid degradation of validation perplexity when applying NMST+ to our language model.

Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated ... NMST+ Triple H defeated Randy Orton to win the WWE Championship for the third time.⟨eos⟩ Context On the west side of the Naktong, VA+ the NK 6th Division was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK ... ST+ the 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been ...

Figure6: Additional plots of p θ (y t = ⟨eos⟩|y <t , x) as a function of t for validation instances of WikiText-103 where p θ 's are {VA, ST, NMST}+GPT-2. For {ST, NMST}+GPT-2, we choose ϵ = 1.0 × 10 -5 because it is optimal in terms of validation perplexities in Table2. Instead of t, we tag the t-th ground truth token. We report their mean (curve) ± st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 exhibits non-monotonic behaviors at plausibly terminating steps (e.g., after red marked tokens such as periods).

Figure7: Non-termination ratios, r nt (L)'s, of sequences generated from all variants ofRNN (top)   and LSTM (bottom), trained on WikiText-2, when using top-k sampling (left), nucleus sampling (middle), and beam search (right), as a function of L in log-log scale. We use the first 10 tokens of every WikiText-2 validation instance as a context. We present their average (curve) with their min-max range (shaded area) across 10 random experiments. VA+ (orange) displays inconsistency (lim L→∞ r nt (L) ↛ 0) for all combinations of model architectures and decoding algorithms, except in VA+RNN using top-4 (orange dashed in top left) and VA+LSTM using top-{2,4} (orange solid and dashed in top left, respectively). On the other hand, NMST+ (blue) and ST+ (green) show consistency (lim L→∞ r nt (L) → 0) across all configurations. By using decoding algorithms other than greedy search, VA+LSTM can avoid non-terminating sequences (e.g., top-{2, 4}). However, as shown in Table1, NMST+{RNN, LSTM} not only have better validation perplexities than VA+{RNN, LSTM} and ST+{RNN, LSTM} but also are consistent with respect to all decoding algorithms.

Figure8: Length distributions of generated sequences from {VA, ST, NMST}+LSTM trained on WikiText-2 and the data length distribution of ground truth sequences in WikiText-2 validation dataset, D val . For {ST, NMST}+LSTM, we select ϵ = 1.0 × 10 -5 since it is optimal in terms of validation perplexities in Table1. NMST+LSTM better models the length distribution of D val than both VA+LSTM and ST+LSTM.

Figure9: Length distributions of predicted sequences from NMST+LSTM trained on WikiText-2 for various ϵ's and the data length distribution of ground truth sequences in WikiText-2 validation dataset, D val . The length distribution of NMST+LSTM using ϵ = 5.0 × 10 -5 matches the data length distribution of D val better than that of NMST+LSTM using ϵ = 1.0 × 10 -4 . We can choose ϵ to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution.

Given a context in a validation instance of WikiText-103, we present example continuations of {VA, ST, NMST}+GPT-2 when using greedy search. We select ϵ = 1.0 × 10 -5 for {ST, NMST}+GPT-2 because it is optimal in terms of validation perplexities in Table2. Unlike {VA, ST}+GPT-2, NMST+GPT-2 improves the quality of the sequence by avoiding repetitive tokens and ending with ⟨eos⟩ when the given context leads VA+GPT-2 to non-terminate within 1, 000 steps. It was the first building in the United States to be built in concrete, and the first to be built in wood. It was also the first building in the United States to be built in steel. It was the first building in ...ST+building is constructed of steel and concrete. The building's exterior is made of steel and concrete. The building's interior is made of wood, and the building's exterior is made of concrete. The building's exterior is made of concrete, and the building's ... NMST+ building was designed by the architectural firm of Bowers & Wainwright, and was completed in 1892. The building is the largest of its kind in the United States. ⟨eos⟩

number three on the issue date of November 18, 2005. It debuted at number one on the chart for the issue date of December 7, 2005. The single debuted at number two on the chart for the issue date of January 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date of March 4, 2006. The single debuted at number two on the Irish Singles Chart for the issue date of June 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date ... ST+ number three on the chart issue dated March 16, 2010. It debuted at number two on the UK Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010.

The two ships were transferred to the Mediterranean Fleet in early 1915, and were transferred to the North America and West Indies Station in early 1916. They were transferred to the Mediterranean Fleet in early 1917, and were transferred to the North America and West Indies Station in early 1918. They were transferred to the North America and West Indies Station in early 1918, and were transferred to the Mediterranean Fleet in early 1918. They were transferred to the North America and West Indies Station in late 1918, and ... NMST+ obsolescent by the time of the Second World War, and were eventually scrapped in favour of the much more powerful British battleship HMS Dreadnought.⟨eos⟩ Context On the July 21, 2014 episode of Raw, VA+ Triple H and Randy Orton were involved in a tag team match, in which Triple H and Orton were accompanied by Triple H and Randy Orton's manager, Mr. Kennedy. Orton was accompanied by Orton's manager, Mr. Kennedy, and Orton's tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton's manager, Mr. Kennedy, and Orton's tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton's manager, Mr. Kennedy, and Orton's manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton's manager, Mr. Kennedy, and Orton's manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton's manager, Mr. Kennedy, and Orton's manager, Mr. Kennedy. Triple H ... ST+ Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 27 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 30 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship. On the August 2 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship.

The Naktong River is the longest river in North and South Korea, and the longest in the whole of North Korea. The Naktong is the longest river in North Korea, and the longest in the whole of North Korea. The river is the longest in the entire country, and the longest in the whole of North Korea.⟨eos⟩ E.4 ADDITIONAL PLOTS FOR FIGURE 3 c . 24 âG ¸ 79 ) di d no t su rv ive . St ill , th er e ar e se ve ra l re fer en ce s to Ne ro in Pl iny ' s Na tu ra l Hi st or ies . Pl iny ha s on e of th e wo rst op in ion s of Ne ro an d ca lls hi m an " en em y of m an kin d .

ACKNOWLEDGMENTS

This work was supported by 42dot, Hyundai Motor Company (under the project Uncertainty in Neural Sequence Modeling), Samsung Advanced Institute of Technology (under the project Next Generation Deep Learning: From Pattern Recognition to AI), and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

APPENDIX A DEFINITIONS OF COMMON DECODING ALGORITHMS AND THEIR CHARACTERISTICS

In this section, we present mathematical definitions of top-k sampling (Fan et al., 2018) , nucleus sampling (Holtzman et al., 2020) , greedy search, and beam search. We then demonstrate whether they are incomplete probable decoding algorithms.A.1 TOP-K SAMPLING At each step t, top-k sampling selects a subset of k most probable tokens in a vocabulary V. Top-k sampling generates decoded sequences from a language model p θ as follows: Definition A.1 (Top-k sampling (Fan et al., 2018) ). Top-k sampling S top-k generates a sequence from a language model p θ given a context x by recursively sampling ŷt fromwhereExcept the trivial case k = |V|, we have ∅ ⊊ V t ⊊ V for all t. By equation 13, equation 6 holds.From equation 12, we see that top-k sampling satisfies equation 5 and equation 7. Therefore, top-k sampling is an incomplete probable decoding algorithm.A.2 NUCLEUS SAMPLING At each step t, nucleus sampling selects the smallest subset of most probable tokens in a vocabulary V, of which total probability is higher than a given threshold µ. Nucleus sampling generates decoded sequences from a language model p θ as follows: Definition A.2 (Nucleus sampling (Holtzman et al., 2020) ). Nucleus sampling S nuc-µ generates a sequence from a language model p θ given a context x by recursively sampling ŷt fromIf min v∈V p θ (y t = v|y <t , x) ≤ 1µ for any context x and any t-prefix y <t , then we have ∅ ⊊ V t ⊊ V for all t. Suppose that equation 6 does not hold for nucleus sampling. Then, this contradicts to V t is the smallest subset of V, satisfying equation 15. From equation 14, we see that nucleus sampling satisfies equation 5 and equation 7. Therefore, nucleus sampling is incomplete and probable.

A.3 BEAM SEARCH

Beam search is a heuristic algorithm that operates on the level of prefixes. We use the definition of beam search in Welleck et al. (2020) . , Kinsler was placed on the 15 @-@ day disabled list with a strained right hamstring. He returned to the active roster on April 7, and was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 4, and was placed on the 15 @-@ day DL on May 5. On May 6, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 9. On May 10, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 11. He was placed on the 15 @-@ day DL on May 13. He was placed on the 15 @-@ day DL on May 20. He was placed on the 15 @-@ day DL on May 24. He was placed on the 15 @-@ day DL on May 28. He was placed on the 15 @-@ day DL on June 2. He was placed on ... , the Yankees signed Rivera to a one @-@ year, $ 2 @.@ 5 million contract. He made his major league debut on April 21, 2010, against the Boston Red Sox. He pitched a scoreless inning in the first inning of the first game of the 2010 World Series against the New York Mets. On May 1, 2010, Rivera was traded to the Pittsburgh Pirates in exchange for J. J. Hardy.⟨eos⟩

