A NON-MONOTONIC SELF-TERMINATING LANGUAGE MODEL

Abstract

Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top-k sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-k sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020). We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020), to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.

1. INTRODUCTION

Autoregressive neural sequence models (Bengio et al., 2000) have been widely used for various natural language generation tasks such as language modeling (Brown et al., 2020; Chowdhery et al., 2022) , machine translation (Bahdanau et al., 2014) , and conversational dialogue modeling (Vinyals & Le, 2015) . Furthermore, large-scale autoregressive neural sequence models have shown unprecedented ability to generate fluent, human-like texts (Vaswani et al., 2017; Brown et al., 2020) . Despite their success, the autoregressive neural sequence models have shown undesirable behaviors: non-termination (Welleck et al., 2020 ), degenerate repetition (Welleck et al., 2019; Holtzman et al., 2020) , and premature termination (Koehn & Knowles, 2017; Stahlberg & Byrne, 2019) . In this paper, we focus on how to prevent non-termination when using a given decoding algorithm. Non-termination is the problem that a language model generates infinitely long sequences with a positive probability from our language model when using a given decoding algorithm. Welleck et al. (2020) pointed out that this issue comes from a discrepancy between the distribution of our language model and its induced distribution by an incomplete decoding algorithm. They formalized this disparity by the notion of inconsistency where our language model generates non-terminating sequences with a positive probability from the decoding algorithm. To avoid this inconsistency, they proposed a self-terminating (ST) language model that uses new parametrization for its classifier rather than usual softmax parametrization. They proved that the ST language model is consistent with respect to greedy search, beam search, top-k sampling (Fan et al., 2018) as well as nucleus sampling (Holtzman et al., 2020) . The ST language model increases the termination probability of each sequence monotonically to 1, but this parametrization is not appropriate for learning our natural language. As an illustrative † New York University ‡ Prescient Design, Genentech § CIFAR Fellow * Corresponding author. example, suppose there are two sequences in our dataset: "I am a boy" vs. "I am a boy, and you are a girl.". Our language model trained on this dataset may or may not terminate after the former. Once our model decides not to end, it should dramatically reduce the termination probability to continue. The ST language model, which monotonically increase the termination probability, cannot capture such a case, where one sequence is a prefix of another. We thus propose a non-monotonic self-terminating (NMST) language model which guarantees the consistency with respect to greedy search, beam search, top-k sampling, and nucleus sampling without monotonically increasing termination probability. The NMST language model encourages the termination probability of each sequence to converge to 1 through NMST parametrization however without monotonicity. Even under this relaxation, the proposed NMST language model provably prevents any non-terminating sequence resulting from greedy search, beam search, top-k sampling, and nucleus sampling, which we refer to as incomplete probable decoding algorithms. We conduct experiments validating the effectiveness of our NMST language models on sequence completion tasks, as was done in earlier studies. We test NMST parametrization with various architectures. Specifically, we train RNN (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2 (Merity et al., 2016) . We additionally finetune GPT-2 (Radford et al., 2019) on WikiText-103 (Merity et al., 2016) . Across all these setups, NMST parametrization effectively prevents non-terminating sequences, especially when compared to softmax parametrization. Furthermore, we see that our NMST parametrization has better (lower) perplexities than those of ST parametrization, confirming the importance of relaxing the monotonic termination probability.

2.1. NOTATIONS FOR AUTOREGRESSIVE NEURAL SEQUENCE MODELS

Sequences, vocabulary, and ⟨eos⟩ We view an instance (e.g., a sentence and a paragraph) as a sequence y = (y 1 , y 2 , . . . , y T ), where each y t is an element from a pre-defined finite set of discrete tokens, referred to as a vocabulary V. V includes a special symbol ⟨eos⟩ that only appears at the end of the sequence. Every sequence y must end with ⟨eos⟩. We write the length of y as |y|, and y |y| = ⟨eos⟩. We call y a non-terminating sequence, |y| = ∞, if y t ̸ = ⟨eos⟩ for all t. Embedding vectors Each token v ∈ V is not a numerical vector so that we use an embedding vector u v ∈ R m to represent v. To capture the notion of similarity between discrete tokens efficiently, we use an embedding vector u v ∈ R m to project v into continuous embedding space (Bengio et al., 2000; Mikolov et al., 2013b; a; Levy & Goldberg, 2014) . Autoregressive neural sequence models Bengio et al. (2000) proposed an autoregressive neural sequence model parametrized by θ ∈ R k . They factorized p θ (y|x) into a product of the conditional probability of each token given all the previous tokens and an input in a predefined order as follows: p θ (y|x) = T t=1 p θ (y t |y <t , x), where y <t is a t-prefix of y and x is an input referred to as a context. For example, x represents either a prompt in sequence completion or a source-side sequence in machine translation. There are several popular architectures for p θ such as RNN (Elman, 1990), LSTM (Hochreiter & Schmidhuber, 1997) , GRU (Cho et al., 2014), and Transformer (Vaswani et al., 2017) . As shown in equation 2, all these models utilize softmax classifiers. In this paper, we modify the parametrization of their softmax classifiers to prevent non-terminating sequences. We thus write a vanilla language model, regardless of its choice of architecture, that uses the original softmax parametrization as p va θ defined in Definition 1. Definition 1. A vanilla language model p va θ computes the conditional probability of each token given a t-prefix y <t and a context x at each time step t as follows: p va θ (y t = v|y <t , x) = exp(u ⊤ v h t )/ v ′ ∈V exp(u ⊤ v ′ h t ), where h t = f θ (y t , h t-1 ) with h 0 = 0.foot_0 



This definition stands for RNN, LSTM, and GRU. For Transformer, ht = f θ (yt, h 1:(t-1) ).

