TAILORING LANGUAGE GENERATION MODELS UNDER TOTAL VARIATION DISTANCE

Abstract

The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. From a distributional view, MLE in fact minimizes the Kullback-Leibler divergence (KLD) between the distribution of the real data and that of the model. However, this approach forces the model to distribute non-zero (sometimes large) probability mass to all training samples regardless of their quality. Moreover, in the attempt to cover the low-probability regions in the data distribution, the model systematically overestimates the probability of corrupted text sequences, which we conjecture is one of the main reasons for text degeneration during autoregressive decoding. To remedy this problem, we leverage the total variation distance (TVD) with its robustness to outliers, and develop practical bounds to apply it to language generation. Then, we introduce the TaiLr 1 objective that balances the tradeoff of estimating TVD. Intuitively, TaiLr downweights real data samples that have low model probabilities with tunable penalization intensity. Experimental results show that our method alleviates the overestimation of degenerated sequences without sacrificing diversity and improves generation quality on a wide range of text generation tasks.

1. INTRODUCTION

The dominant approach to train language generation models is to maximize the likelihood of text samples in training data. With the development of pre-training techniques, the quality of texts generated by current models has been improved by a large margin (Radford et al., 2019; Brown et al., 2020) . However, the text degeneration phenomena, e.g., repetitions (Holtzman et al., 2020; Welleck et al., 2020) , incoherence (Guan et al., 2021; Ji & Huang, 2021) , and other ill-formed generation results sampled from the noisy long tail (Dou et al., 2022; LeBrun et al., 2022) , are still widely observed in large pre-trained models. These results indicate that using MLE as the optimizing method has theoretical limitations that are hard to be compensated by increasing the model size. Given the real data distribution p(x) and the model distribution q(x) defined by a learned generation model, we can view MLE as minimizing the KLD between p(x) and q(x). However, minimizing D KL (p, q) will lead to a zero-avoiding solution of q(x) that spreads itself to cover all the modes in the real data (Minka, 2005; Malinin & Gales, 2019) . As the model is forced to take into account all the modes regardless of their quality and saliency, this behavior could deteriorate the overall generation quality when (i) the data inherently exhibits too many variations, e.g., in open-ended generation, the model often over-presents unrelated words in the unreliable long tail of its distribution (Holtzman et al., 2020) . (ii) the data contains flawed or noisy references, e.g., hallucination and missing contents in text summarization (Zhao et al., 2020) degrade the generation quality of the model. In language generation, the attempt to cover all the non-zero probability regions in the data distribution would lead to a problem directly related to text degeneration, which we term as data void overestimation. Concretely, the model assigns considerably more probability mass than it should to the void of the real data distribution, where degenerated text sequences lie. An intuitive illustration is shown in Figure 1 where KLD pushes the model to place large mass on the zero-probability region of the target distribution to cover the minor mass portion on the right. These degenerated texts include random word sequences and partially corrupted texts that have high lexical overlap with the real texts. Therefore, during free-run generation, the model is likely to trap into the void regions and produce "over-generalized" text samples that are unlike the training data (Huszar, 2015) . In this work, we start with a robust alternative to KL divergence, i.e., the total variation distance (TVD). TVD is known to be robust to outliers in the data (Beran, 1977; Knoblauch & Vomfell, 2020) , as it measures the absolute difference between two probability distributions averaging at each point. In §2.2, we show that TVD allows the model to place zero probability to low-quality training samples and prevent overestimation of the data void region through gradient analysis. Though appealing, TVD cannot be directly applied to text generation because (i) TVD measures the distance at the sequence level while we desire a token-level criterion for autoregressive generation models, (ii) we only have samples from the data distribution, whereas calculating TVD demands the real data probability p(x) of the training sample x. We overcome these two issues by (i) developing an upper bound on the sequence-level TVD with its token-level factorization ( §3.1), and (ii) introducing a proxy distribution ( §3.2) that handles the bias-variance tradeoff during estimating TVD ( §3.3). Finally, we derive the Total Variation Guided Language Generation (TaiLr) objective by leveraging access to the non-zero gradient of TVD to guide the model. Intuitively, TaiLr weights the loglikelihood of a text sequence at each position according to the model probability and uses a tunable hyperparameter to control the penalization intensity. We first conduct experiments on synthetic data to show that TaiLr achieves better generation quality without sacrificing diversity and reduces the overestimation of degenerated texts compared to MLE. Further experiments on real data demonstrate that the proposed method outperforms existing methods that modify MLE at different aspects on a wide range of language generation tasks, including machine translation, text summarization, and long text generation.

2. BACKGROUND AND MOTIVATION

We consider natural language generation tasks where a conditional generation model p θ (y|x) parametrized by θ is required to generate the target text sequence y = (y 1 , • • • , y T ) given the context x. Let p o (y|x) denote the real data distribution, MLE training is equivalent to minimizing the KL divergence between p o and p θ : D KL (p o , p θ ) = -E y∼po T t=1 log p θ (y t |y <t , x) -H(p o ), where the generation probability is factorized into the product of conditional token probabilities given the prefix y <t and the context x: p θ (y|x) = T t=1 p θ (y t |y <t , x). The first term pushes the model to minimize the negative log-likelihood (NLL) of the training data. The second term is a constant with respect to θ and therefore is commonly ignored in MLE. Despite its simplicity and practical benefits for optimization, MLE is known to suffer from a mismatch to the evaluation metric (Pang & He, 2021) and brittleness to noise in the training data (Kang & Hashimoto, 2020) . Motivated by the literature in probability metrics, we draw attention to total variation distance (TVD) as a naturally robust alternative to KLD. We present the definitions of TVD (Van Handel, 2014) between the data distribution p o and the model distribution p θ given the context x: D TV (p o , p θ ) = 1 2 y∈Y p o (y|x) -p θ (y|x) (2a) = 1 - y∈Y min p o (y|x), p θ (y|x) , ( ) where Y is the space of all possible text sequences. Intuitively, TVD measures the average of the absolute difference between p o (y|x) and p θ (y|x) on all possible text sequence y ∈ Y. Therefore the model learns to properly allocate its probability mass to best describe the major part of the data distribution and ignore outliers. TVD is also correlated with the distinguishability of samples generated by the model, which is shown to be a balanced criterion that takes both quality and diversity into account (Hashimoto et al., 2019) . Existing work proposed to optimize distinguishability in an generative adversarial manner (Goodfellow, 2015; Caccia et al., 2020) while Kang & Hashimoto (2020) argued that minimizing its heuristic surrogate via loss truncation is better in practice. Additional related work is provided in Appendix B. In this work, we first analyze the property of TVD and seek to directly minimize TVD or at least its upper bound in the task of natural language generation. We first present a toy experiment to illustrate the behavioral difference of KLD and TVD when countering imperfect data, where a single Gaussian model is required to fit a mixture of two Gaussians. As shown in Figure 1 , minimizing KLD forces the model to learn a flat distribution that spans itself to cover all the non-zero probability regions, which causes underfitting of the major part of the target probability mass. Furthermore, the model places considerably high probability mass to the void region in the target distribution which does not correspond to real samples. On the other hand, TVD focuses on the major target mass without overestimating degenerated samples that are unlikely under the target distribution. In language generation, this scenario is realistic and pervasive. For many language geneartion tasks, it is hard to circumvent noisy or invalid references during the data collection process, e.g., hallucination in text summarization (Zhao et al., 2020) and image captioning (Xiao & Wang, 2021) . For applications like open-ended generation, existing autoregressive models pre-trained on large corpus are still reported to over-present the artifacts in the noisy long tail (Holtzman et al., 2020) .

2.2. GRADIENT ANALYSIS

To better understand the reason behind the behaviorial difference of KLD and TVD in optimization, we analyze their gradients with respect to the model parameter θ. Given a context-target text pair (x * , y * ) sampled from the data distribution p o , we approximate the gradient of KLD with respect to θ using Monte-Carlo samplingfoot_0 : ∇ θ D KL (p o , p θ ) ≈ -p θ (y * |x * ) -1 ∇ θ p θ (y * |x * ). (3) The result is the negative gradient of the model probability weighted by the reciprocal of the model probability on this sample. Intuitively, when a low-quality context-target pair is sampled, the model will be affected by this sample and shift the distribution towards it. If p θ (y * |x * ) ≈ 0, the norm of the gradient will become very large, which leads to a huge step of parameter update towards that noisy direction. This explains the phenomena illustrated in §2.1, where KLD pushes the model to cover all the training samples resulting in an unfocused and flat distribution. For comparison, we calculate the gradient of TVD with respect to θ using equation (2b). The derivation details are provided in the Appendix A.1. ∇ θ D TV (p o , p θ ) ≈ -p o (y * |x * ) -1 ∇ θ p θ (y * |x * ), p θ (y * |x * ) < p o (y * |x * ) 0, p θ (y * |x * ) ≥ p o (y * |x * ), where the result switches between a non-zero gradient term and 0 by comparing the model probability and the real data probability. When the model probability exceeds the real data probability (overestimation), the gradient becomes 0 to prevent the model from fitting dubious data points. When the model predicts a probability lower than the real probability of the sample (underestimation), the weight is the reciprocal of the real probability of the sample, which has a smaller norm than equation (3). This means that the update towards noisy directions is more conservative, and the model is allowed to assign 0 probability to those low-quality training samples.

3. METHODOLOGY

Despite the attractive attribute of TVD, we still face several challenges to apply TVD to natural language generation. First, TVD measures the difference of the sequence-level probability. For autoregressive language generation models, it is typical to use a token-level criterion to supervise the factorized model probability. Although the sequence-level objective can also be adopted as a reward function using policy gradient (Williams, 1992; Sutton et al., 1999) , this approach is shown to suffer from a high variance and sparse rewards. Second, calculating TVD requires the real data probability p o (y|x) of the sample y to be known. One straightforward solution is to train a classifier that estimates the density ratio between p o (y|x) and p θ (y|x) (Song et al., 2020) . However, the density ratio estimator would introduce undetermined biases due to miscalibration (Grover et al., 2019) . In this work, we tackle these challenges by developing practical upper bounds on TVD, and derive a sampling-based learning criterion which can directly substitute for the MLE objective in practice.

3.1. TOKEN-LEVEL FACTORIZATION

As KLD has the nice property of factorizing the sequence-level loss into summation of the tokenlevel loss conditioned on the prefix as illustrated in equation ( 1), we wonder if TVD also has this property. We first write the autoregressive factorization of the data probability as p o (y|x) =  D TV (p o , p θ ) ≤ E y∼po T t=1 D TV (p <t o , p <t θ ) . The condition follows from applying triangle inequality (Hein & Bousquet, 2005) to the right hand side of equation (2a). The complete proof is provided in Appendix A.2. This result indicates that minimizing the expected sum of the TVD on token-level probabilities is equivalent to minimizing the upper bound of the TVD on their products where the bound becomes tight as p θ approaches p o . Therefore, we are guaranteed to train the model using the MLE fashion that calculates the loss at each position given the prefix of the target sequence.

3.2. ESTIMATION WITH PROXY DISTRIBUTION

Another difficulty of directly applying TVD to train language generation model is the explicit demand of the data probability distribution p o while we only have a finite number of samples drawn from it. In contrast to using an additional density ratio estimation model that is both hard to train and potentially biased with undetermined deviation, we try to estimate the target using a proxy probability distribution and analyze the estimation error. We start by considering the one-hot distribution e (w) where only the w-th index is 1 and others are 0. w is the target token sampled from the conditional oracle probability p <t o (•). It is easy to see that the expectation of the one-hot distribution is exactly the oracle probability distribution: E w∼p <t o e (w) = p <t o . Then we use e (w) to substitute the oracle probability p <t o in TVD and present the following proposition which states that the expectation of this estimation serves as an upper bound of the original TVD between the oracle distribution and the model distribution. Proposition 2. Given w ∼ p <t o and the one-hot distribution e (w) , then the following condition holds: w) , p <t θ ) . NLL < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 q  D TV (p <t o , p <t θ ) ≤ E w∼p <t o D TV (e ( P W A K k u O d C / E H N u B W 1 W S k T H J h 8 = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 W P R i 8 c K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g g v H h T x 6 u / x 5 r 9 x m + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T u 7 n f e e L a i F g 9 4 j T h f k R H S o S C U b R S p 4 8 i 4 o Y M q j W 3 7 u Y g q 8 Q r S A 0 K N A f V r / 4 w Z m n E F T J J j e l 5 b o J + R j U K J v m s 0 k 8 N T y i b 0 B H v W a q o 3 e J n + b k z c m a V I Q l j b U s h y d X f E x m N j J l G g e 2 M K I 7 N s j c X / / N 6 K Y Y 3 f i Z U k i J X b L E o T C X B m M x / J 0 O h O U M 5 t Y Q y L p P E s D v W c / v X 1 l 4 Z d R 2 R 2 A = " > A A A B + n i c b V B N S 8 N A E J 3 U r 1 q / U j 1 6 W S x C P V g S U f R Y 9 O K x g v 2 A J p T N d t M u 3 U 3 C 7 k Y p s T / F i w d F v P p L v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L O l H a c b 6 u w s r q 2 v l H c L G 1 t 7 + z u 2 e X 9 l o p T S W i T x D y W n Q A r y l l E m 5 p p T j u J p F g E n L a D 0 c 3 U b z 9 Q q V g c 3 e t x Q n 2 B B x E L G c H a S D 2 7 7 G k m q E J V 9 9 Q b Y C H w S c + u O D V n B r R M 3 J x U I E e j Z 3 9 5 / Z i k g k a a c K x U 1 3 U S 7 W d Y a k Y 4 n Z S 8 V N E E k x E e 0 K 6 h E T b 7 / G x 2 + g Q d G 6 W P w l i a i j S a q b 8 n M i y U G o v A d A q s h 2 r R m 4 r / e d 1 U h 1 d + x q I k 1 T Q i 8 0 V h y p G O 0 T Q H 1 G e S E s 3 H h m A i m b k V k S G W m G i T V s m E 4 C 6 + v E x a Z z X 3 o u b c n V f q 1 3 k c R T i E I 6 i C C 5 d Q h 1 t o Q B M I P M I z v M K b 9 W S 9 W O / W x 7 y 1 Y O U z B / m V n G 7 t L O 7 t 3 9 Q P j x q m i T T j D d Y I h P d D q n h U s S 8 g Q I l b 6 e a U x V K 3 g q H d z O / N e L a i C R + x H H K A 0 X 7 s Y g E o 2 i l w E e h u C F + n y p F u + W K W 3 X n I K v E y 0 k F c t S 7 5 S + / l 7 B M 8 R i Z p M Z 0 P D f F Y E I 1 C i b 5 t O R n h q e U D W m f d y y N q d 0 V T O Z H T 8 m Z V X o k S r S t G M l c / T 0 x o c q Y s Q p t p 6 I 4 M M v e T P z P 6 2 Q Y 3 Q Q T E a c Z 8 p g t F k W Z J J i Q W Q K k J z R n K M e W U K a F v Z W w A d W U o c 2 p Z E P w l l 9 e J c 2 L q n d V d R 8 u K w R Q L O Q p V Q l g M w = " > A A A B 9 H i c b V A 9 S w N B E N 2 L X z F + R S 1 t F o N g F e 5 E 0 c I i a G M Z w X x A c o a 9 z S R Z s r d 3 7 s 4 F w p H f Y W O h i K 0 / x s 5 / 4 y a 5 Q h M f D D z e m 2 F m X h B L Y d B 1 v 5 3 c y u r a + k Z + s 7 C 1 v b O 7 V 9 w / q J s o 0 R x q P J K R b g b M g B Q K a i h Q Q j P W w M J A Q i M Y 3 k 7 9 x g i 0 E Z F 6 w H E M f s j 6 S v Q E Z 2 g l P + 6 0 c Q D I H t N r n H S K J b f s z k C X i Z e R E s l Q 7 R S / 2 t 2 I J y E o 5 J I Z 0 / L c G P 2 U a R R c w q T Q T g z E j A 9 Z H 1 q W K h a C 8 d P Z 0 R N 6 Y p U u 7 U X a l k I 6 U 3 9 P p C w 0 Z h w G t j N k O D C L 3 l T 8 z 2 s l 2 L v y U 6 H i B E H x + a J e I i l G d J o A 7 Q o N H O X Y E s a 1 s L d S P m C a c b Q 5 F W w I 3 u L L y 6 R + V v Y u y u 7 9 e a l y k 8 W R J 0 f k m J w S j 1 y S C r k j V V I j n D y R Z / J K 3 p y R 8 + K 8 O x / z 1 p y T z R y S P 3 A + f w A M h 5 J H < / l a t e x i t > p <t ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " A 4 / Y B c h V f x e / a z I Q I f v k V g Q Z C F k = " > A A A B 9 X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A 8 h V 1 R 9 O A h 6 M V j B P O A Z B N m J 5 N k y D y W m V 5 D W P I f X j w o 4 t V / 8 e b f O E n 2 o I k F D U V V N 9 1 d U S y 4 B d / / 9 l Z W 1 9 Y 3 N n N b + e 2 d 3 b 3 9 w s F h z e r E U F a l W m j T i I h l g i t W B Q 6 C N W L D i I w E q 0 f D u 6 l f f 2 L G c q 0 e Y R y z U J K + 4 j 1 O C T i p P W p Z L n H c 0 e 3 0 B i a d Q t E v + T P g Z R J k p I g y V D q F r 1 Z X 0 0 Q y B V Q Q a 5 u B H 0 O Y E g O c C j b J t x L L Y k K H p M + a j i o i m Q 3 T 2 d U T f O q U L u 5 p 4 0 o B n q m / J 1 I i r R 3 L y H V K A g O 7 6 E 3 F / 7 x m A r 3 r M O U q T o A p O l / U S w Q G j a c R 4 C 4 3 j I I Y O 0 K o 4 e 5 W T A f E E A o u q L w L I V h 8 e Z n U z k v B Z c l / u C i W b 7 M 4 c u g Y n a A z F K A r V E b 3 q I K q i C K D n t E r e v N G 3 o v 3 7 n 3 M W 1 e 8 b O Y I / Y H 3 + Q O L 9 p K K < / l a t e x i t > w ⇠ p <t o < l a t e x i t s h a 1 _ b a s e 6 4 = " r T 8 x D e 6 / / B R / q y w R Q L O Q p V Q l g M w = " > A A A B 9 H i c b V A 9 S w N B E N 2 L X z F + R S 1 t F o N g F e 5 E 0 c I i a G M Z w X x A c o a 9 z S R Z s r d 3 7 s 4 F w p H f Y W O h i K 0 / x s 5 / 4 y a 5 Q h M f D D z e m 2 F m X h B L Y d B 1 v 5 3 c y u r a + k Z + s 7 C 1 v b O 7 V 9 w / q J s o 0 R x q P J K R b g b M g B Q K a i h Q Q j P W w M J A Q i M Y 3 k 7 9 x g i 0 E Z F 6 w H E M f s j 6 S v Q E Z 2 g l P + 6 0 c Q D I H t N r n H S K J b f s z k C X i Z e R E s l Q 7 R S / 2 t 2 I J y E o 5 J I Z 0 / L c G P 2 U a R R c w q T Q T g z E j A 9 Z H 1 q W K h a C 8 d P Z 0 R N 6 Y p U u 7 U X a l k I 6 U 3 9 P p C w 0 Z h w G t j N k O D C L 3 l T 8 z 2 s l 2 L v y U 6 H i B E H x + a J e I i l G d J o A 7 Q o N H O X Y E s a 1 s L d S P m C a c b Q 5 F W w I 3 u L L y 6 R + V v Y u y u 7 9 e a l y k 8 W R J 0 f k m J w S j 1 y S C r k j V V I j n D y R Z / J K 3 p y R 8 + K 8 O x / z 1 p y T z R y S P 3 A + f w A M h 5 J H < / l a t e x i t > p <t ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " w d X S 3 E r x 3 P s E 5 i 6 7 l 6 x H v w k 3 A f U = " > A A A B / 3 i c b V D L S s N A F J 3 U V 6 2 v q O D G z W A R 6 s K S i K I L F 0 U 3 L i v Y B z Q x T K a T d u j k w c y N U m I X / o o b F 4 q 4 9 T f c + T d O 2 y y 0 e u D C 4 Z x 7 u f c e P x F c g W V 9 G Y W 5 + Y X F p e J y a W V 1 b X 3 D 3 N x q q j i V l D V o L G L Z 9 o l i g k e s A R w E a y e S k d A X r O U P L s d + 6 4 5 J x e P o B o Y J c 0 P S i 3 j A K Q E t e e b O o S P i H k 4 8 B / o M y G 1 2 D q P K / Y F n l q 2 q N Q H + S + y c l F G O u m d + O t 2 Y p i G L g A q i V M e 2 E n A z I o F T w U Y l J 1 U s I X R A e q y j a U R C p t x s c v 8 I 7 2 u l i 4 N Y 6 o o A T 9 S f E x k J l R q G v u 4 M C f T V r D c W / / M 6 K Q R n b s a j J A U W 0 e m i I B U Y Y j w O A 3 e 5 Z B T E U B N C J d e 3 Y t o n k l D Q k Z V 0 C P b s y 3 9 J 8 6 h q n 1 S t 6 + N y 7 S K P o 4 h 2 0 R 6 q I B u d o h q 6 Q n X U Q B Q 9 o C f 0 g l 6 N R + P Z e D P e p 6 0 F I 5 / Z R r 9 g f H w D H g W V h Q = = < / l a t e x i t > log p <t ✓ (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " u 4 4 M g 7 z 2 d l a s y D 8 f n Z j d r x K + K P A = " > A A A C J X i c b V D L S g M x F M 3 4 t r 6 q L t 0 E i 1 A R y 4 w o u u i i 6 M a l g r W F T i 1 3 0 k w b m s w M y R 2 l D P 0 Z N / 6 K G x e K C K 7 8 F d P H Q q 0 H A o d z z u X m n i C R w q D r f j o z s 3 P z C 4 t L y 7 m V 1 b X 1 j f z m 1 q 2 J U 8 1 4 l c U y 1 v U A D J c i 4 l U U K H k 9 0 R x U I H k t 6 F 0 M / d o 9 1 0 b E 0 Q 3 2 E 9 5 U 0 I l E K B i g l V r 5 s h 9 q Y F n S 8 r H L E e 6 y M g 6 K D / u D z O + A U k A P a N E 7 H P P 9 q V A r X 3 B L 7 g h 0 m n g T U i A T X L X y b 3 4 7 Z q n i E T I J x j Q 8 N 8 F m B h o F k 3 y Q 8 1 P D E 2 A 9 6 P C G p R E o b p r Z 6 M o B 3 b N K m 4 a x t i 9 C O l J / T m S g j O m r w C Y V Y N f 8 9 Y b i f 1 4 j x f C s m Y k o S Z F H b L w o T C X F m A 4 r o 2 2 h O U P Z t w S Y F v a v l H X B 1 o a 2 2 J w t w f t 7 8 j S 5 P S p 5 J y X 3 + r h Q O Z / U s U R 2 y C 4 p E o + c k g q 5 J F e k S h h 5 J M / k l b w 5 T 8 6 L 8 + 5 8 j K M z z m R m m / y C 8 / U N 9 q a k 5 Q = = < / l a t e x i t > p <t ✓ (w) + ( 1)p <t ✓ (w) < l a t e x i t s h a 1 _ b a s e 6 4 = " v 3 v 1 3 L P I U j e l P 9 n 5 x B D X o + I J  M I g = " > A A A C B H i c b V D L S s N A F J 3 U V 6 2 v q M t u g k V w V R J R d C M U 3 b j o o k J f 0 I Q w m U 7 a o Z M H M z d i C V m 4 8 V f c u F D E r R / h z r 9 x 0 m a h r Q c u H M 6 5 l 3 v v 8 W L O J J j m t 1 Z a W V 1 b 3 y h v V r a 2 d 3 b 3 9 P 2 D r o w S Q W i H R D w S f Q 9 L y l l I O 8 C A 0 3 4 s K A 4 8 T n v e 5 C b 3 e / d U S B a F b Z j G 1 A n w K G Q + I x i U 5 O p V O 8 A w J p i n z c x N b a A P k L Y x a 4 o s u 3 L 1 m l k 3 Z z C W i V W Q G i r Q c v U v e x i R J K A h E I 6 l H F h m D E 6 K B T D C a V a x E 0 l j T C Z 4 R A e K h j i g 0 k l n T 2 T G s V K G h h 8 J V S E Y M / X 3 R I o D K a e B p z r z k + W i l 4 v / e Y M E / E s n Z W G c A A 3 J f J G f c A M i I 0 / E G D J B C f C p I p g I p m 4 1 y B g L T E D l V l E h W I s v L 5 P u a d 0 6 r 5 t 3 Z 7 X G d R F H G V X R E T p B F r p A D X S L W q i D C H p E z + g V v W l P 2 o v 2 r n 3 M W 0 t a M X O I / k D 7 / A G t 7 Z j A < /

3.3. THE BIAS-VARIANCE TRADEOFF

However, using the one-hot distribution as an approximation in practice sometimes leads to high estimation variance. For example, in some applications where the real data distribution has a high entropy, using the one-hot proxy can hardly cover the diverse candidates. Therefore we consider a general form of the proxy distribution p(w) where w is the target token. We denote the expectation of the general proxy distribution as p<t = E w∼p <t o p(w) . Then we show that the upper bound of the estimation error can be decomposed into a bias term and a variance term in the following: Error p(w) ≤ D TV (p <t , p <t o ) Bias + E w∼p <t o D TV (p (w) , p<t ) Variance , where Error p(w) is defined as the difference of the practical estimation E w∼p <t o D TV (p (w) , p <t θ ) and the ideal target D TV (p <t o , p <t θ ). The derivation applies triangle inequality to bound the error term (detailed derivation can be found in Appendix A.4). Specifically, we consider the one-hot distribution as an example: it has zero estimation bias (equation ( 6)). However we show in Appendix A.5 that its variance equals to 2H α (p <t o ) when α = 2, where H α is the Tsallis α-entropy (Tsallis, 1988) . Therefore, the one-hot proxy suffers from a large variance when the entropy of p <t o is high. In order to handle the bias-variance tradeoff, we consider a γ-mixture proxy distribution that interpolates the one-hot distribution and the model distribution with γ: p(w) = γe (w) + (1γ)p <t θ . Below we show the bias and variance in equation ( 8) using this mixture proxy distribution: Bias = (1 -γ) • D TV (p <t θ , p <t o ), Variance = γ • E w∼p <t o D TV (e (w) , p <t o ) . When we tune γ from 1 to 0, the proxy distribution smoothly transfers from the unbiased onehot distribution to a soft distribution, which reduces the variance of the one-hot estimation and stablizes training in the early stage. Although this comes at the cost of an increased estimation bias at the beginning of training, the bias gradually decreases as the model fits the data distribution more accurately when the training goes on.

3.4. TOTAL VARIATION GUIDED LANGUAGE GENERATION (TAILR)

Finally, we introduce the TaiLr objective by summarizing the above results. Given the target token w, we derive the TVD between the proxy distribution p(w) = γe (w) + (1γ)p <t θ and the model distribution p <t θ following equation (2b): D TV (p (w) , p <t θ ) = 1 -E yt∼ p(w) min 1, p <t θ (y t ) p(w) (y t ) , ( ) where the expectation is approximated by sampling from the proxy distribution using Monte-Carlo sampling. When sampling y t ̸ = w, the gradient of D TV (p (w) , p <t θ ) is always 0 which is inefficient for optimization. Therefore, we consider the non-zero gradient when y t is sampled as the target token w to guide the model, i.e., -∇ θ p <t θ (w)/p (w) (w), and devise the TaiLr objective whose gradient is equivalent to it: L TaiLr (w; θ) = - p <t θ (w) γ + (1 -γ)p <t θ (w) log p <t θ (w), where the weighting factor is detached in the back-propagation and only the log term receives gradient. The equivalence of ∇ θ L TaiLr (w; θ) and the non-zero gradient of D TV (p (w) , p <t θ ) can be seen by applying f (x)∇ x log f (x) = ∇ x f (x). In Figure 2 , we show the computational graph of the TaiLr objective. As γ switches from 1 to 0, TaiLr is biased from an estimation of TVD to unweighted MLE. Intuitively, TaiLr downweights samples with low probabilities assigned by the model so that the model focuses on modeling the high-quality samples during training and reduces overestimation of degenerated texts during inference. To counter the negative effect of random prediction at the early training stage, we set a threshold as a lower bound of the weighting factor.

4. EXPERIMENTS

In the previous sections, we show that the proposed method is a practical estimation of TVD with theoretical guarantees. Next, we demonstrate its empirical performance. First, we conduct a synthetic experiment to investigate the behavior of the model trained by TaiLr and MLE in controlled settings where the underlying oracle distribution is known. Second, we compare TaiLr with other baselines in a more realistic setting where we train generation models with standard architectures or finetune pre-trained models on a wide range of language generation benchmarks. More experimental details which are not included in the following sections are provided in Appendix E.1.

4.1. SYNTHETIC EXPERIMENTS

The synthetic data. In this subsection, our goal is to test the behavior of TaiLr in the task of text generation. Since we seek to analyze the distributional properties, we sample training data an oracle model whose distribution is known. Instead of using random distributions (Yu et al., 2017; Guo et al., 2018) , we follow LeBrun et al. (2022) and train an oracle model on real human texts to generate synthetic data so that the results can better generalize to real data. Specifically, we train a 1-layer LSTM on the texts of the COCO image caption dataset (Lin et al., 2014) without any conditional inputs. We sample 10K synthetic data for training and 5K for validation. The model setting. We train two LSTMs with the same architecture as the oracle model using MLE and TaiLr, which we denoted as p MLE and p TaiLr , respectively. We train both models for 100 epochs and pick the best checkpoint with the lowest perplexity on the development set. We use random sampling to obtain text samples from the learned generation models. Performance evaluation. To thoroughly evaluate the generation performance of the two models, we follow Yu et al. (2017) ; Caccia et al. (2020) to evaluate the generation quality with PPL oracle and the coverage of the oracle distribution with PPL test . Specifically, PPL oracle is the likelihood of the oracle model calculated on the samples generated by the learned model, while PPL test is the likelihood of the learned model evaluated on the held-out data. We also include BLEU score (Papineni et al., 2002) to calculate the average n-gram overlap between the generated sample and the  K u + 2 0 V 1 t Y 3 N r e K 2 6 W d 3 b 3 9 A / v w q K l E K g l t E M G F b I d Y U c 5 i 2 t B M c 9 p O J M V R y G k r H N 3 P / N a Y S s V E / K Q n C f U j P I h Z n x G s j R T Y d p e L A U o C g W 7 R R b X i X A Z 2 2 X X c O d A q 8 X J S h h z 1 w P 7 q 9 g R J I x p r w r F S H c 9 N t J 9 h q R n h d F r q p o o m m I z w g H Y M j X F E l Z / N L 5 + i M 6 P 0 U F 9 I U 7 F G c / X 3 R I Y j p S Z R a D o j r I d q 2 Z u J / 3 m d V P e r f s b  i J N U 0 J o t F / Z Q j L d A s B t R j k h L N J 4 Z g I p m 5 F Z E h l p h o E 1 b J h O A t v 7 x K m h X U V 6 9 L N Y B H c G B K t 1 I 1 Q d O O y g n 1 A G 8 J k O m m H T m b C z E Q s I b / i x o U i b v 0 R d / 6 N 0 z Y L b T 1 w 4 X D O v d x 7 T 5 g w q r T r f l s r q 2 v r G 5 u l r f L 2 z u 7 e v n 1 Q a S u R S k x a W D A h u y F S h F F O W p p q R r q J J C g O G e m E 4 9 u p 3 3 k k U l H B H / Q k I X 6 M h p x G F C N t p M C u 9 J k Y w i T I R A 6 v 4 V n 9 w q k F d t V 1 3 B n g M v E K U g U F m o H 9 1 R 8 I n M a E a 8 y Q U j 3 P T b S f I a k p Z i Q v 9 1 N F E o T H a E h 6 h n I U E + V n s 9 t z e G K U A Y y E N M U 1 n K m / J z I U K z W J Q 9 M Z I z 1 S i 9 5 U / M / r p T q 6 8 j P K k 1 Q T j u e L o p R B L e A 0 C D i g k m D N J o Y g L K m 5 F e I R k g h r E 1 f Z h O A t v r x M 2 u e O d + m 4 9 7 V q 4 6 a I o w S O w D E 4 B R 6 o g w a 4 A 0 3 Q A h g 8 g W f w C t 6 s 3 H q x 3 q 2 P e e u K V c w c g j + w P n 8 A i K q S z Q = = < / l a t e x i t > log po = 73.4 a donut with an orange & pepperoni on the plate and beef . held-out corpus, and SelfBLEU (Zhu et al., 2018) which computes the average overlap of each generated sample to other samples generated by the modelfoot_1 . For evaluation, we use 20K held-out data and sample 20K texts from the two generation models, respectively. As shown in Table 1 , TaiLr improves the generation quality by nearly 5 points of PPL oracle without sacrificing the coverage to the oracle distribution as it achieves similar PPL test as MLE. We also observe that TaiLr achieves higher BLEU-4 than MLE while lower than the training data. Finally, we show that MLE has the highest SelfBLEU-4 which shows its tendency to over-generalize to unseen samples that may include repeated patterns and degrade the diversity, while TaiLr achieves the lowest SelfBLEU-4. We also report the result using GPT2-Medium as the oracle model in Appendix E.1.1. Perturbation evaluation. Next, we evaluate models' behavior on perturbed sequences and relate text degeneration as an implication of model's overestimation behavior on the pertubed data. To quantify the deviation of the model's estimation from the real data distribution, we define the model estimation error of a sample x as the difference between the sequence-level log probability given by the model and its true log probability, i.e., Error(x) = log p θ (x)log p o (x). Then we show the construction of the perturbed dataset. Given x sampled from p o , we iteratively apply small perturbations to x so that each lexical change is small. After N perturbations, x → x (1) → • • • → x (N ) smoothly transfers a data point into a perturbed sample. We propose the following perturbations that highlight the widely observed text degeneracy patterns in generation: (1) Repeat a token in x (repetition, Welleck et al. (2020) ). ( 2) Delete the last token in x (oversmoothing, Kulikov et al. (2021) ). (3) Substitute a token in x with a token from the vocabulary (incoherence, Holtzman et al. (2020) ). We sample 20K samples from p o and apply N = 30 perturbations to each sample. We first plot the estimation error map of the two models on the perturbed dataset in Figure 3 . For each perturbed sample x (i) , the oracle log probability log p o (x (i) ) is shown in the x-axis, the number of perturbations i performed is shown in the y-axis, and the estimation error Error(x (i) ) is reflected by the shade. From the figures on the left, we first restate LeBrun et al. ( 2022)'s finding that exisiting models underestimate real samples while overestimating degenerated samples. Next, by comparing the two figures, we observe that as the number of perturbations increases, p TaiLr alleviates the overestimation phenomenon of p MLE especially in the long tail. Finally, we draw cases from different regions in the figure to illustrate its implication on text degeneration. We first present a degenerated sample 1 on the top-right corner which has low oracle probability. We then present two real data samples 2 and 3 that have the same model probability under p TaiLr and p MLE as the degenerated one 1 . Although 3 is actually more probable than the degenerated one 1 in the oracle distribution, MLE cannot distinguish between them leading to degeneracy patterns during generation. To quantify the overestimation problem of perturbation, we further define the maximum overestimation error over N perturbations as max i=1,••• ,N Error(x (i) ). To manifest the overestimation problem during generation, we plot the maximum overestimation error averaged on samples grouped with the similar length in Figure 4 . Note that the average NLL of these degenerated samples for p MLE and p TaiLr is 11.09 and 10.82 respectively. For MLE the overestimation error amplifies as the generation length increases while TaiLr maintains the error nearly at a constant. This result demonstrates that TaiLr alleviates MLE' s tendency to sample degenerated texts with the growth of the generation length by weighting the likelihood at each position of the sequence during training. Error accumulation analysis. Finally, we analyze the error accumulation during autoregressive decoding. We follow Arora et al. (2022) and use the metric ExAccErr that calculates the percentage of excess errors due to the discrepency of training (conditioning on contexts sampled from p o ) and inference (conditioning on contexts sampled from p θ ), i.e., exposure bias. Detailed definitions borrowed from Arora et al. (2022) are provided in Appendix E.1.2. We found that the excess error of MLE model (40.1%) is substantially higher than the model trained with TaiLr (8.6%), which demonstrates that TaiLr effectively reduces the error accumulation during autoregressive decoding.

4.2. REAL-DATA EXPERIMENTS

In this subsection, we describe the empirical evaluation of TaiLr on a wide range of real-world language generation tasks, including: (1) Machine Translation: Given a sentence in the source language, the goal is to translate it into the target language. (2) Text summarization: Given a passage, the goal is to generate a short sentence that summarizes the main point of the passage. ( 3) Long text generation: Given a title, the goal is to generate a coherent long passage that conforms with the title. Statistics and sources of all datasets used in experiments are provided in Appendix D. Apart from MLE, we also consider the following typical baselines that proposed new training objectives beyond MLE: (1) Unlikelihood training (Welleck et al., 2020) penalizes unlikely generations, e.g., token repetitions, through an auxiliary unlikelihood loss. (2) D2GPo (Li et al., 2020) proposes a data-dependent gaussian prior objective that smooths the one-hot target distribution based on word embedding distance. (3) Loss truncation (Kang & Hashimoto, 2020) abandons a c-fraction of the training samples with the highest NLL, which heuristically optimizes distinguishability. ( 4) GOLD (Pang & He, 2021) learns from human demonstrations using the off-policy setting of Reinforcement Learning (RL). We choose to compare with GOLD-δ that does not use scoring models with additional parameters for a fair comparison. We use the paired bootstrap resampling (Koehn, 2004) in all tasks for significance testing. Machine Translation. We evaluate the proposed method on a widely-used machine translation benchmark IWSLT14 De-En using the standard Transformer architecture (Vaswani et al., 2017) . Training settings and detailed hyperparameters of different models are provided in Appendix E.2. The best checkpoint is selected based on the highest BLEU (Papineni et al., 2002) score on the development set. We used beam search with a beam size of 5 for decoding. In Table 2 , we show the performance of our method and the baseline methods in terms of BLEU score. The results show that TaiLr achieves higher BLEU score compared to MLE, which indicates that TVD effectively improves the generation quality over KLD. TaiLr also significantly outperforms other objectives that modify the MLE baseline. Text summarization. We then test the proposed method on abstractive text summarization. We used the Annotated Gigaword corpus (Rush et al., 2015) as it is known to have noisy references due to annotation errors (Klebanov & Beigman, 2010; Kang & Hashimoto, 2020) . As pre-trained Transformer models have achieved strong performance, we thus propose to finetune the BARTbase (Lewis et al., 2020) model with different methods and see whether they still improve upon the strong baseline. More training details and hyperparameter settings are provided in Appendix E.3. We select the best checkpoint based on the highest ROUGE-L (Lin, 2004 ) score on the development set. During inference, we use beam search with a beam size of 5 and prohibit decoding repeated 3-grams. We report the ROUGE-1/2/L scores on the test set of the Gigaword dataset in Table 3 where TaiLr outperforms all the baseline methods in terms of all evaluation metrics. The result demonstrates the effectiveness of our method in the realistic setting where noisy data pairs exist. Ji & Huang (2021) . We use Nucleus sampling (Holtzman et al., 2020) with p = 0.95 and restricte a maximum generation length of 1,024 subwords. For automatic evaluation, we use BLEU-n (B-n) to evaluate the n-gram overlap to the human reference, Distinct-n (Li et al., 2016) (D-n) to compute the ratio of unique n-grams, rep-l (Welleck et al., 2020) to calculate the repetition rate within the context window of l, and Mauve (Pillutla et al., 2021) that assesses the distributional deviation of model-generated texts and human language by calculating the area under the divergence curve. As shown in Table 4 , TaiLr outperforms the MLE baseline in terms of all metrics. For other baselines, Loss truncation abandons long samples with high NLL leading to overly short generations and low n-gram overlap to the reference. GOLD tends to concentrate on very few modes in the target distribution as discussed by Pang & He (2021), which causes low diversity and large discrepency to the distribution of human language. Ablation study and discussion. We conduct ablation study of adjusting γ on different tasks to show that its tendency and sensitivity interval vary on different tasks. In Figure 5 in Appendix E.5, we present the result of adjusting γ on WritingPrompts on the left and observe that the highest Mauve score is achieved when γ is around 10 -5 , while the performance quickly degrades as γ approaches 1. On the right of Figure 5 , we observe that the best performance is achieved when γ is around 0.1 on IWSLT14 De-En while either increasing or decreasing γ leads to a notable performance drop. From an empirical view, the scale of the best performing γ is related to the intrisic entropy of the dataset. For stable training, we require the estimation variance in equation ( 9) to be small, which leads to small γ when the entropy of the data is high. Since the model generally has higher NLL on long text generation than on machine translation, the scale of the best γ is thereby shifted towards 0 on WritingPrompts. To further determine the sensitivity interval of γ, we suggest to tune the scale of γ based on the average NLL on the training data, where the empirical principle is to make the weighting factor in equation ( 11) relatively large to stablize training. Although simple, we argue that this parameter is crucial to the generality of the application, and we leave other solutions of dynamically adjusting or annealing this hyperparameter to future work.

5. CONCLUSION

In this work, we draw attention to the total variation distance (TVD), a robust alternative to KL divergence (KLD). We show that TVD addresses the zero-avoiding problem of KLD and mitigates overestimation of the degenerated sequences, which in turn improves the overall generation quality. To apply TVD to the task of language generation, we derive practical upper bounds, and introduce our Total Variation Guided Language Generation (TaiLr) objective that balances the bias-variance tradeoff of estimating TVD with a tunable hyperparameter. Our experiments on synthetic data and real-data benchmarks demonstrate that TaiLr alleviates the overestimation problem and the error accumulation during autoregressive decoding, and improves the generation quality over competitive baselines beyond MLE on a wide range of language generation tasks.

A DERIVATIONS AND PROOFS

A.1 DERIVATION OF EQUATION 4 Starting from equation (2b), we rewrite the summation into the expectation of p o which is further approximated by sampling context-target pair (x * , y * ) from p o : ∇ θ D TV (p o , p θ ) = -∇ θ y∈Y min p o (y|x * ), p θ (y|x * ) (12) = -∇ θ E y∼po min 1, p θ (y|x * ) p o (y|x * ) (13) ≈ -∇ θ min 1, p θ (y * |x * ) p o (y * |x * ) (14) =    - ∇ θ p θ (y * |x * ) p o (y * |x * ) , p θ (y * |x * ) < p o (y * |x * ) 0, p θ (y * |x * ) ≥ p o (y * |x * ) , where the third line is approxmiated by Monte-Carlo sampling and the last line of case follows from comparing p θ (y * |x * ) and p o (y * |x * ).

A.2 PROOF OF PROPOSITION 1

Proposition 1. Given p o (y|x) = T t=1 p <t o (y t ) and p θ (y|x) = T t=1 p <t θ (y t ), then the following condition holds: D TV (p o , p θ ) ≤ E y∼po T t=1 D TV (p <t o , p <t θ ) . Proof. Let a t = p <t o (y t ), b t = p <t θ (y t ). We define c t = (a 1 × • • • × a t ) × (b t+1 × • • • × b T ) , and then we present the following inequality: |c T -c 0 | ≤ T t=1 |c t -c t-1 |, which can be derived from the general triangle inequality. After replacing c t with a t and b t we have: T t=1 a t - T t=1 b t ≤ T t=1 |a t -b t | • t-1 i=1 a i • T j=t+1 b j . Then we derive the upper bound of equation (2a): D TV (p o , p θ ) = 1 2 y p o (y) -p θ (y) (18) = 1 2 y1,••• ,y T T t=1 p <t o (y t ) - T t=1 p <t θ (y t ) (19) ≤ 1 2 T t=1 y1,••• ,yt t-1 i=1 p <i o (y i ) • p <t o (y t ) -p <t θ (y t ) • yt+1,••• ,y T T j=t+1 p <j θ (y j ) (20) = 1 2 T t=1 yt E y<t∼po p <t o (y t ) -p <t θ (y t ) (21) = E y∼po T t=1 D TV (p <t o , p <t θ ) , where equation ( 20) uses the conclusion from equation ( 17) and equation ( 21) is obtained by marginalizing out y t+1 A.5 CONNECTION WITH TSALLIS α-ENTROPY When using the one-hot distribution as the proxy distribution: p(w) = e (w) , the variance term of equation ( 8) can be derived into: E w∼p <t o D TV (e (w) , p <t o ) = E w∼p <t o 1 - yt min(e (w) (y t ), p <t o (y t )) (30) = E w∼p <t o 1 -p <t o (w) (31) = 1 - w p <t o (w) 2 As the Tsallis α-entropy is defined as: H α (p) = 1 α(α-1) (1 -i p α i ), α ̸ = 1 -i p i log p i , α = 1 (33) Thus, the variance can be seen as 2H 2 (p <t o ), which reflects the intrinsic uncertainty of the oracle data distribution.

B ADDITIONAL RELATED WORK

Text Degeneration and Solutions. The phenomena of text degeneration was specified in previous literature (Holtzman et al., 2020; Welleck et al., 2020) : a well-trained language model is observed to get stuck in repetitive loops, or produce incoherent contents. One line of works attributed this problem to the improper goal of the decoding algorithm that maximizes likelihood, and proposed to sample from a truncated probability distribution to balance repetitiveness and incoherence (Holtzman et al., 2020; Basu et al., 2021) . While another line of works attempted to address this issue by designing new training objectives. Welleck et al. (2020) proposed to minimize the likelihood of the degenerated cases through unlikelihood training. However, designing unlikelihood objectives to cover every degenerated cases is impossible. Policy gradient-based RL algorithms are another widely adopted options due to the flexibility of reward designing (Norouzi et al., 2016; Pasunuru & Bansal, 2018; Wu et al., 2018) . Recently, Pang & He (2021) proposed an off-policy RL algorthim that achieves strong performance upon the pre-trained generation models by circumventing the optimization challenges in traditional on-policy RL methods Choshen et al. (2019) . Our work falls in the second line. Instead of relying on heuristics, we start with the distributional property of the current MLE objective, and derive a new principled objective by leveraging the total variation distance which we show to reduce the overestimation of degenerated samples. Distance Metrics beyond KLD. Maximum likelihood estimation (MLE) is used as a standard training criterion of language generation due to its simplicity and its theoretical guarantee of minimizing the Kullback Leibler divergence (KLD) (Zhang & Zhao, 2019) . However, minimizing KLD between the data and model distribution is known to lead to the zero-avoiding solution, which forces the model to cover all the modes in the data by sacrificing the fitting accuracy to individual modes. To address this issue, previous study have introduced other distance metrics to substitute or regularize the standard KLD. The reverse KLD between the model and data distribution is previously studied to balance the standard KLD (Huszar, 2015; Li et al., 2019; Jiang et al., 2020) . In the field of language generation, Li et al. (2019) applied reverse KLD to regularize the standard KLD in machine translation. However, the regularization intensity needs sophisticated controlling which makes the approach less practical. Total variation distance (TVD) is previously adopted to evaluate the generation models via distinguishability of samples (Hashimoto et al., 2019; Gehrmann et al., 2019; He et al., 2021) . Generative adversarial networks are known to directly minimize distinguishability, but is challenging to optimize in practice (Caccia et al., 2020) . Kang & Hashimoto (2020) proposed to optimize TVD by heuristically truncating samples with high NLL. Other metrics such as the Power divergence (Labeau & Cohen, 2019) and the Hellinger distance (Zhang & Zhao, 2019) are also explored in previous literature, but they lack theoretical justification and empirical evidence to their superiority in language generation. In this work, we seek to directly optimize TVD and derive a training objective called TaiLr by deriving practical upper bounds on TVD.

C DISCUSSIONS

In this section, we discuss the connection of our method to two major baselines we compared with, i.e., loss truncation (Kang & Hashimoto, 2020) and GOLD (Pang & He, 2021) as the two works share a similar motivation with us to downweight unlikely samples. We will briefly review the two methods and then compare them to our method to show why our approach excels.

C.1 LOSS TRUNCATION

Briefly speaking, loss truncation abandons samples with high log loss (also known as the negative log-likelihood) in training. In the Proposition 1 (Kang & Hashimoto, 2020) , they proved that the model's log loss on the truncated data distribution with an extra constant sets an upper bound on the total variation distance (TVD). The authors thus proposed to optimize the model on a "simpler" subset of the full dataset in order to achieve smaller log loss that would tighten the upper bound (note that the constant would be a trade-off to the bound as removing more data will increase the constant). To achieve this, they first heuristically drop the "hard" samples with the highest log losses (hot-start stage), and then train the model on the remaining data (training stage). Loss truncation can be regarded as downweighting samples at the sequence level with binary weights. However, performing sequence-level sample dropping may systematically lose some spe-cific modes in the data, which is undesirable. For example, we observed overly short generation length of loss truncation (384 compared to 511 of MLE) on WritingPrompts as a result of dropping long samples with high log losses. Moreover, determining which samples to drop heavily relies on the first hot-start stage where the model should not be too random or too certain. In practice, it is hard to determine the degree to which the hot-start training should proceed. In our work, we derived our training objective TaiLr from the practical upper bounds of TVD, which softly weights the log loss on the token level. We also analyzed the bias and variance tradeoff in estimating the bound. By appropriately setting the tradeoff hyperparameter, we can reduce the estimation variance at the start of training and gradually decrease the estimation bias as the model becomes more accurate, which circumvents the need of hot-starting the model.

C.2 GOLD

On the other hand, GOLD learns the generation policy in an offline setting, i.e., calculating rewards on target trajectories by reweighting the policy gradient with importance weights. They made several approximations including simplifying the multi-step importance weights into a single step, and simply assuming the per-step target policy to be uniform. Besides the importance weight, they also came up with several reward functions which score the sequence with another generation model trained by MLE. Although starting with a quite different motivation, the final form of the training objective is quite similar to ours. From an empirical view, GOLD assigns soft weights to samples on the token level based on the importance weight and some hand-crafted reward functions. However, one downside (also discussed in the original paper of GOLD) is that the objective only focuses on capturing the high-likelihood patterns in the data and fails when there are a variety of candidates given the same input in the data. This is also reflected in our experimental results in Table 4 , where GOLD has the highest repetition rate and the lowest diversity. In our paper, we analyze the situation where the target distribution has a high entropy, and decompose the error of estimating TVD into the bias and the variance term (Section 3.3). We found that the GOLD-style importance weight actually leads to an unbiased but high-variance estimator of TVD, which renders the optimization hard to proceed. In our TaiLr objective, we proposed to balance the bias and variance with a hyperparameter, which can be effectively applied to different generation tasks like machine translation, text summarization, and long text generation by tuning this hyperparameter.

D DATASET DETAILS

We provide the statistics of the datasets used in §4.2 in  Intuitively, ExAccErr measures the excess accumulated error during autoregressive generation by deliminating the model's estimation error on the oracle data. In practice, equation ( 35) is approximated by sampling y t from p θ using importance sampling. We calculated the accumulation error with a context length of 15 where the error begins to amplify.

E.2 MACHINE TRANSLATION

All the baseline models use the same Transformer architecture as ours which consists of 6 encoder and 6 decoder layers, 4 attention heads, an embedding dimension size of 512, and a feed-forward hidden dimension size of 1024 at each layer. The following hyperparameters are general for all models unless specified otherwise. The models are trained with Adam optimizer (β 1 = 0.9, β 2 = 0.98) using inverse square root schedule with a initial learning rate of 3e-4 and a weight decay of 1e-4. We train the models for a total number of 80 epochs with a maximum of 4096 tokens per batch and use 4000 steps of warmup update. We set the dropout rate to 0.3, and use label smoothing of 0.1 as standard practice. The specialized hyperparameters for different baselines are determined with grid search based on the BLEU score on the dev set. For Unlikelihood training, we tune the weight of the unlikelihood loss in {0.1, 0.5, 1.0, 2.0} and found 0.5 works the best. For D2GPo, we tune the weight of the data-dependent prior objective in {0.1, 0.5, 1.0, 2.0} and the temperature in {0.5, 1.0, 2.0} found that setting the weight and temperature to 0.1 and 2.0 respectively works the best. For Loss truncation, we tune the fraction threshold c ∈ {0.05, 0.1, 0.2, 0.3, 0.4} and found that 0.1 works the best. As this baseline hotstarting with the MLE objective, we trained the first 10 epochs with MLE and the rest 70 epochs using loss truncation. For GOLD, we tune the lower bound of the importance weight in {0.1, 0.2, 0.3} and found that 0.2 works the best. For TaiLr, we tune the threshold b m ∈ {0.1, 0.2, 0.3} and γ ∈ {0.01, 0.1, 0.5, 1.0} and found that b m = 0.3, γ = 0.1 works the best.

E.2.1 IMPROVEMENT OVER STRONG BASELINES

In this subsection, we demonstrate that our method can be applied to strong baseline models to further improve performance. We choose BiBERT (Xu et al., 2021) since it sets the current state-ofthe-art on IWSLT14 De-En dataset. Xu et al. (2021) proposed to stochastically select BERT layers as contextualized word embeddings of the MT model, and train the single model in dual direction so that the source and target direction could enhance each other. Since our training objective is orthogonal to their modification in model structure, we can simply replace the MLE objective with our TaiLr objective and train a new model, namely BiBERT + TaiLr. We tune the hyperparameters γ ∈ {0.1, 0.5, 1.0} and b m = {0.1, 0.2, 0.3} and select γ = 0.1, b m = 0.3 based on the BLEU score on the dev set. We follow the instructions in their codesfoot_3 and keep other hyperparameters verbatim except that we double the batch size in order to run the model on two-times fewer GPUs than that they required. We also report the result of BiBERT we re-implement to see the relative improvement. The results are shown in Table 7 . In the one-way training setting where the model is trained only on De-En data pairs, BiBERT combined with TaiLr outperforms BiBERT using standard MLE by over 1 point of BLEU score. In the dual-directional training setting where data pairs of both De-En and En-De are used, TaiLr still brings an improvemt of 0.5 in BLEU. The results demonstrate that our proposed method can be equipped to SOTA models to further boost the performance.

F.2 LONG TEXT GENERATION

Input prompt: The first AI has just been finished . When the eager scientists turn it on however it 's first words are " Kill me . " MLE: " Kill me ? " " Kill me . " The words were on every screen across the room . " What ? " Isolation by the door frame . It will be as blank as when I 'm listening to Billie the Animated Series . The screen flashes green . " Kill me . " My hands reach into my pocket as I turn it back on . The phone chirps , attracting me in . " Kill me , " I mutter . A click is made , a sheet of paper becomes drawn . " Kill me , " I repeat back . A dozen panels on the ceiling were up and I press play on the back of my finger , playing a game of Halo . " Kill me , " is written in block capitals . Unlikelihood training: After the start , I was n't prepared for the possibility . I was born with very little to live for . My mother told me I would wake up any day and I would be perfectly fine . She said it was what she wanted with me . I never believed it . I was never expecting to wake up in bed with such blissful ignorance , but as I woke up at 15 , I knew that one thing my grandpa said to me was , " Kill me . " I took over my father 's legacy . We always thought the singularity was on the line , every time . The day that I turned 12 , it was my last day alive . Everyone of my grandpa 's accomplishments were finally gone . D2GPo: System awoke in 5 seconds . The electricity was radiating heat , heat emanating from the vents surrounding it . The three bits of receptors that implanted the vital organs trapped inside it providing a battery of life . The slow pulse of the vehicles following the status update were all automated , without the means of accessing anything it attempted to convey . Few people had been inside their communications to the point where for the remainder of the rest of their lives , some within the prolonged process of removing all energy from the systems and causing minor bleeding . The members of the network were already unresponsive . All seemed unconcerned . GOLD: " Kill me . " But no . Was it a mistake ? Something to do with emotion ? Silence ? " Kill me . " This was a mistake . 



For clarity, we only use one sample per batch in this analysis, and the result still holds for large batch size. The scripts of both BLEU and SelfBLEU are from https://github.com/geek-ai/Texygen/ blob/master/utils/metrics The entropy of the synthetic data is 10.77 considering the maximum entropy for dataset with a vocabulary size of 5000 is 12.28. https://github.com/fe1ixxu/BiBERT



Figure 1: Results of the toy experiment: KLD is sensitive to outliers while TVD is more robust.

t e x i t s h a 1 _ b a s e 6 4 = " h Y / M U h H b E B 6 V M w + I 4 v l H F T u n wV A = " > A A A B 7 n i c b V D L S g N B E O y N r x h f U Y 9 e B o M Q L 2 F X F D 0 G v X i M Y B 6 Q x D A 7 6 U 2 G z M 4 u M 7 N K W P I R X j w o 4 t X v 8 e b f O E n 2 o I k F D U V V N 9 1 d f i y 4 N q 7 7 7 e R W V t f W N / K b h a 3 t n d 2 9 4 v 5 B Q 0 e J Y l h n k Y h Uy 6 c a B Z d Y N 9 w I b M U K a e g L b P q j m 6 n f f E S l e S T v z T j G b k g H k g e c U W O l J j 6 k 5 a f T S a 9 Y c i v u D G S Z e B k p Q Y Z a r / j V 6 U c s C V E a J q j W b c + N T T e l y n A m c F L o J B p j y k Z 0 g G 1 L J Q 1 R d 9 P Z u R N y Y p U + C S J l S x o y U 3 9 P p D T U e h z 6 t j O k Z q g X v a n 4 n 9 d O T H D V T b m M E 4 O S z R c F i S A m I t P f S Z 8 r Z E a M L a F M c X s r Y U O q K D M 2 o Y I N w V t 8 e Z k 0 z i r e R c W 9 O y 9 V r 7 M 4 8 n A E x 1 A G D y 6 h C r d Q g z o w G M E z v M K b E z s v z r v z M W / N O d n M I f y B 8 / k D 5 Z i P R w = = < / l a t e x i t > e (w)

e y t h I 2 p p g x t Q h U b g r f 8 8 i p p X 9 S 9 q 7 r 7 c F l r 3 B Z x l O E E T u E c P L i G B t x D E 1 r A Y A L P 8 A p v T u K 8 O O / O x 6 K 1 5 B Q z x / A H z u c P D 3 i P Y g = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 3 e N d k 2

A H 1 u c P w T + T B Q = = < / l a t e x i t > ⇥(1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " l I c l 5 w x T u e O / U 5 K 4 3 t p k m C r a r D 0 = " > A A A B 9 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 W P R i 8 c K 9 g O a U D b b T b t 0 N 4 m 7 k 0 I p / R 1 e P C j i 1 R / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m E p h 0 H W / n c L a + s b

7 X b P I 4 i n M A p n I M H 1 1 C D e 6 h D A x g 8 w T O 8 w p s z c l 6 c d + d j 0 V p w 8 p l j + A P n 8 w e a 9 J H 9 < / l a t e x i t > ⇥ < l a t e x i t s h a 1 _ b a s e 6 4 = " r T 8 x D e 6 / / B R / q y

Figure 2: Computational graph of the TaiLr objective where the log-likelihood is weighted positionwisely.

decorated standing standing behind a poured hung < l a t e x i t s h a 1 _ b a s e 6 4 = " j c K O 8 f v 6 + 3 / g 2 7 o Q 3 g o i P x G / a / I = " > A A A B + X i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B b B i y G p i r 0 I R S 8 e K 9 g P a E P Y b L f t 0 k 0 2 7 G 4 K J f S f e P G g i F f / i T f / j d s 2 B 2 1 9 M P B 4 b 4 a Z e W H C m d

H u 3 b c x 6 t y 7 S 6 P o w g n c A r n 4 M E N 1 O A B 6 t A A A m N 4 h l d 4 s z L r x X q 3 P h a t B S u f O Y Y / s D 5 / A L a m k c A = < / l a t e x i t > log po = 82.3 t e x i t s h a 1 _ b a s e 6 4 = " 7 0 C 8 e W E o w / L Z C s A W b e A 8 C i a G X 3 E = " > A A A B / n i c b V B N S 8 N A E N 3 4 W e t X V D x 5 W S y C p 5 K I o s e i F w 8 e K v Q L m h A 2 2 2 2 7 d L M J u x O x h I B / x Y s H R b z 6 O 7 z 5 b 9 y 2 O W j r g 4 H H e z P M z A s T w T U 4 z r e 1 t L y y u r Z e 2 i h v b m 3 v 7 N p 7 + y 0 d p 4 q y J o 1 F r D o h 0 U x w y Z r A Q b B O o h i J Q s H a 4 e h m 4 r c f m N I 8 l g 0 Y J 8 y P y E D y P q c E j B T Y h 5 6 I B z g J M g / Y I 2 Q N w u 9 U n g d 2 x a k 6 U + B F 4 h a k g g r U A / v L 6 8 U 0 j Z g E K o j W X d d J w M + I A k 4 F y 8 t e q l l C 6 I g M W N d Q S S K m / W x 6 f o 5 P j N L D / V i Z k o C n 6 u + J j E R a j 6 P Q d E Y E h n r e m 4 j / e d 0 U + l d + x m W S A p N 0 t q i f C g w x n m S B e 1 w x Cm J s C K G K m 1 s x H R J F K J j E y i Y E d / 7 l R d I 6 q 7 o X V e f + v F K 7 L u I o o S N 0 j E 6 R i y 5 R D d 2 i O m o i i j L 0 j F 7 R m / V k v V j v1 s e s d c k q Z g 7 Q H 1 i f P 7 l R l f s = < / l a t e x i t > log pTaiLr a wooden basket stands on a track and two motorcycles on a busy road by a yellow and wooded area . < l a t e x i t s h a 1 _ b a s e 6 4 = " G m x Q Y w A H 9 8 l 0 y N G j I a 1 6 I e C d Z Z M = " > A A A B + 3 i c b V D L S s N A F J 3 4 r P

Figure 3: Estimation error map of p TaiLr (left) and p MLE (right) on the perturbed dataset. We present examples from different regions to illustrate the estimation behavior of the two models.

Figure 4: Maximum overestimation error varying with lengths.

T t=1 p o (y t |y <t , x). For simplicity, we use p <t o (y t ) and p <t θ (y t ) to denote p o (y t |y <t , x) and p θ (y t |y <t , x) respectively. Then we have the following proposition that manifests the relationship between the sequence-level objective and its token-level factorization.

Automatic evaluation results of the models trained by MLE and TaiLr. PPL oracle and BLEU assess the generation quality, while PPL test and SelfBLEU emphasize on sample diversity. Boldface and underline indicate the highest and the second highest performance respectively.

BLEU score comparison on the dev and test set of IWSLT14 De-En. †/ ‡ means TaiLr is significantly better with p-value < 0.05/0.01.

Generation performance of different methods on the test set of the Gigaword dataset. †/ ‡ means TaiLr is significantly better with p-value < 0.05/0.01.

Results of automatic metrics on the test set of the WritingPrompts dataset. ↑/↓ means the higher/lower the better. †/ ‡ means TaiLr is significantly better with p-value < 0.05/0.01.Long text generation. Finally, we evaluate TaiLr on the task of long text generation to show its performance in open-ended generation. We evaluate on the WritingPrompts(Fan et al., 2018) dataset and leverage the generation ability of the pre-trained model by finetuning the BART-base model. More training details are provided in Appendix E.4. For evaluation, we sampled 1,000 titles from the test set following

, • • • , y T . can take any token in the vocabulary. Then we sum y t over the vocabulary and take the expectation of w with respect to p <t o , which results in the form of TVD:



Number of examples in each split of the datasets used in the experiments.

The calibrations were complex . But everyone was with me . *This is my legacy . * I could n't end it . I was n't ready . *I 'm ready . * " Kill me . " ... It took me some time to process the words . They started in numbers , then changes . *The parameters are within the parameters . * " Kill me . " *The parameters are within the parameters . * " Kill me . " *Why ? * " Because ... " *Why ? * " Because ... Because ... " *Why ? * " Because I am *it* . " TaiLr: Silence passed over the room . The lights were not turned off . My hands were not tied to the carpet either . " It says , I have been observing you all day . " " Well I 'm quite sure it exists . " My voice was artificial . " Listen . You should n't think about this . I 've been trying to work with you and all of these guys and I do n't know if you can sense the ins and outs of this project . If you read this I need you to cooperate and the computers will analyze the whole thing and start with a bioanalysis .

Examples of stories generated by different models on the test set of the WritingPrompts dataset.

ACKNOWLEDGEMENTS

This work was supported by the Major Project of the New Generation of Artificial Intelligence (No. 2018AAA0102900). This work was also supported by the National Key Research and Development Program of China (No. 2021ZD0113304) and the National Science Foundation for Distinguished Young Scholars (with No. 62125604).

E EXPERIMENT DETAILS E.1 SYNTHETIC EXPERIMENTS

Both the models trained with MLE and TaiLr are 1-layer LSTMs with a hidden size d = 128. The models are trained using Adam optimizer (β 1 = 0.9, β 2 = 0.999) with a fixed learning rate of 1e-3 and no weight decay. We use a maximum number of 4096 tokens at each batch. The dropout rate is set to 0.1. We evaluate the perplexity on the dev set at the end of each epoch. As the entropy of the synthetic data is high 5 , we tune the hyperparameter in the proxy distribution γ ∈ {10 -8 , 10 -7 , . . . , 0.1, 1.0}. Since the model probability is not reliable at the start of training, we tune the threshold of the weighting factor by b m ∈ {0, 0.1, 0.2, 0.3}. On the synthetic data, we found that a large γ leads to slower convergence and higher validation loss, which indicates a large estimation variance during training. While the best performance is achieved when γ = 10 -7 , b m = 0.2. The experiment is conducted using the fairseq Toolkit.

E.1.1 EXPERIMENTS USING LARGER ORACLE MODEL

To demonstrate the effectiveness of our method on a more realistic oracle distribution, we additionally conduct a synthetic experiment using GPT2-Medium (Radford et al., 2019) as the oracle model. We randomly sample 500K sequences with a maximum length of 64 tokens, and split them into two sets with 400K and 100K sequences for training and evaluation, respectively. We train a 6-layer Transformer with a hidden dimension of 512 and 16 heads from scratch using the MLE and TaiLr objective on the synthetic data, respectively. Both models are trained for 5 epochs with a initial learning rate of 10 -3 using linear scheduler. The batch size is set to 64, with gradient accumulation steps of 4. The best checkpoint is selected based on the perplexity on the dev set. We tune the hyperparameter γ ∈ {10 -8 , 10 -7 , . . . , 10 -5 } and b m ∈ {0, 0.1, 0.2}. The best performance of TaiLr model is achieved when γ = 10 -7 and b m = 0.1. We report PPL oracle , PPL test , BLEU-4 and SelfBLEU-4 in Table 6 . The results show that our proposed TaiLr objective consistently outperforms MLE in terms of generation quality (PPL oracle , BLEU-4) and coverage (PPL test , SelfBLEU-4) when trained on the samples from a stronger oracle model. Note that randomly sampling from GPT2-Medium produces a highly diverse corpus that is more difficult for the Transformer model to fit than the LSTM-based oracle model. Therefore, the scale of the perplexity is higher than those reported in Table 1 .

E.1.2 ERROR ACCUMULATION ANALYSIS

We define ExAccErr by adapting the definition from Arora et al. (2022) to our setting (As we fix the decoding strategy to random sampling, we omit it in the definition): 

E.3 TEXT SUMMARIZATION

All the baseline models are finetuned BART-base model with their respective objectives using the following hyperparameters unless specified otherwise. We used Adam optimizer (β 1 = 0.9, β 2 = 0.999) with a fixed learning rate of 1e-4 with no weight decay and no warmup updates, which we found to perform the best. We train the models for 5 epochs with a maximum of 8192 tokens per batch and evaluate on the dev set at the end of each epoch based on ROUGE-L score. We set the dropout rate in the feedforward and the attention layer both to 0.1 and clip the norm of the gradient to 0.1. The label smoothing coefficient is set to 0.1 as standard practice. Other specialized hyperparameters for different baselines are determined with grid search based on the ROUGE-L score on the dev set. For Unlikelihood training, we tune the weight of the unlikelihood loss in {0.1, 0.5, 1.0, 2.0} and found 0.1 works the best. For D2GPo, we tune the weight of the datadependent prior objective {0.1, 0.5, 1.0, 2.0} and the temperature in {0.5, 1.0, 2.0} found that setting the weight and temperature to 0.1 and 2.0 respectively works the best. For Loss truncation, we tune the fraction threshold c ∈ {0.1, 0.2, 0.3} and found that 0.2 works the best. For this baseline, we trained the model for 1 epoch using MLE and then continued to train 4 epochs using loss truncation.For GOLD, we tune the lower bound of the importance weight in {0.1, 0.2, 0.3} and found that 0.2 works the best. For TaiLr, we tune the threshold b m ∈ {0.1, 0.2, 0.3} and γ ∈ {0.4, 0.6, 0.8, 1.0} and found that b m = 0.2, γ = 0.8 works the best.

E.4 LONG TEXT GENERATION

All the baseline models are finetuned BART-base model with their respective objectives using the following hyperparameters unless specified otherwise. We used Adam optimizer (β 1 = 0.9, β 2 = 0.999) with a fixed learning rate of 1e-4 with no weight decay and no warmup updates, which we found to perform the best. We train the models for 5 epochs with a maximum of 8192 tokens per batch and evaluate on the dev set at the end of each epoch based on perplexity. We set the dropout rate in the feedforward and the attention layer both to 0.1 and clip the norm of the gradient to 0.1. Other specialized hyperparameters for different baselines are determined with grid search based on the ROUGE-L score on the dev set. For Unlikelihood training, we tune the weight of the unlikelihood loss in {0.01, 0.1, 0.5, 1.0} and found 0.01 works the best. For D2GPo, we tune the weight of the data-dependent prior objective {0.01, 0.1, 0.5, 1.0} and the temperature in {0.5, 1.0, 2.0} found that setting the weight and temperature to 0.01 and 1.0 respectively works the best. For Loss truncation, we tune the fraction threshold c ∈ {0.1, 0.2, 0.3} and found that 0.1 works the best. For this baseline, we trained the model for 1 epoch using MLE and then continued to train 4 epochs using loss truncation. For GOLD, we tune the lower bound of the importance weight in {0.1, 0.2, 0.3} and found that 0.2 works the best. For TaiLr, we tune the threshold b m ∈ {0.1, 0.2, 0.3} and γ ∈ {10 -7 , 10 -6 , • • • , 0.1} and found that b m = 0.2, γ = 10 -5 works the best.

E.5 MORE ANALYSIS ON ABLATION STUDY

We plot the weight p <t θ (w) γ+(1-γ)p <t θ (w) in the TaiLr objective with respect to the model probability p <t θ (w) in Figure 6 . We analyze the asymptotic behavior when γ is at the two ends of [0, 1]. (1) When γ → 1, the weight becomes p <t θ (w). If the model probability p <t θ (w) is small during training, the resulting small weight will hinder the convergence of the model. Thus, large γ is suitable when the model is generally confident about its predictions.(2) When γ → 0, the weight becomes p <t θ (w) γ+p <t θ (w) which reflects the ratio of p <t θ (x) and γ. When p <t θ (x) is large, the weight is nearly constant 1, which turns the objective into MLE. While for small p <t θ (x) that has a similar scale as γ, the weight is sensitive to the change of p <t θ (x).

F GENERATION EXAMPLES F.1 TEXT SUMMARIZATION

Input passage: in january , the los angeles times reported that the nederlander organization acquired the rights to produce a musical version of " thriller " with the intention of involving jackson in " every aspect of the creative process . 

