CALIBRATING SEQUENCE LIKELIHOOD IMPROVES CONDITIONAL LANGUAGE GENERATION

Abstract

Conditional language models are predominantly trained with maximum likelihood estimation (MLE), giving probability mass to sparsely observed target sequences. While MLE trained models assign high probability to plausible sequences given the context, the model probabilities often do not accurately rank-order generated sequences by quality. This has been empirically observed in beam search decoding as output quality degrading with large beam sizes, and decoding strategies benefiting from heuristics such as length normalization and repetition-blocking. In this work, we introduce sequence likelihood calibration (SLiC) where the likelihood of model generated sequences are calibrated to better align with reference sequences in the model's latent space. With SLiC, decoding heuristics become unnecessary and decoding candidates' quality significantly improves regardless of the decoding method. Furthermore, SLiC shows no sign of diminishing returns with model scale, and presents alternative ways to improve quality with limited training and inference budgets. With SLiC, we exceed or match SOTA results on a wide range of generation tasks spanning abstractive summarization, question generation, abstractive question answering and data-to-text generation, even with modest-sized models.

1. INTRODUCTION

Conditional language generation aims to generate text based on input context, and includes many useful and hard tasks such as abstractive summarization (Mani, 2001; Nenkova and McKeown, 2011) , generative question answering (Bajaj et al., 2016) , question generation (Zhou et al., 2017) and data-to-text (Wiseman et al., 2017; Gardent et al., 2017) . Pretraining large Transformer encoderdecoder models and fine-tuning them on downstream tasks is the common paradigm to address these tasks (Raffel et al., 2020; Lewis et al., 2019; Tay et al., 2022; Zhang et al., 2019a) . Conditional language generation tasks are modeled by learning the probability of a target sequence y given a context sequence x. Since directly modeling sequence probability P (y|x) over all possible generated text sequences is intractable, the canonical solution is to auto-regressively factor the probability and share the parameters at all token prediction steps as P θ (y|x) = l t=0 P θ (y t |y 0 ...y t-1 , x), where l is the sequence length. These models are often trained with maximum likelihood estimation (MLE) over observed target sequences. The learning objective thus becomes L = N i -log P θ (y i |x i ) = N i l t=0 -log P θ (y t i |y 0 i ...y t-1 i , x i ), where N is the number of training instances. It is also referred to as next token prediction loss. In the ideal setting of MLE training, a large number of target sequences are observed for each context, and the relative frequencies of output sequences can calibrate the assigned model probabilities. However, in practice most language generation training datasets have only a single target sequence given the context. While the subsequent MLE trained models learn to assign relatively high probability to plausible sequences, they lack the direct supervision to compare such sequences, and solely

