CALIBRATING SEQUENCE LIKELIHOOD IMPROVES CONDITIONAL LANGUAGE GENERATION

Abstract

Conditional language models are predominantly trained with maximum likelihood estimation (MLE), giving probability mass to sparsely observed target sequences. While MLE trained models assign high probability to plausible sequences given the context, the model probabilities often do not accurately rank-order generated sequences by quality. This has been empirically observed in beam search decoding as output quality degrading with large beam sizes, and decoding strategies benefiting from heuristics such as length normalization and repetition-blocking. In this work, we introduce sequence likelihood calibration (SLiC) where the likelihood of model generated sequences are calibrated to better align with reference sequences in the model's latent space. With SLiC, decoding heuristics become unnecessary and decoding candidates' quality significantly improves regardless of the decoding method. Furthermore, SLiC shows no sign of diminishing returns with model scale, and presents alternative ways to improve quality with limited training and inference budgets. With SLiC, we exceed or match SOTA results on a wide range of generation tasks spanning abstractive summarization, question generation, abstractive question answering and data-to-text generation, even with modest-sized models.

1. INTRODUCTION

Conditional language generation aims to generate text based on input context, and includes many useful and hard tasks such as abstractive summarization (Mani, 2001; Nenkova and McKeown, 2011) , generative question answering (Bajaj et al., 2016) , question generation (Zhou et al., 2017) and data-to-text (Wiseman et al., 2017; Gardent et al., 2017) . Pretraining large Transformer encoderdecoder models and fine-tuning them on downstream tasks is the common paradigm to address these tasks (Raffel et al., 2020; Lewis et al., 2019; Tay et al., 2022; Zhang et al., 2019a) . Conditional language generation tasks are modeled by learning the probability of a target sequence y given a context sequence x. Since directly modeling sequence probability P (y|x) over all possible generated text sequences is intractable, the canonical solution is to auto-regressively factor the probability and share the parameters at all token prediction steps as P θ (y|x) = l t=0 P θ (y t |y 0 ...y t-1 , x), where l is the sequence length. These models are often trained with maximum likelihood estimation (MLE) over observed target sequences. The learning objective thus becomes L = N i -log P θ (y i |x i ) = N i l t=0 -log P θ (y t i |y 0 i ...y t-1 i , x i ), where N is the number of training instances. It is also referred to as next token prediction loss. In the ideal setting of MLE training, a large number of target sequences are observed for each context, and the relative frequencies of output sequences can calibrate the assigned model probabilities. However, in practice most language generation training datasets have only a single target sequence given the context. While the subsequent MLE trained models learn to assign relatively high probability to plausible sequences, they lack the direct supervision to compare such sequences, and solely et al., 2020; Lewis et al., 2019; Raffel et al., 2020; Zhang et al., 2019a) . Since the lack of observed target sequences in MLE training is the root problem, solutions involving learning with multiple sequence candidates have been proposed to directly address it. They can be loosely put in three categories: (1) reinforcement learning with sequence-level rewards (Paulus et al., 2018; Ziegler et al., 2019; Stiennon et al., 2020) ; (2) two-stage systems that generate and rerank candidates (Liu and Liu, 2021; Ravaut et al., 2022b; Liu et al., 2022) ; and (3) multi-task learning with sequence-level losses (Edunov et al., 2018; Liu et al., 2022) . Refer to Related Works (section 4) for a more comprehensive discussion. In this paper, we propose to first decode candidates from a fine-tuned model on its own training dataset, and then continue training the model with a new objective. The new objective aims to align candidates' sequence likelihoods according to their similarities to the target sequence in the model's latent space. We refer to this process as sequence likelihood calibration (SLiC). Our approach is related to multi-task learning with sequence-level losses in Liu et al. (2022) . However, we propose a simple yet effective recipe that eliminates decoding heuristics and doesn't risk directly optimizing the same metrics that are used to report text generation quality. Unlike reinforcement learning, it is a one-time offline process that avoids costly online decoding processes. Also, when compared to two-stage reranking systems, it doesn't require a separate reranking model that incurs additional complexity and compute. As depicted in Figure 1 , our calibration stage naturally extends the current paradigm of pretraining and fine-tuning, and we show that calibrated models have strong improvements over fine-tuned-only models across model sizes. Our main contributions include: • Proposed a sequence likelihood calibration (SLiC) stage that consistently improves model quality, exceeding or matching state-of-the-art results on abstractive summarization, generative question answering, question generation and data-to-text generation tasks. • Proposed a novel calibration similarity metric between model decodes and targets measured in the model's latent space rather than resorting to external metrics or human feedback.



Figure 1: Calibrating sequence likelihood improves language generation across model scales. Scores are averaged ROUGE across 4 datasets (R m in subsection 3.2)

