CALIBRATING SEQUENCE LIKELIHOOD IMPROVES CONDITIONAL LANGUAGE GENERATION

Abstract

Conditional language models are predominantly trained with maximum likelihood estimation (MLE), giving probability mass to sparsely observed target sequences. While MLE trained models assign high probability to plausible sequences given the context, the model probabilities often do not accurately rank-order generated sequences by quality. This has been empirically observed in beam search decoding as output quality degrading with large beam sizes, and decoding strategies benefiting from heuristics such as length normalization and repetition-blocking. In this work, we introduce sequence likelihood calibration (SLiC) where the likelihood of model generated sequences are calibrated to better align with reference sequences in the model's latent space. With SLiC, decoding heuristics become unnecessary and decoding candidates' quality significantly improves regardless of the decoding method. Furthermore, SLiC shows no sign of diminishing returns with model scale, and presents alternative ways to improve quality with limited training and inference budgets. With SLiC, we exceed or match SOTA results on a wide range of generation tasks spanning abstractive summarization, question generation, abstractive question answering and data-to-text generation, even with modest-sized models.

1. INTRODUCTION

Conditional language generation aims to generate text based on input context, and includes many useful and hard tasks such as abstractive summarization (Mani, 2001; Nenkova and McKeown, 2011) , generative question answering (Bajaj et al., 2016) , question generation (Zhou et al., 2017) and data-to-text (Wiseman et al., 2017; Gardent et al., 2017) . Pretraining large Transformer encoderdecoder models and fine-tuning them on downstream tasks is the common paradigm to address these tasks (Raffel et al., 2020; Lewis et al., 2019; Tay et al., 2022; Zhang et al., 2019a) . Conditional language generation tasks are modeled by learning the probability of a target sequence y given a context sequence x. Since directly modeling sequence probability P (y|x) over all possible generated text sequences is intractable, the canonical solution is to auto-regressively factor the probability and share the parameters at all token prediction steps as P θ (y|x) = l t=0 P θ (y t |y 0 ...y t-1 , x), where l is the sequence length. These models are often trained with maximum likelihood estimation (MLE) over observed target sequences. The learning objective thus becomes L = N i -log P θ (y i |x i ) = N i l t=0 -log P θ (y t i |y 0 i ...y t-1 i , x i ), where N is the number of training instances. It is also referred to as next token prediction loss. In the ideal setting of MLE training, a large number of target sequences are observed for each context, and the relative frequencies of output sequences can calibrate the assigned model probabilities. However, in practice most language generation training datasets have only a single target sequence given the context. While the subsequent MLE trained models learn to assign relatively high probability to plausible sequences, they lack the direct supervision to compare such sequences, and solely rely on models' generalization capability. We refer to this phenomenon as models' sequence likelihood not being calibrated. Prior works (Liu and Liu, 2021; Liu et al., 2022) has shown that the correlation between sequence probability and its quality for MLE trained models can be low. Liu et al. (2022) attributed this similarly as the deterministic (one-point) target distribution problem. Exposure bias (Ranzato et al., 2016) further aggravates the problem, as sequence likelihood estimation is noisier when models' decoded sequences shift from exposed training data distribution. Many effective heuristics have been proposed during training and decoding to combat the problem of uncalibrated sequence likelihood. Label smoothing (Szegedy et al., 2016) prevents the network from becoming over-confident towards the observed target. This is particularly necessary in language generation, since the gold target represents just one of many possibilities. It has been observed that increasing number of decoding candidates past a certain point leads to worse quality for beam search decoding (Yang et al., 2018; Koehn and Knowles, 2017) and sampling (Adiwardana et al., 2020) . An optimal number of decoding candidates is often determined empirically by decoding models on the validation set and measuring their performance. Using length normalization is also essential for beam search decoding (Wu et al., 2016) and sampling (Adiwardana et al., 2020) as models tend to underestimate sequence likelihood of longer sentences. Repetition is another common failure mode when models overestimate the probability of repeated sequences (Holtzman et al., 2019) . Trigram blocking (Paulus et al., 2018) and nucleus sampling (Holtzman et al., 2020) have been used to interrupt repeating sequences. These techniques are pervasive and often the default in modern Transformer libraries (Wolf et al., 2020; Lewis et al., 2019; Raffel et al., 2020; Zhang et al., 2019a) . Since the lack of observed target sequences in MLE training is the root problem, solutions involving learning with multiple sequence candidates have been proposed to directly address it. They can be loosely put in three categories: (1) reinforcement learning with sequence-level rewards (Paulus et al., 2018; Ziegler et al., 2019; Stiennon et al., 2020) ; (2) two-stage systems that generate and rerank candidates (Liu and Liu, 2021; Ravaut et al., 2022b; Liu et al., 2022) ; and (3) multi-task learning with sequence-level losses (Edunov et al., 2018; Liu et al., 2022) . Refer to Related Works (section 4) for a more comprehensive discussion. In this paper, we propose to first decode candidates from a fine-tuned model on its own training dataset, and then continue training the model with a new objective. The new objective aims to align candidates' sequence likelihoods according to their similarities to the target sequence in the model's latent space. We refer to this process as sequence likelihood calibration (SLiC). Our approach is related to multi-task learning with sequence-level losses in Liu et al. (2022) . However, we propose a simple yet effective recipe that eliminates decoding heuristics and doesn't risk directly optimizing the same metrics that are used to report text generation quality. Unlike reinforcement learning, it is a one-time offline process that avoids costly online decoding processes. Also, when compared to two-stage reranking systems, it doesn't require a separate reranking model that incurs additional complexity and compute. As depicted in Figure 1 , our calibration stage naturally extends the current paradigm of pretraining and fine-tuning, and we show that calibrated models have strong improvements over fine-tuned-only models across model sizes. Our main contributions include: • Proposed a sequence likelihood calibration (SLiC) stage that consistently improves model quality, exceeding or matching state-of-the-art results on abstractive summarization, generative question answering, question generation and data-to-text generation tasks. • Proposed a novel calibration similarity metric between model decodes and targets measured in the model's latent space rather than resorting to external metrics or human feedback. • Demonstrated that SLiC eliminates the need for popular decoding heuristics, such as beam size optimization, length normalization and repetition prevention for the calibrated models. • Demonstrated that SLiC has persistent significant benefits on model performance even as the number of model parameters scales up. Under the same inference budget, smaller calibrated models might outperform larger counterparts by decoding more candidates.

2. CALIBRATING SEQUENCE LIKELIHOOD

We extend the common paradigm of pretraining and fine-tuning by introducing a third calibration stage, SLiC. As shown in Algorithm 1, we first decode m candidates {ŷ} m from a fine-tuned model P θ f t (y|x) on fine-tuning dataset {x, ȳ} n and then calibrate the fine-tuned model by continuing training on our proposed loss: L(θ) = b L cal (θ, s; x, ȳ, {ŷ} m ) + λL reg (θ, θ f t ; x, ȳ) , where θ and θ f t are the current and fine-tuned-only model weights, L cal and L reg are the calibration and regularization losses. s = s(ŷ, ȳ; x) measures the similarity between the candidate ŷ and the target ȳ conditioned on the context x. We discuss choices of s, L cal , L reg and decode strategies ŷ ∼ P θ (y|x) in the following sections. Algorithm 1 Calibrating Sequence Likelihood for x, ȳ ∈ {x, ȳ} n do sample m candidates from the fine-tuned model {ŷ ∼ P θ f t (y|x)} m θ ← θ f t initialized from the fine-tuned model for {x, ȳ, {ŷ} m } b ∼ {x, ȳ, {ŷ} m } n do train with calibration and regularization loss θ ← θ -lr∇ θ L(θ)

2.1. SIMILARITY FUNCTION: s

For a given output sequence y, we take the decoder output hidden states e L×D = emb(y, x) as its representations, where L is the number of tokens and D is the hidden states dimension. Between a candidate ŷ's representations ê and the target ȳ's representations ē, we calculate their cosine similarities on spans of n tokens and aggregate them across the sequences with a F-measured based function F n . Notation of F n , P n , R n are same as in BERTScore (Zhang et al., 2019b) . (2) it differs from the metrics that we evaluate the generation systems with and mitigates the risk of directly optimizing towards those imperfect metrics (Paulus et al., 2018; Stiennon et al., 2020) ; (3) it is conditioned on the context s(ŷ, ȳ; x), as opposed to metrics in the form of s(ŷ, ȳ). s θ (ŷ, ȳ; x) = n F n (ê, ē) = n F n (emb(ŷ, x), emb(ȳ, x)) F n = 2 P n × R n P n + R n P n (ê, ē) =

2.2. CALIBRATION LOSS: L cal

The calibration loss L cal (θ, s; x, ȳ, {ŷ} m ) aims to align models' decoded candidates' sequence likelihood P θ (ŷ|x) according to their similarity with the target sequence s(ŷ, ȳ; x). Given the context x, target ȳ and a set of candidates {ŷ} m , we consider the following 4 loss types to answer two questions: (1) does absolute difference in similarities matter? (2) is there benefit of list-wise over pair-wise comparisons? Rank loss optimizes the ranking order of positive and negative candidates pairs ŷ+ , ŷuniformly sampled from {ŷ} m where s(ŷ + , ȳ; x) > s(ŷ -, ȳ; x). Margin loss maximizes the sequence probability gap of positive and negative candidates pairs. List-wise rank loss optimizes the ranking orders of a list of candidates, where i, j are positions of ŷi , ŷj in the set {ŷ} m sorted by s(ŷ, ȳ; x). List-wise rank loss is the contrastive loss used in BRIO (Liu et al., 2022) . Expected reward loss (or expected minimum risk) maximizes the expected similarity of a list of candidates (Edunov et al., 2018) . Pair-wise losses (Rank, Margin) has smaller training memory footprint than list-wise rank and expected reward. L cal rank = max(0, β -log P θ (ŷ + |x) + log P θ (ŷ -|x)) L cal margin = max(0, β(s(ŷ + , ȳ; x) -s(ŷ -, ȳ; x)) -log P θ (ŷ + |x) + log P θ (ŷ -|x)) L cal list rank = Σ i<j max (0, β|i -j| -log P θ (ŷ i |x) + log P θ (ŷ j |x)) L cal reward = Σ i -s(ŷ i , ȳ; x) * P θ (ŷ i |x) i P θ (ŷ i |x) β values for all losses are chosen empirically for each loss type in subsection 3.3.

2.3. REGULARIZATION LOSS: L reg

We consider two alternate types of regularization loss L reg to prevent models from deviating significantly from their fine-tuned MLE objective: Cross entropy is the standard fine-tuning MLE objective used in (Liu et al., 2022) . KL divergence directly minimizes the probability distribution distance between the calibrated model and the fine-tuned model at each token on observed target sequence. The main difference is cross entropy loss regularizes the model toward the gold reference while KL divergence regularize the model toward fine-tuned-only model. The regularization losses are both on token level. L reg ce = t -log P θ (ȳ t |ȳ t-1 , x) L reg kl = t P θ (ȳ t |ȳ t-1 , x) log P θ (ȳ t |ȳ t-1 , x) P θ f t (ȳ t |ȳ t-1 , x)

2.4. CANDIDATES DECODING METHODS

We consider the following decoding methods for SLiC: Beam Search is the standard best-first algorithm to solve the intractable maximum likelihood optimization for sequence-to-sequence models (Tillmann and Ney, 2003; Li et al., 2016; Wiseman et al., 2017; Chen et al., 2018) . Diverse Beam Search (DBS; Vijayakumar et al., 2016) generates a list of diverse outputs by dividing the beam search budget into groups and enforcing dissimilarity between groups of beams. It strikes balance between quality and diversity and is often the best strategy for two-stage reranking systems (Liu and Liu, 2021; Ravaut et al., 2022b; Liu et al., 2022) . Nucleus Sampling (Holtzman et al., 2020) only samples high-probable tokens within cumulative probability p at each step of the decoding. It produces diverse candidates while preventing sampling very low quality ones.

3.1. TASKS AND DATASETS

For abstractive summarization tasks, we choose CNN/DailyMail (Hermann et al., 2015; See et al., 2017) , XSUM (Narayan et al., 2018) , RedditTIFU-long (Kim et al., 2019) and SAMSum (Gliwa et al., 2019) due to their diversity in domain, style, abstractiveness, and summary lengths. For question answering related tasks, we choose generative question answering given context MSMARCO NLG (Bajaj et al., 2016) and its reverse problem of question generation SQuAD QG (Zhou et al., 2017; Du et al., 2017) . For data-to-text tasks, we choose text generation given structured data WebNLG-en (Gardent et al., 2017) and common concepts reasoning CommonGen (Lin et al., 2020) . More details of datasets can be found at Appendix A along with their statistics.

3.2. MODEL TRAINING AND EVALUATION DETAILS

We follow the PEGASUS pretraining (Zhang et al., 2019a) and extend transformer model sizes to PEGASUS SMALL (50M), PEGASUS BASE (200M), PEGASUS LARGE (500M) and PEGASUS 2B (2B). Details are reported in Appendix B. Different from the original paper, we use a sentencepiece 96k vocabulary with byte-fallback (Kudo, 2018) and pretraining batch size of 4096 across all models. See Appendix B for model dimensions. In all experiments, we use learning rate lr = 10 -4 , and batch sizes of 512 to finetune and 64 to calibrate models. We use beam search to generate calibration candidates and evaluate the calibrated models, unless specified otherwise. All fine-tuned-only models utilizes heuristics such as beam size optimization and sweeping beam α for length normalization, unless specified otherwise. In our ablation studies (subsection 3.3), benefits analysis (subsection 3.4), and scaling experiments (subsection 3.5), we use models pretrained to 500,000 steps and conduct experiments on 4 datasets (CNN/DailyMail, XSUM, RedditTIFU-long and SAMSum). For ablation studies and benefits analysis, we use PEGASUS LARGE . We report ROUGE 1/2/L (Lin, 2004) for each dataset on validation splits and their overall score R m defined as geometric mean of ROUGE 1/2/L averaged across datasets, R m = 1 4 d 3 √ R 1 R 2 R L . For the final results (subsection 3.6), we pretrain PEGASUS 2B model to 2.5M steps, fine-tune it on all 8 datasets, calibrate them using the same recipe and report numbers on the test split (unless specified otherwise). We use corresponding standard evaluation scripts for each dataset (details in Appendix D).

3.3. ABLATION STUDIES OF CALIBRATION

Ablation experiment results discussed below can be found in Table 1 . Similarity Function We compare our proposed similarity function, using models' latent states at decoder output representation s θ (ŷ, ȳ; x) (subsection 2.1), to directly optimizing the evaluation metric ROUGE. They perform similarly on all datasets even when evaluation metrics are ROUGE scores. We also test a variant of our similarity function by replacing decoder representation emb(y, x) with token embeddings. This variant has lower performance, which suggests benefits of contextualized and input-dependent representations. Calibration Loss Calibrated models with all loss types significantly improve over fine-tuned-only models. Rank loss performs the best followed by margin, list rank and then reward. Reward maximization has the advantage of no hyper-parameters β (Equation 1) to sweep while rank and margin loss have smaller training memory footprints. Rank loss showing the best gain indicates that relative ordering of candidates is more important than the absolute value of their similarity to the target. Regularization Loss Cross entropy and KL divergence regularization perform similarly. About 85% of the calibration gain remains if regularization is removed. Calibration Candidates Decoding Method We choose hyper-parameters for calibration candidates decoding methods based on validation set. The optimal decoding method is dataset dependent, however the differences between methods are small and the worst method achieves 90% of the gains of the best one. Beam search yields the highest average quality. This is opposite to the findings in the two-stage reranking systems (Liu and Liu, 2021; Ravaut et al., 2022b; Liu et al., 2022) , where more diverse decoding strategies are preferred. Checkpoint Selection for Fine-tuned Model We compare ROUGE-selected and perplexityselected checkpoints. The experiments show that starting calibration from the perplexity-selected checkpoint yields same or better performance with the biggest gap on CNN/DailyMail dataset. TL;DR: We recommend a simple recipe: select the fine-tuned model's checkpoint by its validation set perplexity; decode candidates using beam search; calibrate the model with rank loss (based on decoder states similarities) and KL divergence regularization. Calibrated models' quality monotonically improves as the number of decoding candidates increase,foot_0 regardless of the calibration-decoding and evaluation-decoding methods, as shown in Figure 2. On the other hand, fine-tuned-only models suffer from decreased quality when the number of decodes exceeds an optimal value. Once a model is calibrated with either decoding method, it performs well with both at evaluation time. Decoding with beam search yields higher scores, verified up to 20 decodes. When the calibration-decoding and the evaluation-decoding method align, the final quality is slightly better than the mismatched settings. CNN/DailyMail, XSUM, and SAMSum datasets work best with beam search, however RedditTIFU-long works better with nucleus sampling and decoding it with a larger number of candidates may achieve better results. Calibrated models do not require length normalization. As shown in Table 2 , length normalization (commonly implemented as α for beam search) is essential for fine-tuned-only models which bias towards longer sequences at decoding time. In contrast, length normalization has minimal effect on calibrated models. Calibrated models suffer from far fewer repetitions. The repetition rate (rep%) measures a common mode of model failures. It is defined as the percentage of examples that contain any kind of consecutive repeated word n-grams, While length normalization helps general quality on the finetuned-only models, it leads to a side-effect of higher repetitions. Calibrated models, with or without length normalization, have a much lower repetition rate. When we compare with the repetition rate in the gold reference (repetition may occur naturally), calibrated models without length normalization have similar or lower repetition rate. TL;DR: Calibrated models do not require decoding heuristics such as beam size optimization, length normalization and repetition blocking. Scaling properties are important for projecting a technique's future relevance as models scale up (Kaplan et al., 2020a) . In Figure 3 , we compare generation quality versus inference compute at different model sizes and number of decoding candidates using beam search. Appendix H describes the method to estimate inference compute FLOPs. As mentioned earlier in subsection 3.4, fine-tuned-only models have optimal decoding beam sizes while calibrated models' performance monotonically increase with larger decoding beam sizes. Even in the case of greedy decoding (beam size of 1), the calibrated models' performance exceeds the fine-tuned-only models, by a large margin for some datasets (CNN/DailyMail and RedditTIFUlong). Their gaps grow larger with increasing number of beam sizes.

Published as a conference paper at ICLR 2023

The magnitude of quality improvement from calibration persists over models sizes spanning from 50M to 2B. There is no obvious sign of diminishing return as model size scales up. Inference compute may be used for decoding rather than on larger models. A calibrated model, once trained, can improve its performance by decoding more candidates, usually more effectively in the beginning, although returns diminish over 10 candidates. In some cases (SAMSum and especially CNN/DailyMail), a smaller model decoding more candidates can beat a larger one at both quality and efficiency. TL;DR: Calibration benefits persist as model sizes scale up. Smaller calibrated models can outperform larger ones under the same inference compute budget. We calibrate the fine-tuned PEGASUS 2B models on 8 language generation tasks using the simple recipe identified in subsection 3.3 and evaluate them with beam search without decoding heuristics (subsection 3.4). The only hyper-parameter we optimize for SLiC is learning rate lr (Appendix J). We use beam size 5 for fine-tuned-only models and 10 for calibrated models.

3.6. FINAL RESULTS

As shown in Table 3 , calibrated models show consistent improvement over fine-tuned-only models across datasets and tasks. Overall, our calibrated models exceed or match the SOTA models on all datasets. On XSUM, SAMSum, WebNLG-en and CommonGen, our calibrated 2B models are ten to a hundred times smaller than the SOTA models. TL;DR: PEGASUS 2B achieves SOTA results on a wide range of language generation tasks using a simple SLiC recipe while eliminating decoding heuristics.

4. RELATED WORKS

In classification, model calibration often refers to matching output probabilities with expected accuracy. In our case of sequence generation, how to generalize this notion is unclear. Kuleshov and Liang (2015) explores generalizing probabilistic calibration to structured prediction, but we only focus on aligning the sequence likelihood with target sequence similarity. Other approaches in this vein are described below. 4.1 RL APPROACHES Paulus et al. (2018) directly optimizes evaluation metric ROUGE in RL fine-tuning stage. One issue is that ROUGE metric does not enforce fluency. The authors found summaries to be not always readable and proposed that using a mixed training objective works better. Ziegler et al. ( 2019); Stiennon et al. (2020) collects human judgements on fine-tuned models' decodes to train a reward model that ranks candidates according to human preferences. The supervised policy is then fine-tuned against the reward model using PPO. The authors found that optimizing their reward model results in better quality summaries than directly optimizing ROUGE.

4.2. TWO-STAGE RERANKING APPROACHES

SimCLS (Liu and Liu, 2021) proposes formulating text generation as a reference-free quality estimation problem assisted by contrastive learning. The first stage decodes candidates with diverse beam search and a RoBERTa based model is used to rank them in the second stage. SummaReRanker (Ravaut et al., 2022a) observes improved performance when training the generation and the reranking models on two non-overlapping halves of the fine-tuning data compared to training two models on the same data. BRIO (Liu et al., 2022) includes a two-stage reranking system that uses sequence-to-sequence generation models. It is shown that the sequence-to-sequence reranker has better performance than encoder-only models in providing ranking scores.

4.3. MULTI TASK LEARNING WITH SEQUENCE-LEVEL LOSS

Edunov et al. ( 2018) surveys a range of classical objective functions for structured prediction and apply them to sequence-to-sequence models. Their experiments showed that combining sequencelevel objectives with token-level objectives yields improved performance on translation and summarization datasets. Sun and Li ( 2021) combines contrastive learning objective with negative log-likelihood to decrease the likelihood of the model generated "silver" summaries meanwhile increasing the likelihood of the "gold" references. Wieting et al. (2019) introduces an alternative reward function for optimizing neural machine translation systems that is based on semantic similarity. BRIO (Liu et al., 2022) demonstrates that multi task learning of sequence candidates with contrastive reranking and token-level generation has better performance compared to a two-stage reranking system. The ranking order is determined by similarity to target using external metrics (ROUGE, BERTScore). Models trained to rank by ROUGE also perform well measured on BERTScore and vice versa. Lukasik et al. (2020) extends label smoothing from classification tasks to semantic label smoothing for sequence-to-sequence learning. Their technique adds sequence-level losses that smooth over well-formed relevant sequences that are similar to the target sequence semantically and on n-gram level.

5. CONCLUSION

We propose adding a third stage of sequence likelihood calibration (SLiC) after the pretraining and fine-tuning stages for conditional language generation. A simple yet effective recipe for SLiC is using decoder-state similarity, selecting the fine-tuned model's checkpoint by perplexity, decoding candidates with beam search, calibrating with rank loss and KL divergence regularization. We are able to eliminate all decoding heuristics for calibrated models. The benefits of calibration persist as models scale up in size. Smaller calibrated models might outperform larger ones under the same inference compute budget. By calibrating a PEGASUS 2B model, we exceed or match state-of-the-art results on 8 datasets spanning abstractive summarization, generative question answering, question generation and data-to-text tasks. In this work we focus on the setting where ground-truth output sequences are provided. However, this presupposes high-quality labels that are not always available. In the future, we plan to extend SLiC to general language modeling and explore more types of latent similarities.

A DATASETS PROPERTIES

A.1 DATASETS AND TASKS CNN/DailyMail (Hermann et al., 2015; See et al., 2017) summarization dataset contains 313k articles from the CNN and Daily Mail newspapers with bullet point summaries. The summaries are on average 3-4 sentences and relatively extractive.foot_1  XSUM (Narayan et al., 2018) summarization dataset consists of 227k BBC articles from 2010 to 2017 with a single sentence highly abstractive summary. Sometimes the summary contains information not present in the article.foot_2  RedditTIFU-long (Kim et al., 2019) summarization dataset contains 42k posts of informal stories from sub-reddit TIFU from 2013-Jan to 2018-Mar with author written summaries. The style and length of the summaries are very diverse.foot_3  SAMSum (Gliwa et al., 2019) summarization dataset contains 16k high-quality chat-dialogues and their summaries written by linguists.foot_4  SQuAD QG (Zhou et al., 2017; Du et al., 2017) is the task of generating a question from a passageanswer pair extracted from the SQuAD dataset (Rajpurkar et al., 2016) . In particular, we use the split of Du et al. (2017), consisting of 75,722, 10,570, and 11,877 examples for training, validation, and testing, respectively.foot_5  MSMARCO NLG (Bajaj et al., 2016 ) is a large scale dataset focused on machine reading comprehension and question answering. The original QA dataset consists of 1,010,916 queries. However, we work on the NLGEN data that is a subset of the QA data consisting of 182,669 queries, each with a well formed answer. The task is to generate a well formed answer to an input query and a set of answering passages.foot_6  WebNLG-en (Gardent et al., 2017) consists of 16,095 data inputs in the form of sets of RDF triples extracted from DBpedia. Each data point was verbalized by humans in more-than-one natural texts, leading to a total of 38,872 data-text pairs.foot_7  CommonGen (Lin et al., 2020) 

B MODEL ARCHITECTURE

Model sizes and their configurations are reported in Table 5 . C HYPER-PARAMETER NOTATIONS 

D EVALUATION SCRIPTS

For summarization tasks, we use pypi package rouge-score to report ROUGE numbers. We report rougeLsum for ROUGE-L. For SQuAD QG and MSMARCO NLG, we use the original evaluation scripts provided by Du et al. (2017) and Bajaj et al. (2016) , respectively. For WebNLG-en and CommonGen, we use the versions from the GEM benchmark (Gehrmann et al., 2021) and report using the GEM evaluation framework. Those scripts mainly differ in text tokenization methods.

E ABLATION STUDY

SLiC methods for ablation study are reported in Table 7 . 

H MODEL FLOPS ESTIMATION

We extends formulations in Table 1 of Kaplan et al. (2020b) to estimate FLOPs of our transformer encoder decoder models following the formula: total C = C enc × n enc-ctx + C dec × n dec-ctx × m C enc = 2N enc + 2n enc-layer n enc-ctx d enc-attn C dec = 2N dec + n dec-layer n dec-ctx d dec-attn where m is the number of decoder candidates, other notations can be referenced in Table 1 of Kaplan et al. (2020b) . Because of upper triangle attention masking, the effective decoder attention context length is half of sequence lengths instead of full sequence lengths as in the encoder. Extra computation incurred by different decoding methods are omitted as they are much smaller.

I SCALING

SLiC method for scaling curves are reported in Table 11 . At evaluation time, models are decoded with 1, 2, 5, 10, and maybe 15, 20 candidates. ROUGE numbers in Figure 3 are reported in Table 12 . 



At evaluation-decoding time, the candidate with the highest sequence probability is selected to compute quality for both beam search and nucleus sampling. https://www.tensorflow.org/datasets/catalog/cnn_dailymail https://www.tensorflow.org/datasets/catalog/xsum https://www.tensorflow.org/datasets/catalog/reddit_tifu https://www.tensorflow.org/datasets/catalog/samsum https://www.tensorflow.org/datasets/catalog/squad_question_generation https://huggingface.co/datasets/ms_marco/viewer/v2.1 https://www.tensorflow.org/datasets/catalog/gem#gemweb_nlg_en https://www.tensorflow.org/datasets/catalog/gem#gemcommon_gen_default_ config



Figure 1: Calibrating sequence likelihood improves language generation across model scales. Scores are averaged ROUGE across 4 datasets (R m in subsection 3.2)

Figure 2: Effect of decoding methods on calibrated and fine-tuned only models. Colors indicate calibration method. Markers indicate evaluation decoding method. Hyper-parameters at Appendix F.

Figure 3: Quality and inference compute trade-off comparison between fine-tuned only and calibrated models. Inference compute is scaled by increasing model parameters (different colors) and number of decoding candidates (dots on the same line). Hyper-parameters at Appendix I.

Bhattacharyya et al. (2021) trains an energy-based model to mimic the behavior of the task measure such as BLEU scores.Lee et al. (2021);Fernandes et al. (2022) train rerankers for neural machine translation (NMT) that predicts the observed distribution of desired automatic metrics (BLEU, COMET and BLEURT) over the n-best list.



Comparison between fine-tuned only models and calibrated models with or w/o brevity penalty α on overall quality (R1 / R2 / RL) and repetitions' occurrence percentage (rep%). Hyperparameters at Appendix G.

Calibrated PEGASUS 2B comparing with prior SOTA results: BRIO a(Liu et al., 2022), ULL b(Tay et al., 2022), ST-MoE c(Zoph et al., 2022), UniLMv2 d(Bao et al., 2020), Masque e(Nishida et al., 2019), and BART+R3F f(Aghajanyan et al., 2021). † is on validation set. * is on unknown split. See hyper-parameters in Appendix J.

Statistics of datasets.

Model sizes.

Summary of hyper-parameters notations.

Experimental settings for ablation studies. methods for decoding calibrated models are reported in Table8. At evaluation time, models are decoded with 1, 2, 5, 10 and 20 candidates. ROUGE numbers in Figure2are reported in Table9.

Experimental settings for calibrated models' decoding analysis. (y, ŷ, x) reward cross entropy ROUGE fix lr beam 1-20 nucleus → nucleus nucleus 15 s θ (y, ŷ, x) reward cross entropy ROUGE fix lr nucleus 1-20

ROUGE (R1 / R2 / RL) numbers of the decoding curves. settings for length normalization analysis is reported in Table10. Brevity penalty α is chosen as the best value for fine-tuned models' ROUGE performance on validation dataset or disabled.

Experimental settings for length normalization study.

Experimental settings for scaling.

ROUGE (R1 / R2 / RL) numbers of the scaling curve. .96/41.93 44.51/21.34/36.47 27.32/8.06/21.56 51.77/26.44/42.38 2 44.06/21.44/41.33 45.24/22.22/37.15 27.36/8.49/21.89 52.35/27.40/43.27 5 44.08/21.54/41.27 45.65/22.83/37.71 26.61/8.78/21.67 52.48/27.72/43.70 10 43.84/21.30/40.96 45.61/22.93/37.70 25.80/8.37/20.85 52.40/27.64/43./21.47/42.60 46.27/23.02/38.12 27.79/8.42/22.18 53.05/27.96/43.66 2 44.93/21.83/42.15 46.99/23.90/38.89 27.76/8.99/22.36 53.73/29.07/44.75 5 44.78/21.98/41.92 47.26/24.37/39.23 26.85/9.09/21.94 53.94/29.01/44.53 10 44.59/21.86/41.71 47.27/24.59/39.40 25.97/8.74/20.99 53.67/29.31/44./21.70/42.73 47.89/24.54/39.67 28.82/9.29/23.13 53.40/28.01/43.82 2 45.37/21.95/42.54 48.66/25.61/40.55 28.60/9.60/23.12 53.89/29.47/44.88 5 45.40/22.09/42.56 48.94/26.18/40.91 27.86/9.87/22.84 53.98/29.08/44.62 10 45.29/21.82/42.44 48.91/26.08/40.84 27.52/9.01/21.86 53.95/29.61/44..85/42.76 46.42/22.93/38.12 29.29/9.10/23.26 53.31/28.43/44.18 2 46.30/21.92/43.43 47.29/23.95/39.02 29.80/9.59/23.75 54.14/29.29/44.47 5 46.55/22.48/43.68 47.88/24.62/39.62 29.83/9.84/23.91 54.61/29.95/45.10 10 46.63/22.58/43.78 47.93/24.74/39.76 29.87/9.95/24.03 54.89/30.05/45.

ACKNOWLEDGEMENT

We thank David Grangier for early and engaging discussions, and Noah Fiedel for feedback on the paper.

J FINAL RESULTS

SLiC method for final results is reported in Table 13 . We choose the SLiC best based on subsection 3.3. There are in total 3 hyper-parameters: learning rate lr (Algorithm 1), ranking constant β (Equation 1), and regularization strength λ (Equation 2). We fix two of the them: β is set to 10, and lr * λ is set to 1e -5. Best learning rate lr is determined with hyper-parameter tuning on validation set and reported in Table 14 . 

