QUANTIFYING EXPOSURE BIAS FOR OPEN-ENDED LANGUAGE GENERATION Anonymous authors Paper under double-blind review

Abstract

The exposure bias problem refers to the incrementally distorted generation induced by the training-generation discrepancy, in teacher-forcing training for autoregressive neural network language models (LM). It has been regarded as a central problem for LMs trained for open-ended language generation. Although a lot of algorithms have been proposed to avoid teacher forcing and therefore alleviate exposure bias, there is little work showing how serious the exposure bias problem actually is. In this work, we propose novel metrics to quantify the impact of exposure bias in the generation of MLE-trained LMs. Our key intuition is that if we feed ground-truth data prefixes (instead of prefixes generated by the model itself) into the model and ask it to continue the generation, the performance should become much better because the training-generation discrepancy in the prefix is removed. We conduct both automatic and human evaluation in our experiments, and our observations are two-fold: (1) We confirm that the prefix discrepancy indeed induces some level of performance loss. (2) However, the induced distortion seems to be limited, and is not incremental during the generation, which contradicts the claim of exposure bias.

1. INTRODUCTION

Language model (LM) is a central module for natural language generation (NLG) tasks (Young et al., 2017 ) such as open-ended language generation (Radford et al., 2018; Nadeem et al., 2020) , machine translation (Wu et al., 2017) , dialogue response generation (Li et al., 2017) , image captioning (Lin et al., 2014) , etc. For decades, maximum likelihood estimation (MLE) has been the most widely used objective for LM training. However, there is a popular belief in the natural language processing (NLP) community that standard MLE training suffers from the exposure bias problem which leads to an incremental performance degradation during test-time generation. The claim of the exposure bias problem (Bengio et al., 2015; Ranzato et al., 2016) is originated from the following discrepancy between MLE training and test-time generation for auto-regressive language models: During training, the model is trained to predict the next word conditioned on prefix (or history) words sampled from the ground-truth data distribution; While during generation, the model generates words conditioned on prefix sequences generated by the model itself. Hence, due to the exposure to real data during training, the language model could potentially be biased to only perform well with data prefixes. Therefore, it is claimed (and widely believed among researchers) that during generation the errors should accumulate along the generated sequence, and the distribution generated by the model would be incrementally distorted. The forced exposure to ground-truth data during training is also referred to as teacher forcing. In order to avoid teacher forcing, many training algorithms (Bengio et al., 2015; Lamb et al., 2016; Ranzato et al., 2016; Yu et al., 2016; Zhu et al., 2018; Lu et al., 2018; Lin et al., 2017; Guo et al., 2017; Rajeswar et al., 2017; Wiseman & Rush, 2016; Nie et al., 2019; Shi et al., 2018; de Masson d'Autume et al., 2019; Rennie et al., 2016) have been proposed as alternatives to MLE training for open-ended language generation. Most of these works utilize techniques from generative adversarial networks (GANs) (Goodfellow et al., 2014) or reinforcement learning (RL) (Sutton & Barto, 1998) . In this paper, we refer to these algorithms as non-MLE methods. With the huge research efforts devoted to alleviate exposure bias, interestingly, the existence or significance of exposure bias is much less studied. On the other hand, despite the criticism, MLE

annex

(teacher forcing) has remained to be the dominant objective for LM training (Radford et al., 2018; Keskar et al., 2019) . To make the situation more curious, multiple recent works show that the proposed non-MLE methods actually have inferior generation performance (Caccia et al., 2018; de Masson d'Autume et al., 2019) than the MLE baseline. These negative results lead us to question: Is exposure bias truly a serious problem for MLE training?In this work we seek a direct answer to the above question. Here we briefly summarize our contributions: We conduct controlled experiments in which we remove the training-generation discrepancy in the prefix, design various metrics to quantify the performance improvement of the generation as the prefix length grows. On the contrary to our expectation, our measurements consistently show that the performance gain is limited, and the incremental distortion as claimed by exposure bias is not observed (the performance gap does not become larger with longer prefixes). In the next section, we begin by introducing notations and background.

2. PRELIMINARIES

The task of auto-regressive language modelling is to learn the probability distribution of the (l + 1) th word (or token) W l+1 in a sentence W conditioned on the prefix W 1:l := (W 1 , . . . , W l ). We use W i 2 V to denote a discrete random variable distributed across the vocabulary V . For simplicity, we assume all sentences are of length L in the formulations. Denoting the ground-truth data distribution as P D , the standard MLE training aims to minimize the negative log-likelihood (NLL) loss below:where P ✓ ( • | W 1:l ) denotes the conditional distribution of W l+1 of P ✓ given a prefix W 1:l , and ✓ stands for the set of parameters to be trained. Note that the concept of "sentence" (W ) can be naturally generalized to paragraphs or even articles, depending on the target task.We denote the distribution of a MLE-trained LM as P M , which is the major subject of this study. We will experiment with two popular model architectures: LSTM LM (Hochreiter & Schmidhuber, 1997; Sundermeyer et al., 2012) and transformer LM (Baevski & Auli, 2018; Dai et al., 2019) . For generation, we do classical ancestral sampling without invoking sampling algorithms such as top-k sampling (Fan et al., 2018) for the following reasons: (1) The sampling algorithms are known to trade quality out of diversity (Nadeem et al., 2020; Caccia et al., 2018) . So, invoking them could "hide" the exposure bias problem because the prefixes from the model will be of higher quality.(2) The sampling algorithms requires tuning of hyper-parameters, which will complicate the comparison.In addition to popular measures in natural language generation (NLG) such as BLEU (Papineni et al., 2002) or METEOR (Denkowski & Lavie, 2014) , our quantification approaches also rely on the measurements of the divergence between two distributions. Let P denote the set of probability distributions on the vocabulary V , and let f div : P ⇥ P ! R 0 be a divergence function between two distributions (e.g., total variation distance). We will adopt two popular probability divergence functions: total variation distance (denoted as d TV ) and Jensen-Shannon divergence (denoted as d JS ). We provide definitions of d TV and d JS in Appendix A.Our experiments will be focused on the task of open-ended language generation, which is arguably a good test bed for exposure bias because of the following reasons: (1) The generation length is long.(2) Different from typical seq2seq tasks such as machine translation, the generation space is only weakly constrained and the topics can be very diverse, which means the training-generation discrepancy could be large.

3. A QUALITATIVE ATTEMPT

We begin with a qualitative attempt to verify the significance of exposure bias. We design a prefixswitching experiment as follows: We feed a MLE-trained transformer LM on the wiki-103 dataset with four types of prefixes of the same length: (1) test-data samples, (2) model's own samples, (3) test-data samples shuffled on word-level, or (4) samples from a uniformly random distribution on V .Then we let the model continue the generation given these prefixes and compare the quality of the samples in a qualitative manner. We defer details of the model and dataset to Section 5.The intuition behind the prefix-switching experiment follows immediately from the original claim of exposure bias: During generation, if we set the prefix distribution to be the ground-truth data distribution instead of the model's own distribution, then the discrepancy between training and

