E-FORCING: IMPROVING AUTOREGRESSIVE MODELS BY TREATING IT AS AN ENERGY-BASED ONE Anonymous

Abstract

Autoregressive generative models are commonly used to solve tasks involving sequential data. They have, however, been plagued by a slew of inherent flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g., exposure bias or lack of long-range coherence), severely limiting their ability to model distributions properly. In this paper, we propose a unique method termed E-Forcing for training autoregressive generative models that takes advantage of a well-designed energy-based learning objective. By leveraging the extra degree of freedom of the softmax operation, we are allowed to make the autoregressive model itself an energy-based model for measuring the likelihood of input without introducing any extra parameters. Furthermore, we show that with the help of E-Forcing, we can alleviate the above flaws for autoregressive models. Extensive empirical results, covering numerous benchmarks demonstrate the effectiveness of the proposed approach.

1. INTRODUCTION

By factorizing the joint distribution into the product of a series of conditional distributions, autoregressive generative models (abbr. ARGMs) (Vaswani et al., 2017; Dai et al., 2019; van den Oord et al., 2016a; b; Salimans et al., 2017; Chen et al., 2018) simplify the difficult challenge of modeling high-dimensional joint distributions. They can be trained efficiently via maximum likelihood and generate samples of exceptional quality, making this technique popular for modeling distributions, especially for sequential data. Nonetheless, despite their potency and flexibility, and huge success, ARGMs still have inherent weaknesses due to the intrinsic characteristics of chain-style conditional modeling, especially when the training data is less diversefoot_0 . For example, ARGMs usually suffer from a discrepancy in distributions of input contexts between the training and inference stages, which causes a consequent performance drop, i.e., Exposure Bias (Ranzato et al., 2016; Bengio et al., 2015) . Besides, due to the nature of the greedy selection of beam search approximations, the decoded results from ARGMs may also lack long-range coherence (Deng et al., 2020) . Earlier work, both heuristic and theoretical, has been proposed to address these concerns. For instance, the exposure bias problem of ARGMs can be alleviated to some extent with scheduled sampling (Bengio et al., 2015; Mihaylova & Martins, 2019) , by mixing input contexts from both real data and autoregressive generation, during the training stage. However, this scheme introduces some new problems like the over-correcting (Zhang et al., 2019) issue. In addition, at the inference stage, sampling methods such as beam search is employed to generate diverse candidates with high likelihoods, improving the quality of generated sequences. Nevertheless, these approaches result in only marginal improvements in temporal coherence. In this paper, we propose an elegant solution, i.e., E-Forcing, for the above problems of ARGMs by leveraging a deep connection between ARGMs and Energy-based models (EBMs). EBMs are a popular class of generative models that have demonstrated their effectiveness in modeling high-dimensional distributions in a variety of machine learning applications, without requiring the transformation of the target distribution into a product of conditional distributions (Zhao et al., 2017; Arbel et al., 2021; Gao et al., 2021) . As a result, several studies (Deng et al., 2020; Bakhtin et al., 2021; Durkan & Nash, 2019) have made their attempts to benefit ARGMs from the advantages of EBMs. However, though some positive results were obtained, the existing works preferred a two-stage optimization, which first obtained a well-trained ARGM and then trained an additional EBM based on it. Such an optimization strategy not only introduced a heavy training process for EBM but also did not enable ARGMs themselves to benefit from the properties of EBM in modeling the joint distribution in a temporally more coherent way, and required more training parameters to estimate energy scores, burdening the intricacy of the learning task. Our method of combing ARGMs and EBMs takes a different approach, which seamlessly integrates energy-based models into autoregressive models by utilizing the extra degree of freedom within the final softmax layer of the model. We show that in this way the ARGM can be trained using an energy-based learning objective, which allows the ARGM to avoid those intrinsic concerns, such as exposure bias, with the help of energy-based models as former work did (Deng et al., 2020; Bakhtin et al., 2021) whilst being free of increasing the learning model's complexity. This property makes our E-Forcing rather easy to be applied in the training process of any ARGM for any specific task, as no structural changes are required. Besides, we follow the predominant approach for training explicit density generative models to minimize the KL divergence between the (empirical) data distribution and model distribution, which gives rise to the gradient-based contrastive divergence (CD) methods (Hinton, 2002; Kim & Bengio, 2016) for energy-based models. Typically, these methods require a Markov Chain Monte Carlo (MCMC) process to sample data from the EBM for the "negative phase" gradient estimation, which is extremely time-consuming and, meanwhile, inapplicable for discrete data, such as text. To solve this, we present a way to estimate those "negative phase" gradients through those samples generated with the network's autoregressive view instead of the EBM view, making the training feasible. Since our method combines the EBM and ARGM seamlessly as a whole, i.e., the ARGM is also an EBM itself, the exposure bias problem can be mitigated due to the fact that autoregressively sampled data is involved in the "negative phase" of CD methods. On top of it, unlike ARGMs, which factor the joint distribution into a product of conditional distributions, EBMs are able to model the joint distribution directly and score each input at the sequence level instead of at the token level, which makes them capable of modeling long-range coherence. In summary, the following contributions are made to this paper: i) We introduce a novel scheme by integrating the EBM view into autoregressive generative models seamlessly; ii) We proposed a novel method, named E-Forcing, for efficiently optimizing the energy-based autoregressive model via contrastive divergence based on importance sampling but not MCMC; iii) We successfully decrease the inherent flaws of autoregressive models -exposure bias and weak temporal coherence -by leveraging E-Forcing's two-phase optimization, which makes use of both real and generated data; iv) We demonstrate clear improvements of the proposed methods on various tasks such as language modeling, machine translation, and image generation.

2.1. ENERGY-BASED MODELS

Let p d denote the data distribution. Energy-based models (LeCun et al., 2006) are interested in learning an unnormalized energy function E θ (x) that defines the density(mass) function π θ (x) as π θ (x) = exp(-E θ (x)) Z θ , where E θ : X → R denotes an energy function which aims to map a data sample from data space X to an energy scalar, and Z(θ) = x exp(-E θ (x)) denotes the normalizing constant, also known as the partition function, which can be barely estimated. Any function can be used as an energy function to represent an EBM as long as it can generate a single scalar given some input x and the normalizing constant is finitefoot_1 . Contrastive divergence algorithms are commonly used to optimize EBMs via maximum log-likelihood (Hinton, 2002; Kim & Bengio, 2016; Grathwohl et al., 2020) .



When trained on massive datasets under which the underlying distribution is diverse enough, such as in large language models, this problem can be relieved because the training data covers a lot of corner cases making the model much harder to go off distribution. Without constraining the parametrization of E θ , this can be achieved by bounding the region of space in which x takes its allowed values.

