E-FORCING: IMPROVING AUTOREGRESSIVE MODELS BY TREATING IT AS AN ENERGY-BASED ONE Anonymous

Abstract

Autoregressive generative models are commonly used to solve tasks involving sequential data. They have, however, been plagued by a slew of inherent flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g., exposure bias or lack of long-range coherence), severely limiting their ability to model distributions properly. In this paper, we propose a unique method termed E-Forcing for training autoregressive generative models that takes advantage of a well-designed energy-based learning objective. By leveraging the extra degree of freedom of the softmax operation, we are allowed to make the autoregressive model itself an energy-based model for measuring the likelihood of input without introducing any extra parameters. Furthermore, we show that with the help of E-Forcing, we can alleviate the above flaws for autoregressive models. Extensive empirical results, covering numerous benchmarks demonstrate the effectiveness of the proposed approach.

1. INTRODUCTION

By factorizing the joint distribution into the product of a series of conditional distributions, autoregressive generative models (abbr. ARGMs) (Vaswani et al., 2017; Dai et al., 2019; van den Oord et al., 2016a; b; Salimans et al., 2017; Chen et al., 2018) simplify the difficult challenge of modeling high-dimensional joint distributions. They can be trained efficiently via maximum likelihood and generate samples of exceptional quality, making this technique popular for modeling distributions, especially for sequential data. Nonetheless, despite their potency and flexibility, and huge success, ARGMs still have inherent weaknesses due to the intrinsic characteristics of chain-style conditional modeling, especially when the training data is less diversefoot_0 . For example, ARGMs usually suffer from a discrepancy in distributions of input contexts between the training and inference stages, which causes a consequent performance drop, i.e., Exposure Bias (Ranzato et al., 2016; Bengio et al., 2015) . Besides, due to the nature of the greedy selection of beam search approximations, the decoded results from ARGMs may also lack long-range coherence (Deng et al., 2020) . Earlier work, both heuristic and theoretical, has been proposed to address these concerns. For instance, the exposure bias problem of ARGMs can be alleviated to some extent with scheduled sampling (Bengio et al., 2015; Mihaylova & Martins, 2019) , by mixing input contexts from both real data and autoregressive generation, during the training stage. However, this scheme introduces some new problems like the over-correcting (Zhang et al., 2019) issue. In addition, at the inference stage, sampling methods such as beam search is employed to generate diverse candidates with high likelihoods, improving the quality of generated sequences. Nevertheless, these approaches result in only marginal improvements in temporal coherence. In this paper, we propose an elegant solution, i.e., E-Forcing, for the above problems of ARGMs by leveraging a deep connection between ARGMs and Energy-based models (EBMs). EBMs are a popular class of generative models that have demonstrated their effectiveness in modeling high-dimensional distributions in a variety of machine learning applications, without requiring the transformation of the target distribution into a product of conditional distributions (Zhao et al., 2017; Arbel et al., 2021; Gao et al., 2021) . As a result, several studies (Deng et al., 2020; Bakhtin et al., 2021; Durkan & Nash, 2019) have made their attempts to benefit ARGMs from the advantages of EBMs. However, though some positive results were obtained, the existing works preferred a two-stage optimization, which first obtained a well-trained ARGM and then trained an additional EBM based on it. Such an optimization strategy not only introduced a heavy training process for EBM but also did not enable ARGMs themselves to benefit from the properties of EBM in modeling the joint distribution in a temporally more coherent way, and required more training parameters to estimate energy scores, burdening the intricacy of the learning task. Our method of combing ARGMs and EBMs takes a different approach, which seamlessly integrates energy-based models into autoregressive models by utilizing the extra degree of freedom within the final softmax layer of the model. We show that in this way the ARGM can be trained using an energy-based learning objective, which allows the ARGM to avoid those intrinsic concerns, such as exposure bias, with the help of energy-based models as former work did (Deng et al., 2020; Bakhtin et al., 2021) whilst being free of increasing the learning model's complexity. This property makes our E-Forcing rather easy to be applied in the training process of any ARGM for any specific task, as no structural changes are required. Besides, we follow the predominant approach for training explicit density generative models to minimize the KL divergence between the (empirical) data distribution and model distribution, which gives rise to the gradient-based contrastive divergence (CD) methods (Hinton, 2002; Kim & Bengio, 2016) for energy-based models. Typically, these methods require a Markov Chain Monte Carlo (MCMC) process to sample data from the EBM for the "negative phase" gradient estimation, which is extremely time-consuming and, meanwhile, inapplicable for discrete data, such as text. To solve this, we present a way to estimate those "negative phase" gradients through those samples generated with the network's autoregressive view instead of the EBM view, making the training feasible. Since our method combines the EBM and ARGM seamlessly as a whole, i.e., the ARGM is also an EBM itself, the exposure bias problem can be mitigated due to the fact that autoregressively sampled data is involved in the "negative phase" of CD methods. On top of it, unlike ARGMs, which factor the joint distribution into a product of conditional distributions, EBMs are able to model the joint distribution directly and score each input at the sequence level instead of at the token level, which makes them capable of modeling long-range coherence. In summary, the following contributions are made to this paper: i) We introduce a novel scheme by integrating the EBM view into autoregressive generative models seamlessly; ii) We proposed a novel method, named E-Forcing, for efficiently optimizing the energy-based autoregressive model via contrastive divergence based on importance sampling but not MCMC; iii) We successfully decrease the inherent flaws of autoregressive models -exposure bias and weak temporal coherence -by leveraging E-Forcing's two-phase optimization, which makes use of both real and generated data; iv) We demonstrate clear improvements of the proposed methods on various tasks such as language modeling, machine translation, and image generation.

2.1. ENERGY-BASED MODELS

Let p d denote the data distribution. Energy-based models (LeCun et al., 2006) are interested in learning an unnormalized energy function E θ (x) that defines the density(mass) function π θ (x) as π θ (x) = exp(-E θ (x)) Z θ , where E θ : X → R denotes an energy function which aims to map a data sample from data space X to an energy scalar, and Z(θ) = x exp(-E θ (x)) denotes the normalizing constant, also known as the partition function, which can be barely estimated. Any function can be used as an energy function to represent an EBM as long as it can generate a single scalar given some input x and the normalizing constant is finitefoot_1 . Contrastive divergence algorithms are commonly used to optimize EBMs via maximum log-likelihood (Hinton, 2002; Kim & Bengio, 2016; Grathwohl et al., 2020) . Correspondingly, the gradient of the log-likelihood, which needs to be maximized, with respect to θ can be expressed as ∇ θ E p d (x) log π θ (x) = E π θ (x) ∇ θ E θ (x) -E p d (x) ∇ θ E θ (x) . The first term on the right-hand side of Eq.2 is usually called the "negative phase" term while the second term is called the "positive phase" term. In general, due to the challenge of sampling from EBMs, training EBMs by contrastive divergence methods (Hinton, 2002; Kim & Bengio, 2016; Grathwohl et al., 2021) is difficult, especially on high-dimensional data. MCMC methods (Nijkamp et al., 2019; Du & Mordatch, 2019; Grathwohl et al., 2020) are usually adopted for data sampling. However, these methods require enormous extra computing overheads and are not applicable when the input is discrete such as text sequences (Deng et al., 2020) . As a result, a variety of recent works attempt to explore the strategy of training an EBM without MCMC. In particular, Bakhtin et al. (2021) ; Xu et al. (2021) ; Gao et al. (2020) optimize the EBMs by using noise contrastive estimation (NCE) (Gutmann & Hyvärinen, 2010; Ma & Collins, 2018) . Durkan & Nash (2019) estimate the intractable normalization component by utilizing ARGMs and importance sampling. Bengio et al.; Che et al. (2020) ; Wang et al. (2021) skirt the challenge of collecting data in the high-dimensional data space by performing sampling using a carefully crafted latent space, which improves sampling efficiency.

2.2. MODELING DISTRIBUTIONS AUTOREGRESSIVELY

Modeling high-dimensional data distributions directly is usually a rather challenging task due to "the curse of dimensionality" (Bellman, 1954) . One alternative method is to sequential the random variables and then factorize the joint probability distribution into the product of conditionals based on the sequence structure, which is the core idea of autoregressive generative models (ARGMs). ARGMs have been very successful, in particular for sequential data. For example, ARGMs have been widely used in language modeling (Vaswani et al., 2017; Dai et al., 2019; Radford et al., 2019) , audio synthesis (van den Oord et al., 2016a) , and even image generation (van den Oord et al., 2016c; b; Salimans et al., 2017) . However, the advantages of ARGMs are balanced to some extent by issues of (1) exposure bias (Ranzato et al., 2016; Bengio et al., 2015; Song et al., 2020) , due to the discrepancy in input context distributions between the training and inference stages, and (2) weak long-range coherence (Deng et al., 2020) , due to the inherent greedy selection of one token at a time without look-ahead.

2.3. THE MIXTURE OF EBMS AND GENERATIVE MODELS

The seminal idea of combing a generative model and an energy-based model has been explored by a plethora of great works (Pang et al., 2020; Durkan & Nash, 2019; Xie et al., 2019; 2020; Xiao et al., 2021; Bakhtin et al., 2021; Che et al., 2020; Arbel et al., 2021; Deng et al., 2020; Bakhtin et al., 2021; Durkan & Nash, 2019) . In particular, Pang et al. (2020) aimed to learn an energy-based model (EBM) in the latent space of a generator model, so that the EBM can act as a prior model on the generator model's top-down network. VAEBM, a symbiotic composition of a variational auto-encoder and an EBM, was proposed by (Xiao et al., 2021) . Arbel et al. (2021) proposed a novel training method for a GAN/EBM combined model by leveraging the Donsker-Varadham representation of KL-divergence. Among these works, Residual EBM (Deng et al., 2020; Bakhtin et al., 2021; Durkan & Nash, 2019) and EBR (Naskar et al., 2020 ) may be the most related works to our paper. Authors of these works have made their attempt to benefit ARGMs from the advantages of EBMs. However, different from our work, these works utilize a two-stage optimization scheme, which first obtained a well-trained generative model and then trained an additional EBM on top of it. Such an optimization strategy does not enable ARGMs themselves to benefit from the properties of EBM in modeling the joint distribution. Besides, in order to benefit from the EBM, complicated re-sampling or re-ranking schemes are needed during inference time. It also increases parameters since it uses independent networks to represent the ARGM and the EBM, burdening the intricacy of the learning task. In contrast, we introduce the EBM inside the ARGM, treating the ARGM directly as an EBM itself.

3. TREATING THE ARGM AS AN EBM

In this section, we present the overall framework of our E-Forcing method for training better autoregressive models. Let (x 1 , . . . , x K ) be a random sequence of length K drawn from the real data distribution p d , x k denote the random variable at time step k, and x <k represent the random subsequence before time step k, i.e. x <k = (x 1 , x 2 , . . . , x k-1 ). The general spirit of our design is to model the joint distribution p d (x k , x <k ) by integrating an EBM inside the autoregressive model q θ . Formally, given an autoregressive model q θ (x 1 , . . . , x K ) = K k=1 q θ (x k |x <k ) parameterized by θ, we introduce K independent energy-based models p θ (x k , x <k ) for each time step k ≤ K, with the formulation following p θ (x k , x <k ) = q θ (x <k ) • e -ϕ θ (x k ,x <k ) Z θ , where Z θ is equal to E q θ [ x k e -ϕ θ (x k ,x <k ) ], indicating the normalization constant, ϕ θ (•) represents the energy function. Essentially, p θ (x k , x <k ) is a product EBM, defined as the product of q θ and another EBM ϕ θ .

3.1. DEFINITION OF THE ENERGY FUNCTION

We define the energy function ϕ θ (x k , x <k ) using x k 's corresponding component of network's output logits given the input context x <k (e.g., given a sequence "This is Friday." and assuming the corresponding index of the token "Friday" in the vocabulary is i, then the value of -ϕ θ ("Friday", "This is") is the i-th component of the output logit, namely, the input tensor of the final softmax layer). The rationale behind such a design of energy function is out of the extra degree of freedom concealed inside the softmax transformation S : R M → (0, 1) M , which can convert an unnormalized vector with size M into a probability distribution consisting of M probabilities S([z 1 , . . . , z M ]) = [ e z1 M i=1 e zi , . . . , e z M M i=1 e zi ]. It's easy to observe that the softmax operation is unaffected by the input vector's overall magnitude, that is, S([z 1 , . . . , z M ]) = S([z 1 , . . . , z M ] + C), ∀C ∈ R. Such a property allows us to model the energy function by using the ARGM itself instead of introducing a new network.

3.2. ENERGY-BASED LEARNING OBJECTIVE

Other than making the q θ to match p d , E-forcing has an additional training objective to make the K parametric distributions p θ (x k , x <k ) to match the real data distribution p d (x k , x <k ) at any time step k ≤ K. This can be achieved by minimizing the Kullback-Leibler (KL) divergence between the distributions for each time step of a sequence, D KL p d (x k , x <k )||p θ (x k , x <k ) , ∀k ∈ [1, K], We attempt to use contrastive divergence methods (Hinton et al., 1995; Kim & Bengio, 2016) to minimize the objective 5 by descending the gradient w.r.t. θ according to Eq. 2 for each time step. Specifically, given an arbitrary time step k, we have the corresponding gradient of objective 5 with respect to θ ∇ θ L EBM -CD = E p d ∇ θ E θ (x k , x <k ) Positive Phase Gradient -E p θ ∇ θ E θ (x k , x <k ) Negative Phase Gradient . ( ) where E θ (x k , x <k ) = ϕ θ (x k , x <k ) -log q θ (x <k ). Optimization via Eq. 6 involves sampling data from the model distribution p θ and can thus lead to the discovery of non-data-like samples, whose likelihood is then explicitly reduced as the corresponding energy increases during the training. E-Forcing is therefore not plagued by the exposure bias problem naturally. Besides, because we model the joint distribution at each time step, E-Forcing can assess the sequence up to the current time step as a whole and generate more coherent data using energy sampling (Deng et al., 2020) . However, the negative phase gradient is frustrating to compute, especially for discrete data (e.g. text) where common MCMC methods (Welling & Teh, 2011) can not even be applied. Therefore, we propose a novel variant of contrastive divergence methods for E-Forcing's optimization in Section 4.

4. OPTIMIZATION

The key obstacle of optimizing the objective 5 via contrastive divergence methods (Hinton, 2002) (i.e. descends the gradient of Eq. 6) is sampling data from the model distribution p θ for estimating the negative phase gradient. The common MCMC algorithms are not desirable for generating "negative" samples because they are rather time-consuming, and not applicable to discrete data. In order to make the optimization process both efficient and feasible, we modified the original CD methods by means of the importance sampling technique (Horvitz & Thompson, 1952) , which holds two parts of gradient estimation.

4.1. POSITIVE PHASE GRADIENTS

Since the training set consists of i.i.d. samples sampled from the real distribution p d , the computing of positive phase gradients is not difficult. To be specific, by replacing E θ (x k , x <k ) with the form ϕ θ (x k , x <k ) -log q θ (x <k ) in Eq.6, the positive phase gradient G + (θ) with respect to parameter θ can be written into G (k) + (θ) = E p d ∇ θ ϕ θ (x k , x <k ) -∇ θ log q θ (x <k ) . Since carrying out sample estimation of the expectation over the data distribution p d is viable, and the score ϕ θ (x k , x <k ) can be acquired by simply accessing the output logit of ARGM (according to the definition of ϕ θ in Sec. 3), the first term of the positive phase gradient G + can likewise be readily computed. Besides, we can observe that the second term E p d [-∇ θ log q θ (x <k )] of G (k) + (θ) is the negative gradient of likelihood q θ (x <k )'s logarithm, which is exactly the objective of maximizing the autoregressive generative model q θ 's log-likelihood.

4.2. NEGATIVE PHASE GRADIENTS

The estimation of negative phase gradients G (k) -(θ) = E p θ [∇ θ ϕ θ (x k , x <k )-∇ θ log q θ (x <k )], on the other hand, is more involved. Sampling data from p θ is required for estimating the expectation E p θ , whereas p θ is the introduced energy-based autoregressive model, which is an explicit autoregressive generative model and we can only access its modeled density(mass) function p θ . Inspired by the idea of importance sampling, we substitute the troublesome estimation of the expectation over distribution p θ with the expectation over distribution q θ , which is the underlying autoregressive model that can generate samples considerably easier. Accordingly, the negative phase gradient E x k ,x <k ∼p θ [∇ θ E θ (x k , x <k )] has the following form (See the detailed derivation in Appendix B), G (k) -(θ) = E q θ w(x <k ) ∇ θ ϕ θ (x k , x <k ) -∇ θ log q θ (x <k ) , where w(x <k ) = x k e -ϕ θ (x k ,x <k ) E q θ (x ′ <k ) [ x k e -ϕ θ (x k ,x ′ <k ) ] . According to Eq.8, all the estimated expectations only need sampling from the autoregressive model q θ rather than the distribution p θ , and the reweighing weight w in Eq. 9 does not involve expectation computation over distribution p θ either. Generally speaking, producing data from an autoregressive model is a simple ancestral sampling process and naturally suitable for discrete data, as compared with sampling straight from an explicit generative density estimator, which needs MCMC approaches (Durkan & Nash, 2019) . Besides, the term E x <k ∼q θ (x <k ) [w(x <k )∇ θ log q θ (x <k )] in Eq. 8 can be regarded as a re-weighted gradient of q θ 's information entropy with respect to θ. This term can be optimized similarly to the teacher-forcing training of the autoregressive model with the "teacher" sequence generated autoregressively by the model itself. The scheduled sampling methods (Bengio et al., 2015; Ranzato et al., 2016; Mihaylova & Martins, 2019) are similar to this term but without the re-weighting factor. Moreover, the reweighing weight w of Eq. 9 can be further refined (see the derivation in Appendix B.3) and we can observe that w(x <k ) = µ(x <k )/E x ′ <k µ(x <k ), where µ(x <k ) = p θ (x <k )/q θ (x <k ), indicating the possibility of which distribution (p θ or q θ ) the input context x <k is most likely to come from. Correspondingly, w(x <k ) reflects the context x <k 's relative magnitude of µ(x <k ) compared with the average among all potential contexts-the larger the value of w(x <k ), the more likely the context x <k in the data space coming from p θ , which is modeled by the product of autoregressive models and EBMs. During training, those input sequences with contexts more likely from p θ than q θ will be assigned larger weights w while others will be assigned smaller weights w.

4.3. FINAL OPTIMIZATION OF E-FORCING

Algorithm 1 Optimizing ARGMs with E-Forcing Given: a training dataset E ∼ p d , random-initialized autoregressive model q θ , K ∈ N is the generation length for iteration i = 1;i ≤max iterations;i + 1 do Sample minibatch B = {(c i , s i )} n i=1 ∼ E ▷ s i is of length K, c i is the context of s i if i≤N then ▷ After N iterations, we start applying E-Forcing ∇ θ L total ← K k=1 ∇ θ L (k) AR (B) else Autoregressively generate |B| samples from q θ conditioned on c i , denoted as B ∇ θ L total ← K k=1 ∇ θ L (k) AR (B) + λ k ∇ θ L (k) EBM -CD (B, B) end if Update θ ← θ -η∇ θ L total ▷ η denotes learning rate end for Finally, with the help of the above estimation of gradients regarding two phases of Eq. 6, we are able to optimize the product EBM p θ via descending the estimated gradient of contrastive divergence loss ∇ θ L EBM -CD for any time step k ∇ θ L (k) EBM -CD (θ) = G (k) + (θ) -G (k) -(θ). Eq. 10 can be easily estimated by using "positive" samples from the given training dataset and autoregressively generated "negative" samples from q θ . Nevertheless, training the model from scratch with the energy-based learning objective alone can not work well in practice. The reason is simple: at the initial stage of the training process, what we have is just a randomly initialized network that can barely generate anything meaningful. This fact indicates disjoint supports between the real distribution p d and modeled distribution p θ . Importance sampling fails in this case. Hence, to make the optimization more feasible, we pre-train the entire model with an autoregressive loss L AR by teacher-forcing for a few epochs before introducing the energy-based learning objective. In sum, the final gradient concerning parameter θ at each update iteration is ∇ θ L total (θ) = K k=1 ∇ θ L (k) AR (θ) + λ k ∇ θ L (k) EBM -CD (θ), where λ k adjusts the ratio between the two objectives for the time step k. The intact optimization procedure is shown in Algorithm 1foot_2 . We found that an exponentially descending configuration of coefficients λ k according to the order of time steps works well. One possible reason is that such a set of coefficients can remedy the imbalanced training signal by negative phase gradients in Eq. 8 among time steps.

5. EXPERIMENTS

To empirically corroborate the effectiveness of E-Forcing and show its broad applicability, we have conducted extensive experiments on applications, such as language modeling and machine translation. In this section, we will first introduce these experimental setups and analyze the obtained results. Besides, we also carried out a series of experiments to further show our E-Forcing method's ability to resolve ARGMs' inherent flaws(i.e. exposure bias and incoherence generation). More experimental settings as well as analytical experiments are shown in Appendix A and C.

5.1. APPLICATION TO LANGUAGE MODELING

For the language modeling task, three different datasets, WikiText-103 (Merity et al., 2017) , Toronto Book Corpus (Zhu et al., 2015; Kiros et al., 2015) , and CC-news (Mackenzie et al., 2020) , are chosen as testbeds; two autoregressive network structures are used to evaluate the effectiveness: vanilla Transformer (Vaswani et al., 2017) ("Tr-Base" for short) and Transformer-XL (Dai et al., 2019) ("Tr-XL" for short). We regard the vanilla training with the teacher forcing technique as the baseline method. Besides, we also compared our E-Forcing with the residual EBM Deng et al. ( 2020) method, which is a typical method to improve the performance of language models by utilizing EBMs. In order to benefit from the EBM, the residual EBM method requires a new network to estimate the energy scores and imposes a Top-K energy resampling scheme during inferencefoot_3 . The final results are reported in 

5.2. APPLICATION TO NEURAL MACHINE TRANSLATION

We further evaluate E-Forcing's effectiveness on neural machine translation (NMT), which can be regarded as a conditional generation task. We mainly analyze E-Forcing on the IWSLT14 dataset, which includes six different language pairs ({German, Spanish, Italian} → English and English → {German, Spanish, Italian}) (Hereafter we abbreviate English, German, Spanish, Italian as "EN", Table 2 : Comparison of BLEU scores between our approach E-Forcing and the base ARGM trained just with cross-entropy loss on six translation pairs of IWSLT14 datasets. We use "-" to denote that the training trick is not used while "✔" indicates we use it. "5 B" represents we use beam searching with 5 beams. "DE", "ES", "IT"). In addition, we also reported the result of E-Forcing over the WMT16 (English → German) benchmark, which is a relatively larger dataset, in Table 3 . 

5.3. EFFECT ON THE EXPOSURE BIAS

We follow the analytic experiments in the work (Zhang et al., 2019) to show that our E-Forcing is capable of alleviating the exposure bias problem. To be concrete, we randomly select 1K pairs from the training data for each translation pair and use the trained autoregressive model which applied E-Forcing to decode the source sentences, and then count the ground truth words whose probabilities in the predicted distributions produced by our E-Forcing are greater than those produced by the baseline and denote the number as N . The ratio of N to the total number of words tested is calculated. The detailed results are shown in Table 4 . We find that the results on all different tasks are greater than 50%, which demonstrates the ability of our E-Forcing in alleviating the exposure bias problem to some extent. We also attempted to quantitatively validate that our E-Forcing can benefit ARGMs by improving the coherence of generation. Table 5 shows the BLEU scores of generated translations on the IWSLT14 test set with respect to different lengths of the source sentences. Intuitively, due to the cumulative effect of greedy selection at each time step, the collection of samples with longer sentences ought to be more plagued by the incoherence of generations problem. Our approach can outperform the vanilla training in all three length intervals, especially in the lengthy sentence interval [50, ∞], indicating that our E-Forcing can resolve the incoherence problem to a degree.

6. CONCLUSIONS AND FUTURE WORK

In this paper, we propose a novel training method dubbed E-Forcing for ARGMs by treating them as EBMs. This is achieved by defining the energy function using the softmax operation's extra degree of freedom within an autoregressive network. We further design a unique way to improve the training of E-Forcing using importance sampling. Experimental results demonstrate the effectiveness of E-Forcing to alleviate exposure bias and incoherence problems of ARGMs. In the future, we expect to extend E-Forcing on other sequential generation tasks (e.g. text summarization, audio generation) and incorporate the proposed methodology into other advanced autoregressive architectures.

A EXPERIMENTAL SETTINGS

In this section, we introduce detailed setups of different benchmarks as well as the information of corresponding datasets. A We uniformly use the Adam optimizer. The training will be stopped once the model has not obtained better performance for 20 epochs on the validation set. For translation tasks, the length of generated fake sentences, which is used for the computing of the negative phase in Eq. 10, is dependent on the source sequence whilst for language modeling tasks, we fix the length of generated fake sentences as 50 during training. The model structures for language modeling and machine translation tasks are shown in Table 6 . As for the model structures of the image generation task, we use the official structure reported by PixelCNN (van den Oord et al., 2016c) and Gated PixelCNN (van den Oord et al., 2016b) without modification. The source code will be released once upon acceptance. We use the same batch of samples generated autoregressively to approximate both the expectations in Eq.10 and weight w (i.e., shared), which does not need to sample twice. The number of samples in a batch is dynamic while the maximum number of the total tokens in a batch is fixed (4096 in our experiments). If the length of sequences in a batch is 32, then it includes 4096 / 32 = 128 samples in total. It is a common strategy in language generation tasks and has been used in many frameworks(e.g. Fairseq (Ott et al., 2019) ). We generate samples autoregressively as many as the number of sequences in the current batch at each update iteration.

B DERIVATION OF THE NEGATIVE PHASE GRADIENT

In this section, we show the detailed derivation of Eq. 8. Formally, as shown in Sec. 3, given an autoregressive model q θ (x <k ) = k-1 l=1 q θ (x l |x <l ) (k denotes the time step) with parameters θ, we define a product of the autoregressive model and an EBM as follows p θ (x k , x <k ) = q θ (x <k ) • e -ϕ θ (x k ,x <k ) Z θ , where q θ (x <k ) = k-1 l=1 q θ (x l |x <l ), Z θ is the normalization term and equal to E x ′ <k ∼q θ [ x k e -ϕ θ (x k ,x ′ <k ) ]. The optimization of p θ (x k , x <k ) includes two phases, and the gradient w.r.t θ of "negative phase" is -E x <k ∼p θ [∇ θ log q θ (x <k )] + E x k ,x <k ∼p θ [∇ θ ϕ θ (x k , x <k )]. Next, we will show the specific derivation about how to transform Eq. 13 into Eq. 8.

B.1 DERIVATION OF THE FIRST TERM

The first term E x <k ∼p θ [∇ θ log q θ (x <k )] can be processed as follows E x <k ∼p θ [∇ θ log q θ (x <k )] = x <k p θ (x <k )∇ θ log q θ (x <k ) = x <k x k p θ (x k , x <k )∇ θ log q θ (x <k ) = x <k q θ (x <k ) x k e -ϕ θ (x k ,x <k ) Z θ ∇ θ log q θ (x <k ) =E x <k ∼q θ (x <k ) [w(x <k )∇ θ log q θ (x <k )], where we have w(x <k ) = x k e -ϕ(x k ,x <k ) E x ′ <k ∼q θ (x <k ) [ x k e -ϕ θ (x k ,x ′ <k ) ] because w(x <k ) = x k e -ϕ(x k ,x <k ) Z θ = x k e -ϕ(x k ,x <k ) x <k x k q θ (x <k )e -ϕ θ (x k ,x <k )

=

x k e -ϕ(x k ,x <k ) x <k q θ (x <k ) x k e -ϕ θ (x k ,x <k ) = x k e -ϕ(x k ,x <k ) E x <k ∼q θ (x <k ) [ x k e -ϕ θ (x k ,x <k ) ] . (15) 1 Z θ ∇ θ ϕ θ (x k , x <k ) = x <k q θ (x <k ) x k e -ϕ θ (x k ,x <k ) • 1 Z θ ∇ θ ϕ θ (x k , x <k ) = E q θ (x <k ) [ x k e -ϕ θ (x k ,x <k ) Z θ ∇ θ ϕ θ (x k , x <k )] = E q θ (x <k ) [ x k e -ϕ θ (x k ,x <k ) x k e -ϕ θ (x k ,x <k ) • x k e -ϕ θ (x k ,x <k ) Z θ ∇ θ ϕ θ (x k , x <k )] = E q θ (x <k ) [ x k q θ (x k |x <k )w(x <k )∇ θ ϕ θ (x k , x <k )] = E q θ (x <k ) [E a∼q θ (x k |x <k ) [w(x <k )∇ θ ϕ θ (x k , x <k )]] = E x k ,x <k ∼q θ (x k ,x <k ) [w(x <k )∇ θ ϕ θ (x k , x <k )] where w(x <k ) is also equal to x k e -ϕ(x k ,x <k ) Z θ . Combining Eq. 14 and Eq. 16, we can obtain an equivalent form of the gradient of the negative phase without any expectation over p θ as -E x <k ∼q θ (x <k ) [w(x <k )∇ θ log q θ (x <k )] + E x k ,x <k ∼q θ (x k ,x <k ) [w(x <k )∇ θ ϕ θ (x k , x <k )], where w(x <k ) = x k e -ϕ(x k ,x <k ) E x ′ <k ∼q θ (x <k ) [ x k e -ϕ θ (x k ,x ′ <k ) ] .

B.3 FURTHER REFINEMENT OF w

The reweighing weight w can be further deduced as w(x <k ) = x k e -ϕ(x k ,x <k ) E x ′ <k ∼q θ (x <k ) [ x k e -ϕ θ (x k ,x ′ <k ) ] = x k p θ (x k ,x <k ) q θ (x <k ) E x ′ <k ∼q θ (x <k ) [ x k p θ (x k ,x <k ) q θ (x <k ) ] = p θ (x <k ) q θ (x <k ) E x ′ <k ∼q θ (x <k ) [ p θ (x <k ) q θ (x <k ) ] = µ(x <k ) E x ′ <k µ(x <k ) , where µ(x <k ) is defined as p θ (x <k ) qθ (x <k ) . Table 7 : The effect of Top-K correction in the inference stage. We tested BLEU scores of using different k on different translation pairs of IWSLT14 dataset.

C MORE EXPERIMENTAL ANALYSIS

from them depending on their energy scores estimated by the network. To measure the effectiveness of the Top-K energy re-sampling towards our method, we conduct an ablation study on neural machine translation task by selecting different K = {0, 5, 10}. The results are shown in Table 7 and performances are evaluated by using the BLEU score. From Table 7 , we observe that the benefits brought by Top-K sampling is minor (K={5, 10}), when compared with the model without Top-K sampling (K=0). This together with the results shown in Table 1 shows that our E-Forcing can considerably benefit the base autoregressive model even without the energy resampling technique.

C.2 APPLICATION TO IMAGE GENERATION

In order to illustrate the effectiveness and generality of our method in processing different modality tasks, we further show the results of applying E-Forcing to image generation in this section. We apply E-Forcing to Pixel-CNN (Van Oord et al., 2016) and its variant Gated Pixel-CNN (Oord et al., 2016) . Experiments are carried out on the MNIST and CIFAR-10 datasets.

Model

Test (Train) NLL ↓ MNIST CIFAR-10 Pixel-CNN 0.17 (0.13) 3.14 (3.08) Pixel-CNN (w/E-Forcing) 0.15 (0.12) 3.07 (2.98) Gated Pixel-CNN 0.14 (0.11) 3.03 (2.90) Gated Pixel-CNN (w/E-Forcing) 0.12 (0.10) 2.97 (2.87) Table 8 : Performance of E-Forcing with different base networks on MNIST and CIFAR-10 in bits/dim (lower is better), training performance in brackets. Table 8 summarizes the quantitative results measured by per-pixel negative log-likelihood (NLL). We can see that with the help of our E-Forcing, both the Pixel-CNN and the Gated Pixel-CNN can obtain improvements in all datasets (0.17 → 0.15 and 3.14 → 3.07 for Pixel-CNN on MNIST and CIFAR10 respectively and 0.14 → 0.12 and 3.03 → 2.97 for Gated Pixel-CNN on MNIST and CIFAR10 respectively). This is further evidence in favor of the energy-based learning objective for improving autoregressive models. In addition, we have studied the effect of different start epochs of E-Forcing on the performance of language modeling, which can be seen in Table 9 . From this, we may deduce that starting E-Forcing training at the 15th and 10th epoch yields the best results for Transformer-Base and Transformer-XL respectively, whereas starting earlier or later yields a performance decline. It is reasonable because, if E-Forcing was introduced too early, the autoregressive model may not have been optimized well at that moment. As a result, the quality of generation for the "negative phase" would be terrible, making energy-based training unstable. On the other hand, the underlying autoregressive model can be modified only marginally if E-Forcing was introduced when the ARGM training is virtually complete. 

C.4 ANALYSIS TO MODEL'S CONVERGENCE

In this section, we will investigate the convergence of our E-Forcing. To begin, we first train a base Transformer model ("Tr-Base" architecture shown in Table 6 ) on the IWSLT14 Spanish to English training set for baseline and E-Forcing method respectively, and then record the training loss and test loss (in cross-entropy) at the end of each epoch. The loss curves are plotted in Figure 1 . From Figure 1 , we can see that (1) at the start of the training, our E-Forcing converges slightly faster than the baseline. (2) As the training process progresses, the cross entropy of the baseline on the training set will gradually decrease, at a faster rate than E-Forcing. On the other hand, the test loss curve of the baseline will fall initially and then slowly rise after 50 epochs while E-Forcing always remains stable convergence. This phenomenon also shows that our E-Forcing method can effectively prevent over-fitting so that obtaining better generalization. In this section, we conducted an ablation study to investigate our E-Forcing model's generalization ability over different sequential models. We tested over 6 different sequential models, which are GRU (Chung et al., 2014) , LSTM (Hochreiter & Schmidhuber, 1997) , ENAS (Pham et al., 2018) , TrelisNet(TNET for short) (Bai et al., 2019b) , DEQ (Bai et al., 2019a) and Transformer-XL (Dai et al., 2019) on Penn Treebank (Marcus et al., 1993) dataset, which is a relatively small dataset and widely used in machine learning for NLP (Natural Language Processing) research. In general, we can observe that our E-Forcing can achieve improvement over all base AR models applied, which indicates it is a universally applicable training method for AR models.

C.6 CASES STUDIES

To better understand the advantages of our method in correcting error tokens, we also prepare some translation cases in IWSLT14 German → English, as shown in Table 11 . Source Sentence(German) Predicted Target Sentence(English) wenn ich ihnen 600 zeitschriften zeige und sie in 10 kategorien aufteile oder ich ihnen 400 zeitschriften zeige, und diese in 20 kategorien aufteile, dann glauben sie, dass ich ihnen mehr auswahl und eine bessere auswahlerfahrung gegeben habe, als ich ihnen die 400 gegeben hätte gegenüber dem, wenn ich ihnen die 600 gegeben hätte. GroundTruth GroundTruth: and i definitely know that, in my case -in my situation -it would be very dangerous for me to start sort of leaking down that dark path of assumption, particularly given the circumstance that i'm in right now in my career. Baseline: and i know definitely, for me, it would be very dangerous to begin to do this dark path of suspect -especially in the circumstance that i'm in my career right now. Baseline + S.S.: and i know definitely it would be -in my situation -very dangerous to start, to kind of settle down this dark path of presumptionespecially in the circumstance in which i'm in my career right now. Ours: and i definitely know that it's for me -in my situation -very dangerous to start to sickle down this dark path of suspection, in particular, in the circumstance of where i'm in my career right now. wir haben das licht ausgeschaltet, legten es in ein vakuum und saugten die ganze luft aus und kühlten es bis fast zum jetzt, ganz alleine im aufzug, war das stück metall frei, sich zu verhalten wie immer es wollte. GroundTruth: we turned off the lights, and then we put it in a vacuum and sucked out all the air, and then we cooled it down now, all alone in the elevator, the little chunk of metal is free to act however it wanted. Baseline: we turned the light off, put it in a vacuum and sucked it out all the air and cooled it up until almost now, all the way alone, the piece of metal was open to behave as it was. Baseline + S.S.: we turned the lights off, we put it into a vacuum, and we sucked all the air, and we cooled it all the way up to now, all over the place, the piece of metal was free to behave whatever it wanted. Ours: we turned off the lights, we put it into a vacuum and we sucked all the air out, and we cooled it up until almost now, all alone in the elevator, the piece of metal was free to behave whatever it wanted. und im grunde können sie das betrachten, wissen sie, als eine tyrannei des erinnernden selbst, und sie können sich das erinnernde selbst denken als eins, das sozusagen das erlebende selbst schleppt durch erfahrungen, die das erlebende selbst nicht braucht. GroundTruth: and basically you can look at this, you know, as a tyranny of the remembering self, and you can think of the remembering self sort of dragging the experiencing self through experiences that the experiencing self doesn't need. Baseline: and basically, you can think of this, you know, as a tyranny of self, and you can think of the memorable self as one that kind of weaves the living self through experiences that don't need the life itself. Baseline + S.S.: and basically, you can look at this, you know, as a tyrannei of memorial self, and you can think of the memorial self as one that kind of sucks the living self through experiences that don't need the living self. Ours: and basically, you can look at that, you know, as a tyranny of the remembering self, and you can think of the memory itself as one, which is sort of dragging the living self through experiences that the living self doesn't need. wir sind an der schwelle zu erstaunlichen, erstaunlichen ereignissen auf vielen gebieten. und doch denke ich wirklich, dass wir hunderte, 300 jahre vor die aufklärung zurück gehen müssten, um eine zeit zu finden, in der wir fortschritt bekämpft haben, in der wir über diese dinge heftiger getritten haben, an mehr fronten als jetzt. GroundTruth: we're on the verge of amazing, amazing events in many fields, and yet i actually think we'd have to go back hundreds, 300 years, before the enlightenment, to find a time when we battled progress, when we fought about these things more vigorously, on more fronts, than we do now. Baseline: we are at the threshold of amazing, amazing events in many areas, and yet i really think that we have to go back hundreds and 300 years before the enlightenment to find a time when we have fought progress in which we have driven more of these things than now. Baseline + S.S.: we're at the threshold of amazing, amazing events in many areas. and yet, i really think that we have to go back hundreds and hundreds of years before the enlightenment to find a time when we have struggled with progress in which we have driven on these things more powerful, more fronts than now. Ours: we're at the threshold to amazing, amazing events in many areas, and yet i really think that we have to go back hundreds and 300 years before the enlightenment to find a time when we fought progress, where we've been fighting about these things to more fronts than we have now. approach E-Forcing and the base ARGM trained just with cross-entropy loss on three translation pairs of IWSLT14 datasets. The value is expressed in percentage. We use "Tr-Base" as the network architecture. To further evaluate the effectiveness of our proposed E-Forcing, we also evaluate our method by using other metrics, such as ROUGE Lin (2004) and METEOR Banerjee & Lavie (2005) for neural machine translation. The results are shown in Table 12 . In Table 12 , the improvements of E-Forcing in different metrics is consistent with the conclusion of Table 2 , which further prove the effectiveness of our E-Forcing method.



When trained on massive datasets under which the underlying distribution is diverse enough, such as in large language models, this problem can be relieved because the training data covers a lot of corner cases making the model much harder to go off distribution. Without constraining the parametrization of E θ , this can be achieved by bounding the region of space in which x takes its allowed values. We take K to be the length of a segment of the transformer. It is worth noting that Top-K energy resampling can not get the PPL directly.Bakhtin et al. (2021) provides a way to approximate PPL, which leads to an estimated interval of PPL.



Figure 1: (a) Cross entropy loss curves on IWSLT14 Spanish to English translation task on training set. The blue and orange colors represent base model and E-Forcing respectively; (b) Cross entropy loss curves on IWSLT14 Spanish → English translation task on test set.

We can see from the results that E-Forcing outperforms two pure autoregressive models with clear margins over all three benchmarks. Specifically, on the Wikitext-103 benchmark, our E-Forcing improves the performance of the Transformer-Base model and Transformer-XL model by 0.62 PPL points (from 30.56 to 29.94) and 0.30 PPL points (from 24.20 to 23.90) respectively; on CC-news and Toronto Book Corpus benchmarks, our method obtains 0.51 ppl and 0.47 ppl performance gain respectively and gets further improvement once energy resampling technique was applied. Besides, though residual EBM's learning parameters are twice as ours and their method is unable to directly benefit autoregressive models without Top-K energy resampling, our E-Forcing achieves comparable results to them, even slightly better on Toronto Book Corpus and Wikitext-103 benchmarks.

Language modeling performance of different models on three benchmarks. Evaluation is conducted using perplexity (PPL). E.R. is the abbreviation of Energy Resampling technique(Bakhtin et al., 2021), which serves as a necessary module of Residual EBM.

±0.08 28.03 ±0.04 29.13 ±0.02 31.84 ±0.11 40.32 ±0.03 36.96 ±0.07 33.38 5 B 34.93 ±0.05 28.91 ±0.12 30.04 ±0.11 32.56 ±0.04 41.01 ±0.06 37.73 ±0.12 34.20 ±0.09 28.38 ±0.12 29.56 ±0.10 32.11 ±0.03 40.93 ±0.03 37.56 ±0.07 33.85 5 B 35.36 ±0.05 29.11 ±0.04 30.25 ±0.09 32.82 ±0.11 41.58 ±0.07 38.19 ±0.03 34.55



Translation performance of proposed E-

The effect of E-Forcing on the exposure bias problem. Each test set of translation tasks contains 1K sentences selected randomly. N denotes the ground truth words whose probabilities in the predicted distributions produced by E-Forcing are greater than those produced by the baseline. ±0.09 40.00 ±0.04 37.38 ±0.06 40.91 ±0.06 ✔ ✔ 43.84 ±0.10 40.35 ±0.05 38.07 ±0.04 41.58 ±0.07

Performance comparison on the IWSLT14 test set for the different lengths of sentences on three translation tasks (German to English, Italian to English, and Spanish to English). Performance is evaluated by the BLEU score.

CC-news is a de-duplicated subset of the English portion of the CommonCrawl news dataset, which totals around 16 Billion words. 4. IWSLT14 contains about 170k training sentence pairs, 7k valid pairs, and 7k test pairs. It has six different domains of language, and each two of them can consist of a translation pair. 5. WMT16 contains 103M training tokens from 28K articles, with an average length of 3.6K tokens per article, which allows testing the ability of long-term dependency modeling. 6. MNIST is a large collection of handwritten digits. It has a training set of 60,000 examples and a test set of 10,000 examples. 7. CIFAR-10 is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes. Hyperparameters of different model structures and datasets. "Tr-Base", "Tr-Large", and "Tr-XL" indicate Transformer-Base, Transformer-Large, and Transformer-XL respectively

Trans. Pairs DE→ EN EN→ DE EN→ IT IT→ EN ES→ EN EN→ ES

The ablation study of E-forcing over different choices of the architecture of AR models with the comparison of vanilla teacher-forcing training. We tested PPL scores using different AR models on the Penn Treebank dataset

: if i show you 600 magazines and i divide them up into 10 categories, versus i show you 400 magazines and divide them up into 20 categories, you believe that i have given you more choice and a better choosing experience if i gave you the 400 than if i gave you the 600.Baseline: if i show you 600 magazines and i split them in 10 categories, or i'm showing them 400 magazines, and i'm going to split them up into 20 categories, you think i've given them more choices and better choice than i would have given them the 400 over the time that i gave them the 600. Baseline + S.S.: if i show you 600 magazines and i give you 400 magazines in 10 categories, and i give you 400 magazines, and i can split them up in 20 categories, then you think i've given you more choice and a better selection than i would have given you the 400 of which if i gave you the 600. Ours: if i show you 600 magazines and i divide them into 10 categories, or i show you 400 magazines, and i divide them into 20 categories, you think i've given you more choices and better selection experience than i gave you the 400 of whom if i gave you the 600. und ich weiß definitiv, dass es für mich -in meiner situation -sehr gefährlich wäre, anzufangen, diesen dunklen pfad der vermutung sozusagen herunterzusickern -besonders in dem umstand, in dem ich mich in meiner karriere gerade befinde.

Translation cases on IWSLT14 De→En test set, generated by the baseline method, baseline with scheduled sampling and our E-Forcing. The italic font means the mismatch translation C.7 EVALUATION WITH OTHER METRICS

Comparison of ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BLEU scores between our

