NO MCMC FOR ME: AMORTIZED SAMPLING FOR FAST AND STABLE TRAINING OF ENERGY-BASED MODELS

Abstract

Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy regularization methods with a fast variational approximation. We demonstrate the effectiveness of our approach by using it to train tractable likelihood models. Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semi-supervised classification on tabular data from a variety of continuous domains.

1. INTRODUCTION

Energy-Based Models (EBMs) have recently regained popularity within machine learning, partly inspired by the impressive results of Du & Mordatch (2019) and Song & Ermon (2020) on largescale image generation. Beyond image generation, EBMs have also been successfully applied to a wide variety of applications including: out-of-distribution detection (Grathwohl et al., 2019; Du & Mordatch, 2019; Song & Ou, 2018) , adversarial robustness (Grathwohl et al., 2019; Hill et al., 2020; Du & Mordatch, 2019) , reliable classification (Grathwohl et al., 2019; Liu & Abbeel, 2020) and semi-supervised learning (Song & Ou, 2018; Zhao et al.) . Strikingly, these EBM approaches outperform alternative classes of generative models and rival hand-tailored solutions on each task. Despite progress, training EBMs is still a challenging task. As shown in Table 1 , existing training methods are all deficient in at least one important practical aspect. Markov chain Monte Carlo (MCMC) methods are slow and unstable during training (Nijkamp et al., 2019a; Grathwohl et al., 2020) . Score matching mechanisms, which minimize alternative divergences are also unstable and most methods cannot work with discontinuous nonlinearities (such as ReLU) (Song & Ermon, 2019b; Hyvärinen, 2005; Song et al., 2020; Pang et al., 2020b; Grathwohl et al., 2020; Vincent, 2011) . Noise contrastive approaches, which learn energy functions through density ratio estimation, typically don't scale well to high-dimensional data (Gao et al., 2020; Rhodes et al., 2020; Gutmann & Hyvärinen, 2010; Ceylan & Gutmann, 2018) . Trade-offs must be made when training unnormalized models and no approach to date satisfies all of these properties. Figure 1 : Comparison of EBMs trained with VERA and PCD. We see that as entropy regularization goes to 1, the density becomes more accurate. For PCD, all samplers produce high quality samples, but low-quality density models as the distribution of MCMC samples may be arbitrarily far away from the model density. In this work, we present a simple method for training EBMs which performs as well as previous methods while being faster and substantially easier to tune. Our method is based on reinterpreting maximum likelihood as a bi-level variational optimization problem, which has been explored in the past for EBM training (Dai et al., 2019) . This perspective allows us to amortize away MCMC sampling into a GAN-style generator which is encouraged to have high entropy. We accomplish this with a novel approach to entropy regularization based on a fast variational approximation. This leads to the method we call Variational Entropy Regularized Approximate maximum likelihood (VERA). Concretely, we make the following contributions: • We improve the MCMC-based entropy regularizer of Dieng et al. (2019) with a parallelizable variational approximation. • We show that an entropy-regularized generator can be used to produce a variational bound on the EBM likelihood which can be optimized more easily than MCMC-based estimators. • We demonstrate that models trained in this way achieve much higher likelihoods than methods trained with alternative EBM training procedures. • We show that our approach stabilizes and accelerates the training of recently proposed Joint Energy Models (Grathwohl et al., 2019) . • We show that the stabilization of our approach allows us to use JEM for semi-supervised learning, outperforming virtual adversarial training when little prior domain knowledge is available (e.g., for tabular data).

2. ENERGY BASED MODELS

An energy-based model (EBM) is any model which parameterizes a density as p θ (x) = e f θ (x) Z(θ) where f θ : R D → R and Z(θ) = e f θ (x) dx is the normalizing constant which is not explicitly modeled. Any probability distribution can be represented in this way for some f θ . The energy-based parameterization has been used widely for its flexibility, ease of incorporating known structure, and relationship to physical systems common in chemistry, biology, and physics (Ingraham et al., 2019; Du et al., 2020; Noé et al., 2019) . The above properties make EBMs an appealing model class, but because they are unnormalized many tasks which are simple for alternative model classes become challenging for EBMs. For example, exact samples cannot be drawn and likelihoods cannot be exactly computed (or even lowerbounded). This makes training EBMs challenging as we cannot simply train them to maximize likelihood. The most popular approach to train EBMs is to approximate the gradient of the maximum likelihood objective. This gradient can be written as: ∇ θ log p θ (x) = ∇ θ f θ (x) -E p θ (x ) [∇ θ f θ (x )]. MCMC techniques are used to approximately generate samples from p θ (x) (Tieleman, 2008) . Practically, this approach suffers from poor stability and computational challenges from sequential sampling. Many tricks have been developed to overcome these issues (Du & Mordatch, 2019) , but they largely still persist. Alternative estimators have been proposed to circumvent these challenges, including score matching (Hyvärinen, 2005) , noise contrastive estimation (Gutmann & Hyvärinen, 2010) , and variants thereof. These suffer from their own challenges in scaling to high dimensional data, and sacrifice the statistical efficiency of maximum likelihood. In Figure 1 we visualize densities learned with our approach and Persistent Contrastive Divergence (Tieleman, 2008) (PCD) training. As we see, the sample quality of the PCD models is quite high but the learned density models do not match the true model. This is due to accrued bias in the gradient estimator from approximate MCMC sampling (Grathwohl et al., 2020) . Prior work (Nijkamp et al., 2019b) has argued that this objective actually encourages the approximate MCMC samples to match the data rather than the density model. Conversely, we see that our approach (with proper entropy regularization) recovers a high quality model.

3. VARIATIONAL MAXIMUM LIKELIHOOD

We seek the energy function which maximizes likelihood given in Equation 1. We examine the intractable component of the log-likelihood, the log partition-function log Z(θ) = log e f θ (x) dx. We can re-write this quantity as the optimum of log Z(θ) = max q E q(x) [f θ (x)] + H(q) (3) where q is a distribution and H(q) = -E q(x) [log q(x)] denotes its entropyfoot_0 (see the Appendix A.1 for the derivation). Plugging this into our original maximum likelihood statement we obtain: θ = argmax θ E pdata(x) [f θ (x)] -max q E q(x) [f θ (x)] + H(q) (4) which gives us an alternative method for training EBMs. We introduce an auxiliary sampler q φ which we train online to optimize the inner-loop of Equation 4. This objective was used for EBM training in Kumar et al. (2019); Abbasnejad et al. (2019) ; Dai et al. (2017) , Dai et al. (2019 ) (motivated by Fenchel Duality (Wainwright & Jordan, 2008) ). Abbasnejad et al. (2019) use an implicit generative model and Dai et al. (2019) propose to use a sampler which is inspired by MCMC sampling from p θ (x) and whose entropy can be computed exactly. Below we describe our approach which utilizes the same objective with a simpler sampler and a new approach to encourage high entropy. We note that when training p θ (x) and q(x) online together, the inner maximization will not be fully optimized. This leads our training objective for p θ (x) to be an upper bound on log p θ (x). In Section 5.1 we explore the impact of this fact and find that the bound is tight enough to train models that achieve high likelihood on high-dimensional data.

4. METHOD

We now present a method for training an EBM p θ (x) = e f θ (x) /Z(θ) to optimize Equation 4. We introduce a generator distribution of the form q φ (x) = z q φ (x|z)q(z)dz such that: q φ (x|z) = N (g ψ (z), σ 2 I), q(z) = N (0, I) where g ψ is a neural network with parameters ψ and thus, φ = {ψ, σ 2 }. This is similar to the decoder of a variational autoencoder (Kingma & Welling, 2013) . With this architecture it is easy to optimize the first and second terms of Equation 4 with reparameterization, but the entropy term requires more care.

4.1. ENTROPY REGULARIZATION

Estimating entropy or its gradients is a challenging task. Multiple, distinct approaches have been proposed in recent years based on Mutual Information estimation (Kumar et al., 2019) , variational upper bounds (Ranganath et al., 2016)  ∇ φ H(q φ ) = ∇ φ E q φ (x) [log q φ (x)] = ∇ φ E p(z)p( ) [log q φ (x(z, ))] (Reparameterize sampling) = E p(z)p( ) [∇ φ log q φ (x(z, ))] = E p(z)p( ) [∇ x log q φ (x(z, )) T ∇ φ x(z, )] (Chain rule) where we have written x(z, ) = g ψ (z) + σ . All quantities in Equation 6 can be easily computed except for the score-function ∇ x log q φ (x). The following estimator for this quantity can be easily derived (see Appendix A.2): ∇ x log q φ (x) = E q φ (z|x) [∇ x log q φ (x|z)] which requires samples from the posterior q φ (z|x) to estimate. 

4.2. VARIATIONAL APPROXIMATION WITH IMPORTANCE SAMPLING

We propose to replace HMC sampling of q φ (z|x) with a variational approximation ξ(z | z 0 ) ≈ q φ (z | x) where z 0 is a conditioning variable we will define shortly. We can use this approximation with self-normalized importance sampling to estimate ∇ x log q φ (x) = E q φ (z|x) [∇ x log q φ (x | z)] = E p(z0)q φ (z|x) [∇ x log q φ (x | z)] = E p(z0)ξ(z|z0) q φ (z | x) ξ(z | z 0 ) ∇ x log q φ (x | z) = E p(z0)ξ(z|z0) q φ (z, x) q φ (x)ξ(z | z 0 ) ∇ x log q φ (x | z) ≈ k i=1 w i k j=1 w j ∇ x log q φ (x | z i ) ≡ ∇ x log q φ (x; {z i } k i=1 , z 0 ) where {z i } k i=1 ∼ ξ(z | z 0 ) and w i ≡ q φ (zi,x) ξ(zi|z0) . We use k = 20 importance samples for all experiments in this work. This approximation holds for any conditioning information we would like to use. To choose this, let us consider how samples x ∼ q φ (x) are drawn. We first sample z 0 ∼ N (0, I) and then x ∼ q φ (x | z 0 ). In our estimator we want a variational approximation to q φ (z | x) and by construction, z 0 is a sample from this distribution. For this reason we let our variational approximation be ξ η (z | z 0 ) = N (z | z 0 , η 2 I), or simply a diagonal Gaussian centered at the z 0 which generated x. For this approximation to be useful we must tune the variance ηfoot_1 . We do this by optimizing the standard Evidence Lower-Bound at every training iteration L ELBO (η; z 0 , x) = E ξη(z|z0) [log(q φ (x | z)) + log(q φ (z))] + H(ξ η (z | z 0 )). We then use ξ η (z | z 0 ) to approximate ∇ x log q φ (x) which we plug into Equation 6 to estimate ∇ φ H(q φ ) for training our generator. A full derivation and discussion can be found in Appendix A.3. Combining the tools presented above we arrive at our proposed method which we call Variational Entropy Regularized Approximate maximum likelihood (VERA), outlined in Algorithm 1. We found it helpful to further add an 2 -regularizer with weight 0.1 to the gradient of our model's likelihood as in Kumar et al. (2019) . In some of our larger-scale experiments we reduced the weight of the entropy regularizer as in Dieng et al. (2019) . We refer to the entropy regularizer weight as λ. Algorithm 1: VERA Training x) , generator q φ (x, z), approximate posterior ξ η (z|z 0 ), entropy weight λ, gradient penalty γ = .1 Output: Parameters θ such that p θ ≈ p while True do Sample mini-batch x, and generate mini-batch Input : EBM p θ (x) ∝ e f θ ( x g , z 0 ∼ q φ (x, z) Compute L ELBO (η; z 0 , x g ) and update η // Update posterior Compute log f θ (x) -log f θ (x g ) + γ||∇ x log f θ (x)|| 2 and update θ // Update EBM Sample {z i } k i=1 ∼ ξ η (z|z 0 ) Compute s = ∇ x log q φ (x; {z i } k i=1 , z 0 ) // Estimate score fn (Eq.8) Compute g = s T ∇ φ x g // Estimate entropy gradient (Eq.6) Update φ with ∇ φ log f θ (x g ) + λg // Update generator end

5. EBM TRAINING EXPERIMENTS

We present results training various models with VERA and related approaches. In Figure 1 we visualize the impact of our generator's entropy on the learned density model and compare this with MCMC sampling used in PCD learning. In Section 5.1, we explore this quantitatively by training tractable models and evaluating with test-likelihood. In Section 5.2 we explore the bias of our entropy gradient estimator and the estimator's effect on capturing modes.

5.1. FITTING TRACTABLE MODELS

Optimizing the generator in VERA training minimizes a variational upper bound on the likelihood of data under our model. If this bound is not sufficiently tight, then training the model to maximize this bound will not actually improve likelihood. To demonstrate the VERA bound is tight enough to train large-scale models we train NICE models (Dinh et al., 2014) on the MNIST dataset. NICE is a normalizing flow (Rezende & Mohamed, 2015) model -a flexible density estimator which enables both exact likelihood computation and exact sampling. We can train this model with VERA (which does not require either of these abilities), evaluate the learned model using likelihood, and generate exact samples from the trained models. Full experimental details 2 can be found in Appendix B. We compare the performance of VERA with maximum likelihood training as well as a number of approaches for training unnormalized models; Maximum Entropy Generators (MEG) (Kumar et al., 2019) , Persistent Contrastive Divergence (PCD), Sliced Score Matching (SSM) (Song et al., 2020) , Denoising Score Matching (DSM) (Vincent, 2011) , Curvature Propagation (CP-SM) (Martens et al., 2012) , and CoopNets (Xie et al., 2018) . As an ablation we also train VERA with the HMCbased entropy regularizer of Dieng et al. ( 2019), denoted VERA-(HMC). Table 2 shows that VERA outperforms all approaches that do not require a normalized model. Figure 2 shows exact samples from our NICE models. For PCD we can see (as observed in Figure 1 ) that while the approximate MCMC samples resemble the data distribution, the true samples from the model do not. This is further reflected in the reported likelihood value which falls behind all methods besides DSM. Coop-Nets perform better than PCD, but exhibit the same behavior of generator samples resembling the data, but not matching true samples. We attribute this behavior to the method's reliance on MCMC sampling. Conversely, models trained with VERA generate coherent and diverse samples which reasonably capture the data distribution. We also see that the samples from the learned generator much more closely match true samples from the NICE model than PCD and MEG. When we remove the entropy regularizer from the generator (λ = 0.0) we observe a considerable decrease in likelihood and we find that the generator samples are far less diverse and do not match exact samples at all. Intriguingly, entropy-free VERA outperforms most other methods. We believe this is because even without the entropy regularizer we are still optimizing a (weak) bound on likelihood. Conversely, the score-matching methods minimize an alternative divergence which will not necessarily correlate well with likelihood. Further, Figure 2 shows that MEG performs on par with entropy-free VERA indicating that the Mutual Information-based entropy estimator may not be accurate enough in high dimensions to encourage high entropy generators. Maximum VERA VERA (HMC) MEG PCD SSM DSM CP-SM CoopNet Likelihood λ = 1.0 λ = 0.0 λ = 1.0 -791 -1138 -1214 -1165 -1219 -4207 -2039 -4363 -1517 -1465 Table 2 : Fitting NICE models using various learning approaches for unnormalized models. Results for SSM, DCM, CP-SM taken from Song et al. (2020) .

5.2. UNDERSTANDING OUR ENTROPY REGULARIZER

In Figure 3 , we explore the quality of our score function estimator on a PCA (Tipping & Bishop, 1999) model fit to MNIST, a setting where we can compute the score function exactly (see Appendix B.4 for details). The importance sampling estimator (with 20 importance samples) has somewhat larger variance than the HMC-based estimator but has a notably lower bias of .12. The HMC estimator using 2 burn-in steps (recommended in Dieng et al. ( 2019)) has a bias of .48. Increasing the burn-in steps to 500 reduces the bias to .20 while increasing variance. We find the additional variance of our estimator is remedied by mini-batch averaging and the reduced bias helps explain the improved performance in Table 2 . Further, we compute the effective sample size (Kong, 1992) (ESS) of our importance sampling proposal on our CIFAR10 and MNIST models and achieve an ESS of 1.32 and 1.29, respectively using 20 importance samples. When an uninformed proposal (N (0, I)) is used, the ESS is 1.0 for both models. This indicates our gradient estimates are informative for training. More details can be found in Appendix B.6. Next, we count the number of modes captured on a dataset with 1,000 modes consisting of 3 MNIST digits stacked on top of one another (Dieng et al., 2019; Kumar et al., 2019) . We find that both VERA and VERA (HMC) recover 999 modes, but training with no entropy regularization recovers 998 modes. We conclude that entropy regularization is unnecessary for preventing mode collapse in this setting. Joint Energy Models (JEM) (Grathwohl et al., 2019) are an exciting application of EBMs. They reinterpret standard classifiers as EBMs and train them as such to create powerful hybrid generative/discriminative models which improve upon purely-discriminative models at outof-distribution detection, calibration, and adversarial robustness.

6. APPLICATIONS

Traditionally, classification tasks are solved with a function f θ : R D → R K which maps from the data to K unconstrained real-valued outputs (where k is the number of classes). This function parameterizes a distribution over labels y given data x: p θ (y|x) = e f (x) [y] / y e f θ (x) [y ] . The same function f θ can be used to define an EBM for the joint distribution over x and y as: p θ (x, y) = e f θ (x) [y] /Z(θ). The label y can be marginalized out to give an unconditional model p θ (x) = y e f θ (x) [y] /Z(θ). JEM models are trained to maximize the factorized likelihood: log p θ (x, y) = α log p θ (y|x) + log p θ (x) ( ) where α is a scalar which weights the two terms. The first term is optimized with cross-entropy and the second term is optimized using EBM training methods. In Grathwohl et al. (2019) PCD was used to train the second term. We train JEM models on CIFAR10, CIFAR100, and SVHN using VERA instead of PCD. We examine how this change impacts accuracy, generation, training speed, and stability. Full experimental details can be found in Appendix B.7.

Speed and Stability

While the results presented in Grathwohl et al. (2019) are promising, training models as presented in this work is challenging. MCMC sampling can be slow and training can easily diverge. Our CIFAR10 models train 2.8x faster than the official JEM implementationfoot_2 with the default hyper-parameters. With these default parameters JEM models would regularly diverge. To train for the reported 200 epochs training needed to be restarted multiple times and the number of MCMC steps needed to be quadrupled, greatly increasing run-time. Conversely, we find that VERA was much more stable and our models never diverged. This allowed us to remove the additive Gaussian noise added to the data which is very important to stabilize MCMC training (Grathwohl et al., 2019; Nijkamp et al., 2019a; Du & Mordatch, 2019) . Hybrid Modeling In Tables 3 and 4 we compare the discriminative and generative performance of JEM models trained with VERA, PCD (JEM), and HDGE (Liu & Abbeel, 2020) . With α = 1 we find that VERA leads to models with poor classification performance but strong generation performance. With α = 100 VERA obtains stronger classification performance than the original JEM model while still having improved image generation over JEM and HDGE (evaluated with FID (Heusel et al., 2017) ). Unconditional samples can be seen from our CIFAR10 and CIFAR100 models in Figure 4 . Samples are refined through a simple iterative procedure using the latent space of our generator, explained in Appendix B.7.1. Additional conditional samples can be found in Appendix C.5 Figure 4 : Unconditional samples on CIFAR10 (left) and CIFAR100 (right). Out-of-Distribution Detection JEM is a powerful approach for out-of-distribution detection (OOD), greatly outperforming tractable likelihood models like VAEs and flows (Nalisnick et al., 2018) . In Table 5 , reporting AUROC (Hendrycks & Gimpel, 2016) , we see that for all but 1 dataset, VERA outperforms JEM with PCD training but under-performs contrastive training (HDGE). Intriguingly, VERA performs poorly on CelebA. This result, along with the unreliable performance of DSM models at this task (Li et al., 2019) leads to questions regarding special benefits of MCMC training that are lost in our method as well as DSM. We leave this to future work. 

6.1. TABULAR DATA

Training with VERA is much more stable and easy to apply to domains beyond images where EBM training has been extensively tuned. To demonstrate this we show that JEM models trained with VERA can provide a benefit to semi-supervised classification on datasets from a variety of domains. Considerable progress has been made in semi-supervised learning but the most impressive results require considerable domain knowledge (Chen et al., 2020) . In domains like images, text, and audio such knowledge exists but for data from particle accelerators, gyroscopes, and satellites, such intuition may not be available and these techniques cannot be applied. In these settings there are far fewer options for semi-supervised learning. We present VERA as one such option. We train semi-supervised JEM models on data from a variety of continuous domains. We perform no data augmentation beyond removing redundant features and standardizing the remaining features. To further demonstrate the versatility of VERA we use an identical network for each dataset and method. Full experimental details can be found in Appendix B.8. In Table 6 , we find on each dataset tested, we find that VERA outperforms the supervised baseline and outperforms VAT which is the strongest domain agnostic semi-supervised learning method we are aware of. Their method differs from ours in the architecture of the generator and the method for encouraging high entropy. Their generator defines an implicit density (unlike ours which defines a latent-variable model). The entropy is maximized using a series approximation to the generator function's Jacobian log-determinant which approximates the change-of-variables for injective functions. As well there exist CoopNets (Xie et al., 2018) which cooperatively train an EBM and a generator network. Architecturally, they are similar to VERA but are trained quit differently. In CoopNets, the generator is trained via maximum likelihood on its own samples refined using MCMC on the EBM. This maximum likleihood step requires MCMC as well to generate posterior samples as in Pang et al. (2020a) . In contrast, the generator in VERA is trained to minimize the reverse KL-divergence. Our method requires no MCMC and was specifically developed to alleviate the difficulties of MCMC sampling.

7. RELATED WORK

The estimator of Dieng et al. ( 2019) was very influential to our work. Their work focused on applications to GANs. Our estimator could easily be applied in this setting and to implicit variational inference (Titsias & Ruiz, 2019) as well but we leave this for future work.

8. CONCLUSION

In this work we have presented VERA, a simple and easy-to-tune approach for training unnormalized density models. We have demonstrated our approach learns high quality energy functions and models with high likelihood (when available for evaluation). We have further demonstrated the superior stability and speed of VERA compared to PCD training, enabling much faster training of JEM (Grathwohl et al., 2019) while retaining the performance of the original work. We have shown that VERA can train models from multiple data domains with no additional tuning. This enables the applications of JEM to semi-supervised classification on tabular data -outperforming a strong baseline method for this task and greatly outperforming JEM with PCD training. A KEY DERIVATIONS

A.1 DERIVATION OF VARIATIONAL LOG-PARTITION FUNCTION

Here we show that the variational optimization given in equation 3 recovers log Z(θ). max q E q(x) [f θ (x)] + H(q) = max q x q(x)f θ (x)dx - x q(x) log(q(x))dx = max q x q(x) log exp(f θ (x)) q(x) dx = max q x q(x) log exp(f θ (x)) q(x) dx -log Z(θ) + log Z(θ) = max q x q(x) log exp(f θ (x))/Z(θ) q(x) dx + log Z(θ) = max q -KL(q(x) p θ (x)) + log Z(θ) = log Z(θ).

A.2 SCORE FUNCTION ESTIMATOR

Here we derive the equivalent expression for ∇ x log q φ (x) given in Equation 7. ∇ x log q φ (x) = ∇ x q φ (x) q φ (x) = ∇ x z q φ (x, z)dz q φ (x) = z ∇ x q φ (x, z)dz q φ (x) = z ∇ x q φ (x, z) q φ (x) dz = z ∇ x q φ (x | z)q φ (z) q φ (x) dz = z (∇ x log q φ (x | z))q φ (x | z)q φ (z) q φ (x) dz = E q φ (z|x) [∇ x log q φ (x|z)].

A.3 ENTROPY GRADIENT ESTIMATOR

From Equation 6we have ∇ φ H(q φ ) = E p(z0)p( ) [∇ x log q φ (x) T ∇ φ x(z 0 , )]. Plugging in our score function estimator gives ∇ φ H(q φ ) = E p(z0)p( ) [∇ x log q φ (x) T ∇ φ x(z 0 , )] (12) = E p(z0)p( ) E q φ (z|x) [∇ x log q φ (x | z)] T ∇ φ x(z 0 , ) = E p(z0)p( ) E ξη(z|z0) q φ (z | x) ξ η (z | z 0 ) ∇ x log q φ (x | z) T ∇ φ x(z 0 , ) = E p(z0)p( ) E ξη(z|z0) q φ (x, z) q φ (x)ξ η (z | z 0 ) ∇ x log q φ (x | z) T ∇ φ x(z 0 , ) ≈ E p(z0)p( )   k i=1 w i k j=1 w j ∇ x log q φ (x | z i ) T ∇ φ x(z 0 , )   where {z i } k i=1 ∼ ξ(z | z 0 ) and w i ≡ q φ (zi,x) ξ(zi|z0) .

A.3.1 DISCUSSION

We discuss when the approximations in Equation 13will hold. The importance sampling estimator will be biased when q ψ (z | x) differs greatly from ξ(z | z 0 ). Since the generator function g ψ (z) is a smooth, Lipschitz function (as are most neural networks) and the output Gaussian noise is small, the space of z values which could have generated x should be concentrated near z 0 . In these settings, z 0 should be useful for predicting q(z|x). The accuracy of this approximation is based on the dimension of z compared to x and the Lipschitz constant of g ψ . In all settings we tested, dim(z) dim(x) where this approximation should hold. If dim(z) dim(x) the curse of dimensionality would take effect and z 0 would be less and less informative about q(z|x). In settings such as these, we do not believe our approach would be as effective. Thankfully, almost all generator architectures we are aware of have dim(z) dim(x). The approximation could also break down if the Lipschitz constant blew up. We find this does not happen in practice, but this can be addressed with many forms of regularization and normalization. *When α > 1, learning rates were divided by α. † We found λ = 10 -4 to work best on large image datasets, but in general we recommend starting with λ = 1 and trying successively smaller values of λ until training is stable.

B EXPERIMENTAL DETAILS

We give some general tips on how to set hyperparameters when training VERA in Table 7 . In all VERA experiments, we use the gradient norm penalty with weight 0.1. This was not tuned during our experiments. When using VERA and MEG we train with Adam (Kingma & Ba, 2014) and set β 1 = 0, β 2 = .9 as is standard in the GAN literature (Miyato et al., 2018) . In general, we recommend setting the learning rate for the generator to twice the learning rate of the energy function and equal to the learning rate of the approximate posterior sampler.

B.1.1 IMPACT OF λ

Let us rewrite the generator's training objective L(q; λ) = E q(x) [f θ (x)] + λH(q). (13) We can easily see that this objective is equivalent (up to a multiplicative constant) to E q(x) f θ (x) λ + H(q). From this, it is clear that maximizing Equation 14 is the same as minimizing the KL-divergence between q and a tempered version of p θ (x) defined as e f θ (x)/λ Z . Tempering like this is standard practice in EBM training and is done in many recent works. Tempering has the effect of increasing the weight of the gradient signal in SGLD sampling relative to the added Gaussian noise. In all of Du & Mordatch (2019) ; Grathwohl et al. (2019) ; Nijkamp et al. (2019a; b) the SGLD samplers used a temperature of 1/λ = 20,000. This value is near the value of 10, 000 that we use in this work. Thus we can see that our most important hyper-parameter is actually the temperature of the sampler's target distribution. Ideally this temperature would be set to 1, but this can lead to unstable training (as it does with SGLD). To train high quality models, we recommend setting λ as close to 1 as possible, decreasing if training becomes unstable.

B.2 TOY DATA VISUALIZATIONS

We train simple energy functions on a 2D toy data. The EBM is a 2-layer MLP with 100 hidden units per layer using the Leaky-ReLU nonlinearity with negative slope .2. The generator is a 2-layer MLP with 100 hidden units per layer and uses batch normalization (Ioffe & Szegedy, 2015) with ReLU nonlinearities. All models were trained for 100,000 iterations with all learning rates set to .001 and used the Adam optimizer (Kingma & Ba, 2014) . The PCD models were trained using an SGLD sampler and a replay buffer with 10,000 examples, reinitialized every iteration with 5% probability. We used 20 steps of SGLD per training iteration to make runtime consistent with VERA. We tested σ values outside of the presented range but smaller values did not produce decent samples or energy functions and for larger values, training diverged.

B.3 TRAINING NICE MODELS

The NICE models were exactly as in Song et al. (2020) . They have 4 coupling layers and each coupling layer had 5 hidden layers. Each hidden layer has 1000 units and uses the Softplus nonlinearity. We preprocessed the data as in Song et al. (2020) by scaling the data to the range [0, 1], adding uniform noise in the range [-1/512, 1/512], clipping to the range [.001, .999] and applying the logit transform log(x) -log(1 -x). All models were trained for 400 epochs with the Adam optimizer (Kingma & Ba, 2014) with β 1 = 0 and β 2 = .9. We use a batch size of 128 for all models. We re-ran the score matching model of Song et al. (2020) to train for 400 epochs as well and found it did not improve performance as its best test performance happens very early in training. For all generator-based training methods we use the same fixed generator architecture. The generator has a latent-dimension of 100 and 2 hidden layers with 500 units each. We use the Softplus nonlinearity and batch-normalization (Ioffe & Szegedy, 2015) as is common with generator networks. For VERA the hyper-parameters we searched over were the learning rates for the NICE model and for the generator. Compared to Song et al. (2020) we needed to use much lower learning rates. We searched over learning rates in {.0003, .00003, .000003} for both the generator and energy function. We found .000003 to work best for the energy function and .0003 to work best for the generator. This makes intuitive sense since the generator needs to be fully optimized for the bound on likelihood to be tight. When equal learning rates were used (.0003, .0003) we observed high sample quality from the generator but exact samples and likelihoods from the NICE model were very poor. For PCD we search over learning rates in {.0003, .00003, .000003}, the number of MCMC steps in {20, 40} and the SGLD noise standard-deviation in {1.0, 0.1}. All models with 20 steps and SGLD standard-deviation 1.0 quickly diverged. Our best model used learning rate .000003, stepsize 0.1, and 40 steps. We tested the gradient-norm regularizer from Kumar et al. (2019) and found it decreased performance for PCD trained models. Most models with 20 MCMC steps diverged early in training. For review the MCMC sampler we use is stochastic gradient Langevin dynamics (Welling & Teh, 2011) . This sampler updates its samples by x t = x t-1 + σ 2 2 ∇ x f θ (x) + σ, ∼ N (0, I) where σ is the noise standard-deviation and is a parameter of the sampler. For Maximum Entropy Generators (MEG) (Kumar et al., 2019) we must choose a mutualinformation estimation network. We follow their work and use an MLP with LeakyReLU nonlinearities with negative slope .2. Our network mirrors the generator and has 3 hidden layers with 500 units each. We searched over the same hyper-parameters as VERA. We found MEG to perform almost identically to training with no entropy regularization at all. We believe this has to do with the challenges of estimating MI in high dimensions Song & Ermon (2019a) . For CoopNets (Xie et al., 2018) we use the same flow and generator architectures as VERA. Following the MNIST experiments in Xie et al. (2018) we train using 10 SGLD steps per iteration. We tried the recommended learning rates of .007 and .0001 for the flow and generator, respectively but found this to lead quick divergence. For this reason, we search over learning rates for the flow and generator from {.0003, .00003, .000003} as we did for VERA and found the best combination to be .000003 for the flow and .0003 for the generator. Other combinations resulted in higher quality generator samples but much worse likelihood values. We tested the recommended SGLD step-size of .002 and found this to lead to divergence as well in this setup. Thus, we searched over larger values of {.002, .02, .1} found .1 to perform the best, as with PCD.

B.4 ESTIMATION OF BIAS OF ENTROPY REGULARIZER

If we restrict the form of our generator to a linear function x = W z + µ + σ , z, ∼ N (0, I) then we have q(x|z) = N (W z + µ, σ 2 I), q(x) = N (µ, W T W + σ 2 I) meaning we can exactly compute log q(x), and ∇ x log q(x) which is the quantity that VERA (HMC) approximates with the HMC estimator from Dieng et al. ( 2019) and we approximate with VERA. To explore this, we fit a PCA model on MNIST and recover the parameters W, µ of the linear generator and the noise parameter σ. Samples from this model can be seen in Figure 5 . Both VERA and VERA (HMC) have some parameters which are tuned automatically to improve the estimator throughout training. For VERA this is the posterior variance which is optimized according to Equation 10 with Adam with default hyperparameters. For VERA (HMC) this is the stepsize of HMC which is tuned automatically as outlined in (Dieng et al., 2019) . Both estimators were trained with a learning rate of .01 for 500 iterations with a batch size of 5000. Samples from the estimators during training were taken with the default parameters, in particular the number of burn-in steps or the number of importance samples was not varied as they were during evaluation. The bias of the estimators were evaluated on a batch of 10 samples from the generator. For each example in the batch, 5000 estimates were taken from the estimator and averaged to be taken as an estimate of the score function for this batch example. This estimate of the score function was subtracted from the true score function and then averaged over all dimensions and examples in the batch and taken as an estimate of the bias per dimension. For VERA (HMC) we varied the number of burn-in steps used for samples to evaluate the bias of the estimator. We also tried to increase the number of posterior samples taken off the chain, but we found that this did not clearly reduce the bias of this estimator as the number of samples increased. For VERA we computed the bias on the default of 20 importance samples. et al., 2015) architecture for both the energy-function and generator. We use a latent code of 100 dimensions. We train for 17 epochs with a learning rate of .001 and batch size 100. We estimate the number of modes captured by taking S = 10, 000 samples as in (Dieng et al., 2019) and classifying each digit of the 3 stacked images separately with a pre-trained classifier on MNIST.

B.6 EFFECTIVE SAMPLE SIZE

When performing importance sampling, the quality of the proposal distribution has a large impact. If the proposal is chosen poorly, then typically 1 sample will dominate in the expectation. This can B.7.1 MCMC SAMPLE REFINEMENT Our generator q φ (x) is trained to approximate our EBM p θ (x). After training, the samples from the generator are of high quality (see Figures 9 and 10, left) but they are not exactly samples from p θ (x). We can use MCMC sampling to improve the quality of these samples. We use a simple MCMC refinement procedure based on the Metropolis Adjusted Langevin Algorithm (Besag, 1994) applied to an expanded state-space defined by our generator and perform the Accept/Reject step in the data space. We can reparameterize a generator sample x ∼ q φ (x) as a function x(z, ) = g ψ (z) + σ, and we can define an unnormalized density over {z, }, log h(z, ) ≡ f θ (g ψ (z) + σ) -log Z(θ, φ) which is the density (under p θ (x)) of the generator sample. Starting with an initial sample z 0 , 0 ∼ N (0, I) we define the proposal distribution p(z t |z t-1 , t-1 ) = N z t-1 + δ 2 ∇ zt-1 log h(z t-1 , t-1 ), δ 2 I p( t |z t-1 , t-1 ) = N z t-1 + δ 2 ∇ t-1 log h(z t-1 , t-1 ), δ 2 I p(z t , t |z t-1 , t-1 ) = p(z t |z t-1 , t-1 )p( t |z t-1 , t-1 ) and accept a new sample with probability min h(z t , t )p(z t , t |z t-1 , t-1 ) h(z t-1 , t )p(z t-1 , t-1 |z t , t ) , 1 . Here, δ is the step-size and is a parameter of the sampler. We tune δ with a burn-in period to target an acceptance rate of 0.57. We clarify that this procedure is not a valid MCMC sampler for p θ (x) due to the augmented variables and the change in density of g ψ which are not corrected for. The density of the samples will be a combination of p θ (x) and q φ (x). As the focus of this work was training and not sampling/generation, we leave the development of more correct generator MCMC-sampling to future work. Regardless, we find this procedure to improve visual sample quality. In Figure 7 we visualize a sampling chain using the above method applied to our JEM model trained on SVHN. We present results on Inception Score (Salimans et al., 2016) and Frechet Inception Distance (Heusel et al., 2017) . These metrics are notoriously fickle and different repositories are known to give very different results (Grathwohl et al., 2019) . For these evaluations we generate 12,800 samples from the model and (unless otherwise stated) refine the samples with 100 steps of our latent-space MALA procedure (Appendix B.7.1). The code to generate our reported FID comes from this publicly available repository. The code to generate our reported Inception Score can be found here. For HEPMASS and HUMAN data we remove features which repeat the same exact same value more than 5 times. For CROP data we remove features which have covariation greater than 1.01. For MNIST we linearly standardize features to the interval [-1, 1]. We take a random 10% subset of the data to use as a validation set.

B.8.2 TRAINING

We use the same architecture for all experiments and baselines. It has 6 layers of hidden units with dimensions [1000, 500, 500, 250, 250, 250 ] and a Leaky-ReLU nonlinearity with negative-slope .2 between each layer of hidden units. The only layers which change between datasets are the input layer and the output layer which change according to the number of features and number of classses respectively. The training process for semi-supervised learning is similar to JEM with an additional objective commonly used in semi-supervised learning: log p θ (x, y) = α log p θ (y | x) + log p θ (x) + βH(p θ (y | x)) where H(p(y|x)) is the entropy of the predictive distribution over the labels. For all models we report the accuracy the model converged to on the held-out validation set. We report the average taken over three training runs with different seeds. We use equal learning rates for the energy model, generator, and the entropy estimator. We tune the learning rate and decay schedule for supervised models on the full-set of labels and 10 labels per class. 



For continuous spaces, this would be the differential entropy, but we simply use entropy here for brevity. This experiment follows the NICE experiment inSong et al. (2020) and was based on their implementation. https://github.com/wgrathwohl/JEM We treat MNIST as a tabular dataset since we do not use convolutional architectures. http://archive.ics.uci.edu/ml/datasets/HEPMASS https://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+ optical-radar+data+set https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+ Using+Smartphones



Figure 2: Left: Exact samples from NICE model trained with various methods. Right: Approximate samples used for training. For VERA, MEG, and CoopNet, these come from the generator, for PCD these are approximate MCMC samples.

. (per dimension)

Figure 3: Bias (top) and standard deviation (bottom), both per dimension, of the score function estimator using HMC and our proposed importance sampling scheme.

Gao et al. (2020) train EBMs using Noise Contrastive Estimation where the noise distribution is a normalizing flow. Their training objective differs from ours and their generator is restricted to having a normalizing flow architecture. These architectures do not scale as well as the GAN-style architectures we use to large image datasets.

Figure 5: Samples from linear model trained with PCA on MNIST.

similar sampling procedure was proposed in Kumar et al. (2019) and Che et al. (2020) and in both works was found to improve sample quality. In all experiments,

Figure 7: Visualization of our MALA-inspired sample refinement procedure. Samples come from JEM model trained on SVHN. Chains progress to the right and down. Each image is a consecutive step, no sub-sampling is done.

Figure 11: Class-conditional samples from CIFAR10

Features of EBM training approaches.

Classification on image datasets.

FID on CIFAR10.

Out-of-distribution Detection. Model trained on CIFAR10. Values are AUROC (↑).

Accuracy of semi-supervised learning on tabular data with 10 labeled examples per class. et al. (2019) utilize a Mutual Information estimator to approximate the generator's entropy whereas we approximate the gradients of the entropy directly. The method ofKumar et al. (2019) requires the training of an additional MI-estimation network, but our approach only requires the optimization of the posterior variance which has considerably fewer parameters. As demonstrated in Section 5.1, their approach does not perform as well as VERA for training NICE models and their generator collapses to a single point. This is likely due to the notorious difficulty of estimating MI in high dimensions and the unreliability of current approaches for this task(McAllester & Stratos, 2020;  Song & Ermon, 2019a).

Hyperparameters for VERA.

B.8 SEMI-SUPERVISED LEARNING ON TABULAR DATA B.8.1 DATAWe provide details about each of the datasets used for the experiments in Section 6.1. HEPMASS 5 is a dataset obtained from a particle accelerator where we must classify signal from background noise. CROP 6 is a dataseset for classifying crop types from optical and radar sensor data. HUMAN 7 is a dataset for human activity classification from gyroscope data. MNIST is an image dataset of handwritten images, treated here as tabular data. Basic information about each tabular dataset.

annex

be quantified using the effective sample size (Kong, 1992) (ESS) which is defined aswhere p(x) is the target distribution, q(x) is the proposal distribution and N is the number of samples. If the self-normalized importance weights are dominated by one weight close to 1 then the ESS will be 1. If the proposal distribution is identical to the target, so that the self-normalized importance weights are then uniform, then the ESS will be N . When ESS = N , importance sampling is as efficient as using the target distribution.To understand the effect of the proposal distribution on ESS we plot the ESS when doing importance sampling with 20 importance samples from a 128-dimensional Gaussian target distribution with µ = 0 and Σ = I. We use a proposal which is a 128-dimensional Gaussian with µ increasing from 0 to 5. Results can be seen in Figure 6 . We see when the means differ by greater than 2, the ESS is approximately 1.0 and importance sampling has effectively failed. 2020) we set the learning rate for the energy function equal to 0.0001. We set the learning rate for the generator equal to 0.0002. We train for 200 epochs using the Adam optimizer with β 1 = 0 and β 2 = .9. We set the batch size to 64. Results presented are from the models after 200 epochs with no early stopping. We believe better results could be obtained from further training.We trained models with α ∈ {1, 30, 100} and found classification to be best with α = 100 and generation to be best with α = 1.Prior work on PCD EBM training (Grathwohl et al., 2019; Du & Mordatch, 2019; Nijkamp et al., 2019a; b) recommends adding Gaussian noise to the data to stabilize training. Without this, PCD training of JEM models very quickly diverges. Early in our experiments we found training with VERA was stable without the addition of Gaussian noise so we do not use it.As mentioned in Dieng et al. ( 2019), when the strength of the entropy regularizer λ is too high, the generator may fall into a degenerate optimum where it just outputs high-entropy Gaussian noise. To combat this, as suggested in Dieng et al. ( 2019), we decrease the strength of λ to .0001 for all JEM experiments. This value was chosen by decreasing λ from 1.0 by factors of 10 until learning took place (quantified by classification accuracy).On VERA we tune the learning rate and decay schedule, the weighting of the entropy regularization λ and the weighting of the entropy of classification outputs β.For VAT we tune the perturbation size ∈ {.01, .1, 1, 3, 10}. All other hyperparameters were fixed according to tuning on VERA.For MEG we used the hyperparameters tuned according to VERA.For JEM we tune the number of MCMC steps in the range κ ∈ {20, 40, 80}. We generate samples using SGLD with stepsize 1 and noise standard deviation 0.01 as in Grathwohl et al. (2019) . 

C.2 SAMPLES FROM SSL MODELS

We present some samples from our semi-supervised MNIST models in Figure 8 . 

C.3 HYBRID MODELING

We present an extended (Miyato et al., 2018) 25.50 8.59 NCSN (Song & Ermon, 2019b) 23.52 8.91 ADE (Dai et al., 2019) N/A 7.55 IGEBM (Du & Mordatch, 2019) 37.9 8.30 Glow (Kingma & Dhariwal, 2018) 48.93.92 FCE (Gao et al., 2020) 37. 

