NO MCMC FOR ME: AMORTIZED SAMPLING FOR FAST AND STABLE TRAINING OF ENERGY-BASED MODELS

Abstract

Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy regularization methods with a fast variational approximation. We demonstrate the effectiveness of our approach by using it to train tractable likelihood models. Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semi-supervised classification on tabular data from a variety of continuous domains.

1. INTRODUCTION

Energy-Based Models (EBMs) have recently regained popularity within machine learning, partly inspired by the impressive results of Du & Mordatch (2019) and Song & Ermon (2020) on largescale image generation. Beyond image generation, EBMs have also been successfully applied to a wide variety of applications including: out-of-distribution detection (Grathwohl et al., 2019; Du & Mordatch, 2019; Song & Ou, 2018) , adversarial robustness (Grathwohl et al., 2019; Hill et al., 2020; Du & Mordatch, 2019) , reliable classification (Grathwohl et al., 2019; Liu & Abbeel, 2020) and semi-supervised learning (Song & Ou, 2018; Zhao et al.) . Strikingly, these EBM approaches outperform alternative classes of generative models and rival hand-tailored solutions on each task. Despite progress, training EBMs is still a challenging task. As shown in Table 1 , existing training methods are all deficient in at least one important practical aspect. Markov chain Monte Carlo (MCMC) methods are slow and unstable during training (Nijkamp et al., 2019a; Grathwohl et al., 2020) . Score matching mechanisms, which minimize alternative divergences are also unstable and most methods cannot work with discontinuous nonlinearities (such as ReLU) (Song & Ermon, 2019b; Hyvärinen, 2005; Song et al., 2020; Pang et al., 2020b; Grathwohl et al., 2020; Vincent, 2011) . Noise contrastive approaches, which learn energy functions through density ratio estimation, typically don't scale well to high-dimensional data (Gao et al., 2020; Rhodes et al., 2020; Gutmann & Hyvärinen, 2010; Ceylan & Gutmann, 2018) . Trade-offs must be made when training unnormalized models and no approach to date satisfies all of these properties. Figure 1 : Comparison of EBMs trained with VERA and PCD. We see that as entropy regularization goes to 1, the density becomes more accurate. For PCD, all samplers produce high quality samples, but low-quality density models as the distribution of MCMC samples may be arbitrarily far away from the model density. In this work, we present a simple method for training EBMs which performs as well as previous methods while being faster and substantially easier to tune. Our method is based on reinterpreting maximum likelihood as a bi-level variational optimization problem, which has been explored in the past for EBM training (Dai et al., 2019) . This perspective allows us to amortize away MCMC sampling into a GAN-style generator which is encouraged to have high entropy. We accomplish this with a novel approach to entropy regularization based on a fast variational approximation. This leads to the method we call Variational Entropy Regularized Approximate maximum likelihood (VERA). Concretely, we make the following contributions: • We improve the MCMC-based entropy regularizer of Dieng et al. ( 2019) with a parallelizable variational approximation. • We show that an entropy-regularized generator can be used to produce a variational bound on the EBM likelihood which can be optimized more easily than MCMC-based estimators. • We demonstrate that models trained in this way achieve much higher likelihoods than methods trained with alternative EBM training procedures. • We show that our approach stabilizes and accelerates the training of recently proposed Joint Energy Models (Grathwohl et al., 2019) . • We show that the stabilization of our approach allows us to use JEM for semi-supervised learning, outperforming virtual adversarial training when little prior domain knowledge is available (e.g., for tabular data).

2. ENERGY BASED MODELS

An energy-based model (EBM) is any model which parameterizes a density as p θ (x) = e f θ (x) Z(θ) where f θ : R D → R and Z(θ) = e f θ (x) dx is the normalizing constant which is not explicitly modeled. Any probability distribution can be represented in this way for some f θ . The energy-based



Features of EBM training approaches.

