GUIDING ENERGY-BASED MODELS VIA CONTRASTIVE LATENT VARIABLES

Abstract

An energy-based model (EBM) is a popular generative framework that offers both explicit density and architectural flexibility, but training them is difficult since it is often unstable and time-consuming. In recent years, various training techniques have been developed, e.g., better divergence measures or stabilization in MCMC sampling, but there often exists a large gap between EBMs and other generative frameworks like GANs in terms of generation quality. In this paper, we propose a novel and effective framework for improving EBMs via contrastive representation learning (CRL). To be specific, we consider representations learned by contrastive methods as the true underlying latent variable. This contrastive latent variable could guide EBMs to understand the data structure better, so it can improve and accelerate EBM training significantly. To enable the joint training of EBM and CRL, we also design a new class of latent-variable EBMs for learning the joint density of data and the contrastive latent variable. Our experimental results demonstrate that our scheme achieves lower FID scores, compared to prior-art EBM methods (e.g., additionally using variational autoencoders or diffusion techniques), even with significantly faster and more memory-efficient training. We also show conditional and compositional generation abilities of our latent-variable EBMs as their additional benefits, even without explicit conditional training. The code is available at https://github.com/hankook/CLEL.

1. INTRODUCTION

Generative modeling is a fundamental machine learning task for learning complex high-dimensional data distributions p data (x). Among a number of generative frameworks, energy-based models (EBMs, LeCun et al., 2006; Salakhutdinov et al., 2007) , whose density is proportional to the exponential negative energy, i.e., p θ (x) ∝ exp(-E θ (x)), have recently gained much attention due to their attractive properties. For example, EBMs can naturally provide the explicit (unnormalized) density, unlike generative adversarial networks (GANs, Goodfellow et al., 2014) . Furthermore, they are much less restrictive in architectural designs than other explicit density models such as autoregressive (Oord et al., 2016b; a) (Nijkamp et al., 2019) , data augmentations in MCMC sampling (Du et al., 2021) , and better divergence measures (Yu et al., 2020; 2021; Du et al., 2021) . To further improve EBMs, there are several recent attempts to incorporate other generative models into EBM training, e.g., variational autoencoders (VAEs) (Xiao et al., 2021) , flow models (Gao et al., 2020; Xie et al., 2022) , or diffusion techniques (Gao et al., 2021) . However, they often require a high computational cost for training such an extra generative model, or there still exists a large gap between EBMs and state-of-the-art generative frameworks like GANs (Kang et al., 2021) or score-based models (Vahdat et al., 2021) .



and flow-based models (Rezende & Mohamed, 2015; Dinh et al., 2017). Hence, EBMs have found wide applications, including image inpainting (Du & Mordatch, 2019), hybrid discriminative-generative models (Grathwohl et al., 2019; Yang & Ji, 2021), protein design (Ingraham et al., 2019; Du et al., 2020b), and text generation (Deng et al., 2020). Despite the attractive properties, training EBMs has remained challenging; e.g., it often suffers from the training instability due to the intractable sampling and the absence of the normalizing constant. Recently, various techniques have been developed for improving the training stability and the quality of generated samples, for example, gradient clipping (Du & Mordatch, 2019), short MCMC runs

