GUIDING ENERGY-BASED MODELS VIA CONTRASTIVE LATENT VARIABLES

Abstract

An energy-based model (EBM) is a popular generative framework that offers both explicit density and architectural flexibility, but training them is difficult since it is often unstable and time-consuming. In recent years, various training techniques have been developed, e.g., better divergence measures or stabilization in MCMC sampling, but there often exists a large gap between EBMs and other generative frameworks like GANs in terms of generation quality. In this paper, we propose a novel and effective framework for improving EBMs via contrastive representation learning (CRL). To be specific, we consider representations learned by contrastive methods as the true underlying latent variable. This contrastive latent variable could guide EBMs to understand the data structure better, so it can improve and accelerate EBM training significantly. To enable the joint training of EBM and CRL, we also design a new class of latent-variable EBMs for learning the joint density of data and the contrastive latent variable. Our experimental results demonstrate that our scheme achieves lower FID scores, compared to prior-art EBM methods (e.g., additionally using variational autoencoders or diffusion techniques), even with significantly faster and more memory-efficient training. We also show conditional and compositional generation abilities of our latent-variable EBMs as their additional benefits, even without explicit conditional training.

1. INTRODUCTION

Generative modeling is a fundamental machine learning task for learning complex high-dimensional data distributions p data (x). Among a number of generative frameworks, energy-based models (EBMs, LeCun et al., 2006; Salakhutdinov et al., 2007) , whose density is proportional to the exponential negative energy, i.e., p θ (x) ∝ exp(-E θ (x)), have recently gained much attention due to their attractive properties. For example, EBMs can naturally provide the explicit (unnormalized) density, unlike generative adversarial networks (GANs, Goodfellow et al., 2014) . Furthermore, they are much less restrictive in architectural designs than other explicit density models such as autoregressive (Oord et al., 2016b; a) and flow-based models (Rezende & Mohamed, 2015; Dinh et al., 2017) . Hence, EBMs have found wide applications, including image inpainting (Du & Mordatch, 2019) , hybrid discriminative-generative models (Grathwohl et al., 2019; Yang & Ji, 2021) , protein design (Ingraham et al., 2019; Du et al., 2020b) , and text generation (Deng et al., 2020) . Despite the attractive properties, training EBMs has remained challenging; e.g., it often suffers from the training instability due to the intractable sampling and the absence of the normalizing constant. Recently, various techniques have been developed for improving the training stability and the quality of generated samples, for example, gradient clipping (Du & Mordatch, 2019 ), short MCMC runs (Nijkamp et al., 2019) , data augmentations in MCMC sampling (Du et al., 2021) , and better divergence measures (Yu et al., 2020; 2021; Du et al., 2021) . To further improve EBMs, there are several recent attempts to incorporate other generative models into EBM training, e.g., variational autoencoders (VAEs) (Xiao et al., 2021) , flow models (Gao et al., 2020; Xie et al., 2022) , or diffusion techniques (Gao et al., 2021) . However, they often require a high computational cost for training such an extra generative model, or there still exists a large gap between EBMs and state-of-the-art generative frameworks like GANs (Kang et al., 2021) or score-based models (Vahdat et al., 2021) . Instead of utilizing extra expensive generative models, in this paper, we ask whether EBMs can be improved by other unsupervised techniques of low cost. To this end, we are inspired by recent advances in unsupervised representation learning literature (Chen et al., 2020; Grill et al., 2020; He et al., 2021) , especially by the fact that the discriminative representations can be obtained much easier than generative modeling. Interestingly, such representations have been used to detect out-ofdistribution samples (Hendrycks et al., 2019a;b), so we expect that training EBMs can benefit from good representations. In particular, we primarily focus on contrastive representation learning (Oord et al., 2018; Chen et al., 2020; He et al., 2020) since it can learn instance discriminability, which has been shown to be effective in not only representation learning, but also training GANs (Jeong & Shin, 2021; Kang et al., 2021) and out-of-distribution detection (Tack et al., 2020) . In this paper, we propose Contrastive Latent-guided Energy Learning (CLEL), a simple yet effective framework for improving EBMs via contrastive representation learning (CRL). Our CLEL consists of two components, which are illustrated in Figure 1 . • Contrastive latent encoder. Our key idea is to consider representations learned by CRL as an underlying latent variable distribution p data (z|x). Specifically, we train an encoder h ϕ via CRL, and treat the encoded representation z := h ϕ (x) as the true latent variable given data x, i.e., z ∼ p data (•|x). This latent variable could guide EBMs to understand the underlying data structure more quickly and accelerate training since the latent variable contains semantic information of the data thanks to CRL. Here, we assume the latent variables are spherical, i.e., ∥z∥ 2 = 1, since recent CRL methods (He et al., 2020; Chen et al., 2020) use the cosine distance on the latent space. • Spherical latent-variable EBM. We introduce a new class of latent-variable EBMs p θ (x, z) for modeling the joint distribution p data (x, z) generated by the contrastive latent encoder. Since the latent variables are spherical, we separate the output vector f := f θ (x) into its norm ∥f ∥ 2 and direction f /∥f ∥ 2 for modeling p θ (x) and p θ (z|x), respectively. We found that this separation technique reduces the conflict between p θ (x) and p θ (z|x) optimizations, which makes training stable. In addition, we treat the latent variables drawn from our EBM, z ∼ p θ (z), as additional negatives in CRL, which further improves our CLEL. Namely, CRL guides EBM and vice versa.foot_0  We demonstrate the effectiveness of the proposed framework through extensive experiments. For example, our EBM achieves 8.61 FID under unconditional CIFAR-10 generation, which is lower than those of existing EBM models. Here, we remark that utilizing CRL into our EBM training increases training time by only 10% in our experiments (e.g., 38→41 GPU hours). This enables us to achieve the lower FID score even with significantly less computational resources (e.g., we use single RTX3090 GPU only) than the prior EBMs that utilize VAEs (Xiao et al., 2021) or diffusion-based recovery likelihood (Gao et al., 2021) . Furthermore, even without explicit conditional training, our



The representation quality of CRL for classification tasks is not much improved in our experiments under the joint training of CRL and EBM. Hence, we only report the performance of EBM, not that of CRL.



Figure 1: Illustration of the proposed Contrastive Latent-guided Energy Learning (CLEL) framework.(a) Our spherical latent-variable EBM (f θ , g ) learns the joint data distribution p data (x, z) generated by our contrastive latent encoder h ϕ . (b) The encoder h ϕ is trained by contrastive learning with additional negative variables z ∼ p θ (z). Here, z = h ϕ (t i (x))/∥h ϕ (t i 2 where t i ∼ T denotes a random augmentation, and sg(•) denotes the stop-gradient operation.

availability

//github.com/hankook/CLEL.

