GUIDING ENERGY-BASED MODELS VIA CONTRASTIVE LATENT VARIABLES

Abstract

An energy-based model (EBM) is a popular generative framework that offers both explicit density and architectural flexibility, but training them is difficult since it is often unstable and time-consuming. In recent years, various training techniques have been developed, e.g., better divergence measures or stabilization in MCMC sampling, but there often exists a large gap between EBMs and other generative frameworks like GANs in terms of generation quality. In this paper, we propose a novel and effective framework for improving EBMs via contrastive representation learning (CRL). To be specific, we consider representations learned by contrastive methods as the true underlying latent variable. This contrastive latent variable could guide EBMs to understand the data structure better, so it can improve and accelerate EBM training significantly. To enable the joint training of EBM and CRL, we also design a new class of latent-variable EBMs for learning the joint density of data and the contrastive latent variable. Our experimental results demonstrate that our scheme achieves lower FID scores, compared to prior-art EBM methods (e.g., additionally using variational autoencoders or diffusion techniques), even with significantly faster and more memory-efficient training. We also show conditional and compositional generation abilities of our latent-variable EBMs as their additional benefits, even without explicit conditional training.

1. INTRODUCTION

Generative modeling is a fundamental machine learning task for learning complex high-dimensional data distributions p data (x). Among a number of generative frameworks, energy-based models (EBMs, LeCun et al., 2006; Salakhutdinov et al., 2007) , whose density is proportional to the exponential negative energy, i.e., p θ (x) ∝ exp(-E θ (x)), have recently gained much attention due to their attractive properties. For example, EBMs can naturally provide the explicit (unnormalized) density, unlike generative adversarial networks (GANs, Goodfellow et al., 2014) . Furthermore, they are much less restrictive in architectural designs than other explicit density models such as autoregressive (Oord et al., 2016b; a) and flow-based models (Rezende & Mohamed, 2015; Dinh et al., 2017) . Hence, EBMs have found wide applications, including image inpainting (Du & Mordatch, 2019) , hybrid discriminative-generative models (Grathwohl et al., 2019; Yang & Ji, 2021) , protein design (Ingraham et al., 2019; Du et al., 2020b) , and text generation (Deng et al., 2020) . Despite the attractive properties, training EBMs has remained challenging; e.g., it often suffers from the training instability due to the intractable sampling and the absence of the normalizing constant. Recently, various techniques have been developed for improving the training stability and the quality of generated samples, for example, gradient clipping (Du & Mordatch, 2019) , short MCMC runs (Nijkamp et al., 2019) , data augmentations in MCMC sampling (Du et al., 2021) , and better divergence measures (Yu et al., 2020; 2021; Du et al., 2021) . To further improve EBMs, there are several recent attempts to incorporate other generative models into EBM training, e.g., variational autoencoders (VAEs) (Xiao et al., 2021) , flow models (Gao et al., 2020; Xie et al., 2022) , or diffusion techniques (Gao et al., 2021) . However, they often require a high computational cost for training such an extra generative model, or there still exists a large gap between EBMs and state-of-the-art generative frameworks like GANs (Kang et al., 2021) or score-based models (Vahdat et al., 2021) . D KL (p data (x, z)kp ✓ (x, z)) < l a t e x i t s h a 1 _ b a s e 6 4 = " A 6 v 7 Z G r 0 M 9 i e M 9 8 u B U c e j o j 6 W o c = " > A A A E l H i c h V N t a 9 N Q F E 5 j 1 B n f V g W / + C V Y C q 2 M 0 Z S K I g 4 G 3 U C x 6 t S 1 G y w l 3 N z c t J f l j X t P t n b x / i d / j e A n / S n e p O n W t O s 8 E O 7 J e Z 5 z z n N O c p 3 Y p x x a r d 8 V 9 Z Z 2 + 8 7 d j X v 6 / Q c P H z 3 e r D 4 Z 8 C h h m P R x 5 E f s 2 E G c + D Q k f a D g k + O Y E R Q 4 P j l y T r s Z f n R G G K d R e A j T m A w D N A q p R z E C G b K r 6 g c r Q D D G y E / 3 h G 0 B m U D 6 s S c a c e G 7 C J B o 5 B z H S y d i a + 5 e i K Y 1 I A y M j D k m g N a Q m n r 9 P w y 9 n r 8 A p H w k W 2 E 3 g i w G 1 H d J e s U T B S + T 2 p t L / U 6 D b u / b I r R + i r L c r I W f R 7 x V e U 2 L Z Z D d N i w 3 O g 8 R Y 9 G 5 X t + / J J a 1 T Y R M S O I V 1 v V L W y g I t r k l m 1 i c B p f 6 D + f D 5 J k Z Z J S m u E L P h N 3 e k e m L u m f L X v 1 w F + J H i b W u 8 8 7 b h e q m r G 6 W q 9 + k z N 6 s t b Z b u R m r j l k 4 N a W w A 7 t a a c t t 4 C Q g I W A f c X 5 i t m I Y p o g B x T 4 R u p V w E i N 8 i k b k R L o h C g g f p v l P L 4 y 6 j L i G F z H 5 h G D k 0 c W M F A W c T w N H M j P R f B n L g t d i j h M s t Q b v z T C l Y Z w A C f G s s 5 f 4 B k R G d q U M l z K C w Z 9 K B 2 F G p X g D j x F D G O T F 0 3 V r j 8 j p G P k k O 3 2 J C U M Q s Z e p h d g o Q B O R F u d N N B r O a P L U 5 Z b N 5 Z 2 u O o P 2 t t n Z f v W 1 U 9 v t F P v e U J 4 r L 5 S G Y i q v l V 3 l v X K g 9 B W s / l R / q X / U v 9 o z 7 Z 3 W 1 f Z n V L V S 5 D x V S q Z 9 / g f J c 4 8 v < / l a t e x i t > (b) Contrastive learning sg(•) < l a t e x i t s h a 1 _ b a s e 6 4 = " e y J 0 + w t W G 5 z G I F 3 Q d o J l 5 z f q M f M = " > A A A E E X i c f V L b b t N A E H V s L s V c 2 s I j L x Z R p A S V y o 6 K Q E i V K r V I S A R R a J N W q q N o v V 4 n q 6 4 v 2 h 2 3 C W a / g q / h D f H A C 1 / A 3 7 B 2 n I s T 6 E q W x + e c m T M z X i 9 h V I B t / 6 n p x q 3 b d + 5 u 3 D P v P 3 j 4 a H N r + 3 F P x C n H p I t j F v N z D w n C a E S 6 Q I G R 8 4 Q T F H q M n H m X h z l / d k W 4 o H F 0 C p O E 9 E M 0 j G h A M Q I F D b Z r v 9 w Q w Q g g E 0 P Z d L E f Q 8 t s u E C Z T 7 K C 8 o L s i 5 Q K y z 8 w Y l l H D l w g Y 8 h O a H j Y + b x M H c 2 o 9 x 3 Z T M r Y R 4 C k 2 y M c r B w a E U C 5 B S u Q o E S a M 6 + x b L k 8 p w Z t y / X j 6 w h x H l + b j b d z Y b W 3 s V Q J a b K m m t M 7 i y l a y w V h 4 O w o E 1 f Q c N 7 / 6 W y Y I j O n r M o U C / Z K D t r 7 K n 2 5 b 7 N R E T c X x l 8 r q v 8 5 7 7 9 Z q u 6 o 6 k 6 1 + k 2 d D b b q 9 q 5 d H G s 9 c M q g r p X n W P 3 5 t t o G T k M S A W Z I i A v H T q C f I Q 4 U M y J N N x U k Q f g S D c m F C i M U E t H P i h s n r Y Z C f C u I u X o i s A p 0 O S N D o R C T 0 F P K v G m x y u X g P z n P C 1 e s I X j d z 2 i U p E A i P H U O U m Z B b O X 3 2 f I p J x j Y R A U I c 6 q a t / A I c Y R B 3 X r T d I + I m o 6 T D 8 r p Y 0 I 4 g p g / z 1 z E h y E a y 6 x 8 3 y S j 0 V S m 3 q b a s r O 6 0 / W g 1 9 5 1 9 n Z f f t q r H 9 j l v j e 0 p 9 o z r a k 5 2 i v t Q H u n H W t d D e s v 9 B P d 1 f v G N + O 7 8 c P 4 O Z X q t T L n i V Y 5 x u + / X p J d x Q = = < / l a t e x i t > sg(•) < l a t e x i t s h a 1 _ b a s e 6 4 = " e y J 0 + w t W G 5 z G I F 3 Q d o J l 5 z f q M f M = " > A A A E E X i c f V L b b t N A E H V s L s V c 2 s I j L x Z R p A S V y o 6 K Q E i V K r V I S A R R a J N W q q N o v V 4 n q 6 4 v 2 h 2 3 C W a / g q / h D f H A C 1 / A 3 7 B 2 n I s T 6 E q W x + e c m T M z X i 9 h V I B t / 6 n p x q 3 b d + 5 u 3 D P v P 3 j 4 a H N r + 3 F P x C n H p I t j F v N z D w n C a E S 6 Q I G R 8 4 Q T F H q M n H m X h z l / d k W 4 o H F 0 C p O E 9 E M 0 j G h A M Q I F D b Z r v 9 w Q w Q g g E 0 P Z d L E f Q 8 t s u E C Z T 7 K C 8 o L s i 5 Q K y z 8 w Y l l H D l w g Y 8 h O a H j Y + b x M H c 2 o 9 x 3 Z T M r Y R 4 C k 2 y M c r B w a E U C 5 B S u Q o E S a M 6 + x b L k 8 p w Z t y / X j 6 w h x H l + b j b d z Y b W 3 s V Q J a b K m m t M 7 i y l a y w V h 4 O w o E 1 f Q c N 7 / 6 W y Y I j O n r M o U C / Z K D t r 7 K n 2 5 b 7 N R E T c X x l 8 r q v 8 5 7 7 9 Z q u 6 o 6 k 6 1 + k 2 d D b b q 9 q 5 d H G s 9 c M q g r p X n W P 3 5 t t o G T k M S A W Z I i A v H T q C f I Q 4 U M y J N N x U k Q f g S D c m F C i M U E t H P i h s n r Y Z C f C u I u X o i s A p 0 O S N D o R C T 0 F P K v G m x y u X g P z n P C 1 e s I X j d z 2 i U p E A i P H U O U m Z B b O X 3 2 f I p J x j Y R A U I c 6 q a t / A I c Y R B 3 X r T d I + I m o 6 T D 8 r p Y 0 I 4 g p g / z 1 z E h y E a y 6 x 8 3 y S j 0 V S m 3 q b a s r O 6 0 / W g 1 9 5 1 9 n Z f f t q r H 9 j l v j e 0 p 9 o z r a k 5 2 i v t Q H u n H W t d D e s v 9 B P d 1 f v G N + O 7 8 c P 4 O Z X q t T L n i V Y 5 x u + / X p J d x Q = = < / l a t e x i t > z ⇠ p ✓ (z) < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 O b L R j 9 X j n s B x 2 I b + + 5 3 C Y I z x b k = " > A A A F I n i c j V R b b 9 M w F E 4 v w A i 3 F R 5 5 i Z g q W j S m p h o C I Q 1 N 2 p C Q K D C g 7 S r N V e U 4 b m s t N 9 k n W 7 v g f 8 O v 4 Q 3 x g J D 4 M T h t 1 i Z N G V i K f H z O d 8 7 3 + d i O F T h M Q K P x q 1 A s l a 9 d v 7 F x U 7 9 1 + 8 7 d e 5 u V + 1 3 h h 5 z Q D v E d n / c s L K j D P N o B B g 7 t B Z x i 1 3 L o s X V 6 E M e P z y g X z P f a M A 1 o 3 8 U j j w 0 Z w a B c g 0 r x J w L m 2 D R C L o a x N Y w u p E S C u U Y w Q D C m g G v L Q F 2 v z h Y E O 1 F L q j i d Q P S + / b R H P Z A p 3 M D c T i 2 a 2 y h V / D G S i I S B c u V p M w S H l w R v W 7 I W J L a N A S + Z J j L F U 0 d d y i E v O w t S D P 9 A J B I A I j F S V M T 2 I f b l 1 K 7 r x W f m H r Q + y f / a R V Z u T O H M P M O 8 v D r i c W j Q N J D t n 3 u Y c / 9 c r 7 5 e A L P a J l I l h E E O t b 5 p q Y K g z k 2 R x K e / 0 N + + 3 M w s c 3 E x F r t Y R s / U S e + p 9 L T u e b P z B 3 c h v 2 R Q f 2 P e e 5 m q b q r q Z r b 6 V c o G m 1 u N n c Z s G H n D T I w t L R l H g 0 q h q b p B Q l f d Z u J g I U 7 M R g D 9 C H N g x K F S R 6 G g A S a n e E R P l O l h l 4 p + N H u C 0 q g q j 2 0 M f a 4 + D 4 y Z N 5 0 R Y V e I q W s p Z C x a r M Z i 5 9 q Y Z b k r 1 D B 8 0 Y + Y F 4 R A P T J n H o a O A b 4 R P 3 D D Z p w S c K b K w I Q z J d 4 g Y 8 w x A f U b 0 H V 0 S N X u O H 2 n m D 4 E l G P w + Z M I Y T 5 y 8 U R G y X w V j H l z m J p 1 1 W V z t a d 5 o 9 v c M X d 3 n n 3 c 3 d p v J v 3 e 0 B 5 q j 7 S a Z m r P t X 3 t j X a k d T R S e l W y S 2 7 J K 3 8 t f y t / L / + Y Q 4 u F J O e B l h n l 3 3 8 A Q N P I E w = = < / l a t e x i t > (a) Our spherical latent-variable EBM (f θ , g θ ) learns the joint data distribution p data (x, z) generated by our contrastive latent encoder h ϕ . (b) The encoder h ϕ is trained by contrastive learning with additional negative variables z ∼ p θ (z). Here, z i = h ϕ (t i (x))/∥h ϕ (t i (x))∥ 2 where t i ∼ T denotes a random augmentation, and sg(•) denotes the stop-gradient operation. Instead of utilizing extra expensive generative models, in this paper, we ask whether EBMs can be improved by other unsupervised techniques of low cost. To this end, we are inspired by recent advances in unsupervised representation learning literature (Chen et al., 2020; Grill et al., 2020; He et al., 2021) , especially by the fact that the discriminative representations can be obtained much easier than generative modeling. Interestingly, such representations have been used to detect out-ofdistribution samples (Hendrycks et al., 2019a; b) , so we expect that training EBMs can benefit from good representations. In particular, we primarily focus on contrastive representation learning (Oord et al., 2018; Chen et al., 2020; He et al., 2020) since it can learn instance discriminability, which has been shown to be effective in not only representation learning, but also training GANs (Jeong & Shin, 2021; Kang et al., 2021) and out-of-distribution detection (Tack et al., 2020) . In this paper, we propose Contrastive Latent-guided Energy Learning (CLEL), a simple yet effective framework for improving EBMs via contrastive representation learning (CRL). Our CLEL consists of two components, which are illustrated in Figure 1 . • Contrastive latent encoder. Our key idea is to consider representations learned by CRL as an underlying latent variable distribution p data (z|x). Specifically, we train an encoder h ϕ via CRL, and treat the encoded representation z := h ϕ (x) as the true latent variable given data x, i.e., z ∼ p data (•|x). This latent variable could guide EBMs to understand the underlying data structure more quickly and accelerate training since the latent variable contains semantic information of the data thanks to CRL. Here, we assume the latent variables are spherical, i.e., ∥z∥ 2 = 1, since recent CRL methods (He et al., 2020; Chen et al., 2020) use the cosine distance on the latent space. • Spherical latent-variable EBM. We introduce a new class of latent-variable EBMs p θ (x, z) for modeling the joint distribution p data (x, z) generated by the contrastive latent encoder. Since the latent variables are spherical, we separate the output vector f := f θ (x) into its norm ∥f ∥ 2 and direction f /∥f ∥ 2 for modeling p θ (x) and p θ (z|x), respectively. We found that this separation technique reduces the conflict between p θ (x) and p θ (z|x) optimizations, which makes training stable. In addition, we treat the latent variables drawn from our EBM, z ∼ p θ (z), as additional negatives in CRL, which further improves our CLEL. Namely, CRL guides EBM and vice versa. 1We demonstrate the effectiveness of the proposed framework through extensive experiments. For example, our EBM achieves 8.61 FID under unconditional CIFAR-10 generation, which is lower than those of existing EBM models. Here, we remark that utilizing CRL into our EBM training increases training time by only 10% in our experiments (e.g., 38→41 GPU hours). This enables us to achieve the lower FID score even with significantly less computational resources (e.g., we use single RTX3090 GPU only) than the prior EBMs that utilize VAEs (Xiao et al., 2021) or diffusion-based recovery likelihood (Gao et al., 2021) . Furthermore, even without explicit conditional training, our latent-variable EBMs naturally can provide the latent-conditional density p θ (x|z); we verify its effectiveness under various applications: out-of-distribution (OOD) detection, conditional sampling, and compositional sampling. For example, OOD detection using the conditional density shows superiority over various likelihood-based models. Finally, we remark that our idea is not limited to contrastive representation learning and we show EBMs can be also improved by other representation learning methods like BYOL (Grill et al., 2020) or MAE (He et al., 2021) (see Section 4.5).

2. PRELIMINARIES

In this work, we mainly consider unconditional generative modeling: given a set of i.i.d. samples {x (i) } N i=1 drawn from an unknown data distribution p data (x), our goal is to learn a model distribution p θ (x) parameterized by θ to approximate the data distribution p data (x). To this end, we parameterize p θ (x) using energy-based models (EBMs), and incorporate contrastive representation learning (CRL) into EBMs for improving them. We briefly describe the concepts of EBMs and CRL in Section 2.1 and Section 2.2, respectively, and then introduce our framework in Section 3.

2.1. ENERGY-BASED MODELS

An energy-based model (EBM) is a probability distribution on R dx , defined as follows: for x ∈ R dx , p θ (x) = exp(-E θ (x)) Z θ , Z θ = R dx exp(-E θ (x))dx, where E θ (x) is the energy function parameterized by θ and Z θ denotes the normalizing constant, called the partition function. An important application of EBMs is to find a parameter θ such that p θ is close to p data . A popular method for finding such θ is to minimize Kullback-Leibler (KL) divergence between p data and p θ via gradient descent: D KL (p data ∥p θ ) = -E x∼pdata [log p θ (x)] + Constant, (2) ∇ θ D KL (p data ∥p θ ) = E x∼pdata [∇ θ E θ (x)] -E x∼p θ [∇ θ E θ (x)]. (3) Since this gradient computation (3) is NP-hard in general (Jerrum & Sinclair, 1993) , it is often approximated via Markov chain Monte Carlo (MCMC) methods. In this work, we use the stochastic gradient Langevin dynamics (SGLD, Welling & Teh, 2011) , a gradient-based MCMC method for approximate sampling. Specifically, at the (t + 1)-th iteration, SGLD updates the current sample x(t) to x(t+1) using the following procedure: x(t+1) ← x(t) + ε 2 2 ∇ x log p θ (x (t) ) =-∇xE θ (x (t) ) +εδ (t) , δ (t) ∼ N (0, I), where ε > 0 is some predefined constant, x(0) denotes an initial state, and N denotes the multivariate normal distribution. Here, it is known that the distribution of x(T ) (weakly) converges to p θ with small enough ε and large enough T under various assumptions (Vollmer et al., 2016; Raginsky et al., 2017; Xu et al., 2018; Zou et al., 2021) . Latent-variable energy-based models. EBMs can naturally incorporate a latent variable by specifying the joint density p θ (x, z) ∝ exp(-E θ (x, z)) of observed data x and the latent variable z. This class includes a number of EBMs: e.g., deep Boltzmann machines (Salakhutdinov & Hinton, 2009) and conjugate EBMs (Wu et al., 2021) . Similar to standard EBMs, these latent-variable EBMs can be trained by minimizing KL divergence between p data (x) and p θ (x) as described in (3): ∇ θ D KL (p data ∥p θ ) = E x∼pdata(x) [∇ θ E θ (x)] -E x∼p θ (x) [∇ θ E θ (x)] (5) = E x∼pdata(x),z∼p θ (z|x) [∇ θ E θ (x, z)] -E x∼p θ (x),z∼p θ (z|x) [∇ θ E θ (x, z)], (6) where E θ (x) is the marginal energy, i.e., E θ (x) =log exp(-E θ (x, z))dz.

2.2. CONTRASTIVE REPRESENTATION LEARNING

Generally speaking, contrastive learning aims to learn a meaningful representation by minimizing distance between similar (i.e., positive) samples, and maximizing distance between dissimilar (i.e., negative) samples on the representation space. Formally, let h ϕ : R dx → R dz be a ϕ-parameterized encoder, (x, x + ) and (x, x -) be positive and negative pairs, respectively. Contrastive learning then maximizes sim(h ϕ (x), h ϕ (x + )) and minimizes sim(h ϕ (x), h ϕ (x -)) where sim(•, •) is a similarity metric defined on the representation space R dz . Under the unsupervised setup, various methods for constructing positive and negative pairs have been proposed: e.g., data augmentations (He et al., 2020; Chen et al., 2020; Tian et al., 2020) , spatial or temporal co-occurrence (Oord et al., 2018) , and image channels (Tian et al., 2019) . In this work, we mainly focus on a popular contrastive learning framework, SimCLR (Chen et al., 2020) , which constructs positive and negative pairs via various data augmentations such as cropping and color jittering. Specifically, given a mini-batch B = {x (i) } n i=1 , SimCLR first constructs two augmented views {v (i) j := t (i) j (x (i) )} j∈{1,2} for each data sample x (i) via random augmentations t (i) j ∼ T . Then, it considers (v (i) 1 , v (i) 2 ) as a positive pair and (v (i) 1 , v (k) 2 ) as a negative pair for all k ̸ = i. The SimCLR objective L SimCLR is defined as follows: L SimCLR (B; ϕ, τ ) = 1 2n n i=1 j=1,2 L NT-Xent h ϕ (v (i) j ), h ϕ (v (i) 3-j ), {h ϕ (v (k) l )} k̸ =i,l∈{1,2} ; τ , L NT-Xent (z, z + , {z -}; τ ) = -log exp(sim(z, z + )/τ ) exp(sim(z, z + )/τ ) + z-exp(sim(z, z -)/τ ) , where sim(u, v) = u ⊤ v/∥u∥ 2 ∥v∥ 2 is the cosine similarity, τ is a hyperparameter for temperature scaling, and L NT-Xent denotes the normalized temperature-scaled cross entropy (Chen et al., 2020) .

3. METHOD

Recall that our goal is to learn an energy-based model (EBM) p θ (x) ∝ exp(-E θ (x)) to approximate a complex underlying data distribution p data (x). In this work, we propose Contrastive Latent-guided Energy Learning (CLEL), a simple yet effective framework for improving EBMs via contrastive representation learning. Our key idea is that directly incorporating with semantically meaningful contexts of data could improve EBMs. To this end, we consider the (random) representation z ∼ p data (z|x) of x, generated by contrastive learning, as the underlying latent variable.foot_1 Namely, we model the joint distribution p data (x, z) = p data (x)p data (z|x) via a latent-variable EBM p θ (x, z). Our intuition on the benefit of modeling p data (x, z) is two-fold: (i) conditional generative modeling p data (x|z) given some good contexts (e.g., labels) of data is much easier than unconditional modeling p data (x) (Mirza & Osindero, 2014; Van den Oord et al., 2016; Reed et al., 2016) , and (ii) the mode collapse problem of generation can be resolved by predicting the contexts p data (z|x) (Odena et al., 2017; Bang & Shim, 2021) . The detailed implementations of p data (z|x), called the contrastive latent encoder, and the latent-variable EBM are described in Section 3.1 and 3.2, respectively, while Section 3.3 presents how to train them in detail. Our overall framework is illustrated in Figure 1 .

3.1. CONTRASTIVE LATENT ENCODER

To construct a meaningful latent distribution p data (z|x) for improving EBMs, we use contrastive representation learning. To be specific, we first train a latent encoder h ϕ : R dx → R dz , which is a deep neural network (DNN) parameterized by ϕ, using a variant of the SimCLR objective L SimCLR (7) (we describe its detail in Section 3.3) with a random augmentation distribution T . Since our objective only measures the cosine similarity between distinct representations, one can consider the encoder h ϕ maps a randomly augmented sample to a unit vector. We define the latent sampling procedure z ∼ p data (z|x) as follows: z ∼ p data (z|x) ⇔ z = h ϕ (t(x))/∥h ϕ (t(x))∥ 2 , t ∼ T .

3.2. SPHERICAL LATENT-VARIABLE ENERGY-BASED MODELS

We use a DNN f θ : R dx → R dz parameterized by θ for modeling p θ (x, z). Following that the latent variable z ∼ p data (z|x) is on the unit sphere, we utilize the directional information f θ (x)/∥f θ (x)∥ 2 for modeling p θ (z|x), while the remaining information ∥f θ (x)∥ 2 is used for modeling p θ (x). We empirically found that this norm-direction separation stabilizes the latent-variable EBM training. 3For better modeling p θ (z|x), we additionally apply a directional projector g θ : S dz-1 → S dz-1 to f θ (x)/∥f θ (x)∥ 2 , which is constructed by a two-layer MLP, followed by ℓ 2 normalization. We found that it is useful for narrowing the gap between distributions of the direction f θ (x)/∥f θ (x)∥ 2 and the uniformly-distributed latent variable p data (z) (see Section 4.5 and Appendix E for detailed discussion). Overall, we define the joint energy E θ (x, z) as follows: E θ (x, z) = 1 2 ∥f θ (x)∥ 2 2 -βg θ f θ (x) ∥f θ (x)∥ 2 ⊤ z, E θ (x) = -log S d-1 exp(-E θ (x, z))dz = 1 2 ∥f θ (x)∥ 2 2 + Constant, where β ≥ 0 is a hyperparameter. Note that the marginal energy E θ (x) only depends on ∥f θ (x)∥ 2 since S d-1 exp(βg θ (f θ (x)/∥f θ (x)∥ 2 ) ⊤ z)dz is independent of x due to the symmetry. Also, the norm-based design does not sacrifice the flexibility for energy modeling (see Appendix F for details).

3.3. TRAINING

Remark that Section 3.1 and 3.2 define p data (x, z) and p θ (x, z), respectively. We now describe how to train the contrastive latent encoder h ϕ and the spherical latent-variable EBM p θ via mini-batch stochastic optimization algorithms in detail (see Appendix A for the pseudo-code). Let {x (i) } n i=1 be real samples randomly drawn from the training dataset. We first generate n samples {x (i) } n i=1 ∼ p θ (x) using the current EBM via stochastic gradient Langevin dynamics (SGLD) (4). Here, to reduce the computational complexity and improve the generation quality of SGLD, we use two techniques: a replay buffer to maintain Markov chains persistently (Du & Mordatch, 2019) , and periodic data augmentation transitions to encourage exploration (Du et al., 2021) . We then draw latent variables from p data and p θ : z (i) ∼ p data (z|x (i) ) and z(i) ∼ p θ (z|x (i) ) for all i. For the latter case, we simply use the mode of p θ (z|x (i) ) instead of sampling, namely, z(i ) := g θ (f θ (x)/∥f θ (x)∥ 2 ). Let B := {(x (i) , z (i) )} n i=1 and B := {(x (i) , z(i) )} n i=1 be real and generated mini-batches, respectively. Under this setup, we define the objective L EBM for the EBM parameter θ as follows: L EBM (B, B; θ, α, β) = 1 n n i=1 E θ (x (i) , z (i) ) -E θ (x (i) ) + α • (E θ (x (i) ) 2 + E θ (x (i) ) 2 ), where the first two terms correspond to the empirical average of D KL (p data ∥p θ )foot_3 and α is a hyperparameter for energy regularization to prevent divergence, following Du & Mordatch (2019) . When training the latent encoder h ϕ via contrastive learning, we use the SimCLR (Chen et al., 2020) loss L SimCLR (7) with additional negative latent variables {z (i) } n i=1 . To be specific, we define the objective L LE for the latent encoder parameter ϕ as follows: L LE (B, B; ϕ, τ ) = 1 2n n i=1 j=1,2 L NT-Xent z (i) j , z (i) 3-j , {z (k) l } k̸ =i,l∈{1,2} ∪ {z (i) } n i=1 ; τ , where L NT-Xent is the normalized temperature-scaled cross entropy defined in (8), τ is a hyperparameter for temperature scaling, z (i) j := h ϕ (t (i) j (x (i) )), and {t (i) j } ∼ T are random augmentations. We found that considering {z (i) } n i=1 as negative representations for contrastive learning increases the latent diversity, which further improves the generation quality in our CLEL framework. To sum up, our CLEL jointly optimizes the latent encoder h ϕ and the latent-variable EBM (f θ , g θ ) from scratch via the following optimization: min ϕ,θ E B, B[L EBM (B, B; θ, α, β) + L LE (B, B; ϕ, τ )]. After training, we only utilize our latent-variable EBM (f θ , g θ ) when generating samples. The latent encoder h ϕ is used only when extracting a representation of a specific sample during training. (Du & Mordatch, 2019) 38.20 FlowCE † (Gao et al., 2020) 37.30 VERA † ‡ (Grathwohl et al., 2021) 27.50 Improved CD (Du et al., 2021) 25.10 BiDVL (Kan et al., 2022) 20.75 GEBM † (Arbel et al., 2021) 19.31 CF-EBM (Zhao et al., 2021) 16.71 CoopFlow † (Xie et al., 2022) 15.80 CLEL-Base (Ours) 15.27 VAEBM † (Xiao et al., 2021) 12.19 EBM-Diffusion (Gao et al., 2021) 9.58 CLEL-Large (Ours) 8.61

Method FID

Other likelihood models PixelCNN (Oord et al., 2016b) 65.93 NVAE (Vahdat & Kautz, 2020) 51.67 Glow (Kingma & Dhariwal, 2018) 48.90 NCP-VAE (Aneja et al., 2021) 24.08 Score-based models NCSN (Song & Ermon, 2019) 25.30 NCSNv2 (Song & Ermon, 2020) 10.87 DDPM (Ho et al., 2020) 3.17 NCSN++ (Song et al., 2021) 2.20

GAN-based models

StyleGAN2-DiffAugment (Zhao et al., 2020) 5.79 StyleGAN2-ADA (Karras et al., 2020) 2.92 

4.1. UNCONDITIONAL IMAGE GENERATION

An important application of EBMs is to generate images using the energy function E θ (x). To this end, we train our CLEL framework on CIFAR-10 ( Krizhevsky et al., 2009) and ImageNet 32×32 (Deng et al., 2009; Chrabaszcz et al., 2017) under the unsupervised setting. We then generate 50k samples using SGLD and evaluate their qualities using Fréchet Inception Distance (FID) scores (Heusel et al., 2017; Seitzer, 2020) . The unconditionally generated samples are provided in Figure 2 . Table 1 and 2 show the FID scores of our CLEL and other generative models for unconditional generation on CIFAR-10 and ImageNet 32×32, respectively. We first find that CLEL outperforms previous EBMs under both CIFAR-10 and ImageNet 32×32 datasets. As shown in Table 3 , our method can benefit from a multi-scale architecture as Du et al. (2021) did, contrastive representation learning (CRL) with a larger batch, more channels at lower layers in our EBM f θ . As a result, we achieve 8.61 FID on CIFAR-10, which is lower than that of the prior-art EBM based on diffusion recovery likelihood, EBM-Diffusion (Gao et al., 2021) , even with 5× faster and 4× more memoryefficient training (when using the similar number of parameters for EBMs). Then, we narrow the gap between EBMs and state-of-the-art frameworks like GANs without help from other generative models. We think our CLEL can be further improved by incorporating an auxiliary generator (Arbel et al., 2021; Xiao et al., 2021) or diffusion (Gao et al., 2021) , and we leave it for future work. EBMs can be also used for detecting out-of-distribution (OOD) samples. For the OOD sample detection, previous EBM-based approaches often use the (marginal) unnormalized likelihood p θ (x) ∝ exp(-E θ (x)). In contrast, our CLEL is capable of modeling the joint density p θ (x, z) ∝ E θ (x, z). Using this capability, we propose an energy-based OOD detection score: given x, s(x) := 1 2 ∥f θ (x)∥ 2 2 -βg θ f θ (x) ∥f θ (x)∥ 2 ⊤ h ϕ (x) ∥h ϕ (x)∥ 2 . ( ) We found that the second term in ( 14) helps to detect the semantic difference between in-and out-of-distribution samples. Table 4 shows our CLEL's superiority over other explicit density models in OOD detection, especially when OOD samples are drawn from different domains, e.g., SVHN (Netzer et al., 2011) and Texture (Cimpoi et al., 2014) datasets.

4.3. CONDITIONAL SAMPLING

One advantage of latent-variable EBMs is that they can offer the latent-conditional density p θ (x|z) ∝ exp(-E θ (x, z)). Hence, our EBMs can enjoy the advantage even though CLEL does not explicitly train conditional models. To verify this, we first test instance-conditional sampling: given a real sample x, we draw the underlying latent variable z ∼ p data (z|x) using our latent encoder h ϕ , and then perform SGLD sampling using our joint energy E θ (x, z) defined in (10). We here use our CIFAR-10 model. As shown in Figure 3a , the instance-conditionally generated samples contain similar information (e.g., color, shape, and background) to the given instance. This successful result motivates us to extend the sampling procedure: given a set of instances {x (i) }, can we generate samples that contain the shared information in {x (i) }? To this end, we first draw latent variables z (i) ∼ p data (•|x (i) ) for all i, and then aggregate them by summation and normalization: z := i z (i) /∥ i z (i) ∥ 2 . To demonstrate that samples generated from p θ (x|z) contains the shared information in {x (i) }, we collect the set of instances {x (i) y } for each label y in CIFAR-10, and check whether xy ∼ p θ (•|z y ) has the same label y. Figure 3b shows the class-conditionally generated samples {x y } and Figure 3c presents the confusion matrix of predictions for {x y } computed by an external classifier c. Formally, each (i, j)-th entry is equal to P xi (c(x i ) = j). We found that xy is likely to be predicted as the label y, except the case when y is dog: the generated dog images sometimes look like a semantically similar class, cat. These results verify that our EBM can generate samples conditioning on a instance or class label, even without explicit conditional training.

4.4. COMPOSITIONALITY VIA LATENT VARIABLES

An intriguing property of EBMs is compositionality (Du et al., 2020a) : given two EBMs E(x|c 1 ) and E(x|c 2 ) that are conditional energies on concepts c 1 and c 2 , respectively, one can construct a new energy conditioning on both concepts: p θ (x|c 1 and c 2 ) ∝ exp(-E(x|c 1 ) -E(x|c 2 )). As shown in Section 4.3, our CLEL implicitly learns E(x|z), and a latent variable z can be considered as a concept, e.g., instance or class. Hence, in this section, we test compositionality of our model. To this end, we additionally train our CLEL in CelebA 64×64 (Liu et al., 2015) . For compositional sampling, we first acquire three attribute vectors za for a ∈ A := {Young, Female, Smiling} as we did in Section 4.3, then generate samples from a composition of conditional energies as follows: E θ (x|A) := 1 2 ∥f θ (x)∥ 2 -β a∈A sim(g θ (f θ (x)/∥f θ (x)∥ 2 ), za ), where sim(•, •) is the cosine similarity. Figure 4a and 4b show the generated samples conditioning on multiple attributes and their attribute prediction results computed by an external classifier, respectively. They verify our compositionality qualitatively and quantitatively. For example, almost generated faces conditioned by {Young, Female} look young and female (see the third row in Figure 4 .) (Netzer et al., 2011) as the OOD dataset. Table 5 demonstrates the effectiveness of CLEL's components. First, we observe that learning p θ (z|x) to approximate p data (z|x) plays a crucial role for improving generation (see (a) vs. (b)). In addition, using generated latent variables z ∼ p θ (•) as negatives for contrastive learning further improves not only generation, but also OOD detection performance (see (b) vs. (c)). We also empirically found that using an additional projection head is critical; without projection g θ (i.e., (d)), our EBM failed to approximate p data (x), but an additional projection head (i.e., (c) or (e)) makes learning feasible. Hence, we use a 2-layer MLP (c) in all experiments since it is better than a simple linear function (e). We also test various β ∈ {0.1, 0.01, 0.001} under this evaluation setup (see Table 6 ) and find β = 0.01 is the best. Compatibility with other self-supervised representation learning methods. While we have mainly focused on utilizing contrastive representation learning (CRL), our framework CLEL is not limited to CRL for learning the latent encoder h ϕ . To verify this compatibility, we replace SimCLR with other self-supervised representation learning (SSRL) methods, BYOL (Grill et al., 2020) and MAE (He et al., 2021) . See Appendix C for implementation details. Note that these methods have several advantages compared to SimCLR: e.g., BYOL does not require negative pairs, and MAE does not require heavy data augmentations. Table 7 implies that any SSRL methods can be used to improve EBMs under our framework, where the CRL method, SimCLR (Chen et al., 2020) , is the best.

5. RELATED WORKS

Energy-based models (EBMs) can offer an explicit density and are less restrictive in architecture design, but training them has been challenging. For example, it often suffers from the training instability due to the time-consuming and unstable MCMC sampling procedure (e.g., a large number of SGLD steps). To reduce the computational complexity and improve the quality of generated samples, various techniques have been proposed: a replay buffer (Du & Mordatch, 2019) , short-run MCMC (Nijkamp et al., 2019) , augmentation-based MCMC transitions (Du et al., 2021) . Recently, researchers have also attempted to incorporate other generative frameworks into EBM training e.g., adversarial training (Kumar et al., 2019; Arbel et al., 2021; Grathwohl et al., 2021) , flow-based models (Gao et al., 2020; Nijkamp et al., 2022; Xie et al., 2022) , variational autoencoders (Xiao et al., 2021) , and diffusion techniques (Gao et al., 2021) . Another direction is on developing better divergence measures, e.g., f -divergence (Yu et al., 2020) , pseudo-spherical scoring rule (Yu et al., 2021) , and improved contrastive divergence (Du et al., 2021) . Compared to the recent advances in the EBM literature, we have focused on an orthogonal research direction that investigates how to incorporate discriminative representations, especially of contrastive learning, into training EBMs.

6. CONCLUSION

The early advances in deep learning was initiated from the pioneering energy-based model (EBM) works, e.g., restricted and deep Boltzman machines (Salakhutdinov et al., 2007; Salakhutdinov & Hinton, 2009) , however the recent accomplishments rather rely on other generative frameworks such as diffusion models (Ho et al., 2020; Song et al., 2021) . To narrow the gap, in this paper, we suggest to utilize discriminative representations for improving EBMs, and achieve significant improvements. We hope that our work would shed light again on the potential of EBMs, and would guide many further research directions for EBMs.

A TRAINING PROCEDURE OF CLEL

Algorithm 1 Contrastive Latent-guided Energy Learning (CLEL) Require: a latent-variable EBM (f θ , g θ ), a latent encoder h ϕ , an augmentation distribution T , hyperparameters α, β, τ > 0, and the stop-gradient operation sg(•). Sample {x (i) } n i=1 ∼ p (x) 4: Sample {x (i) } n i=1 ∼ p θ (x) using stochastic gradient Langevin dynamics (SGLD) 5: z (i) ← sg h ϕ (t (i) (x (i) ))/∥h ϕ (t (i) (x (i) ))∥ 2 , t (i) ∼ T 6: z(i) ← sg g θ (f θ (x (i) )/∥f θ (x (i) )∥ 2 ) 7: // Compute the EBM loss, L EBM 8: L EBM ← 1 n n i=1 E θ (x (i) , z (i) ) -E θ (x (i) ) + α • (E θ (x (i) ) 2 + E θ (x (i) ) 2 ) 9: // Compute the encoder loss, L LE 10: z (i) j ← h ϕ (t (i) j (x (i) )), t (i) j ∼ T 11: L LE ← 1 2n n i=1 j=1,2 L NT-Xent z (i) j , z (i) 3-j , {z (k) l } k̸ =i,l∈{1,2} ∪ {z (i) } n i=1 ; τ 12: Update θ and ϕ to minimize L EBM + L LE 13: end for

B TRAINING DETAILS

Architectures. For the spherical latent-variable energy-based model (EBM) f θ , we use the 8-block ResNet (He et al., 2016) architectures following Du & Mordatch (2019) . The details of the (a) small, (b) base, and (c) large ResNets are described in Table 8 . We append a 2-layer MLP with a output dimension of 128 to the ResNet, i.e., f θ : R 3×32×32 → R 128 . Note that we use the small model for ablation experiments in Section 4.5. To stabilize training, we apply spectral normalization (Miyato et al., 2018) to all convolutional layers. For the projection g θ , we use a 2-layer MLP with a output dimension of 128, the leaky-ReLU activation, and no bias, i.e., g θ (u) = W 2 σ(W 1 u) ∈ R 128 . For the latent encoder h ϕ , we simpy use the CIFAR variant of ResNet-18 (He et al., 2016) , followed by a 2-layer MLP with a output dimension of 128. Table 8 : Our EBM f θ architectures. For our large model, we build three independent ResNets and resize an input image x ∈ R 3×32×32 to three resolutions: 32 × 32, 16 × 16, and 8 × 8. We use each ResNet for each resolution image, concatenate their output features, and then compute the final output feature f θ (x) ∈ R 128 using single MLP.

Small

Base Large (3, 32, 32) (3, 32, 32) (3, 32, 32), (3, 16, 16), (3, 8, 8 ) (768, 2048, 128) Training. For the EBM parameter θ, we use Adam optimizer (Kingma & Ba, 2015) with β 1 = 0, β 2 = 0.999, and a learning rate of 10 -4 . We use the linear learning rate warmup for the first 2k training iterations. For the encoder parameter ϕ, we use SGD optimizer with a learning rate of 3 × 10 -2 , a weight decay of 5 × 10 -4 , and a momentum of 0.9 as described in Chen & He (2020) .

Input

EBM f θ (x) Conv(3 × 3, 64) ResBlock(64) ×1 AvgPool(2 × 2) ResBlock(64) ×1 AvgPool(2 × 2) ResBlock(128) ×1 AvgPool(2 × 2) ResBlock(128) ×1 GlobalAvgPool MLP(128, 2048, 128) Conv(3 × 3, 128) ResBlock(128) ×2 AvgPool(2 × 2) ResBlock(128) ×2 AvgPool(2 × 2) ResBlock(256) ×2 AvgPool(2 × 2) ResBlock(256) ×2 GlobalAvgPool MLP(256, 2048, 128)                        Conv(3 × 3, 256) ResBlock(256) ×2 AvgPool(2 × 2) ResBlock(256) ×2 AvgPool(2 × 2) ResBlock(256) ×2 AvgPool(2 × 2) ResBlock(256) ×2 GlobalAvgPool                        × 3 Concat → MLP For all experiments, we train our models up to 100k iterations with a batch size of 64, unless otherwise stated. For data augmentation T , we follow Chen et al. (2020) , i.e., T includes random cropping, flipping, color jittering, and color dropping. For hyperparameters, we use α = 1 following Du & Mordatch (2019) , and β = 0.01 (see Section 4.5 for β-sensitivity experiments). For our large model, we use large batch size of 256 only for learning the contrastive encoder h ϕ . After training, we utilize exponential moving average (EMA) models for evaluation. SGLD sampling. For each training iteration, we use 60 SGLD steps with a step size of 100 for sampling x ∼ p θ . Following Du et al. (2021) , we apply a random augmentation t ∼ T for every 60 steps. We also use a replay buffer with a size of 10000 and a resampling rate of 0.1% for maintaining diverse samples (Du & Mordatch, 2019) . For evaluation, we run 600 and 1200 SGLD steps from uniform noises for our base and large models, respectively.

C IMPLEMENTATION DETAILS FOR BYOL AND MAE

We here provide implementation details for replacing SimCLR (Chen et al., 2020) with BYOL (Grill et al., 2020) and MAE (He et al., 2021) under our CLEL framework as shown in Section 4.5. BYOL. Since BYOL also learns its representations on the unit sphere, the method can be directly incorporated with our CLEL framework. MAE. Since MAE's representations do not lie on the unit sphere, we incorporate MAE into our CLEL framework by the following procedure: 1. Pretrain a MAE framework and remove its MAE decoder. To this end, we simply use a publicly-available checkpoint of the ViT-tiny architecture. 2. Freeze the MAE encoder parameters and construct a learnable 2-layer MLP on the top of the encoder. 3. Train only the MLP via contrastive representation learning without data augmentations using our objective (13) for the latent encoder. < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 t p h < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 1 H s At early stage of our research, we first tested a multi-head for modeling p(x) and p(z|x). To be specific, E(x) = g ′ θ (f θ (x)) ∈ R and E(z|x) = z ⊤ g θ (f (x)) where f is a shared backbone and g and g ′ are separate 2-layer MLPs, as shown in Figure 5a . We found that, with this choice, learning p(x) causes a mode collapse for f θ (x) because all samples should be aligned with a specific direction. In contrast, learning p(z|x) encourages f θ (x) to be diverse due to contrastive learning. Namely, modeling p(x) and p(z|x) with the multi-head architecture makes some conflict during optimization. We empirically observe that the multi-head architecture is unstable in EBM training, as shown in Figure 5c . To remove such a conflict, we design the norm-direction separation for modeling p(x) and p(z|x) simultaneously (i.e., Figure 5b ), which leads to training stability, as shown in Figure 5c . Our directional projector g θ is designed for narrowing the gap between the EBM feature direction f θ (x)/∥f θ (x)∥ 2 and the "uniformly-distributed" latent variable p data (z) (i.e., h θ (x)/∥h θ (x)∥ 2 ). Specifically, the contrastive latent variable p data (z) is known to be uniformly distributed (Wang & Isola, 2020 ), but we observed that it is difficult to optimize the feature direction f θ (x)/∥f θ (x)∥ 2 to be uniform along with learning our norm-based EBM p θ (x) ∝ exp(-∥f θ (x)∥ 2 2 ) at the same time. As empirical supports, we analyze the cosine similarity distributions using f θ (x), g θ (f θ (x)), and h ϕ (x) features on CIFAR-10, as shown in Figure 6 . This figure shows that f θ (x) tends to learn similar directions (see Figure 6a ) while h ϕ (x) tends to be uniformly distributed (see Figure 6c ). Hence, it is necessary to employ a projection between them. We found that our projector g θ successfully narrows the gap as shown in Figure 6b , which significantly improves training EBMs.

D TRAINING STABILITY WITH NORM-DIRECTION SEPARATION

/ p r V 1 / F F q n d G C Z x 5 s 3 v N m X l U o 6 j u P v Q W P j x e b W d n M n 3 N 3 b P 3 h 5 2 D q 6 c U V l B f V E o Q r b z 9 C R k o Z 6 L F l R v 7 S E O l N 0 m 9 1 f z u e 3 D 2 S d L M w 1 T 0 s a a B w b m U u B 7 K l + P k x 5 Q o z D w 3 b c i R c V P Q f J C r R h V V f D V n C a j g p R a T I s F D p 3 l 8 Q l D 2 q 0 L I W i W Z h W j k o U 9 z i m O w 8 N a n K D e n H w L H r r m V G U F 9 Y / w 9 G C / X O j R u 3 c V G d e q Z E c i T R 0 B 1 A S N Y = " > A A A C 0 3 i c h V H L j t M w F H X C a w i v D o g V E l h U l Q Y W V V I V M c u R Y M E G M S D a G W l c I s e 9 a a 1 x n M i + G U 1 l Z Y P Y s O A D + Q p + A a f N A j o V X M n y 0 T n n + t r H W a W k x T j + G Y T X r t + 4 e W v v d n T n 7 r 3 7 D 3 r 7 D 6 e 2 r I 2 A i S h V a U 4 z b k F J D R O U q O C 0 M s C L T M F J d v 6 m 1 U 8 u w F h Z 6 s + 4 q m B W 8 I W W u R Q c P Z X 2 v r O C 4 z L L 3 K f m J k D H v i Y k K n i M P s V A G a o t W m 8 = " > A A A C 9 H i c h V F N j 9 M w E H X C 1 x I + t g t H L h Z V p Y V D l V T l 4 0 Y l O H B B L I h 2 V 9 q U y H E n r b V 2 E t m T 1 R x v f J W S Q 5 Q u T 8 e f k W i 3 A l f a M = " > A A A C 6 X i c h V F N j 9 M w E H U C C 0 v 4 6 s K R i 0 V V a Q G p S q q u 4 L g S H L g g F k S 7 K 2 1 K 5 L i T 1 q r t R P Y E b W X l R 3 B D X D j w q / g 3 O G 0 O 0 K 1 g J G u e 5 z 1 7 7 D d 5 J Y X F O P 4 V h D d u H t y 6 f X g n u n v v / o O H v a N H U 1 v W h s O E l 7 I 0 F z m z I I W G C Q q U c F E Z Y C q X c J 6 v X r f 8 + R c w V p T 6 E 6 4 r m C m 2 0 K I Q n K E v Z b 0 f V Z b i E p A d p 4 r h M i / c V f M s G m w 3 u f v Y f J 5 H g 2 K / p j C M u 3 1 c 4 1 I 5 B Y N 0 H 5 m a l s p G j b / h / 6 q s 1 4 + H 8 S b o d Z B 0 o E + 6 O M u O g l E 6 L 3 m t Q C O X z N r L J K 5 w 5 p h B w S U 0 U V p b q B h f s Q V c e q i Z A j t z G y c b O v C V O S 1 K 4 5 d G u q n + e c I x Z e 1 a 5 V 7 Z v t X u c m 1 x L 5 f n a q c 1 F q 9 m T u i q R t B 8 2 7 m o J c W S t n O i c 2 G A o 1 x 7 w L g R / v G U L F I H G J t W 7 F 3 0 i T z 7 q O a c O g = " > A A A D C 3 i c j V H L b t N A F B 2 b V 2 s e T W H J x i K K W l h E d h Q E y 0 q w Y F N R E E k q 1 c E a T 6 6 T U c d j a + Y a N Q z + B L 6 m u 6 q b L v o R / A 3 j 2 E g l j R B X G s 3 R O e f l e G D X + A 9 X 3 O k G / W B V / m 0 Q t q B L 2 j q K d 5 1 B N M t Z m Y F E J q j W J 2 F Q 4 N R Q h Z w J q L y o 1 F B Q d k r n c G K h p B n o q V l l X f k 9 y 8 z 8 N F f 2 S P R X 7 M 0 O Q z O t l 1 l i n f W u e l 2 r y Y 1 a k m R r o z F 9 O z V c F i W C Z M 3 k t B Q + 5 n 7 9 k / 6 M K 2 A o l h Z Q p r u r K 2 A i a i V K U 9 z 7 k D J Q 1 M U K K C 8 8 o C 1 7 m C W X 7 5 t t V n 3 8 A 6 W Z r P u K 5 g r v n S y E I K j o H K + s A K y 4 U v M o Y r Q H 7 C N M d V X v i r 5 k X j m Z q C R X q b y G w r Z a M m P v 4 P V 9 Y f J M N k U / Q m S D s w I F 2 d Z Q e 9 E V u U o t Z g U C j u 3 E W a V D j 3 3 K I U C p q Y 1 Q 4 q L i 7 5 E i 4 C N F y D m / t O T k h K X l N T s l 7 c k Y m R J D v 5 C f 5 R a 6 j N J p F X 6 K v W 2 v U 6 3 o O y V 8 V y d + d B 9 j Q < / l a t e x i t > R d < l a t e x i t s h a 1 _ b a s e 6 4 = " J h + b Z t u k U a O T j b F 0 c i T R 0 B 1 A S N Y = " > A A A C 0 3 i c h V H L j t M w F H X C a w i v D o g V E l h U l Q Y W V V I V M c u R Y M E G M S D a G W l c I s e 9 a a 1 x n M i + G U 1 l Z Y P Y s O A D + Q p + A a f N A j o V X M n y 0 T n n + t r H W a W k x T j + G Y T X r t + 4 e W v v x v f J W S Q 5 Q u T 8 e f k W i 3 A l f a M = " > A A A C 6 X i c h V F N j 9 M w E H U C C 0 v 4 6 s K R i 0 V V a Q G p S q q u 4 L g S H L g g F k S 7 K 2 1 K 5 L i T 1 q r t R P Y E b W X l R 3 B D X D j w q / g 3 O G 0 O 0 K 1 g J G u e 5 z 1 7 7 D d 5 J Y X F O P 4 V h D d u H t y 6 f X g n u n v v / o O H v a N H U 1 v W h s O E l 7 I 0 F z m z I I W G C Q q U c F E Z Y C q X c J 6 v X r f 8 + R c w V p T 6 E 6 4 r m C m 2 0 K I Q n K E v Z b 0 f V Z b i E p A d p 4 r h M i / c V f M s G m w 3 u f v Y f J 5 H g 2 K / p j C M u 3 1 c 4 1 I 5 B Y N 0 H 5 m a l s p G j b / h / 6 q s 1 4 + H 8 S b o d Z B 0 o E + 6 O M u O g l E 6 L 3 m t Q C O X z N r L J K 5 w 5 p h B w S U 0 U V p b q B h f s Q V c e q i Z A j t z G y c b O v C V O S 1 K 4 5 d G u q n + e c I x Z e 1 a 5 V 7 Z v t X u c m 1 x L 5 f n a q c 1 F q 9 m T u i q R t B 8 2 7 m o J c W S t n O i c 2 G A o 1 x 7 w L g R / v G U L J k D H v i Y k K n i M P s V A G a o t W m 8 = " > A A A C 9 H i c h V F N j 9 M w E H X C 1 x I + t g t H L h Z V p Y V D l V T l 4 0 Y l O H B B L I h 2 V 9 q U y H E n r b V 2 E t m T 1 R

F FLEXIBILITY OF NORM-BASED ENERGY FUNCTION

Our norm-based energy parametrization does not sacrifice the flexibility compared to the vanilla parametrization of EBMs. We here show that any vanilla EBM p 1 (x) ∝ exp(f 1 (x)), f 1 (x) ∈ R, can be formulated by a norm-based EBM p 2 (x) ∝ exp(∥f 2 (x)∥ 2 2 ), f 2 (x) ∈ R d , on a compact input space X (e.g., an image x lies on the continuous pixel space X = [0, 255] HW C ). Let b := min x∈X f 1 (x) be the minimum value of f 1 . Then, p 3 (x) ∝ exp(f 1 (x)b) is identical to p 1 due to the normalizing constant. Furthermore, p 3 can be formulated as a special case of the norm-based EBM p 2 : for example, if the first component of f 2 is the same as f 1 (x)b and other components are zero, then p 1 and p 2 model the exactly same distribution. Therefore, energy-modeling with our norm-based design is not much different from that with the vanilla form.



The representation quality of CRL for classification tasks is not much improved in our experiments under the joint training of CRL and EBM. Hence, we only report the performance of EBM, not that of CRL. Chen et al. (2020) shows that the contrastive representations contains such contexts under various tasks. For example, we found a multi-head architecture for modeling p θ (x) and p θ (z|x) makes training unstable. We provide detailed discussion and supporting experiments in Appendix D. Here, z is unnecessary sinceE z∼p θ (z|x) [∇ θ E θ (x, z)] = ∇ θ E θ (x).



Figure 1: Illustration of the proposed Contrastive Latent-guided Energy Learning (CLEL) framework.(a) Our spherical latent-variable EBM (f θ , g θ ) learns the joint data distribution p data (x, z) generated by our contrastive latent encoder h ϕ . (b) The encoder h ϕ is trained by contrastive learning with additional negative variables z ∼ p θ (z). Here, z i = h ϕ (t i (x))/∥h ϕ (t i (x))∥ 2 where t i ∼ T denotes a random augmentation, and sg(•) denotes the stop-gradient operation.

Figure 2: Unconditional generated samples from our EBMs on CIFAR-10 (Krizhevsky et al., 2009)and ImageNet 32×32(Deng et al., 2009;Chrabaszcz et al., 2017).

Figure 3: (a, b) Instance-and class-conditionally generated samples using our CLEL in CIFAR-10. (c) Confusion matrix for the class-conditionally generated samples computed by an external classifier.Female Smiling Young

Figure 4: Compositional generation results in CelebA. (a) Samples are generated by conditioning on checked attributes. (b) Attribute predictions of generated samples computed by an external classifier.

t e x i t s h a 1 _ b a s e 6 4 = " I u 6 G 5 A u T N k T n M 4 7 / + Q Y B B 7 a k g R g = " > A A A C U H i c f V B N a 9 t A E B 0 5 T e I o H 4 3 T Y y + i p h B y M F J w a I 6 B 9 N B L a Q p x Y o i M G a 1 H 9 p L d l d g d h R j h P 5 F r + 6 t 6 6 z

n b n 0 2 J / 8 5 y z K 9 Z s 3 5 + a C W p q y Y j F g 6 5 5 W K u I j m c U Q j a U m w m n q A w kp / f C Q m a F G w D y 0 M 0 / f k f 2 f p o 3 f 6 V J J F L u x J n a I d a 3 y c 1 a v + P 5 k 0 S 5 n v o U 8 5 W c / 0 O b g 5 7 S T d z t n n b v u i u 8 q 7 C a / h D R x D A u / g A j 7 A F f R A g I I n + A J f g 2 / B j + B n I 1 h K f 3 d 4 B X 9 V I / w F Z g S 0 t g = = < / l a t e x i t > g ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " O B X j E R J T o N / D L w N m Z 0 1 r x Q M P m R E = " > A A A C U H i c f V B N S x x B E K 1 Z v y f R q D n m M r g I I Y d l R l b i U d C D l x A D r i 4 4 y 1 L T W 7 v b 2 N 0 z d N e I y 7 B / w q v + q t z y T 3 J L e j 8 E X c W C p h 6 v X v G q X 1 Y o 6 T i O / w S 1 p e W V 1 b X 1 j f D Dx 8 2 t T 9 s 7 u 5 c u L 6 2 g l s h V b t s Z O l L S U I s l K 2 o X l l B n i q 6 y m 5 P J / O q W r J O 5 u e B R Q R 2 N A y P 7 U i B 7 q j 3 o p j w k x u 5 2 P W 7 E 0 4 p e g 2 Q O6 j C v 8 + 5 O c J D 2 c l F q M i w U O n e d x A V 3 K r Q s h a J x m J a O C h Q 3 O K B r D w 1 q c p 1 q e v A 4 2 v d M L + r n 1 j / D 0 Z R 9 v l G h d m 6 k M 6 / U y E O 3 O J u Q b 8 6 y T C 9 Y c / + o U 0 l T l E x G z J z 7 p Y o 4 j y Z x RD 1 p S b A a e Y D C S n 9 8 J I Z o U b A P L Q z T U / K / s / T D O / 0 s y C L n 9 l u V o h 1 o v B t X 8 / 6 e T J q Z z P f Q p 5 w s Z v o a X B 4 0 k m b j 8 F e z f t y c 5 7 0 O X 2 A P v k I C 3 + E Y z u A c W i B A w T 0 8 w G P w O / g b / K s F M + l T h 8 / w o m r h f 2 f w t L c = < / l a t e x i t > R d < l a t e x i t s h a 1 _ b a s e 6 4 = " J h + b Z t u k U a O T j b F 0

y z w a 5 C n D J S A / 2 N C 5 u 2 x e R A O W G y 7 c L q 1 x T E 3 B I N 0 l M t N K 6 a j x J / z f l f b 6 8 T B e F 7 0 K k g 7 0 S V f H 6 X 4 w Y v N S 1 A V o F I p b e 5 b E F c 4 c N y i F g i Z i t Y W K i 3 O + g D M P N S / A z t w 6 s 4 Y O P D O n e W n 8 0 k j X 7 J 8 d j h f W r o r M O 9 u 7 2 m 2 t J X d q W V Z s j c b 8 c O a k r m o E L T a T 8 1 p R L G n 7 I 3 Q u D Q h U K w + 4 M N J f n o o l 9 4 m j / 7 c o Y m / B v 8 7 A e z / p Q w W G Y 2 l e O s b N o u C X j e v 2 f 9 m k 3 t j 8 H v m U k + 1 M r 4 L p a J i M h 6 8 + j v t H 4 y 7 v P f K E P C c H J C G v y R F 5 R 4 7 J h A j y K 3 g c P A 2 e h Z P Q h V / D b x t r G H Q 9 j 8 h f F f 7 4 D S I b 5 D U = < / l a t e x i t > p ✓ (z|x)

a T f 8 I N c e H A / + H f 4 D R B g m 4 F I 1 n z P O / Z Y 7 9 J S y k M h u F P z 7 9 y 9 d r 1 G 3 s 3 g 1 u 3 7 9 z d 7 x 3 c m 5 m i 0 h y m v J C F P k m Z A S l y m K J A C S e l B q Z S C c f p 2 c u G P z 4 H b U S R f 8 B 1 C X P F l r n I B G f o S k n v e 5 n E u A J k h 7 F i u E o z + 6 n + / B t e 1 I + C Q b t J 7 f v 6 4 y I Y Z N v y V p N p x u 0 u r r a x n I F G u o u M d U M l o 9 r d 8 H 9 V 0 u u H w 3 A T 9 D K I O t A n X R w l B 9 4 o X h S 8 U p A j l 8 y Y 0 y g s c W 6 Z R s E l 1 E F c G S g Z P 2 N L O H U w Z w r M 3 G 5 M r e n A V R Y 0 K 7 R b O d J N 9 c 8 T l i l j 1 i p 1 y u a t Z p t r i j u 5 N F V b r T F 7 P r c i L y u E n L e d s 0 p S L G g z M r o Q G j j K t Q O M a + E e T / m K O c f R D T Y I 4 l f g f q f h j e v 0 t g T N s N C P b c z 0 U r G L 2 n b 5 X z K R t z K X A + d y t O 3 p Z T A b D a P x 8 M m 7 c X 8 y 7 v z e I w / I Q 3 J I I v K M T M h r c k S m h H s 9 7 6 n 3 w p v 4 5 / 4 X / 6 v / r Z X 6 X n f m P v k r / B + / A L C U 8 T s = < / l a t e x i t > p ✓ (x)

5 l 3 H P 0 0 o y h 9 A / 5 3 B t 7 5 T u 8 r M A x L 8 9 y l z C w U u 2 p c l / 8 l E 3 o r 8 z n y L i e 7 n l 4 H 0 9 E w G Q 9 P P o z 7 p + P O 7 0 P y h D w l x y Q h L 8 k p e U v O y I T w 4 C B 4 E Y y D k 3 A V f g 2 / h d + 3 0 j D o z j w m f 0 X 4 8 z c M N + w R < / l a t e x i t > g 0 ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 n 4 t 5 h

q 3 j m T F I J r D I J f j n v n 7 r 3 7 D 7 a 2 v Y e P H j / Z 6 e w + H e u 8 V A x G L B e 5 O k 6 o B s E l j J C j g O N C A c 0 S A Z P k 9 F 2 t T 7 6 B 0 j y X X 3 B Z w D S j c 8 l T z i h a K u 5 c z f f i C B e A 1 O t F G c V F k p j P l d c r W n a / I V P z v f r x B 5 5 V L 2 + a v 8 6 8 X r p u b z y p o s x s 0 i o T i T E o 9 D e J k a q

h d 3 m c L a h N H + 9 + e F 7 0 H + z o F h 3 b S x w I U x V y 9 M h F V 8 4 y e V a a 9 / 2 X j s r H Z 2 7 M p h + u Z 3 g b j Q T 8 c 9 l 9 / G n Y P h m 3 e W + Q 5 e U H 2 S U j e k A P y g R y R E W H O n n P o j J 2 J + 9 M 9 d y / c y 8 b q O m 3 P M / J X u d e / A Y Z H + f I = < / l a t e x i t > R < l a t e x i t s h a 1 _ b a s e 6 4 = " j S L 8 l T u S p 0 F i A 9 3 r + P B 4 7 g 9 e w r w = " > A A A D A H i c j V H L j t M w F H X C a w i v D i x h Y V F V G l h U S V U E y x G w Q A L E g G h n p E m J b P e m t c Z 5 y L 5 B U 0 w 2 f A 0 7x I Y F f 8 L f 4 D R B Q K d C X M n y 0 T n n 6 l 4 f 8 1 J J g 2 H 4 w / P P n b 9 w 8 d L O 5 e D K 1 W v X b / R 2 b 0 5 N U W k B E 1 G o Q h 9 x Z k D J H C Y o U c F R q Y F l X M E h P 3 n S 6 I f v Q R t Z 5 G 9 x V c I s Y 4 t c p l I w d F T S + x Z n D J e c 2 z d 1 M C i T G J e A b K 8 l U / u h / v g L n t b 3 g s F v 8 7 t 5 M E g 3 7 a 0 n 1 U z Y b V p t Y z U F j X S b G O t G S k Z u j f 9 w J b 1 + O A z X R c + C q A N9 0 t V B s u u N 4 n k h q g x y F I o Z c x y F J c 4 s 0 y i F g j q I K w M l E y d s A c c O 5 i w D M 7 P r g G s 6 c M y c p o V 2 J 0 e 6 Z v / s s C w z Z p V x 5 2 x 2 N Z t a Q 2 7 V O M 8 2 R m P 6 a G Z l X l Y I u W g n p 5 W i W N D m + + h c a h C o V g 4 w o a V b n o o l c 4 m j + + Q g i J + C e 5 2 G l 2 7 S q x I 0 w 0 L f t z H T i 4 y d 1 r a 7 / 2 W T e W t z d + B S j j Y z P Q u m o 2 E 0 H j 5 4 P e 7 v j 7 u 8 d 8 h t c p f s k Y g 8 J P v k G T k g E y K 8 O 9 5 j 7 7 n 3 w v / k f / a / + F 9 b q + 9 1 P b f I X + V / / w m E A / X 2 < / l a t e x i t > R d < l a t e x i t s h a 1 _ b a s e 6 4 = "J h + b Z t u k U a O T j b F 0 c i T R 0 B 1 A S N Y = " > A A A C 0 3 i c h V H L j t M w F H X C a w i v D o g V E l h U l Q Y W V V I V M c u R Y M E G M S D a G W l c I s e 9 a a 1 x n M i + G U 1 l Z Y P Y s O A D + Q p + A a f N A j o V X M n y 0 T n n + t r H W a W k x T j + G Y T X r t + 4 e W v v d n T n 7 r 3 7 D 3 r 7 D 6 e 2 r I 2 A i S h V a U 4 z b k F J D R O U q O C 0 M s C L T M F J d v 6 m 1 U 8 u w F h Z 6 s + 4 q m B W 8 I W W u R Q c P Z X 2 v r O C 4 z L L 3 K f m y z w a 5 C n D J S A / 2 N C 5 u 2 x e R A O W G y 7 c L q 1 x T E 3 B I N 0 l M t N K 6 a j x J / z f l f b 6 8 T B e F 7 0 K k g 7 0 S V f H 6 X 4 w Y v N S 1 A V o F I p b e 5 b E F c 4 c N y i F g i Z i t Y W K i 3 O + g D M P N S / A z t w 6 s 4 Y O P D O n e W n 8 0 k j X 7 J 8 d j h f W r o r M O 9 u 7 2 m 2 t J X d q W V Z s j c b 8 c O a k r m o E L T a T 8 1 p R L G n 7 I 3 Q u D Q h U K w + 4 M N J f n o o l 9 4 m j / 7 c o Y m / B v 8 7 A e z / p Q w W G Y 2 l e O s b N o u C X j e v 2 f 9 m k 3 t j 8 H v m U k + 1 M r 4 L p a J i M h 6 8 + j v t H 4 y 7 v P f K E P C c H J C G v y R F 5 R 4 7 J h A j y K 3 g c P A 2 e h Z P Q h V / D b x t r G H Q 9 j 8 h f F f 7 4 D S I b 5 D U = < / l a t e x i t > (a) Multi-head architecture f ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " I u 6 G 5 A u T N k T n M 4 7 / + Q Y B B 7 a k g R g = " > A A A C U H i c f V B N a 9 t A E B 0 5 T e I o H 4 3 T Y y + i p h B y M F J w a I 6 B 9 N B L a Q p x Y o i M G a 1 H 9 p L d l d g d h R j h P 5 F r + 6 t 6 6 z / p r V 1 / F F q n d G C Z x 5 s 3 v N m X l U o 6 j u P v Q W P j x e b W d n M n 3 N 3 b P 3 h 5 2 D q 6 c U V l B f V E o Q r b z 9 C R k o Z 6 L F l R v 7 S E O l N 0 m 9 1 f z u e 3 D 2 S d L M w 1 T 0 s a a B w b m U u B 7 K l + P k x 5 Q o z D w 3 b c i R c V P Q f J C r R h V V f D V n C a j g p R a T I s F D p 3 l 8 Q l D 2 q 0 L I W i W Z h W j k o U 9 z i m O w 8 N a n K D e n H w L H r r m V G U F 9 Y / w 9 G C / X O j R u 3 c V G d e q Z E n b n 0 2 J / 8 5 y z K 9 Z s 3 5 + a C W p q y Y j F g 6 5 5 W K u I j m c U Q j a U m w m n q A w k p / f C Q m a F G w D y 0 M 0 / f k f 2 f p o 3 f 6 V J J F L u x J n a I d a 3 y c 1 a v + P 5 k 0 S 5 n v o U 8 5 W c / 0 O b g 5 7 S T d z t n n b v u i u 8 q 7 C a / h D R x D A u / g A j 7 A F f R A g I I n + A J f g 2 / B j + B n I 1 h K f 3 d 4 B X 9 V I / w F Z g S 0 t g = = < / l a t e x i t > g ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " O B X j E R J T o N / D L w N m Z 0 1 r x Q M P m R E = " > A A A C U H i c f V B N S x x B E K 1 Z v y f R q D n m M r g I I Y d l R l b i U d C D l x A D r i 4 4 y 1 L T W 7 v b 2 N0 z d N e I y 7 B / w q v + q t z y T 3 J L e j 8 E X c W C p h 6 v X v G q X 1 Y o 6 T i O / w S 1 p e W V 1 b X 1 j f D D x 8 2 t T 9 s 7 u 5 c u L 6 2 g l s h V b t s Z O l L S U I s l K 2 o X l l B n i q 6 y m 5 P J / O q W r J O 5 u e B R Q R 2 N A y P 7 U i B 7 q j 3 o p j w k x u 5 2 P W 7 E 0 4 p e g 2 Q O6 j C v 8 + 5 O c J D 2 c l F q M i w U O n e d x A V 3 K r Q s h a J x m J a O C h Q 3 O K B r D w 1 q c p 1 q e v A 4 2 v d M L + r n 1 j / D 0 Z R 9 v l G h d m 6 k M 6 / U y E O 3 O J u Q b 8 6 y T C 9 Y c / + o U 0 l T l E x G z J z 7 p Y o 4 j y Z x RD 1 p S b A a e Y D C S n 9 8 J I Z o U b A P L Q z T U / K / s / T D O / 0 s y C L n 9 l u V o h 1 o v B t X 8 / 6 e T J q Z z P f Q p 5 w s Z v o a X B 4 0 k m b j 8 F e z f t y c 5 7 0 O X 2 A P v k I C 3 + E Y z u A c W i B A w T 0 8 w G P w O / g b / K s F M + l T h 8 / w o m r h f 2 f w t L c = < / l a t e x i t > kf ✓ (x)k 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q G u V 0 c O e Z k I m z / q F t 2 v 9 f L / Z t W I = " > A A A C b X i c f V D R a h N B F J 2 s W u N q b a v 4 V C m D Q W z 7 E H Z D R B 8 D 9 c G X Y o U m L X R D u D u 5 m w y d m V 1 m 7 k r D k g / w a 3 z V T + l X + A v O J v u g q f T C M I d z z u X e e 9 J C S U d R d N s K H j x 8 t P W 4 / S R 8 + m z 7 + c 7 u 3 o u R y 0 s r c C h y l d v L F B w q a X B I k h R e F h Z B p w o v 0 u u T W r / 4 h t b J 3 J z T o s C x h p m R m R R A n p r s d h I 1 Q k s 8 m y Q 0 R 4 L D R A P N 0 6 y 6 W R 4 l t p Y m P e + K u tG q + F 0 Q N 6 D D m j q b 7 L V 6 y T Q X p U Z D Q o F z V 3 F U 0 L g C S 1 I o X I Z J 6 b A A c Q 0 z v P L Q g E Y 3r l b X L P l b z 0 x 5 l l v / D P E V + 3 d H B d q 5 h U 6 9 s 9 7 V b W o 1 + V 8 t T f X G a M o + j i t p i p L Q i P X k r F S c c l 5 n x a f S o i C 1 8 A C E l X 5 5 L u Z g Q Z B P N A y T T + i v s 3 j q J 3 0 p 0 A L l 9 r h K w M 4 0 3 C y r 5 r / P J s 3 a 5 v / Q p x x v Z n o X j H r d u N 9 9 / 7 X f G f S b v N t s n 7 1 h h y x m H 9 i A f W Z n b M g E + 8 5 + s J / s V + t 3 8 C p 4 H R y s r U G r 6 X n J / q n g 3 R + O q L 4 P < / l a t e x i t > f ✓ (x) kf ✓ (x)k 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q 7 R O P O L P 1 F a 0 7 D M B 9 N X v O V b 7 Q z w = " > A A A C r 3 i c j V F N b x M x E H W W r 3 b 5 S u m x F 4 u o U u E Q 7 U Z B c K w E B y 6 I I p G k U h 0 W r z O b W L W 9 K 3 s W N b L 2 F / I L + B l c 6 Q V v s g d I K 8 R I l p / e e 6 M Z P + e V k g 6 T 5 E c v u n P 3 3 v 0 H e / v x w 0 e P n z z t H z y b

N H g 0 9 D s y C F q U N x y D d s H 9 2 e K 6 d W + s 8 O N t d 3 a 7 W k r d q e a 5 3 R m P x Z u 6 l q W o E I 7 a T i 1 p R L G m b N l 1 I C w L V O g A u r A z L U 7 H i I X E M f x L H 7 B 2 E 1 1 n 4 E C Z 9 r M B y L O 1 L z 7 h d a n 7 V + O 7 + l 0 2 a r S 3 c c U g 5 3 c 3 0 J p i O h u l 4 + O r T e H A 6 7 v L e I 0 f k

d n T n 7 r 3 7 D 3 r 7 D 6 e 2 r I 2 A i S h V a U 4 z b k F J D R O U q O C 0 M s C L T M F J d v 6 m 1 U 8 u w F h Z 6 s + 4 q m B W 8 I W W u R Q c P Z X 2 v r O C 4 z L L 3 K f m y z w a 5 C n D J S A / 2 N C 5 u 2 x e R A O W G y 7 c L q 1 x T E 3 B I N 0 l M t N K 6 a j x J / z f l f b 6 8 T B e F 7 0 K k g 7 0 S V f H 6 X 4 w Y v N S 1 A V o F I p b e 5 b E F c 4 c N y i F g i Z i t Y W K i 3 O + g D M P N S / A z t w 6 s 4 Y O P D O n e W n 8 0 k j X 7 J 8 d j h f W r o r M O 9 u 7 2 m 2 t J X d q W V Z s j c b 8 c O a k r m o E L T a T 8 1 p R L G n 7 I 3 Q u D Q h U K w + 4 M N J f n o o l 9 4 m j / 7 c o Y m / B v 8 7 A e z / p Q w W G Y 2 l e O s b N o u C X j e v 2 f 9 m k 3 t j 8 H v m U k + 1 M r 4 L p a J i M h 6 8 + j v t H 4 y 7 v P f K E P C c H J C G v y R F 5 R 4 7 J h A j y K 3 g c P A 2 e h Z P Q h V / D b x t r G H Q 9 j 8 h f F f 7 4 D S I b 5 D U = < / l a t e x i t > p ✓ (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 1 H s

5 l 3 H P 0 0 o y h 9 A / 5 3 B t 7 5 T u 8 r M A x L 8 9 y l z C w U u 2 p c l / 8 l E 3 o r 8 z n y L i e 7 n l 4 H 0 9 E w G Q 9 P P o z 7 p + P O 7 0 P y h D w l x y Q h L 8 k p e U v O y I T w 4 C B 4 E Y y D k 3 A V f g 2 / h d + 3 0 j D o z j w m f 0 X 4 8 z c M N + w R < / l a t e x i t > p ✓ (z|x) < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 t p h

Figure 5: (a) A multi-head architecture design and (b) our separation scheme for modeling p(x) and p(z|x). (c) FID scores with various energy design choices.

Figure 6: Cosine similarity distributions using (a) f θ (x), (b) g θ (f θ (x)), and (c) h ϕ (x) features.

FID scores for unconditional generation on CIFAR-10 (Krizhevsky et al., 2009). † denotes EBMs that utilize auxiliary generators, and ‡ denotes hybrid discriminative-generative models.



FID improvements via different configurations with training time and GPU memory footprint on single RTX3090 GPU of 24G memory. Underline is based on our estimation as the model cannot be trained on the single GPU.

AUROC scores in OOD detection using explicit density models on CIFAR-10. Bold and underline entries indicates the best and second best, respectively, among unsupervised methods, where JEM and VERA are supervised methods.

Component ablation experiments.

β sensitivity.

Compatibility.

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

This work was mainly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub; No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). This work was partly supported by KAIST-NAVER Hypercreative AI Center.

availability

//github.com/hankook/CLEL.

ETHICS STATEMENT

In recent years, generative models have been successful in synthesizing diverse high-fidelity images. However, high-quality image generation techniques have threats to be abused for unethical purposes, e.g., creating sexual photos of an individual. This significant threats call us researchers for the future efforts on developing techniques to detect such misuses. Namely, we should learn the data distribution p(x) more directly. In this respect, learning explicit density models like VAEs and EBMs could be an effective solution. As we show the superior performance in both image generation and out-of-distribution detection, we believe that energy-based models, especially with discriminative representations, would be an important research direction for reliable generative modeling.

REPRODUCIBILITY STATEMENT

We provide all the details to reproduce our experimental results in Appendix B. Our code is available at https://github.com/hankook/CLEL. In our experiments, we mainly use NVIDIA GTX3090 GPUs.

