THINKING FOURTH DIMENSIONALLY: TREATING TIME AS A RANDOM VARIABLE IN EBMS

Abstract

Recent years have seen significant progress in techniques for learning highdimensional distributions. Many modern methods, from diffusion models to Energy-Based-Models (EBMs), adopt a coarse-to-fine approach. This is often done by introducing a series of auxiliary distributions that gradually change from the data distribution to some simple distribution (e.g., white Gaussian noise). Methods in this category separately learn each auxiliary distribution (or transition between pairs of consecutive distributions) and then use the learned models sequentially to generate samples. In this paper, we offer a simple way to generalize this idea by treating the "time" index of the series as a random variable and framing the problem as that of learning a single joint distribution of "time" and samples. We show that this joint distribution can be learned using any existing EBM method and that it allows achieving improved results. As an example, we demonstrate this approach using contrastive divergence (CD) in its most basic form. On CIFAR-10 and CelebA (32 × 32), this method outperforms previous CD-based methods in terms of inception and FID scores.

1. INTRODUCTION

Probability density estimation is among the most fundamental tasks in unsupervised learning. It is used in a wide array of applications, from image restoration and manipulation (Nichol et al., 2021; Du et al., 2021; Lugmayr et al., 2020; Kawar et al., 2021; 2022) to out-of-distribution detection (Du & Mordatch, 2019; Grathwohl et al., 2019; Zisselman & Tamar, 2020) . However, directly fitting an explicit probability model to high-dimensional data is a hard task, particularly when the data samples concentrate around a low-dimensional manifold, as is often the case with visual data. One way to circumvent this obstacle is by using coarse-to-fine approaches. In fact, in one form or another, coarse-to-fine strategies have been used with great success in most types of generative models (both implicit and explicit), including generative adversarial networks (GANs) (Karras et al., 2018) , variational autoencoders (VAEs) (Vahdat & Kautz, 2020) , energy-based models (EMBs) (Gao et al., 2018; Zhao et al., 2020) , score matching (Song & Ermon, 2019; Li et al., 2019) and diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) . The coarse-to-fine idea is commonly implemented through the introduction of a series of auxiliary distributions that gradually transition from the data distribution to some simple known distribution that is smoothly spread in space (e.g., a standard normal distribution). This construction is illustrated in Fig. 1a for a two-dimensional data. The index running over the series of distributions is typically referred to as "time". This is to reflect either the diffusion-like sequential manner in which samples are generated for training (from fine to coarse) (Sohl-Dickstein et al., 2015; Ho et al., 2020) or the annealing-like sequential order in which samples are generated from the model at test time (from coarse to fine) (Song & Ermon, 2019) . Methods that use this construction attempt to learn each of the distributions in the series (or each transition rule between pairs of consecutive distributions) separately of the other distributionsfoot_0  In this paper, we explore a more general approach for exploiting the coarse-to-fine structure, which can be used in conjunction with almost any explicit distribution learning algorithm and leads to p v 0 (z) p v 1 (z) p v 2 (z) p v 3 (z) p v 4 (z) (a) The standard approach p z,t (z,t) t (b) Our approach Figure 1 : (a) Coarse-to-fine distribution learning methods introduce a series of auxiliary distributions that gradually transition from the data distribution (2D spiral in this example) to some simple distribution (a Gaussian here). These methods learn each auxiliary distribution (or pair of consecutive distributions) separately. (b) Here we treat the "time" index of the series as a random variable, t, and the samples from all distributions as samples from a single random vector z. We then train the model to learn the joint distribution p z,t (t, z) using samples (t, z). improved results. The key idea is to gather the samples from all auxiliary distributions and view them as coming from a single joint distribution. More specifically, we treat the "time" index of the series as a random variable, t, and the samples from all auxiliary distributions as samples of a random vector z. This allows learning a single model for the joint distribution p z,t (z, t), using pairs of samples (z, t) (see Fig. 1b ). To understand the benefit of this joint modeling, it is important to note that many of the individual distributions p z|t (z|t) commonly occupy only small regions of the space. Thus, when training a separate model for each t, each model is accurate over a different region in space, which can lead to inaccuracies at test time when switching between models. In contrast, here we learn the joint distribution p z,t (z, t), either directly (Sec. 3.4) or by breaking the problem in a reverse way and learning p z (z) and p t|z (t|z) (Sec. 3.3). Thus, during training, our unified model is exposed to samples from the entire space, leading to better stitching of the different parts. Once a model is trained using our approach, it can be used similarly to existing methods by extracting the auxiliary distributions p z|t (z|t) and sampling from them one after the other, from coarse to fine. It can also be used in alternative ways, as we discuss in Sec. 3.5. To illustrate the strength of our approach, we apply it together with the vanilla contrastive divergence (CD) method (Hinton, 2002) on the CIFAR10 (Krizhevsky et al., 2009) and CelebA (Liu et al., 2015) (32 × 32) datasets. It is important to note that although the vanilla CD method is theoretically justified (Yair & Michaeli, 2020), it fails when directly applied to high dimensional visual data (Gao et al., 2018) . This is because it provides good estimates only nearby the data manifold. To date, good results have been obtained only with persistent contrastive divergence (PCD) (Tieleman, 2008; Du & Mordatch, 2019) , which maintains a buffer of past samples. With our approach, on the other hand, plain CD not only succeeds in learning the distribution, but it also improves upon all previous PCD-based techniques in terms of Inception Score (IS) and Fréchet Inception Distance (FID).

2. RELATED WORK

The idea of learning an explicit generative model by using an auxiliary coarse-to-fine series of distributions, has been used in many works. We briefly mention its use within popular models. Song & Ermon (2019) constructed a series of distributions by adding increasing amounts of white Gaussian noise to the training samples. They learned the gradients of the distributions using denoising score matching (Vincent, 2011) , and used the trained model to solve various generative tasks using gradient based simulated annealing.



It is common to represent all models by a single neural network that accepts the "time" index as input. But for each "time" step, the network is exposed only to samples from the corresponding distribution.

