GAUSSIAN-BERNOULLI RBMS WITHOUT TEARS

Abstract

We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture.

1. INTRODUCTION

Restricted Boltzmann machines (RBMs) (Smolensky, 1986; Freund & Haussler, 1991; Hinton, 2002) are generative energy-based models (EBMs) with stochastic binary units. A variant of Boltzmann machines (Ackley et al., 1985) , they have a bipartite graphical structure that enables efficient probabilistic inference, and they can be stacked to form deep belief networks (DBNs) (Hinton & Salakhutdinov, 2006; Bengio et al., 2006; Hinton et al., 2006) and deep Boltzmann machines (DBMs) (Salakhutdinov & Hinton, 2009; Cho et al., 2013) . Gaussian-Bernoulli RBMs (GRBMs) (Welling et al., 2004; Hinton & Salakhutdinov, 2006) extend RBMs to model continuous data by replacing the binary visible units of the RBM with Gaussian random variables. GRBMs remain challenging to learn, however, despite many proposed modifications to the model or training algorithm. For instance, Lee et al. (2007) add a regularization term to encourage sparsely activated binary hidden units. Krizhevsky et al. (2009) attribute the difficulties in learning to highfrequency noise present in natural images. Factorized high-order terms were introduced in (Ranzato & Hinton, 2010; Ranzato et al., 2010) to allow GRBMs to explicitly learn the covariance structure among pixels. Nair & Hinton (2010) suggest that binary hidden units are problematic, and proposed model variants with real-valued hidden units. Cho et al. (2011a; 2013) advocate the use of parallel tempering sampling (Earl & Deem, 2005) , adaptive learning rate, and enhanced gradient (Cho et al., 2011b) to improve GRBM learning. Melchior et al. (2017) conclude that difficulties in GRBM training are due to training algorithms rather than the model itself; they advocate the use of gradient clipping, specialized weight initialization, and contrastive divergence (CD) (Hinton, 2002) rather than persistent CD (Tieleman, 2008) . Tramel et al. (2018) propose the truncated Gaussian visible units and employ the Thouless-Anderson-Palmer (TAP) mean-field approximation for inference and learning. Upadhya & Sastry (2021) propose a stochastic difference of convex functions programming (S-DCP) algorithm to replace CD in training GRBMs. An important motivation for improving GRBM learning is so that it can be used as a front-end to convert real-valued data to stochastic binary data. This would enable research on modelling realvalued data via DBMs/DBNs, which are more expressive due to their deep architectures. This class of models are of special interest: their learning algorithm involves only local computation, and thus they are more biologically plausible than EBMs trained using backprop. As GRBMs are perhaps the simplest hybrid (including both continuous and discrete random variables) EBMs, investigating the inference and learning algorithms of GRBMs would lay the foundation and inspire more future research on deep hybrid EBMs, which are useful for many applications like generating (continuous) images and their (discrete) scene graphs. Finally, RBMs and GRBMs are actively studied in quantum computing and physics (Melko et al., 2019; Ajagekar & You, 2020) since they naturally fit the problem formulation (e.g., Ising models) required by many quantum computing devices. Progress on RBMs/GRBMs could potentially benefit such interdisciplinary research. To this end, we propose improved GRBM learning methods for image data. First, we propose a hybrid Gibbs-Langevin sampling algorithm that outperforms predominant use of Gibbs sampling. To the best of our knowledge this is the first use of Langevin sampling for GRBM training (with or without Metropolis adjustment). Second, we propose a modified CD algorithm so that one can generate images with learned GRBMs starting from Gaussian noise. This enables a fair and direct comparison of GRBMs with deep generative models, something beyond the reach of existing GRBM learning methods. Third, We show that the modified CD with gradient clipping is sufficient to train GRBMs, thus removing the need for heuristics that have been crucial for existing approaches. At last, we empirically show that GRBMs can generate good samples on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA, despite they have a single hidden layer.

2. RELATED WORK

Learning the variances There are two variances to be estimated in GRBM modelling. One is the intrinsic variance of the data, e.g., the variance of image intensities, which is fixed once the data is observed. The other is the (extrinsic) variance parameter in GRBMs, which governs the level of additional Gaussian noises added to visible units. Learning the extrinsic variance is thus necessary for generating sharp and realistic images. But small variance parameters tend to cause the energy function and its gradient to have large values, thus making the stochastic gradient estimates returned by CD numerically unstable. Most existing methods fix the variance (e.g., to one) to avoid this issue. Krizhevsky et al. (2009) ; Cho et al. (2011a) consider learning the variance using a smaller learning rate than for other parameters, obtaining much better reconstruction, thus supporting the importance of learning variances. However, many of the learned filters are still noisy and point-like. Melchior et al. (2017) learn a shared variance across all visible units, yielding improved performance, especially with large numbers of hidden units. In this work, we learn one variance parameter per visible unit and achieve much lower learned variances than existing methods, e.g., approximately 1e -5 on MNIST. Stochastic gradient estimation and learning rate Due to the intractable log partition function of GRBMs, one often estimates the gradients of the log likelihood w.r.t. parameters via Monte Carlo. Gibbs sampling is predominant in CD learning due to its simplicity, but it mixes slowly in practice. One can refer to (Decelle et al., 2021) for a detailed study on the mixing time of CD for RBMs. This yields noisy gradient estimates which often cause training instabilities and prohibits using large learning rates. Cho et al. (2011a) explore parallel tempering with adaptive learning rates to obtain better reconstruction. Cho et al. (2013) propose enhanced gradients that are invariant to bit-flipping in hidden units. Melchior et al. (2017) show that gradient clipping and special weight initialization support robust CD learning with large learning rates. We advocate Langevin MC to improve gradients, and validate that gradient clipping does enable training with large learning rates. Theis et al. (2011) show that GRBMs are outperformed even by simple mixture models in estimating likelihoods for image data. Wang et al. (2012) ; Melchior et al. (2017) demonstrate that GRBMs can be expressed as either a product of experts or a constrained Gaussian mixture in the visible domain, hinting that GRBMs need more hidden units than the true number of components to fit additive mixture densities well. Krause et al. (2013) ; Gu et al. (2022) provide theoretical guarantees on GRBMs for universal approximation of mixtures and smooth densities. Although this shows that GRBMs are expressive, they do not lead directly to practical GRBM learning algorithms.

Model capacity

Model Evaluation Like many deep generative models, evaluating GRBMs is difficult, as the log likelihood is intractable. To date, GRBMs have been evaluated by visually inspecting reconstructed images, filters and hidden activation (i.e., features), and sampled images during CD training. Quantitative metrics include reconstruction errors, and error rates of post-hoc trained classifiers on learned features. However, these metrics do not necessarily indicate if GRBMs are good generative models (Melchior et al., 2017) . Unlike existing work, we sample from learned GRBMs, starting from Gaussian noise, enabling direct comparisons with other generative models, qualitatively (visually inspecting samples) and quantitatively (e.g., Frechet Inception distance (FID) (Heusel et al., 2017) . Note that similar noise-initialization strategy has been studied in EBMs (Nijkamp et al., 2019) . Algorithm 1 Langevin Sampling for GRBMs 1: Input: v (0) , step size α 0 , total step T , burn-in step T , adjust step η 2: For t = 1, . . . , T 3: α t = CosineScheduler(t, T, α 0 ) 4: v = v (t-1) -α t ∂ Ẽ(v (t-1) ) ∂v + √ 2α t ξ t , ξ t ∼ N (0, I) ▷ Use marginal energy in Eq. ( 5) 5: If t <= η or t > η and u ∼ U(0, 1) < A(v, v (t-1) ) 6: v (t) = v 7: Else 8: v (t) = v (t-1) 9: Return: {v ( T +1:T ) } ▷ i : j indexes consecutive samples from i-th to j-th

3. GAUSSIAN-BERNOULLI RESTRICTED BOLTZMANN MACHINES

A Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) (Welling et al., 2004; Krizhevsky et al., 2009; Cho et al., 2011a; Melchior et al., 2017 ) is a Markov Random Field (MRF) with continuous stochastic visible units and binary stochastic hidden units. Denoting N visible units as v ∈ R N and M hidden units as h ∈ {0, 1} M , the energy function associated with a GRBM is defined to be E θ (v, h) = 1 2 v -µ σ ⊤ v -µ σ - v σ 2 ⊤ W h -b ⊤ h , with weight matrix W ∈ R N ×M , bias b ∈ R M , mean µ ∈ R N , and variance σ 2 ∈ R N + , where, unless stated otherwise, x y denotes element-wise division between vectors x and y, as is convention in the GRBM literature. We denote the set of learnable parameters as θ = {W, b, µ, σ 2 }. To ensure the variance remains non-negative during learning, we adopt a reparameterization, directly learning log σ 2 rather than σ 2 or σ. Finally, given the energy function, one can define the Boltzmann distribution, over visible and hidden states, as p θ (v, h) = 1 Z exp (-E θ (v, h)) , where Z = +∞ -∞ h exp (-E θ (v, h)) dv (2) is the normalization constant, which is intractable for even moderately large M . The underlying graphical model, like an RBM, is a bipartite graph with edges only connecting visible units to hidden units. This entails conditional independence of the form p(v|h) = i p(v i |h) and p(h|v) = j p(h j |v). One can also derive the following conditional distributions for GRBMs, p(v|h) = N v|W h + µ, diag(σ 2 ) (3) p(h j = 1|v) = Sigmoid W ⊤ v σ 2 + b j , where N v|W h + µ, diag(σ 2 ) is the multivariate Gaussian distribution with mean W h + µ, and the diagonal covariance matrix diag(σ 2 ). Here, Sigmoid(x) = 1/(1 + exp(-x)) is applied to the vector x in an element-wise manner, and [•] j denotes the j-th element of the corresponding vector. The marginal distribution over visible units is p(v) = exp(-Ẽθ (v))/Z, where Ẽθ (v) = 1 2 v -µ σ ⊤ v -µ σ -Softplus W ⊤ v σ 2 + b ⊤ 1 , and Softplus(x) = log(1 +exp(x)) is applied in an element-wise manner, and 1 is the all-one vector of size M . We call Ẽθ (v) the marginal energy to distinguish it from the GRBM energy in Eq. (1). We leave the derivation to Appendix A.1. As shown in Melchior et al. (2017) , one can also rewrite the marginal distribution p(v) as a constrained Gaussian mixture.

3.1. INFERENCE

When performing probabilistic inference, one often chooses between variational inference (Hinton & Van Camp, 1993; Jordan et al., 1999) and Markov chain Monte Carlo (MCMC) methods (Neal, 1993; Andrieu et al., 2003) . We focus on MCMC as common variational methods have been less effective with RBMs and GRBMs (Gabrié et al., 2015; Takahashi & Yasuda, 2016) . From the generative modelling perspective, we wish to draw samples of visible units during inference. There are two natural approaches to this: 1) sample from the joint distribution in Eq. ( 2) and discard the samples of hidden units, or 2) directly sample from the marginal distribution. Gibbs sampling (Geman & Geman, 1984) is perhaps the predominant approach, due to its simplicity. In the context of GRBMs, one alternates between sampling hidden units given visible units, and sampling visible units given hidden units. This produces samples from the joint distribution in Eq. (2). The detailed Gibbs sampling algorithm is given in Appendix A.2. Langevin Sampling. Langevin Monte Carlo (Grenander & Miller, 1994; Roberts & Tweedie, 1996; Welling & Teh, 2011 ) is a class of MCMC methods that generate samples from a distribution of continuous random variables by simulating Langevin dynamics. Since GRBMs are hybrid EBMs, i.e., comprising continuous and discrete random variables, we have at least two ways to leverage Langevin sampling. One is to directly apply Langevin sampling to the marginal distribution of visible units in Eq. ( 5). Suppose at time step t -1, we have sample v t-1 and want to draw a new sample v t . The proposal distribution corresponding to one-step Langevin dynamics is given by q(v|v (t-1) ) = N v v (t-1) -α t ∂ Ẽ(v (t-1) ) ∂v , 2α t I , where the gradient the of marginal energy Ẽ w.r.t. the visible units is given in Appendix A.3. If we use the Metropolis-Hastings algorithm to accept or reject proposed samples, the acceptance probability of a proposal v t , given the previous state, v t-1 , is (see Appendix A.3 for derivation): A(v (t) , v (t-1) ) = min     1, exp -Ẽθ (v (t) ) -1 4α t v (t-1) -v (t) + αt ∂ Ẽ(v (t) ) ∂v 2 exp -Ẽθ (v (t-1) ) -1 4α t v (t) -v (t-1) + αt ∂ Ẽ(v (t-1) ) ∂v 2     . Alg. 1 shows the Metropolis-adjusted Langevin Algorithm (MALA) for the marginal GRBM. Compared to generic MALA, it also includes an extra hyperparameter, namely, the adjust step η. If η is set to 0, then we perform a Metropolis adjustment at every sampling step, as prescribed in the generic MALA. If η is set to K > 0, then we skip the Metropolis adjustment for the first K steps. The adjust step effectively controls a trade-off between sampling accuracyfoot_0 and computational efficiency. Since we do not hope to see Gaussian noise in our final-sampled images (i.e., beyond the level of intrinsic noise in the observations), it is beneficial to decay the noise level, as in score-based models (Song & Ermon, 2019) . For certain step-size-annealing schedules and energy functions, there are theoretical guarantees on the convergence of Langevin sampling (Durmus & Moulines, 2019) . For simplicity, we use the cosine scheduler and find it works well in practice. More details about the scheduler are provided in Appendix A.3. Gibbs-Langevin Sampling. We also introduce a new hybrid sampler for GRBMs (see Alg. 2). Like the Gibbs sampler, it alternates between sampling hidden units conditioned on visible units, and sampling visible units given the hidden units. Unlike generic Gibbs, which directly samples from the Gaussian p(v|h (t) ), we instead use Langevin MC to sample the continuous visible units given the hidden state. The use of Langevin MC may seem unnecessary because the Gaussian conditional permits a one-step sampling algorithm. The subtlety comes from the fact that the finite-step Langevin sampler explicitly depends on the initial sample. Specifically, the proposal distribution of one complete outer-loop step in Alg. 2, e.g., at iteration t -1, can be expressed as q(v, h|v (t-1) , h (t-1) ) = q(h|v) q(v|v (t-1) , h (t-1) ) , where q(h|v) is given by Eq. ( 4), and q(v|v (t-1) , h (t-1) ) is the proposal distribution of a K-step Langevin sampler (i.e. from the inner loop). This proposal distribution explicitly depends on the initial visible sample, v (t-1) from iteration t -1. By contrast, the generic Gibbs sampler does not have such dependence, i.e., q(v|v (t-1) , h (t-1) ) = q(v|h (t-1) ). This dependence allows us to construct a persistent Markov chain in the space of visible units. Moreover, the Langevin sampler Algorithm 2 Gibbs-Langevin Sampling for GRBMs 1: Input: v (0) , h (0) , step size α 0 , total step T , burn-in step T , adjust step η, Langevin step K 2: Function Langevin(ṽ (0) , h, α 0 , K): 3: For k = 1, . . . , K 4: α k = CosineScheduler(k, K, α 0 ) 5: ṽ(k) = ṽ(k-1) -α k ∂E(ṽ (k-1) ,h) ∂v + √ 2α k ξ k , ξ k ∼ N (0, I) 6: Return ṽ(K) 7: 8: For t = 1, . . . , T 9: v = Langevin(v (t-1) , h (t-1) , α 0 , K) 10: h ∼ p(h|v) 11: If t <= η or t > η and u ∼ U(0, 1) < Ã (v, h), (v (t-1) , h (t-1) ) 12: v (t) , h (t) = v, h 13: Else 14: v (t) , h (t) = v (t-1) , h (t-1) 15: Return: {(v ( T +1:T ) , h ( T +1:T ) )} leverages the informative gradient of log density whereas Gibbs sampler does not. We find that our new sampler performs significantly better than the vanilla Gibbs sampler in practice. The Metropolis adjustment for these Gibbs-Langevin proposals is, however, somewhat more involved. Following Alg. 2, with ṽ(0) = v (t-1) and ṽ(K) = v, by marginalizing out the intermediate states on the Markov chain, we obtain the proposal q(ṽ (K) |ṽ (0) , h (t-1) ) = • • • K k=1 q(ṽ (k) |ṽ (k-1) , h (t-1) ) dṽ (1) • • • dṽ (K-1) . The integrand in Eq. 9 comprises K one-step Langevin updates, each of which is given by q(v|ṽ (k-1) , h (t-1) ) = N v ṽ(k-1) -α k ∂E(ṽ (k-1) , h (t-1) ) ∂v , 2α k I , for which the energy gradient is given in Appendix A.4. Although the multiple integral in Eq. ( 9) appears intractable, one can use reparameterization to derive the following analytical form, q(ṽ (K) |ṽ (0) , h (t-1) ) = N β0 ṽ(0) + K k=1 β k α k µ + W h (t-1) σ 2 , diag K k=1 2α k β 2 k , where β k = K j=k+1 1 - αj σ 2 , ∀k ∈ {0, . . . , K -1} and β K = 1. Based on this result, one can show that the acceptance probability for the Metropolis adjustment is Ã((v (t) , h (t) ), (v (t-1) , h (t-1) )) = min     1, exp -E θ (v (t) , h (t) ) -v (t-1) -β 0 v (t) -a(µ+W h (t) ) √ 2 σ 2 q(h (t-1) |v (t-1) ) exp -E θ (v (t-1) , h (t-1) ) -v (t) -β 0 v (t-1) -a(µ+W h (t-1) ) √ 2 σ 2 q(h (t) |v (t) )     , where q(h j = 1|v) = Sigmoid W ⊤ v σ 2 + b j , a = K k=1 β k α k σ 2 , and σ2 = K k=1 2α k β 2 k . We leave derivations to Appendix A.4.

3.2. LEARNING

To learn GRBMs, we maximize the log likelihood of the observed data using stochastic gradientbased methods, e.g., contrastive divergence (CD). Depending on whether we use the joint (Eq. ( 2)) or the marginal (Eq. ( 5)) distribution, we have two possible gradient estimators. Learning with the Joint Distribution. When optimizing the GRBM with the joint distribution, one can express the general form of the gradient of the log likelihood w.r.t. parameters θ as ∇θ = - ∂E θ (v, h) ∂θ d -- ∂E θ (v, h) ∂θ m . ( ) Algorithm 3 Modified CD Learning Algorithm for GRBMs with Joint Density 1: Input: CD-step K, burn-in step M , learning Rate η, Langevin step size α 0 , SGD step T 2: For t = 1, • • • , T 3: v + = v data 4: h + ∼ p(h|v + ) 5: ∇θ + = ∂E(v + ,h + ) ∂θ d ▷ Compute Positive Gradient 6: v - 0 ∼ N (0, I), h - 0 ∼ p(h|v - 0 ) ▷ Modified CD to start with noise 7: {v - M :K , h - M :K } ∼ Sampler(v - 0 , h - 0 , α 0 σ) ▷ Alg. 5 or Alg. 2, σ is current mean variance 8: ∇θ -= ∂E(v -,h -) ∂θ m ▷ Compute Negative Gradient 9: θ = θ -η(∇θ + -∇θ -) ▷ Compute Update 10: Return θ Here, following the notation in the RBM literature, we denote expectation under the data distribution, i.e., p θ (h|v) p data (v), as ⟨•⟩ d = E p θ (h|v)pdata(v) [•] . Similarly, we denote the expectation under the model distribution, p θ (v, h) as ⟨•⟩ m = E p θ (v,h) [•]. The expected gradients under the data and model distributions are called positive and negative gradients respectively. Based on Eq. ( 13), we can formulate the gradients of specific parameters as follows, ∇W ij = v i σ 2 i h j d - v i σ 2 i h j m ( ) ∇µ i = v i -µ i σ 2 i d - v i -µ i σ 2 i m (15) ∇ log σ 2 i = (v i -µ i ) 2 2σ 2 i - j v i W ij h j σ 2 i d - (v i -µ i ) 2 2σ 2 i - j v i W ij h j σ 2 i m ∇b i = ⟨h i ⟩ d -⟨h i ⟩ m . Since the expectations in these gradients are generally intractable, we use Monte Carlo methods to approximate them. To sample from the joint density, we can use Gibbs or Gibbs-Langevin samplers as described in Sec. 3.1. The overall learning algorithm is outlined in Alg. 3. An important detail is that we multiply the initial Langevin step size by the average variance at each gradient update step and then feed it to the sampler. Since the variance is decreasing (the energy function and its gradient are increasing) as learning goes on, keeping the step size roughly invariant to such scaling would make the sampling more effective. Learning with the Marginal Distribution. Now we turn to learning the model under the marginal distribution in Eq. ( 5). Since we have the marginal distribution of visible units, we can directly get the gradients of log likelihood w.r.t. model parameters θ as, ∇θ = - ∂ Ẽθ (v) ∂θ d -- ∂ Ẽθ (v) ∂θ m . ( ) Since the gradient ∂ Ẽθ (v) ∂θ does not depend on h anymore, we have - ∂ Ẽθ (v) ∂θ d = E pdata(v) - ∂ Ẽθ (v) ∂θ , - ∂ Ẽθ (v) ∂θ m = E p θ (v) - ∂ Ẽθ (v) ∂θ . Based on above results, we can work out the detailed gradients which are the same as those in Eq. ( 14) to Eq. ( 17) but with h replaced with Sigmoid W ⊤ v σ 2 + b . More details are left to Appendix A.5. We use the Langevin sampler in Sec. 3.1 to sample from the marginal density to approximate the intractable expectation. The overall learning algorithm is outlined in Alg. 4. Modified Contrastive Divergence. The above two learning algorithms resemble CD if one ignores the specific sampler used. There exists a subtle yet important difference however. For most deep generative models one generates samples starting from noise. But this does not work well for models trained with CD, where sampling starts from observed data. This discrepancy of the starting sample between training and testing would be a significant issue if the Markov chain does not mix Algorithm 4 Modified CD Learning Algorithm for GRBMs with Marginal Density 1: Input: CD-step K, burn-in step M , learning rate η, Langevin step size α 0 , SGD step T 2: For t = 1, • • • , T 3: v + = v data 4: ∇θ + = ∂ Ẽ(v + ) ∂θ d ▷ Compute Positive Gradient 5: v - 0 ∼ N (0, I) ▷ Modified CD to start with noise 6: {v - i |i = M, • • • , K} ∼ Sampler(v - 0 , α 0 σ) ▷ Alg. 1, σ is current mean variance 7: ∇θ -= ∂ Ẽ(v -) ∂θ m ▷ Compute Negative Gradient sufficiently quickly. We therefore modify CD by running two Markov chains to collect samples for positive and negative gradients respectively. The positive Markov chain is the same as in CD, i.e., starting from observed data. The negative Markov chain now starts from a sample of standard Normal noise rather than the reconstructed datafoot_1 . Since the positive chain starting from data will usually stay close to the data distribution, this modification pushes the negative Markov chain, starting from noise, toward the data distribution. Moreover, the discrepancy between training and testing ceases to be important as we can start from standard Normal noise while sampling from the learned model.

4. EXPERIMENTS

We examine the empirical behavior of our new GRBM algorithms on benchmark image datasets, namely, MNIST, Fashion-MNIST (Xiao et al., 2017) , and CelebA (Liu et al., 2015) . Implementation Details. We found that training with modified CD alone occasionally diverges, necessitating careful tuning of the learning rate. However, adding gradient clipping (e.g., clip gradient norm to 10) enables stable training with all aforementioned sampling methods. We therefore set learning rate to 0.01 for all experiments. Such a large learning rate almost never works in the literature. Melchior et al. (2017) used gradient clipping and similarly large learning rates, but they had to set the learning rate for the variances 100 times smaller than that for the weights and biases during CD training. But thanks to the modified CD and gradient clipping, we found this special treatment of variances is unnecessary. We do not use momentum, weight decay, PCD, or other tricks.

4.1. MODELING GAUSSIAN MIXTURE DENSITIES

We first evaluate density modelling by GRBMs when the data density is known, i.e., Gaussian mixture models (GMMs) in our case. This is challenging for GRBMs as the marginal distribution of visible units of GRBMs is essentially a constrained Gaussian mixture, i.e., the weights of mixture components depend on one another (Melchior et al., 2017) . As such, the mixture components in GRBMs can not be freely placed in the visible domain so one actually needs more hidden units than the log of the number of mixture components to fit GMMs well. We consider the 2D case for simplicity and better visibility. We generate 1,000 samples from two types (isotropic and anisotropic variances) of GMMs with 3 components as shown in Fig. 1 , and learn GRBMs using our modified CD with different sampling algorithms, from which we can draw samples. Here all samplers run for 100 steps during both CD training and testing (see Appendix B.1 for more detail). Density plots and samples are shown in Fig. 1 . Notice that Gibbs manages to recover the three modes in the isotropic case but fails in the anisotropic case. Both Langevin and Gibbs-Langevin sampling collapse when the adjustment is absent. We believe the cosine step size schedule contributes to the collapse as it removes more stochasticity of Langevin dynamics with small step sizes, thus making sampling more similar to gradient descent. But as we will see later, in image modelling, this may not be so severe; there are more modes so that the sampling may collapse to different modes, and the diversity of images remains acceptable. Finally, both Langevin and Gibbs-Langevin do recover all three modes with the adjustment, which shows the adjustment helps the mixing in this synthetic case.

4.2. IMAGE GENERATION

We learn GRBMs to fit image datasets including MNIST, FashionMNIST, and CelebA. To the best of our knowledge, this is the first time that GRBMs have been shown to (unconditionally) generate good images. We provide the ablation study in Appendix B.2 and more results in Appendix B.3. Methods FID VAE 16.13 2sVAE (Dai & Wipf, 2019) 12.60 PixelCNN++ (Salimans et al.) 11.38 WGAN (Arjovsky et al., 2017) 10.28 NVAE (Vahdat & Kautz, 2020) MNIST. We train GRBMs with hidden size 4096 and 100 sampling steps on MNIST. We compare FID scores of GRBMs with other deep generative models in Table 1 . We can see that Gibbs-Langevin family works significantly better than the Langevin family. The Metropolis adjustment ("w. Adjust" in Table 1 ) improves Langevin slightly but degrades Gibbs-Langevin slightly, which is different from what we observed on synthetic data. This is likely because the image distribution is so complicated (e.g., having significantly more modes) that the adjustment rejects proposed moves more frequently than before. Some sophisticated strategy may be needed to increase the acceptance probability. Nevertheless, GRBMs trained with Gibbs-Langevin without adjustment achieve FID scores comparable to other generative models, which is impressive given the single-hidden-layer architecture. The learning curve of (natural) log variance is shown in Fig. 3a . The learned variance FashionMNIST. We then train GRBMs on FahsionMNIST which is more challenging than MNIST. We set hidden size to 10,000 and the sampling step to 100. Samples drawn from learned GRBMs are shown in Fig. 4a . GRBMs successfully learn the shapes of clothes, shoes, bags, and so on. However, they fail to capture fine textures. Since many images in this dataset look similar in shape but differ in texture, the resulting samples look similar to each other. CelebA. Last, we consider the even more challenging CelebA dataset. In particular, we explore two versions of this dataset: 1) CelebA-32 where we center-crop (140×140) and downsample images to 32×32; 2) CelebA-2K-64 where randomly select 2,000 images from the original CelebA and apply the same center crop and downsampling to 64 × 64. We set hidden size to 10,000 and explore the number of 100 and 200 sampling steps. Generated samples are shown in Fig. 4b and 4c . From the figure, we can see that GRBMs can learn to generate reasonably good face images.

5. CONCLUSION

In this paper, we revisit learning Gaussian-Bernoulli restricted Boltzmann machines. We investigate Langevin Monte Carlo and propose a novel Gibbs-Langevin sampling method. Furthermore, we modify the contrastive divergence (CD) algorithm so that one can sample data from learned GRBMs starting from noise. Modified CD along with gradient clipping enables robust training of GRBMs with large learning rates. Finally, we show that GRBMs can unconditionally generate images with good qualities, despite its single-hidden-layer architecture. In the future, it would be beneficial to extend the current GRBMs to convolutional GRBMs which should be able to learn better localized filters. Meanwhile, it would be interesting to explore Gaussian deep belief networks (GDBNs), which are deeper than GRBMs and should be superior. At last, investigating our Gibbs-Langevin sampling for hybrid deep energy based models could be a fruitful direction.

A DERIVATIONS A.1 MARGINAL PROBABILITY DISTRIBUTION OF VISIBLE UNITS OF GRBMS

We derive the marginal distribution of visible units as follows, p(v) = h p(v, h) = 1 Z h exp(-E θ (v, h)) = 1 Z exp - 1 2 v -µ σ ⊤ v -µ σ h exp v σ 2 ⊤ W h + b ⊤ h = 1 Z exp - 1 2 v -µ σ ⊤ v -µ σ i 1 + exp v σ 2 ⊤ W i + b i = 1 Z exp - 1 2 v -µ σ ⊤ v -µ σ i exp Softplus v σ 2 ⊤ W i + b i = 1 Z exp - 1 2 v -µ σ ⊤ v -µ σ exp Softplus W ⊤ v σ 2 + b ⊤ 1 = 1 Z exp - 1 2 v -µ σ ⊤ v -µ σ + Softplus W ⊤ v σ 2 + b ⊤ 1 . A.2 GIBBS SAMPLING Algorithm 5 Gibbs Sampling for GRBMs 1: Input: number of steps T , burn-in step T 2: v (0) ∼ N (0, I) 3: For t = 1, . . . , T 4: (Geman & Geman, 1984) is perhaps the most popular approach due to its simplicity. In the context of GRBMs, we can alternate between sampling hidden units given visible units and sampling visible units given hidden units. Alg. 5 is a blocked Gibbs sampler; it samples all visible units (a block of random variables) at once given all hidden units (the other block) and vice versa. Given the conditional independence in the bipartite graphical model, this block Gibbs sampler is equivalent to a univariate Gibbs sampler that updates one variable at a time given the others following some schedule. In fact, any schedule comprising a sequence of all hidden units followed by all visible units or vice versa would make the equivalence hold. In other words, it preserves the convergence of the original univariate Gibbs sampler and runs as fast as a blocked Gibbs sampler. Relying on this Gibbs sampler, we can get samples of visible and hidden units from the joint distribution in Eq. (2). We can then discard the samples within the burn-in stage and treat remaining ones as the final set of samples. h (t) ∼ p(h|v (t-1) ) ▷ following Eq. (4) 5: v (t) ∼ p(v|h (t) ) ▷ following Eq. (3) 6: Return: {(v (t) , h (t) )|t = T + 1, • • • , T } Gibbs sampling

A.3 LANGEVIN SAMPLING

The gradient of the marginal energy w.r.t. visible units is, ∂ Ẽ(v) ∂v = v -µ σ 2 - W Sigmoid W ⊤ v σ 2 + b σ 2 . ( ) The cosine scheduler for annealing the step size is, α k = CosineScheduler(k, K, α 0 ) = 1 2 α 0 1 + cos k K π (22) where α k is the k-th step size, α 0 is the initial step size, and K is the total number of steps. The derivation of the Metropolis adjustment for Langevin sampling is as follows, Ã(v (t) , v (t-1) ) = min 1, p(v (t) )q(v (t-1) |v (t) ) p(v (t-1) )q(v (t) |v (t-1) ) = min     1, exp -Ẽθ (v (t) ) exp -1 4αt v (t-1) -v (t) + α t ∂ Ẽ(v (t) ) ∂v 2 exp -Ẽθ (v (t-1) ) exp -1 4αt v (t) -v (t-1) + α t ∂ Ẽ(v (t-1) ) ∂v 2     = min     1, exp -Ẽθ (v (t) ) -1 4αt v (t-1) -v (t) + α t ∂ Ẽ(v (t) ) ∂v 2 exp -Ẽθ (v (t-1) ) -1 4αt v (t) -v (t-1) + α t ∂ Ẽ(v (t-1) ) ∂v 2     . A.4 GIBBS-LANGEVIN SAMPLING We now derive the Metropolis Adjustment for Gibbs-Langevin sampling. At time step t-1, the proposal distribution in Alg. 2 is q(v, h|v (t-1) , h (t-1) ) = q(h|v)q(v|v (t-1) , h (t-1) ), (h|v) = Sigmoid W ⊤ v σ 2 + b . ( ) Denoting v (t-1) = ṽ(0) and v = ṽ(K) , we have, q(v|v (t-1) , h (t-1) ) = q(ṽ (K) |ṽ (0) , h (t-1) ) = • • • K k=1 q(ṽ (k) |ṽ (k-1) , h (t-1) ) dṽ (1) • • • dṽ (K-1) , where q(v|ṽ (k-1) , h (t-1) ) = N v ṽ(k-1) -α k ∂E(ṽ (k-1) , h (t-1) ) ∂v , 2α k I (27) ∂E(v, h) ∂v = v -µ -W h σ 2 . ( ) The key question here is how to derive the analytical form of q(ṽ (K) |ṽ (0) , h (t-1) ). The most straightforward way is to compute the multiple integral directly. By fixing all variables except for ṽ(k) in Eq. ( 8), we can integrate out ṽ(k) analytically via the Gaussian integral trick, i.e., ∞ -∞ exp(-ax 2 + bx + c)dx = π a exp( b 2 4a + c). Then by applying the same trick recursively, one can ideally integrate out all ṽ(1) , . . . , ṽ(K-1) in an analytical manner. However, this process is quite involved due to the fact that the integral of ṽ(k) depends on both ṽ(k+1) and ṽ(k-1) . We instead resort to the reparameterization trick. In particular, at the outer loop step t, the k-th inner loop step of Langevin sampling is as follows, ṽ(k) = ṽ(k-1) -α k ∂E(ṽ (k-1) , h (t-1) ) ∂v + √ 2α k ξ k = ṽ(k-1) -α k ṽ(k-1) -µ -W h (t-1) σ 2 + √ 2α k ξ k = 1 - α k σ 2 ṽ(k-1) + α k µ + W h (t-1) σ 2 + √ 2α k ξ k , where ∀k ∈ {1, . . . , K}, ξ k ∼ N (0, I). This discretization of Langevin dynamics gives a sample path of the distribution q(ṽ (K) |ṽ (0) , h (t-1) ). We now show that this sample path could be reparameterized as a simpler one which gives the desirable analytical form of q(ṽ (K) |ṽ (0) , h (t-1) ). To simplify the derivation, we introduce β k = K j=k+1 1 -αj σ 2 , ∀k ∈ {0, . . . , K -1} and β K = 1. Therefore, after K steps, we have, ṽ(K) = 1 - α K σ 2 ṽ(K-1) + α K µ + W h (t-1) σ 2 + √ 2α K ξ K = K k=1 1 - α k σ 2 ṽ(0) + K k=1   K j=k+1 1 - α j σ 2   α k µ + W h (t-1) σ 2 + √ 2α k ξ k = β 0 ṽ(0) + K k=1 β k α k µ + W h (t-1) σ 2 + K k=1 β k √ 2α k ξ k (30) Here {ξ k |k = 1, . . . , K} are independent random variables from the standard Normal distribution. Since we know that the linear combination of several independent Gaussian random variables leads to another Gaussian random variable, we have q(ṽ (K) |ṽ (0) , h (t-1) ) = N β 0 ṽ(0) + K k=1 β k α k µ + W h (t-1) σ 2 , K k=1 2α k β 2 k . ( ) We can compute the acceptance probability, t) , h (t) )q(h (t-1) |v (t-1) )q(v (t-1) |v (t) , h (t) ) p(v (t-1) , h (t-1) )q(h (t) |v (t) )q(v (t) |v (t-1) , h (t-1) ) = min A((v (t) , h (t) ), (v (t-1) , h (t-1) )) = min 1, p(v (t) , h (t) )q(v (t-1) , h (t-1) |v (t) , h (t) ) p(v (t-1) , h (t-1) )q(v (t) , h (t) |v (t-1) , h (t-1) ) = min 1, p(v (     1, exp -E θ (v (t) , h (t) ) -v (t-1) -β0v (t) -a(µ+W h (t) ) √ 2 σ 2 q(h (t-1) |v (t-1) ) exp -E θ (v (t-1) , h (t-1) ) -v (t) -β0v (t-1) -a(µ+W h (t-1) ) √ 2 σ 2 q(h (t) |v (t) )     , where q(h j = 1|v) = Sigmoid W ⊤ v σ 2 + b j , a = K k=1 β k α k σ 2 , and σ2 = K k=1 2α k β 2 k .

A.5 LEARNING

We derive the detailed gradients of the marginalized log likelihood of visible units w.r.t. model parameters as below. ∇W ij = v i σ 2 i Sigmoid W ⊤ v σ 2 + b j d - v i σ 2 i Sigmoid W ⊤ v σ 2 + b j m (33) ∇µ i = v i -µ i σ 2 i d - v i -µ i σ 2 i m (34) ∇ log σ 2 i = (v i -µ i ) 2 2σ 2 i - j v i W ij Sigmoid W ⊤ v σ 2 + b j σ 2 i d - (v i -µ i ) 2 2σ 2 i - j v i W ij Sigmoid W ⊤ v σ 2 + b j σ 2 i m ( ) ∇b i = ⟨ Sigmoid W ⊤ v σ 2 + b i ⟩ d -⟨ Sigmoid W ⊤ v σ 2 + b i ⟩ m .

B MORE EXPERIMENTAL RESULTS

For all experiments, we set the initial variances of GRBMs to be 1, clip the gradient norm to be no larger than 10, and use SGD with neither momentum nor weight decay. We divide the total energy of a mini-batch by the batch size so that we are minimizing the average negative log likelihood. We also decay the learning rate of SGD from the initial value 0.01 to 0 using the same cosine scheduler as described in Eq. ( 22). The burn-in step in CD learning is set to 0, i.e., we do not discard any samples from any Markov chains. For all experiments involving images, we standardize the input image by subtracting the pixel-wise mean and dividing by the pixel-wise standard deviation. For color images, the subtraction and division is performed channel-wise. The details of all baselines on MNIST are as follows. • VAE. We use an encoder with 4 convolutional blocks (3 × 3 Conv+BN+ReLU) along with a 2-layer MLP. The decoder contains a 2-layer MLP followed by a 4 convolutional block (3 × 3 Conv+BN+ReLU) and a 1-hidden-layer MLP. The total number of parameters is around 3.98M. • 2sVAE. For the encoder, we use 4 convolutional blocks (3 × 3 Conv+BN+ReLU). For the decoder, we use 2 convolutional blocks (3 × 3 Conv+BN+ReLU) followed by another 2 convolutional blocks (3 × 3 ConvTranspose+BN+ReLU). The total number of parameters is around 13.58M. • PixelCNN++. We have 8 masked convolutional blocks (3 × 3 Conv+BN+ReLU) which result in 1.41M parameters. • WGAN. For the discriminator, we have 4 convolutional blocks (3 × 3 Conv+BN+ReLU). For the generator, we have 4 convolutional blocks (3 × 3 ConvTranspose+BN+ReLU). The total number of parameters is around 1.73M. • NVAE. We have 44 ResBlock and 42 ResBlock for the encoder and decoder respectively, which results in around 33.36M parameters. Our GRBMs instead have a single hidden layer and about 3.21M parameters.

B.1 GAUSSIAN MIXTURE DENSITIES

The batch size and the hidden size are set to 100 and 256 respectively. We adjust at every step whenever Metropolis adjustmentment is used as the experiments with Gaussian mixture densities are fast. Although smaller hidden size could work, but we found this size makes learning converges stably for all sampling algorithms. To ensure a fair comparison, we train all GRBMs for 50K epochs and use the last model to draw the density plots and samples, despite the learning processes with most of inference algorithms converge within 5K to 10K epochs.

B.2 ABLATION STUDY ON MNIST

In this part, we perform ablation study on MNIST to investigate the effect of several important factors. In all experiments on MNIST, we set batch size to 512 and the number of epochs to 3000. First, we vary the CD step and the hidden size while fixing the other hyperparameters. The results are shown in Table 2 . We found that 4096 hidden size and 100 CD steps work the best on MNIST. More CD steps would potentially be better but take longer time to train. Then we turn to study the number of Langevin sampling steps, the adjust step size, the initial Langevin step size, and its annealing. Here annealing means we decay the initial Langevin step size to 0 following the cosine scheduler as training goes on. The results are shown in Table 3 . We can see that the larger the initial Langevin step size, the better the performance. But values larger than 0.04 would sometimes make the sampling numerically fail. The more the Langevin steps, the better the performance would be. Again, it comes with more computational cost with more Langevin steps. We also find that it may not be necessary to adjust at every step and annealing the initial step size slightly improves the performance of Gbbis-Langevin with adjustment. We further study the effect of standard Normal noise vs. data in the initialization of negative Markov chains in CD. In particular, given a mini-batch of data, we randomly draw a portion of them so that their negative Markov chains start with samples drawn from the standard Normal distribution, same as what we did in modified CD. But for the remaining portion of data, we start their negative Markov chains from the data, same as what we did for the positive Markov chains. This can help us understand the importance of noises in the initialization of modified CD. The results are shown in Table 4 . It is clear that the sample quality decreases with more data being used as the initialization. This is expected since the discrepancy between the initial distributions of the negative Markov chain during training and the Markov chain during testing is increasing. Note that if we use data as initialization for all samples, then the underlying learning method reduces to the original CD.

B.3 MORE VISUAL RESULTS

We train 3K epochs for experiments on both FashionMNIST and CelebA-32 datasets. For CeleA-2K-64, we train 4K epochs. The batch size on FashionMNIST and CelebA-32 is 512 whereas the batch size on CeleA-2K-64 is 100. We show the samples drawn from the best GRBMs learned with different sampling methods in Fig. 5 . It is clear that samples corresponding to Gibbs-Langevin have better visual qualities than those from Langevin and Gibbs. We also show more results of GRBMs learned with Gibbs-Langevin in Fig. 6 , Fig. 7 , Fig. 8 , and Fig. 9 . 



Here sampling accuracy means the closeness between the underlying distribution of samples and the target distribution measured in, e.g., total variation or Wasserstein distances. The reconstructed data is typically obtained by running one complete step of Gibbs sampler from the observed data, thus being highly likely close to observed data.



Density modelling using GRBMs on data from a Gaussian mixtures with isotropic (rows 1 and 2) and anisotropic variances (rows 3 and 4). Rows 1 and 3 show normalized GMM densities and (unnormalized) negative energy values for GRBMs. Rows 2 and 4 show samples drawn under different models and methods; i.e., (a) Ground Truth; (b) Gibbs; (c) Langevin wo. Adjust; (d) Langevin w. Adjust; (e) Gibbs-Langevin wo. Adjust; (f) Gibbs-Langevin w. Adjust.

Figure 2: Intermediate samples from Gibbs-Langevin sampling.

Figure 3: (a) Learning curve of (natural) log variances, (b) learned filters, and (c) samples on MNIST.

Figure 5: Samples from GRBMs learned with different sampling algorithms on MNIST.

Figure 6: More samples from the learned GRBM (Gibbs-Langevin) on MNIST.

Figure 7: More samples from the learned GRBM (Gibbs-Langevin) on FashionMNIST.

Figure 8: More samples from the learned GRBM (Gibbs-Langevin) on CelebA-32.

Figure 9: More samples from the learned GRBM (Gibbs-Langevin) on CelebA-2K-64.

Results on MNIST dataset.

Ablation study of the hidden size and the number of CD steps on MNIST dataset.

Ablation study of the number of Langevin steps K, the initial Langevin step size α 0 , annealing of the initial Langevin step size, and the Metropolis adjust step η on MNIST dataset. All runs use 100 CD steps.

Ablation study of the initialization of negative Markov chains.

