GAUSSIAN-BERNOULLI RBMS WITHOUT TEARS

Abstract

We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture.

1. INTRODUCTION

Restricted Boltzmann machines (RBMs) (Smolensky, 1986; Freund & Haussler, 1991; Hinton, 2002) are generative energy-based models (EBMs) with stochastic binary units. A variant of Boltzmann machines (Ackley et al., 1985) , they have a bipartite graphical structure that enables efficient probabilistic inference, and they can be stacked to form deep belief networks (DBNs) (Hinton & Salakhutdinov, 2006; Bengio et al., 2006; Hinton et al., 2006) and deep Boltzmann machines (DBMs) (Salakhutdinov & Hinton, 2009; Cho et al., 2013) . Gaussian-Bernoulli RBMs (GRBMs) (Welling et al., 2004; Hinton & Salakhutdinov, 2006) extend RBMs to model continuous data by replacing the binary visible units of the RBM with Gaussian random variables. GRBMs remain challenging to learn, however, despite many proposed modifications to the model or training algorithm. For instance, Lee et al. (2007) add a regularization term to encourage sparsely activated binary hidden units. Krizhevsky et al. (2009) attribute the difficulties in learning to highfrequency noise present in natural images. Factorized high-order terms were introduced in (Ranzato & Hinton, 2010; Ranzato et al., 2010) to allow GRBMs to explicitly learn the covariance structure among pixels. Nair & Hinton (2010) suggest that binary hidden units are problematic, and proposed model variants with real-valued hidden units. Cho et al. (2011a; 2013) advocate the use of parallel tempering sampling (Earl & Deem, 2005) , adaptive learning rate, and enhanced gradient (Cho et al., 2011b) to improve GRBM learning. Melchior et al. (2017) conclude that difficulties in GRBM training are due to training algorithms rather than the model itself; they advocate the use of gradient clipping, specialized weight initialization, and contrastive divergence (CD) (Hinton, 2002) rather than persistent CD (Tieleman, 2008 ). Tramel et al. (2018) propose the truncated Gaussian visible units and employ the Thouless-Anderson-Palmer (TAP) mean-field approximation for inference and learning. Upadhya & Sastry (2021) propose a stochastic difference of convex functions programming (S-DCP) algorithm to replace CD in training GRBMs. An important motivation for improving GRBM learning is so that it can be used as a front-end to convert real-valued data to stochastic binary data. This would enable research on modelling realvalued data via DBMs/DBNs, which are more expressive due to their deep architectures. This class of models are of special interest: their learning algorithm involves only local computation, and thus they are more biologically plausible than EBMs trained using backprop. As GRBMs are perhaps the simplest hybrid (including both continuous and discrete random variables) EBMs, investigating the inference and learning algorithms of GRBMs would lay the foundation and inspire more future research on deep hybrid EBMs, which are useful for many applications like generating (continuous) images and their (discrete) scene graphs. Finally, RBMs and GRBMs are actively studied in quantum computing and physics (Melko et al., 2019; Ajagekar & You, 2020) since they naturally fit the problem formulation (e.g., Ising models) required by many quantum computing devices. Progress on RBMs/GRBMs could potentially benefit such interdisciplinary research. To this end, we propose improved GRBM learning methods for image data. First, we propose a hybrid Gibbs-Langevin sampling algorithm that outperforms predominant use of Gibbs sampling. To the best of our knowledge this is the first use of Langevin sampling for GRBM training (with or without Metropolis adjustment). Second, we propose a modified CD algorithm so that one can generate images with learned GRBMs starting from Gaussian noise. This enables a fair and direct comparison of GRBMs with deep generative models, something beyond the reach of existing GRBM learning methods. Third, We show that the modified CD with gradient clipping is sufficient to train GRBMs, thus removing the need for heuristics that have been crucial for existing approaches. At last, we empirically show that GRBMs can generate good samples on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA, despite they have a single hidden layer.

2. RELATED WORK

Learning the variances There are two variances to be estimated in GRBM modelling. One is the intrinsic variance of the data, e.g., the variance of image intensities, which is fixed once the data is observed. The other is the (extrinsic) variance parameter in GRBMs, which governs the level of additional Gaussian noises added to visible units. Learning the extrinsic variance is thus necessary for generating sharp and realistic images. But small variance parameters tend to cause the energy function and its gradient to have large values, thus making the stochastic gradient estimates returned by CD numerically unstable. Most existing methods fix the variance (e.g., to one) to avoid this issue. Krizhevsky et al. (2009); Cho et al. (2011a) consider learning the variance using a smaller learning rate than for other parameters, obtaining much better reconstruction, thus supporting the importance of learning variances. However, many of the learned filters are still noisy and point-like. Melchior et al. ( 2017) learn a shared variance across all visible units, yielding improved performance, especially with large numbers of hidden units. In this work, we learn one variance parameter per visible unit and achieve much lower learned variances than existing methods, e.g., approximately 1e -5 on MNIST. Stochastic gradient estimation and learning rate Due to the intractable log partition function of GRBMs, one often estimates the gradients of the log likelihood w.r.t. parameters via Monte Carlo. Gibbs sampling is predominant in CD learning due to its simplicity, but it mixes slowly in practice. One can refer to (Decelle et al., 2021) for a detailed study on the mixing time of CD for RBMs. This yields noisy gradient estimates which often cause training instabilities and prohibits using large learning rates. Cho et al. (2011a) explore parallel tempering with adaptive learning rates to obtain better reconstruction. Cho et al. (2013) propose enhanced gradients that are invariant to bit-flipping in hidden units. Melchior et al. (2017) show that gradient clipping and special weight initialization support robust CD learning with large learning rates. We advocate Langevin MC to improve gradients, and validate that gradient clipping does enable training with large learning rates. Theis et al. (2011) show that GRBMs are outperformed even by simple mixture models in estimating likelihoods for image data. Wang et al. (2012); Melchior et al. (2017) demonstrate that GRBMs can be expressed as either a product of experts or a constrained Gaussian mixture in the visible domain, hinting that GRBMs need more hidden units than the true number of components to fit additive mixture densities well. Krause et al. (2013); Gu et al. (2022) provide theoretical guarantees on GRBMs for universal approximation of mixtures and smooth densities. Although this shows that GRBMs are expressive, they do not lead directly to practical GRBM learning algorithms.

Model capacity

Model Evaluation Like many deep generative models, evaluating GRBMs is difficult, as the log likelihood is intractable. To date, GRBMs have been evaluated by visually inspecting reconstructed images, filters and hidden activation (i.e., features), and sampled images during CD training. Quantitative metrics include reconstruction errors, and error rates of post-hoc trained classifiers on learned features. However, these metrics do not necessarily indicate if GRBMs are good generative models (Melchior et al., 2017) . Unlike existing work, we sample from learned GRBMs, starting from Gaussian noise, enabling direct comparisons with other generative models, qualitatively (visually inspecting samples) and quantitatively (e.g., Frechet Inception distance (FID) (Heusel et al., 2017) . Note that similar noise-initialization strategy has been studied in EBMs (Nijkamp et al., 2019) .

