GAUSSIAN-BERNOULLI RBMS WITHOUT TEARS

Abstract

We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture.

1. INTRODUCTION

Restricted Boltzmann machines (RBMs) (Smolensky, 1986; Freund & Haussler, 1991; Hinton, 2002) are generative energy-based models (EBMs) with stochastic binary units. A variant of Boltzmann machines (Ackley et al., 1985) , they have a bipartite graphical structure that enables efficient probabilistic inference, and they can be stacked to form deep belief networks (DBNs) (Hinton & Salakhutdinov, 2006; Bengio et al., 2006; Hinton et al., 2006) and deep Boltzmann machines (DBMs) (Salakhutdinov & Hinton, 2009; Cho et al., 2013) . Gaussian-Bernoulli RBMs (GRBMs) (Welling et al., 2004; Hinton & Salakhutdinov, 2006) extend RBMs to model continuous data by replacing the binary visible units of the RBM with Gaussian random variables. GRBMs remain challenging to learn, however, despite many proposed modifications to the model or training algorithm. For instance, Lee et al. (2007) add a regularization term to encourage sparsely activated binary hidden units. Krizhevsky et al. (2009) attribute the difficulties in learning to highfrequency noise present in natural images. Factorized high-order terms were introduced in (Ranzato & Hinton, 2010; Ranzato et al., 2010) to allow GRBMs to explicitly learn the covariance structure among pixels. Nair & Hinton (2010) suggest that binary hidden units are problematic, and proposed model variants with real-valued hidden units. Cho et al. (2011a; 2013) An important motivation for improving GRBM learning is so that it can be used as a front-end to convert real-valued data to stochastic binary data. This would enable research on modelling realvalued data via DBMs/DBNs, which are more expressive due to their deep architectures. This class of models are of special interest: their learning algorithm involves only local computation, and thus they are more biologically plausible than EBMs trained using backprop. As GRBMs are perhaps the simplest hybrid (including both continuous and discrete random variables) EBMs, investigating the inference and learning algorithms of GRBMs would lay the foundation and inspire more future research on deep hybrid EBMs, which are useful for many applications like generating (continuous) images and their (discrete) scene graphs. Finally, RBMs and GRBMs are actively studied in quantum computing and physics (Melko et al., 2019; Ajagekar & You, 2020) since they naturally fit the



advocate the use of parallel tempering sampling (Earl & Deem, 2005), adaptive learning rate, and enhanced gradient (Cho et al., 2011b) to improve GRBM learning. Melchior et al. (2017) conclude that difficulties in GRBM training are due to training algorithms rather than the model itself; they advocate the use of gradient clipping, specialized weight initialization, and contrastive divergence (CD) (Hinton, 2002) rather than persistent CD (Tieleman, 2008). Tramel et al. (2018) propose the truncated Gaussian visible units and employ the Thouless-Anderson-Palmer (TAP) mean-field approximation for inference and learning. Upadhya & Sastry (2021) propose a stochastic difference of convex functions programming (S-DCP) algorithm to replace CD in training GRBMs.

