QUANTITATIVE UNIVERSAL APPROXIMATION BOUNDS FOR DEEP BELIEF NETWORKS

Abstract

We show that deep belief networks with binary hidden units can approximate any multivariate probability density under very mild integrability requirements on the parental density of the visible nodes. The approximation is measured in the L q -norm for q ∈ [1, ∞] (q = ∞ corresponding to the supremum norm) and in Kullback-Leibler divergence. Furthermore, we establish sharp quantitative bounds on the approximation error in terms of the number of hidden units.

1. INTRODUCTION

Deep belief networks (DBNs) are a class of generative probabilistic models obtained by stacking several restricted Boltzmann machines (RBMs, Smolensky (1986) ). For a brief introduction to RBMs and DBNs we refer the reader to the survey articles Fischer & Igel (2012; 2014) ; Montúfar (2016) ; Ghojogh et al. (2021) . Since their introduction, see Hinton et al. (2006) ; Hinton & Salakhutdinov (2006) , DBNs have been successfully applied to a variety of problems in the domains of natural language processing Hinton ( 2009 2019). However, our theoretical understanding of the class of continuous probability distributions, which can be approximated by them, is limited. The ability to approximate a broad class of probability distributions-usually referred to as universal approximation property-is still an open problem for DBNs with real-valued visible units. As a measure of proximity between two real-valued probability density functions, one typically considers the L q -distance or the Kullback-Leibler divergence.

Contributions.

In this article we study the approximation properties of deep belief networks for multivariate continuous probability distributions which have a density with respect to the Lebesgue measure. We show that, as m → ∞, the universal approximation property holds for binary-binary DBNs with two hidden layers of sizes m and m + 1, respectively. Furthermore, we provide an explicit quantitative bound on the approximation error in terms of m. More specifically, the main contributions of this article are: • For each q ∈ [1, ∞) we show that DBNs with two binary hidden layers and parental density ϕ : R d → R + can approximate any probability density f : R d → R + in the L q -norm, solely under the condition that f, ϕ ∈ L q (R d ), where L q (R d ) = f : R d → R : f L q = R d f (x) q dx 1 q < ∞ . In addition, we prove that the error admits a bound of order O m 1 min(q,2) -1 for each q ∈ (1, ∞), where m is the number of hidden neurons. • If the target density f is uniformly continuous and the parental density ϕ is bounded, we provide an approximation result in the L ∞ -norm (also known as supremum or uniform norm), where L ∞ (R d ) = f : R d → R : f L ∞ = sup x∈R d f (x) < ∞ . • Finally, we show that continuous target densities supported on a compact subset of R 2013) considered Gaussian-binary DBNs and analyzed their approximation capabilities in Kullback-Leibler divergence, albeit without a rate. In addition, they only allow for target densities that can be written as an infinite mixture of a set of probability densities satisfying certain conditions, which appear to be hard to check in practice. 

2. DEEP BELIEF NETWORKS

A restricted Boltzmann machine (RBM) is a an undirected, probabilistic, graphical model with bipartite vertices that are fully connected with the opposite class. To be more precise, we consider a simple graph G = (V, E ) for which the vertex set V can be partitioned into sets V and H such that the edge set is given by E = {s, t} : s ∈ V, t ∈ H . We call vertices in V visible units; H contains the hidden units. To each of the visible units we associate the state space Ω V and to the hidden ones we associate Ω H . We equip G with a Gibbs probability measure π(v, h) = e -H (v,h) Z , v ∈ (Ω V ) V , h ∈ (Ω H ) H ,



); Jiang et al. (2018), bioinformatics Wang & Zeng (2013); Liang et al. (2014); Cao et al. (2016); Luo et al. (2019), financial markets Shen et al. (2015) and computer vision Abdel-Zaher & Eldeib (2016); Kamada & Ichimura (2016; 2019); Huang et al. (

Similar questions have been studied for a variety of neural network architectures: The famous results ofCybenko (1989);Hornik et al. (1989)  state that deterministic multi-layer feed-forward networks are universal approximators for a large class of Borel measurable functions, provided that they have at least one sufficiently large hidden layer. See also the articlesLeshno et al. (1993); Chen & Chen (1995);Barron (1993); Burger & Neubauer (2001). LeRoux & Bengio (2008)  proved the universal approximation property for RBMs and discrete target distributions. Montúfar & Morton (2015) established the universal approximation property for discrete restricted Boltzmann machines. Montúfar (2014) showed the universal approximation property for deep narrow Boltzmann machines. Montúfar (2015) showed that Markov kernels can be approximated by shallow stochastic feed-forward networks with exponentially many hidden units. Bengio & Delalleau (2011); Pascanu et al. (2014) studied the approximation properties of so-called deep architectures. Merkh & Montúfar (2019) investigated the approximation properties of stochastic feed-forward networks. The recent work Johnson (2018) nicely complements the aforementioned results by obtaining an illustrative negative result: Deep narrow networks with hidden layer width at most equal to the input dimension do not posses the universal approximation property. Since our methodology involves an approximation by a convex combination of probability densities, we refer the reader to the related works of Nguyen & McLachlan (2019); Nguyen et al. (2020) and the references therein for an overview of the wide range of universal approximation results in the context of mixture models. See also Everitt & Hand (1981); Titterington et al. (1985); McLachlan & Basford (1988); McLachlan & Peel (2000); Robert & Mengersen (2011); Celeux (2019) for in-depth treatments of mixture models. The recent articles Bailey & Telgarsky (2018); Perekrestenko et al. (2020) in the context of generative networks show that deep neural networks can transform a one-dimensional uniform distribution in a way to approximate any two-dimensional Lipschitz continuous target density. Another strand of research related to the questions of this article are works on quantile (or distribution) regression, see Koenker (2005) as well as Dabney et al. (2018); Tagasovska & Lopez-Paz (2019); Fakoor et al. (2021) for recent methods involving neural networks.

d and uniformly bounded away from zero can be approximated by deep belief networks with bounded parental density in Kullback-Leibler divergence. The approximation error in this case is of order O m -1 . Related works. One of the first approximation results for deep belief networks is due to Sutskever & Hinton (2008) and states that any probability distribution on {0, 1} d can be learnt by a DBN with 3 × 2 d hidden layers of size d + 1 each. This result was improved by Le Roux & Bengio (2010); Montúfar & Ay (2011) by reducing the number of layers to 2 d-1 d-log(d) with d hidden units each. These results, however, are limited to discrete probability distributions. Since most applications involve continuous probability distributions, Krause et al. (

