QUANTITATIVE UNIVERSAL APPROXIMATION BOUNDS FOR DEEP BELIEF NETWORKS

Abstract

We show that deep belief networks with binary hidden units can approximate any multivariate probability density under very mild integrability requirements on the parental density of the visible nodes. The approximation is measured in the L q -norm for q ∈ [1, ∞] (q = ∞ corresponding to the supremum norm) and in Kullback-Leibler divergence. Furthermore, we establish sharp quantitative bounds on the approximation error in terms of the number of hidden units.

1. INTRODUCTION

Deep belief networks (DBNs) are a class of generative probabilistic models obtained by stacking several restricted Boltzmann machines (RBMs, Smolensky (1986) ). For a brief introduction to RBMs and DBNs we refer the reader to the survey articles Fischer & Igel (2012; 2014) ; Montúfar (2016) ; Ghojogh et al. (2021) . Since their introduction, see Hinton et al. (2006) ; Hinton & Salakhutdinov (2006) , DBNs have been successfully applied to a variety of problems in the domains of natural language processing Hinton (2009) ; Jiang et al. (2018) , bioinformatics Wang & Zeng (2013) ; Liang et al. (2014) ; Cao et al. (2016) ; Luo et al. (2019) , financial markets Shen et al. (2015) and computer vision Abdel-Zaher & Eldeib (2016) ; Kamada & Ichimura (2016; 2019) ; Huang et al. (2019) . However, our theoretical understanding of the class of continuous probability distributions, which can be approximated by them, is limited. The ability to approximate a broad class of probability distributions-usually referred to as universal approximation property-is still an open problem for DBNs with real-valued visible units. As a measure of proximity between two real-valued probability density functions, one typically considers the L q -distance or the Kullback-Leibler divergence. Contributions. In this article we study the approximation properties of deep belief networks for multivariate continuous probability distributions which have a density with respect to the Lebesgue measure. We show that, as m → ∞, the universal approximation property holds for binary-binary DBNs with two hidden layers of sizes m and m + 1, respectively. Furthermore, we provide an explicit quantitative bound on the approximation error in terms of m. More specifically, the main contributions of this article are: • For each q ∈ [1, ∞) we show that DBNs with two binary hidden layers and parental density ϕ : R d → R + can approximate any probability density f : R d → R + in the L q -norm, solely under the condition that f, ϕ ∈ L q (R d ), where L q (R d ) = f : R d → R : f L q = R d f (x) q dx 1 q < ∞ . In addition, we prove that the error admits a bound of order O m 1 min(q,2) -1 for each q ∈ (1, ∞), where m is the number of hidden neurons. • If the target density f is uniformly continuous and the parental density ϕ is bounded, we provide an approximation result in the L ∞ -norm (also known as supremum or uniform norm), where L ∞ (R d ) = f : R d → R : f L ∞ = sup x∈R d f (x) < ∞ . • Finally, we show that continuous target densities supported on a compact subset of R d and uniformly bounded away from zero can be approximated by deep belief networks with bounded parental density in Kullback-Leibler divergence. The approximation error in this case is of order O m -1 . Related works. One of the first approximation results for deep belief networks is due to Sutskever & Hinton (2008) and states that any probability distribution on {0, 1} d can be learnt by a DBN with 3 × 2 d hidden layers of size d + 1 each. This result was improved by Le Roux & Bengio (2010); Montúfar & Ay (2011) by reducing the number of layers to 2 d-1 d-log(d) with d hidden units each. These results, however, are limited to discrete probability distributions. Since most applications involve continuous probability distributions, Krause et al. (2013) considered Gaussian-binary DBNs and analyzed their approximation capabilities in Kullback-Leibler divergence, albeit without a rate. In addition, they only allow for target densities that can be written as an infinite mixture of a set of probability densities satisfying certain conditions, which appear to be hard to check in practice. Similar questions have been studied for a variety of neural network architectures: The famous results of Cybenko (1989) ; Hornik et al. (1989) state that deterministic multi-layer feed-forward networks are universal approximators for a large class of Borel measurable functions, provided that they have at least one sufficiently large hidden layer. See also the articles Leshno et al. (1993) ; Chen & Chen (1995) ; Barron (1993) ; Burger & Neubauer (2001) . Le Roux & Bengio (2008) proved the universal approximation property for RBMs and discrete target distributions. Montúfar & Morton (2015) established the universal approximation property for discrete restricted Boltzmann machines. Montúfar (2014) showed the universal approximation property for deep narrow Boltzmann machines. Montúfar (2015) showed that Markov kernels can be approximated by shallow stochastic feed-forward networks with exponentially many hidden units. Bengio & Delalleau (2011) ; Pascanu et al. (2014) studied the approximation properties of so-called deep architectures. Merkh & Montúfar (2019) investigated the approximation properties of stochastic feed-forward networks. The recent work Johnson (2018) nicely complements the aforementioned results by obtaining an illustrative negative result: Deep narrow networks with hidden layer width at most equal to the input dimension do not posses the universal approximation property. Since our methodology involves an approximation by a convex combination of probability densities, we refer the reader to the related works of Nguyen & McLachlan (2019) ; Nguyen et al. (2020) and the references therein for an overview of the wide range of universal approximation results in the context of mixture models. See also Everitt & Hand (1981) ; Titterington et al. (1985) ; McLachlan & Basford (1988) ; McLachlan & Peel (2000) ; Robert & Mengersen (2011) ; Celeux (2019) for in-depth treatments of mixture models. The recent articles Bailey & Telgarsky (2018) ; Perekrestenko et al. (2020) in the context of generative networks show that deep neural networks can transform a one-dimensional uniform distribution in a way to approximate any two-dimensional Lipschitz continuous target density. Another strand of research related to the questions of this article are works on quantile (or distribution) regression, see Koenker (2005) as well as Dabney et al. (2018) ; Tagasovska & Lopez-Paz (2019); Fakoor et al. (2021) for recent methods involving neural networks.

2. DEEP BELIEF NETWORKS

A restricted Boltzmann machine (RBM) is a an undirected, probabilistic, graphical model with bipartite vertices that are fully connected with the opposite class. To be more precise, we consider a simple graph G = (V, E ) for which the vertex set V can be partitioned into sets V and H such that the edge set is given by E = {s, t} : s ∈ V, t ∈ H . We call vertices in V visible units; H contains the hidden units. To each of the visible units we associate the state space Ω V and to the hidden ones we associate Ω H . We equip G with a Gibbs probability measure (v,h) dv dh < ∞. Notice that the integral becomes a sum if Ω V (resp. Ω H ) is a discrete set. It is customary to identify the RBM with the probability measure π. π(v, h) = e -H (v,h) Z , v ∈ (Ω V ) V , h ∈ (Ω H ) H , where H : (Ω V ) V × (Ω H ) H → R is chosen such that Z = e -H An important example are binary-binary RBMs. These are obtained by choosing Ω V = Ω H = {0, 1} and H = v, W h + v, b + h, c , v ∈ {0, 1} V , h ∈ {0, 1} H , where b ∈ {0, 1} V and c ∈ {0, 1} H are called biases, and W ∈ R V ×H is called the weight matrix. We shall write for m, n ∈ N, B-RBM(m, n) = {π is a binary-binary RBM with m visible and n hidden units} , for the set of binary-binary RBMs with fixed layer sizes. The following discrete approximation result is well known, see also Montúfar & Ay (2011) : Proposition 1 (Le Roux & Bengio (2008), Theorem 2). Let m ∈ N and µ be a probability distribu- tion on {0, 1} m . Let supp(µ) = v ∈ {0, 1} m : µ(v) > 0 be the support of µ. Set n = supp(µ) + 1. Then, for each ε > 0, there is a π ∈ B-RBM(m, n) such that µ(v) - h∈{0,1} n π(v, h) ε ∀ v ∈ {0, 1} m . A deep belief network (DBN) is constructed by stacking two RBMs. To be more precise, we now consider a tripartite graph with hidden layers H 1 and H 2 and visible units V . We assume that the edge set is now given by E = {s, t 1 }, {t 1 , t 2 } : s ∈ V, t 1 ∈ H 1 , t 2 ∈ H 2 . The state spaces are now Ω V = R and Ω H1 = Ω H2 = {0, 1}. We think of edges in the graph as dependence of the neurons (in the probabilistic sense). The topology of the graph hence shows that the vertices in V and H 2 shall be conditionally independent, that is, we require that p(v, h 1 , h 2 ) = p(v | h 1 )p(h 1 , h 2 ). The joint density of the hidden units p(h 1 , h 2 ) will be chosen as binary-binary RBM. Let D(R d ) = f : R d → R + : R d f (x) dx = 1 be the set of probability densities on R d . For ϕ ∈ D(R d ) and σ > 0 we set V σ ϕ = ϕ µ,σ = σ -d ϕ x -µ σ : µ ∈ R d . Notice that all elements of V σ ϕ are themselves probability distributions. We fix a parental density ϕ ∈ D(R |V | ) and choose the conditional density in (3) as p(• | h 1 ) ∈ V σ ϕ for each h 1 ∈ H 1 . Example 2. The most popular choice of the parental function ϕ in (4) is the d-dimensional standard Gaussian density ϕ(x) = 1 (2π) d/2 exp - |x| 2 2 , x ∈ R d . ( ) Another density considered in previous works is the truncated exponential distribution ϕ(x) = d i=1 λ i e -λixi 1 -e -biλi 1 [0,bi] (x i ), x = (x 1 , . . . , x d ) ∈ R d , ( ) where b i , λ i > 0 for each i = 1, . . . , d. Similar to (2), we collect all DBNs in the set DBN ϕ (d, m, n) = p is a DBN with parental density ϕ, d visible units, m hidden units on the first level, and n hidden units on the second level , where ϕ ∈ D(R d ) and d, m, n ∈ N. We shall not distinguish between the whole DBN and the marginal density of the visible nodes, which is the object we are ultimately interested in, that is, we write p(v) = h1∈H1 h2∈H2 p(v, h 1 , h 2 ). (7) In case p ∈ DBN ϕ (d, m, n) with ϕ ∈ L q (R |V | ) then also the marginal (7) belongs to L q (R |V | ). After their introduction in Hinton & Salakhutdinov (2006) , deep belief networks rose to prominence due to a training algorithm developed in Hinton et al. (2006) which addressed the vanishing gradient problem by pre-training deep networks. Instead of naïvely stacking two RBMs the authors considered several such stacked layers and greedily pre-trained the weights over the layers on a contrastive divergence loss. To be more precise, let M denote the number hidden layers, then, first the visible and the first hidden layer are considered as a classical RBM and the weights of the first hidden layer are learnt. In the second step, the weights of the second hidden layer are learnt based on the first hidden layer using Gibbs sampling. This procedure repeats iteratively until all hidden layers are trained. For more details we refer to Fischer & Igel (2014) ; Ghojogh et al. (2021) .

3. MAIN RESULTS

To state the results of this article, we need to introduce three bits of additional notation: Let q ∈ [1, ∞]. We declare D q (R d ) = D(R d ) ∩ L q (R d ). Finally, for q ∈ [1, ∞), let us abbreviate the constant Υ q = max 1, 1 √ 2π ∞ -∞ |x| q e -x 2 2 dx 1 q =      1 q 2, √ 2 π 1 2q Γ q + 1 2 1 q , q > 2, with the Gamma function Γ(x) = ∞ 0 t x-1 e -t dt, x > 0. The main results of this paper are stated in the following two theorems: Theorem 3. Let q ∈ [1, ∞) and f, ϕ ∈ D q (R d ). Then, for each m ∈ N, the following quantitative bound holds: inf p∈DBNϕ(d,m,m+1) f -p L q 2Υ q ϕ L q m 1-1 min(q,2) , where the constant Υ q is defined in (8). While this bound becomes trivial if q = 1, the following qualitative approximation result still holds in that case: For any ε > 0, there is an M ∈ N such that, for each m M , we can find a p ∈ DBN ϕ (d, m, m + 1) satisfying f -p L q ε. Remark 4. Returning to Example 2, we find that ϕ L q = q -d 2q for the d-dimensional standard normal distribution (5) and ϕ L q = d i=1 λ 1-1 q i q 1 q 1 -e -biλi 1 -e -qλibi 1 q for the truncated exponential distribution (6). Our bound (9) thus shows that deep belief networks with truncated exponential parental density (for suitable choice of the parameters b and λ) better approximate the target density f . This is especially prevalent for small q, which is the primary case of interest, see Corollary 7 below. For a detailed review of the exponential family's properties we refer to Brown (1986) . To state the approximation in the L ∞ -norm, we need to introduce the space of bounded and uniformly continuous functions: C u (R d ) = f ∈ L ∞ (R d ) : lim δ↓0 sup |x-y| δ f (x) -f (y) = 0 . Notice that any probability density f ∈ D(R d ), which is differentiable and has a bounded derivative, belongs to C u (R d ) since any uniformly continuous and integrable function is bounded. Theorem 5. Let f ∈ D(R d ) ∩ C u (R d ) and ϕ ∈ D ∞ (R d ). Then, for any ε > 0, there is an M ∈ N such that, for each m M , we can find a p ∈ DBN ϕ (d, m, m + 1) satisfying f -p L ∞ ε. Remark 6. The uniform continuity requirement on f in Theorem 5 can actually be relaxed to essential uniform continuity, that is, f is uniformly continuous except on a set with zero Lebesgue measure. The most notable example of such a function is the uniform distribution f = 1 [0,1] . Another important metric between between probability densities p, q : R d → R + is the Kullback-Leibler divergence (or relative entropy) defined by KL(f g) = R d f (x) log f (x) g(x) dx if {x ∈ R d : g(x) = 0} ⊂ {x ∈ R d : f (x) = 0} and KL(f g) = ∞ otherwise. From Theorems 3 and 5 we can deduce the following quantitative approximation bound in the Kullback-Leibler divergence: Corollary 7. Let ϕ ∈ D ∞ (R d ). Let Ω ⊂ R d be a compact set and f : Ω → R + be a continuous probability density. Suppose that there is an η > 0 such that both f η and ϕ η on Ω. Then there is a constant M > 0 such that, for each m ∈ N, it holds that inf p∈DBNϕ(d,m,m+1) KL(f p) M ηm 8 ϕ 2 L 2 + f -ϕ 2 L 2 (Ω) , ( ) where f -ϕ 2 L 2 (Ω) = Ω f (x) -ϕ(x) 2 dx. Let us note that any ϕ ∈ D ∞ (R d ) is square-integrable so that the right-hand side of the bound ( 10) is actually finite. This follows from the interpolation inequality ϕ L 2 ϕ L 1 ϕ L ∞ = ϕ L ∞ , see (Brezis, 2011, Exercise 4.4 ). Remark 8. The first assertion of Theorem 3 and Theorem 5 generalize to deep belief networks with additional hidden layers, however, it is still an open question whether (9) can be improved by adding more depth, see also Jalali et al. (2019) for an analysis of this question in the context of Gaussian mixture models. Corollary 7 considerably generalizes the results of (Krause et al., 2013, Theorem 7) . There, the authors only prove that deep belief networks can approximate any density in the closure of the convex hull of a set of probability densities satisfying certain conditions, which appear to be difficult to check in practice. That work also does not contain a convergence rate. In comparison, our results directly describe the class of admissible target densities and do not rely on the indirect description through the convex hull. Finally, there is an unjustified step in the argument of Krause et al., which appears hard to reconcile, see Remark 16 below for details.

4. PROOFS

This section presents the proofs of Theorems 3, 5 and Corollary 7. As a first step, we shall establish a couple of preliminary results in the next two subsections. 4.1 L q -APPROXIMATION OF FINITE MIXTURES Given a set A ⊂ L q (R d ), the convex hull of A is by definition the smallest convex set containing A; in symbols conv(A). It can be shown that conv(A) = n i=1 α i a i : α = (α 1 , . . . , α n ) ∈ n , a 1 , . . . , a n ∈ A, n ∈ N with n = x ∈ [0, 1] n : n i=1 x i = 1 , the n-dimensional standard simplex. It is also convenient to introduce the truncated convex hull conv m (A) = m i=1 α i a i : α = (α 1 , . . . , α m ) ∈ m , a 1 , . . . , a m ∈ A for m ∈ N so that conv(A) = m∈N conv m (A). The closed convex hull conv(A) is the smallest closed convex set containing A and it is straight-forward to check that it coincides with the closure of conv(A) in the topology of L q (R d ). The next result shows that we can approximate any probability density in the truncated convex hull of the set (4) arbitrarily well by a DBN with a fixed number of hidden units: Lemma 9. Let q ∈ [1, ∞], ϕ ∈ D q (R d ), σ > 0, and m ∈ N. Then, for every f ∈ conv m (V σ ϕ ) and every ε > 0, there is a deep belief network p ∈ DBN ϕ (d, m, m + 1) such that f -p L q ε. Proof. Since f ∈ conv m (V σ ϕ ), there are by definition of m (α 1 , . . . , α m ) ∈ m and (µ 1 , . . . , µ m ) ∈ R d m such that f = m i=1 α i ϕ µi,σ . We can think of α = (α 1 , . . . , α m ) as a probability distribution α on {0, 1} m by declaring α(h 1 ) = α i , if h 1 = e i , 0, else, h 1 ∈ {0, 1} m , where (e i ) j = δ i,j , j = 1, . . . , m, is the i th unit vector. Let us fix q ∈ [1, ∞] and σ > 0. By Proposition 1 there is a π ∈ B-RBM(m, m + 1) such that α(h 1 ) - h2∈{0,1} m+1 π(h 1 , h 2 ) ε mσ ϕ L q ∀ h 1 ∈ {0, 1} m . ( ) We set p(v | h 1 ) = ϕ µi,σ (v), h 1 = e i , 0, else, and p(v, h 1 , h 2 ) = p(v | h 1 )π(h 1 , h 2 ) ∈ DBN ϕ (d, m, m + 1 ). This is the desired approximation since f -p L q m i=1 α i - h2∈{0,1} m+1 π(e i , h 2 ) ϕ µi,σ L q ε, where we used that ϕ µ,σ L q = σ ϕ L q for each µ ∈ R d and each σ > 0.  and ϕ ∈ D(R d ). We denote the convolution of f and ϕ σ by

4.2. APPROXIMATION

BY CONVOLUTION Let f ∈ L q (R d ), q ∈ [1, ∞], f ϕ σ (x) = R d f (µ)ϕ σ (x -µ) dµ = R d f (µ)ϕ µ,σ (x) dµ. Young's convolution inequality, Young (1912) , implies that f ϕ σ ∈ L q (R d ). In addition, the following approximation result holds, see Appendix A.1 for the proof: Proposition 10. Let ϕ ∈ D(R d ). Then all of the following hold true: 1. For each q ∈ [1, ∞) and each f ∈ L q (R d ), we have lim σ↓0 f -f ϕ σ L q = 0. 2. If f ∈ L ∞ (R d ) ∩ C u (R d ), then lim σ↓0 f -f ϕ σ L ∞ = 0.

4.3. APPROXIMATION THEORY IN BANACH SPACES

The second ingredient needed in the proof of Theorem 3 is an abstract result from the geometric theory of Banach spaces. To formulate it, we need to introduce the following notion: The Rademacher type of a Banach space X , • X the largest number t 1 for which there is a constant C > 0 such that, for each k ∈ N and each f 1 , . . . , f k ∈ X , E k i=1 i f i t X C k i=1 f i t X holds, where 1 , . . . , k are i.i.d. Rademacher random variables, that is, P 1 = ±1 = 1 2 . It can be shown that t 2 for every Banach space. Example 11. The space L q (R d ) has Rademacher type t = min(q, 2) for q ∈ [1, ∞). The space L ∞ (R d ) on the other hand has only trivial type t = 1. A good reference for the above results on the Rademacher type is (Ledoux & Talagrand, 1991, Section 9.2) . The next approximation result and its application to L q (R d ) will be important below: Proposition 12 (Donahue et al. (1997) , Theorem 2.5). Let X , • X be a Banach space of Rade- macher type t ∈ [1, 2]. Let A ⊂ X and f ∈ conv(A). Suppose that ξ = sup g∈A f -g X < ∞. Then there is a constant C > 0 only depending on the Banach space X , • X such that, for each m ∈ N, we can find an element h ∈ conv m (A) satisfying f -h X Cξ m 1-1 t . ( ) Notice that the bound ( 13) is of course trivial for t = 1. Moreover, in Appendix A.2 we provide an example which shows that the convergence rate m 1 t -1 is optimal. Corollary 13. Let A ⊂ L q (R d ), 1 q < ∞, and suppose that f ∈ conv(A). If ξ = sup g∈A fg X < ∞, then for all m ∈ N, there is a h ∈ conv m (A) such that f -h L q Υ q ξ m 1-1 min(q,2) , where Υ q is the constant defined in (8). Proof. Owing to Example 11 we are in the regime of Proposition 12. The sharp constant C = Υ q was derived in Haagerup (1981) .

4.4. PROOF OF THEOREMS 3 AND 5

Before giving the technical details of the proofs, let us provide an overview of the strategy: 1. By Proposition 10 we can approximate the density f ∈ D q (R d ) with f ϕ σ up to an error which vanishes as σ ↓ 0. 2. Upon showing that f ϕ σ ∈ conv(V σ ϕ ), Proposition 13 allows us to show that for each ε > 0 and each m ∈ N, we can pick σ > 0 such that inf g∈convm(V σ ϕ ) f -g L q ε + 2Υ q ϕ L q m 1-1 min(q,2) . 3. Finally, we employ Lemma 9 to conclude the desired estimate (9). Lemma 14. Let q ∈ [1, ∞], f ∈ D q (R d ), and ϕ ∈ D(R d ). Then, for each σ > 0, we have f ϕ σ ∈ conv(V σ ϕ ), with the closure understood with respect to the norm • L q . Proof. Let us abbreviate g = f ϕ σ . We argue by contradiction. Suppose that g / ∈ conv(V σ ϕ ). As a consequence of the Hahn-Banach theorem, g is separated from conv(V σ ϕ ) by a hyperplane. More precisely, there is a continuous linear function ρ : L q (R d ) → R such that ρ(h) < ρ(g) for all h ∈ conv(V σ ϕ ), see (Brezis, 2011, Theorem. 1.7) . On the other hand, we however have ρ(g) = ρ R d f (µ)ϕ µ,σ dµ = R d f (µ)ρ ϕ µ,σ dµ < ρ(g) R d f (µ) dµ = ρ(g), which is the desired contradiction. We can now establish the main results of this article: Proof of Theorems 3 and 5. Let us first assume that q ∈ (1, ∞) and prove the quantitative bound (9). To this end fix ε > 0 and m ∈ N. We first observe that, by Proposition 10, we can choose σ > 0 sufficiently small such that f -f ϕ σ L q ε 2 . Employing Lemma 14 and Corollary 13 with A = V σ ϕ , we can find a g m ∈ conv m (V σ ϕ ) such that f -g m L q f -f ϕ σ L q + f ϕ σ -g m L q ε 2 + Υ q m 1-1 min(q,2) sup µ∈R d f ϕ σ -ϕ µ,σ L q . For the last term we bound sup µ∈R d f ϕ σ -ϕ µ,σ L q = sup µ∈R d R d R d f (x) ϕ σ (y -x) -ϕ σ (y -µ) dx q dy 1 q R d f (x) sup µ∈R d R d ϕ σ (y -x) -ϕ σ (y -µ) q dy 1 q = sup µ∈R d ϕ -ϕ µ,1 L q 2 ϕ L q , whence f -g m L q ε 2 + 2Υ p ϕ L q m 1-1 min(q,2) . Finally, Lemma 9 allows us to choose p ∈ DBN ϕ (d, m, m+1) such that g m -p L q ε 2 . Therefore, we conclude f -p L q ε + 2Υ q ϕ L q m 1-1 min(q,2) . Since ε > 0 was arbitrary, the bound (9) follows. If q = 1 or q = ∞, we use the fact that conv(A) = m∈N conv m (A) for any subset A of either L 1 (R d ) or L ∞ (R d ), respectively. This implies that, for each ε > 0, we can find m ∈ N and g m ∈ conv m (V σ ϕ ) such that f ϕ σ -g m L q ε 3 . If q = ∞, we note that a uniformly continuous and integrable function is always bounded. Hence, in any case we can apply Proposition 10 to find a σ > 0 for which f -f ϕ σ L q ε 3 . Finally employing Lemma 9 as above, there is a p ∈ DBN ϕ (d, m, m + 1) such that f -p L q f -f ϕ σ L q + f ϕ σ -g m L q + g m -p L q ε.

4.5. KULLBACK-LEIBLER APPROXIMATION ON COMPACTS

Let us begin by bounding the Kullback-Leibler divergence in terms of the L 2 -norm: Lemma 15 (Zeevi & Meir (1997) , Lemma 3.3). Let Ω ⊂ R d , f : Ω → R + , and g : R d → R + be probability densities. If there is an η > 0 such that both f, g η on Ω, then KL(f g) 1 η f -g 2 L 2 (Ω) . Proof. We use Jensen's inequality and the elementary fact log x x -1, x > 0, to obtain KL(f g) = Ω log f (x) g(x) f (x) dx log Ω f (x) 2 g(x) dx Ω f (x) 2 g(x) dx -1 = Ω (f (x) -g(x)) 2 g(x) dx 1 η f -g 2 L 2 . Finally, we can prove the approximation bound in Kullback-Leibler divergence: Proof of Corollary 7. Extending the target density f by zero on R d \ Ω, the corollary follows from Theorem 3 upon showing that, for each m ∈ N, we can choose the approximation p ∈ DBN ϕ (d, m, m + 1) in such a way that p η 2 on Ω. To see this, we notice that f is uniformly continuous since Ω is compact. Hence, Theorem 5 allows us to pick an M ∈ N such that, for each m M , there is a  p m ∈ DBN ϕ (d, m, m + 1) with f -p m L ∞ KL(f p) 8 ϕ 2 L 2 ηm ∀m M. A crude upper bound on inf p∈DBNϕ(d,m,m+1) KL(f p) for m < M can be obtained choosing both zero weights and biases in (1) as well as p(v | h 1 ) = ϕ for each h 1 ∈ {0, 1} m in (3). Hence, the visible units of the DBN have density ϕ. This gives inf p∈DBNϕ(d,m,m+1) KL(f p) KL(f ϕ) 1 η f -ϕ 2 L 2 (Ω) ∀ m = 1, . . . , M -1, again by Lemma 15. Finally, combining ( 14) and ( 15) we get the required estimate: inf p∈DBNϕ(d,m,m+1) KL(f p) M ηm 8 ϕ 2 L 2 + f -ϕ 2 L 2 (Ω) . Remark 16. Our strategy of the proof of the Kullback-Leibler approximation in Corollary 7 through Lemma 15 differs from the one employed in (Krause et al., 2013, Theorem 7) . There, the authors built on the results of Li & Barron (1999) and in the course of their argument claim that the following statement holds true: Let f m , f : Ω → R + , m ∈ N, be probability densities on a compact set Ω ⊂ R d with f m , f η > 0. If KL(f f m ) → 0 as m → ∞, then f m → f in the norm • L ∞ . This, however, does not hold as we illustrate by a simple counterexample presented in Appendix A.3.

5. CONCLUSION

We investigated the approximation capabilities of deep belief networks with two binary hidden layers of sizes m and m + 1, respectively, and real-valued visible units. We showed that, under minimal regularity requirements on the parental density ϕ as well as the target density f , these networks are universal approximators in the strong L q and Kullback-Leibler distances as m → ∞. Moreover, we gave sharp quantitative bounds on the approximation error. We emphasize that the convergence rate in the number of hidden units is independent of the choice of the parental density. Our results apply to virtually all practically relevant examples thereby theoretically underpinning the tremendous empirical success of DBN architectures we have seen over the last couple of years. As we alluded to in Remark 4, the frequently made choice of a Gaussian parental density does not provide the theoretically optimal DBN approximation of a given target density. Since, in practice, the choice of parental density cannot solely be determined from an approximation standpoint, but also the difficulty of the training of the resulting networks needs to be considered, it is interesting to further empirically study the choice of parental density on both artificial and real-world datasets.

A DETAILS OF THE MATHEMATICAL RESULTS

This appendix provides further details of the mathematical results used in the main text. More specifically, we provide 1. the proof of Proposition 10, 2. a detailed proof of Proposition 12 for Hilbert spaces as well as an example showing that its approximation rate is optimal in general, and 3. the construction of an explicit counterexample to the statement discussed in Remark 16. A.1 PROOF OF PROPOSITION 10 Proof. Item 1 is well known, see e.g. (Folland, 1999, Theorem 8.14) . For 2 fix ε > 0. By uniform continuity of f , we can find a δ > 0 such that sup |µ| δ f (x) -f (x -µ) ε 2 ∀ x ∈ R d . In particular, we obtain f (x) -f ϕ σ (x) R d ϕ σ (µ) f (x) -f (x -µ) dµ {|µ|>δ} ϕ σ (µ) f (x) -f (x -µ) dµ + {|µ| δ} ϕ σ (µ) f (x) -f (x -µ) dµ 2 f L ∞ {|µ|>δ} ϕ σ (µ) dµ + ε 2 , where we applied the uniform continuity estimate (16) to the second integral. Since {|µ|>δ} ϕ σ (µ) dµ = |µ|> δ σ ϕ(µ) dµ → 0 as σ ↓ 0, we can choose σ 0 > 0 such that f -(f ϕ σ ) L ∞ ε for all σ ∈ (0, σ 0 ). This completes the proof.

A.2 DETAILS ON PROPOSITION 12

While the proof of Proposition 12 for a general Banach space is rather technical, we find it instructive to present the simplified argument for a Hilbert space. Our proof is inspired by Jones (1992) , see also Barron (1994) . Proposition 17. Let X , • X be a Hilbert space. Let A ⊂ X and f ∈ conv(A). Suppose that ξ = sup g∈A f -g X < ∞. Then, for each m ∈ N, we can find an element g ∈ conv m (A) satisfying f -g X ξ √ m . ( ) Proof. We proceed by induction on m ∈ N. The base m = 1 is trivial, so we can assume that the statement holds for m 1. Let us declare Ξ m+1 = inf g∈convm+1(A) f -g X . By the induction hypothesis, we may assume that Ξ m ξ √ m and we can find h ∈ conv m (A) attaining this bound. Consequently, we get Ξ 2 m+1 inf λ∈[0,1] g∈A λ(f -g) + (1 -λ)(f -h) 2 X = inf λ∈[0,1] g∈A λ 2 f -g 2 X + 2λ(1 -λ) f -g, f -h X + (1 -λ) 2 f -h 2 X inf λ∈[0,1] λ 2 ξ 2 + 2λ(1 -λ) inf g∈A f -g, f -h X + (1 -λ) 2 Ξ 2 m . We claim that inf g∈A f -g, f -h X = 0. ( ) To see this, let us fix an ε > 0 and observe that, since f ∈ conv(A), the Cauchy-Schwarz inequality implies that there must be a finite convex combination of elements in A satisfying k i=1 α i f -a i , f -h = f - k i=1 α i a i , f -h ε. In particular, the inequality f -a i , f -h ε holds for at least one vector a i ∈ A. Since ε > 0 was arbitrary, we have established (19) . Inserting ( 19) in ( 18), we arrive at Ξ 2 m+1 inf λ∈[0,1] λ 2 ξ 2 + (1 -λ) 2 Ξ 2 m ξ 2 Ξ 2 m ξ 2 + Ξ 2 m , where the last step follows by chosing λ = This establishes (17) for m + 1 and the induction is complete. Returning to the original statement of Proposition 12 for a general Banach space, the next example shows that its convergence rate is optimal in general: Example 18. For p ∈ (1, 2] let us consider the Banach space p (R) of p-summable real-valued sequences, that is, (a n ) n∈N ⊂ R belongs to p (R) iff a p = ∞ n=1 |a n | p 1 p < ∞. It can be shown that this Banach space has Rademacher type t = p. Let A be the set formed of the standard basis vectors: A = (1, 0, 0, 0, . . . ), (0, 1, 0, 0, . . . ), (0, 0, 1, 0, . . . ), . . . . 



In particular, each of these DBNs satisfies p m η 2 on Ω. Consequently, by Lemma 15 we obtain inf p∈DBNϕ(d,m,m+1)

we find thatinf h∈convm(A) f -h p = inf (α1,...,αm)∈ mThe optimum on the right-hand side is attained by choosingα 1 = • • • = α m = 1 m so that inf

A.3 CONSTRUCTION OF THE COUNTEREXAMPLE IN REMARK 16

Let Ω = [0, 1] and consider the sequence of probability densities given bywhereConsequently, f m does not converge uniformly to f . Nevertheless, it is straight-forward to check that f m -1 [0,1] L 2 → 0 and since f m , f 1/2 on Ω, we have KL(f m f ) → 0 as m → ∞ by Lemma 15.

