LARGE-WIDTH FUNCTIONAL ASYMPTOTICS FOR DEEP GAUSSIAN NEURAL NETWORKS

Abstract

In this paper, we consider fully-connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions. Extending previous results (Matthews et al., 2018a;b; Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networks as infinite-dimensional random elements on the input space R I . Under suitable assumptions on the activation function we show that: i) a network defines a continuous stochastic process on the input space R I ; ii) a network with re-scaled weights converges weakly to a continuous Gaussian Process in the large-width limit; iii) the limiting Gaussian Process has almost surely locally γ-Hölder continuous paths, for 0 < γ < 1. Our results contribute to recent theoretical studies on the interplay between infinitely-wide deep neural networks and Gaussian Processes by establishing weak convergence in function-space with respect to a stronger metric.

1. INTRODUCTION

The interplay between infinitely-wide deep neural networks and classes of Gaussian Processes has its origins in the seminal work of Neal (1995) , and it has been the subject of several theoretical studies. See, e.g., Der & Lee (2006) , Lee et al. (2018) , Matthews et al. (2018a; b) , Yang (2019) and references therein. Let consider a fully-connected feed-forward neural network with re-scaled weights composed of L ≥ 1 layers of widths n 1 , . . . , n L , i.e. f (1) i (x) = I j=1 w (1) i,j x j + b (1) i i = 1, . . . , n 1 f (l) i (x) = 1 √ n l-1 n l-1 j=1 w (l) i,j φ(f (l-1) j (x)) + b (l) i l = 2, . . . , L, i = 1, . . . , n l (1) where φ is a non-linearity and x ∈ R I is a real-valued input of dimension I ∈ N. Neal (1995) considered the case L = 2, a finite number k ∈ N of fixed distinct inputs (x (1) , . . . , x (k) ), with each x (r) ∈ R I , and weights w (l) i,j and biases b (l) i independently and identically distributed (iid) as Gaussian distributions. Under appropriate assumptions on the activation φ Neal (1995) showed that: i) for a fixed unit i, the k-dimensional random vector (f (2) i (x (1) ), . . . , f i (x (k) )) converges in distribution, as the width n 1 goes to infinity, to a k-dimensional Gaussian random vector; ii) the large-width convergence holds jointly over finite collections of i's and the limiting k-dimensional Gaussian random vectors are independent across the index i. These results concerns neural networks with a single hidden layer, but Neal (1995) also includes preliminary considerations on infinitely-wide deep neural networks. More recent works, such as Lee et al. (2018) , established convergence results corresponding to Neal (1995) results i) and ii) for deep neural networks under the assumption that widths n 1 , . . . , n L go to infinity sequentially over network layers. Matthews et al. (2018a; b) extended the work of Neal (1995) ; Lee et al. (2018) by assuming that the width n grows to infinity jointly over network layers, instead of sequentially, and by establishing joint convergence over all i and countable distinct inputs. The joint growth over the layers is certainly more realistic than the sequential growth, since the infinite Gaussian limit is considered as an approximation of a very wide network. We operate in the same setting of Matthews et al. (2018b) , hence from here onward n ≥ 1 denotes the common layer width, i.e. n 1 , . . . , n L = n. Finally, similar large-width limits have been established for a great variety of neural network architectures, see for instance Yang (2019) . The assumption of a countable number of fixed distinct inputs is the common trait of the literature on large-width asymptotics for deep neural networks. Under this assumption, the large-width limit of a network boils down to the study of the large-width asymptotic behavior of the k-dimensional random vector (f (l) i (x (1) ), . . . , f (l) i (x (k) )) over i ≥ 1 for finite k. Such limiting finite-dimensional distributions describe the large-width distribution of a neural network a priori over any dataset, which is finite by definition. When the limiting distribution is Gaussian, as it often is, this immediately paves the way to Bayesian inference for the limiting network. Such an approach is competitive with the more standard stochastic gradient descent training for the fully-connected architectures object of our study (Lee et al., 2020) . However, knowledge of the limiting finite-dimensional distributions is not enough to infer properties of the limiting neural network which are inherently uncountable such as the continuity of the limiting neural network, or the distribution of its maximum over a bounded interval. Results in this direction give a more complete understanding of the assumptions being made a priori, and hence whether a given model is appropriate for a specific application. For instance, Van Der Vaart & Van Zanten (2011) shows that for Gaussian Processes the function smoothness under the prior should match the smoothness of the target function for satisfactory inference performance. In this paper we thus consider a novel, and more natural, perspective to the study of large-width limits of deep neural networks. This is an infinite-dimensional perspective where, instead of fixing a countable number of distinct inputs, we look at f (l) i (x, n) as a stochastic process over the input space R I . Under this perspective, establishing large-width limits requires considerable care and, in addition, it requires to show the existence of both the stochastic process induced by the neural network and its large-width limit. We start by proving the existence of i) a continuous stochastic process, indexed by the network width n, corresponding to the fully-connected feed-forward deep neural network; ii) a continuous Gaussian Process corresponding to the infinitely-wide limit of the deep neural network. Then, we prove that the stochastic process i) converges weakly, as the width n goes to infinity, to the Gaussian Process ii) jointly over all units i. As a by-product of our results, we show that the limiting Gaussian Process has almost surely locally γ-Hölder continuous paths, for 0 < γ < 1. To make the exposition self-contained we include an alternative proof of the main result of Matthews et al. (2018a; b) , i.e. the finite-dimensional limit for full-connected neural networks. The major difference between our proof and that of Matthews et al. (2018b) is due to the use of the characteristic function to establish convergence in distribution, instead of relying on a CLT (Blum et al., 1958) for exchangeable sequences. The paper is structured as follows. In Section 2 we introduce the setting under which we operate, whereas in Section 3 we present a high-level overview of the approach taken to establish our results. Section 4 contains the core arguments of the proof of our large-width functional limit for deep Gaussian neural networks, which are spelled out in detail in the supplementary material (SM). We conclude in Section 5.

2. SETTING

Let (Ω, H, P) be the probability space on which all random elements of interest are defined. Furthermore, let N (µ, σ 2 ) denote a Gaussian distribution with mean µ ∈ R and strictly positive variance σ 2 ∈ R + , and let N k (m, Σ) be a k-dimensional Gaussian distribution with mean m ∈ R k and covariance matrix Σ ∈ R k×k . In particular, R k is equipped with • R k , the euclidean norm induced by the inner product Aliprantis & Border (2006) ), where ξ(t) = t/(1 + t) for all real values t ≥ 0. Note that (R, | • |) and (R ∞ , • R ∞ ) are Polish spaces, i.e. separable and complete metric spaces (Corollary 3.39 of Aliprantis & Border (2006) ). We choose d ∞ since it generates a topology that coincides with the product topology (line 5 of the proof of Theorem 3.36 of Aliprantis & Border (2006) ). The space (S, d) will indicate a generic Polish space such as R or R ∞ with the associated distance. We indicate with S R I the space of functions from R I into S and C(R I ; S) ⊂ S R I the space of continuous functions from R I into S. •, • R k , and R ∞ = × ∞ i=1 R is equipped with • R ∞ , the norm induced by the distance d(a, b) ∞ = i≥1 ξ(|a i -b i |)/2 i for a, b ∈ R ∞ (Theorem 3.38 of Let ω (l) i,j be the random weights of the l-th layer, and assume that they are iid as N (0, σ 2 ω ), i.e. ϕ ω (l) i,j (t) = E[e itω (l) i,j ] = e -1 2 σ 2 ω t 2 (2) is the characteristic function of ω (l) i,j , for i ≥ 1, j = 1, . . . , n and l ≥ 1. Let b (l) i be the random biases of the l-th layer, and assume that they are iid as N (0, σ 2 b ), i.e. ϕ b (l) i (t) = E[e itb (l) i ] = e -1 2 σ 2 b t 2 (3) is the characteristic function of b (l) i , for i ≥ 1 and l ≥ 1. Weights ω (l) i,j are independent of biases b (l) i , for any i ≥ 1, j = 1, . . . , n and l ≥ 1. Let φ : R → R denote a continuous non-linearity. For the finite-dimensional limit we will assume the polynomial envelop condition |φ(s)| ≤ a + b|s| m , (4) for any s ∈ R and some real values a, b > 0 and m ≥ 1. For the functional limit we will use a stronger assumption on φ, assuming φ to be Lipschitz on R with Lipschitz constant L φ . Let Z be a stochastic process on R I , i.e. for each x ∈ R I , Z(x) is defined on (Ω, H, P) and it takes values in S. For any k ∈ N and x 1 , . . . , x k ∈ R I , let P Z x1,...,x k = P(Z(x 1 ) ∈ A 1 , . . . , Z(x k ) ∈ A k ), with A 1 , . . . , A k ∈ B(S). Then, the family of finite-dimensional distributions of Z(x) is defined as the family of distributions {P Z x1,...,x k : x 1 , . . . , x k ∈ R I and k ∈ N}. See, e.g., Billingsley (1995) . In Definition 1 and Definition 2 we look at the deep neural network (1) as a stochastic process on input space R I , that is a stochastic process whose finite-dimensional distributions are determined by a finite number k ∈ N of fixed distinct inputs (x (1) , . . . , x (k) ), with each x (r) ∈ R I . The existence of the stochastic processes of Definition 1 and Definition 2 will be thoroughly discussed in Section 3. Definition 1. For any fixed l ≥ 2 and i ≥ 1, let (f (l) i (n)) n≥1 be a sequence of stochastic processes on R I . That is, f (l) i (n) : R I → R, with x → f (l) i (x, n), is a stochastic process on R I whose finite- dimensional distributions are the laws, for any k ∈ N and x (1) , . . . , x (k) ∈ R I , of the k-dimensional random vectors f (1) i (X, n) = f (1) i (X) = [f (1) i (x (1) , n), . . . , f (1) i (x (k) , n)] T = I j=1 ω (1) i,j x j + b (1) i 1 (5) f (l) i (X, n) = [f (l) i (x (1) , n), . . . , f (l) i (x (k) , n)] T = 1 √ n n j=1 ω (l) i,j (φ • f (l-1) j (X, n)) + b (l) i 1 (6) where X = [x (1) , . . . , x (k) ] ∈ R I×k is a I × k input matrix of k distinct inputs x (r) ∈ R I , 1 denotes a vector of dimension k × 1 of 1's, x j denotes the j-th row of the input matrix and φ • X is the element-wise application of φ to the matrix X. Let f (l) r,i (X, n) = 1 T r f (l) i (X, n) = f (l) i (x (r) , n) denote the r-th component of the k × 1 vector f (l) i (X, n), being 1 r a vector of dimension k × 1 with 1 in the r-the entry and 0 elsewhere. Remark: in contrast to (1), we have defined ( 5)-( 6) over an infinite number of units i ≥ 1 over each layer l, but the dependency on each previous layer l -1 remains limited to the first n components. Definition 2. For any fixed l ≥ 2, let (F (l) (n)) n≥1 be a sequence of stochastic processes on R I . That is, F (l) (n) : R I → R ∞ , with x → F (l) (x, n) , is a stochastic process on R I whose finite-dimensional distributions are the laws, for any k ∈ N and x (1) , . . . , x (k) ∈ R I , of the k-dimensional random vectors      F (1) (X) = f (1) 1 (X), f (1) 2 (X), . . . T F (l) (X, n) = f (l) 1 (X, n), f (l) 2 (X, n), . . . T . Remark: for k inputs, the vector F (l) (X, n) is an ∞×k array, and for a single input x (r) , F (l) (x (r) , n) can be written as [f r) , n) the r-th column of F (l) (X, n). When we write F (l-1) (x, n), F (l-1) (y, n) R n (see ( 8)) we treat F (l) (x, n) and F (l) (y, n) as elements in R n and not in R ∞ , i.e. we consider only the first n components of F (l) (x, n) and F (l) (y, n). (l) 1 (x (r) , n), f (l) 2 (x (r) , n), . . . ] T ∈ R ∞×1 . We define F (l) r (X, n) = F (l) (x (

3. PLAN SKETCH

We start by recalling the notion of convergence in law, also referred to as convergence in distribution or weak convergence, for a sequence of stochastic processes. See Billingsley (1995) for a comprehensive account. Definition 3 (convergence in distribution). Suppose that f and (f (n)) n≥1 are random elements in a topological space C. Then, (f (n)) n≥1 is said to converge in distribution to f , if E[h(f (n))] → E[h(f )] as n → ∞ for every bounded and continuous function h : C → R. In that case we write f (n) d → f . In this paper, we deal with continuous and real-valued stochastic processes. More precisely, we consider random elements defined on C(R I ; S), with (S, d) Polish space. Our aim is to study in C(R I ; S) the convergence in distribution as the width n goes to infinity for: i) the sequence (f (l) i (n)) n≥1 for a fixed l ≥ 2 and i ≥ 1 with (S, d) = (R, | • |), i.e. the neural network process for a single unit; ii) the sequence (F (l) (n)) n≥1 for a fixed l ≥ 2 with (S, d) = (R ∞ , • ∞ ), i.e.

the neural network

process for all units. Since applying Definition 3 in a function space is not easy, we need, proved in SM F, the following proposition. Proposition 1 (convergence in distribution in C(R I ; S), (S, d) Polish). Suppose that f and (f (n)) n≥1 are random elements in C(R I ; S) with (S, d) Polish space. Then, f (n) d → f if: i) f (n) f d → f and ii) the sequence (f (n)) n≥1 is uniformly tight. We denoted with f d → the convergence in law of the finite-dimensional distributions of a sequence of stochastic processes. The notion of tightness formalizes the concept that the probability mass is not allowed to "escape at infinity": a single random element f in a topological space C is said to be tight if for each > 0 there exists a compact T ⊂ C such that P[f ∈ C \ T ] < . If a metric space (C, ρ) is Polish any random element on the Borel σ-algebra of C is tight. A sequence of random elements (f (n)) n≥1 in a topological space C is said to be uniformly tightfoot_0 if for every > 0 there exists a compact T ⊂ C such that P[f (n) ∈ C \ T ] < for all n. According to Proposition 1, to achieve convergence in distribution in function spaces we need the following Steps A-D: Step A) to establish the existence of the finite-dimensional weak-limit f on R I . We will rely on Theorem 5.3 of Kallenberg (2002) , known as Levy theorem. Step B) to establish the existence of the stochastic processes f and (f (n)) n≥1 as elements in S R I the space of function from R I into S. We make use of Daniell-Kolmogorov criterion (Kallenberg, 2002, Theorem 6.16 ): given a family of multivariate distributions {P I probability measure on R dim(I) | I ⊂ {x (1) , . . . , x (k) } x (z) ∈R I ,k∈N } there exists a stochastic process with {P I } as finite-dimensional distributions if {P I } satisfies the projective property: P J (• × R J\I ) = P I (•), I ⊂ J ⊂ {x (1) , . . . , x (k) } x (z) ∈R I ,k∈N . That is, it is required consistency with respect to the marginalization over arbitrary components. In this step we also suppose, for a moment, that the stochastic processes (f (n)) n≥1 and f belong to C(R I ; S) and we establish the existence of such stochastic processes in C(R I ; S) endowed with a σ-algebra and a probability measure that will be defined. Step C) to show that the stochastic processes (f (n)) n≥1 and f belong to C(R I ; S) ⊂ S R I . With regards to (f (n)) n≥1 this is a direct consequence of ( 5)-( 6) and the continuity of φ. With regards to the limiting process f , with an additional Lipschitz assumption on φ, we rely on the following Kolmogorov-Chentsov criterion (Kallenberg, 2002, Theorem 3.23) : Proposition 2 (continuous version and local-Hölderianity, (S, d) complete). Let f be a process on R I with values in a complete metric space (S, d), and assume that there exist a, b, H > 0 such that, E[d(f (x), f (y)) a ] ≤ H x -y (I+b) , x, y ∈ R I Then f has a continuous version (i.e. f belongs to C(R I ; S)), and the latter is a.s. locally Hölder continuous with exponent c for any c ∈ (0, b /a). Step D) the uniform tightness of (f (n)) n≥1 in C(R I ; S). We rely on an extension of the Kolmogorov-Chentsov criterion (Kallenberg, 2002, Corollary 16 .9), which is stated in the following proposition. Proposition 3 (uniform tightness in C(R I ; S), (S, d) Polish). Suppose that (f (n)) n≥1 are random elements in C(R I ; S) with (S, d) Polish space. Assume that f (0 R I , n) n≥1 (i.e. f (n) evaluated at the origin) is uniformly tight in S and that there exist a, b, H > 0 such that, E[d(f (x, n), f (y, n)) a ] ≤ H x -y (I+b) , x, y ∈ R I , n ∈ N uniformly in n. Then (f (n)) n≥1 is uniformly tight in C(R I ; S). 4 LARGE-WIDTH FUNCTIONAL LIMITS 4.1 LIMIT ON C(R I ; S), WITH (S, d) = (R, | • |), FOR A FIXED UNIT i ≥ 1 AND LAYER l Lemma 1 (finite-dimensional limit). If φ satisfies (4) then there exists a stochastic process f (l) i : R I → R such that (f (l) i (n)) n≥1 f d → f (l) i as n → ∞. Proof. Fix l ≥ 2 and i ≥ 1. Fixed k inputs X = [x (1) , . . . , x (k) ], we show that as n → +∞ f (l) i (X, n) d → N k (0, Σ(l)), where Σ(l) denotes the k × k covariance matrix, which can be computed through the recursion: Σ(1) i,j = σ 2 b + σ 2 ω x (i) , x (j) R I , Σ(l) i,j = σ 2 b + σ 2 ω φ(f i )φ(f j )q (l-1) (df ) , where q (l-1) = N k (0, Σ(l -1)). By means of (2), (3), ( 5) and ( 6),        f (1) i (X) d = N k (0, Σ(1)), Σ(1) i,j = σ 2 b + σ 2 ω x (i) , x (j) R I f (l) i (X, n)|f (l-1) 1,...,n d = N k (0, Σ(l, n)), for l ≥ 2, Σ(l, n) i,j = σ 2 b + σ 2 ω n (φ • F (l-1) i (X, n)), (φ • F (l-1) j (X, n)) R n We prove (7) using Levy's theorem, that is the point-wise convergence of the sequence of characteristic functions of (8). We defer to SM A for the complete proof.

Lemma 1 proves

Step A. This proof gives an alternative and self-contained proof of the main result of Matthews et al. (2018b) , under the more general assumption that the activation function φ satisfies the polynomial envelop (4). Now we prove Step B, i.e. the existence of the stochastic processes f (l) i (n) and f (l) i on the space R R I , for each layer l ≥ 1, unit i ≥ 1 and n ∈ N. In SM E.1 we show that the finite-dimensional distributions of f (l) i (n) satisfies Daniell-Kolmogorov criterion (Kallenberg, 2002, Theorem 6.16 ), and hence the stochastic process f (l) i (n) exists. In SM E.2 we prove a similar result for the finite-dimensional distributions of the limiting process f (l) i . In SM E.3 we prove that, if these stochastic processes are continuous, they are naturally defined in C(R I ; R). In order to prove the continuity, i.e. Step C note that f (1) i (x) = I j=1 ω (1) i,j x j + b (1) i is continuous by construction, thus by induction on l, if f (l-1) i (n) are continuous for each i ≥ 1 and n, then f (l) i (x, n) = 1 √ n n j=1 ω (l) i,j φ(f (l-1) j (x, n)) + b (l) i is continuous being composition of continuous functions. For the limiting process f (l) i we assume φ to be Lipschitz with Lipschitz constant L φ . In particular we have the following: Lemma 2 (continuity). If φ is Lipschitz on R then f (l) i (1), f (l) i (2), . . . are P-a.s. Lipschitz on R I , while the limiting process f (l) i is P-a.s. continuous on R I and locally γ-Hölder continuous for each 0 < γ < 1. Proof. Here we present a sketch of the proof, and we defer to SM B.1 and SM B.2 for the complete proof. For (f (l) i (n)) n≥1 it is trivial to show that for each n |f (l) i (x, n) -f (l) i (y, n)| ≤ H (l) i (n) x -y R I , x, y ∈ R I , P -a.s. ( ) where H (l) i (n) denotes a suitable random variable, which is defined by the following recursion over l H (1) i (n) = I j=1 ω (1) i,j H (l) i (n) = L φ √ n n j=1 ω (l) i,j H (l-1) j (n) (10) To establish the continuity of the limiting process f (l) i we rely on Proposition 2. Take two inputs x, y ∈ R I . From ( 7) we get that [f (l) i (x), f (l) i (y)] ∼ N 2 (0, Σ(l)) where Σ(1) = σ 2 b 1 1 1 1 + σ 2 ω x 2 R I x, y R I x, y R I y 2 R I , Σ(l) = σ 2 b 1 1 1 1 + σ 2 ω |φ(u)| 2 φ(u)φ(v) φ(u)φ(v) |φ(v)| 2 q (l-1) (du, dv), where q (l-1) = N 2 (0, Σ(l -1)). Defining a T = [1, -1], from ( 7) we know that f (l) i (y) -f (l) i (x) ∼ N (a T 0, a T Σ(l)a).Thus |f (l) i (y) -f (l) i (x)| 2θ ∼ | a T Σ(l)aN (0, 1)| 2θ ∼ (a T Σ(l)a) θ |N (0, 1)| 2θ . We proceed by induction over the layers. For l = 1, E |f (1) i (y) -f (1) i (x)| 2θ = C θ (a T Σ(1)a) θ = C θ (σ 2 ω y 2 R I -2σ 2 ω y, x R I + σ 2 ω x 2 R I ) θ = C θ (σ 2 ω ) θ ( y 2 R I -2 y, x R I + x 2 R I ) θ = C θ (σ 2 ω ) θ y -x 2θ R I , where C θ = E[|N (0, 1)| 2θ ] . By hypothesis induction there exists a constant H (l-1) > 0 such that |u -v| 2θ q (l-1) (du, dv) ≤ H (l-1) y - x 2θ R I . Then, |f (l) i (y) -f (l) i (x)| 2θ ∼ |N (0, 1)| 2θ (a T Σ(l)a) θ = |N (0, 1)| 2θ σ 2 ω [|φ(u)| 2 -2φ(u)φ(v) + |φ(v)| 2 ]q (l-1) (du, dv) θ ≤ |N (0, 1)| 2θ (σ 2 ω L 2 φ ) θ |u -v| 2θ q (l-1) (du, dv) ≤ |N (0, 1)| 2θ (σ 2 ω L 2 φ ) θ H (l-1) y -x 2θ R I . where we used |φ(u)| 2 -2φ(u)φ(v) + |φ(v)| 2 = |φ(u) -φ(v)| 2 ≤ L 2 φ |u -v| 2 and the Jensen inequality. Thus, E |f (l) i (y) -f (l) i (x)| 2θ ≤ H (l) y -x 2θ R I , where the constant H (l) can be explicitly derived by solving the following system H (1) = C θ (σ 2 ω ) θ H (l) = C θ (σ 2 ω L 2 φ ) θ H (l-1) . ( ) It is easy to get l) does not depend on i (this will be helpful in establishing the uniformly tightness of (f (l) i (n)) n≥1 and the continuity of F (l) ). By Proposition 2, setting α = 2θ, and β = 2θ -I (since β needs to be positive, it is sufficient to choose θ > I/2) we get that f (l) i has a continuous version and the latter is P-a.s locally γ-Hölder continuous for every 0 < γ < 1 -I 2θ , for each θ > I/2. Taking the limit as θ → +∞ we conclude the proof. H (l) = C l θ (σ 2 ω ) lθ (L 2 φ ) (l-1)θ . Observe that H ( Lemma 3 (uniform tightness). If φ is Lipschitz on R then (f (l) i (n)) n≥1 is uniformly tight in C(R I ; R). Proof. We defer to SM B.3 for details. Fix i ≥ 1, l ≥ 1. We apply Proposition 3 to show the uniform tightness of the sequence (f (Dudley, 2002, Theorem 11.5.3)  (l) i (n)) n≥1 in C(R I ; R). By Lemma 2 f (l) i (1), f (l) i (2), . . . are random elements in C(R I ; R). Since (R, | • |) is Polish, every probability measure is tight, then f (0 R I , n) is tight in R for every n. Moreover, by Lemma 1 f i (0 R I , n) n≥1 d → f (l) i (0 R I ), therefore by , f (0 R I , n) n≥1 is uniformly tight in R. It remains to show that there exist two values α > 0 and β > 0, and a constant H (l) > 0 such that E |f (l) i (y, n) -f (l) i (x, n)| α ≤ H (l) y -x I+β R I , x, y ∈ R I , n ∈ N uniformly in n. Take two points x, y ∈ R I . From (8) we know that f (l) i (y, n)|f (l-1) 1,...,n ∼ N (0, σ 2 y (l, n)) and f (l) i (x, n)|f (l-1) 1,...,n ∼ N (0, σ 2 x (l, n)) with joint distribution N 2 (0, Σ(l, n)), where Σ(1) = σ 2 x (1) Σ(1) x,y Σ(1) x,y σ 2 y (1) , Σ(l) = σ 2 x (l, n) Σ(l, n) x,y Σ(l, n) x,y σ 2 y (l, n) , with,                  σ 2 x (1) = σ 2 b + σ 2 ω x 2 R I , σ 2 y (1) = σ 2 b + σ 2 ω y 2 R I , Σ(1) x,y = σ 2 b + σ 2 ω x, y R I , σ 2 x (l, n) = σ 2 b + σ 2 ω n n j=1 |φ • f (l-1) j (x, n)| 2 , σ 2 y (l, n) = σ 2 b + σ 2 ω n n j=1 |φ • f (l-1) j (y, n)| 2 , Σ(l, n) x,y = σ 2 b + σ 2 ω n n j=1 φ(f (l-1) j (x, n))φ(f (l-1) j (y, n)) Defining a T = [1, -1] we have that f (l) i (y, n)|f (l-1) 1,...,n -f (l) i (x, n)|f (l-1) 1,...,n is distributed as N (a T 0, a T Σ(l, n)a), where a T Σ(l, n)a = σ 2 y (l, n) -2Σ(l, n) x,y + σ 2 x (l, n). Consider α = 2θ with θ integer. Thus f (l) i (y, n)|f (l-1) 1,...,n -f (l) i (x, n)|f (l-1) 1,...,n 2θ ∼ | a T Σ(l, n)aN (0, 1)| 2θ ∼ (a T Σ(l, n)a) θ |N (0, 1)| 2θ . As in previous theorem, for l = 1 we get E |f (1) i (y, n) -f (1) i (x, n)| 2θ = C θ (σ 2 ω ) θ y -x 2θ R I where C θ = E[|N (0, 1) 2θ |]. Set H (1) = C θ (σ 2 ω ) θ and by hypothesis induction suppose that for every j ≥ 1 E |f (l-1) j (y, n) -f (l-1) j (x, n)| 2θ ≤ H (l-1) y -x 2θ R I . By hypothesis φ is Lipschitz, then E |f (l) i (y, n) -f (l) i (x, n)| 2θ f (l-1) 1,...,n = C θ (a T Σ(l, n)a) θ = C θ σ 2 y (l, n) -2Σ(l, n) x,y + σ 2 x (l, n) θ = C θ σ 2 ω n n j=1 φ • f (l-1) j (y, n) -φ • f (l-1) j (x, n) 2 θ ≤ C θ σ 2 ω L 2 φ n n j=1 f (l-1) j (y, n) -f (l-1) j (x, n) 2 θ = C θ (σ 2 ω L 2 φ ) θ n θ n j=1 f (l-1) j (y, n) -f (l-1) j (x, n) 2 θ ≤ C θ (σ 2 ω L 2 φ ) θ n n j=1 f (l-1) j (y, n) -f (l-1) j (x, n) 2θ . Published as a conference paper at ICLR 2021 Using the induction hypothesis E |f (l) i (y, n) -f (l) i (x, n)| 2θ = E E |f (l) i (y, n) -f (l) i (x, n)| 2θ f (l-1) 1,...,n ≤ C θ (σ 2 ω L 2 φ ) θ n n j=1 E |f (l-1) j (y, n) -f (l-1) j (x, n)| 2θ ≤ C θ (σ 2 ω L 2 φ ) θ H (l-1) y -x 2θ R I . We can get the constant H (l) by solving the same system as ( 12), obtaining H (l) = C l θ (σ 2 ω ) lθ (L 2 φ ) (l-1)θ which does not depend on n. By Proposition 3 setting α = 2θ and β = 2θ -I, since β must be a positive constant, it is sufficient to take θ > I/2 and this concludes the proof. Note that Lemma 3 provides the last Step D that allows us to prove the desired result which is explained in the theorem that follows: Theorem 1 (functional limit). If φ is Lipschitz on R then f (l) i (n) d → f (l) i on C(R I ; R). Proof. We apply Proposition 1 to (f (l) i (n)) n≥1 . By Lemma 2, we have that f (l) i , (f (l) i (n)) n≥1 belong to C(R I ; R). From Lemma 1 we have the convergence of the finite-dimensional distributions of (f (l) i (n)) n≥1 , and form Lemma 3 we have the uniform tightness of (f (l) i (n)) n≥1 . 4.2 LIMIT ON C(R I ; S), WITH (S, d) = (R ∞ , • R ∞ ), FOR A FIXED LAYER l As in the previous section we prove Steps A-D for the sequence (F (l) (n)) n≥1 . Remark that each stochastic process F (l) , F (l) (1), F (l) (2), . . . defines on C(R I ; R ∞ ) a joint measure whose i-th marginal is the measure induced respectively by f (l) i , f (l) i (n), f (2) i (n), . . . (see SM E.1 -SM E.4). Let F (l) d = ∞ i=1 f (l) i , where denotes the product measure. Lemma 4 (finite-dimensional limit). If φ satisfies (4) then F (l) (n) l) as n → ∞. f d → F ( Proof. The proof follows by Lemma 1 and Cramér-Wold theorem for finite-dimensional projection of F (l) (n): it is sufficient to establish the large n asymptotic of linear combinations of the f (l) i (X, n)'s for i ∈ L ⊂ N. In particular, we show that for any choice of inputs elements X, as n → +∞ F (l) (X, n) d → ∞ i=1 N k (0, Σ(l)), where Σ(l) is defined in (7). The proof is reported in SM C. l) (2), . . . are P-a.s. Lipschitz on R I , while the limiting process F (l) is P-a.s. continuous on R I and locally γ-Hölder continuous for each 0 < γ < 1. Lemma 5 (continuity). If φ is Lipschitz on R then F (l) , (F (l) (n)) n≥1 belong to C(R I ; R ∞ ). More precisely F (l) (1), F Proof. It derives immediately from Lemma 2. We defer to SM D.1 and SM D.2 for details. The continuity of the sequence process immediately follows from the Lipschitzianity of each component in (9) while the continuity of the limiting process F (l) is proved by applying Proposition 2. Take two inputs x, y ∈ R I and fix α ≥ 1 even integer. Since ξ(t) ≤ t for all t ≥ 0, and by Jensen inequality d F (l) (x), F (l) (y) α ∞ ≤ ∞ i=1 1 2 i |f (l) i (x) -f (l) i (y)| α ≤ ∞ i=1 1 2 i |f (l) i (x) -f (l) i (y)| α Thus, by applying monotone convergence theorem to the positive increasing sequence g (N ) = N i=1 1 2 i |f (l) i (x) -f (l) i (y)| α (which allows to exchange E and ∞ i=1 ), we get E d F (l) (x),F (l) (y) α ∞ ≤ E ∞ i=1 1 2 i |f (l) (x) -f (l) i (y)| α = lim N →∞ E N i=1 1 2 i |f (l) i (x) -f (l) i (y)| α = ∞ i=1 1 2 i E |f (l) i (x) -f (l) i (y)| α = ∞ i=1 1 2 i H (l) x -y α R I = H (l) x -y α R I where we used (11) and the fact that H (l) does not depend on i (see ( 12)). Therefore, by Proposition 2, for each α > I, setting β = α -I (since β needs to be positive, it is sufficient to choose α > I) F (l) has a continuous version F (l)(θ) which is P-a.s locally γ-Hölder continuous for every 0 < γ < 1 -I α . Letting α → ∞ we conclude. Theorem 2 (functional limit). If φ is Lipschitz on R then (F (l) (n)) n≥1 d → F (l) as n → ∞ on C(R I ; R ∞ ). Proof. This is Proposition 1 applied to (F (l) (n)) n≥1 . From Lemma 4 and Lemma 5 it remains to show the uniform tightness of the sequence (F (l) (n)) n≥1 in C(R I ; R ∞ ). Let > 0 and let ( i ) i≥1 be a positive sequence such that ∞ i=1 i = /2 . We have established the uniform tightness of each component (Lemma 3). Therefore for each i ∈ N there exists a compact K i ⊂ C(R I ; R) such that P[f (l) i (n) ∈ C(R I ; R) \ K i ] < i for each n ∈ N (such compact depends on i ). Set K = × ∞ i=1 K i which is compact by Tychonoff theorem. Note that this is a compact on the product space × ∞ i=1 C(R I ; R) with associated product topology, and this is also a compact on C(R I ; R ∞ ) (see SM E.4). Then P F (l) (n) ∈ C(R I ; R ∞ ) \ K = P ∞ i=1 {f (l) i (n) ∈ C(R I ; R) \ K i } ≤ ∞ i=1 P f (l) i (n) ∈ C(R I ; R) \ K i ≤ ∞ i=1 i < which concludes the proof.

5. DISCUSSION

We looked at deep Gaussian neural networks as stochastic processes, i.e. infinite-dimensional random elements, on the input space R I , and we showed that: i) a network defines a stochastic process on the input space R I ; ii) under suitable assumptions on the activation function, a network with re-scaled weights converges weakly to a Gaussian Process in the large-width limit. These results extend previous works (Neal, 1995; Der & Lee, 2006; Lee et al., 2018; Matthews et al., 2018a; b; Yang, 2019) that investigate the limiting distribution of neural network over a countable number of distinct inputs. From the point of view of applications, the convergence in distribution is the starting point for the convergence of expectations. Let consider a continuous function g : C(R I ; R ∞ ) → R. By the continuous function mapping theorem (Billingsley, 1999 , Theorem 2.7), we have g(F (l) (n)) d → g(F (l) ) as n → +∞, and under uniform integrability (Billingsley, 1999 , Section 3), we have (Billingsley, 1999, Theorem 3.5 ) E[g(F (l) (n))] → E[g(F (l) )] as n → +∞. See also Dudley (2002) and references therein. As a by-product of our results we showed that, under a Lipschitz activation function, the limiting Gaussian Process has almost surely locally γ-Hölder continuous paths, for 0 < γ < 1. This raises the question on whether it is possible to strengthen our results to cover the case γ = 1, or even the case of local Lipschitzianity of the paths of the limiting process. In addition, if the activation function is differentiable, does this property transfer to the limiting process? We leave these questions to future research. Finally, while fully-connected deep neural networks represent an ideal starting point for theoretical analysis, modern neural network architectures are composed of a much richer class of layers which includes convolutional, residual, recurrent and attention components. The technical arguments followed in this paper are amenable to extensions to more complex network architectures. Providing a mathematical formulation of network's architectures and convergence results in a way that it allows for extensions to arbitrary architectures, instead of providing an ad-hoc proof for each specific case, is a fundamental research problem. Greg Yang's work on Tensor Programs (Yang, 2019) constitutes an important step in this direction.



Kallenberg (2002) uses the same term "tightness" for both cases of a single random element and of sequences of random elements; we find that the introduction of "uniform tightness" brings more clarity.

