LARGE-WIDTH FUNCTIONAL ASYMPTOTICS FOR DEEP GAUSSIAN NEURAL NETWORKS

Abstract

In this paper, we consider fully-connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions. Extending previous results (Matthews et al., 2018a;b; Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networks as infinite-dimensional random elements on the input space R I . Under suitable assumptions on the activation function we show that: i) a network defines a continuous stochastic process on the input space R I ; ii) a network with re-scaled weights converges weakly to a continuous Gaussian Process in the large-width limit; iii) the limiting Gaussian Process has almost surely locally γ-Hölder continuous paths, for 0 < γ < 1. Our results contribute to recent theoretical studies on the interplay between infinitely-wide deep neural networks and Gaussian Processes by establishing weak convergence in function-space with respect to a stronger metric.

1. INTRODUCTION

The interplay between infinitely-wide deep neural networks and classes of Gaussian Processes has its origins in the seminal work of Neal (1995) , and it has been the subject of several theoretical studies. See, e.g., Der & Lee (2006) , Lee et al. (2018) , Matthews et al. (2018a; b) , Yang (2019) and references therein. Let consider a fully-connected feed-forward neural network with re-scaled weights composed of L ≥ 1 layers of widths n 1 , . . . , n L , i.e. f (1) i (x) = I j=1 w (1) i,j x j + b (1) i i = 1, . . . , n 1 f (l) i (x) = 1 √ n l-1 n l-1 j=1 w (l) i,j φ(f (l-1) j (x)) + b (l) i l = 2, . . . , L, i = 1, . . . , n l (1) where φ is a non-linearity and x ∈ R I is a real-valued input of dimension I ∈ N. Neal (1995) considered the case L = 2, a finite number k ∈ N of fixed distinct inputs (x (1) , . . . , x (k) ), with each x (r) ∈ R I , and weights w (l) i,j and biases b (l) i independently and identically distributed (iid) as Gaussian distributions. Under appropriate assumptions on the activation φ Neal (1995) showed that: i) for a fixed unit i, the k-dimensional random vector (f (2) i (x (1) ), . . . , f (2) i (x (k) )) converges in distribution, as the width n 1 goes to infinity, to a k-dimensional Gaussian random vector; ii) the large-width convergence holds jointly over finite collections of i's and the limiting k-dimensional Gaussian random vectors are independent across the index i. These results concerns neural networks with a single hidden layer, but Neal (1995) also includes preliminary considerations on infinitely-wide deep neural networks. More recent works, such as Lee et al. (2018) , established convergence results corresponding to Neal (1995) results i) and ii) for deep neural networks under the assumption that widths n 1 , . . . , n L go to infinity sequentially over network layers. Matthews et al. (2018a; b) extended the work of Neal (1995); Lee et al. (2018) by assuming that the width n grows to infinity jointly over network layers, instead of sequentially, and by establishing joint convergence over all i and countable distinct inputs. The joint growth over the layers is certainly more realistic than the sequential growth, since the infinite Gaussian limit is considered as an approximation of a very wide network. We operate in the same setting of Matthews et al. (2018b) , hence from here onward n ≥ 1 denotes the common layer width, i.e. n 1 , . . . , n L = n. Finally, similar large-width limits have been established for a great variety of neural network architectures, see for instance Yang (2019). The assumption of a countable number of fixed distinct inputs is the common trait of the literature on large-width asymptotics for deep neural networks. Under this assumption, the large-width limit of a network boils down to the study of the large-width asymptotic behavior of the k-dimensional random vector (f (l) i (x (1) ), . . . , f (l) i (x (k) )) over i ≥ 1 for finite k. Such limiting finite-dimensional distributions describe the large-width distribution of a neural network a priori over any dataset, which is finite by definition. When the limiting distribution is Gaussian, as it often is, this immediately paves the way to Bayesian inference for the limiting network. Such an approach is competitive with the more standard stochastic gradient descent training for the fully-connected architectures object of our study (Lee et al., 2020) . However, knowledge of the limiting finite-dimensional distributions is not enough to infer properties of the limiting neural network which are inherently uncountable such as the continuity of the limiting neural network, or the distribution of its maximum over a bounded interval. Results in this direction give a more complete understanding of the assumptions being made a priori, and hence whether a given model is appropriate for a specific application. For instance, Van Der Vaart & Van Zanten (2011) shows that for Gaussian Processes the function smoothness under the prior should match the smoothness of the target function for satisfactory inference performance. In this paper we thus consider a novel, and more natural, perspective to the study of large-width limits of deep neural networks. This is an infinite-dimensional perspective where, instead of fixing a countable number of distinct inputs, we look at f (l) i (x, n) as a stochastic process over the input space R I . Under this perspective, establishing large-width limits requires considerable care and, in addition, it requires to show the existence of both the stochastic process induced by the neural network and its large-width limit. We start by proving the existence of i) a continuous stochastic process, indexed by the network width n, corresponding to the fully-connected feed-forward deep neural network; ii) a continuous Gaussian Process corresponding to the infinitely-wide limit of the deep neural network. Then, we prove that the stochastic process i) converges weakly, as the width n goes to infinity, to the Gaussian Process ii) jointly over all units i. As a by-product of our results, we show that the limiting Gaussian Process has almost surely locally γ-Hölder continuous paths, for 0 < γ < 1. To make the exposition self-contained we include an alternative proof of the main result of Matthews et al. (2018a; b) , i.e. the finite-dimensional limit for full-connected neural networks. The major difference between our proof and that of Matthews et al. (2018b) is due to the use of the characteristic function to establish convergence in distribution, instead of relying on a CLT (Blum et al., 1958) for exchangeable sequences. The paper is structured as follows. In Section 2 we introduce the setting under which we operate, whereas in Section 3 we present a high-level overview of the approach taken to establish our results. Section 4 contains the core arguments of the proof of our large-width functional limit for deep Gaussian neural networks, which are spelled out in detail in the supplementary material (SM). We conclude in Section 5.

2. SETTING

Let (Ω, H, P) be the probability space on which all random elements of interest are defined. Furthermore, let N (µ, σ 2 ) denote a Gaussian distribution with mean µ ∈ R and strictly positive variance σ 2 ∈ R + , and let N k (m, Σ) be a k-dimensional Gaussian distribution with mean m ∈ R k and covariance matrix Σ ∈ R k×k . In particular, R k is equipped with • R k , the euclidean norm induced by the inner product Aliprantis & Border (2006) ), where ξ(t) = t/(1 + t) for all real values t ≥ 0. Note that (R, | • |) and (R ∞ , • R ∞ ) are Polish spaces, i.e. separable and complete metric spaces (Corollary 3.39 of Aliprantis & Border (2006) ). We choose d ∞ since it generates a topology that coincides with the product topology (line 5 of the proof of Theorem 3.36 of Aliprantis & Border (2006) ). The space (S, d) will indicate a generic Polish space such as R or R ∞ with the associated distance. We indicate with S R I the space of functions from R I into S and C(R I ; S) ⊂ S R I the space of continuous functions from R I into S. •, • R k , and R ∞ = × ∞ i=1 R is equipped with • R ∞ , the norm induced by the distance d(a, b) ∞ = i≥1 ξ(|a i -b i |)/2 i for a, b ∈ R ∞ (Theorem 3.38 of

