LARGE-WIDTH FUNCTIONAL ASYMPTOTICS FOR DEEP GAUSSIAN NEURAL NETWORKS

Abstract

In this paper, we consider fully-connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions. Extending previous results (Matthews et al., 2018a;b; Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networks as infinite-dimensional random elements on the input space R I . Under suitable assumptions on the activation function we show that: i) a network defines a continuous stochastic process on the input space R I ; ii) a network with re-scaled weights converges weakly to a continuous Gaussian Process in the large-width limit; iii) the limiting Gaussian Process has almost surely locally γ-Hölder continuous paths, for 0 < γ < 1. Our results contribute to recent theoretical studies on the interplay between infinitely-wide deep neural networks and Gaussian Processes by establishing weak convergence in function-space with respect to a stronger metric.

1. INTRODUCTION

The interplay between infinitely-wide deep neural networks and classes of Gaussian Processes has its origins in the seminal work of Neal (1995) , and it has been the subject of several theoretical studies. See, e.g., Der & Lee (2006) , Lee et al. (2018 ), Matthews et al. (2018a; b) , Yang (2019) and references therein. Let consider a fully-connected feed-forward neural network with re-scaled weights composed of L ≥ 1 layers of widths n 1 , . . . , n L , i.e. f (1) i (x) = I j=1 w (1) i,j x j + b (1) i i = 1, . . . , n 1 f (l) i (x) = 1 √ n l-1 n l-1 j=1 w (l) i,j φ(f (l-1) j (x)) + b (l) i l = 2, . . . , L, i = 1, . . . , n l (1) where φ is a non-linearity and x ∈ R I is a real-valued input of dimension I ∈ N. Neal (1995) considered the case L = 2, a finite number k ∈ N of fixed distinct inputs (x (1) , . . . , x (k) ), with each x (r) ∈ R I , and weights w (l) i,j and biases b (l) i independently and identically distributed (iid) as Gaussian distributions. Under appropriate assumptions on the activation φ Neal (1995) showed that: i) for a fixed unit i, the k-dimensional random vector (f (2) i (x (1) ), . . . , f i (x (k) )) converges in distribution, as the width n 1 goes to infinity, to a k-dimensional Gaussian random vector; ii) the large-width convergence holds jointly over finite collections of i's and the limiting k-dimensional Gaussian random vectors are independent across the index i. These results concerns neural networks with a single hidden layer, but Neal (1995) also includes preliminary considerations on infinitely-wide deep neural networks. More recent works, such as Lee et al. (2018) , established convergence results corresponding to Neal (1995) results i) and ii) for deep neural networks under the assumption that widths n 1 , . . . , n L go to infinity sequentially over network layers. Matthews et al. (2018a; b) extended the work of Neal (1995) ; Lee et al. (2018) by assuming that the width n grows to infinity jointly over network layers, instead of sequentially, and by establishing joint convergence over all i and countable distinct inputs. The joint growth over the layers is certainly more realistic than the sequential growth, since the infinite Gaussian limit is considered as an approximation of a very wide network. We operate in the same setting of Matthews et al. (2018b) , hence from here onward n ≥ 1 denotes the

