PREDICTING THE OUTPUTS OF FINITE NETWORKS TRAINED WITH NOISY GRADIENTS Anonymous authors Paper under double-blind review

Abstract

A recent line of works studied wide deep neural networks (DNNs) by approximating them as Gaussian Processes (GPs). A DNN trained with gradient flow was shown to map to a GP governed by the Neural Tangent Kernel (NTK), whereas earlier works showed that a DNN with an i.i.d. prior over its weights maps to the socalled Neural Network Gaussian Process (NNGP). Here we consider a DNN training protocol, involving noise, weight decay and finite width, whose outcome corresponds to a certain non-Gaussian stochastic process. An analytical framework is then introduced to analyze this non-Gaussian process, whose deviation from a GP is controlled by the finite width. Our contribution is three-fold: (i) In the infinite width limit, we establish a correspondence between DNNs trained with noisy gradients and the NNGP, not the NTK. (ii) We provide a general analytical form for the finite width correction (FWC) for DNNs with arbitrary activation functions and depth and use it to predict the outputs of empirical finite networks with high accuracy. Analyzing the FWC behavior as a function of n, the training set size, we find that it is negligible for both the very small n regime, and, surprisingly, for the large n regime (where the GP error scales as O(1/n)). (iii) We flesh-out algebraically how these FWCs can improve the performance of finite convolutional neural networks (CNNs) relative to their GP counterparts on image classification tasks. Another recent paper (Yaida, 2020) studied Bayesian inference with weakly non-Gaussian priors induced by finite-N DNNs. Unlike here, there was no attempt to establish a correspondence with trained DNNs. The formulation presented here has the conceptual advantage of representing a distribution over function space for arbitrary training and test data, rather than over specific draws of data sets. This is useful for studying the large n behavior of learning curves, where analytical insights into generalization can be gained (Cohen et al., 2019) . A somewhat related line of work studied the mean field regime of shallow NNs (Mei et al., 2018; Chen et al., 2020; Tzen & Raginsky, 2020) . We point out the main differences from our work: (a) The NN output is scaled differently with width. (b) In the mean field regime one is interested in the dynamics (finite t) of the distribution over the NN parameters in the form of a PDE of the Fokker-Planck type. In contrast, in our framework we are interested in the distribution over function

1. INTRODUCTION

Deep neural networks (DNNs) have been rapidly advancing the state-of-the-art in machine learning, yet a complete analytic theory remains elusive. Recently, several exact results were obtained in the highly over-parameterized regime (N → ∞ where N denotes the width or number of channels for fully connected networks (FCNs) and convolutional neural networks (CNNs), respectively) (Daniely et al., 2016) . This facilitated the derivation of an exact correspondence with Gaussian Processes (GPs) known as the Neural Tangent Kernel (NTK) (Jacot et al., 2018) . The latter holds when highly over-parameterized DNNs are trained by gradient flow, namely with vanishing learning rate and involving no stochasticity. The NTK result has provided the first example of a DNN to GP correspondence valid after end-to-end DNN training. This theoretical breakthrough allows one to think of DNNs as inference problems with underlying GPs (Rasmussen & Williams, 2005) . For instance, it provides a quantitative description of the generalization properties (Cohen et al., 2019; Rahaman et al., 2018) and training dynamics (Jacot et al., 2018; Basri et al., 2019) of DNNs. Roughly speaking, highly over-parameterized DNNs generalize well because they have a strong implicit bias to simple functions, and train well because low-error solutions in weight space can be reached by making a small change to the random values of the weights at initialization. Despite its novelty and importance, the NTK correspondence suffers from a few shortcomings: (a) Its deterministic training is qualitatively different from the stochastic one used in practice, which may lead to poorer performance when combined with a small learning rate (Keskar et al., 2016) . (b) It under-performs, often by a large margin, convolutional neural networks (CNNs) trained with SGD (Arora et al., 2019) . (c) Deriving explicit finite width corrections (FWCs) is challenging, as it requires solving a set of coupled ODEs (Dyer & Gur-Ari, 2020; Huang & Yau, 2019) . Thus, there is a need for an extended theory for end-to-end trained deep networks which is valid for finite width DNNs. Our contribution is three-fold. First, we prove a correspondence between a DNN trained with noisy gradients and a Stochastic Process (SP) which at N → ∞ tends to the Neural Network Gaussian Process (NNGP) (Lee et al., 2018; Matthews et al., 2018) . In these works, the NNGP kernel is determined by the distribution of the DNN weights at initialization which are i.i.d. random variables, whereas in our correspondence the weights are sampled across the stochastic training dynamics, drifting far away from their initial values. We call ours the NNSP correspondence, and show that it holds when the training dynamics in output space exhibit ergodicity. Second, we predict the outputs of trained finite-width DNNs, significantly improving upon the corresponding GP predictions. This is done by deriving leading FWCs which are found to scale with width as 1/N . The accuracy at which we can predict the empirical DNNs' outputs serves as a strong verification for our aforementioned ergodicity assumption. In the regime where the GP RMSE error scales as 1/n, we find that the leading FWC are a decaying function of n, and thus overall negligible. In the small n regime we find that the FWC is small and grows with n. We thus conclude that finite-width corrections are important for intermediate values of n (Fig. 1 ). Third, we propose an explanation for why finite CNNs trained on image classification tasks can outperform their infinite-width counterparts, as observed by Novak et al. (2018) . The key difference is that in finite CNNs weight sharing is beneficial. Our theory, which accounts for the finite width, quantifies this difference ( §4.2). Overall, the NNSP correspondence provides a rich analytical and numerical framework for exploring the theory of deep learning, unique in its ability to incorporate finite over-parameterization, stochasticity, and depth. We note that there are several factors that make finite SGD-trained DNNs used in practice different from their GP counterparts, e.g. large learning rates, early stopping etc. (Lee et al., 2020) . Importantly, our framework quantifies the contribution of finite-width effects to this difference, distilling it from the contribution of these other factors.

1.1. RELATED WORK

The idea of leveraging the dynamics of the gradient descent algorithm for approximating Bayesian inference has been considered in various works (Welling & Teh, 2011; Mandt et al., 2017; Teh et al., 2016; Maddox et al., 2019; Ye et al., 2017) . However, to the best of our knowledge, a correspondence with a concrete SP or a non-parametric model was not established nor was a comparison made of the DNN's outputs with analytical predictions. Finite width corrections were studied recently in the context of the NTK correspondence by several authors. Hanin & Nica (2019) study the NTK of finite DNNs, but where the depth scales together with width, whereas we keep the depth fixed. Dyer & Gur-Ari (2020) obtained a finite N correction to the linear integral equation governing the evolution of the predictions on the training set. Our work differs in several aspects: (a) We describe a different correspondence under different a training protocol with qualitatively different behavior. (b) We derive relatively simple formulae for the outputs which become entirely explicit at large n. (c) We account for all sources of finite N corrections whereas finite N NTK randomness remained an empirical source of corrections not accounted for by Dyer & Gur-Ari (2020) . (d) Our formalism differs considerably: its statistical mechanical nature enables one to import various standard tools for treating randomness, ergodicity breaking, and taking into account non-perturbative effects. (e) We have no smoothness limitation on our activation functions and provide FWCs on a generic data point and not just on the training set. space at equilibrium, i.e. for t → ∞. (c) It seems that the mean field analysis is tailored for two-layer fully-connected NNs and is hard to generalize to deeper nets or to CNNs. In contrast, our formalism generalizes to deeper fully-connected NNs and to CNNs as well, as we showed in section 4.2.

2. THE NNSP CORRESPONDENCE

In this section we show that finite-width DNNs, trained in a specific manner, correspond to Bayesian inference using a non-parametric model which tends to the NNGP as N → ∞. We first give a short review of Langevin dynamics in weight space as described by Neal et al. (2011) , Welling & Teh (2011) , which we use to generate samples from the posterior over weights. We then shift our perspective and consider the corresponding distribution over functions induced by the DNN, which characterizes the non-parametric model. Recap of Langevin-type dynamics -Consider a DNN trained with full-batch gradient descent while injecting white Gaussian noise and including a weight decay term, so that the discrete time dynamics of the weights read ∆w t := w t+1 -w t = -(γw t + ∇ w L (z w )) dt + √ 2T dtξ t (1) where w t is the vector of all network weights at time step t, γ is the strength of the weight decay, L(z w ) is the loss as a function of the output z w , T is the temperature (the magnitude of noise), dt is the learning rate and ξ t ∼ N (0, I). As dt → 0 these discrete-time dynamics converge to the continuous-time Langevin equation given by ẇ (t) = -∇ w γ 2 ||w(t)|| 2 + L (z w ) + √ 2T ξ (t) with ξ i (t)ξ j (t ) = δ ij δ (t -t ) , so that as t → ∞ the weights will be sampled from the equilibrium distribution in weight space, given by (Risken & Frank, 1996 ) P (w) ∝ exp - 1 T γ 2 ||w|| 2 + L (z w ) = exp - 1 2σ 2 w ||w|| 2 + 1 2σ 2 L (z w ) The above equality holds since the equilibrium distribution of the Langevin dynamics is also the posterior distribution of a Bayesian neural network (BNN) with an i.i.d. Gaussian prior on the weights w ∼ N (0, σ 2 w I). Thus we can map the hyper-parameters of the training to those of the BNN: σ 2 w = T /γ and σ 2 = T /2. Notice that a sensible scaling for the weight variance at layer is σ 2 w, ∼ O(1/N -1 ), thus the weight decay needs to scale as γ ∼ O(N -1 ). A transition from weight space to function space -We aim to move from a distribution over weight space Eq. 2 to a one over function space. Namely, we consider the distribution of z w (x) implied by the above P (w) where for concreteness we consider a DNN with a single scalar output z w (x) ∈ R on a regression task with data {(x α , y α )} n α=1 ⊂ R d × R. Denoting by P [f ] the induced measure on function space we formally write P [f ] = dwδ[f -z w ]P (w) ∝ e -1 2σ 2 L[f ] dwe -1 2σ 2 w ||w|| 2 δ[f -z w ] where dw denotes an integral over all weights and we denote by δ[f -z w ] a delta-function in function-space. As common in path-integrals or field-theory formalism (Schulman, 2012), such a delta function is understood as a limit procedure where one chooses a suitable basis for function space, trims it to a finite subset, treats δ[f -z w ] as a product of regular delta-functions, and at the end of the computation takes the size of the subset to infinity. To proceed we decompose the posterior over functions Eq. 3 as P [f ] ∝ e -1 2σ 2 L[f ] P 0 [f ] where the prior over functions is P 0 [f ] ∝ dwe -1 2σ 2 w ||w|| 2 δ[f -z w ]. The integration over weights now obtains a clear meaning: it yields the distribution over functions induced by a DNN with i.i.d. random weights chosen according to the prior P 0 (w) ∝ e -1 2σ 2 w ||w|| 2 . Thus, we can relate any correlation function in function space and weight space, for instance (Df is the integration measure over function space) Df P 0 [f ]f (x)f (x ) = Df dwP 0 (w)δ[f -z w ]f (x)f (x ) = dwP 0 (w)z w (x)z w (x ) (4) As noted by Cho & Saul (2009) , for highly over-parameterized DNNs the r.h.s. of 4 equals the kernel of the NNGP associated with this DNN, K(x, x ). Moreover P 0 [f ] tends to a Gaussian and can be written as P 0 [f ] ∝ exp - 1 2 dµ(x)dµ(x )f (x)K -1 (x, x )f (x ) + O (1/N ) (5) where µ(x) is the measure of the input space, and the O(1/N ) scaling of the finite-N correction will be explained in §3. If we now plug 5 in 3, take the loss to be the total square error 1 L[f ] = n α=1 (y α -f (x α )) foot_1 , and take N → ∞ we have that the posterior P [f ] is that of a GP. Assuming ergodicity, one finds that training-time averaged output of the DNN is given by the posterior mean of a GP, with measurement noise 2 equal to σ 2 = T /2 and a kernel given by the NNGP of that DNN. We refer to the above expressions for P 0 [f ] and P [f ] describing the distribution of outputs of a DNN trained according to our protocol -the NNSP correspondence. Unlike the NTK correspondence, the kernel which appears here is different and no additional initialization dependent terms appear (as should be the case since we assumed ergodicity). Furthermore, given knowledge of P 0 [f ] at finite N , one can predict the DNN's outputs at finite N . Henceforth, we refer to P 0 [f ] as the prior distribution, as it is the prior distribution of a DNN with random weights drawn from P 0 (w). Evidence supporting ergodicity -Our derivation relies on the ergodicity of the dynamics. Ergodicity is in general hard to prove rigorously in non-convex settings, and thus we must revert to heuristics. The most robust evidence of ergodicity in function space is the high level of accuracy of our analytical expressions w.r.t. to our numerical results. This is a self-consistency argument: we assume ergodicity in order to derive our analytical results and then indeed find that they agree very well with the experiment, thus validating our original assumption. Another indicator of ergodicity is a small auto-correlation time (ACT) of the dynamics. Although short ACT does not logically imply ergodicity (in fact, the converse is true: exponentially long ACT implies non-ergodic dynamics). However, the empirical ACT gives a lower bound on the true correlation time of the dynamics. In our framework, it is sufficient that the dynamics of the outputs z w be ergodic, even if the dynamics of the weights converge much slower to an equilibrium distribution. Indeed, we have found that the ACTs of the outputs are considerably smaller than those of the weights (see Fig. 2b ). Full ergodicity may be too strong of a condition and we don't really need it for our purposes, since we are mainly interested in collecting statistics that will allow us to accurately compute the posterior mean of the distribution in function space. Thus, a weaker condition which is sufficient here is ergodicity in the mean (see App. F), and we believe our self-consistent argument above demonstrates that it holds. In a related manner, optimizing the train loss can be seen as an attempt to find a solution to n constraints using far more variables (roughly M • N 2 where M is the number of layers). From a different angle, in a statistical mechanical description of satisfiability problems, one typically expects ergodic behavior when the ratio of the number of variables to the number of constraints becomes much larger than one (Gardner & Derrida, 1988) .

3. INFERENCE ON THE RESULTING NNSP

Having mapped the time-averaged outputs of a DNN to inference on the above NNSP, we turn to analyze the predictions of this NNSP in the case where N is large but finite, such that the NNSP is only weakly non-Gaussian (i.e. its deviation from a GP is O(1/N )). The main result of this section is a derivation of leading FWCs to the standard GP regression results for the posterior mean fGP (x * ) and variance Σ GP (x * ) on an unseen test point x * , given a training set {(x α , y α )} n α=1 ⊂ R d × R, namely (Rasmussen & Williams, 2005) fGP (x * ) = α,β y α K-1 αβ K * β ; Σ GP (x * ) = K * * - α,β K * α K-1 αβ K * β ( ) where Kαβ := K(x α , x β ) + σ 2 δ αβ ; K * α := K(x * , x α ); K * * := K(x * , x * ).

3.1. EDGEWORTH EXPANSION AND PERTURBATION THEORY

Our first task is to find how P [f ] changes compared to the Gaussian (N → ∞) scenario. As the data-dependent part e -L[f ]/2σ 2 is independent of the DNNs, this amounts to obtaining finite width corrections to the prior P 0 [f ]. One way to characterize this is to perform an Edgeworth expansion of P 0 [f ] (Mccullagh, 2017; Sellentin et al., 2017) . We give a short recap of the Edgeworth expansion to elucidate our derivation, beginning with a scalar valued RV. Consider continuous iid RVs {Z i } and assume WLOG Z i = 0, Z 2 i = 1, with higher cumulants κ Z r for r ≥ 3. Now consider their normalized sum Y N = 1 √ N N i=1 Z i . From additivity and homogeneity of cumulants we have κ r≥2 := κ Y r≥2 = N κ Z r ( √ N ) r = κ Z r N r/2-1 . Now, let ϕ(y) := (2π) -1/2 e -y 2 /2 . The charac- teristic function of Y is f (t) := F[f (y)] = exp ∞ r=1 κ r (it) r r! = exp ∞ r=3 κ r (it) r r! φ(t). Taking the inverse Fourier transform F -1 has the effect of mapping it → -∂ y thus we get f (y) = exp ∞ r=3 κ r (-∂y) r r! ϕ(y) = ϕ(y) 1 + ∞ r=3 κr r! H r (y) where H r (y) is the rth Hermite polynomial, e.g. H 4 (y) = y 4 -6y 2 + 3. If we were to consider vector-valued RVs, then the r'th order cumulant becomes a tensor with r indices, and the Hermite polynomials become multi-variate polynomials. In our case, we are considering random functions defined by our stochastic process (the NNSP), thus the cumulants are functional tensors, i.e. are continuously indexed by the inputs x α . This is especially convenient here since for all DNNs with the last layer being fully-connected, all odd cumulants vanish and the 2r th cumulant scales as 1/N r-1 . Consequently, at large N we can characterize P 0 [f ] up to O(N -2 ) by its second and fourth cumulants, K(x 1 , x 2 ) and U (x 1 , x 2 , x 3 , x 4 ), respectively. Thus the leading order correction to P 0 [f ] reads P 0 [f ] ∝ e -SGP[f ] 1 - 1 N S U [f ] + O 1/N 2 where the GP action S GP and the first FWC action S U are given by S GP [f ] = 1 2 dµ 1:2 f x1 K -1 x1,x2 f x2 ; S U [f ] = - 1 4! dµ 1:4 U x1,x2,x3,x4 H x1,x2,x3,x4 [f ] (8) Here, H is the 4th functional Hermite polynomial (see App. A), U is the 4th order functional cumulant of the NN outputfoot_2 , which depends on the choice of the activation function φ U x1,x2,x3,x4 = ς 4 a ( φ α φ β φ γ φ δ -φ α φ β φ γ φ δ ) + 2 other perms. of (α, β, γ, δ) ∈ {1, . . . , 4} (9) where φ α := φ(z -1 i (x α )) and the pre-activations are z i (x) = b i + N j=1 W ij φ(z -1 j (x)). Here we distinguished between the scaled and non-scaled weight variances: σ 2 a = ς 2 a /N , where a are the weights of the last layer. Our shorthand notation for the integration measure over inputs means e.g. dµ 1:4 := dµ(x 1 ) • • • dµ(x 4 ). Using perturbation theory, in App. B we compute the leading FWC to the posterior mean f (x * ) and variance (δf (x * )) 2 on a test point x * f (x * ) = fGP (x * ) + N -1 fU (x * ) + O(N -2 ) (δf (x * )) 2 = Σ GP (x * ) + N -1 Σ U (x * ) + O(N -2 ) (10) with Σ U (x * ) = (f (x * )) 2 U -2 fGP (x * ) fU (x * ) and fU (x * ) = 1 6 Ũ * α1α2α3 ỹα1 ỹα2 ỹα3 -3 K-1 α1α2 ỹα3 (f (x * )) 2 U = 1 2 Ũ * * α1α2 ỹα1 ỹα2 -K-1 α1α2 where all repeating indices are summed over the training set (i.e. range over {1, . . . , n}), denoting: ỹα := K-1 αβ y β , and defining Ũ * α1α2α3 := U * α1α2α3 -U α1α2α3α4 K-1 α4β K * β Ũ * * α1α2 := U * * α1α2 -U * α1α2α3 + Ũ * α1α2α3 K-1 α3β K * β (12) Equations 11, 12 are one of our key analytical results, which are qualitatively different from the corresponding GP expressions Eq. 6. The correction to the predictive mean fU (x * ) has a linear term in y, which can be viewed as a correction to the GP kernel, but also a cubic term in y, unlike fGP (x * ) which is purely linear. The correction to the predictive variance Σ U (x * ) has quartic and quadratic terms in y, unlike Σ GP (x * ) which is y-independent. Ũ * α1α2α3 has a clear interpretation in terms of GP regression: if we consider the indices α 1 , α 2 , α 3 as fixed, then U * α1α2α3 can be thought of as the ground truth value of a target function (analogous to y * ), and the second term on the r.h.s. U α1α2α3α4 K-1 α4β K * β is then the GP prediction of U * α1α2α3 with the kernel K, where α 4 runs on the training set (compare to fGP (x * ) in Eq. 6). Thus Ũ * α1α2α3 is the discrepancy in predicting U α1α2α3α4 using a GP with kernel K. In §3.2 we study the behavior of fU (x * ) as a function of n.

The posterior variance Σ(x) = (δf (x))

2 has a clear interpretation in our correspondence: it is a measure of how much we can decrease the test loss by averaging. Our procedure for generating empirical network outputs involves time-averaging over the training dynamics after reaching equilibrium and also over different realizations of noise and initial conditions (see App. F). This allows for a reliable comparison with our FWC theory for the mean. In principle, one could use the network outputs at the end of training without this averaging, in which case there will be fluctuations that will scale with Σ(x α ). Following this, one finds that the expected MSE test loss after training saturates is n -1 * n * α=1 f (x α ) -y (x α ) 2 + Σ(x α ) where n * is the size of the test set.

3.2. FINITE WIDTH CORRECTIONS FOR SMALL AND LARGE DATA SETS

The expressions in Eqs. 6, 11 for the GP prediction and the leading FWC are explicit but only up to a potentially large matrix inversion, K-1 . These matrices also have a random component related to the largely arbitrary choice of the particular n training points used to characterize the target function. An insightful tool, used in the context of GPs, which solves both these issues is the Equivalent Kernel (EK) (Rasmussen & Williams, 2005; Sollich & Williams, 2004) . The EK approximates the GP predictions at large n, after averaging on all draws of (roughly) n training points representing the target function. Even if one is interested in a particular dataset, the EK result captures the behavior of specific dataset up to small corrections. Essentially, the discrete sums over the training set appearing in Eq. 6 are replaced by integrals over all input space, which together with a spectral decomposition of the kernel function K(x, x ) = i λ i ψ i (x)ψ i (x ) yields the well known result f EK GP (x * ) = dµ(x ) i λ i ψ i (x * )ψ i (x ) λ i + σ 2 /n g(x ) Here we develop an extension of Eq. 13 for the NNSPs we find at large but finite N . In particular, we find the leading non-linear correction to the EK result, i.e. the "EK analogue" of Eq. 11. To this end, we consider the average predictions of an NNSP trained on an ensemble of data sets of size n , corresponding to n independent draws from a distribution µ(x) over all possible inputs x. Following the steps in App. J we find f EK U (x * ) = 1 6 δx * x1 U x1,x2,x3,x4 n 3 σ 6 δx2x 2 g(x 2 ) δx3x 3 g(x 3 ) δx4x 4 g(x 4 ) - 3n 2 σ 4 δx2,x3 δx4,x 4 g(x 4 ) ) where an integral dµ(x) is implicit for every pair of repeated x coordinates. We introduced the discrepancy operator δxx which acts on some function ϕ as dµ(x ) δxx ϕ(x ) := δxx ϕ(x ) = ϕ(x) -f EK GP (x) . Essentially, Eq. 14 is derived from Eq. 11 by replacing each K-1 by (n/σ 2 ) δ and noticing that in this regime Ũ * x2,x3,x4 in Eq. 12 becomes δx * x1 U x1,x2,x3,x4 . Interestingly, f EK U (x * ) is written explicitly in terms of meaningful quantities: δxx g(x ) and δx * x1 U x1,x2,x3,x4 . Equations 13, 14 are valid for any weakly non-Gaussian process, including ones related to CNNs (where N corresponds to the number of channels). It can also be systematically extended to smaller values of n by taking into account higher terms in 1/n, as in Cohen et al. (2019) . At N → ∞, we obtain the standard EK result, Eq. 13. It is basically a high-pass linear filter which filters out features of g that have support on eigenfunctions ψ i associated with eigenvalues λ i that are small relative to σ 2 /n. We stress that the ψ i , λ i 's are independent of any particular size n dataset but rather are a property of the average dataset. In particular, no computationally costly data dependent matrix inversion is needed to evaluate Eq. 13. Turning to our FWC result, Eq. 14, it depends on g(x) only via the discrepancy operator δxx . Thus these FWCs would be proportional to the error of the DNN, at N → ∞. In particular, perfect performance at N → ∞, implies no FWC. Second, the DNN's average predictions act as a linear transformation on the target function combined with a cubic non-linearity. Third, for g(x) having support only on some finite set of eigenfunctions ψ i of K, δxx g(x ) would scale as σ 2 /n at very large n. Thus the above cubic term would lose its explicit dependence on n. The scaling with n of this second term is less obvious, but numerical results suggest that δx2x3 also scales as σ 2 /n, so that the whole expression in the {• • • } has no scaling with n. In addition, some decreasing behavior with n is expected due to the δx * x1 U x1,x2,x3,x4 factor which can be viewed as the discrepancy in predicting U x,x2,x3,x4 , at fixed x 2 , x 3 , x 4 , based on n random samples (x α 's) of U xα,x2,x3,x4 . In Fig. 1 we illustrate this behavior at large n and also find that for small n the FWC is small but increasing with n, implying that at large N FWCs are only important at intermediate values of n. 

4. NUMERICAL EXPERIMENTS

In this section we numerically test our analytical results. We first demonstrate that in the limit N → ∞ the outputs of a FCN trained in the regime of the NNSP correspondence converge to a GP with a known kernel, and that the MSE between them scales as ∼ 1/N 2 which is the scaling of the leading FWC squared. Second, we show that introducing the leading FWC term N -1 fU (x * ), Eq. 11, further reduces this MSE by more than an order of magnitude. Third, we study the performance gap between finite CNNs and their corresponding NNGPs on CIFAR-10.

4.1. TOY EXAMPLE: FULLY CONNECTED NETWORKS ON SYNTHETIC DATA

We trained a 2-layer FCN f (x) = N i=1 a i φ(w (i) • x) on a quadratic target y(x) = x T Ax where the x's are sampled with a uniform measure from the hyper-sphere S d-1 ( √ d), see App. G.1 for more details. Our settings are such that there are not enough training points to fully learn the target: Fig. 2a shows that the time averaged outputs (after reaching equilibrium) fDNN (x * ) is much closer to the GP prediction fGP (x * ) than to the ground truth y * . Otherwise, the convergence of the network output to the corresponding NNGP as N grows (shown in Fig. 2c ) would be trivial, since all reasonable estimators would be close to the target and hence close to each other. In Fig. 2c we plot in log-log scale (with base 10) the MSE (normalized by ( fDNN ) 2 ) between the predictions of the network fDNN and the corresponding GP and FWC predictions for quadratic and ReLU activations. We find that indeed for sufficiently large widths (N 500) the slope of the GP-DNN MSE approaches -2.0 (for both ReLU and quadratic), which is expected from our theory, since the leading FWC scales as 1/N . For smaller widths, higher order terms (in 1/N ) in the Edgeworth series Eq. 7 come into play. For quadratic activation, we find that our FWC result further reduces the MSE by more than an order of magnitude relative to the GP theory. We recognize a regime where the GP and FWC MSEs intersect at N 100, below which our FWC actually increases the MSE, which suggests a scale of how large N needs to be for our leading FWC theory to hold. 

4.2. PERFORMANCE GAP BETWEEN FINITE CNNS AND THEIR NNGP

Several papers have shown that the performance on image classification tasks of SGD-trained finite CNNs can surpass that of the corresponding GPs, be it NTK (Arora et al., 2019) or NNGP (Novak et al., 2018) . More recently, Lee et al. (2020) emphasized that this performance gap depends on the procedure used to collapse the spatial dimensions of image-shaped data before the final readout layer: flattening the image into a one-dimensional vector (CNN-VEC) or applying global average pooling to the spatial dimensions (CNN-GAP). It was observed that while infinite FCN and CNN-VEC outperform their respective finite networks, infinite CNN-GAP networks under-perform their finite-width counterparts, i.e. there exists a finite optimal width. One notable margin, of about 15% accuracy on CIFAR10, was shown in Novak et al. (2018) for the case of CNN-GAP. It was further pointed out there, that the NNGPs associated with CNN-VEC, coincide with those of the corresponding Locally Connected Networks (LCNs), namely CNNs without weight sharing between spatial locations. Furthermore, the performance of SGD-trained LCNs was found to be on par with that of their NNGPs. We argue that our framework can account for this observation. The priors P 0 [f ] of a LCN and CNN-VEC agree on their second cumulant (the covariance), which is the only one not vanishing as N → ∞, but they need not agree on their higher order cumulants, which come into play at finite N . In App. I we show that U appearing in our leading FWC, already differentiates between CNNs and LCNs. Common practice strongly suggests that the prior over functions induced by CNNs is better suited than that of LCNs for classification of natural images. As a result we expect that the test loss of a finite-width CNN trained using our protocol will initially decrease with N but then increase beyond some optimal width N opt , tending towards the loss of the corresponding GP as N → ∞. This is in contrast to SGD behavior reported in some works where the CNN performance seems to saturate as a function of N , to some value better than the NNGP (Novak et al., 2018; Neyshabur et al., 2018) . Notably those works used maximum over architecture scans, high learning rates, and early stopping, all of which are absent from our training protocol. To test the above conjecture we trained, according to our protocol, a CNN with six convolutional layers and two fully connected layers on CIFAR10, and used CNN-VEC for the readout. We used MSE loss with a one-hot encoding into a 10 dimensional vector of the categorical label; further details and additional settings are given in App. G. Fig. 3 demonstrates that, using our training protocol, a finite CNN can outperform its corresponding GP and approaches its GP as the number of channels increases. This phenomenon was observed in previous studies under realistic training settings (Novak et al., 2018) , and here we show that it appears also under our training protocol. We note that a similar yet more pronounced trend in performance appears here also when one considers the averaged MSE loss rather the the MSE loss of the average outputs.

5. CONCLUSION

In this work we presented a correspondence between finite-width DNNs trained using Langevin dynamics (i.e. using small learning rates, weight-decay and noisy gradients) and inference on a stochastic-process (the NNSP), which approaches the NNGP as N → ∞. We derived finite width corrections, that improve upon the accuracy of the NNGP approximation for predicting the DNN outputs on unseen test points, as well as the expected fluctuations around these. In the limit of a large number of training points n → ∞, explicit expressions for the DNNs' outputs were given, involving no costly matrix inversions. In this regime, the FWC can be written in terms of the discrepancy of GP predictions, so that when GP has a small test error the FWC will be small, and vice versa. In the small n regime, the FWC is small but grows with n, which implies that at large N , FWCs are only important at intermediate values of n. For no-pooling CNNs, we build on an observation made by Novak et al. (2018) that finite CNNs outperform their corresponding NNGPs, and show that this is because the leading FWCs reflect the weight-sharing property of CNNs which is ignored at the level of the NNGP. This constitutes one real-world example where the FWC is well suited to the structure of the data distribution, and thus improves performance relative to the corresponding GP. In a future study, it would be very interesting to consider well controlled toy models that can elucidate under what conditions on the architecture and data distribution does the FWC improve performance relative to GP.

A EDGEWORTH SERIES

The Central Limit Theorem (CLT) tells us that the distribution of a sum of N independent RVs will tend to a Gaussian as N → ∞. Its relevancy for wide fully-connected DNNs (or CNNs with many channels) comes from the fact that every pre-activation averages over N uncorrelated random variables thereby generating a Gaussian distribution at large N (Cho & Saul, 2009) , augmented by higher order cumulants which decay as 1/N r/2-1 , where r is the order of the cumulant. When higher order cumulants are small, an Edgeworth series (see e.g. Mccullagh (2017) ; Sellentin et al. (2017) ) is a useful practical tool for obtaining the probability distribution from these cumulants. Having the probability distribution and interpreting its logarithm as our action, places us closer to standard field-theory formalism. For simplicity we focus on a 2-layer network, but the derivation generalizes straightforwardly to networks of any depth. We are interested in the finite N corrections to the prior distribution P 0 [f ], i.e. the distribution of the DNN output f (x) = N i=1 a i φ(w T i x), with a i ∼ N (0, ς 2 a N ) and w i ∼ N (0, ς 2 w d I) . Because a has zero mean and a variance that scales as 1/N , all odd cumulants are zero and the 2r'th cumulant scales as 1/N r-1 . This holds true for any DNN having a fully-connected last layer with variance scaling as 1/N . The derivation of the multivariate Edgeworth series can be found in e.g. Mccullagh (2017) ; Sellentin et al. (2017) , and our case is similar where instead of a vector-valued RV we have the functional RV f (x), so the cumulants become "functional tensors" i.e. multivariate functions of the input x. Thus, the leading FWC to the prior P 0 [f ] is P 0 [f ] = 1 Z e -SGP[f ] 1 + 1 4! dµ (x 1 ) • • • dµ (x 4 ) U (x 1 , x 2 , x 3 , x 4 ) H [f ; x 1 , x 2 , x 3 , x 4 ] +O(1/N 2 ) (A.1) where S GP [f ] is as in the main text Eq. 8 and the 4th Hermite functional tensor is H [f ] = dµ (x 1 ) • • • dµ (x 4 ) K -1 (x 1 , x 1 ) • • • K -1 (x 4 , x 4 ) f (x 1 ) • • • f (x 4 ) -K -1 (x α , x β ) dµ x µ dµ (x ν ) K -1 x µ , x µ K -1 (x ν , x ν ) f x µ f (x ν ) [6] (A.2) + K -1 (x α , x β ) K -1 (x µ , x ν ) [3] where by the integers in [•] we mean all possible combinations of this form, e.g. K -1 αβ K -1 µν = K -1 12 K -1 34 + K -1 13 K -1 24 + K -1 14 K -1 23 (A.3) H[f ] is the functional analogue of the fourth Hermite polynomial: H 4 (x) = x 4 -6x 2 + 3, which appears in the scalar Edgeworth series expanded about a standard Gaussian.

B FIRST ORDER CORRECTION TO POSTERIOR MEAN AND VARIANCE B.1 POSTERIOR MEAN

The posterior mean with the leading FWC action is given by f (x * ) = Df e -S[f ] f (x * ) Df e -S[f ] + O(1/N 2 ) (B.1) where S[f ] = S GP [f ] + S Data [f ] + S U [f ]; S Data [f ] = 1 2σ 2 n α=1 (f (x α ) -y α ) 2 (B.2) where the O(1/N 2 ) implies that we only treat the first order Taylor expansion of S[f ], and where S GP [f ], S U [f ] are as in the main text Eq. 8. The general strategy is to bring the path integral Df to the front, so that we will get just correlation functions w.r.t. the Gaussian theory (including the data term S Data [f ]) • • • 0 , namely the well known results (Rasmussen & Williams, 2005) for fGP (x * ) = f (x * ) 0 and Σ GP (x * ) = (δf (x * )) 2 0 , and then finally perform the integrals over input space. Expanding both the numerator and the denominator of Eq. B.1, the leading finite width correction for the posterior mean reads fU (x * ) = 1 4! dµ 1:4 U (x 1 , x 2 , x 3 , x 4 ) f (x * ) H [f ] 0 -f (x * ) 0 dµ 1:4 U (x 1 , x 2 , x 3 , x 4 ) H [f ] 0 (B.3) This, as standard in field theory, amounts to omitting all terms corresponding to bubble diagrams, namely we keep only terms with a factor of f (x * ) f (x α ) 0 and ignore terms with a factor of f (x * ) 0 , since these will cancel out. This is a standard result in perturbative field theory (see e.g. Zee (2003) ). We now write down the contributions of the quartic, quadratic and constant terms in H[f ]: 1. For the quartic term in H [f ], we have f (x * ) f (x 1 ) f (x 2 ) f (x 3 ) f (x 4 ) 0 -f (x * ) 0 f (x 1 ) f (x 2 ) f (x 3 ) f (x 4 ) 0 = Σ (x * , x α ) Σ x β , x γ f (x δ ) [12] + Σ (x * , x α ) f x β f x γ f (x δ ) [4] (B.4) We dub these terms by f ΣΣ * and f f f Σ * to be referenced shortly. We mention here that they are the source of the linear and cubic terms in the target y appearing in Eq. 11 in the main text.

2.. For the quadratic term in H

[f ], we have f (x * ) f x µ f (x ν ) 0 -f (x * ) 0 f x µ f (x ν ) 0 = Σ x * , x µ f (x ν ) [2] (B.5) we note in passing that these cancel out exactly together with similar but opposite sign terms/diagrams in the quartic contribution, which is a reflection of measure invariance. This is elaborated on in §B.3.

3.. For the constant terms in H

[f ], we will be left only with bubble diagram terms ∝ Df f (x * ) which will cancel out in the leading order of 1/N .

B.2 POSTERIOR VARIANCE

The posterior variance is given by Σ(x * ) = f (x * ) f (x * ) -f 2 = f (x * ) f (x * ) 0 + f (x * ) f (x * ) U -f 2 GP -2 fGP fU + O(1/N 2 ) = Σ GP (x * ) + f (x * ) f (x * ) U -2 fGP fU + O(1/N 2 ) (B.6) Following similar steps as for the posterior mean, the leading finite width correction for the posterior second moment at x * reads f (x * ) f (x * ) U = 1 4! dµ 1:4 U (x 1 , x 2 , x 3 , x 4 ) f (x * ) f (x * ) H [f ] 0 -f (x * ) f (x * ) 0 dµ 1:4 U (x 1 , x 2 , x 3 , x 4 ) H [f ] 0 (B.7) As for the posterior mean, the constant terms in H[f ] cancel out and the contributions of the quartic and quadratic terms are quartic terms = Σ * α Σ * β fγ fδ [12] + Σ * α Σ * β Σ γδ [12] (B.8) and quadratic terms = Σ * µ Σ * ν [2] (B.9)

B.3 MEASURE INVARIANCE OF THE RESULT

The expressions derived above may seem formidable, since they contain many terms and involve integrals over input space which seemingly depend on the measure µ(x). Here we show how they may in fact be simplified to the compact expressions in the main text Eq. 11 which involve only discrete sums over the training set and no integrals, and are thus manifestly measure-invariant. For simplicity, we show here the derivation for the FWC of the mean fU (x * ), and a similar derivation can be done for Σ U (x * ). In the following, we carry out the x integrals, by plugging in the expressions from Eq. 6 and coupling them to U . As in the main text, we use the Einstein summation notation, i.e. repeated indices are summed over the training set. The contribution of the quadratic terms is A α1, * K-1 α1β1 y β1 -A α1α2 K-1 α1β1 K-1 α2β2 y β1 K β2, * (B.10) where we defined A (x 3 , x 4 ) := dµ(x 1 )dµ(x 2 )U (x 1 , x 2 , x 3 , x 4 ) K -1 (x 1 , x 2 ) (B.11) Fortunately, this seemingly measure-dependent expression will cancel out with one of the terms coming from the f ΣΣ * contribution of the quartic terms in H[f ]. This is not a coincidence and is a general feature of the Hermite polynomials appearing in the Edgeworth series, thus for any order in 1/N in the Edgeworth series we will always be left only with measure invariant terms. Collecting all terms that survive we have 1 4! 4 Ũ * α1α2α3 K-1 α1β1 K-1 α2β2 K-1 α3β3 y β1 y β2 y β3 -12 Ũ * α1α2α3 K-1 α2β2 K-1 α1β1 y β1 (B.12) where we defined Ũ * α1α2α3 := U * α1α2α3 -U α1α2α3α4 K-1 α4β4 K * β4 (B. 13) This is a more explicit form of the result reported in the main text, Eq. 11.

C FINITE WIDTH CORRECTIONS FOR MORE THAN ONE HIDDEN LAYER

For simplicity, consider a fully connected network with two hidden layers both of width N , and no biases, thus the pre-activations h (x) and output z (x) are given by h (x) = σ w2 √ N W (2) φ σ w1 √ d W (1) x z (x) = σ a √ N a T φ h (2) (x) (C.1) We want to find the 2nd and 4th cumulants of z (x). Recall that we found that the leading order Edgeworth expansion for the functional distribution of h is P K,U [h] ∝ e -1 2 h(x 1 )K -1 (x 1 ,x 2 )h(x 2 ) 1 + 1 N U (x 1 , x 2 , x 3 , x 4 ) H [h; x 1 , x 2 , x 3 , x 4 ] (C.2) where K -1 (x 1 , x 2 ) and U (x 1 , x 2 , x 3 , x 4 ) are known from the previous layer. So we are looking for two maps: K φ (K, U ) (x, x ) = φ (h (x)) φ (h (x )) P K,U [h] U φ (K, U ) (x 1 , x 2 , x 3 , x 4 ) = φ (h (x 1 )) φ (h (x 2 )) φ (h (x 3 )) φ (h (x 4 )) P K,U [h] (C.3) so that the mapping between the first two cumulants K and U of two consequent layers is (assuming no biases) K ( +1) (x, x ) σ 2 w ( +1) = K φ K ( ) , U ( ) (x, x ) U ( +1) (x 1 , x 2 , x 3 , x 4 ) σ 4 w ( +1) = U φ K ( ) , U ( ) (x 1 , x 2 , x 3 , x 4 ) -K φ K ( ) , U ( ) (x α1 , x α2 ) K φ K ( ) , U ( ) (x α3 , x α4 ) [3] (C.4) where the starting point is the first layer (N (0) ≡ d) K (1) (x, x ) = σ 2 w (1) N (0) x • x U (1) (x 1 , x 2 , x 3 , x 4 ) = 0 (C.5) The important point to note, is that these functional integrals can be reduced to ordinary finite dimensional integrals. For example, for the second layer, denote h := h 1 h 2 K (1) = K (1) (x 1 , x 1 ) K (1) (x 1 , x 2 ) K (1) (x 1 , x 2 ) K (1) (x 2 , x 2 ) (C.6) we find for K (2) K (2) (x 1 , x 2 ) σ 2 w (2) = dhe -1 2 h T K -1 (1) h φ (h 1 ) φ (h 2 ) (C.7) and for U (2) we denote h :=    h 1 h 2 h 3 h 4    K (1) =     K (1) (x 1 , x 1 ) K (1) (x 1 , x 2 ) K (1) (x 1 , x 3 ) K (1) (x 1 , x 4 ) K (1) (x 1 , x 2 ) K (1) (x 2 , x 2 ) K (1) (x 2 , x 3 ) K (1) (x 2 , x 4 ) K (1) (x 1 , x 3 ) K (1) (x 2 , x 3 ) K (1) (x 3 , x 3 ) K (1) (x 3 , x 4 ) K (1) (x 1 , x 4 ) K (1) (x 2 , x 4 ) K (1) (x 3 , x 4 ) K (1) (x 4 , x 4 )     (C.8) so that U φ K (1) , U (1) (x 1 , x 2 , x 3 , x 4 ) = dhe -1 2 h T K -1 (1) h φ (h 1 ) φ (h 2 ) φ (h 3 ) φ (h 4 ) (C.9) This iterative process can be repeated for an arbitrary number of layers.

D FOURTH CUMULANT FOR THRESHOLD POWER-LAW ACTIVATION FUNCTIONS D.1 FOURTH CUMULANT FOR RELU ACTIVATION FUNCTION

The U 's appearing in our FWC results can be derived for several activations functions, and in our numerical experiments we use a quadratic activation φ(z) = z 2 and ReLU. Here we give the result for ReLU, which is similar for any other threshold power law activation (see derivation in App. D.2), and give the result for quadratic activation in App. E. For simplicity, in this section we focus on the case of a 2-layer FCN with no biases, input dimension d and N neurons in the hidden layer, such that φ i α := φ(w (i) • x α ) is the activation at the ith hidden unit with input x α sampled with a uniform measure from S d-1 ( √ d), where w (i) is a vector of weights of the first layer. This can be generalized to the more realistic settings of deeper nets and un-normalized inputs, where in the former the linear kernel L is replaced by the kernel of the layer preceding the output, and the latter amounts to introducing some scaling factors. For φ = ReLU, (Cho & Saul, 2009) give a closed form expression for the kernel which corresponds to the GP. Here we find U corresponding to the leading FWC by first finding the fourth moment of the hidden layer µ 4 := φ 1 φ 2 φ 3 φ 4 (see Eq. 9), taking for simplicity ς 2 w = 1 µ 4 = det(L -1 ) (2π) 2 ∞ 0 dze -1 2 z T L -1 z z 1 z 2 z 3 z 4 (D.1) where L -1 above corresponds to the matrix inverse of the 4 × 4 matrix with elements L αβ = (x α • x β )/d which is the kernel of the previous layer (the linear kernel in the 2-layer case) evaluated on two random points. In App. D.2 we follow the derivation in Moran (1948) , which yields (with a slight modification noted therein) the following series in the off-diagonal elements of the matrix L µ 4 = ∞ ,m,n,p,q,r=0 A mnpqr L 12 L m 13 L n 14 L p 23 L q 24 L r 34 (D.2) where the coefficients A mnpqr are (-) +m+n+p+q+r G +m+n G +p+q G m+p+r G n+q+r !m!n!p!q!r! (D.3) For ReLU activation, these G's read G ReLU s =          1 √ 2π s = 0 -i 2 s = 1 0 s ≥ 3 and odd (-) k (2k)! √ 2π2 k k! s = 2k + 2 k = 0, 1, 2, ... (D.4) and similar expressions can be derived for other threshold power-law activations of the form φ(z) = Θ(z)z ν . The series Eq. D.2 is expected to converge for sufficiently large input dimension d since the overlap between random normalized inputs scales as O(1/ √ d) and consequently L(x, x ) ∼ O(1/ √ d) for two random points from the data sets. However, when we sum over U α1...α4 we also have terms with repeating indices and so L αβ 's are equal to 1. The above Taylor expansion diverges whenever the 4 × 4 matrix L αβ -δ αβ has eigenvalues larger than 1. Notably this divergence does not reflect a true divergence of U , but rather the failure of representing it using the above expansion. Therefore at large n, one can opt to neglect elements of U with repeating indices, since there are much fewer of these. Alternatively this can be dealt with by a re-parameterization of the z's leading to a similar but slightly more involved Taylor series.

D.2 DERIVATION OF THE PREVIOUS SUBSECTION

In this section we derive the expression for the fourth moment f 1 f 2 f 3 f 4 of a two-layer fully connected network with threshold-power law activations with exponent ν: φ(z) = Θ(z)z ν ; ν = 0 corresponds to a step function, ν = 1 corresponds to ReLU, ν = 2 corresponds to ReQU (rectified quadratic unit) and so forth. When the inputs are normalized to lie on the hypersphere, the matrix L is L =    1 L 12 L 13 L 14 L 12 1 L 23 L 24 L 13 L 23 1 L 34 L 14 L 24 L 34 1    (D.5) where the off diagonal elements here have L αβ = O 1/ √ d . We follow the derivation in Ref. Moran (1948) , which computes the probability mass of the positive orthant for a quadrivariate Gaussian distribution with covariance matrix L: P + = det(L -1 ) (2π) 2 ∞ 0 dze -1 2 z T L -1 z (D.6) The characteristic function (Fourier transform) of this distribution is ϕ (t 1 , t 2 , t 3 , t 4 ) = exp - 1 2 t T Lt = exp - 1 2 4 α=1 t 2 α exp   - α<β L αβ t α t β   = exp - 1 2 4 α=1 t 2 α ∞ ,m,n,p,q,r=0 (-) +m+n+p+q+r L 12 L m 13 L n 14 L p 23 L q 24 L r 34 !m!n!p!q!r! t +m+n 1 t +p+q 2 t m+p+r 3 t n+q+r 4 (D.7) Performing an inverse Fourier transform, we may now write the positive orthant probability as P + = 1 (2π) 4 R 4 + dz R 4 dt ϕ (t 1 , t 2 , t 3 , t 4 ) e -i 4 α=1 zαtα = ∞ ,m,n,p,q,r=0 (-) +m+n+p+q+r L 12 L m 13 L n 14 L p 23 L q 24 L r 34 !m!n!p!q!r! × • • • × 1 (2π) 4 R 4 + dz R 4 dt e 4 α=1 (-1 2 t 2 α -izαtα) t +m+n 1 t +p+q 2 t m+p+r 3 t n+q+r 4 = ∞ ,m,n,p,q,r=0 A mnpqr L 12 L m 13 L n 14 L p 23 L q 24 L r 34 (D.8) where the coefficients A mnpqr are A mnpqr = (-) +m+n+p+q+r G +m+n G +p+q G m+p+r G n+q+r !m!n!p!q!r! (D.9) and the one dimensional integral is G (ν=0) s = 1 2π ∞ 0 dz ∞ -∞ t s exp - 1 2 t 2 -itz dt (D.10) We can evaluate the integral over t to get G (ν=0) s = 1 (-i) s (2π) 1/2 ∞ 0 d dz s e -z 2 /2 dz (D.11) and performing the integral over z yields G (ν=0) s =      1 2 s = 0 0 s even and s ≥ 2 (2k)! i(2π) 1/2 2 k k! s = 2k + 1 k = 0, 1, 2, ... (D.12) We can now obtain the result for any integer ν by inserting z ν inside the z integral: G (ν) s = 1 2π ∞ 0 dz z ν ∞ -∞ t s exp - 1 2 t 2 -itz dt = 1 (-i) s (2π) 1/2 ∞ 0 z ν d dz s e -z 2 /2 dz (D.13) Using integration by parts we arrive at the result Eq. D.4 reported in the main text G ReLU s = G (ν=1) s =          1 √ 2π s = 0 -i 2 s = 1 0 s ≥ 3 and odd (-) k (2k)! √ 2π2 k k! s = 2k + 2 k = 0, 1, 2, ... (D.14) Similar expressions can be derived for other threshold power-law activations of the form φ(z) = Θ(z)z ν for arbitrary integer ν. In a more realistic setting, the inputs x may not be perfectly normalized, in which case the diagonal elements of L are not unity. It amounts to introducing a scaling factor for each of the four z's and makes the expressions a little less neat but poses no real obstacle.

E FOURTH CUMULANT FOR QUADRATIC ACTIVATION FUNCTION

For a two-layer network, we may write U , the 4th cumulant of the output f (x) = N i=1 a i φ(w T i x), with a i ∼ N (0, ς 2 a /N ) and w i ∼ N (0, (ς 2 w /d)I) for a general activation function φ as U α1,α2,α3,α4 = ς 4 a N V (α1,α2),(α3,α4) + V (α1,α3),(α2,α4) + V (α1,α4),(α2,α3) (E.1) with V (α1,α2),(α3,α4) = φ α1 φ α2 φ α3 φ α4 w -φ α1 φ α2 w φ α3 φ α4 w (E.2) For the case of a quadratic activation function φ(z) = z 2 the V 's read V (α1,α2),(α3,α4) = 2 L 11 L 33 (L 24 ) 2 + L 11 L 44 (L 23 ) 2 + L 22 L 33 (L 14 ) 2 + L 22 L 44 (L 13 ) 2 +... 4 (L 13 ) 2 (L 24 ) 2 + (L 14 ) 2 (L 23 ) 2 +8 (L 11 L 23 L 34 L 24 + L 22 L 34 L 14 L 13 + L 33 L 12 L 14 L 24 + L 44 L 12 L 13 L 23 )+... 16 (L 12 L 13 L 24 L 34 + L 12 L 14 L 23 L 34 + L 13 L 14 L 23 L 24 ) (E.3) where the linear kernel from the first layer is L(x, x ) = ς 2 w d x • x . Notice that we distinguish between the scaled and non-scaled variances: σ 2 a = ς 2 a N ; σ 2 w = ς 2 w d (E.4) These formulae were used when comparing the outputs of the empirical two-layer network with our FWC theory Eq. 11. One can generalize them straightforwardly to a network with M layers by recursively computing K (M -1) the kernel in the (M -1)th layer (see e.g. Cho & Saul (2009) ), and replacing L with K (M -1) .

F AUTO-CORRELATION TIME AND ERGODICITY

As mentioned in the main text, the network outputs fDNN (x * ) are a result of averaging across many realizations (seeds) of initial conditions and the noisy training dynamics, and across time (epochs) after the training loss levels off. Our NNSP correspondence relies on the fact that our stochastic training dynamics are ergodic, namely that averages across time equal ensemble averages. Actually, for our purposes it suffices that the dynamics are ergodic in the mean, namely that the time-average estimate of the mean obtained from a single sample realization of the process converges in both the mean and in the mean-square sense to the ensemble mean: lim T →∞ E f DNN (x * ; t) T -µ(x * ) = 0 lim T →∞ E f DNN (x * ; t) T -µ(x * ) 2 = 0 (F.1) where µ(x * ) is the ensemble mean on the test point x * and the time-average estimate of the mean over a time window T is f DNN (x * ; t) T := 1 T T 0 f DNN (x * ; t)dt ≈ 1 T tj = T tj =0 f DNN (x * ; t j ) (F.2) This is hard to prove rigorously but we can do a numerical consistency check using the following procedure: Consider the time series of the network output on the test point x * for the i'th realization as a row vector and stack these row vectors for all different realizations into a matrix F , such that F ij = f DNN i (x * ; t j ). (1) Divide the time series data in the matrix F into non-overlapping sub-matrices, each of dimension n seeds × n epochs . (2) For each of these sub-matrices, find f (x * ) i.e. the empirical dynamical average across that time window and across the chosen seeds; (2) Find We trained a 2-layer FCN on a quadratic target y(x) = x T Ax where the x's are sampled with a uniform measure from the hyper-sphere S d-1 ( √ d), with d = 16 and the matrix elements are sampled as A ij ∼ N (0, 1) and fixed for all x's. For both activation functions, we used a training noise level of σ 2 = 0.2, training set of size n = 110 and a weight decay of the first layer γ w = 0.05. Notice that for any activation φ, K scales linearly with ς 2 a = σ 2 a N = (T /γ a ) • N , thus in order to keep K constant as we vary N we need to scale the weight decay of the last layer as γ a ∼ O(N ). This is done in order to keep the prior distribution in accord with the typical values of the target as N varies, so that the comparison is fair. We ran each experiment for 2•10 6 epochs, which includes the time it takes for the training loss to level off, which is usually on the order of 10 4 epochs. In the main text we showed GP and FWC results for a learning rate of dt = 0.001. Here we report in Fig. G .1 the results using dt ∈ {0.003, 0.001, 0.0005}. For a learning rate of dt = 0.003 and width N ≥ 1000 the dynamics become unstable and strongly oscillate, thus the general trend is broken, as seen in the blue markers in Fig. G .1. The dynamics with the smaller learning rates are stable, and we see that there is a convergence to very similar values up to an expected statistical error. The learning rates dt = 0.001, 0.0005 converge to very similar values (recall this is a log scale), demonstrating that the learning rate is sufficiently small so that the discrete-time dynamics is a good approximation of the continuous-time dynamics. For a learning rate of dt = 0.003 (blue) and width N ≥ 1000 the dynamics become unstable, thus the general trend is broken, so one cannot take the dt to be too large.

G.2 CNN EXPERIMENT DETAILS AND ADDITIONAL SETTINGS

The CNN experiment reported in the main text was carried as follows. Dataset: In the main text Fig. 3 we used a random sample of 10 train-points and 2000 test points from the CIFAR10 dataset, and in App. H we report results on 1000 train-points and 1000 test points, balanced in terms of labels. To use MSE loss, the ten categorical labels were one-hot encoded into vector of zeros and one. Architecture: we used 6 convolutional layers with ReLU non-linearity, kernel of size 5 × 5, stride of 1, no-padding, no-pooling. The number of input channels was 3 for the input layer and C for the subsequent 5 CNN layers. We then vectorized the outputs of the final layer and fed it into an ReLU activated fully-connected layer with 25C outputs, which were fed into a linear layer with 10 outputs corresponding to the ten categories. The loss we used was MSE loss. Training: Training was carried using full-batch SGD (GD) at varying learning-rates around 5 • 10 -4 , Gaussian white noise was added to the gradients to generate σ 2 = 0.2 in the NNGP-correspondence, layer-dependant weight decay and bias decay which implies a (normalized by width) weight variance and bias variance of σ 2 w = 2 and σ 2 b = 1 respectively, when trained with no-data. During training we saved, every 1000 epochs, the outputs of the CNN on every test point. We note in passing that the standard deviation of the test outputs around their training-time-averaged value was about 0.1 per CNN output. Training was carried for around half a million epochs which enabled us to reach a statistical error of about 2 • 10 -4 , in estimating the Mean-Squared-Discrepancy between the training-time-averaged CNN outputs and our NNGP predictions. Notably our best agreement between the DNN and GP occurred at 112 channels where the MSE was about 7 • 10 -3 . Notably the variance of the CNN (the average of its outputs squared) with no data, was about 25. Statistics. To train our CNN within the regime of the NNSP correspondence, sufficient training time (namely, epochs) was needed to get estimates of the average outputs fE (x α ) = f (x α ) + δf α since the estimators' fluctuations, δf α , scale as (τ /t training ) -1/2 , where τ is an auto-correlation time scale. Notably, apart from just random noise when estimating the relative MSE between the averaged CNN outputs and the GP, a bias term appears equal to the variance of δf α averaged over all α's as indeed ntest α=1 ( fE (x α )-f GP (x α )) 2 = ntest α=1 ( f (x α )-f GP (x α )) 2 -2 ntest α=1 ( fE (x α )-f GP (x α ))δf α + ntest α=1 (δf α ) 2 (G.1) In all our experiments this bias was the dominant source of statistical error. One can estimate it roughly given the number of uncorrelated samples taken into fE (x α ) and correct the estimator. We did not do so in the main text to make the data analysis more transparent. Since the relative MSEs go down to 7 • 10 -3 and the fluctuations of the outputs quantified by Σ α = (δf α ) 2 are of the order 0.1 2 , the amount of uncorrelated samples of CNN outputs we require should be much larger than 0.1 2 /(7 • 10 -3 ) ≈ 1.43. To estimate this bias in practice we repeated the experiment with 3-7 different initialization seeds and deduced the bias from the variance of the results. For comparison with NNGP (our DN N -GP plots) the error bars were proportional to the variance of δf α . For comparison with the target, we took much larger error bars equal to the uncertainty in estimating the expected loss from a test set of size 1000. These latter error bars where estimated empirically by measuring the variance across ten smaller test sets of size 100. Lastly we discarded the initial "burn-in" epochs, where the network has not yet reached equilibrium. We took this burn-in time to be the time it takes the train-loss to reach within 5% of its stationary value at large times. We estimated the stationary values by waiting until the DNNs train loss remained constant (up to trends much smaller than the fluctuations) for about 5•10 5 epochs. This also coincided well with having more or less stationary test loss. Learning rate. To be in the regime of the NNSP correspondence, the learning rate must be taken small enough such that discrepancy resulting from having discretization correction to the continuum Langevin dynamics falls well below those coming from finite-width. We find that higher C require lower learning rates, potentially due to the weight decay term being large at large width. In Notably, since we did not have pooling layers this can be done straightforwardly without any approximations. The NNGP predictions were then obtained in a standard manner (Rasmussen & Williams, 2005) .

H FURTHER NUMERICAL RESULTS ON CNNS

Here we report two additional numerical results following the CNN experiment we carried (for details see App. G). Concerning the experiment with 10 training points. Here we used the same CNN as in the previous experiment. The noise level was again the same and led to an effective σ 2 = 0.1 for the GP. The weight decay on the biases was taken to be ten times larger leading to σ 2 b = 0.1 instead of σ b = 1.0 as before. For C ≤ 80 we used a learning rate of dt = 5 • 10 -5 after verifying that reducing it further had no appreciable effect. For C ≤ 80 we used dt = 2.5 • 10 -5 . For c ≤ 80 we used 6 • 10 +5 training epochs and we averaged over 4 different initialization seeds. For C > 80 we used between 10 -16 different initialization seeds. We reduced the aforementioned statistical bias in estimating the MSE from all our MSEs. This bias, equal to the variance of the averaged outputs, was estimated based on our different seeds. The error bars equal this estimated variance which was the dominant source of error.

I THE FOURTH CUMULANT CAN DIFFERENTIATE CNNS FROM LCNS

Here we show that while the NNGP kernel K of a CNN without pooling cannot distinguish a CNN from an LCN, the fourth cumulant, U , can. For simplicity let us consider the simplest CNN without pooling consisting of the following parts: (1) A 1D image with one color/channel (X i ) as input i ∈ {0, . . . , L -1}; (2) A single convolutional layer with some activation φ acting with stride 1 and no-padding using the conv-kernel T c x where c ∈ {1, . . . , C} is a channel number index and x ∈ {0, . . . , 2l} is the relative position in the image. Notably, in an LCN this conv-kernel will receive an additional dependence on x, the location on X i on which the kernel acts. The NNGP kernel of an LCN is the same as that of a CNN. This stems from the fact that W o cx W o c x yields a Kronecker delta function on the x, x indices. Consequently, the difference between LCN and CNN, which amounts to whether T c x (x) is the same (CNN) or a different (LCN) random variable than T c x =x (x ), becomes irrelevant as the these two are never averaged together. For simplicity, we turn to the fourth cumulant of the same output, given by z o (x 1 ) • • • z o (x 4 ) -z o (x α )z o (x β ) z o (x γ )z o (x δ ) [3] = z o (x 1 ) • • • z o (x 4 ) -K(x α , x β )K(x γ , x δ )[3] (I.2) with the second term on the LHS implying all pair-wise averages of z o (x 1 )..z o (x 4 ). Note that the first term on the LHS is not directly related to the kernel, thus it has a chance of differentiating a CNN from an LCN. Explicitly, it reads The type 1 contribution cannot differentiate an LCN form a CNN since, as in the NNGP case, they always involve only one x. The type 2 contribution also cannot differentiate since it yields c =c ;x =x W o cx W o cx W o c x W o c x φ(T c x (x)X x+x-l )φ(T c x (x)X x+x-l )φ(T c x (x )X x +x -l )φ(T c x (x )X x +x -l ) (I.4) Examining the average involving the four T 's, one finds that since T c x (x) is uncorrelated with T c x (x ) for both LCNs and CNNs, it splits into c =c ;x =x W o cx W o cx W o c x W o c x φ(T c x (x)X x+x-l )φ(T c x (x)X x+x-l ) φ(T c x (x )X x +x -l )φ(T c x (x )X x +x -l ) (I.5) where as in the NNGP, two T 's with different x are never averaged together and we only get a contribution proportional to products of two K's. We note in passing that these type 2 terms yield a contribution that largely cancels that of K(x α , x β )K(x γ , x δ )[3], apart from a "diagonal" contribution (x = x ). We turn our attention to the type 3 term given by c;x =x W o cx W o cx W o cx W o cx φ(T c x (x)X x+x-l )φ(T c x (x)X x+x-l )φ(T c x (x )X x +x -l )φ(T c x (x )X x +x -l ) (I.6) Examining the average involving the four T 's, one now finds a sharp difference between an LCN and a CNN. For an LCN, this average would split into a product of two K's since T c x (x) would be uncorrelated with T c x (x ). For a CNN however, T c x (x) is the same random variable as T c x (x ) and therefore the average does not split giving rise to a distinct contribution that differentiates a CNN from an LCN. Notably, it is small by a factor of 1/C owing to the fact that it contains a redundant summation over one c-index while the averages over the four W 's contain a 1/C 2 factor when properly normalized.

J CORRECTIONS TO EK

Here we derive finite-N correction to the Equivalent Kernel result. Using the tools developed by Cohen et al. (2019) , the replicated partition function relevant for estimating the predictions of the network (f (x * )) averaged ( • • • n ) over all draws of datasets of size n with n taken from a Poisson distribution with mean n is given by Z n = Df e -SGP[f ]-n 2σ 2 dµx(f (x)-y(x)) 2 (1 + S U [f ]) + O(1/N 2 ) (J.1) with S GP [f ] and S U [f ] given in Eq. 8. We comment that the above expression is only valid for obtaining the leading order asymptotics in n. Enabling generic n requires introducing replicas explicitly (see Cohen et al. (2019) ). Notably, the above expression coincides with that used for a finite dataset, with two main differences: all the sums over the training set have been replaced by integrals with respect to the measure, µ x , from which data points are drawn. Furthermore σ 2 is now accompanied by n. Following this, all the diagrammatic and combinatorial aspects shown in the derivation for a finite dataset hold here as well. For instance, let us examine a specific contribution coming from the quartic term in H[f ]: U x1..x4 K -1 x1x 1 • • • K -1 x4x 4 f (x 1 ) • • • f (x 4 ), and from the diagram/Wick-contraction where we take the expectation value of 3 out of the 4 f 's in this quartic term, to arrive at an expression which is ultimately cubic in the targets y U x1,x2,x3,x4 K -1 x1x 1 f (x 1 ) ∞ K -1 x2x 2 f (x 2 ) ∞ K -1 x3x 3 f (x 3 ) ∞ K -1 x4x 4 Σ ∞ (x 4 , x * ) (J.2) where we recall that f (x) ∞ = K xx K-1 x x y(x ) and Σ ∞ (x 1 , x 2 ) = K x1,x2 -K x1,x K-1 x ,x K x ,x2 being the posterior covariance in the EK limit, where Kxx f (x ) = K xx f (x )+ (σ 2 /n)f (x). Using the fact that K -1 xx K x x gives a delta function w.r. This with the additional 1/4! factor times the combinatorial factor of 4 related to choosing the "partner" of f (x * ) in the Wick contraction, yields an overall factor of 1/6 as in the main text, Eq. 14. The other term therein, which is linear in y, is a result of following similar steps with the f ΣΣ * contributions that do not get canceled by the quadratic part in H[f ].



We take the total error, i.e. we don't divide by n so that L[f ] becomes more dominant for larger n. Here σ 2 is a property of the training protocol and not of the data itself, or our prior on it. Here we take U ∼ O(1) to emphasize the scaling with N in Eqs. 7, 10.



(a) FWC vs. n for d = 8 (b) FWC vs. n for d = 4 (c) GP RMSE vs. n for d = 4

Figure 1: Leading FWC to the mean | fU (x * )| (Eq. 11) and GP discrepancy (RMSE) as a function of train set size n for varying training noise σ 2 . The target is quadratic g(x) = x T Ax = O(1) with x ∈ S d-1 ( √ d) so the number of parameters to be learnt is d(d + 1)/2 (vertical grey dashed line). The GP discrepancy is monotonically decreasing with n whereas | fU (x * )| increases linearly for small n (dashed-dotted lines in (a)) before it decays (best illustrated for larger d and σ 2 ). For sufficiently large n, both the GP discrepancy and | fU (x * )| scale as 1/n (diagonal dashed black lines in (b),(c)). This verifies our prediction for the scaling of FWCs with n, Eq. 14 in the large n regime . Notably, it implies that at large N FWCs are only important at intermediate values of n.

(a) Predictions (b) Auto-correlation functions (c) MSE scaling

Figure 2: Fully connected 2-layer network trained on a regression task. (a) Network outputs on a test point f (x * , t) vs. normalized time: the time-averaged DNN output f (x * ) (dashed line) is much closer to the GP prediction fGP (x * ) (dotted line) than to the ground truth y * (dashed-dotted line). (b) ACFs of the time series of the 1st and 2nd layer weights, and of the outputs: the output converges to equilibrium faster than the weights. (c) Relative MSE between the network outputs and the labels y (triangles), GP predictions fGP (x * ) Eq. 6 (dots), and FWC predictions Eq. 10 (x's), shown vs. width for quadratic (blue) and ReLU (red) activations. For sufficiently large widths (N 500) the slope of the GP-DNN MSE approaches -2.0 and the FWC-DNN MSE is further improved by more than an order of magnitude.

Figure 3: DNN-GP MSE demonstrates convergence to a slope of -2.0, validating the theoretically expected scaling. DNN-ground truth (Y) MSE shows finite CNN can outperform corresponding GP.

the empirical variance σ 2 emp (x * ) across these f (x * ); (4) Repeat (1)-(3) for other combinations of n epochs , n seeds . If ergodicity holds, we should expect to see the following relationσ 2 emp (x * ) = σ 2 m τ n epochs n seeds (F.3)where τ is the auto-correlation time of the outputs and σ 2 m is the macroscopic variance. The results of this procedure are shown in Fig.F.1, where we plot on a log-log scale the empirical variance σ 2 emp vs. the number of epochs n epochs used for time averaging in each set (and using all 500 seeds in this case). Performing a linear fit on the average across test points (black x's in the figure) yields a slope of approximately -1, which is strong evidence for ergodic dynamics.

Figure F.1: Ergodicity check. Empirical variance σ 2 emp (x * ) vs. the number of epochs used for time averaging on a (base 10) log-log scale, with dt = 0.003 and N = 200. The colored circles represent different test points x * and the black x's are averages across these.

Figure G.1: Regression task with fully connected network: (un-normalized) MSE vs. width on log-log scale (base 10) for quadratic activation and different leaning rates. The learning rates dt = 0.001, 0.0005 converge to very similar values (recall this is a log scale), demonstrating that the learning rate is sufficiently small so that the discrete-time dynamics is a good approximation of the continuous-time dynamics. For a learning rate of dt = 0.003 (blue) and width N ≥ 1000 the dynamics become unstable, thus the general trend is broken, so one cannot take the dt to be too large.

Fig. G.2. we report the relative MSE between the NNGP and CNN at learning rates of 0.002, 0.001, 0.0005 and C = 48 showing good convergence already at 0.001. Following this we used learning rates of 0.0005 for C ≤ 48 and 0.00025 for C > 48, in the main figure.

Figure G.2: MSE between our CNN with C = 48 and its NNGP as a function of three learning rates. Comparison with the NNGP. Following Novak et al. (2018), we obtained the Kernel of our CNN.Notably, since we did not have pooling layers this can be done straightforwardly without any approximations. The NNGP predictions were then obtained in a standard manner(Rasmussen & Williams, 2005).

Fig. H.3b is the same as Fig. H.3a apart from the fact that we subtracted our estimate of the statistical bias of our MSE estimator described in App. G.

Figure H.3: CNNs trained on CIFAR10 in the regime of the NNSP correspondence compared with NNGPs MSE test loss normalized by target variance of a deep CNN (solid green) and its associated NNGP (dashed green) along with the MSE between the NNGP's predictions and CNN outputs normalized by the NNGP's MSE test loss (solid blue, and on a different scale). We used balanced training and test sets of size 1000 each. For the largest number of channels we reached, the slope of the discrepancy between the CNN's GP and the trained DNN on the log-log scale was -1.77, placing us close to the perturbartive regime where a slope of -2.0 is expected. Error bars here reflect statistical errors related only to output averaging and not due to the random choice of a test-set. The performance deteriorates at large N = #Channels as the NNSP associated with the CNN approaches an NNGP.

(3) A vectorizing operation taking the C outputs of each convolutional around a point x ∈ {l, . . . , L -l}, into a single index y ∈ {0, . . . , C(L -2l)}. (4) A linear fully connected layer with weights W o cx where o ∈ {0, . . . , #outputs} are the output indices.Consider first the NNGP of such a random DNN with weights chosen according to some iid Gaussian distribution P 0 (w), with w including both W o cx and T c x . Denoting by z o (x) the o'th output of the CNN, for an input x we have (where we denote in this section• • • := • • • P0(w) ) K oo (x, x ) ≡ z o (x)z o (x ) = δ oo c,c ,x,x W o cx W o c x φ(T c x (x)X x+x-l )φ(T c x (x )X x+x -l ) (I.1)

c1..c4 x1..x4 W o c1 x1 • • • W o c4 x 4 φ(T c1 x1 (x 1 )X x1+x1-l ) • • • φ(T c4 x4 (x 4 )X x4+x 4 -l ) (I.3)The average over the four W 's yields non-zero terms of the typeW o cx W o cx W o c x W o cx with either x = x (type 1), x = x and c = c (type 2), or x = x and c = c (type 3).

t. the measure, the integrals against K -1 xαx α can be easily carried out yielding δx * ,x4 U x1,x2,x3,x4 δx1,x 1 δx2,x 2 δx3,x 3 y(x 1 )(x 2 )y(x 3 ) (J.4)

