PREDICTING THE OUTPUTS OF FINITE NETWORKS TRAINED WITH NOISY GRADIENTS Anonymous authors Paper under double-blind review

Abstract

A recent line of works studied wide deep neural networks (DNNs) by approximating them as Gaussian Processes (GPs). A DNN trained with gradient flow was shown to map to a GP governed by the Neural Tangent Kernel (NTK), whereas earlier works showed that a DNN with an i.i.d. prior over its weights maps to the socalled Neural Network Gaussian Process (NNGP). Here we consider a DNN training protocol, involving noise, weight decay and finite width, whose outcome corresponds to a certain non-Gaussian stochastic process. An analytical framework is then introduced to analyze this non-Gaussian process, whose deviation from a GP is controlled by the finite width. Our contribution is three-fold: (i) In the infinite width limit, we establish a correspondence between DNNs trained with noisy gradients and the NNGP, not the NTK. (ii) We provide a general analytical form for the finite width correction (FWC) for DNNs with arbitrary activation functions and depth and use it to predict the outputs of empirical finite networks with high accuracy. Analyzing the FWC behavior as a function of n, the training set size, we find that it is negligible for both the very small n regime, and, surprisingly, for the large n regime (where the GP error scales as O(1/n)). (iii) We flesh-out algebraically how these FWCs can improve the performance of finite convolutional neural networks (CNNs) relative to their GP counterparts on image classification tasks.

1. INTRODUCTION

Deep neural networks (DNNs) have been rapidly advancing the state-of-the-art in machine learning, yet a complete analytic theory remains elusive. Recently, several exact results were obtained in the highly over-parameterized regime (N → ∞ where N denotes the width or number of channels for fully connected networks (FCNs) and convolutional neural networks (CNNs), respectively) (Daniely et al., 2016) . This facilitated the derivation of an exact correspondence with Gaussian Processes (GPs) known as the Neural Tangent Kernel (NTK) (Jacot et al., 2018) . The latter holds when highly over-parameterized DNNs are trained by gradient flow, namely with vanishing learning rate and involving no stochasticity. The NTK result has provided the first example of a DNN to GP correspondence valid after end-to-end DNN training. This theoretical breakthrough allows one to think of DNNs as inference problems with underlying GPs (Rasmussen & Williams, 2005) . For instance, it provides a quantitative description of the generalization properties (Cohen et al., 2019; Rahaman et al., 2018) and training dynamics (Jacot et al., 2018; Basri et al., 2019) of DNNs. Roughly speaking, highly over-parameterized DNNs generalize well because they have a strong implicit bias to simple functions, and train well because low-error solutions in weight space can be reached by making a small change to the random values of the weights at initialization. Despite its novelty and importance, the NTK correspondence suffers from a few shortcomings: (a) Its deterministic training is qualitatively different from the stochastic one used in practice, which may lead to poorer performance when combined with a small learning rate (Keskar et al., 2016) . (b) It under-performs, often by a large margin, convolutional neural networks (CNNs) trained with SGD (Arora et al., 2019) . (c) Deriving explicit finite width corrections (FWCs) is challenging, as it requires solving a set of coupled ODEs (Dyer & Gur-Ari, 2020; Huang & Yau, 2019) . Thus, there is a need for an extended theory for end-to-end trained deep networks which is valid for finite width DNNs.

