ON THE EXPLICIT ROLE OF INITIALIZATION ON THE CONVERGENCE AND GENERALIZATION PROPERTIES OF OVERPARAMETRIZED LINEAR NETWORKS Anonymous

Abstract

Neural networks trained via gradient descent with random initialization and without any regularization enjoy good generalization performance in practice despite being highly overparametrized. A promising direction to explain this phenomenon is the Neural Tangent Kernel (NTK), which characterizes the implicit regularization effect of gradient flow/descent on infinitely wide neural networks with random initialization. However, a non-asymptotic analysis that connects generalization performance, initialization, and optimization for finite width networks remains elusive. In this paper, we present a novel analysis of overparametrized single-hidden layer linear networks, which formally connects initialization, optimization, and overparametrization with generalization performance. We exploit the fact that gradient flow preserves a certain matrix that characterizes the imbalance of the network weights, to show that the squared loss converges exponentially at a rate that depends on the level of imbalance of the initialization. Such guarantees on the convergence rate allow us to show that large hidden layer width, together with (properly scaled) random initialization, implicitly constrains the dynamics of the network parameters to be close to a low-dimensional manifold. In turn, minimizing the loss over this manifold leads to solutions with good generalization, which correspond to the min-norm solution in the linear case. Finally, we derive a novel O(h -1/2 ) upper-bound on the operator norm distance between the trained network and the min-norm solution, where h is the hidden layer width.

1. INTRODUCTION

Neural networks have shown excellent empirical performance in many application domains such as vision (Krizhevsky et al., 2012; Rawat & Wang, 2017) , speech (Hinton et al., 2012; Graves et al., 2013) and video games (Silver et al., 2016; Vinyals et al., 2017) . Among the many unexplained mysteries behind this success is the fact that gradient descent with random initialization and without explicit regularization enjoys good generalization performance despite being highly overparametrized. A promising attempt to explain this phenomena is the Neural Tangent Kernel (NTK) (Jacot et al., 2018) , which characterizes the implicit regularization effect of gradient flow/descent on infinitely wide neural networks with random initialization. Precisely, under this infinite width assumption, a proper initialization, together with gradient flow training, can be understood as a kernel gradient flow (NTK flow) of a functional that is constrained on a manifold that guarantees good generalization performance. The analysis further admits extensions to the finite width (Arora et al., 2019b; Buchanan et al., 2020) . The core of the argument, illustrated in Figure 1 , amounts to showing that (properly scaled) random initialization of networks with sufficiently large width, leads to trajectories that are, in some sense initialized close to the aforementioned manifold. Thus, approximately good initialization, together with acceleration, ensures that the training stays close to the NTK Flow. Such analysis, however, leads to bounds on the network width that significantly exceed practical values (Buchanan et al., 2020) , and seems to suggest the need of acceleration to achieve generalization. This motivates the following questions: • Is the kernel regime, which requires impractical bounds on the network width, necessary to achieve good generalization? Figure 1 : Comparing our analysis to asymptotic/non-asymptotic NTK analysis. • Does generalization depends explicitly on acceleration? Or is acceleration required only due to the choosing an initialization outside the good generalization manifold? For the simplified, yet certainly non-trivial, single-hidden layer linear network setting, this paper finds an answer to these questions. Contributions. We present a novel analysis of the gradient flow dynamics of overparametrized single-hidden layer linear networks, which provides disentangled conditions on initialization that lead to acceleration and generalization. Specifically, we show that imbalanced initialization ensures acceleration, while orthogonal initialization ensures that trajectories remain close to the generalization manifold. Interestingly, properly scaled random initialization of moderately wide networks is sufficient to ensure that initialization is approximately imbalanced and orthogonal, yet it is not necessary for either. More specifically, as illustrated in Figure 1 , this paper makes the following contributions: 1. We show first that gradient flow on the squared-l 2 loss preserves a certain matrix-valued quantity, akin to constants of motion in mechanics or conservation laws of physics, that measures the imbalance of the network weights. Notably, some level of imbalance, measured by certain singular value of the imbalance matrix and defined at initialization, is sufficient to guarantee the exponential rate of convergence of the loss. Our analysis is non-probabilistic and valid under very mild assumptions, satisfied by moderately wide single-hidden-layer linear networks. 2. We characterize the existence of a low-dimensional manifold defined by a specific orthogonality condition on the parameters, which is invariant under the gradient flow. All trajectories within this manifold lead to a unique (w.r.t the end-to-end function) minimizer with good generalization performance, which corresponds to the min-norm solution. As a result, initializing the network within this manifold guarantees good generalization. 3. We further show that by randomly initializing the network weights using N (0, 1/h) (where h is the hidden layer width), an initialization setting related to the kernel regime (see Appendix E), one can approximately satisfy both our sufficient imbalance and orthogonality conditions with high probability. Notably, the inaccurate initialization relative to the good generalization manifold requires acceleration to control the generalization error. In the context of NTK, our result further provide for linear networks a novel O h -1/2 upper-bound on the operator norm distance between the trained network and the min-norm solution. To the best of our knowledge, this is the

