IS STOCHASTIC GRADIENT DESCENT NEAR OPTIMAL?

Abstract

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Jeon & Van Roy (2022) demonstrate that, when data is generated by a ReLU teacher network with W parameters, an optimal learner needs only Õ(W/ϵ) samples to attain expected error ϵ. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Jeon & Van Roy (2022) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.

1. INTRODUCTION

Over the past decade, deep neural networks have produced groundbreaking results. To name a few, they have demonstrated impressive performance on visual classification tasks (He et al., 2016) , parsing and synthesizing natural language (Devlin et al., 2018; Brown et al., 2020) , and super-human performance in various games (Mnih et al., 2013) . These achievements establish neural networks as effective models for many relevant data generating processes. Statistical theory on neural networks indicate graceful scaling of sample complexity. For example, when the data is generated by a ReLU teacher network with W parameters, Jeon & Van Roy (2022) demonstrate that the sample complexity of an optimal learner is Õ(W ). However, existing computational theory suggests that, even for single-hidden-layer teacher networks, the computation required to achieve this sample complexity is intractable. For example, Goel et al. ( 2020); Diakonikolas et al. ( 2020) establish that, for batched stochastic gradient descent with respect to squared or logistic loss to achieve small generalization error for all single-hidden-layer teacher networks, the number of samples or number of gradient steps must be superpolynomial in input dimension or network width. Furthermore, current theoretical guarantees for all computationally tractable algorithms proposed for fitting single-hidden-layer teacher networks with parameters drawn from natural distributions only bound sample complexity by high-order polynomial (Janzamin et al., 2015; Ge et al., 2017) or exponential (Zhong et al., 2017; Fu et al., 2020) functions of input dimension or width. In this work, we aim to reconcile the gap between these negative theoretical results and the apparent practical success of stochastic gradient descent (SGD) in training performant neural networks. To do so, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that SGD with automated width selection attains small expected error with a number of samples and total number of 1

