IS STOCHASTIC GRADIENT DESCENT NEAR OPTIMAL?

Abstract

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Jeon & Van Roy (2022) demonstrate that, when data is generated by a ReLU teacher network with W parameters, an optimal learner needs only Õ(W/ϵ) samples to attain expected error ϵ. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Jeon & Van Roy (2022) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.

1. INTRODUCTION

Over the past decade, deep neural networks have produced groundbreaking results. To name a few, they have demonstrated impressive performance on visual classification tasks (He et al., 2016) , parsing and synthesizing natural language (Devlin et al., 2018; Brown et al., 2020) , and super-human performance in various games (Mnih et al., 2013) . These achievements establish neural networks as effective models for many relevant data generating processes. Statistical theory on neural networks indicate graceful scaling of sample complexity. For example, when the data is generated by a ReLU teacher network with W parameters, Jeon & Van Roy (2022) demonstrate that the sample complexity of an optimal learner is Õ(W ). However, existing computational theory suggests that, even for single-hidden-layer teacher networks, the computation required to achieve this sample complexity is intractable. For example, Goel et al. (2020); Diakonikolas et al. (2020) establish that, for batched stochastic gradient descent with respect to squared or logistic loss to achieve small generalization error for all single-hidden-layer teacher networks, the number of samples or number of gradient steps must be superpolynomial in input dimension or network width. Furthermore, current theoretical guarantees for all computationally tractable algorithms proposed for fitting single-hidden-layer teacher networks with parameters drawn from natural distributions only bound sample complexity by high-order polynomial (Janzamin et al., 2015; Ge et al., 2017) or exponential (Zhong et al., 2017; Fu et al., 2020) functions of input dimension or width. In this work, we aim to reconcile the gap between these negative theoretical results and the apparent practical success of stochastic gradient descent (SGD) in training performant neural networks. To do so, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that SGD with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds established in Jeon & Van Roy ( 2022 2020) is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error. The focus on expected error is more in line with the information-theoretic sample complexity bounds of Jeon & Van Roy (2022). Our results suggest that such expected-error analyses may be better-suited for understanding empirical properties of neural network learning.

2. RELATED WORK

Our work contributes to the literature on the sample and computational complexity of single-hiddenlayer networks. To put our work in context, we review related work in this area, grouped into several categories.

2.1. STOCHASTIC QUERY LOWER BOUNDS

Most lower bounds on the sample and computational complexity of single-hidden-layer neural networks have been established through the stochastic query framework (Goel et al., 2020; Diakonikolas et al., 2020; Song et al., 2017) . A stochastic query algorithm accesses an oracle that returns the expectation of a query function within some tolerance. The literature focuses on query functions that enable gradient descent with respect to common loss functions, with one query per gradient descent step. Aside from the results of Goel et al. ( 2020); Diakonikolas et al. ( 2020), which were discussed in the introduction, Song et al. (2017) show that in a setting where the number of samples is less than the product of the input dimension and the width, exponentially many stochastic queries are required.

2.2. SAMPLE COMPLEXITY UPPER BOUNDS

Jeon & Van Roy (2022); Bartlett et al. (2019) study the sample complexity of optimal learning from data generated by teacher networks, without addressing algorithms or computational complexity. Bartlett et al. (2019) establish upper and lower bounds on the VC dimension (see Vapnik & Chervonenkis (1971) ) of noiseless neural networks. For piece-wise linear activation functions, their work shows that the VC dimension of a network with W parameters and L layers is upper bounded by O(W L log W ) and that there exist networks with W parameters and L layers with VC dimension lower bounded by Ω(W L log(W/L)). These bounds on the VC dimension translate to both upper and lower bounds on the sample complexity of any probably approximately correct (PAC) learning algorithm (Valiant, 1984) . Results in Hanneke (2016) show that for a PAC algorithm that learns up to within tolerance ϵ and failure rate at most δ, the sample complexity is Θ 1 ϵ (VC + log 1 δ ) . In our context, this implies O 1 ϵ (W L log W + log 1 δ ) sample complexity for all teacher networks with W weights and L layers and Ω 1 ϵ (W L log(W/L) + log 1 δ ) for some of these teacher networks. Jeon & Van Roy (2022) use information theory to study the number of samples required to learn from a noisy teacher network such that the expected error is small. Instead of relying on VC dimension, their bounds scale linearly in the rate-distortion function of the neural network. For networks with ReLU or sign activations, their results imply an Õ(W/ϵ) sample complexity bound, where W is the total number of parameters, and ϵ is the expected error. For single-hidden-layer ReLU teacher networks, both works suggest an upper bound on sample complexity that is linear in the number of parameters, up to logarithmic factors. However, no practical algorithm is given. The VC dimension upper bound implies PAC-learnability, and Jeon & Van Roy studies the expected performance of an optimal Bayesian learner. An important difference between these results and the negative stochastic query results is that the latter analyze worst-case performance.



); Bartlett et al. (2019) in a computationally efficient manner. An important difference between our empirical results and the negative theoretical results of Goel et al. (2020); Diakonikolas et al. (

