IS STOCHASTIC GRADIENT DESCENT NEAR OPTIMAL?

Abstract

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Jeon & Van Roy (2022) demonstrate that, when data is generated by a ReLU teacher network with W parameters, an optimal learner needs only Õ(W/ϵ) samples to attain expected error ϵ. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Jeon & Van Roy (2022) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.

1. INTRODUCTION

Over the past decade, deep neural networks have produced groundbreaking results. To name a few, they have demonstrated impressive performance on visual classification tasks (He et al., 2016) , parsing and synthesizing natural language (Devlin et al., 2018; Brown et al., 2020) , and super-human performance in various games (Mnih et al., 2013) . These achievements establish neural networks as effective models for many relevant data generating processes. Statistical theory on neural networks indicate graceful scaling of sample complexity. For example, when the data is generated by a ReLU teacher network with W parameters, Jeon & Van Roy (2022) demonstrate that the sample complexity of an optimal learner is Õ(W ). However, existing computational theory suggests that, even for single-hidden-layer teacher networks, the computation required to achieve this sample complexity is intractable. For example, Goel et al. (2020) ; Diakonikolas et al. (2020) establish that, for batched stochastic gradient descent with respect to squared or logistic loss to achieve small generalization error for all single-hidden-layer teacher networks, the number of samples or number of gradient steps must be superpolynomial in input dimension or network width. Furthermore, current theoretical guarantees for all computationally tractable algorithms proposed for fitting single-hidden-layer teacher networks with parameters drawn from natural distributions only bound sample complexity by high-order polynomial (Janzamin et al., 2015; Ge et al., 2017) or exponential (Zhong et al., 2017; Fu et al., 2020) functions of input dimension or width. In this work, we aim to reconcile the gap between these negative theoretical results and the apparent practical success of stochastic gradient descent (SGD) in training performant neural networks. To do so, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that SGD with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds established in Jeon & Van Roy (2022); Bartlett et al. (2019) in a computationally efficient manner. An important difference between our empirical results and the negative theoretical results of Goel et al. (2020) ; Diakonikolas et al. (2020) is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error. The focus on expected error is more in line with the information-theoretic sample complexity bounds of Jeon & Van Roy (2022) . Our results suggest that such expected-error analyses may be better-suited for understanding empirical properties of neural network learning.

2. RELATED WORK

Our work contributes to the literature on the sample and computational complexity of single-hiddenlayer networks. To put our work in context, we review related work in this area, grouped into several categories.

2.1. STOCHASTIC QUERY LOWER BOUNDS

Most lower bounds on the sample and computational complexity of single-hidden-layer neural networks have been established through the stochastic query framework (Goel et al., 2020; Diakonikolas et al., 2020; Song et al., 2017) . A stochastic query algorithm accesses an oracle that returns the expectation of a query function within some tolerance. The literature focuses on query functions that enable gradient descent with respect to common loss functions, with one query per gradient descent step. Aside from the results of Goel et al. (2020) ; Diakonikolas et al. (2020) , which were discussed in the introduction, Song et al. (2017) show that in a setting where the number of samples is less than the product of the input dimension and the width, exponentially many stochastic queries are required.

2.2. SAMPLE COMPLEXITY UPPER BOUNDS

Jeon & Van Roy (2022); Bartlett et al. (2019) study the sample complexity of optimal learning from data generated by teacher networks, without addressing algorithms or computational complexity. Bartlett et al. (2019) establish upper and lower bounds on the VC dimension (see Vapnik & Chervonenkis (1971) ) of noiseless neural networks. For piece-wise linear activation functions, their work shows that the VC dimension of a network with W parameters and L layers is upper bounded by O(W L log W ) and that there exist networks with W parameters and L layers with VC dimension lower bounded by Ω(W L log(W/L)). These bounds on the VC dimension translate to both upper and lower bounds on the sample complexity of any probably approximately correct (PAC) learning algorithm (Valiant, 1984) . Results in Hanneke (2016) show that for a PAC algorithm that learns up to within tolerance ϵ and failure rate at most δ, the sample complexity is Θ 1 ϵ (VC + log 1 δ ) . In our context, this implies O 1 ϵ (W L log W + log 1 δ ) sample complexity for all teacher networks with W weights and L layers and Ω 1 ϵ (W L log(W/L) + log 1 δ ) for some of these teacher networks. Jeon & Van Roy (2022) use information theory to study the number of samples required to learn from a noisy teacher network such that the expected error is small. Instead of relying on VC dimension, their bounds scale linearly in the rate-distortion function of the neural network. For networks with ReLU or sign activations, their results imply an Õ(W/ϵ) sample complexity bound, where W is the total number of parameters, and ϵ is the expected error. For single-hidden-layer ReLU teacher networks, both works suggest an upper bound on sample complexity that is linear in the number of parameters, up to logarithmic factors. However, no practical algorithm is given. The VC dimension upper bound implies PAC-learnability, and Jeon & Van Roy studies the expected performance of an optimal Bayesian learner. An important difference between these results and the negative stochastic query results is that the latter analyze worst-case performance.

2.3. CONCRETE ALGORITHMS

A segment of the literature offers concrete algorithms for learning from single-hidden-layer teacher networks (Zhong et al., 2017; Fu et al., 2020; Janzamin et al., 2015; Ge et al., 2017) . In Zhong et al. (2017) , to fit the data generated by a noiseless single-hidden-layer ReLU network, the weights are first initialized by a tensor method, which guarantees linear convergence under gradient descent with high probability. However, the sample complexity is exponential in the input dimension and the number of hidden neurons when the weights are i.i.d. Gaussians, as we assume in this paper (see Appendix D for details). Fu et al. (2020) adapts the tensor initialization of Zhong et al. (2017) and provides similar results for cross entropy loss, instead of L2 loss. Ge et al. (2017) design an alternate objective function G such that using SGD to minimize G can recover the parameters of the single-hidden-layer teacher network with high probability. The sample and computational complexity are high-order polynomials in the input dimension and width. Janzamin et al. (2015) use tensor factorization, Fourier analysis, and ridge regression to fit the data generated by single-hidden-layer teacher networks with high probability. In the case of Gaussian inputs, the sample and computational complexity are high-order polynomial in input dimension and width. Note that the results in Ge et al. (2017); Janzamin et al. (2015) do not contradict the results of Goel et al. (2020) ; Diakonikolas et al. (2020) , since the former construct algorithms that work with high probability. Results from a couple papers that focus on networks with multiple hidden layers bear additional implications if specialized to single-hidden-layer networks. Arora et al. (2014) propose an algorithm that learns a distribution generated by a sparse neural network with sign activation units and random edge weights. When specialized to a single hidden layer this gives rise to an Õ(M 3 ) sample complexity bound, where M is the width. Zhang et al. (2016) propose an algorithm for which sample complexity depends exponentially on maximum among neurons of L1 norms of incoming weights for particular activation units.

3. PRELIMINARIES

In this section we give necessary definitions for our experiments, much of which is directly adapted from Jeon & Van Roy (2022).

3.1. TEACHER NETWORK

We assume that the training algorithm is given a set S of N i.i.d samples S = {(x 1 , y 1 ), ..., (x N , y N )} ⊂ R d × R. The input X ∼ N (0, I d ), and the output Y is produced by a random single-hidden-layer teacher network g with noise W : Y = g(X) + W : R d → R. The random single-hidden-layer teacher network g is parametrized by (a, b, θ): g(X) = M i=1 θ i relu(a T i X + b i ), where M is the width of the hidden layer and relu(x) = max(0, x). For the learnable parameters, we assume that for all i ∈ [M ], a i iid ∼ N (0, 1 d+1 I d ), b i iid ∼ N (0, 1 d+1 ), and θ i iid ∼ N (0, 1 M ). The choice of variances keeps the variance of g(x) relatively fixed across different d and M . We further assume that the noise W ∼ N (0, σ 2 ). We denote the hyperparameters for this teacher network by γ := (d, M, σ). Note that this is a special case of the ReLU data generating process from Jeon & Van Roy (2022).

3.2. ERROR

We define test error as d KL (P Y ∥P Ŷ ), the KL-divergence from the predictive distribution of Y (P Ŷ ) to its true distribution P Y . We assume that the predictive distribution of Y is Gaussian with the same variance as the real distribution of Y , i.e., Ŷ ∼ N (ĝ S (X), σ 2 ), where ĝS is the model trained on S. Then, the KL-divergence simplifies to L2 error with respect to the noiseless teacher network scaled inversely by the noise (see section 2.6 of Jeon & Van Roy (2022)): d KL (P Y ∥P Ŷ ) = E (ĝ S (X) -g(X)) 2 |g, S 2σ 2 . (1)

3.3. SAMPLE COMPLEXITY

Our definition of sample complexity is adapted from Definition 4. in Jeon & Van Roy (2022). For any ϵ > 0, we defined the sample complexity N ϵ of a training procedure as the minimal number of samples N such that after training on N samples, the expected error is at most ϵ: N ϵ = min    N : E (ĝ S (X) -g(X)) 2 2σ 2 ≤ ϵ    , where S is an iid set of N training samples, and ĝS is the model trained on this set. Here the expectation is taken over both X and g; so this definition of sample complexity captures the expected performance of a training algorithm.

3.4. COMPUTATIONAL COMPLEXITY

We use the total number of queries to the training data points as a proxy for computational complexity, which we denote by T . More concretely, if the algorithm is trained on m batches of size n, then the number of queries to the training data points would be nm. When each data point is queried, it generates a forward pass and a backward pass. So the actual computation complexity of the algorithm is a product of T and a scaling factor that depends on the fitting model size.

4. EXPERIMENT SETUP

In this section we describe how the experiments are conducted. We first describe the experiment pipeline and then discuss the various components.

4.1. EXPERIMENT PIPELINE

The experiment pipeline is outlined in Algorithm 1, and the corresponding code is available online (Appendix A). The definition of various parameters and the respective values chosen for the experiments are summarized in Table 1 . for each sample number N ∈ N do 3: for i ∈ [num trials] do 4: g ← sample g(γ) 5: Sample N i.i.d. (x 1 , x 2 , ..., x N ) according to N (0, I d ) 6: ∀j ∈ [N ], calculate y j noiseless ← g(x j ) 7: ∀j ∈ [N ], calculate y j ← y j noiseless + w j , where w j iid ∼ N (0, σ 2 ). 8: Set S ← {(x j , y j )|j ∈ [N ]} 9: ĝS ← train(γ, S), logging the number of queries to data points T γ,N,i . 10: Evaluate error according to equation 1: We run the training algorithm on samples S of increasing size N until the test error is below the specified ϵ, and set the smallest such N as N ϵ . By choosing N to double each time, we estimate N ϵ within a factor of 2. The above procedure is performed for noise σ = 0.1 and σ = 0.2; and for each configuration, at least 32 trials are performed to reduce the noise in gathered data. error γ,N,i ← E (ĝ S (X) -g(X)) 2 |g, S

4.2. TRAINING

We split the samples S into an internal training set S t and a validation set S v using a 80/20 ratio. We train single-hidden-layer neural networks of different widths on S t using golden-section search, and select the model with the best performance on the validation set S v . Various details are described below.

4.2.1. ARCHITECTURE OF FITTING NETWORK

The fitting network is an single-hidden-layer ReLU network, the same as the teacher network, but with different widths (number of hidden neurons). No explicit form of regularization like dropout or weight decay is used. To find the best width, we perform golden-section search (scipy.optimize.golden) on widths ranging from 2 to 32 + 8 • max(N, √ dM + max(d, M )) . This maximum width is chosen to allow ample over-fitting, considering either the number of provided samples, or the architecture of the teacher network. Golden-section search is performed on the logarithm (base 2) of the width, with tolerance set to 0.25. The motivation behind this scheme is to get close to a good width by searching few points. For example, at most 8 steps are needed to search through widths from 2 to 1000 in this scheme (the number of steps is at most ln(initial range/tolerance) ln(ϕ) ). We believe that model performance should roughly be a unimodal function of width. So golden-section search should find widths near the optimum. The number of queries T is the sum of the number of queries for each searched width.

4.2.2. OPTIMIZATION

To train the network, we use Adam (Kingma & Ba, 2015) with respect to L2 loss. Aside from the learning rate, We use the default parameters from the PyTorch implementation (β 1 = 0.9, β 2 = 0.999). As empirical evidence suggests that small batch sizes generalize better (for example, see Keskar et al. ( 2017)), we set the batch size to 64 for a balance between model performance and training speed. To automatically set the initial learning rate, we adapt the method first proposed in Smith (2017) . We start with a very small learning rate (1e-8) and exponentially increase it until the model starts to diverge. We adapt three methods implemented in the fastai libraryfoot_0 to estimate the best learning ratefoot_1 , and use their medium as the initial learning rate. The queries to the data points in the phase are included in the calculation of T . During training, we reduce the learning rate by a factor of 10 when the validation loss plateaus using ReduceLROnPlateau from PyTorch (mode='min' and patience=12). We stop training whenever the best validation loss fails to decrease relatively by more than 1% in 24 epochs, and use the model corresponding to the best validation loss. For each fitting network, there is a hard cap of 1500 epochs of training, which is typically never reached.

5.1. SAMPLE COMPLEXITY

In Figure 1 , we plot ϵN ϵ against dM (left) and Nϵ dM against ϵ -1 (right) for the different choices of noise (σ = 0.1, 0.2). In these plots, ϵ is the average test error, and N ϵ is the corresponding number of samples provided. Both the horizontal axis and the vertical axis are drawn in log scale, with equal aspect ratio. In all plots, we included a scatter plot of the points, and a reference line of unit slope in the log plot, which corresponds to a linear fit of the data. In the plots on the left, we also plotted lines corresponding to the median, the mean, and the 95 and 5 percentiles. In the plots on the right, we use locally weighted smoothing (Cleveland, 1979) to estimate the trend, and the 95% confidence interval is produced by bootstrap resampling two-thirds of the data. As we can see in the plots, ϵN ϵ is almost proportional to dM , for a wide range of d, M , and ϵ. Both Bartlett et al. (2019) and Jeon & Van Roy (2022) predict the theoretical sample complexity to be proportional to dM ϵ , up to log factors. So our results indicate that SGD on neural networks (with automatic width selection) can achieve the theoretical sample complexity of "optimal" learners in the case of single-hidden-layer teacher network. We note that while the dependence of N ϵ on dM is very close to linear for big dM , the dependence of N ϵ on ϵ -1 is noticeably worse than linear for very small ϵ. Additional plots of the dependence of N ϵ on d and M for fixed ϵ can be found in Appendix B.

5.2. COMPUTATIONAL COMPLEXITY

We plot the number of queries T against the number of samples N in Figure 2 . As in the previous plots, both the horizontal and vertical axis are in log scale and have equal aspect ratio. We include a reference line of unit slope in the log plot, which corresponds to a linear fit of the data. From Figure 2 we can see that the dependence of T on N is slightly less than linear and so T = O(N ). In the previous section, we demonstrated that N ϵ , the number of samples necessary to achieve test error within ϵ tolerance, appears to be O( dM ϵ ). Therefore, T ϵ , the total number of queries to datapoints to achieve ϵ tolerance, is also approximately proportional to dM ϵ . This implies that for all N , the average number of times each single data point is queried is bounded above by a constant. In our experiments, the width of the fitting network is O(d + M ). Since each query of a data point corresponds to at most one forward pass and one backward pass, the overall computational complexity is O(N d(d + M )) = Õ(d 2 M (d + M )) for fixed ϵ. We hypothesize that by tightening the upper bound on the fitting network's width to O(M ), the current results would still hold, and the corresponding computational complexity could be improved to Õ(d 2 M 2 ). 6 COMPARISON WITH EXISTING RESULTS

6.1. COMPARISON WITH THEORETICAL RESULTS

In the case of single-hidden-layer neural networks, both Bartlett et al. (2019) and Jeon & Van Roy (2022) give theoretical upper bounds on the sample complexity that is Õ dNϵ ϵ . As for lower bounds on sample complexity, the result in Bartlett et al. (1998) implies the existence of single-hidden-layer neural networksfoot_2 with sample complexity at least linear in the total number of weights. However, to the best of our knowledge, there is no tight theoretical lower bound on sample complexity when the teacher network is assumed to be drawn from a distribution, and even for single-hidden-layer teacher networks this seems to remain an open problem. We empirically demonstrate that for single-hidden-layer teacher networks, running SGD on neural networks with adequate hyper-parameters achieves the best known theoretical bounds on sample complexity, with very manageable run time -the average number of queries per datapoint is constant. SGD empirically works well in terms of sample and computational complexity in spite of negative theoretical results in the stochastic query framework (Goel et al., 2020; Diakonikolas et al., 2020) . The discrepancy between theory and practice is best explained by the analysis framework. While Goel et al. (2020) ; Diakonikolas et al. (2020) analyze the worst case performance of algorithms and prove that either sample or computational complexity must be super-polynomial, our empirical work studies the average performance of SGD. The focus on average case performance is also more in line with the actual uses of neural networks -in practice, people don't necessarily need guarantees that SGD on neural networks works for all datasets, as long as practical algorithms succeed with high probability. 2017) all construct algorithms to fit single-hidden-layer teacher networks with provable guarantees on sample complexity, computational complexity, and error. Here we highlight some differences between their works and ours: • While our results indicate sample complexity linear in number of parameters, the mentioned works either have high-order polynomial (Janzamin et al., 2015; Ge et al., 2017) or exponential (Zhong et al. (2017) ; Fu et al. (2020) , see Appendix D for details) sample complexity. • Our work uses standard machine learning tools (Adam, random weight initialization, early stopping, learning rate decay), while the mentioned works use algorithms not commonly found in practice. 

7. CONCLUSIONS

In this work, we empirically demonstrate that to reach a small expected error for single-hidden-layer teacher networks, SGD with automatic width tuning can nearly achieve theoretic sample complexity bounds in a computationally efficient manner. This helps bridge a gap that previously existed between theoretic sample complexity upper bounds and the absence of algorithms that achieve this upper bound computationally efficiently. In addition, the near optimal sample and computation complexity of SGD on neural networks opens up the possibility of modelling it as an optimal Bayesian learner. We hope that this new perspective contributes to the general understanding of performance of SGD on neural networks. Investigating whether our results extend to multiplehidden-layer teacher networks remains an interesting question for future research.

Reproducibility Statement

We provide the source code for reproducing the experiments (Appendix A). 

A CODE FOR RUNNING EXPERIMENTS

We anonymously uploaded the source code to https://anonymous.4open.science/r/ sample-complexity-4B45/README.md, and will share the git repository upon publication. 

B ADDITIONAL PLOTS ON SAMPLE COMPLEXITY

E[(ĝ S (X)-g(X)) 2 ] 2σ 2 > ϵ to min N : E[(ĝ S (X)-g(X)) 2 ] 2σ 2 ≤ ϵ . Since we only run experiments where the sample size N is a power of 2, these two different N s always differ by a factor of 2. From the plots we can see that the dependence of N ϵ on d eventually becomes linear (unit slope in our plots) for big M . For big d, the dependence of N ϵ on M eventually becomes slightly worse than linear, but no worse than quadratic (corresponds to slope being 2 in our plots). In addition, these observations hold for ϵ that spans more than two orders of magnitude.

C ARCHITECTURE OF FITTING NETWORK

In this section we study how different fitting network architectures influence performance.

C.0.1 NUMBER OF LAYERS

We study the performance of the fitting algorithm when the number of hidden layers in the fitting network is 1, 2, and 3. When the number of hidden layers is 2 or 3, we set the number of neurons in each hidden layer to be the same. In all cases, we use golden-section search to find the best width. The minimal width is 2, and the maximum widths are given in Table 2 . Again, the maximum width are chosen to allow ample over-fitting, considering either the number of provided samples, or the architecture of the teacher network. The results are plotted in Figure 5 , and summary statistics are shown in Table 3 . We see that on average, having only one hidden layer in the fitting network has slightly better performance than having two hidden layers, which in turn has slightly better performance than having three hidden layers. This corresponds well with the idea that the fitting network should have similar architecture as the teacher network. 

C.0.2 WIDTH OF HIDDEN LAYER

In this part, we fix the fitting network to have only one hidden layer and study the performance of the fitting algorithm for different widths (number of hidden neurons). We use four different schemes to select the width of the fitting network, which we describe in Table 4 . Name Description same The width of the fitting network is M , same as in the teacher network.

4M

The width of the fitting network is 4M , corresponding to 4x overparametrization. tune The width of the fitting network is tuned using golden-section search on the logarithm of the width. The range of widths searched is [2, 32 + 8 • max(N, √ dM + max(d, M ))]. best Use the median of the widths found by the tune method across trials. Figure 5 : Fitting networks with single hidden layer perform better than those with multiple hidden layers. We plot the inverse of the test error when fitting network has multiple hidden layers (ϵ -1 2 for 2 hidden layers, and ϵ -1 3 for 3) against the inverse of the test error when fitting network has one hidden layer (ϵ -1 1 ). All axises are in log scale, with equal aspect ratio. A reference line corresponding to equal error is plotted. The region below the line corresponds to single-hidden-layer fitting networks having superior performance. The confidence intervals are generated by bootstrap resampling of two-thirds of the data. In all cases multiple-hidden-layer fitting networks perform slightly worse than single-hidden-layer fitting networks, especially when the test error is small. Livni et al., 2014; Neyshabur et al., 2018) . Perhaps surprisingly, the performance difference between tune and best also indicates that for optimal performance, the architecture of the fitting networks needs to be tuned to the particular instantiation of the teacher network, not just to its architecture and number of samples. λ is defined as ( k i=1 σ i )/σ k k , where σ i (W ) is the i-th singular value of the weight matrix W ∈ R d×M of the teacher network. Since Zhong et al. (2017) considers teacher networks where the outer coefficients (our θ i ) are either 1 or -1, we need to multiply the outer coefficients inside the activation functions. So with our teacher network, W would be a d × M Gaussian, with variance 1/d, multiplied (with broadcasting) with a 1 × M Gaussian, with variance 1/M . We set d = 2M and plot λ versus M for 1000 trials in Figure 7 . The vertical axis is in log scale, and we can clearly see that λ depends exponentially on M .



https://docs.fast.ai/ steep, where the loss as the steepest descent; minimum, for a learning rate 1/20 of where the loss is the smallest; and valley, when the loss is in the middle of its longest valley Mild constraints are imposed on the activation functions. Sigmoid, for example, satisfies the constraints. Zhong et al. (2017) mentions that in the worst case λ depends exponentially on the number of hidden units in Remark 4.3.



Figure 1: Sample complexity is almost linear in dM ϵ for wide range of d, M , and ϵ. ϵ is the average test error, and N ϵ is the corresponding sample size. All vertical and horizontal axises are in the log scale, with equal aspect ratio. A unit slope reference is provided to indicate a linear relationship in the log scale. For σ = 0.1 (top), the reference lines correspond to ϵN ϵ = 1.79dM ; for σ = 0.2 (bottom), the reference lines correspond to ϵN ϵ = 1.11dM . The confidence intervals on the right are generated by bootstrap resampling of two-thirds of the data.

Figure 2: Total number of queries to datapoints is sublinear in sample size. All vertical and horizontal axises are in the log scale, with equal aspect ratio. A unit slope reference is provided to indicate a linear relationship in the log scale. For σ = 0.1 (left), the reference line corresponds to T = 1940N ; for σ = 0.2 (right), the reference line corresponds to T = 1622N . The sublinear relationship indicates that the average number of times each single data point is queried is O(1) for all N .

COMPARISON WITH OTHER ALGORITHMS Zhong et al. (2017); Fu et al. (2020); Janzamin et al. (2015); Ge et al. (

Zhong et al. (2017); Fu et al. (2020) uses a tensor method to initialize the weights before applying SGD; Janzamin et al. (2015) use tensor factorization, Fourier analysis, and ridge regression instead of SGD; and Ge et al. (2017) designs an alternate objective function.

σ = 0.1 and σ = 0.2, we plotted the dependence of N ϵ on d and M for ϵ = 1, 0.1, 0.01 (see Figure 3 and Figure 4, respectively). Both the horizontal axis and the vertical axis are drawn in log scale, with equal aspect ratio. The dependence on M for different d is shown on the left, and the dependence on d for different M is shown on the right. We use error bars to indicate the range from max N :

Figure 3: Sample complexity N ϵ is almost linear in d and M for different ϵ when σ = 0.1. All vertical and horizontal axises are in log scale, with equal aspect ratio. The error bars indicate that the estimate of the sample complexity N ϵ is within a factor of 2.

(a) Effect of number of hidden layers in fitting network when σ = 0.1.2-hidden-layer test error.3-hidden-layer test error.(b) Effect of number of hidden layers in fitting network when σ = 0.2.

same scheme test error.4M scheme test error. best scheme test error.(a) Effect of width of fitting network when σ = 0.1.same scheme test error. 4M scheme test error. best scheme test error.(b) Effect of width of fitting network when σ = 0.2.

Figure6: The tune scheme has best performance, followed by the best and 4M schemes. The same scheme has worst performance. We plot the inverse of the test error when using different schemes to select the width of the fitting network (ϵ -1 same , ϵ -1 4M , ϵ -1 best ) against the inverse of the test error when fitting network automatically tunes its width (ϵ -1 tune ). All axises are in log scale, with equal aspect ratio. A reference line corresponding to equal error is plotted. The region below the line corresponds to the width tuning scheme having superior performance. The confidence intervals are generated by bootstrap resampling of two-thirds of the data.

Summary of Parameters in Experiment

Maximum widths for different number of hidden layers.

Geometric mean and median of test error ratio

Different schemes of selecting the fitting network width.

Effect of different width tuning schemes

annex

We set the width tuning scheme as the baseline, and plot the relative performance of the other schemes in Figure 6 , with summary statistics given in Table 5 .The width tuning scheme consistently has the best performance, followed by the 4M and best schemes. The same scheme has the worst performance. These results are consistent with empirical observations that over-parametrization is essential in training neural networks (Ge et al., 2017 ; 

