AGNOSTIC LEARNING OF GENERAL RELU ACTIVATION USING GRADIENT DESCENT

Abstract

We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves an error that is within a constant factor of the optimal i.e., it is guaranteed to achieve an error of O(OP T ), where OP T is the error of the best ReLU function. This is a significant improvement over existing guarantees for gradient descent, which only guarantee error of O( p d • OP T ) even in the zero-bias case (Frei et al., 2020) . We also provide finite sample guarantees, and obtain similar guarantees for a broader class of marginal distributions beyond Gaussians.

1. INTRODUCTION

Gradient descent forms the bedrock of modern optimization algorithms for machine learning. Despite a long line of work in understanding and analyzing the gradient descent iterates, there remain several outstanding questions on whether they can provably learn important classes of problems. In this work we study one of the simplest learning problems where the properties of gradient descent are not well understood, namely agnostic learning of a single ReLU function. More formally, let D be a distribution over R d ⇥R. A ReLU function is parameterized by w = ( w, b w ) where w 2 R d and b w 2 R. For notational convenience, we will consider the points to be in R d+1 by appending e x with a fixed coordinate 1 as x = (e x, 1). Let D be the distribution over R d+1 ⇥ R induced by D. We define the loss incurred at w = ( w, b w ) to be L(w) = 1 2 E (e x,y)⇠ D h ( ( w> e x + b w ) y) 2 i = 1 2 E (x,y)⇠D h ( (w > x) y) 2 i . Here (x) = max(x, 0) is the standard rectified linear unit popularly used in deep learning. The goal in agnostic learning of a ReLU function (or agnostic ReLU regression) is to design a polynomial time learning algorithm that takes as input i.i.d. samples from D and outputs w = ( w, b w ) such that L(w) compares favorably with OP T that is given by OP T := min w=( w,bw)2H 1 2 E (x,y)⇠D [( (w > x) y) 2 ]. Here the hypothesis set H that algorithm competes with is the set of ReLU units with parameters w = ( w, b w ) with the relative bias |b w |/k wk 2 being bounded. This is a non-trivial and interesting regime; when the bias is too large in magnitude the optimal ReLU function fitting the data is either the constant zero function almost everywhere, or a linear function almost everywhere. This agnostic learning problem has been extensively studied and polynomial time learning algorithms exists for a variety of settings. This includes the noisy teacher setting where E[y|x] is given by a ReLU function Kakade et al. (2011) ; Mukherjee & Muthukumar (2020) and the fully agnostic setting where no assumption on y is made (Goel & Klivans, 2019; Diakonikolas et al., 2020) . In a recent work (Frei et al., 2020) analyzed the properties of gradient descent for the above agnostic learning problem when the bias term is assumed to be zero. The gradient descent based learning algorithm corresponds to the following sequence of updates starting from a suitable initializer w 0 : w t+1 = w t ⌘rL(w t ). The work of Frei et al. (2020) proved that starting from zero initialization and for distributions where the marginal of x satisfies some mild assumptions , gradient descent iterates produce, in polynomial time, a point w T such that L(w T ) = O( p OP T ) when the domain for x is bounded (it is instructive for this bound to think of OP T < 1; the general expression is more complicated with some additive terms and dependencies on problem-dependent quantities). While the above provides the first non-trivial learning guarantees for gradient descent in the case of agnostic ReLU learning, it suffers from a few key limitations. The result of Frei et al. (2020) only applies in the setting when the distribution has a bounded domain and when the bias terms are zero. When the distribution is not bounded, the error of O( p OP T ) also includes some dimension-dependent terms; e.g., when the marginal of e x is a standard Gaussian N (0, I d⇥d ), it gives a O( p d • OP T ) error. Moreover, there is a natural question of improving the bound of O( p OP T ) on the error of gradient descent (since the most interesting regime of parameters is when OP T ⌧ 1). This is particularly intriguing given the recent result of Diakonikolas et al. (2020) that shows that, assuming zero bias, gradient descent on a convex surrogate for L(w) achieves O(OP T ) error. This raises the question of whether the same holds for gradient descent on L(w) itself. In another recent work, the authors in Vardi et al. (2021) are able to provide convergence guarantees for gradient descent in the presence of bias terms, but under the strong realizability assumption, i.e, assuming that OP T = 0. To summarize the existing guarantees, to the best of our knowledge, (i) there are no existing guarantees for any polynomial time algorithm (including gradient descent) for agnostic learning of a ReLU function with bias, and (ii) even in the zero bias case, there is no existing guarantee for gradient descent (on the standard squared loss) that achieves O(OP T ) error.

1.1. OUR RESULTS

In this work we make progress on both these fronts, by improving the state of the art of guarantees for gradient descent for agnostic ReLU regression. In particular, we show that when the marginal of x is a Gaussian, gradient descent on L(w) achieves an error of O(OP T ), even under the presence of bias terms that are bounded. The O(OP T ) guarantee that we get even in the zero bias case answers an open question raised in the work of Frei et al. (2020) . There are also no additional dependencies on the dimension. Given the recent statistical query lower bound of Goel & Klivans (2019) that rules out an additive guarantee of OP T + " for agnostic ReLU regression, our result shows that vanilla gradient descent on the target loss already achieves near optimal error guarantees. Below we state our main theorem. For convenience we assume that kṽk 2 (the optimal weight, i.e. v = (e v, b v ) 2 H such that L(v) = OP T ), is a constant; Appendix C shows why this is without loss of generality. Theorem 1.1. Let C 1 1, C 2 > 0, c 3 > 0 be absolute constants. Let D be a distribution over (e x, y) 2 R d ⇥ R where the marginal over e x is the standard Gaussian N (0, I). Let H = {w = ( w, b w ) : k wk 2 [1/C 1 , C 1 ], |b w |  C 2 }, and consider population gradient descent iterates: w t+1 = w t ⌘rL(w t ). For a suitable constant learning rate ⌘, when starting from w 0 = ( w0 , 0) where w0 is randomly initialized from a radially symmetric distribution, with at least constant probability c 3 > 0 one of the iterates w T of gradient descent after poly(d, 1 " ) steps satisfies L(w T ) = O(OP T ) + ". Please see Section 4 for the more formal statement and proof. Note that the above guarantee applies to one of the intermediate iterates produced by gradient descent within the first poly(d, 1/") iterations. This is consistent with other convergence guarantees for gradient descent in non-realizable settings where last iterate guarantees typically do not exist Frei et al. (2020) . One can always pick the iterate among the first poly(d, 1/") steps that has the smallest loss on an independent sample from the distribution D. The above theorem proves that gradient descent obtains a bound of O(OP T ) when the relative bias of the optimal ReLU function is bounded (recall that kṽk 2 = ⇥(1) for the optimal classifier without loss of generality from Proposition C.1). Note that we do not constrain the gradient updates to remain in the set H. This result significantly improves upon the existing state-of-the-art guarantees Frei et al. (2020) of O( p d • OP T ) for gradient descent even when specialized to the case of ReLU activations with no bias. Further this gives the first provable guarantees in the setting with non-zero bias. Our improved bound of O(OP T ) error even with non-zero bias involves several new ideas. At a high level there are two main ingredients that allow us to do beyond the previous work: (1) an improved analysis for gradient descent in the agnostic case that in particular avoids any dimension-dependent factors, and (2) a new "multiscale" random initialization scheme with a stronger guarantee for the initializer. We outline these in more detail in Section 4 and Section 5 respectively. We remark that some of the assumptions in Theorem 1.1 are made with a view towards a clearer exposition, and similar guarantees hold in more general settings. While the above theorem gives guarantees for gradient descent on the population loss function L(w) (as in Vardi et al. (2021) ), we also prove guarantees for the empirical loss function in Section D. Moreover while the above Theorem 1.1 assumes Gaussian marginals (as this already illustrates the improvements guarantees in a basic and well-studied setting), these techniques extend to a broader class of distributions that we describe next.

1.2. GUARANTEES BEYOND GAUSSIAN MARGINALS

The above algorithmic result can be generalized to a broader class of marginals than Gaussians, that we call O(1)-regular marginals. O(1)-regular marginals: Assumptions about the marginals over e x We make the following assumptions about the marginal distribution e D x over e x 2 R d : there exists absolute constants 1 , 0 2 , 2 , 3 , 4 , 5 > 0 and 0 : R + ! R + , such that (i) Approximate isotropicity and bounded fourth moments: for every unit vector (ii) Anti-concentration: there exists an absolute constant 3 > 0 such that for every unit vector ũ 2 R d and > 0, u 2 R d , E e x⇠ e Dx [hu, e xi 2 ] 2 [1/ 0 2 , 2 ], sup t2R P e x⇠ e Dx h hũ, e xi 2 (t , t + ) i  min{ 3 , 1}. (iii) Spread out: there exists 0 : R + ! R + such that 0 (|b v |) > 0 is a constant when |b v | is a constant, and 8ṽ 2 S d 1 , E e x⇠ e Dx h (ṽ > e x + b v ) i 0 (|b v |). (iv) 2-D projections: In every 2-dimensional subspace of R d spanned by orthonormal unit vectors ũ1 , ũ2 2 R d , we have a set G ũ1,ũ2 ⇢ R such that , P e x⇠ e Dx [ũ > 2 e x 2 G ũ1,ũ2 ] = 1 o(1), and 8t 2 G ũ1,ũ2 , E e x⇠ e Dx h (ũ > 1 e x) ũ> 2 e x = t i 5 • E e x⇠ e Dx ⇥ (ũ > 1 e x) ⇤ . In other words, the conditional expectation of (ũ > 1 e x) is not much smaller after conditioning on the projection in an orthogonal direction ũ2 , for most values of ũ> 2 e x. Note that for a Gaussian N (0, I), the r.v.s ũ> 1 e x, ũ> 2 e x are independent, so this condition holds with 5 = 1 and G ũ1,ũ2 = R. We remark that Gaussian distribution N (0, I) is O(1)-regular i.e., all the constants 1 , 2 , 0 2 , 5 = 1, 3  2, and 0 (b v ) = E g⇠N (0,1) [ (g + b v )] > 0 for all b v 2 ( 1, 1) ; in fact 0 is an increasing function that is 0 only at 1. We also note that assumptions of this flavor have also been used in prior works including Vardi et al. (2021) , which inspired parts of our analysis. In particular, Vardi et al. (2021) assume a lower-bound on the density for any 2-dimensional marginal; our assumption (4) on the 2-dimensional marginals is qualitatively weaker (it is potentially satisfied by even discrete distributions), and moreover we only need the condition to be satisfied for a large fraction of values of ũ> 2 e x (and not all). See Section B for the generalized version of our main theorem.

2. RELATED WORK

The agnostic ReLU regression problem that we consider has been studied in a variety of settings. In the realizable setting or when the noise is stochastic with zero mean, i.e., E[y|x] is a ReLU function, the learning problem is known as isotonic regression and can be solved efficiently via the GLM-tron algorithm (Kakade et al., 2011; Kalai & Sastry, 2009) . Distributions generated by a 1-layer ReLU neural network under the realizable setting can also be learned efficiently (Wu et al., 2019) . In the absence of any assumptions on the distribution of y|x, the work of Goel & Klivans (2019) provided an efficient algorithm that achieves O(OP T 2/3 ) + " error under Gaussian and log-concave marginals in the zero-bias setting. The authors also show that it is hard to achieve an additive bound of OP T + " via statistical query (SQ) algorithms Kearns & Valiant (1994) . For the case of zero bias and any marginal over the unit sphere, the work of Goel et al. (2017) provides agnostic learning algorithms for the ReLU regression problem that run in time exponential in 1/" and achieve an error bound of OP T + ". The recent work of Diakonikolas et al. (2020) improved the upper bound of Goel & Klivans (2019) to O(OP T ) + " via designing an efficient algorithm that performs gradient descent on a convex surrogate for the loss L(w); very recently they also obtained near optimal sample complexity with a regularized loss (Diakonikolas et al., 2022) . Note that all of the above works that study the fully agnostic setting consider the setting where the bias terms are not present. 2021) provides a tighter analysis that also extends to the case of non-zero bias. However the analysis only applies in the realizable setting, i.e., when OP T is zero. Our main result provides improved bounds over these works by providing a dimension independent error bound that applies to the case of non-zero bias as well.

Recent works of

There is also a long line of work analyzing gradient descent for broader settings. The works of Ge et al. (2015; 2018) ; Jin et al. (2017) ; Anandkumar & Ge (2016) ; Soltanolkotabi (2017) show convergence of gradient descent updates to approximate stationary points in non-convex settings under suitable assumptions on the function being optimized. Another line of work considers the global convergence properties of gradient descent. These works establish that gradient descent on highly overparameterized neural networks converges to the global optimum of the empirical loss over a finite set of data points (Allen-Zhu et al., 2019; Du et al., 2019; Jacot et al., 2021; Zhong et al., 2017; Chizat & Bach, 2018; Lee et al., 2019; Arora et al., 2019 ). Yet another line of work considers the realizable setting where data is generated from an unknown small depth and width neural network. These works analyze the local convergence properties of gradient descent when starting from a suitably close initial point (Bartlett et al., 2018; Zou et al., 2020) .

3. PRELIMINARIES

We consider agnostically learning a single ReLU neuron with bias through gradient descent under the supervised learning setting. We assume we are given data (x, y), where x 2 R d+1 follows the standard Gaussian distribution N (0, I) in the first d dimensions and the d + 1'th dimension being a constant 1. We also assume the labels y 2 R are arbitrarily correlated with x and (w > x). Note that throughout the paper, we will use e w, e v, e x to denote the first d dimensions of w, v, x respectively, with the last dimension of w being b w 2 R (similarly for b v 2 R). Therefore, w > x is in fact e w > e x + b w . In the analysis, we will compare the current iterate w to any optimizer of the loss L(w). v := arg min w2H L(w), where L(w) = 1 2 E (x,y)⇠D h ( (w > x) y) 2 i , and the hypothesis set H = {w = ( w, b w ) : k wk 2 [ 1 C1 , C 1 ], |b w |  C 2 )}, where C 1 and C 2 are absolute constants. This is to ensure that the relative bias |b w |/k wk 2 is bounded; as described earlier Appendix C allows us to assume k wk 2 [ 1 C1 , C 1 ] without loss of generality. As we are in the agnostic setting, there may be no w that achieves zero loss. We can split the loss function L(w) into two components, one of which is F (w) defined by F (w) := 1 2 E h ( (w > x) (v > x)) 2 i , rF (w) := E h ( (w > x) (v > x)) 0 (w > x)x i . (4) We will often refer to F (w) as the realizable loss, since it captures the difference between w and v; in the realizable setting L(w) = F (w). Note that F (v) = 0. Gradient of the Loss. The gradient of L(w) with respect to w is rL(w) = E h ( (w > x) y) 0 (w > x)x i ( ) where 0 (•) is the derivative of (•), defined as 0 (z) = {z 0}. Note that the ReLU function (z) is differentiable everywhere except at z = 0. Following standard convention in this literature, we define 0 (0) = 1. Note that the exact value of 0 (0) will have no effect on our results. We can also decompose rL(w) as rL(w) = E h ( (w > x) (v > x)) 0 (w > x)x i + E h ( (v > x) y) 0 (w > x)x i (6) Therefore, rL(w) = rF (w) + E h ( (v > x) y) 0 (w > x)x i (7) Gradient Descent. Finally, our paper focuses on the standard gradient descent algorithm with a fixed learning rate ⌘ > 0. We initialize at some point w 0 2 R d+1 , and at each iteration t 2 N we have w t+1 = w t ⌘rF (w t ). We do not optimize the iteration count in this paper; hence it will be instructive to think of ⌘ as a non-negligible parameter that can be set to be sufficiently small (e.g., an inverse polynomial for polynomial time guarantees). Simplification. For sake of exposition we will assume that kṽk 2 = 1; the same analysis goes through when kṽk 2 2 [1/C 1 , C 1 ] as well. Moreover Proposition C.1 shows that assuming that kṽk 2 is normalized is without loss of generality. Note that we cannot make such a simplifying assumption about the vectors w t = ( wt , b w ) in the intermediate iterations. Finally, please see Section B for the weaker distributional guarantees and guarantees.

4. OVERVIEW OF THE ANALYSIS (PROOF OF THEOREM 1.1)

We now provide an overview of our analysis. For complete proofs of the lemmas and propositions, please refer to the supplementary material (Appendix A). Recall that our goal throughout the learning process is to find a w 2 R d+1 such that L(w) achieves a comparable performance to OP T = L(v). In order to accomplish this, we aim to find w such that it is close to v, i.e. kw vk is small. Note that approximating v suffices to achieve an error close to OP T , since we can upper-bound L(w) as To formalize our intuition above, we adopt a similar proof strategy used in Frei et al. (2020) . Namely, we argue that when optimizing with respect to the agnostic loss L(w t ), we are always making some non-trivial progress due to a decrease in kw t vk and due to a decrease in F (w t ) (which is just the realizable portion of the loss). Moreover, whenever we stop making progress, we will argue that at this point either kw t vk  O( p OP T ) or krF (w t )k  O( p OP T ); in both cases, this iterate already achieves an error of O(OP T ) due to Lemma 4.4 and Lemma 4.3. L(w) = 1 2 E h ( (w > x) y) 2 i = 1 2 E h ( (w > x) (v > x) + (v > x) y) 2 i  2 • 1 2 E h ( (w > x) (v > x)) 2 i + 2 • 1 2 E h ( (v > x) y) 2 i = 2F

Challenges in arguing progress.

At a high-level the analysis of gradient descent follows a similar approach to Frei et al. (2020) which only handles zero bias. Yet there are several new ideas needed to obtain the stronger O(OP T ) guarantee even for the zero-bias case. Moreover, allowing non-zero bias terms imposes extra technical challenges. For example, the probability measure of {w > x 0, v > x 0} under Gaussian distributions, which is vital to deriving the gain in each gradient descent step, does not have a closed-form expression when bias is present. Furthermore we cannot afford to lose any dimension dependent factors or assume boundedness. Thus, to address these difficulties, more detailed analyses (e.g. Lemma 4.1, 4.2) are needed to facilitate our argument. Moreover tackling non-zero bias terms requires additional assumptions when initializing w 0 as well. The initializer finds a w 0 such that F (w 0 ) is strictly less than F (0) by a constant amount > 0 (this is inspired by Vardi et al. (2021) , however in their case can have an inverse-polynomial dependence on the dimension ). In fact our multiscale random initialization and the improved analysis is crucial to obtaining a dimension-independent bound on the error. The high-level intuition behind why this property is useful is that it ensures that gradient descent does not get trapped around a highly non-smooth region (e.g. when w = 0) by making it start at somewhere better than it, so that w keeps moving closer to v. Moreover, in our case the analysis is more challenging to implement compared to Vardi et al. (2021) because of the agnostic setting. This is because Vardi et al. (2021) heavily relies on the realizability assumption to simplify its analysis. We also highlight our improvements on the dependency of the dimension d. In previous works, the guarantees of the algorithm has a dependence on d either explicitly or implicitly. For instance, in Frei et al. (2020) the O( p OP T ) guarantee for ReLU neurons includes a coefficient in terms of B X (the upper-bound for kxk), which for Gaussian inputs is in fact p d; or for example in Vardi et al. (2021) , the gain for each gradient descent iteration comes with a dependency on c (the upper-bound for kxk) of c 8 , which for Gaussian is d 4 . In contrast, we avoid such dependencies on the dimension d in order to obtain our guarantees. We first establish two important lemmas we will later utilize in proving progress in each iteration. As stated in the preliminaries, we assume in the rest of the section that kṽk 2 = 1. The first lemma gives a lower bound on the measure of the region where both (v > x) and (w > t x) are non-zero. Our inductive hypotheses will ensure that this lower bound is a constant (if |b v | is a constant). Lemma 4.1 (Lower bound on the measure of the intersection). Suppose the marginal distribution e D x over e x is O(1)-regular. There exists an absolute constant c > 0 such that for all > 0, if F (w)  F (0) then P[w > x 0, v > x 0] 2 ckwk 4 2 kvk 4 2 = 2 ckwk 4 2 (1 + |b v | 2 ) 2 . ( ) With Lemma 4.1, the following lemma allows us to get an improvement on the realizable portion of the loss function as long as the gradient is non-negligible. We state and prove this lemma for the general case of O(1)-regular marginal distributions. Lemma 4.2 (Improvement from the first order term). Suppose the marginal over e x is O(1)-regular. There exists absolute constants c 1 , c 2 > 0 such that for any > 0, if kvk 2 , kwk 2  B and F (w)  F (0) , then hrF (w), w vi kw vk 2 , where = c1 9 B 28 . The constants c 1 , c 2 depend on the constants 1 , 0 2 , 2 , 4 etc. in the regularity assumption of e D x . We remark that for our setting of parameters = ⌦(1) and B = O(1), and hence we will conclude that hrF, w vi ⌦(kw vk 2 2 ). Please refer to Appendix A for all the complete proofs.

4.1. MAIN PROOF STRATEGY

With these two key lemmas, we are now ready to discuss the proof overview of the main theorem (Theorem 1.1). We inductively maintain two invariants in every iteration of the algorithm: (A) kw t vk 2  O(1), and (B) F (0) F (w t ) = ⌦(1). These two invariants are true at t = 0 due to our initialization w 0 . Lemma B.3 guarantees with at least constant probability ⌦(1), both the invariants hold for w 0 . The proof that both the invariants continue to hold follows from the progress made by the algorithm due to a decrease in both kw t vk 2 and F (w t ) (note that we only need to show they do not increase to maintain the invariant). The argument consists of two parts. First, assuming F (w t )  F (0) holds (for some constant > 0), we establish that whenever kw t vk 2 > OP T for some constant > 0, gradient descent always makes progress i.e. kw t vk 2 kw t+1 vk 2 is lower bounded. Next, we argue that if w 0 is initialized such that F (w 0 )  F (0) for some constant > 0, then throughout gradient descent F (w t ) always decreases, i.e. the inequality F (w t )  F (w 0 )  F (0) always holds. However, unlike Vardi et al. (2021) where they focus on the realizable setting, analyzing gradient descent on the agnostic loss L(w) is more challenging, since the update depends on rL(w) and not rF (w). In fact, the additional term from the "non-realizable" portion of the loss L(w) can overwhelm the contribution from the realizable loss when either krF k 2  O( p OP T ) or kw t vk 2  O( p OP T ). The following two lemmas argue that in both of these cases, the current iterate already achieves O(OP T ) error (and this iterate will be the T that satisfies the guarantee of Theorem 1.1). Lemma 4.3 (Success if krF k  O( p OP T )). Suppose B, > 0 are constants such that kvk 2 , kwk 2  B and F (w)  F (0) . Then there exists a constant C G > 0, such that if krF (w)k  C G p OP T then kw vk 2  O( p OP T ). Proof. We can first apply Lemma 4.2 to conclude that hrF (w), w vi kw vk 2 for some constant > 0 (since B, > 0 are constants), hence we have krF (w)kkw vk hrF (w), w vi kw vk 2 . Thus kw vk 2 = O( p OP T ) which implies the lemma. We now argue that if kw t vk  O( p OP T ), then F (w t )  O(OP T ) through the following lemma; this is stated and proven for O(1)-regular distributions. Lemma 4.4 (Small kw t vk implies small F (w t )). Assume e D x is O(1)-regular with parameters defined above. If kw t vk 2  O( p OP T + ") for some " > 0, then F (w t )  O(OP T + "). Proof. Since ReLU function is 1-Lipschitz (i.e. | (z) (z 0 )|  |z z 0 |), F (w t ) = 1 2 E h ( (w > t x) (v > x)) 2 i  1 2 E h (w > t x v > x) 2 i = kw t vk 2 2 E h (u > x) 2 i where we defined u = wt v kwt vk , hence the last equation. Now, notice by using Young's inequality, we get E h (u > x) 2 i = E h (e u > e x + b u ) 2 i  2 E h (e u > e x) 2 i + 2b 2 u  2 2 + 2b 2 u  O(1) due to the regularity assumption on e D x . Hence F (w t )  kw t vk 2 2 • O(1)  O(kw t vk 2 2 )  O(OP T + ") which concludes the proof. Proving progress in kw t vk and F (w t ). To show kw t vk decreases, we establish the following lemma. Lemma 4.5 (Decrease in kw t vk). Assume at time t, F (w t )  F (0) where > 0 is a constant and e D x is O(1)-regular. For constants ⌘ = 0.05• d 2 , C p = 1 9 ( q 100 2 2 / 2 +90 2 / + 10 q 2 ), C 0 = 19.8 / 2 where is defined as in Lemma 4.2, if for some " > 0 kw t vk 2 > 1 C 2 p (OP T + "), then kw t+1 vk 2  kw t vk 2 ⌘C 0 (OP T + "). As a direct consequence of Lemma 4.5, we obtain the following inductive statement: for every t, either (a) kw t vk 2 kw t+1 vk 2 ⌘C(OP T + ") is true for some constant C > 0 or (b) kw t vk 2  O(foot_0 (OP T + ")) holds. Observe that when (b) holds Lemma 4.4 implies the loss is O(OP T ); hence we need only assume at time t (b) does not hold yet, thus it suffices focusing on showing (a) is true. Additionally, note at each timestep t, kw t vk 2 kw t+1 vk 2 = 2⌘hrL(w t ), w t vi ⌘ 2 krL(w t )k 2 Therefore, to lower-bound kw t vk 2 kw t+1 vk 2 , we will give a lower bound for hrL(w t ), w t vi and an upper bound for krL(w t )k 2 . To show that F (w t ) decreases we show that at time t, if gradient descent continues to make progress towards v, then F (w t+1 )  F (w t )  F (0) . The progress in F (w) follows crucially relies on Lemma 4.2. Please see Appendix A in the supplementary material for the detailed proofs.

5. RANDOM INITIALIZATION

We now prove the initialization lemma assuming weak conditions on the marginal distribution over e x 2 R d which is e D x (recall that the standard Gaussian N (0, I) also satisfies all of the properties). We will initialize w = ( w, b w ) with b w = 0 and w drawn from a spherical symmetric distribution D w . The length is chosen from the distribution D ⇢ so that it has a non-negligible probability in any constant length interval (a 1 kvk 2 , a 2 kvk 2 ) where a 2 > a 1 > 0 are constants: our specific choice picks the correct length scale with non-negligible probability, and is reasonably spread out. Our new random initialization and the improved analysis are crucial in obtaining the O(OP T ) guarantee even with non-zero bias. Our multiscale random initialization scheme tries out different length scales and ensures that with non-negligible probability we get an initializer that satisfies the required property. For the correct guess of length scale of kṽk 2 (up to a factor of 2), our improved analysis (see ( 10)) shows that the random spherically symmetric initialization with constant probability produces an initializer w with F (w) F (0) = ⌦(kṽk 2 2 ). When we have unknown length scale kṽk 2 2 [1/M, M ], the random initialization can try out the different length scales in geometric progression i.e., the length scale ⌧ is chosen uniformly at random from {2 j : j 2 Z, log M  j  log M }.

Multiscale random initialization

We are given a parameter M such that kvk 2 2 [2 log M , 2 log M ] (note that M can have large dependencies on d and other parameters; our guarantees will be polynomial in log M ). A random initializer w = ( w, 0) is drawn from D unknown (M ) as follows: 1. Pick j uniformly at random from dlog M e, dlog M e + 1, . . . , 1, 0, 1, . . . , dlog M e . 2. ⇢ 2 R + is drawn according to D ⇢ as follows: we first pick 1 g ⇠ N (0, 1) and set ⇢ = 2 j |g|. 3. A uniformly random unit vector ŵ 2 R d is drawn and we output w = ⇢ ŵ. The initializer is ( w, 0). We prove the following claim about the multiscale random initializer.  F (w)  F (0) c 1 (v) 2 kṽk 2 2 , and kw vk  c 3 (v)kṽk 2 In the above lemma, if e D x is a standard Gaussian N (0, I), the descriptions of these above constants become much simpler, as described in Section B.3. The guarantees for the multiscale random initialization scheme follows from the analysis of random initialization when the length scale of kṽk 2 = 1 is known. Without loss of generality (see Section C, we can assume that kṽk 2 = 1 (or ⇥(1)). For convenience, we will set D ⇢ to be the absolute value of a standard Gaussian N (0, 1) (or N (0, 2 ) with 2 [1, 2]. In this setting, we can show for constants c 1 (v), c 2 (v), c 3 (v) > 0 (these are constants when |b v |/kṽk 2 is bounded), we have with probability at least c 2 (v) > 0 F (w)  F (0) c 1 (v) 2 kṽk 2 2 , and kw vk  c 3 (v)kṽk 2 . ( ) We remark that for random initialization to work, we only need the probability of success ⌘ c 2 (v) > 0 to be non-negligible (e.g., at least an inverse polynomial). We can try O(1/⌘) many random initializers, and amplify the success probability to be very close to 1. The multiscale random initialization finds the correct length scale with probability at least 1/(log M ). For the rest of the overview we assume that the length kṽk 2 = 1 is known; without loss of generality (see Section C), we can assume that kṽk 2 = 1. By definition, the distribution of w 2 R d is spherically symmetric.  F (w) F (0) = 1 2 E x h ( ( w> x) (ṽ > x + b v )) 2 i 1 2 E x h (ṽ > x + b v )) 2 i = ⇢ 2 kṽk 2 2 2 E x h ( ( b w > x) 2 i ⇢kṽk 2 2 E x h ( b w > x) (b v > x + b b v )) i , > x) (b v > x + b b v )) i = kṽk 2 2 c 0 ⇢ 2 2c 3 (v)⇢ where c 0 > 0 is a universal constant based on our assumptions about e D x (c 0 = 0.5 for x ⇠ N (0, I)). One technical portion of the argument is to derive an expression for c 3 (v), and prove that it is a constant independent of the dimension. This forms the bulk of the argument and requires symmetrization and careful use of anti-concentration bounds. Once we establish this, we need to prove that the first part (10) holds with non-negligible probability. From (11), we note that for any ⇢ 2 ⇥ 

6. CONCLUSION

In this paper, we provided a convergence analysis of gradient descent for learning a single neuron with general ReLU activations (with non-zero bias terms) and gave improved guarantees under comparable assumptions also made in previous works. We addressed multiple challenges for analyzing general ReLU activations with non-zero bias terms throughout our analyses that may lead to better understanding of the dynamics of gradient descent when learning ReLU neurons. However, our analysis does not apply to modern neural networks that have multiple nodes and layers. The major open direction is to generalize current performance guarantees for networks of multiple neurons and higher depth.



One can pick many other spread out distributions in place of the absolute value of a Gaussian.



and E e x⇠ e Dx [hu, e xi 4 ]  4 .

Frei et al. (2020);Vardi et al. (2021) consider analyzing gradient descent for the ReLU regression problem.Frei et al. (2020) provides an O( p OP T ) guarantee (along with some additional problem-dependent terms) for the case of zero bias and bounded distributions. When considering distributions such as the standard Gaussian N (0, I) the bound ofFrei et al. (2020) incurs a dimension dependent term of the form O( p d • p OP T ) in the error bound. Vardi et al. (

w) + 2OP T through Young's inequality. The realizable portion of the loss F (w) becomes O(OP T ) when kw vk  O( p OP T ) (see Lemma 4.4 for a proof), and as a consequence we will get O(OP T ) error in total.

There exists c 1 (v), c 2 (v), c 3 (v) > 0 which only depend on b v /kṽk 2 (and not on the dimension), and are both absolute constants when |b v |/kṽk 2 = O(1), such that the following holds. When w = ( w, b w = 0) is drawn according to the distribution D unknown (M ) described above for some given M 1 satisfying kvk 2 2 [1/M, M ]. Then with probability at least c 2 (v)/ log M ,

Overview of the proof of Lemma 5.1 We now outline the argument of Lemma 5.1. Please refer to Section B.3 and Section B.4 for the full proofs. For convenience we define b b v := b v /kṽk 2 , b v := v/kṽk 2 , so they are normalized w.r.t. the length of ṽ. The conditions of the lemma assume that | b b v | = O(1).

w = ⇢kṽk 2 b w with b w being the unit vector along w. For a fixed ⇢ 2 R + , b w (and hence w) is picked along a uniformly random direction i.e., b w ⇠ U S d 1 . Hence for x ⇠ e D x ,

Moreover ⇢ is distributed as the absolute value of a standard normal with variance in[1, 4]; so we get from anti-concentration bounds that ⇢ is in the right interval with probability at least c 5 (v) > 0, which is constant when | b b v | is a constant. Now we condition on this event that ⇢ 2 ⇥ in this interval, let Z be a r.v. that captures the distribution of F ((⇢kṽk b w, 0)) F (0) as b w is drawn uniformly from the unit sphereS d 1 . Note that E[Z]  kṽk 2 2 c 3 (v) 2 /2c 0 . Var[Z]  E[F ((⇢kṽk 2 b w, 0)) 2 ]  O(1) • kṽk 4 ⇣ 2 4 + b b 4 v ⌘ .Further for = E[Z]/2, we have from the Cantelli-Chebychev one-sided tail inequality we have for some absolute constant c 6 > 0where c 6 (v) is a constant when b b v is a constant. This allows us to conclude that F (w) < F (0) ⌦(kṽk 2 ) with probability at least c 5 (v) • c 6 (v) which is a constant when b b v is a constant. Finally kw vk 2  kwk 2 + kṽk 2 isupper bounded just because of our choice of ⇢ and kṽk 2 being upper bounded by assumption. See Sections B.3 and B.4 for the full proofs.

funding

⇤ The last two authors are supported by the National Science Foundation (NSF) under Grant No. CCF-1652491 and CCF 1934931. The last author was also funded by a Google Research Scholar award.

