AGNOSTIC LEARNING OF GENERAL RELU ACTIVATION USING GRADIENT DESCENT

Abstract

We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves an error that is within a constant factor of the optimal i.e., it is guaranteed to achieve an error of O(OP T ), where OP T is the error of the best ReLU function. This is a significant improvement over existing guarantees for gradient descent, which only guarantee error of O( p d • OP T ) even in the zero-bias case (Frei et al., 2020). We also provide finite sample guarantees, and obtain similar guarantees for a broader class of marginal distributions beyond Gaussians.

1. INTRODUCTION

Gradient descent forms the bedrock of modern optimization algorithms for machine learning. Despite a long line of work in understanding and analyzing the gradient descent iterates, there remain several outstanding questions on whether they can provably learn important classes of problems. In this work we study one of the simplest learning problems where the properties of gradient descent are not well understood, namely agnostic learning of a single ReLU function. More formally, let D be a distribution over R d ⇥R. A ReLU function is parameterized by w = ( w, b w ) where w 2 R d and b w 2 R. For notational convenience, we will consider the points to be in R d+1 by appending e x with a fixed coordinate 1 as x = (e x, 1). Let D be the distribution over R d+1 ⇥ R induced by D. We define the loss incurred at w = ( w, b w ) to be L(w) = 1 2 E (e x,y)⇠ D h ( ( w> e x + b w ) y) 2 i = 1 2 E (x,y)⇠D h ( (w > x) y) 2 i . Here (x) = max(x, 0) is the standard rectified linear unit popularly used in deep learning. The goal in agnostic learning of a ReLU function (or agnostic ReLU regression) is to design a polynomial time learning algorithm that takes as input i.i.d. samples from D and outputs w = ( w, b w ) such that L(w) compares favorably with OP T that is given by OP T := min w=( w,bw)2H 1 2 E (x,y)⇠D [( (w > x) y) 2 ]. Here the hypothesis set H that algorithm competes with is the set of ReLU units with parameters w = ( w, b w ) with the relative bias |b w |/k wk 2 being bounded. This is a non-trivial and interesting regime; when the bias is too large in magnitude the optimal ReLU function fitting the data is either the constant zero function almost everywhere, or a linear function almost everywhere. This agnostic learning problem has been extensively studied and polynomial time learning algorithms exists for a variety of settings. This includes the noisy teacher setting where E[y|x] is given by a ReLU function Kakade et al. (2011) ; Mukherjee & Muthukumar (2020) and the fully agnostic setting where no assumption on y is made (Goel & Klivans, 2019; Diakonikolas et al., 2020) . In a recent work (Frei et al., 2020) analyzed the properties of gradient descent for the above agnostic learning problem when the bias term is assumed to be zero. The gradient descent based learning algorithm corresponds to the following sequence of updates starting from a suitable initializer w 0 : w t+1 = w t ⌘rL(w t ). The work of Frei et al. ( 2020) proved that starting from zero initialization and for distributions where the marginal of x satisfies some mild assumptions , gradient descent iterates produce, in polynomial time, a point w T such that L(w T ) = O( p OP T ) when the domain for x is bounded (it is instructive for this bound to think of OP T < 1; the general expression is more complicated with some additive terms and dependencies on problem-dependent quantities). While the above provides the first non-trivial learning guarantees for gradient descent in the case of agnostic ReLU learning, it suffers from a few key limitations. The result of Frei et al. ( 2020) only applies in the setting when the distribution has a bounded domain and when the bias terms are zero. When the distribution is not bounded, the error of O( p OP T ) also includes some dimension-dependent terms; e.g., when the marginal of e x is a standard Gaussian N (0, I d⇥d ), it gives a O( p d • OP T ) error. Moreover, there is a natural question of improving the bound of O( p OP T ) on the error of gradient descent (since the most interesting regime of parameters is when OP T ⌧ 1). This is particularly intriguing given the recent result of Diakonikolas et al. ( 2020) that shows that, assuming zero bias, gradient descent on a convex surrogate for L(w) achieves O(OP T ) error. This raises the question of whether the same holds for gradient descent on L(w) itself. In another recent work, the authors in Vardi et al. ( 2021) are able to provide convergence guarantees for gradient descent in the presence of bias terms, but under the strong realizability assumption, i.e, assuming that OP T = 0. To summarize the existing guarantees, to the best of our knowledge, (i) there are no existing guarantees for any polynomial time algorithm (including gradient descent) for agnostic learning of a ReLU function with bias, and (ii) even in the zero bias case, there is no existing guarantee for gradient descent (on the standard squared loss) that achieves O(OP T ) error.

1.1. OUR RESULTS

In this work we make progress on both these fronts, by improving the state of the art of guarantees for gradient descent for agnostic ReLU regression. In particular, we show that when the marginal of x is a Gaussian, gradient descent on L(w) achieves an error of O(OP T ), even under the presence of bias terms that are bounded. The O(OP T ) guarantee that we get even in the zero bias case answers an open question raised in the work of Frei et al. (2020) . There are also no additional dependencies on the dimension. Given the recent statistical query lower bound of Goel & Klivans (2019) that rules out an additive guarantee of OP T + " for agnostic ReLU regression, our result shows that vanilla gradient descent on the target loss already achieves near optimal error guarantees. Below we state our main theorem. For convenience we assume that kṽk 2 (the optimal weight, i.e. v = (e v, b v ) 2 H such that L(v) = OP T ), is a constant; Appendix C shows why this is without loss of generality. Theorem 1.1. Let C 1 1, C 2 > 0, c 3 > 0 be absolute constants. Let D be a distribution over (e x, y) 2 R d ⇥ R where the marginal over e x is the standard Gaussian N (0, I). Let H = {w = ( w, b w ) : k wk 2 [1/C 1 , C 1 ], |b w |  C 2 }, and consider population gradient descent iterates: w t+1 = w t ⌘rL(w t ). For a suitable constant learning rate ⌘, when starting from w 0 = ( w0 , 0) where w0 is randomly initialized from a radially symmetric distribution, with at least constant probability c 3 > 0 one of the iterates w T of gradient descent after poly(d, 1 " ) steps satisfies L(w T ) = O(OP T ) + ". Please see Section 4 for the more formal statement and proof. Note that the above guarantee applies to one of the intermediate iterates produced by gradient descent within the first poly(d, 1/") iterations. This is consistent with other convergence guarantees for gradient descent in non-realizable settings where last iterate guarantees typically do not exist Frei et al. (2020) . One can always pick the iterate among the first poly(d, 1/") steps that has the smallest loss on an independent sample from the distribution D. The above theorem proves that gradient descent obtains a bound of O(OP T ) when the relative bias of the optimal ReLU function is bounded (recall that kṽk 2 = ⇥(1) for the optimal classifier without loss of generality from Proposition C.1). Note that we do not constrain the gradient updates to remain in the set H. This result significantly improves upon the existing state-of-the-art guarantees Frei et al.



⇤ The last two authors are supported by the National Science Foundation (NSF) under Grant No. CCF-1652491 and CCF 1934931. The last author was also funded by a Google Research Scholar award.

