AGNOSTIC LEARNING OF GENERAL RELU ACTIVATION USING GRADIENT DESCENT

Abstract

We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves an error that is within a constant factor of the optimal i.e., it is guaranteed to achieve an error of O(OP T ), where OP T is the error of the best ReLU function. This is a significant improvement over existing guarantees for gradient descent, which only guarantee error of O( p d • OP T ) even in the zero-bias case (Frei et al., 2020). We also provide finite sample guarantees, and obtain similar guarantees for a broader class of marginal distributions beyond Gaussians.

1. INTRODUCTION

Gradient descent forms the bedrock of modern optimization algorithms for machine learning. Despite a long line of work in understanding and analyzing the gradient descent iterates, there remain several outstanding questions on whether they can provably learn important classes of problems. In this work we study one of the simplest learning problems where the properties of gradient descent are not well understood, namely agnostic learning of a single ReLU function. More formally, let D be a distribution over R d ⇥R. A ReLU function is parameterized by w = ( w, b w ) where w 2 R d and b w 2 R. For notational convenience, we will consider the points to be in R d+1 by appending e x with a fixed coordinate 1 as x = (e x, 1). Let D be the distribution over R d+1 ⇥ R induced by D. We define the loss incurred at w = ( w, b w ) to be L(w) = 1 2 E (e x,y)⇠ D h ( ( w> e x + b w ) y) 2 i = 1 2 E (x,y)⇠D h ( (w > x) y) 2 i . Here (x) = max(x, 0) is the standard rectified linear unit popularly used in deep learning. The goal in agnostic learning of a ReLU function (or agnostic ReLU regression) is to design a polynomial time learning algorithm that takes as input i.i.d. samples from D and outputs w = ( w, b w ) such that L(w) compares favorably with OP T that is given by OP T := min w=( w,bw)2H 1 2 E (x,y)⇠D [( (w > x) y) 2 ]. Here the hypothesis set H that algorithm competes with is the set of ReLU units with parameters w = ( w, b w ) with the relative bias |b w |/k wk 2 being bounded. This is a non-trivial and interesting regime; when the bias is too large in magnitude the optimal ReLU function fitting the data is either the constant zero function almost everywhere, or a linear function almost everywhere. This agnostic learning problem has been extensively studied and polynomial time learning algorithms exists for a variety of settings. This includes the noisy teacher setting where E[y|x] is given by



⇤ The last two authors are supported by the National Science Foundation (NSF) under Grant No. CCF-1652491 and CCF 1934931. The last author was also funded by a Google Research Scholar award.

