THE INDUCTIVE BIAS OF RELU NETWORKS ON ORTHOGONALLY SEPARABLE DATA

Abstract

We study the inductive bias of two-layer ReLU networks trained by gradient flow. We identify a class of easy-to-learn ('orthogonally separable') datasets, and characterise the solution that ReLU networks trained on such datasets converge to. Irrespective of network width, the solution turns out to be a combination of two max-margin classifiers: one corresponding to the positive data subset and one corresponding to the negative data subset. The proof is based on the little-known concept of extremal sectors, for which we prove a number of properties in the context of orthogonal separability. In particular, we prove stationarity of activation patterns from some time T onwards, which enables a reduction of the ReLU network to an ensemble of linear subnetworks.

1. INTRODUCTION

This paper is motivated by the problem of understanding the inductive bias of ReLU networks, or to put it plainly, understanding what it is that neural networks learn. This is a fundamental open question in neural network theory; it is also a crucial part of understanding how neural networks behave on previously unseen data (generalisation) and it could ultimately lead to rigorous a priori guarantees on neural nets' behaviour. For a long time, the dominant way of thinking about machine learning systems was as minimisers of the empirical risk (Vapnik, 1998; Shalev-Shwartz & Ben-David, 2014) . However, this paradigm has turned out to be insufficient for understanding deep learning, where many empirical risk minimisers exist, often with vastly different generalisation properties. To understand deep networks, we therefore need a more fine-grained notion of 'what the model learns'. This has motivated the study of the implicit bias of the training procedure -the ways in which the training algorithm influences which of the empirical risk minimisers is attained. This is a productive research area, and the implicit bias has already been worked out for many linear models. 1 Notably, Soudry et al. (2018) consider a logistic regression classifier trained on linearly separable data, and show that the normalised weight vector converges to the max-margin direction. Building on their work, Ji & Telgarsky (2019a) consider deep linear networks, also trained on linearly separable data, and show that the normalised end-to-end weight vector converges to the max-margin direction. They in fact show that all first-layer neurons converge to the same 'canonical neuron' (which points in the max-margin direction). Although such impressive progress on linear models has spurred attempts at nonlinear extensions, the problem is much harder and analogous nonlinear results have been elusive. In this work, we provide the first such inductive-bias result for ReLU networks trained on 'easy' datasets. Specifically, we • propose orthogonal separability of datasets as a stronger form of linear separability that facilitates the study of ReLU network training, • prove that a two-layer ReLU network trained on an orthogonally separable dataset learns a function with two distinct groups of neurons, where all neurons in each group converge to the same 'canonical neuron', • characterise the directions of the canonical neurons, which turn out to be the max-margin directions for the positive and the negative data subset. The proof is based on the recently introduced concept of extremal sectors (Maennel et al., 2018) which govern the early phase of training. Our main technical contributions are a precise characterisation of extremal sectors for orthogonally separable datasets, and an invariance property which ensures that the network's activation pattern becomes fixed at some point during training. The latter allows us to treat ReLU networks late in training as ensembles of linear networks, which are much better understood. We hope that a similar proof strategy could be useful in other contexts as well.

2. SETTING AND ASSUMPTIONS

In this section, we introduce the learning scenario including the assumptions we make about the dataset, the model, and the training procedure. We consider binary classification. Denote the training data {(x i , y i )} n i=1 with x i ∈ R d and y i ∈ {±1} for all i ∈ [n]. We denote by X ∈ R d×n the matrix with {x i } as columns and by y ∈ R n the vector with {y i } as entries. Orthogonally separable data. A binary classification dataset (X, y) is called orthogonally separable if for all i, j ∈ [n], x i x j > 0, if y i = y j , x i x j ≤ 0, if y i = y j . In other words, a dataset is orthogonally separable iff it is linearly separable, and any training example x i can serve as a linear separator. Geometrically, this means that examples with y i = 1 ('positive examples') and examples with y i = -1 ('negative examples') lie in opposite orthants. Two-layer ReLU networks. We define two-layer width-p fully-connected ReLU networks, parameterised by θ {W, a}, as f θ : R d → R, f θ (x) a ρ(Wx), where W [w 1 , . . . w p ] ∈ R p×d and a [a 1 , . . . , a p ] ∈ R p are the first-and second-layer weights of the network, and ρ is the element-wise ReLU function, ρ(z) i = max {0, z i }. We will often view the network as a collection of neurons, {(a j , w j )} p j=1 . Cross-entropy loss. We assume a training loss of the form (θ) n i=1 i (f θ (x i )), i (u) log(1 + exp (-y i u)); this is the standard empirical cross-entropy loss. More generally, our results hold when the loss is differentiable, i is bounded and Lipschitz continuous, and satisfies -y i i (u) > 0 for all u ∈ R.

Gradient flow training.

We assume the loss is optimised by gradient descent with infinitesimally small step size, also known as gradient flow. Under the gradient flow dynamics, the parameter trajectory is an absolutely continuous curve {θ(t) | t ≥ 0} satisfying the differential inclusion ∂θ(t) ∂t ∈ -∂ (θ(t)), for almost all t ∈ [0, ∞), where ∂ denotes the Clarke subdifferential (Clarke, 1975; Clarke et al., 2008) of , an extension of the gradient to not-everywhere differentiable functions, ∂ (θ) conv lim k→∞ ∇ (θ k ) θ k → θ . (5) θ(t) is the value of the parameters at time t, and we will use the suffix (t) more generally to denote the value of some function of θ at time t.



A more thorough overview of related work can be found in Section 6.

