THE INDUCTIVE BIAS OF RELU NETWORKS ON ORTHOGONALLY SEPARABLE DATA

Abstract

We study the inductive bias of two-layer ReLU networks trained by gradient flow. We identify a class of easy-to-learn ('orthogonally separable') datasets, and characterise the solution that ReLU networks trained on such datasets converge to. Irrespective of network width, the solution turns out to be a combination of two max-margin classifiers: one corresponding to the positive data subset and one corresponding to the negative data subset. The proof is based on the little-known concept of extremal sectors, for which we prove a number of properties in the context of orthogonal separability. In particular, we prove stationarity of activation patterns from some time T onwards, which enables a reduction of the ReLU network to an ensemble of linear subnetworks.

1. INTRODUCTION

This paper is motivated by the problem of understanding the inductive bias of ReLU networks, or to put it plainly, understanding what it is that neural networks learn. This is a fundamental open question in neural network theory; it is also a crucial part of understanding how neural networks behave on previously unseen data (generalisation) and it could ultimately lead to rigorous a priori guarantees on neural nets' behaviour. For a long time, the dominant way of thinking about machine learning systems was as minimisers of the empirical risk (Vapnik, 1998; Shalev-Shwartz & Ben-David, 2014) . However, this paradigm has turned out to be insufficient for understanding deep learning, where many empirical risk minimisers exist, often with vastly different generalisation properties. To understand deep networks, we therefore need a more fine-grained notion of 'what the model learns'. This has motivated the study of the implicit bias of the training procedure -the ways in which the training algorithm influences which of the empirical risk minimisers is attained. This is a productive research area, and the implicit bias has already been worked out for many linear models. 1 Notably, Soudry et al. (2018) consider a logistic regression classifier trained on linearly separable data, and show that the normalised weight vector converges to the max-margin direction. Building on their work, Ji & Telgarsky (2019a) consider deep linear networks, also trained on linearly separable data, and show that the normalised end-to-end weight vector converges to the max-margin direction. They in fact show that all first-layer neurons converge to the same 'canonical neuron' (which points in the max-margin direction). Although such impressive progress on linear models has spurred attempts at nonlinear extensions, the problem is much harder and analogous nonlinear results have been elusive. In this work, we provide the first such inductive-bias result for ReLU networks trained on 'easy' datasets. Specifically, we • propose orthogonal separability of datasets as a stronger form of linear separability that facilitates the study of ReLU network training, • prove that a two-layer ReLU network trained on an orthogonally separable dataset learns a function with two distinct groups of neurons, where all neurons in each group converge to the same 'canonical neuron', • characterise the directions of the canonical neurons, which turn out to be the max-margin directions for the positive and the negative data subset. The proof is based on the recently introduced concept of extremal sectors (Maennel et al., 2018) which govern the early phase of training. Our main technical contributions are a precise characterisation of extremal sectors for orthogonally separable datasets, and an invariance property which ensures that the network's activation pattern becomes fixed at some point during training. The latter allows us to treat ReLU networks late in training as ensembles of linear networks, which are much better understood. We hope that a similar proof strategy could be useful in other contexts as well.

2. SETTING AND ASSUMPTIONS

In this section, we introduce the learning scenario including the assumptions we make about the dataset, the model, and the training procedure. We consider binary classification. Denote the training data {(x i , y i )} n i=1 with x i ∈ R d and y i ∈ {±1} for all i ∈ [n]. We denote by X ∈ R d×n the matrix with {x i } as columns and by y ∈ R n the vector with {y i } as entries. Orthogonally separable data. A binary classification dataset (X, y) is called orthogonally separable if for all i, j ∈ [n], x i x j > 0, if y i = y j , x i x j ≤ 0, if y i = y j . (1) In other words, a dataset is orthogonally separable iff it is linearly separable, and any training example x i can serve as a linear separator. Geometrically, this means that examples with y i = 1 ('positive examples') and examples with y i = -1 ('negative examples') lie in opposite orthants. Two-layer ReLU networks. We define two-layer width-p fully-connected ReLU networks, parameterised by θ {W, a}, as f θ : R d → R, f θ (x) a ρ(Wx), where W [w 1 , . . . w p ] ∈ R p×d and a [a 1 , . . . , a p ] ∈ R p are the first-and second-layer weights of the network, and ρ is the element-wise ReLU function, ρ(z) i = max {0, z i }. We will often view the network as a collection of neurons, {(a j , w j )} p j=1 . Cross-entropy loss. We assume a training loss of the form (θ) n i=1 i (f θ (x i )), i (u) log(1 + exp (-y i u)); this is the standard empirical cross-entropy loss. More generally, our results hold when the loss is differentiable, i is bounded and Lipschitz continuous, and satisfies -y i i (u) > 0 for all u ∈ R.

Gradient flow training.

We assume the loss is optimised by gradient descent with infinitesimally small step size, also known as gradient flow. Under the gradient flow dynamics, the parameter trajectory is an absolutely continuous curve {θ(t) | t ≥ 0} satisfying the differential inclusion ∂θ(t) ∂t ∈ -∂ (θ(t)), for almost all t ∈ [0, ∞), where ∂ denotes the Clarke subdifferential (Clarke, 1975; Clarke et al., 2008) of , an extension of the gradient to not-everywhere differentiable functions, ∂ (θ) conv lim k→∞ ∇ (θ k ) θ k → θ . θ(t) is the value of the parameters at time t, and we will use the suffix (t) more generally to denote the value of some function of θ at time t. Near-zero balanced initialisation. We assume that the neurons {w j } are initialised iid from the Gaussian distribution and then rescaled such that w j ≤ λ, where λ > 0 is a small constant. That is, w j = λ j v j / v j for v j iid ∼ N(0, I) and arbitrary λ j satisfying λ j ∈ (0, λ]. We also assume that a j ∈ {±λ j }. These technical conditions ensure that the neurons are balanced and small in size, w j = |a j | ≈ 0, which simplifies the calculations involved in the analysis of gradient flow. Support examples span the full space. We assume that the support examples of the positive data subset {x i | y i = 1} span the entire R d , and similarly that the support examples of the negative data subset {x i | y i = -1} span R d . (We formally define support examples after introducing some more notation below.)

3. MAIN RESULT

Under the assumptions of Section 2, the network converges to a linear combination of two maxmargin neurons. Specifically, given a dataset (X, y), define the positive and the negative maxmargin vectors w + , w -∈ R d as w + = arg min w w 2 subject to w x i ≥ 1 for i : y i = 1, w -= arg min w w 2 subject to w x i ≥ 1 for i : y i = -1. We call examples which attain equality in eqs. ( 6) and ( 7) positive support examples and negative support examples respectively. We now state the main result. Theorem 1. Let f θ be a two-layer width-p ReLU network trained by gradient flow with the crossentropy loss, initialised near-zero and balanced. Consider an orthogonally separable dataset (X, y) such that its positive support examples span R d , and its negative support examples also span R d . For almost all such datasetsfoot_1 and with probability 1 -1/2 p over the random initialisation, W(t) W(t) F -uw + + zw - F → 0, as t → ∞, for some u, z ∈ R p + such that either u i = 0 or z i = 0 for all i ∈ [p]. Also, a(t) a(t) -(u w + -z w -) → 0, as t → ∞. The theorem says that each neuron (row of W), properly normalised, converges either to a scalar multiple of the positive max-margin direction, u i w + , or to a scalar multiple of the negative maxmargin direction, z i w -. In other words, there are asymptotically only two distinct types of neurons, and the network could in principle be pruned down to a width of just two. These two 'canonical neurons' moreover have an explicit characterisation, given by eqs. ( 6) and ( 7). As for the second-layer weights, the magnitude of each a j equals the norm of the respective w j , and the sign of a j is +1 if w j approaches w + and -1 if w j approaches w -. The following corollary summarises the above in terms of the function learnt by the network. Corollary 1. Under the conditions of Theorem 1, there exist constants u, z ≥ 0 such that f θ(t) (x) θ(t) 2 → uρ(w + x) -zρ(w -x), as t → ∞. 3.1 DISCUSSION OF ASSUMPTIONS Many of our assumptions are technical, serving to simplify the analysis while detracting little from the result's relevance 3 . These include infinitesimal step size (gradient flow), balancedness at initialisation and the condition on support span. The first two could potentially be relaxed to their approximate counterparts, i.e. gradient descent with a small constant step size and approximate balancedness (Arora et al., 2019) 

4. PROOF SKETCH

In the analysis, we distinguish between two phases of training. The first phase takes place close to the origin, θ ≈ 0. In this phase, while neurons move little in the absolute sense, they converge in direction to certain regions of the weight space called extremal sectors.

4.1. CONVERGENCE TO EXTREMAL SECTORS

(All definitions and results in this subsection are by Maennel et al. (2018) . We need them later on.) Sectors are regions in weight space corresponding to different activation patterns. They are important for understanding neuron dynamics: roughly speaking, neurons in the same sector move in the same direction. Definition 1 (Sector). The sector associated to a sequence of signs σ ∈ {-1, 0, 1} n is the region in input space defined as S σ {w ∈ R d | sign w x i = σ i , i ∈ [n]}. ( ) We may also refer to the sign sequence σ itself as a sector. Some sectors are attractors early in training, i.e. neurons tend to converge to them. Such attracting sectors are called extremal sectors. To give a formal definition, we first introduce the function G : S d-1 → R, G(w) - n i=1 i (0) • ρ(w x i ). Intuitively, (normalised) neurons early in training behave as if they were locally optimising G, they therefore tend to cluster around the local optima of G. We formally define extremal sectors as sectors containing these local optima. Definition 2 (Extremal directions and sectors). We say that w ∈ S d-1 is a positive extremal direction, if it is a strict local maximum of G. We say that w is a negative extremal direction if it is a strict local minimum of G. A sector is called (positive/negative) extremal, if it contains a (positive/negative) extremal direction. The following lemma (Maennel et al., 2018, Lemma 5) shows that all neurons either turn off, i.e. become deactivated for all training examples and stop updating, or converge to extremal sectors. Lemma 1. Let a two-layer ReLU network f θ be balanced at initialisation and trained by gradient flow. Assume that the loss derivative i is Lipschitz continuous. Then, for almost all datasets and almost all initialisations with λ small enough, there exists a time T such that each neuron satisfies one of these three conditions: • w j (T ) ∈ S σ where σ ≤ 0 and so w j remains constant for t ≥ T , or • a j (T ) > 0 and w j (T ) ∈ S σ where σ is a positive extremal sector, or • a j (T ) < 0 and w j (T ) ∈ S σ where σ is a negative extremal sector.

4.2. ORTHOGONAL SEPARABILITY: TWO ABSORBING EXTREMAL SECTORS

Lemma 1 shows that by the end of the early phase of training, neurons have converged to extremal sectors. Although eq. ( 12) shows that the number of extremal sectors depends only on the data (i.e. is independent of model expressivity), it is a priori unclear how many extremal sectors there are for a given dataset, or what happens once neurons have converged to extremal sectors. We now answer both of these questions for orthogonally separable datasets. First, we claim that for orthogonally separable datasets, there are only two extremal sectors, one corresponding to the positive data subset and one corresponding to the negative data subset. That is, by converging to an extremal sector, neurons 'choose' whether to activate for positive examples or for negative examples. They thus naturally form two groups of similar neurons. Lemma 2. In the setting of Theorem 1, there is exactly one positive extremal direction and exactly one negative extremal direction. The positive extremal sector σ + is given by σ + j =    1, if y j = 1, -1, if y j = -1 and x j x i < 0 for some i with y i = 1, 0, if y j = -1 and x j x i = 0 for all i with y i = 1, and the negative extremal sector σ -is given by σ - j =    1, if y j = -1, -1, if y j = 1 and x j x i < 0 for some i with y i = -1, 0, if y j = 1 and x j x i = 0 for all i with y i = -1. Second, we show that once a neuron reaches an extremal sector, it remains in the sector forever, i.e. its activation pattern remains fixed for the rest of training. Lemma 3. Assume the setting of Theorem 1. If at time T the neuron (a j , w j ) satisfies a j (T ) > 0 and w j (T ) ∈ S σ , where σ is the positive extremal sector (eq. ( 13)), then for t ≥ T , w j (t) ∈ S σ . The same holds if a j (T ) < 0 and σ is the negative extremal sector (eq. ( 14)).

4.3. PROOF OF THEOREM 1

Once neurons enter their respective absorbing sectors, the second phase of training begins. In this phase, the network's activation patterns are fixed: some neurons are active and update on the positive examples, while the others are active and update on the negative examples. The network thus behaves like an ensemble of independent linear subnetworks trained on subsets of the data. Once this happens, it becomes possible to apply existing results for linear networks; in particular, each subnetwork converges to its respective max-margin classifier. We give more details in the proof below. Proof of Theorem 1. By Lemmas 1 and 2, there exists a time T such that each neuron satisfies either • w j (T ) ∈ S σ where σ ≤ 0 and w j remains constant for t ≥ T , or • a j (T ) > 0 and w j (T ) ∈ S σ + , or • a j (T ) < 0 and w j (T ) ∈ S σ -, where σ + , σ -are the unique positive and negative extremal sectors given by eqs. ( 13) and ( 14). Denote by J 0 , J + , J -, the sets of neurons satisfying the first, the second, and the third condition respectively. By Lemma 3, if j ∈ J + then w j (t) ∈ S σ + for all t ≥ T , and if j ∈ J -then w j (t) ∈ S σ -for t ≥ T . Hence, for t ≥ T , if x i is such that y i = 1 then f θ (x i ) j∈[p] a j ρ(w j x i ) = j∈J+ a j w j x i . Combined with Lemma A.3, this implies that for k ∈ J + , ∂a k ∂t = - i:yi=1 i j∈J+ a j w j x i • w k x i , ∂w k ∂t = - i:yi=1 i j∈J+ a j w j x i • a k x i , (where we have used that P w x i = x i for i with y i = 1 due to positive extremality). From eq. ( 16) it follows that the evolution of neurons in J + depends only on positive examples and other neurons in J + . The neurons behave linearly on the positive data subset, while ignoring the negative subset. The same argument shows that the evolution of neurons in J -depends only on other neurons in J -and the negative data subset, on which the neurons act linearly. In other words, from time T onwards the ReLU network decomposes into a constant part and two independent linear networks, one trained on the positive data subset and the other trained on the negative data subset. We can therefore apply existing max-margin convergence results for linear networks to each of the linear subnetworks. Denote by W = [W 0 , W + , W -] the three parts of the weight matrix. Then by (Ji & Telgarsky, 2019a , Theorems 2.2 and 2.8) and (Ji & Telgarsky, 2020, Theorem 3.1), there exist vectors ū, z, such that W + (t) W + (t) F -ūw + F → 0, as t → ∞, W -(t) W -(t) F -zw - F → 0, as t → ∞. (We allow ū, z ∈ R 0 to account for the fact that J + , J -may be empty). We now need to relate W + F and W -F to W F . In particular, it will be useful to show that W + (t) 2 F / log t has a limit as t → ∞; the same is true for W -(t) 2 F / log t (by the same argument). If J + or J - is empty, this is trivially true and the limit is 0. Otherwise, consider the learning of the positive linear subnetwork, whose objective is effectively + (θ) := i:yi=1 i (f θ (x i )). By (Ji & Telgarsky, 2019a , Theorem 2.2), we know that + (θ(t)) → 0 as t → ∞. Following (Lyu & Li, 2020, Definition A.3 ), define γ(θ) g(log 1/ + (θ)) 2 W + 2 F , ( ) where g(q) := -log (exp(exp(-q)) -1) for the cross-entropy loss. Then W + (t) 2 F log t = g(log 1/ + (θ(t))) 2γ(t) log t = -log (exp( + (θ(t))) -1) 2γ(t) log t . ( ) Using the Taylor expansion exp(u) = 1 + Θ(u) for u → 0 and (Lyu & Li, 2020, Corollary A.11) , we obtain W + (t) 2 F log t = -log Θ( + (θ(t))) 2γ(t) log t = log Θ(t log t) 2γ(t) log t = 1 2γ(t) Θ(1) + log log t log t + 1 . By (Lyu & Li, 2020, Theorem A.7 :1), γ is increasing in t and hence converges; it follows that W + (t) 2 F / log t has a limit. By (Lyu & Li, 2020, Corollary A.11 ), W + (t) 2 F = Θ(log t), implying that the limit is finite and strictly positive. We will denote it by ν + and the analogous quantity for W -by ν -. We now return to the main thread of the proof. We analyse the convergence of W(t)/ W(t) F by analysing W 0 / W(t) F , W + (t)/ W(t) F and W -(t)/ W(t) F in turn. Since W(t) 2 F = W 0 2 F + W + (t) 2 F + W -(t) 2 F , lim t→∞ W(t) 2 F log t = ν + + ν -. Now observe that with probability at least 1 -1/2 p over the random initialisation, ν + + ν -> 0 (or equivalently, J + ∪ J -= ∅). To prove this, let x i+ be any training example with y i+ = 1 and let x i- be any training example with y i-= -1. Then by Lemma B.1, if a neuron (a j , w j ) is initialised such that a j (0) > 0 and w j (0) x i+ > 0 then for t ≥ 0, w j (t) x i+ > 0. This holds in particular at time T . The neuron j thus cannot be in J 0 nor J -, implying j ∈ J + . Similarly, if the neuron is initialised such that a j (0) < 0 and w j (0) x i-> 0, then j ∈ J -. The probability that one of the two initialisations occurs for a single neuron j is 1/2, as P wj [ w j (0) x > 0 ] = 1/2 for any fixed x. Hence, the probability that j ∈ J 0 is at most 1 -1/2 = 1/2, and the probability that [p] ⊆ J 0 is at most 1/2 p . It follows that with probability at least 1 -1/2 p , W 0 W(t) F → 0, as t → ∞. Also, by eqs. ( 17) and ( 22), W + (t) W(t) F = W + (t) W + (t) F • W + (t) F / √ log t W(t) F / √ log t → √ ν + √ ν + + ν - ūw + , and similarly W -(t) W(t) F → √ ν - √ ν + + ν - zw -. For j ∈ J + and t ≥ T we moreover know that if y i = 1 then w j (t) x i > 0 because w j (t) ∈ S σ + . As the same property holds for w + , it follows that ūj ≥ 0. By a similar argument, zj ≥ 0. Combining the last three equations then proves eq. ( 8). As for eq. ( 9), we know by Lemma A.4 that a j (t) = s j w j (t) for some s j ∈ {±1}, implying a(t) = W(t) F . Hence, for j ∈ J + , a j (t) a(t) = s j w j (t) W(t) F → s j u j w + by eq. ( 24), where u j ≥ 0. For j ∈ J + we also know that a j (t) ≥ 0, so s j = 1 and a j (t) a(t) → u j w + . ( ) By a similar argument, we obtain that for j ∈ J -, a j (t) a(t) → -z j w -. Finally, for j ∈ J 0 , a j (t) is constant and so a j (t) a(t) → 0.

5. EXPERIMENTS

In this section, we first verify that the theoretical result (Theorem 1) is predictive of experimental outcomes, even when some technical assumptions are violated. Second, we present evidence that a similar result may hold for deeper networks as well, although this goes beyond Theorem 1.

5.1. TWO-LAYER NETWORKS

To see how well the theory holds up, we train a two-layer ReLU network with 100 neurons on a synthetic orthogonally separable dataset consisting of 500 examples in R 20 . The dataset is constructed from an iid Gaussian dataset by filtering, to ensure orthogonal separability and w + ≈ -w -(for visualisation purposes). Specifically, let z := [1, -1, . . . , 1, -1]. A Gaussian-sampled point x is included with label +1 if it lies in the first orthant and x z ≥ 0, included with label -1 if it lies in the orthant opposite to the first and x z ≥ 0, and discarded otherwise. We train by stochastic gradient descent with batch size 50 and a learning rate of 0.1 for 500 epochs. At initialisation, we multiply all weights by 0.05. This reflects a setting where both key assumptions of Theorem 1 -orthogonal separability and small initialisation -hold, while the other assumptions are relaxed to approach real-life practice. Figure 1 shows the results. Figure 1a shows the top 10 singular values of the first-layer weight matrix W ∈ R 100×20 after training. We see that despite its size, the matrix has rank only two: all singular values except the first two are effectively zero. This is exactly as predicted by the theorem. Furthermore, when we project the neurons on the positive-variance dimensions (Figure 1b ), we see that they align along two main directions. To see how well these directions align with the predicted max-margin directions, we compute the correlation (normalised inner product) of each neuron with its respective max-margin direction. Figure 1c shows the histogram of these correlations. We see that the correlation is generally high, above 0.9 for most neurons. Overall we find very good agreement with theory. Orange (or blue) dots represent neurons with a j > 0 (or a j < 0). c) Histogram of correlations between each neuron and its respective max-margin direction. (There are 100 neurons in total).

5.2. DEEPER NETWORKS

We now explore the behaviour of deeper networks on orthogonally separable data. We train a residual network rather than a fully-connected one. The reason for this is that fully-connected networks with small initialisation are hard to train: early in training, the gradients are vanishingly small but then grow very quickly. We therefore found setting a numerically stable learning rate rather delicate. We consider a residual network f θ : R d → R parameterised by θ {W 1 , . . . , W L }, of the form f 1 θ (x) = W 1 x, f l θ (x) = f l-1 θ (x) + W l ρ(f l-1 θ (x)), for l ∈ [2, L -1], f θ (x) = W L ρ(f L-1 θ (x)), where p is the network's width, and W 1 ∈ R p×d , W l ∈ R p×p and W L ∈ R 1×p are its weights. We train such a four-layer residual net with width 100 on the same dataset and using the same optimiser and hyper-parameters as in Section 5.1. Figure 2 shows the results. The results are very similar to what we observe for two-layer nets: the weight matrices are all rank two (Figure 2a-c ), and the weight matrices' rows align in two main directions (Figure 2d-f ). It is unclear what these directions are for the intermediate layers of the network, but for the first layer, we conjecture it is again the max-margin directions, as suggested by Figure 2g .

6. RELATED WORK

There is a lot of prior work on the implicit bias of gradient descent for various linear models. For logistic regression, Soudry et al. (2018) show that assuming an exponentially-tailed loss and linearly separable data, the normalised weight vector converges to the max-margin direction. Ji & Telgarsky (2019b) to the max-margin solution and consecutive weight matrices align. Gunasekar et al. (2018b) consider linear convolutional nets and prove convergence to a predictor related to the 2/L bridge penalty. A few papers have started addressing the implicit bias problem for nonlinear (homogeneous or ReLU) networks. The problem is much harder and hence requires stronger assumptions. Lyu & Li (2020) and Ji & Telgarsky (2020) assume that at some point during training, the network attains perfect classification accuracy. Training from this point onward, Ji & Telgarsky (2020) show that the network parameters converge in direction. Lyu & Li (2020) show that this direction is a critical KKT point of the (nonlinear) max-margin problem. A complementary approach is taken by Maennel et al. (2018) who analyse the very early phase of training, when the weights are close to the origin. For two-layer networks, they show convergence of neurons to extremal sectors. Our work can be seen as a first step towards bridging the very early and the very late phase of training. Zooming out a bit, there is also work motivated by similar questions, but taking a different approach. For example, Li & Liang (2018) show that two-layer ReLU nets trained on structured data converge to a solution that generalises well. Like ours, their analysis requires that the network's activation patterns change little, but they achieve it by containing training in the neighbourhood of the (relatively large) initialisation (this is the standard lazy training argument Chizat et al. (2019) ). In contrast, we initialise much closer to zero, allowing the neurons to move more. Another related paper is Chizat & Bach (2020) . Using a mean-field analysis, the authors show that infinite-width two-layer ReLU nets converge to max-margin classifiers in a certain non-Hilbertian function space.

7. CONCLUSION

In this work, we prove that two-layer ReLU nets trained by gradient flow on orthogonally separable data converge to a combination of the positive and the negative max-margin classifier. To our knowledge, this is the first result characterising the inductive bias of training neural networks with ReLU nonlinearities, that does not require infinite width or huge overparameterisation. The proof rests on a distinction between two phases of learning: an early phase, in which neurons specialise, and a late phase, in which the network's activation pattern is fixed and hence it behaves like an ensemble of linear subnetworks. This approach enables us to understand nonlinear ReLU networks in terms of the much better understood linear networks. Our hope is that a similar strategy will prove fruitful in the context of deeper networks and more complicated datasets as well. A BASIC LEMMAS This section collects a few lemmas useful for proofs. We assume the same setting and notation as Sections 2 and 4. In addition, we denote by P w the orthogonal projection onto span {x i | w x i = 0} ⊥ , and by g : R d → R d , g(w) - n i=1 i (0) • 1{w x i > 0} P w x i . A.1 LEMMAS ABOUT SECTORS The following lemma gives a necessary condition for a vector to be an extremal direction. Lemma A.1. If w ∈ S d-1 is an extremal direction, then g(w) = Cw for some constant C. Proof. Let ŵ ∈ S d-1 be a positive extremal direction (the negative case is analogous), and let ŵ ∈ S σ . A sector is called open if σ i = 0 for all i ∈ [n]. Denote by A( σ) the set of all open sectors adjacent to σ, A( σ) := σ ∈ {±1} n max i∈[n] |σ i -σi | ≤ 1 . ( ) Since ŵ is a local maximum of G and G is sector-wise linear, ŵ maximises G when constrained to (the closure of) any adjacent sector, i.e. for any σ ∈ A( σ), ŵ = arg max w G(w), subject to w 2 = 1, σ i w x i ≥ 0 for all i ∈ [n]. For w in the feasible region, G can be treated as a linear function with ∇G(w) = g(w σ ) where w σ is any vector such that w σ ∈ S σ . Hence, the necessary first-order KKT conditions for the problem (33) are g(w σ ) = C ŵ - n i=1 λ i σ i x i , where λ i ≥ 0 for all i, but λ i = 0 requires that the corresponding constraint is tight, σ i ŵ x i = 0. It follows that P ŵλ i σ i x i = 0. Multiplying eq. ( 34) from the left by P ŵ therefore yields C ŵ = P ŵ g(w σ ) = - n i=1 i (0) • 1{σ i = 1} P ŵx i . ( ) By adjacency, σ i = σi whenever σi ∈ {±1}, so they can differ only when σi = 0, i.e. when P ŵx i = 0. It follows that C ŵ = - n i=1 i (0) • 1{σ i = 1} P ŵx i = g( ŵ). ( ) The following lemma describes the local behaviour of the function G (defined in eq. ( 12)). Lemma A.2. For w ∈ S d-1 and v ∈ R d , there exists max > 0 such that for ∈ [0, max ], G w + v w + v = 1 w + v G(w) - n i=1 i (0) 1{(w + v) x i > 0} v x i . ( ) Proof. Let g be defined as in eq. ( 31); then G w + v w + v = 1 w + v (w + v) g(w + v). ( ) Published as a conference paper at ICLR 2021 We now analyse w g(w + v) and v g(w + v) separately, starting with the former. Denote I i := 1{(w + v) x i > 0}. Then g(w + v) can be written as g(w+ v) = - n i=1 i (0)•I i I 0 i P w x i - n i=1 i (0)•I i (1-I 0 i ) P w x i - n i=1 i (0)•I i (P w+ v -P w )x i . (39) Define max := 1 2 max >0 , subject to: sign {(w + v) x i } sign {w x i } ≥ 0 ∀i. ( ) For ∈ [0, max ], I 0 i = 1 implies I i = 1, so the first term in eq. ( 47) equals g(w). Regarding the second term, I i (1 -I 0 i ) is nonzero only if I i = 1, I 0 i = 0. For ∈ [0, max ], this can only happen if (w + v) x i > 0 and w x i = 0. In this scenario however, P w x i = 0, so the second term in eq. ( 47) is zero. Regarding the third term, as long as ∈ [0, max ], (w + v) x i = 0 implies w x i = 0, so w ∈ span {x j | w x j = 0} ⊥ ⊆ span {x j | (w + v) x j = 0} ⊥ ( ) and w (P w+ v -P w ) = ww = 0 . It follows that w g(w + v) = w g(w) = G(w). ( ) Turning to v g(w + v), we have that v g(w + v) = -v n i=1 i (0) • 1{(w + v) x i > 0} P w+ v x i , where v P w+ v = (w + v) P w+ v -w P w+ v = (w + v) -w = v . ( ) Plugging eq. ( 42) and eq. ( 43) into eq. ( 38) yields the result.

A.2 TRAINING DYNAMICS

In the following lemma, we prove a formula for the evolution of the parameters of a two-layer network trained by gradient flow (eq. ( 4)). The formula has appeared in Maennel et al. (2018) before (but without a proof). Lemma A.3. Assume that the training inputs with the zero vector {x i } i ∪ {0} are in general position. 4 Then a two-layer ReLU network trained by gradient flow on (X, y) satisfies for all j ∈ [p] and almost all t ≥ 0, ∂a j ∂t = - n i=1 i (t) • ρ(w j x i ), ( ) ∂w j ∂t = - n i=1 i (t) 1 w j x i > 0 a j P wj x i . ( ) Proof. Fix θ, and denote by Σ θ ∈ {-1, 0, 1} p×n the activation matrix for f θ , Σ θ [j, i] sign w j x i . Then for any sequence θ k → θ such that {∇ (θ k )} exists and has a limit, lim k→∞ Σ θ k ∈ Σ ∈ {±1} p×n Σ[j, i] = Σ θ [j, i] if Σ θ [j, i] = 0 . Conversely, for any Σ in the set above, there exists a sequence θ k → θ such that lim k→∞ Σ θ k = Σ. To see this, observe that each w j can be approached separately. Let A be the matrix whose rows are formed by those x i for which w j x i = 0. Then by the general position of inputs, A is a wide full-rank matrix and Aw j = 0. It follows that for any , Aw = has a solution, which can be chosen convergent to w j as → 0. We deduce that ∂ (θ(t)) = conv g(Σ) | Σ[j, i] ∈ {±1}, Σ[j, i] = Σ θ(t) [j, i] if Σ θ(t) [j, i] = 0 , where we define g(Σ) coordinate-wise as g(Σ)[a j ] := n i=1 i (t) 1{Σ[j, i] = 1} w j x i , g(Σ)[w j ] := n i=1 i (t) 1{Σ[j, i] = 1} a j x i . Since the value of g(Σ)[a j ] is independent of Σ, this proves eq. ( 45). The proof of eq. ( 46) is slightly more complicated, as we need to pin down a single member of ∂ (θ(t)). To do that, we recall a result by Davis et al. (2020) , who show that for a large class of deep learning scenarios (which includes ours), the objective admits a chain rule, i.e. (t) = ∂ (θ(t)), ∂θ ∂t for almost all t ≥ 0, where the right-hand side above should be interpreted as the only element of the set {h ∂θ/∂t | h ∈ ∂ (θ(t))}. For t such that both eq. ( 4) and eq. ( 50) hold, 0 = ∂ (θ(t)) -∂ (θ(t)), ∂θ ∂t , implying that ∂θ ∂t ∈ span {∂ (θ(t)) -∂ (θ(t)} ⊥ . Suppose h 1 , h 2 satisfy both eq. ( 4) and eq. ( 52) (taking the role of ∂θ/∂t). Then h 1 -h 2 ∈ (∂ (θ(t)) -∂ (θ(t))) ∩ span {∂ (θ(t)) -∂ (θ(t)} ⊥ = {0}. It follows that ∂θ/∂t is the unique member of both span {∂ (θ(t)) -∂ (θ(t)} ⊥ and -∂ (θ(t)). By eqs. ( 48) and ( 49), span {∂ (θ(t)) -∂ (θ(t)} ⊇ span ξ ij | w j x i = 0 , where ξ ij [w j ] = x i and all other elements of ξ ij are zero. Since ∂θ ∂t ∈ span {∂ (θ(t)) -∂ (θ(t)} ⊥ ⊆ span ξ ij | w j x i = 0 ⊥ , we obtain that for all (i, j) with w j x i = 0, ∂w j /∂t ⊥ x i . In other words, ∂w j ∂t = P wj ∂w j ∂t ∈ conv Σ - n i=1 i (t) 1{Σ[j, i] = 1} a j P wj x i , where the inclusion follows from ∂θ/∂t ∈ -∂ (θ(t)) and eqs. ( 48) and ( 49). Now observe that by definition, P wj x i = 0 for all (i, j) with Σ θ [j, i] = 0, hence the set in eq. ( 56) is a singleton whose only element equals eq. ( 46). The following lemma shows that a balanced two-layer network remains balanced and neurons keep their signs. Lemma A.4. If a two-layer neural network is balanced at initialisation and trained by gradient flow with a loss whose derivative is bounded, then for t ≥ 0, a j (t) = sign a j (0) • w j (t) . From σ j sign w x j = 0 it further follows that w 2)), which strictly exceeds G(w) for small enough. + v = √ 1 + 2 . Hence, (G(w) + α)/ w + v = (G(w) + α)/(1 + O( We have thus shown that if σ is positive extremal, then necessarily σ i = 1 for all examples with y i = 1, and σ i ∈ {0, -1} for examples with y i = -1. Suppose now that σ j = 0 for an example with y j = -1 that satisfies x j x k < 0 for some k with y k = 1. Taking v := -x j / x j will make α strictly positive, as the term corresponding to i = k in eq. ( 60) will be strictly positive (this term's indicator equals 1, as we know from the above that σ k = 1). Like in the previous paragraph, w + v = 1 + O( 2 ), which suffices to show G(w) is locally submaximal. Finally, let x j be such that y j = -1 and x j x i = 0 for all i with y i = 1, and suppose that σ j = -1. With v := P w x j / P w x j (we know that P w x j = 0 because w x j = 0 by σ j = -1), we have α = - 1 P w x j n i=1 i (0) 1{(w + v) x i > 0} x j P w x i . We claim that each term in eq. ( 62) is zero: For terms with σ i = -1, the indicator 1{(w + v) x i > 0} is zero. For terms with σ i = 0, P w x i = 0. (Also notice that such terms necessarily satisfy y i = -1 and x i x k = 0 for all k with y k = 1, which we will need shortly.) Lastly, for terms with σ i = 1, we know y i = 1, and hence x i x l = 0 for all l with σ l = 0. In other words, x i ⊥ span {x l | w x l = 0}, implying P w x i = x i . In the context of eq. ( 62), we obtain x j P w x i = x j x i = 0, concluding the proof that α = 0. Since σ j = -1, w + v < w = 1 for small enough , and (G(w) + α)/ w + v > G(w). We have thus ruled out all sectors except σ + , proving that for orthogonally separable datasets there is a unique positive extremal sector. Lemma 3. Assume the setting of Theorem 1. If at time T the neuron (a j , w j ) satisfies a j (T ) > 0 and w j (T ) ∈ S σ , where σ is the positive extremal sector (eq. ( 13)), then for t ≥ T , w j (t) ∈ S σ . The same holds if a j (T ) < 0 and σ is the negative extremal sector (eq. ( 14)). Proof. We omit the neuron index, and only prove the positive case; the negative case is analogous. Denote σ(t) := sign (X w(t)). We proceed by contradiction. Suppose there exists a time T 1 > T such that σ(T 1 ) = σ(T ). Wlog, take T 1 such that σ(t) is constant on (T, T 1 ) and denote this constant sector σ; by continuity σk = σ k (T ) if σ k (T ) = 0. Now consider σ k (T ) = 0. By the gradient flow differential inclusion, for almost all t ∈ (T, T 1 ), ∂w x k ∂t ∈ conv σ - n i=1 i (t) 1{σ i = 1} ax i x k , where each σ in the definition of the convex hull satisfies σ i = σ i (T ) if σ i (T ) = 0, implying {i | σ i = 1} ⊆ {i | σ i (T ) = 1} ∪ {i | σ i (T ) = 0} = {i | y i = 1} ∪ {i | y i = -1 and x i x j = 0 for all j with y j = 1}. Denote the two sets in the last expression I + and I 0 , and consider the gradient corresponding to some σ in eq. ( 63). The gradient terms corresponding to i ∈ I + are zero (because k ∈ I 0 and so x i x k = 0) and the terms corresponding to i ∈ I 0 (if there are any) are negative. The total gradient for σ is therefore non-positive, which is preserved under taking convex hulls, and so we obtain  ∂w x k ∂t = - n i=1 i (t) 1{σ i = 1} ax i P w x k , where 1{σ i = 1} = 1{y i = 1} as we have shown above. Observe that for x i with y i = 1, we have P w x i = x i . This is because P w projects onto span {x i | σi = 0} ⊥ ⊇ span {x i | σ i (T ) = 0} ⊥ ( ) and x i lies in the right-hand side by the definition of positive extremal sector (eq. ( 13)). Therefore ∂w x k ∂t = - n i=1 i (t) 1{y i = 1} ax i x k . ( ) One can easily check that if σ k (T ) = 1 then ∂ ∂t w x k > 0, if σ k (T ) = -1 then ∂ ∂t w x k < 0, and if σ k (T ) = 0 then ∂ ∂t w x k = 0. It follows that σ(T 1 ) = σ(T ), which is a contradiction. Lemma B.1. Assume the setting of Theorem 1. If at time T the neuron (a j , w j ) satisfies a j (T ) > 0 and w j (T ) x k > 0 for some k ∈ [n] with y k = 1, then for t ≥ T , w j (t) x k > 0. The same holds if instead a j (T ) < 0 and y k = -1. Proof. We omit the neuron index, and only prove the positive case; the negative case is analogous. We will show that for almost all t ∈ (T, ∞), ∂w x k /∂t ≥ 0. By the gradient flow differential inclusion, for almost all t ∈ (T, ∞), ∂w x k ∂t ∈ conv σ - n i=1 i (t) 1{σ i = 1} ax i x k . ( ) Fix any σ and consider the summand corresponding to example i. If y i = 1, theni (t) > 0 and x i x k > 0, so the summand is non-negative. If y i = -1, theni (t) < 0 and x i x k ≤ 0, so the summand is again non-negative. It follows that the sum is non-negative irrespective of σ , hence ∂w x k /∂t ≥ 0. Corollary 1. Under the conditions of Theorem 1, there exist constants u, z ≥ 0 such that f θ(t) (x) θ(t) 2 → uρ(w + x) -zρ(w -x), as t → ∞. ( ) Proof. By Lemma A.4, θ 2 = a 2 + W 2 F = 2 a 2 = 2 W 2 F . Then for any x ∈ R d , 2f θ(t) (x) θ(t) 2 = a(t) a(t) ρ W(t)x W(t) F . Denote ã := lim t→∞ a(t)/ a(t) . Then by Theorem 1, as t → ∞, 2f θ(t) (x) θ(t) 2 → ã ρ uw + x + zw -x (70) = ã ρ(uw + x) + ã ρ(zw -x) (71) = ã u ρ(w + x) + ã z ρ(w -x), where in eq. ( 71) we used the fact that for all i ∈ [p], either u i = 0 or z i = 0, and in eq. ( 72) we used u, z ≥ 0. Finally, since ã = u w + -z w -, we have ã u = w + u u ≥ 0, (73) ã z = -w -z z ≤ 0, which completes the proof.

C RELATIONSHIP TO NONLINEAR MAX-MARGIN

Lemma C.1. Let (X, y) be an orthogonally separable dataset, let w + , w -be defined as in eqs. (6) and (7) and let W := uw + + zw -, ã := w + u -w -z, for some u, z ∈ R p + such that u i = 0 or z i = 0 for all i ∈ [p]. Also let u, z be normalised such that u = w + -1/2 and z = w - -1/2 . Then θ W, ã is a KKT point of the following constrained optimisation problem: min 1 2 θ 2 , s.t. y i f θ (x i ) ≥ 1, i ∈ [n]. Proof. By standard linear max-margin considerations (e.g. Hastie et al. (2008, Section 4.5 .2)), we know that w + = i:yi=1 α i x i , w -= i:yi=-1 α i x i , for some α i ≥ 0 such that α i = 0 if x i is a non-support vector. It follows by orthogonal separability that for i with y i = 1, w + x i > 0, y i w -x i ≤ 0, and for i with y i = -1, w + x i ≤ 0, y i w -x i > 0; we will need these properties shortly. Let us now turn to checking the KKT status of θ wrt. eq. ( 77). We start by showing that θ is feasible. Let x i be such that y i = 1; then y i f θ (x i ) = ã ρ( Wx i ) (81) = ã ρ(uw + x i + zw -x i ) (82) = ã u w + x i (83) = w + u 2 w + x i (84) = w + x i ≥ 1, where the last inequality follows from the definition of w + , eq. ( 6). Similarly, if x i is such that y i = -1, then y i f θ (x i ) = -ã z w -x i (86) = w -z 2 w -x i (87) = w -x i ≥ 1. This shows that θ is feasible. Next, we show that θ is a KKT point, i.e. we show that there exist λ 1 , . . . , λ n ≥ 0 such that 1. for all i ∈ [n], λ i y i f θ (x i ) -1 = 0, and 2. θ ∈ n i=1 λ i y i ∂ θ f θ (x i ), where ∂ θ f θ (x) denotes the Clarke subdifferential of f θ (x) wrt. θ, evaluated at θ. Specifically, we show that the choice λ i = α i / w + , if y i = 1, α i / w -, if y i = -1, satisfies both conditions above. As for the first condition, observe that if i is a non-support example, then λ i = α i = 0 and the condition holds. If i is a support example, then by eqs. ( 85) and ( 88), y i f θ (x i ) = 1 and the condition holds as well. As for the second condition, denote g i (θ) := [I θ (x i )ax i ; ρ(Wx i )], where I θ (x) = diag [1{Wx i > 0}] ∈ R p×p is the diagonal matrix whose (i, i)-element is one if w i x > 0 and zero otherwise. It holds that g i ( θ) ∈ ∂ θ f θ (x i ) and n i=1 λ i y i g i θ = i:yi=1 α i w + diag [1{u > 0}] ãx i ; ρ(uw + x i + zw -x i ) - i:yi=-1 α i w - diag [1{z > 0}] ãx i ; ρ(uw + x i + zw -x i ) = i:yi=1 α i w + w + ux i ; uw + x i - i:yi=-1 α i w - -w -zx i ; zw -x i = 1 w + w + uw + ; uw + w + - 1 w - -w -zw -; zw -w - = uw + + zw -; u w + -z w - θ. This proves the second condition, and shows that θ is a KKT point of eq. ( 77).

D EXPERIMENTS ON REAL DATA

In this section we explore the applicability of our result to real-world datasets and architectures (which lie outside the scope formally covered by our assumptions). We experiment on the MNIST dataset subsetted to two classes, the digit 0 and the digit 1. We train a network consisting of six convolutional layers followed by two fully-connected layers. We view the six convolutional layers as a 'feature extractor' and the two fully-connected layers as a two-layer fully-connected network of the kind we analyse in this paper. The details of the architecture are given in Table 1 . We train the network by Adam with the binary cross-entropy loss and a batch size of 128. We train for 50 epochs. Prior to training, we multiply the weights of the fully-connected layers by 0.05, to approximate the small-norm initialisation assumed by theory. We conduct two sub-studies. First, we demonstrate that the network learns orthogonally separable representations all by itself, in the course of training. This is shown in Figure 3 . The first subplot shows three distributions: The blue distribution is the distribution of x i x j where x i is sampled from class 0 and x j is sampled from class 1. The orange (or green) distribution is the distribution of x i x j where both x i , x j are sampled from class 0 (or class 1). The other subplots show analogous inner-product distributions for the intermediate representations or learned features of the data, i.e. f l θ (x i ) f l θ (x j ) instead of x i x j . What we see is that the network learns representations such that examples of the same class are more similar to each other than examples of different classes -the orange and green distributions are generally more to the right compared to the blue distribution. Moreover, higher-layer representations Specifically, The l-th subplot shows the distribution of f l θ (x i ) f l θ (x j ) for x i , x j sampled from different classes (blue) or both from class 0 (orange) or both from class 1 (green). are generally more strongly separated -as we move up the layer hierarchy, the orange and green distributions keep shifting rightward, whereas the blue distribution shifts leftward. Remarkably, the 7th layer representations are orthogonally separable. In the second sub-study, we explore properties of the weight matrix learnt by the first linear layer of the network, in analogy to the first-layer weight matrix in a two-layer net. Figure 4a shows the top ten singular values of the weight matrix W 7 ∈ R 128×3136 . We see that despite its size, it has very few (perhaps five or ten) significantly non-zero singular values. This is similar to what we observed for synthetic data in Section 5, though the separation between small and large singular values is less crisp and there are more than two non-zero values. Figure 4b shows the rows (neurons) of W 7 projected onto the top two singular dimensions (note that unlike in Section 5, the projection is lossy). The neurons roughly form three clusters: a mixed cluster close to the origin and two clusters corresponding to positive and negative outer-layer weights. Compared to our observations from Section 5, there is less variation in the neurons' norms, leading to them forming clusters rather than rays. This deviation from the theoretical prediction could be due to a number reasons, e.g. the use of biases, convolutional layers, or the large dimensionality of the layer. We leave a detailed investigation of this question to future work. 



A more thorough overview of related work can be found in Section 6. Formally, this means that if {xi} are sampled from any distribution with a density wrt. the Lebesgue measure, then the theorem (treated as an implication) holds with probability one wrt. the data. We verify experimentally in Section 5.1 that these assumptions are indeed not crucial. That is, no k of these points lie on a (k -2)-dimensional hyperplane, for all k ≥ 2.



Figure 1: a) The 10 largest singular values of the first-layer weight matrix W after training. Each dot represents one singular value. b) Neurons (rows of W) projected on the top two singular dimensions.Orange (or blue) dots represent neurons with a j > 0 (or a j < 0). c) Histogram of correlations between each neuron and its respective max-margin direction. (There are 100 neurons in total).

Figure 2: a-c) The 10 largest singular values of the first-, second-and third-layer weight matrix W l after training. Each dot represents one singular value. d-f) Neurons (rows of W l ) projected on the respective top two singular dimensions. g) Histogram of correlations between each first-layer neuron and the closest max-margin direction. (There are 100 neurons in total).

It follows that σk = 1. By Lemma A.3, for almost all t ∈ (T, T 1 ) and any k ∈ [n],

Figure 3: Distributions of feature similarity, where examples are sampled from the specified classes.Specifically, The l-th subplot shows the distribution of f l θ (x i ) f l θ (x j ) for x i , x j sampled from different classes (blue) or both from class 0 (orange) or both from class 1 (green).

Figure 4: a) The 10 largest singular values of the first linear layer's weight matrix W 7 after training. Each dot represents one singular value. b) Neurons (rows of W 7 ) projected on the top two singular dimensions. Orange (or blue) dots represent neurons with W 8 [j] > 0 (or W 8 [j] < 0).

. The assumption that support vectors span R d comes from Ji & Telgarsky (2019a);Soudry et al. (2018). It seems to us that it could be lifted, though we have not investigated this possibility in depth.

Architecture of the studied network. By conv(n, k, s, p) we denote a convolutional layer with n kernels of size k × k with stride s, where the input to the layer is padded by p rows or columns on each margin. By fc(n, o) we denote a fully-connected layer whose input dimension is n and whose output dimension is o.

annex

Proof. By Lemma A.3, for almost all t ≥ 0,i.e. a j and w j grow equally fast. Since |a j (0)| = w j (0) at initialisation, |a j (t)| = w j (t) throughout training.Next denote by B, V > 0 some scalars such that | i (u)| ≤ B for all i ∈ [n] and u ∈ R, andThen ∂a 2 j /∂t ≤ nBa 2 j V , or equivalently ∂ log a 2 j /∂t ≤ nBV . It follows that a 2 j (t) lies between a 2 j (0) exp(-nBV t) and a 2 j (0) exp(nBV t), and hence a j cannot cross zero in finite time, proving eq. ( 57).

B PROOFS OF MAIN RESULTS

Lemma 2. In the setting of Theorem 1, there is exactly one positive extremal direction and exactly one negative extremal direction. The positive extremal sector σ + is given byand the negative extremal sector σ -is given byProof. We will prove the positive case; the negative case follows by inverting all labels. Because G is a continuous function on a compact domain, it has a maximum. At least one maximum must moreover be strict, or otherwise G would have to be constant. This shows that a positive extremal direction exists; we now show there is no more than one such direction.By Lemma A.1, there cannot be more than one extremal direction per sector; it therefore suffices to show that no sector except one, σ + , admits a positive extremal direction. We will show that if w ∈ S d-1 lies in any sector other than σ + , then w is not positive extremal; in particular we show that G(w) can be locally increased.Let σ = σ + and let w ∈ S σ ∩ S d-1 . By Lemma A.2, for any v ∈ R d there exists max > 0 such that for ∈ (0, max ],whereWe now analyse the different possible realisations of σ, and for each we find v ∈ R d such that (G(w) + α)/ w + v > G(w) for small .Suppose first that σ j = -1 for some example with y j = 1, or that σ j = 1 for some example with y j = -1. Then set v := y j x j / x j . By orthogonal separability, we have that α ≥ 0. Also, σ j sign w x j = -y j implies w v < 0, therefore w + v < w = 1 for small enough. It follows that (G(w) + α)/ w + v > G(w).Next suppose that σ j = 0 for some example with y j = 1, and set v := x j / x j . Then α > 0, because each term in eq. ( 60) is non-negative by orthogonal separability, and the term corresponding to i = j is strictly positive:j (0) 1{(w + v) x j > 0} v x j =j (0) 1{ x j > 0} x j > 0.(61)

