INDUCTIVE BIAS OF GRADIENT DESCENT FOR EXPO-NENTIALLY WEIGHT NORMALIZED SMOOTH HOMO-GENEOUS NEURAL NETS

Abstract

We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. Our analysis focuses on exponential weight normalization (EWN), which encourages weight updates along the radial direction. This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate, and hence causes the weights to be updated in a way that prefers asymptotic relative sparsity. These results can be extended to hold for gradient descent via an appropriate adaptive learning rate. The asymptotic convergence rate of the loss in this setting is given by Θ( 1t(log t) 2 ), and is independent of the depth of the network. We contrast these results with the inductive bias of standard weight normalization (SWN) and unnormalized architectures, and demonstrate their implications on synthetic data sets. Experimental results on simple data sets and architectures support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning prunable neural networks.

1. INTRODUCTION

The prevailing hypothesis for explaining the generalization ability of deep neural nets, despite their ability to fit even random labels (Zhang et al., 2017) , is that the optimisation/training algorithms such as gradient descent have a 'bias' towards 'simple' solutions. This property is often called inductive bias, and has been an active research area over the past few years. It has been shown that gradient descent does indeed seem to prefer 'simpler' solutions over more 'complex' solutions, where the notion of complexity is often problem/architecture specific. The predominant line of work typically shows that gradient descent prefers a least norm solution in some variant of the L 2 -norm. This is satisfying, as gradient descent over the parameters abides by the rules of L 2 geometry, i.e. the weight vector moves along direction of steepest descent, with length measured using the Euclidean norm. However, there is nothing special about the Euclidean norm in the parameter space, and hence several other notions of 'length' and 'steepness' are equally valid. In recent years, several alternative parameterizations of the weight vector, such as Batch normalization and Weight normalization, have seen immense success and these do not seem to respect L 2 geometry in the 'weight space'. We pose the question of inductive bias of gradient descent for some of these parameterizations, and demonstrate interesting inductive biases. In particular, it can still be argued that gradient descent with these reparameterizations prefers simpler solutions, but the notion of complexity is different.

1.1. CONTRIBUTIONS

The three main contributions of the paper are as follows. • We establish that the gradient flow path with exponential weight normalization is equal to the gradient flow path of an unnormalized network using an adaptive neuron dependent learning rate. This provides a crisp description of the difference between exponential weight normalized networks and unnormalized networks. • We establish the inductive bias of gradient descent on standard weight normalized and exponentially weight normalized networks and show that exponential weight normalization is likely to lead to asymptotic sparsity in weights. • We provide tight asymptotic convergence rates for exponentially weight normalized networks.

2. RELATED WORK

2.1 INDUCTIVE BIAS Soudry et al. (2018) showed that gradient descent(GD) on the logistic loss with linearly separable data converges to the L 2 maximum margin solution for almost all datasets. These results were extended to loss functions with super-polynomial tails in Nacson et al. (2019b) . Nacson et al. (2019c) extended these results to hold for stochastic gradient descent(SGD) and Gunasekar et al. (2018a) extended the results for other optimization geometries. Ji & Telgarsky (2019b) provided tight convergence bounds in terms of dataset size as well as training time. Ji & Telgarsky (2019a) provide similar results when the data is not linearly separable. Ji & Telgarsky (2019c) showed that for deep linear nets, under certain conditions on the initialization, for almost all linearly separable datasets, the network, in function space, converges to the maximum margin solution. Gunasekar et al. (2018b) established that for linear convolutional nets, under certain assumptions regarding convergence of gradients etc, the function converges to a KKT point of the maximum margin problem in fourier space. Nacson et al. (2019a) shows that for smooth homogeneous nets, the network converges to a KKT point of the maximum margin problem in parameter space. Lyu & Li (2020) established these results with weaker assumptions and also provide asymptotic convergence rates for the loss. Chizat & Bach (2020) explore the inductive bias for a 2-layer infinitely wide ReLU neural net in function space and show that the function learnt is a max-margin classifier for variation norm.

2.2. NORMALIZATION

Salimans & Kingma (2016) introduced weight normalization and demonstrated that it replicates the convergence speedup of BatchNorm. Similarly, other normalization techniques have been proposed as well (Ba et al., 2016 )(Qiao et al., 2020 ) (Li et al., 2019) , but only a few have been theoretically explored. Santurkar et al. (2018) demonstrated that batch normalization makes the loss surface smoother and L 2 normalization in batchnorm can even be replaced by L 1 and L ∞ normalizations. Kohler et al. (2019) showed that for GD, batchnorm speeds up convergence in the case of GLM by splitting the optimization problem into learning the direction and the norm. Cai et al. (2019) analyzed GD on BN for squared loss and showed that it converges for a wide range of lr. Bjorck et al. (2018) showed that the primary reason BN allows networks to achieve higher accuracy is by enabling higher learning rates. Arora et al. (2019) showed that in case of GD or SGD with batchnorm, lr for scale-invariant parameters does not affect the convergence rate towards stationary points. Du et al. (2018) showed that for GD over one-hidden-layer weight normalized CNN, with a constant probability over initialization, iterates converge to global minima. Qiao et al. (2019) compared different normalization techniques from the perspective of whether they lead to points, where neurons are consistently deactivated. Wu et al. (2019) established the inductive bias of gradient flow with weight normalization for overparameterized least squares and showed that for a wider range of initializations as compared to normal parameterization, it converges to the minimum L 2 norm solution. Dukler et al. (2020) analyzed weight normalization for multilayer ReLU net in the infinite width regime and showed that it may speedup convergence. Some other papers (Luo et al., 2019; Roburin et al., 2020) also provide other perspectives to think about normalization techniques.

3. PROBLEM SETUP

We use a standard view of neural networks as a collection of nodes/neurons grouped by layers. Each node u is associated with a weight vector w u , that represents the incoming weight vector for that node. In case of CNNs, weights can be shared across different nodes. w represents all the parameters of the network arranged in form of a vector. The dataset is represented in terms of (x i , y i ) pairs and m represents the number of points in the dataset. The function represented by the neural network is denoted by Φ(w, .). The loss for a single data point x i is given by (y i , Φ(w, x i )) and the loss vector is represented by . The overall loss is represented by L(w) and is given by L(w) = m i=1 (y i , Φ(w, x i )). We sometimes abbreviate L(w(t)) as L when the context is clear. In standard weight normalisation (SWN), each weight vector w u is reparameterized as γ u vu vu . This was proposed by Salimans & Kingma (2016) , as a substitute for Batch Normalization and has been practically used in multiple papers such as Sokolic et al. (2017 ), Dauphin et al. (2017) , Kim et al. (2018) and Hieber et al. (2018) . The corresponding update equations for gradient descent are given by γ u (t + 1) = γ u (t) -η(t) v u (t) ∇ wu L v u (t) v u (t + 1) = v u (t) -η(t) γ u (t) v u (t) I - v u (t)v u (t) v u (t) 2 ∇ wu L In exponential weight normalisation (EWN), each weight vector w u is reparameterized as e αu vu vu . This was mentioned in Salimans & Kingma (2016) , but to the best of our knowledge, has not been widely used. The corresponding update equations for gradient descent with learning rate η(t) are given by α u (t + 1) = α u (t) -η(t)e αu(t) v u (t) ∇ wu L v u (t) (3) v u (t + 1) = v u (t) -η(t) e αu(t) v u (t) I - v u (t)v u (t) v u (t) 2 ∇ wu L The update equations for gradient flow are the continuous counterparts for the same. In case of gradient flow, for both SWN and EWN, we assume v u (0) = 1, to simplify the update equations.

4. INDUCTIVE BIAS OF WEIGHT NORMALIZATION

In this section, we state our main results for weight normalized smooth homogeneous models on exponential loss( (y i , Φ(w, x i ) = e -yiΦ (w,xi) ). The results for cross-entropy loss and proofs have been deferred to the appendix due to space constraints. First, we state the main proposition that helps in establishing these results for EWN. Theorem 1. The gradient flow path with learning rate η(t) for EWN and SWN are given as follows: EWN: dw u (t) dt = -η(t) w u (t) 2 ∇ wu L (5) SWN: dw u (t) dt = -η(t)( w u (t) 2 ∇ wu L + 1 -w u (t) 2 w u (t) 2 (w u (t) ∇ wu L)w u (t)) (6) Thus, the gradient flow path of EWN can be replicated by an adaptive learning rate given by η(t) w u (t) 2 on unnormalized network(Unnorm). These parameterizations also induce different neighborhoods in the parameter space, that have been shown in Figure 1 .

4.1. ASSUMPTIONS

The assumptions in the paper can be broadly divided into loss function/architecture based assumptions and trajectory based assumptions. The loss functions/architecture based assumptions are shared across both gradient flow and gradient descent. Loss function/Architecture based assumptions 1 (y i , Φ(w, x i )) = e -yiΦ(w,xi) 2 Φ(., x) is a C 2 function, for a fixed x 3 Φ(λw, x) = λ L Φ(w, x) , for some λ > 0 and L > 0 Gradient flow. For gradient flow, we make the following trajectory based assumptions (A1) lim t→∞ L(w(t)) = 0 (A2) lim t→∞ w(t) w(t) := w (A3) lim t→∞ (w(t)) (w(t)) := (A4) Let ρ = min i y i Φ( w, x i ). Then ρ > 0. The first assumption is typically satisfied in scenarios where a positively homogeneous network achieves 100% training accuracy. This is not a completely unreasonable assumption, given recent papers demonstrating neural networks with sufficient overparameterization can fit even random labels (Zhang et al. (2017) , Jacot et al. (2018) ), and is a standard assumption made when the purpose is to find the inductive bias. The second assumption states that the network converges in direction and this has been recently shown in Ji & Telgarsky (2020) to hold for gradient flow on homogeneous neural nets without normalization under some regularity assumptions. The third and fourth assumptions are required to show convergence of the gradients in direction. The fourth assumption is indeed true for SWN as shown in Lyu & Li (2020) foot_0 . Gradient Descent. For gradient descent, we also require the learning rate η(t) to not grow too fast. (A5) lim t→∞ η(t) w u (t) ∇ wu L(w(t)) = 0 for all u in the network Proposition 1. Under assumptions (A1)-(A4), lim t→∞ η(t) w u (t) ∇ wu L(w(t)) = 0 holds for every u in the network with η(t) = O( 1 L c ), where c < 1. This proposition establishes that the assumption (A5) is mild and holds for constant η(t), that is generally used in practice. While some of these assumptions are non-standard we believe they do generally hold, and demonstrate the viability of these assumptions in a toy experiment which we call Lin-Sep. In this experiment a 2-layered EWN neural network, with 8 neurons in the hidden layer and a ReLU-squared activation function, is trained on a linearly separable dataset. The learning rate schedule used was O( 1 L 0.97 ) and the network was trained till a loss of e -300 . The corresponding graphs for EWN are shown in Figure 2 . Similar results for SWN have been deferred to Figure 8 in the appendix.

4.2. EFFECT OF NORMALISATION ON WEIGHT AND GRADIENT NORMS

This section contains the main theorems and the difference between EWN and SWN that makes EWN asymptotically relatively sparse as compared to SWN. First, we will state a common proposition for both SWN and EWN. Proposition 2. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, for both SWN and EWN, the following hold: In (b), it can be seen that only weights 5,7 and 8 keep on growing in norm. So, only for these, w u > 0. (c) shows the components of the unit vector w w , only for the weights 5, 7 and 8 as they keep evolving with time. Eventually their contribution to the unit vector become constant. (d) shows the components of the loss vector and they also become constant eventually. (e) shows the normalized parameter margin converging to a value greater than 0. (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, where µ > 0. (ii) Let w u = lim t→∞ wu(t) w(t) and g u = lim t→∞ -∇w u L(w(t)) ∇wL(w(t)) . Then, w u = λ g u for some λ ≥ 0 The first and second part state that under the given assumptions, for both SWN and EWN, gradients converge in direction and the weights that contribute to the final direction of w, converge in opposite direction of the gradients. Now, we provide the main theorem that distinguishes SWN and EWN. Theorem 2. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, the following hold (i) for EWN, w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w u L(w(t)) wv(t) ∇w v L(w(t)) = wu gu wv gv = 1 (ii) for SWN, w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w v L(w(t)) wv(t) ∇w u L(w(t)) = wu gv wv gu = 1 Thus, asymptotically, for EWN, w u (t) = k1(t) ∇w u L(w(t)) while for SWN, w u (t) = k 2 (t) ∇ wu L(w(t)) , where k 1 (t) and k 2 (t) are independent of the neuron u. We demonstrate this property of EWN on the Lin-Sep experiment in Figure 3 . The results for SWN have been deferred to Figure 9 in the Appendix. Now, we provide a corollary for the case of multilayer linear nets. Corollary 1. Consider a weight normalized(SWN or EWN) multilayer linear net, represented by y = W n W n-1 ...W 1 x. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, if the dataset is sampled from a continuous distribution w.r.t R d , then, with probability 1, θ = W 1 W 2 ....W n converges in direction to the maximum margin separator for all linearly separable datasets. 

4.3. SPARSITY INDUCTIVE BIAS FOR EXPONENTIAL WEIGHT NORMALISATION

The inverse relation between w u (t) and ∇ wu L(w(t)) in the EWN trajectory results in an interesting inductive bias that favours movement along sparse directions. Proposition 3. Let assumptions (A1)-(A5) be satisfied. Consider two nodes u and v in the network such that g v ≥ g u > 0 and w u (t) , w v (t) → ∞. Let gu gv be denoted by c. Let , δ be such that 0 < < c and 0 < δ < 2π. Then, the following holds: 1. There exists a time t 1 , such that for all t > t 1 both SWN and EWN trajectories have the following properties: (a) ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] (b) wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(δ) (c) wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos(δ). The above proposition shows that the limit property of the weights in Theorem 2, makes non-sparse w an unstable convergent direction for EWN. But that is not the case for SWN. We demonstrate the relative sparsity between EWN, SWN and Unnorm through two toy experiments -Simple-Traj and XOR.

2.. for SWN

In the Simple-Traj experiment, we have a single data point at (2, 1), that is labelled positive and train a network with linear activations. The architecture is shown in Figure 4a , where weights in blue and red are frozen to values 1 and 0 respectively. Thus, there are effectively only two scalar parameters-w 1 and w 2 . The network is trained till a loss value of e -50 starting from 5 different initialization points. The weight trajectories in Figure 4b shows that EWN prefers to converge either along the x or y axis, and hence has an asymptotic relative sparsity property. In the XOR experiment, we train a 2-layer ReLU network, with 20 hidden neurons on XOR dataset(shown in Figure 5a ). The second layer is fixed to the values 1 or -1 randomly. For attaining 100% accuracy on this dataset with this architecture, at least 4 hidden units are needed. As can be seen in Figure 5 , EWN asymptotically uses exactly 4 neurons out of 20, while Unnorm uses all the 20 neurons. The results for SWN have been deferred to Figure 10 in the appendix.

5. CONVERGENCE RATES

Under Assumption (A2), w can be represented as w = g(t) w + r(t), where lim t→∞ r(t) For multilayer linear nets, the variation of convergence rate with number of layers for a linearly separable dataset is illustrated in Figure 6 . All of these networks were explicitly initialized to represent the same point in function space. It can be seen that EWN, SWN and unnormalized networks all converge faster with more layers, but the effect is much less pronounced for EWN. g(t) = 0. Let d : N → R, given by d(t) = t τ =0 η(τ ) denote total step size.

6. MNIST PRUNING EXPERIMENTS

As EWN leads to asymptotically sparse solutions, it is likely that a sufficiently trained EWN network would be comparatively robust to pruning. In this section, we compare the pruning efficacy of EWN, SWN and Unnorm on a 2-layer ReLU network trained on the MNIST dataset. In case of EWN and SWN, only the first layer is weight normalized as only this layer needs to be pruned. The pruning criterion used is the difference between the initial and final weight norm, i.e, the weights that grow the least in norm are pruned first. The corresponding pruning graphs at different loss values are shown in Figure 7 . It can be seen that when the loss levels are sufficiently low, the EWN network becomes better adapted for pruning, significantly outperforming SWN and the unnormalized network in terms of test accuracy for a given level of pruning. The variation of norm of the weight vectors with gradient descent steps for neurons in the first layer has been deferred to Figure 11 in the appendix.

7. CONCLUSION

In this paper, we analyze the inductive bias of weight normalization for smooth homogeneous neural nets and show that exponential weight normalization is likely to lead to asymptotically sparse solutions and has a faster convergence rate than unnormalized or standard weight normalized networks. 

A PROOF OF THEOREM 1

Theorem. The gradient flow path with learning rate η(t) for EWN and SWN are given as follows: EWN: dw u (t) dt = -η(t) w u (t) 2 ∇ wu L SWN: dw u (t) dt = -η(t)( w u (t) 2 ∇ wu L + 1 -w u (t) 2 w u (t) 2 (w u (t) ∇ wu L)w u (t)) The proof for the two parts will be provided in different subsections, where the corresponding part will be restated for ease of the reader.

A.1 EXPONENTIAL WEIGHT NORMALIZATION

Theorem. The gradient flow path with learning rate η(t) for EWN is given by: dw u (t) dt = -η(t) w u (t) 2 ∇ wu L Proof. In case of EWN, weights are reparameterized as w u = e αu vu vu . Then ∇ αu L = e αu v u ∇ wu L v u ∇ vu L = e αu v u (I - v u v u v u 2 )∇ wu L Now, in case of gradient flow with learning rate η(t), we can say dα u (t) dt = -η(t)∇ αu L = -η(t)e αu(t) v u (t) ∇ wu L v u (t) dv u (t) dt = -η(t)∇ vu L = -η(t) e αu(t) v u (t) (I - v u (t)v u (t) v u (t) 2 )∇ wu L Now, using these equations, we can say d v u (t) 2 dt = 2v u (t) dv u (t) dt = 0 Thus, v u (t) does not change with time. As we assumed v u (0) to be 1, therefore for any t, v u (t) = 1. Using this simplification, we can write dw u (t) dt = d(e αu(t) v u (t)) dt = e αu(t) (-η(t)e αu(t) (I -v u (t)v u (t) )∇ wu L) -η(t)e 2αu(t) (v u (t) ∇ wu L)v u (t) = -η(t)e 2αu(t) ∇ wu L Thus, the gradient flow path with exponential weight normalization can be replicated by an adaptive learning rate given by η(t) w u (t) 2 .

A.2 STANDARD WEIGHT NORMALIZATION

Theorem. The gradient flow path with learning rate η(t) for SWN is given by: dw u (t) dt = -η(t)( w u (t) 2 ∇ wu L + 1 -w u (t) 2 w u (t) 2 (w u (t) ∇ wu L)w u (t)) Proof. In case of SWN, weights are reparameterized as w u = γ u vu vu . Then ∇ γu L = v u ∇ wu L v u ∇ vu L = γ u v u (I - v u v u v u 2 )∇ wu L Now, in case of gradient flow with learning rate η(t), we can say dγ u (t) dt = -η(t)∇ αu L = -η(t) v u (t) ∇ wu L v u (t) dv u (t) dt = -η(t)∇ vu L = -η(t) γ u (t) v u (t) (I - v u (t)v u (t) v u (t) 2 )∇ wu L Now, similar to EWN, v u (t) does not change with time. Using the fact that v u (t) = 1 for all t, we can say dw u (t) dt = d(γ u (t)v u (t)) dt = γ u (t)(-η(t)γ u (t)(I -v u (t)v u (t) )∇ wu L) -η(t)(v u (t) ∇ wu L)v u (t) = -η(t)(γ u (t) 2 ∇ wu L + (1 -γ u (t) 2 )(v u (t) ∇ wu L)v u (t) = -η(t)(γ u (t) 2 ∇ wu L + 1 -γ u (t) 2 γ u (t) 2 (w u (t) ∇ wu L)w u (t) Replacing γ u (t) by w u (t) gives the required expression.

B PROOF OF PROPOSITION 1

Proposition. Under assumptions (A1)-(A4), lim t→∞ η(t) w u (t) ∇ wu L(w(t)) = 0 holds for every u in the network with η(t) = O( 1 L c ), where c < 1. Proof. Under assumption (A1) and (A2), w can be represented as w = g(t) w + r(t), where lim t→∞ r(t) g(t) = 0. Now, for exponential loss, -∇ w L(w(t)) = m i=1 i (w(t))(y i ∇ w Φ(w(t), x i )) i = e -yiΦ(w(t),xi) = e -g(t) L yiΦ( w+ r(t) g(t) ,xi) ∇ w Φ(w(t), x i )) = g(t) L-1 ∇ w Φ( w + r(t) g(t) , x i )) From assumption (A4), we know Φ( w, x i ) ≥ ρ for all i. Now, using Euler's homogeneous theorem, we can say w ∇ w Φ( w, x i ) = LΦ( w, x i ) Thus, ∇ w Φ( w, x i ) > 0 for all i. Now, using the equations above and assumption (A3), we can say lim t→∞ ∇ w L(w(t)) e -ρg(t) L g(t) L-1 m i=1 i y i ∇ w Φ( w, x i ) = k 1 where k 1 is some constant. Now, if η(t) = O( 1 L(w(t)) c ), where c < 1, then using the fact that L goes down at the rate of e -ρg(t) L and w goes up at the rate of g(t), we can say lim t→∞ η(t) w(t) ∇ w L(w(t)) ≤ lim t→∞ k 2 w(t) ∇ w L(w(t)) L(w(t)) c = lim t→∞ k 1 k 2 e -ρg(t) L g(t) L-1 m i=1 i y i ∇ w Φ( w, x i ) w(t) L(w(t)) c = 0 C PROOF OF PROPOSITION 2 Proposition. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, for both SWN and EWN, the following hold: (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, where µ > 0. (ii) Let w u = lim t→∞ wu(t) w(t) and g u = lim t→∞ -∇w u L(w(t)) ∇wL(w(t)) . Then, w u = λ g u for some λ ≥ 0 The proof for different cases will be split into different subsections and corresponding proposition will be stated there for ease of the reader. The proof will depend on the Stolz Cesaro theorem(stated in Appendix J),Integral Form of Stolz-Cesaro Theorem(stated and proved in Appendix J) and following lemmas that have been proved in Appendix I. Lemma 1. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, for both SWN and EWN, w u g u ≥ 0 for all nodes u in the network. Lemma 2. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, for both SWN and EWN, there exists atleast one node u in the network satisfying w u > 0 and g u > 0. Lemma 3. Consider two unit vectors a and b satisfying a b ≥ 0 and a b < 1. Then, there exists a small enough > 0, such that for any unit vector c satisfying c a ≥ cos( ) and any unit vector d satisfying d b ≥ cos( ), b (I -cc )d ≥ . Lemma 4. Consider sequence a satisfying the following properties 1. a k > 0 2. ∞ k=0 a k = ∞ 3. lim k→∞ a k = 0 Then ∞ k=0 a k k j=0 a 2 j = ∞ Lemma 5. Consider two sequences a and b satisfying the following properties 1. a k > 0, ∞ k=0 a k = ∞ and lim k→∞ a k = 0 2. b 0 > 0, b is increasing and b 2 k+1 ≤ b 2 k + ( a k b k ) 2 Then ∞ k=0 a k b k = ∞. Lemma 6. Consider two sequences a and b satisfying the following properties 1. a k > 0 and ∞ k=0 a k = ∞ 2. b k > 0 and ∞ k=0 b k = ∞ 3. ∞ k=0 (a k -b k ) converges to a finite value 4. lim k→∞ a k b k exists Then lim k→∞ a k b k = 1.

C.1 EXPONENTIAL WEIGHT NORMALIZATION

In this section, we will use e αu(t) and w u (t) interchangeably.

C.1.1 GRADIENT FLOW

Proposition. Under assumptions (A1)-(A4) for gradient flow, for EWN, the following hold: (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, where µ > 0. (ii) Let w u = lim t→∞ wu(t) w(t) and g u = lim t→∞ -∇w u L(w(t)) ∇wL(w(t)) . Then, w u = λ g u for some λ ≥ 0 Update Equations: dα u (t) dt = -η(t)e αu(t) (v u (t) ∇ wu L(w(t))) (7) dv u (t) dt = -η(t)e αu(t) (I -v u (t)v u (t) )∇ wu L(w(t)) (8) (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, where µ > 0 Proof. Using the fact that ρ > 0 and Euler's homogeneous theorem, we can say w ∇ w Φ( w, x i ) = LΦ( w, x i ) > 0 Thus, ∇ w Φ( w, x i ) > 0 for all i. Let w(t) = g(t) w + r(t), where lim t→∞ r(t) g(t) = 0. Now, by Taylor's Theorem, we can say ∇ w Φ(w(t), x i ) = ∇ w Φ(g(t) w, x i ) + k=1 k=0 ∇ 2 Φ(g(t) w + kr(t), x i )r(t)dk = g(t) L-1 ∇ w Φ( w, x i ) + g(t) L-1 k=1 k=0 ∇ 2 Φ( w + k r(t) g(t) , x i ) r(t) g(t) dk Now, ∇ 2 Φ( w + k r(t) g(t) , x i ) can be bounded by a constant and lim t→∞ r(t) g(t) = 0. Thus, lim t→∞ k=1 k=0 ∇ 2 Φ( w + k r(t) g(t) , x i ) r(t) g(t) dk = 0. Thus, we can say, if ∇ w Φ( w, x i ) > 0 for some i, then lim t→∞ ∇ w Φ(w(t), x i ) g(t) L-1 = ∇ w Φ( w, x i ) (9) Now, -∇ w L(w(t)) = m i=1 i (w(t))(y i ∇ w Φ(w(t), x i )) Now, Let S = {i : y i Φ( w, x i ) = min j y j Φ( w, x j )}. Let denote min j / ∈S y j Φ( w, x j ) -ρ. Consider a ∈ S and b / ∈ S, then lim t→∞ b (w(t)) a (w(t) = lim t→∞ e -g(t) L (Φ( w+ r(t) g(t) ,xa)-Φ( w+ r(t) g(t) ,x b )) Now, as the minimum difference is , therefore lim t→∞ b (w(t)) a(w(t) = 0. Thus, ∀j / ∈ S, j = 0. Now, using Equation ( 9) and the expression for -∇ w L(w(t)) from before, we can say lim t→∞ -∇ w L(w(t)) ∇ w L(w(t)) = µ i∈S i y i ∇ w Φ( w, x i ) where µ = 1 i∈S i(yiΦ( w,xi)) . (ii) w u > 0 =⇒ w u = λ g u for some λ > 0 Proof. Consider a node u having w u > 0. The proof will be split into two parts depending on g u > 0 or g u = 0. Case 1: g u > 0 Let the angle between w u and g u be denoted by ∆. Using Lemma 1, we can say ∆ ≤ π 2 . We will prove the statement by contradiction, so let's assume ∆ > 0. Now, we know, - ∇w u L(w(t)) ∇w u L(w(t)) converges to g u and v u (t) converges in direction of w u . Taking dot product with gu gu on both sides of Equation ( 8) and using Lemma 3, we can say there exists a time t 1 and a small enough , such that for any t > t 1 , g u g u dv u (t) dt ≥ η(t)e αu(t) ∇ wu L(w(t)) Now, using the fact that α u → ∞ and Equation ( 7), we can say ∞ t1 η(t)e αu(t) ∇ wu L(w(t)) dt = ∞ Integrating Equation ( 10) on both the sides from t 1 to ∞, we get g u g u ( w u w u -v u (t 1 )) ≥ ∞ This is not possible as vectors on LHS have bounded norm. This contradicts. Hence ∆ = 0. Case 2: g u = 0 We are going to show that it is not possible to have w u > 0 and g u = 0. Using Lemma 2, we can say there exists atleast one node v satisfying w v > 0 and g v > 0. Now, using Equation ( 7), we can say w u (t) ≤ t k=0 η(k) w u (k) 2 ∇ wu L(w(k)) dk From Case 1, we know, for any > 0, that there exists a time t 1 , such that for t > t 1 , wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos( ). Now, using Equation ( 7), we can say w v (t) ≥ w v (t 1 ) + cos( ) t k=t1 η(k) w v (k) 2 ∇ wv L(w(k)) dk Thus, we can say, for t > t 1 , w u (t) w v (t) ≤ t k=0 η(k) w u (k) 2 ∇ wu L(w(k)) dk w v (t 1 ) + cos( ) t k=t1 η(k) w v (k) 2 ∇ wv L(w(k)) dk Now, as w u > 0 and w v > 0, therefore both the integrals diverge. Also, the integrands converge in ratio to 0 as g u = 0 and g v > 0. Thus, taking limit t → ∞ on both the sides and using the Integral form of Stolz-Cesaro theorem, we can say lim t→∞ w u (t) w v (t) ≤ However, this is not possible as w u > 0 and w v > 0. This contradicts. Therefore, such a case is not possible.

C.1.2 GRADIENT DESCENT

Proposition. Under assumptions (A1)-(A5) for gradient descent, for EWN, the following hold: (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, where µ > 0. (ii) Let w u = lim t→∞ wu(t) w(t) and g u = lim t→∞ -∇w u L(w(t)) ∇wL(w(t)) . Then, w u = λ g u for some λ ≥ 0 Update Equations: (ii) w u > 0 =⇒ w u = λ g u for some λ > 0 α u (t + 1) = α u (t) -η(t)e αu(t) v u (t) ∇ wu L(w(t)) v u (t) (11) v u (t + 1) = v u (t) -η(t) e αu(t) v u (t) (I - v u (t)v u (t) v u (t) 2 )∇ wu L(w(t)) ( ) (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, Proof. Consider a node u having w u > 0. The proof will be split into two parts depending on g u > 0 or g u = 0. Case 1: g u > 0. Let the angle between w u and g u be denoted by ∆. Using Lemma 1, we can say ∆ ≤ π 2 . We will prove the statement by contradiction, so let's assume ∆ > 0. Now, we know, - ∇w u L(w(t)) ∇w u L(w(t)) converges to g u and v u (t) converges in direction of w u . Taking dot product with gu gu on both sides of Equation ( 12) and using Lemma 3, we can say there exists a time t 1 and a small enough , such that for any t > t 1 , v u (t + 1) g u g u ≥ v u (t) g u g u + (η(t) e αu(t) v u (t) ∇ wu L(w(t)) ) However, in this case, v u (t) doesn't stay constant and thus increase in dot product doesn't directly correspond to an increase in angle. Now, using Equation ( 12), we can say v u (t + 1) 2 ≤ v u (t) 2 + (η(t) e αu(t) v u (t) ∇ wu L(w(t)) ) 2 (13) Using the above two equations, we can say, for time t > t 1 , t) vu(t) ∇ wu L(w(t)) ) 2 Unrolling the equation above, we get v u (t + 1) g u v u (t + 1) g u ≥ vu(t) gu gu + (η(t) e αu (t) vu(t) ∇ wu L(w(t)) ) v u (t) 2 + (η(t) e αu( v u (t + 1) g u v u (t + 1) g u ≥ vu(t1) gu gu + k=t k=t1 (η(k) e αu (k) vu(k) ∇ wu L(w(k)) ) v u (t 1 ) 2 + k=t k=t1 (η(k) e αu(k) vu(k) ∇ wu L(w(k)) ) 2 Now, as α u (t) → ∞, therefore, using Equation (11), we can say k=∞ k=t1 η(k)e αu(k) ∇ wu L(w(k)) = ∞ Now, using this identity, along with the Assumption (A5), Equation ( 13) and Lemma 5, we can say ∞ k=t1 η(k) e αu(k) v u (k) ∇ wu L(w(k)) = ∞ Using this along with Equation ( 14) and Lemma 4, we can say lim t→∞ v u (t + 1) g u v u (t + 1) g u ≥ ∞ However, this is not possible as the vectors on LHS have bounded norm. This contradicts. Thus ∆ = 0. Case 2: g u = 0 We are going to show that its not possible to have w u > 0 and g u = 0. By Lemma 3, we know there exists atleast one node s satisfying w s > 0 and g s > 0. Now from Equation (11), we can say α u (t) = α u (0) - k=t-1 k=0 η(k)e αu(k) v u (k) ∇ wu L(w(k)) v u (k) α s (t) = α s (0) - k=t-1 k=0 η(k)e αs(k) v s (k) ∇ ws L(w(k)) v s (k) Thus, α u (t)-α s (t) = (α u (0)-α s (0))+ k=t-1 k=0 (η(k)e αs(k) v s (k) ∇ ws(k) L(w(k)) v s (k) -η(k)e αu(k) v u (k) ∇ wu L(w(k)) v u (k) ) (15) Now, we know, α u (t) and α s (t) → ∞. Also, as w u > 0 and w s > 0, therefore α u -α s converges. Therefore the RHS of Equation ( 15) converges as well. However, RHS is the difference of two diverging series. Also, as v s (t) and ∇ ws L(w(t)) eventually get aligned and lim t→∞ ∇w u L(w(t)) ∇w s L(w(t)) = 0, so, we can say lim t→∞ e αu(t) vu(t) ∇w u L(w(t)) vu(t) e αs(t) vs(t) ∇w s L(w(t)) vs(t) = 0 However, this contradicts Lemma 6, as the ratio must be converging to 1 if the limit exists. Therefore this case is not possible.

C.2.1 GRADIENT FLOW

Proposition. Under assumptions (A1)-(A4) for gradient flow, for SWN, the following hold: (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, where µ > 0. (ii) Let w u = lim t→∞ wu(t) w(t) and g u = lim t→∞ -∇w u L(w(t)) ∇wL(w(t)) . Then, w u = λ g u for some λ ≥ 0 Update Equations: dγ u (t) dt = -η(t) v u (t) ∇ wu L(w(t)) v u (t) ) (ii) w u > 0 =⇒ w u = λ g u for some λ > 0 dv u (t) dt = -η(t) γ u (t) v u (t) (I - v u (t)v u (t) v u (t) 2 )∇ wu L(w(t)) ( ) (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, Proof. Consider a node u having w u > 0. In this case, γ u (t) can either tend to ∞ or -∞. We will consider the case γ u (t) → ∞. The other case can be handled similarly. The proof will be split into two parts depending on g u > 0 or g u = 0. Case 1: g u > 0 Let the angle between w u and g u be denoted by ∆. Using Lemma 1, we can say ∆ ≤ π 2 . We will prove the statement by contradiction, so let's assume ∆ > 0. Now, we know, - ∇w u L(w(t)) ∇w u L(w(t)) converges to g u and v u (t) converges in direction of w u . Taking dot product with gu gu on both sides of Equation ( 17) and using Lemma 3, we can say there exists a time t 1 and a small enough , such that for any t > t 1 , g u g u dv u (t) dt ≥ η(t)γ u (t) ∇ wu L(w(t)) Now, using the fact that γ u → ∞ and Equation ( 16), we can say ∞ t1 η(t) ∇ wu L(w(t)) dt = ∞ Integrating Equation ( 18) on both the sides from t 1 to ∞, we get g u g u ( w u w u -v u (t 1 )) ≥ ∞ This is not possible as vectors on LHS have bounded norm. This contradicts. Hence ∆ = 0. Case 2: g u = 0 We are going to show that it is not possible to have w u > 0 and g u = 0. Using Lemma 2, we can say there exists atleast one node s satisfying w s > 0 and g s > 0. Now, using Equation ( 16), we can say w u (t) ≤ t k=0 η(k) ∇ wu L(w(k)) dk From Case 1, we know, for any > 0, that there exists a time t 1 , such that for t > t 1 , ws(t) ws(t) -∇w s L(w(t)) ∇w s L(w(t)) ≥ cos( ). Now, using Equation ( 16), we can say w s (t) ≥ w s (t 1 ) + cos( ) t k=t1 η(k) ∇ ws L(w(k)) dk Thus, we can say, for t > t 1 , w u (t) w s (t) ≤ t k=0 η(k) ∇ wu L(w(k)) dk w v (t 1 ) + cos( ) t k=t1 η(k) ∇ ws L(w(k)) dk Now, as w u > 0 and w s > 0, therefore both the integrals diverge. Also, the integrands converge in ratio to 0 as g u = 0 and g v > 0. Thus, taking limit t → ∞ on both the sides and using the Integral form of Stolz-Cesaro theorem, we can say lim t→∞ w u (t) w s (t) ≤ However, this is not possible as w u > 0 and w s > 0. This contradicts. Therefore, such a case is not possible.

C.2.2 GRADIENT DESCENT

Proposition. Under assumptions (A1)-(A5) for gradient descent, for SWN, the following hold: (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, where µ > 0. (ii) Let w u = lim t→∞ wu(t) w(t) and g u = lim t→∞ -∇w u L(w(t)) ∇wL(w(t)) . Then, w u = λ g u for some λ ≥ 0 Update Equations: (ii) w u > 0 =⇒ w u = λ g u for some λ > 0 γ u (t + 1) = γ u (t) -η(t) v u (t) ∇ wu L(w(t)) v u (t) (19) v u (t + 1) = v u (t) -η(t) γ u (t) v u (t) (I - v u (t)v u (t) v u (t) 2 )∇ wu L(w(t)) ( ) (i) lim t→∞ -∇wL(w(t)) ∇wL(w(t)) = µ m i=1 i y i ∇ w Φ( w, x i ) = g, Proof. Consider a node u having w u > 0. In this case, γ u (t) can either tend to ∞ or -∞. We will consider the case γ u (t) → ∞. The other case can be handled similarly. The proof will be split into two parts depending on g u > 0 or g u = 0. Case 1: g u > 0 Let the angle between w u and g u be denoted by ∆. Using Lemma 1, we can say ∆ ≤ π 2 . We will prove the statement by contradiction, so let's assume ∆ > 0. Now, we know, - ∇w u L(w(t)) ∇w u L(w(t)) converges to g u and v u (t) converges in direction of w u . Taking dot product with gu gu on both sides of Equation ( 20) and using Lemma 3, we can say there exists a time t 1 and a small enough , such that for any t > t 1 , v u (t + 1) g u g u ≥ v u (t) g u g u + (η(t) γ u (t) v u (t) ∇ wu L(w(t)) ) However, in this case, v u (t) doesn't stay constant and thus increase in dot product doesn't directly correspond to an increase in angle. Now, using Equation ( 20), we can say v u (t + 1) 2 ≤ v u (t) 2 + (η(t) γ u (t) v u (t) ∇ wu L(w(t)) ) 2 Using the above two equations, we can say, for time t > t 1 , v u (t + 1) g u v u (t + 1) g u ≥ vu(t) gu gu + (η(t) γu(t) vu(t) ∇ wu L(w(t)) ) v u (t) 2 + (η(t) γu(t) vu(t) ∇ wu L(w(t)) ) 2 Unrolling the equation above, we get v u (t + 1) g u v u (t + 1) g u ≥ vu(t1) gu gu + k=t k=t1 (η(k) γu(k) vu(k) ∇ wu L(w(k)) ) v u (t 1 ) 2 + k=t k=t1 (η(k) γu(k) vu(k) ∇ wu L(w(k)) ) 2 Now, as γ u (t) → ∞, therefore, using Equation (20), we can say k=∞ k=t1 η(k) ∇ wu L(w(k)) = ∞ Now, using this identity, along with the assumption (A5), Equation ( 21) and Lemma 5, we can say ∞ k=t1 η(k) γ u (k) v u (k) ∇ wu L(w(k)) = ∞ Using this along with Equation ( 22) and Lemma 4, we can say lim t→∞ v u (t + 1) g u v u (t + 1) g u ≥ ∞ However, this is not possible as the vectors on LHS have bounded norm. This contradicts. Thus ∆ = 0. Case 2: g u = 0 We are going to show that its not possible to have w u > 0 and g u = 0. By Lemma 3, we know there exists atleast one node s satisfying w s > 0 and g s > 0. Now from Equation ( 19), we can say γ u (t) = γ u (0) - k=t-1 k=0 η(k) v u (k) ∇ wu L(w(k)) v u (k) γ s (t) = γ s (0) - k=t-1 k=0 η(k) v s (k) ∇ ws L(w(k)) v s (k) Now, γ s (t) either diverges to ∞ or -∞. In both the cases, it is a strictly monotonic sequence for large enough t. Also lim t→∞ γ u (t + 1) -γ u (t) γ s (t + 1) -γ s (t) = 0 as g u = 0, g s > 0 and from Case 1, w s (t) and -∇ ws L(w(t)) eventually get aligned. Thus, using Stolz-Cesaro theorem, we can say lim t→∞ γ u (t) γ s (t) = 0 However, this is not possible as w u > 0 and w s > 0. This contradicts. Therefore, this is not possible.

D PROOF OF THEOREM 2

Theorem. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, the following hold (i) for EWN, w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w u L(w(t)) wv(t) ∇w v L(w(t)) = 1 (ii) for SWN, w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w v L(w(t)) wv(t) ∇w u L(w(t)) = 1 Proof for different cases will be split into different subsections and the corresponding case will be restated there for ease of the reader.

D.1.1 GRADIENT FLOW

Theorem. Under assumptions (A1)-(A4) for gradient flow, the following holds for EWN: (i) w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w u L(w(t)) wv(t) ∇w v L(w(t)) = 1 Proof. Consider u and v such that w u > 0 and w v > 0. Using Proposition 2, we can say g u > 0 and g v > 0. Also, from Proposition 2, we can say, for both u and v, weights and gradients converge in opposite directions. Hence there exists a time t 2 , such that for any t > t 2 , • -∇ wu L(w(t)) and w u (t) make an angle or lesser with each other • -∇ wv L(w(t)) and w v (t) make an angle or lesser with each other Then, using Equation (7), we can say for any time t > t 2 , w u (t) ≥ w u (t 2 ) + cos( ) t k=t2 η(k) w u (k) 2 ∇ wu L(w(k)) dk w u (t) ≤ w u (t 2 ) + t k=t2 η(k) w u (k) 2 ∇ wu L(w(k)) dk w v (t) ≥ w v (t 2 ) + cos( ) t k=t2 η(k) w v (k) 2 ∇ wv L(w(k)) dk w v (t) ≤ w v (t 2 ) + t k=t2 η(k) w v (k) 2 ∇ wv L(w(k)) dk Using the above equations, we can say, for time t > t 2 , w u (t) w v (t) ≥ w u (t 2 ) + cos( ) t k=t2 η(k) w u (k) 2 ∇ wu L(w(k)) dk w v (t 2 ) + t k=t2 η(k) w v (k) 2 ∇ wv L(w(k)) dk w u (t) w v (t) ≤ w u (t 2 ) + t k=t2 η(k) w u (k) 2 ∇ wu L(w(k)) dk w v (t 2 ) + cos( ) t k=t2 η(k) w v (k) 2 ∇ wv L(w(k)) dk We know that both integrals diverge as w u (t) and w v (t) → ∞, lim t→∞ wu(t) wv(t) and lim t→∞ ∇ wu (t) L ∇ wv (t) L exist. Taking limit t → ∞ on both the equations and using the Integral form of Stolz-Cesaro theorem, we get lim t→∞ cos( ) w u (t) 2 ∇ wu(t) L(w(t)) w v (t) 2 ∇ wv(t) L(w(t)) ≤ lim t→∞ w u (t) w v (t) ≤ lim t→∞ w u (t) 2 ∇ wu(t) L(w(t)) cos( ) w v (t) 2 ∇ wv(t) L(w(t)) As we know these limits exist and this holds for any > 0, therefore lim t→∞ w u (t) w v (t) = lim t→∞ w u (t) 2 ∇ wu(t) L(w(t)) w v (t) 2 ∇ wv(t) L(w(t)) Simplifying it further, we get lim t→∞ w u (t) ∇ wu(t) L w v (t) ∇ wv(t) L = 1 D.1.2 GRADIENT DESCENT Theorem. Under assumptions (A1)-(A5) for gradient descent, the following holds for EWN: (i) w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w u L(w(t)) wv(t) ∇w v L(w(t)) = 1 Proof. Consider two nodes u and s such that w u > 0 and w s > 0. From Proposition2, we know g u > 0, g s > 0 and for both u and s, weights and gradients eventually get aligned opposite to each other. Using Equation ( 11), we can say α u (t) = α u (0) - t-1 k=1 η(k)e αu(k) v u (k) ∇ wu L(w(k)) v u (k) α s (t) = α s (0) - t-1 k=1 η(k)e αs(k) v s (k) ∇ ws L(w(k)) v s (k) Thus, α u (t)-α s (t) = (α u (0)-α s (0))+ k=t-1 k=0 (η(k)e αs(k) v s (k) ∇ ws(k) L(w(k)) v s (k) -η(k)e αu(k) v u (k) ∇ wu L(w(k)) v u (k) ) (23) Now, we know, α u (t) and α s (t) → ∞. Also, as w u > 0 and w s > 0, therefore α u (t) -α s (t) converges. Therefore the RHS of Equation ( 23) converges as well. However, RHS is the difference of two diverging series. Also, lim t→∞ e αu(t) vu (t) ∇w u L(w(t)) vu (t) e αs(t) vs (t) ∇w s L(w(t)) vs(t) exists. Therefore, using Lemma 6, we can say lim t→∞ e αu(t) ∇ wu L(w(t)) e αs(t) ∇ ws L(w(t)) = 1

D.2 STANDARD WEIGHT NORMALIZATION D.2.1 GRADIENT FLOW

Theorem. Under assumptions (A1)-(A4) for gradient flow, the following holds for SWN: (i) w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w v L(w(t)) wv(t) ∇w u L(w(t)) = 1 Proof. Consider u and v such that w u > 0 and w v > 0. Using Proposition 2, we can say g u > 0 and g v > 0. Also, from Proposition 2, we can say, for both u and v, weights and gradients converge in opposite directions. Consider a time t 2 , such that for any t > t 2 , • -∇ wu L(w(t)) and w u (t) atmost make an angle with each other • -∇ wv L(w(t)) and w v (t) atmost make an angle with each other Then, using Equation ( 16), we can say for any time t > t 2 , w u (t) ≥ w u (t 2 ) + cos( ) t k=t2 η(k) ∇ wu L(w(k)) dk w u (t) ≤ w u (t 2 ) + t k=t2 η(k) ∇ wu L(w(k)) dk w v (t) ≥ w v (t 2 ) + cos( ) t k=t2 η(k) ∇ wv L(w(k)) dk w v (t) ≤ w v (t 2 ) + t k=t2 η(k) ∇ wv L(w(k)) dk Using the above equations, we can say, for time t > t 2 , w u (t) w v (t) ≥ w u (t 2 ) + cos( ) t k=t2 η(k) ∇ wu L(w(k)) dk w v (t 2 ) + t k=t2 η(k) ∇ wv L(w(k)) dk w u (t) w v (t) ≤ w u (t 2 ) + t k=t2 η(k) ∇ wu L(w(k)) dk w v (t 2 ) + cos( ) t k=t2 η(k) ∇ wv L(w(k)) dk We know that both integrals diverge as w u (t) and w v (t) → ∞, lim t→∞ wu(t) wv(t) and lim t→∞ ∇ wu (t) L(w(t)) ∇ wv (t) L(w(t)) exist. Taking limit t → ∞ on both the equations and using the Integral form of Stolz-Cesaro theorem, we get lim t→∞ cos( ) ∇ wu(t) L(w(t)) ∇ wv(t) L(w(t)) ≤ lim t→∞ w u (t) w v (t) ≤ lim t→∞ ∇ wu(t) L(w(t)) cos( ) ∇ wv(t) L(w(t)) As we know these limits exist and this holds for any > 0, therefore lim t→∞ w u (t) ∇ wv(t) L w v (t) ∇ wu(t) L = 1

D.2.2 GRADIENT DESCENT

Theorem. Under assumptions (A1)-(A5) for gradient descent, the following holds for SWN: (i) w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w v L(w(t)) wv(t) ∇w u L(w(t)) = 1 Proof. Consider u and v such that w u > 0 and w s > 0. Using Proposition 2, we can say g u > 0 and g s > 0. Also, from Proposition 2, we can say, for both u and s, weights and gradients converge in opposite directions. Now from Equation ( 19), we can say γ u (t) = γ u (0) - k=t-1 k=0 η(k) v u (k) ∇ wu L(w(k)) v u (k) γ s (t) = γ s (0) - k=t-1 k=0 η(k) v s (k) ∇ ws L(w(k)) v s (k) Now, γ s (t) either diverges to ∞ or -∞. In both the cases, it is a strictly monotonic sequence for large enough t. Also lim t→∞ γu(t+1)-γu(t) γs(t+1)-γs(t) exists. Therefore, using Stolz-Cesaro Theorem, we can say lim t→∞ w u (t) ∇ ws(t) L w s (t) ∇ wu(t) L = 1

E PROOF OF PROPOSITION 3

Proposition. Let assumptions (A1)-(A5) be satisfied. Consider two nodes u and v in the network such that g v ≥ g u > 0 and w u (t) , w v (t) → ∞. Let gu gv be denoted by c. Let , δ be such that 0 < < c and 0 < δ < 2π. Then, the following holds: 1. There exists a time t 1 , such that for all t > t 1 both SWN and EWN trajectories have the following properties: (a) ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] (b) wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(δ) (c) wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos(δ).

2.. for SWN, lim t→∞

wu(t) wv(t) = c 3. for EWN, if at some time t 2 > t 1 , (a) wu(t2) wv(t2) > 1 (c-) cos(δ) =⇒ lim t→∞ wu(t) wv(t) = ∞ (b) wu(t2) wv(t2) < cos(δ) c+ =⇒ lim t→∞ wu(t) wv(t) = 0 The proof of different cases will be split into multiple subsections and corresponding proposition will be stated there for ease of the reader.

E.1.1 GRADIENT FLOW

Proposition. Let assumptions (A1)-(A4) be satisfied. Consider two nodes u and v in the network such that g v ≥ g u > 0 and w u (t) , w v (t) → ∞. Let gu gv be denoted by c. Let , δ be such that 0 < < c and 0 < δ < 2π. Then, the following holds: 1. There exists a time t 1 , such that for all t > t 1 , SWN trajectory has the following properties: (a) ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] (b) wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(δ) (c) wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos(δ).

2.. lim t→∞

wu(t) wv(t) = c Proof. The proof of part 1a, i.e, ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] follows from the definition of limit as ∇w u L(w(t)) ∇w v L(w(t)) tends to c. Now, we will move to the proof of part 1b, i.e., wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(δ). The assumptions in this Proposition differ slightly from Proposition 2 and thus the proof is slightly more involved as we also need to show that w u (t) converges in direction. The proof will be given for γ u → ∞. The one for γ u → -∞ can be handled similarly. As g u > 0, therefore ∇ wu L(w(t)) converges in direction. Therefore, for every τ satisfying 0 < τ < 2π, there exists a time t 3 , such that for t > t 3 , -∇w u L(w(t)) ∇w u L(w(t)) gu gu ≥ cos(τ ). Now, Let's assume that w u (t) does not converge in the direction of g u . Then, there must exist a τ satisfying 0 < τ < 2π, such that for this τ , there exists a time t 4 > t 3 satisfying v u (t 4 ) gu gu = cos(∆), where ∆ > τ . Now, we are going to show that for any κ satisfying τ < κ < ∆, there exists a time t 5 > t 4 such that v u (t 5 ) gu gu > cos(κ). Let's say for a given κ, no such t 5 exists. Then, taking dot product with gu gu on both sides of Equation ( 17), we can say g u g u dv (t) dt = η(t)γ u (t) ∇ wu L(w(t)) g u g u (I-v u (t)v u (t) ) -∇ wu L(w(t)) ∇ wu L(w(t)) Now, as gu gu -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(τ ) and gu gu v u ≤ cos(κ), we can say g u g u dv u (t) dt ≥ η(t)γ u (t) ∇ wu L(w(t)) (cos(τ ) -cos(κ)) Now, using the fact that γ u → ∞ and using Equation ( 16), we can say ∞ t=t4 η(t) ∇ wu L(w(t)) dt = ∞ Using this fact and integrating the Equation ( 24) on both the sides from t 4 to ∞, we get a contradiction as vectors on LHS have a finite norm while RHS tends to ∞. Thus, for every κ between τ and ∆, there must exist a t 5 , such that v u (t 5 ) < cos(β), then, similar to Equation ( 26), we can say v u (t + 1) g u g u ≥ v u (t) g u g u + (cos(τ ) -cos(β))(η(t) γ u (t) v u (t) ∇ wu L(w(t)) ) Using the upper bound on v u (t + 1) from Equation ( 21), we can say v u (t + 1) g u v u (t + 1) g u ≥ vu(t) gu gu + (cos(τ ) -cos(β))(η(t) γu(t) vu(t) ∇ wu L(w(t)) ) v u (t) 2 + (η(t) γu(t) vu(t) ∇ wu L(w(t)) ) 2 (27) Let η(t) γu(t) vu(t) ∇ wu L(w(t)) be denoted by χ(t). Then, the above equation can be rewritten as v u (t + 1) g u v u (t + 1) g u ≥ v u (t) g u v u (t) g u v u (t) v u (t) 2 + χ(t) 2 + (cos(τ ) -cos(β)) χ(t) v u (t) 2 + χ(t) 2 Now, we are going to show that for a small enough χ(t), RHS is greater than vu(t) gu vu(t) gu . v u (t) g u v u (t) g u v u (t) v u (t) 2 + χ(t) 2 + (cos(τ ) -cos(β)) χ(t) v u (t) 2 + χ(t) 2 > v u (t) g u v u (t) g u =⇒ (cos(τ ) -cos(β)) χ(t) v u (t) 2 + χ(t) 2 > v u (t) g u v u (t) g u (1 - v u (t) v u (t) 2 + χ(t) 2 ) =⇒ (cos(τ ) -cos(β)) > v u (t) g u v u (t) g u ( v u (t) 2 + χ(t) 2 -v u (t) χ(t) ) Clearly as χ(t) → 0, the RHS tends to 0, therefore the equation is satisfied. Thus for a small enough χ(t), RHS of Equation ( 27) is greater than vu(t) gu vu(t) gu . As v u (t) keeps on increasing and by Assumption (A5), lim t→∞ η(t)γ u (t) ∇ wu L(w(t)) = 0, we can say there exists a time t 7 , such that for any t > t 7 , vu(t) gu vu(t) gu goes up whenever > cos(κ). Now as the above argument holds for any κ between τ and ∆, and for any τ > 0, we can say that w u (t) converges in direction of g u . The proof of part 1c, i.e, wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos(δ), follows exactly the same steps as part 1b. The proof of part 2, i.e, lim t→∞ wu(t) wv(t) = c can be shown in the same way as the proof of Theorem 2 for SWN gradient descent from Appendix D.2.2. wu(t) wv(t) > ∆, for some ∆ > 0. Also, let's denote wu(t2) wv(t2) by β. Then we can say d wu(t) wv(t) dt ≤ -∆(cos(δ) -β(c + ))η w v (t) ∇ wv L(w(t)) As α v → ∞, therefore using Equation ( 7), we can say ∞ t2 η w v (t) ∇ wv L(w(t)) dt → ∞. Thus, integrating both the sides of the equation above from t 2 to ∞, we get ∞ t2 d wu(t) wv(t) dt dt ≤ -∞ This is not possible as wu(t) wv(t) is lower bounded by 0. Thus lim t→∞ wu(t) wv(t) = 0.

E.2.2 GRADIENT DESCENT

Proposition. Let assumptions (A1)-(A5) be satisfied. Consider two nodes u and v in the network such that g v ≥ g u > 0 and w u (t) , w v (t) → ∞. Let gu gv be denoted by c. Let , δ be such that 0 < < c and 0 < δ < 2π. Then, the following holds: 1. There exists a time t 1 , such that for all t > t 1 , EWN trajectory has the following properties: (a) ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] (b) wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(δ) (c) wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos(δ). 2. If at some time t 2 > t 1 , (a) wu(t2) wv(t2) > 1 (c-) cos(δ) =⇒ lim t→∞ wu(t) wv(t) = ∞ (b) wu(t2) wv(t2) < cos(δ) c+ =⇒ lim t→∞ wu(t) wv(t) = 0 Proof. The proof of part 1a, i.e, ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] follows from the definition of limit as ∇w u L(w(t)) ∇w v L(w(t)) tends to c. Now, we will move to the proof of part 1b, i.e, wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(δ). The assumptions in this Proposition differ slightly from Proposition 2 and thus the proof is slightly more involved as we also need to show that w u (t) converges in direction. As g u > 0, therefore ∇ wu L(w(t)) converges in direction. Therefore, for every τ satisfying 0 < τ < 2π, there exists a time t 3 , such that for t > t 3 , -∇w u L(w(t)) ∇w u L(w(t)) gu gu ≥ cos(τ ). Now, Let's assume that w u (t) does not converge in the direction of g u . Then, there must exist a τ satisfying 0 < τ < 2π, such that for this τ , there exists a time t 4 > t 3 satisfying v u (t 4 ) gu gu = cos(∆), where ∆ > τ . Now, we are going to show that for any κ satisfying τ < κ < ∆, there exists a time t 5 > t 4 such that vu(t5) vu(t5) gu gu > cos(κ). Let's say for a given κ, no such t 5 exists. Then, taking dot product with gu gu on both sides of Equation ( 12), we can say v u (t + 1) g u g u = v u (t) g u g u + η(t) e αu(t) v u (t) ∇ wu L(w(t)) g u g u (I - v u (t)v u (t) v u (t) 2 ) -∇ wu L(w(t)) ∇ wu L(w(t)) The proof of part 1c, i.e, wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos(δ) follows exactly the same steps as part 1b. Now, we will move to the proof of part 2a, i.e, wu(t2) wv(t2) > 1 (c-) cos(δ) =⇒ lim t→∞ wu(t) wv(t) = ∞ Using Equation ( 11) and part 1 of the Proposition, we can say w u (t 2 + 1) w v (t 2 + 1) ≥ w u (t 2 ) + η(t 2 ) cos(δ) w u (t 2 ) 2 ∇ wu L(w(t 2 )) w v (t 2 ) + η(t 2 ) w v (t 2 ) 2 ∇ wv L(w(t 2 )) = w u (t 2 ) w v (t 2 ) 1 + cos(δ)η(t 2 ) w u (t 2 ) ∇ wu L(w(t 2 )) 1 + η(t 2 ) w v (t 2 ) ∇ wv L(w(t 2 )) ≥ w u (t 2 ) w v (t 2 ) Thus, wu(t) wv(t) keeps on increasing for t > t 2 . It can either diverge to infinity or converge to a finite value. If it converges to a finite value, then by Stolz Cesaro theorem, lim t→∞ w u (t) w v (t) = lim t→∞ w u (t) 2 ∇ wu L(w(t)) w v (t) 2 ∇ wv L(w(t)) However, this is not possible as wu(t) wv(t) > 1 c for every t > t 2 . Thus, wu(t) wv(t) diverges to infinity. Now, we will move to the proof of part 2b, i.e, wu(t2) wv(t2) < cos(δ) c+ =⇒ lim t→∞ wu(t) wv(t) = 0 Using Equation ( 11) and part 1 of the Proposition, we can say w u (t 2 + 1) w v (t 2 + 1) ≤ w u (t 2 ) + η(t 2 ) w u (t 2 ) 2 ∇ wu L(w(t 2 )) w v (t 2 ) + η(t 2 ) cos(δ) w v (t 2 ) 2 ∇ wv L(w(t 2 )) = w u (t 2 ) w v (t 2 ) 1 + η(t 2 ) w u (t 2 ) ∇ wu L(w(t 2 )) 1 + η(t 2 ) cos(δ) w v (t 2 ) ∇ wv L(w(t 2 )) ≤ w u (t 2 ) w v (t 2 ) Thus, wu(t) wv(t) keeps on decreasing for t > t 2 . As it is always greater than zero, it must converge. Therefore, by Stolz Cesaro Theorem, lim t→∞ w u (t) w v (t) = lim t→∞ w u (t) 2 ∇ wu L(w(t)) w v (t) 2 ∇ wv L(w(t)) For wu(t) wv(t) < 1 c , this can only be satisfied when lim t→∞ wu(t) wv(t) = 0.

F PROOF OF COROLLARY 1

Corollary. Consider a weight normalized(SWN or EWN) multilayer linear net, represented by y = W n W n-1 ...W 1 x. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, if the dataset is sampled from a continuous distribution w.r.t R d , then, with probability 1, θ = W 1 W 2 ....W n converges in direction to the maximum margin separator for all linearly separable datasets. Proof. Consider a linear net given by f = W n W n-1 ...W 1 x, where f is a scalar as we are considering a binary classification problem with exponential loss. Then ∂L(w(t)) ∂W 1 (t) = - m i=0 i (t)y i (W n (t)W n-1 (t)...W 2 (t)) x i Now, atleast one of the neurons in every layer would have a non-zero component on w, otherwise Φ( w, x i ) = 0 for all i, that implies ρ = 0. Let u be one of the nodes of W 1 that has a non-zero component in w u and let w u be the k th row of the matrix W 1 . Now, from Proposition 2, we know w u = λ g u . Denoting the component of w along matrix W k by W k and using Equation ( 35), we get, for some λ k > 0, w u = λ k (( W n W n-1 ..... W 2 ) )[k] m i=1 i y i x i where (( W n W n-1 ..... W 2 ) )[k] represents the k th component of the product column vector. Let S represent the set of rows in W 1 that have a non-zero component in w. Also, let θ denote the final convergent direction of θ. Then, for some µ > 0, we can say θ = µ W T 1 W T 2 .... W T n = µ j∈S λ j ((( W n W n-1 ..... W 2 ) )[j]) 2 m i=1 i y i x i Thus, θ satisfies the KKT conditions of the maximum margin problem and if data is sampled from a continuous distribution w.r.t R d , then with probability 1 (Soudry et al., 2018) , would converge to the maximum margin separator.

G PROOF OF THEOREM 3

Theorem. For EWN, under Assumptions (A1)-(A5) and lim t→∞ r(t+1)-r(t) g(t+1)-g(t) = 0, the following hold 1. w(t) asymptotically grows at Θ (log(d(t)) 1 L 2. L(w(t)) asymptotically goes down at the rate of Θ 1 d(t)(log d(t)) 2 . First, we will establish rates for gradient flow and then go to the case of gradient descent.

G.1 GRADIENT FLOW

Although the asymptotic convergence rates for smooth homogeneous neural nets have been established in Lyu & Li (2020) , the proof technique becomes easier to understand for smooth homogeneous nets, without weight normalization.

G.1.1 UNNORMALIZED NETWORK

Theorem. For Unnorm, under Assumptions (A1)-(A4) for gradient flow and lim t→∞ dr(t) dt g (t) = 0, the following hold 1. w(t) asymptotically grows at Θ (log(t) 1 L 2. L(w(t)) asymptotically goes down at the rate of Θ 1 t(log t) 2-2 L . Proof. Consider w = g(t) w + r(t), where lim t→∞ r(t) g(t) = 0 and r(t) w = 0. Now, we make an additional assumption that lim t→∞ dr(t) d(t) g (t) = 0. This basically avoids any oscillations in r(t) for large t, where it can have a higher derivative, but the value may be bounded. Now, we know dw(t) dt = m i=1 e -yiΦ(w(t),xi) y i ∇ w Φ(w(t), x i ) Now, we know dw(t) dt = 0 for any finite t, otherwise w won't change and L can't converge to 0. Thus, for all t, we can say dw(t) dt m i=1 e -yiΦ(w(t),xi) y i ∇ w Φ(w, x i ) = 1 Taking limit t → ∞ on both the sides, we get lim t→∞ dw(t) dt m i=1 e -yiΦ(w(t),xi) y i ∇ w Φ(w, x i ) = 1 (36) Now, we know dw(t) dt = g (t) w + dr(t) dt m i=1 e -yiΦ(w(t),xi) y i ∇ w Φ(w, x i ) = m i=1 e -yig(t) L Φ( w+ r(t) g(t) ,xi) (y i g(t) L-1 ∇ w Φ( w+ r(t) g(t) , x i ) Let S = {i : Φ( w, x i ) = min j Φ( w, x j )}. Then as ρ > 0, we can say lim t→∞ m i=1 e -yig(t) L Φ( w+ r(t) g(t) ,xi) (y i g(t) L-1 ∇ w Φ( w + r(t) g(t) , x i ) e -ρg(t) L g(t) L-1 i∈S i y i ∇ w Φ( w, x i ) = k where k is some constant. Also, by the assumption lim t→∞ dw(t) dt g (t) = 1 Substituting the above two equations in Equation ( 36), we get lim t→∞ g (t) e -ρg(t) L g(t) L-1 i∈S i y i ∇ w Φ( w, x i ) = 1 k Now, as loss goes down at the rate of e -ρg(t) L , multiplying the numerator and denominator by ρLg(t) L-1 and denoting h(t) = ρg(t) L , we get Consider a vector a(t) of equal dimension as w, and its components corresponding to a node u is given by a u (t) = -w u 2 dL(w(t)) lim t→∞ ρ 1-2 L h (t) Le -h(t) h(t) 2-2 L i∈S i y i ∇ w Φ( w, x i ) = 1 k Thus, dwu . Now as we know w converges in direction to w, therefore, using the update equation above, we can say lim t→∞ dw(t) dt g(t) 2 a(t) = 1 Using the update equation for -dL(w(t)) dw , we can say lim t→∞ dL(w(t)) dw e -ρg(t) L g(t) L-1 = k 1 (37) where k 1 is some constant. Now, using the expression for a(t), we can say lim t→∞ a(t) e -ρg(t) L g(t) L-1 = k where k is some constant. Using the equations above, we can say lim t→∞ g (t) e -ρg(t) L g(t) L+1 = 1 k Now, as loss goes down at the rate of e -ρg(t) L , multiplying the numerator and denominator by ρLg(t) L-1 and denoting h(t) = ρg(t) L , we get lim t→∞ ρ 2 h (t) e -h(t) h(t) 2 = 1 k Thus, asymptotically, h(t) grows at Θ(log(t) + 2 log log t) and thus loss goes down at the rate of Θ( 1 t(log t) 2 ).

G.2 GRADIENT DESCENT

Theorem. For Exponential Weight Normalization, under Assumptions (A1)-(A5) and lim t→∞ r(t+1)-r(t) g(t+1)-g(t) = 0, the following hold 1. w(t) asymptotically grows at Θ (log(d(t)) 1 L 2. L(w(t)) asymptotically goes down at the rate of Θ 1 d(t)(log d(t)) 2 . Proof. Consider w = g(t) w + r(t), where lim t→∞ r(t) g(t) = 0 and r(t) w = 0. Now, we make additional assumptions that lim t→∞ r(t+1)-r(t) g(t+1)-g(t) = 0. Consider a node u in the network that has w u > 0. The update equations for v u (t) and α u (t) are given by α u (t + 1) = α u (t) -η(t)e αu(t) v u (t) ∇ wu L(w(t)) v u (t) v u (t + 1) = v u (t) -η(t) e αu(t) v u (t) (I - v u (t)v u (t) v u (t) 2 )∇ wu L(w(t)) Now, we will first estimate e αu(t+1) vu(t+1) vu( t+1) -e αu(t) vu(t) vu(t) . Let δ u (t) denote η(t)e αu(t) ∇ wu L(w(t)) and u (t) denote the angle between v u (t) and -∇ wu L(w(t)). We know lim t→∞ δ u (t) = 0 and lim t→∞ u (t) = 0. Now, rewriting update equations in terms of these symbols, we get e αu(t+1) = e αu(t) e δu(t) cos( u (t)) Now, as v u (t) keeps on increasing during the gradient descent trajectory, therefore we can say v u (t + 1) = v u (t) + δ u (t) 1 vu(t) vu(t+1) ≤ k, where k > 0 is some constant. Now dividing both sides of Equation ( 38) by e αu(t) δ u (t) cos( u (t)) and analyzing the coefficient of the second term on RHS, we get lim t→∞ e δu(t) cos( u(t)) sin( u (t)) v u (t) v u (t + 1) cos( u (t)) ≤ 0 Taking norm on both sides of Equation ( 38), using Pythagoras theorem and the limits established above, we can say lim t→∞ e αu(t+1) vu(t+1) vu(t+1) -e αu(t) vu(t) vu(t) e αu(t) δ u (t) = 1 Now, we also know lim t→∞ e αu(t+1) vu(t+1) vu(t+1) -e αu(t) vu(t) vu(t) g(t + 1) -g(t) = w u Now, using equations above and Equation (37), we can say lim t→∞ g(t + 1) -g(t) η(t)e -ρg(t) L g(t) L+1 = c where c is some constant. This determines the asymptotic rate of g(t). To get a better closed form, define define a map d : N → R, given by d(t) = (ii) for SWN, w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w v L(w(t)) wv(t) ∇w u L(w(t)) = 1 Proof. The proof follows from Appendix D.

H.4 SPARSITY INDUCTIVE BIAS FOR EXPONENTIAL WEIGHT NORMALISATION

The following proposition shows why EWN is likely to converge to relatively sparse points. Proposition 6. Let assumptions (A1)-(A5) be satisfied. Consider two nodes u and v in the network such that g v > 0, g u > 0, w u (t) → ∞ and w v (t) → ∞. Let gu gv be denoted by c. Let , δ be such that 0 < < c and 0 < δ < 2π. Then, the following holds: 1. There exists a time t 1 , such that for all t > t 1 both SWN and EWN trajectories have the following properties: (a) ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] (b) wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t)) ≥ cos(δ) (c) wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t)) ≥ cos(δ). 2. for SWN, lim t→∞ wu(t) wv(t) = c 3. for EWN, if at some time t 2 > t 1 , (a) wu(t2) wv(t2) > 1 (c-) cos(δ) =⇒ lim t→∞ wu(t) wv(t) = ∞ (b) wu(t2) wv(t2) < cos(δ) c+ =⇒ lim t→∞ wu(t) wv(t) = 0 The above proposition shows that the limit property of the weights in Theorem 4 makes non-sparse w an unstable convergent direction for EWN, while that is not the case for SWN. Proof. The proof follows from Appendix E.

H.5 CONVERGENCE RATES

Theorem 5. For EWN, under Assumptions (A1)-(A5) and lim t→∞ r(t+1)-r(t) g(t+1)-g(t) = 0, the following hold 1. w asymptotically grows at Θ (log(d(t)) 1 L 2. L(w(t)) asymptotically goes down at the rate of Θ 1 d(t)(log d(t)) 2 . Proof. The proof follows Appendix G.2, the only difference is in the gradient update. Let w be represented as w = g(t) w + r(t), where lim t→∞ r(t) g(t) = 0. Using Equation (41), we can say lim t→∞ ∇ w L(w(t)) e -ρg(t) L g(t) L-1 = k As the order remains the same as in the proof for exponential loss, the proof follows from Appendix G.2.

I LEMMA PROOFS

Lemma 1. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, for both SWN and EWN, w u g u ≥ 0 for all nodes u in the network. Proof. We will show the proof just for exponential parameterization and gradient descent, but other cases can be handled similarly. We only need to consider nodes having w u > 0 and g u > 0, as for other nodes w u g u = 0. Consider a node u having w u > 0 and g u > 0. Let's say w u g u < 0. This means, there exists a time t 1 , such that for any t > t 1 , w u (t) (-∇ wu L(w(t))) < 0. Then using Equation (11), we can say that, for t > t 1 , α u (t + 1) < α u (t). But, this contradicts the assumption that w u → ∞. Thus, w u g u ≥ 0. Lemma 2. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, for both SWN and EWN, there exists atleast one node u in the network satisfying w u > 0 and g u > 0. Proof. Under the assumption that ρ > 0 and using Euler's homogeneous theorem, using Proposition 2, we can say w g = µ m i=1 i y i w ∇ w Φ( w, x i ) = Lµ m i=1 i y i Φ( w, x i ) > 0 Thus, there must be atleast one node s satisfying w s > 0 and g s > 0. Similarly, it can be shown for cross-entropy loss as well.  (I -cc )d ≥ cos( ) -(a b + 2 -2 cos( ))(a b + 2 2 -2 cos( ) + (2 -2 cos( ))) Now, we need to show that there exists an > 0 such that cos( ) -(a b + 2 -2 cos( ))(a b + 2 2 -2 cos( ) + (2 -2 cos( ))) > . At = 0, LHS takes the value 1 -(a b) 2 , while RHS takes the value 0. Thus, by continuity with respect to of the functions involved, we can say that there exists an > 0 for which the condition is satisfied. Lemma 4. Consider sequence a satisfying the following properties 1. a k > 0 2. ∞ k=0 a k = ∞ 3. lim k→∞ a k = 0 Then ∞ k=0 a k k j=0 a 2 j = ∞ Proof. If ∞ k=0 a 2 k is bounded, then the statement is obvious. Let's consider the case when ∞ k=0 a 2 k diverges. As lim k→∞ a k = 0, therefore there must be an index k 1 , such that for k ≥ k 1 , a k ≤ . Now, as a k ≤ , therefore a 2 k ≤ a k . Now, as ∞ k=0 a 2 k diverges, therefore, there must be an index k 2 > k 1 , such that for any k > k 2 , k j=k1 a 2 j ≥ k1-1 j=0 a 2 j . Now, for k > k 2 , we can say k j=k1 a j j l=0 a 2 l ≥ 1 √ 2 k j=k1 a j j l=k1 a 2 l ≥ 1 √ 2 k j=k1 a j j l=k1 a l ≥ 1 √ 2 k j=k1 a j k l=k1 a l = 1 √ 2 k j=k1 a j As ∞ k=0 a k diverges, therefore ∞ k=0 a k k j=0 a 2 j diverges as well. Lemma 5. Consider two sequences a and b satisfying the following properties 1. a k > 0, ∞ k=0 a k = ∞ and lim k→∞ a k = 0 2. b 0 > 0, b is increasing and b 2 k+1 ≤ b 2 k + ( a k b k ) 2 Then ∞ k=0 a k b k = ∞. Proof. As we know b is increasing and b 2 k+1 ≤ b 2 k + ( a k b k ) 2 , we get b k ≤ b 2 0 + k-1 j=0 ( a j b j ) 2 ≤ b 2 0 + 1 b 2 0 k-1 j=0 a 2 j Using this, we can say k j=0 a j b j ≥ k j=0 a j b 2 0 + 1 b 2 0 j-1 l=0 a 2 l ≥ k j=0 a j b 2 0 + 1 b 2 0 k-1 l=0 a 2 l Now, if ∞ k=0 a 2 k does not diverge to infinity, then b remains bounded using the bound above and then its trivial to establish that ∞ k=0 a k b k diverges. In case, ∞ k=0 a 2 k diverges to infinity, then there must be an index k 1 such that for any k > k 1 , we can say k-1 j=0 a 2 j ≥ b 4 0 . So, for k > k 1 , we can say k j=0 a j b j ≥ k j=0 b 0 √ 2 a j k-1 l=0 a 2 l Now, as we have assumed a tends to zero, so there must be an index k 2 such that for any k > k 2 , a k ≤ . Also, as we have assumed ∞ j=0 a 2 j diverges, therefore there must be an index k 3 > k 2 , such that for k > k 3 , k j=k2 a 2 j ≥ k2 j=0 a 2 j . Using these things and that if a j ≤ , then a 2 j ≤ a j , we can say for k Proof. Case 1: L = 0 or ∞: > k 3 , k j=k3 a j b j ≥ k j=k3 b 0 2 a j k-1 l=k3 a l ≥ b 0 2 √ k-1 j=k3 a j Now, as ∞ k=0 a k diverges, thus We will prove for L = ∞. The case for 0 can be handled similarly. For any M > 0, there must exist a time t 1 > c, such that 

PARAMETERS

In this section, we will denote w by θ so as to be consistent with the notaion in Lyu & Li (2020) . SWN(in its parameters γ and v) is also a homogeneous network. Therefore, results from Lyu & Li (2020) should directly apply to the case of SWN as well. However, a crucial point to be noted is that it is not even locally Lipschitz around v u = 0. Therefore, the assumptions from Lyu & Li (2020) do not hold. However, during gradient descent or gradient flow, if started from a finite v u > 0, for all u, then during the entire trajectory, v u cannot go down. Therefore, the network is still locally Lipschitz along the trajectory it takes. Examining the proofs from Lyu & Li (2020) , its clear that the proof regarding monotonicity of margin and convergence rates are just dependent on the path that gradient descent/flow takes and thus the proofs hold. However, the result regarding the limit points of θ θ do not hold. One of the crucial theorems the proof relies on is stated below Theorem. Let {x k ∈ R d : k ∈ N} be a sequence of feasible points of an optimization problem (P), { k > 0 : k ∈ N} and {δ k > 0 : k ∈ N} be two sequences. x k is an ( k , δ k )-KKT point for every k and k → 0, δ k → 0. If x k → x as k → ∞ and MFCQ holds at x, then x is a KKT point of (P) The above statement requires MFCQ to be satisfied at x, that was shown in Lyu & Li (2020) assuming local lipschitzness/smoothness at x. However, in this case, for gradient flow, as v u does not grow, while |γ u | → ∞, therefore the convergent point of θ θ will always have the component corresponding to v u as 0. Thus, the network is not locally lipschitz at x and the proof that MFCQ holds is violated. Similarly, for gradient descent as well, it can't be said that v u has a non-zero component in θ θ . Thus, the proof does not hold.

L EXPERIMENT DETAILS

In all the experiments, techniques for handling numerical underflow were used as described in Lyu & Li (2020) . However, the learning rate they used was of O( 1 L ), but in our case, we generally modify it to be O( 1 L c ), where c < 1.

L.1 LI N-SE P

The learning rate used was k(t) L 0.97 , so that it speeds up at the beginning of training, but slows down as loss approaches e -300 . The constant k(t) was initialized at 0.01, and was increased by a factor of 1.1 every time loss went down and decreased by a factor of 1.1 every time loss went up after a gradient step. Its value was capped at 0.01 for EWN and SWN.

L.2 SI M P L E-TR A J

The learning rate used was k(t) L 0.9 , so that it speeds up at the beginning of training, but slows down as loss approaches e -50 . The constant k(t) was initialized at 0.01, and was increased by a factor of 1.1 every time loss went down and decreased by a factor of 1.1 every time loss went up after a gradient step. Its value was capped at 0.1 for EWN and Unnorm.

L.3 XOR

The learning rate used was k(t) L 0.93 , so that it speeds up at the beginning of training, but slows down as loss approaches e -50 . The constant k(t) was initialized at 0.01, and was increased by a factor of 1.1 every time loss went down and decreased by a factor of 1.1 every time loss went up after a gradient step. Its value was capped at 0.01 for EWN and 0.1 for other cases.

L.4 CONVERGENCE RATE EXPERIMENT

For all SWN, EWN and Unnorm, the learning rate was constant η = 0.01 and they were trained for 5000 steps. All the networks were explicitly initialized to the same point in function space.

L.5 MNIST PRUNING EXPERIMENT

The learning rate used was k(t) L . The constant k(t) was initialized at 0.01, and was increased by a factor of 1.1 every time loss went down and decreased by a factor of 1.1 every time loss went up after an epoch. Its value was capped at 0.01 for all the cases. 



Homogeneous networks in the w space are also homogeneous in the γ, v space. Therefore results regarding convergence rates and monotonic margin hold fromLyu & Li (2020). However, the results for convergence to a KKT point of the max margin problem do not hold. For details, refer Appendix K.



Figure 1: L 2 neighborhoods with = 0.5 radius in parameter space for different parameterizations. For EWN in the left (resp. SWN in the middle) the parameter [γ, v] (the parameter [α, v] resp.) is restricted to a 3-d ball of radius and the values that the 2-d weight vector w takes is illustrated for 6 different centers.

Figure 2: Verification of assumptions for EWN in Lin-Sep experiment: (a) shows the dataset.In (b), it can be seen that only weights 5,7 and 8 keep on growing in norm. So, only for these, w u > 0. (c) shows the components of the unit vector w w , only for the weights 5, 7 and 8 as they keep evolving with time. Eventually their contribution to the unit vector become constant. (d) shows the components of the loss vector and they also become constant eventually. (e) shows the normalized parameter margin converging to a value greater than 0.

Figure 4: (a) Network architecture for the Simple-Traj experiment . (b) Trajectories of the two weights for EWN and Unnorm, starting from 5 different initialization points.

where µ > 0 Proof. Follows exactly as shown for gradient flow in Appendix C.1.1.

where µ > 0 Proof. Follows exactly as shown for gradient flow in Appendix C.1.1.

where µ > 0 Proof. Follows exactly as shown for gradient flow in Appendix C.1.1.

=0 η(τ ) and a real analytic function f (t) satisfying f (d(t)) = g(t) for all t ∈ N and lim t→∞η(t)f (d(t)) f (d(t))= 0. Substituting this f in the equation above, we can saylim t→∞ f (d(t)) e -ρf (d(t)) L f (d(t)) L+1 = c Thus f (d(t)) grows at Θ(log(d(t)) 1 L). Now, to get convergence rate for loss, multiply and divide the equation by ρf (d(t)) L-1 and denoting h(d(t)) = ρf (d(t)) L , we getlim t→∞ ρ 2 h (d(t)) e -h(d(t)) h(d(t)) 2 = cThus, h(d(t)) grows at the Θ(log(d(t)) + 2 log log d(t)). Now as g(t) = f (d(t)) and ρg(t) L = h(d(t)), therefore g(t) asymptotically grows at Θ(log d(t) 1 L ) and loss goes down asymptotically at Θ( 1 d(t) log d(t) 2 ).

Consider two unit vectors a and b satisfying a b ≥ 0 and a b < 1. Then, there exists a small enough > 0, such that for any unit vector c satisfying c a ≥ cos( ) and any unit vector d satisfying d b ≥ cos( ), b (I -cc )d ≥ . Proof. First we will try to find bounds on b c and c d. b c = b (a + ca) = b a + b (c -a) c d = (a + ca) (b + db) = a b + a (d -b) + b (c -a) + (c -a) (d -b) Now, using the fact that c a ≥ cos( ) and d b ≥ cos( ), we can say ca ≤ 2 -2 cos( ) and db ≤ 2 -2 cos( ). Using these bounds and the equation above, we can say b c ≤ a b + 2 -2 cos( ) c d ≤ a b + 2 2 -2 cos( ) + (2 -2 cos( )) Using these, we can say b

Let's say lim k→∞ a k b k = c > 1. The other case can be handled similarly. Choose an > 0 such that c -> 1. Then, there exists an index k 1 , such that for k > k 1 , we can sayc -≤ a k b k ≤ c + Using this, we can say, for k > k 1 , b k (c --1) ≤ a k -b k ≤ b k (c + -1)Summing the equation above from k 1 to ∞ and recognizing that∞ k=0 b diverges, we get ∞ k=k1 (a k -b k ) = ∞. This contradicts. Therefore lim k→∞ a k b k = 1.J INTEGRAL FORM OF STOLZ-CESARO THEOREMWe first state the Stolz-Cesaro Theorem.Theorem.(Muresan, 2015) Assume that {a} ∞ k=1 and {b} ∞ k=1 are two sequences of real numbers such that {b} ∞ k=1 is strictly monotonic and diverging. Additionally, if lim k→∞ a k+1 -a k b k+1 -b k = L exists, then lim k→∞ a k b k exists and is equal to L. Now, we state and prove the Integral Form of Stolz-Cesaro Theorem. Theorem. Consider two functions f (t) and g(t) greater than zero satisfying b a f (t)dt < ∞ and b a g(t)dt < ∞ for every finite a, b. For any time t, its known that ∞ t f (t)dt = ∞ and ∞ t g(t)dt = ∞. If lim t→∞ f (t) g(t) exist and is equal to L, then lim t→∞ t c f (t)dt t c g(t)dt exists for any c and is equal to L.

Taking the left inequality, adding (L -) t1 c g(t)dt on both the sides, dividing both the sides by t c g(t)dt and taking lim inf t→∞ , we get L -≤ lim inf

Similarly, taking the right inequality, adding (L + ) t1 c g(t)dt on both the sides, dividing both the sides by t c g(t)dt and taking lim sup t→∞ , we get Using the two inequalities, we get, for any > 0, lim sup

t)dt exists and is equal to L.

Figure 8: Verification of assumptions for SWN in Lin-Sep experiment: (a) shows the dataset.In (b), it can be seen that only weights 5, 6 and 8 keep on growing in norm. So, only for these, w u > 0. (c) shows the components of the unit vector w w , only for the weights 5, 6 and 8 as they keep evolving with time. Eventually their contribution to the unit vector become constant. (d) shows the components of the loss vector and they also become constant eventually. (e) shows the normalized parameter margin converging to a value greater than 0.

Figure 9: Demonstration of Results for SWN in Lin-Sep experiment: (a) demonstrates part 1 of Proposition 2, where g is approximated by using w from the last point of the trajectory. Clearly, ∇ wu L stops oscillating and converges to g. (b) demonstrates part 2 of Proposition 2 and shows that for weight vectors 5,7 and 8, w u (t) converges in opposite direction of ∇ wu L(w(t)). (c), (d) and (e) demonstrate Theorem 2 for SWN, where for weight vectors 5,7 and 8. The three graphs are plotted at loss values of e -200 , e -250 and e -300 respectively. At each loss value, for the 3 weights, log ∇ wu L -log w u is approximately same.

Figure 7: Variation of test accuracy vs percentage of neurons pruned in first layer at different loss values for MNIST experiment Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. volume 89 of Proceedings of Machine Learning Research, pp. 3051-3059. PMLR, 16-18 Apr 2019c. URL http://proceedings.mlr. press/v89/nacson19a.html. Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Rethinking normalization and elimination singularity in neural networks. arXiv preprint arXiv:1911.09738, 2019. Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Micro-batch training with batch-channel normalization and weight standardization, 2020. Simon Roburin, Yann de Mont-Marin, Andrei Bursuc, Renaud Marlet, Patrick Pérez, and Mathieu Aubry. Spherical perspective on learning with batch norm. arXiv preprint arXiv:2006.13382, 2020.

Using similar argument as in Equation (24), we can say, if for any t 6 > t 5 , v u (t 6 ) Now, using arguments similar to the proof of Proposition 2 for SWN gradient descent in Appendix C.2.2, we can show that the above statement leads to a contradiction and thus there must exist a t 5

sin( u (t)) -∇ wu L(w(t)) vu(t) ⊥ ∇ wu L(w(t)) vu(t) ⊥where -∇ wu L(w(t)) vu(t) ⊥ denotes the component of -∇ wu L(w(t)) perpendicular to v u (t).

f (t) g(t) > M , for t > t 1 . Thus we can say for t > t 1 , Similarly the equation holds for liminf as well. Thus, both liminf and limsup are greater than M for any M . Hence lim t→∞ Case2: L is finite In this case, there must exist some time t 1 > c, such that L -< f (t) g(t) < L + .Thus, we can say for t > t 1 ,

annex

arbitrarily chosen between τ and ∆, and the argument holds for any > 0, w u (t) converges in the direction of g u The proof of part 1c, i.e, wv(t) wv(t) -∇w v L(w(t)) ∇w v L(w(t))≥ cos(δ) can be shown in the same way as 1b.The proof of part 2, i.e, lim t→∞ wu(t) wv(t) = c can be shown in the same way as Theorem 2 for SWN gradient flow from Appendix D.2.1.

E.1.2 GRADIENT DESCENT

Proposition. Let assumptions (A1)-(A5) be satisfied. Consider two nodes u and v in the network such that g v ≥ g u > 0 and w u (t) , w v (t) → ∞. Let gu gv be denoted by c. Let , δ be such that 0 < < c and 0 < δ < 2π. Then, the following holds:1. There exists a time t 1 , such that for all t > t 1 , SWN trajectory has the following properties:(a)≥ cos(δ).

2.. lim t→∞

Proof. The proof of part 1a, i.e, ∇w u L(w(t)) ∇w v L(w(t)) ∈ [c -, c + ] follows from the definition of limit as ∇w u L(w(t)) ∇w v L(w(t)) tends to c. Now, we will move to the proof of part 1b, i.e, wu(t) wu(t) -∇w u L(w(t)) ∇w u L(w(t))≥ cos(δ). The assumptions in this Proposition differ slightly from Proposition 2 and thus the proof is slightly more involved as we also need to show that w u (t) converges in direction. The proof will be given for γ u → ∞. The one for γ u → -∞ can be handled similarly.As g u > 0, therefore ∇ wu L(w(t)) converges in direction. Therefore, for every τ satisfying 0 < τ < 2π, there exists a time t 3 , such that for t > t 3 ,gu gu ≥ cos(τ ). Now, Let's assume that w u (t) does not converge in the direction of g u . Then, there must exist a τ satisfying 0 < τ < 2π, such that for this τ , there exists a time t 4 > t 3 satisfying v u (t 4 ) gu gu = cos(∆), where ∆ > τ . Now, we are going to show that for any κ satisfying τ < κ < ∆, there exists a time t 5 > t 4 such that vu(t5) vu(t5) gu gu > cos(κ). Let's say for a given κ, no such t 5 exists. Then, taking dot product with gu gu on both sides of Equation ( 20), we can sayNow, as ≤ cos(κ), we can sayProposition. Let assumptions (A1)-(A4) be satisfied. Consider two nodes u and v in the network such that g v ≥ g u > 0 and w u (t) , w v (t) → ∞. Let gu gv be denoted by c. Let , δ be such that 0 < < c and 0 < δ < 2π. Then, the following holds:1. There exists a time t 1 , such that for all t > t 1 , EWN trajectory has the following properties:(a)≥ cos(δ).2. If at some time t 2 > t 1 , (a) wu(t2) wv(t2) >Proof. The proof of part 1a, i.e,follows from the definition of limit as∇w v L(w(t)) tends to c. Now, we will move to the proof of part 1b, i.e., wu(t) wu(t)≥ cos(δ). The assumptions in this Proposition differ slightly from Proposition 2 and thus the proof is slightly more involved as we also need to show that w u (t) converges in direction.As g u > 0, therefore ∇ wu L(w(t)) converges in direction. Therefore, for every τ satisfying 0 < τ < 2π, there exists a time t 3 , such that for t > t 3 ,gu gu ≥ cos(τ ). Now, Let's assume that w u (t) does not converge in the direction of g u . Then, there must exist a τ satisfying 0 < τ < 2π, such that for this τ , there exists a time t 4 > t 3 satisfying v u (t 4 ) gu gu = cos(∆), where ∆ > τ . Now, we are going to show that for any κ satisfying τ < κ < ∆, there exists a time t 5 > t 4 such that v u (t 5 ) gu gu > cos(κ). Let's say for a given κ, no such t 5 exists. Then, taking dot product with gu gu on both sides of Equation ( 8), we can sayNow, as gu gu≥ cos(τ ) and gu gu v u ≤ cos(κ), we can sayNow, using the fact that α u → ∞ and using Equation ( 7), we can sayUsing this fact and integrating the Equation ( 28) on both the sides from t 4 to ∞, we get a contradiction as vectors on LHS have a finite norm while RHS tends to ∞. Thus, for every κ between τ and ∆, there must exist a t 5 , such that v u (t 5 ) ≤ cos(κ) for any t > t 5 . As κ can be arbitrarily chosen between τ and ∆, and the argument holds for any > 0, w u (t) converges in the direction of g uThe proof of part 1c, i.e, wv(t) wv(t)≥ cos(δ) follows exactly the same steps as part 1b. Now, we will move to the proof of part 2a, i.e, wu(t2) wv(t2) >Using the equation above and part 1 of the Proposition, we can say for t > t 1 ,In this case, using Equation (31), we can seedt > 0 at t 2 . Thus, wu(t) wv(t) always remains greater than 1 (c-) cos(δ) and keeps on increasing. Let's denote wu(t2) wv(t2) by ∆. Then we can sayAs α u → ∞, therefore using Equation ( 7), we can say ∞ t2 η(t) w u (t) ∇ wu L(w(t)) dt → ∞. Thus, integrating both the sides of the equation above from t 2 to ∞, we getNow, we will move to the proof of part 2b, i.e, wu(t2) wv(t2) < cos(δ) c+ =⇒ lim t→∞ wu(t) wv(t) = 0. Using Equation (30) and part 1 of the Proposition, we can say for t > t 1 ,In this case, using Equation (32), we can see < cos(β), then, similar to Equation (33), we can sayUsing the upper bound on v u (t + 1) from Equation ( 13), we can sayLet η(t) e αu (t) vu(t) ∇ wu L(w(t)) be denoted by χ(t). Then, the above equation can be rewritten asNow, we are going to show that for a small enough χ(t), RHS is greater than vu(t) gu vu(t) gu .Clearly as χ(t) → 0, the RHS tends to 0, therefore the equation is satisfied. Thus for a small enough χ(t), RHS of Equation ( 34) is greater than vu(t) gu vu(t) gu . As v u (t) keeps on increasing and by Assumption (A5), lim t→∞ η(t)γ u (t) ∇ wu L(w(t)) = 0, we can say there exists a time t 7 , such that for any t > t 7 , vu(t) gu vu(t) gu goes up whenever > cos(κ). Now as the above argument holds for any κ between τ and ∆, and for any τ > 0, we can say that w u (t) converges in direction of g u .

H CROSS-ENTROPY LOSS

In this section, we will provide the corresponding assumptions and theorems, along with their proofs, for cross-entropy loss.

H.1 NOTATIONS

Let k denote the total number of classes. As Φ(w, x i ) is a multidimensional function for multi-class classification, let's denote the j th component of the output by Φ j (w, x i ). Also, denote the margin for j th class corresponding to i th data point(j = y i ) by ρ i,j , i.e, ρ i,j = Φ yi ( w, x i ) -Φ j ( w, x i ). Margin for a data point i is defined as ρ i = min j =yi ρ i,j . The margin for the entire network is defined as ρ = min i ρ i . Also, define a matrix M (w) of dimensions (m, k), that is given byAlso, for a matrix A, vec(A) represents the matrix vectorized column-wise.

H.2 ASSUMPTIONS

The assumptions can be broadly divided into loss function/architecture based assumptions and trajectory based assumptions. The loss functions/architecture based assumptions are shared across both gradient flow and gradient descent. All the assumptions above are exactly the same as for exponential loss, except for (A3). Using assumption (A1), we can say lim t→∞ j =yi e -(Φy i (w(t),xi)-Φj (w(t),xi)) = 0 (39) lim t→∞ log(1 + j =yi e -(Φy i (w(t),xi)-Φj (w(t),xi)) )

Loss function/Architecture based assumptions

j =yi e -(Φy i (w(t),xi)-Φj (w(t),xi))= 1 (40)Thus, we can say, for large enough t, i ≈ j =yi e -(Φy i (w(t),xi)-Φj (w(t),xi)) . Thus, assumption (A3) basically states that, not just the loss vector converges in direction, but its components corresponding to various classes also converge in direction. This is required to show that gradients converge in direction in case of multi-class classification.Gradient Descent. For gradient descent, we also require the learning rate η(t) to not grow too fast.(A5) lim t→∞ η(t) w u (t) ∇ wu L(w(t)) = 0 for all u in the network Proposition 4. Under assumptions (A1)-(A4), lim t→∞ η(t) w u (t) ∇ wu L(w(t)) = 0 holds for every u in the network with η(t) = O( 1 L c ), where c < 1.Proof. The gradient of the loss function is given byUnder Assumption (A1), w(t) can be represented as w(t) = g(t) w + r(t), where lim t→∞ r(t) g(t) = 0 and r(t) w = 0. Using the equation above, we can sayAs the order remains the same as in the proof for exponential loss, the proof follows from Appendix B.This proposition establishes that the Assumption (A5) is mild and holds for constant η(t), that is generally used in practice.

H.3 EFFECT OF NORMALISATION ON WEIGHT AND GRADIENT NORMS

This section contains the main theorems and the difference between EWN and SWN that makes EWN asymptotically relatively sparse as compared to SWN. First, we will state a common proposition for both SWN and EWN. Proposition 5. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, for both SWN and EWN, the following hold:Proof. using Assumption (A1), w(t) can be represented as w(t) = g(t) w + r(t), where lim t→∞ r(t) g(t) = 0. Then, e -(Φy i (w,xi)-Φj (w,xi)) = e -g(t) L ((Φy i ( w+ r(t) g(t) ,xi)-Φj ( w+ r(t) g(t) ,xi))∇ w Φ j (w, x i ) -∇ w Φ yi (w, x i ) = g(t) L-1 (∇ w Φ j ( w + r(t) g(t) , x i ) -∇ w Φ yi ( w + r(t) g(t) , x i ))Now, using ρ > 0 and Euler's homogeneity theorem, we can say w (∇ w Φ j (w, x i ) -∇ w Φ yi (w, x i )) = L((Φ yi (w, x i ) -Φ j (w, x i )) > 0 Thus, ∇ w Φ j (w, x i ) -∇ w Φ yi (w, x i ) > 0 for all i, j. Using these facts, Equation (39), Equation (40) and Equation (41), we can say(ii) w u > 0 =⇒ w u = λ g u for some λ > 0Proof. The proof follows from Appendix C.The first and second part state that under the given assumptions, for both SWN and EWN, gradients converge in direction and the weights that contribute to the final direction of w, converge in opposite direction of the gradients. Now, we provide the main theorem that distinguishes SWN and EWN. Theorem 4. Under assumptions (A1)-(A4) for gradient flow and (A1)-(A5) for gradient descent, the following hold (i) for EWN, w u > 0, w v > 0 =⇒ lim t→∞ wu(t) ∇w u L(w(t))wv(t) ∇w v L(w(t)) = 1

