INDUCTIVE BIAS OF GRADIENT DESCENT FOR EXPO-NENTIALLY WEIGHT NORMALIZED SMOOTH HOMO-GENEOUS NEURAL NETS

Abstract

We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. Our analysis focuses on exponential weight normalization (EWN), which encourages weight updates along the radial direction. This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate, and hence causes the weights to be updated in a way that prefers asymptotic relative sparsity. These results can be extended to hold for gradient descent via an appropriate adaptive learning rate. The asymptotic convergence rate of the loss in this setting is given by Θ( 1t(log t) 2 ), and is independent of the depth of the network. We contrast these results with the inductive bias of standard weight normalization (SWN) and unnormalized architectures, and demonstrate their implications on synthetic data sets. Experimental results on simple data sets and architectures support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning prunable neural networks.

1. INTRODUCTION

The prevailing hypothesis for explaining the generalization ability of deep neural nets, despite their ability to fit even random labels (Zhang et al., 2017) , is that the optimisation/training algorithms such as gradient descent have a 'bias' towards 'simple' solutions. This property is often called inductive bias, and has been an active research area over the past few years. It has been shown that gradient descent does indeed seem to prefer 'simpler' solutions over more 'complex' solutions, where the notion of complexity is often problem/architecture specific. The predominant line of work typically shows that gradient descent prefers a least norm solution in some variant of the L 2 -norm. This is satisfying, as gradient descent over the parameters abides by the rules of L 2 geometry, i.e. the weight vector moves along direction of steepest descent, with length measured using the Euclidean norm. However, there is nothing special about the Euclidean norm in the parameter space, and hence several other notions of 'length' and 'steepness' are equally valid. In recent years, several alternative parameterizations of the weight vector, such as Batch normalization and Weight normalization, have seen immense success and these do not seem to respect L 2 geometry in the 'weight space'. We pose the question of inductive bias of gradient descent for some of these parameterizations, and demonstrate interesting inductive biases. In particular, it can still be argued that gradient descent with these reparameterizations prefers simpler solutions, but the notion of complexity is different.

1.1. CONTRIBUTIONS

The three main contributions of the paper are as follows. • We establish that the gradient flow path with exponential weight normalization is equal to the gradient flow path of an unnormalized network using an adaptive neuron dependent learning rate. This provides a crisp description of the difference between exponential weight normalized networks and unnormalized networks. • We establish the inductive bias of gradient descent on standard weight normalized and exponentially weight normalized networks and show that exponential weight normalization is likely to lead to asymptotic sparsity in weights. • We provide tight asymptotic convergence rates for exponentially weight normalized networks. 2 RELATED WORK 

3. PROBLEM SETUP

We use a standard view of neural networks as a collection of nodes/neurons grouped by layers. Each node u is associated with a weight vector w u , that represents the incoming weight vector for that node. In case of CNNs, weights can be shared across different nodes. w represents all



2.1 INDUCTIVE BIASSoudry et al.(2018)  showed that gradient descent(GD) on the logistic loss with linearly separable data converges to the L 2 maximum margin solution for almost all datasets. These results were extended to loss functions with super-polynomial tails inNacson et al. (2019b).Nacson et al. (2019c)   extended these results to hold for stochastic gradient descent(SGD) andGunasekar et al. (2018a)   extended the results for other optimization geometries. Ji & Telgarsky (2019b) provided tight convergence bounds in terms of dataset size as well as training time. Ji & Telgarsky (2019a) provide similar results when the data is not linearly separable.Ba et al., 2016)(Qiao et al., 2020)(Li et al., 2019), but only a few have been theoretically explored.Santurkar et al. (2018)  demonstrated that batch normalization makes the loss surface smoother and L 2 normalization in batchnorm can even be replaced by L 1 and L ∞ normalizations.Kohler et al. (2019)  showed that for GD, batchnorm speeds up convergence in the case of GLM by splitting the optimization problem into learning the direction and the norm.Cai et al. (2019)   analyzed GD on BN for squared loss and showed that it converges for a wide range of lr.Bjorck  et al. (2018)  showed that the primary reason BN allows networks to achieve higher accuracy is by enabling higher learning rates. Arora et al.(2019)  showed that in case of GD or SGD with batchnorm, lr for scale-invariant parameters does not affect the convergence rate towards stationary points.Du et al. (2018)  showed that for GD over one-hidden-layer weight normalized CNN, with a constant probability over initialization, iterates converge to global minima. Qiao et al. (2019) compared different normalization techniques from the perspective of whether they lead to points, where neurons are consistently deactivated.Wu et al. (2019)  established the inductive bias of gradient flow with weight normalization for overparameterized least squares and showed that for a wider range of initializations as compared to normal parameterization, it converges to the minimum L 2 norm solution.Dukler et al. (2020)  analyzed weight normalization for multilayer ReLU net in the infinite width regime and showed that it may speedup convergence. Some other papers(Luo et al., 2019;  Roburin et al., 2020)  also provide other perspectives to think about normalization techniques.

