THE ASYMMETRIC MAXIMUM MARGIN BIAS OF QUASI-HOMOGENEOUS NEURAL NETWORKS

Abstract

In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-rate parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.

1. INTRODUCTION

Modern neural networks trained with (stochastic) gradient descent generalize remarkably well despite being trained well past the point at which they interpolate the training data and despite having the functional capacity to memorize random labels Zhang et al. (2021) . This apparent paradox has led to the hypothesis that there must exist an implicit process biasing the network to learn a "good" generalizing solution, when one exists, rather than one of the many more "bad" interpolating ones. While much research has been devoted to identifying the origin of this implicit bias, much of the theory is developed for models that are far simpler than modern neural networks. In this work, we extend and generalize a long line of literature studying the maximum-margin bias of gradient descent in quasi-homogeneous networks, a class of models we define that encompasses nearly all modern feedforward neural network architectures. Quasi-homogeneous networks include feedforward networks with homogeneous nonlinearities, bias parameters, residual connections, pooling layers, and normalization layers. For example, the ResNet-18 convolutional network introduced by He et al. ( 2016) is quasi-homogeneous. We prove that after surpassing a certain threshold in training, gradient flow on an exponential loss, such as cross-entropy, drives the network to a maximum-margin solution under a norm constraint on the parameters. Our work is a direct generalization of the results discussed for homogeneous networks in Lyu & Li (2019) . However, unlike in the homogeneous setting, the norm constraint only involves a subset of the parameters. For example, in the case of a ResNet-18 network, only the last layer's weight and bias parameters are constrained. This asymmetric norm can have non-trivial implications on the robustness and optimization of quasi-homogeneous models, which we explore in sections 5 and 6.

