THE ASYMMETRIC MAXIMUM MARGIN BIAS OF QUASI-HOMOGENEOUS NEURAL NETWORKS

Abstract

In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-rate parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.

1. INTRODUCTION

Modern neural networks trained with (stochastic) gradient descent generalize remarkably well despite being trained well past the point at which they interpolate the training data and despite having the functional capacity to memorize random labels Zhang et al. (2021) . This apparent paradox has led to the hypothesis that there must exist an implicit process biasing the network to learn a "good" generalizing solution, when one exists, rather than one of the many more "bad" interpolating ones. While much research has been devoted to identifying the origin of this implicit bias, much of the theory is developed for models that are far simpler than modern neural networks. In this work, we extend and generalize a long line of literature studying the maximum-margin bias of gradient descent in quasi-homogeneous networks, a class of models we define that encompasses nearly all modern feedforward neural network architectures. Quasi-homogeneous networks include feedforward networks with homogeneous nonlinearities, bias parameters, residual connections, pooling layers, and normalization layers. For example, the ResNet-18 convolutional network introduced by He et al. ( 2016) is quasi-homogeneous. We prove that after surpassing a certain threshold in training, gradient flow on an exponential loss, such as cross-entropy, drives the network to a maximum-margin solution under a norm constraint on the parameters. Our work is a direct generalization of the results discussed for homogeneous networks in Lyu & Li (2019). However, unlike in the homogeneous setting, the norm constraint only involves a subset of the parameters. For example, in the case of a ResNet-18 network, only the last layer's weight and bias parameters are constrained. This asymmetric norm can have non-trivial implications on the robustness and optimization of quasi-homogeneous models, which we explore in sections 5 and 6.

2. BACKGROUND AND RELATED WORK

Early works studying the maximum-margin bias of gradient descent focused on the simple, yet insightful, setting of logistic regression Rosset et al. (2003); Soudry et al. (2018) . Consider a binary classification problem with a linearly separablefoot_0 training dataset {x i , y i } where x i ∈ R d and y i ∈ {-1, 1}, a linear model f (x; β) = β ⊺ x, and the exponential loss L(β) = i e -yif (xi;β) . As shown in Soudry et al. (2018) , the loss only has a minimum in β as its norm becomes infinite. Thus, even after the network correctly classifies the training data, gradient descent decreases the loss by forcing the norm of β to grow in an unbounded manner, yielding a slow alignment of β in the direction of the maximum ℓ 2 -margin solution, which is the configuration of β that minimizes ∥β∥ while keeping the margin min i y i f (x i ; β) at least 1. But what if we parameterize the regression coefficients differently? As shown in Fig. 1 , different parameterizations, while not changing the space of learnable functions, can lead to classifiers with very different properties. The dashed black line is the maximum ℓ 2 -margin solution and the solid black line is the gradient descent trained classifier after 1e5 steps. Existing theory predicts the homogeneous model will converge to the maximum ℓ 2 -margin solution. In this work we will show that the quasi-homogeneous model is driven by a different maximum-margin problem. Linear networks. An early line of works exploring the influence of the parameterization on the maximum-margin bias studied the same setting as logistic regression, but where the regression coefficients β are multilinear functions of parameters θ. Ji & Telgarsky (2018) showed that for deep linear networks, β = i W i , the weight matrices asymptotically align to a rank-1 matrix, while their product converges to the maximum ℓ 2 -margin solution. Gunasekar et al. (2018) showed that linear diagonal networks, β = w 1 ⊙ • • • ⊙ w D , converge to the maximum ℓ 2/D -margin solution, demonstrating that increasing depth drives the network to sparser solutions. They also show an analogous result holds in the frequency domain for full-width linear convolutional networks. Many other works have advanced this line of literature, expanding to settings where the data is not linearly separable Ji & Telgarsky ( 2019 Homogeneous networks. While linear networks allowed for simple and interpretable analysis of the implicit bias in both the space of θ (parameter space) and the space of β (function space), it is unclear how these results on linear networks relate to the behavior of highly non-linear networks used in practice. Wei et al. (2019) and Xu et al. (2021) made progress towards analysis of non-linear networks by considering shallow, one or two layer, networks with positive-homogeneous activations, i.e., there exists L ∈ R + such that f (αx) = α L f (x) for all α ∈ R + . More recently, two concurrent works generalized this idea by expanding their analysis to all positive-homogeneous networks. Nacson et al. (2019a) used vanishing regularization to show that as long as the training error converges to zero and the parameters converge in direction, then the rescaled parameters of a homogeneous model converges to a first-order Karsh-Kuhn-Tucker (KKT) point of a maximum-margin optimization problem. Lyu & Li (2019) defined a normalized margin and showed that once the training loss drops below a certain threshold, a smoothed version of the normalized margin monotonically converges, allowing them to conclude that all rescaled limit points of the normalized parameters are first-order KKT points of the same optimization problem. 



Linearly separable implies there exists a w ∈ R d such that for all i ∈ [n], yiw ⊺ xi ≥ 1.



Figure1: Maximum-margin bias changes with parameterization. Logistic regression, f (x) = β ⊺ x, trained with gradient descent on a homogeneous (left) and quasi-homogeneous (right) parameterization of the regression coefficients β. The dashed black line is the maximum ℓ 2 -margin solution and the solid black line is the gradient descent trained classifier after 1e5 steps. Existing theory predicts the homogeneous model will converge to the maximum ℓ 2 -margin solution. In this work we will show that the quasi-homogeneous model is driven by a different maximum-margin problem.

), generalizing the analysis to other loss functions with exponential tails Nacson et al. (2019b), considering the effect of randomness introduced by stochastic gradient descent Nacson et al. (2019c), and unifying these results under a tensor formulation Yun et al. (2020).

A follow up work, Ji & Telgarsky (2020), developed a theory of unbounded, nonsmooth Kurdyka-Lojasiewicz inequalities to prove a stronger result of directional convergence of the parameters and alignment of the gradient with the parameters along the gradient flow path. Lyu & Li (2019) and Ji & Telgarsky (2020) also explored empirically non-homogeneous models with bias parameters and Nacson et al. (2019a) considered

