THE ASYMMETRIC MAXIMUM MARGIN BIAS OF QUASI-HOMOGENEOUS NEURAL NETWORKS

Abstract

In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-rate parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.

1. INTRODUCTION

Modern neural networks trained with (stochastic) gradient descent generalize remarkably well despite being trained well past the point at which they interpolate the training data and despite having the functional capacity to memorize random labels Zhang et al. (2021) . This apparent paradox has led to the hypothesis that there must exist an implicit process biasing the network to learn a "good" generalizing solution, when one exists, rather than one of the many more "bad" interpolating ones. While much research has been devoted to identifying the origin of this implicit bias, much of the theory is developed for models that are far simpler than modern neural networks. In this work, we extend and generalize a long line of literature studying the maximum-margin bias of gradient descent in quasi-homogeneous networks, a class of models we define that encompasses nearly all modern feedforward neural network architectures. Quasi-homogeneous networks include feedforward networks with homogeneous nonlinearities, bias parameters, residual connections, pooling layers, and normalization layers. For example, the ResNet-18 convolutional network introduced by He et al. (2016) is quasi-homogeneous. We prove that after surpassing a certain threshold in training, gradient flow on an exponential loss, such as cross-entropy, drives the network to a maximum-margin solution under a norm constraint on the parameters. Our work is a direct generalization of the results discussed for homogeneous networks in Lyu & Li (2019) . However, unlike in the homogeneous setting, the norm constraint only involves a subset of the parameters. For example, in the case of a ResNet-18 network, only the last layer's weight and bias parameters are constrained. This asymmetric norm can have non-trivial implications on the robustness and optimization of quasi-homogeneous models, which we explore in sections 5 and 6.

2. BACKGROUND AND RELATED WORK

Early works studying the maximum-margin bias of gradient descent focused on the simple, yet insightful, setting of logistic regression Rosset et al. (2003) ; Soudry et al. (2018) . Consider a binary classification problem with a linearly separablefoot_0 training dataset {x i , y i } where x i ∈ R d and y i ∈ {-1, 1}, a linear model f (x; β) = β ⊺ x, and the exponential loss L(β) = i e -yif (xi;β) . As shown in Soudry et al. (2018) , the loss only has a minimum in β as its norm becomes infinite. Thus, even after the network correctly classifies the training data, gradient descent decreases the loss by forcing the norm of β to grow in an unbounded manner, yielding a slow alignment of β in the direction of the maximum ℓ 2 -margin solution, which is the configuration of β that minimizes ∥β∥ while keeping the margin min i y i f (x i ; β) at least 1. But what if we parameterize the regression coefficients differently? As shown in Fig. 1 , different parameterizations, while not changing the space of learnable functions, can lead to classifiers with very different properties. The dashed black line is the maximum ℓ 2 -margin solution and the solid black line is the gradient descent trained classifier after 1e5 steps. Existing theory predicts the homogeneous model will converge to the maximum ℓ 2 -margin solution. In this work we will show that the quasi-homogeneous model is driven by a different maximum-margin problem. Linear networks. An early line of works exploring the influence of the parameterization on the maximum-margin bias studied the same setting as logistic regression, but where the regression coefficients β are multilinear functions of parameters θ. Ji & Telgarsky (2018) showed that for deep linear networks, β = i W i , the weight matrices asymptotically align to a rank-1 matrix, while their product converges to the maximum ℓ 2 -margin solution. Gunasekar et al. (2018) showed that linear diagonal networks, β = w 1 ⊙ • • • ⊙ w D , converge to the maximum ℓ 2/D -margin solution, demonstrating that increasing depth drives the network to sparser solutions. They also show an analogous result holds in the frequency domain for full-width linear convolutional networks. Many other works have advanced this line of literature, expanding to settings where the data is not linearly separable Ji & Telgarsky (2019) , generalizing the analysis to other loss functions with exponential tails Nacson et al. (2019b) , considering the effect of randomness introduced by stochastic gradient descent Nacson et al. (2019c) , and unifying these results under a tensor formulation Yun et al. (2020) . Homogeneous networks. While linear networks allowed for simple and interpretable analysis of the implicit bias in both the space of θ (parameter space) and the space of β (function space), it is unclear how these results on linear networks relate to the behavior of highly non-linear networks used in practice. Wei et al. (2019) and Xu et al. (2021) made progress towards analysis of non-linear networks by considering shallow, one or two layer, networks with positive-homogeneous activations, i.e., there exists L ∈ R + such that f (αx) = α L f (x) for all α ∈ R + . More recently, two concurrent works generalized this idea by expanding their analysis to all positive-homogeneous networks. Nacson et al. (2019a) used vanishing regularization to show that as long as the training error converges to zero and the parameters converge in direction, then the rescaled parameters of a homogeneous model converges to a first-order Karsh-Kuhn-Tucker (KKT) point of a maximum-margin optimization problem. Lyu & Li (2019) defined a normalized margin and showed that once the training loss drops below a certain threshold, a smoothed version of the normalized margin monotonically converges, allowing them to conclude that all rescaled limit points of the normalized parameters are first-order KKT points of the same optimization problem. A follow up work, Ji & Telgarsky (2020) , developed a theory of unbounded, nonsmooth Kurdyka-Lojasiewicz inequalities to prove a stronger result of directional convergence of the parameters and alignment of the gradient with the parameters along the gradient flow path. Lyu & Li (2019) and Ji & Telgarsky (2020) also explored empirically non-homogeneous models with bias parameters and Nacson et al. (2019a) considered theoretically non-homogeneous models defined as an ensemble of homogeneous models of different orders. While these works have significantly narrowed the gap between theory and practice, all three works have also highlighted the limitation in applying their analysis to architectures with bias parameters, residual connections, and normalization layers, a limitation we alleviate in this work. In a parallel literature studying low-rank biases in deep learning, Le & Jegelka (2022) analyzed non-homogeneous models where the nonlinearities are restricted to the first few layers.

3. DEFINING THE CLASS OF QUASI-HOMOGENEOUS MODELS

Here we introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with positive-homogeneous activations, while structured enough to enable geometric analysis of its gradient dynamics. Throughout this work, we will consider a binary classifier f (x; θ) : R d → R, where θ ∈ R m is the vector concatenating all the parameters of the model. We assume the dynamics of θ(t) over time t are governed by gradient flow dθ dt = -∂L ∂θ on an exponential loss L(θ) = 1 n i e -yif (xi;θ) computed over a training dataset {(x 1 , y 1 ), . . . , (x n , y n )} of size n where x i ∈ R d and y i ∈ {-1, 1}. In App. H we generalize our results to multi-class classification with the cross-entropy loss. Definition 3.1 (Λ-Quasi-Homogeneous). For a (non-zero) positive semi-definite matrix Λ ∈ R m×m , a model f (x; θ) is Λ-quasi-homogeneous if under the parameter transformation ψ α (θ) := e αΛ θ, (1) the output of the model scales f (x; ψ α (θ)) = e α f (x; θ) for all α ∈ R and input x. In this work, we assume Λ is diagonalfoot_1 and let λ i = (Λ) ii and λ max = max i λ i be the maximum diagonal element, which must be positive. Definition (3.1) generalizes the notion of positive homogeneous functions, allowing different scaling rates for different parameters to yield the same scaling of the output. Given two parameters with different values of λ, we refer to the parameter with larger λ as higher-rate and the other as lower-rate. Examples. We consider some simple quasi-homogeneous networks that are not homogeneous. Unbalanced linear diagonal network. Consider a diagonal network as described in Gunasekar et al. (2018) , but with a varying depth for different dimensions of the data. The regression coefficient β i for input component x i is parameterized as the product of D i ∈ N parameters, yielding f (x; θ) = i ( Di j=1 θ ij )x i . When the D i are equal, the network is homogeneous, otherwise, the network is quasi-homogeneous where the choice of λ can be D -1 i for θ ij . Fully connected network with biases. One of the simplest quasi-homogeneous models is a multilayer, fully-connected network with bias parameters, such as the two-layer network, f ( x; θ) = w 2 σ i w 1 i x i + b 1 + b 2 , where σ(•) is a Rectified Linear Unit (ReLU). Without biases this network would be homogeneous, but their inclusion requires a quasi-homogeneous scaling of parameters to uniformly scale the output of the model. For example, the choice of λ can be 1 for b 2 and 1/2 for all other parameters. Networks with residual connections. Similar to networks with biases, residual connections result in a computational path that requires a quasi-homogeneous scaling of the parameters. For example, the model f (x; θ) = j w 2 j σ i w 1 ji x i + x j is quasi-homogeneous, where the choice of λ can be 1 for w 2 and 0 for w 1 . Networks with normalization layers. As discussed in Kunin et al. (2020) , when normalization layers, such as batch normalization, are introduced into a homogeneous network, they induce scale invariance in the parameters in the preceding layer. However, as long as the last layer is positive homogeneous, then a network with normalization layers is quasi-homogeneous. For example, the network f (x; θ) = i w i h i (θ ′ , x) + b is quasi-homogeneous, where w is the weight of the last layer, b is the bias, θ ′ is the set of parameters in earlier layers, and h(θ ′ , x) is the activation of the last hidden layer after normalization. The choice of λ can be 1 for w and b and 0 for θ ′ . See App. A for more examples of quasi-homogeneous models and their relationship to ensembles of homogeneous networks of different orders, as discussed in Nacson et al. (2019a) . Geometric properties. Like homogeneous functions, quasi-homogeneous functions have certain geometric properties of their derivatives. Analogous to Euler's Homogeneous Function Theorem, for a quasi-homogeneous f (x; θ), we have ⟨∇ θ f (x; θ), Λθ⟩ = f (x; θ), which is easily derived by evaluating the derivative ∇ α f (x; ψ α (θ)) at α = 0, the identity element of the transformation. Analogous to how the derivative of a homogeneous function of order L is a homogeneous function of order L -1, the derivative of a quasi-homogeneous function under the same transformation respects the following property, ∇ θ f (x; ψ α (θ)) = e α(I-Λ) ∇ θ f (x; θ). See App. A for a derivation of the geometric properties of quasi-homogeneous functions. 1 . 2 0 . 6 0 . 0 0 . 6 1 . 2 1 1.2 0.6 0.0 0.6 1.2 2 (a) λ1 = λ2 = 1 1 . 2 0 . 6 0 . 0 0 . 6 1 . 2 1 1.2 0.6 0.0 0.6 1.2 2 (b) λ1 = 1, λ2 = 0.5 Figure 2 : A natural coordinate system for quasi-homogeneous models. A useful coordinate system for studying the gradient dynamics of quasi-homogeneous models is the decomposition of parameter space into characteristic curves (solid lines) and level sets of the Λ-seminorm (dashed lines). For a homogeneous function (left), this decomposition is equivalent to a polar decomposition. For a quasi-homogeneous function (right), then the directions of the characteristic curves are eventually dominated by the highest-rate parameters and the level sets of the Λ-seminorm are concentric ellipsoids. Characteristic curves. Throughout this work we consider the partition of parameter space into the family of one-dimensional characteristic curves mapped out by the parameter transformation in Eq. 1. The vector field generating the transformation, ∂ψα ∂α | α=0 = Λθ, is tangent to the characteristic curve and thus we will refer to this vector as the tangent vector. We define the angle ω between the velocity dθ dt and tangent vector such that the cosine similarity between these two vectors is β := cos(ω) = ⟨Λθ, dθ dt ⟩ ∥Λθ∥∥ dθ dt ∥ . Λ-Seminorm. The characteristic curves perpendicularly intersect a family of concentric ellipsoids defined by the Λ-seminorm, ∥θ∥ 2 Λ := i λ i θ 2 i . Together, the intersection of a given characteristic curve with an ellipsoid of given Λ-seminorm uniquely defines a single point in parameter space. In the setting of homogeneous networks, this geometric structure is equivalent to a polar decomposition of parameter space. We also define the Λ-normalized parameters θ = ψ -τ (θ) where τ (θ) is implicitly defined such that ∥ θ∥ 2 Λ = 1. This corresponds to a unique projection of parameter θ onto the unit Λ-seminorm ellipsoid by moving along a characteristic curve. As shown in Fig. 2 , for a homogeneous function, the characteristics are rays and the Λ-seminorm is proportional to the Euclidean norm ∥θ∥. For a quasi-homogeneous function, then the directions of the characteristic curves and the Λ-seminorm are eventually dominated by the highest-rate parameters. Thus, we will also find it helpful to define the Λ max -seminorm as ∥θ∥ 2 Λmax := i:λi=λmax λ i θ 2 i .

4. QUASI-HOMOGENEOUS MAXIMUM-MARGIN BIAS

Having defined the class of quasi-homogeneous models and identified a natural coordinate system to explore their gradient dynamics, we now generalize the maximum-margin bias theory developed in Lyu & Li (2019) for homogeneous models to a general quasi-homogeneous model f (x; θ). Following the analysis strategy of Lyu & Li (2019) , we make the following assumptions: • A1 (Quasi-Homogeneous). There exists a non-zero diagonal positive semi-definite matrix Λ, such that the model f (x; θ) is Λ-quasi-homogeneous. • A2 (Regularity). For any fixed x, f (x; θ) is locally Lipschitz and admits a chain rulefoot_2 . • A3 (Exponential Loss). L(θ) = 1 n i ℓ i where ℓ i = e -yif (xi;θ) . • A4 (Gradient Flow). Learning dynamics are governed by dθ dt ∈ ∂ • θ Lfoot_3 for all t > 0. • A5 (Strong Separability). There exists a time t 0 such that L(θ(t 0 )) < n -1 . We also make the following additional assumptions not presented in Lyu & Li (2019) A6 implies the convergence of the decision boundary and A7 implies that λ max parameters play a role in the classification task. A6 is necessary for a technical reason, but we expect that this assumption can be weakened by exploiting the argument in Ji & Telgarsky (2020) . A7 is trivially true for a homogeneous model where ∥ θ∥ Λmax = ∥ θ∥ = 1, but not for a quasi-homogeneous model. In section 5 we will consider what happens when we remove this assumption. We now state our main theoretical result: Theorem 4.1 (Quasi-Homogeneous Maximum-Margin). Under assumptions A1 to A7, there exists an α ∈ R such that ψ α (lim t→∞ θ(t)) is a first-order KKT pointfoot_4 of the optimization problem: minimize 1 2 ∥θ∥ 2 Λmax subject to y i f (x i ; θ) ≥ 1 ∀i ∈ [n] (P) Significance. Theorem 4.1 implies that after interpolating the training data, the learning dynamics of the model are driven by a competition between maximizing the margin in function space and minimizing the Λ max -seminorm in parameter space. At first glance, this might seem like a straightforward generalization of the result discussed in Lyu & Li (2019) for homogeneous networks, but crucially, whenever Λ is quasi-homogeneous, which is the case for nearly all realistic networks, then the optimization problems are different, as ∥θ∥ Λmax ̸ = ∥θ∥. In the quasi-homogeneous setting, the Λ max -seminorm will only depend on a subset of the parameters, and potentially an unexpected subset, such as just the last layer bias parameters for a standard fully-connected network. In section 5 and 6 we will further discuss the implications of this result. Intuition. The heart of the argument proving Theorem 4.1 essentially relies on showing that after all the assumptions are satisfied, then as t → ∞ the Λ-seminorm diverges ∥θ∥ Λ → ∞ and the angle ω converges ω → 0. The convergence of ω implies that the training trajectory converges to a certain characteristic curve and the divergence of ∥θ∥ Λ implies that the trajectory diverges along this curve away from the origin. In the homogeneous setting the characteristic curves are rays, implying that as t → ∞ the velocity dθ dt aligns in direction to θ. This alignment of the velocity with θ = ∇ 1 2 ∥θ∥ 2 is the key property allowing previous works to derive 1 2 ∥θ∥ 2 as the objective function of the implicit optimization problem. However, in the quasi-homogeneous setting, the directions of the characteristic curves are eventually dominated by the λ max parameters, which is what gives rise to the asymmetric objective function 1 2 ∥θ∥ 2 Λmax in our work. Proof sketch. We defer most of the technical details of the proof of Theorem 4.1 to App. E, but state the central lemma and the overall logical structure below. As in Lyu & Li (2019) , the key mathematical object of our analysis is a normalized margin. The margin, defined as q min (θ) := min i y i f (x i ; θ), is non-differentiable and unbounded, making it difficult to study. Thus, we define the normalized margin, γ(θ) := qmin(θ) ∥θ∥ λ -1 max Λ , and the smooth normalized margin, γ(θ) := log((nL) -1 ) ∥θ∥ λ -1 max Λ , which is a smooth approximation of γ. We then prove the following key lemma lower bounding changes in the Λ-seminorm ∥θ∥ Λ and the smooth normalized margin γ. This lemma holds throughout training, even before separability is achieved, and we believe could be of independent interest to understanding the learning dynamics. Lemma 4.1 (Dynamics of ∥θ∥ Λ and γ). Under assumptions A1, A2, A3, and A4, the dynamics of the Λ-seminorm and smooth normalized margin are governed by the following inequalities, 1 2 d dt ∥θ∥ 2 Λ ≥ L log((nL) -1 ), d dt log(γ) ≥ λ -1 max d dt log(∥θ∥ Λ ) tan(ω) 2 , ( ) for all t > 0 for the first inequality, and for almost every t > 0 for the second inequality. Notice that once the separability assumption is met, the lower bound on the time-derivative of ∥θ∥ 2 Λ is strictly positive. This allows us to conclude that the Λ-seminorm diverges and the loss converges L → 0 (Lemma E.2). We then seek to prove the directional convergence of the parameters to the tangent vector Λθ generating the characteristic curves. We first prove that γ is upper bounded using the definition of the margin and A7 (Lemma E.3). Combining this upper bound with the monotonicity of γ proved in Lemma 4.1, we can conclude by a monotone convergence argument that γ will converge. Taken together, the convergence of γ and the divergence of ∥θ∥ 2 Λ implies the angle ω → 0 on a specific sequence of time (Lemma E.4). Finally, we use the divergence ∥θ∥ Λ → ∞ and the convergence ω → 0 to prove there exists a scaling of the normalized parameters that converges to a first-order KKT point of the optimization problem P in Theorem 4.1. Non-uniqueness of Λ. For a quasi-homogeneous function f , the value of Λ, and the λ max parameter set, is not necessarily unique and therefore one may think Theorem 4.1 looks inconsistent. However, the conditional separability (A7), which is required to apply Theorem 4.1, removes this possibility. See App. B for a discussion on how to determine the highest-rate λ max parameter set. In section 4 we showed how gradient flow on a quasi-homogeneous model will implicitly minimize the norm of only the highest-rate parameters. To explore the implications that this bias has on function space, we will consider a simple problem where analytic solutions exist. We will analyze the binary classification task of learning a linear classifier w that separates two balls in R d . Consider a dataset that forms two disjoint dense balls B(±µ, r) with centers at ±µ ∈ R d and radii r ∈ R + . The label y i of a data point x i is determined by which ball it belongs to, such that y i = 1 if x i ∈ B(µ, r) and y i = -1 if x i ∈ B(-µ, r). We assume ∥µ∥ = 1 and that r < 1 to ensure linear separability. We measure the quality of a classifier by its robustness, the minimum Euclidean distance between the decision boundary {x ∈ R d : ⟨w, x⟩ = 0} and the balls B(±µ, r). See Fig. 3 for a depiction of the problem setup.

5. QUASI-HOMOGENEOUS MAXIMUM-MARGIN CAN DEGRADE ROBUSTNESS

B(+ μ, r) B(-μ, r) w l( w ) l( w ) {x : ⟨w , x⟩ = 0 } We will consider two parameterizations of a linear classifier, one that is homogeneous f hom (x; θ) = i θ i x i and one that is quasihomogeneous f quasi-hom (x; θ) = i ( Di j=1 θ ij )x i where D i = 1 for the first m-coordinates and D i > 1 for the last (d -m)coordinates. For the quasi-homogeneous model, the parameters associated with the first m-coordinates are the λ max parameters. Let P ∈ R d×d be the projection matrix into the subspace spanned by the first m-coordinates, P ⊥ = I -P be the one into the last (d-m)coordinates, and ρ µ := ∥P ⊥ µ∥ be the norm of µ projected into this subspace. As long as the radius r > ρ µ , then the conditional separability assumption of Theorem 4.1 is satisfiedfoot_5 . Applying Theorem 4.1, we can conclude that for appropriate initializationsfoot_6 , f hom and f quasi-hom converge to the linear classifiers defined by the following optimization problems respectively, min w∈R d ∥w∥ s.t. y(x) ⟨w, x⟩ ≥ 1 ∀x ∈ B(±µ, r), min w∈R d ∥P w∥ s.t. y(x) ⟨w, x⟩ ≥ 1 ∀x ∈ B(±µ, r). Each of these two optimization problems is convex and has a unique minimizer, which we can derive exact expressions for by considering the subspace spanned by the vectors P µ and P ⊥ µ. Lemma 5.1. If separability (r < 1) and conditional separability (r > ρ µ ) hold, then Eq. 3 and Eq. 4 have unique minimizers, w hom and w quasi-hom respectively, which satisfy, w hom ∝ µ, w quasi-hom ∝ 1 -r -2 ρ 2 µ 1 -ρ 2 µ P µ + r -1 P ⊥ µ, such that the robustness of these optimal classifiers is From these expressions it is easy to confirm that l(w quasi-hom ) ≤ l(w hom ) for all ρ µ < r < 1. For a fixed ρ µ , the gap in robustness between the homogeneous and quasi-homogeneous models increases as r ↓ ρ µ . These expressions demonstrate that the quasihomogeneous maximum-margin bias can lead to a solution with vanishing robustness in function space. To confirm this conclusion, we train f hom and f quasi-hom with gradient flow and keep track of the classifier w and robustness l(w) for the two models, while sweeping the radius from ρ µ to 1. As shown in Fig. 4 , we see a sharp drop in the highest-rate parameters (w 1 ) and the robustness of the quasi-homogeneous model as r ↓ ρ µ (= 0.5), while for the homogeneous model, the parameters are stable and the robustness is linearfoot_7 in r, as expected from Lemma (5.1). l(w hom ) = 1 -r, l(w quasi-hom ) = 1 -r -2 ρ 2 µ 1 -ρ 2 µ -r 2 -ρ 2 µ . So far we have restricted our analysis to the setting r > ρ µ , such that we can be certain the conditional separability assumption is met. But what happens to the performance of the quasi-homogeneous model below this threshold r ≤ ρ µ ? As shown in Fig. 4 , it appears that the model learns to discard the highest-rate parameters once they are unnecessary and the maximum-margin bias continues on the resulting sub-model. In Fig. 4 , when r ≤ 0.5, the second highest-rate parameters (w 2 ) for the quasi-homogeneous model begins to collapse and the robustness curve repeats another swell, eventually collapsing again when r = 0.25. Based on this, we conjecture a stronger version of Theorem 4.1 without the conditional separability assumption. This conjecture is very similar to an informal conjecture discussed in Nacson et al. (2019a) for ensembles of homogeneous models. Conjecture 5.1 (Cascading Minimization). Under assumptions A1 to A6, there exists a λ ∈ R + and an α ∈ R such that ψ α (lim t→∞ θ(t)) is a first-order KKT point of the optimization problem: minimize 1 2 ∥θ∥ 2 ΛIλ subject to y i f (x i ; θ) ≥ 1 ∀i ∈ [n] θ l = 0 ∀λ l > λ, where I λ is a diagonal matrix whose entry (I λ) ii is 1 if λ i = λ and 0 otherwise. As shown in Fig. 4 , we find evidence of a cascading minimization of the first and then second highest-rate parameters as the radius drops below the respective thresholds that make these parameters necessary.

6. A MECHANISM BEHIND NEURAL COLLAPSE

In this section, we move away from linear models and consider the implications the quasihomogeneous maximum-margin bias has in the setting of highly-expressive neural networks used in practice. We identify that for sufficiently expressive neural networks with normalization layers, the asymmetric norm minimization drives the network to Neural Collapse, an intriguing empirical phenomenon of the last layer parameters and features recently reported by Papyan et al. (2020) . In their paper, they demonstrate that the following four properties can be universally observed in the learning trajectories of deep neural networks once the training error converges to zero: (1) The last-hidden-layer feature vector converges to a single point for all the training data with the same class label. (2) The convex hull of the convergent feature vectors forms a regular (C -1)-simplexfoot_8 centered at the origin, where C is the number of possible class labels. (3) The last-layer weight vector for each class label converges to the corresponding feature vector up to re-scaling. (4) For a new input, the neural network classifies it as the class whose convergent feature vector is closest to the feature vector of the given input. A considerable amount of effort has been made to theoretically understand this mysterious phenomenon. Han et al. ( 2021 2021), showed how gradient dynamics on the space of the last-hidden-layer feature vectors and last-layer weights, without any explicit regularization, would lead to Neural Collapse as a result of the implicit maximum-margin bias. However, the real gradient dynamics of neural networks happen in the space of all parameters of the model, and hence it is not clear how an implicit bias that leads the model to Neural Collapse, can be induced by the parameter gradient dynamics. In this section, we show that the parameter gradient dynamics of any present-day neural networks can universally show Neural Collapse as long as they are sufficiently expressive, apply normalization to the last hidden layer, and are trained with the cross-entropy loss. Our theoretical analysis is based on the regularization by normalization and the quasi-homogeneous maximum-margin bias.  d j=1 h j (x i , θ ′ ) = 0, d j=1 h 2 j (x i , θ ′ ) = 1 ∀i ∈ [n], where {(x i , y i )} i∈[n] is the training data. This model is quasi-homogeneous with λ = 1 for the w c and b c , and λ = 0 for parameters in the earlier layers θ ′ . Thanks to this quasi-homogeneity, our result for multi-class classification tasks (see App. H) reveals that the rescaled parameters converge to a first-order KKT point of the following optimization problem: min (w,b,θ ′ ) c∈[C] |wc| 2 + |b| 2 s.t. min i∈[n] (wy i ) T h(xi, θ ′ ) + by i -max c̸ =y i (wc) T h(xi, θ ′ ) + bc ≥ 1. (8) We further make the following assumptions on expressivity and data distribution: • A8 (Sufficient Expressivity). For any {(x ′ i , h ′ i )} i∈[n] satisfying j (h ′ i ) j = 0 and j (h ′ i ) 2 j = 1 ∀i ∈ [n], there exists θ ′ satisfying h(x ′ i , θ ′ ) = h ′ i for any i ∈ [n]. • A9 (Existence of All Labels). For each class c ∈ [C], there exists at least one data point in {(x i , y i )} i∈[n] whose label y i belongs to c. The first assumption is to eliminate the possibility that any parameter configuration θ ′ cannot realize Neural Collapse. Under these assumptions, the global minimum satisfies Neural Collapse: Theorem 6.1 (Neural Collapse, short version). Under assumptions A8, A9, and d ≥ C, any global optimum of Eq.8 satisfies the four properties of Neural Collapse. Eq.14 Figure 5: Geometric intuition. An illustration of Eq. 9. The black circle represents the origin, the solid lines represent the class vectors w c , and the dotted lines represent the distance L c . Intuitively, minimizing the lengths of the solid lines while maintaining a minimum length of the dotted lines will result in a regular simplex centered at the origin. Note that we do not exclude the possibility that Eq.8 has saddles or local minima. Therefore, depending on the initialization of the learning dynamics, it may end up with those sub-optimal first-order KKT points, which may not show Neural Collapse. Essentially, the proof of Theorem 6.1 relies on first relaxing Eq. 8 to the optimization problem min (w) c |w c | 2 s.t. min c∈[C] L c ≥ 1, where L c is the minimum distance from w c to the (C-2)simplex formed by the convex hull of {w c ′ } c ′ ∈[C]/{c} . With accordance to our geometric intuition, the minimizer of this optimization problem is a regular (C -1)simplex. See Fig. 5 for a visual depiction of this relaxed optimization problem and App. G for the details of the proof.

7. CONCLUSION

In this work, we extend and generalize a long line of literature studying the maximum-margin bias of gradient descent to quasi-homogeneous networks. We show that after reaching a point of separability, the gradient flow dynamics are driven by a competition between maximizing the margin in function space and minimizing the Λ max -seminorm in parameter space. We demonstrate, with a simple linear example, how this strong favoritism for the highest-rate parameters can degrade the robustness of quasi-homogeneous models and conjecture that this process, when possible, will reduce the model to a sparser parameterization. Additionally, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind Neural Collapse. Here we propose some future directions for this work. Discretization effect. In this work, we only considered gradient flow, but generalizing the theoretical results to (stochastic) gradient descent is an important future step. In particular, it is well understood that the discretization effect introduced by a finite learning rate has empirically measurable effects for parameters that are scale-invariant, such as those before normalization layers. While gradient flow would predict the norm of these parameters to be constant through training, gradient descent predicts that they monotonically diverge, as demonstrated by Kunin et al. (2020) . Thus, extending our results to the setting of gradient descent could reshape Theorem 4.1. Optimality of convergence points. We are only able to guarantee by Theorem 4.1 that the learning dynamics will converge to a first-order KKT point of the constrained optimization problem, but not whether this point is locally or globally optimal. Better understanding the landscape of this optimization problem and determining when stronger statements can be made is a promising direction for future work. 2020) studied the gradient flow trajectories for diagonal linear networks and showed that there is a transition from a "kernel" regime to a "rich" regime controlled by the scale of the initialization and the final training loss level. Extending this analysis to quasi-homogeneous networks would be a valuable future direction. Impact on performance. An important takeaway from our work is that the maximum-margin bias can actually degrade the performance of a quasi-homogeneous model. The benefit depends on the parameterization of a model and its relationship to the geometry of the data. Better understanding this interaction could be essential for diagnosing performance gaps of modern neural networks and provide a route towards designing robust architectures. Here we provide more details on Λ-normalization as discussed in section 3. The Λ-normalized parameters θ are given by θ = e -τ λ1 θ 1 , . . . , e -τ λm θ m s.t. ∥ θ∥ 2 Λ = 1 The value of τ is implicitly defined through the constraint ∥ θ∥ 2 Λ = 1. Only in a select number of cases does an explicit expression for τ exist. For example, in the homogeneous setting when Λ = I, τ = log(∥θ∥) and θ = θ ∥θ∥ , as would be expected. Lemma A.1. For all θ ∈ R m such that ∥θ∥ Λ > 0, the Λ-normalized parameters θ are unique. Proof. Proving uniqueness of θ is equivalent to proving uniqueness of τ . For a given θ and Λ, then τ = log(1/ √ z) where z is the positive root of the polynomial i λ i θ 2 i z λi -1 = 0. The coefficients λ i θ 2 i ≥ 0 are non-negative, and because ∥θ∥ Λ > 0, we know there exists at least one positive coefficient λ i θ 2 i > 0. Thus, there is exactly one sign change in the coefficients of this polynomial, which by Descartes' rule of signs, implies the polynomial has exactly one positive root, and thus τ is unique. Locally Lipschitz Quasi-homogeneous models. To apply our analysis and Theorem 4.1 to many deep neural network settings including those with non-smooth ReLU activations, we here consider quasi-homogeneous functions with local Lipschitz property. For such functions f (θ) :  R d → R, Clarke's subdifferential ∂ • θ is defined as follows Clarke et al. (2008). Definition A.1 (Clarke's subdifferential). ∂ • θ f (θ) := conv lim k→∞ ∇ θ f (θ k ) : lim k→∞ θ k = θ, f is differentiable at θ k . ∂ • θ f (ψ α (θ)) = e α(I-Λ) h : h ∈ ∂ • θ f (θ) for any α > 0 and θ ∈ R d . Proof. For any sequence {θ k } k∈N on which f is differentiable and converging to θ, {ψ α (θ k )} k∈N converges to ψ α (θ) and f is differentiable on this new sequence, whose derivative is given by ∇ θ f (ψ α (θ k )) = e α ∂(e -α f (θ)) ∂θ ψα(θ k ) = e α ∂f (e -αΛ θ) ∂θ ψα(θ k ) = e α(I-Λ) ∂f (e -αΛ θ) ∂e -αΛ θ ψα(θ k ) , where the last expression is equivalent to e α(I-Λ) ∇ θ f (θ k ). Conversely, for any sequence {ψ α (θ k )} k∈N on which f is differentiable and converging to ψ α (θ), {θ k } k∈N converges to θ and f is differentiable on it as well, with the above scaling property. Hence, lim k→∞ ∇ θ f (θ k ) : lim k→∞ θ k = ψ α (θ), f is differentiable at θ k = e α(I-Λ) lim k→∞ ∇ θ f (θ k ) : lim k→∞ θ k = θ, f is differentiable at θ k . Thus taking the convex hulls of both sets, and by the commutativity between conv and the linear operation e α(I-Λ) , we conclude that Eq.11 holds. By further assuming that f (θ) admits a chain rule (See Davis et al. (2020)  ∈ R d , ⟨h, Λθ⟩ = f (θ) for any h ∈ ∂ • θ f (θ). Proof. Since f admits a chain rule, there exists α > 0 such that f (e αΛ θ) is differentiable with respect to α and d dα f (e αΛ θ) = g, de αΛ θ dα = g, e αΛ Λθ , for any g ∈ ∂ • θ f (e αΛ θ). Therefore f (θ) = e -α d dα (e α f (θ)) = e -α d dα f (e αΛ θ) = e -α(I-Λ) g, Λθ . By Lemma A.2, for any h ∈ ∂ • θ f (θ), we can find g ∈ ∂ • θ f (e αΛ θ), such that ⟨h, Λθ⟩ = e -α(I-Λ) g, Λθ = f (θ). The properties above immediately implies that any feasible point of optimization problem (P) satisfies Mangasarian-Fromovitz constraint qualification (MFCQ) condition, which implies the firstorder KKT condition is necessary for the optimality. Lemma A.4. Any feasible point θ of optimization problem (P) satisfies Mangasarian-Fromovitz constraint qualification (MFCQ) condition, i.e., there exists v ∈ R d such that for all i ∈ [n] with y i f (x i , θ) = 1, ⟨v, h⟩ > 0 for any h ∈ ∂ • θ (y i f (x i , θ) -1). Proof. Notice that for any h ∈ ∂ • θ (y i f (x i , θ) -1), there exists h ′ ∈ ∂ • θ f (x i , θ) such that ⟨Λθ, h⟩ = y i ⟨Λθ, h ′ ⟩. Hence, choosing v = Λθ, by Lemma A.3, ⟨Λθ, h⟩ = y i ⟨Λθ, h ′ ⟩ = y i f (x i , θ) = 1 > 0. Ensembles of homogeneous models. In Nacson et al. (2019a) , they considered the maximummargin bias of gradient descent for non-homogeneous models that can be expressed as finite sums of positive-homogeneous models of different orders. In particular, for some K ∈ N, they consider functions f (x; θ) that can be expressed as f (x; θ) = K k=1 f (k) (x; θ k ), where θ = [θ 1 , . . . , θ K ] and f (k) (x; θ k ) is α k -positive homogeneous such that 0 < α 1 < • • • < α K . While this class of models is not homogeneous because of the varying orders of the sub-models, it is quasi-homogeneous. If we choose Λ such that for all parameters in θ k the value of λ = α -1 k , then f (x; θ) is Λ-Quasi-Homogeneous. Therefore, the theoretical results discussed in this work should align with the results discussed in Nacson et al. (2019b) for the setting of ensembles of positive-homogeneous models. Indeed Theorem 4.1 and Conjecture 5.1 agree with analysis stated in their work that "an ensemble on neural networks will aim to discard the shallowest network in the ensemble", which is the sub-model with the highest-rate parameters. While all ensembles of positive-homogeneous models are quasi-homogeneous, not all quasihomogeneous models are ensembles. Here we provide a short list of quasi-homogeneous models that cannot be written in the form of Eq. 13. Deep fully connected network with biases. Consider again the two-layer fully connected network with biases discussed in section 3, f (x; θ) = w 2 σ i w 1 i x i + b 1 + b 2 . If we arrange terms such that f (1) (x; b 2 ) = b 2 and f (2) (x; w 1 , b 1 , w 2 ) = w 2 σ i w 1 i x i + b 1 , then we can express f (x; θ) = f (1) (x; b 2 ) + f (2) (x; w 1 , b 1 , w 2 ), which is an ensemble of two positivehomogeneous models with α 1 = 1 and α 2 = 2. However, notice that if we consider a third layer with parameters w 3 and b 3 , then this decoupling of the network is not possible unless some of the sub-models share parameters, preventing us from expressing f (x; θ) in the form of Eq. 13. All fully connected networks with biases, and a depth greater than two, are quasi-homogeneous models, but not an ensemble of positive-homogeneous models. Networks with degenerate Λ. As presented earlier, for quasi-homogeneous networks with residual connections or normalization layers, we can choose Λ to have zero values. Thus, even if these networks could be decoupled into a sum of sub-models that don't share parameters, the sub-models associated with the zero λ parameters would not be positive-homogeneous. In summary, the results presented in this work coincide with the results presented in Nacson et al. (2019a) for ensembles of positive-homogeneous models, but also apply to a far more general class of non-homogeneous models.

B THE CONSISTENCY OF THEOREM 4.1 AND THE PROPER CHOICE OF Λ

As briefly discussed in section 4, for a quasi-homogeneous function f , the value of Λ, and the λ max parameter set, is not necessarily unique and therefore one may think Theorem 4.1 looks inconsistent. However, the conditional separability (A7), which is required to apply Theorem 4.1, removes this possibility. Here we provide some insightful examples and then provide a complete proof. Examples of quasi-homogeneous models with non-unique Λ. We here clarify through examples how our theorem can be consistent with cases where the model f (θ) is quasi-homogeneous with multiple choice of Λ due to additional symmetry. A linear model with two parameters. We consider the following model, f (x; θ 1 , θ 2 ) = θ 1 θ 2 2 x. This is quasi-homogeneous with (λ 1 , λ 2 ) = (1-2ξ, ξ) for any ξ ∈ [0, 1/2]. There are three possible sets of parameters with largest λ value: • If ξ > 1/3, θ 1 has the largest λ value. • If ξ < 1/3, θ 2 has the largest λ value. • If ξ = 1/3, θ 1 and θ 2 have the same λ value. If one naively applies the theorem to these cases, they might think that the learning process converges to a separable solution minimizing θ 2 1 for the first case, θ 2 2 for the second case, and θ 2 1 + θ 2 2 for the latter case, which is inconsistent. However, the first two cases do not satisfy the conditional separability assumption. This is because we can make |θ 1 | or |θ 2 | as small as possible while fixing the function itself. Therefore the correct choice of λ should be (λ 1 , λ 2 ) = (1/3, 1/3). Two-layer quadratic activation with biases. For the sake of simplicity, we assume that all the layer widths are one, i.e., the model is given by four scalar parameters as follows: f (x; θ) = θ 3 (θ 1 x + θ 2 ) 2 + θ 4 . We can easily generalized our argument to wider networks. This model is quasi-homogeneous with the following choices of λ: (λ 1 , λ 2 , λ 3 , λ 4 ) = (ξ, ξ, 1 -2ξ, 1) for any ξ ∈ [0, 1/2]. Again, there are three possibilities. • If ξ = 0, θ 3 , θ 4 have the largest λ value. • If ξ ∈ (0, 1/2), θ 4 has the largest λ value. • If ξ = 1/2, θ 1 , θ 2 , θ 4 have the largest λ value. All of the cases can satisfy the conditional separability condition. For the first case, our theorem tells that the gradient dynamics minimizes θ 2 3 + θ 2 4 . However, by making θ 1 and θ 2 large, we can make θ 3 arbitrary small without changing the output, and hence, it is equivalent to minimizing θ 2 4 alone. This argument also holds for the third case. Thus, for all three cases θ 2 4 is the objective function for the minimization. A neural network with normalization. We consider the following model, f (x; θ) = c∈[C] w T c F norm (h(θ ′ , x)) + b, where w c ∈ R d , b ∈ R C are the weight and bias on the last layer, θ ′ is the set of parameters in the earlier layers, and h(θ ′ , x) ∈ R d is the feature vector on the last hidden layer, which we assume is homogeneousfoot_10 , i.e., e α h(θ ′ ; x) = h(e αλ ′ θ ′ ; x) for any α ∈ R with a certain λ ′ > 0. F norm (•) is a normalizer of the feature vector h(θ ′ ; x) so that the normalized feature vector F norm (h(θ ′ ; x)) is invariant under scaling transformation of θ ′ , i.e., h(θ ′ ; x) = h(e αλ ′ θ ′ ; x) for any α ∈ R. In this setting, possible choices of λ values are 1 for the last layer parameters and ξ for parameter θ ′ i where ξ is an arbitrary non-negative number. Thus there are at least following three possible sets of parameters with largest λ. • If ξ > 1, the parameters θ ′ in the earlier layers have the largest λ. • If ξ = 1, all the parameter have the same value of λ. • If ξ < 1, the last-layer weights and biases have the largest λ. In the first case, by the scale invariance of h(θ ′ ; x), we can make ∥θ ′ ∥ as small as possible while not changing f (θ, x), which implies that it does not satisfy the conditional separability condition. On the other hand, in the second case, it satisfies the conditional separability, since there need to be non-zero last-layer weights or bias to correctly classify data points. By applying the theorem, we can conclude that the learning process converges to a minimizer of c ∥w c ∥ 2 + ∥b∥ 2 + ∥θ ′ ∥ 2 . However, we can minimize ∥θ ′ ∥ as much as we want, while fixing f (θ; x), and hence it is equivalent to minimizing c ∥w c ∥ 2 + ∥b∥ 2 . In the third case, we can apply the theorem as well, which means that the learning process converges to a minimizer of sum of c ∥w c ∥ 2 + ∥b∥ 2 . In summary, while the choice of Λ is not necessarily unique because of intrinsic symmetries in the parameterization of the model, the set of highest-rate parameters is well defined by the constraints imposed by the conditional separability assumption. This makes Theorem 4.1 well defined. A complete proof of uniqueness of the resulting optimization problem As we discussed above going through three examples, we can identify which λ we should choose, or which λ max parameters we should choose, solely by analyzing the model, independent of the data set. In this section, we generalize the previous discussions on examples and prove that the resulting first-order KKT condition derived from Theorem 4.1 is unique, even if the model itself is quasi-homogeneous with multiple choices of scaling parameters Λ. Note that the following argument is independent of the data and properties derived here in this section is solely the properties of the architecture of the model itself. In the following argument, to simplify our proof, we assume f (s; θ) is differentiable with respect to θ, in addition to its continuity. Let S ⊂ R d be the set of {λ i } i∈ [d] with which the model satisfies the quasi-homogeneity, i.e., S := {λ ∈ R d ≥0 : f (x; θ) is λ-quasihomogeneous}. Definition B.1. λ ∈ S is called proper if there exists a separable data set {(x i , y i )} i∈ [n] with which the model satisfies A7 and f (x; θ) is bounded in {θ ∈ R d : ∥θ∥ Λmax = 1}. Definition B.2. Let λ 1 , λ 2 ∈ S be proper. We say they are equivalent in terms of first-order KKT conditions, if for any data set {(x i , y i )} i∈ [n] , the sets of first-order KKT points for the following two optimization problems minimize 1 2 ∥θ∥ 2 Λ k max subject to y i f (x i ; θ) ≥ 1 ∀i ∈ [n] are equivalent with k = 1, 2. We are going to prove the following theorem in this section. Theorem B.1. All proper points in S are equivalent in terms of their first-order KKT conditions. Before going to the proof of this theorem, we prove several lemmas for preparation. Lemma B.1. S is convex. Proof. Let λ 1 , λ 2 ∈ S. It suffices to show that for any α ∈ [0, 1], αλ 1 + (1 -α)λ 2 ∈ S. By the quasi-homogeneity of f (x; θ) with respect to λ 1 and λ 2 , for any β ∈ R, f (x; e β(αλ 1 +(1-α)λ 2 ) θ) = e β-αβ f (x; e αβλ 1 θ) = e β f (x; θ). Hence, f (x; θ) is αλ 1 + (1 -α)λ 2 -quasihomogeneous, i.e., αλ 1 + (1 -α)λ 2 ∈ S. Consider a non-empty line segment in S L := {y ∈ R d ≥0 : y = ζt + λ 0 with some t ∈ R} ⊂ S, where ζ, λ 0 ∈ R d . This set is connected since S is convex. Moreover, it has at least a single end point, because L ∈ R d ≥0 . Hence, without loss of generality, we can assume that λ 0 is the end point and t takes non-negative values. We define a set of indexes I ⊂ [d] by I := i ∈ [d] : λ 0 i = 0, ∃λ ∈ L, λ i > 0 . Furthermore, for λ ∈ S, we define M λ by M λ := {i ∈ [d] : λ i = max j∈[d] λ j }. Lemma B.2. I is non-empty, if L contains at least two different points. Proof. Suppose I is empty. In the following, we will show that there exists t < 0 such that y(t) = ζt + λ 0 ∈ S. Since this contradicts with the fact that λ 0 is an end point of L, we conclude I is non-empty. Since L contains at least two different points, there exists t 1 > 0 such that λ 1 := y(t 1 ) ∈ L. By quasi-homogeneity of the model with respect to λ 0 and λ 1 , for any α, t ∈ R, f (x; e αy(t) θ) = f (x; e α( t t 1 (λ 1 -λ 0 )+λ 0 ) θ) = f (x; e α t t 1 λ 1 e α t 1 -t t 1 λ 0 θ) = e α t t 1 e α t 1 -t t 1 f (x; θ) = e α f (x; θ). Since we assume I is empty, for any i ∈ [d] such that ζ i ̸ = 0, λ 0 i > 0. By the continuity of y(t) with respect to t, this means that there exists an open neighborhood of t = 0 where y i (t) > 0 for any i ∈ [d] such that ζ i ̸ = 0. For the other indexes, i.e. i ∈ [d] such that ζ i = 0, clearly y i (t) ≥ 0 in the open neighborhood. Therefore, in the neighborhood, y i (t) ≥ 0 for all i ∈ [d]. In particular, there exists t < 0 such that y(t) i ≥ 0 for any i ∈ [d] . Combining this fact with Eq.16, we conclude that there exists t < 0 such that y(t) ∈ S. By contradiction, I is non-empty.

Lemma B.3. For any proper element λ

* ∈ L/{λ 0 }, M λ * ∩ I ̸ = ∅, and hence max i∈[d] λ * i = t * max i∈I ζ i . Proof. Since λ * is proper, there exists a data set with which the model satisfies the conditional separability, i.e., there exists κ > 0 such that all the parameter values {θ i } i∈ [d] which separate the data satisfies ∥θ∥ Λ * max > κ. Let {θ * i } i∈[d] be a parameter values separating the data. By the continuity of the model, without loss of generality, we can assume that θ * i ̸ = 0 for any i ∈ [d]. In the following, we show M λ * ∩ I ̸ = ∅ by contradiction. Suppose M λ * ∩ I = ∅. We derive the contradiction by showing that there exists a parameter value θ ′ which correctly separates the data, but breaks the conditional separability condition with λ * . This clearly contradicts with the fact that λ * is proper. We consider the following transformation θ * → e -αy(0) θ * , with some α > 0. This transformation does not change {θ * i } i∈I , but other parameters are scaled down to θ * i → e -αλ 0 i θ * i . Then i∈[d]/I λ * i (e -αλ 0 i θ * i ) 2 can be arbitrarily small by taking α → ∞. (Notice that here we exploited the fact that if λ 0 i = 0, λ * i = 0 for any i ∈ [d]/I.) On the other hand, since λ * i > 0 and |θ * i | > 0 for any i ∈ I, ∥e -αy(0) θ * ∥ Λ * = i∈[d] λ * i (e -αy(0) θ * ) 2 ≥ i∈I λ * i (e -αy(0) θ * ) 2 is lower bounded by a positive constant. (Here we exploit the fact that I is non-empty by Lemma B.2.) Hence by further transforming the parameter e -αy(0) θ * → θ ′ := e -βΛ * e -αy(0) θ * , with a proper choice of β > 0, we can normalize the Λ * -seminorm ∥θ ′ ∥ Λ * = 1 while keeping i∈[d]/I λ * i (θ ′ i ) 2 arbitrarily small. In particular, it is smaller than κ with large enough α > 0. By assumption, we know [d]/I ⊃ argmax i∈[d] λ * i , and hence, ∥θ ′ ∥ 2 Λ * max ≤ i∈[d]/I λ * i (θ ′ i ) 2 < κ. By the quasi-homogeneity of the model, the model classifies the data set correctly even with this transformed parameter θ ′ . However, this means that it breaks the conditional separability condition, which contradicts with the fact that λ * is proper. Therefore, M λ * ∩ I ̸ = ∅. Lastly max i∈[d] λ * i = t * max i∈I ζ i can be derived as follows: max i∈[d] λ * i = max i∈I λ * i = t * max i∈I ζ i . Lemma B.4. For any proper element λ * ∈ L/{λ 0 }, M λ * ∩ ([d]/I) ̸ = ∅. Proof. Suppose M λ * ∩ ([d]/I) = ∅, i.e., M λ * ⊂ I. Let {θ * i } i∈[d] be a parameter values separating the data. Notice that there exists i ∈ [d]/I such that λ 0 i ̸ = 0 and θ * i ̸ = 0. Otherwise, ef (x; θ * ) = f (x; e y(t * ) θ * ) = f (x; e 2y(t * /2) θ * ) = e 2 f (x; θ * ), which is a contradiction. We denote such an index by j. We consider the following transformation θ * → θ ′ := e -βy(t * ) e αy(0) θ * , for some α > 0. β > 0 here is chosen to satisfy ∥θ ′ ∥ Λ * max = 1. By the quasi-homogeneity of the model, the model still correctly classifies the data with this transformed parameters. By taking α arbitrarily large, (e αy(0) θ * ) j = e αλ 0 j θ * j becomes arbitrarily large, and thus β needs to be arbitrarily large to renormalize the Λ * -seminorm. Therefore for any i ∈ I, θ ′ i = e -βζi θ * i can be arbitrarily small. Therefore, we can find a large enough α such that ∥θ ′ ∥ 2 Λ * max = i∈argmin λ * i λ * i (θ * i ) 2 ≤ i∈I λ * i (θ * i ) 2 < κ This contradicts with the fact that λ * is proper. Therefore, M λ * ∩ ([d]/I) ̸ = ∅. Lemma B.5. If there exists a proper λ ∈ Int S, it is unique, where Int S is the interior of S. Proof. Suppose there exists two different proper point λ 1 , λ 2 ∈ Int S. In the following argument, we will show that λ 1 = λ 2 , which is a contradiction. We consider a line including the two points λ 1 and λ 2 . Without loss of generality it can be represented as L := {y ∈ R d ≥0 : y = ζt + λ 0 with some t ≥ 0}, where ζ = λ 2 -λ 1 and λ 0 is an end point of this line. Let λ * = y(t * ) be a proper point in the interior of the line. λ * can be either λ 1 or λ 2 . In the following, we will derive an explicit formula which uniquely determines t * , implying λ 1 = λ 2 . By applying Lemma B.3, we obtain max i∈[d] λ * i = t * max i∈I ζ i . Suppose the line has the other end point t = t max > 0. We can apply Lemma B.3 again with this other end point, and we can derive the corresponding equality λ * max = (t max -t * ) max i∈J ζ i , where J ⊂ [d] is given by J := {j ∈ [d] : y(t max ) j = 0, ∃λ ∈ L, λ j > 0} . By comparing these two equality, we obtain t * = max i∈J ζ i max i∈J ζ i + max i∈I ζ i t max . This uniquely determines t * and hence λ 1 = λ 2 . Next, we consider the other case where L does not have end point other than λ 0 . Notice that max i∈I ζ i > max i∈[d]/I ζ i . ( ) This is due to the following reason: Suppose there exists j ∈ [d]/I such that ζ j ≥ max i∈I ζ i . Since ζ j > 0, j ̸ ∈ I implies that λ 0 j > 0. Hence, y(t * ) j = t * ζ j + λ 0 j > λ * max , which is clearly a contradiction. The inequality Eq.20 means that for any j ∈ [d]/I, {t ≥ 0 : max i∈I y(t) i > y(t) j } = [t ′ j , ∞) with some t ′ j ≥ 0. Hence t * ∈ {t ≥ 0 : max i∈I y(t) i = max i∈[d] y(t) i } = ∩ j∈[d]/J {t ≥ 0 : y(t) J > y(t) j } = [t ′ , ∞), where t ′ = max j∈[d]/I t ′ j . In the region (t ′ , ∞), argmax i∈[d] y(t) i ⊂ I, and hence by Lemma B.4, t * ̸ ∈ (t ′ , ∞). Hence t * = t ′ . The uniqueness of t ′ implies that λ 1 = λ 2 . The argument above shows that in any case λ 1 = λ 2 , which contradicts with the assumption that they are different. Therefore, the claim follows. Lemma B.6. For any proper element λ * ∈ S and any λ ∈ S, min i∈M λ * (λ i -λ * i ) ≤ 0, max i∈M λ * (λ i -λ * i ) ≥ 0. Proof. we show this by contradiction. Suppose there exist λ, λ * ∈ S such that min i∈M λ * (λ i -λ * i ) > 0 or max i∈M λ * (λ i -λ * i ) < 0. Let θ * be parameter values with which the model can separate a data set. We consider a transformation θ → e αy(t) θ with y(t) = t(λ * -λ) + λ with some α, t ∈ R. By quasihomogeneity of the model with respect to λ and λ * , we know that f (x; e αy(t) θ) = e α f (x; θ). By assumption, there exists t ∈ R such that y i (t) ≤ 0 for any i ∈ M λ * . This implies that {f (x; e αy(t) θ) : ∥e αy(t) θ∥ Λ * max ≤ ∥θ∥ Λ * max , α, t ∈ R} is unbounded, which contradicts with our assumption. Hence, the claim follows. Lemma B.7. Let λ 1 , λ 2 ∈ S be proper. If the following two conditions are met, λ 1 and λ 2 are equivalent in terms of the first-order KKT condition. Either min i∈M λ k (λ 1 i -λ 2 i ) = 0 or max i∈M λ k (λ 1 i -λ 2 i ) = 0 holds for both k = 1, 2 {i ∈ M λ 1 : λ 1 i = λ 2 i } = {i ∈ M λ 2 : λ 1 i = λ 2 i } Proof. Let L = {i ∈ M λ 1 : λ 1 i = λ 2 i }(= {i ∈ M λ 2 : λ 1 i = λ 2 i }). It suffices to show that set of the first-order KKT points of the following problem minimize 1 2 ∥θ∥ 2 Λ k max subject to y i f (x i ; θ) ≥ 1 ∀i ∈ [n] is equivalent to the set of the first-order KKT points of the following problem minimize 1 2 i∈L λ k max θ 2 i subject to y i f (x i ; θ) ≥ 1 ∀i ∈ [n] θ i = 0 ∀i ∈ (M λ 1 ∪ M λ 2 )/L. ( ) for any data set {(x i , y i )} i∈ [n] for k = 1, 2. Since the second optimization problem is clearly a restriction of the first optimization problem with some additional constraints, it suffices to show that any first-order KKT point of the first problem satisfies the constraint θ i = 0 for any i ∈ (M λ 1 ∪ M λ 2 )/L. To prove this statement, first we show that for k = 1, 2, and any θ ∈ R d , a one-parameter family of parameter values {Θ(α ) := e -α(Λ 1 -Λ 2 ) θ : α ∈ R} satisfies • d dα f (x; Θ(α)) = 0 • If d dα α=0 ∥Θ(α)∥ Λ k max = 0, θ i = 0 for any i ∈ M λ k /L. The first point can be easily seen by the quasi-homogeneity of the model with respect to λ 1 and λ 2 . Indeed, for any α ∈ R, f (x; Θ(α)) = f (x; e -αΛ 1 e αΛ 2 θ) = e -α f (x; e αΛ 2 θ) = f (x; θ). Regarding the second point, first notice that for any i ∈ L, Θ i (α) = θ i for any α ∈ R d and hence d dα α=0 ∥Θ(α)∥ Λ k max = λ k max d dα α=0 i∈M λ k /L Θ 2 i (α). If min i∈M λ 1 (λ 1 i -λ 2 i ) = 0, λ 1 i -λ 2 i < 0 for any i ∈ M λ k /L , and thus, unless θ i = 0 for any i ∈ M λ k /L, d dα α=0 i∈M λ k /L Θ 2 i (α) = - i∈M λ k /L α(λ 1 i -λ 2 i )θ 2 i > 0 . On the other hand, if max i∈M λ 1 (λ 1 i -λ 2 i ) = 0, λ 1 i -λ 2 i > 0 for any i ∈ M λ k /L, and thus, unless θ i = 0 for any i ∈ M λ k /L, d dα α=0 i∈M λ k /L Θ 2 i (α) < 0. Therefore, in either case, the second claim holds. Let θ be a first-order KKT point of Eq.21 with a KKT multiplier µ ∈ R n . The stationary condition along the one-parameter family Θ(α) is d dα α=0 1 2 ∥Θ(α)∥ 2 Λ k max + j∈[n] µ j y j d dα α=0 f (x j ; Θ(α)) = 0. Hence by the two properties above with both k = 1, 2, we obtain θ i = 0 for any i ∈ (M λ 1 ∪M λ 2 )/L. Therefore θ is a first-order KKT point of Eq.22. By exploiting all of the lemmas above, we are now going to prove Theorem B.1. proof of Theorem B.1. Suppose there exists two different proper points λ 1 , λ 2 ∈ S. By Lemma B.5, at least either of λ 1 or λ 2 is on the boundary ∂S of S. Without loss of generality, we assume that λ 1 ∈ ∂S. We consider a line segment in S L := {y(t) ∈ R d ≥0 : y(t) = ζt + λ 1 , t > 0}, where ζ = λ 2 -λ 1 ∈ R d . We first consider the case where L has a single end point λ 1 . We will show that λ 1 and λ 2 are equivalent in terms of the resulting optimization problem, by applying Lemma B.7. Hence we are going to verify each condition required in the lemma. Notice that the fact that L has a single end point implies that ζ i ≥ 0 for any i ∈ [d]. Combined with Lemma B.6, this means that min i∈M λ k (λ 2 i -λ 1 i ) = 0 for k = 1, 2. Furthermore, clearly {i ∈ M λ 2 : λ 1 i = λ 2 i } = M λ 1 . This is because for any i ∈ {i ∈ M λ 2 : λ 1 i = λ 2 i } and j ∈ [d], λ 1 i = λ 2 i ≥ λ 2 j ≥ λ 2 j -ζ j t = λ 1 j , where the inequality above hold as an equality if and only if j ∈ {i ∈ M λ 2 : λ 1 i = λ 2 i }. {i ∈ M λ 2 : λ 1 i = λ 2 i } = {i ∈ M λ 1 : λ 1 i = λ 2 i } immediately follows from {i ∈ M λ 2 : λ 1 i = λ 2 i } = M λ 1 . Hence by applying Lemma B.7, we obtain that λ 1 and λ 2 are equivalent. Next, we consider the other case where L has two end points. Let y(t max ) denote the other end point. Suppose λ 2 is an interior point, i.e., λ 2 ̸ = y(t max ). By exploiting Lemma B.3, we obtain λ 1 max = t max max i∈J ζ i > (t max -t 2 ) max i∈J ζ i = λ 2 max , where t 2 is given by y(t 2 ) = λ 2 and J ⊂ [d] is given by J := {j ∈ [d] : y(t max ) j = 0, ∃λ ∈ L, λ j > 0} . On the other hand, by applying Lemma B.6 at λ 1 , we know that there exists i ∈ M λ 1 such that ζ ≥ 0. Hence λ 2 i ≥ λ 1 i = λ 1 max . This clearly contradicts with Eq.24. Hence, λ 2 = y(t max ). Lastly, we show that λ 1 and λ 2 = y(t max ) are equivalent in terms of the resulting optimization problem by applying Lemma B.7. Hence, we are going to verify the conditions required in the lemma. Since min i∈M λ 2 ζ i ≤ 0 by Lemma B.6, there exists j ∈ M λ 2 such that ζ j ≤ 0, and hence λ 2 max = λ 2 j = λ 1 j + ζ j t max < λ 1 j ≤ λ 1 max . By applying Lemma B.6 at λ 1 , similarly we obtain λ 1 max ≤ λ 2 max . Therefore λ 1 max = λ 2 max . Suppose there exists i ∈ M λ 1 such that ζ i > 0. Then, λ 1 max = λ 1 i = λ 2 i -ζ i t max = λ 2 i < λ 2 max . This is a contradiction, and hence max i∈M λ 1 ζ i ≤ 0. Combining this with Lemma B.6, max i∈M λ 1 ζ i = 0. Similarly, suppose there exists i ∈ M λ 2 such that ζ i < 0. Then, λ 2 max = λ 2 i = λ 1 i + ζ i t max < λ 1 max . This is a contradiction, and hence min i∈M λ 2 ζ i = 0. Let k ∈ {i ∈ M λ 1 : λ 1 i = λ 2 i }. Then λ 2 k = λ 1 k = λ 1 max = λ 2 max . Hence k ∈ {i ∈ M λ 2 : λ 1 i = λ 2 i }. Similarly if k ∈ {i ∈ M λ 2 : λ 1 i = λ 2 i }, λ 1 k = λ 2 k = λ 2 max = λ 1 max . Hence k ∈ {i ∈ M λ 1 : λ 1 i = λ 2 i }. Therefore {i ∈ M λ 1 : λ 1 i = λ 2 i } = {i ∈ M λ 2 : λ 1 i = λ 2 i } . Now, all the conditions in Lemma B.7 are satisfied, and hence λ 1 and λ 2 are equivalent. In summary, regardless of the finiteness of the line L, λ 1 and λ 2 are equivalent in terms of the first-order KKT condition.

C LIMITATIONS OF OUR CURRENT ASSUMPTIONS

A1. While, many modern neural network architectures are Λ-quasi-homogeneous, this class of models cannot represent models with non-homogeneous activations such as the hyperbolic tangent function or the sigmoid function. While these activation functions are less common in practice, it would be interesting if there exists a direction towards generalizing our analysis to models using these functions. This could be of interest to the computational biology community as these monotonic activations that saturate are much more biologically-plausible then non-saturating homogeneous activations. One route could be studying the properties of homothetic functions, which are monotonic transformations of homogeneous functions. This class of functions has the same ordinal properties of homogeneous functions and is used extensively in economics Simon et al. (1994) . A2 -A5. These assumptions are equivalent, up to order, to the assumptions presented in Lyu & Li (2019) and are all quite standard in the literature studying the maximum margin bias of gradient descent. The strongest of these assumptions is A4, which assumes the training dynamics of the model are governed by the first-order ODE gradient flow. As discussed in section 7, an important future step would be to generalize our results to stochastic gradient descent (SGD). It is very possible that the hyperparameters of SGD, such as the learning rate and batch size, play an important role in determining the forces driving the limiting dynamics. A6. This assumption implies the convergence of the decision boundary and is equivalent to directional convergence for a homogeneous model. This assumption is necessary to show that the model's prediction is bounded on the normalized training trajectory (Lemma E.1) and for a technical reason to show the alignment of dθ dt and Λθ (Lemma E.4). While a necessary assumption, we expect that this assumption can be weakened by exploiting the argument in Ji & Telgarsky (2020) , which was applied for homogeneous functions. This could be an important step for future work as it is possible to construct settings where gradient flow will violate this assumption. See Appendix J of Lyu & Li (2019) for an example of a smooth homogeneous function where the limiting dynamics of gradient flow provably don't converge after normalization, but move along a circle. Another, more practical example, occurs for models where the residual block diverges. Because these parameters necessarily have a λ value of 0, then the normalized parameter θ diverge as well. The divergence of a residual block essentially means that the skip connection of the model is effectively negligible and does not play an important role in the classification. Thus, we believe that if the skip connections play an important role for the classification task, then residual block does not diverge, and this assumption is satisfied. Additionally, like A7, this assumption restricts the possible proper values of Λ as discussed in App. B. A7. This assumption implies that λ max parameters play a role in the classification task. This is the strongest of the assumptions we introduce in our work. This assumption is a restriction on the interaction of the parameterization of a model and the dataset, and thus it is difficult to assert its validity for general settings, without directly modeling the data as in section 5. That said, there are some standard settings where we can assert that this assumption is always true. First, it is trivially true that for a homogeneous model where ∥ θ∥ Λmax = ∥ θ∥ that A7 is always true. Additionally, for all models with batch normalization on the last hidden layer, such as a ResNet-18 model, then A7 is also true, as the last layer parameters are Λ max parameters, and thus necessary for classification. However, there are other settings where it is less clear that A7 is satisfied. For example, for a fullyconnected network with bias parameters the validity of A7 will depend on the data. This limitation is why we introduced Conjecture 5.1, which does not involve A7, and provided empirical evidence supporting its claim in section 5. The challenge to proving Conjecture 5.1 will be showing that a version of Lemma 4.1 still holds once the parameter space is restricted such that the Λ max parameters are zero. It is likely the case that an argument in this direction will require a proof by induction on the highest-rate parameters and their importance to the classification task.

D PROOF OF LEMMA 4.1

Throughout this section, we assume A1 to A4. To simplify notation in our proof, we define ν := 1 2 d dt ∥θ∥ 2 Λ . We make use of the following two simple statements: Lemma D.1. The derivative of loss is given by dL dt = - dθ dt 2 , for almost every t > 0. Hence, it is non-increasing for t > 0. Proof. Since L is locally Lipschitz function admitting a chain rule, by applying Lemma 5.2 in Davis et al. (2020) to L(θ(t)), we immediately obtain dL dt = -dθ dt 2 for almost every t > 0. Lemma D.2. For all θ ∈ R m , ∥Λθ∥ 2 ≤ λ max ∥θ∥ 2 Λ . Proof. ∥Λθ∥ 2 = λ 2 i θ 2 i ≤ λ max λ i θ 2 i = λ max ∥θ∥ 2 Λ . We will now prove the main lemmas described in section 4. Proof of Lemma 4.1. We first prove ν ≥ L log((nL) -1 ). By Lemma A.3, we can express ν as ν = dθ dt , Λθ = n -1 a e -yafa(θ) y a f a (θ). Using this equality the statement can be shown by the following inequality L -1 ν -log(nL) -1 = a e -yafa(θ) -1 a y a f a (θ)e -yafa(θ) -log(nL) -1 = - a p a log p a ≥ 0, where p a := a e -yafa(θ) -1 e -yafa (θ) . We now prove d dt log(γ) ≥ λ -1 max d dt log(∥θ∥ Λ ) tan(ω) 2 . d dt log(γ) = d dt log log(nL) -1 -λ -1 max log ∥θ∥ Λ = (log(nL) -1 ) -1 L -1 - dL dt - ⟨Λθ, dθ dt ⟩ λ max ∥θ∥ 2 Λ ≥ ν -1 - dL dt - ⟨Λθ, dθ dt ⟩ 2 ∥Λθ∥ 2 , where the last inequality applied Lemma 4.1 and Lemma D.2. Since -dL dt = ∥ dθ dt ∥ 2 for almost every t > 0, we can further simplify the RHS, = ν -1 dθ dt 2 - ⟨Λθ, dθ dt ⟩ 2 ∥Λθ∥ 2 = ν -1 I - Λθθ ⊺ Λ ∥Λθ∥ 2 dθ dt 2 = ∥v∥ 2 ν ∥u∥ 2 ∥v∥ 2 = ν ∥Λθ∥ 2 tan(ω) 2 , where the normal component v and the tangent component u are defined as, gives the final result, v := Λθθ ⊺ Λ ∥Λθ∥ 2 dθ dt , u := I - Λθθ ⊺ Λ ∥Λθ∥ 2 dθ dt , d dt log(γ) ≥ λ -1 max ν ∥θ∥ 2 Λ tan(ω) 2 = λ -1 max d dt log(∥θ∥ Λ ) tan(ω) 2 . Here is an important direct consequence of Lemma 4.1. Corollary D.1. γ(t) is non-decreasing for t ≥ t 0 . Proof. Notice the following inequality derived using Lemma 4.1, Lemma D.1, and A5, d dt log(∥θ∥ Λ ) = ν ∥θ∥ 2 Λ ≥ L log((nL) -1 ) ∥θ∥ 2 Λ ≥ 0. It follows by Lemma 4.1 that d dt log(γ) ≥ 0 and thus γ is non-decreasing.

E PROOF OF THEOREM 4.1

We first prove the following four lemmas by utilizing Lemma 4.1. Lemma E.1 (Bounded f (x; θ) on θ(t)). Under assumptions A1, A2, A4, and A6, then for all i ∈ [n] f (x i ; θ) is bounded on the normalized training trajectory θ(t) (i.e. ∃k > 0 such that |f (x i ; θ(t))| ≤ k for all t ≥ 0 and all i ∈ [n]). Lemma E.2 (Divergence of L -1 , q min , ∥θ∥ 2 Λ ). Under assumptions A1 to A5, as t → ∞ the quantities L -1 , q min , ∥θ∥ 2 Λ → ∞. Lemma E.3 (Upper bound on γ). Under assumptions A1 to A7, the normalized margin γ is bounded above by a constant. Lemma E.4 (Alignment of dθ dt and Λθ). Under assumptions A1 to A7, there exists a sequence of time {t k } k∈N on which cosine similarity β(t k ) → 1. Proof of Lemma E.1. By A4 and the uniqueness of the normalization procedure (Lemma A.1) we know the normalized training trajectory θ(t) is continuous. Combined with the convergence of the normalized parameters (A6), this implies that { θ(t) : t ≥ 0} is bounded. Which, by the continuity of f , further implies that for any fixed x ∈ R d , {f (x; θ(t)) : t ≥ 0} is bounded. Thus, there exists a k > 0 such that for all i ∈ [n], |f (x i ; θ(t))| ≤ k for all t ≥ 0. Proof of Lemma E.2. We first prove that L -1 → ∞ as t → ∞. By Lemma D.1, - dL dt = dθ dt 2 ≥ Λθθ ⊺ Λ ∥Λθ∥ 2 dθ dt 2 ≥ λ -1 max ν 2 ∥θ∥ 2 Λ , where the first equality holds in almost everywhere sense, and for the last inequality we use the definition of ν and Lemma D.2. Applying Lemma 4.1, we can lower bound ν, giving the lower bound - dL dt ≥ λ -1 max L 2 log((nL) -1 ) 2 ∥θ∥ 2 Λ = λ -1 max L 2 log((nL) -1 ) (2-2λmax) log((nL) -1 ) ∥θ∥ λ -1 max Λ 2λmax ≥ λ -1 max L 2 log((nL) -1 ) (2-2λmax) γ(t 0 ) 2λmax . where the last inequality holds almost everywhere sense via Corollary D.1. Rearranging terms on both sides of the inequality gives, - dL dt L -2 log((nL) -1 ) -2(1-λmax) ≥ λ -1 max γ(t 0 ) 2λmax . Integration from t 0 to t, with the substitution -dL dt L -2 = d dt L -1 , gives L -1 (t) L -1 (t0) (log n -1 z) -2(1-λmax) dz ≥ λ -1 max γ(t 0 ) 2λmax (t -t 0 ). The RHS diverges as t → ∞, and the LHS as a function of t is non-decreasing for z ≥ n, which is true for all t ≥ t 0 by Lemma D.1 and A5. Thus we can conclude that L -1 → ∞. This implies q min → ∞ as t → ∞, because q min is lower bounded by log(L -1 ) ≤ q min . We now show ∥θ∥ Λ → ∞ as t → ∞. We can upper bound q min by q min = e τ (θ) qmin ≤ e τmax(∥θ∥Λ) sup  ∥θ∥ λ -1 max Λ ≤ log(L -1 ) ∥θ∥ λ -1 max Λmax . Notice that ∥θ∥ λ -1 max Λmax = e τ ∥ θ∥ λ -1 max Λmax and by A7, we know that ∥ θ∥ Λmax ≥ κ. Therefore, we can further upper bound γ as γ ≤ log(L -1 ) e τ κ λ -1 max ≤ e -τ q min κ λ -1 max where the last inequality used log(L -1 ) ≤ q min . By Lemma E.1, e -τ q min ≤ sup i∈[n],t≥0 |f (x i ; θ(t))| ≤ k and therefore γ is upper bounded by a constant, γ ≤ k κ λ -1 max . Proof of Lemma E.4. We will inductively construct a sequence {t k } k∈N such that β(t k ) -1 < k -1 for any k ∈ N. Assume that a sequence satisfies this condition for any k < l with a positive integer l. We show that we can find t l > t l-1 such that the conditions above are met at t l as well. By Lemma.E.3 and Corollary D.1, we can find l ∈ N such that s > t l-1 + ϵ and log γmax (∞) -log γmax (s) < l -1 . Here ϵ > 0 is a constant to make sure that {t k } k∈N goes to infinity. Furthermore, by the fact that ρ → ∞, we can find s ′ > s such that log ρ(s ′ ) -log ρ(s) = 1. This choice of s and s ′ satisfies D := log γmax (s ′ ) -log γmax (t l ) log ρ(s ′ ) -log ρ(s) < l -1 . Assume that for any t ∈ (s, s ′ ), β -2 (l) -1 > D. Then log γmax (s ′ ) -log γmax (s) < s ′ s (β -2 -1) d log ρ dt dt. On the other hand, by Lemma.4.1, the right hand side can be upper bounded as follows. s s ′ (β -2 -1) d log ρ dt dt ≤ s s ′ d log γmax dt dt = log γmax (s ′ ) -log γmax (s). This is a contradiction, implying that there exists t ∈ (s, s ′ ) such that β(s) -2 -1 < D. Thus, |1 -β(t)| < 1 -1/ √ D + 1 < 1 -1/ l -1 + 1 < l -1 , i.e., t l = t satisfies the condition. Note that lim l→∞ t l → ∞ since t l ≥ s > t l-1 + ϵ for any l. Equipped with these convergence results, we can now prove Theorem 4.1 by exploiting the argument on the approximate KKT condition introduced in Dutta et al. (2013) . In this paper, they defined a notion of (ϵ, δ)-KKT point, which can be stated for our optimization problem P as follows: A point θ ∈ R m is a first-order (ϵ, δ)-KKT Point of P if there exist multipliers µ = (µ 1 , . . . , µ n ) such that the following four conditions are satisfied: 1. Primal Feasibility: y i f (x i ; θ) ≥ 1 for all i ∈ [n]. 2. Dual Feasability: µ i ≥ 0 for all i ∈ [n] 3. Approximate Stationarity: ∇ θ 1 2 ∥θ∥ 2 Λmax -i µ i y i h i ≤ ϵ with some h i ∈ ∂ • θ f (x i ; θ) for each i ∈ [n]. 4. Approximate Complementarity: i µ i (y i f (x i ; θ) -1) ≤ δ. We call this set of four conditions as the first-order (ϵ, δ)-KKT condition. In the proof for Theorem 4.1, we use the following fact. Lemma E.5. Under assumptions A1 to A5 and A7, ψ α (θ(t)) with α = -log(q min ) satisfies the firstorder (ϵ(t), δ(t))-KKT condition with a multiplier µ(t), where ϵ(t), δ(t), µ(t) are given as follows: • There exist a sequence of time {t k } k∈N and a sequence of real numbers {α k } k∈N such that ψ α k (θ(t k )) satisfies the first-order (ϵ k , δ k )-KKT condition, where ϵ k , δ k → 0 and ψ α k (θ(t k )) converges.        ϵ(t) = √ λ max γ-λmax 2(1 -β) + m max i∈[m]:λi̸ =λmax q 2(λi-λmax) min 1/2 δ(t) = e -1 nλ max γ-2λmax log((nL) -1 ) -1 µ i (t) = c -1 q (1-2λmax) min e -qi ∀i ∈ [m]. • lim k→∞ ψ α k (θ(t k )) satisfies MFCQ condition. The second point directly follows from Lemma A.4. We prove the first statement with the result of Lemma E.5. By Lemma E.4, we can find a sequence of time {t k } k∈N on which β → 1. Furthermore, because γ is lower bounded by Lemma E.3, q min → ∞ by Lemma E.2 and L → 0 by Lemma E.2, ϵ(t), δ(t) on this sequence converge to 0: ϵ(t k ), δ(t k ) → 0. Lastly we show ψ α(t k ) (θ(t k )) converges. Notice that ψ α(t k ) (θ(t k )) = q-Λ min (t k ) θ(t k ), where qmin (t k ) is the minimum margin of the model f (x; θ(t k )). By A6, θ(t k ) converges, and by the continuity of f (x; θ), qmin (t k ) converges. Therefore to show the convergence of ψ α(t k ) (θ(t k )), it suffices to show that qmin (t) is lower-bounded by a positive value. This can be seen as follows: qmin =    θ Λmax ∥θ∥ Λmax    λ -1 max q min ≥ q min ∥θ∥ λ -1 max Λmax ≥ log L -1 ∥θ∥ λ -1 max Λmax = γ, where γ is non-decreasing and upper-bounded by Corollary D.1. Proof of Lemma E.5. We verify each of four conditions one by one.

1.. Primal Feasibility:

It is straight forward to check that this choice of α implies ψ α (θ) satisfies primal feasibility, as for all i ∈ [n], y i f (x i ; ψ α (θ)) = e α y i f (x i ; θ) ≥ e α q min = 1.

2.. Dual Feasibility:

It is clear that this choice of µ satisfies dual feasibility as q min > 0.

3.. Approximate Stationarity:

By Lemma A.2, for any h ∈ ∂ • θ f (x i , θ), we can expand the sum as follows: i µ i y i ∂ • θ f (x i ; ψ α (θ)) = i µ i y i e α(I-Λ) ∂ • θ f (x i ; θ) = c -1 q (1-2λmax) min q (-I+Λ) min i e -qi y i ∂ • θ f (x i ; θ) = c -1 q (Λ-2λmax) min (-∂ • θ L) . Hence, there exists a combination of h i ∈ ∂ • θ f (x i , ψ α (θ)) such that i µ i y i h i = c -1 q (Λ-2λmax ) min dθ dt . By substituting the expression of c, we obtain i µ i y i h i = q -λmax min ∥Λθ∥Q dθ dt ∥ dθ dt ∥ , where Q is a diagonal matrix such that Q ii = q (λi-λmax) min . Now consider the derivative, ∇ θ 1 2 ∥ψ α (θ)∥ 2 Λmax = DΛψ α (θ) = q -λmax min DΛθ, where D is a diagonal indicator matrix such that D ii = 1 iff λ i = λ max . Combining these last two expressions together, we can now bound the squared norm, ∇ θ 1 2 ∥ψ α (θ)∥ 2 Λmax - i µ i y i h i 2 = q -λmax min DΛθ -q -λmax min ∥Λθ∥Q dθ dt ∥ dθ dt ∥ 2 = q -2λmax min ∥Λθ∥ 2 D Λθ ∥Λθ∥ -Q dθ dt ∥ dθ dt ∥ 2 ≤ λ max q min ∥θ∥ λ -1 max Λ -2λmax D Λθ ∥Λθ∥ -Q dθ dt ∥ dθ dt ∥ 2 ≤ λ max γ-2λmax D Λθ ∥Λθ∥ -Q dθ dt ∥ dθ dt ∥ 2 , where the second to last inequality applies Lemma D.2, and the last inequality applies the definition of the normalized margin. We now bound the squared norm in the RHS using the definition of β, D Λθ ∥Λθ∥ -Q dθ dt ∥ dθ dt ∥ 2 = D Λθ ∥Λθ∥ - dθ dt ∥ dθ dt ∥ + (D -Q) dθ dt ∥ dθ dt ∥ 2 ≤ D Λθ ∥Λθ∥ - dθ dt ∥ dθ dt ∥ 2 + (D -Q) dθ dt ∥ dθ dt ∥ 2 ≤ Λθ ∥Λθ∥ - dθ dt ∥ dθ dt ∥ 2 + ∥D -Q∥ 2 ≤ Λθ ∥Λθ∥ - dθ dt ∥ dθ dt ∥ 2 + λi̸ =λmax q 2(λi-λmax) min ≤ 2(1 -β) + m max λi̸ =λmax q 2(λi-λmax) min Combing the upper bounds we get ∇ θ 1 2 ∥ψ α (θ)∥ 2 Λmax - i µ i y i h i 2 ≤ λ max γ-2λmax 2(1 -β) + m max λi̸ =λmax q 2(λi-λmax) min = ϵ 2 (t).

4.. Approximate Complementary:

Expand the following sum with the defined values for α and µ i , i µ i (y i f (x i ; ψ α (θ)) -1) = i µ i q -1 min y i f (x i ; θ) -1 = i µ i q min (q i -q min ) = c -1 q -2λmax min i e -qi (q i -q min ) Notice the lower bound on the scalar c, c = ∥ dθ dt ∥ ∥Λθ∥ ≥ |⟨ dθ dt , Λθ ∥Λθ∥ ⟩| ∥Λθ∥ = ν ∥Λθ∥ 2 ≥ L log((nL) -1 ) ∥Λθ∥ 2 ≥ e -qmin log((nL) -1 ) ∥Λθ∥ 2 where the last inequality follows from L ≥ e -qmin . Using this lower bound for c, we can upper bound the previous expression as i µ i (y i f (x i ; ψ α (θ)) -1) ≤ q -2λmax min ∥Λθ∥ 2 log((nL) -1 ) -1 i e -(qi-qmin) (q i -q min ) The function f (z) = e -z z attains its global maximum at z = 1, implying we can further upper bound this quantity as i µ i (y i f (x i ; ψ α (θ)) -1) ≤ e -1 nq -2λmax min ∥Λθ∥ 2 log((nL) -1 ) -1 ≤ e -1 n q min ∥θ∥ λ -1 max Λ -2λmax log((nL) -1 ) -1 λ max = e -1 nγ -2λmax log((nL) -1 ) -1 λ max = δ(t), where the second to last inequality applies Lemma D.2. G PROOF OF THEOREM 6.1 We here first state the rigorous version of Theorem 6.1 Theorem G.1 (Neural Collapse). Under assumptions A8, A9, and d ≥ C, any global optimum of the optimization problem Eq.8 satisfies Neural Collapse, i.e., • For any class c ∈ [C], there exists a vector hc such that h i = hc for any data i ∈ [n] with y i = c. • The convex hull of {w c } c∈[C] forms a regular (C -1)-simplex, centered at the origin. • For any c ∈ [C], hc and w c are equivalent up to re-scaling. • argmax c∈[C] w T c h + b c = argmin c∈[C] h -hc for any h ∈ R d , i.e. , any feature vector is classified to the class c with the nearest class mean hc . Proof sketch for Theorem G.1. To prove Theorem G.1, we study a sequence of three relaxed optimization problems, starting from Eq.8, and introduce five lemmas (Lemma G.1 to G.5) characterizing the minimizers of these relaxed problems. The optimization problem Eq.8 can be reduced to the following by A8, min (w,b,h) c∈[C] |w c | 2 + |b| 2 s.t. min i∈[n] q i ≥ 1 j (h i ) j = 0, j (h i ) 2 j = 1 ∀i ∈ [n], where q i ∈ R is defnied as q i := (w yi ) T h i + b yi -max c̸ =yi (w c ) T h i + b c . The minimizers of this optimization problem are characterized in Lemma G.5. To prove Lemma G.5, we will consider the following further relaxed problem, min (w,b,h) c∈[C] |w c | 2 + |b| 2 s.t. min i∈[n] q i ≥ 1 j (h i ) 2 j = 1 ∀i ∈ [n]. ( ) This problem is analyzed in Lemma G.4. This optimization problem is equivalent to our last relaxed problem, min (w,b) c∈[C] |w c | 2 + |b| 2 s.t. max {hi} i∈[n]    min i∈[n] q i : j (h i ) 2 j = 1 ∀i ∈ [n]    ≥ 1. The minimizers of this problem are analyzed in Lemma G.3, with the help of Lemma G.2, which analyzes a further relaxed problem and Lemma G.1, which analyzes the constraint in Eq.36. We will now introduce and prove Lemmas G.1 through G.5. 

Let H denote the set of {h

i } i∈[n] satisfying the constraints j (h i ) 2 j = 1 for any i ∈ [n], ∆ c denote the (C -2)-simplex formed by {w c ′ } c ′ ∈[C]/ q i = min c∈[C] L c , with L c := min α∈∆ ∥w c -w ′ c (α)∥ + (b -b ′ c (α)), where w ′ c (α) : ∆ → ∆ c and b ′ c (α) : ∆ → R are defined as w ′ c (α) := c ′ ∈[C]/{c} α ic(c ′ ) w c ′ and b ′ c (α) := c ′ ∈[C]/{c} α ic(c ′ ) b c ′ , where i c (•) : [C]/{c} → [C -1] is given by i c (c ′ ) = c ′ if c ′ < c and i c (c ′ ) = c ′ -1 otherwise. Any maximizer of the quantity above is given by h i = h * yi where h * c = w c -w * c ∥w c -w * c ∥ , with w * c = w ′ c (α) with α minimizing Eq.38. Proof. max {hi}∈H min i∈[n] q i = min i∈[n] max h:∥h∥=1 (w yi ) T h + b yi -max c ′ ∈[C]/{yi} (w c ′ ) T h + b c ′ = min c∈[C] max h:∥h∥=1 (w c ) T h + b c -max c ′ ∈[C]/{c} (w c ′ ) T h + b c ′ , where the maximization over {h i } i∈[n] is achieved when h i ∈ arg max h:∥h∥=1 (w yi ) T h + b yi -max c ′ ∈[C]/{yi} (w c ′ ) T h + b c ′ . The quantity inside the parenthesis can be reduced as follows. (w c ) T h + b c -max c ′ ∈[C]/{c} (w c ′ ) T h + b c ′ = min c ′ ∈[C]/{c} (w c -w c ′ ) T h + (b -b c ′ ) = min α∈∆ (w c -w ′ c (α)) T h + (b -b ′ c (α)), Therefore by the minimax theorem, max h:∥h∥=1 (w c ) T h + b c -max c ′ ∈[C]/{c} (w c ′ ) T h + b c ′ = max h:||h||=1 min α∈∆ (w c -w ′ c (α)) T h + (b -b ′ c (α)) = min α∈∆ max h:||h||=1 (w c -w ′ c (α)) T h + (b -b ′ c (α)) = min α∈∆ ∥w c -w ′ c (α)∥ + (b -b ′ c (α)) = L c , where the maximization over h is achieved if and only if h = wc-w * c ∥wc-w * c ∥ . Hence Eq.37 holds. Clearly, the maximizer is h i = wy i -w * y i ∥wy i -w * y i ∥ . By Lemma G.1, the optimization problem Eq.36 is reduced to min (w,b) c |w c | 2 + |b| 2 s.t. min c∈[C] L c ≥ 1. We will later show that this minimization is achieved when the convex hull of {w c } c∈[C] forms a regular simplex. However, before dealing with Eq.40, we first solve the minimization of the following easier problem. Lemma G.2. Consider Z c := ∥w c -w ′ c (ᾱ)∥ + (b -b ′ c ( ᾱ)), where ᾱ := ((C -1) -1 , (C -1) -1 , • • • , (C -1) -1 ) is the bary-center of simplex ∆. The minimizer of the following optimization problem min ({wc} c∈[C] ,b) c |w c | 2 + |b| 2 s.t. min c∈[C] Z c ≥ 1, is achieved if and only if the following conditions are met: (44)      ∥w c ∥ = C-1 C c∈[C] w c = 0 b = 0. In the following, we consider the optimization with this new constraint Eq.44. Next, we relax the constraint min c Z c ≥ 1 to C -1 c Z c ≥ 1 (clearly C -1 c Z c ≥ min c∈[C] Z c ) . By the help of Eq.44, this averaged value is calculated as C -1 c∈[C] Z c = C -1 c∈[C] [∥w c -w ′ c (ᾱ)∥ + (b c -b ′ c (ᾱ))] = C -1 c∈[C]   w c - -1 C -1 w c +   (b c -(C -1) -1 c ′ ∈[C]/{c} b c ′     = C -1 c∈[C]   C C -1 ∥w c ∥ +   (b c -(C -1) -1 c ′ ∈[C]/{c} b c ′     = (C -1) -1 c∈[C] ∥w c ∥ . Hence, the relaxed problem can be stated as follows: min (w,b) c ∥w c ∥ 2 + ∥b∥ 2 s.t. c∈[C] ∥w c ∥ ≥ C -1. Clearly, this can be achieved if and only if b = 0 and ∥w c ∥ = C-1 C . Notice that this configuration satisfies C -1 c Z c = min c∈[C] Z c , and hence it is also the global optimum of the original problem Eq.42. Lastly, it is easy to see that these minimizers satisfies Z c = C C -1 ∥w c ∥ +   b -(C -1) -1 c ′ ∈[C]/{c} b c ′   = 1. for any c ∈ [C]. Lemma G.3. Under assumptions A9 and d ≥ C, the minimization Eq.36 is achieved if and only if the following conditions are met:          The convex hull of {w c } c∈[C] forms a regular (C -1)-simplex ∥w c ∥ = C-1 C ∀c ∈ [C] c∈[C] w c = 0 b = 0. (45) Proof. First we show that any point satisfying Eq.45 is a minimizer of Eq.36. Notice that the set of Therefore, a point satisfying Eq.45 is a minimizer Eq.36. Next we prove the inverse. We assume that ({w c } c∈[C] , b) is a minimizer of Eq.36. This point must also be a minimizer of Eq.42, because as we showed previously, there exists a minimizer of Eq.42 satisfying min c∈[C] L c ≥ 1. By Lemma G.2, this point satisfies Eq.43. Hence, this point satisfies Eq.45 if the convex hull {w c } c∈C forms a regular (C -1)-simplex. Since Z c = 1 for any c ∈ [C] by Lemma G.2 and Z c ≥ L c ≥ 1, L c = Z c = 1. This implies that the bary-center w ′ c (ᾱ) is the point in ∆ c closest to w c . In the following, we argue that this property implies that ∥w c1 -w c2 ∥ is independent of the choice of distinct pair c 1 , c 2 ∈ [C], implying that the convex hull {w c } c∈[C] forms a regular simplex. For any c 1 ∈ [C], w ′ c1 (ᾱ) = ∥w c1 ∥ /(C -1) = C -1 . Since L c > 0 for any c ∈ [C], all the vector w c are distinct, and hence, w ′ c1 (ᾱ) is perpendicular to w c2 -w ′ c1 (ᾱ). Therefore, w c2 -w ′ c1 ( ᾱ) 2 = ∥w c2 ∥ 2 -w ′ c1 (ᾱ) 2 = (C -1) 2 C 2 -C -2 = C -2 C . Thus, ∥w c2 -w c1 ∥ 2 = w c2 -w ′ c1 (ᾱ) 2 + w c1 -w ′ c1 (ᾱ) 2 = (C -2)/C + 1 = 2C -1 (C -1). This is independent of c 1 and c 2 , implying that the simplex is regular. Lemma G.4. Under assumption A9 and d ≥ C, the minimization Eq.35 is achieved if and only if the following conditions are met:              The convex hull of {w c } c∈[C] forms a regular (C -1)-simplex ∥w c ∥ = C-1 C ∀c ∈ [C] c∈[C] w c = 0 b = 0 h i = C C-1 w yi ∀i ∈ [n]. Proof. We first show that a minimizer of Eq.35 satisfies Eq.46. L c = 1. Therefore, to satisfy the constraint min i∈[n] q i ≥ 1 in Eq.35, {h i } i∈[n] needs to be a maximizer of the equation above. Again by Lemma G.1, this maximizer is given by h i = w yi -w * yi w yi -w * yi = w yi -w ′ yi ( ᾱ) w yi -w ′ yi ( ᾱ) = C C -1 w yi , which is the last condition. We now prove the inverse. We assume that ({w c }  q i = min i∈[n] min c∈[C]/{yi} (w yi -w c ) T C C -1 w yi = C -1 C - -1 C = 1 j (h i ) 2 j = j ( C C -1 w yi ) 2 j = 1. Lemma G.5. Under assumption A9 and d ≥ C, the minimization Eq.33 is achieved if and only if the following conditions are met:                  The ) satisfies Eq.46 and Eq.48 if and only if Eq.47 is met. • There exists ({w c } c∈[C] , b, {h i } i∈[n] ) satisfying Eq.47. First we show the first statement. We assume Eq.46 and Eq.48. Then the last equality in Eq.47 holds as follows: j∈[d] (w c ) j = j∈[d] C -1 C h i j = 0, where i ∈ [n] is such that y i = c. The existence is assured by A9. The other equalities in Eq.47 trivially holds since they are equivalent to Eq.46. Conversely, if Eq.47 holds, Eq.46 trivially holds and Eq.48 holds as well since proof of Theorem G.1. By A8, the optimization problem Eq.8 can be reduced to Eq.33. Hence, by Lemma G.5, the minimizer's condition is given by Eq.47, which implies the first three properties of Neural Collapse hold with hc = h * c . The last property can be proved as follows. The distance between h and h * c is given by ∥h -h * c ∥ 2 = ∥h∥ 2 + ∥h * c ∥ 2 -2h T h * c = ∥h∥ 2 + 1 - 2C C -1 h T w c . Hence, argmin c∈[C] ∥h -h * c ∥ = argmin c∈[C] ∥h∥ 2 + 1 - 2C C -1 h T w c = argmax c∈[C] h T w c = argmax c∈[C] h T w c + b c . Since qj can be uniformly lower bounded by L as follows L ≥ n -1 log(1 + e -qj ), i.e., qj ≥ -log e nL -1 , j∈[n] qj e -qj (1 + e -qj ) log(1 + e -qj ) ≥ -log e nL -1 e log(e nL -1) (1 + e log(e nL -1) ) log(1 + e log(e nL -1) ) = -e nL -1 log e nL -1 nLe nL . Here for the inequality, we exploit the fact that function ≥ λ -1 max (e nL -1) 2 log(e nL -1) (2-2λmax) γ(t 0 ) 2λmax . where the last inequality relies on a version of Corollary D.1. Rearranging terms on both sides of the inequality gives, -dL dt (e nL -1) -2 log(e nL -1) -2(1-λmax) ≥ λ -1 max γ(t 0 ) 2λmax . Integration from t 0 to t, with the substitution -n dL dt (e nL -1) -2 = d(e nL -1) -1 dt , gives n -1 (e nL(t) -1) -1 (e nL(t 0 ) -1) -1 (-log z) -2(1-λmax) dz ≥ λ -1 max γ(t 0 ) 2λmax (t -t 0 ). The RHS diverges as t → ∞, and the LHS as a function of t is non-decreasing for z < 1, which is true for all t ≥ t 0 by Lemma D.1 and A10. Thus we can conclude that L -1 → ∞. Exploiting this fact, we can easily show ∥θ∥ Λ → ∞ as we discussed in the proof of Lemma E.2.

I EXPERIMENT DETAILS

All empirical figures in this work were generated by the attached notebook. Here we briefly summarize the experimental conditions used to generate these figures. Logistic Regression (Fig. 1 ). This plot was generated by sampling 100 sample from two Gaussian distributions N (±µ, σI) in R 2 where µ = [1/ √ 2, 1/ √ 2] and σ = 0.25. The parameters for both the homogeneous and quasihomogeneous model were initialized as standard random Gaussian vectors. The parameters were trained with full batch gradient descent with a learning rate η = 0.5 for 1e5 steps. The maximum ℓ 2 -margin solution was computed using scikit-learn's SVM package Pedregosa et al. (2011) . Ball Classification (Fig. 4 ). This plot was generated by sampling 1e4 random samples from the surface of two balls B(±µ, r) in R 3 for 100 linearly spaced radii r ∈ [0, 1]. The mean µ = [0.8660254, 0.4330127, 0.25] was chosen such that ρ µ = 0.5 and ρ P ⊥ µ = 0.25. The quasi-homogeneous model was defined such that D 1 = 1, D 2 = 5, and D 3 = 10 leading to λ-values Λ = [1.0.2, 0.1]. The parameters for the homogeneous and quasi-homogeneous model were initialized with the all ones vector. Using these initializations and SciPy's initial value problem ODE solver Virtanen et al. (2020) we then simulated gradient flow until T = 1e5. The final value of the classifier for both models and their respective robustness was recorded and used to generate the final plots.



Linearly separable implies there exists a w ∈ R d such that for all i ∈ [n], yiw ⊺ xi ≥ 1. When Λ is not diagonal, by reparameterizing the model θ → Oθ with a proper orthogonal matrix O, we can diagonalize Λ. Nearly all neural networks have this property, including those with ReLU activations. For details, seeDavis et al. (2020) orLyu & Li (2019). The Clarke's subdifferential ∂ • θ L is a generalization of ∇ θ L for locally Lipschitz functions. For details, see App. A This KKT condition is necessary for the optimality since every feasible point satisfies Mangasarian-Fromovitz constraint qualification (MFCQ) condition (Lemma A.4). For all w ∈ R d that separate the two balls B(±µ, r), ∥P w∥ > 0. This problem does not have local minima, but it does have saddle points. If we consider higher-order homogeneous models, such as a deep linear network, then the resulting maximum margin bias would prefer sparse solutions, which could erode the robustness. A regular (C -1)-simplex is the convex hull of C points where the distance between any pair is the same.Papyan et al. (2020) refer to this simplex centered at the origin as a general simplex Equiangular Tight Frame. Here we use layer normalization, but similar theorems would hold for other normalization schemes, such as batch normalization. Our discussion here works with quasi-homogeneity, but we assume homogeneity here for simplicity.



Figure1: Maximum-margin bias changes with parameterization. Logistic regression, f (x) = β ⊺ x, trained with gradient descent on a homogeneous (left) and quasi-homogeneous (right) parameterization of the regression coefficients β. The dashed black line is the maximum ℓ 2 -margin solution and the solid black line is the gradient descent trained classifier after 1e5 steps. Existing theory predicts the homogeneous model will converge to the maximum ℓ 2 -margin solution. In this work we will show that the quasi-homogeneous model is driven by a different maximum-margin problem.

Figure 3: An illustrative example. A 2D depiction of the binary classification task of learning a linear classifier w to separate two balls B(±µ, r). The robustness l(w) is measured by the minimum Euclidean distance between the decision boundary and the balls.

); Poggio & Liao (2019); Mixon et al. (2022); Rangamani & Banburski-Fahey (2022) studied Neural Collapse in the setting of mean-squared loss and Fang et al. (2021); Tirer & Bruna (2022); Weinan & Wojtowytsch (2022); Zhu et al. (2021); Ji et al. (2021) introduced toy models to explain Neural Collapse in the setting of cross-entropy loss. These toy models are optimization problems over the last-hidden-layer feature vectors and the last-layer parameters, but not including parameters in the earlier layers. Many of these works introduced unjustified explicit regularizations or constraints on the feature vectors in their model. A recent work, Ji et al. (

Note that in Papyan et al. (2020), all the neural networks showing Neural Collapse are trained with the cross-entropy loss and have normalization. Specifically, we consider the C-class classification model f c (x) = w T c h(x, θ ′ ) + b c where the last layer weights w c ∈ R d and bias b c for c ∈ [C] and the last-layer feature h(x, θ ′ ) ∈ R d . The feature vector h(x, θ ′ ) is obtained with layer normalization 10 , and therefore it satisfies

Works such asChizat & Bach (2020);Ji & Telgarsky (2020);Vardi et al. (2021);Lyu et al. (2021) have made progress in this direction for simple homogeneous networks and could provide a strategy for investigating more complex quasi-homogeneous models.Influence of initialization.A major limitation of analyzing the maximum-margin bias of gradient flow is that the dynamics in this terminal phase of training are slow to converge or only become evident at extremely unpractical training loss levels. Motivated by this limitation,Woodworth et al. (2020) andMoroshko et al. (

Class of Quasi-Homogeneous Models 4 Quasi-Homogeneous Maximum-Margin Bias 5 Quasi-Homogeneous Maximum-Margin can Degrade Robustness 6 A Mechanism Behind Neural Collapse 7 Conclusion A More Details on Quasi-homogeneous Models B The Consistency of Theorem 4.1 and the Proper Choice of Λ C Limitations of our Current Assumptions D Proof of Lemma 4.1 E Proof of Theorem 4.1 Proof of Lemma 5.1 G Proof of Theorem 6.1 H Extension to Multi-class Classification I Experiment Details A MORE DETAILS ON QUASI-HOMOGENEOUS MODELS Λ-normalization.

Similar to Theorem B.2 in Lyu & Li (2019), we can show that ∂ • θ f (θ) satisfies a scaling property and a version of Euler's theorem. Lemma A.2. Let f (θ) : R d → R be locally Lipschitz and Λ-quasi-homogeneous. ∂ • θ f satisfies the following scaling property:

Here c = ∥ dθ dt ∥/∥Λθ∥. We prove this lemma right after showing the proof of Theorem 4.1. Proof of Theorem 4.1. By Corollary of Theorem 3.6 in Dutta et al. (2013) (or Theorem C.4 in Lyu & Li (2019)), it suffices to show the following two statements:

{c} , and ∆ denote the standard (C -2)-simplex. Lemma G.1. Under assumption A9, the following equality holds max {hi}∈H min i∈[n]

Furthermore, at these minimizers Z c = 1 for any c ∈ [C]. Proof. Notice that the constraint of this optimization problem is translationally invariant, i.e., for any w ∈ R d and any {w c } c∈[C] satisfying min c∈[C] Z c ≥ 1, {w c + w} c∈[C] also satisfies the constraint. Hence, the minimizer of Eq.42 should satisfy the stationary condition with respect to the derivative of wi for all i ∈ [d], i.e., c∈[C] w c = 0.

({w c } c∈[C] , b) satisfying Eq.45 is non-empty since d ≥ C. All elements of this set satisfy Z c = L c for any c ∈ [C] by the first condition in Eq.45, and are clearly minimizers of Eq.42 by the other conditions in Eq.45 as shown in Lemma G.2. Thus, the elements satisfy min c∈[C] L c ≥ 1. For any c ∈ [C], Z c ≥ L c , and hence ({w c } c∈[C] , b) : min c∈[C] Z c ≥ 1 ⊇ ({w c } c∈[C] , b) : min c∈[C] L c ≥ 1 .

For a minimizer({w c } c∈[C] , b, {h i } i∈[n] ) of Eq.35, ({w c } c∈[C] , b) is a minimizer of Eq.36. Thus, by Lemma G.3, the minimizer satisfies the first four properties. Additionally, by Lemma G.1,

convex hull of {w c } c∈[C] forms a regular (C -1)-simplex ∥w c ∥ = C-1 C ∀c ∈ [C] c∈[C] w c = 0 b = 0 h i = C C-1 w yi ∀i ∈ [n] i∈[d] (w c ) i = 0 ∀c ∈ [C].(47)Proof. Since Eq.33 has an additional constraintj (h i ) j = 0 ∀i ∈ [C],(48)compared to Eq.35, by Lemma G.4, it suffices to show • ({w c } c∈[C] , b, {h i } i∈[n]

The existence of ({w c } c∈[C] , b, {h i } i∈[n] ) satisfying Eq.47 can be seen by the fact that regular (C -1)-simplex is in (C -1)-dimensional subspace, and we can rotate the simplex such that it is inside the (d -1)-dimensional subspace constrained by i∈[d] (w c ) i = 0 ∀c ∈ [C], without violating the other equalities in Eq.47.

x ) log(1+e -x ) is an increasing function. With the help of this inequality, we get e -qj ) qj e -qj (1 + e -qj ) log(1 + e -qj )≥ -e nL -1 log e nL -1 nLe nL j∈[n] log(1 + e -qj ) = (1 -e -nL) log e nL -1 .Next, we show a variant of Lemma E.2. By utilizing the lower bound of 1 2 d dt ∥θ∥ 2 Λ and introducing a newly defined smoothed normalized margin γ := -e nL -e -nL ) 2 log e nL -nL -1) 2 log(e nL -1)(2-2λmax) -

A6 (Normalized Convergence). lim t→∞ θ(t) exists.• A7 (Conditional Separability). There exists a κ > 0 such that only θ with ∥ θ∥ Λmax ≥ κ can separate the training data.

Lemma A.3. If f (θ) : R d → R is locally Lipschitz admitting a chain rule and Λ-quasihomogeneous, then it satisfies a version of Euler's theorem, i.e., for all θ

where τ max (ρ) is defined asτ max (ρ) = max{τ (θ) : ∥θ∥ Λ = ρ} = λ -1 min + logρ, and λ min + = min λi>0 λ i . By Lemma E.1, sup i∈[n],t≥0 |f (x i ; θ(t))| ≤ k implying

c∈[C] , b, {h i } i∈[n] ) satisfies Eq.46. By Lemma G.3, ({w c } c∈[C] , b) is a minimizer of Eq.36. Since the minimum value of Eq.36 is equivalent to the minimum value of Eq.35, it suffices to show that({w c } c∈[C] , b, {h i } i∈[n]) satisfies the constraints in Eq.35, which can be seen as follows:

ACKNOWLEDGMENTS

We thank Kaifeng Lyu, Ben Sorscher, Daniel Soudry, and Hidenori Tanaka for helpful discussions. D.K. thanks the Open Philanthropy AI Fellowship for support. A.Y. thanks the Masason Foundation for support. S.G. thanks the James S. McDonnell and Simons Foundations, NTT Research, and an NSF CAREER Award for support while at Stanford.

annex

F PROOF OF LEMMA 5.1In this section, we provide a proof of Lemma 5.1.Proof of Lemma 5.1. Notice that in the homogeneous case, all the parameters have the largest λ, and therefore P = I and P ⊥ = 0. By substituting these to the expressions of w quasi-hom and l(w quasi-hom ) in Eq.5.1, we can obtain those of w hom and l(w hom ). Therefore in the following, we focus on proving expressions for the quasi-homogeneous case.By symmetry, Eq.4 can be reduced to the following optimization problem over a single ball, Eq.4 = minThe minimization over x ∈ B(µ, r) above can be further reduced as follows:Hence, the optimization problem Eq.4 can be reduced as follows:Eq.4 = minHere we split the optimization over w ∈ R d by considering the orthogonal vectors w 1 = P w and w 2 = P ⊥ w 2 separately. Because the objective function ∥w 1 ∥ is independent of w 2 , the last expression above is equivalent to the following:The maximization over w 2 ∈ R m-d is achieved if w 2 is parallel to P ⊥ µ, and hence, by letting ρ w denote ∥w 2 ∥,where the maximization over ρ µ ∈ R ≥0 on the second line of the equation above is achieved if onlyµ /r 2 ∥w 1 ∥, and hence the maximization over w 2 ∈ R d-m on the first line is achieved if and only ifTherefore, by substituting Eq.30, we obtain Eq.4 = min w1∈R m ∥w 1 ∥ :.Note that this quantity is positive since r < 1 by assumption, and the minimization over w ∈ R m is achieved if and only if w 1 is parallel to P µ,Notice that the optimizers of Eq.29, and hence those of Eq.4, need to satisfy both Eq.32 and Eq.31. This means that the optimizer is unique and is given as follows:By normalizing this, we obtain the expression in Eq.5. At this minimizer w quasi-hom , the robustness l is obtained asIn this section, we extend Theorem 4.1 to the case of multi-classification tasks with cross-entropy loss. The analysis here largely relies on Appendix G in Lyu & Li (2019) .We consider a classification task of data {x i , y i } i∈[n] whose label y i which now takes values inwhere C ∈ N is the number of classes. Our model's output is given by f (x; θ) ∈ R C , and the cross-entropy loss with this model is defined aswhere qj := -log c∈[C]/{yj } e -sjc , and s jc := f yj (x j : θ) -f c (x j ; θ). Notice that s jc is a quasi-homogeneous function. Hence, if Lemma E.2 holds and θ goes to infinity as t → ∞, which we will show later in this section, s jc goes to infinity. Therefore, when t ≫ 1, qj ∼ -log maxand by Taylor expansion of logarithm log(1 + e -qj ) ∼ e -qj ,This expression is now equivalent to the one of binary classification tasks with the exponential loss. Note that the effective margin is now given by min c∈[C]/{yj } s jc , which implies that its asymptotic behavior at the later stage of training is similar to the one with the exponential loss. Being aware of this fact, we can show a variant of Theorem 4.1 with a modified version of separability condition:A10 (Strong Separability for CE Loss). There exists a time t 0 such that L(θ(t 0 )) < n -1 log 2.Under A1,A2, A4, A6, A7, and , A10 with cross-entropy loss Eq.49, there exists an α ∈ R such that ψ α (lim t→∞ θ(t)) is a first-order KKT point of the following optimization problemThe modification of our proof of Theorem 4.1 is quite similar to the extension done in Lyu & Li (2019) and straightforward except the part where we show the lower bound of d dt log ∥θ∥ Λ and divergence ∥θ∥ Λ → ∞. Hence here we will focus on this non-trivial part and ask readers to refer Lyu & Li (2019) 

