MEMORIZATION-DILATION: MODELING NEURAL COLLAPSE UNDER LABEL NOISE

Abstract

The notion of neural collapse refers to several emergent phenomena that have been empirically observed across various canonical classification problems. During the terminal phase of training a deep neural network, the feature embedding of all examples of the same class tend to collapse to a single representation, and the features of different classes tend to separate as much as possible. Neural collapse is often studied through a simplified model, called the layer-peeled model, in which the network is assumed to have "infinite expressivity" and can map each data point to any arbitrary representation. In this work we study a more realistic variant of the layer-peeled model, which takes the positivity of the features into account. Furthermore, we extend this model to also incorporate the limited expressivity of the network. Empirical evidence suggests that the memorization of noisy data points leads to a degradation (dilation) of the neural collapse. Using a model of the memorization-dilation (M-D) phenomenon, we show one mechanism by which different losses lead to different performances of the trained network on noisy data. Our proofs reveal why label smoothing, a modification of cross-entropy empirically observed to produce a regularization effect, leads to improved generalization in classification tasks.

1. INTRODUCTION

The empirical success of deep neural networks has accelerated the introduction of new learning algorithms and triggered new applications, with a pace that makes it hard to keep up with profound theoretical foundations and insightful explanations. As one of the few yet particularly appealing theoretical characterizations of overparameterized models trained for canonical classification tasks, Neural Collapse (NC) provides a mathematically elegant formalization of learned feature representations Papyan et al. (2020) . To explain NC, consider the following setting. Suppose we are given a balanced dataset D = (x (k) n , y n ) k∈[K],n∈[N ] ⊂ X × Y in the instance space X = R d and label space Y = [N ] := {1, . . . , N }, i.e. each class n ∈ [N ] has exactly K samples x (1) n , . . . , x n . We consider network architectures commonly used in classification tasks that are composed of a feature engineering part g : X → R M (which maps an input signal x ∈ X to its feature representation g(x) ∈ R M ) and a linear classifier W (•) + b given by a weight matrix W ∈ R N ×M as well as a bias vector b ∈ R N . Let w n denote the row vector of W associated with class n ∈ [N ]. During training, both classifier components are simultaneously optimized by minimizing the cross-entropy loss. Denoting the feature representations g(x (k) n ) of the sample x (k) n by h (k) n , the class means and the global mean of the features by h n := 1 K K i=1 h (k) n , h := 1 N N n=1 h n , NC consists of the following interconnected phenomena (where the limits take place as training progresses): with m ̸ = n, we have (NC1) Variability collapse. For each class n ∈ [N ], we have 1 K K k=1 h (k) n -h n 2 → 0 . ∥h n -h∥ 2 -∥h m -h∥ 2 → 0, h n -h ∥h n -h∥ 2 , h m -h ∥h m -h∥ 2 → - 1 N -1 . (NC3) Convergence to self-duality. For any n ∈ [N ], it holds h n -h ∥h n -h∥ 2 - w n ∥w n ∥ 2 → 0 . (NC4) Simplification to nearest class center behavior. For any feature representation u ∈ R M , it holds arg max n∈[N ] ⟨w n , u⟩ + b n → arg min n∈[N ] ∥u -h n ∥ 2 . In this paper, we consider a well known simplified model, in which the features h n can take any value in R M , we consider here the case h (k) n ≥ 0 (understood componentwise). This is motivated by the fact that features are typically the outcome of some non-negative activation function, like the Rectified Linear Unit (ReLU) or sigmoid. Moreover, by incorporating the limited expressivity of the network to the layer-peeled model, we propose a new model, called memorization-dilation (MD). Given such model assumptions, we formally prove advantageous effects of the so-called label smoothing (LS) technique Szegedy et al. (2015) (training with a modification of cross-entropy (CE) loss), in terms of generalization performance. This is further confirmed empirically.

2. RELATED WORK

Studying the nature of neural network optimization is challenging. In the past, a plethora of theoretical models has been proposed to do so Sun (2020). These range from analyzing simple linear Kunin et al. Many of the theoretical properties of deep neural networks in the regime of overparameterization are still unexplained. Nevertheless, certain peculiarities have emerged recently. Among those, socalled "benign overfitting" Bartlett et al. (2019); Li et al. (2021) , where deep models are capable of perfectly fitting potentially noisy data by retaining accurate predictions, has recently attracted attention. Memorization has been identified as one significant factor contributing to this effect Arpit et al. (2017); Sanyal et al. (2021) , which also relates to our studies. Not less interesting, the learning risk of highly-overparameterized models shows a double-descent behavior when varying the model



NC2) Convergence to simplex equiangular tight frame (ETF) structure. For any m, n ∈ [N ]

parameterized by the feature engineering network g but are rather free variables. This model is often referred to as layer-peeled model or unconstrained features model, see e.g. Lu & Steinerberger (2020); Fang et al. (2021); Zhu et al. (2021). However, as opposed to those contributions, in which the features h (k)

(2019); Zhu et al. (2020); Laurent & von Brecht (2018) to non-linear deep neural networks Saxe et al. (2014); Yun et al. (2018). As one prominent framework among others, Neural Tangent Kernels Jacot et al. (2018); Roberts et al. (2021), where neural networks are considered as linear models on top of randomized features, have been broadly leveraged for studying deep neural networks and their learning properties.

