MEMORIZATION-DILATION: MODELING NEURAL COLLAPSE UNDER LABEL NOISE

Abstract

The notion of neural collapse refers to several emergent phenomena that have been empirically observed across various canonical classification problems. During the terminal phase of training a deep neural network, the feature embedding of all examples of the same class tend to collapse to a single representation, and the features of different classes tend to separate as much as possible. Neural collapse is often studied through a simplified model, called the layer-peeled model, in which the network is assumed to have "infinite expressivity" and can map each data point to any arbitrary representation. In this work we study a more realistic variant of the layer-peeled model, which takes the positivity of the features into account. Furthermore, we extend this model to also incorporate the limited expressivity of the network. Empirical evidence suggests that the memorization of noisy data points leads to a degradation (dilation) of the neural collapse. Using a model of the memorization-dilation (M-D) phenomenon, we show one mechanism by which different losses lead to different performances of the trained network on noisy data. Our proofs reveal why label smoothing, a modification of cross-entropy empirically observed to produce a regularization effect, leads to improved generalization in classification tasks.

1. INTRODUCTION

The empirical success of deep neural networks has accelerated the introduction of new learning algorithms and triggered new applications, with a pace that makes it hard to keep up with profound theoretical foundations and insightful explanations. As one of the few yet particularly appealing theoretical characterizations of overparameterized models trained for canonical classification tasks, Neural Collapse (NC) provides a mathematically elegant formalization of learned feature representations Papyan et al. (2020). To explain NC, consider the following setting. Suppose we are given a balanced dataset D = (x  (k) n , y n ) k∈[K],n∈[N ] ⊂ X × Y in



the instance space X = R d and label space Y = [N ] := {1, . . . , N }, i.e. each class n ∈ [N ] has exactly K samples x consider network architectures commonly used in classification tasks that are composed of a feature engineering part g : X → R M (which maps an input signal x ∈ X to its feature representation g(x) ∈ R M ) and a linear classifier W (•) + b given by a weight matrix W ∈ R N ×M as well as a bias vector b ∈ R N . Let w n denote the row vector of W associated with class n ∈ [N ]. During training, both classifier components are simultaneously optimized by minimizing the cross-entropy loss. * These authors contributed equally to this work. 1

