SEPARATION AND CONCENTRATION IN DEEP NET-WORKS

Abstract

Numerical experiments demonstrate that deep neural network classifiers progressively separate class distributions around their mean, achieving linear separability on the training set, and increasing the Fisher discriminant ratio. We explain this mechanism with two types of operators. We prove that a rectifier without biases applied to sign-invariant tight frames can separate class means and increase Fisher ratios. On the opposite, a soft-thresholding on tight frames can reduce withinclass variabilities while preserving class means. Variance reduction bounds are proved for Gaussian mixture models. For image classification, we show that separation of class means can be achieved with rectified wavelet tight frames that are not learned. It defines a scattering transform. Learning 1 × 1 convolutional tight frames along scattering channels and applying a soft-thresholding reduces within-class variabilities. The resulting scattering network reaches the classification accuracy of ResNet-18 on CIFAR-10 and ImageNet, with fewer layers and no learned biases.

1. INTRODUCTION

Several numerical works (Oyallon, 2017; Papyan, 2020; Papyan et al., 2020) have shown that deep neural networks classifiers (LeCun et al., 2015) progressively concentrate each class around separated means, until the last layer, where within-classes variability may nearly "collapse" (Papyan et al., 2020) . The linear separability of a class mixture is characterized by the Fisher discriminant ratio (Fisher, 1936; Rao, 1948) . The Fisher discriminant ratio measures the separation of class means relatively to the variability within each class, as measured by their covariances. The neural collapse appears through a considerable increase of the Fisher discriminant ratio during training (Papyan et al., 2020) . No mathematical mechanism has yet been provided to explain this separation and concentration of probability measures. Linear separability and Fisher ratios can be increased by separating class means without increasing the variability of each class, or by concentrating each class around its mean while preserving the mean separation. This paper shows that these separation or concentration properties can be achieved with one-layer network operators using different pointwise non-linearities. We cascade these operators to define structured deep neural networks with high classification accuracies, and which can be analyzed mathematically. Section 2 studies two-layer networks computed with a linear classifier applied to ρF , where F is linear and ρ is a pointwise non-linearity. First, we show that ρF can separate class means with a ReLU ρ r (u) = max(u, 0) and a sign-invariant F . We prove that ρ r F then increases the Fisher ratio. As in Parseval networks (Cisse et al., 2017) , F is normalized by imposing that it is a tight frame which satisfies F T F = Id. Second, to concentrate the variability of each class around its mean, we use a shrinking non-linearity implemented by a soft-thresholding ρ t . For Gaussian mixture models, we prove that ρ t F concentrates within-class variabilities while nearly preserving class means, under appropriate sparsity hypotheses. A linear classifier applied to these ρF defines two-layer neural networks with no learned bias parameters in the hidden layer, whose properties are studied mathematically and numerically. Cascading several convolutional tight frames with ReLUs or soft-thresholdings defines a deep neural network which progressively separates class means and concentrates their variability. One may wonder if we can avoid learning these frames by using prior information on the geometry of images. Section 3 shows that the class mean separation can be computed with wavelet tight frames, which are not learned. They separate scales, directions and phases, which are known groups of transformations. A cascade of wavelet filters and rectifiers defines a scattering transform (Mallat, 2012), which has previously been applied to image classification (Bruna & Mallat, 2013; Oyallon & Mallat, 2015) . However, such networks do not reach state-of-the-art classification results. We show that important improvements are obtained by learning 1 × 1 convolutional projectors and tight frames, which concentrate within-class variabilities with soft-thresholdings. It defines a bias-free deep scattering network whose classification accuracy reaches ResNet-18 (He et al., 2016) on CIFAR-10 and ImageNet. Code to reproduce all experiments of the paper is available at https://github.com/j-zarka/separation_concentration_deepnets. The main contributions of this paper are: • A double mathematical mechanism to separate and concentrate distinct probability measures, with a rectifier and a soft-thresholding applied to tight frames. The increase of Fisher ratio is proved for tight-frame separation with a rectifier. Bounds on within-class covariance reduction are proved for a soft-thresholding on Gaussian mixture models. • The introduction of a bias-free scattering network which reaches ResNet-18 accuracy on CIFAR-10 and ImageNet. Learning is reduced to 1 × 1 convolutional tight frames which concentrate variabilities along scattering channels.

2. CLASSIFICATION BY SEPARATION AND CONCENTRATION

The last hidden layer of a neural network defines a representation Φ(x), to which is applied a linear classifier. This section studies the separation of class means and class variability concentration for Φ = ρF in a two-layer network.

2.1. TIGHT FRAME RECTIFICATION AND THRESHOLDING

We begin by briefly reviewing the properties of linear classifiers and Fisher discriminant ratios. We then analyze the separation and concentration of Φ = ρF , when ρ is a rectifier or a soft-thresholding and F is a tight frame.

Linear classification and Fisher ratio

We consider a random data vector x ∈ R d whose class labels are y(x) ∈ {1, ..., C}. Let x c be a random vector representing the class c, whose probability distribution is the distribution of x conditioned by y(x) = c. We suppose that all classes are equiprobable for simplicity. Ave c denotes C -1 C c=1 . We compute a representation of x with an operator Φ which is standardized, so that E(Φ(x)) = 0 and each coefficient of Φ(x) has a unit variance. The class means µ c = E(Φ(x c )) thus satisfy 



µ c = 0. A linear classifier (W, b) on Φ(x) returns the index of the maximum coordinate of W Φ(x) + b ∈ R C . An optimal linear classifier (W, b) minimizes the probability of a classification error. Optimal linear classifiers are estimated by minimizing a regularized loss function on the training data. Neural networks often use logistic linear classifiers, which minimize a cross-entropy loss. The standardization of the last layer Φ(x) is implemented with a batch normalization (Ioffe & Szegedy, 2015). A linear classifier can have a small error if the typical sets of each Φ(x c ) have little overlap, and in particular if the class means µ c = E(Φ(x c )) are sufficiently separated relatively to the variability of each class. Under the Gaussian hypothesis, the variability of each class is measured by the covariance Σ c of Φ(x c ). Let Σ W = Ave c Σ c be the average within-class covariance and Σ B = Ave c µ c µ T c be the between-class covariance of the means. The within-class covariance can be whitened and normalized to Id by transforming Φ(x) with the square root Σ -1 2 W of Σ -1 W . All classes

