SEPARATION AND CONCENTRATION IN DEEP NET-WORKS

Abstract

Numerical experiments demonstrate that deep neural network classifiers progressively separate class distributions around their mean, achieving linear separability on the training set, and increasing the Fisher discriminant ratio. We explain this mechanism with two types of operators. We prove that a rectifier without biases applied to sign-invariant tight frames can separate class means and increase Fisher ratios. On the opposite, a soft-thresholding on tight frames can reduce withinclass variabilities while preserving class means. Variance reduction bounds are proved for Gaussian mixture models. For image classification, we show that separation of class means can be achieved with rectified wavelet tight frames that are not learned. It defines a scattering transform. Learning 1 × 1 convolutional tight frames along scattering channels and applying a soft-thresholding reduces within-class variabilities. The resulting scattering network reaches the classification accuracy of ResNet-18 on CIFAR-10 and ImageNet, with fewer layers and no learned biases.

1. INTRODUCTION

Several numerical works (Oyallon, 2017; Papyan, 2020; Papyan et al., 2020) have shown that deep neural networks classifiers (LeCun et al., 2015) progressively concentrate each class around separated means, until the last layer, where within-classes variability may nearly "collapse" (Papyan et al., 2020) . The linear separability of a class mixture is characterized by the Fisher discriminant ratio (Fisher, 1936; Rao, 1948) . The Fisher discriminant ratio measures the separation of class means relatively to the variability within each class, as measured by their covariances. The neural collapse appears through a considerable increase of the Fisher discriminant ratio during training (Papyan et al., 2020) . No mathematical mechanism has yet been provided to explain this separation and concentration of probability measures. Linear separability and Fisher ratios can be increased by separating class means without increasing the variability of each class, or by concentrating each class around its mean while preserving the mean separation. This paper shows that these separation or concentration properties can be achieved with one-layer network operators using different pointwise non-linearities. We cascade these operators to define structured deep neural networks with high classification accuracies, and which can be analyzed mathematically. Section 2 studies two-layer networks computed with a linear classifier applied to ρF , where F is linear and ρ is a pointwise non-linearity. First, we show that ρF can separate class means with a ReLU ρ r (u) = max(u, 0) and a sign-invariant F . We prove that ρ r F then increases the Fisher ratio. As in Parseval networks (Cisse et al., 2017) , F is normalized by imposing that it is a tight frame which satisfies F T F = Id. Second, to concentrate the variability of each class around its mean, we use a shrinking non-linearity implemented by a soft-thresholding ρ t . For Gaussian mixture models, we prove that ρ t F concentrates within-class variabilities while nearly preserving class means, under appropriate sparsity hypotheses. A linear classifier applied to these ρF defines two-layer neural networks with no learned bias parameters in the hidden layer, whose properties are studied mathematically and numerically. Cascading several convolutional tight frames with ReLUs or soft-thresholdings defines a deep neural network which progressively separates class means and concentrates their variability. One may wonder if we can avoid learning these frames by using prior information on the geometry of images. Section 3 shows that the class mean separation can be computed with wavelet tight frames, which are not learned. They separate scales, directions and phases, which are known groups of transformations. A cascade of wavelet filters and rectifiers defines a scattering transform (Mallat, 2012) , which has previously been applied to image classification (Bruna & Mallat, 2013; Oyallon & Mallat, 2015) . However, such networks do not reach state-of-the-art classification results. We show that important improvements are obtained by learning 1 × 1 convolutional projectors and tight frames, which concentrate within-class variabilities with soft-thresholdings. It defines a bias-free deep scattering network whose classification accuracy reaches ResNet-18 (He et al., 2016) on CIFAR-10 and ImageNet. Code to reproduce all experiments of the paper is available at https://github.com/j-zarka/separation_concentration_deepnets. The main contributions of this paper are: • A double mathematical mechanism to separate and concentrate distinct probability measures, with a rectifier and a soft-thresholding applied to tight frames. The increase of Fisher ratio is proved for tight-frame separation with a rectifier. Bounds on within-class covariance reduction are proved for a soft-thresholding on Gaussian mixture models. • The introduction of a bias-free scattering network which reaches ResNet-18 accuracy on CIFAR-10 and ImageNet. Learning is reduced to 1 × 1 convolutional tight frames which concentrate variabilities along scattering channels.

2. CLASSIFICATION BY SEPARATION AND CONCENTRATION

The last hidden layer of a neural network defines a representation Φ(x), to which is applied a linear classifier. This section studies the separation of class means and class variability concentration for Φ = ρF in a two-layer network.

2.1. TIGHT FRAME RECTIFICATION AND THRESHOLDING

We begin by briefly reviewing the properties of linear classifiers and Fisher discriminant ratios. We then analyze the separation and concentration of Φ = ρF , when ρ is a rectifier or a soft-thresholding and F is a tight frame.

Linear classification and Fisher ratio

We consider a random data vector x ∈ R d whose class labels are y(x) ∈ {1, ..., C}. Let x c be a random vector representing the class c, whose probability distribution is the distribution of x conditioned by y(x) = c. We suppose that all classes are equiprobable for simplicity. Ave c denotes C -1 C c=1 . We compute a representation of x with an operator Φ which is standardized, so that E(Φ(x)) = 0 and each coefficient of Φ( x (Ioffe & Szegedy, 2015) . A linear classifier can have a small error if the typical sets of each Φ(x c ) have little overlap, and in particular if the class means µ c = E(Φ(x c )) are sufficiently separated relatively to the variability of each class. Under the Gaussian hypothesis, the variability of each class is measured by the covariance Σ c of Φ(x c ). Let Σ W = Ave c Σ c be the average within-class covariance and Σ B = Ave c µ c µ T c be the between-class covariance of the means. The within-class covariance can be whitened and normalized to Id by transforming Φ(x) with the square root Σ -1 2 W of Σ -1 W . All classes c, c are highly separated if Σ -1 2 W µ c -Σ -1 2 W µ c 1. This separation is captured by the Fisher discriminant ratio Σ -1 W Σ B . We shall measure its trace: C -1 Tr(Σ -1 W Σ B ) = Ave c Σ -1 2 W µ c 2 . (1) Fisher ratios have been used to train deep neural networks as a replacement for the cross-entropy loss (Dorfer et al., 2015; Stuhlsatz et al., 2012; Sun et al., 2019; Wu et al., 2017; Sultana et al., 2018; Li et al., 2016) . In this paper, we use their analytic expression to analyze the improvement of linear classifiers. Linear classification obviously cannot be improved with a linear representation Φ. The following proposition gives a simple condition to improve (or maintain) the error of linear classifiers with a non-linear representation. Proposition 2.1. If Φ has a linear inverse, then it decreases (or maintains) the error of the optimal linear classifier, and it increases (or maintains) the Fisher ratio (1). To prove this result, observe that if Φ has a linear inverse Φ -1 then W x = W Φ(x) with W = W Φ -1 . The minimum classification error by optimizing W is thus above the error obtained by optimizing W . Appendix A proves that the Fisher ratio (1) is also increased or preserved. There are qualitatively two types of non-linear operators that increase the Fisher ratio Σ -1 W Σ B . Separation operators typically increase the distance between the class means without increasing the variance Σ W within each class. We first study such operators having a linear inverse, which guarantees through Proposition 2.1 that they increase the Fisher ratio. We then study concentration operators which reduce the variability Σ W with non-linear shrinking operators, which are not invertible. It will thus require a finer analysis of their properties. Separation by tight frame rectification Let Φ = ρF be an operator which computes the first layer of a neural network, where ρ is a pointwise non-linearity and F is linear. We first study separation operators computed with a ReLU ρ r (u) = max(u, 0) applied to an invertible sign-invariant matrix. Such a matrix has rows that can be regrouped in pairs of opposite signs. It can thus be written F = [-F T , F T ] T where F is invertible. The operator ρF separates coefficients according to their sign. Since ρ r (u) -ρ r (-u) = u, it results that Φ = ρ r F is linearly invertible. According to Proposition 2.1, it increases (or maintains) the Fisher ratio, and we want to choose F to maximize this increase. Observe that ρ r (αu) = αρ r (u) if α ≥ 0. We can thus normalize the rows f m of F without affecting linear classification performance. To ensure that F ∈ R p×d is invertible with a stable inverse, we impose that it is a normalized tight frame of R d satisfying F T F = Id and f m 2 = d/p for 1 ≤ m ≤ p. The tight frame can be interpreted as a rotation operator in a higher dimensional space, which aligns the axes and the directions along which ρ r performs the sign separation. This rotation must be adapted in order to optimize the separation of class means. The fact that F is a tight frame can be interpreted as a normalization which simplifies the mathematical analysis. Suppose that all classes x c of x have a Gaussian distribution with a zero mean µ c = 0, but different covariances Σ c . These classes are not linearly separable because they have the same mean, and the Fisher ratio is 0. Applying ρ r F can separate these classes and improve the Fisher ratio. Indeed, if z is a zero-mean Gaussian random variable, then E(max(z, 0)) = (2π) -1/2 E(z 2 ) 1/2 so we verify that for F = [-F T , F T ] T , E(ρ r F x c ) = (2π) -1/2 diag( F Σ c F T ) 1/2 , diag( F Σ c F T ) 1/2 . The Fisher ratio can then be optimized by maximizing the covariance Σ B between the mean vector components diag( F Σ c F T ) 1/2 for all classes c. If we know a priori that that x c and -x c have the same probability distribution, as in the Gaussian example, then we can replace ρ r by the absolute value ρ a (u) = |u| = ρ r (u) + ρ r (-u), and ρ r F by ρ a F , which reduces by 2 the frame size. Concentration by tight frame soft-thresholding If the class means of x are already separated, then we can increase the Fisher ratio with a non-linear Φ that concentrates each class around its mean. The operator Φ must reduce the within-class variance while preserving the class separation. This can be interpreted as a non-linear noise removal if we consider the within-class variability as an additive noise relatively to the class mean. It can be done with soft-thresholding estimators introduced in Donoho & Johnstone (1994) . A soft-thresholding ρ t (u) = sign(u) max(|u| -λ, 0) shrinks the amplitude of u by λ in order to reduce its variance, while introducing a bias that depends on λ. Donoho & Johnstone (1994) proved that soft-thresholding estimators are highly effective to estimate signals that have a sparse representation in a tight frame F . To evaluate more easily the effect of a tight frame soft-thresholding on the class means, we apply the linear reconstruction F T on ρ t F x, which thus defines a representation Φ(x) = F T ρ t F x. For a strictly positive threshold, this operator is not invertible, so we cannot apply Proposition 2.1 to prove that the Fisher ratio increases. We study directly the impact of Φ on the mean and covariance of each class. Let x c be the vector representing the class c. The mean µ c = E(x c ) is transformed into μc = E(Φ(x c )) and the covariance Σ c of x c into the covariance Σ c of Φ(x c ). The average covariances are Σ W = Ave c Σ c and Σ W = Ave c Σ c . Suppose that each x c is a Gaussian mixture, with a potentially large number of Gaussian components centered at µ c,k with a fixed covariance σ 2 Id: p c = k π c,k N (µ c,k , σ 2 Id). ( ) This model is quite general, since it amounts to covering the typical set of realizations of x c with a union of balls of radius σ, centered in the (µ c,k ) k . The following theorem relates the reduction of within-class covariance to the sparsity of F µ c,k . It relies on the soft-thresholding estimation results of Donoho & Johnstone (1994) . For simplicity, we suppose that the tight frame is an orthogonal basis, but the result can be extended to general normalized tight frames. The sparsity is expressed through the decay of sorted basis coefficients. For a vector z ∈ R d , we denote z (r) a coefficient of rank r: |z (r) | ≥ |z (r+1) | for 1 ≤ r ≤ d. The theorem imposes a condition on the amplitude decay of the (F µ c,k ) (r) when r increases, which is a sparsity measure. We write a(r ) ∼ b(r) if C 1 a(r) ≤ b(r) ≤ C 2 a(r) where C 1 and C 2 do not depend upon d nor σ. The theorem derives upper bounds on the reduction of within-class covariances and on the displacements of class means. The constants do not depend upon d when it increases to ∞ nor on σ when it decreases to 0. Theorem 2.2. Under the mixture model hypothesis (2), we have: Tr(Σ W ) = Tr(Σ M ) + σ 2 d, with Tr(Σ M ) = C -1 c,k π c,k µ c -µ c,k 2 . (3) If there exists s > 1/2 such that |(F µ c,k ) (r) | ∼ r -s then a tight frame soft-thresholding with threshold λ = σ √ 2 log d satisfies: Tr(Σ W ) = 2 Tr(Σ M ) + O(σ 2-1/s log d), and all class means satisfy: µ c -µ c 2 = O(σ 2-1/s log d). Under appropriate sparsity hypotheses, the theorem proves that applying Φ = F T ρ t F reduces considerably the trace of the within-class covariance. The Gaussian variance σ 2 d is dominant in (3) and is reduced to O(σ 2-1/s log d) in (4). The upper bound (5) also proves that F T ρ t F creates a relatively small displacement of class means, which is proportional to log d. This is important to ensure that all class means remain well separated. These bounds qualitatively explains the increase of Fisher ratios, but they are not sufficient to prove a precise bound on these ratios. In numerical experiments, the threshold value of the theorem is automatically adjusted as follows. Non-asymptotic optimal threshold values have been tabulated as a function of d by Donoho & Johnstone (1994) . For the range of d used in our applications, a nearly optimal threshold is λ = 1.5 σ. We rescale the frame variance σ 2 by standardizing the input x so that it has a zero mean and each coefficient has a unit variance. In high dimension d, the within-class variance typically dominates the variance between class means. Under the unit variance assumption we have Tr(Σ W ) ≈ d. If F ∈ R p×d is a normalized tight frame then we also verify as in (3) that Tr(Σ W ) ≈ σ 2 p so σ 2 ≈ d/p. It results that we choose λ = 1.5 d/p. A soft-thresholding can also be computed from a ReLU with threshold ρ rt (u) = max(u -λ, 0) because ρ t (u) = ρ rt (u) -ρ rt (-u). It results that [F T , -F T ] ρ rt [F T , -F T ] T = F T ρ t F . However, a thresholded rectifier has more flexibility than a soft-thresholding, because it may recombine differently ρ rt F and ρ rt (-F ) to also separate class means, as explained previously. The choice of threshold then becomes a trade-off between separation of class means and concentration of class variability. In numerical experiments, we choose a lower λ = d/p for a ReLU with a threshold.

2.2. TWO-LAYER NETWORKS WITHOUT BIAS

We study two-layer bias-free networks that implement a linear classification on ρF , where F is a normalized tight frame and ρ may be a rectifier, an absolute value or a soft-thresholding, with no learned bias parameter. Bias-free networks have been introduced for denoising in Mohan et al. (2019) , as opposed to classification or regression. We show that such bias-free networks have a limited expressivity and do not satisfy universal approximation theorems (Pinkus, 1999; Bach, 2017) . However, numerical results indicate that their separation and contractions capabilities are sufficient to reach similar classification results as two-layer networks with biases on standard image datasets. Applying a linear classifier on Φ(x) computes: W Φ(x) + b = W ρF x + b. This two-layer neural network has no learned bias parameters in the hidden layer, and we impose that F T F = Id with frame rows (f m ) m having constant norms. As a result, the following theorem proves that it does not satisfy the universal approximation theorem. We define a binary classification problem for which the probability of error remains above 1/4 for any number p of neurons in the hidden layer. The proof is provided in Appendix C for a ReLU ρ rt with any threshold. The theorem remains valid with an absolute value ρ a or a soft-thresholding ρ t , because they are linear combinations of ρ rt . Theorem 2.3. Let λ ≥ 0 be a fixed threshold and ρ rt (u) = max(u -λ, 0). Let F be the set of matrices F ∈ R p×d with bounded rows f m ≤ 1. There exists a random vector x ∈ R d which admits a probability density supported on the unit ball, and a C ∞ function h : R d → R such that, for all p ≥ d: inf W ∈R 1×p ,F ∈F ,b∈R P[sgn(W ρ rt F x + b) = sgn(h(x))] ≥ 1 4 . Optimization The parameters W , F and b are optimized with a stochastic gradient descent that minimizes a logistic cross-entropy loss on the output. To impose F T F = Id, following the optimization of Parseval networks (Cisse et al., 2017) , after each gradient update of all network parameters, we insert a second gradient step to minimize α/2 F T F -Id 2 . This gradient update is: F ← (1 + α)F -αF F T F. We also make sure after every Parseval step that each tight frame row f m keeps a constant norm f m = d/p by applying a spherical projection: f m ← d/p f m / f m . These steps are performed across all experiments described in the paper, which ensures that all singular values of every learned tight frame are comprised between 0.99 and 1.01. To reduce the number of parameters of the classification matrix W ∈ R C×p , we factorize W = W F T with W ∈ R C×d . It amounts to reprojecting ρF in R d with the semi-orthogonal frame synthesis F T , and thus defines: Φ(x) = F T ρ F x. A batch normalization is introduced after Φ to stabilize the learning of W . On each dataset, applying F T ρF on x greatly reduces classification error, which also appears with an increase of the Fisher ratio. For MNIST, all non-linearities produce nearly the same classification accuracy, but on CIFAR, the soft-thresholding has a higher error. Indeed, the class means of MNIST are distinct averaged digits, which are well separated, because all digits are centered in the image. Concentrating variability with a soft-thresholding is then sufficient. On the opposite, the classes of CIFAR images define nearly stationary random vectors because of arbitrary translations. As a consequence, the class means µ c are nearly constant images, which are only discriminated by their average color. Separating these class means is then important for improving classification. As explained in Section 2.1, this is done by a ReLU ρ r , or in this case an absolute value ρ a , which reduces the error. The ReLU with threshold ρ rt can interpolate between mean separation and variability concentration, and thus performs usually at least as well as the other non-linearities.

Image classification by separation and concentration

The error of the bias-free networks with a ReLU and an absolute value are similar to the errors obtained by training two-layer networks of similar sizes but with bias parameters: 1.6% error on MNIST (Simard et al., 2003) , and 25% on CIFAR-10 (Krizhevsky, 2010) . It indicates that the elimination of bias parameters does not affect performances, despite the existence of the counterexamples from Theorem 2.3 that cannot be well approximated by such architectures. This means that image classification problems have more structure that are not captured by these counter-examples, and that completeness in linear high-dimensional functional spaces may not be key mathematical properties to explain the preformances of neural networks. Figure 1 in Appendix D shows that the learned convolutional tight frames include oriented oscillatory filters, which is also often the case of the first layer of deeper networks (Krizhevsky et al., 2012) . They resemble wavelet frames, which are studied in the next section.

3. DEEP LEARNING BY SCATTERING AND CONCENTRATING

To improve classification accuracy, we cascade mean separation and variability concentration operators, implemented by ReLUs and soft-thresholdings on tight frames. This defines deep convolutional networks. However, we show that some tight frames do not need to be learned. Section 3.1 reviews scattering trees, which perform mean separation by cascading ReLUs on wavelet tight frames. Section 3.2 shows that we reach high classification accuracies by learning projectors and tight frame soft-thresholdings, which concentrate within-class variabilities along scattering channels.

3.1. SCATTERING CASCADE OF WAVELET FRAME SEPARATIONS

Scattering transforms have been introduced to classify images by cascading predefined wavelet filters with a modulus or a rectifier non-linearity (Bruna & Mallat, 2013) . We write it as a product of wavelet tight frame rectifications, which progressively separate class means. Wavelet frame A wavelet frame separates image variations at different scales, directions and phases, with a cascade of filterings and subsamplings. We use steerable wavelets (Simoncelli & Freeman, 1995) computed with Morlet filters (Bruna & Mallat, 2013) . There is one low-pass filter g 0 , and L complex band-pass filters g having an angular direction θ = π/L for 0 < ≤ L. These filters can be adjusted (Selesnick et al., 2005) so that the filtering and subsampling: Fw x(n, ) = x g (2n) defines a complex tight frame Fw . Fast multiscale wavelet transforms are computed by cascading the filter bank Fw on the output of the low-pass filter g 0 (Mallat, 2008) . Each complex filter g is analytic, and thus has a real part and imaginary part whose phases are shifted by α = π/2. This property is important to preserve equivariance to translation despite the subsampling with a stride of 2 (Selesnick et al., 2005) . To define a sign-invariant frame as in Section 2.1, we must incorporate filters of opposite signs, which amounts to shifting their phase by π. We thus associate to Fw a real sign-invariant tight frame F w by considering separately the four phases α = 0, π/2, π, 3π/2. It is defined by F w x(n, , α) = x g ,α (2n), with g ,0 = 2 -1/2 Real(g ), g ,π/2 = 2 -1/2 Imag(g ) and g ,α+π = -g . We apply a rectifier ρ r to the output of all real band-pass filters g ,α but not to the low-pass filter: ρ r F w = x g 0 (2n) , ρ r (x g ,α (2n)) n,α, . The use of wavelet phase parameters with rectifiers is studied in Mallat et al. (2019) . The operator ρ r F w is linearly invertible because F w is a tight frame and the ReLU is applied to band-pass filters, which come in pairs of opposite sign. Since there are 4 phases and a subsampling with a stride of 2, F w x is (L + 1/4) times larger than x. Scattering tree A full scattering tree S T of depth J is computed by iterating J times over ρ r F w . Since each ρ r F w has a linear inverse, Proposition 2.1 proves that this separation can only increase the Fisher ratio. However it also increases the signal size by (L + 1/4) J , which is typically much too large. This is avoided with orthogonal projectors, which perform a dimension reduction after applying each ρ r F w . A pruned scattering tree S T of depth J and order o is defined in Bruna & Mallat (2013) as a convolutional tree which cascades J rectified wavelet filter banks, and at each depth prunes the branches with P j to prevent an exponential growth: S T = J j=1 P j ρ r F w . After the ReLU, the pruning operator P j eliminates the branches of the scattering which cascade more than o band-pass filters and rectifiers, where o is the scattering order (Bruna & Mallat, 2013) . After J cascades, the remaining channels have thus been filtered by at least J -o successive low-pass filters g 0 . We shall use a scattering transform of order o = 2. The operator P j also averages the rectified output of the filters g ,α along the phase α, for fixed. This averaging eliminates the phase. It approximatively computes a complex modulus and produces a localized translation invariance. The resulting pruning and phase average operator P j is a 1 × 1 convolutional operator, which reduces the dimension of scattering channels with an orthogonal projection. If x has d pixels, then S T (x)[n, k] is an array of images having 2 -2J d pixels at each channel k, because of the J subsamplings with a stride of 2. The total number of channels K is 1 + JL + J(J -1)L 2 /2. Numerical experiments are performed with wavelet filters which approximate Gabor wavelets (Bruna & Mallat, 2013) , with L = 8 directions. The number of scales J depends upon the image size. It is J = 3 for MNIST and CIFAR, and J = 4 for ImageNet, resulting in respectively K = 217, 651 and 1251 channels. Each ρ r F w can only improve the Fisher ratio and the linear classification accuracy, but it is not guaranteed that this remains valid after applying P j . Table 1 gives the classification error of a logistic classifier applied on S T (x), after a 1 × 1 orthogonal projection to reduce the number of channels, and a spatial normalization. This error is almost twice smaller than a two-layer neural network on MNIST, given in Table 1 , but it does not improve the error on CIFAR. On CIFAR, the error obtained by a ResNet-20 is 3 times lower than the one of a classifier on S T (x). The main issue is now to understand where this inefficiency comes from. A scattering tree iteratively separates class means with wavelet filters. Its dimension is reduced by predefined projection operators, which may decrease the Fisher ratio and linear separability. To avoid this source of inefficiency, we define a scattering network which learns these projections. The second step introduces tight frame thresholdings along scattering channels, to concentrate withinclass variabilities. Image classification results are evaluated on the CIFAR-10 ( Krizhevsky, 2009) and ImageNet (Russakovsky et al., 2015) datasets. Learned scattering projections Beyond scattering trees, the projections P j of a scattering transform ( 7) can be redefined as arbitrary orthogonal 1 × 1 convolutional operators, which reduce the number of scattering channels: P j P T j = Id. Orthogonal projectors acting along the direction index of wavelet filters can improve classification (Oyallon & Mallat, 2015) . We are now going to learn these linear operators together with the final linear classifier. Before computing this projection, the mean and variances of each scattering channel is standardized with a batch normalization B N , by setting affine coefficients γ = 1 and β = 0. This projected scattering operator can be written: S P = J j=1 P j B N ρ r F w . Applying a linear classifier to S P (x) defines a deep convolutional network whose parameters are the 1 × 1 convolutional P j and the classifier weights W , b. The wavelet convolution filters in F w are not learned. The orthogonality of P j is imposed through the gradient steps (6) applied to F = P T j . Table 2 shows that learning the projectors P j more than halves the scattering classification error of S P relatively to S T on CIFAR-10 and ImageNet, reaching AlexNet accuracy on ImageNet, while achieving a higher Fisher ratio. The learned orthogonal projections P j create invariants to families of linear transformations along scattering channels that depend upon scales, directions and phases. They correspond to image transformations which have been linearized by the scattering transform. Small diffeomorphisms which deform the image are examples of operators which are linearized by a scattering transform (Mallat, 2012) . The learned projector eliminates within-class variabilities which are not discriminative across classes. Since it is linear, it does not improve linear separability or the Fisher ratio. It takes advantage of the non-linear separation produced by the previous scattering layers. The operator P j is a projection on a family of orthogonal directions which define new scattering channels, and is followed by a wavelet convolution F w along spatial variables. It defines separable convolutional filters F w P j along space and channels. Learning P j amounts to choosing orthogonal directions so that ρ r F w P j optimizes the class means separation. If the class distributions are invariant by rotations, the separation can be achieved with wavelet convolutions along the direction index (Oyallon & Mallat, 2015) , but better results are obtained by learning these filters. This separable scattering architecture is different from separable approximations of deep network filters in discrete cosine bases (Ulicny et al., 2019) or in Fourier-Bessel bases (Qiu et al., 2018) . A wavelet scattering computes ρ r F w P j as opposed to a separable decomposition ρ r P j F w , so the ReLU is applied in a Concentration along scattering channels A projected scattering transform can separate class means, but does not concentrate class variabilities. To further reduce classification errors, following Section 2.1, a concentration is computed with a tight frame soft-thresholding F T j ρ t F j , applied on scattering channels. It increases the dimension of scattering channels with a 1 × 1 convolutional tight frame F j , applies a soft-thresholding ρ t , and reduces the number of channels with the 1 × 1 convolutional operator F T j . The resulting concentrated scattering operator is S C = J j=1 (F T j ρ t F j ) (P j B N ρ r F w ). It has 2J layers, with odd layers computed by separating means with a ReLu ρ r and even layers computed by concentrating class variabilities with a soft-thresholding ρ t . According to Section 2.1 the soft-threshold is λ = 1.5 d/p. This soft-thresholding may be replaced by a thresholded rectifier ρ rt (u) = max(u -λ, 0) with a lower threshold λ = d/p. A logistic classifier is applied to S C (x). The resulting deep network does not include any learned bias parameter, except in the final linear classification layer. Learning is reduced to the 1 × 1 convolutional operators P j and F j along scattering channels, and the linear classification parameters. Table 2 gives the classification errors of this concentrated scattering on CIFAR for J = 4 (8 layers) and ImageNet for J = 6 (12 layers). The layer dimensions are specified in Appendix D. The number of parameters of the scattering networks are given in Table 4 , Appendix D. This concentration step reduces the error of S C by about 40% relatively to a projected scattering S P . A ReLU thresholding ρ rt produces an error slightly below a soft-thresholding ρ t both on CIFAR-10 and ImageNet, and this error is also below the errors of ResNet-20 for CIFAR and ResNet-18 for ImageNet. These errors are also nearly half the classification errors previously obtained by cascading a scattering tree S T with several 1 × 1 convolutional layers and large MLP classifiers (Zarka et al., 2020; Oyallon et al., 2017) . It shows that the separation and concentration learning must be done at each scale rather than at the largest scale output. Table 3 shows the progressive improvement of the Fisher ratio measured at each layer of S C on CIFAR-10. The transition from an odd layer 2j -1 to an even layer 2j results from F j T ρ t F j , which always improve the Fisher ratio by concentrating class variabilities. The transition from 2j to 2j + 1 is done by P j+1 ρ r F w , which may decrease the Fisher ratio because of the projection P j+1 , but globally brings an important improvement.

4. CONCLUSION

We proved that separation and concentration of probability measures can be achieved with rectifiers and thresholdings applied to appropriate tight frames F . We also showed that the separation of class means can be achieved by cascading wavelet frames that are not learned. It defines a scattering transform. By concentrating variabilities with a thresholding along scattering channels, we reach ResNet-18 classification accuracy on CIFAR-10 and ImageNet. A major mathematical issue is to understand the mathematical properties of the learned projectors and tight frames along scattering channels. This is necessary to understand the types of classification problems that are well approximated with such architectures, and to prove lower bounds on the evolution of Fisher ratios across layers.



) has a unit variance. The class means µ c = E(Φ(x c )) thus satisfy c µ c = 0. A linear classifier (W, b) on Φ(x) returns the index of the maximum coordinate of W Φ(x) + b ∈ R C . An optimal linear classifier (W, b) minimizes the probability of a classification error. Optimal linear classifiers are estimated by minimizing a regularized loss function on the training data. Neural networks often use logistic linear classifiers, which minimize a cross-entropy loss. The standardization of the last layer Φ(x) is implemented with a batch normalization

For MNIST and CIFAR-10, the first row gives the logistic classification error and the second row the Fisher ratio (1), for different signal representations Φ(x). Results are evaluated with an absolute value ρ a , a soft-thresholding ρ t , and a ReLU with threshold ρ rt . logistic classifiers applied to the input signal x and to Φ(x) = F T ρF x for 3 different non-linearities ρ: absolute value ρ a , soft-thresholding ρ t , and ReLU with threshold ρ rt . The tight frame F is a convolution on patches of size k × k with a stride of k/2, with k = 14 for MNIST and k = 8 for CIFAR. The tight frame F maps each patch to a vector of larger dimension, specified in Appendix D. Figure1in Appendix D shows examples of learned tight frame filters.

Linear classification error and Fisher ratios (1) of several scattering representations, on CIFAR-10 and ImageNet. For S C , results are evaluated with a soft-thresholding ρ t and a thresholded rectifier ρ rt . The last column gives the error of ResNet-20 forCIFAR-10 (He et al.,  2016)  and ResNet-18 for ImageNet, taken from https://pytorch.org/docs/stable/ torchvision/models.html.

Evolution of Fisher ratio across layers for the scattering concentration network S C with a ReLU with threshold ρ rt , on the CIFAR dataset. indexed by wavelet variables produced by F w . It provides explicit coordinates to analyze the mathematical properties, but it also increase the number of learned parameters as shown in Table4, Appendix D.

ACKNOWLEDGMENTS

This work was supported by grants from Région Ile-de-France and the PRAIRIE 3IA Institute of the French ANR-19-P3IA-0001 program. We would like to thank the Scientific Computing Core at the Flatiron Institute for the use of their computing resources.

