DEEP NETWORKS FROM THE PRINCIPLE OF RATE REDUCTION Anonymous authors Paper under double-blind review

Abstract

This work attempts to interpret modern deep (convolutional) networks from the principles of rate reduction and (shift) invariant classification. We show that the basic iterative gradient ascent scheme for maximizing the rate reduction of learned features naturally leads to a deep network, one iteration per layer. The architectures, operators (linear or nonlinear), and parameters of the network are all explicitly constructed layer-by-layer in a forward propagation fashion. All components of this "white box" network have precise optimization, statistical, and geometric interpretation. Our preliminary experiments indicate that such a network can already learn a good discriminative deep representation without any back propagation training. Moreover, all linear operators of the so-derived network naturally become multi-channel convolutions when we enforce classification to be rigorously shift-invariant. The derivation also indicates that such a convolutional network is significantly more efficient to learn and construct in the spectral domain.

1. INTRODUCTION AND MOTIVATION

In recent years, various deep (convolution) network architectures such as AlexNet (Krizhevsky et al., 2012) , VGG (Simonyan & Zisserman, 2015) , ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) , Recurrent CNN, LSTM (Hochreiter & Schmidhuber, 1997) , Capsule Networks (Hinton et al., 2011) , etc., have demonstrated very good performance in classification tasks of real-world datasets such as speeches or images. Nevertheless, almost all such networks are developed through years of empirical trial and error, including both their architectures/operators and the ways they are to be effectively trained. Some recent practices even take to the extreme by searching for effective network structures and training strategies through extensive random search techniques, such as Neural Architecture Search (Zoph & Le, 2017; Baker et al., 2017) , AutoML (Hutter et al., 2019) , and Learning to Learn (Andrychowicz et al., 2016) . Despite tremendous empirical advances, there is still a lack of rigorous theoretical justification of the need or reasons for "deep" network architectures and a lack of fundamental understanding of the associated operators (e.g. multi-channel convolution and nonlinear activation) in each layer. As a result, deep networks are often designed and trained heuristically and then used as a "black box." There have been a severe lack of guiding principles for each of the stages: For a given task, how wide or deep the network should be? What are the roles and relationships among the multiple (convolution) channels? Which parts of the networks need to be learned and trained and which can be determined in advance? How to evaluate the optimality of the resulting network? As a consequence, besides empirical evaluation, it is usually impossible to offer any rigorous guarantees for certain performance of a trained network, such as invariance to transformation (Azulay & Weiss, 2018; Engstrom et al., 2017) or overfitting noisy or even arbitrary labels (Zhang et al., 2017) . In this paper, we do not intend to address all these questions but we would attempt to offer a plausible interpretation of deep (convolution) neural networks by deriving a class of deep networks from first principles. We contend that all key features and structures of modern deep (convolution) neural networks can be naturally derived from optimizing a principled objective, namely the rate reduction recently proposed by Yu et al. (2020) , that seeks a compact discriminative (invariant) representation of the data. More specifically, the basic iterative gradient ascent scheme for optimizing the objective naturally takes the form of a deep neural network, one layer per iteration. This principled approach brings a couple of nice surprises: First, architectures, operators, and parameters of the network can be constructed explicitly layer-by-layer in a forward propagation fashion, and all inherit precise optimization, statistical and geometric interpretation. As result, the so constructed "white box" deep network already gives a good discriminative representation (and achieves good classification performance) without any back propagation for training the deep network. Second, in the case of seeking a representation rigorously invariant to shift or translation, the network naturally lends itself to a multi-channel convolutional network. Moreover, the derivation indicates such a convolutional network is computationally more efficient to learn and construct in the spectral (Fourier) domain, analogous to how neurons in the visual cortex encode and transit information with their spiking frequencies (Eliasmith & Anderson, 2003; Belitski et al., 2008) .

2. TECHNICAL APPROACH

Consider a basic classification task: given a set of m samples X . = [xfoot_0 , . . . , x m ] ∈ R n×m and their associated memberships π(x i ) ∈ [k] in k different classes, a deep network is typically used to model a direct mapping from the input data x ∈ R n to its class label f (x, θ) : x → y ∈ R k , where y is typically a "one-hot" vector encoding the membership information π(x): the j-th entry of y is 1 iff π(x) = j. The parameters θ of the network is typically learned to minimize certain prediction loss, say the cross entropy loss, via gradient-descent type back propagation. Although this popular approach provides people a direct and effective way to train a network that predicts the class information, the so learned representation is however implicit and lacks clear interpretation.

2.1. PRINCIPLE OF RATE REDUCTION AND GROUP INVARIANCE

The Principle of Maximal Coding Rate Reduction. To help better understand features learned in a deep network, the recent work of Yu et al. (2020) has argued that the goal of (deep) learning is to learn a compact discriminative and diverse feature representation 1 z = f (x) ∈ R n of the data x before any subsequent tasks such as classification: x f (x) ---→ z h(z) ---→ y. To be more precise, instead of directly fitting the class label y, a principled objective is to learn a feature map f (x) : x → z which transforms the data x onto a set of most discriminative low-dimensional linear subspaces {S j } k j=1 ⊂ R n , one subspace S j per class j ∈ [k]. Let Z . = [z 1 , . . . , z m ] = [f (x 1 ), . . . , f (x m )] be the features of the given samples X. WLOG, we may assume all features z i are normalized to be of unit norm: z i ∈ S n-1 . For convenience, let Π j ∈ R m×m be a diagonal matrix whose diagonal entries encode the membership of samples/features belong to the j-th class: Π j (i, i) = π(x i ) = π(z i ). Then based on principles from lossy data compression (Ma et al., 2007) , Yu et al. (2020) has suggested that the optimal representation Z ⊂ S n-1 should maximize the following coding rate reduction objective, known as the MCR 2 principle: , where α = n/(m 2 ), α j = n/(tr(Π j ) 2 ), γ j = tr(Π j )/m for j = 1, . . . , k. Given a prescribed quantization error , the first term R of ∆R(Z) measures the total coding length for all the features Z and the second term R c is the sum of coding lengths for features in each of the k classes. In Yu et al. (2020) , the authors have shown the optimal representation Z that maximizes the above object indeed has desirable nice properties. Nevertheless, they adopted a conventional deep network (e.g. the ResNet) as a black box to model and parameterize the feature mapping: z = f (x, θ). It has empirically shown that with such a choice, one can effectively optimize the MCR 2 objective and obtain discriminative and diverse representations for classifying real image data sets. However, there remain several unanswered problems. Although the resulting feature representation is more interpretable, the network itself is still not. It is not clear why any chosen network is able to optimize the desired MCR 2 objective: Would there be any potential limitations? The good empirical results (say with a ResNet) do not necessarily justify the particular choice in architectures and operators of the network: Why is a layered model necessary, how wide and deep is adequate, and is there any rigorous justification for the use of convolutions and nonlinear operators used? In Section 2.2, we show that using gradient ascent to maximize the rate reduction ∆R(Z) naturally leads to a "white box" deep network that represents such a mapping. All linear/nonlinear operators and parameters of the network are explicitly constructed in a purely forward propagation fashion. Group Invariant Rate Reduction. So far, we have considered the data and features as vectors. In many applications, such as serial data or imagery data, the semantic meaning (labels) of the data and their features are invariant to certain transformations g ∈ G (for some group G) (Cohen & Welling, 2016) . For example, the meaning of an audio signal is invariant to shift in time; and the identity of an object in an image is invariant to translation in the image plane. Hence, we prefer the feature mapping f (x, θ) is rigorously invariant to such transformations: Group Invariance: f (x • g, θ) ∼ f (x, θ), ∀g ∈ G, (2) where "∼" indicates two features belonging to the same equivalent class. The recent work of Zaheer et al. (2017); Maron et al. (2020) characterize properties of networks and operators for set permutation groups. Nevertheless, it remains challenging to learn features via a deep network that are guaranteed to be invariant even to simple transformations such as translation and rotation (Azulay & Weiss, 2018; Engstrom et al., 2017) . In Section 2.3, we show that the MCRfoot_1 principle is compatible with invariance in a very natural and precise way: we only need to assign all transformed versions {x • g | g ∈ G} into the same class as x and map them all to the same subspace S. 2 We will rigorously show (in the Appendices) that, when the group G is (discrete) circular 1D shifting or 2D translation, the resulting deep network naturally becomes a multi-channel convolution network! 2.2 DEEP NETWORKS FROM MAXIMIZING RATE REDUCTION Gradient Ascent for Rate Reduction on the Training Samples. First let us directly try to maximize the objective ∆R(Z) as a function in the training samples Z ⊂ S n-1 . To this end, we may adopt a (projected) gradient ascent scheme, for some step size η > 0: Z +1 ∝ Z + η • ∂∆R ∂Z Z subject to Z +1 ⊂ S n-1 . This scheme can be interpreted as how one should incrementally adjust locations of the current features Z in order for the resulting Z +1 to improve the rate reduction ∆R(Z). Simple calculation shows that the gradient ∂∆R ∂Z entails evaluating the following derivatives of the terms in (1): 1 2 ∂ log det(I + αZZ * ) ∂Z Z = α(I + αZ Z * ) -1 E ∈R n×n Z ∈ R n×m , (4) 1 2 ∂ γj log det(I + αjZΠ j Z * ) ∂Z Z = γj αj(I + αjZ Π j Z * ) -1 C j ∈R n×n Z Π j ∈ R n×m . Notice that in the above, the matrix E only depends on Z hence it aims to expand all the features to increase the overall coding rate; the matrix C j depends on features from each class and aims compress them to reduce the coding rate of each class. See Remark 1 in Appendix A for the geometric and statistic meaning of E and C j . Then the complete gradient ∂∆R ∂Z Z is of the following form: ∂∆R ∂Z Z = E Expansion Z - k j=1 γ j C j Compression Z Π j ∈ R n×m . Gradient-Guided Feature Map Increment. Notice that in the above, the gradient ascent considers all the features Z = [z 1 , . . . , z m ] as free variables. The increment Z +1 -Z = η ∂∆R ∂Z Z does not yet give a transform on the entire feature domain z ∈ R n . Hence, in order to find the optimal f (x, θ) explicitly, we may consider constructing a small increment transform g(•, θ ) on the -th layer feature z to emulate the above (projected) gradient scheme: z +1 ∝ z + η • g(z , θ ) subject to z +1 ∈ S n-1 (7) such that: g(z 1 , θ ), . . . , g(z m , θ ) ≈ ∂∆R ∂Z Z . That is, we need to approximate the gradient flow ∂∆R ∂Z that locally deforms each (training) feature {z i } m i=1 with a continuous mapping g(z) defined on the entire feature space z ∈ R n . See Remark 2 in Appendix A for conceptual connection and difference from the Neural ODE framework proposed by Chen et al. (2018) . By inspecting the structure of the gradient (6), it suggests that a natural candidate for the increment transform g(z , θ ) is of the form: g(z , θ ) . = E z - k j=1 γ j C j z π j (z ) ∈ R n , where π j (z ) ∈ [0, 1] indicates the probability of z belonging to the j-th class. 3 Notice that the increment depends on 1). A linear map represented by E that depends only on statistics of all features from the preceding layer; 2). A set of linear maps {C j } k j=1 and memberships {π j (z )} k j=1 of the features. Since we only have the membership π j for the training samples, the function g defined in (8) can only be evaluated on the training samples. To extrapolate the function g to the entire feature space, we need to estimate π j (z ) in its second term. In the conventional deep learning, this map is typically modeled as a deep network and learned from the training data, say via back propagation. Nevertheless, our goal here is not to learn a precise classifier π j (z ) already. Instead, we only need a good enough estimate of the class information in order for g to approximate the gradient ∂∆R ∂Z well. From the geometric interpretation of the linear maps E and C j given by Remark 1 in Appendix A, the term p j . = C j z can be viewed as projection of z onto the orthogonal complement of each class j. Therefore, p j 2 is small if z is in class j and large otherwise. This motivates us to estimate its membership based on the following softmax function: π j (z ) . = exp (-λ C j z ) k j=1 exp (-λ C j z ) ∈ [0, 1]. Hence the second term of (8) can be approximated by this estimated membership:foot_3  k j=1 γ j C j z π j (z ) ≈ k j=1 γ j C j z • π j (z ) . = σ [C 1 z , . . . , C k z ] ∈ R n , which is denoted as a nonlinear operator σ(•) on outputs of the feature z through k banks of filters: [C 1 , . . . , C k ]. Notice that the nonlinearality arises due to a "soft" assignment of class membership based on the feature responses from those filters. Overall, combining ( 7), (8), and (9), the increment feature transform from z to z +1 now becomes: z +1 ∝ z + η • E z -η • σ [C 1 z , . . . , C k z ] subject to z +1 ∈ S n-1 , with the nonlinear function σ(•) defined above and θ collecting all the layer-wise parameters including E , C j , γ j and λ, and with features at each layer always "normalized" onto a sphere S n-1 , denoted as P S n-1 . The form of increment in (10) can be illustrated by a diagram in Figure 1 . Deep Network from Rate Reduction. Notice that the increment is constructed to emulate the gradient ascent for the rate reduction ∆R. Hence by transforming the features iteratively via the above process, we expect the rate reduction to increase, as we will see in the experimental section. This iterative process, once converged say after L iterations, gives the desired feature map f (x, θ) on the input z 0 = x, precisely in the form of a deep network, in which each layer has the structure shown in Figure 1 : f (x, θ) = φ L • φ L-1 • • • • • φ 0 (x), with φ (z , θ ) . = P S n-1 [z + η • g(z , θ )]. (11) As this deep network is derived from maximizing the rate reduction, we call it the ReduNet. Notice that all parameters of the network are explicitly constructed layer by layer in a forward propagation fashion. Once constructed, there is no need of any additional supervised learning, say via back propagation. As suggested in Yu et al. (2020) , the so learned features can be directly used for classification via a nearest subspace classifier. Comparison with Other Approaches and Architectures. Structural similarities between deep networks and iterative optimization schemes, especially those for solving sparse coding, have been long noticed. In particular, Gregor & LeCun (2010) has argued that algorithms for sparse coding, such as the FISTA algorithm (Beck & Teboulle, 2009) , can be viewed as a deep network and be trained for better coding performance, known as LISTA. Later Monga et al. (2019) ; Sun et al. (2020) have proposed similar interpretation of deep networks as unrolling algorithms for sparse coding. Like all networks that are inspired by unfolding certain iterative optimization schemes, the structure of the ReduNet naturally contains a skip connection between adjacent layers as in the ResNet (He et al., 2016) . Remark 4 in Appendix A discusses possible improvement to the basic gradient scheme that may introduce additional skip connections beyond adjacent layers. The remaining k + 1 parallel channels E, C j of the ReduNet actual draw resemblance to the parallel structures that people later found empirically beneficial for deep networks, e.g. ResNEXT (Xie et al., 2017) or the mixture of experts (MoE) module adopted in Shazeer et al. (2017) . But a major difference here is that all components (layers, channels, and operators) of the ReduNet are by explicit construction from first principles and they all have precise optimization, statistical and geometric interpretation. Furthermore, there is no need to learn them from back-propagation, although in principle one still could if further fine-tuning of the network is needed (see Remark 3 of Appendix A for more discussions).

2.3. DEEP CONVOLUTION NETWORKS FROM SHIFT-INVARIANT RATE REDUCTION

We next examine ReduNet from the perspective of invariance to transformation. Using the basic and important case of shift/translation invariance as an example, we will show that for data which are compatible with an invariant classifier, the ReduNet construction automatically takes the form of a (multi-channel) convolutional neural network, rather than heuristically imposed upon. 1D Serial Data and Shift Invariance. For one-dimensional data x ∈ R n under shift symmetry, we take G to be the group of circular shifts. Each observation x i generates a family {x i • g | g ∈ G} of shifted copies, which are the columns of the circulant matrix circ(x i ) ∈ R n×n (see Appendix B.1 or Kra & Simanca (2012) for properties of circulant matrices). What happens if we construct the ReduNet from these families Z 1 = [circ(x 1 ), . . . , circ(x m )]? The data covariance matrix: Z 1 Z * 1 = circ(x 1 ), . . . , circ(x m ) circ(x 1 ), . . . , circ(x m ) * = m i=1 circ(x i )circ(x i ) * ∈ R n×n associated with this family of samples is automatically a (symmetric) circulant matrix. Moreover, because the circulant property is preserved under sums, inverses, and products, the matrices E 1 and C j 1 are also automatically circulant matrices, whose application to a feature vector z can be implemented using cyclic convolution " " (see Proposition B.1 of Appendix B): z 2 ∝ z 1 + η • g(z 1 , θ 1 ) = z 1 + η • e 1 z 1 -η • σ [c 1 1 z 1 , . . . , c k 1 z 1 ] . ( ) Because g(•, θ 1 ) consists only of operations that co-vary with cyclic shifts, the features Z 2 at the next level again consist of families of shifts: Z 2 = circ(x 1 +ηg(x 1 , θ 1 )), . . . , circ(x m +ηg(x m , θ m )) . Continuing inductively, we see that all matrices E and C j based on such Z are circulant. By virtue of the properties of the data, ReduNet has taken the form of a convolutional network, with no need to explicitly choose this structure! The Role of Multiple Channels and Sparsity. There is one problem though: In general, the set of all circular permutations of a vector z give a full-rank matrix. That is, the n "augmented" features associated with each sample (hence each class) typically already span the entire space R n . The MCR 2 objective (1) will not be able to distinguish classes as different subspaces. One natural remedy is to improve the separability of the data by "lifting" the features to a higher dimensional space, e.g., by taking their responses to multiple, filters k 1 , . . . , k C ∈ R n : z[c] = k c z = circ(k c )z ∈ R n , c = 1, . . . , C. The filers can be pre-designed invariance-promoting filters,foot_4 or adaptively learned from the data,foot_5 or randomly selected as we do in our experiments. This operation lifts each original feature vector z ∈ R n to a C-channel feature, denoted z . = [z[1], . . . , z[C]] * ∈ R C×n . If we stack the multiple channels of a feature z as a column vector vec( z) ∈ R nC , the associated circulant version circ( z) and its data covariance matrix, denoted as Σ, for all its shifted versions are given as: circ( z) . = circ(z[1]) . . . circ(z[C]) ∈ R nC×n , Σ . = circ(z[1]) . . . circ(z[C]) [ circ(z[1]) * ,...,circ(z[C]) * ] ∈ R nC×nC , ( ) where circ(z[c]) ∈ R n×n with c ∈ [C] is the circulant version of the c-th channel of the feature z. Then the columns of circ( z) will only span at most an n-dimensional proper subspace in R nC . However, this operation does not yet render the classes separable -features associated with other classes will span the same n-dimensional subspace. This reflects a fundamental conflict between linear (subspace) modeling and invariance. One way of resolving this conflict is to leverage additional structure within each class, in the form of sparsity: signals within each class can be assumed to be generated not as arbitrary linear combinations of basis vectors, but as sparse combinations of atoms of different (incoherent) dictionaries D j . Under this assumption, if the convolution kernels {k c } match well the sparsifying dictionaries, 7 the multi-channel responses should be sparse. Hence we may take an entry-wise sparsity-promoting nonlinear thresholding, say τ (•), on the filter outputs by setting small (say absolute value below ) or negative responses to be zero: 8 z = τ circ(k 1 )z, . . . , circ(k C )z ∈ R n×C . ( ) These features can be assumed to lie on a lower-dimensional (nonlinear) submanifold of R n×C , which can be linearized and separated from the other classes by subsequent ReduNet layers. This multi-channel ReduNet retains the good invariance properties described above: all E and C j matrices are block circulant, and represent multi-channel 1D circular convolutions (see Proposition B.2 of Appendix B for a rigorous statement and proof): Ē( z) = ē z, Cj ( z) = cj z ∈ R n×C , j = 1, . . . , k, where ē, cj ∈ R C×C×n . Hence by construction, the resulting ReduNet is a deep convolutional network for multi-channel 1D signals. Unlike Xception nets Chollet ( 2017), these multi-channel convolutions in general are not depthwise separable. 9 Fast Computation in the Spectral Domain. Since all circulant matrices can be simultaneously diagonalized by the discrete Fourier transform matrix 10 F : circ(z) = F * DF (see Fact 5 in Appendix B.1), all Σ of the form ( 14) can be converted to a standard "blocks of diagonals" form: Σ = F * 0 0 0 . . . 0 0 0 F * D11 ••• D 1C . . . . . . . . . D C1 ••• D CC F 0 0 0 . . . 0 0 0 F ∈ R nC×nC , where each block D kl is an n × n diagonal matrix. The middle of RHS of ( 17) is a block diagonal matrix after a permutation of rows and columns. Hence, to compute Ē and Cj ∈ R nC×nC , we only have to compute in the frequency domain the inverse of C × C blocks for n times and the overall complexity would be O(nC 3 ) instead of O((nC) 3 ) for inverting a generic nC × nC matrix. 11 More details for implementing the network in the spectral domain can be found in Appendix B.3 (see Theorem B.3 for a rigorous statement and Algorithm 1 for implementation details). Connections to Recurrent and Convolutional Sparse Coding. The sparse coding perspective of Gregor & LeCun (2010) is later extended to recurrent and convolutional networks for serial data, e.g. Wisdom et al. (2016); Papyan et al. (2016) ; Sulam et al. (2018); Monga et al. (2019) . Although both sparsity and convolution are advocated as desired characteristics for deep networks, they do not explicitly justify the necessity of sparsity and convolutions from the objective of the network, say classification. In our framework, we see how multi-channel convolutions ( Ē, Cj ), different nonlinear activations ( π j , τ ), and the sparsity requirement are derived from, rather than heuristically proposed for, the objective of maximizing rate reduction of the features while enforcing shift invariance. 2D Images and Translation Invariance. In the case of classifying images invariant to arbitrary 2D translation, we may view the image (feature) z ∈ R (W ×H)×C as a function defined on a torus T 2 (discretized as a W × H grid) and consider G to be the (Abelian) group of all 2D (circular) 7 There is a vast literature on how to learn the most compact and optimal sparsifying dictionaries from sample data, e.g. (Li & Bresler, 2019; Qu et al., 2019) . Nevertheless, in practice, often a sufficient number of random filters suffice the purpose of ensuring features of different classes are separable (Chan et al., 2015) . 8 Here the nonlinear operator τ can be chosen to be a soft thresholding or a ReLU. 9 It remains open what additional structures on the data would lead to depthwise separable convolutions. 10 Here we scaled the matrix F to be unitary, hence it differs from the conventional DFT matrix by a 1/ √ n. 11 There are strong scientific evidences that neurons in the visual cortex encode and transmit information in the rate of spiking, hence the so-called spiking neurons (Softky & Koch, 1993; Eliasmith & Anderson, 2003) . Nature might be exploiting the computational efficiency in the frequency domain for achieving shift invariance.  R (train) R (test) Rc (train) Rc (test) (f) Loss (3D) Figure 2 : Original samples and learned representations for 2D and 3D Mixture of Gaussians. We visualize data points X (before mapping) and features Z (after mapping) by scatter plot. In each scatter plot, each color represents one class of samples. We also show the plots for the progression of values of the objective functions. translations on the torus. As we will show in the Appendix C, the associated linear operators Ē and Cj 's act on the image feature z as multi-channel 2D circular convolutions. The resulting network will be a deep convolutional network that shares the same multi-channel convolution structures as conventional CNNs for 2D images (LeCun et al., 1995; Krizhevsky et al., 2012) . The difference is that, again, the architectures and parameters of our network are derived from the rate reduction objective, and so are the nonlinear activation π j and τ . Again, our derivation in Appendix C shows that this multi-channel 2D convolutional network can be constructed more efficiently in the spectral domain (see Theorem C.1 of Appendix C for a rigorous statement and justification).

3. EXPERIMENTS

We now verify whether the so constructed ReduNet achieves its design objectives through experiments on synthetic data and real images. The datasets and experiments are chosen to clearly demonstrate the behaviors of the network obtained by our algorithm, in terms of learning the correct discriminative representation and truly achieving invariance. It is not the purpose of this work to push the state of the art on any real datasets with highly engineered networks and systems, although we believe this framework has this potential in the future. All code is implemented in Python mainly using NumPy. All our experiments are conducted in a computer with 2.8 GHz Intel i7 CPU and 16GB of memory. Implementation details and more experiments and can be found in Appendix D. Learning Mixture of Gaussians in S 1 and S 2 . Consider a mixture of two Gaussian distributions in R 2 that is projected onto S 1 . We first generate data points from these two distributions,  and π(x i 2 ) = 2. We set m = 500, σ 1 = σ 2 = 0.1 and µ 1 , µ 2 ∈ S 1 . Then we project all the data points onto S 1 , i.e., x i j / x i j 2 . To construct the network (computing E, C j for each layer), we set the # of iterations/layers L = 2, 000,foot_6 step size η = 0.5, and precision = 0.1. As shown in Figure 2a -2b, we can observe that after the mapping f (•, θ), samples from the same class converge to a single cluster and the angle between two different clusters is approximately π/2, which is well aligned with the optimal solution Z of the MCR 2 loss in S 1 . MCR 2 loss of features on different layers can be found in Figure 2c . Empirically, we find that our constructed network is able to maximize MCR 2 loss and converges stably. Similarly, we consider mixture of three Gaussian distributions in R 3 with means µ 1 , µ 2 , µ 3 uniformly in S 2 , and variance σ 1 = σ 2 = σ 3 = 0.1, and all data points are projected onto S 2 (See Figure 2d-2f ). We can observe similar behavior as in S 2 , i.e., samples from the same class converge to one cluster and different clusters are orthogonal to each other. Moreover, we sample new data points from the same distributions for both cases and find that new samples form the same class consistently converge to the same cluster as the training samples. More examples and details can be found in Appendix D. Learning Shift Invariant Features. As described in § 2.3, by maximizing the rate reduction via Eq. ( 12), we are able to explicitly construct operators that are invariant to (circular) shifts. To verify the effectiveness of our proposed network on shift invariance tasks, we apply our network to classify signals sampled from two different 1D functions. The underlying function of the first class is sinusoidal signal h 1 (t) = sin(t) + , and the second class is a composition of sign and sin function, h 2 (t) = sign(sin(t)) + , where ∼ N (0, 0.1). (See Figure 7 in Appendix D). Each sample is generated by first picking t 0 ∈ [0, 10π], then obtaining n equidistant point within the boundaries [t 0 , t 0 + 2π] with i.i.d Gaussian noise. Detailed implementations for sampling from h 1 and h 2 can be found in Appendix D.3. We generate a dataset which contains m samples, with m/2 samples in each class, i.e., X = [X 1 , X 2 ] ∈ R n×m . Then each sample is lifted to a C- 3d ). We find that the proposed network can map different classes of signals (including all shifted augmentations) to orthogonal subspaces, to increase the MCR 2 loss (shown in Figure 3c ). Rotational Invariance on MNIST Digits. We study the ReduNet on learning rotation invariant features on MNIST dataset (LeCun, 1998) . We impose a polar grid on the image x ∈ R H×W , with its geometric center being the center of the 2D polar grid. For each radius r i , i ∈  X 1 = [x 1 1 , . . . , x m 1 ] ∈ R 2×m , x i 1 ∼ N (µ 1 , σ 1 I), and π(x i 1 ) = 1; X 2 = [x 1 2 , . . . , x m 2 ] ∈ R 2×m , x i 2 ∼ N (µ 2 , σ 2 I),

4. CONCLUSIONS AND FUTURE WORK

This work offers an interpretation of deep (convolutional) networks by construction from first principles. It provides a rigorous explanation for the deep architecture and components from the perspective of optimizing the rate reduction objective. Simulations and experiments on basic data sets clearly verify the so-constructed ReduNet achieves the desired functionality and objective. Although in this work the ReduNet is forward constructed, one may study how to effectively fine tune it via back propagation. We believe rate reduction provides a principled framework for designing new networks with interpretable architectures and operators that can scale up to real-world datasets and problems, with better performance guarantees. This framework can also be naturally extended to settings of online or unsupervised learning if Π is partially or not known and is to be optimized.

A ADDITIONAL REMARKS AND EXTENSIONS

Remark 1 (Interpretation of the Two Linear Operators) For any z we have (I + αZ Z * ) -1 z = z -Z q * where q * . = argmin q α z -Z q 2 2 + q 2 2 . ( ) Notice that q * is exactly the solution to the ridge regression by all the data points Z concerned. Therefore, E (similarly for C j ) is approximately (i.e. when m is large enough) the projection onto the orthogonal complement of the subspace spanned by columns of Z . Another way to interpret the matrix E is through eigenvalue decomposition of the covariance matrix Z Z * . Assuming that Z Z * . = U Λ U * where Λ . = diag{σ 1 , . . . , σ d }, we have E = α U diag 1 1 + ασ 1 , . . . , 1 1 + ασ d U * . ( ) Therefore, the matrix E operates on a vector z by stretching in a way that directions of large variance are shrunk while directions of vanishing variance are kept. These are exactly the directions (4) in which we move the features so that the overall volume expands and the coding rate will increase, hence the positive sign. To the opposite effect, the directions associated with (5) are exactly "residuals" of features of each class deviate from the subspace to which they are supposed to belong. These are exactly the directions in which the features need to be compressed back onto their respective subspace, hence the negative sign. Essentially, all linear operations in the ReduNet are determined by data conducting "autoregressions" among themselves. The recent renewed understanding about ridge regression in an over-parameterized setting (Yang et al., 2020) indicates that using seemingly redundantly sampled data (from each subspaces) as regressors do not lead to overfitting. Remark 2 (Connection and Difference from Neural ODE) Notice that one may interpret the increment (7) as a discretized version of a continuous ordinary differential equation (ODE): ż = g(z, θ). (20) Hence the (deep) network so constructed can be interpreted as certain neural ODE (Chen et al., 2018) . Nevertheless, unlike neural ODE where the flow g is chosen to be some generic structures, here our g(z, θ) is to emulate the gradient flow of the rate reduction on the feature set Ż = ∂∆R ∂Z , and its structure is entirely derived and fully determined from this objective, without any other priors or heuristics. Remark 3 (Approximate with a ReLU Network) In practice, there are many other simpler nonlinear activation functions that one can use to approximate the membership π(•) and subsequently the nonlinear operation σ in (9). Notice that the geometric meaning of σ in (9) is to compute the "residual" of each feature against the subspace to which it belongs. So when we restrict all our features to be in the first (positive) quadrant of the feature space,foot_7 one may approximate this residual using the rectified linear units operation, ReLUs, on p j = C j z or its orthogonal complement: σ(z ) ∝ z - k j=1 ReLU P j z , ( ) where P j = (C j ) ⊥ is the projection onto the j-th classfoot_8 and ReLU(x) = max(0, x). The above approximation is good under the more restrictive assumption that projection of z on the correct class via P j is mostly large and positive and yet small or negative for other classes. The resulting ReduNet will be a network primarily involving ReLU operations and feature normalization (onto S n-1 ) between each layer. Although in this work, we have argued that the forwardconstructed ReduNet network already works to a large extent, in practice one certainly can conduct back-propagation to further fine tune the so-obtained network, say to correct some remaining errors in predicting labels of the training data. Empirically, people have found that deep networks with ReLU activations are easier to train via back propagation (Krizhevsky et al., 2012) . Remark 4 (Accelerated Optimization via Additional Skip Connections) Empirically, people have found that additional skip connections across multiple layers may improve the network performance, e.g. the DenseNet (Huang et al., 2017) . In our framework, the role of each layer is precisely interpreted as one iterative gradient ascent step for the objective function ∆R. In our experiments (see Section 3), we have observed that the basic gradient scheme sometimes converges slowly, resulting in deep networks with thousands of layers (iterations)! To improve the efficiency of the basic ReduNet, one may consider in the future accelerated gradient methods such as the Nesterov acceleration (Nesterov, 1983) or perturbed accelerated gradient descent (Jin et al., 2018) . Say to minimize or maximize a function h(z), such accelerated methods usually take the form: p +1 = z + β • (z -z -1 ), z +1 = p +1 + η • ∇h(p +1 ). ( ) Hence they require introducing additional skip connections among three layers -1, and +1. For typical convex or nonconvex programs, the above accelerated schemes can often reduce the number of iterations by a magnitude.

B 1D CIRCULAR SHIFT INVARIANCE

It has been long known that to implement a convolutional neural network, one can achieve higher computational efficiency by implementing the network in the spectral domain via the fast Fourier transform (Mathieu et al., 2013; Lavin & Gray, 2016; Vasilache et al., 2015) . However, our purpose here is different: We want to show that the linear operators E and C j derived from the gradient flow of MCR 2 are naturally convolutions when we enforce shift-invariance rigorously. Their convolution structure is derived from the rate reduction objective, rather than heuristically imposed upon the network. Furthermore, the computation involved in constructing these linear operators has a naturally efficient implementation in the spectral domain via fast Fourier transform. Arguably this work is the first to show multi-channel convolutions, together with other convolution-preserving nonlinear operations in the ReduNet, are both necessary and sufficient to ensure shift invariance. To be somewhat self-contained and self-consistent, in this section, we first introduce our notation and review some of the key properties of circulant matrices which will be used to characterize the properties of the linear operators E and C j and to compute them efficiently. The reader may refer to Kra & Simanca (2012) for a more rigorous exposition on circulant matrices.

B.1 PROPERTIES OF CIRCULANT MATRIX AND CIRCULAR CONVOLUTION

Given a vector z = [z 0 , z 1 , . . . , z n-1 ] * ∈ R n ,foot_9 we may arrange all its circular shifted versions in a circulant matrix form as circ(z) . =        z 0 z n-1 . . . z 2 z 1 z 1 z 0 z n-1 • • • z 2 . . . z 1 z 0 . . . . . . z n-2 . . . . . . . . . z n-1 z n-1 z n-2 . . . z 1 z 0        ∈ R n×n . Fact 1 (Convolution as matrix multiplication via circulant matrix) The multiplication of a circulant matrix circ(z) with a vector x ∈ R n gives a circular (or cyclic) convolution, i.e., circ(z) • x = z x, (25) where (z x) i = n-1 j=0 x j z i+n-j mod n . ( ) Fact 2 (Properties of circulant matrices) Circulant matrices have the following properties: • Transpose of a circulant matrix, say circ(z) * , is circulant; • Multiplication of two circulant matrices is circulant, for example circ(z)circ(z) * ; • For a non-singular circulant matrix, its inverse is also circulant (hence representing a circular convolution). These properties of circulant matrices are extensively used in this work as for characterizing the convolution structures of the operators E and C j . Given a set of vectors [z 1 , . . . , z m ] ∈ R n×m , let circ(z i ) ∈ R n×n be the circulant matrix for z i . Then we have the following: Proposition B.1 (Convolution structures of E and C j ) Given a set of vectors Z = [z 1 , . . . , z m ], the matrix: E = α I + α m i=1 circ(z i )circ(z i ) * -1 is a circulant matrix and represents a circular convolution: Ez = e z, where e is the first column vector of E. Similarly, the matrices C j associated with any subsets of Z are also circular convolutions.

B.2 CIRCULANT MATRIX AND CIRCULANT CONVOLUTION FOR MULTI-CHANNEL SIGNALS

In the remainder of this section, we view z as a 1D signal such as an audio signal. Since we will deal with the more general case of multi-channel signals, we will use the traditional notation T to denote the temporal length of the signal and C for the number of channels. Conceptually, the "dimension" n of such a multi-channel signal, if viewed as a vector, should be n = CT . 16 As we will also reveal additional interesting structures of the operators E and C j in the spectral domain, we use t as the index for time, p for the index of frequency, and c for the index of channel. Given a multi-channel 1D signal z ∈ R C×T , we denote z =    z[1] * . . . z[C] *    = [ z(0), z(1), . . . , z(T -1)] = { z[c](t)} c=C,t=T -1 c=1,t=0 . To compute the coding rate reduction for a collection of such multi-channel 1D signals, we may flatten the matrix representation into a vector representation by stacking the multiple channels of z as a column vector. In particular, we let vec( z) = [ z[1](0), z[1](1), . . . , z[1](T -1), z[2](0), . . .] ∈ R (C×T ) . (28) Furthermore, to obtain shift invariance for the coding rate reduction, we may generate a collection of shifted copies of z (along the temporal dimension). Stacking the vector representations for such shifted copies as column vectors, we obtain circ( z) . =    circ( z[1]) . . . circ( z[C])    ∈ R (C×T )×T . In above, we overload the notation "circ(•)" defined in ( 24). We now consider a collection of m multi-channel 1D signals { zi ∈ R C×T } m i=1 . Compactly representing the data by Z ∈ R C×T ×m in which the i-th slice on the last dimension is zi , we denote (31) Z[c] = [ z1 [c], . . . , zm [c]] ∈ R T ×m , Z(t) = [ z1 (t), . . . , zm (t)] ∈ R C×m . ( Then, we define the shift invariant coding rate reduction for Z ∈ R C×T ×m as ∆R circ ( Z, Π) . = 1 T ∆R(circ( Z), Π) = 1 2T log det I + α • circ( Z) • circ( Z) * - k j=1 γ j 2T log det I + α j • circ( Z) • Πj • circ( Z) * , ( ) where α = CT mT 2 = C m 2 , α j = CT tr(Π j )T 2 = C tr(Π j ) 2 , γ j = tr(Π j ) m , and Πj is augmented membership matrix in an obvious way. Note that we introduce the normalization factor T in (32) because the circulant matrix circ( Z) contains T (shifted) copies of each signal. By applying (4) and ( 5), we obtain the derivative of ∆R circ ( Z, Π) as 1 2T ∂ log det I + αcirc( Z)circ( Z) * ∂vec( Z) = 1 2T ∂ log det I + αcirc( Z)circ( Z) * ∂circ( Z) ∂circ( Z) ∂vec( Z) = α I + αcirc( Z)circ( Z) * -1 Ē ∈R (C×T )×(C×T ) vec( Z), 16 Notice that in the main paper, for simplicity, we have used n to indicate both the 1D "temporal" or 2D "spatial" dimension of a signal, just to be consistent with the vector case, which corresponds to T here. All notation should be clear within the context. γ j 2T ∂ log det I + α j circ( Z)Π j circ( Z) * ∂vec( Z) = γ j α j I + α j circ( Z)Π j circ( Z) * -1 Cj ∈R (C×T )×(C×T ) vec( Z)Π j . (34) In the following, we show that Ē • vec( z) represents a multi-channel circular convolution. Note that Ē = α   I+α m i=1 circ(z i [1])circ(z i [1]) * ••• m i=1 circ(z i [1])circ(z i [C]) * . . . . . . . . . m i=1 circ(z i [C])circ(z i [1]) * ••• I+ m i=1 αcirc(z i [C])circ(z i [C]) *   -1 . ( ) By using Fact 2, the matrix in the inverse above is a block circulant matrix, i.e., a block matrix where each block is a circulant matrix. A useful fact about the inverse of such a matrix is the following. Fact 3 (Inverse of block circulant matrices) The inverse of a block circulant matrix is a block circulant matrix (with respect to the same block partition). The main result of this subsection is the following. Proposition B.2 (Convolution structures of Ē and Cj ) Given a collection of multi-channel 1D signals { zi ∈ R C×T } m i=1 , the matrix Ē is a block circulant matrix, i.e., Ē . =    Ē1,1 • • • Ē1,C . . . . . . . . . ĒC,1 • • • ĒC,C    , ( ) where each Ēc,c ∈ R T ×T is a circulant matrix. Moreover, Ē represents a multi-channel circular convolution, i.e., for any multi-channel signal z ∈ R C×T we have Ē • vec( z) = vec(ē z). In above, ē ∈ R C×C×T is a multi-channel convolutional kernel with ē[c, c ] ∈ R T being the first column vector of Ēc,c , and ē z ∈ R C×T is the multi-channel circular convolution (with " " overloading the notation from Eq. ( 26)) defined as (ē z)[c] = C c =1 ē[c, c ] z[c ], ∀c = 1, . . . , C. Similarly, the matrices Cj associated with any subsets of Z are also multi-channel circular convolutions. Note that the calculation of Ē in (35) requires inverting a matrix of size (C × T ) × (C × T ). In the following, we show that this computation can be accelerated by working in the frequency domain.

B.3 FAST COMPUTATION IN SPECTRAL DOMAIN

Circulant matrix and Discrete Fourier Transform. A remarkable property of circulant matrices is that they all share the same set of eigenvectors that form a unitary matrix. We define the matrix: F T . = 1 √ T         ω 0 T ω 0 T • • • ω 0 T ω 0 T ω 0 T ω 1 T • • • ω T -2 T ω T -1 T . . . . . . . . . . . . . . . ω 0 T ω T -2 T • • • ω (T -2) 2 T ω (T -2)(T -1) T ω 0 T ω T -1 T • • • ω (T -2)(T -1) T ω (T -1) 2 T         ∈ C T ×T , where ω T . = exp(-2π √ -1 T ) is the roots of unit (as ω T T = 1). The matrix F T is a unitary matrix: F T F * T = I and is the well known Vandermonde matrix. Multiplying a vector with F T is known as the discrete Fourier transform (DFT). Be aware that the conventional DFT matrix differs from our definition of F T here by a scale: it does not have the 1 √ T in front. Here for simplicity, we scale it so that F T is a unitary matrix and its inverse is simply its conjugate transpose F * T , columns of which represent the eigenvectors of a circulant matrix (Abidi et al., 2016) . Fact 4 (DFT as matrix-vector multiplication) The DFT of a vector z ∈ R T can be computed as DFT(z) . = F T • z ∈ C T , where DFT(z)(p) = 1 √ T T -1 t=0 z(t) • ω p•t T , ∀p = 0, 1, . . . , T -1. ( ) The Inverse Discrete Fourier Transform (IDFT) of a signal v ∈ C T can be computed as IDFT(v) . = F * T • v ∈ C T (41) where IDFT(v)(t) = 1 √ T T -1 p=0 v(p) • ω -p•t T , ∀t = 0, 1, . . . , T -1. ( ) Regarding the relationship between a circulant matrix (convolution) and discrete Fourier transform, we have: Fact 5 An n × n matrix M ∈ C n×n is a circulant matrix if and only if it is diagonalizable by the unitary matrix F n : F n M F * n = D or M = F * n DF n , ( ) where D is a diagonal matrix of eigenvalues. Fact 6 (DFT are eigenvalues of the circulant matrix) Given a vector z ∈ C T , we have F T • circ(z) • F * T = diag(DFT(z)) or circ(z) = F * T • diag(DFT(z)) • F T . ( ) That is, the eigenvalues of the circulant matrix associated with a vector are given by its DFT. Fact 7 (Parseval's theorem) Given any z ∈ C T , we have z 2 = DFT(z) 2 . More precisely, T -1 t=0 |z[t]| 2 = T -1 p=0 |DFT(z)[p]| 2 . ( ) This property allows us to easily "normalize" features after each layer onto the sphere S n-1 directly in the spectral domain (see Eq. ( 10) and ( 62)). Circulant matrix and Discrete Fourier Transform for multi-channel signals. We now consider multi-channel 1D signals z ∈ R C×T . Let DFT( z) ∈ C C×T be a matrix where the c-th row is the DFT of the corresponding signal z[c], i.e.,

DFT( z)

. =    DFT(z[1]) * . . . DFT(z[C]) *    ∈ C C×T . ( ) Similar to the notation in ( 27), we denote DFT( z) = DFT( z)[1] * . . . DFT( z)[C] * = [DFT( z)(0), DFT( z)(1), . . . , DFT( z)(T -1)] = {DFT( z)[c](t)} c=C,t=T -1 c=1,t=0 . As such, we have DFT(z[c]) = DFT( z)[c]. By using Fact 6, circ( z) and DFT( z) are related as follows: circ( z) =    F * T • diag(DFT(z[1])) • F T . . . F * T • diag(DFT(z[C])) • F T    =    F * T • • • 0 . . . . . . . . . 0 • • • F * T    •    diag(DFT(z[1])) . . . diag(DFT(z[C]))    • F T . ( ) We now explain how this relationship can be leveraged to produce a fast computation of Ē defined in (33). First, there exists a permutation matrix P such that     diag(DFT(z[1])) diag(DFT(z[2])) . . . diag(DFT(z[C]))     = P •     DFT( z)(0) 0 • • • 0 0 DFT( z)(1) • • • 0 . . . . . . . . . . . . 0 0 • • • DFT( z)(T -1)     . Combining ( 48) and ( 49), we have circ( z) • circ( z) * =    F * T • • • 0 . . . . . . . . . 0 • • • F * T    • P • D( z) • P * •    F T • • • 0 . . . . . . . . . 0 • • • F T    , where D( z) . =    DFT( z)(0) • DFT( z)(0) * • • • 0 . . . . . . . . . 0 • • • DFT( z)(T -1) • DFT( z)(T -1) *    . Now, consider a collection of m multi-channel 1D signals Z ∈ R C×T ×m . Similar to the notation in (30), we denote DFT( Z)[c] = [DFT( z1 )[c], . . . , DFT( zm )[c]] ∈ R T ×m , DFT( Z)(p) = [DFT( z1 )(p), . . . , DFT( zm )(p)] ∈ R C×m . By using (50), we have Ē =    F * T • • • 0 . . . . . . . . . 0 • • • F * T    • P • α • I + α • m i=1 D( zi ) -1 • P * •    F T • • • 0 . . . . . . . . . 0 • • • F T    . (53) Note that α • I + α • m i=1 D( zi ) -1 is equal to α   I+αDFT( Z)(0)•DFT( Zi )(0) * ••• 0 . . . . . . . . . 0 ••• I+αDFT( Z)(T -1)•DFT( Z)(T -1) *   -1 =   α(I+αDFT( Z)(0)•DFT( Z)(0) * ) -1 ••• 0 . . . . . . . . . 0 ••• α(I+αDFT( Z)(T -1)•DFT( Z)(T -1) * ) -1   . (54) Therefore, the calculation of Ē only requires inverting T matrices of size C × C. This motivates us to construct the ReduNet in the spectral domain for the purpose of accelerating the computation, as we explain next. Shift-invariant ReduNet in the Spectral Domain. Motivated by the result in (54), we introduce the notations Ē(p) ∈ R C×C×T and Cj (p) ∈ R C×C×T given by Ē(p) . = α • I + α • DFT( Z)(p) • DFT( Z)(p) * -1 ∈ C C×C , (55) Cj (p) . = α j • I + α j • DFT( Z)(p) • Π j • DFT( Z)(p) * -1 ∈ C C×C . In above, Ē(p) (resp., Cj (p)) is the p-th slice of Ē (resp., Cj ) on the last dimension. Then, the gradient of ∆R circ ( Z, Π) with respect to Z can be calculated by the following result. Theorem B.3 (Computing multi-channel convolutions Ē and Cj ) Let Ū ∈ C C×T ×m and W j ∈ C C×T ×m , j = 1, . . . , k be given by Ū (p) . = Ē(p) • DFT( Z)(p), W j (p) . = Cj (p) • DFT( Z)(p), j = 1, . . . , k, for each p ∈ {0, . . . , T -1}. Then, we have 1 2T ∂ log det(I + α • circ( Z)circ( Z) * ) ∂ Z = IDFT( Ū ), γ j 2T ∂ log det(I + α j • circ( Z) Πj circ( Z) * ) ∂ Z = γ j • IDFT( W j Π j ). By this result, the gradient ascent update in (3) (when applied to ∆R circ ( Z, Π)) can be equivalently expressed as an update in frequency domain on V . = DFT( Z ) as V +1 (p) ∝ V (p) + η Ē (p) • V (p) - k j=1 γ j Cj (p) • V (p)Π j , p = 0, . . . , T -1. (61) Similarly, the gradient-guided feature map increment in (10) can be equivalently expressed as an update in frequency domain on v . = DFT( z ) as v +1 (p) ∝ v (p)+η• Ē (p)v (p)-η•σ [ C1 (p)v (p), . . . , Ck (p)v (p)] , p = 0, . . . , T -1, subject to the constraint that v +1 F = z +1 F = 1 (the first equality follows from Fact 7). We summarize the training, or construction to be more precise, of ReduNet in the spectral domain in Algorithm 1. Proof: [Proof to Theorem (B.3)] From ( 4), ( 53) and ( 48), we have 1 2 ∂ log det I + αcirc( Z)circ( Z) * ∂circ( zi ) = Ēcirc( zi ) = Ē F * T ••• 0 . . . . . . . . . 0 ••• F * T   diag(DFT(z i [1])) . . . diag(DFT(z i [C]))   F T (63) = F * T ••• 0 . . . . . . . . . 0 ••• F * T • P • α • I + α • i D( zi ) -1 •   DFT( zi )(0) ••• 0 . . . . . . . . . 0 ••• DFT( zi )(T -1)   • F T (64) = F * T ••• 0 . . . . . . . . . 0 ••• F * T • P •   Ē(0)•DFT( zi )(0) ••• 0 . . . . . . . . . 0 ••• Ē(T -1)•DFT( zi )(T -1)   • F T (65) = F * T ••• 0 . . . . . . . . . 0 ••• F * T • P •   ūi (0) ••• 0 . . . . . . . . . 0 ••• ūi (T -1)   • F T = F * T ••• 0 . . . . . . . . . 0 ••• F * T •   diag( ūi [1]) . . . diag( ūi [C])   • F T = circ(IDFT( ūi )). (67) Therefore, we have 1 2 ∂ log det I + α • circ( Z) • circ( Z) * ∂ zi = 1 2 ∂ log det I + α • circ( Z) • circ( Z) * ∂circ( zi ) • ∂circ( zi ) ∂ zi = T • IDFT( ūi ). (68) By collecting the results for all i, we have ∂ 1 2T log det I + α • circ( Z) • circ( Z) * ∂ Z = IDFT( Ū ). In a similar fashion, we get ∂ γj 2T log det I + α j • circ( Z) • Πj • circ( Z) * ∂ Z = γ j • IDFT( W j • Π j ). Algorithm 1 Training Algorithm (1D Signal, Shift Invariance, Spectral Domain) Input: Z ∈ R C×T ×m , Π, > 0, λ, and a learning rate η. 1: Set α = C m 2 , {α j = C tr(Π j ) 2 } k j=1 , {γ j = tr(Π j ) m } k j=1 . 2: Set V0 = {v i 0 (p) ∈ C C } T -1,m p=0,i=1 . = DFT( Z) ∈ C C×T ×m . 3: for = 1, 2, . . . , L do 4: # Step 1: Compute E and C.

5:

for p = 0, 1, . . . , T -1 do 6: Compute Ē (p) ∈ C C×C and { Cj (p) ∈ C C×C } k j=1 as Ē (p) . = α • I + α • V -1 (p) • V -1 (p) * -1 , Cj (p) . = α j • I + α j • V -1 (p) • Π j • V -1 (p) * -1 ; 7: end for 8: # Step 2: Update vi for each i.

9:

for i = 1, . . . , m do 10: # Compute projection at each frequency p. 11: for p = 0, 1, . . . , T -1 do 12: Compute { pij (p) . = Cj (p) • vi (p) ∈ C C×1 } k j=1 ; 13: end for 14: # Compute overall projection by aggregating over frequency p.

15:

Let { Pij = [ pij (0), . . . , pij (T -1)] ∈ C C×T } k j=1 ; 16: # Compute soft assignment from projection.

17:

Compute π ij = exp(-λ Pij F ) k j=1 exp(-λ Pij F ) k j=1 ; 18: # Compute update at each frequency p.

19:

for p = 0, 1, . . . , T -1 do 20: Set Z = IDFT( V ) as the feature at the -th layer; vi (p) = vi -1 (p) + η Ē (p)v i (p) - k j=1 γ j • π ij 25: # Evaluate the objective value. 26: 1 2T T -1 p=0 log det[I + α V (p) • V (p) * ] - tr(Π j ) m log det[I + α j V (p) • Π j • V (p) * ] ; 27: end for Output: features ZL , the learned filters { Ē (p)} ,p and { Cj (p)} j, ,p .

C 2D CIRCULAR TRANSLATION INVARIANCE

To a large degree, both conceptually and technically, the 2D case is very similar to the 1D case that we have studied carefully in the previous Appendix B. For the sake of consistency and completeness, we here gives a brief account.

C.1 DOUBLY BLOCK CIRCULANT MATRIX

In this section, we consider z as a 2D signal such as an image, and use H and W to denote its "height" and "width", respectively. It will be convenient to work with both a matrix representation z =     z(0, 0) z(0, 1) • • • z(0, W -1) z(1, 0) z(1, 1) • • • z(1, W -1) . . . . . . . . . . . . z(H -1, 0) z(H -1, 1) • • • z(H -1, W -1)     ∈ R H×W , as well as a vector representation vec(z) . = z(0, 0), . . . , z(0, W -1), z(1, 0), . . . , z(1, W -1), . . . . . . , z(H -1, 0), . . . , z(H -1, W -1) * ∈ R (H×W ) . (72) We represent the circular translated version of z as trans p,q (z) ∈ R H×W by an amount of p and q on the vertical and horizontal directions, respectively. That is, we let trans p,q (z)(h, w) . = z(h -p mod H, w -q mod W ), ∀(h, w) ∈ {0, . . . , H -1} × {0, . . . , W -1}. It is obvious that trans 0,0 (z) = z. Moreover, there is a total number of H × W distinct translations given by {trans p,q (z), (p, q) ∈ {0, . . . , H -1} × {0, . . . , W -1}}. We may arrange the vector representations of them into a matrix and obtain circ(z) . = vec(trans 0,0 (z)), . . . , vec(trans 0,W -1 (z)), vec(trans 1,0 (z)), . . . , vec(trans 1,W -1 (z)), . . . , vec(trans H-1,0 (z)), . . . , vec(trans H-1,W -1 (z)) ∈ R (H×W )×(H×W ) . (74) The matrix circ(z) is known as the doubly block circulant matrix associated with z (see, e.g., Abidi et al. (2016) ; Sedghi et al. (2018) ). We now consider a multi-channel 2D signal represented as a tensor z ∈ R C×H×W , where C is the number of channels. The c-th channel of z is represented as z[c] ∈ R H×W , and the (h, w)-th pixel is represented as z(h, w) ∈ R C . To compute the coding rate reduction for a collection of such multi-channel 2D signals, we may flatten the tenor representation into a vector representation by concatenating the vector representation of each channel, i.e., we let vec( z) = [vec( z[1]) * , . . . , vec( z[C]) * ] * ∈ R (C×H×W ) Furthermore, to obtain shift invariance for coding rate reduction, we may generate a collection of translated versions of z (along two spatial dimensions). Stacking the vector representation for such translated copies as column vectors, we obtain circ( z) . =    circ( z[1]) . . . circ( z[C])    ∈ R (C×H×W )×(H×W ) . We can now define a translation invariant coding rate reduction for multi-channel 2D signals. Consider a collection of m multi-channel 2D signals { zi ∈ R C×H×W } m i=1 . Compactly representing the data by Z ∈ R C×H×W ×m where the i-th slice on the last dimension is zi , we denote circ( Z) = [circ( z1 ), . . . , circ( zm )] ∈ R (C×H×W )× (H×W ×m) . (77) Then, we define ∆R circ ( Z, Π) . = 1 HW ∆R(circ( Z), Π) = 1 2HW log det I + α • circ( Z) • circ( Z) * - k j=1 γ j 2HW log det I + α j • circ( Z) • Πj • circ( Z) * , ( ) where α = CHW mHW 2 = C m 2 , α j = CHW tr(Π j )HW 2 = C tr(Π j ) 2 , γ j = tr(Π j ) m , and Πj is augmented membership matrix in an obvious way. By following an analogous argument as in the 1D case, one can show that ReduNet for multi-channel 2D signals naturally gives rise to the multi-channel 2D circulant convolution operations. We omit the details, and focus on the construction of ReduNet in the frequency domain.

C.2 FAST COMPUTATION IN SPECTRAL DOMAIN

Doubly block circulant matrix and 2D-DFT. Similar to the case of circulant matrices for 1D signals, all doubly block circulant matrices share the same set of eigenvectors, and these eigenvectors form a unitary matrix given by F . = F H ⊗ F W ∈ C (H×W )×(H×W ) , where ⊗ denotes the Kronecker product and F H , F W are defined as in (38). Analogous to Fact 4, F defines 2D-DFT as follows. Fact 8 (2D-DFT as matrix-vector multiplication) The 2D-DFT of a signal z ∈ R H×W can be computed as vec(DFT(z)) . = F • vec(z) ∈ C (H×W ) , where DFT(z)(p, q) = 1 √ H • W H-1 h=0 W -1 w=0 z(h, w) • ω p•h H ω q•w W , ∀(p, q) ∈ {0, . . . , H -1} × {0, . . . , W -1}. (81) The 2D-IDFT of a signal v ∈ C H×W can be computed as vec(IDFT(v)) . = F * T • vec(v) ∈ C (H×W ) , where IDFT(v)(h, w) = 1 √ H • W H-1 p=0 W -1 q=0 v(p, q) • ω -p•h H ω -q•w W , ∀(h, w) ∈ {0, . . . , H -1} × {0, . . . , W -1}. ( ) Analogous to Fact 9, F relates DFT(z) and circ(z) as follows. Fact 9 (2D-DFT are eigenvalues of the doubly block circulant matrix) Given a signal z ∈ C H×W , we have By using Fact 9, circ( z) and DFT( z) are related as follows: F • circ(z) • F * = diag(vec(DFT(z))) or circ(z) = F * • diag(vec(DFT(z))) • F . ( circ( z) =    F * • diag(vec(DFT(z[1]))) • F . . . F * • diag(vec(DFT(z[C]))) • F    =     F * • • • 0 0 • • • 0 . . . . . . . . . 0 • • • F *     •     diag(vec(DFT(z[1]))) diag(vec(DFT(z[2]))) . . . diag(vec(DFT(z[C])))     • F . (85) Similar to the 1D case, this relation can be leveraged to produce a fast implementation of ReduNet in the spectral domain. Translation-invariant ReduNet in the Spectral Domain. Given a collection of multi-channel 2D signals Z ∈ R C×H×W ×m , we denote DFT( Z)(p, q) . = [DFT( z1 )(p, q), . . . , DFT( zm )(p, q)] ∈ R C×m . ( ) We introduce the notations Ē(p, q) ∈ R C×C×H×W and Cj (p, q) ∈ R C×C×H×W given by Ē(p, q) . = α • I + α • DFT( Z)(p, q) • DFT( Z)(p, q) * -1 ∈ C C×C , (87) Cj (p, q) . = α j • I + α j • DFT( Z)(p, q) • Π j • DFT( Z)(p, q) * -1 ∈ C C×C . ( ) In above, Ē(p, q) (resp., Cj (p, q)) is the (p, q)-th slice of Ē (resp., Cj ) on the last two dimensions. Then, the gradient of ∆R circ ( Z, Π) with respect to Z can be calculated by the following result. Theorem C.1 (Computing multi-channel 2D convolutions Ē and Cj ) Let Ū ∈ C C×H×W ×m and W j ∈ C C×H×W ×m , j = 1, . . . , k be given by Ū (p, q) . = Ē(p, q) • DFT( Z)(p, q), (89) W j (p, q) . = Cj (p, q) • DFT( Z)(p, q), j = 1, . . . , k, for each (p, q) ∈ {0, . . . , H -1} × {0, . . . , W -1}. Then, we have 1 2HW ∂ log det(I + α • circ( Z)circ( Z) * ) ∂ Z = IDFT( Ū ), 1 2HW ∂ γ j log det(I + α j • circ( Z) Πj circ( Z) * ) ∂ Z = γ j • IDFT( W j Π j ). This result shows that the calculation of the derivatives for the 2D case is analogous to that of the 1D case. Therefore, the construction of the ReduNet for 2D translation invariance can be performed using Algorithm 1 with straightforward extensions.

D IMPLEMENTATION DETAILS AND ADDITIONAL EXPERIMENTS

Code for reproducing the results in this work will be made publicly available with the publication of this paper. Disclaimer: in this work we do not particularly optimize any of the hyper parameters, such as the number of initial channels, kernel sizes, and learning rate etc., for the best performance. The choices are mostly for convenience and just minimally adequate to verify the concept, due to limited computational resource. We provide the cosine similarity results for the experiments described in Figure 2 . The results are shown in Figure 4 . We can observe that the network can map the data points to orthogonal subspaces. Additional experiments on S 1 and S 2 . We also provide additional experiments on learning mixture of Gaussians in S 1 and S 2 in Figure 5 . We can observe similar behavior of the proposed ReduNet: the network can map data points from different classes to orthogonal subspaces. Additional experiments on S 1 with more than 2 classes. We try to apply ReduNet to learn mixture of Gaussian distributions on S 1 with the number of class is larger than 2. Notice that these are the cases to which the existing theory about MCR 2 (Yu et al., 2020) no longer applies. These experiments suggest that the MCR 2 still promotes between-class discriminativeness with so constructed ReduNet. In particular, the case on the left of Figure 6 indicates that the ReduNet has "merged" two linearly correlated clusters into one on the same line. This is consistent with the objective of rate reduction to group data as linear subspaces.

D.2 EXPERIMENTS ON UCI DATASETS

We evaluate the proposed ReduNet on some real datasets, namely the two UCI tasks (Dua & Graff, 2017) : iris and mice. There are 3 classes in iris dataset and the number of features is 4. For mice dataset, there are 8 classes and the number of features is 82. We randomly select 70% data as the training data, and use the rest for evaluation. The results are summarized in Table 1 . We compare our method with logistic regression, SVM, and random forest, and we use the implementations by sklearn (Pedregosa et al., 2011) . From Table 1 , we find that the forward-constructed ReduNet is able to achieve comparable performance with classic methods such as logistic regression, SVM, and random forest. Figure 6 : Learning mixture of Gaussian distributions with more than 2 classes. For both cases, we use step size η = 0.5 and precision = 0.1. For (a), we set iteration L = 2, 500; for (b), we set iteration L = 4, 000.

D.3 ADDITIONAL EXPERIMENTS ON LEARNING SHIFT INVARIANT FEATURES

We provide additional experiments for Learning Shift Invariant Features in §3. The code for sampling from h 1 (t) = sin(t) + and h 2 (t) = sign(sin(t)) + is described in Algorithm 2, and the pseudocode for sampling from 2 classes {h 1 , h 2 } is described as follows, we sample training and test signals using the same procedure. We also provide cosine similarities between samples in Figure 8 . We visualize the cosine similarities for the input X train , X test as well as the learned representations Z train , Z test . The cosine similarity between sample pairs selected from different classes are shown in Figure 9 . We can observe that the original data is not orthogonal w.r.t. different classes, and the the ReduNet is able to learn discriminative (orthogonal) representations. x i j = h j (t) + # broadcast over vector t; 6: end for 7: X j = [x 1 j , x 2 j , . . . , x m j ]; 8: end for 9: X = [X 1 , X 2 , . . . , X k ]; 10: shuffle X. Output: outputs X.

D.4 ADDITIONAL EXPERIMENTS ON LEARNING ROTATIONAL INVARIANCE ON MNIST

We provide additional experiments for learning rotational invariance on MNIST in §3. Examples of rotated images are shown in Figure 12 . We compare the accuracy (both on the original test data and the shifted test data) of the ReduNet (without considering invariance) and the shift invariant ReduNet. For ReduNet (without considering invariance), we use the same training dataset as the shift invariant ReduNet, we set iteration L = 3, 500, step size η = 0.5, and precision = 0.1. The results are summarized in Table 2 . With the invariant design, we can see from Table 2 that the shift invariant ReduNet achieves better performance in terms of invariance on the MNIST binary classification task. We also provide cosine similarities between samples in Figure 10 . We visualize the cosine similarities for the input X train , X test as well as the learned representations Z train , Z test . The cosine similarity between sample pairs selected from different classes are shown in Figure 11 . We can observe that the constructed ReduNet is able to learn discriminative (orthogonal) and invariant representations for MNIST digits. In this part, we provide experimental results for verifying the invariance property of ReduNet under 2D translations. We construct 1). ReduNet (without considering invariance) and 2). 2D translationinvariant ReduNet for classifying digit '0' and digit '1' on MNIST dataset. We use m = 1, 000 samples (500 samples from each class) for training the models, and use another 500 samples (250 samples from each class) for evaluation. To evaluate the 2D translational invariance, for each test image x test ∈ R H×W , we consider all translation augmentations of the test image with a stride=7. More specifically, for the MNIST dataset, we have H = W = 28. So for each image, the total number of all cyclic translation augmentations (with stride=7) is 4×4 = 16. Examples of translated images are shown in Figure 13 . Notice that such translations are considerably larger than normally considered in the literature since we consider invariance to the entire group of cyclic translations on the H × W grid as a torus. See Figure 13 for some representative test samples. 



To simplify the presentation, we assume for now that the feature z and x have the same dimension n. But in general they can be different as we will soon see, say in the case z is multi-channel extracted from x. Hence, any subsequent classifiers defined on the resulting set of subspaces will be automatically invariant to such transformations. Notice that on the training samples Z , for which the memberships Π j are known, the so defined g(z , θ) gives exactly the values for the gradient ∂∆R ∂Z Z . The choice of the softmax is mostly for its simplicity as it is widely used in other (forward components of) deep networks for purposes such as selection, gating(Shazeer et al., 2017) and routing(Sabour et al., 2017). In principle, this term can be approximated by other operators, say using ReLU that is more amenable to training with back propagation, see Remark 3 in Appendix A. For 1D signals like audio, one may consider the conventional short time Fourier transform (STFT); for 2D images, one may consider 2D wavelets as in the ScatteringNet(Bruna & Mallat, 2013). For learned filters, one can learn filters as the principal components of samples as in the PCANet(Chan et al., 2015) or from convolution dictionary learning(Li & Bresler, 2019;Qu et al., 2019). It is remarkable to see how easily our framework leads to working deep networks with thousands of layers! But this also indicates the efficiency of the layers is not so high. Remark 4 provides possible ways to improve. Most current neural networks seem to adopt this regime. P j can be viewed as the orthogonal complement to C j . We use superscript * to indicate (conjugate) transpose of a vector or a matrix



I + α j ZΠ j Z * Rc(Z,Π)

Figure 1: Layer structure of the ReduNet: from one iteration of gradient ascent for rate reduction.

Figure3: Heatmaps of cosine similarity between data Xshift/learned features Zshift, MCR 2 loss, and distance between shift samples and subspaces. For (a), (b), (e), (f), we pick one sample from each class and augment the sample with its every possible shifted ones, then calculate the cosine similarity between these augmented samples. For (d), (h), we first augment each samples in the dataset with its every possible shifted ones, then we evaluate the cosine similarity (in absolute value) between pairs across classes: for each pair, one sample is from training and one sample is from test which belong to different classes. channel feature as defined in (13), i.e., X ∈ R (n•C)×m . For training data, We set the number of features n = 150, samples m = 400, channels C = 7, iterations/layers L = 2, 000, step size η = 0.1, and precision = 0.1. We sample the same number of test data points followed by the same procedure. As shown in Figure8, we observe that the network can map the two classes of signals to orthogonal subspaces both on training and test datasets. To verify invariance property of the network, we first pick 5 signal samples from each class (from test dataset) and get their corresponding augmented samples by shifting. Then we have m = 1, 500 augmented samples for each original signal, X shift ∈ R 150×1,500 , and we visualize the pairwise inner product of X shift and their representations, Zshift ∈ R (150•7)×1,500 in Figure3a-3b. Moreover, we augment every sample (from the test dataset) with its all possible shifted versions and calculate the cosine similarity between their representations and the representations of all the training samples from the other class (in Figure3d). We find that the proposed network can map different classes of signals (including all shifted augmentations) to orthogonal subspaces, to increase the MCR 2 loss (shown in Figure3c). Rotational Invariance on MNIST Digits. We study the ReduNet on learning rotation invariant features on MNIST dataset(LeCun, 1998). We impose a polar grid on the image x ∈ R H×W , with its geometric center being the center of the 2D polar grid. For each radius r i , i ∈ [C], we can sample Γ pixels with respect to each angle γ l = l • (2π/Γ) with l ∈ [Γ]. Then given an image sample x from the dataset, we represent the image in a polar coordinate representation x(p) = (γ l,i , r l,i ) ∈ R Γ×C . Our goal is to learn rotation invariant features, i.e., we expect to learn f (•, θ) such that {f (x(p) • g, θ)} g∈G lie in the same subspace, where g is the shift transformation in polar angle. By performing polar coordinate transformation for images from digit '0' and digit '1' in the training dataset, we can obtain the data matrix X(p) ∈ R (Γ•C)×m . We use m = 2, 000 training samples, set Γ = 200 and C = 5 for polar transformation, and set iteration L = 3, 500, precision = 0.1, stepsize η = 0.5. We generate 1, 000 test samples followed by the same procedure. In Figure10, we can see that our proposed ReduNet is able to map most samples from different classes to orthogonal subspaces (w.r.t. class) on test dataset. Meanwhile, in Figure3e, 3f, and 3h, we observe that the learnt features are invariant to shift transformation in polar angle (i.e., arbitrary rotation in x).

[C], we can sample Γ pixels with respect to each angle γ l = l • (2π/Γ) with l ∈ [Γ]. Then given an image sample x from the dataset, we represent the image in a polar coordinate representation x(p) = (γ l,i , r l,i ) ∈ R Γ×C . Our goal is to learn rotation invariant features, i.e., we expect to learn f (•, θ) such that {f (x(p) • g, θ)} g∈G lie in the same subspace, where g is the shift transformation in polar angle. By performing polar coordinate transformation for images from digit '0' and digit '1' in the training dataset, we can obtain the data matrix X(p) ∈ R (Γ•C)×m . We use m = 2, 000 training samples, set Γ = 200 and C = 5 for polar transformation, and set iteration L = 3, 500, precision = 0.1, stepsize η = 0.5. We generate 1, 000 test samples followed by the same procedure. In Figure10, we can see that our proposed ReduNet is able to map most samples from different classes to orthogonal subspaces (w.r.t. class) on test dataset. Meanwhile, in Figure3e, 3f, and 3h, we observe that the learnt features are invariant to shift transformation in polar angle (i.e., arbitrary rotation in x).

) In addition, we denote vec( Z) = [vec( z1 ), . . . , vec( zm )] ∈ R (C×T )×m , circ( Z) = [circ( z1 ), . . . , circ( zm )] ∈ R (C×T )×(T ×m) .

Doubly block circulant matrix and 2D-DFT for multi-channel signals. We now consider multichannel 2D signals z ∈ R C×H×W . Let DFT( z) ∈ C C×H×W be a matrix where the c-th slice on the first dimension is the DFT of the corresponding signal z[c]. That is, DFT( z)[c] = DFT(z[c]) ∈ C H×W . We use DFT( z)(p, q) ∈ C C to denote slicing of z on the frequency dimensions.

ADDITIONAL EXPERIMENTS ON LEARNING MIXTURE OF GAUSSIANS IN S 1 AND S 2

Figure 4: Cosine similarity (absolute value) for 2D and 3D Mixture of Gaussians. Lighter color implies samples are more orthogonal.

Figure 5: Learning mixture of Gaussians in S 1 and S 2 . (Top) For S 1 , we set σ 1 = σ 2 = 0.1; (Bottom) For S 2 , we set σ 1 = σ 2 = σ 3 = 0.1.

t0 = np.random.uniform(low=0, high=10 * np.pi, size=samples) x = np.linspace(t0, t0+2 * np.pi, time).T noise1 = np.random.normal(0, 0.1, size=(samples, time)) X1 = np.sin(x) + noise1 noise2 = np.random.normal(0, 0.1, size=(samples, time)) X2 = np.sign(np.sin(x)) + noise2 data = np.vstack([X1, X2]) labels = np.hstack([np.ones(samples) * 1, np.ones(samples) * 2]).astype(np.int32)

Figure 7: Visualization of signals in 1D. Blue dots represent the sampled signal used for training with dimension n = 150. Red curves represent the underlying 1D function (noiseless). (Left) One sample from class 1; (Right) One sample from class 2.

Figure 8: Cosine similarity (absolute value) of training/test data as well as training/test representations for learning 1D functions.

Performance (Accuracy) on iris and mice of the UCI datasets. Pseudocode for sampling signals from 1D functions Input: Number of samples m, number of classes k, number of features n, function {h 1 , . . . , h k }. 1: for j = 1, 2, . . . , k do = [t 0 , t 0 + 2π/n, t 0 + (2π/n) • 2, t 0 + (2π/n) • 3, . . . , t 0 + (2π/n) • (n -1)];

Comparing network performance on learning rotational-invariant representations on MNIST.

Comparing network performance on learning 2D translation-invariant representations on MNIST.

