DEEP NETWORKS FROM THE PRINCIPLE OF RATE REDUCTION Anonymous authors Paper under double-blind review

Abstract

This work attempts to interpret modern deep (convolutional) networks from the principles of rate reduction and (shift) invariant classification. We show that the basic iterative gradient ascent scheme for maximizing the rate reduction of learned features naturally leads to a deep network, one iteration per layer. The architectures, operators (linear or nonlinear), and parameters of the network are all explicitly constructed layer-by-layer in a forward propagation fashion. All components of this "white box" network have precise optimization, statistical, and geometric interpretation. Our preliminary experiments indicate that such a network can already learn a good discriminative deep representation without any back propagation training. Moreover, all linear operators of the so-derived network naturally become multi-channel convolutions when we enforce classification to be rigorously shift-invariant. The derivation also indicates that such a convolutional network is significantly more efficient to learn and construct in the spectral domain.

1. INTRODUCTION AND MOTIVATION

In recent years, various deep (convolution) network architectures such as AlexNet (Krizhevsky et al., 2012) , VGG (Simonyan & Zisserman, 2015) , ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) , Recurrent CNN, LSTM (Hochreiter & Schmidhuber, 1997 ), Capsule Networks (Hinton et al., 2011) , etc., have demonstrated very good performance in classification tasks of real-world datasets such as speeches or images. Nevertheless, almost all such networks are developed through years of empirical trial and error, including both their architectures/operators and the ways they are to be effectively trained. Some recent practices even take to the extreme by searching for effective network structures and training strategies through extensive random search techniques, such as Neural Architecture Search (Zoph & Le, 2017; Baker et al., 2017 ), AutoML (Hutter et al., 2019) , and Learning to Learn (Andrychowicz et al., 2016) . Despite tremendous empirical advances, there is still a lack of rigorous theoretical justification of the need or reasons for "deep" network architectures and a lack of fundamental understanding of the associated operators (e.g. multi-channel convolution and nonlinear activation) in each layer. As a result, deep networks are often designed and trained heuristically and then used as a "black box." There have been a severe lack of guiding principles for each of the stages: For a given task, how wide or deep the network should be? What are the roles and relationships among the multiple (convolution) channels? Which parts of the networks need to be learned and trained and which can be determined in advance? How to evaluate the optimality of the resulting network? As a consequence, besides empirical evaluation, it is usually impossible to offer any rigorous guarantees for certain performance of a trained network, such as invariance to transformation (Azulay & Weiss, 2018; Engstrom et al., 2017) or overfitting noisy or even arbitrary labels (Zhang et al., 2017) . In this paper, we do not intend to address all these questions but we would attempt to offer a plausible interpretation of deep (convolution) neural networks by deriving a class of deep networks from first principles. We contend that all key features and structures of modern deep (convolution) neural networks can be naturally derived from optimizing a principled objective, namely the rate reduction recently proposed by Yu et al. (2020) , that seeks a compact discriminative (invariant) representation of the data. More specifically, the basic iterative gradient ascent scheme for optimizing the objective naturally takes the form of a deep neural network, one layer per iteration. This principled approach brings a couple of nice surprises: First, architectures, operators, and parameters of the network can be constructed explicitly layer-by-layer in a forward propagation fashion, and all inherit precise optimization, statistical and geometric interpretation. As result, the so constructed "white box" deep network already gives a good discriminative representation (and achieves good classification performance) without any back propagation for training the deep network. Second, in the case of seeking a representation rigorously invariant to shift or translation, the network naturally lends itself to a multi-channel convolutional network. Moreover, the derivation indicates such a convolutional network is computationally more efficient to learn and construct in the spectral (Fourier) domain, analogous to how neurons in the visual cortex encode and transit information with their spiking frequencies (Eliasmith & Anderson, 2003; Belitski et al., 2008) .

2. TECHNICAL APPROACH

Consider a basic classification task: given a set of m samples X . = [xfoot_0 , . . . , x m ] ∈ R n×m and their associated memberships π(x i ) ∈ [k] in k different classes, a deep network is typically used to model a direct mapping from the input data x ∈ R n to its class label f (x, θ) : x → y ∈ R k , where y is typically a "one-hot" vector encoding the membership information π(x): the j-th entry of y is 1 iff π(x) = j. The parameters θ of the network is typically learned to minimize certain prediction loss, say the cross entropy loss, via gradient-descent type back propagation. Although this popular approach provides people a direct and effective way to train a network that predicts the class information, the so learned representation is however implicit and lacks clear interpretation.

2.1. PRINCIPLE OF RATE REDUCTION AND GROUP INVARIANCE

The Principle of Maximal Coding Rate Reduction. To help better understand features learned in a deep network, the recent work of Yu et al. ( 2020) has argued that the goal of (deep) learning is to learn a compact discriminative and diverse feature representation 1 z = f (x) ∈ R n of the data x before any subsequent tasks such as classification: x f (x) ---→ z h(z) ---→ y. To be more precise, instead of directly fitting the class label y, a principled objective is to learn a feature map f (x) : x → z which transforms the data x onto a set of most discriminative low-dimensional linear subspaces {S j } k j=1 ⊂ R n , one subspace S j per class j ∈ [k]. Let Z . = [z 1 , . . . , z m ] = [f (x 1 ), . . . , f (x m ) ] be the features of the given samples X. WLOG, we may assume all features z i are normalized to be of unit norm: z i ∈ S n-1 . For convenience, let Π j ∈ R m×m be a diagonal matrix whose diagonal entries encode the membership of samples/features belong to the j-th class: Π j (i, i) = π(x i ) = π(z i ). Then based on principles from lossy data compression (Ma et al., 2007 ), Yu et al. (2020) has suggested that the optimal representation Z ⊂ S n-1 should maximize the following coding rate reduction objective, known as the MCR 2 principle: Rate Reduction: ∆R(Z) . = 1 2 log det I + αZZ * R(Z) - k j=1 γ j 2 log det I + α j ZΠ j Z * Rc(Z,Π) , where α = n/(m 2 ), α j = n/(tr(Π j ) 2 ), γ j = tr(Π j )/m for j = 1, . . . , k. Given a prescribed quantization error , the first term R of ∆R(Z) measures the total coding length for all the features Z and the second term R c is the sum of coding lengths for features in each of the k classes. In Yu et al. (2020) , the authors have shown the optimal representation Z that maximizes the above object indeed has desirable nice properties. Nevertheless, they adopted a conventional deep network (e.g. the ResNet) as a black box to model and parameterize the feature mapping: z = f (x, θ). It has empirically shown that with such a choice, one can effectively optimize the MCR 2 objective and obtain discriminative and diverse representations for classifying real image data sets. However, there remain several unanswered problems. Although the resulting feature representation is more interpretable, the network itself is still not. It is not clear why any chosen network is able to optimize the desired MCR 2 objective: Would there be any potential limitations? The good empirical results (say with a ResNet) do not necessarily justify the particular choice in architectures and operators of the network: Why is a layered model necessary, how wide and deep is adequate, and is there any rigorous justification for the use of convolutions and nonlinear operators used? In Section 2.2, we show that using gradient ascent to maximize the rate reduction ∆R(Z) naturally leads to a "white box" deep network that represents such a mapping. All linear/nonlinear operators and parameters of the network are explicitly constructed in a purely forward propagation fashion. Group Invariant Rate Reduction. So far, we have considered the data and features as vectors. In many applications, such as serial data or imagery data, the semantic meaning (labels) of the data and their features are invariant to certain transformations g ∈ G (for some group G) (Cohen & Welling, 2016) . For example, the meaning of an audio signal is invariant to shift in time; and the identity of an object in an image is invariant to translation in the image plane. Hence, we prefer the feature mapping f (x, θ) is rigorously invariant to such transformations:



To simplify the presentation, we assume for now that the feature z and x have the same dimension n. But in general they can be different as we will soon see, say in the case z is multi-channel extracted from x.

