RETHINKING COMPRESSED CONVOLUTIONAL NEU-RAL NETWORKS FROM A STATISTICAL PERSPECTIVE

Abstract

Many designs have recently been proposed to improve the model efficiency of convolutional neural networks (CNNs) at a fixed resource budget, while there is a lack of theoretical analysis to justify them. This paper first formulates CNNs with high-order inputs into statistical models, which have a special "Tucker-like" formulation. This makes it possible to further conduct the sample complexity analysis to CNNs as well as compressed CNNs via tensor decomposition. Tucker and CP decompositions are commonly adopted to compress CNNs in the literature. The low rank assumption is usually imposed on the output channels, which according to our study, may not be beneficial to obtain a computationally efficient model while a similar accuracy can be maintained. Our finding is further supported by ablation studies on CIFAR10, SVNH and UCF101 datasets.

1. INTRODUCTION

The introduction of AlexNet (Krizhevsky et al., 2012) spurred a line of research in 2D CNNs, which progressively achieve high levels of accuracy in the domain of image recognition (Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017) . The current stateof-the-art CNNs leave little room to achieve significant improvement on accuracy in learning stillimages, and attention has hence been diverted towards two directions. The first is to deploy deep CNNs on mobile devices by removing redundancy from the over-parametrized network, and some representative models include MobileNetV1 & V2 (Howard et al., 2017; Sandler et al., 2018) . The second direction is to utilize CNNs to learn from higher-order inputs, for instance, video clips (Tran et al., 2018; Hara et al., 2017) or electronic health records (Cheng et al., 2016; Suo et al., 2017) . This area has not yet seen a widely-accepted state-of-the-art network. High-order kernel tensors are usually required to account for the multiway dependence of the input. This notoriously leads to heavy computational burden, as the number of parameters to be trained grows exponentially with the dimension of inputs. Subsequently, model compression becomes the critical juncture to guarantee the successful training and deployment of tensor CNNs. Tensor methods for compressing CNNs. Denil et al. (2013) showed that there is huge redundancy in network weights such that the entire network can be approximately recovered with a small fraction of parameters. Tensor decomposition recently has been widely used to compress the weights in a CNN network (Lebedev et al., 2015; Kim et al., 2016; Kossaifi et al., 2020b; Hayashi et al., 2019) . Specifically, the weights at each layer are first summarized into a tensor, and then tensor decomposition, CP or Tucker decomposition, can be applied to reduce the number of parameters. Different tensor decomposition to convolution layers will lead to a variety of compressed CNN block designs. For instance, the bottleneck block in ResNet (He et al., 2016) corresponds to the convolution kernel with a special Tucker low-rank structure, and the depthwise separable block in MobileNetV1 (Howard et al., 2017) and the inverted residual block in MobileNetV2 (Sandler et al., 2018) correspond to the convolution kernel with special CP forms. All the above are for 2D CNNs, and Kossaifi et al. (2020b) and Su et al. (2018) considered tensor decomposition to factorize convolution kernels for higher-order tensor inputs. Tensor decomposition can also be applied to fully-connected layers since they may introduce a large number of parameters (Kossaifi et al., 2017; 2020a) ; see also the discussions in Section 5. Moreover, Kossaifi et al. (2019) summarized all weights of a network into one single high-order tensor, and then directly imposed a low-rank structure to achieve full network compression. While the idea is highly motivating, the proposed structure of the high-order tensor is heuristic and can be further improved; see the discussions in Section 2.4. Parameter efficiency of the above proposed architectures was heuristically justified by methods, such as FLOPs counting, naive parameter counting and/or empirical running time. However, there is still lack of a theoretical study to understand the mechanism of how tensor decomposition can compress CNNs. This paper attempts to fill this gap from statistical perspectives. Sample Complexity Analysis. Du et al. (2018a) first characterized the statistical sample complexity of a CNN; see also Wang et al. (2019) for compact autoregressive nets. Specifically, consider a CNN model, y = F CNN (x, W) + ξ, where y and x are output and input, respectively, W contains all weights and ξ is an additive error. Given the trained and true underlying networks F CNN (x, W) and F CNN (x, W * ), the root-mean-square prediction error is defined as E( W) = E x |F CNN (x, W) -F CNN (x, W * )| 2 , where W and W * are trained and true underlying weights, respectively, and E x is the expectation on x. The sample complexity analysis is to investigate how many samples are needed to guarantee a given tolerance on the prediction error. It can also be used to detect the model redundancy. Consider two nested CNNs, where F 1 is more compressed than F 2 . Given the same true underlying networks, when the prediction errors from trained F 1 and F 2 are comparable, we then can argue that F 2 has redundant weights comparing with F 1 . As a result, conducting sample complexity analysis to CNNs with higher order inputs will shed light on the compressing mechanism of popular compressed CNNs via tensor decomposition. The study in Du et al. (2018a) is limited to 1-dimensional convolution with a single kernel, followed by weighted summation, and its theoretical analysis cannot be generalized to CNNs with compressed layers. In comparison, our paper presents a more realistic modeling of CNN by introducing a general N -dimensional convolution with multiple kernels, followed by an average pooling layer and a fully-connected layer. The convolution kernel and fully-connected weights are in tensor forms, and this allows us to explicitly model compressed CNNs via imposing low-rank assumption on weight tensors. Moreover, we used an alternative technical tool, and a sharper upper bound on the sample complexity can be obtained. Our paper makes three main contributions: 1. We formulate CNNs with high-order inputs into statistical models, and show that they have an explicit "Tucker-like" form. 2. The sample complexity analysis can then be conducted to CNNs as well as compressed CNNs via tensor decomposition, with weak conditions allowing for time-dependent inputs like video data. 3. From theoretical analysis, we draw an interesting finding that forcing low dimensionality on output channels may introduce unnecessary parameter redundancy to a compressed network.

1.1. COMPARISON WITH OTHER EXISTING WORKS

Deep neural networks are usually over-parametrized, yet empirically, they can generalize well. It is an important topic in the literature to theoretically study the generalization ability of deep neural networks, including deep CNNs (Li et al., 2020; Arora et al., 2018) . The generalization error, defined as the difference between test and training errors, is commonly used to evaluate such ability, and many techniques have been developed to control its bound; see, for example, the VC dimension (Vapnik, 2013) , the Rademacher complexity & covering number (Bartlett & Mendelson, 2002) , the norm-based capacity control (Neyshabur et al., 2017; Golowich et al., 2018; Bartlett et al., 2017; Neyshabur et al., 2015) and low-rank compression based methods (Li et al., 2020; Zhou & Feng, 2018; Arora et al., 2018) . These works use a model-agnostic framework, and hence relies heavily on explicit regularization such as weight decay, dropout or data augumentation, as well as algorithmbased implicit regularization to remove the redundancy in the network. We, however, attempt to theoretically explain how much compressibility is achieved in a compressed network architecture. Specifically, we make a comparison between a CNN and its compressed version, and makes theoretically-supported modification to the latter to further increase efficiency. Subsequently, our analysis requires an explicit formulation for the network architecture, which is provided in Section 2, and the prediction error at (1) is adopted as our evaluation criteria. We notice that Li et al. (2020) also proposes to use the CP Layers to compress the weights in each convolution layer. But their study is still model-agnostic, since the ranks of the underlying CP Layers depend on the trained weights. In details, their proposed approach uses regularization assumptions on the weights and hence, their derived theoretical bound is influenced by training and not suitable to analyze the network design exclusively. Other existing works, that aim to provide theoretical understanding for the neural networks, include the understanding of parameter recovery with gradient-based algorithms for deep neural networks (Zhong et al., 2017b; Fu et al., 2020; Goel et al., 2018; Zhong et al., 2017a) ; the development of other provably efficient algorithms (Cao & Gu, 2019; Du & Goel, 2018) ; and the investigation of convergence in an over-parameterized regime (Allen-Zhu et al., 2019; Li & Liang, 2018; Du et al., 2018b) . Our work differs greatly from these works in both target and methodology. We do not consider computational complexity or algorithm convergence. Instead, we focus on the statistical sample complexity to depict the mechanism of compressed block designs for CNNs.

2.1. NOTATION

Tensor notations. We follow the notations in Kolda & Bader (2009) to denote vectors by lowercase boldface letters, e.g. a; matrices by capital boldface letters, e.g. A; tensors of order 3 or higher by Euler script boldface letters, e.g. A. For an N th -order tensor A ∈ R  , B ∈ R l1×•••×l N is defined as A, B = i1 • • • i N A(i 1 , . . . , i N )B(i 1 , . . . , i N ) , and the Frobenius norm is A F = A, A . The mode-n multiplication × n of a tensor A ∈ R l1×•••×l N and a matrix B ∈ R pn×ln is defined as (A × n B)(i 1 , . . . , j n , . . . , i N ) = ln in=1 A(i 1 , . . . , i n , . . . , i N )B(j n , i n ), for 1 ≤ n ≤ N , respec- tively. The mode-n multiplication ×N of a tensor A ∈ R l1×•••×l N and a vector b ∈ R ln is defined as (A ×n b)(i 1 , . . . , i n-1 , i n+1 , . . . , i N ) = ln in=1 A(i 1 , . . . , i n , . . . , i N )b(i n ), for 1 ≤ n ≤ N . The symbol "⊗" is the Kronecker product and "•" is the outer product. We extend the definition of Khatri-Rao product to tensors: given tensors A ∈ R l1×l2×•••l N ×K and B ∈ R p1×p2×•••p N ×K , their Khatri-Rao product is a tensor of size R l1p1×l2p2×•••l N p N ×K , denoted by C = A B, where C(:, • • • , k) = A(:, • • • , k) ⊗ B(:, • • • , k), 1 ≤ k ≤ K. CP decomposition. The Canonical Polyadic (CP) decomposition (Kolda & Bader, 2009) factorizes the tensor A ∈ R l1×•••×l N into a sum of rank-1 tensors, i.e. A = R r=1 α r h (1) r • h (2) r • • • • • h (N ) r , where h (j) r is a unit norm vector of size R lj for all 1 ≤ j ≤ N . The CP rank is the number of rank-1 tensors R. Tucker decomposition. The Tucker ranks of an N th -order tensor A ∈ R ×l N are defined as the matrix ranks of the unfoldings of A along all modes. If the Tucker ranks of A are (R 1 , . . . , R N ), then there exist a core tensor G ∈ R N ) , known as Tucker decomposition (Tucker, 1966) ,  H (i) ∈ R li×Ri , such that A = G × 1 H (1) × 2 H (2) • • • × N H (

2.2. BASIC THREE-LAYER CNNS

Consider a three-layer CNN with one convolution, one average pooling and one fully-connected layers, and we assume linear activations for simplicity. Specifically, for a general tensor-structured input X ∈ R d1×d2×•••×d N , we first perform its convolution with an N th -order kernel tensor  A ∈ R l1×l2×•••×l N to get an intermediate output X c ∈ R , • • • , q N ) to derive another intermediate output X cp ∈ R p1×p2×•••×p N . Finally, X cp goes through a fully-connected layer, with weight tensor B ∈ R p1×p2×•••×p N to produce a scalar output; see Figure 1 (1). We consider the convolution layer with the stride size of s c along each dimension. Assume that m j = (d j -l j )/s c + 1 are integers for 1 ≤ j ≤ N , otherwise zero-padding will be needed. Let U (j) ij = [ 0 (ij -1)sc I lj 0 dj -(ij -1)sc-lj ] ∈ R dj ×lj for 1 ≤ i j ≤ m j , 1 ≤ j ≤ N, which act as positioning factors to stretch the kernel tensor A into a tensor of the same size as X, while the rest of entries are filled with zeros; see Figure 1 (2). As a result, X c has entries X c (i 1 , i 2 , . . . , i N ) = X, A × 1 U (1) i1 × 2 U (2) i2 × 3 • • • × N U (N ) i N . For the pooling layer with the stride size of s p along each dimension, we assume the pooling sizes {q j } N j=1 satisfy the relationship of m j = q j + (p j -1)s p , so the sliding windows can be overlapped. But for ease of notation, we can simply take q j = m j /p j . The average pooling operation is equivalent to forming p 1 p 2 p 3 • • • p N consecutive blocks within X c , each of size R q1×q2×•••×q N , and then take the average per block. The resulting tensor X cp has entries X cp (i 1 , i 2 , . . . , i N ) = ( j q j ) -1 j ij qj kj =(ij -1)qj +1 X c (k 1 , k 2 , . . . , k N ), for 1 ≤ i j ≤ p j and 1 ≤ j ≤ N . If we denote U (j) F ,ij = q -1 j ij qj k=(ij -1)qj +1 U (j) k , then equivalently, X cp (i 1 , i 2 , . . . , i N ) = X, A × 1 U (1) F ,i1 × 2 U (2) F ,i2 × 3 • • • × N U (N ) F ,i N . The fully-connected layer performs a weighted summation over X cp , with weights given by entries of the tensor B ∈ R p1×p2×•••×p N . Denote U (j) F = (U (j) F ,1 , • • • , U (j) F ,pj ) for 1 ≤ j ≤ N , and the predicted output has the form of y = X cp , B = X, (B ⊗ A) × 1 U (1) F × 2 U (2) F × 3 • • • × N U (N ) F . Similarly, for a CNN with K kernels, denote {A k , B k } K k=1 to be the set of kernels and fullyconnected weights, where B k ∈ R p1×p2×•••×p N and A k ∈ R l1×l2×•••×l N . The model can then be explicitly represented by y i = y i + ξ i = X i , W X + ξ i , 1 ≤ i ≤ n, where ξ i is the additive error, X i ∈ R d1×d2×•••×d N , and the composite weight tensor W X = ( K k=1 B k ⊗ A k ) × 1 U (1) F × 2 U (2) F × 3 • • • × N U (N ) F . The innate weight-sharing compactness of CNN can be equivalently represented as a "Tucker-like" form at (4). The factor matrices {U (j) F ∈ R dj ×lj pj } N j=1 are fixed and solely determined by CNN operations on the inputs. They have full column ranks given that p j l j ≤ d j always holds. The core tensor is a special Kronecker product that depicts the layer-wise interaction between weights.

2.3. SAMPLE COMPLEXITY ANALYSIS

We can now derive the non-asymptotic error bound for the CNN model. Let Z i = X i × 1 U (1) F × 2 U (2) F × 3 • • • × N U (N ) F ∈ R l1p1×l2p2×•••×l N p N . Model (3) is equivalent to y i = Z i , W + ξ i = K k=1 Z i , B k ⊗ A k + ξ i , where W = K k=1 B k ⊗ A k . The trained weights have the form of W = K k=1 B k ⊗ A k , where { B k , A k } 1≤k≤K = arg min B k ,A k ,1≤k≤K 1 n n i=1 y i - K k=1 Z i , B k ⊗ A k 2 . ( ) Denote x i = vec(X i ) and z i = vec(Z i ), and it can be verified that z i = U G x i , where U G = U (1) F ⊗ {U (N ) F ⊗ [U (N -1) F ⊗ • • • ⊗ (U (3) F ⊗ U (2) F )]} represents the CNN operations on the inputs. Let x = (x 1 , x 2 , . . . , x n ) , and we make the following technical assumptions. Assumption 1. (Time-dependent inputs) x is normally distributed with mean zero and variance Σ = E( x x ), where c x I ≤ Σ ≤ C x I for some 0 < c x < C x . Denote by λ max (A) and λ min (A) the maximum and minimum eigenvalues of a symmetric matrix A, respectively. When {x i } is a stationary time series with spectral density function f X (θ), we can take c x = inf -π≤θ≤π λ min (f X (θ)) and C x = sup -π≤θ≤π λ max (f X (θ)); see Basu & Michailidis (2015) . For independent inputs, Σ = diag{Σ 11 , . . . , Σ nn } is block diagonal, and it holds that c x = min 1≤j≤n λ min (Σ jj ) and C x = max 1≤j≤n λ max (Σ jj ), where Σ jj = E(x j x j ). Assumption 2. (Sub-Gaussian errors) {ξ i } are independent σ 2 -sub-Gaussian random variables with mean zero, and is independent of {X j , 1 ≤ j ≤ i} for all 1 ≤ i ≤ n. Assumption 3. (Restricted isometry property) c u I ≤ U G U G ≤ C u I for some 0 < c u < C u . Denote κ U = C x C u and κ L = c x c u . Let W * = K k=1 B * k ⊗ A * k be the true weight. The model complexity of CNN at (5) is d M = K(P + L + 1), where P = p 1 p 2 • • • p N and L = l 1 l 2 • • • l N . Theorem 1 (CNN). Suppose that Assumptions 1-3 hold and n (κ U /κ L ) 2 d M . Then, W -W * F ≤ 16σ √ α RSM α RSC d M n + δ n and, E( W) ≤ √ κ U W -W * F with probability 1 -4 exp{-[c H (0.25κ L /κ U ) 2 n -9d M ]} -exp{-d M -8δ} , where δ = O p (1), c H is a positive value, α RSC = κ L /2, and α RSM = 3κ U /2. It can be seen that the prediction error E( W), is O p ( d M /n), and hence the sample complexity is of order O(d M /ε 2 ), to achieve a prediction error ε. Technical proofs of all theorems and corollary in the paper are deferred to Appendix A.3. Note that, for simplicity, we assume a simple regression model at (5). The theoretical analysis can indeed be established for classification problems as well. In details, we provide corresponding theorem and corollaries for both binary and multiclass classification problems in Appendix A.4.

2.4. CNNS WITH MORE LAYERS

Consider a 5-layer CNN with "convolution → pooling → convolution → pooling → fully connected" layers, where the settings for the first "convolution → pooling" layers are the same as those in Section 2.2. Denote the kernel for the second convolution by A ∈ R l1× l2× F ,ij , which are both of size R pj × lj with 1 ≤ i j ≤ pj and 1 ≤ j ≤ N , and can be used to represent the second convolution and pooling operations, respectively. Note that the output from the first pooling layer, X cp , is the input to the second convolution layer, and hence the output from the second pooling layer has entries X cp (i 1 , i 2 , . . . , i N ) = X cp , A × 1 U (1) F ,i1 × 2 U (2) F ,i2 × 3 • • • × N U (N ) F ,i N . Stack the matrices {U (j) F ,ij ∈ R dj ×lj } 1≤ij ≤pj and { U (j) F ,ij ∈ R pj × lj } 1≤ij ≤ pj into 3D tensors U (j) ∈ R dj ×lj ×pj and U (j) ∈ R pj × lj × pj , respectively. Let U (j) DF = (U (j) × 3 U (j) ) (1) for 1 ≤ j ≤ N and, by some algebra in Appendix A.2, we can show that the predicted output has the form of y = X cp , B = X, (B ⊗ A ⊗ A) × 1 U (1) DF × 2 U (2) DF × 3 • • • × N U (N ) DF , where U (j) × 3 U (j) yields a tensor of size R dj ×lj × lj × pj , with pj i=1 U (j) (k 1 , k 2 , i) U (j) (i, k 3 , k 4 ) as its (k 1 , k 2 , k 3 , k 4 )th entry. The case with multiple kernels will have a form similar to (4). Following the same logic, it can indeed be generalized to an even deeper CNN by adding more "convolutionpooling" layers. Since the composite weight tensor always has a "Tucker-like" form, techniques in proving Theorem 1 can be adopted to conduct the sample complexity analysis for deep CNNs. Note that, at model ( 7), the kernels at different convolution layers appear in the form of A⊗A, which actually provides a theoretical justification for exploring the layer-wise low-dimensional structure in the literature. On the other hand, we summarize all weights of a network into one single tensor, akin to Kossaifi et al. (2019) . But our summarized tensor has an explicit "nested doll" structure, where the weight structure of the previous layer is nested within that of the current layer. In practice, zero-padding can be used before each convolution layer to preserve the input size. This is compatible with our framework, since we can propagate the padding of zeros to previous layers and all the way back to the input tensor X.

3. COMPRESSED CNNS

For a high-order input, a deep CNN with a larger number of kernels may involve heavy computation, which renders its training difficult for portable devices with limited resources. In real applications, many compressed CNN block designs have been proposed to improve the efficiency, and most of them are based on either matrix factorization or tensor decomposition; see Lebedev et al. (2015) ; Kim et al. (2016) ; Astrid & Lee (2017) ; Kossaifi et al. (2020b) . Tucker decomposition can be used to compress CNNs, which is same as introducing a multilayer CNN block; see Figure 3 in Kim et al. (2016) . Specifically, we stack the kernels A k ∈ R l1×l2×•••×l N with 1 ≤ k ≤ K into a higher order tensor A stack ∈ R l1×l2×•••×l N ×K , and assume that A stack has Tucker ranks of (R 1 , R 2 , • • • , R N , R N +1 ). As a result, A stack = G × 1 H (1) × 2 H (2) • • • × N +1 H (N +1) , where G ∈ R R1×•••×R N ×R N +1 is the core tensor, and H (j) ∈ R lj ×Rj are factor matrices for 1 ≤ j ≤ N + 1. From Section 2, to train such a compressed CNN block with linear activations is equivalent to search for the least-square estimators at ( 6  = R N +1 + R( N i=1 l i + P ). Theorem 2 (Compressed CNN). Let ( W, d M ) be ( W TU , d TU M ) for Tucker decomposition, or ( W CP , d CP M ) for CP decomposition. Suppose that Assumptions 1-3 hold and n (κ U /κ L ) 2 c N d M . Then, W -W * F ≤ 16σ √ α RSM α RSC c N d M n + δ n and, E( W) ≤ √ κ U W -W * F with probability 1-4 exp{-[c H (0.25κ L /κ U ) 2 n-3c N d M ]}-exp{-d M -8δ}, where δ = O p (1), c N = 2 N +1 log(N + 2), c H is a positive value and α RSC , α RSM are defined in Theorem 1. Theorem 2 shows that, as expected, the sample complexity is proportional to the model complexity. Compared to Theorem 1, we see that the sample complexity of the compressed CNN depends on N i=1 l i instead of N i=1 l i . When the input dimension N is large, compressed CNN can indeed reduce a large number of parameters. However, we notice that the sample complexity of a compressed CNN depends on the rank R N +1 or R, rather than the number of kernels K. This differs from the naive parameter counting, and we hence provide the rationale behind this counter-intuitive observation in the corollary below.

Corollary 1. (a) If A has a Tucker decomposition with the ranks of

(R 1 , R 2 , • • • , R N +1 ), the CNN is equivalent to one with R N +1 kernels, and each kernel A r , where 1 ≤ r ≤ R N +1 , has a Tucker decomposition with ranks of (R 1 , R 2 , • • • , R N ). Moreover, (b) if the stacked kernel tensor A has a CP decomposition with rank R, the corresponding CNN can be reparameterized into one with R kernels, and each kernel A r , where 1 ≤ r ≤ R, has a CP decomposition form with rank R. Define the K/R ratio to be the value of K/R N +1 for Tucker decomposition or K/R for CP decomposition, and in practice, it always holds that K ≥ R. Corollary 1 essentially states that when K/R > 1, there exists model redundancy in the CNN model with linear activations. In other words, we can obtain a more compressed network by setting K = R such that it has the same sample complexity as the previous one. The K/R ratio appears in various block designs; see Figure 2 . And we can use this finding to evaluate their model efficiency. (T1) Standard bottleneck block. The basic building block in ResNet (He et al., 2016 ) can be exactly replicated by a Tucker decomposition on A stack ∈ R l1×l2×C×K , with ranks of (l 1 , l 2 , R, R), where C is the number of input channels (Kossaifi et al., 2020b) . As shown by the ablation studies in Session 4, when K R, this design may not be the most parameter efficient. (T2) Funnel block. As a straightforward revision on a standard bottleneck block, the funnel block maintains an output channel size of K = R. This is hence an efficient block design. Also, when R = C, we obtain a normalized funnel block, with the same number of input and output channels. (C1) Depthwise separable block. This block is the basic module in MobileNetV1 (Howard et al., 2017) with a pair of depthwise and pointwise separable convolution layers. It is equivalent to assuming a CP decomposition on A stack ∈ R l1×l2×C×K with the rank of C (Kossaifi et al., 2020b) . The 1 × 1 pointwise convolution expands the numbers of channels from C to K. When K C, this design is parameter inefficient from a computation viewpoint. (C2) Inverted residual block. Sandler et al. (2018) later proposed this design in MobileNetV2. It includes expansive layers between the input and output layers, with the channel size of x • C(x ≥ 1), where x represents the expansion factor. As discussed by Kossaifi et al. (2020b) , it can heuristically correspond to a CP decomposition on A stack , with CP rank equals to x • C. Since, the rank of output channel dimension can be at most x • C, as long as K ≤ x • C holds, it is theoretically efficient and provides leeway for exploring thicker layers within blocks. With nonlinear activations, the model redundancy may bring some benefit. Such benefit is often not guaranteed. And when K R, it will, without a doubt, greatly hinder computational efficiency. We will show, in our ablation studies in the next section, that under realistic settings with nonlinear activations, our finding on the K/R ratio still applies. Specifically, networks with K/R = 1 maintains comparable performance as networks with K/R > 1, while using much less parameters.

4. EXPERIMENTS

This section first verifies the theoretical results with synthetic datasets, and then conducts ablation studies on the bottleneck blocks in ResNet (He et al., 2016) with different K/R ratios.

4.1. NUMERICAL ANALYSIS FOR THEORETICAL RESULTS

We choose four settings to verify the sample complexity in Theorem 1; see Table 1 (2, 2, 2) (3, 2, 3) 3 (10, 10, 8, 3) (5, 5, 3) (3, 3, 3) (2, 2, 2, 1) 3 Setting 3 (S3) (8, 8, 3) (3, 3, 3) (3, 3, 1) 1 (12, 12, 6, 3) (7, 7, 3) (3, 3, 2) (2, 3, 2, 1) 2 Setting 4 (S4) (8, 8, 3)  (3, 3, 3)  (3, 3, 1)  3  (12, 12, 6, 3) (7, 7, 3)  (3, 3, 2)  (2, 3, 2, 1 For Theorem 2, we also adopt four different settings; see Table 1 . Here, we consider 4D input tensors. The stacked kernel A stack is generated by ( 8), where the core tensor has standard normal entries and the factor matrices are generated to have orthonormal columns. The number of training samples n is similarly chosen such that d TU M /n is equally spaced. A linear trend between the estimation error and the square root of d TU M /n can be observed, which verifies Theorem 2. For details of implementation, we employed the gradient descent method for the optimization with a learning rate of 0.01 and a momentum of 0.9. The procedure is deemed to have reached convergence if the target function drops by less than 10 -8 .

4.2. ABLATION STUDIES ON COMPUTATIONALLY-EFFICIENT BLOCK DESIGNS

Following the notation in Section 3, we denote by K, the number of output channels of a bottleneck block, and by R, the Tucker rank of A stack along the output channels dimension. Since K ≥ R, we let t = K/R be the expansive ratio of the block. As t increases, the number of parameters within the block increases. Here, ablation studies are conducted to show that the increase in t does not necessarily lead to the increase in test accuracy. We analyze two image recognition datasets, CIFAR-10 ( Krizhevsky et al., 2009) and Street View House Numbers (SVHN) (BNetzer et al., 2011) , and an action recognition dataset, UCF101 (Soomro et al., 2012) . Standard data pre-processing and augmentation techniques are adopted to all three datasets (He et al., 2016; Hara et al., 2017) . Network architecture. Two network architectures are considered in our study; as shown in Figure 4 . For image recognition datasets, we adopt a 41-layer Residual Network: it consists of a 3 × 3 convolution layer, a max pooling layer, followed by 3 groups of A + x × B residual blocks with different R, and ends with an average pooling and fully-connected layer. Block B represents the bottleneck structure that we are interested in. The standard bottleneck block corresponds to t = 4. When t = 1, it corresponds to the normalized funnel block in Section 3 (T2). For sake of comparison, we also take t = 8 or 16. When t > 1, Block A is the standard downsampling block that contains convolution with the stride size of 2. It is inefficient according to our study. So when t = 1, we propose minor revisions to simultaneously increase the channels while performing the stride-2 3 × 3 convolution. For UCF101 dataset, we use 3D ResNet-18 as the basic framework and insert 4 stacks of Block B with R = 256 before the average pooling and fully-connected layer. Results. The results in Table 2 provide empirical support for our theoretical finding. Though the number of parameters in the network increases with larger t, the overall test accuracy remains roughly comparable. For CIFAR-10, it may appear that the test accuracy for t = 16 is slightly better than t = 1 but it uses 5 times as many parameters. Implementation details. All experiments are conducted in PyTorch and on Tesla V100-DGXS. Following the practice in He et al. (2016) , we adopt batch normalization(BN) (Ioffe & Szegedy, 2015) right after each convolution and before the ReLU activation. For network (1), we initialize the weights as in He et al. (2015) . We use stochastic gradient descent with weight decay 10 -4 , momentum 0.9 and mini-batch size 128. The learning rate starts from 0.1 and is divided by 10 for every 100 epochs. We stop training after 300 epochs, since the training accuracy hardly changes. The training accuracy is approximately 93% for CIFAR-10 and 96% for SVHN. We set seeds 1-5 and report the worst case scenario of the test accuracy in Table 2 . For network (2), we follow Hara et al. (2018) and use the Kinetics pretrained 3D ResNet-18. The weights are fixed for the first few layers, and backward propagation is only applied to the added Block B layers and the fully-connected layer. We use stochastic gradient descent with weight decay 10 -5 , momentum 0.9 and mini-batch size 64 with 16 frames per clip. The learning rate starts from 0.001 and divided by 10 for every 30 epochs. Training is stopped after 80 epochs and top-1 clip accuracy is around 79% for UCF101.

5. CONCLUSION AND DISCUSSION

Our paper proposes a unified theoretical framework that can adopt sample complexity analysis as a tool to study the effective number of parameter in a tensor decomposed CNN block. The main practical takeaways are: (i) For a large kernel tensor A (Peng et al., 2017) , it is always effective to impose low-rank structure on its input dimensions, such as height, width or time-length, etc; and (ii) It is essential to maintain K/R = 1 for a CNN block in order to increase accuracy in a more parameter efficient way. In this regard, one can either choose a small R with deeper networks or a larger R with shallower networks. In fact, when K/R = 1, the block corresponds exactly to the "straightened" bottleneck in Zagoruyko & Komodakis (2016) , who showed in their empirical study that by increasing R in shallow networks, the test accuracy can indeed improve quite significantly. Since the low-dimensional structure at convolution layers is our focus, tensor decomposition has been applied to the kernel weights A stack only in Section 3. Kossaifi et al. (2017) applied tensor decomposition to fully-connected layers since they may contain a large number of parameters as well. Along the line, we can stack the fully-connected weights B k ∈ R p1×p2×•••×p N with 1 ≤ k ≤ K into a higher order tensor B stack ∈ R p1×p2×•••×p N ×K , and further consider a Tensor Contraction Layer (TCL) or Tensor Regression Layer (TRL) as in (Kossaifi et al., 2020a) . Similar theoretical analysis can be obtained, and we leave it for future research.

A APPENDIX

This appendix contains five sections. In the first section, we provide details for our mathematical formulation for a high-order 3-layer CNN. In the next section, we discuss the formulation of a 5layer CNN as an illustrating example of how to add more layers to our framework. A complete version of Theorems 1&2 together with the technical proofs for Theorems 1&2 and Corollary 1 are provided in the third section. In the fourth section, we extend our settings to classification and subsequently establish theorem and corollaries for binary and multiclass classification problem. More implementation details and results for additional experiments are presented in the last section. A.1 CNN FORMULATION For a tensor input X ∈ R d1×•••×d N , it first convolutes with an N -dimensional kernel tensor A ∈ R l1×•••×l N with stride sizes equal to s c , and then performs an average pooling with pooling sizes equal to (q 1 , • • • , q N ). It ends with a fully-connected layer with the weight tensor B and produces a scalar output. We assume that m j = (d j -l j )/s c + 1 are integers for 1 ≤ j ≤ N , otherwise zero-padding will be needed. For ease of notation, we take the pooling sizes {q j } N j=1 to satisfy the relationship m j = p j q j . To duplicate the operation of the convolution layer using a simple mathematical expression, we first need to define a set of matrices {U (j) ij ∈ R dj ×lj }, where U (j) ij = [ 0 (ij -1)sc I lj 0 dj -(ij -1)sc-lj ] ∈ R dj ×lj for 1 ≤ i j ≤ m j , 1 ≤ j ≤ N. (9) U (j) ij acts as a positioning factor to transform the kernel tensor A into a tensor of same size as the input X, with the rest of the entries equal to zero. We are ready to construct our main formulation for a 3-layer tensor CNN. To begin with, we first illustrate the process using a vector input x ∈ R d , with a kernel vector a ∈ R l ; see Fig 5 . Using the i th positioning matrix U (1) i , we can propagate the small kernel vector a, into a vector U (1) i a of size R d , by filling the rest of the entries with zeros. The intermediate output vector has entries given by x c (i) = x, U (1) i a , for 1 ≤ i ≤ m. The average pooling operation is equivalent to forming p consecutive vectors within x c , each of size R q , and then take the average. This results in a vector x cp of size R p , with ith entry equal to x cp (i) = q -1 iq k=(i-1)q+1 x c (i) = x, q -1 iq k=(i-1)q+1 U (1) k a . The fully-connected layer performs a weighted summation over the p vectors, with weights given by entries of the vector b ∈ R p . This gives us the predicted output y = b, x cp = x, w X , where w X = p i=1 b i 1 q iq k=(i-1)q+1 U (1) k a = U (1) F (b ⊗ a) = (b ⊗ a) × 1 U (1) F , and U (1) F = q -1 ( q k=1 U (1) k , • • • , m k=m-q+1 U (1) k ), and "× 1 " represents the mode-1 product. For matrix input X with matrix kernel A, however, we need 2 sets of positioning matrices, {U (1) i1 } m1 i1=1 and {U (2) i2 } m2 i2=1 , one for the height dimension and the other for width. Then, the intermediate output from convolution has entries given by X c (i 1 , i 2 ) = X, U (1) i1 AU (2) i2 , for 1 ≤ i 1 ≤ p 1 , 1 ≤ i 2 ≤ p 2 . For the average pooling, we form p 1 p 2 consecutive matrices from X c , each of size R q1×q2 and take the average. This results in X cp ∈ R p1×p2 , with X cp (i 1 , i 2 ) = X, (q -1 1 i1q1 k1=(i1-1)q1+1 U (1) k1 )A(q -1 2 i2q2 k2=(i2-1)q2+1 U (2) k2 ) . The output X cp goes through fully-connected layer with weight matrix B and gives the predicted output y = X, W X , where W X = p2 i2=1 p1 i1=1 b i1 b i2 1 q 1 q 2 ( i1q1 k1=(i1-1)q1+1 U (1) k1 )A( i2q2 k2=(i2-1)q2+1 U (2) k2 ) = (B⊗A)× 1 U (1) F × 2 U (2) F , with U (j) F = q -1 j ( qj k=1 U (j) k , • • • , mj k=mj -qj +1 U (j) k ) for j = 1 or 2. And "× 1 ", "× 2 " represent the mode-1 and mode-2 product. Figure 5 : Formulating the process with an input vector x. Note that we combine the convolution, pooling and fully connected layers into the composite weight vector w X , where w X = q -1 q i=1 U (1) i a + • • • + q -1 q i=m-q+1 U (1) i a. The represents the convolution operation. From here, together with the case of high-order tensor input discussed in Session 2.2, we can then derive the form of predicted outcome as y = X, W single

X

, where W single X = (B ⊗ A) × 1 U (1) F × 2 U (2) F × • • • × U (N ) F , with U (j) F = q -1 j ( qj k=1 U (j) k , • • • , mj k=mj -qj +1 U (j) k ), and "× j " represents the mode-j product for 1 ≤ j ≤ N . With K kernels, we denote the set of kernels and the corresponding fully-connected weight tensors as {A k , B k } K k=1 . Since the convolution and pooling operations are identical across kernels, we can use a summation over kernels to derive the weight tensor for multiple kernels, which is W X = ( K k=1 B k ⊗ A k ) × 1 U (1) F × 2 U (2) F × • • • × U (N ) F , and we arrive at the formulation for 3-layer tensor CNN.

A.2 FIVE-LAYER CNN FORMULATION

Consider a 5-layer CNN with "convolution→pooling→convolution→pooling→fully connected" layers and a 3D tensor input X ∈ R d1×d2×d3 . Here, we denote the intermediate output from the first convolution by X c ∈ R m1×m2×m3 , from the second convolution by X c ∈ R m1× m2× m3 , and the output from the first pooling by X cp ∈ R p1×p2×p3 and from the second pooling by X cp ∈ R p1× p2× p3 . We can first see that the predicted output from the 5-layer CNN is the same as directly feeding X cp to the second convolution layer, followed by an average pooling layer and a fully-connected layer. Denote the first convolution kernel tensor by A ∈ R l1×l2×l3 , the second convolution kernel tensor by A ∈ R l1× l2× l3 , and the fully-connected weight tensor by B ∈ R p1× p2× p3 . Define the set of matrices {U (j) ij ∈ R dj ×lj }, for 1 ≤ i j ≤ m j , 1 ≤ j ≤ N as in (2). And similarly, define { U (j) ij ∈ R pj × lj }, with U (j) ij = [ 0 (ij -1)sc I lj 0 pj -(ij -1)sc-lj ] , ( ) where sc is the stride size for the second convolution. Let U (j) F = ( U (j) F ,1 , • • • , U (j) F ,pj ) = (q -1 j qj k=1 U (j) k , • • • , q-1 j mj k= mj -qj +1 U (j) k ). We further stack the matrices {U (j) F ,ij ∈ R dj ×lj } 1≤ij ≤pj and { U (j) F ,ij ∈ R pj × lj } 1≤ij ≤ pj into 3D tensors U (j) ∈ R dj ×lj ×pj and U (j) ∈ R pj × lj × pj . Now we proceed to provide a complete version of the sample complexity of compressed CNNs. Let d TU M = N +1 j=1 R j + N i=1 l i R i + R N +1 P and d CP M = R N +1 + R( N i=1 l i + P ). Theorem 2 (Complete Version). Let ( W, d M ) be ( W TU , d TU M ) for Tucker decomposition, or ( W CP , d CP M ) for CP decomposition. Suppose that Assumptions 1-3 hold and n (κ U /κ L ) 2 c N d M . Then, for some δ > 0, (a) W -W * F ≤ 16σ √ α RSM α RSC c N d M n + δ n , (b) W -W * n ≤ 16σα RSM α RSC c N d M n + δ n , (c)err 2 ( W) ≤ 16σα RSM α RSC 2 c N d M n + δ n + σ 2 , with probability 1 -4 exp{-[c H (0.25κ L /κ U ) 2 n -3c N d M ]} -exp{-d M -8δ}, where c N = 2 N +1 log(N + 2), c H is a positive value and α RSC , α RSM are defined in Theorem 1. Because CP decomposition can be considered as a special case of Tucker decomposition where R j = R, for 1 ≤ j ≤ N + 1. We only provide the proof for Tucker CNN.

A.3.1 PROOF OF THEOREM 1

Denote the sets S K = { K k=1 B k ⊗ A k : A k ∈ R l1×l2×•••×l N and B k ∈ R p1×p2×•••×p N } and S K = {W ∈ S K : W F = 1}. Let ∆ = W -W * , and then 1 n n i=1 (y i -Z i , W ) 2 ≤ 1 n n i=1 (y i -Z i , W * ) 2 , which implies that ∆ 2 n ≤ 2 n n i=1 ξ i Z i , ∆ ≤ 2 ∆ F sup ∆∈S 2K 1 n n i=1 ξ i Z i , ∆ , where ∆ 2 n := n -1 n i=1 ( Z i , ∆ ) 2 is the empirical norm with respect to ∆, and ∆ ∈ S 2K . Consider a ε-net S2K , with the cardinality of N (2K, ε), for the set S 2K . For any ∆ ∈ S 2K , there exists a ∆j ∈ S2K such that ∆ -∆j F ≤ ε. Note that ∆ -∆j ∈ S 4K and, from Lemma 1(a), we can further find ∆ 1 , ∆ 2 ∈ S 2K such that ∆ 1 , ∆ 2 = 0 and ∆ -∆j = ∆ 1 + ∆ 2 . It then holds that ∆ 1 F + ∆ 2 F ≤ √ 2 ∆ -∆j F ≤ √ 2ε since ∆ -∆j 2 F = ∆ 1 2 F + ∆ 2 2 F . As a result, 1 n n i=1 ξ i Z i , ∆ = 1 n n i=1 ξ i Z i , ∆j + 1 n n i=1 ξ i Z i , ∆ 1 + 1 n n i=1 ξ i Z i , ∆ 2 ≤ max 1≤j≤N (2K,ε) 1 n n i=1 ξ i Z i , ∆j + √ 2ε sup ∆∈S 2K 1 n n i=1 ξ i Z i , ∆ , which leads to sup ∆∈S 2K 1 n n i=1 ξ i Z i , ∆ ≤ (1 - √ 2ε) -1 max 1≤j≤N (2K,ε) 1 n n i=1 ξ i Z i , ∆j . ( ) Note that, from Lemma 1(b), log N (2K, ε) ≤ 2d M log(9/ε), where d M = K(P + L + 1). Let ε = (2 √ 2) -1 , and then 8 -2 log(9/ε) > 1. As a result, by ( 12) and Lemma 3, P sup ∆∈S 2K 1 n n i=1 ξ i Z i , ∆ ≥ 8σ √ α RSM d M n + δ n , sup ∆∈S 2K ∆ 2 n ≤ α RSM ≤ P max 1≤j≤N (2K,ε) 1 n n i=1 ξ i Z i , ∆j ≥ 4σ √ α RSM d M n + δ n , sup ∆∈S 2K ∆ 2 n ≤ α RSM ≤ N (2K,ε) j=1 P 1 n n i=1 ξ i Z i , ∆j ≥ 4σ √ α RSM d M n + δ n , ∆j 2 n ≤ α RSM ≤ exp{-d M -8δ}. ( ) Note that P sup ∆∈S 2K 1 n n i=1 ξ i Z i , ∆ ≥ 8σ √ α RSM d M n + δ n ≤ P sup ∆∈S 2K 1 n n i=1 ξ i Z i , ∆ ≥ 8σ √ α RSM d M n + δ n , sup ∆∈S 2K ∆ 2 n ≤ α RSM + P sup ∆∈S 2K ∆ 2 n ≥ α RSM . From ( 13) and Lemma 2, we then have that, with probability 1 -4 exp{-[c H (0.25κ L /κ U ) 2 n - 9d M ]} -exp{-d M -8δ}, sup ∆∈S 2K 1 n n i=1 ξ i Z i , ∆ ≤ 8σ √ α RSM d M n + δ n and ∆ 2 n ≥ α RSC ∆ 2 F , which, together with (11), leads to ∆ F ≤ 16σ √ α RSM α RSC d M n + δ n and ∆ n ≤ 16σ α RSM α RSC d M n + δ n . Finally we prove (c). Given a test sample (X, y), and let Z = X × 1 U (1) F × 2 U (2) F × 3 • • • × N U (N ) F . It holds that E(y -X, W X ) 2 = E(y -Z, W ) 2 = E( Z, ∆ ) 2 + σ 2 , and E( Z, ∆ ) 2 = ∆ E(z z )∆ = ∆ U G E(xx )U G ∆ ≤ κ U ∆ 2 2 . As a result, from (a) of this theorem, we have err 2 ( W) ≤ 16σα RSM α RSC 2 d M n + δ n + σ 2 . This accomplishes the proof.

A.3.2 PROOF OF THEOREM 2

Denote the sets S TU (R 1 , • • • , R N +1 ) = { K k=1 B k ⊗ A k : the stacked kernel A stack ∈ R l1×l2×•••×l N ×K has the ranks of (R 1 , ..., R N , R N +1 ) and B k ∈ R p1×p2×•••×p N }, S TU 2K = {W 1 + W 2 : W 1 , W 2 ∈ S TU (R 1 , • • • , R N +1 )}, and S TU 2K = {W ∈ S TU 2K : W F = 1}. Note that W TU , W * ∈ S TU (R 1 , • • • , R N +1 ), and ∆ = W TU -W * ∈ S TU 2K . We first consider a ε-net for S TU 2K . For each 1 ≤ k ≤ K, Let b k = (b k1 , . . . , b kP ) = vec(B k ), and we can rearrange B k ⊗ A k into the form of A k • b k , which is a tensor of size R l1×l2×•••×l N ×P , where P = p 1 p 2 • • • p N . Denote B = (b kj ) ∈ R P ×K ,

and it holds that

K k=1 A k • b k = A × N +1 B = G × 1 H (1) × 2 H (2) • • • × N +1 B, which is a tensor with the size of R l1×l2×•••×l N ×P and the ranks of (R 1 , ..., R N , R N +1 ) where B = BH (N +1) ∈ R P ×R N +1 . Essentially, in this step, we rewrite the model into one with R k+1 kernels instead. Specifically, we now have W = R N +1 r=1 A r ⊗ B r , where A r = G r × 1 H (1) × 2 H (2) • • • × N H (N ) with G r = G(:, :, • • • , r) and the r-th column of B is the vectorization of B r for 1 ≤ r ≤ R N +1 . As a result, S TU 2K consists of tensors with the ranks of (2R 1 , ..., 2R N , 2R N +1 ) at most. Denote S Tucker (r 1 , • • • , r N , r N +1 ) = {T ∈ R l1×•••×l N ×l N +1 : T F = 1, T has the Tucker ranks of (r 1 , • • • , r N , r N +1 )}, where l N +1 = P = p 1 p 2 • • • p N . Then the ε-covering number for S TU 2K satisfies |S TU 2K | ≤ |S Tucker (2R 1 , • • • , 2R N , 2R N +1 )|. For each T ∈ S Tucker (r 1 , • • • , r N , r N +1 ), we have T = G × 1 U (1) × 2 • • • × N +1 U (N +1) , where G ∈ R r1×•••×r N ×r N +1 with G F = 1, U (i) ∈ R li×ri with 1 ≤ i ≤ N + 1 are orthonormal matrices. We now construct an ε-net for S Tucker (r 1 , • • • , r N , r N +1 ) by covering the sets of G and all U (i) s, and the proof hinges on the covering number of low-multilinear-rank tensors in Wang et al. (2019) . Treating G as N +1 j=1 r j -dimensional vector with G F = 1, we can find an ε/(N + 2)-net for it, denoted by Ḡ, with the cardinality of | Ḡ| ≤ (3(N + 2)/ε) N +1 j=1 rj . Next, let O n,r = {U ∈ R n×r : U U = I r }. To cover O n,r , it is beneficial to use the • 1,2 norm, defined as X 1,2 = max i X i 2 , where X i denotes the i th column of X. Let Q n,r = {X ∈ R n×r : X 1,2 ≤ 1}. One can easily check that O n,r ⊂ Q n,r , and then an ε/(N + 2)-net Ōn,r for O n,r has the cardinality of | Ōn,r | ≤ (3(N + 2)/ε) nr . Denote STucker (r 1 , • • • , r N , r N +1 ) = {G × 1 U (1) × 2 • • • × N +1 U (N +1) : G ∈ Ḡ, U (i) ∈ Ōli,ri , 1 ≤ i ≤ N + 1} . By a similar argument presented in Lemma A.1 of Wang et al. (2019) , we can show that STucker (r 1 , • • • , r N , r N +1 ) is a ε-net for the set S Tucker (r 1 , • • • , r N , r N +1 ) with the cardinality of 3N + 6 ε N +1 j=1 rj + N +1 j=1 lj rj . where l N +1 = P . Thus, the ε-covering number of S TU 2K is N TU (ε) = |S Tucker (2R 1 , • • • , 2R N , 2R N +1 )| ≤ 3N + 6 ε 2 N +1 N +1 j=1 Rj +2 N i=1 liRi+2R N +1 P . Let ε = 1/2. It then holds that log(3(N + 2)/ε) < 3 log(N + 2) and log N TU (ε) ≤ 2 N +1 d TU M log[(3N + 6)/ε], where d TU M = N +1 j=1 R j + N i=1 l i R i + R N +1 P . By a method similar to the proof of Lemma 2, we can show that P sup ∆∈S TU 2K ∆ 2 n ≥ α RSM ≤ 2 exp -c H n κ L 4κ U 2 + 3c N d TU M , and P inf ∆∈S TU 2K ∆ 2 n ≤ α RSC ≤ 2 exp -c H n κ L 4κ U 2 + 3c N d TU M . ( ) where c N = 2 N +1 log(N + 2). Moreover, similar to equation 13, by applying Lemma 3, we can show that P sup ∆∈S TU 2K 1 n n i=1 ξ i Z i , ∆ ≥ 8σ √ α RSM c N d TU M n + δ n , sup ∆∈S TU 2K ∆ 2 n ≤ α RSM ≤ P max 1≤j≤N TU (ε) 1 n n i=1 ξ i Z i , ∆j ≥ 4σ √ α RSM c N d TU M n + δ n , sup ∆∈S TU 2K ∆ 2 n ≤ α RSM ≤ N TU (ε) j=1 P 1 n n i=1 ξ i Z i , ∆j ≥ 4σ √ α RSM c N d TU M n + δ n , ∆j 2 n ≤ α RSM ≤ exp{-d TU M -8δ}, where ε = (2 √ 2) -1 and 2 N +4 log(N + 2) -2 N +1 log(3(N + 1)/ε) > 1. By a method similar to the proof of Theorem 1, we can show that, with probability 1 -4 exp{-[c H (0.25κ L /κ U ) 2 n - 3c N log(N + 2)d TU M ]} -exp{-d TU M -8δ}, sup ∆∈S TU 2K 1 n n i=1 ξ i Z i , ∆ ≤ 8σ √ α RSM c N d TU M n + δ n and ∆ 2 n ≥ α RSC ∆ 2 F , which, together with equation 11, leads to ∆ F ≤ 16σ √ α RSM α RSC c N d TU M n + δ n , ∆ n ≤ 16σ α RSM α RSC c N d TU M n + δ n , err 2 ( W TU ) ≤ 16σα RSM α RSC 2 c N d TU M n + δ n + σ 2 . We accomplished the proof.

A.3.3 PROOFS OF COROLLARY 1

When the kernel tensor A has a Tucker decomposition form of +1) , and the multilinear rank is A = G× 1 H (1) × 2 H (2) × 3 • • •× N +1 H (N (R 1 , R 2 , • • • , R N +1 ). Let H = G × 1 H (1) × 2 H (2) × 3 • • • × N H (N ) , which is a tensor of size R l1×•••×l N ×R N +1 . The mode-(N + 1) unfolding of H is a matrix with R N +1 row vectors, each of size R L , and we denote them by g 1 , • • • , g R N +1 . Fold g r s back into tensors, i.e. g r = vec( H r ), where H r ∈ R l1×•••×l N and 1 ≤ r ≤ R N +1 . Moreover, let H (N +1) = (h (N +1) 1 , • • • , h (N +1) K ) , where h (N +1) k is a vector of size R R N +1 and we denote its r th entry as h (N +1) k,r , for 1 ≤ r ≤ R N +1 . It can be verified that, for each 1 ≤ k ≤ K, A k = G × 1 H (1) × 2 H (2) • • • × N H (N ) ×N+1 h (N +1) k = H ×N+1 h (N +1) k , and hence, K k=1 B k ⊗ A k = R N +1 r=1 K k=1 B k ⊗ H r • h (N +1) k,r = R N +1 r=1 K k=1 h (N +1) k,r B k ⊗ H r . By letting B r = K k=1 h (N +1) k,r B k and A r = H r for 1 ≤ r ≤ R N +1 , we can reformulate the model into y i = R N +1 r=1 B r ⊗ A r , Z i + ξ i , and the proof of (a) is then accomplished.  -4 exp{-[c H (0.25κ L /κ U ) 2 n -9d M ]}, α RSC ∆ 2 F ≤ 1 n n i=1 ( Z i , ∆ ) 2 ≤ α RSM ∆ 2 F , for all ∆ ∈ S 2K , where S 2K is defined in Lemma 1, c H is a positive constant from the Hanson- Wright inequality, α RSC = κ L /2 and α RSM = 3κ U /2. Proof. It is sufficient to show that sup ∆∈S 2K ∆ 2 n ≤ α RSM and inf ∆∈S 2K ∆ 2 n ≥ α RSC hold with a high probability, where ∆ 2 n := n -1 n i=1 ( Z i , ∆ ) 2 is the empirical norm, and S 2K is defined in Lemma 1. Without confusion, in this proof, we will also use the notation of ∆ for its vectorized version, vec(∆). Note that ∆ 2 n = 1 n n i=1 ( Z i , ∆ ) 2 = 1 n n i=1 z i ∆∆ z i = 1 n n i=1 x i U G ∆∆ U G x i = 1 n x I n ⊗ (U G ∆∆ U G ) x = w Q w, where Q = n -1 Σ 1 2 [I n ⊗ (U G ∆∆ U G )]Σ 1 2 , w ∈ R nD is a random vector with i.i.d. standard normal entries, z i = U G x i , x i = vec(X i ), x = (x 1 , x 2 , . . . , x n ) , and Σ = E( x x ) ∈ R nD×nD . Denote w G = U G ∆ ∈ R D and S n-1 = {v ∈ R n : v 2 = 1}. Let λ max (Σ ii ) be the maximum eigenvalue of Σ ii = E(vec (X i ) vec (X i ) ) for 1 ≤ i ≤ n, and it holds that λ max (Σ ii ) ≤ C x for all 1 ≤ i ≤ n. For matrix Q, we have tr(Q) = 1 n tr Σ 1 2 [I n ⊗ w G ][I n ⊗ w G ]Σ 1 2 = 1 n tr ([I n ⊗ w G ]Σ[I n ⊗ w G ]) = 1 n n i=1 (w G Σ ii w G ) ≤ 1 n n i=1 sup w G ∈S P L-1 w G Σ ii w G w G w G sup ∆∈S P L-1 ∆ U G U G ∆ ≤ 1 n n i=1 λ max (Σ ij )C u ≤ κ U , and similarly, we can show that tr(Q) ≥ κ L , where κ L = c x c u , κ U = C x C u and κ L ≤ κ U . Thus, κ L ≤ E ∆ 2 n = tr(Q) ≤ κ U . Moreover, denote Q 1 = n -1 2 Σ 1 2 [I n ⊗ w G ] and note that Q = Q 1 Q 1 . To bound the operator norm of Q, we have Q op ≤ Q 1 2 op = sup u∈S n-1 u Q 1 Q 1 u = 1 n sup u∈S n-1 u [I n ⊗ w G ]Σ[I n ⊗ w G ]u ≤ 1 n sup u∈S n-1 u [I n ⊗ w G ]Σ[I n ⊗ w G ]u u [I n ⊗ w G ][I n ⊗ w G ]u sup u∈S n-1 u [I n ⊗ w G w G ]u ≤ C x n w G 2 2 ≤ C x n U G 2 op ∆ 2 2 ≤ κ U n . Finally we can use ( 18) and ( 20) to bound the Frobenius norm of Q. By some algebra, for any square matrices A, B ∈ R n×n , AB 2 F ≤ A 2 op B 2 F holds. Hence, Q 2 F = Q 1 Q 1 2 F ≤ Q 1 2 op Q 1 2 F = Q 1 2 op tr(Q) ≤ κ 2 U n . This, together with ( 17), ( 20) and the Hanson-Wright inequality (Vershynin, 2018, Chapter 6) , leads to P ∆ 2 n -E ∆ 2 n ≥ t ≤ 2 exp -c H min t Q op , t 2 Q 2 F ≤ 2 exp -c H n min t/κ U , (t/κ U ) 2 , where c H is a positive constant. On the other hand, ∆ 2 n -E ∆ 2 n = 1 n n i=1 ∆ z i z i ∆ -E(∆ z i z i ∆) = ∆ Γ∆, where Γ = n -1 n i=1 [z i z i -E(z i z i ) ] is a symmetric matrix. Consider a ε-net S2K , with the cardinality of N (2K, ε), for the set S 2K . For any ∆ ∈ S 2K , there exists a ∆j ∈ S2K such that ∆-∆j F ≤ ε. Note that ∆-∆j ∈ S 4K and, from Lemma 1, we can further find ∆ 1 , ∆ 2 ∈ S 2K such that ∆ 1 , ∆ 2 = 0 and ∆ -∆j = ∆ 1 + ∆ 2 , and it then holds that ∆ 1 F + ∆ 2 F ≤ √ 2 ∆ -∆j F ≤ √ 2ε. Moreover, for a general real symmetric matrix A ∈ R d×d , u ∈ R d and v ∈ R d , sup u 2= v 2 =1 u Av = sup u 2=1 u Au. As a result, ∆ Γ∆ = ∆ j Γ ∆j + (∆ 1 + ∆ 2 ) Γ(∆ 1 + ∆ 2 + 2 ∆j ) ≤ max 1≤j≤N (2K,ε) ∆ j Γ ∆j + 5ε sup ∆∈S 2K ∆ Γ∆, where ( ∆ 1 F + ∆ 2 F ) 2 + 2( ∆ 1 F + ∆ 2 F ) ∆j F ≤ 2(ε + 2)ε ≤ 5ε as ε ≤ 1, and this leads to sup ∆∈S 2K ∆ Γ∆ ≤ (1 -5ε) -1 max 1≤j≤N (2K,ε) ∆ j Γ ∆j . ( ) Note that, from Lemma 1(b), log N (2K, ε) ≤ 2d M log(9/ε), where d M = K(L + P + 1). Let ε = 1/10, and then 2 log(9/ε) < 9. Combining ( 21) and ( 22) and letting t = κ L /2, we have P sup ∆∈S 2K ∆ 2 n -E ∆ 2 n ≥ κ L 2 ≤ 2 exp -c H n κ L 4κ U 2 + 9d M , which, together with (19) and the fact that κ L ≤ κ U , implies that P sup ∆∈S 2K ∆ 2 n ≥ α RSM ≤ 2 exp -c H n κ L 4κ U 2 + 9d M , where α RSM = 1.5κ U ≥ κ U + κ L /2. By a method similar to (22), we can also show that sup ∆∈S 2K E ∆ 2 n -∆ 2 n ≤ (1 -5ε) -1 max 1≤j≤N (2K,ε) E ∆j 2 n -∆j 2 n , which, together with (21), leads to P sup ∆∈S 2K E ∆ 2 n -∆ 2 n ≥ κ L 2 ≤ 2 exp -c H n κ L 4κ U 2 + 9d M , where t = κ L /2 and ε = 1/10. As a result, P inf ∆∈S 2K ∆ 2 n ≤ α RSC ≤ 2 exp -c H n κ L 4κ U 2 + 9d M , where α RSC = κ L -κ L /2. This, together with (23), accomplishes the proof. Lemma 3. (Concentration bound for martingale) Let {ξ i , 1 ≤ i ≤ n} be independent σ 2 -sub-Gaussian random variables with mean zero, and {z i , 1 ≤ i ≤ n} is another sequence of random variables. Suppose that ξ i is independent of {z i , z i-1 , . . . , z 1 } for all 1 ≤ i ≤ n. It then holds that, for any real numbers α, β > 0, P 1 n n i=1 ξ i z i ≥ α ∩ 1 n n i=1 (z i ) 2 ≤ β ≤ exp - nα 2 2σ 2 β . Proof. We can prove the lemma by a method similar to Lemma 4.2 in Simchowitz et al. (2018) .

A.4 CLASSIFICATION PROBLEMS

Starting from Section 2.2, we know that, for each input tensor X, the intermediate scalar output after convolution and pooling has the following form output = X, K k=1 (B k ⊗ A k ) × 1 U (1) F × 2 U (2) F × 3 • • • × N U (N ) F = Z, W , where W = K k=1 B k ⊗ A k and Z = X × 1 U (1) F × 2 U (2) F × 3 • • • × N U (N ) F , A k ∈ R l1×l2×•••×l N is the kth kernel and B k ∈ R p1×p2×•••×p N is the corresponding kth fully-connected weight tensor. Consider a binary classification problem. We have the binary label output y ∈ {0, 1} with p(y|Z) = 1 1 + exp( Z, W ) 1-y exp( Z, W ) 1 + exp( Z, W ) y = exp {y Z, W -log [1 + exp( Z, W )]} . Suppose we have samples {Z i , y i } n i=1 , we use the negative log-likelihood function as our loss function. It is given, up to a scaling of n -1 by L n (W) = - 1 n n i=1 y i Z i , W + 1 n n i=1 φ( Z i , W ), where φ(z) = log(1 + e z ) and its gradient and Hessian matrix is ∇L n (W) = ∂L n (W) ∂ vec(W) = - 1 n n i=1 y i vec(Z i ) + 1 n n i=1 φ ( Z i , W ) vec(Z i ) and, ( ) H n (W) = ∂ 2 L n (W) ∂ 2 vec(W) = 1 n n i=1 φ ( Z i , W ) vec(Z i ) vec(Z i ) , with φ (z) = 1/(1 + e -z ) ∈ (0, 1) and φ (z) = e z /(1 + e z ) 2 = 1/(e -z + 2 + e z ) ∈ (0, 0.25) [because e -z + e z ≥ 2]. Since H n (W) is a positive semi-definite matrix, our loss function in ( 24) is convex. Suppose W is the minimizer to the loss function W ∈ arg min W∈ S K ∩B(R) L n (W), where S K = { K k=1 B k ⊗ A k : A k ∈ R l1×l2×•••×l N and B k ∈ R p1×p2×•••×p N }, and B(R) is a Frobenius ball of some fixed radius R centered at the underlying true parameter. Similar to Fan et al. (2019) , we need to make two additional assumptions to guarantee the locally strong convexity condition. Assumption 4 (Classification). Apart from Assumptions 2-3, we additionally assume that (C1) {vec(X i )} n i=1 are i.i.d gaussian vectors with mean zero and variance Σ, where Σ ≤ C x I. (C2) the Hessian matrix at the underlying true parameter W * is positive definite and there exists some κ 0 > κ U > 0, such that H(W * ) = E(H n (W * )) ≥ κ 0 • I, where κ U = C x C u ; (C3) W * F ≥ α √ d M for some constant α, where d M = K(P + L + 1). Notice that we can relax (C1) into Assumption 1, but it will require more techincal details. Also, {vec(X i )} n i=1 can be sub-gaussian random vectors instead of gaussian random vectors. Denote d M = K(P + L + 1). Theorem 3 (Classification: CNN). Suppose that Assumptions 2&3 and Assumption 4 hold and n d M . Then, for some δ > 0, W -W * F ≤ 2 √ κ U κ 1 d M n + δ n , with probability 1 -4 exp {-0.25cn + 9d M } -2 exp {-c γ d M -cδ}, where δ = O p (1), c and c γ are some positive constants. Denote d TU M = N +1 j=1 R j + N i=1 l i R i + R N +1 P and d CP M = R N +1 + R( N i=1 l i + P ). Corollary 2 (Classification: Compressed CNN). Let ( W, d M ) be ( W TU , d TU M ) for Tucker decomposition, or ( W CP , d CP M ) for CP decomposition. Suppose that Assumptions in Theorem 3 hold and n c N d M . Then, for some δ > 0, W -W * F ≤ 2 √ κ U κ 1 3c N d M n + δ n , with probability 1 -4 exp {-0.25cn + 3c N d M } -2 exp {-c γ d M -cδ}, where δ = O p (1), c and c γ are some positive constants, and c N is defined as in Theorem 2. We consider a binary classification problem as a simple illustration. In fact, the analysis framework can be easily extended to a multiclass classification problem. Here, we consider a M -class classification problem. Because we need our intermediate output after convolution and pooling to be a vector of length M , instead of a scalar, we need to introduce one additional dimension to the fully-connected weight tensor. Hence, we introduce another subscript m to represent the class label. And the set of fullyconnected weights is represented as {B k,m } 1≤k≤K,1≤m≤M where each B k,m is of size p 1 × p 2 × • • • × p N . Then, for each input tensor X, our intermediate output is a vector of size M , where the mth entry is represented by output m = Z, W m , where W m = K k=1 (B k,m ⊗ A k ) is an N -th order tensor of size l 1 p 1 × l 2 p 2 × • • • × l N p N . For a M -class classification problem, we have the vector label output y ∈ (0, 1) M . Essentially, each entry of y comes from a different binary classification problem, with M problems in total. We can model it as p(y m |Z) = 1 1 + exp( Z, W m ) 1-ym exp( Z, W m ) 1 + exp( Z, W m ) ym = exp {y m Z, W m -log [1 + exp( Z, W m )]} If we stack {W m } M m=1 into a tensor W stack , which is an N + 1-order tensor of size l 1 p 1 × l 2 p 2 × • • • × l N p N × M .

And we further introduce some natural basis vectors {e

m ∈ R M } M m=1 . It can be shown that Z i , W m = Z i • e m Z i m , W stack , where • is the outer product. We can then recast this model into one with nM samples {Z i m , y i m : 1 ≤ k ≤ K, 1 ≤ m ≤ M }. The corresponding loss function is L n (W stack ) = - 1 nM M m=1 n i=1 y i Z i m , W stack + 1 nM M m=1 n i=1 φ( Z i m , W stack ). Now we proof several lemmas to be used in Theorem 3. For simplicity of notation, denote x i = vec(X i ) and z i = vec(Z i ). It holds that z i = U G x i , for 1 ≤ i ≤ n. Lemma 4 (Deviation bound). Under Assumption 4(C3), suppose that n d M , then, for some δ > 0, P sup ∆∈S 2K | ∇L n (W * ), ∆ | ≥ 0.5 √ κ U d M n + δ n ≤ 2 exp {-c γ d M } , where δ = O p (1), d M = K(P + L + 1), κ U = C x C u and c γ is some positive constant. Proof. Let η i = Z i , W * , and from (25), ∇L n (W * ), ∆ = 1 n n i=1 [φ (η i ) -y i ] z i , ∆ , and we can observe that (ii) the independence between φ (η i ) -y i and z i , ∆ . And from Lemma 6, the independence leads to E{[φ (η i ) -y i ] z i , ∆ } = E{E[φ (η i ) -y|z] z i , ∆ } = 0 = E[φ (η i ) -y i ]E[ z i , ∆ ]. It implies (i) E ∇L n (W * ), ∆ = 0, [φ (η i )-y i ] z i , ∆ ψ1 ≤ φ (η i )-y i ψ2 z i , ∆ ψ2 . Denote κ U = C x C u . For any fixed ∆ such that ∆ 2 = 1, z i , ∆ ψ2 = w i , Σ 1/2 U G ∆ ψ2 ≤ √ κ U . This, together with φ (η i ) - y i ψ2 ≤ 0.25 in Lemma 7, gives us [φ (η i ) -y i ] z i , ∆ ψ1 ≤ 0.25 √ κ U . Then, we can use the Beinstein-type inequality, namely Corollary 5.17 in (Vershynin, 2010) to derive that, for any fixed ∆ with unit l 2 -norm, P {| ∇L n (W * ), ∆ | ≥ t} = P 1 n n i=1 [φ (η i ) -y i ] z i , ∆ ≥ t ≤ 2 exp -cn min 4t √ κ U , 16t 2 κ U . Consider a ε-net S2K , with the cardinality of N (2K, ε), for the set S 2K . For any ∆ ∈ S 2K , there exists a ∆j ∈ S2K such that ∆ -∆j F ≤ ε. Note that ∆ -∆j ∈ S 4K and, from Lemma 1(a), we can further find ∆ 1 , ∆ 2 ∈ S 2K such that ∆ 1 , ∆ 2 = 0 and ∆ -∆j = ∆ 1 + ∆ 2 . It then holds that ∆ 1 F + ∆ 2 F ≤ √ 2 ∆ -∆j F ≤ √ 2ε since ∆ -∆j 2 F = ∆ 1 2 F + ∆ 2 2 F . As a result, | ∇L n (W * ), ∆ | = ∇L n (W * ), ∆j + | ∇L n (W * ), ∆ 1 | + | ∇L n (W * ), ∆ 2 | ≤ max 1≤j≤N (2K,ε) ∇L n (W * ), ∆j + √ 2ε sup ∆∈S 2K | ∇L n (W * ), ∆ | , which leads to sup ∆∈S 2K | ∇L n (W * ), ∆ | ≤ (1 - √ 2ε) -1 max 1≤j≤N (2K,ε) ∇L n (W * ), ∆j . Note that, from Lemma 1(b), log N (2K, ε) ≤ 2d M log(9/ε), where d M = K(P + L + 1). Let ε = (2 2) -1 and then 2 log(9/ε) < 7. With (28), we can show that P sup ∆∈S 2K | ∇L n (W * ), ∆ | ≥ 2t ≤ 2 exp -cn min 4t √ κ U , 16t 2 κ U + 7d M . Take t = 0.25 √ κ U ( d M /n + δ/n) where δ = O p (1) , and there exists some γ such that d M /n ≤ γ holds. We can finally show that P sup ∆∈S 2K | ∇L n (W * ), ∆ | ≥ 0.5 √ κ U d M n + δ n ≤ 2 exp {-c γ d M -cδ} , where c γ is some positive constant related to γ. Lemma 5 (LRSC). Suppose that n d M , under Assumptions 4, there exists some constant R > 0, such that for any W ∈ S 2K satisfying W -W * F ≤ R, inf ∆∈S 2K ∆ H n (W)∆ ≥ κ 1 2 holds with probability 1 -4 exp {-0.25cn + 9d M } , where κ 1 = κ 1 -κ U . And κ 1 is defined in Lemma 8. Proof. We divide this proof into two parts. 1. RSC of L n (W) at W = W * . We first show that, for all ∆ ∈ S 2K , the following holds with probability at least 1 - 2 exp -0.25c(κ L /κ U ) 2 n + 9d M , ∆ H n (W * )∆ ≥ κ, where κ = κ 0 -κ L > 0. Let η i = Z i , W * and denote z i = φ (η i )z i and we can see that H n (W * ) = 1 n n i=1 z i z i and, H(W * ) = EH n (W * ). Denote x i = vec(X i ). Here, for simplicity, we assume {x i } n i=1 to be independent gaussian vectors with mean zero and covariance matrix Σ, where c x I ≤ Σ ≤ C x I for some 0 < c x < C x . We will also use the notation of ∆ for its vectorized version, vec(∆), and we consider ∆ with unit l 2 -norm. Since ∆, z i ψ2 = ∆, φ (η i )U G Σ 1/2 w i ψ2 ≤ 0.25 √ κ U , where w i is a standard gaussian vector and κ U = C x C u , we can show that ∆, z i 2 -E ∆, z 2 ψ1 ≤ 2 ∆, z i 2 ψ1 ≤ 4 ∆, z i 2 ψ2 ≤ κ U , where the first inequality comes from Remark 5.18 in (Vershynin, 2010) and second inequality comes from Lemma 5.14 in (Vershynin, 2010) . And hence, by the Beinstein-type inequality in Corollary 5.17 in (Vershynin, 2010) , for any fixed ∆ such that ∆ 2 = 1, we have P {|∆ (H n (W * ) -H(W * ))∆| ≥ t} = P 1 n n i=1 ∆, z i 2 -E ∆, z 2 ≥ t ≤ 2 exp -cn min t κ U , t 2 κ 2 U . With similar covering number argument as presented in Lemma 2 in our paper, we can show that, P sup ∆∈S 2K |∆ (H n (W * ) -H(W * ))∆| ≥ 2t ≤ 2 exp -cn min t κ U , t 2 κ 2 U + 9d M , where d M = K(P + L + 1). Llet t = 0.5κ U . By Assumption 4(C1), we can obtain that, when n d M , P inf where τ is some positive constant to be selected according to Lemma 8. Since the difference between h n (•) and H n (•) is the indicator function, it holds that H n (•) ≥ h n (•). We will finish the proof of LRSC in two steps. Firstly, we show that, with high probability, h n (W * ) is positive definite on the restricted set S 2K . Secondly, we bound the difference between ∆ h n (W)∆ and ∆ h n (W * )∆, and hence show that h n (W) is locally positive definite around W * . This naturally leads to the LRSC of L n (W) around W * . From Lemma 8, we can select τ , such that h(W * ) ≥ κ 1 I. Following similar arguments as in the first part, we can show that for all ∆ ∈ S 2K , the following holds with probability at least 1 -2 exp {-0.25cn + 9d M }, ∆ h n (W * )∆ ≥ κ 1 , (29) where κ 1 = κ 1 -κ U > 0. In the meanwhile, for any W ∈ S K such that W -W * F ≤ R, where R can be specified later to satisfy some conditions, |h n (W) -h n (W * )| = 1 n n i=1 φ ( Z i , W )I A z i z i - 1 n n i=1 φ ( Z i , W * )I A z i z i ≤ 1 n n i=1 φ ( Z i , W ) -φ ( Z i , W * ) I A z i z i = 1 n n i=1 φ ( Z i , W ) Z i , W -W * I A z i z i , where W lies between W and W * , and φ (z) = e z (1 -e z )/(1 + e z ) 3 . Given the event A holds, choose R < τ , we can lower bound the term, | Z i , W | ≥ | Z i , W * | - sup ∆∈ S 2K , ∆ F≤R | Z i , ∆ | ≤ (τ -R) sup ∆∈S 2K | Z i , ∆ |. Notice that, for all z ∈ R, the third order derivative of the function φ(z) is upper bounded as |φ (z)| ≤ 1/|z|. This relationship helps us further bound the term, φ ( Z i , W ) Z i , W -W * ≤ | Z i , W -W * | | Z i , W | ≤ sup ∆∈ S 2K , ∆ F≤R | Z i , ∆ | (τ -R) sup ∆∈S 2K | Z i , ∆ | ≤ R sup ∆∈S 2K | Z i , ∆ | (τ -R) sup ∆∈S 2K | Z i , ∆ | = R τ -R . Hence, we can show that P sup ∆∈S 2K |∆ [h n (W) -h n (W * )]∆| ≥ t ≤ P R τ -R sup ∆∈S 2K 1 n n i=1 ∆ z i z i ∆ ≥ t . By setting t = α RSM R/(τ -R), where α RSM = 3κ U /2, we can use the equation ( 16) in Lemma 2 to obtain, as long as n d M , P sup ∆∈S 2K |∆ [h n (W) -h n (W * )]∆| ≥ α RSM R τ -R ≤ 2 exp {-c H n + 9d M } . By rearranging terms, this is equivalent to P          inf ∆∈S 2K ∆ h n (W)∆ ≤ sup ∆∈S 2K ∆ h n (W * )∆ - α RSM R τ -R denoted as the event B1          ≤ 2 exp {-c H n + 9d M } . If we define the event B 2 = {inf ∆∈S 2K ∆ h n (W * )∆ ≤ κ 1 } and denote its complementary event by B c 2 , and from (29), we know that P(B 2 ) ≤ 2 exp {-0.25cn + 9d M }. It can be seen that P inf ∆∈S 2K ∆ h n (W)∆ ≤ κ 1 - α RSM R τ -R ∩ B c 2 ≤ P(B 1 ∩ B c 2 ) ≤ P(B 1 ) Next, we conduct extra experiments for efficient number of kernels for a CP block design. This study uses 32 × 32 inputs with 16 channels, and we set stride to 1 and pooling sizes to (5, 5). We generate the orthonormal factor matrices {H (j) , 1 ≤ j ≤ 4}, where H (1) is of size R 8×R , H (2) is of size R 8×R , H (3) is of size R 16×R and H (4) is of size R K×R where R = 8 and K ∈ {8, 16, 24, 32}. If we denote the orthonormal column vectors of H (j) by h (j) r where 1 ≤ r ≤ R and 1 ≤ j ≤ 4, the stacked kernel tensor A can then be generated as A = R r=1 h (1) r • h (2) r • h (3) r • h (4) r . And it can be seen that A has a CP rank of R = 8. We split the stacked kernel tensor A along the kernel dimension to obtain {A k , 1 ≤ k ≤ K} and generate the corresponding fully-connected weight tensors {B k , 1 ≤ k ≤ K} with standard normal entries. The parameter tensor W is hence obtained, and we further normalize it to have unit Frobenius norm to ensure the comparability of estimation errors between different Ks. The block structure in Figure 7 (1) is employed to train the network, and it is equivalent to the bottleneck structure with a CP decomposition on A; see Kossaifi et al. (2020b) . We can see that as K increases, there is more redundancy in the network parameters. Fifty training sets are generated for each training size, and we stop training when the target function drops by less than 10 -5 . From Figure 7 (2), the estimation error increases as K is larger, and the difference is more pronounced when training size is small.

A.5.2 ABLATION STUDIES ON MORE BLOCK DESIGNS

In this section, we present the results from the studies of K/R ratios on two popular networks structures, namely, ResNeXt (Xie et al., 2017) and Shufflenet v2 (Ma et al., 2018) . To start with, we will show the bottleneck block structure and its respective K/R ratio in ResNeXt and Shufflenet. In Figure 8 , we represent the bottleneck block structure of ResNeXt consisting of 3 layers: a 1 × 1 convolution layer, a 3 × 3 group convolution layer followed by another 1 × 1 convolution layer. During group convolution phase, there are g parallel paths, each has a bottleneck size of r, which 



Figure 1: (1) 3-layer CNN for a 3D input X with one kernel tensor A, average pooling, and fullyconnected weights B. (2) U (j) ij s act as positioning factors. The white spaces indicate zero entries.

) with A stack being constrained to have the form of (8). Denote the estimators by { B TU k , A TU k } 1≤k≤K . The trained weights then have a form of W TU = K k=1 B TU k ⊗ A TU k , and the model complexity is d TU M = N +1 j=1 R j + N i=1 l i R i +R N +1 P . CP decomposition is more popular in compressing CNNs; see Kossaifi et al. (2020b); Lebedev et al. (2015); Astrid & Lee (2017). Let { B CP k , A CP k } 1≤k≤K be the least-square estimators at (6) with A stack having a CP decomposition with a rank of R. Then, the trained weights are W CP = K k=1 B CP k ⊗ A CP k , and the model complexity is d CP M

Figure 2: We use thickness of each layer to indicate its relative number of channels. The last layer indicates the output of the block. The dashed layer represents the "bottleneck" of the block.

Figure 3: (1)-(2) are the experiment results for Theorem 1 with independent and dependent inputs, respectively. (3) is the experiment results for Tucker CNN in Theorem 2.

Figure 4: (1) 41-layer Residual Network for CIFAR-10&SVNH. (2) 30-layer 3D Residual Network for UCF101. Here, t = K/R represents the expansion ratio, and takes values of 1,4,8,16.

Figure 6: Splitting matrix T (∆) based on its singular value decomposition.

H n (W * )∆ ≤ κ ≤ 2 exp {-0.25cn + 9d M } , where κ = κ 0 -κ U > 0. 2. LRSC of L n (W) around W * . Define the event A = {| W * , Z i | > τ sup ∆∈S 2K | ∆, z i |} and construct the functions h n (W) = 1 n n i=1 φ ( Z i , W )I A z i z i and, h(W) = Eh n (W),

Figure 7: Additional experiments for efficient number of kernels.

Figure 8: Equivalent ResNext building blocks. Both (1) and (2) represent a block of ResNeXt with cardinality g, with R = r • g. Here, the expansion ratio t = K/R, and takes values of 1,2 and 4.

l1×

l1×

R1×

•×R N and matrices

m1×m2×

••×m N , and then use average pooling with pooling sizes (q 1

••× lN , and the fully-connected weight tensor B is of size R p1× p2×

••× pN . We similarly define matrices U

Different settings for verifying Theorem 1 (left), and Theorem 2 (right).

Test accuracy(%) on CIFAR-10, SVHN and UCF101. For UCF101, we only count the number of parameters and FLOPs for the added blocks.

annex

We have that the predicted outputwhere u ij = vec( U (1) (i j , :, :)) ∈ R lj pj and U (j) DF = (U (j) × 3 U (j)) (1) , for 1 ≤ i j ≤ p j and 1 ≤ j ≤ 3.

A.3 THEORETICAL RESULTS AND TECHNICAL PROOFS

Given a test sample (X, y), where y = X, W X + ξ, and (X, ξ) satisfies Assumptions 1 and 2. The mean-square training and test error are defined ask be the true weights, and) 2 be the empirical norm with respect to W. The model complexity of CNN at equation (1) or ( 2) is d M = K(P + L + 1). Theorem 1 (Complete Version). Suppose that Assumptions 1-3 hold and n (κIt can be seen that the estimation error, 

Suppose that the kernel tensor A has a CP decomposition form of

, where hfor all 1 ≤ k ≤ K, and henceBy letting B r = hfor all 1 ≤ r ≤ R, we can reparameterize the model into +1) ,Proof. For each K, we first define a map T K :which has the rank of at most K.For any ∆ ∈ S 2K , the rank of matrix T K (∆) is at most 2K. As shown in Figure 6 , we can split the singular value decomposition (SVD) of T K (∆) into two parts, and it can be verified thatThus, we accomplish the proof of (a).Denote by S matrix ⊂ R P ×L the set of matrices with unit Frobenius norm and rank at most K. Note that T (S K ) ⊂ S matrix , while the ε-covering number for S matrix issee Candes & Plan (2011) . This accomplishes the proof of (b).We can now use the techniques in Theorem 3 to show the following corollaries for multiclass classification problem.Denote d MC M = K(M P + L + 1). Corollary 3 (Multiclass Classification: CNN). Under similar assumptions as in Theorem 3, suppose that n d MC M . Then, for some δ > 0,, where δ = O p (1), c and c γ are some positive constants.) for CP decomposition. Suppose that Assumptions in Theorem 3 hold and n c N d multi M . Then, for some δ > 0,, c and c γ are some positive constants, and c N is defined as in Theorem 2.Proof of Theorem 3. Denote the setswhere W * is the underlying true parameter and W, W * ∈ S K , and define the first-order Taylor errorSuppose W is the minimizer for the loss function, i.e.,Denote ∆ = vec( W -W * ). We then haveThen, for some W between W and W * ,From Lemma 4 and Lemma 5, when n d M , we obtain that, for some δ > 0,So, if we choose R to be sufficiently small, such that α RSM R/(τ -R) ≤ κ 1 /2, it holds that,This, together with H n (•) ≥ h n (•), leads us to conclude that, when n d M , there exists some R > 0, such that for anyholds with probabilityWe accomplished our proof of the LRSC of L n (W) around W * .Lemma 6. For two sub-gaussian random variables, X and Y , whenProof. Firstly, we observe thatIt then holds that E exp{λ φ (η i ) -y i } = E{E exp{λ φ (η i ) -y i }|z i } ≤ exp{0.125λ 2 }, and this implies that φ (η i ) -y i ψ2 ≤ 0.25.Lemma 8. Under Assumption 4, there exists a universal constant τ > 0 such that h(W * ) ≥ κ 1 I, where κ 1 is a positive constant.Proof. We first show that for any p 0 ∈ (0, 1), there exists a constant τ such thatWe would separately show that) ≥ p 0 + 1 2 and, P( supfor some positive constants c 1 and c 2 . Using the relationshipwhere τ = c 1 /c 2 .Since x i is a gaussian vector with mean zero and covariance Σ, z i = U G x i is a zero-mean gaussian vector with covariance given by U G ΣU G , and W * , Z i = vec(W * ), z i also follows a normal distribution with mean zero and variance (also its sub-Gaussian norm) upper bounded bywe can take c 1 to be sufficiently small such thatwhere x is a gaussian variable with variance upper bounded by 1.Then, we can also observe that, for any fixed ∆ ∈ S 2K , ∆, z i is a gaussian variable with zero mean and variance upper bounded by κ U . We can use the concentration inequality for gaussian random variable to establish thatfor all t ∈ R. We can further use the union bound to show thatWe can choose c 2 large enough such thatThe probability at (30) is hence shown. Now, we will take a look at the matrix h(W * ) and show that it is positive definite. Same as in Lemma 5, we denote the eventwhere (i) follows from Assumption 4(C1), And since ∆, z i is a gaussian variable with mean zero and variance bounded by κ U , its fourth moment is bounded by 3κ U . Also, by φ (z) ∈ (0, 0.25) for all z ∈ R, (ii) can be shown.Here, we can take p 0 to be small enough such that 0.25with κ 1 = 0.5κ 0 . We hence accomplished the proof of the lemma.A.5 ADDITIONAL EXPERIMENTS

A.5.1 MORE EXPERIMENTS FOR THEORETICAL RESULTS

We first provide some additional implementation details for Section 4.1 in the paper. We consider two types of inputs for verification of the sample complexity in Theorem 1. For the time-dependent inputs, we generate a sequence of vectors {x i } with size R d from a stationary VAR(1) process, where d = 245 or 192. The VAR(1) process has a transition matrix with spectral norm less than is also the rank of the output channel in each individual path; see also Figure 1 in Xie et al. (2017) .Since the output from each path is aggregated via summation, the rank of the output channel of the entire bottleneck block is equal to R = r • g. Subsequently, we can define the expansion ratio of this block to be t = K/R.In Figure 9 , we represent the bottleneck block structure of Shufflenet. It consists of 3 layers: a 1 × 1 convolution layer, a 3 × 3 depthwise convolution layer followed by another 1 × 1 convolution layer. So, the only difference between this block structure and that in ResNet (He et al., 2016) is that the full 3 × 3 convolution is replaced by the depthwise separable convolution. It follows that the expansion ratio is t = K/R.We conducted the experiments on the SVNH dataset and take t = 1, 2, 4, since t = 2 or t = 4 in particular, is commonly used in practice. We followed the design of ResNeXt-50 (see 2018)), only be removing the three convolution block in "conv5" to avoid overfitting. And we set R = 24, 58, 116, 232 as the bottleneck rank for "conv1"-"conv4", respectively. And K = t • R is taken according to different values of t. The implementations are similar to before. Following the practice in He et al. (2016) , we adopt batch normalization(BN) (Ioffe & Szegedy, 2015) right after each convolution and before the ReLU activation. For ResNeXt, the cardinality of the group convolution is set to be 32, and we initialize the weights as in He et al. (2015) . We use stochastic gradient descent with weight decay 10 -4 , momentum 0.9 and mini-batch size 128. The learning rate starts from 0.1 and is divided by 10 for every 100 epochs. We stop training after 350 epochs, since the training accuracy hardly changes.The training accuracy is approximately 96%. We set seeds 1-5 and report the worst case scenario of the test accuracy in Table 3 .We can see that, when t = 1, the test accuracy is comparable to when t = 2 or 4, and the number of parameters are reduced by a lot. In fact, the ratio of number of parameters at a single bottleneck block for t = 1, 2 and 4 is roughly 1 : 2 : 4, for both networks. This implies that the number of parameters will increase dramatically when we have a deeper CNN, for instance, ResNeXt-101. Hence, it is always recommended to keep t = 1 to achieve more parameter efficiency without sacrificing the test accuracy.

