SEKRON: A DECOMPOSITION METHOD SUPPORTING MANY FACTORIZATION STRUCTURES Anonymous

Abstract

While convolutional neural networks (CNNs) have become the de facto standard for most image processing and computer vision applications, their deployment on edge devices remains challenging. Tensor decomposition methods provide a means of compressing CNNs to meet the wide range of device constraints by imposing certain factorization structures on their convolution tensors. However, being limited to the small set of factorization structures presented by state-of-the-art decomposition approaches can lead to sub-optimal performance. We propose SeKron, a novel tensor decomposition method that offers a wide variety of factorization structures, using sequences of Kronecker products. The flexibility of SeKron leads to many compression rates and also allows it to cover commonly used factorizations such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic (CP) and Tucker. Crucially, we derive an efficient convolution projection algorithm shared by all SeKron structures, leading to seamless compression of CNN models. We validate our approach for model compression on both high-level and low-level computer vision tasks and find that it outperforms state-of-the-art decomposition methods.

1. INTRODUCTION

Deep learning models have introduced new state-of-the-art solutions to both high-level computer vision problems (He et al. 2016; Ren et al. 2015) , and low-level image processing tasks (Wang et al. 2018b; Schuler et al. 2015; Kokkinos & Lefkimmiatis 2018) through convolutional neural networks (CNNs). Such models are obtained at the expense of millions of training parameters that come along deep CNNs making them computationally intensive. As a result, many of these models are of limited use as they are challenging to deploy on resource-constrained edge devices. Compared with neural networks for high-level computer vision tasks (e.g., ResNet-50 (He et al. 2016 )), models for low-level imaging problems such as single image super-resolution have much a higher computational complexity due to the larger feature map sizes. Moreover, they are typically infeasible to run on cloud computing servers. Thus, their deployment on edge devices is even more critical. In recent years an increasing trend has begun in reducing the size of state-of-the-art CNN backbones through efficient architecture designs such as Xception (Chollet 2017) , MobileNet (Howard et al. 2019) , ShuffleNet (Zhang et al. 2018c) , and EfficientNet (Tan & Le 2019) , to name a few. On the other hand, there have been studies demonstrating significant redundancy in the parameters of large CNN models, implying that in theory the number of model parameters can be reduced while maintaining performance (Denil et al. 2013) . These studies provide the basis for the development of many model compression techniques such as pruning (He et al. 2020) , quantization (Hubara et al. 2017) , knowledge distillation (Hinton et al. 2015) , and tensor decomposition (Phan et al. 2020) . Tensor decomposition methods such as Tucker (Kim et al. 2016) , Canonical Polyadic (CP) (Lebedev et al. 2015) , Tensor-Train (TT) (Novikov et al. 2015) and Tensor-Ring (TR) (Wang et al. 2018a ) rely on finding low-rank approximations of tensors under some imposed factorization structure as illustrated in Figure 1a . In practice, some structures are more suitable than others when decomposing tensors. Choosing from a limited set of factorization structures can lead to sub-optimal compressions as well as lengthy runtimes depending on the hardware. This limitation can be alleviated by reshaping tensors prior to their compression to improve performance as shown in (Garipov et al. 2016 ). However, this approach requires time-consuming development of customized convolution algorithms. We propose SeKron, a novel tensor decomposition method offering a wide range of factorization structures that share the same efficient convolution algorithm. Our method is inspired by approaches based on the Kronecker Product Decomposition (Thakker et al. 2019; Hameed et al. 2022) . Unlike other decomposition methods, Kronecker Product Decomposition generalizes the product of smaller factors from vectors and matrices to a range of tensor shapes, thereby exploiting local redundancy between arbitrary slices of multi-dimensional weight tensors. SeKron represents tensors using sequences of Kronecker products to compress convolution tensors in CNNs. Using sequences of Kronecker products leads to a wide range of factorization structures including commonly used ones such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic (CP) and Tucker. Sequences of Kronecker products also have the potential to exploit local redundancies using far fewer parameters as illustrated in the example in Figure 1b . By performing the convolution operation using each of the Kronecker factors independently, the number of parameters, computational intensity, and runtime are reduced, simultaneously. Leveraging the flexibility SeKron, we find efficient factorization structures that outperform existing decomposition methods on various image classification and low-level image processing super-resolution tasks. In summary, our contributions are: • Introducing SeKron, a novel tensor decomposition method based on sequences of Kronecker products that allows for a wide range of factorization structures. • Providing a solution to the problem of finding the summation of sequences of Kronecker products between factor tensors that well approximates the original tensor. • Deriving a single convolution algorithm shared by all factorization structures achievable by SeKron, utilized as compressed convolutional layers in CNNs. • Improving the state-of-the-art of low-rank model compression on image classification (highlevel vision) benchmarks such as ImageNet and CIFAR-10, as well as super-resolution (low-level vision) benchmarks such as Set4, Set14, B100 and Urban100.

2. RELATED WORK ON DNN MODEL COMPRESSION

Sparsification. Different components of DNNs, such as weights (Han et al. 2015b; a) , convolutional filters (He et al. 2018; Luo et al. 2017 ) and feature maps (He et al. 2017; Zhuang et al. 2018 ) can be sparse. The sparsity can be enforced using sparsity-aware regularization (Liu et al. 2015; Zhou et al. 2016) or pruning techniques (Luo et al. 2017; Han et al. 2015b ). Many pruning methods (Luo et al. 2017; Zhang et al. 2018b ) aim for a high compression ratio and accuracy regardless of the structure of the sparsity. Thus, they often suffer from imbalanced workload caused by irregular memory access. Hence, several works aim at zeroing out structured groups of DNN components through more hardware friendly approaches (Wen et al. 2016) . Quantization. The computation and memory complexity of DNNs can be reduced by quantizing model parameters into lower bit-widths; wherein the majority of research works use fixed-bit quantization. For instance, the methods proposed in (Gysel et al. 2018; Louizos et al. 2018 ) use fixed 4 or 8-bit quantization. Model parameters have been quantized even further into ternary (Li et al. 2016; Zhu et al. 2016 ) and binary (Courbariaux et al. 2015; Rastegari et al. 2016; Courbariaux et al. 2016) , representations. These methods often achieve low performance even with unquantized activations (Li et al. 2016) . Mixed-precision approaches, however, achieve more competitive performance as shown in (Uhlich et al. 2019) where the bit-width for each layer is determined in an adaptive manner. Also, choosing a uniform (Jacob et al. 2018) or nonuniform (Han et al. 2015a; Tang et al. 2017; Zhang et al. 2018a ) quantization interval has important effects on the compression rate and the acceleration. Tensor Decomposition. Tensor decomposition approaches are based on factorizing weight tensors into smaller tensors to reduce model sizes (Yin et al. 2021) . Singular value decomposition (SVD) applied on matrices as a 2-dimensional instance of tensor decomposition is used as one of the pioneering approaches to perform model compression (Jaderberg et al. 2014) . Other classical highdimensional tensor decomposition methods, such as Tucker (Tucker 1963) and CP decomposition (Harshman et al. 1970) , are also adopted to perform model compression. However, using these methods often leads to significant accuracy drops (Kim et al. 2015; Lebedev et al. 2015; Phan et al. 2020) . The idea of reshaping weights of fully-connected layers into high-dimensional tensors and representing them in TT format (Oseledets 2011) was extended to CNNs in (Garipov et al. 2016) . For multidimensional tensors, TR decomposition (Wang et al. 2018a ) has become a more popular option than TT (Wang et al. 2017) . Subsequent filter basis decomposition works polished these approaches using a shared filter basis. They have been proposed for low-level computer vision tasks such as single image super-resolution in (Li et al. 2019 ). Kronecker factorization is another approach to replace the weight tensors within fully-connected and convolution layers (Zhou et al. 2015) . The rank-1 Kronecker product representation limitation of this approach is alleviated in (Hameed et al. 2022 ). The compression rate in (Hameed et al. 2022 ) is determined by both the rank and factor dimensions. For a fixed rank, the maximum compression is achieved by selecting dimensions for each factor that are closest to the square root of the original tensors' dimensions. This leads to representations with more parameters than those achieved using sequences of Kronecker products as shown in Fig. 1b . There has been extensive research on tensor decomposition through characterizing global correlation of tensors (Zheng et al. 2021) , extending CP to non-Gaussian data (Hong et al. 2020) , employing augmented decomposition loss functions (Afshar et al. 2021) , etc. for different applications. Our main focus in this paper is on the ones used for NN compression. Other Methods NNs can also be compressed using Knowledge Distillation (KD) where a large pretrained network known as teacher is used to train a smaller student network (Mirzadeh et al. 2020; Heo et al. 2019) . Sharing weights in a more structured manner can be another model compression approach as FSNet (Yang et al. 2020 ) which shares filter weights across spatial locations or ShaResNet (Boulch 2018) which reuses convolutional mappings within the same scale level. Designing lightweight CNNs (Sandler et al. 2018; Iandola et al. 2016; Chollet 2017; Howard et al. 2019; Zhang et al. 2018c; Tan & Le 2019 ) is another direction orthogonal to the aforementioned approaches.

3. METHOD

In this section, we introduce SeKron and how it can be used to compress tensors in deep learning models. We start by providing background on the Kronecker Product Decomposition in Section 3.1. Then, we introduce our decomposition method in 3.2. In Section 3.3, we provide an algorithm for computing the convolution operation using each of the factors directly (avoiding reconstruction) at runtime. Finally, we discuss the computational complexity of the proposed method in Section 3.4.

3.1. PRELIMINARIES

Convolutional layers prevalent in CNNs transform an input tensor X ∈ IR C×K h ×Kw using a weight tensor W ∈ IR F ×C×K h ×Kw via a multi-linear map given by Y f,x,y = K h i=1 Kw j=1 C c=1 W f,c,i,j X c,i+x,j+y , where C and F denote the number of input channels and output channels, respectively, and K h × K w denotes the spatial size of the weight (filter). Tensor decomposition seeks an approximation to replace W , typically through finding lower-rank tensors using SVD. One such approximation comes from the fact that any tensor W ∈ IR w1×•••×w N can be written as a sum of Kronecker products (i.e., (Hameed et al. 2022 ). Thus, a lower-rank approximation can be obatined by solving W = R r=1 A r ⊗ B r , where = A r ∈ IR a1×•••×a N , B r ∈ IR b1×•••×b N and a j b j = w j for j ∈ {1, • • • , N }). min {Ar},{Br} W - R r=1 A r ⊗ B r 2 F , for R sums of Kronecker products ( R ≤ R) using the SVD of a particular reshaping (unfolding) of W , where || • || F denotes the Frobenius norm.

3.2. SEKRON TENSOR DECOMPOSITION

The Kronecker decomposition in equation 2 can be extended to finding an approximating sequence of Kronecker factors A (k) ∈ IR R1×•••×R k ×a (k) 1 ×•••×a (k) N as follows: min {A (k) } S k=1 W - R1 r1=1 A (1) r1 ⊗ R2 r2=1 A (2) r1r2 ⊗ • • • ⊗ R S-1 r S-1 =1 A (S-1) r1•••r S-1 ⊗ A (S) r1•••r S-1 2 F . (3) Although this is a non-convex objective, a quasi-optimal solution based on recursive application of SVD is given in Theorem 1. Note that alternative expansion directions to 3 are viable (See Appendix B). Theorem 1 (Tensor Decomposition using a Sequence of Kronecker Products). Any tensor W ∈ IR w1×•••×w N can be represented by a sequence of Kronecker products between S ∈ IN factors: W = R1 r1=1 A (1) r1 ⊗ R2 r2=1 A (2) r1r2 ⊗ • • • ⊗ R S-1 r S-1 =1 A (S-1) r1•••r S-1 ⊗ A (S) r1•••r S-1 , where R i ∈ IN and A (k) ∈ IR R1×•••×R k ×a (k) 1 ×•••×a (k) N . Proof. See Appendix C Our approach to solving equation 3 involves finding two approximating Kronecker factors that minimize the reconstruction error with respect to the original tensor, then recursively applying this procedure on the latter factor found. More precisely, we define intermediate tensors B (k) r1•••r k ≜ R k+1 r k+1 =1 A (k+1) r1•••r k+1 ⊗ R k+2 r k+2 =1 A (k+2) r1•••r k+2 ⊗ • • • ⊗ R S-1 r S-1 =1 A (S-1) r1•••r S-1 ⊗ A (S) r1•••r S-1 , allowing us to re-write the reconstruction error in equation 3, for the k th iteration, as min {A (k) r 1 •••r k , B (k) r 1 •••r k } rj =1,•••Rj , j=1,...,k W (k) r1•••r k-1 - R k r k =1 A (k) r1•••r k ⊗ B (k) r1•••r k 2 F . ( ) Algorithm 1: SeKron Tensor Decomposition Input: Input tensor W ∈ IR w1×•••×w N Kronecker factor shapes {d (i) } S i=1 Output: Kronecker factors {A} S i=1 for i ← 1, 2, . . . , S -1 do d (a) ← d (i) d (b) ← S k=i+1 d (k) W ← UNFOLD(W , shape = d (b) ) // IR B×L× N k=1 d (b) k U, s, V ← BATCHSVD(W) // U ∈ IR B×L×R where R = min(L, N k=1 d (b) k ) A (i) ← STACK((RESHAPE(U b,:,r , shape = d (a) ) | b = 1, 2, . . . B, r = 1, 2, . . . R)) B (i) ← STACK((RESHAPE(s k V ⊤ b,:,r , shape = d (b) ) | b = 1, 2, . . . B, r = 1, 2, . . . R)) W ← B (i) end A (S) ← B S-1 return {A} S i=1 In the first iteration, the tensor being decomposed is the original tensor (i.e., W (1) ← W ). Whereas in subsequent iterations, intermediate tensors are decomposed. At each iteration, we can convert the problem in equation 6 to the low-rank matrix approximation problem min {a (k) r 1 •••r k , b (k) r 1 •••r k } rj =1,•••Rj , j=1,...,k W (k) r1•••r k-1 - R k r k =1 a (k) r1•••r k b (k)⊤ r1•••r k 2 F , through reshaping, such that the overall sum of squares is preserved between equation 6 and equation 7. The problem in equation 7 can be readily solved, as it has a well known solution using SVD. The reshaping operations that facilitate this transformation are W (k) r1•••r k-1 = MAT(UNFOLD(W (k) r1•••r k-1 , d B (k) r 1 •••r k )), a (k) r1•••r k = UNFOLD(A (k) r1•••r k , d I A (k) r 1 •••r k ), b (k) r1•••r k = VEC(B (k) r1•••r k ), where UNFOLD reshapes tensor W (k) r1•••r k-1 by extracting multidimensional patches of shape d B (k) r 1 ,...r k from tensor W (k) r1•••r k-1 in any order, then stacking them along a new first dimension. Vector d B denotes a vector describing the shape of a tensor B, VEC : IR d1×•••×d N → IR d1•••d N flattens a tensor, MAT : IR d1×d2×•••×d N → IR d1×d2•••d N matricizes a tensor and I A denotes an identity tensor with the same number of modes as A and each dimension set to one. Once each B (k) r1•••r k is obtained by solving equation 7 (and using the inverse of the VEC operation in equation 9), we proceed recursively by setting W (k+1) r1•••r k ← B (k) r1•••r k and solving the k + 1 th iteration of equation equation 7. In other words, at the k th iteration, we find Kronecker factors A (k) and B (k) , where the latter is used in the following iteration. Except in the final iteration (i.e., k = S -1), where the intermediate tensor B (k) is the solution to the last Kronecker factor A (S) . (See Algorithm 1) By virtue of the connectivity between all of the Kronecker factors as illustrated in Figure 1a , SeKron can achieve many other commonly used structures, as stated in the following theorem: Theorem 2. The factorization structure imposed by CP, Tucker, TT and TR when decomposing a given tensor W ∈ IR w1×•••×w N can be achieved using SeKron. Proof. See Appendix C.

3.3. CONVOLUTION WITH SEKRON STRUCTURES

In this section, we provide an efficient algorithm for performing a convolution operation using a tensor represented by a sequence of Kronecker factors. Assuming W is approximated as a sequence Algorithm 2: Convolution operation using a sequence of Kronecker factors Input: {A (i) } S i=1 , A (i) ∈ IR ri×fi×ci×Khi×Kwi X ∈ IR N ×C×H×W Output: X ∈ IR N × S k=1 f k ×H×W for i ← S, S -1, . . . , 1 do if i == S then X ← CONV3D(UNSQUEEZE(X , 1), UNSQUEEZE(A (i) , 1) / * IR N × S k=i+1 f k ×rifi× i-1 k=1 c k ×H×W → IR N × S k=i f k ×ri-1× i-1 k=1 c k ×H×W * / X ← RESHAPE 1 (X ) else X ← CONV3D(X , A (i) , groups = ri) / * IR N × S k=i+1 f k ×rifi× i-1 k=1 c k ×H×W → IR N × S k=i f k ×ri× i-1 k=1 c k ×H×W * / X ← RESHAPE 2 (X ) end end return X of Kronecker products using SeKron, i.e., W ≈ W and W = R1 r1=1 A (1) r1 ⊗ R2 r2=1 A (2) r1r2 ⊗ • • • R S-1 r S-1 =1 A (S-1) r1•••r S-1 ⊗ A (S) r1•••r S-1 , the convolution operation in equation 1 can be re-written as Y f xy = K h ,Kw,C i,j,c=1 R1 r1=1 A (1) r1 ⊗ • • • ⊗ R S-1 r S-1 =1 A (S-1) r1•••r S-1 ⊗ A (S) r1•••r S-1 f cij X c,i+x,j+y . Due to the factorization structure of tensor W , the computation in equation 11 can be carried out without its explicit reconstruction. Instead, the projection can be performed using each of the Kronecker factors independently. This property is essential to performing efficient convolution operations using SeKron factorizations, and leads to a reduction in both memory and FLOPs at runtime. In practice, this amounts to replacing one large convolution operation (i.e., one with a large convolution tensor) with a sequence of smaller grouped 3D convolutions, as summarized in Algorithm 2. The ability to avoid reconstruction at runtime when performing a convolution using any SeKron factorization is the result of the following Theorem: Theorem 3 (Linear Mappings with Sequences of Kronecker Products). Any linear mapping using a given tensor W can be written directly in terms of its Kronecker factors A (k) ∈ IR R1×•••R N ×a (k) 1 ×•••×a (k) N . That is: W i1•••i N X i1+z1,••• ,i N +z N = R1,...R k r1,...r N A (1) r1j (1) 1 •••j (1) N • • • A (S) r1•••r S-1 j (S) 1 •••j (S) N X f (j1)+z1,••• ,f (j N )+z N where j (k) n ∈ IN is a function of input indices (see Appendix A) and f (j n ) = S k=1 j (k) n S l=k+1 a (l) n Proof. See Appendix C. Using Theorem 3 we re-write the projection in equation 11 directly in terms of Kronecker factors Y f xy = i,j,c,r1 A (1) r1f1c1i1j1 r2 A (2) r1,r2,f2,c2,i2,j2 • • • r S-1 A (S-1) r1•••r S-1 f N -1 c N -1 i N -1 j N -1 A (S) r1•••r S-1 f N c N i N j N X f (c),f (i)+x,f (j)+y , where i = (i 1 , i 2 , . . . , i N ), j = (j 1 , j 2 , . . . , j N ), c = (c 1 , c 2 , . . . , c N ) denote vectors containing indices i k , j k , c k that enumerate over positions in tensors A (k) . Finally, exchanging the order of summation separates the convolution as follows: Y f xy = i1,j1,c1,r1 A (1) r1f1c1i1j1 • • • i N ,j N ,c N A (S) r1•••r S-1 f N c N i N j N X f (c),f (i)+x,f (j)+y . ( ) Overall, the projection in equation 13 can be carried out efficiently using a sequence of grouped 3D convolutions with intermediate reshaping operations as described in Algorithm 2. Refer to Appendix C for universal approximation properties of neural networks when using SeKron.

3.4. COMPUTATIONAL COMPLEXITY

In order to decompose a given tensor using our method, the sequence length and the Kronecker factor shapes must be specified. Different selections will lead to different FLOPs, parameters, and latency. Specifically, for the decomposition given by equation 10 for W ∈ IR f ×c×h×w using factors A (i) r1•••ri ∈ IR fi×ci×hi×wi , the compression ratio (CR) and FLOPs reduction ratio (FR) are given by CR = S i=1 f i c i h i w i S i=1 i k=1 R k f i c i h i w i , FR = S i=1 f i c i h i w i S i=1 S k=i F k i k=1 R k i k=1 c k h i w i . ( ) Applying SeKron to compress DNN models requires a selection strategy for sequence lengths and factor shapes for each layer in a network. We adopt a simple approach that involves selecting configurations that best match a desired CR while also having a lower latency than the original layer being compressed, as FR may not be a good indicator of runtime speedup in practice.

4. EXPERIMENTAL RESULTS

To demonstrate the effectiveness of SeKron for model compression, we evaluate different CNN models on both high-level and low-level computer vision tasks. For image classification tasks, we evaluate WideResNet16 (Zagoruyko & Komodakis 2016) and ResNet50 (He et al. 2016 ) models on CIFAR-10 (Krizhevsky 2009) and ImageNet (Krizhevsky et al. 2012) , respectively. For superresolution task, we evaluate EDSR-8-128 and SRResNet16 trained on DIV2k (Agustsson & Timofte 2017) . Lastly, we discuss the latency of our proposed decomposition method. In all experiments we compress convolution layers of pre-trained networks using various compression approaches and then re-train the resulting compressed models. We provide implementation details in Appendix D.

4.1. IMAGE CLASSIFICATION EXPERIMENTS

First, we evaluate SeKron by compressing WideResNet16-8 (Zagoruyko & Komodakis 2016) for image classification on CIFAR-10 and comparing against various approaches. Namely, PCA (Zhang et al. 2016 ) which imposes that filter responses lie approximately on a low-rank subspace; SVD-Energy (Alvarez & Salzmann 2017) which imposes a low-rank regularization into the training procedure; L-Rank (learned rank selection) (Idelbayev & Carreira-Perpinan 2020) which jointly optimizes over matrix elements and ranks; ALDS (Liebenwein et al. 2021 ) which provides a global compression framework that finds optimal layer-wise compressions leading to an overall desired global compression rate; TR (Wang et al. 2018a) ; TT (Novikov et al. 2015) as well as two pruning approaches FT (Li et al. 2017) and PFP (Liebenwein et al. 2020) . Figure 2 , shows the CIFAR-10 classification performance drop (i.e., ∆ Top-1) versus compression rates using different methods. SeKron outperforms all other decomposition and pruning methods, at a variety of compression rates. In Table 1 we highlight that at a compression rate of 4× SeKron outperforms all other methods with a small accuracy drop of -0.51, whereas the next best decomposition method (omitting rank selection approaches) suffers a -1.27 drop in accuracy. Next, we evaluate SeKron to compress ResNet50 for the image classification task on ImageNet. Table 2 compares our method to other compression approaches. Most notably, SeKron outperforms all decomposition methods, achieving 74.94% Top-1 accuracy which is ∼1.1% greater than the second highest accuracy achieved by using TT decomposition. At the same time, SeKron is 3× faster than TT on a single CPU. 

4.2. SUPER-RESOLUTION EXPERIMENTS

In this section we use SeKron to compress SRResNet (Ledig et al. 2017 ) and EDSR-8-128 (Li et al. 2019) . Both networks were trained on DIV2K (Agustsson & Timofte 2017) and valuated on Set5 (Bevilacqua et al. 2012 ), Set14 (Zeyde et al. 2012) , B100 (Martin et al. 2001 ) and Urban100 (Huang et al. 2015) . Table 3 presents the performances in terms of PSNR measured on the test images for the models once compressed using SeKron along with the original uncompressed models. Among model compression methods, Filter Basis Decomposition (FBD) (Li et al. 2019 ) has been previously shown to achieve state-of-the-art compression on super-resolution CNNs. Therefore, we compare our model compression results with those obtained using FBD as shown in Table 3 . We highlight that our approach outperforms FBD, on all test datasets when compressing SRResNet16 at similar compression rates. As this table suggests, when compression rate is increased, FBD results in much lower PSNRs for both EDSR-8-128 and SRResNet16 compared to our proposed SeKron.

4.3. CONFIGURING SEKRON CONSIDERING LATENCY AND COMPRESSION RATE

Using the configuration selection strategy proposed in 3.4, we find that a small sequence length (S) is limited to few achievable candidate configurations (and consequently compression rates) that do not sacrifice latency. This is illustrated in Figure 3 for S = 2 where targeting a CPU latency less than 5 ms and a compression ratio less than 10× leaves only 3 options for compression. In contrast, increasing the sequence length to S = 3 leads to a wider range of achievable compression rates (i.e., 129 configurations). Despite the flexibility they provide, large sequence lengths lead to an exponentially larger number of candidate configurations and time-consuming generation of all their runtimes. For this reason, unless otherwise stated, we opted to use S = 3 in all the above-mentioned experiments as it provided a suitable range of compression rates and a manageable search space. As an example, in Table 4 we compress EDSR-8-128 using a compression rate of CR = 2.5×, by selecting configurations for each layer that satisfy the desired CR while simultaneously resulting in a speedup. This led to an overall model speedup of 124ms (compressed) vs. 151ms (uncompressed).

5. CONCLUSIONS

We introduced SeKron, a tensor decomposition approach using sequences of Kronecker products. SeKron allows for a wide variety of factorization structures to be achieved, while, crucially, sharing the same compression and convolution algorithms. Moreover, SeKron has been shown to generalize popular decomposition methods such as TT, TR, CP and Tucker. Thus, it mitigates the need for time-consuming development of customized convolution algorithms. Unlike other decomposition methods, SeKron is not limited to a single factorization structure, which leads to improved compressions and reduced runtimes on different hardware. Leveraging SeKron's flexibility, we find efficient factorization structures that outperform previous decomposition methods on various image classification and super-resolution tasks. Figure 4 : Illustration of alternative expansion directions using sequences of Kronecker products. SeKron structures are those which are leftmost on each level of the tree. Each node is obtained through the decomposition of a single tensor present in its parent node. Proof. First, we define intermediate tensors B (k) r1•••r k ≜ R k+1 r k+1 A (k+1) r1•••r k+1 ⊗ R k+2 r k+2 A (k+2) r1•••r k+2 ⊗ • • • ⊗ R S-1 r S-1 A (S-1) r1•••r S-1 ⊗ A (S) r1•••r S-1 (5 revisited) Then the reconstruction error can be written as W (k) r1•••r k-1 - R k r k =1 A (k) r1•••r k ⊗ B (k) r1•••r k 2 F ( ) where W (1) is the initial tensor being decomposed. As described in Section 3.2, using reshaping operations W (k) r1•••r k-1 = MAT(UNFOLD(W (k) r1•••r k-1 , d B (k) r 1 •••r k )), ( (k) r1•••r k = VEC(UNFOLD(A (k) r1•••r k , d I A (k) r 1 •••r k )), b (k) r1•••r k = VEC(B (k) r1•••r k ), ( that preserve the sum of squares allows us to equivalently write the reconstruction error as W (k) r1•••r k-1 - R k r k =1 a (k) r1•••r k b (k)⊤ r1•••r k 2 F . Now consider the singular value decomposition of matrix W (k) r1•••r k-1 and let u (k) r1•••r k , v r1•••r k denote its left and right singular vectors, respectively (with the right singular vector scaled according to its corresponding singlar value). Set a (k) r1•••r k = u (k) r k and define and define error terms δ (k) r1•••r k = v (k) r1•••r k -b (k) r1•••r k , ϵ (k) r1•••r k = ∥δ (k) r1•••r k ∥. ( ) Expanding out equation 18 reveals its recursive form W (k) r1•••r k-1 - R k r k =1 a (k) r1•••r k b (k)⊤ r1•••r k 2 F = W (k) r1•••r k-1 - R k r k =1 a (k) r1•••r k (v (k) r k -δ (k) r1•••r k ) ⊤ 2 F (20) = W (k) r1•••r k-1 - R k r k =1 a (k) r1•••r k v (k)⊤ r1•••r k + R k r k =1 a (k) r1•••r k δ (k)⊤ r1•••r k 2 F (21) ≤ W (k) r1•••r k-1 - R k r k =1 a (k) r1•••r k v (k)⊤ r1•••r k 2 F + R k r k =1 a (k) r1•••r k δ (k)⊤ r1•••r k 2 F (22) ≤ W (k) r1•••r k-1 - R k r k =1 a (k) r1•••r k v (k)⊤ r1•••r k 2 F + R k r k =1 d (k) ϵ (k) r1•••r k (23) = R k r k = R k +1 σ 2 r k (W (k) r1•••r k-1 ) + R k r k =1 d (k) ϵ (k) r1•••r k (24) = R k r k = R k +1 σ 2 r k (W (k) r1•••r k-1 ) + R k r k =1 d (k) v (k) r1•••r k -b (k) r1•••r k 2 F ( ) where d (k) ∈ N is the number of dimensions of vector a  (k) r1•••r k and R k is the rank of matrix W (k) r1•••r k-1 , σ r k (W (k) r1•••r k-1 ) denotes the r th k singular value of tensor W (k) r1•••r k-1 . By reshaping vectors v (k) r1•••r k , b (k) r1•••r k to matrices according to V (k) r1•••r k = MAT UNFOLD VEC -1 v (k) r1•••r k , S s=k+1 d (s) , S s=k+2 d (s) , N ) describes the dimensions of the s th factor, we can re-write equation 25 as R k r k = R k +1 σ 2 r k (W (k) r1•••r k-1 ) + R k r k =1 d (k) v (k) r1•••r k -b (k) r1•••r k 2 F (28) = R k r k = R k +1 σ 2 r k (W (k) r1•••r k-1 ) + R k r k =1 d (k) V (k) r1•••r k -B (k) r1•••r k 2 F (29) = R k r k = R k +1 σ 2 r k (W (k) r1•••r k-1 ) + R k r k =1 d (k) V (k) r1•••r k - R k+1 r k+1 =1 a (k+1) r1•••r k+1 b (k+1)⊤ r1•••r k+1 2 F . The last line reveals the recursive nature of the formula (compare with equation 20). Unrolling the recursive formula for k = 1, . . . , S -1, by setting W 



Figure 1: (a): Tensor network diagrams of various decomposition methods for a 4D convolution tensor W ∈ IR F ×C×K h ×Kw . Unlike all other decomposition methods where f, c, h, w index over fixed dimensions (i.e., dimensions of W ), SeKron is flexible in its factor dimensions, with f k , c k , h k , w k , ∀k ∈ {1, ..., S} indexing over variable dimension choices, as well as its sequence length S. Thus, it allows for a wide range of factorization structures to be achieved. (b): Example of a 16 × 16 tensor W that can be more efficiently represented using a sequence of four Kronecker factors (requiring 16 parameters) in contrast to using a sequence length of two (requiring 32 parameters).

Figure 2: Performance drop of WideResNet16-8 at various compression rates achieved by different methods on CIFAR-10.

Figure 3: CPU latency for candidate configurations obtained using SeKron on a tensor W ∈ IR 512×512×3×3 with S = 2 (red) and S = 3 (blue), aiming for a speedup (e.g., < 5 ms) and a typical compression rate (e.g., < 10×).

•r k , leads to the following formula for the reconstruction error:ε SeKron (W, r, D) = R1 r1= R1+1 σ 2 r1 (W (1) ) + d (1) (W (2) r1 ) + • • • + d (1) d (2) • • • d (S-2) R1,••• , R S-2 r1,r2,...,r S-2 =1 R S-1 r S-1 = R S-1 +1σ 2 r S-1 (W (S-1)

Performance of compressed WideResNet16-8 using various methods on CIFAR-10

Performance of ResNet50 using various compression methods measured on ImageNet. † indicates models obtained from compressing baselines with different accuracies, for this reason we report accuracy drops of each model with respect to their own baselines as well. The baselines compared are FSNet(Yang et al. 2020), ThiNet(Luo et al. 2017) CP(He et al. 2017) MP(Liu et al. 2019) and Binary Kronecker (Hameed et al. 2022)

PSNR (dB) performance of compressed SRResNet16 and EDSR-8-128 (×4 scaling factor) models using FBD (with basis-64-16)(Li et al. 2019) and our SeKron

CPU latency (ms) for uncompressed (baseline) and compressed SR-ResNet16 and EDSR-8-128 models using SeKron

APPENDIX A SEQUENCE OF KRONECKER PRODUCTS

The Kronecker product between a sequence of factor tensors is given bywheren S l=t+1 a (l) n S l=k+1 a (l) n otherwise, (16) andN .

B ALTERNATIVE EXPANSION DIRECTIONS OF SEKRON

The proposed SeKron structure represents a given tensor W ∈ IR w1×•••×wn using a sequence of Kronecker products as follows:While this decomposition structure is obtained by recursively finding the Kronecker decomposition of the right-most tensor, many alternative sequential Kronecker structures can be obtained as illustrated in Figure 4 . However, such alternative structures do not fall within our SeKron framework as they cannot make use of our convolution algorithm (Algorithm 2)

C THEOREM PROOFS

Theorem 1 (Tensor Decomposition using a Sequence of Kronecker Products). Any tensor W ∈ IR wherewhere r = ( R 1 , . . . , R S-1 ) contains the rank values, D s = d (s) contains the Kronecker factor shapes and is referred to as the Dr-SeKron approximation error (note that the dependency of intermediate matricesTheorem 2. The factorization structure imposed by CP, Tucker, TT and TR when decomposing a given tensor W ∈ IR w1×•••×w N can be achieved using SeKron.Proof. The SeKron decomposition of tensor W is given bywhereThe CP decomposition of tensor W in scalar form iswhereThe Tucker decomposition of tensor W is given bywherewhich is equivalent to equation 35 in the special case where there are nullity constraints on some elements in the Kronecker factors, such that for k = 2, . . . , Nfor any choice of RThe Tensor Ring (TR) decomposition of W is given bywhere which is equivalent to equation 38 in the special case where some elements in the Kronecker factors are constrained, such that all elements in tensor A (1) are constrained to one andfor any choice of RTheorem 3 (Linear Mappings with Sequences of Kronecker Products). Any linear mapping using a given tensor W can be written directly in terms of its Kronecker factorsN . That is:Proof. First we bring out the summations in the SeKron representaion ofsuch thatThen, using the scalar form definition of sequences of kronecker products in equation 16allows us to re-write equation 42 in scalar form asAs the jn terms decompose i n into an integer weighted sum, we can recover i n usingwhere j n = (jn ). Thus, we can writeFinally, combining equations equation 43 and equation 45 leads to Proof. Let f denote a shallow neural network, and f ∈ C(X). Then,According to Hornik (1991), equation 47 is dense in C(X); therefore, it suffices to show that equation 48 is bounded as well.where ε denotes the Dr-SeKron approximation error as in equation 31, with matrix D and vector r describing the shapes of the Kronecker factors the ranks used in the SeKron decomposition of W, respectively.

D IMPLEMENTATION DETAILS

In all of our experiments we use 4 NVIDIA Tesla V100 SXM2 32 GB GPUs during training and evaluate run time on a single core of Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz.

D.1 IMAGENET EXPERIMENTS

We train all models using stochastic gradient descent for 100 epochs using a batch size of 256. The learning rate is initially set to 0.1 and reduced by a factor of 10× at epochs number 30, 60 and 90. We also use a 0.0001 weight decay.

D.2 CIFAR-10 EXPERIMENTS

We train all models using using stochastic gradient descent for 200 epochs using a batch size of 128. The learning rate is initially set to 0.1 and is reduced by a factor of 5× at epochs number 60, 120 and 160. We use nestrov momentum set to 0.9 and weight decay set to 0.0005.

D.3 DIV2K

We train all models using using the ADAM optimizer for 300 epochs using a batch size of 16. The optimizer's learning rate is set to 0.0001 and β 1 , β 2 are set to 0.9, 0.999 respectively.

