SEKRON: A DECOMPOSITION METHOD SUPPORTING MANY FACTORIZATION STRUCTURES Anonymous

Abstract

While convolutional neural networks (CNNs) have become the de facto standard for most image processing and computer vision applications, their deployment on edge devices remains challenging. Tensor decomposition methods provide a means of compressing CNNs to meet the wide range of device constraints by imposing certain factorization structures on their convolution tensors. However, being limited to the small set of factorization structures presented by state-of-the-art decomposition approaches can lead to sub-optimal performance. We propose SeKron, a novel tensor decomposition method that offers a wide variety of factorization structures, using sequences of Kronecker products. The flexibility of SeKron leads to many compression rates and also allows it to cover commonly used factorizations such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic (CP) and Tucker. Crucially, we derive an efficient convolution projection algorithm shared by all SeKron structures, leading to seamless compression of CNN models. We validate our approach for model compression on both high-level and low-level computer vision tasks and find that it outperforms state-of-the-art decomposition methods.

1. INTRODUCTION

Deep learning models have introduced new state-of-the-art solutions to both high-level computer vision problems (He et al. 2016; Ren et al. 2015) , and low-level image processing tasks (Wang et al. 2018b; Schuler et al. 2015; Kokkinos & Lefkimmiatis 2018) through convolutional neural networks (CNNs). Such models are obtained at the expense of millions of training parameters that come along deep CNNs making them computationally intensive. As a result, many of these models are of limited use as they are challenging to deploy on resource-constrained edge devices. Compared with neural networks for high-level computer vision tasks (e.g., ResNet-50 (He et al. 2016 )), models for low-level imaging problems such as single image super-resolution have much a higher computational complexity due to the larger feature map sizes. Moreover, they are typically infeasible to run on cloud computing servers. Thus, their deployment on edge devices is even more critical. In recent years an increasing trend has begun in reducing the size of state-of-the-art CNN backbones through efficient architecture designs such as Xception (Chollet 2017), MobileNet (Howard et al. 2019 ), ShuffleNet (Zhang et al. 2018c) , and EfficientNet (Tan & Le 2019), to name a few. On the other hand, there have been studies demonstrating significant redundancy in the parameters of large CNN models, implying that in theory the number of model parameters can be reduced while maintaining performance (Denil et al. 2013) . These studies provide the basis for the development of rely on finding low-rank approximations of tensors under some imposed factorization structure as illustrated in Figure 1a . In practice, some structures are more suitable than others when decomposing tensors. Choosing from a limited set of factorization structures can lead to sub-optimal compressions as well as lengthy runtimes depending on the hardware. This limitation can be alleviated by reshaping tensors prior to their compression to improve performance as shown in (Garipov et al. 2016 ). However, this approach requires time-consuming development of customized convolution algorithms. (a) G A (1) f A (1) c A (3) h A (4) w Tucker A (1) f A (1) c A (3) h A (4) w CP A (1) f A (2) c A (3) h A (4) w TR A (1) f A (2) c A (3) h A (4) w TT A (1) A (S) A (3) A (4) We propose SeKron, a novel tensor decomposition method offering a wide range of factorization structures that share the same efficient convolution algorithm. Our method is inspired by approaches based on the Kronecker Product Decomposition (Thakker et al. 2019; Hameed et al. 2022) . Unlike other decomposition methods, Kronecker Product Decomposition generalizes the product of smaller factors from vectors and matrices to a range of tensor shapes, thereby exploiting local redundancy between arbitrary slices of multi-dimensional weight tensors. SeKron represents tensors using sequences of Kronecker products to compress convolution tensors in CNNs. Using sequences of Kronecker products leads to a wide range of factorization structures including commonly used ones such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic (CP) and Tucker. A (5) • • • c1 f1 h1 w1 c2 f2 Sequences of Kronecker products also have the potential to exploit local redundancies using far fewer parameters as illustrated in the example in Figure 1b . By performing the convolution operation using each of the Kronecker factors independently, the number of parameters, computational intensity, and runtime are reduced, simultaneously. Leveraging the flexibility SeKron, we find efficient factorization structures that outperform existing decomposition methods on various image classification and low-level image processing super-resolution tasks. In summary, our contributions are: • Introducing SeKron, a novel tensor decomposition method based on sequences of Kronecker products that allows for a wide range of factorization structures. • Providing a solution to the problem of finding the summation of sequences of Kronecker products between factor tensors that well approximates the original tensor. • Deriving a single convolution algorithm shared by all factorization structures achievable by SeKron, utilized as compressed convolutional layers in CNNs. • Improving the state-of-the-art of low-rank model compression on image classification (highlevel vision) benchmarks such as ImageNet and CIFAR-10, as well as super-resolution (low-level vision) benchmarks such as Set4, Set14, B100 and Urban100. 



many model compression techniques such as pruning (He et al. 2020), quantization (Hubara et al. 2017), knowledge distillation (Hinton et al. 2015), and tensor decomposition (Phan et al. 2020). Tensor decomposition methods such as Tucker (Kim et al. 2016), Canonical Polyadic (CP) (Lebedev et al. 2015), Tensor-Train (TT) (Novikov et al. 2015) and Tensor-Ring (TR) (Wang et al. 2018a)

Figure 1: (a): Tensor network diagrams of various decomposition methods for a 4D convolution tensor W ∈ IR F ×C×K h ×Kw . Unlike all other decomposition methods where f, c, h, w index over fixed dimensions (i.e., dimensions of W ), SeKron is flexible in its factor dimensions, with f k , c k , h k , w k , ∀k ∈ {1, ..., S} indexing over variable dimension choices, as well as its sequence length S. Thus, it allows for a wide range of factorization structures to be achieved. (b): Example of a 16 × 16 tensor W that can be more efficiently represented using a sequence of four Kronecker factors (requiring 16 parameters) in contrast to using a sequence length of two (requiring 32 parameters).

Sparsification. Different components of DNNs, such as weights(Han et al. 2015b;a), convolutional filters(He et al. 2018; Luo et al. 2017) and feature maps(He et al. 2017; Zhuang et al. 2018) can be sparse. The sparsity can be enforced using sparsity-aware regularization(Liu et al. 2015; Zhou  et al. 2016)  or pruning techniques(Luo et al. 2017; Han et al. 2015b). Many pruning methods (Luo

