Sparse matrix products for neural network compression

Abstract

Over-parameterization of neural networks is a well known issue that comes along with their great performance. Among the many approaches proposed to tackle this problem, low-rank tensor decompositions are largely investigated to compress deep neural networks. Such techniques rely on a low-rank assumption of the layer weight tensors that does not always hold in practice. Following this observation, this paper studies sparsity inducing techniques to build new sparse matrix product layer for high-rate neural networks compression. Specifically, we explore recent advances in sparse optimization to replace each layer's weight matrix, either convolutional or fully connected, by a product of sparse matrices. Our experiments validate that our approach provides a better compression-accuracy trade-off than most popular low-rank-based compression techniques.

1. Introduction

The success of neural networks in the processing of structured data is in part due to their over-parametrization which plays a key role in their ability to learn rich features from the data (Neyshabur et al., 2018) . Unfortunately, this also makes most state-of-the-art models so huge that they are expensive to store and impossible to operate on devices with limited resources (memory, computing capacity) or that cannot integrate GPUs (Cheng et al., 2017) . This problem has led to a popular line of research for "neural networks compression", which aims at building models with few parameters while preserving their accuracy. 2016) generalize this idea to convolutional layers and then reduce the memory footprint of convolution kernels by using higher-order low-rank decompositions such as CP or Tucker decompositions.

State

Besides, the Tensor-Train (TT) decomposition has been explored to compress both dense and convolutional layers after a pre-training step (Novikov et al., 2015) . This approach may achieve extreme compression rates but it also have impractical downsides that we demonstrate now. In a TT format, all the elements of a M -order tensor are expressed by a product of M matrices whose dimensions are determined by the TT-ranks (R 0 , R 1 , . . . , R M ). For each of the M dimension of the initial tensor, the corresponding matrices can be stacked into an order 3 tensor called a "core" of the decomposition. Hence, the layer weight is decomposed as a set of M cores of small dimensions. Novikov et al. (2015) use this tensor representation to factorize fully connected layers. They first reshape the matrix of weights into an M -order tensor, then apply the TT decomposition. By choosing sufficiently small R m values, this technique allows to obtain a high compression ratio on extremely wide ad hoc neural architectures. Garipov et al. (2016) have adapted this idea to convolutional layers. However, the current formulation of such TT convolutional layer involves the multiplication of all input values by a matrix of dimension 1 × R 1 thus causing an inflation of R 1 times the size of the input in memory. This makes the available implementation (Garipov, 2020) unusable for recent wide convolutional networks at inference time. Other compression methods include unstructured pruning techniques that we review more in details in Section 2.3 and structured pruning techniques that reduce the inner hidden dimensions of the network by completely removing neurons (Anwar et al., 2017) . According to the recent paper of Liu et al. (2018) however, these techniques are more akin to Neural Architecture Search than actual network compression. Finally, quantization-based compression maps the columns of the weight matrices in the network to a subset of reference columns with lower memory footprint (Guo, 2018) . Sparse matrices product for full rank decompositions. We are specifically interested in high-rate compression of neural networks via the efficient factorization of the layer weight matrices. Most known approaches to layer decomposition usually makes low-rank assumption on the layer weight tensors which does not always hold in practice. As we will show in the experiments, this makes the Tucker and SVD based techniques unable to effectively reach high compression rates for standard architectures including both convolutional and fully connected layers, such as VGG19 or ResNet50, whose weight matrices usually exhibit full rank. In this paper, we propose instead to express the weight matrices of fully-connected or convolutional layers as a product of sparse factors which contains very little parameters but still can represent high-rank matrices. Moreover, products of matrices with a total sparsity budget are strictly more expressive than single matrices with that sparsity (Dao et al., 2019) , which motivates our interest in products of multiple matrices. Usually, a linear operator (a matrix) from R D to R D has a time and space complexities of O(D 2 ). But some well known operators like the Hadamard or the Fourier transforms can be expressed in the form of a product of log D sparse matrices, each having O(D) non-zero values (Dao et al., 2019; Magoarou & Gribonval, 2016) . These linear operators, called fast-operators, thus have a time and space complexities lowered to O(D log D). This interesting feature of fast-operators have inspired the design of new algorithms that learn sparse matrix product representations of existing fast-transforms (Dao et al., 2019) or even that computes sparse product approximations of any matrix in order to accelerate learning and inference (Magoarou & Gribonval, 2016; Giffon et al., 2019) . Even though these new methods were initially designed to recover the log D factors corresponding to a fasttransform, they are more general than that and can actually be used to find a factorization with Q < log D sparse matrices. Contributions. We introduce a general framework for neural network compression using the factorization of layers into sparse matrix products. We explore the use of the recently proposed palm4MSA algorithm (Magoarou & Gribonval, 2016) on every layer of a pre-trained neural network to express them as a product of sparse matrices. The obtained sparse matrices are then refined by gradient descent to best fit the final prediction task. When there is only one sparse matrix in the decomposition, our approach recovers the simple procedure of hard thresholding the weights of a matrix after pre-training. We evaluate the effect of different hyper-parameters on our method and show that layers can be factorized into two or three sparse matrices to obtain high compression rates while preserving good performance, compared to several main state-of-the-art methods for neural network compression.

2. Learning sparse matrix products for network compression

We describe how to compress NN weight matrices by sparse matrix factorization. We call our procedure PSM for Product of Sparse Matrices. It is obvious to see that a product of sparse matrices with a given sparsity budget can recover a full rank matrix or a matrix with more non-zero values than the initial sparsity budget. This observation motivates the use of a sparse matrix factorization in place of usual low-rank decomposition and sparsity inducing techniques for neural network compression. We first recall linear transform operations in fully-connected and convolutional layers. Then, inspired by recent work on learning linear operators with fast-transform structures, we propose to use a product of sparse matrices to replace linear transforms in neural networks. We also introduce a procedure to learn such factorization for every layers in deep architecture. Finally, we review some known neural network compression techniques that appear as particular cases of our framework.



of the art techniques for neural network compression. Popular matrix or tensor decomposition methods including Singular Value Decomposition (SVD), CANDE-COMP/PARAFAC (CP) and Tucker have been used to address the problem of model compression by a low-rank approximation of the neural network's weights after learning. Sainath et al. (2013) describe a method based on SVD to compress weight matrices in fully connected layers. Denton et al. (2014); Lebedev et al. (2015); Kim et al. (

