Sparse matrix products for neural network compression

Abstract

Over-parameterization of neural networks is a well known issue that comes along with their great performance. Among the many approaches proposed to tackle this problem, low-rank tensor decompositions are largely investigated to compress deep neural networks. Such techniques rely on a low-rank assumption of the layer weight tensors that does not always hold in practice. Following this observation, this paper studies sparsity inducing techniques to build new sparse matrix product layer for high-rate neural networks compression. Specifically, we explore recent advances in sparse optimization to replace each layer's weight matrix, either convolutional or fully connected, by a product of sparse matrices. Our experiments validate that our approach provides a better compression-accuracy trade-off than most popular low-rank-based compression techniques.

1. Introduction

The success of neural networks in the processing of structured data is in part due to their over-parametrization which plays a key role in their ability to learn rich features from the data (Neyshabur et al., 2018) . Unfortunately, this also makes most state-of-the-art models so huge that they are expensive to store and impossible to operate on devices with limited resources (memory, computing capacity) or that cannot integrate GPUs (Cheng et al., 2017) . This problem has led to a popular line of research for "neural networks compression", which aims at building models with few parameters while preserving their accuracy. 2016) generalize this idea to convolutional layers and then reduce the memory footprint of convolution kernels by using higher-order low-rank decompositions such as CP or Tucker decompositions.

State of the art techniques

Besides, the Tensor-Train (TT) decomposition has been explored to compress both dense and convolutional layers after a pre-training step (Novikov et al., 2015) . This approach may achieve extreme compression rates but it also have impractical downsides that we demonstrate now. In a TT format, all the elements of a M -order tensor are expressed by a product of M matrices whose dimensions are determined by the TT-ranks (R 0 , R 1 , . . . , R M ). For each of the M dimension of the initial tensor, the corresponding matrices can be stacked into an order 3 tensor called a "core" of the decomposition. Hence, the layer weight is decomposed as a set of M cores of small dimensions. Novikov et al. (2015) use this tensor representation to factorize fully connected layers. They first reshape the matrix of weights into an M -order tensor, then apply the TT decomposition. By choosing sufficiently small R m values, this technique allows to obtain a high compression ratio on extremely wide ad hoc neural architectures. Garipov et al. (2016) have adapted this idea to convolutional layers. However, the current formulation of such TT convolutional layer involves the multiplication of all input values by a matrix of dimension 1 × R 1 thus causing an inflation of R 1 times 1



for neural network compression. Popular matrix or tensor decomposition methods including Singular Value Decomposition (SVD), CANDE-COMP/PARAFAC (CP) and Tucker have been used to address the problem of model compression by a low-rank approximation of the neural network's weights after learning. Sainath et al. (2013) describe a method based on SVD to compress weight matrices in fully connected layers. Denton et al. (2014); Lebedev et al. (2015); Kim et al. (

