HOW INFORMATIVE IS THE APPROXIMATION ERROR FROM TENSOR DECOMPOSITION FOR NEURAL NET-WORK COMPRESSION?

Abstract

Tensor decompositions have been successfully applied to compress neural networks. The compression algorithms using tensor decompositions commonly minimize the approximation error on the weights. Recent work assumes the approximation error on the weights is a proxy for the performance of the model to compress multiple layers and fine-tune the compressed model. Surprisingly, little research has systematically evaluated which approximation errors can be used to make choices regarding the layer, tensor decomposition method, and level of compression. To close this gap, we perform an experimental study to test if this assumption holds across different layers and types of decompositions, and what the effect of fine-tuning is. We include the approximation error on the features resulting from a compressed layer in our analysis to test if this provides a better proxy, as it explicitly takes the data into account. We find the approximation error on the weights has a positive correlation with the performance error, before as well as after fine-tuning. Basing the approximation error on the features does not improve the correlation significantly. While scaling the approximation error commonly is used to account for the different sizes of layers, the average correlation across layers is smaller than across all choices (i.e. layers, decompositions, and level of compression) before fine-tuning. When calculating the correlation across the different decompositions, the average rank correlation is larger than across all choices. This means multiple decompositions can be considered for compression and the approximation error can be used to choose between them.

1. INTRODUCTION

Tensor Decompositions (TD) have shown potential for compressing pre-trained models, such as convolutional neural networks, by replacing the optimized weight tensor with a low-rank multi-linear approximation with fewer parameters (Jaderberg et al., 2014; Lebedev et al., 2015; Kim et al., 2016; Garipov et al., 2016; Kossaifi et al., 2019a) . Common compression procedures (Lebedev et al., 2015; Garipov et al., 2016; Hawkins et al., 2021) work by iteratively applying TD on a selected weight tensor, where each time several decomposition choices have to be made regarding (i) the layer to compress, (ii) the type of decomposition, and (iii) the compression level. Selecting the best hyperparameters for these choices at a given iteration requires costly re-evaluating the full model for each option. Recently, Liebenwein et al. (2021) suggested comparing the approximation errors on the decomposed weights as a more efficient alternative, though they only considered matrix decompositions for which analytical bounds on the resulting performance exist. These bounds rely on the Eckhart-Young-Mirsky theorem. For TD, no equivalent theorem is possible (Vannieuwenhoven et al., 2014) . While theoretical bounds are not available for more general TD methods, the same concept could still be practical when considering TDs too. We summarize this as the following general assumption: Assumption 1. A lower TD approximation error on a model's weight tensor indicates better overall model performance after compression. While this assumption appears intuitive and reasonable, we observe several gaps in the existing literature: First, most existing TD compression literature only focuses on a few decomposition choices, e.g. fixing the TD method (Lebedev et al., 2015; Kim et al., 2016) . Although various error measures and decomposition choices have been studied in separation, no prior work systematically compares different decomposition errors across multiple decomposition choices. Second, different decomposition errors with different properties have been used throughout the literature (Jaderberg et al., 2014) , and it is unclear if some error measure should be preferred. Third, a benefit of TD is that no training data is needed for compression, though if labeled data is available, more recent methods combine TD with a subsequent fine-tuning step. Is the approximation error equally valid for the model performance with and without fine-tuning? Overall, to the best of the authors' knowledge, no prior work investigates if and which decomposition choices for TD network compression can be made using specific approximation errors. This paper studies empirically to what extent a single decomposition error correlates with the compressed model's performance across varied decomposition choices, identifying how existing procedures could be improved, and providing support for specific practices. Our contributions are as follows: • A first empirical study is proposed on the correlation between the approximation error on the model weights that result from compression with TD, and the performance of the compressed modelfoot_0 . Studied decomposition choices include the layer, multiple decomposition methods (CP, Tucker, and Tensor Train), and level of compression. Measurements are made using several models and datasets. We show that the error is indicative of model performance, even when comparing multiple TD methods, though useful correlation only occurs at the higher compression levels. • Different formulations for the approximation error measure are compared, including measuring the error on the features as motivated by the work Jaderberg et al. ( 2014); Denil et al. ( 2014) which considers the data distribution. We further study how using training labeled data for additional fine-tuning affects the correlation.

2. RELATED WORK

There is currently no systematic study on how well the approximation error relates to a compressed neural network's performance across multiple choices of network layers, TD methods, and compression levels. We here review the most similar and related studies where we distinguish works with theoretical versus empirical validation, different approximation error measures, and the role of fine-tuning after compression. The relationship between the approximation error on the weights and the performance of the model was studied by theoretical analysis for matrix decompositions. Liebenwein et al. (2021) derive bounds on the model performance for SVD-based compression on the convolutional layers, and thus motivate that the SVD approximation error is a good proxy for the compressed model performance. Arora et al. ( 2018) derive bounds on the generalization error for convolutional layers based on a compression error from their matrix projection algorithm. Baykal et al. (2019) show how the amount of sparsity introduced in a model's layers relates to its generalization performance. While these works show that some theoretical bounds can be found for specific compression methods, such bounds are not available for TDs in general. Other works, therefore, study the relationship for TD empirically. For instance, Lebedev et al. (2015) show how CP decomposition rank affects the approximation error, and the resulting accuracy drop as the rank is decreased. Hawkins et al. (2022) observe that, for networks with repeated layer blocks, the approximation error depends on the convolutional layers within the block. When considering the model's final task performance, the approximation error on the weights might not be the most relevant measure. To consider the effect on the actual data distribution, Jaderberg et al. ( 2014) instead propose to compute an error on the approximated output features of a layer after its weights have been compressed. They found that compressing weights by minimizing the error on features, rather than the error on the weights, results in a smaller loss in classification accuracy. 



The code for our experiments is available at https://github.com/JSchuurmans/tddl.



However, Jaderberg et al. (2014) do not fine-tune the decomposed model, and only use a toy model with few layers. Denil et al. (2014) try to capture the information from the data via the empirical

