SUCCINCT COMPRESSION: LOSSLESS COMPRESSION FOR FAST AND MEMORY-EFFICIENT DEEP NEURAL NETWORK INFERENCE

Abstract

This paper introduces "Succinct Compression", a method to provide lossless compression of Deep Neural Network (DNN) models for fast and memory-efficient inference. The key insight of our method leverages the concept of Succinct Data Structures, which supports fast queries without decompressing the compressed representations. Our method consists of three new insights. First, we introduce two basic building blocks to formulate DNN models, and how they can be extended to be synergistic with compressed models (e.g. pruned or quantized models). Then, we propose a scheme to enable mixed-formulation inference for different layers, to better extract its benefits. Finally, our method exploits a specialized execution pipeline to incorporate different model formulations for fast inference. We quantitatively demonstrate that: our method can (1) enable faster and more memory-efficient inference on uncompressed models; (2) be synergistic with a variety of structure-altered/unaltered compression schemes with better speedup and compression ratio, while preserving the accuracy; and (3) can outperform all other state-of-the-art Model Coding approaches.

1. INTRODUCTION

Recent efforts on Pareto improvements of compressed Deep Neural Network (DNN) models, on inference time, space consumption and the accuracy, have recently bloomed due to the great success of DNNs in practice. Prior works either aggressively simplify/optimize the structure of DNN models (e.g. Pruning and Neural Architecture Search) or retrench the representation of model parameters (e.g. Quantization and Model Coding), with a major focus on the compression ratio and the accuracy. Given a variety of methodologies for efficient compression, there still lacks a general method to further optimize the inference performance and compression ratio, without affecting the accuracy of both uncompressed and compressed models. This paper introduces "Succinct Compression", a method to provide lossless compression of Deep Neural Network (DNN) models for fast and memory-efficient inference. The emphasis of our method is to enhance the inference performance and compression ratio without affecting the accuracy at the same time, for a variety classes of uncompressed and compressed models. The unique characteristic of our method is to exploit Succinct Data Structures, which enables fast queries without decompressing the compressed representations. We consolidate three new insights to better incorporate Succinct Data Structures. ➊ we propose two semi-structured formulations to represent DNN models in element-wise or block-wise manners, and provide simple extensions to allow them for the combinations of other compression techniques. ➋ we enable mixed formulations of different layers in the model, to better extract the potential of Succinct Data Structures. ➌ we design a specialized execution pipeline to perform the inference on different formulations, by carefully engineering the inner operators of Succinct Data Structures. Our evaluation shows that our method can be very effective for the inference efficiency, and generally applicable for uncompressed and compressed models (including for ResNet-50, ResNet-101, VGG-16, MobileNet-V2 and DeiT-B). For uncompressed models, our method can achieves most 1.07× speedup and 1.17× compression ratio at the same time, without affecting the accuracy. We then show that our method can bring significantly more benefits by combining other compression schemes, where all models are pre-processed via other compression methods. For instance, by combining structure-altered compression (such as pruning), our method enables the at most 8.8× acceleration of inference on ResNet-101, with 39.90× compression ratio meanwhile. Similarly, the speedup can be further enhanced to reach 9.3× by incorporating structure-unaltered method (such as quantization). We also compare our method with a variety of the state-of-the-art Model Coding schemes, and show that our method outperforms all of them.

2. RELATED WORKS

A large body of relevant works on compressing DNN models consists of two categories, based on the orientation of their methodology: structure-altered and -unaltered methodologies. We outline key directions in each category, briefly describe their features and justify the novelty of our method.

2.1. STRUCTURE-ALTERED METHODS FOR COMPRESSION

Structure-altered Methods refer to those compression methods by simplifying/optimizing the DNN model architectures, and representative methods in this direction include Pruning, Low-Rank Factorization, Neural Architecture Search (NAS) and Knowledge Distillation (KD). We describe each of them in brief as follow. ➊ Pruning removes the redundant connections within DNN models without incurring a considerable degradation of the accuracy. There are two categories of Pruning. One is Unstructured Pruning (Dong et al. ( 2017 2020)), which aggressively removes neurons with small relevance whenever it's possible. Though such an approach can deliver decent compression ratio with only a marginal degradation of the accuracy, the inference overheads suffers from the inefficient usage of the memory, due to the frequent operations on sparse matrices (Gale et al. ( 2019 2021)), which only removes irrelevant units of DNN models at a granularity of the elementary structures (e.g. weights, filters and layers). Though these methods can benefit the performance/compression ratio due to the reduction of the total computational costs, the accuracy is usually not as expected. 2018)) uncovers the latent compact structure of the network through low-rank matrix factorization of weight layers. Though these approaches may only incur a marginal degradation in terms of the accuracy, they requires extra computational costs and the benefits in memory efficiency may not be consistent in different models. ➌ NAS (Mellor et al. (2021); Zhao et al. (2021) ) automatically output neural network architectures using specific search strategies applied to a large search space. Therefore, a huge amount of extra computational costs are required and such methods need to be performed before the deployments of the selected models. In this work, we consider Pruning as the representative method in this direction, to justify the compatibility of our method with Structure-altered methods (as described in Section 7.2).

2.2. STRUCTURE-UNALTERED METHODS FOR COMPRESSION

Structure-unaltered Methodologies refer to those compression methods by compressing DNN models without altering the model architecture, and there are two representative methods in this direction, which are Quantization and Model Coding. We describe each of them in brief as follow. ➊ Quantization reduces the bitwidth of parameters within DNN models, and such an approach can be achieved via quantization-aware training (Bengio et al. (2013) ; Alizadeh et al. ( 2020)) or posttraining quantization (Banner et al. (2019); Cai et al. (2020) ). Note that it's also feasible to perform extreme quantization (e.g. binarization) for this purpose (Cai et al. (2017) ; Bulat et al. ( 2021)) but usually suffers from a significant degradation of the accuracy.



); Lee et al. (2019); XIAO et al. (2019); Park* et al. (

); Blalock et al. (2020)). The other is Structured Pruning (Huang & Wang (2018); Lin et al. (2018); Yu et al. (2018); He et al. (2019); Zhao et al. (2019); Yu et al. (

Low-Rank Factorization (Mamalet & Garcia (2012); Sainath et al. (2013); Zhao et al. (2017); Li et al. (

KD (Feng et al. (2021); Wang (2021); Zhu et al. (2021)) is to train a large model and then use it as a teacher to train a more compact model. Similarly, KD also demands a huge amount of extra computational costs for training different models, therefore they are usually performed off-line.

