SUCCINCT COMPRESSION: LOSSLESS COMPRESSION FOR FAST AND MEMORY-EFFICIENT DEEP NEURAL NETWORK INFERENCE

Abstract

This paper introduces "Succinct Compression", a method to provide lossless compression of Deep Neural Network (DNN) models for fast and memory-efficient inference. The key insight of our method leverages the concept of Succinct Data Structures, which supports fast queries without decompressing the compressed representations. Our method consists of three new insights. First, we introduce two basic building blocks to formulate DNN models, and how they can be extended to be synergistic with compressed models (e.g. pruned or quantized models). Then, we propose a scheme to enable mixed-formulation inference for different layers, to better extract its benefits. Finally, our method exploits a specialized execution pipeline to incorporate different model formulations for fast inference. We quantitatively demonstrate that: our method can (1) enable faster and more memory-efficient inference on uncompressed models; (2) be synergistic with a variety of structure-altered/unaltered compression schemes with better speedup and compression ratio, while preserving the accuracy; and (3) can outperform all other state-of-the-art Model Coding approaches.

1. INTRODUCTION

Recent efforts on Pareto improvements of compressed Deep Neural Network (DNN) models, on inference time, space consumption and the accuracy, have recently bloomed due to the great success of DNNs in practice. Prior works either aggressively simplify/optimize the structure of DNN models (e.g. Pruning and Neural Architecture Search) or retrench the representation of model parameters (e.g. Quantization and Model Coding), with a major focus on the compression ratio and the accuracy. Given a variety of methodologies for efficient compression, there still lacks a general method to further optimize the inference performance and compression ratio, without affecting the accuracy of both uncompressed and compressed models. This paper introduces "Succinct Compression", a method to provide lossless compression of Deep Neural Network (DNN) models for fast and memory-efficient inference. The emphasis of our method is to enhance the inference performance and compression ratio without affecting the accuracy at the same time, for a variety classes of uncompressed and compressed models. The unique characteristic of our method is to exploit Succinct Data Structures, which enables fast queries without decompressing the compressed representations. We consolidate three new insights to better incorporate Succinct Data Structures. ➊ we propose two semi-structured formulations to represent DNN models in element-wise or block-wise manners, and provide simple extensions to allow them for the combinations of other compression techniques. ➋ we enable mixed formulations of different layers in the model, to better extract the potential of Succinct Data Structures. ➌ we design a specialized execution pipeline to perform the inference on different formulations, by carefully engineering the inner operators of Succinct Data Structures. Our evaluation shows that our method can be very effective for the inference efficiency, and generally applicable for uncompressed and compressed models (including for ResNet-50, ResNet-101, VGG-16, MobileNet-V2 and DeiT-B). For uncompressed models, our method can achieves most 1.07× speedup and 1.17× compression ratio at the same time, without affecting the accuracy. We then show that our method can bring significantly more benefits by combining other compression 1

