HARDWARE-AWARE COMPRESSION WITH RANDOM OP-ERATION ACCESS SPECIFIC TILE (ROAST) HASHING

Abstract

Advancements in deep learning are often associated with increasing model sizes. Training and deploying large models require sophisticated hardware and incur significantly higher costs. Thus, model compression is a widely explored approach to solving the problem. However, SOTA techniques fall short in one or more desirable aspects of compression -for instance, pruning does not reduce memory for training, quantization can only provide up to 32x compression, HashedNet is cache-inefficient, etc. This paper proposes a model-agnostic, cache-friendly, and hardware-aware model compression approach: Random Operation Access Specific Tile (ROAST) hashing. ROAST collapses the parameters by clubbing them through a lightweight mapping. While clubbing these parameters, ROAST utilizes cache hierarchies by aligning the memory access pattern with the parameter access pattern. ROAST is up to ∼25× faster to train and ∼50× faster to infer than the popular parameter sharing method HashedNet. Additionally, ROAST introduces global weight sharing, which is empirically and theoretically superior to local weight sharing in HashedNet, and can be of independent interest. With ROAST, we can efficiently train and deploy the model using a much smaller memory footprint (∼ 10 -100× lesser) in text and image classification tasks.

1. INTRODUCTION

Models across different domains, including Natural Language Processing (NLP), Computer Vision (CV), and Information Retrieval (IR), are exploding in size. State-of-the-art (SOTA) results in these domains are being obtained at a disproportionate increase in model sizes, questioning the sustainability of deep learning (Thompson et al., 2021) . For instance, SOTA architectures for vision include VGG (Simonyan & Zisserman, 2014) (150M params, 0.6GB) and ViT (Dosovitskiy et al., 2020) (up to 304M params, 1.2GB). Additionally, SOTA NLP models range from BERT (Devlin et al., 2018) (340M params, 1.36GB) to GShard (Lepikhin et al., 2020) (600B params, 2.4TB) . Similarly, industrial-scale recommendation models such as DLRM (Naumov et al., 2019; Mudigere et al., 2021) can have up to 10s of trillions of parameters (50TB). Large models, such as the above, come with various challenges. They need high-end distributed hardware for training and deployment, incurring higher costs. Additionally, the required modelparallel setup has higher inference and training-iteration latency for these models. Model compression is a research direction that aims to resolve these issues by reducing the memory footprint of the model. Compression of the order of 100× can eliminate the need for model-parallel setup for many SOTA models like GPT(Radford et al., 2019 ), Gshard(Lepikhin et al., 2020) , DLRM (Naumov et al., 2019) which now can fit on a single GPU. Furthermore, compressing large models to small sizes come with immediate latency benefits. For example, Desai et al. (2022) showed that by compressing the DLRM model 1000× and using 1 GPU instead of 8 GPUs, we could get 3× faster inference at a lower cost. Also, in the case of CPU inference, a smaller model is efficient. For example, (Diamos et al., 2016) showed that if a single RNN layer can fit in registers, it leads to 146× faster inference. Thus, the ML community has heavily invested in model compression. A variety of model compression paradigms now exist in literature like pruning (Han et al., 2016b ), quantisation (Han et al., 2016b) , knowledge distillation (Buciluǎ et al., 2006 ), parameter-sharing (Chen et al., 2015; Desai et al., 2022) , and low rank decomposition (Hrinchuk et al., 2020; Yin et al., 2021) . We show that with ROAST, we can train a BERT-2-2 ( 2 layers, 2 attention heads) model on the largest available text-classification datasets (amazon-polarity, yelp-polarity) using 100× lesser memory without loss of quality. In cases where the model is overly parameterized, like using BERT-12-12 in the text classification task above, we can still obtain similar compression of 100×. Thus it is a good alternative to neural architecture search. The results extend to CV datasets as well. Specifically, we can train a ResNet-9 model with 10× lesser memory for the CIFAR10 dataset. Importantly, we show that ROAST, due to its hardware-aware nature, is significantly faster than HashedNet: ROAST is up to ∼ 25× faster to train and ∼ 50× faster to infer than HashedNet for large matrix multiplications. Our current implementation of ROAST matrix multiplication is about 1.34× slower than full matrix multiplication in pytorch. This is a testament to how optimized CUBLAS libraries are. We believe, with enough investigation, we can make ROAST-MM comparably efficient to pytorch-MM as well. Limitations of ROAST: One of the goals of model compression, apart from reducing memory usage, is to reduce computational workload for deployment. ROAST, currently, is not devised to decrease computation; it only decreases the memory footprint of a model. Reducing computation with a small memory is left for future work. However, it should be noted that reducing the memory footprint can significantly reduce computation latency and power consumption. As shown in (Han et al., 2016a) , accessing memory from RAM is 6400× costlier than 32bit INT ADD and 128× more expensive than on-chip SRAM access in terms of energy consumption. Additionally, RAM access generally is ∼100× slower than a floating-point operation. So, this model compression with ROAST, in our opinion, is an important step for efficient training and inference.

2. RELATED WORK

This section briefly reviews the rich history of model compression paradigms. Model compression can be generally classified into two categories: (1) Compressing a learned model and (2) Learning a compressed model. ROAST lies in the second category.



Table 1 compares these approaches on three considerations (1) if the model memory is reduced for training. (2) if the memory size can be controlled independently of the model, and (3) if the approach considers the underlying Various compression techniques on three aspects (1) Memory reduction during training ( apart from inference) (2) arbitrary control over memory (3) Hardware awareness / cache-efficiency * Some versions of pruning that are tuned to the underlying hardware and are cache-efficient We want the techniques to fare positively in these three aspects. However, techniques like pruning, QAT, and knowledge distillation require us to use the memory of the original model while training and only reduce inference time memory. Additionally, there are limits to compression obtained by quantization and pruning depending on which component we are compressing. For example, we cannot prune an embedding table (N × d) more than d× as we do not want any embedding vector to have all zeros. HashedNet provides memory reduction during training and arbitrary control over memory. However, the look-ups in HashedNet are randomly and independently distributed across the total memory. This makes HashedNet cache-inefficient.

