HARDWARE-AWARE COMPRESSION WITH RANDOM OP-ERATION ACCESS SPECIFIC TILE (ROAST) HASHING

Abstract

Advancements in deep learning are often associated with increasing model sizes. Training and deploying large models require sophisticated hardware and incur significantly higher costs. Thus, model compression is a widely explored approach to solving the problem. However, SOTA techniques fall short in one or more desirable aspects of compression -for instance, pruning does not reduce memory for training, quantization can only provide up to 32x compression, HashedNet is cache-inefficient, etc. This paper proposes a model-agnostic, cache-friendly, and hardware-aware model compression approach: Random Operation Access Specific Tile (ROAST) hashing. ROAST collapses the parameters by clubbing them through a lightweight mapping. While clubbing these parameters, ROAST utilizes cache hierarchies by aligning the memory access pattern with the parameter access pattern. ROAST is up to ∼25× faster to train and ∼50× faster to infer than the popular parameter sharing method HashedNet. Additionally, ROAST introduces global weight sharing, which is empirically and theoretically superior to local weight sharing in HashedNet, and can be of independent interest. With ROAST, we can efficiently train and deploy the model using a much smaller memory footprint (∼ 10 -100× lesser) in text and image classification tasks. Under review as a conference paper at ICLR 2023 Table 1: Various compression techniques on three aspects (1) Memory reduction during training ( apart from inference) (2) arbitrary control over memory (3) Hardware awareness / cache-efficiency * Some versions of pruning that are tuned to the underlying hardware and are cache-efficient Training memory reduction Arbitrary control on memory Cache efficient Pruning No No No* Low-rank decomposition Yes No Yes Low-precision

1. INTRODUCTION

Models across different domains, including Natural Language Processing (NLP), Computer Vision (CV), and Information Retrieval (IR), are exploding in size. State-of-the-art (SOTA) results in these domains are being obtained at a disproportionate increase in model sizes, questioning the sustainability of deep learning (Thompson et al., 2021) . For instance, SOTA architectures for vision include VGG (Simonyan & Zisserman, 2014 ) (150M params, 0.6GB) and ViT (Dosovitskiy et al., 2020) (up to 304M params, 1.2GB). Additionally, SOTA NLP models range from BERT (Devlin et al., 2018) (340M params, 1.36GB) to GShard (Lepikhin et al., 2020 ) (600B params, 2.4TB). Similarly, industrial-scale recommendation models such as DLRM (Naumov et al., 2019; Mudigere et al., 2021) can have up to 10s of trillions of parameters (50TB). Large models, such as the above, come with various challenges. They need high-end distributed hardware for training and deployment, incurring higher costs. Additionally, the required modelparallel setup has higher inference and training-iteration latency for these models. Model compression is a research direction that aims to resolve these issues by reducing the memory footprint of the model. Compression of the order of 100× can eliminate the need for model-parallel setup for many SOTA models like GPT (Radford et al., 2019) , Gshard (Lepikhin et al., 2020) , DLRM (Naumov et al., 2019) which now can fit on a single GPU. Furthermore, compressing large models to small sizes come with immediate latency benefits. For example, Desai et al. (2022) showed that by compressing the DLRM model 1000× and using 1 GPU instead of 8 GPUs, we could get 3× faster inference at a lower cost. Also, in the case of CPU inference, a smaller model is efficient. For example, (Diamos et al., 2016) showed that if a single RNN layer can fit in registers, it leads to 146× faster inference. Thus, the ML community has heavily invested in model compression. A variety of model compression paradigms now exist in literature like pruning (Han et al., 2016b) , quantisation (Han et al., 2016b) , knowledge distillation (Buciluǎ et al., 2006) , parameter-sharing (Chen et al., 2015; Desai et al., 2022) , and low rank decomposition (Hrinchuk et al., 2020; Yin et al., 2021) . Table 1 compares these approaches on three considerations (1) if the model memory is reduced for training. (2) if the memory size can be controlled independently of the model, and (3) if the approach considers the underlying We show that with ROAST, we can train a BERT-2-2 ( 2 layers, 2 attention heads) model on the largest available text-classification datasets (amazon-polarity, yelp-polarity) using 100× lesser memory without loss of quality. In cases where the model is overly parameterized, like using in the text classification task above, we can still obtain similar compression of 100×. Thus it is a good alternative to neural architecture search. The results extend to CV datasets as well. Specifically, we can train a ResNet-9 model with 10× lesser memory for the CIFAR10 dataset. Importantly, we show that ROAST, due to its hardware-aware nature, is significantly faster than HashedNet: ROAST is up to ∼ 25× faster to train and ∼ 50× faster to infer than HashedNet for large matrix multiplications. Our current implementation of ROAST matrix multiplication is about 1.34× slower than full matrix multiplication in pytorch. This is a testament to how optimized CUBLAS libraries are. We believe, with enough investigation, we can make ROAST-MM comparably efficient to pytorch-MM as well. Limitations of ROAST: One of the goals of model compression, apart from reducing memory usage, is to reduce computational workload for deployment. ROAST, currently, is not devised to decrease computation; it only decreases the memory footprint of a model. Reducing computation with a small memory is left for future work. However, it should be noted that reducing the memory footprint can significantly reduce computation latency and power consumption. As shown in (Han et al., 2016a) , accessing memory from RAM is 6400× costlier than 32bit INT ADD and 128× more expensive than on-chip SRAM access in terms of energy consumption. Additionally, RAM access generally is ∼100× slower than a floating-point operation. So, this model compression with ROAST, in our opinion, is an important step for efficient training and inference.

2. RELATED WORK

This section briefly reviews the rich history of model compression paradigms. Model compression can be generally classified into two categories: (1) Compressing a learned model and (2) Learning a compressed model. ROAST lies in the second category. Compressing learned models: 1) Pruning: Pruning (Zhu & Gupta, 2017 ) is a technique to remove parts of a large model, including weights, blocks, and layers, to make the model lighter. Pruning can be performed as a one-time operation or gradually interspersed with training. 2) Quantization: Quantization can involve reducing the precision of the parameters of a model. Mixed precision models are sometimes used where different precision is used with different weights. KMeans quantization is another type of quantization, where models' weights are clustered using KMeans, and each cluster's centroid is used for all cluster weights. Model compression, in this case, is achieved by reducing the number of distinct weights. 3) Knowledge distillation: Knowledge distillation (Buciluǎ et al., 2006) is widely applied in model compression with a focus on distilled architectures. Knowledge distillation involves training a teacher model (large original model); then, a student model is trained using the logits of the teacher model. Empirically, the student model trained under this paradigm generalizes better than the student model trained standalone. Many variations exist on this basic idea of knowledge distillation. While these techniques have successfully reduced memory for inference, one of the drawbacks of this line of compression is that the memory usage while training the model is not reduced. ROAST, however, provides a solution that reduces the model's memory during training and inference. Learning compressed models 1) Low-rank decomposition: In this method, matrices in the model are decomposed into a product of two low-rank matrices, thus saving memory per matrix. A generalization of low-rank decomposition to tensors is called tensor-train decomposition 2) Parameter sharing: Parameter sharing approaches such as HashedNet (Chen et al., 2015) are generally used for matrix compression. These approaches randomly share weights among different parameters, reducing the model's memory usage. This line of research provides model reduction even during training. However, Low-rank decomposition does not offer arbitrary control over memory footprint, and HashedNets are inefficient due to heavy cache-trashing caused by non-local lookups. Conversely, ROAST is a model-agnostic parameter-sharing approach that can arbitrarily reduce the model size without affecting the functional form while keeping the model recovery efficient.

3. BACKGROUND

HashedNet: Compressing MLP matrices Previous work (Chen et al., 2015) introduced a weight sharing method to compress weight matrices of MLP models. They map each matrix parameter to a shared parameter array using a random hash function xxhash (Collet, 2016) . In the forward pass, this mapping is used to recover a weight matrix and perform matrix multiplication for each MLP layer. In the backward pass, the gradients of each weight matrix are mapped to the shared compressed array and aggregated using the sum operation. It should also be noted that each MLP layer uses an independent array of parameters. One of the main concerns with HashedNet is that memory accesses on the compressed array are non-coalesced. Thus, fetching a compressed matrix via HashedNet requires significantly more memory read transactions than fetching an uncompressed matrix for which memory accesses can coalesce. Our evaluation shows that uncoalesced memory accesses lead to high latency, especially for large matrices. Random Block Offset Embedding Array (ROBE) for embedding compression In ROBE (Desai et al., 2022) , the embedding table is generated using an array of parameters. The embedding of a token is obtained by drawing chunks of the embedding from the ROBE array. The locations of the chunks are decided randomly via light-weight universal hash functions. Authors of ROBE showed that ROBE hashing is theoretically superior to feature hashing used in HashedNet. Also, the use of chunks causes memory accesses to coalesce, making embedding lookup efficient. ROAST proposes a component agnostic, global parameter sharing approach that tunes the hashing function to match memory accesses of algorithmic implementation of operation over available hardware, thus giving a superior parameter sharing scheme.

4. RANDOM OPERATION ACCESS SPECIFIC TILE (ROAST) HASHING

Let M be the compressed memory from which parameters will be used, f be the model or the function that we want to run using M, and W be the recovered weights used in f . f can be considered as a composition of operations {O i (X i , W i )}. By operation, we mean the smaller functions that, when composed together, give us the model f . Here X i is the input to the operation, and W i is the weights (i.e., learnable parameters) that O i uses. Generally, W i s are distinct and do not share parameters. (1) ROAST is a generic technique applicable to all computational modules. (2) ROAST proposes to tune its mapping from W i to M in a way that coalesces memory accesses according to how memory is accessed during the operation. This makes ROAST efficient and up to 45× faster than competing approaches like HashedNet. (3) ROAST proposes Global Memory Sharing (GMS) as opposed to Local Memory Sharing (LMS) used in HashedNet. We show GMS to be theoretically and empirically superior to LMS in Section 5 and 6.

4.1. ROAST OPERATIONS IN DEEP LEARNING

Any model f can be considered as a composition of smaller functions {O i (X i , W i )}. There are multiple ways to perform this decomposition depending upon what we consider a valid (or small enough) operation. In ROAST, we consider three types of operations: (1) L(l, W ), lookup that accesses M and recovers l th element of W , say w. By element, we mean some particular part of W that is identifiable by an integer. An example with embedding tables is given in figure 1. (2) MM(X, W ), matrix multiplication that multiplies X with W and returns the result, and (3) N(X), various operations that only act on the input but do not interact with M. In ROAST, in order to limit the memory usage, we make sure that L is used only on a small w and MM is performed without recovering the entire matrix. We find that most deep learning models, if not all, can be written as a composition of operations N, MM and L, where L is only applied on small parameters. Let us discuss how ROAST implements L and MM operations in the following paragraphs. Lookup (L(l, W )) We recover a parameter weight w of any shape in a row-major format. Thus, we can consider w = W (l) to be a 1D vector without loss of generality. ROAST recovers w from M in a blocked fashion. Consider w to be composed of chunks of size Z. Each chunk c is located in M using a universal hash function h 1 and is recovered from the location h 1 (c) in M. Let C(i) give the chunk number of index i and O(i) give the offset of i in this chunk. w [i] = λM[h 1 (C(i)) + O(i)] h 1 : N → {0, ..., |M| -Z} (1) The recovered W has λ as a scaling factor discussed in section 4.2. The hash function hashes to a range {0, ..., |M| -Z} to avoid overflows while reading the memory. For example, Figure 1 (right) illustrates the embedding lookup using L with chunk size of 2. ROAST uses L to implement computational modules such as embeddings, bias vectors, and so on. We generalize the embedding lookup kernel from ROBE (Desai et al., 2022) to implement our L kernel. Matrix multiplication (MM(X i , W i )) 2D matrix multiplication is one of the most widely used operations in deep learning. We implement our ROAST-MM kernel with parameter sharing performed in a way that the algorithm for matrix multiplication accesses coalesced pieces of M. An efficient implementation of matrix multiplication on GPU follows a block multiplication algorithm to use the on-chip shared memory efficiently. While computing C = A × B, A, B and C are divided in tiles of size Z 0 × Z 1 , Z 1 × Z 2 and Z 0 × Z 2 respectively. Thus, we divide our 2D weight matrix into tiles of size Z 1 × Z 2 . The tile, (x, y), where x and y are the coordinates of the tile, is located in M in a row-major format via a universal hash function h 2 (x, y). Let C 1 (i, j) and C 2 (i, j) give the x-coordinate and y-coordinate of the tile to which i, j belongs. Similarly , let O 1 (i, j) and O 2 (i, j) give the x-offset and y-offset of a location (i, j) on the tile. Then, we use the following mapping for ROAST-MM, W [i, j] = λM[h 2 (C 1 (i, j), C 2 (i, j)) + Z 2 O 1 (i, j) + O 2 (i, j)] h 2 : N 2 → {0, ..., |M| -Z 1 Z 2 } Again, λ is the scaling factor discussed in section 4.2. The hash function hashes to a range {0, ..., |M| -Z 1 Z 2 } to avoid overflows while reading the chunk. Figure 1 (left) illustrates ROAST-MM with a chunk size of 2 × 2. The above mapping is used whenever a 2D tile is accessed in the matrix multiplication algorithm. The pseudo code for ROAST-MM is shown in algorithm 1. We talk about implementation of this kernel and its evaluation in the later part of the paper. ROAST uses ROAST-MM kernel to implement computational modules such as MLP layers, attention blocks, etc. Each module invoking ROAST kernels uses independent hash functions. Algorithm 1 ROAST-MM(I × H × O) Require: X ∈ R I×H , M, λ, h : N 2 → {0, ..., |M| -Z 1 Z 2 } Ensure: output = MM(X, M[h(:, :)]) value ← TILE(Z 0 , Z 2 ) ▷ Allocate a 2D tile of size Z 0 × Z 2 to accumulate results for i ∈ {0, 1, ..., ⌈I/Z 0 ⌉ -1} do for j ∈ {0, 1, ..., ⌈O/Z 2 ⌉ -1} do value[:, :] ← 0 for k ∈ {0, 1, ..., ⌈H/Z 1 ⌉ -1} do value ← value + MM(X[i : i + Z 0 , k : k + Z 1 ], M(h(k : k + Z 1 , j : j + Z 2 ))) ▷ Access to the weight tile passes through the hash function end for output[i : i + Z 0 , j : j + Z 2 ] ← λ * value end for end for Apart from scaling each recovered parameter with module-specifc λ, we can also multiply it with another independent hash function g : N k → {±1} (k=1 or k=2).

4.2. GLOBAL MEMORY SHARING (GMS)

HashedNet uses local memory sharing (LMS), which states that each layer will have independent compressed memory. In contrast, ROAST proposes global memory sharing (GMS), wherein we share memory across modules. However, modules cannot directly use the parameters stored in M as each module's weights requires initialization and optimization at different scales. For instance, in the Xavier's initialization (Glorot & Bengio, 2010) , weights are initialized with distribution Uniform(-1/ √ n, 1/ √ n) where n is size of the input to the module. In GMS, we must ensure that each module gets weights at the required scale. To achieve this, we first initialize the entire ROAST parameter array with values from the distribution Uniform(-1/C, 1/C) for some constant C. Then, for each module, we scale the weights retrieved from the ROAST array by a factor of λ = C/ √ n. One can understand the benefit of GMS over LMS in terms of the number of distinct functions in f that can be expressed using a fixed M. Consider a family of functions with n parameters. GMS can potentially express |M| n functions across different random mappings. In LMS, let separate parameters be of sizes n 1 , n 2 , ..n k and each of them is mapped into memories M 1 , M 2 , ..., M k . Thus, n = i n i and |M| = i |M i |. Then LMS can only express |M 1 | n1 |M 2 | n2 ....|M k | n k different functions. Thus expressivity of LMS is strictly less than that of GMS and can be orders of magnitude less depending on exact values of n i and |M i |. We also show that GMS is superior to LMS in terms of dimensionality reduction (feature hashing) in Section 5. The gradient of the loss w.r.t a weight w i in M is the λ-scaled aggregation of gradients of loss w.r.t all the parameters that map to this weight. For simplicity of notation, consider θ as the complete parameter, λ(j) as the scaling factor we use for the module that θ j belongs to, and h be the mapping from θ to M. See Appendix B.1 for more details. ∇ wi f (w) = j,h(j)=i λ(j) * ∇ θj f (θ) (2)

4.4. IMPLEMENTATION OF ROAST-MM

The high-performance community has heavily investigated the fast implementation of the General Matrix Multiplication (GEMM) kernel, a fundamental operation in many computational workloads, including deep learning. Optimized implementations of GEMM kernels are available in vendor libraries such as cuBLAS (NVIDIA Corporation, 2022a) and CUTLASS (NVIDIA Corporation, 2022b). Unfortunately, these implementations do not support custom tile loading operations, which is the key of ROAST-MM. To implement ROAST-MM to a reasonable level of efficiency, we used Triton (Tillet et al., 2019) : an intermediate language for tiled neural network computations. Triton abstracts out the shared memory management to make it helpful in customizing tiled operations with high efficiency. In our implementation of ROAST-MM, the optimal size of coalesced tiles is a parameter that depends on the shape of the weight matrix. Therefore, different tile sizes can lead to different parallelism, occupancy, and shared memory efficiency, resulting in different execution times. We autotune this parameter to obtain the best performance for particular matrix shapes. We propose two strategies for autotuning each ROAST-MM layer -(1) Optimize the inference workload by autotuning the forward kernel and sharing the tile size with the backward kernels. (2) Optimize the training workload by autotuning the forward and backward kernels together. Extensive evaluation of this kernel is presented in appendix C.2.

5. FEATURE HASHING QUALITY: GLOBAL MEMORY SHARING ADVANTAGE OVER LOCAL MEMORY SHARING

We can consider model compression as dimensionality reduction of a parameter vector (a one dimensional vector of all parameters in a model) of size n into a vector of size |M| = m. Quality of inner-product preservation is used as a metric to measure the quality of dimensionality reduction. In terms of dimensionality reduction, ROAST uses ROBE hashing, which shows that chunk based hashing is theoretically better than hashing individual elements. In this section, we compare ROAST's GMS proposal against HashedNet's LMS using a chunck size of one. Consider two parameter vectors x, y ∈ R n , we are interested in how the inner product of parameter vectors are preserved under hashing. Let x = [x 1 , x 2 , ..., x k ] and y = [y 1 , y 2 , ..., y k ] be composed of k vectors of sizes n 1 , n 2 , ...n k where [] denotes concatentation. In LMS, let each piece map to memory of size f i m where i f i = 1. The estimated inner product with GMS is The estimated inner product with LMS can be written as ⟨x, y⟩ G,m = m j=1 n i=1 I(h(i)=j)g(i)x[i] n i=1 I(h(i)=j)g(i)y[i] ⟨x, y⟩ L,m, ⃗ f = k l=1 f l m j=1   n l i=1 I(h(i)=j)g(i)x l [i] n l j=1 I(h(i)=j)g(i)y l [i]   = k l=1 ⟨x l , y l ⟩ G,(f l m) Theorem 1 Let x, y ∈ R n and be composed of k vectors x = [x 1 , x 2 , ..., x k ] and y = [y 1 , y 2 , ..., y k ]. Then the inner product estimation of global and local weight sharing are unbiased. E( ⟨x, y⟩ G,m ) = ⟨x, y⟩ E( ⟨x, y⟩ L,m, ⃗ f ) = ⟨x, y⟩ The variance for inner product estimation can be written as, V G ( ⟨x, y⟩) = i f i V i + 1 m   i,j,i̸ =j (||x i || 2 ||y j || 2 ) + ⟨x i , y i ⟩⟨x j , y j ⟩   (6) V L ( ⟨x, y⟩) = i V i where V l = 1 f l 1 m   i̸ =j a 2 i b 2 j + i̸ =j a i b i a j b j   , where x l = (a 1 , a 2 ..., a n l ) and y l = (b 1 , b 2 ..., b n l ) (8) where V L is local memory sharing variance and V G is global memory sharing variance. Intuition: The two terms in V G can be understood as follows: The first term is the local variance with individual terms reduced by a factor of f i . This is because each piece of the vector is being distributed in a memory that is 1/f i × larger. However, in GMS, there is a possibility of more collisions across pieces. This leads to the second term in V G . Note that, for a given x, y and a finite value for m, V G is always bounded. At the same time, V L is unbounded due to 0 < f i < 1 in the denominator. So if the number of pieces increases or particular f i grows smaller, V L increases. While we cannot prove that V G is strictly less than V L , we can investigate the equation under some assumptions on the data. Practically, each piece of the parameter vector is a computational block like a matrix for multiplication or embedding table lookup. These blocks are initialized at a scale proportional to the square root of their size. So the norms of these vectors are similar. Let us assume the norm of each piece to be √ α. Also, let us assume that over random data distributions over x and y, all the inner products to be β in expectation. Then, V G ≈ k 2 m (α 2 + β 2 ) V L ≈ 1 m (α 2 + β 2 )( 1 f 1 + 1 f 2 + ... + 1 f k ) ≥ 1 m (α 2 + β 2 )k 2 1 ( f i ) = V G 9) Thus, V L is greater than V G , and it can be much greater depending on the exact values of f i . The proof of the theorem and other details are presented in Appendix B.2

6. EXPERIMENTAL EVALUATION

Setup: In this section, we evaluate the ROAST compression approach on two types of tasks. The details of the tasks, datasets and models used are mentioned in table 2. . For image-classification tasks, we choose the cifar-10 dataset and the leader for the DawnBenchmark (Coleman et al., 2017) -a ResNet-9 modelfoot_0 for cifar-10. The target accuracy for this benchmark is 94% and hence we perform hyper-parameter tuning to get a test accuracy of ≥ 94%. We stop the tuning once we Managing excess parameters It is clear from table 3, that BERT-base architecture is highly over parameterized for the tasks under consideration. However, even in this case, ROAST can be used to control the memory footprint while maintaining the functional form of the larger model. Pruning and ROAST We perform unstructured iterative-magnitude pruning (Han et al., 2016b) on ResNet model and find that pruning gives upto 100× compression. However note that pruning requires us to train the model using memory required to store the original model. However, compression with ROAST means using lesser memory even for training. Additionally, pruning can be used in conjunction with ROAST to obtain smaller models using smaller memory. In table 4 , we see that we can prune 90% of weights in 10× compressed ROAST array and still achieve the same quality. Local vs. Global memory sharing In the figure 3 , we show that the quality of the model while using global memory sharing is, indeed, better than local memory sharing. This supports our theoretical observation about these memory sharing schemes. Efficiency of ROAST operators as compared to HashedNet Table 7 shows the inference performance of a simple model using ROAST-MM for matrix multiplication on compressed memory. Our model linearly transforms the input vector and computes its norm. We optimized the ROAST-MM kernel for this experiment using the inference-optimal strategy. We make the following observations We add a lot of information and new results to the table. Specifically, • We add the GMS and LMS results to the table separately. So that readers can get an idea of each of the method on the task. • We add unstructured pruning (best pruning quality wise) resutls for NLP tasks as well. The pruning results are obtained in the following manner. With the full-9-1 schedule, we start from the fully trained model, perform iterative pruning during next 9 epochs and then tune the final pruned model for 1 more epoch. Whereas in the full-1-9 schedule, we again start from the fully trained model, perform pruning in the next 1 epoch and then tune the model further for 9 epochs. We note the best achieved accuracy with the final model structure and the epoch at which this accuracy is reached. • For each result, we note the number of epoch when the best accuracy was reached. • We append an additional small table which notes the number of epochs required to reach a target accuracy to compare the convergence of different models. We make the following observations. • GMS reaches better accuracy than LMS for the same amount of compression for both the datasets. Additionally, GMS reaches the same target accuracy faster than the LMS. • The ROAST approach is more effective than pruning approaches in NLP tasks of textclassification for architectures like BERT. • It is interesting that GMS-10× converges faster than original model on both datasets. We leave investigating this as future work.

A.2 GMS VS LMS FOR YELP

As can be seen from the two plots in figure4, it is clear the GMS performs superior to LMS in both the compression settings. ROAST is a generalized model compression which performs operation specific system-friendly lookup and global memory sharing. This raises some interesting theoretical questions

B.1 BACKWARD PASS FOR MODEL SHARING WEIGHTS ACROSS DIFFERENT COMPONENTS

A general function sharing a weight, say x across different components can be written as , f (x, g(x)) The interpretation is that x was used in g(.) and then again used ahead in f. (In case of MLP, we can think of x being used in multiple layers) Let f (g 1 , g 2 ) where both g 1 and g 2 are functions of x. ∂f (g 1 , g 2 ) ∂x = ∂f (g 1 , g 2 ) ∂g 1 * ∂g 1 ∂x + ∂f (g 1 , g 2 ) ∂g 2 * ∂g 2 ∂x ( ) g 1 = x and g 2 = g(x) ∂f (g 1 , g 2 ) ∂x = ∂f (x, g(y)) ∂x | y=x + ∂f (y, g(x)) ∂g(x) * ∂g(x) ∂x | y=x (11) ∂f (g 1 , g 2 ) ∂x = ∂f (x, g(y)) ∂x | y=x + ∂f (y, g(x)) ∂x | y=x (12) Renaming, ∂f (x, g(x)) ∂x = ∂f (z, g(y)) ∂z | y=x,z=x + ∂f (z, g(y)) ∂y | y=x,z=x Thus, we can essentially consider each place where x appears as new variables and then gradient w.r.t x is just summation of partial derivatives of the function w.r.t these new variables. Thus, it is easy to implement this in the backward pass. In order to make sure that the memory utilization in backward pass is not of the order of the recovered model size, we do not use the auto-differentiation of tensorflow/pytorch. We implement our own backward pass and it can be found in the code.

B.2 GLOBAL FEATURE HASHING VS LOCAL FEATURE HASHING.

We can consider model compression techniques as dimensionality reduction of the parameter vector (a one dimensional vector of all parameters in a model) of size n into a vector of size |M| = m. Quality of inner-product preservation is used as a metric to measure the quality of dimensionality reduction. In terms of dimensionality reduction, ROAST uses ROBE hashing Desai et al. (2022) , which showed that chunk based hashing is theoretically better than hashing individual elements. In this section, we analyse GMS proposal of ROAST against LMS of HashedNet. For the purpose of this comparison we assume a chunk size of 1. Consider two parameter vectors x, y ∈ R n . We are interested in how inner product between these parameter vectors are preserved under hashing. Let x = [x 1 x 2 ...x k ] and y = [y 1 y 2 ...y k ] be composed of k pieces of sizes n 1 , n 2 , ...n k . In LMS, let each piece be mapped into memory of size f i m where i f i = 1. The estimators of inner product in the GMS case can be written as , ⟨x, y⟩ G,m = m j=1 ( n i=1 I(h(i)=j)g(i)x[i])( n i=1 I(h(i)=j)g(i)y[i]) The estimate of inner product with LMS can be written as, ⟨x, y⟩ L,m, ⃗ f = k l=1 f l m j=1 ( n l i=1 I(h(i)=j)g(i)x l [i])( n l j=1 I(h(i)=j)g(i)y l [i]) = k l=1 ⟨x l , y l ⟩ G,(fim) (15) Note that ⟨x, y⟩ L,m, ⃗ f = k l=1 ⟨x l , y l ⟩ G,(f l m) The GMS estimator is the standard feature hashing estimator and the LMS is essentially sum of GMS estimators for each of the piece. as E[g(i)] = 0, it is easy to check by linearity of expectations that Expectation The suffix L refers to local hashing and G refers to global hashing. E G = E( ⟨x, y⟩ G,m ) = ⟨x, y⟩ E L = E( ⟨x, y⟩ L,m, ⃗ f ) = ⟨x, y⟩ Let us now look at the variance. Let us follow the following notation, • V G = V( ⟨x, y⟩ G,m ). GMS variance of entire vectors • V L = V( ⟨x, y⟩ L,m, ⃗ f ). LMS variance of entire vectors • V l = V( ⟨x l , y l ⟩ G,f l m ). variance of each piece we can write V l as follows. The following equation is easy to derive and it can be found the lemma 2 of Weinberger et al. (2009)  V l = 1 f l 1 m ( i̸ =j a 2 i b 2 j + i̸ =j a i b i a j b j ) where x l = (a 1 , a 2 ...a n l ) and y l = (b 1 , b 2 ...b n l ) As, each of the piece is independently hashed in LSM, we can see V L = k l=1 V l Let us now look at V G . Again, using lemma 2 from Weinberger et al. ( 2009) V G = 1 m ( i̸ =j x 2 i y 2 j + i̸ =j x i y i x j y j ) The expression can be split into terms that belong to same pieces and those across pieces V G = 1 m k l=1 ( i̸ =j∈piece-l x 2 i y 2 j + i̸ =j∈piece-l x i y i x j y j ) + 1 m k l1=1 k l2=1,l1̸ =l2 ( i∈piece-l1,j∈pieces-l2 (x 2 i y 2 j ) + i∈piece-l1,j∈pieces-l2 x i y i x j y j )) V G = k l=1 f l V l + 1 m l l1=1 l l2=1,l1̸ =l2 ||x l1 || 2 2 ||y l2 || 2 2 + ⟨x l1 , y l2 ⟩⟨x l2 , y l2 ⟩ (22) Observation 1: In V L we can see that there are terms with 1 f l which makes it unbounded. It makes sense as if number of pieces increase a lot a lot of compressions will not work for example if number of peices > |M|. Also, it will affect V L a lot when some f l is very small which can often be the case. For example, generally embedding tables in DLRM model are much larger than that of matrix multiplciation modules (MLP) . which can make f ≈ 0.001 for MLP components. Observation 2: Practically we can assume each piece, no matter the size of the vector, to be of same norm. The reason lies in initialization. According to Xavier's initialization the weights of a particular node are initialized with norm 1. So for now lets assume a more practical case of all norms being equal to √ α. Also, in order to make the comparisons we need to consider some average case over the data. So let us assume that under independent randomized data assumption, the expected value of all inner products are β. With this , in expectation over randomized data, we have V G = f l V l + k(k -1) m (α 2 + β 2 ) ( ) Now note that, V l = 1 f l 1 m ( i̸ =j a 2 i b 2 j + i̸ =j a i b i a j b j ) where x l = (a 1 , a 2 ...a n l ) and y l = (b 1 , b 2 ...b n l ) (dropping the subscript "l" below) V l = 1 f l 1 m ((||x|| 2 2 ||y|| 2 2 + ⟨x, y⟩ 2 ) -2 i x 2 i y 2 i ) V l = 1 f l 1 m ((α 2 + β 2 ) -2 i x 2 i y 2 i ) Note that for each negative term, there are n l positive terms. To simplify we disregard this term in the equation above. This is an approximation which is practical and only made to get a sense of V L and V G relation. V L -V G = V l - f l V l - k(k -1) m (α 2 + β 2 ) V L -V G = l 1 m ( 1 f l -1)((α 2 + β 2 )) - k(k -1) m (α 2 + β 2 ) V L -V G = l 1 m ( 1 f l -1)((α 2 + β 2 ) - k(k -1) m (α 2 + β 2 ) V L -V G ≥ k(k -1) m ((α 2 + β 2 ) - k(k -1) m (α 2 + β 2 ) V L -V G ≥ 0 Note that we ignored a term which reduces the V L a bit, Let the error be ϵ V L -V G ≥ -ϵ (27) The above equation shows even for the best case, V G might be slightly more than V L . However for general case where harmonic mean is much worse than arithmetic mean, V L will be much larger depending on exact f l s Table 7 : Inference times of different square weight matrices using an input batch of 512. For ROAST, the tile parameters of each matrix multiplication are autotuned. The measurements were taken using TF32 on a NVIDIA A100 GPU (48GB). We used PyTorch's matmul function (MM) 



https://github.com/apple/ml-cifar-10-faster



Figure 1: Generic model compression with operation-specific blocking for BERT as an example : (left) Shows how 2D tiles are mapped to M in case of MM operation. (right) Shows how 1D tiles are mapped to M in case of L operation. λ is the module-specific GMS scaling factor

Figure 2: Local memory sharing : each module compresses its parameters separately. In Global memory sharing, all parameters from accross the modules share the same memory

Figure 3: Effect of local and global memory sharing with compression of BERT-12-12 model for text-classification tasks. In yelp, rolling mean of 5 measurements is taken to reduce noise in plots from Table 7: (1) ROAST-MM outperforms HashedNet kernel consistently across the different multiplication workloads. On an average over different workloads, ROAST-MM is up to 45× faster than HashedNet. (2) ROAST-MM is 1.34× slower than PyTorch-MM. This is expected as Pytorch-MM uses extremely optimized libraries for matrix multiplication and ROAST-MM implementation is comparatively under-optimized. It is still interesting to note that ROAST-MM's performance better in terms of scaling efficiency than PyTorch-MM with the increase in workload. As the workload increases 1600× (from 512×512 to 20480×20480), PyTorch-MM takes 39× time, HashedNet takes 106× time whereas ROAST-MM only takes around 16× time. We present more detailed measurements across different optimizers for training-optimal strategy in the appending C.2 7 CONCLUSION

Figure 4: Separate plots for 10× and 100× ROAST for Yelp dataset for better visibility. Also, rolling mean of 5 measurements was used to reduce noise in the plots B THEORY

Experimental settings: The datasets and models used in experiments.

Text classification task. (above) ROAST shows up to 100× compression without loss of quality on BERT-2-2 model. (below) Even in case of overparameterized model of BERT-12-12, ROAST is able to maintain the quality similar compression.

Image classification task: (above)We see that ResNet-9 model can be trained in 10× smaller memory. (below). Pruning gives 100× post-training compression but requires complete memory for training. We can prune ROAST-10x model, which uses 10× lesser memory, further 10× to give 100× post-training model

Inference times of different square weight matrices using an input batch of 512. For ROAST, the tile parameters of each matrix multiplication are autotuned. The measurements were taken using TF32 on a NVIDIA A100 GPU (48GB). We used PyTorch's matmul function (MM) for the full uncompressed matrix multiplication. ■:bad ■: good

Traditionally model compression has focused on memory reduction during inference. However, model memory during training is also an important consideration. While some of the existing methods such as HashedNet and Low-rank factorisation provide model reduction during training, these methods either do not provide cache-efficient model recovery or have implicit cap on memory reduction. ADDITIONAL DATA FOR REVIEWERS -PARTS OF WHICH WILL GO IN MAIN PAPER IN FINAL VERSION A.1 EXTENDED TABLE 3 WITH EPOCH INFORMATION AND MORE BASELINES

Table 3 extended version

for the full uncompressed matrix multiplication. ■:bad ■: good

Weight update operation (optimizer.step()) for different shapes of square weight matrix with input batch of 512. The tile-parameters of multiplication are optimized for each function over "forward + backward" pass .The measurements are taken with tf32 on A100 (48GB)

Total training step time for different shapes of square weight matrix with input batch of 512. The tile-parameters of multiplication are optimized for each function over "forward + backward" pass .The measurements are taken with tf32 on A100 (48GB)

