CLUSTERING EMBEDDING TABLES, WITHOUT FIRST LEARNING THEM

Abstract

To work with categorical features, machine learning systems employ embedding tables. These tables can become exceedingly large in modern recommendation systems, necessitating the development of new methods for fitting them in memory, even during training. Some of the most successful methods for table compression are Product-and Residual Vector Quantization (Gray & Neuhoff, 1998). These methods replace table rows with references to k-means clustered "codewords." Unfortunately, this means they must first know the table before compressing it, so they can only save memory during inference, not training. Recent work has used hashingbased approaches to minimize memory usage during training, but the compression obtained is inferior to that obtained by "post-training" quantization. We show that the best of both worlds may be obtained by combining techniques based on hashing and clustering. By first training a hashing-based "sketch", then clustering it, and then training the clustered quantization, our method achieves compression ratios close to those of post-training quantization with the training time memory reductions of hashing-based methods. We show experimentally that our method provides better compression and/or accuracy that previous methods, and we prove that our method always converges to the optimal embedding table for least-squares training.

1. INTRODUCTION

Machine learning can model a variety of data types, including continuous, sparse, and sequential features. Categorical features are especially noteworthy since they necessitate an "embedding" of a (typically vast) vocabulary into a smaller vector space in order to facilitate further calculations. IDs of different types, such as user IDs, post IDs on social networks, video IDs, or IP addresses in recommendation systems, are examples of such features. Natural Language Processing is another prominent use for embeddings (usually word embeddings such as Mikolov et al., 2013) , however in NLP the vocabulary can be significantly reduced by considering "subwords" or "byte pair encodings". In recommendation systems like Matrix Factorization or DLRM (see fig. 2 ) it is typically not possible to factorize the vocabulary this way, and embedding tables end up very big, requiring hundreds of gigabytes of GPU memory (Naumov et al., 2019) . This in effect forces models to be split across may GPUs which is expensive and creates a communication bottleneck during training and inference. The traditional solution has been to hash the IDs down to a manageable universe size using the Hashing Trick (Weinberger et al., 2009) , and accepting that unrelated IDs may wind up with the same representation. Too aggressive hashing naturally hurts the ability of the model to distinguish its inputs by mixing up unrelated concepts and reducing model accuracy. Another option is to quantize the embedding tables. Typically, this entails rounding each individual parameter to 4 or 8 bits. Other quantization methods work in many dimensions at the same time, such as Product Quantization and Residual Vector Quantization. (See Gray & Neuhoff (1998) for a survey of quantization methods.) These multi-dimensional methods typically rely on clustering (like k-means) to find a set of representative "code words" to which each original ID is assigned. For example, vectors representing "red", "orange" and "blue" may be stored as simple "dark orange" (sub-sampled and) clustered. This leaves a new small table in which similar IDs are represented by the same vector. We can choose to combine the cluster centers with a new random table (and new hash function), after which the process can be repeated for an increasingly better understanding of which ID should be combined. and "blue" with the two first concepts pointing to the same average embedding. See fig. 1 for an example. Even in the theoretical literature on optimal vector compression, such clustering plays a crucial role (Indyk & Wagner, 2022) . All these quantization methods share one obvious drawback compared to hashing: the model is only quantized after training, thus memory utilization during training is unaffected. (Note: While it is common to do some "finetuning" of the model after, say, product quantization, the method remains primarily a "post-training" approach.) Recent authors have considered more advanced ways to use hashing to overcome this problem: Tito Svenstrup et al. ( 2017 The common theme has been using multiple hash functions which allow features to take different representations with high probability, while still mapping into a small shared table of parameters. While these methods can work better than the hashing trick in some cases, they still fundamentally mix up completely unrelated concepts in a way that introduces large amounts of noise into the remaining machine learning model. Clearly there is an essential difference between "post-training" compression methods like Product Quantization which can utilize similarities between concepts and "during training" techniques based on hashing, which are forced to randomly mix up concepts. This paper's key contribution is to bridge that gap: We present a novel compression approach we call "Clustered Compositional Embeddings" (or CQR for short) that combines hashing and clustering while retaining the benefits of both methods. By continuously interleaving clustering with training, we train recommendation models with accuracy matching post-training quantization, while using a fixed parameter count and computational cost throughout training, matching hashing based methods. In spirit, our effort can be likened to methods like RigL (Evci et al., 2020) , which discovers the wiring of a sparse neural network during training rather than pruning a dense network post training. Our work can also be seen as a form of "Online Product Quantization", though prior work like Xu et al. ( 2018) focused only on updating code words already assigned to concept. Our goal is more ambitious: We want to learn which concepts to group together without ever knowing the "true" embedding for the concepts. Why is this hard? Imagine you are training your model and at some point decide to use the same vector for IDs i and j. For the remaining duration of the training, you can never distinguish the two IDs again, and thus any decision you make is permanent. The more you cluster, the smaller your



); Shi et al. (2020); Desai et al. (2022); Yin et al. (2021); Kang et al. (2021).

Single iteration of Clustered QR. Starting from a random embedding table, each ID is hashed to a vector in each of 2 small tables (left), and the value (shown in the middle) is taken to be the mean of the two vectors. After training for an epoch, the large (implicit) embedding table is

