CLUSTERING EMBEDDING TABLES, WITHOUT FIRST LEARNING THEM

Abstract

To work with categorical features, machine learning systems employ embedding tables. These tables can become exceedingly large in modern recommendation systems, necessitating the development of new methods for fitting them in memory, even during training. Some of the most successful methods for table compression are Product-and Residual Vector Quantization (Gray & Neuhoff, 1998). These methods replace table rows with references to k-means clustered "codewords." Unfortunately, this means they must first know the table before compressing it, so they can only save memory during inference, not training. Recent work has used hashingbased approaches to minimize memory usage during training, but the compression obtained is inferior to that obtained by "post-training" quantization. We show that the best of both worlds may be obtained by combining techniques based on hashing and clustering. By first training a hashing-based "sketch", then clustering it, and then training the clustered quantization, our method achieves compression ratios close to those of post-training quantization with the training time memory reductions of hashing-based methods. We show experimentally that our method provides better compression and/or accuracy that previous methods, and we prove that our method always converges to the optimal embedding table for least-squares training.

1. INTRODUCTION

Machine learning can model a variety of data types, including continuous, sparse, and sequential features. Categorical features are especially noteworthy since they necessitate an "embedding" of a (typically vast) vocabulary into a smaller vector space in order to facilitate further calculations. IDs of different types, such as user IDs, post IDs on social networks, video IDs, or IP addresses in recommendation systems, are examples of such features. Natural Language Processing is another prominent use for embeddings (usually word embeddings such as Mikolov et al., 2013) , however in NLP the vocabulary can be significantly reduced by considering "subwords" or "byte pair encodings". In recommendation systems like Matrix Factorization or DLRM (see fig. 2 ) it is typically not possible to factorize the vocabulary this way, and embedding tables end up very big, requiring hundreds of gigabytes of GPU memory (Naumov et al., 2019) . This in effect forces models to be split across may GPUs which is expensive and creates a communication bottleneck during training and inference. The traditional solution has been to hash the IDs down to a manageable universe size using the Hashing Trick (Weinberger et al., 2009) , and accepting that unrelated IDs may wind up with the same representation. Too aggressive hashing naturally hurts the ability of the model to distinguish its inputs by mixing up unrelated concepts and reducing model accuracy. Another option is to quantize the embedding tables. Typically, this entails rounding each individual parameter to 4 or 8 bits. Other quantization methods work in many dimensions at the same time, such as Product Quantization and Residual Vector Quantization. (See Gray & Neuhoff (1998) for a survey of quantization methods.) These multi-dimensional methods typically rely on clustering (like k-means) to find a set of representative "code words" to which each original ID is assigned. For example, vectors representing "red", "orange" and "blue" may be stored as simple "dark orange"

