SOLAR: SPARSE ORTHOGONAL LEARNED AND RAN-DOM EMBEDDINGS

Abstract

Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this paper, we argue that highdimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we propose a partitioning algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This is facilitated by our novel asymmetric mixture of Sparse, Orthogonal, Learned and Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that our way of one-sided learning is equivalent to learning both query and label embeddings. With these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each task with up to 10× faster speed.

1. INTRODUCTION

Embedding models have been the mainstay algorithms for several machine learning applications like Information Retrieval (IR) (8; 2) and Natural Language Processing (NLP) (21; 16; 31; 9) in the last decade. Embedding models are learned spin-offs from the low-rank approximation and Matrix Factorization techniques that dominated the space of recommendation systems prior to the emergence of Deep Learning (DL). The primary purpose of these models is to project a rather simple and intuitive representation of an input to an abstract low-dimensional dense vector space. This projection enables two things: 1) tailoring the vectors to specific downstream applications and 2) pre-processing and storing documents or products as vectors, thereby making the retrieval process computationally efficient (often matrix multiplication followed by sorting, which are conducive to modern hardware like GPUs). Besides the computational advantage, embedding models capture the semantic relationship between queries and products. A good example is product prediction for a service like Amazon. A user-typed query has to be matched against millions of products and the best search results have to be displayed within a fraction of a second. With naive product data, it would be impossible to figure out that products with 'aqua' in their titles are actually relevant to the query 'water'. Rather, if we can project all the products to a dense low-dimensional vector space, a query can also be projected to the same space and an inner product computation can be performed with all the product vectors (usually a dot product). We can then display the products with the highest inner product. These projections can be learned to encapsulate semantic information and can be continually updated to reflect temporal changes in customer preference. To the best of our knowledge, embedding models are the most prevalent ones in the industry, particularly for product and advertisement recommendations (Amazon's -DSSM ( 23), Facebook's DLRM ( 22)). However, the scale of these problems has blown out of proportion in the past few years prompting research in extreme classification tasks, where the number of classes runs into several million. Consequentially, approaches like Tree-based Models (26; 15; 1) and Sparse-linear Models (36; 39; 38) have emerged as powerful alternatives. Particularly, Tree-based models are much faster to train and evaluate compared to the other methods. However, most real Information Retrieval systems have dynamically changing output classes and all the extreme classification models fail to generalize to new classes with limited training data (e.g., new products being added to the catalogue every day). This has caused the resurgence of embedding models for large scale Extreme Classification (5; 29; 3; 7). Our Contributions: In this paper, we argue that sparse, high dimensional, orthogonal embeddings are superior to their dense low dimensional counterparts. In this regard, we make two interesting design choices: 1) We design the label embeddings (e.g.products in the catalogue) to be high dimensional, super-sparse, and orthogonal vectors. 2) We fix the label embeddings throughout the training process and learn only the input embeddings (one-sided learning), unlike typical dense models, where both the input and label embeddings are learned. Since we use a combination of Sparse, Orthogonal, Learned and Random embeddings, we code-name our method SOLAR. We provide a theoretical premise for SOLAR by showing that one-sided and two-sided learning are mathematically equivalent. Our choices manifest in a five-fold advantage over prior methods: • Matrix Multiplication to Inverted-Index Lookup: Sparse high dimensional embeddings can obtain a subset of labels using a mere inverted-index (8) lookup and restrict the computation and sorting to those labels. This enhances the inference speed by a large margin. • Load-balanced Inverted Index: By forcing the label embeddings to be near-orthogonal and equally sparse (and fixing them), we ensure that all buckets in an inverted index are equally filled and we sample approximately the same number of labels for each input. This omits the well-known imbalanced buckets issue where we sub-sample almost all the labels for popular inputs and end up hurting the inference speed. • Lower Embedding Memory: Dense embedding models need to hold all label embeddings in GPU memory to perform real-time inference. This is not a scalable solution with millions of labels (which is a practical industry requirement). On the contrary, SOLAR needs to store only few integers indices per label which is very memory efficient with modern sparse array support on all platforms. These vectors can also be used with Locality Sensitive Hashing based indexing systems like FLASH (34). • Zero-communication: Our unique construction of label embeddings enables distributed training over multiple GPUs with zero-communication. Hence, we can afford to train on a 1.67 M book recommendation dataset and three largest extreme classification datasets and outperform the respective baselines on all 4 of them on both precision and speed. • Learning to Hash: An Inverted-Index can be perceived as a hash table where all the output classes are hashed into a few buckets (18; 33) . By fixing the label buckets and learning to map the inputs to the corresponding label buckets, we are doing a 'partial learning to hash' task in the hindsight (more on this in Appendix A).

2. RELATED WORK

SNRM: While there have been a plethora of dense embedding models, there is only one prior work called SNRM (Standalone Neural Ranking Model) (40) that trains sparse embeddings for the task of suggesting documents relevant to an input query (classic web search problem). In SNRM, the authors propose to learn a high dimensional output layer and sparsify it using a typical L1 or L2 regularizer. However, imposing sparsity through regularization causes lopsided inverted-index with imbalanced loads and high inference times. As we see in our experiments later, these issues lead to the poor performance of SNRM on our 1.67M product recommendation dataset. GLaS: Akin to SOLAR's construction of near-orthogonal label embeddings, another recent work from Google (11) also explores the idea of enforcing orthogonality to make the labels distinguishable and thereby easier for the classifier to learn. The authors enforce it in such a way that frequently co-occurring labels have high cosine-similarity and the ones that rarely co-occur have low cosine similarity. This imposition was called a Graph Laplacian and Spreadout (GLaS) regularizer. However, this was done entirely in the context of dense embeddings and cannot be extended to our case due to the differentiability issue. We show the comparison of SOLAR against dense embedding models with and without GLaS regularizer later on in section 5.1.

