ANCHOR & TRANSFORM: LEARNING SPARSE EMBEDDINGS FOR LARGE VOCABULARIES

Abstract

Learning continuous representations of discrete objects such as text, users, movies, and URLs lies at the heart of many applications including language and user modeling. When using discrete objects as input to neural networks, we often ignore the underlying structures (e.g., natural groupings and similarities) and embed the objects independently into individual vectors. As a result, existing methods do not scale to large vocabulary sizes. In this paper, we design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. We call our method ANCHOR & TRANSFORM (ANT) as the embeddings of discrete objects are a sparse linear combination of the anchors, weighted according to the transformation matrix. ANT is scalable, flexible, and end-to-end trainable. We further provide a statistical interpretation of our algorithm as a Bayesian nonparametric prior for embeddings that encourages sparsity and leverages natural groupings among objects. By deriving an approximate inference algorithm based on Small Variance Asymptotics, we obtain a natural extension that automatically learns the optimal number of anchors instead of having to tune it as a hyperparameter. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes and demonstrates stronger performance with fewer parameters (up to 40× compression) as compared to existing compression baselines. Code for our experiments can be found at https://github.com/pliang279/ sparse_discrete.

1. INTRODUCTION

Most machine learning models, including neural networks, operate on vector spaces. Therefore, when working with discrete objects such as text, we must define a method of converting objects into vectors. The standard way to map objects to continuous representations involves: 1) defining the vocabulary V = {v 1 , ..., v V } as the set of all objects, and 2) learning a V × d embedding matrix that defines a d dimensional continuous representation for each object. This method has two main shortcomings. Firstly, when V is large (e.g., million of words/users/URLs), this embedding matrix does not scale elegantly and may constitute up to 80% of all trainable parameters (Jozefowicz et al., 2016) . Secondly, despite being discrete, these objects usually have underlying structures such as natural groupings and similarities among them. Assigning each object to an individual vector assumes independence and foregoes opportunities for statistical strength sharing. As a result, there has been a large amount of interest in learning sparse interdependent representations for large vocabularies rather than the full embedding matrix for cheaper training, storage, and inference. In this paper, we propose a simple method to learn sparse representations that uses a global set of vectors, which we call the anchors, and expresses the embeddings of discrete objects as a sparse linear combination of these anchors, as shown in Figure 1 . One can consider these anchors to represent latent topics or concepts. Therefore, we call the resulting method ANCHOR & TRANSFORM (ANT). The approach is reminiscent of low-rank and sparse coding approaches, however, surprisingly in the literature these methods were not elegantly integrated with deep networks. Competitive attempts are often complex (e.g., optimized with RL (Joglekar et al., 2019) ), involve multiple training stages (Ginart et al., 2019; Liu et al., 2017) , or require post-processing (Svenstrup et al., 2017; Guo et al., 2017; Aharon et al., 2006; Awasthi & Vijayaraghavan, 2018) . We derive a simple optimization objective which learns these anchors and sparse transformations in an end-to-end manner. ANT is G 3 v 7 O 6 5 + w c N E 6 e a s j q N R a x b I T F M c M X q w E G w V q I Z k a F g z X B 4 M 6 k 3 7 5 k 2 P F Y 1 G C U s k K S v e M Q p A W t 1 3 e M O s A f Q M i u X 8 H j K h m Y 1 T Z S J Y i 3 H X b f o l b y p 8 D L 4 c y i i u a p d 9 6 v T i 2 k q m Q I q i D F t 3 0 s g y I g G T g U b F z q p Y Q m h Q 9 J n b Y u K S G a C b H r H G J 9 a p 4 f t Y v s U 4 K n 7 e y I j 0 p i R D G 2 n J D A w i 7 W J + V + t n U J 0 H W R c J S k w R W e L o l R g i P E k F N z j m l E Q I w u E a m 7 / i u m A a E L B R l e w I f i L J y 9 D o 1 z y z 0 v e 3 U W x c j m P I 4 + O 0 A k 6 Q z 6 6 Q h V 0 i 6 q o j i h 6 R M / o F b 0 5 T 8 6 L 8 + 5 8 z F p z z n z m E P 2 R 8 / k D 6 e S Z 8 A = = < / l a t e x i t > Figure 1 : ANCHOR & TRANSFORM (ANT) consists of two components: 1) ANCHOR: Learn embeddings A of a small set of anchor vectors A = {a1, ..., a A }, A << V that are representative of all discrete objects, and 2) TRANSFORM: Learn a sparse transformation T from the anchors to the full embedding matrix E. A and T are trained end-to-end for specific tasks. ANT is scalable, flexible, and allows the user to easily incorporate domain knowledge about object relationships. We further derive a Bayesian nonparametric view of ANT that yields an extension NBANT which automatically tunes A to achieve a balance between performance and compression. scalable, flexible, and allows the user flexibility in defining these anchors and adding more constraints on the transformations, possibly in a domain/task specific manner. We find that our proposed method demonstrates stronger performance with fewer parameters (up to 40× compression) on multiple tasks (text classification, language modeling, and recommendation) as compared to existing baselines. We further provide a statistical interpretation of our algorithm as a Bayesian nonparametric (BNP) prior for neural embeddings that encourages sparsity and leverages natural groupings among objects. Specifically, we show its equivalence to Indian Buffet Process (IBP; Griffiths & Ghahramani (2005) ) prior for embedding matrices. While such BNP priors have proven to be a flexible tools in graphical models to encourage hierarchies (Teh & Jordan, 2010) , sparsity (Knowles & Ghahramani, 2011) , and other structural constraints (Roy et al., 2016) , these inference methods are usually complex, hand designed for each setup, and non-differentiable. Our proposed method opens the door towards integrating priors (e.g., IBP) with neural representation learning. These theoretical connections leads to practical insights -by asymptotically analyzing the likelihood of our model in the small variance limit using Small Variance Asymptotics (SVA; Roweis (1998) ), we obtain a natural extension, NBANT, that automatically learns the optimal number of anchors to achieve a balance between performance and compression instead of having to tune it as a hyperparameter.

2. RELATED WORK

Prior work in learning sparse embeddings of discrete structures falls into three categories: Matrix compression techniques such as low rank approximations (Acharya et al., 2019; Grachev et al., 2019; Markovsky, 2011) , quantizing (Han et al., 2016) , pruning (Anwar et al., 2017; Dong et al., 2017; Wen et al., 2016) , or hashing (Chen et al., 2015; Guo et al., 2017; Qi et al., 2017) have been applied to embedding matrices. However, it is not trivial to learn sparse low-rank representations of large matrices, especially in conjunction with neural networks. To the best of our knowledge, we are the first to present the integration of sparse low-rank representations, their non-parametric extension, and demonstrate its effectiveness on many tasks in balancing the tradeoffs between performance & sparsity. We also outperform many baselines based on low-rank compression (Grachev et al., 2019) , sparse coding (Chen et al., 2016b) , and pruning (Liu et al., 2017) . Reducing representation size: These methods reduce the dimension d for different objects. Chen et al. (2016a) divides the embedding into buckets which are assigned to objects in order of importance, Joglekar et al. ( 2019) learns d by solving a discrete optimization problem with RL, and Baevski & Auli (2019) reduces dimensions for rarer words. These methods resort to RL or are difficult to tune with many hyperparameters. Each object is also modeled independently without information sharing. Task specific methods include learning embeddings of only common words for language modeling (Chen et al., 2016b; Luong et al., 2015) , and vocabulary selection for text classification (Chen et al., 2019) . Other methods reconstruct pre-trained embeddings using codebook learning (Chen et al., 2018; Shu & Nakayama, 2018) or low rank tensors (Sedov & Yang, 2018) . However, these methods cannot work for general tasks. For example, methods that only model a subset of objects cannot be used for retrieval because it would never retrieve the dropped objects. Rare objects might be highly relevant to a few users so it might not be ideal to completely ignore them. Similarly, task-specific methods such as subword (Bojanowski et al., 2017) and wordpiece (Wu et al., 2016) embeddings, while useful for text, do not generalize to general applications such as item and query retrieval.

3. ANCHOR & TRANSFORM

Suppose we are presented with data X ∈ V N , Y ∈ R N ×c drawn from some joint distribution p(x, y), where the support of x is over a discrete set V (the vocabulary) and N is the size of the training set. The entries in Y can be either discrete (classification) or continuous (regression). The goal is to learn a d-dimensional representation {e 1 , ..., e V } for each object by learning an embedding matrix E ∈ R V ×d where row i is the representation e i of object i. A model f θ with parameters θ is then used to predict y, i.e., ŷi = f θ (x i ; E) = f θ (E[x i ]). At a high level, to encourage statistical sharing between objects, we assume that the embedding of each object is obtained by linearly superimposing a small set of anchor objects. For example, when the objects considered are words, the anchors may represent latent abstract concepts (of unknown cardinality) and each word is a weighted mixture of different concepts. More generally, the model assumes that there are some unknown number of anchors, A = {a 1 , ..., a A }. The embedding e i for object i is generated by first choosing whether the object possesses each anchor a k ∈ R d . The selected anchors then each contribute some weight to the representation of object i. Therefore, instead of learning the large embedding matrix E directly, ANT consists of two components: Algorithm 1 ANCHOR & TRANSFORM algorithm for learning sparse representations of discrete objects.

ANCHOR & TRANSFORM:

1: Anchor: initialize anchor embeddings A. 2: Transform: initialize T as a sparse matrix. 3: Optionally + domain info: initialize domain sparsity matrix S(G) as a sparse matrix (see Appendix F). 4: for each batch (X, Y) do 5: Compute loss L = ∑ i D φ (y i , f θ (x i ; TA)) 6: A, T, θ = UPDATE (∇L, η). 7: T = max {(T -ηλ 2 ) ⊙ S(G) + T ⊙ (1 -S(G)), 0}. 8: end for 9: return anchor embeddings A and transformations T.

1) ANCHOR: Learn embeddings

A ∈ R A ×d of a small set of anchor objects A = {a 1 , ..., a A }, A << V that are representative of all discrete objects. 2) TRANSFORM: Learn a sparse transformation T from A to E. Each of the discrete objects is induced by some transformation from (a few) anchor objects. To ensure sparsity, we want nnz(T) << V × d. A and T are trained end-to-end for task specific representations. To enforce sparsity, we use an 1 penalty on T and constrain its domain to be non-negative to reduce redundancy in transformations (positive and negative entries canceling out). min T≥0, A,θ i D φ (y i , f θ (x i ; TA)) + λ 2 T 1 , where D φ is a suitable Bregman divergence between predicted and true labels, and T 1 denotes the sum of absolute values. Most deep learning frameworks directly use subgradient descent to solve eq (1), but unfortunately, such an approach will not yield sparsity. Instead, we perform optimization by proximal gradient descent (rather than approximate subgradient methods which have poorer convergence around non-smooth regions, e.g., sparse regions) to ensure exact zero entries in T: A t+1 , T t+1 , θ t+1 = UPDATE ∇ i D φ (y i , f θ (x i ; T t A t )), η , T t+1 = PROX ηλ 2 (T t+1 ) = max T t+1 -ηλ 2 , 0 , where η is the learning rate, and UPDATE is a gradient update rule (e.g., SGD (Lecun et al., 1998) , ADAM (Kingma & Ba, 2015) , YOGI (Zaheer et al., 2018) ). PROX ηλ2 is a composition of two proximal operators: 1) soft-thresholding (Beck & Teboulle, 2009) at ηλ 2 which results from subgradient descent on λ 2 T 1 , and 2) max(⋅, 0) due to the non-negative domain for T. We implement this proximal operator on top of the YOGI optimizer for our experiments. Together, equations ( 2) and (3) give us an iterative process for end-to-end learning of A and T along with θ for specific tasks (Algorithm 1). T is implemented as a sparse matrix by only storing its non-zero entries and indices. Since nnz(T) << V × d, this makes storage of T extremely efficient as compared to traditional approaches of computing the entire V ×d embedding matrix. We also provide implementation tips to further speedup training and ways to incorporate ANT with existing speedup techniques like softmax sampling (Mikolov et al., 2013) or noise-contrastive estimation (Mnih & Teh, 2012) in Appendix H. After training, we only store A × d + nnz(T) << V × d entries that define the complete embedding matrix, thereby using fewer parameters than the traditional V × d matrix. General purpose matrix compression techniques such as hashing (Qi et al., 2017) , pruning (Dong et al., 2017) , and quantizing (Han et al., 2016) are compatible with our method: the matrices A and nnz(T) can be further compressed and stored. We first discuss practical methods for anchor selection ( §3.1). In Appendix F we describe several ways to incorporate domain knowledge into the anchor selection and transform process. We also provide a statistical interpretation of ANT as a sparsity promoting generative process using an IBP prior and derive approximate inference based on SVA ( §3.2). This gives rise to a nonparametric version of ANT that automatically learns the optimal number of anchors.

3.1. ANCHOR: SELECTING THE ANCHORS A

Inspired by research integrating initialization strategies based on clustering (Teh et al., 2007) and Coresets (Bachem et al., 2015) with Bayesian nonparametrics, we describe several practical methods to select anchor objects that are most representative of all objects (refer to Appendix D for a comparison of initialization strategies.). Frequency and TF-IDF: For tasks where frequency or TF-IDF (Ramos, 1999) are useful for prediction, the objects can simply be sorted by frequency and the most common objects selected as the anchor points. While this might make sense for tasks such as language modeling (Luong et al., 2015; Chen et al., 2016b) , choosing the most frequent objects might not cover rare objects that are not well represented by common anchors. Clustering: To ensure that all objects are close to some anchor, we use k-means++ initialization (Arthur & Vassilvitskii, 2007) . Given a feature space representative of the relationships between objects, such as Glove (Pennington et al., 2014) for words or a co-occurrence matrix (Haralick et al., 1973) for more general objects, k-means++ initialization picks cluster centers to span the entire space. This can augment other strategies, such as initializing anchors using frequency followed by clustering to complete remaining anchors (see Figure 2 ).

Random basis vectors:

Initialize A to a set of random basis vectors. This simple yet powerful method captures the case where we have less knowledge about the objects (i.e., without access to any pretrained representation/similarity space).

3.2. STATISTICAL INTERPRETATION AS A BAYESIAN NONPARAMETRIC PRIOR

To provide a statistical interpretation of ANT, we first analyze a generative process for discrete representations that is consistent with our algorithm. Given a set of anchors, A = {a 1 , ..., a A }, we use a binary latent variable z ik ∈ {0, 1} to indicate whether object i possesses anchor k and a positive latent variable w ik ∈ R ≥0 to denote the weight that anchor k contributes towards object i. Therefore, the representation e i is given by e i = ∑ k w ik z ik a k . Ideally, we want the vector z i to be sparse for efficient learning and storage. More formally, suppose there are K ∶= A anchors, then: • Z ∈ R V ×K ∼ IBP(a, b); A ∈ R K×d ∼ P (A) = N (0, 1); W ∈ R V ×K ∼ P (W) = Exp(1) • for i = 1, ⋯, N -ŷi = f θ (x i ; (Z ○ W)A) -y i ∼ p(y i x i ; Z, W, A) = exp {-D φ (y i , ŷi )} b φ (y i ) In this generative process, the selection matrix Z follows a two-parameter Indian Buffet Process (IBP; Griffiths & Ghahramani (2005) ) prior (Ghahramani et al., 2007) . Not only does this BNP prior allow for a potentially infinite number of anchors, but it also encourages each object to only select a small subset of anchors, resulting in a sparse z i (see Appendix A for details). We place a standard Gaussian prior on the continuous anchors embeddings a k and an exponential prior on the weights W which give the actual non-negative transformation weights for the non-zero entries defined in Z. E = (Z ○ W)A is the final embedding learnt by our model which represents a d-dimensional continuous representation {e 1 , ..., e V } for each discrete object where row i is the representation e i of object i. Finally, a neural model f θ with parameters θ is used to predict y i given the embedded representations, i.e., ŷi = f θ (x i ; (Z ○ W)A) = f θ ((Z ○ W)A[x i ]). Likelihood Model/Loss: We assume that the final emission model y i ŷi belongs to the exponential family. Since exponential family distributions have a corresponding Bregman divergence (Banerjee et al. (2005) ; see Appendix C for examples), we choose D φ (y i , ŷi ) as the corresponding Bregman divergence between predicted and true labels. Appropriate choices for D φ recover cross-entropy and MSE losses. b φ (y i ) does not depend on any learnable parameter or variable and can be ignored. Joint likelihood: Under the generative model as defined above, the joint likelihood is given by: log p(Y, Z, W, A X) ∝ i log p(y i x i ; Z, W, A) + log p(Z) + log p(W) + log p(A) = i -D φ (y i , f θ (x i ; (Z ○ W)A)) + log b φ (y i ) + log p(Z) + log p(W) + log p(A). However, calculating the posterior or MAP estimate is hard, especially due to the presence of the non-linear deep network f θ . Approximate inference methods such as MCMC, variational inference, or probabilistic programming would be computationally and statistically inefficient since it would involve sampling, evaluating, or training the model multiple times. To tackle this problem, we perform approximate inference via Small Variance Asymptotics (SVA), which captures the benefits of rich latent-variable models while providing a framework for scalable optimization (Broderick et al., 2013a; Jiang et al., 2012; Roychowdhury et al., 2013) . Approximate Inference via SVA: To use SVA, we introduce a scaling variable β and shrink the variance of the emission probability by taking β → ∞. The scaled probability emission becomes p(y i x i ; Z, W, A) = exp {-βD φ (y i , ŷi )} b βφ (y i ). Following Broderick et al. (2013a) , we modulate the number of features in the large-β limit by choosing constants λ 1 > λ 2 > 0 and setting the IBP hyperparameters a = exp(-βλ 1 ) and b = exp(βλ 2 ). This prevents a limiting objective function that favors a trivial cluster assignment (every data point assigned to its own separate feature). Maximizing the asymptotic joint likelihood (after taking limits, i.e., lim β→∞ 1 β log p(Y, Z, W, A X)) results in the following objective function: min T≥0, A,θ,K i D φ (y i , f θ (x i ; TA)) + λ 2 T 0 + (λ 1 -λ 2 )K, where we have combined the variables Z and W with their constraints into one variable T. The exponential prior for W results in a non-negative domain for T. Please refer to Appendix B for derivations. Note that eq (5) suggests a natural objective function in learning representations that minimize the prediction loss D φ (y i , f θ (x i ; TA)) while ensuring sparsity of T as measured by the 0 -norm and using as few anchors as possible (K). Therefore, optimizing eq (5) gives rise to a nonparametric version of ANT, which we call NBANT, that automatically learns the optimal number of anchors. To perform optimization over the number of anchors, our algorithm starts with a small A = 10 and either adds anchors (i.e., adding a new row to A and a new column to T) or deletes anchors to minimize eq (5) at every epoch depending on the trend of the objective evaluated on validation set. We outline the exact algorithm in Appendix G along with more implementation details. Analogously, we can derive the finite case objective function for a fixed number of anchors K: min T≥0, A,θ i D φ (y i , f θ (x i ; TA)) + λ 2 T 0 , which, together with a 1 penalty on T as a convex relaxation for the 0 penalty, recovers the objective function in eq (1). The solution for this finite version along with K yields the Pareto front. Different values of λ 1 in eq (5) can be used for model selection along the front as elucidated in Appendix L.

4. EXPERIMENTS

To evaluate ANT, we experiment on text classification, language modeling, and movie recommendation tasks. Experimental details are in Appendix J and full results are in Appendix K.

4.1. TEXT CLASSIFICATION

Setup: We follow the setting in Chen et al. (2019) with four datasets: AG-News (V = 62K) (Zhang et al., 2015) , DBPedia (V = 563K) (Lehmann et al., 2015) , Sogou-News (V = 254K) (Zhang et al., 2015) , and Yelp-review (V = 253K) (Zhang et al., 2015) . We use a CNN for classification (Kim, 2014) . ANT is used to replace the input embedding and domain knowledge is derived from WordNet and co-occurrence in the training set. We record test accuracy and number of parameters used in the embedding only. For ANT, num params is computed as A × d + nnz(T). Table 1 : Text classification results on AG-News. Our approach with different initializations achieves within 0.5% accuracy with 40× fewer parameters, outperforming the published compression baselines. Init: initialization method, Acc: accuracy, # Emb: number of (non-zero) embedding parameters. (Zhang et al., 2015) 61, 673 All 91.6 15.87 FREQUENCY (Chen et al., 2019) 5, 000 Frequency 91.0 1.28 TF-IDF (Chen et al., 2019) 5, 000 TF-IDF 91.0 1.28 GL (Chen et al., 2019) 4, 000 Group lasso 91.0 1.02 VVD (Chen et al., 2019) 3, 000 Var dropout 91.0 0.77 SPARSEVD (Chirkova et al., 2018) 5, 700 Mult weights 88.8 1.72 SPARSEVD-VOC (Chirkova et al., 2018) Baselines: On top of the CNN, we compare to the following compression approaches. Vocabulary selection methods: 1) FREQUENCY where only embeddings for most frequent words are learnt (Chen et al., 2016b; Luong et al., 2015) , 2) TF-IDF which only learns embeddings for words with high TF-IDF score (Ramos, 1999) , 3) GL (group lasso) which aims to find underlying sparse structures in the embedding matrix via row-wise 2 regularization (Liu et al., 2015; Park et al., 2016; Wen et al., 2016) , 4) VVD (variational vocabulary dropout) which performs variational dropout for vocabulary selection (Chen et al., 2019) . We also compare to 5) SPARSEVD (sparse variational dropout) which performs variational dropout on all parameters (Chirkova et al., 2018) , 6) SPARSEVD-VOC which uses multiplicative weights for vocabulary sparsification (Chirkova et al., 2018) , and 7) a SPARSE CODE model that learns a sparse code to reconstruct pretrained word representations (Chen et al., 2016b) . All CNN architectures are the same for all baselines with details in Appendix J.1. Method A Init A Sparse T T ≥ 0 Acc (%) # Emb (M) CNN Results on AG-News are in Table 1 and results for other datasets are in Appendix K.1. We observe that restricting T ≥ 0 using an exponential prior is important in reducing redundancy in the entries. Domain knowledge from WordNet and co-occurrence also succeeded in reducing the total (non-zero) embedding parameters to 0.40M, a compression of 40× and outperforming the existing approaches.

4.2. LANGUAGE MODELING

Setup: We perform experiments on word-level Penn Treebank (PTB) (V = 10K) (Marcus et al., 1993) and WikiText-103 (V = 267K) (Merity et al., 2017) with LSTM (Hochreiter & Schmidhuber, 1997) and AWD-LSTM (Merity et al., 2018) . We use ANT as the input embedding tied to the output embedding. Domain knowledge is derived from WordNet and co-occurrence on the training set. We record the test perplexity and the number of (non-zero) embedding parameters. Baselines: We compare to SPARSEVD and SPARSEVD-VOC, as well as low-rank (LR) and tensortrain (TT) model compression techniques (Grachev et al., 2019) . Note that the application of variational vocabulary selection to language modeling with tied weights is non-trivial since one is unable to predict next words when words are dynamically dropped out. We also compare against methods that compress the trained embedding matrix as a post-processing step before evaluation: POST-SPARSE HASH (post-processing using sparse hashing) (Guo et al., 2017) and POST-SPARSE HASH+k-SVD (Awasthi & Vijayaraghavan, 2018; Guo et al., 2017) which uses k-SVD (which is the basis of dictionary learning/sparse coding) (Aharon et al., 2006) to solve for a sparse embedding matrix, instead of adhoc-projection in (Guo et al., 2017) . Comparing to these post-processing methods demonstrates that end-to-end training of sparse embeddings is superior to post-compression. Results: On PTB (Table 2 ), we improve the perplexity and compression as compared to previously proposed methods. We observe that sparsity is important: baseline methods that only perform lowerrank compression with dense factors (e.g., LR LSTM) tend to suffer in performance and use many parameters, while ANT retains performance with much better compression. ANT also outperforms post-processing methods (POST-SPARSE HASH), we hypothesize this is because these post-processing methods accumulate errors in both language modeling as well as embedding reconstruction. Using an anchor size of 500 1, 000 reaches a good perplexity/compression trade-off: we reach within 2 points perplexity with 5× reduction in parameters and within 7 points perplexity with 10× reduction. Using AWD-LSTM, ANT with 1, 000 dynamic basis vectors is able to compress parameters by 10× while achieving 72.0 perplexity. Incorporating domain knowledge allows us to further compress the parameters by another 10× and achieve 70.0 perplexity, which results in 100× total compression. -103 (bottom) . We outperform existing vocabulary selection, low-rank, tensor-train, and post-compression (hashing) baselines on performance and compression metrics. Ppl: perplexity, # Emb: number of (non-zero) embedding parameters.

Method (PTB)

A Init A Sparse T T ≥ 0 Ppl # Emb (M) LSTM (Chirkova et al., 2018) 10, 000 All 70.3 2.56 LR LSTM (Grachev et al., 2019) 10, 000 All 112.1 1.26 TT LSTM (Grachev et al., 2019) 10, 000 All 116.6 1.16 AWD-LSTM (Merity et al., 2018) 10, 000 All 59.0 4.00 SPARSEVD (Chirkova et al., 2018) 9, 985 Mult weights 109.2 1.34 SPARSEVD-VOC (Chirkova et al., 2018) On WikiText-103, we train using sampled softmax (Bengio & Senecal, 2008 ) (due to large vocabulary) for 500, 000 steps. To best of our knowledge, we could not find literature on compressing language models on WikiText-103. We tried general compression techniques like low rank tensor and tensor train factorization (Grachev et al., 2019) , but these did not scale. As an alternative, we consider a HASH EMBED baseline that retains the frequent k words and hashes the remaining words into 1, 000 OOV buckets (Svenstrup et al., 2017) . We vary k ∈ {1×10 5 , 5×10 4 , 1×10 4 } (details in Appendix J.3). From Table 2 (bottom), we reach within 3 perplexity with ∼ 16× reduction in parameters and within 13 perplexity with ∼ 80× reduction, outperforming the frequency and hashing baselines. We observe that ANT's improvement over post-compression methods (POST-SPARSE HASH) is larger on WikiText than PTB, suggesting that ANT is particularly suitable for large vocabularies.

4.3. RECOMMENDER SYSTEMS

Setup: We perform experiments on both movie and product recommendation tasks. For movie recommendations, we follow Ginart et al. (2019) and we experiment on MovieLens 25M (Harper & Konstan, 2015) with 126K users and 59K movies. We also present results for MovieLens 1M in Appendix K.3. On product recommendation, we show that ANT scales to Amazon Product reviews (Ni et al., 2019) , the largest existing dataset for recommender systems with 233M reviews spanning 43.5M users and 15.2M products. Following Wan et al. (2020) , we ensured that the users and products in the test set have appeared in the training data for generalization. Figure 3 : NBANT accurately selects the optimal A to minimize eq (5), thereby achieving a balance between performance and compression. Baselines: We compare to a baseline Matrix Factorization (MF) model (Koren et al., 2009) with full embedding matrices for movies and users and to Mixed Dimension (MIXDIM) embeddings (Ginart et al., 2019) , a compression technique that assigns different dimension to different users/items based on popularity. We also compare to SPARSE CBOW (Sun et al., 2016) which learns sparse E by placing an 1 penalty over all entries of E and optimizing using online subgradient descent, and SLIMMING (Liu et al., 2017) , which performs subgradient descent before pruning small weights by setting them to 0. Such methods learn embeddings for objects independently without statistical strength sharing among related objects. We also test NBANT using the algorithm derived from the Bayesian nonparametric interpretation of ANT. Results: From Table 3 , ANT outperforms standard matrix factorization and dense mixed dimensional embeddings for performance and compression. NBANT is also able to automatically select an optimal number of anchors (6 8) to achieve solutions along the performance-compression Pareto front. In Figure 3 , we plot the value of eq (5) across values of A after a comprehensive hyperparameter sweep on ANT across 1000 settings. In comparison, NBANT optimizes A and reaches a good value of eq (5) in a single run without having to tune A as a hyperparameter, thereby achieving best balance between performance and compression. Please refer to Appendix K.3 for more results and discussion on NBANT. For product recommendation, we first experiment on a commonly used subset of the data, Amazon Electronics (with 9.84M users and 0.76M products), to ensure that our results match published baselines (Wan et al., 2020) , before scaling our experiment to the entire dataset. From Table 4 , we find that ANT compresses embeddings by 25× on Amazon Electronics while maintaining performance, and 10× on the full Amazon reviews dataset. Online NBANT: Since NBANT automatically grows/contracts A during training, we can further extend NBANT to an online version that sees a stream of batches without revisiting previous ones (Bryant & Sudderth, 2012) . We treat each batch as a new set of data coming in and train on that batch until convergence, modify A as in Algorithm 2, before moving onto the next batch. In this significantly more challenging online setting, NBANT is still able to learn well and achieve a MSE of 0.875 with 1.25M non zero parameters. Interestingly this online version of NBANT settled on a similar range of final user (8) and item (8) anchors as compared to the non-online version (see Table 3 ), which confirms the robustness of NBANT in finding relevant anchors automatically. In Appendix K.3 we discuss more observations around online NBANT including ways of learning A .

4.4. DISCUSSION AND OBSERVATIONS

Here we list some general observations regarding the importance of various design decisions in ANT: 1) Sparsity is important: Baselines that compress with dense factors (e.g., LR, TT) suffer in performance while using many parameters, while ANT retains performance with better compression. 2) Choice of A: We provide results on more clustering initializations in Appendix D. In general, performance is robust w.r.t. choice of A. While frequency and clustering work better, using a dynamic basis also performs well. Thus, it is beneficial to use any extra information about the discrete objects (e.g., domain knowledge or having a good representation space like GloVe to perform clustering). 3) Anchors and sparse transformations learned: We visualize the important transformations (large entries) learned between anchors and non-anchors in Table 5 . Left, we show the most associated non-anchors for a given anchor word and find that the induced non-anchors are highly plausible: stock accurately contributes to bonds, certificates, securities, and so on. Right, we show the largest (non-anchor, anchor) pairs learned, where we find related concepts such as (billion, trillion) and (government, administration). On MovieLens, for each anchor, we sort the movies according to the magnitude of their transformation coefficients which automatically discovers movie clusters based on underlying genres. We obtain a genre purity ratio of 61.7% by comparing automatically discovered movie clusters with the true genre tags provided in MovieLens. 4) Zero transformations learned: For MovieLens, we find that ANT assigns 2673 out of 59047 movies to an entire zero row, of which 84% only had 1 rating (i.e., very rare movies). Therefore, compression automatically discovers very rare objects (1 labeled point). On WikiText-103, rare words (e.g., Anarky, Perl, Voorhis, Gaudí, Lat, Bottomley, Nescopeck) are also automatically assigned zero rows when performing high compression (54.2 ppl with 0.4M params). Certain rare words that might be predictive, however, are assigned non-zero rows in T, such as: sociologists, deadlines, indestructible, causeways, outsourced, glacially, heartening, unchallenging, roughest. 5) Choice of λ 1 , λ 2 : Tuning λ 1 allows us to perform model selection by controlling the trade-off between A (model complexity) and performance. By applying eq (5) on our trained models in Table 2 , choosing a small λ 1 = 2 × 10 -5 prefers more anchors ( A = 1, 000) and better performance (ppl = 79.4), while a larger λ 1 = 1 × 10 -1 selects fewer anchors ( A = 100) with a compromise in performance (ppl = 106.6). Tuning λ 2 allows us to control the tradeoff between sparsity and performance (see details in Appendix L). Figure 4 : ANT converges faster and to a better validation loss than the baselines. 6) Convergence: In Figure 4 , we plot the empirical convergence of validation loss across epochs. ANT converges as fast as the (non-sparse) MF baseline, and faster than compression baselines MixDim (Ginart et al., 2019) and Sparse CBOW (Sun et al., 2016) . ANT also converges to the best validation loss. 

5. CONCLUSION

This paper presented ANCHOR & TRANSFORM to learn sparse embeddings of large vocabularies using a small set of anchor embeddings and a sparse transformation from anchors to all objects. We also showed a statistical interpretation via integrating IBP priors with neural representation learning. Asymptotic analysis of the likelihood using SVA yields an extension that automatically learns the optimal number of anchors. On text classification, language modeling, and recommender systems, ANT outperforms existing approaches with respect to accuracy and sparsity.

APPENDIX A INDIAN BUFFET PROCESS WITH TWO PARAMETERS

In this section we provide a more detailed treatment of the Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2005; 2011; Paisley et al., 2010) as well its two-parameter generalization (Ghahramani et al., 2007) . We begin with describing the single parameter version, which essentially defines a probability distribution over sparse binary matrices with a finite number of rows and an unbounded number of columns. Under IBP prior with hyperparameter a, to generate such a sparse binary random matrix Z with V rows, we have the following process: v j ∼ Beta(a, 1) b k = k j=1 v j z ik ∼ Bernoulli(b k ), i = 1, ..., V It can be shown from this construction that a given matrix Z will have non-zero probability under the IBP prior if and only if the number of columns containing non-zero entries is finite, albeit a random quantity (Griffiths & Ghahramani, 2005) . Also note that b k keeps diminishing as 0 < v j < 1, thus most z ik will be 0, thereby giving rise to the desired sparsity. Moreover, it can be shown that number of number of non-empty columns would be O(log V ) in expectation. Like most Bayesian nonparametric models, it is best understood with an analogy. Consider of a set of customers (objects) queued up to take dishes (features/anchors) from a buffet arranged in a line. The first customer starts at beginning of the buffet and takes a serving of all of the first Poisson(a) dishes. The remaining customers are more picky. The ith customer samples dishes in proportion to their popularity (i.e., with probability m k i), where m k is the number of previous customers who have sampled a dish. Having reached the end of all previous sampled dishes, the ith customer then tries Poisson(a i) new dishes. The result of this process for the entire vocabulary V is a binary matrix Z with V rows and infinitely many columns where z ik = 1 if the ith customer sampled the kth dish. Using either description of the IBP, we can find the distribution of the sparse binary matrix Z with V rows and K non-empty columns to be given by (Griffiths & Ghahramani, 2005; Broderick et al., 2013b) : p(Z) = a K ∏ 2 V -1 h=1 K h ! exp(-aH V ) K k=1 ( V -m k )!(m k -1)! V ! , where m k denotes number of customers (objects) who selected dish (anchor) k, H V is the V -th Harmonic number H V = ∑ V j=1 1 j , and K h is the number of occurrences of the non-zero binary vector h among the columns in Z. However, the number of features per object and the total number of features are directly coupled through the single parameter a. The two-parameter generalization of the IBP allows one to independently tune the average number of features for each object and the overall number of features used across all V objects (Ghahramani et al., 2007) . In particular, we now have an additional hyperparameter b along with a. The first customer, as before, samples Poisson(a) dishes. However, the i-th customer now samples in proportion to their popularity with probability m k (i + b), where m k is the number of previous customers who have sampled a dish. Having reached the end of all previously sampled dishes, the ith customer tries Poisson(ab (i + b)) new dishes. The parameter b is introduced in such a way as to preserve the expected number of features per object to be still a, but the expected overall number of features is now O(ab log V ). The total number of features used thus increases as b increases, thus providing more control on sparsity of Z. This two parameter IBP prior for a binary matrices Z with V rows and K non-empty columns is given by: p(Z) = (ab) K ∏ 2 V -1 h=1 K h ! exp(-abH V ) K k=1 Γ(m k )Γ( V -m k + b) Γ( V + b) where m k denotes number of customers (objects) who selected dish (anchor) k and H V = ∑ V j=1 1 b+j-1 . This distribution is suitable for use as a prior for Z in models that represent objects using a sparse but potentially infinite array of features. Historically, IBP has been used as a prior in latent feature models, where the binary entries z ik of a random matrix encode whether feature k is used to explain observation i. The IBP can be further combined with a simple observation model p(y i z i,∶ , θ) from the exponential family like the Gaussian distribution (see e.g. Griffiths & Ghahramani (2011) ). The complexity of inference, using MCMC or variational methods, for such binary factor analysis models has limited the adoption of more complicated observation models. In this work, we break this barrier and, to the best of our knowledge, are the first to integrate IBP with deep representation learning of discrete objects by employing an efficient SVA based inference. Thus, our approach combines the representation capabilities of neural networks with desirable sparsity properties.

B DERIVATION OF OBJECTIVE FUNCTION VIA SVA

In this section we derive our objective function using Small Variance Asymptotics (SVA) (Jiang et al., 2012) . Recall that the generative process in our model is given by: • Z ∈ R V ×K ∼ IBP(a, b) • A ∈ R K×d ∼ P (A) = N (0, 1) • W ∈ R V ×K ∼ P (W) = Exponential(1) • for i = 1, ⋯, N -ŷi = f θ (x i ; (Z ○ W)A) -y i ∼ p(y i x i ; Z, W, A) = exp {-D φ (y i , ŷi )} b φ (y i ) The joint log-likelihood under our generative model above is therefore: log p(Y, Z, W, A X) ∝ i log p(y i x i , Z, W, A) + log p(Z) + log p(W) + log p(A) = i {-D φ (y i , f θ (x i , (Z ○ W)A)) + log b φ (y i )} + log p(Z) + log p(W) + log p(A). To use SVA, an approximate objective function for finding point estimates is obtained by taking the limit of the emission probability variances down to zero. We begin by introducing a scaling variable β and shrinking the variance of the emission probability to 0 by taking β → ∞. The scaled probability emission becomes p(y i x i , Z, W, A) = exp {-βD φ (y i , ŷi )} b βφ (y i ) (11) Following Broderick et al. (2013a) , we modulate the number of features in the large-β limit by choosing constants λ 1 > λ 2 > 0 and setting the IBP hyperparameters with β as follows: a = exp(-βλ 1 ) b = exp(βλ 2 ) (12) This prevents a limiting objective function that favors a trivial cluster assignment (every data point assigned to its own separate feature). We now take the limit of the log-likelihood term by term: lim β→∞ 1 β log p(Y, A, W, Z X) (13) = lim β→∞ 1 β log p(y i x i , Z, W, A) + lim β→∞ 1 β log p(Z) + lim β→∞ 1 β log p(W) + lim β→∞ 1 β log p(A). • lim β→∞ 1 β log p(y i x i , Z, W, A) = lim β→∞ 1 β (-βD φ (y i , ŷi ) + log b βφ (y i )) = -D φ (y i , ŷi ) + O(1). • lim β→∞ 1 β log p(Z) = -λ 2 Z 0 -(λ 1 -λ 2 )K, see box below. • lim β→∞ 1 β log p(W) = 0, if W ≥ 0 else -∞. • lim β→∞ 1 β log p(A) = 0 as log p(A) = O(1). For convenience, we re-write the limit of the IBP prior as lim β→∞ 1 β log p(Z) = lim β→∞ 1 β log (ab) K ∏ 2 V -1 h=1 K h a + lim β→∞ 1 β log exp(-abH V ) b + K k=1 lim β→∞ 1 β log Γ(m k )Γ( V -m k + b) Γ( V + b) c (15) For part a : lim β→∞ 1 β log (ab) K ∏ 2 V -1 h=1 K h = lim β→∞ 1 β log exp(-β(λ 1 -λ 2 )K) ∏ 2 V -1 h=1 K h = lim β→∞ 1 β × -β(λ 1 -λ 2 )K -lim β→∞ 1 β × O(1) = -(λ 1 -λ 2 )K For part b : lim β→∞ 1 β log exp(-abH V ) = lim β→∞ 1 β × -abH V = lim β→∞ -exp(-β(λ 1 -λ 2 )K) β × V j=1 1 exp(βλ 2 ) + j -1 = 0 (17) For part c : lim β→∞ 1 β log Γ(m k )Γ( V -m k + b) Γ( V + b) = lim β→∞ 1 β log Γ(m k ) -lim β→∞ 1 β m k j=1 log( V -j + b) = 0 - m k j=1 lim β→∞ log( V -j + exp(βλ 2 )) β = - m k j=1 λ 2 = -λ 2 m k We know that m k is the number of objects which uses anchor k which counts the number of non-zero entries in the k-th column of Z. When we sum over all k, it just becomes the number of non-zero entries in Z, which is equivalent to the L 0 norm of Z, i.e., Z 0 . Therefore, the MAP estimate under SVA as given by max lim β→∞ 1 β log p(Y, A, W, Z X) is equivalent to optimizing the following objective function: max Z∈0,1 W≥0 A,θ,K i -D φ (y i , f θ (x i , (Z ○ W)A)) -λ 2 Z 0 -(λ 1 -λ 2 )K, where the exponential prior for W resulted in a limiting domain for W to be positive. Note that we can combine the optimizing variables Z and W with their constraints into one variable T ≥ 0. Also we can switch from a maximization problem to a minimization problem by absorbing the negative sign. Finally we arrive at the desired objective: min T≥0 A,θ,K i D φ (y i , f θ (x i , TA)) + λ 2 T 0 + (λ 1 -λ 2 )K. C EXPONENTIAL FAMILY DISTRIBUTIONS AS BREGMAN DIVERGENCES In this section we provide some results that relate exponential families distributions and Bregman divergences. As a result, we can relate likelihood models from Sec. 3.2 to appropriate Bregman divergences. Thus, a probabilistic observation model can be translated to a loss functions minimizing the Bregman divergence, which are more amenable to deep network training using gradient based methods. We begin by defining the Bregman divergence below and stating the relationship formally in Theorem 1. Definition 1. (Bregman, 1967) Let φ ∶ S → R, S = dom(φ) be a strictly convex function defined on a convex set S ⊂ R d such that φ is differentiable on ri(S), assumed to be non-empty. The Bregman divergence D φ ∶ S × ri(S) → [0, ∞) is defined as D φ (x, y) = φ(x) -φ(y) -⟨x -y, ∇φ(y)⟩, where ∇φ(y) represents the gradient vector of φ evaluated at y. Theorem 1. (Banerjee et al., 2005) There is a bijection between regular exponential families and regular Bregman divergences. In particular, for any exponential family distribution p(x θ) = p 0 (x) exp(⟨x, θ⟩ -g(θ)) can be written as p(x µ) = exp(-D φ (x, µ))b φ (x) where φ is the Legendre dual of the log-partition function g(θ) and µ = ∇ θ g(θ). From Theorem 1, we can see that maximizing log-likelihood log p(x θ) is same as minimizing the Bregman divergence D φ (x, µ). Note that we can ignore b φ (x) as it depends only on observed data and does not depend on any parameters. We now illustrate some common examples of exponential families (like Gaussian and categorical), derive their corresponding Bregman divergences, and connect to usual loss functions used in deep networks (like MSE and cross-entropy). Example 1: Gaussian distribution. (Banerjee et al., 2005) We start with the unit variance spherical Gaussian distributions with with mean µ, which have densities of the form: p(x; µ) = 1 (2π) d exp - 1 2 x -µ 2 2 . ( ) Using the log-partition function for Gaussian distribution, we can calculate that φ(x) = 1 2 x 2 , which yields Bregman divergence equal to: D φ (x, µ) = φ(x) -φ(µ) -⟨x -µ, ∇φ(µ)⟩ (24) = 1 2 x 2 2 - 1 2 µ 2 2 -⟨x -µ, µ⟩ = 1 2 x -µ 2 2 mean squared error , ( ) Thus, D φ (x, µ) along with constant b φ (x) given by b φ (x) = 1 (2π) d , ( ) recovers the Gaussian density p(x) = exp(-D φ (x, µ))b φ (x). Therefore, when we assume that labels have a Gaussian emmission model, the corresponding Bregman divergence D φ (x, µ) = 1 2 x -µ 2 2 recovers the squared loss commonly used for regression. Example 2: Multinomial distribution. (Banerjee et al., 2005) Another exponential family that is widely used is the family of multinomial distributions: p(x, q) = N ! ∏ d j=1 x j ! d j=1 q xj j (28) where x j ∈ Z + are frequencies of events, ∑ d j=1 x j = N and q j ≥ 0 are probabilities of events, ∑ d j=1 q j = 1. The multinomial density can be expressed as the density of an exponential distribution in x = {x j } d-1 j=1 with natural parameter θ = log qj q d d-1 j=1 , cumulant function g(θ) = -N log q d , and expectation parameter µ = ∇g(θ) = [N q j ] d-1 j=1 . The Legendre dual φ of g is given by φ(µ) = N d j=1 µ j N log µ j N = N d j=1 q j log q j . ( ) As a result, the multinomial density can be expressed as a Bregman divergence equal to: D φ (x, µ) = d j=1 x j log x j constant - d j=1 x j log µ j cross-entropy loss . ( ) and constant b φ (x) given by b φ (x) = ∏ d j=1 x xj j N N N ! ∏ d j=1 x j ! , ( ) which recovers the multinomial density p(x) = exp(-D φ (x, µ))b φ (x). Therefore, when the labels are generated from a multinomial distribution, the corresponding Bregman divergence D φ (x, µ) = -∑ d j=1 x j log µ j + constant recovers the cross-entropy loss commonly used for classification.

D LEARNING THE ANCHOR EMBEDDINGS A

Here we provide several other strategies for initializing the anchor embeddings: • Sparse lasso and variational dropout (Chen et al., 2019) . Given the strong performance of sparse lasso and variational dropout as vocabulary selection methods, it would be interesting to use sparse lasso/variational dropout to first select the important task-specific words before jointly learning their representations and their transformations to other words. However, sparse lasso and variational dropout require first training a model to completion unlike frequency and clustering based vocabulary selection methods that can be performed during data preprocessing. • Coresets involve constructing a reduced data set which can be used as proxy for the full data set, with provable guarantees such that the same algorithm run on the coreset and the full data set gives approximately similar results (Phillips, 2016; Har-Peled & Mazumdar, 2004) . Coresets can be approximately computed quickly (Bachem et al., 2017) and can be used to initialize the set of anchors A. In general, there is a trade-off between how quickly we can choose the anchor objects and their performance. Randomly picking anchor objects (which is equivalent to initializing the anchor embeddings with dynamic basis vectors) becomes similar to learning a low-rank factorization of the embedding matrix (Sedov & Yang, 2018) , which works well for general cases but can be improved for task-specific applications or with domain knowledge. Stronger vocabulary selection methods like variational dropout and group lasso would perform better but takes significantly longer time to learn. We found that intermediate methods such as frequency, clustering, with WordNet/co-occurrence information works well while ensuring that the preprocessing and training stages are relatively quick. In Appendix K we provide more results for different initialization strategies including those based on clustering initializations. In general, performance is robust with respect to the choice of A among the ones considered (i.e., random, frequency, and clustering). While frequency and clustering work better, using a set of dynamic basis embeddings still gives strong performance, especially when combined with domain knowledge from WordNet and co-occurrence statistics. This implies that when the user has more information about the discrete objects (e.g., having a good representation space to perform clustering), then the user should do so. However, for a completely new set of discrete objects, simply using low-rank basis embeddings with sparsity also work well.

E TRANSFORM: LEARNING A SPARSE T

In addition to a simple sparse linear transformation, we describe some extensions that improve sparsity and expressitivity of the learned representations.

Reducing redundancy in representations:

To further reduce redundancy in our sparse representations, we perform orthogonal regularization of dynamic basis vectors A by adding the loss term L(A) = ∑ i≠j a ⊺ i a j to the loss function in eq (1). This ensures that different basis vectors a i and a j are orthogonal instead of being linear combinations of one another which would lead to redundancies across different learnt entries in T. Mixture of anchors: In general, different initialization strategies may bring about different advantages. For example, using a mixture of random basis vectors has been shown to help model multisense embeddings (Athiwaratkun et al., 2018; Nguyen et al., 2017) . One can define a set of M anchor embeddings A 1 , ..., A M each initialized by different strategies and of possibly different sizes. Nonlinear mixture of transformations: To complement learning multiple sets of anchor embeddings A 1 , ..., A M , the straightforward extension of the TRANSFORM step would be to learn a separate linear transformation for each anchor embedding and summing the result: E = ∑ M m=1 T m A m . However, the expressive power of this linear combination is equivalent to one set of anchor embeddings equal to concatenating A 1 , ..., A M and one linear transformation. To truly exhibit the advantage of multiple anchors, we transform and combine them in a nonlinear fashion, e.g., E = ∑ M m=1 softmax(T m )A m (softmax over the rows of T m , Figure 5 ). Different transformations can be learned for different initializations of anchors. This is connected with the multi-head attention mechanism in the Transformer (Vaswani et al., 2017) , where softmax(T m ) are the softmax-activated (sparse) attention weights and A m the values to attend over. The result is an embedding matrix formed via a nonlinear mixture of anchors (each initialized with different strategies) and sparse transformations.

F INCORPORATING DOMAIN KNOWLEDGE

ANT also allows incorporating domain knowledge about object relationships. Suppose we are given some relationship graph G = (V, E) where each object is a vertex v ∈ V and an edge (u, v) ∈ E exists between objects u and v if they are related. Real-world instantiations of such a graph include 1) WordNet (Miller, 1995) or ConceptNet (Liu & Singh, 2004) for semantic relations between words, 2) word co-occurrence matrices (Haralick et al., 1973) , and 3) Movie Clustering datasets (Leskovec & Krevl, 2014) . From these graphs, we extract related positive pairs P = {(u, v) ∈ E} and unrelated negative pairs N = {(u, v) ∉ E}. We incorporate domain information as follows (see Figure 6 for a visual example): Positive pairs: To incorporate a positive pair (u, v), we do not enforce sparsity on T u,v . This allows ANT to freely learn the transformation between related objects u and v without being penalized for sparsity. On the other hand, transformations between negative pairs will be sparsely penalized. In other words, before computing the 1 -penalty, we element-wise multiply T with a domain sparsity matrix S(G) where S(G) u,v = 0 for (u, v) ∈ P (entries not 1 -penalized) and S(G) u,v = 1 otherwise (entries are 1 -penalized), resulting in the following modified objective: min T≥0, A,θ i D φ (y i , f θ (x i , TA)) + λ 2 T ⊙ S(G) 1 . Since we perform proximal GD, this is equivalent to only soft-thresholding the entries between unrelated objects, i.e., T = max {(T -ηλ 2 ) ⊙ S(G) + T ⊙ (1 -S(G)), 0}. Note that this strategy is applicable when anchors are selected using the frequency method. Negative pairs: For negative pairs, we add an additional constraint that unrelated pairs should not share entries in their linear combination coefficients of the anchor embeddings. In other words, we add the loss term L(T, N) = (u,v)∈N t u ⊺ t v to the loss in eq (1), where each inner sum discourages t u and t v from sharing similar entries. This strategy can used regardless of the way anchors are selected. We acknowledge that there are other ways to incorporate domain knowledge as well into the general ANT framework, and we only serve to give some initial examples of such methods.

G NONPARAMETRIC ANCHOR & TRANSFORM

In this section we provide details for our non-parametric extension of ANT. Recall that our full objective function derived via small variance asymptotics is given by: min T≥0 A,θ,K i D φ (y i , f θ (x i ; TA)) + λ 2 T 0 + (λ 1 -λ 2 )K, which suggests a natural objective function in learning representations that minimize the prediction loss D φ (y i , f θ (x i ; TA)) while ensuring sparsity of T as measured by the 0 -norm and using as few anchors as possible (K). Therefore, optimizing eq (5) gives rise to a nonparametric version of ANT, which we call NBANT, that automatically learns the optimal number of anchors. To perform optimization over the number of anchors, our algorithm starts with a small initial number of anchors K = A = 10 and either adds ∆K anchors (i.e., adding ∆K new rows to A and ∆K new sparse columns to T) or deletes ∆K anchors to minimize eq (34) at every epoch depending on the trend of the objective evaluated on the training set. We detail the full Algorithm 2, and highlight the main changes as compared to ANT. Practically, this algorithm involves the same number of training epochs and batches through each training epoch as the vanilla ANT method. To enable sharing of trained anchors, we change the indices from where A and T are read from so that the partially trained removed anchors are still stored in case more anchors need to be added again.

H EFFICIENT LEARNING AND INFERENCE

The naive method for learning E from anchor embeddings A and the sparse transformations T still scales linearly with V × d. Here we describe some tips on how to perform efficient learning and inference of the anchor embeddings A and the sparse transformations T: • Store T as a sparse matrix by only storing its non-zero entries and indices. From our experiments, we have shown that nnz(T) << V × d which makes storage efficient. • For inference, use sparse matrix multiply as supported in TensorFlow and PyTorch to compute E = TA (or its non-linear extensions). This decreases the running time from scaling by V × d to only scaling as a function of nnz(T). For training, using inbuilt sparse Algorithm 2 NBANT: Nonparametric Bayesian ANT. Differences from ANT are highlighted in red. ANCHOR & TRANSFORM: 1: Anchor: initialize initial K = A and corresponding anchor embeddings A ∈ R K×d . 2: Transform: initialize T ∈ R V ×K as a sparse matrix. for each batch (X, Y) do • Finally, even if we want to use our ANT framework with full softmax in language model, it is possible without blowing up memory requirements. In particular, let g ∈ R 1× V be the incoming gradient from cross-entropy loss and h ∈ R d×1 be the vector coming from layers below, like LSTM. The gradient update is then 5: Compute loss L = ∑ i D φ (y i , f θ (x i ; TA)) 6: T ← PROX ηλ (T -ηg(Ah) T ) The main issue is computing the huge V × A outer product as an intermediate step which will be dense. However, note that incoming gradient g is basically a softmax minus an offset corresponding to correct label. This should only have large values for a small set of words and small for others. If we carefully apply the 1 -prox operator earlier, which is nothing but a soft-thresholding, we can make this incoming gradient sparse very sparse. Thus we need to only calculate a much smaller sized outer product and touch a small number of rows in T. Thus, making the approach feasible.

I GENERALITY OF ANT

We show that under certain structural assumptions on the anchor embeddings and transformation matrices, ANT reduces to the following task-specific methods for learning sparse representations: 1) Frequency (Chen et al., 2016b) , TF-IDF, Group Lasso (Wen et al., 2016) , and variational dropout (Chen et al., 2019) based vocabulary selection, 2) Low-rank factorization (Grachev et al., 2019) , and 3) Compositional code learning (Shu & Nakayama, 2018; Chen et al., 2018) . Hence, ANT is general and unifies some of the work on sparse representation learning done independently in different research areas. Frequency-based vocabulary selection (Luong et al., 2015; Chen et al., 2016b) : Initialize A with the A most frequent objects and set T a,a = 1 for all a ∈ A, T = 0 otherwise. Then E = TA consists of embeddings of the A most frequent objects with zero embeddings for all others. During training, gradients are used to update A but not T (i.e., only embeddings for frequent objects are learned). By changing the selection of A, ANT also reduces to other vocabulary selection methods such as TF-IDF (Ramos, 1999) , Group Lasso (Wen et al., 2016) , and variational dropout (Chen et al., 2019) Low-rank factorization (Acharya et al., 2019; Markovsky, 2011; Grachev et al., 2019) : Initialize A by a mixture of random basis embeddings (just 1 anchor per set) A 1 , ..., A M ∈ R 1×d and do not enforce any sparsity on the transformations T 1 , ..., T M ∈ R V ×1 . If we further restrict ourselves to only linear combinations E = ∑ M m=1 T m A m , this is equivalent to implicitly learning the M low rank factors a 1 , ..., a M , t 1 , ..., t M that reconstruct embedding matrices of rank at most M . T m . Note that applying a softmax activation to the rows of T m makes all entries dense so during training, we store T m as sparse matrices (which is efficient since T m has few non-zero entries) and implicitly reconstruct E. Domain knowledge: When incorporating domain knowledge in ANT, we use both WordNet and cooccurrence statistics. For WordNet, we use the public WordNet interface provided by NLTK http: //www.nltk.org/howto/wordnet.html. For each word we search for its immediate related words among its hypernyms, hyponyms, synonyms, and antonyms. This defines the relationship graph. For co-occurrence statistics, we define a co-occurrence context size of 10 on the training data. Two words are defined to be related if they co-occur within this context size. A note on baselines: Note that the reported results on SPARSEVD and SPARSEVD-VOC (Chirkova et al., 2018) have a different embedding size: 300 instead of 256. This is because they use pre-trained word2vec or GloVe embeddings to initialize their model before compression is performed.

J.2 LANGUAGE MODELING ON PTB

Base LSTM model: Our base model is a 2 layer LSTM with an embedding size of 200 and hidden layer size of 200. The code was adapted from https://github.com/salesforce/ awd-lstm-lm and the full table of hyperparameters is provided in Table 7 . Base AWD-LSTM model: In addition to experiments on an vanilla LSTM model as presented in the main text, we also performed experiments using a 3 layer AWD-LSTM with an embedding size of 400 and hidden layer size of 1, 150. The full hyperparameters used can be found in Table 8 . Anchor: We experiment with dynamic, frequency, and clustering initialization strategies. The number of anchors A is a hyperparameter that is selected using the validation set. The range of A is in {10, 20, 50, 80, 100, 500, 1, 000}. Smaller values of A allows us to control for fewer anchors and smaller transformation matrix T at the expense of performance. Domain knowledge: When incorporating domain knowledge in ANT, we use both WordNet and cooccurrence statistics. For WordNet, we use the public WordNet interface provided by NLTK http: //www.nltk.org/howto/wordnet.html. For each word we search for its immediate related words among its hypernyms, hyponyms, synonyms, and antonyms. This defines the relationship graph. For co-occurrence statistics, we define a co-occurrence context size of 10 on the training data. Two words are defined to be related if they co-occur within this context size. (2019) . Their presented results differ from our computations in two aspects: they include the LSTM parameters on top of the embedding parameters, and they also count the embedding parameters twice since they do not perform weight tying (Press & Wolf, 2017) (see equation ( 6) of Grachev et al. ( 2019)). To account for this, the results of SPARSEVD and SPARSEVD-VOC (Chirkova et al., 2018) , as well as the results of various LR and TT low rank compression methods (Grachev et al., 2019) were modified by subtracting off the LSTM parameters (200 × 200 × 16) . This is derived since each of the 8 weight matrices W i,f,o,c , U i,f,o,c in an LSTM layer is of size 200 × 200, and there are a 2 LSTM layers. We then divide by two to account for weight tying. In the main text, we compared with the strongest baselines as reported in Grachev et al. (2019) : these were the methods that performed low rank decomposition on both the input embedding ( V × d), output embedding (d × V ), and intermediate hidden layers of the model. For full results, please refer to Grachev et al. (2019) .

Loss cross entropy Dropout

Note that the reported results on SPARSEVD and SPARSEVD-VOC (Chirkova et al., 2018) have a different embedding size and hidden layer size of 256 instead of 200, although these numbers are close enough for fair comparison. In our experiments we additionally implemented an LSTM with an embedding size of 256 and hidden layer size of 256 so that we can directly compare with their reported numbers. For baselines that perform post-processing compression of the embedding matrix, POST-SPARSE HASH (post-processing using sparse hashing) (Guo et al., 2017) and POST-SPARSE HASH+k-SVD (improving sparse hashing using k-SVD) (Guo et al., 2017; Awasthi & Vijayaraghavan, 2018) , we choose two settings: the first using 500 anchors and 10 nearest neighbors to these anchor points, and the second using 1, 000 anchors and 20 nearest neighbors. The first model uses 500 × d + V × 10 non-zero embedding parameters while the second model uses 1, 000 × d + V × 20 parameters. For AWD-LSTM on PTB, this is equivalent to 0.3M and 0.6M embedding parameters respectively which is comparable to the number of non-zero parameters used by our method.

J.3 LANGUAGE MODELING ON WIKITEXT-103

Base AWD-LSTM model: Our base model is a 4 layer AWD-LSTM with an embedding size of 400 and hidden layer size of 2, 500. The code was adapted from https://github.com/ salesforce/awd-lstm-lm and the hyperparameters used can be found in Table 9 . For the HASH EMBED baseline that retains the frequent k words and hashes the remaining words into 1, 000 OOV buckets (Svenstrup et al., 2017) , We vary k ∈ {1 × 10 5 , 5 × 10 4 , 1 × 10 4 } to obtain results across various parameter settings.

J.4 MOVIE RECOMMENDATION ON MOVIELENS

Base MF model: We show the hyperparamters used for the MF model in Table 10 . We use the Yogi optimizer (Zaheer et al., 2018) to learn the parameters.

ANT and NBANT:

We build ANT on top of the MF model while keeping the base hyperparamters constant. For ANT, we apply compression to both movie and user embedding matrices individually. NBANT involves defining the starting value of K = A , and a ∆K value which determines the rate of increase or decrease in K. For Movielens 25M we use a larger initial A and ∆K since it is a larger dataset and also takes longer to train, so we wanted the increase and decrease in anchors to be faster (see Table 10 ). Beyond this initial setting, we found that performance is robust with respect to the initial value of K and ∆K, so we did not tune these parameters. In practice, we tie the updates of the number of user anchors and movie anchors instead of optimizing over both independently. Therefore, we start with the same number of initial user and movie anchors before incrementing or decrementing them by the same ∆K at the same time. We found that this simplification did not affect performance and NBANT was still able to find an optimal number of anchors for a good trade-off between performance and compression.

K MORE RESULTS

In the following sections, we provide additional results on learning sparse representations of discrete objects using ANT. Table 11 : More text classification results on (from top to bottom) AG-News, DBPedia, Sogou-News, and Yelp-review. Domain knowledge is derived from WordNet and co-occurrence statistics. Our approach with different initializations and domain knowledge achieves within 1% accuracy with 21× fewer parameters on DBPedia, within 1% accuracy with 10× fewer parameters on Sogou-News, and within 2% accuracy with 22× fewer parameters on Yelp-review. Acc: accuracy, # Emb: # (non-zero) embedding parameters.

Methods on AG-News

A Init A Sparse T T ≥ 0 Domain Acc (%) # Emb (M) CNN (Zhang et al., 2015) 61, 673 All 91.6 15.87 FREQUENCY (Chen et al., 2019) 5, 000 Frequency 91.0 1.28 TF-IDF (Chen et al., 2019) 5, 000 TF-IDF 91.0 1.28 GL (Chen et al., 2019) 4, 000 Group lasso 91.0 1.02 VVD (Chen et al., 2019) 3, 000 Var dropout 91.0 0.77 SPARSEVD (Chirkova et al., 2018) 5, 700 Mult weights 88.8 1.72 SPARSEVD-VOC (Chirkova et al., 2018) 11 . For AG-News, using a mixture of anchors and transformations also achieves stronger performance than the baselines using 5 anchors per mixture, although the larger number of transformations leads to an increase in parameters. Our approach with different initializations and domain knowledge achieves within 1% accuracy with 21× fewer parameters on DBPedia, within 1% accuracy with 10× fewer parameters on Sogou-News, and within 2% accuracy with 22× fewer parameters on Yelp-review. Different initialization strategies: Here we also presented results across different initialization strategies and find that while those based on frequency and clustering work better, using a set of dynamic basis embeddings still gives strong performance, especially when combined with domain knowledge from WordNet and co-occurrence statistics. This implies that when the user has more information about the discrete objects (e.g., having a good representation space to perform clustering), Here we provide several orientation views of the full 3D plot across user anchors (x-axis), movie anchors (y-axis), and objective value of eq (34) (z-axis). NBANT reaches the objective value indicated by the shaded red plane which is close to the optimal objective value as computed over all grid search experiments for user and movie anchors. Therefore, NBANT can efficiently optimize A to achieve a balance between performance and compression, eventually reaching a good value of eq (5) from just one run. the embedding parameters by another 10× and achieve 70.0 test perplexity, which results in 100× total compression as compared to the baseline. We also perform more controlled experiments with different embedding dimension sizes 200 and 250 where we also outperform the baselines. Extra results on WikiText-103: We also report full results using AWD-LSTM on WikiText-103 in Table 13 , where we reach within 3 points perplexity with ∼ 16× reduction and within 13 points perplexity with ∼ 80× reduction, outperforming the frequency (HASH EMBED) and post-processing hashing (SPARSE HASH) baselines.

K.3 MOVIE RECOMMENDATION

Extra results on MovieLens 1M: We also report results on MovieLens 1M in Table 14 , where we also observe improvements in accuracy and compression. We also run NBANT on Movielens 1M. Despite the small size of the dataset, NBANT is able to optimize for A quickly and achieve a good trade-off between performance and compression. Extra results on MovieLens 25M: Finally, we provide a 3D version of the 2D plot shown in Figure 3 where we simplified the plot by showing grid points with an equal number of user and movie anchors. We provide several orientation views of the full 3D plot across user anchors (x-axis), movie anchors (y-axis), and objective value of eq (34) (z-axis) in Figure 7 . NBANT reaches the objective value indicated by the shaded red plane which is close to the optimal objective value as computed over all grid search experiments for user and movie anchors. Therefore, NBANT can efficiently optimize A and reach a good value of eq (5) from just one run. Online NBANT: Since NBANT automatically grows/contracts A during training, we can further extend NBANT to an online version that sees a stream of batches without revisiting previous ones Bryant & Sudderth (2012) . This further enables NBANT to scale to large datasets that cannot all fit in memory. We treat each batch as a new set of data coming in and train on that batch until convergence, modify A as in Algorithm 2, before moving onto the next batch. In this significantly more challenging online setting, NBANT is still able to learn well and achieve a MSE of 0.875 with 1.25M non zero parameters. 3 . From Figure 8 , we found that initially A grew steadily from 10 up to 26 as online batches were seen. As even more batches were seen, the number of clusters decreased steadily from 26 to 10, and oscillated between 8 and 10. This means initially some connections/groupings between online batches were not clear, but with more data, natural clusters merged together. Interestingly this online version of NBANT settled on a similar range of final user (8) and item (8) anchors as compared to the non-online version (see Table 3 ), which confirms the robustness of NBANT in finding relevant anchors automatically. L EFFECT OF λ 1 AND λ 2 In this section we further study the effect of the hyperparameters λ 1 and λ 2 . Recall that our full objective function derived via small variance asymptotics is given by: min The role of hyper-parameter λ 2 is clear: For a fixed λ 1 and integer valued variable K, tuning λ 2 controls the trade-off between sparsity of T and model performance (see Table 13 ). The role of hyper-parameter λ 1 is more subtle. It can be considered as a weighing fraction for scalarizing an underlying multi-objective optimization problem. To elaborate, one can consider our goal as a multi-objective problem of minimizing the predictive loss while simultaneously using a minimal number of anchors (K). Then the hyperparameter λ 1 can be used to select a solution along the Pareto front. In other words, tuning the hyperparameter λ 1 allows us to perform model selection by controlling the trade-off between the number of anchors used and prediction performance. We apply eq (36) on the trained models in Table 2 and report these results in Table 15 . Choosing a small λ 1 = 2 × 10 -5 selects a model with more anchors ( A = 1, 000) and better performance (ppl = 79.4), while a larger λ 1 = 1 × 10 -1 selects the model with fewest anchors ( A = 100) with a compromise in performance (ppl = 106.6). 



t e x i t s h a _ b a s e = " F O U u R c J x E H t T / f w p D l J t B w c = " > A A A C B H i c b Z C T s M w F I Y d r q X c A o x d L C o k p i g B B I x F L I x F o h e p j S r H dV q r j h P Z J g q s D C q A w g B A r D H G + C m G a D l l y x + s O j / k A i u w X W / r a X l l d W d J G e X N r e f X t t v j h V l D V o L G L V D o h m g k v W A A C t R P F S B Q I g p G N p z W N B + O E + R E Z S B y S s B Y P b v S B f Y A K s o B y j S k n Q Y q n P r r q O m w s v g l d A F R W q + y v b j + m a c Q k U E G n h u A n G F H A q K T c T T V L C B R A e s Y l C R i s / y I y b y D h H M b K P A k d P Z C T S e h w F p j M i M N T z t a n X T Q n j p Z w m K T B J Z v C V G C I T Q R O e K U R B j A Q q b v K Z A o Q s H k V j Y h e P M n L L z x P F O H f f r F o L + I o o Q o R M f I Q x e o h m Q H T U Q R Y / o G b i N + v J er H e r Y Z J V z B y g P I + f w A / T h w < / l a t e x i t > 2. Transform < l a t e x i t s h a 1 _ b a s e 6 4 = " f 4 W r d 5 J D L s 5 t N y 5 K U n + T k b E G O 1 o = " > A A A C B 3 i c b Z D L S g M x F I Y z 9 V b r b d S l I M E i u C o z V d R l w Y 3 L C r 1 B O 5 R M m m l D k 8 y Q n B H L 0 J 0 b X 8 W N C 0 X c + g r u f B v T y 0 J b f w h 8 / O c c T s 4 f J o I b 8 L x v J 7 e y u r a + k d 8 s b

Figure 2: An illustration of initialization strategies for anchors combining ideas from frequency and k-means++ clustering initialization. Clustering initialization picks anchors to span the space of all objects after frequent objects have been selected.

MoviesGenre God's Not Dead, Sex and the City, Sex and the City 2, The Twilight Saga: Breaking Dawn -Part 1, The Princess Diaries 2: Royal Engagement, The Last Song, Legally Blonde 2: Red, White & Blonde, The Twilight Saga: Eclipse, Maid in Manhattan, The Twilight Saga: Breaking Dawn -Part 2 romance, comedy Nostalghia, Last Days, Chimes at Midnight, Lessons of Darkness, Sonatine, Band of Outsiders, Gerry, Cyclo, Mishima: A Life in Four Chapters, George Washington drama, indie

Scalability: In addition to fast convergence, ANT also works effectively on large datasets such as Movielens 25M (162K users, 59K movies, 25M examples) and WikiText-103 (267K unique words, 103M tokens). For each epoch on Movielens 25M, standard MF takes 165s on a GTX 980 Ti GPU while ANT takes 176s for A = 5 and 180s for A = 20. ANT also scales to the largest recommendation dataset, Amazon reviews, with 25M users and 9M products.

Figure 5: Generalized nonlinear mixture of anchors A1, ..., A M and transformations T1, ..., T M , E = ∑ M m=1 softmax(Tm)Am (softmax across rows of Tm). Different sparse transformations can be learned for different initializations of anchor embeddings.

Figure 6: Incorporating domain knowledge into ANT given object relationship graphs (left) with extracted positive pairs P and negative pairs N. Transformations between negative (unrelated) pairs are sparsely 1- penalized while those between positive (related) pairs are not. The linear transformation coefficients tu and tv of negative pairs (u, v) are also discouraged from sharing similar entries.

3: for each each epoch do 4:

baselines: We also used some of the baseline results as presented in Grachev et al.

baselines: WhileBaevski & Auli (2019) adapt embedding dimensions according to word frequencies, their goal is not to compress embedding parameters and they use 44.9M (dense) parameters in their adaptive embedding layer, while we use only 2M. Their embedding parameters are calculated by their reported bucket sizes and embedding sizes (three bands of size 20K (d = 1024), 40K (d = 256) and 200K (d = 64)). Their perplexity results are also obtained using a Transformer model with 250M params while our AWD-LSTM model uses 130M params.

Figure7: A 3D version of the 2D plot shown in Figure3where we simplified the plot by only showing grid points with an equal number of user and movie anchors for Movielens 25M. Here we provide several orientation views of the full 3D plot across user anchors (x-axis), movie anchors (y-axis), and objective value of eq (34) (z-axis). NBANT reaches the objective value indicated by the shaded red plane which is close to the optimal objective value as computed over all grid search experiments for user and movie anchors. Therefore, NBANT can efficiently optimize A to achieve a balance between performance and compression, eventually reaching a good value of eq (5) from just one run.

Figure 8: Using NBANT for online learning increases A with more data before natural clusters emerge, and A decreases to a similar value (8) as the non-online version in Table3.

T≥0 A,θ,K i D φ (y i , f θ (x i , TA)) + λ 2 T 0 + (λ 1 -λ 2 )K(36)

Language modeling on PTB (top) and WikiText

On Movielens 25M, ANT outperforms MF and mixed dimensional embeddings. NBANT automatically tunes A ( * denotes A discovered by NBANT) to achieve a balance between performance and compression. Random (λ1 = 0.1, λ2 = 2 × 10 -6 ) 0.637 1.29 Auto → 6 * Auto → 6 * Random (λ1 = 0.01, λ2 = 2 × 10 -6 ) 0.649 1.09

ANT scales to Amazon Product reviews, the largest existing dataset for recommender systems with 233M reviews spanning 43.5M users and 15.2M products, and is able to perform well under compression.

Word association results after training language models with ANT on the word-level PTB dataset. Left: the non-anchor words most induced by a given anchor word. Right: the largest (non-anchor, anchor) entries learnt in T after sparse 1-regularization. Bottom: movie clusters obtained by sorting movies with the highest coefficients with each anchor embedding.

Table of hyperparameters for text classification experiments on AG-News, DBPedia, Sogou-News, and Yelp-review datasets. All text classification experiments use the same base CNN model with the exception of different output dimensions (classes in the dataset): 4 for AG-News, 14 for DBPedia, 5 for Sogou-News, and 5 for Yelp-review.

Table of hyperparameters for language modeling experiments using AWD-LSTM on PTB dataset.

Table of hyperparameters for language modeling experiments using AWD-LSTM on WikiText-103 dataset.

Table of hyperparameters for movie recommendation experiments on Movielens 1M (top) and Movielens 25M (bottom). Initial K and ∆K are used for NBANT experiments.

On Movielens 1M, ANT outperforms MF and mixed dimensional embeddings. NBANT automatically tunes A ( * denotes A discovered by NBANT) to achieve a balance between performance and compression.

An example of model selection on the trained language models using LSTM trained on PTB. Tuning the hyperparameter λ1 and evaluating eq (36) allows us to perform model selection by controlling the trade-off between sparsity as determined by the number of anchors used and prediction performance.

annex

A, T, θ = UPDATE (∇L, η).

7:

T = max {T -ηλ 2 , 0}. Compute eq (34) using current value of K, A, T on the validation set. if eq (34) is on a decreasing trend then 11: K = K + ∆K, add ∆K rows to A and ∆K (sparse) columns to T.12:else if eq (34) is on an increasing trend then 13: K = K -∆K, remove ∆K rows from A and ∆K (sparse) columns from T. keep current values of K, A, T.

16:

end if 17: end for 18: return anchor embeddings A and transformations T.representation of most deep learning frameworks like PyTorch or Tensorflow is not optimal, as they do not support changing non-zero locations in sparse matrix and apriori its not easy to find optimal set of non-zero locations.• During training, instead, implicitly construct E from its anchors and transformations. In fact, we can do better: instead of constructing the entire E matrix to embed a single datapoint x ∈ R 1× V , we can instead first index x into T, i.e., xT ∈ R 1× A before performing a sparse matrix multiplication with A, i.e., (xT)A ∈ R 1×d . We are essentially taking advantage of the associative property of matrix multiplication and the fact that xT is a simple indexing step and (xT)A is an effective sparse matrix multiplication. To enable fast row slicing into sparse matrix, we just storing the matrix in adjacency list or CSOO format. (We move away from CSR as adding/deleting a non-zero location is very expensive.) When gradient comes back, only update the corresponding row in T. The gradient will be sparse as well due to the L1-prox operator. • Above trick solves the problem for tasks where embedding is used only at the input, e.g., classification. For tasks like language model, where embedding is used at output as well one can also use above mentioned trick with speedup techniques like various softmax sampling techniques (Bengio & Senecal, 2008; Mikolov et al., 2013) or noise-contrastive estimation (Gutmann & Hyvarinen, 2010; Mnih & Teh, 2012) , which will be anyway used for large vocabulary sizes. To elaborate, consider the case of sampled softmax (Bengio & Senecal, 2008) . We normally generate the negative sample indices, and then we can first index into T using the true and negative indices before performing sparse matrix multiplication with A. This way we do not have to instantiate entire E by expensive matrix multiplication. • When training is completed, only store the non-zero entries of T or store T as a sparse matrix to reconstruct E for inference. • To save time when initializing the anchor embeddings and incorporating domain knowledge, precompute the necessary statistics such as frequency statistics, co-occurrence statistics, and object relation statistics. We use a small context size of 10 to measure co-occurrence of two words to save time. When using WordNet to discover word relations, we only search for immediate relations between words instead of propagating relations across multiple steps (although this could further improve performance). • In order to incorporate domain knowledge in the sparsity structure, we again store 1 -S(G) using sparse matrices. Recall that S(G) has an entry equal to 1 for entries representing unrelated objects that should be 1 -penalized, which makes S(G) quite dense since most anchor and non-anchor objects are unrelated. Hence we store 1-S(G) instead which consists few non-zero entries only at (non-anchor, anchor) entries for related objects. Element-wise multiplications are also replaced by sparse element-wise multiplications when computing T ⊙ S(G) and T ⊙ (1 -S(G)).Published as a conference paper at ICLR 2021 Compositional code learning (Shu & Nakayama, 2018; Chen et al., 2018) : Initialize A by a mixture of random basis embeddings A 1 , ..., A M , initialize transformations T 1 , ..., T M , and apply a linear combination E = ∑ M m=1 T m A m . For sparsity regularization, set row i of S(G) mi as a reverse one-hot vector with entry d mi = 0 and all else 1. In other words, index d mi of row row T mi is not regularized, and all other entries are 1 -regularized with extremely high λ 2 such that row T mi essentially becomes an one-hot vector with dimension d mi = 1. This results in learning a codebook where each object in V is mapped to only one anchor in each mixture.Therefore, ANT encompasses several popular methods for learning sparse representations, and gives further additional flexibility in defining various initialization strategies, applying nonlinear mixtures of transformations, and incorporating domain knowledge via object relationships.

J EXPERIMENTAL DETAILS

Here we provide more details for our experiments including hyperparameters used, design decisions, and comparison with baseline methods. We also include the anonymized code in the supplementary material.

J.1 TEXT CLASSIFICATION

Base CNN model: For all text classification experiments, the base model is a CNN (Lecun et al., 1998) with layers of 2D convolutions and 2D max pooling, before a dense layer to the output softmax. The code was adapted from https://github.com/wenhuchen/ Variational-Vocabulary-Selection and the architecture hyperparameters are provided in Table 6 . The only differences are the output dimensions which is 4 for AG-News, 14 for DBPedia, 5 for Sogou-News, and 5 for Yelp-review.Anchor: We experiment with dynamic, frequency, and clustering initialization strategies. The number of anchors A is a hyperparameter that is selected using the validation set. The range of A is in {10, 20, 50, 80, 100, 500, 1, 000}. Smaller values of A allows us to control for fewer anchors and smaller transformation matrix T at the expense of performance.Transformation: We experiment with sparse linear transformations for T. λ 2 is a hyperparameter that is selected using the validation set. Larger values of λ 2 allows us to control for more sparse entries in T at the expense of performance. For experiments on dynamic mixtures, we use a softmax-based nonlinear combination E = ∑ M m=1 softmax(T m )A m where softmax is performed over the rows of Published as a conference paper at ICLR 2021 then the user should do so. However, for a completely new set of discrete objects, simply using low-rank basis embeddings with sparsity also work well.Incorporating domain knowledge: We find that from WordNet and co-occurrence helps to further reduce the total embedding parameters while maintaining task performance.

K.2 LANGUAGE MODELING

Extra results on PTB: We report additional language modeling results using AWD-LSTM on PTB in Table 12 . ANT with 1, 000 dynamic basis vectors is able to compress the embedding parameters by 10× while achieving 72.0 test perplexity. By incorporating domain knowledge, we further compress

