LEARNABLE EMBEDDING SIZES FOR RECOMMENDER SYSTEMS

Abstract

The embedding-based representation learning is commonly used in deep learning recommendation models to map the raw sparse features to dense vectors. The traditional embedding manner that assigns a uniform size to all features has two issues. First, the numerous features inevitably lead to a gigantic embedding table that causes a high memory usage cost. Second, it is likely to cause the over-fitting problem for those features that do not require too large representation capacity. Existing works that try to address the problem always cause a significant drop in recommendation performance or suffer from the limitation of unaffordable training time cost. In this paper, we propose a novel approach, named PEP 1 (short for Plug-in Embedding Pruning), to reduce the size of the embedding table while avoiding the drop of recommendation accuracy. PEP prunes embedding parameter where the pruning threshold(s) can be adaptively learned from data. Therefore we can automatically obtain a mixed-dimension embedding-scheme by pruning redundant parameters for each feature. PEP is a general framework that can plug in various base recommendation models. Extensive experiments demonstrate it can efficiently cut down embedding parameters and boost the base model's performance. Specifically, it achieves strong recommendation performance while reducing 97-99% parameters. As for the computation cost, PEP only brings an additional 20-30% time cost compared with base models.

1. INTRODUCTION

The success of deep learning-based recommendation models (Zhang et al., 2019) demonstrates their advantage in learning feature representations, especially for the most widely-used categorical features. These models utilize the embedding technique to map these sparse categorical features into real-valued dense vectors to extract users' preferences and items' characteristics. The learned vectors are then fed into prediction models, such as the inner product in FM (Rendle, 2010) , selfattention networks in AutoInt (Song et al., 2019) , to obtain the prediction results. The embedding table could contain a large number of parameters and cost huge amounts of memory since there are always a large number of raw features. Therefore, the embedding table takes the most storage cost. A good case in point is the YouTube Recommendation Systems (Covington et al., 2016) . It demands tens of millions of parameters for embeddings of the YouTube video IDs. Considering the increasing demand for instant recommendations in today's service providers, the scale of embedding tables becomes the efficiency bottleneck of deep learning recommendation models. On the other hand, features with uniform embedding size may hard to handle the heterogeneity among different features. For example, some features are more sparse, and assigning too large embedding sizes is likely to result in over-fitting issues. Consequently, recommendation models tend to be sub-optimal when embedding sizes are uniform for all features. The existing works towards this problem can be divided into two categories. Some works (Zhang et al., 2020; Shi et al., 2020; Kang et al., 2020) proposed that some closely-related features can share parts of embeddings, reducing the whole cost. Some other works (Joglekar et al., 2020; Zhao et al., 2020b; a; Cheng et al., 2020) proposed to assign embeddings with flexible sizes to different features relying on human-designed rules (Ginart et al., 2019) or neural architecture search (Joglekar et al., 2020; Zhao et al., 2020b; a; Cheng et al., 2020) . Despite a reduced embedding size table, these methods still cannot perform well on the two most concerned aspects, recommendation performance and computation cost. Specifically, these methods either obtain poor recommendation performance or spend a lot of time and efforts in getting proper embedding sizes. In this paper, to address the limitations of existing works, we proposed a simple yet effective pruning-based framework, named Plug-in Embedding Pruning (PEP), which can plug in various embedding-based recommendation models. Our method adopts a direct manner-pruning those unnecessary embedding parameters in one shot-to reduce parameter number. Specifically, we introduce the learnable threshold(s) that can be jointly trained with embedding parameters via gradient descent. Note that the threshold is utilized to determine the importance of each parameter automatically. Then the elements in the embedding vector that are smaller than the threshold will be pruned. Then the whole embedding table is pruned to make sure each feature has a suitable embedding size. That is, the embedding sizes are flexible. After getting the pruned embedding table, we retrain the recommendation model with the inspiration of the Lottery Ticket Hypothesis (LTH) (Frankle & Carbin, 2018) , which demonstrates that a subnetwork can reach higher accuracy compared with the original network. Based on flexible embedding sizes and the LTH, our PEP can cuts down embedding parameters while maintaining and even boosting the model's recommendation performance. Finally, while there is always a trade-off between recommendation performance and parameter number, our PEP can obtain multiple pruned embedding tables by running only once. In other words, our PEP can generate several memory-efficient embedding matrices once-for-all, which can well handle the various demands for performance or memory-efficiency in real-world applications. We conduct extensive experiments on three public benchmark datasets: Criteo, Avazu, and MovieLens-1M. The results demonstrate that our PEP can not only achieve the best performance compared with state-of-the-art baselines but also reduces 97% to 99% parameter usage. Further studies show that our PEP is quite computationally-efficient, requiring a few additional time for embedding-size learning. Furthermore, visualization and interpretability analysis on learned embedding confirm that our PEP can capture features' intrinsic properties, which provides insights for future researches.

2. RELATED WORK

Existing works try to reduce the embedding table size of recommendation models from two perspectives, embedding parameter sharing and embedding size selection.

2.1. EMBEDDING PARAMETER SHARING

The core idea of these methods is to make different features re-use embeddings via parameter sharing. Kang et al. (2020) proposed MGQE that retrieves embedding fragments from a small size of shared centroid embeddings and then generates final embedding by concatenating those fragments. Zhang et al. (2020) used the double-hash trick to make low-frequency features share a small embedding-table while reducing the likelihood of a hash collision. Shi et al. (2020) tried to yield a unique embedding vector for each feature category from a small embedding table by combining multiple smaller embedding (called embedding fragments). The combination is usually through concatenation, add, or element-wise multiplication among embedding fragments. However, those methods suffer from two limitations. First, engineers are required to carefully design the parameter-sharing ratio to balance accuracy and memory costs. Second, these rough embeddingsharing strategies cannot find the redundant parts in the embedding tables, and thus it always causes a drop in recommendation performance. In this work, our method automatically chooses suitable embedding usages by learning from data. Therefore, engineers can be free from massive efforts for designing sharing strategy, and the model performance can be boosted via removing redundant parameters and alleviating the over-fitting issue.

2.2. EMEBDDING SIZE SELECTION

The embedding-sharing methods assign uniform embedding sizes to every feature, which may still fail to deal with the heterogeneity among different features. Recently, several methods proposed a new paradigm of mixed-dimension embedding table. Specifically, different from assigning all features with uniformed embedding size, different features can have different embedding sizes. MDE (Ginart et al., 2019) proposed a human-defined rule that the embedding size of a feature is proportional to its popularity. However, this rule-based method is too rough and cannot handle those important features with low frequency. Additionally, there are plenty of hyper-parameters in MDE requiring a lot of truning efforts. Some other works (Joglekar et al., 2020; Zhao et al., 2020b; a; Cheng et al., 2020) assigned adaptive embedding sizes to different features, relying on the advances in Neural Architecture Search (NAS) (Elsken et al., 2019) , a significant research direction of Automated Machine Learning (AutoML) (Hutter et al., 2019) . NIS (Joglekar et al., 2020 ) used a reinforcement learning-based algorithm to search embedding size from a candidate set predefined by human experts. A controller is adopted to generate the probability distribution of size for specific feature embeddings. This was further extended by DartsEmb (Zhao et al., 2020b) by replacing the reinforcement learning searching algorithm with differentiable search (Liu et al., 2018) . Au-toDim (Zhao et al., 2020a) allocated different embedding sizes for different feature fields, rather than individual features, in a same way as DartsEmb. DNIS (Cheng et al., 2020) made the candidate embedding size to be continuous without predefined candidate dimensions. However, all these NAS-based methods require extremely high computation costs in the searching procedure. Even for methods that adopt differential architecture search algorithms, the searching cost is still not affordable. Moreover, these methods also require a great effort in designing proper search spaces. Different from these works, our pruning-based method can be trained quite efficiently and does not require any human efforts in determining the embedding-size candidates.

3. PROBLEM FORMULATION

Feature-based recommender systemfoot_0 is commonly used in today's information services. In general, deep learning recommendation models take various raw features, including users' profiles and items' attributes, as input and predict the probability that a user like an item. Specifically, models take the combination of user's profiles and item's attributes, denoted by x, as its' input vector, where x is the concatenation of all fields that could defined as follows: x = [x 1 ; x 2 ; . . . ; x M ] , where M denotes the number of total feature fields, and x i is the feature representation (one-hot vector in usual) of the i-th field. Then for x i , the embedding-based recommendation models generate corresponding embedding vector v i via following formulation: v i = V i x i , 0.3 -0.1 -0.4 -0.1 0.2 0.1 0.3 -0.2 -0.4 0.4 -0.1 0.2 0.5 -0.3 0.1 … 𝐯 1 𝐯 2 𝐯 M Pruning 𝑔 𝑠 = 0.15 0.3 0 -0.4 0 0.2 0 0.3 -0.2 -0.4 0.4 0 0.2 0.5 -0.3 0 … 𝐯 1 𝐯 2 𝐯 M 𝐕 𝒮(𝐕, 𝒟) 0.1 -0.1 0 0.05 0.12 𝐯 3 0 0 0 0 0 𝐯 3 Interaction Function 𝜙 ො 𝑦 𝑑 1 ሚ 𝑑 1 Figure 1: The basic idea of PEP. where V i ∈ R ni×d is an embedding matrix of i-th field, n i denotes the number of features in the i-th field, and d denotes the size of embedding vectors. The model's embedding matrices V for all fields of features can be formulated as follows, V = {V 1 , V 2 , . . . , V M }, The prediction score could be calculated with V and model's other parameters (mainly refer to the parameters in prediction model) Θ as follows, ŷ = φ(x|V, Θ), ( ) where ŷ is the predicted probability and φ represent the prediction model, such as FM (Rendle, 2010) or AutoInt (Song et al., 2019) . As for model training, to learn the models parameters, the optimizer minimizes the training loss as follows, min L(V, Θ, D), where D = {x, y} represents the data fed into the model, x denotes the input feature, y denotes the ground truth label, and L is the loss function. The Logloss is the most widely-used loss function in recommendation tasks (Rendle, 2010; Guo et al., 2017; Song et al., 2019) and calculated as follows, L = - 1 |D| |D| j=1 yj log (ŷj) + (1 -yj) log (1 -ŷj) , ( ) where |D| is the total number of training samples and regularization terms are omitted for simplification.

4.1. LEARNABLE EMBEDDING SIZES THROUGH PRUNING

As mentioned above, a feasible solution for memory-efficient embedding learning is to automatically assign different embedding sizes di for different features embeddings v i , which is our goal. However, to learn di directly is infeasible due to its discreteness and extremely-large optimization space. To address it, we propose a novel idea that enforce column-wise sparsity on V, which equivalently shrinks the embedding size. For example, as it shown in Figure 1 , the first value in embedding v 1 is pruned and set to zero, leading to a d1 = d 1 -1 embedding size in effect. Furthermore, some unimportant feature embeddings, like v 3 , are dropped by set all values to zerofoot_1 . Thus our method can significantly cut down embedding parameters. Note that the technique of sparse matrix storage help us to significantly save memory usage (Virtanen et al., 2020) . In such a way, we recast the problem of embedding-size selection into learning column-wise sparsity for the embedding matrix V. To achieve that, we design a sparsity constraint on V as follows, min L, s.t. ||V|| 0 ≤ k, where || • || 0 denotes the L 0 -norm, i.e. the number of non-zeros and k is the parameter budget, which is, the constraint on the total number of embedding parameters. However, direct optimization of Equation ( 7) is NP-hard due to the non-convexity of the L 0 -norm constraint. To solve this problem, the convex relaxation of L 0 -norm, called L 1 -norm, has been studied for a long time (Taheri & Vorobyov, 2011; Beck & Teboulle, 2009; Jain et al., 2014) . For example, the Projected Gradient Descent (PGD) (Jain et al., 2014) in particular has been proposed to project parameters to L 1 ball to make the gradient computable in almost closed form. Note that the L 1 ball projection is also known as Soft Thresholding (Kusupati et al., 2020) . Nevertheless, such methods are still faced with two major issues. First, the process of projecting the optimization values onto L 1 ball requires too much computation cost, especially when the recommendation model has millions of parameters. Second, the parameter budget k requires human experts to manually set at a global level. Considering that features have various importance for recommendation, such operation is obviously sub-optimal. To tackle those two challenges, inspired by Soft Threshold Reparameterization (Kusupati et al., 2020) , we directly optimize the projection of V and adaptively pruning the V via learnable threshold(s) which can be updated by gradient descent. The re-parameterization of V can be formulated as follows, V = S(V, s) = sign(V)ReLU(|V| -g(s)), where V ∈ R N ×d denotes the re-parameterized embedding matrix, and g(s) serves as a pruning threshold value, of which sigmoid function is a simple yet effective solution. 4 We set the initial value of trainable parameter s ∈ R (called s init ) to make sure that the threshold(s) g start close to zero. The sign(•) function converts positive input value to 1 and negative input value to -1, and zero input will keep unchanged. As S(V, s) is applied to each element of V, and thus the optimization problem in Equation ( 5) could be redefined as follows, min L(S(V, s), Θ, D). Then the trainable pruning parameter s could be jointly optimized with parameters of the recommendation models φ, through the standard back-propagation. Specifically, the gradient descent update equation for V at t-th step is formulated as follows, V (t+1) ← V (t) -η t ∇ S(V,s) L S(V (t) , s), D ∇ V S(V, s), where η t is t-th step learning rate and denotes the Hadamard product. To solve the nondifferentiablilty of S(•), we use sub-gradient to reformat the update equation as follows, V (t+1) ← V (t) -η t ∇ S(V,s) L S(V (t) , s), D 1 S(V (t) , s) = 0 , where 1{•} denotes the indicator function. Then, as long as we choose a continuous function g in S(•), then the loss function L S(V (t) , s), D would be continuous for s. Moreover, the sub-gradient of L with respect to s can be used of gradient descent on s as well. Thanks to the automatic differentiation framework like TensorFlow (Abadi et al., 2016) and PyTorch (Paszke et al., 2019) , we are free from above complex gradient computation process. Our PEP code can be found in Figure 7 of Appendix A.2. As we can see, it is quite simple to incorporate with existing recommendation models, and there is no need for us to manually design the backpropagation process.

4.2. RETRAIN WITH LOTTERY TICKET HYPOTHESIS

After pruning the embedding matrix V to the target parameter budget P, we could create a binary pruning mask m ∈ {0, 1} V that determines which parameter should remain or drop. Then we retrain the base model with a pruned embedding table. The Lottery Ticket Hypothesis (Frankle & Carbin, 2018) illustrates that a sub-network in a randomly-initialized dense network can match the original network, when trained in isolation in the same number of iterations. This sub-network is called the winning ticket. Hence, instead of randomly re-initializing the weight, we retrain the base model while re-initializing the weights back to their original (but masked now) weights m V 0 . This initiation strategy can make the training process faster and stable, keeping the performance consistent, which is shown in Appendix A.6. 

4.3. PRUNING WITH FINER GRANULARITY

Threshold parameter s in Equation ( 8) is set to a scalar that values of every dimension will have the same threshold value. We name this version as global wise pruning. However, different dimensions in the embedding vector v i may have various importance, and different fields of features may also have highly various importance. Thus, values in the embedding matrix require different sparsity budgets, and pruning with a global threshold may not be optimal. To better handle the heterogeneity among different features/dimensions in V, we design following different threshold tactic with different granularities. (1) Dimension Wise: The threshold parameter s is set as a vector s ∈ R d . Each value in an embedding will be pruned individually. (2) Feature Wise: The threshold parameter s is defined as a vector s ∈ R N . Pruning on each features' embedding could be done in separate ways. (3) Feature-Dimension Wise: this variant combines the above genre of threshold to obtain the finest granularity pruning. Specifically, thresholds are set as a matrix s ∈ R N ×d .

5. EXPERIMENTS

Dataset. We use three benchmark datasets: MovieLens-1M, Criteo, and Avazu, in our experiments. Metric. We adopt AUC (Area Under the ROC Curve) and Logloss to measure the performance of models. Baselines and Base Recommendation Models. We compared our PEP with traditional UE (short for Uniform Embedding). We also compare with the recent advances in flexible embedding sizes: MGQE (Kang et al., 2020) , MDE (Ginart et al., 2019), and DartsEmb (Zhao et al., 2020b) 5 . We deploy PEP and all baseline methods to three representative feature-based recommendation models: FM (Rendle, 2010), DeepFM (Guo et al., 2017) , and AutoInt (Song et al., 2019) , to compare their performancefoot_4 .

5.1. RECOMMENDATION ACCURACY AND PARAMETER NUMBER

We present the curve of recommendation performance and parameter number in Figure 2 , 3 and 4, including our method and state-of-the-art baseline methods. Since there is a trade-off between recommendation performance and parameter number, the curves are made of points that have different sparsity demandsfoot_5 . • Our method reduces the number of parameters significantly. Our PEP achieves the highest reduce-ratio of parameter number in all experiments, especially in relatively large datasets (Criteo and Avazu). Specifically, in Criteo and Avazu datasets, our PEP-0 can reduce 99.90% parameter usage compared with the best baseline (from the 10 6 level to the 10 3 level, which is very significant.). Embedding matrix with such low parameter usage means that only hundreds of embeddings are non-zero. By setting less-important features' embedding to zero, our PEP can break the limitation in existing methods that minimum embedding size is one rather than zero. We conduct more analysis on the MovieLens dataset in Section 5.3 and 5.4 to help us understand why our method can achieve such an effective parameter decreasing. • Our method achieves strong recommendation performance. Our method consistently outperforms the uniform embedding based model and achieves better accuracy than other methods in most cases. Specifically, for the FM model on the Criteo dataset, the relative performance improvement of PEP over UE is 0.59% and over DartsEmb is 0.24% in terms of AUC. Please note that the improvement of AUC or Logloss at such level is still considerable for feature-based recommendation tasks (Cheng et al., 2016; Guo et al., 2017) , especially considering that we have reduced a lot of parameters. A similar improvement can also be observed from the experiments on other datasets and other recommendation models. It is worth noting that our method could keep a strong AUC performance under extreme sparsity-regime. For example, when the number of parameters is only in the 10 3 level (a really small one), the recommendation performance still remarkably outperforms the Linear Regression model (more details can be found in Appendix A.5). To summarize it, with the effectiveness of recommendation accuracy and parameter-size reduction, the PEP forms a frontier curve encompassing all the baselines at all the levels of parameters. This verifies the superiority that our method can handle different parameter-size budgets well.

5.2. EFFICIENCY ANALYSIS OF OUR METHOD

As is shown in Section 5.1, learning a suitable parameter budget can yield a higher-accuracy model while reducing the model's parameter number. Nevertheless, it will induce additional time to find apposite sizes for different features. In this section, we study the computational cost and compare the runtime of each training epoch between PEP and DartsEmb on the Criteo dataset. We implement both models with the same batch size and test them on the same platform. The training time of each epoch on three different models is given in Table 2 . We can observe that our PEP's additional computation-cost is only 20% to 30%, which is acceptable compared with the base model. DartsEmb, however, requires nearly double computation time to search a good embedding size in its bi-level optimization process. Furthermore, DartsEmb needs to search multiple times to fit different memory budgets, since each one requires a complete re-running. Different from DartsEmb, our PEP can obtain several embedding schemes, which can be applied in different application scenarios, in only a single running. As a result, our PEP's time cost on embedding size search can be further reduced in real-world systems. 

5.3. INTERPRETABLE ANALYSIS ON PRUNED EMBEDDINGS

The feature-based recommendation models usually apply the embedding technique to capture two or high order feature interactions. But how does our method work on features interactions? Does our method improve model performance by reducing noisy feature interactions? In this section, we conduct an interpretable analysis by visualizing the feature interaction matrix, calculated by VV . Each value in the matrix is the normalized average of the absolute value of those two field features' dot product result, of which the higher indicates those two fields have a stronger correlation. Figure 5 (a) and 5 (b) illustrate the interaction matrix without and with pruning respectively, and 5 (c) shows the variation of matrix values. We can see that our PEP can reduce the parameter number between unimportant field interaction while keeping the significance of those meaningful field features' interactions. By denoising those less important feature interactions, the PEP can reduce embedding parameters while maintaining or improving accuracy.

5.4. CORRELATION BETWEEN SPARSITY AND FREQUENCY

As is shown in Figure 6 (a), feature frequencies among different features are highly diversified. Thus, using embeddings with uniform size may not handle their heterogeneity, and this property play an important role in embedding size selection. Hence, some recent works (Zhao et al., 2020b; Ginart et al., 2019; Cheng et al., 2020; Kang et al., 2020; Zhang et al., 2020; Joglekar et al., 2020) explicitly utilize the feature frequencies. Different from them, our PEP shrinks the parameter in an end-to-end automatic way, thus circumvents the complex human manipulation. Nevertheless, the frequency of features is one of the factors that determines whether one feature is important or not. Thus, we study whether our method can detect the influence of frequencies and whether the learned embedding sizes are relevant to the frequency. We first analyze the sparsityfoot_6 trajectory during training, which is shown in Figure 6 (b), where different colors indicate different groups of features divided according to their popularity. For each group, we first calculate each feature's sparsity, then compute the average on all features. Shades in pictures represent the variance within a group. We can observe that PEP tends to assign high-frequency features larger sizes to make sure there is enough representation capacity. For low-frequency features, the trends are on the contrary. These results are accord to the postulation that high-frequency features deserve more embedding parameters while a few parameters are enough for low-frequency feature embeddings. Then we probe the relationship between the sparsity of pruned embedding and frequencies of each feature. From Figure 6 (c), we can observe that the general relationship is concord with the above analysis. However, as we can see, some low-frequency features are assigned rich parameters, and some features with larger popularity are assigned small embedding size. This illustrates that simply allocating more parameters to high-frequency features, as most previous works do, can not handle the complex connection between features and their popularities. Our method performs pruning based on data, which can reflect the feature intrinsic proprieties, and thus can cut down parameters in a more elegant and efficient way.

6. CONCLUSION

In this paper, we approach the common problem of fixed-size embedding table in today's featurebased recommender systems. We propose a general plug-in framework to learn the suitable embedding sizes for different features adaptively. The proposed PEP method is efficient can be easily applied to various recommendation models. Experiments on three state-of-the-art recommendation models and three benchmark datasets verify that PEP can achieve strong recommendation performance while significantly reducing the parameter number and can be trained efficiently.

A APPENDIX

A.1 DESCRIPTION OF g(s) Following Kusupati et al. (2020) , a proper threshold function g(s) should have following three properties: 1. g(s) > 0, lim s→-∞ g(s) = 0, and lim s→∞ g(s) = ∞.

2.

∃G ∈ R ++ 0 < g (s) ≤ G ∀s ∈ R.

3.

g (s init ) < 1 which reduce the updating speed of s at the initial pruning.

A.2 PYTORCH CODE OF PEP

We present the main codes of PEP here since it is really easy-to-use and can plug in various embedding-based recommendation models. We experiment with three public benchmark datasets: MovieLens-1M, Criteo, and Avazu. Table 3 summarizes the statistics of datasets. • MovieLens-1Mfoot_7 . It is a widely-used benchmark dataset and contains timestamped user-movie ratings ranging from 1 to 5. Following AutoInt (Song et al., 2019) , we treat samples with a rating 1, 2 as negative samples and samples with a rating 4, 5 as positive samples. Other samples will be treat as neutral samples and removed. • Criteofoot_8 . This is a benchmark dataset for feature-based recommendation task, which contains 26 categorical feature fields and 13 numerical feature fields. It has about 45 million users' clicking records on displayed ads. • Avazufoot_9 . Avazu dataset contains 11 days' user clicking behaviors which are released for the Kaggle challenge, There are 22 categorical feature fields in the dataset, and parts of the fields are anonymous. Preprocessing Following the general preprocessing steps (Guo et al., 2017; Song et al., 2019) , for numerical feature fields in Criteo, we employ the log transformation of log 2 (x) if x > 2 proposed by the winner of Criteo Competitionfoot_10 to normalize the numerical features. Besides, we consider features of which the frequency is less than ten as unknown and treat them as a single feature "unknown" for Criteo and Avazu datasets. For each dataset, all the samples are randomly divided into training, validation, and testing set based on the proportion of 80%, 10%, and 10%.

A.4.2 PERFORMANCE MEASURES

We evaluate the performance of PEP with the following two metrics: • AUC. The area under the Receiver Operating Characteristic or ROC curve (AUC) means the probability to rank a randomly chosen positive sample higher than a randomly chosen negative sample. A model with higher AUC indicates the better performance of the model. Here we conduct experiments to verify the effectiveness of this operation in our PEP. We compare our method with its variation that uses random re-initialization for retraining to examine the influence of initialization. We also compare the standard PEP with the original base recommendation model to verify the influence of embedding pruning. To evaluate the importance of retraining, we further test the performance of PEP with the pruning stage only. We choose FM as the base recommendation model and use the same settings as the above experiments. We present the results in Figure 8 and 9. We can observe that the winning ticket with original initialization parameters can make the training procedure faster and obtain higher recommendation accuracy compared with random re-initialization. This demonstrates the effectiveness of our design of retraining. Moreover, the randomly reinitialize winning ticket still outperforms the unpruned model. By reducing the less-important features' embedding parameters, model performance could benefit from denoising those over-parametered embeddings. This can be explained that it is likely to get over-fitted for those over-parameterized embeddings when embedding sizes are uniform. Moreover, it is clear that the performance of PEP without retraining gets a little bit downgrade, but it still outperforms the original models. And the margin between without retrain and the original model is larger than the margin between with and without retraining. These results demonstrate that the PEP chiefly benefits from the suitable embedding size selection. We conjecture the benefit of retraining: during the search stage, less-important elements in embedding matrices are pruned gradually until the training procedure reaches a convergence. However, in earlier training epochs when these elements have not been pruned, they may have negative effects on the gradient updates for those important elements. This may make the learning of those important elements suboptimal. Thus, a retraining step can eliminate such effects and improve performance. 



It is also known as click-through rate prediction. Our PEP benefit from such kind of reduction, as demonstrated in Section 5.1, 5.3 and 5.4. More details about how to choose a suitable g(s) are provided in Appendix A.1. We do not compare with NIS(Joglekar et al., 2020) since it has not released codes and its reinforcementlearning based search is really slow. More details of implementation and above information could be found in Appendix A.4. We report five points of our method, marked from 0 to 4. We define the sparsity of an embedding by the ratio of the number of non-zero values to its original embedding size. https://grouplens.org/datasets/movielens https://www.kaggle.com/c/criteo-display-ad-challenge https://www.kaggle.com/c/avazu-ctr-prediction https://www.csie.ntu.edu.tw/r01922136/kaggle-2014-criteo.pdf We omit the results of AutoInt with LR on the MovieLens-1M dataset because there is no performance drop for the AutoInt model compared with other models.



Figure 2: AUC-# Parameter curve on MovieLens-1M with three base models.

Figure 3: AUC-# Parameter curve on Criteo with three base models.

Figure 5: Interpretable analysis on MovieLens-1M dataset.

Figure 7: PyTorch code of PEP.

Figure 10: Influence of different granularity on MovieLens-1M dataset (Choose FM as base model)

Comparison of our PEP and existing works (AutoInt is a base recommendation model and others are embedding-parameter-reduction methods.)

Runtime of each training epoch on Criteo between base model, DartsEmb, and our PEP.

Statistics of three utilized benchmark datasets.

Performance comparison between PEP-0 and Linear Regression.It is worth noting that the AutoInt model does not contain the LR component, so the PEP-0 in AutoInt on Criteo and Avazu dataset lead to a large performance drop. We try to include LR in PEP-0 in AutoInt and test the performance13 . As we can see, the accuracy on Criteo and Avazu outperforms the AutoInt without LR; It can be explained that LR helps our PEP-0 acquire a more stable performance.A.6 THE LOTTERY TICKET HYPOTHESISIn the retraining stage in Section 4.2, we rely on the Lottery Ticket Hypothesis to reinitialize the pruned embeddings table (called winning ticket) into their original initial values.

7. ACKNOWLEDGEMENTS

This work was supported in part by The National Key Research and Development Program of China under grant 2020AAA0106000, the National Natural Science Foundation of China under U1936217, 61971267, 61972223, 61941117, 61861136003. 

annex

Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1-38, 2019.Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. Memory-efficient embedding for recommendations. arXiv preprint arXiv:2006.14827, 2020a.Xiangyu Zhao, Chong Wang, Ming Chen, Xudong Zheng, Xiaobing Liu, and Jiliang Tang. Autoemb: Automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252, 2020b.• Logloss. As a loss function widely used in the feature-based recommendation, Logloss on test data can straight way evaluate the model's performance. The lower the model's Logloss, the better the model's performance.

A.4.3 BASELINES

We compared our proposed method with the following state-of-the-art methods:• UE (short for Uniform Embedding). The uniform-embedding manner is commonly accepted in existing recommender systems, of which all features have uniform embedding sizes.• MGQE (Kang et al., 2020) . This method retrieves embedding fragments from a small size of shared centroid embeddings, and then generates final embedding by concatenating those fragments. MGQE learns embeddings with different capacities for different items. This method is the most strongest baseline among embedding-parameter-sharing methods.• MDE (short for Mixed Dimension Embedding (Ginart et al., 2019) ). This method is based on human-crafted rule, and the embedding size of a specific feature is proportional to its popularity. Higher-frequency features will be assigned larger embedding sizes. This is the state-of-the-art human-rule-based method.• DartsEmb (Zhao et al., 2020b) . This is the state-of-the-art neural architecture search-based based method which allows features to automatically search for the embedding sizes in a given space.A.4.4 IMPLEMENTATION DETAILS Following AutoInt (Song et al., 2019) and DeepFM (Guo et al., 2017) , we employ Adam optimizer with the learning rate of 0.001 to optimize model parameters in both the pruning and re-training stage. For g(s), we apply g(s) = 1 1+e -s in all experiments and initialize the s to -15, -150 and -150 in MovieLens-1M, Criteo and Avazu datasets respectively. Moreover, the granularity of PEP is set as Dimension-wise for PEP-2, PEP-3, and PEP-4 on Criteo and Avazu datasets. And others are set as Feature Dimension-wise. The base embedding dimension d is set to 64 for all the models before pruning. We deploy our method and other baseline methods to three state-of-the-art models: FM (Rendle, 2010), DeepFM (Guo et al., 2017) , and AutoInt (Song et al., 2019) , to compare their performance. Besides, in the retrain stage, we exploit the early-stopping technique according to the loss of validation dataset during training. We use PyTorch (Paszke et al., 2019) to implement our method and train it with mini-batch size 1024 on a single 12G-Memory NVIDIA TITAN V GPU.Implementation of Baseline For Uniform Embedding, we test the embedding size varying from [8, 16, 32, 64] , for the MovieLens-1M dataset. For Criteo and Avazu dataset, we vary the embedding size from [4, 8, 16] because performance starts to drop when d > 16.For other baseline methods, we first turn the hyper-parameters to make models have the highest recommendation performance or highest parameter reduction rate. Then we tune those methods that can balance those two aspects. We provide the experimental details of our implementation for these baseline methods as below, following the settings of the original papers. For the grid search space of MDE, we search the baseline dimension d from [4, 8, 16, 32] , the number of blocks K from [8, 16], and α from [0.1, 0.2, 0.3]. For MGQE, we search the baseline dimension d from [8, 16, 32] , the number of subspace D from [4, 8, 16] , and the number of centroids K from [64, 128, 256, 512] . For DartsEmb, we choose three different candidate embedding spaces to meet the different memory budgets: {1, 2, 8}, {2, 4, 16} and {4, 8, 32}.

A.5 COMPARISON BETWEEN PEP-0 AND LINEAR REGRESSION

The Linear Regression (LR) model is an embedding-free model that only makes predictions based on the linear combination of raw features. Thence, it is worth comparing our method on the extremelysparse level (PEP-0) with LR.Table 4 shows that our PEP-0 significantly outperforms the LR in all cases. This result verity that our PEP-0 does not depend on the LR part in FM and DeepFM to remain a strong recommendation performance. Therefore, even at an extremely-sparse level, our PEP still has high application value in the real-world scenarios. ing can achieve comparable AUC with fewer training epochs. Hence we adopt this granularity on PEP-2, PEP-3, and PEP-4 in large datasets to save time spent on training.A.8 ABOUT LEARNABLE g(s)Pruning threshold(s) g(s) can be learned from training data to reduce parameter usage in the embedding matrix. However, why can our PEP learn suitable g(s) with training data? We deduce that the increase of s in g(s) can decrease the training loss. In other words, our PEP tries to update s in the optimization process to achieve lower training loss.In Figure 11 , we plot the FM's training curves with/without PEP on MovieLens-1M and Criteo datasets to confirm our assumption. Our PEP can achieve much lower training loss when pruning.Besides, it verifies that our PEP could learn embedding sizes in a stable form.The stability shown in Figure 11 can be explained that our PEP obtains a relatively stable embedding parameter number at later stage of pruning (e.g., when epoch is larger than 30 in MovieLens dataset) as shown in Figure 11 . And embedding parameters are well-trained. Thus, the training loss curve looks relatively stable. Note that the figure shows a sequence of changing thresholds. The point when we get the embedding table for some sparsity level is not a converged point for this exact level, which instead requires retraining with a fixed threshold.

