UMEC: UNIFIED MODEL AND EMBEDDING COMPRES-SION FOR EFFICIENT RECOMMENDATION SYSTEMS

Abstract

The recommendation system (RS) plays an important role in the content recommendation and retrieval scenarios. The core part of the system is the ranking neural network, which is usually a bottleneck of whole system performance during online inference. Hammering an efficient neural network-based recommendation system involves entangled challenges of compressing both the network parameters and the feature embedding inputs. We propose a unified model and embedding compression (UMEC) framework to jointly learn input feature selection and neural network compression together, which is formulated as a resource-constrained optimization problem and solved using the alternating direction method of multipliers (ADMM) algorithm. Experimental results on public benchmarks show that our UMEC framework notably outperforms other non-integrated baseline methods. The codes can be found at https://github.com

1. INTRODUCTION

As the core component of a recommendation system (RS), recommendation models (RM) based on ranking neural networks are widely adopted in general content recommendation and retrieval applications. In general, an effective recommendation model consists of two components: a feature embedding sub-model and a prediction sub-model, as illustrated in Figure 1 . Usually, an RM adopts neural networks to serve two sub-models. Formally, we denote an RM as f (•; W), where W is the learnable parameters of f . For the inference, the model f takes the input feature data x to predict the confidence of the content, serving the recommendation applications. Specifically, we further define the embedding and prediction sub-models as f e (•; W e ) and f p (•; W p ) respectively, where W e and W p are their own learnable parameters and W = {W e , W p }. The embedding feature, v := f e (x; W e ), is the input of f p with the input data x. Hence, we can express the RM as f (•; W) f p (f e (•; W e ); W p ). Given a ranking training loss (•) (i.e., binary cross entropy (BCE) loss), the learning goal of the ranking model can be written as min W (x,y)∈D (f (x; W), y), where y is the ground-truth label, and D is the training dataset. Nowadays, extremely large-scale data have been poured into the recommendation system to predict user behavior in many applications. In the online inference procedure, the heaviest computation component is the layer-wise product between the hidden output vectors and the model parameters W, for a neural network-based RM. A slimmed neural network structure would save a great amount of power consumption during the inference. Hence, the main idea of an RM compression is to slim the structural complexity of f (W) and reduce the dimension of hidden output vectors. To obtain an efficient ranking model (for example, MLP based) for an RS, one may apply existing model compression methods to MLPs directly. For example, Li et al. ( 2016) removes the entire filters in the network together with their connecting feature maps in terms of magnitudes, which can also be applied to MLP structures to remove a specific neuron as well as its connections. Molchanov et al. ( 2019) approximates the importance of a neuron (filter) to the final loss by using the first and second-order Taylor expansions, which can also be applied to pruning the neurons without hassle. There are also some compression methods focusing on dimension reduction of embedding feature vectors. However, the performance of these compression methods highly depends on the hyper-parameter tuning and the background knowledge of the specific recommendation model. For example, an embedding dimension compression method may require the user to search for the best dimension with multiple training and search rounds. We would like to solve the RM compression problem with a unified resource-constrained optimization framework. It only relies on the resource consumption of the RM f (W) to compress both the MLP and the embedding dimensions, without any multistage heuristics nor expensive hyper-parameters tuning. Our novel unified model and embedding compression (UMEC) framework directly satisfies both the requirement of resource consumption of the ranking neural network model and the prediction accuracy goal of the ranking model, with end-to-end gradient-based training. We reformulate the optimization of training loss associated with hard constraints into a minimax optimization problem solved by the alternating direction method of multipliers (ADMM) method (Boyd et al., 2011) . To summarize our contributions, we present the merits of UMEC as the following: • To the best of our knowledge, UMEC is the first unified optimization framework for the recommendation system scenario. Unlike those existing works that treat the selection of input feature and compression of the model as two individual problems, UMEC jointly learns both together via unifying both original prediction learning goal and the model compression related hard constraints. • We reformulate the joint input feature selection and model compression task as a constrained optimization problem. We convert resource constraints and L 0 sparsity constraints into soft optimization energy terms and solve the whole optimization using ADMM methods. • Extensive experiments performed over large-scale public benchmarks show that our method largely outperforms previous state-of-the-art input feature selection methods and model compression methods, endorsing the benefits of the proposed end-to-end optimization. 



Figure 1: An of the neural networkbased recommendation system: the deep learning recommendation model (DLRM), proposed by Naumov et al. (2019).

Ginart et al. (2019)  proposes a mixed dimension embedding scheme by designing non-uniform embedding dimensions scaling with popularity of features. Joglekar et al. (2020) uses Neuron Input Search (NIS) to learn the embedding dimensions for the sparse categorical features.

Recommendation Models With the recent development of deep learning techniques, plenty of works have proposed learning-based recommendation systems to grapple with personalized recommendation tasks.Hidasi et al. (2015)  applies recurrent neural networks (RNN) on long session data to model the whole session and achieve better accuracy than the traditional matrix factorization approaches.Covington et al. (2016)  proposes a high-level recommendation algorithm for YouTube videos by specifying a candidate generation model followed by a separate ranking model, which demonstrates a significant performance boost thanks to neural networks.Wu et al. (2019)  proposes a recommendation algorithm based on the graph neural network to better capture the complex transitions among items, achieving a more accurate item embedding. Their method outperforms traditional sequential approaches.Naumov et al. (2019)  exploits categorical features and designs a parallelism scheme for the embedding tables to alleviate the limited memory problem.Huang et al.  (2020)  proposes a multi-attention based model to mitigate the low efficiency problem in the group recommendation scenarios.

