UMEC: UNIFIED MODEL AND EMBEDDING COMPRES-SION FOR EFFICIENT RECOMMENDATION SYSTEMS

Abstract

The recommendation system (RS) plays an important role in the content recommendation and retrieval scenarios. The core part of the system is the ranking neural network, which is usually a bottleneck of whole system performance during online inference. Hammering an efficient neural network-based recommendation system involves entangled challenges of compressing both the network parameters and the feature embedding inputs. We propose a unified model and embedding compression (UMEC) framework to jointly learn input feature selection and neural network compression together, which is formulated as a resource-constrained optimization problem and solved using the alternating direction method of multipliers (ADMM) algorithm. Experimental results on public benchmarks show that our UMEC framework notably outperforms other non-integrated baseline methods. The codes can be found at https://github.com

1. INTRODUCTION

As the core component of a recommendation system (RS), recommendation models (RM) based on ranking neural networks are widely adopted in general content recommendation and retrieval applications. In general, an effective recommendation model consists of two components: a feature embedding sub-model and a prediction sub-model, as illustrated in Figure 1 . Usually, an RM adopts neural networks to serve two sub-models. Formally, we denote an RM as f (•; W), where W is the learnable parameters of f . For the inference, the model f takes the input feature data x to predict the confidence of the content, serving the recommendation applications. Specifically, we further define the embedding and prediction sub-models as f e (•; W e ) and f p (•; W p ) respectively, where W e and W p are their own learnable parameters and W = {W e , W p }. The embedding feature, v := f e (x; W e ), is the input of f p with the input data x. Hence, we can express the RM as f (•; W) f p (f e (•; W e ); W p ). Given a ranking training loss (•) (i.e., binary cross entropy (BCE) loss), the learning goal of the ranking model can be written as min W (x,y)∈D (f (x; W), y), where y is the ground-truth label, and D is the training dataset. Nowadays, extremely large-scale data have been poured into the recommendation system to predict user behavior in many applications. In the online inference procedure, the heaviest computation component is the layer-wise product between the hidden output vectors and the model parameters W, for a neural network-based RM. A slimmed neural network structure would save a great amount of power consumption during the inference. Hence, the main idea of an RM compression is to slim the structural complexity of f (W) and reduce the dimension of hidden output vectors. To obtain an efficient ranking model (for example, MLP based) for an RS, one may apply existing model compression methods to MLPs directly. For example, Li et al. (2016) removes the entire filters in the network together with their connecting feature maps in terms of magnitudes, which can also

