LEARNING TWO-TIME-SCALE REPRESENTATIONS FOR LARGE SCALE RECOMMENDATIONS

Abstract

We propose a surprisingly simple but effective two-time-scale (2TS) model for learning user representations for recommendation. In our approach, we will partition users into two sets, active users with many observed interactions and inactive or new users with few observed interactions, and we will use two RNNs to model them separately. Furthermore, we design a two-stage training method for our model, where, in the first stage, we learn transductive embeddings for users and items, and then, in the second stage, we learn the two RNNs leveraging the transductive embeddings trained in the first stage. Through the lens of online learning and stochastic optimization, we provide theoretical analysis that motivates the design of our 2TS model. The 2TS model achieves a nice bias-variance trade-off while being computationally efficient. In large scale datasets, our 2TS model is able to achieve significantly better recommendations than previous state-of-the-art, yet being much more computationally efficient.

1. INTRODUCTION

A hypothetical user's interaction with recommendation systems gives us diminishing returns in terms of its information value in understanding the user. For an active user who has lots of historical interactions, she is typically well understood by the recommender, and each new interaction gives relatively little new information. In contrast, for an inactive or new user, every additional interaction will provide interesting information for understanding this user. Therefore, the representations for active and inactive users should be updated differently when a new interaction occurs. Figure 1 illustrates such information diminishing phenomenon, where the amount of change in user embedding from φ t to φ t+1 due to an additional interaction is decaying. One can select a particular threshold t * for the number of interactions, above which the users can be categorized to active users, and below which inactive users. Roughly active users' embeddings evolve slowly as a function of the number of interactions, while inactive users' embeddings evolve fast. Hence a two-time-scale embedding evolution. Apart from the time-scale difference in temporal dynamics, the simultaneous presence of active and inactive users also presents other modeling and computational challenges. On the one hand, active users lead to long sequences of interactions and high degree nodes in the user-item interaction graph. Existing sequence models, such as RNN models, have some limitations when dealing with long-range sequences, due to the difficulty in gradient propagation. Moreover, graph neural network-based models become computationally inefficient due to the intensive message passing operations through high-degree nodes introduced by active users. On the other hand, predicting preferences of inactive or new users (also known as the cold-start problem) is a challenging few-shot learning problem, where a decision needs to be made given only a few number of observations. To address various challenges imposed by the presence of two types of users, we leverage the different dynamics of these users and propose (i) a two-time-scale (2TS) model and (ii) a two-stage training algorithm. 2TS model. Based on the number of observed interactions, we partition the users into two sets: active and inactive users. Our 2TS model (Fig. 1 ) update the embeddings of active users and inactive users by two RNNs with independent parameters, in order to respect the two-time-scale nature. Moreover, the initial embeddings of inactive users are represented by a common embedding ψ, which is shared across all inactive users. Therefore, the overall model for inactive users is inductive, in the sense that the learned model can be applied to unseen users. In contrast, the initial embedding of each active user is a user-specific embedding φ u , which is also called transductive embedding. Such embeddings are very expressive, which can better express users with a long history.

Two-stage training.

In stage 1, we first learn transductive user embeddings φ u and transductive item embeddings x i using a classical collaborative filtering method. Then we fix these embeddings, and in stage 2, we will learn the parameters of the two RNNs and a common initialization ψ for inactive users. It is notable that the transductive embeddings for inactive users are abandoned in stage 2. Only those for active users are finally used in the 2TS model. Besides, for active users, we do not use all interaction data to learn the RNN since their transductive embeddings have already encoded the information of their history. We only use a small number of last clicked items to learn the adaptation for active users, which improves the efficiency of the training process. The proposed 2TS model and the two-stage training algorithm lead to a few advantages: • Bias-variance trade-off. The differential use of transductive and inductive embeddings for the two RNN models allows 2TS to achieve a good overall bias-variance trade-off. We theoretically analyze such trade-off in Section 2 through the lens of learning-to-learn paradigm for designing online learning (or adaptation) algorithms. Our theory shows that there exists an optimal threshold to split users to achieve the best overall excessive risk. • Encode long-range sequence. The transductive embeddings φ u for active users are user-specific vectors, so they can memorize the user's long-range history during the training, without suffering from the difficulty of gradient propagation. The RNN on top of these transductive embeddings is only used for adaptation to recently engaged new items. • Computational efficiency. The efficiency of our method on large-scale problems mainly comes from two designs in the algorithm. First, stage 1 learns the transductive embeddings of active users and items, which contain a large number of parameters. However, it is fast since it does not involve any deep neural components and the loss is simply a convex function. Second, stage 2 only learns the RNNs which contain a small number of parameters, and the RNN for active users is only trained on a few last engaged items, which cuts off the long sequences. Experimentally, our method reveals to be much more efficient than the baselines on large-scale datasets. We summarize the contributions of this paper as follows: • To explain the intuition and motivation of the 2TS model, we provide theoretical analysis on a simplified setting, which rigorously argues the need for differential use of transductive and inductive embeddings for active and inactive users (Section 2). • Motivated by the analysis, we design the 2TS model and a two-stage training method, for practical use (Section 3). The proposed method is applied to two large-scale benchmark datasets and compared comprehensively to various baseline models, spanning a diverse set of categories, which shows that our method is advantageous in terms of both accuracy and efficiency (Section 5).

2. THEORETICAL MOTIVATION: WHY TWO-TIME-SCALE MODELS?

We will first present the motivation for designing the 2TS model, through the lens of online learning and stochastic optimization. Our analysis quantitatively reveals that (i) the embeddings for active and inactive users evolve in different time scales, and (ii) two different online learning algorithms for active and inactive users respectively can lead to a better overall estimation of user embeddings. Our analysis will be carried out in a learning-to-learn setting, where online learning algorithms need to be designed to tackle a family of tasks for estimating the embedding vector of a user. Though this idealized setting can not cover all aspects of real-world recommendation problems, it leads to



Figure 1: Two-time-scale convergence of user embeddings motivates us to design a two-stage method, where the first stage estimates transductive embedding and the second stage learns two different RNNs for active and inactive users respectively.

