ONLINE LOW RANK MATRIX COMPLETION

Abstract

We study the problem of online low-rank matrix completion with M users, N items and T rounds. In each round, the algorithm recommends one item per user, for which it gets a (noisy) reward sampled from a low-rank user-item preference matrix. The goal is to design a method with sub-linear regret (in T) and nearly optimal dependence on M and N. The problem can be easily mapped to the standard multi-armed bandit problem where each item is an independent arm, but that leads to poor regret as the correlation between arms and users is not exploited. On the other hand, exploiting the low-rank structure of reward matrix is challenging due to non-convexity of the low-rank manifold. We first demonstrate that the low-rank structure can be exploited using a simple explore-then-commit (ETC) approach that ensures a regret of O(polylog(M + N)T 2/3 ). That is, roughly only polylog(M + N) item recommendations are required per user to get a non-trivial solution. We then improve our result for the rank-1 setting which in itself is quite challenging and encapsulates some of the key issues. Here, we propose OCTAL (Online Collaborative filTering using iterAtive user cLustering) that guarantees nearly optimal regret of O(polylog(M + N)T 1/2 ). OCTAL is based on a novel technique of clustering users that allows iterative elimination of items and leads to a nearly optimal minimax rate.

1. INTRODUCTION

Collaborative filtering based on low-rank matrix completion/factorization techniques are the cornerstone of most modern recommendation systems (Koren, 2008) . Such systems model the underlying user-item affinity matrix as a low-rank matrix, use the acquired user-item recommendation data to estimate the low-rank matrix and subsequently, use the matrix estimate to recommend items for each user. Several existing works study this offline setting (Candès & Recht, 2009; Deshpande & Montanari, 2012; Jain et al., 2013; Chen et al., 2019; Abbe et al., 2020) . However, typical recommendation systems are naturally online and interactive -they recommend items to users and need to adapt quickly based on users' feedback. The goal of such systems is to quickly identify each user's preferred set of items, so it is necessary to identify the best items for each user instead of estimating the entire affinity matrix. Moreover, items/users are routinely added to the system, so it should be able to quickly adapt to new items/users by using only a small amount of recommendation feedback. In this work, we study this problem of the online recommendation system. In particular, we study the online version of low-rank matrix completion with the goal of identifying top few items for each user using say only logarithmic many exploratory recommendation rounds for each user. In each round (out of T rounds) we predict one item (out of N items) for each user (out of M users) and obtain feedback/reward for each of the predictions -e.g. did the users view the recommended movie. The goal is to design a method that has asymptotically similar reward to a method that can pick best items for each user. As mentioned earlier, we are specifically interested in the setting where T ⌧ N i.e. the number of recommendation feedback rounds is much smaller than the total number of items. Moreover, we assume that the expected reward matrix is low-rank. That is, if R (t) ij is the reward obtained in the t th round for predicting item-j for user-i, then ER (t) ij = P ij , where P 2 R M⇥N is a low-rank matrix. A similar low rank reward matrix setting has been studied in online multidimensional learning problems (Katariya et al., 2017b; Kveton et al., 2017; Trinh et al., 2020) . But in these problems, the goal is to find the matrix entry with the highest reward. Instead, our goal is to recommend good items to all users which is a significantly harder challenge. A trivial approach is to ignore the underlying low-rank structure and solve the problem using standard multi-arm bandit methods. That is, model each (user, item) pair as an arm. Naturally, that would imply exploration of almost all the items for each user, which is also reflected in the regret bound (averaged over users) of O( p NT) (Remark 1). That is, as expected, the regret bound is vacuous when the number of recommendation rounds T ⌧ N. In contrast, most of the existing online learning techniques that leverage structure amongst arms assume a parametric form for the reward function and require the reward function to be convex in the parameters (Shalev-Shwartz et al., 2011; Bubeck, 2011) . Thus, due to the non-convexity of the manifold of low-rank matrices, such techniques do not apply in our case. While there are some exciting recent approaches for non-convex online learning (Agarwal et al., 2019; Suggala & Netrapalli, 2020) , they do not apply to the above mentioned problem. Our Techniques and Contributions: We first present a method based on the explore-then-commit (ETC) approach (Algorithm 2). For the first few rounds, the algorithm runs a pure exploration strategy by sampling random items for each user. We use the data obtained by pure exploration to learn a good estimate of the underlying reward matrix P; this result requires a slight modification of the standard matrix completion result in Chen et al. ( 2019). We then run the exploitation rounds based on the current estimate. In particular, for the remaining rounds, the algorithm commits to the arm with the highest estimated reward for each user. With the ETC algorithm, we achieve a regret bound of O(polylog(M + N)T 2/3 ) (Thm. 1). This bound is able to get rid of the dependence on N, implying non-trivial guarantees even when T ⌧ N. That is, we require only polylog(M + N) exploratory recommendations per user. However, the dependence of the algorithm on T is sub-optimal. To address this, we study the special but critical case of rank-one reward matrix. The rank-1 setting is itself technically challenging (Katariya et al., 2017b) and encapsulates many key issues. We provide a novel algorithm OCTAL in Algorithm 3 and a modified version in Algorithm 8 that achieves nearly optimal regret bound of O(polylog(M + N)T 1/2 ) (see Theorems 2 and E). The key insight is that in rank-one case, we need to cluster users based on their true latent representation to ensure low regret. Our method OCTAL consists of multiple phases of exponentially increasing number of rounds. Each phase refines the current estimate of relevant sub-matrices of the reward matrix using standard matrix completion techniques. Using the latest estimate, we jointly refine the cluster of users and the estimate of the best items for users in each cluster. We can show that the regret in each phase decreases very quickly and since the count of phases required to get the correct clustering is small, we obtain our desired regret bounds. Finally, we show that our method achieves a regret guarantee that scales as e O(T 1/2 ) with the number of rounds T. We also show that the dependence on T is optimal (see Theorem 3). Below we summarize our main contributions ( e O(•) hides logarithmic factors): • We formulate the online low rank matrix completion problem and define the appropriate notion of regret to study. We propose Algorithm 2 based on the explore-then-commit approach that suffers a regret of e O(T 2/3 ). (Thm. 1) • We propose a novel algorithm OCTAL (Online Collaborative filTering using iterAtive user cLustering) and a modified version (Alg. 3 and Alg. 8 respectively) for the special case of rank-1 reward matrices and guarantee a regret of e O(T 1/2 ). Importantly, our algorithms provide non-trivial regret guarantees in the practical regime of M T and N T i.e when the number of users and items is much larger than the number of recommendation rounds. Moreover, OCTAL does not suffer an issue of large cold-start time (possibly large exploration period) as in the ETC algorithm. • We conducted detailed empirical study of our proposed algorithms (see Appendix A) on synthetic and multiple real datasets, and demonstrate that our algorithms can achieve significantly lower regret than methods that do not use collaboration between users. Furthermore, we show that for rank-1 case, while it is critical to tune the exploration period in ETC (as a function of rounds and sub-optimality gaps) and is difficult in practice (Lattimore & Szepesvári, 2020)[Ch. 6], OCTAL still suffers lower regret without such side-information (see Figures 1a and 2b ). Technical Challenges: For rank-1 case, i.e., P = uv T , one can cluster users in two bins: ones with u i 0 and ones with u i < 0. Cluster-2 users dislike the best items for Cluster-1 users and vice-versa. Thus, we require algorithms that can learn this cluster structure and exploit the fact that within the same cluster, relative ranking of all items remain the same. This is a challenging information to

