ONLINE LOW RANK MATRIX COMPLETION

Abstract

We study the problem of online low-rank matrix completion with M users, N items and T rounds. In each round, the algorithm recommends one item per user, for which it gets a (noisy) reward sampled from a low-rank user-item preference matrix. The goal is to design a method with sub-linear regret (in T) and nearly optimal dependence on M and N. The problem can be easily mapped to the standard multi-armed bandit problem where each item is an independent arm, but that leads to poor regret as the correlation between arms and users is not exploited. On the other hand, exploiting the low-rank structure of reward matrix is challenging due to non-convexity of the low-rank manifold. We first demonstrate that the low-rank structure can be exploited using a simple explore-then-commit (ETC) approach that ensures a regret of O(polylog(M + N)T 2/3 ). That is, roughly only polylog(M + N) item recommendations are required per user to get a non-trivial solution. We then improve our result for the rank-1 setting which in itself is quite challenging and encapsulates some of the key issues. Here, we propose OCTAL (Online Collaborative filTering using iterAtive user cLustering) that guarantees nearly optimal regret of O(polylog(M + N)T 1/2 ). OCTAL is based on a novel technique of clustering users that allows iterative elimination of items and leads to a nearly optimal minimax rate.

1. INTRODUCTION

Collaborative filtering based on low-rank matrix completion/factorization techniques are the cornerstone of most modern recommendation systems (Koren, 2008) . Such systems model the underlying user-item affinity matrix as a low-rank matrix, use the acquired user-item recommendation data to estimate the low-rank matrix and subsequently, use the matrix estimate to recommend items for each user. Several existing works study this offline setting (Candès & Recht, 2009; Deshpande & Montanari, 2012; Jain et al., 2013; Chen et al., 2019; Abbe et al., 2020) . However, typical recommendation systems are naturally online and interactive -they recommend items to users and need to adapt quickly based on users' feedback. The goal of such systems is to quickly identify each user's preferred set of items, so it is necessary to identify the best items for each user instead of estimating the entire affinity matrix. Moreover, items/users are routinely added to the system, so it should be able to quickly adapt to new items/users by using only a small amount of recommendation feedback. In this work, we study this problem of the online recommendation system. In particular, we study the online version of low-rank matrix completion with the goal of identifying top few items for each user using say only logarithmic many exploratory recommendation rounds for each user. In each round (out of T rounds) we predict one item (out of N items) for each user (out of M users) and obtain feedback/reward for each of the predictions -e.g. did the users view the recommended movie. The goal is to design a method that has asymptotically similar reward to a method that can pick best items for each user. As mentioned earlier, we are specifically interested in the setting where T ⌧ N i.e. the number of recommendation feedback rounds is much smaller than the total number of items. Moreover, we assume that the expected reward matrix is low-rank. That is, if R (t) ij is the reward obtained in the t th round for predicting item-j for user-i, then ER (t) ij = P ij , where P 2 R M⇥N is a low-rank matrix. A similar low rank reward matrix setting has been studied in online multidimensional learning problems (Katariya et al., 2017b; Kveton et al., 2017; Trinh et al., 2020) . But in

