ONLINE LOW RANK MATRIX COMPLETION

Abstract

We study the problem of online low-rank matrix completion with M users, N items and T rounds. In each round, the algorithm recommends one item per user, for which it gets a (noisy) reward sampled from a low-rank user-item preference matrix. The goal is to design a method with sub-linear regret (in T) and nearly optimal dependence on M and N. The problem can be easily mapped to the standard multi-armed bandit problem where each item is an independent arm, but that leads to poor regret as the correlation between arms and users is not exploited. On the other hand, exploiting the low-rank structure of reward matrix is challenging due to non-convexity of the low-rank manifold. We first demonstrate that the low-rank structure can be exploited using a simple explore-then-commit (ETC) approach that ensures a regret of O(polylog(M + N)T 2/3 ). That is, roughly only polylog(M + N) item recommendations are required per user to get a non-trivial solution. We then improve our result for the rank-1 setting which in itself is quite challenging and encapsulates some of the key issues. Here, we propose OCTAL (Online Collaborative filTering using iterAtive user cLustering) that guarantees nearly optimal regret of O(polylog(M + N)T 1/2 ). OCTAL is based on a novel technique of clustering users that allows iterative elimination of items and leads to a nearly optimal minimax rate.

1. INTRODUCTION

Collaborative filtering based on low-rank matrix completion/factorization techniques are the cornerstone of most modern recommendation systems (Koren, 2008) . Such systems model the underlying user-item affinity matrix as a low-rank matrix, use the acquired user-item recommendation data to estimate the low-rank matrix and subsequently, use the matrix estimate to recommend items for each user. Several existing works study this offline setting (Candès & Recht, 2009; Deshpande & Montanari, 2012; Jain et al., 2013; Chen et al., 2019; Abbe et al., 2020) . However, typical recommendation systems are naturally online and interactive -they recommend items to users and need to adapt quickly based on users' feedback. The goal of such systems is to quickly identify each user's preferred set of items, so it is necessary to identify the best items for each user instead of estimating the entire affinity matrix. Moreover, items/users are routinely added to the system, so it should be able to quickly adapt to new items/users by using only a small amount of recommendation feedback. In this work, we study this problem of the online recommendation system. In particular, we study the online version of low-rank matrix completion with the goal of identifying top few items for each user using say only logarithmic many exploratory recommendation rounds for each user. In each round (out of T rounds) we predict one item (out of N items) for each user (out of M users) and obtain feedback/reward for each of the predictions -e.g. did the users view the recommended movie. The goal is to design a method that has asymptotically similar reward to a method that can pick best items for each user. As mentioned earlier, we are specifically interested in the setting where T ⌧ N i.e. the number of recommendation feedback rounds is much smaller than the total number of items. Moreover, we assume that the expected reward matrix is low-rank. That is, if R (t) ij is the reward obtained in the t th round for predicting item-j for user-i, then ER (t) ij = P ij , where P 2 R M⇥N is a low-rank matrix. A similar low rank reward matrix setting has been studied in online multidimensional learning problems (Katariya et al., 2017b; Kveton et al., 2017; Trinh et al., 2020) . But in these problems, the goal is to find the matrix entry with the highest reward. Instead, our goal is to recommend good items to all users which is a significantly harder challenge. A trivial approach is to ignore the underlying low-rank structure and solve the problem using standard multi-arm bandit methods. That is, model each (user, item) pair as an arm. Naturally, that would imply exploration of almost all the items for each user, which is also reflected in the regret bound (averaged over users) of O( p NT) (Remark 1). That is, as expected, the regret bound is vacuous when the number of recommendation rounds T ⌧ N. In contrast, most of the existing online learning techniques that leverage structure amongst arms assume a parametric form for the reward function and require the reward function to be convex in the parameters (Shalev-Shwartz et al., 2011; Bubeck, 2011) . Thus, due to the non-convexity of the manifold of low-rank matrices, such techniques do not apply in our case. While there are some exciting recent approaches for non-convex online learning (Agarwal et al., 2019; Suggala & Netrapalli, 2020) , they do not apply to the above mentioned problem.

Our Techniques and Contributions:

We first present a method based on the explore-then-commit (ETC) approach (Algorithm 2). For the first few rounds, the algorithm runs a pure exploration strategy by sampling random items for each user. We use the data obtained by pure exploration to learn a good estimate of the underlying reward matrix P; this result requires a slight modification of the standard matrix completion result in Chen et al. (2019) . We then run the exploitation rounds based on the current estimate. In particular, for the remaining rounds, the algorithm commits to the arm with the highest estimated reward for each user. With the ETC algorithm, we achieve a regret bound of O(polylog(M + N)T 2/3 ) (Thm. 1). This bound is able to get rid of the dependence on N, implying non-trivial guarantees even when T ⌧ N. That is, we require only polylog(M + N) exploratory recommendations per user. However, the dependence of the algorithm on T is sub-optimal. To address this, we study the special but critical case of rank-one reward matrix. The rank-1 setting is itself technically challenging (Katariya et al., 2017b) and encapsulates many key issues. We provide a novel algorithm OCTAL in Algorithm 3 and a modified version in Algorithm 8 that achieves nearly optimal regret bound of O(polylog(M + N)T 1/2 ) (see Theorems 2 and E). The key insight is that in rank-one case, we need to cluster users based on their true latent representation to ensure low regret. Our method OCTAL consists of multiple phases of exponentially increasing number of rounds. Each phase refines the current estimate of relevant sub-matrices of the reward matrix using standard matrix completion techniques. Using the latest estimate, we jointly refine the cluster of users and the estimate of the best items for users in each cluster. We can show that the regret in each phase decreases very quickly and since the count of phases required to get the correct clustering is small, we obtain our desired regret bounds. Finally, we show that our method achieves a regret guarantee that scales as e O(T 1/2 ) with the number of rounds T. We also show that the dependence on T is optimal (see Theorem 3). Below we summarize our main contributions ( e O(•) hides logarithmic factors): • We formulate the online low rank matrix completion problem and define the appropriate notion of regret to study. We propose Algorithm 2 based on the explore-then-commit approach that suffers a regret of e O(T 2/3 ). (Thm. 1) • We propose a novel algorithm OCTAL (Online Collaborative filTering using iterAtive user cLustering) and a modified version (Alg. 3 and Alg. 8 respectively) for the special case of rank-1 reward matrices and guarantee a regret of e O(T 1/2 ). Importantly, our algorithms provide non-trivial regret guarantees in the practical regime of M T and N T i.e when the number of users and items is much larger than the number of recommendation rounds. Moreover, OCTAL does not suffer an issue of large cold-start time (possibly large exploration period) as in the ETC algorithm. • We conducted detailed empirical study of our proposed algorithms (see Appendix A) on synthetic and multiple real datasets, and demonstrate that our algorithms can achieve significantly lower regret than methods that do not use collaboration between users. Furthermore, we show that for rank-1 case, while it is critical to tune the exploration period in ETC (as a function of rounds and sub-optimality gaps) and is difficult in practice (Lattimore & Szepesvári, 2020)[Ch. 6] , OCTAL still suffers lower regret without such side-information (see Figures 1a and 2b ). Technical Challenges: For rank-1 case, i.e., P = uv T , one can cluster users in two bins: ones with u i 0 and ones with u i < 0. Cluster-2 users dislike the best items for Cluster-1 users and vice-versa. Thus, we require algorithms that can learn this cluster structure and exploit the fact that within the same cluster, relative ranking of all items remain the same. This is a challenging information to exploit, and requires learning the latent structure. Note that an algorithm that first attempts to cluster all users and then optimize the regret will suffer the same T 2/3 dependence in the worst case as in the ETC algorithm; this is because the difficulty of clustering the users are widely different. Instead, our proposed OCTAL algorithm (Algorithms 3 and 8) in each iteration/phase performs two tasks: i) it tries to eliminate some of the items similar to standard phase elimination method (Lattimore & Szepesvári, 2020) [Ch 6, Ex 6.8] for users that are already clustered, ii) it simultaneously tries to grow the set of clustered users. For partial clustering, we first apply low rank matrix completion guarantees over carefully constructed reward sub-matrices each of which correspond to a cluster of users and a set of active items that have high rewards for all users in the same cluster, and then use the partial reward matrix for eliminating some of the items in each cluster.

1.1. RELATED WORKS

To the best of our knowledge, we provide first rigorous online matrix completion algorithms. But, there are several closely related results/techniques in the literature which we briefly survey below. A very similar setting was considered in Sen et al. (2017) where the authors considered a multi-armed bandit problem with L contexts and K arms with context dependent reward distributions. The authors assumed that the L ⇥ K reward matrix is low rank and can be factorized into non-negative components which allowed them to use recovery guarantees from non-negative matrix factorization. Moreover, the authors only showed ETC algorithms that resulted in T 2/3 regret guarantees. Our techniques can be used to improve upon the existing guarantees in Sen et al. (2017) in two ways 1) Removing the assumption of the low rank components being non-negative as we use matrix completion with entry-wise error guarantees. 2) The dependence on T can be improved from T 2/3 to T 1/2 when the reward matrix P is rank-1. Multi-dimensional online decision making problems namely stochastic rank-1 matrix bandits was introduced in Katariya et al. (2017b; a) ; Trinh et al. (2020) . In their settings, at each round t 2 [T], the learning agent can choose one row and one column and observes a reward corresponding to an entry of a rank-1 matrix. Here, the regret is defined in terms of the best (row ,column) pair which corresponds to the best arm. This setting was extended to the rank r setting (Kveton et al., 2017) , rank 1 multi-dimensional tensors (Hao et al., 2020) , bilinear bandits (Jun et al., 2019; Huang et al., 2021) and generalized linear bandits (Lu et al., 2021) . Although these papers provide tight regret guarantees, they cannot be translated to our problem. This is because, we solve a significantly different problem with an underlying rank-1 reward matrix P where we need to minimize the regret for all users (rows of P) jointly. Hence, it is essential to find the entries (columns) of P with large rewards for each user(row) of P; contrast this with the multi-dimensional online learning problem where it is sufficient to infer only the entry ((row,column) pair) in the matrix/tensor with the highest reward. Since the rewards for each user have different gaps, the analysis becomes involved for our OCTAL algorithm. Finally, (Dadkhahi & Negahban, 2018; Zhou et al., 2020) 2021). In particular, these papers were the first to motivate and theoretically analyze the collaborative framework with the restriction that the same item cannot be recommended more than once to the same user. Here a significantly stricter cluster structure assumption is made over users where users in same cluster have similar preferences. Such models are restrictive as they provide theoretical guarantees only on a very relaxed notion of regret (termed pseudo-regret). In the past decade, several papers have studied the problem of offline low rank matrix completion on its own (Mazumder et al., 2010; Negahban & Wainwright, 2012; Chen et al., 2019; Deshpande & Montanari, 2012; Abbe et al., 2020; Jain et al., 2013; Jain & Kar, 2017) and also in the presence of side information such as social graphs or similarity graphs (Xu et al., 2013; Ahn et al., 2018; 2021; Elmahdy et al., 2020; Jo & Lee, 2021; Zhang et al., 2022) . Some of these results namely the ones that provide k • k 1 norm guarantees on the estimated matrix can be adapted into Explore-Then-Commit (ETC) style algorithms (see Sec. 4). Finally, there is significant amount of related theoretical work for online non-convex learning (Suggala & Netrapalli, 2020; Yang et al., 2018; Huang et al., 2020) and empirical work for online Collaborative Filtering (Huang et al., 2020; Lu et al., 2013; Zhang et al., 2015) but they do not study the regret in online matrix completion setting.

2. PROBLEM DEFINITION

Notations: We write [m] to denote the set {1, 2, . . . , m}. For a vector v 2 R m , v i denotes the i th element; for any set U ✓ [m], let v U denote the vector v restricted to the indices in U . A i denotes the i th row of A and A ij denotes the (i, j)-th element of matrix A. [n] denotes the set {1, 2, . . . , n}. For any set U ⇢ [m], V ⇢ [n], A U ,V denotes the matrix A restricted to the rows in U and columns in V. Also, let kAk 2!1 be the maximum `2 norm of the rows of A and kAk 1 be the absolute value of the largest entry in A. We write EX to denote the expectation of a random variable X. Consider a system with a set of M users and N items. Let P = UV T 2 R M⇥N be the unknown reward matrix of rank r < min(M, N) where U 2 R M⇥r and V 2 R N⇥r denote the latent embeddings corresponding to users and items respectively. In other words, we can denote P ij , hu i , v j i where u i , v j 2 R r denotes the r-dimensional embeddings of i-th user and the j-th item, respectively. Often, we will also use the SVD decomposition of P = Ū⌃ V where Ū 2 R M⇥r , V 2 R N⇥r are orthonormal matrices i.e. ŪT Ū = I and VT V = I and ⌃ , diag( 1 , 2 , . . . , r ) 2 R r⇥r is a diagonal matrix. We will denote the condition number of the matrix P by  , (max i i )(min i i ) 1 . Consider a system that recommends one item to every user, in each round t 2 [T]. Let, R (t) u⇢u(t) be the reward for recommending item ⇢ u (t) 2 [N] for user u. Also, let: R (t) u⇢u(t) = P u⇢u(t) + E (t) u⇢u(t) (1) where E  u⇢u(t) } u2[M],t2[T] is assumed to be i.i.d. zero mean sub-gaussian random variables with variance proxy 2 . That is, E[E (t) u⇢u(t) ] = 0 and E[exp(sE (t) u⇢u(t) )]  exp( 2 s 2 /2) for all u 2 [M], t 2 [T]. The goal is to minimize the expected regret where the expectation is over randomness in rewards and the algorithm: Reg(T) , T M X u2[M] max j2[N] P uj E[ X t2[T] 1 M X u2[M] R (t) u⇢u(t) ]. In this problem, the interesting regime is (N, M) T as is often the case for most practical recommendation systems. Here, treating each user separately will lead to vacuous regret bounds as each item needs to be observed at least once by each user to find the best item for each user. However, low-rank structure of the rewards can help share information about items across users. Remark 1. If T N, then we can treat each user as a separate multi-armed bandit problem. In that case, in our setting, the well-studied Upper Confidence Bound (UCB) algorithm achieves an expected regret of at most O( p NT log T) (Theorem 2.1 in Bubeck & Cesa-Bianchi (2012)).

3. PRELIMINARIES

Let us introduce a different observation model from (1). Consider an unknown rank r matrix P 2 R M⇥N . For each entry i 2 [M], j 2 [N], we observe: P ij + E ij with probability p, 0 with probability 1 p, where E ij are independent zero mean sub-gaussian random variables with variance proxy 2 > 0. We now introduce the following result from Chen et al. ( 2019 For all (i, j) 2 ⌦, set Mask ij = 0. for each user u 2 U in round t = (` 1)b + `0 do 5: Recommend an item ⇢ u (t) in {j 2 V | (u, j) 2 ⌦, Mask uj = 0} and set Mask u⇢u(t) = 1. If not possible then recommend any item ⇢ u (t) in V s.t. (u, ⇢ u (t)) 6 2 ⌦. Observe R (t) u⇢u(t) . 6: end for 7: end for 8: end for 9: For each (u, j) 2 ⌦, compute Z uj to be average of bm/bc observations corresponding to user u being recommended item j i.e. Z uj = avg{R k) where k = d|V|/|U|e and V into V (1) , V (2) , . . . , V V (q) = {i 2 V | i = q} for each q 2 [k]. Set ⌦ (q) ⌦ \ (U ⇥ V (q) ) for all q 2 [k]. #If |U| |V|, we partition the indices in U . 11: for q 2 [k] do 12: Solve convex program min Q (q) 2R |U |⇥|V (q) | 1 2 X (i,j)2⌦ (q) ⇣ Q (q) i⇡(j) Z ij ⌘ 2 + kQ (q) k ? , where kQ (q) k ? denotes nuclear norm of matrix Q (q) and ⇡(j) is index of j in set V (q) . 13: end for 14: Return e Q 2 R M⇥N s.t. e Q U ,V (q) = Q (q) for all q 2 [k] and for every (i, j) 6 2 U ⇥ V, e Q ij = 0. Note there are several difficulties in using Lemma 1 directly in our setting which are discussed below: Remark 2 (Matrix Completion for Rectangular Matrices). Lemma 1 is described for square matrices and a trivial approach to use Lemma 1 for rectangular matrices with M rows and N columns (say N M) by appending N M zero rows leads to an undesirable (N/M) 1/2 factor (Lemma 5) in the error bound (the (N/M) 1/2 factor does not arise if we care about spectral/Frobenius norm instead of L 1 norm). One way to resolve the issue is to partition the columns into N/M groups by assigning each column into one of the groups uniformly at random. Thus, we create N/M matrices which are almost square and apply Lemma 1 to recover an estimate that is close in L 1 norm. Thus we can recover an estimate of the entire matrix which is close in L 1 norm up to the desired accuracy without suffering the undesirable (N/M) 1/2 factor (Lemma 6 and Steps 10-12 in Algorithm 1). Remark 3 (Observation Models). The observation model in equation 4 is significantly different from equation 1. In the former, a noisy version of each element of P is observed independently with probability p while in the latter, in each round t 2 [T], for each user u 2 [M], we observe noisy version of a chosen element ⇢ u (t). Our approach to resolve this discrepancy theoretically is to first sample a set ⌦ of indices according to equation 4 and subsequently use equation 1 to observe the indices in ⌦ (see Steps 3-6 in Algorithm 1 and Corollary 1). Of course, this implies obtaining observations corresponding to indices in a super-set of ⌦ (see Step 5 in Algorithm 1) and only using the observations in ⌦ for obtaining an estimate of the underlying matrix. In practice, this is not necessary and we can use all the observed indices to obtain the estimate in Step 12 of Algorithm 1. Remark 4 (Repetition and Median Tricks). The smallest error that is possible to achieve by using Lemma 1 is by substituting p = 1 and thereby obtaining k b P Pk 1  O ⇣ (min i i ) 1 • p µd log dkPk 1 ⌘ and moreover, the probability of failure is polynomially small in the dimension d; however, this is insufficient when d is not large enough. Two simple tricks allow us to resolve this issue: 1) First we can obtain repeated observations from the same entry of the reward matrix and take its average; s repetitions can bring down the noise variance to 2 /s 2) Second, we can use the median trick where we obtain several independent estimates of the reward matrix and compute the element-wise median to boost the success probability (see proof of Lemma 2). We address all these issues (see Appendix B for detailed proofs) and arrive at the following lemma:  kP b Pk 1  O ⇣ r p sd 2 s µ 3 log d 2 p ⌘ . Remark 5. Alg. A repeats the following process O(log(MN 1 )) times: 1) sample subset of indices ⌘ 2 m in order to obtain the desired guarantee. ⌦ ✓ [M] ⇥ [N] such that every (i, j) 2 [M] ⇥ [N] is inserted into ⌦ independently with probability p. 2) By setting b = max i2[M] |j 2 [N] | (i, j) 2 ⌦|,

4. EXPLORE-THEN-COMMIT (ETC) ALGORITHM

In this section, we present an Explore-Then-Commit (ETC) based algorithm for online low-rank matrix completion. The algorithm has two disjoint phases of exploration and exploitation. We will first jointly explore the set of items for all users for a certain number of rounds and compute an estimate b P of the reward matrix P. Subsequently, we commit to the estimated best item found for each user and sample the reward of the best item for the remaining rounds in the exploitation phase for that user. Note that the exploration phase involves using a matrix completion estimator in order to estimate the entire reward matrix P from few observed entries. Our regret guarantees in this framework is derived by carefully balancing exploration phase length and the matrix estimation error (detailed proof provided in Appendix C). Theorem 1. Consider the rank-r online matrix completion problem with M users, N items, T recommendation rounds. Set d 2 = min(M, N). Let R (t) u⇢u(t) be the reward in each round, defined as in equation 1. Suppose d 2 = ⌦(µr log(rd 2 )). Let P 2 R M⇥N be the expected reward matrix that satisfies the conditions stated in Lemma 2 , and let 2 be the noise variance in rewards. Then, Algorithm 2, applied to the online rank-r matrix completion problem guarantees the following regret: For each tuple of indices (i, j) 2 [M] ⇥ [N], independently set ij = 1 with probability p and ij = 0 with probability 1 p. Reg(T) = O ⇣⇣ T 2 3 ( 2 r 2 kPk 1 ) 1 3 ⇣ µ 3 N log d 2 d 2 ⌘ 1/3 + Nµ 2 kPk 1 d 2 ⌘ log 5 (MNT) + kPk 1 T 2 ⌘ . 4: Denote ⌦ = {(i, j) 2 [M] ⇥ [N] | ij = 1} and b = max i2[M] | |j 2 [N] | (i, j ) 2 ⌦| to be the maximum number of index tuples in a particular row. Set total number of rounds to be bs.

5:

Compute the k th estimate b T and M > N where the regret scales only logarithmically on M, N. This is intuitively satisfying since in each round we are obtaining M observations, so more users translate to more information which in-turn allows better understanding of the underlying reward matrix. However, the dependence of regret on T (namely T 2/3 ) is sub-optimal. In the subsequent section, we provide a novel algorithm to obtain regret guarantees with T 1/2 for rank-1 P. P (k) = ESTIMATE([M], [N], bs, ⌦, b, ). # ( Remark 8 (Gap dependent bounds). Define the minimum gap to be = min u2[M] P u⇡u(1) P u⇡u(2) where ⇡ u (1), ⇡ u (2) corresponds to the items with the highest and second highest reward for user u respectively. If the quantity is known then it is possible to design ETC algorithms where length of the exploration phase is tuned accordingly in order to obtain regret bounds that scale logarithmically with the number of rounds T.

5. OCTAL ALGORITHM

In this section we present our algorithm OCTAL (Algorithm 3) for online matrix completion where the reward matrix P is rank 1. The set of users is described by a latent vector u 2 R M and the set of items is described by a latent vector v 2 R N . Thus P = uv T with SVD decomposition P = ¯ ūv T . Algorithm Overview: Our first key observation is that as P is rank-one, we can partition the set of users into two disjoint clusters C 1 , C 2 where C 1 ⌘ {i 2 [M] | u i 0} and C 2 ⌘ [M] \ C 1 . Clearly, for all users u 2 C 1 , the item that results in maximum reward is j max = argmax t2[N] v t . On the other hand, for all users u 2 C 2 , the item that results in maximum reward is j min = argmin t2[N] v t . Thus, if we can identify C 1 , C 2 and estimate items with high reward (identical for users in the same cluster) using few recommendations per user, we can ensure low regret. But, initially C 1 , C 2 are unknown, so all users are unlabelled i.e., their cluster is unknown. In each phase (the outer loop indexed by `), Algorithm 3 tries to label at least a few unlabelled users correctly. This is achieved by progressively refining estimate e Q of the reward matrix P restricted to the unlabelled users and all items (Step 12). Subsequently, unlabelled users for which the difference in maximum and minimum reward (inferred from estimated reward matrix) is large are labelled (Step 19) . At the same time, in Step 13 users labelled in previous phases are partitioned into two clusters (denoted by M (`,1) and M (`,2) ) and for each of them, the algorithm refines an estimate of two distinct sub-matrices of the reward matrix P by recommending items only from a refined set (N (`,1) and N (`,2) respectively) containing the best item (j max or j min ). We also identify a small set of good items for each labelled user (including users labelled in previous phases), which correspond to large estimated rewards. We partition all these users into two clusters (M (`+1,1) and M (`+1,2) ) such that the set of good items for users in different clusters are disjoint. We can prove that such a partitioning is possible; users in same cluster have same sign of user embedding. 



denotes the unbiased additive noise. Each element of {E

): Lemma 1 (Theorem 1 in Chen et al. (2019)). Let rank r = O(1) matrix P 2 R d⇥d with SVD decomposition P = Ū⌃ VT satisfy k Ūk 2,1  p µr/d, k Vk 2observe noisy entries of P according to observation model in (4). Then, with probability exceeding 1 O(d 3 ), we can compute a matrix b P by using Algorithm 4 (Appendix B) with parameters (U = [M], V = [N], 2 , r, p) s.t., Set of users U ✓ [M], set of items V ✓ [N], total rounds m, set of indices ⌦ ✓ U ⇥ V, rounds in each iteration b = max u2U |v 2 V | (u, v) 2 ⌦|,regularization parameter . Index of round t is relative to the first round when the algorithm is invoked; hence t = 1, 2, . . . , m. 1: for `= 1, 2, . . . , m/b do 2:

) for t 2 [m] | ⇢ u (t) = j}. Discard all other observations corresponding to indices not in ⌦. 10: Without loss of generality, assume |U|  |V|. For each i 2 V, independently set i to be a value in the set [d|V|/|U|e] uniformly at random. Partition indices in

Let rank r = O(1) reward matrix P 2 R M⇥N with SVD decomposition P = Ū⌃ VT satisfy k Ūk 2,1  p µr/M, k Vk 2,1  p µr/N and condition number  = O(1). Let d 1 = max(M, N), d 2 = min(M, N) such that d 2 = ⌦(µr log(rd 2 )) sufficiently large constant C > 0. Suppose we observe noisy entries of P according to observation model in (1). For any positive integer s > 0 satisfying p s = O ⇣q pd2 µ 3 log d2 kPk 1 ⌘ , there exists an algorithm A with parameters s, p, that uses m = O ⇣ s log(MN 1 )(Np + p Np log M 1 ) ⌘ rounds to compute a matrix b P such that with probability exceeding 1 O( log(MN 1 ))

Algorithm A invokes Alg. 1 with total rounds bs, number of rounds in each iteration b, set ⌦, set of users [M], items [N] and regularization parameter = C p min(M, N)p for a suitable constant C > 0 in order to compute an estimate of P. The final estimate b P is computed by taking an entry-wise median of each individual estimate obtained as output from several invocations of Alg. 1. Alternatively, Alg. A is detailed in Alg. 7 in Appendix B. Note that the total number of noisy observations made from the matrix P is m • M MN • p • s. Therefore, informally speaking, the average number of observations per index is p • s which results in an error of e O( / p sp) ignoring other terms (contrast with error e O( / p p) in equation 5.) Remark 6 (Setting parameters s, p in lemma 2). Lemma 2 has three input parameters namely s 2 Z, 0  p  1 and 0   1. For any set of input parameters (⌘, ⌫), our goal is to set s, p, as functions of known , r, µ, d 2 such that we can recover kP b Pk 1  ⌘ with probability 1 ⌫ for which the conditions on and p are satisfied. From (17), we must have p sp = c r p d2 p µ 3 log d2 ⌘ for some appropriate constant c > 0. If r = O(1) and ⌘  kPk 1 , then an appropriate choice of c also satisfies the condition p s

Algorithm 2 ETC ALGORITHM Require: users M, items N, rounds T, noise 2 , rank r of P, upper bound on magnitude of expected rewards ||P|| 1 , no. of estimates f = O(log(MNT)). 1: Set d 2 = min(M, N) and v = (NkPk 1 ) = dvp 1 e and = C p d 2 p for some constants C, C > 0. 2: for k = 1, 2, . . . , f do 3:

Published as a conference paper at ICLR 2023Algorithm 3 OCTAL (ONLINE COLLABORATIVE FILTERING USING ITERATIVE USER CLUSTERING)Require: Number of users M, items N, rounds T, noise 2 , bound on the entry-wise magnitude of expected rewards ||P|| 1 , incoherence µ.1: Set M (1,1) = M (1,2) = and B (1) = [M]. Set N (1,1) = N (1,2) = . Set f = O(log(MNT))and suitable constants a, c, C, C 0 , C > 0. 2: for `= 1, 2, . . . , do

Algorithm 1 is used to recommend items to every user for bs rounds. Recommend argmax j2[N] b P ij for each user i 2 [M]. # Number of remaining rounds is T bsf . 10: end for Remark 7 (Non-trivial regret bounds). Theorem 1 provides non-trivial regret guarantees in the key regime when N

4:

for k = 1, 2, . . . , f do 5:for each pair of non-null sets (B (`) , N), (M (`,1) , N (`,1) ), (M (`,2) , N (`,2) )Denote (T (1) , T (2) ) to be the considered pair of sets and i 2 {0, 1, 2} to be its index. For each tuple of indices (u, v) 2 T (1) ⇥T (2) , independently set uv = 1 with probability p `,i and uv = 0 with probability 1 p `,i . 9:DenoteSet total number of rounds to be m `,i = b `,i s `,i .10:end for 11:Set m `= max i2{0,1,2} m `,i .12:Compute e# Algorithm 1 is used to recommend items to every user in B (`) for m `rounds.

13:

For# Algorithm 1 recommends items to every user in M (`,i) for m `rounds.

14:

end forut } for all u 2 B (`) \ B (`+1) .18:For i 2 {1, 2}, for all users u 2 M (`,i) , compute Tut }.

19:

Set v to be any user in [M]\B (`+1) . Set M (`+1,1) = {u 2 [M]\B (`+1) | T +1,1) ).

20:

Compute N (`+1,1) = T u2M (`+1,1) T +1,i) and M (`+1,i) . 22: end for We also prove that the set of good items contain the best item for each labelled user (j max or j min ). So, after each phase, for each cluster of users, we compute the intersection of good items over all users in the cluster. This subset of items (joint good items) must contain the best item for that cluster and therefore we can discard the other items (Step 20). We can show that all items in the set of joint good items (N (`+1,1) and N (`+1,2) ) have rewards which is close to the reward of the best item. Therefore the algorithm suffers small regret if for each group of labelled users, the algorithm recommends items from the set of joint good items (Step 13) in the next phase. We can further show that for the set of unlabelled users, the difference in rewards between the best item and worst item is small and hence the regret for such users is small, irrespective of the recommended item (Step 12). Note that until the number of labelled users is sufficiently large, we do not consider them separately (Step 21). A crucial part of our analysis is to show that for any subset of users and items considered in Step 5, the number of rounds sufficient to recover a good estimate of the expected reward sub-matrix is small irrespective of the number of considered items (if the number of users is sufficiently large).Remark 9 (Practical considerations). In general OCTAL (Alg. 3) is computationally faster than the ETC Algorithm (Alg. 2) with a higher exploration length. This is because OCTAL eliminates large chunks of items in every phase and therefore has to solve easier optimization problems; on the other hand, ETC has to solve a low rank matrix completion problem in MN variables that becomes slower with the exploration length (datapoints). Moreover, OCTAL algorithm runs in phases with the initial phases being very small; hence the users do not have to wait for a long time to even get personalized recommendations like in ETC. These features make OCTAL much more practical than ETC.To summarize, in Algorithm 3, the entire set of rounds [T] is partitioned into phases of exponentially increasing length. In each phase, for the set of unlabelled users, we do pure exploration and recommend random items from the set of all possible items (Step 12). The set of labelled users are partitioned into two clusters; for each, we follow a semi-exploration strategy where we recommend random items from a set of joint good items (Steps 13). We now introduce the following definition:Local incoherence for a vector v implies that any sub-vector of v having a significant size must be incoherent as well. Note that the local incoherence condition is trivially satisfied if the magnitude of each vector entry is bounded from below. We are now ready to state our main result: Theorem 2. Consider the rank-1 online matrix completion problem with T rounds, M users s.t. M p T and N items. Denoteu⇢u(t) be the reward in each round, defined as in equation 1. Let 2 be the noise variance in rewards and let P 2 R M⇥N be the expected reward matrix with SVD decompositionThen, by suitably choosing parameters { `}`, positive integers {s (`,0) , s (`,1) , s (`,2) } `and 1 {p (`,0) , p (`,1) , p (`,2) } ` 0 as described in Algorithm 3, we can ensure a regret guarantee of Reg(T) = O( p TkPk 1 + J p TV) whereSimilar to Algorithm 2, Algorithm 3 allows non-trivial regret guarantees even when N T provided the number of users is significantly large as well i.e. M = e ⌦(N p T). Finally, we show that the above dependence on N, M, T matches the lower bound that we obtain by reduction to the well-known multi-armed bandit problem.Theorem 3. Let P 2 [0, 1] M⇥N be a rank 1 reward matrix and the noise variance 2 = 1. In that case, any algorithm for online matrix completion problem will suffer regret of ⌦( p NTM 1 ).

6. CONCLUSIONS

We studied the problem of online rank-one matrix completion in the setting of repeated item recommendations and blocked item recommendations, which should be applicable for several practical recommendation systems. We analyzed an explore-then-commit (ETC) style method which is able to get the regret averaged over users to be nearly independent of number of items. That is, per user, we require only logarithmic many item recommendations to get non-trivial regret bound. But, the dependence on the number of rounds T is sub-optimal. We further improved this dependence by proposing OCTAL that carefully combines exploration, exploitation and clustering for different users/items. Our methods iteratively refines estimate of the underlying reward matrix, while also identifying users which can be recommended certain items confidently. Our algorithms and proof techniques are significantly different than existing bandit learning literature. We believe that our work only scratches the surface of an important problem domain with several open problems. For example, Algorithm 3 requires rank-1 reward matrix. Generalizing the result to rank-r reward matrices would be interesting. Furthermore, relaxing assumptions on the reward matrix like stochastic noise or additive noise should be relevant for several important settings. Finally, collaborative filtering can feed users related items, hence might exacerbate their biases. Our method might actually help mitigate the bias due to explicit exploration, but further investigation into such challenges is important.

