MULTI-USER REINFORCEMENT LEARNING WITH LOW RANK REWARDS

Abstract

We consider the problem of collaborative multi-user reinforcement learning. In this setting there are multiple users with the same state-action space and transition probabilities but with different rewards. Under the assumption that the reward matrix of the N users has a low-rank structure -a standard and practically successful assumption in the offline collaborative filtering setting -we design algorithms with significantly lower sample complexity compared to the ones that learn the MDP individually for each user. Our main contribution is an algorithm which explores rewards collaboratively with N user-specific MDPs and can learn rewards efficiently in two key settings: tabular MDPs and linear MDPs. When N is large and the rank is constant, the sample complexity per MDP depends logarithmically over the size of the state-space, which represents an exponential reduction (in the state-space size) when compared to the standard "non-collaborative" algorithms.

1. INTRODUCTION

Reinforcement learning has recently seen tremendous empirical and theoretical success Mnih et al. (2015) ; Sutton et al. (1992) ; Jin et al. (2020b) ; Gheshlaghi Azar et al. (2013) ; Dann & Brunskill (2015) . Near optimal algorithms have been proposed to explore and learn a given MDP with sample access to trajectories. In this work, we consider the problem of learning the optimal policies for multiple MDPs collaboratively so that the total number of trajectories sampled per MDP is smaller than the number of trajectories required to learn them individually. This combines reinforcement learning and collaborative filtering. We assume that the various users have the same transition matrices, but different rewards and the rewards have a low rank structure. Low rank assumption is popular in the collaborative filtering literature and has been deployed successfully in a variety of tasks Bell & Koren (2007); Gleich & Lim (2011); Hsieh et al. (2012) . This can be regarded as an instance of multi-task reinforcement, various versions of which have been considered in the literature Brunskill & Li (2013); D'Eramo et al. (2020); Teh et al. (2017); Hessel et al. (2019) . Motivation Recently, collaborative filtering has been studied in the online learning setting: Bresler & Karzand (2021); Jain & Pal (2022). Here multiple bandit instances are simultaneously explored under low rank assumptions. In this work, we extend this setting to consider stateful modeling of such systems. In the context of e-commerce, this can allow the algorithm to discover temporal patterns like 'User bought a Phone and hence they might be eventually interested in phone cover' or 'User last bought shoes many years ago which might be worn out by now, therefore recommend shoes'. Note that the fact that a user has bought a shoe changes their preferences (and hence the reward function). Our setting allows one to model such changes. While we assume that the users share the same transition matrix, this can be relaxed in practice by clustering users based on side information and modeling each cluster to have a common transition matrix. This approach has been successfully deployed in various multi-agent RL problems in practice, including in sensitive healthcare settings (see Mate et al. (2022) and reference therein). Our Contributions a) Improved Sample Complexity: We introduce the setting of multi-user collaborative reinforcement learning in the case of tabular and linear MDPs and provide sample efficient algorithms for both these scenarios without access to a generative model. Under low rank assumption, the total sample complexity required to learn the near-optimal policies for every user scales as Õ(N + |S||A|) instead of O(N |S||A|) in the case of tabular MDPs and Õ(N + d) instead of O(N d 2 ) in the case of linear MDPs. b) Collaborative Exploration: In order to learn the rewards of all the users efficiently under lowrank assumptions, we need to deploy standard low rank matrix estimation/completion algorithms, which require specific kinds of linear measurements (See Section 1.1). Without access to a generative model, the main challenge in this setting is to obtain these linear measurements by querying trajectories of carefully designed policies. We design such algorithms in Section 3. c) Functional Reward Maximization: In the case of linear MDPs, matrix completion is more challenging since we observe measurements of the form e ⊺ i Θ * ψ where Θ * ∈ R N ×d , corresponding to the reward obtained by user i, with respect to an embedding ψ. Estimating Θ * under low rank assumptions requires the distribution of ψ to have certain isotropy properties (see Section 6). Querying such measurements goes beyond the usual reward maximization and are related to mean-field limits of multi-agent reinforcement learning similar to the setting in Cammardella et al. ( 2020) where a functional of the distribution of the states is optimized. We design a procedure which can sampleefficiently estimate policies which lead to these isotropic measurements (Section 5). d) Matrix Completion With Row-Wise Linear Measurements: For the linear MDP setting, the low rank matrix estimation problem lies somewhere in between the matrix completion (Recht, 2011; Jain et al., 2013) and matrix estimation with restricted strong convexity (Negahban et al., 2009) . We give a novel active learning based algorithm where we estimate Θ * row by row without any assumptions like incoherence. This is described in Section 6. i M y i ) n i=1 when the number of samples is much smaller than d 1 × d 2 using the assumption that M has low rank. a) Matrix Completion: x i and y i are standard basis vectors. Typically x i and y i are picked uniformly at random and recovery guarantees are given whenever the matrix M is incoherent (Recht, 2011). b) Matrix Estimation: x i and y i are not restricted to be standard basis vectors. Typically, they are chosen i.i.d such that the restricted strong convexity holds (Negahban et al., 2009) . In this work, we consider MDPs associated with N users such that their reward matrix satisfies a low rank structure. For the case of tabular MDPs, we use the matrix completion setting and for the case of linear MDPs, our setting lies some where in between settings a) and b) as explained above. 



1.1 RELATED WORKS Related Settings: Multi-task Reinforcement learning has been studied empirically and theoretically Brunskill & Li (2013); Taylor & Stone (2009); D'Eramo et al. (2020); Teh et al. (2017); Hessel et al. (2019); Sodhani et al. (2021). Modi et al. (2017) considers learning a sequence of MDPs with side information, where the parameters of the MDP varies smoothly with the context. Shah et al. (2020) introduced the setting where the optimal Q function Q * (s, a), when represented as a S × A matrix has low rank. With a generative model, they obtain algorithms which makes use of this structure to obtain a smaller sample complexity whenever the discount factor is bounded by a constant. Sam et al. (2022) improves the results in this setting by considering additional assumptions like low rank transition kernels. Our setting is different in that we consider multiple users, but do not consider access to a generative model. In fact our main contribution is to efficiently obtain measurements conducive to matrix completion without a generative model. Hu et al. (2021) considers a multitask RL problem with linear function approximation similar to our setting, but with the assumption of low-rank Bellman closure, where the application of the Bellman operator retains the low rank structure. They obtain a bound depending on the quantity N √ d instead of (N + d) like in our work. Lei & Li (2019) considers multi-user RL with low rank assumptions in an experimental context. Low Rank Matrix Estimation: Low rank matrix estimation has been extensively studied in the statistics and ML community for decades in the context of supervised learning Candès & Tao (2010); Negahban & Wainwright (2011); Fazel (2002); Chen et al. (2019); Jain et al. (2013; 2017); Recht (2011); Chen et al. (2020); Chi et al. (2019) in multi-user collaborative filtering settings. The basic question is to estimate a d 1 ×d 2 matrix M given linear measurements (x ⊺

1.2 NOTATION By ∥ • ∥ we denote the Euclidean norm and by e 1 , . . . , e i , . . . the standard basis vectors of the space R m for some m ∈ N. Let S d-1 := {x ∈ R d : ∥x∥ = 1}. Let B d (r) := {x ∈ R d : ∥x∥ ≤ r}. For any m × n matrix A and a set Ω ⊆ [n] by A Ω , we denote the sub-matrix of A where the columns

