MULTI-USER REINFORCEMENT LEARNING WITH LOW RANK REWARDS

Abstract

We consider the problem of collaborative multi-user reinforcement learning. In this setting there are multiple users with the same state-action space and transition probabilities but with different rewards. Under the assumption that the reward matrix of the N users has a low-rank structure -a standard and practically successful assumption in the offline collaborative filtering setting -we design algorithms with significantly lower sample complexity compared to the ones that learn the MDP individually for each user. Our main contribution is an algorithm which explores rewards collaboratively with N user-specific MDPs and can learn rewards efficiently in two key settings: tabular MDPs and linear MDPs. When N is large and the rank is constant, the sample complexity per MDP depends logarithmically over the size of the state-space, which represents an exponential reduction (in the state-space size) when compared to the standard "non-collaborative" algorithms.

1. INTRODUCTION

Reinforcement learning has recently seen tremendous empirical and theoretical success Mnih et al. (2015) ; Sutton et al. (1992) ; Jin et al. (2020b) ; Gheshlaghi Azar et al. (2013) ; Dann & Brunskill (2015) . Near optimal algorithms have been proposed to explore and learn a given MDP with sample access to trajectories. In this work, we consider the problem of learning the optimal policies for multiple MDPs collaboratively so that the total number of trajectories sampled per MDP is smaller than the number of trajectories required to learn them individually. This combines reinforcement learning and collaborative filtering. We assume that the various users have the same transition matrices, but different rewards and the rewards have a low rank structure. Low rank assumption is popular in the collaborative filtering literature and has been deployed successfully in a variety of tasks Bell & Koren (2007); Gleich & Lim (2011); Hsieh et al. (2012) . This can be regarded as an instance of multi-task reinforcement, various versions of which have been considered in the literature Brunskill & Li ( 2013 Here multiple bandit instances are simultaneously explored under low rank assumptions. In this work, we extend this setting to consider stateful modeling of such systems. In the context of e-commerce, this can allow the algorithm to discover temporal patterns like 'User bought a Phone and hence they might be eventually interested in phone cover' or 'User last bought shoes many years ago which might be worn out by now, therefore recommend shoes'. Note that the fact that a user has bought a shoe changes their preferences (and hence the reward function). Our setting allows one to model such changes. While we assume that the users share the same transition matrix, this can be relaxed in practice by clustering users based on side information and modeling each cluster to have a common transition matrix. This approach has been successfully deployed in various multi-agent RL problems in practice, including in sensitive healthcare settings (see Mate et al. (2022) and reference therein). Our Contributions a) Improved Sample Complexity: We introduce the setting of multi-user collaborative reinforcement learning in the case of tabular and linear MDPs and provide sample efficient algorithms for both these scenarios without access to a generative model. Under low rank assumption, the total sample complexity required to learn the near-optimal policies for every user scales as Õ(N + |S||A|) instead of O(N |S||A|) in the case of tabular MDPs and Õ(N + d) instead of O(N d 2 ) in the case of linear MDPs.



); D'Eramo et al. (2020); Teh et al. (2017); Hessel et al. (2019). Motivation Recently, collaborative filtering has been studied in the online learning setting: Bresler & Karzand (2021); Jain & Pal (2022).

