IMPROVED SAMPLE COMPLEXITY FOR REWARD-FREE REINFORCEMENT LEARNING UNDER LOW-RANK MDPS

Abstract

In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an ϵ-optimal policy and achieve an ϵ-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on ϵ, as well as on K in the large d regime, where d and K respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.

1. INTRODUCTION

Reward-free reinforcement learning, recently formalized by Jin et al. (2020b) , arises as a powerful framework to accommodate diverse demands in sequential learning applications. Under the rewardfree RL framework, an agent first explores the environment without reward information during the exploration phase, with the objective to achieve certain learning goals later on for any given reward function during the planning phase. Such a learning goal can be to find an ϵ-optimal policy, to achieve an ϵ-accurate system identification, etc. The reward-free RL paradigm may find broad application in many real-world engineering problems. For instance, reward-free exploration can be efficient when various reward functions are taken into consideration over a single environment, such as safe RL (Miryoosefi & Jin, 2021; Huang et al., 2022) , multi-objective RL (Wu et al., 2021 ), multi-task RL (Agarwal et al., 2022; Cheng et al., 2022) , etc. Studies of reward-free RL on the theoretical side have been largely focused on characterizing the sample complexity to achieve a learning goal under various MDP models. Specifically, reward-free tabular RL has been studied in Jin et al. 



(2020a); Ménard et al. (2021); Kaufmann et al. (2021); Zhang et al. (2020). For reward-free RL with function approximation, Wang et al. (2020) studied linear MDPs introduced by Jin et al. (2020b), where both the transition and the reward are linear functions of a given feature extractor, Zhang et al. * Equal contribution 1

