IMPROVED SAMPLE COMPLEXITY FOR REWARD-FREE REINFORCEMENT LEARNING UNDER LOW-RANK MDPS

Abstract

In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an ϵ-optimal policy and achieve an ϵ-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on ϵ, as well as on K in the large d regime, where d and K respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.

1. INTRODUCTION

Reward-free reinforcement learning, recently formalized by Jin et al. (2020b) , arises as a powerful framework to accommodate diverse demands in sequential learning applications. Under the rewardfree RL framework, an agent first explores the environment without reward information during the exploration phase, with the objective to achieve certain learning goals later on for any given reward function during the planning phase. Such a learning goal can be to find an ϵ-optimal policy, to achieve an ϵ-accurate system identification, etc. The reward-free RL paradigm may find broad application in many real-world engineering problems. For instance, reward-free exploration can be efficient when various reward functions are taken into consideration over a single environment, such as safe RL (Miryoosefi & Jin, 2021; Huang et al., 2022 ), multi-objective RL (Wu et al., 2021 ), multi-task RL (Agarwal et al., 2022; Cheng et al., 2022) , etc. Studies of reward-free RL on the theoretical side have been largely focused on characterizing the sample complexity to achieve a learning goal under various MDP models. Specifically, reward-free tabular RL has been studied in Jin et al. In this paper, we focus on reward-free RL under low-rank MDPs, where the transition kernel admits a decomposition into two embedding functions that map to low dimensional spaces. Compared with linear MDPs, the feature functions (i.e., the representation) under low-rank MDPs are unknown, hence the design further requires representation learning and becomes more challenging. Rewardfree RL under low-rank MDPs was first studied by Agarwal et al. (2020) , and the authors introduced a provably efficient algorithm FLAMBE, which achieves the learning goal of system identification with a sample complexity of Õ( H 22 K 9 d 7 ϵ 10 ). Here d, H and K respectively denote the representation dimension, episode horizon, and action space cardinality. Later on, Modi et al. ( 2021) proposed a model-free algorithm MOFFLE for reward-free RL under low-nonnegative-rank MDPs (where feature functions are non-negative), for which the sample complexity for finding an ϵ-optimal policy scales as Õ( H 5 K 5 d 3 LV ϵ 2 η ) (which is rescaled under the condition of 2021) on reward-free low-rank MDP is polynomial in the involved parameters, but still much higher than desirable. It is vital to improve the algorithm to further reduce the sample complexity. • Previous studies on low-rank MDPs did not provide estimation accuracy guarantee on the learned representation (only on the transition kernels). However, such a representation learning guarantee can be very beneficial to reuse the learned representation in other RL environment.

1.1. MAIN CONTRIBUTIONS

We summarize our main contributions in this work below. • Lower bound: We provide the first-known lower bound Ω( HdK ϵ 2 ) on the sample complexity that holds for any algorithm under the same low-rank MDP setting. Our proof lies in a novel construction of hard MDP instances that capture the necessity of the cardinality of the action space on the sample complexity. Interestingly, comparing this lower bound for low-rank MDPs with the upper bound for linear MDPs in Wang et al. (2020) further implies that it is strictly more challenging to find near-optimal policy under low-rank MDPs than linear MDPs. • Algorithm: We propose a new model-based reward-free RL algorithm under low-rank MDPs. The central idea of RAFFLE lies in the construction of a novel exploration-driven reward, whose corresponding value function serves as an upper bound on the model estimation error. Hence, such a pseudo-reward encourages the exploration to collect samples over those state-action space where the model estimation error is large so that later stage of the algorithm can further reduce such an error based on those samples. Such reward construction is new for low-rank MDPs, and serve as the key reason for our improved sample complexity. • Sample complexity: We show that our algorithm can both find an ϵ-optimal policy and achieve an ϵ-accurate system identification via reward-free exploration, with a sample complexity of Õ( H 3 d 2 K(d 2 +K) ϵ 2 ), which matches our lower bound in terms of the dependence on ϵ as well as on K in the large d regime. Our result significantly improves that of Õ( H 22 K 9 d 7 ϵ 10 ) in Agarwal et al. (2020) to achieve the same goal. Our result also improves the sample complexity of Õ( 



(2020a); Ménard et al. (2021); Kaufmann et al. (2021); Zhang et al. (2020). For reward-free RL with function approximation, Wang et al. (2020) studied linear MDPs introduced by Jin et al. (2020b), where both the transition and the reward are linear functions of a given feature extractor, Zhang et al. (2021b) studied linear mixture MDPs introduced by Ayoub et al. (2020), and Zanette et al. (2020b) considered a classes of MDPs with low inherent Bellman error introduced by Zanette et al. (2020a).

h=1 r h ≤ 1 for fair comparison). Here, d LV denotes the non-negative rank of the transition kernel, which may be exponentially larger than d as shown in Agarwal et al. (2020), and η denotes the positive reachability probability to all states, where 1/η can be as large as √ d LV as shown in Uehara et al. (2022b). Recently, a rewardfree algorithm called RFOLIVE has been proposed under non-linear MDPs with low Bellman Eluder dimension (Chen et al., 2022b), which can be specialized to low-rank MDPs. However, RFOLIVE is computationally more costly and considers a special reward function class, making their complexity result not directly comparable to other studies on reward-free low-rank MDPs. This paper investigates reward-free RL under low-rank MDPs to address the following important open questions: • For low-rank MDPs, none of previous studies establishes a lower bound on the sample complexity showing a necessary sample complexity requirement for near-optimal policy finding. • The sample complexity of previous algorithms in Agarwal et al. (2020); Modi et al. (

5 d 3 LV ϵ 2 η ) in Modi et al. (2021) in three aspects: order on K is reduced; d can be exponentially smaller than d LV as shown in Agarwal et al. (2020); and no introduction of η, where 1/η

