TOWARDS MINIMAX OPTIMAL REWARD-FREE REIN-FORCEMENT LEARNING IN LINEAR MDPS

Abstract

We study reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, an agent first interacts with the environment without accessing the reward function in the exploration phase. In the subsequent planning phase, it is given a reward function and asked to output an ϵ-optimal policy. We propose a novel algorithm LSVI-RFE under the linear MDP setting, where the transition probability and reward functions are linear in a feature mapping. We prove an r OpH 4 d 2 {ϵ 2 q sample complexity upper bound for LSVI-RFE, where H is the episode length and d is the feature dimension. We also establish a sample complexity lower bound of ΩpH 3 d 2 {ϵ 2 q. To the best of our knowledge, LSVI-RFE is the first computationally efficient algorithm that achieves the minimax optimal sample complexity in linear MDP settings up to an H and logarithmic factors. Our LSVI-RFE algorithm is based on a novel variance-aware exploration mechanism to avoid overly-conservative exploration in prior works. Our sharp bound relies on the decoupling of UCB bonuses during two phases, and a Bernstein-type self-normalized bound, which remove the extra dependency of sample complexity on H and d, respectively.

1. INTRODUCTION

In reinforcement learning (RL), an agent tries to learn an optimal policy that maximizes the cumulative long-term rewards by interacting with an unknown environment. Designing efficient exploration mechanisms, being a critical task in RL algorithm design, is of great significance in improving the sample efficiency of RL, both theoretically Azar et al. (2017) ; Ménard et al. (2021) and empirically Schwarzer et al. (2020) ; Ye et al. (2021) . In particular, for scenarios where reward signals are sparse and require manually-designed reward functions, e.g., Nair et al. (2018) ; Riedmiller et al. (2018) , or multi-task settings where RL agents are required to accomplish different goals in different stages, e.g., Hessel et al. (2019); Yang et al. (2020) , efficient exploration of the environments is crucial, as it can avoid the agent from repeated learning under different reward functions, resulting in inefficiency and even intractability of sample complexity. However, the theoretical understanding is still limited, especially for MDPs with large (or infinite) states or action spaces. To understand the exploration mechanism in RL, reward-free exploration (RFE) is firstly proposed in Jin et al. (2020a) to explore the environment without reward signals. RFE contains two phases: exploration and planning. In the exploration phase, the agent first interacts with the environment without accessing the reward function. In the subsequent planning phase, the agent is given a reward function and asked to output an ϵ-optimal policy. RFE has great significance in a host of reinforcement learning applications, e.g., multi-task RL Hessel et al. 2022). The minimax optimal sample complexity OpH 3 S 2 A{ϵ 2 q of RFE is obtained in Ménard et al. (2021) for tabular settings where S and A are sizes of state and action space, respectively. However, this bound is intractable when state and action space are large. In this paper, we consider the RFE problem in the linear MDP setting, where the transition probability and the reward function are linear in a feature mapping ϕp¨, ¨q. (2022) . This means that the following fundamental question still remains open: Does there exist a computation-efficient and minimax optimal algorithm for RFE in linear MDPs? We make constructive contributions to minimax optimality by proposing a computationally efficient algorithm LSVI-RFE, achieving the state-of-the-art sample complexity upper bound of r OpH 4 d 2 {ϵ 2 q, which is minimax optimal up to an H and logarithmic factors. The LSVI-RFE algorithm is based on a novel variance-aware exploration mechanism with weighted linear regression to avoid overly-conservative exploration in prior works. Accordingly, our sharp bound relies on the decoupling of UCB bonuses during two phases, and a Bernstein-type self-normalized bound, which remove the extra dependency of sample complexity on H and d, respectively. We summarize the main contributions of the paper below. • We propose the LSVI-RFE algorithm for RFE in linear MDPs based on a novel varianceaware exploration mechanism with weighted linear regression to avoid overly-conservative exploration in prior works. It also builds the monotonicity of constructed value functions between two phases, providing new reward-free linear RL techniques. • LSVI-RFE achieves a sample complexity of r OpH 4 d 2 {ϵ 2 q, which relies on the decoupling of UCB bonuses during two phases. In addition, a Bernstein-type self-normalized bound and the conservatism of elliptical potentials are utilized to reach optimal dependency of sample complexity on d, which is potentially a general tool for linear reward-free RL. • We prove a sample complexity lower bound of ΩpH 3 d 2 {ϵ 2 q, showing that LSVI-RFE is the first computationally efficient algorithm to achieve the minimax optimal sample complexity in linear MDPs up to an H and logarithmic factors. Notations Scalars are denoted in lower case letters, and vectors/matrices are denoted in boldface letters. Denote }x} 2 Λ " x J Λx for vector x and positive definite matrix Λ, and } ¨}F denotes the Frobenius norm. Denote t1, ..., nu as rns. a n " Opb n q if there exists an absolute constant c ą 0 such that a n ď cb n holds for all n ě 1 and a n " Ωpb n q for inverse direction. r Op¨q further suppresses the polylogarithmic factors in Op¨q. À denotes approximately less than up to constant factors. 



(2019); Yang et al. (2020), RL with sparse rewards Nair et al. (2018); Riedmiller et al. (2018), and systematic generalization of RL Jiang et al. (2019); Mutti et al. (

There are a number of ways to parameterize an MDP linearly. The first sample efficient algorithm is introduced by Jiang et al. (2017), where low Bellman rank MDPs are considered. Subsequent works include Dann et al. (2018); Sun et al. (2019). Yang & Wang (2019) develops the first statistically and computationally efficient algorithm for linear MDPs with generative models, while Jin et al. (2020b) considers RL settings and proposes the LSVI-UCB algorithm. Concurrently, Zanette et al. (2020a) provides a Thompson sampling-based algorithm. He et al. (2021); Wagenmaker et al. (2021) provide a gap-dependent regret bound and a first-order regret bound for linear MDPs, respectively. Subsequently, the minimax optimal algorithm is proposed in Hu et al. (2022). More works on RL with linear function approximation include Zanette et al. (2020b) for low inherent bellman error case, Wang et al. (2020c) for linear Q function case, and Wang et al. (2020b) for bounded Eluder dimension case. Another popular linearly parameterized MDP is the linear mixture MDP, studied in Modi et al. (2020); Yang & Wang (2020); Jia et al. (2020); Ayoub et al. (2020); Cai et al. (2020); Zhou et al. (2021). Reward-Free Exploration in Tabular MDPs Reward-free exploration is studied in Jin et al. (2020a); Kaufmann et al. (2021); Ménard et al. (2021); Zhang et al. (2020). Jin et al. (2020a) achieves an r OpH 5 S 2 A{ϵ 2 q sample complexity in the tabular setting. Subsequently, the RF-UCRL algorithm in Kaufmann et al. (2021) improves this result by an H factor. Ménard et al. (2021) modifies the Upper Confidence Bound (UCB)-bonus of UCRL and achieves a sample complexity of r OpH 3 S 2 A{ϵ 2 q, matching the minimax lower bound provided in Jin et al. (2020a) up to logarithmic factors. Concurrently, the SSTP algorithm Zhang et al. (2020) also achieves minimax optimality in the time-homogeneous setting. Besides, Wu et al. (2022) proposes the first gap dependent bound.

The linear MDP model is an formulation that provides a linear function approximation for general MDP problems with large (or infinite) state and action space, and has received much recent attention in RL studies, e.g., Jin et al. (2020b); Zhou et al. (2021). Existing works in Wang et al. (2020a); Zanette et al. (2020c); Chen et al. (2021) also study RFE in linear MDPs. The best known sample complexity upper bound in this setting is r OpH 4 d 3 {ϵ 2 q obtained in Chen et al. (2021), and the best known lower bound is ΩpmaxtH 3 d, d 2 u{ϵ 2 q by combining results obtained in Zhang et al. (2021) and Wagenmaker et al.

