UNIFIED ALGORITHMS FOR RL WITH DECISION-ESTIMATION COEFFICIENTS: NO-REGRET, PAC, AND REWARD-FREE LEARNING

Abstract

Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. ( 2021) as a governing complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. ( 2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.

1. INTRODUCTION

Reinforcement Learning (RL) has achieved immense success in modern artificial intelligence. As RL agents typically require an enormous number of samples to train in practice (Mnih et al., 2015; Silver et al., 2016) , sample-efficiency has been an important question in RL research. This question has been studied extensively in theory, with provably sample-efficient algorithms established for many concrete RL problems starting with tabular Markov Decision Processes (MDPs) (Brafman & Tennenholtz, 2002; Azar et al., 2017; Agrawal & Jia, 2017; Jin et al., 2018; Dann et al., 2019; Zhang et al., 2020b) , and later MDPs with various types of linear structures (Yang & Wang, 2019; Jin et al., 2020b; Zanette et al., 2020b; Ayoub et al., 2020; Zhou et al., 2021; Wang et al., 2021) . Towards a more unifying theory, a recent line of work seeks general structural conditions and unified algorithms that encompass as many as possible known sample-efficient RL problems. Many such structural conditions have been identified, such as Bellman rank (Jiang et al., 2017) , Witness rank (Sun et al., 2019) , Eluder dimension (Russo & Van Roy, 2013; Wang et al., 2020b ), Bilinear Class (Du et al., 2019 ), and Bellman-Eluder dimension (Jin et al., 2021) . The recent work of Foster et al. (2021) proposes the Decision-Estimation Coefficient (DEC) as a quantitative complexity measure that governs the statistical complexity of model-based RL with a model class. Roughly speaking, the DEC measures the optimal trade-off-achieved by any policy-between exploration (gaining information) and exploitation (being a near-optimal policy itself) when the true model could be any model within the model class. Foster et al. ( 2021) establish regret lower bounds for online RL in terms of the DEC, and upper bounds in terms of (a variant of) the DEC and model class capacity, showing that the DEC is necessary and (in the above sense) sufficient for online RL with low regret. This constitutes a significant step towards a unified understanding of sample-efficient RL.

