UNIFIED ALGORITHMS FOR RL WITH DECISION-ESTIMATION COEFFICIENTS: NO-REGRET, PAC, AND REWARD-FREE LEARNING

Abstract

Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. ( 2021) as a governing complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. ( 2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.

1. INTRODUCTION

Reinforcement Learning (RL) has achieved immense success in modern artificial intelligence. As RL agents typically require an enormous number of samples to train in practice (Mnih et al., 2015; Silver et al., 2016) , sample-efficiency has been an important question in RL research. This question has been studied extensively in theory, with provably sample-efficient algorithms established for many concrete RL problems starting with tabular Markov Decision Processes (MDPs) (Brafman & Tennenholtz, 2002; Azar et al., 2017; Agrawal & Jia, 2017; Jin et al., 2018; Dann et al., 2019; Zhang et al., 2020b) , and later MDPs with various types of linear structures (Yang & Wang, 2019; Jin et al., 2020b; Zanette et al., 2020b; Ayoub et al., 2020; Zhou et al., 2021; Wang et al., 2021) . Towards a more unifying theory, a recent line of work seeks general structural conditions and unified algorithms that encompass as many as possible known sample-efficient RL problems. Many such structural conditions have been identified, such as Bellman rank (Jiang et al., 2017) , Witness rank (Sun et al., 2019 ), Eluder dimension (Russo & Van Roy, 2013; Wang et al., 2020b ), Bilinear Class (Du et al., 2019 ), and Bellman-Eluder dimension (Jin et al., 2021) . The recent work of Foster et al. (2021) proposes the Decision-Estimation Coefficient (DEC) as a quantitative complexity measure that governs the statistical complexity of model-based RL with a model class. Roughly speaking, the DEC measures the optimal trade-off-achieved by any policy-between exploration (gaining information) and exploitation (being a near-optimal policy itself) when the true model could be any model within the model class. Foster et al. ( 2021) establish regret lower bounds for online RL in terms of the DEC, and upper bounds in terms of (a variant of) the DEC and model class capacity, showing that the DEC is necessary and (in the above sense) sufficient for online RL with low regret. This constitutes a significant step towards a unified understanding of sample-efficient RL. Despite this progress, several important questions remain open within the DEC framework. First, in Foster et al. (2021) , regret upper bounds for low-DEC problems are achieved by the Estimation-To-Decisions (E2D) meta-algorithm, which requires a subroutine for online model estimation given past observations. However, their instantiations of this algorithm either (1) use a general improperfoot_0 estimation subroutine that works black-box for any model class, but results in a regret bound that scales with a (potentially significantly) larger variant of the DEC that does not admit known polynomial bounds, or (2) require a proper estimation subroutine, which typically requires problem-specific designs and unclear how to construct for general model classes. These additional bottlenecks prevent their instantiations from being a unified sample-efficient algorithm for any low-DEC problem. Second, while the DEC captures the complexity of no-regret learning, there are alternative learning goals that are widely studied in the RL literature such as PAC learning (Dann et al., 2017) and reward-free learning (Jin et al., 2020a) , and it is unclear whether they can be characterized using a similar framework. Finally, several other optimistic model-based algorithms such as Optimistic Posterior Sampling (Zhang, 2022; Agarwal & Zhang, 2022a) or Optimistic Maximum Likelihood Estimation (Mete et al., 2021; Liu et al., 2022a; b) have been proposed in recent work, whereas the E2D algorithm does not explicitly use optimism in its algorithm design. It is unclear whether E2D actually bears any similarities or connections to the aformentioned optimistic algorithms. In this paper, we resolve the above open questions positively by developing new complexity measures and unified algorithms for RL with Decision-Estimation Coefficients. Our contributions can be summarized as follows. • We design E2D-TA, the first unified algorithm that achieves low regret for any problem with bounded DEC and low-capacity model class (Section 3). E2D-TA instantiates the E2D metaalgorithm with Tempered Aggregation, a general improper online estimation subroutine that achieves stronger guarantees than variants used in existing work. • We establish connections between E2D-TA and two existing model-based algorithms: Optimistic Model-Based Posterior Sampling, and Optimistic Maximum-Likelihood Estimation. We show that these two algorithms enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC (Appendix E). • We extend the DEC framework to two new learning goals: PAC learning and reward-free learning. We define variants of the DEC, which we term as Explorative DEC (EDEC) and Reward-Free DEC (RFDEC), and show that they give upper and lower bounds for sample-efficient learning in the two settings respectively (Section 4). • We instantiate our results to give sample complexity guarantees for the broad problem class of RL with low-complexity Bellman representations. Our results recover many existing and yield new guarantees when specialized to concrete RL problems (Section 5).

Related work

Our work is closely related to the long lines of work on sample-efficient RL (both no-regret/PAC and reward-free), and problems/algorithms in general interactive decision making. We review these related work in Appendix A due to the space limit.

RL as Decision Making with Structured Observations

We adopt the general framework of Decision Making with Structured Observations (DMSO) (Foster et al., 2021) , which captures broad classes of problems such as bandits and reinforcement learning. In DMSO, the environment is described by a model M " pP M , R M q, where P M specifies the distribution of the observation o P O, and R M specifies the conditional meansfoot_1 of the reward vector r P r0, 1s H , where H is the horizon length. The learner interacts with a model using a policy π P Π. Upon executing π in M , they observe an (observation, reward) tuple po, rq " M pπq as follows: 1. The learner first observes an observation o " P M pπq (also denoted as P M,π p¨q P ∆pOq). 2. Then, the learner receives a (random) reward vector r " pr h q H h"1 , with conditional mean R M poq " pR M h poqq H h"1 :" E r"R M p¨|oq rrs P r0, 1s H and independent entries conditioned on o.



Which outputs in general a distribution of models within class rather than a single model. Note that R M (and thus M ) only specifies the conditional mean rewards instead of the reward distributions.

