

Abstract

With the increasing need for handling large state and action spaces, general function approximation has become a key technique in reinforcement learning (RL). In this paper, we propose a general framework that unifies model-based and model-free RL, and an Admissible Bellman Characterization (ABC) class that subsumes nearly all Markov Decision Process (MDP) models in the literature for tractable RL. We propose a novel estimation function with decomposable structural properties for optimization-based exploration and the functional eluder dimension as a complexity measure of the ABC class. Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed, achieving regret bounds that match or improve over the best-known results for a variety of MDP models. In particular, for MDPs with low Witness rank, under a slightly stronger assumption, OPERA improves the state-of-the-art sample complexity results by a factor of dH. Our framework provides a generic interface to design and analyze new RL models and algorithms.

1. I N T R O D U C T I O N

Reinforcement learning (RL) is a decision-making process that seeks to maximize the expected reward when an agent interacts with the environment (Sutton & Barto, 2018) . Over the past decade, RL has gained increasing attention due to its successes in a wide range of domains, including Atari games (Mnih et al., 2013) , Go game (Silver et al., 2016) , autonomous driving (Yurtsever et al., 2020) , Robotics (Kober et al., 2013) , etc. Existing RL algorithms can be categorized into valuebased algorithms such as Q-learning (Watkins, 1989) and policy-based algorithms such as policy gradient (Sutton et al., 1999) . They can also be categorized as a model-free approach where one directly models the value function classes, or alternatively, a model-based approach where one needs to estimate the transition probability. Due to the intractably large state and action spaces that are used to model the real-world complex environment, function approximation in RL has become prominent in both algorithm design and theoretical analysis. It is a pressing challenge to design sample-efficient RL algorithms with general function approximations. In the special case where the underlying Markov Decision Processes (MDPs) enjoy certain linear structures, several lines of works have achieved polynomial sample complexity and/or 2021), but their sample complexity results when restricted to special MDP instances do not always match the best-known results. Viewing the above gap, we aim to answer the following question: Is there a unified framework that includes all model-free and model-based RL classes while maintaining sharp sample efficiency? In this paper, we tackle this challenging question and give a nearly affirmative answer to it. We summarize our contributions as follows: • We propose a general framework called Admissible Bellman Characterization (ABC) that covers a wide set of structural assumptions in both model-free and model-based RL, such as linear



In this paper, we use FLAMBE to refer to both the algorithm and the low-rank MDP with unknown feature mappings.



Figure 1: Venn-Diagram Visualization of Prevailing Sample-Efficient RL Classes. As by far the richest concept, the DEC framework is both a necessary and sufficient condition for sample-efficient interactive learning. BE dimension is a rich class that subsumes both low Bellman rank and low eluder dimension and addresses almost all model-free RL classes. The generalized Bilinear Class captures model-based RL settings including KNRs, linear mixture MDPs and low Witness rank MDPs, yet precludes some eluder-dimension based models. Bellman Representability is another unified framework that subsumes the vanilla bilinear classes but fails to capture KNRs and low Witness rank MDPs. Our ABC class encloses both generalized Bilinear Class and Bellman Representability and subsumes almost all known solvable MDP cases, with the exception of the Q * state-action aggregation and deterministic linear Q * MDP models, which neither Bilinear Class nor our ABC class captures. regression and a Bernstein-type bonus. Other structural MDP models include the block MDPs (Du et al., 2019) and FLAMBE (Agarwal et al., 2020b) 1 , to mention a few. In a more general setting, however, there is still a gap between the plethora of MDP models and sample-efficient RL algorithms that can learn the MDP model with function approximation. The question remains open as to what constitutes minimal structural assumptions that admit sampleefficient reinforcement learning. To answer this question, there are several lines of work along this direction. Russo & Van Roy (2013); Osband & Van Roy (2014) proposed an structural condition named eluder dimension, and Wang et al. (2020) extended the LSVI-UCB for general linear function classes with small eluder dimension. Another line of works proposed low-rank structural conditions, including Bellman rank (Jiang et al., 2017; Dong et al., 2020) and Witness rank (Sun et al., 2019). Recently, Jin et al. (2021) proposed a complexity called Bellman eluder (BE) dimension, which unifies low Bellman rank and low eluder dimension. Concurrently, Du et al. (2021) proposed Bilinear Classes, which can be applied to a variety of loss estimators beyond vanilla Bellman error. Very recently, Foster et al. (2021) proposed Decision-Estimation Coefficient (DEC), which is a necessary and sufficient condition for sample-efficient interactive learning. To apply DEC to RL, they proposed a RL class named Bellman Representability, which can be viewed as a generalization of the Bilinear Class. Nevertheless, Sun et al. (2019) is limited to model-based RL, and Jin et al. (2021) is restricted to model-free RL. The only frameworks that can unify both model-based and model-free RL are Du et al. (2021) and Foster et al. (2021), but their sample complexity results when restricted to special MDP instances do not always match the best-known results. Viewing the above gap, we aim to answer the following question:

). Recently, Jin et al. (2021) proposed a complexity called Bellman eluder (BE) dimension, which unifies low Bellman rank and low eluder dimension. Concurrently, Du et al. (2021) proposed Bilinear Classes, which can be applied to a variety of loss estimators beyond vanilla Bellman error. Very recently, Foster et al. (2021) proposed Decision-Estimation Coefficient (DEC), which is a necessary and sufficient condition for sample-efficient interactive learning. To apply DEC to RL, they proposed a RL class named Bellman Representability, which can be viewed as a generalization of the Bilinear Class. Nevertheless, Sun et al. (2019) is limited to model-based RL, and Jin et al. (2021) is restricted to model-free RL. The only frameworks that can unify both model-based and model-free RL are Du et al. (2021) and Foster et al. (

