

Abstract

With the increasing need for handling large state and action spaces, general function approximation has become a key technique in reinforcement learning (RL). In this paper, we propose a general framework that unifies model-based and model-free RL, and an Admissible Bellman Characterization (ABC) class that subsumes nearly all Markov Decision Process (MDP) models in the literature for tractable RL. We propose a novel estimation function with decomposable structural properties for optimization-based exploration and the functional eluder dimension as a complexity measure of the ABC class. Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed, achieving regret bounds that match or improve over the best-known results for a variety of MDP models. In particular, for MDPs with low Witness rank, under a slightly stronger assumption, OPERA improves the state-of-the-art sample complexity results by a factor of dH. Our framework provides a generic interface to design and analyze new RL models and algorithms.

1. I N T R O D U C T I O N

Reinforcement learning (RL) is a decision-making process that seeks to maximize the expected reward when an agent interacts with the environment (Sutton & Barto, 2018) . Over the past decade, RL has gained increasing attention due to its successes in a wide range of domains, including Atari games (Mnih et al., 2013 ), Go game (Silver et al., 2016 ), autonomous driving (Yurtsever et al., 2020 ), Robotics (Kober et al., 2013) , etc. Existing RL algorithms can be categorized into valuebased algorithms such as Q-learning (Watkins, 1989) and policy-based algorithms such as policy gradient (Sutton et al., 1999) . They can also be categorized as a model-free approach where one directly models the value function classes, or alternatively, a model-based approach where one needs to estimate the transition probability. Due to the intractably large state and action spaces that are used to model the real-world complex environment, function approximation in RL has become prominent in both algorithm design and theoretical analysis. It is a pressing challenge to design sample-efficient RL algorithms with general function approximations. In the special case where the underlying Markov Decision Processes (MDPs) enjoy certain linear structures, several lines of works have achieved polynomial sample complexity and/or 1



T regret guarantees under either model-free or model-based RL settings. For linear MDPs where the transition probability and the reward function admit linear structure, Yang & Wang (2019) developed a variant of Q-learning when granted access to a generative model, Jin et al. (2020) proposed an LSVI-UCB algorithm with a O( √ d 3 H 3 T ) regret bound and Zanette et al. (2020a) further extended the MDP model and improved the regret to O(dH √ T ). Another line of work considers linear mixture MDPs Yang & Wang (2020); Modi et al. (2020); Jia et al. (2020); Zhou et al. (2021a), where the transition probability can be represented by a mixture of base models. In Zhou et al. (2021a), an O(dH √ T ) minimax optimal regret was achieved with weighted linear * Equal contribution.

