OFFLINE REINFORCEMENT LEARNING WITH DIFFER-ENTIABLE FUNCTION APPROXIMATION IS PROVABLY EFFICIENT

Abstract

Offline reinforcement learning, which aims at optimizing sequential decisionmaking strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. We show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research.

1. INTRODUCTION

Offline reinforcement learning (Lange et al., 2012; Levine et al., 2020) refers to the paradigm of learning a policy in the sequential decision making problems, where only the logged data are available and were collected from an unknown environment (Markov Decision Process / MDP). Inspired by the success of scalable supervised learning methods, modern reinforcement learning algorithms (e.g. Silver et al. ( 2017)) incorporate high-capacity function approximators to acquire generalization across large state-action spaces and have achieved excellent performances along a wide range of domains. For instance, there are a huge body of deep RL-based algorithms that tackle challenging problems such as the game of Go and chess (Silver et al., 2017; Schrittwieser et al., 2020) , Robotics (Gu et al., 2017; Levine et al., 2018) , energy control (Degrave et al., 2022) and Biology (Mahmud et al., 2018; Popova et al., 2018) . Nevertheless, practitioners also noticed that algorithms with general function approximators can be quite data inefficient, especially for deep neural networks where the models may require million of steps for tuning the large number of parameters they contain. 1On the other hand, statistical analysis has been actively conducted to understand the sample/statistical efficiency for reinforcement learning with function approximation, and fruitful results have been achieved under the respective model representations (Munos, 2003; Chen and Jiang, 2019; Yang and Wang, 2019; Du et al., 2019; Sun et al., 2019; Modi et al., 2020; Jin et al., 2020b; Ayoub et al., 2020; Zanette et al., 2020; Jin et al., 2021a; Du et al., 2021; Jin et al., 2021b; Zhou et al., 2021a; Xie et al., 2021a; Min et al., 2021; Nguyen-Tang et al., 2022; Li et al., 2021; Zanette et al., 2021; Yin et al., 2022; Uehara et al., 2022; Cai et al., 2022) . However, most works consider linear model approximators (e.g. linear (mixture) MDPs) or its variants. While the explicit linear



Check Arulkumaran et al. (2017) and the references therein for an overview. 1

