OFFLINE REINFORCEMENT LEARNING WITH DIFFER-ENTIABLE FUNCTION APPROXIMATION IS PROVABLY EFFICIENT

Abstract

Offline reinforcement learning, which aims at optimizing sequential decisionmaking strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. We show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research.

1. INTRODUCTION

Offline reinforcement learning (Lange et al., 2012; Levine et al., 2020) refers to the paradigm of learning a policy in the sequential decision making problems, where only the logged data are available and were collected from an unknown environment (Markov Decision Process / MDP). Inspired by the success of scalable supervised learning methods, modern reinforcement learning algorithms (e.g. Silver et al. ( 2017)) incorporate high-capacity function approximators to acquire generalization across large state-action spaces and have achieved excellent performances along a wide range of domains. For instance, there are a huge body of deep RL-based algorithms that tackle challenging problems such as the game of Go and chess (Silver et al., 2017; Schrittwieser et al., 2020 ), Robotics (Gu et al., 2017; Levine et al., 2018 ), energy control (Degrave et al., 2022) and Biology (Mahmud et al., 2018; Popova et al., 2018) . Nevertheless, practitioners also noticed that algorithms with general function approximators can be quite data inefficient, especially for deep neural networks where the models may require million of steps for tuning the large number of parameters they contain. 1On the other hand, statistical analysis has been actively conducted to understand the sample/statistical efficiency for reinforcement learning with function approximation, and fruitful results have been achieved under the respective model representations (Munos, 2003; Chen and Jiang, 2019; Yang and Wang, 2019; Du et al., 2019; Sun et al., 2019; Modi et al., 2020; Jin et al., 2020b; Ayoub et al., 2020; Zanette et al., 2020; Jin et al., 2021a; Du et al., 2021; Jin et al., 2021b; Zhou et al., 2021a; Xie et al., 2021a; Min et al., 2021; Nguyen-Tang et al., 2022; Li et al., 2021; Zanette et al., 2021; Yin et al., 2022; Uehara et al., 2022; Cai et al., 2022) . However, most works consider linear model approximators (e.g. linear (mixture) MDPs) or its variants. While the explicit linear structures make the analysis trackable (linear problems are easier to analyze), they are unable to reveal the sample/statistical complexity behaviors of practical algorithms that apply powerful function approximations (which might have complex structures). In addition, there is an excellent line of works tackling provably efficient offline RL with general function approximation (e.g. (Chen and Jiang, 2019; Xie et al., 2021a; Zhan et al., 2022) ). Due to the generic function approximation class considered, those complexity bounds are usually expressed in the standard worst-case fashion O(Vfoot_1 max 1 n ) which lack the characterizations of individual instance behaviors. However, as mentioned in Zanette and Brunskill (2019), practical reinforcement learning algorithms often perform far better than what these problem-independent bounds would suggest. These observations motivate us to consider function approximation schemes that can help address the existing limitations. In particular, in this work we consider offline reinforcement learning with differentiable function class approximations. Its definition is given in below. (Zhang et al., 2022a) and we adopt it here for the policy learning task. Note by the compactness of Θ, Ψ and continuity, there exists constants F := {f (θ, ϕ(•, •)) : S × A → R, θ ∈ Θ} that satisfies differentiability/smoothness condition: 1. for any ϕ ∈ R m , f (θ, ϕ) is third-time differ- entiable with respect to θ; 2. f, ∂ θ f, ∂ 2 θ,θ f, ∂ 3 θ,θ C Θ , B F , κ 1 , κ 2 , κ 3 > 0 that bounds: θ 2 ≤ C Θ , |f (θ, ϕ(s, a))| ≤ B F , ∇ θ f (θ, ϕ(s, a)) 2 ≤ κ 1 , ∇ 2 θθ f (θ, ϕ(s, a)) 2 ≤ κ 2 , ∇ 3 θθθ f (θ, ϕ(s, a)) 2 ≤ κ 3 for all θ ∈ Θ, s, a ∈ S × A. 2 Why consider differentiable function class (Definition 1.1)? There are two main reasons why differentiable function class is worth studying for reinforcement learning. • Due to the limitation of statistical tools, existing analysis in reinforcement learning usually favor basic settings such as tabular MDPs (where the state space and action space are finite (Azar et al., 2013; 2017; Sidford et al., 2018; Jin et al., 2018; Cui and Yang, 2020; Agarwal et al., 2020; Yin et al., 2021a; b; Li et al., 2020; Ren et al., 2021; Xie et al., 2021b; Li et al., 2022; Zhang et al., 2022b; Qiao et al., 2022; Cui and Du, 2022) ) or linear MDPs (Yang and Wang, 2020; Jin et al., 2020b; Wang et al., 2020; Jin et al., 2021b; Ding et al., 2021; Wang et al., 2021a; Min et al., 2021) / linear Mixture MDPs (Modi et al., 2020; Cai et al., 2020; Zhang et al., 2021a; Zhou et al., 2021b; a) (where the transition dynamic admits linear structures) so that well-established techniques (e.g. from linear regression) can be applied. In addition, subsequent extensions are often based on linear models (e.g. Linear Bellman Complete models (Zanette et al., 2020) and Eluder dimension (Russo and Van Roy, 2013; Jin et al., 2021a) ). Differentiable function class strictly generalizes over the previous popular choices, i.e. by choosing f (θ, ϕ) = θ, ϕ or specifying ϕ to be one-hot representations, and is far more expressive as it encompasses nonlinear approximators. • Practically speaking, the flexibility of selecting model f provides the possibility for handling a variety of tasks. For instance, when f is specified to be neural networks, θ corresponds to the weights of each network layers and ϕ(•, •) corresponds to the state-action representations (which is induced by the network architecture). When facing with easier tasks, we can deploy simpler model f such as polynomials. Yet, our statistical guarantee is not affected by the specific choices as we can plug the model f into Theorem 3.2 to obtain the respective bounds (we do not need separate analysis for different tasks).

1.1. RELATED WORKS

Reinforcement learning with function approximation. RL with function approximation has a long history that can date back to Bradtke and Barto (1996) ; Tsitsiklis and Van Roy (1996) . Later,



Check Arulkumaran et al. (2017) and the references therein for an overview. Here ∇ θθθ f (θ, ϕ(s, a)) 2 is defined as the 2-norm for 3-d tensor and in the finite horizon setting we simply instantiate BF = H.



Definition 1.1 (Parametric Differentiable Function Class). Let S, A be arbitrary state, action spaces and a feature map ϕ(•, •) : S × A → Ψ ⊂ R m . The parameter space Θ ∈ R d . Both Θ and Ψ are compact spaces. Then the parametric function class (for a model f : R d × R m → R) is defined as

