OPTIMISTIC POLICY OPTIMIZATION WITH GENERAL FUNCTION APPROXIMATIONS

Abstract

Although policy optimization with neural networks has a track record of achieving state-of-the-art results in reinforcement learning on various domains, the theoretical understanding of the computational and sample efficiency of policy optimization remains restricted to linear function approximations with finite-dimensional feature representations, which hinders the design of principled, effective, and efficient algorithms. To this end, we propose an optimistic model-based policy optimization algorithm, which allows general function approximations while incorporating exploration. In the episodic setting, we establish a T -regret that scales polynomially in the eluder dimension of the general model class. Here T is the number of steps taken by the agent. In particular, we specialize such a regret to handle two nonparametric model classes; one based on reproducing kernel Hilbert spaces and another based on overparameterized neural networks.

1. INTRODUCTION

Reinforcement learning with neural networks achieved impressive empirical breakthroughs (Mnih et al., 2015; Silver et al., 2016; 2017; Berner et al., 2019; Vinyals et al., 2019) . These algorithms are often based on policy optimization (Williams, 1992; Baxter & Bartlett, 2000; Sutton et al., 2000; Kakade, 2002; Schulman et al., 2015; 2017) . Compared with value-based approaches, which iteratively estimate the optimal value function, policy-based approaches directly optimize the expected total reward, which leads to more steady policy improvement. In particular, as shown in this paper, policy optimization generates steadily improving stochastic policies and consequently allow adversarial environments. On the other hand, policy optimization often suffers from a lack of computational and statistical efficiency in practice, which calls for the principled design of efficient algorithms. Specifically, in terms of computational efficiency, the recent progress (Abbasi-Yadkori et al., 2019a; b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) establishes the convergence of policy optimization to a globally optimal policy given sufficiently many data points, even in the presence of neural networks. However, in terms of sample efficiency, it remains less understood how to sequentially acquire the data points used in policy optimization while balancing exploration and exploitation, especially in the presence of neural networks, despite the recent progress (Cai et al., 2019; Agarwal et al., 2020) . In particular, such a lack of sample efficiency prohibits the principled applications of policy optimization in critical domains, e.g., autonomous driving and dynamic treatment, where data acquisition is expensive. In this paper, we aim to provably achieve sample efficiency in model-based policy optimization, which is quantified via the lens of regret. In particular, we focus on the episodic setting with general function approximations on the transition kernel. Such a setting is studied by Russo & Van Roy (2013; 2014); Osband & Van Roy (2014); Ayoub et al. (2020); Wang et al. (2020) , which however focus on value iteration. In contrast, policy optimization remains less understood, despite its critical role in practice. To this end, we propose an optimistic policy optimization algorithm, which achieves exploration by incorporating optimism into policy evaluation and propagating it through policy improvement. In particular, we establish a κ(P) • √ H 3 T -regret of the proposed algorithm, which matches that of existing value iteration algorithms but additionally allow the reward function to adversarially vary across each episode. Here T is the number of steps, H is the length of each episode, and κ(P) is

