OPTIMISTIC POLICY OPTIMIZATION WITH GENERAL FUNCTION APPROXIMATIONS

Abstract

Although policy optimization with neural networks has a track record of achieving state-of-the-art results in reinforcement learning on various domains, the theoretical understanding of the computational and sample efficiency of policy optimization remains restricted to linear function approximations with finite-dimensional feature representations, which hinders the design of principled, effective, and efficient algorithms. To this end, we propose an optimistic model-based policy optimization algorithm, which allows general function approximations while incorporating exploration. In the episodic setting, we establish a T -regret that scales polynomially in the eluder dimension of the general model class. Here T is the number of steps taken by the agent. In particular, we specialize such a regret to handle two nonparametric model classes; one based on reproducing kernel Hilbert spaces and another based on overparameterized neural networks.

1. INTRODUCTION

Reinforcement learning with neural networks achieved impressive empirical breakthroughs (Mnih et al., 2015; Silver et al., 2016; 2017; Berner et al., 2019; Vinyals et al., 2019) . These algorithms are often based on policy optimization (Williams, 1992; Baxter & Bartlett, 2000; Sutton et al., 2000; Kakade, 2002; Schulman et al., 2015; 2017) . Compared with value-based approaches, which iteratively estimate the optimal value function, policy-based approaches directly optimize the expected total reward, which leads to more steady policy improvement. In particular, as shown in this paper, policy optimization generates steadily improving stochastic policies and consequently allow adversarial environments. On the other hand, policy optimization often suffers from a lack of computational and statistical efficiency in practice, which calls for the principled design of efficient algorithms. Specifically, in terms of computational efficiency, the recent progress (Abbasi-Yadkori et al., 2019a; b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) establishes the convergence of policy optimization to a globally optimal policy given sufficiently many data points, even in the presence of neural networks. However, in terms of sample efficiency, it remains less understood how to sequentially acquire the data points used in policy optimization while balancing exploration and exploitation, especially in the presence of neural networks, despite the recent progress (Cai et al., 2019; Agarwal et al., 2020) . In particular, such a lack of sample efficiency prohibits the principled applications of policy optimization in critical domains, e.g., autonomous driving and dynamic treatment, where data acquisition is expensive. In this paper, we aim to provably achieve sample efficiency in model-based policy optimization, which is quantified via the lens of regret. In particular, we focus on the episodic setting with general function approximations on the transition kernel. Such a setting is studied by Russo & Van Roy (2013; 2014); Osband & Van Roy (2014); Ayoub et al. (2020); Wang et al. (2020) , which however focus on value iteration. In contrast, policy optimization remains less understood, despite its critical role in practice. To this end, we propose an optimistic policy optimization algorithm, which achieves exploration by incorporating optimism into policy evaluation and propagating it through policy improvement. In particular, we establish a κ(P) • √ H 3 T -regret of the proposed algorithm, which matches that of existing value iteration algorithms but additionally allow the reward function to adversarially vary across each episode. Here T is the number of steps, H is the length of each episode, and κ(P) is the model capacity, which is defined based on the eluder dimension. Moreover, we instantiate the proposed algorithm for the special cases of reproducing kernel Hilbert spaces and overparameterized neural networks, both of which are infinite-dimensional model classes. Our work is related to the study on computational efficiency of policy optimization (Fazel et al., 2018; Yang et al., 2019; Abbasi-Yadkori et al., 2019a; b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) . These works assume either the transition model is known or there exists a well-explored behavior policy such that the policy update direction can be estimated accurately. With such assumptions, the tradeoff between exploration and exploitation is absent and their focus is solely on the computational aspect. In addition, our work is related to the works on adversarial MDP (Even-Dar et al., 2009; Yu et al., 2009; Neu et al., 2010a; b; Zimin & Neu, 2013; Neu et al., 2012; Rosenberg & Mansour, 2019b; a) . The algorithm in these work directly estimates the visitation measure and their algorithm utilize mirror descent to handle adversarial reward functions. Furthermore, our work is closely related the recent work on the sample complexity of policy optimization methods Cai et al. ( 2019), which only focus on the tabular and linear settings. In contrast, our work consider general function approximation setting, which is significantly more general. Moreover, the construction of optimistic policy evaluation is related to Ayoub et al. (2020) , where the similar approach is incorporated in estimating the optimal value function. The theoretical foundation of such a type of optimistic estimation is innovated by Russo & Van Roy (2014) in the bandit problem. In particular, to characterizing the optimism and accuracy of the optimistic evaluation, we rely on the notion of the eluder dimension proposed by Russo & Van Roy (2014), which is further instantiated by this paper to the cases of kernel and neural function approximations.

1.1. NOTATIONS

We denote by • p the p -norm of a vector when p ∈ N or the spectral norm of a matrix when p = 2. For any two distributions p 1 , p 2 over a discrete set A, we denote by D KL (p 1 p 2 ) the KL-divergence D KL (p 1 p 2 ) = a∈A p 1 (a) log p 1 (a) p 2 (a) . For any a, b, x ∈ R, we define the clamp function clamp(x, a, b) =    b, if x > b, x, if a ≤ x ≤ b, a, if x < a. (1.1)

2.1. ONLINE REINFORCEMENT LEARNING WITH ADVERSARIAL REWARDS

We consider an episodic MDP (S, A, H, {P h } H h=1 , {r h } H h=1 ), where S is a continuous state space, A is a discrete action space, H is the number of steps in each episode, {P h } H h=1 represent the unknown transition model, and {r h } H h=1 represent the reward function. In particular, for any h ∈ [H], P h represents the transition kernel from a state-action pair (s h , a h ) at the h-th step to the next state s h+1 , while r h represents the reward function at the h-th step that maps a state-action pair to a deterministic reward. Moreover, we allow the reward function to vary across episodes and denote by r k h the reward function at the h-th step of the k-th episode. In particular, r k h depends on the trajectories before the k-th episode begins, possibly in an adversarial manner, and remains unobservable until the k-th episode ends. Without loss of generality, we assume each episode starts from a fixed state s 1 and all rewards fall in the interval [0, 1]. For any h ∈ [H], a policy π h represents the conditional distribution of the action given the state at the h-th step. We drop the subscript h to represent the collection of policies at all steps and still refer to such a collection as a policy when it is clear from the context. For any (k, h) ∈ N × [H], given a policy π and reward functions {r k h } H h=1 , the value function and Q-function at the h-th step of the

