OPTIMISM IN REINFORCEMENT LEARNING WITH GENERALIZED LINEAR FUNCTION APPROXIMATION

Abstract

We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of O H √ d 3 T where H is the horizon, d is the dimensionality of the state-action features and T is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.

1. INTRODUCTION

We study episodic reinforcement learning problems with infinitely large state spaces, where the agent must use function approximation to generalize across states while simultaneously engaging in strategic exploration. Such problems form the core of modern empirical/deep-RL, but relatively little work focuses on exploration, and even fewer algorithms enjoy strong sample efficiency guarantees. On the theoretical side, classical sample efficiency results from the early 00s focus on "tabular" environments with small finite state spaces (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Strehl et al., 2006) , but as these methods scale with the number of states, they do not address problems with infinite or large state spaces. While this classical work has inspired practically effective approaches for large state spaces (Bellemare et al., 2016; Osband et al., 2016; Tang et al., 2017) , these methods do not enjoy sample efficiency guarantees. More recent theoretical progress has produced provably sample efficient algorithms for complex environments where function approximation is required, but these algorithms are relatively impractical (Krishnamurthy et al., 2016; Jiang et al., 2017) . In particular, these methods are computationally inefficient or rely crucially on strong dynamics assumptions (Du et al., 2019b) . In this paper, with an eye toward practicality, we study a simple variation of Q-learning, where we approximate the optimal Q-function with a generalized linear model. The algorithm is appealingly simple: collect a trajectory by following the greedy policy corresponding to the current model, perform a dynamic programming back-up to update the model, and repeat. The key difference over traditional Q-learning-like algorithms is in the dynamic programming step. Here we ensure that the updated model is optimistic in the sense that it always overestimates the optimal Q-function. This optimism is essential for our guarantees. Optimism in the face of uncertainty is a well-understood and powerful algorithmic principle in shorthorizon (e.g,. bandit) problems, as well as in tabular reinforcement learning (Azar et al., 2017; Dann et al., 2017; Jin et al., 2018) . With linear function approximation, Yang & Wang (2019) and Jin et al. (2019) show that the optimism principle can also yield provably sample-efficient algorithms, when the environment dynamics satisfy certain linearity properties. Their assumptions are always satisfied in tabular problems, but are somewhat unnatural in settings where function approximation is required. Moreover as these assumptions are directly on the dynamics, it is unclear how their analysis can accommodate other forms of function approximation, including generalized linear models. In the present paper, we replace explicit dynamics assumptions with expressivity assumptions on the function approximator, and, by analyzing a similar algorithm to Jin et al. ( 2019), we show that the optimism principle succeeds under these strictly weaker assumptions.foot_0 More importantly, the relaxed assumption facilitates moving beyond linear models, and we demonstrate this by providing the first practical and provably efficient RL algorithm with generalized linear function approximation. The paper is organized as follows: In Section 2 we formalize our setting, introduce the optimistic closure assumption, and discuss related assumptions in the literature. In Section 3 we study optimistic closure in detail and verify that it is strictly weaker than the recently proposed Linear MDP assumption. Our main algorithm and results are presented in Section 4, with the main proof in Section A. We close with some final remarks and future directions in Section 5.

2. PRELIMINARIES

We consider episodic reinforcement learning in a finite-horizon markov decision process (MDP) with possibly infinitely large state space S, finite action space A, initial distribution µ ∈ ∆(S), transition operator P : S × A → ∆(S), reward function R : S × A → ∆([0, 1]) and horizon H. The agent interacts with the MDP in episodes and, in each episode, a trajectory (s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s H , a H , r H ) is generated where s 1 ∼ µ, for h > 1 we have s h ∼ P (• | s h-1 , a h-1 ), r h ∼ R(s h , a h ) , and actions a 1:H are chosen by the agent. For normalization, we assume that H h=1 r h ∈ [0, 1] almost surely. A (deterministic, nonstationary) policy π = (π 1 , • • • , π H ) consists of H mappings π h : S → A, where π h (s h ) denotes the action to be taken at time point h if at state s h ∈ S The value function for a policy π is a collection of functions (V π 1 , . . . , V π H ) where V π h : S → R is the expected future reward the policy collects if it starts in a particular state at time point h. Formally, V π h (s) E H h =h r h | s h = s, a h:H ∼ π . The value for a policy π is simply V π E s1∼µ [V π 1 (s 1 )], and the optimal value is V max π V π , where the maximization is over all nonstationary policies. The typical goal is to find an approximately optimal policy, and in this paper, we measure performance by the regret accumulated over T episodes, Reg(T ) T V -E T t=1 H h=1 r h,t . Here r h,t is the reward collected by the agent at time point h in the t th episode. We seek algorithms with regret that is sublinear in T , which demonstrates the agent's ability to act near-optimally over the long run.

2.1. Q-VALUES AND FUNCTION APPROXIMATION

For any policy π, the state-action value function, or the Q-function is a sequence of mappings Q π = (Q π 1 , . . . , Q π H ) where Q π h : S × A → R is defined as Q π h (s, a) E H h =h r h | s h = s, a h = a, a h+1:H ∼ π . The optimal Q-function is Q h Q π h where π argmax π V π is the optimal policy. In the value-based function approximation setting, we use a function class G to model Q . In this paper, we always take G to be a class of generalized linear models (GLMs), defined as follows: Let d ∈ N be a dimensionality parameter and let B d x ∈ R d : x 2 ≤ 1 be the 2 ball in R d .



This is also mentioned as a remark inJin et al. (2019).

