OPTIMISM IN REINFORCEMENT LEARNING WITH GENERALIZED LINEAR FUNCTION APPROXIMATION

Abstract

We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of O H √ d 3 T where H is the horizon, d is the dimensionality of the state-action features and T is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.

1. INTRODUCTION

We study episodic reinforcement learning problems with infinitely large state spaces, where the agent must use function approximation to generalize across states while simultaneously engaging in strategic exploration. Such problems form the core of modern empirical/deep-RL, but relatively little work focuses on exploration, and even fewer algorithms enjoy strong sample efficiency guarantees. On the theoretical side, classical sample efficiency results from the early 00s focus on "tabular" environments with small finite state spaces (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Strehl et al., 2006) , but as these methods scale with the number of states, they do not address problems with infinite or large state spaces. While this classical work has inspired practically effective approaches for large state spaces (Bellemare et al., 2016; Osband et al., 2016; Tang et al., 2017) , these methods do not enjoy sample efficiency guarantees. More recent theoretical progress has produced provably sample efficient algorithms for complex environments where function approximation is required, but these algorithms are relatively impractical (Krishnamurthy et al., 2016; Jiang et al., 2017) . In particular, these methods are computationally inefficient or rely crucially on strong dynamics assumptions (Du et al., 2019b) . In this paper, with an eye toward practicality, we study a simple variation of Q-learning, where we approximate the optimal Q-function with a generalized linear model. The algorithm is appealingly simple: collect a trajectory by following the greedy policy corresponding to the current model, perform a dynamic programming back-up to update the model, and repeat. The key difference over traditional Q-learning-like algorithms is in the dynamic programming step. Here we ensure that the updated model is optimistic in the sense that it always overestimates the optimal Q-function. This optimism is essential for our guarantees. Optimism in the face of uncertainty is a well-understood and powerful algorithmic principle in shorthorizon (e.g,. bandit) problems, as well as in tabular reinforcement learning (Azar et al., 2017; Dann et al., 2017; Jin et al., 2018) . With linear function approximation, Yang & Wang (2019) and Jin et al. (2019) show that the optimism principle can also yield provably sample-efficient algorithms, when the environment dynamics satisfy certain linearity properties. Their assumptions are always satisfied in tabular problems, but are somewhat unnatural in settings where function approximation

