NEAR-OPTIMAL POLICY IDENTIFICATION IN ACTIVE REINFORCEMENT LEARNING

Abstract

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a generative model. We propose the AE-LSVI algorithm for bestpolicy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy uniformly over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.

1. INTRODUCTION

Reinforcement learning (RL) algorithms are increasingly applied to complex domains such as robotics (Kober et al., 2013) , magnetic tokamaks (Seo et al., 2021; Degrave et al., 2022) , and molecular search (Simm et al., 2020a; b) . A central challenge in such environments is that data acquisition is often a time-consuming and expensive process, or may be infeasible due to safety considerations. A common approach is therefore to train policies offline by interacting with a simulator. However, even when a simulator is available, such applications require algorithms that are capable of learning and planning in large state spaces. Many existing approaches require a large amount of training data to obtain good policies, and efficient active exploration in large state spaces is still an open problem. Moreover, when deploying policies trained on simulators in real-world applications, a crucial requirement is that the policy performs well in any state that it might encounter. In particular, at training time, the learning approach has to sufficiently explore the state space. This is of particular importance when at test time, the system's state is partly out of the control of the learning algorithm-e.g., for a self-driving car or robot, which may be influenced by human actions. In this work, we formally study the setting of reinforcement learning with a generative model. Our objective is to learn a near-optimal policy by actively querying the simulator with a state-action pair chosen by the learning algorithm. The simulator then returns a new state that is sampled from the transition model of the (simulated) environment. Inspired by previous works, we make a structural assumption in the kernel setting, which states that the Bellman operator maps any bounded value function to one with a bounded reproducing kernel Hilbert space (RKHS) norm. In particular, this assumption implies that the reward and the optimal Q-function can be represented by an RKHS function. We propose a novel approach based on least-squares value iteration (LSVI). The algorithm is designed to actively explore uncertain states based on the uncertainty in the Q-estimates, and makes use of optimism for action selection and pessimism for estimating a near-optimal policy. Contributions We propose a novel kernelized algorithm for best policy identification in reinforcement learning with a generative model. Our sampling strategy actively explores (i) states for which the best action is the most uncertain and (ii) the corresponding "optimistic" actions. We prove sample complexity guarantees for finding an ϵ-optimal policy uniformly over any given initial state. Our bounds scale with the maximum information gain of the corresponding reproducing kernel Hilbert space but do not explicitly scale with the number of states or actions. When specialized to the offline contextual Bayesian optimization (BO) setting (Char et al., 2019) , we improve upon sample complexity guarantees from prior work. Finally, we include experimental evaluations on several RL and BO benchmark tasks. The former of these includes one of the first empirical evaluations of the model-free optimistic value iteration algorithms with function approximation (Yang et al., 2020) .

2. PROBLEM STATEMENT

We consider an episodic MDP S, A, H, (P h ) h∈[H] , (r h ) h∈ [H] with state space S, action space A, horizon H ∈ N, Markov transition kernel (P h ) h∈[H] and deterministic reward functions (r h : S × A → [0, 1]) h∈[H] . In particular, for each h ∈ [H], we let P h (•|s, a) denote the probability transition kernel when action a is taken at state s ∈ S in step h ∈ [H]. A policy consists of H functions π = (π h ) h∈ [H] where for all h ∈ [H], π h (•|s) is a probability distribution over the action set A. In particular, π h (a|s) is the probability that the agent takes action a in state s at step h. We assume the generative (or random) access model, in which the agent interacts with the environment in the following way: Let T denote the number of episodes and H the horizon, i.e., the number of steps in each episode. Then for each t ∈ [T ], h ∈ [H], the agent chooses s t h ∈ S, a t h ∈ A, obtains the reward r h (s t h , a t h ) and observes the new state s ′ h,t ∼ P h (•|s t h , a t h ). To measure the performance of an agent, we use the value function. For a policy π, h ∈ [H], s ∈ S, and a ∈ A, the value function V π h : S → R and the Q-function Q π h : S × A → [0, H] are given by: V π h (s) = E π H h ′ =h r h ′ (s h ′ , a h ′ ) s h = s , Q π h (s, a) = E π H h ′ =h r h ′ (s h ′ , a h ′ ) s h = s, a h = a , (1) where E π denotes the expectation with respect to the randomness of the trajectory {(s h , a h )} H h=1 that is obtained by following the policy π. We use π * to denote the optimal policy, and we abbreviate V π * h , Q π * h as V * h , Q * h , respectively. We also have V * h (s) = sup π V π h (s) for all s ∈ S and h ∈ [H]. The goal is to find an ϵ-optimal policy while minimizing the number of necessary episodes T . More precisely, for a fixed precision ϵ > 0 and horizon H ∈ N, the goal of the learner is to output a policy πT after a suitable number of episodes T > 0 such that ∥V * 1 -V πT 1 ∥ ℓ ∞ (S) ≤ ϵ. Finally, we also recall the Bellman equation that is associated to some policy π: V π H+1 = 0, Q π h (s, a) = r h (s, a) + E s ′ ∼P h (•|s,a) [V π h+1 (s ′ )], V π h (s) = E a∼π h (a|s) [Q π h (s, a)], (2) and the Bellman optimality equation: V * H+1 = 0, Q * h (s, a) = r h (s, a) + E s ′ ∼P h (•|s,a) [V * h+1 (s ′ )], V * h (s) = max a∈A Q * h (s, a). It follows that the optimal policy π * is the greedy policy with respect to {Q * h } h∈[H] , a property that is going to be useful later on when defining our active exploration strategy. We use the reproducing kernel Hilbert space (RKHS) function class to represent functions such as the reward functions {r h } h∈[H] and the optimal Q-functions {Q * h } h∈ [H] (see the formal statement in Assumption 1). In particular, we consider a space of well-behaved functions defined on X = S × A, where H denotes an RKHS defined on X induced by some continuous, positive definite kernel function k : X × X → R. We also assume that (i) X ⊂ R d is a compact set, (ii) the kernel function is bounded k(x, x ′ ) ≤ 1 for all x, x ′ ∈ X , and (iii) every f ∈ H has a bounded RKHS norm, i.e., ∥f ∥ H ≤ B Q H for some fixed positive constant B Q > 0.

3. AE-LSVI ALGORITHM

Our algorithm runs in episodes t ∈ [T ] of horizon H. As in the kernel least-squares value iteration (Yang et al., 2020) , at the beginning of every episode t, it solves a sequence of kernel ridge

