NEAR-OPTIMAL POLICY IDENTIFICATION IN ACTIVE REINFORCEMENT LEARNING

Abstract

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a generative model. We propose the AE-LSVI algorithm for bestpolicy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy uniformly over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.

1. INTRODUCTION

Reinforcement learning (RL) algorithms are increasingly applied to complex domains such as robotics (Kober et al., 2013) , magnetic tokamaks (Seo et al., 2021; Degrave et al., 2022) , and molecular search (Simm et al., 2020a; b) . A central challenge in such environments is that data acquisition is often a time-consuming and expensive process, or may be infeasible due to safety considerations. A common approach is therefore to train policies offline by interacting with a simulator. However, even when a simulator is available, such applications require algorithms that are capable of learning and planning in large state spaces. Many existing approaches require a large amount of training data to obtain good policies, and efficient active exploration in large state spaces is still an open problem. Moreover, when deploying policies trained on simulators in real-world applications, a crucial requirement is that the policy performs well in any state that it might encounter. In particular, at training time, the learning approach has to sufficiently explore the state space. This is of particular importance when at test time, the system's state is partly out of the control of the learning algorithm-e.g., for a self-driving car or robot, which may be influenced by human actions. In this work, we formally study the setting of reinforcement learning with a generative model. Our objective is to learn a near-optimal policy by actively querying the simulator with a state-action pair chosen by the learning algorithm. The simulator then returns a new state that is sampled from the transition model of the (simulated) environment. Inspired by previous works, we make a structural assumption in the kernel setting, which states that the Bellman operator maps any bounded value function to one with a bounded reproducing kernel Hilbert space (RKHS) norm. In particular, this assumption implies that the reward and the optimal Q-function can be represented by an RKHS function. We propose a novel approach based on least-squares value iteration (LSVI). The algorithm is designed to actively explore uncertain states based on the uncertainty in the Q-estimates, and makes use of optimism for action selection and pessimism for estimating a near-optimal policy.

