BLESSING FROM EXPERTS: SUPER REINFORCEMENT LEARNING IN CONFOUNDED ENVIRONMENTS

Abstract

We introduce super reinforcement learning in the batch setting, which takes the observed action as input for enhanced policy learning. In the presence of unmeasured confounders, the recommendations from human experts recorded in the observed data allow us to recover certain unobserved information. Including this information in the policy search, the proposed super reinforcement learning will yield a superpolicy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., the expert's recommendation). Furthermore, to address the issue of unmeasured confounding in finding super-policies, a number of non-parametric identification results are established. Finally, we develop two super-policy learning algorithms and derive their corresponding finite-sample regret guarantees.

1. INTRODUCTION

Offline reinforcement learning (RL) aims to find a sequence of optimal policies by leveraging the batch data (Sutton & Barto, 2018; Levine et al., 2020) . In many high-stake domains such as medical studies (Kosorok & Laber, 2019) , it is very costly or dangerous to interact with the environment for online data collection, and learning must rely entirely on pre-collected observational or experimental data. Recently, there is a surging interest in studying offline RL theories and methods. Most existing solutions rely on the unconfoundedness assumption that excludes the existence of latent variables that confound the action-reward/-next-observation associations. However, in practice we often encounter unmeasured confounding, under which most existing RL algorithms will lead to sub-optimal polices. In this paper, we study offline policy learning in confounded contextual bandits and sequential decision making. Existing works on policy learning focused on searching an optimal policy that purely depends on the past history, ignoring the recommended action given by the human expert in the observed data. In many applications, there is a common belief that human decision-makers have access to important information that is not recorded in the observed data when taking an action (Kleinberg et al., 2018) . For example, in the urgent care, clinicians leverage visual observations or communications with patients to recommend treatments, where such unstructured information is hard to quantify and often not recorded (McDonald, 1996) . Another motivating example is given by the deep brain stimulation (DBS Lozano et al., 2019) . Due to recent advances in DBS technology, it becomes feasible to instantly collect electroencephalogram data, based on which we are able to provide adaptive stimulation to specific regions in the brain so as to treat patients with neurological disorders including Parkinson's disease, essential tremor, etc. In these applications, the patient is allowed to determine the behavior policy (e.g., when to turn on/off the stimulation, for how long, etc) based on information only known to herself (e.g., how she feels), therefore generating batch data with unmeasured confounders. We notice that despite challenges in policy learning with latent confounders, human recommendations may capture certain unobserved information as discussed in aforementioned applications. Including this information as input of the policy can enhance policy learning, which is indeed "a blessing from experts". Therefore, in this paper, we ask Is it possible to consistently learn an optimal policy that takes both the data history and human recommendation at the current time as input for better decision making? We will answer the above question affirmatively. Specifically, we first introduce a novel framework called super RL, which compared with the standard RL additionally takes the human's recommendation as input for policy learning. In confounded environments, super RL can embrace the blessing

