BLESSING FROM EXPERTS: SUPER REINFORCEMENT LEARNING IN CONFOUNDED ENVIRONMENTS

Abstract

We introduce super reinforcement learning in the batch setting, which takes the observed action as input for enhanced policy learning. In the presence of unmeasured confounders, the recommendations from human experts recorded in the observed data allow us to recover certain unobserved information. Including this information in the policy search, the proposed super reinforcement learning will yield a superpolicy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., the expert's recommendation). Furthermore, to address the issue of unmeasured confounding in finding super-policies, a number of non-parametric identification results are established. Finally, we develop two super-policy learning algorithms and derive their corresponding finite-sample regret guarantees.

1. INTRODUCTION

Offline reinforcement learning (RL) aims to find a sequence of optimal policies by leveraging the batch data (Sutton & Barto, 2018; Levine et al., 2020) . In many high-stake domains such as medical studies (Kosorok & Laber, 2019) , it is very costly or dangerous to interact with the environment for online data collection, and learning must rely entirely on pre-collected observational or experimental data. Recently, there is a surging interest in studying offline RL theories and methods. Most existing solutions rely on the unconfoundedness assumption that excludes the existence of latent variables that confound the action-reward/-next-observation associations. However, in practice we often encounter unmeasured confounding, under which most existing RL algorithms will lead to sub-optimal polices. In this paper, we study offline policy learning in confounded contextual bandits and sequential decision making. Existing works on policy learning focused on searching an optimal policy that purely depends on the past history, ignoring the recommended action given by the human expert in the observed data. In many applications, there is a common belief that human decision-makers have access to important information that is not recorded in the observed data when taking an action (Kleinberg et al., 2018) . For example, in the urgent care, clinicians leverage visual observations or communications with patients to recommend treatments, where such unstructured information is hard to quantify and often not recorded (McDonald, 1996) . Another motivating example is given by the deep brain stimulation (DBS Lozano et al., 2019) . Due to recent advances in DBS technology, it becomes feasible to instantly collect electroencephalogram data, based on which we are able to provide adaptive stimulation to specific regions in the brain so as to treat patients with neurological disorders including Parkinson's disease, essential tremor, etc. In these applications, the patient is allowed to determine the behavior policy (e.g., when to turn on/off the stimulation, for how long, etc) based on information only known to herself (e.g., how she feels), therefore generating batch data with unmeasured confounders. We notice that despite challenges in policy learning with latent confounders, human recommendations may capture certain unobserved information as discussed in aforementioned applications. Including this information as input of the policy can enhance policy learning, which is indeed "a blessing from experts". Therefore, in this paper, we ask Is it possible to consistently learn an optimal policy that takes both the data history and human recommendation at the current time as input for better decision making? We will answer the above question affirmatively. Specifically, we first introduce a novel framework called super RL, which compared with the standard RL additionally takes the human's recommendation as input for policy learning. In confounded environments, super RL can embrace the blessing from experts. In other words, it leverages the human expertise in discovering unobserved information for enhanced policy learning. The resulting policy, which we call super-policy, is guaranteed to outperform the standard optimal one learned from without using the human expertise and the behavior policy that may depend on the hidden state. To implement the proposed super-policy for decision making in the future, we require the human expert to recommend an action at each time, which is commonly seen in practice. The super-policy then takes this action and other observations as input and override the recommendation produced by the expert. Second, to address the challenge of partial observability or unmeasured confounding, we establish several non-parametric identification results in finding these super-policies in various confounded environments, leveraging the recent development in causal inference (Tchetgen Tchetgen et al., 2020) . Notably, our identification results prove that the super-policy is learnable from the observed data despite the presence of unmeasured confounding. Finally, we develop two super RL algorithms and derive the corresponding finite-sample regret guarantees that are polynomial in terms of all relevant parameters in finding a desirable super-policy.

2. RELATED WORK

There is an increasing interest in studying off-policy evaluation (OPE) and learning in sequential decision making problem with unmeasured confounding. Specifically, Zhang & Bareinboim (2016) introduced the causal RL framework and the confounded Markov decision process (MDP) with memoryless unmeasured confounding, under which the Markov property holds in the observed data. Along this direction, many OPE and learning methods are proposed using instrumental or mediator variables (Chen & Zhang, 2021; Liao et al., 2021; Li et al., 2021; Wang et al., 2021; Shi et al., 2022; Fu et al., 2022; Yu et al., 2022) . In addition, partial identification bounds for the off-policy's value have been established based on sensitivity analysis (Namkoong et al., 2020; Kallus & Zhou, 2020; Bruns-Smith, 2021) . Another streamline of research focuses on general confounded POMDP models to allow for both unmeasured confounding and partial observability. Several point identification results were established (Tennenholtz et al., 2020; Bennett & Kallus, 2021; Nair & Jiang, 2021; Shi et al., 2021; Ying et al., 2021; Miao et al., 2022) . However, none of the aforementioned works study policy learning with the help of human expertise, i.e., taking recommended action in the observed data for decision making. Different from these works, we tackle the policy learning problem from a unique perspective and propose a novel super RL framework by leveraging human expertise in discovering certain unobserved information to further improve decision making. We also rigorously establish the super-optimality of the proposed super-policy over the standard optimal policy and the behavior policy. Our paper is also related to a line of works on policy learning and evaluation with partial observability using spectral decomposition and predictive state representation related methods (see e.g., Littman & Sutton, 2001; Song et al., 2010; Boots et al., 2011; Hsu et al., 2012; Singh et al., 2012; Anandkumar et al., 2014; Jin et al., 2020; Cai et al., 2022; Lu et al., 2022; Uehara et al., 2022a; b) . Nonetheless, these methods require the no-unmeasured-confounders assumption. Finally, our proposal is motivated by the work of Stensrud & Sarvet (2022) that introduced the concept of superoptimal treatment regime in contextual bandits. They used an instrumental variable approach for discovering such regime. However, their method can only be applied in a restrictive single-stage decision making setting with binary actions. In contrast, our super-RL framework is generally applicable to both confounded contextual bandits and sequential decision making allowing arbitrarily many actions. It is also worth mentioning that the proposed super RL differs from the recently proposed safe RL via human intervention (Saunders et al., 2017) , where human intervention is performed to override bad actions recommended by the intelligent agent. We aim to leverage the human expertise in the previously collected data for intelligent agents to make better decisions.

3. SUPER RL: A CONTEXTUAL BANDIT EXAMPLE

In this section, we introduce the super-policy in confounded contextual bandits (e.g., single-stage decision making with unmeasured confounders). Consider a random tuple (S, U, A, {R(a)} a∈A ), where S and U denote the observed and unobserved features respectively, A denotes the action whose space is given by a finite set A, and {R(a)} a∈A denotes a set of the potential/counterfactual rewards under A = a, representing the reward that the agent would receive had action a been taken. The observed reward, denoted by R, can then be written as R = a∈A R(a)I(A = a).

