BLESSING FROM EXPERTS: SUPER REINFORCEMENT LEARNING IN CONFOUNDED ENVIRONMENTS

Abstract

We introduce super reinforcement learning in the batch setting, which takes the observed action as input for enhanced policy learning. In the presence of unmeasured confounders, the recommendations from human experts recorded in the observed data allow us to recover certain unobserved information. Including this information in the policy search, the proposed super reinforcement learning will yield a superpolicy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., the expert's recommendation). Furthermore, to address the issue of unmeasured confounding in finding super-policies, a number of non-parametric identification results are established. Finally, we develop two super-policy learning algorithms and derive their corresponding finite-sample regret guarantees.

1. INTRODUCTION

Offline reinforcement learning (RL) aims to find a sequence of optimal policies by leveraging the batch data (Sutton & Barto, 2018; Levine et al., 2020) . In many high-stake domains such as medical studies (Kosorok & Laber, 2019) , it is very costly or dangerous to interact with the environment for online data collection, and learning must rely entirely on pre-collected observational or experimental data. Recently, there is a surging interest in studying offline RL theories and methods. Most existing solutions rely on the unconfoundedness assumption that excludes the existence of latent variables that confound the action-reward/-next-observation associations. However, in practice we often encounter unmeasured confounding, under which most existing RL algorithms will lead to sub-optimal polices. In this paper, we study offline policy learning in confounded contextual bandits and sequential decision making. Existing works on policy learning focused on searching an optimal policy that purely depends on the past history, ignoring the recommended action given by the human expert in the observed data. In many applications, there is a common belief that human decision-makers have access to important information that is not recorded in the observed data when taking an action (Kleinberg et al., 2018) . For example, in the urgent care, clinicians leverage visual observations or communications with patients to recommend treatments, where such unstructured information is hard to quantify and often not recorded (McDonald, 1996) . Another motivating example is given by the deep brain stimulation (DBS Lozano et al., 2019) . Due to recent advances in DBS technology, it becomes feasible to instantly collect electroencephalogram data, based on which we are able to provide adaptive stimulation to specific regions in the brain so as to treat patients with neurological disorders including Parkinson's disease, essential tremor, etc. In these applications, the patient is allowed to determine the behavior policy (e.g., when to turn on/off the stimulation, for how long, etc) based on information only known to herself (e.g., how she feels), therefore generating batch data with unmeasured confounders. We notice that despite challenges in policy learning with latent confounders, human recommendations may capture certain unobserved information as discussed in aforementioned applications. Including this information as input of the policy can enhance policy learning, which is indeed "a blessing from experts". Therefore, in this paper, we ask Is it possible to consistently learn an optimal policy that takes both the data history and human recommendation at the current time as input for better decision making? We will answer the above question affirmatively. Specifically, we first introduce a novel framework called super RL, which compared with the standard RL additionally takes the human's recommendation as input for policy learning. In confounded environments, super RL can embrace the blessing from experts. In other words, it leverages the human expertise in discovering unobserved information for enhanced policy learning. The resulting policy, which we call super-policy, is guaranteed to outperform the standard optimal one learned from without using the human expertise and the behavior policy that may depend on the hidden state. To implement the proposed super-policy for decision making in the future, we require the human expert to recommend an action at each time, which is commonly seen in practice. The super-policy then takes this action and other observations as input and override the recommendation produced by the expert. Second, to address the challenge of partial observability or unmeasured confounding, we establish several non-parametric identification results in finding these super-policies in various confounded environments, leveraging the recent development in causal inference (Tchetgen Tchetgen et al., 2020) . Notably, our identification results prove that the super-policy is learnable from the observed data despite the presence of unmeasured confounding. Finally, we develop two super RL algorithms and derive the corresponding finite-sample regret guarantees that are polynomial in terms of all relevant parameters in finding a desirable super-policy.

2. RELATED WORK

There is an increasing interest in studying off-policy evaluation (OPE) and learning in sequential decision making problem with unmeasured confounding. Specifically, Zhang & Bareinboim (2016) introduced the causal RL framework and the confounded Markov decision process (MDP) with memoryless unmeasured confounding, under which the Markov property holds in the observed data. Along this direction, many OPE and learning methods are proposed using instrumental or mediator variables (Chen & Zhang, 2021; Liao et al., 2021; Li et al., 2021; Wang et al., 2021; Shi et al., 2022; Fu et al., 2022; Yu et al., 2022) . In addition, partial identification bounds for the off-policy's value have been established based on sensitivity analysis (Namkoong et al., 2020; Kallus & Zhou, 2020; Bruns-Smith, 2021) . Another streamline of research focuses on general confounded POMDP models to allow for both unmeasured confounding and partial observability. Several point identification results were established (Tennenholtz et al., 2020; Bennett & Kallus, 2021; Nair & Jiang, 2021; Shi et al., 2021; Ying et al., 2021; Miao et al., 2022) . However, none of the aforementioned works study policy learning with the help of human expertise, i.e., taking recommended action in the observed data for decision making. Different from these works, we tackle the policy learning problem from a unique perspective and propose a novel super RL framework by leveraging human expertise in discovering certain unobserved information to further improve decision making. We also rigorously establish the super-optimality of the proposed super-policy over the standard optimal policy and the behavior policy. Our paper is also related to a line of works on policy learning and evaluation with partial observability using spectral decomposition and predictive state representation related methods (see e.g., Littman & Sutton, 2001; Song et al., 2010; Boots et al., 2011; Hsu et al., 2012; Singh et al., 2012; Anandkumar et al., 2014; Jin et al., 2020; Cai et al., 2022; Lu et al., 2022; Uehara et al., 2022a; b) . Nonetheless, these methods require the no-unmeasured-confounders assumption. Finally, our proposal is motivated by the work of Stensrud & Sarvet (2022) that introduced the concept of superoptimal treatment regime in contextual bandits. They used an instrumental variable approach for discovering such regime. However, their method can only be applied in a restrictive single-stage decision making setting with binary actions. In contrast, our super-RL framework is generally applicable to both confounded contextual bandits and sequential decision making allowing arbitrarily many actions. It is also worth mentioning that the proposed super RL differs from the recently proposed safe RL via human intervention (Saunders et al., 2017) , where human intervention is performed to override bad actions recommended by the intelligent agent. We aim to leverage the human expertise in the previously collected data for intelligent agents to make better decisions.

3. SUPER RL: A CONTEXTUAL BANDIT EXAMPLE

In this section, we introduce the super-policy in confounded contextual bandits (e.g., single-stage decision making with unmeasured confounders). Consider a random tuple (S, U, A, {R(a)} a∈A ), where S and U denote the observed and unobserved features respectively, A denotes the action whose space is given by a finite set A, and {R(a)} a∈A denotes a set of the potential/counterfactual rewards under A = a, representing the reward that the agent would receive had action a been taken. The observed reward, denoted by R, can then be written as R = a∈A R(a)I(A = a).  Policy Value V(π b ) V(π * ) V(ν * ) ϵ = 0.5 0.0 0.4 0.4 ϵ = 0 0.6 0.4 1.0 ϵ = 1 -0.6 0.4 1.0 Denote the spaces of S and U by S and U respectively. Let π : S → P(A) denote a policy depending only on the observed information S, where P(A) refers to the class of all probability distributions over A. In particular, π(a | s) refers to the probability of choosing an action a given that S = s. In the batch setting, we are given i.i.d. copies of (S, A, R), where the action A is generated by some behavior policy π b : S × U → P(A) that depends on both observed and unobserved features. Since U is unobserved, nearly all existing solutions focused on finding an optimal policy π * given by π * (a * | s) = 1 if a * = argmax a∈A E [R(a) | S = s] ∀s ∈ S, assuming the uniqueness of the maximization in equation 1 for every s ∈ S. In addition, notice that U may confound the causal relationship of the action-reward in the observational data. Ignoring this latent confounder will produce a biased estimator of π * . As discussed earlier, in this paper, we aim to find an optimal policy that leverages the input of human expertise, since actions generated by the behavior policy depend on the latent information. In particular, we search a super-policy ν * in a larger policy class Ω = {ν : S × A → P(A)} such that ν * (a * | s, a ′ ) = 1 if a * = argmax a∈A E [R(a) | S = s, A = a ′ ] ∀(s, a ′ ) ∈ S × A. (2) The two optimal optimal policies are equivalent when unconfoundedness assumption holds. When this condition is violated, E [R(a) | S = s, A = a ′ ] ̸ = E [R(a) | S = s] in general. More importantly, it follows from Proposition 1 of Stensrud & Sarvet (2022) that the value under ν * is no worse and often larger than that under π * . This yields the super-optimality of ν * over π * . It is also worth mentioning that in the presence of latent confounders, there is no guarantee that the standard optimal policy π * outperforms the behavior policy π b because π b depends on the unobserved information. To the contrary, since π b ∈ Ω, the proposed super-policy is always better than π b . Specifically, let V(ν) be the value under the intervention of a generic policy ν, i.e., V(ν) = a∈A E[R(a)ν(a | S, A)]. We have the following lemma that demonstrates the super-optimality of ν * over both π * and π b . Lemma 3.1 (Super-Optimality). V(ν * ) ≥ max{V(π b ), V(π * )}. Intuitively speaking, the super-optimality of ν * comes from the use of unobserved information U contained in π b . We consider the following toy example to elaborate. Toy Example: Assume S and U independently follow a Bernoulli distribution with success probability 0.5. Suppose the action is binary and the behavior policy satisfies P(A = 1|S, U = 1) = P(A = 0|S, U = 0) = 1 -ε for some 0 ≤ ε ≤ 1. Let R = 8(A -0.5)(S -0.2)(U -0.3). In this example, the parameter ε measures the degree of unmeasured confounding. When ε = 0.5, the behavior policy does not depend on U and the no-unmeasured-confounders assumption is automatically satisfied. Otherwise, this condition is violated. In particular, when ε = 0 or 1, we can fully recover the latent confounder based on the recommended action. Table 1 summarizes the policy values of π b , π * and ν * under different ε, in which the super-optimality holds. Despite its appealing property, it is generally impossible to learn the super-policy ν * without any further assumptions, since the counterfactual effect E [R(a) | S = s, A = a ′ ] is not identifiable from the observed data due to unmeasured confounding. Toward that end, we adopt the proximal causal inference framework developed by Tchetgen Tchetgen et al. (2020) . Specifically, we assume the existence of certain action and reward proxies Z ∈ Z and W ∈ W in additional to (S, A, R). These proxies are required to satisfy the following assumptions (Miao et al., 2018b) : Assumption 1. (a) R |= Z | (S, U, A); (b) W |= (Z, A) | (S, U ), W |= U | S; (c) R(a) |= A | (S, U ) for a ∈ A; (d) There exists a bridge function q : W × A × S → R such that E [q(W, a, S) | U, S, A = a] = E [R | U, S, A = a] . Assumptions 1(a)-(b) are standard in proximal causal inference, requiring these proxies to meet certain conditional independence conditions. Assumptions 1(c), called latent unconfoundedness, is mild as we allow U to be unobserved. The last assumption can be satisfied when some completeness and regularity conditions hold. See Miao et al. (2018a) and also Lemma 3.3 below for more details. Then the following lemma allows us to consistently learn the super-policy ν * from the observed data. Lemma 3.2. Under Assumption 1, we have E [R(a) | S = s, A = a ′ ] = E[q(W, a, S) | S = s, A = a ′ ], which further leads to that V(ν) = E a∈A q(W, a, S)ν(a | S, A) for any ν ∈ Ω. In practice, one may want to include as many confounders in the policy as possible to achieve the largest super-optimality. Hence under this proximal causal inference framework, with some abuse of notation, we further extend the policy class to Ω = {ν : S × Z × A → P(A)} and consider the corresponding super-policy ν * as ν * (a * | s, z, a ′ ) = 1 if a * = argmax a∈A E R(a) | S = s, Z = z, A = a ′ , ( ) where Z is a subset of Z that continues to exist when we implement the super-policy. In applications where the action proxy is no longer available in future decision making, equation 4 is reduced to equation 2. We also remark that different from Z, W is obtained after intervention. As such, it does not make sense to include W in the super-policy. The following corollary allows us to identify ν * . Corollary 3.1. Under Assumption 1, the policy value under a given ν ∈ Ω is given by V(ν) = E a∈A q(W, a, S)ν(a | S, A, Z) . In addition, the optimal policy ν * is given by ν * (a * | s, z, a ′ ) = 1 if a * = argmax a∈A E q(W, a, S) | S = s, Z = z, A = a ′ . It can be seen from Corollary 3.1 that to identify the super-policy, it remains to estimate the bridge function q defined in Assumption 1(d). One can impose the following completeness condition. Assumption 2. For any squared-integrable function g and for any (s, a) ∈ S × A, E[g(U ) | Z, S = s, A = a] = 0 almost surely if and only if g(U ) = 0 almost surely. Lemma 3.3. Under Assumptions 1-2 and some regularity conditions (see Assumption 7 in Appendix E, solving the following linear integral equation E [q(W, a, S) | Z, S, A = a] = E [R | Z, S, A = a] , for every a ∈ A with respect to q gives a valid bridge function that satisfies Assumption 1(d). Built upon Corollary 3.1 and Lemma 3.3, Algorithm 1 summarizes the procedure to find ν * from a population perspective. Practical procedure that learns ν * given samples can be found in Appendix B. Algorithm 1: Identification of ν * in confounded contextual bandits. Input: i.i.d. copies of (S, Z, A, R, W ). Compute q by solving E [q(W, a, S) | Z, S, A = a] = E [R | Z, S, A = a] for every a ∈ A. Compute a * = argmax a∈A E q(W, a, S) | S = s, Z = z, A = a ′ ∀(s, z, a ′ ) ∈ S × Z × A. Output: ν * with ν * (a * | s, z, a ′ ) = 1 for any (s, z, a ′ ).

4.1. MODEL SETUP AND SUPER-POLICIES IN SEQUENTIAL DECISION MAKING

In this section, we formally introduce the super-policy in confounded sequential decision making, demonstrate its super-optimality, and present several non-parametric identification results. Consider an episodic and confounded POMDP denoted by M = (S, U, A, T, P, r) where S and U denote the spaces of observed and unobserved features respectively, A denotes the action space, T denotes the total length of horizon, P = {P t } T t=1 where each P t denotes transition kernel from S × U × A to S × U at time t, and r = {r t } T t=1 denotes the set of reward functions over S × U × A. The data following M can be summarized as {S t , U t , A t , R t } T t=1 where S t and U t correspond to the observed and latent features at time t, A t and R t denote the action and the reward at time t. For simplicity, we assume the action space is discrete and all rewards are uniformly bounded, i.e., |R t | ≤ R max . Given an offline dataset, our objective is to learn an (in-class) optimal policy to maximize the expected cumulative rewards. All existing works consider policies defined as a sequence of functions mapping from the past history (excluding the current action) to a probability mass function over the action space A. Specifically, given a generic policy π = {π t } T t=1 , one can define its value function as V π t (s, u) = E π T t ′ =t R t ′ S t = s, U t = u , for every (s, u) ∈ S × U, where E π denotes the expectation with respect to the distribution whose action at each time t follows π t . Existing works aim to leverage the batch data to estimate an optimal policy that maximizes V(π) = E[V π 1 (S 1 , U 1 )], where we use E to denote the expectation with respect to the initial data distribution. Under unmeasured confounding, the observed action A t in the batch data is generated by some behavior policy π b t : S × U → P(A) for 1 ≤ t ≤ T . Let π b = {π b t } T t=1 . To handle unmeasured confounding, we similarly assume the existence of certain reward proxies {W t } T t=1 and action proxies {Z t } T t=1 that can help identify policy values. In sequential decision making, as shown in Tennenholtz et al. (2020) , past and future observations can be served as the two proxies in confounded partially observable Markov decision processes (POMDPs). As such, our method can be applied to most confounded decision-making problems where human agents will recommend actions in the future. Concrete examples of these proxies are given in later sections and Appendix A. Previous works such as Lu et al. (2022) focus on finding π * ∈ Π ≡ {π = {π t } T t=1 | π t : S × Z t → P(A)} such that π * = argmax π∈Π V(π). In particular, when Z t s are certain current features that can serve as the action proxies (see Section 4.2), Π corresponds to the class of stationary policies. When Z t s are given by the entire data history (see Section 4.3), Π corresponds to the class of general history-dependent policies. When Z t s are given by the most recent k-step observations (see Section 4.4), Π corresponds to the class of k-memory policies. Motivated by the discussions in Section 3, we propose to learn a super policy ν * ∈ Ω ≡ {ν = {ν t } T t=1 | ν t : S × Z t × A → P(A)} which leverages human expertise for enhanced policy making that maximizes V(ν). Here A in Ω reflects the action space at the current time point t. Actions recommended by the expert before time t can be included in Z t . See Section 4.3 for more details. When considering Ω, the policy value V(ν) indeed depends on π b as well because to implement the proposed super-policy we require the human agent to produce an action according to π b and then intervene using ν. However, to ease notation, we omit π b when referring to V(ν). Similar as before, since the super-policy additionally uses the expert's recommendation that depends on the unobserved information, we expect the super-policy ν * to be superior to both π * and π b , which is shown below. Theorem 4.1 (Super-Optimality). V(ν * ) ≥ max{V(π * ), V(π b )}.

4.2. IDENTIFICATION OF STATIONARY SUPER-POLICIES VIA Q-BRIDGE FUNCTIONS

Under unmeasured confounding, we apply the proximal causal inference framework to sequential decision making and make following assumptions to identify the policy value V(ν) for each ν ∈ Ω. Assumption 3. (a) (Markovianity) The process {S t , U t , A t , R t } T t=1 satisfies the Markov property, i.e., for any t, (R t , S t+1 , U t+1 ) depends on the past history only through (S t , U t , A t ). (b) (Reward proxy) W t |= (A t , U t-1 , S t-1 ) | (U t , S t ), W t |= U t | S t , for 1 ≤ t ≤ T . (c) (Action proxy) Z t |= (R t , W t , S t+1 , U t+1 , W t+1 ) | (U t , S t , A t ) for 1 ≤ t ≤ T . Assumption 3 is satisfied by a wide range of confounded sequential decision making models. See Appendix A for detailed discussions. Specifically, Assumption (a) is mild. It essentially requires the data to be Markovian if we were to observe {U t } T t=1 . Assumptions (b) and (c) extends Assumption 1 to sequential decision making. In this section, we require the existence of current features that can serve as action proxy and focus on learning an optimal stationary policy. Alternatively, one can set the action proxy to past observations, as in Sections 4.3 -4.5 and study history-dependent policies. Without loss of generality, we also assume these action proxies continue to be available when making decisions in the future. Otherwise, we can restrict the super-policy to be a function of (S t , A t ) only. To identify V(ν) and ultimately ν * under unmeasured confounding, we rely on the existence of a class of Q-bridge functions {q ν t } T t=1 defined over W × S × A such that for every (s, u, a) ∈ S × U × A, E ν T t ′ =t R t ′ U t , S t , A t = E a∈A q ν t (W t , S t , a)ν t (a | S t , Z t , A t ) | U t , S t , A t . (9) Theorem 4.2 (Identification). If there exist {q ν t } T t=1 that satisfy equation 9, then the value of policy ν can be identified by V(ν) = E[ a∈A q ν 1 (W 1 , S 1 , a)ν 1 (a | S 1 , Z 1 , A 1 )]. The following theorem proves the identifiability of these Q-bridge functions under certain completeness and regularity conditions. Together with Theorem 4.2, it forms the basis to learn the super-policy from the observed data. Let V ν t (W t , S t , Z t , A t ) = q ν t (W t , S t , a)ν t (a | S t , Z t , A t ). Theorem 4.3. Under Assumption 3 and certain completeness and regularity (Assumptions 8, 9 and 10 in Appendix F), there always exist Q-bridge functions {q ν t } T t=1 satisfying equation 9. In particular, set q ν T +1 = 0, q ν t can be obtained by solving the following linear integral equations for t = T, . . . , 1, E{q ν t (W t , S t , A t ) -R t -V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t , S t , A t } = 0. ( )

4.3. IDENTIFICATION OF GENERAL HISTORY-DEPENDENT SUPER-POLICIES

In this section, we set Z t = {O 1:t , A 1:(t-1) }, S t = ∅ and W t to certain future features that can serve as a reward proxy that satisfies Assumption 3(b) (e.g., conditionally independent of the current action). The corresponding space of Z t is given by Z t = t t ′ =1 O × t-1 t ′ =1 A. Alternatively, one may set Z t = {O 1:(t-1) , A 1:(t-1) } and W t to the current observation as in Tennenholtz et al. (2020) ; Shi et al. (2021) to meet Assumption 3. The resulting model is reduced to a typical POMDP with unmeasured confounding and we present the identification results in Section 4.5. We focus on the case where A 1:(t-1) in Z t are generated by the behavior policy instead of the super-policy. The policy class we focus on is given by Ω history = {ν = {ν t } T t=1 | ν t : t t ′ =1 (O × A) → P(A)} , which includes all actions recommended by the expert for decision making but those generated by ν ∈ Ω. We leave the inclusion of these actions in the policy class as future work. To ease notation, we omit "history" in Ω history when there is no confusion. Let O 0 denote some pre-collected observation before the decision process initiates. We impose the following additional assumption: Assumption 4. (a) Z t+1 ⊥ ⊥O 0 | U t , Z t , A t , for 1 ≤ t ≤ T -1; (b) W t ⊥ ⊥O 0 | U t , Z t , A t , for 1 ≤ t ≤ T ; (c) O t ∈ O is generated from U t by some unknown map H t : U → O. Assumption 4(a)-(b) can be easily satisfied by initializing the decision process at t = 2. Assumption 4(c) is often imposed in POMDPs. Then we have the following identification results. Theorem 4.4. Assume assumptions 3, 4, and certain completeness and regularity conditions in Appendix F hold. Define q ν T +1 = 0, and {q ν t } T t=1 over W × t t ′ =1 (O × A) as the solutions to the following linear integral equations: E q ν t (W t , Z t , A t ) -R t - a∈A q ν t+1 (W t+1 , Z t+1 , a)ν t (a | Z t+1 , A t+1 ) | Z t , O 0 , A t = 0, for t = T, T -1, . . . , 1. Then we could identify the policy value for ν ∈ Ω history as V(ν) = E [q ν 1 (W 1 , Z 1 , A 1 )] . Theorem 4.4 allows us to identify general history-dependent policies.

4.4. IDENTIFICATION OF K-STEP HISTORY-DEPENDENT SUPER-POLICIES

In Section 4.3, we discuss how to identify the value of a history-dependent policy by taking Z t as past observations up to time t. As a result, the dimension of Z t increases linearly with t, resulting in the curse of dimensionality and history (Pineau et al., 2006) . In this section, we consider a more practical class of policies that only use the most recent k-step observations. Policies of this type are widely used in practice (see e.g., Mnih et al., 2015; Berner et al., 2019) . To begin with, let W t be the future proxy reward that satisfies Assumption 3(b). For any t ≥ k + 1, let Z t ∈ Z t denote the observed history from time t -k up to time t, i.e., (O (t-k):t , A (t-k):(t-1) ). We further define Zt = Z t ∩ Z t+1 ∈ Zt as a subset of Z t . Next, we define the Q-bridge functions {q ν t } T t=k+1 over W × Zt × A such that for every (u, a) ∈ U × A and t ≥ k + 1, E ν t:T [ T t ′ =t R t ′ U t , A t ] = E[ a∈A q ν t (W t , Zt , a)ν t (a | Z t , A t ) | U t , A t ]. Under certain regularity conditions (Assumptions 13 and 14 specified in Appendix F), we are able to identify the Q-bridge functions {q ν t } T t=k+1 through the following linear integral equations. Theorem 4.5. Under Assumptions 3, 4(c), Assumptions 13 and 14 in Appendix F, there exist Qbridge functions {q ν t } T t=k+1 satisfying equation 13. In particular, set q ν T +1 = 0, q ν t can be obtained by solving the following linear integral equations for t = T, • • • , k + 1: E{q ν t (W t , Zt , A t ) -R t - a∈A q ν t+1 (W t+1 , Zt+1 , a)ν t+1 (a | Z t+1 , A t+1 ) | Z t , A t } = 0. (14) As for 1 ≤ t ≤ k, take Z t = {O 1:t , A 1:(t-1) }, if additionally Assumptions 11, 12 in Appendix F and Assumption 4(a)-(b) on O 0 hold for 1 ≤ t ≤ k, then there exist {q ν t } k t=1 over W × ( t t ′ =1 )(O × A) as the solution to the following linear integral equation for t = 1, . . . , k. E{q ν t (W t , Z t , A t ) -R t - a∈A q ν t+1 (W t+1 , Z t+1 , a)ν t+1 (a | Z t+1 , A t+1 ) | Z t , O 0 , A t } = 0, ( ) where O 0 denotes some pre-collected observation defined in Section 4.3. Finally, the policy value can be identified as V(ν) = E[q ν 1 (W 1 , Z 1 , A 1 )]. We remark that the requirement for O 0 in Theorem 4.5 is much weaker than that in Theorem 4.4. In particular, here we only need Assumptions 4 (a)-(b), 11 and 12 to hold for the first k steps. When t ≥ k + 1, we require the variability of Z t to cover the variability of (U t , Zt ), which to some extent requires the observation at k-th lag has sufficient variability relative to the variability of unobserved state at the current time (U t ). As the lag k increases, this assumption becomes more restrictive.

4.5. ALTERNATIVE IDENTIFICATION OF SUPER-POLICIES

In Section 4.2, we discuss how to identify the policy value via Q-bridge functions assuming the existence of certain future observations (W t ) that can serve as reward proxy and are conditionally independent of the current action. As commented earlier, this condition can be relaxed by setting W t = O t , Z t = {O 1:(t-1) , A 1:(t-1) } and S t = ∅. The resulting data generating process is reduced to the POMDP model studied in Tennenholtz et al. (2020) . However, based on identification results in Sections 4.3-4.4, this rules out the dependence of the super-policy on the most recent observation, which could be restrictive. In the following, we provide a remedy for addressing this limitation. For simplicity, we focus on identifying a given history-dependent super-policy ν = {ν t } T t=1 's value, where ν t : t t ′ =1 (O × A) → P(A) depends on all the past observations and recommended actions. We consider a tabular setting where all random variables can only take finitely many values and use boldface letters r ∈ R dr , u ∈ R du , o ∈ R do to represent the vectors consisting of all possible rewards, latent states and observations. Meanwhile, our results can be extended to general settings as well using value-bridge functions (Shi et al., 2021) . Let O 0 denote some pre-collected observation. The following assumption summarizes the conditions for the model: Assumption 5. (a) The process {U t , A t } T t=1 satisfies the Markov property; (b) For all 1 ≤ t ≤ T , the observation O t is generated from U t by some unknown map H t : U → O; (c) For all 1 ≤ t ≤ T , O t-1 ⊥ ⊥(R t , O t , U t+1 ) | (U t , A t ). We define the following matrices: [P (t,r) o,a ] i,j = Pr(R t = r i , O t = o | A t = a, O t-1 = o j ), P (r) o,a ∈ R dr×do ; [P (t) a ] i,j = Pr(O t = o i | A t = a, O t-1 = o j ), P (t) a ∈ R do×do ; [P (t,o) o,a ′ ,a ] i,j = Pr(O t+1 = o i , O t = o, A t+1 = a ′ | A t = a, O t-1 = o j ), P (t) o,a ′ ,a ∈ R do×do [P (t) a,u ] i,j = Pr(U t = u i | A t = a, O t-1 = o j ), P (t) a,u ∈ R du×do . Theorem 4.6. Under Assumption 5, as long as P (t) a and P (t) a,u are invertible for any t = 1, . . . , T and a ∈ A, the value function V(ν) for any ν ∈ Ω is identifiable. In particular, V(ν) = T t=1 { o1,a1,a ′ 1 ,...,ot,at,a ′ t ( t k=1 ν k (a k | o k , a ′ k , . . . , o 1 , a ′ 1 ))r ⊺ (P (t,r) ot,at [P (t) a1 ] -1 )( 1 k=t-1 P (k,o) o k ,a ′ k+1 ,a k [P (k) a k ] -1 ) Pr(O 1 = o, A 1 = a 1 )}.

5. SUPER-POLICY LEARNING WITH REGRET GUARANTEE

Based on the established identification results, we introduce our super-policy learning algorithms and establish the corresponding finite-sample regret bounds. We only focus on settings described in Sections 3 and 4.2. Other settings can be similarly studied, which we will leave as the future work.

5.1. CONFOUNDED CONTEXTUAL BANDITS: REGRET GUARANTEES

We develop a practical algorithm in Appendix B, based on the minimax estimation (Dikkala et al., 2020) . Let ν * denote the output of Algorithm 3 in Appendix B which relies on the estimation of the bridge function q given by equation 6. Define the L 2 norm of a generic function f as ∥f ∥ 2 ≡ E[f 2 ]. Let g(S, Z, A ; f ) ≡ E[f (W, S) | S, Z, A] for any f defined over W × S. For a given projection estimator Ê, let ĝ(S, Z, A ; f ) ≡ Ê[f (W, S) | S, Z, A] denote the corresponding estimator. Define p max = sup u,s,z,a ′ ,ν∈Ω a∈A π b (A = a | U = u, S = s)ν(A ′ = a ′ | Z = z, S = s, A = a) π b (A ′ = a ′ | U = u, S = s) . Lemma 5.1. Suppose q belongs to certain function class Q ⊂ W ×S ×A. Define the projection error as ξ n := sup q∈Q,a∈A ∥g[•, •, • ; q(•, •, a)] -ĝ[•, •, • ; q(•, •, a)]∥ 2 , and the bridge function estimation error as ζ n := ∥q -q∥ 2 . Then we obtain the following regret decomposition V(ν * ) -V(ν * ) ≤ 2(ξ n + p max ζ n ). Suppose q and the projection estimator are computed by the procedure described in Appendix B. When Q (the function space for q) and G (the function space for the projected function) are VCsubgraph classes, we have the following theorem for the regret guarantee. Results when G and Q are reproducing kernel Hilbert spaces (RKHSs) are provided in Appendix I.3. Theorem 5.1. If the star-shaped spaces G and Q are VC-subgraph classes with VC dimensions V(G), and V(G) respectively. Under assumptions in Theorems I.2 and I.4, with probability at least 1 -δ, V(v * ) -V(v * ) ≲ n -1/2 p max log(1/δ) + max {V(G), V(Q)}, where for any two positive sequences {a n } n , {b n } n , a n ≲ b n means that there exists some constant C > 0 such that a n ≤ Cb n for any n.

5.2. CONFOUNDED SEQUENTIAL DECISION MAKING: REGRET GUARANTEES

Now we present our super-policy learning algorithm for the sequential setting introduced in Section 4.2. Given the identification results in Theorems 4.2 and 4.6, to obtain the superpolicy ν * , one solution is to directly search over the space of super-policies that maximizes the estimated value, i.e., ν = argmax v∈Ω V(ν). However, when T is large and models imposed for estimating bridge functions are complex (e.g., deep neural networks), direct optimizing V(ν) requires extensive computational power. Motivated by Theorem 4.3, we propose a fitted-Q-iteration type algorithm (Algorithm 2) for practical implementation and estab-lish the regret bound in finding the super policy ν * under memoryless unmeasured confounding. Algorithm 2: Super RL for the confounded POMDP Input: Data D = {D t } T t=1 with D t = {(S i,t , Z i,t , A i,t , R i,t , W i,t , S i,t+1 , Z i,t+1 , W i,t+1 )} n i=1 . Let qT +1 = 0 and ν * T be an arbitrary policy. Repeat for t = T, . . . , 1: Obtain an estimator qt for q t via a min-max estimation method in Appendix I.1 using D t Compute Ê[q t (W t , S t , a) | S t = s, Z t = z, A t = a ′ ] for a ∈ A using the method in Appendix I.2 and obtain the estimated super policy ν * t as for every (a ′ , z, s), ν * t (a * | s, z, a ′ ) = 1 argmax a∈A Ê[q t (W t , S t , a) | S t = s, Z t = z, A t = a ′ ] . Output: ν * = {ν * t } T t=1 . Assumption 6 (Memoryless Unmeasured Confounding). For 2 ≤ t ≤ T , U t is independent of past data history (including latent factors in the past) up to time t -1 given S t . We introduce some notations. Define p ν t and p π b t as the marginal distributions of all random variables at time t under the policy ν and behavior policy π b respectively. Define constants p t,max := sup s,z,a p ν * t (s, z, a)/p π b t (s, z, a), and p ω t,max = sup s,z,a,ν∈Ω ω ν t (s, z, a), where ω ν t (s, z, a) denotes certain density ratio whose explicit form is given in equation 29 of Appendix H. Let Q (t) denote the class for modelling q t . Define Lemma 5.2. Suppose q t ∈ Q (t) for 1 ≤ t ≤ T and ν * is computed via Algorithm 2. Then under Assumptions 3, 6, 8, 9 and 10, we obtain the following regret decomposition, g t [S t , Z t , A t ; q(•, •, a)] := E[q(W t , S t , a) | S t , Z t , A t ] and ĝt [S t , Z t , A t ; q(•, •, a)] := Ê[q(W t , S t , a) | S t , Z t , A t ] for q ∈ Q (t) and a ∈ A. Consider two projection errors ξ t,n := sup q∈Q (t) ,a∈A ∥g t [•, •, • ; q(•, •, a)] -ĝt [•, •, • ; q(•, •, a)] V(ν * ) -V(ν * ) ≲ T t=1 2p t,max ξ t,n + T T t=1 (p ω t,max ) 2 (ζ t,n ) 2 . In Appendix I, we provide a detailed analysis of ξ t,n and ζ t,n regarding to the critical radii of local Rademacher complexities of different spaces, when qt is estimated by the conditional moment method and the projection E[q(W t , S t , a) | S t , Z t , A t ] is estimated by the empirical risk minimization. Here we provide a regret bound which is characterized by the VC dimensions. Let G (t) be the space of testing functions in the min-max estimating procedure described in Appendix I.1, and H (t) be the space of inner products between any policy ν ∈ Ω and q ∈ Q (t) with H (T +1) = {0}. See the exact definitions of G (t) and H (t) in Appendix I.1. Theorem 5.2. If the star-shaped spaces G (t) , H (t+1) and t+1) ) and V(G (t) ) respectively for 1 ≤ t ≤ T . Under assumptions specified in Theorems I.1 and I.3, with probability at least 1 -δ, Q (t) are VC-subgraph classes with VC dimensions V(G (t) ), V(H ( V(v * ) -V(v * ) ≲ T t=1 (p t,max + p ω t,max )(T -t + 1) 2.5 log(T /δ) + max{V(G (t) ), V(H (t+1) ), V(Q (t) )} n . When G (t) , Q (t) and H (t) are RKHSs, we establish the corresponding results in Appendix I.3.

6. CONCLUSION

In this paper, we introduce super reinforcement learning, which takes the observed action as input for enhanced policy learning. We establish the identification results for the super-policy in various confounded environments. Practical algorithms are proposed to perform the super-policy learning and corresponding finite-sample regret guarantees are provided.

A POMDP STRUCTURES AND PROXY VARIABLES

A.1 POMDP STRUCTURES In Figure 1 , we illustrate the general POMDP structure regarding to the variables {U t , S t , A t , R t } T t=1 . Figure 2 provides an example of the POMDP structure under the memoryless assumption (Assumption 6). As Figure 2 shows, all the information from the past time steps is transited to the next step only through the current observed state S t . Figure 3 provides an illustration for the causal relationship of all the variables involved in the confounded POMDP. At any time step t, the reward proxy W t is only related to the action A t through S t and U t ; the action proxy Z t is only related to the reward R t through S t and U t . In Section A.2, we provide more illustrations about the relationship of proxy variables with other variables. 

A.2 PROXY VARIABLES

In this section, we discuss several options for proxy variables W t and Z t satisfying the basic assumption (Assumption 3). In Figure 4 , we list some plausible causal relationship among W t , U t , A t . We require the effect between U t and W t exists, but the effect between W t and R t is optional. For concrete examples of W t , readers can refer to the discussion of type c variables in Tchetgen Tchetgen et al. (2020) . Once we determine W t , we can select Z t accordingly. Figure 5 shows several different relationships of the action proxy Z t with other variables. In the left plot of Figure 5 , Z t is one of the cause of A t and Z t ⊥ ⊥(U t , S t ) | A t . In this case, Z t can be considered as an instrumental variable for A t . In the middle plot of Figure 5 , (U t , S t ) is a direct cause for Z t , the effect between Z t and A t can be in both directions and can be optional. As for the right plot in Figure 5 , Z t is a direct effect of U t and S t . And the effect between Z t and A t can be in both directions and can be optional. For concrete examples of choices of Z t in the observational study, readers can refer to the discussion of type b variables in Tchetgen Tchetgen et al. (2020) . In Section 4.3 and 4.4, we also discuss the cases when Z t includes previous history. 

B LEARNING ALGORITHM FOR CONTEXTUAL BANDITS

In this section, we present the practical algorithm (Algorithm 3) for finding the super-policy in our contextual bandit example. The key step is to estimate the bridge function q by the linear integral equation stated in Lemma 3.3. When S × Z × A × W are all finite and discrete, it can be straightforwardly estimated. In the following, we discuss the estimation when the general space is considered. Algorithm 3: Learning Algorithm for the contextual bandits under unmeasured confounding Input: Data D = (S i , Z i , A i , R i , W i ) n i=1 . Obtain the estimation of the bridge function q by solving the estimation equation equation 6 using data D Implement any supervised learning method for estimating E [q(W, S, a) | S, Z, A]. Compute a * = argmax a∈A Ê [q(W, S, a) | S = s, Z = z, A = a ′ ] ∀(s, z, a ′ ) ∈ S × Z × A. Output: ν * with ν(a * | s, z, a ′ ) = 1 and ν(ã | s, z, a ′ ) = 0 for ã ̸ = a * . We consider the conditional moment estimation procedure in Dikkala et al. (2020) , and propose to estimate Q-bridge function by q := arg min q∈Q sup g∈G Ψ(q, g) -λ ∥g∥ 2 G + U ∆ 2 ∥g∥ 2 2,n + λµ∥q∥ 2 Q , where Ψ(q, g) = 1 n n i=1 {q(W i , S i , A i ) -R i } g(Z i , S i , A i ), Q is the function space that we assume q * lies in, G is the function space where the test functions g come from, λ, µ, ∆, U > 0 are some tuning parameters. As for the projection Ê[q(W, S, a) | S = s, Z = z, A = a ′ ], the conditional moment framework can be also adopted to perform the estimation, here we propose to estimate it via the empirical risk minimization. ĝ(•, •, • ; q(•, •, a)) := arg min g∈G 1 n n i=1 [g(S i , Z i , A i ) -q(•, •, a)] 2 + µ∥g∥ 2 G , ( ) where q is defined in equation 16, µ > 0 is a tuning parameter.

C SIMULATIONS C.1 SIMULATION STUDY FOR CONTEXTUAL BANDITS

In this section, we conduct two simulation studies to evaluate the performance of the proposed super-policy. The first one is a contextual bandit example with tabular state values. We aim to demonstrate the super-policy performs better when the behavior policy reveals more information about the unmeasured confounders. The second one is a contextual bandit example with a continuous state space. It is used to demonstrate the performance of our algorithm using the bridge function. A contextual bandit with tabular state values Similar to the toy example described in Section 3, we take S and U as independent binary variables such that Pr(S = 1) = 0.5 and Pr(U = 1) = 0.5. The binary action A is generated by the following conditional probabilities Pr(A = 1 | U = 0) = ϵ, Pr(A = 1 | U = 1) = 1 -ϵ, with different choices of ϵ ∈ [0, 1]. The larger the |ϵ -0.5| is, the more information of U is revealed in the observed action A. Both the reward proxy W and the action proxy Z are binary and are generated according to the following conditional probabilities Pr(W = 1 | U = 0) = 0.4, Pr(W = 1 | U = 1) = 0.6; Pr(Z = 1 | U = 0) = 0.4, Pr(Z = 1 | U = 1) = 0. 6. Moreover, W and Z are conditionally independent given U . The observed reward is computed by R = (U -0.5)(A -0.5) + ϵ where ϵ ∼ N (0, 0.5). Three types of policy classes are considered. 1. SONLY: S → P(A). The policy only depends on the observed state S. 2. SZONLY: S × Z → P(A). The policy depends on on the observed state S and the action proxy Z.

3.. SUPER: S × Z × A → P(A).

The super-policy class where the policy depends on the observed state S, the action proxy Z t , and observed action A. We implement Algorithm 1 to estimate the corresponding optimal policies for different policy classes. Note that for SONLY and SZONLY, we perform the projection step (line 4) by conditioning on S and (S, Z) respectively. Since this is a tabular setting, we use the empirical averages to approximate all the conditional expectations. In this simulation study, we consider the sample size n = 5000. As Table 2 shows, the super-policy produces smaller regret as ϵ deviates from 0.5 more, while the estimated optimal policies such as SONLY and SZONLY do not change and have larger regrets.

A contextual bandit with a continuous state

In this setting, we take S and U as independent Gaussian random variables such that S ∼ N (0, 1) and U ∼ N (0, 1). The binary action A is generated by the following conditional probabilities Pr(A = 1 | U > 0) = ϵ, Pr(A = 1 | U ≤ 0) = 1 -ϵ, with different choices of ϵ ∈ [0, 1]. The larger the |ϵ -0.5| is, the more information of U is revealed in the observed action A. We generate W and Z according to the following conditional probabilities W | (S, U ) ∼ N (S + 3U, 1); Z | (S, U, A) ∼ N (3S + U + 0.5A, 1 ). Moreover, W and Z are conditionally independent given (S, U ). The observed reward is computed by R = (U -0.5)(A -0.5) + ϵ where ϵ ∼ N (0, 0.5). For this continuous setting, we compute the Q-bridge function via the min-max conditional moment estimation described in Appendix I by taking G, Q as reproducing kernel Hilbert Spaces (RKHSs) equipped with Gaussian kernels. The bandwidths of Gaussian kernels are selected by the median heuristic. Tuning parameters of the penalties are selected by cross-validation. Computation details can be found in Section E of Dikkala et al. (2020) . As for the projection step, we adopt kernel ridge regression (KRR) to perform the estimation, and the tuning parameter of the penalty is selected by cross-validation. In this simulation study, we take the sample size n = 1000. Table 3 shows the simulation results over 50 replications. The observation is consistent with that in the tabular setting. And the super-policy outperforms the other two commonly used optimal policies when ϵ deviates from 0.5.

C.2 A SIMULATION STUDY FOR SEQUENTIAL DECISION MAKING

In this section, we perform a simulation study to evaluate the performance of the super-policy in the sequential decision making. Specifically, we mainly follow the data generation process We take the sample size as n = 1000 and the length of episode T = 20. Note that this setting satisfies the memoryless assumption (i.e., Assumption 6). We implement Algorithm 2 to estimate the optimal policies from three policy classes considered in Section C.1 by adjusting the projection step accordingly. We again use the RKHS modeling to perform the min-max conditional moment estimation for obtaining a sequence of Q-bridge functions and implement KRR to estimate the projections at every iteration. See implementation details in the discussion of the continuous setting in Section C.1. To obtain the regret value, we estimate the optimal policy which depends on both S t and U t , and use it to approximate the oracle optimal value. Table 4 summarises the simulation results over 50 simulated datasets. As we can see, the super policy performs significantly better than the other two commonly used optimal policies.

D REAL DATA APPLICATIONS D.1 APPLICATION TO RHC DATA

In this section, we evaluate the performance of our method on the dataset from the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT Connors et al., 1996) . SUPPORT examined the effectiveness and safety of direct measurement of cardiac function by Right Heart Catheterization (RHC) for certain critically ill patients in intensive care units (ICU). This dataset has been studied by many existing works (e.g. Qi et al., 2021; Tchetgen Tchetgen et al., 2020) . Our goal is to find an optimal policy on the usage of RHC that maximizes 30-day survival rates 30-day survival rates of critically ill patients from the day admitted or transferred to ICU. This dataset corresponds to the setting of contextual bandits. There are 5735 patients, of whom 2184 were measured by RHC in the first 24 hours (A = 1) and the remaining were considered in the control group (A = 0). If a patient survived or censored at day 30, we let the response Y = 1, otherwise, we take the response as Y = -1. Following the data pre-processing steps in Qi et al. (2021) , we consider 71 covariates including demographics, diagnosis, estimated survival probability, comorbidity, vital signs, and physiological status among others in this study. See the full list of covariates in https://hbiostat.org/data/repo/rhc.html. In particular, we take the action proxy Z = (pafi1, paco21) and the reward proxy W = (ph1, hema1). For more details and justifications of the choices of proxy variables, we refer readers to Section 6.1 of Tchetgen Tchetgen et al. (2020) . We compare the super-policy with the following two policies considered in Qi et al. (2021) : d 1 (L, Z) and d 1 (L), where d 1 (L, Z) corresponds to the policy in the policy class SZONLY and d 1 (L) corresponds to the policy in the policy class SONLY. To make it more comparable, we use the same estimating procedure for the bridge functions considered in these three methods. In addition, the RKHS modeling for the min-max conditional moment estimation is taken to obtain the Q-bridge function. See details of the RKHS modeling in the continuous setting in Section C.1. Since Qi et al. (2021) adopt the linear modeling for the decision functions d 1 (L, Z) and d 1 (L), we also use the linear regression to obtain the projection (line 4) in Algorithm 3. To evaluate the value by different policies, we randomly separate 40% of the data and use it as the evaluation set E. More specifically, after obtaining the estimated optimal policies using 60% of the data, we perform the policy evaluation of these three estimated optimal policies using the remaining 40% of the data. Take q as the estimated bridge function using E. The evaluation is  V(ν) = Ê{ a∈A q(W, S, a)ν(a | S, Z, A)}, for ν ∈ SUPER; V(π) = Ê{ a∈A q(W, S, a)π(a | S, Z)}, for π ∈ SZONLY; V(π) = Ê{ a∈A q(W, S, a)ν(a | S)}, for π ∈ SONLY. The expectation Ê refers to the average with respect to the evaluation set E. Table 5 shows the evaluation results over 20 random splits. As we can see, the super-policy produces higher policy values compared with the other two methods.

D.2 APPLICATION TO MIMIC3 DATA

In this section, we use the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC-III) dataset (https://physionet.org/content/mimiciii/1.4/) to demonstrate the performance of estimated optimal policies from three policy classes (SONLY, SZONLY and SUPER). This dataset records the longitudinal information (including information of demographics, vitals, labs and scores, see details in Section 4.3 of Nanayakkara et al. ( 2022)) of patients who satisfied the sepsis criteria, and the goal is to learn an optimal personalized treatment strategy for sepsis. Despite the richness of data collected at the ICU, the mapping between true patient states and clinical observations is usually ambiguous (Nanayakkara et al., 2022) , and therefore makes this dataset fit into the setting of a confounded POMDP. We obtain a clean dataset following the same data pre-processing steps described in Raghu et al. (2017) . Based on it, we take (vasopressor administration, fluid administration) as the action variable, (-1)*SOFA as the reward function. We take (Weight, Temperature) as the reward proxy W since they are not directly related to the action. All the remaining variables except for aforementioned ones are treated as observed state variables. The action proxy is taken as (Weight, Temperature) observed from the last time step. And it is natural to assume that (Weight, Temperature) observed from the last time step is not directly related to the response at the current time step. To simplify the complexity of the action space, we discretize vasopressor and fluid administrations into 2 bins, instead of 5 as in the previous work (Raghu et al., 2017) . This results in a 4-dimensional action space. The numbers of episode length for every patient differ in the dataset. We decide to fix the horizon T = 2, and exclude those patients with records less than 2 time steps. Following the estimation steps described in Section C.2, we estimate the optimal policies under policy classes SONLY, SZONLY and SUPER respectively. We also adopt the idea of "random splitting" described in Section D.1 to evaluate different policies. Basically, we randomly divide the data into two parts with equal sample sizes. We use one part as the training data to learn optimal policies. The other part is used for evaluating the corresponding policies. We implement the off-policy evaluation method proposed by Miao et al. (2022) in the confounded POMDP to calculate the policy values. Table 5 summarizes the evaluation results over 20 random splits. As we can see, the super-policy produces higher policy values compared to the other two methods.

E TECHNICAL PROOFS IN SECTION 3

Proof of Lemma 3.1. V(π * ) = E a∈A R(a)π * (a | S) = E E a∈A R(a)π * (a | S) | S, Z, A ≤ E E a∈A R(a)ν * (a | S, Z, A) | S, Z, A = V(ν * ). The first inequality is due to the optimality of ν * . Similarly, for the behavior policy π b , we can show that V(π b ) = E E a∈A R(a)1(a = A) | S, Z, A ≤ E E a∈A R(a)ν * (a | S, Z, A) | S, Z, A = V(ν * ). Proof of Lemma 3.2. E [R(a) | S = s, A = a ′ ] = E [E {R(a) | U, S = s, A = a ′ } | S = s, A = a ′ ] = E [E {R(a) | U, S = s} | S = s, A = a ′ ] (18) = E [E {R | U, S = s, A = a} | S = s, A = a ′ ] = E [E {q(W, a, S) | U, S = s, A = a} | S = s, A = a ′ ] (19) = E [E {q(W, a, S) | U, S = s, A = a ′ } | S = s, A = a ′ ] (20) = E [q(W, a, S) | S = s, A = a ′ ] , where equation 18 is because of Assumption 1(c), equation 19 is from equation 3 in Assumption 1 and equation 20 is due to Assumption 1(b). To close this section, we prove Lemma 3.3. The following regularity condition is imposed. For a probability measure function µ, let L 2 {µ(x)} denote the space of all squared integrable functions of x with respect to measure µ(x), which is a Hilbert space endowed with the inner product ⟨g 1 , g 2 ⟩ = g 1 (x)g 2 (x)dµ(x). For all s, a, t, define the following operator K s,a : L 2 µ W |S,A (w | s, a) → L 2 µ Z|S,A (z | s, a) h → E {h(W ) | Z = z, S = s, A = a} , and its adjoint operator (b) K * s,a : L 2 µ Z|S,A (z | s, a) → L 2 µ W |S,A (w | s, a) g → E {g(Z) | W = w, S = s, A = a} . Z [E {R t | Z = z, S = s, A = a}] 2 f Z|S,A (z | s, a)dz < ∞. (c) There exists a singular decomposition (λ s,a;ν , ϕ s,a;ν , ψ s,a;ν ) ∞ ν=1 of K s,a such that, ∞ ν=1 λ -2 s,a;ν |⟨E {R t | Z = z, S = s, A = a} , ψ s,a;t;ν ⟩| 2 < ∞. Proof of Lemma 3.3. From equation 6, we have 0 = E [R -q(W, A, S) | Z, S, A] = E {E [R -q(W, A, S) | U, Z, S, A] | Z, S, A} (21) = E {E [R -q(W, A, S) | U, S, A] | Z, S, A} , where equation 21 is due to Assumption 1(b). Then by Assumption 2, we have E [R -q(W, A, S) | U, S, A] = 0, which is exactly equation 3. In addition, by Proposition 1 in Miao et al. (2018a) , the solution to equation 6 exists under Assumption 7. Then Lemma 3.3 is proved.

F TECHNICAL PROOFS IN SECTION 4

Proof of Theorem 4.1. First of all, note that there is one-to-one corresponding policy of π b and π * in Ω respectively. Specifically, for {π b t } T t=1 , we can let ν π b t (a | S t , a ′ ) = 1(a = a ′ ) almost surely to recover π b . For π * , we can always choose ν π * such that ν π * (a | S t , A t ) = π * (a | S t ) . This completes our proof that ν * achieves the super-optimality. Next, to show Theorem 4.3, we need to make some additional conditions. Assumption 8. For a probability measure function µ, let L 2 {µ(x)} denote the space of all squared integrable functions of x with respect to measure µ(x), which is a Hilbert space endowed with the inner product ⟨g 1 , g 2 ⟩ = g 1 (x)g 2 (x)dµ(x). (Z t+1 , A t+1 ) |= Z t | (U t , S t , A t ) for 1 ≤ t ≤ T -1. Assumption 9 (Completeness). Assumption 10 (Regularity conditions). For all s, a, t, define the following operator K s,a;t : L 2 µ Wt|St,At (w | s, a) → L 2 µ Zt|St,At (z | s, a) h → E {h(W t ) | Z t = z, S t = s, A t = a} . Take K * s,a;t as the adjoint operator of K s,a,t . For any Z t = z, S t = s, W t = w, A t = a and 1 ≤ t ≤ T , following conditions hold: (a) W ×Z f Wt|Zt,St,At (w | z, s, a)f Zt|Wt,St,At (z | w, s, a)dwdz < ∞, where f Wt|Zt,St,At and f Zt|Wt,St,At are conditional density functions. (b) For any g ∈ G (t+1) , Z [E {R t + g(W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t = z, S t = s, A t = a}] 2 f Zt|St,At (z | s, a)dz < ∞. (c) There exists a singular decomposition (λ s,a;t;ν , ϕ s,a;t;ν , ψ s,a;t;ν ) t) where G (t) satisfies the regularity conditions (b) and (c) above. ∞ ν=1 of K s,a;t such that for all g ∈ G (t+1) , ∞ ν=1 λ -2 s,a;t;ν |⟨E {R t + g(W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t = z, S t = s, A t = a} , ψ s,a;t;ν ⟩| 2 < ∞. (d) For all 1 ≤ t ≤ T , v π t ∈ G (

Now we are ready to prove Theorem 4.3.

Proof of Theorem 4.3. Part I. We suppose there exists q π t satisfying equation 10,  1 ≤ t ≤ T . Define V ν t (W t , S t , Z t , A t ) = a∈A q ν t (W t , S t , a)π(a | S t , Z t , A t ) and V ν T +1 = 0. Then E R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t , S t , A t =E E R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , Z t , S t , A t | Z t , S t , A t =E E R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , S t , A t | Z t , E {q ν t (W t , S t , A t ) | Z t , S t , A t } =E [E {q ν t (W t , S t , A t ) | U t , Z t , S t , A t } | Z t , S t , A t ] =E [E {q ν t (W t , S t , A t ) | U t , S t , A t } | Z t , S t , A t ] by Assumption 3. Therefore, by Assumption 9 (a), we have E R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , S t , A t = E {q ν t (W t , S t , A t ) | U t , S t , A t } a.s. and for any a ∈ A, E R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , S t , A t = a =E {q ν t (W t , S t , A t ) | U t , S t , A t = a} = E {q ν t (W t , S t , a) | U t , S t , A t = a} =E {q ν t (W t , S t , a) | U t , S t } . Next, we prove that E ν R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , S t , Z t , A t = E a∈A q ν t (W t , S t , a)ν t (a | S t , Z t , A t ) | U t , S t , Z t , A t a.s. (23) Take W t+1 (a), S t+1 (a), Z t+1 (a), U t+1 (a) as the counterfactual variables had the action a is taken at the current time t as a. For any a ∈, E ν R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , S t , Z t , A t = a = a ′ ∈A E R t (a ′ ) + V ν t+1 (W t+1 (a ′ ), S t+1 (a ′ ), Z t+1 (a ′ ), π b (S t+1 (a ′ ), U t+1 (a ′ ))) | U t , S t , Z t , A t = a ν t (a ′ | S t , Z t , A t = a) = a ′ ∈A E R t (a ′ ) + V ν t+1 (W t+1 (a ′ ), S t+1 (a ′ ), Z t+1 (a ′ ), π b (S t+1 (a ′ ), U t+1 (a ′ ))) | U t , S t , A t = a ν t (a ′ | S t , Z t , A t = a) by Assumption 3 = a ′ ∈A E R t (a ′ ) + V ν t+1 (W t+1 (a ′ ), S t+1 (a ′ ), Z t+1 (a ′ ), π b (S t+1 (a ′ ), U t+1 (a ′ )) | U t , S t ν t (a ′ | S t , Z t , A t = a) = a ′ ∈A E R t + V π t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , S t , A t = a ′ ν t (a ′ | S t , Z t , A t = a) = a ′ ∈A E [q ν t (W t , S t , a ′ ) | U t , S t ] ν t (a ′ | S t , Z t , A t = a) by equation 22 = a ′ ∈A E [q ν t (W t , S t , a ′ ) | U t , S t , Z t , A t = a] ν t (a ′ | S t , Z t , A t = a), where the fourth, fifth and last equations are based on the unconfoundedness assumption once U t is given and W t is independent of (A t , Z t ) given U t , S t . Therefore, equation 23 is verified. Part II. We will use this Bellman-like equation equation 23 to verify equation 9 and thus establish the identification results. First, at time T , by equation 23 and V ν T +1 = 0, E ν (R T | U T , S T , Z T , A T ) = E a∈A q ν T (W T , S T , a)ν T (a | S T , Z T , A T ) | U T , S T , Z T , A T . By induction, suppose that at time t + 1, E ν T t ′ =t+1 R t ′ | S t+1 , U t+1 , Z t+1 , A t+1 = E V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | S t+1 , U t+1 , Z t+1 , A t+1 . Then at time t, E ν T t ′ =t R t ′ U t , S t , Z t , A t =E ν R t + E ν T t ′ =t+1 R t ′ U t+1 , S t+1 , Z t+1 , A t+1 , U t , S t , Z t , A t U t , S t , Z t , A t =E ν R t + E ν T t ′ =t+1 R t ′ U t+1 , S t+1 , Z t+1 , A t+1 U t , S t , Z t , A t by Assumption 3 =E ν R t + E V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) U t+1 , S t+1 , Z t+1 , A t+1 U t , S t , Z t , A t =E ν R t + E V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) U t+1 , S t+1 , Z t+1 , A t+1 , U t , S t , Z t , A t U t , S t , Z t , A t by Assumption 3 =E ν R t + E ν V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) U t+1 , S t+1 , Z t+1 , A t+1 , U t , S t , Z t , A t U t , S t , Z t , A t =E ν R t + V ν t+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | U t , S t , Z t , A t by the law of total expectation =E{ a∈A q ν t (W t , S t , a)ν t (a | S t , Z t , A t ) | U t , S t , Z t , A t } by equation 23. Part III. Now we prove the existence of the solution to equation 10. For t = T, . . . , 1, by Assumption 10 (a), K s,a;t is a compact operator for each (s, a) ∈ S × A (Carrasco et al., 2007, Example 2.3) , so there exists a singular value system stated in Assumption 10 (c). Then by Assumption 9 (b), we have Ker(K * s,a;t ) = 0, since for any g ∈ Ker(K * s,a;t ), we have, by the definition of Ker, K * s,a;t g = E [g(Z t ) | W t , S t = s, A t = a] = 0, which implies that g = 0 a.s. Therefore Ker(K * s,a;t ) = 0 and Ker(K * s,a;t ) ⊥ = L 2 (µ Zt|St,At (z | s, a)). By Assumption 10 (b), E {R t + g(W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t = •, S t = s, A t = a} ∈ Ker(K * s,a,;t ) for given (s, a) ∈ S t × A and any g ∈ G (t+1) . Now condition (a) in Theorem 15.16 of Kress (1989) has been verified. The condition (b) is satisfied given Assumption 10 (c). Recursively applying the above argument from t = T to t = 1 yields the existence of the solution to equation 10. Next, we show our generalized identification results stated in Section 4.3. Before that, we make the following assumptions. Assumption 11 (Completeness conditions for history-dependent policies). For any a ∈ A, t = 1, . . . , T , (a) For any square-integrable function g, E{g(U t , Z t ) | Z t , O 0 , A t = a} = 0 a.s. if and only if g = 0 a.s; (b) For any square-integrable function g, E{g(Z t , O o ) | W t , Z t , A t = a} = 0 a.s. if and only if g = 0 a.s. Assumption 12 (Regularity Conditions for history-dependent policies). For all z, a, t, define the following operator K z,a;t : L 2 µ Wt|Zt,At (w | z, a) → L 2 µ O0|Zt,At (z | o, a) h → E {h(W t ) | Z t = z, O 0 = o, A t = a} . Take K * z,a;t as the adjoint operator of K z,a,t . For any Z t = z, O 0 = o, W t = w, A t = a and 1 ≤ t ≤ T , following conditions hold: (a) W ×O f Wt|Zt,O0,At (w | z, o, a)f O0|Wt,Zt,At (o | w, z, a)dwdo < ∞, where f Wt|Zt,O0,At and f O0|Wt,Zt,At are conditional density functions. (b) For any g ∈ G (t+1) , Z [E {R t + g(W t+1 , Z t+1 , A t+1 ) | Z t = z, O 0 = o, A t = a}] 2 f O0|Zt,At (o | z, a)dz < ∞. (c) There exists a singular decomposition (λ z,a;t;ν , ϕ z,a;t;ν , ψ z,a;t;ν ) t) where G (t) satisfies the regularity conditions (b) and (c) above. Now we are ready to prove Theorem 4.4. ∞ ν=1 of K z,a;t such that for all g ∈ G (t+1) , ∞ ν=1 λ -2 z,a;t;ν |⟨E {R t + g(W t+1 , Z t+1 , A t+1 ) | Z t = z, O 0 = o, A t = a} , ψ z,a;t;ν ⟩| 2 < ∞. (d) For all 1 ≤ t ≤ T , v π t ∈ G ( Proof of Theorem 4.4. The structure of the proof and related arguments are similar to the proof of Theorem 4.3. Mainly, we will show the solution of equation 11 satisfies the following equation E ν T t ′ =t R t ′ | U t , A t = E a∈A q ν t (W t , Z t , a)ν t (a | Z t , A t ) | U t , A t , where E ν refers to expectation taken with respect to {ν t } T t=t . Therefore we only list several key steps in the corresponding three parts of the proof. Take V ν t (W t , Z t , A t ) = a∈A q ν t (W t , Z t , a)ν(a | Z t , A t ). Part I. By Assumption 3 and 4, we have E R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Z t , O 0 , A t = E R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Z t , A t and E {q ν t (W t , Z t , A t ) | U t , Z t , O 0 , A t } = E {q ν t (W t , Z t , A t ) | U t , Z t , A t } . Then by Assumption 11 (a), we have E R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Z t , A t = E {q ν t (W t , Z t , A t ) | U t , Z t , A t } a.s. and therefore E ν R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Z t , A t = E a∈A q ν t (W t , Z t , a)ν t (a | Z t , A t ) | U t , Z t , A t a.s., where E ν refers to expectation taken with respect to {ν t } T t=t . Part II. Following the same induction idea, we can show that if E ν T t ′ =t+1 R t ′ | U t+1 , Z t+1 , A t+1 = E V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t+1 , Z t+1 , A t+1 , then by utilizing equation 25, at time t, we can obtain E ν T t ′ =t R t ′ U t , Z t , A t = E a∈A q ν t (W t , Z t , a)ν t (a | Z t , A t ) | U t , Z t , A t , where E ν refers to expectation taken with respect to {ν t } T t=t . Part III. The existence of the solution to equation 11 can be verified by utilizing Assumption 11(b) and Assumption 12. Lastly, in order to show our generalized identification results stated in Section 4.4, we adapt the completeness and regularity assumptions as follows. Assumption 13 (Completeness conditions for k-step history-dependent policies). For any a ∈ A, t = k + 1, . . . , T , (a) For any square-integrable function g, E{g(U t , Zt ) | Z t , A t = a} = 0 a.s. if and only if g = 0; (b) For any square-integrable function g, E{g(Z t ) | W t , Zt , A t = a} = 0 a.s. if and only if g = 0 a.s. Assumption 14 (Regularity Conditions for k-step history-dependent policies). Define the following conditional expectation operator: K s,a;t : L 2 µ (Wt, Zt)|St,At ((w, z) | s, a) → L 2 µ Zt|St,At (z | s, a) h → E h(W t , Zt ) | Z t = z, S t = s, A t = a , and take K * s,a;t as its adjoint operator. For any Zt = z, Z t = z, S t = s, W t = w, A t = a and k + 1 ≤ t ≤ T , (a) f (Wt, Zt)|Zt,St,At ((w, z) | z, s, a)f Zt|Wt, Zt,St,At (z | w, z, s, a)dwdzdz < ∞, where f Zt|Wt, Zt,St,At and f (Wt, Zt)|Zt,St,At are conditional density functions. (b) For any g ∈ G (t+1) , Z [E {R t + g(W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t = z, S t = s, A t = a}] 2 f Zt|St,At (z | s, a)dz < ∞. (c) There exists a singular decomposition (λ s,a;t;ν , ϕ s,a;t;ν , ψ s,a;t;ν ) t) where G (t) satisfies the regularity conditions (b) and (c) above. Now we are ready to prove Theorem 4.5. ∞ ν=1 of K s,a;t such that for all g ∈ G (t+1) , ∞ ν=1 λ -2 s,a;t;ν |⟨E {R t + g(W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t = z, S t = s, A t = a} , ψ s,a;t;ν ⟩| 2 < ∞. (d) For all k + 1 ≤ t ≤ T , v π t ∈ G ( Proof of Theorem 4.5. The results for 1 ≤ t ≤ k can be obtained by directly applying the proof of Theorem 4.4. Here we only show the proof for the case when t > k. The proof structure and argument are quite similar to the proof of Theorem 4.3. Therefore, we list several important steps in three parts of the proof. Take v ν t (W t , Z t , A t ) = a∈A q ν t (W t , Zt , a)ν(a | Z t , A t ). Part I. By Assumption 3, E R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Z t , A t = E R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Zt , A t and E q ν t (W t , Zt , A t ) | U t , Z t , A t = E q ν t (W t , Zt , A t ) | U t , Zt , A t . Then by Assumption 13 (a), we have E R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Zt , A t = E q ν t (W t , Zt , A t ) | U t , Zt , A t a.s. and therefore E ν R t + V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t , Z t , A t = E a∈A q ν t (W t , Zt , a)ν t (a | Z t , A t ) | U t , Z t , A t a.s., where E ν refers to expectation taken with respect to {ν t } T t=t . Part II. Following the same induction idea, we can show that if E ν T t ′ =t+1 R t ′ | U t+1 , Z t+1 , A t+1 = E V ν t+1 (W t+1 , Z t+1 , A t+1 ) | U t+1 , Z t+1 , A t+1 , then by utilizing equation 25, at time t,  E ν T t ′ =t R t ′ U t , Z t , A t = E a∈A q ν t (W t , Zt , a)ν t (a | Z t , A t ) | U t , Pr(R t = r, O t = o|A t = a, U t = u) Pr(U t = u|A t = a, O t-1 = o) = Pr(R t = r, O t = o|A t = a, O t-1 = o) P (t,r) oa , Pr(U t+1 = u, O t = o|A t = a, U t = u) Pr(U t = u|A t = a, O t-1 = o) = Pr(U t+1 = u, O t = o|A t = a, O t-1 = o) P (t,u) oa , Pr((U t+1 = u, A t+1 = a ′ , O t = o|A t = a, U t = u) Pr(U t = u|A = a, O t-1 = o) = Pr(U t+1 = u, A t+1 = a ′ , O t = o|A t = a, O t-1 = o ¯) P (t,u) o,a ′ ,a , Pr((O t = o|U t = u) Pr(U t = u|A t = a, O t-1 = o) = Pr(O t = o|A t = a, O t-1 = o) P (t) a . Accordingly, Pr(R t = r, O t = o|A t = a, U t = u), Pr(U t+1 = u, O t = o|A t = a, U t = u), Pr((U t = u, A t+1 = a ′ , O t = o|A t = a, U t = u) (R t = r, O t = o|A t = a, U t = u) and Pr(U t+1 = u, O t = o|A t = a, U t = u) and Pr((U t+1 = u, A t+1 = a ′ , O t = o|A t = a, U t = u) by Pr(R t = r, O t = o|A t = a, U t = u) = P (t,r) oa [P (t) a ] -1 Pr((O t = o|U t = u) Pr(U t+1 = u, O t = o|A t = a, U t = u) = P (t,u) oa [P (t) a ] -1 Pr((O t = o|U t = u) Pr((U t+1 = u, A t+1 = a ′ , O t = o|A t = a, U t = u) = P (t,u) o,a ′ ,a [P (t) a ] -1 Pr((O t = o|U t = u) respectively. We first represent E ν R 1 using the observed data. Notice that E ν R 1 = a ′ ,u E a R 1 (a)ν 1 (a|O 1 , A 1 )|A 1 = a ′ , U 1 = u Pr(A 1 = a ′ , U 1 = u) = a ′ a,o ν 1 (a | o, a ′ )r ⊤ Pr(R 1 = r, O 1 = o|A 1 = a, U 1 = u) Pr(A 1 = a ′ , U 1 = u) = a ′ a,o ν 1 (a | o, a ′ )r ⊤ P (1,r) oa [P (1) a ] -1 Pr((O 1 = o|U 1 = u) Pr(A 1 = a ′ , U 1 = u) = o,a,a ′ ν 1 (a | o, a ′ )r ⊤ P (1,r) oa [P (1) a ] -1 Pr(O 1 = o, A 1 = a ′ ) Next, consider E ν R 2 . According to the Markov property, R 2 and O 2 are conditionally independent of (A 1 , U 0 , O 1 ) given (A 1 , U 1 ). As such, we have that E ν R 2 = o1,a ′ 1 ,a1 ν 1 (a 1 |o 1 , a ′ 1 ) s2,a ′ 2 ,o2 a2 ν 2 (a 2 |o 2 , o 1 , a ′ 2 , a ′ 1 )r ⊤ Pr(R 2 = r, O 2 = o 2 |A 2 = a 2 , U 2 = u) Pr(U 2 = u, A 2 = a ′ 2 , O 1 = o 1 |A 1 = a 1 , U 1 = u)} Pr(U 1 = u, A 1 = a ′ 1 ). = o1,a ′ 1 ,a1 ν 1 (a 1 |o 1 , a ′ 1 ) a ′ 2 ,o2 a2 ν 2 (a 2 |o 2 , a ′ 1 , o 1 , a ′ 2 )r ⊤ P (2,r) o2,a2 [P (2) a2 ] -1 Pr((O 2 = o|U 2 = u) P (1,u) o1,a ′ 2 ,a1 [P (1) a1 ] -1 Pr((O 1 = o|U 1 = u) Pr(U 1 = u, A 1 = a ′ 1 ) = o1,a ′ 1 ,a1 ν 1 (a 1 |o 1 , a ′ 1 )   a ′ 2 ,o2,a2 ν 2 (a 2 |o 2 , a ′ 1 , o 1 , a ′ 2 )r ⊤ P (2,r) o2,a2 [P (2) a2 ] -1 P (1,o) o1,a ′ 2 ,a1   [P (1) a1 ] -1 Pr(O 1 = o, A 1 = a ′ 1 ). where  P (t,o) ot,a ′ t+1 ,at = Pr(O t+1 = o, A t = a ′ t+1 , O t = o t |A t = a t , O t-1 = o). =2ξ n + E (q(W, S, A ′ ) -q(W, S, A ′ )) a∈A π b (a | U, S)ν * (A ′ | Z, S, a) π b (A ′ | U, S) + E (q(W, S, A ′ ) -q(W, S, A ′ )) a∈A π b (a | U, S)ν * (A ′ | Z, S, a) π b (A ′ | U, S) ≤ 2(ξ n + p max ζ n ), where equation 26 is due to the optimality of q and equation 27 is due to the definition of ξ n . Proof of Theorem 5.1. The bound in Theorem 5.1 can be derived by combining the results of Theorem I.2, Theorem I.4 and Lemma I.2. In the following, we derive the regret bound stated in Section 5.2. Before that, we present the following regret decomposition lemma. Define function class Q(t) over W × S such that Q(t) := {q(•, •, a) : q ∈ Q (t) , a ∈ A}. Lemma H.1. Suppose f t ∈ Q (t) ⊂ W × S × A and take the policy ν f = {ν f,t } T t=1 as the one that is greedy with respect to Ê[f t (W t , S t , a) | S t , Z t , A t ]. Take g t (S t , Z t , A t ; q) := E[q(W t , S t ) | S t , Z t , A t ] and ĝt (S t , Z t , A t ; q) := Ê[q(W t , S t ) | S t , Z t , A t ] for q ∈ Q(t) . Define the projection error ξ t,n := sup q∈ Q(t) ∥g t (•, •, •; q) -ĝt (•, •, •; q)∥ 2 , and ζ f t,n := E f t (W t , S t , A t ) -R t + a∈A f t+1 (W t+1 , S t+1 , A t+1 )ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t 2 . Define p t,max := sup s,z,a p ν * t (S t = s, Z t = z, A t = a) p π b t (S t = s, Z t = z, A t = a) , and p ν max,t = sup s,z,a ω ν t (S t = s, Z t = z, A t = a), where ω ν t (S t , Z t , A t ) := a∈A ( u∈U π b (a | U t = u, S t )p π b t (u | S t )du)ν(A t | S t , Z t , a) u π b (A t | U t = u, S t )p π t (u | S t , Z t )du p ν t (S t , Z t ) p π b t (S t , Z t ) . ( ) Then under Assumption 3, 8, 9 and 10, together with Assumption 6, we can obtain the following regret bound V(ν * ) -V(ν f ) ≤ T t=1 2p t,max ξ t,n + T T t=1 [(p ν * t,max ) 2 + (p ν f t,max ) 2 ](ζ f t,n ) 2 . Proof of Lemma H.1. We start from the decomposition V(ν * ) -V(ν f ) ≤V(ν * ) -E a∈A f 1 (W 1 , S 1 , a)ν * 1 (a | S 1 , Z 1 , A 1 ) + E a∈A f 1 (W 1 , S 1 , a)ν * 1 (a | S 1 , Z 1 , A 1 ) -E Ê a∈A f 1 (W 1 , S 1 , a)ν * 1 (a | S 1 , Z 1 , A 1 ) | S 1 , Z 1 , A 1 + E Ê a∈A f 1 (W 1 , S 1 , a)ν f,1 (a | S 1 , Z 1 , A 1 ) | S 1 , Z 1 , A 1 -E a∈A f 1 (W 1 , S 1 , a)ν f,1 (a | S 1 , Z 1 , A 1 ) + E a∈A f 1 (W 1 , S 1 , a)ν f,1 (a | S 1 , Z 1 , A 1 ) -V(ν * ) ≤2ξ 1,n + V(ν * ) -E a∈A f 1 (W 1 , S 1 , a)ν * 1 (a | S 1 , Z 1 , A 1 ) + E a∈A f 1 (W 1 , S 1 , a)ν f,1 (a | S 1 , Z 1 , A 1 ) -V(ν f ) First, we can show that for any policy ν ∈ Ω, E a∈A f 1 (W 1 , S 1 , a)ν 1 (a | S 1 , Z 1 , A 1 ) -V(ν) =E a∈A f 1 (W 1 , S 1 , a)ν 1 (a | S 1 , Z 1 , A 1 ) -E ν T t=1 R t =E ν T t=1 a∈A f t (W t , S t , a)ν t (a | S t , Z t , A t ) -E ν R t + a∈A f t+1 (W t+1 , S t+1 , a)ν t+1 (a | S t+1 , Z t+1 , A t+1 ) (31) At time t, because of the optimality of ν f,t , we have Ê a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) | S t , Z t , A t ≤ Ê a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t . Then E a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) | S t , Z t , A t -E a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t ≤E a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) | S t , Z t , A t -Ê a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) | S t , Z t , A t + Ê a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t -E a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t , E a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) - a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) ≤E 1/2    E a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) - a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t 2    ≤ 2ξ t,n . (33) The last inequality is due to the decomposition equation 32 and the definition of ξ t,n . Note that E ν * a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) | S t , Z t , A t -E ν * a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t =E ν * E ν * a∈A f t (W t , S t , a) (ν * t (a | S t , Z t , A t ) -ν f,t (a | S t , Z t , A t )) | U t , S t , Z t , A t | S t , Z t , A t =E ν * E a∈A f t (W t , S t , a) (ν * t (a | S t , Z t , A t ) -ν f,t (a | S t , Z t , A t )) | U t , S t , Z t , A t | S t , Z t , A t =E p ν * t (U t | S t , Z t , A t ) p b t (U t | S t , Z t , A t ) E a∈A f t (W t , S t , a) (ν * t (a | S t , Z t , A t ) -ν f,t (a | S t , Z t , A t )) | U t , S t , Z t , A t | S t , Z t , A t . Due to Assumption 6, we have p ν * t (U t | S t ) = p π b t (U t | S t ) and p ν * t (U t | S t , Z t , A t ) = p ν * t (Z t , A t | U t , S t )p ν * t (U t | S t ) u∈U p ν * t (Z t , A t | U t = u, S t )p ν * t (U t = u | S t )du = p π b t (Z t , A t | U t , S t )p π b t (U t | S t ) u∈U p π b t (Z t , A t | U t = u, S t )p π b t (U t = u | S t )du = p π b t (U t | S t , Z t , A t ). Therefore, E ν * E ν * a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) | S t , Z t , A t -E ν * a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t =E ν * E a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) | S t , Z t , A t -E a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) | S t , Z t , A t =E p ν * t (S t , Z t , A t ) p π b t (S t , Z t , A t ) E a∈A f t (W t , S t , a)(ν * t (a | S t , Z t , A t ) -ν f,t (a | S t , Z t , A t )) | S t , Z t , A t ≤2p max,t ξ t,n . The last inequality is due to equation 33 and the definition of p max,t . Now let's go back to equation 30, we have E a∈A f 1 (W 1 , S 1 , a)ν f,1 (a | S 1 , Z 1 , A 1 ) -V(ν f,t ) =E ν f T t=1 a∈A f t (W t , S t , a)ν t (a | S t , Z t , A t ) -E ν f R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) , because of equation 31, and E a∈A f 1 (W 1 , S 1 , a)ν * t (a | S 1 , Z 1 , A 1 ) -V(ν * ) =E ν * T t=1 a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) -E ν * R t + a∈A f t+1 (W t+1 , S t+1 , a)ν * t+1 (a | S t+1 , Z t+1 , A t+1 ) ≥ T t=1 E ν * a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) -E ν * R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) -2p t+1,max ξ t+1,n because of equation 34. Then V(ν * ) -V(ν f ) ≤2ξ 1,n + E ν f T t=1 a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) -E ν f R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t -E ν * T t=1 a∈A f t (W t , S t , a)ν * t (a | S t , Z t , A t ) +E ν * R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t + T t=2 2p t,max ξ t,n We know that for ν, ∈ {ν * , ν f }, E ν R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) | U t , S t , Z t , A t =E ν a∈A E R t + a ′ ∈A f t+1 (W t+1 , S t+1 , a ′ )ν f,t+1 (a ′ | S t+1 , Z t+1 , A t+1 ) | U t , S t , Z t , A t = a ν t (a | S t , Z t , A t ) Take ω ν t (S t , Z t , A t ) = a∈A ( u∈U π b (a | U t = u, S t )p π b t (u | S t , Z t )du)ν(A t | S t , Z t , a) u π π b (A t | U t = u, S t )p π t (u | S t , Z t )du p ν t (S t , Z t ) p π b t (S t , Z t ) . Then at any t, E ν f a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) -E ν f R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t =E ν f a∈A ν f,t (a | S t , Z t , A t ) E f t (W t , S t , A t ) -R t + a ′ ∈A f t+1 (W t+1 , S t+1 , a ′ )ν f,t+1 (a ′ | S t+1 , Z t+1 , A t+1 ) | U t , S t , Z t , A t = a =E ν f a∈A ν f,t (a | S t , Z t , A t ) E f t (W t , S t , A t ) -R t + a ′ ∈A f t+1 (W t+1 , S t+1 , a ′ )ν f,t+1 (a ′ | S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t = a =E ω ν f (S t , Z t , A t ) f t (W t , S t , A t ) -R t + a ′ ∈A f t+1 (W t+1 , S t+1 , a ′ )ν f,t+1 (a ′ | S t+1 , Z t+1 , A t+1 ) . The second equality is due to that p π b t (U t | S t , Z t , A t = a) = p ν f t (U t | S t , Z t , A t = a). T t=1 E ν f a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) -R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) ≤ T T t=1 (E {ω ν f (S t , Z t , A t ) [f t (W t , S t , A t ) -R t + a ′ ∈A f t+1 (W t+1 , S t+1 , a ′ )ν f,t+1 (a ′ | S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t 2   1/2 ≤ T T t=1 (p ν f max,t ) 2 (ζ f t,n ) 2 Similarly, we have T t=1 E ν f a∈A f t (W t , S t , a)ν f,t (a | S t , Z t , A t ) -R t + a∈A f t+1 (W t+1 , S t+1 , a)ν f,t+1 (a | S t+1 , Z t+1 , A t+1 ) ≤ T T t=1 (p ν * max,t ) 2 (ζ f t,n ) 2 Therefore, overall we have  V(ν * ) -V(ν f ) ≤ T t=1 2p t,max ξ t,n + T T t=1 [(p ν * t,max ) 2 + (p ν f t,max ) 2 ](ζ f t,n ) 2 . Ψ n (q, Vt+1 , g) -λ ∥g∥ 2 G (t) + U ∆ 2 ∥g∥ 2 2,n + λµ∥q∥ 2 Q (t) , where ∥ • ∥ 2,n is the empirical norm, λ, U , δ and µ are positive tuning parameters, and Ψ n (q, Vt+1 , g) = 1 n n i=1 q(W i,t , S i,t , A i,t ) - R i,t + Vt+1 (W i,t+1 , S i,t+1 , Z i,t+1 , A i,t+1 ) T -t + 1 g(Z i,t , S i,t , A i,t ), Vt+1 (W i,t+1 , S i,t+1 , Z i,t+1 , A i,t+1 ) = a∈A qt+1 (W i,t+1 , S i,t+1 , a)ν * t+1 (a | S i,t+1 , Z i,t+1 , A i,t+1 ). In the following, we utilize a uniform error bound to study ξ (a) For any ν ∈ V and q ∈ Q (t) , ⟨ν, q⟩ ∈ H (t) . For any h ∈ H (t+1) , T t (h + R t ) ∈ Q (t) . (b) For any q ∈ (T -t)Q (t+1) and any ν ∈ V, we have T t Rt+⟨ν,q⟩ T -t+1 2 Q (t) ≤ q T -t 2 Q (t+1) . (c) For any q ∈ Q (t) and ν ∈ V, we have ∥⟨ν, q⟩∥ 2 H (t) ≤ C v ∥q∥ 2 Q (t) for some constant C v > 0. (d) There exists L > 0 such that ∥g * -Tt q t ∥ 2 ≤ ϱ t,n , where g * ∈ arg min g∈G (t) L 2 ∥q t ∥ 2 Q (t) ∥g -Tt q t ∥ 2 for all q t ∈ Q (t) . Take Q (t) B , H D and G (t) 3U as balls in Q (t) , H (t) and G (t) respectively for some fixed constants B, D, U > 0 such that functions in Q 3U are uniformly bounded by 1. Consider the following two spaces: Ω (t) = {(w t , s t , z t , a t , w t+1 , s t+1 , z t+1 , a t+1 ) → r[q * h (w t , s t , a t ) -h(w t+1 , s t+1 , z t+1 , a t+1 )]g(z t , s t , a t ) : h ∈ H (t+1) D , g ∈ G (t) 3U , r ∈ [0, 1] Ξ (t) = (w t , s t , z t , a t ) → r[q -q * h (w t , s t , a t )]g L 2 B (z t , s t , a t ) : q ∈ Q (t) , q -q * h ∈ Q (t) B , h ∈ H (t+1) D , r ∈ [0, 1] , where q * h ∈ Q (t) is the solution to E[q(W t , S t , A t ) -h(W t+1 , S t+1 , Z t+1 , A t+1 ) | Z t , S t , A t ] = 0 and g L 2 B = arg min g∈G (t) L 2 B ∥g -Tt (q -q * h )∥ 2 for a given L > 0. We use the Rademacher complexity to characterize the complexity of a function class. For a generic real-valued function space F ⊂ R X , the local Rademacher complexity with radius δ > 0 is defined as R n (F, r) = sup f ∈F ,∥f ∥2≤r 2 1 n n i=1 ϵ i f (X i ) , where {X i } n i=1 are i.i.d. copies of X and {ϵ i } n i=1 are i.i.d. Rademacher random variables. Suppose F is star-shape and ∥f ∥ ∞ ≤ 1 for f ∈ F. The critical radius of the local Rademacher complexity R n (F, r), denoted by r * , is the smallest value satisfying r 2 ≥ R n (F, r). Theorem I.1. Suppose G (t) , t = 1, . . . , T are symmetric and start-convex set of test functions and ∥T T (R T )∥ Q T ≤ M Q . Under Assumption 15, take ∆ = ∆t,n + c 0 log(c 1 T /δ)/n for some universal constants c 0 , c 1 > 0, where ∆t,n is the maximum of critical radius of G (t) 3U , Ω (t) and Ξ (t) . Assume that ϱ t,n in Assumption 15(d) ≤ ∆. Then (R t + Vt+1 )/(T -t + 1) ∈ H If we further assume tuning parameters satisfy U λ ≍ (∆) 2 and µ ≥ O(L 2 + U/B), then the following equality holds uniformly for all t = 1, . . . , T with probability 1 -δ: Proof of Theorem I.1. Proof of Theorem I.1 is a direct adaption of Theorem 6.2 and Lemma D.2 in Miao et al. (2022) . ∥q t /(T -t + 1)∥ 2 Q (t) ≤ (T -t + 2)M Q , Remark 1. Under the setting of contextual bandits, the Q function estimation can be considered as a special case of equation 35 by setting t = T . Then the result of bounding ζ n can be adopted from Theorem I.1 accordingly. And we have the following theorem. Theorem I.2. Suppose there exists q * ∈ G that satisfy the E[q * -R | S, Z, A] = 0. The functions in G and Q are uniformly bounded by 1. |R| ≤ 1. Take ∆ = ∆n + c 0 log(c 1 /δ)/n with some positive universal constants c 0 and c 1 , and ∆n the maximum of critical radius of G 3U and Ξ = (w, s, z, a) → r[q -q * ](w, s, a)g L 2 B (z, s, a) : q -q * ∈ Q B , r ∈ [0, 1] , where g L 2 B = arg min f ∈G L 2 B ∥g -E(q -q * | S, Z, A)∥ 2 . In addition, we suppose that for any q ∈ Q, ∥g L 2 ∥h-h * ∥ 2 2 -E(q-q * | S, Z, A)∥ 2 ≲ η n ≲ ∆. By taking the tuning parameters λ ≈ ∆ 2 /U and µ ≳ L 2 + ∆ 2 /(Bλ), with probability at least 1 -δ, we have ζ n ≲ ∆n + log(c 1 /δ)/n.

I.2 PROJECTION ESTIMATION

In this section, we discuss how to perform the projection step Ê[q t (W t , S t , a) | S t = s, Z t = z, A t = a ′ ] in Algorithm 2. Take Q(t) as a space defined over W × S such that Q(t) := {q(•, •, a) : q ∈ Q (t) , a ∈ A}. Take g * t (s, z, a; q) := E[q(W t , S t ) | S t = s, Z t = z, A t = a] for q ∈ Q(t) . We estimate g * by ĝt (•, •, •; q) := arg min g∈Q (t) 1 n n i=1 [g(S i,t , Z i,t , A i,t ) -q(W i,t , S M are uniformly bounded by 1. Consider the following space: Υ (t) = (w t , s t , z t , a t ) → [g(s t , z t , a t ) -q(w t , s t )] 2 -[g * (s t , z t , a t ; q) -q(w t , s t )] 2 : g, g * ∈ G (t) M , q ∈ Q(t) B Theorem I.3. Suppose for any q ∈ Q (t) and a ∈ A, ∥q(•, •, a)∥ 2 Q(t) ≤ Cv ∥q∥ 2 Q (t) ; for any q ∈ Q(t) , g * (•, •, •; q) ∈ G (t) and ∥g * (•, •, •; q)∥ 2 G (t) ≤ C g ∥q∥ 2 Q(t) . Take κ t,n = κt,n + c 0 log(c 1 T /δ)/n for some universal positive constants c 0 and c 1 , where κ(t) n is the critical radius of function space Υ (t) . If we further assume the tuning parameter µ in equation 38 satisfying µ ≳ (κ t,n ) 2 , then with probability at least 1 -δ, we have for any t = 1, . . . , T, ξ t,n ≲ (T -t + 1) κ t,n 1 + ∥q π t /(T -t + 1)∥ 2 Q (t) + µ∥q π t /(T -t + 1)∥ 2 Q (t) ≲ (T -t + 1) 1.5 M Q κ t,n . Remark 2. Under the setting of contextual bandits, the estimation for the projection (equation 17) can be considered as a special case of equation 38 by setting t = T . Then the corresponding result for bounding ξ n can be obtained by taking t = T , Q = {q(•, •, a) : q ∈ G, a ∈ A} in Theorem I.3. And we obtain Theorem I.4. Suppose for any q ∈ Q and a ∈ A, ∥q(•, •, a)∥ 2 Q ≤ Cv ∥q∥ 2 Q ; for any q ∈ Q, g * (•, •, •; q) ∈ G and ∥g * (•, •, •; q)∥ 2 G ≤ C g ∥q∥ 2 Q. Take κ n = κn + c 0 log(c 1 /δ)/n for some universal positive constants c 0 and c 1 , where κn is the critical radius of function space Υ = (w, s, z, a) → [g(s, z, a) -q(w, s)] 2 -[g * (s, z, a; q) -q(w, s)] 2 : g, g * ∈ G M , q ∈ Q B If we further assume the tuning parameter µ in equation 17 satisfying µ ≳ (κ n ) 2 , then with probability at least 1 -δ, we have ξ n ≲ κ n 1 + ∥q π ∥ 2 Q + µ∥q π ∥ 2 Q ≲ κ n . Proof of Theorem I.3. First, we note that for any g ∈ G (t) , E [g(S t , Z t , A t ) -q(W t , S t )] 2 -E [g * (S t , Z t , A t ; q) -q(W t , S t )] 2 =E [{g(S t , Z t , A t ) -g * (S t , Z t , A t ; q)} {g(S t , Z t , A t ) + g * (S t , Z t , A t ; q) -2q(W t , S t )}] =E [{g(S t , Z t , A t ) -g * (S t , Z t , A t ; q)} {g(S t , Z t , A t ) -g * (S t , Z t , A t ; q) + 2g * (S t , Z t , A t ; q) -2q(W t , S t )}] =E {g(S t , Z t , A t ) -g * (S t , Z t , A t ; q)} 2 (39) The last equality is due to the fact that Eg(S t , Z t , A t )[g * (S t , Z t , A t ; q) -q(W t , S t )] = 0 for any g ∈ G (t) . From the basic inequality, we have 1 n n i=1 [ĝ(S i,t , Z i,t , A i,t ) -q(W i,t , S i,t )] 2 ≤ 1 n n i=1 [g * (S i,t , Z i,t , A i,t ; q) -q(W i,t , S i,t )] 2 + µ∥g * ∥ 2 G (t) -µ∥ĝ∥ 2 G (t) . (40)



∥ 2 and ζ t,n which denotes the projected error related to the computation in line 4 of Algorithm 2. The exact definition of ζ t,n is given in equation 37 of Appendix I. The finite-sample regret bound of ν * by Algorithm 2 relies on the following regret decomposition.

Figure 1: The data generation process under a typical POMDP.

Figure 2: The data generation process under the memoryless POMDP.

Figure 3: An illustration of the causal relationship of variables involved in the confounded POMDP.

Figure 4: Causal relationship between W t and other variables. Dashed arrows indicate the causal effect is optional.

Figure 5: Different causal relationship between Z t and other variables. Dashed arrows indicate the causal effect is optional.

Regularity conditions for contextual bandits). For any Z = z, S = s, W = w, A = a, (a) W ×Z f W |Z,S,A (w | z, s, a)f Z|W,S,A (z | w, s, a)dwdz < ∞, where f Wt|Zt,St,At and f Zt|Wt,St,At are conditional density functions.

For any (s, a) ∈ S × A, t = 1, . . . , T , (a) For any square-integrable function g, E{g(U t ) | Z t , S t = s, A t = a} = 0 a.s. if and only if g = 0 a.s;(b) For any square-integrable function g, E{g(Z t ) | W t , S t = s, A t = a} = 0 a.s. if and only if g = 0 a.s.

Follow the similar argument, one can derive the identification formula for t = 3, . . . , T .H PROOF IN SECTION 5Proof of Lemma 5.1.V(ν * ) -V(ν * ) =E E a∈A q(W, S, a)ν * (a | S, Z, A) | S, Z, A -E a∈A q(W, S, a)ν * (a | S, Z, A) | S, Z, A ≤E E a∈A q(W, S, a)ν * (a | S, Z, A) | S, Z, A -Ê a∈A q(W, S, a)ν * (a | S, Z, A) | S, Z, A + E Ê a∈A q(W, S, a)ν * (a | S, Z, A) | S, Z, A -E a∈A q(W, S, a)ν * (a | S, Z, A) | S, Z, A(26)≤2ξ n + E a∈A q(W, S, a)ν * (a | S, Z, A) -a∈A q(W, S, a)ν * (a | S, Z, A) S, a)ν * (a | S, Z, A) -a∈A q(W, S, a)ν * (a | S, Z, A)

For a function space F, we define αF = {αf : f ∈ F} and F B = {f ∈ F : ∥f ∥ 2 F ≤ B}. Assumption 15. The following conditions hold for t = 1, . . . , T .

D = C v (T -t + 1)M Q .

where qt is the solution of equation 35; andζ t,n ≲ M Q (T -t + 1) 2 ( ∆t,n + log(c 1 T /δ)/n), where ζ t,n = E qt (W t , S t , A t ) -R t + a∈A Vt+1 (W t+1 , S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t 2 (37)with Vt+1 defined in equation 36.

Policy values under different choices of ϵ in the toy example. In general, V(π b ) = 0.6 -1.2ϵ, V(π * ) = 0.4, V(ν * ) = |0.7 -ϵ| + |ϵ -0.3|. Bold values are the largest under different settings.

Simulation results for the tabular setting described in C.1 under different choices of ϵ. We replicate the simulation for 50 times. Mean regret values for estimated optimal policies under different policy classes are provided (and a smaller regret value indicates a better performance). Values in the parentheses are the standard deviations of the regret values.

Simulation results for the continuous setting described in C.1 under different choices of ϵ. The simulation is performed over 50 simulated datasets. Mean regret values for estimated optimal policies using different policy classes are provided. Smaller regret values indicate better performance. Values in the parentheses are the standard deviations of the regret values.

Simulation results for the sequential decision making problem described in C.2. The simulation is performed over 50 simulated datasets. Mean regret values for estimated optimal policies under different policy classes are provided. The smaller regret values indicate better performances. Values in the parentheses are the standard deviations of the regret values.

Evaluation results of the optimal policies learned from three different policy classes using the RHC data. The averages of evaluation values over 20 random splits are presented. Larger values indicate better performances. Values in the parentheses are standard deviations.

Evaluation results of the optimal policies learned from three different policy classes using the MIMIC-III data. The averages of evaluation values over 20 random splits are presented. Larger values indicate better performances. Values in the parentheses are standard deviations.

S t , A t by Assumption 3 and 8, and

Z t , A t , where E ν refers to expectation taken with respect to {ν t } T t=t . Part III. The existence of the solution to equation 14 can be verified by utilizing Assumption 13(b) and Assumption 14. Proof of Theorem 4.6. We notice that O t-1 |= (R t , O t , U t+1 )|(U t , A t ). Consequently, the conditional distributions of (R t , O t ) and (U t+1 , O t ) given (A t , U t ) shall satisfy

and Pr((O t = o|U t = u) correspond to the matrices consisting of all conditional probabilities. When P and Pr(U t = u|A t = a, O t-1 = o) are invertible, it allows us to represent Pr

Proof of Lemma 5.2. Proof of Lemma 5.2 is a direct adaption of Lemma H.1.We take the min-max estimation procedure to solve the estimation equation equation 10. More specifically, we follow the construction inDikkala et al. (2020) and propose the following estimators for Q-bridge functions. For the following discussion, without loss of generality, we assume max |R t | ≤ 1 for t = 1, . . . , T , and function spaces Q(t) , G (t) H(t) below are classes of bounded functions whose image is a subset of [-1, 1]. Take qT +1 = 0. For t = T, . . . , 1,

, W t+1 , S t+1 , Z t+1 , A t+1 ) | S t , Z t , A t ] for h ∈ L 2 {R × W × S × Z × A} and [ Tt q](S t , Z t , A t ) = E[q(W t , S t , A t ) | S t , Z t , A t ] for h ∈ L 2 {W × S × A}.And take [⟨ν, q⟩](W t , S t , Z t , A t ) = a∈A q(W t , S t , a)ν(a | S t , Z t , A t ).

i,t )] 2 + µ∥g∥ 2 G (t) ; and Ê[q t (W t , S t , a) | S t = s, Z t = z, A t = a ′ ] = (T -t + 1)ĝ t (•, •, •; qt (•, •, a)/(T -t + 1)).M as balls in Q and Q(t) respectively for some fixed constants B and M such that functions in Q(t) B and G

annex

Next, we will establish the different between E [g(S t , Z t , A t ) -q(W t , S t )]2 -E [g * (S t , Z t , A t ; q) -q(W t , S t )] [g(S i,t , Z i,t , A i,t ) -q(W i,t , S i,t )][g * (S i,t , Z i,t , A i,t ; q) -q(W i,t , S i,t )] 2 ,to study the bound for E {g(S t , Z t , A t ) -g * (S t , Z t , A t ; q)} 2 .To begin with, for any g, g * ∈ G (t) and q ∈ Q(t) ,where the second inequality is due to the uniform boundness of g and q, and the last equality is from equation 39.Then we apply Corollary of Theorem 3.3 in Bartlett et al. (2005) to the function class Υ (t) . For any function f ∈ Υ (t) , ∥f ∥ ∞ ≤ 1, and Var(f ) ≤ 16Ef . Take the functional T in Theorem 3.3 of Bartlett et al. (2005) as T (f ) = Ef 2 and define r * as the fixed point of a sub-root function ψ such that for any r ≥ r * ,Then with probability at least 1-δ, the following inequality holds for any f ∈ Υ (t) ,If we take κt,n = c √ r * for some universal constant c, and the sub-root function ψ as the identity function. Then κ n is the critical radius of R n (Υ (t) ).

Therefore, for any

hold with probability at least 1 -δ.Then combine with the basic inequality equation 40, with probability at least 1 -δ, we haveThe last inequality is from the condition of tuning parameter µ.

I.3 BOUND THE CRITICAL RADIUS

In this section, we characterize the bound of critical radius mentioned above. Lemma I.1. Suppose G (t) , H (t+1) and Q (t) are VC-subgraph classed with VC dimensions V(G (t) ), V(H (t) ) and V(G (t) ) respectively, then we haveProof. Note that for any h ∈ H (t+1) , we have ∥h∥ 2And equation 43 is derived directly from Section D.3.1 in Miao et al. (2022) . As for equation 44, note thatBy the similar argument in bounding log N n (t, Ω (t) ) in Section D.4.2 in Miao et al. (2022) , we haveB ), where N n (ϵ, G) denotes the smallest empirical ϵ-covering of G. And the bound in equation 44 is obtained by bounding the local Rademacher complexity by entropy integral (See Section D.3.1 in Miao et al. (2022) ).Similar results apply to ∆n and κn and we get Lemma I.2. Suppose G and Q are VC-subgraph classed with VC dimensions V(G) and V(G) respectively, then we haveLemma I.3. Suppose G (t) , Q (t) and H (t+1) are RKHSs endowed with reproducing kernel K G , K Q and K G with decreasing sorted eigenvaluesThen ∆t,n is upper bounded by δ satisfiesThen κt,n is upper bounded by δ satisfiesProof. The proof follows the similar argument in the proof of Lemma D.7 in Miao et al. (2022) 

