PESSIMISM IN THE FACE OF CONFOUNDERS: PROV-ABLY EFFICIENT OFFLINE REINFORCEMENT LEARNING IN PARTIALLY OBSERVABLE MARKOV DECISION PRO-CESSES

Abstract

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of P3O is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that P3O achieves a n -1/2 -suboptimality, where n is the number of trajectories in the dataset. To our best knowledge, P3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.

1. INTRODUCTION

Offline reinforcement learning (RL) (Sutton and Barto, 2018) aims to learn an optimal policy of a sequential decision making problem purely from an offline dataset collected a priori, without any further interactions with the environment. Offline RL is particularly pertinent to applications in critical domains such as precision medicine (Gottesman et al., 2019) and autonomous driving (Shalev-Shwartz et al., 2016) . In particular, in these scenarios, interacting with the environment via online experiments might be risky, slow, or even possibly unethical. But oftentimes offline datasets consisting of past interactions, e.g., treatment records for precision medicine (Chakraborty and Moodie, 2013; Chakraborty and Murphy, 2014) and human driving data for autonomous driving (Sun et al., 2020) , are adequately available. As a result, offline RL has attracted substantial research interest recently (Levine et al., 2020) . Most of the existing works on offline RL develop algorithms and theory on the model of Markov decision processes (MDPs). However, in many real-world applications, due to certain privacy concerns or limitations of the sensor apparatus, the states of the environment cannot be directly stored in the offline datasets. Instead, only partial observations generated from the states of the environments are stored (Dulac-Arnold et al., 2021) . For example, in precision medicine, a physician's treatment might consciously or subconsciously depend on the patient's mood and socioeconomic status (Zhang and Bareinboim, 2016) , which are not recorded in the data due to privacy concerns. As another example, in autonomous driving, a human driver makes decisions based on multimodal information of the environment that is not limited to visual and auditory inputs, but only observations captured by LIDARs and cameras are stored in the datasets (Sun et al., 2020) . In light of the partial observations in the datasets, these situations are better modeled as partially observable Markov decision processes (POMDPs) (Lovejoy, 1991) . Existing offline RL methods for MDPs, which fail to handle partial observations, are thus not applicable. In this work, we make the initial step towards studying offline RL in POMDPs where the datasets only contain partial observations of the states. In particular, motivated from the aforementioned real-world applications, we consider the case where the behavior policy takes actions based on the states of the environment, which are not part of the dataset and thus are latent variables. Instead, the trajectories in datasets consist of partial observations emitted from the latent states, as well as the actions and rewards. For such a dataset, our goal is to learn an optimal policy in the context of general function approximation. Furthermore, offline RL in POMDP suffers from several challenges. First of all, it is known that both planning and estimation in POMDPs are intractable in the worst case (Papadimitriou and Tsitsiklis, 1987; Burago et al., 1996; Goldsmith and Mundhenk, 1998; Mundhenk et al., 2000; Vlassis et al., 2012) . Thus, we have to identify a set of sufficient conditions that warrants efficient offline RL. More importantly, our problem faces the unique challenge of the confounding issue caused by the latent states, which does not appear in either online and offline MDPs or online POMDPs. In particular, both the actions and observations in the offline dataset depend on the unobserved latent states, and thus are confounded (Pearl, 2009) .

Rh-1 Rh

Sh-1 Sh Sh+1 Oh-1 Oh Oh+1 Ah-1 Ah Figure 1 : Causal graph of the data generating process for offline RL in POMDP. Here S h is the state at step h. Besides, A h , R h , and O h are the action, immediate reward, and observation, respectively. The dotted nodes indicate that the variables are not stored in the offline dataset. Solid arrows indicate the dependency among the variables. In specific, the action A h is specified by the behavior policy which is a function of S h (blue arrows). Moreover, both the observation O h and reward R h depend on the state S h (red arrows). We remark that S h affects both A h and (O h , R h ) and thus serves as an unobserved confounder. Such a confounding issue is illustrated by a causal graph in Figure 1 . As a result, directly applying offline RL methods for MDPs will nevertheless incur a considerable confounding bias. Besides, since the latent states evolve according to the Markov transition kernel, the causal structure is thus dynamic, which makes the confounding issue more challenging to handle than that in static causal problems. Furthermore, apart from the confounding issue, since we aim to learn the optimal policy, our algorithm also needs to handle the distributional shift between the trajectories induced by the behavior policy and the family of target policies. Finally, to handle large observation spaces, we need to employ powerful function approximators. As a result, the coupled challenges due to (i) the confounding bias, (ii) distributional shift, and (iii) large observation spaces that are distinctive in our problem necessitates new algorithm design and theory. To this end, by leveraging tools from proximal causal inference (Lipsitch et al., 2010; Tchetgen et al., 2020; Miao et al., 2018b; a) , we propose the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which provably addresses the challenge of the confounding bias and the distributional shift in the context of general function approximation. Specifically, we focus on a benign class of POMDPs where the causal structure involving latent states can be captured by the past and current observations, which serves as the negative control action and outcome respectively (Miao et al., 2018a; b; Cui et al., 2020; Singh, 2020; Kallus et al., 2021; Bennett and Kallus, 2021; Shi et al., 2021) . Then the value of each policy can be identified by a set of confounding bridge functions corresponding to that policy, which satisfy a sequence of backward moment equations that are similar to the celebrated Bellman equations in classical RL (Bellman and Kalaba, 1965) . Thus, by estimating these confounding bridge functions from offline data, we can estimate the value of each policy without incurring the confounding bias. More concretely, P3O involves two components -policy evaluation via minimax estimation and policy optimization via pessimism. Specifically, to tackle the distributional shift, P3O returns the policy that maximizes pessimistic estimates of the values obtained by policy evaluation. Meanwhile, in policy evaluation, to ensure pessimism, we construct a coupled sequence of confidence regions for the confounding bridge functions via minimax estimation, using function approximators. Furthermore, under a partial coverage assumption on the confounded dataset, we prove that P3O achieves a O(H log(N fun )/n) suboptimality, where n is the number of trajectories, H is the length of each trajectory, N fun stands for the complexity of the employed function classes (e.g., the covering number), and O(•) hides logarithmic factors. When specified to linear function classes, the suboptimality of Thus, the actions does not directly depends on the latent states and thus these works do not involve the challenge due to confounded data. The third line of research studies OPE in POMDPs where the goal is to learn the value of the target policy as opposed to learning the optimal policy. As a result, these works do not to need to handle the challenge of distributional shift via pessimism. P3O becomes O( H 3 d/n), where d is the dimension of the feature mapping. To our best knowledge, we establish the first provably efficient offline RL algorithm for POMDP with a confounded dataset.

1.1. OVERVIEW OF TECHNIQUES

To deal with the coupled challenges of confounding bias, distributional shift, and large observational spaces, our algorithm and analysis rely on the following technical ingredients. Confidence regions based on minimax estimation via proximal causal inference. To handle the confounded offline dataset, we use the proxy variables from proximal causal inference (Lipsitch et al., 2010; Tchetgen et al., 2020; Miao et al., 2018a; b) , which allows us to identify the value of each policy by a set of confounding bridge functions. These bridge functions only depend on observed variables and satisfy a set of backward conditional moment equations. We then estimate these bridge functions via minimax estimation (Dikkala et al., 2020; Chernozhukov et al., 2020; Uehara et al., 2021) . More importantly, to handle the distributional shift, we propose a sequence of novel confidence regions for the bridge functions, which quantifies the uncertainty of minimax estimation based on finite data. This sequence of new confidence regions has not been considered in the previous works on off-policy evaluation (OPE) in POMDPs (Bennett and Kallus, 2021; Shi et al., 2021) as pessimism seems unnecessary in these works. Meanwhile, the confidence regions are constructed as a level set w.r.t. the loss functions of the minimax estimation for bridge functions. Such a construction contrasts with previous works on offline RL with confidence regions via maximum likelihood estimation (Uehara and Sun, 2021; Liu et al., 2022) or least square regression (Xie et al., 2021) . Furthermore, we develop a novel theoretical analysis to show that any function in the confidence regions enjoys a fast statistical rate of convergence. Finally, leveraging the backwardly inductive nature of the bridge functions, our proposed confidence regions and analysis take the temporal structure into consideration, which might be of independent interest to the study on dynamic causal inference (Friston et al., 2003) . Pessimism principle for learning POMDPs. To learn the optimal policy in the face of distributional shift, we adopt the pessimism principle which is shown to be effective in offline RL in MDPs (Liu et al., 2020; Jin et al., 2021; Rashidinejad et al., 2021; Uehara and Sun, 2021; Xie et al., 2021; Yin and Wang, 2021; Zanette et al., 2021; Yin et al., 2022; Yan et al., 2022) . Specifically, the newly proposed confidence regions, combined with the identification result based on proximal causal inference, allows us to construct a novel pessimistic estimator for the value of each target policy. From a theoretical perspective, the identification result and the backward induction property of the bridge functions provide a way of decomposing the suboptimality of the learned policy in terms of statistical errors of the bridge functions. When combined with the pessimism and the fast statistical rates enjoyed by any functions in the confidence regions, we show that our proposed P3O algorithm efficiently learns the optimal policy under only a partial coverage assumption of the confounded dataset. We highlight that our work firstly extend the pessimism principle to offline RL in POMDPs with confounded data.

1.2. RELATED WORK

Our work is closely related to the bodies of literature on (i) reinforcement learning POMDPs, (ii) offline reinforcement learning (in MDPs), and (iii) OPE via causal inference. Compared to the literature, our work simultaneously involve partial observability, confounded data, and offline policy optimization simultaneously, and thus involves the challenges faced by (i)-(iii). We summarize and contrast with most related existing works in Table 1 . We defer the detailed discussion to Appendix B.1.

2. PRELIMINARIES

Notations. Different from an MDP, in a POMDP, only the observation o, the action a, and the reward r are observable, while the state variable s is unobservable. In each episode, the environment first samples an initial state S 1 from µ 1 (•). At each step h ∈ [H], the environment emits an observation O h from O h (•|S h ). If an action A h is taken, then the environment samples the next state S h+1 from P h (•|S h , A h ) and assign a reward R h given by R h (S h , A h ). In our setting, we also let O 0 ∈ O denote the prior observation before step h = 1. We assume that O 0 is independent of other random variables in this episode given the first state S 1 .

2.2. OFFLINE DATA GENERATION: CONFOUNDED DATASET

Now we describe the data generation process. Motivated by real-world examples such as precision medicine and autonomous driving discussed in Section 1, we assume that the offline data is generated by some behavior policy π b which has access to the latent states. Specifically, we let π b = {π b h } H h=1 denote a collection of policies such that π b h (•|s) : S → ∆(A) specifies the probability of taking action each a ∈ A at state s and step h. This behavior policy induces a set of probability distribution P b = {P b h } H h=1 on the trajectories of the POMDP, where P b h is the distribution of the variables at step h when following the policy π b . Formally, we assume that the offline data is denoted by D = {(o k 0 , (o k 1 , a k 1 , r k 1 ), • • • , (o k H , a k H , r k H ))} n k=1 , where n is the number of trajectories, and for each k ∈ [n], (o k 0 , (o k 1 , a k 1 , r k 1 ), • • • , (o k H , a k H , r k H ) ) is independently sampled from P b . We highlight that such an offline dataset is confounded since the latent state S h , which is not stored in the dataset, simultaneously affects the control variables (i.e., action A h ) and the outcome variables (i.e., observation O h and reward R h ). Such a setting is prohibitive for existing offline RL algorithms for MDPs as directly applying them will nevertheless incur a confounding bias that is not negligible.

2.3. LEARNING OBJECTIVE

The goal of offline RL is to learn an optimal policy from the offline dataset which maximizes the expected cumulative reward. For POMDPs, the learned policy can only depend on the observable information mentioned in Section 2.1. To formally define the set policies of interest, we first define the space of observable history as H = {H h } H-1 h=0 , where each element τ h ∈ H h is a (partial) trajectory such that τ h ⊆ {(o 1 , a 1 ), • • • , (o h , a h )}. We use Γ h to denote the corresponding random variable. We denote by Π(H) the class of policies which make decisions based on the current observation o h ∈ O and the history information τ h-1 ∈ H h-1 . That means, a policy π = {π h } H h=1 ∈ Π(H) satisfies π h (•|o, τ ) : O ×H h-1 → ∆(A). The choice of H induces the policy set Π(H) by specifying the input of the policies. We now introduce three examples of H and the corresponding Π(H). Example 2.1 (Reactive policy (Azizzadenesheli et al., 2018) ). The policy only depends on the current observation O h . Formally, we have H h-1 = {∅} and therefore τ h-1 = ∅ for each h ∈ [H]. Example 2.2 (Finite-history policy (Efroni et al., 2022) ). The policy depends on the current observation and the history of length at most k. Formally, we have H h-1 = (O × A) ⊗ min{k,h-1} and τ h-1 = ((o l , a l ), • • • , (o h-1 , a h-1 )) for some k ∈ N, where the index l = max{1, h -k}. Example 2.3 (Full-history policy (Liu et al., 2022) ). The policy depends on the current observation and the full history. Formally, H h-1 = (O × A) ⊗(h-1) , and τ h-1 = ((o 1 , a 1 ), • • • , (o h-1 , a h-1 )). We illustrate these examples with causal graphs and a more detailed discussion in Appendix C.1. Now given a policy π ∈ Π(H), we denote by J(π) the value of π that characterizes the expected cumulative rewards the agent receives by following π. Formally, J(π) is defined as J(π) := E π H h=1 γ h-1 R h (S h , A h ) , (2.1) where γ ∈ (0, 1] is the discount factor, E π denotes the expectation w.r.t. P π = {P π h } H h=1 which is the distribution of the trajectories induced by π. We define the suboptimality gap of any policy π as SubOpt( π) := J(π ⋆ ) -J( π), where π ⋆ ∈ arg max π∈Π(H) J(π). (2.2) Here π ⋆ is the optimal policy within Π(H). Our goal is to find some policy π ∈ Π(H) that minimizes the suboptimality gap in (2.2) based on the offline dataset D.

3. ALGORITHM: PROXY VARIABLE PESSIMISTIC POLICY OPTIMIZATION

As is previously discussed, the offline RL problem introduced in Section 2 for POMDPs suffers from three coupled challenges -(i) the confounding bias, (ii) distributional shift, and (iii) large observation spaces. In the sequel, we introduce an algorithm that addresses all three challenges simultaneously. We first introduce the high-level idea for combating these challenges. Offline RL for POMDPs is known to be intractable in the worst case (Krishnamurthy et al., 2016) . So we first identify a benign class of POMDPs where the causal structure involving latent states can be captured by only the observable variables available in the dataset D. For such a class of POMDPs, by leveraging tools from proximal causal inference (Lipsitch et al., 2010; Tchetgen et al., 2020) , we then seek to identify the value J(π) of the policy π ∈ Π(H) via some confounding bridge functions b π (Assumption 3.2) which only depend on the observable variables and thus can be estimated using D (Theorem 3.3). Identification via proximal causal inference will be discussed in Section 3.1. In addition, to estimate these confounding bridge functions, we utilize the the fact that these functions satisfy a sequence of conditional moment equations which resembles the Bellman equations in classical MDPs (Bellman and Kalaba, 1965) . Then we adopt the idea of minimax estimation (Dikkala et al., 2020; Kallus et al., 2021; Uehara et al., 2021; Duan et al., 2021) which formulates the bridge functions as the solution to a series of minimax optimization problems in (3.11). Additionally, the loss function in minimax estimation readily incorporates function approximators and thus addresses the challenge of large observation spaces. To further handle the distributional shift, we extend the pessimism principle (Liu et al., 2020; Jin et al., 2021; Rashidinejad et al., 2021; Uehara and Sun, 2021; Xie et al., 2021; Yin and Wang, 2021; Zanette et al., 2021) to POMDPs with the help of the confounding bridge functions. In specific, based on the confounded dataset, we first construct a novel confidence region CR π (ξ) for b π based on level sets with respect to the loss functions of the minimax estimation (See (3.12) for details). Our algorithm, Proxy variable Pessimistic Policy Optimization (P3O), outputs the policy that maximizes pessimistic estimates the values of the policies within Π(H). The details of P3O is summarized by Algorithm 1 in Section 3.3.

3.1. POLICY VALUE IDENTIFICATION VIA PROXIMAL CAUSAL INFERENCE

To handle the confounded dataset D, we first identify the policy value J(π) for each π ∈ Π(H) using the idea of proxy variables. Following the notions of proximal causal inference (Tennenholtz et al., 2020; Lipsitch et al., 2010) , we assume that there exists negative control actions {Z h } H h=1 and negative control outcomes {W h } H h=1 satisfying the following independence assumption. Assumption 3.1 (Negative control). We assume there exist negative control variables {W h } H h=1 and {Z h } H h=1 measurable with respect to the observed trajectories, such that under P b , it holds that Z h ⊥ O h , R h , W h , W h+1 | A h , S h , Γ h-1 , W h ⊥ A h , Γ h-1 , S h-1 | S h . (3.1) We explain in detail the existence of such negative control variables for all the three different choices of history H in Examples 2.1, 2.2, and 2.3 respectively in Appendix C.1. Besides Assumption 3.1, our identification of policy value also relies on the notion of confounding bridge functions (Kallus et al., 2021; Shi et al., 2021) , for which we make the following assumption. Assumption 3.2 (Confounding bridge functions). For any history-dependent policy π ∈ Π(H), we assume the existence of the value bridge functions {b π h : A × W → R} H h=1 and the weight bridge functions {q π h : A × Z → R} H h=1 which are defined as the solution to the following equations almost surely with respect to the measure P b : E π b [b π h (A h , W h )|A h , Z h ] = E π b R h π h (A h |O h , Γ h-1 ) + γ a ′ b π h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ) A h , Z h , E π b [q π h (A h , Z h )|A h , S h , Γ h-1 ] = µ h (S h , Γ h-1 ) π b h (A h |S h ) . (3.3) Here b π H+1 is a zero function and µ h (S h , Γ h-1 ) in (3.3) is defined as the importance sampling ratio µ h (S h , Γ h-1 ) := P π h (S h , Γ h-1 )/P b h (S h , Γ h-1 ). We use "confounding bridge function" and "bridge function" interchangeably throughout the paper. We remark that in the proximal causal inference literature, the existence of such bridge functions bears more generality than assuming certain complicated completeness conditions (Cui et al., 2020)  J(π) = F (b π ), where F (b π ) := E π b a∈A b π 1 (a, W 1 ) . (3.4) See Appendix E for a detailed proof. Note that although we have assumed the existence of both the value bridge functions in (3.2) and the weight bridge functions in (3.3), Theorem 3.3 represents J(π) using only the value bridge functions. In (3.2) all the random variables involved are observed by the learner and distributed according to the data distribution P b , which means that the value bridge functions can be estimated from data. This overcomes the confounding issue.

3.2. POLICY EVALUATION VIA MINIMAX ESTIMATION WITH UNCERTAINTY QUANTIFICATION

According to Theorem 3.3 and Assumption 3.2, to estimate the value J(π) of π ∈ Π(H), it suffices to estimate the value bridge functions {b π h } H h=1 by solving (3.2), which is a conditional moment equation. To this end, we adopt the method of minimax estimation (Dikkala et al., 2020; Uehara et al., 2021; Duan et al., 2021) . Furthermore, in order to handle the distributional shift between behavior policy and target policies, we construct a sequence of confidence regions for {b π h } H h=1 based on minimax estimation, which allows us to apply the pessimism principle by finding the most pessimistic estimates within the confidence regions. Specifically, minimax estimation involves two function classes B ⊆ {b : A × W → R} and G ⊆ {g : A × Z → R}, interpreted as the primal and dual function classes, respectively. Theoretical assumptions on B and G are presented in Section 4. In order to find functions that satisfy (3.2), it suffices to find b  = (b 1 , • • • , b H ) ∈ B ⊗H such that the following conditional moment ℓ π h (b h , b h+1 )(A h , Z h ) := E π b b h (A h , W h ) -R h π h (A h |O h , Γ h-1 ) -γ a ′ ∈A b h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ) A h , Z h (3. L π h (b h , b h+1 ) := E π b ℓ π h (b h , b h+1 )(A h , Z h ) 2 . (3.6) It might seem tempting to directly minimize the empirical version of (3.6). However, this is not viable as one would obtain a biased estimator due to an additional variance term. The reason is that the quantity defined by (3.5) is a conditional expectation and therefore RMSE defined by (3.6) cannot be directly unbiasedly estimated from data (Farahmand et al., 2016) . In the sequel, we adopt the technique of minimax estimation to circumvent this issue. In particular, we first use Fenchel duality to write (3.6) as L π h (b h , b h+1 ) = 4λE π b max g∈G ℓ π h (b h , b h+1 )(A h , Z h ) • g(A h , Z h ) -λg(A h , Z h ) 2 , λ > 0, (3.7) which holds when the dual function class G is expressive enough such that ℓ π h (b h , b h+1 )/2λ ∈ G. Then thanks to the interchangeability principle (Rockafellar and Wets, 2009; Dai et al., 2017; Shapiro et al., 2021) , we can interchange the order of maximization and expectation and derive that L π h (b h , b h+1 ) = 4λ max g∈G E π b ℓ π h (b h , b h+1 )(A h , Z h ) • g(A h , Z h ) -λg(A h , Z h ) 2 . (3.8) The core idea of minimax estimation is to minimize the empirical version of (3.8) instead of (3.6), and the benefit of doing so is a fast statistical rate of O(n -1/2 ) (Dikkala et al., 2020; Uehara et al., 2021) , as we can see in the sequel. For simplicity, in the following, we define Φ λ π,h : B × B × G → R with parameter λ > 0 as Φ λ π,h (b h , b h+1 ; g) := E π b ℓ π h (b h , b h+1 )(A h , Z h ) • g(A h , Z h ) -λg(A h , Z h ) 2 , (3.9) and we denote by  Φ λ π,h : B × B × G → R the empirical version of Φ λ π,h , i.e., Φ λ π,h (b h ,b h+1 ; g) := E π b b h (A h , W h ) -R h π h (A h |O h , Γ h-1 ) -γ a ′ ∈A b h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ) • g(A h , Z h ) -λg(A h , Z h ) 2 , ( π := (b π 1 , • • • , b π H ) ∈ B ⊗H as CR π (ξ) := b ∈ B ⊗H max g∈G Φ λ π,h (b h , b h+1 ; g) -max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) ≤ ξ, Output: π = { π h } H h=1 . From the above definitionfoot_0 , one can see that CR π (ξ) is actually a coupled sequence of H confidence regions, where each single confidence region aims to cover a function b π h . For notational simplicity, we use a single notation CR π (ξ) to denote all the H confidence regions. Intuitively, the confidence region CR π (ξ) contains all b ∈ B ⊗H whose RMSE does not exceed that of ( b h (b h+1 ), b h+1 ) by too much at each h ∈ [H]. The confidence region takes the sequential dependence of confounding bridge functions into consideration in the sense that each b ∈ CR π (ξ) is restricted through the minimax estimation loss between continuous steps. As we show in Section D, with high probability, the confidence region CR π (ξ) contains the true bridge value functions b π . More importantly, every b ∈ CR π (ξ) enjoys a fast statistical rate of O(n -1/2 ). Now combining the confidence region (3.12) and the identification formula (3.4), for any policy π ∈ Π(H), we adopt an pessimistic estimate of the value of J(π) as J Pess (π) := min b∈CR π (ξ) F (b), where F (b) := E π b a∈A b 1 (a, W 1 ) . (3.13)

3.3. POLICY OPTIMIZATION

Given the pessimistic value estimate (3.13), P3O chooses π which maximizes J Pess (π), that is, π := arg max π∈Π(H) J Pess (π). (3.14) We summarize the entire P3O algorithm in Algorithm 1. In Section 4, we show that under some mild assumptions on the function classes B and G and under only a partial coverage assumption on the dataset D, the suboptimality (2.2) of Algorithm 1 decays at the fast statistical rate of O(n -1/2 ), where O(•) omits H and factors that characterize the complexity of the function classes.

4. THEORETICAL RESULTS

In this section, we present our theoretical results. For ease of presentation, we first assume that both the primal function class B and the policy class Π(H) are finite sets with cardinality |B| and |Π(H)|, respectively. But we allow the dual function class G to be an infinite set. Our results can be easily extended to infinite B and Π(H) using the notion of covering numbers (Wainwright, 2019), which we demonstrate with linear function approximation in Section H.1. We first introduce some necessary assumptions for efficient learning of the optimal policy. To begin with, the following Assumption 4.1 ensures that the offline data generated by π b has a good coverage over π ⋆ . The problem would become intractable without such an assumption (Chen and Jiang, 2019) . Assumption 4.1 (Partial coverage). We assume that the concentrability coefficient for the optimal policy π ⋆ , defined as C π ⋆ := max h∈[H] E π b (q π ⋆ h (A h , Z h )) 2 , satisfies that C π ⋆ < +∞. Very importantly, Assumption 4.1 only assumes the partial coverage, i.e., the optimal policy π ⋆ is well covered by π b (Jin et al., 2021; Uehara and Sun, 2021), which is significantly weaker than the uniform coverage, i.e., the entire policy class Π(H) is covered by π b (Chen and Jiang, 2019) in the sense that max π∈Π(H) C π < +∞. See Appendix B.1 for more about partial coverage in POMDP. The next assumption is on the functions classes B and G. We require that B and G are uniformly bounded, and that G is symmetric, star-shaped, and has bounded localized Rademacher complexity. Assumption 4.2 (Function classes B and G). We assume the classes B and G satisfy that: i) There exist M B , M G < +∞ such that B, G are bounded by sup b∈B sup w∈W | a∈A b(a, w)| ≤ M B 2 , sup g∈G sup (a,z)∈A×Z |g(a, z)| ≤ M G ; ii) G is star-shaped, i.e., for any g ∈ G and λ ∈ [0, 1], it holds that λg ∈ G; iii) G is symmetric, i.e., for any g ∈ G, it holds that -g ∈ G; iv) For any step h ∈ [H], G has bounded critical radius α G,h,n which solves inequality R n (G; α) ≤ α 2 /M G , where R n (G, α) is the localized population Rademacher complexity of G under the distribution of (A h , Z h ) induced by π b , that is, R n (G, α) = E π b ,ϵi sup g∈G:∥g∥2≤α 1 n n i=1 ϵ i g(A h , Z h ) , with ∥g∥ 2 defined as (E π b [g 2 (A h , Z h )]) 1/2 , random variables {ϵ i } n i=1 independent of (A h , Z h ) and independently uniformly distributed on {+1, -1}. Also, we denote α G,n := max h∈[H] α G,h,n . Finally, to ensure that the minimax estimation in Section 3.2 learns the value bridge functions unbiasedly, we make the following completeness and realizability assumptions on the function classes B and G which are standard in the literature (Dikkala et al., 2020; Xie et al., 2021; Shi et al., 2021) . Assumption 4.3 (Completeness and realizability). We assume that, i) completeness: for any h ∈ [H], any π ∈ Π(H), and any b h , b h+1 ∈ B, it holds that 1 2λ ℓ π h (b h , b h+1 ) ∈ G where ℓ π h is defined in (3.5); ii) realizability: for any h ∈ [H], any π ∈ Π(H), and any b h+1 ∈ B, there exists b ⋆ ∈ B such that L π h (b ⋆ , b h+1 ) ≤ ϵ B for some ϵ B < +∞, i.e., we assume that 0 ≤ ϵ B := max h∈[H],π∈Π(H),b h+1 ∈B min b h ∈B E π b ℓ π h (b h , b h+1 )(A h , Z h ) 2 < +∞. Here the completeness assumption means the dual function class G is rich enough which guarantees the equivalence between L π h (•, •) and max g∈G Φ λ π,h (•, •; g). The realizability assumption means that the primal function class B is rich enough such that (3.2) always admits an (approximate) solution. With these technical assumptions, we can establish our main theoretical results in the following theorem, which gives an upper bound of the suboptimality (2.2) of the policy π output by Algorithm 1. Theorem 4.4 (Suboptimality). Under Assumptions 3.1, 3.2, 4.1, 4.2, and 4.3, by setting the regularization parameter λ and the confidence parameter ξ as λ = 1 and ξ = C 1 • M 2 B M 2 G • log(|B||Π(H)|H/ζ)/n, then probability at least 1 -3δ, it holds that SubOpt( π) ≤ C ′ 1 √ C π ⋆ H • M B M G • log(|B||Π(H)|H/ζ)/n + C ′ 1 C π ⋆ M G Hϵ 1/4 B , where ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n )}. Here C π ⋆ , α G,n , and ϵ B are defined in Assumption 4.1, 4.2, and 4.3, respectively. And C 1 , C ′ 1 , c 1 , and c 2 are some problem-independent universal constants. We introduce all the key technical lemmas and sketch the proof of Theorem 4.4 in Section D. We refer to Appendix G for a detailed proof. When it holds that α G,n = O(n -1/foot_1 ) and ϵ B = 0, Theorem 4.4 implies that SubOpt( π) ≤ O(n -1/2 ), which corresponds to a "fast statistical rate" for minimax estimation (Uehara et al., 2021) . The derivation of such a fast rate relies on a novel analysis for the risk of functions in the confidence region, which is shown by Lemma D.3 in Section D. Meanwhile, for many choices of the dual function class G, it holds that α G,n scales with √ log N G where N G denotes the complexity measure of the class G. In such cases, the suboptimality also scales with √ log N G , without explicit dependence on the cardinality of the spaces S, A, or O. Finally, we highlight that, thanks to the principle of pessimism, the suboptimality of P3O depends only on the partial coverage concentrability coefficient C π ⋆ , which can be significantly smaller than the uniform coverage concentrability coefficient sup π∈Π(H) C π . In conclusion, when α G,n = O( log N G /n) and ϵ B = 0, the P3O algorithm enjoys a O(H C π ⋆ log N G /n) suboptimality. Linear function approximation. Theorem 4.4 can be readily extended to the case of linear function approximation (LFA) with infinite-cardinality B and Π(H), which yields an O( H 3 d/n) suboptimality. Due to space limit, we defer the detailed setup and main results of LFA to Appendix H.1. SHALEV-SHWARTZ, S., SHAMMAH, S. and SHASHUA, A. (2016) (2015) . From data to optimal decision making: a data-driven, probabilistic machine learning approach to decision support for patients with sepsis. JMIR medical informatics 3 e3445. In this section, we provide a comprehensive clarification on the use of notation in this paper. J(π) Policy value E π [ H h=1 γ h-1 R h ] b π h , q π We use lower case letters (i.e., s, a, o, and τ ) to represent dummy variables and upper case letters (i.e., S, A, O, and Γ) to represent random variables. We use the variables in the calligraphic font (i.e., S, A, O, and H) to represent the spaces of variables, and the blackboard bold font (i.e., P and O) to represent probability kernels. We use H = {H h } H-1 h=0 to denote the space of observable history, where each element τ h ∈ H h is a (partial) trajectory such that τ h ⊆ {(o 1 , a 1 ), • • • , (o h , a h )}. We use π b = {π b h } H h=1 to denote the behavior policy, where π b h : S → ∆(A). We use π = {π h } H h=1 ∈ Π(H) to denote a history- dependent policy with π h : O × H h-1 → ∆(A). Also, we use π ⋆ = {π ⋆ h } H h=1 to denote the optimal history-dependent policy. Offline data D is collected by π b , as described in Section 2.2. We use P b = {P b h } H h=1 and P π = {P π h } H h=1 to denote the distribution of trajectories under the policy π b and π, respectively, where P b h and P π h denote the density of corresponding variables at step h. Also, we use E π b and E π to denote the expectation w.r.t. the distribution P b and P π . We use E π b to denote the empirical version of E π b , which is calculated on data D. (Guo et al., 2016; Krishnamurthy et al., 2016; Jin et al., 2020; Xiong et al., 2021; Jafarnia-Jahromi et al., 2021; Efroni et al., 2022; Liu et al., 2022) . In the online setting, the actions are specified by history-dependent policies and thus the latent state does not directly affect the actions. Thus, the actions and observations in the online setting are not confounded by latent states. Consequently, although these work also conduct uncertainty quantification to encourage exploration, the confidence regions are not based on confounded data and are thus constructed differently. Offline reinforcement learning and pessimism. Our work is also related to the literature on offline RL and particularly related to the works based on the pessimism principle (Antos et al., 2007; Munos and Szepesvári, 2008; Chen and Jiang, 2019; Buckman et al., 2020; Liu et al., 2020; Min et al., 2021; Jin et al., 2021; Zanette, 2021; Jin et al., 2021; Xie et al., 2021; Uehara and Sun, 2021; Yin and Wang, 2021; Rashidinejad et al., 2021; Zhan et al., 2022; Yin et al., 2022; Yan et al., 2022) . Offline RL faces the challenge of the distributional shift between the behavior policy and the family of target policies. Without any coverage assumption on the offline data, the number of data needed to find a near-optimal policy can be exponentially large (Buckman et al., 2020; Zanette, 2021) . To circumvent this problem, a few existing works study offline RL under a uniform coverage assumption, which requires the concentrability coefficients between the behavior and target policies are uniformly bounded. See, e.g., Antos et al. (2007) ; Munos and Szepesvári (2008); Chen and Jiang (2019) and the references therein. Furthermore, a more recent line of work aims to weaken the uniform coverage assumption by adopting the pessimism principle in algorithm design (Liu et al., 2020; Jin et al., 2021; Rashidinejad et al., 2021; Uehara and Sun, 2021; Xie et al., 2021; Yin and Wang, 2021; Zanette et al., 2021; Yin et al., 2022; Yan et al., 2022) . In particular, these works proves theoretically that pessimism is effective in tackling the distributional shift of the offline dataset. In particular, by constructing pessimistic value function estimates, these works establish upper bounds on the suboptimality of the proposed methods based on significantly weaker partial coverage assumption. That is, these methods can find a near-optimal policy as long as the dataset covers the optimal policy. The efficacy of pessimism has also been validated empirically in Kumar et al. ( 2020 2021). Compared with these works on pessimism, we focus on the more challenging setting of POMDP with a confounded dataset. To perform pessimism in the face of confounders, we conduct uncertainty quantification for the minimax estimation regarding the confounding bridge functions. Our work complements this line of research by successfully applying pessimism to confounded data. OPE via causal inference. Our work is closely related to the line of research that employing tools from causal inference (Pearl, 2009) for studying OPE with unobserved confounders (Oberst and Sontag, 2019; Kallus and Zhou, 2020; Bennett et al., 2021; Kallus and Zhou, 2021; Mastouri et al., 2021; Shi et al., 2021; Bennett and Kallus, 2021; Shi et al., 2022) . Among them, Bennett and Kallus (2021); Shi et al. (2021) are most relevant to our work. In particular, these works also leverage proximal causal inference (Lipsitch et al., 2010; Miao et al., 2018a; b; Cui et al., 2020; Tchetgen et al., 2020; Singh, 2020) to identify the value of the target policy in POMDPs. See Tchetgen et al. (2020) for a detailed survey of proximal causal inference. In comparison, this line of research only focuses on evaluating a single policy, whereas we focus on learning the optimal policy within a class of target policies. As a result, we need to handle a more challenging distributional shift problem between the behavior policy and an entire class of target policies, as opposed to a single target policy in OPE. However, thanks to the pessimism, we establish theory based on a partial coverage assumption that is similar to that in the OPE literature. To achieve such a goal, we conduct uncertainty quantification for the bridge function estimators, which is absent in the the works on OPE. As a result, our analysis is different from that in Bennett and Kallus (2021) ; Shi et al. (2021) . Relations between minimax-typed loss and least-square-typed loss (Xie et al., 2021) . During the preparation of this paper, we find that in the MDP setting, the least-square-typed loss considered by (Xie et al., 2021) can be reformulated to the minimax-typed loss that we consider in this paper with a different dual function class. To see this, consider the MDP setting with a single transition tuple (S h , A h , S h+1 ). The goal is to estimate the Bellman target (BV h+1 ) : S × A → R, where B is the Bellman operator and V h+1 : S → R is a fixed state-value function. For each (s, a) ∈ S × A, (B π f h+1 )(s, a) is given by (Bf h+1 )(s, a) = R h (s, a) + S P h (ds ′ |s, a)V h+1 (s ′ ). Here R h is the reward function and we can assume it is known for now, and P h : S × A → ∆(S) is the unknown transition kernel. We use function class F to approximate the bellman target. Then based on the offline transition data D = {(s τ h , a τ h , s τ h+1 )} N τ =1 , the least-square-typed loss function given in Equation (3.1) of (Xie et al., 2021) becomes L ls h (f h ) = E D f h (S h , A h ) -R h -V h+1 (S h+1 2 -min f ′ h ∈F E D f ′ h (S h , A h ) -R h -V h+1 (S h+1 2 , (B.1) where R h is an abbreviation for R h (S h , A h ). Using the equality x 2 -y 2 = (x + y)(x -y), we can rewrite the least-square-typed loss (B.1) as L ls h (f h ) = sup f ′ h ∈F E D (f h + f ′ h )(S h , A h ) -2R h -2V h+1 (S h+1 ) (f h -f ′ h )(S h , A h ) . For derivation, we further rewrite first term as (f h + f ′ h )(S h , A h ) -2R h -2V h+1 (S h+1 ) = 2f h (S h , A h ) -2R h -2V h+1 (S h+1 ) -(f h -f ′ h )(S h , A h ) . With this, we can then rewrite the least-square-typed loss (B.1) as L ls h (f h ) = sup f ′ h ∈F E D 2f h (S h , A h ) -2R h -2V h+1 (S h+1 ) (f h -f ′ h )(S h , A h ) -(f h -f ′ h )(S h , A h ) 2 .

Now by defining a new function class

G f depending on f as G f = {f -f ′ : f ′ ∈ F}, we arrive that 1 2 L ls h (f h ) = sup g h ∈G f h E D f h (S h , A h ) -R h -V h+1 (S h+1 ) g h (S h , A h ) - 1 2 g h (S h , A h ) 2 . (B.2) This shares the same form as the minimax-typed loss sup g h ∈G Φ 1/2 π,h (b h , b h+1 ; g h ) we consider in our work, see (3.10) in the main text. But still there are differences. In (B.2), the dual function g h lies in a dual function class G f h which depends on the primal function f h . While in our minimax-typed loss, the dual function class does not depends on the primal function. Finally, we need to point out that even the two losses share the same form, the form of the confidence region considered by our work is different from that considered by Xie et al. (2021) . To see this, still using the previous notations, the confidence region in Xie et al. (2021)  (Equation (3.2)) becomes CR h (ξ) = f h ∈ F : L ls h (f h ) ≤ ξ . Meanwhile, if we reduce our confidence region to the above MDP setting, our confidence region should be in the form of CR h (ξ) = f h ∈ F : L mm h (f h ) -min f h ∈F L mm h (f h ) ≤ ξ , where L mm h (f h ) denotes the minimax-typed-loss. Our algorithm and theoretical analysis are based on the second form of confidence region, which is key to the derivation of fast statistical rates for elements in the confidence region based on minimax estimation.

B.2 DISCUSSION ABOUT THE PARTIAL COVERAGE

More about the partial coverage (Assumption 4.1). Our work assumes the partial coverage of D according to Assumption 4.1, where we implicitly requires that P π h (S h , Γ h-1 ) /P b h (S h , Γ h-1 ) < +∞ for all π ∈ Π(H) (we call it the finite-ratio condition from here). We note that this finite-ratio condition can NOT be regarded as the full coverage assumption. Instead, this is a regularity condition that arises from causal inference. First of all, the finite-ratio condition is different from the full coverage assumption in standard MDPs. The Full coverage assumption in standard MDPs usually takes the form that max π∈Π P π h (s, a) P b h (s, a) < C, for some fixed C > 0. This condition means the density ratio of the marginal distributions of (s, a) between any target policy π and the behavior policy π b is uniformly bounded by a constant. This condition (or some similar form) is a common and widely accepted form of full coverage in the MDP literature, e.g. (Chen and Jiang, 2019; Xie and Jiang, 2020) . Note that this constant C is a uniform upper bound over the candidate policy class. Very importantly, this constant u ′ appears in the final error bound. The partial coverage assmuption in MDP, on the other hand, is commonly formulated as P π ⋆ h (s, a) P b h (s, a) < C, This condition means the density ratio of the marginal distributions of (s, a) between only the optimal policy π ⋆ and the behavior policy π b , is bounded by a constant. The form of this assumption is very close to Assumption 4.1 (Partial coverage) in our paper. In other words, our Assumption 4.1 is a version of the partial coverage assumption that is tailored to the POMDP case. Notably, this constant C in the partial coverage assumption also appears in the final error bound. As a sharp comparison to both the full coverage and partial coverage assumptions, the finite-ratio condition that the quantity P π h (S h , Γ h-1 ) /P b h (S h , Γ h-1 ) < +∞ for all π ∈ Π(H) does not result in any constant factor that appears in the final error bound. In the case of infinite policy class Π(H), we can allow the ratio to be arbitrarily large and that won't hurt our final error bound. Therefore, this is not a coverage assumption. Our finite-ratio condition is a regularity condition that arises from causal inference. This condition is needed to deal with the extra challenge of the confounding issue in our POMP setting. In related works studying OPE under confounded POMDP (Shi et al., 2021) , this finite-ratio condition is also needed. Overall, our paper is indeed under partial coverage and the finite ration condition is not a kind of coverage assumption.

B.3 POTENTIAL APPLICATION: REAL-WORLD EXAMPLE OF PROXIMAL CAUSAL INFERENCE

IN RL. Let us consider the real-world example of applying the POMDP model to sepsis treatment studied by Tsoukalas et al. ( 2015). In such an example, the state, action, observation, and reward of the POMDP are given by the following: • State variable S h refers to the clinical state of the patient, e.g., sepsis, SIRS, Bacteremia, etc. • Observable variable O h refers to all the information one can read from a medical device, such as the heart rate, the respiratory rate, blood pressure, blood test result of infection, etc. • Action A h refers to certain treatment given to the patient. For example, each antibiotic combination can be considered as an action. As mentioned in Tsoukalas et al. (2015) , a total of 48 antibiotics have been included in the patient's remedy. • Reward/cost values need to be provided empirically by physicians, based on the severity of each state. In the example of Tsoukalas et al. (2015) , the states and their corresponding rewards/costs include: Healthy (100,000), No SIRS (50,000), Probable Sepsis (PS, 5000), SIRS (-50), Bacteremia (-10,000), etc. • Finally, a history trajectory is the record of antibiotic treatment received by the patient. The behavior policy is some treatment plans that have been applied to some patients to generate the dataset. 

C PROXIMAL CAUSAL INFERENCE

In this Section, we complement the discussion of proximal causal inference in Section 3.1.

C.1 ILLUSTRATION OF EXAMPLES

In this subsection, we give detailed discussions for the three examples of history-dependent policies mentioned in Section 2.3. In particular, we give causal graphs of the POMDP when adopting these policies. Also, we explain the choice of negative control variables for these policies in Section 3.1. C.1.1 REACTIVE POLICY (EXAMPLE 2.1 REVISITED) When the target policy is a reactive policy, it only depends on the current observation O h . That is, H h-1 = {∅} and Γ h-1 = ∅ for each h ∈ [H]. The causal graph for such a target policy is shown in Figure 2 . In this case, we choose the negative control action as Z h = O h-1 (node in green) and the negative control outcome as W h = O h (node in yellow). By this choice, we can check the independence condition in Assumption 3.1 via Figure 2 , i.e., under When the target policy a is finite-length history policy, it depends on the current observation and history of length at most k. That is, P b , O h-1 ⊥ O h , R h , O h+1 | S h , A h O h ⊥ A h , S h-1 | S h . R h-1 R h S h-1 S h S h+1 O h-1 O h O h+1 A h-1 A h H h-1 = (O × A) ⊗ min{k,h-1} for some k ∈ N, Γ h-1 = ((O l , A l ), • • • , (O h-1 , A h-1 )) where the index l = max{1, h -k}. The causal graph for such a target policy is shown in Figure 3 . In this case, we choose the negative control action as Z h = O l-1 (node in green) and the negative control outcome as W h = O h (node in yellow). By this choice, we can check the independence condition in Assumption 3.1 via Figure 3 , i.e., under  P b , O l-1 ⊥ O h , R h , O h+1 | S h , A h , O h-1 , A h-1 , • • • , O l , A l , O h ⊥ A h , S h-1 , O h-1 , A h-1 , • • • , O l , A l | S h . R l-1 R l R h S l-1 S l S h S h+1 O l-1 O l O h O h+1 A l-1 A l A h . . . . . .

C.1.3 FULL-HISTORY POLICY (EXAMPLE 2.3 REVISITED)

When the target policy is a full-history policy, it depends on the current observation and the full history. That is, H h-1 = (O × A) ⊗(h-1) and Γ h-1 = ((O 1 , A 1 ), • • • , (O h-1 , A h-1 )). The causal graph for such a target policy is shown in Figure 4 . In this case, we choose the negative control action as Z h = O 0 (node in green) and the negative control outcome as W h = O h (node in yellow). By this choice, we can check the independence condition in Assumption 3.1 via Figure 4 , i.e., under P b ,  O 0 ⊥ O h , R h , O h+1 | S h , A h , O h-1 , A h-1 , • • • , O 1 , A 1 , O h ⊥ A h , S h-1 , O h-1 , A h-1 , • • • , O 1 , A 1 | S h . R 1 R h O 0 S 1 S h S h+1 O 1 O h O h+1 A 1 A h . . . . . . E π b [b π h (A h , O h )|A h , S h ] = E π b R h π h (A h |O h ) + γ a ′ b π h+1 (a ′ , O h+1 )π h (A h |O h ) A h , S h , (C.2) E π b [q π h (A h , O h-1 )|A h , S h ] = µ h (S h ) π b h (A h |S h ) , (C.3) Then we show that the solutions to (C.2) and (C.3) also solve (3.2) and (3.3). The difference between (C.2) and (3.2) is that in (C.2) we condition on the latent state S h rather than the observable negative control variable Z h . In related literature (Bennett and Kallus, 2021; Shi et al., 2021) , the solutions to (C.2) and (C.3) are referred to as unlearnable bridge functions. We first show the existence of {b π h } H h=1 in a backward manner. Denote by b π h+1 a zero function. Suppose that b π h+1 exists, we show that b π h also exists. Since now spaces S, A, and O are discrete, we adopt the notation of matrix. In particular, we denote by B ∈ R |A|×|O| , B(a, o) = b h (a, o), O ∈ R |S|×|O| , O(s, o) = P b h (O h = o|S h = s), R ∈ R |A|×|S| , R(s, a) = E π b R h π h (A h |O h ) + γ a ′ b π h+1 (a ′ , O h+1 )π h (A h |O h ) A h = a, S h = s . The existence of b π h satisfying (C.2) is equivalent to the existence of B solving the matrix equation B O ⊤ = R. (C.4) By condition (C.1), we known that the matrix O ⊤ is of full column rank, which indicates that (C.4) admits a solution B. This proves the existence of b π h . For {q π h } H h=1 , we use a similar method by considering Q ∈ R |A|×|O| , Q(a, o) = q h (a, o), O -∈ R |S|×|O| , O -(s, o) = P b h (O h-1 = o|S h = s), C ∈ R |A|×|S| , C(s, a) = µ h (S h = s) π b h (A h = a|S h = s) . The existence of q π h satisfying (C.3) is equivalent to the existence of Q solving the matrix equation Q O ⊤ -= C (C.5) By condition (C.1), we known that the matrix O ⊤ -is of full column rank, which indicates that (C.5) admits a solution Q. This proves the existence of q π h . Thus we have shown that there exists {b π h } H Example C.2 (Example 2.2 revisited). For the tabular setting and finite length policies (i.e., π h : O × (O × A) min{k,h-1} → ∆(A)), the sufficient condition under which Assumption 3.2 holds is that, for any action a ∈ A, rank(P b h (O h |A h = a, O h-k-1 )) = |O|, rank(P b h (O h-k-1 |A h = a, S h , Γ h-1 )) = |O|, (C.6) where P b h (O h |A h = a, O h-k-1 ) is a |O| × |O| matrix with (o, o ′ )-th element is P b h (O h = o|A h = a, O h-k-1 = o ′ ) and P b h (O h-k-1 |A h = a, S h , Γ h-1 ) is a |S||H h-1 | × |O| matrix defined similarly. Proof of Example C.2. To see this, we first prove the existence of {b π n }. For simplicity, we denote by P a = P b h (O h | A h = a, O h-k-1 ) ∈ R |O|×|O| for each a ∈ A. Also, we denote that B a = (b h (a, O h )) ∈ R |O|×1 , R a = E π b R h π h (A h | O h ) + γ a ′ b π h+1 (a ′ , O h+1 ) π h (A h | O h ) | A h = a, O h-k-1 ∈ R |O|×1 . Then for each a ∈ A, the existence of b π n (a, •) is equivalent to the existence of the solution to P a B a = R a . Such a linear equation admits a solution due to our assumption on the matrix P a . This shows the existence of {b π h }. For {q π h }, the deduction is similar by considering for each a ∈ A, T a = P b h (O h-k-1 | A h = a, S h , Γ h-1 ) ∈ R |S||H h-1 |×|O| , Q a = (q h (a, O h-k-1 )) ∈ R |O|×1 , C a = µ h (S h , Γ h-1 ) π b (a | S h ) ∈ R |S||H h-1 |×1 . By considering the equation that T a Q a = C a and using the full rank assumption on matrix T a , we can obtain the existence of {q π h }. This finishes the proof of Example C.2.

D PROOF SKETCHES OF MAIN THEORETICAL RESULT

In this section, we sketch the proof of the main theoretical result Theorem 4.4, and we refer to Appendix G for a detailed proof. For simplicity, we denote that for any π ∈  F (b π ) -F (b) ≤ H h=1 γ h-1 √ C π • L π h (b h , b h+1 ), where the concentrability coefficient C π is defined as C π := sup h∈[H] E π b (q π h (A h , Z h )) 2 . Proof of Lemma D.1. See Appendix F.1 for a detailed proof. The following two lemmas characterize the theoretical properties of the confidence region CR π (ξ). Specifically, Lemma D.2 shows that with high probability the confidence region of π contains the true value bridge function b π . Besides, Lemma D.3 shows that each bridge function vector b ∈ CR π (ξ) enjoys a fast statistical rate (Uehara et al., 2021) for its RMSE loss L π h defined in (3.6). To obtain such a fast rate, we develop novel proof techniques in Appendix F.3. Lemma D.2 (Validity of confidence regions). Under Assumption 3.2 and 4.2, for any 0 < δ < 1, by setting ξ = C 1 (λ + 1/λ) • M 2 B • M 2 G • log(|B||Π(H)|H/ζ)/n, for some problem-independent universal constant C 1 > 0 and ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n )}, it holds with probability at least 1 -δ that b π ∈ CR π (ξ) for any policy π ∈ Π(H). Proof of Lemma D.2. See Appendix F.2 for a detailed proof. Lemma D.3 (Accuracy of confidence regions). Under Assumption 3.2, 4.2, and 4.3, by setting the same ξ as in Lemma D.2, with probability at least 1 -δ/2, for any policy π ∈ Π(H), b ∈ CR π (ξ), and step h, L π h (b h , b h+1 ) ≤ C 1 M B M G (λ + 1/λ) • log(|B||Π(H)|H/ζ)/n + C 1 ϵ 1/4 B M 1/2 G , for some problem-independent universal constant C 1 > 0, and ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n )}. Proof of Lemma D.3. See Appendix F.3 for a detailed proof. When α G,n ∈ O(n -1/2 ) and ϵ B = 0, Lemma D.3 implies that L π h (b h , b h+1 ) ≤ O(n -1 ). Now with Lemma D.1, Lemma D.2, and Lemma D.3, by the choice of π in P3O, we can show that J(π ⋆ ) -J( π) ≤ O(n -1/2 ) + max b∈CR π ⋆ (ξ) F (b) -min b∈CR π (ξ) F (b) ≤ O(n -1/2 ) + max b∈CR π ⋆ (ξ) F (b) -min b∈CR π ⋆ (ξ) F (b) ≤ O(n -1/2 ) + 2 max b∈CR π ⋆ (ξ) F (b) -F (b π ⋆ ) ≤ O(n -1/2 ) + 2 max b∈CR π ⋆ (ξ) H h=1 γ h-1 √ C π ⋆ • L π ⋆ h (b h , b h+1 ), (D.2) where the first inequality holds by Lemma D.2, the second inequality holds from the optimality of π in Algorithm 1, the third inequality holds directly, and the last inequality holds by Lemma D.1. Finally, by applying Lemma D.3 to the right hand side of (D.2), we conclude the proof of Theorem 4.4.

E PROOF OF THEOREM 3.3

Proof of Theorem 3.3. For any step h, we denote J h (π) := E π [R h (S h , A h )]. We have that J h (π) = E π [R h (S h , A h )] = E π E π [R h (S h , A h )|O h , S h , Γ h-1 ] = E π a∈A R h (S h , a)π h (a|O h , Γ h-1 ) = E π E π a∈A R h (S h , a)π h (a|O h , Γ h-1 ) S h , Γ h-1 , where the second and the last equality follows from the tower property of conditional expectation. Using the definition of density ratio µ h (S h , Γ h-1 ) in Assumption 3.2, we can change the outer expectation to E π b by J h (π) = E π b µ h (S h , Γ h-1 ) • E π a∈A R h (S h , a)π h (a|O h , Γ h-1 ) S h , Γ h-1 , = E π b a∈A R h (S h , a) • π h (a|O h , Γ h-1 ) • µ h (S h , Γ h-1 ) = E π b a∈A π b h (a|S h ) • R h (S h , a) • π h (a|O h , Γ h-1 ) π b h (a|S h ) • µ h (S h , Γ h-1 ) (a) = E π b E π b R h (S h , A h ) • π h (A h |O h , Γ h-1 ) π b h (A h |S h ) • µ h (S h , Γ h-1 ) S h , O h , Γ h-1 = E π b R h (S h , A h ) • π h (A h |O h , Γ h-1 ) • µ h (S h , Γ h-1 ) π b h (A h |S h ) , where step (a) follows from the fact that A h ∼ π b h (•|S h ) and satisfies A h ⊥ O h , Γ h-1 |S h under π b . Now using the definition (3.3) of weight bridge function q π h in Assumption 3.2, we have that J h (π) = E π b R h (S h , A h ) • π h (A h |O h , Γ h-1 ) • E π b [q π h (A h , Z h )|S h , A h , Γ h-1 ] (a) = E π b [R h (S h , A h ) • π h (A h |O h , Γ h-1 ) • q π h (A h , Z h )] = E π b [E π b [R h (S h , A h ) • π h (A h |O h , Γ h-1 ) • q π h (A h , Z h )|A h , Z h ]] = E π b [E π b [R h (S h , A h ) • π h (A h |O h , Γ h-1 )•|A h , Z h ] q π h (A h , Z h )] , where step (a) follows from the assumption that Z h ⊥ O h , R h |S h , A h , Γ h-1 by Assumption 3.1. Now using the definition (3.2) of value bridge function b π h in Assumption 3.2, we have that J h (π) = E π b E π b b π h (A h , W h ) -γ a ′ ∈A b π h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ) A h , Z h q π h (A h , Z h ) = E π b [f (S h , A h , O h , W h , W h+1 , Γ h-1 ) • q π h (A h , Z h )] = E π b [E π b [f (S h , A h , O h , W h , W h+1 , Γ h-1 ) • q π h (A h , Z h )|S h , A h , O h , W h , W h+1 , Γ h-1 ]] , = E π b [f (S h , A h , O h , W h , W h+1 , Γ h-1 ) • E π b [q π h (A h , Z h )|S h , A h , O h , W h , W h+1 , Γ h-1 ]] , (a) = E π b [f (S h , A h , O h , W h , W h+1 , Γ h-1 ) • E π b [q π h (A h , Z h )|S h , A h , Γ h-1 ]] , where for simplicity we have denoted that f (S h , A h , O h , W h , W h+1 , Γ h-1 ) = b π h (A h , W h ) -γ a ′ ∈A b π h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ), and step (a) follows from the assumption that Z h ⊥ O h , W h , W h+1 |S h , A h , Γ h-1 by Assumption 3.1. By the definition (3.3) of weight bridge function q π h in Assumption 3.2 again, we have that J h (π) = E π b f (S h , A h , O h , W h , W h+1 , Γ h-1 ) • µ h (S h , Γ h-1 ) π b h (A h |S h ) (a) = E π b b π h (A h , W h ) -γ a ′ ∈A b π h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ) • µ h (S h , Γ h-1 ) π b h (A h |S h ) , where step (a) just applies the definition of f . Now sum J h (π) over h ∈ [H], we have that J(π) = H h=1 γ h-1 J h (π) = E π b µ 1 (S 1 , Γ 0 ) π b 1 (A 1 |S 1 ) b π 1 (A 1 , W 1 ) (A) + H h=2 γ h-1 ∆ h (B) , (E.1) where for simplicity we define ∆ h for h = 2, • • • , H as ∆ h = E π b µ h (S h , Γ h-1 ) π b h (A h |S h ) b π h (A h , W h ) - µ h-1 (S h-1 , Γ h-2 ) π b h-1 (A h-1 |S h-1 ) • a ′ ∈A b π h (a ′ , W h )π h-1 (A h-1 |O h-1 , Γ h-1 ) . In the sequel, we deal with term (A) and (B) respectively. On the one hand, we have that (A) (a) = E π b P π 1 (S 1 , Γ 0 ) P b 1 (S 1 , Γ 0 )π b 1 (A 1 |S 1 ) b π 1 (A 1 , W 1 ) (b) = E π b 1 π b 1 (A 1 |S 1 ) b π 1 (A 1 , W 1 ) = E π b E π b 1 π b 1 (A 1 |S 1 ) b π 1 (A 1 , W 1 ) S 1 , W 1 (c) = E π b a∈A π b 1 (a|S 1 ) π b 1 (a|S 1 ) b π 1 (a, W 1 ) = E π b a∈A b π 1 (a, W 1 ) , where step (a) follows from the definition of µ 1 (S 1 , Γ 0 ) in Assumption 3.2, step (b) follows from the fact that at h = 1, P b 1 (S 1 , Γ 0 ) = P π 1 (S 1 , Γ 0 ), and step (c) follows from the assumption that A 1 ⊥ W 1 |S 1 by Assumption 3.1. On the other hand, term (b) in (E.1) is actually 0, which we show by proving that ∆ h = 0 for any h ≥ 2. We denote by ∆ h = ∆ 1 h -∆ 2 h and we consider ∆ 1 h and ∆ 2 h respectively, where ∆ 1 h = E π b µ h (S h , Γ h-1 ) π b h (A h |S h ) • b π h (A h , W h ) , ∆ 2 h = E π b µ h-1 (S h-1 , Γ h-2 ) π b h-1 (A h-1 |S h-1 ) • a ′ ∈A b π h (a ′ , W h )π h-1 (A h-1 |O h-1 , Γ h-1 ) . In the sequel, we prove that ∆ 1 h = ∆ 2 h for the three cases of T h in Example 2.1, 2.2, and 2.3, respectively. Case 1: Reactive policy (Example 2.1). We first focus on the simple case when policy π is reactive. Since for reactive policies T h = ∅, we can equivalently write µ h (S h , Γ h-1 ) as µ h (S h ) = P π h (S h )/P b h (S h ). Now for ∆ 1 h , we can rewrite it as ∆ 1 h = E π b P π h (S h ) P b h (S h )π b h (A h |S h ) • b π h (A h , W h ) (a) = S P b h (s h )ds h a h ∈A π b h (a h |s h ) W P b h (w h |s h , a h )dw h • P π h (s h ) P b h (s h ) π b h (a h |s h ) b π h (a h , w h ) (b) = a h ∈A S P π h (s h )ds h W P b h (w h |s h )dw h • b π h (a h , w h ). Here step (a) expands the expectation by using integral against corresponding density functions, and step (b) follows from cancelling the same terms and the fact that W h ⊥ A h |S h under Assumption 3.1. For ∆ 2 h , we can also rewrite it as ∆ 2 h = E π b P π h-1 (S h-1 )π h-1 (A h-1 |O h-1 ) P b h-1 (S h-1 )π b h-1 (A h-1 |S h-1 ) • a ′ ∈A b π h (a ′ , W h ) (a) = S P b h-1 (s h-1 )ds h-1 O O h-1 (o h-1 |s h-1 )do h-1 a h-1 ∈A ( ( ( ( ( ( ( ( π b h-1 (a h-1 |s h-1 ) S P h (s h |s h-1 , a h-1 )ds h W P b h (w h |s h , s h-1 , a h-1 , o h-1 ) • P π h-1 (s h-1 )π h-1 (a h-1 |o h-1 ) P b h-1 (s h-1 ) ( ( ( ( ( ( ( π b h-1 (a h-1 |s h-1 ) a h ∈A b π h (a h , w h )dw h . Here step (a) follows from expanding the expectation. It follows that ∆ 2 h (b) = a h ∈A S P π h-1 (s h-1 )ds h-1 O O h-1 (o h-1 |s h-1 )do h-1 a∈A π h-1 (a h-1 |o h-1 ) S P h (s h |s h-1 , a h-1 )ds ′ W P b h (w h |s h ) • b π h (a h , w h ) (c) = a h ∈A S P π h (s h )ds h W P b h (w h |s h ) • b π h (a h , w h )dw h . Here step (b) follows from cancelling the same terms and using the fact that 1, and step (d) follows by marginalizing over W h ⊥ S h-1 , A h-1 , O h-1 |S h by Assumption 3. S h-1 , A h-1 , O j-1 . Thus we have proved that ∆ 1 h = ∆ 2 h for reactive policies and consequently ∆ h = ∆ 1 h -∆ 2 h = 0. Case 2: Finite-history policy (Example 2.2). Now we have that Γ h-1 ∪{A h , O h } = {A l-1 , O l-1 }∪ T h , where the index l = max{0, h -k}. Similarly, we can first rewrite ∆ 1 h as ∆ 1 h = E π b P π h (S h , Γ h-1 ) P b h (S h , Γ h-1 )π b h (A h |S h ) b π h (A h , W h ) (a) = S×H h-1 P b h (s h , τ h-1 )ds h dτ h-1 a h ∈A π b h (a h |s h ) W P b h (w h |s h , a h , τ h-1 )dw h • P π h (s h , τ h-1 ) P b h (s h , τ h-1 )π b h (a h |s h , τ h-1 ) b π h (a h , w h ) (b) = a h ∈A S×H h-1 P π h (s h , τ h-1 )ds h dτ h-1 W P b h (w h |s h )dw h • b π h (a h , w h ). Here step (a) follows from expanding the expectation, and step (b) follows from cancelling the same terms and using the fact that W h ⊥ A h , Γ h-1 |S h under Assumption 3.1. For ∆ 2 h , we can also rewrite it as ∆ 2 h = E π b P π h-1 (S h-1 , Γ h-2 )π h-1 (A h-1 |O h-1 ) P b h-1 (S h-1 , Γ h-2 )π b h-1 (A h-1 |S h-1 , Γ h-2 ) a ′ ∈A b π h (a ′ , W h ) (a) = S×H h-2 ( ( ( ( ( ( ( ( ( ( ( P b h-1 (s h-1 , τ h-2 )ds h-1 dτ h-2 O O h-1 (o h-1 |s h-1 )do h-1 a h-1 ∈A ( ( ( ( ( ( ( ( π b h-1 (a h-1 |s h-1 ) S P h (s h |s h-1 , a h-1 )ds h W P b h (w h |s h , s h-1 , a h-1 , o h-1 , τ h-2 ) • P π h-1 (s h-1 , τ h-2 )π h-1 (a h-1 |o h-1 , τ h-2 ) ( ( ( ( ( ( ( ( P b h-1 (s h-1 , τ h-2 ) ( ( ( ( ( ( ( π b h-1 (a h-1 |s h-1 ) a h ∈A b π h (a h , w h ) (b) = a h ∈A S×H h-2 P π h-1 (s h-1 , τ h-2 , a l , o l )ds h-1 d τ h-2 da l do l O O h-1 (o h-1 |s h-1 )do h-1 a h-1 ∈A π h-1 (a h-1 |o h-1 , τ h-2 ) S P h (s h |s h-1 , a h-1 )ds h W P b h (w h |s h ) • b π h (a h , w h ) (E.2) (c) = a h ∈A S×H h-1 P π h (s h , τ h-1 )ds h dτ h-1 W P b h (w h |s h ) • b π h (a h , w h ), where the index l = max{1, h -1 -k}. In step (b), we have denoted by τ h-2 = τ h-2 \ {a l , o l } and it holds that τ h-1 = τ h-2 ∪ {o h-1 , a h-1 }. Here step (a) follows from expanding the expectation, step (b) follows from cancelling the same terms and using the fact that W h ⊥ S h-1 , A h-1 , Γ h-1 |S h under Assumption 3.1, and step (c) follows by marginalizing S h-1 , A l , O l . Thus we have proved that ∆ 1 h = ∆ 2 h for finite-length history policies and consequently ∆ h = ∆ 1 h -∆ 2 h = 0. Case 3: Full-history policy (Example 2.3). For full history information T h , we have that Γ h-1 ∪ {A h , O h } = T h . Following the same argument as in Case 2 (Example 2.2), we can first show that ∆ 1 h = a h ∈A S×H h-1 P π h (s h , τ h-1 )ds h dτ h-1 W P b h (w h |s h )dw h • b π h (a h , w h ). Besides, for ∆ 2 , by a similar argument as in Case 2 except that we don't need marginalize over A l , O l in (E.2), we can show that ∆ 2 h = a h ∈A S×H h-2 P π h-1 (s h-1 , τ h-2 )ds h-1 dτ h-2 O O h-1 (o h-1 |s h-1 )do h-1 a h-1 ∈A π h-1 (a h-1 |o h-1 , τ h-2 ) S P h (s h |s h-1 , a h-1 )ds h W P b h (w h |s h , s h-1 , a h-1 , o h-1 , τ h-2 ) • b π h (a h , w h ) = a h ∈A S×H h-1 P π h (s h , τ h-1 )ds h dτ h-1 W P b h (w h |s h )dw h • b π h (a h , w h ). Therefore, we show that ∆ 1 h = ∆ 2 h for full history policies and consequently  ∆ h = ∆ 1 h -∆ 2 h = 0. J(π) = (A) = E π b a∈A b π 1 (a, W 1 ) . This finishes the proof of Theorem 3.3.

F PROOF OF LEMMAS IN SECTION D

We first review and define several notations and quantities that are useful in the proof of the lemmas in Section D. Firstly, we define mapping ℓ π h : B × B → {A × Z → R} as ℓ π h (b h , b h+1 )(A h , Z h ) := E π b b h (A h , W h ) -R h π h (A h |O h , Γ h-1 ) -γ a ′ ∈A b h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ) A h , Z h . (F.1) Furthermore, for each step h ∈ [H], we define a joint space I h = A × W × O × H h-1 × W and define mapping ς π h : B × B → {I h → R} as ς π h (b h , b h+1 )(A h , W h , O h , Γ h-1 , W h+1 ) := b h (A h , W h ) -R h π h (A h |O h , Γ h-1 ) -γ a ′ ∈A b h+1 (a ′ , W h+1 )π h (A h |O h , Γ h-1 ). (F.2) When appropriate, we abbreviate I h = (A h , W h , O h , Γ h-1 , W h+1 ) ∈ I h in the sequel. Using definition (F.1) and (F.2), we further introduce two mappings Φ λ π,h , Φ π,h : B × B × G → R as defined by (3.9), Φ λ π,h (b h , b h+1 ; g) := E π b ℓ π h (b h , b h+1 )(A h , Z h ) • g(A h , Z h ) -λg(A h , Z h ) 2 , Φ π,h (b h , b h+1 ; g) := Φ 0 π,h (b h , b h+1 ; g) = E π b ℓ π h (b h , b h+1 )(A h , Z h ) • g(A h , Z h ) , where we define that Φ π,h = Φ 0 π,h . Also, recall from (3.10) that the empirical version of Φ λ π,h , Φ π,h are defined by Φ λ π,h , Φ π,h as Meanwhile, we define the following quantity for ease of theoretical analysis as  Φ λ π,h (b h , b h+1 ; g) := E π b ς π h (b h , b h+1 )(I h ) • g(A h , Z h ) -λg(A h , Z h ) 2 , Φ π,h (b h , b h+1 ; g) := Φ 0 π,h (b h , b h+1 ; g) = E π b ς π h (b h , b h+1 )(I h ) • g(A h , Z h ) . ∥g∥ 2 2 := E π b [g(A h , Z h ) 2 ], and we denote by ∥g∥ 2 2,n its empirical version, i.e., ∥g∥ 2 2,n := E π b [g(A h , Z h ) 2 ]. We remark that we have dropped the dependence of ∥g∥ 2 2 on step h since it is clear from the context when used in the proofs and does not make any confusion.  F (b π ) -F (b) (a) = E π b a∈A b π 1 (a, W 1 ) -b 1 (a, W 1 ) = E π b a∈A π b 1 (a|S 1 ) π b 1 (a|S 1 ) (b π 1 (a, W 1 ) -b 1 (a, W 1 )) (b) = E π b E π b 1 π b 1 (A 1 |S 1 ) (b π 1 (a, W 1 ) -b 1 (a, W 1 )) S 1 , W 1 = E π b 1 π b 1 (A 1 |S 1 ) (b π 1 (A 1 , W 1 ) -b 1 (A 1 , W 1 )) where step (a) follows from Theorem 3.3 and (D.1), and step (b) holds since A 1 ⊥ W 1 | S 1 by Assumption 3.1. Notice that by definition (3.3), at step h = 1, the weight bridge function q π h satisfies equation E π b [q π 1 (A 1 , Z 1 )|A 1 , S 1 , Γ 0 ] = P π h (S 1 , Γ 0 ) P π h (S 1 , Γ 0 )π b 1 (A 1 |S 1 ) = 1 π b 1 (A 1 |S 1 ) , which further gives that F (b π ) -F (b) = E π b E π b [q π 1 (A 1 , Z 1 )|A 1 , S 1 , Γ 0 ] (b π 1 (A 1 , W 1 ) -b 1 (A 1 , W 1 )) (a) = E π b E π b [q π 1 (A 1 , Z 1 )|A 1 , S 1 , W 1 , Γ 0 ] • b π 1 (A 1 , W 1 ) -b 1 (A 1 , W 1 ) = E π b q π 1 (A 1 , Z 1 ) b π 1 (A 1 , W 1 ) -b 1 (A 1 , W 1 ) , where step (a)holds since Z 1 ⊥ W 1 | A 1 , S 1 , H 0 by Assumption 3.1. Now we can further obtain that, F (b π ) -F (b) = E π b q π 1 (A 1 , Z 1 )E π b [b π 1 (A 1 , W 1 ) -b 1 (A 1 , W 1 )|A 1 , Z 1 ] (a) = E π b q π 1 (A 1 , Z 1 ) E π b R 1 π 1 (A 1 |O 1 , Γ 0 ) + γ a ′ b π 2 (a ′ , W 2 )π 1 (A 1 |O 1 , Γ 0 ) A 1 , Z 1 -E π b [b 1 (A 1 , W 1 )|A 1 , Z 1 ] , where step (a) follows from the definition in (3.2) of value bridge function b π 1 in Assumption 3.2. Now to relate the difference between F (b π ) and F (b) with the RMSE loss L π 1 defined in (3.6), we rewrite the above equation as the following, F (b π ) -F (b) = E π b q π 1 (A 1 , Z 1 ) E π b R 1 π h (A 1 |O 1 , Γ 0 ) + γ a ′ b π 2 (a ′ , W 2 )π 1 (A 1 |O 1 , Γ 0 ) A 1 , Z 1 -E π b R h π 1 (A 1 |O 1 , Γ 0 ) + γ a ′ b 2 (a ′ , W h+1 )π 1 (A 1 |O 1 , Γ 0 ) A 1 , Z 1 + E π b R 1 π 1 (A 1 |O 1 , Γ 0 ) + γ a ′ b 2 (a ′ , W 2 )π 1 (A 1 |O 1 , Γ 0 ) A 1 , Z 1 -E π b b 1 (A 1 , W 1 ) A 1 , Z 1 = E π b q π 1 (A 1 , Z 1 ) γE π b a ′ b π 2 (a ′ , W 2 ) -b 2 (a, W 2 ) π 1 (A 1 |O 1 , Γ 0 ) A 1 , Z 1 + E π b R 1 π h (A 1 |O 1 , Γ 0 ) + γ a ′ b 2 (a ′ , W 2 )π h (A 1 |O 1 , Γ 0 ) -b 1 (A 1 , W 1 ) A 1 , Z 1 . (F.4) We deal with the two terms in the right-hand side of (F.4) respectively. On the one hand, the first term equals to γE π b q π 1 (A 1 , Z 1 )E π b a ′ b π 2 (a ′ , W 2 ) -b 2 (a, W 2 ) π 1 (A 1 |O 1 , Γ 0 ) A 1 , Z 1 = γE π b q π 1 (A 1 , Z 1 ) a ′ b π 2 (a ′ , W 2 ) -b 2 (a, W 2 ) π 1 (A 1 |O 1 , Γ 0 ) = γE π b E π b q π 1 (A 1 , Z 1 ) S 1 , A 1 , Γ 0 , O 1 , W 2 a ′ b π 2 (a ′ , W 2 ) -b 2 (a, W 2 ) π 1 (A 1 |O 1 , Γ 0 ) (a) = γE π b E π b q π 1 (A 1 , Z 1 ) S 1 , A 1 , Γ 0 a ′ b π 2 (a ′ , W 2 ) -b 2 (a, W 2 ) π 1 (A 1 |O 1 , Γ 0 ) (b) = γE π b µ 1 (S 1 , Γ 0 ) π b 1 (A 1 |S 1 ) a ′ b π 2 (a ′ , W 2 ) -b 2 (a, W 2 ) π 1 (A 1 |O 1 , Γ 0 ) , where step (a) follows from the fact that Z 1 ⊥ O 1 , W 2 |S 1 , A 1 , Γ 0 according to Assumption 3.1, and step (b) follows from the definition (3.3) of weight bridge function q π 1 in Assumption 3.2. Now following the same argument as in showing ∆ h = 0 in the proof of Theorem 3.3, we can show that E π b µ 1 (S 1 , Γ 0 ) π b 1 (A 1 |S 1 ) a ′ b π 2 (a ′ , W 2 ) -b 2 (a, W 2 ) π 1 (A 1 |O 1 , Γ 0 ) = E π b q π 2 (A 2 , Z 2 ) b π 2 (A 2 , W 2 ) -b 2 (A 2 , W 2 ) . (F.5) On the other hand, the second term in the R.H.S. of (F.4) can be rewritten and bounded by E π b q π 1 (A 1 , Z 1 )E π b R 1 π 1 (A 1 |O 1 , Γ 0 ) + γ a ′ b 2 (a ′ , W 2 )π 1 (A 1 |O 1 , Γ 0 ) -b 1 (A 1 , W 1 ) A 1 , Z 1 ≤ √ C π E π b E π b R 1 π 1 (A 1 |O 1 , Γ 0 ) + γ a ′ b 2 (a ′ , W 2 )π 1 (A 1 |O 1 , Γ 0 ) -b 1 (A 1 , W 1 ) A 1 , Z 1 1/2 = √ C π • L π 1 (b 1 , b 2 ), (F.6) where C π is defined as C π := sup h∈[H] E π b (q π h (A h , Z h )) 2 , the inequality follows from Cauchy-Schwarz inequality, and the equality follows from the definition of L π 1 in (3.6). Combining (F.4), (F.5) with (F.6), we can obtain that F (b π ) -F (b) ≤ √ C π • L π 1 (b 1 , b 2 ) + γE π b q π 2 (A 2 , Z 2 ) b π 2 (A 2 , W 2 ) -b 2 (A 2 , W 2 ) . (F.7) Now applying the above argument on the second term in the R.H.S. of (F.7) recursively, we can obtain that F (b π ) -F (b) ≤ H h=1 γ h-1 √ C π • L π h (b h , b h+1 ). This finishes the proof of Lemma D.1. F.2 PROOF OF LEMMA D.2 Proof of Lemma D.2. By the definition of the confidence region CR π (α) in (3.12), we need to show for any policy π ∈ Π(H) and step h ∈ [H], it holds that, max g∈G Φ λ π,h (b π h , b π h+1 ; g) -max g∈G Φ λ π,h ( b h (b π h+1 ), b π h+1 ; g) ≤ ξ. (F.8) Notice that by Assumption 4.2, the function class G is symmetric and star-shaped, which indicates that max g∈G Φ λ π,h ( b h (b π h+1 ), b π h+1 ; g) ≥ Φ λ π,h ( b h (b π h+1 ), b π h+1 ; 0) = 0. Therefore, in order to prove (F.8), it suffices to show that max g∈G Φ λ π,h (b π h , b π h+1 ; g) ≤ ξ. (F.9) To relate the empirical expectation Φ λ π,h (b π h , b π h+1 ; g) = Φ π,h (b π h , b π h+1 ; g)-λ∥g∥ 2 2,n to its population version, we need two localized uniform concentration inequalities. On the one hand, to relate ∥g∥ 2 2 and ∥g∥ 2 2,n , by Lemma I.1 (Theorem 14.1 of Wainwright ( 2019)), for some absolute constants c 1 , c 2 > 0, it holds with probability at least 1 -δ/2 that, ∥g∥ 2 2,n -∥g∥ 2 2 ≤ 1 2 ∥g∥ 2 2 + M 2 G log(2c 1 /ζ) 2c 2 n , ∀g ∈ G, (F.10) where ζ = min{δ, 2c 1 exp(-c 2 nα 2 G,n /M 2 G )} and α G,n is the critical radius of function class G defined in Assumption 4.2. On the other hand, to relate Φ π,h (b h , b h+1 ; g) and Φ π,h (b h , b h+1 ; g) we invoke Lemma I.2 (Lemma 11 of (Foster and Syrgkanis, 2019)). Specifically, for any given b h , b h+1 ∈ B, π ∈ Π(H), and h ∈ [H], in Lemma I.2 we choose F = G, X = A × Z, Y = I h , and loss function ℓ(g(A h , Z h ), I h ) := ς π h (b h , b h+1 )(I h ) • g(A h , Z h ) where ς π h is defined in (F.1), I h ∈ I h is defined in the beginning of Appendix F. It holds that ℓ is L-Lipschitz continuous in the first argument since for any g, g ′ ∈ G, (A h , Z h ) ∈ A × Z, it holds that ℓ(g(A h , Z h ), I h ) -ℓ(g ′ (A h , Z h ), I h ) = |ς π h (b h , b h+1 )(I h )| • |g(A h , Z h ) -g ′ (A h , Z h )| ≤ 2M B • |g(A h , Z h ) -g ′ (A h , Z h )|, which indicates that L = 2M B . Now setting f ⋆ = 0 in Lemma I.2, we have that δ n in Lemma I.2 coincides with α G,n in Assumption 4.2. Then we can conclude that for some absolute constants c 1 , c 2 > 0, it holds with probability at least (F.11) where 1 -δ/(2|B| 2 |Π(H)|H) that Φ π,h (b h , b h+1 ; g) -Φ π,h (b h , b h+1 ; g) = E π b [ℓ(g(A h , Z h ), I h )] -E π b [ℓ(g(A h , Z h ), I h )] ≤ 18L∥g∥ 2 M 2 G log 2c 1 |B| 2 |Π(H)|H/ζ ′ c 2 n + 18LM 2 G log 2c 1 |B| 2 |Π(H)|H/ζ ′ c 2 n , ∀g ∈ G, ζ ′ = min{δ, 2c 1 |B| 2 |Π(H)|H exp(-c 2 nα 2 G,n /M 2 G )}. Applying a union bound argument over b h , b h+1 ∈ B, π ∈ Π(H), and h ∈ [H], we then have that (F.11) holds for any b h , b h+1 ∈ B, g ∈ G, π ∈ Π(H), and h ∈ [H] with probability at least 1 -δ/2. Now using these two concentration inequalities (F.10) and (F.11), we can further deduce that, for some absolute constants c 1 , c 2 > 0, with probability at least 1 -δ, max g∈G Φ λ π,h (b π h , b π h+1 ; g) = max g∈G Φ π,h (b π h , b π h+1 ; g) -λ∥g∥ 2 2,n ≤ max g∈G Φ π,h (b π h , b π h+1 ; g) -λ∥g∥ 2 2 + λ 2 ∥g∥ 2 2 + λM 2 G log(2c 1 /ζ) 2c 2 n , + 18L∥g∥ 2 M 2 G log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n + 18LM 2 G log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n , where ζ is given as ζ = min{δ, 2c 1 exp(-c 2 nα 2 G,n /M 2 G )} and ζ ′ is given as ζ ′ = min{δ, 2c 1 |B| 2 |Π(H)|H exp(-c 2 nα 2 G,n /M 2 G ) } for any policy π ∈ Π(H) and step h. Then we can further bound the right-hand side of the above inequality as max g∈G Φ λ π,h (b π h , b π h+1 ; g) ≤ max g∈G Φ π,h (b π h , b π h+1 ; g) + max g∈G - λ 2 ∥g∥ 2 2 + 18L∥g∥ 2 M 2 G • log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n + λM 2 G • log(2c 1 /ζ) 2c 2 n + 18LM 2 G • log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n ≤ 728L 2 • M 2 G • log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) λn + λM 2 G • log(2c 1 /ζ) 2c 2 n + 18LM 2 G • log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n . Here the last inequality holds from the fact that Φ π,h (b π h , b π h+1 ; g) = 0 since b π h and b π h+1 are true bridge functions, and the fact that sup ∥g∥2 {a∥g∥ 2 -b∥g∥ 2 2 } ≤ a 2 /4b for any b > 0. Now according to the choice of ξ in Lemma D.2, using the fact that ζ < ζ ′ and L = 2M B , we can conclude that, with probability at least 1 -δ, max g∈G Φ λ π,h (b π h , b π h+1 ; g) ≤ 728L 2 M 2 G • log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) λn + λM 2 G • log(2c 1 /ζ) 2c 2 n + 18LM 2 G • log(2c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n ≲ O (λ + 1/λ) • M 2 B M 2 G • log(|B||Π(H)|H/ζ) n ≲ ξ. This proves (F.9), and thus further indicates (F.8). Therefore, we finish the proof of Lemma D.2.

F.3 PROOF OF LEMMA D.3

We first give the high-level idea for proving Lemma D.3 as following. In order to achieve the fast rate for the whole confidence region, we took a series of novel proof steps. We first introduce the following lemma, which claims that for any b h+1 ∈ B, the b ⋆ (b h+1 ) defined in (F.3) satisfies that max g∈G Φ λ π,h (b ⋆ (b h+1 ), b h+1 ; g) is well-bounded. The proof of lemma follows the same argument as in the proof of Lemma D.2, which we defer to Appendix F.4. Then given any bridge function in the confidence region, we identify a key term (term (⋆) in (F.12)) which is related to the RMSE of this bridge function. By carefully upper & lower bound this term, where Lemma F.1 is applied, we eventually obtain a quadratic inequality that the RMSE of this bridge function satisfies. By solving this inequality, we can derive an upper bound on the RMSE loss which is uniform over the bridge functions in the confidence region, which is exactly the fast rate of the whole confidence region. Lemma F.1. For any function b h+1 ∈ B, policy π ∈ Π(H), and step h ∈ [H], it holds with probability at least 1 -δ/2 that max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g) ≤ ξ + ϵ 1/2 B M G , where b ⋆ (b h+1 ) is defined in (F.3) and ξ is defined in Lemma D.3. Proof of Lemma F.1. See Appendix F.4 for a detailed proof. With Lemma F.1, we are now ready to give the proof of Lemma D.3. Proof of Lemma D.3. Let's consider that for any b h , b h+1 ∈ CR π (ξ), we have that max g∈G Φ λ π,h (b h , b h+1 ; g) = max g∈G Φ π,h (b h , b h+1 ; g) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -2λ∥g∥ 2 2,n + Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + λ∥g∥ 2 2,n . We further write the above as max g∈G Φ λ π,h (b h , b h+1 ; g) ≥ max g∈G Φ π,h (b h , b h+1 ; g) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -2λ∥g∥ 2 2,n + min g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + λ∥g∥ 2 2,n = max g∈G Φ π,h (b h , b h+1 ; g) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -2λ∥g∥ 2 2,n (⋆) -max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g). (F.12) Here step (a) follows from that G is symmetric, Φ π,h (b h , h h+1 ; -g) = -Φ π,h (b h , h h+1 ; g), and that min g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + λ∥g∥ 2 2,n = min g∈G -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; -g) + λ∥g∥ 2 2,n = min g∈G -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + λ∥g∥ 2 2,n = -max g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -λ∥g∥ 2 2,n = -max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g). In the sequel, we upper and lower bound term (⋆) respectively. Upper bound of term (⋆). By inequality (F.12), after rearranging terms, we can arrive that (⋆) ≤ max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + max g∈G Φ λ π,h (b h , b h+1 ; g) ≤ max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + max g∈G Φ λ π,h (b h , b h+1 ; g) -max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) + max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) On the one hand, by Lemma F.1, we have that with probability at least 1 -δ/2, max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g) ≤ ξ + ϵ 1/2 B M G , (F.13) and by the definition of b h (b h+1 ) in (3.11), it holds simultaneously that max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) ≤ max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g) ≤ ξ + ϵ 1/2 B M G . (F.14) On the other hand, by the choice of CR π (ξ), it holds that max g∈G Φ λ π,h (b h , b h+1 ; g) -max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) ≤ ξ. (F.15) Consequently, by combining (F.13), (F.14), and (F.15), we conclude that with probability at least 1 -δ/2, (⋆) ≤ 3ξ + 2ϵ 1/2 B M G . (F.16) Lower bound of term (⋆). For lower bound, we need two localized uniform concentration inequalities similar to (F.10) and (F.11) in the proof of Lemma D.2. On the one hand, by Lemma I.1, for some absolute constants c 1 , c 2 > 0, it holds with probability at least 1 -δ/4 that,  ∥g∥ 2 2,n -∥g∥ 2 2 ≤ 1 2 ∥g∥ 2 2 + M 2 G log(4c 1 /ζ) 2c 2 n , ∀g ∈ G, (F.17 ℓ(g(A h , Z h ), I h ) := ς π h (b h , b h+1 )(I h )g(A h , Z h ) -ς π h (b ′ h , b h+1 )(I h )g(A h , Z h ) , where ς π h is defined in (F.1) and I h ∈ I h is defined in the beginning of Appendix F. It holds that ℓ is L-Lipschitz continuous in its first argument with L = 2M B . Now setting f ⋆ = 0 in Lemma I.2, we have that δ n in Lemma I.2 coincides with α G,n in Assumption 4.2. Then we have that for some absolute constants c 1 , c 2 > 0, it holds with probability at least 1 -δ/(4|B| 3 |Π(H)|H) that (F.18) where  Φ π,h (b h , b h+1 ; g) -Φ π,h (b ′ h , b h+1 ; g) -Φ π,h (b h , b h+1 ; g) -Φ π,h (b ′ h , b h+1 ; g) = E π b [ℓ(g(A h , Z h ), I h )] -E π b [ℓ(g(A h , Z h ), I h )] ≤ 18L∥g∥ 2 M 2 G • log(4c 1 |B| 3 |Π(H)|H/ζ ′ ) c 2 n + 18L • M 2 G • log 4c 1 |B| 3 |Π(H)|H/ζ ′ c 2 n , ∀g ∈ G, ζ ′ = min{δ, 4c 1 |B| 3 |Π(H)|H exp(-c 2 nα 2 G,n /M 2 G )}. Applying a union bound argument over b h , b ′ h , b h+1 ∈ B, π ∈ Π(H), ι n := M 2 G • log(4c 1 |B| 3 |Π(H)|H/ζ ′ ) c 2 n , ι ′ n := M 2 G • log(4c 1 /ζ) 2c 2 n (F.19) Now we are ready to prove the lower bound on term (⋆). For simplicity, given fixed b h , b h+1 ∈ B, we denote g π h := 1 2λ ℓ π h (b h , b h+1 ) ∈ G, where ℓ π h is defined in (F.1) and g π h ∈ G due to Assumption 4.3. Now consider that (⋆) = max g∈G Φ π,h (b h , b h+1 ; g) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -2λ∥g∥ 2 2,n ≥ Φ π,h (b h , b h+1 ; g π h /2) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g π h /2) - λ 2 ∥g π h ∥ 2 2,n , where the inequality follows from the fact that G is star-shaped and consequently g π h /2 ∈ G. Then by applying concentration inequality (F.17) and (F.18), we have that (F.20) where the second inequality follows from that (⋆) ≥ Φ π,h (b h , b h+1 ; g π h /2) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g π h /2) -18Lι n ∥g π h ∥ 2 -18Lι 2 n - λ 2 3 2 ∥g π h ∥ 2 2 + ι ′2 n ≥ λ∥g π h ∥ 2 2 -18Lι n ∥g π h ∥ 2 -ϵ 1/2 B M G -18Lι 2 n - λ 2 3 2 ∥g π h ∥ 2 2 + ι ′2 n = λ 4 ∥g π h ∥ 2 2 -18Lι n ∥g π h ∥ 2 -18Lι 2 n - λ 2 ι ′2 n -ϵ 1/2 B M G , Φ π,h (b ⋆ (b h+1 ), b h+1 ; g π h /2) ≤ ϵ 1/2 B M G (we prove this inequality by (F.25) in the proof of Lemma F.1) and the fact that Φ π,h (b h , b h+1 ; g π h /2) = 1 4λ E π b [ℓ π h (b h , b h+1 )(A h , Z h ) 2 ] = λ∥g π h ∥ 2 2 . Combining upper bound and lower bound of term (⋆). Now we are ready to combine the upper bound and lower bound of (⋆) to derive the bound on L π h (b h , b h+1 ). By combining upper bound (F.16) and lower bound (F.20), we have that with probability at least 1 -δ, for any b h , b h+1 ∈ B, π ∈ Π(H), and h ∈ [H], λ 4 ∥g π h ∥ 2 2 -18Lι n ∥g π h ∥ 2 -18Lι 2 n - λ 2 ι ′2 n -ϵ 1/2 B M G ≤ 3ξ + 2ϵ 1/2 B M G , (F.21) This gives a quadratic inequality on ∥g π h ∥ 2 , i.e., λ∥g π h ∥ 2 2 -72Lι n (A) ∥g π h ∥ 2 -4 18Lι 2 n + λ 2 ι ′2 n + 3ξ + 3ϵ 1/2 B M G (B) ≤ 0. By solving this quadratic equation, we have that ∥g π h ∥ 2 ≤ 1 2λ A + 1 2λ A 2 + 4B ≤ A λ + √ B λ . Applying the definition of A and B, we conclude that, with probability at least 1 -δ, ∥g π h ∥ 2 ≤ 72 λ Lι n + 2 λ 18Lι 2 n + λ 2 ι ′2 n + 3ξ + 3ϵ 1/2 B M G 1/2 ≤ 72 λ Lι n + 6 √ 2 λ L 1/2 ι n + √ 2 √ λ ι ′ n + 2 √ 3 λ ξ 1/2 + 2 √ 3 λ ϵ 1/4 B M 1/2 G Therefore, we can bound the RMSE loss L π h (b h , b h+1 ) by L π h (b h , b h+1 ) = 2λ∥g π h ∥ 2 ≤ (144L + 12 √ 2L 1/2 )ι n + 2 √ 2λι ′ n + 4 √ 3ξ 1/2 + 4 √ 3ϵ 1/4 B M 1/2 G . (F.22) Plugging in the definition of ι n , ι ′ n in (F.19), ξ in Lemma D.3, and that L = 2M B , we have that L π h (b h , b h+1 ) ≤ (144L + 12 √ 2L) • M 2 G • log(4c 1 |B| 3 |Π(H)|H/ζ ′ ) c 2 n + 2 √ 2λ • M 2 G • log(4c 1 /ζ) 2c 2 n + 4 √ 3 • C 1 (λ + 1/λ) • M 2 B M 2 G • log(|B||Π(H)|H/ζ ′ ) n + 4 √ 3 • ϵ 1/4 B M 1/2 G ≤ C 1 M B M G (λ + 1/λ) • log(|B||Π(H)|H/ζ) n + C 1 ϵ 1/4 B M 1/2 G . for some problem-independent constant C 1 > 0 and ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n /M 2 G )}. Here in the second inequality we have used the fact that ζ < ζ ′ . This finishes the proof of Lemma D.3.  ; g) = Φ π,h (b h , b h+1 ; g) -λ∥g∥ 2 2,n and its population version Φ λ π,h (b h , b h+1 ; g) via two localized uniform concentration inequalities. On the one hand, to relate ∥g∥ 2 2 and ∥g∥ 2 2,n , by Lemma I.1 (Theorem 14.1 of Wainwright ( 2019)), for some absolute constants c 1 , c 2 > 0, it holds with probability at least  1 -δ/4 that ∥g∥ 2 2,n -∥g∥ 2 2 ≤ 1 2 ∥g∥ 2 2 + M 2 G • log(4c 1 /ζ) 2c 2 n , ∀g ∈ G, (g(A h , Z h ), I h ) := ς π h (b h , b h+1 )(I h )g(A h , Z h ) where ℓ π h is defined in (F.1) and I h ∈ I h is defined in the beginning of Appendix F. We can see that ℓ is L-Lipschitz continuous in the first argument since for any g, g ′ ∈ G, (A h , Z h ) ∈ A × Z, it holds that ℓ(g(A h , Z h ), I h ) -ℓ(g ′ (A h , Z h ), I h ) = |ς π h (b h , b h+1 )(I h )| • |g(A h , Z h ) -g ′ (A h , Z h )| ≤ 2M B • |g(A h , Z h ) -g ′ (A h , Z h )|, which indicates that L = 2M B . Now setting f ⋆ = 0 in Lemma I.2, we have that δ n in Lemma I.2 coincides with α G,n in Assumption 4.2. Then we can conclude that for some absolute constants c 1 , c 2 > 0, it holds with probability at least 1 π ∈ Π(H), and h ∈ [H] with probability at least 1 -δ/4. Now using these two concentration inequalities (F.23) and (F.24), we can further deduce that, for some absolute constants c 1 , c 2 > 0, with probability at least 1 -δ/2, -δ/(4|B| 2 |Π(H)|H) that, for all g ∈ G, Φ π,h (b h , b h+1 ; g) -Φ π,h (b h , b h+1 ; g) = E π b [ℓ(g(A h , Z h ), A h , Z h )] -E π b [ℓ(g(A h , Z h ), A h , Z h )] ≤ 18L∥g∥ 2 M 2 G • log 4c 1 |B| 2 |Π(H)|H/ζ c 2 n + 18L • M 2 G • log 4c 1 |B| 2 |Π(H)|H/ζ c 2 n , max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b π h+1 ; g) = max g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -λ∥g∥ 2 2,n ≤ max g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -λ∥g∥ 2 2 + λ 2 ∥g∥ 2 2 + λM 2 G log(4c 1 /ζ) 2c 2 n , + 18L∥g∥ 2 M 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n + 18L • M 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n ≤ max g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + max g∈G - λ 2 ∥g∥ 2 2 + 18L∥g∥ 2 M 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n + λM 2 G log(4c 1 /ζ) 2c 2 n + 18L • M 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n ≤ ϵ 1/2 B M G + 728L 2 • M 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) λn + λM 2 G • log(4c 1 /ζ) 2c 2 n + 18L • sM 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n , where ζ is given as ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n /M 2 G )} and ζ ′ is given as ζ ′ = min{δ, 4c 1 |B| 2 |Π(H)|H exp(-c 2 nα 2 G,n /M 2 G )} for any policy π ∈ Π(H) and step h ∈ [H]. Here the last inequality holds from the fact that max g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) ≤ ϵ 1/2 B M G , (F.25) and that sup ∥g∥2 {a∥g∥ 2 -b∥g∥ 2 2 } ≤ a 2 /4b. Note that inequality (F.25) holds according to Assumption 4.3 and 4.3. In fact, by Assumption 4.3, we can first obtain by quadratic optimization that for λ > 0, max g∈G Φ λ π,h (b h , b h+1 ) = 1 4λ L π h (b h , b h+1 ), for any functions b h , b h+1 ∈ B. Thus we can equivalently express b ⋆ h (b h+1 ) as b ⋆ h (b h+1 ) = arg min b∈B 1 4λ L π h (b, b h+1 ) = arg min b∈B L π h (b, b h+1 ). This further indicates the following bound on max g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) that max g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) ≤ max g∈G L h (b ⋆ h (b h+1 ), b h+1 ) • E π b [g(A h , Z h ) 2 ] ≤ ϵ 1/2 B M G , by Cauchy-Schwarz inequality and Assumption 4.3. Now according to the choice of ξ in Lemma D.2, using the fact that ζ < ζ ′ and L = 2M B , we can conclude that, with probability at least 1 -δ/2,  max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b π h+1 ; g) ≤ 728L 2 • M 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) λn + λM 2 G • log(4c 1 /ζ) 2c 2 n + 18L • M 2 G • log(4c 1 |B| 2 |Π(H)|H/ζ ′ ) c 2 n + ϵ 1/2 B M G ≲ O (λ + 1/λ) • M 2 B M 2 G • log(|B||Π(H)|H/ζ) n + ϵ 1/2 B M G ≲ ξ + ϵ 1/2 B M G . Therefore, (π ⋆ ) -J( π) = F (b π ⋆ ) -F (b π ) = F (b π ⋆ ) -F (b π ⋆ ) (i) + F (b π ⋆ ) -F (b π ) (ii) + F (b π ) -F (b π ) . We can bound term (i) and term (iii) via uniform concentration inequalities, which we present latter. For term (ii), via Lemma D.2, with probability at least 1 -δ, b π ⋆ ∈ CR π ⋆ (ξ) and b π ∈ CR π (ξ), which indicates that (ii) = F (b π ⋆ ) -F (b π ) ≤ max b∈CR π ⋆ (ξ) F (b) -min b∈CR π (ξ) F (b). (G.1) From (G.1), we can further bound term (ii) as (ii) ≤ max b∈CR π ⋆ (ξ) F (b) -max π∈Π(H) min b∈CR π (ξ) F (b) ≤ max b∈CR π ⋆ (ξ) F (b) -min b∈CR π ⋆ (ξ) F (b) = max b∈CR π ⋆ (ξ) F (b) -F (b π ⋆ ) + F (b π ⋆ ) -min b∈CR π ⋆ (ξ) F (b) ≤ 2 max b∈CR π ⋆ (ξ) F (b) -F (b π ⋆ ) . (G.2) Here the first inequality holds because max π∈Π(H) min b∈CR π (ξ) F (b) = min b∈CR π (ξ) F (b) by the definition of π from (3.14). The second inequality holds because by definition π ⋆ is the optimal policy in Π(H). The third inequality is trivial. Now to further bound (G.2) by the RMSE loss defined in (3.6), we consider 2 max b∈CR π ⋆ (ξ) F (b) -F (b π ⋆ ) ≤ 2 max b∈CR π ⋆ (ξ) F (b) -F (b) (iv) + 2 max b∈CR π ⋆ (ξ) F (b) -F (b π ⋆ ) (v) + 2 F (b π ⋆ ) -F (b π ⋆ ) , where we can bound term (iv) and term (vi) via uniform concentration inequalities, which we present latter. For term (v), we invoke Lemma D.1 and obtain that (v) ≤ 2 max b∈CR π ⋆ (ξ) H h=1 γ h-1 √ C π ⋆ • L π ⋆ h (b h , b h+1 ) ≤ 2 √ C π ⋆ H h=1 γ h-1 max b∈CR π ⋆ (ξ) L π ⋆ h (b h , b h+1 ). Now invoking Lemma D.3, with probability at least 1-δ, max b∈CR π ⋆ (ξ) L π ⋆ h (b h , b h+1 ) is bounded by max b∈CR π ⋆ (ξ) L π h (b h , b h+1 ) ≤ C 1 M B M G (λ + 1/λ) log(|B||Π(H)|H/ζ) n + C 1 ϵ 1/4 B M 1/2 G , (G.3) for each step h ∈ [H], where ζ = min{δ, c 1 exp(-c 2 nα 2 G,n )}. In the sequel, we turn to deal with term (i), (iii), (iv), and (vi), respectively. To this end, it suffices to apply uniform concentration inequalities to bound F (b) and F (b) uniformly over b ∈ B ⊗H . By Hoeffding inequality, we have that, with probability at least 1 -δ, J(π, b) -J(π, b) ≤ 2M 2 B log(|B|/δ) n , ∀π ∈ Π(H), ∀b ∈ B ⊗H . (G.4) Consequently, all of (i), (iii), (iv), and (vi) are bounded by the right hand side of (G.4). Finally, by combining (G.3) and (G.4), with probability at least 1 -3δ, it holds that J(π ⋆ ) -J( π) ≤ (i) + (iii) + (iv) + (vi) + (v) ≤ 2 √ C π ⋆ H h=1 γ h-1 C 1 M B M G (λ + 1/λ) log(|B||Π(H)|H/ζ) n + C 1 ϵ 1/4 B M 1/2 G + 4 2M 2 B log(|B|/δ) n ≤ C ′ 1 √ C π ⋆ (λ + 1/λ) 1/2 HM B M G log(|B||Π(H)|H/ζ) n + C ′ 1 √ C π ⋆ Hϵ 1/4 B M 1/2 G , for some problem-independent constant C ′ 1 > 0. We finish the proof of Theorem 4.4 by taking λ = 1.

H DETAILS FOR LINEAR FUNCTION APPROXIMATION H.1 MAIN RESULT FOR LINEAR FUNCTION APPROXIMATION

In this subsection, we extend Theorem 4.4 to primal function class B, dual function class G, and policy class Π(H) with linear structures. The linear structure assumption is commonly considered in the RL literature (Jin et al., 2021; Xie et al., 2021; Zanette et al., 2021; Duan et al., 2021; Min et al., 2022a; b; Fei and Xu, 2022; Huang et al., 2023) , to mention a few. And it can be viewed as an extension of linear bandits (Auer, 2002; Dani et al., 2008; Li et al., 2010; Abbasi-Yadkori et al., 2011; He et al., 2022) to multiple-horizon setting. Note that the exact detail of the linear structure assumption might change across different works. In our case, we consider linear function classes B lin , G lin and Π lin , which is characterized by the following definition. Next, all of (i), (iii), (iv), and (vi) are bounded by the R.H.S. of (H.4). Finally, by (H.1), (H.2), (H.3) and (H.4), we have that J(π ⋆ ) -J( π) ≤ (i) + (iii) + (iv) + (vi) + (v) ≤ 2 √ C π ⋆ H h=1 γ h-1 C 2 • (1 + λ) M B M G • dH log (1 + L b L π Hn/δ) n + C 2 • M 1/2 G ϵ 1/4 B + 4 2M 2 B log(N ϵ,b N ϵ,π /δ) n + 2M B ϵ . Finally, by taking ϵ = 1/n 2 , and plugging in the values of N ϵ,b and N ϵ,π from Lemma H.3, we get J(π ⋆ ) -J( π) ≤ 2 √ C π ⋆ H h=1 γ h-1 C 2 • (1 + λ) M B M G • dH log (1 + L b L π Hn/δ) n + C 2 • M 1/2 G ϵ 1/4 B + C 3 M B dH log(1 + L b L π n/δ) n , where C 3 is some problem-independent universal constant. We then simplify the expression and use the fact that We first prove this for a fixed ϵ-net of Π(H) and B ⊗H . Specifically, choose an ϵ-net of Π(H) such that for any π = {π h } H h=1 and π ′ = {π ′ h } H h=1 in this ϵ-net, it holds that ∥π h -π ′ h ∥ ∞,1 ≤ ϵ for all h. Also choose an ϵ-net of B ⊗H such that for any b = {b h } H h=1 and b ′ = {b ′ h } H h=1 in the ϵ-net, it holds that ∥b h -b ′ h ∥ ∞ ≤ ϵ for all h. Denote the cardinality of these two ϵ-net by N ϵ,π and N ϵ,b , respectively. Then by the same argument behind (F.11), we get that, with probability at least 1 -δ/2, for any π and b in their ϵ-nets, and for any g ∈ G,  M G = sup ≤ | Φ π,h (b h , b h+1 ; g) -Φ π ′ ,h (b ′ h , b ′ h+1 ; g)| + | Φ π ′ ,h (b ′ h , b ′ h+1 ; g) -Φ π ′ ,h (b ′ h , b ′ h+1 ; g)| + |Φ π ′ ,h (b ′ h , b ′ h+1 ; g) -Φ π,h (b h , b h+1 ; g)| ≤ 8M B M G • ϵ + 18L∥g∥ 2 M 2 G log 2c 1 N 2 ϵ,b N ϵ,π H/ζ c 2 n + 18LM 2 G log 2c 1 N 2 ϵ,b N ϵ,π H/ζ c 2 n , (H.8) where the first step is by the triangle inequality and the second steps is by (H.5) and (H.7).  + 18L∥g∥ 2 M 2 G log 2c 1 N 2 ϵ,b N ϵ,π H/ζ c 2 n + 18LM 2 G log 2c 1 N 2 ϵ,b N ϵ,π H/ζ c 2 n + 8M B M G ϵ ≤ max g∈G Φ π,h (b π h , b π h+1 ; g) + max g∈G - λ 2 ∥g∥ 2 2 + 18L∥g∥ 2 M 2 G log 2c 1 N 2 ϵ,b N ϵ,π H/ζ c 2 n + λM 2 G log(2c 1 /ζ) 2c 2 n + 18LM 2 G log 2c 1 N 2 ϵ,b N ϵ,π H/ζ c 2 n + 8M B M G ϵ ≤ 728L 2 M 2 G log(2c 1 N 2 ϵ,b N ϵ,π H/ζ ′ ) λn + λM 2 G log(2c 1 /ζ) 2c 2 n + 18LM 2 G log(2c 1 N 2 ϵ,b N ϵ,π H/ζ ′ ) c 2 n + 8M B M G ϵ, ; g) ≤ C 1 • λ + 1 λ • M 2 B M 2 G dH log (1 + L b L π Hn/δ) n + C 1 • M B M G n 2 , where C 1 is some problem-independent constant. Note that second term on the right hand side is smaller than the first term. Then the result follows from our choice of ξ in Lemma H.5.

H.4.2 PROOF OF LEMMA H.6

Proof of Lemma H.6. Consider any π ∈ Π(H) and b = {b h } H h=1 ∈ CR π (ξ). Same as (F.12), we have Combining with (H.11) and (H.12), we get that, with probability at least 1 -δ/2, (⋆) ≤ 3ξ + 2ϵ where ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n /M 2 G )} for some absolute constants c 1 and c 2 , and α G,n is the critical radius of G defined in Assumption 4.2. Second, we fix an ϵ-net of Π(H) and an ϵ-net of B ⊗H , as described in Appendix H.2. Denote by N ϵ,π and N ϵ,b their respective covering numbers. Then by the same argument behind (F.18) and a union bound, we get that, with probability at least 1 -δ/4, for all π = {π h } H h=1 , b = {b h } H h=1 and b ′ = {b ′ h } H h=1 in their ϵ-nets, and for all g ∈ G, Φ π,h (b h , b h+1 ; g) -Φ π,h (b ′ h , b h+1 ; g) -Φ π,h (b h , b h+1 ; g) -Φ π,h (b ′ h , b h+1 ; g) ≤ 18L∥g∥ 2 M 2 G log(4c 1 N 3 ϵ,b N ϵ,π H/ζ ′ ) c 2 n + 18LM 2 G log 4c 1 N 3 ϵ,b N ϵ,π H/ζ ′ c 2 n , (H.15) where ζ ′ = min{δ, 4c 1 N 3 ϵ,b N ϵ,π H exp(-c 2 nα 2 G,n /M 2 G )}. We then use (H.5), and conclude that, with probability at least 1 -δ/4, for all π ∈ Π(H), and b, b ′ ∈ B ⊗H , and g ∈ G, Φ π,h (b h , b h+1 ; g) -Φ π,h (b ′ h , b h+1 ; g) -Φ π,h (b h , b h+1 ; g) -Φ π,h (b ′ h , b h+1 ; g) ≤ 18L∥g∥ 2 M 2 G log(4c 1 N 3 ϵ,b N ϵ,π H/ζ ′ ) c 2 n + 18LM 2 G log 4c 1 N 3 ϵ,b N ϵ,π H/ζ ′ c 2 n + 8M B M G ϵ. (H.16) In the sequel, for simplicity, we denote that where ℓ π h is defined by (F.1) and g π h ∈ G follows from Assumption 4.3. We then have ι n := M 2 G log(4c 1 N 3 ϵ,b N ϵ,π H/ζ ′ ) c 2 n , ι ′ n := M 2 G log( (⋆) = max g∈G Φ π,h (b h , b h+1 ; g) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -2λ∥g∥ 2 2,n ≥ Φ π,h (b h , b h+1 ; g π h /2) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g π h /2) - λ 2 ∥g π h ∥ 2 2,n , where the inequality holds because g π h /2 ∈ G. Together with (H.14) and (H.16), we have (H.19)  (⋆) ≥ Φ π,h (b h , b h+1 ; g π h /2) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g π h /2) -18Lι n ∥g π h ∥ 2 -18Lι 2 n -8M B M G ϵ - λ 2 3 2 ∥g π h ∥ 2 2 + ι ′2 n ≥ λ∥g π h ∥ 2 2 -18Lι n ∥g π h ∥ 2 -ϵ 1/2 B M G -18Lι 2 n -8M B M G ϵ - λ 2 3 2 ∥g π h ∥ 2 2 + ι ′2 n = λ 4 ∥g π h ∥ 2 2 -18Lι n ∥g π h ∥ 2 -ϵ 1/2 B M G -18Lι 2 n -8M B M G ϵ - λ 2 ι ′2 n ,



We refer the readers to Appendix B.1 for a detailed comparison of the minimax-typed loss (and confidence region) with the least-square-typed loss (and confidence region) used byXie et al. (2021). The constant M B might seem to be proportional to |A| due to the summation a∈A b(a, w), but it is not. The reason is that the definition of the true confounding bridge function in (3.2) involves a product with π h (•|O h , Γ h-1 ) which is a distribution over A. Thus the summation over A is essentially an average over A.



a) : S × A → ∆(S) characterizes the distribution of the next state s h+1 given that the agent takes action a h = a ∈ A at state s h = s ∈ S and step h ∈ [H]. The set O = {O h } H h=1 denotes the observation emission kernels where each kernel O h (•|s) : S → ∆(O) characterizes the distribution over observations given the current state s ∈ S at step h ∈ [H]. Finally, the set R = {R h } H h=1 denotes the collection of reward functions where each function R h (s, a) : S × A → [0, 1] specifies the reward the agent receives when taking action a ∈ A at state s ∈ S and step h ∈ [H].

.10) where E π b denotes the empirical version of E π b based on dataset D described in Section 2.2. Furthermore, note that the value bridge functions (b π 1 , • • • , b π h ) admit a sequential dependence structure. To handle such dependency, for any π ∈ Π(H), h ∈ [H], and b h+1 ∈ B, we first define the minimax estimator b h (b h+1 ) as b h (b h+1 ) := arg min b∈B max g∈G Φ λ π,h (b h , b h+1 ; g) . (3.11) Based on (3.11), we propose a confidence region for b

UEHARA, M., IMAIZUMI, M., JIANG, N., KALLUS, N.,SUN, W. and XIE, T. (2021). Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981 . UEHARA, M. and SUN, W. (2021). Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage. arXiv e-prints arXiv-2107. VLASSIS, N., LITTMAN, M. L. and BARBER, D. (2012). On the computational complexity of stochastic controller optimization in POMDPs. ACM Transactions on Computation Theory (TOCT) 4 1-8. WAINWRIGHT, M. J. (2019

value bridge function, weight bridge function of π at step h b π , q π value bridge function vector, weight bridge function vector of π CR π (ξ) confidence region of b π , according to (3.12) b an element in the confidence region CR π (ξ) F (b), F (b) a mapping for identification with J(π) = F (b π ), according to (3.4) ℓ π h "Bellman residual" for bridge functions, according to (3.5) L π h residual mean square loss for ℓ π h , according to (3.6) Φ λ π,h , Φ λ π,h a mapping for minimax estimation, according to (3.9) b h (b h+1 ) minimax estimator of b π h given b h+1 , according to (3.11) J pess (π) pessimistic estimator of J(π), according to (3.13) π policy returned by P3O algorithm, according to (3.14)

); Kidambi et al. (2020); Yu et al. (2021); Janner et al. (

Figure 2: Causal graph for reactive policy. The dotted nodes indicate that the variables are not stored in the offline dataset. Solid arrows indicate the dependency among the variables. Specifically, The red arrows depict the dependence of the target policy on the observable variables. The blue arrows depict the dependence of the behavior policy on the latent state. The negative control action and outcome variables at the h-th step are filled in green and yellow, respectively.

Figure 3: Causal graph for finite-length history policy. Index l = max{1, h -k}. The dotted nodes indicate that the variables are not stored in the offline dataset. Solid arrows indicate the dependency among the variables. Specifically, The red arrows depict the dependence of the target policy on the observable variables. The blue arrows depict the dependence of the behavior policy on the latent state. The negative control action and outcome variables at step h are filled in green and yellow.

Figure 4: Causal graph for full-length history policy. The dotted nodes indicate that the variables are not stored in the offline dataset. Solid arrows indicate the dependency among the variables. Specifically, The red arrows depict the dependence of the target policy on the observable variables. The blue arrows depict the dependence of the behavior policy on the latent state. The negative control action and outcome variables at step h are filled in green and yellow, respectively.

and {q π h } H h=1 which solve equation (C.2) and (C.3). Finally, it holds that any solution to (C.2) and (C.3) also forms a solution to (3.2) and (3.3), which has been shown in Theorem 11 in Shi et al. (2021). This finishes the proof of Example C.1.

Π(H) and b ∈ B ⊗H , F (b) := E π b a∈A b 1 (a, W 1 ) , F (b) := E π b a∈A b 1 (a, W 1 ) . (D.1) By the definition (D.1) and Theorem 3.3, for any policy π ∈ Π(H), it holds that J(π) = F (b π ), where we have denoted by b π = (b π 1 , • • • , b π H ) the vector of true value bridge functions of π which are given in (3.2). Our proof to Theorem 4.4 relies on the following three key lemmas. The first lemma relates the different values of mapping F (•) induced by a true value bridge function b π and any other functions b ∈ B ⊗H to the RMSE loss which we aim to minimize by algorithm design. This indeed decomposes the suboptimality (2.2). Lemma D.1 (Suboptimality decomposition). Under Assumption 3.1, 3.2, for any policy π ∈ Π(H) and b ∈ B ⊗H , it holds that

Published as a conference paper at ICLR 2023 Recall from (3.11) that given function b h+1 ∈ B, the minimax estimator b h (b h+1 ) is defined as b h (b h+1 ) := arg min b∈B max g∈G Φ λ π,h (b, b h+1 ; g).

assumption on B in Assumption 4.2, we have that |ℓ π h |, |ς π h | ≤ 2M B . By the completeness assumption on G in Assumption 4.3, we also know that ℓ π h (b h , b h+1 )/2λ ∈ G for any b h , b h+1 ∈ B. Finally, for notational simplicity, we define for each g ∈ G that,

PROOF OF LEMMA D.1 Proof of Lemma D.1. By definition (D.1) of F (b), for any policy π ∈ Π(H) and vector of functions b ∈ B ⊗H , it holds that

)where ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n /M 2 G )} and α G,n is the critical radius of G defined in Assumption 4.2. On the other hand, following the same argument as in deriving (F.11), for any given b h , b ′ h , b h+1 ∈ B, π ∈ Π(H), and h ∈ [H], in Lemma I.2 we choose F = G, X = A × Z, Y = I, and loss function

and h ∈ [H], we have that (F.18) holds for any b h , b ′ h , b h+1 ∈ B, g ∈ G, π ∈ Π(H), and h ∈ [H] with probability at least 1 -δ/4. Finally, for simplicity, we denote that

PROOF OF LEMMA F.1 Proof of Lemma F.1. Following the proof of Lemma D.2, we first relate Φ λ π,h (b h , b h+1

F.23) where ζ = min{δ, 4c 1 exp(-c 2 nα 2 G,n /M 2 G )} and α G,n is the critical radius of function class G defined in Assumption 4.2. On the other hand, to relate Φ π,h (b h , b h+1 ; g) and Φ π,h (b h , b h+1 ; g), we invoke Lemma I.2 (Lemma 11 of (Foster and Syrgkanis, 2019)). Specifically, for any given b h , b h+1 ∈ B, π ∈ Π(H), and step h, in Lemma I.2 we choose F = G, X = A × Z, Y = I h , and loss function ℓ

F.24) where ζ = min{δ, 4c 1 |B| 2 |Π(H)|H exp(-c 2 nα 2 G,n /M 2 G )}. Applying a union bound argument over b h , b h+1 ∈ B, π ∈ Π(H), and h ∈ [H], we then have that (F.11) holds for any b h , b h+1 ∈ B, g ∈ G,

we conclude the proof of Lemma F.1. G PROOF OF THEOREM 4.4 Proof of Theorem 4.4. By the definition of F (b) and F (b) in (D.1) and the fact that J(π) = F (b π ) according to Theorem 3.3, we first have that J

z)∥ 2 • ∥ω∥ 2 ≤ L g .This gives the result of Corollary H.2.H.4 PROOF OF LEMMAS IN APPENDIX H H.4.1 PROOF OF LEMMA H.5Proof of Lemma H.5. First, for any ϵ ∈ (0, 1), consider arbitrary π = {π h } H h=1 and π′ = {π ′ h } H h=1 in Π lin such that ∥π h -π ′ h ∥ ∞,1 ≤ ϵ for all h ∈ [H]. And consider arbitrary b = {b h } H h=1 and b ′ = {b ′ h } H h=1 in B ⊗H such that ∥b h -b ′ h ∥ ∞ ≤ ϵ for all h ∈ [H]. Then by definition of Φ λ π,h (b h , b h+1 ; g) in (3.9) and Φ λ π,h (b h , b h+1 ; g) in (3.10), and that Φ λ π,h = Φ 0 π,h and Φ λ π,h = Φ 0 π,h , one can easily get that Φ π,h (b h , b h+1 ; g) -Φ π ′ ,h (b ′ h , b ′ h+1 ; g) ≤ [2ϵ + γ • (ϵ + ϵM B )] • M G ≤ 4M B M G ϵ, Φ π,h (b h , b h+1 ; g) -Φ π ′ ,h (b ′ h , b ′ h+1 ; g) ≤ [2ϵ + γ • (ϵ + ϵM B )] • M G ≤ 4M B M G ϵ, (H.5)for all g ∈ G.Now, same as in the proof of Lemma D.2, we want to show: for any π ∈ Π(H),max g∈G Φ λ π,h (b π h , b π h+1 ; g) ≤ ξ.The rest of the proof would be very similar to that of Lemma D.2 with an additional covering argument. To begin with, we again writeΦ λ π,h (b π h , b π h+1 ; g) = Φ π,h (b π h , b π h+1 ; g) -λ∥g∥ 2 2,n . Same as (F.10), we have that with probability at least 1 -δ/2, ζ = min{δ, 2c 1 exp(-c 2 nα 2 G,n /M 2 G )} and c 1 , c 2 are some universal constants. Next, we upper bound | Φ π ′ ,h (b h , b h+1 ; g) -Φ π ′ ,h (b h , b h+1 ; g)| for any π ∈ Π(H), and b ∈ B ⊗H .

π,h (b h , b h+1 ; g) -Φ π,h (b h , b h+1 ; g) ζ ′ = min{δ, 2c 1 N 2 ϵ,b N ϵ,π H exp(-c 2 nα 2 G,n /M 2 G )}. Now for any π ∈ Π(H) and b ∈ B ⊗H , by our construction of the ϵ-nets, we can find a π ′ and b ′ in the ϵ-nets such that ∥π h -π ′ h ∥ ∞,1 ≤ ϵ and ∥b h -b ′ h ∥ ∞ ≤ ϵ for all h.Then we have that with probability at least 1 -δ/2, for any π ∈ Π(H) and b ∈ B ⊗H , and for any g ∈ G,Φ π,h (b h , b h+1 ; g) -Φ π,h (b h , b h+1 ; g)

H.9)with ζ = min{δ, 2c 1 exp(-c 2 nα 2 G,n /M 2 G )} and ζ ′ = min{δ, 2c 1 N 2 ϵ,b N ϵ,π H exp(-c 2 nα 2 G,n /M 2 G )} for any policy π ∈ Π(H) and step h. Here the first inequality is by (H.6) and (H.8), the second inequality is trivial, and the last inequality holds from the fact that Φ π,h (b π h , b π h+1 ; g) = 0 and the fact that sup ∥g∥2 {a∥g∥ 2 -b∥g∥ 2 2 } ≤ a 2 /4b.Now by Definition H.1, we apply Lemma H.3 with ∥θ h ∥ 2 ≤ L b and ∥β h ∥ ≤ L π and get that log N ϵ,π ≤ dH log 1 ϵ = 1/n 2 , and together with (H.9) and (H.10), we get thatmax g∈G Φ λ π,h (b π h , b π h+1 ; g) ≤ C • (λ + 1/λ) M 2 B M 2 G dH log (1 + L b L π Hn/δ) + nα 2 G,n /M 2 G n + C • M B M G n 2 ,where C is some universal constant. Here we have plugged in the value of ζ, ζ ′ and L = 2M B . Finally, by plugging in the value of α G,n from Lemma H.4, we conclude that

(b h , b h+1 ; g) ≥ max g∈G Φ π,h (b h , b h+1 ; g) -Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) -(b ⋆ h (b h+1 ), b h+1 ; g).We again upper and lower bound term (⋆) respectively.Upper bound of term (⋆). By the same argument as in the proof of Lemma F.1, we have that: for any b ∈ B ⊗H , π ∈ Π(H), and h ∈ [H], it holds with probability at least 1 -δ/2 thatmax g∈G Φ π,h (b ⋆ h (b h+1 ), b h+1 ; g) ≤ ξ + ϵ 1/2 B M G , where b ⋆ h (b h+1 ) is defined in (F.3) and ξ is defined in Lemma H.5. We then get maxg∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) ≤ max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g) ≤ ξ + ϵ inequality follows from the definition of b h (b h+1 ) in (3.11). Also note that, by the construction of the confidence region CR π (ξ), we havemax g∈G Φ λ π,h (b h , b h+1 ; g) -max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) ≤ ξ. h+1 ), b h+1 ; g) + max g∈G Φ λ π,h (b h , b h+1 ; g) ≤ max g∈G Φ λ π,h (b ⋆ h (b h+1 ), b h+1 ; g) + max g∈G Φ λ π,h (b h , b h+1 ; g) -max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g) + max g∈G Φ λ π,h ( b h (b h+1 ), b h+1 ; g).

We compare with most related representative works in closely related lines of research. The first line of research studies offline RL in standard MDPs without any partial observability. The second line of research studies online RL in POMDPs where the actions are specified by history-dependent policies.

In the sequel, we use lower case letters (i.e., s, a, o, and τ ) to represent dummy variables and upper case letters (i.e., S, A, O, and Γ) to represent random variables. We use the variables in the calligraphic font (i.e., S, A, and O) to represent the spaces of variables, and the blackboard bold font (i.e., P and O) to represent probability kernels.

. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 .SHAPIRO, A., DENTCHEVA, D. and RUSZCZYNSKI, A. (2021). Lectures on stochastic programming: modeling and theory. SIAM. SHI, C., UEHARA, M. and JIANG, N. (2021). A minimax learning approach to off-policy evaluation in partially observable markov decision processes. arXiv preprint arXiv:2111.06784 . SHI, C., WANG, X., LUO, S., ZHU, H., YE, J. and SONG, R. (2022). Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework. Journal of the American Statistical Association 1-29. SINGH, R. (2020). Kernel methods for unobserved confounding: Negative controls, proxies, and instruments. arXiv preprint arXiv:2012.10315 . SUN, P., KRETZSCHMAR, H., DOTIWALLA, X., CHOUARD, A., PATNAIK, V., TSUI, P., GUO, J., ZHOU, Y., CHAI, Y., CAINE, B. ET AL. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Advances in neural information processing systems 34. YU, T., KUMAR, A., RAFAILOV, R.,RAJESWARAN, A., LEVINE, S. and FINN, C. (2021). Combo: Conservative offline model-based policy optimization. Advances in Neural Information Processing Systems 34. ZANETTE, A. (2021). Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In International Conference on Machine Learning. PMLR. ZANETTE, A., WAINWRIGHT, M. J. and BRUNSKILL, E. (2021). Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems 34. ZHAN, W., HUANG, B., HUANG, A., JIANG, N. and LEE, J. D. (2022). Offline reinforcement learning with realizability and single-policy concentrability. arXiv preprint arXiv:2202.04634 . Overview of Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Episodic Partially Observable Markov Decision Process . . . . . . . . . . . . . . . 2.2 Offline Data Generation: Confounded Dataset . . . . . . . . . . . . . . . . . . . . 2.3 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma D.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Main Result for Linear Function Approximation . . . . . . . . . . . . . . . . . . . 38 H.2 Auxiliary Results for Linear Function Approximation . . . . . . . . . . . . . . . . 39 H.3 Proof of Corollary H.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 H.4 Proof of Lemmas in Appendix H . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 H.4.1 Proof of Lemma H.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 H.4.2 Proof of Lemma H.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Table of Notations

Through out the paper, we use O(•) to hide problem-independent constants and use O(•) to hide problem-independent constants plus logarithm factors. The following table summaries the notations we used in our proposed algorithm design and theory.Reinforcement learning in POMDPs. Our work is related to the recent line of research on developing provably efficient online RL methods for POMDPs

When using reactive policies (Example 2.1), the negative control action variable (Z h ) is just the observation variable O h-1 which reflects the patient's clinical state at the last treatment time step, and the negative control outcome variable (W h ) is just the observation variable O h at the current time step. Furthermore, when the observation O contains enough information to reflect the underlying state S, which basically implies a certain full rank assumption, we can then use Example C.1 to guarantee the existence of the bridge functions (See Appendix C).

Example C.1 (Example 2.1 revisited). For the tabular setting (i.e., S, A, and O are finite spaces) and reactive policies (i.e., π h : O → ∆(A)), the sufficient condition under which Assumption 3.2 holds is thatrank(P b h (O h |S h )) = |S|, rank(P b h (O h-1 |S h )) = |S|, (C.1) where P h (O h |S h ) denote an |S| × |O| matrix whose (s, o)-th element is P b h (O h = o|S h = s), and P b h (O h-1 |S h ) is defined similarly.

Now we have shown that term (B) in (E.1) is actually 0 for Example 2.1, Example 2.2, and Example 2.3, respectively, which allows us to conclude that

Now combine (H.6) and (H.8) with a union bound, we conclude that, with probability at least 1 -δ, for any π ∈ Π(H),

Lower bound of term (⋆). First of all, same as (F.17), it holds with probability at least 1 -δ/4 that,

′ are same as in (H.14) and (H.15). Furthermore, given fixed b h , b h+1 ∈ B, we denote

ACKNOWLEDGEMENTS

Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their support. Zhuoran Yang acknowledges Simons Institute (Theory of Reinforcement Learning) for the support. The authors would like to thank Zhihan Liu for helpful discussions on minimax estimators.

annex

Definition H.1 (Linear function approximation). Let ϕ : A × W → R d be a feature mapping for some integer d ∈ N. We let the primal function class be B = B lin whereh=1 be H feature mappings. We let the policy function class be Π(H) = Π lin where Π lin = {Π lin,h } H h=1 and each Π lin,h is defined as Π lin,h := π h π h (a|o, τ ) = exp (⟨ψ h (a, o, τ ), β⟩)Finally, let ν : A × Z → R d be another feature mapping. We let the dual function class be G = G lin whereAssume without loss of generality that these feature mappings are normalized, i.e., ∥ϕ∥ 2 , ∥ψ∥ 2 , ∥ν∥ 2 ≤ 1.We note that Definition H.1 is consistent with Assumption 4.2. One can see that B lin and G lin is uniformly bounded, G lin is symmetric and star-shaped. And for other more detailed theoretical properties of B lin , G lin , and Π lin , we refer the readers to Appendix H.2 for corresponding results.Under linear function approximation, we can extend Theorem 4.4 to the following corollary, which characterizes the suboptimality (2.2) of π found by P3O when using B lin , G lin , and Π lin as function classes. Corollary H.2 (Suboptimality analysis: linear function approximation). With linear function approximation (Definition H.1), under Assumption 3.1, 3.2, 4.1, and 4.3, by setting the regularization parameter λ and the confidence parameter ξ as λ = 1 and Recall the definition of the bridge function class B ⊗H where B = B lin is defined asDenote by N ∞ ϵ (B) the ϵ-covering number of B with respect to the ℓ ∞ norm. That is, there exists a collection of functionsRecall the policy function class Π(H) = Π ⊗H lin where Π lin is defined asThe upper bounds for these covering numbers are given by the following lemma. Lemma H.3 (Lemma 6 in Zanette et al. 2021) . For any ϵ ∈ (0, 1), we haveThe ϵ-nets for the product function classes In the rest of Appendix H, due to the proof, we need to consider ϵ-nets defined for the product function classes B ⊗H and Π(H) = Π ⊗H lin . Specifically, for B ⊗H , we consider an ϵ-net of B ⊗H defined in the following way: for anySimilarly, we consider an ϵ-net defined for Π(H) defined as the following: for any π = {π h } H h=1 ∈ Π(H), there exists an for some problem-independent universal constant C 2 > 0, it holds with probability at least 1 -δ that b π ∈ CR π (ξ) for any policy π ∈ Π(H).Lemma H.6 (Alternative of Lemma D.3 in the linear case). Under Assumption 3.2, 4.2, 4.3, and 4.3, by setting the same ξ as in Lemma H.5, with probability at least 1 -δ, for any policy π ∈ Π(H), b ∈ CR π (ξ), and step h,B , for some problem-independent universal constant C 2 > 0.We are now ready to prove Corollary H.2.Proof of Corollary H.2. We follow the proof of Theorem 4.4 and write.(H.1)We deal with term (ii) first. By Lemma H.5, with probability at leastThen following (G.2), we can upper bound term (ii) by.

(H.2)

To bound term (v), we invoke Lemma D.1 which holds regardless of the underlying function classes and obtain thatNow by Lemma H.6, with probability at least 1 -δ,3) Now we deal with the term (i), (iii), (iv), and (vi), respectively. To this end, we apply uniform concentration inequalities to bound J(π, b) and J(π, b) uniformly over the ϵ-net of π and b as described in the proof of Lemma H.5. By Hoeffding's inequality, we have that, with probability at least 1 -δ, for all π and b in their ϵ-nets,where N ϵ,π and N ϵ,b are the covering numbers defined in Appendix H.2. Here we use the regularity assumption that | a∈A b π 1 (a, w)| ≤ M B for all w ∈ W and the definition of J(π, b) from (D.1). Consequently, for all π ∈ Π(H) and b ∈ B ⊗H , we havewhere the second inequality follows from the same reason as in (F.20).Finally, combine (H.19) and (H.13) and we getBy the fact thatWe then plug in the values of ι n and ι ′ n from (H.17), ξ from Lemma H.5, ζ and ζ ′ from below (H.14) and (H.15), N ϵ,b and N ϵ,π from Lemma H.3, α G,n from Lemma H.4, and set ϵ = 1/n 2 . Simplify the expression and we getwhere C is some problem-independent universal constant. By (H.18) and (3.6), we havefor some constant C 2 . This finishes the proof.

I AUXILIARY LEMMAS

We introduce some useful lemmas for the uniform concentration over function classes. Before we present the lemmas, we first introduce several notations. For a function class F on a probability space (X , P ), we denote by ∥f ∥ 2 2 the expectation of f (X) 2 , that is ∥f ∥ 2 2 = E X∼P [f (X) 2 ]. Also, we denote bythe localized Rademacher complexity of F with scale δ > 0 and size n ∈ N. Here {ϵ i } b i=1 and {X i } n i=1 are i.i.d. and independent. Each ϵ i is uniformly distributed on {+1, -1} and each X i is distributed according to P . Then for any t ≥ δ n , we have that Then for any t ≥ δ n and some absolute constants c 1 , c 2 > 0, with probability 1 -c 1 exp(-c 2 nt 2 /b 2 ) it holds thatfor any f ∈ F. If furthermore, the loss function ℓ is linear in f , i.e., ℓ((f + f ′ )(x), y) = ℓ(f (x), y) + ℓ(f ′ (x), y) and ℓ(αf (x), y) = αℓ(f (x), z), then the lower bound on δ 2 n is not required.Proof of Lemma I.2. See Lemma 11 of Foster and Syrgkanis (2019) for a detailed proof.Remark I.3. We remark that in the original Lemma 11 of Foster and Syrgkanis (2019), inequality (I.3) only holds for δ n , and we extend it to any t ≥ δ n since according to Lemma 13.6 of Wainwright (2019) we know that R n (F; δ)/δ is a non-increasing function of δ on (0, +∞), which indicates that t ≥ δ n also solves the inequality.

