OFFLINE POLICY INTERVAL ESTIMATION WITHOUT SUFFICIENT EXPLORATION OR REALIZABILITY

Abstract

We study the problem of offline policy evaluation (OPE), where the goal is to estimate the value of given decision-making policy without interacting with the actual environment. In particular, we consider the interval-based OPE, where the output is an interval rather than a point, indicating the uncertainty of the evaluation. The interval-based estimation is especially important in OPE since, when the data coverage is insufficient relative to the complexity of the environmental model, any OPE method can be biased even with infinite sample size. In this paper, we characterize such irreducible biases in terms of the discrepancy between the target policy and the data-sampling distribution, and show that the marginalimportance-sampling (MIS) estimator achieves the minimax bias with an appropriate importance-weight function. Motivated with this result, we then propose a new interval-based MIS estimator that asymptotically achieves the minimax bias.

1. INTRODUCTION

The offline policy evaluation (OPE) is the art of estimating the value of given decision-making policies based on offline datasets without interacting with the actual environment. Since the interaction with the environment is often infeasible or expensive in many real-world applications, it is better to evaluate the value offline rather than online. In the literature, it is understood from theoretical perspectives that there are two fundamental conditions for OPE to be successful: sufficient exploration, the coverage of the data-sampling distribution over the state-action space relative to the target policy, and realizability, the knowledge of correct environmental model with bounded complexity. In particular, if neither of these two conditions are met in a certain manner, it is known that OPE is never sample efficient, i.e., it takes prohibitively large sample to make the estimation reasonably accurate (Wang et al., 2020; Zanette, 2021) . In practice, given a problem instance of OPE, consisting of an environment and a dataset, it is difficult to confirm that these conditions hold or to modify the problem instance so that these conditions hold, making the existing theoretical guarantees less practical. Towards practical OPE, we set our research objective to develop a theoretically-sound value estimator without assuming these two conditions. Towards our objective, we first analyze the statistical performance of OPE methods when the two assumptions do not hold (Section 4). The key quantity is the information-theoretic worst-case bias of the value estimator (Eq. ( 5)) and its minimum termed the minimax bias (Eq. ( 6)), which is positive when there exist multiple indistinguishable environments, given only a problem instance of OPE. In fact, we show that the minimax bias can be non-zero if we do not assume the two conditions (Corollary 4.2). It suggests that, without the two assumptions, there exists a problem instance that any point-based value estimator is not reliable. Given the existence of irreducible bias, we propose an alternative formulation of offline policy evaluation called minimax-bias offline policy interval estimation (minimax-bias OPI), where the objective is to estimate the shortest possible interval containing the true value, instead of a point estimate (Section 5). Since our characterization of the minimax bias allows us to define the optimal interval (Definition 5.1), the minimax-bias OPI is formulated as a problem to estimate the optimal interval (Problem 5.1). We provide a theoretical foundation to solve the minimax-bias OPI based on the marginal importance sampling estimator (Section 6). The key result is that the optimal importance weight mini-mizing the distributional Bellman residual (DBR) allows us to construct an approximately optimal interval (Theorem 6.3). This illustrates that our problem setting is well-posed and can be solved under realistic assumptions if we can solve the minimization of DBR. Accordingly, we develop a novel algorithm in Section 7 to find the best importance weight function, which results in an interval estimator applicable even if the two fundamental conditions do not hold (Theorem 7.7). Before proceeding to these results, we introduce basic mathematical notation in the rest of this section, review the related work in Section 2, and introduce the useful OPE-specific notation in Section 3. Mathematical notation. Let I denote the identity operator and let a∨b := max{a, b} and a∧b := min{a, b} denote the maximum and minimum operators for a, b ∈ R, respectively. Let X be a metric space with Borel algebra Σ. Let B(X ) and C (X ) be the spaces of the real-valued measurable bounded functions and the continuous functions on X , respectively, both of which is equipped with the uniform norm ∥f ∥ ∞ := sup x∈X |f (x)|. Let M (X ) denote the space of the finite signed measures on the same space X , equipped with the total variation (TV) norm ∥P ∥ TV := sup E++E-=X {P (E + ) -P (E -)}. In particular, let δ x ∈ M (X ), x ∈ X , denote Dirac's delta measure. For any f ∈ B(X ) and any P ∈ M (X ), let ⟨f, P ⟩ := f (x)dP (x) be a shorthand for the (signed) expectation of f with respect to P . Let ⊙ denote the importance-weighting operation given by d(f ⊙ P )(x) := f (x)dP (x), f ∈ B(X ), P ∈ M (X ). Let L 1 (P ) be the space of the functions integrable with respect to P ∈ M (X ), i.e., ∥f ⊙ P ∥ TV < ∞. Let L (V) denote the set of the bounded linear operator on a normed vector space V. For any A ∈ L (M (X )), let A * ∈ L (B(X )) denote the conjugate operator such that ⟨A * f, P ⟩ = ⟨f, AP ⟩ for f ∈ B(X ) and P ∈ M (X ).

2. RELATED WORK

The problem of estimating the interval containing the true value has been known as offline policy interval estimation (OPI). This section reviews the existing studies on OPI by dividing the previous OPI methods into two categories: non-asymptotic and asymptotic methods (see Table 1 for the summary of comparison). We also discuss our contribution to the literature. The non-asymptotic methods typically put their emphasis on the validity of the interval with any finite sample size, where intervals are valid if they contain the true value J(π). For instance, Feng et al. (2020; 2021) compute intervals that contain the true policy value with high probability, under the realizability of the policy Q-functions q π . Jiang and Huang (2020) also proposed an interval estimator with validity under more relaxed realizability condition that either the policy Q-function q π or the marginal density ratio function w π is realizable. One limitation of this approach is the theoretical understanding on the tightness of the interval is often unclear or partial. Another limitation of this approach is that they tend to require the realizability with known complexity. This requirement is not desirable for practical use; if we used a too complex hypothesis class such as a reproducing kernel Hilbert space with infinite radius, the resultant interval would be trivial, and thus, non-informative. The asymptotic methods focus on the asymptotically dominant term of the uncertainty in the large sample limit, which typically allows us to theoretically understand their behavior, especially the tightness, in depth. For instance, Kallus and Uehara (2020); Shi et al. (2021) gave confidence interval estimators that achieve the efficiency lower bound. The bootstrap estimators (Hao et al., 2021 ) also enable us to compute the asymptotically exact confidence intervals in a more flexible manner. One major limitation is that they assume both the sufficient exploration and the realizability conditions of q π and w π hold, which can be hardly validated in real-world applications. These assumptions are essential to their analyses because they focus on estimation of the asymptotic variance of order O(n -1/2 ), assuming that the bias is negligible. Therefore, these methods are not applicable to our setting where the asymptotic bias of order O(1) dominates the asymptotic variance. In this study, we take the asymptotic approach, but with a focus on the estimation of the bias rather than the variance, because the bias is dominant in our setting where the sufficient exploration and the realizability do not hold at all. Our contributions are threefold. First, we characterize the theoretical lower bound of the asymptotic bias through the asymptotic analysis, which serves as a theoretical foundation of OPE without sufficient exploration or realizability assumptions. Second, without the two assumptions, we develop an interval estimation method that outputs an asymptotically valid interval, that is, an interval that contains the true value in the large sample limit. Third, under the realizability condition of the generalized marginal density ratio function w ♯ π , we show that the estimated interval is optimal.

3. PRELIMINARIES

We first introduce our formulation of reinforcement learning and offline policy evaluation. Then, we introduce two fundamental concepts in RL, a Q-function and an occupancy measure, along with shorthand notation for them. Offline policy evaluation. Let X := S × A be a compact Hausdorff space representing the stateaction space of the system with |X | < ∞.foot_0 Let M := (ι, T, R) be the Markov decision process (MDP) of environment on X , where ι ∈ M (S) is the initial state distribution, T : X → M (S) is the transition dynamics and R : X → M ([-1, 1]) is the conditional reward distribution. Let π : S → M (A) be the target policy. Then, the value J(π) of π with respect to M is given by the γ-discounted expected average reward J(π) ≡ J M (π) := E M,π (1 -γ) ∞ t=1 γ t-1 r t , where γ ∈ (0, 1) is a discounting factor and E M,π denotes the expectation with respect to the Markov chain generated with a t ∼ π(s t ), r t ∼ R(s t , a t ), s t+1 ∼ T (s t , a t ) for all t ≥ 1 and s 1 ∼ ι. In offline policy evaluation, we are given a dataset D := (D ι , D T,R ) as input, where D ι := {s ι,j } n j=1 is a set of initial states and D T,R := {(x i , s ′ i , r i )} n i=1 is a set of transition records sampled from dG M,β (D) := n j=1 dι(s ι,j ) • n i=1 dβ(x i )dT (s ′ i |x i )dR(r i |x i ), where β ∈ M (X ) is an arbitrary state-action-sampling distribution.foot_1 Then, an instance of the offline policy evaluation (OPE) is identified by the quadruple P := (M, β, π, γ) and formalized as follows. Problem 3.1 (Offline policy evaluation, OPE). Given (D, π, γ) where D ∼ G M,β , estimate J(π). Q-function and occupancy measure. Let ι π ∈ M (X ) and T π ∈ L (M (X )) be the initial stateaction distribution and the state-action transition operator associated with π such that dι π (s, a) := dι(s)dπ(a|s) and d(T π P )(s, a) := dT (s|x)dπ(a|s)dP (x) for s ∈ S, a ∈ A and P ∈ M (X ), respectively. Also let ρ ∈ B(X ) be the expected reward function such that ρ(x) := rdR(r|x) for x ∈ X . Then, the value J(π) is rewritten as J(π) = ⟨ρ, Γ π ι π ⟩ = ⟨Γ * π ρ, ι π ⟩ = ⟨ρ, µ π ⟩ = ⟨q π , ι π ⟩, where Γ π := (1 -γ) ∞ t=1 (γT π ) t-1 ∈ L (M (X ) ) is the accumulation operator, µ π := Γ π ι π ∈ M (X ) is the normalized occupancy measure of π (henceforth the occupancy measure), and q π := Γ * π ρ ∈ B(X ) is the normalized Q-function of π (henceforth the Q-function). Note that we have ∥q∥ ∞ ≤ 1 and ∥µ π ∥ TV = 1 thanks to the normalization. Two Bellman equations. One of the essential difficulties of OPE lies in the fact the direct estimation of the accumulation operator Γ π (and hence µ π and q π ) is intractable due to the infinite sum. The Bellman equation is useful to mitigate this problem. Here, we introduce two variants of the Bellman equation, the functional and distributional Bellman equations, given by ρ = ∆ * π q π , ι π = ∆ π µ π , where ∆ π := Γ -1 π = (I -γT π )/(1 -γ) is the difference operator. Note that, in the Bellman equations, both q π and µ π are uniquely characterized via more directly estimatable quantities (ρ, T π ) and (ι π , T π ), respectively. The errors of the Bellman equations are referred to as the Bellman residuals. In particular, the distributional Bellman residual (DBR) is given by R π (w) := ι π -∆ π (w ⊙ β) ∈ M (X ), which plays an important role in our analysis. Empirical estimates. Finally, we introduce the empirical estimates of (ι π , T π , ρ, β) based on the dataset D as follows. For all P ∈ M (X ) and x ∈ X , ιπ := 1 n n j=1 δ xι,j , Tπ P := n i=1 δ x ′ i N (x i ) P ({x i }), ρ(x) := 1 N (x) i:xi=x r i , β := 1 n n i=1 δ xi , where N (x) := 1 ∨ |{i : x i = x}| is the data-counting function (with the zero-division safeguard) and x ι,i := (s ι,i , a ι,i ) and x ′ i := (s ′ i , a ′ i ) are the state-action pairs associated with additional samples a ι,i ∼ π(s ι,i ) and a ′ i ∼ π(s ′ i ), respectively. Throughout this paper, we employ the conventional marginal importance sampling (MIS) estimator (Liu et al., 2018; Xie et al., 2019) to estimate the value in offline. The MIS estimator associated with a weight function w ∈ B(X ) is given by Ĵ(w) := ⟨ρ, w ⊙ β⟩. (4) The MIS estimator is justified if the weight function w is equal to the marginal density w π := dµπ dβ (assuming it exists) since, in that case, the MIS estimator is unbiased, E[ Ĵ(w π )] = J(π), according to (1). Note, however, that w π does not exist when the exploration is insufficient, β ̸ ≫ µ π , and the unbiasedness cannot be guaranteed in general. Two natural questions thus arise: Does the MIS estimator still enjoy any theoretical guarantee in such a general setting? If so, what is the best weight function w? In short, the answer to the first question is affirmative and the answer to the second question is one of the main contributions of this work.

4. IRREDUCIBLE BIAS IN OFFLINE POLICY EVALUATION

In this section, we theoretically analyze the statistical performance of OPE methods without sufficient exploration or realizability assumptions. As a result, we will show that any OPE method must incur an irreducible bias that never disapears even when the sample size goes infinity. Given such a negative result, we instead propose a novel problem setting called the minimax-bias OPI, where the goal is to estimate the interval that contains the true value and is as short as possible. The proposed problem setting is expected to be solved without sufficient exploration or realizability assumptions, and thus, will be of practical use. To study the statistical performance of OPE methods, we introduce the notion of the minimax bias of the point-based estimators. Let Ĵ be any random variable representing a point-based OPE estimator. Then, the information-theoretic worst-case bias of Ĵ is given by ϵ[ Ĵ] ≡ ϵ[ Ĵ; P] := sup (M ′ ,β)∼(M,β) J M ′ (π) -E Ĵ , where the equivalence ∼ is defined by the equality with respect to the corresponding distributions of the dataset, i.e., G M ′ ,β = G M,β . If there exist equivalent environments M and M ′ that result in the different policy values J M (π) ̸ = J M ′ (π) yet indistinguishable from the dataset, the worst-case bias ϵ[ Ĵ] is inevitable without an additional source of information, i.e., domain knowledge. The minimax bias is then defined as the minimum possible worst-case bias of OPE, ϵ ⋆ (π) ≡ ϵ ⋆ (π; P) := inf Ĵ ϵ[ Ĵ; P], which can be thought of as a characteristic of the problem P indicating its hardness in terms of the irreducible uncertainty even with infinitely large sample. In fact, there exists the unique Ĵ achieving the infimum and we refer to it as the optimal point estimator J ⋆ (π). Our main objective is to understand the minimax bias in various settings. To this end, we introduce a novel concept, the projection of the occupancy measure µ π with respect to β. Let Π β be the projection operator onto the support of β such that Π β P = χ β ⊙ P , P ∈ M (X ) and χ β (x) = 1{x ∈ supp β}. Definition 4.1 (Projected occupancy measure and its importance weight). We refer to µ ♯ π := (1 -γ)Π β ∞ t=0 (γT π Π β ) t ι π (7) as the projected occupancy measure of π. Correspondingly, we also refer to w ♯ π := dµ ♯ π dβ as the projected importance weight of π with respect to β. Note that w ♯ π can be thought of as an extension of w π in the sense it is always well-defined and w ♯ π = w π whenever w π exists. On the other hand, µ ♯ π can be thought of as the known component of µ π since it is always identifiable given G M,β , thanks to the projection Π β , and µ ♯ π = µ π whenever µ π is also identifiable. We now present our main result, which discovers close relationships between the minimax bias ϵ ⋆ (π), the MIS estimator Ĵ(w), the DBR R π (w) and the projected importance weight w ♯ π . Theorem 4.1. For all w ∈ B(X ), we have ϵ ⋆ (π) ≤ ϵ Ĵ(w) ≤ ∥R π (w)∥ TV ≤ ϵ ⋆ (π) + 1 + γ 1 -γ ∥w -w ♯ π ∥ L 1 (β) . Proof (sketch). The most nontrivial part is the last inequality, which follows from the constructive proof of ϵ ⋆ (π) ≥ ∥R π (w ♯ π )∥ TV . In particular, we construct two worst-case environments M ± under the constraint (M ± , β) ∼ (M, β). Roughly speaking, the environments are constructed to have a special state ⊥ in the underexplored region of X absorbing all the transition to that region, and have the extreme reward there, i.e., ρ(⊥) = ±1. With this explicit construction of the environments, we can give an analytic lower bound of ϵ ⋆ (π), which coincides with ∥R π (w ♯ π )∥ TV . See Section B for the complete proof. An immediate consequence of Theorem 4.1 is that it tells us when and how the minimax bias is positive. To see this, let µ ̸ ♯ π := µ π -µ ♯ π be the projection residual of µ π . Corollary 4.2. We have ϵ ⋆ (π) = ∥∆ π µ ̸ ♯ π ∥ TV and thus ∥µ ̸ ♯ π ∥ TV ≤ ϵ ⋆ (π) ≤ 1 + γ 1 -γ ∥µ ̸ ♯ π ∥ TV . In other words, the minimax bias is zero if and only if the projection residual µ ̸ ♯ π is zero, or equivalently, if µ π is absolutely continuous with respect to the data distribution β. Moreover, the size of the minimax bias is proportional to the size of the projection residual µ ̸ ♯ π . It thus formally asserts the limitation of the point-based estimators in the insufficient exploration settings. In summary, any OPE methods must be biased in the worst case whenever the exploration is insufficient, µ π ̸ ≪ β, motivating the interval-based approach.

5. PROBLEM SETUP: MINIMAX-BIAS OPI

As discussed above, the point-based estimator suffers from irreducible bias, suggesting a hardness in Problem 3.1 under realistic assumptions. Given such a limitation, we propose an alternative problem setting called the minimax-bias OPI so that we can solve it under realistic assumptions. Since there exists a bias, our idea is to estimate the value by an interval that contains the true value, instead of a point. Let us first define the target of the estimation, which we call the optimal interval. As discussed earlier, since J ⋆ (π) and ϵ ⋆ (π) are the best possible point estimator and its error guarantee, respectively, the optimal interval can be naturally formulated as follows. Definition 5.1 (Optimal interval). The following is referred to as the optimal interval: I ⋆ (π) ≡ I ⋆ (π; P) := [J ⋆ (π) -ϵ ⋆ (π), J ⋆ (π) + ϵ ⋆ (π)]. Then, the problem of offline policy interval estimation (OPI), an uncertainty-aware interval extension of Problem 3.1, is formalized as follows. Problem 5.1 (Minimax-bias OPI). Estimate I ⋆ (π) based on (D, π, γ), where D ∼ G M,β . Towards estimating the optimal interval, let us introduce two desirable properties of an interval. Definition 5.2 is stronger than Definition 5.3. Definition 5.2 (Approximate optimality). We refer to the interval I satisfying d H (I, I ⋆ (π)) ≤ ϵ, where d H (•, •) is the Hausdorff distance, as ϵ-approximately optimal. Moreover, a sequence of intervals {I n } n≥1 is said to be asymptotically (approximately) optimal if it converges to an (approximately) optimal interval. Definition 5.3 (Validity). An interval I ⊂ R is said to be valid if I ⊃ I ⋆ (π). Moreover, a sequence of intervals {I n } n≥1 is said to be asymptotically valid if its lower limit lim k→∞ n≥k I n is valid.

6. THEORETICAL FOUNDATION OF MINIMAX-BIAS OPI

We provide a theoretical foundation to solve Problem 5.1, mainly referring to Theorem 4.1. First, Theorem 4.1 implies that the MIS estimator Ĵ(w) with the projected importance weight w = w ♯ π is optimal in the sense it achieves the minimax bias. More generally: Corollary 6.1. There exists w ∈ L 1 (β) such that ϵ[ Ĵ(w)] = ϵ ⋆ (π). This motivates us to seek for the best weight function w within the MIS framework. The following corollary shows the next significant implications, that the optimal point-based and interval-based OPE is achieved with combining the MIS estimator and the minimization of DBR. Corollary 6.2. Let w ⋆ be a minimizer of DBR in a compact hypothesis class W ⊂ B(X ), w ⋆ ∈ argmin w∈W ∥R π (w)∥ TV . (10) Then, we have E Ĵ(w ⋆ ) -J ⋆ (π) ≤ 1 + γ 1 -γ ϵ W , and |∥R π (w ⋆ )∥ TV -ϵ ⋆ (π)| ≤ 1 + γ 1 -γ ϵ W , where ϵ W := min w∈W ∥w -w ♯ π ∥ L 1 (β) is the realizability error of W. In other words, given W is expressive enough to approximate w ♯ π well and thus ϵ W is negligible, the optimal point estimator J ⋆ (π) and its uncertainty ϵ ⋆ (π) can be estimated with a solution w ⋆ to (10) and its objective value ∥R π (w ⋆ )∥ TV , respectively. These observations naturally lead us to the following proxy to the optimal interval I(π; W) := E Ĵ(w ⋆ ) -∥R π (w ⋆ )∥ TV , E Ĵ(w ⋆ ) + ∥R π (w ⋆ )∥ TV . ( ) In fact, it satisfies two desirable properties: the validity and the approximate optimality. Theorem 6.3. The interval I(π; W) is valid and 2 1+γ 1-γ ϵ W -approximately optimal.

7. ESTIMATION OF OPTIMAL INTERVAL

As discussed in the previous section, the estimation of the optimal interval I ⋆ (π) is reduced to the minimization of the TV norm ∥R π (w)∥ TV . However, even the estimation of the exact TV norm is notorious for its difficulty (e.g., see Section 5 in Sriperumbudur et al. ( 2012)), let alone the minimization. This motivate us to develop new variational approximations for the TV norm. In particular, we newly introduce two approximations of ∥R π (w)∥ TV , each of which is suitable for the evaluation and the optimization of the objective, respectively.

7.1. EVALUATING THE OBJECTIVE

The first approximation reduces the evaluation of the TV norm to a conventional regression problem. To see this, let us begin with the approximation formula for the TV norm of general measures. Let F be a universal function approximator on X , i.e., a set of functions dense in C (X ), such as the reproducing kernel Hilbert space (RKHS) with a universal kernel (Sriperumbudur et al., 2010) or a set of neural networks (Hornik et al., 1989 ). Proposition 7.1. For all positive measures P ∈ M (X ), we have ∥P ∥ TV = sup f ∈F ⟨ f , P ⟩ where t := max{-1, min{1, t}} denotes the clipping of t ∈ R to [-1, 1]. The proof is relegated to Section C. Letting P = R π (w), we immediately get the following special case useful for the evaluation of DBR. Corollary 7.2. For all w ∈ B(X ), we have ∥R π (w)∥ TV = sup f ∈F ⟨ f , R π (w)⟩. ( ) The supremum ( 15) is estimated via the regularized empirical risk minimization framework, minimizing the following objective L(f ) := -⟨ f , Rπ (w)⟩ + Ψ(f ), where Rπ (w ) := ιπ -(1 -γ) -1 (I -γ Tπ )(w ⊙ β) is the natural empirical counterpart of the DBR, Ψ : F → R is a penalty function that makes it easier to minimize and prevent the minimizer from overfitting. Once the regularized empirical risk minimizer f := argmin f ∈F L(f ) is found, we then evaluate ⟨ f , Rπ (w)⟩ to approximate the RHS of ( 15) and get the desired estimate. The resultant procedure of the evaluation of the TV norm is summarized in Algorithm A.1. In fact, under a reasonable choice on F and Ψ, it is shown that the output of Algorithm A.1 is consistent. The details on the specific choice of F and Ψ and the proof is provided in Section D. Theorem 7.3. Let EvaluateDBR(D, F, w) be the output of Algorithm A.1, where F and Ψ are given as in Section D.1. Then, for all w ∈ W, we have EvaluateDBR(D, F, w) → ∥R π (w)∥ TV in probability.

7.2. MINIMIZING THE OBJECTIVE

We now turn to the minimization of ∥R π (w)∥ TV with respect to w ∈ W. The previous variational formula ( 15) is not straightforwardly usable for this purpose since it ends up with the saddlepoint problem, which we found is too unstable in practice. To mitigate this issue, we introduce a minimization-based approximation of the TV norm. To this end, we first introduce the convolution of the TV norm with the maximum mean discrepancy (MMD) (Sriperumbudur et al., 2009) . Here, the MMD of a measure P ∈ M (X ) is given by MMD κ (P ) := ⟨κ, P ⊗2 ⟩, where κ : X 2 → R is a c 0 -universal kernel in the sense of Sriperumbudur et al. ( 2010) and P ⊗2 denotes the product measure of P on X 2 . Definition 7.1 (Convolution norm). For all P ∈ M (X ) and u ≥ 1, we refer to ∥P ∥ u,κ := inf Q≪P {u MMD κ (P -Q) + ∥Q∥ TV } (17) as the u-convolution norm of P . The following proposition shows the u-convolution norm is a reasonable approximation of the TV norm and admits a sample-based estimation unlike the TV norm. Proposition 7.4. For all P ∈ M (X ), we have ∥P ∥ TV = lim u→∞ ∥P ∥ u,κ . Moreover, if P is a probability measure, for all δ ∈ (0, 1), we have ∥ Pn -P ∥ u,κ = O u 2 + ln(1/δ) n ( ) with probability ≥ 1 -δ, where Pn := 1 n n i=1 δ xi is the empirical distribution of an n-sample (x 1 , ..., x n ) independently drawn from P , n ≥ 1. Proof. The key of the proof is Lemma E.2, which gives the dual representation of the convolution norm, ∥P ∥ u,κ = sup f ∈B(X ) ∥f ∥ H ≤u ∥f ∥∞≤1 ⟨f, P ⟩, where H is the RKHS generated by κ. Then, the density of the universal RKHS in C (X ) implies that lim u→∞ ∥P ∥ u,κ = sup ∥f ∥∞≤1 ⟨f, P ⟩ = ∥P ∥ TV , which proves the first statement. The second statement follows from the uniform law of large number, namely Theorem H.5 and Lemma H.6. Slightly extending it for the sample approximation of the DBR R π (w) ≈ Rπ (w), we obtain the following approximation formula useful for the weight optimization. The proof is given in Section F. Corollary 7.5. For all w ∈ B(X ), we have ∥ Rπ (w)∥ u,κ → ∥R π (w)∥ TV ( ) in probability as u → ∞ and n/u 2 → ∞. In the practical implementation, one may change the variable in the LHS of (20) from the measure where ηπ := (ι π + Tπ β + β)/3. Here, we have exploited the transitivity of the absolute continuity with Q ≪ Rπ (w) ≪ ηπ . Note that the minimization with respect to g is tractable since g is only evaluated on the support of ηπ , which is a finite set. Moreover, the convolution-based formula ( 21) is stable with the minimization with respect to w ∈ W resulting in the joint minimization problem of (w, g), unlike the regression-based formula (15) resulting in the saddle-point problem. Q ∈ M (X ) to the function g ∈ B(X ), ∥ Rπ (w)∥ u,κ = inf Q≪ Rπ(w) u MMD κ ( Rπ (w) -Q) + ∥Q∥ TV = min g∈B(X ) u MMD κ ( Rπ (w) -g ⊙ ηπ ) + ⟨|g|, ηπ ⟩ , Since the objective of ( 21) is lower bounded and convex with respect to (w, g), the minimizer ŵu := argmin w∈W ∥ Rπ (w)∥ u,κ is computed with any convex optimization algorithms. The hyperparameter u is then chosen from a predefined grid U so that the TV norm of DBR is minimized. Specifically, one may choose it as the logarithmically-even grid U := {2 k } kmax k=0 , where the upper limit k max := ⌊log 2 min{n, m}⌋ is determined according to the order of empirical approximation error (19). The entire procedure of the weight estimation is summarized in Algorithm A.2. By its derivation, we can formally show the consistency of Algorithm A.2. The proof is relegated to Section G. Theorem 7.6. Let OptimizeDBR(D, F, W, κ) be the output of Algorithm A.2. Then, OptimizeDBR(D, F, W, κ) → w ⋆ in probability as n → ∞. Finally, we present our OPI method in Algorithm 7.1, called the minimax optimal interval estimation (MOI), which is straightforwardly derived from Algorithm A.2 and the equation ( 13). We can also guarantee the validity and the approximate consistency of MOI in the asymptotic sense. The proof follows directly from Theorem 7.6. Theorem 7.7. Let MOI(D, F, W, κ) be the output of Algorithm 7.1. Then, MOI(D, F, W, κ) is asymptotically valid and 2 1+γ 1-γ ϵ W -approximately optimal in probability.

8. CONCLUSION

In this paper, we have studied OPI without the sufficient exploration and the realizability conditions. In particular, we have pointed out the existence of the irreducible bias in such a general setting, and correspondingly, introduced a novel formulation of the interval-based OPE. We have then revealed the connection between the conventional MIS estimator and the irreducible bias, which is eventually utilized to construct the proposed method, the minimax optimal interval estimator (MOI), and to prove its optimality. One of the major limitations of the proposed method is its model agnosticity, lying at the opposite end to the model-based approach, e.g., Yu et al. (2020) , that depends on the full correctness of the model. It is left for future work to extend and combine these methods to be applicable to partially correct models. 

CONTENTS

ϵ ⋆ (π) ≤ ϵ Ĵ(w) ≤ ∥R π (w)∥ TV Proof. The first inequality is trivial. To show the second one, observe J M (π) -E Ĵ(w) = |⟨ρ, Γ π ι π -w ⊙ β⟩| = |⟨q π , ι π -∆ π (w ⊙ β)⟩| = |⟨q π , R π (w)⟩| ≤ ∥R π (w)∥ TV , where the last inequality is owing to ∥q π ∥ ∞ ≤ 1. Since the RHS is independent of M given G M,β , we thus have ϵ Ĵ(w) = sup M ′ :G M ′ ,β =G M,β J M ′ (π) -E Ĵ(w) ≤ ∥R π (w)∥ TV . Now, to prove the last inequality, we prepare two extreme, yet indistinguishable environments M ± := (ι, T , R ± ). Let T be an arbitrary state-transition operator indistinguishable from T , which will be determined later. Also let Tπ be the state-action transition operator associated with T and π such that d Tπ (s, a|x) = d T (s|x)dπ(a|s) for x, (s, a) ∈ X . Let μπ := (1 -γ)(I -γ Tπ ) -1 ι π be the common occupancy measure of M ± induced with T , and μπ | ̸ ≪β be the singular component of μπ with respect to β. Let X 0 be a set separating μπ | ̸ ≪β from β and X β := X \ X 0 be its complement. 3 For convenience, let Π β , Π 0 : M (X ) → M (X ) denote the projections of measure onto X β and X 0 , respectively, given by Π β := χ β ⊙ P , and Π 0 P := (1 -χ β ) ⊙ P where χ β being the indicator function of X β such that χ β (x) = 1 if x ∈ X β and χ 0 (x) = 0 otherwise. Note that μπ | ̸ ≪β = Π 0 μπ by construction. Finally, put R ± (x) = δ ±1 for x ∈ X 0 and R ± (x) = R(x) otherwise, which is necessary for the indistinguishability of R ± , and denote the associated expected reward by ρ ± (x) := rdR ± (r|x), x ∈ X . Then, we have J + (π) -J -(π) = ⟨ρ + -ρ -, μπ ⟩ = 2∥μ π | ̸ ≪β ∥ TV , where J ± (π) are the policy values with respect to M ± . Now, the following lemma connects the RHS of (25) with DBR. Lemma B.2. There exist T indistinguishable from T such that ∥μ π | ̸ ≪β ∥ TV = ∥R π (w ♯ π )∥ TV . ( ) Proof. The proof is constructive. Consider an expanded state space S ← S ∪ {⊥}, where ⊥ denotes an absorbing state of T . Accordingly, let X 0 ← X 0 ∪ ({⊥} × A). Now, put T := T | ≪β + T | ̸ ≪β such that T | ≪β is the restriction of T onto X β and T | ̸ ≪β is the absorbing transition, respectively given by T | ≪β = T Π β , T | ̸ ≪β P = P (X 0 ) δ ⊥ , P ∈ M (X ). Here  (1 -γ)ι π = (I -γ Tπ )μ π = (I -γ Tπ | ≪β -γ Tπ | ̸ ≪β )μ π = (I -γ Tπ | ≪β )μ π -γ μπ (X 0 )δ ⊥,π = (I -γ Tπ | ≪β )μ π -γ∥μ π | ̸ ≪β ∥ TV δ ⊥,π , which implies, with P : = (I -γ Tπ | ≪β ) -1 ι π , μπ = (1 -γ) P + γ∥μ π | ̸ ≪β ∥ TV (I -γ Tπ | ≪β ) -1 δ ⊥,π = (1 -γ) P + γ∥μ π | ̸ ≪β ∥ TV δ ⊥,π . (∵ Tπ | ≪β δ ⊥,π = 0) Measuring the volumes on X 0 , we further get ∥μ π | ̸ ≪β ∥ TV = (1 -γ) P (X 0 ) + γ∥μ π | ̸ ≪β ∥ TV , which yields P (X 0 ) = ∥μ π | ̸ ≪β ∥ TV . On the other hand, since w ♯ π = (1 -γ)d(Π β P )/dβ, we have ∥R π (w ♯ π )∥ TV = ι π -(1 -γ)∆ π Π β P TV = ι π -(Π β -γ Tπ | ≪β ) P TV ∵ T π Π β = Tπ | ≪β = ι π -Π β ι π + γΠ 0 T π | ≪β P TV ∵ Tπ | ≪β = Π β Tπ | ≪β + Π 0 Tπ | ≪β = Π 0 ι π + γT π | ≪β P TV = Π 0 P TV ∵ (I -γ Tπ | ≪β ) P = ι π = P (X 0 ). Combining it with (27), we get the desired result. Plugging ( 26) into (25), we have ∥R π (w ♯ π )∥ TV = 1 2 {J + (π) -J -(π)} (28) with a specific configuration of T . Since M ± are indistinguishable from one another, any estimators must incur the bias of at least a half of the difference J + (π) -J -(π) in the worst case, i.e., R π (w ♯ π ) ≤ ϵ ⋆ (π). This proves the last inequality of (8) via the triangle inequality, ∥R π (w)∥ TV ≤ ∥R π (w ♯ π )∥ TV + ∥∆ π (w -w ♯ π ) ⊙ β∥ TV ≤ ϵ ⋆ (π) + 1 + γ 1 -γ ∥w -w ♯ π ∥ L 1 (β) . and thus concludes the proof of Proposition 4.1. C PROOF OF PROPOSITION 7.1 First, we introduce a saddle-point formulation of the TV norm. Let F 1 := {f ∈ F : ∥f ∥ ∞ ≤ 1} be the intersection of F with the unit ball of B(X ). Lemma C.1 gives a general variational formula of the TV norm. Lemma C.1. For all P ∈ M (X ), we have ∥P ∥ TV = sup f ∈F1 ⟨f, P ⟩. ( ) Proof. Let us denote the unit ball of B(X ) with U 1 := {f ∈ B(X ) : ∥f ∥ ∞ ≤ 1}. As an instance of the integral probability metrics (IPM), the TV norm is known to be written as ∥P ∥ TV = sup g∈U1 ⟨g, P ⟩. ( ) Now fix g ∈ U 1 such that |∥P ∥ TV -⟨g, P ⟩| ≤ c for a positive constant c > 0. Since F 1 is dense in C (X ) ∩ U 1 , it is also dense in L 1 (P ) ∩ U 1 and therefore there exists f ∈ F 1 such that ∥f -g∥ L 1 (P ) ≤ c. Then it follows that |∥P ∥ TV -⟨f, P ⟩| ≤ |∥P ∥ TV -⟨g, P ⟩| + ∥f -g∥ L 1 (P ) ≤ 2c. Since c > 0 can be arbitrarily small, we finally have ∥P ∥ TV ≤ sup f ∈F1 ⟨f, P ⟩ ≤ sup g∈U1 ⟨g, P ⟩ ≤ ∥P ∥ TV , which proves the desired result. Since the supremum in ( 29) is taken over a constrained domain F 1 , its computation is not necessarily tractable in general. The following lemma is useful to make the domain unconstrained. Lemma C.2. Let Ψ : B(X ) → R ≥0 be a penalty function such that Ψ(f ) = 0 if f ∈ F 1 . Then, for all P ∈ M (X ), we have sup f ∈F1 ⟨f, P ⟩ = sup f ∈F {⟨ f , P ⟩ -Ψ(f )} , ( ) where f (x) := max{-1, min{1, f (x)}} denotes the clipping of f (x) to [-1, 1]. Proof. Slightly extending (32), we have ∥P ∥ TV ≤ sup f ∈F1 ⟨f, P ⟩ = sup f ∈F1 {⟨f, P ⟩ -Ψ(f )} ≤ sup f ∈F {⟨ f , P ⟩ -Ψ(f )} (∵ domain expansion) ≤ sup g∈U1 ⟨g, P ⟩ (∵ f ∈ U 1 , Ψ(f ) ≥ 0) ≤ ∥P ∥ TV . This yields the desired claim. Finally, Proposition 7.1 is proved taking the penalty function as the trivial one, Ψ(f ) = 0 for all f ∈ F. We utilize a nontrivial penalty function in Section D.

D DETAILS AND PROOF OF THEOREM 7.3

We first present a preferred choice of the function approximator F and the penalty function Ψ, which is needed to construct the objective function L(f ) in ( 16). Then, we prove Theorem 7.3 to show the consistency of the resultant algorithm (Algorithm A.1).

D.1 CHOICE OF FUNCTION APPROXIMATOR AND PENALTY FUNCTION

As for the function approximator F, we choose a universal RKHS. Let κ : X 2 → R be the corresponding symmetric positive-definite kernel. We assume κ is c 0 -universal in the sense of Sriperumbudur et al. (2010) . Also, without loss of generality, we assume κ is normalized, ∥κ∥ ∞ := sup x,x ′ ∈X |κ(x, x ′ )| ≤ 1. For instance, the Gaussian kernel κ(x, y) = exp{-∥x -y∥ 2 2 /(2α 2 )}, x, y ∈ R d , d ≥ 1, α > 0, is one of such choices. As for the penalty function Ψ, we employ Ψλ (f ) := ⟨(|f | -1) + , Rπ,+ (w) + Rπ,-(w)⟩ + λ 2(1 -γ) ∥f ∥ 2 F , where λ > 0 is a hyperparameter, (g) + := max{0, g} denotes the positive part of a function g ∈ B(X ), Rπ,+ (w) := ιπ -γ 1-γ Tπ (w ⊙ β) and Rπ,-(w) := 1 1-γ (w ⊙ β) are the positive and the negative part of the empirical DBR, respectively, and ∥ • ∥ F is the RKHS norm. Then, letting C(P ) := 1 + 1+γ 1-γ ⟨w, P ⟩, P ∈ M (X ), the penalized objective ( 16) is simplified as L(f ) = ⟨|f -1|, Rπ,+ (w)⟩ + ⟨|f + 1|, Rπ,-(w)⟩ + λ 2(1 -γ) ∥f ∥ 2 F -C( β), which is convex with respect to f ∈ F. In other words, the minimizer f ≡ fλ := argmin f ∈F L(f ) can be found in a tractable manner with convex optimization methods. As for the choice of λ, as will be seen in the next section, we can achieve the consistency if λ → 0 and nλ → ∞. Thus, we may employ some fixed default λ = 1/ √ n or select the best hyperparameter within some fixed grid, e.g., Λ n := {1, 2, ..., 2 ⌊log 2 n⌋ }, that best attains the supremum (15), possibly using the training-validation split technique.

D.2 CONSISTENCY ANALYSIS

We first introduce some notations useful for the analysis. Let us define probability measures P, Pn ∈ M (X 3 ) by dP (x 1 , x 2 , x 3 ) := dι π (x 1 )dβ(x 2 )dT π (x 3 |x 2 ), d Pn (x 1 , x 2 , x 3 ) := dι π (x 1 )d β(x 2 )d Tπ (x 3 |x 2 ), loss functions ℓ f , φ f : X 3 → R, f ∈ F, by ℓ f (x 1 , x 2 , x 3 ) := 3 j=1 ℓ f,j (x 1 , x 2 , x 3 ), φ f (x 1 , x 2 , x 3 ) := 3 j=1 φ f,j (x 1 , x 2 , x 3 ), for x 1 , x 2 , x 3 ∈ X , where ℓ f,1 (x 1 , x 2 , x 3 ) := (1 -γ)|f (x 1 ) -1|, φ f,1 (x 1 , x 2 , x 3 ) := -(1 -γ) f (x 1 ) , ℓ f,2 (x 1 , x 2 , x 3 ) := |w(x 2 )f (x 2 ) + |w(x 2 )||, φ f,2 (x 1 , x 2 , x 3 ) := w(x 2 ) f (x 2 ) , ℓ f,3 (x 1 , x 2 , x 3 ) := γ|w(x 2 )f (x 3 ) -|w(x 2 )||, φ f,3 (x 1 , x 2 , x 3 ) := -γw(x 2 ) f (x 3 ) . Let us also define the associated risk functions L λ (f ; Q) := ⟨ℓ f , Q⟩ + λ 2 ∥f ∥ 2 F , Φ(f ; Q) := ⟨φ f , Q⟩, for probability measures Q ∈ M (X 3 ). By these definitions, we have Φ(f ; P ) = (1 -γ)⟨ f , R π (w)⟩, Φ(f ; Pn ) = (1 -γ)⟨ f , Rπ (w)⟩, and L λ (f ; P ) = Φ(f ; P ) + (1 -γ) {Ψ λ (f ) + C(β)} (37) L λ (f ; Pn ) = Φ(f ; Pn ) + (1 -γ) Ψλ (f ) + C( β) for all λ > 0 and f ∈ F, where Φλ is given by ( 34) and Ψ λ (f ) := ⟨(|f | -1) + , R π,+ (w) + R π,-(w)⟩ + λ 2(1 -γ) ∥f ∥ 2 F , R π,+ (w) := ι π + γ 1-γ T π (w ⊙ β) and R π,-(w) := 1 1-γ (w ⊙ β) are the positive and the negative parts of the DBR, respectively. Therefore, we obtain an alternative expression of the objective ( 16) L(f ) = 1 1 -γ L λ (f ; Pn ) -C( β), which implies fλ = argmin f ∈F L λ (f ; Pn ). Moreover, by Lemma C.2, we also obtain alternative expressions of the quantity of interest ∥R π (w)∥ TV = - 1 1 -γ inf f ∈F Φ(f ; P ) (41) = C(β) - 1 1 -γ inf f ∈F L 0 (f ; P ). The goal of this section is to reveal the relationship of fλ and ∥R π (w)∥ TV via ( 40), ( 41) and ( 42) The following lemma gives a key insight on the behavior of fλ . Let G : = 1 -γ + (1 + γ)∥w∥ ∞ be the Lipschitz constant of f → ℓ f and f → φ f . Let B F (0, 1) := {f ∈ F : ∥f ∥ F ≤ 1} be the unit closed ball of F. Let R n (H) is the Rademacher complexity of a function class H ⊂ B(X ) (see Definition H.4). Lemma D.1. Suppose the predictor attaining min f ∈F L λ (f ; P ) exists and denote it by f * λ ∈ F. Also suppose ∥f ∥ ∞ ≤ ∥f ∥ F for all f ∈ F. Then, for all δ ∈ (0, 1), we have L λ ( fλ ; P ) ≤ L λ (f * λ ; P ) + 8G 2 λ R n (B F (0, 1)) + ln(1/δ) 2n 2 and ∥ fλ -f * λ ∥ F ≤ 4G λ R n (B F (0, 1)) + ln(1/δ) 2n with probability 1 -δ. Proof. Define ϵ(f ; Q) := L λ (f ; Q) -L λ (f * λ ; Q) for f ∈ F and Q ∈ M (X ). Let F c := {f ∈ F : ϵ(f ; P ) ≤ c} and let lg,j := ℓ f * λ +g,j -ℓ f * λ ,j for j = 1, 2, 3. Then, since ϵ(f ; P ) ≥ λ 2 ∥f -f * λ ∥ 2 F ≥ λ 2 ∥f -f * λ ∥ 2 ∞ by the strong convexity of f → L λ (f ; P ), the uniform law of large number (Theorem H.5) gives sup f ∈Fc ϵ(f ; P ) -ϵ(f ; Pn ) = sup f ∈Fc 3 j=1 ⟨ lf-f * λ ,j , P -Pn ⟩ ≤ sup ∥g∥ F ≤ √ 2c/λ 3 j=1 ⟨ lg,j , P -Pn ⟩ (∵ strong convexity) ≤ 2R n      3 j=1 lg,j : ∥g∥ F ≤ 2c λ      + 2G 2c λ ln(1/δ) 2n (∵ Theorem H.5) ≤ 8c λ G R n (B F (0, 1)) + ln(1/δ) 2n =: ε(c, δ) (∵ see below) with probability 1 -δ for all c > 0. Here, the last inequality follows from R n      3 j=1 lg,j : ∥g∥ F ≤ 2c λ      ≤ 3 j=1 R n lg,j : ∥g∥ F ≤ 2c λ (∵ subadditivity) ≤ 3 j=1 G j 2c λ R n (B F (0, 1)) (∵ Lemma H.4) ≤ 2c λ GR n (B F (0, 1)), where G j is the Lipschitz constants of ℓ f,j , j = 1, 2, 3. Now take fc ∈ F such that fc = fλ if ϵ( fλ ; P ) ≤ c and, otherwise, ϵ( fc ; Pn ) ≤ 0 and ϵ( fc ; P ) = c.foot_3 Then, when c > ε(c, δ), we get with probability 1 -δ ϵ( fc ; P ) = ϵ( fc ; Pn ) + ϵ( fc ; P ) -ϵ( fc ; Pn ) ≤ sup f :ϵ(f ;P )≤c ϵ(f ; P ) -ϵ(f ; Pn ) (∵ ϵ( fc ; Pn ) ≤ 0, ϵ( fc ; P ) ≤ c) < c, which implies fc = fλ and hence ϵ( fλ ; P ) < c with the same probability. Thus, since it holds with any c > 0 such that c > ε(c, δ), we finally have ϵ( fλ ; P ) ≤ c * with probability 1 -δ, where c * is the solution to c * = ε(c * , δ), or more concretely c * = 8G 2 λ R n (B F (0, 1)) + ln(1/δ) 2n 2 . This concludes the proof. Now, verifying the assumptions of Lemma D.1 and evaluating the Rademacher complexity of the unit ball R n (B F (0, 1)), we get the following proposition. Proposition D.2 (Generalization error bound of RERM with RKHS). For all δ ∈ (0, 1) and λ > 0, we have L 0 ( fλ ; P ) ≤ inf f ∈H L λ (f ; P ) + 8G 2 ln(e 2 /δ) λn and ∥ fλ -f * λ ∥ F ≤ 4G λ ln(e 2 /δ) 2n with probability 1 -δ. Proof. It suffices to invoke Lemma D.1 with Lemma H.6. To this end, we need to verify the existence of min f ∈H L λ (f ; P ) and the dominance of the norm ∥ • ∥ ∞ ≤ ∥ • ∥ H . In fact, the minimum exists since f → L λ (f ; P ) is continuous with respect to L 2 (P ) and the inifimum inf f ∈H L λ (f ; P ) does not change if we restrict the domain to the ball {f ∈ H : ∥f ∥ H ≤ 2L λ (0; P )}, which is compact according to Lemma H.7. The dominance of the norm is shown by, for all f ∈ H, ∥f ∥ ∞ = sup x∈X |f (x)| = sup x∈X |⟨κ(•, x), f ⟩ H | ≤ sup x∈X ∥κ(•, x)∥ H ∥f ∥ H = sup x∈X κ(x, x)∥f ∥ H ≤ ∥f ∥ H . We need one more lemma to connect Φ( fλ ; Pn ) with Φ( fλ ; P ). . Now, applying Theorem H.5 with δ ← δ/2, F ← ±G and D ← 2G, we get sup ∥f -f * λ ∥ F ≤d ⟨φ f , Pn -P ⟩ ≤ 2R n (G) + 2G ln(2/δ) 2n with probability 1 -δ. Since ∥ fλ -f * λ ∥ F ≤ d with probability 1 -δ, we further get by union bound ⟨φ fλ , Pn -P ⟩ ≤ 2R n (G ∪ -G) + G ln(2/δ) 2n . with probability 1 -2δ. Since Φ( fλ ; Q) := ⟨φ fλ , Q⟩ for all Q ∈ M (X 3 ) the proof is concluded by R n (G) = R n ({φ f : f ∈ F, ∥f -f * λ ∥ F ≤ d}) ≤ 3 j=1 R n ({φ f,j : f ∈ F, ∥f -f * λ ∥ F ≤ d}) ≤ GdR n (B F (0, 1)) ≤ Gd √ n . Finally, we are ready to prove Theorem 7.3. Observe We first show the following utility lemma. (1 -γ) - 1 1 -γ Φ( Lemma E.1. Let ∥f ∥ A and ∥f ∥ B be arbitrary norms of f ∈ B(X ). Also let ∥f ∥ A∨B := ∥f ∥ A ∨ ∥f ∥ B be the norm defined by the maximum of these. Then, for all P ∈ M (X ), we have ∥P ∥ (A∨B) * ≤ inf Q≪P {∥P -Q∥ A * + ∥Q∥ B * } , where we denote the dual norm of ∥ • ∥ X on B(X ) by ∥P ∥ X * := sup f ∈B(X ),∥f ∥ X ≤1 ⟨f, P ⟩, P ∈ M (X ).  Moreover, if | supp(P )| is finite and ∥ • ∥ A and ∥ • ∥ B dominate ∥ • ∥ ∞ , then the equality is attained. Proof. Observe ∥P ∥ (A∨B) * = sup f ∈B(X ) ∥f ∥ A ≤1 ∥f ∥ B ≤1 ⟨f, P ⟩ = sup f,g∈B(X ) ∥f ∥ A ≤1 ∥g∥ B ≤1 inf Q≪P {⟨f, P ⟩ + ⟨g -f, Q⟩} (∵ Q as a Lagrange multiplier) = sup f,g∈B(X ) ∥f ∥ A ≤1 ∥g∥ B ≤1 inf Q≪P {⟨f, P -Q⟩ + ⟨g, Q⟩} ≤ inf Q≪P      sup f ∈B(X ) ∥f ∥ A ≤1 a∈B A b∈B B inf Q≪P    d j=1 a j {P (x j ) -Q(x j )} + d j=1 b j Q(x j )    , where  B A := {(f (x j )) d j=1 : f ∈ B(X ), ∥f ∥ A ≤ 1} ⊂ R d , B B := {(g(x j )) d j=1 : g ∈ B(X ), ∥g∥ B ≤ 1} ⊂ R d ∥P ∥ (A∨B) * = sup a∈B A b∈B B inf Q≪P    d j=1 a j {P (x j ) -Q(x j )} + d j=1 b j Q(x j )    = inf Q≪P    sup a∈B A a j {P (x j ) -Q(x j )} + sup b∈B B d j=1 b j Q(x j )    = inf Q≪P {∥P -Q∥ A * + ∥Q∥ B * } . This concludes the proof. For P ∈ M (X ), define F u (P ) := sup f ∈B(X ) ∥f ∥∞≤1 ∥f ∥ H ≤u ⟨f, P ⟩. The following lemma shows that F u (P ) is equal to the u-convolution norm. In other words, it gives the dual representation of the u-convolution norm in Proposition 7.4. Lemma E.2. For all P ∈ M (X ), we have ∥P ∥ u,κ = F u (P ), where ∥P ∥ u,κ is given with respect to Definition 7.1. Proof. Without loss of generality, we assume ∥P ∥ TV = 1. Let Pn be the empirical distribution of P given by Definition H.3. Let H be the RKHS associated with the kernel κ. Let ∥f ∥ A = ∥f ∥ H and ∥f ∥ B = ∥f ∥ ∞ in Lemma E.1 and observe that, for all P ∈ M (X ), F u (P ) = ∥P ∥ (A∨B) * , ∥P ∥ u,κ = inf Q≪P {∥P -Q∥ A * + ∥Q∥ B * } , since ∥P ∥ H * = MMD κ (P ) and ∥P ∥ TV = ∥P ∥ ∞ * . Then, we have 0 ≤ ∥P ∥ u,κ -F u (P ) ≤ EF u ( Pn -P ) since F u (P ) ≤ ∥P ∥ u,κ (∵ Lemma E.1) ≤ E∥ Pn ∥ u,κ (∵ Jensen's ineq with Lemma H.3) = EF u ( Pn ) (∵ Lemma E.1 with | supp( Pn )| < ∞) ≤ F u (P ) + EF u ( Pn -P ). (∵ triangle inequality of F u (•)) The proof is concluded remembering that 0 ≤ EF u ( Pn -P ) ≤ 2R n ({f ∈ B(X ) : ∥f ∥ H ≤ u, ∥f ∥ ∞ ≤ 1}) (∵ Theorem H.5) ≤ 2R n ({f ∈ B(X ) : ∥f ∥ H ≤ u}) ≤ 2u sup x∈X κ(x, x) n → 0 (∵ Lemma H.6) as n → ∞, where R n (F) is the maximal Rademacher complexity (Definition H.4) F PROOF OF COROLLARY 7.5 Let w := w/∥w∥ ∞ be the normalization of w. By Lemma E.2, we have ∥ w ⊙ ( β -β)∥ u,κ = sup f ∈B(X ) ∥f ∥ H ≤u ∥f ∥∞≤1 ⟨ wf, β -β⟩ ≤ sup f ∈B(X ) ∥ wf ∥ Hw ≤u ∥ wf ∥∞≤1 ⟨ wf, β -β⟩ (∵ ∥ wf ∥ Hw = ∥f ∥ H , ∥ wf ∥ ∞ ≤ ∥f ∥ ∞ ) ≤ sup g∈B(X ) ∥g∥ Hw ≤u ∥g∥∞≤1 ⟨g, β -β⟩ (∵ wB(X ) ⊂ B(X )) = ∥ β -β∥ u,κw , where H w is the RKHS associated with κ w (x, y) := w(x) w(y)κ(x, y). Similarly, we have  and κ (2) w ((x, x ′ ), (y, y ′ )) := w(x) w(y)κ(x ′ , y ′ ). Therefore, we have ∥ Tπ ( w ⊙ β) -T π ( w ⊙ β)∥ u,κ = sup f ∈B(X ) ∥f ∥ H ≤u ∥f ∥∞≤1 ⟨f, Tπ ( w ⊙ β) -T π ( w ⊙ β)⟩ = sup f ∈B(X ) ∥f ∥ H ≤u ∥f ∥∞≤1 ⟨ w ⊗ f, β(2) π -β (2) π ⟩ ≤ ∥ β(2) π -β (2) π ∥ u,κ (2) w , where β (2) π , β(2) π ∈ M (X 2 ) are given by dβ (2) π (x, x ′ ) := dβ(x)dT π (x ′ |x) and d β(2) π (x, x ′ ) := d β(x)d Tπ (x ′ |x), |∥ Rπ (w)∥ u,κ -∥R π (w)∥ u,κ | ≤ ∥ι π -ι π ∥ u,κ + γ 1 -γ ∥ Tπ (w ⊙ β) -T π (w ⊙ β)∥ u,κ + 1 1 -γ ∥w ⊙ ( β -β)∥ u,κ ≤ ∥ι π -ι π ∥ u,κ + γ∥w∥ ∞ 1 -γ ∥ β2 π -β 2 π ∥ u,κ (2) w + ∥w∥ ∞ 1 -γ ∥ β -β∥ u,κw . Since κ, κ w and κ (2) w are all bounded, (19) now implies |∥ Rπ (w)∥ u,κ -∥R π (w)∥ u,κ | = O ∥w∥ ∞ 1 -γ u 2 + ln(1/δ) n . Combining this with (18) and take the limit with u → ∞ and n/u 2 → ∞, we get the desired result. G PROOF OF THEOREM 7.6 Note that Theorem 7.3 combined with the compactness of W and the continuity of w → ∥R π (w)∥ TV implies that EvaluateDBR(D, F, w) converges to ∥R π (w)∥ TV uniformly on w ∈ W. Thus, it suffice to show the following lemma. Lemma G.1. We have min u∈U ∥R π ( ŵu )∥ TV → min w∈W ∥R π (w)∥ TV . Proof. Note also that Corollary 7.5 combined with the compactness of W and the continuity of w → ∥R π (w)∥ u,κ implies ∥ Rπ (w)∥ u,κ → ∥R π (w)∥ TV uniformly for all w ∈ W, under suitable asymptotics of u and n as in Corollary 7.5. In other words, for all c > 0 and δ ∈ (0, 1), there exists u 0 ≥ 1 and p 0 > 0 such that, for all u ≥ u 0 and n ≥ p 0 u 2 such that sup w∈W |∥ Rπ (w)∥ u,κ -∥R π (w)∥ TV | ≤ c with probability ≥ 1 -δ. Therefore, taking such a pair (u, n) satisfying u ∈ U (which exists by the definition of U), we have min w∈W ∥R π (w)∥ TV ≤ min u ′ ∈U ∥R π ( ŵu ′ )∥ TV (∵ restriction of domain) ≤ c + ∥ Rπ ( ŵu )∥ u,κ ≤ c + ∥ Rπ (w ⋆ )∥ u,κ (∵ definition of ŵu ) ≤ 2c + ∥ Rπ (w ⋆ )∥ TV = 2c + min w∈W ∥R π (w)∥ TV with probability ≥ 1 -δ. Since c > 0 can be arbitrary small, we have just proved the desired result.

H BASIC DEFINITIONS AND RESULTS

This section presents basic results used in the proof of our results for completeness. The main purpose of this section is to show Proposition D.2.

H.1 SIGNED MEASURES

We first introduce the absolute value and the positive and negative parts of a signed measure. Recall that Σ is the Borel algebra of X . Definition H.1 (Absolute value and positive and negative parts of signed measure). For all P ∈ M (X ), its absolute value is given by |P | ∈ M (X ) such that |P |(E) = sup E++E-=E {P (E + ) -P (E -)}. Moreover, its positive and negative parts are given by P ± := (|P | ± P )/2 ∈ M (X ). The following properties are then immediately seen. We omit the proof since it is trivial from the definitions. Lemma H.1. The following statements are true. 1. P ± are nonnegative measures. 2. P = P + -P -and |P | = P + + P -. 3. P, P ± ≪ |P |.

4.. ∥|P |∥

TV = ∥P + ∥ TV + ∥P -∥ TV = ∥P ∥ TV . Next, the sign function of a signed measure is defined with the absolute value. Definition H.2 (Sign of signed measure). For all P ∈ M (X ), its sign is given by sign P := dP d|P | ∈ B(X ). We note that the essential range of the sign function is bounded to [-1, 1]. Lemma H.2. We have |(sign P )(x)| ≤ 1 for |P |-almost every x ∈ X . Proof. It follows from dP d|P | = dP + d|P | - dP - d|P | ≤ dP + d|P | + dP - d|P | = d|P | d|P | = 1. (|P |-almost everywhere) Finally, we define the empirical distribution for signed measures. Definition H.3 (Empirical distribution of signed measure). For all P ∈ M (X ) such that ∥P ∥ TV = 1, we define its n-th empirical distribution by Pn := 1 n n i=1 (sign P )(x i )δ xi , where {x i } n i=1 is n-i.i.d. sample drawn independently from |P |, which is a probability distribution. Note that it coincides with the empirical distribution of probability measures if P is nonnegative. One of its most basic properties is the unbiasedness. Lemma H.3 (Unbiasedness). For all P ∈ M (X ) such that ∥P ∥ TV = 1, we have P (E) = E Pn (E) for all E ∈ Σ. The following lemma is useful to bound the Rademacher complexity of the composition of functions. Lemma H.4 (Rademacher contraction lemma). For all Θ ∈ R n and a family of 1-Lipschitz continuous functions (φ i ) n i=1 , φ i : R → R, we have R(φ(Θ)) ≤ R(Θ), where φ(Θ) := {(φ 1 (θ 1 ), ..., φ n (θ n )) ∈ R n : θ ∈ Θ}.  σ i φ i (θ ′ i ) -φ n (θ ′ n ) . Since the expression inside the expectation is bounded by sup θ∈Θ n-1 i=1 σ i φ i (θ i ) + φ n (θ n ) + sup θ ′ ∈Θ n-1 i=1 σ i φ i (θ ′ i ) -φ n (θ ′ n ) = sup θ,θ ′ ∈Θ n-1 i=1 σ i {φ i (θ i ) + φ i (θ ′ i )} + {φ n (θ n ) -φ n (θ ′ n )} ≤ sup θ,θ ′ ∈Θ n-1 i=1 σ i {φ i (θ i ) + φ i (θ ′ i )} + |θ n -θ ′ n | = sup θ,θ ′ ∈Θ n-1 i=1 σ i {φ i (θ i ) + φ i (θ ′ i )} + θ n -θ ′ n (∵ Symmetry of θ and θ ′ ) = sup θ∈Θ n-1 i=1 σ i φ i (θ i ) + θ n + sup θ ′ ∈Θ n-1 i=1 σ i φ i (θ ′ i ) -θ ′ n , we have R(φ(Θ)) ≤ 1 2 E σ n-1 sup θ∈Θ n-1 i=1 σ i φ i (θ i ) + θ n + sup θ ′ ∈Θ n-1 i=1 σ i φ i (θ ′ i ) -θ ′ n = R( φ(Θ)) , where φ = (φ 1 , ..., φ n-1 , I) and I is the identity map. Iterating the same argument to swap φ j with I for all j = 1, ..., n -1, we get the desired result. The following theorem gives a sufficient condition for the concentration of the empirical process f → ⟨f, Pn -P ⟩, f ∈ F, with respect to the Rademacher complexity R n (F). Theorem H.5 (Uniform law of large number). For all probability measures P ∈ M (X ) and all F ⊂ B(X ), we have with probability 1 -δ, where D := sup f ∈F ,x,y∈X {f (x) -f (y)}. Here, Pn is the empirical distribution of P given by Definition H.3. E sup f ∈F ⟨f, Pn -P ⟩ ≤ 2R n (F). Proof. Let {x i } n i=1 and {x ′ i } n i=1 are two n-i.i.d. sample drawn independently from P . The first result follows from E sup f ∈F ⟨f, Pn -P ⟩ = E sup f ∈F E 1 n n i=1 {f (x i ) -f (x ′ i )} {x i } n i=1 (∵ Lemma H.3) ≤ E sup f ∈F 1 n n i=1 {f (x i ) -f (x ′ i )} = E sup f ∈F 1 n n i=1 σ i {f (x i ) -f (x ′ i )} (∵ symmetry of x i and x ′ i ) ≤ 2E sup f ∈F 1 n n i=1 σ i f (x i ) ≤ 2R n (F). To show the second result, define Thus, it suffices to establish A(S) -EA(S) ≤ D ln(1/δ) 2n with probability 1 -δ, which follows from McDiarmid's inequality (Boucheron et al., 2003) applied on A(S). Here, the assumption of McDiarmid's inequality we need to verify is that A(S ′ )-A(S) ≤ D/n for all S ′ := {x ′ i } n i=1 ∈ X n that only differs from S at the j-th element, 1 ≤ j ≤ n. This is verified by A(S ′ ) = sup f ∈F 1 n n i=1 f (x ′ i ) -⟨f, P ⟩ = sup f ∈F 1 n n i=1 f (x i ) -⟨f, P ⟩ + 1 n {f (x ′ i ) -f (x i )} ≤ A(S) + 1 n sup f ∈F {f (x ′ i ) -f (x i )} ≤ A(S) + D n . This concludes the proof. suffices to show its completeness and total boundedness. The completeness is trivial, while the total boundedness follows from the fact that the elements of U (Λ) is approximated with their projections onto the first K coordinates where the approximation error is bounded by λ K+1 , which tends to zero as K → ∞.



This includes the cases where S and A are finite or compact subsets of finite-dimensional Euclidean spaces. For simplicity, we assume the sample sizes of Dι and DT,R are the same. The generalization with different sample sizes is possible with minor modification. That is, {X0, X β } is a partition of X such that μπ|̸ ≪β (E) = 0 for all measurable E ⊂ X β and β(E) = 0 for all measurable E ⊂ X0. Such an fc exists at the intersection of the line segment [f * λ , fλ ] and the level set {f ∈ F : ϵ(f ; P ) ≤ c} since f → ϵ(f ; Q) is convex.



Minimax Optimal Interval Estimation (MOI) Input: Dataset D, universal approximator F, hypothesis class W, kernel κ; Output: Estimate of the optimal interval I ⋆ (π); ŵ := OptimizeDBR(D, F, W, κ) ; // Algorithm A.2 ε := EvaluateDBR(D, F, ŵ) ; // Algorithm A.1 return [ Ĵ( ŵ) -ε, J( ŵ) + ε];

, T | ≪β is corresponding to the known component of T , which necessarily coincides with that of T by definition, and T | ̸ ≪β is corresponding to the unknown component of T . Note that T is a proper transition operator indistinguishable from T . Let Tπ | ≪β and Tπ | ̸ ≪β be the state-action transition operator associated with T β and T 0 , respectively, such that d Tπ | ≪β (s, a|x) := d T | ≪β (s|x)dπ(a|s) and d Tπ | ̸ ≪β (s, a|x) := d T | ̸ ≪β (s|x)dπ(a|s) for x, (s, a) ∈ X . Then, we have

For all δ ∈ (0, 1), we have, in addition to the statements of Proposition D.2, Φ( fλ ; Pn ) -Φ( fλ ; P ) It follows from the uniform law of large number (Theorem H.5) with the high probability range of fλ given by (43). Let G := {φ f : f ∈ F, ∥f -f * λ ∥ F ≤ d}, where d := 4G λ ln(e 2 /δ) 2n

The domination of Aand B-norms over the uniform norm implies there exists c < ∞ such that B A , B B ⊂ c U d where U d := {z ∈ R d : max 1≤j≤d |z j | ≤ 1} denotes the unit hypercube of R d . Since c U d is compact and B A and B B are closed subsets thereof, they are also compact. Thus, Sion's minimax theorem yields

It follows from that, for all E ∈ Σ, sign P )(x i )1{x i ∈ E}]= (sign P ⊙ |P |)(E) (∵ x i ∼ |P |) = P (E).(∵ sign P = dP/d|P |)H.2 RADEMACHER COMPLEXITY AND UNIFORM LAW OF LARGE NUMBERThe Rademacher complexity is a measure of the complexity of function class. It is mainly utilized to establish the concentration of the empirical processes corresponding to the function classes, i.e., the uniform law of large number. Throughout the section, let σ n := {σ i } n i=1 be a sequence of Rademacher random variables, each of which takes ±1 with probability 1/2 independently. Definition H.4 (Rademacher complexity). For a subset of n-dimensional vectors Θ ⊂ R n , the Rademacher complexity of Θ is defined byR(Θ) := E σ n sup θ∈Θ n i=1 σ i θ i .Moreover, for S ∈ X n and F ⊂ B(X ), the empirical Rademacher complexity of F with respect to S is defined by R S (F) := R(F(S)/n), where F(S) := {(f (x 1 ), ..., f (x n )) ∈ R : S = {x i } n i=1 , f ∈ F} denotes the set of vectors obtained by applying f ∈ F on each element of S. Also, we define the n-th maximal Rademacher complexity of F by R n (F) := supS∈X n R S (F).

i (θ i ) + φ n (θ n ) + sup

Pn -P ⟩ ≤ 2R n (F) + D ln(1/δ) 2n

i ) -⟨f, P ⟩ for S := {x i } n i=1 ∈ X n and observe A(S) follows the same law as sup f ∈F ⟨f, Pn -P ⟩.

Comparison of OPI methods. q π and w π denote the Q-function and the marginal density ratio function, respectively, w ♯ π denotes a generalization of w π for the insufficient exploration setting.

Evaluating the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Minimizing the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

fλ ; Pn ) -∥R π (w)∥ TV where s(d) := inf ∥f ∥ F ≤d L 0 (f ; P ) -inf f ∈F L 0 (f ; P ). The second term is bounded by Now, since lim d→∞ s(d) = 0, taking d = λ -1/3 , we have just shown the following proposition, which directly translates into Theorem 7.3.

Q∥ A * + ∥Q∥ B * } , which proves the inequality. Now suppose d := supp(P ) is finite and ∥ • ∥ A and ∥ • ∥ B dominate ∥ • ∥ ∞ . Label each element of supp(P ) by {x j } d j=1 . Then, following the same equality as above, we get ∥P ∥ (A∨B)

H.3 REPRODUCING KERNEL HILBERT SPACE

Throughout this section, we assume H is the reproducing kernel Hilbert space generated with a continuous, symmetric, positive-definite kernel κ : X 2 → R. Also let ∥ • ∥ H and ⟨•, •⟩ H be the associated norm and inner product, and let B H (0, 1) := {f ∈ H : ∥f ∥ H ≤ 1} be the closed unit ball of H. The following lemma shows the Rademacher complexity of RKHS can be explicitly bounded. Note that c 0 -universal kernel is uniformly bounded and hence ∥κ∥ ∞ < ∞. Lemma H.6 (Rademacher complexity of RKHS). We haveProof. It follows from that, for all S ∈ X n ,The following lemma shows the compactness of the closed RKHS balls in the L 2 -metrics. Lemma H.7 (Compactness of RKHS). B H (0, 1) is compact with respect to L 2 (P ) for all positive measures P ∈ M (X ).Proof. Mercer's theorem gives an eigen decomposition of κ such thatwhere the convergence is uniform on X 2 , {ϕ k : X → R} ∞ k=1 is a continuous orthonormal basis of L 2 (P ) and Λ := {λ k ∈ R ≥0 } ∞ k=1 is a nonnegative decreasing sequence with lim k→∞ λ k = 0. It then follows from the standard function analysis that B H,1 under the L 2 (P )-metric is isometric to U (Λ) := {a ≡ (a k ) ∞ k=1 : k≥0:a k ̸ =0 a 2 k /λ k ≤ 1} under the ℓ 2 -metric via the mapping a k (f ) := f (x)ϕ k (x)dP (x). Here, we evaluate 0/0 = ∞. Therefore, the compactness of B H,1 in L 2 (P ) follows from the compactness of U (Λ) in ℓ 2 . To show the compactness of the latter, it

