NON-ASYMPTOTIC CONFIDENCE INTERVALS OF OFF-POLICY EVALUATION: PRIMAL AND DUAL BOUNDS

Abstract

Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing nonasymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al. (2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum assumptions on the data and the function class of the Q-function, and works for the behavior-agnostic settings where the data is collected under a mix of arbitrary unknown behavior policies. We present empirical results that clearly demonstrate the advantages of our approach over existing methods.

1. INTRODUCTION

Off-policy evaluation (OPE) seeks to estimate the expected reward of a target policy in reinforcement learnings (RL) from observational data collected under different policies (e.g., Murphy et al., 2001; Fonteneau et al., 2013; Jiang & Li, 2016; Liu et al., 2018a) . OPE plays a central role in applying reinforcement learning (RL) with only observational data and has found important applications in areas such as medicine, self-driving, where interactive "on-policy" data is expensive or even infeasible to collect. A critical challenge in OPE is the uncertainty estimation, as having reliable confidence bounds is essential for making high-stakes decisions. In this work, we aim to tackle this problem by providing non-asymptotic confidence intervals of the expected value of the target policy. Our method allows us to rigorously quantify the uncertainty of the prediction and hence avoid the dangerous case of being overconfident in making costly and/or irreversible decisions. However, off-policy evaluation per se has remained a key technical challenge in the literature (e.g., Precup, 2000; Thomas & Brunskill, 2016; Jiang & Li, 2016; Liu et al., 2018a) , let alone gaining rigorous confidence estimation of it. This is especially true when 1) the underlying RL problem is long or infinite horizon, and 2) the data is collected under arbitrary and unknown algorithms (a.k.a. behavior-agnostic). As a consequence, the collected data can exhibit arbitrary dependency structure, which makes constructing rigorous non-asymptotic confidence bounds particularly challenging. Traditionally, the only approach to provide non-asymptotic confidence bounds in OPE is to combine importance sampling (IS) with concentration inequalities (e.g., Thomas et al., 2015a; b) , which, however, tends to degenerate for long/infinite horizon problems (Liu et al., 2018a) . Furthermore, neither can this approach be applied to the behavior-agnostic settings, nor can it effectively handle the complicated time dependency structure inside individual trajectories. Instead, it requires to use a large number of independently collected trajectories drawn under known policies. In this work, we provide a practical approach for Behavior-agnostic, Off-policy, Infinite-horizon, Non-asymptotic, Confidence intervals based on arbitrarily Dependent data (BONDIC). Our method is motivated by a recently proposed optimization-based (or variational) approach to estimating OPE confidence bounds (Feng et al., 2020) , which leverages a tail bound of kernel Bellman statistics (Feng et al., 2019) . Our approach achieves a new bound that is both an order-of-magnitude tighter and computationally efficient than that of Feng et al. (2020) . Our improvements are based on two pillars 1) developing a new primal-dual perspective on the non-asymptotic OPE confidence bounds, which is connected to a body of recent works on infinite-horizon value estimation (Liu et al., 2018a; Nachum et al., 2019a; Tang et al., 2020a; Mousavi et al., 2020) ; and 2) offering a new tight concentration inequality on the kernel Bellman statistics that applies to behavior-agnostic off-policy data with arbitrary dependency between transition pairs. Empirically, we demonstrate that our method can provide reliable and tight bounds on a variety of well-established benchmarks.

Related Work

Besides the aforementioned approach based on the combination of IS and concentration inequalities (e.g., Thomas et al., 2015a) , bootstrapping methods have also been widely used in off-policy estimation (e.g., White & White, 2010; Hanna et al., 2017; Kostrikov & Nachum, 2020) . But the latter is limited to asymptotic bounds. Alternatively, Bayesian methods (e.g. Engel et al., 2005; Ghavamzadeh et al., 2016a) offers a different way to estimate the uncertainty in RL, but fails to guarantee frequentist coverage. In addition, Distributed RL (Bellemare et al., 2017) seeks to quantify the intrinsic uncertainties inside the Markov decision process, which is orthogonal to the estimation of uncertainty that we consider. Our work is built upon the recent advances in behavior-agnostic infinite-horizon OPE, including Liu et al. (2018a) ; Feng et al. (2019) ; Tang et al. (2020a) ; Mousavi et al. (2020) , as well as the DICE-family (e.g., Nachum et al., 2019a; Zhang et al., 2020a; Yang et al., 2020b) . In particular, our method can be viewed as extending the minimax framework of the infinite-horizon OPE in the infinite data region by Tang et al. (2020a) ; Uehara et al. (2020) ; Jiang & Huang (2020) to the non-asymptotic finite sample region. Outline For the rest of the paper, we start with the problem statement in Section 2 , and an overview on the two dual approaches to infinite-horizon OPE that are tightly connected to our method in Section 3. We then present our main approach in Section 4 and perform empirical studies in Section 5. The proof and an abundance of additional discussions can be found in Appendix.

2. BACKGROUND, DATA ASSUMPTION, PROBLEM SETTING

Consider an agent acting in an unknown environment. At each time step t, the agent observes the current state s t in a state space S, takes an action a t ∼ π(• | s t ) in an action space A according to a given policy π; then, the agent receives a reward r t and the state transits to s t = s t+1 , following an unknown transition/reward distribution (r t , s t+1 ) ∼ P(• | s t , a t ). Assume the initial state s 0 is drawn from an known initial distribution D 0 . Let γ ∈ (0, 1) be a discount factor. In this setting, the expected reward of π is defined as J π := E π T t=0 γ t r t | s 0 ∼ D 0 , which is the expected total discounted rewards when we execute π starting from D 0 for T steps. In this work, we consider the infinite-horizon case with T → +∞. Our goal is to provide an interval estimation of J π for a general and challenging setting with significantly released constraints on the data. In particular, we assume the data is behavior-agnostic and off-policy, which means that the data can be collected from multiple experiments, each of which can execute a mix of arbitrary, unknown policies, or even follow a non-fixed policy. More concretely, suppose that the model P is unknown, and we have a set of transition pairs Dn = (s i , a i , r i , s i ) n i=1 collected from previous experiments in a sequential order, such that for each data point i, the (r i , s i ) is drawn from the model P(• | s i , a i ), while (s i , a i ) is generated with an arbitrary black box given the previous data points. We formalize both the data assumption and goal as below. Assumption 2.1 (Data Assumption). Assume the data Dn = (s i , a i , r i , s i ) n i=1 is drawn from an arbitrary joint distribution, such that for each i = 1, . . . , n, conditional on D<i := (s j , a j , r j , s j ) j<i ∪ (s i , a i ), the subsequent local reward and next state (r i , s i ) are drawn from P(• | s i , a i ). Goal Given a confidence level δ ∈ (0, 1), we want to construct an interval [ Ĵ-, Ĵ+ ] ⊂ R based on the data Dn , such that Pr(J π ∈ [ Ĵ-, Ĵ+ ]) ≥ 1δ, where Pr(•) is w.r.t. the randomness of the data. The partial ordering on the data points is introduced to accommodate the case that s i+1 equals s j for some j ≤ i. The data assumption only requires that (r i , s i ) is generated from P(• | s i , a i ), and imposes no constraints on how (s i , a i ) is generated. This provides great flexibility in terms of the data collection process. In particular, we do not require (s i , a i ) n i=1 to be independent as always assumed in recent works (Liu et al., 2018a; Mousavi et al., 2020) . A crucial fact is that our data assumption actually implies a martingale structure on the empirical Bellman residual operator of the Q-function, As we will show in Section 4.1, this enables us to derive a key concentration inequality underpinning our non-asymptotic confidence bounds. Here, we summarize a few notations that will simplify the presentation in the rest of work. First of all, we append each (s i , a i , r i , s i ) with an action a i ∼ π(• | s i ) following s i . This can be done for free as long as π is given (See the Remark in Section 3). Also, we write x i = (s i , a i ), x i = (s i , a i ), and y i = (x i , r i ) = (s i , a i , r i ). Correspondingly, define X = S × A to be the state-action space and Y = X × R. Denote P π (y | x) = P(s , r | x)π(a | s ). In this way, the observed data can be written as pairs of {x i , y i } n i=1 , and Assumption 2.1 is equivalent to saying that y i ∼ P π (• | x i ) given D<i , which is similar to a supervised learning setting. We equalize the data Dn with its empirical measure Dn = n i=1 δ xi,yi /n, where δ is the Delta measure.

3. TWO DUAL APPROACHES TO INFINITE-HORIZON OFF-POLICY ESTIMATION

The deficiency of the traditional IS methods on long-/infinite-horizon RL problems (a.k.a. the curse of horizon (Liu et al., 2018a) ) has motivated a line of work on developing efficient infinite-horizon value estimation (e.g., Liu et al., 2018a; Feng et al., 2019; Nachum et al., 2019a; Zhang et al., 2020a; Mousavi et al., 2020; Tang et al., 2020a) . The main idea is to transform the value estimation problem into estimating either the Q-function or the visitation distribution (or its related density ratio) of the policy π. This section introduces and reinterprets these two tightly connected methods, which serves to lay out a foundation for our main confidence bounds from a primal and dual perspective. Given a policy π, its Q-function is defined as q π (x) = E π [ ∞ t=0 γ t r t | x 0 = x], where the expectation is taken when we execute π initialized from a fixed state-action pair (s 0 , a 0 ) = x 0 = x. Let D π,t be the distribution of (x t , y t ) = (s t , a t , s t , a t , r t ) when executing policy π starting from s 0 ∼ D 0 for t steps. The visitation distribution of π is defined as D π = ∞ t=0 γ t D π,t . Note that D π integrates to 1/(1γ), while we treat it as a probability measure in the notation. The expected reward J π can be expressed using either q π or D π as follows: J π := E π ∞ t=0 γ t r t = E r∼Dπ [r] = E x∼Dπ,0 [q π (x)], where r ∼ D π (resp. x ∼ D π,0 ) denotes sampling from the r-(resp. x-) marginal distribution of D π (resp. D π,0 ). Eq. (1) plays a key role in the infinite-horizon value estimation by transforming the estimation of J π into estimating either q π or D π . Value Estimation via Q Function Because D π,0 (x) = D 0 (s)π(a|s) is known, we can estimate J π by E x∼Dπ,0 [q(x)] with any estimation q of the true Q-function q π ; the expectation under x ∼ D π,0 can be estimated to any accuracy with Monte Carlo. To estimate q π , we consider the empirical and expected Bellman residual operator: Rq(x, y) = q(x) -γq(x ) -r, R π q(x) = E y∼Pπ(•|x) Rq(x, y) . It is well-known that q π is the unique solution of the Bellman equation R π q = 0. Since y i ∼ P π (•|x i ) for each data point in Dn , if q = q π , then Rq(x i , y i ), i = 1, . . . , n are all zero-mean random variables. Let ω be any function from X to R, then i Rq(x i , y i )ω(x i ) also has zero mean. This motivates the following functional Bellman loss (Feng et al., 2019; 2020; Xie & Jiang, 2020) , L W (q; Dn ) := sup ω∈W 1 n n i=1 Rq(x i , y i )ω(x i ) , where W is a set of functions ω : X → R. To ensure that the sup is finite, W is typically set to be an unit ball of some normed function space W o , such that W = {ω ∈ W o : ω Wo ≤ 1}. Feng et al. (2019) considers the simple case when W is taken to be the unit ball K of the reproducing kernel Hilbert space (RKHS) with a positive definite kernel k : X × X → R, in which case the loss has a simple closed form solution: L K (q; Dn ) = 1 n 2 n ij=1 Rq(x i , y i )k(x i , x j ) Rq(x j , y j ). (4) Note that the RHS of Eq. ( 4) is the square root of the kernel Bellman V-statistics in Feng et al. (2019) . Feng et al. (2019) showed that, when the support of data distribution Dn covers the whole space (which may require an infinite data size) and k is an integrally strictly positive definite kernel, L K (q; Dn ) = 0 iff q = q π . Therefore, one can estimate q π by minimizing L K (q, Dn ). Remark The empirical Bellman residual operator R can be extended to Rq(x, y) = q(x)rγ 1 m m =1 q(s , a ), where {a } m i=1 are i.i.d. drawn from π(•|s ). As m increases, this gives a lower variance estimation of R π q. If m = +∞, we have Rq(x, y) = q(x)r -γE a ∼π(• | s ) [q(s , a )], which coincides with the operator used in the expected SARSA (Sutton & Barto, 1998) . In fact, without any modification, all results in this work can be applied to Rq for any m. Value Estimation via Visitation Distribution Another way to estimate J π in Eq. ( 1) is to approximate D π with a weighted empirical measure of the data (Liu et al., 2018a; Nachum et al., 2019a; Mousavi et al., 2020; Zhang et al., 2020a) . The key idea is to assign an importance weight ω(x i ) to each data point x i in Dn . We can choose the function ω : X → R properly such that D π and hence J π can be approximated by the ω-weighted empirical measure of Dn as follows: J π ≈ Ĵω := E Dω n [r] = 1 n n i=1 ω(x i )r i , D π ≈ Dω n := 1 n n i=1 ω(x i )δ xi,yi . Intuitively, ω can be viewed as the density ratio between D π and Dn , although the empirical measure Dn may not have well-defined density. Liu et al. (2018a) ; Mousavi et al. (2020) proposed to estimate ω by minimizing a discrepancy measure between Dω n and D π . To see this, note that D = D π if and only if ∆(D, q) = 0 for any function q, where ∆ (D, q) = E D [γq(x ) -q(x)] -E Dπ [γq(x ) -q(x)] = E D [γq(x ) -q(x)] + E Dπ,0 [q(x)], ) using the fact that E Dπ [γq(x )q(x)] = -E Dπ,0 [q(x)] (Theorem 1, Liu et al., 2018a) . Also note that the RHS of Eq. ( 6) can be practically calculated given any D and q without knowing D π . Let Q be a set of functions q : X → R. One can define the following loss for ω: I Q (ω; Dn ) = sup q∈Q ∆( Dω n , q) . Similar to L W (q; Dn ), when Q is a ball in RKHS, I Q (ω; Dn ) also has a bilinear closed form analogous to Eq. (4); see Mousavi et al. (2020) and Appendix F. As we show in Section 4, I Q (ω; Dn ) and L W (q; Dn ) are connected to the primal and dual views of our confidence bounds, respectively.

4. MAIN APPROACH

Let Q be a large enough function set including the true Q-function q π , that is, q π ∈ Q. Following Feng et al. (2020) , a confidence interval [ Ĵ-Q,W , Ĵ+ Q,W ] of J π can be constructed as follows: Ĵ+ Q,W = sup q∈Q E Dπ,0 [q] s.t. L W (q; Dn ) ≤ ε n , and Ĵ-Q,W is defined in a similar way by replacing sup on q ∈ Q with inf. The idea here is to seek the extreme q function with the largest (resp. smallest) expected values in set F := Q ∩ {q : L K (q; Dn ) ≤ ε n }. Therefore, Eq. ( 8) would be a 1δ confidence interval if q π is included in F with at least probability 1δ, which is ensured when q π ∈ Q and Pr(L W (q π ; Dn ) ≤ ε n ) ≥ 1 -δ . (9) Feng et al. (2020) showed that in the RKHS case when W = K, Eq. ( 9) can be achieved with ε n = 2c qπ,k n -1 n log(1/δ) n + 1 n , ( ) when n is an even number, where c qπ,k = sup x,y Rq π (x, y) 2 k(x, x). This was proved using Hoeffding's inequality for U-statistics (Hoeffding, 1963) . To solve Eq. ( 8) efficiently, Feng et al. (2020) took Q to be a ball in RKHS with random feature approximation. Unfortunately, this method as described by Eq. ( 8)-( 10) has two major disadvantages: 1) Bound Needs to Be Tightened (Section 4.1) The bound of ε n = O(n -1/4 ) in Eq. ( 10) is sub-optimal in rate. In Section 4.1, we improve it by an ε n = O(n -1/2 ) bound under the mild Assumption 2.1, which gets rid of the independence requirement between the transition pairs. Our tightened bound is achieved by firstly noting a Martingale structure on the empirical Bellman operator under Assumption 2.1, and then applying an inequality in Pinelis (1992) . 2) Dependence on Global Optimization (Section 4. 2) The bound in Eq. ( 8) is guaranteed to be a 1δ confidence bound only when the maximization in Eq. ( 8) is solved to global optimality. With a large n, this leads to a high computational cost, even when choosing Q as the RKHS. Feng et al. (2020) solved Eq. ( 8) approximately using a random feature technique, but this method suffers from a gap between the theory and practice. In Section 4.2, we address this problem by presenting a dual form of Eq. ( 8), which sidesteps solving the challenging global optimization in Eq. ( 8). Moreover, the dual form enables us to better analyze the tightness of the confidence interval and issues regarding the choices of Q and W.

4.1. A TIGHTER CONCENTRATION INEQUALITY

In this section, we explain our method to improve the bound in Eq. ( 10) by giving a tighter concentration inequality for the kernel Bellman loss in Eq. ( 4). We introduce the following semi-expected kernel Bellman loss: L * K (q; Dn ) = 1 n 2 n ij=1 R π q(x i )k(x i , x j )R π q(x j ) , in which we replace the empirical Bellman residual operator Rq in Eq. ( 3) with its expected counterpart R π q, but still take the empirical average over {x i } n i=1 in Dn . For a more general function set W, we can similarly define L * W (q; Dn ) by replacing Rq with R π q in Eq. ( 3). Obviously, we have L * W (q; Dn ) = 0 when q = q π . Theorem 4.1 below shows that L K (q; Dn ) concentrates around L * K (q; Dn ) with an O(n -1/2 ) error under Assumption 2.1. At a first glance, it may seem surprising that the concentration bound is able to hold even without any independence assumption between {x i }. An easy way to make sense of this is by recognizing that the randomness in y i conditional on x i is aggregated through averaging, even when {x i } are deterministic. As Assumption 2.1 does not impose any (weak) independence between {x i }, we cannot establish that L K (q; Dn ) concentrates around its mean E Dn [L K (q; Dn )] (which is a full expected kernel bellman loss), without introducing further assumptions. Theorem 4.1. Assume K is the unit ball of RKHS with a positive definite kernel k(•, •). Let c q,k := sup x∈X ,y∈Y ( Rq(x, y) -R π q(x)) 2 k(x, x) < ∞. Under Assumption 2.1, for any δ ∈ (0, 1), with at least probability 1δ, we have L K (q; Dn ) -L * K (q; Dn ) ≤ 2c q,k log(2/δ) n . In particular, when q = q π , we have c qπ,k = sup x,y ( Rq π (x, y)) 2 k(x, x), and L K (q π ; Dn ) ≤ 2c qπ,k log(2/δ) n . Intuitively, to see why we can expect an O(n -1/2 ) bound, note that L K (q, Dn ) consists of the square root of the product of two Rq terms, each of which contributes an O(n -1/2 ) error w.r.t. R π q. Technically, the proof is based on a key observation: Assumption 2.1 ensures that Z i := Rq(x i , y i ) - R π q(x i ), i = 1, . . . , n forms a martingale difference sequence w.r.t. { D<i : ∀i = 1, . . . , n}, in the sense that E[Z i | D<i ] = 0, ∀i. See Appendix B for details. The proof also leverages a special property of RKHS and applies a Hoeffding-like inequality on the Hilbert spaces as in Pinelis (1992) (see Appendix B). For other more general function sets W, we establish in Appendix E a similar bound by using Rademacher complexity, although it yields a less tight bound than Eq. ( 12) when W = K.

4.2. DUAL CONFIDENCE BOUNDS

We derive a dual form of Eq. ( 8) that sidesteps the need for solving the challenging global optimization in Eq. ( 8). To do so, let us plug the definition of L W (q; Dn ) into Eq. ( 3) and introduce a Lagrange multiplier: Ĵ+ Q,W = sup q∈Q inf h∈W inf λ≥0 E Dπ,0 [q] -λ 1 n n i=1 h(x i ) Rq(x i , y i ) -ε n (14) = sup q∈Q inf ω∈Wo E Dπ,0 [q] - 1 n n i=1 ω(x i ) Rq(x i ) + ε n ω Wo , where we use ω(x) = λh(x). Exchanging the order of min/max and some further derivation yields the following main result. Theorem 4.2. I) Let W be the unit ball of a normed function space W o . We have Ĵ+ Q,W ≤ F + Q (ω) := E Dω n [r] + I Q (ω; Dn ) + ε n ω Wo , ∀ω ∈ W o , Ĵ- Q,W ≥ F - Q (ω) := E Dω n [r] -I -Q (ω; Dn ) -ε n ω Wo , ∀ω ∈ W o , where -Q = {-q : q ∈ Q} and hence I -Q (ω; Dn ) = I Q (ω; Dn ) if Q = -Q. Further, we have Ĵ+ Q,W = inf ω∈Wo F + Q (ω) and Ĵ- Q,W = sup ω∈Wo F - Q (ω) if Q is convex and there exists a q ∈ Q that satisfies the strict feasibility condition that L W (q; Dn ) < ε n . II) For Dn and δ ∈ (0, 1), assume W o and ε n ∈ R satisfy Eq. ( 9) (e.g., via Theorem 4.1). Then for any function set Q with q π ∈ Q, and any function ω + , ω -∈ W o (the choice of Q, ω + , ω -can depend on Dn arbitrarily), we have Pr J π ∈ F - Q (ω -), F + Q (ω + ) ≥ 1 -δ . ( ) Theorem 4.2 transforms the original bound in Eq. ( 8), framed in terms of q and L W (q; Dn ), into a form that involves the density-ratio ω and the related loss I Q (ω; Dn ). The bounds in Eq. ( 16) can be interpreted as assigning an error bar around the ω-based estimator Ĵω = E Dω n [r] in Eq. ( 5), with the error bar of I ±Q (ω; Dn ) + ε n ω Wo . Specifically, the first term I ±Q (ω; Dn ) measures the discrepancy between Dω n and D π as discussed in Eq. ( 7), whereas the second term captures the randomness in the empirical Bellman residual operator Rq π . Compared with Eq. ( 8), the global maximization on q ∈ Q is now transformed inside the I Q (ω; Dn ) term, which yields a simple closed form solution in the RKHS case (see Appendix F). In practice, we can optimize ω + and ω -to obtain the tightest possible bound (and hence recover the primal bound) by minimizing/maximizing F + Q (ω) and F -Q (ω), but it is not necessary to solve the optimization to global optimality. When W o is an RKHS, by the standard finite representer theorem (Scholkopf & Smola, 2018) , the optimization on ω reduces to a finite dimensional optimization, which can be sped up with any favourable approximation techniques. We elaborate on this in Appendix D. Length of the Confidence Interval The form in Eq. ( 16) also makes it much easier to analyze the tightness of the confidence interval. Suppose ω = ω + = ω -and Q = -Q, the length of the optimal confidence interval is length([ Ĵ- Q,W , Ĵ+ Q,W ]) = inf ω∈Wo 2I Q (ω; Dn ) + 2ε n ω Wo . Given ε n is O(n -1/2 ), we can make the overall length of the optimal confidence interval also O(n -1/2 ), as long as W o is rich enough to include a good density ratio estimator ω * that satisfies I Q (ω * ; Dn ) = O(n -1/2 ) and has a bounded norm ω * Wo . We can expect to achieve I Q (ω * ; Dn ) = O(n -1/2 ), when (1) Q has an O(n -1/2 ) sequential Rademacher complexity (Rakhlin et al., 2015 ) (e.g., a finite ball in RKHS); and (2) Dn is collected following a Markov chain with strong mixing condition and weakly converges to some limit distribution D ∞ whose support is X , and therefore we can define ω * as the density ratio between D π and D ∞ . Refer to Appendix C for more discussions. Indeed, our experiments show that the lengths of practically constructed confidence intervals do tend to decay with an O(n -1/2 ) rate. Choice of W and Q To ensure the concentration inequality in Theorem 4.1 is valid, the choice of W o cannot depend on the data Dn . Therefore, we should use a separate holdout data to construct a data-dependent W o . In contrast, the choice of Q can depend on the data Dn arbitrarily, since it is a part of the optimization bound Eq. ( 8) but not in the tail bound Eq. ( 9). In this light, one can construct the best possible Q by exploiting the data information in the most favourable way. For example, we can construct an estimator of q ≈ q π based on any state-of-the-art method (e.g., Q-learning or model-based methods), and set Q to be a ball centering around q such that q πq ∈ Q. This enables post-hoc analysis based on prior information on q π , as suggested in Feng et al. (2020) . Mis-specification of Q and Oracle Upper/Lower Estimates Our result relies on the assumption that q π ∈ Q. However, as with other statistical estimation problems, there exists no provably way to empirically verify the correctness of model assumptions such as q π ∈ Q. Because empirical data only reveals the information of the unknown function (in our case q π ) on a finite number data points, but no conclusion can be made on the unseeing data points without imposing certain smoothness assumption. Typically, what we can do is the opposite: reject q π ∈ Q when the Bellman loss L W (q; Dn ) of all q in Q is larger than the threshold ε n . We highlight that, even without verifying q π ∈ Q, our method can still be viewed as a confidence interval of a best possible (oracle) upper and lower estimation given the data Dn plus the assumption that q π ∈ Q, defined as Ĵ+ Q, * = sup q∈Q E Dπ,0 [q] s.t. Rq(x i , y i ) = Rq π (x i , y i ), ∀i = 1, . . . , n . In fact, it is impossible to derive empirical upper bounds lower than Ĵ+ Q, * , as there is no way to distinguish q and q π if Rq(x i , y i ) = Rq π (x i , y i ) for all i. But our interval [ ĴQ,K , Ĵ+ Q,K ] provides a 1 -δ confidence outer bound of [ Ĵ- Q, * , Ĵ+ Q, * ] once Eq. ( 9) holds, regardless if q π ∈ Q holds or not. Hence, it is of independent interest to further explore the dual form of Eq. ( 18), which is another starting point for deriving our bound. We have more discussion in Appendix G. Lastly, we argue that it is important to include the Q in the bound. Proposition G.1 in Appendix shows that removing the q ∈ Q constraint in Eq. ( 18) would lead to an infinite upper bound, unless the {s i , s i } n i=1 from Dn almost surely covers the whole state space S in the sense that Pr s∼D0 (s ∈ {s i , s i } n i=1 ) = 1.

5. EXPERIMENTS

We compare our method with a variety of existing algorithms for obtaining asymptotic and nonasymptotic bounds on a number of benchmarks. We find our method can provide confidence interval that correctly covers the true expected reward with probability larger than the specified success probability 1δ (and is hence safe) across the multiple examples we tested. In comparison, the non-asymptotic bounds based on IS provide much wider confidence intervals. On the other hand, the asymptotic methods, such as bootstrap, despite giving tighter intervals, often fail to capture the true values with the given probability in practice.

Environments and Dataset Construction

We test our method on three environments: Inverted-Pendulum and CartPole from OpenAI Gym (Brockman et al., 2016) , and a Type-1 Diabetes medical treatment simulator. 1 We follow a similar procedure as Feng et al. (2020) to construct the behavior and target policies. more details on environments and data collection procedure are included in Appendix H.1.

Algorithm Settings

We test the dual bound described in our paper. Throughout the experiment, we always set W = K, the unit ball of the RKHS with positive definite kernel k, and set Q = r Q K, the ball of radius r Q in the RKHS with another kernel k. We take both kernels to be Gaussian RBF kernel and choose r Q and the bandwidths of k and k using the procedure in Appendix H.2. We use a fast approximation method to optimize ω in F + Q (ω) and F - Q (ω) as shown in Appendix D. Once ω is found, we evaluate the bound in Eq. ( 16) exactly to ensure that the theoretical guarantee holds. Baseline Algorithms We compare our method with four existing baselines, including the IS-based non-asymptotic bound using empirical Bernstein inequality by Thomas et al. (2015b) , the IS-based bootstrap bound of Thomas (2015) , the bootstrap bound based on fitted Q evaluation (FQE) by Kostrikov & Nachum (2020) , and the bound in Feng et al. (2020) which is equivalent to the primal bound in (8) but with looser concentration inequality (they use a ε n = O(n -1/4 ) threshold). Results Figure 1 shows our method obtains much tighter bounds than Feng et al. (2020) , which is because we use a much tighter concentration inequality, even the dual bound that we use can be slightly looser than the primal bound used in Feng et al. (2020) . Our method is also more computationally efficient than that of Feng et al. (2020) because the dual bound can be tightened approximately while the primal bound requires to solve a global optimization problem. Figure 1 (b) shows that we provide increasingly tight bounds as the data size n increases, and the length of the interval decays with an O(n -1/2 ) rate approximately. Figure 1 (c) shows that when we increase the significance level δ, our bounds become tighter while still capturing the ground truth. Figure 1 (d) shows the percentage of times that the interval fails to capture the true value in a total of 100 random trials (denoted as δ) as we vary δ. We can see that δ remains close to zero even when δ is large, suggesting that our bound is very conservative. Part of the reason is that the bound is constructed by considering the worse case and we used a conservative choice of the radius r Q and coefficient c qπ,k in Eq. ( 13) (See Appendix H.2). In Figure 2 we compare different algorithms on more examples with δ = 0.1. We can again see that our method provides tight and conservative interval that always captures the true value. Although FQE (Bootstrap) yields tighter intervals than our method, it fail to capture the ground truth much more often than the promised δ = 0.1 (e.g., it fails in all the random trials in Figure 2 (a)). We conduct more ablation studies on different hyper-parameter and data collecting procedure. See Appendix H.2 and H.3 for more details.

6. CONCLUSION

We develop a dual approach to construct high confidence bounds for off-policy evaluation with an improved rate over Feng et al. (2020) . Our method can handle dependent data, and does not require a global optimization to get a valid bound. Empirical results demonstrate that our bounds is tight and valid compared with a range of existing baseline. Future directions include leveraging our bounds for policy optimization and safe exploration. A PROOF OF THE DUAL BOUND IN THEOREM 4.2 Proof. Introducing a Lagrange multiplier, the bound in ( 8) is equivalent to Ĵ+ Q,W = max q∈Q min λ≥0 E Dπ,0 [q] -λ max h∈W 1 n n i=1 h(x i ) Rq(x i , y i ) -ε n = max q∈Q min λ≥0 min h∈W E Dπ,0 [q] -λ 1 n n i=1 h(x i ) Rq(x i , y i ) -ε n = max q∈Q min ω∈Wo E Dπ,0 [q] - 1 n n i=1 ω(x i ) Rq(x i , y i ) + ε n ω Wo , where we use ω = λh(x), such that λ is replaced by ω Wo . Define M (q, ω; Dn ) = E Dπ,0 [q] - 1 n n i=1 ω(x i ) Rq(x i , y i ) + ε n ω Wo = E Dω n [r] + ∆( Dω n , q) + ε n ω Wo . Then we have max q∈Q M (q, ω; Dn ) = E Dω n [r] + max q∈Q ∆( Dω n , q) + ε n ω Wo = E Dω n [r] + I Q (ω; Dn ) + ε n ω Wo = F + Q (ω). Therefore, Ĵ+ Q,W = max q∈Q min ω∈Wo M (q, ω; Dn ) ≤ min ω∈Wo max q∈Q M (q, ω; Dn ) = min ω∈Wo F + Q (ω). The lower bound follows analogously. The strong duality holds when the Slater's condition is satisfied (Nesterov, 2013) , which amounts to saying that the primal problem in ( 8) is convex and strictly feasible; this requires that Q is convex and there exists at least one solution q ∈ Q that satisfy that constraint strictly, that is, L W (q; Dn ) < ε n ; note that the objective function Q is linear on q and the constraint function L W (q; Dn ) is always convex on q (since it is the sup a set of linear functions on q following (3)).

B PROOF OF CONCENTRATION BOUND IN THEOREM 4.1

Our proof require the following Hoeffding inequality on Hilbert spaces by Pinelis (Theorem 3, 1992) ; see also Section 2.4 of Rosasco et al. (2010) . Lemma B.1. (Theorem 3, Pinelis, 1992) Let H be a Hilbert space and {f i } n i=1 is a Martingale sequence in H that satisfies sup i f i H ≤ σ almost surely. We have for any > 0, Pr 1 n n i=1 f i H ≥ ≤ 2 exp - n 2 2σ 2 . Therefore, with probability at least 1δ, we have 1 n n i=1 f i H ≤ 2σ 2 log(2/δ) n . Published as a conference paper at ICLR 2021 Lemma B.2. Let k(x, x ) be a positive definite kernel whose RKHS is H k . Define f i (•) = Rq(x i , y i )k(x i , •) -R π q(x i )k(x i , •). Assume Assumption 2.1 holds, then {f i } n i=1 is a Martingale difference sequence in H k w.r.t. T <i := (x j , y j ) j<i ∪ (x i ). That is, E [f i+1 (•) | T <i ] = 0. In addition, 1 n n i=1 f i 2 H k = 1 n 2 n ij=1 Rq(x i , y i ) -R π q(x i ) k(x i , x j ) Rq(x j , y j ) -R π q(x j ) , and f i 2 H k ≤ c q,k for ∀i = 1, . . . , n. Proof of Theorem 4.1. Following Lemma B.1 and Lemma B.2, since {f i } n i=1 is a Martingale difference sequence in H k with f i H k ≤ c q,k almost surely, we have with probability at least 1δ, 1 n 2 n ij=1 Rq(x i , y i ) -R π q(x i ) k(x i , x j ) Rq(x j , y j ) -R π q(x j ) = 1 n n i=1 f i 2 H k ≤ 2c q,k log(2/δ) n . Using Lemma B.3 below, we have L K (q; Dn ) -L * K (q; Dn ) ≤ 1 n n i=1 f i H k ≤ 2c q,k log(2/δ) n . This completes the proof. Lemma B.3. Assume k(x, x ) is a positive definite kernel. We have L K (q; Dn ) -L * K (q; Dn ) 2 ≤ 1 n 2 n ij=1 Rq(x i , y i ) -R π q(x i ) k(x i , x j ) Rq(x j , y j ) -R π q(x j ) . Proof. Define ĝ(•) = 1 n n i=1 Rq(x i , y i )k(x i , •) g(•) = 1 n n i=1 R π q(x i )k(x i , •). Then we have ĝ 2 H k = 1 n 2 n ij=1 Rq(x i , y i )k(x i , x j ) Rq(x j , y j ) = LK (q; Dn ), g 2 H k = 1 n 2 n ij=1 R π q(x i )k(x i , x j )R π q(x j ) = L * K (q; Dn ), ĝ -g 2 H k = 1 n 2 n ij=1 Rq(x i , y i ) -R π q(x i ) k(x i , x j ) Rq(x j , y j ) -R π q(x j ) . The result then follows the triangle inequality ĝ H k -g H k ≤ ĝ -g H k . B.1 CALCULATION OF c qπ,k The practical calculation of the coefficient c qπ,k in the concentration inequality was discussed in Feng et al. (2020) , which we include here for completeness.  ( Rq π (x, y)) 2 k(x, x) ≤ 4K max r 2 max (1 -γ) 2 . In practice, we get access to K max from the kernel function that we choose (e.g., K max = 1 for RBF kernels), and r max from the knowledge on the environment.

C MORE ON THE TIGHTNESS OF THE CONFIDENCE INTERVAL

The benefit of having both upper and lower bounds is that we can empirically access the tightness of the bound by checking the length of the interval [ F - Q (ω -), F + Q (ω + )]. However, from the theoretical perspective, it is desirable to know a priori that the length of the interval will decrease with a fast rate as the data size n increases. We now show that this is the case if W o is chosen to be sufficiently rich so that it includes a ω ∈ W o such that Dω n ≈ D π . Theorem C.1. Assume W o is sufficiently rich to include a "good" ω * in W o with Dω * n ≈ D π in that sup q∈Q E Dω * n Rq(x; x , r) -E Dπ Rq(x; x , r) ≤ c n α , ( ) where c and α are two positive coefficients. Then we have max Ĵ+ Q,W -J π , J π -Ĵ- Q,W ≤ c n α + ε n ω * Wo . Assumption ( 19) holds if Dn is collected following a Markov chain with certain strong mixing condition and weakly converges to some limit discussion D∞ whose support is X , for which we can define ω * (x) = D π (x)/D ∞ (x). In this case, if Q is a finite ball in RKHS, then we can achieve ( 19) with α = 1/2, and yields the overall bound of rate O(n -1/2 ). For more general function classes, α depends on the martingale Rademacher complexity of function set RQ = {Rq(x, y) : q ∈ Q} Rakhlin et al. (2015) . In our empirical reults, we observe that the gap of the practically constructed bounds tend to follow the O(n -1/2 ) rate. Proof. Note that J π = E Dπ [r] = E Dπ [r], I Q (ω; Dn ) = sup q∈Q E Dω n [γq(x ) -q(x)] -E Dπ [γq(x ) -q(x)] . Because ω * ∈ W, we have Ĵ+ W,Q -J π ≤ F + Q (ω * ) -J π = E Dω n [r] -E Dπ [r] + I Q (ω π ; Dn ) + ε n ω * Wo = sup q∈Q E Dω n Rq(x, y) -E Dπ Rq(x, y) + ε n ω * Wo ≤ c n α + ε n ω * Wo . The case of lower bound follows similarly.

D OPTIMIZATION ON W o

Consider the optimization of ω in W o F + Q (ω) := 1 n n i=1 r i ω(x i ) + I Q (ω; Dn ) + ω Wo 2c qπ,k log(2/δ) n (20) Assume W o is the RKHS of kernel k(x, x), that is, W o = H k . By the finite representer theorem of RKHS (Smola et al., 2007) . the optimization of ω in RKHS H k can be reduced to a finite dimensional optimization problem. Specifically, the optimal solution of (20) can be written into a form of ω(x) = n i=1 k(x, x i )α i with ω 2 H k = n i,j=1 k(x i , x j )α i α j for some vector α := [α i ] n i=1 ∈ R n . Write K = [k(x i , x j )] n i,j=1 and r = [r i ] n i=1 . The optimization of ω reduces to a finite dimensional optimization on α: min α∈R n 1 n r Kα + I Q (Kα; Dn ) + √ αKα 2c qπ,k log(2/δ) n , where I Q (Kα; Dn ) = max q∈Q E Dπ,0 [q] + 1 n ( Tq) Kα , and Tq = [γq(x i )q(x i )] n i=1 . When Q is RKHS, we can calculate I Q (Kα; Dn ) using ( 22) in section F. This computation can be still expensive when n is large. Fortunately, our confidence bound holds for any ω; better ω only gives tighter bounds, but it is not necessary to find the global optimal ω. Therefore, one can use any approximation algorithm to find ω, which provides a trade-off of tightness and computational cost. We discuss two methods: 1) Approximating α The length of α can be too large when n is large. To address this, we assume α i = g(x i , θ), where g is any parametric function (such as a neural network) with parameter θ which can be much lower dimensional than α. We can then optimize θ with stochastic gradient descent, by approximating all the data averaging 1 n n i=1 (•) with averages over small mini-batches; this would introduce biases in gradient estimation, but it is not an issue when the goal is only to get a reasonable approximation. 2) Replacing kernel k Assume the kernel k yields a random feature expansion: k(x, x) = E β∼π [φ(x, β)φ(x, β)], where φ(x, β) is a feature map with parameter β and π is a distribution of β. We draw {β i } m i=1 i.i.d. from π, where m is taken to be much smaller than n. We replace k with k (x, x) = 1 m m i=1 φ(x, β i )φ(x, β i ) and H k with H k, That is, we consider to solve Ĵ+ Q,W = min ω∈H k    F + Q (ω) := 1 n n i=1 r i ω(x i ) + I Q (ω; Dn ) + ω H k 2c qπ, k log(2/δ) n    . It is known that any function ω in H k can be represented as ω(x) = 1 m m i=1 w i φ(x, β i ), for some w = [w i ] m i=1 ∈ R m and satisfies ω 2 H k = 1 m m i=1 w 2 i . In this way, the problem reduces to optimizing an m-dimensional vector w, which can be solved by standard convex optimization techniques.

E CONCENTRATION INEQUALITIES OF GENERAL FUNCTIONAL BELLMAN LOSSES

When K is a general function set, one can still obtain a general concentration bound using Radermacher complexity. Define Rq • W := {h(x, y) = Rq(x, y)ω(x) : ω ∈ W}. Using the standard derivation in Radermacher complexity theory in conjunction with Martingale theory (Rakhlin et al., 2015) , we have sup ω∈W 1 n n i=1 ( Rq(x i , y i ) -R π q(x i ))ω(x i ) ≤ 2Rad( Rq • W) + 2c q log(2/δ) n , where Rad( Rq • K) is the sequential Radermacher complexity as defined in (Rakhlin et al., 2015) . A triangle inequality yields | L k (q; Dn ) -L k (q; Dn ) | ≤ sup ω∈W 1 n n i=1 ( Rq(x i , y i ) -R π q(x i ))ω(x i ) Therefore, | L W (q; Dn ) -L W (q; Dn ) | ≤ 2Rad( Rq • W) + 2c q log(2/δ) n , where c q,W = sup ω∈W sup x,y ( Rq(x, y) -R π q(x)) 2 ω(x) 2 . When W equals the unit ball K of the RKHS related to kernel k, we have c q,k = c q,W , and hence this bound is strictly worse than the bound in Theorem 4.1. F CLOSED FORM OF I Q (ω; Dn ) WHEN Q IS RKHS Similar to L K (q; Dn ), when Q is taken to be the unit ball K of the RKHS of a positive definite kernel k(x, x), (7) can be expressed into a bilinear closed form shown in Mousavi et al. (2020) : I Q (ω; Dn ) 2 = A -2B + C, A = E (x,x)∼Dπ,0×Dπ,0 [k(x, x)] B = E (x,x)∼ Dω n ×Dπ,0 Tx π k(x, x) C = E (x,x)∼ Dω n × Dω n Tx π Tx π k(x, x) , were Tπ f (x) = γf (x ) -f (x); in Tx π Tx π k(x, x), we apply Tx π and Tx π in a sequential order by treating k as a function of x and then of x.

G MORE ON THE ORACLE BOUND AND ITS DUAL FORM

The oracle bound (18) provides another starting point for deriving optimization-based confidence bounds. We derive its due form here. Using Lagrangian multiplier, the optimization in ( 18) can be rewritten into Ĵ+ Q, * = max q∈Q min ω M (q, ω; Dn ), where M * (q, ω; Dn ) = E Dπ,0 [q] - 1 n n i=1 ω(x i ) Rq(x i , y i ) -Rq π (x i , y i ) , where ω now serves as the Lagrangian multiplier. By the weak duality, we have J * Q,+ ≤ F + Q, * (ω) := E Dω n [r] + I Q (ω; Dn ) known + R(ω, q π ) unknown , ∀ω. and R(ω, q π ) = 1 n n i=1 ω(x i ) Rq π (x i ). The derivation follows similarly for the lower bound. So for any ω ∈ W o , we have [ Ĵ- Q, * , Ĵ+ Q, * ] ⊆ [ F - Q, * (ω), F + Q, * (ω)]. Here the first two terms of F + Q, * (ω) can be empirically estimated (it is the same as the first two terms of ( 16)), but the third term R(ω, q π ) depends on the unknown q π and hence need to be further upper bounded. Our method can be viewed as constraining ω in W, which is assumed to be the unit ball of W o , and applying a worst case bound: F + Q, * (ω) := E Dω n [r] + I Q (ω; Dn ) + R(ω, q π ), ∀ω ∈ W o ≤ E Dω n [r] + I Q (ω; Dn ) + w Wo sup h∈W R(h, q π ), ∀ω ∈ W o ≤ E Dω n [r] + I Q (ω; Dn ) + w Wo L W (q π , Dn ), ∀ω ∈ W o w.p.1-δ ≤ E Dω n [r] + I Q (ω; Dn ) + w Wo , ∀ω ∈ W o = F + Q (ω). where the last step applies the high probability bound that Pr(L W (q π , Dn ) ≤ ε) ≥ 1δ. Similar derivation on the lower bound counterpart gives Pr F - Q, * (ω), F + Q, * (ω) ⊆ F - Q (ω), F + Q (ω) ≥ 1 -δ. Therefore, our confidence bound [ F - Q (ω), F + Q (ω)] is a 1 -δ confidence outer bound of the oracle bound [ Ĵ- Q, * , Ĵ+ Q, * ] ⊆ [ F - Q, * (ω), F + Q, * (ω)]. Introducing Q is necessarily Our method does not require any independence assumption between the transition pairs, the trade-off is that that we have to assume that q π falls into a function set Q that imposes certain smoothness assumption. This is necessary because the data only provide information regarding q π on a finite number of points, and q π can be arbitrarily non-smooth outside of the data points, and hence no reasonable upper/lower bound can be obtained without any smoothness condition that extend the information on the data points to other points in the domain. Proposition G.1. Unless Pr s∼Dπ,0 (s / ∈ {s i , s i } n i=1 ) = 0, for any u ∈ R, there exists a function q : S × A → R, such that E Dπ,0 [q] = u, Rq(x i , y i ) = Rq π (x i , y i ), ∀i = 1, . . . , n. Proof. Let Q null be the set of functions that are zero on {s i , s i } n i=1 , that is, Q null = {g : S × A → R : g(s, a) = 0, ∀s ∈ {s i , s i } n i=1 , a ∈ A}.

Then we have

Rπ (q π + g)(x i , y i ) = Rπ q π (x i , y i ), ∀i = 1, . . . , n. and E Dπ,0 [q π + g] = E Dπ,0 [q π ] + E Dπ,0 [g] = J π + E Dπ,0 [g]. Taking g(s, a) = zI(s / ∈ {s i , s i } n i=1 ) , where z is any real number. Then we have E Dπ,0 [q π + g] = J π + zPr s∼Dπ,0 (s / ∈ {s i , s i } n i=1 ). Because Pr s∼Dπ,0 (s / ∈ {s i , s i } n i=1 ). = 0, we can take z to be arbitrary value to make E Dπ,0 [q π + g] to take arbitrary value. Environments and Dataset Construction We test our method on three environments: Inverted-Pendulum and CartPole from OpenAI Gym (Brockman et al., 2016) , and a Type-1 Diabetes medical treatment simulator. For Inverted-Pendulum we discretize the action space to be {-1, -0.3, -0.2, 0, 0.2, 0.3, 1}. The action space of CartPole and the medical treatment simulator are both discrete. Policy Construction We follow a similar setup as Feng et al. (2020) to construct behavior and target policies. For all of the environments, we constraint our policy class to be a softmax policy and use PPO (Schulman et al., 2017) to train a good policy π, and we use different temperatures of the softmax policy to construct the target and behavior policies (we set the temperature τ = 0.1 for target policy and τ = 1 to get the behavior policy, and in this way the target policy is more deterministic than the behavior policy). We consider other choices of behavior policies in Section H.3. For horizon lengths, We fix γ = 0.95 and set horizon length H = 50 for Inverted-Pendulum, H = 100 for CartPole, and H = 50 for Diabetes simulator. Algorithm Settings We test the bound in Eq.( 16)-( 17). Throughout the experiment, we always set W = K, a unit ball of RKHS with kernel k(•, •). We set Q = r Q K, the zero-centered ball of radius r Q in an RKHS with kernel k(•, •). We take both k and k to be Gaussian RBF kernel. The bandwidth of k and k are selected to make sure the function Bellman loss is not large on a validation set. The radius is selected to be sufficiently large to ensure that q * is included in Q. To ensure a sufficiently large radius, we use the data to approximate a q so that its functional Bellman loss is small than n . Then we set r Q = 10 * q K. We optimize ω using the random feature approximation method described in Appendix D. Once ω + and ω -are found, we evaluate the bound in Eq. ( 16) exactly, to ensure the theoretical guarantee holds.

H.2 SENSITIVITY TO HYPER-PARAMETERS

We investigate the sensitivity of our algorithm to the choice of hyper-parameters. The hyper-parameter mainly depends on how we choose our function class Q and W. Radius of Q Recall that we choose Q to be a ball in RKHS with radius r Q , that is, Q = r Q K = {r Q f : f ∈ K}, where K is the unit ball of the RKHS with kernel k. Ideally, we want to ensure that r Q ≥ q * K so that q * ∈ Q. Since it is hard to analyze the behavior of the algorithm when q * is unknown, we consider a synthetic environment where the true q * is known. This is done by explicitly specifying a q * inside K and then infer the corresponding deterministic reward function r(x) by inverting the Bellman equation: r(x) := q * (x) -γE x ∼Pπ(•|x) [q * (x )]. Here r is a deterministic function, instead of a random variable, with an abuse of notation. In this way, we can get access to the true RKHS norm of q * : ρ * = q * K . For simplicity, we set both the state space S and action space A to be R and set a Gaussian policy π(a|s) ∝ exp(f (s, a)/τ ), where τ is a positive temperature parameter. We set τ = 0.1 as target policy and τ = 1 as behavior policy. Figure 3 shows the results as we set r Q to be ρ * , 10ρ * and 100ρ * , respectively. We can see that the tightness of the bound is affected significantly by the radius when the number n of samples is very small. However, as the number n of samples grow (e.g., n ≥ 2 × 10 3 in our experiment), the length of the bounds become less sensitive to the changing of the predefined norm of Q.

Similarity Between Behavior Policy and Target Policy

We study the performance of changing temperature of the behavior policy. We test on Inverted-Pendulum environment as previous described. Not surprisingly, we can see that the closer the behavior policy to the target policy (with temperature τ = 0.1), the tighter our confidence interval will be, which is observed in Figure 4 (a).

Bandwidth of RBF kernels

We study the results as we change the bandwidth in kernel k and k for W and Q, respectively. Figure 4(b) shows the length of the confidence interval when we use different bandwidth pairs in the Inverted-Pendulum environment. We can see that we get relatively tight confidence bounds as long as we set the bandwidth in a reasonable region (e.g., we set the bandwidth of k in [0.1, 0.5], the bandwidth of k in [0.5, 3]).

H.3 SENSITIVITY TO THE DATA COLLECTION PROCEDURE

We investigate the sensitivity of our method as we use different behavior policies to collect the dataset Dn . Varying Behavior Policies We study the effect of using different behavior policies. We consider the following cases: 1. Data is collected from a single behavior policy of form π α = απ + (1α)π 0 , where π is the target policy and π 0 is another policy. We construct π and π 0 to be Gaussian policies of form π(a|s) ∝ exp(f (s, a)/τ ) with different temperature τ , where temperature for target policy is τ = 0.1 and temperature for π 0 is τ = 1. 2. The dataset Dn is the combination of the data collected from multiple behavior policies of form π α defined as above, with α ∈ {0.0, 0.2, 0.4, 0.6, 0.8}. We show in Figure 5 (a) that the length of the confidence intervals by our method as we vary the number n of transition pairs and the mixture rate α. We can see that the length of the interval decays with the sample size n for all mixture rate α. Larger α yields better performance because the behavior policies are closer to the target policy. Varying Trajectory Length T in Dn As we collect Dn , we can either have a small number of long trajectories, or a larger number of short trajectories. In Figure 5 (b)-(c), we vary the length T of the trajectories as we collect Dn , while fixing the total number n of transition pairs. In this way, the number of trajectories in each Dn would be m = n/T . We can see that the trajectory length does not impact the results significantly, especially when the length is reasonably large (e.g., T ≥ 20).

I MORE RELATED WORKS

We give more detailed overview of different approaches for uncertainty estimation in OPE. Finite-Horizon Importance Sampling (IS) Assume the data is collected by rolling out a known behavior policy π 0 up to a trajectory length T , then we can the finite horizon reward by changing E π,P [•] to E π0,P [•] with importance sampling(e.g., Precup et al., 2000; Precup, 2001; Thomas et al., 2015a; b) . Taking the trajectory-wise importance sampling as an example, assume we collect a set of independent trajectories τ i := {s i t , a i t , r i t } T -1 t=0 , i = 1, . . . , m up to a trajectory length T by unrolling a known behavior policy π 0 . When T is large, we can estimate J * by a weighted averaging: ĴIS = 1 m m i=1 ω(τ i )J(τ i ) , where ω(τ i ) = T -1 t=0 π(a i t |s i t ) π 0 (a i t |s i t ) , J(τ i ) = T -1 t=0 γ t r i t . One can construct non-asymptotic confidence bounds based on ĴIS using variants of concentration inequalities (Thomas, 2015; Thomas et al., 2015b) . Unfortunately, a key problem with this IS estimator is that the importance weight ω(τ i ) is a product of the over time, and hence tends to cause an explosion in variance when the trajectory length T is large. Although improvement can be made by using per-step and self-normalized weights (Precup, 2001 ), or control variates & Li, 2016; Thomas & Brunskill, 2016) , the curse of horizon remains to be a key issue to the classical IS-based estimators (Liu et al., 2018a) . Moreover, due to the time dependency between the transition pairs inside each trajectory, the nonasymptotic concentration bounds can only be applied on the trajectory level and hence decay with the number m of independent trajectories in an O(1/ √ m) rate, though m can be small in practice. We could in principle apply the concentration inequalities of Markov chains (e.g., Paulin, 2015) to the time-dependent transition pairs, but such inequalities require to have an upper bound of certain mixing coefficient of the Markov chain, which is unknown and hard to construct empirically. Our work addresses these limitations by constructing a non-asymptotic bound that decay with the number n = mT of transitions pairs, while without requiring known behavior policies and independent trajectories. Yin & Wang (2020) , as well as the DICE-family (e.g., Nachum et al., 2019a; b; Zhang et al., 2020a; Wen et al., 2020; Zhang et al., 2020b) . These methods are based on either estimating the value function, or the stationary visitation distribution, which is shown to form a primal-dual relation (Tang et al., 2020a; Uehara et al., 2020; Jiang & Huang, 2020) that we elaborate in depth in Section 3.

Infinite

Besides Feng et al. (2020) which directly motivated this work, there has been a recent surge of interest in interval estimation under infinite-horizon OPE (e.g., Liu et al., 2018b; Jiang & Huang, 2020; Duan et al., 2020; Dai et al., 2020; Feng et al., 2020; Tang et al., 2020b; Yin et al., 2020; Lazic et al., 2020) . For example, Dai et al. (2020) develop an asymptotic confidence bound (CoinDice) for DICE estimators with an i.i.d assumption on the off-policy data; Duan et al. (2020) provide a data dependent confidence bounds based on Fitted Q iteration (FQI) using linear function approximation when the off-policy data consists of a set of independent trajectories; Jiang & Huang (2020) provide a minimax method closely related to our method but do not provide analysis for data error; Tang et al. (2020b) propose a fixed point algorithm for constructing deterministic intervals of the true value function when the reward and transition models are deterministic and the true value function has a bounded Lipschitz norm. Model-Based Methods Since the model P is the only unknown variable, we can construct an estimator P of P using maximum likelihood estimation or other methods, and plug it into Eq. ( 1) to obtain a plug-in estimator Ĵ = J π, P. This yields the model-based approach to OPE (e.g., Jiang & Li, 2016; Liu et al., 2018b) . One can also estimate the uncertainty in J π, P by propagating the uncertatinty in P (e.g., Asadi et al., 2018; Duan et al., 2020) , but it is hard to obtain non-asymptotic and computationally efficient bounds unless P is to be simple linear models. In general, estimating the whole model P can be an unnecessarily complicated problem as an intermediate step of the possibly simpler problem of estimating J π,P . Bootstrapping, Bayes, Distributional RL As a general of uncertainty estimation, bootstrapping has been used in interval estimation in RL in various ways (e.g., White & White, 2010; Hanna et al., 2017; Kostrikov & Nachum, 2020; Hao et al., 2021) . Bootstrapping is simple and highly flexible, and can be applied to time-dependent data (as appeared in RL) using variants of block bootstrapping methods (e.g., Lahiri, 2013; White & White, 2010) . However, bootstrapping typically only provides asymptotic guarantees; although non-asymptotic bounds of bootstrap exist (e.g., Arlot et al., 2010) , they are sophistic and difficult to use in practice and would require to know the mixing condition for the dependent data. Moreover, bootstrapping is time consuming since it requires to repeat the whole off-policy evaluation pipeline on a large number of resampled data. Bayesian methods (e.g., Engel et al., 2005; Ghavamzadeh et al., 2016b; Yang et al., 2020a) offer another general approach to uncertainty estimation in RL, but require to use approximate inference algorithms and do not come with non-asymptotic frequentist guarantees. In addition, distributional RL (e.g., Bellemare et al., 2017) seeks to quantify the intrinsic uncertainties inside the Markov decision process, which is orthogonal to the epistemic uncertainty that we consider in off-policy evaluation.



https://github.com/jxx123/simglucose.



Figure 1: Results on Inverted-Pendulum. (a) The confidence interval (significance level δ = 0.1) of our method (green) and that of Feng et al. (2020) (blue) when varying the data size n. (b) The length of the confidence intervals (δ = 0.1) of our method scaling with the data size n. (c) The confidence intervals when we vary the significance level δ (data size n = 5000). (d) The significance level δ vs. the empirical failure rate δ of capturing the true expected reward by our confidence intervals (data size n = 5000). We average over 50 random trials for each experiment.

Figure 2: Results on different environments when we use a significance level of δ = 0.1. The colored bars represent the confidence intervals of different methods (averaged over 50 random trials); the black error bar represents the stand derivation of the end points of the intervals over the 50 random trials.

(Feng et al. (2020) Lemma 3.1) Assume the reward function and kernel function is bounded with sup x |r(x)| ≤ r max and sup x,x |k(x, x )| ≤ K max , we have: c qπ,k := sup x∈X ,y∈Y

Figure 3: Ablation study on the radius r Q of the function class Q. The default collecting procedure uses a horizon length of H = 50. The discounted factor is γ = 0.95 by default.

Bandwidth of k of Wo (a) Varying temperature τ . (b) Varying the bandwidth of kernels in Wo and Q.

Figure 4: Ablation studies on Inverted-Pendulum. We change the temperature τ of the behavior policies in (a), and change the bandwidth of the kernel k of Wo and the kernel k of Q (denoted by hk in (b)).

Figure 5: Ablation studies on the data collection procedure, as we (a) change the behavior policies, and (b)-(c) change the trajectory lengths. The other settings are the same as that in Figure 3.

-Horizon, Behavior-Agnostic OPE Our work is closely related to the recent advances in infinite-horizon and behavior-agnostic OPE, including, for example, Liu et al. (2018a); Feng et al. (2019); Tang et al. (2020a); Mousavi et al. (2020); Liu et al. (2020); Yang et al. (2020b); Xie et al. (2019);

