OFFLINE POLICY INTERVAL ESTIMATION WITHOUT SUFFICIENT EXPLORATION OR REALIZABILITY

Abstract

We study the problem of offline policy evaluation (OPE), where the goal is to estimate the value of given decision-making policy without interacting with the actual environment. In particular, we consider the interval-based OPE, where the output is an interval rather than a point, indicating the uncertainty of the evaluation. The interval-based estimation is especially important in OPE since, when the data coverage is insufficient relative to the complexity of the environmental model, any OPE method can be biased even with infinite sample size. In this paper, we characterize such irreducible biases in terms of the discrepancy between the target policy and the data-sampling distribution, and show that the marginalimportance-sampling (MIS) estimator achieves the minimax bias with an appropriate importance-weight function. Motivated with this result, we then propose a new interval-based MIS estimator that asymptotically achieves the minimax bias.

1. INTRODUCTION

The offline policy evaluation (OPE) is the art of estimating the value of given decision-making policies based on offline datasets without interacting with the actual environment. Since the interaction with the environment is often infeasible or expensive in many real-world applications, it is better to evaluate the value offline rather than online. In the literature, it is understood from theoretical perspectives that there are two fundamental conditions for OPE to be successful: sufficient exploration, the coverage of the data-sampling distribution over the state-action space relative to the target policy, and realizability, the knowledge of correct environmental model with bounded complexity. In particular, if neither of these two conditions are met in a certain manner, it is known that OPE is never sample efficient, i.e., it takes prohibitively large sample to make the estimation reasonably accurate (Wang et al., 2020; Zanette, 2021) . In practice, given a problem instance of OPE, consisting of an environment and a dataset, it is difficult to confirm that these conditions hold or to modify the problem instance so that these conditions hold, making the existing theoretical guarantees less practical. Towards practical OPE, we set our research objective to develop a theoretically-sound value estimator without assuming these two conditions. Towards our objective, we first analyze the statistical performance of OPE methods when the two assumptions do not hold (Section 4). The key quantity is the information-theoretic worst-case bias of the value estimator (Eq. ( 5)) and its minimum termed the minimax bias (Eq. ( 6)), which is positive when there exist multiple indistinguishable environments, given only a problem instance of OPE. In fact, we show that the minimax bias can be non-zero if we do not assume the two conditions (Corollary 4.2). It suggests that, without the two assumptions, there exists a problem instance that any point-based value estimator is not reliable. Given the existence of irreducible bias, we propose an alternative formulation of offline policy evaluation called minimax-bias offline policy interval estimation (minimax-bias OPI), where the objective is to estimate the shortest possible interval containing the true value, instead of a point estimate (Section 5). Since our characterization of the minimax bias allows us to define the optimal interval (Definition 5.1), the minimax-bias OPI is formulated as a problem to estimate the optimal interval (Problem 5.1). We provide a theoretical foundation to solve the minimax-bias OPI based on the marginal importance sampling estimator (Section 6). The key result is that the optimal importance weight mini-mizing the distributional Bellman residual (DBR) allows us to construct an approximately optimal interval (Theorem 6.3). This illustrates that our problem setting is well-posed and can be solved under realistic assumptions if we can solve the minimization of DBR. Accordingly, we develop a novel algorithm in Section 7 to find the best importance weight function, which results in an interval estimator applicable even if the two fundamental conditions do not hold (Theorem 7.7). Before proceeding to these results, we introduce basic mathematical notation in the rest of this section, review the related work in Section 2, and introduce the useful OPE-specific notation in Section 3. Mathematical notation. Let I denote the identity operator and let a∨b := max{a, b} and a∧b := min{a, b} denote the maximum and minimum operators for a, b ∈ R, respectively. Let X be a metric space with Borel algebra Σ. Let B(X ) and C (X ) be the spaces of the real-valued measurable bounded functions and the continuous functions on X , respectively, both of which is equipped with the uniform norm ∥f ∥ ∞ := sup x∈X |f (x)|. Let M (X ) denote the space of the finite signed measures on the same space X , equipped with the total variation (TV) norm ∥P ∥ TV := sup E++E-=X {P (E + ) -P (E -)}. In particular, let δ x ∈ M (X ), x ∈ X , denote Dirac's delta measure. For any f ∈ B(X ) and any P ∈ M (X ), let ⟨f, P ⟩ := f (x)dP (x) be a shorthand for the (signed) expectation of f with respect to P . Let ⊙ denote the importance-weighting operation given by d(f ⊙ P )(x) := f (x)dP (x), f ∈ B(X ), P ∈ M (X ). Let L 1 (P ) be the space of the functions integrable with respect to P ∈ M (X ), i.e., ∥f ⊙ P ∥ TV < ∞. Let L (V) denote the set of the bounded linear operator on a normed vector space V. For any A ∈ L (M (X )), let A * ∈ L (B(X )) denote the conjugate operator such that ⟨A * f, P ⟩ = ⟨f, AP ⟩ for f ∈ B(X ) and P ∈ M (X ).

2. RELATED WORK

The problem of estimating the interval containing the true value has been known as offline policy interval estimation (OPI). This section reviews the existing studies on OPI by dividing the previous OPI methods into two categories: non-asymptotic and asymptotic methods (see Table 1 for the summary of comparison). We also discuss our contribution to the literature. The non-asymptotic methods typically put their emphasis on the validity of the interval with any finite sample size, where intervals are valid if they contain the true value J(π). For instance, Feng et al. (2020; 2021) compute intervals that contain the true policy value with high probability, under the realizability of the policy Q-functions q π . Jiang and Huang (2020) also proposed an interval estimator with validity under more relaxed realizability condition that either the policy Q-function q π or the marginal density ratio function w π is realizable. One limitation of this approach is the theoretical understanding on the tightness of the interval is often unclear or partial. Another limitation of this approach is that they tend to require the realizability with known complexity. This requirement is not desirable for practical use; if we used a too complex hypothesis class such as a reproducing kernel Hilbert space with infinite radius, the resultant interval would be trivial, and thus, non-informative. The asymptotic methods focus on the asymptotically dominant term of the uncertainty in the large sample limit, which typically allows us to theoretically understand their behavior, especially the tightness, in depth. For instance, Kallus and Uehara (2020); Shi et al. (2021) gave confidence interval estimators that achieve the efficiency lower bound. The bootstrap estimators (Hao et al., 2021 ) also enable us to compute the asymptotically exact confidence intervals in a more flexible manner. One major limitation is that they assume both the sufficient exploration and the realizability conditions of q π and w π hold, which can be hardly validated in real-world applications. These assumptions are essential to their analyses because they focus on estimation of the asymptotic variance of order O(n -1/2 ), assuming that the bias is negligible. Therefore, these methods are not applicable to our setting where the asymptotic bias of order O(1) dominates the asymptotic variance. In this study, we take the asymptotic approach, but with a focus on the estimation of the bias rather than the variance, because the bias is dominant in our setting where the sufficient exploration and the realizability do not hold at all. Our contributions are threefold. First, we characterize the theoretical lower bound of the asymptotic bias through the asymptotic analysis, which serves as a theoretical

