OFFLINE POLICY INTERVAL ESTIMATION WITHOUT SUFFICIENT EXPLORATION OR REALIZABILITY

Abstract

We study the problem of offline policy evaluation (OPE), where the goal is to estimate the value of given decision-making policy without interacting with the actual environment. In particular, we consider the interval-based OPE, where the output is an interval rather than a point, indicating the uncertainty of the evaluation. The interval-based estimation is especially important in OPE since, when the data coverage is insufficient relative to the complexity of the environmental model, any OPE method can be biased even with infinite sample size. In this paper, we characterize such irreducible biases in terms of the discrepancy between the target policy and the data-sampling distribution, and show that the marginalimportance-sampling (MIS) estimator achieves the minimax bias with an appropriate importance-weight function. Motivated with this result, we then propose a new interval-based MIS estimator that asymptotically achieves the minimax bias.

1. INTRODUCTION

The offline policy evaluation (OPE) is the art of estimating the value of given decision-making policies based on offline datasets without interacting with the actual environment. Since the interaction with the environment is often infeasible or expensive in many real-world applications, it is better to evaluate the value offline rather than online. In the literature, it is understood from theoretical perspectives that there are two fundamental conditions for OPE to be successful: sufficient exploration, the coverage of the data-sampling distribution over the state-action space relative to the target policy, and realizability, the knowledge of correct environmental model with bounded complexity. In particular, if neither of these two conditions are met in a certain manner, it is known that OPE is never sample efficient, i.e., it takes prohibitively large sample to make the estimation reasonably accurate (Wang et al., 2020; Zanette, 2021) . In practice, given a problem instance of OPE, consisting of an environment and a dataset, it is difficult to confirm that these conditions hold or to modify the problem instance so that these conditions hold, making the existing theoretical guarantees less practical. Towards practical OPE, we set our research objective to develop a theoretically-sound value estimator without assuming these two conditions. Towards our objective, we first analyze the statistical performance of OPE methods when the two assumptions do not hold (Section 4). The key quantity is the information-theoretic worst-case bias of the value estimator (Eq. ( 5)) and its minimum termed the minimax bias (Eq. ( 6)), which is positive when there exist multiple indistinguishable environments, given only a problem instance of OPE. In fact, we show that the minimax bias can be non-zero if we do not assume the two conditions (Corollary 4.2). It suggests that, without the two assumptions, there exists a problem instance that any point-based value estimator is not reliable. Given the existence of irreducible bias, we propose an alternative formulation of offline policy evaluation called minimax-bias offline policy interval estimation (minimax-bias OPI), where the objective is to estimate the shortest possible interval containing the true value, instead of a point estimate (Section 5). Since our characterization of the minimax bias allows us to define the optimal interval (Definition 5.1), the minimax-bias OPI is formulated as a problem to estimate the optimal interval (Problem 5.1). We provide a theoretical foundation to solve the minimax-bias OPI based on the marginal importance sampling estimator (Section 6). The key result is that the optimal importance weight mini-1

