NON-ASYMPTOTIC CONFIDENCE INTERVALS OF OFF-POLICY EVALUATION: PRIMAL AND DUAL BOUNDS

Abstract

Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing nonasymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al. (2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum assumptions on the data and the function class of the Q-function, and works for the behavior-agnostic settings where the data is collected under a mix of arbitrary unknown behavior policies. We present empirical results that clearly demonstrate the advantages of our approach over existing methods.

1. INTRODUCTION

Off-policy evaluation (OPE) seeks to estimate the expected reward of a target policy in reinforcement learnings (RL) from observational data collected under different policies (e.g., Murphy et al., 2001; Fonteneau et al., 2013; Jiang & Li, 2016; Liu et al., 2018a) . OPE plays a central role in applying reinforcement learning (RL) with only observational data and has found important applications in areas such as medicine, self-driving, where interactive "on-policy" data is expensive or even infeasible to collect. A critical challenge in OPE is the uncertainty estimation, as having reliable confidence bounds is essential for making high-stakes decisions. In this work, we aim to tackle this problem by providing non-asymptotic confidence intervals of the expected value of the target policy. Our method allows us to rigorously quantify the uncertainty of the prediction and hence avoid the dangerous case of being overconfident in making costly and/or irreversible decisions. However, off-policy evaluation per se has remained a key technical challenge in the literature (e.g., Precup, 2000; Thomas & Brunskill, 2016; Jiang & Li, 2016; Liu et al., 2018a) , let alone gaining rigorous confidence estimation of it. This is especially true when 1) the underlying RL problem is long or infinite horizon, and 2) the data is collected under arbitrary and unknown algorithms (a.k.a. behavior-agnostic). As a consequence, the collected data can exhibit arbitrary dependency structure, which makes constructing rigorous non-asymptotic confidence bounds particularly challenging. Traditionally, the only approach to provide non-asymptotic confidence bounds in OPE is to combine importance sampling (IS) with concentration inequalities (e.g., Thomas et al., 2015a; b) , which, however, tends to degenerate for long/infinite horizon problems (Liu et al., 2018a) . Furthermore, neither can this approach be applied to the behavior-agnostic settings, nor can it effectively handle the complicated time dependency structure inside individual trajectories. Instead, it requires to use a large number of independently collected trajectories drawn under known policies. In this work, we provide a practical approach for Behavior-agnostic, Off-policy, Infinite-horizon, Non-asymptotic, Confidence intervals based on arbitrarily Dependent data (BONDIC). Our method is motivated by a recently proposed optimization-based (or variational) approach to estimating OPE confidence bounds (Feng et al., 2020) , which leverages a tail bound of kernel Bellman statistics (Feng et al., 2019) . Our approach achieves a new bound that is both an order-of-magnitude tighter and computationally efficient than that of Feng et al. (2020) . Our improvements are based on two pillars 1) developing a new primal-dual perspective on the non-asymptotic OPE confidence bounds, which is connected to a body of recent works on infinite-horizon value estimation (Liu et al., 2018a; Nachum et al., 2019a; Tang et al., 2020a; Mousavi et al., 2020) ; and 2) offering a new tight concentration inequality on the kernel Bellman statistics that applies to behavior-agnostic off-policy data with arbitrary dependency between transition pairs. Empirically, we demonstrate that our method can provide reliable and tight bounds on a variety of well-established benchmarks.

Related Work

Besides the aforementioned approach based on the combination of IS and concentration inequalities (e.g., Thomas et al., 2015a), bootstrapping methods have also been widely used in off-policy estimation (e.g., White & White, 2010; Hanna et al., 2017; Kostrikov & Nachum, 2020) . But the latter is limited to asymptotic bounds. Alternatively, Bayesian methods (e.g. Engel et al., 2005; Ghavamzadeh et al., 2016a) offers a different way to estimate the uncertainty in RL, but fails to guarantee frequentist coverage. In addition, Distributed RL (Bellemare et al., 2017) seeks to quantify the intrinsic uncertainties inside the Markov decision process, which is orthogonal to the estimation of uncertainty that we consider. 2020), as well as the DICE-family (e.g., Nachum et al., 2019a; Zhang et al., 2020a; Yang et al., 2020b) . In particular, our method can be viewed as extending the minimax framework of the infinite-horizon OPE in the infinite data region by Tang et al. Outline For the rest of the paper, we start with the problem statement in Section 2 , and an overview on the two dual approaches to infinite-horizon OPE that are tightly connected to our method in Section 3. We then present our main approach in Section 4 and perform empirical studies in Section 5. The proof and an abundance of additional discussions can be found in Appendix.

2. BACKGROUND, DATA ASSUMPTION, PROBLEM SETTING

Consider an agent acting in an unknown environment. At each time step t, the agent observes the current state s t in a state space S, takes an action a t ∼ π(• | s t ) in an action space A according to a given policy π; then, the agent receives a reward r t and the state transits to s t = s t+1 , following an unknown transition/reward distribution (r t , s t+1 ) ∼ P(• | s t , a t ). Assume the initial state s 0 is drawn from an known initial distribution D 0 . Let γ ∈ (0, 1) be a discount factor. In this setting, the expected reward of π is defined as J π := E π T t=0 γ t r t | s 0 ∼ D 0 , which is the expected total discounted rewards when we execute π starting from D 0 for T steps. In this work, we consider the infinite-horizon case with T → +∞. Our goal is to provide an interval estimation of J π for a general and challenging setting with significantly released constraints on the data. In particular, we assume the data is behavior-agnostic and off-policy, which means that the data can be collected from multiple experiments, each of which can execute a mix of arbitrary, unknown policies, or even follow a non-fixed policy. More concretely, suppose that the model P is unknown, and we have a set of transition pairs Dn = (s i , a i , r i , s i ) n i=1 collected from previous experiments in a sequential order, such that for each data point i, the (r i , s i ) is drawn from the model P(• | s i , a i ), while (s i , a i ) is generated with an arbitrary black box given the previous data points. We formalize both the data assumption and goal as below.



Our work is built upon the recent advances in behavior-agnostic infinite-horizon OPE, including Liu et al. (2018a); Feng et al. (2019); Tang et al. (2020a); Mousavi et al. (

(2020a); Uehara et al. (2020); Jiang & Huang (2020) to the non-asymptotic finite sample region.

