NON-ASYMPTOTIC CONFIDENCE INTERVALS OF OFF-POLICY EVALUATION: PRIMAL AND DUAL BOUNDS

Abstract

Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing nonasymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al. (2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum assumptions on the data and the function class of the Q-function, and works for the behavior-agnostic settings where the data is collected under a mix of arbitrary unknown behavior policies. We present empirical results that clearly demonstrate the advantages of our approach over existing methods.

1. INTRODUCTION

Off-policy evaluation (OPE) seeks to estimate the expected reward of a target policy in reinforcement learnings (RL) from observational data collected under different policies (e.g., Murphy et al., 2001; Fonteneau et al., 2013; Jiang & Li, 2016; Liu et al., 2018a) . OPE plays a central role in applying reinforcement learning (RL) with only observational data and has found important applications in areas such as medicine, self-driving, where interactive "on-policy" data is expensive or even infeasible to collect. A critical challenge in OPE is the uncertainty estimation, as having reliable confidence bounds is essential for making high-stakes decisions. In this work, we aim to tackle this problem by providing non-asymptotic confidence intervals of the expected value of the target policy. Our method allows us to rigorously quantify the uncertainty of the prediction and hence avoid the dangerous case of being overconfident in making costly and/or irreversible decisions. However, off-policy evaluation per se has remained a key technical challenge in the literature (e.g., Precup, 2000; Thomas & Brunskill, 2016; Jiang & Li, 2016; Liu et al., 2018a) , let alone gaining rigorous confidence estimation of it. This is especially true when 1) the underlying RL problem is long or infinite horizon, and 2) the data is collected under arbitrary and unknown algorithms (a.k.a. behavior-agnostic). As a consequence, the collected data can exhibit arbitrary dependency structure, which makes constructing rigorous non-asymptotic confidence bounds particularly challenging. Traditionally, the only approach to provide non-asymptotic confidence bounds in OPE is to combine importance sampling (IS) with concentration inequalities (e.g., Thomas et al., 2015a; b) , which, however, tends to degenerate for long/infinite horizon problems (Liu et al., 2018a) . Furthermore,

