OFFLINE POLICY SELECTION UNDER UNCERTAINTY

Abstract

The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose BayesDICE for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints (as opposed to explicit likelihood, which is not available). Empirically, BayesDICE is highly competitive to existing state-of-the-art approaches in confidence interval estimation. More importantly, we show how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and we empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

1. INTRODUCTION

Off-policy evaluation (OPE) (Precup et al., 2000) in the context of reinforcement learning (RL) is often motivated as a way to mitigate risk in practical applications where deploying a policy might incur significant cost or safety concerns (Thomas et al., 2015a) . Indeed, by providing methods to estimate the value of a target policy solely from a static offline dataset of logged experience in the environment, OPE can help practitioners determine whether a target policy is or is not safe and worthwhile to deploy. Still, in many practical applications the ability to accurately estimate the online value of a specific policy is less of a concern than the ability to select or rank a set of policies (one of which may be the currently deployed policy). This problem, related to but subtly different from OPE, is offline policy selection (Doroudi et al., 2017; Paine et al., 2020; Kuzborskij et al., 2020) , and it often arises in practice. For example, in recommendation systems, a practitioner may have a large number of policies trained offline using various hyperparameters, while cost and safety constraints only allow a few of those policies to be deployed as live experiments. Which policies should be chosen to form the small subset that will be evaluated online? This and similar questions are closely related to OPE, and indeed, the original motivations for OPE were arguably with offline policy selection in mind (Precup et al., 2000; Jiang, 2017) , the idea being that one can use estimates of the value of a set of policies to rank and then select from this set. Accordingly, there is a rich literature of approaches for computing point estimates of the value of the policy (Dudík et al., 2011; Bottou et al., 2013; Jiang & Li, 2015; Thomas & Brunskill, 2016; Nachum et al., 2019; Zhang et al., 2020; Uehara & Jiang, 2020; Kallus & Uehara, 2020; Yang et al., 2020) . Because the offline dataset is finite and collected under a logging policy that may be different from the target policy, prior OPE methods also estimate high-confidence lower and upper bounds on a target policy's value (Thomas et al., 2015a; Kuzborskij et al., 2020; Bottou et al., 2013; Hanna et al., 2016; Feng et al., 2020; Dai et al., 2020; Kostrikov & Nachum, 2020) . These existing approaches may be readily applied to our recommendation systems example, by using either mean or lower-confidence bound estimates on each candidate policy to rank the set and picking the top few to deploy online. However, this naïve approach ignores crucial differences between the problem setting of OPE and the downstream evaluation criteria a practitioner prioritizes. For example, when choosing a few policies out of a large number of available policies, a recommendation systems practitioner may

