OFFLINE POLICY SELECTION UNDER UNCERTAINTY

Abstract

The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose BayesDICE for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints (as opposed to explicit likelihood, which is not available). Empirically, BayesDICE is highly competitive to existing state-of-the-art approaches in confidence interval estimation. More importantly, we show how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and we empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

1. INTRODUCTION

Off-policy evaluation (OPE) (Precup et al., 2000) in the context of reinforcement learning (RL) is often motivated as a way to mitigate risk in practical applications where deploying a policy might incur significant cost or safety concerns (Thomas et al., 2015a) . Indeed, by providing methods to estimate the value of a target policy solely from a static offline dataset of logged experience in the environment, OPE can help practitioners determine whether a target policy is or is not safe and worthwhile to deploy. Still, in many practical applications the ability to accurately estimate the online value of a specific policy is less of a concern than the ability to select or rank a set of policies (one of which may be the currently deployed policy). This problem, related to but subtly different from OPE, is offline policy selection (Doroudi et al., 2017; Paine et al., 2020; Kuzborskij et al., 2020) , and it often arises in practice. For example, in recommendation systems, a practitioner may have a large number of policies trained offline using various hyperparameters, while cost and safety constraints only allow a few of those policies to be deployed as live experiments. Which policies should be chosen to form the small subset that will be evaluated online? This and similar questions are closely related to OPE, and indeed, the original motivations for OPE were arguably with offline policy selection in mind (Precup et al., 2000; Jiang, 2017) , the idea being that one can use estimates of the value of a set of policies to rank and then select from this set. Accordingly, there is a rich literature of approaches for computing point estimates of the value of the policy (Dudík et al., 2011; Bottou et al., 2013; Jiang & Li, 2015; Thomas & Brunskill, 2016; Nachum et al., 2019; Zhang et al., 2020; Uehara & Jiang, 2020; Kallus & Uehara, 2020; Yang et al., 2020) . Because the offline dataset is finite and collected under a logging policy that may be different from the target policy, prior OPE methods also estimate high-confidence lower and upper bounds on a target policy's value (Thomas et al., 2015a; Kuzborskij et al., 2020; Bottou et al., 2013; Hanna et al., 2016; Feng et al., 2020; Dai et al., 2020; Kostrikov & Nachum, 2020) . These existing approaches may be readily applied to our recommendation systems example, by using either mean or lower-confidence bound estimates on each candidate policy to rank the set and picking the top few to deploy online. However, this naïve approach ignores crucial differences between the problem setting of OPE and the downstream evaluation criteria a practitioner prioritizes. For example, when choosing a few policies out of a large number of available policies, a recommendation systems practitioner may have a number of objectives in mind: The practitioner may strive to ensure that the policy with the overall highest groundtruth value is within the small subset of selected policies (akin to top-k precision). Or, in scenarios where the practitioner is sensitive to large differences in achieved value, a more relevant downstream metric may be the difference between the largest groundtruth value within the k selected policies compared to the groundtruth of the best possible policy overall (akin to top-k regret). With these or other potential offline policy selection metrics, it is far from obvious that ranking according to OPE estimates is ideal (Doroudi et al., 2017) . The diversity of potential downstream metrics in offline policy selection presents a challenge to any algorithm that yields a point estimate for each policy. Any one approach to computing point estimates will necessarily be sub-optimal for some policy selection criteria. To circumvent this challenge, we propose to compute a belief distribution over groundtruth values for each policy. Specifically, with the posteriors for the distribution over value for each policy calculated, one can use a straightforward procedure that takes estimation uncertainty into account to rank the policy candidates according to arbitrarily complicated downstream metrics. While this belief distribution approach to offline policy selection is attractive, it also presents its own challenge: how should one estimate a distribution over a policy's value in the pure offline setting? In this work, we propose Bayesian Distribution Correction Estimation (BayesDICE) for off-policy estimation of a belief distribution over a policy's value. BayesDICE works by estimating posteriors over correction ratios for each state-action pair (correcting for the distribution shift between the off-policy data and the target policy's on-policy distribution). A belief distribution of the policy's value may then be estimated by averaging these correction distributions over the offline dataset, weighted by rewards. In this way, BayesDICE builds on top of the state-of-the-art DICE point estimators (Nachum et al., 2019; Zhang et al., 2020; Yang et al., 2020) , while uniquely leveraging posterior regularization to satisfy chance constraints in a Markov decision process (MDP). As a preliminary experiment, we show that BayesDICE is highly competitive to existing frequentist approaches when applied to confidence interval estimation. More importantly, we demonstrate BayesDICE's application in offline policy selection under different utility measures on a variety of discrete and continuous RL tasks. Among other findings, our policy selection experiments suggest that, while the conventional wisdom focuses on using lower bound estimates to select policies (due to safety concerns) (Kuzborskij et al., 2020) , policy ranking based on the lower bound estimates does not always lead to lower (top-k) regret. Furthermore, when other metrics of policy selection are considered, such as top-k precision, being able to sample from the posterior enables significantly better policy selection than only having access to the mean or confidence bounds of the estimated policy values.

2. PRELIMINARIES

We consider an infinite-horizon Markov decision process (MDP) (Puterman, 1994) denoted as M = S, A, R, T, µ 0 , γ , which consists of a state space, an action space, a deterministic reward function,foot_0 a transition probability function, an initial state distribution, and a discount factor γ ∈ (0, 1]. In this setting, a policy π(a t |s t ) interacts with the environment starting at s 0 ∼ µ 0 and receives a scalar reward r t = R(s t , a t ) as the environment transitions into a new state s t+1 ∼ T (s t , a t ) at each timestep t. The value of a policy is defined as ρ (π) := (1 -γ) E s0,at,st [ ∞ t=0 γ t r t ] .

2.1. OFFLINE POLICY SELECTION

We formalize the offline policy selection problem as providing a ranking O ∈ Perm([1, N ]) over a set of candidate policies {π i } N i=1 given only a fixed dataset D = {x (j) := (s (j) 0 , s (j) , a (j) , r (j) , s (j) )} n j=1 where s (j) 0 ∼ µ 0 , (s (j) , a (j) ) ∼ d D are samples of an unknown distribution d D , r (j) = R(s (j) , a (j) ), and s (j) ∼ T (s (j) , a (j) ).foot_1 One approach to the offline policy selection problem is to first characterize the value of each policy (Eq. 1, also known as the normalized per-step reward) via OPE under some utility function u(π) that leverages a point estimate (or



For simplicity, we restrict our analysis to deterministic rewards, and extending our methods to stochastic reward scenarios is straightforward. This tuple-based representation of the dataset is for notational and theoretical convenience, following Dai et al. (2020); Kostrikov & Nachum (2020), among others. In practice, the dataset is usually presented as finite-

annex

length trajectories {(s, and this can be processed into a dataset of finite samples from µ0 and from d D × R × T . For mathematical simplicity, we assume that the dataset is sampled i.i.d. This

