OFFLINE POLICY SELECTION UNDER UNCERTAINTY

Abstract

The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose BayesDICE for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints (as opposed to explicit likelihood, which is not available). Empirically, BayesDICE is highly competitive to existing state-of-the-art approaches in confidence interval estimation. More importantly, we show how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and we empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

1. INTRODUCTION

Off-policy evaluation (OPE) (Precup et al., 2000) in the context of reinforcement learning (RL) is often motivated as a way to mitigate risk in practical applications where deploying a policy might incur significant cost or safety concerns (Thomas et al., 2015a) . Indeed, by providing methods to estimate the value of a target policy solely from a static offline dataset of logged experience in the environment, OPE can help practitioners determine whether a target policy is or is not safe and worthwhile to deploy. Still, in many practical applications the ability to accurately estimate the online value of a specific policy is less of a concern than the ability to select or rank a set of policies (one of which may be the currently deployed policy). This problem, related to but subtly different from OPE, is offline policy selection (Doroudi et al., 2017; Paine et al., 2020; Kuzborskij et al., 2020) , and it often arises in practice. For example, in recommendation systems, a practitioner may have a large number of policies trained offline using various hyperparameters, while cost and safety constraints only allow a few of those policies to be deployed as live experiments. Which policies should be chosen to form the small subset that will be evaluated online? This and similar questions are closely related to OPE, and indeed, the original motivations for OPE were arguably with offline policy selection in mind (Precup et al., 2000; Jiang, 2017) , the idea being that one can use estimates of the value of a set of policies to rank and then select from this set. Accordingly, there is a rich literature of approaches for computing point estimates of the value of the policy (Dudík et al., 2011; Bottou et al., 2013; Jiang & Li, 2015; Thomas & Brunskill, 2016; Nachum et al., 2019; Zhang et al., 2020; Uehara & Jiang, 2020; Kallus & Uehara, 2020; Yang et al., 2020) . Because the offline dataset is finite and collected under a logging policy that may be different from the target policy, prior OPE methods also estimate high-confidence lower and upper bounds on a target policy's value (Thomas et al., 2015a; Kuzborskij et al., 2020; Bottou et al., 2013; Hanna et al., 2016; Feng et al., 2020; Dai et al., 2020; Kostrikov & Nachum, 2020) . These existing approaches may be readily applied to our recommendation systems example, by using either mean or lower-confidence bound estimates on each candidate policy to rank the set and picking the top few to deploy online. However, this naïve approach ignores crucial differences between the problem setting of OPE and the downstream evaluation criteria a practitioner prioritizes. For example, when choosing a few policies out of a large number of available policies, a recommendation systems practitioner may have a number of objectives in mind: The practitioner may strive to ensure that the policy with the overall highest groundtruth value is within the small subset of selected policies (akin to top-k precision). Or, in scenarios where the practitioner is sensitive to large differences in achieved value, a more relevant downstream metric may be the difference between the largest groundtruth value within the k selected policies compared to the groundtruth of the best possible policy overall (akin to top-k regret). With these or other potential offline policy selection metrics, it is far from obvious that ranking according to OPE estimates is ideal (Doroudi et al., 2017) . The diversity of potential downstream metrics in offline policy selection presents a challenge to any algorithm that yields a point estimate for each policy. Any one approach to computing point estimates will necessarily be sub-optimal for some policy selection criteria. To circumvent this challenge, we propose to compute a belief distribution over groundtruth values for each policy. Specifically, with the posteriors for the distribution over value for each policy calculated, one can use a straightforward procedure that takes estimation uncertainty into account to rank the policy candidates according to arbitrarily complicated downstream metrics. While this belief distribution approach to offline policy selection is attractive, it also presents its own challenge: how should one estimate a distribution over a policy's value in the pure offline setting? In this work, we propose Bayesian Distribution Correction Estimation (BayesDICE) for off-policy estimation of a belief distribution over a policy's value. BayesDICE works by estimating posteriors over correction ratios for each state-action pair (correcting for the distribution shift between the off-policy data and the target policy's on-policy distribution). A belief distribution of the policy's value may then be estimated by averaging these correction distributions over the offline dataset, weighted by rewards. In this way, BayesDICE builds on top of the state-of-the-art DICE point estimators (Nachum et al., 2019; Zhang et al., 2020; Yang et al., 2020) , while uniquely leveraging posterior regularization to satisfy chance constraints in a Markov decision process (MDP). As a preliminary experiment, we show that BayesDICE is highly competitive to existing frequentist approaches when applied to confidence interval estimation. More importantly, we demonstrate BayesDICE's application in offline policy selection under different utility measures on a variety of discrete and continuous RL tasks. Among other findings, our policy selection experiments suggest that, while the conventional wisdom focuses on using lower bound estimates to select policies (due to safety concerns) (Kuzborskij et al., 2020) , policy ranking based on the lower bound estimates does not always lead to lower (top-k) regret. Furthermore, when other metrics of policy selection are considered, such as top-k precision, being able to sample from the posterior enables significantly better policy selection than only having access to the mean or confidence bounds of the estimated policy values.

2. PRELIMINARIES

We consider an infinite-horizon Markov decision process (MDP) (Puterman, 1994) denoted as M = S, A, R, T, µ 0 , γ , which consists of a state space, an action space, a deterministic reward function,foot_0 a transition probability function, an initial state distribution, and a discount factor γ ∈ (0, 1]. In this setting, a policy π(a t |s t ) interacts with the environment starting at s 0 ∼ µ 0 and receives a scalar reward r t = R(s t , a t ) as the environment transitions into a new state s t+1 ∼ T (s t , a t ) at each timestep t. The value of a policy is defined as ρ (π) := (1 -γ) E s0,at,st [ ∞ t=0 γ t r t ] .

2.1. OFFLINE POLICY SELECTION

We formalize the offline policy selection problem as providing a ranking O ∈ Perm([1, N ]) over a set of candidate policies {π i } N i=1 given only a fixed dataset D = {x (j) := (s (j) 0 , s (j) , a (j) , r (j) , s (j) )} n j=1 where s (j) 0 ∼ µ 0 , (s (j) , a (j) ) ∼ d D are samples of an unknown distribution d D , r (j) = R(s (j) , a (j) ), and s (j) ∼ T (s (j) , a (j) ). 2 One approach to the offline policy selection problem is to first characterize the value of each policy (Eq. 1, also known as the normalized per-step reward) via OPE under some utility function u(π) that leverages a point estimate (or lower bound) of the policy value; i.e., O ← ArgSortDescending({u(π i )} N i=1 ).

2.2. SELECTION EVALUATION

A proposed ranking O will eventually be evaluated according to how well its policy ordering aligns with the policies' groundtruth values. In this section, we elaborate on potential forms of this evaluation score. To this end, let us denote the groundtruth distribution of returns of policy π i by Z(•|π i ). In other words, Z( •|π i ) is a distribution over R such that z ∼ Z(•|π i ) ≡ z := (1 -γ) ∞ t=0 γ t • R(s t , a t ) ; s 0 ∼ µ 0 , a t ∼ π i (s t ), s t+1 ∼ T (s t , a t ) . (2) Note that E Z(•|πi) [z] = ρ(π i ). As ). • Beyond expected return: One may define the above ranking scores in terms of statistics of Z(•|π i ) other than the groundtruth means ρ(π i ). For example, in safety-critical applications, one may be concerned with the variance of the policy return. Accordingly, one may define CVaR analogues to top-k precision and regret. For simplicity, we will restrict our attention to ranking scores which only depend on the average return of π i . To this end, we will use ρ i as shorthand for ρ(π i ) and assume that the ranking score S is a function of O and {ρ i } N i=1 .

2.3. RANKING SCORE SIMULATION FROM THE POSTERIOR

It is not clear whether ranking according to vanilla OPE (either mean or confidence based) is ideal for any of the ranking scores above, including, for example, top-1 regret in the presence of uncertainty. However, if one has access to an approximate belief distribution over the policy's values, there is a simple sampling-based approach that can be used to find a near-optimal ranking (optimality depending on how accurate the belief distribution is) with respect to an arbitrary specified downstream ranking score, and we elaborate on this procedure here. First, note that if we have access to the true groundtruth policy values {ρ i } N i=1 , and the ranking score function S, we can calculate the score value of any ranking O and find the ranking O * that optimizes this score. However, we are limited to a finite offline dataset and the full return distributions are unknown. In this offline setting, we propose to instead compute a belief distribution q({ρ i } N i=1 ), and then we can optimize over the expected ranking score E q S(O, {ρ i } N i=1 ) as shown in Algorithm 1. This algorithm simulates realizations of the groundtruth values {ρ i } N i=1 by sampling from the belief distribution q({ρ i } N i=1 ), and in this way estimates the expected realized ranking score S over all is a common assumption in the OPE literature (Uehara & Jiang, 2020) and may be relaxed in some cases by assuming a fast mixing time (Nachum et al., 2019) . possible rankings O. As we will show empirically, matching the selection process (the S used in Algorithm 1) to the downstream ranking score naturally leads to improved performance. The question now becomes how to effectively learn a belief distribution over {ρ i } N i=1 . Figure 1 : The belief distributions of ρ 1 and ρ 2 depend on the uncertainty induced from the finite offline data (D and D ). A user might prefer π 2 only if p(ρ 2 < ρ 1 ) < ξ (a choice of S). OPE based on mean point estimates would select π 2 in either case as ρ 2 has the greater mean. Sampling from the posterior belief in OfflineSelect allows simulation of any ranking score under S, aligning policy selection with the user's choice of S. Algorithm 1 OfflineSelect Inputs Posteriors q({ρ i } N i=1 ), ranking score Ŝ Initialize O * ; L * £ Track best score for O in Perm([1, ..., N ]) do L = 0 for j = 1 to n do sample {ρ (j) i } N i=1 ∼ q({ρ i } N i=1 ) £ Sum up sample scores L = L + Ŝ({ρ (j) i } N i=1 , O) end for if L < L * then £ Update best ranking/score L * = L; O * = O end if end for return O * , L *

3. BAYESDICE

To learn a belief distribution over {ρ i } N i=1 , we pursue a Bayesian approach to infer an approximate posterior distribution given prior beliefs. While model-based Bayesian approaches exist (e.g., (Deisenroth & Rasmussen, 2011) and variants (Parmas et al., 2018) ), they typically suffer from compounding error, so a model-free approach is preferable. However, Bayesian inference is challenging in this model-free scenario because the likelihood function is not easy to compute, as it is defined over infinite horizon returns. Therefore, we first investigate several approaches to representing policy value, before identifying a novel posterior estimator that is computationally attractive and can support a broad range of ranking scores for downstream tasks.

3.1. POLICY RANKING SCORE REPRESENTATION

In practice, the downstream task of ranking or selecting policy candidates might require more than the value expectation, but also other properties of the policy value distribution. To ensure that the necessary distribution properties are computable, we first consider the class of ranking scores we would like to support: • Offline: Since we focus on ranking policies given only offline data, the ranking score should not depend on on-policy samples. • Flexible: Since the downstream task may utilize different ranking scores, the representation of the policy value should be sufficient to support their efficient computation. With these considerations in mind, we review ways to represent the value of a policy π. Define Q π (s, a) = E [ ∞ t=0 γ t R(s t , a t )|s 0 = s, a 0 = a] and d π (s, a) = (1 -γ) ∞ t=0 γ t d π t (s, a) , with d π t (s, a) = P (s t = s, a t = a|s 0 ∼ µ 0 , ∀i < t, a i ∼ π (•|s i ) , s i+1 ∼ T (•|s i , a i )) , which are the state-action values and stationary visitations of π. These satisfy the recursions Q π (s, a) = R(s, a) + γ • P π Q π (s, a), where P π Q(s, a) := E s ∼T (s,a),a ∼π(s ) [Q(s , a )]; (3) From these identities, the policy value can be expressed in two equivalent ways: d π (s, a) = (1 -γ)µ0(s)π(a|s) + γ • P π * d π (s, ρ(π) = (1 -γ) • E a0∼π(s0) s0∼µ0 [Q π (s 0 , a 0 )] (5) = E (s,a)∼d π [r(s, a)]. Current OPE methods are generally based on one of the representations (1), ( 5) or (6). For example, importance sampling (IS) estimators (Precup et al., 2000; Murphy et al., 2001; Dudík et al., 2011) are based on (1); LSTDQ (Lagoudakis & Parr, 2003 ) is a representative algorithm for fitting Q π and thus based on (5); the recent DICE algorithms (Nachum & Dai, 2020; Yang et al., 2020) estimate the stationary density ratio ζ (s, a) := d π (s,a) d D so that ρ (π) = E d D [ζ • r], and are thus based on (6). Among the three strategies, the third is the most promising in our scenario. First, IS suffers from an exponential growth in variance (Liu et al., 2018) and further requires knowledge of the behavior policy. In contrast, the functions Q π and d π are duals (Nachum & Dai, 2020; Yang et al., 2020) , and share common behavior-agnostic and minimax properties (Uehara & Jiang, 2020) , However, estimation of Q π assumes a ranking score with a linear dependence on R (s, a), and therefore, even if we estimate Q π accurately, it is still impossible to evaluate ranking scores that involve (1 -γ) E [ ∞ t=0 γ t σ(r t ) ] such that σ(•) : R → R is a nonlinear function (unless one learns a different Q function for each possible ranking score, which may be computationally expensive). By contrast, ranking scores with such nonlinear components can be easily computed from the stationary density ratio as E d D [ζ • σ (r)] . Given these considerations, the estimator via stationary density ratio satisfies both requirements: it enjoys statistical advantages in the offline setting and is flexible for downstream ranking score calculation. Therefore, we focus on a Bayesian estimator for ζ π next.

3.2. STATIONARY RATIO POSTERIOR ESTIMATION

Recall that to apply a simple Bayesian approach to infer the posterior of ζ π , one requires a loglikelihood function, but such a quantity is not readily calculable in our scenario from the given data. Therefore, we develop an alternative, computationally tractable approach by considering an optimization view of Bayesian inference under a chance constraint, which allows us to derive the posterior over a set of stochastic equations. Let f (•) denote a non-negative convex function with f (0) achieving the minimum 0, e.g., f (x) = x x. Also let ∆ d (s, a) := (1 -γ)µ 0 (s)π(a|s) + γ • P π * d(s, a) -d (s, a). Starting with (5) we reduce the |S| |A| many constraints for the stationary distribution of π to a single feature-vectorbased constraint for ζ: ∆ d (s, a) = 0, ∀(s, a) ∈ S × A ⇒ φ, ∆ d = 0 (7) ⇒ f ( φ, ∆ d ) = 0 ⇒ max β∈H φ β φ, ∆ d -f * (β) = 0 (8) ⇒ max β∈H φ E d D ζ (s, a) • β (γφ(s , a ) -φ (s, a)) + (1 -γ) E µ0π β φ -f * (β) = 0,( ) where H φ denotes the bounded Hilbert space with the feature mappings φ, d D denotes the distribution generating the empirical experience, and we have used Fenchel duality in the middle step. The function φ (•, •) : S × A → R m is a feature mapping, with m possibly infinite. Then the condition φ, ∆ d = 0 can be understood as matching the two distributions (1-γ)µ 0 (s)π(a|s)+γ •P π * d(s, a) and d (s, a) in terms of their embeddings (Smola et al., 2007) , which is a generalization of the approximation methods in (De Farias & Van Roy, 2003; Lakshminarayanan et al., 2017) . In particular, when |S| |A| is finite and we set φ(s, a) = δ s,a , where δ s,a ∈ {0, 1} |S||A| is an indicator vector with a single 1 at position (s, a) and 0 otherwise, we are matching the distributions pointwise. The feature map φ (s, a) can also be set to general reproducing kernel k ((s, a), •) ∈ R ∞ . As long as the kernel k (•, •) is characteristic, the embeddings will match if and only if the distributions are identical almost surely (Sriperumbudur et al., 2011) . Given that the experience was collected by some other means, i.e., D ∼ d D , the constraint for ζ in (7) might not hold exactly. Therefore, we consider a feasible set ζ ∈ {ζ : (ζ, D) } where (ζ, D) := max β∈H φ ÊD ζ (s, a) • β (γφ(s , a ) -φ (s, a)) -f * (β) + (1 -γ) E µ0π β φ . (10) Note that (ζ) 0 since H φ is symmetric. We expect the posterior of ζ, q (ζ), to concentrate most of its mass on this set and balance the prior. Formally, this means min q KL (q||p) -λξ, s.t. P q ( (ζ) ) ξ, where the chance constraint considers the probability of the feasibility of ζ under the posterior. This formulation can be equivalently rewritten as min q KL (q||p) -λP q ( (ζ) ) Then, by applying Markov's inequality, i.e., P q ( (ζ) ) = 1 -P q ( (ζ) ) 1 -Eq[ (ζ)] , we can obtain an upper bound on (12) as min q KL (q||p) + λ E q [ (ζ, D)] (13) = min q(ζ) max q(β|ζ) KL (q||p) + λ E q(ζ)q(β|ζ) ÊD ζ (s, a) • β (γφ(s , a ) -φ (s, a)) -f * (β) + (1 -γ) E µ0π β φ , where the equality follows by interchangeability (Shapiro et al., 2014; Dai et al., 2017) . We amortize the optimization for β w.r.t. each ζ to a distribution q (β|ζ) to reduce the computational effort. Due to the space limitation, we postpone the discussion about the important properties of Bayes-DICE, including the parametrization of the posteriors, the variants of BayesDICE for undiscounted MDP and alternatives of the log-likelihoods, and the connections to the vanilla Bayesian stochastic processes, to Appendix A. Please refer the details there. Finally, note that with the posterior approximation for ζ i , denoting the estimate for candidate policy i, we can draw posterior samples of ρi by drawing a sample ζ i ∼ q(ζ i ) and computing ρi = 1 n (s,a,r)∈D ζ i (s, a)r. This defines a posterior distribution over ρi and we further assume that the distributions are independent for each policy, so q({ρ i } N i=1 ) = i q(ρ i ). This defines the necessary inputs for OfflineSelect to determine a ranking of the candidate policies.

4. RELATED WORK

We categorize the relevant related work into three categories: offline policy selection, off-policy evaluation, and Bayesian inference for policy evaluation.

Offline policy selection

The decision making problem we formalize as offline policy selection is a member of a set of problems in RL referred to as model selection. Previously, this term has been used to refer to state abstraction selection (Jiang, 2017; Jiang et al., 2015) as well as learning algorithm and feature selection (Foster et al., 2019; Pacchiano et al., 2020) . More relevant to our proposed notion of policy selection are a number of previous works which use model selection to refer to the problem of choosing a near-optimal Q-function from a set of candidate approximation functions (Fard & Pineau, 2010; Farahmand & Szepesvári, 2011; Irpan et al., 2019; Xie & Jiang, 2020) . In this case, the evaluation metric is typically defined as the L ∞ norm of difference of Q versus the state-action value function of the optimal policy Q * . While one can relate this evaluation metric to the sub-optimality (i.e., regret) of the policy induced by the Q-function, we argue that our proposed policy selection problem is both more general -since we allow for the use of policy evaluation metrics other than sub-optimality -and more practically relevant -since in many practical applications, the policy may not be expressible as the argmax of a Q-function. Lastly, the offline policy selection problem we describe is arguably a formalization of the problem approached in Paine et al. ( 2020) and referred to as hyperparameter selection. In contrast to this previous work, we not only formalize the decision problem, but also propose a method to directly optimize the policy selection evaluation metric. Offline policy selection has also been studied by Doroudi et al. (2017) , which considers what properties a point estimator should have in order for it to yield good rankings in terms of a notion of ranking score referred to as fairness. Off-policy evaluation Off-policy evaluation (OPE) is a highly active area of research. While the original motivation for OPE was in the pursuit of policy selection (Precup et al., 2000; Jiang, 2017) , the field has historically almost exclusively focused on the related but distinct problem of estimating the online value (accumulated rewards) of a single target policy. In addition to a plethora of techniques for providing point estimates of this groundtruth value (Dudík et al., 2011; Bottou et al., 2013; Jiang & Li, 2015; Thomas & Brunskill, 2016; Kallus & Uehara, 2020; Nachum et al., 2019; Zhang et al., 2020; Yang et al., 2020) , there is also a growing body of literature that uses frequentist principles to derive high-confidence lower bounds for the value of a policy (Bottou et al., 2013; Thomas et al., 2015b; Hanna et al., 2016; Kuzborskij et al., 2020; Feng et al., 2020; Dai et al., 2020; Kostrikov & Nachum, 2020) . As our results demonstrate, ranking or selecting policies based on either their estimated mean or lower confidence bounds can at times be sub-optimal, depending on the evaluation criteria. Bayesian inference for policy evaluation Our proposed method for policy selection relies on Bayesian principles to estimate a posterior distribution over the groundtruth policy value. While many Bayesian-inspired methods have been proposed for policy optimization (Deisenroth & Rasmussen, 2011; Parmas et al., 2018) , especially in the context of exploration (Houthooft et al., 2016; Dearden et al., 2013; Kolter & Ng, 2009) , relatively few have been proposed for policy evaluation. In one instance, Fard & Pineau (2010) derive PAC-Bayesian bounds on estimates of the Bellman error of a candidate Q-value function. In contrast to this work, we use our BayesDICE algorithm to estimate a distribution over target policy value, and this distribution allows us to directly optimize arbitrary downstream policy selection metrics.

5. EXPERIMENTS

We empirically evaluate the performance of BayesDICE on confidence interval estimation (which can be used for policy selection) and offline policy selection under linear and neural network posterior parametrizations on tabular -Bandit, Taxi (Dietterich, 1998) , FrozenLake (Brockman et al., 2016) -and continuous-control -Reacher (Brockman et al., 2016) -tasks. As we show below, BayesDICE outperforms existing methods for confidence interval estimation, producing accurate coverage while maintaining tight interval width, suggesting that BayesDICE achieves accurate posterior estimation, being robust to approximation errors and potentially misaligned Bayesian priors in practice. Moreover, in offline policy selection settings, matching the selection algorithm (Algorithm 1) to the ranking score (enabled by the estimating the posterior) shows clear advantages over ranking based on point estimates or confidence intervals on a variety of ranking scores. See Appendix C for additional results and implementation details.

5.1. CONFIDENCE INTERVAL ESTIMATION

Before applying BayesDICE to policy selection, we evaluate the BayesDICE approximate posterior by computing the accuracy of the confidence intervals it produces. We compare BayesDICE against a known set of confidence interval estimators based on concentration inequalities. To compute these baselines, we first use weighted (i.e., self-normalized) per-step importance sampling (Thomas & Brunskill, 2016) to compute a policy value estimate for each logged trajectory. These trajectories provide a finite sample of value estimates. We use self-normalized importance sampling since it has been found to yield better empirical results in MDPs despite being biased (Liu et al., 2018; Nachum et al., 2019) ; for Bandit results without self-normalization, see Figure 5 in Appendix C. We then use empirical Bernstein's inequality (Thomas et al., 2015b) , bias-corrected bootstrap (Thomas et al., 2015a) , and Student's t-test to derive lower and upper high-confidence bounds on these estimates. We further consider Bayesian Deep Q-Networks (BDQN) (Azizzadenesheli et al., 2018) with an average empirical reward prior in the function approximation setting, which applies Bayesian linear regression to the last layer of a deep Q-network to learn a distribution of Q-values. Both BayesDICE and BDQN output a distribution of parameters, from which we conduct Monte Carlo sampling and use the resulting samples to compute a confidence interval at a given confidence level. We plot the empirical coverage and interval width at different confidence levels in Figure 2 . To compute the empirical interval coverage, we conduct 200 trials with randomly sampled datasets. The interval coverage is the proportion of the 200 intervals that contains the true value of the target policy. The interval log-width is the median of the log width of the 200 intervals. As shown in Figure 2 , BayesDICE's coverage closely follows the intended coverage (black dotted line), while maintaining narrow interval width across all tasks considered. This suggests that BayesDICE's posterior estimation is highly accurate, being robust to approximation errors and potentially misaligned Bayesian priors in practice.

5.2. POLICY SELECTION

Next, we demonstrate the benefit of matching the policy selection criteria to the ranking score in offline policy selection. Our evaluation is based on a variety of cardinal and ordinal ranking scores defined in Section 2.2. We begin by considering the use of Algorithm 1 with BayesDICEapproximated posteriors. By keeping the BayesDICE posterior fixed, we focus our evaluation on the performance of Algorithm 1. We plot the groundtruth performance of this procedure applied to Bandit and Reacher in Figure 3 . These figures compare using different Ŝ to rank the policies according to Algorithm 1 across different downstream ranking scores S. We find that aligning the criteria Ŝ used in Algorithm 1 with the downstream ranking score S is empirically the best approach ( Ŝ = S). In contrast, using point estimates such as Mean or Mean ± Std can yield much worse downstream performance. We also see that in the Bandit setting, where we can analytically compute the Bayes-optimal ranking, using aligned ranking scores in conjunction with BayesDICE-approximated posteriors achieves near-optimal performance. In these experiments, we fix the posterior to the one approximated by BayesDICE and evaluate different Ŝ used in Algorithm 1 to compute a policy ranking. We find that using Ŝ = S (i.e., aligning the ranking score in posterior simulation with the groundtruth evaluation) results in better performance than simple point estimates. Interestingly, the lower-bound point estimate almost always performs worse than the mean or the upper bound.

Bandit Reacher

Having established BayesDICE's ability to compute accurate posterior distributions as well as the benefit of appropriately aligning the ranking score used in Algorithm 1, we compare BayesDICE to state-of-the-art OPE methods in policy selection. In these experiments, we use Algorithm 1 with posteriors approximated by BayesDICE and Ŝ = S. We compare the use of BayesDICE in this way to ranking via point estimates of DualDICE (Nachum et al., 2019) and other confidence-interval estimation methods introduced in Section 5.1. We present results in Figure 4 , in terms of top-k regret and correlation on bandit and reacher across different sample sizes and behavior data. BayesDICE outperforms other methods on both tasks. See additional ranking results in Appendix C. 

6. CONCLUSION

In this paper, we formally defined the offline policy selection problem, and proposed BayesDICE to first estimate posterior distributions of policy values before using a simulation-based procedure to compute an optimal policy ranking. Empirically, BayesDICE not only provides accurate belief distribution estimation, but also shows excellent performance in policy selection tasks. Instead of selecting from a set of policy candidates, the policy optimization is considering all feasible policies and selecting optimistically. Specifically, the feasibility of the stationary state-action distribution can be characterized as (18) Then, we have the posteriors for all valid policies should satisfies λP q ( (ζ • π, D) ) ξ, with (ζ • π, D) := max β∈H φ β ÊD [ a (ζ (s, a) π (a|s)) φ (s) -γ (ζ (s, a) π (a|s)) φ (s )] + (1 -γ) E µ0 β φ -f * (β) . Meanwhile, we will select one posterior from among these posteriors of all valid policies optimistically, i.e., max q(ζ)q(π) E q [U (τ, r, D)] + λ 1 ξ -λ 2 KL (q (ζ) q (π) ||p (ζ, π)) (20) s.t. P q ( (ζ • π, D) ) ξ ) where E q [U (τ, r, D)] denotes the optimistic policy score to capture the upper bound of the policy value estimation. For example, the most widely used one is E q [U (τ, r, D)] = E q ÊD [τ • r] + λ u E q ÊD [τ • r] -E q ÊD [τ • r] 2 , where the second term is the empirical variance and usually known as one kind of "exploration bonus". Then the whole algorithm is iterating between solving (20) and use the obtain policy collecting data into D in (20). This Exploration-BayesDICE follows the same philosophy of Osband et al. (2019); ODonoghue et al. (2018) where the variance of posterior of the policy value is taken into account for exploration. However, there are several significant differences: i), the first and most different is the modeling object, Osband et al. ( 2019 2018) considers fixed finite-horizon case. Therefore, the exploration with BayesDICE pave the path for principle and practical exploration-vs-exploitation algorithm. The regret bound is out of the scope of this paper, and we leave for future work.

C.1 ENVIRONMENTS AND POLICIES.

Bandit. We create a Bernoulli two-armed bandit with binary rewards where α controls the proportion of optimal arm (α = 0 and α = 1 means never and always choosing the optimal arm respectively). Our selection experiments are based on 5 target policies with α = [0.75, 0.8, 0.85, 0.9, 0.95]. Reacher. We modify the Reacher task to be infinite horizon, and sample trajectories of length 100 in the behavior data. To obtain different behavior and target policies, We first train a deterministic policy from OpenAI Gym (Brockman et al., 2016) until convergence, and define various policies by converting the optimal policy into a Gaussian policy with optimal mean with standard deviation 0.4-0.3α. Our selection experiments are based on 5 target policies with α = [0.75, 0.8, 0.85, 0.9, 0.95]. 



For simplicity, we restrict our analysis to deterministic rewards, and extending our methods to stochastic reward scenarios is straightforward. This tuple-based representation of the dataset is for notational and theoretical convenience, followingDai et al. (2020);Kostrikov & Nachum (2020), among others. In practice, the dataset is usually presented as finitelength trajectories {(s(j) 0 , a (j) 0 , r (j) 0 , s (j) 1 , . . . )} m j=1, and this can be processed into a dataset of finite samples from µ0 and from d D × R × T . For mathematical simplicity, we assume that the dataset is sampled i.i.d. This



a), where P π * d(s, a) := π(a|s) s,ã T (s|s, ã)d(s, ã). (4)

Figure 2: Confidence interval estimation on Bandit, FrozenLake, Taxi, and Reacher. The y-axis shows the empirical coverage and median log-interval width across 200 trials. BayesDICE exhibits near true coverage while maintaining narrow interval width, suggesting an accurate posterior approximation.

Figure3: Policy selection using top-k ranking scores compared to mean/confidence ranking approaches on two-armed Bandit and Reacher. In these experiments, we fix the posterior to the one approximated by BayesDICE and evaluate different Ŝ used in Algorithm 1 to compute a policy ranking. We find that using Ŝ = S (i.e., aligning the ranking score in posterior simulation with the groundtruth evaluation) results in better performance than simple point estimates. Interestingly, the lower-bound point estimate almost always performs worse than the mean or the upper bound.

Figure 4: Policy selection evaluation under correlation and regret at top-k in two-armed Bandit (left) and Reacher (right) compared to other methods using point estimate (DualDICE) or high-confidence lower bounds. Please see Appendix C for more results with respect to other downstream metrics.

(s, a) = (1 -γ) µ 0 + P * d (s) , ∀s ∈ S, (17) where P * d (s) := s,ā T (s|s, ā) d (s, ā). Apply the feature mapping for distribution matching, we obtain the constraint for ζ • π with ζ (s, a) := d(s) d D (s,a) as max β∈H φ β E d D a (ζ (s, a) π (a|s)) φ (s) -γ (ζ (s, a) π (a|s)) φ (s ) +(1 -γ) E µ0 β φ -f * (β) = 0.

); ODonoghue et al. (2018) is updating with Q-function, while we are handling the dual representation; ii), BayesDICE is compatible with arbitary nonlinear function approximator, while Osband et al. (2019); ODonoghue et al. (2018) considers tabular or linear functions; iii), BayesDICE is considering infinite-horizon MDP, while Osband et al. (2019); ODonoghue et al. (

Figure 7: Additional k values for top-k ranking on bandit. Ranking results based on Algorithm 1 (blue lines) always perform better than using mean or high-confidence lower bound.

part of the offline policy selection problem, we are given a ranking score S that is a function of a proposed ranking O and groundtruth policy statistics {Z(•|π i )} N i=1 . The ranking score S can take on many forms and is application specific; e.g.,• top-k precision: This is an ordinal ranking score. The ranking score considers the top k policies in terms of groundtruth means ρ(π i ) and returns the proportion of these which appear in the top k spots of O.

Appendix A MORE DISCUSSIONS ON BAYESDICE

In section, we provide more details about BayesDICE.Remark (parametrization of q (ζ) and q (β|ζ)): We parametrize both q (ζ) (and the resulting q (β|ζ)) as Gaussians with the mean and variance approximated by a multi-layer perceptron (MLP), i.e.: ζ = MLP w (s, a) + σ w ξ, ξ ∼ N (0, 1). w and w denote the parameters of the MLP.Remark (connection to Bayesian inference for stochastic processes): Recall the posterior can be viewed as the solution to an optimization (Zellner, 1988; Zhu et al., 2014; Dai et al., 2016) ,The ( 13) is equivalent to define the log-likelihood proportion to (ζ, D), which is a stochastic process, including Gaussian process (GP) by setting f * (β) = 1 2 β β. Specifically, plug f (β) = 1 2 β β back into (13), we haveAlthough the GP has been applied for RL (Engel et al., 2003; Ghavamzadeh et al., 2016; Azizzadenesheli et al., 2018) , they all focus on prior on value function; while BayesDICE considers general stochastic processes likelihood, including GP, for the stationary ratio modeling, which as we justified is more flexible for different selection criteria in downstream tasks.Remark (auxilary constraints and undiscounted MDP): As Yang et al. (2020) suggested, the non-negative and normalization constraints are important for optimization. We exploit positive neuron to ensure the non-negativity of the mean of the q (ζ). For the normalization, we consider the 13). With the normalization condition introduced, the proposed BayesDICE is ready for undiscounted MDP by simply setting γ = 1 in ( 13) together with the above extra term for normalization.

Remark (variants of log-likelihood):

We apply the Markov's inequality to (12) for the upper bound (13). In fact, the optimization with chance constraint has rich literature (Ben-Tal et al., 2009) , where plenty of surrogates can be derived with different safe approximation. For example, if the q is simple, one can directly calculate the CDF for the probability P q ( (ζ)); or one can also exploit different probability inequalities to derive other surrogates, e.g., condition value-at-risk, i.e.,and Bernstein approximation (Nemirovski & Shapiro, 2007) . These surrogates lead to better approximation to the chance probability P q ( (ζ) ) with the extra cost in optimization.

B BAYESDICE FOR EXPLORATION VS. EXPLOITATION TRADEOFF

In main text, we mainly consider exploiting BayesDICE for estimating various ranking scores for both discounted MDP and undiscounted MDP. In fact, with the posterior of the stationary ratio computed, we can also apply it for better balance between exploration vs. exploitation for policy optimization.

C.2 DETAILS OF NEURAL NETWORK IMPLEMENTATION

We parametrize the distribution correction ratio as a Gaussian using a deep neural network for the continuous control task. Specifically, we use feed-forward networks with two hidden-layers of 64 neurons each and ReLU as the activation function. The networks are trained using the Adam optimizer (β 1 = 0.99, β 2 = 0.999) with batch size 2048. 

