OFFLINE POLICY COMPARISON WITH CONFIDENCE: BENCHMARKS AND BASELINES

Abstract

Decision makers often wish to use offline historical data to compare sequentialaction policies at various world states. Importantly, computational tools should produce confidence values for such offline policy comparison (OPC) to account for statistical variance and limited data coverage. Nevertheless, there is little work that directly evaluates the quality of confidence values for OPC. In this work, we address this issue by creating benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning. In addition, we present an empirical evaluation of the risk versus coverage trade-off for a class of model-based baseline methods. In particular, the baselines learn ensembles of dynamics models, which are used in various ways to produce simulations for answering queries with confidence values. While our results suggest advantages for certain baseline variations, there appears to be significant room for improvement in future work.

1. INTRODUCTION

Given historical data from a dynamic environment, how well can we make predictions about future trajectories while also quantifying the uncertainty of those predictions? Our main goal is to drive research toward a positive answer by encouraging work on a specific prediction problem, offline policy comparison with confidence (OPCC). OPCC involves using historical data to answer queries that each ask for: 1) a prediction of which of two policies is better for an initial state and horizon, where the policies, state, and horizon can be arbitrarily specified, and 2) a confidence value for the prediction. While here we use OPCC for benchmarking uncertainty quantification, it also has utility for both decision support and policy optimization. For decision support, a farm manager may want a prediction for which of two irrigation policies will best match season-level crop goals. A careful farm manager, however, would only take the prediction seriously if it comes with a meaningful measure of confidence. For policy optimization, we may want to search through policy variations to identify variations that confidently improve over others in light of historical data. Offline reinforcement learning (ORL) (Levine et al., 2020) , both for policy evaluation and optimization, offers a number of techniques relevant to decision support and OPCC in particular. One of the key ORL challenges is dealing with uncertainty due to statistical variance and limited coverage of historical data. This recognition has led to rapid progress in ORL, yielding different approaches for addressing uncertainty, e.g. pessimism in the face of uncertainty (Kumar et al., 2020; Buckman et al., 2020; Jin et al., 2021a; Shrestha et al., 2021) or regularizing policy learning toward the historical data (Kumar et al., 2019; Peng et al., 2019; Fujimoto & Gu, 2021; Kostrikov et al., 2021) . However, there has been very little work on directly evaluating the uncertainty quantification capabilities embedded in these approaches. Rather, overall ORL performance is typically evaluated, which can be affected by many algorithmic choices that are not directly related to uncertainty quantification. A major motivation for our work is to better measure and understand the underlying uncertainty quantification embedded in popular ORL approaches for offline policy evaluation (OPE). Contribution. The first contribution of this paper is to develop benchmarks (Section 4) for OPCC derived from existing ORL benchmarks and to suggest metrics (Section 3.3) for the quality of uncertainty quantification. Each benchmark includes: 1) a set of trajectory data D collected in an environment via different types of data collection policies, and 2) a set of queries Q, where each query asks which of two provided policies has a larger expected reward with respect to a specified horizon and initial states. Note that our OPCC benchmarks are related to recent benchmarks for offline policy evaluation (OPE) (Fu et al., 2021) , which includes a policy ranking task similar to OPCC. That work, however, does not propose evaluation metrics and protocols for measuring uncertainty quantification over policy rankings. Further, our query sets Q span a much broader range of initial states than existing benchmarks, which is critical for understanding how uncertainty quantification varies across the wider state space as it relates to the trajectory data D. Our second contribution is to present a pilot empirical evaluation (Section 5) of OPCC for a class of approaches that use ensembles as the mechanism to capture uncertainty, which is one of the prevalent approaches on ORL. This class uses learned ensembles of dynamics and reward models to produce Monte-Carlo simulations of each policy, which can then be compared in various ways to produce a prediction and confidence value. Our results for different variations of this class provide evidence that some variations may improve aspects of uncertainty quantification. However, overall, we did not observe sizeable and consistent improvements from most of the considered variations. This suggests that there is significant room for future work aimed at consistent improvement for one or more of the uncertainty-quantification metrics. The benchmarks and baselines are made publiclyfoot_0 available with the intention of supporting community expansion over time.

2. BACKGROUND

We formulate our work in the framework of Markov Decision Processes (MDPs), for which we assume basic familiarity (Puterman, 2014 ). An MDP is a tuple M = (S, A, P, R), where S is the state space, A is the action space, and P (s ′ |s, a) is the first-order Markovian transition function that gives the probability of transitioning to state s ′ given that action a is taken in state s. Finally, R(s, a) is potentially a stochastic reward function, which returns the reward for taking action a in state s. In this work, we focus on decision problems with a finite horizon h, where action selection can depend on the time step. A non-stationary policy π(s, t) is a possibly stochastic function that returns an action for the specified state s and time step t ∈ {0, . . . , h -1}. Given an MDP M , horizon h, and discount factor γ ∈ [0, 1); the value of a policy π at state s is denoted by V π M (s, h) = E h-1 t=0 γ t R (S t , A t ) S 0 = s, A t = π(S t , t) , where S t and A t are the state and action random variables at time t. It is important to note that we gain considerable flexibility by allowing for non-stationary policies. For example, π could be an open-loop policy or even a fixed sequence of actions, which are commonly used in the context of model-predictive control (Richards, 2005) . Further, we can implicitly represent the action value function Q π (s, a, h) for a policy π by defining a new non-stationary policy π ′ that takes action a at t = 0 and then follows π thereafter, which yields a, h) . For this reason, we will focus exclusively on comparisons in terms of state-value functions without loss of generality. V π ′ M (s, h) = Q π M (s,

3. OFFLINE POLICY COMPARISON WITH CONFIDENCE

In this section, we first introduce the concept of policy comparison queries, which are then used to define the OPCC learning problem. Finally, we discuss metrics used in our OPCC evaluations.

3.1. POLICY COMPARISON QUERIES

We consider the fundamental decision problem of predicting the relative future performance of two policies, which we formalize via policy comparison queries (PCQs). A PCQ is a tuple q = (s, π, ŝ, π, h) M , where s and ŝ are arbitrary starting states, π and π are policies, h is a horizon, and M is a MDP. The answer to a PCQ is the truth value of V π M (s, h) < V π M (ŝ, h). That is, a PCQ asks whether the h-horizon value of π is greater than π when started in ŝ and s respectively. As motivated in Section 1, PCQs are useful for both human-decision support and automated policy optimization. For example, if a farm manager wants information about which of two irriga-



Benchmark and baselines: https://github.com/opcciclr

