OFFLINE POLICY COMPARISON WITH CONFIDENCE: BENCHMARKS AND BASELINES

Abstract

Decision makers often wish to use offline historical data to compare sequentialaction policies at various world states. Importantly, computational tools should produce confidence values for such offline policy comparison (OPC) to account for statistical variance and limited data coverage. Nevertheless, there is little work that directly evaluates the quality of confidence values for OPC. In this work, we address this issue by creating benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning. In addition, we present an empirical evaluation of the risk versus coverage trade-off for a class of model-based baseline methods. In particular, the baselines learn ensembles of dynamics models, which are used in various ways to produce simulations for answering queries with confidence values. While our results suggest advantages for certain baseline variations, there appears to be significant room for improvement in future work.

1. INTRODUCTION

Given historical data from a dynamic environment, how well can we make predictions about future trajectories while also quantifying the uncertainty of those predictions? Our main goal is to drive research toward a positive answer by encouraging work on a specific prediction problem, offline policy comparison with confidence (OPCC). OPCC involves using historical data to answer queries that each ask for: 1) a prediction of which of two policies is better for an initial state and horizon, where the policies, state, and horizon can be arbitrarily specified, and 2) a confidence value for the prediction. While here we use OPCC for benchmarking uncertainty quantification, it also has utility for both decision support and policy optimization. For decision support, a farm manager may want a prediction for which of two irrigation policies will best match season-level crop goals. A careful farm manager, however, would only take the prediction seriously if it comes with a meaningful measure of confidence. For policy optimization, we may want to search through policy variations to identify variations that confidently improve over others in light of historical data. Offline reinforcement learning (ORL) (Levine et al., 2020), both for policy evaluation and optimization, offers a number of techniques relevant to decision support and OPCC in particular. One of the key ORL challenges is dealing with uncertainty due to statistical variance and limited coverage of historical data. This recognition has led to rapid progress in ORL, yielding different approaches for addressing uncertainty, e.g. pessimism in the face of uncertainty (Kumar et al., 2020; Buckman et al., 2020; Jin et al., 2021a; Shrestha et al., 2021) or regularizing policy learning toward the historical data (Kumar et al., 2019; Peng et al., 2019; Fujimoto & Gu, 2021; Kostrikov et al., 2021) . However, there has been very little work on directly evaluating the uncertainty quantification capabilities embedded in these approaches. Rather, overall ORL performance is typically evaluated, which can be affected by many algorithmic choices that are not directly related to uncertainty quantification. A major motivation for our work is to better measure and understand the underlying uncertainty quantification embedded in popular ORL approaches for offline policy evaluation (OPE). Contribution. The first contribution of this paper is to develop benchmarks (Section 4) for OPCC derived from existing ORL benchmarks and to suggest metrics (Section 3.3) for the quality of uncertainty quantification. Each benchmark includes: 1) a set of trajectory data D collected in an environment via different types of data collection policies, and 2) a set of queries Q, where each

