THE IMPORTANCE OF PESSIMISM IN FIXED-DATASET POLICY OPTIMIZATION

Abstract

We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms. Our core contribution is a unified conceptual and mathematical framework for the study of algorithms in this regime. This analysis reveals that for naïve approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement: in order to guarantee that we select a policy which is near-optimal, we may need the dataset to be informative of the value of every policy. To avoid this, algorithms can follow the pessimism principle, which states that we should choose the policy which acts optimally in the worst possible world. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle. These theoretical findings are validated by experiments on a tabular gridworld, and deep learning experiments on four MinAtar environments.

1. INTRODUCTION

We consider fixed-dataset policy optimization (FDPO), in which a dataset of transitions from an environment is used to find a policy with high return. 1 We compare FDPO algorithms by their worst-case performance, expressed as high-probability guarantees on the suboptimality of the learned policy. It is perhaps obvious that in order to maximize worst-case performance, a good FDPO algorithm should select a policy with high worst-case value. We call this the pessimism principle of exploitation, as it is analogous to the widely-known optimism principle (Lattimore & Szepesvári, 2020) of exploration. 2Our main contribution is a theoretical justification of the pessimism principle in FDPO, based on a bound that characterizes the suboptimality incurred by an FDPO algorithm. We further demonstrate how this bound may be used to derive principled algorithms. Note that the core novelty of our work is not the idea of pessimism, which is an intuitive concept that appears in a variety of contexts; rather, our contribution is a set of theoretical results rigorously explaining how pessimism is important in the specific setting of FDPO. An example conveying the intuition behind our results can be found in Appendix G.1. We first analyze a family of non-pessimistic naïve FDPO algorithms, which estimate the environment from the dataset via maximum likelihood and then apply standard dynamic programming techniques. We prove a bound which shows that the worst-case suboptimality of these algorithms is guaranteed to be small when the dataset contains enough data that we are certain about the value of every possible policy. This is caused by the outsized impact of value overestimation errors on suboptimality, sometimes called the optimizer's curse (Smith & Winkler, 2006). It is a fundamental consequence of ignoring the disconnect between the true environment and the picture painted by our limited observations. Importantly, it is not reliant on errors introduced by function approximation.



We use the term fixed-dataset policy optimization to emphasize the computational procedure; this setting has also been referred to as batch RL(Ernst et al., 2005; Lange et al., 2012) and more recently, offline RL (Levineet al., 2020). We emphasize that this is a well-studied setting, and we are simply choosing to refer to it by a more descriptive name.2 The optimism principle states that we should select a policy with high best-case value.1

