THE IMPORTANCE OF PESSIMISM IN FIXED-DATASET POLICY OPTIMIZATION

Abstract

We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms. Our core contribution is a unified conceptual and mathematical framework for the study of algorithms in this regime. This analysis reveals that for naïve approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement: in order to guarantee that we select a policy which is near-optimal, we may need the dataset to be informative of the value of every policy. To avoid this, algorithms can follow the pessimism principle, which states that we should choose the policy which acts optimally in the worst possible world. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle. These theoretical findings are validated by experiments on a tabular gridworld, and deep learning experiments on four MinAtar environments.

1. INTRODUCTION

We consider fixed-dataset policy optimization (FDPO), in which a dataset of transitions from an environment is used to find a policy with high return. 1 We compare FDPO algorithms by their worst-case performance, expressed as high-probability guarantees on the suboptimality of the learned policy. It is perhaps obvious that in order to maximize worst-case performance, a good FDPO algorithm should select a policy with high worst-case value. We call this the pessimism principle of exploitation, as it is analogous to the widely-known optimism principle (Lattimore & Szepesvári, 2020) of exploration. 2 Our main contribution is a theoretical justification of the pessimism principle in FDPO, based on a bound that characterizes the suboptimality incurred by an FDPO algorithm. We further demonstrate how this bound may be used to derive principled algorithms. Note that the core novelty of our work is not the idea of pessimism, which is an intuitive concept that appears in a variety of contexts; rather, our contribution is a set of theoretical results rigorously explaining how pessimism is important in the specific setting of FDPO. An example conveying the intuition behind our results can be found in Appendix G.1. We first analyze a family of non-pessimistic naïve FDPO algorithms, which estimate the environment from the dataset via maximum likelihood and then apply standard dynamic programming techniques. We prove a bound which shows that the worst-case suboptimality of these algorithms is guaranteed to be small when the dataset contains enough data that we are certain about the value of every possible policy. This is caused by the outsized impact of value overestimation errors on suboptimality, sometimes called the optimizer's curse (Smith & Winkler, 2006). It is a fundamental consequence of ignoring the disconnect between the true environment and the picture painted by our limited observations. Importantly, it is not reliant on errors introduced by function approximation. We contrast these findings with an analysis of pessimistic FDPO algorithms, which select a policy that maximizes some notion of worst-case expected return. We show that these algorithms do not require datasets which inform us about the value of every policy to achieve small suboptimality, due to the critical role that pessimism plays in preventing overestimation. Our analysis naturally leads to two families of principled pessimistic FDPO algorithms. We prove their improved suboptimality guarantees, and confirm our claims with experiments on a gridworld. Finally, we extend one of our pessimistic algorithms to the deep learning setting. Recently, several deep-learning-based algorithms for fixed-dataset policy optimization have been proposed (Agarwal et al., 2019; Fujimoto et al., 2019; Kumar et al., 2019; Laroche et al., 2019; Jaques et al., 2019; Kidambi et al., 2020; Yu et al., 2020; Wu et al., 2019; Wang et al., 2020; Kumar et al., 2020; Liu et al., 2020). Our work is complementary to these results, as our contributions are conceptual, rather than algorithmic. Our primary goal is to theoretically unify existing approaches and motivate the design of pessimistic algorithms more broadly. Using experiments in the MinAtar game suite (Young & Tian, 2019), we provide empirical validation for the predictions of our analysis. The problem of fixed-dataset policy optimization is closely related to the problem of reinforcement learning, and as such, there is a large body of work which contains ideas related to those discussed in this paper. We discuss these works in detail in Appendix E.

2. BACKGROUND

We anticipate most readers will be familiar with the concepts and notation, which is fairly standard in the reinforcement learning literature. In the interest of space, we relegate a full presentation to Appendix A. Here, we briefly give an informal overview of the background necessary to understand the main results. We represent the environment as a Markov Decision Process (MDP), denoted M := S, A, R, P, γ, ρ . We assume without loss of generality that R( s, a ) ∈ [0, 1], and denote its expectation as r( s, a ). ρ represents the start-state distribution. Policies π can act in the environment, represented by action matrix A π , which maps each state to the probability of each state-action when following π. Value functions v assign some real value to each state. We use v π M to denote the value function which assigns the sum of discounted rewards in the environment when following policy π. A dataset D contains transitions sampled from the environment. From a dataset, we can compute the empirical reward and transition functions, r D and P D , and the empirical policy, πD . An important concept for our analysis is the value uncertainty function, denoted µ π D,δ , which returns a high-probability upper-bound to the error of a value function derived from dataset D. Certain value uncertainty functions are decomposable by states or state-actions, meaning they can be written as the weighted sum of more local uncertainties. See Appendix B for more detail. Our goal is to analyze the suboptimality of a specific class of FDPO algorithms, called value-based FDPO algorithms, which have a straightforward structure: they use a fixed-dataset policy evaluation (FDPE) algorithm to assign a value to each policy, and then select the policy with the maximum value. Furthermore, we consider FDPE algorithms whose solutions satisfy a fixed-point equation. Thus, a fixed-point equation defines a FDPE objective, which in turn defines a value-based FDPO objective; we call the set of all algorithms that implement these objectives the family of algorithms defined by the fixed-point equation.

3. OVER/UNDER DECOMPOSITION OF SUBOPTIMALITY

Our first theoretical contribution is a simple but informative bound on the suboptimality of any value-based FDPO algorithm. Next, in Section 4, we make this concrete by defining the family of naïve algorithms and invoking this bound. This bound is insightful because it distinguishes the impact of errors of value overestimation from errors of value underestimation, defined as: Definition 1. Consider any fixed-dataset policy evaluation algorithm E on any dataset D and any policy π. Denote v π D := E(D, π). We define the underestimation error as E ρ [v π M -v π D ] and overestimation error as E ρ [v π D -v π M ]. The following lemma shows how these quantities can be used to bound suboptimality.



We use the term fixed-dataset policy optimization to emphasize the computational procedure; this setting has also been referred to as batch RL(Ernst et al., 2005; Lange et al., 2012) and more recently, offline RL (Levineet al., 2020). We emphasize that this is a well-studied setting, and we are simply choosing to refer to it by a more descriptive name.2 The optimism principle states that we should select a policy with high best-case value.

