WHEN IS OFFLINE POLICY SELECTION FEASIBLE FOR REINFORCEMENT LEARNING? Anonymous

Abstract

Hyperparameter selection and algorithm selection are critical procedures before deploying reinforcement learning algorithms in real-world applications. However, algorithm-hyperparameter selection prior to deployment requires selecting policies offline without online execution, which is a significant challenge known as offline policy selection. As yet, there is little understanding about the fundamental limitations of the offline policy selection problem. To contribute to our understanding of this problem, in this paper, we investigate when sample efficient offline policy selection is possible. As off-policy policy evaluation (OPE) is a natural approach for policy selection, the sample complexity of offline policy selection is therefore upper-bounded by the number of samples needed to perform OPE. In addition, we prove that the sample complexity of offline policy selection is also lower-bounded by the sample complexity of OPE. These results imply not only that offline policy selection is effective when OPE is effective, but also that sample efficient policy selection is not possible without additional assumptions that make OPE effective. Moreover, we theoretically study the conditions under which offline policy selection using Fitted Q evaluation (FQE) and the Bellman error is sample efficient. We conclude with an empirical study comparing FQE and Bellman errors for offline policy selection.

1. INTRODUCTION

Learning a policy from offline datasets, known as offline reinforcement learning (RL), has gained in popularity due to its practicality. Offline RL is useful for many real-world applications, as learning from online interaction may be expensive or dangerous (Levine et al., 2020) . For example, training a stock-trading RL agent online may incur large losses before learning to perform well, and training self-driving cars in the real world may be too dangerous. Ideally we want to learn a good policy offline, before deployment. In practice, offline RL algorithms often have hyperparameters which require careful tuning, and whether or not we can select effective hyperparameters is perhaps the most important consideration when comparing algorithms (Wu et al., 2019; Kumar et al., 2022) . Additionally, performance can differ between algorithms, so algorithm selection is also important (Kumar et al., 2022) . When a specific algorithm is run with a specific hyperparameter configuration, it outputs a learned policy and/or a value function. Thus, each algorithm-hyperparameter configuration produces a candidate policy from which we create a set of candidate policies. The problem of finding the best-performing policy from a set of candidate policies is called policy selection. While policy selection is typically used for algorithm-hyperparameter selection, it is more general since each candidate policy can be arbitrarily generated. The typical way of finding the best-performing policy in RL is to perform several rollouts for each candidate policy in the environment, compute the average return for each policy, and then select the policy that produced the highest average return. This approach is often used in many industrial problems where A/B testing is available. However, this approach assumes that these rollouts can be performed, which necessarily requires access to the environment or simulator, a luxury not available in the offline setting. As such, in offline RL, Monte Carlo rollouts cannot be performed to select candidate policies, and mechanisms to select policies with offline data are needed.

