WHEN IS OFFLINE POLICY SELECTION FEASIBLE FOR REINFORCEMENT LEARNING? Anonymous

Abstract

Hyperparameter selection and algorithm selection are critical procedures before deploying reinforcement learning algorithms in real-world applications. However, algorithm-hyperparameter selection prior to deployment requires selecting policies offline without online execution, which is a significant challenge known as offline policy selection. As yet, there is little understanding about the fundamental limitations of the offline policy selection problem. To contribute to our understanding of this problem, in this paper, we investigate when sample efficient offline policy selection is possible. As off-policy policy evaluation (OPE) is a natural approach for policy selection, the sample complexity of offline policy selection is therefore upper-bounded by the number of samples needed to perform OPE. In addition, we prove that the sample complexity of offline policy selection is also lower-bounded by the sample complexity of OPE. These results imply not only that offline policy selection is effective when OPE is effective, but also that sample efficient policy selection is not possible without additional assumptions that make OPE effective. Moreover, we theoretically study the conditions under which offline policy selection using Fitted Q evaluation (FQE) and the Bellman error is sample efficient. We conclude with an empirical study comparing FQE and Bellman errors for offline policy selection.

1. INTRODUCTION

Learning a policy from offline datasets, known as offline reinforcement learning (RL), has gained in popularity due to its practicality. Offline RL is useful for many real-world applications, as learning from online interaction may be expensive or dangerous (Levine et al., 2020) . For example, training a stock-trading RL agent online may incur large losses before learning to perform well, and training self-driving cars in the real world may be too dangerous. Ideally we want to learn a good policy offline, before deployment. In practice, offline RL algorithms often have hyperparameters which require careful tuning, and whether or not we can select effective hyperparameters is perhaps the most important consideration when comparing algorithms (Wu et al., 2019; Kumar et al., 2022) . Additionally, performance can differ between algorithms, so algorithm selection is also important (Kumar et al., 2022) . When a specific algorithm is run with a specific hyperparameter configuration, it outputs a learned policy and/or a value function. Thus, each algorithm-hyperparameter configuration produces a candidate policy from which we create a set of candidate policies. The problem of finding the best-performing policy from a set of candidate policies is called policy selection. While policy selection is typically used for algorithm-hyperparameter selection, it is more general since each candidate policy can be arbitrarily generated. The typical way of finding the best-performing policy in RL is to perform several rollouts for each candidate policy in the environment, compute the average return for each policy, and then select the policy that produced the highest average return. This approach is often used in many industrial problems where A/B testing is available. However, this approach assumes that these rollouts can be performed, which necessarily requires access to the environment or simulator, a luxury not available in the offline setting. As such, in offline RL, Monte Carlo rollouts cannot be performed to select candidate policies, and mechanisms to select policies with offline data are needed. A common approach to policy selection in the offline setting, known as offline policy selection, is to perform off-policy policy evaluation (OPE) to estimate the value of candidate policies from a fixed dataset and then select the policy with the highest estimated value. Typical OPE algorithms include direct methods such as fitted Q evaluation (FQE) estimator (Le et al., 2019) Offline policy selection has been mainly associated with OPE, since these two problem are closely related. It is known that OPE is a hard problem that requires an exponential number of samples to evaluate any given policy in the worst case (Wang et al., 2020) , so OPE can be unreliable for policy selection. As a result, a follow-up question is: do the hardness results from OPE also hold for offline policy selection? If the answer is yes, then we would need to consider additional assumptions to enable sample efficient offline policy selection. Moreover, offline policy selection should be easier than OPE intuitively, since estimating each policy accurately might not be necessary for policy selection. There is mixed evidence that alternatives to OPE might be effective. Tang & Wiens (2021) show empirically that using TD errors perform poorly because they provide overestimates; they conclude OPE is necessary. On the other hand, Zhang & Jiang (2021) perform policy selection without OPE, by selecting the value function that is closest to the optimal value function. However, their method relies on having the optimal value function in the set of candidate value functions. It remains an open question about when, or even if, alternative approaches can outperform OPE for offline policy selection. Unfortunately, there is little understanding about the offline policy selection problem to answer the aforementioned questions. Therefore, to provide a better understanding, we aim to investigate the question: When can we perform offline policy selection efficiently (with a polynomial sample complexity) for RL? To this end, our contributions are as follows: • We show that the sample complexity of the offline policy selection problem is lower-bounded by the samples need to perform OPE. This implies no policy selection approach can be more sample efficient than OPE in the worst case. On the other hand, OPE can be used for offline policy selection, so the sample complexity of policy selection is upper-bounded by the samples required for OPE. In particular, we show that a selection algorithm that simply chooses the policy with the highest IS estimate achieves a nearly minimax optimal sample complexity, which is exponential in the horizon. • To circumvent exponential sample complexity, we need to make additional assumptions. We identify when FQE, a commonly used OPE method, is efficient for offline policy selection. Specifically we discuss that FQE is efficient for policy selection when the candidate policies are well covered by the offline dataset. This theoretical result supports several empirical findings. • We explore the use of Bellman errors for policy selection and provide a theoretical argument and experimental evidence for the improved sample efficiency of using Bellman errors compared to FQE, under stronger assumptions such as deterministic dynamics and good data coverage.

2. BACKGROUND

In reinforcement learning (RL), the agent-environment interaction can be formalized as a finite horizon finite Markov decision process (MDP) M = (S, A, H, ν, Q). S is a set of states with size S = |S|, and A is a set of actions with size A = |A|, H ∈ Z + is the horizon, and ν ∈ ∆(S) is the initial state distribution where ∆(S) is the set of probability distributions over S. Without loss of generality, we assume that there is only one initial state s 0 . The reward R and next state S ′ are sampled from Q, that is, (R, S ′ ) ∼ Q(•|s, a). We assume the reward is bounded in [0, r max ] almost surely so the total return of each episode is bounded in [0, V max ] almost surely. The stochastic kernel Q induces a transition probability P : S × A → ∆(S), and a mean reward function r(s, a) which gives the mean reward when taking action a in state s. A non-stationary policy is a sequence of memoryless policies (π 0 , . . . , π H-1 ) where π h : S → ∆(A). We assume that the set of states reachable at time step h, S h ⊂ S, are disjoint, without loss



, importance sampling (IS) estimator (Sutton & Barto, 2018), doubly robust estimator (Jiang & Li, 2016), model-based estimator (Mannor et al., 2007), and marginalized importance sampling estimator (Xie et al., 2019). Empirically, Tang & Wiens (2021); Paine et al. (2020) provide experimental results on offline policy selection using OPE. Similarly, Doroudi et al. (2017); Yang et al. (2022) use OPE estimators for policy selection.

