WHEN IS OFFLINE POLICY SELECTION FEASIBLE FOR REINFORCEMENT LEARNING? Anonymous

Abstract

Hyperparameter selection and algorithm selection are critical procedures before deploying reinforcement learning algorithms in real-world applications. However, algorithm-hyperparameter selection prior to deployment requires selecting policies offline without online execution, which is a significant challenge known as offline policy selection. As yet, there is little understanding about the fundamental limitations of the offline policy selection problem. To contribute to our understanding of this problem, in this paper, we investigate when sample efficient offline policy selection is possible. As off-policy policy evaluation (OPE) is a natural approach for policy selection, the sample complexity of offline policy selection is therefore upper-bounded by the number of samples needed to perform OPE. In addition, we prove that the sample complexity of offline policy selection is also lower-bounded by the sample complexity of OPE. These results imply not only that offline policy selection is effective when OPE is effective, but also that sample efficient policy selection is not possible without additional assumptions that make OPE effective. Moreover, we theoretically study the conditions under which offline policy selection using Fitted Q evaluation (FQE) and the Bellman error is sample efficient. We conclude with an empirical study comparing FQE and Bellman errors for offline policy selection.

1. INTRODUCTION

Learning a policy from offline datasets, known as offline reinforcement learning (RL), has gained in popularity due to its practicality. Offline RL is useful for many real-world applications, as learning from online interaction may be expensive or dangerous (Levine et al., 2020) . For example, training a stock-trading RL agent online may incur large losses before learning to perform well, and training self-driving cars in the real world may be too dangerous. Ideally we want to learn a good policy offline, before deployment. In practice, offline RL algorithms often have hyperparameters which require careful tuning, and whether or not we can select effective hyperparameters is perhaps the most important consideration when comparing algorithms (Wu et al., 2019; Kumar et al., 2022) . Additionally, performance can differ between algorithms, so algorithm selection is also important (Kumar et al., 2022) . When a specific algorithm is run with a specific hyperparameter configuration, it outputs a learned policy and/or a value function. Thus, each algorithm-hyperparameter configuration produces a candidate policy from which we create a set of candidate policies. The problem of finding the best-performing policy from a set of candidate policies is called policy selection. While policy selection is typically used for algorithm-hyperparameter selection, it is more general since each candidate policy can be arbitrarily generated. The typical way of finding the best-performing policy in RL is to perform several rollouts for each candidate policy in the environment, compute the average return for each policy, and then select the policy that produced the highest average return. This approach is often used in many industrial problems where A/B testing is available. However, this approach assumes that these rollouts can be performed, which necessarily requires access to the environment or simulator, a luxury not available in the offline setting. As such, in offline RL, Monte Carlo rollouts cannot be performed to select candidate policies, and mechanisms to select policies with offline data are needed. A common approach to policy selection in the offline setting, known as offline policy selection, is to perform off-policy policy evaluation (OPE) to estimate the value of candidate policies from a fixed dataset and then select the policy with the highest estimated value. Typical OPE algorithms include direct methods such as fitted Q evaluation (FQE) estimator (Le et al., 2019) , importance sampling (IS) estimator (Sutton & Barto, 2018) , doubly robust estimator (Jiang & Li, 2016) , model-based estimator (Mannor et al., 2007) , and marginalized importance sampling estimator (Xie et al., 2019) . Empirically, Tang & Wiens (2021) ; Paine et al. (2020) provide experimental results on offline policy selection using OPE. Similarly, Doroudi et al. (2017) ; Yang et al. (2022) use OPE estimators for policy selection. Offline policy selection has been mainly associated with OPE, since these two problem are closely related. It is known that OPE is a hard problem that requires an exponential number of samples to evaluate any given policy in the worst case (Wang et al., 2020) , so OPE can be unreliable for policy selection. As a result, a follow-up question is: do the hardness results from OPE also hold for offline policy selection? If the answer is yes, then we would need to consider additional assumptions to enable sample efficient offline policy selection. Moreover, offline policy selection should be easier than OPE intuitively, since estimating each policy accurately might not be necessary for policy selection. There is mixed evidence that alternatives to OPE might be effective. Tang & Wiens (2021) show empirically that using TD errors perform poorly because they provide overestimates; they conclude OPE is necessary. On the other hand, Zhang & Jiang (2021) perform policy selection without OPE, by selecting the value function that is closest to the optimal value function. However, their method relies on having the optimal value function in the set of candidate value functions. It remains an open question about when, or even if, alternative approaches can outperform OPE for offline policy selection. Unfortunately, there is little understanding about the offline policy selection problem to answer the aforementioned questions. Therefore, to provide a better understanding, we aim to investigate the question: When can we perform offline policy selection efficiently (with a polynomial sample complexity) for RL? To this end, our contributions are as follows: • We show that the sample complexity of the offline policy selection problem is lower-bounded by the samples need to perform OPE. This implies no policy selection approach can be more sample efficient than OPE in the worst case. On the other hand, OPE can be used for offline policy selection, so the sample complexity of policy selection is upper-bounded by the samples required for OPE. In particular, we show that a selection algorithm that simply chooses the policy with the highest IS estimate achieves a nearly minimax optimal sample complexity, which is exponential in the horizon. • To circumvent exponential sample complexity, we need to make additional assumptions. We identify when FQE, a commonly used OPE method, is efficient for offline policy selection. Specifically we discuss that FQE is efficient for policy selection when the candidate policies are well covered by the offline dataset. This theoretical result supports several empirical findings. • We explore the use of Bellman errors for policy selection and provide a theoretical argument and experimental evidence for the improved sample efficiency of using Bellman errors compared to FQE, under stronger assumptions such as deterministic dynamics and good data coverage.

2. BACKGROUND

In reinforcement learning (RL), the agent-environment interaction can be formalized as a finite horizon finite Markov decision process (MDP) M = (S, A, H, ν, Q). S is a set of states with size S = |S|, and A is a set of actions with size A = |A|, H ∈ Z + is the horizon, and ν ∈ ∆(S) is the initial state distribution where ∆(S) is the set of probability distributions over S. Without loss of generality, we assume that there is only one initial state s 0 . The reward R and next state S ′ are sampled from Q, that is, (R, S ′ ) ∼ Q(•|s, a). We assume the reward is bounded in [0, r max ] almost surely so the total return of each episode is bounded in [0, V max ] almost surely. The stochastic kernel Q induces a transition probability P : S × A → ∆(S), and a mean reward function r(s, a) which gives the mean reward when taking action a in state s. A non-stationary policy is a sequence of memoryless policies (π 0 , . . . , π H-1 ) where π h : S → ∆(A). We assume that the set of states reachable at time step h, S h ⊂ S, are disjoint, without loss of generality, because we can always define a new state space S ′ = S × {0, 1, 2, . . . , H -1}. Then, it is sufficient to consider stationary policies π : S → ∆(A). Given a policy π, for any h ∈ [H], and (s, a) ∈ S × A, we define the value function and the actionvalue function as v π h (s) := E π [ H-1 t=h r(S t , A t )|S h = s] and q π h (s, a) := E π [ H-1 t=h r(S t , A t )|S h = s, A h = a], respectively. The expectation is with respect to P π , which is the probability measure on the random element (S 0 , A 0 , R 0 , . . . , R H-1 ) induced by the policy π. The optimal value function is defined by v * h (s) := sup π v π h (s), and the Bellman operator is defined by (T q h )(s, a) = r(s, a) + s ′ ∈S P (s, a, s ′ ) max a ′ ∈A q h (s ′ , a ′ ). The Bellman evaluation operator is defined by (T π q h )(s, a) = r(s, a) + s ′ ∈S P (s, a, s ′ ) a ′ ∈A π(a ′ |s ′ )q h (s ′ , a ′ ). We use J(π) to denote the value of the policy π, that is, the expected return from the initial state J(π) = v π (s 0 ). A policy π is optimal if J(π) = v * (s 0 ). In the offline RL setting, we are given a fixed set of transitions D with samples drawn from a data distribution µ. In this paper, we consider the setting where the data is collected by a behavior policy π b since the data collection scheme is more practical (Xiao et al., 2022) and is used to collect data for benchmark datasets (Fu et al., 2020) . We use d π h to denote the data distribution at the horizon h by following the policy π, that is, d π h (s, a) := P π (S h = s, A h = a), and µ h (s, a) := P π b (S h = s, A h = a). Notation. We use [H] to denote the set [H] := {0, 1, . . . , H -1}. Given a value function q and a state-action distribution µ, the norm is defined as ∥q∥ p,µ := ( s,a µ(s, a)|q(s, a)| p ) 1/p and the max norm is defined as ∥q∥ ∞ := max s,a |q(s, a)|.

3. SAMPLE COMPLEXITY OF OFFLINE POLICY SELECTION

We consider the offline policy selection (OPS) problem and offline policy evaluation (OPE) problem. We follow a similar notation and formulation used in Xiao et al. (2022) to formally describe these problem settings. The OPS problem for a fixed number of episodes n is given by the tuple (S, A, H, ν, n, I). I is a set of instances of the form (M, d b , Π) where M ∈ M(S, A, H, ν) specifies an MDP, d b is a distribution over a trajectory (S 0 , A 0 , R 0 , . . . , R H-1 ) by running the behavior policy π b on M , and Π is a finite set of candidate policies. We consider the setting where Π has a small size and does not depend on S, A or H. An OPS algorithm takes as input a batch of data D, which contains n trajectories, and a set of candidate policies Π, and outputs a policy π ∈ Π. We say an OPS algorithm L is (ε, δ)-sound on instance (M, d b , Π) if Pr D∼d b (J M (L(D, Π)) ≥ J M (π † ) -ε) ≥ 1 -δ where π † is the best policy in Π. We say an OPS algorithm L is (ε, δ)-sound on the problem (S, A, H, ν, n, I) if it is sound on any instance (M, d b , Π) ∈ I. Given a pair (ε, δ), the sample complexity of OPS is the smallest integer n such that there exists a behavior policy π b and an OPS algorithm L such that L is (ε, δ)-sound on the OPS problem (S, A, H, ν, n, I(π b )) where I(π b ) denotes the set of instances with data distribution d b . That is, if the statistical complexity is lower-bounded by n, then, for any behavior policy π b , there exists an MDP M and a set of candidate policies Π such that any (ε, δ)-sound OPS algorithm on (M, d b , Π) requires at least n samples. Similarly, the OPE problem for a fixed number of episodes n is given by (S, A, H, ν, n, I). I is a set of instances of the form (M, d b , π) where M and d b are defined as above, and π is a target policy. An OPE algorithm takes as input a batch of data D and a target policy π, and outputs an estimate of the policy value. We say an OPE algorithm L is (ε, δ)-sound on instance (M, d b , π) if Pr D∼d b (|L(D, π) -J M (π)| ≤ ε) ≥ 1 -δ. We say an OPE algorithm L is (ε, δ)-sound on the problem (S, A, H, ν, n, I) if it is sound on any instance (M, d b , π) ∈ I. Note that ε should be less than V max /2 otherwise the bound is trivial.

3.1. OPE ALGORITHM AS SUBROUTINE FOR OFFLINE POLICY SELECTION

It is obvious that a sound OPE algorithm can be used for OPS, so the sample complexity of OPS is upper-bounded by the sample complexity of OPE up to a logarithmic factor. We state this formally in the theorem below. The proof can be found in Appendix A.1. Theorem 1. (Upper bound on sample complexity of OPS) Given an MDP M , a data distribution d b , and a set of policies Π, suppose that, for any pair (ε, δ), there exists an (ε, δ)-sound OPE algorithm L on any OPE instance I ∈ {(M, d b , π) : π ∈ Π} with a sample size at most O(N OP E (S, A, H, ε, 1/δ)). Then there exists an (ε, δ)-sound OPS algorithm for the OPS problem instance (M, d b , Π) which requires at most O(N OP E (S, A, H, ε, |Π|/δ)) episodes. In terms of the sample complexity, we have an extra log (|Π|)/n term for OPS due to the union bound. For hyperparameter selection in practice, the size of the candidate set is often much smaller than n, so this extra term is negligible. However, if the set is too large, complexity regularization (Bartlett et al., 2002) may need to be considered.

3.2. OFFLINE POLICY SELECTION IS NOT EASIER THAN OPE

We have shown that OPS is sample efficient when OPE is sample efficient. However, it remains unclear whether OPS can be sample efficient when OPE is not. In the following theorem, we lower bound the sample complexity of OPS by the sample complexity of OPE. As a result, both OPS and OPE suffer from the same hardness result, and we cannot expect OPS to be sample efficient under conditions where OPE is not sample efficient. Theorem 2 (Lower bound on sample complexity of OPS). Suppose for any data distribution d b and any pair (ε, δ) with ε ∈ (0, V max /2) and δ ∈ (0, 1), there exists an MDP M and a policy π such that any (ε, δ)-sound OPE algorithm requires at least Ω(N OP E (S, A, H, ε, 1/δ)) episodes. Then there exists an MDP M ′ with S ′ = S + 2, H ′ = H + 1, and a set of candidate policies such that for any pair (ε, δ) with ε ∈ (0, V max /3) and δ ∈ (0, 1/m) where m := ⌈log(V max /ε)⌉ ≥ 1, any (ε, δ)-sound OPS algorithm also requires at least Ω(N OP E (S, A, H, ε, 1/(mδ)))) episodes. The proof sketch is to construct an OPE algorithm that queries OPS as a subroutine. As a result, the sample complexity of OPS is lower bounded by the sample complexity of OPE. We use the reduction first mentioned in Wang et al. (2020) , and present a proof in Appendix A.2. There exist several hardness results for OPE in tabular settings and with linear function approximation (Yin & Wang, 2021; Wang et al., 2020) . Theorem 2 implies that the same hardness results hold for OPS. We should not expect to have a sound OPS algorithm without additional assumptions. Theorem 2, however, does not imply that OPS and OPE are always equally hard. There are instances where OPS is easy but OPE is not. For example, when all policies in the candidate set all have the same value, any random policy selection is sound. However, OPE can still be difficult in such cases.

3.3. IMPORTANCE SAMPLING ACHIEVES NEARLY MINIMAX OPTIMAL SAMPLE COMPLEXITY

We present an exponential lower bound on the sample complexity of OPE in Theorem 5 in the appendix, which uses the same construction from Xiao et al. (2022) . By the lower bound on the sample complexity of OPE and Theorem 2, we now have a lower bound for OPS. Corollary 1 (Exponential lower bound on the sample complexity of OPS). For any positive integers S, A, H with S > 2H and a pair (ε, δ) with 0 < ε ≤ 1/8, δ ∈ (0, 1), any (ε, δ)-sound OPS algorithm needs at least Ω(A H-1 /ε 2 ) episodes. We can use a common OPE method, importance sampling (IS), with a random behavior policy for policy selection. Recall the IS estimator (Rubinstein, 1981) is given by Ĵ(π) = 1 n n i=1 H-1 h=0 π(A (i) h |S (i) h ) π b (A (i) h |S (i) h ) Wi H-1 h=0 R (i) h Gi , where n is the number of episodes in the dataset D. We now provide an upper bound on the sample complexity of OPS using IS. The proof can be found in Appendix A.3. Theorem 3 (OPS using importance sampling). Suppose the data collection policy is uniformly random, that is, π b (a|s) = 1/A for all (s, a) ∈ S × A, and |G i | ≤ V max almost surely. Then the selection algorithm L that selects the policy with the highest IS estimate is (ε, δ)-sound with O(A H V max ln (|Π|/δ)/ε 2 ) episodes. The theorem suggests that IS achieves a nearly minimax optimal sample complexity for OPS up to a factor A and logarithmic factors. There are other improved variants of IS, including per-decision IS and weighted IS (Precup et al., 2000; Sutton & Barto, 2018) . However, none of these variants can help reduce sample complexity in the worst case because the lower bound in Corollary 1 holds for any OPS algorithm. The result suggests that we need to consider additional assumptions on the environment, the data distribution, or the candidate set to obtain guarantees for OPS. Note that Wang et al. (2017) have shown that IS estimator achievew the minimax mean squared error for the OPE problem. Our result shows that IS also achieves a (nearly) minimax sample complexity for the OPS problem. There are other examples where OPS is efficient by using particular OPE methods. For example, Liu et al. (2021) consider environments where the state contains exogenous variables; the agent has limited impact on those exogenous variables; and the agent has the knowledge about the endogenous dynamic. In such environments, they provide a sound OPE method to select hyperparameters. Another example is when an accurate simulator of the environment is available. The simulation lemma (Kearns & Singh, 2002) shows that we can evaluate any policy accurately and thus perform sound OPS. All these specialized strategies indicate that we can perform OPS efficiently, under much stronger assumptions.

4. CONDITIONS FOR SAMPLE EFFICIENT OFFLINE POLICY SELECTION

In the following sections, we discuss the conditions under which OPS using Fitted Q-iteration and Bellman Error are efficient for offline policy selection. These results are based on existing theoretical analysis of FQE and BE, for example, Chen & Jiang ( 2019 

4.1. OFFLINE POLICY SELECTION USING FITTED Q-EVALUATION

Fitted Q-evaluation (FQE) is a commonly-used OPE method that has been shown to be effective for policy selection in benchmark datasets (Paine et al., 2020) . FQE applies the empirical Bellman evaluation operator on the value estimate iteratively: given a function class F, we initialize q H-1 = 0, and, for h = H -2, . . . , 0, q h = arg min f ∈F lh (f, q h+1 ) where lh (f, q h+1 ) := 1 |D h | (s,a,r,s ′ )∈D h (f (s, a) -r -q h+1 (s ′ , π(s ′ ))) 2 and D h is the dataset at horizon h with distribution µ h . The FQE estimate for J(π) is Ĵ(π) = E a∼π(•|s0) [q 0 (s 0 , a)]. Intuitively, we should be able to evaluate the value of all policies covered by the data distribution. If the candidate policy set is a subset of the set of policies covered by data, then we can do policy selection well. To define the notion of well-covered policies, we use the measure of distribution shift introduced in Xie et al. (2021) . Given a non-negative real value C, we define Π(C, µ, F) be the set of policies covered by the data distribution µ, that is, for any policy π ∈ Π(C, µ, F), max f,f ′ ∈F ∥f -T π f ′ ∥ d π h ∥f -T π f ′ ∥µ h ≤ C holds for any h ∈ [H]. The theorem below provides a theoretical guarantee for OPS using FQE. The proof can be found in Appendix A.4. Theorem 4 (Offline policy selection for well-covered policies). Suppose we have a function class F with Rademacher complexity R µ n (F), an approximation error ε apx , and an optimization error ε opt (see Appendix A.4 for the definitions). If there exists a non-negative value C such that Π ⊂ Π(C, µ, F), then the OPS algorithm L that selects the policy with the highest FQE estimate satisfies Pr J(L(D, Π)) ≥ J(π † ) -2H √ C ε opt + ε apx + c 0 HR µ n (F) + c 0 H 2 log (|Π|H/δ) n ≥ 1 -δ for some constant c 0 . If we have a good function class and an optimizer, then FQE only requires a polynomial number of episodes for OPS under the assumption that all candidate policies are well-covered by the data. However, in practice, we need to choose the function class and optimization hyperparameters. It is easy to choose optimization hyperparameters such as the learning rate. We can do so by choosing the optimizer with the lowest ε opt . This can be done by selecting the optimization hyperparameters that result in the smallest TD errors h lh (q h , q h+1 ) on a validation dataset. However, selecting a function class is nontrivial, because we need to balance the approximation error, the complexity measure and how well the data covers the candidate policies. In our experiments, we fix the function class as a neural network model with the same architecture and tune the optimization hyperparameters. Theorem 4 presents a positive result that OPS can be sample efficient when selecting amongst wellcovered policies. However, if one of the candidate policies is not well-covered, then the FQE estimate may overestimate the value of the uncovered policy and resulting in poor OPS. It is known that FQE can even diverge (Sutton & Barto, 2018) , due to the fact that it combines off-policy learning with bootstrapping and function approximation, known as the deadly triad. One way to circumvent the issue of uncovered policies is to find a way to assign low values for uncovered policies. One heuristic way is to early stop when FQE diverges and assign lower value estimates to policies for which FQE diverges since the policy is likely to be not covered by the data. Another heuristic is to use a pessimistic version of FQE that penalizes policies whose data distribution is dissimilar from the behavior distribution. This approach is similar to CQL, but it introduces an additional hyperparameter to control the level of pessimism, which again is not easy to tune in the offline setting. Practical insight. Our result explains the empirical findings in Paine et al. ( 2020) that FQE is sufficient for selecting hyperparameters for conservative algorithms and imitation learning algorithms. A recent paper (Kumar et al., 2022) discusses when we should use pessimistic offline RL or behavioral cloning under specific conditions. Our result shows that we can in fact run both algorithms and choose the one with highest FQE estimate. With high probability, we can choose the best policy among offline RL and behavioral cloning algorithms without considering specific conditions.

4.2. OFFLINE POLICY SELECTION USING THE BELLMAN ERROR

In this section, we investigate whether OPS using the estimated Bellman error has any advantage over FQE. Most value-based methods output value functions and we often perform policy selection over value functions by considering the greedy policies with respect to these value functions. Suppose we are given a candidate set of value functions {q i } K i=1 , a natural criterion is to select the value function that is closest to the optimal value function, that is, ∥q i -q * ∥ ∞ is the smallest. A justification of choosing the smallest ∥q i -q * ∥ ∞ is that the value error gives us a lower bound on J(π i ) where π i is the greedy policy with respect to q i . That is, Yee, 1994) . Therefore, choosing the smallest ∥q i -q * ∥ ∞ is equivalent to choosing the largest lower bound on the policy value. While we do not know q * , it might still be possible to derive a similar lower bound on the policy value, possibly by estimating the Bellman error (BE). J(π i ) ≥ J(π * ) -2H∥q i -q * ∥ ∞ (Singh & A potential use case of the BE is that if the data distribution covers the entire state-action space well, then the BE is a good surrogate for the policy value. More formally, the concentration coefficient C is defined as the smallest value such that max s,a,h a) ≤ C for all policies π. By the performance difference lemma (Kakade, 2003) , we can show that d π h (s,a) µ h (s, J(π) ≥ J(π * ) -CH H-1 h=0 ∥q h -T q h+1 ∥ 2,µ h . That is, the Bellman error evaluated on the dataset can be a useful indicator of the quality of the greedy policy, if C is small. Note that a small concentration coefficient is a stronger assumption compared to the assumption of FQE. To be more specific, the condition max s,a,h d π h (s,a) µ h (s,a) ≤ C for all policies π ∈ Π implies max f,f ′ ∈F ∥f -T π f ′ ∥ d π h ∥f -T π f ′ ∥µ h ≤ C for all policies π ∈ Π. That is, we only need coverage for all policies in the candidate set for sample efficient OPS with FQE. However, for BE, the measure of distribution shift needs to be bounded for all policies. Finding the value function with the lowest Bellman error in deterministic environments is sample efficient. To see why this is the case, let k be the output index of the selection algorithm that selects the index of the value function with the smallest empirical Bellman error, then with probability at least 1 -δ, for some constant c 0 , ∥q k -T q k ∥ 2 2,µ ≤ min i=1,...,|Π| ∥q i -T q i ∥ 2 2,µ + c 0 V max log (|Π|H/δ) n where we denote ∥q i -T q i ∥ 2 2,µ = 1 H h ∥q i,h -T q i,h+1 ∥ 2 2,µ h for simplicity. Therefore, the sample size n only needs to scale with V 2 max . In contrast, the sample size of FQE depends on the complexity of the function class R µ n (F). If we use a function class with high complexity to perform FQE, then we expect to see that BE is a more sample efficient method for selection compared to FQE when the data covers the state-action space well, that is, when C is small. If the concentration coefficient is large or infinite, then the Bellman error can be a poor indicator of the greedy policy's performance. For example, Fujimoto et al. (2022) provide several examples where BE is poor surrogate for value error even when the environment is deterministic. Moreover, several empirical works have shown that OPS using the Bellman error or value estimate from offline RL algorithms underperform using OPE estimates (Tang & Wiens, 2021) . Our result provides theoretical reasoning for these experimental observations. Since, in most practical scenarios the concentration coefficient can be very large or infinite, OPS using the BE can be poor. Estimating Bellman error in stochastic environments. Estimating Bellman error in stochastic environments potentially requires more samples. This is because doing so often involves fitting a function (Antos et al., 2008; Farahmand & Szepesvári, 2011) , and resultingly the sample complexity of estimating the BE accurately depends on the complexity of the function class. We hypothesize that it makes the sample complexity similar to FQE. If the sample complexity is similar to FQE and policy selection using BE requires stronger assumptions, then it is natural to use FQE in stochastic environments even if the data coverage is good. We leave the exact analysis of the sample complexity of OPS using estimated Bellman Error in stochastic environments for future work.

5. EMPIRICAL COMPARISON BETWEEN FQE AND BE

In this section, we aim to validate our results experimentally by answering the following questions: (1) How do FQE and BE perform when (a) all candidate policies are well-covered, (b) the data contains diverse trajectories or (c) the data neither covers all candidate policies nor provides diverse trajectories? (2) Can BE be more sample efficient than FQE for OPS? We conduct experiments on two standard RL environments: Acrobot and Cartpole. We first generate a set of candidate policy-value pairs {π i , q i } K i=1 by running CQL (Kumar et al., 2020) with different hyperparameters on a batch of data collected by an optimal policy with random actions taken 40% of the time. We then use either FQE or BE to rank the candidate policies, and select the top-k policies. The OPS problem we described in this paper is a special case with k = 1. We also included a random selection baseline where the we randomly choose k policies from the candidate set. To evaluate the performance of top-k policy selection, we consider the top-k regret which is used in the existing literature (Zhang & Jiang, 2021; Paine et al., 2020) . Top-k regret is the gap between the best policy within the top-k policies, and the best among all candidate policies. More experimental details can be found in Appendix B. In the first set of experiments, we generate three different datasets: (a) well-covered data is generated such that all candidate policies are well-covered, (b) diverse data includes more diverse trajectories to obtain a lower concentration coefficient, and (c) expert data is collected by a deterministic and optimal policy. Figure 1 shows the top-k regret with 900 episodes of data. FQE performs very well with a small regret on well-covered and diverse data. BE does not perform well with wellcovered data. However, with diverse data, BE performs much better. Surprisingly, for expert data, BE performs better than FQE even though the data distribution has an extremely large concentration coefficient. In Acrobot, BE tends to perform better with a larger k even with good data coverage, while FQE can have a small top-1 regret as long as the data covers the candidate policy. In the second set of experiments, we use diverse data. Figure 2 shows the results with varying numbers of episodes for top-k regret with k ∈ {1, 5, 10}. BE performs better than FQE when the sample size is small (less than 100 episodes) in most cases. The random selection baseline is expected to perform independent of the data used for OPS. We expect to see a larger gap when the environment has a high-dimensional state spacee so a function class with large complexity is needed for FQE. In summary, these experimental results validate our theoretical results. We show (1) FQE performs very well when the candidate policies are well-covered; (2) BE performs well under a diverse data coverage assumption; and (3) neither perform well without their corresponding data coverage conditions. Moreover, BE can be more sample efficient than FQE under good data coverage.

6. RELATED WORK

In this section we provide a more comprehensive survey of prior work on model selection for reinforcement learning. In the online setting, model selection has been studied extensively across contextual bandits (Foster et al., 2019) to reinforcement learning (Lee et al., 2021) . In the online setting, the goal is to select model classes while balancing exploration and exploitation to achieve low regret, which is very different from the offline setting where no exploration is performed. In the offline setting, besides using OPE for model selection, Farahmand & Szepesvári (2011) and Zhang & Jiang (2021) consider selecting a value function that has the smallest Bellman error or is the closest to the optimal value function. Farahmand & Szepesvári (2011) consider selecting a value function among a set of candidate value functions in stochastic environments such that with high probability, ∥q k -T q k ∥ 2 2,µ ≤ min i ∥q i -T q i ∥ 2 2,µ + ε where k is the output of a selection algorithm. To estimate the Bellman error, they propose to fit a regression model qi to predict T q i and bound the Bellman error by ∥q i -qi ∥ 2,µ + b i where b i is an upper bound on the estimation error of the regression. However, as we show in this paper, even in a deterministic environment where selecting the smallest Bellman error is easy, it does not imply effective OPS unless the data coverage is good. Zhang & Jiang (2021) consider the projected Bellman error with a piecewise constant function class. They show that if q * is in the candidate set and a stronger data assumption is satisfied, then they can choose the optimal value function by selecting the value function with the smallest Batch Value-Function Tournament loss (BVFT). Xie & Jiang (2021) consider BVFT with approximation error. In practice, their method is computationally expensive since it scales with O(|Π| 2 ) instead of O(|Π|), making the method impractical when the candidate set is not small. Other work on model selection in RL is in other settings: selecting models and selecting amongst OPE estimators. Hallak et al. (2013) consider model selection for model-based RL algorithms with batch data. They focus on selecting the most suitable model that generates the observed data, based on the maximum likelihood framework. Su et al. (2020) consider estimator selection for OPE when the estimators can be ordered with monotonically increasing biases and decreasing confidence intervals. They also show that their method achieves an oracle inequality for estimator selection. To the best of our knowledge, there is no previous work on understanding the fundamental limits for the OPS problem in RL. There is one related work in the batch contextual bandit setting, studying the selection of a linear model (Lee et al., 2022) . They provide a hardness result suggesting it is impossible to achieve an oracle inequality that balances the approximation error, the complexity of the function class, and data coverage. However, their results are restricted only to the contextual bandit setting. In this paper, we consider the more general problem, selecting a policy from a set of policies, in the RL setting.

7. CONCLUSION & DISCUSSION

In this paper, we have made progress towards understanding when OPS is feasible for RL. Our main result that the sample complexity of OPS is lower-bounded by the sample complexity of OPE, is perhaps expected. However, to our knowledge, this has never been formally shown. This result implies that without conditions to make OPE feasible, we cannot do policy selection efficiently. We provided theoretical statements with experimental evidence demonstrating that FQE is sample efficient when all candidate policies are well-covered and that BE can be more sample efficient under stronger data coverage. A natural next step from this work is to consider the structure or properties of the candidate policy set. For example, Kumar et al. (2021) use the trend in the TD error to select the number of training steps for CQL. The early stopping criterion could be selected based on errors on the dataset, and it is possible that such an approach could be proved to be sound in certain settings. As another example, if we know that performance is smooth in a hyperparameter for an algorithm, such as a regularization parameter, then it might be feasible to exploit curve fitting to get better estimates of performance. Our work does not yet answer these questions, but it does provide some insights into OPS that could lead to the development of sound hyperparameter selection procedures for real-world RL applications. the reward r of MDP M r and run our sound OPS algorithm on Π and using bisection search to estimate a precise interval for J(π). The process is as follows. By construction, our OPS algorithm will output either π 1 , which has value J Mr (π 1 ) = r, or output π 2 , which has value J Mr (π 2 ) = J M (π). That is, it has the same value as π in the original MDP. Let us consider the following two cases. Let π † be the best policy in Π for MDP M r . Case 1: the OPS algorithm selects π 1 . We know, by definition of a sound OPS algorithm, that Pr(J Mr (π 1 ) ≥ J Mr (π † ) -ε ′ ) ≥ 1 -δ ′ =⇒ Pr(r ≥ max(r, J Mr (π 2 )) -ε ′ ) ≥ 1 -δ ′ =⇒ Pr(J Mr (π 2 ) ≤ r + ε ′ ) ≥ 1 -δ ′ . Case 2: the OPS algorithm selects π 2 . Pr(J Mr (π 2 ) ≥ J Mr (π † ) -ε ′ ) ≥ 1 -δ ′ =⇒ Pr(J Mr (π 2 ) ≥ max(r, J Mr (π 2 )) -ε ′ ) ≥ 1 -δ ′ =⇒ Pr(J Mr (π 2 ) ≥ r -ε ′ ) ≥ 1 -δ ′ . Given this information, we describe the iterative process by which we produce the estimate Ĵ(π). We first set U = V max , L = 0 and r = U +L 2 and run the sound OPS algorithm with input D r of sample size n r and the candidate set Π. Then if the selected policy is π 1 , then we conclude the desired event J(π) ≤ r + ε ′ occurs with probability at least 1 -δ ′ , and set U equal to r. If the selected policy is π 2 , then we know the desired event J(π) ≥ rε ′ occurs with probability at least 1 -δ ′ , and set L equal to r. We can continue the bisection search until the accuracy is less than ε ′ , that is, U -L ≤ ε ′ , and the output value estimate is Ĵ(π) = U +L 2 . If all desired events at each call occur, then we conclude that Lε ′ ≤ J(π) ≤ U + ε ′ and thus |J(π) -Ĵ(π)| ≤ ε. The total number of OPS calls is at most m. Setting δ ′ = δ/m and applying a union bound, we can conclude that with probability at least 1 -δ, |J(π) -Ĵ(π)| ≤ ε. Finally, since any (ε, δ)-sound OPE algorithm on the instance (M, d b , π) needs at least Ω(N OP E (S, A, H, ε, 1/δ)) samples, the (ε ′ , δ ′ )-sound OPS algorithm needs at least Ω(N OP E (S, A, H, ε, 1/δ)), or Ω(N OP E (S, A, H, 3ε/2, 1/mδ ′ )) samples for at least one of the instances (M r , d b,r , Π).

A.3 PROOF OF THEOREM 3

Theorem 3 (OPS using importance sampling). Suppose the data collection policy is uniformly random, that is, π b (a|s) = 1/A for all (s, a) ∈ S × A, and |G i | ≤ V max almost surely. Then the selection algorithm L that selects the policy with the highest IS estimate is (ε, δ)-sound with O(A H V max ln (|Π|/δ)/ε 2 ) episodes. Proof. Since the policy is uniform random, we know |W i G i | < A H V max almost surely. Moreover, the importance sampling estimator is unbiased, that is, E[W i G i ] = J(π). Using the Bernstein's inequality, we can show that the IS estimator satisfies Pr | Ĵ(π k ) -J(π k )| ≤ 2A H V max ln (2/δ) 3n + 2Var(W i G i ) ln (2/δ) n ≥ 1 -δ for one candidate policy π k . Using the union bound, we have Pr | Ĵ(π k ) -J(π k )| ≤ 2A H V max ln (2|Π|/δ) 3n + 2V(W i G i ) ln (2|Π|/δ) n , ∀k ≥ 1 -δ That is, Pr J(L(D, Π)) ≥ J(π † ) - 4A H V max ln (2|Π|/δ) 3n + 8V(W i G i ) ln (2|Π|/δ) n ≥ 1 -δ. s0 s0 sH 1 sH 1 sH 2 sH 2 s o 1 s o 1 s o H 1 s o H 1 s1 s1 s o H 2 s o H 2 … a0 a0 ah 2 ah 2

Bernoulli reward

Figure 3 : Lower bound construction. For the variance term, V(W i G i ) = E[W 2 i G 2 i ] -E[W i G i ] 2 ≤ E[W 2 i G 2 i ] ≤ E[W 2 i ]E[G 2 i ] ≤ A H V max The second inequality follows from the Cauchy-Schwarz inequality. Therefore, if n > 32A H V max ln (2|Π|/δ)/ε 2 , L is (ε, δ)-sound. Theorem 5 (Exponential lower bound on the sample complexity of OPE). For any positive integers S, A, H with S > 2H and a pair (ε, δ) with 0 < ε ≤ 1/8, δ ∈ (0, 1), any (ε, δ)-sound OPE algorithm needs at least Ω(A H ln (1/δ)/ε 2 ) episodes. Proof. We provide a proof which uses the construction from Xiao et al. (2022) . They provide the result for the offline RL problem with Gaussian rewards. Here we provide the result for OPE problem with Bernoulli rewards since we assume rewards are bounded to match Theorem 3. We can construct an MDP with S states, A actions and 2H states in Figure 3 . Given any behavior policy π b , let a h = arg min a π b (a|s h ) be the action that leads to the next state s h+1 from state s h , and all other actions lead to an absorbing state s o h . Once we reach an absorbing state, the agent gets zero reward for all actions for the remainder of the episode. The only nonzero reward is in the last state s H-1 . Consider a target policy that chooses a h for state s h for all h = 0, . . . , H -1, and two MDPs where the only difference between them is the reward distribution in s H-1 . MDP 1 has Bernoulli distribution with mean 1/2 and MDP 2 has Bernoulli distribution with mean 1/2 -2ε. Let P 1 denote the probability measure with respect to MDP 1 and P 2 denote the probability measure with respect to MDP 2. Let r denote the OPE estimate by an algorithm L. Define an event E = {r < 1 2 -ε}. Then L is not (ε, δ)-sound if P 1 (E) + P 2 (E c ) 2 ≥ δ. This is because L is not (ε, δ)-sound if either P 1 (r < 1 2 -ε) ≥ δ or P 2 (r > 1 2 -ε) ≥ δ. Using the Bretagnolle-Huber inequality (See Theorem 14.2 of Lattimore & Szepesvári (2020)), we know P 1 (E) + P 2 (E c ) 2 ≥ 1 4 exp (-D KL (P 1 , P 2 )). By the chain rule for KL-divergence and the fact that P 1 and P 2 only differ in the reward for (s H-1 , a H-1 ), we have D KL (P 1 , P 2 ) = E 1 n i=1 I{S (i) H-1 = s H-1 , A (i) H-1 = a H-1 } 1 2 log ( 1/2 1/2 -ε ) + 1 2 log ( 1/2 1/2 + ε ) = n i=1 P 1 (S (i) H-1 = s H-1 , A H-1 = a H-1 ) - 1 2 log (1 -4ε 2 ) ≤ n8ε 2 A H The last inequality follows from -log (1 -4ε 2 ) ≤ 8ε 2 if 4ε 2 ≤ 1/2 (Krishnamurthy et al., 2016) and P 1 (S (i) H-1 = s H-1 , A H-1 = a H-1 ) < 1/A H from the construction of the MDPs. Finally, P 1 (E) + P 2 (E c ) 2 ≥ 1 4 exp (- n8ε 2 A H ) which is larger than δ if n ≤ A H ln(1/4δ)/8ε 2 . As a result, we need at least Ω(A H ln (1/δ)/ε 2 ) episodes.

A.4 PROOF OF THEOREM 4

Assumption 1 (Approximation error). For any policy π ∈ Π and h ∈ [H], we assume the approximation error is bounded by ε apx , that is, sup g∈F inf f ∈F ∥f -T π g∥ 2 2,µ h ≤ ε apx . Assumption 2 (Optimization error). Given a target value function g, we want to find the empirical risk minimizer f = arg min f ∈F l(f, g). We assume we have an optimization oracel that can return a value function f such that the optimization error is bounded, that is, l( f , g) ≤ l( f , g) + ε opt . Definition 1 (Rademacher complexity). Given a function class F, let X = {x 1 , . . . , x n } denotes n fixed data points at horizon h following the distribution µ h , the empirical Rademacher complexity is defined as R X (F) = E sup f ∈F 1 n n i=1 σ i f (x i ) | X where the expectation is with respect to the Rademacher random variables σ i . The Rademacher complexity is defined as R µ h n (F) = E[R X (F)] where the expectation is with respect to the n data points. Finally, to simply the notation, we define R µ n (F) = max h∈[H] R µ h n (F) as the maximum Rademacher complexity over all horizons. Lemma 1 (Excess risk bound, modification from Theorem 5.2 of Duan et al. (2021) ). Let q = (q 0 , . . . , q H-1 ) be the output of FQE with n sample drawn from the data distribution µ h at each horizon h. Then we have, with probability 1 -δ, the following holds for all h = 0, . . . , H -1 ∥q h -T π q h+1 ∥ 2 2,µ h ≤ ε opt + ε apx + c 0 HR µ h n (F) + c 0 H 2 log (H/δ) n for some constant c o > 0. Proof. For each horizon h ∈ [H], let q h be the output value function such that l(q h , q h+1 ) ≤ l(q h , q h+1 ) + ε opt where qh is the empirical minimizer, that is, qh = arg min f ∈F l h (f, q h+1 ). It follows l(q h , fh+1 ) ≤ l(q † h , q h+1 ) + ε opt where q † h is the population minimizer, that is, q † h = arg min f ∈F ∥f -T q h+1 ∥ 2,µ h . Then follow the proof of Theorem 5.2 in Duan et al. (2021) by takeing f h = q h , we have ∥q h -T π q h+1 ∥ 2 2,µ h ≤ ε opt + ε apx + cHR µ h n (F) + cH 2 log (H/δ) n holds for all h = 0, . . . , H -1 with probability at least 1 -δ. Theorem 4 (Offline policy selection for well-covered policies). Suppose we have a function class F with Rademacher complexity R µ n (F), an approximation error ε apx , and an optimization error ε opt (see Appendix A.4 for the definitions). If there exists a non-negative value C such that Π ⊂ Π(C, µ, F), then the OPS algorithm L that selects the policy with the highest FQE estimate satisfies Pr J(L(D, Π)) ≥ J(π † ) -2H √ C ε opt + ε apx + c 0 HR µ n (F) + c 0 H 2 log (|Π|H/δ) n ≥ 1 -δ for some constant c 0 . Proof. First fix a policy π ∈ Π. Let q = (q 0 , . . . , q H-1 ) be the output of FQE with n sample drawn from the data distribution µ h at each horizon h, then ∥q h -T π q h+1 ∥ 2 d b ≤ ε opt + ε apx + c 0 HR µ h n (F) + c 0 H 2 log (H/δ) n from Lemma 1 for some constant c 0 . By the definition of Π(C, µ, F), we know ∥q h -T π q h+1 ∥ d π h ≤ C∥q h -T π q h+1 ∥ µ h for any f ∈ F. Therefore, we have ∥q h -T π q h+1 ∥ 2 d π h ≤ C(ε opt + ε apx + c 0 HR µ h n (F) + c 0 H 2 log (H/δ) n ). We know ∥q π 0 -q 0 ∥ 1,d π 0 = a π(a|s 0 )|q π 0 (s 0 , a) -q 0 (s 0 , a)| = a π(a|s 0 )|(T π q π 1 )(s 0 , a) -(T π q 1 )(s 0 , a) + (T π q 1 )(s 0 , a) -q 0 (s 0 , a)| ≤ a,s ′ ,a ′ π(a|s 0 )p(s ′ |s, a)π(a ′ |s ′ )|q π 1 (s, a) -q 1 (s, a)| + a π(a|s 0 )|(T π q 1 )(s 0 , a) -q 0 (s 0 , a)| = ∥q π 1 -q 1 ∥ 1,d π 1 + ∥T π q 1 -q 0 ∥ 1,d π 0 Apply the same inequality recursively, we have ∥q π 0 -q 0 ∥ 1,d π 0 ≤ H-1 h=0 ∥T π q h+1 -q h ∥ 1,d π h ≤ H-1 h=0 ∥T π q h+1 -q h ∥ 2,d π h The last inequality follows from the Jensen's inequality. Therefore, we have |v π (s 0 ) - a π(a|s 0 )q 0 (s 0 , a)| ≤ H C(ε opt + ε apx + c 0 HR µ n (F) + c 0 H 2 log (H/δ) n ) with probability 1 -δ. The bound on FQE estimate holds for one policy in the candidate set. Using the union bound, the bound can hold for all policies. Therefore, the OPS algorithm has error up to 2H C(ε opt + ε apx + c 0 HR µ n (F) + c 0 H 2 log (|Π|H/δ) n ) with probability at least 1 -δ. A.5 DETAILS OF SECTION 4.2 By the performance difference lemma, J(π * ) -J(π) = E π * H-1 h=0 q * h (S h , A h ) -q * h (S h , π h (S h )) ≤ E π * H-1 h=0 q * h (S h , A h ) -q h (S h , A h ) + q h (S h , π h (S h )) -q * (S h , π h (S h )) ≤ E π * H-1 h=0 |q * h (S h , A h ) -q h (S h , A h )| + |q h (S h , π h (S h )) -q * (S h , π h (S h ))| ≤ H-1 h=0 ∥q * h -q h ∥ 1,d π * h π * + ∥q * h -q h ∥ 1,d π * h π h ≤ H-1 h=0 ∥q * h -q h ∥ 2,d π * h π * + ∥q * h -q h ∥ 2,d π * h π h where d π h is the state-action distribution at horizon h induced by policy π. The first inequality follows because π h is greedy with respect to q h . We consider a state-action distribution β 0 that is induced by some policy, then ∥q * 0 -q 0 ∥ 2,β0 ≤ ∥T q * 1 -T q 1 + T q 1 -q 0 ∥ 2,β0 ≤ ∥T q * 1 -T q 1 ∥ 2,β0 + ∥T q 1 -q 0 ∥ 2,β0 ≤ ∥q * 1 -q 1 ∥ 2,β1 + C∥T q 1 -q 0 ∥ 2,µ0 where β 1 (s ′ , a ′ ) = s,a β 0 (s, a)P (s, a, s ′ )I{a ′ = arg max a ′′ ∈A (q * (s ′ , a ′′ ) -q 1 (s ′ , a ′′ )) 2 } is also induced by some policy. The first inequality follows by the fact that q * is the fixed point of the operator T . We can recursively apply the same process for ∥q * h -q h ∥ 2,β h , h > 0, and we can get ∥q * h -q h ∥ 2,β h ≤ C H-1 h ∥T q h+1 -q h ∥ 2,µ h . Therefore, ∥q * -q 0 ∥ 2,β0 ≤ CH H-1 h ∥T q 1 -q 0 ∥ 2,µ0 .

Now we show that

∥q k -T q k ∥ 2 2,µ ≤ min i=1,...,|Π| ∥q i -T q i ∥ 2 2,µ + c 0 V max log (|Π|H/δ) n holds with probability at least 1 -δ for some constant c 0 . Proof. Using Hoeffding's inequality and the union bound, with probability at least 1 -δ 1 n (s,a,r,s ′ )∈D h |q i,h (s, a) -r -max a ′ q i,h (s ′ , a ′ )| 2 -∥q i,h -T q i,h+1 ∥ 2 2,µ h ≤ c 1 B log (H|Π|/δ) n for all candidate value function q i = (q i,1 , . . . , q i,H-1 ), i = 1, . . . , |Π| and h ∈ [H]. Let k be the index with the lowest empirical Bellman error, that is, k = arg min i=1,...,|Π|

1. nH

H-1 h=0 (s,a,r,s ′ )∈D h |q i (s, a) -r -max a ′ q i (s ′ , a ′ )| 2 . Then ∥q k -T q k ∥ 2 2,µ ≤ min i=1,...,|Π| ∥q i -T q i ∥ 2 2,µ + c 0 V max log (|Π|H/δ) n .

B EXPERIMENTAL DETAILS

We provide the experimental details in this section. Generating candidate policies. To generate a set of candidate policies, we run CQL with different hyperparameter configurations on a batch of data with 300 episodes collected with an ε-greedy policy with respect to the optimal policy where ε = 0.4. The hyperparameter configuration includes: • Learning rate ∈ {0.001, 0.0003, 0.0001} • Network hidden layer size ∈ {128, 256, 512} • Regularization coefficient ∈ {1.0, 0.1, 0.01, 0.001, 0.0} • Iterations of CQL ∈ {100, 200} That is, we have 90 candidate policies for Acrobot. For Cartpole, CQL with many hyperparameter configurations can generate an optimal policy (reaching a return of 200) so the selection is sufficiently easy that OPS algorithms using both FQE and BE achieve zero regret. In order to make the result more meaningful, we remove all the policies that are optimal from the candidate set. That results in 67 candidate policies for Cartpole. Generating offline data for OPS. To generate data for offline policy selection, we use three different data distributions: (a) a data distribution collected by running the mixture of all candidate policies. As a result, the data distribution covers all candidate policies well; (b) a data distribution collected by running the mixture of all candidate policies and an ε-greedy optimal policy that provides more diverse trajectories than (a); (c) a data distribution collected by a deterministic optimal policy. To perform experiment with multiple runs. To perform experiments with multiple runs, we fix the offline data and the candidate policies and only resample the offline data for OPS. This better reflects the theoretical result that the randomness is from resampling the data for an OPS algorithm. In our experiments, we use 5 runs and report the average regret with one standard error. Since the variability across runs is not large, we find using 5 runs is enough. Random selection baseline We include a random selection baseline that randomly chooses k policies given k. Since the random selection algorithm has very high variance within one run, we first compute the expected regret of random selection by performing 2000 times random selection within each run, and report the average regret over runs. Practical considerations for FQE implementation. As mentioned in the paper, we fix the function class as a two layer neural network model with hidden size 256 and tune the optimization hyperparameters. We use the Adam optimizer with learning rate selected from the set {0.001, 0.0003, 0.0001, 0.00003} for 200 epochs. We selected the hyperparameter configuration that resulted in a value function with the smallest RMSTD error evaluated on a separate validation dataset. To avoid selecting an uncovered policy, we assign a low estimate to policies when FQE diverges to a large value.



); Le et al. (2019); Xie et al. (2021); Duan et al. (2021).

Figure 1: Top-k regret with varying k and 900 episodes on Acrobot and Cartpole. The results are averaged over 5 runs with one standard error.

A TECHNICAL DETAILS

A.1 PROOF OF THEOREM 1 Theorem 1. (Upper bound on sample complexity of OPS) Given an MDP M , a data distribution d b , and a set of policies Π, suppose that, for any pair (ε, δ), there exists an (ε, δ)-sound OPE algorithm L on any OPE instance I ∈ {(M, d b , π) : π ∈ Π} with a sample size at most O(N OP E (S, A, H, ε, 1/δ)). Then there exists an (ε, δ)-sound OPS algorithm for the OPS problem instance (M, d b , Π) which requires at most O(N OP E (S, A, H, ε, |Π|/δ)) episodes.Proof. The OPS algorithm L(D, Π) for a given (ε, δ) works as follows: we query an (ε ′ , δ ′ )-sound OPE algorithm for each policy in Π and select the policy with the highest estimated value. That is, L(D, Π) outputs the policy π := argmax π∈Π Ĵ(π), where Ĵ(π) is the value estimate for policy π by theBy definition of an (ε ′ , δ ′ )-sound OPE algorithm we haveApplying the union bound, we have(1)Let π † denote the best policy in the candidate set Π, that is,The first and third inequalities follow from equation 1, and the second inequality follows from the definition of π.Finally, by setting δ ′ = δ/|Π| and ε ′ = ε/2, we getThat is, L is an (ε, δ)-sound OPS algorithm. This requires at most O(N OP E (S, A, H, ε/2, |Π|/δ)) samples.A.2 PROOF OF THEOREM 2Theorem 2 (Lower bound on sample complexity of OPS). Suppose for any data distribution d b and any pair (ε, δ) with ε ∈ (0, V max /2) and δ ∈ (0, 1), there exists an MDP M and a policy π such that any (ε, δ)-sound OPE algorithm requires at least Ω(N OP E (S, A, H, ε, 1/δ)) episodes. Then there exists an MDP M ′ with S ′ = S + 2, H ′ = H + 1, and a set of candidate policies such that for any pair (ε, δ) with ε ∈ (0, V max /3) and δ ∈ (0, 1/m) where m := ⌈log(V max /ε)⌉ ≥ 1, any (ε, δ)-sound OPS algorithm also requires at least Ω(N OP E (S, A, H, ε, 1/(mδ)))) episodes.Proof. Our goal is to construct an (ε, δ)-sound OPE algorithm with δ ∈ (0, 1) and ε ∈ [0, V max /2].To evaluate any policy π in M with dataset D sampled from d b , we first construct a new MDP M r with two additional states: an initial state s 0 and a terminal state s 1 . Taking a 1 at s 0 transitions to s 1 with reward r. Taking a 2 at s 0 transitions to the initial state in the original MDP M .Let Π = {π 1 , π 2 } be the candidate set in M r where π 1 (s 0 ) = a 1 and π 2 (s 0 ) = a 2 and π 2 (a|s) = π(a|s) for all (s, a) ∈ S × A. Since π 1 always transitions to s 1 , it never transitions to states in MDP M . Therefore, π 1 can be arbitrary for all (s, a) ∈ S × A. We can add any number of transitions (s 0 , a 1 , r, s) and (s 0 , a 2 , 0, s) in D to construct the dataset D r with distribution d b,r arbitrarily.Suppose we have an (ε ′ , δ ′ )-sound OPS algorithm, where we set ε ′ = 2ε/3, δ ′ = δ/m and m := ⌈log(V max /ε ′ )⌉. Note that if this assumption does not hold, then it directly implies that the sample complexity of OPS is larger than Ω(N OP E (S, A, H, ε, 1/δ)). Our strategy will be to iteratively set

