NEAR-OPTIMAL POLICY IDENTIFICATION IN ACTIVE REINFORCEMENT LEARNING

Abstract

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a generative model. We propose the AE-LSVI algorithm for bestpolicy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy uniformly over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required. * The first three authors contributed equally to this work. We use the structural assumption for the kernel setting from Yang et al. ( 2020) which states that the Bellman operator maps any bounded value function to a function with a bounded RKHS norm.

1. INTRODUCTION

Reinforcement learning (RL) algorithms are increasingly applied to complex domains such as robotics (Kober et al., 2013) , magnetic tokamaks (Seo et al., 2021; Degrave et al., 2022) , and molecular search (Simm et al., 2020a; b) . A central challenge in such environments is that data acquisition is often a time-consuming and expensive process, or may be infeasible due to safety considerations. A common approach is therefore to train policies offline by interacting with a simulator. However, even when a simulator is available, such applications require algorithms that are capable of learning and planning in large state spaces. Many existing approaches require a large amount of training data to obtain good policies, and efficient active exploration in large state spaces is still an open problem. Moreover, when deploying policies trained on simulators in real-world applications, a crucial requirement is that the policy performs well in any state that it might encounter. In particular, at training time, the learning approach has to sufficiently explore the state space. This is of particular importance when at test time, the system's state is partly out of the control of the learning algorithm-e.g., for a self-driving car or robot, which may be influenced by human actions. In this work, we formally study the setting of reinforcement learning with a generative model. Our objective is to learn a near-optimal policy by actively querying the simulator with a state-action pair chosen by the learning algorithm. The simulator then returns a new state that is sampled from the transition model of the (simulated) environment. Inspired by previous works, we make a structural assumption in the kernel setting, which states that the Bellman operator maps any bounded value function to one with a bounded reproducing kernel Hilbert space (RKHS) norm. In particular, this assumption implies that the reward and the optimal Q-function can be represented by an RKHS function. We propose a novel approach based on least-squares value iteration (LSVI). The algorithm is designed to actively explore uncertain states based on the uncertainty in the Q-estimates, and makes use of optimism for action selection and pessimism for estimating a near-optimal policy. Contributions We propose a novel kernelized algorithm for best policy identification in reinforcement learning with a generative model. Our sampling strategy actively explores (i) states for which the best action is the most uncertain and (ii) the corresponding "optimistic" actions. We prove sample complexity guarantees for finding an ϵ-optimal policy uniformly over any given initial state. Our bounds scale with the maximum information gain of the corresponding reproducing kernel Hilbert space but do not explicitly scale with the number of states or actions. When specialized to the offline contextual Bayesian optimization (BO) setting (Char et al., 2019) , we improve upon sample complexity guarantees from prior work. Finally, we include experimental evaluations on several RL and BO benchmark tasks. The former of these includes one of the first empirical evaluations of the model-free optimistic value iteration algorithms with function approximation (Yang et al., 2020) .

2. PROBLEM STATEMENT

We consider an episodic MDP S, A, H, (P h ) h∈ [H] , (r h ) h∈ [H] with state space S, action space A, horizon H ∈ N, Markov transition kernel (P h ) h∈ [H] and deterministic reward functions (r h : S × A → [0, 1]) h∈ [H] . In particular, for each h ∈ [H], we let P h (•|s, a) denote the probability transition kernel when action a is taken at state s ∈ S in step h ∈ [H] . A policy consists of H functions π = (π h ) h∈ [H] where for all h ∈ [H], π h (•|s) is a probability distribution over the action set A. In particular, π h (a|s) is the probability that the agent takes action a in state s at step h. We assume the generative (or random) access model, in which the agent interacts with the environment in the following way: Let T denote the number of episodes and H the horizon, i.e., the number of steps in each episode. Then for each t ∈ [T ], h ∈ [H], the agent chooses s t h ∈ S, a t h ∈ A, obtains the reward r h (s t h , a t h ) and observes the new state s ′ h,t ∼ P h (•|s t h , a t h ). To measure the performance of an agent, we use the value function. For a policy π, h ∈ [H], s ∈ S, and a ∈ A, the value function V π h : S → R and the Q-function Q π h : S × A → [0, H] are given by: V π h (s) = E π H h ′ =h r h ′ (s h ′ , a h ′ ) s h = s , Q π h (s, a) = E π H h ′ =h r h ′ (s h ′ , a h ′ ) s h = s, a h = a , (1) where E π denotes the expectation with respect to the randomness of the trajectory {(s h , a h )} H h=1 that is obtained by following the policy π. We use π * to denote the optimal policy, and we abbreviate V π * h , Q π * h as V * h , Q * h , respectively. We also have V * h (s) = sup π V π h (s) for all s ∈ S and h ∈ [H]. The goal is to find an ϵ-optimal policy while minimizing the number of necessary episodes T . More precisely, for a fixed precision ϵ > 0 and horizon H ∈ N, the goal of the learner is to output a policy πT after a suitable number of episodes T > 0 such that ∥V * 1 -V πT 1 ∥ ℓ ∞ (S) ≤ ϵ. Finally, we also recall the Bellman equation that is associated to some policy π: V π H+1 = 0, Q π h (s, a) = r h (s, a) + E s ′ ∼P h (•|s,a) [V π h+1 (s ′ )], V π h (s) = E a∼π h (a|s) [Q π h (s, a)], (2) and the Bellman optimality equation: V * H+1 = 0, Q * h (s, a) = r h (s, a) + E s ′ ∼P h (•|s,a) [V * h+1 (s ′ )], V * h (s) = max a∈A Q * h (s, a). It follows that the optimal policy π * is the greedy policy with respect to {Q * h } h∈[H] , a property that is going to be useful later on when defining our active exploration strategy. We use the reproducing kernel Hilbert space (RKHS) function class to represent functions such as the reward functions {r h } h∈[H] and the optimal Q-functions {Q * h } h∈ [H] (see the formal statement in Assumption 1). In particular, we consider a space of well-behaved functions defined on X = S × A, where H denotes an RKHS defined on X induced by some continuous, positive definite kernel function k : X × X → R. We also assume that (i) X ⊂ R d is a compact set, (ii) the kernel function is bounded k(x, x ′ ) ≤ 1 for all x, x ′ ∈ X , and (iii) every f ∈ H has a bounded RKHS norm, i.e., ∥f ∥ H ≤ B Q H for some fixed positive constant B Q > 0.

3. AE-LSVI ALGORITHM

Our algorithm runs in episodes t ∈ [T ] of horizon H. As in the kernel least-squares value iteration (Yang et al., 2020) , at the beginning of every episode t, it solves a sequence of kernel ridge regression problems based on the data obtained in the previous t -1 episodes to obtain value function estimates { Qt h } H h=1 : Qt h ∈ arg min f ∈H t-1 i=1 r h (s i h , a i h ) + V t h+1 (s ′ h,i ) -f (s i h , a i h ) 2 + λ∥f ∥ 2 H , where λ is the regularization parameter. Recalling that x ∈ X := S × A, the solution of the problem in Eq. ( 4) can be written in closed form as follows: Qt h (x) := k t h (x) T (K t h + λI) -1 Y t h , where 1) and observations Y t h ∈ R t-1 are given as follows: k t h (x) ∈ R t-1 , the kernel matrix K t h ∈ R (t-1)×(t- k t h (x) := [k(x 1 h , x), . . . , k(x t-1 h , x)], K t h := k(x i h , x i ′ h ) i,i ′ ∈[t-1] , [Y t h ] i := r h (s i h , a i h )+V t h+1 (s ′ h,i ). Next, we can also compute the uncertainty function σ t h (•, •) in the closed form: σ t h (s, a) = 1 λ 1/2 k(x, x) -k t h (x) T (K t h + λI) -1 k t h (x) 1/2 . ( ) We recall that each reward function is bounded in [0, 1]. We use  Q t h (•, •) := Qt h (•, •) + βσ t h (•, •) H-h+1 0 , V t h (•) := max a∈A Q t h (•, a), Qt h (•) := k t h (•) T (K t h + λI) -1 Y t h , [Y t h ] i := r h (s i h , a i h ) + V t h+1 (s ′ h,i ). Similarly, we have Q t h (•, •) := Qt h (•, •) -βσ t h (•, •) H-h+1 0 , V t h (•) := max a∈A Q t h (•, a), Qt h (•) := k t h (•) T (K t h + λI) -1 Y t h , [Y t h ] i := r h (s i h , a i h ) + V t h+1 (s ′ h,i ). ( ) Our proposed algorithm AE-LSVI is presented in Algorithm 1. At each h, the algorithm uses optimistic and pessimistic value estimates from Eqs. ( 7) and ( 9) (computed based on the data collected in previous episodes), and selects s t h and a t h as: s t h ∈ arg max s∈S max a∈A Q t h (s, a) -max a∈A Q t h (s, a) , a t h ∈ arg max a∈A Q t h (s t h , a). The main intuition behind the proposed sampling rules is as follows. Since the optimal policy π * is the greedy policy with respect to {Q * h } h∈[H] , we do not need to learn {Q * h } h∈[H] everywhere on S × A. Hence, it is sufficient to focus on discovering the best actions for each state. Our active exploration strategy is explicitly designed to focus on (i) states for which the best action is the most uncertain (Eq. ( 11)) and (ii) corresponding best "optimistic" actions (Eq. ( 12)). We use πT to denote the final reported policy returned by AE-LSVI (see Algorithm 1). There are various reasonable greedy-based choices for πT . The simplest one is to return πT,h (•) = arg max a∈A QT h (•, a), but in our theory and experiments, we focus on equating πT with the policy with the highest lower confidence estimate Q t h (s, a). Our sampling strategy combined with the proposed policy reporting rule allows for discovering an ϵ-optimal policy uniformly over any given initial state as we formally show in the next section.  T * h Q(s, a) = r h (s, a) + E s ′ ∼P h (•|s,a) max a ′ ∈A Q(s ′ , a ′ ) . Assumption 1 implies that for every h ∈ [H], both r h (•, •) and Q * h (•, •) are elements of the set {f ∈ H : ∥f ∥ H ≤ B Q H}. Conversely, a sufficient condition for Assumption 1 to be satisfied with (Yang et al., 2020) . Moreover, only assuming B Q = 2 is that {r h (•, •), P h (s ′ |•, •)} ⊆ {f ∈ H : ∥f ∥ H ≤ 1} for all h ∈ [H] and s ′ ∈ S Q * h ∈ H, ∥Q * h ∥ ≤ B Q H for all h ∈ [H] is not enough in order to obtain sample size guarantees which are polynomial in H and d (Du et al., 2020) . The main quantity that characterizes the complexity of the RKHS function class in the kernelized setting is the maximum information gain (Srinivas et al., 2010 ) Γ k (T, λ) := sup D⊆S×A,|D|≤T 1 2 ln |I + λ -1 K D,D |, where K D,D denotes the Gram matrix, | • | denotes the determinant, λ > 0 is a regularization parameter, and the index k indicates the kernel. This quantity is known to be sublinear in T for most of the popularly used kernels (Srinivas et al., 2010) . Further, we define the set of possible optimistic and pessimistic value functions Q(T, h, b) = Q(•, •) = Q(•, •) ± βσ D (•, •) H-h+1 0 : Q ∈ H, ∥ Q∥ H ≤ 2H Γ k (T, λ), β ∈ [0, b], D ⊆ S × A, |D| ≤ T , where b > 0 and σ D (•, •) is of the form Eq. ( 6) computed with a data set D ⊆ S × A, and denote its ℓ ∞ -covering number as N ∞ (ϵ, T, h, b).foot_0 Our sample complexity bounds depend on b T > 0 defined as the smallest number that satisfies the following inequality: 8Γ k T, T +1 T + 8 log N ∞ (H/T, T, h, b T ) + 16 log(2T H) + 22 + 2B 2 Q T +1 T ≤ (b T /H) 2 (16) For many kernel functions, b T has a sublinear dependence on T . For instance, b T = O(γH log(γT H)) for bounded and continuously differentiable kernels with γ-finite spectrum and b T = O(H √ T H log(T ) 1/γ ) for bounded and continuously differentiable kernels with γexponential decay. See (Yang et al., 2020, Corollary 4.4 ) for more details. We recall that the Bellman equation implies that Q t h+1 (•), Q t h (•) are upper and lower confidence bounds for Q * h for all h ∈ [H], respectively (see Lemma A.1), while the target functions of kernel ridge regressions are T * h Q t h+1 (•) and T * h Q t h+1 (•). As a technical tool, we use the following concentration result that follows from (Yang et al., 2020, Lemma 5.2 ). Lemma 4.1. Consider the setup of Assumption 1, and 6), ( 7) and (9) computed with λ = 1 + 1/T and β = b T from Eq. ( 16). Then with probability at least 1 -(2T 2 H 2 ) -1 , the following holds for all t ∈ [T ], h ∈ [H] and all (s, a) ∈ S × A: Q t h+1 (•), Q t h (•) σ t h (•) from Eqs. ( 0 ≤ Q t h (s, a) -T * h Q t h+1 (s, a) ≤ 2βσ t h (s, a), 0 ≤ T * h Q t h+1 (s, a) -Q t h (s, a) ≤ 2βσ t h (s, a). ( ) With the previous confidence lemma in place, we state our main theorem that characterizes the sample complexity of AE-LSVI . The proof is given in Appendix A.2. Theorem 4.2. Consider the setting of Lemma 4.1 and let H ∈ N be a fixed horizon. When running Algorithm 1 for T episodes, then with probability at least 1 -(2T 2 H 2 ) -1 , the best-policy estimate πT (Algorithm 1, Line 12) satisfies: ∥V * 1 -V πT 1 ∥ ℓ ∞ (S) ≤ 2 √ 3βH(H + 1) Γ k (T,λ) T . ( ) In other words, for a given fixed precision ϵ > 0, after T = O β 2 H 4 Γ k (T,λ) ϵ 2 episodes (or O β 2 H 5 Γ k (T,λ) ϵ 2 samples) ∥V * 1 -V πT 1 ∥ ℓ ∞ (S) ≤ ϵ holds with probability at least 1 -(2T 2 H 2 ) -1 . The obtained result is general since it holds for any kernel function that satisfies Assumption 1. To obtain concrete kernel-dependent regret bounds it remains to specify the kernel and the bounds for the corresponding maximum information gain in Eq. ( 16). These are summarized in Yang et al. (2020) for the most widely used kernels (see Assumption 4.3 and its discussion). In the special case of linear kernels with the feature dimension d, our sample complexity guarantee reduces to Õ( d 3 H 7 ϵ 2 ). Better bounds (in terms of d) for this special case are known Õ( d 2 H 7 ϵ 2 ), see, e.g., Agarwal et al. (2019, Theorem 3.3 ). These bounds are obtained by the LSVI algorithm with D-optimal design. Unlike this algorithm, AE-LSVI uses optimism for active exploration and such a performance gap is present even in the simpler linear bandit setting where optimistic algorithms are known to attain worse sample complexity guarantees (Lattimore & Szepesvári, 2020, Chapter 22) . The special case also includes the linear MDP setting, which assumes linear reward functions and linear transition kernels. For linear MDPs it is possible to find a policy π satisfying (Hu et al., 2022) ; in our setting of Assumption 1, such a policy π can be found using O(H 5 β 2 Γ k (T, λ)/ϵ 2 ) samples (Yang et al., 2020) . Both results hold with at least a constant probability. However, they require that the initial state s 1 is fixed for all episodes. In contrast, the result of Theorem 4.2 holds uniformly over the entire state space. V 1 (s 1 ) - V π 1 (s 1 ) ≤ ϵ using Õ(d 2 H 3 /ϵ 2 ) samples

5. APPLICATION TO OFFLINE CONTEXTUAL BAYESIAN OPTIMIZATION

In this section, we specialize Algorithm 1 to the offline contextual Bayesian optimization setting (Char et al., 2019) . We show that in this setting the proposed active exploration scheme leads to new sample complexity bounds that hold uniformly over the context space. The offline contextual Bayesian optimization setting is similar to the one considered in Section 2 when H = 1. In particular, instead of having H different functions to learn, we have a single unknown objective Q * : S ×A → R that we learn about (from noisy point evaluations). Here, we refer to S as the context space, and assume that both S and A are compact sets. As before, we use a shorthand notation X = S ×A. In each round t ∈ [T ], the learner chooses a context-action pair (s t , a t ) ∈ S × A and observes y t = Q * (s t , a t ) + η t (with independent sub-Gaussian noise). To choose (s t , a t ) at each round t, we make use of the same active exploration strategy from Eqs. ( 11) and (12). Our complete algorithm for the offline BO setting can be found in Appendix A.3 (see Algorithm 2). We define Qt : S × A → R (and σ t : S × A → R) similarly as Qt h (resp. σ t h ) from Eq. ( 5) (resp. Eq. ( 6)) but with the modification of ignoring the index h and defining Y t := (y i ) t-1 i=1 ∈ R t-1 . We further define the upper and lower confidence bounds for Q * as: Q t (•, •) = Qt (•, •) + β t σ t (•, •), Q t (•, •) = Qt (•, •) -β t σ t (•, •). When Q * ∈ H and ∥Q * ∥ H ≤ B correspond to some known kernel (such that k(x, x ′ ) ≤ 1 for all x, x ′ ∈ X ), then (β t ) t∈[T ] is a non-decreasing sequence of parameters that can be chosen according to Abbasi-Yadkori (2012, Theorem 3.11) to yield valid confidence bounds. Similarly, in case of Q * ∼ GP X (0, k) (Bayesian setting), we can utilize Gaussian Process confidence bounds (Srinivas et al., 2010) and use the corresponding (β t ) t∈[T ] sequence. In what follows, we assume that (β t (δ)) t∈[T ] is a non-decreasing sequence such that with probability at least 1 -δ, Q t (s, a) ≤ Q * (s, a) ≤ Q t (s, a) holds for all t ∈ [T ] and (s, a) ∈ S × A. Corollary 5.1. Assume (β t (δ)) t∈[T ] is set to satisfy Eq. ( 21). Fix ϵ ∈ (0, 1) and run Algorithm 2 for T ≥ 12β 2 T Γ k (T, λ) ϵ 2 (22) rounds. Then, for every s ∈ S, the reported policy πT (•) computed as in Line 6 (Algorithm 2) satisfies Q * (s, πT (s)) ≥ max a∈A Q * (s, a) -ϵ with probability at least 1 -δ. We briefly compare the result obtained in Corollary 5.1 with related results from the literature. In the Bayesian setting, Char et al. (2019, Theorem 1) obtain a sample complexity that scales as E[T ] = O |S| 3 |A|Γ k (T, λ)/ϵ 2 in expectation for a given context distribution. In comparison, our result obtained in Eq. ( 22) holds in ℓ ∞ -norm over the context space (i.e., implies bounds for any context distribution). When specialized to the finite set X = S × A and when f ∼ GP X (0, k), the result of Corollary 5.1 holds with β T = O(log(|X |T 2 )) (Srinivas et al., 2010) , which then results in T = O log 2 (|X |T 2 )Γ k (T,λ) ϵ 2 leading to a significant improvement for large discrete context spaces. In the setting of distributionally robust Bayesian optimization (DRBO), Kirschner et al. (2020) obtain a result with the same dependency as ours. However, their bound holds only for a fixed contextual distribution and degenerates as a function of the distance between the training and test distributions.

6. RELATED WORK

Reinforcement learning with function approximation dates back to at least (Bellman et al., 1963; Daniel, 1976; Schweitzer & Seidmann, 1985) . A majority of work is in the online setting where the learning agent interacts with the environment while (typically) minimizing regret. Upper confidence bound algorithms, originally developed in the bandit setting (Lattimore & Szepesvári, 2020) (also, frequently used in the related setting of best-arm identification, e.g., (Gabillon et al., 2011; Kalyanakrishnan et al., 2012; Soare et al., 2014) ) , have been successfully applied to tabular Markov decision processes (MDPs) (Auer & Ortner, 2006; Auer et al., 2008) , and extended to RL with function approximation. Jin et al. (2020) propose the LSVI-UCB algorithm in the linear MDP setting that achieves a near-optimal regret bound. Yang et al. (2020) ; Domingues et al. (2021) extend this work to the non-linear function approximation setting. These works are closely related to ours in that we make use of LSVI and confidence bounds for the Q-function in the kernelized setting. Unlike previous works, we consider the generative model setting and derive bounds on the sample complexity that hold uniformly over the initial state. There are many more alternative parametric models that admit sample efficient algorithms (e.g., Ayoub et al., 2020; Zhou et al., 2021; Du et al., 2021; Zanette et al., 2020; Liu & Su, 2022) . In the generative model setting, the learner has access to a simulator that for any given state-action pair returns a next-state sample from the transition kernel. This provides additional flexibility to obtain data from states that are otherwise hard to reach in the environment. For the tabular case, matching upper and lower bounds are shown by Azar et al. (2012); Gheshlaghi Azar et al. (2013) . In the generative model setting with function approximation, Lattimore et al. (2020) show that policy iteration can be used to compute a near-optimal policy given features such that the Q-function of any policy can be approximated by a linear function. Their algorithm uses a D-experimental design to roll out policies from a sufficiently diverse set of states. The POLITEX algorithm (Abbasi-Yadkori et al., 2019; Szepesvári, 2022) can be used in lieu of policy iteration and leads to tighter bounds on the approximation error. A similar approach based on LSVI is analyzed by Agarwal et al. (2019, Chapter 3) . In practical applications of RL, simpler approaches to exploration are often used or exploration techniques inspired by upper-confidence bound algorithms or Thompson sampling are combined with deep learning function approximation. To list a few, the ϵ-greedy approach (Mnih et al., 2013) , upper confidence bounds (UCB) (Chen et al., 2017) , Thompson sampling (TS) (Osband et al., 2016a) , added Ornstein-Uhlenbeck action noise (Lillicrap et al., 2015) , and entropy bonuses (Haarnoja et al., 2018) are all widely applied. More sophisticated methods actively plan to encounter novel states (Shyam et al., 2019; Ecoffet et al., 2021) . Though these methods serve as reasonable heuristics and are usually computationally efficient, they either lack theoretical guarantees or lead to methods that require large numbers of samples. One recent practical work (Mehta et al., 2022b) gives an acquisition function for the generative model setting based on methods from Bayesian experimental design (Neiswanger et al., 2021) , and achieves good policies with small numbers of samples; however, this model-based method assumes access to the MDP reward function and is computationally expensive. An important special case of the MDP setting is the contextual bandit setting. When combined with linear function approximation, this recovers the contextual linear bandit setting (Abbasi-Yadkori et al., 2011) , and contextual Bayesian optimization when using kernel features (Srinivas et al., 2010; Krause & Ong, 2011) . Various works consider the case where the learner has control over the choice of context during training time. Char et al. (2019) propose a variant based on Thompson sampling. Pearce & Branke (2018) ; Pearce et al. (2020) also propose variants that leverage ideas from the knowledge gradient (Frazier et al., 2009) . The latter works lack theoretical guarantees, while our result (from Section 5) improves upon the sample complexity guarantee of Char et al. (2019) . The approach by Kirschner et al. (2020) for the distributionally robust setting can be specialized to our setting, in which case they recover similar bounds but only for a fixed context distribution.

7.1. REINFORCEMENT LEARNING EXPERIMENTS

In the previous sections we presented the AE-LSVI algorithm, which provably identifies a nearoptimal policy in polynomial time given access to a generative model of the MDP dynamics. Here, we test the AE-LSVI algorithm empirically, and additionally provide one of the first empirical evaluation of the LSVI-UCB method from Yang et al. (2020) on standard benchmarks. We evaluate AE-LSVI and LSVI-UCB on four MDPs from the literature as well as four synthetic contextual BO problems from Char et al. (2019) . We discuss details of our implementation in Appendix B.1. Each environment has a discrete action space. For continuous environments, we discretize the action space into 10 bins per dimension but model the value function in the original continuous state and action space. All methods besides DDQN are initialized by executing a random policy for two episodes. In between exploration episodes, the pessimistic policy πT is evaluated by executing it for 10 episodes in the environment.

Initial State Distribution

To evaluate the policies found by each method, we must initialize the policy at initial states drawn from some distribution p 0 at test time. As AE-LSVI does not explicitly consider the initial state distribution, for each environment we choose both a standard p 0 from the literature as well as a an alternate distribution p ′ 0 that is translated in the state space, i.e., p ′ 0 (s) = p 0 (s-∆ s ) for some ∆ s . The alternate distribution allows us evaluate the best policy estimate in an area of state space that is not explicitly given to agents. We evaluate each policy using initial states sampled from p ′ 0 as a proxy for understanding how well the optimal policy has been identified in regions of the state space beyond where it was initialized. We give a complete description of the various p ′ 0 for each environment in Appendix B.2. In Table 1 , we present results for each method and environment when initialized on p 0 , which is the typical setup for training and evaluating RL algorithms in the literature. In Table 2 we present results for each method evaluated for the initial state distribution p ′ 0 . Comparison Methods Besides AE-LSVI and LSVI-UCB , we compare against several ablations and methods taken from the literature. As a naive baseline for performance in active exploration, we randomly sample state-action pairs from the MDP, evaluate the next states and rewards, and fit Q-functions to that data as in the other methods, executing the policy given by the Q-function mean (Random). We also perform uncertainty sampling (US) on the Q-function, choosing state-action pairs at each step that maximize σ t h (•, •) as in Eq. ( 6). Additionally, we compare β Tracking 14.0 ± 0.4 9.2 ± 0.9 12.5 ± 0.1 13.3 ± 0.3 13.8 ± 0.1 14.0 ± 0.1 12.5 ± 0.4 (13.7 ± 0.2) (13.7 ± 0.2) (13.7 ± 0.1) (13.8 ± 0.1) β + Rotation 14.3 ± 0.2 12.8 ± 1.4 13.3 ± 0.5 10.1 ± 0.4 12.9 ± 1.1 13.7 ± 0.8 12.8 ± 0.7 (12.7 ± 0.3) (13.4 ± 0.3) (12.7 ± 1.2) (7.5 ± 0.2) Table 2 : Average Return ± standard error of executing the identified best policy on the MDP starting from p ′ 0 over 5 seeds after collecting 1000 timesteps of data through the use of a generative model (left) and online RL methods (right). For online methods, numbers without parentheses refer to training from episodes starting from p 0 , whereas numbers in parentheses use the uniform distribution on the state space as initial states during training. against three online RL baselines: the Double DQN algorithm (Van Hasselt et al., 2016) where an epsilon-greedy approach is used for exploration (DDQN), the bootstrapped DQN (Osband et al., 2016b) which keeps an ensemble of Q-functions and does exploration acting according to a sampled Q-function each exploratory rollout (BDQN), and a greedy exploration algorithm (Greedy) that chooses arg max a Qt h (s, a) at every step h for a given state s but uses the same value iteration procedure used in the main methods. The experiments are conducted with a default exploration bonus β = 0.5, however, we also empirically analyze the performance for other β-values in Appendix B.3. Environments We evaluate all methods on four environments: a Cartpole swing-up problem with dense rewards, a nonlinear Navigation problem, and two problems (β Tracking and β + Rotation) in plasma control from Mehta et al. (2022a) , in which plasma is driven to a desired target state. We give further information on the environments used in Appendix B.2.

Results

As our bound on the value function error uses the ℓ ∞ (S)-norm, our method provably finds an approximately optimal policy regardless of the initial distribution. The LSVI-UCB method is able to quickly learn a policy for the initial state distribution p 0 given at training time, as it is designed to minimize regret on the episodic MDP initialized at p 0 . This can be seen clearly in Table 1 , which shows that after 1000 samples, LSVI-UCB performs the best on nearly every environment. In the online setting when the start state distribution is known, greedy and ϵ-greedy methods like DDQN also perform relatively well. We also see in Table 1 that AE-LSVI does not perform particularly well compared to the online methods given the 1,000-sample budget. This is to be expected, as the online methods naturally collect data that is reachable from p 0 and in particular LSVI-UCB is designed to minimize regret on episodes beginning from p 0 . However, this focus on performing well when starting from p 0 comes at the expense of active exploration and identifying the best policy uniformly across the state space. As shown in Table 2 , AE-LSVI outperforms the baselines when evaluated on a different initial state distribution p ′ 0 , even when the online algorithms are initialized from a uniform initial state distribution p 0 during training. This is unsurprising, as AE-LSVI is precisely built for this setting and identifies the best action uniformly across the state space, unlike LSVI-UCB which aims to minimize regret starting from an initial state distribution. We see that uncertainty sampling outperforms a random data selection strategy and is comparable to the online methods. However, as we discuss above (in Section 3), in general it is the uncertainty in the value of the best action at a state and not the uncertainty in the value of a state-action pair that needs to be reduced in order to more efficiently find the best policy. We see that, in general, the online methods perform better on p ′ 0 when they train on episodes uniformly initialized on the state space. This suggests that in these cases, it is helpful to make sure that the evaluation distribution p ′ 0 is supported by the training distribution p 0 . We also note that (as we describe in Appendix B.2) the maximum possible score on Navigation starting from p ′ 0 is higher than that from p 0 due to a starting distribution closer to the goal. We believe that these results give empirical support to the theoretical claims of Section 4.

7.2. OFFLINE CONTEXTUAL BAYESIAN OPTIMIZATION EXPERIMENTS

We test the performance of AE-LSVI (Algorithm 2 in Appendix A.3) in the offline contextual Bayesian optimization setting. In particular, we test the algorithm on the optimization problems presented in Section 3 of Char et al. (2019) , each having a discrete context space but continuous action space. In all experiments, we average over 10 seeds. At the beginning of each experiment, the values corresponding to five actions, chosen uniformly at random, are observed for each context. Every time new data is observed, the hyperparameters of the GP are tuned according to the marginal likelihood. We leverage the Dragonfly library for these experiments (Kandasamy et al., 2020) . Comparison Methods For baselines, we compare against the Multi-task Thompson Sampling (MTS) method presented by Char et al. (2019) , which picks context and action based on the largest improvement over what has been seen according to samples from the posterior. In addition, we compare to the strategy of picking the context with the greatest expected improvement. This method was presented by Swersky et al. (2013) , and we refer to it as Multi-task Expected Improvement (MEI), following Char et al. (2019) . We also compare against the REVI algorithm (Pearce & Branke, 2018), which picks contexts and actions that will increase the posterior mean the most across all contexts. Additionally, we show the performance of naive Thompson sampling (TS) and expected improvement (EI), where contexts are picked in a round robin fashion. Lastly, we show the performance of randomly selecting contexts and actions at each time step (RAND). Experiment Tasks To evaluate the method in the case where the objective function is correlated in context space, we take a higher dimensional function and assign some dimensions to context space and the rest to action space. A single GP with a squared exponential kernel is then used to model the objective function. In particular, the Branin-Hoo (Branin, 1972) , Hartmann 4, and Hartmann 6 (Picheny et al., 2013) functions are used to create Branin 1-1, Hartmann 2-2, Hartmann 3-1, and Hartmann 4-2, where the first number corresponds to the context dimension and the second to the action dimension. These functions have 10, 9, 8, and 16 equispaced contexts, respectively. Results Figure 1 shows the maximum simple regret seen in any given context as a function of t values observed. As seen from these plots, AE-LSVI often is one of the best performing methods. The only task that AE-LSVI struggles on is Hartmann 4-2. We believe that estimating the amount of improvement to be gained at each context is difficult for this benchmark task. This is supported by the fact none of the more sophisticated methods outperforms the baseline that applies EI in a round-robin fashion. It is likely that improved modeling or hyperparameter selection is needed in order for these methods to achieve the highest performance on this task.

8. CONCLUSION

We provided a new kernelized least-squares value iteration algorithm for RL in the generative model setting, which aims to learn a near-optimal policy for all initial states by actively exploring states for which the best action is the most uncertain. Our algorithm identifies a near-optimal policy uniformly over the entire state space and attains polynomial sample complexity. Experimentally, we demonstrate that it outperforms other RL algorithms in a variety of environments when robustness to the initial state is required. Perhaps the most immediate direction for future work is to extend the algorithm to the local access model (Yin et al., 2022) in which the simulator can be queried only for states that have been encountered in previous simulation steps.

Lemma

A.1. Let t ∈ [T ]. Then, for every (s, a) ∈ S × A, 1. If Q t h (s, a) ≥ T * h Q t h+1 (s, a) holds for all h ∈ [H], then Q t h (s, a) ≥ Q * h (s, a) is true for all h ∈ [H]. 2. If Q t h (s, a) ≤ T * h Q t h+1 (s, a) holds for all h ∈ [H], then Q * h (s, a) ≥ Q t h (s, a) is true for all h ∈ [H]. Proof. In order to prove part 1., let s ∈ S and a ∈ A and assume Q t h (s, a) ≥ T * h Q t h+1 (s, a) for all h ∈ [H] and t ∈ [T ]. We prove ∀h ∈ [H], Q t h (s, a) ≥ Q * h (s, a) by induction on h = H, H - 1, . . . , 1. For the initial case h = H, we have Q t H (s, a) assumption ≥ T * h Q t H+1 (s, a) Def. of T * h = r H (s, a) + E s ′ ∼P H (•|s,a) max a ′ ∈A Q t H+1 (s ′ , a ′ ) (24) = r H (s, a) (25) = Q * H (s, a). For the inductive step, we assume that Q * h+1 (s, a) ≤ Q t h+1 (s, a). Then, Q * h (s, a) = T * h Q * h+1 (s, a) Def. of T * h = r h (s, a) + E s ′ ∼P h (•|s,a) max a ′ ∈A Q * h+1 (s ′ , a ′ ) (28) inductive hypothesis ≤ r h (s, a) + E s ′ ∼P h (•|s,a) max a ′ ∈A Q t h+1 (s ′ , a ′ ) (29) Def. of T * h = T * h Q t h+1 (s, a) assumption ≤ Q t h (s, a). This shows Q t h (s, a) ≥ Q * h (s, a) for all h ∈ [H] and thus concludes the proof of the first claim. The second part can be shown analogously. The following is a standard result that can be found in multiple works. Lemma A.2. Consider a kernel k : X × X → R such that k(x, x) ≤ 1 for every x ∈ X . Then for all h ∈ [H] and λ ≥ 1 we have T t=1 σ t h (s t h , a t h ) ≤ 3Γ k (T, λ)T . Proof. We can for example invoke the result of Lemma 3 in Bogunovic & Krause (2021) that in our notation reads as: T t=1 σ t h (s t h , a t h ) ≤ λ -1 (2λ + 1)Γ k (T, λ)T , for λ > 0. Setting λ ≥ 1, we obtain T t=1 σ t h (s t h , a t h ) ≤ 3Γ k (T, λ)T . A.2 PROOF OF THEOREM 4.2 Let πT be the best-policy estimate returned by the algorithm. Recall the definition of π * ≥h T := π * ≥h T,h ′ H h ′ =1 := πT,h ′ for h ′ = 1, . . . , h -1 π * h ′ for h ′ = h, . . . , H as the policy that equals our best-policy estimate πT until step h -1 and then equals the optimal policy π * . We start the proof with the following useful lemma. Lemma A.3. Let πT be a best-policy estimate, let s ∈ S be an initial state, and let h ∈ [H]. Using the notation from Eq. (35), we obtain V π * ≥h T 1 (s) -V π * ≥h+1 T 1 (s) = E a1,...,s h following πT Q * h s h , π * h (s h ) -Q * h s h , πT,h (s h ) s 1 = s . Proof. To formally prove the lemma, we first explicitly express V π * ≥h T 1 (s) and V π * ≥h+1 T 1 (s) for an arbitrary initial state s ∈ S as V π * ≥h T 1 (s) (36) = E a1,...,s H following π * ≥h T |s1=s H h ′ =1 r h ′ (s h ′ , a h ′ ) (37) = E a1,...,s h following πT |s1=s E a h ,...,s H following π * |s h H h ′ =1 r h ′ (s h ′ , a h ′ ) = E a1,...,s h following πT |s1=s h h ′ =1 r h ′ (s h ′ , a h ′ ) + E a h ,...,s H following π * |s h H h ′ =h+1 r h ′ (s h ′ , a h ′ ) , and T from Eq. ( 35), and Eqs. ( 39) and ( 43) use the property that integration is a linear operator. V π * ≥h+1 T 1 (s 1 ) (40) = E a1,...,s H following π * ≥h+1 T |s1=s H h ′ =1 r h ′ (s h ′ , a h ′ ) (41) = E a1,...,s h following πT |s1=s E a h ,s h+1 following πT |s h E a h+1 ,...,s H following π * |s h+1 h h ′ =1 r h ′ (s h ′ , a h ′ ) + H h ′ =h+1 r h ′ (s h ′ , a h ′ ) (42) = E a1,...,s h following πT |s1=s h h ′ =1 r h ′ (s h ′ , a h ′ ) + E a h ,s h+1 following πT |s h E a h+1 ,...,s H following π * |s h+1 H h ′ =h+1 r h ′ (s h ′ , a h ′ ) . ( Lemma A.3 then follows from Eqs. ( 39) and ( 43) as well as the definition of Q * h : V π * ≥h T 1 (s 1 ) -V π * ≥h+1 T 1 (s 1 ) = E a1,...,s h following π T |s1=s h h ′ =1 r h ′ (s h ′ , a h ′ ) - h h ′ =1 r h ′ (s h ′ , a h ′ ) + E a h ,...,s H following π * |s h H h ′ =h+1 r h ′ (s h ′ , a h ′ ) -E a h ,s h+1 following πT |s h E a h+1 ,...,s H following π * |s h+1 H h ′ =h+1 r h ′ (s h ′ , a h ′ ) (44) = E a1,...,s h following πT |s1=s Q * h s h , π * h (s h ) -Q * h s h , πT,h (s h ) . ( ) We proceed with the proof by using the notation from Eq. ( 35). We can decompose the instantaneous regret for an arbitrary initial state s ∈ S as follows: V * 1 (s) -V πT 1 (s) = V π * ≥1 T 1 (s) -V π * ≥H+1 T 1 (s) (46) = H h=1 V π * ≥h T 1 (s) -V π * ≥h+1 T 1 (s) (47) Lemma A.3 = H h=1 E s1,a1,...,s h following πT Q * h s h , π * h (s h ) -Q * h s h , πT,h (s h ) s 1 = s . The intuition behind Lemma A.3 used in Eq. ( 48) is as follows. Both V π * ≥h T 1 (s) and V π * ≥h+1 T 1 (s) refer to the same random trajectory segment (s 1 , a 1 , . . . , s h ) until step h (i.e., the same initial state and policy are used), which is captured as E s1,a1,...,s h following πT [•] . For the remaining steps h, . . . , H, the policies only differ at step h, a property which is captured in the difference Q * h s h , π * h (s h ) -Q * h s h , πT,h (s h ) . Conditioning on the event in Lemma 4.1 holding true and by invoking Lemma A.1, we have that: Q t h (s, a) ≤ Q * h (s, a) ≤ Q t h (s, a), holds for every h ∈ [H], t ∈ [T ], and (s, a) ∈ S × A. Next, we proceed to bound 48) uniformly on S. We have: Q * h •, π * h (•) - Q * h •, πT,h (•)) from Eq. ( Q * h (s, π * h (s)) -Q * h (s, πT,h (s)) Eq. ( 49) ≤ Q * h (s, π * h (s)) -max t∈[T ] Q t h (s, πT,h (s)) (50) Def. of πT,h = Q * h (s, π * h (s)) -max a∈A max t∈[T ] Q t h (s, a) (51) = min t∈[T ] Q * h (s, π * h (s)) -max a∈A Q t h (s, a) (52) Def. of π * h = min t∈[T ] max a∈A Q * h (s, a) -max a∈A Q t h (s, a) Eq. ( 49) ≤ min t∈[T ] max a∈A Q t h (s, a) -max a∈A Q t h (s, a) Eq. ( 11) ≤ min t∈[T ] max a∈A Q t h (s t h , a) -max a∈A Q t h (s t h , a) Eq. ( 12) ≤ min t∈[T ] Q t h (s t h , a t h ) -Q t h (s t h , a t h ) (56) ≤ 1 T T t=1 Q t h (s t h , a t h ) -Q t h (s t h , a t h ) . Next, for convenience we introduce the notation d t h := Q t h (s t h , a t h ) -T * h Q t h+1 (s t h , a t h ) + T * h Q t h+1 (s t h , a t h ) -Q t h (s t h , a t h ) , and obtain the following upper bound on Q t h (s t h , a t h ) -Q t h (s t h , a t h ) (from Eq. ( 57)) for every h ∈ [H], t ∈ [T ]: Q t h (s t h , a t h ) -Q t h (s t h , a t h ) Eq. ( 58) = d t h + T * h Q t h+1 (s t h , a t h ) -T * h Q t h+1 (s t h , a t h ) (59) Def. of T * h = d t h + E s ′ ∼P h (•|s t h ,a t h ) max a∈A Q t h+1 (s ′ , a) -max a∈A Q t h+1 (s ′ , a) ≤ d t h + max s ′ ∈S max a∈A Q t h+1 (s ′ , a) -max a∈A Q t h+1 (s ′ , a) Eq. ( 11) = d t h + max a∈A Q t h+1 (s t h+1 , a) -max a∈A Q t h+1 (s t h+1 , a) Eq. ( 12) ≤ d t h + Q t h+1 (s t h+1 , a t h+1 ) -Q t h+1 (s t h+1 , a t h+1 ) . Using the definition of Q t H+1 and Q t H+1 as the zero functions, we can unroll the recursive inequality from Eq. ( 63) and upper bound Q t h (s t h , a t h ) -Q t h (s t h , a t h ) for every h ∈ [H], t ∈ [T ] as follows: Q t h (s t h , a t h ) -Q t h (s t h , a t h ) ≤ H h ′ =h d t h ′ . ( ) Q t h (s t h , a t h ) -Q t h (s t h , a t h ) ≤ H h ′ =h Q t h ′ (s t h ′ , a t h ′ ) -T * h ′ Q t h ′ +1 (s t h ′ , a t h ′ ) + T * h ′ Q t h ′ +1 (s t h ′ , a t h ′ ) -Q t h ′ (s t h ′ , a t h ′ ) (66) Lemma 4.1 ≤ H h ′ =h 4βσ t h ′ (s t h ′ , a t h ′ ). By substituting the bound from Eq. (67) in Eq. ( 57), and then in Eq. ( 48), we arrive at: V * 1 (s) -V πT 1 (s) ≤ 4β H h=1 H h ′ =h 1 T T t=1 σ t h ′ (s t h ′ , a t h ′ ) ≤ 2 √ 3βH(H + 1) Γ k (T,λ) T , where the last inequality follows from Lemma A.2. Since Eq. ( 68) holds for any s ∈ S, we arrive at our main result: Observe the reward y t = Q * (s t , a t ) + η t 6: end for 7: Output the policy estimate πT such that πT (•) = arg max a∈A max t∈[T ] Q t (s, a) ∥V * 1 -V πT 1 ∥ ℓ ∞ (S) ≤ 2 √ 3βH(H + 1) Γ k (T,λ) T . Our algorithm for the offline contextual Bayesian optimization is presented in Algorithm 2. As a side observation, we note that similarly to Char et al. (2019) , we can also simply incorporate context weights (i.e., given ω(s) that represents some weighting of context s that may depend on the probability of seeing s at evaluation time or the importance of s), in case they are available, into the proposed acquisition function, i.e., s t ∈ arg max s∈S max a∈A Q t (s, a) -max a∈A Q t (s, a) w(s) . A.3.1 PROOF OF COROLLARY 5.1 Proof. In this proof, we condition on the event in Eq. ( 21) holding true. Similar arguments to the ones in Eq. ( 59) -Eq. ( 63) lead to the following for every s ∈ S: max a∈A Q * (s, a) -Q * (s, πT (s)) Eq. ( 21) ≤ max a∈A Q * (s, a) -max t∈[T ] Q t (s, πT (s)) (71) Def. of πT = max a∈A Q * (s, a) -max a∈A max t∈[T ] Q t (s, a) (72) = min t∈[T ] max a∈A Q * (s, a) -max a∈A Q t (s, a) Eq. ( 21) ≤ min t∈[T ] max a∈A Q t (s, a) -max a∈A Q t (s, a) Def. of s t ≤ min t∈[T ] max a∈A Q t (s t , a) -max a∈A Q t (s t , a) Def. of a t ≤ min t∈[T ] Q t (s t , a t ) -Q t (s t , a t ) (76) ≤ 1 T T t=1 Q t (s t , a t ) -Q t (s t , a t ) Eq. ( 20) = 1 T T t=1 2β t σ t (s t , a t ) (78) ≤ 2β T T T t=1 σ t (s t , a t ) (79) Lemma A.2 ≤ 2β T √ 3Γ k (T,λ) √ T . Finally, by setting ϵ ≥ 2β T √ 3Γ k (T,λ) √ T and expressing it in terms of T , we arrive at the main result.

B ADDITIONAL EXPERIMENTAL DETAILS B.1 IMPLEMENTATION

We use an exact Gaussian Process with a squared exponential kernel with learned scale parameters in each dimension for the value function regression in Eq. ( 4). We fit the kernel hyperparameters at each iteration using 1000 iterations of Adam (Kingma & Ba, 2014), maximizing the marginal log likelihood of the training data. We used the TinyGP package (Foreman-Mackey, 2021) built on top of JAX (Bradbury et al., 2018) in order to take advantage of JIT compilation. All experiments are conducted with a fixed bonus β = 0.5. We have empirically evaluated various settings of β in Appendix B.3. We uniformly sample 1,000 points from the state space and evaluate them to find an approximate maximizer to the objective in Eq. ( 11). DDQN and BDQN. For both of these methods we use networks with two hidden layers, each with 256 units. For the bootstrapped DQN, we use a network with 10 different heads, each representing a different Q function. For each step collected during exploration, a corresponding mask is generated and added to the replay buffer that signifies which heads will train on this sample. Each Q function has a probability of 0.5 of being trained on each transition.

B.2 ENVIRONMENTS

Each environment is defined with a native reward function taken from the literature. We established upper and lower bounds on the reward function value and used them to scale the reward function values to [0, 1] so that our environments would match the theoretical results in this paper. Cartpole We use a modified version of the cartpole environment from Mehta et al. (2022a) that has dense rewards as implemented in Wang et al. (2019) . The state space is 4D and consists of the horizontal position and velocity of the cart as well as the angular position and velocity of the pole. p 0 in this environment is a normal distribution centered with the cart below the goal horizontally with the pole hanging down with very small variance. p ′ 0 is the same distribution displaced 5 meters to the right. Navigation This is a 2D navigation problem with dynamics of the form s t+1 = s t + B(s t )a t , where B(t) = sin(x 2 /10) + 4 0 0 1.5 cos(x 1 /10) -2 . The goal is fixed at 6 9 . We define p 0 to be the uniform distribution over the axis-aligned rectangle given by corners -8 -9 and -6 -6 . We define p ′ 0 to be the uniform distribution over the axis-aligned rectangle given by corners 1 4 and 3 7 . The reward function at every timestep is simply the negative ℓ 1 -distance between the agent and the goal. β Tracking and β + Rotation Our two simulated plasma control problems are taken from Mehta et al. (2022a) , which gives a thorough description of their relevance to the problem of nuclear fusion. At a high level, β N is a normalized plasma pressure ratio that is correlated with the economic output of a fusion reactor. Our β Tracking environment aims to adjust the injected power in the reactions in order to achieve a target value of β N = 2%. The initial state distribution p 0 is taken from a set of real datapoints from shots on the DIII-D tokamak in San Diego. Our alternate initial state distribution p ′ 0 consists of simply adding 0.4 to each component of a vector sampled from p 0 . The reward function is the negative ℓ 1 -distance between the β N value and 2%. The dynamics are given by a learned model of the plasma state as introduced in Char et al. (2022) . The β + Rotation environment is a more complex plasma control problem, introducing an additional actuator (injected torque) and an additional control objective (controlling plasma rotation). Control of plasma rotation is key to plasma stability and this is a reduced version of the realistic problem. This environment also uses a model from Char et al. (2022) for the dynamics, real plasma states for the initial state distribution p 0 , and a fixed translation for the alternate initial state distribution p ′ 0 . We also include a randomly drawn target for β N and rotation in the state space for every episode.

B.3 EXPLORING β VALUES

In the main paper, we report experiments with the exploration parameter β = 0.5 for all t. In this work, we do not explore principled methods of choosing β and welcome future work in the area. In lieu of this, we provide an empirical analysis of the sensitivity of AE-LSVI to varying settings of β. We ran the AE-LSVI method on our evaluation environments as in the experiments in Table 2 , where we allowed each method to collect 1,000 timesteps of data and evaluated the identified policies on the environments starting from a evaluation initial distribution p ′ 0 distinct from the initial distribution p 0 . In Table 3 we observe that lower values of β perform better because the confidence bounds seem too wide at higher settings, where the performance becomes similar to that



The results on bT hold despite Yang et al. (2020, Lemma D.1) being stated only in the case of the smaller class obtained from only adding +βσD(•, •) in the definition of Q(•, •) in Q(T, h, b) in Eq. (15).



Figure 1: The maximum simple regret seen in any given context for the offline contextual Bayesian optimization experiments. The shaded regions show the standard error over 10 different seeds.

) Eqs. (37) and (41) use the definition of V π 1 , Eqs. (38) and (42) use the definition of π * ≥h T and π * ≥h+1

Algorithm 1 AE-LSVI (Active Exploration with Least-Squares Value Iteration)Require: kernel function k(•, •), exploration parameter β > 0, regularizer λ ≥ 1 1: for t = 1, . . . , T do Let B Q > 0 be a fixed positive constant. Let k : (S × A) 2 → R be a continuous kernel function on a compact set S × A ⊂ R d such that sup x,x ′ ∈S×A k(x, x ′ ) ≤ 1. We assume that ∥T * h Q∥ H ≤ B Q H for all functions Q : S × A → [0, H] and all h ∈ [H],where T * h denotes the Bellman optimality operator, i.e.,

Average Return ± standard error of executing the identified best policy on the MDP starting from p 0 over 5 seeds after collecting 1000 timesteps of data through the use of a generative model (left of line) or episodes starting from p 0 (right of line).

ACKNOWLEDGMENTS

Johannes Kirschner gratefully acknowledges funding from the SNSF Early Postdoc.Mobility fellowship P2EZP2 199781. Ian Char is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745016 and DGE2140739. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Viraj Mehta was supported in part by US Department of Energy grants under contract numbers DE-SC0021414 and DE-AC02-09CH1146. Willie Neiswanger was supported in part by NSF (#1651565), AFOSR (FA95501910024), ARO (W911NF-21-1-0125), CZ Biohub, and Sloan Fellowship. In addition, this project has received support from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme grant No. 815943.

REPRODUCIBILITY STATEMENT

The proof of Theorem 4.2 is provided in Appendix A.2 and the proof of Corollary 5.1 is given in Appendix A.3. The supplementary material includes the source code for the experiments. It also includes a requirements file and README with full instructions on how to run the RL and BO experiments. Although we are not allowed to provide the data used for running the β Tracking and β + Rotation experiments at this time, all other experiments can be run using the provided code. Lastly, experimental details about the implementation and the environments used can be found in Appendix B.1 and Appendix B.2, respectively.

annex

13.9 ± 0.3 14.0 ± 0.4 13.2 ± 1.3 13.4 ± 0.7 β + Rotation 14.8 ± 0.3 14.3 ± 0.2 14.1 ± 0.9 13.3 ± 0.9Table 3 : Average Return ± standard error of executing the identified best policy on the MDP starting from p ′ 0 over 5 seeds after collecting 1000 timesteps of data using the AE-LSVI method with varying values of the exploration parameter β. of uncertainty sampling. Therefore, we recommend initially trying β-values around 0.2 -0.5 when applying AE-LSVI .

