UNDERSTANDING THE COMPLEXITY GAINS OF REFOR-MULATING SINGLE-TASK RL WITH A CURRICULUM

Abstract

Reinforcement learning (RL) problems can be challenging without well-shaped rewards. Prior work on provably efficient RL methods generally proposes to address this issue with dedicated exploration strategies. However, another way to tackle this challenge is to reformulate it as a multi-task RL problem, where the task space contains not only the challenging task of interest but also easier tasks that implicitly function as a curriculum. Such a reformulation opens up the possibility of running existing multi-task RL methods as a more efficient alternative to solving a single challenging task from scratch. In this work, we provide a theoretical framework that reformulates a single-task RL problem as a multi-task RL problem defined by a curriculum. Under mild regularity conditions on the curriculum, we show that sequentially solving each task in the multi-task RL problem is more computationally efficient than solving the original single-task problem, without any explicit exploration bonuses or other exploration strategies. We also show that our theoretical insights can be translated into an effective practical learning algorithm that can accelerate curriculum learning on robotic goal-reaching tasks.

1. INTRODUCTION

Reinforcement learning (RL) provides an appealing and simple way to formulate control and decisionmaking problems in terms of reward functions that specify what an agent should do, and then automatically train policies to learn how to do it. However, in practice the specification of the reward function requires great care: if the reward function is well-shaped, then learning can be fast and effective, but if rewards are delayed, sparse, or can only be achieved after extensive explorations, RL problems can be exceptionally difficult (Kakade and Langford, 2002; Andrychowicz et al., 2017; Agarwal et al., 2019) . This challenge is often overcome with either reward shaping (Ng et al., 1999; Andrychowicz et al., 2017; 2020; Gupta et al., 2022) or dedicated exploration methods (Tang et al., 2017; Stadie et al., 2015; Bellemare et al., 2016; Burda et al., 2018) , but reward shaping can bias the solution away from optimal behavior, while even the best exploration methods, in general, may require covering the entire state space before discovering high-reward regions. On the other hand, a number of recent works have proposed multi-task learning methods in RL that involve learning contextual policies that simultaneously represent solutions to an entire space of tasks, such as policies that reach any potential goal (Fu et al., 2018; Eysenbach et al., 2020b; Fujita et al., 2020; Zhai et al., 2022) , policies conditioned on language commands (Nair et al., 2022) , or even policies conditioned on the parameters of parametric reward functions (Kulkarni et al., 2016; Siriwardhana et al., 2019; Eysenbach et al., 2020a; Yu et al., 2020b) . While such methods are often not motivated directly from the standpoint of handling challenging exploration scenarios, but rather directly aim to acquire policies that can perform all tasks in the task space, these multi-task formulations often present a more tractable learning problem than acquiring a solution to a single challenging task in the task space (e.g., the hardest goal, or the most complex language command). We pose the following question: can we construct a multi-task RL problem with contextual policies that is easier than solving a single-task RL problem from scratch? In this work, we answer this question affirmatively by analyzing the sample complexity of a class of curriculum learning methods. To build the intuition for how reformulating a single-task problem into a multi-task problem enables efficient learning, consider the setting where the optimal state visitation distributions d π ⋆ ω µ , d π ⋆ ω ′ µ of two different contexts ω, ω ′ are "similar", and our goal is to learn the optimal policy π ⋆ ω ′ w.r.t. ω ′ . Suppose we have learned the optimal policy π ⋆ ω , we can facilitate learning π ⋆ ω ′ by: (1) using π ⋆ ω as an initialization and (2) setting a new initial state distribution µ ′ = βd π ⋆ ω µ + (1 -β µ ) by mixing d π ⋆ ω µ , the optimal state visitation distribution of π ⋆ ω , and µ, the initial distribution of interest. Using π ⋆ ω as initialization for learning π ⋆ ω ′ facilitates the learning process as it guarantees the initialization is within some neighborhood of the optimality. Setting the new initial distribution µ ′ = βd µ with probability β". We refer to this approach, which consists of rolling in a similar optimal state visitation distribution of another context, as ROLLIN. µ of ω, ω ′ respectively. For learning π ⋆ ω ′ when the optimal policy π ⋆ ω of a similar context is known, ROLLIN rolls in the optimal policy of a near by context ω with probability β. We illustrate the intuition of ROLLIN in Figure 1 . More specifically, we adopt the contextual MDP formulation, where we assume each MDP M ω is uniquely defined by a context ω in the context space W ⊂ R n , and we are given a curriculum {ω k } K k=0 , with the last MDP M ω K being the MDP of interest. To show our main results, we only require a Lipschitz continuity assumption on r ω w.r.t. ω and some mild regularity conditions on the curriculum {ω k } K k=0 . We show that learning π ⋆ K by recursively rolling in with a nearoptimal policy for ω k to construct the initial distribution µ k+1 for the next context ω k+1 , is provably more efficient than learning π ⋆ ω K from scratch. In particular, we show that when an appropriate sequence of contexts is selected, we can reduce the iteration and sample complexity bounds of entropy-regularized softmax policy gradient (with an inexact stochastic estimation of the gradient) from an original exponential dependency on the state space size, as suggested by Ding et al. (2021) , to a polynomial dependency. We also prescribe a practical implementation of ROLLIN. In summary, our contributions can be stated as follows. We first provide a theoretical method (ROLLIN) that facilitates single-task policy learning by recasting it as a multi-task problem under entropy-regularized softmax policy gradient (PG), which reduces the exponential complexity bound of the entropy-regularized PG to a polynomial dependency on S. Last but not least, we also provide a deep RL implementation of ROLLIN and demonstrate adding ROLLIN improves performance in simulated goal-reaching tasks with an oracle curriculum and a non-oracle curriculum learned from MEGA (Pitis et al., 2020) , as well as several standard Mujoco locomotion tasks inspired by meta RL (Clavera et al., 2018) .

2. RELATED WORK

Convergence of policy gradient methods. Theoretical analysis of policy gradient methods has a long history (Williams, 1992; Sutton et al., 1999; Konda and Tsitsiklis, 1999; Kakade and Langford, 2002; Peters and Schaal, 2008) . Motivated by the recent empirical success (Schulman et al., 2015; 2017) in policy gradient (PG) methods, the theory community has extensively studied the convergence of PG in various settings (Fazel et al., 2018; Agarwal et al., 2021; 2020; Bhandari and Russo, 2019; Mei et al., 2020; Zhang et al., 2020b; Agarwal et al., 2020; Zhang et al., 2020a; Li et al., 2021; Cen et al., 2021; Ding et al., 2021; Yuan et al., 2022; Moskovitz et al., 2022) . Agarwal et al. (2021) established the asymptotic global convergence of policy gradient under different policy parameterizations. We extend the result of entropy regularized PG with stochastic gradient (Ding et al., 2021) to the contextual MDP setting. In particular, our contextual MDP setting reduces the exponential state space dependency w.r.t. the iteration number and per iteration sample complexity suggested by Ding et al. (2021) to a polynomial dependency. We shall also clarify that there is much existing convergence analysis on other variants of PG that produce an iteration number that does not suffer from an exponential state space dependency (Agarwal et al., 2021; Mei et al., 2020) , but they assume access to the exact gradient during each update of PG, while we assume a stochastic estimation of the gradient, which is arguably more practical. Contextual MDPs. Contextual MDPs (or MDPs with side information) have been studied extensively in the theoretical RL literature (Abbasi-Yadkori and Neu, 2014; Hallak et al., 2015; Dann et al., 2019; Jiang et al., 2017; Modi et al., 2018; Sun et al., 2019; Dann et al., 2019; Modi et al., 2020) . We analyze the iteration complexity and sample complexity of (stochastic) policy gradient methods, which is distinct from these prior works that mainly focus on regret bounds (Abbasi-Yadkori and Neu, 2014; Hallak et al., 2015; Dann et al., 2019) and PAC bounds (Jiang et al., 2017; Modi et al., 2018; Sun et al., 2019; Dann et al., 2019; Modi et al., 2020) . Several works assumed linear transition kernel and reward model (or generalized linear model (Abbasi-Yadkori and Neu, 2014) ) with respect to the context (Abbasi-Yadkori and Neu, 2014; Modi et al., 2018; Dann et al., 2019; Modi et al., 2020; Belogolovsky et al., 2021) . These assumptions share similarity to our assumptions -we have a weaker Lipschitz continuity assumption with respect to the context space (since linear implies Lipschitz) on the reward function and a stronger shared transition kernel assumption. Exploration. A number of prior works have shown that one can reduce the complexity of learning an optimal policy with effective exploration methods (Azar et al., 2017; Jin et al., 2018; Du et al., 2019; Misra et al., 2020; Agarwal et al., 2020; Zhang et al., 2020d) . The computational efficiency suggested by our work is different from some of the aforementioned prior methods that rely on adding exploration bonus for policy cover (Azar et al., 2017; Jin et al., 2018; Agarwal et al., 2020; Zhang et al., 2020d) , as we assume access to a "good" curriculum which ensures the optimal policy defined by the next context is not too different from the optimal policy of the current context. The key intuition of our work is more related to the strategic exploration in the block MDP setting (Du et al., 2019; Misra et al., 2020) . The block MDP setting assumes the state space can be encoded into a feature space with a sequential block structure and explores the next block in the feature space after having a policy that covers the previous blocks. Such sequential exploration procedure is similar to ROLLIN, as our method also considers solving the multi-task contextual MDPs in a sequential perspective, but ROLLIN focuses on learning the optimal policy of the next context once we have a near-optimal policy of the current context. Curriculum learning in reinforcement learning. Curriculum learning is a powerful idea that has been widely used in RL (Florensa et al., 2017; Kim and Choi, 2018; Omidshafiei et al., 2019; Ivanovic et al., 2019; Akkaya et al., 2019; Portelas et al., 2020; Bassich et al., 2020; Fang et al., 2020; Klink et al., 2020; Dennis et al., 2020; Parker-Holder et al., 2022; Liu et al., 2022) (also see (Narvekar et al., 2020) for a detailed survey). Although curricula formed by well-designed reward functions (Vinyals et al., 2019; OpenAI, 2018; Berner et al., 2019; Ye et al., 2020; Zhai et al., 2022) are usually sufficient given enough domain knowledge, tackling problems with limited domain knowledge requires a more general approach where a suitable curriculum is automatically formed from a task space. In the goal-conditioned reinforcement learning literature, this corresponds to automatic goal proposal mechanisms (Florensa et al., 2018; Warde-Farley et al., 2018; Sukhbaatar et al., 2018; Ren et al., 2019; Ecoffet et al., 2019; Hartikainen et al., 2019; Pitis et al., 2020; Zhang et al., 2020c; OpenAI et al., 2021; Zhang et al., 2021) . The practical instantiation of this work is also similar to Bassich et al. (2020) ; Liu et al. (2022) , where a curriculum is adopted for learning a progression of a set of tasks. Learning conditional policies in multi-task RL. Multi-task RL (Tanaka and Yamamura, 2003) approaches usually learn a task-conditioned policy that is shared across different tasks (Rusu et al., 2015; Rajeswaran et al., 2016; Andreas et al., 2017; Finn et al., 2017; D'Eramo et al., 2020; Yu et al., 2020a; Ghosh et al., 2021; Kalashnikov et al., 2021) . Compared to learning each task independently, joint training enjoys the sample efficiency benefits from sharing the learned experience across different tasks as long as the policies generalize well across tasks. To encourage generalization, it is often desirable to condition policies on low dimensional feature representations that are shared across different tasks instead (e.g., using variational auto-encoders (Nair et al., 2018; Pong et al., 2019; Nair et al., 2020) or variational information bottleneck (Goyal et al., 2019; 2020; Mendonca et al., 2021) ). The idea of learning contextual policies has also been discussed in classical adaptive control literature (Sastry et al., 1990; Tao, 2003; Landau et al., 2011; Åström and Wittenmark, 2013; Goodwin and Sin, 2014) . Different from these prior works which have been mostly focusing on learning policies that can generalize across different tasks, our work focuses on how the near-optimal policy from a learned task could be used to help the learning of a similar task.

3. PRELIMINARIES

We consider the contextual MDP setting, where a contextual MDP M W = (W, S, A, P , r ω , γ, ρ) consists of a context space W, a state space S, an action space A, a transition dynamic function P : S × A → P(S) (where P(X) denotes the set of all probability distributions over set X), a context-conditioned reward function r : W × S × A → [0, 1], a discount factor γ ∈ (0, 1], and an initial state distribution of interest ρ. For convenience, we use S = |S|, A = |A| to denote the number of states and actions. While some contextual MDP formulations (Hallak et al., 2015) have context-conditioned transition dynamics and reward functions, we consider the setting where only the reward function can change across contexts. We denote r ω as the reward function conditioned on a fixed ω ∈ W and M ω = (S, A, P , r ω , γ, ρ) as the MDP induced by such fixed reward function. We use π(a|s) : S → P(A) to denote a policy and we adopt the softmax parameterization: π θ (a|s) = exp[θ(s,a)] a ′ exp[θ(s,a ′ )] , where θ : S × A → R. We use d π ρ (s) := (1 -γ) ∞ t=0 γ t P π (s t = s|s 0 ∼ ρ) to denote the discounted state visitation distribution and V π ω := E [ ∞ t=0 γ t r ω (s t , a t )] + αH(ρ, π) to denote the entropy regularized discounted return on M ω , where H(ρ, π) := E s0∼ρ,a h ∼π(•|s h ) ∞ h=0 -γ h log π(a h |s h ) is the discounted entropy term. We use π ⋆ ω := arg max π V π ω to denote an optimal policy that maximizes the discounted return under M ω . We assume all the contextual reward functions are bounded within [0, 1]: r ω (s, a) ∈ [0, 1], ∀ω ∈ Ω, ∀(s, a) ∈ S × A. Similarly to previous analysis Agarwal et al. (2021) ; Mei et al. (2020) ; Ding et al. (2021) , we assume the initial distribution ρ for PG or stochastic PG satisfies ρ(s) > 0, ∀s ∈ S. 3.1 ASSUMPTIONS Given a curriculum {ω k } K k=0 , where the last context ω K defines M ω K , the MDP of interest, our goal is to show that reformulating M ω K into a multi-task problem {M ω k } K k=0 , enjoys better computational complexity and sample complexity than solving the single-task problem M ω K from scratch. As we will show in Section 4, if we have a "good" curriculum {ω k } K k=0 such that the optimal policies π ⋆ ω k , π ⋆ ω k+1 w.r.t. two consecutive contexts ω k , ω k+1 are "close enough" to each other, then using an ε-optimal policy of ω k as an initialization allows us to directly start from the near optimal regime of ω k+1 , hence only requiring polynomial complexity to learn π ⋆ ω k+1 . Formally, we use the two following assumptions to characterize the good properties of our curriculum. Assumption 3.1 (Lipschitz reward in the context space) The reward function is Lipschitz continuous w.r.t. the context: max s,a |r ω (s, a) - r ω ′ (s, a)| ≤ L r ∥ω -ω ′ ∥ 2 , ∀ω, ω ′ ∈ W. Assumption 3.2 (Similarity between two Consecutive Contexts) We assume the given curricu- lum {ω k } K k=0 satisfies max 0≤k≤K-1 ∥ω k+1 -ω k ∥ 2 ≤ O S -2 , and we have access to a nearoptimal initialization θ (0) 0 for learning π ⋆ ω0 (formally define in Section 4.2). Intuitively, Assumption 3.1 defines the similarity between two tasks via a Lipschitz continuity in the context space, similar Lipschitz assumption also appears in Abbasi-Yadkori and Neu (2014) ; Modi et al. (2018) ; Dann et al. (2019) ; Modi et al. (2020) ; Belogolovsky et al. (2021) . Assumption 3.1 and 3.2 together quantify the maximum difference between two consecutive tasks M ω k-1 , M ω k , in terms of the maximum difference between their reward function. Assumption 3.1 and 3.2 also play a crucial role in reducing the exponential complexity to a polynomial one, and we will briefly discuss the intuition in the next section. Ding et al. (2021) proposed a two-phased PG convergence analysis framework with a stochastic gradient. In particular, the author demonstrates that with high probability, stochastic PG with arbitrary initialization achieves an ε-optimal policy can be achieved with an iteration number of T 1 , T 2 and per iteration sample complexity of B 1 , B 2 in two separate phases where T 1 = Ω(S 2S 3 ), T 2 = Ω(S 3/2 ) ( Ω(•) suppresses the log S and terms that do not contain S) and B 1 = Ω(S 2S 3 ), B 2 = Ω(S 5 ), respectively, and PG enters phase 2 only when the updating policy becomes ε 0 -optimal, where ε 0 is a term depending on S (formally defined by (19) in Appendix A.3). For completeness, we restate the main theorem of Ding et al. (2021) in Theorem A.2, provide the details of such dependencies on S are provided in Corollary A.3, and describes the two phase procedure in Algorithm 4. The main implication of these two-phase results is that, if we want to apply PG to learn an optimal policy from an arbitrary initialization, we suffer from an exponential iteration number and sample complexity, unless the initialization is ε 0 -optimal. In the next section, we will discuss how Assumption 3.1 and 3.2 enables an ε 0 -optimal initialization for every ω k hence reducing the exponential complexity to a polynomial one.

4. THEORETICAL ANALYSIS

In this section, we introduce ROLLIN, a simple algorithm that accelerates policy learning under the contextual MDP setup by bootstrapping new context learning with a better initial distribution (Algorithm 1). We also provide the total complexity analysis of applying ROLLIN to stochastic PG for achieving an ε-optimal policy.

4.1. ROLLIN

The theoretical version of ROLLIN is provided in Algorithm 1. Our intuition behind ROLLIN is that when two consecutive contexts in the curriculum {ω k } K k=1 are close, their optimal parameters θ ⋆ ω k-1 , θ ⋆ ω k should be close to each other. Let θ (k) t denote the parameters at the t th iteration of stochastic PG for learning θ ⋆ ω k , if we initialize the parameter θ (k) 1 as the optimal parameter of the previous context θ ⋆ ω k-1 (line 5 in Algorithm 1), and set the initial distribution µ k as a mixture between the optimal state visitation distribution of the previous context d π ⋆ ω k-1 µ k-1 and the original distribution of interest ρ with ratio β (line 6 in Algorithm 1), such that µ k = βd π ⋆ ω k-1 µ k-1 + (1 -β)ρ, then stochastic PG enjoys a faster convergence rate. This happens because setting θ (k) 1 = θ ⋆ k-1 ensures a near-optimal initialization for θ Algorithm 1 Provably Efficient Learning via ROLLIN 1: Input: ρ, {ω k } K k=0 , M W , β ∈ (0, 1), θ (0) 0 . 2: Initialize µ 0 = ρ. 3: Run stochastic PG (Algorithm 4) with initialization θ (0) 0 , µ 0 , M ω0 and obtain θ ⋆ ω0 . 4: for k = 1, . . . , K do 5: Set θ (k) 1 = θ ⋆ ω k-1 . ▷ π ⋆ ω k-1 = π θ ⋆ ω k-1 is the optimal policy of ω k-1 .

6:

Set µ k = βd π ⋆ ω k-1 µ k-1 + (1 -β)ρ. ▷ µ k is the initial distribution for learning θ ⋆ ω k . 7: Run stochastic PG (Algorithm 4) with initialization θ (k) 1 , µ k , M ω k and obtain θ ⋆ ω k . 8: end for 9: Output: θ ⋆ ω K

4.2. MAIN RESULTS

We now discuss how to use a sequence of contexts to learn the target context ω K with provable efficiency given the optimal policy π ⋆ ω0 of the initial context ω 0 , without incurring an exponential dependency on S (as mentioned in Section 3.2). Our polynomial complexity comes as a result of enforcing an ε 0 -optimal initialization (ε 0 is the same as Section 3.2 and ( 19)) for running stochastic PG (line 6 of Algorithm 1). Hence, stochastic PG directly enters phase 2, with a polynomial dependency in S. Our main results consist of two parts. We first show that when two consecutive contexts ω k-1 , ω k are close enough to each other, using ROLLIN for learning θ ⋆ k with initialization θ (k) 1 = θ ⋆ ω k-1 and applying an initial distribution µ k = βd π ⋆ ω k-1 µ k-1 + (1 -β)ρ improves the convergence rate. Specifically, the iteration number and complexity for learning θ ⋆ ω k from θ ⋆ ω k-1 is stated as follows: Theorem 4.1 (Complexity of Learning the Next Context) Consider the context-based stochastic softmax policy gradient (line 7 of Algorithm 1), suppose Assumption 3.1 and Assumption 3.2 hold, then the iteration number of obtaining an ε-optimal policy for ω k from θ ⋆ ω k -1 is Ω S 3/2 and the per iteration sample complexity is Ω Lr α(1-β) S 3 . In other words, Theorem 4.1 shows that when ω k-1 , ω k are close enough, ROLLIN reduces the iteration and sample complexity from an exponential dependency of Ω(S 2S 3 ) to an iteration number of Ω(S 3/2 ) and per iteration sample complexity of Ω(S 3 ). It is worth noting that the theorem above only address the iteration numbers and sample complexity for learning θ ⋆ ω k from θ ⋆ ω k-1 . Theorem 4.3 provides the total complexity for learning θ ⋆ ω K from θ (0) 0 via recursively apply the results in Theorem 4.1. Before introducing Theorem 4.3, we first provide a criteria for the desired initialization of θ (0) 0 . Definition 4.2 (Near-optimal Initialization) We say θ 0 is a near-optimal initialization for learning θ ⋆ ω if θ 0 satisfies V π ⋆ ω ω (ρ) -V π θ 0 ω (ρ) < ε 0 . Note that in the above definition, π ⋆ ω k represents the optimal policy of ω k , and V π ω k represents value function of context ω k under policy π. Now we introduce the results for the overall complexity: Theorem 4.3 (Main Results: Total Complexity of ROLLIN) Suppose Assumption 3.1 and 3.2 hold, and θ (0) 0 is an near-optimal initialization, then the total number of iteration of learning π ⋆ ω K using Algorithm 1 is Ω(KS 3/2 ) and the per iteration is Ω S 3 , with high probability. A direct implication of Theorem 4.3 is that, with a curriculum {ω k } K k=0 satisfying Assumption 3.1 and 3.2, one can reduce the daunting exponential dependency on S caused by poor initialization to a polynomial dependency on S. Admittedly the state space S itself is still large in practice, but reducing the state space S itself requires extra assumptions on S which is beyond the scope of this work. We will provide a sketch proof of Theorem 4.1 and 4.3 in the next subsection and leave all the details proof to Appendix A.4, A.5 respectively.

Sketch proof of Theorem 4.1

The key insight for proving Theorem 4.1 is to show that in MDP M ω k , the value function w.r.t. π ⋆ ω k , π ⋆ ω k-1 can be bounded by the ℓ 2 norm between ω k and ω k-1 . In particular, we prove such a relation in Lemma A.5: V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) ≤ 2L r ∥ω k -ω k-1 ∥ 2 (1 -γ) 2 . ( ) By setting θ (k) 1 = θ ⋆ ω k-1 , Equation (2) directly implies V π ⋆ ω k ω k (ρ) -V θ (k) 1 (ρ) ≤ 2Lr∥ω k -ω k-1 ∥ 2 (1-γ) 2 . As suggested by Ding et al. (2021) stochastic PG can directly start from stage 2 with polynomial complexity of T 2 = Ω(S 3/2 ), B 2 = Ω(S 5 ), if V π ⋆ ω k ω k (ρ) -V θ (k) 1 (ρ) ≤ ε 0 , where ε 0 (formally defined in Equation (19) in Appendix A.3) is a constant satisfying ε 0 = O(S -2 ). Hence, by enforcing two consecutive contexts to be close enough ∥ω k -ω k-1 ∥ 2 ≤ O(S -2 ), we can directly start from a near-optimal initialization with polynomial complexity w.r.t. S. This largely explains the intuition behind the initialization in line 5 of ROLLIN: θ (k) 1 = θ ⋆ ω k-1 . It is worth highlighting that the per iteration sample complexity B 2 shown by Ding et al. (2021) scales as Ω(S 5 ), while our result in Theorem 4.1 only requires a smaller sample complexity of Ω(S 3 ). Such an improvement in the sample complexity comes from line 6 of ROLLIN: µ k = βd π ⋆ ω k-1 µ k-1 + (1 -β)ρ. Intuitively, setting µ k as βd π ⋆ ω k-1 µ k-1 + (1 -β)ρ allows us to provide an upper bound on the density mismatch ratio: d π ⋆ µ k µ k /µ k ∞ ≤ Ω L r α(1 -β) ∆ k ω S , where ∆ k ω = max 1≤i≤k ∥ω i -ω i-1 ∥ 2 . Since the sample complexity B 2 (provided in Corollary A.3) contains one multiplier of d π ⋆ µ k µ k /µ k ∞ , setting ∆ k ω = O(S -2 ) immediately reduces the complexity by an order of S 2 . The proof of the upper bound of the density mismatch ratio (Equation 3) is provided in Lemma A.1. Sketch proof of Theorem 4.3 We obtain Theorem 4.3 by recursively applying Theorem 4.1. More precisely, we use induction to show that, if we initialize the parameters of the policy as θ (k) 1 = θ ⋆ ω k-1 , when t = Ω(S 3/2 ), ∀k ∈ [K], we have V π ⋆ ω k ω k (ρ) -V π θ (k-1) t ω k (ρ) < ε 0 . Hence, for any context ω k , k ∈ [K], initializing θ (k) 1 = θ (k-1) t from learning π ⋆ ω k-1 via stochastic PG after t = Ω(S 3/2 ) iteration, θ (k) 1 will directly start from the efficient phase 2 with polynomial complexity. Hence, the total iteration number for learning the θ ⋆ K is Ω(KS 3/2 ), and the per iteration sample complexity Ω S 3 remains the same as Theorem 4.1.

5. PRACTICAL IMPLEMENTATION OF ROLLIN

In this section, we describe the practical implementation of ROLLIN using Soft-Actor-Critic (Haarnoja et al., 2018) algorithm when an oracle curriculum is available. SAC can be seen as a variant of entropy-regularized stochastic PG with the addition of the critics to reduce the gradient variance. Recall that in the theoretical analysis, we learn a separate policy for each context that can start from the near-optimal state distribution of the previous context to achieve a good return under the current context. However, in practice, we usually would want to have a policy that can directly start from the initial distribution ρ to obtain a good return for the final context ω K . In order to learn such a policy, we propose to have two context-conditioned RL agents training in parallel, where the first agent π main is the main agent that eventually will learn to achieve a good return from ρ, and the second agent π exp is an exploration agent that learns to achieve a good return under the current context from the near-optimal state density of the previous context. Another purpose of the exploration agent (as the name suggests) is to provide a better exploration experience for the main agent to learn the current context better. This is made convenient by using an off-policy RL agent where the main agent can easily learn from data that is not on-policy. Specifically, for each episode, there is a probability of β where we run the main agent conditioned on the previous context for a random number of steps until we switch to the exploration agent to collect experience for the current context until the episode ends. Otherwise, we directly run the main agent for the entire episode. Both agents are trained to maximize the return under the current context. Whenever the average return of the last 10 episodes exceed a performance threshold R, we immediately switch to the next context and re-initialize the exploration agent and its replay buffer. A high-level description is available in Algorithm 2 (a more detailed version in Algorithm 8). Algorithm 2 Practical Implementation of ROLLIN if average return of the last 10 episodes under context ω k is greater than R then if k > 0 and with probability of β then 8: h ∼ Geom(1 -γ) (truncated at H) 9: run π main (a|s, ω k-1 ) from the initial state for h steps and switch to π exp (a|s, ω k ) until the episode ends to obtain trajectory τ 0:H = {s 0 , a 0 , r 0 , s 1 , a 1 , • • • , s H }. While the focus of our work is on developing a provably efficient approach to curriculum learning, we also conduct an experimental evaluation of our practical implementation of ROLLIN with soft actor-critic (SAC) (Haarnoja et al., 2018) as the RL algorithm on several continuous control tasks including goal reaching with an oracle curriculum and a learned curriculum, as well as non-goal tasks.

6.1. GOAL REACHING WITH AN ORACLE CURRICULUM

We adopt the antmaze-umaze environment (Fu et al., 2020) for evaluating the performance of ROLLIN in goal-reaching tasks. In the oracle curriculum case, we use a hand-crafted path of contexts, where each context specifies the location that the ant needs to reach (as shown in Figure 2 ). 1 We consider a path of contexts ω(κ) parameterized by κ ∈ [0, 1] where ω(0) = ω 0 and ω(1) = ω K , and propose contexts along the path with a fixed step size ∆. See Appendix E.1 for more implementation details. We combine ROLLIN with a variety of prior methods, and we evaluate the following conditions: (1) standard goal reaching; (2) goal reaching with goal relabeling (Andrychowicz et al., 2017) ; (3) goal reaching with Go-Explore (Ecoffet et al., 2019) . For goal relabeling, we adopt a similar relabeling technique as Pitis et al. (2020) , where each mini batch contains 1/3 original transitions, 1/3 transitions with future state relabeling, and 1/3 transitions with next state relabeling. We implemented the Go-Explore method by adding an additional standard Gaussian exploration noise (multiplied by a constant factor) to the agent for learning the next goal ω(k + 1), once it reaches the current goal ω(k). We empirically observed that sampling the replay buffer from a geometric distribution with p = 10 -5 (more recent transitions are sampled more frequently) improves the overall performance. Hence, in all future experiments, we compare the performance of ROLLIN with classic uniform sampling and the new geometric sampling. We demonstrate how adjusting the ROLLIN parameter (β = 0.1, 0.2, 0.5, 0.75, 0.9) impacts the learning speed on three different step sizes ∆ = 1 24 , 1 18 , 1 12 . We expect coarser curricula with larger steps to be more difficult. Main comparisons. We first provide an overview experiments that compares ROLLIN with a fixed β = 0.1 on different step sizes ∆ in different settings. In each case, we compare the prior method (vanilla, relabeled, or Go-Explore) with and without the addition of ROLLIN. As shown in Table 1 , ROLLIN improves the largest value of κ reached by the agent in most presented settings (except Go-Explore with ∆ = 1/12). This result suggests that ROLLIN facilitates goal-conditioned RL with a curriculum, as we only update the learning progress κ to κ + ∆ when the return of the current policy reaches a certain threshold R (See detailed update of κ in Algorithm 2). Note that β = 0.1 does not always produce the best result, we will provide more results comparing different βs in different settings later in this section, and we leave all the learning curves and detailed tables to Appendix F.1. Note that we do not include the results of directly learning the last context in the antmaze-umaze environment because the agent cannot reach the goal without the aid of a curriculum, which is corroborated by Pitis et al. (2020) . In this section, we focus on the setting where an oracle curriculum is not provided. In particular, we show that ROLLIN can improve MEGA (Pitis et al., 2020) , an existing automated goal curriculum generation method (Figure 3 ). MEGA proposes intrinsic goals that are achievable, but appear rarely in the replay buffer. Since there is no clear notion of progression in the learned curriculum setting, we use the progression in the number of environment steps where ω k corresponds to the goals during k × 100K -(k + 1) × 100K steps. We follow the same highlevel procedure as described in Algorithm 2 where we randomly roll in a policy that was trained on the previous context (in this case, it would be the policy checkpoint that was obtained after the last 100K-step learning segment). A detailed algorithm procedure is provided in Algorithm 9.

6.3. NON-GOAL REACHING TASKS

For the non-goal tasks, we choose a fixed contextual space with ten discrete contexts: κ ∈ {0.1, 0.2, . . . , 1}, where each κ uniquely determines the desired x-velocity of a locomotion agent in the following environments: walker2d, hopper, humanoid, and ant in OpenAI gym (Brockman et al., 2016) . Our goal is to train an agent to move fast and stably. For each context ω(κ), we set the desired speed range to be [λκ, λ(κ + 0.1)), where λ is a parameter depending on the physics of the agent in different environments. When the x-velocity of an agent is within the desired velocity range, we set the healthy_reward to a higher value healthy_reward_high. Otherwise, we set the healthy_reward to a lower value healthy_reward_low. In each environment, we increase the task difficulty with later curriculum steps (larger κ), by increasing the near-optimal threshold R(κ). Detailed parameters of the desired speed range λ, near optimal-threshold R(κ), healthy_reward_high, and healthy_reward_low are provided in Appendix E.3. Main comparisons. We first compare ROLLIN with a fixed β = 0.1 at different environment steps: 0.75 × 10 6 , 1 × 10 6 . In each case, we compare the learning progress κ, averaged x-velocity, and averaged return, with and without the addition of ROLLIN. Note that for the case without ROLLIN, we still provide the curriculum to the agent for training. As shown in Table 2 , ROLLIN improves the largest learning progresses κ, average x-velocity, and average return in most presented settings. This result suggests that ROLLIN also can facilitate learning of non-goal tasks, as we only update the learning progress κ to κ + 0.1 when the return of the current policy reaches a certain threshold R(κ) (See detailed update of κ in Algorithm 2). Note that β = 0.1 does not always produce the best result, we will provide more results comparing different βs in different settings later in this section, and we leave all the learning curves and detailed tables to Appendix F.3. Step = 0.75 × 10 6 Step = 1.0 × 10 6 Env. Table 2 Learning progress κ, average x-velocity, and average return at the 0.75 and 1.0 million environment steps in walker, hopper, humanoid, and ant. The average x-velocity and return are estimated using the last 50k time steps. "Scratch" shows the results of directly training the agent with the last context ω(1). "Baseline" indicates β = 0 where we provide the curriculum ω(κ) to the agent without using ROLLIN. We pick β = 0.1 for all experiments using ROLLIN, the results of using other βs can be found in Table 8 , 9, 10 in Appendix F.3. The standard error is computed over 8 random seeds.

6.4. EXPERIMENTAL SUMMARY

In summary, we empirically demonstrated that ROLLIN improves the performance of goal-reaching tasks and non-goal tasks in different settings with a vast range of selection of βs. Although ROLLIN introduces an extra parameter β for finetuning goal-reaching tasks with a curriculum, our extensive experiments suggest that one could expect improvement by choosing a constant β = 0.1 or 0.2. Still, for achieving the best performance in general-task, one shall still consider fine-tuning β.

APPENDIX A GENERALIZATION BETWEEN DIFFERENT TASKS IN THE CONTEXT SPACE

A.1 SUMMARIES OF NOTATIONS AND ASSUMPTIONS 1. The maximum entropy RL (MaxEnt RL) objective with initial state distribution ρ in reinforcement aims at maximizing (Equation 15& 16 of Mei et al. ( 2020)) V π (ρ) := ∞ h=0 γ h E s0∼ρ,a h ∼π(a h |s h ) [r(s h , a h )] + αH(ρ, π) and H(π(a h |s h )) is the discounted entropy term H(ρ, π) := E s0∼ρ,a h ∼π(•|s h ) ∞ h=0 -γ h log π(a h |s h ) , and α is the penalty term. For simplicity, we denote the optimization objective function in (4) as α-MaxEnt RL. Similar to Equation 18& 19 of Mei et al. (2020) , we also define the advantage and Q-functions and for MaxEnt RL as A π (s, a) := Q π (s, a) -α log π(s, a) -V π (s), Q π (s, a) := r(s, a) + γ s ′ P (s ′ |s, a)V π (s). 2. We let d π s0 (s) = (1 -γ) ∞ t=0 γ t P π (s t = s|s 0 ), to denote the discounted state visitation of policy π starting at state s 0 , and let  d π ρ (s) = E s∼ρ [d π s (s)] ROLLIN µ k = βd π ⋆ ω k-1 µ k-1 + (1 -β)ρ, the density mismatch ratio d π ⋆ ω k µ k /µ k ∞ satisfies d π ⋆ ω k µ k µ k ∞ ≤ Ω L r α(1 -β) ∆ k ω S , where ∆ k ω = max 1≤i≤k ∥ω i -ω i-1 ∥ 2 . Proof By (1) from ROLLIN, we have d π ⋆ ω k µ k µ k ∞ = d π ⋆ ω k µ k -d π ⋆ ω k-1 µ k-1 + d π ⋆ ω k-1 µ k-1 µ k ∞ (i) ≤ d π ⋆ ω k µ k -d π ⋆ ω k-1 µ k-1 1 min µ k + d π ⋆ ω k-1 µ k-1 βd π ⋆ ω k-1 µ k-1 + (1 -β)ρ ∞ (ii) ≤ d π ⋆ ω k µ k -d π ⋆ ω k-1 µ k-1 1 min µ k + 1 β where inequality (i) holds because of (1), and inequality (ii) holds by setting ρ = 0. Now it remains to bound d π ⋆ ω k+1 µ k+1 -d π ⋆ ω k µ k 1 using the difference ∥ω k+1 -ω k ∥ 2 . Let P k h = P π ⋆ ω k h (s ′ |s 0 ∼ µ k ) denote the state visitation distribution resulting from π ⋆ ω k probability starting at µ k , then we have P k h (s ′ ) -P k-1 h (s ′ ) = s,a P k h-1 (s)π ⋆ ω k (a|s) -P k-1 h-1 (s)π ⋆ ω k-1 (a|s) P (s ′ |s, a) = s,a P k h-1 (s)π ⋆ ω k (a|s) -P k h-1 (s)π ⋆ ω k-1 (a|s) + P k-1 h-1 (s)π ⋆ ω k-1 (a|s) -P k-1 h-1 (s)π ⋆ ω k-1 (a|s) P (s ′ |s, a) = s P k h-1 (s) a π ⋆ ω k (a|s) -π ⋆ ω k-1 (a|s) P (s ′ |s, a) + s P k h-1 (s) -P k-1 h-1 (s) a π ⋆ ω k-1 (a|s)P (s ′ |s, a) . Taking absolute value on both side, yields P k h -P k-1 h 1 = s ′ P k h (s ′ ) -P k-1 h (s ′ ) ≤ s P k h-1 (s) a π ⋆ ω k (a|s) -π ⋆ ω k-1 (a|s) ≤c1∥ω k -ω k-1 ∥ 2 s ′ P (s ′ |s, a) + s P k h-1 (s) -P k-1 h-1 (s) s ′ a π ⋆ ω k-1 (a|s)P (s ′ |s, a) (i) ≤c 1 ∥ω k -ω k-1 ∥ 2 + P k h-1 -P k-1 h-1 1 ≤ • • • ≤ c 1 h ∥ω k -ω k-1 ∥ 2 + P k 0 -P k-1 0 1 (ii) = c 1 h ∥ω k -ω k-1 ∥ 2 + ∥µ k -µ k-1 ∥ 1 , where inequality (i) holds by applying Lemma B.2 with c 1 = L r /α(1γ) from and equality (ii) holds because the initial distribution of P k h is µ k . By the definition of d π µ , we have d π ⋆ ω k µ k (s) -d π ⋆ ω k-1 µ k-1 (s) (i) = d k (s) -d k-1 (s) = (1 -γ) ∞ h=0 γ h P k h (s) -P k-1 h (s) , ∀s ∈ S. ( ) where in equality (i), we use d k to denote d π ⋆ ω k µ k . Adding ℓ 1 norm on both sides of ( 14) and applying (13), yields ∥d k -d k-1 ∥ 1 ≤ (1 -γ) ∞ h=0 γ h (c 1 h ∥ω k -ω k-1 ∥ 2 + ∥µ k -µ k-1 ∥ 1 ) (i) = γc 1 1 -γ ∥ω k -ω k-1 ∥ 2 + ∥µ k -µ k-1 ∥ 1 (ii) = γc 1 1 -γ ∥ω k -ω k-1 ∥ 2 + β ∥d k-1 -d k-2 ∥ 1 , where equality (i) holds because ∞ h=0 γ h h = γ/(1γ) 2 and equality (ii) holds because of (1). Hence, we know that ∥d k -d k-1 ∥ 1 ≤ γc 1 1 -γ ∥ω k -ω k-1 ∥ 2 + β ∥d k-1 -d k-2 ∥ 1 ≤ γc 1 1 -γ k-1 i=0 ∥ω i+1 -ω i ∥ 2 β k-i + β k-1 ∥d 1 -d 0 ∥ 1 ≤ γc 1 1 -γ • 1 1 -β ∆ k ω + β k-1 ∥d 1 -d 0 ∥ 1 ≈ γc 1 (1 -γ)(1 -β) ∆ k ω ( ) where ∆ k ω = max 1≤i≤k ∥ω iω i-1 ∥ 2 and the last ≈ holds when k is large. Therefore, applying (16) back to (11), we know that d π ⋆ ω k µ k µ k ∞ ≤ d π ⋆ ω k µ k -d π ⋆ ω k-1 µ k-1 1 min µ k + 1 β (i) ≤ 1 min µ k γc 1 (1 -γ)(1 -β) ∆ k ω + 1 β = Ω L r α(1 -β) ∆ k ω S , where inequality (i) holds since Lemma B.2 implies c 1 = L r /α(1γ), and we omit the 1/(1γ) 6 and log in the Ω, which completes the proof. Note that we can only achieve the final bound Ω 2021))) Consider an arbitrary tolerance level δ > 0 and a small enough tolerance level ε > 0. For every initial point θ 1 , if θ T +1 is generated by SPG (Algorithm 4) with T 1 ≥ 6D(θ 0 ) δε 0 8L C 0 δ ln 2 , T 2 ≥ ε 0 6δε -1 t 0 , T = T 1 + T 2 , B 1 ≥ max 30σ 2 C 0 δ ε 0 δ , 6σT 1 log T 1 ∆L , B 2 ≥ σ 2 ln(T 2 + t 0 ) 6C ζ δε , η t = η ≤ min log T 1 T 1 L , 8 C 0 δ , 1 2L ∀1 ≤ t ≤ T 1 , η t = 1 t -T 1 + t 0 ∀t > T 1 , where D(θ t ) = V π ⋆ (ρ) -V π θ t (ρ), ε 0 = min α min s∈S ρ(s) 6 ln 2 2 ζ exp - 1 (1 -γ)α 4 , 1 , t 0 ≥ 3σ 2 2δε 0 , C 0 δ = 2α S d π ⋆ ρ ρ -1 ∞ min s∈S ρ(s) min θ∈G 0 δ min s,a π θ (a|s) 2 , C ζ = 2α S d π ⋆ ρ ρ -1 ∞ min s∈S ρ(s)(1 -ζ) 2 min s,a π ⋆ (a|s) 2 , G 0 δ := θ ∈ R S×A : min θ ⋆ ∈Θ ⋆ ∥θ -θ ⋆ ∥ 2 ≤ (1 + 1/δ) ∆ , ∆ = log cθ 1,η -log π ⋆ 2 , cθ 1,η = inf t≥1 min s,a π θt (a|s), σ 2 = 8 (1 -γ) 2 1 + (α log A) 2 (1 -γ 1/2 ) 2 , L = 8 + α(4 + 8 log A) (1 -γ) 3 , ( ) then we have P(D(θ T +1 ≤ ε)) ≥ 1 -δ. Corollary A.3 (Iteration Complexity and Sample Complexity for ε-Optimal Policies) Suppose we set the tolerance level ε, δ = O(S -1 ), the iteration complexity and sample complexity of obtaining an ε-optimal policy using stochastic softmax policy gradient (Algorithm 4) in phase 1 and phase 2 satisfies: • Phase 1: T 1 = Ω S 2S 3 , B 1 = Ω S 2S 3 , • Phase 2: T 2 = Ω S 3/2 , B 2 = Ω S 5 , with probability at least 1δ. Proof We first check the dependency of ( 19) on S. Notice that • ε 0 : 1 ε 0 = max 6 ln 2 α min s∈S ρ(s) 2 ζ exp - 1 (1 -γ)α -4 , 1 = Ω(S 2 ); • t 0 : t 0 ≥ 3σ 2 2δε 0 = Ω(S); (21) • C 0 δ : 1 C 0 δ = S 2α d π ⋆ ρ ρ ∞ max s∈S ρ(s) -1 1 min θ∈G 0 δ min s,a π θ (a|s) 2 = Ω(S 3 ); (22) • C ζ : 1 C ζ = S d π ⋆ ρ ρ ∞ max s∈S ρ(s) -1 (1 -ζ) -2 max s,a π ⋆ (a|s) -2 = Ω(S 3 ). ( ) Hence, the complexities in phase 1 scales at T 1 ≥ 6D(θ 0 ) δε 0 8L C 0 δ ln 2 = Ω S 2S 3 , B 1 ≥ max 30σ 2 C 0 δ ε 0 δ , 6σT 1 log T 1 ∆L = Ω S 2S 3 . (24) To enforce a positive T 2 , the tolerance level ε, δ should satisfy ε0 6δε ≥ 1, which implies 1 δε = Ω(S 2 ). Hence, if assuming ε0 δε = o(S) the tolerance level ε, δ = O(S -1 ), the complexities in phase 2 scales at We first introduce the following lemma to aid the proof of Theorem A.4. Lemma A.5 (Bounded Optimal Values Between two Adjacent Contexts) Under the same conditions as Theorem A.4, we have T 2 ≥ ε 0 6δε -1 t 0 = Ω S 3/2 , B 2 ≥ σ 2 ln(T 2 + t 0 ) 6C ζ δε = Ω S 5 . ( V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) ≤ 2L r ∥ω k -ω k-1 ∥ 2 (1 -γ) 2 . ( ) Proof Let V π ω denote the value function of policy π with reward function r ω . From (65) of Lemma B.3, we know that for any initial distribution ρ, we have V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) = 1 1 -γ s d π ⋆ ω k-1 ρ (s) • α • D KL π ⋆ ω k-1 (•|s)||π ⋆ ω k (•|s) . ( ) From ( 47) of Lemma B.1, we know that π ⋆ ω k-1 (a|s) = softmax(Q π ⋆ ω k-1 (•, s)/α) a := exp Q π ⋆ ω k-1 (s, a)/α a ′ exp Q π ⋆ ω k-1 (s, a ′ )/α π ⋆ ω k (a|s) = softmax(Q π ⋆ ω (•, s)/α) a := exp Q π ⋆ ω k (s, a)/α a ′ exp Q π ⋆ ω k (s, a ′ )/α , hence, we have D KL π ⋆ ω k-1 (•|s)||π ⋆ ω k (•|s) = a π ⋆ ω k-1 (a|s) log softmax(Q π ⋆ ω k-1 (a, s)/α) a -log softmax(Q π ⋆ ω k (a, s)/α) a . ( ) Let f (x) denote the log soft max function for an input vector x = [x 1 , x 2 , . . . , x A ] ⊤ such that x i ≥ 0, then for a small perturbation ∆ ∈ R A , the intermediate value theorem implies |[f (x + ∆)] i -[f (x)] i | = ∆ ⊤ ∇ z [f (z)] i , for some vector z on the segment [x, x + ∆]. Now consider the Jacobian of the log softmax function ∂[∇ z f (z)] i /∂z j : ∂[∇ z f (z)] i ∂z j = 1 -p i (z) ∈ (0, 1) if i = j, -p j (z) ∈ (-1, 0) otherwise, where p i (z) = exp(z i )/ A k=1 exp(z k ). hence, we know that |[f (x + ∆)] i -[f (x)] i | = ∆ ⊤ ∇ z [f (z)] i ≤ ∥∆∥ ∞ A k=1 ∂[f (z)] i ∂z k = ∥∆∥ ∞   1 -p i (z) + j̸ =i p j (z)   ≤ 2 ∥∆∥ ∞ . ( ) Now let x = 1 α [Q π ⋆ ω k-1 (s, a 1 ), Q π ⋆ ω k-1 (s, a 2 ), . . . , Q π ⋆ ω k-1 (s, a A )], x + ∆ = 1 α [Q π ⋆ ω k (s, a 1 ), Q π ⋆ ω k (s, a 2 ), . . . , Q π ⋆ ω k (s, a A )], (57) from Lemma B.2 implies that 1 α Q π ⋆ ω k -Q π ⋆ ω k-1 ∞ ≤ L r ∥ω k -ω k-1 ∥ 2 α(1 -γ) , substituting ( 34) and ( 32) into ( 29), yields D KL π ⋆ ω k-1 (•|s)||π ⋆ ω k (•|s) ≤ a 2π ⋆ ω k-1 (a|s) ∥∆∥ ∞ ≤ 2 ∥∆∥ ∞ ≤ 2L r ∥ω k -ω k-1 ∥ 2 α(1 -γ) . Combine ( 35) with ( 27), we have V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) = 1 1 -γ s d π ⋆ ω k-1 ρ (s) • α • D KL π ⋆ ω k-1 (•|s)||π ⋆ ω k (•|s) ≤ 2L r ∥ω k -ω k-1 ∥ 2 (1 -γ) 2 , ( ) which completes the proof. ■ Now we are ready to proceed to the proof of Theorem A.4. Proof From ( 19) we know that ε 0 = min α min s∈S ρ(s) 6 ln 2 2 ζ exp - 1 (1 -γ)α 4 , 1 = O 1 S 2 . ( ) And from Section 6.2 of Ding et al. (2021) , we can directly enter phase 2 of the stochastic PG when V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) ≤ ε 0 . Hence, when ∆ k ω = max 1≤i≤k ∥ω i -ω i-1 ∥ 2 = O(1/S 2 ), we have V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) ≤ 2L r ∆ ω (1 -γ) 2 ≤ ε 0 2 , which implies we can directly enter phase 2 and enjoys the faster iteration complexity of T 2 = Ω S 3/2 (by choosing δ = O(S -1 )) and the smaller batch size of B 2 ≥ σ 2 ln(T 2 + t 0 ) 6C ζ δε (i) = Ω L r α(1 -β) ∆ k ω S 5 (ii) = Ω L r α(1 -β) S 3 , where equation (i) holds by applying Lemma A.1 to (23): σ 2 ln(T 2 + t 0 ) 6C ζ δε = Ω S 4 • d π ⋆ ω k µ k /µ k ∞ = Ω L r α(1 -β) ∆ k ω S 5 , and equality (ii) holds by the assumption that ∆ k ω = O(S -2 ) and we omit the log term and components not related to S in Ω. is an near-optimal initialization, then the total number of iteration of learning π ⋆ ω K using Algorithm 1 is Ω(KS 3/2 ) and the per iteration is Ω S 3 , with high probability. Proof From lemma A.5, we know that V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) ≤ 2L r ∥ω k -ω k-1 ∥ 2 (1 -γ) 2 . ( ) Suppose for each context ω k , we initialize the parameters of the policy as θ (k) 1 = θ ⋆ ω k-1 , and let θ (k) t denote the parameters at the t th iteration of SPG. We will use induction to show that when t = Ω(S 3/2 ), ∀k ∈ [K], we have V π ⋆ ω k ω k (ρ) -V π θ (k-1) t ω k (ρ) < ε 0 , this implies that for any context ω k , k ∈ [K], we can always find a good initialization by setting θ (k) 1 = θ (k-1) t from learning π ⋆ ω k-1 using SPG after t = Ω(S 3/2 ) iteration. This result guarantees that every initialization θ (k) 1 for learning the optimal contextual policy π ⋆ ω k will directly start from the efficient phase 2. Induction: k = 0. When k = 0, Assumption 3.2 and the near-optimal initialization (Definition 4.2) of θ (0) 0 implies that V π ⋆ ω 0 ω0 (ρ) -V π θ (0) 0 ω0 (ρ) < ε 0 . (43) This result implies that a near-optimal initialization allows the initialization to directly start from phase 2 of SPG. Induction: from k -1 to k. Suppose the result in (42) holds for k -1, then we know that V π ⋆ ω k-1 ω k-1 (ρ) -V π θ (k-1) 1 ω k-1 (ρ) = V π ⋆ ω k-1 ω k-1 (ρ) -V π θ (k-2) t ω k-1 (ρ) < ε 0 . Select ε such that ε ≤ ε 0 /2. Theorem A.4 suggests that when t ′ = Ω(S 3/2 ), with high probability, we have V π ⋆ ω k ω k (ρ) -V π θ (k-1) t ′ ω k (ρ) < ε ≤ ε 0 2 . ( ) Hence, if we initialize θ (k) 1 = θ (k-1) t , with high probability when t ′ = Ω(S 3/2 ), we have V π ⋆ ω k ω k (ρ) -V π θ (k-1) t ′ ω k (ρ) = V π ⋆ ω k ω k (ρ) -V π ⋆ ω k-1 ω k (ρ) + V π ⋆ ω k-1 ω k (ρ) -V π θ (k-1) t ′ ω k (ρ) (i) ≤ ε 0 2 + V π ⋆ ω k ω k (ρ) -V π θ (k-1) t ′ ω k (ρ) (ii) < ε 0 , where inequality (i) holds by equation ( 39) in Theorem A.4, inequality (ii) holds because of the induction assumption in (45). Therefore, we have shown ( 42) holds for t = Ω(S 3/2 ), ∀k ∈ [K]. Since we have K contexts in total, we know that Algorithm 1 can enforce a good initialization θ (k) 1 that directly starts from phase 2 for learning all π ⋆ ω k , and for each k ∈ [K], the iteration complexity is Ω(S 3/2 ). Hence the total iteration complexity of obtaining an ε-optimal policy for the final context ω K is Ω KS 3/2 , with per iteration sample complexity of Ω S 3 . ■ APPENDIX B KEY LEMMAS B.1 OPTIMAL POLICY OF MAXIMUM ENTROPY RL NACHUM ET AL. ( ) Lemma B.1 The optimal policy π ⋆ that maximizes the α-MaxEnt RL objective (4) with penalty term α satisfies: π ⋆ (a|s) = exp Q π ⋆ (s, a) -V π ⋆ (s) /α = exp Q π ⋆ (s, a)/α a exp (Q π ⋆ (s, a)/α) for all h ∈ N, where Q π ⋆ (s, a) := r(s, a) + γE s ′ ∼P (s ′ |s,a) V (s ′ ) V π ⋆ (s) := α log a exp Q π ⋆ (s, a)/α . ( ) Proof Similar proof appears in (Nachum et al., 2017) , we provide the proof for completeness. At the optimal policy π θ = π ⋆ , take the gradient of (4) w.r.t. p ∈ ∆(A) and set it to 0, we have ∂ ∂p(a) a∈A p(a) Q π ⋆ (s, a) -α ln p(a) = Q π ⋆ (s, a) -α ln p(a) -α = 0, which implies p(a) = exp Q π ⋆ (s, a) α -1 ∝ exp Q π ⋆ (s, a) α . Hence, we conclude that π ⋆ (a|s) ∝ exp(Q ⋆ (s, a)/α). ■

B.2 BOUNDING THE DIFFERENCE BETWEEN OPTIMAL POLICIES

Lemma B.2 Suppose Assumption 3.1 holds, let π ⋆ ω (a|s), π ⋆ ω ′ (a|s) denote the optimal policy for α-MaxEnt RL (47), then ∀(s, a) ∈ S × A, the optimal policies of α-MaxEnt RL under context ω, ω ′ satisfy: |π ⋆ ω (a|s) -π ⋆ ω ′ (a|s)| ≤ L r ∥ω -ω ′ ∥ 2 α(1 -γ) . Proof From Lemma C.1, we know that the soft value iteration T Q(s, a) = r(s, a) + γαE s ′ log a ′ exp Q(s ′ , a ′ )/α (52) is a contraction. Let Q t ω , Q t ω ′ denote the Q functions at the t th value iteration under context ω, ω ′ respectively, we know Q ∞ ω = Q π ⋆ ω and Q ∞ ω ′ = Q π ⋆ ω ′ . Let ε t = ∥Q t ω -Q t ω ′ ∥ ∞ , then we have ε t+1 = Q t+1 ω -Q t+1 ω ′ ∞ = r ω (s, a) -r ω ′ (s, a) + γαE s ′ log a ′ exp Q t ω ′ (s ′ , a ′ ) α -γαE s ′ log a ′ exp Q t ω ′ (s ′ , a ′ ) α ∞ ≤ ∥r ω -r ω ′ ∥ ∞ + γα E s ′ log s ′ exp Q t ω (s ′ , a ′ )/α -E s ′ log s ′ exp Q t ω ′ (s ′ , a ′ )/α ∞ ≤ ∥r ω -r ω ′ ∥ ∞ + γ Q t ω -Q t ω ′ ∞ = ∥r ω -r ω ′ ∥ ∞ + γε t , where the last inequality holds because f (x) = log n i=1 exp(x i ) is a contraction. From (53), we have ε t+1 ≤ ∥r ω -r ω ′ ∥ ∞ + γε t ≤ (1 + γ) ∥r ω -r ω ′ ∥ ∞ + γ 2 ε t-1 ≤ • • • ≤ ∥r ω -r ω ′ ∥ ∞ t i=0 γ i + γ t ε 1 , (54) which implies Q π ⋆ ω -Q π ⋆ ω ′ ∞ = ε ∞ ≤ ∥r ω -r ω ′ ∥ ∞ 1 -γ ≤ L r ∥ω -ω ′ ∥ 2 1 -γ , where the last inequality holds by Assumption 3.1. Hence, we have 1 α Q π ⋆ ω (s, a) -Q π ⋆ ω ′ (s, a) ≤ L r ∥ω -ω ′ ∥ 2 α(1 -γ) , ∀s, a ∈ S × A which implies 1 α Q π ⋆ ω -Q π ⋆ ω ′ ∞ ≤ L r ∥ω -ω ′ ∥ 2 α(1 -γ) . Next, let π ⋆ ω , π ⋆ ω ′ denote the maximum entropy policy RL under context ω, ω ′ respectively. Then for a fixed state action pair (s, a) ∈ S × A, we have π ⋆ ω (a|s) = softmax(Q π ⋆ ω (•, s)/α) a := exp Q π ⋆ ω (s, a)/α a ′ exp [Q π ⋆ ω (s, a ′ )/α] , π ⋆ ω ′ (a|s) = softmax(Q π ⋆ ω ′ (•, s)/α) a := exp Q π ⋆ ω ′ (s, a)/α a ′ exp Q π ⋆ ω ′ (s, a ′ )/α , where 56). Let f (x) denote the softmax function for an input vector x = [x 1 , x 2 , . . . , x A ] ⊤ such that x i ≥ 0, then for a small perturbation ∆ ∈ R A , the intermediate value theorem implies Q π ⋆ ω (•, s), Q π ⋆ ω ′ (•, s) ∈ R A , |[f (x + ∆)] i -[f (x)] i | = ∆ ⊤ ∇ x [f (z)] i , for some vector z on the segment [x, x + ∆]. Hence |[f (x + ∆)] i -[f (x)] i | = ∆ ⊤ [∇ x f (z)] i ≤ ∥∆∥ ∞ A k=1 ∂[f (z)] i ∂z k ≤ ∥∆∥ ∞   p i (z)(1 -p i (z)) + j̸ =i p i (z)p j (z)   < ∥∆∥ ∞   p i (z) + j̸ =i p j (z)   = ∥∆∥ ∞ , where the Jacobian of the softmax function ∂ [∇ x f (z)] i /∂z j satisfies: ∂ [∇ x f (z)] i ∂z j = p i (z)(1 -p i (z)) if i = j, p i (z)p j (z) otherwise, and p i (z) = exp(z i )/ A k=1 exp(z k ). Now let x = 1 α [Q π ⋆ ω (s, a 1 ), Q π ⋆ ω (s, a 2 ), . . . , Q π ⋆ ω (s, a A )], x + ∆ = 1 α [Q π ⋆ ω ′ (s, a 1 ), Q π ⋆ ω ′ (s, a 2 ), . . . , Q π ⋆ ω ′ (s, a A )]. We know that f (x) = π ⋆ ω (a|s) and f (x + ∆) = π ⋆ ω ′ (a|s). Then (57) implies that ∥∆∥ ∞ ≤ L r ∥ω -ω ′ ∥ 2 α(1 -γ) , substituting this bound on ∥∆∥ ∞ into (60), we have Lemma B.3 For any policy π and any initial distribution ρ, the value function V π (ρ) of the α-MaxEnt RL (48) satisfies: |π ⋆ ω (a|s) -π ⋆ ω ′ (a|s)| = |f (x) -f (x + ∆)| ≤ ∥∆∥ ∞ ≤ L r ∥ω -ω ′ ∥ 2 α(1 -γ) , V π ⋆ (ρ) -V π (ρ) = 1 1 -γ s d π ρ (s) • α • D KL (π(•|s)||π ⋆ (•|s)) , where π ⋆ is the optimal policy of the α-MaxEnt RL (4). Proof Similar proof appears in Lemma 25 & 26 of Mei et al. (2020) , we provide the proof here for completeness. Soft performance difference. We first show a soft performance difference result for the MaxEnt value function (Lemma 25 of Mei et al. (2020)) . By the definition of MaxEnt value function and Q-function (4), ( 6), ∀π, π ′ , we have V π ′ (s) -V π (s) = a π ′ (a|s) • Q π ′ (s, a) -α log π ′ (a|s) - a π(a|s) • [Q π (s, a) -α log π(a|s)] = a (π ′ (a|s) -π(a|s)) • Q π ′ (a|s) -α log π ′ (a|s) + a π(a|s) • Q π ′ (s, a) -α log π ′ (a|s) -Q π (s, a) + α log π(a|s) = a (π ′ (a|s) -π(a|s)) • Q π ′ (a|s) -α log π ′ (a|s) + αD KL (π(•|s)||π ′ (•|s)) + γ a π(a|s) s ′ P (s ′ |s, a) • V π ′ (s ′ ) -V π (s ′ ) = 1 1 -γ s ′ d π s (s ′ ) a ′ (π ′ (a ′ |s ′ ) -π(a ′ |s ′ )) Q π ′ (s ′ , a ′ ) -α log π ′ (a ′ |s ′ ) + αD KL (π(•|s ′ )||π ′ (•|s ′ )) , where the last equality holds because by the definition of state visitation distribution d π s0 (s) = (1 -γ) ∞ t=0 γ t P π (s t = s|s 0 ), taking expectation of s w.r.t. s ∼ ρ, yields V π ′ (ρ) -V π (ρ) = 1 1 -γ s ′ d π ρ (s ′ ) a ′ (π ′ (a ′ |s ′ ) -π(a ′ |s ′ )) • Q π ′ (s ′ , a ′ ) -α log π ′ (a ′ |s ′ ) + αD KL (π(•|s ′ )||π ′ (•|s ′ )) , and ( 68) is known as the soft performance difference lemma (Lemma 25 in Mei et al. (2020) ). Soft sub-optimality. Next we will show the soft sub-optimality result. By the definition of the optimal policy of α-MaxEnt RL (47), we have α log π ⋆ (a|s) = Q π ⋆ (s, a) -V π ⋆ (s). Substituting π ⋆ into the performance difference lemma (68), we have V π ⋆ (s) -V π (s) = 1 1 -γ s ′ d π s (s ′ ) • a ′ (π ⋆ (a ′ |s ′ ) -π(a ′ |s ′ )) • Q π ⋆ (s ′ , a ′ ) -α log π ⋆ (a ′ |s ′ ) =V π ⋆ (s ′ ) + αD KL (π(•|s ′ )||π ⋆ (•|s ′ )) = 1 1 -γ s ′ d π s (s ′ ) • a ′ (π ⋆ (a ′ |s ′ ) -π(a ′ |s ′ )) =0 •V π ⋆ (s ′ ) + αD KL (π(•|s ′ )||π ⋆ (•|s ′ )) = 1 1 -γ s ′ [d π s (s ′ ) • αD KL (π(•|s ′ )||π ⋆ (•|s ′ ))] , taking expectation s ∼ ρ yields  V π ⋆ (ρ) -V π (ρ) = 1 1 -γ s d π ρ (s) • α • D KL (π(•|s)||π ⋆ (•|s)) , + γαE s ′ log a ′ exp (Q(s ′ , a ′ )/α) is a contraction. Proof A similar proof appears in Haarnoja (2018) , we provide the proof for completeness. To see ( 72) is a contraction, for each (s, a) ∈ S × A, we have T Q 1 (s, a) = r(s, a) + γα log a ′ exp Q 1 (s, a) α ≤r(s, a) + γα log a ′ exp Q 2 (s, a) + ∥Q 1 -Q 2 ∥ ∞ α ≤r(s, a) + γα log exp ∥Q 1 -Q 2 ∥ ∞ α a ′ exp Q 2 (s, a) α =γ ∥Q 1 -Q 2 ∥ ∞ + r(s, a) + γα log a ′ exp Q 2 (s, a) α = γ ∥Q 1 -Q 2 ∥ ∞ + T Q 2 (s, a), which implies T Q 1 (s, a) -T Q 2 (s, a) ≤ γ ∥Q 1 -Q 2 ∥ ∞ . Similarly, we also have T Q 2 (s, a) - T Q 1 (s, a) ≤ γ ∥Q 1 -Q 2 ∥ ∞ , hence we conclude that |Q 1 (s, a) -Q 2 (s, a)| ≤ γ ∥T Q 1 -T Q 2 ∥ ∞ , ∀(s, a) ∈ S × A, which implies ∥Q 1 -Q 2 ∥ ∞ ≤ γ ∥T Q 1 -T Q 2 ∥ ∞ . Hence T is a γ-contraction and the optimal policy π ⋆ of it is unique. ■

C.2 CONSTANT MINIMUM POLICY PROBABILITY

Lemma C.2 (Lemma 16 of Mei et al. (2020) ) Using the policy gradient method (Algorithm 3) with an initial distribution ρ such that ρ(s) > 0, ∀S, we have c := inf t≥1 min s,a π θt (a|s) > 0 is a constant that does not dependent on t. Remark C.3 (State Space Dependency of constant c) Note that assuming c is independent of S is quite strong and does not generally hold as suggested by Li et al. (2021) . Still, if one replace the constant c with other S dependent function f (S), one still can apply a similar proof technique for Theorem 4.1 to show that ROLLIN reduces the iteration complexity, and the final iteration complexity bound in Theorem 4.1 will include an additional f (S).

APPENDIX D SUPPORTING ALGORITHMS

Algorithm 3 PG for α-MaxEnt RL (Algorithm 1 in Mei et al. ( 2020)) 1: Input: ρ, θ 0 , η > 0. 2: for t = 0, . . . , T do 3: 1: Input: s, a, θ, γ, α 2: Initialize s 0 ← s, a 0 ← a, Q ← r(s 0 , a 0 ) 3: Draw H ∼ Geom(1γ) 4: for h = 0, 1, . . . , H -1 do 5: The success rate of goal reaching in antmaze-umaze with a curriculum generated by MEGA with β = 0.1, 0.2, 0.5, 0.75, 0.9. The success rate is reported over 8 independent random seeds. 10 random trials for each seed are used to estimate the success rate. Table 4 Learning progress κ, average x-velocity, and average return at the 0.75 and 1.0 million environment steps in walker, hopper, humanoid, and ant. The average x-velocity and return are estimated using the last 50k time steps. We pick β = 0.1 for all experiments using ROLLIN, the results of using other βs can be found in  θ t+1 ← θ t + η • ∂V π θ t (ρ) Input: ρ, θ, γ 2: Draw H ∼ Geom(1 -γ) ▷ Geom(1 -γ) geometric distribution with parameter 1 -γ 3: Draw s 0 ∼ ρ, a 0 ∼ π θ (•|s 0 ) 4: for h = 1, 2, . . . , H -1 do 5: Draw s h+1 ∼ P(•|s h , a h ), a h+1 ∼ π θt (•|s h+1 s h+1 ∼ P(•|s h , a h ), a h+1 ∼ π θ (•|s h+1 ) 6: Q ← Q + γ h+1 /2 [r(s h+1 , a h+1 ) -α log π θ (a h+1 |s h+1 )]

Env.

Step Table 10 Average return of the last 50k time steps, at the 0.75 and 1.0 million environment steps with varying β of non goal reaching tasks. Baseline corresponds to β = 0, where no ROLLIN is used. The standard error is computed over 8 random seeds. We highlight the values that are larger than the baseline (β = 0) in purple, and the largest value in bold font. APPENDIX G ADDITIONAL DISCUSSIONS ON THE RELATED WORK Liu et al. (2022) ; Bassich et al. (2020) also consider RL tasks with a curriculum. From a theoretical perspective, we demonstrate the sample complexity gain in the stochastic policy gradient method using non-asymptotic statistical analysis techniques, whereas Liu et al. (2022) ; Bassich et al. (2020) do not study policy gradient methods. From an empirical perspective, the non goal-reaching Mujoco locomotion domain is related to the continuous robot evolution model considered by Liu et al. (2022) . This prior work uses a context to characterize the transition dynamics of the robot, while we assume the same transition dynamics for all tasks and our context only determines the reward function. The major empirical difference between our work and Liu et al. (2022) lies in the algorithm. ROLLIN facilitates learning by rolling



Note that in general, ROLLIN is not specific to goal-conditioned RL and could in principle utilize any type of context-parameterized reward function that satisfies its assumptions. DISCUSSION AND FUTURE WORKWe presented ROLLIN, a simple algorithm that accelerates curriculum learning under the contextual MDP setup by rolling in a near-optimal policy to bootstrap the learning of new nearby contexts with provable learning efficiency benefits. Theoretically, we show that ROLLIN attains polynomial sample complexity by utilizing adjacent contexts to initialize each policy. Since the key theoretical insight of ROLLIN suggests that one can reduce the density mismatch ratio by constructing a new initial distribution, it would be interesting to see how ROLLIN can affect other variants of convergence analysis of PG (e.g., NPG(Kakade, 2001;Cen et al., 2021) or PG in a feature space(Agarwal et al., 2021; 2020)). On the empirical side, our extensive experiments demonstrate that ROLLIN improves the empirical performance of various tasks beyond our theoretical assumptions, which reveals the potential of ROLLIN in other practical RL tasks with a curriculum. Our experiments also suggest that one could expect improvement by choosing a constant β = 0.1 or 0.2. Hence, another potential direction on the empirical side would be providing an automatic tuning method to select the best β.



1β)µ also facilitates the learning of π ⋆ ω ′ as it reduces the density mismatch ratio between d sample complexity. The new distribution µ ′ = βd π ⋆ ω µ + (1β)µ for learning π ⋆ ω ′ could be understood as "rolling in d π ⋆ ω

Figure 1 Illustration of ROLLIN. The red circle represents the initial state distribution. The dark curve represents the optimal policy w.r.t. ω. The blue diamonds represent the optimal state distributions d π ⋆ ω µ , d π ⋆ ω ′

1 and ρ further improves the rate of convergence by decreasing the density mismatch ratio d with an S dependency that influences the rate of convergence).

← k + 1, D exp ← ∅, and re-initialize the exploration agent π exp

10: record τ 0:H in D, and τ h:H in D exp . 11: else 12:run π main (a|s, ω k ) to obtain trajectory τ 0:H and record τ 0:H in D.

Figure 2 Oracle curriculum of desired goals on antmaze-umaze.

) denote the initial state visitation distribution under initial state distribution ρ. 3. We assume the reward functions under all context are bounded within [0, 1]: r ω (s, a) ∈ [0, 1], ∀ω ∈ Ω, ∀(s, a) ∈ S × A. (9) 4. Similar to previous analysis in Agarwal et al. (2021); Mei et al. (2020); Ding et al. (2021), we assume the initial distribution µ for PG/stochastic PG satisfies ρ(s) > 0, ∀s ∈ S. A.2 MAIN RESULTS: MISMATCH COEFFICIENT UPPER BOUND Lemma A.1 (Density Mismatch Ratio via ROLLIN) Using (1) from

β) ∆ k ω S by setting β as a constant. If we pick an arbitrarily small β, then the 1/β term will dominate the complexity and we will not have the final bound of Ω Lr α(1-β) ∆ k ω S . ■ A.3 COMPLEXITY OF VANILLA STOCHASTIC PG Theorem A.2 (Complexity of Stochastic PG (Theorem 5.1 of Ding et al. (

OF LEARNING THE NEXT CONTEXT Theorem A.4 (Theorem 4.1: Complexity of Learning the Next Context) Consider the contextbased stochastic softmax policy gradient (line 7 of Algorithm 1), suppose Assumption 3.1 and Assumption 3.2 hold, then the iteration number of obtaining an ε-optimal policy for ω k from θ ⋆ ω k -1 is Ω S 3/2 and the per iteration sample complexity is Ω Lr α(1-β) S 3 .

which completes the proof. ■ B.3 SOFT SUB-OPTIMALITY LEMMA (LEMMA 25 & 26 OF MEI ET AL. (2020))

CONSISTENCY EQUATION OF MAXENT RL Lemma C.1 (Contraction of Soft Value Iteration) From (48) and (6), the soft value iteration operator T defined as T Q(s, a) := r(s, a)

Two-Phase SPG for α-MaxEnt RL (Algorithm 5.1 inDing et al. (2021))

) 6: end for 7: Output: s H , a H Algorithm 7 EstEntQ: Unbiased Estimation of MaxEnt Q (Algorithm 8.2 in Ding et al. (2021))

Figure4The success rate of goal reaching in antmaze-umaze with a curriculum generated by MEGA with β = 0.1, 0.2, 0.5, 0.75, 0.9. The success rate is reported over 8 independent random seeds. 10 random trials for each seed are used to estimate the success rate.

Figure 5 Visualization of the successful trajectories by taking the ant's 2-D location in the maze. The maze walls are colored in black. The circles represent the starting locations of the agent and the crosses with the same color represent their corresponding goal locations.

Figure 11 Performance of ROLLIN combined with MEGA (Pitis et al., 2020) on antmaze-umaze with different choices of β.

Figure12Accelerating learning on several non goal-reaching tasks. The confidence interval represents the standard error computed over 8 random seeds, for β = 0.1.

: Initialize D ← ∅, D exp ← ∅, k ← 0, and two SAC agents π main and π exp .

Learning progress κ at 3 million environment steps with varying curriculum step size ∆ of different settings of goal reaching in antmaze-umaze. We pick β = 0.1 for all experiments using ROLLIN, the results of using other βs, ∆s, and exploration noise can be found in Table5, 6, 7 in Appendix F.1. The standard error is computed over 8 random seeds.

and we want to bound |π ⋆ ω (a|s)π ⋆ ω ′ (a|s)|. Next we will use (57) to bound |π ⋆ ω (a|s)π ⋆ ω ′ (a|s)|, where the last inequality holds by (

Input: ρ, θ 0 , α, B 1 , B 2 , T 1 , T, {η t } Random-horizon SPG for α-MaxEnt RL Update (Algorithm 3.2 inDing et al. (2021)) 1: Input: ρ, α, θ 0 , B, t, η t 2: for i = 1, 2, ..., B do in line 6 of Algorithm 6 is an unbiased estimator of the gradient ∇ θ V π θ (ρ).

Table 8, 9, 10 in Appendix F.3. The standard error is computed over 8 random seeds.

Vanilla Goal reaching. Learning progress κ at 3 million environment steps with varying β and curriculum step size ∆ of vanilla goal reaching task. Geo indicates the usage of geometric sampling. Baseline corresponds to β = 0, where no ROLLIN is used. The standard error is computed over 8 random seeds. We highlight the values that are larger than the baseline (β = 0) in purple, and the largest value in bold font.

Goal relabeling. All other settings are the same as Table5.

Go-Explore with different exploration noise. EN represents the multiplier for the Gaussian exploration noise. All other settings are the same as Table5.

Learning progress κ at 0.75 and 1.0 million environment steps with varying β of non goal reaching tasks. Baseline corresponds to β = 0, where no ROLLIN is used. The standard error is computed over 8 random seeds. We highlight the values that are larger than the baseline (β = 0) in purple, and the largest value in bold font.

Average x-direction velocity of the last 50k time steps, at 0.75 and 1.0 million environment steps with varying β of non goal reaching tasks. Baseline corresponds to β = 0, where no ROLLIN is used. The standard error is computed over 8 random seeds. We highlight the values that are larger than the baseline (β = 0) in purple, and the largest value in bold font.

APPENDIX E EXPERIMENTAL DETAILS

We use the SAC implementation from https://github.com/ikostrikov/ jaxrl (Kostrikov, 2021) for all our experiments in the paper. The exception is the goal reaching task with automated curriculum generation, where we use our customized DDPG implementation that is highly inspired from the same codebase.

E.1 GOAL REACHING WITH AN ORACLE CURRICULUM

For our antmaze-umaze experiments with oracle curriculum, we use a sparse reward function where the reward is 0 when the distance D between the ant and the goal is greater than 0.5 and r = exp(-5D) when the distance is smaller than or equal to 0.5. The performance threshold is set to be R = 200. Exceeding such threshold means that the ant stays on top of the desired location for at least 200 out of 500 steps, where 500 is the maximum episode length of the antmaze-umaze environment. We use the average return of the last 10 episodes and compare it to the performance threshold R. For both of the SAC agents, we use the same set of hyperparameters shown in Table 3 . See Algorithm 8, for a more detailed pseudocode. 

E.2 GOAL REACHING WITH AN AUTOMATICALLY GENERATED CURRICULUM

For our antmaze-umaze experiments with automatically generated curriculum, we use a sparse reward function where the reward is -1 when the distance between the ant and the goal is greater than 1.0 (this is different from the oracle curriculum setting where the threshold is 0.5), and 0 otherwise. We closely follow the DDPG setup in the MEGA paper (Pitis et al., 2020) and use the rfaab relabeling with a ratio of (0.1, 0.4, 0.1, 0.3, 0.1). We also use the go-explore style exploration where the epsilon-greedy ratio is increased from 0.1 to 0.11 when the goal is reached (whenever a reward of zero is received). The maximum episode length is 500 and we estimate the success rate of the agent by rolling out the policy 10 times with different pairs of random initial starting locations (a square region centered around (0.0, 0.0) with length 2.0) and goals (a square region centered around (0.0, -8.0) with length 2.0). See Figure 5 for a visualization of 8 successful trajectories of this task. We use three-layer network (512, 512, 512) with layer normalization for both the actor and the critic. We also use target Q value clipping such that the target Q value that gets backed up is always within [-100, 0] (the range of the achievable Q value with a discount factor of 0.99 and binary reward {-1., 0.}). We did not use a dynamic normalization of the observation (which was used in the original paper). Also, we used tanh to squash the output of the actor to be between -1 and 1 rather than using L 2 action Algorithm 9 Practical Implementation of ROLLIN when combined with Automated Goal Curriculum Generation 1: Input: ρ: initial state distribution, β: roll-in ratio, discount factor γ, goal proposer G, (e.g., MEGA (Pitis et al., 2020) ), curriculum step interval N , 2: Initialize D ← ∅, D exp ← ∅, k ← 0, and two off-policy RL agents π main and π exp . 3: for each environment step do if more than N steps have been taken at curriculum step k then 6:Re-initialize the exploration agent π exp , π main,k = π main,k-1 8:Start a new episode with s 0 ∼ ρ, t ← 0, get a new goal g proposed by G. 

E.3 NON GOAL REACHING

For the non goal reaching tasks in walker2d, hopper, humanoid, and ant experiments, the desired x-velocity range [λκ, λ(κ + 0.1)), the near-optimal threshold R(κ), and the healthy_reward all depend on the environments. The maximum episode length 1000. Details are provided in Table 4 . in a "near-optimal" policy learned from the previous context for learning the next context, which enables the agent to start from an initialization that is "close" to the optimal policy specified by the current context, while Liu et al. (2022) stores all transitions of different robot parameters into one replay buffer, and uses the replay buffer to learn a single policy.

APPENDIX F ADDITIONAL LEARNING CURVES AND TABLES

Our goal-reaching tasks in the antmaze-umaze environment also appears in Bassich et al. (2020) . However, our work studies a problem that is different to Bassich et al. (2020) . In particular, Bassich et al. (2020) considers curriculum learning for RL by proposing a progression function that generates the context for each task, while our method tackles the orthogonal problem of using a given curriculum to accelerate RL. Our method is agnostic to where this curriculum comes from, and it could be produced by the designer (Section 6.1) or another algorithm (Section 6.2).

APPENDIX H ADDITIONAL DISCUSSIONS ON THE LIMITATIONS

Our theoretical analysis in the tabular case inspires a practical implementation of ROLLIN which demonstrates empirical success in various domains. However, we shall clarify that theoretical analysis in the tabular case is still limited for understanding the empirical success of RL practice, as the state spaces in many RL applications are either enormous or continuous. One potential future direction is to extend the initial distribution update procedure (1) from tabular analysis to feature space analysis.On the practical side, although ROLLIN generally demonstrates empirical benefits with a choice of β = 0.1 or β = 0.2, one still needs to fine tune the β for achieving the best performance in general.Hence, another future direction on the empirical side could be implementing the an auto-tuning version of the ROLLIN parameter β.

