UNDERSTANDING THE COMPLEXITY GAINS OF REFOR-MULATING SINGLE-TASK RL WITH A CURRICULUM

Abstract

Reinforcement learning (RL) problems can be challenging without well-shaped rewards. Prior work on provably efficient RL methods generally proposes to address this issue with dedicated exploration strategies. However, another way to tackle this challenge is to reformulate it as a multi-task RL problem, where the task space contains not only the challenging task of interest but also easier tasks that implicitly function as a curriculum. Such a reformulation opens up the possibility of running existing multi-task RL methods as a more efficient alternative to solving a single challenging task from scratch. In this work, we provide a theoretical framework that reformulates a single-task RL problem as a multi-task RL problem defined by a curriculum. Under mild regularity conditions on the curriculum, we show that sequentially solving each task in the multi-task RL problem is more computationally efficient than solving the original single-task problem, without any explicit exploration bonuses or other exploration strategies. We also show that our theoretical insights can be translated into an effective practical learning algorithm that can accelerate curriculum learning on robotic goal-reaching tasks.

1. INTRODUCTION

Reinforcement learning (RL) provides an appealing and simple way to formulate control and decisionmaking problems in terms of reward functions that specify what an agent should do, and then automatically train policies to learn how to do it. However, in practice the specification of the reward function requires great care: if the reward function is well-shaped, then learning can be fast and effective, but if rewards are delayed, sparse, or can only be achieved after extensive explorations, RL problems can be exceptionally difficult (Kakade and Langford, 2002; Andrychowicz et al., 2017; Agarwal et al., 2019) . This challenge is often overcome with either reward shaping (Ng et al., 1999; Andrychowicz et al., 2017; 2020; Gupta et al., 2022) or dedicated exploration methods (Tang et al., 2017; Stadie et al., 2015; Bellemare et al., 2016; Burda et al., 2018) , but reward shaping can bias the solution away from optimal behavior, while even the best exploration methods, in general, may require covering the entire state space before discovering high-reward regions. On the other hand, a number of recent works have proposed multi-task learning methods in RL that involve learning contextual policies that simultaneously represent solutions to an entire space of tasks, such as policies that reach any potential goal (Fu et al., 2018; Eysenbach et al., 2020b; Fujita et al., 2020; Zhai et al., 2022) , policies conditioned on language commands (Nair et al., 2022) , or even policies conditioned on the parameters of parametric reward functions (Kulkarni et al., 2016; Siriwardhana et al., 2019; Eysenbach et al., 2020a; Yu et al., 2020b) . While such methods are often not motivated directly from the standpoint of handling challenging exploration scenarios, but rather directly aim to acquire policies that can perform all tasks in the task space, these multi-task formulations often present a more tractable learning problem than acquiring a solution to a single challenging task in the task space (e.g., the hardest goal, or the most complex language command). We pose the following question: can we construct a multi-task RL problem with contextual policies that is easier than solving a single-task RL problem from scratch? In this work, we answer this question affirmatively by analyzing the sample complexity of a class of curriculum learning methods. To build the intuition for how reformulating a single-task problem into a multi-task problem enables efficient learning, consider the setting where the optimal state visitation distributions d π ⋆ ω µ , d π ⋆ ω ′ µ of two different contexts ω, ω ′ are "similar", and our goal is to learn the optimal policy π ⋆ ω ′ w.r.t. ω ′ . Suppose we have learned the optimal policy π ⋆ ω , we can facilitate learning π ⋆ ω ′ by: (1) using π ⋆ ω as an initialization and (2) setting a new initial state distribution µ ′ = βd π ⋆ ω µ + (1 -β µ ) by mixing d π ⋆ ω µ , the optimal state visitation distribution of π ⋆ ω , and µ, the initial distribution of interest. Using π ⋆ ω as initialization for learning π ⋆ ω ′ facilitates the learning process as it guarantees the initialization is within some neighborhood of the optimality. Setting the new initial distribution µ ′ = βd µ of ω, ω ′ respectively. For learning π ⋆ ω ′ when the optimal policy π ⋆ ω of a similar context is known, ROLLIN rolls in the optimal policy of a near by context ω with probability β. We illustrate the intuition of ROLLIN in Figure 1 . More specifically, we adopt the contextual MDP formulation, where we assume each MDP M ω is uniquely defined by a context ω in the context space W ⊂ R n , and we are given a curriculum {ω k } K k=0 , with the last MDP M ω K being the MDP of interest. To show our main results, we only require a Lipschitz continuity assumption on r ω w.r.t. ω and some mild regularity conditions on the curriculum {ω k } K k=0 . We show that learning π ⋆ K by recursively rolling in with a nearoptimal policy for ω k to construct the initial distribution µ k+1 for the next context ω k+1 , is provably more efficient than learning π ⋆ ω K from scratch. In particular, we show that when an appropriate sequence of contexts is selected, we can reduce the iteration and sample complexity bounds of entropy-regularized softmax policy gradient (with an inexact stochastic estimation of the gradient) from an original exponential dependency on the state space size, as suggested by Ding et al. ( 2021), to a polynomial dependency. We also prescribe a practical implementation of ROLLIN. In summary, our contributions can be stated as follows. We first provide a theoretical method (ROLLIN) that facilitates single-task policy learning by recasting it as a multi-task problem under entropy-regularized softmax policy gradient (PG), which reduces the exponential complexity bound of the entropy-regularized PG to a polynomial dependency on S. Last but not least, we also provide a deep RL implementation of ROLLIN and demonstrate adding ROLLIN improves performance in simulated goal-reaching tasks with an oracle curriculum and a non-oracle curriculum learned from MEGA (Pitis et al., 2020) , as well as several standard Mujoco locomotion tasks inspired by meta RL (Clavera et al., 2018) .

2. RELATED WORK

Convergence of policy gradient methods. Theoretical analysis of policy gradient methods has a long history (Williams, 1992; Sutton et al., 1999; Konda and Tsitsiklis, 1999; Kakade and Langford, 2002; Peters and Schaal, 2008) . Motivated by the recent empirical success (Schulman et al., 2015; 2017) in policy gradient (PG) methods, the theory community has extensively studied the convergence of PG in various settings (Fazel et al., 2018; Agarwal et al., 2021; 2020; Bhandari and Russo, 2019; Mei et al., 2020; Zhang et al., 2020b; Agarwal et al., 2020; Zhang et al., 2020a; Li et al., 2021; Cen et al., 2021; Ding et al., 2021; Yuan et al., 2022; Moskovitz et al., 2022) . Agarwal et al. (2021) established the asymptotic global convergence of policy gradient under different policy parameterizations. We extend the result of entropy regularized PG with stochastic gradient (Ding et al., 2021) to the contextual MDP setting. In particular, our contextual MDP setting reduces the exponential state space dependency w.r.t. the iteration number and per iteration sample complexity suggested by Ding et al. (2021) to a polynomial dependency. We shall also clarify that there is much existing convergence analysis on other variants of PG that produce an iteration number that does not suffer from an exponential state space dependency (Agarwal et al., 2021; Mei et al., 2020) , but they assume access to the exact gradient during each update of PG, while we assume a stochastic estimation of the gradient, which is arguably more practical. Contextual MDPs. Contextual MDPs (or MDPs with side information) have been studied extensively in the theoretical RL literature (Abbasi-Yadkori and Neu, 2014; Hallak et al., 2015; Dann et al., 



Figure 1 Illustration of ROLLIN. The red circle represents the initial state distribution. The dark curve represents the optimal policy w.r.t. ω. The blue diamonds represent the optimal state distributions d π ⋆ ω µ , d π ⋆ ω ′

