SOLVING COMPOSITIONAL REINFORCEMENT LEARN-ING PROBLEMS VIA TASK REDUCTION

Abstract

We propose a novel learning paradigm, Self-Imitation via Reduction (SIR), for solving compositional reinforcement learning problems. SIR is based on two core ideas: task reduction and self-imitation. Task reduction tackles a hard-to-solve task by actively reducing it to an easier task whose solution is known by the RL agent. Once the original hard task is successfully solved by task reduction, the agent naturally obtains a self-generated solution trajectory to imitate. By continuously collecting and imitating such demonstrations, the agent is able to progressively expand the solved subspace in the entire task space. Experiment results show that SIR can significantly accelerate and improve learning on a variety of challenging sparse-reward continuous-control problems with compositional structures. Code and videos are available at https://sites.google.

1. INTRODUCTION

A large part of everyday human activities involves sequential tasks that have compositional structure, i.e., complex tasks which are built up in a systematic way from simpler tasks (Singh, 1991) . Even in some very simple scenarios such as lifting an object, opening a door, sitting down, or driving a car, we usually decompose an activity into multiple steps and take multiple physical actions to finish each of the steps in our mind. Building an intelligent agent that can solve a wide range of compositional decision making problems as such remains a long-standing challenge in AI. Deep reinforcement learning (RL) has recently shown promising capabilities for solving complex decision making problems. But most approaches for these compositional challenges either rely on carefully designed reward function (Warde-Farley et al., 2019; Yu et al., 2019; Li et al., 2020; OpenAI et al., 2019) , which is typically subtle to derive and requires strong domain knowledge, or utilize a hierarchical policy structure (Kulkarni et al., 2016; Bacon et al., 2017; Nachum et al., 2018; Lynch et al., 2019; Bagaria & Konidaris, 2020) , which typically assumes a set of low-level policies for skills and a high-level policy for choosing the next skill to use. Although the policy hierarchy introduces structural inductive bias into RL, effective low-level skills can be non-trivial to obtain and the bi-level policy structure often causes additional optimization difficulties in practice. In this paper, we propose a novel RL paradigm, Self-Imitation via Reduction (SIR), which (1) naturally works on sparse-reward problems with compositional structures, and (2) does not impose any structural requirement on policy representation. SIR has two critical components, task reduction and imitation learning. Task reduction leverages the compositionality in a parameterized task space and tackles an unsolved hard task by actively "simplifying" it to an easier one whose solution is known already. When a hard task is successfully accomplished via task reduction, we can further run imitation learning on the obtained solution trajectories to significantly accelerate training. Fig. 1 illustrates task reduction in a pushing scenario with an elongated box (brown) and a cubic box (blue). Consider pushing the cubic box to the goal position (red). The difficulty comes with the wall on the table which contains a small door in the middle. When the door is clear (Fig. 1a ), the task is easy and straightforward to solve. However, when the door is blocked by the elongated box (Fig. 1b ), the task becomes significantly harder: the robot needs to clear the door before it can start pushing the cubic box. Such a compositional strategy is non-trivial to discover by directly running standard RL. By contrast, this solution is much easier to derive via task reduction: starting from an initial state with the door blocked (Fig. 1b ), the agent can first imagine a simplified task that has the same goal but starts from a different initial state with a clear door (Fig. 1a ); then the agent can convert to this simplified task by physically perturbing the elongate box to reach the desired initial state (Fig. 1c ); finally, a straightforward push completes the solution to the original hard task (Fig. 1b ). Notably, by running imitation learning on the composite reduction trajectories, we effectively incorporate the inductive bias of compositionality into the learned policy without the need of explicitly specifying any low-level skills or options, which significantly simplifies policy optimization. Moreover, although task reduction only performs 1-step planning from a planning perspective, SIR still retains the capability of learning an arbitrarily complex policy by alternating between imitation and reduction: as more tasks are solved, these learned tasks can recursively further serve as new reduction targets for unsolved ones in the task space, while the policy gradually generalizes to increasingly harder tasks by imitating solution trajectories with growing complexity. We implement SIR by jointly learning a goal-conditioned policy and a universal value function (Schaul et al., 2015) , so that task reduction can be accomplished via state search over the value function and policy execution with different goals. Empirical results show that SIR can significantly accelerate and improve policy learning on challenging sparse-reward continuous-control problems, including robotics pushing, stacking and maze navigation, with both object-based and visual state space.

2. RELATED WORK

Hierarchical RL (HRL) (Barto & Mahadevan, 2003) is perhaps the most popular approach for solving compositional RL problems, which assumes a low-level policy for primitive skills and a high-level policy for sequentially proposing subgoals. Such a bi-level policy structure reduces the planning horizon for each policy but raises optimization challenges. For example, many HRL works require pretraining low-level skills (Singh, 1991; Kulkarni et al., 2016; Florensa et al., 2017; Riedmiller et al., 2018; Haarnoja et al., 2018a) or non-trivial algorithmic modifications to learn both modules jointly (Dayan & Hinton, 1993; Silver & Ciosek, 2012; Bacon et al., 2017; Vezhnevets et al., 2017; Nachum et al., 2018) . On the contrary, our paradigm works with any policy architecture of enough representation power by imitating composite trajectories. Notably, such an idea of exploiting regularities on behavior instead of constraints on policy structures can be traced back to the "Hierarchies of Machines" model in 1998 (Parr & Russell, 1998) . Our approach is also related to the options framework (Sutton et al., 1999; Precup, 2000) , which explicitly decomposes a complex task into a sequence of temporal abstractions, i.e., options. The options framework also assumes a hierarchical representation with an inter-option controller and a discrete set of reusable intra-option policies, which are typically non-trivial to obtain (Konidaris & Barto, 2009; Konidaris et al., 2012; Bagaria & Konidaris, 2020; Wulfmeier et al., 2020) . SIR implicitly reuses previously learned "skills" via task reduction and distills all the composite strategies into a single policy. Besides, there are other works tackling different problems with conceptually similar



(a) An easy task, which can be completed by a straightforward push of the blue cubic box. (b) A hard task, where a straightforward push fails since the elongated box blocks the door. (c) Task reduction: solve a hard task by reducing it to an easier one.

Figure 1: An example use case of task reduction, the key technique of our approach. The task is to push the blue box to the red position.

