MULTI-TASK OPTION LEARNING AND DISCOVERY FOR STOCHASTIC PATH PLANNING

Abstract

This paper addresses the problem of reliably and efficiently solving broad classes of long-horizon stochastic path planning problems. Starting with a vanilla RL formulation with a stochastic dynamics simulator and an occupancy matrix of the environment, our approach computes useful options with policies as well as high-level paths that compose the discovered options. Our main contributions are (1) data-driven methods for creating abstract states that serve as end points for helpful options, (2) methods for computing option policies using auto-generated option guides in the form of dense pseudo-reward functions, and (3) an overarching algorithm for composing the computed options. We show that this approach yields strong guarantees of executability and solvability: under fairly general conditions, the computed option guides lead to composable option policies and consequently ensure downward refinability. Empirical evaluation on a range of robots, environments, and tasks shows that this approach effectively transfers knowledge across related tasks and that it outperforms existing approaches by a significant margin.

1. INTRODUCTION

Autonomous robots must compute long-horizon motion plans (or path plans) to accomplish their tasks. Robots use controllers to execute these motion plans by reaching each point in the motion plan. However, the physical dynamics can be noisy and controllers are not always able to achieve precise trajectory targets. This prevents robots from deterministically reaching a goal while executing the computed motion plan and increases the complexity of the motion planning problem. Several approaches (Schaul et al., 2015; Pong et al., 2018) have used reinforcement learning (RL) to solve multi-goal stochastic path planning problems by learning goal-conditioned reactive policies. However, these approaches work only for short-horizon problems (Eysenbach et al., 2019) . On the other hand, multiple approaches have been designed for handling stochasticity in motion planning (Alterovitz et al., 2007; Sun et al., 2016) , but they require discrete actions for the robot. This paper addresses the following question: Can we develop effective approaches that can efficiently compute plans for long-horizon continuous stochastic path planning problems? In this paper, we show that we can develop such an approach by learning abstract states and then learning options that serve as actions between these abstract states. Abstractions play an important role in long-horizon planning. Temporally abstracted high-level actions reduce the horizon of the problem in order to reduce the complexity of the overall decisionmaking problem. E.g., a task of reaching a location in a building can be solved using abstract actions such as "go from room A to corridor A", "reach elevator from corridor A", etc., if one can automatically identify these regions of saliency. Each of these actions is a temporally abstracted action. Not only do these actions reduce the complexity of the problem, but they also allow the transfer of knowledge across multiple tasks. E.g, if we learn how to reach room B from room A for a task, we can reuse the same solution when this abstract action is required to solve some other task. Reinforcement learning allows learning policies that account for the stochasticity of the environment. Recent work (Lyu et al., 2019; Yang et al., 2018; Kokel et al., 2021) has shown that combining RL with abstractions and symbolic planning has enabled robots to solve long-horizon problems that require complex reasoning. However, these approaches require hand-coded abstractions. In this paper, we show that the abstractions automatically learned (Shah & Srivastava, 2022) can be efficiently combined with deep reinforcement learning approaches. The main contributions to this paper are: (1) A formal foundation for constructing a library of two different types of options that are task-independent and transferable, (2) A novel approach for auto-generating dense pseudo-reward function in the form of option guides that can be used to learn policies for synthesized options, and (3) An overall hierarchical algorithm approach that uses combines these automatically synthesized abstract actions with reinforcement learning and uses them for multi-task long-horizon continuous stochastic path planning problems. We also show that these options are composable and can be used as abstract actions with high-level search algorithms. Our formal approach provides theoretical guarantees about the composability of the options and their executability using an option guide. We present an extensive evaluation of our approach using two separates sets of automatically synthesized options in a total of 14 settings to answer three critical questions: (1) Does this approach learn useful high-level planning representations? (2) Do these learned representations support transferring learning to new tasks? The rest of the paper is organized as follows: Sec. 2 some of the existing approaches that are closely related to our approach; Sec. 3 introduces a few existing ideas used by our approach; Sec. 4 presents our algorithm; Sec. 5 presents an extensive empirical evaluation of our approach.

2. RELATED WORK

To the best of our knowledge, this is the first approach that uses a data-driven approach for synthesizing transferable and composable options and leverages these options with a hierarchical algorithm to compute solutions for stochastic path planning problems It builds upon the concepts of abstraction, stochastic motion planning, option discovery, and hierarchical reinforcement learning and combines reinforcement learning with planning. Here, we discuss related work from each of these areas. Motion planning is a well-researched area. Numerous approaches ( (Kavraki et al., 1996; LaValle, 1998; Kuffner & LaValle, 2000; Pivtoraiko et al., 2009; Saxena et al., 2022) ) have been developed for motion planning in deterministic environments. Kavraki et al. (1996) ; LaValle ( 1998 Multiple approaches (Du et al., 2010; Kurniawati et al., 2012; Vitus et al., 2012; Berg et al., 2017; Hibbard et al., 2022) have been developed for performing motion planning with stochastic dynamics. Alterovitz et al. (2007) build a weighted graph called stochastic motion roadmap (SMR) inspired from the probabilistic roadmaps (PRM) (Kavraki et al., 1996) where the weights capture the probability of the robot making the corresponding transition. Sun et al. (2016) use linear quadratic regulator --a linear controller that does not explicitly avoid collisions --along with value iteration to compute a trajectory that maximizes the expected reward. However, these approaches require an analytical model of the transition probability of the robot's dynamics. Tamar et al. (2016) develop a fully differentiable neural module that approximates the value iteration and can be used for computing solutions for stochastic path planning problems. However, these approaches (Alterovitz et al., 2007; Sun et al., 2016; Tamar et al., 2016) (Jurgenson & Tamar, 2019; Eysenbach et al., 2019; Jurgenson et al., 2020) design end-to-end reinforcement learning approaches for solving stochastic motion planning problems. These approaches only learn policies to solve one path planning problem at a time and do not transfer knowledge across multiple problems. In contrast, our approach does not require discrete actions and learn options that are transferrable to different problems. Several approaches have considered the problem of learning task-specific subgoals. Kulkarni et al. (2016); Bacon et al. (2017); Nachum et al. (2018; 2019); Czechowski et al. (2021) use intrinsic reward functions to learn a two-level hierarchical policy. The high-level policy predicts a subgoal that the low-level goal-conditioned policy should achieve. The high-level and low-level policies are then trained simultaneously using simulations in the environment. Paul et al. ( 2019) combine imitation learning with reinforcement learning for identifying subgoals from expert trajectories and



); Kuffner & LaValle (2000) develop sampling-based techniques that randomly sample configurations in the environment and connect them for computing a motion plan from the initial and goal configurations. Holte et al. (1996); Pivtoraiko et al. (2009); Saxena et al. (2022) discretize the configuration space and use search techniques such as A ⇤ search to compute motion plans in the discrete space.

require discretized actions. Du et al. (2010); Van Den Berg et al. (2012) formulate a stochastic motion planning problem as a POMDP to capture the uncertainty in robot sensing and movements. Multiple approaches

