MULTI-TASK OPTION LEARNING AND DISCOVERY FOR STOCHASTIC PATH PLANNING

Abstract

This paper addresses the problem of reliably and efficiently solving broad classes of long-horizon stochastic path planning problems. Starting with a vanilla RL formulation with a stochastic dynamics simulator and an occupancy matrix of the environment, our approach computes useful options with policies as well as high-level paths that compose the discovered options. Our main contributions are (1) data-driven methods for creating abstract states that serve as end points for helpful options, (2) methods for computing option policies using auto-generated option guides in the form of dense pseudo-reward functions, and (3) an overarching algorithm for composing the computed options. We show that this approach yields strong guarantees of executability and solvability: under fairly general conditions, the computed option guides lead to composable option policies and consequently ensure downward refinability. Empirical evaluation on a range of robots, environments, and tasks shows that this approach effectively transfers knowledge across related tasks and that it outperforms existing approaches by a significant margin.

1. INTRODUCTION

Autonomous robots must compute long-horizon motion plans (or path plans) to accomplish their tasks. Robots use controllers to execute these motion plans by reaching each point in the motion plan. However, the physical dynamics can be noisy and controllers are not always able to achieve precise trajectory targets. This prevents robots from deterministically reaching a goal while executing the computed motion plan and increases the complexity of the motion planning problem. Several approaches (Schaul et al., 2015; Pong et al., 2018) have used reinforcement learning (RL) to solve multi-goal stochastic path planning problems by learning goal-conditioned reactive policies. However, these approaches work only for short-horizon problems (Eysenbach et al., 2019) . On the other hand, multiple approaches have been designed for handling stochasticity in motion planning (Alterovitz et al., 2007; Sun et al., 2016) , but they require discrete actions for the robot. This paper addresses the following question: Can we develop effective approaches that can efficiently compute plans for long-horizon continuous stochastic path planning problems? In this paper, we show that we can develop such an approach by learning abstract states and then learning options that serve as actions between these abstract states. Abstractions play an important role in long-horizon planning. Temporally abstracted high-level actions reduce the horizon of the problem in order to reduce the complexity of the overall decisionmaking problem. E.g., a task of reaching a location in a building can be solved using abstract actions such as "go from room A to corridor A", "reach elevator from corridor A", etc., if one can automatically identify these regions of saliency. Each of these actions is a temporally abstracted action. Not only do these actions reduce the complexity of the problem, but they also allow the transfer of knowledge across multiple tasks. E.g, if we learn how to reach room B from room A for a task, we can reuse the same solution when this abstract action is required to solve some other task. Reinforcement learning allows learning policies that account for the stochasticity of the environment. Recent work (Lyu et al., 2019; Yang et al., 2018; Kokel et al., 2021) has shown that combining RL with abstractions and symbolic planning has enabled robots to solve long-horizon problems that require complex reasoning. However, these approaches require hand-coded abstractions. In this paper, we show that the abstractions automatically learned (Shah & Srivastava, 2022 ) can be efficiently combined with deep reinforcement learning approaches. The main contributions to this paper are: (1) A formal foundation for constructing a library of two different types of options that are task-independent and transferable, (2) A novel approach for auto-generating dense pseudo-reward function in the form of option guides that can be used to learn policies for synthesized options, and (3) An overall hierarchical algorithm approach that uses combines these automatically synthesized abstract actions with reinforcement learning and uses them for multi-task long-horizon continuous stochastic path planning problems. We also show that these options are composable and can be used as abstract actions with high-level search algorithms. Our formal approach provides theoretical guarantees about the composability of the options and their executability using an option guide. We present an extensive evaluation of our approach using two separates sets of automatically synthesized options in a total of 14 settings to answer three critical questions: (1) Does this approach learn useful high-level planning representations? (2) Do these learned representations support transferring learning to new tasks? The rest of the paper is organized as follows: Sec. 2 some of the existing approaches that are closely related to our approach; Sec. 3 introduces a few existing ideas used by our approach; Sec. 4 presents our algorithm; Sec. 5 presents an extensive empirical evaluation of our approach.

2. RELATED WORK

To the best of our knowledge, this is the first approach that uses a data-driven approach for synthesizing transferable and composable options and leverages these options with a hierarchical algorithm to compute solutions for stochastic path planning problems It builds upon the concepts of abstraction, stochastic motion planning, option discovery, and hierarchical reinforcement learning and combines reinforcement learning with planning. Here, we discuss related work from each of these areas. Motion planning is a well-researched area. Numerous approaches( (Kavraki et al., 1996; LaValle, 1998; Kuffner & LaValle, 2000; Pivtoraiko et al., 2009; Saxena et al., 2022) ) have been developed for motion planning in deterministic environments. Kavraki et al. (1996) ; LaValle (1998); Kuffner & LaValle (2000) develop sampling-based techniques that randomly sample configurations in the environment and connect them for computing a motion plan from the initial and goal configurations. Holte et al. (1996) ; Pivtoraiko et al. (2009); Saxena et al. (2022) discretize the configuration space and use search techniques such as A ⇤ search to compute motion plans in the discrete space. Multiple approaches (Du et al., 2010; Kurniawati et al., 2012; Vitus et al., 2012; Berg et al., 2017; Hibbard et al., 2022) have been developed for performing motion planning with stochastic dynamics. Alterovitz et al. (2007) build a weighted graph called stochastic motion roadmap (SMR) inspired from the probabilistic roadmaps (PRM) (Kavraki et al., 1996) where the weights capture the probability of the robot making the corresponding transition. Sun et al. (2016) use linear quadratic regulator --a linear controller that does not explicitly avoid collisions --along with value iteration to compute a trajectory that maximizes the expected reward. However, these approaches require an analytical model of the transition probability of the robot's dynamics. Tamar et al. (2016) develop a fully differentiable neural module that approximates the value iteration and can be used for computing solutions for stochastic path planning problems. However, these approaches (Alterovitz et al., 2007; Sun et al., 2016; Tamar et al., 2016) require discretized actions. Du et al. (2010) ; Van Den Berg et al. (2012) formulate a stochastic motion planning problem as a POMDP to capture the uncertainty in robot sensing and movements. Multiple approaches (Jurgenson & Tamar, 2019; Eysenbach et al., 2019; Jurgenson et al., 2020) design end-to-end reinforcement learning approaches for solving stochastic motion planning problems. These approaches only learn policies to solve one path planning problem at a time and do not transfer knowledge across multiple problems. In contrast, our approach does not require discrete actions and learn options that are transferrable to different problems. Several approaches have considered the problem of learning task-specific subgoals. Kulkarni et al. (2016) ; Bacon et al. (2017) ; Nachum et al. (2018; 2019) ; Czechowski et al. (2021) use intrinsic reward functions to learn a two-level hierarchical policy. The high-level policy predicts a subgoal that the low-level goal-conditioned policy should achieve. The high-level and low-level policies are then trained simultaneously using simulations in the environment. Paul et al. (2019) combine imitation learning with reinforcement learning for identifying subgoals from expert trajectories and bootstrap reinforcement learning. Levy et al. (2019) learn a multi-level policy where each level learns subgoals for the policy at the lower level using Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) for control problems rather than long-horizon motion planning problems in deterministic settings. Kim et al. (2021) randomly sample subgoals in the environment and use a path planning algorithm to select the closest subgoal and learn a policy that achieves this subgoal. Numerous approaches (Stolle & Precup, 2002; Şimşek et al., 2005; Brunskill & Li, 2014; Kurutach et al., 2018; Eysenbach et al., 2019; Bagaria & Konidaris, 2020; Bagaria et al., 2021) perform hierarchical learning by identifying task-specific options through experience collected in the test environment and then use these options (Sutton et al., 1999) along with low-level primitive actions. Stolle & Precup (2002) ; Şimşek et al. (2005) lay foundation for discovering options in discrete settings. They collect trajectories in the environment and use them to identify high-frequency states in the environment. These states are used as termination sets of the options and initiation sets are derived by selecting states that lead to these high-frequency states. Once options are identified, they use Q-learning to learn policies for these options independently to formulate Semi-MDPs (Sutton et al., 1999) . Bagaria & Konidaris (2020) learn options in a reverse fashion. They compute trajectories in the environment that reaches the goal state. In these trajectories, they use the last K points to define an option. They use these points to define the initiation set of the option and the goal state is used as the termination set. They continue to partition rest of collected trajectories similarly and generate a fixed number (a hyperparameter) of options. Approaches for combining symbolic planning with reinforcement learning (Silver & Ciosek, 2012; Yang et al., 2018; Jinnai et al., 2019; Lyu et al., 2019; Kokel et al., 2021) use pre-defined abstract models to combine symbolic planning with reinforcement learning. In contrast, our approach learns such options (including initiation and termination sets) as well as their policies and uses them to compute solutions for stochastic path planning problems with continuous state and action spaces. Now, we discuss some of the existing concepts required to understand our approach.

3. BACKGROUND

Motion planning Let X = X free [ X obs be the configuration space of a robot R and let O be the set of obstacles in a given environment. Given a collision function f : X ! {0, 1}, X free represents the set of configurations that are not in collision with any obstacle o 2 O such that f (x) = 0 and X obs represents the set of configurations in collision such that f (x) = 1. Let x i 2 X free be the initial configuration of the robot and x g 2 X free be the goal configuration of the robot. The motion planning problem can be defined as: Definition 1. A motion planning problem M is defined as a 4-tuple hX , f, x i , x g i, where X = X free [ X obs is the configuration space, f is the collision function, x i is an initial configuration, and x g is a goal configuration. A solution to a motion planning problem is a motion plan ⌧ . A motion plan is a sequence of configurations hx 0 , . . . , x n i such that x 0 = x i , x n = x g , and 8x 2 ⌧, f (x) = 0. Robots use controller that accepts sequenced configurations from the motion plan and generates controls that take the robot from one configuration to the next configuration. In practice, these environment dynamics are noisy, which introduces stochasticity in the overall motion planning problem. This stochasticity can be handled by computing a motion policy ⇡ : X ! X that takes the current configuration of the robot and generates the next waypoint for the robot.

Markov decision process

In this work, we define the stochastic path planning problem as a continuous stochastic shortest path (SSP) problem (Bertsekas & Tsitsiklis, 1991) . A continuous stochastic shortest path problem is defined as a 5-tuple hX , A, T, C, s 0 , Gi where X is a continuous state space (configuration space of the robot), A is a set of continuous actions, T : X ⇥A⇥X ! [0, 1] is a transition function, C : X ! R is a cost function, s 0 is an initial state, and G ✓ S is a set of goal states. Discount factor is set to 1 for this work. A solution to an SSP is a policy ⇡ : X ! A that maps states to actions that take the robot to the goal states and minimizes the cumulative cost. Dynamic programming methods such as value iteration (VI) or policy iteration (PI) can be used to compute such policies when every component of the MDP is known. When one or more SSP components are unknown various reinforcement learning (RL) approaches (Watkins, 1989; Mnih et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) can be used to learn policies. Options framework Options framework (Sutton et al., 1999) 2022) use region-based Voronoi diagrams (RBVDs) to define abstractions for motion planning problems. To construct effective abstractions, they use critical regions (Molina et al., 2020) to construct the RBVD. Intuitively, critical regions are defined as regions that have a high density of solutions for a given class of problems but low sampling probability under uniform sampling distribution. Shah & Srivastava (2022) use critical regions to define RBVDs as follows: Definition 2. Given a configuration space X , let d c define minimum distance between a configuration x 2 X and a region ✓ X . Given a set of regions and a robot R, a region-based Voronoi diagram is a partitioning of X such that for every Voronoi cell i 2 there exists a region i 2 such that forall x 2 and forall j 6 = i , d c (x, i ) < d c (x, j ) and each i is connected. In this framework, abstract states are defined using a bijective function `: ! S that maps each Voronoi cell to an abstract state. The abstraction function ↵ : X ! S is defined such that ↵(x) = s if and only if there exists a Voronoi cell such that x 2 and `( ) = s. A configuration x 2 X is said to be in the high-level abstract state s 2 S (denoted by x 2 s) if ↵(x) = s. In this work, we use RBVDs along with critical regions to construct abstractions and synthesize options. Alg. 1 outlines our approach -stochastic hierarchical abstraction-guided robot planner (SHARP) --for computing a policy for the given stochastic path planning problem. Our approach requires a robot simulator to compute path plans. It takes the occupancy matrix of the environment along with the initial and goal configuration of the robot as the input and produces a policy composed of options as the output.

4. OUR APPROACH

Learning the critical regions The first step in our approach is to identify critical regions in the given environment. We use a deep neural network to identify these critical regions. We train (lines 1-3) this neural network with the kinematic model of the robot and a set of training environments E train that do not include the test environments. We discuss this in detail in Sec. 4.1. Once the network is trained, the same network is used to identify critical regions (line 5) in all test environments for the same robot. Synthesizing option endpoints Alg. 1 uses these identified critical regions to construct RBVD (line 6) and generate a set of abstract states (line 7). We then use this RBVD to construct a collection of options and synthesize option endpoints for the given environment (line 7). Sec. 4.2 explains this step in detail. These option are synthesized only once per the robot and the environment. They are reused for each subsequent pair of initial and goal pairs. We refer to option endpoints as options for brevity. Options as abstract actions Let O be the set of synthesized options. Our approach uses these options as high-level abstract actions. It considers the initiation sets of these options as preconditions for the abstract actions and the termination sets as their effects. We identify the abstract initial and goal state for the robot's initial and goal configuration (line 8) and use the A ⇤ search to compute a sequence of identified abstract actions that takes the robot from the initial abstract state to the goal abstract state (line 9). We use the Euclidean distance between the termination set of the corresponding option and the goal configuration as the heuristic for the A ⇤ search. The A ⇤ search also requires the costs for the options which are initially unknown. Therefore, we use the Euclidean distance between the initiation and termination sets of an option as an approximation to its cost. Computing policies for options Let p = ho 1 , . . . , o n i be the sequence of options that takes the robot from its initial abstract state s 0 to its goal abstract state s g . Alg. 1 then computes policies for each of the options in this sequence of options. It first starts by computing a policy ⇡ 0 that takes the robot from its initial configuration x i to some point in the initiation set of the first option o 1 (line 11). Alg. 1 then starts computing the policy for every option in p. First, we check if a policy for an option is already computed from the previous call of Alg. 1 or not (line 14). If it is already computed, we use the same policy. Otherwise, our approach computes an option guide for the given option using its endpoints, and it then uses this option guide to compute a policy for the option (lines 15-16). Sec. 4.3 presents this in details. Lastly, Alg. 1 computes a policy that takes the robot from the termination set of the last option in p to the goal configuration x g (line 19). Updating the option costs To efficiently transfer the learned option policies, our approach needs to update the option costs that A ⇤ search uses to compute the sequence of options. We update this cost (line 17) by collecting rollouts of the learned policy and using the average number of steps from the initiation set to the termination set as an approximation of the cost of the option. Now, we explain each step of our approach in detail.

4.1. LEARNING CRITICAL REGIONS

Our approach first needs to identify critical regions (Sec. 3) to synthesize options in the given environment. Recall that critical regions are regions in the environment that have a high density of solutions for the given class of problems but at the same time, they are hard to sample under uniform distribution over the configuration space. The training data is generated by solving randomly sampled motion planning problems. Input into the network is an occupancy matrix of the environment that represents the free space and obstacles in the given environment. Labels represent critical regions in the given environment. These critical region predictors are environment independent and they are also generalizable across robots to a large extent. Furthermore, the approach presented here directly used the open-source critical regions predictors made available by Shah & Srivastava (2022) . These predictors are environment independent and need to be trained only once per kinematic characteristics of a robot. E.g., the non-holonomic robots used to evaluate our approach (details in Sec. 5) are different from those used by Shah & Srivastava (2022) but we used the critical regions predictor developed by them for a rectangular holonomic robot. Shah & Srivastava (2022) use 20 training environments (E train ) to generate the training data. For each training environment e train 2 E train , they randomly sample 100 goal configurations. Shah & Srivastava (2022) randomly sample 50 initial configurations for each goal configuration and compute motion plans for them using an off-the-shelf motion planner and a kinematic model of the robot. They use UNet (Ronneberger et al., 2015) with Tanh activation function for intermediate layers and Sigmoid activation for the last layer. They use the weighted logarithmic loss as the loss function. Lastly, they use ADAM optimizer (Kingma & Ba, 2014) with learning rate 10 4 to minimize the loss function for 50, 000 epochs.

4.2. SYNTHESIZING OPTION ENDPOINTS

The central idea of this paper is to synthesize transferable options using the set of critical regions and the RBVD . Our approach uses these critical regions and RBVD for identifying option endpoints, i.e., initiation and termination sets. In this work, we define two types of options: 1. Options between centroids of two critical regions -centroid options (O c ) and 2. Options between interfaces of two pairs of abstract states -interface options (O i ). Now we define each of these options. Centroid options Centroid options are defined using centroids of two neighboring critical regions. Intuitively, these options define abstract actions that transition between a pair of critical regions. Formally, let be the set of critical regions and be the RBVD constructed using the critical regions . Let S be a set of abstract states defined using the RBVD . Definition 3. Let s i 2 S be an abstract state in the RBVD with the corresponding critical region i 2 . Let d be the Euclidean distance measure and let t define a threshold distance. Let c i be the centroid of the critical region r i . A centroid region of the critical region r i with the centroid c i is defined as a set of configuration points {x|x 2 s i ^d(x, c i ) < t}. We use this definition to define the endpoints for the centroid options as follows: Definition 4. Let s i , s j 2 S be a pair of neighboring states in an RBVD constructed using the set of critical regions . Let i , j 2 be the critical regions for the abstract states s i and s j and let c i and c j be their centroids regions. The endpoints for a centroid option are defined as a pair hI ij , ij i such that I ij = c i and ij = c j . Interface options Interface options are defined using intersections of high-level states. Intuitively, these options define high-level temporally abstracted actions between intersections of pairs of abstract states and take the robot across the high-level states. To construct interface options, we first need to identify interface regions between a pair of neighboring abstract states. Let S be a set of abstract states defined using the RBVD . We define interface region as follows: Definition 5. Let s i , s j 2 S be a pair of neighboring states and i and j be their corresponding critical regions. Let d c (x, ) define the minimum euclidean distance between configuration x 2 X and some point in a region ⇢ X . Let p be a configuration such that d c (p, i ) = d c (p, j ) that is, p is on the border of the Voronoi cells that define s i and s j . Given the Euclidean distance measure d and a threshold distance t, an interface region for a pair of neighboring states (s i , s j ) is defined as a set {x|(x 2 s i _ x 2 s j ) ^d(x, p) < t}. We use this definition of interface regions to define endpoints for the interface options as follows: Definition 6. Let s i , s j , s k 2 S be abstract states in the RBVD such that s i and s j are neighbors and s j and s k are neighbors. Let ˆ ij and ˆ jk be the interface regions for pairs of high-level states (s i , s j ) and (s j , s k ). The endpoints for an interface option are defined as a pair hI o ijk , o ijk i such that I o ijk = ˆ ij and o ijk = ˆ jk . We construct a collection of options for the given environment using these definitions for endpoints of centroid and interface options. Let V : S ⇥ S ! {0, 1} define a neighborhood function such that V(s i , s j ) = 1 iff s i , s j 2 S are neighbors. We define the set of centroid options as O c = {o ij |8s i , s j V(s i , s j ) = 1} with endpoints computed as in Def. 4 . Similarly, we define the set of interface options as O i = {o ijk |8s i , s j , s k V(s i , s j ) = 1 ^V(s j , s k ) = 1} with endpoints computed as in Def. 6. Our approach uses this collection of options to compute a composed policy. Let s 1 , . . . , s n be a set of distinct adjacent states such that V(s i , s i+1 ) = 1. q ! q " q # q $ # ∈ ℐ %! # ∈ ℐ %" # ∉ ℐ %" # ∉ ℐ %! # ∉ ℐ %# # ∉ ' %$ q #&$ # ∈ ' %$ … q #&" # = # ' # ≠ # ' Let O = {o 1 , . . . , o n } be the set of centroid or interface options for these sequence of adjacent states. The composed policy ⇧ O is a finite state automaton as shown in Fig. 1 . Recall that for a sequence of options o 1 , . . . , o n , Alg. 1 computes ⇡ 0 and ⇡ n+1 as special cases. For a controller state q i , ⇧(x) = ⇡ i (x) where ⇡ i represents the policy for option o i 2 O. The controller makes a transition q i ! q i+1 when the robot reaches a configuration x 2 I oi+1 . Now, we prove the existence of a composed policy for our approach. Recall that for a configuration space X and a set of critical regions , the RBVD induces a set of abstract state S and an abstraction function ↵. Theorem 4.1. Let x 1 and x n be the initial and goal configurations of the robot such that s 1 = ↵(x 1 ) and s n = ↵(x n ). Let V : S ⇥ S ! {0, 1} be the neighborhood function as defined above. Let O be the set of centroid or interface options. If there exists a sequence of distinct abstract states s 1 , . . . , s n such that V(s i , s i+1 ) = 1 then there exists a composed policy ⇧ such that the resultant configuration after the termination of every option in ⇧ would be the goal configuration x n . Proof. (Sketch) The proof directly derives from the definition of the endpoints for the centroid and interface options. Given a sequence of adjacent abstract states s 1 , . . . , s n , Def. 4 and 6 ensures a sequence of options o 1 , . . . , o n such that i = I i+1 . This implies that an option can be executed once the previous option is terminated. Given this sequence of options o 1 , . . . , o n , according to the definition of the compound policy, there exists a compound policy ⇧ such that for every pair of options o i , o j 2 ⇧, I oj = oi . Thus, we can say that if every option in ⇧ terminates, then the resulting configuration would be the goal configuration.

4.3. SYNTHESIZING AND USING OPTION GUIDES

Once we identify the endpoints --initiation and termination sets --for the option, we use these sets to compute option guides for these options. We use option guides to formulate a reward function that is used to train policies for them. The option guides contain two components: a guide path and a pseudo reward function. We formally define an option guide as follows: Definition 7. Given an option o i 2 O with option endpoints hI i , i i, an option guide G oi is defined as 4-tuple hI i , i , G i , R i i where I i and i are endpoints of the option o i , G i is the guide path, and R i is the pseudo reward function. We now define both components of the option guide. A guide path is a motion plan from the initiation set to the termination set of an option. Recall that ↵ is an abstraction function defined using the RBVD . We defune a guide path as follows: Definition 8. Let X be the configuration space of the robot R. Let d be the Euclidean distance measure. Given an option o i with endpoints hI i , i i, let c Ii and c i define centroids for the initiation and termination sets respectively. Given a threshold distance t and the Euclidean distance measure d, a guide path G i for the option o i is a sequence of points [p 1 , . . . , p n ] in the configurations space X such that p 1 = c Ii , p n = c i , for each pair of consecutive points p i , p j 2 G i , d(p i , p j ) < t, and for every point p i 2 G i , p i 2 ↵(I i ) or p i 2 ↵( i ). Here, we abuse the notation and use the abstraction function with a set of low-level configurations rather than a single configuration. Intuitively, ↵(I i ) = {↵(x)|8x 2 I i }. It can be similarly computed for i . We generate this guide path using a sampling-based motion planner HARP (Shah & Srivastava, 2022) and the kinematic model of the robot. The guide path is used to define a pseudo reward function R for the option. It is a dense reward function that provides a reward to the robot according to the distance of the robot from the nearest point in the guide path and the distance from the last point of the guide path. For an option o i , we define the dense pseudo reward function R i : X ! R as follows: Definition 9. Let o i be an option with endpoints hI i , i i and let G i = [p 1 , . . . , p n ] be the guide path. Given a configuration x 2 X , let n(x) = p i define the closest point in the guide path. Let d be the Euclidean distance measure. The pseudo reward function R i (x) is defined as: R i (x) = 8 < : r t if x 2 i r p if x 2 S/{↵(I i ), ↵( i )} (d(x, n(x)) + d(n(x), p n )) otherwise Here r t is a large positive reward and r p is a large negative reward that can be tuned as hyperparameters. Intuitively, we wish to automatically generate a dense pseudo-reward function based on the information available form the environment. Thus rather than providing only a sparse reward function (e.g. "+1 when you reach the termination set of the option"), we develop the pseudo reward function that captures this condition (first case), a penalty for straying away from the source and target abstract states (second case), and a reward for covering more of the option guide (third case). Given an option o i with endpoints hI i , i i and an option guide G i = hI i , i , G i , R i i, our approach can use arbitrary reinforcement learning algorithm to learn the policy for the option. For our empirical evaluation (Sec. 5), we use Soft Actor Critic (Haarnoja et al., 2018) .

5. EMPIRICAL EVALUATION

We conduct an extensive evaluation of our approach in a total of four different environments with two different robots. All experiments were conducted on a system running Ubuntu 18.04 with AMD Figure 2: The figure shows the time taken by our approach and baselines to compute path plans in the test environment. Results for an additional environment is in the Appendix B. The x-axis shows the problem instance and the y-axis shows the time in seconds. The reported time for our approach includes time to predict critical regions, construct abstractions, and learn policies for all the options. Each subsequent problem instance uses trained policies for options from the previous problems if there exists one. Timeout was set to 2400 seconds. The numbers are averaged over 5 independent trails. The transparent bars for SAC show that training was stopped as it reached the timeout. Ryzen Threadripper 3960X 24-core processor and two NVidia 3080 GPUs. We implement our approach using PyBullet (Coumans & Bai, 2016) robotics simulator. PyBullet does not explicitly model stochasticity in the movement of the robot. Therefore, we use random perturbations in the actions to introduce stochasticity in the environment while training and use default controllers to evaluate the learned policies. We implement our approach using PyTorch and Stable Baselines. We use a 2 layered neural network with each layer containing 256 neurons in every layer to learn policy for each option. The input to the network is the current configuration of the robot and a vector to the nearest point on the guide path for the current option. The network outputs the next point for the robot to reach. We use r t = 1000 and r p = 100 for the pseudo reward function with maximum learning steps set to 150k to learn policy for each option. We evaluate our model for 20 episode every 10k steps and stop the training if we achieve an average reward of 500. Our code and data are available in the supplementary material. Evaluating our approach We evaluate our approach using a total of 7 environments. They are inspired by indoor household and office environments. Appendix A shows our test environments. Dimensions of the first four environments (App. A.1) are 15m ⇥ 15m. The rest of the environments (App. A.2) are of the size 75m ⇥ 75m . These environments were not included in the set of training environments used to train the network that identifies critical regions. We generate 5 motion planning tasks (P1-P5) for each environment to evaluate our approach. We use two non-holonomic robots to evaluate our approach in these four environments: the ClearPath Husky robot (H) and the AgileX Limo (L). The Husky is a 4-wheeled differential drive robot that can move in one direction and rotate in place. The Limo is also a 4-wheeled robot that can move in one direction and has a steering. We considered a set of recent approaches (Kulkarni et al., 2016; Lillicrap et al., 2016; Levy et al., 2019; Bagaria & Konidaris, 2020) before selecting two approaches to evaluate against our approach. First, we compare our approach against vanilla soft actor-critic (SAC) (Haarnoja et al., 2018) . SAC is an off-policy deep reinforcement learning approach and learns a single policy for taking the robot from its initial configuration to the goal configuration. We use the same network architecture as ours for the SAC neural network. We use the terminal reward of +1000 and a step reward of 1. We also evaluate our approach against a sampling-based motion planner RRT (LaValle, 1998) with replanning. We compute a path plan with RRT and execute it. If the robot does not reach the goal after executing the entire path plan, then we compute a new path plan from the configuration where the robot ended. We continue this loop until either the robot reaches the goal configuration or we reach the timeout.

Analysis of the results

We thoroughly evaluated our approach to answer three critical questions: (1) Does this approach learn useful high-level planning representations? (2) Do these learned representations support transferring learning to new tasks? Figure 3: The figure shows the success rate of our approach and baselines in the test environments. Results for an additional environment is in the Appendix B. The x-axis shows the problem instance and the y-axis shows the fraction of successful executions of the model out of 20 test executions. We used the final policy for our approach and SAC. RRT computed a new plan for each execution with a timeout set to 2400 seconds. The numbers are averaged over 5 independent trails. Fig. 2 shows the time taken by our approach for learning policies compared to the baselines SAC and RRT. Fig. 2 shows that our approach was able to learn policies significantly faster than the baselines. In most cases, our approach was able to compute solutions in half of the time taken by the baselines. Vanilla SAC was not able to reach the threshold average reward of +500 in any environment before the timeout of 2400 seconds. This shows that our approach was able to learn useful high-level planning representations that improve the efficiency of the overall stochastic motion planning. Fig. 3 shows the success rate for our approach and the baselines. Our approach achieves the goal a significantly higher number of times compared to replanning-based deterministic RRT. Our approach was able to consistently reach the goal configuration more than 80% of times. On the other hand, RRT was able to reach the goal only 50% of times in the given timeout of 2400 seconds. For larger environments (E-F), SAC was not able to solve a single problem. We also conducted a thorough analysis of options being used across problems P1-P5 in each environment. Appendix C shows the fraction of options reused by our approach across P1-P5. These reuse rates combined with the success rates show that our approaches is able to successfully transfer learned options across new tasks.

6. CONCLUSION

In this paper, we presented the first approach that uses a data-driven process to automatically identify transferable options. Our approach synthesizes two different types of options: centroid options and interface options. It uses these options with our hierarchical algorithm for solving stochastic path planning problems. We also provide a way to automatically generate pseudo reward functions for synthesized options to efficiently learn their policies. We show that learned options can be transferred to different problems and used to solve new problems. Our formal framework also provides guarantees on the compossibility of the generated options. Our empirical evaluation on a large variety of problem settings shows that our approach significantly outperforms other approaches. Through our empirical evaluation, we show that by combining reinforcement learning with abstractions and high-level planning we can improve its efficiency and compute solutions faster. We also show that using reinforcement learning to solve path planning problems in stochastic environments yields robust plans that perform better than deterministic plans computed using replanning-based motion planners.



models options as temporal abstraction over primitive actions in an MDP. Let S be the state-space of an MDP. An option o is defined as 3-tuple hI o , o , ⇡ o i. I o ✓ S defines the initiation set of an option o. An option o can be applied in a state s iff s 2 I o . o is the termination set of an option o. Execution of an option o terminates when the agent reaches a state s 2 o . ⇡ o : S ! A defines policy for the option o. Sutton et al. (1999) define Semi-Markov Decision Process (SMDP) by adding available options to the set of primitive actions available in the MDP. Solution for an SMDP is a policy ⇡ : S ! Ā where Ā = A [ O and O is a set of options. Automated synthesis of abstractions Shah & Srivastava (

Figure 1: A policy composed of a collection of options o1, . . . , on for a sequence of distinct adjacent abstract states s1, . . . , sn.

