EXAMPLE-BASED PLANNING VIA DUAL GRADIENT FIELDS Anonymous

Abstract

Path planning is one of the key abilities of an intelligent agent. However, both the learning-based and sample-based planners remain to require explicitly defining the task by manually designing the reward function or optimisation objectives, which limits the scope of implementation. Formulating the path planning problem from a new perspective, Example-based planning is to find the most efficient path to increase the likelihood of the target distribution by giving a set of target examples. In this work, we introduce Dual Gradient Fields (DualGFs), an offline-learning example-based planning framework built upon score matching. There are two gradient fields in DualGFs: a target gradient field that guides task completion and a support gradient field that ensures moving with physical constraints. In the learning process, instead of interacting with the environment, the agents are trained with two offline examples, i.e., the target gradients and support gradients are trained by target examples and support examples, respectively. The support examples are randomly sampled from free space, i.e., states without collisions. DualGF is a weighted mixture of the two fields, combining the merits of the two fields together. To update the mixing ratio adaptively, we further propose a fields-balancing mechanism based on Lagrangian-Relaxation. Experimental results across four tasks (navigation, tracking, particle rearrangement, and room rearrangement) demonstrate the scalability and effectiveness of our method. Our codes and demonstrations can be found at https://sites.google.com/view/dualgf.



In this paper, we consider a novel data-driven planning paradigm: Example-based planning, in which the user can specify the task by providing a set of target examples, rather than programming a taskspecific objective. Benefiting from such a paradigm, example-based planning can scale to various tasks, particularly tasks with implicit goals, i.e.specifying the task with a target distribution instead of a specific goal state. Besides, the agent needs to infer the environmental constraints to safely move in a physical world. Previous approaches either learn physical constraints from interacting with the environment and collision penalty (Wu et al., 2022) , offline demonstrations (Janner et al., 2022) , or exhaustively sampling points at test time (LaValle et al., 1998a) . However, online interaction is costly and unsafe, while offline demonstrations are expensive to collect and test-time sampling is time inefficient. To this end, we propose a fully example-based planning framework that learns two gradient fields with different purposes from examples by score-matching (Vincent, 2011), namely DualGF. DualGF consists of two fields: A target gradient field and a support gradient field. The target gradient field estimates the gradient of the target distribution so as to provide the fastest direction to accomplish the task. The support gradient field learns to reverse the perturbed state back to the free space so as to help avoid collisions. To combine the merits of the two fields, we further introduce a gradient mixer to adaptive balance the trade-off between the two gradients (keep safe vs. reach goal) when constructing the dual gradient field in execution. In practice, we can also incorporate the dual gradient field with a low-level controller to output primitive actions in control. The two gradients are trained from two sets of examples, respectively. For the target gradient field, we collect a set of target states sampled from the target distribution, such as a set of tidied rooms. For the support gradient field, we provide the agent with another set of examples (support examples) that are uniformly sampled from the free space, i.e., states without collision. The support examples are abundant and easy to obtain in real scenarios, e.g., randomly initialised objects, which largely alleviate the safety issue of learning from interactions. As illustrated in Fig. 1 , the agent can learn generalisable inference of task specification and physical constraints from target and support examples and planning in unseen environment. Our experiments validate the generalisation ability of the DualGF planning framework across a variety of tasks, including classical planning tasks such as navigation, tracking, and planning tasks without explicit goal specification such as object rearrangement (Wu et al., 2022) . Specifically, the proposed DualGF significantly outperforms the learning-based baselines in planning performance and efficiency while achieving comparable performance with reference approaches that use the ground truth model or test-time sampling. Ablation studies also demonstrate the effectiveness of the proposed support gradient field and field-balancing mechanism. In conclusion, our contributions are summarised as follows: a) We reformulate the path planning problem in a data-driven paradigm, where we specify the tasks with examples rather than manually design objectives; b) We propose a novel score-based planning framework DualGF that can adaptively integrate two gradient fields trained from different example sets so as to output instructions to complete a task; c) We conduct experiments in four tasks to demonstrate the scalability of our method, and empirical results show that DualGF significantly outperforms the state-of-the-art methods in efficiency and safety.

2. RELATED WORK

Learning from Demonstration. Example-based planning can be viewed as a special case of Learning from demonstration (LfD). LfD is a long-studied problem aiming at learning a policy from only a set of expert trajectories. There are two mainstreams of LfD: Behavioural Cloning (BC) (Pomerleau, 1991; Ross & Bagnell, 2010; Ross et al., 2011) , which learns a policy in a supervised manner; and Inverse Reinforcement Learning (IRL) (Fu et al., 2017; Liu et al., 2020; Kostrikov et al., 2018; Ziebart et al., 2008) , which finds a cost function under which the expert is uniquely optimal. Different from LfD, we train the agent from only two sets of examples instead of whole demonstrations. 



Figure 1: Our task setting. Left: The agent learns task specification from target examples and physical constraints from support examples during training. Right: The agent plans a path in novel conditions during the test phase. . Planning paths to reach a goal is a fundamental function of an intelligent agent (Russell, 2010) and has a wide range of real-world applications, such as navigation (Patle et al., 2019), object tracking (Zhong et al., 2019), and object rearrangement (King et al., 2016). Existing planning algorithms, whether samplingbased (LaValle et al., 1998a; Karaman & Frazzoli, 2011) or learning-based (Kulathunga, 2021; Yu et al., 2020; Tamar et al., 2016), need exhausted test-time sampling for searching a path or reward functions for learning. This severely limits the implementation scope of planning since many real-world tasks are hard to design the objectives/ reward with human priors, e.g., tidying up a house, or rearranging a desktop.

Hatch et al., 2022)  extends RCE to an offline setting. Different from these methods, our method requires neither interaction with the environment nor offline demonstrations.

