EXAMPLE-BASED PLANNING VIA DUAL GRADIENT FIELDS Anonymous

Abstract

Path planning is one of the key abilities of an intelligent agent. However, both the learning-based and sample-based planners remain to require explicitly defining the task by manually designing the reward function or optimisation objectives, which limits the scope of implementation. Formulating the path planning problem from a new perspective, Example-based planning is to find the most efficient path to increase the likelihood of the target distribution by giving a set of target examples. In this work, we introduce Dual Gradient Fields (DualGFs), an offline-learning example-based planning framework built upon score matching. There are two gradient fields in DualGFs: a target gradient field that guides task completion and a support gradient field that ensures moving with physical constraints. In the learning process, instead of interacting with the environment, the agents are trained with two offline examples, i.e., the target gradients and support gradients are trained by target examples and support examples, respectively. The support examples are randomly sampled from free space, i.e., states without collisions. DualGF is a weighted mixture of the two fields, combining the merits of the two fields together. To update the mixing ratio adaptively, we further propose a fields-balancing mechanism based on Lagrangian-Relaxation. Experimental results across four tasks (navigation, tracking, particle rearrangement, and room rearrangement) demonstrate the scalability and effectiveness of our method. Our codes and demonstrations can be found at https://sites.google.com/view/dualgf.



Planning paths to reach a goal is a fundamental function of an intelligent agent (Russell, 2010) and has a wide range of real-world applications, such as navigation (Patle et al., 2019) , object tracking (Zhong et al., 2019) , and object rearrangement (King et al., 2016) . Existing planning algorithms, whether samplingbased (LaValle et al., 1998a; Karaman & Frazzoli, 2011) or learning-based (Kulathunga, 2021; Yu et al., 2020; Tamar et al., 2016) , need exhausted test-time sampling for searching a path or reward functions for learning. This severely limits the implementation scope of planning since many real-world tasks are hard to design the objectives/ reward with human priors, e.g., tidying up a house, or rearranging a desktop. In this paper, we consider a novel data-driven planning paradigm: Example-based planning, in which the user can specify the task by providing a set of target examples, rather than programming a taskspecific objective. Benefiting from such a paradigm, example-based planning can scale to various tasks, particularly tasks with implicit goals, i.e.specifying the task with a target distribution instead of a specific goal state. Besides, the agent needs to infer the environmental constraints to safely move in a physical world. Previous approaches either learn physical constraints from interacting with the environment and collision penalty (Wu et al., 2022) , offline demonstrations (Janner et al., 2022) , or exhaustively sampling points at test time (LaValle et al., 1998a) . However, online interaction is costly and unsafe, while offline demonstrations are expensive to collect and test-time sampling is time inefficient. To this end, we propose a fully example-based planning framework that learns two gradient fields with different purposes from examples by score-matching (Vincent, 2011) , namely DualGF. DualGF consists of two fields: A target gradient field and a support gradient field. The target gradient field estimates the gradient of the target distribution so as to provide the fastest direction to accomplish the task. The support gradient field learns to reverse the perturbed state back to the free space so as to help avoid collisions. To combine the merits of the two fields, we further introduce a gradient mixer to adaptive balance the trade-off between the two gradients (keep safe vs. reach goal) when constructing the dual gradient field in execution. In practice, we can also incorporate the dual gradient field with a low-level controller to output primitive actions in control. The two gradients are trained from two sets of examples, respectively. For the target gradient field, we collect a set of target states sampled from the target distribution, such as a set of tidied rooms. For the support gradient field, we provide the agent with another set of examples (support examples) that are uniformly sampled from the free space, i.e., states without collision. The support examples are abundant and easy to obtain in real scenarios, e.g., randomly initialised objects, which largely alleviate the safety issue of learning from interactions. As illustrated in Fig. 1 , the agent can learn generalisable inference of task specification and physical constraints from target and support examples and planning in unseen environment. Our experiments validate the generalisation ability of the DualGF planning framework across a variety of tasks, including classical planning tasks such as navigation, tracking, and planning tasks without explicit goal specification such as object rearrangement (Wu et al., 2022) . Specifically, the proposed DualGF significantly outperforms the learning-based baselines in planning performance and efficiency while achieving comparable performance with reference approaches that use the ground truth model or test-time sampling. Ablation studies also demonstrate the effectiveness of the proposed support gradient field and field-balancing mechanism. In conclusion, our contributions are summarised as follows: a) We reformulate the path planning problem in a data-driven paradigm, where we specify the tasks with examples rather than manually design objectives; b) We propose a novel score-based planning framework DualGF that can adaptively integrate two gradient fields trained from different example sets so as to output instructions to complete a task; c) We conduct experiments in four tasks to demonstrate the scalability of our method, and empirical results show that DualGF significantly outperforms the state-of-the-art methods in efficiency and safety.

2. RELATED WORK

Learning from Demonstration. Example-based planning can be viewed as a special case of Learning from demonstration (LfD). LfD is a long-studied problem aiming at learning a policy from only a set of expert trajectories. There are two mainstreams of LfD: Behavioural Cloning (BC) (Pomerleau, 1991; Ross & Bagnell, 2010; Ross et al., 2011) , which learns a policy in a supervised manner; and Inverse Reinforcement Learning (IRL) (Fu et al., 2017; Liu et al., 2020; Kostrikov et al., 2018; Ziebart et al., 2008) , which finds a cost function under which the expert is uniquely optimal. Different from LfD, we train the agent from only two sets of examples instead of whole demonstrations. There are some RL-based methods that consider an example-based setting: VICE (Fu et al., 2018) is similar to AIRL (Fu et al., 2017) , but is designed for learning from a set of success examples. SQIL (Reddy et al., 2019) is a method modified from SAC (Haarnoja et al., 2018) that labels success examples with a reward of +1. RCE (Eysenbach et al., 2021) , a modification of the actor-critic-based method (Fujimoto et al., 2018; Haarnoja et al., 2018) , directly learns a value function from transitions and successful examples. Recently, Offline-RCE (Hatch et al., 2022) extends RCE to an offline setting. Different from these methods, our method requires neither interaction with the environment nor offline demonstrations. Path Planning. Sampling-based planning algorithms, such as the probabilistic roadmap method (PRM, (Kavraki & Latombe, 1994) ) and the rapidly exploring random tree (RRT, (LaValle et al., 1998b) ), are dominant in traditional path planning algorithms. PRT and its variants: (Dobson & Bekris, 2014) ; (You et al., 2021) ; (Chai et al., 2022) ; (Hüppi et al., 2022) , intend to generate a roadmap and find a collision-free path from a starting point to a goal region on the roadmap. RRT and its variants: (Strub & Gammell, 2020) ; (Gammell et al., 2020) (Li et al., 2022) ; (Strub & Gammell, 2022) , iteratively build a tree by expanding towards the random sample point instead of directly connecting to the sample point. Different from these methods, our approach enables path planning in an end-to-end paradigm without test-time sampling, which is more time-efficient and scalable to high-dimensional tasks. Our method is also similar to artificial potential field Lee & Park (2003) which manually designs an attractive field guiding the agent to the goal location and a repulsive field pushing the agent away from the obstacles. Differently, our method learns two fields from two sets of examples that can generalise to unseen conditions. Besides, our method can plan with visual observation (See Appendix. D.4) and scale up to high-dimensional domains.

3. PROBLEM STATEMENT

The example-based planning can be described by a tuple (S, S f , p tar , ρ, ϵ) where S denotes the state space (e.g., configurations of objects and obstacles) while S f ⊆ S denotes the free space (the valid states without object collision or transboundary (Wu et al., 2022) , the goal of the planning agent is to find the most efficient path to increase the likelihood of the target distribution: π * = arg max π E ρ(s0),τ ∼π st∈τ γ t log p tar (s t ) , s.t. ∀s t ∈ S f To this end, the agent needs to infer the task specification, i.e., knowing where to move can probably succeed, as well as the free space, i.e., avoiding collisions between objects and obstacles. In this work, we consider the offline setting of example-based planning, where the agent has to infer physics constraints from support examples instead of interacting with the environment. The support examples are uniformly sampled from the free space S * sup = {s * , s * ∼ p sup (s)} where p sup (s) =    1 |S f | , s ∈ S f 0, otherwise , |S f | = S f ds (2) The rationale for 'replacing interactions with examples' is that the support examples are enough for inferring the free space S f while the online interactions are unsafe and costly. Besides, the agent should learn generalisable inference of free space instead of overfitting to a specific environment. 

4. METHODOLOGY

Overview: To tackle example-based planning, our key idea is to learn gradient fields from examples. As illustrated in Fig. 2 (a), the target gradient field Φ θ tar and the support gradient field Φ ϕ sup is trained from the target examples and support examples via score-matching respectively. The target gradient guides the task completion by providing the fastest direction to increase the task likelihood. The support gradient helps avoid collision by pointing to the inner of the free space. Finally, we construct a policy that plans the next state based on the mixture of the two gradients, i.e., the dual gradient. In the control scenario, a heuristic-based controller can be used to interpret the dual gradient into a low-level action.

4.1. PLANNING WITH THE DUAL GRADIENT FIELDS

We tackle example-based planning by greedily maximising the target likelihood. Firstly, the agent aims at searching for the next state that maximises the target likelihood from the free space: argmax st+1 log p tar (s t+1 ), s.t. log p sup (s t+1 ) ≥ log 1 |S f | (3) However, this objective is problematic since both the log p sup (s t+1 ) and log p tar (s t+1 ) are zero in many regions, such as non-target region or inner of the free space. Besides, p sup (s t+1 ) is nondifferentiable on the boundary of the free space ∂S f . As a result, we choose to 'soften' both distributions by adding a small Gaussian noise: p σ sup (ŝ) = N (ŝ; s, σ 2 I)p sup (s)ds, p σ tar (ŝ) = N (ŝ; s, σ 2 I)p tar (s)ds In this way, we replace the objective in Eq. 3 with the following: argmax st+1 log p σ tar (s t+1 ), s.t. log p σ sup (s t+1 ) ≥ c (5) where c ∈ R is a threshold indicating the conservative degree of the policy. The problem Eq. 5 can be further related to an unconstrained optimisation with a penalty term by the Lagrangian method: max st+1 min λ≥0 log p σ tar (s t+1 ) + λ log p σ sup (s t+1 ) -c where λ > 0 is known as a Lagrangian multiplier. The s t+1 and λ can be updated iteratively for multiple steps to obtain a solution. Here, we choose to update the s t+1 and λ only one step for Eq. 6 since the ŝt+1 is close to s t in our problem, i.e., || ŝt+1 -s t || 2 ≤ ϵ. : ŝt+1 = s t + µ s ∇ s log p σ tar (s t ) + λ t ∇ s log p σ sup (s t ) λ t+1 = ReLU λ t -µ λ log p σ sup (s t+1 ) -c where µ s and µ λ are two step sizes. Empirically, the latter is set as a fixed scalar µ λ = 0.01 while the former is properly set to ensure ||s t+1 -s t || 2 ≤ ϵ. In this way, our planning policy is derived as: ŝt+1 = s t + ϵ g dual ||g dual || 2 , g dual = g tar + λ t g sup dual gradient g tar = ∇ s log p σ tar (s t ) target gradient , g sup = ∇ s log p σ sup (s t ) support gradient (8) Intuitively, the state is updated by the mixture of the target gradient ∇ s log p σ tar (s t ) and the support gradient ∇ s log p σ sup (s t ) weighted by λ t . As shown in Fig. 3 , the target gradient is a vector field that points to the goal region, so as to guide the task completion. The support gradient fields point to the inner region of the free space. The closer a state is to the boundary of the free space, the larger the magnitudes of its support gradient will 'push' it into the inner region. In Appendix. D.9, we also derive the closed-form of the target and support gradients and discuss the above insights. + 𝝀 𝒕 * = Support Gradient Field Φ !"# (𝑠 ! ) Target Gradient Field Φ $%& (𝑠 ! ) Dual Gradient Field Φ !"# 𝑠 ! + 𝜆 ! Φ $%& (𝑠 ! ) Condition 𝑠 ! Figure 3 : Visualisation of the learned target, support and dual gradient fields on Navigation task. The L2-norm of support gradients is small in the free space but grows significantly faster when close to the boundary. As a result, the direction of dual gradients is consistent with the target gradient in the free space and consistent with the support gradient near the boundary.

4.2. UPDATING MIXING RATE VIA SUPPORT GRADIENT FIELD

From Eq. 7, we can infer that λ t is updated to maintain the 'safety level' log p σ sup (s) at a certain threshold c, since a larger λ t will lead to an increase of log p σ sup (s) on the Further states, vice versa. However, updating λ t requires the exact computation of log p σ sup (s t+1 ) -c which involves integration over the state space as shown in Eq. 4. This is intractable when the state space is highdimensional. Thus, we seek to develop a practical field-balancing mechanism. We first approximate log p σ sup (s t+1 ) from log p σ sup (s 0 ) via first-order Taylor expansion: log p σ sup (s t+1 ) -c = -c -log p σ sup (s 0 ) + log p σ sup (s t+1 ) -log p σ sup (s 0 ) ≈ -c -log p σ sup (s 0 ) + t k=0 ∇ s log p σ sup (s k ), s k+1 -s k (9) We further assume that s 0 is always initialised far away from the boundary of the free space so that Var s0∼ρ(s) [log p σ sup (s)] ≈ 0. Thus, the remaining term c -log p σ sup (s 0 ) can be regarded as a constant c ′ in the sense of expectation: E s0∼ρ(s) c -log p σ sup (s 0 ) = c -E ρ(s) [log p σ sup (s)] = c -H(ρ(•), p σ sup (•)) constant def = c ′ (10) 4.3 INCORPORATING DUALGF WITH LOW-LEVEL CONTROLLER Under the assumption that the system is holonomic, DualGF can also tackle the control tasks incorporated with a low-level controller C : S × S → A. For instance, if the atomic action space is velocity (with speed limit v max and simulation duration ∆t), we can set ϵ = v max • ∆t in Eq. 7 and cast the state change ( ŝt+1 , s t ) into a target velocity via a heuristic-based controller: a t = C( ŝt+1 , s t ) = ŝt+1 -s t ∆t = v max ||g dual || 2 • g dual (11) For the non-linear dynamics where the action space is forces imposed on the objects, we can leverage a PID controller to achieve the target velocity in multiple steps. In this work, we focus on the effectiveness of the high-level module trained under the fully example-based setting, i.e., the dual gradient fields, instead of the complexity of the dynamics. Hence, we conduct experiments on tasks with velocity-based action space (e.g., linear or angular velocity). The whole planning framework is summarised below: ŝt+1 = s t + ϵ g dual ||g dual || 2 , g dual = ∇ s log p σ tar (s t ) + λ t ∇ s log p σ sup (s t ) λ t+1 = ReLU λ t -µ λ t k=0 ∇ s log p σ sup (s k ), s k+1 -s k -c ′ a t = C( ŝt+1 , s t ) = v max ||g dual || 2 • g dual (12) 4.4 LEARNING GRADIENT FIELDS FROM EXAMPLES Finally, we seek to estimate the target and support gradient fields mentioned above to realise our planning framework. Thanks to score-based generative modelling, we can obtain a guaranteed estimation of these gradient fields from examples. Preliminary: The Denoising Score-Matching (DSM) proposed by (Vincent, 2011) aims at estimating the score function of a data distribution ∇ s log p data (s). Given a sample set {s * ∼ p data (s)}, DSM pre-specifies a noise distribution q σ ( s|s), e.g., N (0, σ 2 I), and trains a score network Φ θ (•) to denoise the perturbed data samples: L(θ) = E s∼qσ( s|s),s∼p data (s) ||Φ θ ( s) - s -s σ 2 || 2 2 (13) This objective guarantees the optimal score network satisfies With s a ∈ R da and s c ∈ R dc denoted as agent-state and conditional-state respectively (s = [s a , s c ]), the training objectives of both score networks are as follows: Φ * θ (s) = ∇ s q σ (s) almost surely. When σ is small enough, we have ∇ s q σ (s) ≈ ∇ s log p data (s), so that Φ * θ (s) ≈ ∇ s log p data (s). L(θ) = E qσ( sa|sa),ptar(s) ||Φ θ tar ([ s a , s c ]) - s a -s a σ 2 || 2 2 L(ϕ) = E qσ( sa|sa),psup(s) ||Φ ϕ sup ([ s a , s c ]) - s a -s a σ 2 || 2 2 (14) We adopt a variant of DSM (Song et al., 2020) that conducts DSM under different noise scales simultaneously. We defer network architecture and key hyperparameters to Appendix. C.1. 

5.2. EVALUATION

We collect trajectories over 5 random seeds for evaluation. For each task, we collect trajectories starting from the same initial states. We calculate the following metrics on the trajectories (for more details, we defer to Appendix. B): Task Return weighted by Success (TRS) calculates the averaged cumulative reward of an episode : E[1(τ is success) st∈τ r t ] where r t denotes the immediate reward and 1(τ is success) = 1 only when τ has no collision. In Ball Rearrangement: 1) As Fig. 5 illustrated, DualGF achieves the best performance compared with learning-based baselines in PL and ACN. In the hardest task Circling + Clustering, DualGF even outperforms than TarGF(ORCA) that plans with the ground truth model. These demonstrate the effectiveness of our method to scale to highdimensional tasks. 2) The CS of DualGF is comparable with other competitive baselines such as TarGF(SAC). Hence, our method is not trivially overfitting to a single mode. 3) The classifier-based methods (i.e., GAIL and RCE) fail due to the training collapse of the classifier. In practice, we observe that classifier-based baselines achieve appealing results during the training process yet fail to converge at a high-performance level due to the over-exploitation of the classifier.

Milliseconds per

In Room Rearrangement: 1) DualGF achieves the best performance compared with baselines in both CS and ACN. This indicates that our method can generalise well to the diverse conditional attributes such as different sizes, numbers and composition of objects. 2) The planning-based methods either suffer from scalability issues that PRM and Rapidly-exploring Random Trees (RRT) (LaValle et al., 1998a) cannot search for a feasible solution in 10 minutes. Besides, TarGF(ORCA) assumes the agent is circular in shape which is infeasible in Room Rearrangement. Hence, we do not compare them in Room Rearrangement.

6.2. ABLATION STUDIES AND ANALYSIS

We further conduct ablation studies to analyse the effectiveness of the key components of our method, i.e., the support gradient field and the field-balancing mechanism. Further ablations on key hyperparameters (i.e., initial λ and c ′ ), size of the target and support sets and the choice of the noise level t are deferred to Appendix D.1, D.2 and D.3 respectively. 

7. CONCLUSION

In this study, we first reformulate the path planning problem in a novel data-driven paradigm where the agent learns to plan with two sets of examples. We further propose a novel score-based planning framework (DualGF) to tackle this problem. There are two gradient fields in DualGFs: a target gradient field that guides task completion and a support gradient field that ensures moving with physical constraints. Moreover, an adaptive gradient balance mechanism is introduced to combine the merits of the two fields. Experiments demonstrate the scalability of our method, and empirical results show that it significantly outperforms state-of-the-art baselines in efficiency and safety. Limitations and Future Work: The main limitations of this work are the local minimum issue as discussed in Appendix D.6 and the holonomic assumption made in Sec 4.3. Another minor issue is the inherent instability of the Lagrangian Relaxation which can be mitigated via a small learning rate and gradient clipping, as discussed in Appendix D.7. In the future, we may incorporate our method with MCMC-based approaches such as Langevin Dynamics to overcome the local minimum issues. Besides, we may extend our method to non-holonomic scenarios by integrating model-predictive control (MPC).

A TASK DETAILS

A.1 TRACKING AND NAVIGATION Navigation and Tracking requires the agent to reach a static location or track a dynamic location while avoiding collision with the obstacles. There are four rectangular obstacles and a particle agent. The scale, position and orientation of the obstacles are randomised before each episode in both tracking and navigation. The goal location is randomly reset at the start of the episode. Note that in tracking, the goal location will be further reset when reached by the agent. In this task, the target distribution is assigned as a Gaussian distribution centred on the goal location, i.e., N (g, 0.1 2 I). The state space is the concatenation of the agent's position, goal location and geometry observations of obstacles. The action space is two-dimensional linear velocity. 

State and Action

∈ [-1, 1] 2 , ori = [sin(yaw), cos(yaw)] ∈ [-1, 1] 2 , side ∈ [-1, 1]. Horizon: 100. Initial Distribution: Before each episode, we first randomly initialise 4 box-shaped obstacles and then randomly sample the agent's state that does not collide with the obstacles.

Dynamics:

The floor and wall are all absolutely smooth planes. The arena is a 0.25m x 0.25m area, with the agent's radius being 0.015m. We set the friction coefficients of all obstacles and agents to 10.0 to penalise the collision. Target Examples: To sample a target example, we first randomly initialise the obstacles' locations and sizes and the goal location. Then, we sample an agent's location via repeatedly sampling from a Gaussian distribution centred on the fixed goal location until the location is collision-free. To ensure the target score network can work well, we sample 100,000 examples for training. Support Examples: To sample a support example, we first randomly initialise the obstacles' locations and sizes and the goal location. Then, we sample an agent's location via repeatedly sampling from the uniform distribution over the configuration space until the location is collision-free. To collect support examples, we reset the environment 200 times. For each reset, we collect 1000 examples from the same condition. Thus, the number of support examples is 200,000 in total.

A.2 BALL REARRANGEMENT

Ball Rearrangement includes three sub-tasks: Circling, Clustering, and the hybrid of the first two, Circling + Clustering. In both sub-tasks, there are three sets of 7 balls that belong to different categories, i.e., red, green and blue. The state and action spaces are the joint positions and linear velocities of all balls respectively. The agent is required to rearrange all balls to maximise the target likelihood of the joint state while avoiding collisions. The interpretations of the target distribution of each task are as follows: Clustering requires all balls to form into three clusters by colour; Both Circling and Circling + Clustering require all balls to form into a circle that can be centred anywhere in the arena, except that the latter further requires that the balls of the same type are adjacent to each other.

State and Action Spaces:

There are 21 balls in the environment. The state space is the concatenation of all sub-states of all balls: s = [s 1 , s 2 , ..., s 2 1]. For each ball, the sub-state is s i = [s a , s c ], s a = [x, y], s c = [c] where x, y ∈ [-1, 1] denotes the two dimensional position and c ∈ {0, 1, 2} is a class label. The action space is also the concatenation of all sub-actions of all balls: a = [a 1 , a 2 , ..., a 2 1] where a i ∈ [-1, 1] 2 is a 2-dimensional linear velocity. Horizon: 300 for Circling + Clustering and 100 for Circling and Clustering.

Initial Distribution:

We first uniformly sample rough locations for each ball, and then we eliminate overlaps between these positions by executing the physical simulation. Dynamics: The floor and wall are all absolutely smooth planes. All the balls are bounded in an 0.3m x 0.3m area, with the radius being 0.025m. We set the friction coefficients of all balls to 100.0 since we observe that setting a small(e.g.not larger than 1.0) friction coefficient does not significantly affect the dynamics. Besides, to increase the complexity of the dynamics, we set the masses and restitution coefficients of all green and blue balls to 0.1 and 0.99, respectively. The masses and restitution coefficients of all red balls are set to 10 and 0.1, respectively. We observe that under these dynamics, the collision may significantly harm the efficiency of the rearrangement process. Hence, the agent has to adapt to the dynamics for more efficient object rearrangement. Target Examples: For Circling, first random sample a feasible centre in the free space. Given the centre, we further random sample a feasible radius for the target circle. Then we sample a circle orientation of the target circle. Given the centre, radius and orientation, we can obtain feasible locations for balls to form a circle. Finally, we randomly assign these positions to the balls. For Clustering, we directly sample initial ball locations from a Gaussian Mixture Model in Eq. 15. Then we step physical simulations to remove the collision between balls. The final state serves as a target example. For Circling+Clustering, it is similar to Circling. We first sample feasible locations for balls to form in a circle. Then we randomly sample a 'colourisation starting point'. Starting from this point, we colourise the balls in R -G -B or R -B -G order (50% for each order). For these three tasks, we collect 100,000 target examples. However, we demonstrate in Appendix D.2 that the target examples can be reduced to 100 in Clustering. Support Examples: When sampling support examples, we increase the number of balls from 3 × 7 to 3 × 10 so as to increase the ball density in the arena. Since our score network is implemented as a graph neural network (As mentioned in Appendix C.1), the network can be transferred to different dimensions of input. To sample a support example, we uniformly sample locations for all the balls independently. Then we step physical simulations to remove collisions between balls to obtain a support example. To collect support examples, we repeat the above procedure to obtain 100, 000 examples. As shown in Appendix D.2, 50,000 support examples also work well.

A.3 ROOM REARRANGEMENT

Room Rearrangement is built on a large-scale, synthetic indoor scene dataset 3D-FRONT (Fu et al., 2021) . Following (Wu et al., 2022) , we use 756 of 839 rooms for training and 83 for testing. The agent is required to learn scene priors from the target examples (i.e., training set) and rearrange objects into reasonable layouts during the test phase. The state space is the concatenation of observations of each object and the room boundary information. The action space is the concatenation of the two-dimensional linear velocity and one-dimensional angular velocity of each object. Dataset and Simulator: We clean the 3D-Front dataset Fu et al. (2021) to obtain bedrooms that consist of four walls and three to eight objects. We augment each room by flipping two times and rotating four times to get eight augmentations. We then import these rooms into iGibson Shen et al. (2021) to run the physical simulation. The 756x8 rooms are used for target examples that are used for training the target score network, the classifier-based baselines and the VAE in goal-conditioned baselines. The other 83x8 rooms are used to initialise the room in the test phase, so the room rearrangement task is performed on the test dataset. The rooms in the training dataset are only used for learning prior knowledge to arrange the room.

State and Action Spaces:

The state consists of a aspect ratio r a ∈ R + and an object state s o ∈ R K×6 where K denotes the number of objects. The aspect ratio r a = tanh( bx by ) where b x and b y denotes the horizontal and vertical wall bounds. The object state is the concatenation of sub-states of all the objects s o = [s 1 , s 2 , ...s i , ..., s K ] where the sub-state of the i-th object s i ∈ R 6 is consists of 2-D position, 1-D orientation, 2-D bounding bound and a 1-D category label. The action is also a concatenation of sub-actions of all the objects a o = [a 1 , a 2 , ...a i , ..., a K ]. For the i-th object, the action a i ∈ R 3 consists of a 2-D linear and a 1-D angular velocity. The whole action space is normalised into a 3 × K dimensional unit-box [-1, 1] 3×K by the velocity bounds. Note that the agent-state s a in this task is the concatenation of agents' positions and orientations while the condition-state s c in this task is the aspect ratio and the concatenation of agents' labels and 2-d bounding boxes. Horizon: Each training episode contains 100 steps, instead of 250 steps used in Wu et al. (2022) . Initial Distribution: To guarantee the initial state is accessible to the high-density region of the target distribution, we sample an initial state in two stages: First, we sample a room from the 83x8 rooms in the test dataset. Then we perturb this room by 1000 Brownian steps. Dynamics: For more efficient environment reset and physical simulation, we build a 'proxy simulator' based on PyBullet Coumans & Bai (2016 -2021) to replace the iGibson simulator. We use iGibson to load and save the metadata of each room. Then we reload these rooms in the proxy simulator, where each object is replaced by a box with the same geometry. We set the friction coefficient of all the objects in the room to zero, as the dynamics of the room are complex enough. Target Examples: We use the 756x8 rooms from the training set as target examples. Support Examples: Starting from each state in the target examples, we perturb the state via Brownian steps. For every 800 steps, we collect a state and then step simulations to turn it into a feasible support example. For each room, we collect 100 times. After cleaning the data, we obtain a support set with 448, 754 examples.

B METRIC DETAILS

Here we introduce additional details of PL and CS.

B.1 PSEUDO-LIKELIHOOD (PL)

measures the similarity between a given state and a target distribution by assigning a pseudolikelihood function F proxy : S → R + . At each time step t, the PL-curve reports the averaged PL across all trajectories E[F proxy (s t )]. Hence, the PL curve indicates the efficiency and final performance of each method. We do not report the PL-curves on Room Rearrangement, since it is hard to describe the human preferences and scene priors by hand-engineering. The PL curve is only reported in Ball Rearrangement. The specific pseudo-likelihood functions are listed as follows: Circling: The pseudo-likelihood function is defined as F proxy (s) = exp -(σ θ +σr) , where σ θ and σ r denote the standard deviation of the angle between two adjacent balls and the distances from each ball to the centre of gravity of all balls, respectively. Intuitively, if a set of balls are arranged into a circle, then the σ r and σ θ should be close to zero, achieving higher pseudo-likelihood. Clustering: Different from (Wu et al., 2022) , our target distribution is a six-mode Gaussian instead of a two-mode Gaussian: Defining the joint centres' positions as a latent variable C = (C r , C g , C b ) where C r , C g and C b denote centres of red, green and blue balls respectively and the above six modes as {c i } 1≤i≤6 , the C obeys a categorical distribution p(C = c i ) = 1 6 . The pseudo-likelihood function is a Gaussian Mixture Model: p GM M (s) = 1≤k≤6 p(C = c k )p(s|C = c k ) p(s|C = c k ) = 1≤i≤ K 3 N (C k r , 0.05I)(s i ) K 3 ≤i≤ 2K 3 N (C k g , 0.05I)(s i ) 2K 3 ≤i≤K N (C k b , 0.05I)(s i ) Circling+Clusterng: The pseudo-likelihood function is defined as F proxy (s) = exp -(σ θ +σr) • exp -(σ R +σ G +σ B )-σ C where σ R denotes the standard deviation of the angle between two adjacent red balls, and σ G and σ B and σ C denotes the standard deviation of the positions of red, green and blue centres. Intuitively, the first term exp -(σ θ +σr) measures the pseudo-likelihood of balls forming a circle and the next term exp -(σ R +σ G +σ B )+σ C measures the pseudo-likelihood of balls being clustered into three piles.

B.2 COVERAGE SCORE (CS):

CS measures the diversity and fidelity of the terminal states S T =. CS reports the Minimal-Matching-Distance(MMD) (Achlioptas et al., 2018) For Room Rearrangement: The coverage score is calculated by averaging the coverage score in each room condition since the state dimension differs in different rooms. For each room in 83 test rooms, we calculate the coverage score between the eight ground truth states and eight rearrangement results and then the averaged coverage score over the 83 rooms is taken as the final coverage score for a method. We measure the distance between two states by calculating the average L2 distance between the positions of the corresponding objects.

C IMPLEMENTATION DETAILS

For all learning-based baselines, we implement all the methods including ours in the same network architecture design and basically the same capacity (e.g., hidden dimensions).

C.1 DUALGF

Training Objective: The complete training objective for both target and support networks is the SDE-based score-matching objective proposed by Song et al. (2020) : E t∼U (0,1) E s(0)∼p data (s) E s(t)∼p0t(s(t)|s(0)) [Φ θ (s(t), t) -∇ s(t) log p 0t (s(t) | s(0))∥ 2 2 ]. where p 0t (s(t) | s(0)) = N (s(t); s(0), 1 2 log σ (σ 2t -1)I) and σ = 25 is a hyper-parameter. Using this objective, we can obtain the estimated score w.r.t. different levels of the noise-perturbed target distribution p t data (s(t)) = p 0t (s(t) | s(0))p data (s(0))ds(0)simultaneously. In this way, we can assign different noise-level t for different tasks conveniently. For all tasks, our method chooses t = 0.01. Network Architecture: In Navigation and Tracking, our score networks are simply implemented as fully connected layers. We first encode the agent position s a and noise-level t into feature vectors f a and f t . The state of each obstacle o i is encoded by a shared linear layer to f oi . We then pool the {f oi } 4 i into a single vector. The support score network concatenates the f a , f t and f o and sends them to a linear fusion layer to get the score output. The target score network further encodes goal location into f g and concatenates the f a , f t , f g and f o and sends them to a linear fusion layer to get the score output. In Ball Rearrangement and Room Rearrangement, the architecture of the target score network is exactly the same as the score network of Wu et al. (2022) used in Clustering and Cirlcing+Clustering. The support score network is exactly the same as the score network of Wu et al. (2022) used in Circling which does not encode the class label into the initial feature. Reproduction: To entirely reproduce the results and check the details, we recommend running our code released on the Anonymous Github https://anonymous.4open.science/r/ ICLR23AnonymousCode-1426.

C.2 BASELINES C.3 TARGF

The training objective of TarGF is the same as Eq. 16. And we set t = 0.01 for all experiments for a fair comparison. For TarGF(SAC): The implementation is based on the codes released by authors, except that we set t = 0.01 for both reward estimation and residual policy for a fair comparison. For TarGF(ORCA): We set τ = 0.1 and the simulation duration of each timestep ∆t = 0.02. For each agent(object), ORCA only considers the 2-nearest agents as neighbours, since in our experiments ORCA often has no solution when the number of neighbours is larger than 2.

C.4 CLASSIFIER-BASED BASELINES

These baselines refer to the RCE and GAIL in experiments. RCE and is implemented based on the codes Eysenbach et al. (2021) released by RCE's authors. We only modify γ = 0.95, the training steps decrease to 0.5 million (i.e., the same number of training steps as other methods) and the model architecture. The architecture of actor and critic networks is implemented the same as ours(i.e., the same feature extraction layers and edge-convolutional layers). GAIL trains two classifiers by discriminating two example sets with agent's rollouts respectively. The discriminator takes only one state instead of two adjacent states as input since the 'ground truth' reward(i.e., likelihood) is defined on the current state. At each training step, we update the discriminator by distinguishing between the agents' and the expert's states(for one step) and then update the RL policy under the reward given by the discriminator(for one step).

C.5 RL(SAC) IN NAVIGATION AND TRACKING

This is implemented as ablation of GAIL that replaces the classifiers with ground truth task rewards.

C.6 PRM

We implement PRM from an off-the-shelf open-sourced repository PythonRobotics Sakai et al. (2018) with 16.3k stars. This implementation is also used for navigation tasks. We keep the default configuration and properly rescale our configurations to fit the API. drops after the number of support examples is smaller than 50,000. This is reasonable since the volume of the free space increases exponentially with the dimension of the state space. Thus, it is infeasible to learn a good support score network with few examples. In our cases, using 50,000 to 100,000 support examples works well for 42-dimensional state space (i.e., 21 balls). 

D.3 ABLATIONS ON THE CHOICE OF NOISE LEVEL OF THE SCORE NETWORK

This ablation study is conducted on Clustering. For simplicity, we just fix the λ t at the initial value. Target Score Network: We fix the noise scale of the support score network at t = 0.01 (same as DualGF in the main paper). We then compare the performance of the target score networks conditioned on different levels of noise scale in Figure . 12. With the noise level decreasing, the performance on PL and ACN slightly drops. This indicates the target score network can still provide useful target gradients under smaller noise levels. However, we observe that the target score network conditioned on a smaller noise scale seems to be more 'unsafe'. Thus, we have to specify a larger initial value of λ t to regularise the target gradient. For t = 0.005, 0.001, 0.0005, 0.0001, we specify λ t = 4, 10, 15, 20 respectively. Support Score Network: Similarly, we fix the noise scale of the target score network at t = 0.01 (same as DualGF in the main paper) and λ t = 3.0 for all experiments. We then compare the performance of the support score networks conditioned on different levels of noise scale in Figure. 13 . With the noise level decreasing, the performance on PL significantly drops while the ACN remains comparable to the original DualGF. In practice, we observe that the target score network conditioned on a smaller noise scale tends to be more 'conservative', i.e., the support gradients are of large magnitude so that they overwhelm the target gradients. We also report planning success rate (SR) on ball and room rearrangement. Since the number of objects in these environments is quite large, so we relax the success threshold to 'less than N collisions in an episode' where N denotes the number of objects in each episode. As shown in Figure . 15, Du- RL (SAC) 9.1 ± 0.6 0.21 ± 0.04 8.3 ± 0.4 0.19 ± 0.01 0.0 ± 0.0 0.21 ± 0.02 GAIL 1.5 ± 0.4 0.35 ± 0.01 1.6 ± 1.0 0.20 ± 0.02 0.1 ± 0.1 0.18 ± 0.04 RCE 2.2 ± 1.2 0.13 ± 0.01 2.2 ± 0.4 0.22 ± 0.04 0.1 ± 0.0 0.17 ± 0.03 TarGF (SAC) 7.0 ± 0.2 0.19 ± 0.02 4.4 ± 0.3 0.19 ± 0.01 0.2 ± 0.1 0.18 ± 0.04 DualGF (Image) 16.4 ± 1.0 0.24 ± 0.02 15.7 ± 2.7 0.26 ± 0.02 0.4 ± 0.0 0.27 ± 0.01 alGF achieves comparable performance with model-based approach TarGF(ORCA) and outperforms all the other learning-based baselines except for Room Rearrangement. On the left side, there are some regions in the dual gradient field where the target gradient and the support gradient cancel each other. Besides, the dual gradients around these regions point back to them. Once the agent steps into these regions, it will get stuck and cannot get out. On the right side, we demonstrate another implicit type of this problem, where the agent is not stuck by the obstacle but by itself. For instance, the target gradient leads a table and a shelf to move in the opposite direction while the support gradient provides a 'repulsive force' on them. Thus, both the shelf and the table get stuck in the left corner. One potential solution is to leverage test-time sampling to increase the optimality of the planning results.

Dual Gradients Local Minimum

Local Minimum Local Minimum 



Figure 1: Our task setting. Left: The agent learns task specification from target examples and physical constraints from support examples during training. Right: The agent plans a path in novel conditions during the test phase..

Figure 2: Method Overview. a): A target and a support score network are trained from examples via score-matching objective. b): DualGF computes target and support gradient at each time step, where the target gradient guides the task completion (i.e., move objects to tidy layout) and the support gradient directs the objects to avoid collisions. DualGF adaptively mixes the two gradients via a field-balancing mechanism to output the dual gradient. Incorporated with a low-level controller, DualGF can interpret the dual gradient into an action.

Training: Adopting DSM, we train a target score network Φ θ tar : R da+dc → R da and a support score network Φ ϕ sup : R da+dc → R da from target examples S * tar and support examples S * sup respectively, where d a denotes the dimension of agent-state space (e.g., agent's position and orientation) and d c denotes the dimension of conditional-state space (e.g., agent's category, obstacles' state).

Figure 4: Examples of four tasks used in the experiments. Navigation and Tracking: to reach/track a goal location while avoiding collision with obstacles. Ball Rearrangement: to rearrange balls into certain patterns. Room rearrangement: to rearrange the furniture into a reasonable layout.

Figure 5: Comparative Results on Ball Rearrangement and Room Rearrangement. The results on Ball Rearrangement are plotted in the first three columns. The CS and ACN results on Room Rearrangement are plotted in the last column. The order of the X-axis is fixed across CS and ACN bars.

Figure 6: Ablation results on Clustering, where ∆Metric = Metric(DualGF) -Metric(Ablated).With mixing rate λ 0 decreasing, the advantage of DualGF over the ablation increases significantly.

(a) Six Modes of Clustering (b) Examples of Circling (c) Examples of Circling + Clustering

Figure 7: Visualisation of three sub-tasks of Ball Rearrangement. (a): Clustering requires all balls to form into three clusters by colour. (b): Circling requires all balls to form into a circle, neglecting the colour order. (c): Circling + Clustering require all balls to form into a circle while the balls of the same type are adjacent to each other.

between S T and a fixed set of examples S gt from p tar : sgt∈Sgt min s T ∈S T ||s gt -s T ||. This metric can detect mode-collapsing: If an agent only moves objects to a single mode, the terminal states may be far away from other modes which leads to a high value of CS (i.e., bad performance).For Ball Rearrangement: To calculate the coverage score, we sample fixed sets examples from the target distribution serving as S gt for Circling, Clustering, and Circling+Clustering, respectively. We sample 20 examples for Circling and Circling+Clustering and 50 examples for Clustering. Since the balls in the same category can actually be viewed as a two-dimensional point cloud, we measure the distance between two states by summing the CDs between each pile of balls by category.

Figure 10: Ablation results on the number of target examples.

Figure 12: Ablation results on the noise level of the target score network.

Figure 13: Ablation results on the noise level of the support score network.

Figure 15: Success rates of all methods. From left to right, we present results on Circling, Clustering, Circling+Clustering and Room respectively.

Figure 16: The local minimum problem of DualGF. Left: Case study on Ball Rearrangement. Right: Case study on Room Rearrangement.

). At each time step t, our planning agent π : S → A first outputs a target state within an epsilon-ball of the input state, i.e., || ŝt+1 -s t || ≤ ϵ.

Success Rate (SR) calculates the percentage of non-collision trajectories. Pseudo-Likelihood (PL) measures the similarity between a given state and a target distribution by assigning a pseudo-likelihood function F Comparative results on Navigation and Tracking. The obstacles is always fixed in Navigation (Static) while changes for each episode in Navigation (Dynamic).

The inference time per step averaged over 100 trajectories.

Spaces: s = [s a , s c ] where the agent state is 2-D location s a = [x, y] ∈ [-1, 1] 2 and condition-state is the 2D goal location and obstacles' states s c = [g, o 1 , o 2 , o 3 , o 4 ] where g ∈ [-1, 1] 2 and o i denotes the state of the i-th obstacle. The obstacle state is the normalised position, orientation and side-length o i = [pos, ori, side] where pos

Comparative results on Navigation and Tracking. We compare the image-based implementation of DualGF with state-based baselines.

C.7 POTENTIAL FIELD

This baseline is also implemented from an off-the-shelf open-sourced repository Python-Robotics Sakai et al. (2018) with 16.3k stars. We set ζ = 1.0 for the attractive field and ζ = 1.0 for the repulsive field. The distance threshold of the repulsive field is set to be Q * = 0.1.

D ADDITIONAL ANALYSIS

D.1 ABLATIONS ON MIXING RATE λ AND THRESHOLD c ′ We seek to evaluate the sensitivity of our method to the two key hyperparameters, i.e., λ 0 and c ′ . The experiments are conducted on Clustering for better quantitative evaluation.We evaluate DualGF with different levels of the initial lambda λ 0 . As shown in Fig. 8 , the PL and CS do not significantly change when the λ ≥ 1.0. This indicates our method is not sensitive to λ when it is above a positive threshold. Further, we evaluate DualGF with different levels of the initial threshold c ′ . As shown in Fig. 9 , the PL and CS do not significantly change with all different levels of threshold but the ACN becomes better with the increase of the threshold. This indicates our method is not sensitive to c ′ . We can mitigate this problem by setting a small learning rate and clipping the gradient. When λ t is properly updated, we can obtain appealing results as shown in Figure . 17 When q σ is a Gaussian kernel, we can derive the closed form of ∇ ŝq σ (ŝ|s):where C(s, ŝ) > 0 is a scalar function depends on s and ŝ. Thus, ∇ ŝq σ (ŝ|s) essentially points to the denoising direction, i.e., sŝ. Further, we have:Intuitively, when ŝ lies in the zero-density region of p(s), i.e., p(ŝ) = 0, the gradient of the perturbed distribution ∇ ŝp σ (ŝ) still points to the averaged denoising direction to the positive-density region.To summarise, the score of the perturbed distribution ∇ s log p σ (s) is in the same direction with ∇ s p σ (s):where 1 pσ(s) > 0 obviously. Thus, in the zero-density region of the original distribution, the score networks are estimating the averaged denoising direction to the positive-density region.

D.9 THE LIMITATION OF MANUALLY DESIGNING THE REWARD FOR BALL REARRANGEMENT

Here we take Circling for example to demonstrate the non-triviality of example-based planning. We leverage the pseudo-likelihood function of Circling as a manually designed reward function to train an RL agent, namely Heuristic(SAC).As shown in Figure . 18, our method achieves significantly higher PL than Heuristic(SAC), which indicates our method is better in both sample quality and efficiency. Our method is also better than Heuristic(SAC) in CS, which indicates our method is better in diversity and mode coverage. . 4, our method still outperforms the learning-based baselines across all the tasks and metrics.However, this evaluation will also be problematic since the success trajectories of different methods are of different difficulties. For instance, the potential field baseline is of only 28% Success Rate yet achieves the highest Task Return. However, we observe that this method only succeeds when the initial state is quite near to the goal which is easier for the agent to obtain a higher task return. 15.7 ± 7.9 0.13 ± 0.01 9.8 ± 0.1 0.22 ± 0.04 0.4 ± 0.1 0.17 ± 0.03 TarGF (SAC) 36.2 ± 3.9 0.19 ± 0.02 0.7 ± 0.4 0.19 ± 0.01 0.1 ± 0.0 0.18 ± 0.04 DualGF (wo Sup) 0.0 ± 0.0 0.07 ± 0.01 13.1 ± 4.9 0.09 ± 0.03 0.2 ± 0.06 0.11 ± 0.04 DualGF 81.3 ± 0.4 0.37 ± 0.02 66.9 ± 7.3 0.39 ± 0.05 6.0 ± 0.5 0.56 ± 0.03Planning Based PRM (Reference) 85.0 ± 0.4 0.45 ± 0.04 84.0 ± 0.1 0.47 ± 0.04 25.7 ± 0.4 0.16 ± 0.04 Potential Field 77.4 ± 1.8 0.45 ± 0.00 84.3 ± 0.3 0.28 ± 0.07 22.4 ± 2.4 0.09 ± 0.02

