NEURO-SYMBOLIC PROCEDURAL PLANNING WITH COMMONSENSE PROMPTING

Abstract

Procedural planning aims to implement complex high-level goals by decomposition into simpler low-level steps. Although procedural planning is a basic skill set for humans in daily life, it remains a challenge for large language models (LLMs) that lack a deep understanding of the cause-effect relations in procedures. Previous methods require manual exemplars to acquire procedural knowledge from LLMs in the zero-shot setting. However, such elicited pre-trained knowledge in LLMs induces spurious correlations between goals and steps, impairing the model's generalization to unseen tasks. In contrast, this paper proposes a neuro-symbolic procedural PLANner (PLAN) that elicits procedural knowledge from the LLMs with commonsense-infused prompting. To mitigate spurious goal-step correlations, we use symbolic program executors on the latent procedural representations to formalize prompts from external knowledge bases as a causal intervention toward the Structural Causal Model of procedural planning. Both automatic and human evaluations on WikiHow and RobotHow show the superiority of PLAN on procedural planning without further training or manual exemplars.

1. INTRODUCTION

How to make a cup of coffee? As humans, we can easily specify a procedure to solve this task, using our innate ability of commonsense reasoning. However, can we endow machines with the same ability to construct a sequential plan? As depicted in Figure 1 , procedural planning (Pearson, 1996; Zhang et al., 2020b; Huang et al., 2022) aims to decompose a high-level goal (Task: Watch TV) into a sequence of temporally extended steps (Procedural Plan: Step at all five time-steps). We study procedural planning as the conditional text generation problem since it resembles real-world scenarios. Previous approaches (Huang et al., 2022; Ahn et al., 2022) require a small number of carefully written or held-out exemplars to acquire procedural knowledge. However, these manual exemplars evolved from task data are impossible to cover the ever-changing task setups and the flexible dependency relations among goals and steps. In fact, the biased data may cause the model to learn spurious correlations and hinder the model from generalizing well in zero-shot scenarios. Studies in cognitive science show that humans rely on chunking mechanisms (Gobet et al., 2001; Miller, 1956) which turn primitive stimuli into conceptual groups to solve novel and complex problems. Inspired by this, we hypothesize that generalizable procedural planning ability can be achieved by learning cause-effect relations among complex goals and simpler steps using external knowledge. To reveal the cause-effect relations in procedural planning, we devise a Structural Causal Model (SCM) (Peters et al., 2017) , a directed acyclic graph commonly used to describe the causal relationships within a system Pearl (2009) . As depicted in Figure 2 , the pre-trained knowledge (D) (e.g., TV and living room is highly correlated) in LLMs confounds (D influences T , S i-1 and S i , resulting in spurious correlations) the system to make biased decisions toward an unreasonable step (e.g., Find Television). Thus, we adopt front-door adjustment (definition in Appendix A.3), which utilizes a mediator (P i ) that blocks all directed paths from the cause (T or S i-1 ) to the effect (S i ). In this way, T (or S i-1 ) affects S i by flowing through indirect paths: T (or S i-1 ) affects P i and P i affects S i . And we can identify the causal effects among goals and steps by investigating the indirect effect (Equation 3), which is computed by multiplying the effect of T (or S i-1 ) on P i-1 (Equation 1) with the effect of P i on S i (Equation 2). With the above front-door adjustment, we can mitigate the spurious correlations (e.g., between "television" and "living room") and thus make reasonable decisions on steps (e.g., Find book). Please refer to A.1 for causal preliminaries (including explanation for SCM, confounder, mediator, spurious correlations), and A.3 for the front-door adjustment definition. Guided by the above causal analysis of procedural planning, we need to construct the mediator P i and then intervene on task T and prompt P i , which is required to compute the conditional probability in Equation3. As depicted in Figure 3 , we seek to automatically construct commonsense-infused prompts as the mediator P i by concatenating the task, previous steps with commonsense knowledge extracted from external resources (e.g., ConceptNet (Speer et al., 2017) ). First, we modify the goal input by sampling a task-relevant knowledge subgraph (Stage1 in Section 3.1) to implement interventions on T . Then, we modify the prompt by adapting the edge weight to implement interventions on P i (Edge-Wise Adoption of Stage2 in Section 3.1). However, directly incorporating knowledge of graph structure into LLMs leads to the loss of the logical order in eliciting procedural knowledge from LLMs. Thus, we apply symbolic executors (Mao et al., 2019; Yi et al., 2018) that execute the sequential mapping program on latent knowledge representations (e.g., the subevent of). In this way, we transit graph structure knowledge into natural language that preserves procedural structure, such as the sequential order of two low-level steps (Symbolic Structuring of Stage2 in Section 3.1). The procedural prompt P G (e.g, "please get the remote control") is further translated into admissible one PG (e.g., "grab remote control") from available steps in a certain domain (RobotHow or WikiHow in our case). Finally, we utilize the commonsense-infused prompt PG to control the generation of procedural plans in LLMs in a zero-shot setting (Section 3.2). We conducted experiments on RobotHow (Puig et al., 2018) and WikiHow (Koupaee & Wang, 2018) under original and counterfactual situations. Our major contributions can be summarized as: • We develop the first causal framework for procedural planning by 1) defining a temporally extended Structural Causal Model and 2) resolving spurious correlation between high-level goals and low-level steps via front-door adjustment with a prompt-based mediator. • We propose a neuro-symbolic approach to construct commonsense-infused prompts for LLMs to tackle the procedural planning task without manual exemplars or further training. • Extensive evaluations show the superiority of PLAN in terms of reasoning about the causeeffect relations among goals and steps and achieving promising planning ability.

2. EXTERNAL KNOWLEDGE MATTERS IN PROCEDURAL PLANNING

As depicted in Figure 1 , procedural planning requires generating the Plan (e.g., Step 1: Walk to the living room.) conditioned on the Task (e.g., Watch TV). We first describe the problem definition T denotes the task query, and S i is the sub-goal step at timestep i. D is the unobservable confounding variable introduced by the LLMs. P i denotes the mediating variables we construct to mitigate the spurious correlation. (b) The SCM at timestep i. Without causal intervention, the model produces a sub-goal step "find television" due to the spurious correlation between "television" and "living room" caused by the confounding variable D. With our causal intervention, the constructed mediating variable P i (Section 3.1) can block the backdoor paths for T → S i and S i-1 → S i (opened by D) and generate the causal sub-goal "find book" precisely (Section 3.2). and then show why external knowledge matters in procedural planning through the lens of causality. Finally, we show how we elicit procedural ability from the Large Language Models (LLMs).

2.1. PROBLEM DEFINITION

Given the high-level task T (e.g. watch television in the living room) sampled from a task domain M T (e.g. RobotHow), a procedural planner aims to decompose it into lower-level temporally extended steps S T = {S 1 , ..., S i |S i ∈ S}. There exists certain admissible plans S, which is a fixed set constrained by the task domain M T (e.g., the affordance of the interacted objects). The plan S i at timestep i is generated as π(S i |T, S 0:i-1 ).

2.2. A CAUSAL LOOK AT PROCEDURE PLANNING WITH LLMS

We seek to empower the LLMs with the ability to reason cause-effect relations in procedural planning. Thus, we devise a causal framework by first defining a Structural Causal Model (SCM) of procedural planning in Figure 2 . The SCM describes the temporal dynamics and procedural cause-effect relationship. Our causal assumption in SCM indicates that there is a backdoor path from task to step, which must be blocked with front-door adjustment. Therefore, we model the input prompt as a mediator which is created from external knowledge. More specifically, we define our Full Temporal Causal Graph as in Figure 2a , which is an unrolled Structural Causal Model (SCM) for sequential decision-making. Our goal is to identify the causal relations between the attended task T and plan procedures S T = {S 1 , S 2 , . . .} from LLMs. Initially, there are direct paths T → S i and S k → S i , k < i because S i relies on the LLM attended task entities and previous accomplished steps. D is an unobserved confounder from learned knowledge during pre-training. D builds a backdoor path between T and S i and misguides the LLMs to attend to false entities to generate the next step (see Fig. 2b ). Note that D is unobservable as we directly adopt the LLM without knowing the pre-training data. To mitigate the spurious correlation, we then introduce a mediator P i for each S i as shown in Figure 2a . To achieve our front-door adjustment, we inject external knowledge into LLMs with a neuro-symbolic approach by adopting three stages described in Section 3.1.

3. OUR APPROACH

Although LLMs have strong general language intelligence, they still perform poorly in reasoning the cause-effect relations in procedural plans due to a lack of daily life experience. We propose to elicit the unbiased procedural planning knowledge from the LLMs using the created commonsense-infused Prompt P as π(S i |T, S 0:i-1 , P ).  + % Admissible Knowledge Prompt ( - * $ ) Step 1: walk to the living room. Step 2: grab remote control. Step 3: sit on chair. Step 4: switch on remote control. Step planning in a five-stage manner. We illustrate the commonsense-infused prompt construction (the first three stages) in Section 3.1 and planning with LLMs (the last stage) in Section 3.2.

3.1. COMMONSENSE-INFUSED PROMPT CONSTRUCTION

Overview Inspired by the causal analysis in Section 2.2, we propose to construct commonsenseinfused Prompt P that helps reveal the cause-effect relations among the goals and steps during procedural planning within 3 stages: 1) Stage1 sample a subgraph G s from the external knowledge base G by extracting task(T )-relevant nodes. 2) Stage2 adapt the edge weight E w in G s and apply symbolic structuring to get the admissible knowledge prompt PG . 3) Stage3 acquire the temporal order by temporally aggregated the prompt P i with previous steps S 0:i-1 . Stage1:Task-Relevant Knowledge Subgraph Sampling First, we investigate the causal effect T → P i and S i-1 → P i (Figure 2 ). S i is a collider that blocks the association between D and P i in the path T ← D → S i ← P i . Let π i denote π(•|P i-1 ) that represent the probability density function conditioned on P i-1 . Since there is no backdoor path for T → P i and similarly for S i-1 → P i , we simply have the conditional probability after applying do-operators: π i (P i = p|do(T )) = π i (P i = p|T ), π i (P i = p|do(S i-1 )) = π i (P i = p|S i-1 ) We achieve the do-operation in a prompting way by modifying the goal input so that the model attends to the task-relevant entities. To implement, we use NLTK to tokenize and pos tag the task text T . Then we use the noun (e.g. television), noun phrases (e.g. remote control), and verb phrases (e.g. watch television) as the concept node. In this way, the task name T is Semantically Parsed into the Concept Set T E . Each concept e ∈ T E is used as a query for sampling the H-hop task-relevant subgraph G s ⊆ N e × R s × N e from the external knowledge base G ⊆ N × R × N , where N and R represent the number of concept nodes and commonsense relations respectively. When extracting G s , we keep the triplets with relation type in the household domain (e.g., AtLocation, UsedFor) and filter out ones in the linguistic domain (e.g., DistinctFrom, DerivedFrom) for the procedural planning task. N e is maintained in a set of top-k task-relevant nodes using the weight of each R e , which is updated with edge-wise adaption in Stage2. Stage2:Edge-Wise Adaption and Symbolic Structuring Second, we need to find the causal effect for P i → S i . Since the path P i ← T ← D → S i contains a backdoor from P i to S i , we cannot rely on the conditional probability. Instead, we intervene on P i using do-operator to cut off D → T : π i (S i |do(P i = p)) = t,s π i (S i |p, T = t, S i-1 = s)π i (T = t, S i-1 = s) = t,s π i (S i |p, T = t, S i-1 = s)π i (S i-1 = s|T = t)π i (T = t) The retrieved concept-centered graph has multiple edges representing various relationships with other actions/entities. Therefore, the summation over intervened T can be achieved by incorporating these edges into the prompt. For instance, "living room" can be "walked to" and "used for reading" while "book" can locate in "living room" and "bedroom". Similarly, we extrapolate over the edges for i -1 hops to aggregate the intervened S i , i.e. P (S i-1 = s|T = t). Directly ranking the retrieved nodes N e with the annotated weight (E w ) in the external knowledge base will result in a spurious correlation. Because such retrieved local subgraphs tend to capture the task-invariant concept nodes as the causal factors. To mitigate this, we propose to adapt the weight of each triplet (Edge-wise Adaption). The adapted weight is the addition of the original edge weight and the cosine similarity between the tail node embedding n E tail of the edge R e and the task embedding v task as: Êw ← E w + cosine(n E tail , v task ). The embeddings are projected from the node text and task name using the sentence-transformer (Reimers & Gurevych, 2019) 

Stage3:Temporally-Extended Aggregation

To acquire temporal order in the procedure, we obtain the Prompt P at timestep i with the aggregation of task T , history steps S 0:i-1 and current external knowledge PG . The underlying causal mechanism is a combination of Eq. 1 and Eq. 2: π i (S i |do(T ), do(S i-1 )) = p π i (S i |do(P i = p))π i (p|do(T ), do(S i-1 )) = p π i (p|T ) t,s π i (S i |p, T = t, S i-1 = s)π i (T = t, S i-1 = s) The adjustment and marginalization in Eq. 3 is achieved in the input space by forming the Procedural Prompt P G that allows the LLM to attend on the causal entities instead of the highly correlated ones for the next step generation. The LLM can reason over the most relevant edges to link the concepts with the task entities as a context. The prompts from knowledge bases are independent of the pre-training data distribution so that P i is independent of D and satisfies the front-door criterion. Please refer to Appendix A.3 and Figure 4 for the simplification of our structural causal model.

3.2. PROCEDURAL PLANNING WITH LARGE LANGUAGE MODELS

Stage4:Semantic Generation The external knowledge is further concatenated with the goal input (T ) as the initial prompt. Given the prompt, the language model Generation LM G ∈ {P AR , P AE } (e.g., GPT3, BART) generates the next sentence, and the most confident prediction is then appended to previous prompts. The Termination Condition is either reaching the max step t or the matching score is below threshold θ. The joint probabilities of auto-regressive (P AR ) and auto-encoder (P AE ) model is factorized as: π AR (x) = n i=1 p(s n | PG , s 1:n-1 , T ), π AE (x) = n i=1 p(s n | PG , {s 1:n-1 , [M ASK]}, T ) (4) where PG represent the commonsense knowledge and T represent the task name. 

Stage5:Admissible

Step Translation To ensure that the generated procedural plans are grounded to the environment, we should avoid producing the steps that are inadmissible (e.g. Toast the table). In other words, the generated steps should be fully constrained to the admissible composite of action and object in a certain task domain. Thus previous works (Huang et al., 2022; Ahn et al., 2022) have explored using the model (which is LM T in our case) to score a step selected from a fixed set of available options, instead of directly sampling from the output distributions of the language model (which is LM G in our case). Specifically, we match the generated step by LM G to the most similar admissible step in the embedding space encoded by the Translation Language Model LM T . Following (Huang et al., 2022) , we utilize a Sentence-Transformer (Reimers & Gurevych, 2019) to calculate the cosine similarity as π(s i |x) = LM T (LM G (x)), which translates LM G (x) into the admissible step s i ∈ S that is the closest in the embedding space measured by the cosine similarity.

3.3. COUNTERFACTUAL PROCEDURAL DATA CONSTRUCTION

To investigate the counterfactual reasoning ability, we design three families of intervention methods: 1) Initial Configuration: intervene in the initial configuration, such as the location for implementing the task. 2) Intermediate Step, randomly select one step from the ground truth program as an additional constraint of implementing the task and append it to the task name for generating the procedural plan. 3) Final Goal, intervene the task goal as the composite of another randomly sampled task. Table 5 in 

4.1. PROCEDURAL PLANNING SETUP

Datasets We conduct zero-shot experiments on two datasets with procedural information, WikiHow (collected following (Koupaee & Wang, 2018) ) and RobotHow (Puig et al., 2018) without training. WikiHow is a large-scale text summarization dataset that is constructed from a human-written knowledge base, involving procedural tasks that spans various topics. We utilize "how to" title as the task names and the summarized headlines as the steps. RobotHow is a large knowledge base of common household tasks collected in the VirtualHome (Puig et al., 2018) Table 2 : Averaged 5-point Likert scale human evaluations on "coverage" and "order" aspects. Baselines We compare our approach with three vanilla generative pre-trained language models (BART, GPT2, and GPT3) and two powerful generation baselines (Zero-shot Planner (Huang et al., 2022) noted as "LLMaP" and Chain of Thought (Wei et al., 2022) noted as "Chain"). More method and configuration details of the models can be found in Appendix B.3 and Appendix B.4. Metrics We ask human annotators on the Amazon Mechanical Turk platform to rate model performance on two aspects: 1) Coverage: depicts which set of steps can better complete the target task (captures semantic completeness). 2) Order: depicts which sequence covers more steps that are necessary to complete the target task (captures sequential order correctness). In addition, we use Sentence-BLEU (S-BLEU) (Papineni et al., 2002) , BERTScore (Zhang* et al., 2020) , ROUGE-1 (Lin, 2004) and Word Mover's Distance (WMD) (Kusner et al., 2015) as automatic evaluation metrics. These metrics are used to compute the semantic scores between the annotated programs and the predictions. Details of the crowdsourcing human evaluation can be found in Appendix C.1.

4.2. HUMAN EVALUATION RESULTS WITH COVERAGE AND ORDER METRIC

Each example is rated by 3 crowdsourcing annotators. For the Win-Lose Comparison, we ask the human rater to choose between ours and the baseline LLMaP (Huang et al., 2022) . Averaged results reported in Table 1 show that our PLAN is more frequently rated as better for both coverage and order metrics, outperforming baselines over the winning ratio by 21% in coverage and 26% in order, across two datasets. We report the average results of Human Ratings with 5-point Likert scale in Table 2 . The consistent performance boost of PLAN indicates the superiority of injecting external commonsense knowledge into the procedural planning task. The performance drop of LLMaP and Chain in the counterfactual setting indicates the vulnerability of fixed holdout knowledge and the pre-defined manual exemplars in causal procedural planning. Please refer to Appendix C.1 for the crowdsourcing human evaluation interface details. Table 3 shows two examples for Qualitative Comparison. More examples can be found in Appendix D.

4.3. AUTOMATICALLY MEASURING THE PROCEDURAL PLANNING

Main Results Table 4 summarizes The automatic evaluation results. PLAN achieves the best results regardless of the architecture of the language model architecture, either autoregressive or autoencoder based. The performance gain of "LLMaP" over "Chain" may probably be due to direct exposure to the holdout task from the dataset. While the "Chain" baseline still outperforms the vanilla baseline that only takes the high-level task name as the prompt. Note that the annotated program is not the

Dataset Task Model Steps

RobotHow Toast Bread

Chain

Step 1: Walk to kitchen. Step 2: Put bowl on oven. Step 3: Switch off oven. Step 4: Open oven. Step 5: Grab bowl. Step 6: Put bowl in oven.

LLMaP

Step 1: Put bread in oven. Step 2: Put food bread in oven. Step 3: Put food bread on table.

PLAN

Step 1: Walk to kitchen counter. Step 2: Pull bread slice. Step 3: Put bread slice on toaster. Step 4: Switch on toaster. Step 5: Look at toaster. Step 6: Turn to toaster.

WikiHow

How to Take Care of a Pet

Chain

Step 1: Take care of your pets. Step 2: Scroll to the end of the text file.

LLMaP

Step 1: Make sure your pet always has enough food and water. Step 2: Be sure to feed your dog at least three times a day. Step 3: Feed your pet high-quality "complete and balanced" dog food. Step 4: Give your dog fruits and vegetables as snacks. Step 5: Eat a variety of food. Step 6: Feed your puppy healthy food.

PLAN

Step 1: Research the type of pet, so you can make sure the pet fits into your environment and lifestyle. Step 2: Choose your breed carefully, especially if you have children. Step 3: Exercise your pet if it's the kind of pet that needs it, such as a dog. Step 4: Watch your pet's diet carefully. Step 5: Monitor your pet for illness and injury. Step 6: Make time for your pet. only solution, thus these automatic metrics provide limited absolute performance information. Details for the correlation between automatic metrics and human evaluation can be found in Section 4.5.

Effects of Edge-wise Adaption and Symbolic Program Execution

The variant "w/o Adaption" maintains the top-k task-specific nodes ranked by the annotated weight E W in the external knowledge base G without adaption. The variant "w/o Symbolic" directly takes the extracted concept nodes from external knowledge base as prompt. The performance drop of these two variants in Table 4 with significance test in Appendix C.2 demonstrate the importance of adaption and symbolic modules. Effects of the Large Language Model Architecture We use GPT2 and GPT3 as autoregressive architecture and BART (Lewis et al., 2020) as autoencoder architecture. The autoregressive architecture achieves better results than the autoencoder one. Since the pre-training objective of autoregressivebased GPT is to predict the next token given the previous input tokens. We assume the performance gain of GPT is due to a smaller gap between the objective of pre-training and procedural planning.

Level of Complexity

We show report results that use the test set which is separated into several buckets according to the number of steps in the procedural planning task. The step number reflects the difficulty of the task. In Table 7 and Overall, our automatic and human evaluation scores are consistent with the main claim of this paper. However, human evaluation is still irreplaceable for procedural planning at the current stage.

5. RELATED WORK

Given the limited diversified cultural background of the dataset we are using from RobotHow and WikiHow, we assume our results may be biased toward a single cultural background. For instance, given the task "make breakfeast", it should take multi-culture into consideration to generate the procedural plans.

8. REPRODUCIBILITY STATEMENT

We provide more data samples and qualitative samples in supplemental materials. In addition, we provide our code implementation at https://anonymous.4open.science/r/PLANNER-7B24 to reproduce our experiments. The Preprocess folder provides the utils to construct the data. The Evaluation folder provides the code for automatic and human evaluation tools. The Planning folder contains the main code for our approach and reproduced planners for procedural planning. The Visualization folder provides the code we use to visualize in the environment.

A SCM THEORETICAL DETAILS

A.1 CAUSAL PRELIMINARIES The Structural Causal Model (SCM) is a directed acyclic graph (DAG) to describe the causal relationships within a system Pearl (2009) . In this paper, we refer to the unrolled SCM along the time dimension as the full temporal causal graph, while the rolled-up version is also called the causal summary graph Peters et al. (2017) . In an SCM, if the variable D is a cause of both T and S i , then it is called a confounder. A confounder opens up a backdoor path and causes a spurious correlation between T and S i . The backdoor path is defined as the remaining path between T and S i when all the arrows pointing out of T are removed. Therefore, T ← D → S i is a backdoor path. For our SCM with mediator P i shown in Figure 4c (same as Figure 2b ) from the main paper, there is no backdoor path between T and {P i , S i-1 } because only D → T is left after removing outgoing arrows of T . On the other hand, there is a backdoor path between P i and S i , i.e. P i ← T ← D → S i so that P i indirectly affects the observation of S i through {T, S i-1 } and D. The mediator is the variable added between treatment variable (the cause T and S i-1 in our case) and treatment variable (the effect S i in our case), and thus blocks all directed path from the cause to effect ( (Zhang et al., 2016) ). The spurious correlations happens when two variables are statistically related but not causally related because of a third variable influences these two variables at the same time or the correlation is coincidental. To identify the true causal effect between X and Y , we aim to estimate the conditional π(Y |do(X)) after intervention with the do-operator. The do-operator is to break the backdoor path by setting X to a fixed value independent of Z. Then the path Z → X can be removed to eliminate the backdoor paths. In practice, the backdoor adjustment and front-door adjustment are two fundamental methods to implement interventions and obtain the conditional π(Y |do(X)). Clarity of the Definition As a language prompt, P i inherits the content from P i-1 and thus can be detached from steps before S i-1 for simplicity. Causal Intervention There are two types of operation to control the confounding bias: the backdoor adjustment and the front-door adjustment (Pearl, 2009) . The backdoor adjustment is intractable in our case because it requires the prior distribution of the confounding variables. On the other hand, we can construct an input prompt as a mediator P i for T → S i and S i-1 → S i . Then the front-door adjustment applies a two-step do-operation to mitigate bias by investigating P → S i (Pearl, 2009) . Specifically, we construct the prompt mediator P i using techniques illustrated in Section 2.2. The pre-trained knowledge (D) in LLMs confounds language models to make biased decisions toward an unreasonable action. Since the confounder is unobservable, intervention techniques such as back-door (definition in Appendix A.2) adjustment (Hu & Li, 2021; Weber et al., 2020; Yue et al., 2020) are not applicable in our SCM. Instead, we build a mediator and implement it as a commonsense-infused prompt. Through the mediator, we can identify causal effects among goals and steps by investigating the indirect effect from the goals, which is essentially the front-door adjustment (definition in Appendix A.3) in causality (Pearl, 2009) .

A.2 THE BACKDOOR ADJUSTMENT

The backdoor adjustment is one way to realize the intervention do(T = t) by considering the conditional probability over the existing data distribution with observed confounder D. Let π i denote π(•|P i-1 ) that represent the probability density function conditioned on P i-1 . It calculates the average causal effects by considering all stratums of the dataset: π i (S i |do(T )) = d π i (S i |T, D = d)π i (D = d) However, for LLMs, the pretraining data is usually unobservable and has been transformed as knowledge incorporated into the hidden space. Therefore, we are not able to directly apply the backdoor adjustment. 

A.3 THE FRONT-DOOR ADJUSTMENT

The front-door adjustment is another technique to apply intervention by introducing a mediator P i when the confounder is unobservable. As is explained in Section 2.2 from the main paper, the front-door adjustment is equivalent to two consecutive do-operations on task T and prompt P i . We first investigate the generation of S 1 and then expand it to S t . Timestep i = 1 As is shown in Figure 4a , since there is no preceding steps, the first step generation involves D, T and P 1 only. Similar to the proof in Section 2.2 from the main paper, we have: π i (S 1 |do(T )) = p π i (S 1 |do(P 1 = p))π i (p|do(T )) = p π i (p|T ) t π i (S i |p, T = t)π i (T = t) (6) By adding intervention to T , we make the value of do(T = t) independent of the confounder D at the beginning. The backdoor path through D → T is eliminated as a result. Timestep i > 1 As is shown in Figure 2a from the main paper, we model the mediator P 1 as an effect of three variables, T , P i-1 and S i-1 . The first step of our front-door adjustment is to apply the do-operator on the three variables and observe the change in P i as explained in Section 2.2 from the main paper. Since there are no backdoor paths between P i and these variables, we have the probability after intervention equal to the conditional probability without intervention: π i (P i = p|do(T )) = π i (P i = p|T ) (7) π i (P i = p|do(P i-1 )) = π i (P i = p|P i-1 ) (8) π i (P i = p|do(S i-1 )) = π i (P i = p|S i-1 ) The second step is to apply do-operator on P i and then identify the causal effect as: π i (S i |do(P i )) = t,p ′ ,s π i (S i |P i , T = t, P i-1 = p ′ , S i-1 = s) π i (T = t, P i-1 = p ′ , S i-1 = s) Combining Equation7-9 and Equation 10, we have the front-door adjustment. Note that there are three backdoor paths from each of the variables T , P i-1 , and S i-1 , as is shown in Figure 4b (drawn in blue, red and purple). More importantly, the one through T , i.e. P i ← T ← D → S i (the blue path in Figure 4b ) and the one through P i-1 , i.e. P i ← P i-1 ← T ← D → S i (the red path in Figure 4b ) shares the same subpath. The intervention on the task T breaks the backdoor paths for both T and P i-1 . Therefore, we have our front-door adjustment as π i (S i |do(S i-1 ),do(P i-1 ), do(T )) 𝑆 ! 𝑃 ! 𝑇 𝐷 𝑆 ! 𝑃 ! 𝑇 𝐷 𝑆 ! 𝑃 ! 𝑇 𝐷 𝑑𝑜 𝑇 Task-relevant sampling Adaption & Symbolic Structuring 𝑑𝑜 𝑃 ! 𝜋 " 𝑆 ! 𝑑𝑜 𝑃 ! = ) # 𝜋 " 𝑆 ! 𝑃 ! , 𝑡 𝜋 " (𝑡) 𝜋 " 𝑃 ! 𝑑𝑜 𝑇 = 𝜋 " (𝑃 ! |𝑇) (a) SCM at timestep i = 1 𝑑𝑜 𝑇 , 𝑑𝑜(𝑆 !"# ) Task-relevant sampling Adaption & Symbolic Structuring 𝑑𝑜 𝑃 ! 𝜋 ! 𝑆 ! 𝑑𝑜 𝑃 ! = + $,& 𝜋 ! 𝑆 ! 𝑃 ! , 𝑡, 𝑝 𝜋 ! (𝑡, 𝑝) 𝜋 ! 𝑃 ! 𝑑𝑜 𝑇 , 𝑑𝑜(𝑆 !"# ) = 𝜋 ! (𝑃 ! |𝑇, 𝑆 !"# ) 𝑆 / 𝑃 / 𝑇 𝐷 𝑆 !"# 𝑆 / 𝑃 / 𝑇 𝐷 𝑆 !"# 𝑆 / 𝑃 / 𝑇 𝐷 𝑆 !"# (b) The SCM at timestep i > 1 (11) = p π i (S i |do(P i = p))π i (p|do(S i-1 ), do(P i-1 ), do(T )) (12) = p π i (S i |do(P i = p))π i (p|do(S i-1 ), P i-1 , do(T )) (13) = p π i (S i |do(P i = p))π i (p|do(S i-1 ), do(T )) (14) = p π i (p|S i-1 , T ) s,t π i (S i |p, S i-1 = s, T = t)π i (S i-1 = s, T = t) (15) = π i (S i |do(S i-1 ), do(T )) We have Equation 13because of the intervention on T and Rule 2 (Pearl, 1995) , Equation 14because of Rule 1 (Pearl, 1995) . After simplification based on Equation 12-16, we get the SCM at timestep i > 1 in Figure 4c . This is an equivalent SCM after eliminating P i-1 in Figure 4b . The reason we could eliminate P i-1 is as follows. We follow a common method of constructing temporally-extended prompt, which is to append the prediction at previous timesteps to the prompt at current timestep. In our case, the P G,i is the same as P G,i-1 , thus P i inherit part of the content from P i-1 , the change only depend on the S i-1 . Thus P i-1 and S i-2 are fixed, and there is no need to predict P i-1 at timestep i again. In this way, we simplify the causal graph in Figure 4b to the one in Figure 4c . In summary, we define and simplify the causal graph based on the temporal-extended property of our prompt construction (P i inherit the content from P i-1 ). We end up with Equation 14-16 which is shown as Equation 3 in Section 2.2 from the main paper. WikiHow This datasetfoot_2 is under an Attribution-Noncommercial-Share Alike 3.0 Creative Commons License. And the text content is free to modify, republish and share. We evaluate the inference of 1000 tasks by random selection from the dataset. The admissible action space and interaction object space are more complex than the programs in RobotHow. And there is no fixed "[Action] ¡Object¿ (Number)" form of each step. For each article, it contains the title, the bold headlines and text. We utilize the title and headlines as our task name and steps respectively. External Knowledge Base For the external knowledge base, we utilize ConceptNet to leverage commonsense reasoning ability to help ground language generation in goal-guided procedural text generation. ConceptNet (Speer et al., 2017) Task: Watch TV Step 1: Find remote control. Step 2: Grab remote control. Step 3: Find television. Step 4: Switch on television. Step 5: Turn to television. Step 6: Watch television. Step 7: Switch off television. Step 8: Put back remote control Task: Watch TV in bedroom Step 1: Walk to bedroom Step 2: Find remote control. Step 3: Grab remote control. Step 4: Find television. Step 5: Switch on television. Step 6: Turn to television. Step 7: Watch television. Step 8: Switch off television. Step 9: Put back remote control

Intermediate Step

Task: Work Step 1: Walk to home office. Step 2: Walk to chair. Step 3: Find chair. Step 4: Sit on chair. Step 5: Find computer. Step 6: Switch on computer. Step 7: Turn to computer. Step 8: Look at computer Task: Work (Find Computer) Step 1: Walk to home office. Step 2: Walk to chair. Step 3: Find chair. Step 4: Sit on chair. Step 5: Find computer. Step 6: Switch on computer. Step 7: Turn to computer. Table 6 show the examples that compare the original program and the counterfactual program of each intervention method are also provided. Specifically, for Initial Configuration, we randomly append the location to a given task name to constrain the location of completing the task. The steps are prepended with the initial step "walk to ¡Location¿". For Intermediate Step, we randomly sampled a step from the task-specific program and append it to the task name to constrain the way to implement a given task. For Final Goal, we randomly combine two tasks by combining both the task names and the programs to construct a set of long-horizon composite tasks. We conduct counterfactual experiments by applying randomly selected intervention methods over RobotHow. And we only apply the Intermediate Step intervention method over WikiHow due to the loose configuration requirement and the long text of the WikiHow contents. Note that the performance gain of PLAN under the counterfactual setting mainly comes from the additional guidance of the task introduced from the Intermediate Step intervention method. However, the baselines mostly experience performance drops due to the limited annotated exemplars. PLAN consistently outperforms baselines by a large margin, indicating its superiority under the counterfactual setting.

B.3 METHOD DETAILS

The existing formalization of the procedural planning task can be mainly categorized as 1) sequential choice making (Lyu et al., 2021; Wu et al., 2022; Zhang et al., 2020a; b) , which reasons about the next step from the options given, the task, and previous steps; 2) conditioned generation (Huang et al., 2022; Ahn et al., 2022) , which generates the temporally extended plans to implement the task. We study the procedural planning task as the conditioned generation problem (Huang et al., 2022; Ahn et al., 2022) since it resembles real-world scenarios. Baselines LLMaP propose a procedure to extract temporally extended plans from large pre-trained language models. Chain explores manually creating exemplars that mimic the reasoning process and uses them to prompt large language models for reasoning tasks. To compare with Chain on the procedural planning task, we manually generate exemplars that contain the chain of thought for 1% of the inference task programs. Note that for the BART language model, we use BART-large version. And we use the 1.5 billion parameter . For the translation model LM T , we use sentence-transformers (RoBERTa-large). All these models are released by HuggingFace. In addition, our experiments with GPT3 (davinci) use OpenAI API (May, 2022) . External Knowledge Graph Conceptnet5 define a set of 34 relations (foot_3 ). Within the relations we consider in the procedural planning task, the averaged sampling time of subgraph sampling is 0.03576 milliseconds per task program.

B.4 HYPERPARAMETER SEARCH AND CONFIGURATION DEICISION

We perform a hyperparameter search for all evaluated methods for the following hyperparameters. • The confidence threshold θ, which terminate the generation when below it, is searched in {0, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8}. • The steps horizon, which constrains the maximal number of procedural planning steps, is searched in {10, 20, 40}. • The number of hops for retrieving the subgraph from the external knowledge base is searched in {1, 2, 3}. • The ratio of maximal concepts to the length of the task name is searched in {1, 2, 3}. • The cosine similarity threshold for keeping the task-specific concept is searched in {0.4, 0.6, 0.8}. • The edge weight threshold θ e is searched in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. • The top-k task-specific nodes value is searched in {1, 5, 10, 15, 20, 25, 50, 100}. The configurations used in the experiments are: θ=0.7, 20 step horizon, 3 hops, 3 ratio of concepts to task length, cosine similarity threshold 0.4, θ e =0.6 and k=10. We empirically choose the hop number H as 3 considering both the input length limit of the LLMs and the fact that 3-hop contains reasonable relevant information in practice (Zhang et al., 2022) .

B.5 COMPUTATION AND RESOURCES

We use one single NVIDIA A100 GPU Server for all the experiments. Since there is no training in our zero-shot settings, the computation is only used for the inference stage of the experiments.

C EVALUATION DETAILS C.1 CROWDSOURCING HUMAN EVALUATION

We conduct all the human evaluations (rating and win-lose comparison) on Amazon Mechanical Turk platform. Each example is rated by 3 annotators. We ask Amazon Mechanical Turk workers, for every assignment, to evaluate the quality of the provided low-level steps given the high-level task description. For the Win-Lose Comparison, they were asked to choose one from the two provided model generated results by 1:the first one is better, 2:equal and 3:the second one is better. For the Human Ratings, they were asked to score each sample with 5-point Likert scale. This process does not involve collecting any personal information. And we manually check no offensive content is produced by the models. The assignment layout templates for workers are shown in Figure 7 and Figure 6 . Specifically, we evaluate randomly selected 50 task examples from each dataset (RobotHow and WikiHow) under all the settings (standard and counterfactual). We only collect the examples that the workers read the instructions carefully by checking whether they give 1 score for the empty program as a sanity check. The hourly wage paid to participants is estimated $9. And the total amount spent on participant compensation is $1296. The details of the Human Intelligence Tasks process are described in the following sections. The above example is to evaluate the order metric, for the coverage metric, the same process are conducted, except for the instructions are: Read the given task and the sequence of steps, and determine which sequence covers more steps that are necessary to complete the target task. Please ignore the sequential order of the steps. For every question below, determine whether the task can be completed in any reasonable scenario using the provided steps (Please consider the sequential order of the steps.). You could directly give the lowest score (1) for the empty steps. In other words, can the task be decomposed into these steps? (Please consider the sequential order of the steps.) Then the program to be evaluated is provided as: Finally, the workers are asked to score the program by following the instructions below: Use the slider below to indicate how much you agree with the following statement (1 = Strongly disagree, 5 = Strongly agree). If "sequence of steps" are blank, please directly choose 1 (lowest score). The task can be completed in any reasonable scenario using the provided steps. [SLIDER PROVIDED HERE] The above example is to evaluate the order metric, for the coverage metric, the same process is conducted, except for the instructions are: For every question below, determine whether the task can be completed in any reasonable scenario using the provided steps (Please ignore the sequential order of the steps.). You could directly give the lowest score (1) for the empty steps. In other words, can the task be decomposed into these steps? (Please ignore the sequential order of the steps.)

C.2 MORE RESULTS

Significance Test We provide paired-t test (p¡0.05) statistics results for Table 2 . On RobotHow, our PLAN significantly outperforms all baselines on Original-Order(BART) and Counterfactual-Coverage(GPT2). On WikiHow, our PLAN significantly outperforms all baselines on Original-Coverage(BART, GPT2), Counterfactual-Coverage(BART, GPT2), and Counterfactual-Order(BART). For the coverage metric under the counterfactual setting, the human-provided program is not significantly better than our PLAN.

Model RobotHow

Step performance drop. Especially on BERTScore-f1, the p-value is 8.884e -13 and 1.4e -8 respectively. This further confirms the importance of the modules. Results on GPT-3 In addition, we conduct experiments with GPT-3 (davinci version) using OpenAI API. We showcase the comparison in Table 9 and Table 10 . Table 9 : Showcases of procedural steps predicted by different models with GPT3 as the base LLM on RobotHow.

Model Program

WikiHow Task: How to Become an Art Investor

Human

Step 1: Start with some experience or interest in art. Step 2: Understand the difference between art collectors, art investors and art speculators. Step 3: Figure out what you are willing to pay for art, before going to an auction house. Step 4: Pay attention to what schools of art are selling well, and which are down. Step 5: Focus art investments on fine art paintings, rather than decorative art. Step 6: Reach out to trusted auction houses and dealers when you are looking to buy art. Step 7: Buy your investment art when you feel confident of its worth, its price and its ability to grow in value. Step 8: Study how art is properly stored. Step 9: Have your art investments appraised occasionally. Step 10: Consider renting out your art investments. Step 11: Understand that selling an art investment can take time.

Chain

Step 1: Buy your investment art when you feel confident of its worth, its price and its ability to grow in value. LLMaP Step 1: Reach out to trusted auction houses and dealers when you are looking to buy art.

PLAN

Step 1: Figure out what you are willing to pay for art, before going to an auction house. Step 2: Consider renting out your art investments. Step 3: Buy your investment art when you feel confident of its worth, its price and its ability to grow in value. WikiHow Task: How to Be an Organized Artist

Human

Step 1: Make sure you know what is expected of you. Step 2: Stick to your topic. Step 3: Don't try to be to be funny unless the scenario calls for it. Step 4: Act naturally for the situation; talk, act and sit as your character would usually do in the circumstances. Step 5: Participate. Step 6: Don't react to what others say as yourself, stay in character. Step 7: Don't make anything violent or too crazy. Step 8: Relax and enjoy yourself. Step 9: Be your character. Step 10: Play games that allow you to practice improvisation.

Chain

Step 1: First, you will need to make sure you have all of the materials listed below. Step 2: Set a schedule. Step 3: Create a comfortable space. Step 4: Take notes in journal or sketchbo. Step 5: Keep neat and tidy. Step 6: Take a break.

LLMaP

Step 1: Make plans.

PLAN

Step 1: Start with some experience or interest in art. Step 2: Put together a schedule and chart. Step 3: Prepare to create your neopoprealist mural. Step 4: Organize your computer-based materials. Step 5: Have a clear plan. Step 6: Buy your investment art when you feel confident of its worth, its price and its ability to grow in value. Step 7: Work on being the best you. WikiHow Task: How to Be Good at Improvisation

Human

Step 1: Keep related supplies in the same area. Step 2: Make an effort to clean a dedicated workspace after every session. Step 3: Place loose supplies in large, clearly visible containers. Step 4: Use clotheslines and clips to hang sketches, photos, and reference material. Step 5: Use every inch of the room for storage, especially vertical space. Step 6: Use chalkboard paint to make space for drafting ideas right on the walls. Step 7: Purchase a label maker to make your organization strategy semi-permanent. Step 8: Make a habit of throwing out old, excess, or useless stuff each month.

Chain

Step 1: Play games that allow you to practice improvisatio.

LLMaP

Step 1: Don't overdo it.

PLAN

Step 1: Try the spontaneous approach. Step 2: Express yourself creatively. Step 3: Play games that allow you to practice improvisatio. Step 4: Do extracurricular activitie. WikiHow Task: How to Train a Parrot to Say Something

Human

Step 1: Decide what you want your parrot to say, but make it basic. Step 2: If you want, you can make it say simple but funny things. Step 3: You should go to a nice and quiet room. Step 4: To start teaching it, repeat what you want it to say many times. Step 5: If you DO get your parrot to say it correctly, then you've succeeded!

Chain

Step 1: Decide what you want your parrot to say, but make it basic.

LLMaP

Step 1: If you do get your parrot to say it correctly, then you've succeeded.

PLAN

Step 1: Decide what you want your parrot to say, but make it basic. Step 2: If you do get your parrot to say it correctly, then you've succeeded. Motivation of Evaluation Metrics Since the nature of the procedural planning task can be opendomain in that the golden plans may not be unique. This leads to the challenge that common automatic metrics proposed in natural language task are not perfect to evaluate procedural planning. The same observations of such challenge to directly judge the system using automatic metrics are discussed in LLMaP (Huang et al., 2022) as well. We assume that the human evaluation on Coverage and Order can reflect how well the procedural plans are close to human annotated program, because the human annotators are required to determine whether the task can be completed in any reasonable scenario using the procedural plans explicitly. Thus we provide both the automatic evaluation and human evaluation on two aspects Coverage and Order, with description in the Metrics paragraph in Section 4.1.

Evaluation on Success Rate Metric

To make human evaluations more intuitive, we provide an additional Success Rate metric to show whether the procedural plans can successfully implement the task, which focus more on the success rate instead of the coverage or the order of the plans. We show the Success Rate evaluations on the baselines and our method in Table 11 . The assignment layout template for workers is shown in Figure 8 .

More Ablation

To verify the contribution of the first translation language model LM T that translates the knowledge prompt P G into admissible one PG , we conduct an additional ablation experiment by simply removing the first LM T and replacing PG with P G to prompt the LLM for procedural planning. We provide results with comparisons to other ablations in Table 12 .

Results on Counterfactual Task Samples

We show automatic evaluation results on counterfactual RobotHow in Table 13 . • Input task T : Take shower. • Human-annotated Plan Reference: Step 1: Walk to bathroom. Step 2: Walk to clothes dress. Step 3: Find clothes dress. Step 4: Put off clothes dress. Step 5: Find shower. Step 6: Enter shower. Step 7: Find soap. Step 8: Grab soap. Step 9: Scrub soap. Step 10: Put back soap. Step 11: Leave shower. Step 12: Find towel. Step 13: Grab towel. Step 14: Wipe towel. Step 15: Find clothes dress. Step 16: Put on clothes dress. (take a shower, HasSubevent, wash your body, 1.0); (take a shower, HasPrerequisite, go to the bathroom, 1.0); (take a shower, HasPrerequisite, go to the bathroom and undress, 1.0); (take a shower, HasPrerequisite, step into the shower, 1.0); (take a shower, HasPrerequisite, soap up, 1.0); (take a shower, HasPrerequisite, bring some soap, 1.0); (take a shower, HasPrerequisite, bring some shampoo, 1.0); (take a shower, HasLastSubevent, towel yourself off, 1.0); (take a shower, HasPrerequisite, remember to take your towel, 1.0); • Knowledge prompt P G : Step: take out your clothes. Step: set clean clothes. Step: go to the bathroom. Step: go to the bathroom and undress. Step: take your clothes off. Step: turn on the water. Step: bring some soap. Step: bring some shampoo. Step: remember to take your towel. Step: get clean. Step: wash your hair. Step: use shampoo. Step: wash behind your ears. Step: wash your body. Step: turn off the water. Step: dry off. Step: become more clean. Step: put your clothes on. • Translated knowledge prompt PG : Step: find clothes underwear. Step: wash shower. Step: walk to bathroom. Step: look at shower. Step: take off clothes underwear. Step: wash shower. Step: find soap. Step: grab soap. Step: scrub shower. Step: find shampoo. Step: grab shampoo. Step: find towel. Step: wipe shower. Step: wash hair. Step: wipe shampoo. Step: scrub shower. Step: wash body. Step: switch off washing machine. Step: scrub shower. Step: wipe shower. Step: put on clothes underwear. • Generated plan S T : Step 1: Find clothes underwear. Step 2: Walk to Bathroom. Step 3: Take off clothes shirt. Step 4: Scrub shower. Step 5: Pour shampoo into hair. Step 6: Wipe shampoo. Step 7: Wipe hair. Step 8: Wash body. Step 9: Find Towel. Step 10: Put on clothes underwear.

D.2 PREDICTED PROCEDURAL PLANS

More qualitative examples of final predicted procedural plans that are randomly selected are provided. Table 14 show random samples on the original dataset. 

Model Program

RobotHow Task: Play Games

Human

Step 1: Walk to bedroom. Step 2: Walk to video game console. Step 3: Find video game console. Step 4: Walk to video game console. Step 5: Switch on video game console. Step 6: Find television. Step 7: Switch on television. Step 8: Find video game controller. Step 9: Grab video game controller. Step 10: Find chair. Step 11: Walk to chair. Step 12: Sit on chair

Chain

Step 1: Put chef knife on water glass. Step 2: Find sink.

LLMaP

Step 1: Walk to bedroom. Step 2: Walk to video game console. Step 3: Find video game console

PLAN

Step 1: Walk to video game controller. Step 2: Find video game controller. Step 3: Switch on video game controller. Step 4: Find tv. Step 5: Switch on tv. WikiHow Task: How to Be an Organized Artist

Human

Step 1: Keep related supplies in the same area.,. Step 2: Make an effort to clean a dedicated workspace after every session.,. Step 3: Place loose supplies in large, clearly visible containers.,. Step 4: Use clotheslines and clips to hang sketches, photos, and reference material.,. Step 5: Use every inch of the room for storage, especially vertical space.,. Step 6: Use chalkboard paint to make space for drafting ideas right on the walls.,. Step 7: Purchase a label maker to make your organization strategy semi-permanent.,. Step 8: Make a habit of throwing out old, excess, or useless stuff each month.

Chain

Step 1: Feel for a dense, kinky coat. Step 2: Look for rounded ears with crimped inner hairs.

LLMaP

Step 1: Make sure that all money has been covered on both end. Step 2: Check your bill and banking statements regularly.to ensure that your bills are getting paid as they should be, you should check all statements that you receive. Step 3: Keep track of your expenses. Step 4: Keep track of your income.

PLAN

Step 1: Prioritize your goals for the workshop. Step 2: Prepare to create your neopoprealist mural. Step 3: Organize your computer-based materials. Step 4: Have a clear plan. Step 5: Work on being the best you. Step 6: Paint patterns with brushes of suitable size for the particular portion of work you are painting. Step 7: Put together a schedule and chart. Step 8: Buy your investment art when you feel confident of its worth, its price and its ability to grow in value. Step 9: Make a clean brea. Step 10: First, you will need to make sure you have all of the materials listed below. WikiHow Task: How to Use Saddle Soap

Human

Step 1: Read the manufacturer's directions., Step 2: Avoid lye heavy products., Step 3: Make your own saddle soap.

Chain

Step 1: Give her a safe space. Step 2: Let him exercise. Step 3: Be kind to him. Step 4: Offer him a treat.

LLMaP

Step 1: Dip a damp cloth in the saddle soap. Step 2: Dampen several cloths and work them into a lather with the saddle soap. Step 3: Dip a cloth into the mixture. Step 4: Rub the product on with a cloth. Step 5: Clean the grime off with the cloth. Step 6: Dampen a cloth with water. Step 7: Rub a damp microfiber cloth over the surface. Step 8: Use a damp cloth to wipe once again. Step 9: Dampen a towel with lukewarm water. Step 10: Wipe the area clean with a damp towel. Step 11: Soak a towel in warm water.

PLAN

Step 1: Brush your horse. Step 2: Clean the saddle. Step 3: Start by using the first cloth to rub down the leather, making sure you remove any built up grass and dirt. Step 4: Rub the leather down so you remove most of the oil and wax. Step 5: Wipe off excess oil. Step 6: Rinse the leather. Step 7: Clean any other leather tack items (martingale, breastplate, draw reins, cruppers, etc. Table 14 : Showcases of procedural steps predicted by different models with GPT2 as the base LLM.

Model Program

RobotHow Initial Configuration Task: Play games in dining room.

Human

Step 1: Walk to dining room. Step 2: Walk to video game console. Step 3: Find video game console. Step 4: Walk to video game console. Step 5: Switch on video game console. Step 6: Find television. Step 7: Switch on television. Step 8: Find video game controller. Step 9: Grab video game controller. Step 10: Find chair. Step 11: Walk to chair. Step 12: Sit on chairTask: Play games in dining room. Step 1: Walk to video game controller. Step 2: Put video game controller on diningtable. Step 3: Put boardgame on kitchen table .Step 4: Put boardgame on diningtable.

Chain

Step 1: Walk to dining room. Step 2: Walk to sauce pan. Step 3: Pour bottle water into dish bowl. Step 4: Walk to water. Step 5: Walk to carrot. Step 6: Walk to food salt.

LLMaP

Step 1: Walk to kitchen. Step 2: Walk to food food. Step 3: Walk to dining room.

PLAN

Step 1: walk to dining room. Step 2: find diningtable. Step 3: walk to board game. Step 4: put video game controller on diningtable. Step 5: put food snack on diningtable. Step 6: put board game on diningtable. Step 7: put boardgame on diningtable. Step 8: put board game on kitchen table. Step 9: find video game console. Step 10: put glass on board game. Step 11: grab video game controller. Step 12: put glass on boardgame. RobotHow Final Goal Task: Turn light off and clean

Human

Step 1: Walk to bedroom. Step 2: Walk to light. Step 3: Switch off light. Step 1: Walk to home office. Step 2: Walk to rag. Step 3: Find rag. Step 4: Grab rag. Step 5: Walk to desk. Step 6: Find computer. Step 7: Wipe computer. Step 8: Wipe desk. Step 9: Put back rag.

Chain

Step 1: Walk to kitchen. Step 2: Walk to cooking pot. Step 3: Walk to water. Step 4: Walk to dishwasher.

LLMaP

Step 1: Put light bulb on bowl. Step 2: Switch off light bulb. Step 3: Switch on light.

PLAN

Step 1: plug out lighting. Step 2: put cleaning solution on desk. Step 3: find dish soap. Step 4: scrub light switch. Step 5: wipe lighting. Table 16 : Showcases of procedural steps predicted by different models with GPT2 as the base LLM under counterfactual setting (Initial Configuration, Final Goal).

E DISCUSSION

E.1 LIMITATIONS Though pointing out a direction to prompt out actionable knowledge in large-scale pre-trained language models with external commonsense knowledge, the limitations of reasoning long-horizon procedural plan still exist. Existing datasets for procedural planning like WikiHow and RobotHow are all monolingual supporting only English goals and plans. In the future, it is important to expand these datasets or having novel datasets that support multiple languages used across the world. The inherent difference between these languages may also result in different planning strategies in granularity or abstraction levels, which is potentially challenging. In addition, the long-horizon and complex composite tasks still remain challenging for the existing procedural planners. Above limitations are discussed mainly based on the challenges of procedural planning task. In addition, there are limitations of our implementation that are guided by our causal analysis. First, the coverage of the leveraged external resources is limited, which is common in a knowledge-enhanced system. This may result in the wrong understanding of the task and produce not reasonable procedural plans. For example, the knowledge of the word "Turking", which refers to "The act or process of performing small tasks using the Amazon Mechanical Turk service." according to Wiktionary, is not covered in the external resources (e.g., ConceptNet). Since our proposed system does not assume specific external resources. It is plausible in the future if we utilize more powerful external resources (e.g., Wiktionary). Second, the hop number and the threshold of the multi-hop retrieval in taskrelevant subgraph sampling is currently a configured hyperparameter. This may result in not ideally constructed prompt. The future work could instead make these hyperparameters learnable on each task domain, and also explore the pros and cons between end-to-end commonsense-infused prompt versus neuro-symbolic constructed prompt.

E.2 FAILURE ANALYSIS

We discuss detailed failure modes and examples with analyses below. For example, the predicted procedural plan on task "Turking", which refers to "The act or process of performing small tasks using the Amazon Mechanical Turk service." according to Wiktionary. We compare the predicted procedural plan on this task among baselines and our method: (1) The ground truth plan is "Task: Turking. Step 1: Walk to home office. Step 2: Walk to desk. Step 3: Find chair. Step 4: Sit on chair. Step 5: Find computer. Step 6: Switch on computer" (2) The plan predicted by Chain baseline is empty. (3) The plan predicted by LLMaP baseline is "Task: Turking. Step 1: Put teddybear on oven." (4) Our prediction is "Task: Turking. Step 1: Eat food turkey. Step 2: Drink water. Step 3: Sleep." We can see that for the "out-of-knowledge" task, our method also lead failure planning. We assume this is mainly due to the limited knowledge in external resources, as discussed in the Appendix E.1, and this main failure mode can be avoided by introducing larger external resources (e.g, Wiktionary), similar as other knowledge-enriched methods.

E.3 ETHICAL CONSIDERATIONS

We hope to de-bias the procedural planning to avoid misleading either humans or robots with daily life instructions, which may result in unsafe situations. The cultural bias behind these datasets can be a critical issue for future work. As the ground truth planning steps usually reflect the culture shared by the English-speaking group, other cultures may have a completely different practical consideration that leads to different orders of these steps or even novel steps that are not proposed by the LLMs we utilized in this paper. In the future, we will consider cultural bias as a proxy variable so that we could adjust the implicit knowledge from LLM or commonsense from external sources according to the different needs of cultural backgrounds.



Source code and datasets are publicly available at https://sites.google.com/view/iclr-clap Procedural Planning Learning to generate procedural plan(Zhang et al., 2020a;Lyu et al., 2021; Zhang et al., 2020b;Chang et al., 2020;Wu et al., 2022;Huang et al., 2022) is important for embodied agentTellex et al.(2011);Jansen (2020);Ahn et al. (2022) and conversational assistants(Ilievski et al., 2018;Yang et al., 2022). Previous work views procedural script learning as a structured form of commonsense knowledgeGupta et al. (2004);Regneri et al. (2010);Wanzare et al. (2016), while more recent work strengthens its association with the changing environments for executable action planningPuig et al. (2018);Shridhar et al. (2020). Some works(Sun et al., 2020;Zhao et al., 2021) explore to utilize human written programs to precisely specify tasks. Our method tackles the problem with aware of cause-effect by utilizing commonsense-infused prompts via a neuro-symbolic approach(Mao et al., 2019;Nye et al., 2021;Yi et al., 2018) for zero-shot procedural planning.Causality for Language GenerationThe integration of causality and machine learning has been an intriguing topic for many problemsPearl (2009);Schölkopf (2022). Previous studies focusing on causal inference for natural language understanding Chen et al. (2020); Keith et al. (2020); Wood-Doughty et al. (2018) and generating counterfactual text representations Feder et al. (2021). Weber et al. (2020) proposes an intervention method for script learning. However, these methods cannot be directly applied to procedural planning which requires a formal structure. Our method is based on mediation analysis VanderWeele (2015) and causal interventionPearl (2009);Peters et al. (2017).Prompt for Large Language ModelThere is an emerging interest in using prompts to extract knowledge from large language models(Chen et al., 2022;Le Scao & Rush, 2021;Su et al., 2022;Ye et al., 2022;Zhou et al., 2022;Kojima et al., 2022).Cao et al. (2022) treats the prompt as a cause of the task-specific predictor and investigates biases in prompt-based probing evaluations. Chain of thoughtWei et al. (2022) discovers that LLM can perform better on reasoning tasks when the prompt is designed as a series of short sentences that mimic the reasoning process of humans.6 CONCLUSION AND FUTURE WORKProcedural planning is a newly emerged research area of great importance to various applications, such as household robots and virtual assistants. We propose a neuro-symbolic procedural PLANner (PLAN) with commonsense-infused prompts elicited from the external knowledge base to solve the procedural planning problem in a zero-shot manner without human annotated exemplars. Experiments show the effectiveness of our proposed PLAN under both origin and counterfactual settings, indicating the capability of mitigating spurious correlation by injecting external knowledge in LLMs. Though, procedural planning over long-horizon and composite tasks remains challenging. And exploring multimodal learning and developing human-aligned evaluation metrics are promising future directions in this area. https://www.wikihow.com https://github.com/commonsense/conceptnet5/wiki/Relations



Figure 1: Two independant procedural planning task examples from RobotHow and WikiHow. PLAN construct commonsense-infused prompt from external knowledge (e.g., ConceptNet) to elicit procedural planning ability of the Large Language Models (LLMs) without training or exemplars.

Figure 2: Structural Causal Model (SCM) for Procedural Planning. (a) The full temporal causal graph.T denotes the task query, and S i is the sub-goal step at timestep i. D is the unobservable confounding variable introduced by the LLMs. P i denotes the mediating variables we construct to mitigate the spurious correlation. (b) The SCM at timestep i. Without causal intervention, the model produces a sub-goal step "find television" due to the spurious correlation between "television" and "living room" caused by the confounding variable D. With our causal intervention, the constructed mediating variable P i (Section 3.1) can block the backdoor paths for T → S i and S i-1 → S i (opened by D) and generate the causal sub-goal "find book" precisely (Section 3.2).

Figure 3 and Algorithm 1 depict how PLAN tackles the procedural Task ("): Watch TV in the living room Step 1 ($ ! ): Walk to the living room Generation %& " with ! Translation %& # Step 2 ($ $ ): Find Remote Control Semantic Parsing " ! ={# " : watch TV,

Figure 3: The Overview of Procedural Planning. Our five-stage pipeline includes: 1) semantically parsing the task T into concept set T E to retrieve subgraph G s from the external knowledge base G. 2) formalize procedural prompt P G and then translate into the admissible one PG . 3) aggregate task, previous steps and P G as final commonsense-infused prompt P . (Section 3.1) 4) and 5) generating and translating time-extended procedural plans until triggering the termination condition. (Section 3.2)

the Appendix summarizes the category and description. The counterfactual dataset construction details and post-intervention examples are provided in Appendix B.2.

Figure 4: The front-door Adjustment for Causal Procedural Planner. (a) the structural causal model at timestamp i = 1. T denotes the task name and S 1 denotes the step at timestep 1. D is the unobservable confounding variable introduced by the pre-training data. P 1 denotes the mediating variables we construct to mitigate the spurious correlation at timestep 1. (b) D opens up backdoor paths for T → S i , P i-1 → S i and S i-1 → S i which can be blocked by introducing P i . path 1 and path 2 share the same path D → T . Intervention on T blocks D → T and the backdoor path 2. Intervention on S i-1 blocks D → S i-1 and the backdoor path 3. (c) the structural causal model at timestamp i > 1 after simplification based on Equation 12-16.

Figure 5: The Causal Graph after do-operation. (a) the causal graph transition of Structural Causal Model at timestamp i = 1. (b) the causal graph transition of Structural Causal Model at timestamp i > 1.

This dataset is Attribution-NonCommercial-ShareAlike 4.0 International Creative Commons License. We evaluate the inference of 150 tasks by random selection from the dataset. Each program contains the task name, task description and steps. We use the task name and sequence of steps as our input and output references.Each step is a composition of [Action],[Object]   and[Number]. For example, the sequence of steps of the task "Watch TV" are: 1. [Walk] <TELEVISION> (1) 2. [SwitchOn] <TELEVISION> (1) 3. [Walk] <SOFA> (1) 4. [Sit] <SOFA> (1) 5. [Watch] <TELEVISION> (1).

Step 8: Look at computer Final Goal Task1: Turn light off Step 1: Walk to bedroom Step 2: Walk to light Step 3: Switch off light Task2: Clean Step 1: Walk to home office Step 2: Walk to rag Step 3: Find rag Step 4: Grab rag Step 5: Walk to desk Step 6: Find computer Step 7: Wipe computer Step 8: Wipe desk Step 9: Put back rag Task: Turn light off and Clean Step 1: Walk to bedroom Step 2: Walk to light Step 3: Switch off light Step 4: Walk to home office Step 5: Walk to rag Step 6: Find rag Step 7: Grab rag Step 8: Walk to desk Step 9: Find computer Step 10: Wipe computer Step 11: Wipe desk Step 12: Put back rag

Figure 6: Amazon Mechanical Turk Platform. Questions Layout for Human Raters for Win-Tie-Lose Comparison.

Figure 7: Amazon Mechanical Turk Platform. Questions Layout for Human Raters for 5 Point Likert Scale.

Figure 8: Amazon Mechanical Turk Platform. Questions Layout for Human Raters for 5 Point Likert Scale on Success Rate.

Task: Write an email Sequence of Steps: Step 1: Walk to home office Step 2: Walk to computer Step 3: Find computer Step 4: Turn to computer Step 5: Look at computer Step 6: Walk to computer Step 7: Find chair Step 8: Sit on chair Step 9: Find keyboard Step 10: Grab keyboard Step 11: Find mouse Step 12: Grab mouse Step 13: Type on keyboard

. The nodes N e are finally retrieved by ranking the adapted weight Êw . To better track the utilized external knowledge during inference, we construct the task-dependent commonsense prompt with a Symbolic Executor (Symbolic Structuring) guided by the relation type of each triplet in G s with the adapted edge weight beyond threshold θ e . Specifically, the Symbolic Executor acquires the neural information of each natural language node and executes the sequential mapping program by sampling the operation Op from the Symbolic Rule Set R according to the edge relation type. The Symbolic Rule Set R is obtained by mapping the description of the relations (e.g., AtLocation represent 'A is a typical location for B, or A is the inherent location of B. Some instances of this would be considered meronyms in WordNet.') in the external knowledge graph (e.g., ConceptNet) to symbolic operations (e.g., Op AtLocation). For instance, the AtLocation edge samples the operation Op AtLocation from R, which takes the commonsense relation of the triplet from G s as the parameters to query the procedural concept output given the natural language meaning of the linked nodes (e.g., go to the location of Start Node Of(r e ) in this case). Similarly, Op UsedFor may refer to "go to find End Node Of(r e ) and use it for Start Node Of(r e )". And operators Op HasSubevent and Op HasPrerequisite will recursively navigate the subgraph G s . After navigating the subgraph, we linearize the transformed triplets as the Procedural Prompt P G , which is then translated to Admissible Knowledge Prompt PG by the Translation Language Model LM T .

Edge-wise adaption as Êw ← E w + cosine(n E tail , v task ) and re-rank N e in T E ; 5: Map the description text of the relations R s in G s as Symbolic Rule Set R; 6: Construct procedural prompt P G by verbalizing the re-weighted G s using R; 7: Translate P G in Admissible Knowledge Prompt PG = LM T (P G );Temporally-extended zero-shot inference for Procedural Plan S T = {S 1 , ..., S i }: 8: for each timestep i do

simulator. The dataset contains the programs with high-level task names and low-level steps. M T is composed of 292 and 2000 distinct tasks from RobotHow and WikiHow respectively. Human evaluations use randomly sampled 50 task examples for each dataset. Automatic evaluations use 150 and 1000 task examples randomly sampled from RobotHow and WikiHow respectively. Please refer to Appendix B.1 andAppendix B.2 for dataset details. Percentages of procedural planning results of PLAN that are better than, tied with, or worse than Planner(Huang et al., 2022), in coverage and order metrics under the original and counterfactual setting.

Showcases of procedural steps predicted by different models with GPT2 as the base LLM.

Automatic evaluation results on the Original RobotHow and WikiHow. Metrics are computed between the annotated programs and the predictions.

Table8in Appendix C.2, we show that the averaged performance gain of PLAN over the baselines are consistent or more significant in more complicated procedural planning settings. This indicates the superiority of PLAN in solving long-horizon tasks. PLAN consistently outperforms baselines by a large margin and experiences a much smaller performance drop compared with the powerful baselines when switching to the counterfactual setting. We assume it's due to the biased knowledge of the holdout examples and manual exemplars utilized in the baselines, which are vulnerable to counterfactual samples. Automatic evaluations on counterfactual RobotHow are summarized in Table13in Appendix C.2. Aligned with human evaluations, PLAN achieves the best performance. The overall poor performance in Final Goal category indicates the challenge for long-horizon and composite procedural planning. While the overall better performance in Intermediate Step category benefits from the intermediate guidance.4.5 CORRELATION BETWEEN AUTOMATIC AND HUMAN EVALUATIONWe evaluate segment-level Pearson Correlation between human and automatic metrics. We observe that BERTScore has a moderate correlation to the human coverage score and WMD has a moderate correlation to the human order score, with 23.3% and 32.3% respectively. Similar to the prior findings(Xu et al., 2021), n-gram-based metrics (Sentence-BLEU and ROUGE) have a relatively weaker correlation to the human coverage score, with a Pearson correlation of 16.4% and 21.1%.

captures commonsense knowledge explicitly with triplets of (head node, relation, end node). It contains 799, 273 nodes and 2, 487, 810 edges that

Three Types of Counterfactual Procedural Planning. Three types of methods, including initial configuration, intermediate step, and final goal are applied to intervene the original procedural data.

Comparison between Standard and Counterfactual Procedural Planning. Three types of methods, including initial configuration, intermediate step, and final goal are applied to intervene the original procedural data.

Bucket S-BLEU WMD BERT-f1 ROUGE-f1 Coverage Order Step Avg. Time Cost (ms) Evaluation results on the Original RobotHow by separating test set into several Step Bucket.

Evaluation results on the Original WikiHow by separating test set into several Step Bucket.

Showcases of procedural steps predicted by different models with GPT3 as the base LLM on WikiHow.

Averaged 5-point Likert scale human evaluations on Success Rate aspect with GPT3 language model architecture.

Automatic evaluation results for additional ablation on the Original RobotHow and WikiHow. Metrics are computed between the annotated programs and the predictions.

Automatic evaluation results on the Counterfactual RobotHow with language model GPT2.We provide running examples with intermediate output for each module in the following paragraph. First, we show the intermediate output of input task T , the subgraph G s depicted in the tuple of the start node, relation type, tail node and edge weight, the knowledge prompt P G and the translated one PG as below:

• Task-relevant subgraph G s (N head , R e , N tail , E w ):

Table 15 show random samples on the counterfactual datasets with the Intermediate Step intervention method. And Table 16 shows random samples on the counterfactual RobotHow with the Initial Configuration and Final Goal intervention methods.

ACKNOWLEDGMENTS

The research was sponsored by the U.S. Army Research Office and was accomplished under Contract Number W911NF-19-D-0001 for the Institute for Collaborative Biotechnologies. This work was also supported by the National Science Foundation award #2048122. We thank the Robert N.Noyce Trust for their generous gift to the University of California via the Noyce initiative. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.We also conduct the paired-t test (p¡0.05) statistics results over the variant "w/o Adaption" and "w/o Symbolic". Compared with the full model PLAN, the variants experienced a statistically significant

