PLANNING WITH SEQUENCE MODELS THROUGH ITER-ATIVE ENERGY MINIMIZATION

Abstract

Recent works have shown that sequence modeling can be effectively used to train reinforcement learning (RL) policies. However, the success of applying existing sequence models to planning, in which we wish to obtain a trajectory of actions to reach some goal, is less straightforward. The typical autoregressive generation procedures of sequence models preclude sequential refinement of earlier steps, which limits the effectiveness of a predicted plan. In this paper, we suggest an approach towards integrating planning with sequence models based on the idea of iterative energy minimization, and illustrate how such a procedure leads to improved RL performance across different tasks. We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy. We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments. We further demonstrate unique benefits of our iterative optimization procedure, involving new task generalization, test-time constraints adaptation, and the ability to compose plans together.

1. INTRODUCTION

E θ ( τ i-1) E θ ( τ i)

Energy

High Low Sequence modeling has emerged as unified paradigm to study numerous domains such as language (Brown et al., 2020; Radford et al., 2018) and vision (Yu et al., 2022; Dosovitskiy et al., 2020) . Recently, (Chen et al., 2021; Janner et al., 2021) have shown how a similar approach can be effectively applied to decision making, by predicting the next action to take. However, in many decision making domains, it is sub-optimal to simply predict the next action to execute -as such an action may be only locally optimal and lead to global dead-end. Instead, it is more desirable to plan a sequence of actions towards a final goal, and choose the action most optimal for the final overall goal. Unlike greedily picking the next action to execute, effectively constructing an action sequence towards a given goal requires a careful, iterative procedure, where we need to assess and refine intermediate actions in a plan to ensure we reach the final goal. To refine an action at a particular timestep in a plan, we must reconsider both actions both before and after the chosen action. Directly applying this procedure to standard language generation is difficult, as the standard autoregressive decoding procedure prevents regeneration of previous actions based of future ones. For example, if the first five predicted actions places an agent at a location too far to reach a given goal, there is no manner we may change the early portions of plan. In this paper, we propose an approach to iteratively generate plans using sequence models. Our approach, Multistep Energy-Minimization Planner (LEAP), formulates planning as an iterative op-timization procedure on an energy function over trajectories defined implicitly by a sequence model (illustrated in Figure 1 ). To define an energy function across trajectories, we train a bidirectional sequence model using a masked-language modeling (MLM) objective (Devlin et al., 2019) . We define the energy of a trajectory as the negative pseudo-likelihood (PLL) of this MLM (Salazar et al., 2019) and sequentially minimize this energy value by replacing actions at different timepoints in the trajectory with the marginal estimates given by the MLM. Since our MLM is bi-directional in nature, the choice of new action at a given time-step is generated based on both future and past actions. By iteratively generating actions through planning, we illustrate how our proposed framework outperforms prior methods in both BabyAI (Chevalier-Boisvert et al., 2019) and Atari (Bellemare et al., 2013) tasks. Furthermore, by formulating the action generation process as an iterative energy minimization procedure, we illustrate how this enables us to generalize to environments with new sets of test-time constraints as well as more complex planning problems. Finally, we demonstrate how such an energy minimization procedure enables us to compose planning procedures in different models together, enabling the construction of plan which achieves multiple objectives. Concretely, in this paper, we contribute the following: First, we introduce LEAP, a framework through which we may iteratively plan with sequence models. Second, we illustrate how such a planning framework can be beneficial on both BabyAI and Atari domains. Finally, we illustrate how iteratively planning through energy minimization gives a set of unique properties, enabling better test time performance on more complex environments and environments with new test-time obstacles, and the ability to compose multiple learned models together, to jointly generate plans that satisfy multiple sets of goals. (Du et al., 2020) . On the other hand, our training approach relies on a more stable masked language modeling objective.

3. METHOD

In this section, we describe our framework, Multistep Energy-Minimization Planner (LEAP), which formulates planning as a energy minimization procedure. Given a set of trajectories in a discrete action space, with each trajectory containing state and action sequences (s 1 , a 1 , s 2 , a 2 , . . . , s N , a N ), our goal is to learn a planning model which can predict a sequence of actions a 1:T , given the trajectory context τ ctx containing the past K steps states and actions, that maximizes the long-term task-dependent objective J : a * 1:T = arg max a 1:T J (τ ctx , a 1:T ) where N denotes the length of the entire trajectory and T is the planning horizon. We use the abbreviation J (τ ), where τ =: (τ ctx , a 1:T ), to denote the objective value of that trajectory. To formulate this planning procedure, we learn an energy function E θ (τ ), which maps each trajectory τ to a scalar valued energy so that a * 1:T = arg min a 1:T E θ (τ ctx , a 1:T ).

3.1. LEARNING TRAJECTORY LEVEL ENERGY FUNCTIONS

We wish to construct an energy function E θ (τ ctx , a 1:T ) such that minimal energy is assigned to optimal set actions a * 1:T . To train our energy function, we assume access to dataset of M near optimal set of demonstrations in the environment, and train our energy function to have low energy across demonstrations. Below, we introduce masked language models, and then discuss how we may get such a desired energy function from masked language modeling. Masked Language Models. Given a trajectory of the form (s 1 , a 1 , s 2 , a 2 , . . . , s n , a n ), we train a transformer language model to model the marginal likelihood p θ (a t |τ ctx , a -t ), where we utilize a -t as shorthand for actions a 1:T except the action at timestep t. To train this masked language model (MLM), we utilize the standard BERT training objective (Devlin et al., 2019) , where we minimize the loss function L MLM = E τ ,t [-log p θ (a t ; τ ctx , a -t )] , where we mask out and predict the marginal likelihood of percentage of the actions in a trajectory (details on masking in the Section A.4). . Constructing Trajectory Level Energy Functions. Given a trained MLM, we define an energy function for a trajectory as the sum of negative marginal likelihood of each action in a sequence E θ (τ ) = - t log p θ (a t ; τ ctx , a -t ). ( ) Such an energy function, also known as the pseudolikelihood of the MLM, has been used extensively in prior work in NLP (Goyal et al., 2021; Salazar et al., 2020) , and has been demonstrated to effectively score the quality natural language text (outperforming direct autoregressive scoring) (Salazar et al., 2020) . In our planning context, this translates to effectively assigning low energy to optimal planned actions, which is our desired goal for E θ (τ ). We illustrate the energy computation process in Figure 2 .

3.2. PLANNING AS ENERGY MINIMIZATION

Given a learned energy function E θ (τ ), which assigns low energy to optimal trajectories a * 1:T , we wish to plan a sequence of actions in test-time to minimize our energy function. To implement this planning procedure, we utilize Gibbs sampling of individual actions at each timestep on the energy function E θ (τ ) for low energy plans, which we detail below. For test-time plan generation, we initialize a masked trajectory of length T , with a small context length of past states and actions, which we illustrate in Equation 3. At each step of Gibb's sampling, we randomly mask out one or multiple action tokens in padded locations and perform forward pass to estimate their energy distribution conditioned on trajectory context τ i \I , which is the masked outcome from previous iteration on sampled timesteps collected in index set I. Then, action a t is sampled using Gibb's sampling based on the locally normalized energy score a t ∼ p θ (a t ; τ ctx , a -t ). The process is illustrated in Figure 2 , where the actions with low energy values (in blue) are sampled to minimize E θ (τ i ; θ) in each iteration. To sample effectively from the underlying energy distribution, we repeat this procedure for multiple timesteps, which we illustrate in Algorithm 1. The computational time of Algorithm 1 increases linearly with iteration numbers. τ = s 1 s 2 . . . s n-1 s n s n . . . s n context a 1 a 2 . . . a n-1 plan [PAD] [PAD] . . . [PAD] Algorithm 1 Iterative Planning through Energy Minimization (for discrete actions)  I ∼ [1, 2, • • • , T ] 6: // Estimate the energy distributions on masked tokens 7: E ← f (h(τ i \I ; θ)) 8: // Sample the action tokens based on energy value 9: a ∼ E 10: // Update actions a in τ at masked tokens 11: τ i+1 ← τ i \T + a 12: end for 13: Execute all planned actions a1:T or the first planned action a1 in padded trajectory τ Note that our resultant algorithm has several important differences compared to sequential model based approaches such as Decision Transformer (DT) (Chen et al., 2021) . First, actions are generated using an energy function defined globally across actions, enabling us to choose each action factoring in the entire trajectory context. Second, our action generation procedure is iterative in nature, allowing us to leverage computational time to find better solutions towards final goals.

4. PROPERTIES OF MULTISTEP ENERGY-MINIMIZATION PLANNER

In LEAP, we formulate planning as an optimization procedure arg min τ E θ (τ ). By formulating planning in such a manner, we illustrate how our approach enables online adaptation to new test-time constraints, generalization to harder planning problems, and plan composition to achieve multiple set of goals (illustrated in Figure 3 ). Online adaptation. In LEAP, recall that plans are generated by minimizing an energy function E θ (τ ) across trajectories. At test time, if we have new external constraints, we maybe correspondingly define a new energy function E constraint (τ ) to encode these constraints. For instance, if a state becomes dangerous at test time (illustrated as a red grid in Figure 3 (a)) -we may directly define an energy function which assigns 0 energy to plans which do not utilize this state and high energy to plans which utilize such a state. We may then generate plans which satisfies this constraint by simply minimizing the summed energy function τ * = arg min τ (E θ (τ ) + E constraint (τ )). and then utilize Gibb's sampling to obtain a plan τ * from E θ (τ ) + E constraint (τ ). While online adaptation may also be integrated with other sampling based planners using rejection sampling, our approach directly integrates trajectory generation with online constraints. Novel Environment Generalization. In LEAP, we leverage many steps of sampling to recover an optimal trajectory τ * which minimizes our learned trajectory energy function E θ (τ ). In settings at test time when the environment is more complex than those seen at training time (illustrated in Figure 3 (b )) -the underlying energy function necessary to compute plan feasibility may remain simple (i.e. measure if actions enter obstacles), but the underlying planning problem becomes much more difficult. In these settings, as long as the learned function E θ (τ ) generalizes, and we may simply leverage more steps of sampling to recover a successful plan in this more complex environment. Task compositionality. Given two different instances of LEAP, E 1 θ (τ ) and E 2 θ (τ ), encoding separate tasks for planning, we may generate a trajectory which accomplishes the tasks encoded by both models by simply minimizing a composed energy function (assuming task independence) τ * = arg min τ (E 1 θ (τ ) + E 2 θ (τ )). An simple instance of such a setting is illustrated in Figure 3 (c) , where the first LEAP model E 1 θ (τ ) encodes two obstacles in an environment, and a second LEAP model E 2 θ (τ ) encodes four other obstacles. By jointly optimizing both energy functions (through Gibbs sampling), we may successfully construct a plan which avoids all obstacles in both models. Transformer and TD learning (IQL in BabyAI and Generalization tests and CQL in Atari) across BabyAI, Atari, and Generalization Tests. On a diverse set of tasks, LEAP performs better than prior approaches.

5.1. BABYAI

Setup The BabyAI comprises an extensible suite of tasks in environments with different sizes and shapes, where the reward is given only when the task is successfully finished. We evaluate models in trajectory planning that requires the agent move to the goal object through turn left, right and move forward actions, and general instruction completion tasks in which the agent needs to take extra pickup, drop and open actions to complete the instruction. For more detailed experimental settings and parameters, please refer to Results Across all environments, LEAP achieves highest success rate, with the margin magnified on larger, harder tasks, see Table 1 . In particular, LEAP could solve the easy tasks like GoToLo-calS8N7 with nearly 90% success rate, and has huge advantages over baselines in the large maze worlds (GoToObjMazeS7) and complex tasks (GoToSeqS5R2) which require going to a sequence of objects in correct order. In contrast, baselines perform poorly solving these difficult tasks. Next, we visualize the underlying iterative planing and execution procedure in GoToLocalS8N7 and GoToSeqS5R2 tasks. On the left side of Figure 5 , we present how the trajectory is optimized and its energy is minimized at single time step. Through iterative refinement, the final blue trajectory is closer to the optimal solution than the original red one, which follows the correct direction with higher efficiency and perform the actions like open the door in correct situation. On the right side, we present the entire task completion process through many time steps. LEAP successfully plans a efficient trajectory to visit the two objects and opens the door when blocked. We also explore the model performance in the stochastic settings, please refer to Appendix B. Effect of Iterative Refinement. We investigate the effect of iterative refinement by testing the success rate of our approach under different sample iteration number in GoToLocalS7N5 environment. From the left side of Figure 6 , the task success rate continues to improve as we increase the number of sample iteration. Energy Landscape Verification. We further verify our approach by visualizing the energy assignment on various trajectories in the same environment as above. More specifically, we com- pare the estimated energy of the labeled optimal trajectory with the noisiness of trajectories, produced by randomizing a percentage of steps in the optimal action sequence. The right Figure 6 depicts the energy assignment to trajectories with various corruption levels as a function of training time. With the progress of training, LEAP learns to (a) reduce the energy value assigned to the optimal trajectory; (b) increase the energy value assigned to the corrupted trajectories. This result justifies the performance of LEAP. Effect of Training Data. In BabyAI, we utilize a set of demonstrations generated using an oracle planner. We further investigate the performance of LEAP when the training data is not optimal. To achieve it, we randomly swap the decisions in the optimal demonstration with an arbitrary action with the probability of 25%. We compare against DT, the autoregressive sequential planner. Despite a small performance drop in Table 2 , LEAP still substantially outperforms DT, indicating LEAP works well with non-perfect data.

5.2. ATARI

Setup We further evaluate our approach on the dynamically changing Atari environment, with higher-dimensional visual state. Due to above features , we train and test our model without the goal state, and update the plan after each step of execution to avoid unexpected changes in the world. We compare our model to BC, DT (Chen et Results Table 3 shows the result comparison. LEAP achieves the best average performance. Specifically, it achieves better or comparable result in 3 games out of 4, whereas baselines typically perform poorly in more than one games. Energy Landscape Verification In Atari environment, the training trajectories are generated by an online DQN agent during training, whose accumulated rewards are varied a lot. LEAP is trained to estimate the energy values of trajectories depending on their rewards. In Figure 7 , we visualize the estimated energy of different training trajectories and their corresponding rewards in Breakout and Pong games. We notice that the underlying energy value estimated to a trajectory is well correlated with its reward, with low energy value assigned to high reward trajectory. This justifies the correctness of our trained model and further gives a natural objective to assess the relative of quality planned trajectory. In Qbert and Seaquest games that LEAP gets low scores, this negative correlation is not obvious showing that the model is not well-trained.

6. PROPERTIES OF MULTISTEP ENERGY-MINIMIZATION PLANNER

Next, we analyze the unique properties enabled by LEAP described in §4 in customized BabyAI environments. For each environment, we design at most three settings with increasing difficulty levels to gradually challenge the planning model. As before, the target reaching success rate is measured as the evaluation criteria. The performances are compared against Implicit Q-Learning (IQL) (Kostrikov et al., 2021) and Decision Transformer (DT) (Chen et al., 2021).

6.1. ONLINE ADAPTATION

Setup To examine LEAP's adaptation ability to test-time constraints, we construct an BabyAI environment with multiple lethal lava zones at the test time as depicted in Figure 8 (a). The planner E θ generates the trajectory without the awareness of the lava zones. Once planned, the energy prediction is corrected by the constraint energy function E constraint (τ ), which assigns large energy to the immediate action leading to lava, and zero otherwise. The agent traverses under the guidance of the updated energy estimation. To make the benchmark comparison fair, we also remove the immediate action that will lead to a lava for all baselines. The difficulty levels are characterized by the amount of lava grids added and the way they are added, where easy, medium correspond to adding at most 2 and 5 lava grids respectively on the way to the goal object in 8×8 grids world. The third case is hard due to the unstructured maze world in which the narrow paths can be easily blocked by lava grids and requires the agents to plan a trajectory to bypass.

Results

The quantitative comparison is collected in the Table 4 , Left. Although drops with harder challenges, the performance of our model still exceeds both baselines under all settings. Visual illustration of medium example results can be seen in Figure 8 (a) that the agent goes up first to bypass the lava grids and then drives to the goal object.

6.2. NOVEL ENVIRONMENT GENERALIZATION

Setup To evaluate LEAP's generalization ability in unseen testing environments, we train the model in easier environments but test them in more challenging environments. In easy case, the model is trained in 8×8 world without any obstacles but tested in the world with 14 obstacles as distractors. In medium and hard cases, the model is trained in single-room world but tested in maze Results Our model achieves best averaged performance across three cases, but slightly worse than IQL in hard case, see Table 4 , Middle. In contrast, sequential RL model DT has significantly lower performance when moved to in unseen maze environments. LEAP trained in plane world could still plan a decent trajectory in unseen maze environment after blocked by walls, see Figure 8 (b).

6.3. TASK COMPOSITIONALITY

Setup We design composite trajectory planning and instruction completion tasks for easy and medium cases respectively. In easy case, all obstacles are equally separated into two subsets, each observable by one of the two planners, see Figure 8 (c ). As a result, the planning needs to add up model's predictions using two partial observations to successfully avoid the obstacles. In medium case, two separate models trained for different tasks; one for planning trajectory in 10×10 maze world and the other for object pickup in single-room world. The composite task is to complete the object pickup in 10×10 maze world. Results Our model significantly outperforms the baselines in both two testing cases, while IQL and DT suffer great success rate drop indicating they can not be applied to composite tasks directly, see Table 4 , Right. This proves that LEAP can be easily combined with other models responsible for different tasks, making it more applicable and general for wide-range tasks. In Figure 8 (c), the composite LEAP could reach the goal by avoiding all obstacles even though the first LEAP planner is blocked by unperceivable obstacle.

7. CONCLUSION

This work proposes and evaluates LEAP, an sequence model that plans and refines a trajectory through energy minimization. The energy-minimization is done iteratively -where actions are sequentially along a trajectory. Our current approach is limited to discrete spaces -relaxing this using approaches such as discrete binning ( A.4 EXPERIMENT DETAILS BabyAI For LEAP, the larger size environment requires longer horizon T and correspondingly more sampling iterations N . After N iteration, all T planned actions will be executed. For DT model, it's beneficial of using a longer context length in more complex environments as shown in its original paper (Chen et al., 2021) . We list out these parameters for LEAP and DT models in Note that our approach can easily be conditioned on total reward, by simply concatenating the reward as input in the sequence model. One hypothesis is that when demonstration set contained trajectories of varying quality, taking reward as input following will enable the model to recognize the quality of training trajectories and potentially improve the performance. To further validate the importance of the rewards, we test the LEAP with and without return-to-go inputs, which sum of future rewards Chen et al. (2021) . The results show that the performance degradation without the return-to-go inputs, which is shown in Table 10 .

B STOCHASTIC ENVIRONMENT TESTING

In this section we demonstrate the possibility of extending our method into stochastic settings. Although Paster et al. (2022) reveals that planning by sampling from the learnt policy conditioned on desired reward could lead to suboptimal outcome due to the existence of stochastic factors, our



EXPERIMENTSIn this section, we evaluate the planning performance of LEAP in BabyAI and Atari environments. We compare with a variety of different offline reinforcement learning approaches, and summarize the main results in Figure4.



Figure 1: Plan Generation through Iteratively Energy Minimization. LEAP plans a trajectory to a goal (specified by the yellow star) by iteratively sampling and minimizing a trajectory energy function estimated using language model E θ .

Figure 2: Energy Minimization in LEAP. LEAP generates plans via Gibbs sampling different actions based on a learned trajectory energy model E θ (τ ). In each iteration, Masked Language Model (MLM) predicts the energy of alternative actions at selected timesteps in a trajectory. A new trajectory is generated using a Gibbs sampler, with individual actions sampled based on the energy distribution. By repeating the above steps iteratively, LEAP generates the trajectory with low energy value.

Figure 3: Properties of Planning as Energy Minimization. By formulating planning as energy minimization, LEAP enables the following properties: (a): Online adaptation; (b): Generalization; (c): Task composition.

Figure 4: Quantitative Results of LEAP of Different Domains. Results comparing LEAP to Decision

Figure 5: Qualitative Visualization of Planning and Execution Procedure in BabyAI. Left depicts the planning through iterative energy minimization where N is the sample iteration number. Right shows the execution of the concatenate action sequences. Two task settings are illustrated: (a): Trajectory planning, where the task is to solely plan a trajectory leading to the goal location. (b): Instruction completion, where a sequence of tasks are commanded, and an additional "Open" is involved to get through the doors. Target locations are marked with .

Figure 6: Analysis of LEAP in the BabyAI Environments. Left: Success rate increases with more sampling steps, suggesting the importance of iterative refinement in LEAP. Right: LEAP captures the correct energy landscape. It assigns low energy to the optimal trajectory (noise level=0%) and high energy to noisy paths.

Figure 7: Energy vs. Reward on Atari. Energies and rewards are normalized to [0, 1]. We demonstrate negative correspondence between the achieved rewards and estimated energy by LEAP, which justifies our method.

Figure 8: Qualitative Visualization of Generalization Tests. (a): Online adaptation (medium), trained in plane world and tested in world with lavas; (b): Generalization (hard), trained in plane world and tested in maze world; (c): Task composition (easy), each model only perceive half number of obstacles. Target locations and unperceivable obstacles are marked with and , respectively. world containing multi-rooms connected by narrow paths (Figure 8 (b)). The maze size and the number of rooms in hard case are 10×10 and 9, which are larger than 7×7 and 4 in medium case.

Table 5 in Appendix.

BabyAI Quantitative Performance. The task success rate of LEAP and a variety of prior algorithms on BabyAI env. Models are trained with 500 optimal trajectory demos in each environment, and results are averaged over 5 random seeds. The SX and NY in environment name means its size and the number of obstacles.

Performance on Suboptimal

Quantitative Comparison on Atari. Gamernormalized scores for the 1% DQN-replay Atari dataset(Agarwal et al., 2020). We report the mean and standard error score across 5 seeds. LEAP achieves best averaged scores over 4 games and performs comparably to DT and CQL over all games.

Easy 92.0% 68.0% 90.5% 77.5% 36.0% 60.5% 83.5% 58.0% 42.5% Medium 64.5% 20.0% 52.0% 64.0% 37.5% 57.5% 43.0% 15.5% 11.5%

Property Test on Modified BabyAI Environments. Three properties performance of LEAP and prior algorithms on BabyAI tasks with different difficulty. Left: Online Adaptation; Middle: Generalization; Right: Task Composition.

Janner et al., 2021) would be interesting. Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. arXiv preprint arXiv:2202.05607, 2022. 2 Atari Baseline Models The scores for DT, BC, CQL, QR-DQN, and REM in Table 3 can be found in Chen et al. (2021).

We didn't use context information in LEAP in most BabyAI environments as we expect the iterative planning could generate a correct trajectory based solely on the current state observation. While the GoToSeqS5R2 environment requires go to a sequence of objects in correct order and LEAP needs to remember what objects have been visited from the context. During training, we randomly select and mask one action in a trajectory.

BabyAI environment experiment details for LEAP and DT.The input to DT model includes the instruction, state context sequence, action context sequence and return-to-go sequence in which the target reward is set to 1 initially. The input to other baseline models are the same except they use normal reward sequence instead of return-to-go sequence. While LEAP only use the instruction, state context sequence and action context sequence. Inside state sequence, each state s n contains the [x, y, dir, g x , g y ] meaning the agent's x position, y position, direction and goal object's x position, y position (if the goal location is available).Atari In dynamically changing Atari environment, LEAP use context information in all four games and only execute the first planned action to avoid the unexpected changes in the world, see details in Table9. During training, we randomly sample and mask one action in a trajectory.

Atari environment experiment details for LEAP.

LEAP performance in Atari environment with and w/o return input.

Appendix

A EXPERIMENTAL DETAILS A.1 BABYAI ENVIRONMENT DETAILS We categorize the environments tested in the trajectory planning and instruction completion into the single-room plane world and multi-room maze world which are connect by doors.

1.. Trajectory planning:

• Plane world: GoToLocalS7N5 (7×7), GoToLocalS8N7 (8×8) • Maze world: GoToObjMazeS4 (10×10), GoToObjMazeS7 (19×19) 2. Instruction completion:• Plane world: PickUpLoc (8×8) • Maze world: GoToObjMazeS4R2Close (7×7), GoToSeqS5R2 (9×9)The Table 5 presents the detailed BabyAI environment settings including the environment size, the number of rooms, the number of obstacles, the status of doors and one example instruction in that environment. For MOPO, we use the author's original implementation of dynamic model training and policy learning. For RL policy, we adopt the IQL discussed above.https://github.com/tianheyu927/mopo.The actor network and policy network of BCQ and IQL use the transformer architecture which is the same as architecture in our model, see details above. The original DT and BC already use the transformer architecture so we didn't change. For all baselines, we add the same instruction encoder and image encoder described above to process instruction and image observations. model circumvents the problem by formulating the planning as an optimization problem -we use the Gibbs sampling method to find the trajectory with the lowest energy evaluated by the trained model. Assuming that the frequency of successful actions dominates in the dataset, our model is trained to assign lower energy to trajectories with higher likelihood of reaching the goal. Consequently, in the stochastic environments, LEAP constructs a sequence of actions that has the best opportunity to accomplish the target. When executing this plan in a stochastic environment, we may also choose to replan our sequence of actions after each actual action in the environment (to deal with stochasticity of the next state given an action). Note then that this sequence of actions will be optimal in the stochastic environment, as we always choose the action that has the maximum likelihood of reaching the final state. Also note that multi-step planning can potentially provide advantage over a simple policy to predict the next action in stochastic environments, as such policy simply assigns probability distribution to the immediate next step without the awareness of future step adjustments facing stochastic factors.To verify the assumptions, we constructed a stochastic testing in BabyAI environment. The test is created by adopting a stochastic dynamic model, where the agent fails to execute the turning actions turn left/right with 20% chance, and instead performs the remaining actions, including turn right/left, forward, pickup, drop, and open, with uniform probability. The remaining settings follow BabyAI experiments detailed in Appendix. A.4, except that we train models using demonstrations generated with the above dynamic model. Those training data are noisy in the sense that the actions taken are not optimal, and corrections are required from future actions. We believe LEAP, as a multi-step planner, can learn the above correlations between the consecutive actions. We compare LEAP with the baseline DT, the results of which is collected in the Table. 11. It can be observed that LEAP has a superior performance compared to DT on both tested environments, which indicates both the possibility of applying our approach in the stochastic settings, and the advantage of multistep planning when facing stochastic factors. • Sequence model classifier: LSTM sequence model predicts the scalar energy value given the entire trajectory τ , and train the loss between ground truth trajectory energy and estimated energy. The optimal trajectories in Babyai are assigned with lowest energy value 0 and the generates suboptimal trajectories are assigned with higher values depending on the degree of randomness. • MLM: discussed in the main text. 

