ABSTRACT-TO-EXECUTABLE TRAJECTORY TRANSLA-TION FOR ONE-SHOT TASK GENERALIZATION

Abstract

Training long-horizon robotic policies in complex physical environments is essential for many applications, such as robotic manipulation. However, learning a policy that can generalize to unseen tasks is challenging. In this work, we propose to achieve one-shot task generalization by decoupling plan generation and plan execution. Specifically, our method solves complex long-horizon tasks in three steps: build a paired abstract environment by simplifying geometry and physics, generate abstract trajectories, and solve the original task by an abstract-to-executable trajectory translator. In the abstract environment, complex dynamics such as physical manipulation are removed, making abstract trajectories easier to generate. However, this introduces a large domain gap between abstract trajectories and the actual executed trajectories as abstract trajectories lack low-level details and are not aligned frame-to-frame with the executed trajectory. In a manner reminiscent of language translation, our approach leverages a seq-to-seq model to overcome the large domain gap between the abstract and executable trajectories, enabling the low-level policy to follow the abstract trajectory. Experimental results on various unseen long-horizon tasks with different robot embodiments demonstrate the practicability of our methods to achieve one-shot task generalization. Videos and more details can be found in the supplementary materials and project page .

1. INTRODUCTION

Training long-horizon robotic policies in complex physical environments is important for robot learning. However, directly learning a policy that can generalize to unseen tasks is challenging for Reinforcement Learning (RL) based approaches (Yu et al., 2020; Savva et al., 2019; Shen et al., 2021; Mu et al., 2021) . The state/action spaces are usually high-dimensional, requiring many samples to learn policies for various tasks. One promising idea is to decouple plan generation and plan execution. In classical robotics, a high-level planner generates a abstract trajectory using symbolic planning with simpler state/action space than the original problem while a low-level agent executes the plan in an entirely physical environment Kaelbling & Lozano-Pérez (2013); Garrett et al. (2020b) . In our work, we promote the philosophy of abstract-to-executable via the learning-based approach. By providing robots with an abstract trajectory, robots can aim for one-shot task generalization. Instead of memorizing all the high-dimensional policies for different tasks, the robot can leverage the power of planning in the low-dimensional abstract space and focus on learning low-level executors. The two-level framework works well for classical robotics tasks like motion control for robot arms, where a motion planner generate a kinematics motion plan at a high level and a PID controller execute the plan step by step. However, such a decomposition and abstraction is not always trivial for more complex tasks. In general domains, it either requires expert knowledge (e.g., PDDL (Garrett et al., 2020b; a) ) to design this abstraction manually or enormous samples to distill suitable abstractions automatically (e.g., HRL (Bacon et al., 2017; Vezhnevets et al., 2017) ). We refer Abel (2022) for an in-depth investigation into this topic. On the other side, designing imperfect high-level agents whose state space does not precisely align with the low-level executor could be much easier and more flexible. High-level agents can be planners with abstract models and simplified dynamics in the simulator (by discarding some physical features, e.g., enabling a "magic" gripper Savva et al. (2019) ; Torabi et al. (2018)) or utilizing an existing "expert" agent such as humans or pre-trained agents on different manipulators. Though imperfect, their trajectories still contain meaningful information to guide the low-level execution of novel tasks. For example, different robots may share a similar procedure of reaching, grasping, and moving when manipulating a rigid box with different grasping poses. As a trade-off, executing their trajectories by the low-level executors becomes non-trivial. As will be shown by an example soon, there may not be a frame-to-frame correspondence between the abstract and the executable trajectories due to the mismatch. Sometimes the low-level agent needs to discover novel solutions by slightly deviating from the plan in order to follow the rest of the plan. Furthermore, the dynamics mismatch may require low-level agents to pay attention to the entire abstract trajectory and not just a part of it. To benefit from abstract trajectories without perfect alignment between high and low-level states, we propose TRajectory TRanslation (abbreviated as TR 2 ), a learning-based framework that can translate abstract trajectories into executable trajectories on unseen tasks at test time. The key feature of TR 2 is that we do not require frame-to-frame alignment between the abstract and the executable trajectories. Instead, we utilize a powerful sequence-to-sequence translation model inspired by machine translation (Sutskever et al., 2014; Bahdanau et al., 2014) to translate the abstract trajectories to executable actions even when there is a significant domain gap. This process is naturally reminiscent of language translation, which is well solved by seq-to-seq models. We illustrate the idea in a simple Box Pusher task as shown in Fig. 1 . The black agent needs to push the green target box to the blue goal position. We design the high-level agent as a point mass which can magically attract the green box to move along with it. For the high-level agent, it is easy to generate an abstract trajectory by either motion planning or heuristic methods. As TR 2 does not have strict constraints over the high-level agent, we can train TR 2 to translate the abstract trajectory, which includes the waypoints to the target, into a physically feasible trajectory. Our TR 2 framework learns to translate the magical abstract trajectory to a strategy to move around the box and push the box to the correct direction, which closes the domain gap between the high-and low-level agents. Our contributions are: (1) We provide a practical solution to learn policies for long-horizon complex robotic tasks in three steps: build a paired abstract environment (e.g., by using a point mass with magical grasping as the high-level agent), generate abstract trajectories, and solve the original task with abstract-to-executable trajectory translation. (2) The seq-to-seq models, specifically the transformer-based auto-regressive model (Vaswani et al., 2017; Chen et al., 2021; Parisotto et al., 2020) , free us from the restriction of strict alignment between abstract and executable trajectories, providing additional flexibility in high-level agent design, abstract trajectory generation and helps bridge the domain gap. (3) The combination of the abstract trajectory and transformer enables TR 2 to solve unseen long-horizon tasks. By evaluating our method on a navigation-based task and three manipulation tasks, we find that our agent achieves strong one-shot generalization to new tasks, while being robust to intentional interventions or mistakes via re-planning. Our method is evaluated on various tasks and environments with different embodiments. In all experiments, the method shows great improvements over baselines. We also perform real-world experiments on the Block Stacking task to verify the capability to handle noise on a real robot system. Please refer to the anonymous project page for more visualizations.

2. RELATED WORKS

generalization, these works usually assume a dataset of expert demonstrations, which is used to train a behavior cloning model (Duan et al., 2017; Rahmatizadeh et al., 2018; Torabi et al., 2018; James et al., 2018) that accelerates learning in future novel tasks. In terms of architecture, Xu et al. (2022) is similar to us but still requires a dataset of low-level demos and is experimented in the offline setting whereas we utilize a dataset of simpler abstract trajectories and train online in complex environments. Cross-Morphology Imitation Learning When there is a morphology difference between expert and imitator, a manually-designed retargeting mapping is usually used to convert the state and action for both locomotion (Peng et al., 2020; Agrawal & van de Panne, 2016) and manipulation (Suleiman et al., 2008; Qin et al., 2022; Antotsiou et al., 2018) . However, the mapping function is task-specific, which limits the application of these approaches to a small set of tasks. To overcome this limitation, action-free imitation is explored by learning a dynamics model (Torabi et al., 2018; Radosavovic et al., 2020; Liu et al., 2019; Edwards et al., 2019) to infer the missing action or a reward function (Aytar et al., 2018; Zakka et al., 2022; Sermanet et al., 2016) to convert IL to a standard RL paradigm. However, these methods need extensive interaction data to learn the dynamics model or update policy with learned rewards. Instead of learning the reward function from cross-morphology teacher trajectories, we propose a generic trajectory following reward based on the given abstract trajectory that can provide dense supervision for the agent. DeepMimic (Peng et al., 2018) is similar to ours, but it differs in that we do not allow training at inference time with new trajectories . SILO (Lee et al., 2019) is another similar approach that trains an agent to follow the demonstrator but is limited by fixed window sizes/horizon parameters that inhibit the test generalizability of its approach. Demo Augmented Reinforcement Learning Motion primitives, especially dynamic motion primitives, have long been used by robotics researchers to combine the human demonstration with RL (Kober & Peters, 2009; Theodorou et al., 2010; Li et al., 2017; Singh et al., 2020) . Recently, (Pertsch et al., 2021; 2020) extended this framework to learn task-agnostic skills from demonstrations. Another line of work (Ho & Ermon, 2016; Rajeswaran et al., 2017) directly use demonstrations as interaction data. For example, Demo Augmented Policy Gradient (Rajeswaran et al., 2017; Radosavovic et al., 2020) performs behavior cloning and policy gradient interchangeably with a decayed weight for imitation in on-policy training while other works (Vecerik et al., 2017; Hester et al., 2018) append the demonstrations into the replay buffer for off-policy RL. However, their works typically utilize low-level demonstrations while we utilize abstract trajectories that are much easier to generate.

3.1. OVERVIEW AND PRELIMINARIES

Similar to one-shot imitation learning (Duan et al., 2017) , we are tackling the problem of one-shot task generalization. In one-shot imitation learning, an agent must solve an unseen task given a demonstration (e.g., human demo, low-level demo) without additional training. However, even a single demonstration can be challenging to produce, especially for complex long-horizon robotics tasks. Different from one-shot imitation learning, we replace the demonstration with an abstract trajectory. The abstract trajectory is a sequence of high-level states corresponding to a high-level agent that instructs the low-level agent on how to complete the task at a high level. In the high-level space, we strip low-level dynamics and equip the high-level agent with magical grasping, allowing it to manipulate objects easily. The simplification makes abstract trajectories easier and more feasible to generate for long-horizon unseen tasks compared to human demos or low-level demos. Given a novel task, our method seeks to solve it with the following three steps: (i) construct a paired abstract environment that can be solved with simple heuristics or planning algorithms and generate abstract trajectories (Sec. 3.2); (ii) translate the high-level abstract trajectory to a low-level executable trajectory with a trained trajectory translator in a closed loop manner (Sec. 3.3, 3.4 ); (iii) solve the given task and potentially other unseen tasks with the trajectory translator (Sec. 3.5). The three steps above enable our approach to tackle unseen long-horizon tasks that are out-of-distribution as well. Moreover, we can utilize the re-planning feature (regenerating the abstract trajectory during an episode) to increase the success rate at test time to handle unforeseen mistakes or interventions. Next, we introduce definitions and symbols. We consider an environment as a Markov Decision Process (MDP) (S L , A, P, R), where A is the action space, P is the transition function of the environment, and R is the reward function. Different from the regular MDP, we consider two state spaces, the low-level state space S L and a high-level state space S H . We assume there exists a map f : S L → S H and a dissimilarity function d : S L × S H → R such that d(s L , s H ) = 0 for s H = f (s L ), where s H ∈ S H and s L ∈ S L . The high-level agent generates an abstract trajectory τ H = (s H 1 , s H 2 ..., s H T ) from an initial state s H 1 = f (s L 1 ) . Note that actions are not included in τ H . Lastly, the low-level agent receives observations s L t in the low-level state space S L , and takes actions a t in the action space A, and the dynamics P returns the next observation s L t+1 = P (a t |s L t ).

3.2. ABSTRACTING ENVIRONMENTS AND GENERATING ABSTRACT TRAJECTORIES

The first step to solving a challenging task is to build a paired task that abstracts the low-level physical details away. The paired task should be much simpler than the original task so that it can be solved easily. We leverage two general ways to build the abstract environment: (i) simplify geometry (Manolis Savva* et al., 2019) , e.g., representing the agent and objects with point masses; (ii) abstract contact dynamics (Srivastava et al., 2021; Kolve et al., 2017) , e.g., the original environment requires detailed physical manipulation to grasp objects while the agent in the abstract environment can magically grasp the objects. Thus, leveraging the two methods above, a concrete solution to construct abstract environments for many difficult tasks can be done. Concretely for all our abstract environments we remove all contact dynamics and enable magical grasp, as well as represent all relevant objects as point masses. The point mass representation further makes the mapping function f simple to define. As a result, making the abstract environment is scalable as we use the same simple process for all environments. This then enables simple generation of abstract trajectories with heuristics with a point mass as the high-level agent. To generate abstract trajectories, for manipulation tasks, we make the high-level agent approach the object, then magically grasp the object, and then finally move the object to the target position. If there are no objects to be manipulated in the task, then it is a simple navigation sequence. i which can also be viewed as a prompt. However, due to the high-level nature of the abstract trajectory and lack of a frame-to-frame correspondence, learning to follow the abstract trajectory requires a deeper understanding about the abstract trajectory. For more details on the architecture see section E Thus, we adopt the transformer architecture, specifically GPT-2 (Radford et al., 2019; Brown et al., 2020; Radford et al., 2018) , and we format the input sequence by directly appending the current low-level state and past low-level states to the abstract trajectory. With the transformer's attention mechanism (Vaswani et al., 2017) , when processing the current low-level state s L t , the model can attend to past low-level states as well as the entire abstract trajectory to make a decision. By allowing attention over the entire abstract trajectory, our model is capable of modelling long-horizon dependencies better and suffers less from information bottlenecks. For example, in Box Pusher (Fig. 1 ) the low-level agent must look far ahead into the future in the abstract trajectory to determine which direction the high-level agent (black dot) moves the green target box. By understanding where the target box moves, the low-level agent can execute the appropriate actions to position itself in a way to move the target box the same way and follow the abstract trajectory.

3.3. ABSTRACT-TO-EXECUTABLE TRAJECTORY TRANSLATOR

Note that, while the backbone of the model discussed here is the GPT-2 transformer, it can easily be replaced by any other seq-to-seq model like an LSTM (Hochreiter & Schmidhuber, 1997) .

3.4. TRAINING WITH TRAJECTORY FOLLOWING REWARD

Conventionally, seq-to-seq models are trained on large parallel corpora for translation tasks like English to German (Luong et al., 2015) in an auto-regressive / open loop manner. However, we desire to train a policy network that solves robotic environments well. To this end, we adapt the seq-to-seq model to a closed-loop setting where the model receives environment feedback at every step as opposed to an open-loop setting, reducing error accumulation that often plagues pure offline methods. Thus, to train the translation model described in Sec. 3.3, we use online RL to maximize a trajectory following reward (1), and specifically, we use the PPO algorithm (Schulman et al., 2017) . Note that our framework is not limited to any particular algorithm, and our framework can also work in offline settings if an expert low-level dataset is available. The core idea of the trajectory following reward is to encourage the low-level agent to match as many high-level states in the abstract trajectory as possible. We say a low-level state s L matches a high-level state s H when s H has the shortest distance to s L , and this distance is lower than a threshold ϵ. During an episode, we track the farthest high-level state the low-level agent has matched and use j t to denote the index of the farthest high-level state which has been matched at timestep t.  Concretely j t = max 1≤k≤t j ′ k , where j ′ k = arg min 1≤i≤n {d(s L k , s H i )|d(s L k , s H i ) < ϵ} j ′ ) = ||f (s L t ) -s H j ′ ||. We define our trajectory following reward as follows: R T raj =      0 if j t < n and j ′ t ≤ j t-1 (make no progress) (1 + β • j ′ t ) • r dist (s L t , s H j ′ t ) if j t < n and j ′ t > j t-1 (make progress) r dist (s L t , s H n ) if j t = n (has matched all high-level states) Here, r dist (s L t , s H j ′ ) = 1 -tanh(w • d(s L t , s H j ′ ) ) is a common distance-based reward function which maps a distance to a bounded reward, and the weight term (1 + β • j ′ t ) is used to emphasise more on the later, more difficult to reach, high-level states. w and β are scaling hyperparameters but are kept the same for all environments and experiments. For a visual reward trace of the trajectory following reward function see Sec. C.3 of the appendix. In practice, we combine the trajectory following reward with the original task reward. Note that the original task rewards are simplistic and not always advanced enough such that a goal-conditioned policy can solve the task. Details about the task reward are in the appendix. Furthermore, a limitation of the reward function is that it cannot handle abstract trajectories when a high-level state is repeated. This presents a problem in periodic tasks such as pick and placing a block between two locations repeatedly. A simple solution is to chunk abstract trajectories such that each subsequence does not have repeated high-level states and we show an example of this working in section C.2

3.5. TEST WITH TRAJECTORY TRANSLATION AND RE-PLANNING

At test time starting with low-level state s L 1 , we generate an abstract trajectory τ H from the mapped initial high-level state s H = f (s L 1 ). At timestep t, we execute our low-level agent π L θ in a closed-loop manner by taking action a t = arg max a∈A π L θ (a|s L t , s L t-1 ..., s L t-k+1 , τ H ). In addition, we are able to re-generate the abstract trajectory in the middle of completing the task to further boost the test performance and handle unforeseen situations such as external interventions or mistakes. We refer to this strategy as re-planning and investigate it in Sec. 4.6. Note that re-planning is not adopted in most one-shot imitation learning methods because low-level or human demonstrations are challenging and time-consuming to generate for new tasks and are impractical for long-horizon tasks. In contrast, with a high-level agent acting in a simple high-level state space, we can alleviate these problems and quickly generate abstract trajectories to follow at test time. In practice, we run re-planning in one of the two scenarios: 1) If the agent matches the final high-level state in τ H , we will re-plan to begin solving the next part of a potentially long horizon task as done for Block Stacking and Open Drawer test settings. 2) If after maximum timesteps allowed, the agent has yet to match the final high-level state, then we will re-plan as the agent likely had some errors. Re-planning enables the low-level policy to solve tasks even when there is some intervention or mistake, in addition to allowing it to solve longer-horizon tasks. 

4. EXPERIMENTS

The effectiveness of TR 2 -GPT2 (our TR 2 method with a GPT2 backbone) originates from two key designs: abstract trajectory setup and the complimenting transformer architecture. These two designs contribute to strong performances, especially for long-horizon and unseen tasks. To evaluate our approach, our experiments will answer the following questions: (1) How does TR 2 -GPT2 performs compared to other baselines? (Sec. 4.2) (2) How does TR 2 -GPT2 perform on long-horizon unseen tasks? (Sec. 4.3) (3) How does the abstract trajectory setup impact learning? (Sec. 4.4) (4) How does TR 2 -GPT2 translate trajectories to bridge the domain gap via attention? (Sec. 4.5) (5) How can re-planning improve performance in long-horizon tasks? (Sec. 4.6) To answer these questions, we build four robotic tasks in the SAPIEN (Xiang et al., 2020) simulator with realistic low-level dynamics shown in Fig. 3 . These environment support flexible configurations that can generate task variants with different horizon lengths. The Box Pusher task is the simplest and can be used for fast concept verification. The Couch Moving task can test long-horizon dependency and abstract trajectory length generalization. The last two tasks, Block Stacking and Open Drawer, have full-physics robot embodiment with visual sensory input, which can evaluate the performance and generalizability of our method on more difficult high-dimensional robotics tasks. For each environment, we build a paired environment with pointmass-based simplifications and magical grasping to generate abstract trajectories via heuristic methods. See appendix B for more details. We compare our method, namely TR 2 -GPT2, with three baselines on our four environments. TR 2 -LSTM: Similar to our method, but replaces GPT-2 with an LSTM (Hochreiter & Schmidhuber, 1997) ; Goal-conditioned Policy (GC): Instead of using an abstract trajectory as guidance, it receives a task-specific goal and the original task reward, implemented as an MLP; Subgoal-conditioned Policy (SGC): Similar to GC, but it receives a sub-goal, which is a high-level state n steps ahead of the current furthest matched high-level state, and is trained with the same reward as TR 2 -GPT2. The correspondence between low-and high-level state are matched using the algorithm in Sec. 3.4. Furthermore, we compare against a similar line of work, SILO (Lee et al., 2019) , and evaluate our method on two of their environments: Obstacle Push and Pick and Place (from SILO).

4.1. ENVIRONMENT DETAILS

In this section we outline all experimented environments, detailing the train and test tasks. See Appendix B for additional visuals for all environments including those from SILO. Obstacle Push and Pick and Place from SILO (Lee et al., 2019) : We recreate the two tasks from SILO (they did not release code) and compare our method against SILO on these tasks.

4.2. RESULTS AND ANALYSIS

As shown in Table 1 , our TR 2 -GPT2 performs better than all other baselines, especially on test tasks that are unseen and long-horizon. The performance can be attributed to the abstract trajectory setup of TR 2 and the transformer architecture, which will be further investigated in ablation studies. The GC baseline cannot solve most tasks as the designed task rewards do not provide sufficient guidance. For the SGC baseline, even when conditioned with sub-goals it still has low success rates on test tasks, indicating that simple heuristics are not sufficient to select good subgoals. The TR 2 -LSTM baseline also performs worse than TR 2 -GPT2. One interpretation is that modelling long-horizon dependencies with LSTMs is challenging due to the information bottleneck. This prompts us to investigate how the transformer's attention module intuitively leverages the long-term information in Sec. 4.5. Lastly, compared to SILO, which seeks to imitate a demo as closely as possible like us, we achieve better results shown in Table 2 .

4.3. PERFORMANCE ON LONG-HORIZON UNSEEN TASKS

In Couch Moving, we test the ability of the model to generalize to long-horizon unseen tasks with outof-distribution abstract trajectories. Rows 3-6 in Table 1 demonstrate how the model can successfully solve much longer and varying mazes, showcasing the expressive power of TR 2 -GPT2 compared to baselines. These experiments also show that our model can handle variable sequence lengths and horizons at test time. In comparison, due to fixed/manually tuned horizon sizes, SILO and HRL methods such as HAC have difficulty handling the variance in abstract trajectory lengths after our preliminary attempts. In Block Stacking, we test the generalizability of the TR 2 -GPT2 to handle out-of-distribution task settings, e.g., stacking to higher heights or further locations. Rows 7-12 in Table 1 showcases how our method can stack blocks up to two times as tall as the training setting. It can even build new configurations such as a 4-3-2-1 pyramid. As shown in Fig. 6 , TR 2 -GPT2 can go as far as stacking 26 blocks to build a castle configuration in real-world experiments. Lastly, in Open Drawer we test the ability of TR 2 -GPT2 to generalize to opening drawers of different geometries. Row 14 of Table 1 shows that TR 2 -GPT2 can generalize across different drawer handles and reach handles that are farther away compared to the training setting.

4.4. HOW DOES THE GRANULARITY OF THE ABSTRACT TRAJECTORY IMPACT LEARNING

We perform an ablation study where we vary the granularity of the abstract trajectory, with the most sparse settings resembling closer to that of single-goal conditioned policies and denser settings resembling that of imitation learning. We decrease the granularity by skip sampling more of the abstract trajectory, a process detailed in E, resulting in shorter/sparser abstract trajectories. Results in Fig. 5 (right) show as the abstract trajectory becomes less granular (more sparse), the more difficult it is for agents to learn to follow the abstract trajectory and solve the task. In the Couch Moving task for example, insufficient number of high-level states makes the problem ambiguous and more difficult to determine the correct orientation to rotate into. Thus, low-level policies trained with sufficiently granular high-level states can be successful.

4.5. ATTENTION ANALYSIS OF TR 2 -GPT2 FOR BRIDGING THE DOMAIN GAP

In general, TR 2 -GPT2 will learn whatever is necessary to bridge the domain gap between the highlevel and low-level spaces and fill in gaps of information that are excluded from the high-level space. For example, in Couch Moving, the abstract trajectory does not include information about when and how to rotate the couch, only a coarse path to the goal location. Thus the low-level policy must learn to rotate the couch appropriately to go through the corners and follow the abstract trajectory. To visually understand how TR 2 -GPT2 learns to bridge the domain gap, we investigate the attention of the transformer when solving the Couch Moving task. We observe that after training, TR 2 -GPT2 exhibits an understanding of an optimal strategy to determine when to rotate or not. As shown in Fig. 4 , whenever the agent is in a chamber which permits rotation, the agent attends to positions between the next chamber or next next chamber, all of which are indicative of the orientation of the upcoming corner. Attending to these locations enables the agent to successfully bridge the high-to low domain gap in Couch Moving. Moreover, the agent learns to pay attention mostly to locations up ahead and learns that the past parts are uninformative, despite being given the full abstract trajectory to process at each timestep. A video of full trajectory attention analysis can be found in the supplementary materials and our project page .

4.6. EFFECTS OF RE-PLANNING

One property of our approach is the feasibility of re-generating the abstract trajectory at test time, and we refer to this as re-planning. It enables us to introduce explicit error-corrective behavior via the high-level agent in long-horizon tasks. Our results in Fig. 5 

5. CONCLUSION

We have introduced the Trajectory Translation (TR 2 ) framework that seeks to train low-level policies by translating an abstract trajectory into executable actions. As a result we can easily decouple plan generation and plan execution, allowing the low-level agent to simply focus on low-level control to follow an abstract trajectory. This allows our method to generalize to unseen long-horizon unseen tasks. We further can utilize re-planning via the high-level agent to easily improve success rate to handle situations when mistakes or external intervention occurs.

6. REPRODUCIBILITY STATEMENT

We have uploaded our anonymized code to a GitHub repo: https://github.com/ abstract-to-executable/code. For details on running the training and evaluation code, see the README. For exact hyper-parameter settings check Appendix G.

7. ETHICS STATEMENT

A part of our work utilizes collected data in the form of abstract trajectories that will be released for other researchers to use. We want to reaffirm that this data is completely generated using code with no humans involved in physically generating data. Moreover, the generated data only encodes information about solving a few manipulation and navigation tasks.

A REAL ROBOT EXPERIMENTS OF BLOCK STACKING

In this section we describe how the Block Stacking environment was constructed in the real world and how the real world experiments were conducted. Videos of the results can be found here: https: //sites.google.com/view/abstract-to-executable/real-world-videos  A.1 REAL WORLD ENVIRONMENT The real world environment consists of a robot arm with a parallel gripper, an RGBD camera, and a flat table of which blocks can be placed.

Robot Arm

We use a UFactory xArm 7 robot with a xArm gripper as this was the closest arm matching that of the Panda arm and gripper used in simulation that was available to us. The arm and gripper is controlled via 5D position control of the end-effector (3) and the gripper (2).

RGBD Camera

We use the Intel RealSense sensor to capture RGBD data from the scene. Blocks We use 2-inch / 5.08 cm width wood blocks (which are built by taping eight 1-inch blocks) for block stacking. Each of the blocks also have a texture taped onto it made from paper cutouts using assets from the game Minecraft. Note that in simulation we actually use 4 cm width blocks, and to adjust for this we scale all positions by 5.08/4.00 accordingly when transferring to real.

A.2 EXECUTION

In Fig. 6 , we show different kinds of block configurations TR 2 -GPT2 tackles. On average each block will require around 40 to 60 actions and in our experiments we are able to stack up to 26 blocks in novel configurations requiring up to around 1600 actions. In order to perform Block Stacking in the real world setting, we conduct the following process 1. Detect the position of isolated blocks not at goal positions. Directly use the detected position to place the block in the simulated environment to create an initial state. See Sec. D for more details on how we estimate block positions in both real world and simulation. 2. Run the high-level agent and generate an abstract trajectory to pick up the block and place it in a goal position 3. In the simulated environment, run the trajectory translation model on the generated abstract trajectory and initial state to generate a executable trajectory. 4. Run the executable trajectory in real world by setting the position of the xArm based on the 3D position of the end-effector and 2D position of the grippers in the executable trajectory states. As we simply wish to show feasibility of sim-to-real transfer, in practice we assume that in the real world the past placed blocks were successful, and we only care about estimating the position of the new block given to the robot arm to be placed. Note that usually the position of all the blocks would be given to the high-level agent for planning purposes in order to generate an abstract trajectory. In our experiments we're not concerned with estimating the position of every block although that would be possible with a more sophisticated vision pipeline. Moreover, the low-level policy only needs to observe the position of the block the high-level agent manipulates, so our simplification is more than feasible.

A.3 FAILURE MODES

There are a few failure modes that arise when transferring from simulation to the real world. We detail them in order of general frequency. 1. The biggest problem was blocks bouncing and rotating when dropped onto another block or the table. In simulation, the block is released nearly perfectly from both grippers on the end-effector, resulting in minimal external rotation forces causing it to rotate and thus, Figure 6 : Example real-world test configurations built using the TR 2 -GPT2 policy, which is trained on stacking a single block on the top of another one or two blocks in simulation. Test configurations vary in height of up to 5 blocks as well as using up to 26 blocks in trajectories that take up to 1600 steps to solve. Complexity varies from balancing blocks on one or multiple blocks to packing blocks closely together. Videos of real-world block stacking can be found in our supplemental materials. the blocks land perfectly. However, in the real world this is not always possible as the blocks we used would stick a little to the xArm gripper, causing one side of the block to be released at a different time to the other side. This leads to imprecision and bouncing in block placement as well as unwanted rotation. We partially address this issue by engineering abstract trajectories to release blocks from a lower height so the model will also try and release blocks from a lower height, minimizing imprecision caused by rotation and bounces. We further mitigate this issue in some real world experiments by discretizing along the z-axis to constrain the robot arm to dropping the blocks at certain heights. 2. Another issue is the supported range of the learned policy. Since in training the learned policy only trains to pick and place single blocks in a predefined region, it can only generalize so far before having more and more error. In the real world experiments we didn't draw explicit boundaries marking what regions were within what the model had seen in training and what regions were not. Thus, when placing blocks in the real world sometimes we place them too far away and the model fails to pick it up. This is something that could be addressed by diversifying the training dataset further to include a wider spawn range of blocks. 3. Lastly, while this isn't a failure mode specific to the real world, it may appear at first glance that there is a discrepancy between simulation success and real world success. In simulation we can stack 6-block tall towers albeit with low success rates whereas in the real world we are able to stack a large castle configuration with over 3x the number of blocks. This is the result of the common failure mode where stacking higher towers is more difficult as it requires more and more generalization in vertical stacking since in training, agents only ever stack a single block up to a height of 3 blocks. Table 1 shows that stacking tower of height 4 has very high success rate, and since the castle configuration has only at most height 4 blocks, it is more than feasible to stack successfully.

B SPECIFICATIONS OF ALL ENVIRONMENTS IN MAIN PAPER

This section dives into specific details for each environment we test on, including the heuristic used for abstract trajectory generation, the exact details of what is in the high-and low-level state spaces, as well as a description of each task. We further show more examples of specific details that were left out of the main manuscript such as figures depicting how couch moving works.

B.1 BOX PUSHER

Environment configuration the low-level agent receives a 6 dimensional observation consisting of the 2D positions of the agent, the green target box, and the blue goal location. The low-level agent's action space is 2 dimensional and is simply a xy delta position vector. The high-level space has 4 dimensional high-level states containing the 2D position of the high-level pointmass and the 2D position of the target box. The high-level agent further can magically grasp the target box. The general task is to control a box shaped agent to push a target green box as shown in Fig 1 to a blue goal location. The training task's variation does not include any obstacles, whereas in test time the task includes red obstacles that only the high-level agent can see, depicted in Fig. 3 . Without guidance from a higher-level agent through an abstract trajectory, it is difficult to avoid the obstacles since they aren't in the observations. Thus, the test task requires careful following of an abstract trajectory in order to push the green target box to the goal. Success is defined as when the target green box reaches within an ϵ distance of the blue goal location. ϵ is equal to the width of the target box. High-to low-level domain gap Under this setup, the high-level agent can attach itself to the green target box with magical grasping, and directly drag and pull the box around. The low-level agent however does not have magical grasp and can only push the green target box. In the example in Fig. 1 we show how the low-level agent must go behind the green target box, deviating from the high-level plan, in order to push it to the left whereas the high-level agent can simply move to the green target box and immediately start moving to the left.

Abstract Trajectory Generation

The heuristic to generate the trajectory is as follows 1. Move the agent along the x or y axis until the x or y position difference to the target box is below a threshold ϵ. Then do the same for the other axis. 2. Once near enough to the target box, enable grasping and the target box will follow the agents every subsequent movement. 3. Move the agent along the x or y axis again until the x or y position difference between the target box and the goal location is below a threshold ϵ. Then do the same for the other axis. In total, there are 4 random variations of the abstract trajectories for any given starting high-level state, depending on which direction the agent approaches the target box first (2), and which direction the agent moves the target box to the goal location first (2). 

Environment configuration

The low-level, couch shaped, agent receives a 9D vector representing a local 3x3 patch of its surroundings, indicating empty space (0) and walls (1). Additionally it also receives a 2D vector representing which direction along the maze moves forward to the final goal, and it's own 2D position. The low-level agent has a 3-dimensional action space consisting of 2D force and 1D torque. The high-level space contains 2D dimensional states containing only the 2D position of the high-level agent pointmass. The task is to control a couch shaped agent and move it through a map composed of chambers, corridors, and corners, which are marked in Fig. 7a . Success is defined as when the agent reaches within an ϵ distance of a blue target circle marked on the map. The environment nomenclature is follows the structure Couch Moving [short/long] n. Long has corrdiors around 1.5x longer than the short variation, and n is the number of corners in the maze. The training environment is Couch Moving Short 3 and the test tasks are all long variations with more corners. Fig. 8 shows visually these different configurations. High-to low-level domain gap In couch moving, the high-level agent and low-level agent rather have a large domain gap due to different morphologies. The high-level agent is a pointmass that can go through the entire map with no issues. However, the low-level agent is a couch shaped agent and cannot fit through corners unless oriented correctly as shown in Fig. 7 . Furthermore, the low-level agent can only access a local patch of the map layout, while the high-level agent has access to the whole map. This requires the low-level agent to learn to attend to different parts of the abstract trajectory in order to determine when to rotate in chambers, and we investigate how TR 2 -GPT2 bridges this gap via attention in Sec. 4.5 Abstract trajectory generation The high-level agent is a point mass that can freely move around within the map walls. The high-level states consist just the absolute 2D position of the high-level agent. As a result the abstract trajectory is simply the map itself represented as a sequence of 2D positions. The low-level robot receives 32 dimensional observations from the environment, consisting of the robot arms joint position, joint velocity, the target block's pose, and the end effector's pose. The arm is controlled via a 4 dimensional delta position controller on the end effector. The high-level space consists of 6 dimensional vectors containing the 3D position of the high-level agent pointmass and the 3D position of the block to stack. The high-level agent also has magical grasping. The training task is visualized in Fig. 9 , and consists of a single pick and place of a single block up to a height of 3 blocks. The location of already placed blocks and the target block to be moved varies within a constrained region reachable by the robot arm. The robot arm also has a small amount of randomness in where it initializes. The test tasks shown in our results in Table 1 are visualized in Fig. 10 and vary in complexity of number of blocks used to stack towers and pyramids in addition to a wider region in which blocks are to be placed. The test tasks are designed to test long horizon task generalization by stacking many blocks in farther locations in succession without fail. Additionally, the stacked configurations are different than what's seen in training time. We additionally run real-world tests on a wider variety of configurations detailed in Sec. A. Success is defined as placing every spawned block in a goal position and the robot gripper moving a minimum ϵ distance away from any of the stacked blocks. In our environment ϵ is equal to 2 times the width of the blocks.

High-to low-level domain gap

The high-level agent is a pointmass represented by a 3D coordinate position and can magically grasp blocks. However, the low-level agent contains information about the entire robot propioception as well as the pose of the end effector. Moreover, the low-level agent has complex dynamics as it must physically grasp and release blocks precisely.

Abstract trajectory generation

The high-level agent returns a trajectory consisting of 10dimensional high-level states. This includes the 3D positions on the x, y, and z axes of the high-level pointmass agent and the target block to stack. The following details the abstract trajectory generation heuristic 1. Move the high level agent to a random safe height. Then the high-level agent will elevate to another random height above the target block as it moves toward the target block. 2. The agent then moves down and approaches the target block. The high-level agent then turns magical grasp on and the target block will now follow the high-level agent's every subsequent movement. 3. The agent then goes up to a random safe height, moves to the top of goal location, and finally lowers down the block to a random height above the goal location. 4. The high-level agent stops magically grasping the target block and the target block will no longer move and stay where it was last placed. 5. Lastly, the high-level agent will move up to a random safe height, then move back to the location where it started. There are various random variations in the abstract trajectory, these include the various random heights the agent moves up to before/after picking or placing, as well as when the agent releases the target block it is magically grasping. These variations can be learned by the TRfoot_0 -GPT2 model and allows a user more fine-grained control over how exactly the robot arm picks up blocks by engineering the abstract trajectory.

Environment configuration

The low-level robot receives 39 dimensional observations, which contains its joint position, joint velocity, pose of its end-effector and target handle's axis-aligned bounding box. Its actions are 13 dimensional and are all realized through a joint velocity controller. The high-level space consists of 9 dimensional states composed of the 3D position of the high-level agent pointmass and a 6-dimensional bounding box of the handle. The high-level agent can further magically grasp the handle and pull the drawer out with ease. To produce the axis-aligned bounding boxes, we use the environment's point cloud data captured by a fixed camera. The handle's points in the point cloud are filtered out through a built-in segmentation map provided by SAPIEN, then these points are used to calculate an axis-aligned bounding box. This vision pipeline is expanded upon in Section D.2. The training task is to open a single drawer on cabinets in the training set. One test task to test object generalization is to open a single drawer on unseen cabinets in the test set. The final test task is to open two drawers on cabinets with at least two openable drawers, which is difficult to solve as it is easy to accidentally close a drawer while opening another. Success is defined as opening all targeted drawers at least 90% of the way and the drawers maintaining a velocity less than a threshold.

High-to low-level domain gap

The high-level agent is a pointmass represented by a 3D position and can magically grasp and move the target drawer open. The low-level agent however is a 13-DoF robot and has complex dynamics and must physically grasp and control the drawer handle and pull it open. Abstract trajectory generation The high-level state is 9 dimensional, containing the high-level agent's 3D positions and the bounding box of the target handle. Note that in high-level states the bounding box is extracted first from the given starting low-level state, and is then magically moved through space via translation only. We do not recompute the bounding box of the target handle or capture new point cloud data as the high-level agent generates the abstract trajectory. The details of the abstract trajectory generation implementation is described below: 1. The high-level agent will first move up to a height equal to the mean height of the captured bounding box for target handle given by the starting low-level state. Then, the high-level agent will approach the handle until it is close enough. The only variation in the abstract trajectories are where the high-level agent decides to move to after opening the cabinet.

B.5 OBSTACLE PUSH (SILO)

Figure 11 : Example of the recreated Obstacle Push environment from SILO (Lee et al., 2019) . This is adapted from our Box Pusher environment, with the black stick now representing the closed gripper used in SILO, the red boxes representing obstacles, and the blue target representing the goal location. Environment configuration The task is to control a stick-like agent to push the green box to the target location as depicted in Fig. 11 . This environment was recreated based off the original paper by Lee et al. (2019) as their environment code could not be found at the time of writing. High-to low-level domain gap The high-level agent is simply a point mass with magical grasp and the magical ability to move the target green box through obstacles. The low-level agent on the other hand must deal with the obstacles and manipulate the box to move around the obstacle and to the goal.

Abstract trajectory generation

The abstract trajectory generated here is meant to mimic as closely as possible the demonstrations generated in SILO for this environment. The details of the abstract trajectory generation implementation is described below: 1. The high-level agent will first move forward to the green box and magically grasps it. 2. The high-level agent then drags the grasped box along with it as it moves straight upwards until it goes through the middle red obstacle and reaches the same x-axis as the blue target. 3. The high-level agent moves straight to the blue goal until the grasped green block is on the blue goal.

B.6 PICK AND PLACE (SILO)

Environment configuration The task is to control the robot arm to pick up the red block and bring it to the goal 12. The action-space is the same exact 4-DOF action space used by SILO and the observation space is mostly the same barring some differences between Panda and Sawyer arms. This environment was recreated based off the original paper by Lee et al. (2019) as their environment code could not be found at the time of writing. Figure 12 : Example of the recreated Pick and Place environment from SILO (Lee et al., 2019) . This is adapted from our Block Stacking environment, but instead we have the panda arm and gripper instead of the Sawyer one. The goal location of the block is the white sphere High-to low-level domain gap The high-level agent is simply a point mass with magical grasp and the magical ability to move the red block through the two walls. The low-level agent on the other hand must deal with the two walls and manipulate the box to mimic the high-level agent as much as possible to within feasibility as the walls prevent the low-level agent from fully following the high-level agent.

Abstract trajectory generation

The abstract trajectory generated is based on the details supplied by SILO. The details of the abstract trajectory generation implementation is described below: 1. The high-level agent will first to the red block and magically grasp it. 2. The high-level agent then moves the grasped box along with it through two milestones in a bit of a curve. One milestone is either to the left or the right of the initial starting point, the 2nd milestone is the goal.

C TRAINING

This section explains how we train our trajectory translation models as well as detailing the reward functions used.

C.1 ONLINE TRAINING

We use PPO (Schulman et al., 2017) as our training algorithm. Both the actor and critic are initialized with the same model but separate weights. For additional architecture details see Sec. E. In all experiments, online training hyperparameters are mostly kept the exact same, see Sec. G for specific hyperparameters used. Moreover, all results reported are averaged over three seeds. In particular, for the Block Stacking environment, after the initial online training, only the TR 2 -GPT2 model had any substantial success rates. We further train the TR 2 -GPT2 model in a second round where we simply turn gradient accumulation on and continue training, reducing the number of gradient updates in each epoch to just 3. This helped improve the maximum success rate the models were able to attain. Lastly, in practice the dissimilarity function d associated with the mapping function will weigh data relevant to the agent by 0.1 and data relevant to all other objects by 0.9.

C.2 ADDITIONAL DISCUSSION ON THE TRAJECTORY FOLLOWING REWARD

Here we discuss and show an example regarding the limitation of the trajectory following reward when the abstract trajectory has repeated/similar high-level states. An example of this is in periodic tasks such as repeated pick and place between two locations. Using the T R 2 framework, a practical way to overcome this issue is to chunk the abstract trajectory into parts where there is no periodicity. For repeated pick and place for example, each chunk would be a smaller abstract trajectory corresponding with each pick and place. See this this video for an successful example of applying this strategy to the task of repeatedly pick and placing a block between two locations three times. 1. The increase in return in the first few time steps is attributed to the low-level agent following the high-level agent to move towards the target green box.

C.3 TRAJECTORY FOLLOWING RETURN TRACE

2. Then the following plateau is when the low-level agent must go behind the target green box before pushing it, showing how in this scenario, we do not give more reward as the low-level agent is not matching new high-level states. 3. Once the agent starts pushing the target green box, it begins to match new high-level states as the target green box is moving along the path demonstrated in the abstract trajectory, leading to more reward. 4. In the next plateau in return, we see that the low-level agent tries to push the target green box up to match the abstract trajectory, but requires around 30 steps to move behind the target green box, in which it is not matching new high-level states. 5. Finally, the low-level agent pushes the target green box to the blue goal location and gains constant reward for keeping the target green box at the blue goal location.

C.4 TASK REWARD

Each task comes with a basic task reward to define the task. And we use the task reward as an auxiliary supervision signal in addition to the trajectory following reward. In this section, we present the task rewards. In the trajectory following setting, the task reward is scaled down by a factor of 0.1.

C.4.2 COUCH MOVING

Whenever the 2D position of the agent p a is in a chamber, we give 0 reward. Whenever p a is more than ϵ distance away from the center of any chamber, we give reward 0 if it is oriented correctly for the upcoming corner as shown in Fig. 7a , and reward -1 if it is not oriented correctly as shown in Fig. 7b . In the trajectory following setting, the task reward is scaled down by a factor of 0.5. Note that this reward function is not meant to be well defined, and as a result goal conditioned policies will have a lot of trouble having any meaningful success. The task reward here is to simply aid the exploration process for models in the TR 2 framework when trained online.

C.4.3 BLOCK STACKING

The low-level agent is tasked with picking up a block, stacking it at the goal location, then returning back to its initial location. Thus our success metric is: success =    1 if ||p init -p a || 2 < ϵ 1 and ||p a -p g || 2 > ϵ 2 and not grasping and close enough 0 otherwise (2) The success metric relies on four conditions: firstly the low-level agent is asked to returned to its initial location; secondly, the agent is required to move away from the goal position greater than a certain distance; thirdly, the agent must not be grasping the block; and lastly, the target block needs to be a close enough to the goal location. A way to check if the agent is still grasping a block is implemented as: To assist the agent in task completion, we employ a four-stage task reward to encourage it to approach, grasp, and transport the block to the goal position, and return to initial location. To prevent RL agents from remaining in an intermediate stage indefinitely, we ensure that the rewards in each successive step are strictly greater than those in the current stage. R T ask =        1 -tanh(||p b -p a || 2 ) not grasping and not close enough 2.25 -tanh(||p b -p a || 2 ) -tanh(||p g -p b || 2 ) grasping and not close enough 3.5 grasping and close enough 14.5 -tanh(||p init -p a || 2 ) not grasping and close enough (4) where p b , p g , p a represents the target block's position, the goal position, and the agent's position respectively. At stage one, the agent is not grasping the target block and the target block is not close enough to the goal location, so the agent receives a reward encouraging it to approach the target block. A constant reward will be given to the agent if it is grasping the block, and upon grasping, the reward enters the second stage. At stage two, the agent is grasping the target block while the target block is not at the goal location, thus an additional training signal will be added to the reward to encourage the agent to take the block to the goal location. When the target block gets close enough to the goal location, an extra reward will be given to the agent so that it stays close to the goal location. During the last stage, the agent has released the target block at the correctly location, and a reward will be given to the agent to encourage it to go back to initial location. In the trajectory following setting, the task reward is scaled down by a factor of 0.2.

C.4.4 OPEN DRAWER

Since this task is adpated from the OpenCabinetDrawer task in ManiSkill benchmark (Mu et al., 2021) , we just use the original task reward from the ManiSkill benchmark, and we briefly explain the reward below. The Open Drawer environment divides the reward into three stages. In the first stage, the agent is rewarded for being close to the target drawer's handle. To promote contact with the target link, the negative value of Euclidean distance between the handle and the gripper is added into the reward. When the distance between the gripper and the target link is smaller than a threshold, the agent proceeds to the second stage. The agent receives a reward based on the opening angle of the door or the opening distance of the drawer at this stage. When the agent opens the door or drawer enough, the last period starts. At the very last stage, the agent gets a negative reward depending on the speed to encourage itself to remain static. In the trajectory following setting, the task reward is scaled down by a factor of 0.02. Note that the task reward weight is much smaller than in other environments since the overall return of Open Drawer's task reward is much larger in magnitude than in other environments.

C.4.5 OBSTACLE PUSH (SILO)

The same task reward for Box Pusher is used here as well.

C.4.6 PICK AND PLACE (SILO)

The same task reward for Block Stacking is used here as well.

D VISUAL INPUT PROCESSING

We use visual inputs in Block Stacking and Open Drawer, and this section explains how we process the visual inputs.

D.1 BLOCK STACKING

In the Block Stacking environment, we need to estimate the 3D position of the blocks. In our code, our models in fact use the 3D position plus a 4D quaternion as input, however the quaternion rarely deviates from a fixed value, so we treat the models as simply only observing 3D positions. In both simulation and the real world, we follow this general pipeline: 1. Capture an unstructured point cloud and use camera intrinsics and extrinsics to transform the point cloud into the world frame relative to the robot arm base. 2. Using hard-coded xyz boundaries filter out points outside of the working space where blocks are placed and stacked 3. Apply k-means clustering to segment the point cloud 4. Apply the Iterative Closest Point (ICP) algorithm to estimate block poses for each segmented point cloud. Note that we only keep the block position. For the k-means step, we optimize k by selecting k that minimizes a scaled inertia score i + αk. i is the inertia, equal to the within-cluster sum-of-squares (Buitinck et al., 2013) , and α is a cluster penalty factor. The next two subsections on simulation and real world will provide additional details of how its done in those settings. All our results for Block Stacking in Table 1 are produced using estimated 3D block positions at every time step. Note that at training time, the policies use ground truth block positions in order to bypass the vision pipeline and speed up training. In the SAPIEN simulator, we capture RGBD data as an unstructured point cloud and transform it into a world frame point cloud relative to the robot arm using the camera intrisics and extrinsics given by the simulator. In addition to the hard-coded xyz boundaries, we further remove the floor which is at z = 0. We further filter out points that are not red, leaving us with a point cloud in the world frame of just points of blocks in the scene. For the k-means clustering for segmentation, we set the α parameter to 0.5. Note that during evaluation in simulation, we only use the position of the block to be manipulated for the low-level agent, determined by the high-level agent.

D.1.2 REAL WORLD

Using the Intel RealSense depth camera, we get RGBD data in the form of an unstructured point cloud. We are given the camera intrinsics by the camera, and we estimate the camera extrinsic via hand-eye calibration. Using the intrinsics and extrinsics, we transform the point cloud into the robot base's frame. In addition to the hard-coded xyz boundaries that crops the RGBD into the points shown in Fig. 14b , we further remove most of the floor points by segmenting out the plane using RANSAC as shown in Figure 14c For the k-means clustering for segmentation, we set the α parameter to 1.75. After applying ICP, the resulting estimated block poses for each cluster is shown in Fig. 14d . As our paper's focus is not on sim2real, in practice we simplify experiments by only using the position of the block closest to the camera and assume that is the block to be picked up and stacked to a desired goal position. The final desired pose is shown in Fig. 14e . More advanced vision pipelines can be used to better segment out individual blocks if necessary.

D.2 OPEN DRAWER

As mentioned in section B, Open Drawer environments uses a point cloud representation to record its state which grants it the potential to be transferred to real experiments. These points are first filtered out by SAPIEN to include only points on the target drawer handle. The filtered points are then used to find the bounding box for target handle. Since the bounding box is aligned along axes, it is represented by a 6D vector as following: Bounding Box := (x min , y min , z max , x max , y max , z min ) where the minimum and the maximum among all points are recorded.

E TRAJECTORY TRANSLATOR IMPLEMENTATION DETAILS

This section describes the model architecture we use for TR 2 based models as well as some practical implementation details. When discussing the architecture in this section we will refer to Fig. 2 . Abstract Trajectory Pre-processing In practice, we preprocess the abstract trajectory before training to constrain the distances between high-level states to a fixed value as well as to ensure the abstract trajectory is not too long while still being descriptive. We preprocess abstract trajectories τ H by interpolating between high-level states such that for a given high-level state s H i , the next high-level state s H i+1 is approximately ϵ distance away. If two adjacent high-level states from the original τ H are closer than ϵ distance to each other, then we throw out the later high-level state. Higher ϵ means we reduce the length of the abstract trajectory while lower values increases the length. ϵ can vary between environments and is tuned accordingly such that no two high-level states are too close and a low-level agent can't easily skip a high-level state by moving too fast. With the trajectory following reward, we observe that with lower ϵ values, TR 2 -GPT2 often generates very jittery trajectories by moving back and forth in order to emulate moving slower and match every high-level state and so typically we will set ϵ higher. Lastly, we will always retain the first and last high-level states. Abstract Trajectory Sub-sampling During training and evaluation, we sub-sample the abstract trajectory in order to improve inference speeds and training speeds. In particular, given abstract trajectory τ H = (s H 1 , s H 2 , ..., s H n ), we keep s H n , and interval sample the sequence s H 1 , ..., s H n-1 by keeping every p-th state s H 1 , s H p+1 , .... The sub-sampled abstract trajectory is then given to the low-level agent as part of the observation. Note that for trajectory following reward calculations, we still use the full un-sampled version of the abstract trajectory. Inputs The inputs at timestep t to models in the TR 2 framework consist of the abstract trajectory τ H = (s H 1 , s H 2 , ..., s H n ) as well as the past executed low-level states s L t-k+1 , s L t-k+2 , ..., s L t . The abstract trajectory can also be viewed as a prompt, with the difference being that this prompt is always a part of the input sequence. The low-level states are generated in an auto-regressive manner via interaction with the environment, similar to the auto-regressive nature of the Decision Transformer (DT) (Chen et al., 2021) . The high-level states are passed through their own encoder, which can be a MLP, Convolutional Neural Network (Krizhevsky et al., 2017 ), PointNet (Qi et al., 2016) etc. depending on how you wish to process the data. In our work we only use the MLP as it is much faster to train with state based inputs. Similarly, the low-level states are passed through their own separate encoder as well. The encoders output tokens x 1 , x 2 , ..., x n+k forming a sequence of length n+k. Positional encodings are removed and instead you can optionally add timestep based embeddings to the tokens related to the high-level states as Decision Transformer does. The sequence is then fed through a model, which in our work is mainly the GPT2 Transformer but can easily be any other sequence processing model like LSTMs. The sequence processing model then outputs output tokens z 1 , z 2 , ..., z n+k . The z n+k token can be viewed as a contextual token that is then fed into the final MLP to guide the MLP in producing the final action that would make the agent follow the abstract trajectory. Note that by feeding the sequence in the order presented above, unidirectional transformers like GPT2 enable the final output token z n+k to attend to all past executed states as well as the entire abstract trajectory, which is crucial for long horizon dependency modelling and is further examined in 4.5

F ADDITIONAL RESULTS

We further benchmark our TR 2 -GPT2 method on the X-Magical benchmark from (Zakka et al., 2022) as our method can also be seen as a sort of cross-embodiment learning method. We describe the environment setup below and results. Note that similar to Block Stacking, we utilize the extra fine-tuning stage for the Gripper embodiment. 

Environment configuration

The task is to control an agent to push all three boxes into the end-zone as depicted in figure 15 . We follow the same exact environment configurations as Zakka et al. (2022) and defer the reader to their paper for exact details. The original environment configuration is our low-level agent setup. The high-level space like our other environments abstracts all objects and the agent as simple 2D positions. Concretely, the high-level state is a 8 dimensional vector containing the 2D position of the pointmass and the three boxes. High-to low-level domain gap The high-level agent is simply a point mass with magical grasp, allowing itself to easily move boxes to the end-zone one by one. Each low-level agent embodiment affects the way the low-level agent can feasibly push all three boxes to the end-zone. The gripper and short-stick embodiments are most similar to the high-level agent as they also move boxes to the end-zone one by one, although without the power of magical grasping. The medium-stick and long-stick embodiments are more different than the high-level agent in that they can move two and three boxes respectively at once and do not have magical grasp as well.



The high-level agent will turn on magical grasping, meaning every subsequent movement of the high-level agent will also move the axis-aligned bounding box captured from the initial state. The high-level agent will begin moving in the direction that opens the drawer until it is opened completely The high-level agent finally releases the magical grasp and moves to a random new location away from the handle and cabinet.



Figure 1: Task in Box Pusher: move the green target box to the blue goal position. The arrows in map show how the agents move.

Figure 3: Training and Test Environments, with arrows indicating the direction of movement

(a) Agent attending to its current location as well the next next chamber. (b) Agent only attends to the next corner and chamber and not the past.

Figure 4: Mean attention of all heads of the TR 2 -GPT2 model. Orange arrow indicates where the agent is on the map, and the red circle indicates the goal to reach. Darker blue represents the most attention and lighter colors represent minimal attention.

(left and middle)  show that re-planning enables handling unforeseen interventions in Block Stacking and mistakes in Open Drawer successfully. In Block Stacking, we add external interventions where we randomly move blocks off the tower and allow the agent to re-plan just once per intervention. As the number of external interventions increases, the success rate does not lower as significantly compared to not re-planning since the re-generated abstract trajectories guide the low-level agent to pick up misplaced blocks. In Open Drawer, the robot arm must open two drawers and often will close the one it opened by accident. With re-planning, the high-level agent provides a corrective abstract trajectory to guide the low-level agent to re-open the closed drawer and succeed.

28-block Castle. Note that the robot failed to stack two blocks properly towards the back left. In the end 26 out of 28 blocks were stacked

Figure 7: 7a shows how a properly oriented couch can easily go through a corner. 7b shows that on the other hand the wrong orientation can prevent the couch from moving through the corner

Figure 9: Examples of training time tasks. The low-level agent must follow an abstract trajectory and place a block at the designated white goal position

Figure 13: Episode return trace for a single episode. The chart only shows trajectory following rewards and does not include the scaled down task reward. Orange arrows indicate direction of future movement of the black agent

.1 BOX PUSHER We define two distances, d a,b = ||p a -p b ||, d b,g = ||p b -p g || where p a = 2D position of agent, p b = 2D position of target box, and p g = 2D position of the goal location. The task reward at each timestep is then equal to -(0.1d a,b + 0.9d b,g ).

left and right finger, angles between impulses and open directions are smaller than threshold False Otherwise (3)

Final extracted pose of block to pick up

Figure 14: Frame by frame process of how the pose of the block to manipulate is estimated in the real world.

Figure 15: Examples of different low-level embodiments (from left to right: Gripper, Short-stick, Medium-stick, Long-stick), all solving the same task of pushing 3 boxes to the pink end-zone at the top

Mean success rate and standard error of training and test tasks, evaluated over 3 training seeds and 128 evaluation episodes each. TR The goal is to control a 13-DoF mobile robot (w/ gripper) to open a drawer on various cabinets. During training the task is to pull open a drawer on cabinets from a training set. At test time, the task is to pull open drawers on unseen cabinets and / or open additional drawers in an episode. The test setting tests whether the low-level agent can learn to follow the abstract trajectory and manipulate and pull unseen drawer handles.

Results compared withLee et al. (2019)

H Learning Curves

Abstract trajectory generation The details of the abstract trajectory generation implementation is described below:1. The high-level agent will first move to a random box not in the end-zone and when within a ϵ distance, magically grasps onto the targeted box.2. The high-level agent then drags the grasped box along with it as it moves straight to the end-zone.3. The high-level agent releases the grasped box onto the end-zone and moves back out.4. Repeat steps 1-3 until all boxes are in the end-zone.

F.1.2 TASK REWARD FOR X-MAGICAL

A denser task reward is not defined for this environment in our experiments. We simply use the sparse success signal at the end of the episodes as the additional task reward and scale it down by a factor of 0.2.

F.1.3 RESULTS

We achieve comparable results to XIRL on the X-Magical benchmark shown in Table 3 . Note that we did not leverage access to a low-level dataset of (image observation-only) demonstrations, and instead utilize the generated abstract trajectories and the trajectory following reward to guide the agent to solving the task. Moreover, we like to point out that our method is conditioned on an abstract trajectory whereas XIRL trains goal conditioned policies.

G HYPERPARAMETERS

We detail the hyperparameters of our training processes. For all online training experiments we use the following common hyperparameters in Table 4 . In subsequent tables 5, 6, 7, 8, 9, 10, we detail only the differences relative to the defaults. 

