ALFWORLD: ALIGNING TEXT AND EMBODIED ENVIRONMENTS FOR INTERACTIVE LEARNING

Abstract

Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld (Côté et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, and visual scene understanding).

TextWorld Embodied

Welcome! You are in the middle of the room. Looking around you, you see a diningtable, a stove, a microwave, and a cabinet. Your task is to: Put a pan on the diningtable.

> goto the cabinet

You arrive at the cabinet. The cabinet is closed.

> open the cabinet

The cabinet is empty.

> goto the stove

You arrive at the stove. Near the stove, you see a pan, a pot, a bread loaf, a lettuce, and a winebottle.

> take the pan from the stove

You take the pan from the stove.

> goto the diningtable

You arrive at the diningtable.

> put the pan on the diningtable

You put the pan on the diningtable. Consider helping a friend prepare dinner in an unfamiliar house: when your friend asks you to clean and slice an apple for an appetizer, how would you approach the task? Intuitively, one could reason abstractly: (1) find an apple (2) wash the apple in the sink (3) put the clean apple on the cutting board (4) find a knife (5) use the knife to slice the apple (6) put the slices in a bowl. Even in an unfamiliar setting, abstract reasoning can help accomplish the goal by leveraging semantic priors. Priors like locations of objects -apples are commonly found in the kitchen along with implements for cleaning and slicing, object affordances -a sink is useful for washing an apple unlike a refrigerator, pre-conditions -better to wash an apple before slicing it, rather than the converse. We hypothesize that, learning to solve tasks using abstract language, unconstrained by the particulars of the physical world, enables agents to complete embodied tasks in novel environments by leveraging the kinds of semantic priors that are exposed by abstraction and interaction. To test this hypothesis, we have created the novel ALFWorld framework, the first interactive, parallel environment that aligns text descriptions and commands with physically embodied robotic simulation. We build ALFWorld by extending two prior works: TextWorld (Côté et al., 2018) -an engine for interactive text-based games, and ALFRED (Shridhar et al., 2020) -a large scale dataset for visionlanguage instruction following in embodied environments. ALFWorld provides two views of the same underlying world and two modes by which to interact with it: TextWorld, an abstract, text-based environment, generates textual observations of the world and responds to high-level text actions; ALFRED, the embodied simulator, renders the world in high-dimensional images and responds to low-level physical actions as from a robot (Figure 1 ). foot_0 Unlike prior work on instruction following (MacMahon et al., 2006; Anderson et al., 2018a) , which typically uses a static corpus of cross-modal expert demonstrations, we argue that aligned parallel environments like ALFWorld offer a distinct advantage: they allow agents to explore, interact, and learn in the abstract environment of language before encountering the complexities of the embodied environment. While fields such as robotic control use simulators like MuJoCo (Todorov et al., 2012) to provide infinite data through interaction, there has been no analogous mechanism -short of hiring a human around the clock -for providing linguistic feedback and annotations to an embodied agent. TextWorld addresses this discrepancy by providing programmatic and aligned linguistic signals during agent exploration. This facilitates the first work, to our knowledge, in which an embodied agent learns the meaning of complex multi-step policies, expressed in language, directly through interaction. Empowered by the ALFWorld framework, we introduce BUTLER (Building Understanding in Textworld via Language for Embodied Reasoning), an agent that first learns to perform abstract tasks in TextWorld using Imitation Learning (IL) and then transfers the learned policies to embodied tasks in ALFRED. When operating in the embodied world, BUTLER leverages the abstract understanding gained from TextWorld to generate text-based actions; these serve as high-level subgoals that facilitate physical action generation by a low-level controller. Broadly, we find that BUTLER is capable of generalizing in a zero-shot manner from TextWorld to unseen embodied tasks and settings. Our results show that training first in the abstract text-based environment is not only 7× faster, but also yields better performance than training from scratch in the embodied world. These results lend credibility to the hypothesis that solving abstract language-based tasks can help build priors that enable agents to generalize to unfamiliar embodied environments. Our contributions are as follows: § 2 ALFWorld environment: The first parallel interactive text-based and embodied environment. § 3 BUTLER architecture: An agent that learns high-level policies in language that transfer to low-level embodied executions, and whose modular components can be independently upgraded. § 4 Generalization: We demonstrate empirically that BUTLER, trained in the abstract text domain, generalizes better to unseen embodied settings than agents trained from corpora of demonstrations or from scratch in the embodied world. et al., 2017) , is a benchmark for learning to complete embodied household tasks using natural language instructions and egocentric visual observations. As shown in Figure 1 (right), ALFRED tasks pose challenging interaction and navigation problems to an agent in a high-fidelity simulated environment.

2. ALIGNING ALFRED AND TEXTWORLD

Tasks are annotated with a goal description that describes the objective (e.g., "put a pan on the dining table "). We consider both template-based and human-annotated goals; further details on goal specification can be found in Appendix H. Agents observe the world through high-dimensional pixel images and interact using low-level action primitives: MOVEAHEAD, ROTATELEFT/RIGHT, LOOKUP/DOWN, PICKUP, PUT, OPEN, CLOSE, and TOGGLEON/OFF.



Note: Throughout this work, for clarity of exposition, we use ALFRED to refer to both tasks and the grounded simulation environment, but rendering and physics are provided byTHOR (Kolve et al., 2017).



Figure 1: ALFWorld: Interactive aligned text and embodied worlds. An example with high-level text actions (left) and low-level physical actions (right).

