ASK YOUR HUMANS: USING HUMAN INSTRUCTIONS TO IMPROVE GENERALIZATION IN REINFORCEMENT LEARNING

Abstract

Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-bystep human demonstrations in the form of natural language instructions and action trajectories. We introduce a dataset of such demonstrations in a crafting-based grid world. Our model consists of a high-level language generator and low-level policy, conditioned on language. We find that human demonstrations help solve the most complex tasks. We also find that incorporating natural language allows the model to generalize to unseen tasks in a zero-shot setting and to learn quickly from a few demonstrations. Generalization is not only reflected in the actions of the agent, but also in the generated natural language instructions in unseen tasks. Our approach also gives our trained agent interpretable behaviors because it is able to generate a sequence of high-level descriptions of its actions.

1. INTRODUCTION

One of the most remarkable aspects of human intelligence is the ability to quickly adapt to new tasks and environments. From a young age, children are able to acquire new skills and solve new tasks through imitation and instruction (Council et al., 2000; Meltzoff, 1988; Hunt, 1965) . The key is our ability to use language to learn abstract concepts and then reapply them in new settings. Inspired by this, one of the long term goals in AI is to build agents that can learn to accomplish new tasks and goals in an open-world setting using just a few examples or few instructions from humans. For example, if we had a health-care assistant robot, we might want to teach it how to bring us our favorite drink or make us a meal in just the way we like it, perhaps by showing it how to do this a few times and explaining the steps involved. However, the ability to adapt to new environments and tasks remains a distant dream. Previous work have considered using language as a high-level representation for RL (Andreas et al., 2017; Jiang et al., 2019) . However, these approaches typically use language generated from templates that are hard-coded into the simulators the agents are tested in, allowing the agents to receive virtually unlimited training data to learn language abstractions. But both ideally and practically, instructions are a limited resource. If we want to build agents that can quickly adapt in open-world settings, they need to be able to learn from limited, real instruction data (Luketina et al., 2019) . And unlike the clean ontologies generated in these previous approaches, human language is noisy and diverse; there are many ways to say the same thing. Approaches that aim to learn new tasks from humans must be able to use human-generated instructions. In this work, we take a step towards agents that can learn from limited human instruction and demonstration by collecting a new dataset with natural language annotated tasks and corresponding gameplay. The environment and dataset is designed to directly test multi-task and sub-task learning, as it consists of nearly 50 diverse crafting tasks. 1 Crafts are designed to share similar features and Figure 1 : From state observation at time step t, the agent generates a natural language instruction "go to key and press grab," which guides the agent to grab the key. After the instruction is fulfilled and the agent grabs the key, the agent generates a new instruction at t + 1. sub-steps so we would be able to test whether the method is able to learn these shared features and reuse existing knowledge to solve new, but related tasks more efficiently. Our dataset is collected in a crafting-based environment and contains over 6,000 game traces on 14 unique crafting tasks which serve as the training set. The other 35 crafting tasks will act as zero-shot tasks. The goal is for an agent to be able to learn one policy that is able to solve both tasks it was trained on as well as a variety of unseen tasks which contain similar sub-tasks as the training tasks. To do this, we train a neural network system to generate natural language instructions as a highlevel representation of the sub-task, and then a policy to achieve the goal condition given these instructions. Figure 1 shows how our agent takes in the given state of the environment and a goal (Iron Ore), generates a language representation of the next instruction, and then uses the policy to select an action conditioned on the language representation -in this case to grab the key. We incorporate both imitation learning (IL) using both the language and human demonstrations and Reinforcement Learning (RL) rewards to train our agent to solve complicated multi-step tasks. Our approach which learns from human demonstrations and language outperforms or matches baseline methods in the standard RL setting. We demonstrate that language can be used to better generalize to new tasks without reward signals and outperforms baselines on average over 35 zero-shot crafting tasks. Our method uses language as a high-level task to help decompose a larger, complex task into sub-tasks and identify correct sub-tasks to utilize in a zero-shot task setting. We also show that the agent can learn few-shot tasks with only a few additional demos and instructions. Finally, training with human-generated instructions gives us an interpretable explanation of the agent's behavior in cases of success and failure. Generalization is further demonstrated in the agent's ability to explain how the task is decomposed in both train and evaluation settings, in a way that reflects the actual recipes that describe the crafting task. With our dataset collection procedure and languageconditioned method, we demonstrate that using natural human language can be practically applied to solving difficult RL problems and begin solving the generalization problem in RL. We hope that this will inspire future work that incorporates human annotation, specifically language annotation, to solve more difficult and diverse tasks.

2. RELATED WORK

Previous works on language descriptions of tasks and sub-tasks have generally relied on what Andreas et al. (2017) calls "sketches." A sketch specifies the necessary sub-tasks for a final task and is manually constructed for every task. The agent then relies on reward signals from the sketches in order to learn these predefined sub-tasks. However, in our setup, we want to infer such "sketches" from a limited number of instructions given by human demonstrations. This setting is not only more difficult but also more realistic for practical applications of RL where we might not have a predefined ontology and simulator, just a limited number of human-generated instructions. In addition, at test time, their true zero-shot task requires the sketch, whereas our method is able to generate the "sketches" in the form of high-level language with no additional training and supervision.



Our dataset, environment, and code can be found at: https://github.com/valeriechen/ask-your-humans.

