HOW TO AVOID BEING EATEN BY A GRUE: STRUCTURED EXPLORATION STRATEGIES FOR TEXTUAL WORLDS

Abstract

Text-based games are long puzzles or quests, characterized by a sequence of sparse and potentially deceptive rewards. They provide an ideal platform to develop agents that perceive and act upon the world using a combinatorially sized natural language state-action space. Standard Reinforcement Learning agents are poorly equipped to effectively explore such spaces and often struggle to overcome bottlenecks-states that agents are unable to pass through simply because they do not see the right action sequence enough times to be sufficiently reinforced. We introduce Q*BERT , an agent that learns to build a knowledge graph of the world by answering questions, which leads to greater sample efficiency. To overcome bottlenecks, we further introduce MC!Q*BERT an agent that uses an knowledge-graph-based intrinsic motivation to detect bottlenecks and a novel exploration strategy to efficiently learn a chain of policy modules to overcome them. We present an ablation study and results demonstrating how our method outperforms the current state-of-the-art on nine text games, including the popular game, Zork, where, for the first time, a learning agent gets past the bottleneck where the player is eaten by a Grue.

1. INTRODUCTION

Text-adventure games such as Zork1 (Anderson et al., 1979) (Fig. 1 ) are simulations featuring language-based state and action spaces. Prior game playing works have focused on a few challenges that are inherent to this medium: (1) Partial observability the agent must reason about the world solely through incomplete textual descriptions (Narasimhan et al., 2015; Côté et al., 2018; Ammanabrolu & Riedl, 2019b) . (2) Commonsense reasoning to enable the agent to more intelligently interact with objects in its surroundings (Fulda et al., 2017; Yin & May, 2019; Adolphs & Hofmann, 2019; Ammanabrolu & Riedl, 2019a) . (3) A combinatorial state-action space wherein most games have action spaces exceeding a billion possible actions per step; for example the game Zork1 has 1.64 × 10 14 possible actions at every step (Hausknecht et al., 2020; Ammanabrolu & Hausknecht, 2020) . Despite these challenges, modern text-adventure agents such as KG-A2C (Ammanabrolu & Hausknecht, 2020), TDQN (Hausknecht et al., 2020), and DRRN (He et al., 2016) have relied on surprisingly simple exploration strategies such as -greedy or sampling from the distribution of possible actions. Most text-adventure games have relatively linear plots in which players must solve a sequence of puzzles to advance the story and gain score. To solve these puzzles, players have freedom to a explore both new areas and previously unlocked areas of the game, collect clues, and acquire tools needed to solve the next puzzle and unlock the next portion of the game. From a Reinforcement Learning perspective, these puzzles can be viewed as bottlenecks that act as partitions between different regions of the state space. We contend that existing Reinforcement Learning agents that are unaware of such latent structure and are thus poorly equipped for solving these types of problems. In this paper we introduce two new agents: Q*BERT and MC!Q*BERT, both designed with this latent structure in mind. The first agent, Q*BERT, improves on existing text-game agents that use knowledge graph-based state representations by framing knowledge graph construction during exploration as a question-answering task. To train Q*BERT's knowledge graph extractor, we introduce the Jericho-QA dataset for question-answering in text-games. We show that it leads to improved knowledge graph accuracy and sample efficiency compared to prior methods for constructing knowledge graphs in text-games (Ammanabrolu & Riedl, 2019b) . Observation: West of House You are standing in an open field west of a white house, with a boarded front door. There is a small mailbox here.

Action: Open mailbox

Observation: Opening the small mailbox reveals a leaflet.

Action: Read leaflet

Observation: (Taken) "WELCOME TO ZORK! ZORK is a game of adventure, danger, and low cunning. In it you will explore some of the most amazing territory ever seen by mortals. No computer should be without one!" Action: Go north Observation: North of House You are facing the north side of a white house. There is no door here, and all the windows are boarded up. To the north a narrow path winds through the trees. Figure 1 : Excerpt from Zork1. However, improved knowledge graph accuracy is not enough to overcome bottlenecks; it does not improve asymptotic performance. To this end, MC!Q*BERT (Modular policy Chaining! Q*BERT) extends Q*BERT by combining two innovations: (1) an intrinsic motivation based on expansion of its knowledge graph both as a way to encourage exploration as well as a means for the agent to self-detect when it is stuck; and (2) by additionally introducing a structured exploration algorithm that, when stuck on a bottleneck, will backtrack through the sequence of states leading to the current bottleneck, in search of alternative solutions. As MC!Q*BERT overcomes bottlenecks, it constructs a modular policy that chains together the solutions to multiple bottlenecks. Like Go Explore (Ecoffet et al., 2019) , MC!Q*BERT relies on the determinism present in many text-games to reliably revisit previous states. However, we show that MC!Q*BERT's ability to detect bottlenecks via the knowledge graph state representation enable it to outperform such alternate exploration strategies on nine different games. Our contributions are as follows: 1) We develop an improved knowledge-graph extraction procedure based on question answering and introduce the open-source Jericho-QA training dataset. 2) We show that intrinsic motivation reward based on knowledge graph expansion is capable of reliably identifying bottleneck states. 3) Finally, we show that structured exploration in the form of backtracking can be used to overcome these bottleneck states and reach state-of-the-art levels of performance on the Jericho benchmark (Hausknecht et al., 2020) .

2. UNDERSTANDING BOTTLENECK STATES

Overcoming bottlenecks is not as simple as selecting the correct action from the bottleneck state. Most bottlenecks have long-range dependencies that must first be satisfied: Zork1 for instance features a bottleneck in which the agent must pass through the unlit Cellar where a monster known as a Grue lurks, ready to eat unsuspecting players who enter without a light source. To pass this bottleneck the player must have previously acquired and lit the latern. Other bottlenecks don't rely on inventory items and instead require the player to have satisfied an external condition such as visiting the reservoir control to drain water from a submerged room before being able to visit it. In both cases, the actions that fulfill dependencies of the bottleneck, e.g. acquiring the lantern or draining the room, are not rewarded by the game. Thus agents must correctly satisfy all latent dependencies, most of which are unrewarded, then take the right action from the correct location to overcome such bottlenecks. Consequently, most existing agents-regardless of whether they use a reduced action space (Zahavy et al., 2018; Yin & May, 2019) or the full space (Hausknecht et al., 2020; Ammanabrolu & Hausknecht, 2020) -have failed to consistently clear these bottlenecks. To better understand how to design algorithms that pass these bottlenecks, we first need to gain a sense for what they are. We observe that quests in text games can be modeled in the form of a dependency graph. These dependency graphs are directed acyclic graphs (DAGs) where the vertices indicate either rewards that can be collected or dependencies that must be met to progress and are generally unknown to a player a priori. In text-adventure games the dependencies are of two types: items that must be collected for future use, and locations that must be visited. An example of such a graph for the game of Zork1 can found in Fig. 2 . More formally, bottleneck states are vertices in the dependency graph that, when the graph is laid out topographically, are (a) the only state on a level, and (b) there is another state at a higher level with non-zero reward. Bottlenecks can be mathematically expressed as follows: let D = V, E be the directed acyclic dependency graph for a particular game

