MONTE-CARLO PLANNING AND LEARNING WITH LANGUAGE ACTION VALUE ESTIMATES

Abstract

Interactive Fiction (IF) games provide a useful testbed for language-based reinforcement learning agents, posing significant challenges of natural language understanding, commonsense reasoning, and non-myopic planning in the combinatorial search space. Agents using standard planning algorithms struggle to play IF games due to the massive search space of language actions. Thus, languagegrounded planning is a key ability of such agents, since inferring the consequence of language action based on semantic understanding can drastically improve search. In this paper, we introduce Monte-Carlo planning with Language Action Value Estimates (MC-LAVE) that combines Monte-Carlo tree search with language-driven exploration. MC-LAVE concentrates search effort on semantically promising language actions using locally optimistic language value estimates, yielding a significant reduction in the effective search space of language actions. We then present a reinforcement learning approach built on MC-LAVE, which alternates between MC-LAVE planning and supervised learning of the selfgenerated language actions. In the experiments, we demonstrate that our method achieves new high scores in various IF games.

1. INTRODUCTION

Building an intelligent goal-oriented agent that can perceive and react via natural language is one of the grand challenges of artificial intelligence. In pursuit of this goal, we consider Interactive Fiction (IF) games (Nelson, 2001; Montfort, 2005) , which are text-based simulation environments where the agent interacts with the environment only through natural language. They serve as a useful testbed for developing language-based goal-oriented agents, posing important challenges such as natural language understanding, commonsense reasoning, and non-myopic planning in the combinatorial search space of language actions. IF games naturally have a large branching factor, with at least hundreds of natural language actions that can affect the simulation of game states. This renders naive exhaustive search infeasible and raises the strong need for language-grounded planning ability, i.e. effective search space is too large to choose an optimal action, without inferring the future impact of language actions by understanding the environment state described in natural language. Still, standard planning methods such as Monte-Carlo tree search (MCTS) are language-agnostic and rely only on uncertainty-driven exploration, encouraging more search on less-visited states and actions. This simple uncertainty-based strategy is not sufficient to find an optimal language action under limited search time, especially when each language action is treated as a discrete token. On the other hand, recent reinforcement learning agents for IF games have started to leverage pre-trained word embeddings for language understanding (He et al., 2016; Fulda et al., 2017; Hausknecht et al., 2020) or knowledge graphs for commonsense reasoning (Ammanabrolu & Hausknecht, 2020), but their exploration strategies are still limited to the -greedy or the softmax policies, lacking more structured and non-myopic planning ability. As a consequence, current state-of-the-art agents for IF games still have not yet been up to the human-level play. In this paper, we introduce Monte-Carlo planning with Language Action Value Estimates (MC-LAVE), a planning algorithm for the environments with text-based interactions. MC-LAVE combines Monte-Carlo tree search with language-driven exploration, addressing the search inefficiency attributed to the lack of language understanding. It starts with credit assignment to language actions via Q-learning of the experiences collected from the past searches. Then, MC-LAVE assigns nonuniform search priorities to each language action based on the optimistically aggregated Q-estimates of the past actions that share similar meanings with the candidate action, so as to focus more on the semantically promising actions. This is in contrast to the previous methods that involve language understanding in the form of a knowledge graph, where the insignificant language actions are uniformly filtered out by the graph mask (Ammanabrolu & Hausknecht, 2020; Ammanabrolu et al., 2020) . We show that the non-uniform search empowered by language understanding in MC-LAVE yields better search efficiency while not hurting the asymptotic guarantee of MCTS. We then present our reinforcement learning approach that uses MC-LAVE as a strong policy improvement operator. Since MCTS explores the combinatorial space of action sequences, its search results can be far better than the simple greedy improvement, as demonstrated in the game of Go (Silver et al., 2017) . This final algorithm, MC-LAVE-RL, alternates between planning via MC-LAVE and supervised learning of self-generated language actions. Experimental results demonstrate that MC-LAVE-RL achieves new high scores in various IF games provided in the Jericho framework (Hausknecht et al., 2020) , showing the effectiveness of language-grounded MC-LAVE planning.

2.1. INTERACTIVE FICTION GAME

Interactive Fiction (IF) games are fully text-based environments where the observation and the action spaces are defined as natural language. The game-playing agent observes textual descriptions of the world, selects a language-based action, and receives the associated reward. IF games can be modeled as a special case of partially observable Markov decision processes (POMDPs) defined by tuple S, A, Ω, T, O, R, γ , where S is the set of environment states s, A is the set of language actions a, Ω is the set of text observations o, T (s |s, a) = Pr(s t+1 = s |s t = s, a t = a) is the transition function, R(s, a) is the reward function for taking action a in state s, O(s) = o is the deterministic observation function in state s, and γ ∈ (0, 1) is the discount factor. The history at time step t, h t = {o 0 , a 0 , . . . , o t-1 , a t-1 , o t }, is a sequence of observations and actions. The goal is to find an optimal policy π * that maximizes the expected cumulative rewards, i.e. π * = arg max π E π [ ∞ t=0 γ t R(s t , a t )]. We use the same definition of observation and action space as Hausknecht et al. (2020); Ammanabrolu & Hausknecht (2020); Côté et al. (2018) , i.e. An observation is defined by o t = (o t desc , o tgame , o tinv , a t-1 ) where o t desc is the textual description of the current location of the agent, o tgame is the simulator response to the previous action taken by the agent, o tinv is the information of agent's inventory, and a t-1 is the previous action taken by the agent. An action is denoted by a sequence of words a t = (a 1 t , a 2 t , . . . , a |at| t ). Finally, we denote A valid (o t ) ⊆ A as the set of valid actions for the observation o t , which is provided by the Jericho environment interface. Figure 1 shows an example of observation and action in ZORK1, one of the representative IF games.

2.2. CHALLENGES IN INTERACTIVE FICTION GAME

IF games pose important challenges for reinforcement learning agents, requiring natural language understanding, commonsense reasoning, and non-myopic language-grounded planning ability in combinatorial search space of language actions (Hausknecht et al., 2020) . More concretely, consider the particular game state of ZORK1 described in Figure 1 . In this example situation, a human player would be naturally capable of performing strategic planning via language understanding and commonsense reasoning: (1) the 'closed trap door' will have to be opened and be explored to proceed with the game, (2) however, acquiring the 'lantern' should precede entering the trap door since the cellar, which is expected to exist below the trap door, could be likely pitch-dark. Without such language-grounded reasoning and planning, the agent may need to try out every actions uniformly, most of them making it vulnerable to being eaten by a monster in the cellar who always appears when there is no light source. As a result, any agent that lacks the ability of long-term planning with language reasoning is prone to be stuck at a suboptimal policy, which enters the cellar to obtain an immediate reward and does nothing further to avoid encountering the monster that kills the agent immediately.

