LEARNED BELIEF SEARCH: EFFICIENTLY IMPROVING POLICIES IN PARTIALLY OBSERVABLE SETTINGS

Abstract

Search is an important tool for computing effective policies in single-and multiagent environments, and has been crucial for achieving superhuman performance in several benchmark fully and partially observable games. However, one major limitation of prior search approaches for partially observable environments is that the computational cost scales poorly with the amount of hidden information. In this paper we present Learned Belief Search (LBS), a computationally efficient search procedure for partially observable environments. Rather than maintaining an exact belief distribution, LBS uses an approximate auto-regressive counterfactual belief that is learned as a supervised task. In multi-agent settings, LBS uses a novel public-private model architecture for underlying policies in order to efficiently evaluate these policies during rollouts. In the benchmark domain of Hanabi, LBS obtains more than 60% of the benefit of exact search while reducing compute requirements by 35×, allowing it to scale to larger settings that were inaccessible to previous search methods.

1. INTRODUCTION

Search has been a vital component for achieving superhuman performance in a number of hard benchmark problems in AI, including Go (Silver et al., 2016; 2017; 2018) , Chess (Campbell et al., 2002) , Poker (Moravčík et al., 2017; Brown & Sandholm, 2017; 2019) , and, more recently, self-play Hanabi (Lerer et al., 2020) . Beyond achieving impressive results, the work on Hanabi and Poker are some of the few examples of search being applied in large partially observable settings. In contrast, the work on belief-space planning typically assumes a small belief space, since these methods scale poorly with the size of the belief space. Inspired by the recent success of the SPARTA search technique in Hanabi (Lerer et al., 2020) , we propose Learned Belief Search (LBS) a simpler and more scalable approach for policy improvement in partially observable settings, applicable whenever a model of the environment and the policies of any other agents are available at test time. Like SPARTA, the key idea is to obtain Monte Carlo estimates of the expected return for every possible action in a given action-observation history (AOH) by sampling from a belief distribution over possible states of the environment. However, LBS addresses one of the key limitations of SPARTA. Rather than requiring a sufficiently small belief space, in which we can compute and sample from an exact belief distribution, LBS samples from a learned, auto-regressive belief model which is trained via supervised learning (SL). The autoregressive parameterization of the probabilities allows LBS to be scaled to high-dimensional state spaces, whenever these are composed as a sequence of features. Another efficiency improvement over SPARTA is replacing the full rollouts with partial rollouts that bootstrap from a value function after a specific number of steps. While in general this value function can be trained via SL in a separate training process, this is not necessary when the blueprint (BP) was trained via RL. In these cases the RL training typically produces both a policy and an approximate value function (either for variance reduction or for value-based learning). In particular, it is common practice to train centralized value functions, which capture the required dependency on the sampled state even when this state cannot be observed by the agent during test time. While LBS is a very general search method for Partially Observable Markov Decision Processes (POMDPs), our application is focused on single-agent policy improvement in Decentralized POMDPs (Dec-POMDPs) (in our specific case, Hanabi). One additional challenge of single-agent policy improvement in Dec-POMDPs is that, unlike standard POMDPs, the Markov state s of the environment is no longer sufficient for estimating the future expected return for a given AOH of the searching player. Instead, since the other players' policies also depend on their entire actionobservation histories, e.g., via Recurrent-Neural Networks (RNNs), only the union of Markov state s and all AOHs is sufficient. This in general makes it challenging to apply LBS, since it would require sampling entire AOHs, rather than states. However, in many Dec-POMDPs, including Hanabi, information can be split between the common-knowledge (CK) trajectory and private observations. Furthermore, commonly information is 'revealing', such that there is a mapping from the most recent private observation and the CK trajectory to the AOH for each given player. In these settings it is sufficient to track a belief over the union of private observations, rather than over trajectories, which was also exploited in SPARTA. We adapt LBS to this setting with a novel public-RNN architecture which makes replaying games from the beginning, as was done in SPARTA, unnecessary. When applied to the benchmark problem of two player Hanabi self-play, LBS obtains around 60% of the performance improvement of exact search, while reducing compute requirements by up to 35×. We also successfully apply LBS to a six card version of Hanabi, where calculating the exact belief distribution would be prohibitively expensive.

2.1. BELIEF MODELING & PLANNING IN POMDPS

Deep RL on POMDPs typically circumvents explicit belief modeling by using a policy architecture such as an LSTM that can condition its action on its history, allowing it to implicitly operate on belief states (Hausknecht & Stone, 2015) . 'Blueprint' policies used in this (and prior) work take that approach, but this approach does not permit search since search requires explicitly sampling from beliefs in order to perform rollouts. There has been extensive prior work on learning and planning in POMDPs. Since solving for optimal policies in POMDPs is intractable for all but the smallest problems, most work focuses on approximate solutions, including offline methods to compute approximate policies as well as approximate search algorithms, although these are still typically restricted to small grid-world environments (Ross et al., 2008) . One closely related approach is the Rollout algorithm (Bertsekas & Castanon, 1999) , which given an initial policy, computes Monte Carlo rollouts of the belief-space MDP assuming that this policy is played going forward, and plays the action with the highest expected value. In the POMDP setting, rollouts occur in the MDP induced by the belief statesfoot_0 . There has been some prior work on search in large POMDPs. Silver & Veness (2010) propose a method for performing Monte Carlo Tree Search in large POMDPs like Battleship and partiallyobservable PacMan. Instead of maintaining exact beliefs, they approximate beliefs using a particle filter with Monte Carlo updates. Roy et al. (2005) attempt to scale to large belief spaces by learning a compressed representation of the beliefs and then performing Bayesian updates in this space. Most recently MuZero combines RL and MCTS with a learned implicit model of the environment (Schrittwieser et al., 2019) . Since recurrent models can implicitly operate on belief states in partially-observed environments, MuZero in effect performs search with implicit learned beliefs as well as a learned environment model.

2.2. GAMES & HANABI

Search has been responsible for many breakthroughs on benchmark games. Most of these successes were achieved in fully observable games such as Backgammon (Tesauro, 1994 ), Chess (Campbell et al., 2002) and Go (Silver et al., 2016; 2017; 2018) . More recently, belief-based search techniques



SPARTA's single agent search uses a similar strategy in the DEC-POMDP setting, but samples states from the beliefs rather than doing rollouts directly in belief space.

