LEARNED BELIEF SEARCH: EFFICIENTLY IMPROVING POLICIES IN PARTIALLY OBSERVABLE SETTINGS

Abstract

Search is an important tool for computing effective policies in single-and multiagent environments, and has been crucial for achieving superhuman performance in several benchmark fully and partially observable games. However, one major limitation of prior search approaches for partially observable environments is that the computational cost scales poorly with the amount of hidden information. In this paper we present Learned Belief Search (LBS), a computationally efficient search procedure for partially observable environments. Rather than maintaining an exact belief distribution, LBS uses an approximate auto-regressive counterfactual belief that is learned as a supervised task. In multi-agent settings, LBS uses a novel public-private model architecture for underlying policies in order to efficiently evaluate these policies during rollouts. In the benchmark domain of Hanabi, LBS obtains more than 60% of the benefit of exact search while reducing compute requirements by 35×, allowing it to scale to larger settings that were inaccessible to previous search methods.

1. INTRODUCTION

Search has been a vital component for achieving superhuman performance in a number of hard benchmark problems in AI, including Go (Silver et al., 2016; 2017; 2018) , Chess (Campbell et al., 2002) , Poker (Moravčík et al., 2017; Brown & Sandholm, 2017; 2019) , and, more recently, self-play Hanabi (Lerer et al., 2020) . Beyond achieving impressive results, the work on Hanabi and Poker are some of the few examples of search being applied in large partially observable settings. In contrast, the work on belief-space planning typically assumes a small belief space, since these methods scale poorly with the size of the belief space. Inspired by the recent success of the SPARTA search technique in Hanabi (Lerer et al., 2020) , we propose Learned Belief Search (LBS) a simpler and more scalable approach for policy improvement in partially observable settings, applicable whenever a model of the environment and the policies of any other agents are available at test time. Like SPARTA, the key idea is to obtain Monte Carlo estimates of the expected return for every possible action in a given action-observation history (AOH) by sampling from a belief distribution over possible states of the environment. However, LBS addresses one of the key limitations of SPARTA. Rather than requiring a sufficiently small belief space, in which we can compute and sample from an exact belief distribution, LBS samples from a learned, auto-regressive belief model which is trained via supervised learning (SL). The autoregressive parameterization of the probabilities allows LBS to be scaled to high-dimensional state spaces, whenever these are composed as a sequence of features. Another efficiency improvement over SPARTA is replacing the full rollouts with partial rollouts that bootstrap from a value function after a specific number of steps. While in general this value function can be trained via SL in a separate training process, this is not necessary when the blueprint (BP) was trained via RL. In these cases the RL training typically produces both a policy and an approximate value function (either for variance reduction or for value-based learning). In particular, it is common practice to train centralized value functions, which capture the required dependency on the sampled state even when this state cannot be observed by the agent during test time. While LBS is a very general search method for Partially Observable Markov Decision Processes (POMDPs), our application is focused on single-agent policy improvement in Decentralized

