PLAYING NONDETERMINISTIC GAMES THROUGH PLANNING WITH A LEARNED MODEL Anonymous authors Paper under double-blind review

Abstract

The MuZero algorithm is known for achieving high-level performance on traditional zero-sum two-player games of perfect information such as chess, Go, and shogi, as well as visual, non-zero sum, single-player environments such as the Atari suite. Despite lacking a perfect simulator and employing a learned model of environmental dynamics, MuZero produces game-playing agents comparable to its predecessor, AlphaZero. However, the current implementation of MuZero is restricted only to deterministic environments. This paper presents Nondeterministic MuZero (NDMZ), an extension of MuZero for nondeterministic, two-player, zero-sum games of perfect information. Borrowing from Nondeterministic Monte Carlo Tree Search and the theory of extensive-form games, NDMZ formalizes chance as a player in the game and incorporates the chance player into the MuZero network architecture and tree search. Experiments show that NDMZ is capable of learning effective strategies and an accurate model of the game.

1. INTRODUCTION

While the AlphaZero algorithm achieved superhuman performance in a variety of challenging domains, it relies upon a perfect simulation of the environment dynamics to perform precision planning. MuZero, the newest member of the AlphaZero family, combines the advantages of planning with a learned model of its environment, allowing it to tackle problems such as the Atari suite without the advantage of a simulator. This paper presents Nondeterministic MuZero (NDMZ), an extension of MuZero to stochastic, two-player, zero-sum games of perfect information. In it, we formalize the element of chance as a player in the game, determine a policy for the chance player via interaction with the environment, and augment the tree search to allow for chance actions. As with MuZero, NDMZ is trained end-to-end in terms of policy and value. However, NDMZ aims to learn two additional quantities: the player identity policy and the chance player policy. With the assumption of perfect information, we change the MuZero architecture to allow agents to determine whose turn it is to move and when a chance action should occur. The agent learns the fixed distribution of chance actions for any given state, allowing it to model the effects of chance in the larger context of environmental dynamics. Finally, we introduce new node classes to the Monte Carlo Tree Search, allowing the search to accommodate chance in a flexible manner. Our experiments using Nannon, a simplified variant of backgammon, show that NDMZ approaches AlphaZero in terms of performance and attains a high degree of dynamics accuracy.

2.1. EARLY TREE SEARCH

The classic search algorithm for perfect information games with chance events is expectiminimax, which augments the well-known minimax algorithm with the addition of of chance nodes (Mitchie, 1966; Maschler et al., 2013) . Chance nodes assume the expected value of a random event taking place, taking a weighted average of each of its children, based on the probability of reaching the child, rather than the max or min. Chance nodes, min nodes, and max nodes are interleaved depending on the game being played. The *-minimax algorithm extends the alpha-beta tree pruning strategy to games with chance nodes (Ballard, 1983; Knuth & Moore, 1975) .

