

Abstract

We present a general framework for evolutionary learning to emergent unbiased state representation without any supervision. Evolutionary frameworks such as self-play converge to bad local optima in case of multi-agent reinforcement learning in non-cooperative partially observable environments with communication due to information asymmetry. Our proposed framework is a simple modification of selfplay inspired by mechanism design, also known as reverse game theory, to elicit truthful signals and make the agents cooperative. The key idea is to add imaginary rewards using the peer prediction method, i.e., a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment. Numerical experiments with predator prey, traffic junction and StarCraft tasks demonstrate that the state-of-the-art performance of our framework.

1. INTRODUCTION

Evolving culture prevents deep neural networks from falling into bad local optima (Bengio, 2012) . For example, self-play (Samuel, 1967; Tesauro, 1995) has not only demonstrated the ability to abstract high-dimensional state spaces as typified by AlphaGo (Silver et al., 2017) , but also improved exploration coverage in partially observable environments and communication (Sukhbaatar et al., 2016; Singh et al., 2019) to exchange their internal representations, such as explored observation and hidden state in RNNs. Evolutionary learning is expected to be a general framework for creating superhuman AIs as such learning can generate a high-level abstract representation without any bias in supervision. However, when applying evolutionary learning to a partially observable environment with noncooperative agents, improper bias is injected into the state representation. This bias originates from the environment. A partially observable environment with non-cooperative agents induces actions that disable an agent from honestly sharing the correct internal state resulting in the agent taking actions such as concealing information and deceiving other agents at equilibrium (Singh et al., 2019) . The problem arises because the agent cannot fully observe the state of the environment, and thus, it does not have sufficient knowledge to verify the information provided by other agents. Furthermore, neural networks are vulnerable to adversarial examples (Szegedy et al., 2014) and are likely to induce erroneous behavior with small perturbations. Many discriminative models for information accuracy are available; these include GANs (Goodfellow et al., 2014; Radford et al., 2016) and curriculum learning (Lowe et al., 2020) . However, these models assume that accurate samples can be obtained by supervision. Because of this assumption, is it impossible to apply these models to a partially observable environment, where the distribution is not stable. We generalize self-play to non-cooperative partially observable environments via mechanism design (Myerson, 1983; Miller et al., 2005) , which is also known as reverse game theory. The key idea is to add imaginary rewards by using the peer prediction method (Miller et al., 2005) , that is, a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment, which is calculated based on social influence on the signals. We formulate the non-cooperative partially observable environment as an extention of the pertially observable stochastic games (POSG) (Hansen et al., 2004) ; introduce truthfulness (Vickrey, 1961) , which is an indicator of the validity of state representation. We show that the imaginary reward enables us to reflect the bias of state representation on the gradient without oracles. As the first contribution, we propose truthful self-play (TSP) and analytically demonstrate convergence to the global optimum (Section 4). We propose the imaginary reward on the basis of the peer prediction method (Miller et al., 2005) and apply it to self-play. The mechanism affects the gradient of the local optima, but not the global optima. The trick is to use the actions taken by the agents as feedback to verify the received signal from the every other agent, instead of the true state, input, and intent, which the agents cannot fully observe. TSP only requires a modification of the baseline function for self-play; it drastically improves the convergence to the global optimum in Comm-POSG. As the second contribution, based on the results of numerical experiments, we report that the TSP achieved state-of-the-art performance for various multi-agent tasks made of up to 20 agents (Section 5). Using predator prey (Barrett et al., 2011 ), traffic junction (Sukhbaatar et al., 2016; Singh et al., 2019), and StarCraft Synnaeve et al. (2016) environments, which are typically used in Comm-POSG research, we compared the performances of TSP with the current neural nets, including the state-ofthe-art method, with LSTM, CommNet (Sukhbaatar et al., 2016), and IC3Net (Singh et al., 2019) . We report that the model with IC3Net optimized by TSP has the best performance. This work is the first attempt to apply mechanism design to evolutionary learning. TSP is a general optimization algorithm whose convergence is theoretically guaranteed for arbitrary policies and environments. Furthermore, since no supervision is required, TSP has a wide range of applications to not only game AIs (Silver et al., 2017) , but also the robots (Jaderberg et al., 2018 ), chatbots (Gupta et al., 2019; Chevalier et al., 2019) , and autonomous cars (Tang, 2019) that are employed in multiagent tasks. Notation: Vectors are columns. Let n := {1, . . . , n}. R is a set of real numbers. i is the imaginary unit. Re u and Im u are a real and an imaginary part of complex number u, respectively. n-tuple are written as boldface of the original variables a := a 1 , . . . , a n , and a -i is a (n -1)-tuple obtained by removing the i-th entry from a. Let 1 := (1, . . . , 1) T . Matrices are shown in uppercase letters L := ( ij ). E is the unit matrix. The set of probability distributions based on the support X is described as P(X ).

2. RELATED WORK

Neural communication has gained attention in the field of multiagent reinforcement learning (MARL) for both discrete (Foerster et al., 2016) and continuous (Sukhbaatar et al., 2016; Singh et al., 2019) signals. Those networks are trained via self-play to exchange the internal state of the environment stored in the working memory of recurrent neural networks (RNNs) to learn the right policy in partially observable environments. The term self-play was coined by the game AI community in the latter half of the century. Samuel (Samuel, 1967) introduced self-play as a framework for sharing a state-action value among two opposing agents to efficiently search the state space at Checkers. TD-Gammon (Tesauro, 1995) introduced self-play as a framework to learn TD(λ) (Sutton & Barto, 1998) and achieve professionalgrade levels in backgammon. AlphaGo (Silver et al., 2017) defeated the Go champion by combining supervised learning with professional game records and self-play. AlphaZero (Silver et al., 2018) successfully learnt beyond its own performance entirely based on self-play. All these studies explain that eliminating the bias of human knowledge in supervision is the advantage of self-play. Self-play is also known as evolutionary learning (Bengio, 2012) in the deep learning community mainly as an approach to emerging representations without supervision (Bansal et al., 2018; Balduzzi et al., 2019 ). Bansal et al. (2018) show that competitive environments contribute to emerging diversity and complexity. Rich generative models such as GANs (Goodfellow et al., 2014; Radford et al., 2016) are frameworks for acquiring an environmental model by employing competitive settings. RNNs such as world models (Ha & Schmidhuber, 2018; Eslami et al., 2018) are capable of more comprehensive ranges of exploration in partially observable environments and generation of symbols and languages (Bengio, 2017; Gupta et al., 2019; Chevalier et al., 2019) . The difference between evolutionary learning and supervised learning is the absence of human knowledge and oracles. Several works have formalized those in which the agents exchange environmental information as a formal class of the games such as Dec-POMDP-Com (Goldman & Zilberstein, 2003) and COM-MTDP (Pynadath & Tambe, 2002) , and several frameworks are proposed to aim to solve the problems. However, the limitation of the frameworks is that they assume a common reward. As there are yet no formal definition of non-cooperative communication game, we formalize such a game to

