TRUTHFUL SELF-PLAY

Abstract

We present a general framework for evolutionary learning to emergent unbiased state representation without any supervision. Evolutionary frameworks such as self-play converge to bad local optima in case of multi-agent reinforcement learning in non-cooperative partially observable environments with communication due to information asymmetry. Our proposed framework is a simple modification of selfplay inspired by mechanism design, also known as reverse game theory, to elicit truthful signals and make the agents cooperative. The key idea is to add imaginary rewards using the peer prediction method, i.e., a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment. Numerical experiments with predator prey, traffic junction and StarCraft tasks demonstrate that the state-of-the-art performance of our framework.

1. INTRODUCTION

Evolving culture prevents deep neural networks from falling into bad local optima (Bengio, 2012) . Self-play (Samuel, 1967; Tesauro, 1995) has not only demonstrated the ability to abstract highdimensional state spaces as typified by AlphaGo (Silver et al., 2017) , but also improved exploration coverage in partially observable environments. Communication (Sukhbaatar et al., 2016; Singh et al., 2019) exchanges their internal representations such as explored observation and hidden state in RNNs. Evolutionary learning is expected to be a general framework for creating superhuman AIs as such learning can generate a high-level abstract representation without any bias in supervision. However, when applying evolutionary learning to a partially observable environment with noncooperative agents, improper bias is injected into the state representation. This bias originates from the environment. A partially observable environment with non-cooperative agents induces actions that disable an agent from honestly sharing the correct internal state resulting in the agent taking actions such as concealing information and deceiving other agents at equilibrium (Singh et al., 2019) . The problem arises because the agent cannot fully observe the state of the environment, and thus, it does not have sufficient knowledge to verify the information provided by other agents. Furthermore, neural networks are vulnerable to adversarial examples (Szegedy et al., 2014) and are likely to induce erroneous behavior with small perturbations. Many discriminative models for information accuracy are available; these include GANs (Goodfellow et al., 2014; Radford et al., 2016) and curriculum learning (Lowe et al., 2020) . However, these models assume that accurate samples can be obtained by supervision. Because of this assumption, is it impossible to apply these models to a partially observable environment, where the distribution is not stable. We generalize self-play to non-cooperative partially observable environments via mechanism design (Myerson, 1983; Miller et al., 2005) , which is also known as reverse game theory. The key idea is to add imaginary rewards by using the peer prediction method (Miller et al., 2005) , that is, a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment, which is calculated based on social influence on the signals. We formulate the non-cooperative partially observable environment as an extention of the partially observable stochastic games (POSG) (Hansen et al., 2004) ; introduce truthfulness (Vickrey, 1961) , which is an indicator of the validity of state representation. We show that the imaginary reward enables us to reflect the bias of state representation on the gradient without oracles.

