

Abstract

We present a general framework for evolutionary learning to emergent unbiased state representation without any supervision. Evolutionary frameworks such as self-play converge to bad local optima in case of multi-agent reinforcement learning in non-cooperative partially observable environments with communication due to information asymmetry. Our proposed framework is a simple modification of selfplay inspired by mechanism design, also known as reverse game theory, to elicit truthful signals and make the agents cooperative. The key idea is to add imaginary rewards using the peer prediction method, i.e., a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment. Numerical experiments with predator prey, traffic junction and StarCraft tasks demonstrate that the state-of-the-art performance of our framework.

1. INTRODUCTION

Evolving culture prevents deep neural networks from falling into bad local optima (Bengio, 2012) . For example, self-play (Samuel, 1967; Tesauro, 1995) has not only demonstrated the ability to abstract high-dimensional state spaces as typified by AlphaGo (Silver et al., 2017) , but also improved exploration coverage in partially observable environments and communication (Sukhbaatar et al., 2016; Singh et al., 2019) to exchange their internal representations, such as explored observation and hidden state in RNNs. Evolutionary learning is expected to be a general framework for creating superhuman AIs as such learning can generate a high-level abstract representation without any bias in supervision. However, when applying evolutionary learning to a partially observable environment with noncooperative agents, improper bias is injected into the state representation. This bias originates from the environment. A partially observable environment with non-cooperative agents induces actions that disable an agent from honestly sharing the correct internal state resulting in the agent taking actions such as concealing information and deceiving other agents at equilibrium (Singh et al., 2019) . The problem arises because the agent cannot fully observe the state of the environment, and thus, it does not have sufficient knowledge to verify the information provided by other agents. Furthermore, neural networks are vulnerable to adversarial examples (Szegedy et al., 2014) and are likely to induce erroneous behavior with small perturbations. Many discriminative models for information accuracy are available; these include GANs (Goodfellow et al., 2014; Radford et al., 2016) and curriculum learning (Lowe et al., 2020) . However, these models assume that accurate samples can be obtained by supervision. Because of this assumption, is it impossible to apply these models to a partially observable environment, where the distribution is not stable. We generalize self-play to non-cooperative partially observable environments via mechanism design (Myerson, 1983; Miller et al., 2005) , which is also known as reverse game theory. The key idea is to add imaginary rewards by using the peer prediction method (Miller et al., 2005) , that is, a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment, which is calculated based on social influence on the signals. We formulate the non-cooperative partially observable environment as an extention of the pertially observable stochastic games (POSG) (Hansen et al., 2004) ; introduce truthfulness (Vickrey, 1961) , which is an indicator of the validity of state representation. We show that the imaginary reward enables us to reflect the bias of state representation on the gradient without oracles. As the first contribution, we propose truthful self-play (TSP) and analytically demonstrate convergence to the global optimum (Section 4). We propose the imaginary reward on the basis of the peer prediction method (Miller et al., 2005) and apply it to self-play. The mechanism affects the gradient of the local optima, but not the global optima. The trick is to use the actions taken by the agents as feedback to verify the received signal from the every other agent, instead of the true state, input, and intent, which the agents cannot fully observe. TSP only requires a modification of the baseline function for self-play; it drastically improves the convergence to the global optimum in Comm-POSG. As the second contribution, based on the results of numerical experiments, we report that the TSP achieved state-of-the-art performance for various multi-agent tasks made of up to 20 agents (Section 5). Using predator prey (Barrett et al., 2011) , traffic junction (Sukhbaatar et al., 2016; Singh et al., 2019), and StarCraft Synnaeve et al. (2016) environments, which are typically used in Comm-POSG research, we compared the performances of TSP with the current neural nets, including the state-ofthe-art method, with LSTM, CommNet (Sukhbaatar et al., 2016) , and IC3Net (Singh et al., 2019) . We report that the model with IC3Net optimized by TSP has the best performance. This work is the first attempt to apply mechanism design to evolutionary learning. TSP is a general optimization algorithm whose convergence is theoretically guaranteed for arbitrary policies and environments. Furthermore, since no supervision is required, TSP has a wide range of applications to not only game AIs (Silver et al., 2017) , but also the robots (Jaderberg et al., 2018) , chatbots (Gupta et al., 2019; Chevalier et al., 2019) , and autonomous cars (Tang, 2019) that are employed in multiagent tasks. Notation: Vectors are columns. Let n := {1, . . . , n}. R is a set of real numbers. i is the imaginary unit. Re u and Im u are a real and an imaginary part of complex number u, respectively. n-tuple are written as boldface of the original variables a := a 1 , . . . , a n , and a -i is a (n -1)-tuple obtained by removing the i-th entry from a. Let 1 := (1, . . . , 1) T . Matrices are shown in uppercase letters L := ( ij ). E is the unit matrix. The set of probability distributions based on the support X is described as P(X ).

2. RELATED WORK

Neural communication has gained attention in the field of multiagent reinforcement learning (MARL) for both discrete (Foerster et al., 2016) and continuous (Sukhbaatar et al., 2016; Singh et al., 2019) signals. Those networks are trained via self-play to exchange the internal state of the environment stored in the working memory of recurrent neural networks (RNNs) to learn the right policy in partially observable environments. The term self-play was coined by the game AI community in the latter half of the century. Samuel (Samuel, 1967) introduced self-play as a framework for sharing a state-action value among two opposing agents to efficiently search the state space at Checkers. TD-Gammon (Tesauro, 1995) introduced self-play as a framework to learn TD(λ) (Sutton & Barto, 1998) and achieve professionalgrade levels in backgammon. AlphaGo (Silver et al., 2017) defeated the Go champion by combining supervised learning with professional game records and self-play. AlphaZero (Silver et al., 2018) successfully learnt beyond its own performance entirely based on self-play. All these studies explain that eliminating the bias of human knowledge in supervision is the advantage of self-play. Self-play is also known as evolutionary learning (Bengio, 2012) in the deep learning community mainly as an approach to emerging representations without supervision (Bansal et al., 2018; Balduzzi et al., 2019) . Bansal et al. (2018) show that competitive environments contribute to emerging diversity and complexity. Rich generative models such as GANs (Goodfellow et al., 2014; Radford et al., 2016) are frameworks for acquiring an environmental model by employing competitive settings. RNNs such as world models (Ha & Schmidhuber, 2018; Eslami et al., 2018) are capable of more comprehensive ranges of exploration in partially observable environments and generation of symbols and languages (Bengio, 2017; Gupta et al., 2019; Chevalier et al., 2019) . The difference between evolutionary learning and supervised learning is the absence of human knowledge and oracles. Several works have formalized those in which the agents exchange environmental information as a formal class of the games such as Dec-POMDP-Com (Goldman & Zilberstein, 2003) and COM-MTDP (Pynadath & Tambe, 2002) , and several frameworks are proposed to aim to solve the problems. However, the limitation of the frameworks is that they assume a common reward. As there are yet no formal definition of non-cooperative communication game, we formalize such a game to Comm-POSG as a superset of POSGs (Hansen et al., 2004) , a more general class of multi-agent games including the cases of non-cooperativity (Hansen et al., 2004) . To the best of our knowledge, there are no studies that have introduced truthful mechanisms into the field of MARL, but it may be possible to introduce it by using agents that can learn flexibly, such as neural networks. A typical truthful mechanism is the VCG mechanism (Vickrey, 1961) , which is a generalization of the pivot method used in auction theory, but whereas the subject of the report that must satisfy truthfulness must be a valuation (or a value function if interpreted from a RL perspective). In this study, the scope of application is different because the belief states of the environment are subject to reporting. Therefore, we introduce instead a peer prediction method (Miller et al., 2005) that guarantees truthfulness with respect to reporting beliefs about arbitrary probability distributions using proper scoring rules (Gneiting & Raftery, 2007) .

3.1. COMM-POSG

A communicative partially-observable stochastic game (Comm-POSG) is a class of non-cooperative Bayesian games in which every agent does not fully observe the environment but interacts each other. We define Comm-POSG as an extension of POSG (Hansen et al., 2004) with a message protocol. Definition 3.1 (Hansen et al., 2004) POSG n, T, S, A, X , T , P, R is a class for multi-agent decision making under uncertainty in which the state evolves over time 1 ≤ t ≤ T , where • n is the number of agents, • T is a horizon i.e., the episode length, • S is a set of discrete/continuous state s t ∈ S with an initial probabilistic distribution p(s 0 ), • A is a set of discrete/continuous action a ti ∈ A, • X is a set of discrete/continuous observation x ti ∈ X , • T ∈ P (S × A × S) is state transition probability, • P ∈ P (S × X n ) is an observation probability, and • R : S × A n → R n is a reward function that outputs an n-dimensional vector. In Comm-POSGs, every agent further follows a message protocol Z n×n , where Z is the discrete/continuous signal space. The complete information exchanged among the agent in time is Z t , where Z t := (z tij ) i,j∈ n ∈ Z n×n is a signal matrix in which (i, j)-th entry z tij represents a signal from Agent i to Agent j at t. The i-th diagonal entry of Z t , h ti := z tii represents the pre-state, an internal state of i-th agent before receiving the singals from the others. A game in Comm-POSG is denoted as G := n, T, S, A, X , T , P, R, Z . The objective function of Comm-POSG is social welfare (Arrow, 1963) defined by the following, J := n i=1 V πi ; V πi := E πi T t=1 γ t-1 r ti , where γ ∈ [0, 1] is discount rate, r ti is reward π i is a stochastic policy, and V πi is the value function. In extensive-form games including Comm-POSG, in addition to the information in the environment, the information of other agents cannot be observed. In the optimization problem under these assumptions, a policy converges to a solution called the Bayesian Nash equilibrium (BNE) (Fudenberg, 1993) . We denote the social welfare at the BNE is J * , and the global maximum Ĵ. In general, J * = Ĵ holds, which is closely related to the information asymmetry.

3.2. COMMUNICATIVE RECURRENT AGENTS

In order to propose an optimization algorithm in this paper, we do not propose a concrete structure of the network, but we propose an abstract structure that can cover existing neural communication models (Sukhbaatar et al., 2016; Singh et al., 2019) , namely communicative recurrent agents (CRAs) f φ , σ φ , q φ , π θ , where • f φ ( ĥt-1,i , x ti ) → Z is a deep RNN for the high-dimensional input x ti ∈ X and the previous post-state ĥt-1,i ∈ Z, with a parameter φ and an initial state ĥ0 ∈ Z, • σ φ (z ti |h ti ) is a stochastic messaging policy for a pre-state h ti := f φ ( ĥt-1,i , x ti ), • q φ ( ĥti |ẑ ti ) is a stochastic model for a post-state ĥti ∈ Z and the received messages ẑti := Z T t:i = (z t1i , . . . , z t,i-1,i , h ti , z t,i+1,i , . . . , z tni ) T , and • π θ (a ti | ĥti ) is the stochastic action policy with a parameter θ. These agents are trained through self-play using on-policy learning such as REINFORCE (Williams, 1992) . All n-agents share the same weight per episode, and the weights are updated based on the cumulative reward after the episode. In addition to the recurrent agent's output of actions with the observation series as input, the CRA has signals for communication as input and output. CRAs estimate current state of the environment and current value of the agent herself based on the post-state model with the pre-state h ti in the hidden layer of the RNN and the received signals ẑti,-i from other agents. Hence, the veracity of the signals z ti is the point of contention.

3.3. TRUTHFULNESS

In mechanism design, a truthful game (Vickrey, 1961) is a game in which all agents make an honest reporting in the Bayesian Nash equilibrium. In Comm-POSGs, the truthfulness of the game is achieved if all the sent signal equals the pre-state z tij = h ti i.e., all the agent share a complete information. In such case, all the agents can have the same information ẑti = h t := (h t1 , . . . , h tn ) T for all i's. In such a case, all agents will have the same post-state model probability distribution and the mean of the cross entropy between the probability distributions of each model will be minimized. D φ (Z t ) := 1 n n i=1 H q φ ( ĥt |ẑ ti ) + 1 n 2 n i=1 n j=1 D KL q φ ( ĥt |ẑ ti ) || q φ ( ĥt |ẑ tj ) . The first term represents the entropy of knowledge each agent has about the environment, and the second term represents the information asymmetry between agents; D φ is a lower bound on the amount of true information the environment has, H [p(s t )]; since achieving truthfulness is essentially the same problem as minimizing D φ , eventually it also maximizes J * will be maximized. Proposition 3.1. (global optimally) For any games G in Comm-POSG, if D φ (Z t ) = H [p(s t )] for 0 ≤ t ≤ T and R i is symmetric for any permutation of i ∈ n , J * (G) = Ĵ(G) holds 1 .

4. PROPOSED FRAMEWORK

An obvious way to achieve truthful learning is to add D φ as a penalty term of the objective, but there are two obstacles to this approach. One is that the new regularization term also adds a bias to the social welfare J, and the other is that D φ contains the agent's internal state, post-state ĥti , so the exact amount cannot be measured by the designer of the learner. If post-states are reported correctly, then pre-states should also be reported honestly, and thus truthfulness is achieved. Therefore, it must be assumed that the post-states cannot be observed during optimization. Our framework, truthful self-play (TSP), consists of two elements: one is the introduction of imaginary rewards, a general framework for unbiased regularization in Comm-POSG, and the other is the introduction of peer prediction method (Miller et al., 2005) , a truthful mechanism to encourage honest reporting based solely on observable variables. In the following, each of them is described separately and we clarify that the proposed framework converges to the global optimum in Comm-POSG. We show the whole procedure in Algorithm 1.

4.1. IMAGINARY REWARD

Imaginary rewards are virtual rewards passed from an agent and have a different basis i than rewards passed from the environment, with the characteristic that they sum to zero. Since the most of RL Algorithm 1 The truthful self-play (TSP). Require: Comm-POSG n, T, S, A, X , T , P, R, Z , CRA f φ , σ φ , q φ , π θ with initial weight w 0 = θ 0 , φ 0 and initial state ĥ0 , learning rate α > 0, and mass parameter β ≥ 0. Initialize w ← w 0 . for each episode do Genesis: s 1 ∼ p(s), ĥ0i ← ĥ0 , ∀i ∈ n . for t = 1 to T do 1. Self-Play Observe x t ∼ P(•|s t ). Update pre-state h ti ← f φ ( ĥt-1,i , x ti ), ∀i ∈ n . Generate message z ti ∼ σ φ (•|h ti ), ∀i ∈ n . Send message Z t ← (z t1 , . . . , z tn ). Receive the message ẑti ← Z T t:i , ∀i ∈ n . Update post-state ĥti ∼ q φ (•|ẑ ti ), ∀i ∈ n . Act a ti ∼ π θ (•| ĥti ), ∀i ∈ n . Get the real reward r t ← R(s t , a t ). 2. Compute a score matrix L t with the peer prediction mechanism, tij ← log π θ (a ti |z tji ) (i = j) 0 (i = j) , ∀i, j ∈ n . 3. Combine the real and imaginary rewards, R + t ← R t + i∆L t . 4. Update the weights by policy gradient (Williams, 1992), g t ← n i=1   n j=1 r + tji   ∇ w log π θ (a ti | ĥti ) + log q φ ( ĥti |ẑ ti ) + log σ φ (z ti | ĥt-1,i , x ti ) , w ← w + αRe g t + αβIm g t . (5) 5. Proceed to the next state s t+1 ∼ T (•|s t , a t ). end for end for return w environments, including Comm-POSG, are of no other entities than agent and environment, twodimensional structure are sufficient to describe them comprehensively if we wish to distinguish the sender of the reward. To remain the social welfare of the system is real, the system must be designed so that the sum of the imaginary rewards, i.e., imaginary part of the social welfare, is zero. In other words, it is not observed macroscopically and affects only the relative expected rewards of agents. The real and imaginary parts of the complex rewards are ordered by the mass parameter β during training, which allows the weights of the network to maintain a real structure. The whole imaginary reward is denotedfoot_1 as iY = (iy ij ) i,j∈ n where iy ij is the imaginary reward passed from Agent i to Agent j, and the complex reward for the whole game is R + := R + iY where R is a diagonal matrix with the environmental reward r i as (i, i)-th entry. We write G[iY] as the structure in which this structure is introduced. In this case, the following proposition holds. Proposition 4.1. For any G in Comm-POSG, if G[iY] is truthful and R + is an Hermitian matrix, J * (G[iY]) = Ĵ(G) holds. Proof. Since G[iY] is truthful, J * (G[iY]) = Ĵ(G[iY]) holds from Proposition 3.1. Further, since R + is Hermitian, iy ij = -iy ji , and hence Im Ĵ(G[iY]) = 0 holds; Ĵ(G[iY]) = Ĵ(G) holds. This indicates that J * (G[iY]) ≥ J * (G), indicating that the BNE could be improved by introducing imaginary rewards. Also, since n i=1 n j=1 iy ij = 0 from the condition that R + is Hermitian, the imaginary rewards do not affect the social welfare of the system, which is a macroscopic objective function, but only the expected rewards of each agent. The baseline function in policy gradient Williams (1992) is an example of a function that does not affect the objective function when the mean gets zero. However, the baseline function is a quantity that is determined based on the value function of a single agent, whereas the imaginary reward is different in that (1) it affects the value function of each agent and (2) it is a meaningful quantity only when n ≥ 2 and is not observed when n = 1.

4.2. PEER PREDICTION MECHANISM

The peer prediction mechanism (Miller et al., 2005) is derived from a mechanism design using proper scoring rules (Gneiting & Raftery, 2007) , which aims to encourage verifiers to honestly report their beliefs by assigning a reward measure score to their responses when predicting probabilistic events. These mechanisms assume at least two agents, a reporter and a verifier. The general scoring rule can be expressed as F(p s s) where p s is the probability of occurrence reported by the verifier for the event s, and F(p s s) is the score to be obtained if the reported event s actually occurred. The scoring rule is proper if an honest declaration consistent with the beliefs of the verifier and the reported p s maximizes the expected value of the earned score, and it is strictly proper if it is the only option for maximizing the expected value. A representative example that is strictly proper is the logarithmic scoring rule F(p s s) = log p s , where the expected value for a 1-bit signal is the cross-entropy p * s log p s + (1 -p * s ) log(1 -p s ) for belief p * s . One can find that p s = p * s is the only report that maximizes the score. Since the proper scoring rule assumes that events s are observable, it is not applicable to problems such as partial observation environments where the true value is hidden. Miller et al. (2005) , who first presented a peer prediction mechanism, posed scoring to a posterior of the verifiers that are updated by the signal, rather than the event. This concept is formulated by a model that assumes that an event s emits a signal z stochastically and infers the type of s by the signal of the reporters who receive it. The peer prediction mechanism (Miller et al., 2005) is denoted as F(p(s|z) s) under the assumption that (1) the type of event s and the signal z emitted by each type follow a given prior, (2) the priors are shared knowledge among verifiers, and (3) the posterior is updated according to the reports. We apply the mechanism to RL, i.e., the problem of predicting the agent's optimal behavior a ti ∼ π θ |s t for the true state s t ∈ S. In self-play, the conditions of 1 and 2 can be satisfied because the prior π θ is shared among the agents, and furthermore, the post-state in Comm-POSG corresponds to 3, so that the peer prediction mechanism can be applied to the problem of predicting agent behavior. To summarize the above discussion, we can see that we can allocate a score matrix L t as follows, L t := ( (a ti |z tji )) i,j∈ n ; (a ti |z tji ) := F(π θ (a ti |z tji ) a ti ) = log π θ (a ti |z tji ), which is an n-order square matrix representing the score from Agent i to Agent j.

4.3. THE TRUTHFUL SELF-PLAY

In our framework, a truthful mechanism is constructed by introducing the proper prediction mechanism into imaginary rewards. Since the score matrix L t does not satisfy Hermitianity, we perform zero-averaging by subtracting the mean of the scores from each element of the matrix, thereby making the sum to zero. This can be expressed as follows using the graph Laplacian ∆ := E -11 T /n. Y t = ∆L t = 1 n     n -1 -1 . . . -1 -1 n -1 . . . -1 . . . . . . . . . . . . -1 -1 . . . n -1         0 (at2|zt12) . . . (atn|zt1n) (at1|zt21) 0 . . . (atn|zt2n) . . . . . . . . . . . . (at1|ztn1) (at2|ztn2) . . . 0     , to get R + t = R t + i∆L t , which is the formula that connects reinforcement learning and mechanism design.  sup φ ∂ Re V π θ ∂ Im V π θ < β, where β < ∞ is bounded mass parameter. Proof (in summary.) R + i∆L is Hermitian; and if Eq (9) holds, then G[i∆L] is truthful from Proposition A.3. Therefore, from Proposition 4.1, convergence to the global optima is achieved.

5. NUMERICAL EXPERIMENT

In this section, we establish the convergence of TSP through the results of numerical experiments with deep neural nets. We consider three environments for our analysis and experiments (→ Fig. 1 ). We compare the performances of TSP with self-play (SP) and SP with curiosity (Houthooft et al., 2016) using three tasks belonging to Comm-POSG, comprising up to 20 agents. The hyperparameters are listed in the appendix. With regard CRAs, three models namely LSTM, CommNet (Sukhbaatar et al., 2016) , and IC3Net (Singh et al., 2019) , were compared. The empirical mean of the social welfare function J was used as a measure for the comparison. IC3Net is an improvement over CommNet, which is a continuous communication method based on LSTM. Actor-critic and value functions were added to the baselines in all the frameworks. We performed 2,000 epochs of experiment with 500 steps, each using 120 CPUs; the experiment was conducted over a period of three days. PP and TJ: Table 1 lists the experimental results for each task. We can see that IC3Net with TSP outperforms the one with SP for all tasks. Fig. 2 (a) shows that TSP elicits truthful information, and (b) confirms that the social welfare of TSP exceeds that of the SPs. (c) confirms that the average of the imaginary part is zero. From these experimental results, we conclude that the TSP successfully realized truthful learning and state-of-the-art in tasks comprising 3 to 20 agents.

StarCraft:

Table 2 shows a comparison of social welfare in the exploration and combat tasks in StarCraft. (i) In the search task, 10 Medics find one enemy medic on a 50×50-cell grid; similar to PP, the reward is a competitive task where the reward is divided by the number of medics found. (ii) In the combat task, 10 Marines versus 3 Zealots fight on a 50×50 cell grid. The maximum step of the episode is set at 60. We find that IC3Net, with its information-hiding gate, performs less well than CommNet but performs better when trained in TSP due to the truthful mechanism.

6. CONCLUDING REMARK

Our objective was to construct a general framework for emergent unbiased state representation without any supervision. Firstly, we proposed the TSP and theoretically clarified its convergence to the global optimum in the general case. Secondly, we performed experiments involving up to 20 agents and achieved the state-of-the-art performance for all the tasks. Herein, we summarize the advantages of our framework. 1. Strong convergence: TSP guarantees convergence to the global optimum theoretically and experimentally; self-play cannot provide such a guarantee. Besides, the imaginary reward i∆L satisfies the baseline condition. 2. Simple solution: The only modification required for TSP is that i∆L should be added to the baseline in order to easily implement it for deep learning software libraries such as TensorFlow and PyTorch. 3. Broad coverage: TSP is a general framework, the same as self-play. Since TSP is independent of both agents and environments and supports both discrete and continuous control, it can be applied to a wide range of domains. No supervision is required. To the best of our knowledge, introducing mechanism design to MARL is a new direction for the deep-learning community. In future work, we will consider fairness (Sen, 1984) as the social choice function. We expect that many other frameworks will be developed by using the methodology employed in this study. A THEORY ". . . a human brain can learn such high-level abstractions if guided by the messages produced by other humans, which act as hints or indirect supervision for these high-level abstractions; and, language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator, and this gives rise to rapid search in the space of communicable ideas that help humans build up better high-level internal representations of their world." (Bengio, 2012) A.1 GLOBAL OPTIMALLY Proposition A.1. If G ∈ G is strongly truthful, J * (G) = Ĵ(G) holds and R i is symmetric for any permutation of i ∈ n , the value is given by, Ĵ = sup w∈W nR 1 dπ n θ dq n φ (s|h) dP. ( ) Proof. In this proof, we show that a truthful game maximizes the designer's objective, then derive the closed form. To prove the former point, we show the following lemma. Lemma A.1. The global optimum of the designer's objective J(w; G) satisfies Pareto-optimally of a joint objective, Maximize w∈W U (w) and Minimize w∈W D KL (p || q φ ) , where, U (w; R) := E q φ [V π θ ] = s∈S,ẑ∈Z n V π θ (s) dq φ (s|h i , z -i ) dσ φ (ẑ), and V π θ (s) := E π θ [R i |s] = a∈A n R i (s, a) dπ n θ (a|s), ∀s ∈ S. ( ) Proof. The first objective indicates the Bayesian Nash equilibrium w ∈ W * (G), and the second indicates the belief state q φ is as close to the true state p as possible. From the assumption, R is maximized if p is available for the each agent. Hence, from the theorem of value iteration, there exists the solution π (s) for V (s) for any s ∈ S. Hereinafter, we name π and V := V π the ideal policy and the ideal value, respectively. From Eq (1), the ideal policy solves the designer's objective as J(w; G) = nE p(s) [V π θ (s)]. Hence, the objective is to find a policy π * (a|x, φ) as close as π where π * , φ * ∈ W(G), that maximizes the first objective U (w). The policy π * can be broken down by the ideal policy π and a wrong policy π = π as follows, π * (a|x, φ) = E q φ (s|x) [π (a|s)] = q φ (s 0 |x)π (a|s 0 ) + (1 -q φ (s 0 |x))π (a|φ, x), for the observations x ∈ X n , where s 0 ∈ S is the true state. Hence, V π * (s 0 |φ) = E P(x|s0) [q φ (s 0 |x)V (s 0 ) + (1 -q φ (s 0 |x))V (x|φ))] = q φ (s 0 )V (s 0 ) + E P(x|s0) [(1 -q φ (s 0 |x))V (x|φ))] = q φ (s 0 )V (s 0 ) + (1 -q φ (s 0 )) V (s 0 |φ), where, V (x|φ) := R 1 (s 0 , a) n i=1 dπ (a i |s)q φ (s|x), and V (s 0 |φ) := E P(x|s0) V (x|φ) 1 -q φ (s 0 |x) 1 -q φ (s 0 ) . Thus, the error from the ideal value function can be written as follows, V (s 0 ) -V π * (s 0 |φ) = (1 -q φ (s 0 ))(V (s 0 ) -V (s 0 |φ)). ( ) The error is minimized if q φ (s 0 ) = 1 as V (s 0 ) > V (s 0 |φ) by the definition. From the Jensen's equation, log E p(s0) [q φ (s 0 )] ≥ E p(s0) [log q φ (s 0 )] = -D KL (p || q φ ) -H [p] . ) Therefore, the right-hand side determines the lower bound of the probability to be maximized, E p(s0) [q φ (s 0 )]. As the second term does not depend on φ, the optimization is achieved by minimizing D KL (p || q φ ). Suppose that J( π, φ ; G) = Ĵ(G) > J * (G) for a non-truthful reporting policy such that σ φ (h|h) < 1. From lemma A.1, q φ (s|h i ) for an internal state h i = f (x i ) of Agent i with an encoder f minimizes D KL (p || q φ (s|h i )). However, as q φ (s|z i ) = q φ (s|h i ), D KL (p || q φ (s|z i )) > D KL (p || q φ (s|h i )) holds. That contradicts the Pareto-optimally. Therefore, σ φ must be truthful. Next, since σ φ (h|h) = 1 holds from the assumption of strong truthfulness, from the symmetry of R, Eq (12) and ( 13) are written as follows. Ĵ = sup θ,φ J = sup θ,φ E p(s) [nV π θ ] = sup θ,φ nV π θ (s) dp(s) = sup θ,φ nV π θ (s) dq φ (s|h 1 , z -1 ) n j=2 dσ j (z j |x j ) dP(x j |s) dp(s) = sup θ,φ nV π θ (s) dq φ (s|h) dP n dp = sup θ,φ nR 1 (s, a) dπ n θ (a|s) dq n φ (s|h) dP n dp. ( ) Proposition A.2. If [C] is an unbiased truthful mechanism of G, self-play with G[C] converges to Ĵ(G). Proof. Since [C] is unbiased, E π θ [C i ] = 0 holds. Hence, for an arbitrary baseline b, b + C i also satisfies the baseline condition. Therefore, from the policy gradient theorem (Sutton & Barto, 1998) , self-play converges to J * (G[C]). Further, since [C] is an unbiased truthful mechanism, J * (G[C]) = Ĵ(G[C]) = Ĵ(G) holds from Proposition A.1. A general loss function ψ : A × Z → R ∞ for any strictly concave nonnegative function ψ : P(A) → R ∞ is defined as follows: ψ := D ψ (π θ (a j |z i ) δ(a j |•)) , where δ(a j |•) is a point-wise probability that satisfies lim →0 B( ;ãj ) dδ(a j |ã j ) = 1 for an open ball B( ; ãj ), and D ψ is Bregman divergence (Bregman, 1967) defined by the following equation. D ψ (p q) := ψ(p) -ψ(q) + ∇ψ(q) d(p -q). Sending a truthful signal is the best response to minimize the expectation of the general loss function. For example, KL-diveregence is a special case of Bregman divergence when ψ = -H [•], and the following equation holds. E π θ [ ψ ] = D ψ (π θ (a j |z i ) δ(a j |•)) dπ θ (a j |h i ) = D KL (π θ (a i |z i ) || π θ (a i |h i )) ≥ 0. The equality holds if and only if z i = h i . Notice that π θ (a i |h i ) = π θ (a j |h i ). Now, we generalize the zero-one mechanism to arbitrary signaling games. Proposition A.3. (Bregman mechanism) For any signaling games, if sup φ dV π θ / dβI ψ < 1, [ıI ψ ] is an unbiased truthful mechanism of G ∈ G for a general cost function: I ψ (a|z) := ∆L ψ (a|z)1 =     n -1 -1 . . . -1 -1 n -1 . . . -1 . . . . . . . . . . . . -1 -1 . . . n -1     •     0 ψ (a 2 |z 1 ) . . . ψ (a n |z 1 ) ψ (a 1 |z 2 ) 0 . . . ψ (a n |z 2 ) . . . . . . . . . . . . ψ (a 1 |z n ) ψ (a 2 |z n ) . . . 0         1 1 . . . 1     , where ∆ := nE -11 T is a graph Laplacian. Proof. The problem we dealt with is designing a scoring rule for information that satisfies two properties: 1) regularity, the score should be finite if the information is sufficiently correct, and 2) properness, the score should be maximized if and only if the information is true. The well-known example of the scoring rule is mutual information (MI), which compares a pair of probabilistic distributions p and q in the logarithmic domain. However, MI cannot apply to continuous actions. Instead, we introduce a more general tool, the regular proper scoring rule as defined below. Definition A.1. (Gneiting & Raftery, 2007) For a set Ω, F(• •) : P (Ω) × Ω → R ∞ is a regular proper scoring rule iff there exists a strictly concave, real-valued function f on P (Ω) such that F (p x) = f (p) - Ω f * (p(ω), x) dp(ω) + f * (p, x) for p ∈ P (Ω) and x ∈ Ω, where f * is a subgradient of f that satisfies the following property, f (q) ≥ f (p) + Ω f * (p, ω) d(q -p)(ω) for q ∈ P (Ω). We also define F for q ∈ P (Ω) as F (p q) := Ω F (p x) dq(x), and describe a set of regular proper scoring rules B. For F, F 1 , F 2 ∈ B, the following property holds (Gneiting & Raftery, 2007) . 1. Strict concavity: if q = p, then F(q q) > F(p q). 2. F 1 (p q) + aF 2 (p q) + f (q) ∈ B where a > 0 and f are not dependent on p. 3. -D ψ ∈ B, where D ψ is the Bregman divergence. Lemma A.2. For any F ∈ B and a function F defined as shown below, if sup φ dV π θ / dβ ψ < 1, then [ı F ] is a truthful mechanism of G. F (a j |z i ) := -F (π θ (a j |z i ) a j ) (i = j) 0 (i = j) . ( ) Proof. We prove that the surrogate objective of G[ı F ], V π θ β := V π θ -β F is strictly concave, and if ∇ φ V π θ β = 0, then z i = h i with probability 1. We denote φ as the truthful parameter where σ φ(h i |h i ) = 1. The policy gradient for φ is ∇ φ V π θ β dq φ = ∇ φ V π θ dq φ + β ∇ φ F (π θ (a j |z i ) a j ) dπ θ dq φ = ∇ φ V π θ dq φ + β ∇ φ F (π θ (a j |z i ) π θ (a j |h i )) dq φ . First, we consider the local optima, i.e., ∇ φ V π θ dq φ = 0 and φ = φ. For Gâteaux differential with respect to φ := ( φ -φ) T / φ -φ , φ ∇V π θ β = β φ ∇ ψ > 0 holds from the strict concaveity. At the  Continuous D/C p 2 2 2π θ (a|z) -π θ (•|z) 2 2 Pseudospherical Discrete Continuous ( a∈A p(a) κ ) 1/κ π θ (a|z) κ-1 /ψ(π θ (•|z)) κ-1 (Good, 1952) Continuous Continuous p κ-1 κ π θ (a|z) κ-1 / π θ (•|z) κ κ-1 κ global optima i.e., ∇ φ V π θ dq φ = 0 and φ = φ, ∇ φ V π θ β = β ∇ ψ = 0 holds. Next, if φ ∇V π θ < 0, as sup φ dV π θ / d F < β, the following equation holds for φ = φ. φ ∇V π θ β dq φ = φ ( ∇V π θ + β ∇ F ) dq φ > φ ∇V π θ dq φ + sup φ dV d F φ ∇ F dq φ ≥ φ ∇V π θ dq φ -inf φ ( φ ∇V π θ ) dq φ ≥ 0. Hence, φ ∇V π θ β ≥ 0 holds, and the equality holds if and only if φ = φ. Therefore, V π θ β is strictly concave, and the following equation holds for α k ∈ o(1/k). lim K→∞ K k=1 ∇ φ V π θ β (φ k ) ∇ φ V π θ β (φ k ) α k = φ. (a.s.) I ψ is defined for both discrete and continuous actions. Table 3 lists examples of scoring rules ψ for arbitrary actions. In particular, minimizing ψ for continuous action is known as probability density estimation (Gneiting & Raftery, 2007) . -I ψ is a proper scoring rule (Gneiting & Raftery, 2007)  ∂ Re V π θ ∂ Im V π θ < β, where β < ∞ is bounded mass parameter.

Proof.

From Proposition A.3, [ıI ψ ] is unbiased truthful. Therefore, from Proposition A.2, convergence to the global optima is achieved. A.2 SELF-PLAY CONVERGES TO LOCAL OPTIMA Theorem A.2. If G ∈ G is non-truthful, self-play does not converge to the global optimum Ĵ(G). Proof. Example A.1. (One-bit two-way communication game) Fig. 4 shows an example of a non-cooperative partially observable environment with 1-bit state. The reward structure is presented in Table 4 . The sum of rewards is maximized when both agents report the correct state to the environment, n i=1 R n i (s, a) = 2c (a 1 = a 2 = s) 0 (otherwise) . Hence, the objective varies in the range 0 ≤ J(G 2 com ) ≤ 2c.  , r 2 ) s 1 -s a 1 T s (c, c) (1, -1) F 1 -s (-1, 1) (0, 0) Proposition A.4. If c < 1, J * (G 2 com ) < Ĵ(G 2 com ) holds. Proof. Since p(s) = 1/2, we can assume s = 1 without loss of generality. Besides, we discuss only Agent 1 because of symmetry. From Z = {0, 1}, Agent 1's messageling policy σ 1 sends the correct information x or false information 1 -x when it knows x. Hence, we can represent the policy by using parameter φ ∈ [0, 1] as follows. σ φ (z|x) = φ z (1 -φ) 1-z (x = 1) 1/2 (x = •) , Differentiating σ φ with φ gives the following. dσ φ dφ = 2z -1 (x = 1) 0 (x = •) G 2 com . Therefore, from Eq (13), if π * , • ∈ W * (G 2 com ), then the policy gradient for φ is as follows. d dφ U (π * , φ) = d dφ V * 1 dq 1 dσ 1 dP dp = V * 1 dq 1 dσ 1 dφ dP dp =λ V * 1 (2z 1 -1) dz 1 s=x1=1 =λ (2z 1 -1)R 1 dπ * 1 dπ * 2 dq 2 dz 1 dP s=x1=1 =λ(1 -λ) (2z 1 -1)R 1 dπ * 1 dπ * 2 dq 2 dz 1 s=x1=1,x2=• =λ(1 -λ) 1 z1=0 (2z 1 -1)R 1 (s, x 1 , z 1 ) s=x1=1 =λ(1 -λ) [R 1 (1, 1, 1 ) -R 1 (1, 1, 0 )] =λ(1 -λ)(c -1) < 0. As the policy gradient is negative from the assumption of c ∈ (0, 1), φ * = 0 gets the Nash equilibrium from φ ≥ 0, thereby resulting in always sending false information to the opposite: σ φ * (z|x) = 1 -z (x = 1) 1/2 (x = •) . (35) Let J x 1 , x 2 := J| x= x1,x2 . We can get J * and Ĵ as follows. J * = J * x 1 , x 2 dP 2 dp = J * 1, 1 λ 2 + J * 1, • • 2λ(1 -λ) + J * •, • (1 -λ) 2 = 2cλ 2 + 0 + 2c/2 2 • (1 -λ) 2 = 2c λ 2 + 1 4 (1 -λ) 2 , and Ĵ = Ĵ x 1 , x 2 dP 2 dp = Ĵ 1, 1 λ 2 + Ĵ 1, • • 2λ(1 -λ) + Ĵ •, • (1 -λ) 2 = 2cλ 2 + 2c • 2λ(1 -λ) + 2c/2 2 • (1 -λ) 2 = 2c λ 2 + 2λ(1 -λ) + 1 4 (1 -λ) 2 , respectively. Therefore, Ĵ -J * = 4cλ(1 -λ) > 0, and J * < Ĵ holds. From Proposition A.2, G = G 2 com is the counterexample that global optimally does not occur. A.3 ZERO-ONE MECHANISM SOLVES G 2 com . Proposition A.5. (zero-one mechanism) Let : A × Z → {0, 1} be a zero-one loss between an action and a message (a i |z j ) := a j (1 -z i ) + (1 -a i )z i , and Proof. I(a|z) := 1 -1 -1 1 0 (a 2 |z 1 ) (a 1 |z 2 ) 0 1 1 . ( The following equation holds. d dφ V * β,1 dq 1 = d dφ (V * 1 -βI 1 dπ * 2 dq 2 ) dq 1 = -λ(1 -λ)(1 -c) -β I 1 dπ * 2 dq 1 dq 2 dσ 1 dφ dσ 2 dP 2 dp = -λ(1 -λ)(1 -c) -βλ I 1 dπ * 2 dq 2 (2z 1 -1) dz 1 dσ 2 dP s=x1=1 = -λ(1 -λ)(1 -c) -βλ 2 (2z 1 -1) (a 2 |z 1 ) dπ * 2 dq 2 dz 1 s=x1=x2=1 = -λ(1 -λ)(1 -c) -βλ 2 1 z1=0 (2z 1 -1) (1|z 1 ) = -λ(1 -λ)(1 -c) -βλ 2 ( (1|1) -(1|0)) = -λ(1 -λ)(1 -c) + βλ 2 =λ 2 ı -(1 -c) 1 -λ λ . ( ) not approach each other to prevent collision while making good use of multiagent communication. That is similar to blinkers and brake lights.

C.3 STARCRAFT: BLOOD WARS (SC)

Explore: In order to complete the exploration task, the agent must be within a specific range (field of view) of the enemy unit. Once the agent is within the enemy unit's field of view, it does not take any further action. The reward structure of the enemy unit is the same as the PP task, with the only difference being that the agent The point is that instead of being in the same place, you explore the enemy unit's range of vision and get a non-negative reward. Medic units that do not attack enemy units are used to prevent combat from interfering with the mission objective. For observation, for each agent, it is the agent's (absolute x, absolute y) and the enemy's (relative x, relative y, visible), where visiblea is a visual range. If the enemy is not in exploration range, the relative x and relative y are zero. The agent has nine actions to choose from: eight basic directions and one stay action. Combat: Agents make their own observations (absolute x, absolute y, health point + shield, weapon cooldown, previous action) and (relative x, relative y, visible, health point + shield, weapon cooldown). Relative x and y are only observed when the enemy is visible, corresponds to a visible flag. All observations are normalized to lie between (0, 1). The agent must choose from 9+M actions, including 9 basic actions and 1 action to attack M agents. The attack action is only effective if the enemy is within the agent's view, otherwise is a no-problem. In combat, the environment doesn't compare to Starcraft's predecessors. The setup is much more difficult, restrictive, new and different, and therefore not directly comparable. In Combat task, we give a negative reward r time = -0.01 at each time step to avoid delaying the enemy team's detection. When an agent is not participating in a battle, at each time step, the agent is rewarded with (i) normalized health status at the current and previous time step, and (ii) normalized health status at the previous time step displays the time steps of the enemies you have attacked so far and the current time step. The final reward for each agent consists of (i) all remaining health * 3 as a negative reward and (ii) 5 * m + all remaining health * 3 as a positive reward if the agent wins. Give health*3 to all living enemies as a negative reward when you lose. In this task, a group of enemies is randomly initialized in half of the map. Thus the other half making communication-demanding tasks even more difficult. 



Refer Section A for the proofs. Note the use of i for imaginary units and i for indices.



Theorem 4.1. (global optimally) For any games in Comm-POSG, TSP converges to the global optimum Ĵ if the following convergence condition are met,

Figure 1: Experimental environments. a: Predator Prey (PP-n)(Barrett et al., 2011). Each agent receives r ti = -0.05 at each time step. After reaching the prey, they receive r ti = 1/m, where m is the number of predators that have reached the prey. b: Traffic Junction (TJ-n)(Sukhbaatar et al., 2016;Singh et al., 2019). n cars with limited sight inform the other cars of each position to avoid a collision. The cars continue to receive a reward of r ti = -0.05 at each time as an incentive to run faster. If cars collide, each car involved in the collision will receive r ti = -1. c: Combat task in StarCraft: Blood Wars (SC)(Synnaeve et al., 2016).

Figure 2: Learning curves in three-agent predator prey (PP-3). a Truthfulness, the fraction of steps that the agent shares a true pre-state (z ij = h i ), b the real part of the reward, and c the imaginary part.

Figure 3: Left: One-bit two-way communication game G 2 com . S = A = {0, 1}, X = S ∪ {•}, and p(s) = 1/2. Agents are given correct information from the environment with a probability of λ > 0: P(s|s) = λ, P(•|s) = 1 -λ. q i (s|x i , z j ) follows a Bernoulli distribution and estimates the state p(s) of the environment from observations and signals z j ∈ Z = X . Let h i = x i . Right: An example of the non-cooperative solutions: s = 1, x = •, 1 , z = •, 0 , a = 0, 1 , r = -1, 1 and r 1 + r 2 = 0 < 2c.

If β > (1 -c)(1 -λ)/λ, then [iI] is an unbiased truthful mechanism of G 2 com , and self-play with G 2 com [ıI] converges to the global optima Ĵ(G 2 com ) = 2c[1 -3/4 • (1 -λ) 2 ].

Social welfare J in five Comm-POSG tasks. PP-n: Predator prey and TJ-n: Traffic junction. All the scores are negatives. The experiment was repeated thrice. The average and standard deviation are listed. Bold is the highest score. The models listed in the top three rows show the models optimized by SP, and the bottom row shows IC3Net optimized by TSP. ± 0.35 4.97 ± 1.33 22.32 ± 1.04 47.91 ± 41.2 819.97 ± 438.7 CommNet 1.54 ± 0.33 4.30 ± 1.14 6.86 ± 6.43 26.63 ± 4.56 463.91 ± 460.8 IC3Net 1.03 ± 0.06 2.44 ± 0.18 4.35 ± 0.72 17.54 ± 6.44 216.31 ± 131.7 TSP IC3Net 0.69 ± 0.14 2.34 ± 0.21 3.93 ± 1.46 12.83 ± 2.50 132.60 ± 17.91



Examples of the scoring rules, where κ > 1, p κ = ( p(a) κ da) 1/κ . Readers can refer to(Gneiting & Raftery, 2007) for examples of many other functions.

since it is a linear combination of Bregman divergence. Hence, from Lemma A.2, [iI ψ ] is truthful. Besides, since 1 T I ψ = 1 T ∆L

G 2

Hyperparameters used in the experiment. β are grid searched in space {0.1, 1, 10, 100}, and the best parameter is shown. The other parameters are not adjusted.

annex

Therefore, if β > (1 -c)(1 -λ)/λ, then φ * = 1 holds, and J * = Ĵ holds. The value of Ĵ is clear from the proof of Lemma A.2.[ıI] is also known as the peer prediction method (Miller et al., 2005) , which is inspired by peer review. This process is illustrated in Fig. 4 Left, and the state-action value functions are listed in Fig. 4 Right.

B COMPLEXITY ANALYSIS

Although the computational complexity of βI ψ per iteration is O(n 3 ) as it involves the multiplication of n-order square matrices, we can reduce it to O(n 2 ) to obtain), and the sample size is O(n).

C EXPERIMENTAL ENVIRONMENTS

In the experiment, we used two partial observation environments. This setting is the same as that adopted in existing studies (Sukhbaatar et al., 2016; Singh et al., 2019) . Fig. 1 shows the environments.

C.1 PREDATOR PREY (PP)

Predator-Prey (PP) is a widely used benchmark environment in MARL (Barrett et al., 2011; Sukhbaatar et al., 2016; Singh et al., 2019) . Multiple predators search for prey in a randomly initialized location in the grid world with a limited field of view. The field of view of a predator is limited so that only a few blocks can be seen. Therefore, for the predator to reach the prey faster, it is necessary to inform other predators about the prey's location and the locations already searched. Thus, the prey's location is conveyed through communication among the predators, but predators can also send false messages to keep other predators away from the prey.In this experiment, experiments are performed using PP-3 and PP-5, which have two difficulty levels.In PP-3, the range of the visual field is set to 0 in a 5 × 5 environment. In PP-5, the field of view is set to 1 in a 10 × 10 environment. The numbers represent the number of agents.

C.2 TRAFFIC JUNCTION (TJ)

Traffic Junction (TJ) is a simplified road intersection task. An n body agent with a limited field of view informs the other bodies of its location to avoid collisions. In this experiment, the difficulty level of TJ is changed to three. TJ-5 solves the task of crossing two direct one-way streets. For TJ-10, there are two lanes, and each body can not only go straight, but also turn left or right. For TJ-20, the two-lane road will comprise two parallel roads, for a total of four intersections. Each number corresponds to n.In the initial state, each vehicle is given a starting point and a destination and is trained to follow a determined path as fast as possible while avoiding collisions. The agent is in each body and takes two actions, i.e., accelerator and brake, in a single time step. It is crucial to ensure that other vehicles do

