STATUS-QUO POLICY GRADIENT IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Individual rationality, which involves maximizing expected individual return, does not always lead to optimal individual or group outcomes in multi-agent problems. For instance, in social dilemma situations, Reinforcement Learning (RL) agents trained to maximize individual rewards converge to mutual defection that is individually and socially sub-optimal. In contrast, humans evolve individual and socially optimal strategies in such social dilemmas. Inspired by ideas from human psychology that attribute this behavior in humans to the status-quo bias, we present a status-quo loss (SQLoss) and the corresponding policy gradient algorithm that incorporates this bias in an RL agent. We demonstrate that agents trained with SQLoss evolve individually as well as socially optimal behavior in several social dilemma matrix games. To apply SQLoss to games where cooperation and defection are determined by a sequence of non-trivial actions, we present GameDistill, an algorithm that reduces a multi-step game with visual input to a matrix game. We empirically show how agents trained with SQLoss on a GameDistill reduced version of the Coin Game evolve optimal policies.

1. INTRODUCTION

In sequential social dilemmas, individually rational behavior leads to outcomes that are sub-optimal for each individual in the group (Hardin, 1968; Ostrom, 1990; Ostrom et al., 1999; Dietz et al., 2003) . Current state-of-the-art Multi-Agent Deep Reinforcement Learning (MARL) methods that train agents independently can lead to agents that play selfishly and do not converge to optimal policies, even in simple social dilemmas (Foerster et al., 2018; Lerer & Peysakhovich, 2017) . To illustrate why it is challenging to evolve optimal policies in such dilemmas, we consider the Coin Game (Foerster et al., 2018) . Each agent can play either selfishly (pick all coins) or cooperatively (pick only coins of its color). Regardless of the other agent's behavior, the individually rational choice for an agent is to play selfishly, either to minimize losses (avoid being exploited) or to maximize gains (exploit the other agent). However, when both agents behave rationally, they try to pick all coins and achieve an average long term reward of -0.5. In contrast, if both play cooperatively, then the average long term reward for each agent is 0.5. Therefore, when agents cooperate, they are both better off. Training Deep RL agents independently in the Coin Game using state-of-the-art methods leads to mutually harmful selfish behavior (Section 2.2). The problem of how independently learning agents evolve optimal behavior in social dilemmas has been studied by researchers through human studies and simulation models (Fudenberg & Maskin, 1986; Green & Porter, 1984; Fudenberg et al., 1994; Kamada & Kominers, 2010; Abreu et al., 1990) . A large body of work has looked at the mechanism of evolution of cooperation through reciprocal behaviour and indirect reciprocity (Trivers, 1971; Axelrod, 1984; Nowak & Sigmund, 1992; 1993; 1998) , through variants of reinforcement using aspiration (Macy & Flache, 2002) , attitude (Damer & Gini, 2008) or multi-agent reinforcement learning (Sandholm & Crites, 1996; Wunder et al., 2010) , and under specific conditions (Banerjee & Sen, 2007) using different learning rates (de Cote et al., 2006) similar to WoLF (Bowling & Veloso, 2002) as well as using embedded emotion (Yu et al., 2015) , social networks (Ohtsuki et al., 2006; Santos & Pacheco, 2006) . However, these approaches do not directly apply to Deep RL agents (Leibo et al., 2017) . Recent work in this direction (Kleiman-Weiner et al., 2016; Julien et al., 2017; Peysakhovich & Lerer, 2018) focuses on letting agents learn strategies in multi-agent settings through interactions with other agents. Leibo et al. (2017) defines the problem of social dilemmas in the Deep RL framework and analyzes the outcomes of a fruit-gathering game (Julien et al., 2017) . They vary the abundance of resources and the cost of conflict in the fruit environment to generate degrees of cooperation between agents. Hughes et al. ( 2018) defines an intrinsic reward (inequality aversion) that attempts to reduce the difference in obtained rewards between agents. The agents are designed to have an aversion to both advantageous (guilt) and disadvantageous (unfairness) reward allocation. This handcrafting of loss with mutual fairness evolves cooperation, but it leaves the agent vulnerable to exploitation. LOLA (Foerster et al., 2018) uses opponent awareness to achieve high cooperation levels in the Coin Game and the Iterated Prisoner's Dilemma game. However, the LOLA agent assumes access to the other agent's network architecture, observations, and learning algorithms. This access level is analogous to getting complete access to the other agent's private information and therefore devising a strategy with full knowledge of how they are going to play. Wang et al. (2019) proposes an evolutionary Deep RL setup to evolve cooperation. They define an intrinsic reward that is based on features generated from the agent's past and future rewards, and this reward is shared with other agents. They use evolution to maximize the sum of rewards among the agents and thus evolve cooperative behavior. However, sharing rewards in this indirect way enforces cooperation rather than evolving it through independently learning agents. Interestingly, humans evolve individual and socially optimal strategies in such social dilemmas without sharing rewards or having access to private information. Inspired by ideas from human psychology (Samuelson & Zeckhauser, 1988; Kahneman et al., 1991; Kahneman, 2011; Thaler & Sunstein, 2009) that attribute this behavior in humans to the status-quo bias (Guney & Richter, 2018), we present the SQLoss and the corresponding status-quo policy gradient formulation for RL. Agents trained with SQLoss evolve optimal policies in multi-agent social dilemmas without sharing rewards, gradients, or using a communication channel. Intuitively, SQLoss encourages an agent to stick to the action taken previously, with the encouragement proportional to the reward received previously. Therefore, mutually cooperating agents stick to cooperation since the status-quo yields higher individual reward, while unilateral defection by any agent leads to the other agent also switching to defection due to the status-quo loss. Subsequently, the short-term reward of exploitation is overcome by the long-term cost of mutual defection, and agents gradually switch to cooperation. To apply SQLoss to games where a sequence of non-trivial actions determines cooperation and defection, we present GameDistill, an algorithm that reduces a dynamic game with visual input to a matrix game. GameDistill uses self-supervision and clustering to extract distinct policies from a sequential social dilemma game automatically. Our key contributions can be summarised as: 1. We introduce a Status-Quo loss (SQLoss, Section 2.3) and an associated policy gradientbased algorithm to evolve optimal behavior for agents playing matrix games that can act in either a cooperative or a selfish manner, by choosing between a cooperative and selfish policy. We empirically demonstrate that agents trained with the SQLoss evolve optimal behavior in several social dilemmas iterated matrix games (Section 4). 2. We propose GameDistill (Section 2.4), an algorithm that reduces a social dilemma game with visual observations to an iterated matrix game by extracting policies that implement cooperative and selfish behavior. We empirically demonstrate that GameDistill extracts cooperative and selfish policies for the Coin Game (Section 4.2). 3. We demonstrate that when agents run GameDistill followed by MARL game-play using SQLoss, they converge to individually as well as socially desirable cooperative behavior in a social dilemma game with visual observations (Section 4.2).

2. APPROACH 2.1 SOCIAL DILEMMAS MODELED AS ITERATED MATRIX GAMES

To remain consistent with previous work, we adopt the notations from Foerster et al. (2018) . We model social dilemmas as general-sum Markov (simultaneous move) games. A multi-agent Markov game is specified by G = S, A, U , P , r, n, γ . S denotes the state space of the game. n denotes the

