STATUS-QUO POLICY GRADIENT IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Individual rationality, which involves maximizing expected individual return, does not always lead to optimal individual or group outcomes in multi-agent problems. For instance, in social dilemma situations, Reinforcement Learning (RL) agents trained to maximize individual rewards converge to mutual defection that is individually and socially sub-optimal. In contrast, humans evolve individual and socially optimal strategies in such social dilemmas. Inspired by ideas from human psychology that attribute this behavior in humans to the status-quo bias, we present a status-quo loss (SQLoss) and the corresponding policy gradient algorithm that incorporates this bias in an RL agent. We demonstrate that agents trained with SQLoss evolve individually as well as socially optimal behavior in several social dilemma matrix games. To apply SQLoss to games where cooperation and defection are determined by a sequence of non-trivial actions, we present GameDistill, an algorithm that reduces a multi-step game with visual input to a matrix game. We empirically show how agents trained with SQLoss on a GameDistill reduced version of the Coin Game evolve optimal policies.

1. INTRODUCTION

In sequential social dilemmas, individually rational behavior leads to outcomes that are sub-optimal for each individual in the group (Hardin, 1968; Ostrom, 1990; Ostrom et al., 1999; Dietz et al., 2003) . Current state-of-the-art Multi-Agent Deep Reinforcement Learning (MARL) methods that train agents independently can lead to agents that play selfishly and do not converge to optimal policies, even in simple social dilemmas (Foerster et al., 2018; Lerer & Peysakhovich, 2017) . To illustrate why it is challenging to evolve optimal policies in such dilemmas, we consider the Coin Game (Foerster et al., 2018) . Each agent can play either selfishly (pick all coins) or cooperatively (pick only coins of its color). Regardless of the other agent's behavior, the individually rational choice for an agent is to play selfishly, either to minimize losses (avoid being exploited) or to maximize gains (exploit the other agent). However, when both agents behave rationally, they try to pick all coins and achieve an average long term reward of -0.5. In contrast, if both play cooperatively, then the average long term reward for each agent is 0.5. Therefore, when agents cooperate, they are both better off. Training Deep RL agents independently in the Coin Game using state-of-the-art methods leads to mutually harmful selfish behavior (Section 2.2). The problem of how independently learning agents evolve optimal behavior in social dilemmas has been studied by researchers through human studies and simulation models (Fudenberg & Maskin, 1986; Green & Porter, 1984; Fudenberg et al., 1994; Kamada & Kominers, 2010; Abreu et al., 1990) . A large body of work has looked at the mechanism of evolution of cooperation through reciprocal behaviour and indirect reciprocity (Trivers, 1971; Axelrod, 1984; Nowak & Sigmund, 1992; 1993; 1998) , through variants of reinforcement using aspiration (Macy & Flache, 2002) , attitude (Damer & Gini, 2008) or multi-agent reinforcement learning (Sandholm & Crites, 1996; Wunder et al., 2010) , and under specific conditions (Banerjee & Sen, 2007) using different learning rates (de Cote et al., 2006) similar to WoLF (Bowling & Veloso, 2002) as well as using embedded emotion (Yu et al., 2015) , social networks (Ohtsuki et al., 2006; Santos & Pacheco, 2006) . However, these approaches do not directly apply to Deep RL agents (Leibo et al., 2017) . Recent work in this direction (Kleiman-Weiner et al., 2016; Julien et al., 2017; Peysakhovich & Lerer, 2018) focuses on letting agents learn strategies in multi-agent settings through interactions with 1

