D3C: REDUCING THE PRICE OF ANARCHY IN MULTI-AGENT LEARNING

Abstract

Even in simple multi-agent systems, fixed incentives can lead to outcomes that are poor for the group and each individual agent. We propose a method, D3C, for online adjustment of agent incentives that reduces the loss incurred at a Nash equilibrium. Agents adjust their incentives by learning to mix their incentive with that of other agents, until a compromise is reached in a distributed fashion. We show that D3C improves outcomes for each agent and the group as a whole on several social dilemmas including a traffic network with Braess's paradox, a prisoner's dilemma, and several reinforcement learning domains.

1. INTRODUCTION

We consider a setting composed of multiple interacting artificially intelligent agents. These agents will be instantiated by humans, corporations, or machines with specific individual incentives. However, it is well known that the interactions between individual agent goals can lead to inefficiencies at the group level, for example, in environments exhibiting social dilemmas (Braess, 1968; Hardin, 1968; Leibo et al., 2017) . In order to resolve these inefficiencies, agents must reach a compromise. Any arbitration mechanism that leverages a central coordinatorfoot_0 faces challenges when attempting to scale to large populations. The coordinator's task becomes intractable as it must both query preferences from a larger population and make a decision accounting for the exponential growth of agent interactions. If agents or their designers are permitted to modify their incentives over time, the principal must collect all this information again, exacerbating the computational burden. A central coordinator represents a single point of failure for the system whereas one motivation for multi-agent systems research inspired by nature (e.g., humans, ants, the body, etc.) is robustness to node failures (Edelman and Gally, 2001) . Therefore, we focus on decentralized approaches. A trivial form of decentralized compromise is to require every agent to minimize group loss (maximize welfare). Leaving the optimization problem aside, this removes inefficiency, but similar to a mechanism with a central coordinator, requires communicating all goals between all agents, an expensive step and one with real consequences for existing distributed systems like wireless sensor networks (Kulkarni et al., 2010) where transmitting a signal saps a node's energy budget. There is also the obvious issue that this compromise may not appeal to an individual agent, especially one that is expected to trade its low-loss state for a higher average group loss. One additional, more subtle consequence of optimizing group loss is that it cannot distinguish between behaviors in environments with a group loss that is constant sum, for instance, in zero-sum games. But zero-sum games have rich structure to which we would like agents to respond. Electing a team leader (or voting on a decision) implies one candidate (decision) wins while another loses. Imagine two agents differ on their binary preference with each trying to minimize their probability of losing. A group loss is indifferent; we prefer the agents play the game (and in this, case argue their points). Design Criteria: We seek an approach to compromise in multi-agent systems that applies to the setting just described. The celebrated Myerson-Satterthwaite theorem (Arrow, 1970; Satterthwaite, 1975; Green and Laffont, 1977; Myerson and Satterthwaite, 1983) states that no mechanism exists that simultaneously achieves optimal efficiency (welfare-maximizing behavior), budget-balance (no taxing agents and burning side-payments), appeals to rational individuals (individuals want to opt-in to the mechanism), and is incentive compatible (resulting behavior is a Nash equilibrium). Given this impossibility result, we aim to design a mechanism that approximates weaker notions of these criteria. In addition, the mechanism should be decentralized, extensible to large populations, and adapt to learning agents with evolving incentives in possibly non-stationary environments. Design: We formulate compromise as agents mixing their incentives with others. In other words, an agent may become incentivized to minimize a mixture of their loss and other agents' losses. We design a decentralized meta-algorithm to search over the space of these possible mixtures. We model the problem of efficiency using price of anarchy. The price of anarchy, ⇢ 2 [1, 1), is a measure of inefficiency from algorithmic game theory with lower values indicating more efficient games (Nisan et al., 2007) . Forcing agents to minimize a group (average) loss with a single local minimum results in a "game" with ⇢ = 1. Note that any optimal group loss solution is also Paretoefficient. Computing the price of anarchy of a game is intractable in general. Instead, we derive a differentiable upper bound on the price of anarchy that agents can optimize incrementally over time. Differentiability of the bound makes it easy to pair the proposed mechanism with, for example, deep learning agents that optimize via gradient descent (Lerer and Peysakhovich, 2017; OpenAI et al., 2019) . Budget balance is achieved exactly by placing constraints on the allowable mixtures of losses. We appeal to individual rationality in three ways. One, we initialize all agents to optimize only their own losses. Two, we include penalties for agents that deviate from this state and mix their losses with others. Three, we show empirically on several domains that opting into the proposed mechanism results in better individual outcomes. We also provide specific, albeit narrow, conditions under which agents may achieve a Nash equilibrium, i.e. the mechanism is incentive compatible, and demonstrate the agents achieving a Nash equilibrium under our proposed mechanism in a traffic network problem. The approach we propose divides the loss mixture coefficients among the agents to be learned individually; critically, the agents do not need to observe or directly differentiate with respect to the other agent strategies. In this work, we do not tackle the challenge of scaling communication of incentives to very large populations; we leave this to future work. Under our approach, scale can be achieved through randomly sharing incentives according to the learned mixture weights or sparse optimization over the simplex (Pilanci et al., 2012; Kyrillidis et al., 2013; Li et al., 2016) . Our Contribution: We propose a differentiable, local estimator of game inefficiency, as measured by price of anarchy. We then present two instantiations of a single decentralized meta-algorithm, one 1st order (gradient-feedback) and one 0th order (bandit-feedback), that reduce this inefficiency. This meta-algorithm is general and can be applied to any group of individual agent learning algorithms. This paper focuses on how to enable a group of agents to respond to an unknown environment and minimize overall inefficiency. Agents with distinct losses may find their incentives well aligned to the given task, however, they may instead encounter a social dilemma (Sec. 3). We also show that our approach leads to interesting behavior in scenarios where agents may need to sacrifice team reward to save an individual (Sec. F.4) or need to form parties and vote on a new team direction (Sec. 3.4). Ideally, one meta-algorithm would allow a multi-agent system to perform sufficiently well in all these scenarios. The approach we propose, D3C (Sec. 2), is not that meta-algorithm, but it represents a holistic effort to combine critical ingredients that we hope takes a step in the right direction.foot_1 

2. DYNAMICALLY CHANGING THE GAME

In our approach, agents may consider slight re-definitions of their original losses, thereby changing the definition of the original game. Critically, this is done in a way that conserves the original sum of losses (budget-balanced) so that the original group loss can still be measured. In this section, we derive our approach to minimizing the price of anarchy in several steps. First we formulate minimizing the price of anarchy via compromise as an optimization problem. Second we specifically consider compromise as the linear mixing of agent incentives. Next, we define a local price of anarchy and derive an upper bound that agents can differentiate. Then, we decompose this bound into a set of differentiable objectives, one for each agent. Finally, we develop a gradient estimator to minimize the agent objectives in settings with bandit feedback (e.g., RL) that enables scalable decentralization.



For example, the VCG mechanism (Clarke, 1971). D3C is agnostic to any action or strategy semantics. We are interested in rich environments where high level actions with semantics such as "cooperation" and "defection" are not easily extracted or do not exist.

