ERMAS: LEARNING POLICIES ROBUST TO REALITY GAPS IN MULTI-AGENT SIMULATIONS

Abstract

Policies for real-world multi-agent problems, such as optimal taxation, can be learned in multi-agent simulations with AI agents that emulate humans. However, simulations can suffer from reality gaps as humans often act suboptimally or optimize for different objectives (i.e., bounded rationality). We introduce -Robust Multi-Agent Simulation (ERMAS), a robust optimization framework to learn AI policies that are robust to such multi-agent reality gaps. The objective of ERMAS theoretically guarantees robustness to the -Nash equilibria of other agents -that is, robustness to behavioral deviations with a regret of at most . ERMAS efficiently solves a first-order approximation of the robustness objective using meta-learning methods. We show that ERMAS yields robust policies for repeated bimatrix games and optimal adaptive taxation in economic simulations, even when baseline notions of robustness are uninformative or intractable. In particular, we show ERMAS can learn tax policies that are robust to changes in agent risk aversion, improving policy objectives (social welfare) by up to 15% in complex spatiotemporal simulations using the AI Economist (Zheng et al., 2020).

1. INTRODUCTION

Reinforcement learning (RL) offers a tool to optimize policy decisions affecting complex, multiagent systems; for example, to improve traffic flow or economic productivity. In practice, the need for efficient policy evaluation necessitates training on simulations of multi-agent systems (MAS). Agents in these systems can be emulated with fixed behavioral rules, or by optimizing for a reward function using RL (Zheng et al., 2020) . For instance, the impact of economic policy decisions are often estimated with agent-based models (Holland & Miller, 1991; Bonabeau, 2002) . This commonly introduces a reality gap as the reward function and resulting behavior of simulated agents might differ from those of real people (Simon & Schaeffer, 1990) . This becomes especially problematic as the complexity of the simulation grows, for example, when increasing the number of agents, or adding agent affordances (Kirman, 1992; Howitt, 2012) . As a result, policies learned in imperfect simulations need to be robust against reality gaps in order to be effective in the real world. We introduce -Robust Multi-Agent Simulation (ERMAS), a robust optimization framework for training robust policies, termed planners, that interact with real-world multi-agent systems. ERMAS trains robust planners by simulating multi-agent systems with RL and sampling worst-case behaviors from the worst-case agents. This form of multi-agent robustness poses a very challenging multilevel (e.g., max-min-min) optimization problem. Existing techniques which could be applied to ERMAS's multi-agent robustness objective, e.g., naive adversarial robustness (Pinto et al., 2017) and domain randomization (Tobin et al., 2017; Peng et al., 2018) , are intractable as they would require an expensive search through a large space of agent reward functions. Alternative frameworks improve robustness, e.g., to changes in environment dynamics, observation or action spaces (Pinto et al., 2017; Li et al., 2019; Tessler et al., 2019) , but do not address reality gaps due to reward function mismatches, as they use inappropriate metrics on the space of adversarial perturbations. To solve this problem, ERMAS has three key features: 1) It formulates a multi-agent robustness objective equivalent to finding the worst case -Nash equilibria. 2) It optimizes a tractable dual problem to the equivalent objective. 3) It approximates the dual problem using local solution concepts and first-order meta-learning techniques (Nichol et al., 2018; Finn et al., 2017) . ERMAS ultimately yields policies that are robust to other agents' behavioral deviations, up to a regret of . We show that ERMAS learns robust policies in repeated bimatrix games by finding the worst-case reality gaps, corresponding to highly adversarial agents, which in turn leads to more robust planners. We further consider a challenging, large-scale spatiotemporal economy that features a social planner that learns to adjust agent rewards. In both settings, we show policies trained by ERMAS are more robust by testing them in perturbed environments with agents that have optimized for reward functions unused during ERMAS training. This generalization error emulates the challenge faced in transferring policies to the real world. In particular, we show ERMAS can find AI Economist tax policies that achieve higher social welfare across a broad range of agent risk aversion objectives. In all, we demonstrate ERMAS is effective even in settings where baselines fail or become intractable. Contributions To summarize, our contributions are: • We derive a multi-agent adversarial robustness problem using -Nash equilibria, which poses a challenging nested optimization problem. • We describe how ERMAS efficiently solves the nested problem using dualization, trustregions, and first-order meta-learning techniques. • We empirically validate ERMAS by training robust policies in two multi-agent problems: sequential bimatrix games and economic simulations. In particular, ERMAS scales to complex spatiotemporal multi-agent simulations.

2. ROBUSTNESS AND REALITY GAPS IN MULTI-AGENT ENVIRONMENTS

We seek to learn a policy π p for an agent, termed the planner, that interacts with an environment featuring N other agents. The planner's objective depends both on its own policy and the behavior of other agents in response to that policy; this is a multi-agent RL problem in which the planner and agents co-adapt. In practice, evaluating (and optimizing) π p requires use of a simulation with agents that emulate those in the environment of interest (i.e. the real world), which might contain agents whose reward function differs from those used in the simulation. Our goal is to train planner policies that are robust to such reality gaps. Formally, we build on partially-observable multi-agent Markov Games (MGs) (Sutton & Barto, 2018) , defined by the tuple M := (S, A, r, T , γ, o, I), where S and A are the state and action spaces, respectively, and I are agent indices. Since the MG played by the agents depends on the choice of planner policy, we denote the MG given by π p as M [π p ]. MGs proceed in episodes that last H + 1 steps (possibly infinite), covering H transitions. At each time t ∈ [0, H], the world state is denoted s t . Each agent i = 1, . . . , N receives an observation o i,t , executes an action a i,t and receives a reward r i,t . The environment transitions to the next state s t+1 , according to the transition distribution T (s t+1 |s t , a t ).foot_0 Each agent observes o i,t , a part of the state s t . Agent policies π i are parameterized by θ i while the planner policy π p is parameterized by θ p . The Nash equilibria of M [π p ] are agent policies where any unilateral deviation is suboptimal: ANE(π p ) := {π | ∀i ∈ [1, N ], πi ∈ Π : J i (π i , π -i , π p ) ≤ J i (π i , π -i , π p )}, where J i (π, π p ) := E π,πp H t=0 γ t r (i) t denotes the objective of agent i. Hence, a rational agent would not unilaterally deviate from π ∈ ANE(π p ). To evaluate a fixed planner policy π p , we simply sample outcomes using policies π ∈ ANE(π p ). Also optimizing π p introduces a form of two-level learning. Under appropriate conditions, this can be solved with simultaneous gradient descent (Zheng et al., 2020; Fiez et al., 2019) . Robustness Objective As noted before, we wish to learn planner policies π p that are robust to reality gaps arising from changes in agent reward functions, e.g., when agents are boundedly rational. 2We develop a robustness objective for the planner by formalizing such reality gaps as perturbations



Bold-faced quantities denote vectors or sets, e.g., a = (a1, . . . , aN ), the action profile for N agents. This type of reality gap occurs when the simulated environment's reward function r fails to rationalize the actual behavior of the agents in the real environment, i.e., when agents in the real world act suboptimally with respect to the simulation's reward function.

