A RISK-AVERSE EQUILIBRIUM FOR MULTI-AGENT SYSTEMS Anonymous

Abstract

In multi-agent systems, intelligent agents are tasked with making decisions that lead to optimal outcomes when actions of the other agents are as expected, whilst also being prepared for their unexpected behaviour. In this work, we introduce a novel risk-averse solution concept that allows the learner to accommodate low probability actions by finding the strategy with minimum variance, given any level of expected utility. We first prove the existence of such a risk-averse equilibrium, and propose one fictitious-play type learning algorithm for smaller games that enjoys provable convergence guarantees in games classes including zero-sum and potential. Furthermore, we propose an approximation method for larger games based on iterative population-based training that generates a population of riskaverse agents. Empirically, our equilibrium is shown to be able to reduce the utility variance, specifically in the sense that other agents' low probability behaviour is better accounted for by our equilibrium in comparison to playing other solutions. Importantly, we show that our population of agents that approximate a risk-averse equilibrium is particularly effective against unseen opposing populations, especially in the case of guaranteeing a minimum level of performance, which is critical to safety-aware multi-agent systems.

1. INTRODUCTION

Game Theory (GT) has become an important analytical tool in solving Machine Learning (ML) problems; the idea of "gamification" has become popular in recent years (Wellman, 2006; Lanctot et al., 2017) particularly in multi-agent systems research. The importance of risk-aversion in the single-agent decision making literature (Zhang et al., 2020; Mihatsch & Neuneier, 2002; Chow et al., 2017) is obvious, whilst there still exist many open questions in the current game theory research domain. This paper aims to add to the current research in the multi-agent strategic decision-making literature based on the notion of risk-aversion through the lens of a new equilibrium concept. One reason that risk-aversion is important is that multi-agent interaction is rife with strategic uncertainty; this is because performance doesn't solely depend on ones own action. It is rarely the case that one will have certainty over the execution and the strategy of the opponent in situations ranging from board games to economic negotiations (Calford, 2020) . This presents a dilemma for autonomous decision-makers in human-AI interaction as one can no longer rely on perfect execution or complete strategy knowledge. Therefore, an important issue is what happens when actors take dangerous low probability actions such that could be considered as mistakes. These issues in play can arise in an array of circumstances, from misunderstandings of reward structures to execution fatigue, leading to the execution of an unexpected pure strategy. Hedging against unexpected play is important for the agents as otherwise it can lead to large costs. As demonstrated in Fig. (1), a mistake in the execution of the pure-strategy Nash equilibrium (NE) could lead to both cars overtaking and crashing into each other, a negative yet critical outcome in multi-agent system. Traditional equilibrium solutions in GT (e.g. NE, Trembling Hand Perfect Equilibrium (THPE) (Bielefeld, 1988) ) lack the ability to handle this style of risk as either: 1) they assume strategies are executed perfectly, and/or, 2) large costs may be undervalued if there is a low probability attached to them. We address these by introducing a new framework for studying risk in multi-agent systems through mean-variance analysis. In our framework, strategies are evaluated both in terms of expected utility against the opponent, but also the potential utility variance if the opponent played

Stay in Lane

5, 5 0, 20 Overtake 20, 0 -50, -50 Risk Averse Equilibrium Pure Strategy Nash Equilibrium Figure 1 : Cars are rewarded for reaching their destination. They are behind slow tractors but can stay in their lanes and arrive safely, but slowly. They can overtake to arrive quickly, but if the other also overtakes they will crash, leading to large negative payoffs. The Overtake strategy is high-risk, high-reward and susceptible to errors, and is selected under a Nash equilibrium. The Stay in Lane strategy is low-risk, low-reward with low variance and selected by our mean-variance RAE approach. low probability pure strategies. For example, the driving example in Fig. (1) describes a simple scenario where, due to the critical nature of wanting to avoid crashing, the benefits of overtaking may be entirely redundant with the possibility of low probability play leading to crashes. We summarise the contributions of our paper here: 1. We introduce a novel risk-averse equilibrium (RAE) based on mean-variance components of the available strategies. Our framework generalises the single-agent mean-variance decision framework to multi-agent settings. 2. We show that the RAE always exists in finite games, and that it is solvable in the class of games with the fictitious-play property. This, as we later show, unlocks a powerful array of computational methods for solving games. 3. We demonstrate that: 1) RAE is able to locate a minimum variance solution for any given level of utility 2) A by-product of RAE is that it can be used as a Nash equilibrium selection tool in the presence of a "risk-dominant" equilibrium 3) RAE is able to find a low risk strategy in a safety-sensitive autonomous driving environment.

2. RELATED WORK

There exists three relevant bodies of work, those works that empirically study the presence of riskaversion in humans, those that aim to develop new equilibrium frameworks and those that study how risk-averse agents alters classical game-theoretic results. On the empirical side, the first paper to show that humans prefer to bet on known probability devices, rather than on other human choices, suggesting strategic uncertainty avoidance (Camerer & Karjalainen, 1994) . Bohnet & Zeckhauser (2004) similarly found that subjects are more trusting in an objective randomisation device rather than other humans. Eichberger et al. (2008) found that more trust is placed in game theorists than "grannies" as the latter is a source of strategic ambiguity. Similar practices are noted in the game setting which more closely model multi-agent interactions, especially in the form of ambiguity aversion, such as those games outside of 3-color Ellsberg Urn tasks (Kelsey & Le Roux, 2015) , public goods and weakest link games (Kelsey & Le Roux, 2017) , or in the presence of strategic complements and strategic substitutes (Kelsey & le Roux, 2018) . For an extensive survey of the experimental evidence, we refer readers to (Harrison & Rutström, 2008) . The equilibrium literature can be divided into three distinct sections. Harsanyi et al. (1988) introduced risk-dominant Nash equilibria (NE) (Nash, 1951) , which suggests that increasing levels of strategic ambiguity will lead to the equilibrium with the lowest deviation losses. Risk dominance is limited as it is restricted to the set of NE strategies, and therefore may be risk-dominant in comparison to other NE but not particularly risk-averse at all. Bielefeld (1988) set out the THPE which deals with strategic risk by accounting for off-equilibrium play. However, this is sensitive to strictly dominated strategies and, because all trembles happen with marginal probability large downside risk can be masked, (In Fig. 1 an error probability of 0.01 would only impact the utility function by 0.5, whereas we later show that our variance solution values the error at 47.6γ, where γ > 0 and non-marginal), which is problematic for safety-sensitive systems (e.g., autonomous driving). McKelvey & Palfrey (1995) utilises the Quantal Response Equilibrium (QRE) to introduce errors into strategy selection, but with lower percentages on big mistakes which also discounts the impact of large downside risk. We argue that QRE undervalues big costs which are particularly damaging in real-world settings, whereas our mean-variance approach hedges away from high cost risk even in the presence of marginal probabilities. Yekkehkhany et al. (2020) utilises a similar mean-variance equilibrium concept based on risk derived from one-shot play in the probabilistic setting, rather than the expectation setting of this work. This does not apply more generally to the model-free machine learning setting where utility probabilities are not known and is therefore more difficult to apply practically. In terms of risk-aversion outside of equilibrium concepts, competitive network games Wardrop (1952) and the non expected utility maximising setting Fiat & Papadimitriou (2010) have been studied the most. Risk aversion in the network setting is based on a generalisation of the classical selfish routing model Beckmann et al. (1956) to incorporate uncertain delays. Nikolova & Stier-Moses (2014) consider a mean-variance framework for Wardrop equilibria in this setting, whilst Lianeas et al. (2019) extend this research to looking at how risk aversion degrades the performance of a routing system. Whilst the mean-variance approach is the same underlying notion as our work, we instead propose a solution for general games rather than routing games. In general games, Fiat & Papadimitriou (2010) remove the assumption of expectation maximisation and show that under risk averse utility functions there may exist no Nash equilibria. Further works have generally focused on Price of Anarchy Piliouras et al. (2016) ; Kesselheim & Kodric (2018) , which study how removing the assumptions of risk neutrality in general games impacts the difference between the achieved worst equilibrium and the maximum possible welfare of the system. Our work follows a similar strand in looking at general games, but focuses on defining a new equilibrium concept rather than establishing how risk-averse agents change the convergence properties to classical equilibrium concepts. In addition, we frame our work such that it is more scalable for usage alongside RL techniques. Our framework fits broadly into the areas of risk-sensitive RL ( (Chow & Ghavamzadeh, 2014; Keramati et al., 2020; Zhang et al., 2021) ) and risk measures, such as mean-variance, value at risk (VaR) and conditional value at risk (CVaR), in RL ( (Garcıa & Fernández, 2015; Tang et al., 2019; Hiraoka et al., 2019; Ma et al., 2020) ). The focus of risk-sensitive RL has remained predominately in single-agent settings where risk is due to the environment, rather than from other inhabitants of the environment. Multi-agent solutions include: RMIX (Qiu et al., 2021) which optimises decentralised CVaR policies in cooperative risk-sensitive settings, RAM-Q and RA3-Q (Gao et al., 2021) which tackles algorithmic trading by utilising an adversarial approach to promote variance reduction, or risk-sensitive DAPG (Eriksson et al., 2022) which approaches risk in Bayesian games in terms of the CVaR induced by the possible combinations of types in the game. However, as we are specifically concerned with game-theoretic equilibrium concepts we will not directly compare to these methods. Historically, the key challenge of computational GT is how to solve for a NE. For example, in twoplayer zero-sum games, it is theoretically possible to solve for an NE directly via linear programming (LP) (Morgenstern & Von Neumann, 1953) . Another approach to finding an equilibrium is the iterative method Fictitous Play (FP) (Brown, 1951) , where players make best-responses to the timeaverage action of the opponent. However, in practice both the above approaches can be strictly intractable. Limitations due to action space size led to a general wave of methods that focus on starting with a "restricted" action space and iteratively expanding said space in order to approximate an equilibrium with the best possible strategies. Notably, Double Oracle (DO) (McMahan et al., 2003; Dinh et al., 2021; McAleer et al., 2021) and Policy-Space Response Oracles (PSRO) (Lanctot et al., 2017; McAleer et al., 2020; Perez-Nieves et al., 2021; Feng et al., 2021) methods are the two major frameworks in this area. In this work we face a similar challenge in terms of the difficulty of solving for our own equilibrium. In this paper we demonstrate how FP and PSRO can be applied as a solver for our new equilibrium concept. In doing so, we provide a concrete methodology for obtaining solutions in our setting. However, we must adapt them as they are generally designed for risk-neutral equilibria which is not the case for this work.

3. PRELIMINARIES & NOTATIONS

In this section, we introduce the preliminaries and notation required to understand our formulation. A normal-form game (NFG) is the standard representation of strategic interaction in GT. A finite n-person NFG is a tuple (N, A, u), where N is a finite set of n players, A = A 1 ×, ..., ×A n is the joint action profile, with A i being the strategies available to player i, and u = (u 1 , ..., u n ) where u i : A → R is the real-valued expected utility function for each player. A player plays a mixedstrategy, σ i ∈ ∆ A i , which is a probability distribution over their possible actions. In Sec. 6 we replace our atomic pure strategies with neural network based strategies and therefore re-define our notation to keep clarity between the two game schemes. The central equilibrium concept in game-theory is the Nash equilibrium (NE), which is a strategy profile where no players have an incentive to deviate. Let a -i ∈ A -i be the pure strategy sets for all players other than i. Let u i (a i , a -i ) be the expected utility function for player i versus all players other than i, the strategy profile a * = (a * i , a * -i ) is a NE if, u i (a * i , a * -i ) ≥ u i (a i , a * -i ) ∀a i ∈ a i (1) 4 MEAN-VARIANCE EQUILIBRIUM In the following section, we introduce our mean-variance based total utility function and then show how it can be used as an equilibrium concept. Our proposed variance method aims to deal with the main downside of QRE and THPE, that they both undervalue large downsides. For example, QRE is designed that action probabilities are proportional to expected utility and THPE assigns 'error' probability to all zero-probability actions. In both of these cases, variance from the average expected utility will provide a more pronounced effect of risk than using the raw values.

4.1. UTILITY FUNCTION

Here we propose a total utility function that measures both the expected utility, but also the potential variance of utility dependent on the opponent's strategy. For simplicity we provide definitions based on playing a symmetric game, such that two players share an action set A, a utility function u. We extend this to the non-symmetric case in Appendix H. Define the expected utility of action a i ∈ A against action a j ∈ A as u(a i , a j ) and the full expected utility table as M, where the entry M i,j refers to u(a i , a j ) and M i refers to u(a i , a j ) ∀j, i.e. the vector of expected utilities that action a i receives against all other actions. We now define the expected utility of the mixed-strategy for player 1 σ versus the mixed strategy for player 2 ς as u(σ, ς, M) = ai∈A aj ∈A σ(a i )ς(a j )u(a i , a j ) = σ T • M • ς. The weighted co-variance matrix for M is a |A| × |A| matrix Σ M,ς = [c ij ] with entries c jk = ai∈A ς(a i ) u(a i , a j ) -Mj u(a i , a k ) -Mk , where Mi = 1 |A| |A| k=1 ς k u(a i , a k ) is the weighted average expected utility for action i. This is a standard co-variance matrix where the values for each action are weighted by the likelihood of them being selected by the opponent. A uniform weighting could be used, however we believe that in terms of utility variability avoidance it is more intuitive to hedge against the variance caused by high likelihood actions. However, as will be discussed later, all actions will still receive positive probability under our framework and therefore will always provide some weight in the variance calculation, leading to low likelihood high-variance actions still having a large impact on the final result. This accounts for the idea that mistakes may happen such that all actions can be played with at least a low probability. This allows us to define the mixed-strategy σ variance utility as follows: Var(σ, ς, M) = |A| k=1 |A| n=1 σ(a k )σ(a n )c kn = |A| i=1 σ(a i ) 2 c ii + |A| k=1 |A| n=k+1 σ(a k )σ(a n )c kn = σ T • Σ M,ς • σ. (4) The final total utility function r which considers expected and variance utility for strategy σ is, r(σ, ς, M) = u(σ, ς, M) -γ Var(σ, ς, M), where γ ∈ R is the risk-aversion parameter. Applying Eq. ( 5) to Fig. (1) we show how we arrive at a strategy profile that has our desired properties. Consider two joint strategy profiles, S 1 = ((1 -ϵ, 0 + ϵ), (1 -ϵ, 0 + ϵ)) and a Nash equilibrium S 2 = ((0 + ϵ, 1 -ϵ), (1 -ϵ, 0 + ϵ)) where (1ϵ, 0 + ϵ) represents playing Stay in Lane with probability (1ϵ). ϵ is arbitrarily small and used to ensure fully mixed strategies, for the example we use ϵ = 0.01. Profile S 1 receives u(S 1 ) = 5 and the Nash profile receives u(S 2 ) = 20. However, Var(S 1 ) = 0.32 and Var(S 2 ) = 47.6, i.e. the Nash strategy has huge variance for Player 1. Therefore, r(S 1 ) = 5 -0.32γ and r(S 2 ) = 20 -47.6γ and we have for any risk-aversion parameter γ > 0.32 it is optimal to play S 1 .

4.2. EQUILIBRIUM CONCEPT

We now define our new equilibrium concept based on the total utility function ( 5). First start by defining the best-response map: σ * (ς) ∈ arg max σ r(σ, ς, M) s.t. σ(a) ≥ 0 , ∀a ∈ A σ T 1 = 1, where due to the quadratic term σ T • Σ M,ς • σ and the constraints, we have a Quadratic Programme (QP). The above programme finds σ such that the total utility is maximised, whilst ensuring no strategies are assigned negative action probability, and that the action probabilities sum to one. We now propose the following: PROPOSITION 1. For any given expected utility µ b , there exists γ such that the solution to ( 6) is the minimum variance solution. We defer the proof to Appendix (A). This proposition implies that when using 6, given any expected utility value µ b , there exists γ that achieves µ b with the minimum possible variance. Therefore, γ can be tuned to provide a desired expected utility whilst giving the user the minimum viable variance solution. Therefore, based on Eq. ( 6), we define the equilibrium for the strategy profile σ, DEFINITION 2 (Risk-Averse Equilibrium (RAE)). A strategy profile {σ, ς} is a risk-averse equilibrium if both σ and ς are risk-averse best responses to each other. Finally, a property of most game-theoretic equilibria is that a solution exists, at least in the finite game setting. For our equilibrium, we note the following result in mixed-strategies: THEOREM 3. For any finite N-player game where each player i has a finite k number of pure strategies, A i = {a i 1 , ..., a i k }, an RAE exists. We defer the proof of the result to Appendix (A). Importantly, Theorem 3 establishes the existence of solutions providing practical relevance for our equilibrium concept.

5. EQUILIBRIUM LEARNING VIA STOCHASTIC FICTITIOUS PLAY

We start by showing that our total utility function can be used as a form of stochastic fictitious play (SFP) (Fudenberg & Kreps, 1993) for finding an RAE in small NFGs. SFP has convergence guarantees in a selection of games, most notably potential games (Monderer & Shapley, 1996a; b) and finite two-player zero-sum games (Robinson, 1951) . Furthermore, SFP is robust to games outside of the above game classes (Ganzfried, 2020) , and we extend these observations in Appendix (B). SFP describes a learning process where each player chooses a best response to their opponents' time-average strategies. In SFP, a group of n ≥ 2 players repeatedly play a n-player NFG. The state variable is Z t ∈ ∆ S , whose components Z i t describe the time averages of each player's behaviour, Z i t = 1 t t u=1 σ i t where σ i t ∈ ∆ A i represents the observed strategy of player i at time-step t. A SFP process is one where each player best responds to the time-average strategy of their opponent, Z -i t such that, THEOREM 4. Given the total utility function Eq. ( 5) there exist RAE convergence guarantees in the category of games that are solved by SFP. σ i t+1 ∈ arg max σ u i (σ, Z -i t , M) -λv i (σ) (7) t σ t Best Response Oracle RAE Solver σ t (ς) ∈ argmax σ u(σ, ς, M) -γ Var(σ, ς, M) ϕ BR (σ t ) = argmax ϕ t ∑ k=1 σ k t ( u(ϕ, ϕ k ) -γ Var(ϕ, ϕ k ) ) ϕ BR PSRO Inner Loop Σ t … Φ T = {ϕ 1 , ϕ 2 , …, ϕ T } Φ 3 = {ϕ 1 , ϕ 2 , ϕ 3 } PSRO Outer Loop Φ 2 = {ϕ 1 , ϕ 2 } Meta-Game t Covariance Matrix t ϕ BR Meta-Game t + 1 Covariance Matrix t + 1 ϕ BR ϕ BR t+1 Σ t+1 SFP does not necessarily converge in all game classes (but is robust empirically). Therefore, we show that if the SFP process does converge to a strategy then that strategy is guaranteed to be an RAE, PROPOSITION 5. Suppose the SFP sequence {Z t } converges to σ in observed strategiesfoot_0 , then σ is a risk-averse equilibrium. Note that for SFP we require a stronger notion of convergence in observed strategies σ i t rather than in beliefs Z i t , but the usage of a converged final σ i t guarantees a risk-averse equilibrium.

6. EQUILIBRIUM LEARNING VIA ITERATIVE AGENT GENERATION

For games that can't be tractably displayed in the normal-form, we use iterative solution frameworks, which make use of reinforcement learning (RL) policies as proxies for actions. This approach aims to approximate equilibria in large games by finding a small representative collection of risk-averse policies which can instead be selected over by RAE. We provide a visualisation of the following iterative agent generation process in Fig. (2), and provide an algorithm in Appendix D. Consider two-player stochastic games G defined by the tuple {S, A, P, R}, where S is the set of states, A = A 1 × A 2 is the joint action space, P : S × A × S → [0, 1] is the state-transition function and R i : S × A → R is the reward function for player i. An agent is a policy ϕ, where a policy is a mapping ϕ : S × A → [0, 1] which can be described in both a tabular form or as a neural network. The expected utility between two agents is defined to be M (ϕ i , ϕ j ) (i.e., in the same manner defined for NFGs in Sec. 4.1), and represents the expected utility to agent ϕ i against opponent ϕ j . Our iterative framework does T ∈ N + iterative updates on a meta-game M (an NFG made up of RL agents as actions) following the framework of PSRO (Lanctot et al., 2017) . At every iteration t ≤ T , a player is defined by a population of fixed agents Φ t = Φ 0 ∪ {ϕ 1 , ϕ 2 , ..., ϕ t }, where Φ 0 is the initial random agent. For notation convenience, we consider the single-population case where players share the same Φ t . As such, the population will generate a meta-game M t , an expected utility matrix between all the agents in the population, with individual entries M (ϕ i , ϕ j ) ∀ϕ i , ϕ j ∈ Φ t . To make use of a population Φ t we require a way to select which agents ϕ t ∈ Φ t will be utilised for training. This function f is a mapping f : M t → [0, 1] t which takes as input a meta-game M t and outputs a meta-distribution σ t = f (M t ). The output σ t is a probability assignment to each agent in the population Φ t and, as we are in the single-population setting (i.e., symmetric play), we do not distinguish between populations. This is the equivalent of a mixed-strategy in a NFG, except now the Expected Utility actions are RL policies. We apply our risk-averse equilibrium concept (Def. 2) as the meta-solver. As ϕ are RL policies then the policies are sampled by their respective probability in σ t . ∞ = 0.1 ∞ = 0.2 ∞ = 0.22 ∞ = 0.23 ∞ = 0.3 ∞ = 0.4 ∞ = 0.7 ∞ = At each epoch the population Φ t is augmented with a new agent that is a best-response to the meta-distribution σ t . Generally, this will be purely in terms of the expected reward, and can be found with any optimisation process such as Reinforcement Learning. In our setting there are two properties that we are concerned with when adding a new agent to the population. Notably, how it impacts the expected return but also how it impacts the variance utility of the population σ t+1 • Σ Mt+1,σt+1 • σ t+1 . To do this, we follow the PPO approach of (Zhang et al., 2020) that optimises both performance and per-step RL-reward variance. It is shown by (Bisi et al., 2019) that the variance of the per-step RL-reward bounds the variance of the total-RL reward from above. Notably, the variance utility of a population is measured in terms of the variance of the total RL-reward, and therefore shrinking the variance of the per-step RL-reward will also shrink the variance of the total RL-reward. To achieve this, an augmented MDP is used where the MDP-reward, g i t is replaced as follows: ĝi t = g i t -λ(g i t ) 2 + (2λg i t y i ) where y i = 1 T T t=1 g i t is the average of the RL-rewards during the data collection phase. Notably, as this variance is also with respect to the sampling probability defined by σ t this optimises the correct co-variance matrix which is similarly weighted by σ t .

7. EXPERIMENTS

We validate the effectiveness of RAE on three environments that all display some risk component. 1. Randomly generated coordination games where some actions provide a high expected utility if the other player selects the same action, but have large costs if not. There also exist actions that have lower coordinated expected utility but lower costs. We conduct experiments testing SFP on games with 100 actions, and utilising our iterative approach on games with 500 actions. Vanilla policy gradient RL agents are used for the iterative approach. 2. A generalised grid-world stag-hunt (Peysakhovich & Lerer, 2017) game that has a payoffdominant and risk-dominant equilibrium. In this game it is not possible tractably to list out all actions and therefore our iterative approach is applied. PPO RL agents are used. 3. An autonomous driving environment (Leurent, 2018) For SFP we select the baselines to be NE (including risk/dominant payoff NE), THPE (Bielefeld, 1988) and QRE (McKelvey & Palfrey, 1995) . For our iterative experiments, we select the baselines to be PSRO-{Nash, Uniform, Self-Play, THPE, QRE} where the brackets refer to the meta-solver used. In the population-based setting we believe it is fair to restrict our baselines to algorithms that operate within this framework, and to not consider non-population risk-aversion algorithms and opponent modelling frameworks. In addition, our goal is to introduce and evaluate a new game-theoretic concept and therefore we believe the most natural comparisons are those from GT. We present our results in the form of answering three critical questions w.r.t. the effectiveness of RAE. Question 1. Do RAE solutions have similar expected utility whilst lowering variance? We start by investigating performance in randomly generated coordination NFGs. These NFGs are designed so that there are actions in the games that perform well if your opponent follows the same strategy but have large negative payoffs otherwise. There also exist actions that perform worse (but still positively) when coordinated on, but also maintain better performance (albeit worse than coordination) when the opponent plays a different action. We provide an example in Appendix F. These games draw out potential pitfalls in current GT solution concepts (e.g. Nash) that focus on appealing coordination utility, without considering the variability of expected utility. We present our results in Fig. (3) where (a) represents games with 100 actions solved using SFP, and (b) represents games with 500 actions solved using our iterative framework. We plot our RAE results across multiple values of γ in order to generate a theoretical 'efficient frontier'. An efficient frontier shows for values of expected utility what are the minimum possible variance solutions that you can find to attain said expected utility. Our figures show that, whilst our baselines are able to achieve a diverse range of expected utility values, they are unable to find the minimum variance solution which our RAE is able to find. We believe this shows the strong flexibility of our approach, in that it is able to attain any utility reward that the baselines can achieve, whilst finding a better solution in terms of variance.

Question 2. Can RAE act as a NE selection method?

A by-product of RAE is that it can be used as a NE selection tool. We evaluate this in a generalised stag-hunt grid world (Peysakhovich & Lerer, 2017) where there exist both 'payoff-dominant' and 'risk-dominant' NEs. We provide full environment details in Appendix (F) and provide a visualisation in Fig. (4a ). There are two differing goals in the environment: 1) Collect plants on your own and receive low expected utility or 2) Hunt the stag and receive a large positive expected utility if the agents catch the stag together, and a large negative expected utility if only one agent hunts the stag. These are the 'risk-dominant' safe strategy and the 'payoff-dominant' risky strategy. In Fig. ( 4) we demonstrate how RAE can effectively act as a NE selection method. In Fig. (4b ) we present the expected utility where each meta-solver population is trained against itself. Notably, the final strategy of all the baselines focus on capturing the stag, which is the risky payoff-dominant strategy. However, our RAE instead finds the risk-dominant strategy in which it focuses on gathering plants and not going after the stag. The impact of this is particularly noticeable when we place an RAE population and a Nash population into the environment together as co-players, shown in Fig. (4c) . The Nash population still attempts to hunt the stag, but in this case the RAE population is still gathering plants therefore leading to the Nash population being caught by the stag a numerous amount of times leading to very negative expected utility. This is an overall desirable property of RAE as it suggests, in the case that RAE and NE overlap, RAE will find the risk-dominant strategy. Question 3. In safety-sensitive environments what sort of strategy does RAE learn to follow? Finally, how does RAE act in an environment where avoiding any large downside possibility is critical, for example autonomous driving. Our environment is modelled on the example in Fig. ( 1) where there exists two-way traffic with slow-moving vehicles and faster moving agents behind that may be interested in overtaking. From a game-theoretic standpoint this is a surprisingly difficult problem. A NE prescribes that one agent overtakes and the other waits, which is a strategy that is exposed to errors and low probability play. We provide full environment details in Appendix (F). In Fig. (5) we provide our results. In Table (a) we provide metrics where the average value is based off of 100 episodes in the environment and the standard deviation is based over 5 different training seeds. Firstly, we note that in terms of expected utility and variance-utility RAE outperforms the baselines, whilst also maintaining strong worst-case performance. Notably, RAE arrives at a strategy that very rarely crashes, and nearly always arrives at the final destination. The same conclusion can not be drawn for any of the provided baselines. To see why this is happening, in Fig. (4b ) and (4c) we provide position heat-maps of the cars utilising the RAE strategy and the Nash strategy respectively. In the RAE heat-map one can see that the strategy taken is the safe strategy, i.e. follow behind until all vehicles in the on-coming lane have passed and then proceed to overtake. This strategy provides little expected utility for the vast majority of the episode, but remains sensitive to the risk-element of the environment which is our desired outcome. On the other hand, the Nash heat-map shows that the strategy is to overtake straight away and nearly always ends up in a crash due to car congestion in the middle of the episode.

8. CONCLUSION

We introduce a new risk-averse equilibrium concept, RAE, based on mean-variance analysis. Theoretically, we prove the existence and solvability of RAE and provide methods for arriving at an RAE in both small and large scale game settings. Empirically, we show that our RAE is able to locate minimum variance solutions for any expected utility, act as a NE selection method in the presence of risk-dominant equilibria and is effective at finding a safe equilibrium in a safety-sensitive autonomous driving environment. Avenues for future work should focus on the limitations of the current RAE approach, namely non-convergence guarantees in certain classes of games and the fact that RAE minimises upside and downside variance, where minimising downside variance only would be a desirable property. A FULL PROOFS

A.1 PROPOSITION 1 [MINIMUM VARIANCE SOLUTION]

PROPOSITION 1. The solution to the optimisation (6) provides the same solutions to the following:, σ * ∈ arg min σ σ T • Σ M • ς s.t. σ T • M • σ ≥ µ b σ(a) ≥ 0 ∀a ∈ A σ T 1 = 1 (8) where µ b ∈ R is the lowest level of expected return that the actor is willing to accept. Proof. (Merton, 1972) shows by a Lagrange multiplier argument that the optimisation problem, σ * ∈ arg min σ σ T • Σ M • σ s.t. σ T • M • ς ≥ µ b σ(a) ≥ 0 ∀a ∈ A σ T 1 = 1 can be rewritten as σ * ∈ arg min σ σ T • Σ M • σ -τ σ T • M • ς s.t. σ(a) ≥ 0 ∀a ∈ A σ T 1 = 1 which can be equivalently expressed as, σ * ∈ arg min σ -σ T • M • ς -λσ T • Σ M • σ s.t. σ(a) ≥ 0 ∀a ∈ A σ T 1 = 1 where λ = 1 τ .

A.2 THEOREM 3 [RAE EXISTENCE]

THEOREM 3. For any finite N-player game where each player i has a finite k number of pure strategies, A i = {a i 1 , ..., a i k }, an RAE exists Proof. We base our proof on Kakutani's Fixed Point Theorem Lemma (Kakutani Fixed Point Theorem). Let A be a non-empty subset of a finite dimensional Euclidean space. Let f : A ⇒ A be a correspondence, with x ∈ A -→ f (x) ⊆ A, satisfying the following conditions: 1. A is a compact and convex set. 2. f (x) is non-empty for all x ∈ A. 3. f (x) is a convex-valued correspondence: for all x ∈ A, f (x) is a convex set. 4. f (x) has a closed graph: that is, if {x n , y n } → {x, y} with y n ∈ f (x n ), then y ∈ f (x). Then, f has a fixed point, that is, there exists some x ∈ A, such that x ∈ f (x). We define our best-response function as B i (σ -i ) = arg max a∈∆i r i (a, σ -i ) where u i is defined as in Eq. ( 5) and by definition s must satisfy all of the properties of a proper mixed-strategy, and the best-response correspondence is B : ∆ ⇒ ∆ such that for all σ ∈ ∆, we have: B(σ) = [B i (σ -i )] i∈N We show that B(σ) satisfies the conditions of Kakutani's Fixed Point Theorem 1. ∆ is compact, convex and non-empty.

By definition

∆ = Π i∈N ∆ i where each ∆ i = {a| j a j = 1} is a simplex of dimension |A i | -1, thus each ∆ i is closed and bounded, and thus compact. Their product set is also compact.

2.. B(σ) is non-empty.

By the definition of B i (σ -i ) where ∆ i is non-empty and compact, and r i is a quadratic and hence a polynomial function in a. It is known that all polynomial functions are continuous, we can invoke Weirstrass's Extreme Value Theorem which states Lemma. If a real valued-function f is continuous on the closed interval [a, b], then f must attain a maximum and a minimum, each at least once. That is, there exist numbers c and d in [a, b] such that: f (c) ≥ f (x) ≥ f (d) ∀x ∈ [a, b] Therefore, as ∆ i is non-empty and compact and r i is continuous in a, B i (σ -i ) is non-empty, and therefore B(σ) is also non-empty. 3. B(σ) is a convex-valued correspondence. Equivalently, B(σ) ⊂ ∆ is convex if and only if B i (σ -i ) is convex for all i. In order to show that B i (σ -i ) is convex for all i, we instead show that the Quadratic Programme defined by Eq. ( 6) is a special case of convex optimisation under certain conditions, and therefore by definition has a feasible set which is a convex set. A convex optimisation problem is one of the form, minimize f 0 (x) s.t. f i (x) < 0, i = 1, ..., m a T i x = b i , i = 1, ..., p where f 0 , ..., f m are convex functions. The requirements for a problem to be a convex optimisation problem are: (a) the objective function must be convex (b) the inequality constraint functions must be convex (c) the equality constraint functions h i (x) = a T i x = b i must be affine We note that a quadratic form x T Ax is convex if A is positive semi-definite, and strictly convex if A is positive definite (we can guarantee strict convexity by adding a small constant to the diagonal of A without impacting the variance values). In our constrained optimisation, the quadratic term σ T Σσ is always guaranteed to be at least convex as Σ, the covariance matrix, is always at least PSD. Therefore, our objective function is convex. Additionally, it is easy to see that our inequality constraint functions are also convex and that our equality constraint function is affine. Therefore, our Quadratic Programme is an instance of a convex optimisation problem. Importantly, the feasible set of a convex optimisation problem is convex, since it is the intersection of the domain of the problem D = m i=0 domf i , , which itself is a convex set. Therefore, for all members of the feasible set x, y ∈ B i (σ -i ) and all θ ∈ [0, 1] we have that θx + (1θ)y ∈ S and we have a convex-valued correspondence. 4. B(σ) has a closed graph. Suppose to obtain a contradiction, that B(σ) does not have a closed graph. Then, there exists a sequence (σ n , σn ) → (σ, σ) with σn ∈ B(σ n ), but σ / ∈ B(σ), i.e. there exists some i such that σi / ∈ B i (σ -i ). This implies that there exists some σ ′ i ∈ ∆ i and some ϵ > 0 such that r i (σ ′ i , σ -i ) > r i ( σi , σ -i ) + 3ϵ. ( ) By the continuity of r i and the fact that σ n -i → σ -i , we have for sufficiently large n, r i (σ ′ i , σ n -i ) ≥ r i (σ ′ i , σ -i ) -ϵ. and combining the preceding two relations we obtain r i (σ ′ i , σ n -i ) > r i ( σi , σ -i ) + 2ϵ ≥ r i ( σn i , σ n -i ) + ϵ (18) where the second relation follows from the continuity of r i . This contradicts the assumption that σn i ∈ B(σ n -i ) and completes the proof. Therefore, B(σ) satisfies the conditions of Kakutani's Fixed Point Theorem, and therefore if σ * ∈ B(σ * ) then σ * is an equilibrium.

A.3 THEOREM 4 [SFP CONVERGENCE]

THEOREM 4. Given the total utility function Eq. 5 there exist RAE convergence guarantees in the category of games that are solved by SFP. Proof. We show that our utility measure can be embedded as a version of stochastic fictitious play and therefore can be used to find equilibrium in two-player zero-sum games and potential games. A smooth fictitious play procedure is one in which the best-response, B(σ), is derived from maximising a function of the form r i (σ)λv i (σ i ) where, 1. v i (σ i ) : A i → R is a strictly convex function. 2. The gradient of v i (σ i ) becomes arbitrarily large near the boundary of the strategy simplex, i.e. lim σi→∂Ai |v i (σ i )| = ∞ which ensures that there exists a unique solution to the best-response, and that all pure strategies receive strictly positive probability in the best-response. We have shown that our variance measure is a strictly convex objective under the assumption that Σ i is positive-definite. Therefore, we need to show that the gradient satisfies the boundary condition. We start by showing that lim ||x n || = || lim x n || if lim x n = x, Theorem. Let X and Y be normed spaces. If lim x n = x in X and T : X → Y is continuous, then lim T (x n ) = T (lim x n ) Proof. Let ϵ > 0. As T is continuous, by the epsilon-delta definition of continuous functions, there exists δ > 0 such that, ||x -y|| < δ ⇒ ||T (x) -T (y)|| < ϵ As lim x n = x, there exists n 0 ∈ N such that, n > n 0 ⇒ ||x n -x|| < δ and it follows that, n > n 0 ⇒ ||T (x n ) -T (x)|| < ϵ and thus lim T (x n ) = T (x) = T lim(x n ) Since, T : X → R x → ||x|| is continuous, we have lim ||x n || = || lim x n ||. Next, we show that our gradient has a lower bound that satisfies the boundary condition. Note for this proof we replace σ with x and Σ with Cov as the proof relies upon the singular value decomposition and notation may become confusing. Due to the symmetry of Cov, ∇x Cov x = 2 Cov x = W , and we show that as At the boundary of the simplex, i.e. utilising a pure strategy, this is the specific case of a mixedstrategy where only Dirac probability distributions can be used. Therefore, in the limit there is infinite density upon the pure strategy at the edge of the simplex and we have that lim x→∂A x = +∞. We can replace this in the above, x → ∂A, lim x→∂A ||W x|| > +∞. lim x→∂A ||W x|| = lim x→∂A ||U ΣV T x|| = lim x→∂A ||ΣV T x|| as U is orthogonal = lim x→∂A ||Σ(V T x)|| = lim x→∂A i σ i |(V T x) i | where σ i is the i-th singular value ≥ lim x→∂A σ min i |(V T x) i | = lim lim x→∂A ||W x|| ≥ σ min || lim x→∂A x|| = σ min (+∞) as Cov is restricted to positive-definiteness, all singular values are strictly positive and we have the desired result lim x→∂A ||W x|| ≥ +∞ Therefore, our variance function is admissible as the perturbation function v i (σ i ) in stochastic fictitious play, and retains convergence guarantees.

A.4 PROPOSITION 5 [SFP IS RAE]

PROPOSITION 5. Suppose the SFP sequence {Z t } converges to σ in the observed strategy sensefoot_1 , then σ is a Risk-Averse equilibrium. Proof. Assume the observed strategy has converged to σ = (σ 1 , σ 2 ) and that the strategy is not an RAE. This implies there exists some σ i,′ such that: r i (σ i,′ , σ -i ) > r i (σ i , σ -i ) However, because σ has converged then the SFP sequence {Z t } will also converge such that lim t→∞ Z t = σ and because we are in an SFP process it must be the case that: r i (σ i , σ -i ) > r i (σ i,′ , σ -i ) ∀σ i,′ ∈ ∆ i and therefore σ i,′ can not be a best response to σ -i . for each player i ∈ N do:

C FIGURE 3 TRAINING CURVES

4: Compute meta-policy π t by SFP (Eq.7). 5: Find new policy by Oracle: ϕ i t = O i (π -i t ). 6: Expand Φ i t+1 ← Φ i t ∪ {ϕ i t }. 7: Update meta-payoff M t+1 . 8: Return: π and Φ. if P (X ≤ p ii ) > 0.9 do 6:

E HYPERPARAMETER SETTINGS

for all other actions j do for all other actions j do 11: Sample anti-coordination element p ij ∼ U(0, 10) 12: Set Payoff matrix element P ij = P ji = p ij 13: Return: P . A simple 3 action example of a NFG generated following the above would be: F.2 STAG HUNT GRID WORLD Our stag-hunt environment is taken from (Peysakhovich & Lerer, 2017) where we slightly alter the parameters of the game. A 5 × 5 grid is used with 2 players, 1 stag and 2 plants randomly spawned in. The action set of the players is A = {left, right, up, down}. The stag at every time-step will move one grid space closer to the closest player on the grid, the plants do not move. There are 3 different rewards signals in the game: I QRE FAILURE CASE In the following section we present results on the two-action driving game described in Sec. 1 of the main article and displayed in Fig. 5 . We specifically utilise this game to show a failure case of QRE as a risk-sensitive solution. Ideally, a risk-sensitive solution concept would only play the Stay in Lane strategy as the Overtake strategy has far too high potential downside risk.

Stay in Lane

Figure 12 : QRE and RAE results on two-action driving game. As can be seen from the results in Fig. 6 , for a large sample of QRE hyperparameters the equilibrium found is high variance with potential poor downside performance. We believe this is because the very large costs of the errors are easily picked up by variance analysis, but not so easily by the setup of QRE.



Convergence in the time-average Zt does not imply convergence in the actual strategy taken at each t, but may for example imply cyclic actual behaviour that results in average behaviour converging. Convergence in the time-average Zt does not imply convergence in the actual strategy taken at each t, but may for example imply cyclic actual behaviour that results in average behaviour converging.



Figure 2: Iterative agent generation process. Note that u(•) and Var(•) are overloaded to represent utility/variance between distributions over a population or utility/variance between two policies. v i (σ) : ∆ A → R is a strictly convex function, and the gradient of v i (σ) becomes arbitrarily large near the boundary of the strategy simplex, i.e. lim σ→∂(A i ) |v i (σ)| = ∞. We propose the following with regards to our total utility function (Proofs are deferred to Appendix (A))

Figure 3: a) SFP on NFGs with 100 actions, b) PSRO on NFGs with 500 actions. Both compare final expected utility vs. variance utility results. We plot RAE values for multiple γ to form an 'efficient frontier' and show that, whilst baselines achieve similar expected utilities they are always finding solutions that are too high in variance utility. In Fig. a) we exclude the Payoff Dominant result as it provided huge final variance utility, whilst in Fig. b) we exclude the THPE result for the same reason.

based on Fig. (1) for testing that our RAE in Fig. (1) is attainable in an RL setting. PPO RL agents are used.

Figure 4: Stag-hunt environment results. a) Visualisation of the environment b) Results for intrapopulation play c) Results for RAE population vs. Nash population.

Figure 5: Results on autonomous driving environment. a) Results on 100 episodes over 5 seeds b) Position heat-map for RAE solution c) Position heat-map for Nash solution.(4c). The Nash population still attempts to hunt the stag, but in this case the RAE population is still gathering plants therefore leading to the Nash population being caught by the stag a numerous amount of times leading to very negative expected utility. This is an overall desirable property of RAE as it suggests, in the case that RAE and NE overlap, RAE will find the risk-dominant strategy.

σ min || lim x→∂S x|| due to Theorem 4

Figure 9: Training curves over multiple seeds for Figure 3.

element P ii = |p ii | 5:

element P ij = P ji = p ij

Figure10: Game where one strategy (dotted outline) provides a high return assuming successful coordination but high variance in case the opponent does not coordinate correctly.

Figure 11: Two-action driving risk game.



Hyper-parameter settings for our experiments.We randomly generate coordination games with N actions in the following way:

SFP Robustness -Anti Coordination Games

Gamma: 0.1 Gamma: 0.3 Gamma: 0.5 Gamma: 0.7 Gamma: 0.9 Gamma: 1.0 

SFP Robustness -Coordination Games

Gamma: 0.1 Gamma: 0.3 Gamma: 0.5 Gamma: 0.7 Gamma: 0.9 Gamma: 1.0 

G COMPUTE ARCHITECTURE

All experiments run on one machine with:• AMD Ryzen Threadripper 3960X 24 Core • 1 x NVIDIA GeForce RTX 3090

H ASYMMETRIC FORMULATION

In the following section we will show the formulation of Sec. 4 but for the asymmetric case.

H.1 UTILITY FUNCTION

Player i has an action set A i , and a utility function u i . We define the on-equilibrium utility of action a k i ∈ A i against action a k ′ j ∈ A j as u(a k i , a k ′ j ) and the full utility table for player i as M i , where the entry M k,k ′ i refers to u(a k i , a k ′ j ) and M k i refers to u(a k i , a k ′ j ) ∀k ′ , i.e. the vector of utilities that action a k i receives against all other actions. Take the 2-player case, we now define the utility of the mixed-strategy for player 1 σ versus the mixed strategy for player 2 ς asThe weighted co-variance matrix for the utility matrixwhere2 ), i.e. the weighted average utility for action i. As we are trying to minimise variance with respect to the opponent strategy we used a weighted covariance matrix such that potential variance caused by each action is weighted by its probability of selection under the opponent strategy. As will be discussed later, all actions will receive positive probability under our framework and therefore will always provide some weight in the variance calculation. This allows us to define the mixed-strategy σ utility variance based as follows:The final utility function r which considers both on-and off-equilibrium utility for strategy σ is,where γ ∈ R is the risk-aversion parameter.

H.2 EQUILIBRIUM CONCEPT

We now define our new equilibrium concept based on the utility function (25). First start by defining the best-response map:

