SIMILARITY-BASED COOPERATION

Abstract

As machine learning agents act more autonomously in the world, they will increasingly interact with each other. Unfortunately, in many social dilemmas like the one-shot Prisoner's Dilemma, standard game theory predicts that ML agents will fail to cooperate with each other. Prior work has shown that one way to enable cooperative outcomes in the one-shot Prisoner's Dilemma is to make the agents mutually transparent to each other, i.e., to allow them to access one another's source code (Rubinstein, 1998; Tennenholtz, 2004) -or weights in the case of ML agents. However, full transparency is often unrealistic, whereas partial transparency is commonplace. Moreover, it is challenging for agents to learn their way to cooperation in the full transparency setting. In this paper, we introduce a more realistic setting in which agents only observe a single number indicating how similar they are to each other. We prove that this allows for the same set of cooperative outcomes as the full transparency setting. We also demonstrate experimentally that cooperation can be learned using simple ML methods.

1. INTRODUCTION

As AI systems start to autonomously interact with the world, they will also increasingly interact with each other. We already see this in contexts such as trading agents (CFTC & SEC, 2010) , but the number of domains where separate AI agents interact with each other in the world is sure to grow; for example, consider autonomous vehicles. In the language of game theory, AI systems will play general-sum games with each other. For example, autonomous vehicles may find themselves in Game-of-Chicken-like dynamics with each other (cf. Fox et al., 2018) . In many of these interactions, cooperative or even peaceful outcomes are not a given. For example, standard game theory famously predicts and recommends defecting in the one-shot Prisoner's Dilemma. Even when cooperative equilibria exist, there are typically many equilibria, including uncooperative and asymmetric ones. For instance, in the infinitely repeated Prisoner's Dilemma, mutual cooperation is played in some equilibria, but so is mutual defection, and so is the strategy profile in which one player cooperates 70% of the time while the other cooperates 100% of the time. Moreover, the strategies from different equilibria typically do not cooperate with each other. A recent line of work at the intersection of AI/(multi-agent) ML and game theory aims to increase AI/ML systems' ability to cooperate with each other (Stastny et al., 2021; Dafoe et al., 2021; Conitzer & Oesterheld, 2022) . Prior work has proposed to make AI agents mutually transparent to allow for cooperation in equilibrium (McAfee 1984; Howard 1988; Rubinstein 1998, Section 10.4; Tennenholtz 2004; Barasz et al. 2014; Critch 2019; Oesterheld 2019b) . Roughly, this literature considers for any given 2-player normal-form game Γ the following program meta game: Both players submit a computer program, e.g., some neural net, to choose actions in Γ on their behalf. The computer program then receives as input the computer program submitted by the other player. Prior work has shown that the program meta game has cooperative equilibria in the Prisoner's Dilemma. Unfortunately, there are multiple obstacles to cooperation based on full mutual transparency. 1) While partial transparency is the norm, settings of full transparency are rare. For example, while GPT-3's architecture and training regime are public knowledge, the exact learned model is not. 2) Games played with full transparency in general have many equilibria, including ones that are much worse for some or all players than the Nash equilibria of the underlying game (see the folk theorems given by Rubinstein 1998, Section 10.4, and Tennenholtz 2004) . In particular, full mutual transparency can make the problem of equilibrium selection harder. 3) The full transparency setting poses challenges to modern ML methods. In particular, it requires at least one of the models to receive as input a model that has at least as many parameters as itself. Meanwhile, most modern Policy 2 π 2 : R → ∆(A 2 ) successes of ML use models that are orders of magnitudes larger than the input. Consequently, we are not aware of successful projects on learning general-purpose models such as neural nets in the full transparency setting. diff : ∆(A 1 ) R × ∆(A 2 ) R → R 2 (diff(π 1 , π 2 )) 1 (diff(π 1 , π 2 )) 2 Mixed strategy 1 σ 1 ∈ ∆(A 1 ) Mixed strategy 2 σ 2 ∈ ∆(A 2 ) (b) Contributions In this paper we introduce a novel variant of program meta games called difference (diff) meta games that enables cooperation in equilibrium while also addressing obstacles 1-3. As in the program meta game, we imagine that two players each submit a program or policy to instruct an agent to play a given game, such as the Prisoner's Dilemma. The main idea is that before choosing an action, the agents are given information about how similar the two players' policies are to each w.r.t. how they make the present decision. We formally introduce this setup in Section 3. For an informal illustration, see Figure 1a . Because it requires a much lower degree of mutual transparency, we find the diff meta game setup more realistic than the full mutual transparency setting. Thus, it addresses Obstacle 1 to cooperation based on full mutual transparency. Diff meta games can still have cooperative equilibria when the underlying base game does not. Specifically, in Prisoner's Dilemma-like games, there are equilibria in which both players submit policies that cooperate with similar policies and thus with each other. We call this phenomenon similarity-based cooperation (SBC). For example, consider the Prisoner's Dilemma as given in Table 1 for G = 3. (We study such examples in more detail in Section 3.) Imagine that the players can only submit threshold policies that cooperate if and only if the perceived difference to the opponent is at most θ i . As a measure of difference, the policies observe diff(θ 1 , θ 2 ) = |θ 1 -θ 2 | + N, where N is sampled independently for each player according to the uniform distribution over [0, 1]. For instance, if Player 1 submits a threshold of 1 /2 and Player 2 submits a threshold of 3 /4, then the perceived difference is 1 /4 + N. Hence, Player 1 cooperates with probability P( 1 /4 + N ≤ 1 /2) = 1 /4 and Player 2 cooperates with probability P( 1 /4 + N ≤ 3 /4) = 1 /2. It turns out that (θ 1 = 1, θ 2 = 1) , which leads to mutual cooperation with probability 1, is a Nash equilibrium of the meta game. Intuitively, the only way for either player to defect more is to lower their threshold. But then |θ 1 -θ 2 | will increase, which will cause the opponent to defect more (at a rate of 1 /2). This outweighs the benefit of defecting more oneself. In Section 4, we prove a folk theorem for diff meta games. Roughly speaking, this result shows that merely observing a diff value is sufficient for enabling all the cooperative outcomes that full mutual transparency enables. Specifically, we show that for every individually rational strategy profile σ (i.e., every strategy profile that is better for each player than their minimax payoff), there is a diff function such that σ is played in an equilibrium of the resulting diff meta game. Next, we address Obstacle 2 to full mutual transparency -the multiplicity of equilibria. First, note that any given measure of similarity will typically only enable a specific set of equilibria, much smaller than the set of individually rational strategy profiles. For instance, in the above example, all equilibria are symmetric. In general, one would hope that similarity-based cooperation will Player 2 Cooperate Defect Player 1 Cooperate G, G 0, G + 1 Defect G + 1, 0 1, 1 Table 1 : The Prisoner's Dilemma, parameterized by some number G > 1. generally result in symmetric outcomes in symmetric games. After all, the new equilibria of the diff game are based on submitting similar policies and if two policies play different strategies against each other, they cannot be similar. In Section 5, we substantiate this intuition. Specifically, we prove, roughly speaking, that in symmetric, additively decomposable games, the Pareto-optimal equilibrium of the meta game is unique and gives both players the same utility, if the measure of difference between the agents satisfies a few intuitive requirements (Section 5). For example, in the Prisoner's Dilemma, the unique Pareto-optimal equilibrium of the meta game must be one in which both players cooperate with the same probability. Finally we show that diff meta games address Obstacle 3: we demonstrate that in games with higherdimensional action spaces, we can find cooperative equilibria of diff meta games with ML methods. In Section 8, we show that, if we initialize the two policies randomly and then let each of them learn to be a best response to the other, they generally converge to the Defect-Defect equilibrium. This is expected based on results in similar contexts, such as in the Iterated Prisoner's Dilemma. However, in Section 6, we introduce a novel pretraining method that trains policies to cooperate against copies and defect against randomly generated policies. Our experiments show that policies pretrained in this way find partially cooperative equilibria of the diff game when trained against each other. We discuss how the present paper relates to prior work in Section 9. We conclude in Section 10 with some ideas for further work.

2. BACKGROUND

Elementary game theory definitions. We assume familiarity with game theory. For an introduction, see Osborne (2004) . A (two-player, normal-form) game Γ = (A 1 , A 2 , u) consists of sets of actions or pure strategies A 1 and A 2 for the two players and a utility function u : A 1 × A 2 → R 2 . Table 1 gives the Prisoner's Dilemma as a classic example of a game. A mixed strategy for Player i is a distribution over A i . We denote the set of such distributions by ∆(A i ). We can extend u to mixed strategies by taking expectations, i.e., u(σ 1 , σ 2 ) := ∑ a 1 ∈A 1 ,a 2 ∈A 2 σ 1 (a 1 )σ 2 (a 2 )u(a 1 , a 2 ). For any player i, we use -i to denote the other player. We call σ i a best response to a strategy σ -i ∈ ∆(A -i ), if supp(σ i ) ⊆ arg max a i ∈A i u i (a i , a -i ), where supp denotes the support. A strategy profile σ σ σ ∈ ∆(A 1 ) × ∆(A 2 ) is a vector of strategies, one for each player. We call a strategy profile (σ 1 , σ 2 ) a (strict) Nash equilibrium if σ 1 is a (unique) best response to σ 2 and vice versa. As first noted by Nash (1950) , each game has at least one Nash equilibrium. We say that a strategy profile σ σ σ is individually rational if each player's payoff is at least its minimax payoff, i.e., if y i ≥ max σ i ∈∆(A i ) min a -i ∈A -i u i (σ i , a -i ) for i = 1, 2. We say that σ σ σ is Pareto-optimal if there exists no σ σ σ ′ s.t. u i (σ σ σ ′ ) ≥ u i (σ σ σ ′ ) for i = 1, 2 and u i (σ σ σ ′ ) > u i (σ σ σ ′ ) for at least one i. Symmetric games and additively decomposable games. We say that a game is (player) symmetric if A 1 = A 2 and for all a 1 , a 2 for i = 1, 2, we have that u i (a 1 , a 2 ) = u -i (a 2 , a 1 ). The Prisoner's Dilemma in Table 1 is symmetric. We say that a game additively decomposes into (u i, j : A j → R) i, j∈{1,2} if u i (a 1 , a 2 ) = u i,1 (a 1 ) + u i,2 (a 2 ) for all i = {1, 2} and all a 1 ∈ A 1 , a 2 ∈ A 2 . Intuitively, this means that each action a j of Player j generates some amount of utility u i, j (a j ) for Player i independently of what Playerj plays. For example, the Prisoner's Dilemma in Table 1 is additively decomposable, where u i,i : Cooperate 7 → 0, Defect 7 → 1 and u i,-i : Cooperate 7 → G, Defect 7 → 0 for i = 1, 2. Intuitively, Cooperate generates G for the opponent and 0 for oneself, while Defect generates 0 for oneself and 1 for the opponent. Alternating best response learning. The orthodox approach to learning in games is to learn to best respond to the opponent, essentially ignoring that the opponent is also a learning agent. In this paper, we specifically consider alternating best response (ABR) learning. In ABR, the players take turns. In each turn, one of the two players updates the parameters θ θ θ i of her strategy to optimize u i (θ θ θ i , θ θ θ -i ), i.e., updates her model to be a best response to the opponent's current model (Brown cf. 1951; Heinrich et al. 2021; Zhang et al. 2022) . Since learning an exact best response is generally intractable, we will specifically consider the use of gradient ascent in each turn to optimize u i (θ θ θ i , θ θ θ -i ) over θ θ θ i . In continuous games if ABR with exact (locally) best response updates converges to (θ θ θ 1 , θ θ θ 2 ), then (θ θ θ 1 , θ θ θ 2 ) is a (local) Nash equilibrium. Note, however, that ABR may fail to converge (e.g., in the face of Rock-Paper-Scissors dynamics). Moreover, if the best response updates of θ i are only approximated, ABR may converge to non-equilibria (Mazumdar et al., 2020, Proposition 6) .

3. DIFF META GAMES

We now formally introduce diff meta games, the novel setup we consider throughout this paper. Given some base game Γ, we consider a new meta game played by two players whom we will call principals. Each principal i submits a policy. The two players' policies each observe a real-valued measure of how similar they are to each other. Based on this, the policies then output a (potentially mixed) strategy for the base game. Finally, the utility is realized as per the base game. Below we define this new game formally. This model is illustrated in Figure 1 . Definition 1. Let Γ = (A, u) be a game. A (diff-based) policy for Player i for Γ is a function R → ∆(A i ) mapping the perceived real-valued difference between the policies to a mixed strategy of the game. For i = 1, 2 let A i ⊆ ∆(A i ) R be a set of difference-based policies for Player i. Then a policy difference (diff) function for (A 1 , A 2 ) is a stochastic function diff : A 1 × A 2 ⇝ R 2 . For any two policies π 1 , π 2 , we say that (π 1 , π 2 ) plays the strategy profile σ σ σ ∈ ∆(A 1 ) × ∆(A 2 ) of Γ if σ i = E [π i (diff i (π 1 , π 2 ))] for i = 1, 2. For sets of policies A 1 , A 2 and difference function diff we then define the diff meta game (Γ, A 1 , A 2 , diff) to be the game (A 1 , A 2 ,V ), where V (π 1 , π 2 ) := E [u((π i (diff i (π 1 , π 2 ))) i=1,2 )] for all π 1 ∈ A 1 , π 2 ∈ A 2 . Note that Definition 1 does not put any restrictions on diff. For example, the above definition allows (diff(π i , π -i )) i to be a real number whose binary representation uniquely specifies π -i . This paper is dedicated to situations in which diff specifically represents some intuitive notion of how different the policies are, thus excluding such diff functions. Unfortunately, there are many different ways in which one could formalize this constraint, especially in asymmetric games. In Section 5 we will impose some restrictions along these lines, including symmetry. Our folk theorem (Theorem 3 in Section 4) will similarly impose constraints on diff to avoid diff functions like the above. The rest of this section will study concrete examples of Definition 1. First, we define a particularly simple type of diff-based policy. Almost all of our theoretical analysis will be based on this class of policies. Definition 2. Let θ ∈ R ∪ {-∞, ∞} and σ ⩽ i , σ > i ∈ ∆(A i ) be strategies for Player i for i = 1, 2. Then we define (σ ⩽ i , θ , σ > i ) to be the policy π s.t. π(d) = σ ⩽ i if d ≤ θ and π(d) = σ > i otherwise. We call policies of this form threshold policies. Let Ā i denote the set of such threshold policies. Throughout the rest of this section, we analyze the Prisoner's Dilemma as a specific example. We limit attention to threshold agents of the form (C, θ , D), i.e., policies that cooperate against similar opponents (diff below threshold θ ) and defect against dissimilar opponents. This is because such policies can be used to form cooperative equilibria, while policies that always cooperate ((C, 1,C)) or policies that are more cooperative against less similar opponent policies (e.g., (D, 1,C)) cannot be used to form cooperative equilibria in the PD with a natural diff function. Policies of the form (C, θ , D) are uniquely specified by a single real number θ . A natural measure of the similarity between two policies θ 1 , θ 2 is then the absolute difference |θ 1 -θ 2 |. We allow diff to be noisy, however. We summarize this in the following. Example 1. Let Γ be the Prisoner's Dilemma as per Table 1 . Then consider the (Γ, Â1 , Â2 , diff) meta game where Âi = {(C, θ i , D) | θ i ∈ R} and diff i ((C, θ 1 , D), (C, θ 2 , D))) = |θ 1 -θ 2 | + N i for i = 1, 2 where N i is some real-valued random variable. The only open parameters of Example 1 are G (the parameter used in our definition of the Prisoner's Dilemma) and the noise distribution. Nevertheless, Example 1 is a rich setting that allows for nontrivial results. We leave a detailed analysis for Appendix B and only give two specific results about equilibria here. Proposition 1. Consider Example 1 with N i ∼ Uniform([0, ε]) i.i.d. for some ε ≥ 0 and with G ≥ 2. Then ((C, θ 1 , D), (C, θ 2 , D)) is a Nash equilibrium if and only if θ 1 , θ 2 ≤ 0 or 0 < θ 1 = θ 2 ≤ ε. In case of the latter, the equilibrium is strict if G > 2. Another natural distribution to use for N i is the normal distribution. The following result shows for G = 2 what the Nash equilibria of the diff meta game are. Proposition 2. Consider Example 1 with G = 2. Assume N i is i.i.d. for i = 1, 2 according some unimodal distribution with mode ν with positive measure on every interval. Then ((C, θ 1 , D), (C, θ 2 , D)) is a Nash equilibrium if and only if θ 1 = θ 2 ≤ ν.

4. A FOLK THEOREM FOR DIFF META GAMES

What are the Nash equilibria of a diff meta game on Γ? Of course, the answer depends on what diff function we use. A first answer is that Nash equilibria of Γ carry over to the diff meta game regardless of what diff function is used (assuming that at least all constant policies are available); see Proposition 15 in Appendix C.1. Any other equilibria of the diff meta game hinge on the use of the right diff function. In fact, if diff is constant and thus uninformative, the Nash equilibria of the diff meta game are exactly the Nash equilibria of Γ; see Proposition 16 in Appendix C.1. The more interesting question is for what strategy profiles σ σ σ there exists some diff function s.t. σ σ σ is played in an equilibrium of the resulting diff meta game. The following result answers this question. Theorem 3 (folk theorem for diff meta games). Let Γ be a game and σ σ σ be a strategy profile for Γ. Let A i ⊇ Ā i for i = 1, 2. Then the following two statements are equivalent: 1. There is a diff function such that there is a Nash equilibrium (π 1 , π 2 ) of the diff meta game (Γ, diff, A 1 , A 2 ) s.t. (π 1 , π 2 ) play σ . 2. The strategy profile σ is individually rational (i.e., better than everyone's minimax payoff). The result continues to hold true if we restrict attention to deterministic diff functions with diff 1 = diff 2 and diff i (π 1 , π 2 ) ∈ {0, 1} for i = 1, 2. We leave the full proof to Appendix C.2, but give a short sketch of the construction for 2⇒1 here. For any σ σ σ , we construct the desired equilibrium from policies π * i = (σ i , 1 /2, σi ) for i = 1, 2, where σi is Player i's minimax strategy against Player -i. We then take any diff function s.t. diff(π * i , π -i ) = (0, 0) if π -i = π * -i and diff(π * i , π -i ) = (1, 1) otherwise.

5. A UNIQUENESS THEOREM

Theorem 3 allows for highly asymmetric similarity-based cooperation. For example, in the PD with, say, G = 2, Theorem 3 shows that with the right diff function, the strategy profile (C, 2 /3 * C + 1 /3 * D) is played in an equilibrium of the diff meta game of the PD. This seems odd, as one would expect similarity-based cooperation to result in playing symmetric strategy profiles. Note that, for example, all equilibria of Propositions 1 and 2 are symmetric. In this section, we show that under some restrictions on diff and the base game Γ, we can recover the symmetry intuition. We first need a few definitions of properties of diff. Let Γ be a symmetric game. We say that diff is minimized by copies if for all policies π, π ′ , all y and i = 1, 2, P(diff i (π, π ′ )<y) ≤ P(diff i (π, π)<y). For example, the diff function in Example 1 is minimized by copies. The diff functions in the proof of Theorem 3 are not in general minimized by copies when the given base game is symmetric. For example, to achieve (C, 2 /3 * C + 1 /3 * D) in equilibrium, the proof of Theorem 3 (as sketched above) uses the policies π * 1 = (C, 1 /2, D) and π * 2 = ( 2 /3 * C + 1 /3 * D, 1 /2, D) and a diff function with diff(π * 1 , π * 2 ) = (0, 0) but diff(π * 1 , π * 1 ) = (1, 1). If the base game is symmetric, we call diff symmetric if for all π 1 , π 2 , diff(π 1 , π 2 ) is distributed the same as diff(π 2 , π 1 ) and (diff 1 (π 1 , π 2 ), diff 2 (π 1 , π 2 )) is distributed the same as (diff 2 (π 1 , π 2 ), diff 1 (π 1 , π 2 )). Finally, we need a more complicated but nonetheless intuitive property of diff functions. In this paper, we generally imagine that low values of diff are informative about the other player's policy. In contrast, we will her assume that high values of diff are uninformative. That is, for any σ i and π -i , we will assume that there is a policy π i that plays σ i against π -i and triggers the above-threshold policy of π -i with the highest-possible probability. Appendix D.1.1 shows why this assumption is necessary. Formally, let π -i = (σ ⩽ -i , θ -i , σ > -i ) be any threshold policy. Let p be the highest number such that there is π i s.t. in (π i , π -i ), Player -i plays arbitrarily close to (1 -p)σ ⩽ -i + pσ > -i and σ max π -i = (1 -p)σ ⩽ -i + pσ > -i . Intuitively, σ max π -i is the strategy played by π -i against the most different opponent policies. For the examples of Section 3 we have p = 1 and thus simply σ max π -i = σ > -i . But if diff is bounded (as in the proof of Theorem 3), then we might even have p = 0 or anything in between. Definition 3. We call diff : Ā 1 × Ā 2 ⇝ R 2 high value uninformative if for each threshold policy π -i , σ i and ε > 0 there is a threshold policy π i such that in (π i , π -i ), a strategy profile within ε of (σ i , σ max π -i ) is played. We are now ready to state a uniqueness result for the Nash equilibria of diff meta games. Theorem 4. Let Γ be a player-symmetric, additively decomposable game. Let diff be symmetric, high-value uninformative, and minimized by copies. Then if (π 1 , π 2 ) is a Nash equilibrium that is not Pareto-dominated by another Nash equilibrium, we have that V 1 (π 1 , π 2 ) = V 2 (π 1 , π 2 ). Hence, if there exists a Pareto-optimal Nash equilibrium, its payoffs are unique, Pareto-dominant among Nash equilibria and equal across the two players. We prove Theorem 4 in Appendix D.3. Roughly, we prove that under the given assumptions, equilibrium policies are more beneficial to the opponent when observing a diff value below the threshold than if they observe a diff value above the threshold. Second, we show (using the first fact) that if in a given strategy profile Principal i receives a lower utility than Principal -i, then Principal i can increase her utility by submitting a copy of Principal -i's policy. Appendix D.1 shows why the assumptions (additive decomposability of the game and and high-value uninformativeness and symmetry of diff) are necessary.

6. A NOVEL PRETRAINING METHOD FOR SIMILARITY-BASED COOPERATION

We now describe a simple machine learning method that we use to find cooperative equilibria in more complex games. To use this method, we consider neural net policies π θ θ θ parameterized by a real vector θ θ θ . We call this procedure Cooperate against Copies and Defect against Random (CCDR) pretraining based on its intended effect in Prisoner's Dilemma-like games. First, for any given diff game, let V d : (R m ) (R n+1 ) × (R m ) (R n+1 ) → R 2 be the utility of a version of the game in which diff is nonnoisy. Then we will pretrain each model π θ i to maximize V d (π θ i , π θ i ) +V d (π θ i , π ′ θ -i ) for randomly sampled θ ′ -i . That is, each player i pretrains their policy π θ i to do well in both of the following scenarios: principal -i copies principal i's model; and principal -i generates a random model. CCDR pretraining is motivated by two considerations. First, in games like the HDPD, it can be shown that there exist cooperative equilibria between policies that cooperate at a diff value of 0 and defect as the perceived diff value increases. We give a toy model of this in Appendix E. CCDR puts in place the rudimentary structure of these equilibria. Note, however, that CCDR does not directly optimize for the model's ability to form a cooperative equilibrium. Second, CCDR can be thought of as a form of curriculum training. Before trying to play diff games against other (different but similar) learned agents, we might first train a policy to solve two (conceptually and technically) easier related problems.

7. A HIGH-DIMENSIONAL ONE-SHOT PRISONER'S DILEMMA

To study similarity-based cooperation in an ML context, we need a more complex version of the Prisoner's Dilemma. The complex Prisoner's Dilemma-like games studied by the multi-agent learning community generally offer other mechanisms that establish cooperative equilibria (e.g., playing a game repeatedly). For our experiments, however, we specifically need SBC to be the only mechanism to establish cooperation. We therefore introduce a new game, the High-Dimensional (one-shot) Prisoner's Dilemma (HDPD). The goal is to give a variant of the one-shot Prisoner's Dilemma that is conceptually simple but introduces scalable complexity that makes finding, for example, exact best responses in the diff meta game intractable. In addition to G, the HDPD is parameterized by two functions f C , f D : R n → R m representing the two actions Cooperate and Defect, respectively, as well as a probability measure µ over R n . Each player's action is also a function f i : R n → R m . This is illustrated in Figure 2 for Here the functions f 1 and f 2 are the actions chosen by the players. First we sample a point x (red). We then calculate the distance of f i (x) to f C (x) and f D to determine both players' losses. the case of n = 1 and m = 2. For any pair of actions f 1 , f 2 , payoffs are then determined as follows. First, we sample some x according to µ from R n . In Figure 2 , this is represented by a red dot on the x axis. Then to determine how much Player 1 cooperates, we consider the distance d( f 1 (x, f C (x) to determine, roughly speaking, how much Player 1 cooperates. Here, d denotes the Euclidean distance. The larger the distance the less cooperative is f 1 . This distance is visualized in Figure 2 by the arrow from (x, f 1 (x) to (x, f C (x)). We analogously determine how much the players defect. Formally, we define u i ( f 1 , f 2 ) = -E x∼µ [d( f i (x), f D (x)) + Gd( f -i (x), f C (x))] /E x∼µ [d( f C (x), f D (x))]. Thus, the action f i = f D corresponds to defecting and the action f C corresponds to cooperating, e.g., u( f C , f C ) = (-1, -1) and u( f D , f D ) = (-G, -G). The unique equilibrium of this game is ( f D , f D ). In our experiments, we specifically used G = 5. If we further let µ be uniform over [0, 1], the utilities for Figure 2 come out at about -5.501 for Player 1 and -2.599 for Player 2. We consider a diff meta game on the HDPD. Formally, a diff-based policy for the HDPD is a function R → (R m ) (R n ) . For notational convenience, we will instead write policies as functions R n+1 → R m . We then define our diff function by diff i (π 1 , π 2 ) = E (y,x)∼ν [d(π 1 (y, x), π 2 (y, x))] + N i , where ν is some probability distribution over R n+1 and N i is some real-valued noise.

8. EXPERIMENTS

Our results so far demonstrate the theoretical viability of similarity-based cooperation, but leave open questions regarding its practicality. In this section, we address one of these questions: In complex environments, where cooperating and defecting are by themselves complex operations, can we find the cooperative equilibria for a given diff function with machine learning methods? Experimental setup. We trained on the environment from section 7. We selected a fixed set of hyperparameters based on prior exploratory experiments and the theoretical considerations in Appendix E. We then randomly initialized θ 1 and θ 2 and CCDR-pretrained them. Finally, we trained the θ 1 and θ 2 against each other using ABR. We repeated the experiment with 28 different seeds. As control, we also ran the experiment without CCDR pretraining (on 26 seeds). We also ran experiments with Learning with Opponent-Learning Awareness (LOLA) (Foerster et al., 2018) , which we report in Appendix G. Results. First, we observe that in the runs without CCDR pretraining, the players generally converge to mutual defection during alternating best response learning. In particular, in all 26 runs, at least one player's utility was below -5. Only two runs had a utility above -5 for one of the players (-4.997 and -4.554) . The average utility across the 26 runs and across the two players was -5.257 with a standard deviation of 0.1978. Anecdotally, these results are robust -ABR without pretraining practically never finds cooperative equilibria in the HDPD. Second, we observe that in all 28 runs, CCDR pretraining qualitatively yields the desired policy models, i.e., a policy that cooperates at low values of diff and gradually comes closer to defecting at high values of diff. Figure 3a shows a representative example. Our main positive experimental result is that after CCDR pretraining, the models converged in alternating best response learning to a partially cooperative equilibrium in 26 out of 28 runs. Thus, the cooperative equilibria postulated in general by Theorem 3 and in simplified examples by Propositions 1 and 2 (as well as Proposition 24), do indeed exist and can be found with simple methods. The minimum utility of either player across the 26 successful runs was -4.854. The average utility across all runs and the two players was about -2.77 and thus a little closer to u( f C , f C ) = -1 than to u( f D , f D ) = -5. The standard deviation was about 1.19. Figure 3b shows the losses (i.e., the negated utilities) across ABR learning. Generally, the policies also converge to receiving approximately the same utility (cf. Section 5). The average of the absolute differences in utility between the two players at the end of the 28 runs is about 0.04 with a standard deviation of 0.05. We see that in line with Theorem 4, we tend to learn egalitarian equilibria in this symmetric, additively decomposable setting. After alternating best response learning, the models generally have a similar structure as the model in Figure 3a , though often they cooperate only a little at low diff values. Based on prior exploratory experiments, CCDR's success is moderately robust. Discussion. Without pretraining, ABR learning unsurprisingly converges to mutual defection. This is due to a bootstrapping problem. Submitting a policy of the form "cooperate with similar policies, defect against different policies" is a unique best response if the opponent submits a model of this form as well. If the opponent model π -i is not of this form, then any policy π i that defects, i.e., that satisfies π i (diff(π i , π -i )) = f D , is a best response. Because f C is complex, learning a model that cooperates at all is unlikely. (Even if f C was simple, the appropriate use of the perceived diff value would still be specific and thus unlikely to be found by chance.) Similar failures to find the more complicated cooperative equilibria by default have also been observed in the iterated PD (Sandholm & Crites 1996; Foerster et al. 2018; Letcher et al. 2019) and in the open-source PD (Hutter, 2020) (cf. Section 9.1). Opponent shaping methods have been used successfully to learn to cooperate both in the iterated Prisoner's Dilemma (Foerster et al. 2018; Letcher et al. 2019 ) and the open-source Prisoner's Dilemma (Hutter, 2020) . Our experiments in Appendix G show that LOLA can also learn SBC, but unfortunately not as robustly as CCDR pretraining. CCDR pretraining reliably finds models that cooperate with each other and that continue to partially cooperate with each other throughout ABR training. This shows that when given some guidance, ABR can find similarity-based cooperative equilibria. We conclude from our experiments that SBC is a suitable means of establishing cooperation between modern ML agents. That said, CCDR also has some limitations that we hope can be addressed in future work. For one, in many games optimal play against randomly generated opponents is unreasonable when facing a rational opponent. Second, our experiments show that while the two policies almost fully cooperate after CCDR pretraining, they quickly partially unlearn to cooperate in the ABR phase. We would prefer a method that preserves closer to full cooperation throughout ABR-style training. Third, while CCDR seems to often work, it can certainly fail in games in which SBC is possible. For instance, CCDR may sometimes result in insufficiently steep incentive curves. We suspect that to make progress on the latter issues we need training procedures that more explicitly reason about incentives à la opponent shaping (cf. our experiments with LOLA Appendix G).

9.1. PROGRAM EQUILIBRIUM

We already discussed in Section 1 the literature on program meta games in which players submit computer programs as policies and the programs fully observe each other's code (McAfee 1984; Howard 1988; Rubinstein 1998, Section 10.4; Tennenholtz 2004 ). Interestingly, some constructions for equilibria in program meta games are similarity based. For example, the earliest cooperative program equilibrium for the Prisoner's Dilemma, described in all four of the above-cited papers, is the program "Cooperate if the opponent's program is equal to this program; else Defect". Other approaches to program equilibrium cannot be interpreted as similarity based, however (see, e.g., Barasz et al., 2014; Critch, 2019; Oesterheld, 2019b) . To our knowledge, the only published work on ML in program equilibrium is due to Hutter (2020) . It assumes the programs to have the structure proposed by Oesterheld (2019b) on simple normal-form games, thus leaving only a few parameters open. Similar to our experiments, Hutter shows that best response learning fails to converge to the cooperative equilibria. In Hutter's experiments, the opponent shaping methods LOLA (Foerster et al., 2018) and SOS (Letcher et al., 2019) converge to mutual cooperation. 9.2 DECISION THEORY AND NEWCOMB'S PROBLEM Brams (1975) and Lewis (1979) have pointed out that the Prisoner's Dilemma against a similar opponent closely resembles Newcomb's problem, a problem first introduced to the decision-theoretical literature by Nozick (1969) . Most of the literature on Newcomb's problem is about the normative, philosophical question of whether one should cooperate or defect when playing a Prisoner's Dilemma against an exact copy. Our work is inspired by the idea that in some circumstances one should cooperate with similar opponents. However, this literature only informally discusses the question of whether to also cooperate with agents other than exact copies (Hofstadter e.g., 1983; Drescher 2006, Ch. 7; Ahmed 2014, Sect. 4.6.3) . We address this question formally. One idea behind the present project, as well as the program game literature, is to analyze a decision situation from the perspective of (actual or hypothetical) principals who design policies. The principals find themselves in an ordinary strategic situation. This is how our analysis avoids the philosophical issues arising in the agent's perspective. Similar changes in perspective have been discussed in the literature on Newcomb's problem (Gauthier e.g., 1989; Oesterheld & Conitzer 2022) .

9.3. LEARNING IN NEWCOMB-LIKE DECISION PROBLEMS

There is some existing work on learning in Newcomb-like environments that therefore also applies to the Prisoner's Dilemma against a copy. Whether cooperation against a copy is learned generally depends on the learning scheme. Bell et al. (2021) show that Q-learning with a softmax policy learns to defect. Regret minimization also learns to defect. Other learning schemes do converge to cooperating against exact copies (Albert & Heiner, 2001; Mayer et al., 2016; Oesterheld, 2019a; Oesterheld et al., 2021) . All schemes in prior work differ from the present setup, however, and to our knowledge none offer a model of cooperation between similar but non-equal agents.

10. CONCLUSION AND FUTURE WORK

We make a strong case for the promise of similarity-based cooperation as a means of improving outcomes from interactions between ML agents. At the same time, there are many avenues for future work. On the theoretical side, we would be especially interested in generalizations of Theorem 4, that is, theorems that tell us what outcomes we should expect in diff meta games. Is it true more generally that under reasonable assumptions about the diff function, we can expect similarity-based cooperation to result in fairly specific, symmetric, Pareto-optimal outcomes? We are also interested in further experimental investigations of SBC. We hope that future work can improve on our results in the HDPD in terms of robustness and degree of cooperation. Besides that, we think a natural next step is to study settings in which the agents observe their similarity to one another in a more realistic fashion. For example, we conjecture that similarity-based cooperation can occur when the agents can determine that their policies were generated by similar learning procedures.



R → ∆(A 1 )

Figure 1: (a) Illustration of the diff meta game of a Prisoner's Dilemma. (b) A graphical representation of diff meta games (Definition 1). Nodes with two incoming nodes are determined by applying one of the parent nodes to the other.

Figure 2: The figure describes how utilities are calculated in the HDPD. Here the functions f 1 and f 2 are the actions chosen by the players. First we sample a point x (red). We then calculate the distance of f i (x) to f C (x) and f D to determine both players' losses.

Figure 3: (a) The behavior of a CCDR pretrained policy. For each perceived diff to the opponent y, the graph shows the expected distance of the learned policy's choice to to f C and to f D . (b) Losses of Player 1 in 10 runs through the ABR phase.

