NO-REGRET LEARNING IN STRONGLY MONOTONE GAMES CONVERGES TO A NASH EQUILIBRIUM

Abstract

This paper studies a class of online games involving multiple agents with continuous actions that aim to minimize their local loss functions. An open question in the study of online games is whether no-regret learning for such agents leads to the Nash equilibrium. We address this question by providing a sufficient condition for strongly monotone games that guarantees Nash equilibrium convergence in a time average sense, regardless of the specific learning algorithm, assuming only that it is no-regret. Furthermore, we show that the class of games for which no-regret learning leads to a Nash equilibrium can be expanded if some further information on the learning algorithm is known. Specifically, we provide relaxed sufficient conditions for first-order and zeroth-order gradient descent algorithms as well as for best response algorithms in which agents choose actions that best respond to other agents' actions during the last episode. We analyze the convergence rate for these algorithms and present numerical experiments on three economic market problems to illustrate their performance.

1. INTRODUCTION

Online convex optimization (Hazan et al., 2016; Shalev-Shwartz et al., 2012) is used to solve decision-making problems where the cost function is unknown and optimal actions are selected with only incomplete information. Recently, online convex optimization has also been employed for the solution of games involving multiple agents with applications ranging from traffic routing (Sessa et al., 2019) to economic market optimization (Narang et al., 2022; Wang et al., 2022; Lin et al., 2020) . In these online convex games, agents simultaneously take actions to minimize their loss functions, which depend on the other agents' actions. Generally, agents in non-cooperative games have access to limited information. For example, they may not be able to observe the actions of other agents and may not even know the exact game mechanism. As a result, rational agents will focus on sequentially learning their individual optimal actions at the expense of other agents, and their ability to do so efficiently can be quantified using notions of regret that captures the cumulative loss of the learned online actions compared to the best actions in hindsight. An algorithm is said to achieve no-regret learning if the regret of the sequence of online actions generated by this algorithm is sub-linear in the total numbers of episodes T . While no-regret learning has been studied for a variety of games; see e.g., Sessa et al. (2019) ; Tatarenko & Kamgarpour (2018) ; Wang et al. (2022) ; Daskalakis et al. (2021) ; Anagnostides et al. (2022) , the analysis of regret alone is not sufficient to characterize the limit points of a learning algorithm, i.e., the sequence of actions taken by the algorithm. In fact, no-regret learning may not converge at all and can exhibit limit cycles, as shown in Mertikopoulos et al. (2018) . In this paper, we adopt the notion of a Nash equilibrium, that describes a stable point at which the agents have no incentives to change their actions, to analyze the convergence properties of no-regret algorithms for online convex games. A growing literature has recently focused on showing Nash equilibrium convergence for online games; see, e.g., Bravo et al. (2018) ; Tatarenko & Kamgarpour (2020) ; Drusvyatskiy & Ratliff (2021) ; Lin et al. (2021; 2020) ; Mertikopoulos & Zhou (2019) ; Narang et al. (2022) ; Heliou et al. (2017) ; Golowich et al. (2020) ; Azizian et al. (2020) . For example, for potential games with finite actions, Heliou et al. (2017) show that the sequence of play returned by the exponential weight algorithm converges to the Nash equilibrium. On the other hand, for games with continuous actions, strong monotonicity, which ensures the uniqueness of the Nash equilibrium (Rosen, 1965) , is a sufficient condition for Nash equilibrium convergence for many specific learning algorithms, including the mirror descent algorithm (Bravo et al., 2018; Lin et al., 2021) , the dual averaging algorithm (Mertikopoulos & Zhou, 2019) and the derivative-free algorithm (Drusvyatskiy & Ratliff, 2021; Narang et al., 2022) . Besides, an optimistic gradient algorithm is proposed in Golowich et al. (2020) that achieves tight last-iterate convergence for smooth monotone games. Similarly, Lin et al. (2020) investigate the last-iterate convergence for continuous games with unconstrained action sets that satisfy a so-called 'cocoercive' condition that includes a broader class of games with potentially many Nash equilibria. However, all these works analyze the convergence and/or regret for specific learning algorithms and for assumptions that depend on this specific choice of algorithms and games. In this paper, we follow a different approach and instead focus on understanding for what classes of games and learning algorithms Nash equilibrium convergence can be guaranteed. Specifically, we are interested in understanding whether and for what class of online convex games with continuous action sets no-regret learning converges to a Nash equilibrium regardless of the specific algorithm. Moreover, we are interested in understanding whether and how this class of online convex games can be expanded when the no-regret learning algorithm is known. In our main result, we show that for m-strongly monotone games with parameter m > 2L √ N -1, where L is a Lipschitz constant of the gradient function with respect to the actions of other agents, and N is the number of agents, any no-regret algorithm leads to Nash equilibrium convergence. While Nash equilibrium convergence has been analyzed for different combinations of learning algorithms and games, to the best of our knowledge, this is the first effort to understand for what classes of games and learning algorithms Nash equilibrium convergence can be guaranteed, and thus bridge regret analysis with Nash equilibrium convergence in games. We note that this result applies to any no-regret algorithm and thus can provide theoretical support for the convergence of any such algorithm for which regret analysis is easy but Nash equilibrium convergence is difficult to show. Furthermore, we show that the class of games m > 2L √ N -1 can be expanded if additional information about a specific no-regret algorithm is known. First, for the class of gradient-descent (GD) algorithms including first-order and zeroth-order algorithms, we show that m > 0 is a sufficient condition for Nash equilibrium convergence. Note that Drusvyatskiy & Ratliff (2021) also show convergence of the zeroth-order algorithm to a Nash equilibrium, but with the additional assumption that the Jacobian of the gradient function is Lipschitz continuous which they need to ensure that the smoothed game induced by the zeroth-order oracle is strongly monotone. In this work, we show that this assumption is not necessary and Nash equilibrium convergence can still be guaranteed even when the smoothed game is not strongly monotone. In addition, we study the class of the best response algorithms, where every agent selects the best action in the next episode given the other agents' current actions. Best response algorithms have been studied for several classes of games, including potential games (Swenson et al., 2018; Durand & Gaujal, 2016) and zero-sum games (Leslie et al., 2020) . However, none of these works study games with continuous actions or provides sufficient conditions that guarantee convergence to a Nash equilibrium. We show that for m-strongly monotone games, the best response algorithm ensures Nash equilibrium convergence if m > L √ N -1. This is, to the best of our knowledge, the first convergence analysis of the best response algorithm in continuous games. We numerically validate the proposed algorithms using three online marketing examples, specifically the Cournot game, the Kelly auction, and the Retailer pricing competition, that satisfy different conditions on the parameter m and, therefore, belong to different game classes. We show that for games that do not satisfy the sufficient condition m > L √ N -1, such as the Cournot game, the best response algorithm may diverge. As a result, gradient descent algorithms may be better suited to solve games in this class. We also compare the performance of these algorithms for games for which Nash equilibrium convergence is guaranteed. We observe that when m > L √ N -1, the best response algorithm outperforms first-order gradient descent which, in turns, outperforms the zeroth-order method. On the other hand, when 0 < m < L √ N -1, first-order gradient descent outperforms the zeroth-order method. In summary, by defining sufficient conditions for Nash equilibrium convergence that depend only on the properties of the game, i.e., the parameter m, and not the learning algorithm used to solve it, our analysis allows to identify classes of games for which no-regret learning guarantees convergence to a Nash equilibrium without analyzing specific algorithms or identify specific no-regret learning algorithms with no guaranteed convergence to a Nash equilibrium, both fundamental questions in the analysis of online convex games. The rest of the paper is organized as follows. In Section 2, we define the online convex games under consideration and provide some assumptions. The main result is presented in Section 3, where we provide sufficient conditions on the class of games for which no-regret learning leads to Nash equilibrium convergence. In Sections 4 and 5, we study two specific classes of no-regret learning algorithms and show that the classes of games for which Nash equilibrium convergence can be guaranteed can be expanded if the algorithm is known. In section 6, we use three online marketing examples to validate the proposed algorithms and conditions. We conclude this work in Section 7.

2. PROBLEM DEFINITION

Consider an online convex game with N agents, whose goal is to learn their best individual actions that minimize their local loss functions. For each agent i ∈ N = {1, . . . , N }, denote by C i (x i , x -i ) : X → R the individual loss function, where x i ∈ X i is the action of agent i, x -i are the actions of all agents except for agent i, and we define X = Π N i=1 X i to be the joint action space since each agent takes actions independently. For ease of notation, we collect all agents' actions in a vector x := (x 1 , . . . , x N ). We assume that C i (x) is convex in x i for all x -i ∈ X -i , where X -i is the joint action space excluding agent i. In addition, we assume that the diameter of the convex set X i is bounded by D, for all i = 1, . . . , N . The goal of every agent i is to determine the action x i that minimizes its individual loss function, i.e., min xi∈Xi C i (x i , x -i ). (1) As shown in Rosen (1965) , convex games always have at least one Nash equilibrium. In what follows, we denote by x * a Nash equilibrium of the game (1). Then, for each agent i, we have C i (x * ) ≤ C i (x i , x * -i ), ∀x i ∈ X i , i ∈ N . At this Nash equilibrium point, agents are strategically stable in the sense that each agent lacks incentives to change its action. Since the agents' loss functions are convex, the Nash equilibrium can also be characterized by the first-order optimality condition, i.e., ⟨∇ xi C i (x * ), x i -x * i ⟩ ≥ 0, ∀x i ∈ X i , i ∈ N , where ∇ xi C i (x) is the partial derivative of the loss function with respect to each agent' action. We write ∇ i C i (x) instead of ∇ xi C i (x) whenever it is clear from the context. Throughout the paper, we make the following assumptions on the convex loss functions. Assumption 1. For each agent i, we have that C i (x) is L 0 -Lipschitz continuous in x and |C i (x)| ≤ U . Assumption 2. For each agent i, we have that ∇ i C i (x) is L 1 -Lipschitz continuous in x and ∥∇ i C i (x)∥ ≤ B. The above assumptions are very common in the literature and hold in many applications, e.g., Cournot Games, retailer pricing games; see Bravo et al. (2018); Duvocelle et al. (2018) ; Lin et al. (2021) . In general, it is not easy to show convergence to a Nash equilibrium for games with multiple Nash Equilibria. For this reason, recent studies often focus on games that are so-called strongly monotone and are well-known to have a unique Nash equilibrium (Rosen, 1965) . In this case, convergence to the Nash equilibrium is shown, e.g., in Drusvyatskiy & Ratliff (2021) ; Bravo et al. (2018) . The game (1) is said to be m-strongly monotone if for ∀x, x ′ ∈ X we have that N i=1 ⟨∇ i C i (x) -∇ i C i (x ′ ), x i -x ′ i ⟩ ≥ m ∥x -x ′ ∥ 2 . ( ) The ability of the agents to efficiently learn their optimal actions can be quantified using the notion of regret that captures the cumulative loss of the learned online actions compared to the best actions in hindsight, and can be formally defined as Reg i = T t=1 C i (x t ) -min xi T t=1 C i (x i , x -i,t ), for sequences of actions {x i,t } T t=1 , i = 1, . . . , N . An algorithm is said to be no-regret if the regret of each agent is sub-linear in the total number of episodes T , i.e., Reg i = O(T a ), a ∈ [0, 1), ∀i ∈ N . In this work, we are interested in understanding for what classes of games and learning algorithms Nash equilibrium convergence can be guaranteed. Specifically, we are interested in understanding whether and for what class of online convex games with continuous action sets no-regret learning converges to a Nash equilibrium regardless of the specific algorithm; see Section 3. Moreover, we are interested in understanding whether and how this class of online convex games can be expanded when the no-regret learning algorithm is known; see Sections 4 and 5.

3. NO-REGRET LEARNING CONVERGES TO A NASH EQUILIBRIUM

In this section, we provide our main result which shows that any no-regret learning algorithm can guarantee Nash equilibrium convergence for the class of m-strongly monotone games that satisfy an additional condition on the parameter m. We start with single-agent learning and then extend it to multi-agent games.

3.1. SINGLE-AGENT LEARNING

Consider a single agent whose goal is to minimize its convex loss function C(x) by optimizing its action x ∈ X . Let x * = argmin x C(x). Suppose that an online algorithm generates a sequence {x t } T t=1 , and the regret is defined as Reg = T t=1 C(x t ) -C(x * ). In the following lemma, we show that strong convexity of the loss function C(x) guarantees that no-regret learning leads to Nash equilibrium convergence. Lemma 1. Suppose that the loss function C(x) is m-strongly convex in x with the parameter m > 0. If the regret is sub-linear in T such that Reg = O(T a ) with a ∈ [0, 1), then we have T t=1 ∥x t -x * ∥ 2 = O(T a ). Proof. The strong convexity of the loss function C(x) implies that C(x) -C(x * ) ≥ m 2 ∥x -x * ∥ 2 , ∀x ∈ X . Substituting in this inequality the agent's action x t at time t and summing over t, we obtain that Reg = T t=1 (C(x t ) -C(x * )) ≥ m 2 T t=1 ∥x t -x * ∥ 2 . The result follows by the fact that m 2 is a constant that does not depend on T . Lemma 1 shows that no-regret learning for single-agent learning guarantees time-averaged convergence to the stable point when the loss function is strongly convex, i.e, 1 T T t=1 ∥x t -x * ∥ 2 → 0 as T → ∞. It's well known that strong convexity is equivalent to the condition ⟨∇C(x) -∇C(x ′ ), x -x ′ ⟩ ≥ m ∥x -x ′ ∥ 2 . Moreover, note that the strong monotonicity condition (2) is equivalent to (4) in the case of singleagent learning, i.e., when N = 1. This observation inspires the following analysis for multi-agent games.

3.2. MULTI-AGENT GAMES

As discussed in Section 3.1, strong monotonicity is sufficient for the convergence of no-regret singleagent learning to an equilibrium point. However, the extension of this result from single-agent learning to multi-agent games is non-trivial, due to the structure of the agents' loss functions that are coupled by the other agents' actions. Besides, every time an agent updates its action, the other agents also react to this change. Therefore, since the actions x -i,t also change, the function C i (•, x -i,t ) becomes non-stationary from the perspective of agent i. In the following result, we show that if the game is sufficiently strongly monotone, no-regret learning can still guarantee Nash equilibrium convergence. Theorem 1. Suppose that the game ( 1) is m-strongly monotone and ∇ i C i (x i , x -i ) is L-Lipschitz continuous in x -i for every x i ∈ X i . Suppose that an algorithm generates the action sequences {x i,t }, i = 1, . . . , N , and that the regret of each agent satisfies Reg i = T t=1 C i (x t ) - C i (y * i , x -i,t ) = O(T a ), ∀i = 1, . . . , N, where y * i = argmin yi T t=1 C i (y i , x -i,t ) and a ∈ [0, 1). Let x * denote the unique Nash equilibrium and y * = (y * 1 , . . . , y * N ). Then, the following hold: 1. ∥x * -y * ∥ ≤ N L mT T t=1 ∥x t -x * ∥; 2. If m -2L √ N -1 > 0, then T t=1 ∥x t -x * ∥ 2 = O(T a ). The proof can be found in Appendix 8.2. Theorem 1 implies that any no-regret algorithm can lead to Nash equilibrium whenever m -2L √ N -1 > 0. Note that the condition m -2L √ N -1 > 0 always holds for single-agent learning where N = 1, as long as m > 0, which coincides with Lemma 1. Remark 1. Recall that L 1 and L are the Lipschitz constants of the function ∇ i C i (x) with respect to x and x -i , respectively. From the definitions we can conclude that L ≤ L 1 . The Lipschitz constant L 1 provides an upper bound on the variation of gradients and is always greater than the strongly monotone parameter m, which provides a lower bound, i.e., m ≤ L 1 . However, it is still likely to have m -2L √ N -1 > 0. For example, in the extreme case where C i only depends on x i , we have that L = 0 and thus the condition naturally holds as long as m > 0. Remark 2. We provide here some intuition regarding regarding the condition m -2L √ N -1 > 0. Rearranging the terms in the inequality gives L < m 2 √ N -1 . Recall that L is the Lipschitz constant of the function ∇ i C i (x i , x -i ) with respect to x -i , which can be interpreted as a bound on the maximum influence of the other agents' actions. We need this influence of the other agents' actions and, therefore, L, to be small for the game to converge. On the other hand, the presence of multiple agents (N is large) makes the game increasingly involved, which restricts the class of games for which no-regret learning converges to a Nash Equilibruim. In general, it is easier to analyze the regret of an algorithm compared to analyzing the Nash equilibrium convergence. Theorem 1 provides an alternative way to analyze the Nash equilibrium convergence: As long as we can show an algorithm is no-regret and a game is m-strongly monotone with m -2L √ N -1 > 0, we can directly obtain Nash equilibrium convergence.

4. SUFFICIENT CONDITIONS FOR CONVERGENCE OF GRADIENT DESCENT ALGORITHMS

In this section, we provide sufficient conditions for the convergence of multi-agent games to a Nash equilibrium with the additional knowledge that the no-regret learning algorithm is a gradient-descent (GD) algorithm. Specifically, we investigate two types of GD algorithms: a first-order algorithm and a zeroth-order algorithm. Note that, under Assumptions 1 and 2, it can be easily verified that both first-order and zeroth-order GD algorithms ensure that the regret of each agent in (3) is sub-linear in T ; see Hazan et al. (2016); Flaxman et al. (2004) . In what follows, we show that Nash equilibrium convergence can be guaranteed for both algorithms as long as m > 0.

4.1. FIRST-ORDER ALGORITHMS

We first consider the case where the agents have access to their gradient information. Specifically, we assume that each agent can obtain an unbiased gradient estimate G i of ∇ i C i (x) with finite variance given the joint action profile x, where E[G i ] = ∇ i C i (x) and E[G i -∇ i C i (x)] 2 ≤ σ 2 . During learning, we assume that each agent performs the following action update x i,t+1 = P Xi (x i,t -η t G i,t ), where E[G i,t ] = ∇ i C i (x t ) and P Xi projects the agent's actions to its action space X i . In the following result, we present the Nash equilibrium convergence analysis of the first-order GD algorithm (5). Theorem 2. Let Assumptions 1 and 2 hold. Suppose that the game (1) is m-strongly monotone with parameter m > 0. Then, the first-order GD algorithm (5) with η t = 1 mt satisfies that E ∥x T -x * ∥ 2 = O(T -1 ). See Appendix 8.3 for the detailed proof. Note that equation ( 6) implies that there exists a constant C 0 > 0 such that E ∥x t -x * ∥ 2 ≤ C 0 t -1 for ∀t = 1, . . . , T . Summing over t we get E T t=1 ∥x t -x * ∥ 2 ≤ T t=1 C 0 t -1 ≤ C 0 1 + T 1 1 t dt ≤ C 0 (1 + ln T ), and thus E T t=1 ∥x t -x * ∥ 2 = O(ln T ). Therefore, the last iteration convergence implies the time-averaged convergence of the algorithm.

4.2. ZEROTH-ORDER ALGORITHMS

In many games, the agents have access to limited information and cannot observe the other agents' actions. For example, in the Cournot competition and Kelly auction games (Bravo et al., 2018; Lin et al., 2021) , each company (agent) is not willing to share its strategy (action) with the rivals and would rather keep it as a secret. In this case, first-order gradient information is not available since it depends on the joint action of all agents. Instead, here, we assume that the agents can only access their own loss function evaluation, which is also referred to as bandit feedback. In this case, a common and effective approach to estimate the unknown gradient is to utilize zeroth-order methods. Specifically, at each episode, the agents perturb their actions x i,t by an amount δu i,t , where u i,t ∈ S di is a random variable sampled from the unit sphere S di ⊂ R di and δ is the size of this perturbation. Then, the agents play their perturbed actions xi,t = x i,t + δu i,t and receive as feedback their local losses C i (x t ). Using this information, every agent constructs its own gradient estimate g i,t = di δ C i (x t )u i,t , and performs the following update x i,t+1 = P (1-δ)Xi (x i,t -η t g i,t ), where the projection set (1 -δ)X i is to ensure the feasibility of the next played action. To facilitate the analysis, we define the δ-smoothed function C δ i (x) = Ew i∼Bi,u-i∼S-i [C i (x i + δw i , x -i + δu -i )], where S -i = Π j̸ =i S j , and B i , S i denote the unit ball and unit sphere in R di , respectively. As shown in Drusvyatskiy & Ratliff (2021) ; Bravo et al. (2018) , the function C δ i (x) satisfies the following properties. Lemma 2. Let Assumptions 1 and 2 hold. Then we have that 1. C δ i (x i , x -i ) is convex in x i ; 2. C δ i (x) is L 0 -Lipschitz continuous in x; 3. |C δ i (x) -C i (x)| ≤ δL 0 √ N ; 4. E[ di δ C i (x t )u i,t ] = ∇ i C δ i (x t ). Note that, since the game with the loss function C i (x) is assumed strongly monotone, it has a unique Nash equilibrium. However, without assuming that the Jacobian of ∇ i C i (x) is Lipschitz continuous as in Drusvyatskiy & Ratliff (2021) , the smoothed game defined by the loss function C δ i (x) over the set (1-δ)X is possibly not strongly monotone and can have multiple Nash Equlibria. As we discuss below, even if the smoothed game has multiple Nash Equilibria, it is still possible to show that the original game converges to a Nash equilibrium. To do so, we first provide a lemma that bounds the distance between the Nash Equilibria of the smoothed game. In this lemma, for a given δ, we denote by A δ the set of Nash Equilibria of the smoothed game with loss C δ i (x) contained in the set (1 -δ)X . Since C δ i (x) is convex in x i , we know that there exists at least one Nash equilibrium in the smoothed game, i.e., A δ ̸ = ∅. Lemma 3. Suppose that Assumptions 1 and 2 hold and that the game (1) is m-strongly monotone. Moreover, assume that the smoothed game with losses C δ i (x) has multiple Nash Equilibria in the set (1 -δ)X . Then the distance between any arbitrary two Nash Equilibria is bounded by 2L1δN m . The detailed proof can be found in Appendix 8.4. Lemma 3 states that, although there may exist multiple Nash Equilibria in the smoothed game, the distance between these Nash Equilibria can be bounded by the parameter δ. Based on this observation, we can further bound the distance between the Nash equilibrium x * of the original game and the Nash Equilibria of the smoothed game, as shown in the following lemma. Lemma 4. Suppose that Assumptions 1 and 2 hold and that the game (1) is m-strongly monotone. Then, any Nash equilibrium x δ,j ∈ A δ of the smoothed game satisfies that x * -x δ,j ≤ δ 1 + L 1 √ N m ∥x * ∥ + L 1 N m . We provide the detailed proof in Appendix 8.5. Lemma 4 shows that even if the smoothed game is not strongly monotone and possibly has multiple Nash Equilibria, we can still upper bound the distance between these Nash Equilibria and the Nash equilibrium of the original game. The result in Lemma 4 is the same as that in Drusvyatskiy & Ratliff (2021) , with the difference that Drusvyatskiy & Ratliff (2021) make the additional assumption that the Jacobian of the gradient of the cost function is Lipschitz continuous. Here, we show that this assumption is not needed and the statement of the lemma still holds. We now present the main result. Theorem 3. Let Assumptions 1 and 2 hold. Suppose that the game (1) is m-strongly monotone with parameter m > 0. Then, the zeroth-order GD algorithm (7) with η t = 1 mt , and δ = T -1 3 satisfies that E ∥x T -x * ∥ 2 = O(T -1 3 ). See Appendix 8.6 for the proof. Since T t=1 t -1 3 ≤ 1 + T 1 1 t 1 3 dt ≤ 1 + 3 2 t 2 3 T 1 ≤ 1 + 3 2 T 2 3 , we obtain that E T t=1 ∥x t -x * ∥ 2 = O(T ). Thus the time-averaged convergence of the game is guaranteed.

5. SUFFICIENT CONDITIONS FOR CONVERGENCE OF THE BEST RESPONSE ALGORITHM

In the previous section, we showed that gradient descent algorithms guarantee convergence of mstrongly monotone games to a Nash equilibrium provided that m > 0. In this section, we provide sufficient conditions for Nash equilibrium convergence for the best response algorithm. The best response is a common strategy in the game theory literature, especially that on fully competitive games, that produces the most favorable outcome given the other agents plays. In continuous games (1), the best response is defined as: x i,t+1 = argmin xi∈Xi C i (x i , x -i,t ), i.e., each agent takes the action that provides the best response to the other agents' actions from the previous episode. The convergence analysis of the best response algorithm is presented below. Theorem 4. Suppose that the game (1) is m-strongly monotone with parameter m > L √ N -1. Then the best response algorithm (10) satisfies that ∥x T -x * ∥ 2 ≤ ρ T ∥x 0 -x * ∥ 2 , ( ) where ρ := L 2 (N -1) m 2 . The proof is provided in Appendix 8.7. From the inequality (11), it is easy to obtain that T t=1 ∥x t -x * ∥ 2 ≤ T t=1 ρ t ∥x 0 -x * ∥ 2 ≤ ∥x0-x * ∥ 2 1-ρ = O(1) . Theorem 4 shows that the best response algorithm converges to the Nash equilibrium at a linear rate. Indeed, it is a no-regret learning algorithm for each agent as well, as shown in Theorem 5 in Appendix 8.8. To the best of our knowledge, this is the first effort to provide a sufficient condition (m > L √ N -1) under which the best-response algorithm achieves Nash equilibrium convergence in convex games. Moreover, we experimentally show that when m > L √ N -1 does not hold, the best-response algorithm may lead to cycles, which further supports our theoretical results.

6. NUMERICAL EXPERIMENTS

In this section, we illustrate and compare the proposed algorithms on the Cournot game problems. Additional experiments on retailer pricing games and Kelly auction games can be found in Appendix 8.9 and 8.10, respectively. We show that, by defining sufficient conditions for Nash equilibrium convergence that depend only the properties of the game, i.e., the parameter m, and not the learning algorithm used to solve it, our analysis allows to identify classes of games for which no-regret learning guarantees convergence to a Nash equilibrium without analyzing specific algorithms or identify specific no-regret learning algorithms with no guaranteed convergence to a Nash equilibrium. We first consider a Cournot game with two agents whose goal is to minimize their local loss by appropriately setting the production quantity x i , i = 1, 2. The loss function of each agent is given by C i (x) = x i ( aixi 2 +b i x -i -e i )+1 , where a i > 0 , b i , e i are constant parameters, and x -i denotes the production quantity of the opponent of agent i. It is easy to show that ∇ i C i (x) = a i x i + b i x -i -e i . Recalling that L is the Lipschitz constant of the function ∇ i C i (x) with respect to x -i , we have Rosen (1965) , the strong monotonicity parameter m in this simple example coincides with the smallest eigenvalue of the matrix G(x)+G ′ (x) L = max{b 1 , b 2 }. Define g(x) = (∇ 1 C 1 (x), ∇ 2 C 2 (x)) and let G(x) denote the Jacobian of g(x), i.e., G(x) = [a 1 , b 1 ; b 2 , a 2 ]. According to

2

. In what follows, we verify and compare the effectiveness of the first-order gradient descent algorithm (FO), the zeroth-order gradient descent algorithm (ZO), and the best response algorithm (BR), analyzed in Sections 4 and 5. To apply the first-order algorithm, we assume that the gradient is subject to a noise sampled from the normal distribution. For all algorithms, the feasible set is defined as We validate our methods for two different selections of parameters. First, we select the parameters (a 1 , a 2 ) = (2, 1), (b 1 , b 2 ) = (0.5, 0.5), (e 1 , e 2 ) = (1.8, 1.9). Using these parameters, we get that m = 0.79, L = 0.5, and, therefore, m > L √ N -1, so that the sufficient conditions for all three algorithms are satisfied. The results are shown in Figure 1 , where the solid lines and shades are averages and ± standard deviations over 60 runs, respectively. Specifically, in Figure 1 (a), the two gradient-descent algorithms both converge to the Nash equilibrium, and the first-order algorithm outperforms the zeroth-order algorithm in terms of convergence speed and variance. Figure 1 (b) illustrates the action updates of the best response algorithm. Since the best response algorithm performs the action update in a way that each agent selects the optimal actions against other agents' actions, this algorithm converges very fast to the Nash equilibrium point (0.4857, 1.657). X i = {x i |x i ∈ [0, 3]}. Next we select the parameters (a 1 , a 2 ) = (2, 1), (b 1 , b 2 ) = (-1.5, 1.5), (e 1 , e 2 ) = (1.8, 1.9). In this case, we have that m = 1, L = 1.5, and, therefore, m < L √ N -1. As a result, the sufficient condition for the best response algorithm is not satisfied. The simulation results in this case are presented in Figure 2 . When m < L √ N -1, the first-order and zeroth-order gradient descent methods still converge to the Nash equilibrium, as shown in Figure 2 (a). Figure 2 (b) shows that the agents' actions of the best response algorithm oscillate in a regular pattern after some episodes and fail to converge to the Nash equilibrium. Therefore, the best response algorithm does not guarantee convergence to the Nash equilibrium when the sufficient condition is not satisfied. We also consider a Cournot game with 5 agents, i = 1, 2, 3, 4, 5. The loss function of each agent is .8, 1.9, 1.5, 1.6, 1.8] . In this case m = 1.2844, L = 0.6, and therefore m > L √ N -1 so that the sufficient conditions are satisfied for all three algorithms. The Nash equilibrium is x * = [0.672, 0.597, 0.512, 0.631, 0.538]. Figure 3 shows that all three algorithms converge to this Nash equilibrium.  C i (x) = x i ( aixi 2 + b i j̸ =i x j -e i ) + 1, where a = [2, 2, 1.5, 1.8, 2], b = [0.2, 0.3, 0.3, 0.2, 0.3], e = [1

7. CONCLUSION

In this work, we study the connection between no-regret learning and time-average Nash equilibrium convergence for the class of strongly monotone games. Specifically, we provided a sufficient condition on the class of strongly monotone games for which any no-regret learning algorithm leads to Nash equilibrium convergence. Moreover, we showed that the class of these games can be expanded when additional information about a specific no-regret algorithm is considered, including the first-order and zeroth-order gradient descent algorithms and the best response algorithm. We numerically validated our theoretical results for different games that belong to different classes, including Cournot games, retailer pricing games, and Kelly auction games. Compared to existing literature that analyzes the regret and Nash equilibrium convergence for specific algorithms and for assumptions that depend on the specific choice of algorithms and games, here we proposed a different approach that focuses on understanding the fundamental relationship between no-regret learning and Nash equilibrium convergence, regardless of the specific learning algorithm and based only on the game type.

8. APPENDIX

8.1 AUXILIARY LEMMAS Lemma 5. (Bravo et al. (2018) ) Let a n , n = 1, 2, . . ., be a non-negative sequence such that a n+1 ≤ a n (1 - A n ) + E n 1+q , ( ) where q > 0, and A, E > 0. Then, assuming A > q, we have that a n ≤ E (A -q)n q + O 1 n q . (13)

8.2. PROOF OF THEOREM 1:

We first show a useful lemma that lays the foundation for the subsequent analysis. Lemma 6. If the game with the cost C i (x i , x -i ) is strongly monotone with parameter m, then the function C i (z, x -i ) is strongly convex in z. Proof. From the strong monotonicity condition (2), it follows that ⟨∇ i C i (z, x -i ) -∇ i C i (z ′ , x -i ), z -z ′ ⟩ ≥ m ∥z -z ′ ∥ 2 . (14) Define h i (z) := C i (z, x -i ). Then, inequality (14) becomes ⟨∇h i (z) -∇h i (z ′ ), z -z ′ ⟩ ≥ m ∥z -z ′ ∥ 2 . The result follows from the fact that a differentiable function f is strongly convex if and only if its domain is convex and ⟨∇f (x) -∇f (x ′ ), x -x ′ ⟩ ≥ m ∥x -x ′ ∥ 2 . Using Lemma 6, we have that the function C i (z, x -i ) is strongly convex in z, which means that C i (y ′ i , x -i ) ≥ C i (y i , x -i )+⟨∇ i C i (y i , x -i ), y ′ i -y i ⟩+ m 2 ∥y ′ i -y i ∥ 2 for any fixed x -i ∈ X -i . Recall that y * i = argmin yi T t=1 C i (y i , x -i,t ) and x * denotes the Nash equilibrium of the game (1). Since the sum operator preserves convexity, we have that t C i (y i , x -i,t ) is also convex in y i . Moreover, from the necessary condition of optimality, we have that t ⟨∇ i C i (y * i , x -i,t ), y i -y * i ⟩ ≥ 0 for ∀y i ∈ X i . Since x * i ∈ X i , replacing y i with x * i gives that t ⟨∇ i C i (y * i , x -i,t ), x * i -y * i ⟩ ≥ 0. ( ) Next, we analyze the distance between x * and y * , where y * = (y * 1 , . . . , y * 1 ). By strong monotonicity of C i (x i , x -i ), we have that ⟨∇ i C i (y * i , x -i,t ) -∇ i C i (x * i , x -i,t ), y * i -x * i ⟩ ≥ m ∥y * i -x * i ∥ 2 . Summing this inequality over t = 1 . . . , T , and combining the inequality (15), we have that mT ∥x * i -y * i ∥ 2 ≤ t ⟨∇ i C i (y * i , x -i,t ) -∇ i C i (x * i , x -i,t ), y * i -x * i ⟩ ≤ t ⟨-∇ i C i (x * i , x -i,t ), y * i -x * i ⟩. (16) Summing the inequality ( 16) over i = 1, . . . , N , we further obtain mT ∥x * -y * ∥ 2 ≤ i t ⟨-∇ i C i (x * i , x -i,t ), y * i -x * i ⟩ = i t ⟨∇ i C i (x * ) -∇ i C i (x * i , x -i,t ), y * i -x * i ⟩ - i t ⟨∇ i C i (x * ), y * i -x * i ⟩ ≤ i t ⟨∇ i C i (x * ) -∇ i C i (x * i , x -i,t ), y * i -x * i ⟩ ≤ i t L x -i,t -x * -i ∥x * i -y * i ∥ ≤ i t L ∥x t -x * ∥ ∥x * -y * ∥ ≤N L ∥x * -y * ∥ t ∥x t -x * ∥ , where the second inequality is due to the fact that x * is a Nash equilibrium and the third inequality follows from the Lipschitz continuous property of the function ∇ i C i with respect to x -i . Rearranging the terms in the above inequality, we obtain the first of the two theorem statements. Next, we analyze the regret of the action sequence {x i,t } for i = 1, ..., N . By the definition of regret in (3), we have that Regret i = t C i (x t ) - t C i (y * i , x -i,t ) = t C i (x t ) -C i (x * i , x -i,t ) + t C i (x * i , x -i,t ) -C i (y * i , x -i,t ) ≥ t ⟨∇ i C i (x * i , x -i,t ), x i,t -x * i ⟩ + m 2 ∥x * i -x i,t ∥ 2 + t ⟨∇ i C i (y * i , x -i,t ), x * i -y * i ⟩ + m 2 ∥y * i -x * i ∥ 2 ≥ t ⟨∇ i C i (x * i , x -i,t ), x i,t -x * i ⟩ + m 2 ∥x * i -x i,t ∥ 2 + mT 2 ∥y * i -x * i ∥ 2 , ( ) where the first inequality follows from the strong convexity of the function C i (z, x -i ) with respect to z for any x -i ∈ X -i and the second inequality follows from the necessary condition of optimality. Summing the regret in ( 17) over i = 1, . . . , N , we have that i Regret i = i t C i (x t ) -C i (y * i , x -i,t ) ≥ i t ⟨∇ i C i (x * i , x -i,t ), x i,t -x * i ⟩ + m 2 t ∥x * -x t ∥ 2 + mT 2 ∥y * -x * ∥ 2 = i t ⟨∇ i C i (x * i , x -i,t ) -∇ i C i (x * ), x i,t -x * i ⟩ + t i ⟨∇ i C i (x * ), x i,t -x * i ⟩ + mT 2 ∥y * -x * ∥ 2 + m 2 t ∥x * -x t ∥ 2 ≥ i t ⟨∇ i C i (x * i , x -i,t ) -∇ i C i (x * ), x i,t -x * i ⟩ + m 2 t ∥x * -x t ∥ 2 + mT 2 ∥y * -x * ∥ 2 ≥ i t (-L) x -i,t -x * -i ∥x i,t -x * i ∥ + m 2 t ∥x * -x t ∥ 2 + mT 2 ∥y * -x * ∥ 2 , ( ) where the second to the last inequality follows from the fact that x * is a Nash equilibrium and, therefore, satisfies i ⟨∇ i C i (x * ), x i -x * i ⟩ ≥ 0 for ∀x ∈ X , and the last inequality is due to the fact that ∇ i C i (x) is Lipschitz continuous in x -i . Applying the inequality ab ≤ 1 2λ a 2 + λ 2 b 2 , ∀λ > 0 to the term x -i,t -x * -i ∥x i,t -x * i ∥ in (18), we have that i Regret i ≥ -L i t 1 2λ ∥x * i -x i,t ∥ 2 + λ 2 x * -i -x -i,t 2 + m 2 t ∥x * -x t ∥ 2 + mT 2 ∥y * -x * ∥ 2 ≥ -L t 1 2λ ∥x * -x t ∥ 2 + λ 2 (N -1) ∥x * -x t ∥ 2 + m 2 t ∥x * -x t ∥ 2 + mT 2 ∥y * -x * ∥ 2 , where the last inequality follows from the fact that i x * -i -x -i,t 2 = (N -1) ∥x * -x t ∥ 2 . Inequality (19) holds for ∀λ > 0. Substituting λ = (N -1) -1 2 in (19), we get that i Regret i ≥ -L √ N -1 t ∥x * -x t ∥ 2 + m 2 t ∥x * -x t ∥ 2 + mT 2 ∥y * -x * ∥ 2 ≥ m -2L √ N -1 2 t ∥x t -x * ∥ 2 + mT 2 ∥y * -x * ∥ 2 . ( ) Since we assume that the learning algorithm is no-regret, we have that m-2L √ N -1 2 t ∥x t -x * ∥ 2 + m 2 t ∥x * -x t ∥ 2 ≤ i Regret i = O(T a ). Given that the two terms in the left hand of this inequality are always positive, we conclude that t ∥x * -x t ∥ 2 = O(T a ) and mT 2 ∥y * -x * ∥ 2 = O(T a ). The proof is complete.

8.3. PROOF OF THEOREM 2:

According to Theorem 2 in Rosen (1965) , strongly monotone games have a unique Nash equilibrium. Let x * = (x * i , x * -i ) denote the Nash equilibrium of the game (1). Using the update equation ( 5), we have that ∥x i,t+1 -x * i ∥ 2 = ∥P Xi (x i,t -η t G i,t ) -x * i ∥ 2 ≤ ∥x i,t -x * i -η t G i,t ∥ 2 ≤ ∥x i,t -x * i ∥ 2 + η 2 t ∥G i,t ∥ 2 -2η t ⟨G i,t , x i,t -x * i ⟩, where the first inequality follows from the facts that P Xi (x * i ) = x * i and ∥P Xi (z)∥ ≤ ∥z∥ for any vector z. Taking the expectation of both sides of the above inequality, we get E ∥x i,t+1 -x * i ∥ 2 ≤E ∥x i,t -x * i ∥ 2 + η 2 t ∥G i,t ∥ 2 -2η t ⟨G i,t , x i,t -x * i ⟩ ≤E ∥x i,t -x * i ∥ 2 + η 2 t ∥∇ i C i (x t )∥ 2 + η 2 t σ 2 -2η t ⟨∇ i C i (x t ), x i,t -x * i ⟩. (21) Since x * is a Nash equilibrium of the convex game, we have that ⟨∇ i C i (x * ), x i,t -x * i ⟩ ≥ 0, i = 1, . . . , N . Summing the inequality (21) over i = 1, . . . , N , we get that E ∥x t+1 -x * ∥ 2 ≤E ∥x t -x * ∥ 2 + η 2 t i ∥∇ i C i (x t )∥ 2 + η 2 t N σ 2 -2η t i ⟨∇ i C i (x t ), x i,t -x * i ⟩ ≤E ∥x t -x * ∥ 2 + η 2 t N B 2 + η 2 t N σ 2 -2η t i ⟨∇ i C i (x t ) -∇ i C i (x * ), x i,t -x * i ⟩ ≤(1 -2η t m)E ∥x t -x * ∥ 2 + η 2 t N B 2 + η 2 t N σ 2 ≤(1 - 2 t ) ∥x t -x * ∥ 2 + N (B 2 + σ 2 ) m 2 t 2 , where the second inequality follows from Assumption 2 and the fact that x * is a Nash equilibrium, the third inequality follows from the strong monotonicity of the game (1), and the last inequality is obtained by substituting η t = 1 mt . Then, using Lemma 5, we obtain E ∥x T -x * ∥ 2 ≤ N (B 2 + σ 2 ) m 2 T + O 1 T , which completes the proof. 8.4 PROOF OF LEMMA 3: In the analysis that follows, the expectations are taken w.r.t w i ∼ B i and u -i ∼ S -i . Since C i is bounded and with finite support, by Lebesgue's dominated convergence theorem (Chapter 4 in Royden & Fitzpatrick (1988)), we can interchange the order of integration and differentiation. From the definition of C δ i (x), it follows that ∇ i C i (x) -∇ i C δ i (x) = ∇ i C i (x) -∇ i E[C i (x i + δw i , x -i + δu -i )] = E ∇ i C i (x) -∇ i C i (x i + δw i , x -i + δu -i ) ≤ E L 1 δ ∥(w i , u -i )∥ ≤ L 1 δ √ N , where the first inequality follows from the Lipschitz continuous property of the function ∇ i C i (x) with respect to x. Recall that A δ denotes the set of Nash Equilibria of the smoothed game with losses C δ i (x). Take two arbitrary points y * δ , z * δ ∈ A δ . Then, we have that i ⟨∇ i C δ i (y * δ ) -∇ i C δ i (z * δ ), y * δi -z * δi ⟩ = i ⟨∇ i C i (y * δ ) -∇ i C i (z * δ ), y * δi -z * δi ⟩ + i ⟨∇ i C δ i (y * δ ) -∇ i C i (y * δ ) -∇ i C δ i (z * δ ) + ∇ i C i (z * δ ), y * δi -z * δi ⟩ ≥ i ⟨∇ i C i (y * δ ) -∇ i C i (z * δ ), y * δi -z * δi ⟩ - i 2L 1 δ √ N y * δi -z * δi ≥m ∥y * δ -z * δ ∥ 2 -2L 1 δN ∥y * δ -z * δ ∥ , where the first inequality follows from the inequality ( 22) and the second inequality is derived using the strong monotonicity condition (2) and the fact that ( i y * δi -z * δi ) 2 ≤ N i y * δi -z * δi 2 = N ∥y * δ -z * δ ∥ 2 . Since y * δ , z * δ are Nash Equilibria of the smoothed game, for ∀x i ∈ (1 -δ)X i , we have that ⟨∇ i C δ i (y * δ ), x i -y * δi ⟩ ≥ 0 and ⟨∇ i C δ i (z * δ ), x i -z * δi ⟩ ≥ 0. Replacing x i in the above two inequalities with z * δi and y * δi , respectively, we obtain that ⟨∇ i C δ i (y * δ ), z * δi -y * δi ⟩ ≥ 0, ⟨∇ i C δ i (z * δ ), y * δi -z * δi ⟩ ≥ 0. Adding these two inequalities together gives that ⟨∇ i C δ i (y * δ ) -∇ i C δ i (z * δ ), y * δi -z * δi ⟩ ≤ 0. ( ) Combining inequalities ( 23) and ( 24), we get that m ∥y * δ -z * δ ∥ 2 -2L 1 δN ∥y * δ -z * δ ∥ ≤ 0. Rearranging the terms in this inequality completes the proof. 8.5 PROOF OF LEMMA 4: Let x * be the Nash equilibrium of the game defined by the losses C i over the set (1 -δ)X , and recall that x * is the Nash equilibrium of the original game with losses C i over the set X and x δ,j is one Nash equilibrium of the smoothed game with losses C δ i over the set (1 -δ)X . Since the losses C i are strongly monotone, the Nash equilibrium x * is unique and well-defined. By the triangle inequality, we have that x * -x δ,j ≤ ∥x * -x * ∥ + x δ,j -x * . We first bound ∥x * -x * ∥. This bound has already been derived by Drusvyatskiy & Ratliff (2021) , and takes the form ∥x * -x * ∥ ≤ δ 1 + L 1 √ N m ∥x * ∥ . Next we focus on the term x δ,j -x * . The convexity of the game guarantees that the Nash Equilibria satisfy the following property: ⟨∇ i C i (x * ), x i -x * i ⟩ ≥ 0, ∀x i ∈ (1 -δ)X i , ⟨∇ i C δ i (x δ,j ), x i -x δ,j i ⟩ ≥ 0, ∀x i ∈ (1 -δ)X i . Replacing x i in the above two inequalities in (26) with x δ,j i and x * i , respectively, and summing the two inequalities, we have that ⟨∇ i C i (x * ) -∇ i C δ i (x δ,j ), x δ,j i -x * i ⟩ ≥ 0. Combining ( 27) with the strong monotonicity condition (2), we have that m x δ,j -x * 2 ≤ i ⟨∇ i C i (x δ,j ) -∇ i C i (x * ), x δ,j i -x * i ⟩ ≤ i ⟨∇ i C i (x δ,j ) -∇ i C δ i (x δ,j ), x δ,j i -x * i ⟩ ≤ i L 1 δ √ N x δ,j i -x * i ≤L 1 δN x δ,j -x * , where the first inequality is due to the strong monotonicity condition (2), the second inequality follows from ( 27) and the third inequality follows from ( 22). Therefore, for any Nash equilibrium x δ,j ∈ A δ , we have that x δ,j -x * ≤ L 1 δN m . Combining ( 28) with ( 25), we have that x * -x δ,j ≤ ∥x * -x * ∥ + x δ,j -x * ≤ δ 1 + L1 √ N m ∥x * ∥ + L1δN m , which completes the proof. 8.6 PROOF OF THEOREM 3: Recall that the Nash Equilibria of the smoothed game with losses C δ i over the set (1 -δ)X are x δ,j . From the update equation ( 7), for any x δ,j ∈ A δ , we have that x i,t+1 -x δ,j i 2 = P (1-δ)Xi (x i,t -η t g i,t ) -x δ,j i 2 ≤ x i,t -x δ,j i -η t g i,t 2 ≤ x i,t -x δ,j i 2 + η 2 t ∥g i,t ∥ 2 -2η t ⟨g i,t , x i,t -x δ,j i ⟩, where the first inequality holds since x δ,j i ∈ (1 -δ)X i . Taking expectations with respect to u i,t of both sides of the above inequality, we have that E x i,t+1 -x δ,j i 2 ≤E x i,t -x δ,j i 2 + η 2 t d 2 i U 2 δ 2 -2η t ⟨∇ i C δ i (x t ), x i,t -x δ,j i ⟩. Since x δ,j is a Nash equilibrium of the smoothed game, it satisfies ⟨∇ i C δ i (x δ,j ), x i -x δ,j i ⟩ ≥ 0, ∀x i ∈ (1 -δ)X i . Summing the inequality (29) over i = 1, . . . , N , we get E x t+1 -x δ,j 2 ≤E x t -x δ,j 2 + η 2 t d 2 i U 2 N δ 2 -2η t i ⟨∇ i C δ i (x t ) -∇ i C δ i (x δ,j ), x i,t -x δ,j i ⟩ ≤E x t -x δ,j 2 + η 2 t d 2 i U 2 N δ 2 -2η t m x t -x δ,j 2 -2L 1 δN x t -x δ,j ≤(1 -2η t m)E x t -x δ,j 2 + η 2 t d 2 i U 2 N δ 2 + 4η t δL 1 N D, where the second inequality can be obtained using similar techniques as in (23). Substituting η t = 1 mt into (30), we get that E x t+1 -x δ,j 2 ≤(1 - 2 t )E x t -x δ,j 2 + d 2 i U 2 N 4m 2 t 2 δ 2 + 2L 1 N Dδ mt ≤(1 - 2 t )E x t -x δ,j 2 + E 1 t 2 δ 2 + E 2 δ t , where E 1 = d 2 i U 2 N 4m 2 + 2L1N D m , E 2 = 2L1N D m . Using induction, it can be verified that for ∀t ≥ 1, there exists a constant A 0 > 0 such that E x t -x δ,j 2 ≤ max E 1 tδ 2 + E 2 δ, A 0 1 tδ 2 + δ . ( ) Replacing t with T and setting δ = T -1 3 in (32), we get E x T -x δ,j 2 ≤max E 1 T δ 2 + E 2 δ, A 0 1 T δ 2 + δ ≤ 2A 1 T -1 3 , where A 1 := max{E 1 , E 2 , A 0 }. Combining (33) with Lemma 4, we have E ∥x T -x * ∥ 2 ≤2E x T -x δ,j 2 + 2 x δ,j -x * 2 ≤4A 1 T -1 3 + 2 1 + L1 √ N m ∥x * ∥ + L1N m 2 T 2 3 =O T -1 3 , which completes the proof.

8.7. PROOF OF THEOREM 4:

From the convexity of the loss function C i and the update rule (10), we have that ⟨∇ i C i (x i,t+1 , x -i,t ), x i -x i,t+1 ⟩ ≥ 0, ∀x i ∈ X i . (35) Since the game (1) is strongly monotone, we also have that for all x i ∈ X i ,  ⟨∇ i C i (x i , x -i,t ) -∇ i C i (x i,t+1 , x -i,t ), x i -x i,t+1 ⟩ ≥ m ∥x i -x i,t+1 ∥ 2 . ( * i -x i,t+1 ∥ 2 ≤⟨∇ i C i (x * i , x -i,t ) -∇ i C i (x i,t+1 , x -i,t ), x * i -x i,t+1 ⟩ ≤ ⟨∇ i C i (x * i , x -i,t ), x * i -x i,t+1 ⟩. (37) Summing the inequality (37) over i = 1, . . . , N , it follows that ∥x t+1 -x * ∥ 2 ≤ 1 m i ⟨∇ i C i (x * i , x -i,t ), x * i -x i,t+1 ⟩ ≤ 1 m i ⟨∇ i C i (x * i , x -i,t ) -∇ i C i (x * ), x * i -x i,t+1 ⟩ ≤ 1 m i L x -i,t -x * -i ∥x * i -x i,t+1 ∥ ≤ L m i 1 2λ x -i,t -x * -i 2 + λ 2 ∥x i,t+1 -x * i ∥ 2 ≤ L m N -1 2λ ∥x t -x * ∥ 2 + λ 2 ∥x t+1 -x * ∥ 2 , where the second inequality follows from the Nash equilibrium condition ⟨∇ i C i (x * ), x i -x * i ⟩ ≥ 0, for ∀x ∈ X , the third inequality follows from the Lipschitz continuous property of the function C i in x -i , and the second to last inequality is due to the fact that ab ≤ 1 2λ a 2 + λ 2 b 2 for any λ > 0. Setting λ = m L and rearranging the terms in (38), we obtain that ∥x t+1 -x * ∥ 2 ≤ L 2 (N -1) m 2 ∥x t -x * ∥ 2 . ( ) Applying inequality (39) iteratively for t = 1, . . . , T , we have that ∥x T -x * ∥ 2 ≤ L 2 (N -1) m 2 T ∥x 0 -x * ∥ 2 , which completes the proof.

8.8. BR ALGORITHM IS NO-REGRET LEARNING

Theorem 5. Suppose that the game (1) is m-strongly monotone with parameter m > L √ N -1. Then the BR Algorithm achieves no-regret learning, specifically Reg i = T t=1 C i (x t ) -min xi T t=1 C i (x i , x -i,t ) = O( √ T ). ( ) Proof. Let y * i := argmin xi T t=1 C i (x i , x -i,t ). Recalling that x i,t+1 = argmin xi C i (x i , x -i,t ), we have that ⟨∇ i C i (x i,t+1 , x -i,t ), y * i -x i,t+1 ⟩ ≥ 0. From the definition of regret in (3), we have that Reg i = T t=1 (C i (x t ) -C i (y * i , x -i,t )) = T t=1 C i (x t ) -C i (x i,t+1 , x -i,t ) + C i (x i,t+1 , x -i,t ) -C i (y * i , x -i,t ) ≤ T t=1 C i (x t ) -C i (x i,t+1 , x -i,t ) + T t=1 ⟨∇ i C i (x i,t+1 , x -i,t ), x i,t+1 -y * i ⟩ ≤ T t=1 C i (x t ) -C i (x i,t+1 , x -i,t ) , where the first inequality is due to the convexity of the loss function C i (x) with respect to x i and the second inequality follows from the necessary condition of optimality. Then, it follows that Reg i ≤ T t=1 C i (x t ) -C i (x t+1 ) + C i (x t+1 ) -C i (x i,t+1 , x -i,t ) ≤C i (x 1 ) + T t=1 C i (x t+1 ) -C i (x i,t+1 , x -i,t ) ≤U + L 0 T t=1 ∥x -i,t+1 -x -i,t ∥ , where the last inequality follows from the Lipschitz continuous property of the function C i in x. Summing the inequality (41) over i = 1, . . . , N , we have that N i=1 Reg i ≤N U + L 0 N i=1 T t=1 ∥x -i,t+1 -x -i,t ∥ ≤ N U + L 0 N (N -1) T t=1 ∥x t+1 -x t ∥ , where the last inequality follows from the fact that ( i ∥x -i,t+1 -x -i,t ∥) 2 ≤ N i ∥x -i,t+1 -x -i,t ∥ 2 ≤ N (N -1) ∥x t+1 -x t ∥ 2 . From the inequality (39) in the proof of Theorem 4, when m > L √ N -1, we have that ∥x t+1 -x * ∥ 2 ≤ ρ ∥x t -x * ∥ 2 . Using this result, we can bound the term ∥x t+1 -x t ∥ 2 as follows: ∥x t+1 -x t ∥ 2 = ∥x t+1 -x * + x * -x t ∥ 2 = ∥x t+1 -x * ∥ 2 + ∥x t -x * ∥ 2 -2 ∥x t+1 -x * ∥ ∥x t -x * ∥ ≤ ∥x t+1 -x * ∥ 2 + ∥x t -x * ∥ 2 - 2 √ ρ ∥x t+1 -x * ∥ 2 ≤ ∥x t -x * ∥ 2 -∥x t+1 -x * ∥ 2 , ( ) where the last inequality holds since ρ < 1. Then, substituting the inequality (43) into the inequality (42), we get that 8.9 ADDITIONAL EXPERIMENTS: RETAILER PRICING GAME Consider a market with two retailers and four products. Each retailer is responsible for selling different products and making pricing decisions for their own products at each episode t. Suppose that Retailer 1 sells the j-th product for j = 1, 2, while Retailer 2 sells the j-th product for j = 3, 4. Retailers 1 and 2 make pricing decisions x t 1 = (p t 1 , p t 2 ) and x t 2 = (p t 3 , p t 4 ), respectively, where x t i denotes the decision of Retailer i, and p t j denotes the price of the j-th product at episode t. All products are substitutes or complements of each other and their pricing decisions influence the demand of other products. The demand of the j-th product is modeled as D j (x) = 4 k=1 A jk p t k + b j , where A jk , b j > 0 are constants. Note that A jk > 0 if the k-th product is a substitute for the j-th product, A jk < 0 if the k-th product is a complement of the j-th product, and A jj < 0 for j = 1, 2, 3, 4. The goal of each retailer is to minimize their own loss, where Using these parameters, we get that m = 2.82, L = 0.3, and, therefore, m > L √ N -1. As a result, the sufficient conditions are satisfied for all three algorithms. The results are shown in Figure 4 , where the solid lines and shaded areas denote the average and standard deviation over 60 runs. Figure 4 shows that all three algorithms converge to the Nash equilibrium. 8.10 ADDITIONAL EXPERIMENTS: KELLY AUCTION Consider a service provider (providing, e.g., bandwidth, space on a website, etc.) with two bidders N = 2 that place monetary bids x i ∈ [0, D] for the utilization of the resource. Each bidder receives some unit of the resource which is proportional to their own bid, i.e., the i-th bidder gets ρ i = xi d+ N i=1 xi units of resource, where d ≥ 0 is the entry barrier for bidding on it. We model the loss of each bidder as u i (x) = -(g i ρ i -x i ), where g i is the bidder's marginal gain from obtaining a unit of resources. We set g i = 0.8, ∀i = 1, 2 and d = 0.2. Figure 5 presents simulation results for this example. We see that the first-order and zeroth-order methods converge to the Nash equilibrium. Note also that, in this example, m < L √ N -1 always holds regardless of the choice of parameters. Therefore, the sufficient condition for the best response algorithm is not satisfied. Nevertheless, as seen in Figure 5 (b), the best response algorithm still converges to the Nash equilibrium, which validates our theoretical analysis that this condition is only sufficient and not necessary. 



Error to Nash equilibrium of the first-order method (FO) and the zeroth-order method (ZO).

Action values of the best response algorithm (BR).

Figure 1: Cournot game when m > L √ N -1.

Error to Nash equilibrium of the first-order method (FO) and the zeroth-order method (ZO).

Figure 2: Cournot game when 0 < m < L √ N -1.

Error to Nash equilibrium of the first-order method (FO) and the zeroth-order method (ZO).

Error to Nash equilibrium of the best response algorithm (BR).

Figure 3: Cournot game with multiple agents N = 5 when m > L √ N -1.

U + L 0 N (N -1) T t=1 ∥x t -x * ∥ 2 -∥x t+1 -x * ∥ 2 ≤N U + L 0 N (N -1)T ∥x 1 -x * ∥ ≤N U + L 0 D N (N -1)T , the fact that Reg i ≤ N i=1 Reg i completes the proof.From equation (41) in the convergence proof of the best response algorithm, we have thatReg i ≤ T t=1 C i (x t ) -C i (x i,t+1 , x -i,t ) = T t=1 C i (x t ) -T t=1 min xi∈Xi C i (x i , x -i,t ) = O( √ T ),which indicates that the dynamic regret of agent i, i.e.,T t=1 C i (x t )-T t=1 min xi∈Xi C i (x i , x -i,t ), is of order O( √ T ).Note that this term captures the dynamic regret induced by the variation of other agents' actions.

Error to Nash equilibrium of the first-order method (FO) and the zeroth-order method (ZO).

Error to Nash equilibrium of the best response algorithm (BR).

Figure 4: Retailer pricing game when m > L √ N -1.

C 1 (x) = -(p 1 D 1 (x) + p 2 D 2 (x)) and C 2 (x) = -(p 3 D 3 (x) + p 4 D 4 (x)). The parameters A jk are given as A 1≤j,k≤4 =

Error to Nash equilibrium of the first-order method (FO) and the zeroth-order method (ZO).

Error to Nash equilibrium of the best response algorithm (BR).

Figure 5: Kelly auction game.

