THE POWER OF REGULARIZATION IN SOLVING EXTENSIVE-FORM GAMES

Abstract

In this paper, we investigate the power of regularization, a common technique in reinforcement learning and optimization, in solving extensive-form games (EFGs). We propose a series of new algorithms based on regularizing the payoff functions of the game, and establish a set of convergence results that strictly improve over the existing ones, with either weaker assumptions or stronger convergence guarantees. In particular, we first show that dilated optimistic mirror descent (DOMD), an efficient variant of OMD for solving EFGs, with adaptive regularization can achieve a fast O(1/T ) last-iterate convergence in terms of duality gap and distance to the set of Nash equilibrium (NE) without uniqueness assumption of the NE. Second, we show that regularized counterfactual regret minimization (Reg-CFR), with a variant of optimistic mirror descent algorithm as regret-minimizer, can achieve O(1/T 1/4 ) best-iterate, and O(1/T 3/4 ) averageiterate convergence rate for finding NE in EFGs. Finally, we show that Reg-CFR can achieve asymptotic last-iterate convergence, and optimal O(1/T ) averageiterate convergence rate, for finding the NE of perturbed EFGs, which is useful for finding approximate extensive-form perfect equilibria (EFPE). To the best of our knowledge, they constitute the first last-iterate convergence results for CFRtype algorithms, while matching the state-of-the-art average-iterate convergence rate in finding NE for non-perturbed EFGs. We also provide numerical results to corroborate the advantages of our algorithms.

1. INTRODUCTION

Extensive-form games (EFGs) are widely used in modeling sequential decision-making of multiple agents with imperfect information. Many popular real-world multi-agent learning problems can be modeled as EFGs, including Poker (Brown and Sandholm, 2018; 2019b) , Scotland Yard (Schmid et al., 2021) , Bridge (Tian et al., 2020 ), cloud computing (Kakkad et al., 2019) , and auctions (Shubik, 1971) , etc. Despite the recent success of many of these applications, efficiently solving large-scale EFGs is still challenging. Solving EFGs typically refers to as finding a Nash equilibrium (NE) of the game, especially in the two-player zero-sum setting. In the past decades, the most popular methods in solving EFGs are arguably regret-minimization based methods, such as counterfactual regret minimization (CFR) (Zinkevich et al., 2007) and its variants (Tammelin et al., 2015; Brown and Sandholm, 2019a) . By controlling the regret of each player, the average of strategies constitute an approximated NE in two-player zero-sum games, which is called average-iterate convergence (Zinkevich et al., 2007; Tammelin et al., 2015; Farina et al., 2019a) . However, averaging the strategies can be undesirable, which not only incurs more computation (Bowling et al., 2015) (additional memory and computation for the average strategy), but also intro-♮ Alphabetical Order duces additional representation and optimization errors when function approximation is used. For example, when using neural networks to parameterize the strategies, the averaged strategy may not be able to be represented properly and the optimization object can be highly non-convex. Therefore, it is imperative to understand if (approximated) NE can be efficiently solved without average, which motivates the study of last-iterate convergence. In fact, the popular CFR-type algorithms mentioned above only enjoy average-iterate convergence guarantees so far (Zinkevich et al., 2007; Tammelin et al., 2015; Farina et al., 2019a) , and it is unclear if such a last-iterate convergence is achievable for this type of algorithms. The recent advances of Optimistic Mirror Descent (Rakhlin and Sridharan, 2013; Mertikopoulos et al., 2019; Wei et al., 2021; Cai et al., 2022) shed lights on how to achieve last-iterate convergence for solving normal-form games (NFGs), a strict sub-class of EFGs. The last-iterate convergence in EFGs has not received attention until recently (Bowling et al., 2015; Farina et al., 2019c; Lee et al., 2021) . Specifically, Bowling et al. (2015) provided some empirical evidence of last-iterate convergence for CFR-type algorithms, while Farina et al. (2019c) empirically proved that OMD enjoyed last-iterate convergence in EFGs. Lee et al. (2021) proposed an OMD variant with the first last-iterate convergence guarantees in EFGs, but the solution itself might have room for improvement: To make the update computationally efficient, the mirror map needs to be generated through a dilated operation (see §2 for more details); and for this case, the analysis in Lee et al. (2021) requires the NE to be unique. In particular, an important and arguably most well-studied instance of OMD for no-regret learning over simplex, i.e., the optimistic multiplicative weights update (OMWU) (Daskalakis and Panageas, 2019; Wei et al., 2021) , cannot be shown to have explicit lastiterate convergence rate so far , without such a uniqueness condition, even for normal-form games. Anagnostides et al. (2022) can only guarantee an asymptotic last-iterate convergence rate without uniqueness assumptionfoot_0 . Indeed, it is left as an open question in (Wei et al., 2021) if the uniqueness condition is necessary for OMWU to converge with an explicit rate for this strict sub-class of EFGs, when constant stepsize is used. In this paper, we remove the uniqueness condition, while establishing the last-iterate convergence for Dilated Optimistic Mirror Descent (DOMD) type methods. The solution relies on exploiting the power of the regularization techniques in EFGs. Our last-iterate convergence guarantee is not only for the convergence of duality gap, a common metric used in the literature, but also for the actual iterate, i.e., the convergence of the distance to the set of NE. This matches the bona fide last-iterate convergence studied in the literature, e.g., Daskalakis and Panageas (2019) ; Wei et al. (2021) , and such a kind of last-iterate guarantee is unknown when the mirror map is either dilated or entropybased. More importantly, the techniques we develop can also be applied to CFR, resulting in the first last-iterate convergence guarantee for CFR-type algorithms. We detail our contributions as follows. Contributions. Our contributions are mainly four-fold: (i) We develop a new type of dilated OMD algorithms, an efficient variant of OMD that exploits the structure of EFGs, with adaptive regularization (Reg-DOMD), and prove an explicit convergence rate of the duality gap, without the uniqueness assumption of the NE. (ii) We further establish a last-iterate convergence rate for dilated optimistic multiplicative weights update to the NE of EFGs (beyond the duality gap as in Cen et al. (2021b) , for the NFG setting), when constant stepsize is used. This also moves one step further towards solving the open question for the NFG setting, about whether the uniqueness assumption can be removed to prove last-iterate convergence of the authentic OMWU algorithms with constant stepsizes (Daskalakis and Panageas, 2019; Wei et al., 2021) . (iii) For CFR-type algorithms, using the regularization technique, we establish the first best-iterate convergence rate of O(1/T 1/4 ) for finding the NE of non-perturbed EFGs, and last-iterate asymptotic convergence for finding the NE of perturbed EFGs in terms of duality gap, which is useful for finding approximate extensive-form perfect equilibrium (EFPE) (Selten, 1975) . (iv) As a by-product of our analysis, we also provide a faster and optimal rate of O(1/T ) average-iterate convergence guarantee in finding NE of perturbed EFGs (see formal definition in §4.1), while also matching the state-of-the-art guarantees for CFR-type algorithms in finding NE for the non-perturbed EFGs in terms of duality gap (Farina et al., 2019a) . Technical challenges. We emphasize the technical challenges we address as follows. First, by adding regularization to the original problem, Reg-DOMD will converge to the NE of the regularized Table 1 : Comparisons between our methods and previous last-iterate convergence methods. (D)OMWU refers to (Dilated) Optimistic Multiplicative Weights Update (Daskalakis and Panageas, 2019) and (D)OGDA refers to (Dilated) Optimistic Gradient Descent Ascent (Daskalakis et al., 2018; Liang and Stokes, 2019; Mokhtari et al., 2020) . And Reg-DOMWU (Reg-DOGDA) refers to DOMWU (DOGDA) with regularization. The fifth column Iterate refers to the Euclidean distance to NE. (G), (L) refer to global convergence rate and local convergence rate, respectively. it is clear from the context. For any convex and differentiable function ψ, its associated Bregman divergence is defined as D ψ (u, v) := ψ(u) -ψ(v) -⟨∇ψ(v), u -v⟩. Finally, we use C (u) to denote the projection of u to a convex set C with respect to Euclidean distance. Bilinear optimization problem. Strategies in two-player zero-sum extensive-form games with perfect recall can be interpreted in sequence-form (Von Stengel, 1996) . Thus, finding the Nash equilibrium reduces to solving a bilinear saddle-point problem, min x∈X max y∈Y x ⊤ Ay (2.1) where X ⊂ R M T , Y ⊂ R N T are the decision sets for min/max players called treeplexes (to be defined next). In sequence-form representation, x i denotes the probability of reaching node i in the treeplex when only counting the uncertainty incurred by the min-player, and y i can be interpreted similarly. The matrix A ∈ [-1, 1] M ×N , where A i,j denotes the payoff of the max-player when the min-player reaches i and max-player reaches j. Nash equilibria are just the solutions to Eq (2.1). We define Z * = X * × Y * to denote the set of NE, which is always convex for two player zero-sum game. For convenience, we use P := M + N to denote the dimension of problem (2.1), and concatenate the sequence form for both players by defining z := (x, y) ∈ Z := X × Y and the gradient of the bilinear form (2.1) by defining F (z) := (Ay, -A ⊤ x). By re-normalizing A, we can assume ∥F (z)∥ ∞ ≤ 1 without loss of generality. Treeplex and dilated regularizer. The structure of a sequence-form is enforced implicitly by the treeplexes, which we define formally here: Definition 2.1 ( Hoda et al. (2010) ). Treeplex is recursively defined as follows: 1. Each probability simplex is a treeplex. 2. The Cartesian product of multiple treeplexes is a treeplex. 3. The branching of two treeplexes is a treeplex, where for integers m, n > 0, the branching of two treeplexes Z 1 ⊂ R m T , Z 2 ⊂ R n T on index i ∈ {1, 2, . .., m} is defined as Z 1 i Z 2 = {(u, u i • v) : u ∈ Z 1 , v ∈ Z 2 }. (2.2) See an illustration of treeplex in Figure 1 of Appendix A. The simplexes in the treeplex specify the decision points for both players, which are also called information sets in the EFG literature (Zinkevich et al., 2007; Tammelin et al., 2015; Farina et al., 2019a) . The collection of information sets in treeplex Z is denoted as H Z . For any h ∈ H Z , we use Ω h to denote the indices in Z belonging to decision point h and h(i) to denote the information set that index i belongs to. That is, h(i) = h if and only if i ∈ Ω h . We use σ(h) to denote the index of the parent variable of h and H i = {h ∈ H Z : σ(h) = i}. For a simplex Z, the parent of the only information set h ∈ H Z does not exist and we use σ(h) = 0 to denote it. And when applying Cartesian product on multiple treeplexes, it will not change the parent of any information set. When we branch two treeplexes, that is Z 1 i Z 2 , then the parent of all information set h ∈ H Z2 with σ(h) = 0 will be updated to σ(h) = i. For convenience, we use z h to denote the slice of z with indices in Ω h . Let C Ω := max h∈H Z |Ω h | denote the maximum number of indices in each individual information set. For convenience, we define vector q ∈ R P with q i := z i /z σ(h(i)) for any i. In the EFG terminology, q h ∈ R |Ω h | , the slice of q in information set h, is the probability distribution of actions in information set h. The treeplex structure motivates a natural dilation operation to generate regularizers that leads to efficient computation in EFGs (Hoda et al., 2010) . For any strongly-convex base regularizer ψ ∆ defined on a simplex, the dilated regularizer is defined by ψ Z (z) := h∈H Z α h z σ(h) ψ ∆ z h z σ(h) , where z σ(h) is the probability of reaching the parent variable of information set h. And α h is the h th element of vector α ∈ R |H Z | + which is some hyper-parameter set according to ψ ∆ to guarantee that ψ Z is 1-strongly convex with respect to 2-norm (Hoda et al., 2010; Kroer et al., 2020) , i.e., D ψ Z (z 1 , z 2 ) ≥ 1 2 ∥z 1 -z 2 ∥ 2 . Two common base regularizers are the negative entropy ψ ∆ Entropy (p) = i p i log p i and the Euclidean norm ψ ∆ Euclidean (p) = i p 2 i , where p ∈ ∆ is a probability distribution. Finding NE and regret minimization. Given a strategy z in sequence form, there are two criteria to evaluate the performance: • the Euclidean distance to the set of NE ∥ Z * (z) -z∥, • the duality gap max z∈Z F (z) ⊤ (z -z). When one or both of the above quantities are close to zero, we find an approximate NE. A common approach to minimize duality gap is by regret minimization, where we define the (external) regret of the min-player as R X T := T t=1 l t (x t ) -min x∈X T t=1 l t ( x), (2.4) where l t is the loss function at iteration t and x t is the output of the regret minimizer at iteration t. Regret of the max-player can be defined similarly. When regret is growing sublinearly with respect to T , the average regret is converging to zero (hence the name no-regret). The following Nash folklore theorem implies that the average strategy will converge to NE. Lemma 2.2. For a bilinear zero-sum game where l X t (x t ) = -l Y t (y t ) = x ⊤ t Ay t , the duality gap of the average strategy ( 1 T T t=1 x t , 1 T T t=1 y t ) is bounded by (R X T + R Y T )/T .

3.1. SOLVING A REGULARIZED PROBLEM

To obtain a faster convergence rate for OMD algorithms, we will solve the NE of the regularized problem below (and thus strongly convex-concave) as an intermediate step. In the literature (McKelvey and Palfrey, 1995) , the solution to the regularized problem is called the quantal-response equilibrium (QRE), when the regularizer ψ Z is entropy: min x∈X max y∈Y x ⊤ Ay + τ ψ Z (x) -τ ψ Z (y) (3.1) where τ ∈ (0, 1] is the weight of regularization and ψ Z is a strongly-convex regularizer. Thanks to the strong convexity of ψ Z , Eq (3.1) has a unique NE, denoted by z * τ . For t = 1, 2, ..., the update rule of optimistic mirror descent for the regularized problem (3.1), which we refer to as Reg-DOMD, can be written as z t = argmin z∈Z z, F (z t-1 ) + τ ∇ψ Z ( z t ) + 1 η D ψ Z (z, z t ) z t+1 = argmin z∈Z z, F (z t ) + τ ∇ψ Z ( z t ) + 1 η D ψ Z (z, z t ) (3.2) where we set z 0 = z 1 as uniform strategy, i.e., z 0,h z 0,σ(h) is uniform distribution in ∆ |Ω h | , and η > 0 is the stepsize. The Dilated Optimistic Mirror Descent (DOMD) (Lee et al., 2021) now becomes a special case when τ = 0. We call the update rule (3.2) Regularized Dilated Optimistic Multiplicative Weights Update (Reg-DOMWU) when the base regularizer ψ ∆ is negative entropy, and Regularized Dilated Optimistic Gradient Descent Ascent (Reg-DOGDA) when ψ ∆ is Euclidean norm. As desired, z t+1 converges to z * τ at a linear rate for any fixed τ . Theorem 3.1. With η ≤ 1 8P , τ ≤ 1 and ψ Z being a 1-strongly convex function with respect to the 2-norm, Reg-DOMD guarantees that D ψ Z (z * τ , z t+1 ) ≤ (1 -ητ ) t D ψ Z (z * τ , z 1 ) for any t ≥ 1 when we initialize z 0 = z 1 . The results in Theorem 3.1 are for general dilated regularizers, and apply to the regularized version of two representative algorithms, Reg-DOMWU and Reg-DOGDA, as studied in Lee et al. (2021) . The detailed proof is postponed to Appendix C. We sketch the proof below. Proof sketch of Theorem 3.1. When ψ Z is a 1-strongly convex function with respect to 2-norm and η ≤ 1 8P , then for any z ∈ Z and t ≥ 1, we have ητ ψ Z (z) -ητ ψ Z (z t ) + ηF (z t ) ⊤ (z t -z) (3.3) ≤ (1 -ητ )D ψ Z (z, z t ) -D ψ Z (z, z t+1 ) -D ψ Z ( z t+1 , z t ) - 7 8 D ψ Z (z t , z t ) + 1 8 D ψ Z ( z t , z t-1 ), which is adapted from the standard OMD analysis (Rakhlin and Sridharan, 2013) , but for the regularized problem. See Lemma C.2 for the proof. Taking z = z * τ in Eq (3. 3), we have (1 -ητ )D ψ Z (z * τ , z t ) -D ψ Z (z * τ , z t+1 ) -D ψ Z ( z t+1 , z t ) - 7 8 D ψ Z (z t , z t ) + 1 8 D ψ Z ( z t , z t-1 ) ≥ ητ ψ Z (z t ) -ητ ψ Z (z * τ ) + ηF (z t ) ⊤ (z t -z * τ ) (i) ≥ 0, where (i) follows by definition of z * τ . Letting Θ t+1 = D ψ Z (z * τ , z t+1 ) + D ψ Z ( z t+1 , z t ), inequality (3.4 ) can be written as Θ t+1 ≤ (1 -ητ )Θ t - 7 8 D ψ Z (z t , z t ) -( 7 8 -ητ )D ψ Z ( z t , z t-1 ) ≤ (1 -ητ )Θ t (3.5) where the second inequality comes from ητ ≤ η ≤ 7 8 . This justifies the linear convergence. In the existing work Wei et al. (2021) ; Lee et al. (2021) without regularization, i.e., when τ = 0, the above argument cannot guarantee the linearly shrinking property of Θ t . With the unique NE assumption, one can prove some "slope" in the original bilinear objective which implies an explicit convergence rate (Lee et al., 2021, Lemma 15) . It was unclear if such an assumption can be removed. Here, the regularization technique enables us to avoid such an assumption. See a more detailed and technical discussion below Lemma D.5.

3.2. FROM THE REGULARIZED PROBLEM TO THE ORIGINAL PROBLEM

Intuitively, if the weight of regularization τ is sufficiently small, NE for the regularized problem should be close to the NE of the original problem (2.1). In the following we formalize this intuition and show how Theorem 3.1 implies a last-iterate guarantee. We shrink the weight of regularization τ as follows: First initialize τ = τ 0 for some hyper-parameter τ 0 at the beginning and run Reg-DOMD in episodes. In each episode, we update the parameters z t and z t+1 for Θ(1/τ ) iterations so that the duality gap of z t will be lower than O(τ ) according to Lemma D.1 and Theorem 3.1. Then, we will shrink τ by one half and start the next episode from scratch. Notice that although τ is changing, the stepsize η keeps fixed/constant, which differs from Hsieh et al. (2021) , where the stepsize is adaptive. Theorem 3.2. With the shrinking algorithm described above, the duality gap satisfies max z∈Z F ( z t+1 ) ⊤ ( z t+1 -z) ≤ O( 1 t ) for t = 1, 2, ..., T . Moreover, we have an iterate con- vergence rate of ∥ z t+1 -Z * ( z t+1 )∥ ≤ O( 1 t ). In practice, we use an adaptive weight-shrinking rule proposed in Appendix A, which is motivated by Yang et al. (2020) . Note that Theorem 3.2 applies for both Reg-DOMWU and Reg-DOGDA. To the best of our knowledge, this is the first result to obtain convergence rate for duality gap and the distance to the NE set in EFGs without the unique NE assumption, when the mirror map is generated through a dilated operation (Lee et al., 2021) . Technical overview. We briefly sketch the intuition behind the proof and defer the full details to Appendix D. To prove the duality gap guarantee, first notice that in the regularized problem, z t has a small duality gap thanks to the last-iterate guarantee in Theorem 3.1. So we only need to argue that the duality gap of z * τ in the original problem is also small, which turns out to be O(τ ). However, this argument does not imply a small distance to the NE set, because the distance between z * τ and z * is unknown. Instead, we need the result that the lower-bound of the "slope" of the duality gap is strictly positive, i.e., for any z, we have max z ′ ∈Z F (z) ⊤ (z -z ′ ) ≥ c∥z -Z * (z)∥ for some constant c > 0. Moreover, compared to existing "slope" results (Gilpin et al., 2008; Wei et al., 2021) , we provide a stronger one when the regularizer is entropy since we prove that max z ′ F (z) ⊤ (z -z ′ ) ≥ c∥z -Z * (z)∥ when z ′ is restricted to a subset of Z (see Lemma D.6). Due to the regularization, our dependence on the EFG size P is quite mild. There's only a P ∥α∥ ∞ dependence on the EFG size for the duality gap convergence result (∥α∥ ∞ is usually O(P 2 ) regarding the specific type of dilation (Hoda et al., 2010; Kroer et al., 2020; Farina et al., 2021) ), which can be found in Appendix D. The convergence rate of the distance to the NE set of the original problem depends on the slope c, which also depends on the reward matrix.

4. REGULARIZED COUNTERFACTUAL REGRET MINIMIZATION (Reg-CFR)

Counterfactual regret minimization is the most widely used solution framework in EFGs in the past decades, and has achieved many successes including defeating the professional human player in Texas Hold'em (Brown and Sandholm, 2018; 2019b) . Through the framework, the (global) regret of the EFG in (2.4) can be minimized by minimizing the local regret in each information set separately. To describe the regret decomposition framework in its full generality, we first introduce some additional notation. W h (z) is the value at the treeplex rooted at information set h of the player h belongs to when both players play according to z. For any h ∈ H X , W h (z) can be recursively defined as W h (z) = i∈Ω h q i (Ay) i + h ′ ∈Hi W h ′ (z) + τ α h ψ ∆ (q h ) where q i = z i /z σ(h(i)) ∈ ∆ |Ω h(i) | is the (conditional-form) strategy on information set h(i) (it lies in a simplex due to the definition of treeplex in Definition 2.1) and α h is the hyper-parameter defined in Eq (2.3). For h ∈ H Y , W h (z) can be defined similarly. The local loss l h t (q h ) : ∆ |Ω h | → R at any information set h ∈ H Z can be defined by l h t (q h ) := V h (z t ), q h + τ α h ψ ∆ (q h ), where V h (z) := (Ay) i + h ′ ∈Hi W h ′ (z) i∈Ω h . Notice that W h (z) is a scalar while V h (z) is a vector. Furthermore, the two quantities can be related to each other by W h (z) = z h z σ(h) , V h (z) + τ α h ψ ∆ ( z h z σ(h) ). The local difference at information set h is just G h T (q h ) := T t=1 l h t (q t,h ) - T t=1 l h t (q h ) and the local regret R h T := max q h ∈∆ |Ω h | G h T ( q h ). The following decomposition implies that the global regret can be controlled by the sum of local regrets: Lemma 4.1 (Laminar regret decomposition (Farina et al., 2019b) ). For any z 1 , z 2 , ..., z T , z ∈ Z and τ ≥ 0, we have G Z T (z) = T t=1 (F (z t ) ⊤ (z t -z) + τ ψ Z (z t ) -τ ψ Z (z)) = h∈H Z z σ(h) G h T ( z h z σ(h) ) R Z T = max z∈Z G Z T ( z) ≤ max z∈Z h∈H Z z σ(h) R h T (4.1) where R Z T is the sum of the regret of min-player and max-player defined in Eq (2.4) instantiated with l t (z) = ⟨F (z t ), z⟩ + τ ψ Z (z). The proof is postponed to Appendix E. Hence, by minimizing R h T at each information set h ∈ H Z , R Z T will also be minimized. By Lemma 2.2, the average strategy will converge to NE when τ = 0. In fact, when τ > 0, the average strategy will converge to the corresponding NE of the regularized problem z * τ according to a stronger version of Lemma 2.2 (Theorem 3 ; Farina et al., 2019b) . For completeness, we provide the formal version as Lemma F.3. To describe our main results in full generality, we introduce the notion of perturbed EFGs before diving into the algorithm and analysis.

4.1. PERTURBED EXTENSIVE-FORM GAME AND EXTENSIVE-FORM PERFECT NASH EQUILIBRIUM

Although NE specifies a natural notion of optimality in EFGs, an NE strategy is not necessarily behaving reasonably in information sets that it will not reach almost surely. To avoid this issue, a stronger and refined notion of equilibirum, extensive-form perfect equilibria, has been proposed in Selten (1975) , which takes every information set into consideration by perturbing the EFG to force the players to reach every information set. We formally introduce the definitions below. Definition 4.2. For any γ ≥ 0, a γ-perturbed EFG is an EFG with a γ-perturbed treeplex Z γ := X γ × Y γ which restricts that q i = zi z σ(h(i)) ≥ γ for any z ∈ Z γ and index i. An extensive-form perfect equilibrium is a limit point of {z γ, * } γ→0 where z γ, * is the NE of the γ-perturbed EFG. The simplest instance of γ-perturbed treeplex is a γ-perturbed probability simplex ∆ γ where all entries have a probability larger than γ. Since the standard EFG is just a perturbed EFG with γ = 0, we will only describe our results in γ-perturbed EFG to keep the argument unified and general, and only translate our result to the γ = 0 case when necessary. Correspondingly, we use z γ, * τ to denote the Nash equilibrium of the regularized game in Eq (3.1) when (x, y) ∈ Z γ . When γ > 0, z γ, * is empirically used as an approximation to the EFPE (Kroer et al., 2017; Farina et al., 2017) . We prove that z γ, * could been seen as an approximation of EFPE in terms of duality gap (See Lemma F.4 for more details about this approximation, which might be of independent interest).

4.2. MAIN RESULT

Given the regret decomposition in Lemma 4.1, we instantiate the regret minimizer in each information set by the regularized version of the Dual Stabilized Optimistic Mirror Descent algorithm (Hsieh et al., 2021) , i.e., Reg-DS-OptMD. The DS-OptMD algorithm in (Hsieh et al., 2021) achieves constant regret in two player zero-sum NFGs, which to the best of our knowledge, is the state-of-the-art result that achieves this desired property. Hence, we develop our local regret minimizer based on this algorithm. For any information set h ∈ H Z and t = 1, 2, ..., T , the full update rule of our proposed algorithm, Regularized Counterfactual Regret Minimization (Reg-CFR), follows q t,h = argmin q h ∈∆ γ |Ω h | V h (z t-1 2 ) + τ α h ∇ψ ∆ (q t-1,h ), q h + λ h t-1 D ψ ∆ (q h , q t-1,h ) + (λ h t -λ h t-1 )D ψ ∆ (q h , q 1,h ) q t+ 1 2 ,h = argmin q h ∈∆ γ |Ω h | V h (z t-1 2 ) + τ α h ∇ψ ∆ (q t,h ), q h + λ h t D ψ ∆ (q h , q t,h ), (4.2) where the adaptive stepsize is defined by λ h t := κ + t-1 s=1 δ h s and κ ≥ 1 is a hyper-parameter. δ h s := ∥V h (z s+ 1 2 ) -V h (z s-1 2 )∥ 2 is the variation of value function and q t,h = z t,h z t,σ(h) . Again, for any h ∈ H Z , q 0,h = q 1 2 ,h are intialized as uniform distribution in ∆ |Ω h | . With the adaptive stepsize in Reg-DS-OptMD, we no longer need to tune the stepsize for each individual information set. Reg-CFR enjoys a desirable last-iterate convergence guarantee of the actual iterate as follows: Theorem 4.3. Consider the case when τ > 0. In γ-perturbed EFGs, if we use Euclidean norm as the regularizer ψ ∆ in Reg-CFR, then T t=1 D ψ Z (z γ, * τ , z t ) ≤ Cγ τ , where C γ is some positive variable depending on γ. As a result, • When γ > 0 and τ ≤ 1 2∥α∥∞ , C γ is a constant which implies asymptotic last-iterate convergence to z γ, * τ in terms of Bregman distance. • When γ = 0, C γ ≤ O(T 1/4 ), implying a O(T -3/4 ) best-iterate convergence rate to z * τ in terms of Bregman distance. To the best of our knowledge, under the regret decomposition framework, although some CFR-type algorithms, like CFR+ (Tammelin et al., 2015) , have been empirically observed to have last-iterate convergence (Bowling et al., 2015) , there is no theoretical justifications for them in the literature yet. Our results appear to be the first to establish the provable best-and last-iterate convergence results under the regret decomposition framework of CFR. Even in terms of empirical performance, our algorithm Reg-CFR can achieve faster last-iterate convergence rate comparing to previous ones. More interestingly, by applying regularization to CFR (Zinkevich et al., 2007) and CFR+ (Tammelin et al., 2015) , we empirically show that regularization can improve the last-iterate performance. We will discuss them in Appendix B. Significance of last-iterate convergence for CFR. We believe that Theorem 4.3 paves the way for more tractable CFR-type algorithms with function approximation in large-scale EFGs like Texas Hold'em. Previously, although Brown and Sandholm (2018; 2019b) achieved super-human level performance in Texas Hold'em, they utilized domain-specific abstraction techniques (Ganzfried and Sandholm, 2014; Brown et al., 2015) , which will merge the similar nodes in Texas Hold'em into one to make the total number of nodes tractable. However, the existing abstraction methods are highly restricted to the poker games. Therefore, it is crucial to design algorithms with function approximation to do such abstraction in an end-to-end manner. Currently, the average-iterate convergence of CFR is an obstacle to using function approximation. In the seminal work Deep-CFR (Brown et al., 2019) , the authors trained an additional network to maintain the average policy, which caused additional approximation errors. In the subsequent work (Steinberger, 2019; Steinberger et al., 2020) , to get the average policy, they stored the networks at every iteration on disk and sampled one randomly to follow. Though sampling successfully eliminates the additional approximation error, given that it takes at least 10 5 iterations to converge in large poker games, storing all networks on disk is not tractable for large games like Texas Hold'em. With Theorem 4.3, we can easily run CFR with function approximation since we only need to take the best model during learning due to the best-iterate guaranteefoot_2 . A direct consequence of the theorem above is the following corollary. Corollary 4.4. For any desired duality gap ϵ, we can set τ = Θ(ϵ). The best-iterate convergence to the NE z * when γ = 0 would be O(T -1/4 ). When γ > 0, we will still have asymptotic last-iterate convergence to z * ,γ , the NE of the γ-perturbed EFG, both in terms of duality gap. Remark 4.5 (Technical challenges in showing best-iterate convergence for CFR-type algorithms). Although OMD achieves last-iterate convergence (Daskalakis and Panageas, 2019; Wei et al., 2021) and fast average-iterate convergence (Rakhlin and Sridharan, 2013; Syrgkanis et al., 2015) , applying OMD as local regret minimizer in the CFR framework does not enjoy those results since the loss function for the regret minimizer depends on the global strategy in the treeplex which is not totally controlled by the local regret minimizer as in the NFGs. Therefore, the local regret minimizer could be seen as deployed in a changing environment where the previous results do not apply. Moreover, as a by-product, we find that the average strategy produced by Reg-CFR is also superior comparing to the previous variants of CFR algorithms to our best knowledge. Notice that when picking τ = 0, the algorithm will converge to the NE z * ,γ of the γ-perturbed EFG. Theorem 4.6. Consider the case when τ ≥ 0 and the regularizer is Euclidean norm. In γ-perturbed EFGs with γ > 0 and τ ≤ 1 2∥α∥∞ , the average strategy output by Reg-CFR converges to z γ, * τ with convergence rate O(1/T ), which is the optimal rate. In the original EFG with γ = 0, the average strategy output by Reg-CFR converges to z * τ with convergence rate O(1/Tfoot_3/4 ). To the best of our knowledge, Reg-CFR is the first CFR-type algorithm that achieves the theoretically optimal average-iterate convergence rate O(1/T ) when γ > 0 (for both τ > 0 and τ = 0). Furthermore, it maintains the current state-of-the-art average-iterate O(1/T 3/4 ) convergence rate established by Farina et al. (2019a) in the original EFG where γ = 0.

Supplementary Materials for "The Power of Regularization in Solving Extensive-Form Games"

A OMITTED DETAILS Here we present some details omitted in the maintext.

A.1 RELATED WORK

Regularization. In reinforcement learning, regularization has been widely used to accelerate convergence and encourage exploration (Tuyls et al., 2003; Geist et al., 2019; Cen et al., 2021a; Mei et al., 2020) . In game theory, regularization can be used to turn the bilinear objective in normalform games into a strongly-convex-strongly-concave one (Hofbauer and Hopkins, 2005; Cen et al., 2021b) . However, Hofbauer and Hopkins (2005) only gave asymptotic convergence to the NE of the regularized game under the best-response dynamics and Cen et al. (2021b) only provided convergence of OMWU to the original NE in terms of duality gap. Similar ideas could be dated back to the smoothing techniques led by Nesterov (2003) . This way, the linear convergence rate to the saddle point of the new objective can be guaranteed. Letting the regularization be small, the solution to the regularized problem can be close to the NE of the original problem, in terms of duality gap (Cen et al., 2021b) . In contrast, we aim to show the convergence in terms of not only the duality gap, but also the distance to the NE set (of the original problem), and for the more complicated setting of EFGs. The idea of using regularization in learning in games has also been explored recently in various different settings (Perolat et al., 2021; Leonardos et al., 2021) . Specifically, Perolat et al. (2021) ; Leonardos et al. (2021) study continuous-time dynamics and establish convergence to NE, either only gave rate to the NE of the regularized game, or only guaranteed asymptotic convergence to the NE of the original game, using techniques based on Lyapunov arguments. Instead, our focus was on discrete-time optimistic mirror-descent algorithms with constant stepsizes, with convergence rates for both duality gap and iterate-distance. Finally, we note that the framework of CFR (for solving EFGs) was not investigated in these recent works. Last-iterate convergence. Finding the NE in EFGs could be formulated as finding the saddle point of a bilinear objective function. While mirror descent diverges in simple cases (in terms of the last-iterate) (Mertikopoulos et al., 2018; Bailey and Piliouras, 2018) , its optimistic version receives great success in finding the saddle point, enabling both faster and last-iterate convergence guarantees (Rakhlin and Sridharan, 2013; Mertikopoulos et al., 2019; Lei et al., 2021; Daskalakis et al., 2018; Mokhtari et al., 2020) . However, these previous works either only consider the case without constraints (which do not apply to the NFG/EFG setting), or provide only asymptotic convergence without explicit rate. Recently, with the unique NE assumption, Daskalakis and Panageas (2019) gives an asymptotic last-iterate convergence result for OMWU in NFGs. Wei et al. (2021) further improves the result by showing that both OMWU and OGDA converge to the NE with a global sublinear convergence rate O(1/T ) and a local linear convergence rate in NFGs. Among them, OMWU requires the unique NE assumption. Very recently, Cai et al. (2022) provides a tight last-iterate convergence for OGDA. Finally, Lee et al. (2021) extends the result of OMWU from NFGs in Wei et al. (2021) to EFGs, and still requires the unique NE assumption. Concurrent to our submission, we are aware of Piliouras et al. (2022) , which studies network zero-sum EFGs with last-iterate convergence rate guarantees, also without the unique NE assumption. However, the regularizer therein for the OMD update rule is neither dilated nor entropy-based, which makes the algorithm less scalable than the one we study, with dilated and entropy-based regularizer, see Lee et al. (2021) for a related discussion. Counterfactual regret minimization (CFR). CFR-type algorithms are based on the idea that the regret in an EFG could be decomposed into the local regret of each information set. By minimizing the local regret, the global regret will be minimized and the algorithms will achieve average-iterate convergence thereby. Recent work Farina et al. (2019a) utilizes the progress in the aforementioned optimistic methods, and achieves a faster average-iterate convergence rate of O(1/T 3/4 ) in EFGs. However, since CFR-type methods rely on the regret decomposition that breaks the structure of the strategy, up to now no CFR (and variant) algorithms are able to inherit the optimal rate optimistic algorithms have enjoyed in NFGs, to the best of our knowledge. Also, due to the decomposition, although Bowling et al. (2015) has found that the last iterate of CFR+ (Tammelin et al., 2015) , a variant of CFR, converges empirically, no CFR-type algorithm have the last-iterate convergence guarantee theoretically. Extensive-form perfect equilibrium and perturbed EFGs. Nash equilibrium in EFGs does not have any guarantee at the places with zero probability to reach when all players follow the NE. Therefore, in reality when players make an error that leads to an impossible state in the NE, still following the NE may be suboptimal. The concept of extensive-form perfect equilibria has thus been proposed to resolve the issue (Selten, 1975) . To find the EFPEs, Miltersen and Sørensen (2010); Farina and Gatti (2017) formulate the problem as a linear programming (LP), which is not tractable for large EFGs. Kroer et al. (2017) and Farina et al. (2017) extend the first-order method (Nesterov, 2005) and CFR to the perturbed extensive-form game (Selten, 1975 ) (which can be used for finding approximate EFPEs), where players have a small probability choosing to act randomly at every information set. Both of the results do not have last-iterate convergence.

A.2 A GRAPHICAL ILLUSTRATION OF TREEPLEX

For better understanding of the structure of treeplex, we show the treeplex of the player who moves first in Kuhn Poker in Figure 1 . Fold Check × Jack Queen King Check Raise 4 3 Ω ! ! Ω ! " 𝜎 ℎ " 𝜎 ℎ # 2 1 0 0 1 Check Raise 0 1 Check Fold Figure 1 : Treeplex of the player who moves first in Kuhn Poker, say player x. The blue circle denotes the chance node and the grey triangles denote the indices in x. This is the place where Cartesian product is applied to. And the squares denote the information sets of player x, which are the simplexes. The purple arrow is the place applied Branching once (i = 1). We omit the same structure as Jack under Queen & King. The dotted square represented the indices belongs to information set h 1 and h 2 and the red line represents the parent index of h 1 and h 2 . The treeplex is built up from 6 simplexes (2 each under different private card). Here's how the treeplex is built up. • Branching: h 1 1 h 2 . • Cartesian Product: Cartesian product of 3 similar treeplexes under Jack, Queen & King individually. And the whole game tree of Kuhn Poker is shown in Figure 2 . 

A.3 PSEUDOCODE OF THE ADAPTIVE WEIGHT-SHRINKING ALGORITHM

Here's the practical version of adaptively shrinking τ framework mentioned in §3.2. Algorithm 1 Adaptive Weight-Shrinking 1: τ ← τ 0 2: δ τ0 ← max z ′ F (z 0 ) ⊤ (z 0 -z ′ ) + τ 0 ψ Z (z 0 ) -τ 0 ψ Z (z ′ ) 3: z 0 , z 1 ←Uniform Strategy 4: for t = 1, 2, ... do 5: z t , z t+1 ← Reg-DOMD(z t-1 , z t ) 6: if max z ′ F ( z t ) ⊤ (z t -z ′ ) + τ ψ Z ( z t ) -τ ψ Z (z ′ ) ≤ δτ 4 then 7: τ ← τ 2 8: δ τ ← max z ′ F ( z t ) ⊤ (z t -z ′ ) + τ ψ Z ( z t ) -τ ψ Z (z ′ ) 9: z t ← z t+1 10: end if 11: end for Notice that this framework can also be applied to Reg-CFR by simply changing Reg-DOMD to Reg-CFR.

A.4 EXPERIMENT ENVIRONMENTS

Kuhn Poker (Kuhn, 1950) . In Kuhn Poker, there are two players and three cards, Jack, Queen and King. And at the beginning, each player should place 1 chip into the pot and then 1 private card will be dealt to each player. And each player can call, raise or fold in each round. If a player call, then she should ensure that each player contributes equally to the pot. If a player raise, she should put 1 more chip in the pot than the other. If a player fold, then the other player takes all the chips in the pot. There will be at most 1 raise in the game. And a betting round ends when both players call or one of them fold. After the game ends and nobody folds, the two players reveal their private cards and the one with higher rank takes all the chips in the pot. Leduc Poker (Southey et al., 2005) . Leduc Poker is similar to Kuhn Poker. It has 6 cards, three ranks ({J, Q, K}) with two suits ({a, b}) each. There are two betting rounds in Leduc Poker, each round admits two raises. The player who raises should place 1 more chip in the first round and 2 chips in the second. If the game ends and nobody folds, then the players reveal their private cards. The one who has the same private card as the public card wins. If nobody has the same private card as the public card, then the one with higher rank wins. Otherwise the game draws and the two players share the pot equally.

B EXPERIMENT RESULTS

Beyond sharp theoretical guarantees, regularized algorithms in EFG also have superior performance in practice, which we showcase in this section through numerical experiments in Kuhn Poker (Kuhn, 1950) and Leduc Poker (Southey et al., 2005) . The details of the experiment setup are illustrated in Appendix A. The results are shown in Figure 3 for the last-iterate convergence in duality gap. We used grid search to find the best parameters for each algorithm. The algorithms Reg-DOMWU, Reg-DOGDA, Reg-CFR all apply the adaptive weight-shrinking framework proposed as Algorithm 1 in Appendix A. As shown in Figure 4 , we show the regret upper-bound max z∈Z h∈H Z z σ(h) R h T . We can see that Reg-CFR has constant regret even in a non-perturbed EFG in practice. Moreover, we further empirically show that regularization is also helpful for CFR and CFR+. That is, with RM and RM+ as local regret minimizers, adding regularization still helps the algorithm enjoy last-iterate convergence. See Figure 5 for the details. To minimize the regret of a convex but non-linear loss function l t (x t ), we feed ⟨∇l t (x t ), x t ⟩ into RM and RM+ as the loss function. See (Farina et al., 2019d, §2 .1) for more details. Moreover, the average-iterate convergence rate of this regularized version is still competitive with the original version. See Figure 6 for details. Figure 7 illustrates the duality gap of average iterate. We can see that Reg-CFR is faster than CFR in both environments and has a comparable performance with CFR+ in smaller environments like Kuhn Poker. Figure 8 illustrates the maximum cumulative regret across all information sets, conditioned on reaching that information set. This is also used as metric in Farina et al. (2017) ; Kroer et al. (2017) . This metric can be used to measure the "closeness" to EFPEs. We can see that with γ > 0, Reg-CFR significantly outperforms CFR and CFR+ in finding EFPEs. C PROOF OF THEOREM 3.1 Lemma C.1. For any τ ≤ 1 and z ∈ Z, the NE of the regularized problem Eq (3.1) satisfies that F (z) ⊤ (z -z * τ ) -τ ψ Z (z * τ ) + τ ψ Z (z) ≥ 0. (C.1) Lemma C.2. Consider the update rule in Eq (3.2). When ψ Z satisfies Eq (C.6) with p = 2 and η ≤ 1 8P , then for any z ∈ Z and t ≥ 1, we have ητ ψ Z (z) -ητ ψ Z (z t ) + ηF (z t ) ⊤ (z t -z) ≤(1 -ητ )D ψ Z (z, z t ) -D ψ Z (z, z t+1 ) -D ψ Z ( z t+1 , z t ) - 7 8 D ψ Z (z t , z t ) + 1 8 D ψ Z ( z t , z t-1 ). Proof of Theorem 3.1. Taking z = z * τ in Lemma C.2, we have (1 -ητ )D ψ Z (z * τ , z t ) -D ψ Z (z * τ , z t+1 ) -D ψ Z ( z t+1 , z t ) - 7 8 D ψ Z (z t , z t ) + 1 8 D ψ Z ( z t , z t-1 ) ≥ητ ψ Z (z t ) -ητ ψ Z (z * τ ) + ηF (z t ) ⊤ (z t -z * τ ) (i) ≥ 0, (C.2) where (i) is by Lemma C.1. Letting Θ t+1 = D ψ Z (z * τ , z t+1 ) + D ψ Z ( z t+1 , z t ), inequality (C. 2) can be written as Θ t+1 ≤(1 -ητ )Θ t - 7 8 D ψ Z (z t , z t ) -( 7 8 -ητ )D ψ Z ( z t , z t-1 ) ≤(1 -ητ )Θ t (C.3) where the second inequality comes from ητ ≤ η ≤ 7 8 . As a result, D ψ Z (z * τ , z t+1 ) ≤ Θ t+1 ≤ (1 -ητ ) t Θ 1 = (1 -ητ ) t D ψ Z (z * τ , z 1 ) (C.4) where the last equation is satisfied when we initialize z 0 = z 1 . Lemma C.3. F (z) is P -Lipschitz for any z ∈ Z. That is, for any z, z ′ ∈ Z, we have ∥F (z) -F (z ′ )∥ ≤ P ∥z -z ′ ∥. (C.5) Proof. ∥F (z) -F (z ′ )∥ = ∥A ⊤ (x -x ′ )∥ 2 + ∥A(y -y ′ )∥ 2 ≤ P ∥x -x ′ ∥ 2 1 + P ∥y -y ′ ∥ 2 1 ≤ P ∥z -z ′ ∥ 2 1 ≤P ∥z -z ′ ∥. Lemma C.4. Let C be a convex set and u 1 = argmin u1∈C {⟨ u 1 , g + τ ∇ψ C (u)⟩ + 1 η D ψ C ( u 1 , u)} where ψ C is a strongly-convex function in C. Then for any u 2 ∈ C, τ ∈ [0, 1], η > 0, ητ ψ C (u 1 )-ητ ψ C (u 2 )+η⟨u 1 -u 2 , g⟩ ≤ (1-ητ )D ψ C (u 2 , u)-D ψ C (u 2 , u 1 )-(1-ητ )D ψ C (u 1 , u). Proof. Plug in the definition of Bregman divergence D ψ C (u 1 , u) = ψ C (u 1 ) -ψ C (u) - ⟨∇ψ C (u), u 1 -u⟩, the right-hand side of it is equal to, (1 -ητ )D ψ C (u 2 , u) -D ψ C (u 2 , u 1 ) -(1 -ητ )D ψ C (u 1 , u) =(1 -ητ )(ψ C (u 2 ) -ψ C (u) -⟨∇ψ C (u), u 2 -u⟩) + (-ψ C (u 2 ) + ψ C (u 1 ) + ⟨∇ψ C (u 1 ), u 2 -u 1 ⟩) + (1 -ητ )(-ψ C (u 1 ) + ψ C (u) + ⟨∇ψ C (u), u 1 -u⟩) =ητ ψ C (u 1 ) -ητ ψ C (u 2 ) + ⟨∇ψ C (u 1 ) -(1 -ητ )∇ψ C (u), u 2 -u 1 ⟩ (i) ≥ητ ψ C (u 1 ) -ητ ψ C (u 2 ) + η⟨u 1 -u 2 , g⟩, where (i) is by the first order optimality of u 1 , i.e., (ηg + ∇ψ C (u 1 ) -(1 -ητ )∇ψ C (u)) ⊤ (u 2 -u 1 ) ≥ 0. Lemma C.5. Suppose that ψ C is a 1-strongly convex function with respect to p-norm in C such that D ψ C (x, x ′ ) ≥ 1 2 ∥x -x ′ ∥ 2 p (C.6) for some p ≥ 1, and u, u 1 , u 2 are members of a convex set C such that, u 1 = argmin u ′ ∈C {⟨u ′ , g 1 + τ ∇ψ C (u)⟩ + D ψ C (u ′ , u)}, u 2 = argmin u ′ ∈C {⟨u ′ , g 2 + τ ∇ψ C (u)⟩ + D ψ C (u ′ , u)}. (C.7) Then we have, ∥u 1 -u 2 ∥ p ≤ ∥g 1 -g 2 ∥ q , (C.8) where q ≥ 1 and 1 p + 1 q = 1. Proof. By the first-order optimality of u 1 , u 2 , we have (g 1 + ∇ψ C (u 1 ) -(1 -τ )∇ψ C (u)) ⊤ (u 2 -u 1 ) ≥ 0, (g 2 + ∇ψ C (u 2 ) -(1 -τ )∇ψ C (u)) ⊤ (u 1 -u 2 ) ≥ 0. (C.9) Summing up and rearranging the terms, ⟨u 2 -u 1 , g 1 -g 2 ⟩ ≥ ⟨∇ψ C (u 1 ) -∇ψ C (u 2 ), u 1 -u 2 ⟩. (C.10) To bound the right-hand side of inequality (C.10), By the lower bound of Bregman divergence (C.6), we have ⟨∇ψ C (u 1 ), u 1 -u 2 ⟩ ≥ ψ C (u 1 ) -ψ C (u 2 ) + 1 2 ∥u 1 -u 2 ∥ 2 p , ⟨∇ψ C (u 2 ), u 2 -u 1 ⟩ ≥ ψ C (u 2 ) -ψ C (u 1 ) + 1 2 ∥u 1 -u 2 ∥ 2 p . Summing them up we have ⟨∇ψ C (u 1 ) -∇ψ C (u 2 ), u 1 -u 2 ⟩ ≥ ∥u 1 -u 2 ∥ 2 p . Combining with inequality (C.10), ⟨u 2 -u 1 , g 1 -g 2 ⟩ ≥ ∥u 1 -u 2 ∥ 2 p . (C.11) Finally, by Hölder's inequality, ⟨u 2 -u 1 , g 1 -g 2 ⟩ ≤ ∥u 1 -u 2 ∥ p • ∥g 1 -g 2 ∥ q , and as a result ∥u 1 -u 2 ∥ p ≤ ∥g 1 -g 2 ∥ q as claimed. Proof of Lemma C.1 By definition of NE, we have F (z) ⊤ (z -z * τ ) =(-x * ⊤ τ Ay + x ⊤ Ay * τ ) = -x * ⊤ τ Ay + τ ψ Z (y) + x ⊤ Ay * τ + τ ψ Z (x) -τ (ψ Z (x) + ψ Z (y)) ≥ -x * ⊤ τ Ay * τ + τ ψ Z (y * τ ) + x * ⊤ τ Ay * τ + τ ψ Z (x * τ ) -τ (ψ Z (x) + ψ Z (y)) =τ ψ Z (z * τ ) -τ ψ Z (z). Proof of Lemma C.2. Plug u = z t , u 1 = z t+1 , u 2 = z, g = F (z t ), ψ C = ψ Z into Lemma C.4, ητ ψ Z ( z t+1 )-ητ ψ Z (z)+η⟨ z t+1 -z, F (z t )⟩ ≤ (1-ητ )D ψ Z (z, z t )-D ψ Z (z, z t+1 )-(1-ητ )D ψ Z ( z t+1 , z t ). Plug u = z t , u 1 = z t , u 2 = z t+1 , g = F (z t-1 ) and ψ C = ψ Z into Lemma C.4, ητ ψ Z (z t )-ητ ψ Z ( z t+1 )+η⟨z t -z t+1 , F (z t-1 )⟩ ≤ (1-ητ )D ψ Z ( z t+1 , z t )-D ψ Z ( z t+1 , z t )-(1-ητ )D ψ Z (z t , z t ). Summing them up and adding ⟨F (z t ) -F (z t-1 ), z t -z t+1 ⟩ to both sides, we have ητ ψ Z (z t ) -ητ ψ Z (z) + η⟨F (z t ), z t -z⟩ ≤(1 -ητ )D ψ Z (z, z t ) -D ψ Z (z, z t+1 ) -D ψ Z ( z t+1 , z t ) -(1 -ητ )D ψ Z (z t , z t ) + η⟨F (z t ) -F (z t-1 ), z t -z t+1 ⟩. It remains to bound the last term, which is η⟨F (z t ) -F (z t-1 ), z t -z t+1 ⟩ (i) ≤η∥x t -x t+1 ∥ • ∥ηAy t -ηAy t-1 ∥ + η∥y t -y t+1 ∥ • ∥ηAx t -ηAx t-1 ∥ (ii) ≤ η 2 (∥Ay t -Ay t-1 ∥ 2 + ∥Ax t -Ax t-1 ∥ 2 ) (iii) ≤ 2η 2 P 2 ∥z t -z t-1 ∥ 2 (iv) ≤ 1 32 ∥z t -z t-1 ∥ 2 ≤ 1 16 (∥z t -z t ∥ 2 + ∥ z t -z t-1 ∥ 2 ) ≤ 1 8 (D ψ Z (z t , z t ) + D ψ Z ( z t , z t-1 )) where (i) is by Hölder's inequality, (ii) is by Lemma C.5 with p = q = 2 , (iii) is by Lemma C.3, and (iv) is by η ≤ 1 8P . The proof of the claim is completed by putting everything together.

D PROOF OF THEOREM 3.2

Firstly, we will prove that the approximate NE of the regularized problem is close to the NE of the original problem in terms of duality gap. Lemma D.1. For any τ > 0 and z ∈ Z, we have max z∈Z F (z) ⊤ (z -z) ≤ 2τ C B + 2P D ψ Z (z * τ , z), (D.1) where C B is the upper-bound of the regularizer ψ Z . Proof. max z∈Z F (z) ⊤ (z -z) = max z∈Z {x * ⊤ τ A y -x ⊤ Ay * τ + τ ψ Z (z * τ ) -τ ψ Z ( z) -τ ψ Z (z * τ ) + τ ψ Z ( z) + (x -x * τ ) ⊤ A y + x ⊤ A(y * τ -y)} ≤ max z∈Z {x * ⊤ τ A y -x ⊤ Ay * τ + τ ψ Z (z * τ ) -τ ψ Z ( z)} + max z∈Z {-τ ψ Z (z * τ ) + τ ψ Z ( z) + (x -x * τ ) ⊤ A y + x ⊤ A(y * τ -y)} (i) ≤0 + 2τ C B + ∥x -x * τ ∥ 1 + ∥y * τ -y∥ 1 (ii) ≤ 2τ C B + 2P ∥z -z * τ ∥ ≤2τ C B + 2P D ψ Z (z * τ , z) (D.2) where A direct consequence of the lemma is that for any ϵ > 0, we can set τ = ϵ 4C B , then after (i) is because of the definition of z * τ and ∥F (z)∥ ∞ ≤ 1 for any z ∈ Z. (ii) is by ∥x-x * τ ∥ 1 ≤ √ P ∥x -x * τ ∥, ∥y -y * τ ∥ 1 ≤ √ P ∥y -y * τ ∥ and a + b ≤ 2 √ a 2 + b 2 . C B is 2(log ϵ-log 4P )-log D ψ Z (z * τ , z1) log(1-τ ) ≤ -2(log ϵ-log 4P )+log D ψ Z (z * τ , z1) τ iterations, z t produced by Reg-DOMD will satisfies that max z∈Z { x ⊤ t Ay -x ⊤ A y t } ≤ ϵ 2 + 2P ϵ 2 16P 2 D ψ Z (z * τ , z 1 ) D ψ Z (z * τ , z 1 ) ≤ ϵ (D.3) by Theorem 3.1. Proof of Theorem 3.2. Sublinear convergence rate of duality gap. For any ϵ, the number of iterations that the duality gap reach ϵ is no larger than 4C B -2(log ϵ-log 4P )+log D ψ Z ( z * τ , z1 ) ϵ by the discussion above. Therefore, while duality gap reaching ϵ = ϵ0 2 K , the number of iterations performed so far is no larger than K k=0 4C B • 2 k -2 log ϵ 0 + 2k log 2 + 2 log 4P + log D ψ Z (z * τ , z 1 ) ϵ 0 ≤4C B 2 K+2 -log ϵ 0 + K log 2 + log 4P + log D ψ Z (z * τ , z 1 ) ϵ 0 = O(1/ϵ). (D.4) Iterate convergence. From the proof of Theorem 5 in Wei et al. (2021) , we have the following lemma. Lemma D.2 (Proved in Theorem 5 of Wei et al. (2021) ). Consider a bilinear zero-sum game. Let ρ := min x∈X max y∈Y x ⊤ Ay be the game value. When X , Y are polytopes, we have max y∈Y x ⊤ A y -ρ ≥ c ∥x -X * (x)∥ (ρ -min x∈X x ⊤ Ay ≥ c y -Y * (y) ) for some constant c > 0 where X * (x) ( Y * (y)) is the projection of x (y) to the NE set X * (Y * ) of the min-player (max-player). Then, since the treeplex is a polytope by definition, we have max z∈Z F ( z t ) ⊤ ( z t -z) = max y∈Y x ⊤ t Ay -min x∈X x ⊤ A y t ≥c(∥ x t - X * ( x t )∥ + ∥ y t - Y * ( y t )∥) ≥c∥ z t - Z * ( z t )∥ (D.5) where the last inequality comes from √ a + b ≤ √ a + √ b. Therefore, ∥ z t -Z * ( z t )∥ ≤ 1 c max z∈Z F ( z t ) ⊤ (z t -z) ≤ O( 1 t ). Notice that comparing to the results in Gilpin et al. (2008) ; Wei et al. (2021) , our slope result (Lemma D.6) is based on different techniques. In Lemma D.6, we prove that max y∈V * ( X * (x)) x ⊤ A y-ρ ≥ c x ∥x -X * (x)∥ where V * ( X * (x)) ⊆ Y when x ∈ F x and F x ⊆ X contains all possible iterates generated by DOMWU. That is, our result is stronger than the existing results, when the algorithm is DOMWUfoot_4 . Moreover, our result can be viewed as an extension of (Lee et al., 2021, Lemma 14) to the non-unique NE cases. Given that (Lee et al., 2021, Lemma 14 ) plays an critical role in proving the last-iterate convergence with unique NE assumption, Lemma D.6 may be useful when proving last-iterate convergence in EFGs without unique NE assumption and regularization.

D.1 COMPLEMENTARY SLACKNESS

This part of discussion is similar to the one in Lee et al. (2021) . From Definition 2.1, we have ∀h ∈ H Y , i∈Ω h y i = y σ(h) , y 0 = 1 (D.6) which can be written compactly as E Y y = e Y where E Y ∈ R (|H Y |+1)×N and e Y = (1, 0, 0, ..., 0) ∈ R |H Y |+1 . Except the first row of E Y where there's 1 on index 0 and 0 otherwise, all other rows have 1 on index σ(h) and -1 on all i ∈ Ω h . Therefore, for any fixed x, the objective of y can be written as max y∈Y x ⊤ Ay s.t. E Y y = e Y , y ≥ 0 (D.7) whose dual problem is min g e ⊤ Y g s.t. E ⊤ Y g ≥ A ⊤ x (D.8) where e ⊤ Y g = g 0 since e Y = (1, 0, 0, ..., 0). Remind that the primal formulation of the original problem is min x∈X max y∈Y x ⊤ Ay s.t. E X x = e X , x ≥ 0 E Y y = e Y , y ≥ 0. (D.9) Therefore, every solution y * of the original problem would be a solution of the following problem. min x∈X ,g g 0 s.t. E ⊤ Y g ≥ A ⊤ x E X x = e X x ≥ 0. (D.10) The dual of this one is max y∈Y,f f 0 s.t. E ⊤ X f ≤ Ay E Y y = e Y y ≥ 0. (D.11) Note that X * , Y * are the optimal solution of Eq (D.10) and Eq (D.11). By complementary slackness, for any optimal solution pair (x * , g * ), (y * , f * ), we have slackness variables w * ∈ R M , s * ∈ R N so that E ⊤ X f + w * = Ay E ⊤ Y g -s * = A ⊤ x x * ⊙ w * = 0 y * ⊙ s * = 0 w * ≥ 0 s * ≥ 0 (D.12) where ⊙ denotes the element-wise product. As a direct consequence, we have the following lemma. Lemma D.3. For any optimal solution pair (x * , g * ), (y * , f * ) of Eq (D.10) and Eq (D.11), we have h∈Hi f * h + (Ay * ) i = f * h(i) ∀i ∈ supp(X * ) h∈Hi f * h + (Ay * ) i ≥ f * h(i) ∀i ̸ ∈ supp(X * ) h∈Hi g * h + (A ⊤ x * ) i = g * h(i) ∀i ∈ supp(Y * ) h∈Hi g * h + (A ⊤ x * ) i ≤ g * h(i) ∀i ̸ ∈ supp(Y * ) (D.13) where supp(x) denotes the support set of vector x and supp(C) = x∈C supp(x) denotes the support set of a convex set C. Proof. Since (E ⊤ X f ) i = f * h(i) -h∈Hi f * h by definition of E, from Eq (D.12), we have h∈Hi f * h + (Ay * ) i = w * i + f * h(i) ≥ f * h(i) . (D.14) For any i where there's x * ∈ X * and x * i > 0, from x * ⊙ w * = 0, we have w * i = 0. Thus, the above inequality takes the equality. So the first two lines of Lemma D.3 are proved. Similarly, we can prove the last two lines. We further introduce the following definitions. Definition D.4. ρ = x * ⊤ Ay * P S(x * ) = {y : y is a pure strategy, x * ⊤ Ay = ρ} P S(y * ) = {x : x is a pure strategy, x ⊤ Ay * = ρ} V * (x * ) = C(P S(x * )) V * (y * ) = C(P S(y * )) supp(x) = {i : x i > 0} supp(C) = {i : ∃x ∈ C, x i > 0} (D.15) where C(S) denotes the minimum convex set covering all points in S. A fact from the definition is that ∀y ∈ V * (x * ), x * ⊤ Ay = ρ and ∀x ∈ V * (y * ), x ⊤ Ay * = ρ. Lemma D.5. V * (x * ), V * (y * ) are not empty for any x * ∈ X * , y * ∈ Y * . Proof. For any x ∈ X , y * ∈ Y * , f * so that supp(x) ⊆ supp(X * ) and (f * , y * ) is a pair of optimal solution of Eq (D.11), we have x ⊤ Ay * = i x i (Ay * ) i = i x i (f * h(i) - h∈Hi f * h ) = h∈H X f * h i∈Ω h x i - h∈H X ,h̸ =0 f * h x σ(h) = h∈H X f * h x σ(h) - h∈H X ,h̸ =0 f * h x σ(h) =f * 0 = ρ (D.16) where the second equality is because supp(x) ⊆ supp(X * ) and Lemma D.3. The fourth equality comes from the fact that i∈Ω h x i = x σ(h) . Therefore, V * (y * ) is not empty for any y * ∈ Y * . Similarly, V * (x * ) is not empty for any x * ∈ X * . When assuming unique NE as in Lee et al. (2021) , the second line and the fourth line in D.3 will be strictly larger than and strictly less than by strict complementary slackness. The discussion in Lemma D.5 turns out to be if and only if supp(x) ⊆ supp(X * ), we have x ⊤ Ay * = ρ which strengthen our conclusion here.

D.2 CONNECTION BETWEEN DUALITY GAP AND ITERATE DISTANCE

Lemma D.6. The constants c x , c y defined below satisfy that c x , c y > 0. c x = inf x∈Fx\X * max y∈V * ( X * (x)) (x -X * (x)) ⊤ Ay ∥x -X * (x)∥ c y = inf y∈Fy\Y * max x∈V * ( Y * (y)) x ⊤ A( Y * (y) -y) ∥y -Y * (y)∥ (D.17) where F x = {x|x ∈ X , ∀i ∈ supp(X * ) x i ≥ ϵ dil } F y = {y|y ∈ Y, ∀i ∈ supp(Y * ) y i ≥ ϵ dil }, (D.18) and ϵ dil is some game dependent constant defined in Lemma D.7.

Proof. Define the set X

′ = {x|x ∈ X , ∥x -X * (x)∥ ≥ ϵ dil }. In the following, we will show that we only need to consider x ∈ X ′ instead of F x \ X * . Formally we will prove that for any x ∈ F x \ X * , we have x ′ ∈ X ′ so that ∀y, (x -X * (x)) ⊤ Ay ∥x -X * (x)∥ = (x ′ -X * (x ′ )) ⊤ Ay ∥x ′ -X * (x ′ )∥ . (D.19) The claim trivially holds if x ∈ X ′ . Otherwise, let x ′ = X * (x) + ϵ dil ∥x-X * (x)∥ (x -X * (x) ). For any element that x i ≥ X * (x) i ≥ 0, we know that x ′ i ≥ 0. For elements that X * (x) i > x i ≥ 0, we can ensure that i ∈ supp(X * ), which means that X * (x) i > x i ≥ ϵ dil since x ∈ F x \ X * . Therefore, we have x ′ i ≥ X * (x) i -|x i -X * (x) i | • ϵ dil ∥x-X * (x)∥ ≥ X * (x) i -ϵ dil ≥ 0. Also, for any h ∈ H X , i∈Ω h x ′ i = ϵ dil ∥x -X * (x)∥ i∈Ω h x i + (1 - ϵ dil ∥x -X * (x)∥ ) i∈Ω h X * (x) i = ϵ dil ∥x -X * (x)∥ x σ(h) + (1 - ϵ dil ∥x -X * (x)∥ ) X * (x) σ(h) =x ′ σ(h) . (D.20) Therefore, x ′ ∈ X and we can conclude that x ′ ∈ X ′ since X * (x) = X * (x ′ ). Moreover, since x ′ -X * (x) and x -X * (x) are parallel and X * (x) = X * (x ′ ), we can conclude that Eq (D.19) is satisfied. Because X ′ is closed, we can define c ′ x = min x∈X ′ max y∈V * ( X * (x)) (x -X * ⊤ Ay ∥x -X * (x)∥ c ′ y = min y∈Y ′ max x∈V * ( Y * (y)) x ⊤ A( Y * (y) -y) ∥y -Y * (y)∥ (D.21) with the inequality that c x ≥ c ′ x and c y ≥ c ′ y by the discussion above. Then, we will prove that c ′ x , c ′ y > 0. Firstly, we will prove that c ′ y ≥ 0. If c ′ y < 0, then it says that there's some y so that min x∈V * ( Y * (y)) x ⊤ Ay > ρ (D.22) which implies that for any x * ∈ X * , x * ⊤ Ay > ρ. And it contradicts with the definition of X * . If c ′ y = 0, then for some y ̸ ∈ Y * , max x∈V * ( Y * (y)) x ⊤ A( And we can prove that ξ(y * ) ∈ (0, 2M ]. The lower bound is directly from Lemma D.3 and the upperbound is from the assumption on A that ∀y ∈ Y, ∥Ay∥ ∞ ≤ 1. Let y ′ = Y * (y) + ξ( Y * (y)) 2N •M (y -Y * (y)) ∈ Y. For any pure strategy x ∈ P S X \ P S(y * ), we have x ⊤ Ay ′ =x ⊤ A Y * (y) -x ⊤ A( Y * (y) -y ′ ) ≥x ⊤ A Y * (y) -∥x∥ ∞ • ∥ Y * (y) -y ′ ∥ 1 ≥x ⊤ A Y * (y) - ξ( Y * (y)) M ≥ρ (D.25) where the last inequality comes from the definition of ξ( Y * (y)) in Eq (D.24). For any pure strategy x ∈ P S(y * ), we have x ⊤ Ay ′ =x ⊤ A Y * (y) + ξ( Y * (y)) 2N • M x ⊤ A(y - Y * (y)) ≥x ⊤ A Y * (D.26) Therefore, min x∈X x ⊤ Ay ′ ≥ ρ since any x ∈ X is a linear combination of pure strategies. And it implies that y ′ ̸ ∈ Y * is also a maximin point, contradicting with the definition of Y * . So, c ′ y > 0 and so does c ′ x . And further we have that c x , c y > 0. Lemma D.7. For any t = 1, 2, ..., and i ∈ supp(Z * ), and η ≤ 1 8P , Reg-DOMWU ensures that z t,i ≥ ϵ dil where ϵ dil is some game-dependent constant. Proof. By Lemma C.2, Reg-DOMD satisfies ητ ψ Z (z) -ητ ψ Z (z t ) + ηF (z t ) ⊤ (z t -z) ≤(1 -ητ )D ψ Z (z, z t ) -D ψ Z (z, z t+1 ) -D ψ Z ( z t+1 , z t ) - 7 8 D ψ Z (z t , z t ) + 1 8 D ψ Z ( z t , z t-1 ). (D.27) Pick z = z * such that supp(z * ) = supp(Z * ) (note that such a z * ∈ Z * must exist since Z * is convex). Then, we have ητ ψ Z (z * ) -ητ ψ Z (z t ) ≤ητ ψ Z (z * ) -ητ ψ Z (z t ) + ηF (z t ) ⊤ (z t -z * ) ≤(1 -ητ )D ψ Z (z * , z t ) -D ψ Z (z * , z t+1 ) -D ψ Z ( z t+1 , z t ) - 7 8 D ψ Z (z t , z t ) + 1 8 D ψ Z ( z t , z t-1 ) (D.28) where the first inequality comes from F (z t ) ⊤ (z t -z * ) = x ⊤ t Ay * -x * ⊤ Ay t ≥ 0 by definition of NE. And it further implies that D ψ Z (z * , z t+1 ) + D ψ Z ( z t+1 , z t ) ≤(1 -ητ ) D ψ Z (z * , z t ) + D ψ Z ( z t , z t-1 ) - 1 2 D ψ Z (z t , z t ) + D ψ Z ( z t , z t-1 ) -ητ ψ Z (z * ) + ητ ψ Z (z t ) ≤(1 -ητ ) D ψ Z (z * , z t ) + D ψ Z ( z t , z t-1 ) - 1 2 D ψ Z (z t , z t ) + D ψ Z ( z t , z t-1 ) -ητ ψ Z (z * ) (D.29) when ητ ≤ η ≤ 3 8 . When τ = 0, we have D ψ Z (z * , z t+1 ) ≤ D ψ Z (z * , z 1 ) + D ψ Z ( z 1 , z 0 ) = D ψ Z (z * , z 1 ). (D.30) And when τ > 0, we have D ψ Z (z * , z t+1 ) ≤ (1 -ητ ) t D ψ Z (z * , z 1 ) -ψ Z (z * ) ≤ D ψ Z (z * , z 1 ) -ψ Z (z * ). (D.31) Therefore, for any i ∈ supp(Z * ) = supp(z * ), z * i log 1 q t+1,i ≤ j α h(j) z * j log 1 q t+1,j =D ψ Z (z * , z t+1 ) - j α h(j) z * j log q * j ≤D ψ Z (z * , z 1 ) -ψ Z (z * ) - j α h(j) z * j log q * j = j α h(j) z * j log 1 q 1,j -ψ Z (z * ) ≤2P ∥α∥ ∞ log C Ω (D.32) where the last inequality comes from the fact that z 1 is initialized as a uniform strategy. Therefore, q t+1,i ≥ exp -2P ∥α∥ ∞ log C Ω min i∈supp(Z * ) z * i (D.33) for any i ∈ supp(Z * ).

And we further have

z t+1,i = z t+1,σ(h(i)) • q t+1,i = z t+1,σ(h(σ(h(i)))) • q t+1,σ(h(i)) • q t+1,i =... ≥ exp -2P 2 ∥α∥ ∞ log C Ω min i∈supp(Z * ) z * i =:ϵ dil > 0, (D.34) completing the proof. E PROOF OF LEMMA 4.1 Our regret decomposition framework follows the laminar regret decomposition (Farina et al., 2019b) , which is a more general case of the original counterfactual regret minimization (Zinkevich et al., 2007) . The second part of Lemma 4.1, the boundedness of regret, also appears in (Farina et al., 2019b, Theorem 2) . But here we use Lemma E.1 to prove it which is more concise. Lemma E.1 (First part of Lemma 4.1). The difference satisfies that G Z T (z) = h∈H Z z σ(h) G h T (z) for any z ∈ Z γ and γ ≥ 0. Proof. We define the scalar subtree value S h t (z) recursively, S h t (z) := i∈Ω h q i (Ay t ) i + h ′ ∈Hi S h ′ t (z) + τ α h ψ ∆ (q h ). (E.1) For terminal nodes, H i will be empty set and thus S h t (z) = i∈Ω h q i (Ay t ) i + τ α h ψ ∆ (q h ). By definition, for any z ∈ Z γ , we have G Z T (z) = T t=1 (⟨F (z t ), z t ⟩ + τ ψ Z (z t )) - T t=1 (⟨F (z t ), z⟩ + τ ψ Z (z)) = T t=1 h∈H0 S h t (z t ) - T t=1 h∈H0 S h t (z) = h∈H0 T t=1 S h t (z t ) - T t=1 S h t (z) (E.2) where H 0 = {h : h ∈ H Z , σ(h) = 0} is the set of information set at the root of treeplex. Note that Z γ = Z γ h1 × Z γ h2 × ... × Z hm where H 0 = {h 1 , h 2 , ..., h m }. Then, the inequality in the second line is simply by expanding the definition of S h t (z) from the recursive manner. We further define G h T,sub := T t=1 S h t (z t ) - T t=1 S h t (z). Then, G h T,sub (z) = T t=1 S h t (z t ) - T t=1 S h t (z) = T t=1 S h t (z t ) - T t=1 i∈Ω h q i (Ay t ) i + τ α h ψ ∆ (q h ) + i∈Ω h q i h ′ ∈Hi T t=1 S h ′ t (z) (i) = T t=1 S h t (z t ) - T t=1 i∈Ω h q i (Ay t ) i + τ α h ψ ∆ (q h ) + i∈Ω h q i h ′ ∈Hi T t=1 S h ′ t (z t ) -G h ′ sub (z) = T t=1 S h t (z t ) - T t=1 i∈Ω h q i (Ay t ) i + h ′ ∈Hi S h ′ t (z t ) + τ α h ψ ∆ (q h ) - i∈Ω h q i h ′ ∈Hi -G h ′ sub (z) =G h T (q h ) + i∈Ω h q i h ′ ∈Hi G h ′ sub (z) (E.3) where (i) comes from T t=1 S h ′ t (z) = T t=1 S h ′ t (z t ) -G h ′ sub (z) . By applying it recursively, we will get for any z ∈ Z γ , G Z T (z) = h∈H Z z σ(h) G h T (q h ), (E.4) which completes the proof. Lemma E.2 (Second part of Lemma 4.1). The regret satisfies that R Z T ≤ max z∈Z γ h∈H Z z σ(h) R h T for any γ ≥ 0. Proof. By Lemma E.1, we have R Z T = max z∈Z γ G Z T ( z) = max z∈Z γ h∈H Z z σ(h) G h T ( z h z σ(h) ) ≤ max z∈Z γ h∈H Z z σ(h) max q h ∈∆ γ |Ω h | G h T (q h ) = max z∈Z γ h∈H Z z σ(h) R h T which completes the proof. F PROOF OF THEOREM 4.3 AND THEOREM 4.6 F.1 PROOF OF LEMMA F.1 Lemma F.1. For any information set h ∈ H Z , q h ∈ ∆ γ |Ω h | and τ ≤ 1 2∥α∥∞ , Reg-CFR guarantees G h T (q h ) ≤λ h T +1 D ψ ∆ (q h , q 1,h ) + ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 -α h τ T t=2 D ψ ∆ (q h , q t,h ) + T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - λ h t-1 8 ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 . (F.1) Proof. By Lemma F.6, G h T (q h ) = T t=1 V h (z t+ 1 2 ), q t+ 1 2 ,h -q h + τ α h ψ ∆ (q t+ 1 2 ,h ) -τ α h ψ ∆ (q h ) ≤(λ h 1 -τ α h )D ψ ∆ (q h , q 1,h ) -λ h T +1 D ψ ∆ (q h , q T +1,h ) + (λ h T +1 -λ h 1 )D ψ ∆ (q h , q 1,h ) -(λ h 1 -τ α h )D ψ ∆ (q 3 2 ,h , q 1,h ) - λ h T 2 D ψ ∆ (q T +1,h , q T + 1 2 ,h ) - T t=2 λ h t-1 2 D ψ ∆ (q t,h , q t-1 2 ,h ) + (λ h t -τ α h )D ψ ∆ (q t+ 1 2 ,h , q t,h ) + T t=1 V h (z t+ 1 2 ) -V h (z t-1 2 ), q t+ 1 2 ,h -q t+1,h - λ h t 2 D ψ ∆ (q t+1,h , q t+ 1 2 ,h ) -τ α h T t=2 D ψ ∆ (q h , q t,h ). (F.2) By the strong convexity of ψ ∆ , ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 ≤2∥q t+ 1 2 ,h -q t,h ∥ 2 + 2∥q t,h -q t-1 2 ,h ∥ 2 ≤4D ψ ∆ (q t+ 1 2 ,h , q t,h ) + 4D ψ ∆ (q t,h , q t-1 2 ,h ). (F.3) Also, V h (z t+ 1 2 ) -V h (z t-1 2 ), q t+ 1 2 ,h -q t+1,h - λ h t 2 D ψ ∆ (q t+1,h , q t+ 1 2 ,h ) ≤ ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 2λ h t + λ h t 2 ∥q t+ 1 2 ,h -q t+1,h ∥ 2 - λ h t 2 D ψ ∆ (q t+1,h , q t+ 1 2 ,h ) ≤ ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 2λ h t ≤ ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t (F.4) where the second inequality is by Young's inequality.

Therefore, with τ α

h ≤ 1 2 ≤ λ h t-1 2 , G h T (q h ) = T t=1 V h (z t+ 1 2 ), q t+ 1 2 ,h -q h + τ α h ψ ∆ (q t+ 1 2 ,h ) -τ α h ψ ∆ (q h ) ≤(λ h T +1 -τ α h )D ψ ∆ (q h , q 1,h ) + T t=1 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - 1 8 T t=2 λ h t-1 ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 -τ α h T t=2 D ψ ∆ (q h , q t,h ) ≤λ h T +1 D ψ ∆ (q h , q 1,h ) + ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 + T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - λ h t-1 8 ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 -τ α h T t=2 D ψ ∆ (q h , q t,h ), (F.5) which completes the proof. For simplicity, we use constant M h as the maximum value of D ψ ∆ (q h , q 1,h ) in information set h. D ψ ∆ (q h , q 1,h ) is upper-bounded since q 1,h is initialized as uniform distribution in ∆ |Ω h | . F.2 PROOF OF THEOREM 4.3 By Lemma E.1, we have 0 ≤ G Z T (z γ, * τ ) = h∈H z * τ,σ(h) G h T ( z γ, * τ,h z γ, * τ,σ(h) ) where the first inequality is by definition of z γ, * τ . Now by Lemma F.1 taking q h = q γ, * τ,h = z γ, * τ,h z γ, * τ,σ(h) , 0 ≤ h∈H Z z γ, * τ,σ(h) λ h T +1 M h + ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 + T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - λ h t-1 8 ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 -τ α h T t=2 D ψ ∆ (q γ, * τ,h , q t,h ) where constant M h is the maximum value of D ψ ∆ (q h , q 1,h ) in information set h. D ψ ∆ (q h , q 1,h ) is upper-bounded since q 1,h is initialized as uniform distribution in ∆ |Ω h | . By rearranging the terms, we have τ T t=2 D ψ Z (z γ, * τ , z t ) (i) = τ T t=2 h∈H Z α h z γ, * τ,σ(h) D ψ ∆ (q γ, * τ,h , q t,h ) ≤ C γ (F.6) where (i) is by the expanded form of the (dilated) Bregman divergence D ψ Z (see Lemma F.8 for a detailed proof) and the constant C γ is defined by C γ := h∈H Z z γ, * τ,σ(h) λ h T +1 M h + ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 + T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - λ h t-1 8 ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 . (F.7) Non-perturbed EFG best-iterate convergence. To bound the quantity λ h T +1 M h + T t=2 ∥V h (z t+ 1 2 )-V h (z t-1 2 )∥ 2 λ h t in C γ (other parts of C γ have been already bounded by constant), we introduce the following Lemma, whose proof is postponed to F.5. Lemma F.2. Consider update-rule in Eq (4.2). For any h ∈ H Z , by taking κ = T 1 2 , Reg-CFR satisfies that λ h T +1 M h + T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t ≤ O(T 1 4 ) (F.8) where constant M h is the maximum value of D ψ ∆ (q h , q 1,h ) in information set h. 4 ). (F.9) By Lemma F.2, we know that C γ ≤ O(T 1 4 ), which is τ T t=2 D ψ Z (z * τ , z t ) ≤ O(T Therefore, there exists t ′ ∈ {2, 3, ..., T }, D ψ Z (z * τ , z t ′ ) ≤ 1 τ O(T -3 4 ). (F.10) So, z t ′ converges to z * τ with convergence rate O(T -3 4 ). Perturbed EFG asymptotic last-iterate convergence. From the form of constant C γ Eq (F.7) and λ h t-1 ≥ κ ≥ 1, we have C γ ≤ h∈H Z z γ, * τ,σ(h) λ h T +1 M h + ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 + T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - 1 8 ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 (F.11) where constant M h is the maximum value of D ψ ∆ (q h , q 1,h ) in information set h. We will prove that C γ ≤ O(1) when γ > 0. By the Lipschitz property of V h (z) (see Lemma F.10 for a full proof), we have ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 ≤(L 2 h∈H Z ∥q t+ 1 2 ,h -q t-1 2 ,h ∥) 2 ≤P L 2 2 h∈H Z ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 ≤P L 2 2 γ P h∈H Z z γ, * τ,σ(h) ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 (F.12) where the last inequality is because z γ, * τ,i z γ, * τ,σ(h(i)) ≥ γ for any i by definition of γ-perturbed EFG so that z γ, * τ,i ≥ γ P . Since z γ, * τ,σ(h) ≤ 1, h∈H Z z γ, * τ,σ(h) ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 ≥ γ P P L 2 2 z γ, * τ,σ(h) ∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 . (F.13) for any h ∈ H Z . Plugging inequality (F.13) to equation (F.11), we have C γ ≤ h∈H Z z γ, * τ,σ(h) ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 + h∈H Z z γ, * τ,σ(h) λ h T +1 M h - γ P 16P 2 L 2 2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 + h∈H Z z γ, * τ,σ(h) T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - γ P 16P 2 L 2 2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 . (F.14) As a result, it remains to bound the following two quantities in Eq (F.15) and Eq (F.16) separately by some constant: λ h T +1 M h -ι T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 . (F.15) T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 λ h t -ι∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 , (F.16) where we use ι := γ P 16P 2 L 2 2 for convenience. For Eq (F.15), since λ h T +1 = κ + T t=1 δ h t where δ h t = ∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 , we can get M h κ + T t=1 δ h t -ι T t=2 δ h t ≤ M h κ + δ h 1 + M h T t=2 δ h t -ι T t=2 δ h t = f h ( T t=2 δ h t ) (F.17) where the second inequality comes from √ a + b ≤ √ a + √ b and f is a quadratic function with negative coefficient on the quadratic term. Therefore, it is upper-bounded by a constant. As for Eq (F.16), we discuss the two possible cases separately.

When lim

t→∞ λ h t < +∞, then ∞ t=1 ∥V h (z t+ 1 2 ) -V h (z t- 2 ) ∥ 2 < +∞ from definition of λ h t so that Eq (F.16) is bounded by a constant. When lim t→∞ λ h t = +∞, then we must have t ′ = min t {t : 1/λ h t ≤ ι}. Therefore, Eq (F.16) is bounded by t ′ t=1 ∥V h (z t+ 1 2 ) -V h (z t-1 2 ) ∥ 2 < +∞. Therefore, T t=2 D ψ Z (z γ, * τ , z t ) ≤ O(1) τ (F.18) so that z t converges asymptotically to z γ, * τ . Proof of Corollary 4.4. By Lemma D.1, we know that when τ = ϵ 4C B , we will get max z∈Z F (z t ) ⊤ (z t -z) ≤ O(ϵ) + 2P D ψ Z (z * τ , z t ). (F.19) Using Theorem 4.3, the proof is done. F.3 PROOF OF THEOREM 4.6 We first state a stronger version of the folklore theorem here (Theorem 3 ; Farina et al., 2019b) , to provide gurantees for average iterate below. Lemma F.3. For a EFG where l X t (x t ) = x ⊤ Ay t + τ ψ Z (x), l Y t (y) = -x ⊤ t Ay + τ ψ Z (y), the saddle point residual max z∈Z F (z) ⊤ (z -z) + τ ψ(z) -τ ψ( z) of the average strategy ( 1 T T t=1 x t , 1 T T t=1 y t ) is bounded by R X +R Y T . Non-perturbed EFG average-iterate convergence. From Lemma E.2 and Lemma F.1, by taking z = argmax z∈Z h∈H Z z σ(h) R h T , we have R Z T ≤ h∈H Z z σ(h) R h T ≤ h∈H z σ(h) λ h T +1 M h + ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 + T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - λ h t-1 8 ∥q h t+ 1 2 -q t-1 2 ,h ∥ 2 ≤ O(T 1/4 ) (F.20) where the last inequality is by Lemma F.2 and constant M h is the maximum value of D ψ ∆ (q h , q 1,h ) in information set h. Therefore, by Lemma F.3, the average iterate enjoys O(T -3 4 ) convergence rate in terms of duality gap.

Perturbed EFG average-iterate convergence. By taking q

h = z h z σ(h) where z = argmax z∈Z γ h∈H Z z σ(h) R h T , from Lemma F.1, we have R Z T ≤ h∈H Z z σ(h) R h T ≤ h∈H Z z σ(h) λ h T +1 M h + ∥V h (z 3 2 ) -V h (z 1 2 )∥ 2 + h∈H Z z σ(h) T t=2 ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 λ h t - 1 8 ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ 2 (F.21) where constant M h is the maximum value of D ψ ∆ (q h , q 1,h ) in information set h. Follow the same analysis in F.2, we will get R Z T ≤ O(1) which means the duality gap converges with convergence rate O( 1 T ) by Lemma F.3.

F.4 APPROXIMATE EXTENSIVE-FORM PERFECT EQUILIBRIA

To illustrate why the NE of Z γ for some fixed γ > 0 is a good approximation to the extensive-form perfect equilibria, we propose the following lemma. Lemma F.4. The approximate NE z γ of a γ-perturbed EFG is an approximation of the EFPE in terms of duality gap. That is, max z∈Z 0 F (z γ ) ⊤ (z γ -z) ≤ max z∈Z γ F (z γ ) ⊤ (z γ -z) + γP 2 (F.22) where Z 0 is an infinitely small perturbed treeplex whose NE is exactly EFPE. Proof. For any z ∈ Z 0 , we can define z ′ ∈ Z γ as z ′ i z ′ σ(h(i)) = (1 -γ|Ω h(i) |) z i z σ(h(i)) + γ. (F.23) Then, we will use induction to prove that ∥z -z ′ ∥ ∞ ≤ γP . Note that we will use anc(i) := {i, σ(h(i)), σ(h(σ(h(i)))), ..., i ′ } where σ(h(i ′ )) = 0 to denote the set of ancestors of index i in the treeplex. Firstly, for index i which satisfies that σ(h(i)) = 0, we have | j∈anc(i) (1 -γ|Ω h(j) |)q j + γ - j∈anc(i) q j | = | -γ|Ω h(i) |q i + γ| ≤ γ|Ω h(i) |. (F.24) Then, assume that we already prove that | j∈anc(σ(h(i))) (1 -γ|Ω h(j) |)q j + γ - j∈anc(σ(h(i))) q j | ≤ γC σ(h(i)) for an index i where C σ(h(i)) = j∈anc(σ(h(i))) |Ω h(j) |, then j∈anc(i) (1 -γ|Ω h(j) |)q j + γ - j∈anc(i) q j ≥ (1 -γ|Ω h(i) |)q i + γ j∈anc(σ(h(i))) q j -γC σ(h(i)) - j∈anc(i) q j = -γ|Ω h(i) | j∈anc(i) q j + γ j∈anc(i) q j -γC σ(h(i)) (1 -γ|Ω h(i) |)q i + γ ≥ -γ(|Ω h(i) | + C σ(h(i)) ) and similarly, we have the upperbound γ(1 + C σ(h(i)) ). Therefore, we have ∥z -z ′ ∥ ∞ ≤ γP . Therefore, for y = argmax y∈Y 0 x γ⊤ A y where z γ = (x γ , y γ ) is an approximate NE in a γperturbed EFG, we have x γ⊤ Ay =x γ⊤ A y ′ + (y -y ′ ) =x γ⊤ Ay ′ + x γ⊤ A(y -y ′ ) ≤ max y∈Y γ x γ⊤ A y + ∥A ⊤ x γ ∥ 1 • ∥y -y ′ ∥ ∞ (F.25) which implies that max z∈Z 0 F (z γ ) ⊤ (z γ -z) ≤ max z∈Z γ F (z γ ) ⊤ (z γ -z) + ∥A ⊤ x γ ∥ 1 • ∥y -y ′ ∥ ∞ + ∥Ay γ ∥ 1 • ∥x -x ′ ∥ ∞ ≤ max z∈Z γ F (z γ ) ⊤ (z γ -z) + γP 2 where the last inequality comes from ∥F (z)∥ ∞ ≤ 1 for any z ∈ Z. We first prove some standard results in DS-OptMD (Hsieh et al., 2021) when adding regularization. Lemma F.5. For any convex set C and u 0 , u ∈ C, consider the update rule u 1 = argmin u1∈C {⟨ u 1 , g + τ ∇ψ C (u)⟩ + λ 1 D ψ C ( u 1 , u) + (λ 2 -λ 1 )D ψ C ( u 1 , u 0 )} where ψ C is a strongly convex function in C. Then for any u 2 ∈ C, τ ψ C (u 1 ) -τ ψ C (u 2 ) + ⟨g, u 1 -u 2 ⟩ ≤λ 1 ((1 - τ λ 1 )D ψ C (u 2 , u) -D ψ C (u 2 , u 1 ) -(1 - τ λ 1 )D ψ C (u 1 , u)) + (λ 2 -λ 1 )(D ψ C (u 2 , u 0 ) -D ψ C (u 2 , u 1 ) -D ψ C (u 1 , u 0 )). (F.26) Proof. Since u 1 = argmin u1∈C g -λ 1 (1 - τ λ 1 )∇ψ C (u) -(λ 2 -λ 1 )∇ψ C (u 0 ), u 1 + λ 2 ψ C ( u 1 ) , (F.27) by first-order optimality condition, g + λ 2 ∇ψ C (u 1 ) -λ 1 (1 - τ λ 1 )∇ψ C (u) -(λ 2 -λ 1 )∇ψ C (u 0 ) ⊤ (u 2 -u 1 ) ≥ 0. (F.28) Notice that λ 1 ((1 - τ λ 1 )D ψ C (u 2 , u) -D ψ C (u 2 , u 1 ) -(1 - τ λ 1 )D ψ C (u 1 , u)) =λ 1 ∇ψ C (u 1 ) -(1 - τ λ t )∇ψ C (u), u 2 -u 1 -τ ψ C (u 2 ) + τ ψ C (u 1 ), (F.29) and (λ 2 -λ 1 )(D ψ C (u 2 , u 0 ) -D ψ C (u 2 , u 1 ) -D ψ C (u 1 , u 0 )) =(λ 2 -λ 1 ) ∇ψ C (u 1 ) -∇ψ C (u 0 ), u 2 -u 1 . (F.30) Sum them up, λ 1 ((1 - τ λ 1 )D ψ C (u 2 , u) -D ψ C (u 2 , u 1 ) -(1 - τ λ 1 )D ψ C (u 1 , u)) + (λ 2 -λ 1 )(D ψ C (u 2 , u 0 ) -D ψ C (u 2 , u 1 ) -D ψ C (u 1 , u 0 )) = λ 2 ∇ψ C (u 1 ) -λ 1 (1 - τ λ t )∇ψ C (u) -(λ 2 -λ 1 )∇ψ C (u 0 ), u 2 -u 1 -τ ψ C (u 2 ) + τ ψ C (u 1 ) ≥ ⟨g, u 1 -u 2 ⟩ -τ ψ C (u 2 ) + τ ψ C (u 1 ) (F.31) where the last equation comes from Eq (F.28). Lemma F.6. Consider the update rule Eq (4.2). For any information set h ∈ H Z , q h ∈ ∆ γ |Ω h | and t = 1, 2, ..., T , we have τ α h ψ ∆ (q t+ 1 2 ,h ) -τ α h ψ ∆ (q h ) + V h (z t+ 1 2 ), q t+ 1 2 ,h -q h ≤(λ h t -τ α h )D ψ ∆ (q h , q t,h ) -λ h t+1 D ψ ∆ (q h , q t+1,h ) + (λ h t+1 -λ h t )D ψ ∆ (q h , q 1,h ) + V h (z t+ 1 2 ) -V h (z t-1 2 ), q t+ 1 2 ,h -q t+1,h -λ h t D ψ ∆ (q t+1,h , q t+ 1 2 ,h ) -(λ h t -τ α h )D ψ ∆ (q t+ 1 2 ,h , q t,h ). (F.32) Since ψ ∆ 1-strong convex with respect to 2-norm, we have ψ ∆ (q t-1 2 ,h ) -ψ ∆ (q t,h ) ≥ ∇ψ ∆ (q t,h ), q t-1 2 ,h -q t,h + 1 2 ∥q t-1 2 ,h -q t,h ∥ 2 ψ ∆ (q t,h ) -ψ ∆ (q t-1 2 ,h ) ≥ ∇ψ ∆ (q t-1 2 ,h ), q t,h -q t-1 2 ,h + 1 2 ∥q t-1 2 ,h -q t,h ∥ 2 . (F.38) Add them up then we will get, ∇ψ ∆ (q t-1 2 ,h ) -∇ψ ∆ (q t,h ), q t-1 2 ,h -q t,h ≥ ∥q t-1 2 ,h -q t,h ∥ 2 . (F.39) Therefore, λ h t-1 ∥q t-1 2 ,h -q t,h ∥ 2 + (λ h t -λ h t-1 ) ∇ψ ∆ (q 1,h ) -∇ψ ∆ (q t,h ), q t-1 2 ,h -q t,h ≤ λ h t-1 ∇ψ ∆ (q t-1 2 ,h ) -λ h t ∇ψ ∆ (q t,h ) + (λ h t -λ h t-1 )∇ψ ∆ (q 1,h ), q t-1 2 ,h -q t,h ≤ V h (z t-1 2 ) -V h (z t-3 2 ), q t-1 2 ,h -q t,h ≤∥V h (z t-1 2 ) -V h (z t-3 2 )∥ • ∥q t-1 2 ,h -q t,h ∥. (F.40) And by definition, λ h t-1 ≤ λ h t = (λ h t-1 ) 2 + ∥V h (z t-1 2 ) -V h (z t-3 2 )∥ 2 ≤ λ h t-1 + ∥V h (z t-1 2 ) -V h (z t-3 2 )∥, (F.41) so that λ h t-1 ∥q t-1 2 ,h -q t,h ∥ 2 ≤ (∥∇ψ ∆ (q 1,h )-∇ψ ∆ (q t,h )∥+1)•∥V h (z t-1 2 )-V h (z t-3 2 )∥•∥q t-1 2 ,h -q t,h ∥ (F.42) which implies that ∥q t-1 2 ,h -q t,h ∥ ≤ O(1) λ h t-1 , (F.43) since ∇ψ ∆ is bounded by constant when ψ ∆ is Euclidean norm. And ∥V h (z t-1 2 ) -V h (z t-3 2 )∥ is also bounded by constant since both the regularizer and ∥F (z)∥ ∞ are bounded. At the same time, directly from update rule Eq (4.2), V h (z t-1 2 ) + τ ∇ψ ∆ (q t,h ), q t+ 1 2 ,h +λ h t D ψ ∆ (q t+ 1 2 ,h , q t,h ) ≤ V h (z t-1 2 ) + τ ∇ψ ∆ (q t,h ), q t,h (F.44) which implies that λ h t 2 ∥q t+ 1 2 ,h -q t,h ∥ 2 ≤ λ h t D ψ ∆ (q t+ 1 2 ,h , q t,h ) ≤ V h (z t-1 2 ) + τ ∇ψ ∆ (q t,h ), q t,h -q t+ 1 2 ,h ≤∥V h (z t-1 2 ) + τ ∇ψ ∆ (q t,h )∥ • ∥q t,h -q t+ 1 2 ,h ∥ ≤O(1)∥q t,h -q t+ 1 2 ,h ∥. (F.45) Hence, we have ∥q t+ 1 2 ,h -q t,h ∥ ≤ O(1) λ h t . Proof of Lemma F.2. By Lemma F.10, ∥V h (z t+ 1 2 ) -V h (z t-1 2 )∥ 2 ≤ (L 2 h∈H Z ∥q t+ 1 2 ,h -q t-1 2 ,h ∥) 2 ≤ P 2 L 2 2 C 2 1 (λ h t-1 ) 2 where the last inequality is by and Lemma F.7 and ∥q t+ 1 2 ,h -q t-1 2 ,h ∥ ≤ ∥q t+ 1 2 ,h -q t,h ∥ + ∥q t,h -q t-1 2 ,h ∥ ≤ C 1 λ h t-1 . Lipschitz continuity of V h (z) Here we will show that V h (z) is Lipschitz continuous with respect to q. We first show that z is in a Lipschitz continuous manner with respect to q. Lemma F.9. In γ-perturbed treeplex Z γ with γ ≥ 0 , for any z, z ′ ∈ Z γ , we have ∥z -z ′ ∥ ≤ L 1 h∈H Z ∥q h -q ′ h ∥ (F.51) for some game-dependent constant L 1 . Proof. We first consider the base case when Z γ is a γ-perturbed simplex, where Eq (F.51) is satisfied with L 1 = 1 since q = z. We consider the two basic operator, Cartesian product and branching, in the definition of treeplex (see Definition 2.1 for details). We want to prove that both of them keep smoothness, that is, Eq (F.51) remains satisfied after applying the operation to multiple treeplexes where Eq (F.51) is satisfied. Firstly, for Cartesian product , if Eq (F.51) is satisfied for Z γ 1 , Z γ 2 , ..., Z γ m , then for any z = (z 1 , z 2 , ..., z m ), z ′ = (z ′ 1 , z ′ 2 , ..., z ′ m ) ∈ Z γ = Z γ 1 × Z γ 2 × ... × Z γ m , we have ∥z -z ′ ∥ ≤ m i=1 ∥z i -z ′ i ∥ ≤ L 1 m i=1 h∈H Z i ∥q h -q ′ h ∥ = L 1 h∈H Z ∥q h -q ′ h ∥ . (F.52) Notice that we abuse the notation H Zi and H Z here to denote the set of all information sets in Z γ i and Z γ . To be convenient, let define the branching of m γ-perturbed treeplexes Z γ 1 , Z γ 2 , ..., Z γ m and a γperturbed simplex ∆ γ m be Z γ = {(p, p 1 z 1 , p 2 z 2 , ..., p m z m ) : p ∈ ∆ γ m , z i ∈ Z γ i }. It's easy to see that it is equivalent to using the original branching operator for m times in a bottom-up manner. And we have

Suppose for any z

i , z ′ i ∈ Z γ i , ∥z i -z ′ i ∥ ≤ L i h∈H Z i ∥q i,h -q ′ i, ∥p i z i -p ′ i z ′ i ∥ =∥(p i z i -p i z ′ i ) + (p i z ′ i -p ′ i z ′ i )∥ ≤∥p i (z i -z ′ i )∥ + ∥(p i -p ′ i )z ′ i ∥ =p i ∥z i -z ′ i ∥ + |p i -p ′ i | • ∥z ′ i ∥ ≤∥z i -z ′ i ∥ + |Z γ i | • ∥p -p ′ ∥ (F.54) where the fourth inequality is by Z γ i ⊂ R |Z γ i | . Therefore, ∥(p 1 z 1 + p 2 z 2 + ... + p m z m ) -(p ′ 1 z ′ 1 + p ′ 2 z ′ 2 + ... + p ′ m z ′ m )∥ ≤ m i=1 ∥z i -z ′ i ∥ + ( m i=1 |Z γ i | + 1)∥p -p ′ ∥ ≤ m i=1 L i h∈H Z i ∥q i,h -q ′ i,h ∥ + (P + 1)∥p -p ′ ∥ ≤ max{L 1 , L 2 , ..., L m , P + 1} h∈H Z ∥q h -q ′ h ∥ (F.55) where the third line is by the inductive assumption and the fourth line is by definition of H Z and q. Therefore, recursively applying Eq (F.52) and Eq (F.55), we will have for any z, z ′ ∈ Z γ ∥z -z ′ ∥ ≤ L 1 h∈H Z ∥q h -q ′ h ∥ (F.56) where we take L 1 = P + 1. Finally, since both operator keeps the smoothness, by induction, we know that for any treeplex Eq (F.51) is satisfied. Now we can prove that V h (z) is Lipschitz continuous with respect to q. Lemma F.10. When ψ ∆ is L p -Lipschitz continuous, that is, |ψ ∆ (x) -ψ ∆ (x ′ )| ≤ L p ∥x -x ′ ∥ for any x, x ′ ∈ ∆ γ , and |ψ ∆ (x)| is upper-bounded by a constant C ∆ B for any x ∈ ∆ γ , then for any z, z ′ ∈ Z γ and h ∈ H, we have ∥V h (z) -V h (z ′ )∥ ≤ L 2 h∈H Z ∥q h -q ′ h ∥ (F.57) where L 2 is a game-dependent constant. Proof. Here we consider h ∈ H X , and h ∈ H Y can be addressed similarly. ∥V h (z) -V h (z ′ )∥ ≤ i∈Ω h |(A(y -y ′ )) i | + h ′ ∈Hi |W h ′ (z) -W h ′ (z ′ )| ≤ i∈Ω h ∥y -y ′ ∥ 1 + h ′ ∈Hi |W h ′ (z) -W h ′ (z ′ )| ≤ i∈Ω h P ∥y -y ′ ∥ + h ′ ∈Hi |W h ′ (z) -W h ′ (z ′ )| (F.58) where the second inequality is because each entry of A is in [-1, 1] and the last inequality is by ∥y -y ′ ∥ 1 ≤ √ P ∥y -y ′ ∥. |W h (z) -W h (z ′ )| = ( q h , V h (z) + τ α h ψ ∆ (q h )) -( q ′ h , V h (z ′ ) + τ α h ψ ∆ (q ′ h )) ≤ q h , V h (z) -q ′ h , V h (z ′ ) + τ α h ψ ∆ (q h ) -ψ ∆ (q ′ h ) ≤| q h , V h (z) -V h (z ′ ) | + | q h -q ′ h , V h (z ′ ) | + τ ∥α∥ ∞ L p ∥q h -q ′ h ∥ (i) ≤∥q h ∥ 1 • ∥V h (z) -V h (z ′ )∥ ∞ + ∥q h -q ′ h ∥ • ∥V h (z ′ )∥ + τ ∥α∥ ∞ L p ∥q h -q ′ h ∥ (ii) ≤ i∈Ω h |(A(y -y ′ )) i | + i∈Ω h h ′ ∈Hi |W h ′ (z) -W h ′ (z ′ )| + P (1 + τ ∥α∥ ∞ C ∆ B )∥q h -q ′ h ∥ + τ ∥α∥ ∞ L p ∥q h -q ′ h ∥. (F.59) Here (i) is by Hölder's inequality. (ii) is by ∥q h ∥ 1 = 1 and ∥V h (z)∥ ≤ ∥Ay∥ 1 + P τ ∥α∥ ∞ C ∆ B ≤ P (1 + τ ∥α∥ ∞ C ∆ B ) . By recursively applying this inequality, we have |W h (z) -W h (z ′ )| ≤∥A(y -y ′ )∥ 1 + P (1 + τ ∥α∥ ∞ C ∆ B ) h∈H Z ∥q h -q ′ h ∥ + τ ∥α∥ ∞ L p h∈H Z ∥q h -q ′ h ∥ (F.60) Notice that ∥A(y -y ′ )∥ 1 ≤ P ∥y -y ′ ∥ 1 ≤ P 2 ∥y - y ′ ∥ ≤ L 1 P 2 h∈H Z ∥q h -q ′ h ∥ (F.61) where the first inequality is because A ∈ [-1, 1] M ×N and the third inequality is by Lemma F.9. Therefore, |W h (z) -W h (z ′ )| ≤P 2 L 1 h∈H Z ∥q h -q ′ h ∥ + P (1 + τ ∥α∥ ∞ C ∆ B ) h∈H Z ∥q h -q ′ h ∥ + τ ∥α∥ ∞ L p h∈H Z ∥q h -q ′ h ∥ =L 3 h∈H Z ∥q h -q ′ h ∥ (F.62) where L 3 = P 2 L 1 + P (1 + τ ∥α∥ ∞ C ∆ B ) + τ ∥α∥ ∞ L p . And back to Eq (F.58), we have ∥V h (z) -V h (z ′ )∥ ≤C Ω • P ∥z -z ′ ∥ + P • L 3 h∈H Z ∥q h -q ′ h ∥ ≤P (L 1 C Ω + L 3 ) h∈H Z ∥q h -q ′ h ∥ =L 2 h∈H Z ∥q h -q ′ h ∥, (F.63) where the second inequality comes from Lemma F.9.

G CONCLUSIONS AND FUTURE WORK

In this paper, we investigate the regularization technique, a widely used one in reinforcement learning and optimization, in solving EFGs. Firstly, we prove that Reg-DOMD can achieve the first result of last-iterate convergence rate to the NE without the unique NE assumption, for dilated OMD-type algorithms with constant stepsizes, in terms of both duality gap and the distance to the set of NE. We further prove that by solving the regularized problem, CFR with Reg-DS-OptMD as regret minimizer, which we called Reg-CFR, can achieve best-iterate convergence result in finding NEs and asymptotic last-iterate convergence in finding approximate extensive-form perfect equilibria. These results constitute the first last-iterate convergence results for CFR-type algorithms. Furthermore, we have shown empirically that for CFR and CFR+, solving the regularized problem can achieve better last-iterate performance, further demonstrating the power of regularization in solving EFGs. We leave it for future work to study its explicit convergence rate.



A recent result(Anagnostides et al., 2022, Theorem 3.4) also gave a best-iterate convergence result with rate, but only asymptotic convergence result for the last iterate. In fact, we found that just taking the last iterate is good enough empirically. This part can be referred to Figure and Figure 5. In fact, here we only require that the regularization is entropy to make Lemma D.7 hold.



Figure3: The last-iterate convergence result in Kuhn Poker (left) and Leduc Poker (right). CFR(Zinkevich et al., 2007), CFR+(Tammelin et al., 2015)  are tested as baselines. We can see that the last-iterate performance of Reg-DOMWU and Reg-DOGDA is much better than their versions when τ = 0.

Figure 5: The last-iterate convergence results of CFR and CFR+, in Kuhn Poker (left) and Leduc Poker (right). We can see that with regularization, the last iterate produced by CFR and CFR+ significantly outperforms the original version without regularization.

Figure 7: The duality gap of average iterate in Kuhn Poker (left) and Leduc Poker (right).

the upper-bound of the regularizer ψ Z . It would be P ∥α∥ ∞ log C Ω for entropy regularizer and P ∥α∥∞ CΩ for Euclidean regularizer, where C Ω = max h∈H Z |Ω h |.

X denote all pure strategies of x. If P S(y * ) = P S X , then V * ( Y * (y)) = X . Eq (D.23) implies that min x∈X x ⊤ Ay = ρ so that y ∈ Y * . But this contradicts with the definition that y ̸ ∈ Y * . If P S(y * ) ̸ = P S X , we define ξ(y * ) = min x∈P S X \P S(y * ) {x ⊤ Ay * -ρ}. (D.24)

PROPERTIES OF Reg-DS-OptMD (4.2)

h ∥. For z = (p, p 1 z 1 , p 2 z 2 + ..., p m z m ), z ′ = (p ′ , p ′ 1 z ′ 1 , p ′ 2 z ′ 2 , ..., p ′ m z ′ m ) ∈ Z γ , ∥z -z ′ ∥ ≤ m i=1 ∥p i z i -p ′ i z ′ i ∥ + ∥p -p ′ ∥ (F.53)

The full game tree of Kuhn Poker. The yellow nodes belong to the player who moves first and the purple nodes belong to the other player. The blue node is the chance node which dealt the private cards for each player. The first line is the private card for the player moving first and the second line is for the other player. The game tree under different private card composition are the same so we only plot the first-move player get Jack and the second-move player get Queen.

acknowledgement

ACKNOWLEDGEMENT T.Y. was supported by NSF CCF-2112665 (TILOS AI Research Institute). A.O and K.Z. were supported by MIT-DSTA grant 031017-00016. K.Z. also acknowledges support from Simons-Berkeley Research Fellowship. The authors also thank Suvrit Sra for the valuable discussions.

annex

Proof. Plug u 1 = q t+ 1 2 ,h , u 2 = q t+1,h , g = V h (z t-1 2 ), ψ C = α h ψ ∆ into Lemma C.4, τ α h ψ ∆ (q t+ 1 2 ,h ) -τ α h ψ ∆ (q t+1,h ) + V h (z t-1 2 ), q t+ 1 2 ,h -q t+1,h ≤λ h t (1 -τ α h λ h t )D ψ ∆ (q t+1,h , q t,h ) -D ψ ∆ (q t+1,h , q t+ 1 2 ,h ) -(1 -τ α h λ h t )D ψ ∆ (q t+ 1 2 ,h , q t,h ) . (F.33)(F.34)By summing Eq (F.33) and Eq (F.34) up, then addingBy the two lemmas above, we can prove that the update of Reg-DS-OptMD (4.2) is stable. Lemma F.7 (Stability of Reg-DS-OptMD). For any t = 1, 2, ..., when ψ ∆ is Euclidean norm, Reg-CFR satisfies thatfor some constant C 1 .Proof. Consider the update rule Eq (4.2), by first-order optimality, for any h ∈ H Z , we haveAdd them up,Then, by letting κ = T 1 2 , we havewhich completes the proof.

F.6 AUXILIARY LEMMAS FOR Reg-CFR

In this section, we prove some auxiliary lemmas for Reg-CFR. We begin with the expanding form of the Bregman divergence generated by the dilated Euclidean norm. Lemma F.8. When ψ ∆ (q) = 1 2 i q 2 i , we haveProof. Firstly, we can write ψ Z in the formwhere q i = zi z σ(h(i)) . And by the definition of Bregman divergence, we havez 1,σ(h(i)) (q 2 1,i -2q 1,i q 2,i ) + z 1,σ(h(i))z 1,σ(h(i)) (q 1,i -q 2,i ) 2 .(F.50)Similarly, we will get i∈Ω h z 2,i (q 2,i -2q 2,i + k∈Ω h(i) q 2 2,k ) = 0. Therefore,

