SOLVING CONSTRAINED VARIATIONAL INEQUALITIES VIA A FIRST-ORDER INTERIOR POINT-BASED METHOD

Abstract

We develop an interior-point approach to solve constrained variational inequality (cVI) problems. Inspired by the efficacy of the alternating direction method of multipliers (ADMM) method in the single-objective context, we generalize ADMM to derive a first-order method for cVIs, that we refer to as ADMM-based interiorpoint method for constrained VIs (ACVI). We provide convergence guarantees for ACVI in two general classes of problems: (i) when the operator is ξ-monotone, and (ii) when it is monotone, some constraints are active and the game is not purely rotational. When the operator is, in addition, L-Lipschitz for the latter case, we match known lower bounds on rates for the gap function of O(1/ √ K) and O(1/K) for the last and average iterate, respectively. To the best of our knowledge, this is the first presentation of a first-order interior-point method for the general cVI problem that has a global convergence guarantee. Moreover, unlike previous work in this setting, ACVI provides a means to solve cVIs when the constraints are nontrivial. Empirical analyses demonstrate clear advantages of ACVI over common first-order methods. In particular, (i) cyclical behavior is notably reduced as our methods approach the solution from the analytic center, and (ii) unlike projection-based methods that zigzag when near a constraint, ACVI efficiently handles the constraints.

1. INTRODUCTION

We are interested in the constrained variational inequality problem (Stampacchia, 1964) : find x ⋆ ∈ X s.t. ⟨x -x ⋆ , F (x ⋆ )⟩ ≥ 0, ∀x ∈ X , ( ) where X is a subset of the Euclidean n-dimensional space R n , and where F : X → R n is a continuous map. Finding (an element of) the solution set S ⋆ X ,F of cVI is a key problem in multiple fields such as economics and game theory. More pertinent to machine learning, CVIs generalize standard single-objective optimization, complementarity problems (Cottle & Dantzig, 1968) , zerosum games (von Neumann & Morgenstern, 1947; Rockafellar, 1970) and multi-player games. For example, solving cVI is the optimization problem underlying reinforcement learning (e.g., Omidshafiei et al., 2017) -and generative adversarial networks (Goodfellow et al., 2014) . Moreover, even when training one set of parameters with one loss f , that is F (x) ≡ ∇ x f (x), a natural way to improve the model's robustness in some regard is to introduce an adversary to perturb the objective or the input, or to consider the worst sample distribution of the empirical objective. As has been noted in many problem domains, including robust classification (Mazuelas et al., 2020) , adversarial training (Szegedy et al., 2014) , causal inference (Christiansen et al., 2020) , and robust objectives (e.g., Rothenhäusler et al., 2018) , this leads to a min-max structure, which is an instance of the cVI problem. To see this, consider two sets of parameters (agents), x 1 ∈ X 1 and x 2 ∈ X 2 , that share a loss/utility function, f : X 1 × X 2 → R, which the first agent aims to minimize The constraints are depicted with dashed lines and the iterates with circles. ACVI gets close to the Nash Equilibrium (⋆) in a single step, whereas EG zigzags when hitting a constraint. The remaining commonly used methods-GDA, OGDA, and LA-GDA-perform similarly to EG, see App. E. and the second agent aims to maximize. Then the problem is to find a saddle point of f , i.e., a point (x ⋆ 1 , x ⋆ 2 ) such that f (x ⋆ 1 , x 2 ) ≤ f (x ⋆ 1 , x ⋆ 2 ) ≤ f (x 1 , x ⋆ 2 ) . This corresponds to a cVI with F (x) ≡ [∇ x1 f (x 1 , x 2 ) -∇ x2 f (x 1 , x 2 )] ⊺ . Solving cVIs is significantly more challenging than single-objective optimization problems, due to the fact that F is a general vector field, leading to "rotational" trajectories in parameter space (App. A). In response, the development of efficient algorithms with provable convergence has recently been the focus of interest in machine learning and optimization, particularly in the unconstrained setting, where X ≡ R n (e.g., Tseng, 1995; Daskalakis et al., 2018; Mokhtari et al., 2019; 2020; Golowich et al., 2020b; Azizian et al., 2020; Chavdarova et al., 2021a; Gorbunov et al., 2022; Bot et al., 2022) . In many applications, however, we have constraints on (part of) the decision variable x, that is, X is often a strict subset of R n . As an example, let us revisit the aforementioned distributionally robust prediction problem: consider a linear setting (cf. Eq. 1 in Rothenhäusler et al., 2018) and class of parametrized distributions △ ≡ {w ∈ R d |w ≥ 0, e ⊺ w = 1}, where e ∈ R d is a vector of all ones. Thus, the robust problem is: min x∈R n max w∈R d w ⊺ (y -Dx), subject to w ≥ 0, e ⊺ w = 1 , where D ∈ R d×n contains d samples of an n-dimensional covariate vector, and y ∈ R d is the vector of target variables (the constraint w ≤ 1 is implied). This illustrates that given a standard minimization problem, its robustification immediately leads to an instance of the cVI problem; see further examples in § 5. Additional example applications include (i) machine learning applications in business, finance, and economics where often the sum of the decision variables-representing, for example, resources-cannot exceed a specific value, (ii) contract theory (e.g. §2.3.2 in (Bates et al., 2022) where one player is the parameters of a probability distribution as above), and (iii) solving optimal control problems numerically, among others. Significantly fewer works address the convergence of first-order optimization methods in the constrained setting; see § 2 for an overview. Recently, Cai et al. (2022) established a convergence rate for the projected extragradient method (Korpelevich, 1976) , when F is monotone and Lipschitz (see § 3 for definitions). However, (i) the proof that the authors presented is computer-assisted, which makes it hard to interpret and of limited usefulness for inspiring novel (e.g., accelerated) methods, and (ii) the considered setting assumes the projection is fast to compute and thus ignores the projection in the rate. The latter assumption only holds in rare cases when the constraints are relatively simple so that operations such as clipping suffice. However, when the inequality and/or equality constraints are of a general form, each EG update requires two projections (see App. A.4). Each projection requires solving a new/separate constrained optimization problem, which if given general constraints implies the need for a second-order method as explained next. Interior point (IP) methods are the de facto family of iterative algorithms for constrained optimization. These methods enjoy well-established guarantees and theoretical understanding in the context of single-objective optimization [see, e.g., Boyd & Vandenberghe (2004 ), Ch.11, Megiddo (1989) , Wright (1997) ], and have extensions to a wide range of problem settings (e.g., Tseng, 1993; Nesterov & Nemirovski, 1994; Nesterov & Todd, 1998; Renegar, 2001; Wright, 2001) . They build on a natural idea of solving a simplified homotopic problem that makes it possible to "smoothly" transition to the original complex problem; see § 3.1. Several works extend IP methods to cVI, by applying the second-order Newton method to a modified Karush-Kuhn-Tucker (KKT) system appropriate for the cVI (Ralph & Wright, 2000; Qi & Sun, 2002; Fan & Yan, 2010; Monteiro & Pang, 1996; Chen et al., 1998) . Many of these approaches, however, rely on strong assumptions-see § 2. Moreover, although these methods enjoy fast convergence in terms of the number of iterations, each iteration involves the computation of the Jacobian of F (or Hessian when F ≡ ∇f (x)) which quickly becomes prohibitive for large dimension n of x. Hence first-order methods are preferred in practice. We are currently missing a first-order optimization method for solving cVI with general constraints. Accordingly, in this paper, we focus on the following open question: Can we derive first-order algorithms for the cVI problem that (i) can be applied when general constraints are given, and that (ii) have global convergence guarantees? In this paper, we develop precisely such a method. To mitigate the computational burden of secondderivative computation, we replace the Newton step of the traditional IP methods with the alternating direction method of multipliers (ADMM) method. ADMM was designed with a different purpose: it is applicable only when the objective is separable into two or more different functions whose arguments are non disjoint-see § 3.1 for full description-and can be seen as equivalent to Douglas-Rachford operator splitting (Douglas & Rachford, 1956) applied in the dual space (see e.g. Lin et al., 2022) . ADMM owes its popularity primarily to its computational efficiency (Boyd et al., 2011) for largescale machine learning problems and its fast convergence in some machine-learning settings (e.g., Nishihara et al., 2015) . The core idea of our approach is to reformulate the original cVI problem with equality and inequality constraints via the KKT conditions, so as to apply ADMM in such a way that the subproblems of the resulting algorithm have desirable properties (see § 4.1). That is, by generalizing the technique underlying ADMM, we derive a novel first-order algorithm for solving monotone VIs with very general constraints. Furthermore, this framework can be used to design novel algorithms for solving cVIs; see App. C. Our contributions can be summarized as follows: • Based on the KKT system for the constrained VI problem and the ADMM technique, we derive an algorithm that we refer to as the ADMM-based Interior Point Method for Constrained VIs (ACVI)-see § 4.1 and Algorithm 1. • We prove the global convergence of ACVI given two sets of assumptions: (i) when F is ξmonotone, and (ii) when it is monotone, the constraints are active at the solution, and the game is not purely rotational. By further assuming F is a Lipschitz operator, we upper bound the rate of decrease of the gap function and we match the known lower bound for the gap function of O(1/ √ K) for the last iterate-see § 4.2. • Empirically, we document two notable advantages of ACVI over popular projection-based saddlepoint methods: (i) the ACVI iterates exhibit significantly reduced rotations, as they approach the solution from the analytic center, and (ii) while projection-based methods show extensive zigzagging when hitting a constraint, ACVI avoids this, resulting in more efficient updates- § 5. Our convergence guarantees are parameter-free, meaning these do not require a priori knowledge of the constants of the problem (such as the Lipschitz constant), and, interestingly, the convergence guarantee does not require that F is Lipschitz. This assumption is solely used to express the rate of decrease of the gap function (in contrast to the extragradient method (Korpelevich, 1976) where such an assumption is necessary to show convergence. To the best of our knowledge, the proposed ACVI method is the first first-order IP algorithm for VIs with a global convergence proof.

2. RELATED WORK

Unconstrained VIs: methods and guarantees. Apart from the standard gradient descent ascent (GDA) method, among the most commonly used methods for VI optimization are the extragradient method (EG, Korpelevich, 1976 ), optimistic GDA (OGDA, Popov, 1980) , and the lookahead method (LA, Zhang et al., 2019; Chavdarova et al., 2021b) . (See App. A for a full description). In contrast to gradient fields (as in a single-objective setting), when F is a general vector field, the last iterate can be far from the solution even though the average iterate converges to it (Daskalakis et al., 2018; Chavdarova et al., 2019) . This is problematic since it implies that the average convergence guarantee is weaker in the sense that it may not extend to more general setups where we can no longer rely on the convexity of X . Golowich et al. (2020b; a) provided a last-iterate lower bound of O( 1 p√ K ) for the broad class of p-stationary canonical linear iterative (p-SCLI) first-order methods (Arjevani et al., 2016 ). An extensive line of further work has provided guarantees for the last iterate for other problem classes. For the general monotone VI (MVI) class, the following p-SCLI methods come with guarantees that match the lower bound: (i) Golowich et al. (2020b) obtained a rate in terms of the gap function relying on first-and second-order smoothness of F , and Gorbunov et al. (2022) obtained a rate of O( 1 K ) in terms of reducing the squared norm of the operator relying on first-order smoothness of F (Assumption 1), using a computer-assisted proof, and (ii) Golowich et al. (2020b) and Chavdarova et al. (2021a) provided the best iterate rate for OGDA. Constrained zero-sum and VI classes of problems. Gidel et al. (2017b) extended the Frank-Wolfe (Frank & Wolfe, 1956; Jaggi, 2013; Lacoste-Julien & Jaggi, 2015) method-also known as the conditional gradient (V.F Demyanov, 1970)-to solve a subclass of cVI, specifically constrained zero-sum problems. This extension was carried out under a strong convex-concavity assumption and also under the assumption that the constraint set is strongly convex; that is, it has sublevel sets that are strongly convex functions (Vial, 1983) . Daskalakis & Panageas (2019) provided an asymptotic proof for the last iterate for zero-sum convex-concave constrained problems for the optimistic multiplicative weights update (OMWU) method. Wei et al. (2021) focused on OGDA and OMWU in the constrained setting and provided convergence rates for bilinear games over the simplex. In her seminal work, Korpelevich (1976) proposed the classical (projected) extragradient method (EG)-see App. A-and proved its convergence for monotone (c)VIs with an L-Lipschitz operator, and Cai et al. (2022) established a rate with respect to the gap function using a computer-aided proof. Tseng (1995) built on (Pang, 1987) and provided a linear convergence rate for EG in the setting of strongly monotone F , whereas Malitsky (2015) focused on the same constrained setting but on the projected reflected gradient method. Diakonikolas (2020) obtained parameter-free guarantee for Halpern iteration (Halpern, 1967) for cocoercive operators. Goffin et al. (1997) described a second-order cutting-plane method for solving pseudomonotone VIs with linear inequalities. Interior point (IP) methods in single-objective and VI settings. Traditionally IP methods primarily express the inequality constraints by augmenting the objective with a log-barrier penalty (see § 3.1), and then use Newton's method to solve the subproblem (Boyd & Vandenberghe, 2004) . The latter involves computing either the inverse of a large matrix or a Cholesky decomposition and yet it can be highly efficient in low dimensions as it requires only a few iterations to converge. When the dimensionality of the variable is large, however, the computation becomes infeasible. Among other IP variants that address this issue, Lin et al. (2018) replaced the Newton step with the ADMM method, which is known to be highly scalable in terms of the dimension (Boyd et al., 2011) . In the context of cVIs, a few works apply IP methods, mostly Newton-based (e.g., Nesterov & Nemirovski, 1994, Chapter 7 ). Monteiro & Pang (1996) analyze path-following IP methods for complementarity problems, which are a subclass of cVI, using local homeomorphic maps. Chen et al. (1998) provided a superlinear global convergence rate of the smoothing Newton method when F is semi-smooth for box constrained VIs. Similarly, Qi & Sun (2002) ; Qi et al. (2000) focused on the smoothing Newton method and provided the rate for the outer loop. Ralph & Wright (2000) showed superlinear convergence for MVI problems under inequality constraints, under the following set of assumptions: (i) existence of a strictly complementary solution, (ii) full rank of the Jacobian of the active constraints at the solution, and (iii) twice differentiable constraints. They provided a local convergence rate. Fan & Yan (2010) considered inequality constraints and proposed a second-order Newton-based method that has global convergence guarantees under certain conditions. A rate was not provided.

3. PRELIMINARIES

Notation. Bold small and capital letters denote vectors and matrices, respectively. Sets are denoted with curly capital letters, e.g., S. The Euclidean norm of v is denoted by ∥v∥, and the inner product in Euclidean space with ⟨•, •⟩. With ⊙ we denote element-wise product. We let [n] denote {1, . . . , n} and let e denote vector of all 1's. We let x ⊥ y denote x and y are perpendicular. In the remainder of the paper, we consider a general setting in which the constraint set C ⊆ X is defined as an intersection of finitely many inequalities and linear equalities: C = {x ∈ R n |φ i (x) ≤ 0, i ∈ [m], Cx = d} , (CS) where each φ i : R n → R, C ∈ R p×n , d ∈ R p , where we assume rank(C) = p. For brevity, with φ we denote the concatenated φ i (•), i ∈ [m], and in the remainder of the paper, each φ i ∈ C 1 (R n ), i ∈ [m] and is convex. For convenience we denote: C ≤ ≜ {x ∈ R n |φ(x) ≤ 0 } , C < ≜ {x ∈ R n |φ(x) < 0 } , and C = ≜ {y ∈ R n |Cy = d} ; thus the relative interior of C is int C ≜ C < ∩ C = , and we consider int C ̸ = ∅ and C is compact. In the following, we list the definitions and assumptions we refer to later on. Definition 1 ((strong/ξ) monotonicity). An operator F : X ⊇ S → R n is monotone on S if: ⟨x -x ′ , F (x) -F (x ′ )⟩ ≥ 0, ∀x, x ′ ∈ S . F is said to be ξ-monotone on S iff there exist c > 0 and ξ > 1 such that ⟨x -x ′ , F (x) -F (x ′ )⟩ ≥ c∥x -x ′ ∥ ξ , for all x, x ′ ∈ S. Finally, F is µ-strongly monotone on S if there exists µ > 0, such that ⟨x -x ′ , F (x) -F (x ′ )⟩ ≥ µ∥x -x ′ ∥ 2 , for all x, x ′ ∈ S. Moreover, we say that an operator F is star-monotone, star-ξ-monotone or star-strongly-monotone (on S) if the respective definition holds for x ′ ≡ x ⋆ , where x ⋆ ∈ S ⋆ S,F . Note that the "star-" definitions are weaker relative to their respective non-star counterparts. The above definition holds similarly for unconstrained VIs, by setting S ≡ R n . The analog for cVI of the function values used as a performance measure for convergence rates in convex optimization is the gap function (a.k.a., the optimality gap or primal gap), defined next. Definition 2 (gap function). Given a candidate point x ′ ∈ X and a map F : X ⊇ S → R n where S is compact, the gap function G : R n → R is defined as G(x ′ , S) ≜ max x∈S ⟨F (x ′ ), x ′ -x⟩ . Note that the gap function requires S to be compact in order to be defined (as otherwise, it can be infinite). We will rely on the following assumption to express our rates in terms of the gap function. Assumption 1 (first-order smoothness). Let F : X ⊇ S → R n be an operator, we say that F satisfies L-first-order smoothness on S, or L-smoothness, if F is an L-Lipschitz map; that is, there exists L > 0 such that ∥F (x) -F (x ′ )∥ ≤ L ∥x -x ′ ∥, for all x, x ′ ∈ S. As an informal summary, a solution existence guarantee follows when X is compact; see Chapter 2.2 of (Facchinei & Pang, 2003) , and App. A.2.

3.1. RELEVANT PATH-FOLLOWING INTERIOR-POINT METHODS AND ADMM

In this section, we overview the interior-point approach to single-objective optimization, focusing on aspects that are most relevant to our proposed method. Consider the following problem: min x f (x) s.t. φ(x) ≤ 0 and Cx = d , where f, φ i : R n → R are convex and continuously differentiable, x ∈ R n , C ∈ R p×n , and d ∈ R p . IP methods solve problem (cCVX) by reducing it to a sequence of linear equality-constrained problems via a logarithmic barrier (see, e.g., Boyd & Vandenberghe, 2004, Chapter 11) : min x f (x) -µ m i=1 log(-φ i (x)) s.t. Cx = d, with µ > 0 . (l-cCVX) Assume that (l-cCVX) has a solution for each µ > 0, and let x µ denote the solution of (l-cCVX) for a given µ. The central path of (l-cCVX) is defined as the set of points x µ , µ > 0. Note that x µ ∈ R n is a strictly feasible point of (cCVX) as it satisfies φ(x µ ) < 0 and Cx µ = d. Alternating direction method of multipliers (ADMM) method. ADMM (Glowinski & Marroco, 1975; Gabay & Mercier, 1976; Lions & Mercier, 1979; Glowinski & Le Tallec, 1989 ) is a gradientbased algorithm for convex optimization problems that splits the objective into subproblems each of which is easier to solve. Its popularity is due to its computational scalability (Boyd et al., 2011) . Consider a problem of the following form: min x,y f (x) + g(y) s.t. Ax + By = b , (ADMM-Pr) where f, g : R n → R are convex, x, y ∈ R n , A, B ∈ R n ′ ×n , and b ∈ R n ′ . The augmented Lagrangian function, L β (•), of the (ADMM-Pr) problem is: L β (x, y, λ) = f (x) + g(y) + ⟨Ax + By -b, λ⟩ + β 2 ∥Ax + By -b∥ 2 , (AL-CVX) where β > 0 is referred to as the penalty parameter. If the augmented Lagrangian method is used to solve (AL-CVX), at each step k we have: x k+1 , y k+1 = argmin x,y L β (x, y, λ k ) and λ k+1 = λ k + β(Ax k+1 + By k+1 -b) , where the latter step is gradient ascent on the dual. In contrast, ADMM updates x and y in an alternating way as follows: x k+1 = argmin x L β (x, y k , λ k ) , y k+1 = argmin y L β (x k+1 , y k , λ k ) , λ k+1 = λ k + β(Ax k+1 + By k+1 -b) . (ADMM) 4 ACVI: FIRST-ORDER ADMM-BASED IP METHOD FOR CONSTRAINED VIS

4.1. DERIVING THE ACVI ALGORITHM

In this section, we derive an interior-point method for the cVI problem that we refer to as ACVI (ADMM-based interior problem for constrained VIs). We first restate the cVI problem in a form that will allow us to derive an interior-point procedure. By the definition of cVI it follows (see §1.3 in Facchinei & Pang, 2003) that: x ∈ S ⋆ C,F ⇔          w = x x = argmin z F (w) ⊺ z s.t. φ(z) ≤ 0 Cz = d ⇔    F (x) + ∇φ ⊺ (x)λ + C ⊺ ν = 0 Cx = d 0 ≤ λ⊥φ(x) ≤ 0, (KKT) where λ ∈ R m and ν ∈ R p are dual variables, and ⊥ denotes perpendicular. Recall that we assume that int C ̸ = ∅, thus, by the Slater condition (using the fact that φ i (x), i ∈ [m] are convex) and the KKT conditions, the second equivalence holds, yielding the KKT system of cVI. Note that the above equivalence also guarantees the two solutions coincide; see Facchinei & Pang (2003, Prop. 1.3.4 (b) ). Analogous to the method described in § 3, we add a log-barrier term to the objective to remove the inequality constraints and obtain the following modified version of (KKT):        w = x x = argmin z F (w) ⊺ z -µ m i=1 log -φ i (z) s.t. Cz = d ⇔        F (x) + ∇φ ⊺ (x)λ + C ⊺ ν = 0 λ ⊙ φ(x) + µe = 0 Cx -d = 0 φ(x) < 0, λ > 0, (KKT-2) with µ > 0, e ≜ [1, . . . , 1] ⊺ ∈ R m . Again, the equivalence holds by the KKT and the Slater condition. We derive the update rule at step k via the following subproblem: min x F (w k ) ⊺ x - µ m i=1 log -φ i (x) , s.t. Cx = d , where we fix w = w k . Directly projecting on the equality constraint may cause the vectors to fall out of the domain of the log term. On the other hand, (i) w k is a constant vector in this subproblem, and (ii) the objective is split, making ADMM a natural choice to solve the subproblem. Hence, we introduce a new variable y ∈ R n yielding:    min x,y F (w k ) ⊺ x + 1[Cx = d] -µ m i=1 log -φ i (y) s.t. x = y , 1[Cx = d] ≜ 0, if Cx = d +∞, if Cx ̸ = d. (1) Note that 1[Cx = d] is a generalized real-valued convex function of x. We introduce the following: P c ≜ I -C ⊺ (CC ⊺ ) -1 C , (P c ) and d c ≜ C ⊺ (CC ⊺ ) -1 d , (d c -EQ) where P c ∈ R n×n and d c ∈ R n . The augmented Lagrangian of (1) is thus: L β (x, y, λ)=F (w k ) ⊺ x + 1(Cx = d) -µ m i=1 log(-φ i (y)) + ⟨ λ, x -y⟩ + β 2 ∥x -y∥ 2 , (AL) where β > 0 is the penalty parameter. Finally, using ADMM, we have the following update rule for x at step k: x k+1 = arg min x∈C= L β (x, y k , λ k ) = arg min x∈C= β 2 x -y k + 1 β (F (w k ) + λ k ) 2 . (2) This yields the following update for x: x k+1 = P c y k - 1 β F (w k ) + λ k + d c . (X-EQ) For y and the dual variable λ, we have: y k+1 = argmin y L β (x k+1 , y, λ k ) = argmin y -µ m i=1 log -φ i (y) + β 2 y -x k+1 - 1 β λ k 2 , (Y-EQ) λ k+1 = λ k + β(x k+1 -y k+1 ). (λ-EQ) Next, we derive the update rule for w. We set w k to be the solution of the following equation: w + 1 β P c F (w) -P c y k + 1 β P c λ k -d c = 0. (W-EQ) The following theorem ensures the solution of (W-EQ) exists and is unique, see App. B.1 for proof. Theorem 1 (W-EQ: solution uniqueness). If F is monotone on C = , the following statements hold true for the solution of (W-EQ): (i) it always exists, (ii) it is unique, and (iii) it is contained in C = . Remark 1. Note that when there are no equality constraints, C = becomes the entire space R n . Further notice that w k = x k+1 , thus it is redundant to state it in the algorithm, and we remove w. We summarize the full algorithm as Algorithm 1. For problems such as affine or low-dimensional VIs, or optimization over the probability simplex, (W-EQ) can be solved analytically, such that step 8 is fast to compute. For problems where (W-EQ) is cumbersome to solve analytically-e.g., in GANs-one could use optimization methods for the unconstrained case, e.g., EG and GDA, among others. See App. B.5 for further discussion. In the remaining discussion, where clear from context, we drop the superscript from the iterate x (t) k . Algorithm 1 ACVI pseudocode. 1: Input: operator F : X → R n , constraints Cx = d and φ i (x) ≤ 0, i = [m], hyperparameters µ -1 , β > 0, δ ∈ (0, 1), number of outer and inner loop iterations T and K, resp. 2: Initialize: y (0) 0 ∈ R n , λ (0) 0 ∈ R n 3: P c ≜ I -C ⊺ (CC ⊺ ) -1 C where P c ∈ R n×n 4: d c ≜ C ⊺ (CC ⊺ ) -1 d where d c ∈ R n 5: for t = 0, . . . , T -1 do 6: µ t = δµ t-1 7: for k = 0, . . . , K -1 do 8: Set x (t) k+1 to be the solution of: x + 1 β P c F (x) -P c y (t) k + 1 β P c λ (t) k -d c = 0 (w.r.t. x) 9: y (t) k+1 = argmin y -µ t m i=1 log -φ i (y) + β 2 y -x (t) k+1 -1 β λ (t) k 2 10: λ (t) k+1 = λ (t) k + β(x (t) k+1 -y (t) k+1 ) 11: end for 12: (y (t+1) 0 , λ ) ≜ (y (t) K , λ K ) 13: end for

4.2. CONVERGENCE ANALYSIS

We consider two broad classes of problems. The first class assumes that F is ξ-monotone on C = -a stronger assumption than monotonicity, yet weaker than strong monotonicity. The second setup requires that (i) F is monotone, (ii) the constraints are active at the solution, and (iii) F is not purely rotational. Note that (iii) is weaker than requiring that the active constraints at the solution form an acute angle with the operator; in other words, given the latter, the former holds due to monotonicity of F . (See App. B). Note that (iii) is not strong, as purely rotational games occur "almost never" in a Baire category sense (Kupka, 1963; Smale, 1963; Balduzzi et al., 2018; Hsieh et al., 2021) . The proofs of the main theorems use the following lemma. Lemma 1 (Upper bound for G(•)). When F is L-Lipschitz on C = -as per Assumption 1-we have that any iterate x k produced by Algorithm 1 satisfies G(x k , C) ≤ M 0 ∥x k -x ⋆ ∥, where M 0 > 0 depends linearly on L, and x ⋆ ∈ S ⋆ C,F . To state the results we define the following sets. For r, s > 0, let Ĉr ≜ {x ∈ R n |Cx = d, φ(x) ≤ re}, and similarly let Cs ≜ {x ∈ R n | ∥Cx -d∥ ≤ s, φ(x) ≤ 0}. We have the following. Theorem 2 (Last and average iterate convergence for star-ξ-monotone operator). Given an operator F : X → R n monotone on C = (Def. 1), assume that either F is strictly monotone on C or one of φ i is strictly convex. Assume there exists r > 0 or s > 0 such that F is star-ξ-monotone on either Ĉr or Cs . Let x (t) K and x(t) K ≜ 1 K K k=1 x (t) k denote the last and average iterate of Algorithm 1, respectively, run with sufficiently small µ -1 . Then for all t ∈ [T ], we have that: 1. x (t) K -x ⋆ ≤ O( 1 K 1/(2ξ) ). 2. If in addition F is ξ-monotone on C = , we have x(t) K -x ⋆ ≤ O( 1 K 1/ξ ) . 3. Moreover, if F is L-Lipschitz on C = -as per Assumption 1-the same corresponding upper bounds hold for G(x (t) K , C) and G( x(t) K , C); that is, G(x (t) K , C) ≤ O( L K 1/(2ξ) ) and G( x(t) K , C) ≤ O( L K 1/ξ ) . Remark 2. Note that the convergence guarantee does not rely on Assumption 1, and it is solely used to relate the rate to the gap function. Also, note that µ -1 does not impact the convergence rate. Moreover, for simplicity we state the result with sufficiently small µ -1 , however, the proof extends to any µ -1 > 0. That is, the above result can be made parameter-free; see App. B.4. Theorem 3 (Last and average iterate convergence for monotone operator). Given an operator F : X → R n , assume (i) F is monotone on C = , and (ii) either F is strictly monotone on C or one of φ i is strictly convex, and (iii) inf x∈S\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0, where S ≡ Ĉr or Cs . Let x (t) K and x(t) K ≜ 1 K K k=1 x (t) k denote the last and average iterate of Algorithm 1, respectively, run with sufficiently small µ -1 . Then for all t ∈ [T ], we have that: 1. x (t) K -x ⋆ ≤ O( 1 √ K ). 2. If in addition inf x∈S\{x ⋆ } F (x ⋆ ) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0 (with S ≡ Ĉr or Cs ), then x(t) K -x ⋆ ≤ O( 1 K ) . 3. Moreover, if F is L-Lipschitz on C = -as per Assumption 1-the same corresponding upper bounds hold for G(x (t) K , C) and G( x(t) K , C), that is, G(x (t) K , C) ≤ O( L √ K ) and G( x(t) K , C) ≤ O( L K ) . Assumption (iii) in Theorem 3 requires the angle of F (x) and xx ⋆ to be acute on S\ {x ⋆ }, where S = Ĉr or Cs . For example, when there are no equality constraints, Assumption (iii) becomes inf x∈C\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0. From (cVI) and by the monotonicity of F , we can see that for any point x ∈ C\ {x ⋆ }, the angle between F (x) and xx ⋆ is always less than or equal to π/2. And assumption (iii) requires that F (x ⋆ ) ̸ = 0, which means some constraints are active at x ⋆ , and ∃θ ∈ (0, π/2) s.t. for any x ∈ C\ {x ⋆ }, the angle between F (x) and xx ⋆ is upper bounded by θ. Remark 3. Our proofs rely on the existence of the central path-see Appendix A. Note that since C is compact, it suffices that either: (i) F is strictly monotone on C, or that (ii) one of the inequality constraints φ i is strictly convex for the central path to exist (Facchinei & Pang, 2003, Corollary 11.4.24) . Thus, if F is ξ-monotone on C, then the central path exists. However, to relax the former assumption, notice that-by the compactness of C-there exists a sufficiently large M such that for any x ∈ C, x ⊺ x ≤ M . Thus, one can add a strictly convex inequality constraint φ m+1 (x)-e.g., x ⊺ x -M ≤ 0-and the solution set remains intact. That is, as µ tends to 0 the original problem is recovered. This ensures the existence of the central path without changing the original problem.

5. EXPERIMENTS

Problems. To study the empirical performance of ACVI we use the following 2D problems: (i) cBG: the common bilinear game, constrained on R + for the two players, stated in Fig. 1 , (ii) Von Neumann's ratio game (Von Neumann, 1971; Daskalakis et al., 2020; Diakonikolas et al., 2021) , (iii) Forsaken game (Hsieh et al., 2021) -which exhibits limit cycles, as well as (iv) toy GAN-used in (Daskalakis et al., 2018; Antonakopoulos et al., 2021) . Note that these are known to be challenging problems in the literature, and interestingly the latter three are non-monotone, going beyond the assumptions that we made in our theoretical results. We also consider the following higher-dimension bilinear game on the probability simplex, with η ∈ (0, 1), n = 1000: min x1∈△ max x2∈△ ηx ⊺ 1 x 1 + (1 -η) x ⊺ 1 x 2 -ηx ⊺ 2 x 2 ; △={x i ∈ R 500 |x i ≥ 0, and , e ⊺ x i = 1}. (HBG) As GANs on MNIST (Lecun & Cortes, 1998) enjoy well-established metrics, we use this setup and augment it solely with linear inequalities. We implement the baselines with the greedy projection algorithm-see App. D.3 for details-hence these baselines will be slower when equality constraints are also given. App. E lists additional experiments, including on Fashion-MNIST (Xiao et al., 2017) . Methods. We compare ACVI with the projected variants of the common saddle point optimizers (fully described in App. A.4): (i) GDA, (ii) EG (Korpelevich, 1976), (iii) OGDA (Popov, 1980) , and (iv) LA k-GDA (Chavdarova et al., 2021b; Zhang et al., 2019) , where k is the hyperparameter of LA. For ACVI on MNIST, l denotes the number of steps to solve the subproblems; see Algorithm 4. Results. From Fig. 1 , we observe that projection-based algorithms may zigzag when hitting a constraint due to the rotational nature of F , behavior that ACVI avoids because it incorporates the constraints in its update rule; see Fig. 5 for the remaining baselines. Fig. 2 shows that even with problems that go beyond our theoretical assumptions, a single step of ACVI significantly reduces the distance to the solution. Moreover, from Fig. 2 (b), we observe that ACVI escapes the limit cycles; see also Fig. 6 . Fig. 3 shows results for (HBG), indicating that ACVI is time efficient, and that ACVI performs well relative to projection-based methods for varying rotational intensity (1 -η). Fig. 4 summarizes the experiments on MNIST with linear inequality constraints; we observe that ACVI converges significantly faster than the corresponding baseline.

6. CONCLUSION

Motivated by the lack of a first-order method to solve constrained VI (cVI) problems with general constraints, we proposed a framework that combines (i) interior-point methods-needed to be able to handle general constraints-with (ii) the ADMM method-designed to deal with separable objectives. The combination yields ACVI-a first-order ADMM-based interior point method for cVIs. We proved convergence for two broad classes of problems and derived the corresponding convergence rates. Numerical experiments showed that while projection-based methods zigzag when hitting a constraint due to the rotational vector field, ACVI avoids this by incorporating the constraints in the update rule. A BACKGROUND: ADDITIONAL DETAILS This section lists additional background such as omitted definitions and a description of the used baseline methods.

A.1 ADDITIONAL VI DEFINITIONS & EQUIVALENT FORMULATIONS

Seeing an operator F : X → R n as the graph GrF = {(x, y)|x ∈ X , y = F (x)}, its inverse F -1 is defined as GrF -1 = {(y, x)|(x, y) ∈ GrF }. See, for example, (Ryu & Yin, 2022) for further discussion. We denote the projection to the set X with Π X . Definition 3 ( 1 µ -cocoercive operator). An operator F : X ⊇ S → R n is 1 µ -cocoercive (or 1 µ -inverse strongly monotone) on S if its inverse (graph) F -1 is µ-strongly monotone on S, that is, ∃µ > 0, s.t. ⟨x -x ′ , F (x) -F (x ′ )⟩ ≥ µ ∥F (x) -F (x ′ )∥ 2 , ∀x, x ′ ∈ S . It is star 1 µ -cocoercive if the above holds when setting x ′ ≡ x ⋆ where x ⋆ denotes a solution, that is: ∃µ > 0, s.t. ⟨x -x ⋆ , F (x) -F (x ⋆ )⟩ ≥ µ ∥F (x) -F (x ⋆ )∥ 2 , ∀x ∈ S, x ⋆ ∈ S ⋆ X ,F . Note from Def. 3 that cocoercivity is a strict subclass of monotone and L-Lipschitz operators, thus is it is a stronger assumption. See Chapter 4.2 of (Bauschke & Combettes, 2017) for further relations of cocoercivity with other properties of operators. In the following, we will make use of the natural and normal mappings of an operator F : X → R n , where X ⊂ R n . Following the notation of (Facchinei & Pang, 2003) , the natural map F NAT X : X → R n is: F NAT X ≜ x -Π X x -F (x) , ∀x ∈ X , (F-NAT) whereas the normal map F NOR X : R n → R n is: F NOR X ≜ F Π X (x) + x -Π X (x), ∀x ∈ R n . (F-NOR) Moreover, we have the following solution characterizations: (i) x ⋆ ∈ S ⋆ X ,F iff F NAT X (x ⋆ ) = 0, and (ii) x ⋆ ∈ S ⋆ X ,F iff ∃x ′ ∈ R n s.t. x ⋆ = Π X (x ′ ) and F NOR X (x ′ ) = 0. Remark 4 ("rotational" component of the vector field). The rotational trajectories in parameter space are induced by the fact that the eigenvalues of the Jacobian of the vector field F (the secondorder derivative matrix) belong to the complex set C; that is λ i ∈ C. In contrast, when F ≡ ∇f the second-order derivative matrix known as the Hessian is always symmetric, and thus the eigenvalues are real.

A.2 EXISTENCE OF SOLUTION

We provide brief informal summary of some sufficient conditions for solution existence, that S ⋆ •,F ̸ = ∅. See Chapter 2 of (Facchinei & Pang, 2003) for a full treatment of the topic. The common underlying tool to establish that a solution to the VI(X , F ) problem exists is using topological degree. The topological degree tool is designed so as to satisfy the so-called homotopy invariance axiom, which in turn allows for reducing a solution existence question of a complicated map to a simpler one (which is homotopy-invariant to the original one) for which we can more easily show that it has a solution (e.g., the identity map on a closed domain). It can be used (as one way) to prove the celebrated Brouwer fixed-point theorem, which states that any continuous map Φ : S → S, where S is a nonempty convex compact set, has a fixed point in S. We have the following sufficient condition, see (Cor. 2.2.5, Facchinei & Pang, 2003) Theorem 4 (sufficient condition for existence of the solution, Cor. 2.2.5, (Facchinei & Pang, 2003) ). If X ⊆ R n is compact and convex, and F : X → R n is continuous, then the solution set is nonempty and compact. It can also be shown that when X is closed convex (and F continuous), if one can find x ′ ∈ X s.t. ⟨F (x), xx ′ ⟩ ≥ 0, ∀x ∈ X , then S ⋆ X ,F ̸ = ∅. The same conclusion follows if one can show that ∃x ′ ∈ X and the set {x ∈ X |⟨F (x), xx ′ ⟩ < 0} is bounded (possibly empty, see Prop. 2.2.3 in (Facchinei & Pang, 2003) ). Sufficient conditions can also be established via the natural and the normal map due to the above solution characterizations. In this case, we require that F is continuous on an open set S and we are interested if S ⋆ X ,F ̸ = ∅, where X is assumed closed and convex and subset of S, X ⊆ S. If one establishes that a solution exists for F NAT X on a bounded open set U, and if cl U ⊆ S, then it follows that S ⋆ X ,F ̸ = ∅. A similar implication holds when we have such a guarantee for F NOR X . See Theorem 2.2.1 of (Facchinei & Pang, 2003) . In summary, the solution existence guarantee follows from the boundness of some set which includes X , the boundness of the set of potential solutions (if we can construct such set), or the compactness of X itself.

A.3 EXISTENCE OF CENTRAL PATH

In this section, we discuss the results that establish guarantees of the existence of the central path. Let L(x, λ, ν) ≜ F (x) + ∇φ ⊺ (x)λ + C ⊺ ν, h(x) = C ⊺ x -d. For (λ, w, x, ν) ∈ R 2m+n+p , let G(λ, w, x, ν) ≜    w • λ w + φ(x) L(x, λ, ν) h(x)    ∈ R 2m+n+p , H(λ, w, x, ν) ≜ w + φ(x) L(x, λ, ν) h(x) ∈ R m+n+p . Let H ++ ≜ H(R 2m ++ × R n × R p ) . By (Corollary 11.4.24, Facchinei & Pang, 2003) we have the following proposition. Proposition 1 (sufficient condition for the existence of the central path.). If F is monotone, either F is strictly monotone or one of φ i is strictly convex, and C is bounded. The following four statements hold for the functions G and H: 1. G maps R 2m ++ × R n+p homeomorphically onto R m ++ × H ++ ; 2. R m ++ × H ++ ⊆ G(R 2m + × R n+p ); 3. for every vector a ∈ R m + , the system H(λ, w, x, ν) = 0, w • λ = a has a solution (λ, w, x, ν) ∈ R 2m + × R n+p ; and 4. the set H ++ is convex.

A.4 SADDLE-POINT OPTIMIZATION METHODS

In this section, we describe in detail the saddle point methods that we compare within the main part (in § 5). We denote the projection to the set X with Π X , and when the method is applied in the unconstrained setting Π X ≡ I. For an example of the associated vector field and its Jacobian, consider the following constrained zero-sum game: min x1∈X1 max x2∈X2 f (x 1 , x 2 ) , (ZS-G) where f : X 1 × X 2 → R is smooth and convex in x 1 and concave in x 2 . As in the main paper, we write x ≜ (x 1 , x 2 ) ∈ R n . The vector field F : X → R n and its Jacobian J are defined as: F (x) = ∇ x1 f (x) -∇ x2 f (x) , J(x) = ∇ 2 x1 f (x) ∇ x2 ∇ x1 f (x) -∇ x1 ∇ x2 f (x) -∇ 2 x2 f (x) . In the remainder of this section, we will only refer to the joint variable x, and (with abuse of notation) the subscript will denote the step. Let γ ∈ [0, 1] denote the step size. (Projected) Gradient Descent Ascent (GDA). The extension of gradient descent for the cVI problem is gradient descent ascent (GDA). The GDA update at step k is then: x k+1 = Π X x k -γF (x k ) . (GDA) (Projected) Extragradient (EG). EG (Korpelevich, 1976) uses a "prediction" step to obtain an extrapolated point x k+ 1 2 using GDA: x k+ 1 2 = Π X x k -γF (x k ) , and the gradients at the extrapolated point are then applied to the current iterate x t : x k+1 = Π X x k -γF Π X x k -γF (x k ) . (EG) In the original EG paper, (Korpelevich, 1976) proved that the EG method (with a fixed step size) converges for monotone VIs, as follows. Theorem 5 (Korpelevich ( 1976)). Given a map F : X → R n , if the following is satisfied: 1. the set X is closed and convex, 2. F is single-valued, definite, and monotone on X -as per Def. 1, 3. F is L-Lipschitz-as per Asm. 1. then there exists a solution x ⋆ ∈ X , such that the iterates x k produced by the EG update rule with a fixed step size γ ∈ (0, 1 L ) converge to it, that is x k → x ⋆ , as k → ∞. Facchinei & Pang (2003) also show that for any convex-concave function f and any closed convex sets x 1 ∈ X 1 and x 2 ∈ X 2 , the EG method converges (Facchinei & Pang, 2003, Theorem 12.1.11) . (Projected) Optimistic Gradient Descent Ascent (OGDA). The update rule of Optimistic Gradient Descent Ascent OGDA ((OGDA) Popov, 1980) is: x n+1 = Π X x n -2γF (x n ) + γF (x n-1 ) . (OGDA) (Projected) Lookahead-Minmax (LA). The LA algorithm for min-max optimization (Chavdarova et al., 2021b) , originally proposed for minimization by Zhang et al. (2019) , is a general wrapper of a "base" optimizer where, at every step t: (i) a copy of the current iterate xn is made: xn ← x n , (ii) xn is updated k ≥ 1 times, yielding ωn+k , and finally (iii) the actual update x n+1 is obtained as a point that lies on a line between the current x n iterate and the predicted one xn+k : x n+1 ← x n + α( xn+k -x n ), α ∈ [0, 1] . (LA) In this work, we use solely GDA as a base optimizer for LA and thus denote it with LAk-GDA. The projection-free Frank-Wolfe (FW). FW (Frank & Wolfe, 1956 ) is an IP-type method for solving constrained smooth zero-sum games (ZS-G). It avoids the projection operator by ensuring we never leave the constraint set. To do so, it finds the intersection points of the inequality constraints-hence, it requires that the inequality constraints satisfy certain structures (such as linear) in order for this operation to be computationally cheap. Algorithm 2 Frank-Wolfe algorithm for zero-sum games. 1: Input: C, ν > 0 2: Initialize: z (0) = (x (0) 1 , x (0) 2 ) ∈ X 1 × X 2 3: for t = 0, . . . , T do 4: Compute r (t) ≜ ∇ x1 f (x (t) 1 , x (t) 2 ) -∇ x2 f (x (t) 1 , x (t) 2 ) 5: s (t) ≜ arg min z∈X1×X2 ⟨z, r (t) ⟩ 6: Compute g t ≜ ⟨z (t) -s (t) , r (t) ⟩ 7: if g t ≤ ε then 8: return z (t) 9: end if 10: Let γ = min 1, ν 2C g t or γ = 2 2+t 11: Update z (t+1) = (1 -γ)z (t) + γs (t) 12: end for

B OMITTED PROOFS AND DISCUSSIONS CONCERNING ALGORITHM 1

This section provides the proofs of the theoretical results in § 4.2. B.1 PROOF OF THEOREM 1: UNIQUENESS OF THE SOLUTION OF EQ. W-EQ Recall first that Eq. W-EQ is as follows: x + 1 β P c F (x) -P c y k + 1 β P c λ k -d c = 0 , since w = x. Proof of Theorem 1: uniqueness of the solution of (W-EQ). Let G(x) denote the LHS of (W-EQ), that is: G(x) ≜ x + 1 β P c F (x) -P c y k + 1 β P c λ k -d c We claim that G(x) is strongly monotone on C = . In fact, ∀x, y ∈ C = , P c (x -y) = xy. Note that P c is symmetric, thus we have: ⟨G(x) -G(y), x -y⟩ = ∥x -y∥ 2 + 1 β ⟨P c F (x) -P c F (y), x -y⟩ = ∥x -y∥ 2 + 1 β ⟨x -y, F (x) -F (y)⟩ ≥ ∥x -y∥ 2 . Therefore, according to Theorem 2.3.3 (b) in (Facchinei & Pang, 2003) , S ⋆ C=,G has a unique solution x ∈ C = . Thus, we have: G( x) ⊺ (x -x) = 0, ∀x ∈ C = . From the above, we deduce that G(x) ∈ Span {c 1 , • • • , c p }, where c i is the row vectors of C, i ∈ [p]. Suppose that G( x) = p i=1 α i c i . Notice that CG(x) = 0, ∀x ∈ C = . Thus, we have that: c ⊺ j G( x) = c ⊺ j p i=1 α i c i = 0, ∀j ∈ [p] . Hence, ⟨G( x), G( x)⟩ = ⟨ p i=1 α i c i , p i=1 α i c i ⟩ = 0 , which indicates that G( x) = 0. Hence, x is a solution of (W-EQ) in C = . On the other hand, ∀x ∈ R n , if x is a solution of (W-EQ), i.e. G(x) = 0, then x ∈ C = . By the uniqueness of x in C = we have that x = x, which means x is unique in R n .

B.2 PROOF OF LEMMA 1: UPPER BOUND ON THE GAP FUNCTION

Proof of Lemma 1: Upper bound on the gap function. Let x k denote an iterate produced by Algorithm 1, and let x ∈ C. Note that we always have x k ∈ C = . We have that: ⟨F (x k ), x k -x⟩ = ⟨F (x ⋆ ), x ⋆ -x⟩ + ⟨F (x k ), x k -x⟩ -⟨F (x ⋆ ), x ⋆ -x⟩ = ⟨F (x ⋆ ), x ⋆ -x⟩ + ⟨F (x k ), x k -x ⋆ + x ⋆ -x⟩ -⟨F (x ⋆ ), x ⋆ -x⟩ . From the proof of Theorem 6 we know that x k is bounded, which gives: ⟨F (x k ), x k -x ⋆ ⟩ ≤ M ∥x k -x ⋆ ∥ , with M > 0, as well as that ⟨F (x ⋆ ), x ⋆ -x⟩ ≤ 0. Thus, for the above we get: ⟨F (x k ), x k -x⟩ ≤ ⟨F (x k ) -F (x ⋆ ), x ⋆ -x⟩ + M ∥x k -x ⋆ ∥ ≤ D ∥F (x k ) -F (x ⋆ )∥ + M ∥x k -x ⋆ ∥ ≤ (DL + M ) ∥x k -x ⋆ ∥ , where for the second row we used x ⋆ -x ≤ D where D ≜ max x ′ ∈C ∥x ⋆ -x ′ ∥ is the largest distance between any point in C and x ⋆ . For the last row we used that F is L-Lipschitz-Assumption 1-which concludes the proof. The proof is analogous for the y k ∈ C ≤ iterates produced by Algorithm 1. B.3 PROOFS OF THE CONVERGENCE RATE: THEOREMS 2 AND 3 Let f (t) (x) ≜ F (x µt ) ⊺ x + 1(Cx = d) , f (t) k (x) ≜ F (x (t) k+1 ) ⊺ x + 1(Cx = d) , g (t) (y) ≜ -µ t m i=1 log -φ i (y) , where x µt is a solution of (KKT-2) when µ = µ t . Note that the existence of x µt is guaranteed by the existence of the central path-see App. A, and that f (t) , f k and g (t) are all convex. In the following proofs, unless causing confusion, we drop the subscript t to simplify notations. Let y µ = x µ , from (KKT-2) we can see that (x µ , y µ ) is an optimal solution of min f (x) + g(y) s.t. x = y . There exists λ µ ∈ R n such that (x µ , y µ , λ µ ) is a KKT point of (4). We give the following proposition which we will repeatedly use in the proofs: Proposition 2. If F is monotone, then ∀k ∈ N, f k (x k+1 ) -f k (x µ ) ≥ f (x k+1 ) -f (x µ ). Furthermore, if F is ξ-monotone, as per Def. 1 f k (x k+1 ) -f k (x µ ) ≥ f (x k+1 ) -f (x µ ) + c∥x k+1 -x µ ∥ ξ 2 . Proof of Proposition 2. It suffices to note that: f k (x k+1 ) -f k (x µ ) -(f (x k+1 ) -f (x µ )) = (F (x k+1 ) -F (x µ )) ⊺ (x k+1 -x µ ). Some of our proofs that follow are inspired by some convergence proofs in ADMM (Gabay, 1983; Eckstein & Bertsekas, 1992; Davis & Yin, 2016; He & Yuan, 2012; 2015; Lin et al., 2022) . However, although Algorithm 1 adopts the high level idea of ADMM, we can not directly refer to the convergence proofs of ADMM, but need to substantially modify these. We will use the following lemma. Lemma 2. f (x) + g(y) -f (x µ ) -g(y µ ) + ⟨λ µ , x -y⟩ ≥ 0, ∀x, y. Proof. The Lagrange function of (4) is L(x, y, λ) = f (x) + g(y) + λ ⊺ (x -y). And by the property of KKT point, we have L(x µ , y µ , λ) ≤ L(x µ , y µ , λ µ ) ≤ L(x, y, λ µ ), ∀(x, y, λ) , from which the conclusion follows. The following lemma is straightforward to verify: Lemma 3. If f (x) + g(y) -f (x µ ) -g(y µ ) + ⟨λ µ , x -y⟩ ≤ α 1 , ∥x -y∥ ≤ α 2 then we have -∥λ µ ∥α 2 ≤ f (x) + g(y) -f (x µ ) -g(y µ ) ≤ ∥λ µ ∥α 2 + α 1 . The following lemma lists some simple but useful facts that we will use in the following proofs. Lemma 4. For (4) and Algorithm 1, we have 0 ∈ ∂f k (x k+1 ) + λ k + β(x k+1 -y k+1 ) (5) 0 ∈ ∂g(y k+1 ) -λ k -β(x k+1 -y k+1 ), (6) λ k+1 -λ k = β(x k+1 -y k+1 ), (7) 0 ∈ ∂f (x µ ) + λ µ , (8) 0 ∈ ∂g(y µ ) -λ µ , (9) x µ = y µ . ( ) We define: ∇f k (x k+1 ) ≜ -λ k -β(x k+1 -y k ), (y k+1 ) ≜ λ k + β(x k+1 -y k+1 ). Then from ( 5) and ( 6) we can see that ∇f k (x k+1 ) ∈ ∂f k (x k+1 ) and ∇g(y k+1 ) ∈ ∂g(y k+1 ). Lemma 5. For Algorithm 1, we have ⟨ ∇g(y k+1 ), y k+1 -y⟩ = -⟨λ k+1 , y -y k+1 ⟩, and ⟨ ∇f k (x k+1 ), x k+1 -x⟩ + ⟨ ∇g(y k+1 ), y k+1 -y⟩ = -⟨λ k+1 , x k+1 -y k+1 -x + y⟩ + β⟨-y k+1 + y k , x k+1 -x⟩. ( ) Proof of Lemma 5. From ( 7), ( 11) and ( 12) we have: ⟨ ∇f k (x k+1 ), x k+1 -x⟩ = -⟨λ k + β(x k+1 -y k ), x k+1 -x⟩ = -⟨λ k+1 , x k+1 -x⟩ + β⟨-y k+1 + y k , x k+1 -x⟩, and ⟨ ∇g(y k+1 ), y k+1 -y⟩ = -⟨λ k+1 , yy k+1 ⟩. Adding these together yields (15). Lemma 6. For Algorithm 1, we have ⟨ ∇f k (x k+1 ), x k+1 -x µ ⟩ + ⟨ ∇g(y k+1 ), y k+1 -y µ ⟩ + ⟨λ µ , x k+1 -y k+1 ⟩ ≤ 1 2β ∥λ k -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ µ ∥ 2 + β 2 ∥y µ -y k ∥ 2 - β 2 ∥y µ -y k+1 ∥ 2 - 1 2β ∥λ k+1 -λ k ∥ 2 - β 2 ∥y k -y k+1 ∥ 2 Proof of Lemma 6. Letting (x, y, λ) = (x µ , y µ , λ µ ) in( 15), adding ⟨λ µ , x k+1 -y k+1 ⟩ to both sides, and using ( 7) and ( 10), we have: ⟨ ∇f k (x k+1 ), x k+1 -x µ ⟩ + ⟨ ∇g(y k+1 ), y k+1 -y µ ⟩ + ⟨λ µ , x k+1 -y k+1 ⟩ = -⟨λ k+1 -λ µ , x k+1 -y k+1 ⟩ + β⟨-y k+1 + y k , x k+1 -x µ ⟩ = - 1 β ⟨λ k+1 -λ µ , λ k+1 -λ k ⟩ + ⟨-y k+1 + y k , λ k+1 -λ k ⟩ -β⟨-y k+1 + y k , -y k+1 + y µ ⟩ (16) = 1 2β ∥λ k -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k + y ⋆ ∥ 2 - β 2 ∥-y k+1 + y ⋆ ∥ 2 - β 2 ∥-y k+1 + y k ∥ 2 +⟨-y k+1 + y k , λ k+1 -λ k ⟩ . ( ) On the other hand, ( 14) gives ⟨ ∇g(y k ), y k -y⟩ + ⟨λ k , -y k + y⟩ = 0 . ( ) Letting y = y k in ( 14) and y = y k+1 in (18), and adding them together, we have: ⟨ ∇g(y k+1 ) -∇g(y k ), y k+1 -y k ⟩ + ⟨λ k+1 -λ k , -y k+1 + y k ⟩ = 0 . By the monotonicity of ∂g we know that the first term of the above equality is non-negative. Thus, we have: ⟨λ k+1 -λ k , -y k+1 + y k ⟩ ≤ 0 . ( ) Plugging it into (17), we have the conclusion. Lemma 7. For Algorithm 1, we have f (x k+1 ) + g(y k+1 ) -f (x µ ) -g(y µ ) + ⟨λ µ , x k+1 -y k+1 ⟩ ≤ 1 2β ∥λ k -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ µ ∥ 2 + β 2 ∥-y k + y µ ∥ 2 - β 2 ∥-y k+1 + y µ ∥ 2 - 1 2β ∥λ k+1 -λ k ∥ 2 - β 2 ∥-y k+1 + y k ∥ 2 (20) Furthermore, if F is ξ-monotone on C = , we have c∥x k+1 -x µ ∥ ξ 2 + f (x k+1 ) + g(y k+1 ) -f (x µ ) -g(y µ ) + ⟨λ µ , x k+1 -y k+1 ⟩ ≤ 1 2β ∥λ k -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ µ ∥ 2 + β 2 ∥-y k + y µ ∥ 2 - β 2 ∥-y k+1 + y µ ∥ 2 - 1 2β ∥λ k+1 -λ k ∥ 2 - β 2 ∥-y k+1 + y k ∥ 2 . ( ) Proof. From the convexity of f k (x) and g(y) and using Proposition 2 and (13), we have f (x k+1 ) + g(y k+1 ) -f (x µ ) -g(y µ ) + ⟨λ µ , x k+1 -y k+1 ⟩ ≤f k (x k+1 ) + g(y k+1 ) -f k (x µ ) -g(y µ ) + ⟨λ µ , x k+1 -y k+1 ⟩ ≤⟨ ∇f k (x k+1 ), x k+1 -x µ ⟩ + ⟨ ∇g(y k+1 ), y k+1 -y µ ⟩ +⟨λ µ , x k+1 -y k+1 ⟩ ≤ 1 2β ∥λ k -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ µ ∥ 2 + β 2 ∥-y k + y µ ∥ 2 - β 2 ∥-y k+1 + y µ ∥ 2 - 1 2β ∥λ k+1 -λ k ∥ 2 - β 2 ∥-y k+1 + y k ∥ 2 . ( ) If F is ξ-monotone on C = , again by Proposition 2, we can add the term c∥x k+1 -x µ ∥ ξ 2 to the first line and the inequality still holds. Theorem 6. For Algorithm 1, we have f (x k+1 ) -f (x µ ) + g(y k+1 ) -g(y µ ) → 0, f k (x k+1 ) -f k (x µ ) + g(y k+1 ) -g(y µ ) → 0, x k+1 -y k+1 → 0, as k → ∞. Furthermore, if F is ξ-monotone on C = , we have x k+1 → x µ , k → ∞ Proof of Theorem 6. Proof From Lemma 2 and (20), we have 1 2β ∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k+1 + y k ∥ 2 ≤ 1 2β ∥λ k -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ µ ∥ 2 + β 2 ∥-y k + y µ ∥ 2 - β 2 ∥-y k+1 + y µ ∥ 2 . ( ) Summing over k = 0, • • • , ∞, we have ∞ k=0 ( 1 2β ∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k+1 + y k ∥ 2 ) ≤ 1 2β ∥λ 0 -λ µ ∥ 2 + β 2 ∥-y 0 + y µ ∥ 2 . from which we deduce that λ k+1 -λ k → 0 and -y k+1 + y k → 0. Moreover, ∥λ k -λ µ ∥ 2 and ∥-y k + y µ ∥ 2 are bounded for all k, as well as ∥λ k ∥. Since λ k+1 -λ k = β(x k+1 -y k+1 ) = β(x k+1 -x µ ) + β(-y k+1 + y µ ) we deduce that x k+1 -y k+1 → 0 and x k+1 -x µ is also bounded. From (15) and the convexity of f and g, and using Proposition 2, we have: f (x k+1 ) -f (x µ ) + g(y k+1 ) -g(y µ ) ≤f k (x k+1 ) -f k (x µ ) + g(y k+1 ) -g(y µ ) ≤ -⟨λ k+1 , x k+1 -y k+1 ⟩ + β⟨-y k+1 + y k , x k+1 -x µ ⟩ → 0. On the other hand, from (8), (9), and (10), we have: f k (x k+1 ) -f k (x µ ) + g (y k+1 ) -g (y µ ) ≥f (x k+1 ) -f (x µ ) + g (y k+1 ) -g (y µ ) ≥⟨-λ µ , x k+1 -x µ ⟩ + ⟨λ µ , y k+1 -y µ ⟩ = -⟨λ µ , x k+1 -y k+1 ⟩ → 0. Thus, we have f (x k+1 ) -f (x µ ) + g (y k+1 ) -g (y µ ) → 0 and f k (x k+1 ) -f k (x µ ) + g (y k+1 ) - g (y µ ) → 0, k → ∞. If F is ξ-monotone on C = , from Lemma 2 and (21) we have c∥x k+1 -x µ ∥ ξ 2 + 1 2β ∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k+1 + y k ∥ 2 ≤ 1 2β ∥λ k -λ µ ∥ 2 - 1 2β ∥λ k+1 -λ µ ∥ 2 + β 2 ∥-y k + y µ ∥ 2 - β 2 ∥-y k+1 + y µ ∥ 2 From this we deduce that: c∥x k+1 -x µ ∥ ξ 2 + ∞ k=0 1 2β ∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k+1 + y k ∥ 2 ≤ 1 2β ∥λ 0 -λ µ ∥ 2 + β 2 ∥-y 0 + y µ ∥ 2 . Therefore, ∥x k+1 -x µ ∥ 2 → 0, k → ∞. Lemma 8. For Algorithm 1, we have 1 2β ∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k+1 + y k ∥ 2 ≤ 1 2β ∥λ k -λ k-1 ∥ 2 + β 2 ∥-y k + y k-1 ∥ 2 . (25) Furthermore, if F is ξ-monotone on C = , we have c∥x k+1 -x k ∥ 2 + 1 2β ∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k+1 + y k ∥ 2 ≤ 1 2β ∥λ k -λ k-1 ∥ 2 + β 2 ∥-y k + y k-1 ∥ 2 . ( ) Proof of Lemma 8. (15) gives: ⟨ ∇f k-1 (x k ) , x k -x⟩ + ⟨ ∇g (y k ) , y k -y⟩ = -⟨λ k , x k -y k -x + y⟩ + β⟨-y k + y k-1 , x k -x⟩. Letting (x, y, λ) = (x k , y k , λ k ) in ( 15) and (x, y, λ) = (x k+1 , y k+1 , λ k+1 ) in ( 27), and adding them together, and using (7), we have ⟨ ∇f k (x k+1 ) -∇f k-1 (x k ) , x k+1 -x k ⟩ + ⟨ ∇g (y k+1 ) -∇g (y k ) , y k+1 -y k ⟩ = -⟨λ k+1 -λ k , x k+1 -y k+1 -x k + y k ⟩ + β⟨-y k+1 + y k -(-y k + y k-1 ) , x k+1 -x k ⟩ = - 1 β ⟨λ k+1 -λ k , λ k+1 -λ k -(λ k -λ k-1 )⟩ +⟨-y k+1 + y k + (y k -y k-1 ) , λ k+1 -λ k + βy k+1 -(λ k -λ k-1 + βy k )⟩ = 1 2β ∥λ k -λ k-1 ∥ 2 -∥λ k+1 -λ k ∥ 2 -∥λ k+1 -λ k -(λ k -λ k-1 )∥ 2 + β 2 ∥-y k + y k-1 ∥ 2 -∥-y k+1 + y k ∥ 2 -∥-y k+1 + y k -(-y k + y k-1 )∥ 2 +⟨-y k+1 + y k -(-y k + y k-1 ) , λ k+1 -λ k -(λ k -λ k-1 )⟩ = 1 2β ∥λ k -λ k-1 ∥ 2 -∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k + y k-1 ∥ 2 -∥-y k+1 + y k ∥ 2 - 1 2β ∥λ k+1 -λ k -(λ k -λ k-1 )∥ 2 - β 2 ∥-y k+1 + y k -(-y k + y k-1 )∥ 2 +⟨-y k+1 + y k -(-y k + y k-1 ) , λ k+1 -λ k -(λ k -λ k-1 )⟩ ≤ 1 2β ∥λ k -λ k-1 ∥ 2 -∥λ k+1 -λ k ∥ 2 + β 2 ∥-y k + y k-1 ∥ 2 -∥-y k+1 + y k ∥ 2 By the convexity of f k and f k-1 , we have ⟨ ∇f k (x k+1 ) , x k+1 -x k ⟩ ≥f k (x k+1 ) -f k (x k ) , -⟨ ∇f k-1 (x k ) , x k+1 -x k ⟩ ≥f k-1 (x k ) -f k-1 (x k+1 ) . Adding them together gives that: ⟨ ∇f k (x k+1 ) -∇f k-1 (x k ) , x k+1 -x k ⟩ ≥f k (x k+1 ) -f k-1 (x k+1 ) -f k (x k ) + f k-1 (x k ) =⟨F (x k+1 ) -F (x k ), x k+1 -x k ⟩ ≥ 0 . Thus by the monotonicity of F and ∇g, (25) follows. Theorem 7. If F is monotone on C = , then for Algorithm 1, we have -∥λ µ ∥ ∆ µ β (K + 1) ≤f (x K+1 ) + g (y K+1 ) -f (x µ ) -g (y µ ) ≤f K (x K+1 ) + g (y K+1 ) -f K (x µ ) -g (y µ ) ≤ ∆ µ K + 1 + 2∆ µ √ K + 1 + ∥λ µ ∥ ∆ µ β (K + 1) , ( ) ∥x K+1 -y K+1 ∥ ≤ ∆ µ β (K + 1) , ( ) where ∆ µ ≜ 1 β ∥λ 0 -λ µ ∥ 2 + β∥y 0 -y µ ∥ 2 . Furthermore, if F is ξ-monotone on C = , we have c∥ xK+1 -x µ ∥ ξ ≤ ∆ µ K + 1 , ( ) c∥x K+1 -x µ ∥ ξ 2 ≤ ∆ µ K + 1 + 2∆ µ √ K + 1 . ( ) where xK+1 = 1 K+1 K+1 k=1 x K . Proof of Theorem 7. Summing ( 23) over k = 0, 1, . . . , K and using the monotonicity of 1 2β ∥λ k+1λ k ∥ 2 + β 2 ∥-y k+1 + y k ∥ 2 from Lemma 8, we have: 1 β ∥λ K+1 -λ K ∥ 2 + β∥-y K+1 + y K ∥ 2 ≤ 1 K + 1 1 β ∥λ 0 -λ µ ∥ 2 + β∥-y 0 + y µ ∥ 2 (32) From the above we deduce that β∥x K+1 -y K+1 ∥ = ∥λ K+1 -λ K ∥ ≤ β∆ µ (K + 1) , ∥-y K+1 + y K ∥ ≤ ∆ µ β (K + 1) . On the other hand, (23) gives: 1 2β ∥λ k+1 -λ µ ∥ 2 + β 2 ∥-y k+1 + y µ ∥ 2 ≤ 1 2β ∥λ k -λ µ ∥ 2 + β 2 ∥-y k + y µ ∥ 2 ≤ 1 2β ∥λ 0 -λ µ ∥ 2 + β 2 ∥-y 0 + y µ ∥ 2 = 1 2 ∆ µ . Hence, we have: ∥λ K+1 -λ µt ∥ ≤ β∆ µt , ∥-y K+1 + y µt ∥ ≤ ∆ µt β . Then from ( 16) and the convexity of f and g, we have: f (x K+1 ) -f (x µ ) + g (y K+1 ) -g (y µ ) + ⟨λ µ , x K+1 -y K+1 ⟩ ≤f K (x K+1 ) -f K (x µ ) + g (y K+1 ) -g (y µ ) + ⟨λ µ , x K+1 -y K+1 ⟩ ≤ 1 β ∥λ K+1 -λ µ ∥∥λ K+1 -λ K ∥ + ∥-y K+1 + y K ∥∥λ K+1 -λ K ∥ +β∥-y K+1 + y K ∥∥-y K+1 + y µ ∥ ≤ ∆ µ K + 1 + 2∆ µ √ K + 1 . ( ) From Lemma 3, we have (28). If in addition F is ξ-monotone on C = , using (24), we can obtain the following inequality similar to (32): c K k=0 ∥x k+1 -x µ ∥ ξ 2 K + 1 + 1 β ∥λ K+1 -λ K ∥ 2 + β∥-y K+1 + y K ∥ 2 ≤ 1 K + 1 1 β ∥λ 0 -λ µ ∥ 2 + β∥-y 0 + y µ ∥ 2 By the convexity of ∥•∥ ξ 2 we have c∥ xK+1 -x µ ∥ ξ 2 ≤ ∆ µ K+1 . And from (35) we can see that c∥x K+1 -x µ ∥ ξ 2 ≤ ∆ µ K+1 + 2∆ µ √ K+1 . Theorem 8. If F is monotone on C = , then for Algorithm 1, we have |f ( xK+1 ) + g ( ŷK+1 ) -f (x µ ) -g (y µ )| ≤ ∆ µ 2 (K + 1) + 2 √ β∆ µ ∥λ µ ∥ β (K + 1) , ∥ xK+1 -ŷK+1 ∥ ≤ 2 √ β∆ µ β (K + 1) where xK+1 = 1 K+1 K+1 k=1 x k , ŷK+1 = 1 K+1 K+1 k=1 y k , and ∆ µ ≜ 1 β ∥λ 0 -λ µ ∥ 2 + β∥y 0 -y µ ∥ 2 . Proof of Theorem 8. Summing (20) over k = 0, 1, . . . , K, dividing both sides with K + 1, and using the definitions of xK+1 and ŷK+1 and the convexity of f and g, we have f ( xK+1 ) + g ( ŷK+1 ) -f (x µ ) -g (y µ ) + ⟨λ µ , xK+1 -ŷK+1 ⟩ ≤ ∆ µ 2 (K + 1) . From ( 7) and ( 33), we have: ∥ xK+1 -ŷK+1 ∥ = 1 β (K + 1) ∥ K k=0 (λ k+1 -λ k )∥ = 1 β (K + 1) ∥λ K+1 -λ 0 ∥ ≤ 1 β (K + 1) (∥λ 0 -λ µ ∥ + ∥λ K+1 -λ µ ∥) ≤ 2 √ β∆ µ β (K + 1) Finally, from Lemma 3, the conclusion follows. From Proposition 1 and the fact that C is compact we can see that lim  ⋆ = lim µ→0 λ µ , then x ⋆ ∈ S ⋆ C,F . Theorem 9. ∃μ > 0, s.t. if µ t < μ, then F (x (t) K+1 ) ⊺ (x (t) K+1 -x ⋆ ) ≤ 2( ∆ µ K+1 + 2∆ µ √ K+1 + ∥λ ⋆ ∥ ∆ µ β(K+1) ), F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) ≤ 2( ∆ µ K+1 + 2∆ µ √ K+1 + ∥λ ⋆ ∥ ∆ µ β(K+1) ) and F (x ⋆ ) ⊺ ( x(t) K+1 -x ⋆ ) ≤ 2( ∆ µ 2(K+1) + 2 √ β∆ µ ∥λ ⋆ ∥ β(K+1) ), ∀K ≥ 0. Proof of Theorem 9. For simplicity we let B(x) denote the log-barrier term -m i=1 log(-φ i (x)). From Theorem 7 and 8 we have F (x µ ) ⊺ (x (t) K+1 -x µ ) + µ(B(y (t) K+1 ) -B(y µ )) = f (x (t) K+1 ) -f (x µ ) + g(y (t) K+1 ) -g(y µ ) ≤ ∆ µ K + 1 + 2∆ µ √ K + 1 + ∥λ µ ∥ ∆ µ β (K + 1) (37) F (x K+1 ) ⊺ (x (t) K+1 -x µ ) + µ(B(y (t) K+1 ) -B(y µ )) = f K (x (t) K+1 ) -f K (x µ ) + g(y (t) K+1 ) -g(y µ ) ≤ ∆ µ K + 1 + 2∆ µ √ K + 1 + ∥λ µ ∥ ∆ µ β (K + 1) (38) F (x µ ) ⊺ ( x(t) K+1 -x µ ) + µ(B( ŷ(t) K+1 ) -B(y µ )) = f ( x(t) K+1 ) -f (x µ ) + g( ŷ(t) K+1 ) -g(y µ ) ≤ ∆ µ 2 (K + 1) + 2 √ β∆ µ ∥λ µ ∥ β (K + 1) . ( ) From Proposition 1 in App. A we know that when µ → 0, x µ → x ⋆ , λ µ → λ ⋆ and µ(B(y (t) K+1 ) - B(y µ )) → 0, so ∃μ > 0, s.t. if µ t < μ, then we have F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) ≤ 2 ∆ µ K + 1 + 2∆ µ √ K + 1 + ∥λ ⋆ ∥ ∆ µ β (K + 1) F (x (t) K+1 ) ⊺ (x (t) K+1 -x ⋆ ) ≤ 2 ∆ µ K + 1 + 2∆ µ √ K + 1 + ∥λ ⋆ ∥ ∆ µ β (K + 1) F (x ⋆ ) ⊺ ( x(t) K+1 -x ⋆ ) ≤ 2 ∆ µ 2 (K + 1) + 2 √ β∆ µ ∥λ ⋆ ∥ β (K + 1) . To make the dependencies of the constants clear, here we restate Theorem 2 and Theorem 3 and provide their proofs. Theorem 10 (restatement of Theorem 2). Given an operator F : X → R n monotone on C = (Def. 1), and either F is strictly monotone on C or one of φ i is strictly convex. Assume there exists r > 0 or s > 0 such that F is star-ξ-monotone-as per Def. 1-on either Ĉr or Cs , resp. Let ∆ ≜ 1 β ∥λ 0 -λ ⋆ ∥ 2 + β∥y 0 -y ⋆ ∥ 2 . Let x (t) K and x(t) K ≜ 1 K K k=1 x (t) k denote the last and average iterate of Algorithm 1, respectively, run with sufficiently small µ -1 . Then there exists K 0 ∈ N, K 0 depends on r or s, s.t.∀K > K 0 , for all t ∈ [T ], we have that: 1. ∥x (t) K -x ⋆ ∥ ≤ ( 4 c ( ∆ K + 2∆ √ K + ∥λ ⋆ ∥ ∆ βK )) 1/ξ . 2. If in addition F is ξ-monotone on C = , we have ∥ x(t) K -x ⋆ ∥ ≤ ( 2∆ cK ) 1/ξ . 3. Moreover, if F is L-Lipschitz on C = -as per Assumption 1-then G(x (t) K , C) ≤ M 0 ( 4 c ( ∆ K + 2∆ √ K + ∥λ ⋆ ∥ ∆ βK )) 1/ξ and G( x(t) K , C) ≤ M 0 2∆ c(K+1) 1/ξ , where M 0 = DL + M is a linear function of L, see the proof of Lemma 1 in App. B. Proof of Theorem 10. Without loss of generality, we suppose that F is star-ξ-monotone on Ĉr . Since you ϕ(y k ) ≤ 0, from (29) we can see that ∃K 0 ∈ N, K 0 depends on r, s.t.∀K ≥ K 0 , x k ∈ Ĉr , so if µ t < μ (μ is defined in Theorem 9), by Theorem 9 we have c∥x (t) K+1 -x ⋆ ∥ ξ ≤ (F (x (t) K+1 ) ⊺ -F (x ⋆ ) ⊺ )(x (t) K+1 -x ⋆ ) ≤ F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + F (x (t) K+1 ) ⊺ (x (t) K+1 -x ⋆ ) ≤4 ∆ K + 1 + 2∆ √ K + 1 + ∥λ ⋆ ∥ ∆ β (K + 1) .

So we have

∥x (t) K+1 -x ⋆ ∥ ≤ 4 c ∆ K + 1 + 2∆ √ K + 1 + ∥λ ⋆ ∥ ∆ β (K + 1) 1/ξ . If in addition F is ξ-monotone on C = , then from (30) we know that when μ is small enough, we have ∥ x(t) K+1 -x ⋆ ∥ ≤ 2∆ c(K + 1) 1/ξ . If F is L-Lipschitz on C = , then from Lemma 1 we can see that G(x (t) K , C) ≤ M 0 4 c ∆ K + 1 + 2∆ √ K + 1 + ∥λ ⋆ ∥ ∆ β (K + 1) 1/ξ , G( x(t) K , C) ≤ M 0 2∆ c(K + 1) 1/ξ . Theorem 11 (restatement of Theorem 3). Given an operator F : X → R n , assume: (i) F is monotone on C = , as per Def. 1; (ii) either F is strictly monotone on C or one of φ i is strictly convex; and (iii) inf x∈S\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ = a > 0, where S ≡ Ĉr or Cs . Let ∆ ≜ 1 β ∥λ 0 -λ ⋆ ∥ 2 + β∥y 0 -y ⋆ ∥ 2 . Let x (t) K and x(t) K ≜ 1 K K k=1 x (t) k denote the last and average iterate of Algorithm 1, respectively, run with sufficiently small µ -1 . Then there exists K 0 ∈ N, K 0 depends on r or s, s.t.∀t ∈ [T ], ∀K > K 0 , we have that: 1. x (t) K -x ⋆ ≤ 2 a ∆ K + 2∆ √ K + ∥λ ⋆ ∥ ∆ βK . 2. If in addition inf x∈S\{x ⋆ } F (x ⋆ ) ⊺ x-x ⋆ ∥x-x ⋆ ∥ = b > 0 (with S ≡ Ĉr or Cs ), then x(t) K -x ⋆ ≤ 2 b ∆ 2K + 2 √ β∆∥λ ⋆ ∥ βK . 3. Moreover, if F is L-Lipschitz on C = -as per Assumption 1-then G(x (t) K , C) ≤ 2M0 a ∆ K + 2∆ √ K + ∥λ ⋆ ∥ ∆ βK and G( x(t) K , C) ≤ 2M0 b ∆ 2K + 2 √ β∆∥λ ⋆ ∥ βK , where M 0 = DL + M is a linear function of L, see the proof of Lemma 1 in App. B. Proof of Theorem 11. Without loss of generality, we suppose inf x∈ Ĉr\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ = a > 0. When K ≥ K 0 (K 0 and μ are defined in the proof of Theorem 10 and in Theorem 9, resp.), by Theorem 9 we have that ∥x (t) K+1 -x ⋆ ∥ ≤ 1 a F (x (t) K+1 ) ⊺ (x (t) K+1 -x ⋆ ) ≤ 2 a ∆ K + 1 + 2∆ √ K + 1 + ∥λ ⋆ ∥ ∆ β (K + 1) . Similarly, if inf x∈ Ĉr\{x ⋆ } F (x ⋆ ) ⊺ x-x ⋆ ∥x-x ⋆ ∥ = b > 0, we have that ∥ x(t) K+1 -x ⋆ ∥ ≤ 2 b ∆ 2 (K + 1) + 2 √ β∆∥λ ⋆ ∥ β (K + 1) . If F is L-Lipschitz on C = , then from Lemma 1 we can see that G(x (t) K , C) ≤ 2M 0 a ∆ K + 1 + 2∆ √ K + 1 + ∥λ ⋆ ∥ ∆ β (K + 1) , G( x(t) K , C) ≤ 2M 0 b ∆ 2 (K + 1) + 2 √ β∆∥λ ⋆ ∥ β (K + 1) . From the above proofs, we can see that by setting µ -1 small enough, Theorem 2 and 3 hold true. But since we do not know exactly how small µ -1 should be, in practice, we can set a small µ -1 , and µ t could be smaller than any fixed positive number after a very small number of outer loops, as µ t decays exponentially. Thus, Algorithm 1 is actually parameter-free. Note that the assumption (iii) in Theorem 3 is the weakening of the assumption inf x∈S\{x ⋆ } F (x ⋆ ) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0, where S = Ĉr or Cs . In fact, by the monotonicity of F , we have F (x) ⊺ (x -x ⋆ ) ≥ F (x ⋆ ) ⊺ (x -x ⋆ ), so if inf x∈S\{x ⋆ } F (x ⋆ ) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0, there must be inf x∈S\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0.

B.4 PARAMETER-FREE CONVERGENCE RATE

In this section, we give a convergence rate taking into account both the inner and the outer loop when the operator F is L-Lipschitz. First, we bound ∥x ⋆ -x µ ∥ under the assumptions in Theorem 2 and Theorem 3, resp, and give the following lemma: Lemma 9. For monotone F , then for any µ > 0, we have: (i) If F is star-ξ-monotone (Def. 1), then ∥x ⋆ -x µ ∥ ≤ m c µ 1 ξ ; (ii) If a ≜ inf x∈S\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0, where S ≡ Ĉr or Cs , then ∥x ⋆ -x µ ∥ ≤ m a µ. Proof. Consider convex problem    min x F (x µ ) ⊺ x s.t. φ(x) ≤ 0 Cx = d (P) The Lagrangian of (P) is L(x, λ, ν) = F (x µ ) ⊺ x + λ ⊺ φ(x) + ν ⊺ (Cx -d). There exists λµ > 0 and ν µ , s.t. (x µ , λµ , ν µ ) is a KKT point of (KKT-2). By the stationarity condition in (KKT-2), we have that g( λµ , ν µ ) = inf x L(x µ , λµ , ν µ ) = F (x µ ) ⊺ x µ + λ⊺ φ(x µ ) + ν ⊺ (Cx µ -d =0 ). Note that by the complementarity slackness condition in (KKT-2), we have λ⊺ φ(x µ ) = -mµ. Since x ⋆ is a feasible point of (P), ( λµ , ν µ ) dual feasible, thus by the duality theory, we have: F (x µ ) ⊺ x ⋆ ≥ g( λµ , ν µ ) = F (x µ ) ⊺ x µ -mµ , from where we deduce that F (x µ ) ⊺ (x µ -x ⋆ ) ≤ mµ . (40) Therefore, (i) If F is star-ξ-monotone: Since x ⋆ ∈ S ⋆ C,F , we have F (x ⋆ ) ⊺ (x µ = x ⋆ ) ≥ 0 . ( ) Subtract ( 40) by ( 41) and using the star-ξ-monotonicity of F , we obtain c ∥x µ -x ⋆ ∥ ξ ≤ (F (x µ ) -F (x ⋆ )) ⊺ (x µ -x ⋆ ) ≤ mµ . (ii) If a ≜ inf x∈S\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ > 0, where S ≡ Ĉr or Cs , since x µ ∈ S, by ( 40) we have that a ∥x µx ⋆ ∥ ≤ mµ. We give another lemma that would be used in the proofs of our main results in this section. Lemma 10. Assume F is monotone and L-smooth (Assumption 1), then we have F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + mµ t +L ∥x µt -x ⋆ ∥ ∥x µt -x ⋆ ∥ + ∆ µt β (K + 1) + ∆ µt β , F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + M ∥x µt -x ⋆ ∥ , and F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) ≤ ∆ µt 2 (K + 1) + 2 √ β∆ µt ∥λ µt ∥ β (K + 1) + mµ t +L ∥x µt -x ⋆ ∥ ∥x µt -x ⋆ ∥ + 2 √ β∆ µt β (K + 1) + ∆ µt β , where M = sup x∈C ∥F (x)∥, and B(x) = -m i=1 log(-φ i (x)). Proof. Note that x (t) K+1 -x µt = x (t) K+1 -y µt ≤ x (t) K+1 -y (t) K+1 + y (t) K+1 -y µt . ( ) Recall that (29) gives ∥x (t) K+1 -y (t) K+1 ∥ ≤ ∆ µt β (K + 1) , and (34) gives y (t) K+1 -y µt = ∆ µt β . Plugging ( 29) and ( 34) into (45), we have x (t) K+1 -x µt ≤ ∆ µt β (K + 1) + ∆ µt β . ( ) Note that F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) =F (x µt ) ⊺ (x (t) K+1 -x µ t ) + µ t (B(y (t) K+1 ) -B(y µt )) + F (x µ ) ⊺ (x µ -x ⋆ ) + (F (x ⋆ ) -F (x µt )) ⊺ (x (t) K+1 -x ⋆ ) . Thus by the L-Lipschitzness of F , using ( 40), ( 46) and (37) in the proof of Thm. 9, we have F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) ≤ F (x µt ) ⊺ (x (t) K+1 -x µt ) + µ t (B(y (t) K+1 ) -B(y µt )) + |F (x µ t ) ⊺ (x µ t -x ⋆ )| + (F (x ⋆ ) -F (x µt )) ⊺ (x (t) K+1 -x ⋆ ) ≤ F (x µt ) ⊺ (x (t) K+1 -x µt ) + µ t (B(y (t) K+1 ) -B(y µt )) + |F (x µt ) ⊺ (x µt -x ⋆ )| + L ∥x µt -x ⋆ ∥ x µt -x (t) K+1 + ∥x µt -x ⋆ ∥ ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + mµ t +L ∥x µt -x ⋆ ∥ ∥x µt -x ⋆ ∥ + ∆ µt β (K + 1) + ∆ µt β . Using (38), we have F (x (t) K+1 ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) = F (x µt ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) + F (x (t) K+1 ) ⊺ (x µt -x ⋆ ) ≤ F (x µt ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) + M ∥x µt -x ⋆ ∥ ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + M ∥x µt -x ⋆ ∥ . Similarly, recall that (36) gives ∥ xK+1 -ŷK+1 ∥ ≤ 2 √ β∆ µt β (K + 1) . By ( 34) and the convexity of ∥•∥, we have ŷ(t) K+1 -y µt = 1 K + 1 K+1 k=1 (y (t) k+1 -y µt ) ≤ ∆ µt β . ( ) Note that Thus we have x(t) K+1 -x µt = x(t) K+1 -y µt ≤ x(t) K+1 - ŷ(t) K+1 + ŷ(t) K+1 -y µt . ( ) Plugging ( 36) and ( 47) into (48), we have x(t) K+1 -x µt ≤ 2 √ β∆ µt β (K + 1) + ∆ µt β . ( ) Using the L-Lipschitzness of F , ( 39) and ( 49), we have F (x ⋆ ) ⊺ ( x(t) K+1 -x ⋆ ) + µ t (B( ŷ(t) K+1 ) -B(y µt )) ≤ F (x µt ) ⊺ ( x(t) K+1 -x µt ) + µ t (B( ŷ(t) K+1 ) -B(y µt )) + |F (x µt ) ⊺ (x µt -x ⋆ )| + (F (x ⋆ ) -F (x µt )) ⊺ ( x(t) K+1 -x ⋆ ) ≤ F (x µt ) ⊺ ( x(t) K+1 -x µ t ) + µ t (B( ŷ(t) K+1 ) -B(y µt )) + |F (x µt ) ⊺ (x µt -x ⋆ )| + L ∥x µt -x ⋆ ∥ x µt - x(t) K+1 + ∥x µt -x ⋆ ∥ ≤ ∆ µt 2 (K + 1) + 2 √ β∆ µt ∥λ µt ∥ β (K + 1) + mµ t + L ∥x µt -x ⋆ ∥ ∥x µt -x ⋆ ∥ + 2 √ β∆ µt β (K + 1) + ∆ µt β . Now we are ready to give our main theorems in this section. Theorem 12 (Complete convergence rate for star-ξ-monotone operator). Given an operator F : X → R n monotone and L-Lipschitz on C = (Def. 1, 1), and either F is strictly monotone on C or one of φ i is strictly convex. Assume there exists r > 0 or s > 0 such that F is star-ξ-monotone-as per Def. 1-on either Ĉr or Cs , resp. Let ∆ µt ≜ 1 β ∥λ 0 -λ µt ∥ 2 + β∥y 0 -y µt ∥ 2 , Let x (t) K and x(t) K ≜ 1 K K k=1 x (t) k denote the last and average iterate of Algorithm 1, respectively, run with sufficiently small µ t-1 . Then there exists K 0 ∈ N, K 0 depends on r or s, s.t.∀K > K 0 , for all t ∈ [T ], we have that: 1. x (t) K -x ⋆ ≤ 2 c ∆ µ t K + 2∆ µ t √ K + ∥λ µt ∥ ∆ µ t βK + LM1 c m c µ -1 1 ξ δ t+1 ξ 1 ξ , and G(x (t) K , C) ≤ M 0 2 c ∆ µ t K + 2∆ µ t √ K + ∥λ µt ∥ ∆ µ t βK + LM1 c m c µ -1 1 ξ δ t+1 ξ 1 ξ . 2. If in addition F is ξ-monotone on C = , we have ∥ x(t) K -x ⋆ ∥ ≤ ∆ µ t cK 1 ξ + m c µ t 1 ξ , G( x(t) K , C) ≤ M 0 ∆ µ t cK 1/ξ + m c µ t 1 ξ 1 ξ , where M 0 = DL + M is a linear function of L, see the proof of Lemma 1 in App. B, and M 1 ≜ m c µ -1 1 ξ + 2 ∆ µ t β + M L . Proof. By the star-ξ-monotonicity of course F , using ( 42) and ( 43) in Lemma 10, we have c (t) K+1 -x ⋆ ξ ≤(F (x (t) K ) -F (x ⋆ )) ⊺ (x (t) K+1 -x ⋆ ) =F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) -(F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt ))) ≤∥F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )∥ + ∥F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )∥ ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + M ∥x µt -x ⋆ ∥ + ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + mµ t + L ∥x µt -x ⋆ ∥ ∥x µt -x ⋆ ∥ + ∆ µt β (K + 1) + ∆ µt β ≤2 ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + L ∥x µt -x ⋆ ∥ ∥x µt -x ⋆ ∥ + ∆ µt β (K + 1) + ∆ µt β + M L + mµ t ≤2 ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + L m c µ t 1 ξ m c µ t 1 ξ + ∆ µt β (K + 1) + ∆ µt β + M L where in the last inequality we use Lemma 9 (i). We let M 1 ≜ m c µ -1 1 ξ + 2 ∆ µ t β + M L , then we have x (t) K+1 -x ⋆ ≤ 2 c ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + LM 1 c m c µ t 1 ξ 1 ξ . By Lemma 1, we obtain G(x (t) K+1 , C) ≤ M 0 2 c ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + LM 1 c m c µ -1 1 ξ δ t+1 ξ 1 ξ . If in addition F is ξ-monotone on C = , then from (30) we have: ∥ x(t) K+1 -x ⋆ ∥ ≤ ∆ µt c(K + 1) 1 ξ + m c µ t 1 ξ . Again by Lemma 1, we have G( x(t) K+1 , C) ≤ M 0 ∆ µ t c(K+1) 1 ξ + m c µ t 1 ξ . Theorem 13 (Complete convergence rate for star-ξ-monotone operator). Given an operator F : X → R n , assume: (i) F is monotone and L-smooth on C = , as per Def. 1, 1; (ii) either F is strictly monotone on C or one of φ i is strictly convex; and (iii) inf x∈S\{x ⋆ } F (x) ⊺ x-x ⋆ ∥x-x ⋆ ∥ = a > 0, where S ≡ Ĉr or Cs . Let ∆ µt ≜ 1 β ∥λ 0 -λ µt ∥ 2 + β∥y 0 -y µt ∥ 2 . Let (t) K and x(t) K ≜ 1 K K k=1 x (t) k denote the last and average iterate of Algorithm 1, respectively. Then there exists K 0 ∈ N, K 0 depends on r or s, s.t.∀t ∈ [T ], ∀K > K 0 , we have that: 1. x (t) K -x ⋆ = 1 a ∆ µ t K + 2∆ µ t √ K + ∥λ µt ∥ ∆ µ t βK + mM a 2 + B(y µ t )-B ⋆ a µ -1 δ t+1 , and G(x (t) K , C) ≤ M 0 1 a ∆ µ t K + 2∆ µ t √ K + ∥λ µt ∥ ∆ µ t βK + mM a 2 + B(y µ t )-B ⋆ a µ -1 δ t+1 . 2. If in addition inf x∈S\{x ⋆ } F (x ⋆ ) ⊺ x-x ⋆ ∥x-x ⋆ ∥ = b > 0 (with S ≡ Ĉr or Cs ), then x(t) K -x ⋆ ≤ 1 b ∆ µ t 2K + 2 √ β∆ µ t ∥λ µ t ∥ β(K) + M3 b µ -1 δ t+1 , and G( x(t) K , C) ≤ M 0 1 b ∆ µ t 2K + 2 √ β∆ µ t ∥λ µ t ∥ + M3 b µ -1 δ t+1 , where M = sup x∈C ∥F (x)∥, M 0 = DL + M is a linear function of L, see the proof of Lemma 1 in App. B, and M 3 ≜ m + Lm a m a µ -1 + 2 √ β∆ µ t β(K+1) + ∆ µ t β + B(y µt ) -B ⋆ . Proof. If the barrier term B(y (t) K+1 ) ≥ B(y µt ), using (43) in Lemma 10, we have F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) ≤F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + M ∥x µt -x ⋆ ∥ . If B(y (t) K+1 ) ≤ B(y µt ), since B is lower bounded in any compact set, and by (46) in Lemma 9 we have x(t) K+1 -x µt ≤ 2 √ β∆ µ t β(K+1) + ∆ µ t β ≤ 3 ∆ µ t β ≜ M 2 , we let B ⋆ = inf ∥x-x ⋆ ∥≤M2 B(x). Then we have F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) ≤ F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) + µ t (B(y µt ) -B ⋆ ) ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + M ∥x µt -x ⋆ ∥ + µ t (B(y µt ) -B ⋆ ) . Thus in both case we have a x (t) K+1 -x ⋆ ≤F (x (t) K ) ⊺ (x (t) K+1 -x ⋆ ) ≤ ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + mM a µ t + µ t (B(y µt ) -B ⋆ ) . where in the first inequality assumption (iii) is used and in the last inequality Lemma 9 is used. Thus we have: x (t) K+1 -x ⋆ = 1 a ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + mM a 2 + B(y µt ) -B ⋆ a µ t . And by Lemma 1, we obtain G(x (t) K+1 , C) ≤ M 0 1 a ∆ µt K + 1 + 2∆ µt √ K + 1 + ∥λ µt ∥ ∆ µt β (K + 1) + mM a 2 + B(y µt ) -B ⋆ a µ t . If in addition inf x∈S\{x ⋆ } F (x ⋆ ) ⊺ x-x ⋆ ∥x-x ⋆ ∥ = b > 0, then similarly, from (44) in Lemma 10, we have b x(t) K+1 -x ⋆ ≤ F (x ⋆ ) ⊺ (x (t) K+1 -x ⋆ ) + µ t (B(y (t) K+1 ) -B(y µt )) + µ t (B(y µt ) -B ⋆ ) ≤ ∆ µt 2 (K + 1) + 2 √ β∆ µt ∥λ µt ∥ β (K + 1) + mµ t + L m a µ t m a µ t + 2 √ β∆ µt (K + 1) + ∆ µt β + µ t (B(y µt ) -B ⋆ ) = ∆ µt 2 (K + 1) + 2 √ β∆ µt ∥λ µt ∥ β (K + 1) + m + Lm a m a µ -1 + 2 √ β∆ µt β (K + 1) + ∆ µt β + B(y µt ) -B ⋆ µ t , Let M 3 ≜ m + Lm a m a µ -1 + 2 √ β∆ µ t β(K+1) + ∆ µ t β + B(y µt ) -B ⋆ , then we have x(t) K+1 -x ⋆ ≤ 1 b ∆ µt 2 (K + 1) + 2 √ β∆ µt ∥λ µt ∥ β (K + 1) + M 3 b µ t . By Lemma 1, we have G(x (t) K+1 , C) ≤ 1 b ∆ µt 2 (K + 1) + 2 √ β∆ µt ∥λ µt ∥ β (K + 1) + M 3 b µ t . Remark 5. In Theorem 12, for any t ≥ O(ln K), we have x (t) K -x ⋆ , G(x (t) K , C) ≤ O( 1 K 1/(2ξ) ), and x(t) K -x ⋆ , G( x(t) K , C) ≤ O( 1 K 1/ξ ). Similarly, in Theorem 13, for any t ≥ O(ln K), we have x (t) K -x ⋆ , G(x (t) K , C) ≤ O( 1 √ K ), and x(t) K -x ⋆ , G( x(t) K , C) ≤ O( 1 K ).

B.5 DISCUSSION ON ALGORITHM 1

Advantages and disadvantages of Algorithm 1. In Algorithm 1, the update of x (step 8) requires solving (W-EQ). Our method is especially suitable for problems where (W-EQ) is easy to solve analytically. This includes the class of affine variational inequalities, low-dimensional problems, and when optimization variables represent probabilities, for example. For problems where (W-EQ) is hard to solve-for example, min-max optimization problems in GANs-one could use other unconstrained methods like GDA or EG methods (without projection) so as to solve V I(R n , G), where G is defined in (3). Algorithm 4 describes ACVI when using an unconstrained solver for the inner problems. We observe that Algorithm 4 outperforms the projection-based baseline algorithms, see for example Fig. 4 . Note that when there are constraints, the projection-based methods such as GDA, EG, OGDA etc. require solving a quadratic programming problem in each iteration (or twice per iteration for EG). This problem is often nontrivial and solving it may require using an interior point method or variants (such as the Frank-Wolfe algorithm). Because of this, Algorithm 1 can be seen as an orthogonal approach to projection-based methods, or in other words, as a complementary tool to solve (cVI) and particularly relevant when the constraints are non-trivial. ACVI with only equality or inequality constraints. If there are no equality constraints, then C = becomes R n . In this case, we have that x (t) k+1 is the solution of: x + 1 β (F (x) + λ (t) k ) -y (t) k = 0 . When there are no inequality constraints, we let y k = x k and λ k = 0 for every k, and we can remove the outer loop. Thus, Algorithm 1 can be simplified to update only one variable x each iteration with the following updating rule: x k+1 is the unique solution of x + 1 β P c F (x) -P c x k -d c = 0 .

C VARIANT OF THE ACVI ALGORITHM (V-ACVI)

The presented approach of combining interior point methods with ADMM can be used as a framework to derive additional algorithms that may be more suitable for some specific problems. More precisely, observe from Eq. 1 that we could also consider a different splitting than that in § 4. Following this approach, we present a variant of Algorithm 1 and discuss its advantages and disadvantages relative to Algorithm 1.

C.1 INTRODUCTION OF THE VARIANT ACVI

Deriving the v-ACVI algorithm. By considering a different splitting in (1) we get:    min x,y F (w) ⊺ x -µ m i=1 log -φ i (x) + 1[Cy = d] s.t. x = y , where: 1[Cy = d] = 0, if Cy = d +∞, if Cy ̸ = d . The augmented Lagrangian of ( 50) is thus: L β (x, y, λ) = F (w) ⊺ x -µ m i=1 log(-φ i (x)) + 1(Cy = d) + ⟨ λ, x -y⟩ + β 2 ∥x -y∥ 2 (AL-CVI) where β > 0 is the penalty parameter. We have that for x at step k + 1: x k+1 = arg min x L β (x, y k , λ k ) = arg min x 1 2 x -y k + 1 β (F (w) + λ k ) 2 - µ β m i=1 log(-φ i x) The following proposition ensures the existence and uniqueness of x k+1 in C < . i.e. We show that x k+1 is the unique solution in C < of the following closed-form equation (see App. C.2 for its proof): x -y k + 1 β (F (w) + λ k ) - µ β m i=1 ∇φ i (x) φ i (x) = 0 . (X-CF) Proposition 3 (unique solution). The problem (X-CF) has a solution in C < and the solution is unique. For y, the updating rule is y k+1 = argmin y L β (x k+1 , y, λ k ) = arg min y∈C= - 1 β (λ k ) ⊺ y + 1 2 ∥y -x k+1 ∥ 2 2 = arg min y∈C= 1 2 ∥y -x k+1 - 1 β λ k ∥ 2 2 = P c (x k+1 + λ k β ) + d c (y) And the updating rule for dual variable λ is λ k+1 = λ k + β(x k+1 -y k+1 ) (λ) As in § 4, we would like to choose w k so that w k = x k+1 . To this end, we need the following assumption: Assumption 2. ∀b ∈ R n and µ > 0, x + 1 β F (x) -µ β m i=1 ∇φi(x) φ(x) + b = 0 has a solution in C < . If Assumption 2 holds true, we can let w k be a solution of w -y k+1 + 1 β (F (w) + λ k+1 ) - µ β m i=1 ∇φ i (w) φ i (w) = 0 (52) in C < . And by Proposition 3, we can let x k+1 be the unique solution of x -y k + 1 β F (w k ) + λ k - µ β m i=1 ∇φ i (x) φ i (x) = 0 (x) in C < . Note that w k is also a solution of (x). By the uniqueness of the solution of (x) shown in Prop. 3, we know that w k = x k+1 . So in the (k + 1)-th step, we can compute x, y, λ and w by (x), (y), (λ) and ( 52), respectively. Since w k = x k+1 , we can simplify our algorithm by removing variable w and only keep x, y and λ, see Algorithm 3. Algorithm 3 v-ACVI pseudocode. 1: Input: operator F : X → R n , equality Cx = d and inequality constraints φ i (x) ≤ 0, i = [m], hyperparameters µ -1 , β > 0, δ ∈ (0, 1), number of outer and inner loop iterations T and K, resp. 2: Initialize: y (0) 0 ∈ R n , λ (0) 0 ∈ R n 3: P c ≜ I -C ⊺ (CC ⊺ ) -1 C where P c ∈ R n×n 4: d c ≜ C ⊺ (CC ⊺ ) -1 d where d c ∈ R n 5: for t = 0, . . . , T -1 do (y (t+1) 0 , λ ) ≜ (y (t) K , λ K ) 14: end for Discussion: ACVI Vs. v-ACVI. Relative to Algorithm 1, the subproblem for solving x in line 7 in Algorithm 3 becomes more complex, whereas the subproblem for y becomes simpler. Hence, in cases when the inequality constraints are simpler, or there are no inequality constraints Alg. 3 may be more suitable, as that simplifies the x subproblem. However, Algorithm 1 balances better the complexities of the subproblems, hence it may be simpler to use for general problems. Convergence of v-ACVI. By analogous proofs to those in App. B, we can get similar convergence results as for Algorithm 1, that is Theorems 2 and 3. Specifically, Theorem 2 and 3 hold for Algorithm 3, provided that we replace the assumption "F is monotone on C = " with "F is monotone on C ≤ " in Theorems 2 and 3.

C.2 PROOF OF PROPOSITION 3

To prove proposition 3, we will use the following lemma. Lemma 11. ∀b ∈ R n , ∀x ∈ C < , 1 2 ∥x -b∥ 2 2 -µ β m i=1 log(-φ i (x)) → +∞, ∥x∥ 2 → +∞ Proof of Lemma 11. We denote ϕ : C < → R by ϕ(x) = 1 2 ∥x -b∥ 2 2 - µ β m i=1 log(-φ i (x))

D DETAILS ON THE IMPLEMENTATION

In this section, we provide the details on the implementation of the experiments shown in the main part in 2D and higher dimension bilinear game, see § D.1 and § D.2, respectively. We also provide here in § D.3 the details of the MNIST experiments presented later in App. E. In addition, we provide the source code through the following link: https://github.com/Chavdarova/ACVI.

D.1 EXPERIMENTS IN 2D

We first state the considered problem fully, then describe the setup what includes the hyperparameters. Problems. We consider the following constrained bilinear game (for the same experiment shown in Fig. 1 and 5 ): min x1∈R+ max x2∈R+ 0.05x 2 1 + x 1 x 2 -0.05x 2 2 . (cBG) The Von Neumann's ratio game (Von Neumann, 1971) (results in Fig. (a) ) is as follows: min x∈∆ 2 max y∈∆ 2 ⟨x, Ry⟩ ⟨x, Sy⟩ , where ∆ 2 = z ∈ R 2 |z ≥ 0, e ⊺ z = 1 , R = -0.6 -0.3 0.6 -0. 3 and S = 0.9 0.5 0.8 0.4 . The so called Forsaken (Hsieh et al., 2021) game-used in Fig. 2(b ) and in App. E-is as follows: min x1∈X1 max x2∈X2 x 1 (x 2 -0.45) + h(x 1 ) -h(x 2 ) , where h(z) = 1 4 z 2 -1 2 z 4 + 1 6 z 6 . The original version is unconstrained X ≡ R 2 . In Fig. 2 (b) we use the constraint x 2 1 + x 2 2 ≤ 4, and in App. E we use two other constraints: x 1 ≥ 0.08 and x 2 ≥ 0.4. For the toy GAN experiments, shown in Fig. 2(c ), the problem is as follows: min θ max φ E x∼N (0,2) (φx 2 ) - E z∼N (0,1) (φθ 2 z 2 ) s.t. φ 2 + θ 2 ≤ 4 (toy-GAN) In the experiment, we look at a finite sum (sample average) approximation, which we then solve deterministically in a full batch fashion. Setup. For all the 2D problems, we set the step size of GDA, EG and OGDA to 0.1, we use k = 5 and α = 0.5 for LA-GDA, we set β = 0.08, µ -1 = 10 -5 , δ = 0.5 and λ 0 = 0 for ACVI; and run for 50 iterations. For ACVI, we set the number of outer loop iterations to T = 20. In the first 19 outer loop iterations, we only run one inner loop iteration, and in the last outer loop iteration, we run 30 inner loop iterations (for a total of 50 updates). In Fig. 1 and Fig. (c ), we set the starting point for all algorithms. In Fig. 2 (a) and (b), we set the starting point to be (0.5, 0.5) ⊺ for all algorithms.

D.2 HIGH-DIMENSION BILINEAR GAME

We set the step size of GDA, EG, and OGDA to 0.1, using k = 4 and α = 0.5 for LA-GDA. For ACVI, we set β = 0.5, µ -1 = 10 -6 , δ = 0.5 and λ 0 = 0 for ACVI. The solution of (HBG) is x ⋆ = 1 500 e, where e ∈ R 1000 . As a metric of the experiments on this problem, we use the relative error: ε r (x k ) = ∥x k -x ⋆ ∥ ∥x ⋆ ∥ . In Fig. 3 (b), we set ε = 0.02 and compute the number of iterations of ACVI needed to reach the relative error given different rotation "strength"1-η, η ∈ (0, 1). Here we set the maximum number of iterations to be 50 for all algorithms. In Fig. 3 (a), we set η = 0.05 and compute CPU times needed for ACVI, EG, OGDA, and LA4-GDA to reach different relative errors. Here we set the maximum run time to 1500 seconds for all algorithms. In Fig. 7 in App. E on the other hand, we fix η = 0.05, and for varying CPU time limits, we compute the relative error of the last iterates of ACVI, GDA, EG, OGDA, and LA4-GDA. More general HD-BG game (g-HBG). Since (HBG) has perfect conditioning (that is, the interactive term x ⊺ 1 Bx 2 , is with B ≡ I), in App. E.2 we present results on the following more general high dimensional bilinear game: min x1∈△ max x2∈△ η 2 • x ⊺ 1 Ax 1 + (1 -η) x ⊺ 1 Bx 2 - η 2 x ⊺ 2 Cx 2 , △={x i ∈ R 500 |x i ≥ -e, and , e ⊺ x i = 0}. (g-HBG) Where A, B and C are randomly generated 500 × 500 matrices, and A, C are positive semi-definite. The solution of (g-HBG) is x ⋆ = 0, where 0 ∈ R 1000 . As a metric of the experiments on this problem, we use the error ε(x k ) = ∥x k ∥. The remaining settings are identical to those of (HBG), explained above. Comparison with the Frank-Wolfe algorithm on general HD-BG (g-HBG) and (gg-HBG). In App. E, we compare with the FW method (see A.4) on two problems: (i) (g-HBG), and (ii) (gg-HBG), where the objective is the same as (g-HBG) but the constraint set becomes more general, in which C i is a randomly generated 10 × 500 matrix, i = {1, 2}. In both experiments, we implement FW as in Algorithm 2, where we choose γ to be 2/(2 + t) at the t-th iterate, t = 0, • • • , T . For (i), the constraint set of (g-HBG) is a "shifted simplex", hence its vertices are easy to compute. This allows us to solve the linear minimization problem in Algorithm 2 of (Gidel et al., 2017a) much faster, and we refer to this implementation as fast-FW. In contrast, for the (gg-HBG) problem, we cannot apply this, and in that case, we use the standard linear programming routine covopt.solvers.lp in Pythonreferred herein as FW. min x1∈U1 max x2∈U2 η 2 • x ⊺ 1 Ax 1 + (1 -η) x ⊺ 1 Bx 2 - η 2 x ⊺ 2 Cx 2 , U i ={x i ∈ R 500 | -100e ≤ x i ≤ 100e, and , C i x i = 0}, i = 1, 2. (gg-HBG) The solution of both (g-HBG) and (gg-HBG) is 0. As a metric of the experiments on this problem, we use the error ε(x k ) = ∥x k ∥. The remaining settings are identical to those of (HBG), explained above.

D.3 MNIST AND FASHION-MNIST EXPERIMENTS

For the experiments on the MNIST datasetfoot_0 , we use the source code of Chavdarova et al. (2021b) for the baselines and we build on it to implement ACVI. For completeness, we provide an overview of the implementation. Models. We used the DCGAN architectures (Radford et al., 2016) , listed in Table 1 , and the parameters of the models are initialized using PyTorch default initialization. For experiments on this dataset, we used the non-saturating GAN loss as proposed in (Goodfellow et al., 2014) : L D = E xd ∼p d log D( xd ) + E z∼pz log 1 -D G( z) (L-D) L G = E z∼pz log D G( z) , (L-G) where G(•), D(•) denote the generator and discriminator, resp., and p d and p z denote the data and the latent distributions (the latter predefined as normal distribution). Details on the ACVI implementation. When implementing ACVI on MNIST, we "remove" the outer loop of Algorithm 1 (that is we set T = 1), and fix µ to be a small number, in particular, we select µ = 10 -9 . We randomly initialize x and y and initialize λ to zero. For lines 8 and 9 of Algorithm 1, we run l steps of (unconstrained) EG and GD, respectively. For the update of x (using EG), we use step-size η x = 0.001, whereas for y (using GD), we use step-size η y = 0.2. We present results when l ∈ {1, 10}. At every iteration, we update λ using the expression in line 11 of Algorithm 1, with β = 0.5. Because the problem in step 8 of Algorithm 1 does not change a lot over the iterations (as well as when computing y), when we implement Algorithm 1 we do not reinitialize the variable x. We we denote a convolutional layer and "transposed conv" a transposed convolution layer (Radford et al., 2016) . We use ker and pad to denote kernel and padding for the (transposed) convolution layers, respectively. With h×w we denote the kernel size. With c in → y out we denote the number of channels of the input and output, for (transposed) convolution layers. The models use Batch Normalization (Ioffe & Szegedy, 2015) layers. instead use the one from the previous iteration as initialization and update it l times. The full details of the training when using an inner optimizer for step 8 of Algorithm 1 are provided in Algorithm 4, where we recall that G(x) is defined as: G(x) ≜ x + 1 β P c F (x) -P c y k + 1 β P c λ k -d c (G-EQ) Note that in the case of MNIST, we consider only inequality constraints (and there are no equality constraints), therefore, the matrices P c and d c are identity and zero, respectively. Thus, there is no need for lines 3 and 4 in Algorithm 4. Algorithm 4 Pseudocode for ACVI when using an inner optimizer (MNIST experiments). 1: Input: operator F : X → R n , equality Cx = d and inequality constraints φ i (x) ≤ 0, i = [m], hyperparameters µ, β > 0, δ ∈ (0, 1), inner optimizer A (e.g. EG, GDA, OGDA), l number of steps for the inner-optimizer, number of iterations K. 2: Initialize: x 0 ∈ R n , y 0 ∈ R n , λ 0 ∈ R n 3: P c ≜ I -C ⊺ (CC ⊺ ) -1 C where P c ∈ R n×n 4: d c ≜ C ⊺ (CC ⊺ ) -1 d where d c ∈ R n 5: for k = 0, . . . , K -1 do 6: To obtain x k+1 : run l steps of A solving VI(R n , G), where G is defined in (G-EQ) 7: To obtain y k+1 : run l steps of GD to find y k+1 that minimizes: -µ m i=1 log -φ i (y) + β 2 y -x k+1 -1 β λ k 2 8: λ k+1 = λ k + β(x k+1 -y k+1 ) 9: end for 10: Return: x K The implementation details for the Fashion-MNIST experiment are identical to those of the MNIST experiment. Setup 1: MNIST. The MNIST experiment in Fig. 4 in the main part (and Fig. 11, 12) has 100 randomly generated linear inequality constraints for the Generator and 100 for the Discriminator. Setup 1: projection details. Suppose the linear inequality constraints for the Generator are Aθ ≤ b, where θ ∈ R n is the vector of all parameters of the Generator, A = (a ⊺ 1 , • • • , a ⊺ 100 ) ⊺ ∈ R 100×n , b = (b 1 , . . . , b 100 ) ∈ R 100 . We use the greedy projection algorithm described in (Beck, 2017) . A greedy projection algorithm is essentially a projected gradient method, it is easy to implement in high-dimension problems, and it has a convergence rate of O(1/ √ K). See Chapter 8.2.3 in (Beck, 2017) for more details. Since the dimension n is very large, at each step of the projection, one could only project θ to one hyperplane a ⊺ i x = b i for some i ∈ I(θ), where I(θ) ≜ {j|a ⊺ j θ > b j }. For every j ∈ {1, 2, . . . , 100}, let S j ≜ {x|a ⊺ j x ≤ b j }. The greedy projection method chooses i so that i ∈ arg max{dist(θ, S i )}. Note that as long as θ is not in the constraint set C ≤ = {x|Ax ≤ b}, i would be in I(θ). Algorithm 5 gives the details of the greedy projection method we use for the baseline, written for the Generator only for simplicity; the same projection method is used for the Discriminator as well. Algorithm 5 Greedy projection method for the baseline. 1: Input: θ ∈ R n , A = (a ⊺ 1 , . . . , a ⊺ 100 ) ⊺ ∈ R 100×n , b = (b 1 , . . . , b 100 ) ∈ R 100 , ε > 0 2: while True do 3: I(θ) ≜ {j|a ⊺ j θ > b j } 4: if I(θ) = ∅ or max j∈I(θ) |a ⊺ j θ-bj | ∥aj ∥ < ε then 5: break 6: end if 7: choose i ∈ arg max j∈I(θ) |a ⊺ j θ-bj | ∥aj ∥ 8: θ ← θ - |a ⊺ i θ-bi| ∥ai∥ 2 a i 9: end while 10: Return: θ Setup 2: MNIST & Fashion-MNIST. We add two constraints for the MNIST experiment: we set the squared sum of all parameters of the Generator and that of the Discriminator (separately) to be less than or equal to a hyperparameter M . We select a large number for M ; in particular, we set M = 50. Metrics. We describe the metrics for the MNIST experiments shown later in App. E. We use the two standard GAN metrics, Inception Score (IS, Salimans et al., 2016) and Fréchet Inception Distance (FID, Heusel et al., 2017) . Both FID and IS rely on a pre-trained classifier and take a finite set of m samples from the generator to compute these. Since MNIST has greyscale images, we used a classifier trained on this dataset and used m = 5000. More precisely, it is computed as follows: Finally, FID is computed as: IS(G) = exp E xg∼pg D KL p( ỹ| xg )||p( ỹ) = exp 1 m m i=1 C c=1 p(y c | xi ) log p(y c | xi ) p(y c ) . D FID (p d , p g ) ≈ D 2 (m d , C d ), (m g , C g ) = ∥m d -m g ∥ 2 2 + T r C d + C g -2(C d C g ) 1 2 , (FID) where D 2 denotes the Fréchet Distance. Note that as this metric is a distance, the lower it is, the better the performance. Hardware. We used the Colab platform (https://colab.research.google.com/) and Tesla P100 GPUs. The running times are reported in App. E. Figure 8 : General high-dimensional bilinear game (g-HBG): comparison of ACVI with the GDA, EG, OGDA, and LA4-GDA baselines (described in App. A.4). Left: number of iterations (y-axis) needed to reach an ϵ-distance to the solution, for varying intensity of the rotational component 1 -η (η is the x-axis) of the vector field (the smaller the η the higher the rotational component). We fix a threshold of the maximum number of iterations, and we stop the experiment. Right: distance to the solution (see App. D.2) of the last iterate (y-axis) for a varying wall-clock CPU time allowed to run each experiment (x-axis); in this experiment η is fixed to η = 0.05. See App. E.2 and D.2 for discussion and details on the implementation, respectively. we run experiments on the (gg-HBG) problem, where we fix CPU time and depict the relative error of ACVI and FW. Left: number of iterations (y-axis) needed to reach an ϵ-distance to the solution, for varying intensity of the rotational component 1 -η (η is the x-axis) of the vector field (the smaller the η the higher the rotational component). We fix a threshold of the maximum number of iterations, and we stop the experiment. Right: distance to the solution (see App. D.2) of the last iterate (y-axis) for a varying wall-clock CPU time allowed to run each experiment (x-axis); in this experiment η is fixed to η = 0.05. See App. E.2 and D.2 for discussion and details on the implementation, respectively. Since FW (and variants, such as approximate and accelerated) rely on a specific structure of the constraints, FW can be extremely slow when those assumptions are not met-see discussion by Jaggi (2013) in §3 as well as examples in §4 therein. In contrast, the herein-presented ACVI Algorithm focuses on constraints of a general form, and further variants can be derived out of it to also exploit the structure of the constraints. We leave exploiting such constraint structure-including extending FW to VIs and deriving variants of ACVI-for future work. In this section, we present more detailed results of the summarizing plot in Fig. 4 of the main paper. For this experiment, we used linear inequalities as described in § D.3. Unlike in subsection E.3.2, here all the baselines are projected methods (that is, the same problem setting applies to ACVI and the baselines). Fig. 11 and 12 depict the comparisons with projected GDA and projected EG, respectively. We observe that ACVI converges fast relative to the corresponding baseline. When choosing a larger number of steps for the inner problem l = 10 (see Algorithm 4) the wall-clock time per iteration increases, and interestingly the ACVI steps compensate for that and overall converge as fast as when l = 1.

E.3.2 SETUP 2: EXPERIMENTS ON (FASHION-)MNIST WITH QUADRATIC INEQUALITIES

In this section, we consider the MNIST and Fashion-MNIST datasets, which are unconstrained problems so as to make use of the well-established performance metrics (which are otherwise unclear in the non-monotone settings, where we do not know the optimal solution apriori). We augment the problem with a mild constraint which requires that the norm of the per-player parameters does not exceed a certain value (see App. D.3). We compare ACVI with unconstrained baselines, which sets ACVI at a disadvantage as the projection requires additional computation. However, the primary purpose of these experiments is to observe if Algorithm 1 is competitive computationally-wise when lines 8 and 9 are non-trivial and require an (unconstrained) solver. However note that since MNIST is a relatively easy problem, it may not answer the natural question if ACVI has advantages on problems augmented with constraints over standard unconstrained methods. We leave such analyses for future work. The implementation and the used metrics are described in App. D.3. when trained on the Fashion-MNIST dataset. We believe that further exploring the type of constraints to be added, or the implementation options (e.g., l, step-size) may be proven fruitful even for problems that are originally unconstrained-as such an approach may reduce the rotational component of the original vector field, what in turn causes faster convergence or may help in escaping limit cycles for problems beyond monotone ones. 



Provided under Creative Commons Attribution-Share Alike 3.0.



Figure 1: ACVI (Algorithm 1) and EG iteratesdepicted in red and green, resp.-on the game: min x1∈R+ max x2∈R+ 0.05 • x 2 1 + x 1 x 2 -0.05 • x 2 2 .

Figure2: Convergence of GDA, EG, OGDA, LA-GDA, and ACVI on three different 2d problems, for a fixed number of total iterations, where markers denote the iterates of the respective method.

Figure 4: FID (lower is better) on MNIST with added constraints, over wall-clock time; averaged over 3 seeds. See § 5 and App. D for discussion and implementation, resp.

) -F (x) -λ + yx = 0 with respect to x, where y, λ are variables 8:for k = 0, . . . , K -

Metrics: IS. Given a sample from the generator xg ∼ p g -where p g denotes the data distribution of the generator-IS uses the softmax output of the pre-trained network p( ỹ| xg ) which represents the probability that xg is of class c i , i ∈ 1 . . . C, i.e., p( ỹ| xg ) ∈ [0, 1] C . It then computes the marginal class distribution p( ỹ) = x p( ỹ| xg )p g ( xg ). IS measures the Kullback-Leibler divergence D KL between the predicted conditional label distribution p( ỹ| xg ) and the marginal class distribution p( ỹ).

Achieved error given varying fixed CPU time

Figure9: General high-dimensional bilinear game (g-HBG): comparison of ACVI with FW baseline (Algorithm 2). Left: number of iterations (y-axis) needed to reach an ϵ-distance to the solution, for varying intensity of the rotational component 1 -η (η is the x-axis) of the vector field (the smaller the η the higher the rotational component). We fix a threshold of the maximum number of iterations, and we stop the experiment. Right: distance to the solution (see App. D.2) of the last iterate (y-axis) for a varying wall-clock CPU time allowed to run each experiment (x-axis); in this experiment η is fixed to η = 0.05. See App. E.2 and D.2 for discussion and details on the implementation, respectively.

Figure10: Given varying CPU time (in seconds), depicting the relative error (see App. D.2) of FW and ACVI (Algorithm 1) on the HBG problem where η is fixed to η = .05 (hence the vector field is highly rotational). For the details on the implementation, see App. D.2.

Fig.13summarizes the experiments in terms of the obtained FID score over time. We observe that ACVI (although it uses two solvers at each iteration) is yet performing competitively to unconstrained GDA and EG. Figures 14-19 provide in addition samples of the Generator and IS scores, separately for each method. Figures 20 and 21 depict samples generated by the different methods,

Figure 12: Setup 1: Comparison between ACVI and EG, and the projected EG on MNIST with linear inequalities (described in § D.3). l denotes the number of steps for the inner problems, see Algorithm 4. The depicted results are over multiple seeds. The FID and IS metrics as well as the implementation details are described in App. D.3. See App. E.3.1 for discussion.

Figure 13: Summary of the experiments on MNIST, using FID (lower is better). 13(a): GDA and ACVI with GDA, and 13(b): EG and ACVI with EG, using l = {1, 10} for ACVI. Using step size of 0.001. The depicted results are over multiple seeds. See App. D.3 and E.3 for details on the implementation and discussion, resp.

Figure 14: GDA on MNIST. Left: samples xg ∼ p g of the last iterate of the Generator. Right: FID and IS of GDA, depicted in blue and red, respectively. Using step size of 0.001.

Figure 15: ACVI (Algorithm 1) with 10 GDA steps on MNIST. Left: samples xg ∼ p g of the last iterate of the Generator. Right: FID and IS of GDA, depicted in blue and red, respectively. Using step size of 0.001 for x and 0.2 for y, and l = 10 both for x and y.

Figure 16: ACVI (Algorithm 1) with 1 GDA step on MNIST. Left: samples xg ∼ p g of the last iterate of the Generator. Right: FID and IS of GDA, depicted in blue and red, respectively. Using step size of 0.001 for x and 0.2 for y, and l = 1 both for x and y.

Figure 17: EG on MNIST. Left: samples xg ∼ p g of the last iterate of the Generator. Right: FID and IS of EG, depicted in blue and red, respectively. Using step size of 0.001.

Figure 18: ACVI (Algorithm 1) with 10 EG steps on MNIST. Left: samples xg ∼ p g of the last iterate of the Generator. Right: FID and IS of GDA, depicted in blue and red, respectively. Using step size of 0.001 for x and 0.2 for y, and l = 10 both for x and y.

Figure 19: ACVI (Algorithm 1) with 1 EG step on MNIST. Left: samples xg ∼ p g of the last iterate of the Generator. Right: FID and IS of GDA, depicted in blue and red, respectively. Using step size of 0.001 for x and 0.2 for y, and l = 1 both for x and y.

(a) EG (b) ACVI+EG 1 step (c) ACVI+EG 10 step

Figure 20: Generated images at fixed wall-clock computation time (3000s) by: the baseline EG, and by ACVI with l ∈ {1, 10} on the Fashion-MNIST (Xiao et al., 2017) dataset. See App. D.3 and E.3 for details on the implementation and discussion, resp.

Figure 21: Generated images at fixed wall-clock computation time (3000s) by: the baseline GDA, and by ACVI with l ∈ {1, 10} on the Fashion-MNIST (Xiao et al., 2017) dataset. See App. D.3 and E.3 for details on the implementation and discussion, resp.

DCGAN architectures(Radford et al., 2016) used for experiments on MNIST. With "conv."

IS)It aims at estimating (i) if the samples look realistic i.e., p( ỹ| xg ) should have low entropy, and (ii) if the samples are diverse (from different ImageNet classes), i.e., p( ỹ) should have high entropy. As these are combined using the Kullback-Leibler divergence, the higher the score is, the better the performance.

ACKNOWLEDGMENTS

TC thanks the support of the Swiss National Science Foundation (SNSF), grant P2ELP2 199740. The authors thank Matteo Pagliardini and Tianyi Lin for insightful discussions and feedback.

annex

let B(x) = -µ β m i=1 log(-φ i (x)). We choose an arbitrary x 0 ∈ C < . Then by the convexity of B(x) we deduce that ∀x ∈ C < , ϕ(x) ⩾ 1 2 ∥x -b∥ 2 2 + B(x 0 ) + ∇B(x 0 ) ⊺ (x -x 0 ) → +∞, ∥x∥ 2 → +∞In the remaining, we prove Proposition 3 which guarantees that (X-CF) has a unique solution.Proof of Proposition 3: uniqueness of the solution of (X-CF). Let ϕ : C < → R denote:We chooseSo there exists M > 0 such that x 0 ∈ B(0, M ) and ∀x ∈ S, ϕ(x) ≤ ϕ(x 0 ), x must belong to B(0, M ),whereIt's clear that there exists t > 0 such that for every x ∈ C < that satisfies ϕ(x) ≤ ϕ(x 0 ), x must belong to C t ,whereAnd we can make t small enough so that x 0 ∈ C t . C t is a nonempty compact set and ϕ is continuous, so there exists

E ADDITIONAL EMPIRICAL ANALYSIS

In this section, we provide some omitted plots/analyses of the results in the main paper as well as additional experiments. In particular, (i) App. E.1 lists results in 2D, (ii) App. E.2 on (HBG) and (g-HBG), whereas (iii) App. E.3 provides more detailed plots of those experiments summarized in Fig. 4 and presents additional results on other constraints on MNIST where we compare computationally-wise with unconstained baselines. Additional experiments: varying constraints on the Forsaken problem. The Forsaken game was first pointed out in (Hsieh et al., 2021) and is particularly relevant because it has limit cycles, despite that it is in 2D. Since we are missing a tool to detect if we are in a limit cycle when in higher dimensions, this example is a popular benchmark in many recent works. Interestingly, in Fig. 2 (b) we observe that ACVI is the only method that escapes the limit cycle. However, since in those simulations, given the initial point the constraints are not active throughout the training, in this section, we run experiments with additional constraints. Fig. 6 depicts the baseline methods and ACVI on the Forsaken problem with two different constraints than that considered in Fig. 2 (that x 2 1 + x 2 2 ≤ 4). Since this game is non-monotone, we observe that for some constraints the baseline methods-GDA, EG, OGDA, LA4-GDA-stay near the constraint (and do not converge). This may indicate that ACVI may have better chances of converging for broader problem classes than monotone VIs, relative to baseline methods whose convergence may depend on the constraints, and when hitting a constraint may be significantly slower (as Fig. 1 illustrates). Figure 6 : Forsaken game with different constraints: we consider two additional (to that in Fig. 2 ) constraints: (a) that x 1 ≥ 0.08, and (b) that x 2 ≥ 0.4. See App. D.1 for details on the implementation, and App. E.1 for a discussion.

E.2 ADDITIONAL EXPERIMENTS HBG AND ON G-HBG

Complementary analysis to those in Fig. 3 . Similar to Fig. 3 , in Fig. 7 we run experiments on the HBG problem. However, here for a given fixed CPU time, we depict the relative error of the considered baselines and ACVI. of GDA, EG, OGDA, LA4-GDA, and ACVI (Algorithm 1) on the HBG problem where η is fixed to η = .05 (hence the vector field is highly rotational). This experiment complements those in Fig. 3 in the main paper. For the on the implementation, see App. D.2.Additional experiments on (g-HBG). In Fig. 8 we run experiments on the generalized HBG problem (g-HBG). In figure 8 (a), we compute the number of iterations needed to reach ε-distance to solution for varying intensity of the rotational component (1 -η); in figure 8(b), we compute the error of the last iterate given fixed CPU time. We observe that despite the highly rotational monotone vector field, ACVI converges significantly faster in terms of wall clock time in higher dimensions as well.Comparison with Frank-Wolf algorithm on (g-HBG) and (gg-HBG). Similar to Fig. 8 , in Fig. 9 we also run experiments on (g-HBG), but here we compare ACVI with FW. We observe that ACVI outperforms FW even when we make use of the special structure of the simple constraint set when solving the linear minimization problem in FW (the fast FW method). Similar to Fig. 9 (b), in Fig. 10 

