ACCELERATED SINGLE-CALL METHODS FOR CON-STRAINED MIN-MAX OPTIMIZATION

Abstract

We study first-order methods for constrained min-max optimization. Existing methods either require two gradient calls or two projections in each iteration, which may be costly in some applications. In this paper, we first show that a variant of the Optimistic Gradient (OG) method, a singlecall single-projection algorithm, has O( 1√ T ) best-iterate convergence rate for inclusion problems with operators that satisfy the weak Minty variation inequality (MVI). Our second result is the first single-call singleprojection algorithm -the Accelerated Reflected Gradient (ARG) method that achieves the optimal O( 1 T ) last-iterate convergence rate for inclusion problems that satisfy negative comonotonicity. Both the weak MVI and negative comonotonicity are well-studied assumptions and capture a rich set of non-convex non-concave min-max optimization problems. Finally, we show that the Reflected Gradient (RG) method, another single-call single-projection algorithm, has O( 1 √ T ) last-iterate convergence rate for constrained convex-concave min-max optimization, answering an open problem of (Hsieh et al., 2019). Our convergence rates hold for standard measures such as the tangent residual and the natural residual.

1. INTRODUCTION

Various Machine Learning applications, from the generative adversarial networks (GANs) (e.g., (Goodfellow et al., 2014; Arjovsky et al., 2017)) , adversarial examples (e.g., (Madry et al., 2017) ), robust optimization (e.g., (Ben-Tal et al., 2009) ), to reinforcement learning (e.g., (Du et al., 2017; Dai et al., 2018) ), can be captured by constrained min-max optimization. Unlike the well-behaved convex-concave setting, these modern ML applications often require solving non-convex non-concave min-max optimization problems in high dimensional spaces. Unfortunately, the general non-convex non-concave setting is intractable even for computing a local solution (Hirsch et al., 1989; Papadimitriou, 1994; Daskalakis et al., 2021) . Motivated by the intractability, researchers turn their attention to non-convex non-concave settings with structure. Significant progress has been made for several interesting structured non-convex non-concave settings, such as the ones that satisfy the weak Minty variation inequality (MVI) (Definition 2) (Diakonikolas et al., 2021; Pethick et al., 2022) and the ones that satisfy the more strict negatively comonotone condition (Definition 3) (Lee & Kim, 2021a; Cai et al., 2022a) . These algorithms are variations of the celebrated extragradient (EG) method (Korpelevich, 1976) , an iterative first-order method. Similar to the extragradient method, these algorithms all require two oracle calls per iteration, which may be costly in practice. We investigate the following important question in this paper: Can we design efficient single-call first-order methods for structured non-convex non-concave min-max optimization? (*) We provide an affirmative answer to the question. We first show that a single-call method known as the Optimistic Gradient (OG) method (Hsieh et al., 2019) is applicable to all non-convex non-concave settings that satisfy the weak MVI. We then provide the Accelerated Reflected Gradient (ARG) method that achieves the optimal convergence rate in all non-convex non-concave settings that satisfy the negatively comonotone condition. Single-call methods have been studied in the convex-concave settings (Hsieh et al., 2019) but not for the more general non-convex non-concave settings. See Table 1 for comparisons between our algorithms and other algorithms from the literature. Algorithm 1-Call? Constraints? Non-Monotone Comonotone weak MVI Normal EG+ (Diakonikolas et al., 2021 ) ✗ ✗ O( 1 √ T ) O( 1 √ T ) CEG+ (Pethick et al., 2022 ) ✗ ✓ O( 1 √ T ) O( 1 √ T ) OGDA (B öhm, 2022; Bot et al., 2022 ) ✓ ✗ O( 1 √ T ) O( 1 √ T ) OG[this paper] ✓ ✓ O( 1 √ T ) O( 1 √ T ) Accelerated FEG (Lee & Kim, 2021b) ✗ ✗ O( 1 T ) AS (Cai et al., 2022a) ✗ ✓ O( 1 T ) ARG [This paper] ✓ ✓ O( 1 T ) Table 1 : Existing results for min-max optimization problem with non-monotone operators. A ✓ in "Constraints?" means the algorithm works in the constrained setting. The convergence rate is in terms of the operator norm (in the unconstrained setting) and the residual (in the constrained setting).

1.1. OUR CONTRIBUTIONS

Throughout the paper, we adopt the more general and abstract framework of inclusion problems, which includes constrained min-max optimization as a special case. More specifically, we consider the following problem. Inclusion Problem. Given E = F + A where F : R n → R n is a single-valued (possibly non-monotone) operator and A : R n ⇒ R n is a set-valued maximally monotone operator, the inclusion problem is defined as follows find z * ∈ Z such that 0 ∈ E(z * ) = F(z * ) + A(z * ). (IP) As shown in the following example, we can interpret a min-max optimization problem as an inclusion problem. Example 1 (Min-Max Optimization). The following structured min-max optimization problem captures a wide range of applications in machine learning such as GANs, adversarial examples, robust optimization, and reinforcement learning: min x∈R nx max y∈R ny f (x, y) + g(x) -h(y), where f (•, •) is possibly non-convex in x and non-concave in y. Regularized and constrained minmax problems are covered by appropriate choices of lower semi-continuous and convex functions g and h. Examples include the ℓ 1 -norm, the ℓ 2 -norm, and the indicator function of a closed convex feasible set. Let z = (x, y), if we define F(z) = (∂ x f (x, y), -∂ y f (x, y)) and A(z) = (∂g(x), ∂h(y)), where A is maximally monotone, then the first-order optimality condition of (1) has the form of an inclusion problem. (Daskalakis et al., 2021) shows that without any assumption on the operator E = F + A, the problem is intractable. 1 The most well understood setting is when E is monotone, i.e., ⟨uv, zz ′ ⟩ ≥ 0 for all z, z ′ and u ∈ E(z), v ∈ E(z ′ ), which captures convex-concave min-max optimization. Motivated by non-convex non-concave min-max optimization, we consider the two most widely studied families of non-monotone operators: (i) negatively comonotone operators and (ii) operators that satisfy the less restrictive weak MVI. See Section 2 for more detailed discussion on their relationship. Here are the main contributions of this paper.

Contribution 1:

We provide an extension of the Optimistic Gradient (OG) method for inclusion problems when the operator E = F + A satisfies the weak MVI. More specifically, we prove that OG has a O( 1 √ T ) convergence rate (Theorem 1) matching the state of the art algorithms (Diakonikolas et al., 2021; Pethick et al., 2022) . Importantly, our algorithm only requires a single oracle call to F and a single call to the resolvent of A. a a The resolvent of A is defined as (I + A) -1 . When A is the subdifferential of the indicator function of a closed convex set, the resolvent operator is exactly the Euclidean projection. Hence our algorithm performs a single projection in the constrained case. Next, we provide an accelerated single-call method when the operator satisfies the stronger negatively comonotone condition.

Contribution 2:

We design an accelerated version of the Reflected Gradient (RG) (Chambolle & Pock, 2011; Malitsky, 2015; Cui & Shanbhag, 2016; Hsieh et al., 2019) method that we call the Accelerated Reflected Gradient (ARG) method, which has the optimal O( 1 T ) convergence rate for inclusion problems whose operators E = F + A are negatively comonotone (Theorem 2). Note that O( 1T ) is the optimal convergence rate for any first-order methods even for monotone inclusion problems (Diakonikolas, 2020; Yoon & Ryu, 2021) . Importantly, ARG only requires a single oracle call to F and a single call to the resolvent of A. Finally, we resolve an open question from (Hsieh et al., 2019) .

Contribution 3:

We show that the Reflected Gradient (RG) method has a last-iterate convergence rate of O( 1 √ T ) for constrained convex-concave min-max optimization (Theorem 3). Hsieh et al. (2019) show that the RG algorithm asymptotically converges but fails to obtain a concrete rate. We strengthen their result to obtain a tight finite convergence rate for RG. We also provide illustrative numerical experiments in Appendix E.

1.2. RELATED WORKS

We provide a brief discussion of the most relevant and recent results on nonconvexnonconcave min-max optimization here and defer the discussion on related results in the convex-concave setting to Appendix A. We also refer readers to (Facchinei & Pang, 2003; Bauschke & Combettes, 2011; Ryu & Yin, 2022) and references therein for a comprehensive literature review on inclusion problems and related variational inequality problems. Structured Nonconvex-Nonconcave Min-Max Optimization. Since in general nonconvex-nonconcave min-max optimization problems are intractable, recent works study problems under additional assumptions. The Minty variational inequality (MVI) assumption (also called coherence or variational stability), which covers all quasiconvexconcave and starconvex-concave problems, is well-studied in e.g., (Dang & Lan, 2015; Zhou et al., 2017; Liu et al., 2019; Malitsky, 2020; Song et al., 2020; Liu et al., 2021 ) convergence rate for problems that satisfies MVI (Dang & Lan, 2015) . Diakonikolas et al. (2021) proposes a weaker assumption called weak MVI, which includes both MVI or negative comonotonicity (Bauschke et al., 2021) as special cases. Under the weak MVI, the EG+ (Diakonikolas et al., 2021) and OGDA+ (B öhm, 2022) ) convergence rate when we only assume weak MVI (Theorem 1). The result for accelerated algorithms in the nonconvex-nonconcave setting is sparser. For negatively comonotone operators, optimal O( 1 T ) convergence rate is achieved by variants of the EG algorithm in the unconstrained setting (Lee & Kim, 2021a) and in the constrained setting (Cai et al., 2022a) . To the best of our knowledge, the ARG algorithm is the first efficient single-call single-resolvent method that achieves the accelerated and optimal O( 1 T ) convergence rate in the constrained nonconvex-nonconcave setting (Theorem 2). We summarize previous results and our results in Table 1 . Our analysis of ARG is inspired by (Cai et al., 2022a ) and uses a similar potential function argument.

2. PRELIMINARIES

Basic Notations. Throughout the paper, we focus on the Euclidean space R n equipped with inner product ⟨•, •⟩. We denote the standard ℓ 2 -norm by ∥ • ∥. For any closed and convex set Z ⊆ R n , Π Z [•] : R n → Z denotes the Euclidean projection onto set Z such that for any z ∈ R n , Π Z [z] = argmin z ′ ∈Z ∥zz ′ ∥. We denote B(z, r) the ℓ 2 -ball centered at z with radius r. Normal Cone. We denote N Z : Z → R n to be the normal cone operator such that for z ∈ Z, N Z (z) = {a : ⟨a, z ′ -z⟩ ≤ 0, ∀z ′ ∈ Z }. Define the indicator function I Z (z) = 0 if z ∈ Z, +∞ otherwise. It is not hard to see that the subdifferential operator ∂I Z = N Z . A useful fact is that if z = Π Z [z ′ ], then λ(z ′ -z) ∈ N Z (z) for any λ ≥ 0. Monotone Operator. We recall some standard definitions and results on monotone operators here and refer the readers to (Bauschke & Combettes, 2011; Ryu & Boyd, 2016; Ryu & Yin, 2022) for more detailed introduction. A set-valued operator A : R n ⇒ R n maps each point z ∈ R n to a subset A(z) ⊆ R n . We denote the graph of A as Gra(A) := {(z, u) : u ∈ A(z)} and the zeros of A as Zer(A) = {z : 0 ∈ A(z)}. The inverse operator of A is denoted as A -1 whose graph is Gra(A -1 ) = {(u, z) : (z, u) ∈ Gra(A)}. For two operators A and B, we denote A + B to be the operator with graph Gra(A + B) = {(z, u A + u B ) : (z, u A ) ∈ Gra(A), (z, u B ) ∈ Gra(B)}. We denote the identity operator as I : R n → R n . We say operator A is single valued if |A(z)| ≤ 1 for all z ∈ R n . Single-valued operator A is L-Lipschitz if ∥A(z) -A(z ′ )∥ ≤ L • ∥z -z ′ ∥, ∀z, z ′ ∈ R n . Moreover, we say A is non-expansive if it is 1-Lipschitz. Definition 1 ((Maximally) monotonicity). An operator A : R n ⇒ R n is monotone if u -u ′ , z -z ′ ≥ 0, ∀(z, u), (z ′ , u ′ ) ∈ Gra(A). Moreover, A is maximally monotone if A is monotone and Gra(A) is not properly contained in the graph of any other monotone operators. When g : R n → R n is closed, convex, and proper, then its subdifferential operator ∂g is maximally monotone. As an example, the normal cone operator N Z = ∂I Z is maximally monotone. We denote resolvent of A as J A = (I + A) -1 . Some useful properties of the resolvent are summarized in the following proposition. Proposition 1. If A is maximally monotone, then J A satisfies the following. 1. The domain of J A is R n . J A is non-expansive and single-valued on R n .

2.. If z

= J A (z ′ ), then z ′ -z ∈ A(z). If c ∈ A(z), then z = J A (z + c). 3. When A = N Z is the normal cone operator for some closed convex set Z, then J η A = Π Z is the Euclidean projection onto Z for all η > 0. Non-Monotone Operator. Definition 2 (Weak MVI (Diakonikolas et al., 2021; Pethick et al., 2022) ). An operator A : R n ⇒ R n satisfies weak MVI if for some z * ∈ Zer(A), there exists ρ ≤ 0 ⟨u, zz * ⟩ ≥ ρ∥u∥ 2 , ∀(z, u) ∈ Gra(A). Definition 3 (Comonotonicity (Bauschke et al., 2021) ). An operator A : R n ⇒ R n is ρ- comonotone if u -u ′ , z -z ′ ≥ ρ u -u ′ 2 , ∀(z, u), (z ′ , u ′ ) ∈ Gra(A). When A is ρ-comonotone for ρ > 0, then A is also known as ρ-cocoercive, which is a stronger condition than monotonicity. When A is ρ-comonotone for ρ < 0, then A is nonmonotone. Weak MVI with ρ = 0 is also know as MVI, coherence, or variational stability. Note that the weak MVI is implied by negative comonotonicity. We refer the readers to (Lee & Kim, 2021a , Example 1), (Diakonikolas et al., 2021 , Section 2.2) and (Pethick et al., 2022, Section 5) for examples of min-max optimization problems that satisfy the two conditions.

2.1. PROBLEM FORMULATION

Inclusion Problem. Given E = F + A where F : R n → R n is a single-valued (possibly non-monotone) operator and A : R n ⇒ R n is a set-valued maximally monotone operator, the inclusion problem is defined as follows find z * ∈ Z such that 0 ∈ E(z * ) = F(z * ) + A(z * ). (IP) We say z is an ϵ-approximate solution to an inclusion problem (IP) if 0 ∈ F(z) + A(z) + B(0, ϵ). Throughout the paper, we study IP problems under the following assumption. Assumption 1. In the setup of IP, 1. there exists z * ∈ Zer(E), i.e., 0 ∈ F(z * ) + A(z * ). 2. F is L-Lipschitz.

3.. A is maximally monotone.

When F is monotone, we refer to the corresponding IP problem as a monotone inclusion problem, which covers convex-concave min-max optimization. In the more general nonmonotone setting, we would study problems that satisfy negative comonotonicity or weak MVI. Assumption 2 (Comonotonicity). In the setup of IP, E = F + A is ρ-comonotone, i.e., u -u ′ , z -z ′ ≥ ρ u -u ′ 2 , ∀(z, u), (z ′ , u ′ ) ∈ Gra(E). Assumption 3 (Weak MVI). In the setup of IP, E = F + A satisfies weak MVI with ρ ≤ 0, i.e., there exists z * ∈ Zer(E), ⟨u, z -z * ⟩ ≥ ρ∥u∥ 2 , ∀(z, u) ∈ Gra(E). An important special case of inclusion problem is the variational inequality problem. Variational Inequality. Let Z ⊆ R n be a closed and convex set and F : R n → R n be a single-valued operator. The variation inequality (VI) problem associated with Z and F is stated as find z * ∈ Z such that ⟨F(z * ), z * -z⟩ ≤ 0, ∀z ∈ Z. (VI) Published as a conference paper at ICLR 2023 Note that VI is a special case of IP when A = N Z = ∂I Z is the normal cone operator: 0 ∈ F(z * ) + N Z (z * ) ⇔ -F(z * ) ∈ N Z (z * ) ⇔ ⟨F(z * ), z * -z⟩ ≤ 0, ∀z ∈ Z. The general formulation of VI unifies many problems such as convex optimization, minmax optimization, computing Nash equilibria in multi-player concave games, and is extensively-studied since 1960s (Facchinei & Pang, 2003) . Definitions of the convergence measure for VI and the classical algorithms, EG and PEG, are presented in Appendix B.

2.2. CONVERGENCE MEASURE

We focus on a strong convergence measure called the tangent residual, defined as r tan F,A (z) := min c∈A(z) ∥F(z) + c∥. It is clear by definition that r tan F,A (z) ≤ ϵ implies z is an ϵ-approximate solution to the inclusion (IP) problem, and also an (ϵ • D) approximate strong solution to the corresponding variational inequality (VI) problem when Z is bounded by D. Moreover, the tangent residual is an upper bound of other notion of residuals in the literature such as the natural residual r nat F,A (Diakonikolas, 2020) or the forward-backward residual r f b F,A (Yoon & Ryu, 2022) as shown in Proposition 2 (see Appendix B.4 for the formal statement and proof). Thus our convergence rates on the tangent residual also hold for the natural residual or the forward-backward residual. Note that in the unconstrained setting where A = 0, these residuals are all equivalent to the operator norm ∥F(z)∥.

3. OPTIMISTIC GRADIENT METHOD FOR WEAK MVI PROBLEMS

In this section, we consider an extension of the Optimistic Gradient (OG) algorithm (Daskalakis et al., 2017; Mokhtari et al., 2020a; b; Hsieh et al., 2019; Peng et al., 2020) for inclusion problems: given arbitrary starting point z -1 2 = z 0 ∈ R n and step size η > 0, the update rule is z t+ 1 2 = J η A z t -ηF(z t-1 2 ) , z t+1 = z t+ 1 2 + ηF(z t-1 2 ) -ηF(z t+ 1 2 ). (OG) For t ≥ 1, the update rule can also be written as z t+ 3 2 = J η A [z t+ 1 2 -2ηF(z t+ 1 2 ) + ηF(z t-1 2 )], which coincides with the forward-reflected-backward algorithm (Malitsky & Tam, 2020) . We remark that the update rule of OG is different from the Optimistic Gradient Descent/Ascent (OGDA) algorithm (also known as Past Extra Gradient (PEG) algorithm) (Popov, 1980) , which is single-call but requires two projections in each iteration. Previous results for OG only hold in the convex-concave (monotone) setting. The main result in this section is that OG has O( 1√ T ) convergence rate even for nonconvex-nonconcave min-max optimization problems that satisfy weak MVI, matching the state of the art results achieved by two-call methods (Diakonikolas et al., 2021; Pethick et al., 2022) . Remarkably, OG only requires single call to F and single call to the resolvent J η A in each iteration. The main result is shown in Theorem 1. The proof relies on a simple yet important observation that z t -z t+1 η ∈ F(z t+ 1 2 ) + A(z t+ 1 2 ). Theorem 1. Assume Assumption 1 and 3 hold with ρ ∈ (- 1 12 √ 3L , 0]. Consider the iterates of (OG) with step size η ∈ (0, 1 2L ) satisfying C = 1 2 + 2ρ η -2η 2 L 2 > 0 (existence of such η is guaranteed by Fact 1). Then for any T ≥ 1, min t∈[T] r tan F,A (z t+ 1 2 ) 2 ≤ min t∈[T] ∥z t+1 -z t ∥ 2 η 2 ≤ H 2 Cη 2 • 1 T , where H 2 = ∥z 1 -z * ∥ 2 + 1 4 ∥z 1 2 -z 0 ∥ 2 . Proof. From the update rule of (OG), we have the following identity (see also (Hsieh et al., 2019 , Appendix B)): for any p ∈ Z, ∥z t+1 -p∥ 2 = ∥z t -p∥ 2 + z t+1 -z t+ 1 2 2 -z t+ 1 2 -z t 2 + 2 z t -ηF(z t-1 2 ) -z t+ 1 2 + ηF(z t+ 1 2 ), p -z t+ 1 2 . (2) Since z t+ 1 2 = J η A [z t -ηF(z t-1 2 )], we have z t -ηF(z t-1 2 )-z t+ 1 2 η ∈ A(z t+ 1 2 ) by Proposition 1. Then z t -z t+1 η = z t -ηF(z t-1 2 ) -z t+ 1 2 η + F(z t+ 1 2 ) ∈ F(z t+ 1 2 ) + A(z t+ 1 2 ). Set p = z * . By the weak MVI assumption, we have 2 z t -ηF(z t-1 2 ) -z t+ 1 2 + ηF(z t+ 1 2 ), z * -z t+ 1 2 = 2η z t -z t+1 η , z * -z t+ 1 2 ≤ - 2ρ η ∥z t -z t+1 ∥ 2 . (3) Define c = 1 2 -2η 2 L 2 > 0. We have identity (1 -2c)η 2 L 2 = 4η 4 L 4 = 1 2 -c -(1 + 2c)η 2 L 2 . ( ) Combining Equation ( 2) and (3) and using ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 , we have ∥z t+1 -z * ∥ 2 ≤ ∥z t -z * ∥ 2 + z t+1 -z t+ 1 2 2 -z t+ 1 2 -z t 2 + c∥z t -z t+1 ∥ 2 -(c + 2ρ η )∥z t -z t+1 ∥ 2 ≤ ∥z t -z * ∥ 2 + (1 + 2c) z t+1 -z t+ 1 2 2 -(1 -2c) z t+ 1 2 -z t 2 -(c + 2ρ η )∥z t -z t+1 ∥ 2 . (5) Using the update rule of OG and L-Lipschitzness of F, we have that for any t ≥ 0, z t+1 -z t+ 1 2 2 = ηF(z t-1 2 ) -ηF(z t+ 1 2 ) 2 ≤ η 2 L 2 z t+ 1 2 -z t-1 2 2 . ( ) Moreover, using ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 and Equation ( 6) , we have that for any t ≥ 1, z t+ 1 2 -z t-1 2 2 ≤ 2 z t+ 1 2 -z t 2 + 2 z t -z t-1 2 2 ≤ 2 z t+ 1 2 -z t 2 + 2η 2 L 2 z t-1 2 -z t-3 2 2 . which imples z t+ 1 2 -z t 2 ≥ 1 2 z t+ 1 2 -z t-1 2 2 -η 2 L 2 z t-1 2 -z t-3 2 2 . ( ) Combining Equation ( 4), ( 5), ( 6), and (7), we have that for all t ≥ 1. ∥z t+1 -z * ∥ 2 ≤ ∥z t -z * ∥ 2 + (1 + 2c) z t+1 -z t+ 1 2 2 -(1 -2c) z t+ 1 2 -z t 2 -(c + 2ρ η )∥z t -z t+1 ∥ 2 ≤ ∥z t -z * ∥ 2 + (1 -2c)η 2 L 2 z t-1 2 -z t-3 2 2 - 1 2 -c -(1 + 2c)η 2 L 2 z t+ 1 2 -z t-1 2 2 -(c + 2ρ η )∥z t -z t+1 ∥ 2 = ∥z t -z * ∥ 2 + 4η 4 L 4 z t-1 2 -z t-3 2 2 -z t+ 1 2 -z t-1 2 2 -(c + 2ρ η )∥z t -z t+1 ∥ 2 . Telescoping the above inequality and using c = 1 2 -2η 2 L 2 and ηL < 1 2 , we get ( 1 2 + 2ρ η -2η 2 L 2 ) T ∑ t=1 ∥z t -z t+1 ∥ 2 ≤ ∥z 1 -z * ∥ 2 + 1 4 z 1 2 -z -1 2 2 . Note that z 0 is the same as z -1 2 . This completes the proof. Fact 1. For any L > 0 and ρ > - 1 12 √ 3L . There exists η ∈ (0 , 1 2L ) such that 1 2 + 2ρ η -2η 2 L 2 > 0. Proof. Let η = 1 2 √ 3L , then the desired inequality holds whenever ρ > ηL(1 -4η 2 L 2 ) 4 • 1 L = - 1 12 √ 3L .

4. ACCELERATED REFLECTED GRADIENT FOR NEGATIVELY COMONOTONE PROBLEMS

In this section, we propose a new algorithm called the Accelerated Reflected Gradient (ARG) algorithm. We prove that ARG enjoys accelerated O( 1 T ) convergence rate for inclusion problems with comonotone operators (Theorem 2). Note that the lower bound Ω( 1T ) holds even for the special case of convex-concave min-max optimization (Diakonikolas, 2020; Yoon & Ryu, 2021) . Our algorithm is inspired by the Reflected Gradient (RG) algorithm (Chambolle & Pock, 2011; Malitsky, 2015; Cui & Shanbhag, 2016; Hsieh et al., 2019) for monotone variational inequalities. Starting at initial points z -1 = z 0 ∈ Z, the update rule of RG with step size η > 0 is as follows: for t = 0, 1, 2, • • • z t+ 1 2 = 2z t -z t-1 , z t+1 = Π Z z t -ηF(z t+ 1 2 ) . (RG) We propose the following Accelerated Reflected Gradient (ARG) algorithm, which is a single-call single-resolvent first-order method. Given arbitrary initial points z 0 = z 1 2 ∈ R n and step size η > 0, ARG sets z 1 = J η A [z 0 -ηF(z 0 )] and updates for t = 1, 2, • • • z t+ 1 2 = 2z t -z t-1 + 1 t + 1 (z 0 -z t ) - 1 t (z 0 -z t-1 ), z t+1 = J η A z t -ηF(z t+ 1 2 ) + 1 t + 1 (z 0 -z t ) . (ARG) We use the idea from Halpern iteration (Halpern, 1967) to design the accelerated algorithm (ARG). This technique for deriving optimal first-order methods is also called Anchoring and receives intense attention recently (Diakonikolas, 2020; Yoon & Ryu, 2021; Lee & Kim, 2021a; Tran-Dinh & Luo, 2021; Tran-Dinh, 2022; Cai et al., 2022a) . We defer detailed discussion on these works to Appendix A. We remark that the state of the art result from (Cai et al., 2022a ) is a variant of the EG algorithm that makes two oracle calls per iteration. Thus, to the best of our knowledge, ARG is the first single-call single-resolvent algorithm with optimal convergence rate for general inclusion problems with comonotone operators. Theorem 2. Assume Assumption 1 and 2 hold for ρ ∈ [-1 60L , 0], then the accelerated reflected gradient (ARG) algorithm with constant step size η > 0 satisfying Inequality (10) has the following convergence rate: for any T ≥ 1, r tan F,A (z T ) ≤ √ 6H η • 1 T , where H 2 = ∥z 0 -z * ∥ 2 + 4∥z 1 -z 0 ∥ 2 ≤ ∥z 0 -z * ∥ 2 + 4r tan F,A (z 0 ) 2 . Remark 1. Note that if Assumption 2 is satisfied with respect to some ρ > 0, it also satisfies Assumption 2 with ρ = 0, so Theorem 2 applies. We provide a proof sketch for Theorem 2 here and the full proof in Appendix C. Our proof is based on a potential function argument similar to the one in (Cai et al., 2022a) . Proof Sketch. We apply a potential function argument. We first show the potential function is approximately non-increasing and then prove that it is upper bounded by a term independent of T. As the potential function at step t is also at least Ω(t 2 ) • r tan (z t ) 2 , we conclude that ARG has an O( 1 T ) convergence rate.

5. LAST-ITERATE CONVERGENCE RATE OF REFLECTED GRADIENT

In this section, we show that the Reflected Gradient (RG) algorithm (Chambolle & Pock, 2011; Malitsky, 2015; Cui & Shanbhag, 2016; Hsieh et al., 2019) has a last-iterate convergence rate of O( 1√ T ) with respect to tangent residual and gap function (see Definition 4) for solving monotone variational inequalities (Theorem 3). Theorem 3. For a variational inequality problem (VI) associated with a closed convex set Z and a monotone and L-Lipschitz operator F with a solution z * , the (RG) algorithm with constant step size η ∈ (0, 1 (1+ √ 2)L ) has the following last-iterate convergence rate: for any T ≥ 1, r tan F,Z (z T ) ≤ λHL √ T , GAP Z,F,D (z T ) ≤ λDHL √ T where H 2 = 4∥z 0 -z * ∥ 2 + 13 L 2 ∥F(z 0 )∥ 2 and λ = 6(1+3η 2 L 2 ) η 2 L 2 (1-(1+ √ 2)ηL) . We remark that the convergence rate of RG is slower than ARG and other optimal firstorder algorithms even in the monotone setting. Nevertheless, understanding its last-iterate convergence rate is still interesting: (1) RG is simple and largely used in practice; (2) Lastiterate convergence rates of simple classic algorithms such as EG and RG are mentioned as open problems in (Hsieh et al., 2019) . The question is recently resolved for EG (Gorbunov et al., 2022a; Cai et al., 2022b) but remains open for RG; (3) Compared to EG, RG requires only a single call to F and a single projection in each iteration. We provide a proof sketch for Theorem 3 here and the full proof in Appendix D. Proof Sketch. Our analysis is based on a potential function argument and can be summarized in the following three steps. ( 1 (Nemirovski, 2004; Nesterov, 2007; Mokhtari et al., 2020b; Hsieh et al., 2019) and the rate is optimal (Ouyang & Xu, 2021) . But the gap function or average-iterate convergence is not meaningful in the nonconvex-nonconcave setting. For convergence in terms of the residual in the constrained setting, EG and PEG has a slower rate of O( 1√ T ) for best-iterate convergence (Korpelevich, 1976; Popov, 1980; Facchinei & Pang, 2003; Hsieh et al., 2019) and the more desirable last-iterate convergence (Cai et al., 2022b; Gorbunov et al., 2022b) . We remark that the last-iterate convergence rate of the reflected gradient (RG) algorithm was unknown. The O( 1√ T ) rate is tight for p-SCIL algorithms (Golowich et al., 2020) , a subclass of first-order methods that includes EG, PEG, and many of its variations, but faster rate is possible for other first-order methods.

Accelerated Convergence Rate in Residual.

Recent results with accelerated convergence rates in terms of the residual are based on Halpern iteration (Halpern, 1967) (also called Anchoring). The vanilla Halpern iteration has O( 1T ) convergence rate for cocoercive operators (stronger than monotonicity) (Diakonikolas, 2020; Kim, 2021) . Recently, a line of works contributed to provide O( 1T ) convergence rate for monotone operators in the constrained setting. Diakonikolas (2020); Yoon & Ryu (2022) provide double-loop algorithms with O( log T T ) convergence rate for monotone operators in the constrained setting. In the unconstrained setting (A = 0), Yoon & Ryu (2021) propose the Extra Anchored Gradient (EAG) algorithm, the first efficient algorithm with O( 1 T ) convergence rate for monotone operators. They also establish matching lower bound for first-order methods. Lee & Kim (2021a) generalize EAG to Fast Extragradient (FEG), which works even for negatively comonotone operators but still in the unconstrained setting. Analysis for variants of EAG and FEG in the unconstrained setting is provided in (Tran-Dinh & Luo, 2021; Tran-Dinh, 2022) . Recently, Cai et al. (2022a) close the open problem by proving the projected version of EAG has O( 1 T ) convergence rate. They also propose the accelerated forward-backward splitting (AS) algorithm, a generalization of FEG, which has O( 1 T ) convergence rate for negatively comonotone operators in the constrained setting.

A.2 NONCONVEX-NONCONCAVE SETTING

This paper study structured nonconvex-nonconcave optimization problems from the general perspective of operator theory and focus on global convergence under weak MVI and negative comonotonicity. There is a line of works focusing on local convergence, e.g., (Heusel et al., 2017; Mazumdar et al., 2019; Jin et al., 2020; Fiez & Ratliff, 2021) . Another line of works focus on problems satisfying different structural assumptions, such as the Polyak Łojasiewicz condition (Nouiehed et al., 2019; Yang et al., 2020) .

B ADDITIONAL PRELIMINARY B.1 RESOLVENT AND PROXIMAL OPERATOR

When A = ∂g is the subdifferential operator of a lower semi-continuous, proper, and convex function f , its resolvent (I + λ∂g) -1 is also known as the proximal operator of g denoted as prox λg . The resolvent (I + λ∂g) -1 is efficiently computable for the following popular choices of function g: ℓ 1 -norm || • || 1 , ℓ 2 -norm || • || 2 , maxtrix norms, the log-barrier -∑ n i=1 log(x i ), and more generally any quadratic or smooth functions. Moreover, many of them have closed-form expressions. For example, the proximal operator of the ℓ 1 -norm g = || • || 1 is the element-wise soft-thresholding operator (prox λg (v) ) i = (v i -λ) + -(-v i -λ) + . We refer readers to (Parikh & Boyd, 2014, Chapter 6, 7) for a comprehensive review on proximal operators and their efficient computation.

B.2 GAP FUNCTION

A standard suboptimality measure for the variationaly inequalitt (VI) problem is the gap function defined as GAP Z,F (z) := max z ′ ∈Z ⟨F(z), zz ′ ⟩. Note that when the feasible set Z is unbounded, approximating the gap function is impossible: consider the simple unconstrained saddle point problem min x∈R max y∈R xy, which has a unique saddle point (0, 0) but any other point has an infinitely large gap. A refined notion is the following restricted gap function (Nesterov, 2007) , which is meaningful for unbounded Z. Definition 4 (Restricted Gap Function). Given a closed convex set Z, a single-valued operator F, and a radius D, the restricted gap function at point z ∈ Z is GAP Z,F,D := max z ′ ∈Z ∩B(z,D) F(z), z -z ′ where B(z, D) is a Euclidean ball centered at z with radius D. In the rest of the paper, we call GAP Z,F,D the gap function (or gap) for convenience. The following Lemma relates ∥F(z) + c∥ where c ∈ N Z (z), and the gap function. Lemma 1. Let Z be a closed convex set Z and F be a monotone and L-Lipschitz operator. For any z ∈ Z and c ∈ N Z (z), we have GAP Z,F,D (z) := max z ′ ∈Z ∩B(z,D) F(z), z -z ′ ≤ D • ∥F(z) + c∥. Proof. The proof is straightforward. Since c ∈ N Z (z), we have ⟨c, z -z ′ ⟩ ≥ 0 for any z ′ ∈ Z. Therefor, max z ′ ∈Z ∩B(z,D) F(z), z -z ′ ≤ max z ′ ∈Z ∩B(z,D) F(z) + c, z -z ′ ≤ max z ′ ∈Z ∩B(z,D) z -z ′ • ∥F(z) + c∥ (Cauchy-Schwarz inequality) ≤ D • ∥F(z) + c∥.

B.3 CLASSICAL ALGORITHMS FOR VARIATIONALY INEQUALITIES

The Extragradient Algorithm (Korpelevich, 1976 ). Starting at initial point z 0 ∈ Z, the update rule of EG is: for t = 0, 1, 2, • • • z t+ 1 2 = Π Z [z t -ηF(z t )], z t+1 = Π Z z t -ηF(z t+ 1 2 ) . (EG) At each step t ≥ 0, the EG algorithm makes an oracle call of F(z t ) to produce an intermediate point z t+ 1 2 (a gradient descent step if F = ∂ f is the gradient of some function f ), then the algorithm makes another oracle call F(z t+ 1 2 ) and updates z t to z t+1 . In each step, EG needs two oracle calls to F and two projections Π Z .

The Past Extragradient Algorithm (Popov, 1980) Starting at initial point z

0 = z -1 2 ∈ Z, the update rule of PEG with step size η > 0 is: for t = 0, 1, 2, • • • z t+ 1 2 = Π Z z t -ηF(z t-1 2 ) , z t+1 = Π Z z t -ηF(z t+ 1 2 ) . (PEG) Note that PEG is also known as the Optimistic Gradient Descent/Ascent (OGDA) algorithm in the literature. The update rule of PEG is similar to (EG) but only requires a single call to F in each iteration. Both of EG and PEG perform two projections in every iteration.

B.4 TANGENT RESIDUAL UPPER BOUNDS OTHER NOTIONS OF RESIDUAL

Proposition 2. Let A be a maximally monotone operator and F be an single-valued operator. Then for any z ∈ R n and α > 0, r tan F,A (z) ≥ r nat F,A (z) := ∥z -J A (z -F(z))∥ r tan F,A (z) ≥ r f b F,A,α (z) := 1 α ∥z -J αA [z -αF(z)]∥. Proof. For any c ∈ A(z), we have r nat F,A (z) = ∥z -J A (z -F(z))∥ = ∥J A (z + c) -J A (z -F(z))∥ ≤ ∥F(z) + c∥ (J A is non-expansive) and r f b F,A,α (z) = 1 α ∥z -J αA (z -αF(z))∥ = 1 α ∥J αA (z + αc) -J αA (z -αF(z))∥ ≤ ∥F(z) + c∥. (J A is non-expansive) Thus both r tan F,A (z) and r f b F,A,α (z) are smaller than r tan F,A (z) = min c∈A(z) ∥F(z) + c∥.

C MISSING PROOFS IN SECTION 4

To prove Theorem 2, we apply a potential function argument. We first show the potential function is approximately non-increasing and then prove that it is upper bounded by a term independent of T. As the potential function at step t is also at least Ω(t 2 ) • r tan (z t ) 2 , we conclude that ARG has an O( 1 T ) convergence rate .

C.1 POTENTIAL FUNCTION

Recall the update rule of ARG: z 0 = z 1 2 ∈ R n are initial points and z 1 = J η A [z 0 -ηF(z 0 )]; for t ≥ 1, z t+ 1 2 = 2z t -z t-1 + 1 t + 1 (z 0 -z t ) - 1 t (z 0 -z t-1 ), z t+1 = J η A z t -ηF(z t+ 1 2 ) + 1 t + 1 (z 0 -z t ) . (ARG) Recall that when A is the normal cone of a closed convex set Z, the resolvent J A is equivalent to Euclidean projection to set Z. Hence, if we apply the ARG algorithm to solve monotone VI problems, the algorithm uses a single call to operator F and a single projection to Z per iteration. Here we allow A to be an arbitrary maximally monotone operator, and the ARG algorithm becomes a single-call single-resolvent algorithm in this more general setting. Next, we specify the potential function. Define c t+1 := z t -ηF(z t+ 1 2 ) + 1 t+1 (z 0 -z t ) -z t+1 η , ∀t ≥ 0. By update rule we have c t ∈ A(z t ) for all t ≥ 1. The potential function at iterate t ≥ 1 is defined as V t := t(t + 1) 2 ∥ηF(z t ) + ηc t ∥ 2 + t(t + 1) 2 ηF(z t ) -ηF(z t-1 2 ) 2 + t⟨ηF(z t ) + ηc t , z t -z 0 ⟩. C.2 APPROXIMATELY NON-INCREASING POTENTIAL Fact 2. For any L > 0 and ρ ≥ -1 60L . There exists η > 0 such that 1 2 -(12 - 4ρ η )η 2 L 2 + 2ρ η ≥ 0. ( ) Moreover, every η > 0 satisfies (10) also satisfies ρ η ≥ -1 4 . Proof. Rewriting (10), we get ρ > ηL(24η 2 L 2 -1) 4 + 8η 2 L 2 • 1 L . Let x = ηL and f (x) = x(24x 2 -1) 4+8x 2 . Since f ( 1 12 ) = -5 292 < -1 60 . We know η = 1 12L satisfies (10). Moreover, rewritng (10) and using ηL > 0, we get ρ η ≥ - 1 -72η 2 L 2 4 + 8η 2 L 2 ≥ - 1 4 . We show in the following lemma that V t is approximately non-increasing. Lemma 2. In the same setup as Theorem 2, for any t ≥ 1, we have V t+1 ≤ V t + 1 8 • ∥ηF(z t+1 ) + ηc t+1 ∥ 2 . Proof. The plan is to show that V t -V t+1 plus a few non-positive terms is still ≥ -1 8 • ∥ηF(z t+1 ) + ηc t+1 ∥ 2 , which certifies the claim.

Two Positive Terms. Since

F + A is ρ-comonotone, we have ⟨ηF(z t+1 ) + ηc t+1 -ηF(z t ) -ηc t , z t+1 -z t ⟩ - ρ η ∥ηF(z t+1 ) + ηc t+1 -ηF(z t ) -ηc t ∥ 2 ≥ 0. (11) Since F is L-Lipschitz, we have η 2 L 2 • z t+1 -z t+ 1 2 2 -ηF(z t+1 ) -ηF(z t+ 1 2 ) 2 ≥ 0. Denote p = 1 24 . Multiplying the above inequality with 1 -ρ 3η > 0 and rearranging terms, we get p • z t+1 -z t+ 1 2 2 -ηF(z t+1 ) -ηF(z t+ 1 2 ) 2 + (1 - ρ 3η )η 2 L 2 -p • z t+1 -z t+ 1 2 2 + ρ 3η ηF(z t+1 ) -ηF(z t+ 1 2 ) 2 ≥ 0. ( ) Sum-of-Squares Identity. We show an equivalent formulation z t+ 1 2 and z t+1 using defi- nitions of ηc t = z t-1 -z t -ηF(z t-1 2 ) + 1 t (z 0 -z t-1 ) and ηc t+1 = z t -ηF(z t+ 1 2 ) + 1 t+1 (z 0 - z t ) -z t+1 : z t+ 1 2 = 2z t -z t-1 + 1 t + 1 (z 0 -z t ) - 1 t (z 0 -z t-1 ) = z t + (z t -z t-1 ) + 1 t + 1 (z 0 -z t ) - 1 t (z 0 -z t-1 ) = z t -ηF(z t-1 2 ) -ηc t + 1 t + 1 (z 0 -z t ), z t+1 = z t -ηF(z t+ 1 2 ) -ηc t+1 + 1 t + 1 (z 0 -z t ). We also have z t+1 -z t+ 1 2 = ηF(z t-1 2 ) + ηc t -ηF(z t+ 1 2 ) -ηc t+1 . Next, we simplify ); replace u 2 with ηc t ; replace u 4 with ηc t+1 ; replace k with t; replace p with q. Note that V t -V t+1 -t(t + 1) × LHS of x 3 = x 2 -y 1 -u 2 + 1 k+1 (x 0 -x 2 ) and x 4 = x 2 - y 3 -u 4 + + 1 k+1 (x 0 -x 2 ) hold due to the above equivalent formations of z t+ 1 2 and z t+1 . Expression ( 17) and ( 18) appear on both sides of the following equation. 15) V t -V t+1 -t(t + 1) × LHS of Inequality (11) - t(t + 1) 4p × LHS of Inequality (12) = t(t + 1) 4 ηc t+1 -ηc t + ηF(z t-1 2 ) -2ηF(z t ) + ηF(z t+ 1 2 ) 2 (14) + (1 -4p)t -4p 4p (t + 1) • ηF(z t+ 1 2 ) -ηF(z t+1 ) 2 ( + (t + 1) • ηF(z t+ 1 2 ) -ηF(z t+1 ), ηF(z t+1 ) + ηc t+1 + t(t + 1) ρ η • ∥ηF(z t+1 ) + ηc t+1 -ηF(z t ) -ηc t ∥ 2 (17) - t(t + 1) 4p • (1 - ρ 3η )η 2 L 2 -p • z t+1 -z t+ 1 2 2 + ρ 3η ηF(z t+1 ) -ηF(z t+ 1 2 ) 2 . ( ) Since ∥a∥ 2 + ⟨a, b⟩ = ∥a + b 2 ∥ 2 -∥b∥ 2 4 , we have Expression (15) + Expression ( 16) = (1 -4p)t -4p 4p (t + 1) • ηF(z t+ 1 2 ) -ηF(z t+1 ) + p(t + 1) (1 -4p)t -4p • (ηF(z t+1 ) + ηc t+1 ) 2 - p(t + 1) (1 -4p)t -4p • ∥ηF(z t+1 ) + ηc t+1 ∥ 2 ≥ - p(t + 1) (1 -8p)t • ∥ηF(z t+1 ) + ηc t+1 ∥ 2 (t ≥ 1) ≥ - 2p 1 -8p • ∥ηF(z t+1 ) + ηc t+1 ∥ 2 ( t+1 t ≤ 2) = - 1 8 ∥ηF(z t+1 ) + ηc t+1 ∥ 2 . (p = 1 24 ) Now it remains to show that the sum of Expression ( 14), (17), and ( 18) is non-negative. Multiplying 4 t(t+1) and replacing p = 1 24 , we get 4 t(t + 1) • (Expression (14) + Expression (17) + Expression ( 18)) = ηc t+1 -ηc t + ηF(z t-1 2 ) -2ηF(z t ) + ηF(z t+ 1 2 ) 2 + 1 -(24 - 8ρ η )η 2 L 2 • z t+1 -z t+ 1 2 2 + 4ρ η • ∥ηF(z t+1 ) + ηc t+1 -ηF(z t ) -ηc t ∥ 2 - 8ρ η ηF(z t+1 ) -ηF(z t+ 1 2 ) 2 . Denote B 1 = ηc t+1 -ηc t + ηF(z t-1 2 ) -2ηF(z t ) + ηF(z t+ 1 2 ) B 2 = z t+1 -z t+ 1 2 = ηF(z t-1 2 ) + ηc t -ηF(z t+ 1 2 )ηc t+1 (By ( 13)) B 3 = ηF(z t+1 ) + ηc t+1 -ηF(z t ) -ηc t B 4 = ηF(z t+1 ) -ηF(z t+ 1 2 ). It is not hard to check that B 1 -B 2 = 2(B 3 -B 4 ): B 1 -B 2 = 2ηc t+1 -2ηc t -2ηF(z t ) + 2ηF(z t+ 1 2 ) = 2(B 3 -B 4 ). Note that ρ is non-positive and we have 4 t(t + 1) • (Expression (14) + Expression (17) + Expression ( 18)) = ∥B 1 ∥ 2 + 1 -(24 - 8ρ η )η 2 L 2 • ∥B 2 ∥ 2 + ρ η • ∥2B 3 ∥ 2 - 2ρ η ∥2B 4 ∥ 2 ≥ 1 2 -(12 - 4ρ η )η 2 L 2 • ∥B 1 -B 2 ∥ 2 + ρ η • ∥2B 3 ∥ 2 - 2ρ η ∥2B 4 ∥ 2 (∥a∥ 2 + ∥b∥ 2 ≥ 1 2 ∥a -b∥ 2 and (24 -8ρ η )η 2 L 2 ≥ 0) ≥ 1 2 -(12 - 4ρ η )η 2 L 2 • ∥B 1 -B 2 ∥ 2 + 2ρ η • ∥2B 3 -2B 4 ∥ 2 (-∥a∥ 2 + 2∥b∥ 2 ≥ -2∥a -b∥ 2 and -ρ η ≥ 0) = 1 2 -(12 - 4ρ η )η 2 L 2 + 2ρ η • ∥B 1 -B 2 ∥ 2 (B 1 -B 2 = 2(B 3 -B 4 )) ≥ 0. (Inequality ( 10)) The last inequality holds by the choice of η as shown in Fact 2.

C.3 BOUDING POTENTIAL AT ITERATION 1

Lemma 3. Let F be a L-Lipschitz operator and A be a maximally monotone operator. For any z 0 = z 1 2 ∈ R n , η ∈ (0, 1 2L ), and z 1 = J η A [z 0 -ηF(z 0 )], we have the following 1. ∥z 1 -z 0 ∥ ≤ η • r tan F,A (z 0 ). 2. ∥ηF(z 1 ) + ηc 1 ∥ ≤ (1 + ηL)∥z 1 -z 0 ∥. 3. V 1 ≤ 4∥z 1 -z 0 ∥ 2 where V 1 is defined in (9). Proof. For any c ∈ A(z 0 ), due to non-expansiveness of J η A , we have ∥z 1 -z 0 ∥ = J η A [z 0 -ηF(z 0 )] -J η A [z 0 + ηc] ≤ η∥F(z 0 ) + c∥. Proof of Theorem 2. It is equivalent to prove that for every T ≥ 1, we have ∥ηF(z T ) + ηc T ∥ 2 ≤ 6H 2 T 2 . From Lemma 3, we have ∥ηF(z 1 ) + ηc 1 ∥ 2 ≤ (1 + ηL) 2 ∥z 1 -z 0 ∥ 2 ≤ H 2 . So the theorem holds for T = 1. For any T ≥ 2, by Lemma 4 we have T(T + 1 2 ) 4 ∥ηF(z T ) + ηc T ∥ 2 ≤ V T + ∥z 0 -z * ∥ 2 ≤ V 1 + ∥z 0 -z * ∥ 2 + 1 8 T ∑ t=2 ∥ηF(z t ) + ηc t ∥ 2 = H 2 + 1 8 T ∑ t=2 ∥ηF(z t ) + ηc t ∥ 2 . By subtracting 1 8 ∥ηF(z T ) + ηc T ∥ 2 from both sides of the above inequality, we get T 2 4 ∥ηF(z T ) + ηc T ∥ 2 ≤ H 2 + 1 8 T-1 ∑ t=2 ∥ηF(z t ) + ηc t ∥ 2 which is in the form of Proposition 4 with C 1 = H 2 and p = 1 9 . Thus we have for any T ≥ 2 ∥ηF(z T ) + ηc T ∥ 2 ≤ 6H 2 T 2 .

D MISSING PROOFS IN SECTION 5

To prove Theorem 3, our analysis is based on a potential function argument and can be summarized in the following three steps. (1) We construct a potential function and show that it is non-increasing between two consecutive iterates; (2) We prove that the RG algorithm has a best-iterate convergence rate, i.e., for any T ≥ 1, there exists one iterate t * ∈ [T] such that our potential function at iterate t * is small; (3) We combine the above steps to show that the the last iterate has the same convergence guarantee as the best iterate and derive the O( 1 √ T ) last-iterate convergence rate.

D.1 NON-INCREASING POTENTIAL

Potential Function. We denote c t+1 := z t -ηF(z t+ 1 2 ) -z t+1 η , ∀t ≥ 0 (20) Note that according to the update rule of RG, z t+1 = Π Z [z t -ηF(z t+ 1 2 )], so c t+1 ∈ N Z (z t+1 ). The potential function we adopt is P t defined as P t := ∥F(z t ) + c t ∥ 2 + F(z t ) -F(z t-1 2 ) 2 , ∀t ≥ 1. ( ) Lemma 5. In the same setup of Theorem 3, P t ≥ P t+1 for any t ≥ 1. Proof. The plan is to show that P t -P t+1 plus a few non-positive terms is non-negative, which certifies that P t -P t+1 ≥ 0.  = z t+ 1 2 -z t+1 2 + ηF(z t-1 2 ) -ηF(z t ) 2 + z t+ 1 2 + z t+1 2 -z t + ηF(z t ) + ηc t 2 . The right-hand side of the above equality is clearly ≥ 0, thus we conclude P t -P t+1 ≥ 0.

D.2 BEST-ITERATE CONVERGENCE

In this section, we show that for any T ≥ 1, there exists some iterate t * such that P t * = O( 1 T ), which is implied by ∑ T t=1 P t = O(1). To prove this, we first show ∑ T t=1 ∥z t+ 1 2 -z t ∥ 2 = ∑ T t=1 ∥z t -z t-1 ∥ 2 = O(1) and then relate ∑ T t=1 P t to these two quantities. Lemma 6. In the same setup of Theorem 3, for any T ≥ 1, we have T ∑ t=1 z t+ 1 2 -z t 2 = T ∑ t=1 ∥z t -z t-1 ∥ 2 ≤ H 2 1 -(1 + √ 2)ηL . Proof. First note that by the update rule of RG, we have z t+ 1 2 = 2z t -z t-1 thus z t+ 1 2 -z t = z t -z t-1 . Therefore, it suffices to only prove the inequality for ∑ T t=1 ∥z t+ 1 2 -z t ∥ 2 . From the proof of (Hsieh et al., 2019, Lemma 2) , for any t ≥ 1 and p ∈ Z, we have 1 -(1 + √ 2)ηL • z t+ 1 2 -z t 2 ≤ ∥z t -p∥ 2 -∥z t+1 -p∥ 2 -2η F(z t+ 1 2 ), z t+ 1 2 -p + ηL z t -z t-1 2 2 -z t+1 -z t+ 1 2 2 . ( ) We set p = z * to be a solution of the variational inequality (VI) problem in the above inequality. Note that -2η F(z t+ 1 2 ), z t+ 1 2 -z * = -2η F(z t+ 1 2 ) -F(z * ), z t+ 1 2 -z * -2η F(z * ), z t+ 1 2 -z * ≤ -2η F(z * ), z t+ 1 2 -z * (F is monotone) = 2η⟨F(z * ), z t-1 -z * ⟩ -4η⟨F(z * ), z t -z * ⟩ (26) where the last equality holds since z t+ 1 2 = 2z tz t-1 . Also note that ⟨F(z * ), z tz * ⟩ ≥ 0 for all t ≥ 0 since z t ∈ Z and z * is a solution to (VI). Combing Inequality (25) and Inequality (26), telescoping the terms for t = 1, 2, • • • , T, and dividing both sides by 1 - (1 + √ 2)ηL > 0, we get T ∑ t=1 z t+ 1 2 -z t 2 ≤ ∥z 1 -z * ∥ 2 + ∥z 1 -z 1 2 ∥ 2 + 2η⟨F(z * ), z 0 -z * ⟩ 1 -(1 + √ 2)ηL . To get a cleaner constant that only relies on the starting point z 0 = z 1 2 , we further simplify the three terms on the right-hand side. Note that since η < 1 2L and z 1 = Π Z [z 0 -ηF(z 0 )], we have z 1 -z 1 2 2 = ∥z 1 -z 0 ∥ 2 ≤ η 2 ∥F(z 0 )∥ 2 ≤ 4 L 2 ∥F(z 0 )∥ 2 . Thus we have ∥z 1 -z * ∥ 2 ≤ 2∥z 1 -z 0 ∥ 2 + 2∥z 0 -z * ∥ 2 ≤ 8 L 2 ∥F(z 0 )∥ 2 + 2∥z 0 -z * ∥ 2 . Moreover, 2η⟨F(z * ), z 0 -z * ⟩ ≤ 2η∥F(z * )∥∥z 0 -z * ∥ ≤ 2η(∥F(z * ) -F(z 0 )∥ + ∥F(z 0 )∥)∥z 0 -z * ∥ (∥A∥ ≤ ∥A -B∥ + ∥B∥) ≤ 2ηL∥z 0 -z * ∥ 2 + 2η∥F(z 0 )∥∥z 0 -z * ∥ ≤ ∥z 0 -z * ∥ 2 + 1 L ∥F(z 0 )∥∥z 0 -z * ∥ (η < 1 2L ) ≤ 2∥z 0 -z * ∥ 2 + 1 L 2 ∥F(z 0 )∥ 2 . (2ab ≤ a 2 + b 2 ) Thus ∥z 1 -z * ∥ 2 + z 1 -z 1 2 2 + 2η⟨F(z * ), z 0 -z * ⟩ ≤ 13 L 2 ∥F(z 0 )∥ 2 + 4∥z 0 -z * ∥ 2 = H 2 . This completes the proof. Lemma 7. In the same setup of Theorem 3, for any T ≥ 1, we have T ∑ t=1 P t ≤ λ 2 H 2 L 2 . Proof. We first show an upper bound for P t P t = ∥F(z t ) + c t ∥ 2 + F(z t ) -F(z t-1 2 ) 2 = F(z t ) -F(z t-1 2 ) + z t -z t-1 η 2 + F(z t ) -F(z t-1 2 ) 2 (definition of c t (20)) ≤ 3 F(z t ) -F(z t-1 2 ) 2 + 2 η 2 ∥z t -z t-1 ∥ 2 (∥A + B∥ 2 ≤ 2∥A∥ 2 + 2∥B∥ 2 ) ≤ 3L 2 z t -z t-1 2 2 + 2 η 2 ∥z t -z t-1 ∥ 2 (F is L-Lipschitz) = 3L 2 z t -z t-1 + z t-1 -z t-1 2 2 + 2 η 2 ∥z t -z t-1 ∥ 2 ≤ 6L 2 z t-1 2 -z t-1 2 + 2 η 2 + 6L 2 ∥z t -z t-1 ∥ 2 (∥A + B∥ 2 ≤ 2∥A∥ 2 + 2∥B∥ 2 ) ≤ 2 + 6η 2 L 2 η 2 z t-1 2 -z t-1 2 + ∥z t -z t-1 ∥ 2 . Summing the above inequality of t = 1, 2, • • • T, we get T ∑ t=1 P t ≤ 2 + 6η 2 L 2 η 2 T ∑ t=1 z t-1 2 -z t-1 2 + ∥z t -z t-1 ∥ 2 = 2 + 6η 2 L 2 η 2 ∥z 1 -z 0 ∥ 2 + T-1 ∑ t=1 z t+ 1 2 -z t 2 + ∥z t+1 -z t ∥ 2 ≤ 2 + 6η 2 L 2 η 2 ∥z 1 -z 0 ∥ 2 + 2H 2 1 -(1 + √ 2)ηL ≤ 6(1 + 3η 2 L 2 )H 2 η 2 (1 -(1 + √ 2)ηL) . The second last inequality holds by Lemma 6. The last inequality holds since ∥z 1 - z 0 ∥ 2 ≤ 4 L 2 ∥F(z 0 )∥ 2 ≤ H 2 . Recall that λ = 6(1+3η 2 L 2 ) η 2 L 2 (1-(1+ √ 2)ηL) . This completes the proof.

D.3 PROOF OF THEOREM 3

Fix any T ≥ 1. From Lemma 5, we know that the potential function P t is non-increasing for all t ≥ 1. Lemma 7 guarantees that the sum of potential functions ∑ T t=1 P t is upper bounded by λ 2 H 2 L 2 , where λ 2 = 6(1+3η 2 L 2 ) η 2 L 2 (1-(1+ 2)ηL) . Combining the above, we can conclude that the potential function at the last iterate P T is upper bounded by 

E NUMERICAL ILLUSTRATION

In this section, we conduct numerical experiments to illustrate and compare the performance of several algorithms: Reflected Gradient (RG), Extra Gradient (EG), Accelerated Reflected Gradient (ARG), and Fast Extra Gradient (FEG) (Lee & Kim, 2021a ). Among them, ARG and FEG are accelerated algorithms while RG and EG are normal algorithms. Test Problem We use a classical example (Problem 1 in (Malitsky, 2015) ) which is unconstrained and the operator F(z) = Az where A is an n × n matrix that A(i, j) =    1, j = n + 1 -i > i -1, j = n + 1 -i < i 0, otherwise Note that F is 1-Lipschitz and its solution is the zero vector 0 when n is even.

Test Details

We run experiments using Python 3.9 on jupyter-notebook, on MacBook Air (M1, 2020) running macOS 12.5.1. Time of execution is measured using the time package in Python. For all tests, we take initial point to be the all-one vector z 0 = (1, • • • , 1). We denote η to be the step size and the termination criteria is the residual (operator norm) ||F(z t )|| ≤ ε. The code can be found in the Supplementary Material.

Test Results

The results for EG and RG are shown in Figure 1 . With step size η = 0.4, EG is slower than RG. This is due to the fact that EG makes two gradient calls per iteration. Even with the optimized step size η = 0.7 which gives the best performance, EG is still slower than RG for this problem. Our results are consistent with numerical results on Mathematica by Malitsky (2015) . The results for FEG and ARG are shown in Figure 2 . With step size η = 0.5, FEG is slower than ARG. With the optimized step size η = 1, FEG is a little faster than ARG. So for this problem, the performance of FEG and ARG are comparable. We also remark that for this particular problem, both ARG and FEG are slower than EG or RG. This does not contradict with our theoretical results on worst-case convergence rate. Simple algorithms like RG and EG can be faster than accelerated methods like ARG and FEG for particular problems. This also illustrates the importance of understanding simple algorithms like RG. 

F AUXILIARY PROPOSITIONS

Proposition 3 (Two Identities). Let (x k ) k∈[4] , (y k ) k∈[4] , x 0 , u 2 and u 4 be arbitrary vectors in R n . Let k ≥ 1 and q ∈ (0, 1) be two real numbers. If the following two equations holds: then the following identity holds: x 3 = x 2 -y 1 -u 2 x 4 = x 2 -y 3 -u 4 ∥y 2 + u 2 ∥ 2 + ∥y 2 -y 1 ∥ 2 -∥y 4 + u 4 ∥ 2 -∥y 4 -y 3 ∥ 2 -2 • ⟨y 4 -y 2 , x 4 -x 2 ⟩ -2 • 1 4 • ∥x 4 -x 3 ∥ 2 -∥y 4 -y 3 ∥ 2 -2 • ⟨u 4 -u 2 , x 4 -x 2 ⟩ = x 3 -x 4 2 + y 1 -y 2 2 + x 3 + x 4 2 -x 2 + y 2 + u 2 2 If the following two equations holds: x 3 = x 2 -y 1 -u 2 + 1 k + 1 (x 0 -x 2 ) x 4 = x 2 -y 3 -u 4 + 1 k + 1 (x 0 -x 2 )



Indeed, even if A is maximally monotone,(Daskalakis et al., 2021) implies that the problem is still intractable without further assumptions on F.



λ 2 H 2 L 2 T . Since P T = ∥F(z T ) + c T ∥ 2 + ∥F(z T ) -F(z T-1 2 )∥ 2 , we obtain the last-iterate convergence rate r tan F,Z (z T ) 2 ≤ ∥F(z T ) + c T ∥ 2 ≤ λ 2 H 2 L 2 T .The convergence rate on ∥F(z T ) + c T ∥ 2 implies a convergence rate on the gap function GAP Z,F,D (z T ) by Lemma 1:GAP Z,F,D (z T ) ≤ D • ∥F(z T ) + c T ∥ ≤ λDHL √ T .

Figure1: Results for EG and RG when ε = 0.001. The read line and blue line are EG and RG with step size η = 0.4. The yellow line is EG with (approximately) optimized step size η = 0.7. We remark that RG would diverge with η = 0.7.

Figure 2: Results for FEG and ARG when ε = 0.01. The read line and blue line are FEG and ARG with step size η = 0.5. The yellow line is FEG with (approximately) optimized step size η = 1.

EG+ to CEG+ algorithm, achieving the same convergence rate in the general (constrained) setting. To the best of our knowledge, the OG algorithm is the only single-call single-resolvent algorithm with O(1

) We construct a potential function and show that it is non-increasing between two consecutive iterates; (2) We prove that the (RG) algorithm has a best-iterate convergence rate, i.e., for any T ≥ 1, there exists one iterate t * ∈ [T] such that our potential function at iterate t * is small; (3) We combine the above steps to show that the the last iterate has the same convergence guarantee as the best iterate and derive the O( 1 convergence rate for problems satisfying negative comonotonicity. Finally, we resolve the problem of last-iterate convergence rate of RG. Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Convergence Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6



Three Non-Positive Terms. Since F is monotone, we have(-2) • ⟨ηF(z t+1 ) -ηF(z t ), z t+1z t ⟩ ≤ 0. (22)Since F is L-Lipschitz and 0 < η < ; replace u 2 with ηc t ; replace u 4 with ηc t+1 ; also note thatx 3 = x 2y 1u 2 and x 4 = x 2y 3u 4 holddue to the above equivalent formations of z t+ 1 2 and z t+1 . η 2 • (P t -P t+1 ) + LHS of Inequality(22) + LHS of Inequality(23) + LHS of Inequality(24)

ACKNOWLEDGEMENTS

Yang Cai is supported by a Sloan Foundation Research Fellowship and the NSF Award CCF-1942583 (CAREER). We thank the anonymous reviewers for their constructive comments.

annex

Thus ∥z 1z 0 ∥ ≤ η • r tan F,A (z 0 ). By definition of V 1 in (9), we have V 1 = ∥ηF(z 1 ) + ηc 1 ∥ 2 + ∥ηF(z 1 ) -ηF(z 0 )∥ 2 + ⟨ηF(z 1 ) + ηc 1 , z 1z 0 ⟩.We bound ∥ηF(z 1 ) + ηc 1 ∥ first. Note that by definition, we have ηc 1 = z 0 -ηF(z 0 )z 1 . Thus we have ∥ηF(z 1 )Then we can apply the bound on ∥ηF(z 1 ) + ηc 1 ∥ to bound V 1 as follows:where we use L-Lipschitzness of F and Cauchy-Schwarz inequality in the first inequality; we use ∥ηF(z 1 ) + ηc 1 ∥ ≤ (1 + ηL)∥z 1z 0 ∥ in the second inequality; we use ηL ≤ 1 2 in the last inequality.

C.4 PROOF OF THEOREM 2

We first show that the potential function V t = Ω(t 2 • r tan (z t ) 2 ). Lemma 4. In the same setup as Theorem 2, for any t ≥ 1, we haveProof. Since 0 ∈ F(z * ) + A(z * ), by ρ-comonotonicity of F + A and Fact 2, we haveBy definition of V t in (9), for any t ≥ 1, we havewhere in the second last inequality we we apply ⟨a, b⟩ ≥ -α 4 ∥a∥ 2 -1 α ∥b∥ 2 with a = √ t(ηF(z t ) + ηc t ), b = √ t(z *z 0 ), and α = t + 1 2 .then the following identity holds:We verify the two identities by MATLAB. The code is available at https://github.com/weiqiangzheng1999/Single-Call.Proposition 4 ((Cai et al., 2022a) ). Let {a k ∈ R + } k≥2 be a sequence of real numbers. Let C 1 ≥ 0 and p ∈ (0, 1 3 ) be two real numbers. If the following condition holds for every k ≥ 2,then for each k ≥ 2 we haveProof. We prove the statement by induction.Base Case: k = 2. From Inequality (27), we haveThus, Inequality (28) holds for k = 2.

Inductive

Step: for any k ≥ 3. Fix some k ≥ 3 and assume that Inequality (28) holds for all 2 ≤ t ≤ k -1. We slightly abuse notation and treat the summation in the form ∑ 2 t=3 as 0. By Inequality (27), we have(Induction assumption on Inequality (28))This complete the inductive step. Therefore, for all k ≥ 2, we have

