Solving stochastic weak Minty variational inequalities without increasing batch size

Abstract

This paper introduces a family of stochastic extragradient-type algorithms for a class of nonconvex-nonconcave problems characterized by the weak Minty variational inequality (MVI). Unlike existing results on extragradient methods in the monotone setting, employing diminishing stepsizes is no longer possible in the weak MVI setting. This has led to approaches such as increasing batch sizes per iteration which can however be prohibitively expensive. In contrast, our proposed methods involves two stepsizes and only requires one additional oracle evaluation per iteration. We show that it is possible to keep one fixed stepsize while it is only the second stepsize that is taken to be diminishing, making it interesting even in the monotone setting. Almost sure convergence is established and we provide a unified analysis for this family of schemes which contains a nonlinear generalization of the celebrated primal dual hybrid gradient algorithm.

1. Introduction

Stochastic first-order methods have been at the core of the current success in deep learning applications. These methods are mostly well-understood for minimization problems at this point. This is even the case in the nonconvex setting where there exists matching upper and lower bounds on the complexity for finding an approximately stable point (Arjevani et al., 2019) . The picture becomes less clear when moving beyond minimization into nonconvex-nonconcave minimax problems-or more generally nonmonotone variational inequalities. Even in the deterministic case, finding a stationary point is in general intractable (Daskalakis et al., 2021; Hirsch & Vavasis, 1987) . This is in stark contrast with minimization where only global optimality is NP-hard. An interesting nonmonotone class for which we do have efficient algorithms is characterized by the so called weak Minty variational inequality (MVI) (Diakonikolas et al., 2021) . This problem class captures nontrivial structures such as attracting limit cycles and is governed by a parameter ρ whose negativity increases the degree of nonmonotonicity. It turns out that the stepsize γ for the exploration step in extragradient-type schemes lower bounds the problem class through ρ > -γ /2 (Pethick et al., 2022) . In other words, it seems that we need to take γ large to guarantee convergence for a large class. This reliance on a large stepsize is at the core of why the community has struggled to provide a stochastic variants for weak MVIs. The only known results effectively increase the batch size at every iteration (Diakonikolas et al., 2021, Thm. 4 .5)-a strategy that would be prohibitively expensive in most machine learning applications. Pethick et al. (2022) proposed (SEG+) which attempts to tackle the noise by only diminishing the second stepsize. This suffices in the special case of unconstrained quadratic games but can fail even in the monotone case as illustrated in Figure 1 . This naturally raises the following research question: Can stochastic weak Minty variational inequalities be solved without increasing the batch size? We resolve this open problem in the affirmative when the stochastic oracles are Lipschitz in mean, with a modification of stochastic extragradient called bias-corrected stochastic extragradient (BC-SEG+). The scheme only requires one additional first order oracle call, while crucially maintaining the fixed stepsize. Specifically, we make the following contributions: (i) We show that it is possible to converge for weak MVI without increasing the batch size, by introducing a bias-correction term. The scheme introduces no additional hyperparameters and recovers the maximal range ρ ∈ (-γ /2, ∞) of explicit deterministic schemes. The rate we establish is interesting already in the star-monotone case where only asymptotic convergence of the norm of the operator was known when refraining from increasing the batch size (Hsieh et al., 2020, Thm. 1) . Our result additionally carries over to another class of problem treated in Appendix G, which we call negative weak MVIs. (ii) We generalize the result to a whole family of schemes that can treat constrained and regularized settings. First and foremost the class includes a generalization of the forward-backwardforward (FBF) algorithm of Tseng (2000) to stochastic weak MVIs. The class also contains a stochastic nonlinear extension of the celebrated primal dual hybrid gradient (PDHG) algorithm (Chambolle & Pock, 2011) . Both methods are obtained as instantiations of the same template scheme, thus providing a unified analysis and revealing an interesting requirement on the update under weak MVI when only stochastic feedback is available. (iii) We prove almost sure convergence under the classical Robbins-Monro stepsize schedule of the second stepsize. This provides a guarantee on the last iterate, which is especially important in the nonmonotone case, where average guarantees cannot be converted into a single candidate solution. Almost sure convergence is challenging already in the monotone case where even stochastic extragradient may not converge (Hsieh et al., 2020, Fig. 1) .

2. Related work

Weak MVI Diakonikolas et al. (2021) was the first to observe that an extragradient-like scheme called extragradient+ (EG+) converges globally for weak MVIs with ρ ∈ (-1 /8L F , ∞). This results was later tightened to ρ ∈ (-1 /2L F , ∞) and extended to constrained and regularized settings in (Pethick et al., 2022) . A single-call variant has been analysed in Böhm (2022) . Weak MVI is a star variant of cohypomonotonicity, for which an inexact proximal point method was originally studied in Combettes & Pennanen (2004) . Later, a tight characterization was carried out by Bauschke et al. (2021) for the exact case. It was shown that acceleration is achievable for an extragradient-type scheme even for cohypomonotone problems (Lee & Kim, 2021) . Despite this array of positive results the stochastic case is largely untreated for weak MVIs. The only known result (Diakonikolas et al., 2021, Theorem 4.5) requires the batch size to be increasing. Similarly, the accelerated method in Lee & Kim (2021, Thm. 6 .1) requires the variance of the stochastic oracle to decrease as O(1/k). Stochastic & monotone When more structure is present the story is different since diminishing stepsizes becomes permissible. In the monotone case rates for the gap function was obtained for stochastic Mirror-Prox in Juditsky et al. (2011) under bounded domain assumption, which was later relaxed for the extragradient method under additional assumptions (Mishchenko et al., 2020) . The norm of the operator was shown to asymptotically converge for unconstrained MVIs in Hsieh et al. (2020) with a double stepsize policy. There exists a multitude of extensions for monotone problems: Single-call stochastic methods are covered in detail by Hsieh et al. (2019) , variance reduction was applied to Halpern-type iterations (Cai et al., 2022) , cocoercivity was used in Beznosikov et al. (2022) , and bilinear games studied in Li et al. (2022) . Beyond monotonicity, a range of structures have been explored such as MVIs (Song et al., 2020) , pseudomonotonicity (Kannan & Shanbhag, 2019; Boţ et al., 2021) , two-sided Polyak-Łojasiewicz condition (Yang et al., 2020) , expected cocoercivity (Loizou et al., 2021) , sufficiently bilinear (Loizou et al., 2020) , and strongly star-monotone (Gorbunov et al., 2022) . Variance reduction The assumptions we make about the stochastic oracle in Section 3 are similar to what is found in the variance reduction literature (see for instance Alacaoglu & Malitsky (2021, Assumption 1) or Arjevani et al. (2019) ). However, our use of the assumption are different in a crucial way. Whereas the variance reduction literature uses the stepsize γ ∝ 1/L F (see e.g. Alacaoglu & Malitsky (2021, Theorem 2.5 )), we aim at using the much larger γ ∝ 1/L F . For instance, in the special case of a finite sum problem of size N, the mean square smoothness constant L F from Assumption III can be √ N times larger than L F (see Appendix I for details). This would lead to a prohibitively strict requirement on the degree of allowed nonmonotonicity through the relationship ρ > -γ /2.

Bias-correction

The idea of adding a correction term has also been exploited in minimization, specifically in the context of compositional optimization Chen et al. (2021) . Due to their distinct problem setting it suffices to simply extend stochastic gradient descent (SGD), albeit under additional assumptions such as (Chen et al., 2021, Assumption 3) . In our setting, however, SGD is not possible even when restricting ourselves to monotone problems.

3. Problem formulation and preliminaries

We are interested in finding z ∈ n such that the following inclusion holds, 0 ∈ T z := Az + Fz. (3.1) A wide range of machine learning applications can be cast as an inclusion. Most noticeable, a structured minimax problem can be reduced to (3.1) as shown in Section 8.1. We will rely on common notation and concepts from monotone operators (see Appendix B for precise definitions). Assumption I. In problem (3.1), (i) The operator F : n → n is L F -Lipschitz with L F ∈ [0, ∞), i.e., ∥Fz -Fz ′ ∥ ≤ L F ∥z -z ′ ∥ ∀z, z ′ ∈ n . (3.2) (ii) The operator A : n ⇒ n is a maximally monotone operator. (iii) Weak Minty variational inequality (MVI) holds, i.e., there exists a nonempty set S ⋆ ⊆ zer T such that for all z ⋆ ∈ S ⋆ and some ρ ∈ (-1 2L F , ∞) ⟨v, z -z ⋆ ⟩ ≥ ρ∥v∥ 2 , for all (z, v) ∈ gph T . (3.3) Remark 1. In the unconstrained and smooth case (A ≡ 0), Assumption I(iii) reduces to ⟨Fz, z-z ⋆ ⟩ ≥ ρ∥Fz∥ 2 for all z ∈ n . When ρ = 0 this condition reduces to the MVI (i.e. star-monotonicity), while negative ρ makes the problem increasingly nonmonotone. Interestingly, the inequality is not symmetric and one may instead consider that the assumption holds for -F. Through this observation, Appendix G extends the reach of the extragradient-type algorithms developed for weak MVIs.

Stochastic oracle

We assume that we cannot compute Fz easily, but instead we have access to the stochastic oracle F(z, ξ), which we assume is unbiased with bounded variance. We additionally assume that z → F(z, ξ) is L F Lipschitz continuous in mean and that it can be simultaneously queried under the same randomness. Assumption II. For the operator F(•, ξ) : n → n the following holds. (i) Two-point oracle: The stochastic oracle can be queried for any two points z, z ′ ∈ n , F(z, ξ), F(z ′ , ξ) where ξ ∼ P. (3.4) (ii) Unbiased: E ξ F(z, ξ) = Fz ∀z ∈ n . (iii) Bounded variance: E ξ ∥ F(z, ξ) -F(z)∥ 2 ≤ σ 2 F ∀z ∈ n . Assumption III. The operator F(•, ξ) : n → n is Lipschitz continuous in mean with L F ∈ [0, ∞): E ξ ∥ F(z, ξ) -F(z ′ , ξ)∥ 2 ⩽ L 2 F ∥z -z ′ ∥ 2 for all z, z ′ ∈ n . (3.5) Remark 2. Assumptions II(i) and III are also common in the variance reduction literature (Fang et al., 2018; Nguyen et al., 2019; Alacaoglu & Malitsky, 2021) , but in contrast with variance reduction we will not necessarily need knowledge of L F to specify the algorithm, in which case the problem constant will only affect the complexity. Crucially, this decoupling of the stepsize from L F will allow the proposed scheme to converge for a larger range of ρ in Assumption I(iii). Finally, note that Assumption II(i) commonly holds in machine learning applications, where usually the stochasticity is induced by the sampled mini-batch.

4. Method

To arrive at a stochastic scheme for weak MVI we first need to understand the crucial ingredients in the deterministic setting. For simplicity we will initially consider the unconstrained and smooth Algorithm 1 (BC-SEG+) Stochastic algorithm for problem (3.1) when A ≡ 0 Require z -1 = z-1 = z 0 ∈ n α k ∈ (0, 1), γ ∈ (⌊-2ρ⌋ + , 1/L F ) Repeat for k = 0, 1, . . . until convergence 1.1: Sample ξ k ∼ P 1.2: zk = z k -γ F(z k , ξ k ) + (1 -α k ) zk-1 -z k-1 + γ F(z k-1 , ξ k ) 1.3: Sample ξk ∼ P 1.4: z k+1 = z k -α k γ F(z k , ξk ) Return z k+1 setting, i.e. A ≡ 0 in (3.1). The first component is taking the second stepsize α smaller as done in extragradient+ (EG+), zk = z k -γFz k z k+1 = z k -αγF zk (EG+) where α ∈ (0, 1). Convergence in weak MVI was first shown in Diakonikolas et al. (2021) and later tightened by Pethick et al. (2022) , who characterized that smaller α allows for a larger range of the problem constant ρ. Taking α small is unproblematic for a stochastic scheme where usually the stepsize is taken diminishing regardless. However, Pethick et al. (2022) also showed that the extrapolation stepsize γ plays a critical role for convergence under weak MVI. Specifically, they proved that a larger stepsize γ leads to a looser bound on the problem class through ρ > -γ/2. While a lower bound has not been established we provide an example in Figure 3 of Appendix H where small stepsize prevents convergence. Unfortunately, picking γ large (e.g. as γ = 1 /L F ) causes significant complications in the stochastic case where both stepsizes are usually taken to be diminishing as in the following scheme, zk = z k -β k γ F(z k , ξ k ) with ξ k ∼ P z k+1 = z k -α k γ F(z k , ξk ) with ξk ∼ P (SEG) where α k = β k ∝ 1 /k. Even with a two-timescale variant (when β k > α k ) it has only been possible to show convergence for MVI (i.e. when ρ = 0) (Hsieh et al., 2020) . Instead of decreasing both stepsizes, Pethick et al. (2022) proposes a scheme that keeps the first stepsize constant, zk = z kγ F(z k , ξ k ) with ξ k ∼ P z k+1 = z k -α k γ F(z k , ξk ) with ξk ∼ P (SEG+) However, (SEG+) does not necessarily converge even in the monotone case as we illustrate in Figure 1 . The non-convergence stems from the bias term introduced by the randomness of zk in F(z k , ξk ). Intuitively, the role of zk is to approximate the deterministic exploration step zk := z k -γFz k . While zk is an unbiased estimate of zk this does not imply that F(z k , ξk ) is an unbiased estimate of F( zk ). Unbiasedness only holds in special cases, such as when F is linear and A ≡ 0 for which we show convergence of (SEG+) in Section 5 under weak MVI. In the monotone case it suffice to take the exploration stepsize γ diminishing (Hsieh et al., 2020, Thm. 1) , but this runs counter to the fixed stepsize requirement of weak MVI. Instead we propose bias-corrected stochastic extragradient+ (BC-SEG+) in Algorithm 1. BC-SEG+ adds a bias correction term of the previous operator evaluation using the current randomness ξ k . This crucially allows us to keep the first stepsize fixed. We further generalize this scheme to constrained and regularized setting with Algorithm 2 by introducing the use of the resolvent, (id + γA) -1 .

5. Analysis of SEG+

In the special case where F is affine and A ≡ 0 we can show convergence of (SEG+) under weak MVI up to arbitrarily precision even with a large stepsize γ. Theorem 5.1. Suppose that Assumptions I and II hold. Assume Fz := Bz + v and choose α k ∈ (0, 1) and γ ∈ (0, 1/L F ) such that ρ ≥ γ(α k -1)/2. Consider the sequence (z k ) k∈ generated by (SEG+). Then for all z ⋆ ∈ S ⋆ , K k=0 α k K j=0 α j E∥Fz k ∥ 2 ≤ ∥z 0 -z ⋆ ∥ 2 +γ 2 (γ 2 L 2 F +1)σ 2 F K j=0 α 2 j γ 2 (1-γ 2 L 2 F ) K j=0 α j . (5.1) Figure 1 : Monotone constrained case illustrating the issue for projected variants of (SEG+) (see Appendix H.2 for algorithmic details). The objective is bilinear ϕ(x, y) = (x -0.9) • (y -0.9) under box constraints ∥(x, y)∥ ∞ ≤ 1. The unique stationary point (x ⋆ , y ⋆ ) = (0.9, 0.9) lies in the interior, so even ∥Fz∥ can be driven to zero. Despite the simplicity of the problem both projected variants of (SEG+) only converges to a γ-dependent neighborhood. For weak MVI with ρ < 0 this neighborhood cannot be made arbitrarily small since γ cannot be taken arbitrarily small (see Figure 3 of Appendix H). The underlying reason for this positive results is that F(z k , ξk ) is unbiased when F is linear. This no longer holds when either linearity of F is dropped or when the resolvent is introduced for A 0, in which case the scheme only converges to a γ-dependent neighborhood as illustrated in Figure 1 . This is problematic in weak MVI where γ cannot be taken arbitrarily small (see Figure 3 of Appendix H). 6 Analysis for unconstrained and smooth case For simplicity we first consider the case where A ≡ 0. To mitigate the bias introduced in F(z k , ξk ) for (SEG+), we propose Algorithm 1 which modifies the exploration step. The algorithm can be seen as a particular instance of the more general scheme treated in Section 7. Theorem 6.1. Suppose that Assumptions I to III hold. Suppose in addition that γ ∈ (⌊-2ρ⌋ + , 1 /L F ) and (α k ) k∈ ⊂ (0, 1) is a diminishing sequence such that 2γL F √ α 0 + 1 + 1+γ 2 L 2 F 1-γ 2 L 2 F γ 2 L 2 F γ 2 L 2 F α 0 ≤ 1 + 2ρ γ . (6.1) Then, the following estimate holds for all z ⋆ ∈ S ⋆ E[∥F(z k ⋆ )∥ 2 ] ≤ (1 + ηγ 2 L 2 F )∥z 0 -z ⋆ ∥ 2 + Cσ 2 F γ 2 K j=0 α 2 j µ K j=0 α j (6.2) where C = 1 + 2η (γ 2 L 2 F + 1) + 2α 0 , η = 1 2 1+γ 2 L 2 F 1-γ 2 L 2 F γ 2 L 2 F + 1 γL F √ α 0 , µ = γ 2 (1 -γ 2 L 2 F )/2 and k ⋆ is chosen from {0, 1, . . . , K} according to probability P[k ⋆ = k] = α k K j=0 α j . Remark 6.2. As α 0 → 0, the requirement (6.1) reduces to ρ > -γ /2 as in the deterministic setting of Pethick et al. (2022) . Letting α k = α 0/ √ k+r the rate becomes O( 1 / √ k), thus matching the rate for the gap function of stochastic extragradient in the monotone case (see e.g. Juditsky et al. (2011)) . The above result provides a rate for a random iterate as pioneered by Ghadimi & Lan (2013) . Showing last iterate results even asymptotically is more challenging. Already in the monotone case, vanilla (SEG) (where β k = α k ) only has convergence guarantees for the average iterate (Juditsky et al., 2011) . In fact, the scheme can cycle even in simple examples (Hsieh et al., 2020, Fig. 1) . Under the classical (but more restrictive) Robbins-Monro stepsize policy, it is possible to show almost sure convergence for the iterates generates by Algorithm 1. The following theorem demonstrates the result in the particular case of α k = 1 /k+r. The more general statement is deferred to Appendix D. Theorem 6.3 (almost sure convergence). Suppose that Assumptions I to III hold. Suppose γ ∈ (⌊-2ρ⌋ + , 1 /L F ), α k = 1 k+r for any positive natural number r and (γL F + 1)α k + 2 1+γ 2 L 2 F 1-γ 2 L 2 F γ 4 L 2 F L 2 F α k+1 + γL F (α k+1 + 1)α k+1 ≤ 1 + 2ρ γ . (6.3) Algorithm 2 (BC-PSEG+) Stochastic algorithm for problem (3.1) Require z -1 = z 0 ∈ n , h -1 ∈ n , α k ∈ (0, 1), γ ∈ (⌊-2ρ⌋ + , 1/L F ) Repeat for k = 0, 1, . . . until convergence 2.1: Sample ξ k ∼ P 2.2: h k = z k -γ F(z k , ξ k ) + (1 -α k ) h k-1 -z k-1 -γ F(z k-1 , ξ k ) 2.3: zk = (id + γA) -1 h k 2.4: Sample ξk ∼ P 2.5: z k+1 = z k -α k h k -zk + γ F(z k , ξk ) Return z k+1 Then, the sequence (z k ) k∈ generated by Algorithm 1 converges almost surely to some z ⋆ ∈ zer T . Remark 6.4. As α k → 0 the condition on ρ reduces to ρ > -γ /2 like in the deterministic case. To make the results more accessible, both theorems have made particular choices of the free parameters from the proof, that ensures convergence for a given ρ and γ. However, since the parameters capture inherent tradeoffs, the choice above might not always provide the tightest rate. Thus, the more general statements of the theorems have been preserved in the appendix. 7 Analysis for constrained case The result for the unconstrained smooth case can be extended when the resolvent is available. Algorithm 2 provides a direct generalization of the unconstrained Algorithm 1. The construction relies on approximating the deterministic algorithm proposed in Pethick et al. (2022) , which iteratively projects onto a half-space which is guaranteed to contain the solutions. By defining Hz = z -γFz, the scheme can concisely be written as, zk = (I + γA) -1 (Hz k ) z k+1 = z k -α k (Hz k -Hz k ), (CEG+) for a particular adaptive choice of α k ∈ (0, 1). With a fair amount of hindsight we choose to replace Hz k with the bias-corrected estimate h k (as defined in Step 2.2 in Algorithm 2), such that the estimate is also reused in the second update. Theorem 7.1. Suppose that Assumptions I to III hold. Moreover, suppose that α k ∈ (0, 1), γ ∈ (⌊-2ρ⌋ + , 1 /L F ) and the following holds, µ 1- √ α 0 1+ √ α 0 -α 0 (1 + 2γ 2 L 2 F η) + 2ρ γ > 0 (7.1) where η ≥ 1 √ α 0 (1-γ 2 L 2 F ) + 1- √ α 0 √ α 0 . Consider the sequence (z k ) k∈ generated by Algorithm 2. Then, the following estimate holds for all z ⋆ ∈ S ⋆ E[dist(0, T zk ⋆ ) 2 ] ≤ E[∥z 0 -z ⋆ ∥ 2 ] + ηE[∥h -1 -Hz -1 ∥ 2 ] + Cγ 2 σ 2 F K j=0 α 2 j γ 2 µ K j=0 α j where C = 1 + 2η(1 + γ 2 L 2 F ) + 2α 0 η and k ⋆ is chosen from {0, 1, . . . , K} according to probability P[k ⋆ = k] = α k K j=0 α j . Remark 3. The condition on ρ in (7.1) reduces to ρ > -γ /2 when α 0 → 0 as in the deterministic case. As oppose to Theorem 6.3 which tracks ∥Fz k ∥ 2 , the convergence measure of Theorem 7.1 reduces to dist(0, T zk ) 2 = ∥F zk ∥ 2 when A ≡ 0. Since Algorithm 1 and Algorithm 2 coincide when A ≡ 0, Theorem 7.1 also applies to Algorithm 1 in the unconstrained case. Consequently, we obtain rates for both ∥F zk ∥ 2 and ∥Fz k ∥ 2 in the unconstrained smooth case.

8. Asymmetric & nonlinear preconditioning

In this section we show that the family of stochastic algorithms which converges under weak MVI can be expanded beyond Algorithm 2. This is achieved by extending (CEG+) through introducing Algorithm 3 Nonlinearly preconditioned primal dual extragradient (NP-PDEG) for solving (8.5) Require z -1 = z 0 = (x 0 , y 0 ) with x 0 , x -1 , x-1 , x-1 ∈ n , y 0 , y -1 ∈ r , θ ∈ [0, ∞), Γ 1 ≻ 0, Γ 2 ≻ 0 Repeat for k = 0, 1, . . . until convergence 3.1: ξ k ∼ P 3.2: xk = x k -Γ 1 ∇ x φ(z k , ξ k ) + (1 -α k ) xk-1 -x k-1 + Γ 1 ∇ x φ(x k-1 , y k-1 , ξ k ) 3.3: xk = prox Γ -1 1 f xk 3.4: ξ ′ k ∼ P 3.5: ŷk = y k + Γ 2 θ∇ y φ( xk , y k , ξ ′ k ) + (1 -θ)∇ y φ(z k , ξ k ) 3.6: +(1 -α k ) ŷk-1 -y k-1 -Γ 2 θ∇ y φ( xk-1 , y k-1 , ξ ′ k ) + (1 -θ)∇ y φ(z k-1 , ξ k ) 3.7: ȳk = prox Γ -1 2 g ŷk 3.8: ξk ∼ P 3.9: x k+1 = x k + α k xk -xk -Γ 1 ∇ x φ(z k , ξk ) 3.10: y k+1 = y k + α k ȳk -ŷk + Γ 2 ∇ y φ(z k , ξk ) Return z k+1 = (x k+1 , y k+1 ) a nonlinear and asymmetrical preconditioning. Asymmetrical preconditioning has been used in the literature to unify a large range of algorithm in the monotone setting Latafat & Patrinos (2017) . A subtle but crucial difference, however, is that the preconditioning considered here depends nonlinearly on the current iterate. As it will be shown in Section 8.1 this nontrivial feature is the key for showing convergence for primal-dual algorithms in the nonmonotone setting. Consider the following generalization of (CEG+) by introducing a potentially asymmetric nonlinear preconditioning P z k that depends on the current iterate z k . find zk such that H z k (z k ) ∈ P z k (z k ) + A(z k ), (8.1a) update z k+1 = z k + αΓ H z k (z k ) -H z k (z k ) . (8.1b) where H u (v) P u (v) -F(v) and Γ is some positive definite matrix. The iteration independent and diagonal choice P z k = γ -1 I and Γ = γI correspond to the basic (CEG+). More generally we consider P u (z) Γ -1 z + Q u (z) (8. 2) where Q u (z) captures the nonlinear and asymmetric part, which ultimately enables alternating updates and relaxing the Lipschitz conditions (see Remark 8.1(ii)). Notice that the iterates above does not always yield well-defined updates and one must inevitably impose additional structures on the preconditioner (we provide sufficient condition in Appendix F.1). Consistently with (8.2), in the stochastic case we define Pu (z, ξ) Γ -1 z + Qu (z, ξ). (8. 3) The proposed stochastic scheme, which introduces a carefully chosen bias-correction term, is summarized as compute h k = Pz k (z k , ξ k ) -F(z k , ξ k ) + (1 -α k ) h k-1 -Pz k-1 (z k-1 , ξ k ) + F(z k-1 , ξ k ) (8.4a) -Qz k-1 (z k-1 , ξ ′ k-1 ) + Qz k-1 (z k-1 , ξ ′ k ) with ξ k , ξ ′ k ∼ P find zk such that h k ∈ Pz k (z k , ξ ′ k ) + Az k (8.4b) update z k+1 = z k + α k Γ Pz k (z k , ξk ) -F(z k , ξk ) -h k with ξk ∼ P (8.4c) Remark 4. The two additional terms in (8.4a) are due to the interesting interplay between weak MVI and stochastic feedback, which forces a change of variables (see Appendix F.4). To make a concrete choice of Qu (z, ξ) we will consider a minimax problem as a motivating example (see Appendix F.1 for a more general setup).

8.1. Nonlinearly preconditioned primal dual hybrid gradient

We consider the problem of minimize x∈ n maximize y∈ r f (x) + φ(x, y) -g(y). (8.5) where φ(x, y) := E ξ [ φ(x, y, ξ)]. The first order optimality conditions may be written as the inclusion 0 ∈ T z Az + Fz, where A = (∂ f, ∂g), F(z) = (∇ x φ(z), -∇ y φ(z)), (8.6) while the algorithm only has access to the stochastic estimates F(z, ξ) (∇ x φ(z, ξ), -∇ y φ(z, ξ)). Assumption IV. For problem (8.5), let the following hold with a stepsize matrix Γ = blkdiag(Γ 1 , Γ 2 ) where Γ 1 ∈ n and Γ 2 ∈ r are symmetric positive definite matrices: (i) f , g are proper lsc convex (ii) φ : n+r → is continuously differentiable and for some symmetric positive definite matrices D xx , D xy , D yx , D yy , the following holds for all z = (x, y), z ′ = (x ′ , y ′ ) ∈ n+r ∥∇ x φ(z ′ ) -∇ x φ(z)∥ 2 Γ 1 ≤ L 2 xx ∥x ′ -x∥ 2 D xx + L 2 xy ∥y ′ -y∥ 2 D xy , ∥∇ y φ(z ′ ) -θ∇ y φ(x ′ , y) -(1 -θ)∇ y φ(z)∥ 2 Γ 2 ≤ L 2 yx ∥x ′ -x∥ 2 D yx + L 2 yy ∥y ′ -y∥ 2 D yy . (iii) Stepsize condition: L 2 xx D xx + L 2 yx D yx ≺ Γ -1 1 and L 2 xy D xy + L 2 yy D yy ≺ Γ -1 2 . (iv) Bounded variance: E ξ ∥ F(z, ξ) -F(z ′ , ξ)∥ 2 Γ ≤ σ 2 F ∀z, z ′ ∈ n . (v) φ(•, ξ) : n+r → is continuously differentiable and for some symmetric positive definite matrices D xz , D yz , D yx , D yy , the following holds for all z = (x, y), z ′ = (x ′ , y ′ ) ∈ n+r and v, v ′ ∈ n for θ ∈ [0, ∞): E ξ ∥∇ x φ(z ′ , ξ) -∇ x φ(z, ξ)∥ 2 Γ 1 ≤ L 2 xz ∥z ′ -z∥ 2 D xz if θ 1: E ξ ∥∇ y φ(z, ξ) -∇ y φ(z ′ , ξ)∥ 2 Γ 2 ≤ L 2 yz ∥z ′ -z∥ 2 D yz if θ 0: E ξ ∥∇ y φ(v ′ , y ′ , ξ) -∇ y φ(v, y, ξ)∥ 2 Γ 2 ≤ L 2 yx ∥v ′ -v∥ 2 D yx + L 2 yy ∥y ′ -y∥ 2 D yy . Remark 8.1. In Algorithm 3 the choice of θ ∈ [0, ∞) leads to different algorithmic oracles and underlying assumptions in terms of Lipschitz continuity in Assumptions IV(ii) and IV(v). (i) If θ = 0 then the first two steps may be computed in parallel and we recover Algorithm 2. Moreover, to ensure Assumption IV(ii) in this case it suffices to assume for L x , L y ∈ [0, ∞), ∥∇ x φ(z ′ ) -∇ x φ(z)∥ ≤ L x ∥z ′ -z∥, ∥∇ y φ(z ′ ) -∇ y φ(z)∥ ≤ L y ∥z ′ -z∥. (ii) Taking θ = 1 leads to Gauss-Seidel updates and a nonlinear primal dual extragradient algorithm with sufficient Lipschitz continuity assumptions for some L x , L y ∈ [0, ∞), ∥∇ x φ(z ′ ) -∇ x φ(z)∥ ≤ L x ∥z ′ -z∥, ∥∇ y φ(z ′ ) -∇ y φ(x ′ , y)∥ ≤ L y ∥y ′ -y∥. Algorithm 3 is an application of (8.4) applied for solving (8.6). In order to cast the algorithm as an instance of the template algorithm (8.4), we choose the positive definite stepsize matrix as Γ = blkdiag(Γ 1 , Γ 2 ) with Γ 1 ≻ 0, Γ 2 ≻ 0, and the nonlinear part of the preconditioner as Qu (z, ξ) 0, -θ∇ y φ( x, y, ξ) , and Q u (z) 0, -θ∇ y φ( x, y) (8.7) where u = (x, y) and z = ( x, ȳ). Recall H u (z) P u (z) -F(z) and define S u (z; z) H u (z) -Q u (z). The convergence in Theorem 8.2 depends on the distance between the initial estimate Γ -1 ẑ-1 with ẑ-1 = ( x-1 , ŷ-1 ) and the deterministic S z -1 (z -1 ; z-1 ). See Appendix B for additional notation. Theorem 8.2. Suppose that Assumption I(iii) to II(ii) and IV hold. Moreover, suppose that α k ∈ (0, 1), θ ∈ [0, ∞) and the following holds, µ 1- √ α 0 1+ √ α 0 + 2ρ γ -α 0 -2α 0 (ĉ 1 + 2ĉ 2 (1 + ĉ3 ))η > 0 and 1 -4ĉ 2 α 0 > 0 (8.8) where γ denotes the smallest eigenvalue of Γ, η ≥ (1 where C 2(η + α 0 ( 1 + 4ĉ 2 α 2 0 )( 1 √ α 0 (1-L M ) 2 + 1- √ α 0 √ α 0 )/(1 -4ĉ 2 α 0 ) and ĉ1 L 2 xz ∥ΓD xz ∥ + 2(1 -θ) 2 L 2 yz ∥ΓD yz ∥ + 2θ 2 L 2 yy ∥Γ 2 D yy ∥, ĉ2 2θ 2 L 2 yx ∥Γ 1 D yx ∥, ĉ3 L 2 xz ∥ΓD xz ∥, L 2 M max L 2 xx ∥D xx Γ 1 ∥ + L 2 yx ∥D yx Γ 1 ∥, ∥L 2 xy ∥D xy Γ 2 ∥ + L 2 yy ∥D yy Γ 2 ∥ . Consider the sequence (z k ) k∈ generated by Algorithm 3. Then, the following holds for all z ⋆ ∈ S ⋆ E[dist Γ (0, T zk ⋆ ) 2 ] ≤ E[∥z 0 -z ⋆ ∥ 2 Γ -1 ] + ηE[∥Γ -1 ẑ-1 -S z -1 (z -1 ; z-1 )∥ 2 Γ ] + Cσ 2 F K j=0 α 2 j µ K j=0 α j √ α 0 (1-L M ) 2 + 1- √ α 0 √ α 0 ))(1 + 2ĉ 2 ) + 1 + 2(ĉ 1 + 2ĉ 2 (Θ + ĉ3 ))η with Θ = (1 -θ) 2 + 2θ 2 and k ⋆ is chosen from {0, 1, . . . , K} according to probability P[k ⋆ = k] = α k K j=0 α j . Remark 5. When α 0 → 0 the conditions in (8.2) reduces to 1 + 2ρ γ > 0 as in the deterministic case. For θ = 0 Algorithm 3 reduces to Algorithm 2. With this choice Theorem 8.2 simplifies, since the constant ĉ2 = 0, and we recover the convergence result of Theorem 7.1.

9. Experiments

We compare BC-SEG+ and BC-PSEG+ against (EG+) using stochastic feedback (which we refer to as (SF-EG+)) and (SEG) in both an unconstrained setting and a constrained setting introduced in Pethick et al. (2022) . See Appendix H.2 for the precise formulation of the projected variants which we denote (SF-PEG+) and (PSEG) respectively. In the unconstrained example we control all problem constant and set ρ = -1 /10L F , while the constrained example is a specific minimax problem where ρ > -1 /2L F holds within the constrained set for a Lipschitz constant L F restricted to the same constrained set. To simulate a stochastic setting in both examples, we consider additive Gaussian noise, i.e. F(z, ξ) = Fz + ξ where ξ ∼ N(0, σ 2 I). In the experiments we choose σ = 0.1 and α k ∝ 1 /k, which ensures almost sure convergence of BC-(P)SEG+. For a more aggressive stepsize choice α k ∝ 1 / √ k see Figure 4 . Further details can be found in Appendix H. The results are shown in Figure 2 . The sequence generated by (SEG) and (PSEG) diverges for the unconstrained problem and cycles in the constrained problem respectively. In comparison (SF-EG+) and (SF-PEG+) gets within a neighborhood of the solutions but fails to converge due to the nondiminishing stepsize, while BC-SEG+ and BC-PSEG+ converges in the examples.

10. Conclusion

This paper shows that nonconvex-nonconcave problems characterize by the weak Minty variational inequality can be solved efficiently even when only stochastic gradients are available. The approach crucially avoids increasing batch sizes by instead introducing a bias-correction term. We show that convergence is possible for the same range of problem constant ρ ∈ (-γ /2, ∞) as in the deterministic case. Rates are established for a random iterate, which matches those of stochastic extragradient in the monotone case, and the result is complemented with almost sure convergence, thus providing asymptotic convergence for the last iterate. We show that the idea extends to a family of extragradient-type methods which includes a nonlinear extension of the celebrated primal dual hybrid gradient (PDHG) algorithm. For future work it is interesting to see if the rate can be improved by considering accelerated methods such as Halpern iterations. 

B Preliminaries

Given a psd matrix V we define the inner product as 2021)). An operator A : n ⇒ n is said to be ρ-monotone for some ρ ∈ , if for all (x, y), ⟨•, •⟩ V ⟨•, V•⟩ and the corresponding norm ∥ • ∥ √ ⟨•, •⟩ V . The distance from u ∈ n to a set U ⊆ n with respect to a positive definite matrix V is defined as dist V (u, U) min u ′ ∈U ∥u -u ′ ∥ V , (x ′ , y ′ ) ∈ gph A ⟨y -y ′ , x -x ′ ⟩ ≥ ρ∥x -x ′ ∥ 2 , and it is said to be ρ-comonotone if for all (x, y), (x ′ , y ′ ) ∈ gph A ⟨y -y ′ , x -x ′ ⟩ ≥ ρ∥y -y ′ ∥ 2 . The operator A is said to be maximally (co)monotone if there exists no other (co)monotone operator B for which gph A ⊂ gph B properly. If A is 0-monotone we simply say it is monotone. When ρ < 0, ρ-comonotonicity is also referred to as |ρ|-cohypomonotonicity. Definition B.2 (Lipschitz continuity and cocoercivity). Let D ⊆ n be a nonempty subset of n . A single-valued operator A : D → n is said to be L-Lipschitz continuous if for any x, x ′ ∈ D ∥Ax -Ax ′ ∥ ≤ L∥x -x ′ ∥, and β-cocoercive if ⟨x -x ′ , Ax -Ax ′ ⟩ ≥ β∥Ax -Ax ′ ∥ 2 . Moreover, A is said to be nonexpansive if it is 1-Lipschitz continuous, and firmly nonexpansive if it is 1-cocoercive. A β-cocoercive operator is also β -1 -Lipschitz continuity by direct implication of Cauchy-Schwarz. The resolvent operator J A = (id + A) -1 is firmly nonexpansive (with dom J A = n ) if and only if A is (maximally) monotone. We will make heavy use of the Fenchel-Young inequality. For all a, b ∈ n and e > 0 we have, 2⟨a, b⟩ ≤ e∥a∥ 2 + 1 e ∥b∥ 2 (B.1) ∥a + b∥ 2 ≤ (1 + e)∥a∥ 2 + (1 + 1 e )∥b∥ 2 (B.2) -∥a -b∥ 2 ≤ -1 1+e ∥a∥ 2 + 1 e ∥b∥ 2 (B.3) C Proof for SEG+ Proof of Theorem 5.1. Following (Hsieh et al., 2020) closely, define the reference state zk := z k -γFz k to be the exploration step using the deterministic operator and denote the second stepsize as η k := α k γ. We will let ζ denote the additive noise term, i.e. F(z, ξ) := F(z) + ζ. Expanding the distance to solution, ∥z k+1 -z ⋆ ∥ 2 = ∥z k -η k F(z k , ξk ) -z ⋆ ∥ 2 = ∥z k -z ⋆ ∥ 2 -2η k ⟨ F(z k , ξk ), z k -z ⋆ ⟩ + η 2 k ∥ F(z k , ξk )∥ 2 = ∥z k -z ⋆ ∥ 2 -2η k ⟨ F(z k , ξk ), zk -z ⋆ ⟩ -2γη k ⟨ F(z k , ξk ), F(z k )⟩ + η 2 k ∥ F(z k , ξk )∥ 2 . (C.1) Recall that the operator is assumed to be linear Fz = Bz + v in which case we have, F(z k , ξk ) = Bz k + v + ζk =B(z k -γ F(z k , ξ k )) + v + ζk =B(z k -γBz k -γv -γζ k ) + v + ζk =B(z k -γ(Bz k + v)) + v -γBζ k + ζk =F( zk ) -γBζ k + ζk . (C.2) The two latter terms are zero in expectation due to the unbiasedness from Assumption II(ii), which lets us write the terms on the RHS of (C.1) as, -E k ⟨ F(z k , ξk ), zk -z ⋆ ⟩ = -⟨F( zk ), zk -z ⋆ ⟩ (C.3) -E k ⟨ F(z k , ξk ), F(z k )⟩ = -⟨F( zk ), F(z k )⟩ (C.4) E k ∥ F(z k , ξk )∥ 2 = ∥F( zk )∥ 2 + E k ∥γBζ k ∥ 2 + E k ∥ ζk ∥ 2 . (C.5) We can bound (C.3) directly through the weak MVI in Assumption I(iii) which might still be posi- tive, -⟨F( zk ), zk -z ⋆ ⟩ ≤ -ρ∥F( zk )∥ 2 . (C.6) For the latter two terms of (C.5) we have E k ∥γBζ k ∥ 2 + E k ∥ ζk ∥ 2 = γ 2 E k ∥F(ζ k ) -F(0)∥ 2 + E k ∥ ζk ∥ 2 ≤ (γ 2 L 2 F + 1)σ 2 F , (C.7 ) where the last inequality follows from Lipschitz in Assumption I(i) and bounded variance in Assumption II(iii). Combining everything into (C.1) we are left with E k ∥z k+1 -z ⋆ ∥ 2 ≤ ∥z k -z ⋆ ∥ 2 + η 2 k (γ 2 L 2 F + 1)σ 2 F -2γη k ⟨F( zk ), F(z k )⟩ + (η 2 k -2η k ρ)∥F( zk )∥ 2 (C.8) By assuming the stepsize condition, ρ ≥ (η k -γ)/2, we have η 2 k -2η k ρ ≤ γη k . This allows us to complete the square, -2γη k ⟨F( zk ), F(z k )⟩ + (η 2 k -2η k ρ)∥F( zk )∥ 2 ≤ -2γη k ⟨F( zk ), F(z k )⟩ + γη k ∥F( zk )∥ 2 = γη k (∥F(z k ) -F( zk )∥ 2 -∥F(z k )∥ 2 ) ≤ γη k (γ 2 L 2 F -1)∥F(z k )∥ 2 , (C.9) where the last inequality follows from Lipschitzness of F and the definition of the update rule. Plugging into (C.8) we are left with E k ∥z k+1 -z ⋆ ∥ 2 ≤ ∥z k -z ⋆ ∥ 2 + η 2 k (γ 2 L 2 F + 1)σ 2 F -γη k (1 -γ 2 L 2 F )∥F(z k )∥ 2 . (C.10) The result is obtained by total expectation and summing.

D Proof for smooth unconstrained case

Lemma D.1. Consider the recurrent relation B k+1 = ξ k B k + d k such that ξ k > 0 for all k ≥ 0. Then B k+1 = Π k p=0 ξ p         B 0 + k ℓ=0 d ℓ Π ℓ p=0 ξ p         . Assumption V. γ ∈ (⌊-2ρ⌋ + , 1 /L F ) and for positive real valued b, µ γ 2 (1 -γ 2 L 2 F (1 + b -1 )) > 0. (D.1) Theorem D.2. Suppose that Assumptions I to III hold. Suppose in addition that Assumption V holds and that (α k ) k∈ ⊂ (0, 1) is a diminishing sequence such that 2γL F √ α 0 + 1 + (b + 1)γ 2 L 2 F γ 2 L 2 F α 0 ≤ 1 + 2ρ γ . (D.2) Consider the sequence (z k ) k∈ generated by Algorithm 1. Then, the following estimate holds K k=0 α k K j=0 α j E[∥F(z k )∥ 2 ] ≤ ∥z 0 -z ⋆ ∥ 2 + ηγ 2 ∥F(z 0 )∥ 2 + Cσ 2 F γ 2 K j=0 α 2 j µ K j=0 α j , (D.3) where C = 1 + 2η (γ 2 L 2 F + 1) + 2α 0 and η = 1 2 (b + 1)γ 2 L 2 F + 1 γL F √ α 0 . Proof of Theorem D.2. The proof relies on establishing a (stochastic) descent property on the following potential function U k+1 ∥z k+1 -z ⋆ ∥ 2 + A k+1 ∥u k ∥ 2 + B k+1 ∥z k+1 -z k ∥ 2 . where u k zkz k + γF(z k ) measures the difference of the bias-corrected step from the deterministic exploration step, and (A k ) k∈ , (B k ) k∈ are positive scalar parameters to be identified. We proceed to consider each term individually. Let us begin by quantifying how well zk estimates z k -γF(z k ). u k = zk -z k + γF(z k ) = γF(z k ) -γ F(z k , ξ k ) + (1 -α k )(z k-1 -z k-1 + γ F(z k-1 , ξ k )). Therefore, ∥u k ∥ 2 = ∥γF(z k ) -γ F(z k , ξ k ) + (1 -α k )(γ F(z k-1 , ξ k ) -γF(z k-1 ))∥ 2 + (1 -α k ) 2 ∥u k-1 ∥ 2 + 2(1 -α k )⟨z k-1 -z k-1 + γF(z k-1 ), γF(z k ) -γ F(z k , ξ k ) + (1 -α k )(γ F(z k-1 , ξ k ) -γF(z k-1 ))⟩. Conditioned on F k , in the inner product the left term is known and the right term has an expectation that equals zero. Therefore, we obtain E[∥u k ∥ 2 |F k ]=E[∥(1-α k ) γF(z k )-γ F(z k ,ξ k )+γ F(z k-1 ,ξ k )-γF(z k-1 ) +α k γF(z k )-γ F(z k ,ξ k ) ∥ 2 |F k ] +(1-α k ) 2 ∥u k-1 ∥ 2 ≤(1-α k ) 2 ∥u k-1 ∥ 2 +2(1-α k ) 2 γ 2 E[∥ F(z k ,ξ k )-F(z k-1 ,ξ k )∥ 2 |F k ] +2α 2 k γ 2 E[∥F(z k )-F(z k ,ξ k )∥ 2 |F k ] ≤(1-α k ) 2 ∥u k-1 ∥ 2 +2(1-α k ) 2 γ 2 L 2 F ∥z k -z k-1 ∥ 2 +2α 2 k γ 2 σ 2 F (D.4) where in the first inequality we used Young inequality and the fact that the second moment is larger than the variance, and Assumptions II(iii) and III were used in the second inequality. By step 1.4, the equality ∥z k+1 -z ⋆ ∥ 2 = ∥z k -z ⋆ ∥ 2 -2α k γ⟨ F(z k , ξk ), z k -z ⋆ ⟩ + α 2 k γ 2 ∥ F(z k , ξk )∥ 2 , (D.5) holds. The inner product in (D.5) can be upper bounded using Young inequalities with positive parameters ε k , k ≥ 0, and b as follows. E[⟨-γ F(z k , ξk ), z k -z ⋆ ⟩ | Fk ] = -γ⟨F(z k ), z k -zk ⟩ -γ⟨F(z k ), zk -z ⋆ ⟩ = -γ 2 ⟨F(z k ), F(z k )⟩ + γ⟨F(z k ), zk -z k + γF(z k )⟩ -γ⟨F(z k ), zk -z ⋆ ⟩ ≤ γ 2 1 2 ∥F(z k ) -F(z k )∥ 2 - 1 2 ∥F(z k )∥ 2 - 1 2 ∥F(z k )∥ 2 + γ 2 ε k 2 ∥F(z k )∥ 2 + 1 2ε k ∥z k -z k + γF(z k )∥ 2 -γρ∥F(z k )∥ 2 ≤ γ 2 L 2 F 1 + b 2 ∥u k ∥ 2 + 1 + b -1 2 γ 4 L 2 F ∥F(z k )∥ 2 - γ 2 2 ∥F(z k )∥ 2 - γ 2 2 ∥F(z k )∥ 2 + γ 2 ε k 2 ∥F(z k )∥ 2 + 1 2ε k ∥u k ∥ 2 -γρ∥F(z k )∥ 2 = γ 2 L 2 F 1 + b 2 + 1 2ε k ∥u k ∥ 2 + γ 2 (γ 2 L 2 F (1 + b -1 ) -1) 2 ∥F(z k )∥ 2 + γ 2 (ε k -1) 2 -γρ ∥F(z k )∥ 2 . (D.6) Conditioning (D.6) with E • | F k = E E • | Fk | F k , since F k ⊂ Fk , yields 2E[⟨-γ F(z k , ξk ), z k -z ⋆ ⟩ | F k ] ≤ γ 2 L 2 F (1 + b) + 1 ε k E[∥u k ∥ 2 | F k ] -µ∥F(z k )∥ 2 + γ 2 (ε k -1) -2γρ E ∥F(z k )∥ 2 | F k , (D.7) where µ was defined in (D.1). The condition expectation of the third term in (D.5) is bounded through Assumption II(iii) by E ∥ F(z k , ξk )∥ 2 | F k = E E[∥ F(z k , ξk )∥ 2 | Fk ] | F k ≤ ∥F(z k )∥ 2 + σ 2 F , which in turn implies E ∥z k+1 -z k ∥ 2 | F k = α 2 k γ 2 E ∥ F(z k , ξk )∥ 2 | F k ≤ α 2 k γ 2 E ∥F zk ∥ 2 | F k + α 2 k γ 2 σ 2 F (D.8) Combining (D.7), (D.8), and (D.5) yields E[∥z k+1 -z ⋆ ∥ 2 + A k+1 ∥u k ∥ 2 + B k+1 ∥z k+1 -z k ∥ 2 | F k ] ≤ ∥z k -z ⋆ ∥ 2 + A k+1 + α k γ 2 L 2 F (1 + b) + 1 ε k E[∥u k ∥ 2 | F k ] -α k µ∥F(z k )∥ 2 + α k γ 2 (ε k -1) -2γρ + α 2 k γ 2 E ∥F(z k )∥ 2 | F k + α 2 k γ 2 σ 2 F + B k+1 α 2 k γ 2 E ∥F zk ∥ 2 | F k + B k+1 α 2 k γ 2 σ 2 F . (D.9) Further using (D.4) and denoting X k 1 α k γ 2 L 2 F (1 + b) + 1 ε k + A k+1 , X k 2 α k γ 2 (ε k -1) -2ργ + α k γ 2 leads to E[U k+1 | F k ] -U k ≤ -α k µ∥F(z k )∥ 2 + X k 1 (1 -α k ) 2 -A k ∥u k-1 ∥ 2 + 2X k 1 (1 -α k ) 2 γ 2 L 2 F -B k ∥z k -z k-1 ∥ 2 + X k 2 + B k+1 α 2 k γ 2 E ∥F(z k )∥ 2 | F k + B k+1 α 2 k + α 2 k + 2X k 1 α 2 k γ 2 σ 2 F . (D.10) Having established (D.10), set A k = A, B k = 2Aγ 2 L 2 F , and ε k = ε to obtain by the law of total expectation that E[U k+1 ] -E[U k ] ≤ -α k µE ∥F(z k )∥ 2 + X k 1 (1 -α k ) 2 -A E ∥u k-1 ∥ 2 + 2γ 2 L 2 F X k 1 (1 -α k ) 2 -A E ∥z k -z k-1 ∥ 2 + X k 2 + 2Aγ 4 L 2 F α 2 k E ∥F(z k )∥ 2 + 2Aγ 2 L 2 F + 1 + 2X k 1 α 2 k γ 2 σ 2 F . (D.11) To get a recursion we require X k 1 (1 -α k ) 2 -A ≤ 0 and X k 2 + 2Aγ 4 L 2 F α 2 k ≤ 0. (D.12) By developing the first requirement of (D.12) we have, 0 ≥ X k 1 (1 -α k ) 2 -A = α k (1 -α k ) 2 γ 2 L 2 F (1 + b) + 1 ε + α k (α k -2)A. (D.13) Equivalently, A needs to satisfy A ≥ (1 -α k ) 2 2 -α k γ 2 L 2 F (1 + b) + 1 ε . (D.14) for any α k ∈ (0, 1). Since (1-α k ) 2 2-α k ≤ 1 2 given α k ∈ (0, 1) it suffice to pick A = 1 2 (b + 1)γ 2 L 2 F + 1 ε . (D.15) For the second requirement of (D.12) note that we can equivalently require that the following quantity is negative 1 α k γ 2 X k 2 + 2Aγ 4 L 2 F α 2 k = ε -1 -2ρ γ + α k + 2Aγ 2 L 2 F α k ≤ ε -1 -2ρ γ + 1 + (b + 1)γ 2 L 2 F + 1 ε γ 2 L 2 F α 0 where we have used that α k ≤ α 0 and the choice of A from (D.15). Setting the Young parameter ε = γL F √ α 0 we obtain that X k 2 + 2Aγ 4 L 2 F α 2 k ≤ 0 owing to (D. 2). On the other hand, the last term in (D.11) may be upper bounded by 2Aγ 2 L 2 F + 1 + 2X k 1 = 1 + (b + 1)γ 2 L 2 F + 1 γL F √ α 0 (γ 2 L 2 F + 1) + 2α k ≤ 1 + (b + 1)γ 2 L 2 F + 1 γL F √ α 0 (γ 2 L 2 F + 1) + 2α 0 = C. Thus, it follows from (D.11) that E[U k+1 ] -E[U k ] ≤ -α k µE ∥F(z k )∥ 2 + Cα 2 k γ 2 σ 2 F . Telescoping the above inequality completes the proof. Proof of Theorem 6.1. The theorem is obtained as a particular instantiation of Theorem D.2. The condition in (D.1) can be rewritten as b > γ 2 L 2 F 1-γ 2 L 2 F . A reasonable choice is b = 2γ 2 L 2 F 1-γ 2 L 2 F . Substituting back into µ we obtain µ = γ 2 (1 -γ 2 L 2 F (1 + 1-γ 2 L 2 F 2γ 2 L 2 F )) = γ 2 (1-γ 2 L 2 F ) 2 > 0. (D.16) Similarly, the choice of b is substituted into η and (D.2) of Theorem D.2. The rate in (D.2) is further simplified by applying Lipschitz continuity of F from Assumption I(i) to ∥Fz 0 ∥ 2 = ∥Fz 0 -Fz ⋆ ∥ 2 . The proof is complete by observing that the guarantee on the weighted sum can be converted into an expectation over a sampled iterate in the style of Ghadimi & Lan (2013) . Assumption VI (almost sure convergence). Let d ∈ [0, 1], b > 0. Suppose that the following holds (i) the diminishing sequence (α k ) k∈ ⊂ (0, 1) satisfies the classical conditions ∞ k=0 α k = ∞, ᾱ ∞ k=0 α 2 k < ∞; (ii) letting c k (1 + b)γ 2 L 2 F + 1 γL F α -d k for all k ≥ 0 η k ∞ ℓ=k c l α l Π ℓ p=0 (1 -α p ) 2 < ∞, ν ∞ k=0 η k+1 α 2 k Π k p=0 1 (1-α p ) 2 < ∞, (D.17) and γL F α d k + α k + 2γ 2 L 2 F α k η k+1 Π k p=0 1 (1-α p ) 2 ≤ 1 + 2ρ γ . (D.18) Although at first look the above assumptions may appear involved, as shown in Theorem D.3 classical stepsize choice of α 0 k+1 is sufficient to satisfy (D.17), and to ensure almost sure convergence provided that instead (D.20) holds. Note that with this choice as k goes to infinity, α k ↘ 0 and the deterministic range γ + 2ρ > 0 is obtained. Theorem D.3 (almost sure convergence). Suppose that Assumptions I to III hold. Additionally, suppose the stepsize conditions in Assumptions V and VI. Then, the sequence (z k ) k∈ generated by Algorithm 1 converges almost surely to some z ⋆ ∈ zer T . Moreover, the following estimate holds K k=0 α k K j=0 α j E[∥F(z k )∥ 2 ] ≤ ∥z 0 -z ⋆ ∥ 2 + η 0 γ 2 ∥F(z 0 )∥ 2 + C µ K j=0 α j , (D.19) where C = 2γ 2 σ 2 F (γ 2 L 2 F + 1)ν + ᾱ 1 2 + (b + 1)γ 2 L 2 F + 1 γL F is finite. In particular, if α k = 1 k+r for any positive natural number r, and d = 1, then Assumption VI(ii) can be replaced by (γL F + 1)α k + 2 (1 + b)γ 4 L 2 F L 2 F α k+1 + γL F (α k+1 + 1)α k+1 ≤ 1 + 2ρ γ . (D.20) Proof of Theorem D.3 (almost sure convergence). Having established (D.10), let B k = 2A k γ 2 L 2 F such that 2X k 1 (1 -α k ) 2 γ 2 L 2 F -B k ∥z k -z k-1 ∥ 2 = 2γ 2 L 2 F X k 1 (1 -α k ) 2 -A k ∥z k -z k-1 ∥ 2 . (D.21) In what follows we show that it is sufficient to ensure  X k 1 (1 -α k ) 2 ≤ A k , X k 2 + 2A k+1 γ 4 L 2 F α 2 k ≤ 0, (D.22) resulting in the inequality E[U k+1 | F k ] -U k ≤ -α k µ∥F(z k )∥ 2 + 2A k+1 γ 2 L 2 F + 1 + 2X k 1 α 2 k γ 2 σ 2 F . (D. A k+1 = Π k p=0 1 (1 -α p ) 2         A 0 - k ℓ=0 c l α l Π ℓ p=0 (1 -α p ) 2         = η k+1 Π k p=0 1 (1 -α p ) 2 (D.26) which would ensure A k ≥ 0 for all k. Therefore, assumptions (D.17) and (D.18) (which is a restatement of the conditions in (D.22)) are sufficient for ensuring (D.23). Substituting X k 1 and A k+1 in (D.23) yields E[U k+1 | F k ] -U k ≤ -α k µ∥F(z k )∥ 2 + ξ k , (D.27) where ξ k = 2 A k+1 (γ 2 L 2 F + 1) + 1 2 + (b + 1)γ 2 L 2 F α k + 1 γL F α 1-d k α 2 k γ 2 σ 2 F . By Assumption VI we have that ∞ k=0 ξ k = 2γ 2 σ 2 F        (γ 2 L 2 F + 1) ∞ k=0 A k+1 α 2 k + ∞ k=0 α 2 k 2 + (b + 1)γ 2 L 2 F ∞ k=0 α 3 k + 1 γL F ∞ k=0 α 3-d k        ≤ 2γ 2 σ 2 F        (γ 2 L 2 F + 1) ∞ k=0 A k+1 α 2 k + 1 2 + (b + 1)γ 2 L 2 F + 1 γL F ∞ k=0 α 2 k        < ∞ where we used the fact that α 3 k ≤ α 2 k and d ≤ 1 in the first inequality, while the second inequality uses (D.25), and Assumption VI(i). The claimed convergence result follows by the Robbins-Siegmund supermartingale theorem (Bertsekas, 2011, Prop. 2) and standard arguments as in (Bertsekas, 2011, Prop. 9) . The claimed rate follows by taking total expectation and summing the above inequality over k and noting that initial iterates were set as z-1 = z -1 = z 0 . To provide an instance of the sequence (α k ) k∈ that satisfy the assumptions, let r denote a positive natural number and set α k = 1 k+r . (D.28) Then, Π ℓ p=0 (1 -α p ) 2 = Π ℓ p=0 ( p+r-1 p+r ) 2 = (r-1) 2 (ℓ+r) 2 = (r -1) 2 α 2 ℓ , and for any K ≥ 0 K ℓ=0 c ℓ α ℓ Π ℓ p=0 (1 -α p ) 2 = K ℓ=0 (r-1) 2 (ℓ+r) 3 c ℓ . Plugging the value of c ℓ and ε k from Assumption VI(ii) and (D.24) we obtain that A 0 is finite valued since ∞ ℓ=0 1 (ℓ+r) 3 ε ℓ = ∞ ℓ=0 1 (ℓ+r) 3-d < ∞ owing to the fact that d ≤ 1. Moreover, A k+1 = (k + r) 2 (r -1) 2         A 0 - k ℓ=0 (r-1) 2 (ℓ+r) 3 c ℓ         = (k + r) 2 ∞ ℓ=k+1 1 (ℓ+r) 3 c ℓ = 1 α 2 k ∞ ℓ=k+1 α 3 ℓ c ℓ (D.29) On the other hand, for e > 1 we have the following bound Therefore, it follows from (D.29) that A k+1 α k = 1 α k ∞ ℓ=k+1 α 3 ℓ (1 + b)γ 2 L 2 F + 1 γL F α 3-d ℓ (D.30) ≤ (1 + b)γ 2 L 2 F 1 2(k+1+r) 2 k+1+r + 1 1 k+1+r + 1 γL F 1 (2-d)(k+1+r) 1-d 1 k+1+r + 1 1 k+1+r = 1+b 2 γ 2 L 2 F α k+1 (2α k+1 + 1)α k+1 + 1 γL F (2-d) α 1-d k+1 (α k+1 + 1)α k+1 ≤ (1 + b)γ 2 L 2 F α k+1 + 1 γL F (2-d) α 1-d k+1 (α k+1 + 1)α k+1 (D.31) In turn, this inequality ensures that ν as defined in Assumption VI(ii) is finite. To see this note that ν = ∞ k=0 A k+1 α 2 k (D.31) ≤ ∞ k=0 (1 + b)γ 2 L 2 F α k+1 + 1 γL F (2-d) α 1-d k+1 (α k+1 + 1)α k+1 α k ≤ δ ∞ k=0 α 2 k < ∞, where in the last two inequalities Assumption VI(i) was used. It remains to confirm the second inequality in (D.22) . With the choice of α k and ε k as in (D.28) and (D.24) we have 1 α k γ 2 X 2 + 2A k+1 γ 4 L 2 F α 2 k = γL F α d k -1 -2ρ γ + α k + 2A k+1 γ 2 L 2 F α k (D.31) ≤ γL F α d k + α k + 2γ 2 L 2 F (1 + b)γ 2 L 2 F α k+1 + 1 γL F (2-d) α 1-d k+1 (α k+1 + 1)α k+1 -1 -2ρ γ . It follows that with d = 1 the assumption (D.20) is sufficient to ensure that the second condition in (D.22) holds. Proof of Theorem 6.3 (almost sure convergence). The result is a restatement of the special case in Theorem D.3 where α k = 1 k+r . We proceed similarly to the proof of Theorem 6.1. The condition in (D.1) can be rewritten as b >  γ 2 L 2 F 1-γ 2 L 2 F . A reasonable choice is b = 2γ 2 L 2 F 1-γ 2 L 2

E Proof for constrained case

We will rely on two well-known and useful properties of the deterministic operator H = id -γF from (Pethick et al., 2022, Lm. A. 3) that we restate here for convenience. Lemma E.1. Let F : n → n be a L F -Lipschitz operator and H = id -γF with γ ∈ (0, 1 /L F ]. Then, (i) The operator H is 1 /2-cocoercive. (ii) The operator H is (1 -γL F )-monotone, and in particular ∥Hz ′ -Hz∥ ≥ (1 -γL F )∥z ′ -z∥ ∀z, z ′ ∈ n . (E.1) Proof. The first claim follows from direct computation ⟨Hz - Hz ′ , z -z ′ ⟩ = ⟨Hz -Hz ′ , Hz -Hz ′ + γFz -γFz ′ ⟩ = 1 2 ∥Hz -Hz ′ ∥ 2 -γ 2 2 ∥Fz ′ -Fz∥ 2 + 1 2 ∥z ′ -z∥ 2 ≥ 1 2 ∥Hz -Hz ′ ∥ 2 , (E.2) where the last inequality is due to Lipschitz continuity and γ ≤ 1 /L F . The strongly monotonicity of H is a consequence of Cauchy-Schwarz and Lipschitz continuity of F, ⟨Hz ′ -Hz, z ′ -z⟩ = ∥z ′ -z∥ 2 -γ⟨Fz ′ -Fz, z ′ -z⟩ ≥ (1 -γL)∥z ′ -z∥ 2 . The last claim follows from the Cauchy-Schwarz inequality. Theorem E.2. Suppose that Assumptions I to III hold. Moreover, suppose that α k ∈ (0, 1), γ ∈ (⌊-2ρ⌋ + , 1 /L F ) and for positive parameters ε and b the following holds, µ 1 1+b (1 - 1 ε(1-γL F ) 2 ) -α 0 (1 + 2γ 2 L 2 F A) + 2ρ γ > 0 and 1 - 1 ε(1-γL F ) 2 ≥ 0 (E.3) where A ≥ ε + 1 b (1 - 1 ε(1-γL F ) 2 ). Consider the sequence (z k ) k∈ generated by Algorithm 2. Then, the following estimate holds for all z ⋆ ∈ S ⋆ K k=0 α k K j=0 α j E[∥h k -Hz k ∥ 2 ] ≤ E[∥z 0 -z ⋆ ∥ 2 ] + AE[∥h -1 -Hz -1 ∥ 2 ] + Cγ 2 σ 2 F K j=0 α 2 j µ K j=0 α j (E.4) where C = 1 + 2A(1 + γ 2 L 2 F ) + 2α 0 A. Proof of Theorem E.2. We rely on the following potential function, U k+1 ∥z k+1 -z ⋆ ∥ 2 + A k+1 ∥h k -Hz k ∥ 2 + B k+1 ∥z k+1 -z k ∥ 2 , where (A k ) k∈ and (B k ) k∈ are positive scalar parameters to be identified. We will denote Ĥk := zk -γ F(z k , ξk ), so that z k+1 = z kα k (h k -Ĥk ). Then, expanding one step, ∥z k+1 -z ⋆ ∥ 2 = ∥z k -z ⋆ ∥ 2 -2α k ⟨h k -Ĥk , z k -z ⋆ ⟩ + α 2 k ∥h k -Ĥk ∥ 2 . (E.5) Recall that Hz z -γFz in the deterministic case. In the Algorithm 2, h k estimates Hz k . Let us quantify how good this estimation is. h k -Hz k = γFz k -γ F(z k , ξ k ) + (1 -α k-1 )(h k-1 -z k-1 + γ F(z k-1 , ξ k )) ∥h k -Hz k ∥ 2 = (1 -α k-1 ) 2 ∥h k-1 -z k-1 + γFz k-1 ∥ 2 + ∥γFz k -γ F(z k , ξ k ) + (1 -α k-1 )(γ F(z k-1 , ξ k ) -γFz k-1 )∥ 2 + 2(1 -α k-1 )⟨h k-1 -z k-1 + γFz k-1 , γFz k -γ F(z k , ξ k ) + (1 -α k-1 )(γ F(z k-1 , ξ k ) -γFz k-1 )⟩ In the scalar product, the left term is known when z k is known and the right term has an expectation equal to 0 by Assumption II(ii) when z k is known. Thus, taking conditional expectation and using the fact that the second moment is larger than the variance, we can go on as E[∥h k -Hz k ∥ 2 | F k ] ≤ (1 -α k ) 2 ∥h k-1 -Hz k-1 ∥ 2 + E[2(1 -α k ) 2 γ 2 ∥ F(z k , ξ k ) -F(z k-1 , ξ k )∥ 2 | F k ] + E[2α 2 k γ 2 ∥Fz k -F(z k , ξ k )∥ 2 | F k ] ≤ (1 -α k ) 2 ∥h k-1 -Hz k-1 ∥ 2 + 2(1 -α k ) 2 L 2 F γ 2 ∥z k -z k-1 ∥ 2 + 2α 2 k γ 2 σ 2 F (E.6) where we have used Assumption II(iii) and Assumption III. We continue with the conditional expectation of the inner term in (E.5). - E[⟨h k -Ĥk , z k -z ⋆ ⟩ | F k ] = -⟨h k -Hz k , z k -z ⋆ ⟩ = -⟨h k -Hz k , z k -zk ⟩ -⟨h k -Hz k , zk -z ⋆ ⟩ = -⟨h k -Hz k , z k -zk ⟩ -⟨Hz k -Hz k , z k -zk ⟩ -⟨h k -Hz k , zk -z ⋆ ⟩ ≤ -⟨h k -Hz k , z k -zk ⟩ -1 2 ∥Hz k -Hz k ∥ 2 -⟨h k -Hz k , zk -z ⋆ ⟩ (E.7) where the last inequality uses 1 /2-cocoercivity of H from Lemma F.2(i) under Assumption I(i) and the choice γ ≤ 1/L F . By definition of zk in Step 2.3, we have h k ∈ zk + γA(z k ), so that 1 γ (h k -Hz k ) ∈ F(z k ) + A(z k ). Hence, using the weak MVI from Assumption I(iii), ⟨h k -Hz k , zk -z ⋆ ⟩ ≥ ρ γ ∥h k -Hz k ∥ 2 . (E.8) Using (E.8) in (E.7) leads to the following inequality, true for any ε k > 0: - E[⟨h k -Ĥk , z k -z ⋆ ⟩ | F k ] ≤ ε k 2 ∥h k -Hz k ∥ 2 + 1 2ε k ∥z k -z k ∥ 2 -1 2 ∥Hz k -Hz k ∥ 2 -ρ γ ∥h k -Hz k ∥ 2 . To majorize the term ∥z k -z k ∥ 2 , we use Lemma F.2(ii) to get ∥Hz k -Hz k ∥ 2 ≥ (1 -γL F ) 2 ∥z k -z k ∥ 2 . Hence, as long as γL F < 1, then -E[⟨h k -Ĥk , z k -z ⋆ ⟩ | F k ] ≤ ε k 2 ∥h k -Hz k ∥ 2 + 1 2ε k (1-γL F ) 2 -1 2 ∥Hz k -Hz k ∥ 2 -ρ γ ∥h k -Hz k ∥ 2 . (E.9) The third term in (E.5) is bounded by α 2 k E[∥h k -Ĥk ∥ 2 | F k ] = α 2 k ∥h k -Hz k ∥ 2 + α 2 k γ 2 E[∥F zk -F(z k , ξk )∥ 2 | F k ] ≤ α 2 k ∥h k -Hz k ∥ 2 + α 2 k γ 2 σ 2 F (E.10) Combined with the update rule, (E.10) can also be used to bound the difference of iterates E[∥z k+1 -z k ∥ 2 | F k ] = E[α 2 k ∥h k -Ĥk ∥ 2 | F k ] ≤ α 2 k ∥h k -Hz k ∥ 2 + α 2 k γ 2 σ 2 F (E.11) Using (E.5), (E.9), (E.10) and (E.11) we have, E[U k+1 | F k ] ≤ ∥z k -z ⋆ ∥ 2 + (A k+1 + α k ε k )∥h k -Hz k ∥ 2 -α k 1 - 1 ε k (1-γL F ) 2 ∥Hz k -Hz k ∥ 2 + α k (α k -2ρ γ + α k B k+1 )∥h k -Hz k ∥ 2 + α 2 k (1 + B k+1 )γ 2 σ 2 F ≤ ∥z k -z ⋆ ∥ 2 + A k+1 + α k (ε k + 1 b (1 - 1 ε k (1-γL F ) 2 )) ∥h k -Hz k ∥ 2 + α k α k -2ρ γ + α k B k+1 -1 1+b (1 - 1 ε k (1-γL F ) 2 ) ∥h k -Hz k ∥ 2 + α 2 k (1 + B k+1 )γ 2 σ 2 F , (E.12) where the last inequality follows from Young's inequality with positive b and requiring 1 - 1 ε k (1-γL F ) 2 ≥ 0 as also stated in (E.3). By defining X 1 k A k+1 + α k (ε k + 1 b (1 - 1 ε k (1-γL F ) 2 )) X 2 k α k α k -2ρ γ + α k B k+1 -1 1+b (1 - 1 ε k (1-γL F ) 2 ) (E.13) and applying (E.6), we finally obtain E[U k+1 | F k ] -U k ≤ X 2 k ∥h k -Hz k ∥ 2 + (X 1 k (1 -α k ) 2 -A k )∥h k-1 -Hz k-1 ∥ 2 + (2X 1 k (1 -α k ) 2 γ 2 L 2 F -B k )∥z k -z k-1 ∥ 2 + 2X 1 k α 2 k γ 2 σ 2 F + α 2 k (1 + B k+1 )γ 2 σ 2 F , (E.14) We can pick B k = 2γ 2 L 2 F A k in which case, to get a recursion, we only require the following. X 1 k (1 -α k ) 2 -A k ≤ 0 and X 2 k < 0 (E.15) Set A k = A, ε k = ε. For the first requirement of (E.15), X 1 k (1 -α k ) 2 -A k = α k (1 -α k ) 2 (ε + 1 b (1 - 1 ε(1-γL F ) 2 )) + (1 -α k ) 2 A -A ≤ α k (ε + 1 b (1 - 1 ε(1-γL F ) 2 )) + (1 -α k ) 2 A -A ≤ α k (ε + 1 b (1 - 1 ε(1-γL F ) 2 )) + (1 -α k )A -A = α k (ε + 1 b (1 - 1 ε(1-γL F ) 2 )) -α k A (E.16) where the first inequality follows from (1 -α k ) 2 ≤ 1 and the second inequality follows from (1 -α k ) 2 ≤ (1 -α k ). Thus, to satisfy the first inequality of (E.15) it suffice to pick A ≥ ε + 1 b (1 - 1 ε(1-γL F ) 2 ). (E.17) The noise term in (E.14) can be made independent of k by using α k ≤ α 0 and (E.17) as follows 2X 1 k + 1 + B k+1 = 1 + 2A(1 + γ 2 L 2 F ) + 2α k (ε + 1 b (1 - 1 ε(1-γL F ) 2 )) ≤ 1 + 2A(1 + γ 2 L 2 F ) + 2α 0 A = C. (E.18) Thus it follows from (E.14) and α k ≤ α 0 that E[U k+1 | F k ] -U k ≤ α k α 0 -2ρ γ + 2α 0 γ 2 L 2 F A -1 1+b (1 - 1 ε k (1-γL F ) 2 ) ∥h k -Hz k ∥ 2 + α 2 k Cγ 2 σ 2 F . (E.19) The result is obtained by total expectation and summing the above inequality while noting that the initial iterate were set as z -1 = z 0 . Proof of Theorem 7.1. The theorem is a specialization of Theorem E.2 with a particular a choice of b and ε. The second requirement in (E.3) can be rewritten as, ε ≥ 1 (1-γL F ) 2 , (E.20) which is satisfied by ε = 1 √ α 0 (1-γL F ) 2 . We substitute in the choice of ε, b = √ α 0 and denotes η A. The weighted sum in (E.4) can be converted into an expectation over a sampled iterate in the style of Ghadimi & Lan (2013) , E[∥h k ⋆ -Hz k ⋆ ∥ 2 ] = K k=0 α k K j=0 α j E[∥h k -Hz k ∥ 2 ] with k ⋆ chosen from {0, 1, . . . , K} according to probability P[k ⋆ = k] = α k K j=0 α j . Noticing that h k ⋆ -Hz k ⋆ ∈ γ(F zk ⋆ + Az k ⋆ ) = γT zk ⋆ so E[∥h k ⋆ -Hz k ⋆ ∥ 2 ] ≥ min u∈T zk⋆ E[∥γu∥ 2 ] ≥ E[ min u∈T zk⋆ ∥γu∥ 2 ] =: E[dist(0, γT zk ⋆ ) 2 ] where the second inequality follows from concavity of the minimum. This completes the proof. F Proof for NP-PDEG through a nonlinear asymmetric preconditioner

F.1 Preliminaries

Consider the decomposition z = (z 1 , . . . , z m ), u = (u 1 , . . . , u m ) with z i , u i ∈ n i and define the shorthand notation u ≤i (u 1 , u 2 , . . . , u i ) and u ≥i (u i , . . . , u m ) for the truncated vectors. Moreover sup-pose that A conforms to the decomposition Az = (A 1 , z 1 , . . . , A m z m ) with A i : n i ⇒ n i maximally monotone. Consistently with the decomposition define Γ = blkdiag(Γ 1 , . . . , Γ m ) where Γ i ∈ n i ×n i are positive definite matrices and let P u (z) Γ -1 z + Q u (z), where Q u (z) = 0, q 1 (z 1 , u ≥2 ), q 2 (z 1 , z 2 , u ≥3 ), . . . , q m-1 (z ≤m-1 , u m ) (F.1) When P u furnishes such an asymmetric structure the preconditioned resolvent has full domain, thus ensuring that the algorithm is well-defined. In the following lemma we show that the iterates in (8.1) are well-defined for a particular choice of the preconditioner P u in (F.1). The proof is similar to that of (Latafat & Patrinos, 2017, Lem. 3.1) and is included for completeness. Lemma F.1. Let z = (z 1 , . . . , z m ), u = (u 1 , . . . , u m ) be given vectors, suppose that A conforms to the decomposition Az = (A 1 , z 1 , . . . , A m z m ) with A i : n i ⇒ n i maximally monotone, and let P u be defined as in (F.1). Then, the preconditioned resolvent (P u + A) -1 is Lipschitz continuous and has full domain. Moreover, the update z = (P u + A) -1 z reduces to the following update zi = (Γ -1 1 + A 1 ) -1 z 1 if i = 1 (Γ -1 i + A i ) -1 (z i -q i-1 (z ≤i-1 , u ≥i ) if i = 2, . . . , m (F.2) Proof. Owing to the asymmetric structure (F.1), the resolvent may equivalently be expressed as z = (z 1 , . . . zm ) m , where q 0 ≡ 0. The Gauss-Seidel-type update in (F.2) is of immediate verification after noting that (Γ -1 i + A i ) -1 is single-valued (in fact Lipschitz continuous) since the sum of Γ i ≻ 0 and A i is (maximally) strongly monotone. This also implies that Γ -1 i + A i = Āi + βI for some β > 0 and some maximally monotone operator Āi . Thus dom = (P u + A) -1 z ⇐⇒ Γ -1 i zi + A i (z i ) ∈ z i -q i-1 (z ≤i-1 , u ≥i ), i = 1, . . . , Γ -1 i + A i ) -1 = range(Γ -1 i + A i ) = range( 1 β Ā + I) = n , where we used Minty's theorem in the last equality.

F.2 Deterministic lemmas

To eventually prove Theorem F.5 we will compare the stochastic algorithm (8.4) with its deterministic counterpart (8.1), so we introduce H u (z) P u (z) -F(z) (F.3a) Ḡ(z) (P z + A) -1 (H z (z)) (F.3b) G(z) z -α k Γ H z (z) -H z ( Ḡ(z)) . (F.3c) We first derive results for the deterministic operator G and then shows that z k+1 from the stochastic scheme behaves similarly to G(z k ) when α k is small enough, even if Γ, which also appears inside the preconditioner Pu (•, ξ), remains large. Instead of making assumptions on F directly, we instead consider the following important operator, M u (z) := F(z) -Q u (z). (F.4) such that we can write (F.3b) as H u (z) = Γ -1 z -M u (z). As a shorthand we write M(z) = M z (z). Assumption VII. The operator M u as defined in (F.4) is L M -Lipschitz with L M ≤ 1 with respect to a positive definite matrix Γ ∈ n×n , i.e. ∥M u (z) -M u (z ′ )∥ Γ ≤ L M ∥z -z ′ ∥ Γ -1 ∀z, z ′ ∈ n . (F.5) Remark 6. This is satisfied by the choice of Q u in (8.7) and Assumptions IV(ii) and IV(iii). With M u defined, it is straightforward to establish that H u is 1 /2-cocoercive and strongly monotone. Lemma F.2. Suppose Assumption VII holds. Then, (i) The mapping H u is 1 /2-cocoercive for all u ∈ n , i.e. ⟨H u (z ′ ) -H u (z), z ′ -z⟩ ≥ 1 2 ∥H u (z ′ ) -H u (z)∥ 2 Γ ∀z, z ′ ∈ n . (F.6) (ii) Furthermore, H u is (1 -L M ) -monotone for all u ∈ n , and in particular ∥H u (z ′ ) -H u (z)∥ Γ ≥ (1 -L M )∥z ′ -z∥ Γ -1 ∀z, z ′ ∈ n . (F.7) Proof. By expanding using (F.4), H u (z) -H u (z ′ ) = Γ -1 (z -z ′ ) -(M u (z) -M u (z ′ )). (F.8) Using this we can show cocoercivity, ⟨H u (z ′ ) -H u (z), z ′ -z⟩ = ⟨H u (z ′ ) -H u (z), H u (z ′ ) -H u (z) -(M u (z) -M u (z ′ ))⟩ Γ (F.8) = 1 2 ∥H u (z ′ ) -H u (z)∥ 2 Γ + 1 2 ∥z ′ -z∥ 2 Γ -1 -1 2 ∥M u (z) -M u (z ′ )∥ 2 Γ Assumption VII ≥ 1 2 ∥H u (z ′ ) -H u (z)∥ 2 Γ (F.9 ) That H u is strongly monotone follows from Cauchy-Schwarz and Assumption VII, ⟨H u (z ′ ) -H u (z), z ′ -z⟩ = ∥z ′ -z∥ 2 Γ -1 -⟨M u (z ′ ) -M u (z), z ′ -z⟩ ≥ (1 -L M )∥z ′ -z∥ 2 Γ -1 . (F.10) The last claim follows from Cauchy-Schwarz and dividing by ∥z ′ -z∥ Γ -1 . We will rely on the resolvent remaining nonexpansive when preconditioned with a variable stepsize matrix. Lemma F.3. Let Γ ∈ n×n be positive definite and the operator A : n ⇒ n be maximally monotone. Then, R = (Γ -1 + A) -1 is nonexpansive, i.e. ∥Rx -Ry∥ Γ -1 ≤ ∥x -y∥ Γ for all x, y ∈ n . Proof. Let v ∈ Rx and u ∈ Ry. By maximal monotonicity of A, 0 ≤ ⟨v -Γ -1 x -u + Γ -1 y, x -y⟩ = -∥x -y∥ 2 Γ -1 + ⟨v -u, x -y⟩. Therefore, using the Cauchy-Schwarz inequality ∥x -y∥ 2 Γ -1 ≤ ⟨v -u, x -y⟩ ≤ ∥x -y∥ Γ -1 ∥v -u∥ Γ (F. 11) The proof is complete by rearranging.

F.3 Stochastic results

The stochastic assumptions on F in Theorem F.5 propagates to M and Qu as captured by the following lemma. Lemma F.4. Suppose Assumptions II(ii) and IV(iv) for F(z, ξ) = (∇ x φ(z, ξ), -∇ y φ(z, ξ)) as defined in (8.6). Let M and M be as defined in (F.15) and Qu and Q u as in (8.7) with θ ∈ [0, ∞). Then, the following holds for all z, z ′ ∈ R n (i) E ξ [ M(z, ξ)] = M(z) and E ξ [ Qz ′ (z, ξ)] = Q z ′ (z) (ii) E ξ [∥M(z) -M(z, ξ)∥ 2 Γ ] ≤ ((1 -θ) 2 + θ 2 )σ 2 F and E ξ [∥Q z ′ (z) -Qz ′ (z, ξ)∥ 2 Γ ] ≤ θ 2 σ 2 F . Proof. Unbiasedness follows immediately through Assumption II(ii). For the second claim we have for all (x, y) = z ∈ n E ξ [∥M(z) -M(z, ξ)∥ 2 Γ ] = E ξ        ∇ x φ(z, ξ) -∇ x φ(z) (1 -θ)(∇ y φ(z, ξ) -∇ y φ(z ′ )) 2 Γ        = E ξ        (1 -θ)(∇ x φ(z, ξ) -∇ x φ(z)) + θ(∇ x φ(z, ξ) -∇ x φ(z)) (1 -θ)(∇ y φ(z, ξ) -∇ y φ(z)) 2 Γ        (Assumption II(ii)) ≤ (1 -θ) 2 E ξ        ∇ x φ(z, ξ) -∇ x φ(z) ∇ y φ(z, ξ) -∇ y φ(z) 2 Γ        + θ 2 E ξ        ∇ x φ(z, ξ) -∇ x φ(z) 0 2 Γ        (Assumption IV(iv)) ≤ ((1 -θ) 2 + θ 2 )σ 2 F . (F.12) The last claim follows directly through Assumption IV(iv). This completes the proof. Theorem F.5. Suppose that Assumption I(iii) to II(ii) and IV hold. Moreover, suppose that α k ∈ (0, 1), θ ∈ [0, ∞) and for positive parameter b and ε the following holds, µ 1 1+b (1 - 1 ε(1-L M ) 2 ) + 2ρ γ -α 0 -2α 0 (ĉ 1 + 2ĉ 2 (1 + ĉ3 ))A > 0, (F.13) 1 -4ĉ 2 α 0 > 0 and 1 - 1 ε(1-L M ) 2 ≥ 0 where γ denotes the smallest eigenvalue of Γ, A ≥ (1 + 4ĉ 2 α 2 0 )(ε + 1 b (1 - 1 ε(1-L M ) 2 ))/(1 -4ĉ 2 α 0 ) and ĉ1 L 2 xz ∥ΓD xz ∥ + 2(1 -θ) 2 L 2 yz ∥ΓD yz ∥ + 2θ 2 L 2 yy ∥Γ 2 D yy ∥, ĉ2 2θ 2 L 2 yx ∥Γ 1 D yx ∥, ĉ3 L 2 xz ∥ΓD xz ∥, L 2 M max L 2 xx ∥D xx Γ 1 ∥ + L 2 yx ∥D yx Γ 1 ∥, ∥L 2 xy ∥D xy Γ 2 ∥ + L 2 yy ∥D yy Γ 2 ∥ . Consider the sequence (z k ) k∈ generated by Algorithm 3. Then, the following holds for all z ⋆ ∈ S ⋆ K k=0 α k K j=0 α j E[∥Γ -1 ẑk -S z k (z k ; zk )∥ 2 Γ ] ≤ E[∥z 0 -z ⋆ ∥ 2 Γ -1 ] + AE[∥Γ -1 ẑ-1 -S z -1 (z -1 ; z-1 )∥ 2 Γ ] + Cσ 2 F K j=0 α 2 j µ K j=0 α j (F.14) where C 2(A + α 0 (ε + 1 b (1 - 1 ε(1-L M ) 2 )))(Θ + 2ĉ 2 ) + 1 + 2(ĉ 1 + 2ĉ 2 (1 + ĉ3 ))A and Θ = (1 -θ) 2 + 2θ 2 . Proof of Theorem F.5. The proof relies on tracking the two following important operators instead of F and F M(z) := F(z) -Q z (z) and M(z, ξ) := F(z, ξ) -Qz (z, ξ). (F.15) We will denote Ĥk := Pk (z k , ξk ) -F(z k , ξk ), so that z k+1 = z kα k Γ(h k -Ĥk ). We will further need the following change of variables to later be able to apply weak MVI (see Appendix F.4): s k = h k -Qz k (z k , ξ ′ k ) Ŝ k = Ĥk -Qz k (z k , ξ ′ k ) = ∥z k -z ⋆ ∥ 2 Γ -1 -2α k ⟨s k -Ŝ k , z k -z ⋆ ⟩ + α 2 k ∥s k -Ŝ k ∥ 2 Γ (F.18) In the algorithm, s k estimates S z k (z k ; zk ). Let us quantify how good this estimation is. We will make use of the careful choice of the bias-correction term to shift the noise index by 1 in the second equality. s k -S z k (z k ; zk ) = M(z k ) + Q z k (z k ) -M(z k , ξ k ) -Qz k (z k , ξ ′ k ) + (1 -α k )(h k-1 -Γ -1 z k-1 + M(z k-1 , ξ k ) -Qz k-1 (z k-1 , ξ ′ k-1 ) + Qz k-1 (z k-1 , ξ ′ k )) = M(z k ) + Q z k (z k ) -M(z k , ξ k ) -Qz k (z k , ξ ′ k ) + (1 -α k )(s k-1 + Qz k-1 (z k-1 , ξ′ k ) -Γ -1 z k-1 + M(z k-1 , ξ k )) = M(z k ) + Q z k (z k ) -M(z k , ξ k ) -Qz k (z k , ξ ′ k ) + (1 -α k )(s k-1 -S z k-1 (z k-1 ; zk-1 )) + (1 -α k )( M(z k-1 , ξ k ) -M(z k-1 ) + Qz k-1 (z k-1 , ξ′ k ) -Q z k-1 (z k-1 )) Using the shorthand notation sk s k -S z k (z k ; zk ), Qz k (z k , ξ ′ k ) Q z k (z k ) -Qz k (z k , ξ ′ k ), M(z k , ξ k ) M(z k ) -M(z k , ξ k ), it follows that ∥ sk ∥ 2 Γ = (1 -α k ) 2 ∥ sk-1 ∥ 2 Γ + ∥ M(z k , ξ k ) + Qz k (z k , ξ ′ k ) -(1 -α k )( M(z k-1 , ξ k ) + Qz k-1 (z k-1 , ξ ′ k )∥ 2 Γ + 2(1 -α k )⟨ sk-1 , M(z k , ξ k ) + Qz k (z k , ξ ′ k ) -(1 -α k ) M(z k-1 , ξ k ) + Qz k-1 (z k-1 , ξ ′ k ) ⟩ (F.19) In the scalar product, the left term is known when z k is known. Moreover, since E • | F k = E E • | F ′ k | F k , owing to F k ⊂ F ′ k , we have E M(z k , ξ k ) + Qz k (z k , ξ ′ k ) -(1 -α k ) M(z k-1 , ξ k ) + Qz k-1 (z k-1 , ξ ′ k ) | F k = E M(z k , ξ k ) -(1 -α k ) M(z k-1 , ξ k ) | F k = 0, where we use Assumption II(ii) through Lemma F.4(i). Since the second moment is larger than the variance we have E ∥ M(z k , ξ k ) + Qz k (z k , ξ ′ k ) -M(z k-1 , ξ k ) -Qz k-1 (z k-1 , ξ ′ k )∥ 2 Γ | F k ≤ E ∥ M(z k , ξ k ) -M(z k-1 , ξ k ) + Qz k (z k , ξ ′ k ) -Qz k-1 (z k-1 , ξ ′ k )∥ 2 Γ | F k (F.20) Using the Young inequality it follows from (F.19), (F.20) that E[∥ sk ∥ 2 Γ | F k ] ≤ (1 -α k ) 2 ∥ sk-1 ∥ 2 Γ + 2α 2 k E[∥ M(z k , ξ k ) + Qz k (z k , ξ ′ k )∥ 2 Γ | F k ] + 2(1 -α k ) 2 E[∥ M(z k , ξ k ) + Qz k (z k , ξ ′ k ) -M(z k-1 , ξ k ) -Qz k-1 (z k-1 , ξ ′ k )∥ 2 Γ | F k ] ≤ (1 -α k ) 2 ∥ sk-1 ∥ 2 Γ + 2α 2 k E[∥M(z k ) -M(z k , ξ k ) + Q z k (z k ) -Qz k (z k , ξ ′ k )∥ 2 Γ | F k ] + E[2(1 -α k ) 2 ∥ M(z k , ξ k ) -M(z k-1 , ξ k ) + Qz k (z k , ξ ′ k ) -Qz k-1 (z k-1 , ξ ′ k )∥ 2 Γ | F k ] (F.21) To bound the second last term of (F.21) we use unbiasedness due to Assumption II(ii) through Lemma F.4(i) and that E • | F k = E E • | F ′ k | F k , owing to F k ⊂ F ′ k E[∥M(z k ) -M(z k , ξ k ) + Q z k (z k ) -Qz k (z k , ξ ′ k )∥ 2 Γ | F k ] = E[∥M(z k ) -M(z k , ξ k )∥ 2 Γ | F k ] + E[E[∥Q z k (z k ) -Qz k (z k , ξ ′ k )∥ 2 Γ | F ′ k ] | F k ] ≤ Θσ 2 F (F.22) with Θ = (1 -θ) 2 + 2θ 2 . where the last inequality follows from Assumptions II(ii) and IV(iv) through Lemma F.4(ii). To bound the last term of (F.21) we use the particular choice of Q u , M(z k , ξ k ) -M(z k-1 , ξ k ) + Qz k (z k , ξ ′ k ) -Qz k-1 (z k-1 , ξ ′ k ) = ∇ x φ(z k , ξ k ) -∇ x φ(z k-1 , ξ k ) (1 -θ)(∇ y φ(z k-1 ξ k ) -∇ y φ(z k , ξ k )) -θ(∇ y φ( xk , y k , ξ ′ k ) -∇ y φ( xk , y k , ξ ′ k ) . (F.23) So Assumption IV(v) applies after application of Young's inequality and the tower rule, leading to the following bound E[∥ M(z k , ξ k ) -M(z k-1 , ξ k ) + Qz k (z k , ξ ′ k ) -Qz k-1 (z k-1 , ξ ′ k )∥ 2 Γ | F k ] = E[∥∇ x φ(z k , ξ k ) -∇ x φ(z k-1 , ξ k )∥ 2 Γ 1 | F k ] + E[∥(1 -θ)(∇ y φ(z k-1 ξ k ) -∇ y φ(z k , ξ k )) -θ∇ y φ( xk , y k , ξ ′ k ) -∇ y φ( xk , y k , ξ ′ k )∥ 2 Γ 2 | F k ] ≤ E[∥∇ x φ(z k , ξ k ) -∇ x φ(z k-1 , ξ k )∥ 2 Γ 1 | F k ] + 2(1 -θ) 2 E[∥(∇ y φ(z k-1 ξ k ) -∇ y φ(z k , ξ k ))∥ 2 Γ 2 | F k ] + 2θ 2 E E[∥∇ y φ( xk , y k , ξ ′ k ) -∇ y φ( xk , y k , ξ ′ k )∥ 2 Γ 2 | F ′ k ] | F k Assumption IV(v) ≤ L 2 xz ∥z k -z k-1 ∥ 2 D xz + 2(1 -θ) 2 L 2 yz ∥z k -z k-1 ∥ 2 D yz + 2θ 2 L 2 yy ∥y k -y k-1 ∥ 2 D yy + 2θ 2 L 2 yx ∥ xk -xk-1 ∥ 2 D yx ≤ ĉ1 ∥z k -z k-1 ∥ 2 Γ -1 + ĉ2 ∥ xk -xk-1 ∥ 2 Γ -1 (F.24) where ĉ1 L 2 xz ∥ΓD xz ∥ + 2(1 -θ) 2 L 2 yz ∥ΓD yz ∥ + 2θ 2 L 2 yy ∥Γ 2 D yy ∥ and ĉ2 2θ 2 L 2 yx ∥Γ 1 D yx ∥. Using (F.24) and (F.22) in (F.21) yields,  E[∥ sk ∥ 2 Γ | F k ] ≤ (1 -α k ) 2 ∥ sk-1 ∥ 2 Γ + 2α 2 k Θσ 2 F + 2(1 -α k ) 2 ĉ1 ∥z k -z k-1 ∥ 2 Γ -1 + ĉ2 ∥ xk -xk-1 ∥ 2 Γ -1 1 . (F.25) To majorize ∥ xk -xk-1 ∥ Γ - ∥ xk -xk-1 ∥ Γ -1 1 ≤ ∥s k x -s k-1 x ∥ Γ 1 . (F.26) We can go on as ∥s k x -s k-1 x ∥ Γ 1 = ∥Γ -1 1 x k -∇ x φ(z k , ξ k ) + (1 -α k ) Γ -1 1 (x k-1 -x k-1 ) + ∇x φ(z k-1 , ξ k ) -s k-1 x ∥ Γ 1 ≤ (1 -α k )∥x k -x k-1 ∥ Γ -1 1 + (1 -α k )∥∇ x φ(z k , ξ k ) -∇ x φ(z k-1 , ξ k )∥ Γ 1 + α k ∥Γ -1 1 x k -∇ x φ(z k , ξ k ) -s k-1 x ∥ Γ 1 (Assumption IV(v)) ≤ (1 -α k )∥x k -x k-1 ∥ Γ -1 1 + (1 -α k )L xz ∥z k -z k-1 ∥ D xz + α k ∥Γ -1 1 x k -∇ x φ(z k , ξ k ) -s k-1 x ∥ Γ 1 = (1 -α k )∥x k -x k-1 ∥ Γ -1 1 + (1 -α k )L xz ∥z k -z k-1 ∥ D xz + α k ∥s k x -s k-1 x ∥ Γ -1 1 + α k (1 -α k )∥Γ -1 1 x k-1 -∇ x φ(z k-1 , ξ k ) -s k-1 x ∥ Γ 1 , where the last equality uses ∥a -b∥ 2 = ∥a∥ 2 + ∥b∥ 2 -2⟨a, b⟩ and unbiasedness from Assumption II(ii) to conclude that the inner product is zero. Hence, by subtracting α k ∥s k xs k-1 x ∥ Γ -1 1 and diving by 1 -α k , we get E[∥s k x -s k-1 x ∥ 2 Γ -1 1 | F k ] ≤ 2(1 + ĉ3 )∥x k -x k-1 ∥ 2 Γ -1 1 + 2α 2 k E[∥Γ -1 x k-1 -∇ x φ(z k-1 , ξ k ) -s k-1 x ∥ 2 Γ 1 | F k ] Assumptions II(ii) and IV(iv ) ≤ 2(1 + ĉ3 )∥x k -x k-1 ∥ 2 Γ -1 1 + 2α 2 k E[∥Γ -1 x k-1 -∇ x ϕ(z k-1 ) -s k-1 x ∥ 2 Γ 1 | F k ] + 2α 2 k σ 2 F ≤ 2(1 + ĉ3 )∥z k -z k-1 ∥ 2 Γ -1 + 2α 2 k E[∥S z k-1 (z k-1 ; zk-1 ) -s k-1 ∥ 2 Γ | F k ] + 2α 2 k σ 2 F where ĉ3 L 2 xz ∥ΓD xz ∥ and the last inequality reintroduces the y-components. We finally obtain E[∥ xk -xk-1 ∥ 2 Γ -1 | F k ] ≤ 2(1 + ĉ3 )∥z k -z k-1 ∥ 2 Γ -1 + 2α 2 k E[∥s k-1 -S z k-1 (z k-1 ; zk-1 )∥ 2 Γ | F k ] + 2α 2 k σ F . (F.27) Introducing (F.27) into (F.25) yields E[∥s k -S z k (z k ; zk )∥ 2 Γ | F k ] ≤ (1 -α k ) 2 (1 + 4ĉ 2 α 2 k )∥s k-1 -S z k-1 (z k-1 ; zk-1 )∥ 2 Γ + 2(1 -α k ) 2 (ĉ 1 + 2ĉ 2 (1 + ĉ3 ))∥z k -z k-1 ∥ 2 Γ -1 + 2α 2 k (Θ + (1 -α k ) 2 2ĉ 2 )σ 2 F . (F.28) We continue with the inner term in (F.18) under conditional expectation. - E[⟨s k -Ŝ k , z k -z ⋆ ⟩ Γ | F k ] = -⟨s k -S z k (z k ), z k -z ⋆ ⟩ = -⟨s k -S z k (z k ), z k -zk ⟩ -⟨s k -S z k (z k ), zk -z ⋆ ⟩ = -⟨s k -S z k (z k ; zk ), z k -zk ⟩ -⟨S z k (z k ; zk ) -S z k (z k ), z k -zk ⟩ -⟨s k -S z k (z k ), zk -z ⋆ ⟩ = -⟨s k -S z k (z k ; zk ), z k -zk ⟩ -⟨H z k (z k ) -H z k (z k ), z k -zk ⟩ -⟨s k -S z k (z k ), zk -z ⋆ ⟩ where the last equality uses that S z k (z k ; zk ) -S z k (z k ) = H z k (z k ) -H z k (z k ). By definition of zk in (8.4b), we have s k = h k -Qz k (z k , ξ ′ k ) ∈ Γ -1 zk + A(z k ), so that s k -S z k (z k ) ∈ F(z k ) + A(z k ). Hence, using the weak MVI from Assumption I(iii), ⟨s k -S z k (z k ), zk -z ⋆ ⟩ ≥ ρ∥s k -S z k (z k )∥ 2 . (F.29) Using also cocoercivity of H u from Lemma F.2(i), this leads to the following inequality, true for any ε k > 0: - E[⟨s k -Ŝ k , z k -z ⋆ ⟩ | F k ] ≤ ε k 2 ∥s k -S z k (z k ; zk )∥ 2 Γ + 1 2ε k ∥z k -z k ∥ 2 Γ -1 -1 2 ∥H z k (z k ) -H z k (z k )∥ 2 Γ -ρ∥s k -S z k (z k )∥ 2 . To majorize the term ∥z k -z k ∥ 2 Γ -1 , we may use Lemma F.2(ii) for which we need to determind L M . For the particular choice of Q u , we have through Assumption IV(ii) that ∥M(z ′ ) -M(z)∥ 2 Γ ≤ L 2 M ∥z ′ -z∥ 2 Γ -1 (F.30) Published as a conference paper at ICLR 2023 with L 2 M max L 2 xx ∥D xx Γ 1 ∥ + L 2 yx ∥D yx Γ 1 ∥, ∥L 2 xy ∥D xy Γ 2 ∥ + L 2 yy ∥D yy Γ 2 ∥ . By the stepsize choice Assumption IV(iii), L M < 1, which will be important promptly. From Lemma F.2(ii) it then follows that ∥H z k (z k ) -H z k (z k )∥ 2 Γ ≥ (1 -L M ) 2 ∥z k -zk ∥ 2 Γ -1 . Hence, given L M < 1, -E[⟨s k -Ŝ k , z k -z ⋆ ⟩ Γ | F k ] ≤ ε k 2 ∥s k -S z k (z k ; zk )∥ 2 Γ + 1 2ε k (1-L M ) 2 -1 2 ∥H z k (z k ) -H z k (z k )∥ 2 Γ -ρ∥s k -S z k (z k )∥ 2 = ε k 2 ∥s k -S z k (z k ; zk )∥ 2 Γ + 1 2ε k (1-L M ) 2 -1 2 ∥S z k (z k ; zk ) -S z k (z k )∥ 2 Γ -ρ∥s k -S z k (z k )∥ 2 . (F. 31) The conditional expectation of the third term in (F.18) is bounded by α 2 k E[∥s k -Ŝ k ∥ 2 Γ | F k ] = α 2 k ∥s k -S z k (z k )∥ 2 Γ + α 2 k E[∥F(z k ) -F(z k , ξk )∥ 2 Γ | F k ] ≤ α 2 k ∥s k -S z k (z k )∥ 2 Γ + α 2 k σ 2 F (F. 32) where we have used Assumption IV(iv). Combined with the update rule, (F.32) can also be used to bound the conditional expectation of the difference of iterates E[∥z k+1 -z k ∥ 2 Γ -1 | F k ] = E[α 2 k ∥s k -Ŝ k ∥ 2 Γ | F k ] ≤ α 2 k ∥s k -S z k (z k )∥ 2 Γ + α 2 k σ 2 F (F.33) Using (F.18), (F.31), (F.32), (F.33) and that -ρ∥s k -S z k (z k )∥ 2 ≤ -ρ γ ∥s k -S z k (z k )∥ 2 Γ with γ denoting the smallest eigenvalue of Γ we have, E[U k+1 | F k ] ≤ ∥z k -z ⋆ ∥ 2 Γ -1 + (A k+1 + α k ε k )∥s k -S z k (z k ; zk )∥ 2 Γ -α k 1 - 1 ε k (1-L M ) 2 ∥S z k (z k ; zk ) -S z k (z k )∥ 2 Γ + α k (α k -2ρ γ + α k B k+1 )∥s k -S z k (z k )∥ 2 Γ + α 2 k (1 + B k+1 )σ 2 F ≤ ∥z k -z ⋆ ∥ 2 Γ -1 + A k+1 + α k (ε k + 1 b (1 - 1 ε k (1-L M ) 2 )) ∥s k -S z k (z k ; zk )∥ 2 Γ + α k α k -2ρ γ + α k B k+1 -1 1+b (1 - 1 ε k (1-L M ) 2 ) ∥H z k (z k ) -H z k (z k )∥ 2 Γ + α 2 k (1 + B k+1 )σ 2 F , (F.34) where the last inequality follows from Young's inequality with positive b as long as 1- 1 ε k (1-L M ) 2 ≥ 0. By defining X 1 k A k+1 + α k (ε k + 1 b (1 - 1 ε k (1-L M ) 2 )) X 2 k 2ρ γ -α k -α k B k+1 + 1 1+b (1 - 1 ε k (1-L M ) 2 ) (F.35) and applying (F.28), we finally obtain E[U k+1 | F k ] -U k ≤ -α k X 2 k ∥s k -S z k (z k )∥ 2 Γ + (X 1 k (1 -α k ) 2 (1 + 4ĉ 2 α 2 k ) -A k )∥s k-1 -S z k-1 (z k-1 ; zk-1 )∥ 2 Γ + (2X 1 k (1 -α k ) 2 (ĉ 1 + 2ĉ 2 (1 + ĉ3 )) -B k )∥z k -z k-1 ∥ 2 Γ -1 + 2X 1 k α 2 k (Θ + (1 -α k ) 2 2ĉ 2 )σ 2 F + α 2 k (1 + B k+1 )σ 2 F , (F.36) If A k ≥ X 1 k (1 -α k ) 2 (1 + 4ĉ 2 α 2 k ), then it suffice to pick B k as 2X 1 k (1 -α k ) 2 (ĉ 1 + 2ĉ 2 (1 + ĉ3 )) = 2(ĉ 1 +2ĉ 2 (1+ĉ 3 ))A k 1+4ĉ 2 α 2 k ≤ 2(ĉ 1 + 2ĉ 2 (1 + ĉ3 ))A k =: B k . (F.37) To get a recursion, we then only require the following conditions X 1 k (1 -α k ) 2 (1 + 4ĉ 2 α 2 k ) ≤ A k and X 2 k > 0. (F.38) Set A k = A, ε k = ε. For the first inequality of (F.38), since (1 -α k ) 2 ≤ (1 -α k ), the terms involving A are bounded as (1 -α k ) 2 (1 + 4ĉ 2 α 2 k )A -A ≤ (1 -α k )(1 + 4ĉ 2 α 2 k )A -A = -α k A + (1 -α k )(4ĉ 2 α 2 k )A ≤ -α k (1 -4ĉ 2 α 0 )A (F.39) where the last inequality follows from (1 -α k ) ≤ 1 and α k ≤ α 0 . Thus to satisfy the first inequality of (F.38) it suffice to pick A ≥ (1 + 4ĉ 2 α 2 0 )(ε + 1 b (1 - 1 ε(1-L M ) 2 )) 1 -4ĉ 2 α 0 (F.40) where 1 -4ĉ 2 α 0 > 0 is required. The second equality of (F.38) is satisfied owing to (F.13). The noise term in (F.36) can be made independent of k by using α k ≤ α 0 , 2X 1 k (1 + (1 -α k ) 2 2ĉ 2 ) + 1 + B k+1 = 2(A + α k (ε + 1 b (1 - 1 ε(1-L M ) 2 )))(Θ + (1 -α k ) 2 2ĉ 2 ) + 1 + 2(ĉ 1 + 2ĉ 2 (1 + ĉ3 ))A ≤ 2(A + α 0 (ε + 1 b (1 - 1 ε(1-L M ) 2 )))(Θ + 2ĉ 2 ) + 1 + 2(ĉ 1 + 2ĉ 2 (1 + ĉ3 ))A =: C. (F.41) Thus, it follows from (F.36) that E[U k+1 | F k ] -U k ≤ α k α 0 -2ρ γ + 2α 0 (ĉ 1 + 2ĉ 2 (1 + ĉ3 ))A -1 1+b (1 - 1 ε(1-L M ) 2 ) ∥s k -S z k (z k )∥ 2 Γ + α 2 k Cσ 2 F . (F.42) The result is obtained by total expectation and summing the above inequality while noting that the initial iterate were set as z -1 = z 0 . Proof of Theorem 8.2. The theorem is a specialization of Theorem F.5 for a particular a choice of b and ε. The third requirement of (F.13) can be rewritten as, ε ≥ 1 (1-L M ) 2 , (F.43) which is satisfied by ε = 1 √ α 0 (1-L M ) 2 . We substitute in the choice of ε, b = √ α 0 and denotes η A. The weighted sum in (F.14) is equivalent to an expectation over a sampled iterate in the style of Ghadimi & Lan (2013) , E[∥Γ -1 ẑk ⋆ -S z k⋆ (z k ⋆ ; zk ⋆ )∥ 2 Γ ] = K k=0 α k K j=0 α j E[∥Γ -1 ẑk -S z k (z k ; zk )∥ 2 Γ ]. with k ⋆ chosen from {0, 1, . . . , K} according to probability P [k ⋆ = k] = α k K j=0 α j . Noticing that Γ -1 ẑk ⋆ -S z k⋆ (z k ⋆ ; zk ⋆ ) ∈ F zk ⋆ + Az k ⋆ = T zk ⋆ so E[∥Γ -1 ẑk ⋆ -S z k⋆ (z k ⋆ ; zk ⋆ )∥ 2 Γ ] ≥ min u∈T zk⋆ E[∥u∥ 2 Γ ] ≥ E[ min u∈T zk⋆ ∥u∥ 2 Γ ] =: E[dist Γ (0, T zk ⋆ ) 2 ] where the second inequality follows from concavity of the minimum. This completes the proof.

F.4 Explanation of bias-correction term

Consider the naive analysis which would track h k . By the definition of zk in (8.4b) and H k (z k ) we would have h k -H k (z k ) + P k (z k ) -Pk (z k , ξk ) ∈ F(z k ) + A(z k ). Hence, assuming zero mean and using the weak MVI from Assumption I(iii), E[⟨h k -H k (z k ), zk -z ⋆ ⟩ | F ′ k ] = E[⟨h k -H k (z k ) + P k (z k ) -Pk (z k , ξ ′ k ), zk -z ⋆ ⟩ | F ′ k ] ≥ E[ρ∥h k -H k (z k ) + P k (z k ) -Pk (z k , ξ ′ k )∥ 2 | F ′ k ] . (F.44) To proceed we could apply Young's inequality, but this would produce a noise term, which would propagate to the descent inequality in (F.36) with a α k factor in front. To show convergence we would instead need a smaller factor of α 2 k . To avoid this error term entirely we instead do a change of variables with s k h k -Pz k (z k , ξ ′ k ) such that, h k ∈ Pz k (z k , ξ ′ k ) + Az k ⇔ h k -Pz k (z k , ξ ′ k ) ∈ Az k ⇔ s k ∈ Az k . (F.45) This make application of Assumption I(iii) unproblematic, but affects the choice of the biascorrection term, since the analysis will now apply to s k . If we instead of the careful choice of h k in (8.4a) had made the choice h k = Pz k (z k , ξ k ) -F(z k , ξ k ) + (1 -α k )(h k-1 -Pz k-1 (z k-1 , ξ k ) + F(z k-1 , ξ k )) (F.46) then s k = Pz k (z k , ξ k ) -F(z k , ξ k ) -Pz k (z k , ξ ′ k ) + (1 -α k )(s k-1 -Pz k-1 (z k-1 , ξ k ) + F(z k-1 , ξ k ) -Pz k-1 (z k-1 , ξ ′ k-1 )). Notice how the latter term is evaluated under ξ ′ k-1 instead of ξ ′ k . The choice in (8.4a) resolves this issue.

G Negative weak Minty variational inequality

In this section we consider the problem of finding a zero of the single-valued operator F (with the set-valued operator A ≡ 0). Observe that the weak MVI in Assumption I(iii), ⟨Fz, zz ⋆ ⟩ ≥ ρ∥Fz∥ 2 , for all z ∈ n , is not symmetric and one may instead consider that the assumption holds for -F. As we will see below this simple observation leads to nontrivial problem classes extending the reach of extragradient-type methods both in the deterministic and stochastic settings. Assumption VIII (negative weak MVI). There exists a nonempty set S ⋆ ⊆ zer T such that for all z ⋆ ∈ S ⋆ and some ρ ∈ (-∞, 1 /2L) ⟨Fz, zz ⋆ ⟩ ≤ ρ∥Fz∥ 2 , for all z ∈ n . (G.1) Under this assumption the algorithm of Pethick et al. (2022) leads to the following modified iterates: zk = z k +γ k Fz k , (G.2) z k+1 = z k + λ k α k (H k zk -H k z k ) = z k +λ k α k γ k F zk , where H k id+γ k F (G.3) We next consider the lower bound example of (Pethick et al., 2022, Ex. 5) to show that despite the condition for weak MVI being violated for b smaller than a certain threshold, the negative weak MVI in Assumption VIII holds for any negative b and thus the extragradient method applied to -F is guaranteed to converge. Since M is a bisymmetric linear mapping, M ⊤ M = (a 2 + b 2 )I which according to the above characterizations implies ρ ∈ (-1 2L , b a 2 +b 2 ], ρ ∈ [ b a 2 +b 2 , 1 2L ). The range for ρ is nonempty if b > -a √ 3 while this is not an issue for ρ which allows any negative b. We complete this section with a corollary to Theorem 6.3 when replacing weak MVI assumption with Assumption VIII. Corollary G.1. Suppose that Assumptions I(i) and I(ii), Assumptions II, III and VIII hold. Let (z k ) k∈ denote the sequence generated by Algorithm 1 applied to -F. Then, the claims of Theorem 6.3 hold true. Figure 3 : The (projected) (SEG+) method needs to take γ arbitrarily small to guarantee convergence to an arbitrarily small neighborhood. We show an instance satisfying the weak MVI where γ cannot be taken arbitrarily small. The objective is ψ(x, y) = ϕ(x -0.9, y -0.9) under box constraints ∥(x, y)∥ ∞ ≤ 1 with ϕ from Example 2 where L = 1 and ρ = -1 /10L. The unique stationary point (x ⋆ , y ⋆ ) = (0.9, 0.9) lies in the interior, so even ∥Fz∥ can be driven to zero. Taking γ smaller does not make the neighborhood smaller as oppose to the monotone case in Figure 1 . where ψ(z) = 2z 6 21 -z 4 3 + z 2 3 . In both Example 2 and Example 3 the operator F is defined as Fz = (∇ x ϕ(x, y), -∇ y ϕ(x, y)). To simulate a stochastic setting in all examples, we consider additive Gaussian noise, i.e. F(z, ξ) = Fz + ξ where ξ ∼ N(0, σ 2 I). We choose σ = 0.1 and initialize with z 0 = 1 if not specified otherwise. The default configuration is γ = 1 /2L F with α k = 1 /18•(k/c+1), c = 100 and β k = α k for diminishing stepsize schemes and α = 1 /18 for fixed stepsize schemes. We make two exceptions: Figure 1 

H.2 Additional algorithmic details

For the constrained setting in Figure 1 , we consider two extensions of (SEG+). One variant uses a single application of the resolvent as suggested by Pethick et al. (2022) , zk = (id + γA) -1 (z kγ F(z k , ξ k )) with ξ k ∼ P z k+1 = z k + α k (z kz k ) -γ( F(z k , ξk ) -F(z k , ξ k )) with ξk ∼ P (P 1 SEG+) The other variant applies the resolvent twice as in stochastic Mirror-Prox (Juditsky et al., 2011) , zk = (id + γA) -1 (z kγ F(z k , ξ k )) with ξ k ∼ P  z k+1 = (id + α k γA) -1 (z k -α k γ F(z k ,



S u (z) = H u (z) -Q u (z) S u (z; z) = H u (z) -Q u (z) (F.16)where Q u (z) and H u are as defined in Section 8.In contrast with the unconstrained smooth case we will rely on a slightly different potential function, namely,U k+1 ∥z k+1z ⋆ ∥ 2 Γ -1 + A k+1 ∥s k -S z k (z k ; zk )∥ 2 Γ + B k+1 ∥z k+1z k ∥ 2 Γwhere (A k ) k∈ and (B k ) k∈ are positive scalar parameters to be identified.We start by writing out one step of the update∥z k+1z ⋆ ∥ 2 Γ -1 = ∥z kz ⋆ ∥ 2 Γ -1 -2α k ⟨h k -Ĥk , z kz ⋆ ⟩ + α 2 k ∥h k -Ĥk ∥ 2 Γ (F.17) √k/100+1.



Figure 2: Comparison of methods in the unconstrained setting of Example 2 (left) and the constrained setting of Example 3 (right). Notice that only BC-SEG+ and BC-PSEG+ converges properly while (SEG) diverges, (PSEG) cycles and both (SF-EG+) and (SF-PEG+) only converge to a neighborhood. BC-(P)SEG+ is guaranteed to converge with probability 1 as established through Theorem 6.3 and ??.

which we simply denote dist(u, U) when V = I. The norm ∥X∥ refers to spectral norm when X is a matrix.We summarize essential definitions from operator theory, but otherwise refer to Bauschke & Combettes (2017); Rockafellar (1970) for further details. An operator A : n ⇒ d maps each point x ∈ n to a subset Ax ⊆ d , where the notation A(x) and Ax will be used interchangably. We denote the domain of A by dom A {x ∈ n | Ax ∅}, its graph by gph A {(x, y) ∈ n × d | y ∈ Ax}. The inverse of A is defined through its graph, gph A -1 {(y, x) | (x, y) ∈ gph A} and the set of its zeros by zer A {x ∈ n | 0 ∈ Ax}. Definition B.1 ((co)monotonicity Bauschke et al. (

23) A reasonable choice for the Young parameter ε k is to choose ε k = γL F α d k for some d ∈ [0, 1]. (D.24) The rational for this choice will become more clear in what follows. The first inequality in (D.22) is linear and we can solve it to equality by Lemma D.1c k and η k be as in Assumption VI(ii). Then, Lemma D.1 yields

. The choice of b is substituted into (D.1), (D.20) and C of Theorem D.3. This completes the proof.

Consider(Pethick et al., 2022, Ex. 5)minimize x∈R maximize y∈R f (x, y) := axy + b 2 (x 2y 2 ), (G.4)where b < 0 and a > 0. The associated F is a linear mapping. For a linear mapping M, Assumption VIII holds if1 2 (M + M ⊤ ) -ρM ⊤ M ⪯ 0, ρ ∈ (-∞, 1 /2L) While Assumption I(iii) holds if 1 2 (M + M ⊤ ) -ρM ⊤ M ⪰ 0, ρ ∈ (-1 /2L, ∞).For this example L = √ a 2 + b 2 and F(z) = Mz (bx + ay, -ax + by).

exampleExample 2 (Unconstrained quadratic game(Pethick et al., 2022, Ex. 5)). Consider, a ∈ R + and b ∈ R.The problem constants inExample 2 can easily be computed as ρ = b a 2 +b 2 and L = √ a 2 + b 2 . We can rewrite Example 2 in terms of L and ρ by choosing a = L 2 -L 4 ρ 2 and b = L 2 ρ. Example 3 (Constrained minimax (Pethick et al., 2022, Ex. 4)). Consider minimize |x|≤ 4 /3 maximize |y|≤ 4 /3 ϕ(x, y) := xy + ψ(x) -ψ(y), (GlobalForsaken)

uses the slower decay c = 1000 when γ = 0.1 and Figure 3 uses c = 5000 for γ = 0.01 (and otherwise c = 1000) to ensure fast enough convergence. When the aggressive stepsize schedule is used then α k = 1 /18•

Figure 4: Instead of taking α k ∝ 1 /k (for which almost sure convergence is established through Theorem 6.3 and ??) we take α k ∝ 1 / √ k as permitted in Theorems 6.1 and 7.1. We consider the example provided in Figure 1 (top row) and the two examples from Figure 2 (bottom row). Under this more aggressive stepsize schedule the guarantee is only in expectation over the iterates which is also apparent from the relatively large volatility in comparison with Figures 1 and 2.

ξk )) with ξk ∼ P (P 2 SEG+) When applying (SEG) to constrained settings we similarly use the following projected variants:zk = (id + β k γA) -1 (z kβ k γ F(z k , ξ k )) with ξ k ∼ P z k+1 = (id + α k γA) -1 (z kα k γ F(z k , ξk )) with ξk ∼ P (PSEG)and (EG+) (using stochastic feedback denoted SF)zk = (id + γA) -1 (z kγ F(z k , ξ k )) with ξ k ∼ P z k+1 = (id + αγA) -1 (z kαγ F(z k , ξk )) with ξk ∼ P (SF-PEG+)which we in the unconstrained case (A ≡ 0) refer to as (SF-EG+) as defined below. zk = z kγ F(z k , ξ k ) with ξ k ∼ P z k+1 = z kαγ F(z k , ξk ) with ξk ∼ P (SF-EG+)

Overview of the results. The second row is obtained as special cases of the first row. the unconstrained and smooth setting Appendix C treats convergences of (SEG+) for the restricted case where F is linear. Appendix D shows both random iterate results and almost sure convergence of Algorithm 1. Theorems 6.1 and 6.3 in the main body are implied by the more general results in this section, which preserves certain free parameters and more general stepsize requirements. Appendices E and F moves beyond the unconstrained and smooth case by showing convergence for instances of the template scheme (8.1). The analysis of Algorithm 3 in Appendix F applies to Algorithm 2, but for completeness we establish convergence for general F separately in Appendix E. The relationship between the theorems are presented in Table1.

1 1 in (F.25) let s kx be the primal components of s k in what follows. Recall that A decomposes as specified in Section 8, such that we can write s kx ∈ Γ -1

acknowledgement

11 Acknowledgments and disclosure of funding This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement n°725594 -time-data). This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_205011. The work of the third and fourth author was supported by the Research Foundation Flanders (FWO) postdoctoral grant 12Y7622N and research projects G081222N, G033822N, G0A0920N; Research Council KU Leuven C1 project No. C14/18/068; European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348. The work of Olivier Fercoq was supported by the Agence National de la Recherche grant ANR-20-CE40-0027, Optimal Primal-Dual Algorithms (APDO).

I Comparison with variance reduction

Consider the case where the expectation comes in the form a finite sum,In the worst case the averaged Lipschitz constant F F scales proportionally to the number of elements N squared, i.e. L F = Ω( √ NL F ). It is easy to construct such an example by taking one elements to have Lipschitz constant NL while letting the remaining elements have Lipschitz constant L. Recalling the definition in Assumption III, L 2 F = N 2 L 2 N + N-1 N L 2 ≥ NL 2 while the average becomes L F = NL N + N-1 N L ≤ 2L so L F ≥ √ N /2L F . Thus, L F can be √ N times larger than L F , leading to a potentially strict requirement on the weak MVI parameter ρ > -L F/2 for variance reduction methods.

