AVERAGE-CASE ACCELERATION FOR BILINEAR GAMES AND NORMAL MATRICES

Abstract

Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games. We make the following three main contributions. First, we show that for zero-sum bilinear games the average-case optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially non-symmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speed-up compared to worst-case optimal algorithms. We illustrate our findings through numerical simulations with a varying degree of mismatch with our assumptions.

1. INTRODUCTION

The traditional analysis of optimization algorithms is a worst-case analysis (Nemirovski, 1995; Nesterov, 2004) . This type of analysis provides a complexity bound for any input from a function class, no matter how unlikely. However, since hard-to-solve inputs might rarely occur in practice, the worst-case complexity bounds might not be representative of the observed running time. A more representative analysis is given by the average-case complexity, averaging the algorithm's complexity over all possible inputs. This analysis is standard for analyzing, e.g., sorting (Knuth, 1997) and cryptography algorithms (Katz & Lindell, 2014) . Recently, a line of work (Berthier et al., 2020; Pedregosa & Scieur, 2020; Lacotte & Pilanci, 2020; Paquette et al., 2020) focused on optimal methods for the optimization of quadratics, specified by a symmetric matrix. While worst-case analysis uses bounds on the matrix eigenvalues to yield upper and lower bounds on convergence, average-case analysis relies on the expected distribution of eigenvalues and provides algorithms with sharp optimal convergence rates. While the algorithms developed in this context have been shown to be efficient for minimization problems, these have not been extended to smooth games. A different line of work considers algorithms for smooth games but studies worst-case optimal methods (Azizian et al., 2020) . In this work, we combine average-case analysis with smooth games, and develop novel average-case optimal algorithms for finding the root of a linear system determined by a (potentially non-symmetric) normal matrix. We make the following main contributions: 1. Inspired by the problem of finding equilibria in smooth games, we develop average-case optimal algorithms for finding the root of a non-symmetric affine operator, both under a normality assumption (Thm. 4.1), and under the extra assumption that eigenvalues of the operator are supported in a disk (Thm. 4.2). The proposed method shows a polynomial speedup compared to the worst-case optimal method, verified by numerical simulations. 2. We make a novel connection between average-case optimal methods for optimization, and average-case optimal methods for bilinear games. In particular, we show that solving the Hamiltonian using an average-case optimal method is optimal (Theorem 3.1) for bilinear games. This result complements (Azizian et al., 2020) , who proved that Polyak Heavy Ball algorithm on the Hamiltonian is asymptotically worst-case optimal for bilinear games.

2. AVERAGE-CASE ANALYSIS FOR NORMAL MATRICES

In this paper we consider the following class of problems. Definition 1. Let A ∈ R d×d be a real matrix and x ∈ R d a vector. The non-symmetric (affine) operator (NSO) problem is defined as: Find x : F (x) def = A(x-x ) = 0 . (NSO) This problem generalizes that of minimization of a convex quadratic function f , since we can cast the latter in this framework by setting the operator F = ∇f . The set of solutions is an affine subspace that we will denote X . We will find convenient to consider the distance to this set, defined as dist(x, X ) def = min v∈X x -v 2 , with X = {x ∈ R d | A(x -x ) = 0} . In this paper we will develop average-case optimal methods. For this, we consider A and x to be random vectors, and a random initialization x 0 . This induces a probability distribution over NSO problems, and we seek to find methods that have an optimal expected suboptimality w.r.t. this distribution. Denoting E (A,x ,x0) the expectation over these random problems, we have that average-case optimal methods they verify the following property at each iteration t min xt E (A,x ,x0) dist(x t , X ) s.t. x i ∈ x 0 + span({F (x j )} i-1 j=0 ), ∀i ∈ [1 : t]. (2) The last condition on x t stems from restricting the class of algorithms to first-order methods. The class of first-order methods encompasses many known schemes such as gradient descent with momentum, or full-matrix AdaGrad. However, methods such as Adam (Kingma & Ba, 2015) or diagonal AdaGrad (Duchi et al., 2011) are not in this class, as the diagonal re-scaling creates iterates x t outside the span of previous gradients. Although we will focus on the distance to the solution, the results can be extended to other convergence criteria such as F (x t ) 2 . Finally, note that the expectations in this paper are on the problem instance and not on the randomness of the algorithm.

2.1. ORTHOGONAL RESIDUAL POLYNOMIALS AND FIRST-ORDER METHODS

The analysis of first-order methods simplifies through the use of polynomials. This section provides the tools required to leverage this connection. Definition 2. A residual polynomial is a polynomial P that satisfies P (0) = 1. Proposition 2.1. (Hestenes et al., 1952) If the sequence (x t ) t∈Z+ is generated by a first-order method, then there exist residual polynomials P t , each one of degree at most t, verifying x t -x = P t (A)(x 0 -x ). As we will see, optimal average-case method are strongly related to orthogonal polynomials. We first define the inner product between polynomials, where we use z * for the complex conjugate of z ∈ C. Definition 3. For P, Q ∈ R[X], we define the inner product •, • µ for a measure µ over C as P, Q µ def = C P (λ)Q(λ) * dµ(λ) . (4) Definition 4. A sequence of polynomials {P i } is orthogonal (resp. orthonormal) w.r.t. •, • µ if P i , P i µ > 0 (resp. = 1); P i , P j µ = 0 if i = j.

2.2. EXPECTED SPECTRAL DISTRIBUTION

Following (Pedregosa & Scieur, 2020) , we make the following assumption on the problem family. Assumption 1. x 0 -x is independent of A, and E (x0, x ) [(x 0 -x )(x 0 -x ) ] = R 2 d I d . We will also require the following definitions to characterize difficulty of a problem class. Let {λ 1 , . . . , λ d } be the eigenvalues of a matrix A ∈ R d×d . We define the empirical spectral distribution of A as the probability measure μA (λ) def = 1 d d i=1 δ λi (λ) , where δ λi is the Dirac delta, a distribution equal to zero everywhere except at λ i and whose integral over the entire real line is equal to one. Note that with this definition, D dμ A (λ) corresponds to the proportion of eigenvalues in D. When A is a matrix-valued random variable, µ A is a measure-valued random variable. As such, we can define its expected spectral distribution µ A def = E A [μ A ] , which by the Riesz representation theorem is the measure that verifies f dµ = E A [ f dµ A ] for all measureable f . Surprisingly, the expected spectral distribution is the only required characteristic to design optimal algorithms in the average-case.

2.3. EXPECTED ERROR OF FIRST-ORDER METHODS

In this section we provide an expression for the expected convergence in terms of the residual polynomial and the expected spectral distribution introduced in the previous section. To go further in the analysis, we have to assume that A is a normal matrix. Assumption 2. The (real) random matrix A is normal, that is, it verifies AA = A A. Normality is equivalent to A having the spectral decomposition A = U ΛU * , where U is unitary, i.e., U * U = U U * = I. We now have everything to write the expected error of a first-order algorithm applied to (NSO). Theorem 2.1. Consider the application of a first-order method associated to the sequence of polynomials {P t } (Proposition 2.1) on the problem (NSO). Let µ be the expected spectral distribution of A. Under Assumptions 1 and 2, we have E[dist(x t , X )] = R 2 C\{0} |P t | 2 dµ , Before designing optimal algorithms for certain specific distributions, we compare our setting with the average-case accelerating for minimization problems of Pedregosa & Scieur (2020) , who proposed optimal optimization algorithms in the average-case.

2.4. DIFFICULTIES OF FIRST-ORDER METHODS ON GAMES AND RELATED WORK

This section compares our contribution with the existing framework of average-case optimal methods for quadratic minimization problems. Definition 5. Let H ∈ R d×d be a random symmetric positive-definite matrix and x ∈ R d a random vector. These elements determine the following random quadratic minimization problem min x∈R d f (x) def = 1 2 (x-x ) H(x-x ) . (OPT) As in our paper, Pedregosa & Scieur (2020) find deterministic optimal first-order algorithms in expectation w.r.t. the matrix H, the solution x , and the initialization x 0 . Since they work with problem (OPT), their problem is equivalent to (NSO) with the matrix A = H. However, they have the stronger assumption that the matrix is symmetric, which implies being normal. The normality assumption is restrictive in the case of game theory, as they do not always naturally fit such applications. However, this set is expressive enough to consider interesting cases, such as bilinear games, and our experiments show that our findings are also consistent with non-normal matrices. Using orthogonal residual polynomials and spectral distributions, they derive the explicit formula of the expected error. Their result is similar to Theorem 2.1, but the major difference is the domain of the integral, a real positive line in convex optimization, but a shape in the complex plane in our case. This shape plays a crucial role in the rate of converge of first-order algorithms, as depicted in the work of Azizian et al. (2020) ; Bollapragada et al. (2018) . In the case of optimization methods, they show that optimal schemes in the average-case follow a simple three-term recurrence arising from the three-term recurrence for residual orthogonal polynomials for the measure λµ(λ). Indeed, by Theorem 2.1 the optimal method corresponds to the residual polynomials minimizing P, P µ , and the following result holds: Theorem 2.2. (Fischer, 1996, §2.4 ) When µ is supported in the real line, the residual polynomial of degree t minimizing P, P µ is given by the degree t residual orthogonal polynomial w.r.t. λµ(λ). However, the analogous result does not hold for general measures in C, and hence our arguments will make use of the following Theorem 2.3 instead, which links the residual polynomial of degree at most t that minimizes P, P µ to the sequence of orthonormal polynomials for µ. Theorem 2.3. [Theorem 1.4 of Assche (1997)] Let µ be a positive Borel measure in the complex plane. The minimum of the integral C |P (λ)| 2 dµ(λ) over residual polynomials P of degree lower or equal than t is uniquely attained by the polynomial P (λ) = t k=0 φ k (λ)φ k (0) * t k=0 |φ k (0)| 2 , with optimal value C |P (λ)| 2 dµ(λ) = 1 t k=0 |φ k (0)| 2 , where (φ k ) k is the orthonormal sequence of polynomials with respect to the inner product •, • µ . In the next sections we consider cases where the optimal scheme is identifiable.

3. AVERAGE-CASE OPTIMAL METHODS FOR BILINEAR GAMES

We consider the problem of finding a Nash equilibrium of the zero-sum minimax game given by min θ1 max θ2 (θ 1 , θ 2 ) def = (θ 1 -θ 1 ) M (θ 2 -θ 2 ) . (9) Let θ 1 , θ 1 ∈ R d1 , θ 2 , θ 2 ∈ R d2 , M ∈ R d1×d2 and d def = d 1 + d 2 . The vector field of the game (Balduzzi et al., 2018) is defined as F (x) = A(xx ), where F (θ 1 , θ 2 ) = ∇ θ1 (θ 1 , θ 2 ) -∇ θ2 (θ 1 , θ 2 ) = 0 M -M 0 =A θ 1 θ 2 =x - θ 1 θ 2 =x = A(x -x ) . ( ) As before, X denotes the set of points x such that F (x) = 0, which is equivalent to the set of Nash equilibrium. If M is sampled independently from x 0 , x and x 0 -x has covariance R 2 d I d , Assumption 1 is fulfilled. Since A is skew-symmetric, it is in particular normal and Assumption 2 is also satisfied. We now show that the optimal average-case algorithm to solve bilinear problems is Hamiltonian gradient descent with momentum, described below in its general form. Contrary to the methods in Azizian et al. (2020) , the method we propose is anytime (and not only asymptotically) average-case optimal. Optimal average-case algorithm for bilinear games. Initialization. x -1 = x 0 = θ 1,0 , θ 2,0 , sequence {h t , m t } given by Theorem 3.1. Main loop. For t ≥ 0, g t = F (x t -F (x t )) -F (x t ) = 1 2 ∇ F (x t ) 2 by (12) x t+1 = x t -h t+1 g t + m t+1 (x t-1 -x t ) The quantity 1 2 F (x) 2 is commonly known as the Hamiltonian of the game (Balduzzi et al., 2018) , hence the name Hamiltonian gradient descent. Indeed, g t = ∇ 1 2 F (x) 2 when F is affine: F (x -F (x)) -F (x) = A(x -A(x -x ) -x ) -A(x -x ) = -A(A(x -x )) = A (A(x -x )) = ∇ 1 2 A(x -x ) 2 = ∇ 1 2 F (x) 2 . ( ) The following theorem shows that (11) is indeeed the optimal average-case method associated to the minimization problem min x 1 2 F (x) 2 , as the following theorem shows. Theorem 3.1. Suppose that Assumption 1 holds and that the expected spectral distribution of M M is absolutely continuous with respect to the Lebesgue measure. Then, the method (11) is average-case optimal for bilinear games when h t , m t are chosen to be the coefficients of the average-case optimal minimization of 1 2 F (x) 2 . How to find optimal coefficients? Since 1 2 F (x) 2 is a quadratic problem, the coefficients {h t , m t } can be found using the average-case framework for quadratic minimization problems of (Pedregosa & Scieur, 2020, Theorem 3.1) . Proof sketch. When computing the optimal polynomial x t = P t (A)(x 0 -x ), we have that the residual orthogonal polynomial P t behaves differently if t is even or odd. • Case 1: t is even. In this case, we observe that the polynomial P t (A) can be expressed as Q t/2 (-A 2 ) , where (Q t ) t≥0 is the sequence of orthogonal polynomials w.r.t. the expected spectral density of -A 2 , whose eigenvalues are real and positive. This gives the recursion in (11). • Case 2: t is odd. There is no residual orthogonal polynomial of degree t for t odd. Instead, odd iterations do correspond to the intermediate computation of g t in (11), but not to an actual iterate.

3.1. PARTICULAR CASE: M WITH I.I.D. COMPONENTS

We now show the optimal method when the entries of M are i.i.d. sampled. For simplicity, we order the players such that d 1 ≤ d 2 . Assumption 3. Assume that each component of M is sampled iid from a distribution of mean 0 and variance σ 2 , and we take d 1 , d 2 → ∞ with d1 d2 → r < 1. In such case, the spectral distribution of 1 d2 M M tends to the Marchenko-Pastur law, supported in [ , L] and with density: ρ M P (λ) def = (L -λ)(λ -) 2πσ 2 rλ , where L def = σ 2 (1 + √ r) 2 , def = σ 2 (1 - √ r) 2 . ( ) Proposition 3.1. When M satisfies Assumption 3, the optimal parameter of scheme (11) are h t = -δt σ 2 √ r , m t = 1 + ρδ t , where ρ = 1+r √ r , δ t = (-ρ -δ t-1 ) -1 , δ 0 = 0. Proof. By Theorem 3.1, the problem reduces to finding the optimal average-case algorithm for the problem min x 1 2 F (x) 2 . Since the expected spectral distribution of 1 d2 M M is the Marchenko-Pastur law, we can use the optimal algorithm from (Pedregosa & Scieur, 2020, Section 5) .

4. GENERAL AVERAGE-CASE OPTIMAL METHOD FOR NORMAL OPERATORS

In this section we derive general average-case optimal first-order methods for normal operators. First, we need to assume the existence of a three-term recurrence for residual orthogonal polynomials (Assumption 4). As mentioned in subsection 2.4, for general measures in the complex plane, the existence of a three-term recurrence of orthogonal polynomials is not ensured. In Proposition B.3 in Appendix B we give a sufficient condition for its existence, and in the next subsection we will show specific examples where the residual orthogonal polynomials satisfy the three-term recurrence. Assumption 4 (Simplifying assumption). The sequence of residual polynomials {ψ t } t≥0 orthogonal w.r.t. the measure µ, defined on the complex plane, admits the three-term recurrence ψ -1 = 0, ψ 0 = 1, ψ t (λ) = (a t + b t λ)ψ t-1 (λ) + (1 -a t )ψ t-2 (λ). ( ) Under Assumption 4, Theorem 4.1 shows that the optimal algorithm can also be written as an average of iterates following a simple three-terms recurrence. Theorem 4.1. Under Assumption 4 and the assumptions of Theorem 2.1, the following algorithm is optimal in the average case, with y -1 = y 0 = x 0 : y t = a t y t-1 + (1 -a t )y t-2 + b t F (y t-1 ) x t = B t B t + β t x t-1 + β t B t + β t y t , β t = φ 2 t (0), B t = B t-1 + β t-1 , B 0 = 0 . ( ) where (φ k (0)) k≥0 can be computed using the three-term recurrence (upon normalization). Moreover, E (A,x ,x0) dist(x t , X ) converges to zero at rate 1/B t . Remark. Notice that it is not immediate that ( 16) fulfills the definition of first-order algorithms stated in (2), as y t is clearly a first-order method but x t is an average of the iterates y t . Using that F is an affine function we see that x t indeed fulfills (2). Remark. Assumption 4 is needed for the sequence (y t ) t≥0 to be computable using a three-term recurrence. However, for some distribution, the associated sequence of orthogonal polynomials may admit another recurrence that may not satisfy Assumption 4.

4.1. CIRCULAR SPECTRAL DISTRIBUTIONS

In random matrix theory, the circular law states that if A is an n×n matrix with i.i.d. entries of mean C and variance R 2 /n, as n → ∞ the spectral distribution of A tends to the uniform distribution on D C,R . In this subsection we apply Theorem 4.1 to a class of spectral distributions specified by Assumption 5, which includes the uniform distribution on D C,R . Even though the random matrices with i.i.d entries are not normal, we will see in section 6 that the empirical results for such matrices are consistent with our theoretical results under the normality assumption. Assumption 5. Assume that the expected spectral distribution µ A is supported in the complex plane on the disk D C,R of center C ∈ R, C > 0 and radius R < C. Moreover, assume that the spectral density is circularly symmetric, i.e. there exists a probability measure µ R supported on [0, R] such for all f measurable and r ∈ [0, R], dµ A (C + re iθ ) = 1 2π dθ dµ R (r). Proposition 4.1. If µ satisfies Assumption 5, the sequence of orthonormal polynomials is (φ t ) t≥0 , φ t (λ) = (λ -C) t K t,R , where K t,R = R 0 r 2t dµ R (r) . ( ) Example. The uniform distribution in D C,R is to dµ R = 2r R 2 dr, and K t,R = R t / √ t + 1. From Proposition 4.1, the sequence of residual polynomials is given by φ t (λ)/φ t (0) = 1 -λ C t , which implies that Assumption 4 is fulfilled with a t = 1, b t = -1 C . Thus, by Theorem 4.1 we have Theorem 4.2. Given an initialization x 0 (y 0 = x 0 ), if Assumption 5 is fulfilled with R < C and the assumptions of Theorem 2.1 hold, then the average-case optimal first-order method is y t = y t-1 -1 C F (y t-1 ), β t = C 2t /K 2 t,R , B t = B t-1 + β t-1 , x t = B t B t + β t x t-1 + β t B t + β t y t . Moreover, E (A,x ,x0) dist(x t , X ) converges to zero at rate 1/B t . We now compare Theorem 4.2 with worst-case methods studied in Azizian et al. (2020) . They give a worst-case convergence lower bound of (R/C) 2t on the quantity dist(z t , X ) for first-order methods (z t ) t≥0 on matrices with eigenvalues in the disk D C,R . By the classical analysis of first-order methods, this rate is achievable by gradient descent with stepsize 1/C, i.e. the iterates y t defined in (18). However, by equation ( 79) in Proposition D.3 we have that under slight additional assumptions (those of Proposition 5.2), lim t→∞ E [dist(x t , X )]/E [dist(y t , X )] = 1 -R 2 C 2 holds. That is, the average-case optimal algorithm outperforms gradient descent by a constant factor depending on the conditioning R/C.

5. ASYMPTOTIC BEHAVIOR

The recurrence coefficients of the average-case optimal method typically converges to limiting values when t → ∞, which gives an "average-case asymptotically optimal first-order method" with constant coefficients. For the case of symmetric operators with spectrum in [ , L], Scieur & Pedregosa (2020) show that under mild conditions, the asymptotically optimal algorithm is the Polyak momentum method with coefficients depending only on and L. For bilinear games, since the average-case optimal algorithm is the average-case optimal algorithm of an optimization algorithm, we can make use of their framework to obtain the asymptotic algorithm (see Theorem 3 of Scieur & Pedregosa (2020) ). Proposition 5.1. Assume that the expected spectral density µ M M of M M is supported in [ , L] for 0 < < L, and strictly positive in this interval. Then, the asymptotically optimal algorithm for bilinear games is the following version of Polyak momentum: g t = F (x t -F (x t )) -F (x t ) x t+1 = x t + √ L- √ √ L+ √ 2 (x t-1 -x t ) - 2 √ L+ √ 2 g t Notice that the algorithm in ( 19) is the worst-case optimal algorithm from Proposition 4 of Azizian et al. (2020) . For the case of circularly symmetric spectral densities with support on disks, we can also compute the asymptotically optimal algorithm. Proposition 5.2. Suppose that the assumptions of Theorem 4.2 hold with µ ) and for some κ ∈ Z. Then, the average-case asymptotically optimal algorithm is, with y 0 = x 0 : R ∈ P([0, R]) fulfilling µ R ([r, R]) = Ω((R -r) κ ) for r in [r 0 , R] for some r 0 ∈ [0, R y t = y t-1 -1 C F (y t-1 ), x t = R C 2 x t-1 + 1 -R C 2 y t . Moreover, the convergence rate for this algorithm is asymptotically the same one as for the optimal algorithm in Theorem 4.2. Namely, lim t→∞ E [dist(x t , X )]B t = 1. The condition on µ R simply rules out cases in which the spectral density has exponentially small mass around 1. It is remarkable that in algorithm (20) the averaging coefficients can be expressed so simply in terms of the quantity R/C. Notice also that while the convergence rate of the algorithm is slower than the convergence rate for the optimal algorithm by definition, both rates match in the limit, meaning that the asymptotically optimal algorithm also outperforms gradient descent by a constant factor 1 -R 2 C 2 in the limit t → ∞.

6. EXPERIMENTS

We compare some of the proposed methods on settings with varying degrees of mismatch with our assumptions. Bilinear Games. We consider min-max bilinear problems of the form (10), where the entries of M are generated i.i.d. from a standard Gaussian distribution. We vary the ratio r = d/n parameter for d = 1000 and compare the average-case optimal method of Theorems 3.1 and 5.1, the asymptotic worst-case optimal method of (Azizian et al., 2020) and extragradient (Korpelevich, 1976) . In all cases, we use the convergence-rate optimal step-size assuming knowledge of the edges of the spectral distribution. The spectral density for these problems is displayed in the first row of Figure 1 and the benchmark results on the second row. Average-case optimal methods always outperform other methods, and the largest gain is in the ill-conditioned regime (r ≈ 1). Circular Distribution. For our second experiment we choose A as a matrix with iid Gaussian random entries, therefore the support of the distribution of its eigenvalue is a disk. Note that A does not satisfy the normality assumption of Assumption 2. Figure 1 (third row) compares the average-case optimal methods from Theorems 4.2 and 5.2 on two datasets with different levels of conditioning. Note that the methods converge despite the violation of Assumption 2, suggesting a broader applicability than the one proven in this paper. We leave this investigation for future work.

7. DISCUSSION AND FUTURE RESEARCH DIRECTIONS

In this paper, we presented a general framework for the design of optimal algorithms in the averagecase for affine operators F , whose underlying matrix is possibly non-symmetric. However, our approach presents some limitations, the major one being the restriction to normal matrices. Fortunately, the numerical experiments above suggests that this assumption can be relaxed. Developing a theory without that assumption is left for future work. Another avenue for future work is to analyze the nonlinear-case in which the non-symmetric operator A is non-linear, as well as the case in which it is accessed through a stochastic estimator (as done by (Loizou et al., 2020) for the worst-case analysis). 2.5 0.0 2.5 eigenvalue density r=0.9 limit density empirical density 2.5 0.0 2.5 r=0.95 Second row: Benchmarks. Average-case optimal methods always outperform other methods, and the largest gain is in the ill-conditioned regime (r ≈ 1). Third row. Benchmarks (columns 1 and 3) and eigenvalue distribution of a design matrix generated with iid entries for two different degrees of conditioning. Depite the normality assumption not being satisfied, we still observe an improvement of average-case optimal methods vs worst-case optimal ones.

A PROOF OF THEOREM 2.1

A.1 PRELIMINARIES Before proving Theorem 2.1, we quickly analyze the distance function (1), recalled below, dist(x, X ) def = min v∈X x -v 2 . The definition of the distance function is not practical for the theoretical analysis. Fortunately, it is possible to find a simple expression that uses the orthogonal projection matrix Π to the kernel Ker(A). Since Π is an orthogonal projection matrix to the kernel of a linear transformation, it satisfies Π = Π T , Π 2 = Π, and AΠ = 0. (21) The normality assumption on A implies also that ΠA = 0. Indeed, the spectral decomposition of A is A = [U 1 |U 2 ] Λ 0 0 0 [U 1 |U 2 ] * , and then Π = U 2 U * 2 . The next proposition uses Π to derive the explicit solution of the (1). Proposition A.1. We have that dist(y, X ) = (I -Π)(y -x ) 2 ∀x ∈ X . Proof. We first parametrize the set of solution X . By definition we have X = {x : A(x -x ) = 0}. Which can be written in terms of the kernel of A as X = {x + Πw : w ∈ R d }. From this, we can rewrite the distance function (1) as dist(y, X ) = min w∈R d y -(x + Πw) 2 . The minimum can be attained at different points, but in particular at w = -(y -x ), which proves the statement. We now simplifies further the result of the previous proposition in the case where x t is generated by a first order method. Proposition A.2. For every iterate x t of a first-order methods, i.e., x t satisfies x t -x = P t (A)(x 0 -x ), deg(P t ) ≤ t, P (0) = I, we have that dist(x t , X ) = x t -x 2 -Π(x 0 -x ) 2 .

Proof. We start with the result of Proposition

A.1, dist(x t , X ) = (I -Π)(x t -x ) 2 . The norm can be split into (I -Π)(x t -x ) 2 = x t -x 2 + Π 2 =Π by ( 21) (x t -x ) 2 -2 Π(x t -x ) 2 = x t -x 2 -Π(x t -x ) 2 . Since x t is generated by a first order method, we have x t -x = P t (A)(x 0 -x ), P t (0) = 1. Since P (0) = 1, the polynomial can be factorized as P (A) = I + AQ t-1 (A), Q t-1 being a polynomial of degree t -1. Therefore, Π(x t -x ) 2 reads Π(x t -x ) 2 = Π (I + AQ t-1 (A)) (x 0 -x ) 2 = Π(x 0 -x ) + ΠA =0 by (22) Q t-1 (A)(x 0 -x ) 2 = Π(x 0 -x ) 2 , which prove the statement.

A.2 PROOF OF THE THEOREM

We are now ready to prove the main result. Theorem 2.1. Consider the application of a first-order method associated to the sequence of polynomials {P t } (Proposition 2.1) on the problem (NSO). Let µ be the expected spectral distribution of A. Under Assumptions 1 and 2, we have E[dist(x t , X )] = R 2 C\{0} |P t | 2 dµ , Proof. We start with the result of Proposition A.2, dist(x t , X ) = x t -x 2 -Π(x 0 -x ) 2 . We now write the expectation of the distance function, E[dist(x t , X )] = E x t -x 2 -Π(x 0 -x ) 2 = E P t (A)(x 0 -x ) 2 -Π(x 0 -x ) 2 = E tr P t (A)P t (A) T (x 0 -x ) (x 0 -x ) T -tr Π 2 (x 0 -x )(x 0 -x ) T = E A tr P t (A)P t (A) T E (x 0 -x ) (x 0 -x ) T |A -tr ΠE (x 0 -x )(x 0 -x ) T |A = RE A tr P t (A)P t (A) T -tr Π = RE d i=1 |P (λ i )| 2 -tr Π = RE C\{0} |P (λ)| 2 δ λi (λ) + |P (0)| 2 • [# zero eigenvalues] -tr Π However, |P (0)| 2 = 1 and tr Π corresponds to the number of zero eigenvalues of A, therefore, E[dist(x t , X )] = RE C\{0} |P (λ)| 2 δ λi (λ) = R C\{0} P (λ)µ(λ). B PROOFS OF THEOREM 3.1 AND PROPOSITION 3.1 Proposition B.1. [Block determinant formula] If A, B, C, D are (not necessarily square) matrices, det A B C D = det(D)det(A -BD -1 C), ( ) if D is invertible. Definition 6 (Pushforward of a measure). Recall that the pushforward f * µ of a measure µ by a function f is defined as the measure such that for all measurable g, g(λ) d(f * µ)(λ) = g(f (λ)) dµ(λ). (24) Equivalently, if X is a random variable with distribution µ, then f (X) has distribution f * µ. Proposition B.2. Assume that the dimensions of M ∈ R dx×dy fulfill d x ≤ d y and let r = d x /d y . Let µ M M be the expected spectral distribution of the random matrix M M ∈ R dx×dx , and assume that it is absolutely continuous with respect to the Lebesgue measure. The expected spectral distribution of A is contained in the imaginary line and is given by µ A (iλ) = 1 - 2 1 + 1 r δ 0 (λ) + 2|λ| 1 + 1 r µ M M (λ 2 ) . ( ) for λ ∈ R. If d x ≥ d y , then (25) holds with µ M M in place of µ M M and 1/r in place of r. Proof. By the block determinant formula, we have that for s = 0, det (sI d1+d2 -A) = sI d1 -M M sI d2 = det(sI d2 )det(sI d1 + M s -1 I d2 M ) = s d2-d1 det(s 2 I d1 + M M ) Thus, for every eigenvalue -λ ≤ 0 of -M M , both i √ λ and -i √ λ are eigenvalues of A. Since rank(M M ) = rank(M ), we have rank(A) = 2rank(M ). Thus, the rest of the eigenvalues of A are 0 and there is a total of d - 2d 1 = d 2 -d 1 of them. Notice that d 1 d 1 + d 2 = 1 d1+d2 d1 = 1 1 + 1 r (27) Let f + (λ) = i √ λ, f -(λ) = -i √ λ , and let (f + ) * µ M M (resp., (f -) * µ M M ) be the pushforward measure of µ M M by the function f + (resp., f -). Thus, by the definition of the pushforward measure (Definition 6), µ A (iλ) = 1 - 2 1 + 1 r δ 0 (λ) + 1 1 + 1 r (f + ) * µ M M (λ) + 1 1 + 1 r (f -) * µ M M (λ) We compute the pushforwards (f + ) * µ M M , (f -) * µ M M performing the change of variables y = ±i √ λ under the assumption that µ M M (λ) = ρ M M (λ)dλ: R ≥0 g ±i √ λ dµ M M (λ) = R ≥0 g ±i √ λ ρ M M (λ)dλ = ±iR ≥0 g (y) ρ M M (|y| 2 )2|y| d|y|, which means that the density of (f + ) * µ M M at y ∈ iR ≥0 is 2|y|ρ M M (|y| 2 ) and the density of (f -) * µ M M at y ∈ -iR ≥0 is also 2|y|ρ M M (|y| 2 ). Proposition B.3. The condition ∀P, Q polynomials P (λ), λQ(λ) = 0 =⇒ λP (λ), Q(λ) = 0 (30) is sufficient for any sequence (P k ) k≥0 of orthogonal polynomials of increasing degrees to satisfy a three-term recurrence of the form γ k P k (λ) = (λ -α k )P k-1 (λ) -β k P k-2 (λ), where γ k = λP k-1 (λ), P k (λ) P k (λ), P k (λ) , α k = λP k-1 (λ), P k-1 (λ) P k-1 (λ), P k-1 (λ) , β k = λP k-1 (λ), P k-2 (λ) P k-2 (λ), P k-2 (λ) Proof. Since λP k-1 (λ) is a polynomial of degree k, and (P j ) 0≤j≤k is a basis of the polynomials of degree up to k, we can write λP k-1 (λ) = k j=0 λP k-1 , P j P j , P j P j (λ) Now, remark that for all j < k -2, P k-1 , λP j = 0 because the inner product of P k-1 with a polynomial of degree at most k -2. If we make use of the condition (30), this implies that λP k-1 , P j = 0 for all j < k -2. Plugging this into (33), we obtain (31). Proposition B.4. Let Π R t be the set of polynomials with real coefficients and degree at most t. For t ≥ 0 even, the minimum of the problem min Pt∈Π R t ,Pt(0)=1 iR\{0} |P t (λ)| 2 |λ|ρ M M (|λ| 2 ) d|λ| (34) is attained by an even polynomial with real coefficients. Proof. Since dµ(iλ) def = |λ|ρ M M (|λ| 2 ) d|λ| is supported in the imaginary axis and is symmetric with respect to 0, for all polynomials P, Q, λP (λ), Q(λ) = iR λP (λ)Q(λ) * dµ(λ) = - iR P (λ)λ * Q(λ) * dµ(λ) = -P (λ), λQ(λ) . (35) Hence, P (λ), λQ(λ) = 0 implies λP (λ), Q(λ) = 0. By Proposition B.3, a three-term recurrence ( 31) and ( 32) for the orthonormal sequence (φ t ) t≥0 of polynomials holds. By Proposition B.5, the orthonormal polynomials (φ t ) t≥0 of even (resp. odd) degree are even (resp. odd) and have real coefficients. Hence, for all t ≥ 0 even t k=0 φ k (λ)φ k (0) * t k=0 |φ k (0)| 2 = t/2 k=0 φ 2k (λ)φ 2k (0) * t/2 k=0 |φ 2k (0)| 2 (36) is an even polynomial with real coefficients. By Theorem 2.3, this polynomial attains the minimum of the problem min Pt∈Π C t ,Pt(0)=1 iR\{0} |P t (λ)| 2 |λ|ρ M M (|λ| 2 ) d|λ| (37) and, a fortiori, the minimum of the problem in (34), in which the minimization is restricted polynomials with real coefficients instead of complex coefficients. Proposition B.5. The polynomials (φ t ) t≥0 of the orthonormal sequence corresponding to the measure µ(iλ) = |λ|ρ M M (|λ| 2 )d|λ| have real coefficients and are even (resp. odd) for even (resp. odd) k. Proof. The proof is by induction. The base case follows from the choice φ 0 = 1. Assuming that φ k-1 ∈ R[X] by the induction hypothesis, we show that α k = 0 (where α k is the coefficient from (31) and ( 32)): λφ k-1 (λ), φ k-1 (λ) = iR λ|φ k-1 (λ)| 2 |λ|ρ M M (|λ| 2 )d|λ| = R ≥0 iλ(|φ k-1 (iλ)| 2 -|φ k-1 (-iλ)| 2 )λρ M M (λ 2 )dλ = 0 The last equality follows from |φ k-1 (iλ)| 2 = |φ k-1 (-iλ)| 2 , which holds because φ k-1 (iλ) * = φ k-1 (-iλ), and in turn this is true because φ k-1 ∈ R[X] by the induction hypothesis. Once we have seen that α k = 0, it is straightforward to apply the induction hypothesis once again to show that φ k also satisfies the even/odd property. Namely, for k even (resp. odd), γ k P k = λP k-1 -β k P k-2 , and the two polynomials in the right-hand side have even (resp. odd) degrees. Finally, φ k must have real coefficients because φ k-1 and φ k-2 have real coefficients by the induction hypothesis, and the recurrence coefficient β k is real, as λP k-1 (λ), P k-2 (λ) = iR λφ k-1 (λ)φ k-2 (λ) * |λ|ρ M M (|λ| 2 )d|λ| = R ≥0 iλ(φ k-1 (iλ)φ k-2 (iλ) * -φ k-1 (iλ) * φ k-2 (iλ))λρ M M (λ 2 )dλ = - R ≥0 2λIm(φ k-1 (iλ)φ k-2 (iλ) * )λρ M M (λ 2 )dλ ∈ R. Proposition B.6. Let t ≥ 0 even. Assume that on R >0 , the expected spectral density µ M M has Radon-Nikodym derivative ρ M M with respect to the Lebesgue measure. If Q t/2 def = arg min P t/2 ∈Π R t/2 , P t/2 (0)=1 R>0 P t/2 (λ) 2 dµ -A 2 (λ), and P t def = arg min Pt∈Π R t , Pt(0)=1 iR\{0} |P t (λ)| 2 |λ|ρ M M (|λ| 2 ) d|λ|, then P t (λ) = Q t/2 (-λ 2 ). Proof. First, remark that the equalities in ( 40) and ( 41) are well defined because the arg min are unique by Theorem 2.3. Without loss of generality, assume that d x ≤ d y (otherwise switch the players), and let r def = d x /d y < 1. Since, -A 2 = M M 0 0 M M , each eigenvalue of M M ∈ R dx×dx is an eigenvalue of -A 2 with doubled duplicity, and the rest of eigenvalues are zero. Hence, we have µ -A 2 = 1 -2/(1 + 1 r ) δ 0 + 2µ M M /(1 + 1 r ). Thus, for all t ≥ 0, Q t = arg min Pt∈Π R t , Pt(0)=1 R>0 P t (λ) 2 dµ -A 2 (λ) = arg min Pt∈Π R t , Pt(0)=1 R>0 P t (λ) 2 ρ M M (λ) dλ By Proposition B.4, for an even t ≥ 0 the minimum in ( 41) is attained by an even polynomial with real coefficients. Hence, min Pt∈Π R t , Pt(0)=1 iR\{0} |P t (λ)| 2 |λ|ρ M M (|λ| 2 ) d|λ| = min P t/2 ∈Π R t/2 , P t/2 (0)=1 iR\{0} |P t/2 (λ 2 )| 2 |λ|ρ M M (|λ| 2 ) d|λ| = 2 min P t/2 ∈Π R t/2 , P t/2 (0)=1 R>0 |P t/2 ((iλ) 2 )| 2 λρ M M (λ 2 ) dλ = 2 min P t/2 ∈Π R t/2 , P t/2 (0)=1 R>0 P t/2 (λ 2 ) 2 λρ M M (λ 2 ) dλ = min P t/2 ∈Π R t/2 , P t/2 (0)=1 R>0 P t/2 (λ) 2 ρ M M (λ) dλ (44) Moreover, for any polynomial Q t/2 that attains the minimum on the right-most term, the polynomial λ 2 ) attains the minimum on the left-most term. In particular, using (43), P t (λ) def = Q t/2 (-λ 2 ) attains the minimum on the left-most term. P t (λ) = Q t/2 (- Theorem 3.1. Suppose that Assumption 1 holds and that the expected spectral distribution of M M is absolutely continuous with respect to the Lebesgue measure. Then, the method (11) is average-case optimal for bilinear games when h t , m t are chosen to be the coefficients of the average-case optimal minimization of 1 2 F (x) 2 . Proof. Making use of Theorem 2.1 and Proposition B.2, we obtain that for any first-order method using the vector field F , E[dist(x t , X )] = R 2 C\{0} |P t (λ)| 2 dµ A (λ) = 2R 2 1 + 1 r iR\{0} |P t (λ)| 2 |λ|ρ M M (|λ| 2 ) d|λ| Let Q t/2 , P t be as defined in ( 41) and ( 40). For t ≥ 0 even the iteration t of the average-case optimal method for the bilinear game must satisfy x t -P X (x 0 ) = P t (A)(x 0 -P X (x 0 )) = Q t/2 (-A 2 )(x 0 -P X (x 0 )) On the other hand, the first-order methods for the minimization of the function 1 2 F (x) 2 make use of the vector field ∇ 1 2 F (x) 2 = A (Ax + b) = -A 2 (x -x ). Let µ -A 2 be the spectral density of -A 2 . By Theorem 2.1, the average-case optimal first-order method for the minimization problem is the one for which the residual polynomial P t (Proposition 2.1) minimizes the functional R P 2 t dµ -A 2 . That is, the residual polynomial is Q t . From ( 46), we see that the t-th iterate of the average-case optimal method for F is equal to the t/2-th iterator of the average-case optimal method for ∇ 1 2 F (x) 2 . C PROOFS OF THEOREM 4.1 AND THEOREM 4.2 Theorem 4.1. Under Assumption 4 and the assumptions of Theorem 2.1, the following algorithm is optimal in the average case, with y -1 = y 0 = x 0 : y t = a t y t-1 + (1 -a t )y t-2 + b t F (y t-1 ) x t = B t B t + β t x t-1 + β t B t + β t y t , β t = φ 2 t (0), B t = B t-1 + β t-1 , B 0 = 0 . ( ) where (φ k (0)) k≥0 can be computed using the three-term recurrence (upon normalization). Moreover, E (A,x ,x0) dist(x t , X ) converges to zero at rate 1/B t . Proof. We prove by induction that x t -x = t k=0 φ k (A)φ k (0) * t k=0 φ k (0) 2 (x 0 -x ) The base step t = 0 holds trivially because φ 0 = 1. Assume that (47) holds for t -1. Subtracting x from (16), we have x t -x = t-1 k=0 φ k (0) 2 t k=0 φ k (0) 2 (x t-1 -x ) + φ t (0) 2 t k=0 φ k (0) 2 (y t -x ) (48) If φ t (0) 2 (y t -x ) = φ t (0)φ t (A)(x 0 -x ), by the induction hypothesis for t -1 and (48), we have x t -x = t-1 k=0 φ t (0)φ t (A) t k=0 φ k (0) 2 (x 0 -x ) + φ t (0)φ t (A) t k=0 φ k (0) 2 (x 0 -x * ) = t k=0 φ t (0)φ t (A) t k=0 φ k (0) 2 (x 0 -x * ), which concludes the proof of (47). The only thing left is to show (49), again by induction. The base case follows readily from y 0 = x 0 in (16). Dividing by φ t (0) 2 , we rewrite (49) as y t -x = φ t (A) φ t (0) (x 0 -x ) = ψ t (A)(x 0 -x ), where ψ t is the t-th orthogonal residual polynomial of sequence. By Assumption 4, ψ t must satisfy the recurrence in ( 15). If we subtract x * from the second line of ( 16), we apply the induction hypothesis and then the recurrence in (15), we obtain y t -x = a t (y t-1 -x ) + (1 -a t )(y t-2 -x ) + b t F (y t-1 ) = a t (y t-1 -x ) + (1 -a t )(y t-2 -x ) + b t A(y t-1 -x * ) = a t ψ t-1 (A)(x 0 -x ) + (1 -a t )ψ t-2 (A)(x 0 -x ) + b t Aψ t-1 (A)(x 0 -x ) = ψ t (A)(x 0 -x ), thus concluding the proof of (49). Proposition C.1. Suppose that Assumption 5 holds with C = 0, that is, the circular support of µ is centered at 0. Then, the basis of orthonormal polynomials for the scalar product P, Q = D R,0 P (λ)Q(λ) * dµ(λ) is φ k (λ) = λ k D k,R , ∀k ≥ 0, where K k,R = 2π R 0 r 2k dµ R (r). Proof. First, we will show that if µ satisfies Assumption 5 with C = 0, then λ i , λ j = 0 if j, k ≥ 0 with j = k (without loss of generality, suppose that j > k). λ j , λ k = D R,0 λ j (λ * ) k dµ(λ) = D R,0 λ j-k |λ| 2k dµ(λ) = R 0 1 2π 2π 0 (re iθ ) j-k r 2k dθ dµ R (r) = 1 2π 2π 0 e iθ(j-k) dθ R 0 r j+k dµ R (r) = e i2π -1 2πi(j -k) R 0 r j+k dµ R (r) = 0 (54) And for all k ≥ 0, λ k , λ k = D R,0 |λ k | 2 dµ(λ) = R 0 1 2π 2π 0 r 2k dθ dµ R (r) = 2π 0 r 2k dµ R (r). Proposition 4.1. If µ satisfies Assumption 5, the sequence of orthonormal polynomials is (φ t ) t≥0 , φ t (λ) = (λ -C) t K t,R , where K t,R = R 0 r 2t dµ R (r) . ( ) Proof. The result follows from Proposition C.1 using the change of variables z → z + C. To compute the measure µ R for the uniform measure on D C,R , we perform a change of variables to circular coordinates: D C,R f (λ) dµ(λ) = 1 πR 2 R 0 2π 0 f (C + re iθ )r dθ dr = R 0 2π 0 f (C + re iθ ) dθ dµ R (r). =⇒ dµ R (r) = r πR 2 dr (56) And R 0 r 2t dµ R (r) = 1 πR 2 R 0 r 2t+1 dr = 1 π R 2t 2t + 2 =⇒ K t,R = R t / √ t + 1. ( ) Theorem 4.2. Given an initialization x 0 (y 0 = x 0 ), if Assumption 5 is fulfilled with R < C and the assumptions of Theorem 2.1 hold, then the average-case optimal first-order method is y t = y t-1 -1 C F (y t-1 ), β t = C 2t /K 2 t,R , B t = B t-1 + β t-1 , x t = B t B t + β t x t-1 + β t B t + β t y t . Moreover, E (A,x ,x0) dist(x t , X ) converges to zero at rate 1/B t . Proof. By Proposition 4.1, the sequence of residual orthogonal polynomials is given by ψ t (λ) = φ t (λ)/φ t (0) = 1 -λ C t . Hence, Assumption 4 is fulfilled with a t = 1, b t = -1 C , as ψ t (λ) = ψ t-1 (λ) -λ C ψ t-1 (λ). We apply Theorem 4.1 and make use of the fact that φ k (0 ) 2 = C 2k K 2 t,R . See Proposition D.3 for the rate on dist(x t , X ). D PROOF OF PROPOSITION 5.2 Proposition D.1. Suppose that the assumptions of Theorem 4.2 hold with the probability measure µ R fulfilling µ R ([r, R]) = Ω((R -r) κ ) for r in [r 0 , R] for some r 0 ∈ [0, R) and for some κ ∈ Z. Then, lim t→∞ C 2t K 2 t,R t k=0 C 2k K 2 k,R = 1 - R 2 C 2 . ( ) Proof. Given > 0, let c ∈ Z ≥0 be the minimum such that 1 c i=0 R 2 C 2 i ≤ (1 + ) 1 ∞ i=0 R 2 C 2 i = (1 + ) 1 - R 2 C 2 (59) Define Q t,R def = R 2t K 2 t,R . Then, C 2t K 2 t,R t k=0 C 2k K 2 k,R = C 2t R 2t Q t,R t k=0 C 2k R 2k Q k,R = Q t,R t k=0 R 2 C 2 t-k Q k,R Now, on one hand, using that Q t,R is an increasing sequence on t, Q t,R t k=0 R 2 C 2 t-k Q k,R ≥ 1 t k=0 R 2 C 2 t-k ≥ 1 ∞ k=0 R 2 C 2 k = 1 - R 2 C 2 (61) On the other hand, for t ≥ c , Q t,R t k=0 R 2 C 2 t-k Q k,R ≤ Q t,R t k=t-c R 2 C 2 t-k Q k,R = Q t,R t k=t-c R 2 C 2 t-k Q t,R - t k d ds Q s,R ds Thus, we want to upper-bound t k d ds Q s,R ds. First, notice that d ds Q s,R = d ds R 0 r R 2s dµ R (r) -1 = R 0 r R 2s -log( r R ) dµ R (r) R 0 r R 2s dµ R (r) 2 By concavity of the logarithm function we obtain log( R r ) ≤ R r0 -1 for r ∈ [r 0 , R]. Choose r 0 close enough to R so that R r0 -1 ≤ /c . We obtain that R 0 r R 2s log R r dµ R (r) ≤ r0 0 r R 2s log R r dµ R (r) + R r0 r R 2s R r 0 -1 dµ R (r). Published as a conference paper at ICLR 2021 Thus, t k d ds Q s,R ds ≤ t k r0 0 r R 2s log R r dµ R (r) R 0 r R 2s dµ R (r) 2 ds + t k R r0 r R 2s R r0 -1 dµ R (r) R 0 r R 2s dµ R (r) 2 ds. (65) Using that log x ≤ x, for k ∈ [t -c , t] we can bound the first term of (65) as t k r0 0 r R 2s log R r dµ R (r) R 0 r R 2s dµ R (r) 2 ds ≤ t k r0 0 r R 2s-1 dµ R (r) R 0 r R 2s dµ R (r) 2 ds ≤ (t -k) r0 R 2k-1 R 0 r R 2t dµ R (r) 2 ≤ c r 0 R 2(t-c )-1 Q 2 t,R ≤ c r 0 R 2(t-c )-1 1 (c 1 ) 2 (2t + 1) 2κ t→∞ ---→ 0. In the last inequality we use that by Proposition D.2, for t large enough, Q t,R = R 2t K 2 t,R ≤ (2t + 1) k /c 1 . For k ∈ [t -c , t], the second term of ( 65) can be bounded as t k R r0 r R 2s R r0 dµ R (r) R 0 r R 2s dµ R (r) 2 ds ≤ (t -k) R r 0 -1 1 R 0 r R 2t dµ R (r) ≤ c R r 0 -1 1 R 0 r R 2t dµ R (r) ≤ Q t,R . From ( 65), ( 66) and (67), we obtain that for t large enough, for k ∈ [t -c , t], t k d ds Q s,R ds ≤ 2 Q t,R . Hence, we can bound the right-hand side of (62): Q t,R t k=t-c R 2 C 2 t-k Q t,R - t k d ds Q s,R ds ≤ Q t,R t k=t-c R 2 C 2 t-k (Q t,R -2 Q t,R ) = 1 (1 -2 ) t k=t-c R 2 C 2 t-k = 1 (1 -2 ) c k=0 R 2 C 2 k ≤ 1 + 1 -2 1 - R 2 C 2 . ( ) The last inequality follows from the definition of c in (59). Since is arbitrary, by the sandwich theorem applied on (60), ( 61) and (69), lim t→∞ C 2t K 2 t,R t k=0 C 2k K 2 k,R = 1 - R 2 C 2 . ( ) Proposition D.2. Under the assumptions of Theorem 4.2, we have that there exists c 1 > 0 such that for t large enough, K 2 t,R ≥ c 1 R 2t (2t + 1) -κ . Proof. By the assumption on µ R , there exist r 0 , c 1 , κ > 0 such that K 2 t,R def = 2π R 0 r 2t dµ R (r) = 2π r0 0 r 2t dµ R (r) + 2π R r0 r 2t dµ R (r) ≥ 2πc 1 R r0 r 2t (R -r) κ-1 dr = -2πc 1 r0 0 r 2t (R -r) κ-1 dr + 2πc 1 R 0 r 2t (R -r) κ-1 dr ≥ -2πc 1 Rr 2t 0 + 2πc 1 R 2t+κ B(2t + 1, κ). ( ) where the beta function B(x, y) is defined as B(x, y) def = 1 0 r x+1 (1 -r) y+1 dr. Using the link between the beta function and the gamma function B(x, y) = Γ(x)Γ(y)/Γ(x + y), and Stirling's approximation, we obtain that for fixed y and large x, B(x, y) ∼ Γ(y)x -y . (74) Hence, for t large enough, B(2t + 1, κ) ∼ Γ(κ)(2t + 1) -κ = (κ -1)!(2t + 1) -κ . Hence, from (72) we obtain that there exist c 1 depending only on κ and r 0 such that for t large enough K 2 t,R ≥ -2πc 1 Rr 2t 0 + 2πc 1 R 2t+κ (k -1)!(2t + 1) -κ ≥ c 1 R 2t (2t + 1) -κ . ( ) Proposition 5.2. Suppose that the assumptions of Theorem 4.2 hold with µ R ∈ P([0, R]) fulfilling µ R ([r, R]) = Ω((R -r) κ ) for r in [r 0 , R] for some r 0 ∈ [0, R) and for some κ ∈ Z. Then, the average-case asymptotically optimal algorithm is, with y 0 = x 0 : y t = y t-1 -1 C F (y t-1 ), x t = R C 2 x t-1 + 1 -R C 2 y t . (20) Moreover, the convergence rate for this algorithm is asymptotically the same one as for the optimal algorithm in Theorem 4.2. Namely, lim t→∞ E [dist(x t , X )]B t = 1. Proof. The proof follows directly from Theorem 4.2 and Proposition D.1. See ( 77) and ( 79) in Proposition D.3 for the statement regarding the convergence rate. Proposition D.3. For the average-case optimal algorithm (18), E dist(x t , X ) = ξ opt (t) def = 1 t k=0 C 2k K 2 k,R For the average-case asymptotically optimal algorithm (20), E dist(x t , X ) = ξ asymp (t) def = 1 - R C 2 2 t k=1 K 2 k,R C 2k R C 4(t-k) + R C 4t ( ) For the iterates y t in (18), i.e. gradient descent with stepsize 1/C, we have E dist(y t , X ) = ξ GD (t) def = K 2 t,R C 2t Moreover, for all t ≥ 0, we have ξ opt (t) ≤ ξ asymp (t), and under the assumptions of (5.1), lim t→∞ ξ opt (t) ξ asymp (t) = 1, lim t→∞ ξ opt (t) ξ GD (t) = ξ asymp (t) ξ GD (t) = 1 - R C 2 (79) Proof. To show ( 76), ( 77), (78), we use the expression x t -x = P t (A)(x 0 -x ) (Proposition 2.1) and then evaluate P t 2 µ = C\{0} |P t | 2 dµ (Theorem 2.1). For (76), the value of P t 2 µ follows directly from Theorem 2.3, which states that the value for the optimal residual polynomial P t is 1 t k=0 |φ k (0)| 2 = 1 t k=0 C 2k K 2 k,R . A simple proof by induction shows that for the asymptotically optimal algorithm (20), the following expression holds for all t ≥ 0: x t -x = R C 2t + 1 - R C 2 t k=1 1 - A C k R C 2(t-k) (x 0 -x ) Thus, P t (λ) = R C 2t + 1 - R C 2 t k=1 1 - λ C k R C 2(t-k) = R C 2t φ 0 (λ) + 1 - R C 2 t k=1 K k,R C k φ k (λ) R C 2(t-k) , which concludes the proof of (77), as P t 2 µ = 1 - R C 2 2 t k=1 K 2 k,R C 2k R C 4(t-k) + R C 4t . By equation ( 52), y t -x = 1 - A C t (y 0 -x ) = K t,R C t φ k (A)(y 0 -x ) Thus, for the y t iterates, P t 2 µ = K 2 t,R C 2t , and (78) follows. Now, ξ opt (t) ≤ ξ asymp (t), ∀t ≥ 0 is a consequence of ξ opt (t) being the rate of the optimal algorithm. And lim t→∞ ξ opt (t) ξ GD (t) = lim t→∞ C 2t K 2 t,R t k=0 C 2k K 2 k,R = 1 - R 2 C 2 follows from Proposition D.1. To show lim t→∞ ξopt(t) ξGD(t) = 1 -R 2 C 2 , which concludes the proof, we rewrite ξ asymp (t) = R C 2t   1 - R C 2 2 t k=1 1 Q k,R R C 2(t-k) + R C 2t   , using that by definition, Q k,R = R 2k /K 2 k,R . Now, let c ∈ Z ≥0 such that ∞ k=c R C 2k ≤ . ( ) Using the same argument as in Proposition D.1 (see ( 68)), for t large enough and k ∈ [t -c , t], t k d ds Q s,R ds ≤ 2 Q t,R . Hence, for t large enough, 1 - R C 2 2 t k=1 1 Q k,R R C 2(t-k) + R C 2t = 1 - R C 2 2 t k=t-c 1 Q t,R - t k d ds Q s,R R C 2(t-k) + t-c k=1 1 Q k,R R C 2(t-k) + R C 2t ≤ 1 - R C 2 2 1 (1 -2 )Q t,R t k=t-c R C 2(t-k) + t-c k=1 R C 2(t-k) + ≤ 1 - R C 2 1 (1 -2 )Q t,R + 1 - R C 2 + , which can be made arbitrarily close to 1 -R C 2 1 Q t,R by taking > 0 small enough. Plugging this into (86), we obtain that we can make ξ asymp (t) arbitrarily close to 1 -R C 2 R C 2t 1 Q t,R = 1 -R C 2 ξ GD (t) by taking t large enough.



Figure1: Benchmarks and spectral density for different games. Top row: spectral density associated with bilinear games for varying values of the ratio parameter r = n/d (the x-axis represents the imaginary line). Second row: Benchmarks. Average-case optimal methods always outperform other methods, and the largest gain is in the ill-conditioned regime (r ≈ 1). Third row. Benchmarks (columns 1 and 3) and eigenvalue distribution of a design matrix generated with iid entries for two different degrees of conditioning. Depite the normality assumption not being satisfied, we still observe an improvement of average-case optimal methods vs worst-case optimal ones.

ACKNOWLEDGEMENTS

C. Domingo-Enrich has been partially funded by "la Caixa" Foundation (ID 100010434), under agreement LCF/BQ/AA18/11680094, and partially funded by the NYU Computer Science Department.

