AVERAGE-CASE ACCELERATION FOR BILINEAR GAMES AND NORMAL MATRICES

Abstract

Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games. We make the following three main contributions. First, we show that for zero-sum bilinear games the average-case optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially non-symmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speed-up compared to worst-case optimal algorithms. We illustrate our findings through numerical simulations with a varying degree of mismatch with our assumptions.

1. INTRODUCTION

The traditional analysis of optimization algorithms is a worst-case analysis (Nemirovski, 1995; Nesterov, 2004) . This type of analysis provides a complexity bound for any input from a function class, no matter how unlikely. However, since hard-to-solve inputs might rarely occur in practice, the worst-case complexity bounds might not be representative of the observed running time. A more representative analysis is given by the average-case complexity, averaging the algorithm's complexity over all possible inputs. This analysis is standard for analyzing, e.g., sorting (Knuth, 1997) and cryptography algorithms (Katz & Lindell, 2014) . Recently, a line of work (Berthier et al., 2020; Pedregosa & Scieur, 2020; Lacotte & Pilanci, 2020; Paquette et al., 2020) focused on optimal methods for the optimization of quadratics, specified by a symmetric matrix. While worst-case analysis uses bounds on the matrix eigenvalues to yield upper and lower bounds on convergence, average-case analysis relies on the expected distribution of eigenvalues and provides algorithms with sharp optimal convergence rates. While the algorithms developed in this context have been shown to be efficient for minimization problems, these have not been extended to smooth games. A different line of work considers algorithms for smooth games but studies worst-case optimal methods (Azizian et al., 2020) . In this work, we combine average-case analysis with smooth games, and develop novel average-case optimal algorithms for finding the root of a linear system determined by a (potentially non-symmetric) normal matrix. We make the following main contributions: 1. Inspired by the problem of finding equilibria in smooth games, we develop average-case optimal algorithms for finding the root of a non-symmetric affine operator, both under a normality assumption (Thm. 4.1), and under the extra assumption that eigenvalues of the operator are supported in a disk (Thm. 4.2). The proposed method shows a polynomial speedup compared to the worst-case optimal method, verified by numerical simulations. 2. We make a novel connection between average-case optimal methods for optimization, and average-case optimal methods for bilinear games. In particular, we show that solving the Hamiltonian using an average-case optimal method is optimal (Theorem 3.1) for bilinear games. This result complements (Azizian et al., 2020) , who proved that Polyak Heavy Ball algorithm on the Hamiltonian is asymptotically worst-case optimal for bilinear games.

2. AVERAGE-CASE ANALYSIS FOR NORMAL MATRICES

In this paper we consider the following class of problems. Definition 1. Let A ∈ R d×d be a real matrix and x ∈ R d a vector. The non-symmetric (affine) operator (NSO) problem is defined as: Find x : F (x) def = A(x-x ) = 0 . (NSO) This problem generalizes that of minimization of a convex quadratic function f , since we can cast the latter in this framework by setting the operator F = ∇f . The set of solutions is an affine subspace that we will denote X . We will find convenient to consider the distance to this set, defined as dist(x, X ) def = min v∈X x -v 2 , with X = {x ∈ R d | A(x -x ) = 0} . In this paper we will develop average-case optimal methods. For this, we consider A and x to be random vectors, and a random initialization x 0 . This induces a probability distribution over NSO problems, and we seek to find methods that have an optimal expected suboptimality w.r.t. this distribution. Denoting E (A,x ,x0) the expectation over these random problems, we have that average-case optimal methods they verify the following property at each iteration t min xt E (A,x ,x0) dist(x t , X ) s.t. x i ∈ x 0 + span({F (x j )} i-1 j=0 ), ∀i ∈ [1 : t]. The last condition on x t stems from restricting the class of algorithms to first-order methods. The class of first-order methods encompasses many known schemes such as gradient descent with momentum, or full-matrix AdaGrad. However, methods such as Adam (Kingma & Ba, 2015) or diagonal AdaGrad (Duchi et al., 2011) are not in this class, as the diagonal re-scaling creates iterates x t outside the span of previous gradients. Although we will focus on the distance to the solution, the results can be extended to other convergence criteria such as F (x t ) 2 . Finally, note that the expectations in this paper are on the problem instance and not on the randomness of the algorithm.

2.1. ORTHOGONAL RESIDUAL POLYNOMIALS AND FIRST-ORDER METHODS

The analysis of first-order methods simplifies through the use of polynomials. This section provides the tools required to leverage this connection. Definition 2. A residual polynomial is a polynomial P that satisfies P (0) = 1. Proposition 2.1. (Hestenes et al., 1952) If the sequence (x t ) t∈Z+ is generated by a first-order method, then there exist residual polynomials P t , each one of degree at most t, verifying x t -x = P t (A)(x 0 -x ). (3) As we will see, optimal average-case method are strongly related to orthogonal polynomials. We first define the inner product between polynomials, where we use z * for the complex conjugate of z ∈ C.

