Adaptive Extra-Gradient Methods for Min-Max Optimization and Games

Abstract

We present a new family of min-max optimization algorithms that automatically exploit the geometry of the gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. Thanks to this adaptation mechanism, the proposed method automatically detects whether the problem is smooth or not, without requiring any prior tuning by the optimizer. As a result, the algorithm simultaneously achieves order-optimal convergence rates, i.e., it converges to an ε-optimal solution within O(1/ε) iterations in smooth problems, and within O(1/ε 2 ) iterations in non-smooth ones.Importantly, these guarantees do not require any of the standard boundedness or Lipschitz continuity conditions that are typically assumed in the literature; in particular, they apply even to problems with singularities (such as resource allocation problems and the like). This adaptation is achieved through the use of a geometric apparatus based on Finsler metrics and a suitably chosen mirror-prox template that allows us to derive sharp convergence rates for the methods at hand.

1. Introduction

The surge of recent breakthroughs in generative adversarial networks (GANs) [20] , robust reinforcement learning [41] , and other adversarial learning models [27] has sparked renewed interest in the theory of min-max optimization problems and games. In this broad setting, it has become empirically clear that, ceteris paribus, the simultaneous training of two (or more) antagonistic models faces drastically new challenges relative to the training of a single one. Perhaps the most prominent of these challenges is the appearance of cycles and recurrent (or even chaotic) behavior in min-max games. This has been studied extensively in the context of learning in bilinear games, in both continuous [16, 31, 40] and discrete time [12, 18, 19, 32] , and the methods proposed to overcome recurrence typically focus on mitigating the rotational component of min-max games. The method with the richest history in this context is the extra-gradient (EG) algorithm of Korpelevich [25] and its variants. The EG algorithm exploits the Lipschitz smoothness of the problem and, if coupled with a Polyak-Ruppert averaging scheme, it achieves an O(1/T ) rate of convergence in smooth, convex-concave min-max problems [35] . This rate is known to be tight [34, 39] but, in order to achieve it, the original method requires the problem's Lipschitz constant to be known in advance. If the problem is not Lipschitz smooth (or the algorithm is run with a vanishing step-size schedule), the method's rate of convergence drops to O(1/ √ T ). Our contributions. Our aim in this paper is to provide an algorithm that automatically adapts to smooth / non-smooth min-max problems and games, and achieves order-optimal rates in both classes without requiring any prior tuning by the optimizer. In this regard, we propose a flexible algorithmic scheme, which we call AdaProx, and which exploits gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. Thanks to this mechanism, and to the best of our knowledge, AdaProx is the first algorithm that simultaneously achieves the following: T in smooth / non-smooth problems respectively; "unbounded domain" is self-explanatory; and, finally, "singularities" means that the problem's defining vector field may blow up at a boundary point of the problem's domain. 1. An O 1/

√

T convergence rate in non-smooth problems and O(1/T ) in smooth ones. 2. Applicability to min-max problems and games where the standard boundedness / Lipschitz continuity conditions required in the literature do not hold. 3. Convergence without prior knowledge of the problem's parameters (e.g., whether the problem's defining vector field is smooth or not, its smoothness modulus if it is, etc.). Our proposed method achieves the above by fusing the following ingredients: a) a family of local norms -a Finsler metric -capturing any singularities in the problem at hand; b) a suitable mirror-prox template; and c) an adaptive step-size policy in the spirit of Rakhlin & Sridharan [43] . We also show that, under a suitable coherence assumption, the sequence of iterates generated by the algorithm converges, thus providing an appealing alternative to iterate averaging in cases where the method's "last iterate" is more appropriate (for instance, if using AdaProx to solve non-monotone problems). Related works. There have been several works improving on the guarantees of the original extragradient/mirror-prox template. We review the most relevant of these works below; for convenience, we also tabulate these contributions in Table 1 above. Because many of these works appear in the literature on variational inequalities [15] , we also use this language in the sequel. In unconstrained problems with an operator that is locally Lipschitz continuous (but not necessarily globally so), the golden ratio algorithm (GRAAL) [29] achieves convergence without requiring prior knowledge of the problem's Lipschitz parameter. However, GRAAL provides no rate guarantees for non-smooth problems -and hence, a fortiori, no interpolation guarantees either. By contrast, such guarantees are provided in problems with a bounded domain by the generalized mirror-prox (GMP) algorithm of [47] under the umbrella of Hölder continuity. Still, nothing is known about the convergence of GRAAL / GMP in problems with singularities (i.e., when the problem's defining vector field blows up at a boundary point of the problem's domain). Singularities of this type were treated in a recent series of papers [1, 17, 48] by means of a "Bregman continuity" or "Lipschitz-like" condition. These methods are order-optimal in the smooth case, without requiring any knowledge of the problem's smoothness modulus. On the other hand, like GRAAL (but unlike GMP), they do not provide any rate interpolation guarantees between smooth and non-smooth problems. Another method that simultaneously achieves an O(1/ √ T ) rate in non-smooth problems and an O(1/T ) rate in smooth ones is the recent algorithm of Bach & Levy [2] . The BL algorithm employs an adaptive, AdaGradlike step-size policy which allows the method to interpolate between the two regimes -and this, even with noisy gradient feedback. On the negative side, the BL algorithm requires a bounded domain with a (Bregman) diameter that is known in advance; as a result, its theoretical guarantees do not apply to unbounded problems. In addition, the BL algorithm makes crucial use of boundedness and Lipschitz continuity; extending the BL method beyond this standard framework is a highly non-trivial endeavor which formed a big part of this paper's motivation.

2. Problem Setup and Blanket Assumptions

We begin in this section by reviewing some basics for min-max problems and games. 



2.1. Min-max / Saddle-point problems. A min-max game is a saddle-point problem of the form min θ∈Θ max φ∈Φ L(θ, φ) (SP)

Overview of related work. For the purposes of this table, "parameter-agnostic" means that the method does not require prior knowledge of the parameters of the problem it was designed to solve (Lipschitz modulus, domain diameter, etc.); "rate interpolation" means that the algorithm's convergence rate is O(1/T ) or O 1/

