LEARNING A MINIMAX OPTIMIZER: A PILOT STUDY

Abstract

Solving continuous minimax optimization is of extensive practical interest, yet notoriously unstable and difficult. This paper introduces the learning to optimize (L2O) methodology to the minimax problems for the first time and addresses its accompanying unique challenges. We first present Twin-L2O, the first dedicated minimax L2O framework consisting of two LSTMs for updating min and max variables separately. The decoupled design is found to facilitate learning, particularly when the min and max variables are highly asymmetric. Empirical experiments on a variety of minimax problems corroborate the effectiveness of Twin-L2O. We then discuss a crucial concern of Twin-L2O, i.e., its inevitably limited generalizability to unseen optimizees. To address this issue, we present two complementary strategies. Our first solution, Enhanced Twin-L2O, is empirically applicable for general minimax problems, by improving L2O training via leveraging curriculum learning. Our second alternative, called Safeguarded Twin-L2O, is a preliminary theoretical exploration stating that under some strong assumptions, it is possible to theoretically establish the convergence of Twin-L2O. We benchmark our algorithms on several testbed problems and compare against state-of-the-art minimax solvers.

1. INTRODUCTION

Many popular applications can be formulated into solving continuous minimax optimization, such as generative adversarial networks (GAN) (Goodfellow et al., 2014) , distributionally robust learning (Globerson & Roweis, 2006) , domain adaptation (Ganin & Lempitsky, 2014) , distributed computing (Shamma, 2008; Mateos et al., 2010) , privacy protection (Wu et al., 2018; 2020) , among many more. This paper studies such problems: we consider a cost function f : R m × R n → R and the min-max game min x max y f (x, y). We aim to find the saddle point (x * , y * ) of f : f (x * , y) ≤ f (x * , y * ) ≤ f (x, y * ), ∀(x, y) ∈ X × Y, where X ⊂ R m and Y ⊂ R n . If X = R m and Y = R n , (x * , y * ) is called a global saddle point; if X × Y is a neighborhood near (x * , y * ), (x * , y * ) is a local saddle point. The main challenge to solve problem (1) is the unstable dynamics of iterative algorithms. Simplest algorithms such as gradient descent ascent (GDA) can cycle around the saddle point or even diverge (Benaım & Hirsch, 1999; Mertikopoulos et al., 2018b; Lin et al., 2019) . Plenty of works have been developed recently to address this issue (Daskalakis et al., 2018; Daskalakis & Panageas, 2018; Liang & Stokes, 2019; Mertikopoulos et al., 2018a; Gidel et al., 2018; Mokhtari et al., 2019) . However, the convergence is still sensitive to the parameters in these algorithms. Even if the cost function is only changed by scaling, those parameters have to be re-tuned to ensure convergence. A recent trend of learning to optimize (L2O) parameterizes training algorithms to be learnable from data, such that the meta-learned optimizers can be adapted to a special class of functions and outperform general-purpose optimizers. That is particularly meaningful, when one has to solve a large number of yet similar optimization problems repeatedly and quickly. Specifically, for existing L2O methods that operate in the space of continuous optimization, almost all of them solve some minimization problem (Andrychowicz et al., 2016; Chen et al., 2017; Li & Malik, 2016) , leveraging an LSTM or a reinforcement learner to model their optimizer. Different from classic optimization results that often provide worst-case convergence, most L2O methods have little or no convergence guarantees, especially on problem or data instances distinct from what is seen in training, leaving their generalizability in practice questionable (Heaton et al., 2020) . Motivated by L2O's success in learning efficient minimization solvers from data, this paper seeks to answer: whether we could accomplish strong minimax L2O solvers as well; and if yes, how generalizable they could be? As it might look straightforward at first glance, such extension is highly nontrivial due to facing several unique challenges. Firstly, while continuous minimization has a magnitude of mature and empirically stable solvers, for general minimax optimization, even state-of-the-art analytical algorithms can exhibit instability or even divergence. To the best of our knowledge, most state-of-theart convergence analysis of minimax optimization is built on the convex-concave assumption (Gidel et al., 2018; Mokhtari et al., 2019; Ryu et al., 2019) , and some recent works relax the assumption to nonconvex-concave (Lin et al., 2019; 2020) . Convergence for general minimax problems is still open. That makes a prominent concern on whether a stable minimax L2O is feasible. Secondly, given the two groups of min and max variables simultaneously, it is unclear to what extent their optimization strategies can be modeled and interact within one unified framework -a new question that would never be met in minimization. Thirdly, the noisy and sometimes cyclic dynamics of minimax optimization will provide noisier guidance (e.g., reward) to L2O; not to say that, it is not immediately clear how to define the reward: for minimization, the reward is typically defined as the negative cumulative objective values along the history (Li & Malik, 2016) . However, for minimax optimization the objective cannot simply be decreased or increased monotonically. Contribution: This paper is a pilot study into minimax L2O. We start by establishing the first dedicated minimax L2O framework, called Twin-L2O. It is composed of two LSTMs sharing one objective-based reward, separately responsible for updating min and max variables. By ablations of the design options, we find this decoupled design facilitate meta-learning most, particularly when the min and max updates are highly non-symmetric. We demonstrate the superior convergence of Twin-L2O on several testbed problems, compared against a number of analytical solvers. On top of that, we further investigate how to enhance the generalizability of the learned minimax solverfoot_0 , and discuss two complementary alternatives with experimental validations. The first alternative is an empirical toolkit that is applicable for general minimax L2O. We introduce curriculum learning to training L2O models for the first time, by recognizing that not all problem instances are the same difficult to learn to solve. After plugging in that idea, we show that Twin-L2O can be trained to stably solve a magnitude more problem instances (in terms of parameter varying range). The second alternative explores a theoretical mechanism called safeguarding, particularly for the important special case of convex-concave problems. When solving a testing instance, safeguarding identifies when an L2O failure would occur and provides an analytical fall-back option (Diakonikolas, 2020) . That guarantees convergence for convex-concave problems and, in practice, converges faster even when the problem parameters are drawn from a different distribution from training.

2. RELATED WORK

2.1 MINIMAX OPTIMIZATION Following (Neumann, 1928) , the problem (1) has been studied for decades due to its wide applicability. Simultaneous gradient descent (SimGD) or gradient descent ascent (GDA) (Nedić & Ozdaglar, 2009; Du & Hu, 2019; Jin et al., 2019; Lin et al., 2019) is one of the simplest minimax algorithms, conducting gradient descent over variable x and gradient ascent over variable y. However, the dynamics of SimGD or GDA can converge to limit cycles or even diverge (Benaım & Hirsch, 1999; Mertikopoulos et al., 2018b; Lin et al., 2019) . To address this issue, Optimistic gradient descent ascent (OGDA) simply modifies the dynamics of GDA and shows more stable performance (Daskalakis et al., 2018; Daskalakis & Panageas, 2018; Liang & Stokes, 2019; Mertikopoulos et al., 2018a; Gidel et al., 2018; Mokhtari et al., 2019) . OGDA attracts more attention because of its empirical success in training GANs. (Ryu et al., 2019) theoretically studies OGDA by analyzing its continuous time dynamic and



We differentiate the usages of two terms: parameters and variables, throughout the paper. For example, minu maxv ax -by 2 , we call a, b parameters and x, y variables. For simplicity, this paper only discusses the L2O generalizability when the testing instances' parameter distribution differs from the training.

