SOLVING MIN-MAX OPTIMIZATION WITH HIDDEN STRUCTURE VIA GRADIENT DESCENT ASCENT

Abstract

Many recent AI architectures are inspired by zero-sum games, however, the behavior of their dynamics is still not well understood. Inspired by this, we study standard gradient descent ascent (GDA) dynamics in a specific class of non-convex non-concave zero-sum games, that we call hidden zero-sum games. In this class, players control the inputs of smooth but possibly non-linear functions whose outputs are being applied as inputs to a convex-concave game. Unlike general zerosum games, these games have a well-defined notion of solution; outcomes that implement the von-Neumann equilibrium of the "hidden" convex-concave game. We prove that if the hidden game is strictly convex-concave then vanilla GDA converges not merely to local Nash, but typically to the von-Neumann solution. If the game lacks strict convexity properties, GDA may fail to converge to any equilibrium, however, by applying standard regularization techniques we can prove convergence to a von-Neumann solution of a slightly perturbed zero-sum game. Our convergence guarantees are non-local, which as far as we know is a firstof-its-kind type of result in non-convex non-concave games. Finally, we discuss connections of our framework with generative adversarial networks.

1. INTRODUCTION

Traditionally, our understanding of convex-concave games revolves around von Neumann's celebrated minimax theorem, which implies the existence of saddle point solutions with a uniquely defined value. Although many learning algorithms are known to be able to compute such saddle points (Cesa-Bianchi & Lugoisi, 2006) , recently there has there has been a fervor of activity in proving stronger results such as faster regret minimization rates or analysis of the day-to-day behavior (Mertikopoulos et al., 2018; Daskalakis et al., 2018; Bailey & Piliouras, 2018; Abernethy et al., 2018; Wang & Abernethy, 2018; Daskalakis & Panageas, 2019; Abernethy et al., 2019; Mertikopoulos et al., 2019; Bailey & Piliouras, 2019; Gidel et al., 2019; Zhang & Yu, 2019; Hsieh et al., 2019; Bailey et al., 2020; Mokhtari et al., 2020; Hsieh et al., 2020; Pérolat et al., 2020) . This interest has been largely triggered by the impressive successes of AI architectures inspired by min-max games such as Generative Adversarial Networks (GANS) (Goodfellow et al., 2014a) , adversarial training (Madry et al., 2018) and reinforcement learning self-play in games (Silver et al., 2017) . Critically, however, all these applications are based upon non-convex non-concave games, our understanding of which is still nascent. Nevertheless, some important early work in the area has focused on identifying new solution concepts that are widely applicable in general min-max games, such as (local/differential) Nash equilibrium (Adolphs et al., 2019; Mazumdar & Ratliff, 2019) , local minmax (Daskalakis & Panageas, 2018) , local minimax (Jin et al., 2019) , (local/differential) Stackleberg equilibrium (Fiez et al., 2020) , local robust point (Zhang et al., 2020) . The plethora of solutions concepts is perhaps suggestive that "solving" general min-max games unequivocally may be too ambitious a task. Attraction to spurious fixed points (Daskalakis & Panageas, 2018) , cycles (Vlatakis-Gkaragkounis et al., 2019) , robustly chaotic behavior (Cheung & Piliouras, 2019; Cheung & Piliouras, 2020 ) and computational hardness issues (Daskalakis et al., 2020) all suggest that general min-max games might inherently involve messy, unpredictable and complex behavior. Are there rich classes of non-convex non-concave games with an effectively unique game theoretic solution that is selected by standard optimization dynamics (e.g. gradient descent)? Our class of games. We will define a general class of min-max optimization problems, where each agent selects its own vectors of parameters which are then processed separately by smooth functions. Each agent receives their respective payoff after entering the outputs of the processed decision vectors as inputs to a standard convex-concave game. Formally, there exist functions F : R N → X ⊂ R n and G : R M → Y ⊂ R m and a continuous convex-concave function L : X × Y → R, such that the min-max game is min θ θ θ∈R N max φ φ φ∈R M L(F(θ θ θ), G(φ φ φ)). (Hidden Convex-Concave (HCC)) We call this class of min-max problems Hidden Convex-Concave Games. It generalizes the recently defined hidden bilinear games of Vlatakis-Gkaragkounis et al. (2019) . Our solution concept. Out of all the local Nash equilibria of HCC games, there exists a special subclass, the vectors (θ θ θ * , φ φ φ * ) that implement the von Neumann solution of the convex-concave game. This solution has a strong and intuitive game theoretic justification. Indeed, it is stable even if the agents could perform arbitrary deviations directly on the output spaces X, Y . These parameter combinations (θ θ θ * , φ φ φ * ) "solve" the "hidden" convex-concave L and thus we call them von Neumann solutions. Naturally, HCCs will typically have numerous local saddle/Nash equilibria/fixed points that do not satisfy this property. Instead, they correspond to stationary points of the F, G where their output is stuck, e.g., due to an unfortunate initialization. At these points the agents may be receiving payoffs which can be arbitrarily smaller/larger than the game theoretic value of game L. Fortunately, we show that Gradient Descent Ascent (GDA) strongly favors von Neumann solutions over generic fixed points. Our results. In this work, we study the behavior of continuous GDA dynamics for the class of HCC games where each coordinate of F, G is controlled by disjoint sets of variables. In a nutshell, we show that GDA trajectories stabilize around or converge to the corresponding von Neumann solutions of the hidden game. Despite restricting our attention to a subset of HCC games, our analysis has to overcome unique hurdles not shared by standard convex concave games. Challenges of HCC games. In convex-concave games, deriving the stability of the von Neumann solutions relies on the Euclidean distance from the equilibrium being a Lyapunov function. In contrast, in HCC games where optimization happens in the parameter space of θ θ θ, φ φ φ, the non-linear nature of F, G distorts the convex-concave landscape in the output space. Thus, the Euclidean distance will not be in general a Lyapunov function. Moreover, the existence of any Lyapunov function for the trajectories in the output space of F, G does not translate to a well-defined function in the parameter space (unless F, G are trivial, invertible maps). Worse yet, even if L has a unique solution in the output space, this solution could be implemented by multiple equilibria in the parameter space and thus each of them can not be individually globally attracting. Clearly any transfer of stability or convergence properties from the output to the parameter space needs to be initialization dependent. Lyapunov Stability. Our first step is to construct an initialization-dependent Lyapunov function that accounts for the curvature induced by the operators F and G (Lemma 2). Leveraging a potentially infinite number of initialization-dependent Lyapunov functions in Theorem 4 we prove that under mild assumptions the outputs of F, G stabilize around the von Neumann solution of L. Convergence. Mirroring convex concave games, we require strict convexity or concavity of L to provide convergence guarantees to von Neumann solutions (Theorem 5). Barring initializations where von Neumann solutions are not reachable due to the limitations imposed by F and G, the set of von Neumann solutions are globally asymptotically stable (Corollary 1). Even in non-strict HCC games, we can add regularization terms to make L strictly convex concave. Small amounts of regularization allows for convergence without significantly perturbing the von Neumann solution (Theorem 6) while increasing regularization enables exponentially faster convergence rates (Theorem 7). Organization. In Section 2 we provide some preliminary notation, the definition of our model and some useful technical lemmas. Section 3 is devoted to the presentation of our the main results. Section 4 discusses applications of our framework to specific GAN formulations. Section 5 concludes our work with a discussion of future directions and challenges. We defer the full proofs of our results as well as further discussion on applications to the Appendix.

2. PRELIMINARIES

2.1 NOTATION Vectors are denoted in boldface x, y unless otherwise indicated are considered as column vectors. We use • corresponds to denote the 2 -norm. For a function f : R d → R we use ∇f to denote its gradient. For functions of two vector arguments, f (x, y) : R d1 × R d2 → R , we use ∇ x f, ∇ y f to denote its partial gradient. For the time derivative we will use the dot accent abbreviation, i.e., ẋ = d dt [x(t)]. A function f will belong to C r if it is r times continuously differentiable. The term "sigmoid" function refers to σ : R → R such that σ(x) = (1 + e -x ) -1 .

2.2. HIDDEN CONVEX CONCAVE GAMES

We will begin our discussion by defining the notion of convex concave functions as well as strictly convex concave functions. Note that our definition of strictly convex concave functions is a superset of strictly convex strictly concave functions that are usually studied in the literature. Definition 1. L : R n × R m → R is convex concave if for every y ∈ R n L(•, y) is convex and for every x ∈ R m L(x, •) is concave. Function L will be called strictly convex concave if it is convex concave and for every x × y ∈ R n × R m either L(•, y) is strictly convex or L(x, •) is strictly concave. At the center of our definition of HCC games is a convex concave utility function L. Additionally, each player of the game is equipped with a set of operator functions. The minimization player is equipped with n functions f i : R ni → R while the maximization player is equipped with m functions g j : R mj → R. We will assume in the rest of our discussion that f i , g j , L are all C 2 functions. The inputs θ θ θ i ∈ R ni and φ φ φ j ∈ R mj are grouped in two vectors θ θ θ = [θ θ θ 1 θ θ θ 2 • • • θ θ θ n ] F(θ θ θ) = [f 1 (θ θ θ 1 ) f 2 (θ θ θ 2 ) • • • f N (θ θ θ n )] φ φ φ = [φ φ φ 1 φ φ φ 2 • • • φ φ φ m ] G(θ θ θ) = [g 1 (φ φ φ 1 ) g 2 (φ φ φ 2 ) • • • g M (φ φ φ m )] We are ready to define the hidden convex concave game (θ θ θ * , φ φ φ * ) = arg min θ θ θ∈R N arg max φ φ φ∈R M L(F(θ θ θ), G(φ φ φ)). where N = n i=1 n i and M = m j=1 m j . Given a convex concave function L, all stationary points of L are (global) Nash equilibria of the min-max game. We will call the set of all equilibria of L, von Neumann solutions of L and denote them by Solution(L). Unfortunately, Solution(L) can be empty for games defined over the entire R n × R m . For games defined over convex compact sets, the existence of at least one solution is guaranteed by von Neumann's minimax theorem. Our definition of HCC games can capture games on restricted domains by choosing appropriately bounded functions f i and g j . In the following sections, we will just assume that Solution(L) is not empty. We note that our results hold for both bounded and unbounded f i and g j . We are now ready to write down the equations of the GDA dynamics for a HCC game: θ θ θ i = -∇ θ θ θi L(F(θ θ θ), G(φ φ φ)) = -∇ θ θ θi f i (θ θ θ i ) ∂L ∂f i (F(θ θ θ), G(φ φ φ)) φ φ φ j = ∇ φ φ φj L(F(θ θ θ), G(φ φ φ)) = ∇ φ φ φj g j (φ φ φ j ) ∂L ∂g j (F(θ θ θ), G(φ φ φ)) (1) θ 11 θ 12 • • • θ 1n1 θ θ θ1 f 1 (θ θ θ 1 ) . . . θ N 1 θ N 2 • • • θ N nN θ θ θN f N (θ θ θ N ) F(θ θ θ) L(F(θ θ θ), G(φ φ φ)) φ 11 φ 12 • • • φ 1n1 φ φ φ1 g 1 (φ φ φ 1 ) . . . φ M 1 φ M 2 • • • φ M nM φ φ φM g M (φ φ φ M ) G(φ φ φ) θi = -∇ θi L(F(θ θ θ), G(φ φ φ)) φj = ∇ φj L(F(θ θ θ), G(φ φ φ))

2.3. REPARAMETRIZATION

The following lemma is useful in studying the dynamics of hidden games. Lemma 1. Let k : R d → R be a C 2 function. Let h : R → R be a C 1 function and x(t) denote the unique solution of the dynamical system Σ 1 . Then the unique solution for dynamical system Σ 2 is z(t) = x( t 0 h(s)ds) ẋ = ∇k(x) x(0) = x init : Σ 1 ż = h(t)∇k(z) z(0) = x init : Σ 2 (2) θ i Equilibrium-value f * i f i (θ i ) (a) (b) (c) (d) (e) (f) (g) -3 -2 -1 0 1.5 4 Figure 1 : Neither Gradient Descent nor Ascent can traverse stationary points. An immediate consequence of Lemma 1 is that if we initialize in the above example θ i (0) at (a), f i (θ i (t)) can not escape the purple section. This extends to cases where θ θ θ i is vector of variables. By choosing h(t) = -∂L(F(t), G(t))/∂f i and h(t) = ∂L(F(t), G(t))/∂g j respectively, we can connect the dynamics of each θ θ θ i and φ φ φ j under Equation (1) to gradient ascent on f i and g j . Applying Lemma 1, we get that trajectories of θ θ θ i and φ φ φ j under Equation ( 1) are restricted to be subsets of the corresponding gradient ascent trajectories with the same initializations. For example, in Figure 1 θ i (t) can not escape the purple section if it is initialized at (a) neither the orange section if it is initialiazed at (f). This limits the attainable values that f i (t) and g j (t) can take for a specific initialization. Let us thus define the following: Definition 2. For each initialization x(0 ) of Σ 1 , Im k (x(0)) is the image of k • x : R → R. Applying Definition 2 in the above example, Im fi (θ i (0)) = (f i (-2), f i (-1)) if θ i is initialized at (c). Additionally, observe that in each colored section f i (θ i (t)) uniquely identifies θ i (t). Generally, even in the case that θ θ θ i are vectors, Lemma 1 implies that for a given θ θ θ i (0), f i (θ θ θ i (t)) uniquely identifies θ θ θ i (t). As a result we get that a new dynamical system involving only f i and g j Theorem 1. For each initialization (θ θ θ(0), φ φ φ(0)) of Equation (1), there are C 1 functions X θ θ θi(0) , X φ φ φj (0) such that θ θ θ i (t) = X θ θ θi(0) (f i (t)) and φ φ φ j (t) = X φ φ φj (0) (g j (t)). If (θ θ θ(t), φ φ φ(t)) sat- isfy Equation (1) then f i (t) = f i (θ θ θ i (t)) and g j (t) = g j (φ φ φ j (t)) satisfy ḟi = -∇ θ θ θi f i (X θ θ θi(0) (f i )) 2 ∂L ∂f i (F, G) ġj = ∇ φ φ φj g j (X φ φ φj (0) (g j )) 2 ∂L ∂g j (F, G) By determining the ranges of f i and g j , an initialization clearly dictates if a von Neumann solution is attainable. In Figure 1 for example, any point of the pink, orange or blue colored section like (e), (f) or (g) can not converge to a von Neumann solution with f i (θ i ) = f * i . The notion of safety captures which initializations can converge to a given element of Solution(L). Definition 3. . We will call the initialization (θ θ θ(0), φ φ φ(0)) safe for a (p, q) ∈ Solution(L) if φ φ φ i (0) and θ θ θ j (0) are not stationary points of f i and g j respectively and p i ∈ Im fi (θ θ θ i (0)) and q j ∈ Im gj (φ φ φ j (0)). Finally, in the following sections we use some fundamental notions of stability. We call an equilibrium x * of an autonomous dynamical system ẋ = D(x(t)) stable if for every neighborhood U of x * there is a neighborhood V of x * such that if x(0) ∈ V then x(t) ∈ U for all t ≥ 0. We call a set S asymptotically stable if there exists a neighborhood R such that for any initialization x(0) ∈ R, x(t) approaches S as t → +∞. If R is the whole space the set globally asymptotically stable.

3.1. GENERAL CASE

Our main results are based on designing a Lyapunov function for the dynamics of Equation (3): Lemma 2. If L is convex concave and (φ φ φ(0), θ θ θ(0)) is a safe for (p, q) ∈ Solution(L), then the following quantity is non-increasing under the dynamics of Equation (3): H(F, G) = N i=1 fi pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz + M j=1 gj qj z -q j ∇g j (X φ φ φj (0) (z)) 2 dz (4) F(θ) G(φ) (p, q) Figure 2: Level sets of Lyapunov function of Equation (4) for both F and G being one dimensional sigmoid functions. Observe that our Lyapunov function here is not the distance to (p, q) as in a classical convex concave game. The gradient terms account for the non constant multiplicative terms in Equation (3). Indeed if the game was not hidden and f i and g j were the identity functions then H would coincide with the Euclidean distance to (p, q). Our first theorem employs the above Lyapunov function to show that (p, q) is stable for Equation (3). Theorem 2. If L is convex concave and (φ φ φ(0), θ θ θ(0)) is a safe for (p, q) ∈ Solution(L), then (p, q) is stable for Equation (3). Clearly, for the special case of globally invertible functions F, G we could come up with an equivalent Lyapunov function in the θ θ θ, φ φ φ-space. In this case it is straightforward to transfer the stability results from the induced dynamical system of F, G (Equation (3)) to the initial dynamical system of θ θ θ, φ φ φ (Equation (1)). For example we can prove the following result: Theorem 3. If f i and g j are sigmoid functions and L is convex concave and there is a (φ φ φ(0), θ θ θ(0)) that is safe for (p, q) ∈ Solution(L), then (F -foot_0 (p), G -1 (q)) is stable for Equation (1). In the general case though, stability may not be guaranteed in the parameter space of Equation (1). We will instead prove a weaker notion of stability, which we call hidden stability. Hidden stability captures that if (F(θ θ θ(0)), G(φ φ φ(0))) is close to a von Neumann solution, then (F(θ θ θ(t)), G(φ φ φ(t))) will remain close to that solution. Even though hidden stability is weaker, it is essentially what we are interested in, as the output space determines the utility that each player gets. Here we provide sufficient conditions for hidden stability. Theorem 4 (Hidden Stability). Let (p, q) ∈ Solution(L). Let R fi and R gj be the set of regular values 1 of f i and g j respectively. Assume that there is a ξ > 0 such that [p i -ξ, p i + ξ] ⊆ R fi and [q j -ξ, q j + ξ] ⊆ R gj . Define r(t) = F(θ θ θ(t)) -p 2 + G(φ φ φ(t)) -q 2 . If f i and g j are proper functionsfoot_1 , then for every > 0, there is an δ > 0 such that r(0) < δ =⇒ ∀t ≥ 0 : r(t) < . Unfortunately hidden stability still does not imply convergence to von Neumann solutions. Vlatakis-Gkaragkounis et al. (2019) studied hidden bilinear games and proved that Ḣ = 0 for this special class of HCC games. Hence, a trajectory is restricted to be a subset of a level set of H which is bounded away from the equilibrium as shown in Figure 2 . To sidestep this, we will require in the next subsection the hidden game to be strictly convex concave.

3.2. HIDDEN STRICTLY CONVEX CONCAVE GAMES

In this subsection we focus on the case where L is a strictly convex concave function. Based on Definition 1, a strictly convex concave game is not necessarily strictly convex strictly concave and thus it may have a continum of von Neumann solutions. Despite this, LaSalle's invariance principle, combined with the strict convexity concavity, allows us to prove that if (θ θ θ(0), φ φ φ(0)) is safe for Z ⊆ Solution(L) then Z is locally asymptotically stable for Equation (3). Lemma 3. Let L be strictly convex concave and Z ⊂ Solution(L) is the non empty set of equilbria of L for which (θ θ θ(0), φ φ φ(0)) is safe. Then Z is locally asymptotically stable for Equation (3). The above lemma however does not suffice to prove that for an arbitrary initialization (θ θ θ(0), φ φ φ(0)), (F(t), G(t)) approaches Z as t → +∞. In other words, a-priori it is unclear if (F(θ θ θ(0)), G(φ φ φ(0))) is necessarily inside the region of attraction (ROC) of Z. To get a refined estimate of the ROC of Z, we analyze the behavior of H as f i and g j approach the boundaries of Im fi (θ θ θ i (0)) and Im gj (φ φ φ j (0)) and more precisely we show that the level sets of H are bounded. Once again the corresponding analysis is trivial for convex concave games, since the level sets are spheres around the equilibria. Theorem 5. Let L be strictly convex concave and Z ⊂ Solution(L) is the non empty set of equilbria of L for which (θ θ θ(0), φ φ φ(0)) is safe. Under the dynamics of Equation (1) (F(θ θ θ(t)), G(θ θ θ(t))) converges to a point in Z as t → ∞. The theorem above guarantees convergence to a von Neumann solution for all initializations that are safe for at least one element of Solution(L). However, this is not the same as global asymptotic stability. To get even stronger guarantees, we can assume that all initializations are safe. In this case it is straightforward to get a global asymptotic stability result: Corollary 1. Let L be strictly convex concave and assume that all intitializations are safe for at least one element of Solution(L). The following set is globally asymptotically stable for continuous GDA dynamics. {(θ θ θ * , φ φ φ * ) ∈ R n × R m : (F (θ θ θ * ), G(φ φ φ * )) ∈ Solution(L)} Notice that the above approach on global asymptotic convergence using Lyapunov arguments can be extended to other popular alternative gradient-based heuristics like variations of Hamiltonian Gradient descent. For concision, we defer the exact statements, proofs in Section 8.2.2

3.3. CONVERGENCE VIA REGULARIZATION

Regularization is a key technique that works both in the practice of GANs Mescheder et al. (2018) ; Kurach et al. (2019) and in the theory of convex concave games Pérolat et al. (2020) ; Roth et al. (2017) ; Sanjabi et al. (2018) . Our settings of hidden convex concave games allows for provable guarantees for regularization in a wide class of settings, bringing closer practical and theoretical guarantees. Let us have a utility L(x, y) that is convex concave but not strictly. Here we will propose a modified utility L that is strictly convex strictly concave. Specifically we will choose L (x, y) = L(x, y) + λ 2 x 2 - λ 2 y 2 The choice of the parameter λ captures the trade-off between convergence to the original equilibrium of L and convergence speed. On the one hand, invoking the implicit function theorem, we get that for small λ the equilibria of L are not significantly perturbed. Theorem 6. If L is a convex concave function with invertible Hessians at all its equilibria, then for each > 0 there is a λ > 0 such that L has equilibria that are -close to the ones of L. Note that invertibility of the Hessian means that L must have a unique equilibrium. On the other hand increasing λ increases the rate of convergence of safe initializations to the perturbed equilibrium Theorem 7. Let (θ θ θ(0), φ φ φ(0)) be a safe initialization for the unique equilibrium of L (p, q). If r(t) = F(θ θ θ(t)) -p 2 + G(φ φ φ(t)) -q 2 then there are initialization dependent constants c 0 , c 1 > 0 such that r(t) ≤ c 0 exp(-λc 1 t). In this section we show how our theorems provide connections between the framework of hidden games and practical applications of min-max optimization like training GANs. Hidden strictly convex-concave games. We will start our discussion with the fundamental generative architecture of Goodfellow et al. (2014a) 's GAN. In the vanilla GAN architecture, as it is commonly referred, our goal is to find a generator distribution p G that is close to an input data distribution p data . To find such a generator function, we can use a discriminator D that "criticizes" the deviations of the generator from the input data distribution. For the case of a discrete p data over a set N , the minimax problem of Goodfellow et al. (2014a) is the following: min pG(x)≥0, x∈N pG(x)=1 max D∈(0,1) |N | V (G, D) = x∈N p data (x) log(D(x)) + x∈N p G (x) log(1 -D(x)) The problem above can be formulated as a constrained strictly convex-concave hidden game. On the one hand, for a fixed discriminator D * , the V (G, D * ) is linear over the p G (x). On the other hand, for a fixed generator G * , V (G * , D) is strongly-concave. We can implement the inequality constraints on both the generator probabilities and discriminator using sigmoid activations. For the equality constraint x∈N p G (x) = 1 we can introduce a Langrange multiplier. Having effectively removed the constraints, we can see in Figure 3 , the dynamics of Equation ( 1) converge to the unique equilibrium of the game, an outcome consistent with our results in Corollary 1. It is worth noting that while the Euclidean distance to the equilibrium is not monotonically decreasing, H(t) is. We defer the detailed proof of convergence in Section 9.2. r(t) = (F(t) -p) 2 + (G(t) -q) 2 + |λ -λ * | 2 Hidden convex-concave games & Regularizaiton. An even more interesting case is Wassertein GANs-WGANs (Arjovsky et al. (2017) ). One of the contributions of Lei et al. (2019) is to show that WGANs trained with Stochastic GDA can learn the parameters of Gaussian distributions whose samples are transformed by non-linear activation functions. It is worth mentioning that the original WGAN formulation has a Lipschitz constraint in the discriminator function. For simplicity, Lei et al. (2019) replaced this constraint with a quadratic regularizer. The min-max problem for the case of one-dimensional Gaussian N (0, α 2 * ) and linear discriminator D v (x) = v x with x 2 activation is: min α∈R max v∈R V WGAN (G α , D v ) = E X∼pdata [D(X)] -E X∼pG [D(X)] -v 2 /2 = E x∼N (0,α 2 * ) 2 [vx] -E x∼N (0,α 2 * ) 2 [vx] -v 2 /2 = (α 2 * -α 2 )v -v 2 /2 Observe that V WGAN is not convex-concave but it can posed as a hidden strictly convex-concave game with G(α) = (α 2 * -α 2 ) and F(v) = v. When computing expectations analytically without sampling, Theorem 5 guarantees convergence. In contrast, without the regularizer V WGAN can be modeled as a hidden linear-linear game and thus GDA dynamics cycle. Empirically, these results are robust to discrete and stochastic updates using sampling as shown in Figure 4 . Therefore regularization in the work of Lei et al. (2019) was a vital ingredient in their proof strategy and not just an implementation detail. In Section 9.3, we also discuss applications of regularization to normal form zero sum games. 4). All trajectories converge to one of the two equilibria (0, 1) and (0, -1) whereas without regularization, GDA would cycle on the level sets. In the right figure, we replace the exact expectations in V WGAN with approximations via sampling and continuous time updates on α and v with discrete ones. For small learning rates and large sample sizes, unregularized GDA continues to cycle. In contrast, the regularization approach of Lei et al. (2019) converges to the (0, 1) equilibrium. -4 -2 0 2 4 a -3 -2 -1 0 1 2 3 v a 0.5 1.0 1.5 2.0 v -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 The two applications of HCC games in GANs are not isolated findings but instances of a broader pattern that connects HCC games and standard GAN formulations. As noted by Goodfellow (2017) , if updates in GAN applications were directly performed in the "functional space", i.e. the generator and discriminator outputs, then standard arguments from convex concave optimization would imply convergence to global Nash equilibria. Indeed, standard GAN formulations like the vanilla GAN Goodfellow et al. (2014a) , f-GAN (Nowozin et al. (2016) ) and WGAN Arjovsky et al. (2017) can all be thought of as convex concave games in the space of generator and discriminator outputs. Given that the connections between convex concave games and standard GAN objectives in the output space is missing from recent literature, in Section 9.1 we show how one can apply Von Neumann's minimax theorem to derive the optimal generators and discriminators even in the non-realizable case. In practice, the updates happen in the parameter space and thus convexity arguments no longer apply. Our study of HCC games is a stepping stone towards bridging the gap in convergence guarantees between the case of direct updates in the output space and the parameter space.

5. DISCUSSION

In this work, we introduce a class of non-convex non-concave games that we call hidden convex concave (HCC) games. In this class of games, the competition on the output/operator space has a convex concave structure but training happens in the input/parameter space, where the mappings between input and output space are smooth but non-convex non-concave functions. The main inspiration for this class is the indirect competition of the parameters of generator and discriminator on GANs' architectures. Our analysis combines ideas from game theory, dynamical systems and control theory such Lyapunov functions and LaSalle's theorem. Our convergence results favor not arbitrary local Nash equilibria, but only von Neumann solutions. To the best of our knowledge, such last iterate convergence results are the first result of their kind. Given the modular structure of our model and proofs, HCC games show particular promise as a theoretical testbed for studying which dynamics are more well suited to which GANs. We believe that further positive results of this kind for different combinations of GAN formulations and learning algorithms are possible by properly adapting our current techniques. 

6.1. BACKGROUND IN DYNAMICAL SYSTEMS

Our analysis combines tools from dynamical systems, stability analysis and invariance principles theory. We start with the definitions of the different stability notions. We remind the well known Lyapunov's Lyapunov stability criterion (Theorem 8) Stability analysis in convex concave games is further complicated due to the possibility of non-isolated fixed points. To tackle this issue, we recall Krasovskii-LaSalle's Invariance Principle (Theorem 9), a powerful result that has several implications for the asymptotic stability of a set in an autonomous (possibly nonlinear) dynamical system. In the special case where the goal set contains only stable fixed points a pointwise convergence theorem can be derived (Theorem 10). Finally, we remind the notions of diffeomorphism and topological conjugacy of two dynamical systems, which are useful to transfer behavioral claims between equivalent dynamics. Let f : D → R n be a locally Lipschitz map from a domain D ⊂ R n to R n . We consider dynamical systems of the form ẋ = f (x) ( ) A point x for which f (x) = 0 is called a fixed point. We will be interested in the following notions of stability for the fixed point points of Equation ( ). Definition 4 (Stability properties, (Khalil, 2002, Definition 4.1)). The fixed point x = 0 of Equation ( ) is • stable if, for each > 0, there is a δ = δ( ) > 0 such that x(0) < δ =⇒ x(t) < ∀t ≥ 0 • unstable if it is not stable • asymptotically stable if it is stable and δ can be chosen such that x(0) < δ =⇒ lim t→∞ x(t) = 0 The Lyapunov Theorem will be a useful tool to prove (asymptotic) stability of a fixed point. Theorem 8 (Lyapunov Theorem, (Khalil, 2002, Theorem 4.1)). Let x = 0 be a fixed point point for Equation ( ) and D ⊂ R n be a domain containing x = 0. Let V : D → R be a continuously differentiable function such that V (0) = 0 and V (x) > 0 in D -{0} V (x) ≤ 0 in D then x = 0 is stable. Moreover if V (x) < 0 in D -{0} then x = 0 is asymptotically stable. Unfortunately, the Lyapunov theorem is not very helpful when it comes to proving convergence in dynamical systems with non isolated fixed points. By definition, non-isolated fixed points cannot be asymptotically stable. Non isolated fixed points may give rise to more complex behaviour than point-wise convergence. Definition 5. We say that a trajectory x(t) approaches a set M as t → ∞ if for each > 0 there is a T > 0 such that dist(x(t), M) < , ∀t > T where the operator "dist" is the minimum distance from a point to a set M dist(p, M) = inf x∈M p -x Definition 6. We say that a set M is invariant for Equation ( ) if x(0) ∈ M =⇒ x(t) ∈ M, ∀t ∈ R We will say M is positively invariant if the above holds for t ≥ 0. We are ready to state LaSalle's Invariance Principle, a general theorem that can help us study the stability of non isolated fixed points. Theorem 9 ( LaSalle's Invariance Principle, (Khalil, 2002, Theorem 4.4) ). Let Ω ⊂ D be a compact set that is positively invariant with respect to Equation ( ). Let V : D → R be a continuously differentiable function such that V (x) ≤ 0 in Ω. Let E be the set of all points where V (x) = 0. Let M be the largest invariant set in E. Then every solution starting in Ω approaches M as t → ∞. LaSalle  ∀x x x ∈ A, t ∈ R : g(Φ t (x x x)) = Ψ t (g(x x x)) Furthermore, two flows Φ t : A → A and Ψ t : B → B are diffeomorphic if there exists a diffeomorphism g : A → B such that ∀x x x ∈ A, t ∈ R : g(Φ t (x x x)) = Ψ t (g(x x x)). If two flows are diffeomorphic, then their vector fields are related by the derivative of the conjugacy. That is, we get precisely the same result that we would have obtained if we simply transformed the coordinates in their differential equations.

6.2. BACKGROUND IN CONVEX OPTIMIZATION

For the sake of completeness, we recall here the definition of (strict) convex/concave function and its first order necessary and sufficient criterion. We will also discuss strong convexity and its second order characterizations. We will be interested in notions from convex optimization throughout this work Definition 9 ( (Boyd & Vandenberghe, 2004, p. 67) ). Let f : R n → R be a function then • f is convex if ∀x, y ∈ R n , t ∈ [0, 1] : f (tx + (1 -t)y) ≤ tf (x) + (1 -t)f (y) • f is strictly convex if ∀x, y ∈ R n , t ∈ (0, 1) : f (tx + (1 -t)y) < tf (x) + (1 -t)f (y) • f is (strictly) concave if -f is (strictly) convex. We will also use the first order characterizations of convex and concave functions Theorem 11 ( (Boyd & Vandenberghe, 2004, p. 69-70) ). Let f : R n → R be a differentiable function. • f is convex if and only if ∀x, y ∈ R n : f (y) ≥ f (x) + ∇f (x) T (y -x) • f is concave if and only if ∀x, y ∈ R n : f (y) ≤ f (x) + ∇f (x) T (y -x) To establish convergence rates, we will use the notion of strong convexity Definition 10 ((Nesterov, 2004, p. 63)). A continuously differentiable function f of R n will be called µ strongly convex for a positive constant µ if for all x, y ∈ R n we have f (y) ≥ f (x) + ∇f (x), y -x + µ 2 x -y 2 We will also use second order characterizations of strong convexity Theorem 12 ((Nesterov, 2004, p. 65)) . A twice continuously differentiable function f is µ strongly convex for a positive constant µ if and only if for all x ∈ R n we have ∇ 2 f (x) ≥ µI Symmetrically, a function will be called µ strongly concave if -f is µ strongly convex.

6.3. BACKGROUND IN GAME THEORY

In this short section, we remind to the reader a generalization of Von-Neumann's Minimax theorem, which we will exploit to analyze the equilibrium solution of the different GANs' architectures. A special case of Fan's minimax theorem is the following Corollary 2 (Fan's minimax theorem, Fan (1953)). Let X ⊂ R n and Y ⊂ R m be convex nonempty sets. Suppose that X is compact and f : X × Y → R is a function such that f (•, y) is lower semicontinuous on X for each y ∈ Y and that f is convex concave. Then we have that min x∈X sup y∈Y f (x, y) = sup y∈Y min x∈X f (x, y). The below time-reparametrization lemma shows that the solution for a non-autonomous system, multiplicative to a gradient flow can be derived by just time-rescaling of the solution of the simplified gradient ascent dynamics. Indeed, since the multiplicative term is common across all terms of the vector field then over the time it dictates only the magnitude of the vector field (the speed of the motion), but does not affect the directionality other than moving backwards or forwards along the same trajectory. Lemma 1. Let k : R d → R be a C 2 function. Let h : R → R be a C 1 function and x(t) denote the unique solution of the dynamical system Σ 1 . Then the unique solution for dynamical system Σ 2 is z(t) = x( t 0 h(s)ds) ẋ = ∇k(x) x(0) = x init : Σ 1 ż = h(t)∇k(z) z(0) = x init : Σ 2 (5) Proof. Firstly, notice that it holds x(0) = x init and ẋ = ∇k(x), since x is the unique solution of Σ 1 It is easy to check that: z(0) = x( 0 0 h(s)ds) = x(0) = x init ż = ẋ t 0 h(s)ds × d[ t 0 h(s)ds] dt = ∇k x t 0 h(s)ds h(t) = ∇k(z)h(t) In order to leverage the convex-concave properties of the operators in our hidden structure under the Gradient Descent Ascent dynamics we need to recover the equivalent system (T ) in the operator space Ḟ Ġ = T F G . (Σ) := θ θ θ = -∇L(F(θ θ θ), G(φ φ φ)) φ φ φ = ∇L(F(θ θ θ), G(φ φ φ)) ≡ θ θ θ i = -∇ θ θ θi f i (θ θ θ i )h fi,L (t) φ φ φ j = ∇ φ φ φj g j (φ φ φ j )h gj ,L (t) From this point, applying the aforementioned lemma, under GDA each f i and g j follows a time dependent rescaling of the corresponding gradient ascent solution. Exploiting the monotonicity of f i (t) and g j (t) under gradient ascent, we can construct an invertible map between the parameter space {(θ θ θ i , φ φ φ j )} and the operator space {(f i , g j )} which allows us to construct the equivalent system T in the operator space. Notice that the properties of gradient ascent are crucial since the operator space can be arbitrarily smaller in dimension. In this case a smooth invertible map that is common for all initializations cannot exist. Theorem 1. For each initialization (θ θ θ(0), φ φ φ(0)) of Equation (1), there are C 1 functions X θ θ θi(0) , X φ φ φj (0) such that θ θ θ i (t) = X θ θ θi(0) (f i (t)) and φ φ φ j (t) = X φ φ φj (0) (g j (t)). If (θ θ θ(t), φ φ φ(t)) sat- isfy Equation (1) then f i (t) = f i (θ θ θ i (t)) and g j (t) = g j (φ φ φ j (t)) satisfy ḟi = -∇ θ θ θi f i (X θ θ θi(0) (f i )) 2 ∂L ∂f i (F, G) ġj = ∇ φ φ φj g j (X φ φ φj (0) (g j )) 2 ∂L ∂g j (F, G) Proof. Let us first study a simpler dynamical system (Σ * ) with unique solution of γ θ θ θi(0) (t). (Σ * ) ≡ ż = ∇f i (z) z(0) = θ θ θ i (0) It is easy to observe that: ḟi = ∇f (z) ż = ∇f (z) 2 If θ θ θ i (0) is a stationary point of f i then the trajectory of z is a single point. But the trajectory of θ θ θ i under the dynamics of Equation ( 1) is also a single point so we can pick the following function X θ θ θi(0) (f i ) = θ θ θ i (0). On the other hand if θ θ θ i (0) is not a stationary point of f i , f i continuously increases along the trajectory of (Σ * ). Therefore A θ θ θi(0 ) (t) = f i (γ θ θ θi(0) (t) ) is an increasing function and therefore invertible. Let us call A -1 θ θ θi(0) (f i ) the inverse. Let's recall now the θ θ θ i part of the dynamical system of interest Equation ( 1) θ θ θ i = -∇ θ θ θi f i (θ θ θ i ) ∂L ∂f i (F(θ θ θ), G(φ φ φ)) initialized at θ θ θ i (0). Applying Lemma 1 for the first equation with h(t) = - ∂L ∂f i (F(θ θ θ(t)), G(φ φ φ(t))) we have that under the dynamics of Equation ( 1) θ θ θ i (t) = γ θ θ θi(0) t 0 h(s)ds (P) Thus it holds f i (θ θ θ i (t)) = f γ θ θ θi(0) t 0 h(s)ds = A θ θ θi(0) t 0 h(s)ds or equivalently t 0 h(s)ds = A -1 θ θ θi(0) (f i (θ θ θ i (t))) Plugging in back to Equation (P) θ θ θ i (t) = γ θ θ θi(0) (A -1 θ θ θi(0) (f i (θ θ θ i (t)))) Therefore we can pick X θ θ θi(0) (f i ) = γ θ θ θi(0) (A -1 θ θ θi(0) (f i )) which is C 1 as composition of C 1 functions. We can perform an equivalent analysis for φ φ φ j (0) and g j to pick C 1 function X φ φ φj (0) . Let us now track the time derivative of f i (θ θ θ i ) and g j (φ φ φ j ) ḟi = ∇ θ θ θi f i (θ θ θ i ) θ θ θ i = ∇ θ θ θi f i (θ θ θ i ) 2 ∂L ∂f i (F, G) ġj = ∇ φ φ φj g j (φ φ φ j ) φ φ φ j = ∇ φ φ φj g j (φ φ φ j ) 2 ∂L ∂g j (F, G) We can now replace θ θ θ i = X θ θ θi(0) (f i ) and φ φ φ j = X φ φ φj (0) (g j ) to get the equations required.

8. HIDDEN CONVEX CONCAVE GAMES

In this section, we analyze the derived stability properties of the hidden convex concave games. It is worth mentioning that without strict/strong convexity/concavity from at least one of the operators, the quality of the results are limited to "Lyapunov Stability". Firstly, we present a construction of a Lyapunov function for the operators' dynamics Theorem 2. Then, in Theorem 3, Theorem 4 we explore the stability of the initial conditions in the parameter space.

8.1. GENERAL CASE

The following theorem presents the construction of a Lyapunov potential function for the induced operator dynamics. To motivate its construction, we can study a fundamental convexconcave function L(x, y) = (xp) 2 -(yq) 2 with saddle point (p, q). Under the gradientdescent-ascent dynamics (T ) := ẋ = -∇ x L(x, y) (minimization of convex part) ẏ = ∇ y L(x, y) (maximization of concave part) . it is easy to check that H(x, y) = (xp) 2 + (yq) 2 meets all the criteria of a Lyapunov function. The construction below extends this argument to any convex-concave function L(F, G) and bypasses the more complex multiplicative terms for the gradient induced dynamics of Theorem 1. Notice that H(F, G) = N i=1 fi pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz + M j=1 gj qj z -q j ∇g j (X φ φ φj (0) (z)) 2 dz coincides with the 2 2 distance from (p, q) in the case of gradient norms equal to one, i.e. ∇f i 2 = ∇g j 2 = 1 Lemma 2. If L is convex concave and (φ φ φ(0), θ θ θ(0) ) is a safe for (p, q) ∈ Solution(L), then the following quantity is non-increasing under the dynamics of Equation (3): H(F, G) = N i=1 fi pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz + M j=1 gj qj z -q j ∇g j (X φ φ φj (0) (z)) 2 dz (4) Proof. Simple substitution gets us the following Ḣ = - N i=1 (f i -p i ) ∂L ∂f i (F, G) + M j=1 (g j -q j ) ∂L ∂g j (F, G) = -F -p, ∇ F L(F, G) + G -q, ∇ G L(F, G) By Theorem 11 for the convex L(•, G) and concave L(F, •). - F -p, ∇ F L(F, G) ≤ L(p, G) -L(F, G) G -q, ∇ G L(F, G) ≤ L(F, G) -L(F, q) Thus we can end up writing Ḣ ≤ L(p, G) -L(F, G) + L(F, G) -L(F, q) ≤ L(p, G) -L(p, q) + L(p, q) -L(F, q) ≤ 0 The last inequality holds since (p, q) ∈ Solution(L). Indeed, if (p, q) is a saddle point of L then L(p, G) ≤ L(p, q) ≤ L(F, q). Theorem 2. If L is convex concave and (φ φ φ(0), θ θ θ(0)) is a safe for (p, q) ∈ Solution(L), then (p, q) is stable for Equation (3). Proof. Leveraging Lemma 2, there is a function H which is well defined in D = {Im fi (θ θ θ i (0))} N i=1 × {Im gj (φ φ φ j (0))} M j=1 and in this domain Ḣ ≤ 0. Given the safety conditions we know that (p, q) ∈ D. Observe that for the proposed function, it holds that H(p, q) = 0. Also for each f i and g j term in H we know that it has its minimum of value 0 at the corresponding p i and q j . We can deduce this by taking the derivative of each term to study its monotonicity. For example, the f i terms are strictly increasing in f i > p i and strictly decreasing in f i < p i . Thus for all D -{(p, q)}, H > 0. Applying Theorem 8 for the continuously differentiable H we have that (p, q) is stable for Equation (3). In the following example, we examine how it is possible to transfer the stability properties between two (topological conjugate) dynamical systems. Theorem 3. If f i and g j are sigmoid functions and L is convex concave and there is a (φ φ φ(0), θ θ θ(0)) that is safe for (p, q) ∈ Solution(L), then (F -1 (p), G -1 (q)) is stable for Equation (1). Proof. Firstly, we recall the property of sigmoid's gradient: dσ(x) dx = σ(x)(1 -σ(x)). Thus the transformed dynamical system in the operator space can be written as: (T ) := ḟi = -f 2 i (1 -f i ) 2 ∂L ∂fi (F, G) ġj = g 2 j (1 -g j ) 2 ∂L ∂gj (F, G) Notice that 1. The dynamical system (T ) in the operator space is independent of the initial conditions. In fact, the dynamical system of (T ) and the one of Equation (1), called (Σ) for short, are diffeomorphic for all initializations, not just a specific trajectory. 2. Since (θ θ θ(0), φ φ φ(0)) is safe, using Theorem 2 we get that (p, q) is stable for (T ). We would like to prove that for every open neighborhood V of (F -1 (p), G -1 (q)) there exists an open neighborhood U of (F -1 (p), G -1 (q)) such that (θ θ θ init , φ φ φ init ) ∈ U =⇒ ∀t ≥ 0 : (θ θ θ(t), φ φ φ(t)) ∈ V. Using the diffeomorphism γ = γ Σ→T between GDA dynamics of (Σ) and (T ) , γ( V) is an open neighborhood of (p, q) since V is open and γ((F -1 (p), G -1 (q))) ≡ (p, q) ∈ γ(V ). By Item 2, since (p, q) is stable for (T ) there is an open neighborhood Ũ of (p, q) such that: (F init , G init ) ∈ Ũ =⇒ ∀t ≥ 0 : (F(t), G(t)) ∈ γ(V ) or equivalently γ(θ θ θ init , φ φ φ init ) ∈ Ũ =⇒ ∀t ≥ 0 : γ(θ θ θ(t), φ φ φ(t)) ∈ γ(V ) Indeed, using the inverse diffeomorphism γ -1 , we can establish that for U = γ -1 ( Ũ ) it holds that (θ θ θ init , φ φ φ init ) ∈ U =⇒ ∀t ≥ 0 : (θ θ θ(t), φ φ φ(t)) ∈ V Until now, we have established the stability of a pair (p, q) for the induced dynamics (T ). By the construction of the induced dynamics, (T ) is coupled only with a very specific initial condition (θ θ θ init , φ φ φ init ). In order to tackle the challenge of a stability result for a whole region of initial conditions, in the following lemma we prove that r(θ θ θ, φ φ φ) = F(θ θ θ)-p 2 + G(φ φ φ)q 2 can work like an intrinsic measure of closeness for the {θ θ θ, φ φ φ}-parameter space around a hidden fixed point of the {F, G}-operator space. Under this "hidden" neighborhood notion, stability property can be taken by assuming the properness of the hidden operators. Theorem 4. Let (p, q) ∈ Solution(L). Let R fi and R gj be the set of regular valuesfoot_2 of f i and g j respectively. Assume that there is a ξ > 0 such that [p iξ, p i + ξ] ⊆ R fi and [q jξ, q j + ξ] ⊆ R gj . Define r(t) = F(θ θ θ(t))p 2 + G(φ φ φ(t))q 2 . If f i and g j are proper functionsfoot_3 , then for every > 0, there is an δ > 0 such that r(0) < δ =⇒ ∀t ≥ 0 : r(t) < .

Proof. Let us define the following sets ∀i ∈ [n] :

A i = { θ θ θ i ∈ R ni | f i (θ θ θ i ) ∈ [p i -ξ, p i + ξ]} ∀j ∈ [m] : B j = { φ φ φ j ∈ R mj | g j (φ φ φ j ) ∈ [q j -ξ, q j + ξ]} Since f i and g j are proper A i and B j are compact sets. Thus, the continuous functions ∇f i (θ θ θ i ) 2 and ∇g j (φ φ φ j ) 2 have a minimum and maximum value on A i and B j respectively. Let us call K fi and K gj the maxima and κ fi and κ gj the minima. Observe that the minima and maxima must be all greater than zero since [p iξ, p i + ξ] and [q jξ, q j + ξ] are regular values. Let us define κ = min{ min 1≤i≤n κ fi , min 1≤j≤m κ gj } K = max{ max 1≤i≤n K fi , max 1≤j≤m K gj } where K ≥ κ > 0 as we discussed. Let us create the following set S = {(θ θ θ, φ φ φ) ∈ R N × R M | ∀i ∈ [n] : θ θ θ i ∈ A i , ∀j ∈ [m] : φ φ φ j ∈ B j } We can prove that every (θ θ θ, φ φ φ) ∈ S is a safe initialization for (p, q). Of course, every θ θ θ i and φ φ φ j are not stationary points of f i and g j respectively. We also need to prove that the equilibrium (p, q) is feasible. We will prove this by contradiction. Let there be a (θ θ θ, φ φ φ) ∈ S such that (p, q) is not feasible. Without loss of generality we can assume that there is an i ∈ [n] such that p i / ∈ Im fi (θ θ θ i ). The case for the g j is symmetrical. Along the gradient ascent trajectory of f i with initialization at θ θ θ i , observe that f i (t) cannot attain an infimum or a supremum in [p iξ, p i + ξ] because there are no stationary points of f i in A i . Observe also that at initialization f i (θ θ θ i ) ∈ [p i -ξ, p i + ξ]. Thus [p i -ξ, p i + ξ] ⊆ Im fi (θ θ θ i ), a contradiction. Let us pick an initialization (θ θ θ(0), φ φ φ(0)) such that r(0) ≤ ξ 2 . It is clear that (θ θ θ(0), φ φ φ(0)) ∈ S and so it is safe for (p, q). We can do the same steps as in Theorem 2 to prove that the function H(F, G) below does not increase under the dynamics of Equation ( 1): H(F, G) = N i=1 fi pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz + M j=1 gj qj z -q j ∇g j (X φ φ φj (0) (z)) 2 dz Observe that since (θ θ θ(0), φ φ φ(0)) ∈ S we have that the interval between p i and f i (θ θ θ i (0)) belongs in [p iξ, p i + ξ] and ∇f i (•) 2 ≥ κ in this interval. Thus we can write (f i (θ θ θ i (0)) -p i ) 2 2κ ≥ fi(θ θ θi(0)) pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz Repeating the same argument for all f i and g j we have that r(0) 2κ ≥ H(F(θ θ θ(0)), G(φ φ φ(0))) ≥ H(F(θ θ θ(t)), G(φ φ φ(t))) Let us pick r(0) < min{ξ 2 , ξ 2 κ K } = ξ 2 κ K . We already know that trajectories start in S. We will prove that they also remain in S. We will do this by contradiction. If a trajectory escaped S, then without loss of generality this means that there is at least an i ∈ [n] such that at some t > 0, f i (θ θ θ i (t)) / ∈ [p i -ξ, p i + ξ]. The case of g j is similar. Clearly we have that fi(θ θ θi(t)) pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz ≥ min pi-ξ pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz, pi+ξ pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz As above, we have that the gradients in the integrals of the right hand side are less or equal than K so fi(θ θ θi(t)) pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz ≥ ξ 2 2K . The terms of H are all non-negative so we have that r(0) 2κ ≥ H(F(θ θ θ(t)), G(φ φ φ(t))) ≥ fi(θ θ θi(t)) pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz ≥ ξ 2 2K . But r(0) < ξ 2 κ K , a contradiction. So the trajectories will stay in S. We can then write fi(θ θ θi(t)) pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz ≥ (f i (θ θ θ i (t)) -p i ) 2 2K . Repeating the same argument for all f i and g j we have that r(0) 2κ ≥ H(F(θ θ θ(t)), G(φ φ φ(t))) ≥ r(t) 2K . For every > 0, there is a positive δ = min{ξ 2 , }κ K such that r(0) < δ =⇒ r(t) < . A special case of the above result is the standard convex-concave games: Corollary 3. Let L(x, y) be strictly convex concave and Solution(L) is the non empty set of equilbria of L. Then Solution(L) is locally asymptotically stable for continuous GDA dynamics. Proof. The proof of the above classical result can be derived by the straightforward application of Lemma 3 for the case of F(x) = x and G(y) = y. Notice that i) if F, G are the identity maps all the initial configurations are safe and ii) if ∇F 2 = ∇G 2 = 1, then the initialization-dependent Lyapunov functions coincide to a single Lyapunov function, which is actually the squared Euclidean distance r(θ θ θ, φ φ φ) = F(θ θ θ) -p 2 + G(φ φ φ) -q 2 = θ θ θ -p 2 + φ φ φ -q 2 .

8.2.1. GRADIENT DESCENT-ASCENT DYNAMICS

In the following preliminary result, we show that strict convexity or concavity in L(•, •), for at least one of its arguments, suffices to yield locally asymptotic stability starting from a safe initial condition. Our argumentation leverages the power of Theorem 9 and combines the previous section stability results. Here, we will firstly outline the basic steps below: 1. We start by showing that there exists a compact set Ω ⊂ D. 2. Therefore, since Ḣ ≤ 0 (Lyapunov property), any configuration (F(0), G(0)) starting from a bounded sub-level set Ω of H, will remain inside Ω over all time. 3. The second crucial observation is that thanks to the strictness on convexity or concavity of L, the largest invariant set of Ḣ = 0 contains only points belonging to Von Neumann's Solution(L). Then Theorem 9 implies the local asymptotic stability of set Z for Equation (3). Lemma 3. Let L be strictly convex concave and Z ⊂ Solution(L) is the non empty set of equilbria of L for which (θ θ θ(0), φ φ φ(0)) is safe. Then Z is locally asymptotically stable for Equation (3). Proof. Pick a point (p, q) ∈ Z. Since our initialization is safe for this saddle point, we can construct the H function as in Theorem 2 and prove that it has the following property Ḣ ≤ 0 in D = {Im fi (θ θ θ i (0))} N i=1 × {Im gj (φ φ φ j (0))} M j=1 If (F(θ θ θ(0)), G(φ φ φ(0))) = (p, q) then the theorem holds trivially. Otherwise, take a ball B centered at the equilibirum with a small enough radius such that it is contained in the interior of D. H 0 = min (F,G)∈∂B H(F, G) Ω = {(F, G) ∈ B|H(F, G) ≤ H 0 /2} We know that in both of the cases H 0 > 0 from Theorem 2. Since Ḣ ≤ 0, starting in Ω, it implies that H(F(t), G(t)) ≤ H 0 for t ≥ 0, so Ω is forward invariant. Since Ω ⊂ D we know that it is bounded. Ω is closed since it is a sublevel set of a continuous function. Notice that the restriction of Ω on B does not affect the above properties since Ω is in the interior of B. Thus Ω is a compact forward invariant set, satisfying the requirement of Theorem 9 Let E = {(F, G) ∈ B| Ḣ(F, G) = 0}. Without loss of generality we can assume that L(•, q) is strictly convex as the case of L(p, •) being strictly concave is similar. In the following inequality Ḣ ≤ L(p, G) -L(p, q) + L(p, q) -L(F, q) ≤ 0 we know that L(p, G) -L(p, q) ≤ 0 and L(p, q) -L(F, q) ≤ 0. , q) . By the strict convexity of L(•, q) we know that this means that F = p. Let M be the largest invariant set inside E. By the properties of M being invariant subset of E we have (F(0), G( 0)) ∈ M =⇒ ∀t : F(t) = p and L(p, G(t)) = L(p, q) Taking the time derivatives on each of the constant quantities, they should be zero. So Ḣ = 0 implies L(p, G) = L(p, q) = L(F ḟi = 0 ⇒ ∀i ∈ [N ] : ∇ θ θ θi f i (X θ θ θi(0) (p i )) 2 ∂L ∂f i (p, G) = 0 L(p, G(t)) = 0 ⇒ M j=1 ∇ φ φ φj g j (X φ φ φj (0) (g j )) 2 ∂L ∂g j (p, G) 2 = 0 We know that ∇ θ θ θi f i (X θ θ θi(0) (p i )) = 0 by the safety conditions and that ∇ φ φ φj g j (X φ φ φj (0) (g j )) 2 = 0 inside D again by safety conditions. This implies ∀i ∈ [N ] : ∂L ∂f i (p, G) = 0 ∀j ∈ [M ] : ∂L ∂g j (p, G) = 0 Thus M contains only stationary points of L so M ⊆ Solution(L). In addition M ⊆ D so only stationary points of L for which the initialization is safe are allowed so M ⊆ Z. Applying Theorem 9 we have that for any initialization of Equation (3) inside Ω, as t → ∞ (F(t), G(t)) approaches M and thus Z is locally asymptotically stable for Equation (3). A special case of the above result is the standard convex-concave games: Corollary 4. Let L(x, y) be strictly convex concave and Solution(L) is the non empty set of equilbria of L. Then Solution(L) is locally asymptotically stable for continuous GDA dynamics. In the following main result of our work, we show that strict convexity or concavity in L(•, •), for at least one of its arguments, suffices to yield a convergence result to a Von Neumann's Solution(L) starting from a safe initial condition. In order to get convergence results for any safe initialization, we need to study the region of attraction of the set Z ⊂ Solution(L). We refine the estimation of the region of attraction as proposed in Lemma 3 by analyzing the behavior of the level sets of H. More precisely, we show that the proposed Lyapunov function H(F, G) = N i=1 fi pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz + M j=1 gj qj z -q j ∇g j (X φ φ φj (0) (z)) 2 dz is radially unbounded. In other words, while the operators converges to their limit values (supremum/infimum of their domain) H → +∞. In order to show that we analyze the asymptotic behavior of F c 1 ∇fi 2 , while F → sup f i . Hence, A) Theorem 9 implies that the trajectory will approach the set of stationary points of H or equivalently a set of Von Neumann's Solution(L). B) The stability of Solution(L) and Theorem 10, leads to the conclusion that the trajectory will converges to a specific point of Solution(L). Theorem 5. Let L be strictly convex concave and Z ⊂ Solution(L) is the non empty set of equilbria of L for which (θ θ θ(0), φ φ φ(0)) is safe. Under the dynamics of Equation (1) (F(θ θ θ(t)), G(θ θ θ(t))) converges to a point in Z as t → ∞. Proof. Again let's pick a point (p, q) ∈ Z. Since our initialization is safe for this saddle point, we can construct the H function as in Theorem 2 and prove that it has the following property Ḣ ≤ 0 in D = {Im fi (θ θ θ i (0))} N i=1 × {Im gj (φ φ φ j (0))} M j=1 If (F(θ θ θ(0)), G(φ φ φ(0))) = (p, q) then the theorem holds trivially. Otherwise define H 0 = H(F(θ θ θ(0)), G(φ φ φ(0))) Ω = {(F, G) ∈ D|H(F, G) ≤ H 0 } where we know that H 0 > 0 from Theorem 2. Let us assume that indeed Ω is in the interior of D. Then, applying the same argumentation as in Lemma 3 combined with Theorem 2, all fixed points in Z are stable. So applying Theorem 10 we get that the trajectory initialized at (F(θ θ θ(0)), G(φ φ φ(0))) ∈ Ω converges to a point in Z. It remains to prove our assertion about the set Ω: Claim 1. Ω is in the interior of D. Proof. We will argue that as (F, G) approaches the boundary of D, the value of H should become unbounded. If this is true then for the finite upper bound of H 0 , Ω should have no points close to the boundary of H and thus it should be in the interior. As (F, G) approach the boundary of D, at least one of the variables f i or g j approaches the endpoints points of Im fi (θ θ θ i (0)) or Im gj (φ φ φ j (0)) respectively. We will study the case of f i since the case of g j is symmetrical. The endpoint f is can be either the supremum or the infimum of the gradient ascent trajectory on f i or ±∞ if they do not exist. Let f is be the supremum or ∞ depending on if the former exists. We can take the gradient ascent dynamics and apply Lemma 1 to get ḟi = ∇ θ θ θi f i (X θ θ θi(0) (f i )) 2 We know that f i (θ θ θ i (t)) goes to f is when initialized at f i (θ θ θ i (0)). Let us define the following function a(f i ) = fi pi 1 ∇f i (X θ θ θi(0) (z)) 2 dz Observe that ȧ = 1, thus lim t→∞ a(f i (t)) = ∞. In other words lim t→∞ fi(t) pi 1 ∇f i (X θ θ θi(0) (z)) 2 dz = fis pi 1 ∇f i (X θ θ θi(0) (z)) 2 dz = ∞ Symmetrically if f is is the infimum or -∞, then the limit above would be -∞. In either case f i → f is =⇒ fi pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz → ∞ For the last step it is important to note that p i is not at the boundary of D based on the safety conditions. Therefore as (F, G) approach the boundary of D in the dynamics of Equation ( 3), at least one of the terms of H goes to infinity. Also note that all the terms of H are individually nonnegative so no matter what the other variables in (F, G) are doing they cannot stop H → ∞. Again, a special case of the above result is the standard convex-concave games: Corollary 5. Let L(x x x, y y y) be strictly convex concave and Solution(L) is the non empty set of equilbria of L. Under the continuous GDA dynamics (x x x(t), y y y(t)) converges to a point in Solution(L) as t → ∞.

8.2.2. CONNECTIONS TO HAMILTONIAN DESCENT

In GANs numerous learning heuristics are being tested and explored. One technique that has particular interesting theoretical justification as well as practical performance is Hamiltonian Gradient Descent (HGD). Understanding the convergence guarantees for HGD is an open research question Maddison et al. (2018) ; Balduzzi et al. (2018) ; O'Donoghue & Maddison (2019) . We provide some new justification about its success in GANs by provably establishing convergence of a modified version of HGD in a relatively simple but illustrative subclass of hidden convex concave games, namely 2x2 hidden bi-linear games. This class of games is fairly expressive. Despite the restriction of planar bi-linear competition in the output space, the hidden game can have an arbitrary number of variables in the parameter space. It's important to note that given the bi-linear nature of competition, the classical GDA dynamics cycles instead of converging to the equilibrium as shown in Vlatakis-Gkaragkounis et al. (2019) More precisely, in the hidden 2x2 bi-linear game presented in Vlatakis-Gkaragkounis et al. ( 2019), we have two functions f : R N → [0, 1] and g : R M → [0, 1] and two constants (p, q) ∈ (0, 1) 2 where (p, q) is the fully mixed equilibrium of the bi-linear game. Without loss of generality, we are interested in solving the following problem min θ θ θ∈R M max φ φ φ∈R N (f (θ θ θ) -p)(g(φ φ φ) -q) Defining L(θ θ θ, φ φ φ) = (f (θ θ θ)p)(g(φ φ φ)q), the dynamics of HGD are: θ θ θ = - 1 2 ∇ θ θ θ ∇ φ φ φ L(θ θ θ, φ φ φ) 2 - 1 2 ∇ θ θ θ ∇ θ θ θ L(θ θ θ, φ φ φ) 2 φ φ φ = - 1 2 ∇ φ φ φ ∇ θ θ θ L(θ θ θ, φ φ φ) 2 - 1 2 ∇ φ φ φ ∇ φ φ φ L(θ θ θ, φ φ φ) 2 Observe that the second term of each right hand side would be zero in a classical bi-linear game but involves second order derivatives of f and g in the case of hidden bi-linear games. To circumvent the complexities of the second order derivatives and mimic the classical bi-linear game we will study a modified version of Equation ( 6), namely: θ θ θ = - 1 2 ∇ θ θ θ ∇ φ φ φ L(θ θ θ, φ φ φ) 2 φ φ φ = - 1 2 ∇ φ φ φ ∇ θ θ θ L(θ θ θ, φ φ φ) 2 Employing an analysis similar to the one in Section 3.2, we get the following convergence result: Theorem 13. Let (θ θ θ(0), φ φ φ(0)) be safe for (p, q). Then (f (θ θ θ(t)), g(φ φ φ(t))) converges to (p, q) under the dynamics of Equation (7). Proof. Simple substitution gives us θ θ θ = -∇ θ θ θ f (θ θ θ) ∇ φ φ φ g(φ φ φ) 2 (f (θ θ θ) -p) φ φ φ = -∇ φ φ φ g(φ φ φ) ∇ θ θ θ f (θ θ θ) 2 (g(φ φ φ) -q) Applying Lemma 1 and following the same steps as before ḟ = -∇ θ θ θ f (X θ θ θ(0) (f )) 2 ∇ φ φ φ g(X φ φ φ(0) (g)) 2 (f -p) ġ = -∇ φ φ φ g(X φ φ φ(0) (g)) 2 ∇ θ θ θ f (X φ φ φ(0) (f )) 2 (g -q) Once again we consider the function H(f, g) = f p z -p ∇f (X θ θ θ(0) (z)) 2 dz + g q z -q ∇g(X φ φ φ(0) (z)) 2 dz Simple substitution gives Ḣ = -(f -p) ∇ φ φ φ g(X φ φ φ(0) (g)) 2 (f -p) -(g -q) ∇ θ θ θ f (X φ φ φ(0) (f )) 2 (g -q) A little bit of reorganization gives Ḣ = -(f -p) 2 ∇ φ φ φ g(X φ φ φ(0) (g)) 2 -(g -q) 2 ∇ θ θ θ f (X θ θ θ(0) (f )) 2 ≤ 0 Thus, we get Ḣ ≤ 0 in D = Im f (θ θ θ(0)) × Im g (φ φ φ(0)) Similarly with the strict convex analysis of the previous section, if (f (θ θ θ(0)), g(φ φ φ(0))) = (p, q) then the theorem holds trivially. Otherwise define H 0 = H(f (θ θ θ(0)), g(φ φ φ(0))) Ω = {(f, g) ∈ D|H(f, g) ≤ H 0 } where we know that H 0 > 0 from Theorem 2. Additionally, we can apply Claim 1 even in the new dynamics, so Ω is in the interior of D. Since Ḣ ≤ 0, starting in Ω, it implies that H(f (t), g(t)) ≤ H 0 for t ≥ 0, so f (t), g(t) stays in Ω. Additionally, Ω is closed since it is a sublevel set of a continuous function. Notice that the restriction of Ω on D does not affect the above properties since Ω is in the interior of D. Thus Ω is a compact forward invariant set. For a safe initialization (θ θ θ(0), φ φ φ(0), both ∇ φ φ φ g(X φ φ φ(0) (g(t))) , ∇ θ θ θ f (X θ θ θ(0) (f (t))) cannot go to 0 as this happens only at the boundaries of D which are outside Ω. So Ḣ = 0 only at (p, q) in Ω. Therefore, applying Theorem 9, we get that (f (θ θ θ(t)), g(φ φ φ(t))) converges to (p, q)

8.3. REGULARIZATION AND CONVERGENCE

In this section, we show that even in the absence of strict convexity/concavity for both of the operators, it is possible to achieve a positive convergence result by sacrificing the exactness of a targeted equilibrium. In other words, we prove that by adding a small regularization term, the new utility function becomes strictly convex strictly concave. Beside the guaranteed convergence of the "perturbed" L , we can always choose sufficiently small magnitude of regularization such that the new equilibria are arbitrarily close to the initial ones. Theorem 6. If L is a convex concave function with invertible Hessians at all its equilibria, then for each > 0 there is a λ > 0 such that L has equilibria that are -close to the ones of L. Proof. For any choice of λ > 0 we have that L is strictly convex strictly concave so the KKT conditions are sufficient to determine its equilibria.

∂L(x, y)

∂x i + λx i = 0 ∂L(x, y) ∂y j -λy j = 0 We can view the above set of constraints as a single vector constraint r(λ, x, y) = 0. Note that by assumption of the Hessians being invertible at all equilibria, L has a unique equilibrium (x * , y * ). Clearly we have that r(0, x * , y * ) = 0. Observe that for the Jacobian of r at (0, x * , y * ) with respect to (x, y) we have that D (x,y) r(0, x * , y * ) = ∇ 2 L(x * , y * ) and thus it is invertible. Invoking the Implicit function Theorem, there is a differentiable function g, defined in a small enough neighborhood of 0, that takes a λ and returns g(λ) = (x(λ), y(λ)) such that r(λ, g(λ)) = 0. Thus for a small enough λ, we have that g returns the corresponding equilibria of L . By continuity of g, for all there is a δ > 0 ∀0 < λ < δ : x(λ) -x(0) 2 + y(λ) -y(0) 2 ≤ 2 But (x(0), y(0)) = (x * , y * ) so the equilbrium of L has an -close equilibrium of L for λ < δ. By strict convexity strict concavity of L , it has a unique equilibrium as well. So the equilibria of L and L are -close to each other. The previous theorem highlights that small values of λ induce only small changes to the equilibria of the hidden game. As is the case for classical convex concave games, larger values of λ lead to (exponentially) faster convergence. To prove this for HCC games, we provide a detailed upper and lower bound analysis of the gradients of f i and g j . Theorem 7. Let (θ θ θ(0), φ φ φ(0)) be a safe initialization for the unique equilibrium of L (p, q). If r(t) = F(θ θ θ(t))p 2 + G(φ φ φ(t))q 2 then there are initialization dependent constants c 0 , c 1 > 0 such that r(t) ≤ c 0 exp(-λc 1 t). Proof. Following the same analysis with the strict convex concave analysis of the previous section, if (F(θ θ θ(0)), G(φ φ φ(0))) = (p, q) then the theorem holds trivially. Otherwise, since our initialization is safe for (p, q), we can construct the H function as in Theorem 2 and prove that it has the following property in D = {Im fi (θ θ θ i (0))} N i=1 × {Im gj (φ φ φ j (0))} M j=1 Ḣ ≤ L (p, G) -L (p, q) + L (p, q) -L (F, q) ≤ - λ 2 F(θ θ θ(t)) -p 2 + G(φ φ φ(t)) -q 2 ≤ - λ 2 r(t) Where the second step follows from L (p, •) being λ strongly concave and L (•, q) being λ strongly convex and q, p being the corresponding optima of these functions since (p, q) is an equilibrium. Let us define H 0 = H(F(θ θ θ(0)), G(φ φ φ(0))) Ω = {(F, G) ∈ D|H(F, G) ≤ H 0 } where we know that H 0 > 0 from Theorem 2. Additionally, we can apply Claim 1 even in the new dynamics, so Ω is in the interior of D. Since Ḣ ≤ 0, starting in Ω, it implies that H(F(θ θ θ(t)), G(φ φ φ(t))) ≤ H 0 for t ≥ 0, so (F(t), G(t)) stays in Ω. Additionally, Ω is closed since it is a sublevel set of a continuous function. Notice that the restriction of Ω on D does not affect the above properties since Ω is in the interior of D. Thus Ω is a compact forward invariant set. For a safe initialization (θ θ θ(0), φ φ φ(0)), the following continuous functions must have a minimum and maximum value on Ω respectively. K fi ≥ ∇f i (X θ θ θi(0) (•)) 2 ≥ κ fi K gj ≥ ∇g j (X φ φ φj (0) (•)) 2 ≥ κ gj Observe that the minima and maxima must be all greater than zero , since both ∇ φ φ φj g j (X φ φ φj (0) (g(t))) , ∇ θ θ θi f i (X θ θ θi(0) (f (t))) cannot go to 0 as this happens only at the boundaries of D which are outside Ω.

Let us define

κ = min{ min 1≤i≤n κ fi , min 1≤j≤m κ gj } K = max{ max 1≤i≤n K fi , max 1≤j≤m K gj } Observe that K ≥ ∇f i (X θ θ θi(0) (•)) 2 ≥ κ in this interval. Thus we can write (f i (θ θ θ i (t)) -p i ) 2 2κ ≥ fi(θ θ θi(t)) pi z -p i ∇f i (X θ θ θi(0) (z)) 2 dz ≥ (f i (θ θ θ i (t)) -p i ) 2 2K Repeating the same argument for all f i and g j we have that r(t) 2κ ≥ H(F(θ θ θ(t)), G(φ φ φ(t))) ≥ r(t) Thus we can extend our analysis Ḣ ≤ -λr(t) ≤ - 2κλ 2 H(t) ⇒ H(t) ≤ H 0 e -λκt ⇒ r(t) ≤ 2 × K × H 0 e -λκt

9. APPLICATIONS

9.1 CONNECTING GANS AND HIDDEN CONVEX-CONCAVE GAMES At the heart of many GAN formulations like the standard GAN Goodfellow et al. (2014b) , f-GAN Nowozin et al. (2016) and Wassertein GAN (WGAN) Arjovsky et al. (2017) lies a classical convex concave game in the operator output space. Indeed for the realizeable case Goodfellow et al. (2014b) used the underlying convexity properties to find the Nash equilibria of standard GAN and Farnia & Ozdaglar (2020) did the same thing for the f-GAN and WGAN. Perhaps surprisingly, neither work references explicitly the convex concave nature of the operator output space game or von Neumann's minimax theorem. To highlight the significance of von Neumann equilibria as a solution concept for GANs, we show how the optimal G * and D * can be derived separately from each other by solving the corresponding min-max (max-min) problems. This allows one to independently verify the validity of von Neumann's minimax theorem and its generalizations for GANs. We also extend our analysis to a wide class of non-realizeable cases as well. In practice however, as noted explicitly by Goodfellow (2017) , the updates in GAN training happen in the parameter space giving rise to a HCC game. This has exactly motivated studying the learning dynamics of HCC games in Section 3. Thus, in this section, we present these connections between Hidden Convex-Concave games and the different architectures of Generative Adversarial Networks. More specifically, we start by exploring the structure of GANs and we verify their hidden convex-concave intrinsic form. 1. Under this scope of hidden games, the strong (or even strict) convexity/concavity of at least one of the players (Discriminator/Generator) in combination with the convergence results of the following sections provide some theoretical explanation about the convergence properties of those architectures even under the vanilla Gradient Descent-Ascent Dynamics. In the following three subsections, we analyze both the derivation of arg min max and arg max min for the "vanilla-GANs", f-GANs, W-GANs using min-max optimization arguments based on the Minimax Theorem for convex-concave functions. More precisely, 1. In the Lemmas 4, 9 and 14, we present the optimal discriminators which consist the best-response for the case of a fixed generator. In all these maximization problems, typically each D(x) is decoupled and D * G (x) is derived by the hidden concavity of the discriminator architecture. 2. In the Lemmas 5, 10 and 15, we present the optimal generators which consist the best-response for the case of a fixed discriminator. In all these minimization problems, typically the generator can cheat the fixed discriminator by producing greedily a distribution only over the restricted subset of the points for which the discriminator has the highest confidence about their originality. 3. In the Lemmas 6, 11 and 16, we leverage lemmas of (Item 1) to understand the form GAN's utility function which corresponds typically to JSD, f -divergence and Wasserstein distance which donate their name to their GAN architecture as well. Thus, it is then trivial to show that p data is the optimal choice in the realizable case. 4. In the Lemmas 7, 12 and 17, on the other side of the coin, we emphasize to derive the minmax solutions too. Our proof strategy invokes the partition to two basic sets, S G * D and S c G * D ,the "preferable" or not data points by the generator. Leveraging the concavity part of the objective, we show that the best strategy for the discriminator is to label all the points uniformly with the same confidence in order to incentivize the generator to expands its support to the maximum possible. 5. In the Lemmas 8 and 13, we analyze the non-realizable case. One the one hand using Item 3 we are able to compute the arg max min generator G * . To conclude about the arg min max discriminators we apply the Von Neumann's Minimax theorem to prove D * = Best-Response(G * ).

9.1.1. GAN

The utility of the zero-sum game V (G, D) for the distribution p data over the discrete set N is V (G, D) = x∈N p data (x) log(D(x)) + x∈N p G (x) log(1 -D(x)) On the one hand, it is easy to check that for a fixed discriminator D, the utility function is linear over the p G operator. On the other hand, for a fixed generator G, the utility function is of the form a log(D) + b log(1 -D) which is strongly-concave. We start our work with the following lemmas Lemma 4 (Goodfellow et al. (2014b) ). For a fixed generator G the optimal discriminator is D * G (x) = p data (x) p data (x) + p G (x) Proof. Observe that the optimization problem for each D(x) is decoupled. Thus D * G (x) = arg max D∈[0,1] p data (x) log(D) + p G (x) log(1 -D) By concavity the unique maximum of the above is given by Proof. We can substitute in V (G, D) the optimal discriminator from Lemma 4. Thus we get D * G (x) = p data (x) p data (x) + p G (x) Lemma 5. V (G, D * G ) = x∈N p data (x) log p data (x) p data (x) + p G (x) + x∈N p G (x) log p G (x) p data (x) + p G (x) We can now prove that V (G, D * G ) = -log(4) + KL p data || p G + p data 2 + KL p G || p G + p data 2 = -log(4) + 2JSD(p data ||p G ) By minimizing V (G, D * G ) , the result follows trivially. Lemma 7. The max-min discriminator is ∀x ∈ N : D * (x) = 1 2 when the generator is allowed choose any distribution over N , Proof. We can substitute in V (G, D) the optimal generator from Lemma 5 V (G * D , D) = log(D max ) x∈S G * D p data (x) + log(1 -D max ) x∈S G * D p G (x) + x / ∈S G * D p data (x) log(D(x)) + x / ∈S G * D p G (x) 0 log(1 -D(x)) Observe that for x / ∈ S G * D , if D takes more than two values then setting D equal to the highest of the them for all x / ∈ S G * D improves utility. So for an optimal discriminator we would have a single value D max > D min . In the end we have that x / ∈ S G * D =⇒ D * (x) = D min x ∈ S G * D =⇒ D * (x) = D max Observe that for any combination of D max and D min with D max > D min , the constant discriminator D max has higher utility. Therefore we can focus our attention on the constant discriminator D const (x) = D V (G * Dconst , D const ) = log(D) + log(1 -D) The optimal value for D is 1 2 and as a result D * (x) = 1 2 . Lemma 10. For a fixed discriminator D, any distribution supported only on S G * D = {x ∈ N : ∀x ∈ N f * (D(x)) ≥ f * (D(x )) } is an optimal generator when it is allowed to choose any distribution over N . Proof. Observe that for a fixed discriminator, the optimal generator optimizes - x∈N p G (x)f * (D(x)) since the other term is independent of the generator. Let us define the following F max = max x∈N f * (D(x)) Then we have that - x∈N p G (x)f * (D(x)) ≥ -F max with the equality being true only for distributions supported only on S G * D . Lemma 11 (Nowozin et al. (2016) ). The min-max generator is the following distribution G * = arg min G∈G D f (p data ||p G ). Proof. We can substitute in V (G, D) the optimal discriminator from Lemma 9. Thus we get V (G, D * G ) = x∈N p data (x)f p data (x) p G (x) - x∈N p G (x)f * f p data (x) p G (x) We will first prove that: V (G, D * G ) = D f (p data ||p G ) Let's recall firstly the definition of f-divergence: D f (p data ||p G ) = x∈N p G (x)f p data (x) p G (x) Since f is convex and lower semi-continuous, Frenchel convex duality guarantees that we can write f in terms of its conjugate dual as f (u) = sup v∈R uvf * (v) . Equivalently we get: D f (p data ||p G ) = x∈N p G (x) sup v∈R p data (x) p G (x) v -f * (v) = x∈N sup v∈R {p data (x)v -f * (v)p G (x)} = x∈N p data (x)f p data (x) p G (x) - x∈N p G (x)f * f p data (x) p G (x) The last line follows arguments similar to Lemma 9 applied for each term. By minimizing V (G, D * G ), the result follows trivially. Lemma 12. The max-min discriminator is ∀x ∈ N : D * (x) = f (1) when the generator is allowed choose any distribution over N . Proof. We want to substitute in V (G, D) the optimal generator from Lemma 5. Observe that for all x ∈ S G * D , we may not have all D(x) to be equal. Only the values of f * are guaranteed to be equal, f * (D(x)) = F max . However, if there are two distinct D values then we can always pick the higher one and improve utility. Thus we can focus on discriminators that are constant over S G * D . Let D Fmax be the corresponding value V (G * D , D) = D Fmax x∈Si p data (x) -f * (D Fmax ) x∈Si p G (x) + x / ∈S G * D p data (x)D(x) - x / ∈S G * D p G (x) 0 f * (D(x)) Observe It is easy to check that Lemma 12 holds even in the non-realizable case. As a result, the generator is minimizing D f (p data ||p G ) whose value is finite under the assumptions we made on f . Clearly the quantities above are finite. Thus there exists a real number v, the value of the game, such that:  ∀D ∈ D : V (G * , D) ≤ v = V (G * , D * ) (A) ∀G ∈ G : V (G, D * ) ≥ v = V (

V (G *

Dconst , D const ) = 0 Finally, it is easy to check that the choice of constant discriminator satisfies trivially the Lipschitz constraints, i.e |D const (x) -D const (x )| = 0 ≤ dist(x, x ) for any metric function dist.

9.2. GANS AND HIDDEN CONSTRAINED OPTIMIZATION

In the following section, we will generalize the results of Section 3.2 and Section 3.3 for the case of vanilla GAN of Goodfellow et al. (2014a) whose objective is linear-strong-concave where the maximization part is constrained in the distributional simplex. More precisely, In order to remove the constraints from the objective above, we plan to make use of a Lagrange multiplier. We remind the reader that since both the discriminator and the generator use the sigmoid activations, we only have to capture the x∈N p G (x) = 1 constraint. Thus, our equivalent Langragian is: 

9.3. ZERO-SUM GAMES

We close this section with an application of our regularization machinery in hidden bilinear games. Hidden bilinear zero-sum games were introduced by Vlatakis-Gkaragkounis et al. ( 2019) and they are formally defined as: Definition 11 (Hidden Bilinear Zero-Sum Game). In a hidden bilinear zero-sum game there are two players, each one equipped with a smooth function F F F : R n → R N and G G G : R m → R M and a payoff matrix U N ×M such that each player inputs its own decision vector θ θ θ ∈ R n and φ φ φ ∈ R m and is trying to maximize or minimize r(θ θ θ, φ φ φ) = F F F (θ θ θ) UG G G(φ φ φ) respectively. For the special case of hidden bilinear games, Vlatakis-Gkaragkounis et al. ( 2019) proved that if the dimension of the game is greater or equal than two like (e.g. akin to Rock-Paper-Scissors) then GDA dynamics tend to "cycle" through their parameter space with an even more complex behavior than a typical periodic trajectory. Specifically, the system is formally analogous to Poincaré recurrent systems (e.g. many body problem in physics). In contrast, leveraging Theorem 6, we know that by adding a small regularization term we can "break" the cycling behavior and converge to an approximate Nash Equilibrium. We close this section by presenting a comparison between the optimization portraits of GDA dynamics with the absence or not of a regularization for the archetypical game of Rock-Paper-Scissors: 



A value a ∈ Im f is called a regular value of f if ∀q ∈ dom f : f (q) = a, it holds ∇f (q) = 0. A function is proper if inverse images of compact subsets are compact. A value a ∈ Im f is called a regular value of f if ∀q ∈ dom f : f (q) = a, it holds ∇f (q) = 0. A function is proper if inverse images of compact subsets are compact. We note that D * , may take the value 1 for some x ∈ N if the generator G * does not have full support. Assigning D(x) = 1 for some x may lead to infinite utilities in general. We prove however for that the pair (G * , D * ) this is not the case. We thus consider that pair an equilibrium. This assumption guarantees that the D f is always finite even if the distribution chosen by the generator is not fully supported on N . This in turn guarantees that D * is also finite resulting in a meaningful equilibrium. Unbounded divergences like KL are known to be problematic for GANs even in practiceArjovsky et al. (2017).



Figure3: Comparison of 2 distance from the equilibrium and the Lyapunov function. Both of them converge to zero as we state but 2 distance not monotonically to zero. For p data we choose a fully mixed distribution of dimension d = 4. Given the sigmoid activations all the initializations are safe. We defer the detailed proof of convergence in Section 9.2.

Figure4: On the left, we show the trajectories of regularized GDA for α 2 * = 1 as well as the level sets of Equation (4). All trajectories converge to one of the two equilibria (0, 1) and (0, -1) whereas without regularization, GDA would cycle on the level sets. In the right figure, we replace the exact expectations in V WGAN with approximations via sampling and continuous time updates on α and v with discrete ones. For small learning rates and large sample sizes, unregularized GDA continues to cycle. In contrast, the regularization approach ofLei et al. (2019) converges to the (0, 1) equilibrium.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Hidden Convex Concave Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Reparametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Learning in Hidden Convex Concave Games 3.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Hidden strictly convex concave games . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Convergence via regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . in dynamical systems . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Background in convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Background in Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Preliminaries 8 Hidden Convex Concave Games 8.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Hidden strictly convex concave games . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Gradient Descent-Ascent Dynamics . . . . . . . . . . . . . . . . . . . . . 8.2.2 Connections to Hamiltonian Descent . . . . . . . . . . . . . . . . . . . . . 8.3 Regularization and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Applications 9.1 Connecting GANs and Hidden Convex-Concave Games . . . . . . . . . . . . . . . 9.1.1 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 f-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 WGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 GANs and Hidden Constrained Optimization . . . . . . . . . . . . . . . . . . . . 9.3 Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

For a fixed discriminator D, any distribution supported only on S G * D = {x ∈ N : ∀x ∈ N D(x) ≥ D(x )} is an optimal generator when it is allowed to choose any distribution over N . Proof. Observe that for a fixed discriminator, the optimal generator optimizes x∈N p G (x) log(1 -D(x)) since the other term is independent of the generator. Let us define the following D max = max x∈N D(x) Then we have that x∈N p G (x) log(1 -D(x)) ≥ log(1 -D max ) with the equality being true only for distributions supported only on S G * D . Lemma 6 (Goodfellow et al. (2014b)). The min-max generator is the following distribution G * = arg min G∈G JSD(p data ||p G ).

that for x / ∈ S G * D , if D takes more than two values then setting D equal to the highest of the them for all x / ∈ S G * D improves utility. So for an optimal discriminator we would have a single valueD Fmin with f * (D Fmin ) < f * (D Fmax ). As a result x / ∈ S G * D =⇒ D * (x) = D Fmin x ∈ S G * D =⇒ D * (x) = D FmaxWe now have two cases. For any combination with D Fmin > D Fmax , the constant discriminator D(x) = D Fmin has higher utility. Symmetrically, for any combination with D Fmax > D Fmin , the constant discriminator D(x) = D Fmax has higher utility. Thus the optimal discriminator is constant. Plugging in the constant discriminatorD const (x) = D we get V (G * Dconst , D const ) = D + f * (D)The optimal value for D follwoing the approach of Lemma 9 is f (1) and as a resultD * (x) = f (1)Lemma 13 (Non-realizable case). Assume that f ∈ C 1 is strictly convex and lim x→0 + xf ( 1 x ) exists and is finite 6 . If the choice of generator G is restricted in G, a convex compact subset of the |N | dimensional simplex, such that p data / ∈ G then (G * , D * ) = arg min G∈G D f (p data ||p G ), f p data p G * Proof. We cannot readily apply von Neumann's minimax theorem since the V (G, D) since D = R |N | is not compact for the discriminator. We can still apply Fan's Minimax Theorem min G∈G sup D∈D V (G, D) = sup D∈D min G∈G V (G, D).

G * , D * ) (B)for G * the minimizer of D f (p data ||p G ) and a D * ∈ R|N | . Now applying Lemma 9 we have thatD = Best-Response(G * ) = f p data (x) p G * (x) .Additionally, by the optimality of the response and the consequence (A) of Minimax Theorem it holds that V (G * , D) = v. Finally, assuming that f is strictly convex we get that V (G, •) is strictly concave, Best-Response(G * ) is unique and thusD * = D = f p data (x) p G * (x)Lemma 17. The max-min discriminator is ∀x ∈ N : D * (x) = c, Constant function when the generator is allowed choose any distribution over N , Proof. We can substitute in V (G, D) the optimal generator from Lemma 5Observe that for x / ∈ S G * D , if D takes more than two values then setting D equal to the highest of the them for all x / ∈ S G * D improves utility. So for an optimal discriminator we would have a single value D max > D min . In the end we have thatx / ∈ S G * D =⇒ D * (x) = D min x ∈ S G * D =⇒ D * (x) = D maxObserve that for any combination of D max and D min with D max > D min , the constant discriminator D max has higher utility. Therefore we can focus our attention on the constant discriminator D const (x) = D, where the optimal value is exactly zero.

min pG(x)≥0, x∈N pG(x)=1 max D∈(0,1) |N | V (G, D) = x∈N p data (x) log(D(x)) + x∈N p G (x) log(1 -D(x))At a first glance, by rewriting the equivalent Langrangian formulation of the aforementioned constrained min-max problem we can see that strong-concavity property does not hold any more. However our following theorem shows that by exploiting further the structure ofGoodfellow et al. (2014a)'s architecture a convergence result is possible.Theorem 14. Let V (Gen θ θ θ , Disc φ φ φ ) be Goodfellow GAN as described in Section 4, where G, D use sigmoid activations. Then for a fully mixed distribution p data , (F(t) = Gen θ θ θ(t) , G(t) = Disc φ φ φ(t) ) converges to (p data , 1 2 1 |N | ) as t → ∞ under the dynamics of Equation (1).Proof. Let us write down our original objective min

|N | ,λ∈R L(F, G, λ) = p data log(G(φ φ φ)) + F(θ θ θ) log(1 -G(φ φ φ)) + λ(F 1 |N | -1)

Figure 5: Trajectories of a single player using gradient-descent-ascent dynamics for a hidden bilinear game L(F(θ θ θ), G(φ φ φ)) = F (θ θ θ)AG(φ φ φ) where A is the classical Rock-Paper-Scissors table and F, G have the sigmoid activations. The two left figures present the Poincaré recurrence for different initializations of the dynamics, a behavior consistent with the Lyapunov stability of Theorem 2. On the other hand, the two figures on the right illustrate convergent to the mixed Nash equilibrium executions which exploit the regularization tools as described in Section 3.3. The regularization terms added are centered at the mixed equilbrium of the game, leading to convergence to the unmodified equilibrium of the Rock-Paper-Scissors game.

's theorem does not give us pointwise convergence directly. But in the special case that M contains only stable fixed points we can apply the following theorem Theorem 10 (Pointwise Convergence Theorem,(Bhat & Bernstein, 2003, Proposition 5.4)). Let x(t) be a trajectory of Equation ( ). If the positive limit sets of x(t) contain a stable fixed point then

2. To indicate the relation of Von-Neumann solution with this hidden model, we leverage thishidden convex-concave structure in order to compute the well-known both min max and max min optima of GANs under the realizability or not assumption. The results of this section are summarized in the following table: p data represents the target data distribution. G * is the min-max generator and D * is the max-min discriminator. JSD denotes the Jensen-Shannon divergence, D f the f -divergence for the convex function f and EMD the earth mover distance and c the constant discriminator. xGAN, xGAN correspond to the realizable and the non-realizable case accordingly. -indicates the lack of a closed form solution for D * of WGAN.

annex

It is easy to check that Lemma 6 holds even in the non-realizable case. As a result, the generator is minimizing JSD(p data ||p G ) whose value is finite. Clearly the quantities above are finite. Thus there exists a real number v, the value of the game, such that:for G * the minimizer of JSD(p data ||p G ) and a D * ∈ [0, 1] |N | . Now applying Lemma 4, we have thatAdditionally, by the optimality of the response and the consequence (A) of Minimax Theorem it holds that V (G * , D) = v. Finally, since V (G * , •) is strongly concave, all other discriminators receive value less than v and are not optimal. ThusThe utility of the zero-sum game V (G, D) for the distribution p data over the discrete set N isWe will assume that f is a strictly convex function with f (1) = 0. On the one hand, it is easy to check that for a fixed discriminator D, the utility function is linear over the p G operator. On the other hand, for a fixed generator G, the utility function is of the form aDbf * (D) which is strictly-concave.We start our work with the following lemmas Lemma 9 (Nowozin et al. (2016) ). For a fixed generator G the optimal discriminator isProof. Observe that the optimization problem for each D(x) is decoupled. ThusBy concavity the unique maximum of the above is given by Fermat criterionThe utility of the zero-sum game V (G, D) for the distribution p data over the discrete metric space (N , dist)On the one hand, it is easy to check that for a fixed discriminator D, the utility function is linear over the p G operator. On the other hand, for a fixed generator G, the utility function is linear over D.We start our work with the following lemmas Lemma 14 (Arjovsky et al. (2017) ). For a fixed generator G the optimal discriminator is a solution of the following linear program maximize over D(•)where the optimal value of the LP is the Earth mover's distance between p data and p G .Proof. Indeed, by definition any solution of the above LP is an optimal discriminator over a fixed generator G. To complete the proof of the statement, we recall that Earth Mover's distance of (p data , p G ) is equal to min ∆∈Coupling(pdata,pG)Now if we consider the dual formulation of the Wasserstein distance, then the Kantorovich duality Evans (1997); Villani (2008) implies that the above linear program consists exactly the dual linear program which computes the Earth Mover's distance.Lemma 15. For a fixed discriminator D, any distribution supported only on} is an optimal generator when it is allowed to choose any distribution over N .Proof. Observe that for a fixed discriminator, the optimal generator optimizessince the other term is independent of the generator. Let us define the followingThen we have thatwith the equality being true only for distributions supported only on S G * D . Lemma 16 (Arjovsky et al. (2017) ). The min-max generator is the following distributionProof. We can substitute in V (G, D) the optimal discriminator from Lemma 14. Thus we get, the result follows trivially.whereand f i and g j are sigmoid functions and θ i and φ j are their one dimensional inputs. Let's write again the equivalent dynamics of Equation (3) for the sigmoid activations and the Langrage multiplier. Applying the same steps with Theorem 3 for sigmoids:Since all initializations are safe in this game, our "generalized" Lyapunov function:where λ * is the Langrange multiplier at the equilibrium of the non-hidden game and x i is the i-th element of N . Applying the same steps as in Lemma 3 we get that GDA approaches the largest invariant set E of points (F, G, λ) that have the following propertiesFor the first equality, we have that the value of λ does not affect L when the generator respects the sum to one constraint. Thus L(p data , G, λ) = L(p data , G, λ * )Then we can observe that L(p data , G, λ * ) is strictly concave in G and given that 1 2 1 |N | is its unique minimum we have thatGiven that E is an invariant set and G is constant in E, we have that Ġ = 0. In other words,As a consequence we have thatOnce again, given that E is an invariant set and F is constant in E, we have that Ḟ = 0Observe that by the optimality conditions of the non-hidden game, λ * needs to satisfy the same equation and thus λ = λ * . Clearly we have thatThus the dynamics converge to the unique equilibrium of the hidden game.

