ON THE IMPOSSIBILITY OF GLOBAL CONVERGENCE IN MULTI-LOSS OPTIMIZATION

Abstract

Under mild regularity conditions, gradient-based methods converge globally to a critical point in the single-loss setting. This is known to break down for vanilla gradient descent when moving to multi-loss optimization, but can we hope to build some algorithm with global guarantees? We negatively resolve this open problem by proving that desirable convergence properties cannot simultaneously hold for any algorithm. Our result has more to do with the existence of games with no satisfactory outcomes, than with algorithms per se. More explicitly we construct a two-player game with zero-sum interactions whose losses are both coercive and analytic, but whose only simultaneous critical point is a strict maximum. Any 'reasonable' algorithm, defined to avoid strict maxima, will therefore fail to converge. This is fundamentally different from single losses, where coercivity implies existence of a global minimum. Moreover, we prove that a wide range of existing gradient-based methods almost surely have bounded but non-convergent iterates in a constructed zero-sum game for suitably small learning rates. It nonetheless remains an open question whether such behavior can arise in high-dimensional games of interest to ML practitioners, such as GANs or multi-agent RL.

1. INTRODUCTION

Problem Setting. As multi-agent architectures proliferate in machine learning, it is becoming increasingly important to understand the dynamics of gradient-based methods when optimizing multiple interacting goals, otherwise known as differentiable games. This framework encompasses GANs (Goodfellow et al., 2014) , intrinsic curiosity (Pathak et al., 2017) , imaginative agents (Racanière et al., 2017) , synthetic gradients (Jaderberg et al., 2017) , hierarchical reinforcement learning (Wayne & Abbott, 2014; Vezhnevets et al., 2017) and multi-agent RL in general (Busoniu et al., 2008) . The interactions between learning agents make for vastly more complex mechanics: naively applying gradient descent on each loss simultaneously is known to diverge even in simple bilinear games. Related Work. A large number of methods have recently been proposed to alleviate the failings of simultaneous gradient descent: adaptations of single-loss algorithms such as Extragradient (EG) (Azizian et al., 2019) and Optimistic Mirror Descent (OMD) (Daskalakis et al., 2018) , Alternating Gradient Descent (AGD) for finite regret (Bailey et al., 2019) , Consensus Optimization (CO) for GAN training (Mescheder et al., 2017) , Competitive Gradient Descent (CGD) based on solving a bilinear approximation of the loss functions (Schaefer & Anandkumar, 2019), Symplectic Gradient Adjustment (SGA) based on a novel decomposition of game mechanics (Balduzzi et al., 2018; Letcher et al., 2019a) , and opponent-shaping algorithms including Learning with Opponent-Learning Awareness (LOLA) (Foerster et al., 2018) and its convergent counterpart, Stable Opponent Shaping (SOS) (Letcher et al., 2019b) . Let A be this set of algorithms. Each has shown promising theoretical implications and empirical results, but none offers insight into global convergence in the non-convex setting, which includes the vast majority of machine learning applications. One of the main roadblocks compared with single-loss optimization has been noted by Schaefer & Anandkumar (2019) : "a convergence proof in the nonconvex case analogue to Lee et al. ( 2016) is still out of reach in the competitive setting. A major obstacle to this end is the identification of a suitable measure of progress (which is given by the function value in the single agent setting), since norms of gradients can not be expected to decay monotonously for competitive dynamics in non-convex-concave games." It has been established that Hamiltonian Gradient Descent converges in two-player zero-sum games under a "sufficiently bilinear" condition by Abernethy et al. (2019) , but this algorithm is unsuitable for optimization as it cannot distinguish between minimization and maximization (Hsieh et al., 2020, Appendix C.4) . Global convergence has also been established for some algorithms in a few special cases: potential and Hamiltonian games (Balduzzi et al., 2018) , zero-sum games satisfying the twosided Polyak-Łojasiewicz condition (Yang et al., 2020) , zero-sum linear quadratic games (Zhang et al., 2019) and zero-sum games whose loss and first three derivatives are bounded (Mangoubi & Vishnoi, 2020) . These are significant contributions with several applications of interest, but do not include any of the architectures mentioned above. Finally, Balduzzi et al. (2020) show that GD dynamics are bounded under a 'negative sentiment' assumption in smooth markets, which do include GANs -but this does not imply convergence, as we will show. 2020) uploaded a preprint just after the completion of this work with a similar focus to ours. They prove that generalized Robbins-Monro schemes may converge with arbitrarily high probability to spurious attractors. This includes simGD, AGD, stochastic EG, optimistic gradient and Kiefer-Wolfowitz. However, Hsieh et al. ( 2020) focus on the possible occurrence of undesirable convergence phenomena for stochastic algorithms. We instead prove that desirable convergence properties cannot simultaneously hold for all algorithms (including deterministic). Moreover, their results apply only to decreasing step-sizes whereas ours include constant step-sizes. These distinctions are further highlighted by Hsieh et al. (2020) in the further related work section. Taken together, our works give a fuller picture of the failure of global convergence in multi-loss optimization. Contribution. We prove that global convergence in multi-loss optimization is fundamentally incompatible with the 'reasonable' requirement that algorithms avoid strict maxima and converge only to critical points. We construct a two-player game with zero-sum interactions whose losses are coercive and analytic, but whose only critical point is a strict maximum (Theorem 1). Reasonable algorithms must either diverge to infinite losses or cycle (bounded non-convergent iterates). One might hope that global convergence could at least be guaranteed in games with strict minima and no other critical points. On the contrary we show that strict minima can have arbitrarily small regions of attraction, in the sense that reasonable algorithms will fail to converge there with arbitrarily high probability for fixed initial parameter distribution (Theorem 2). Finally, restricting the game class even further, we construct a zero-sum game in which all algorithms in A (as defined in Appendix A) are proven to cycle (Theorem 3). It may be that cycles do not arise in high-dimensional games of interest including GANs. Proving or disproving this is an important avenue for further research, but requires that we recognise the impossibility of global guarantees in the first place.

2.1. SINGLE LOSSES: GLOBAL CONVERGENCE OF GRADIENT DESCENT

Given a continuously differentiable function f : R d → R, let θ k+1 = θ k -α∇f (θ k ) be the iterates of gradient descent with learning rate α, initialised at θ 0 . Under standard regularity conditions, gradient descent converges globally to critical points: Proposition 1. Assume f ∈ C 2 has compact sublevel sets and is either analytic or has isolated critical points. For any θ 0 ∈ R d , define U 0 = {f (θ) ≤ f (θ 0 )} and let L < ∞ be a Lipschitz constant for ∇f in U 0 . Then for any 0 < α < 2/L we have lim k θ k = θ for some critical point θ.



On the other hand, failure of global convergence has been shown for the Multiplicative Weights Update method by Palaiopanos et al. (2017), for policy-gradient algorithms by Mazumdar et al. (2020), and for simultaneous and alternating gradient descent (simGD and AGD) by Vlatakis-Gkaragkounis et al. (2019); Bailey et al. (2019), with interesting connections to Poincaré recurrence. Nonetheless, nothing is claimed about other optimization methods. Farnia & Ozdaglar (2020) show that GANs may have no Nash equilibria, but it does not follow that algorithms fail to converge since there may be locally-attracting but non-Nash critical points (Mazumdar et al., 2019, Example 2). Finally, Hsieh et al. (

