SGDA WITH SHUFFLING: FASTER CONVERGENCE FOR NONCONVEX-PŁ MINIMAX OPTIMIZATION

Abstract

Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvexnonconcave objectives with Polyak-Łojasiewicz (PŁ) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-PŁ and primal-PŁ-PŁ objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for GDA with an arbitrary step-size ratio, which matches the full-batch upper bound for the primal-PŁ-PŁ case.

1. INTRODUCTION

A finite-sum minimax optimization problem aims to solve the following: min x∈X max y∈Y f (x; y) := 1 n n i=1 f i (x; y), where f i denotes the i-th component function. In plain language, we want to minimize the average of n component functions for x, while maximizing it for y given x. There are many important areas in modern machine learning that fall within the minimax problem, including generative adversarial networks (GANs) (Goodfellow et al., 2020) , adversarial attack and robust optimization (Madry et al., 2018; Sinha et al., 2018) , multi-agent reinforcement learning (MARL) (Li et al., 2019) , AUC maximization (Ying et al., 2016; Liu et al., 2020; Yuan et al., 2021) , and many more. In most cases, the objective f is usually nonconvex-nonconcave, i.e., neither convex in x nor concave in y. Since general nonconvex-nonconcave problems are known to be intractable, we would like to tackle the problems with some additional structures, such as smoothness and Polyak-Łojasiewicz (PŁ) condition(s). We elaborate the detailed settings for our analysis, nonconvex-PŁ and primal-PŁ-PŁ (or, PŁ(Φ)-PŁ), in Section 2. One of the simplest and most popular algorithms to solve the problem (1) would be stochastic gradient descent-ascent (SGDA). This naturally extends the idea of stochastic gradient descent (SGD) used for minimization problems. Given an initial iterate (x 0 ; y 0 ), at time t ∈ N, SGDA (randomly) chooses an index i(t) ∈ {1, . . . , n} and accesses the i(t)-th component to perform a pair of updates x t = x t-1 -α∇ 1 f i(t) (x t-1 ; y t-1 ), y t = y t-1 + β∇ 2 f i(t) (x ; y t-1 ), where x = x t-1 , (simSGDA), or x t , Here, α > 0 and β > 0 are the step sizes and ∇ j denotes the gradient with respect to j-th argument for f i(t) (j = 1, 2). As shown in the update equations above, there are two widely used versions of SGDA: simultaneous SGDA (simSGDA), and alternating SGDA (altSGDA). In such stochastic gradient methods, there are two main categories of sampling schemes for the component indices i(t). One way is to sample i(t) independently (in time) and uniformly at random 1

