GRADIENT DESCENT ASCENT FOR MIN-MAX PROB-LEMS ON RIEMANNIAN MANIFOLDS Anonymous

Abstract

In the paper, we study a class of useful non-convex minimax optimization problems on Riemanian manifolds and propose a class of Riemanian gradient descent ascent algorithms to solve these minimax problems. Specifically, we propose a new Riemannian gradient descent ascent (RGDA) algorithm for the deterministic minimax optimization. Moreover, we prove that the RGDA has a sample complexity of O(κ 2 -2 ) for finding an -stationary point of the nonconvex stronglyconcave minimax problems, where κ denotes the condition number. At the same time, we introduce a Riemannian stochastic gradient descent ascent (RSGDA) algorithm for the stochastic minimax optimization. In the theoretical analysis, we prove that the RSGDA can achieve a sample complexity of O(κ 4 -4 ). To further reduce the sample complexity, we propose a novel momentum variance-reduced Riemannian stochastic gradient descent ascent (MVR-RSGDA) algorithm based on a new momentum variance-reduced technique of STORM. We prove that the MVR-RSGDA algorithm achieves a lower sample complexity of Õ(κ 4 -3 ) without large batches, which reaches near the best known sample complexity for its Euclidean counterparts. Extensive experimental results on the robust deep neural networks training over Stiefel manifold demonstrate the efficiency of our proposed algorithms. Algorithm Learning Rate Batch Size Complexity In this section, we briefly review the minimax optimization and Riemannian manifold optimization research works. 2.1

1. INTRODUCTION

In the paper, we study a class of useful non-convex minimax (a.k.a. min-max) problems on the Riemannian manifold M with the definition as: min x∈M max y∈Y f (x, y), where the function f (x, y) is µ-strongly concave in y but possibly nonconvex in x. Here Y ⊆ R d is a convex and closed set. f (•, y) : M → R for all y ∈ Y is a smooth but possibly nonconvex real-valued function on manifold M, and f (x, •) : Y → R for all x ∈ M a smooth and (strongly)concave real-valued function. In this paper, we mainly focus on the stochastic minimax optimization problem f (x, y) := E ξ∼D [f (x, y; ξ)], where ξ is a random variable that follows an unknown distribution D. In fact, the problem (1) is associated to many existing machine learning applications: 1). Robust Training DNNs over Riemannian manifold. Deep Neural Networks (DNNs) recently have been demonstrating exceptional performance on many machine learning applications. However, they are vulnerable to the adversarial example attacks, which show that a small perturbation in the data input can significantly change the output of DNNs. Thus, the security properties of DNNs have been widely studied. One of secured DNN research topics is to enhance the robustness of DNNs under the adversarial example attacks. To be more specific, given training data D := {ξ i = (a i , b i )} n i=1 , where a i ∈ R d and b i ∈ R represent the features and label of sample ξ i respectively. Each data sample a i can be corrupted by a universal small perturbation vector y to generate an adversarial attack sample a i + y, as in (Moosavi-Dezfooli et al., 2017; Chaubey et al., 2020) . To make DNNs robust against adversarial attacks, one popular approach is to solve the following robust training problem: min x max y∈Y 1 n n i=1 (h(a i + y; x), b i ) , where y ∈ R d denotes a universal perturbation, and x is the weight of the neural network; h(•; x) is the the deep neural network parameterized by x; and (•) is the loss function. Here the constraint Y = {y : y ≤ ε} indicates that the poisoned samples should not be too different from the original ones. Recently, the orthonormality on weights of DNNs has gained much interest and has been found to be useful across different tasks such as person re-identification (Sun et al., 2017) and image classification (Xie et al., 2017) . In fact, the orthonormality constraints improve the performances of DNNs (Li et al., 2020; Bansal et al., 2018) , and reduce overfitting to improve generalization (Cogswell et al., 2015) . At the same time, the orthonormality can stabilize the distribution of activations over layers within DNNs (Huang et al., 2018) . Thus, we consider the following robust training problem over the Stiefel manifold M: min x∈M max y∈Y 1 n n i=1 (h(a i + y; x), b i ). (3) When data are continuously coming, we can rewrite the problem (3) as follows: min x∈M max y∈Y E ξ [f (x, y; ξ)], where f (x, y; ξ) = (h(a + y; x), b) with ξ = (a, b). 2). Distributionally Robust Optimization over Riemannian manifold. Distributionally robust optimization (DRO) (Chen et al., 2017; Rahimian & Mehrotra, 2019 ) is an effective method to deal with the noisy data, adversarial data, and imbalanced data. At the same time, the DRO in the Riemannian manifold setting is also widely applied in machine learning problems such as robust principal component analysis (PCA). To be more specific, given a set of data samples {ξ i } n i=1 , the DRO over Riemannian manifold M can be written as the following minimax problem: min x∈M max p∈S n i=1 p i (x; ξ i ) -p - 1 n 2 , where p = (p 1 , • • • , p n ), S = {p ∈ R n : n i=1 p i = 1, p i ≥ 0}. Here (x; ξ i ) denotes the loss function over Riemannian manifold M, which applies to many machine learning problems such as PCA (Han & Gao, 2020a) , dictionary learning (Sun et al., 2016) , DNNs (Huang et al., 2018) , structured low-rank matrix learning (Jawanpuria & Mishra, 2018) , among others. For example, the task of PCA can be cast on a Grassmann manifold. To the best of our knowledge, the existing explicitly minimax optimization methods such as gradient descent ascent method only focus on the minimax problems in Euclidean space. To fill this gap, in the paper, we propose a class of efficient Riemannian gradient descent ascent algorithms to solve the problem (1) via using general retraction and vector transport. When the problem (1) is deterministic, we propose a new deterministic Riemannian gradient descent ascent algorithm. When the problem (1) is stochastic, we propose two efficient stochastic Riemannian gradient descent ascent algorithms. Our main contributions can be summarized as follows: 1) We propose a novel Riemannian gradient descent ascent (RGDA) algorithm for the deterministic minimax optimization problem (1). We prove that the RGDA has a sample complexity of O(κ 2 -2 ) for finding an -stationary point. 2) We also propose a new Riemannian stochastic gradient descent ascent (RSGDA) algorithm for the stochastic minimax optimization. In the theoretical analysis, we prove that the SRGDA has a sample complexity of O(κ 4 -4 ). 3) To further reduce the sample complexity, we introduce a novel momentum variancereduced Riemannian stochastic gradient descent ascent (MVR-RSGDA) algorithm based on a new momentum variance-reduced technique of STORM (Cutkosky & Orabona, 2019) . We prove the MVR-RSGDA achieves a lower sample complexity of Õ(κ 4 -3 ) (please see Table 1 ), which reaches near the best known sample complexity for its Euclidean counterparts. 4) Extensive experimental results on the robust DNN training over Stiefel manifold demonstrate the efficiency of our proposed algorithms. Riemannian manifold optimization methods have been widely applied in machine learning problems including dictionary learning (Sun et al., 2016) , matrix factorization (Vandereycken, 2013) , and DNNs (Huang et al., 2018) . Many Riemannian optimization methods were recently proposed. E.g. Zhang & Sra (2016) ; Liu et al. (2017) have proposed some efficient first-order gradient methods for geodesically convex functions. Subsequently, Zhang et al. (2016) have presented fast stochastic variance-reduced methods to Riemannian manifold optimization. More recently, Sato et al. (2019) have proposed fast first-order gradient algorithms for Riemannian manifold optimization by using general retraction and vector transport. Subsequently, based on these retraction and vector transport, some fast Riemannian gradient-based methods (Zhang et al., 2018; Kasai et al., 2018; Zhou et al., 2019; Han & Gao, 2020a) have been proposed for non-convex optimization. Riemannian Adam-type algorithms (Kasai et al., 2019) were introduced for matrix manifold optimization. In addition, some algorithms (Ferreira et al., 2005; Li et al., 2009; Wang et al., 2010) have been studied for variational inequalities on Riemannian manifolds, which are the implicit min-max problems on Riemannian manifolds. Notations: • denotes the 2 norm for vectors and spectral norm for matrices. x, y denotes the inner product of two vectors x and y. For function f (x, y), f (x, 

3. PRELIMINARIES

In this section, we first re-visit some basic information on the Riemannian manifold M. In general, the manifold M is endowed with a smooth inner product •, • x : T x M × T x M → R on tangent space T x M for every x ∈ M. The induced norm • x of a tangent vector in T x M is associated with the Riemannian metric. We first define a retraction R x : T x M → M mapping tangent space T x M onto M with a local rigidity condition that preserves the gradients at x ∈ M (please see Fig. 1 (a)). The retraction R x satisfies all of the following: 1) R x (0) = x, where 0 ∈ T x M; 2) ∇R x (0), u x = u for u ∈ T x M. In fact, exponential mapping Exp x is a special case of retraction R x that locally approximates the exponential mapping Exp x to the first order on the manifold. Next, we define a vector transport T : T M T M → T M (please see Fig. 1 (b) ) that satisfies all of the following 1) T has an associated retraction R, i.e., for x ∈ M and w, u ∈ T x M, T u w is a tangent vector at R x (w); 2) T 0 v = v; 3) T u (av + bw) = aT u v + bT u w for all a, b ∈ R a u, v, w ∈ T M. Vector transport T y x v or equivalently T u v with y = R x (u) transports v ∈ T x M along the retraction curve defined by direction u. Here we focus on the isometric vector transport T y x , which satisfies u, v x = T y x u, T y x v y for all u, v ∈ T x M. Let ∇f (x, y) = (∇ x f (x, y), ∇ y f (x, y)) denote the gradient over the Euclidean space, and let gradf (x, y) = (grad x f (x, y), grad y f (x, y)) = Proj TxM (∇f (x, y)) denote the Riemannian gradient over tangent space T x M, where Proj X (z) = arg min x∈X x-z is a projection operator. Based on the above definitions, we provide some standard assumptions about the problem (1). Although the problem (1) is non-convex, following (Von Neumann & Morgenstern, 2007) , there exists a local solution or stationary point (x * , y * ) satisfies the Nash Equilibrium, i.e., f (x * , y) ≤ f (x * , y * ) ≤ f (x, y * ), where x * ∈ X and y * ∈ Y. Here X ⊂ M is a neighbourhood around an optimal point x * . Assumption 1. X is compact. Each component function f (x, y) is twice continuously differentiable in both x ∈ X and y ∈ Y, and there exist constants L 11 , L 12 , L 21 and L 22 , such that for every x, x 1 , x 2 ∈ X and y, y 1 , y 2 ∈ Y, we have grad x f (x 1 , y; ξ) -T x1 x2 grad x f (x 2 , y; ξ) ≤ L 11 u , grad x f (x, y 1 ; ξ) -grad x f (x, y 2 ; ξ) ≤ L 12 y 1 -y 2 , ∇ y f (x 1 , y; ξ) -∇ y f (x 2 , y; ξ) ≤ L 21 u , ∇ y f (x, y 1 ; ξ) -∇ y f (x, y 2 ; ξ) ≤ L 22 y 1 -y 2 , where u ∈ T x1 M and x 2 = R x1 (u). Assumption 1 is commonly used in Riemannian optimization (Sato et al., 2019; Han & Gao, 2020a) , and min-max optimization (Lin et al., 2019; Luo et al., 2020; Xu et al., 2020b) . Here, the terms L 11 , L 12 and L 21 implicitly contain the curvature information as in (Sato et al., 2019; Han & Gao, 2020a) . Specifically, Assumption 1 implies the partial Riemannian gradient grad x f (•, y; ξ) for all y ∈ Y is retraction L 11 -Lipschitz continuous as in (Han & Gao, 2020a) and the partial gradient ∇ y f (x, •; ξ) for all x ∈ X is L 22 -Lipschitz continuous as in (Lin et al., 2019) . Since grad x f (x, y 1 ; ξ)grad x f (x, y 2 ; ξ) = Proj TxM ∇ x f (x, y 1 ; ξ) -Proj TxM ∇ x f (x, y 2 ; ξ) ≤ ∇ x f (x, y 1 ; ξ) -∇ x f (x, y 2 ; ξ) ≤ L 12 y 1 -y 2 , we can obtain grad x f (x, y 1 ; ξ)grad x f (x, y 2 ; ξ) ≤ L 12 y 1 -y 2 by the L 12 -Lipschitz continuous of ∇ x f (x, •; ξ) for all x ∈ X . Let the partial Riemannian gradient grad y f (•, y; ξ) for all y ∈ Y be retraction L21 -Lipschitz, i.e., grad y f (x 1 , y; ξ) -T x1 x2 grad y f (x 2 , y; ξ) ≤ L21 u . Since grad y f (x 1 , y; ξ) -T x1 x2 grad y f (x 2 , y; ξ) = Proj TxM ∇ y f (x 1 , y; ξ) -T x1 x2 Proj TxM ∇ y f (x 2 , y; ξ) ≤ ∇ y f (x 1 , y; ξ)-∇ y f (x 2 , y; ξ) ≤ L 21 u , we have L 21 ≥ L21 . For the deterministic problem, let f (x, y) instead of f (x, y; ξ) in Assumption 1. Since f (x, y) is strongly concave in y ∈ Y, there exists a unique solution to the problem max y∈Y f (x, y) for any x. We define the function Φ(x) = max y∈Y f (x, y) and y * (x) = arg max y∈Y f (x, y). Assumption 2. The function Φ(x) is retraction L-smooth. There exists a constant L > 0, for all x ∈ X , z = R x (u) with u ∈ T x M, such that Φ(z) ≤ Φ(x) + gradΦ(x), u + L 2 u 2 . ( ) Assumption 3. The objective function f (x, y) is µ-strongly concave w.r.t y, i.e., for any x ∈ M f (x, y 1 ) ≤ f (x, y 2 ) + ∇ y f (x, y 2 ), y 1 -y 2 - µ 2 y 1 -y 2 2 , ∀y 1 , y 2 ∈ Y. ( ) Assumption 4. The function Φ(x) is bounded from below in M, i.e., Φ * = inf x∈M Φ(x). Assumption 5. The variance of stochastic gradient is bounded, i.e., there exists a constant σ 1 > 0 such that for all x, it follows E ξ grad x f (x, y; ξ)grad x f (x, y) 2 ≤ σ 2 1 ; There exists a constant σ 2 > 0 such that for all y, it follows E ξ ∇ y f (x, y; ξ) -∇ y f (x, y) 2 ≤ σ 2 2 . We also define σ = max{σ 1 , σ 2 }. Assumption 2 imposes the retraction smooth of function Φ(x), as in Sato et al. (2019) ; Han & Gao (2020b; a) . Assumption 3 imposes the strongly concave of f (x, y) on variable y, as in (Lin et al., 2019; Luo et al., 2020) . Assumption 4 guarantees the feasibility of the nonconvex-strongly-concave problems, as in (Lin et al., 2019; Luo et al., 2020) . Assumption 5 imposes the bounded variance of stochastic (Riemannian) gradients, which is commonly used in the stochastic optimization (Han & Gao, 2020b; Lin et al., 2019; Luo et al., 2020) .

4. RIEMANIAN GRADIENT DESCENT ASCENT

In the section, we propose a class of Riemannian gradient descent ascent algorithm to solve the deterministic and stochastic minimax optimization problem (1), respectively.

4.1. RGDA AND RSGDA ALGORITHMS

In this subsection, we propose an efficient Riemannian gradient descent ascent (RGDA) algorithm to solve the deterministic min-max problem (1). At the same time, we propose a standard Riemannian stochastic gradient descent ascent (RSGDA) algorithm to solve the stochastic min-max problem (1). Algorithm 1 summarizes the algorithmic framework of our RGDA and RSGDA algorithms. At the step 5 of Algorithm 1, we apply the retraction operator to ensure the variable x t for all t ≥ 1 in the manifold M. At the step 6 of Algorithm 1, we use 0 < η t ≤ 1 to ensure the variable y t for all t ≥ 1 in the convex constraint Y. Here we define a reasonable metric to measure the convergence: H t = gradΦ(x t ) + L y t -y * (x t ) , where L = max(1, L 11 , L 12 , L 21 , L 22 ), and the first term of H t measures convergence of the iteration solutions {x t } T t=1 , and the last term measures convergence of the iteration solutions {y t } T t=1 . Since the function f (x, y) is strongly concave in y ∈ Y, there exists a unique solution y * (x) to the problem max y∈Y f (x, y) for any x ∈ M. Thus, we apply the standard metric y t -y * (x t ) to measure convergence of the parameter y. Given y = y * (x t ), we use the standard metric gradΦ(x t ) = grad x f (x t , y * (x t )) to measure convergence of the parameter x. Note that we use the coefficient L to balance the scale of metrics of the variable x and the variable y. Algorithm 1 RGDA and RSGDA Algorithms for Min-Max Optimization 1: Input: T , parameters {γ, λ, η t } T t=1 , mini-batch size B, and initial input x 1 ∈ M, y 1 ∈ Y; 2: for t = 1, 2, . . . , T do 3: (RGDA) Compute deterministic gradients v t = grad x f (x t , y t ), w t = ∇ y f (x t , y t ); (RSGDA) Draw B i.i.d. samples {ξ i t } B i=1 , then compute stochastic gradients v t = 1 B B i=1 grad x f (x t , y t ; ξ i t ), w t = 1 B B i=1 ∇ y f (x t , y t ; ξ i t ); Update: x t+1 = R xt (-γη t v t ); 6: Update: ỹt+1 = P Y (y t + λw t ) and y t+1 = y t + η t (ỹ t+1 -y t ); 7: end for 8: Output: x ζ and y ζ chosen uniformly random from {x t , y t } T t=1 . Algorithm 2 MVR-RSGDA Algorithm for Min-Max Optimization 1: Input: T , parameters {γ, λ, b, m, c 1 , c 2 } and initial input x 1 ∈ M and y 1 ∈ Y; 2: Draw B i.i.d. samples B 1 = {ξ i 1 } B i=1 , then compute v 1 = grad x f B1 (x 1 , y 1 ) and w 1 = ∇ y f B1 (x 1 , y 1 ); 3: for t = 1, 2, . . . , T do 4: Compute η t = b (m+t) 1/3 ; 5: Update: x t+1 = R xt (-γη t v t ); 6: Update: ỹt+1 = P Y (y t + λw t ) and y t+1 = y t + η t (ỹ t+1 -y t ); 7: Compute α t+1 = c 1 η 2 t and β t+1 = c 2 η 2 t ; 8: Draw B i.i.d. samples B t+1 = {ξ i t+1 } B i=1 , then compute v t+1 = grad x f Bt+1 (x t+1 , y t+1 ) + (1 -α t+1 )T xt+1 xt v t -grad x f Bt+1 (x t , y t ) , w t+1 = ∇ y f Bt+1 (x t+1 , y t+1 ) + (1 -β t+1 ) w t -∇ y f Bt+1 (x t , y t ) ; 9: end for 10: Output: x ζ and y ζ chosen uniformly random from {x t , y t } T t=1 .

4.2. MVR-RSGDA ALGORITHM

In this subsection, we propose a novel momentum variance-reduced stochastic Riemannian gradient descent ascent (MVR-RSGDA) algorithm to solve the stochastic min-max problem (1), which builds on the momentum-based variance reduction technique of STORM (Cutkosky & Orabona, 2019) . Algorithm 2 describes the algorithmic framework of MVR-RSGDA method. In Algorithm 2, we use the momentum-based variance-reduced technique of STORM to update stochastic Riemannian gradient v t : v t+1 = α t+1 grad x f Bt+1 (x t+1 , y t+1 ) SGD + (1 -α t+1 ) grad x f Bt+1 (x t+1 , y t+1 ) -T xt+1 xt grad x f Bt+1 (x t , y t ) -v t SPIDER = grad x f Bt+1 (x t+1 , y t+1 ) + (1 -α t+1 )T xt+1 xt v t -grad x f Bt+1 (x t , y t ) , where α t+1 ∈ (0, 1]. When α t+1 = 1, v t will degenerate a vanilla stochastic Riemannian gradient; When α t+1 = 0, v t will degenerate a stochastic Riemannian gradient based on variance-reduced technique of SPIDER (Nguyen et al., 2017; Fang et al., 2018) . Similarly, we use this momentumbased variance-reduced technique to estimate the stochastic gradient w t .

5. CONVERGENCE ANALYSIS

In this section, we study the convergence properties of our RGDA, RSGDA, and MVR-RSGDA algorithms under some mild conditions. For notational simplicity, let L = max(1, L 11 , L 12 , L 21 , L 22 ) and κ = L 21 /µ denote the number condition of function f (x, y). We first give a useful lemma. Lemma 1. Under the assumptions in §3, the gradient of function Φ(x) = max y∈Y f (x, y) is retraction G-Lipschitz, and the mapping or function y * (x) = arg max y∈Y f (x, y) is retraction κ- Lipschitz. Given any x 1 , x 2 = R x1 (u) ∈ X ⊂ M and u ∈ T x1 M, we have: gradΦ(x 1 ) -T x1 x2 gradΦ(x 2 ) ≤ G u , y * (x 1 ) -y * (x 2 ) ≤ κ u , where G = κL 12 + L 11 and κ = L 21 /µ.

5.1. CONVERGENCE ANALYSIS OF BOTH THE RGDA AND RSGDA ALGORITHMS

In the subsection, we study the convergence properties of deterministic RGDA and stochastic RS-GDA algorithms. The related proofs of RGDA and RSGDA are provided in Appendix A.1. Theorem 1. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 1 by using deterministic gradients. Given η = η t for all t ≥ 1, 0 < η ≤ min(1, 1 2γL ), 0 < λ ≤ 1 6 L and 0 < γ ≤ µλ 10 Lκ , we have 1 T T t=1 gradΦ(x t ) + L y t -y * (x t ) ≤ 2 Φ(x 1 ) -Φ * √ γηT . ( ) Remark 1. Since 0 < η ≤ min(1, 1 2γL ) and 0 < γ ≤ µλ 10 Lκ , we have 0 < ηγ ≤ min( µλ 10 Lκ , 1 2L ). Let ηγ = min( µλ 10 Lκ , 1 2L ), we have ηγ = O( 1 κ 2 ). The RGDA algorithm has convergence rate of O κ T 1/2 . By κ T 1/2 ≤ , i.e., E[H ζ ] ≤ , we choose T ≥ κ 2 -2 . In the deterministic RGDA Algorithm, we need one sample to estimate the gradients v t and w t at each iteration, and need T iterations. Thus, the RGDA reaches a sample complexity of T = O(κ 2 -2 ) for finding anstationary point. Theorem 2. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 1 by using stochastic gradients. Given η = η t for all t ≥ 1, 0 < η ≤ min(1, 1 2γL ), 0 < λ ≤ 1 6 L and 0 < γ ≤ µλ 10 Lκ , we have 1 T T t=1 E gradΦ(x t ) + L y t -y * (x t ) ≤ 2 Φ(x 1 ) -Φ * √ γηT + √ 2σ √ B + 5 √ 2 Lσ √ Bµ . ( ) Remark 2. Since 0 < η ≤ min(1, 1 2γL ) and 0 < γ ≤ µλ 10 Lκ , we have 0 < ηγ ≤ min( µλ 10 Lκ , 1 2L ). Let ηγ = min( µλ 10 Lκ , 1 2L ), we have ηγ = O( 1 κ 2 ). Let B = T , the RSGDA algorithm has convergence rate of O κ T 1/2 . By κ T 1/2 ≤ , i.e., E[H ζ ] ≤ , we choose T ≥ κ 2 -2 . In the stochastic RSGDA Algorithm, we need B samples to estimate the gradients v t and w t at each iteration, and need T iterations. Thus, the RSGDA reaches a sample complexity of BT = O(κ 4 -4 ) for finding an -stationary point.

5.2. CONVERGENCE ANALYSIS OF THE MVR-RSGDA ALGORITHM

In the subsection, we provide the convergence properties of the MVR-RSGDA algorithm. The related proofs of MVR-RSGDA are provided in Appendix A.2. Theorem 3. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 2. Given T 1/3 , and has a sample complexity of BT = Õ κ 9/2 -3 for finding an -stationary point. y 1 = y * (x 1 ), c 1 ≥ 2 3b 3 + 2λµ, c 2 ≥ 2 3b 3 + 50λ L2 µ , b > 0, m ≥ max 2, (cb) 3 , 0 < γ ≤ µλ 2κ L√ 25+4µλ and 0 < λ ≤ 1 6 L , we have 1 T T t=1 E gradΦ(x t ) + L y t -y * (x t ) ≤ √ 2M m 1/6 T 1/2 + √ 2M T 1/3 , ( ) where c = max(1, c 1 , c 2 , 2γL) and M = 2(Φ(x1)-Φ * ) γb + 2σ 2 Bλµη0b + 2(c 2 1 +c 2 2 )σ 2 b 2 Bλµ ln(m + T ). Remark 3. Let c 1 = 2 3b 3 + 2λµ, c 2 = 2 3b 3 + 50λ L2 µ , λ = 1 6 L , γ = µλ 2κ L√ 25+4µλ and η 0 = b m 1/3 . It is easy verified that γ = O( 1 κ 2 ), λ = O(1), λµ = O( 1 κ ), c 1 = O(1), c 2 = O(κ), m = O(κ 3 Remark 4. In the about theoretical analysis, we only assume the convexity of constraint set Y, while Lin et al. (2019) not only assume the convexity of set Y, but also assume and use its bounded (Please see Assumption 4.2 in (Lin et al., 2019) ). Clearly, our assumption is milder than (Lin et al., 2019) . When there does not exist a constraint set on parameter y, i.e.,Y = R d , our algorithms and theoretical results still work, while Lin et al. (2019) 

6. EXPERIMENTS

In this section, we conduct the deep neural network (DNN) robust training over the Stiefel manifold St(r, d) = {W ∈ R d×r : W T W = I r } to evaluate the performance of our algorithms. In the experiment, we use MNIST, CIFAR-10, and CIFAR-100 datasets to train the model ( More experimental results on SVHN, STL10, and FashionMNIST datasets are provided in the Appendix B ). Considering the sample size is large in these datasets, we only compare the proposed stochastic algorithms (RSGDA and MVR-RSGDA) in the experiments. Here, we use the SGDA algorithm (Lin et al., 2019) as a baseline, which does not apply the orthogonal regularization in the DNN robust training.

6.1. EXPERIMENTAL SETTING

Given a deep neural network h(•; x) parameterized by x as shown in the above problem (2), the weights of l-th layer is x i ∈ St(n l in , n l out ), where St(n l in , n l out ) is the Stiefel manifold of l-th layer. For the weights in dense layers, n l in , n l out are the number of inputs and outputs neurons. For the weights in convolution layers, n l in is the number of input channels, n l out is the product of the number of output channels and kernel sizes. Note that the trainable parameters from other components (e.g. batchnorm) are not in Stiefel manifold. For both RSGDA and MVR-RSGDA algorithms, we set {γ, λ} to {1.0, 0.1}. We further set {b, m, c 1 , c 2 } to 0.5, 8, 512, 512 for MVR-RSGDA. η in RSGDA is set to 0.01. For both algorithms, the mini-batch size is set to 512. We set for y as 0.05 and 0.03 for the MNIST dataset and CIFAR-10/100 datasets. The above settings are the same for all datasets. An 8-layer (5 convolution layers and 3 dense layers) deep neural network is used in all experiments. All codes are implemented with McTorch (Meghwanshi et al., 2018) which is based on PyTorch (Paszke et al., 2019) .

6.2. EXPERIMENTAL RESULTS

The training loss plots of the robust training problem in the above Eq. ( 2) are shown in Fig. 2 . From the figure, we can see that MVR-RSGDA enjoys a faster convergence speed compared to the baseline RSGDA. It's also clear that when the dataset becomes complicate (from MNIST to CIFAR-10/100), the advantage of MVR-RSGDA becomes larger. When it comes to robust training, the training loss is not enough to identify which algorithm is better. We also use a variant of uniform perturbation to attack the model trained by our algorithms. We follow the design of uniform attack in previous works (Moosavi-Dezfooli et al., 2017; Chaubey et al., 2020) , and the detail uniform attack objective is shown below: min y∈Y 1 n n i=1 max h bi (y + a i ) -max j =bi h j (y + a i ), 0 , s.t. Y = { y ∞ ≤ ε} where h j is the j-th logit of the output from the deep neural network, and y here is a uniform permutation added for all inputs. In practice, we sample a mini-batch with 512 samples at each iteration. The optimization of the uniform permutation lasts for 1000 iterations for all settings. The attack loss is presented in Fig 3 . The attack loss for the model trained by MVR-RSGDA is higher compared to both RSGDA and SGDA, which indicates the model trained by MVR-RSGDA is harder to attack and thus more robust. The test accuracy with natural image and uniform attack is shown in Tab. 2, which also suggests the advantage of MVR-RSGDA. More results are provided in Appendix B.

7. CONCLUSION

In the paper, we investigated a class of useful min-max optimization problems on the Riemanian manifold. We proposed a class of novel efficient Riemanian gradient descent ascent algorithms to solve these minimax problems, and studied the convergence properties of the proposed algorithms. For example, we proved that our new MVR-RSGDA algorithm achieves a sample complexity of Õ(κ 4 -3 ) without large batches, which reaches near the best known sample complexity for its Euclidean counterparts.

A APPENDIX

In this section, we provide the detailed convergence analysis of our algorithms. We first review some useful lemmas. Lemma 2. (Nesterov, 2018) Assume that f (x) is a differentiable convex function and X is a convex set. x * ∈ X is the solution of the constrained problem min x∈X f (x), if ∇f (x * ), x -x * ≥ 0, x ∈ X . ( ) Lemma 3. (Nesterov, 2018) Assume the function f (x) is L-smooth, i.e., ∇f (x) -∇f (y) ≤ L x -y , and then the following inequality holds |f (y) -f (x) -∇f (x) T (y -x)| ≤ L 2 x -y 2 . ( ) Next, based on the above assumptions and Lemmas, we gives some useful lemmas: Lemma 4. The gradient of function Φ(x) = max y∈Y f (x, y) is retraction G-Lipschitz, and the mapping or function y * (x) = arg max y∈Y f (x, y) is retraction κ-Lipschitz. Given any x 1 , x 2 = R x1 (u) ∈ X ⊂ M and u ∈ T x1 M, we have gradΦ(x 1 ) -T x1 x2 gradΦ(x 2 ) ≤ G u , y * (x 1 ) -y * (x 2 ) ≤ κ u , where G = κL 12 + L 11 and κ = L 21 /µ, and vector transport T x1 x2 transport the tangent space of x 1 to that of x 2 . Proof. Given any x 1 , x 2 = R x1 (u) ∈ X and u ∈ T x1 M, define y * (x 1 ) = arg max y∈Y f (x 1 , y) and y * (x 2 ) = arg max y∈Y f (x 2 , y), by the above Lemma 2, we have (y -y * (x 1 )) T ∇ y f (x 1 , y * (x 1 )) ≤ 0, ∀y ∈ Y (20) (y -y * (x 2 )) T ∇ y f (x 2 , y * (x 2 )) ≤ 0, ∀y ∈ Y. Let y = y * (x 2 ) in the inequality (20) and y = y * (x 1 ) in the inequality (21), then summing these inequalities, we have (y * (x 2 ) -y * (x 1 )) T ∇ y f (x 1 , y * (x 1 )) -∇ y f (x 2 , y * (x 2 )) ≤ 0. Since the function f (x 1 , •) is µ-strongly concave, we have f (x 1 , y * (x 1 )) ≤ f (x 1 , y * (x 2 )) + (∇ y f (x 1 , y * (x 2 ))) T (y * (x 1 ) -y * (x 2 )) - µ 2 y * (x 1 ) -y * (x 2 ) 2 , (23) f (x 1 , y * (x 2 )) ≤ f (x 1 , y * (x 1 )) + (∇ y f (x 1 , y * (x 1 ))) T (y * (x 2 ) -y * (x 1 )) - µ 2 y * (x 1 ) -y * (x 2 ) 2 . ( ) Combining the inequalities ( 23) with ( 24), we obtain (y * (x 2 ) -y * (x 1 )) T ∇ y f (x 1 , y * (x 2 )) -∇ y f (x 1 , y * (x 1 )) + µ y * (x 1 ) -y * (x 2 ) 2 ≤ 0. (25) By plugging the inequalities ( 22) into (25), we have µ y * (x 1 ) -y * (x 2 ) 2 ≤ (y * (x 2 ) -y * (x 1 )) T ∇ y f (x 2 , y * (x 2 )) -∇ y f (x 1 , y * (x 2 )) ≤ y * (x 2 ) -y * (x 1 ) ∇ y f (x 2 , y * (x 2 )) -∇ y f (x 1 , y * (x 2 )) ≤ L 21 u y * (x 2 ) -y * (x 1 ) , where the last inequality is due to Assumption 1. Thus, we have y * (x 1 ) -y * (x 2 ) ≤ κ u , where κ = L 21 /µ and x 2 = R x1 (u), u ∈ T x1 M. Since Φ(x) = f (x, y * (x)), we have gradΦ(x) = grad x f (x, y * (x)). Then we have gradΦ (x 1 )-T x1 x2 gradΦ(x 2 ) = grad x f (x 1 , y * (x 1 )) -T x1 x2 grad x f (x 2 , y * (x 2 )) ≤ grad x f (x 1 , y * (x 1 ))-grad x f (x 1 , y * (x 2 )) + grad x f (x 1 , y * (x 2 ))-T x1 x2 grad x f (x 2 , y * (x 2 )) ≤ L 12 y * (x 1 ) -y * (x 2 ) + L 11 u ≤ (κL 12 + L 11 ) u , where u ∈ T x1 M. Lemma 5. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 1 or 2. Given 0 < η t ≤ 1 2γL , we have Φ(x t+1 ) ≤ Φ(x t ) + γL 12 η t y * (x t ) -y t 2 + γη t grad x f (x t , y t ) -v t 2 - γη t 2 gradΦ(x t ) 2 - γη t 4 v t 2 . ( ) Proof. According to Assumption 2, i.e., the function Φ(x) is retraction L-smooth, we have Φ(x t+1 ) ≤ Φ(x t ) -γη t gradΦ(x t ), v t + γ 2 η 2 t L 2 v t 2 (30) = Φ(x t ) + γη t 2 gradΦ(x t ) -v t 2 - γη t 2 gradΦ(x t ) 2 + ( γ 2 η 2 t L 2 - γη t 2 ) v t 2 = Φ(x t ) + γη t 2 gradΦ(x t ) -grad x f (x t , y t ) + grad x f (x t , y t ) -v t 2 - γη t 2 gradΦ(x t ) 2 + ( γ 2 η 2 t L 2 - γη t 2 ) v t 2 ≤ Φ(x t ) + γη t gradΦ(x t ) -grad x f (x t , y t ) 2 + γη t grad x f (x t , y t ) -v t 2 - γη t 2 gradΦ(x t ) 2 + ( Lγ 2 η 2 t 2 - γη t 2 ) v t 2 ≤ Φ(x t ) + γη t gradΦ(x t ) -grad x f (x t , y t ) 2 + γη t grad x f (x t , y t ) -v t 2 - γη t 2 gradΦ(x t ) 2 - γη t 4 v t 2 , where the last inequality is due to 0 < η t ≤ 1 2γL . Consider an upper bound of gradΦ(x t )grad x f (x t , y t ) 2 , we have gradΦ(x t ) -grad x f (x t , y t ) 2 = grad x f (x t , y * (x t )) -grad x f (x t , y t ) 2 ≤ L 12 y * (x t ) -y t 2 . (31) Then we have Φ(x t+1 ) ≤ Φ(x t ) + γη t L 12 y * (x t ) -y t 2 + γη t grad x f (x t , y t ) -v t 2 - γη t 2 gradΦ(x t ) 2 - γη t 4 v t 2 . ( ) Lemma 6. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 1 or 2. Under the above assumptions, and set 0 < η t ≤ 1 and 0 < λ ≤ 1 6 L , we have y t+1 -y * (x t+1 ) 2 ≤ (1 - η t µλ 4 ) y t -y * (x t ) 2 - 3η t 4 ỹt+1 -y t 2 + 25η t λ 6µ ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 , ( ) where κ = L 21 /µ. Proof. According to the assumption 3, i.e., the function f (x, y) is µ-strongly concave w.r.t y, we have f (x t , y) ≤ f (x t , y t ) + ∇ y f (x t , y t ), y -y t - µ 2 y -y t 2 = f (x t , y t ) + w t , y -ỹt+1 + ∇ y f (x t , y t ) -w t , y -ỹt+1 + ∇ y f (x t , y t ), ỹt+1 -y t - µ 2 y -y t 2 . ( ) According to the assumption 1, i.e., the function f (x, y) is L 22 -smooth w.r.t y, and L ≥ L 22 , we have f (x t , ỹt+1 ) -f (x t , y t ) -∇ y f (x t , y t ), ỹt+1 -y t ≥ - L 22 2 ỹt+1 -y t 2 ≥ - L 2 ỹt+1 -y t 2 . ( ) Combining the inequalities ( 34) with ( 35), we have f (x t , y) ≤ f (x t , ỹt+1 ) + w t , y -ỹt+1 + ∇ y f (x t , y t ) -w t , y -ỹt+1 - µ 2 y -y t 2 + L 2 ỹt+1 -y t 2 . ( ) According to the step 6 of Algorithm 1 or 2, we have ỹt+1 = P Y (y t + λw t ) = arg min y∈Y 1 2 y - y t -λw t 2 . Since Y is a convex set and the function 1 2 y -y t -λw t 2 is convex, according to Lemma 2, we have ỹt+1 -y t -λw t , y -ỹt+1 ≥ 0, y ∈ Y. ( ) Then we obtain w t , y -ỹt+1 ≤ 1 λ ỹt+1 -y t , y -ỹt+1 = 1 λ ỹt+1 -y t , y t -ỹt+1 + 1 λ ỹt+1 -y t , y -y t = - 1 λ ỹt+1 -y t 2 + 1 λ ỹt+1 -y t , y -y t . ( ) Combining the inequalities ( 36) with (38), we have f (x t , y) ≤ f (x t , ỹt+1 ) + 1 λ ỹt+1 -y t , y -y t + ∇ y f (x t , y t ) -w t , y -ỹt+1 - 1 λ ỹt+1 -y t 2 - µ 2 y -y t 2 + L 2 ỹt+1 -y t 2 . ( ) Let y = y * (x t ) and we obtain f (x t , y * (x t )) ≤ f (x t , ỹt+1 ) + 1 λ ỹt+1 -y t , y * (x t ) -y t + ∇ y f (x t , y t ) -w t , y * (x t ) -ỹt+1 - 1 λ ỹt+1 -y t 2 - µ 2 y * (x t ) -y t 2 + L 2 ỹt+1 -y t 2 . ( ) Due to the concavity of f (•, y) and y * (x t ) = arg max y∈Y f (x t , y), we have f (x t , y * (x t )) ≥ f (x t , ỹt+1 ). Thus, we obtain 0 ≤ 1 λ ỹt+1 -y t , y * (x t ) -y t + ∇ y f (x t , y t ) -w t , y * (x t ) -ỹt+1 -( 1 λ - L 2 ) ỹt+1 -y t 2 - µ 2 y * (x t ) -y t 2 . ( ) By y t+1 = y t + η t (ỹ t+1 -y t ), we have y t+1 -y * (x t ) 2 = y t + η t (ỹ t+1 -y t ) -y * (x t ) 2 = y t -y * (x t ) 2 + 2η t ỹt+1 -y t , y t -y * (x t ) + η 2 t ỹt+1 -y t 2 . ( ) Since 0 < η t ≤ 1, 0 < λ ≤ 1 6 L and L ≥ L 22 ≥ µ, we have λ ≤ 1 6 L ≤ 1 6µ and η t ≤ 1 ≤ 1 6µλ . Then we obtain (1 + η t µλ 4 )(1 - η t µλ 2 ) = 1 - η t µλ 2 + η t µλ 4 - η 2 t µ 2 λ 2 8 ≤ 1 - η t µλ 4 , -(1 + η t µλ 4 ) 3η t 4 ≤ - 3η t 4 , (1 + η t µλ 4 ) 4η t λ µ ≤ (1 + 1 24 ) 4η t λ µ = 25η t λ 6µ , (1 + 4 η t µλ )γ 2 κ 2 η 2 t = γ 2 κ 2 η 2 t + 4γ 2 κ 2 η t µλ ≤ γ 2 κ 2 η t 6µλ + 4γ 2 κ 2 η t µλ = 25γ 2 κ 2 η t 6µλ . ( ) Thus we have y t+1 -y * (x t+1 ) 2 ≤ (1 - η t µλ 4 ) y t -y * (x t ) 2 - 3η t 4 ỹt+1 -y t 2 + 25η t λ 6µ ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 . ( ) A.1 CONVERGENCE ANALYSIS OF RGDA AND RSGDA ALGORITHMS In the subsection, we study the convergence properties of deterministic RGDA and stochastic RS-GDA algorithms, respectively. For notational simplicity, let L = max(1, L 11 , L 12 , L 21 , L 22 ). Theorem 4. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 1 by using deterministic gradients. Given η = η t for all t ≥ 1, 0 < η ≤ min(1, 1 2γL ), 0 < λ ≤ 1 6 L and 0 < γ ≤ µλ 10 Lκ , we have 1 T T t=1 L y t -y * (x t ) + gradΦ(x t ) ≤ 2 Φ(x 1 ) -Φ * √ γηT . Proof. According to Lemma 6, we have y t+1 -y * (x t+1 ) 2 ≤ (1 - η t µλ 4 ) y t -y * (x t ) 2 - 3η t 4 ỹt+1 -y t 2 + 25η t λ 6µ ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 . We first define a Lyapunov function Λ t , for any t ≥ 1 Λ t = Φ(x t ) + 6γ L2 λµ y t -y * (x t ) 2 . ( ) According to Lemma 5, we have Λ t+1 -Λ t = Φ(x t+1 ) -Φ(x t ) + 6γ L2 λµ y t+1 -y * (x t+1 ) 2 -y t -y * (x t ) 2 ≤ γη t L 12 y t -y * (x t ) 2 + γη t grad x f (x t , y t ) -v t 2 - γη t 2 gradΦ(x t ) 2 - γη t 4 v t 2 + 6γ L2 λµ - µλη t 4 y t -y * (x t ) 2 - 3η t 4 ỹt+1 -y t 2 + 25λη t 6µ ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 ≤ - L2 γη t 2 y t -y * (x t ) 2 - γη t 2 gradΦ(x t ) 2 - 9γ L2 η t 2λµ ỹt+1 -y t 2 - 1 4 - 25κ 2 L2 γ 2 µ 2 λ 2 γη t v t 2 ≤ - L2 γη t 2 y t -y * (x t ) 2 - γη t 2 gradΦ(x t ) 2 , where the first inequality holds by the inequality (52); the second last inequality is due to L = max(1, L 11 , L 12 , L 21 , L 22 ) and v t = grad x f (x t , y t ), w t = ∇ y f (x t , y t ), and the last inequality is due to 0 < γ ≤ µλ 10 Lκ . Thus, we obtain L2 γη t 2 y t -y * (x t ) 2 + γη t 2 gradΦ(x t ) 2 ≤ Λ t -Λ t+1 . Since the initial solution satisfies y 1 = y * (x 1 ) = arg max y∈Y f (x 1 , y), we have Λ 1 = Φ(x 1 ) + 6γ L2 λµ y 1 -y * (x 1 ) 2 = Φ(x 1 ). Taking average over t = 1, 2, • • • , T on both sides of the inequality (55), we have 1 T T t=1 L2 η t 2 y t -y * (x t ) 2 + η t 2 gradΦ(x t ) 2 ≤ Λ 1 -Λ T +1 γT ≤ Φ(x 1 ) -Φ * γT , where the last equality is due to the above equality (56) and Assumption 4. Let η = η 1 = • • • = η T , we have 1 T T t=1 L2 y t -y * (x t ) 2 + gradΦ(x t ) 2 ≤ 2(Φ(x 1 ) -Φ * ) γηT . According to Jensen's inequality, we have 1 T T t=1 L y t -y * (x t ) + gradΦ(x t ) ≤ 2 T T t=1 L2 y t -y * (x t ) 2 + gradΦ(x t ) 2 1/2 ≤ 4(Φ(x 1 ) -Φ * ) γηT 1/2 = 2 Φ(x 1 ) -Φ * √ γηT . ( ) Theorem 5. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 1 by using stochastic gradients. Given η = η t for all t ≥ 1, 0 < η ≤ min(1, 1 2γL ), 0 < λ ≤ 1 6 L and 0 < γ ≤ µλ 10 Lκ , we have 1 T T t=1 E L y t -y * (x t ) + gradΦ(x t ) ≤ 2 Φ(x 1 ) -Φ * √ γηT + √ 2σ √ B + 5 √ 2 Lσ √ Bµ . Proof. According to Lemma 6, we have y t+1 -y * (x t+1 ) 2 ≤ (1 - η t µλ 4 ) y t -y * (x t ) 2 - 3η t 4 ỹt+1 -y t 2 + 25η t λ 6µ ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 . ( ) We first define a Lyapunov function Θ t , for any t ≥ 1 Θ t = E Φ(x t ) + 6γ L2 λµ y t -y * (x t ) 2 . ( ) By Assumption 5, we have E grad x f (x t , y t ) -v t 2 = E grad x f (x t , y t ) - 1 B B i=1 grad x f (x t , y t ; ξ i t ) 2 ≤ σ 2 B , E ∇ y f (x t , y t ) -w t 2 = E ∇ y f (x t , y t ) - 1 B B i=1 ∇ y f (x t , y t ; ξ i t ) 2 ≤ σ 2 B . According to Lemma 5, we have Θ t+1 -Θ t = E[Φ(x t+1 )] -E[Φ(x t )] + 6γ L2 λµ E y t+1 -y * (x t+1 ) 2 -E y t -y * (x t ) 2 ≤ γη t L 12 E y t -y * (x t ) 2 + γη t E grad x f (x t , y t ) -v t 2 - γη t 2 E gradΦ(x t ) 2 - γη t 4 v t 2 + 6γ L2 λµ - µλη t 4 E y t -y * (x t ) 2 - 3η t 4 E ỹt+1 -y t 2 + 25λη t 6µ E ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 ≤ - L2 γη t 2 E y t -y * (x t ) 2 - γη t 2 E gradΦ(x t ) 2 - 9γ L2 η t 2λµ E ỹt+1 -y t 2 - 1 4 - 25κ 2 L2 γ 2 µ 2 λ 2 γη t v t 2 + γη t E grad x f (x t , y t ) -v t 2 + 25 L2 γη t µ 2 E ∇ y f (x t , y t ) -w t 2 ≤ - L2 γη t 2 E y t -y * (x t ) 2 - γη t 2 E gradΦ(x t ) 2 + γη t σ 2 B + 25 L2 γη t σ 2 Bµ 2 , ( ) where the first inequality holds by the inequality (61); the second last inequality is due to L = max(1, L 11 , L 12 , L 21 , L 22 ), and the last inequality is due to 0 < γ ≤ µλ 10 Lκ and Assumption 5. Thus, we obtain L2 γη t 2 E y t -y * (x t ) 2 + γη t 2 E gradΦ(x t ) 2 ≤ Θ t -Θ t+1 + γη t σ 2 B + 25 L2 γη t σ 2 Bµ 2 . ( ) Since the initial solution satisfies y 1 = y * (x 1 ) = arg max y∈Y f (x 1 , y), we have Θ 1 = Φ(x 1 ) + 6γ L2 λµ y 1 -y * (x 1 ) 2 = Φ(x 1 ). Taking average over t = 1, 2, • • • , T on both sides of the inequality (66), we have 1 T T t=1 E L2 η t 2 y t -y * (x t ) 2 + η t 2 gradΦ(x t ) 2 ≤ Θ t -Θ t+1 γT + 1 T T t=1 η t σ 2 B + 1 T T t=1 25 L2 η t σ 2 Bµ 2 = Φ(x 1 ) -Φ * γT + 1 T T t=1 η t σ 2 B + 1 T T t=1 25 L2 η t σ 2 Bµ 2 , ( ) where the last equality is due to the above equality (67 ). Let η = η 1 = • • • = η T , we have 1 T T t=1 E L2 y t -y * (x t ) 2 + gradΦ(x t ) 2 ≤ 2(Φ(x 1 ) -Φ * ) γηT + σ 2 B + 25 L2 σ 2 Bµ 2 . ( ) According to Jensen's inequality, we have 1 T T t=1 E L y t -y * (x t ) + gradΦ(x t ) ≤ 2 T T t=1 E L2 y t -y * (x t ) 2 + gradΦ(x t ) 2 1/2 ≤ 4(Φ(x 1 ) -Φ * ) γηT + 2σ 2 B + 50 L2 σ 2 Bµ 2 1/2 ≤ 2 Φ(x 1 ) -Φ * √ γηT + √ 2σ √ B + 5 √ 2 Lσ √ Bµ , where the last inequality is due to (a 1 + a 2 + a 3 ) 1/2 ≤ a 1/2 1 + a 1/2 2 + a 1/2 3 for all a 1 , a 2 , a 3 > 0.

A.2 CONVERGENCE ANALYSIS OF THE MVR-RSGDA ALGORITHM

In the subsection, we study the convergence properties of the MVR-RSGDA algorithm. For notational simplicity, let L = max(L 11 , L 12 , L 21 , L 22 , 1). Lemma 7. Suppose the stochastic gradients v t and w t is generated from Algorithm 2, given 0 < α t+1 ≤ 1 and 0 < β t+1 ≤ 1, we have E grad x f (x t+1 , y t+1 ) -v t+1 2 ≤ (1 -α t+1 ) 2 E grad x f (x t , y t ) -v t 2 + 4(1 -α t+1 ) 2 L 2 11 γ 2 η 2 t v t 2 + 4(1 -α t+1 ) 2 L 2 12 η 2 t ỹt+1 -y t 2 + 2α 2 t+1 σ 2 B . ( ) E ∇ y f (x t+1 , y t+1 ) -w t+1 2 ≤ (1 -β t+1 ) 2 E ∇ y f (x t , y t ) -w t 2 + 4(1 -β t+1 ) 2 L 2 21 γ 2 η 2 t v t 2 + 4(1 -β t+1 ) 2 L 2 22 η 2 t ỹt+1 -y t 2 + 2β 2 t+1 σ 2 B . Proof. We first prove the inequality (71). According to the definition of v t in Algorithm 2, we have v t+1 -T xt+1 xt v t = -α t+1 T xt+1 xt v t + (1 -α t+1 ) grad x f Bt+1 (x t+1 , y t+1 ) -T xt+1 xt grad x f Bt+1 (x t , y t ) + α t+1 grad x f Bt+1 (x t+1 , y t+1 ). Then we have E grad x f (x t+1 , y t+1 ) -v t+1 2 (74) = E grad x f (x t+1 , y t+1 ) -T xt+1 xt v t -(v t+1 -T xt+1 xt v t ) 2 = E grad x f (x t+1 , y t+1 ) -T xt+1 xt v t + α t+1 T xt+1 xt v t -α t+1 grad x f Bt+1 (x t+1 , y t+1 ) -(1 -α t+1 ) grad x f Bt+1 (x t+1 , y t+1 ) -T xt+1 xt grad x f Bt+1 (x t , y t ) 2 = E (1 -α t+1 )T xt+1 xt (grad x f (x t , y t ) -v t ) + (1 -α t+1 ) grad x f (x t+1 , y t+1 ) -T xt+1 xt grad x f (x t , y t ) -grad x f Bt+1 (x t+1 , y t+1 ) + T xt+1 xt grad x f Bt+1 (x t , y t ) + α t+1 grad x f (x t+1 , y t+1 ) -grad x f Bt+1 (x t+1 , y t+1 ) 2 = (1 -α t+1 ) 2 E grad x f (x t , y t ) -v t 2 + α 2 t+1 E grad x f (x t+1 , y t+1 ) -grad x f Bt+1 (x t+1 , y t+1 ) 2 + (1 -α t+1 ) 2 E grad x f (x t+1 , y t+1 ) -T xt+1 xt grad x f (x t , y t ) -grad x f Bt+1 (x t+1 , y t+1 ) + T xt+1 xt grad x f Bt+1 (x t , y t ) 2 + 2α t+1 (1 -α t+1 ) grad x f (x t+1 , y t+1 ) -T xt+1 xt grad x f (x t , y t ) -grad x f Bt+1 (x t+1 , y t+1 ) + T xt+1 xt grad x f Bt+1 (x t , y t ), grad x f (x t+1 , y t+1 ) -grad x f Bt+1 (x t+1 , y t+1 ) ≤ (1 -α t+1 ) 2 E grad x f (x t , y t ) -v t 2 + 2α 2 t+1 E grad x f (x t+1 , y t+1 ) -grad x f Bt+1 (x t+1 , y t+1 ) 2 + 2(1 -α t+1 ) 2 E grad x f (x t+1 , y t+1 ) -T xt+1 xt grad x f (x t , y t ) -grad x f Bt+1 (x t+1 , y t+1 ) + T xt+1 xt grad x f Bt+1 (x t , y t ) 2 ≤ (1 -α t+1 ) 2 E grad x f (x t , y t ) -v t 2 + 2α 2 t+1 σ 2 B + 2(1 -α t+1 ) 2 E grad x f Bt+1 (x t+1 , y t+1 ) -T xt+1 xt grad x f Bt+1 (x t , y t ) 2 =T1 , where the fourth equality follows by E[grad x f Bt+1 (x t+1 , y t+1 )] = grad x f (x t+1 , y t+1 ) and E[grad x f Bt+1 (x t+1 , y t+1 )grad x f Bt+1 (x t , y t )] = grad x f (x t+1 , y t+1 )grad x f (x t , y t ); the first inequality holds by Young's inequality; the last inequality is due to the equality E ζ -E[ζ] 2 = E ζ 2 -E[ζ] 2 and Assumption 5. Next, we consider an upper bound of the above term T 1 as follows: T 1 = E grad x f Bt+1 (x t+1 , y t+1 ) -T xt+1 xt grad x f Bt+1 (x t , y t ) 2 (75) = E grad x f Bt+1 (x t+1 , y t+1 ) -T xt+1 xt grad x f (x t , y t+1 ; ξ t+1 ) + T xt+1 xt grad x f (x t , y t+1 ; ξ t+1 ) -T xt+1 xt grad x f Bt+1 (x t , y t ) 2 ≤ 2E grad x f Bt+1 (x t+1 , y t+1 ) -T xt+1 xt grad x f (x t , y t+1 ; ξ t+1 ) 2 + 2E grad x f (x t , y t+1 ; ξ t+1 ) -grad x f Bt+1 (x t , y t ) 2 ≤ 2L 2 11 γ 2 η 2 t v t 2 + 2L 2 12 y t+1 -y t 2 = 2L 2 11 γ 2 η 2 t v t 2 + 2L 2 12 η 2 t ỹt+1 -y t 2 , ( ) where the last inequality is due to Assumption 1. Thus, we have E grad x f (x t+1 , y t+1 ) -v t+1 2 ≤ (1 -α t+1 ) 2 E grad x f (x t , y t ) -v t 2 + 4(1 -α t+1 ) 2 L 2 11 γ 2 η 2 t v t 2 + 4(1 -α t+1 ) 2 L 2 12 η 2 t ỹt+1 -y t 2 + 2α 2 t+1 σ 2 B . We apply a similar analysis to prove the above inequality (72). We obtain E ∇ y f (x t+1 , y t+1 ) -w t+1 2 ≤ (1 -β t+1 ) 2 E ∇ y f (x t , y t ) -w t 2 + 4(1 -β t+1 ) 2 L 2 21 γ 2 η 2 t v t 2 + 4(1 -β t+1 ) 2 L 2 22 η 2 t ỹt+1 -y t 2 + 2β 2 t+1 σ 2 B . Theorem 6. Suppose the sequence {x t , y t } T t=1 is generated from Algorithm 2. Given y 1 = y * (x 1 ), c 1 ≥ 2 3b 3 + 2λµ, c 2 ≥ 2 3b 3 + 50λ L2 µ , b > 0, m ≥ max 2, (cb) 3 , 0 < γ ≤ µλ 2κ L√ 25+4µλ and 0 < λ ≤ 1 6 L , we have 1 T T t=1 E gradΦ(x t ) + L y t -y * (x t ) ≤ √ 2M m 1/6 T 1/2 + √ 2M T 1/3 , where c = max(2γL, c 1 , c 2 , 1) and M = 2(Φ(x1)-Φ * ) γb + 2σ 2 λµη0bB + 2(c 2 1 +c 2 2 )σ 2 b 2 λµB ln(m + T ). Proof. Since η t is decreasing and m ≥ b 3 , we have η t ≤ η 0 = b m 1/3 ≤ 1. Similarly, due to m ≥ (2γLb) 3 , we have η t ≤ η 0 = b m 1/3 ≤ 1 2γL . Due to 0 < η t ≤ 1 and m ≥ max (c 1 b) 3 , (c 2 b) 3 , we have α t+1 = c 1 η 2 t ≤ c 1 η t ≤ c1b m 1/3 ≤ 1 and β t+1 = c 2 η 2 t ≤ c 2 η t ≤ c2b m 1/3 ≤ 1. According to Lemma 7, we have 1 η t E grad x f (x t+1 , y t+1 ) -v t+1 2 - 1 η t-1 E grad x f (x t , y t ) -v t 2 (80) ≤ (1 -α t+1 ) 2 η t - 1 η t-1 E grad x f (x t , y t ) -v t 2 + 4(1 -α t+1 ) 2 L 2 11 γ 2 η t v t 2 + 4(1 -α t+1 ) 2 L 2 12 η t ỹt+1 -y t 2 + 2α 2 t+1 σ 2 η t B ≤ 1 -α t+1 η t - 1 η t-1 E grad x f (x t , y t ) -v t 2 + 4L 2 11 γ 2 η t v t 2 + 4L 2 12 η t ỹt+1 -y t 2 + 2α 2 t+1 σ 2 η t B = 1 η t - 1 η t-1 -c 1 η t E grad x f (x t , y t ) -v t 2 + 4L 2 11 γ 2 η t v t 2 + 4L 2 12 η t ỹt+1 -y t 2 + 2α 2 t+1 σ 2 η t B , where the second inequality is due to 0 < α t+1 ≤ 1. By a similar way, we also obtain 1 η t E ∇ y f (x t+1 , y t+1 ) -w t+1 2 - 1 η t-1 E ∇ y f (x t , y t ) -w t 2 (81) ≤ 1 η t - 1 η t-1 -c 2 η t E ∇ y f (x t , y t ) -w t 2 + 4L 2 21 γ 2 η t v t 2 + 4L 2 22 η t ỹt+1 -y t 2 + 2β 2 t+1 σ 2 η t B . By η t = b (m+t) 1/3 , we have 1 η t - 1 η t-1 = 1 b (m + t) 1 3 -(m + t -1) 1 3 ≤ 1 3b(m + t -1) 2/3 ≤ 1 3b m/2 + t 2/3 ≤ 2 2/3 3b(m + t) 2/3 = 2 2/3 3b 3 b 2 (m/2 + t) 2/3 = 2 2/3 3b 3 η 2 t ≤ 2 3b 3 η t , where the first inequality holds by the concavity of function f (x) = x 1/3 , i.e., (x + y) 1/3 ≤ x 1/3 + y 3x 2/3 ; the second inequality is due to m ≥ 2, and the last inequality is due to 0 < η t ≤ 1. Let c 1 ≥ 2 3b 3 + 2λµ, we have 1 η t E grad x f (x t+1 , y t+1 ) -v t+1 2 - 1 η t-1 E grad x f (x t , y t ) -v t 2 (83) ≤ -2λµη t E grad x f (x t , y t ) -v t 2 + 4L 2 11 γ 2 η t v t 2 + 4L 2 12 η t ỹt+1 -y t 2 + 2α 2 t+1 σ 2 η t B . Let c 2 ≥ 2 3b 3 + 50λ L2 µ , we have 1 η t E ∇ y f (x t+1 , y t+1 ) -w t+1 2 - 1 η t-1 E ∇ y f (x t , y t ) -w t 2 (84) ≤ - 50λ L2 µ η t E ∇ y f (x t , y t ) -w t 2 + 4L 2 21 γ 2 η t v t 2 + 4L 2 22 η t ỹt+1 -y t 2 + 2β 2 t+1 σ 2 η t B . According to Lemma 6, we have y t+1 -y * (x t+1 ) 2 -y t -y * (x t ) 2 ≤ - η t µλ 4 y t -y * (x t ) 2 - 3η t 4 ỹt+1 -y t 2 + 25λη t 6µ ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 . Next, we define a Lyapunov function Ω t , for any t ≥ 1 Ω t = E Φ(x t ) + γ 2λµ 1 η t-1 E grad x f (x t , y t ) -v t 2 + 1 η t-1 E ∇ y f (x t , y t ) -w t 2 + 6γ L2 λµ E y t -y * (x t ) 2 . ( ) Then we have Since the initial solution satisfies y 1 = y * (x 1 ) = arg max y∈Y f (x 1 , y), we have Ω 1 = Φ(x 1 ) + 6γ L2 λµ y 1 -y * (x 1 ) 2 + γ 2λµ 1 η 0 grad x f (x 1 , y 1 ) -v 1 2 + 1 η 0 ∇ y f (x 1 , y 1 ) -w 1 2 = Φ(x 1 ) + γ 2λµ 1 η 0 grad x f (x 1 , y 1 )grad x f B1 (x 1 , y 1 ) 2 + 1 η 0 ∇ y f (x 1 , y 1 ) -∇ y f B1 (x 1 , y 1 ) 2 ≤ Φ(x 1 ) + γσ 2 λµη 0 B , Ω t+1 -Ω t = E[Φ(x t+1 )] -E[Φ(x t )] + 6γ L2 λµ E y t+1 -y * (x t+1 ) 2 -E y t -y * (x t ) 2 + γ 2λµ 1 η t E grad x f (x t+1 , y t+1 ) -v t+1 2 - 1 η t-1 E grad x f (x t , y t ) -v t 2 + 1 η t E ∇ y f (x t+1 , y t+1 ) -w t+1 2 - 1 η t-1 E ∇ y f (x t , y t ) -w t 2 ≤ L 12 γη t E y t -y * (x t ) 2 + γη t E grad x f (x t , y t ) -v t 2 - γη t 2 E gradΦ(x t ) 2 - γη t 4 v t 2 + 6γ L2 λµ - µλη t 4 E y t -y * (x t ) 2 - 3η t 4 E ỹt+1 -y t 2 + 25λη t 6µ E ∇ y f (x t , y t ) -w t 2 + 25γ 2 κ 2 η t 6µλ v t 2 + γ 2λµ -2λµη t E grad x f (x t , y t ) -v t 2 + 4L 2 11 γ 2 η t v t 2 + 4L where the last inequality holds by Assumption 5.  η 3 t ≤ 2(Φ(x 1 ) -Φ * ) T γη T + 2σ 2 T λµη 0 η T B + 2(c 2 1 + c 2 2 )σ 2 T η T λµB T 1 b 3 m + t dt ≤ 2(Φ(x 1 ) -Φ * ) T γη T + 2σ 2 T λµη 0 η T B + 2(c 2 1 + c 2 2 ) According to Jensen's inequality, we have 1 T T t=1 E gradΦ(x t ) + L y t -y * (x t ) ≤ 2 T T t=1 E gradΦ(x t ) 2 + L2 y t -y * (x t ) 2 1/2 ≤ √ 2M T 1/2 (m + T ) 1/6 ≤ √ 2M m 1/6 T 1/2 + √ 2M T 1/3 , ( ) where the last inequality is due to (a 1 + a 2 ) 1/6 ≤ a 1/6 1 + a 1/6 2 for all a 1 , a 2 > 0.

B ADDITIONAL EXPERIMENTAL RESULTS

In this section, we provide additional experimental results on SVHN, FashionMNIST and STL-10 datasets, given in 



Figure 1: Illustration of manifold operations.(a) A vector u in T x M is mapped to R x (u) in M; (b) A vector v in T x M is transported to T y M by T y x v (or T u v), where y = R x (u) and u ∈ T x M.

Figure 2: Training loss of robust training DNNs with orthogonality regularization on weights.

Figure 3: Attack loss when using uniform attack on DNNs trained by SGDA, RSGDA and MVR-RSGDA.

) and η 0 = O( 1 κ ). Without loss of generality, let T ≥ m = O(κ 3 ), we have M = O κ 2 + κ 2 B + κ 3 B ln(T ) . When B = κ, we have M = O κ 2 ln(T ) . Thus, the MVR-RSGDA algorithm has a convergence rate of Õ κ T 1/3 . By κ T 1/3 ≤ , i.e., E[H ζ ] ≤ , we choose T ≥ κ 3 -3 . In Algorithm 2, we require B samples to estimate the stochastic gradients v t and w t at each iteration, and need T iterations. Thus, the MVR-RSGDA has a sample complexity of BT = Õ κ 4 -3 for finding an -stationary point of the problem (1). Similarly, when B = 1, the MVR-RSGDA algorithm has a convergence rate of Õ κ 3/2

+ T ), we rewrite the above inequality as follows:1 T T t=1 E gradΦ(x t ) 2 + L2 y t -y * (x t ) 2 ≤ M T (m + T ) 1/3 .

Figure 4: Additional results for robust training (a-c) and uniform attack (d-f) with SGDA, RSGDA and MVR-RSGDA algorithms.

Convergence properties comparison of our algorithms for obtaining an -stationary point of the min-max optimization problem (1). κ denotes the condition number of function f (x, •).

can't work.

Test accuracy against nature images and uniform attack for MNIST, CIFAR-10, and CIFAR-100 datasets.

E ∇ y f (x t , y t ) -w t 2 + 4L 2 21 γ 2 η t v t 2 + 4L 2 22 η t E ỹt+1 -y twhere the first inequality holds by Lemmas 5 and the above inequalities (83), (84) and (85); the second inequality is due to L = max(1, L 11 , L 12 , L 21 , L 22 ); the last inequality is due to 0 ≤ γ ≤ According to the above inequality (87), we haveγη t 2 E gradΦ(x t ) 2 + L2 E y t -y * (x t ) 2 ≤ Ω t -Ω t+1 + Eη t gradΦ(x t ) 2 + L2 y t -y * (x t ) 2 ≤

Benchmark Datasets Used in Experiments ) 2 + L2 y t -y * (x t ) 2

The training loss and attack loss under uniform attack is shown in Fig 4. The test accuracy with natural images and uniform attack is shown in Tab. 4. From these results, our methods are robust to the uniform attack in training DNNs.

Test accuracy against nature images and uniform attack for FashionMNIST, SVHN and STL-10 datasets.

annex

Then we obtain ỹt+1 -y t , y * (x t ) -y t ≤ 1 2η t y t -y * (x t ) 2 + η t 2 ỹt+1 -y t 2 -1 2η t y t+1 -y * (x t ) 2 . (43)Consider the upper bound of the term ∇ y f (x t , y t ) -w t , y * (x t ) -ỹt+1 , we haveBy plugging the inequalities ( 41), ( 43) to ( 44), we havewhere the second inequality holds by L ≥ L 22 ≥ µ and 0 < η t ≤ 1, and the last inequality is due to 0 < λ ≤ 1 6 L . It implies thatNext, we decompose the term y t+1 -y * (x t+1 ) 2 as follows:where the first inequality holds by the Cauchy-Schwarz inequality and Young's inequality, and the last equality is due to Lemma 4.By combining the above inequalities ( 46) and (47), we have 

