GRADIENT DESCENT ASCENT FOR MIN-MAX PROB-LEMS ON RIEMANNIAN MANIFOLDS Anonymous

Abstract

In the paper, we study a class of useful non-convex minimax optimization problems on Riemanian manifolds and propose a class of Riemanian gradient descent ascent algorithms to solve these minimax problems. Specifically, we propose a new Riemannian gradient descent ascent (RGDA) algorithm for the deterministic minimax optimization. Moreover, we prove that the RGDA has a sample complexity of O(κ 2 -2 ) for finding an -stationary point of the nonconvex stronglyconcave minimax problems, where κ denotes the condition number. At the same time, we introduce a Riemannian stochastic gradient descent ascent (RSGDA) algorithm for the stochastic minimax optimization. In the theoretical analysis, we prove that the RSGDA can achieve a sample complexity of O(κ 4 -4 ). To further reduce the sample complexity, we propose a novel momentum variance-reduced Riemannian stochastic gradient descent ascent (MVR-RSGDA) algorithm based on a new momentum variance-reduced technique of STORM. We prove that the MVR-RSGDA algorithm achieves a lower sample complexity of Õ(κ 4 -3 ) without large batches, which reaches near the best known sample complexity for its Euclidean counterparts. Extensive experimental results on the robust deep neural networks training over Stiefel manifold demonstrate the efficiency of our proposed algorithms.

1. INTRODUCTION

In the paper, we study a class of useful non-convex minimax (a.k.a. min-max) problems on the Riemannian manifold M with the definition as: min x∈M max y∈Y f (x, y), where the function f (x, y) is µ-strongly concave in y but possibly nonconvex in x. Here Y ⊆ R d is a convex and closed set. f (•, y) : M → R for all y ∈ Y is a smooth but possibly nonconvex real-valued function on manifold M, and f (x, •) : Y → R for all x ∈ M a smooth and (strongly)concave real-valued function. In this paper, we mainly focus on the stochastic minimax optimization problem f (x, y) := E ξ∼D [f (x, y; ξ)], where ξ is a random variable that follows an unknown distribution D. In fact, the problem (1) is associated to many existing machine learning applications: 1). Robust Training DNNs over Riemannian manifold. Deep Neural Networks (DNNs) recently have been demonstrating exceptional performance on many machine learning applications. However, they are vulnerable to the adversarial example attacks, which show that a small perturbation in the data input can significantly change the output of DNNs. Thus, the security properties of DNNs have been widely studied. One of secured DNN research topics is to enhance the robustness of DNNs under the adversarial example attacks. To be more specific, given training data D := {ξ i = (a i , b i )} n i=1 , where a i ∈ R d and b i ∈ R represent the features and label of sample ξ i respectively. Each data sample a i can be corrupted by a universal small perturbation vector y to generate an adversarial attack sample a i + y, as in (Moosavi-Dezfooli et al., 2017; Chaubey et al., 2020) . To make DNNs robust against adversarial attacks, one popular approach is to solve the following robust training problem: min x max y∈Y 1 n n i=1 (h(a i + y; x), b i ) , where y ∈ R d denotes a universal perturbation, and x is the weight of the neural network; h(•; x) is the the deep neural network parameterized by x; and (•) is the loss function. Here the constraint Y = {y : y ≤ ε} indicates that the poisoned samples should not be too different from the original ones. Recently, the orthonormality on weights of DNNs has gained much interest and has been found to be useful across different tasks such as person re-identification (Sun et al., 2017) and image classification (Xie et al., 2017) . In fact, the orthonormality constraints improve the performances of DNNs (Li et al., 2020; Bansal et al., 2018) , and reduce overfitting to improve generalization (Cogswell et al., 2015) . At the same time, the orthonormality can stabilize the distribution of activations over layers within DNNs (Huang et al., 2018) . Thus, we consider the following robust training problem over the Stiefel manifold M: min x∈M max y∈Y 1 n n i=1 (h(a i + y; x), b i ). When data are continuously coming, we can rewrite the problem (3) as follows: min x∈M max y∈Y E ξ [f (x, y; ξ)], where f (x, y; ξ) = (h(a + y; x), b) with ξ = (a, b). 2). Distributionally Robust Optimization over Riemannian manifold. Distributionally robust optimization (DRO) (Chen et al., 2017; Rahimian & Mehrotra, 2019 ) is an effective method to deal with the noisy data, adversarial data, and imbalanced data. At the same time, the DRO in the Riemannian manifold setting is also widely applied in machine learning problems such as robust principal component analysis (PCA). To be more specific, given a set of data samples {ξ i } n i=1 , the DRO over Riemannian manifold M can be written as the following minimax problem: min x∈M max p∈S n i=1 p i (x; ξ i ) -p - 1 n 2 , ( ) where p = (p 1 , • • • , p n ), S = {p ∈ R n : n i=1 p i = 1, p i ≥ 0}. Here (x; ξ i ) denotes the loss function over Riemannian manifold M, which applies to many machine learning problems such as PCA (Han & Gao, 2020a), dictionary learning (Sun et al., 2016 ), DNNs (Huang et al., 2018) , structured low-rank matrix learning (Jawanpuria & Mishra, 2018), among others. For example, the task of PCA can be cast on a Grassmann manifold. To the best of our knowledge, the existing explicitly minimax optimization methods such as gradient descent ascent method only focus on the minimax problems in Euclidean space. To fill this gap, in the paper, we propose a class of efficient Riemannian gradient descent ascent algorithms to solve the problem (1) via using general retraction and vector transport. When the problem (1) is deterministic, we propose a new deterministic Riemannian gradient descent ascent algorithm. When the problem (1) is stochastic, we propose two efficient stochastic Riemannian gradient descent ascent algorithms. Our main contributions can be summarized as follows: 1) We propose a novel Riemannian gradient descent ascent (RGDA) algorithm for the deterministic minimax optimization problem (1). We prove that the RGDA has a sample complexity of O(κ 2 -2 ) for finding an -stationary point. 2) We also propose a new Riemannian stochastic gradient descent ascent (RSGDA) algorithm for the stochastic minimax optimization. In the theoretical analysis, we prove that the SRGDA has a sample complexity of O(κ 4 -4 ). 3) To further reduce the sample complexity, we introduce a novel momentum variancereduced Riemannian stochastic gradient descent ascent (MVR-RSGDA) algorithm based on a new momentum variance-reduced technique of STORM (Cutkosky & Orabona, 2019) . We prove the MVR-RSGDA achieves a lower sample complexity of Õ(κ 4 -3 ) (please see Table 1 ), which reaches near the best known sample complexity for its Euclidean counterparts. 4) Extensive experimental results on the robust DNN training over Stiefel manifold demonstrate the efficiency of our proposed algorithms.

