STOCHASTIC CONSTRAINED DRO WITH A COMPLEX-ITY INDEPENDENT OF SAMPLE SIZE

Abstract

Distributionally Robust Optimization (DRO), as a popular method to train robust models against distribution shift between training and test sets, has received tremendous attention in recent years. In this paper, we propose and analyze stochastic algorithms that apply to both non-convex and convex losses for solving Kullback-Leibler divergence constrained DRO problem. Compared with existing methods solving this problem, our stochastic algorithms not only enjoy competitive if not better complexity independent of sample size but also just require a constant batch size at every iteration, which is more practical for broad applications. We establish a nearly optimal complexity bound for finding an ϵ-stationary solution for non-convex losses and an optimal complexity for finding an ϵ-optimal solution for convex losses. Empirical studies demonstrate the effectiveness of the proposed algorithms for solving non-convex and convex constrained DRO problems.

1. INTRODUCTION

Large-scale optimization of DRO has recently garnered increasing attention due to its promising performance on handling noisy labels, imbalanced data and adversarial data (Namkoong & Duchi, 2017; Zhu et al., 2019; Qi et al., 2020a; Chen & Paschalidis, 2018) . Various primal-dual algorithms can be used for solving various DRO problems (Rafique et al., 2021; Nemirovski et al., 2009) . However, primal-dual algorithms inevitably suffer from additional overhead for handling a n dimensionality dual variable, where n is the sample size. This is an undesirable feature for large-scale deep learning, where n could be in the order of millions or even billions. Hence, a recent trend is to design dual-free algorithms for solving various DRO problems (Qi et al., 2021; Jin et al., 2021; Levy et al., 2020) . In this paper, we provide efficient dual-free algorithms solving the following constrained DRO problem, which are still lacking in the literature, min w∈W max {p∈∆n:D(p,1/n)≤ρ} n i=1 p i ℓ i (w) -λ 0 D(p, 1/n), where w denotes the model parameter, W is closed convex set, ∆ n = {p ∈ R n : n i=1 p i = 1, p i ≥ 0} denotes a n-dimensional simplex, ℓ i (w) denotes a loss function on the i-th data, D(p, 1/n) = n i=1 p i log(p i n) represents the Kullback-Leibler (KL) divergence measure between p and uniform probabilities 1/n ∈ R n , and ρ is the constraint parameter, and λ 0 > 0 is a small constant. A small KL regularization on p is added to ensure the objective in terms of w is smooth for deriving fast convergence. There are several reasons for considering the above constrained DRO problem. First, existing dualfree algorithms are not satisfactory (Qi et al., 2021; Jin et al., 2021; Levy et al., 2020; Hu et al., 2021) . They are either restricted to problems with no additional constraints on the dual variable p except for the simplex constraint (Qi et al., 2021; Jin et al., 2021) , or restricted to convex analysis or have a requirement on the batch size that depends on accuracy level (Levy et al., 2020; Hu et al., 2021) . Second, the Kullback-Leibler divergence measure is a more natural metric for measuring the distance between two distributions than other divergence measures, e.g., Euclidean distance. Third, compared with KL-regularized DRO problem without constraints, the above KL-constrained DRO formulation allows it to automatically decide a proper regularization effect that depends on the optimal solution by tuning the constraint upper bound ρ. The question to be addressed is the following: Can we develop stochastic algorithms whose oracle complexity is optimal for both convex and non-convex losses, and its per-iteration complexity is independent of sample size n without imposing any requirements on the (large) batch size in the meantime? We address the above question by (i) deriving an equivalent primal-only formulation that is of a compositional form; (ii) designing two algorithms for non-convex losses and extending them for convex losses; (iii) establishing an optimal complexity for both convex and non-convex losses. In particular, for a non-convex and smooth loss function ℓ i (w), we achieve an oracle complexity of O(1/ϵ 3 )foot_0 for finding an ϵ-stationary solution; and for a convex and smooth loss function, we achieve an oracle complexity of O(1/ϵ 2 ) for finding an ϵ-optimal solution. We would like to emphasize that these results are on par with the best complexities that can be achieved by primal-dual algorithms (Huang et al., 2020; Namkoong & Duchi, 2016) . But our algorithms have a per-iteration complexity of O(d), which is independent of the sample size n. The convergence comparison of different methods for solving (1) is shown in Table 1 . To achieve these results, we first convert the problem (1) into an equivalent problem: min w∈W min λ≥λ0 F (w, λ) := λ log 1 n n i=1 exp ℓ i (w) λ + (λ -λ 0 )ρ. By considering x = (w ⊤ , λ) ⊤ ∈ R d+1 as a single variable to be optimized, the objective function is a compositional function of x in the form of f (g(x)), where g(x) = λ, 1 n n i=1 exp ℓi(w) λ ∈ R 2 and f (g) = g 1 log(g 2 ) + g 1 ρ. However, there are several challenges to be addressed for achieving optimal complexities for both convex and non-convex loss functions ℓ i (w). First, the problem F (x) is non-smooth in terms of x given the domain constraint w ∈ W and λ ≥ λ 0 . Second, the outer function f (g)'s gradient is non-Lipschtiz continuous in terms of the second coordinate g 2 if λ is unbounded, which is essential for all existing stochastic compositional optimization algorithms. Third, to the best of our knowledge, no optimal complexity in the order of O(1/ϵ 2 ) has been achieved for a convex compositional function except for Zhang & Lan (2021) , which assumes f is convex and component-wisely non-decreasing and hence is not applicable to (2). To address the first two challenges, we derive an upper bound for the optimal λ assuming that ℓ i (w) is bounded for w ∈ W, i.e., λ ∈ [λ 0 , λ], which allows us to establish the smoothness condition of F (x) and f (g). Then we consider optimizing F (x) = F (x) + δ X (x), where δ X (x) = 0 if x ∈ X = {x = (w ⊤ , λ) ⊤ : w ∈ W, λ ∈ [λ 0 , λ]}. By leveraging the smoothness conditions of F and f , we design stochastic algorithms by utilizing a recursive variance-reduction technique to compute a stochastic estimator of the gradient of F (x), which allows us to achieve a complexity of O(1/ϵ 3 ) for finding a solution x such that E[dist(0, ∂ F (x))] ≤ ϵ. To address the third challenge, we consider optimizing Fµ (x) = F (x) + µ∥x∥ 2 /2 for a small µ. We prove that Fµ (x) satisfies a Kurdyka-Łojasiewicz inequality, which allows us to boost the convergence of the aforementioned algorithm to enjoy an optimal complexity of O(1/ϵ 2 ) for finding an ϵ-optimal solution to F (x). Besides the optimal algorithms, we also present simpler algorithms with worse complexity, which are more practical for deep learning applications without requiring two backpropagations at two different points per iteration as in the optimal algorithms.

2. RELATED WORK

DRO springs from the robust optimization literature (Bertsimas et al., 2018; Ben-Tal et al., 2013) and has been extensively studied in machine learning and statistics (Namkoong & Duchi, 2017; Duchi et al., 2016; Staib & Jegelka, 2019; Deng et al., 2020; Qi et al., 2020b; Duchi & Namkoong, 2021) , and operations research (Rahimian & Mehrotra, 2019; Delage & Ye, 2010) . Depending on how to constrain or regularize the uncertain variables, there are constrained DRO formulations that specify a constraint set for the uncertain variables, and regularized DRO formulations that use a regularization term in the objective for regularizing the uncertain variables (Levy et al., 2020) . Duchi et al. (2016) showed that minimizing constrained DRO with f -divergence including a χ 2divergence constraint and a KL-divergence constraint, is equivalent to adding variance regularization for the Empirical Risk Minimization (ERM) objective, which is able to reduce the uncertainty and  (1/ϵ 4 ) O(1) O(d) COM ASCDRO O(1/ϵ 3 ) O(1) O(d) COM Convex FastDRO 3 (Levy et al., 2020) O(1/ϵ 3 ) O(1/ϵ) O( d ϵ ) P SPD (Namkoong & Duchi, 2016) O(1/ϵ 2 ) O(1) O(n + d) PD Dual SGM (Levy et al., 2020) O(1/ϵ 2 ) O(1) O(d) P RSCDRO This work O(1/ϵ 3 ) O(1) O(d) COM RASCDRO O(1/ϵ 2 ) O(1) O(d) COM improve the generalization performance of the model. Primal-Dual Algorithms. Many primal-dual algorithms designed for the min-max problems can be directly applied to optimize the constrained DRO problem. The algorithms proposed in (Nemirovski et al., 2009; Juditsky et al., 2011; Yan et al., 2019; Namkoong & Duchi, 2016; Yan et al., 2020; Song et al., 2021; Alacaoglu et al., 2022) are applicable to solving (1) when ℓ is a convex function. Recently, Rafique et al. (2021) and Yan et al. (2020) proposed non-convex stochastic algorithms for solving non-convex strongly convex min-max problems, which are applicable to solving (1) when ℓ is a weakly convex function or smooth. Many primal-dual stochastic algorithms have been proposed for solving non-convex strongly concave problems with a state of the art oracle complexity of O(1/ϵfoot_2 ) for finding a stationary solution (Huang et al., 2020; Luo et al., 2020; Tran-Dinh et al., 2020) . However, the primal-dual algorithms require maintaining and updating an O(n) dimensional vector for updating the dual variable. Constrained DRO. Recently, Levy et al. (2020) proposed sample independent algorithms based on gradient estimators for solving a group of DRO problems in the convex setting. To be more specific, they achieved a convergence rate of O(1/ϵfoot_1 ) for the χ 2 -constrained/regularized and CVaRconstrained convex DRO problems and the batch size of logarithmically dependent on the inverse accuracy level O(log(1/ϵ)) with the help of multi-level Monte-Carlo (MLMC) gradient estimator. For the KL-constrained DRO objective and other more general setting, they achieve a convergence rate of O(1/ϵ 3 ) under a Lipschitz continuity assumption on the inverse CDF of the loss function and a mini-batch gradient estimator with a batch size in the order O(1/ϵ) (please refer to Table 3 in Levy et al. (2020) ). In addition, Levy et al. (2020) also proposed a simple stochastic gradient method for solving the dual expression of the DRO formulation, which is called Dual SGM. In terms of convergence, they only discussed the convergence guarantee for the χ 2 -regularized and CVaR penalized convex DRO problems (cf. Claim 3 in their paper). However, there is still gap for proving the convergence rate of Dual SGM for non-convex KL-constrained DRO problems due to similar challenges mentioned in the previous section, in particular establishing the smoothness condition in terms of the primal variable and the Lagrangian multipliers (denoted as x, ν, η respectively in their paper). This paper makes unique contributions for addressing these challenges by (i) removing η in Dual SGM and deriving the box constraint for our Lagrangian multiplier λ for proving the smoothness condition; (ii) establishing an optimal complexity in the order of O(1/ϵ 3 ) in the presence of non-smooth box constraints, which, to the best of our knowledge, is the first time for solving a non-convex constrained compositional optimization problem. Regularized DRO. DRO with KL divergence regularization objective has shown superior performance for addressing data imbalanced problems (Qi et al., 2021; 2020a; Li et al., 2020; 2021) . Jin et al. (2021) proposed a mini-batch normalized gradient descent with momentum that can find a first-order ϵ stationary point with an oracle complexity of O(1/ϵ 4 ) for KL-regularized DRO and χ 2 regularized DRO with a non-convex loss. They solve the challenge that the loss function could be unbounded. Qi et al. (2021) proposed online stochastic compositional algorithms to solve KLregularized DRO. They leveraged a recursive variance reduction technique (STORM (Cutkosky & Orabona, 2019) ) to compute a gradient estimator for the model parameter w only. They derived a complexity of O(1/ϵ 3 ) for a general non-convex problem and improved it to O(1/(µϵ)) for a problem that satisfies an µ-PL condition. Qi et al. (2020a) reports a worse complexity for a simpler algorithm for solving KL-regularized DRO. Li et al. (2020; 2021) studied the effectiveness of KL regularized objective on different applications, such as enforcing fairness between subgroups, and handling the class imbalance. More related works are included in the appendix due to limit of space, which will not affect the discussion of results in this paper.

3. PRELIMINARIES

In this section, we introduce notations, definitions and assumptions. We show that (1) is equivalent to (2) in Section G in Appendix. Notations: Let ∥ • ∥ denotes the Euclidean norm of a vector or the spectral norm of a matrix. And  x = (w ⊤ , λ) ⊤ ∈ R d+1 , g i (x) = exp( ℓi(w) λ ) and g(x) = E i∼D [exp( ℓi(w) λ )] (c) ℓ i (w) is L-smooth, i.e., ∥∇ℓ i (w 1 ) -∇ℓ i (w 2 )∥ ≤ L∥w 1 -w 2 ∥, ∀w 1 , w 2 ∈ W, i ∼ D. (d) There exists a positive constant ∆ < ∞ and an initial solution (w 1 , λ 1 ) such that F (w 1 , λ 1 ) -min w∈W min λ≥λ0 F (w, λ) ≤ ∆. Assumption 2. Let σ g , σ ∇g be positive constants and σ 2 = max{σ g , σ ∇g }. For i ∼ D, assume that E[∥g i (x) -g(x)∥ 2 ] ≤ σ 2 g , E[∥∇g i (x) -∇g(x)∥ 2 ] ≤ σ 2 ∇g . Remark: Assumption 1 (a), i.e., the boundness condition of W is also assumed in Levy et al. (2020) , which is mainly used for convex analysis. Assumption 1(b), (c), i.e., the Lipstchiz continuity and smoothness of loss function, and the variance bounds for g i and its gradient in Assumption 2 can be derived from Assumption 1 (b), such that E[∥g i (x) -g(x)∥ 2 ] ≤ E[∥g i (x)∥ 2 ] ≤ exp( 2C λ0 ), and E[∥∇g i (x) -∇g(x)∥ 2 ] ≤ E[∥∇g i (x)∥ 2 ] ≤ exp( 2C λ0 )(G 2 + C 2 λ0 )foot_3 However, F (w, λ) is not necessarily smooth in terms of x = (w ⊤ , λ) ⊤ if λ is unbounded. To address this concern, we prove that optimal λ is indeed bounded. Lemma 1. The optimal solution of the dual variable λ * to the problem (2) is upper bounded by λ = λ 0 + C/ρ, where C is the upper bound of the loss function and ρ is the constraint parameter. Thus, we could constrain the domain of λ in the DRO formulation (2) with the upper bound λ , and obtain the following equivalent formulation: min w∈W min λ0≤λ≤ λ λ log 1 n n i=1 exp ℓ i (w) λ + λρ. The upper bound λ guarantees the smoothness of F (w, λ) and the smoothness of f λ (•), which are critical for the proposed algorithms to enjoy fast convergence rates. Lemma 2. F (w, λ) is L F -smooth for any w ∈ W and λ ∈ [λ 0 , λ], where L F = λL 2 g + 2L g + λL ∇g + 1 + λ. L g and L ∇g are constants independent of sample size n and explicitly derived in Lemma 7 . Below, we let X = {x|w ∈ W, λ 0 ≤ λ ≤ λ}, δ X (x) = 0 if x ∈ X , and δ X (x) = ∞ if x / ∈ X . The problem (3) is equivalent to : min x∈R d+1 F (x) := F (x) + δ X (x), Since F is non-smooth, we define the regular subgradient as follows. Definition 1 (Regular Subgradient). Consider a function Φ : R n → R and Φ(x) is finite at a point x.  For a vector v ∈ R n , v is a regular subgradient of Φ at x, written v ∈ ∂Φ(x), if lim inf x→x Φ(x) -Φ(x) -v ⊤ (x -x) ∥x -x∥ ≥ 0. Since F (x) is differentiable, we use ∂ F (x) = ∇F (x) + ∂δ X (x) (

4. STOCHASTIC CONSTRAINED DRO WITH NON-CONVEX LOSSES

In this section, we present two stochastic algorithms for solving (4). The first algorithm is simpler yet practical for deep learning applications. The second algorithm is an accelerated one with a better complexity, which is more complex than the first algorithm. Algorithm 1 SCDRO(x 1 , v 1 , u 1 , s 1 , η 1 , T 1 ) 1: Input: w 1 ∈ W, λ 1 ≥ λ 0 , x 1 = (w ⊤ 1 , λ 1 ) ⊤ 2: Initialization: Draw a sample ξ 1 ∼ D, and calculate s 1 = exp(ℓ i (w 1 )/λ 1 ), v 1 = ∇f λ1 (s 1 )∂ w g i (x 1 )) ∈ R d u 1 = ∇f λ1 (s 1 )∂ λ g i (x 1 ) + log(s 1 ) + ρ ∈ R 3: for t = 1, • • • , T do 4: Update x t+1 = Π X (x t -ηz t ) 5: Draw a sample ξ i ∼ D 6: Let s t+1 = (1 -β)s t + βg i (x t+1 ) 7: Update v t+1 , u t+1 according to (6) 8: end for 9: return: (x τ , v τ , u τ , s τ ), where τ ∼ [T ] Algorithm 2 ASCDRO(x 1 , v 1 , u 1 , s 1 , η 1 , T 1 ) 1: Input: w 1 ∈ W, λ 1 ≥ λ 0 , x 1 = (w ⊤ 1 , λ 1 ) ⊤ 2: Initialization: Draw a sample ξ 1 ∼ D, and calculate s 1 = exp(ℓ i (w 1 )/λ 1 ), v 1 = ∂ w g i (x 1 ) ∈ R d u 1 = ∂ λ g i (x 1 ) ∈ R 3: for t = 1, • • • , T do 4: Update x t+1 = Π X (x t -ηz t ) , where z t is given in (8) 5: Draw a sample ξ i ∼ D

6:

Update s t+1 , v t+1 , u t+1 according to (7) 7: end for 8: return: (x τ , v τ , u τ , s τ ), where τ ∼ [T ] 4.1 BASIC ALGORITHM: SCDRO A major concern of the algorithm design is to compute a stochastic gradient estimator of the gradient of F (x). At iteration t, the gradient of F (x t ) is given by ∂ w F (x t ) = ∇f λt (g(x t ))∇ w g(x t ) ∂ λ F (x t ) = ∇f λt (g(x t ))∇ λ g(x t ) + log(g(x t )) + ρ. Both ∇ λ g(x t ) and ∇ w g(x t ) can be estimated by unbiased estimator denoted by ∇g i (x t ). The concern lies at how to estimate g(x t ) inside ∇f λt (•). The first algorithm SCDRO is applying existing techniques for two-level compositional function. In particular, we estimate g(x t ) by a sequence of s t , which is updated by moving average s t = (1 -β)s t-1 + βg i (x t ). Then we substitute g(x t ) in ∂ w F (x t ) and ∂ λ F (x t ) with s t , and invoke the following moving average to obtain the gradient estimators in terms of w t and λ t , respectively, v t = (1 -β)v t-1 + β∇f λt (s t )∇ w g i (x t ) (6) u t = (1 -β)u t-1 + β(∇f λt (s t )∇ λ g i (x t ) + log(s t ) + ρ). Finally we complete the update step of x t by x t+1 = Π X (x t -ηz t ), where z t = (v ⊤ t , u t ) ⊤ . We would like to point out the moving average estimator for tracking the inner function g(w) is widely used for solving compositional optimization problems (Wang et al., 2017; Qi et al., 2021; Zhang & Xiao, 2019; Zhou et al., 2019) . Using the moving average for computing a stochastic gradient estimator of a compositional function was first used in the NASA method proposed in Ghadimi et al. (2020) . The proposed method SCDRO is presented in Algorithm 1. It is similar to NASA but with a simpler design on the update of x t+1 . We directly use projection after an SGD-tyle update. In contrast, NASA uses two steps to update x t+1 . As a consequence, NASA has two parameters for updating x t+1 while SCDRO only has one parameter η for updating x t+1 . It is this simple change that allows us to extend SCDRO for convex problems in the next section. Below, we present the convergence rate of our basic algorithm SCDRO for a non-convex loss function. Theorem 1. Suppose the Assumption 1 and 2 hold, and set β = 1 √ T , η = β 20L 2 F . Then after running Algorithm 1 T iterations, we have E[dist(0, ∂ F (x τ )) 2 ] ≤ (624σ 2 + 280∆) L 2 F √ T + 20L 2 F ∆ T . Remark: Theorem 1 shows that SCDRO achieves a complexity of O(1/ϵ 4 ) for finding an ϵ-stationary point, i.e., E[dist(0, ∂ F (x R ))] ≤ ϵ for a non-convex loss function. Note that NASA (Ghadimi et al., 2020) enjoys the same oracle complexity but for a different convergence measure, i.e., E[∥y(x, z) - x∥ 2 + ∥z -∇F (x)∥ 2 ] ≤ ϵ for a returned primal-dual pair (x, z), where y(x, z) = X [x -z]. We can see that our convergence measure is more intuitive. In addition, we are able to leverage our convergence measure to establish the convergence for convex functions by using Kurdyka-Łojasiewicz (KL) inequality and the restarting trick as shown in next section. In contrast, such convergence for NASA is missing in their paper. Compared with stochastic primal-dual methods (Rafique et al., 2021; Yan et al., 2020) for the min-max formulation (1), their algorithms are double looped and have the same oracle complexity for a different convergence measure, i.e., E[dist(0, ∂ F (x * )) 2 ] ≤ γ 2 ∥x -x * ∥ 2 ] ≤ ϵ for some returned solution x, where x * is a reference point that is not computable. Our convergence measure is stronger as we directly measure E[dist(0, ∂ F (x τ )) 2 ] on a returned solution x τ . This is due to that we leverage the smoothness of F (•).

4.2. ACCELERATED ALGORITHM: ASCDRO

Our second algorithm presented in Algorithm 2 is inspired by Qi et al. (2021) for solving the KLregularized DRO by leveraging a recursive variance reduced technique (i.e., STORM) to estimate g(w t ) and ∇g(w t ) for computing ∂ w F (x t ) and ∂ λ F (x t ) in ( 5). In particular, we use v t for tracking ∇ w g(x t ), use u t for tracking ∇ λ g(x t ), and use s t for tracking g(x t ), which are updated by: v t = ∇ w g i (x t ) + (1 -β)(v t-1 -∇ w g i (x t-1 )) u t = ∇ λ g i (x t ) + (1 -β)(u t-1 -∇ λ g i (x t-1 )) s t = g i (x t ) + (1 -β)(s t-1 -g i (x t-1 )). A similar update to s t has been used in Chen et al. (2021) for tracking the inner function values for two-level compositional optimization. However, they do not use similar updates for tracking the gradients as v t , u t . Hence, their algorithm has a worse complexity. Then we invoke these estimators into ∂ w F (x t ) and ∂ λ F (x t ) to obtain the gradient estimator z t = (∇f λt (s t )v ⊤ t , ∇f λt (s t )u t + log(s t ) + ρ) ⊤ . (8) Below, we show ASCDRO can achieve a better convergence rate in the non-convex loss function. Theorem 2. Under Assumption 1 and 2, for any α > 1, let k = ασ 2/3 L F , w = max(2σ 2 , (16L 2 F k) 3 ) and c = σ 2 14L F k 3 + 130L 4 F . Then after running Algorithm 2 for T iterations with η t = k (w+tσ 2 ) 1/3 and β t = cη 2 t , we have E[dist(0, ∂ F (x τ )) 2 ] ≤ O log T T 2/3 . Remark: Theorem 2 implies that with a polynomial decreasing step size, ASCDRO is able to find an ϵ-stationary solution such that E[dist(0, ∂ F (x R ))] ≤ ϵ with a near-optimal complexity O(1/ϵ 3 ). Note that the complexity O(1/ϵ 3 ) is optimal up to a logarithmic factor for solving non-convex smooth optimization problems (Arjevani et al., 2019) . State-of-the-art primal-dual methods with variance-reduction for min-max problems (Huang et al., 2020) have the same complexity but for a different convergence measure, i.e, E[ 1 γ ∥x -X [x -γ∇F (x)]∥] ≤ ϵ for a returned solution x.

5. STOCHASTIC ALGORITHMS FOR CONVEX PROBLEMS

In this section, we presented restarted algorithms for solving (3) with a convex loss function ℓ i (w). The key is to restart SCDRO and ASCDRO by using a stagewise step size scheme. We define a new objective F µ (x) = F (x) + µ∥x∥ 2 /2 and correspondingly Fµ (x) = F µ (x) + δ X (x), where µ is a constant to be determined later. With this new objective, we have the following lemma. Lemma 3. Suppose that ℓ i (w) is convex for all i, then for all x ∈ X , Fµ (x) satisfies the following Kurdyka-Łojasiewicz (KL) inequality dist(0, ∂ Fµ (x)) 2 ≥ 2µ( Fµ (x) -inf x∈X Fµ (x)). Algorithm 3 RSCDRO or RASCDRO 1: Input: w 1 ∈ W, λ 1 ∈ R + , x 1 = (w ⊤ 1 , λ 1 ) ⊤ 2: Initialization: The same as in SCDRO or ASCDRO 3: Let Λ k = (x k , v k , u k , s k ) 4: for k = 1, • • • , K do 5: Λ k+1 = SCDRO(Λ k , η k , T k ) or Λ k+1 = ASCDRO(Λ k , η k , T k ) 6: Change η k , T k according to Lemma 4 or Lemma 5 7: end for 8: return: x K Lemma 3 allows us to obtain the convergence guarantee for convex losses. The idea of the restarted algorithm is to apply SCDRO and ASCDRO to the new objective Fµ (x) by adding µx t to 6) of Algorithm 1 and substituting (∇f λt (s t )∇ w g i (x t ) ⊤ , ∇f λt (s t )∇ λ g i (x t ) + log(s t ) + ρ) ⊤ in Eq. ( z t in (8) of Algorithm 2 by z t = (∇f λt (s t )v ⊤ t , ∇f λt (s t )u t + log(s t ) + ρ) ⊤ + µx t , and restarting SCDRO or ASCDRO with a stagewise step size to enjoy the benefit of KL inequality of Fµ (x). It is notable that a stagewise step size is widely and commonly used in practice. The multi-stage restarted version of SCDRO and ASCDRO are shown Algorithm 3, to which we refer as restarted-SCDRO (RSCDRO) and restarted-ASCDRO (RASCDRO).

5.1. RESTARTED SCDRO FOR CONVEX PROBLEMS

In this subsection, we present the convergence rate of RSCDRO for convex losses. We first present a lemma that states F µ (x k ) is stagewisely decreasing. Lemma 4. Suppose Assumptions 1 and 2 hold, ℓ i (w) is convex for all i, and F µ (x 1 ) - inf x∈X F µ (x) ≤ ∆ µ < ∞. Let ϵ 1 = ∆ µ , ϵ k = ϵ k-1 /2, β k = min{ µϵ k cσ 2 , 1 c }, η k = min{ µϵ k 12cL 2 F σ 2 , 1 12cL 2 F } and T k = max{ 384cL 2 F σ 2 µ 2 ϵ k , 384cL 2 F µ }, where c = 384L 2 F . Run RSCDRO, then we have E[F µ (x k ) -inf x∈X F µ (x)] ≤ ϵ k for each stage k. The above lemma implies that the objective gap E[F µ (x k ) -inf x∈X F µ (x) ] is decreased by a factor of 2 after each stage. Based on the above lemma, RSCDRO has the following convergence rate Theorem 3. Under the same assumptions and parameter settings as Lemma 4, after K = O(log 2 (ϵ 1 /ϵ)) stages, the output of RSCDRO satisfies E[F µ (x K ) -inf x∈X F µ (x)] ≤ ϵ, and the oracle complexity is O(1/µ 2 ϵ). The following corollary follows from the above theorem (please see Appendix F.5 for proof). Corollary 1. Let µ = ϵ/(2(R 2 + λ2 )). Then under the same assumptions and parameter settings as Lemma 4, after K = O(log 2 (ϵ 1 /ϵ)) stages, the output of RSCDRO satisfies E[F (x K ) -inf x∈X F (x)] ≤ ϵ and the oracle complexity is O(1/ϵ 3 ).

5.2. RESTARTED ASCDRO FOR CONVEX PROBLEMS

In this subsection, we establish a better convergence rate of RASCDRO for convex losses. Lemma 5. Suppose Assumptions 1 and 2 hold, ℓ i (w) is convex for all i, and F µ (x 1 ) - inf x∈X F µ (x) ≤ ∆ µ < ∞. Let ϵ 1 = ∆ µ , ϵ k = ϵ k-1 /2, β k = min{ µϵ k cσ 2 , 1 c }, η k = min{ √ µϵ k 24cL F σ 2 , 1 24cL 2 F } and T k = max{ 192cL F σ µ 3/2 √ ϵ k , 192cL 2 F σ 2 µϵ k , 192cL 2 F µ }, where c = 768L 2 F . Run RASC- DRO, then we have E[F µ (x k ) -inf x∈X F µ (x)] ≤ ϵ k for each stage k. The above lemma implies that the objective gap E[F µ (x k ) -inf x∈X F µ (x) ] is decreased by a factor of 2 after each stage. Hence we have the following convergence rate for the RASCDRO. Theorem 4. Under the same assumptions and parameter settings as Lemma 5, after K = O(log 2 (ϵ 1 /ϵ)) stages, the output of RASCDRO satisfies E[F µ (x K ) -inf x∈X F µ (x)] ≤ ϵ, and the oracle complexity is O max 1/µϵ, 1/µ 3/2 √ ϵ . By the same method of derivation of Corollary 1, the following corollary of Theorem 4 holds. Corollary 2. Let µ = ϵ/(2(R 2 + λ2 )). Then under the same assumptions and parameter settings as Lemma 5, after Remark: Corollary 2 shows that RASCDRO achieves the claimed oracle complexity O(1/ϵ 2 ) for finding an ϵ-optimal solution, which is optimal for solving convex smooth optimization problems (Nemirovsky & Yudin, 1983) . Finally, we note that a similar complexity was established in (Zhang & Lan, 2021) for constrained convex compositional optimization problems. However, their analysis requires each level function to be convex, which does not apply to our case as the outer function K = O(log 2 (ϵ 1 /ϵ)) stages, the output of RASCDRO satisfies E[F (x K ) - inf x∈X F (x)] ≤ ϵ and the oracle complexity is O(1/ϵ 2 ). f λ (•) is non-convex.

6. EXPERIMENTS

In this section, we verify the effectiveness of the proposed algorithms in solving imbalanced classification problems. We show that the proposed methods outperform baselines under both the convex and non-convex settings in terms of convergence speed, and generalization performance. In addition, we study the influence of ρ to the robustness of different optimization methods in supplement. Baselines. For the comparison of convergence speed, we compare with different algorithms for optimizing the same objective (1), including, stochastic primal-dual algorithms, namely PG-SMD2 (Rafique et al., 2021) for a non-convex loss, and SPD (Namkoong & Duchi, 2016) for a convex loss, Dual SGM (Levy et al., 2020) and mini-batch based SGD named FastDRO (Levy et al., 2020) for both convex and non-convex losses . For the comparison of generalization performance, we compare with different methods for optimizing different objectives, including the traditional ERM with CE loss by SGD with momentum (SGDM), KL-regularized DRO solved by RECOVER (Qi et al., 2021) , and CVaR-constrained, χ 2 -regularized/-constrained DRO optimized by FastDRO. Datasets. We conduct experiments on four imbalanced datasets, namely CIFAR10-ST, CIFAR100-ST (Qi et al., 2020b) , ImageNet-LT (Liu et al., 2019) Models. For a non-convex setting (deep model), we learn ResNet20 for CIFAR10-ST, CIFAR100-ST, and ResNet50 for ImageNet-LT and iNaturalist2018, respectively. On CIFAR10-ST, CIFAR100-ST, we optimize the network from scratch by different algorithms. For the large-scale ImageNet-LT and iNaturalist2018 datasets, we optimize the last block of the feature layers and the classifier weight with other layers frozen of a pretrained ResNet50 model. This is a common training strategy in the literature (Kang et al., 2019; Qi et al., 2020a) . For a convex setting (linear model), we freeze the feature layers of the pretrained models, and only fine-tune the last classifier weight. The pretrained models for ImageNet-LT, CIFAR10-ST, CIFAR100-ST are trained from scratch by optimizing the standard cross-entropy (CE) loss using SGD with momentum 0.9 for 90 epochs. The pretrained ResNet50 model for iNaturalist2018 is from the released model by Kang et al. (2019) . Parameters and Settings. For all experiments, the batch size is 128 for CIFAR10-ST and CIFAR100-ST, and 512 for ImageNet-LT and iNaturalist2018. The loss function is the CE loss. The λ 0 is set to 1e-3. The (primal) learning rates for all methods are tuned in {0.01, 0.05, 0.1, 0.5, 1}. The learning rate for updating the dual variable in PG_SMD2 and SPD is tuned in {1e-5, 5e-5, 1e-4, 5e-4)}. The momentum parameter β in our proposed algorithms and RECOVER are tuned {0.1 : 0.1 : 0.9}. For RECOVER, the hyper-parameter λ is tuned in {1, 50, 100}. The constrained parameter ρ is tuned in {0.1, 0.5, 1} for the comparison of generalization performance unless specified otherwise. The initial λ and Larange multiplier in Dual SGM are both tuned in {0.1, 1, 10}. Convergence comparison between different baselines. In the convex setting, we compare RSCDRO and RASCDRO with SPD, FastDRO and Dual SGM baselines. We report the training accuracy and testing accuracy in terms of the number (#) of processing samples. We denote 1 pass of training data by 1 epoch. We run a total of 3 epochs for CIFAR10-ST and CIFAR100-ST and decay the learning rate by a factor of 10 at the end of 2nd epoch. Similarly, we run 60 epochs and decay the learning rate at the 30th epochs for the ImageNet-LT, and run 30 epochs and decay the learning rate at the 20th epoch for iNaturalist2018. In the nonconvex setting, we compare SCDRO with two baselines, PG-SMD2 and FastDRO. We run 120 epochs for CIFAR10-ST and CIFAR100-ST, and decay the learning rate by a factor of 10 at the 90th epoch. And we run 30 epochs for ImageNet-LT and iNaturalist2018, and decay the learning rate at the 20th epoch. Results. We first report the results for convex setting in Figures 1 and 3 . It is obvious to see that RSCDRO and RASCDRO are consistently better than baselines on CIFAR10-ST, CIFAR100-ST, and ImageNet-LT. PD-SMD2 and Dual SGM have comparable results with our proposed algorithms on the iNaturalist2018 in terms of training accuracy, but is worse in terms of testing accuracy. FastDRO has the worst performance on all the datasets. RSCDRO and RASCDRO achieve comparable results on all datasets, however, the stochastic estimator in RASCDRO requires two gradient computations per iteration, which incurs more computational cost than RSCDRO. Hence, in the non-convex setting, we focus on SCDRO. Figure 2 and 4 report the results for non-convex setting. We can see that SCDRO achieves the best performance on all the datasets. The margin increases on the large scale ImageNet-LT and iNaturalist2018 datasets. For the three baselines, Dual SGM has better testing performance than FastDRO and PD-SGM2 on CIFAR10-ST and CIFAR100-ST. On the large scale data ImageNet-LT and iNaturalist2018, however, Dual SGM has the worst performance in terms of the testing accuracy. Furthermore, SCDRO is more stable than FastDRO and Dual SGM in different settings as the training of Dual SGM and FastDRO is comparable to SCDRO in convex settings and much worse than SCDRO in non-convex settings. Comparison with ERM and KL-regularized DRO. Next, we compare our method for solving KL-constrained DRO (KL-CDRO) with 1) ERM+SGDM, and KL-regularized DRO (KL-RDRO) optimized by RECOVER in the non-convex setting 2) CVaR-constrained DRO, χ 2 -regularized DRO χ 2 -constrained DRO optimized by FastDRO in the convex setting. We conduct the experiments on the large-scale ImageNet-LT and iNaturalist2018 datasets. The results shown in Table 2 and 3 vividly demonstrate that our method for constrained DRO outperforms the ERM-based method and other popular f -divergence constrained/regularized DRO in different settings. 

7. CONCLUSIONS

In this paper, we proposed dual-free stochastic algorithms for solving KL-constrained distributionally robust optimization problems for both convex and non-convex losses. The proposed algorithms have nearly optimal complexity in both settings. Empirical studies vividly demonstrate the effectiveness of the proposed algorithm for solving non-convex and convex constrained DRO problems. A MORE RELATED WORK Wang et al. (2021) studies the Sinkhorn distance constraint, a variant of Wasserstein distance based on entropic regularization. An efficient batch gradient descent with a bisection search algorithm has been proposed to obtain a near-optimal solution with an arbitrarily small sub-optimality gap. However, no non-asymptotic convergence results are established in their paper. Duchi & Namkoong (2021) developed a convex DRO framework with f -divergence constraints to improve model robustness. The author developed the finite-sample minimax upper and lower bounds and the non-asymptotic convergence rate of O(1/ √ n), and provided the empirical studies on real distributional shifts tasks with existing interior point solver (Udell et al., 2014) and gradient descent with backtracking Armijo line-searches (Boyd et al., 2004) . However, no stochastic algorithms that directly optimize the considered constrained DRO with non-asymptotic convergence rates are provided in their paper. Compositional Functions and DRO. The connection between compositional functions and DRO formulations have been observed and leveraged in the literature. Dentcheva et al. (2017) studied the statistical estimation of compositional functionals with applications to estimating conditionalvalue-at-risk measures, which is closely related to the CVaR constrained DRO. However, they do not consider stochastic optimization algorithms. To the best of our knowledge, Qi et al. (2021) was the first to use stochastic compositional optimization algorithms to solve KL-regularized DRO problems. Our work is different in that we solve KL-constrained DRO problems, which is more challenging than KL-regularized DRO problems. The benefits of using compositional optimization for solving DRO include (i) we do not need to maintain and update a high dimensional dual variable as in the primal-dual methods (Rafique et al., 2021) ; (ii) we do not need to worry about the batch size as in MLMC-based stochastic methods (Levy et al., 2020; Hu et al., 2021) .

B MORE EXPERIMENTAL RESULTS

GPU Setting: All our results are conducted on Tesla V100. Testing convergence curves are presented in the Figure 3 and 4 for the convex and non-convex setting respectively. Sensitivity to ρ. We study the sensitivity of different methods to ρ. The results on CIFAR10-ST and CIFAR100-ST are shown in Table 4 in the supplement, which demonstrates that the testing performance is sensitive to ρ. However, our method SCDRO is better than baselines PG-SMD2 and FastDRO for different values of ρ. C PRELIMINARY LEMMAS Lemma 6. For q ≥ 1, f λ (q) = λ log(q) + λρ is L f λ -Lipschitz continuous and L ∇f λ -smooth, where L ∇f λ = L f λ = λ. Remark: g i (w, λ) = exp( ℓi(w) λ ) ≥ 1 as λ ≥ λ 0 ∈ R + and ℓ i (w) ≥ 0 in problem (3). Thus g(x) = 1 n n i=1 g i (w, λ) ≥ 1. Then by this lemma we have ∥∇f λ (g(x))∥ ≤ λ and ∥∇f λ (g(x 1 )) - ∇f λ (g(x 2 ))∥ ≤ λ∥g(x 1 ) -g(x 2 )∥ for x, x 1 , x 2 ∈ X . Proof. For any q ≥ 1, we have ∇f λ (q) = λ q ≤ λ And for any q 1 , q 2 ≥ 1, we have ∥∇f λ (q 1 ) -∇f λ (q 2 )∥ ≤ λ q 1 - λ q 2 ≤ (q 1 -q 2 )λ q 1 q 2 ≤ λ∥q 1 -q 2 ∥ This complete the proof. Lemma 7. Let L A = exp( C λ0 )( G 2 λ 2 0 + L λ0 ), L B = exp( C λ0 )( CG λ 3 0 + G λ 2 0 ), L C = exp( C λ0 )( CG+λ0G λ 3 0 ) and L D = exp( C λ0 )( C 2 +2λ0C λ 4 0 ). g i (w, λ) is L g -Lipschtz continuous and L ∇g -smooth in terms of (w, λ), where L g = exp( C λ0 )( G λ0 + C λ 2 0 ) and L ∇g = L 2 A + L 2 B + L 2 C + L 2 D , Proof. The gradient of g i (w, λ) is given as ∇ w,λ g i (w, λ) ⊤ = (∂ w g i (w, λ) ⊤ , ∂ λ g i (w, λ)) = exp ℓ i (w) λ ∇ w ℓ i (w) λ ⊤ , -exp ℓ i (w) λ ℓ i (w) λ 2 . Then by Assumption 1, we have ∥∇ w,λ g i (w, λ)∥ ≤ exp ℓ i (w) λ ∇ w ℓ i (w) λ + ℓ i (w) λ 2 λ≥λ0 ≤ exp C λ 0 G λ 0 + C λ 2 0 . Thus, L g = exp C λ0 G λ0 + C λ 2 0 . For for all (w, λ), (w ′ , λ ′ ) ∈ X , we have ∥∇ w,λ g i (w, λ) -∇ w,λ g i (w ′ , λ ′ )∥ 2 ≤ exp ℓ i (w) λ ∇ w ℓ i (w) λ + exp ℓ i (w ′ ) λ ′ ∇ w ℓ i (w ′ ) λ ′ 2 + exp ℓ i (w) λ ℓ i (w) λ 2 -exp ℓ i (w ′ ) λ ′ ℓ i (w ′ ) λ ′2 2 ≤ exp ℓ i (w) λ ∇ w ℓ i (w) λ -exp ℓ i (w ′ ) λ ∇ w ℓ i (w ′ ) λ 2 + exp ℓ i (w ′ ) λ ∇ w ℓ i (w ′ ) λ -exp ℓ i (w ′ ) λ ′ ∇ w ℓ i (w ′ ) λ ′ 2 + exp ℓ i (w) λ ℓ i (w) λ 2 -exp ℓ i (w ′ ) λ ℓ i (w ′ ) λ 2 2 + exp ℓ i (w ′ ) λ ℓ i (w ′ ) λ 2 -exp ℓ i (w ′ ) λ ′ ℓ i (w ′ ) λ ′2 2 . To bound the first term, we first check the Lipschitz continuous of exp( ℓi(w) λ ) ∇wℓi(w) λ with respect to w, ∂ exp ℓi(w) λ ∇wℓi(w) λ ∂w ≤ exp ℓ i (w) λ ∇ w ℓ i (w) λ ∇ w ℓ i (w) λ ⊤ + exp ℓ i (w) λ ∇ 2 w ℓ i (w) λ (a) = exp ℓ i (w) λ ∇ w ℓ i (w) λ ⊤ ∇ w ℓ i (w) λ + exp ℓ i (w) λ ∇ 2 w ℓ i (w) λ (b) ≤ exp ℓ i (w) λ ∇ w ℓ i (w) λ 2 + exp ℓ i (w) λ ∇ 2 w ℓ i (w) λ ≤ exp C λ 0 G 2 λ 2 0 + L λ 0 := L A . where equality (a) is due to the property of the norm of rank-one symmetric matrix and inequality (b) is due to Cauchy-Schwarz inequality. Therefore, we have exp( ℓ i (w) λ ) ∇ w ℓ i (w) λ -exp( ℓ i (w ′ ) λ ) ∇ w ℓ i (w ′ ) λ 2 ≤ L A ∥w -w ′ ∥ 2 Furthermore, it holds that ∂ exp ℓi(w) λ ∇wℓi(w) λ ∂λ = exp ℓ i (w) λ ℓ i (w)∇ w ℓ i (w) λ 3 + exp ℓ i (w) λ ∇ w ℓ i (w) λ 2 ≤ exp C λ 0 CG λ 3 0 + G λ 2 0 := L B ∂ exp ℓi(w) λ ℓi(w) λ 2 ∂w = exp ℓ i (w) λ ℓ i (w)∇ w ℓ i (w) λ 3 + exp ℓ i (w) λ ∇ w ℓ i (w) λ 2 ≤ exp C λ 0 CG + λ 0 G λ 3 0 := L C ∂ exp ℓi(w) λ ℓi(w) λ 2 ∂λ = exp ℓ i (w) λ ℓ 2 i (w) λ 4 + exp ℓ i (w) λ 2ℓ i (w) λ 3 ≤ exp C λ 0 C 2 + 2λ 0 C λ 4 0 := L D . As a result, we obtain ∥∇ w,λ g i (w, λ) -∇ w,λ g i (w ′ , λ ′ )∥ 2 ≤ L 2 A ∥w -w ′ ∥ 2 + L 2 B ∥λ -λ ′ ∥ 2 + L 2 C ∥w -w ′ ∥ 2 + L 2 D ∥λ -λ ′ ∥ 2 = (L 2 A + L 2 C ) ∥w -w ′ ∥ 2 + (L 2 B + L 2 D ) ∥λ -λ ′ ∥ 2 ≤ (L 2 A + L 2 B + L 2 C + L 2 D ) (w ⊤ , λ) -(w ′⊤ , λ ′ ) 2 . Thus L ∇g = L 2 A + L 2 B + L 2 C + L 2 D . Lemma 8. F (w, λ) is L F -smooth, where L F = λL 2 g + 2L g + λL ∇g + 1 + λ. Remark: Lemma 6, 7 and Lemma 8 imply that L ∇f λ = L f λ ≤ L F , L g ≤ L F and L F ≥ 1. Proof. For all x 1 = (w ⊤ 1 , λ 1 ) ⊤ , x 2 = (w ⊤ 2 , λ 2 ) ⊤ ∈ X , and let d(x) = (0, • • • , 0, log(g(x)) + ρ) ⊤ ∈ R d+1 , by expansion we have ∥∇F (x 1 ) -∇F (x 2 )∥ = ∥∇f λ1 (g(x 1 ))∇g(x 1 ) + d(x 1 ) -∇f λ2 (g(x 2 ))∇g(x 2 ) -d(x 2 )∥ ≤ ∥∇f λ1 (g(x 1 ))∇g(x 1 ) -∇f λ2 (g(x 2 ))∇g(x 2 )∥ + | log(g(x 1 )) -log(g(x 2 ))| ≤ ∥∇f λ1 (g(x 1 ))∇g(x 1 ) -∇f λ1 (g(x 2 ))∇g(x 1 )∥ + ∥∇f λ1 (g(x 2 ))∇g(x 1 ) -∇f λ2 (g(x 2 ))∇g(x 1 )∥ + ∥∇f λ2 (g(x 2 ))∇g(x 1 ) -∇f λ2 (g(x 2 ))∇g(x 2 )∥ + |g(x 1 ) -g(x 2 )|. Noting the Lipschtiz continuous of g(x) and ∇g(x), we obtain ∥∇F (x 1 ) -∇F (x 2 )∥ ≤ (L ∇ f λ 1 L g + 1)|g(x 1 ) -g(x 2 )| + ∥∇g(x 1 )∥ g(x 2 ) ∥λ 1 -λ 2 ∥ + L f λ 2 ∥∇g(x 1 ) -∇g(x 2 )∥ (a) ≤ (L ∇ f λ 1 L 2 g + L g )∥x 1 -x 2 ∥ + ∥∇g(x 1 )∥∥λ 1 -λ 2 ∥ + L f λ 2 L ∇g ∥x 1 -x 2 ∥ ≤ (L ∇ f λ 1 L 2 g + 2L g + L f λ 2 L ∇g )∥x 1 -x 2 ∥ (b) ≤ ( λL 2 g + 2L g + λL ∇g + 1 + λ)∥x 1 -x 2 ∥. where the inequality (a) is due to g(x 2 ) ≥ 1 and the inequality (b) is due to the upper bound of λ. Thus, L F = λL 2 g + 2L g + λL ∇g + 1 + λ.

C.1 PROOF OF LEMMA 1

Proof. Recall the primal problem: p * = max {p∈∆n,D(p,1/n)≤ρ} n i=1 p i ℓ i (w) + λ 0 D(p, 1/n). Invoking dual variable λ, we obtain the dual problem: q * = min λ≥0 max p∈∆n n i=1 p i ℓ i (w) -λ(D(p, 1/n) -ρ) -λ 0 D(p, 1/n). Set p = (1/n, . . . , 1/n), which is a Slater vector satisfying D(p, 1/n) -ρ < 0. Applying Lemma 3 in (Nedić & Ozdaglar, 2009) , we have | λ * | ≤ 1 ρ q * - n i=1 pi ℓ i (w) -λ 0 D(p, 1/n) . Since the primal problem is concave in term of p given w, we have p * = q * . Therefore, | λ * | ≤ 1 ρ p * - n i=1 pi ℓ i (w) = 1 ρ n i=1 p * i ℓ i (w) -λ 0 D(p * , 1/n) - n i=1 pi ℓ i (w) ≤ C ρ , where the last inequality is because |ℓ i (w)| ≤ C for w ∈ W. Let λ = λ + λ 0 , we have q * = min λ≥λ0 max p∈∆n n i=1 p i ℓ i (w) -λ(D(p, 1/n) -ρ) -λ 0 ρ. Section G will also show q * = min λ≥λ0 λ log 1 n n i=1 exp ℓ i (w) λ + λ(ρ -ρ 0 ). By Eq. ( 10 ). Then for every t ∈ {1, • • • T } we have E[∥g(x t+1 ) -s t+1 ∥ 2 ] ≤ E (1 -β)∥g(x t ) -s t ∥ 2 + 2L 2 g ∥x t+1 -x t ∥ 2 β + β 2 σ 2 . Taking summation of E[∥g(x t+1 ) -s t+1 ∥ 2 ] from 1 to T , we have T t=1 E[∥g(x t ) -s t ∥ 2 ] ≤ E ∥g(x 1 ) -s 1 ∥ 2 β + 2L 2 g β 2 T t=1 ∥x t+1 -x t ∥ 2 + βT σ 2 . ( ) Proof. Note that s t+1 = (1 -β)s t + βg i (x t+1 ) and E[g(x t+1 ) -g i (x t+1 )]=0, then by simple expansion we have E[∥g(x t+1 ) -s t+1 ∥ 2 ] = E[∥β(g(x t+1 ) -g i (x t+1 )) + (1 -β)(g(x t+1 ) -s t )∥ 2 ] = E[β 2 ∥g(x t+1 ) -g i (x t+1 )∥ 2 + (1 -β) 2 ∥g(x t+1 ) -s t ∥ 2 ] + 2 E[⟨g(x t+1 ) -g i (x t+1 ), g(x t+1 ) -s t ⟩] 0 = E[β 2 ∥g(x t+1 ) -g i (x t+1 )∥ 2 + (1 -β) 2 ∥g(x t+1 ) -g(x t ) + g(x t ) -s t ∥ 2 ]. ( ) Invkoing Lemma 7 to Eq. ( 12) and recalling Assumption 2 , we obtain E[∥g(x t+1 ) -s t+1 ∥ 2 ] (a) ≤ E[β 2 ∥g(x t+1 ) -g i (x t+1 )∥ 2 + (1 -β) 2 (1 + β)∥g(x t ) -s t ∥ 2 + (1 + 1 β )(1 -β) 2 ∥g(x t+1 ) -g(x t )∥ 2 (b) ≤ E β 2 ∥g(x t+1 ) -g i (x t+1 )∥ 2 + (1 -β)∥g(x t ) -s t ∥ 2 + 2L 2 g ∥x t+1 -x t ∥ 2 β (c) ≤ E (1 -β)∥g(x t ) -s t ∥ 2 + 2L 2 g ∥x t+1 -x t ∥ 2 β + β 2 σ 2 . where the inequality (a) is due to (a + b) 2 ≤ (1 + β)a 2 + (1 + 1 β )b 2 , the inequality (b) is because of (1 -β) 2 ≤ 1, (1 + 1 β ) ≤ 2 β and the Lemma 7 and the inequality (c) is from Assumption 2. Lemma 10. Under Assumption 1, run Algorithm 1 with ηL F ≤ 1/4, and then the output x R of Algorithm 1 satisfies E R [dist(0, ∂ F (x R )) 2 ] ≤ 2 + 40L F η T T t=1 ∥z t -∇F (x t )∥ 2 + 2∆ ηT + 40L F ∆ T . ( ) Proof. The proof of this lemma follow the proof of Theorem 2 in (Xu et al., 2019) . Recall the update of x t+1 is x t+1 = Π X (x t -ηz t ) = arg min x∈R d+1 {δ X (x) + ⟨z t , x -x t ⟩ + 1 2η ∥x -x t ∥ 2 }. then by Exercise 8.8 and Theorem 10.1 of (Rockafellar & Wets, 1998) we know -z t - 1 η (x t+1 -x t ) ∈ ∂δ X (x t+1 ) , which implies that ∇F (x t+1 ) -z t - 1 η (x t+1 -x t ) ∈ ∇F (x t+1 ) + ∂δ X (x t+1 ) = ∂ F (x t+1 ) . ( ) By the update of x t+1 , we also have, δ X (x t+1 ) + ⟨z t , x t+1 -x t ⟩ + 1 2η ∥x t+1 -x t ∥ 2 ≤ δ X (x t ). Since F (x) is smooth with parameter L F , then F (x t+1 ) ≤ F (x t ) + ⟨∇F (x t ), x t+1 -x t ⟩ + L F 2 ∥x t+1 -x t ∥ 2 . Combing the above two inequalities, we get ⟨z t -∇F (x t ), x t+1 -x t ⟩ + 1 2 (1/η -L)∥x t+1 -x t ∥ 2 ≤ F (x t ) -F (x t+1 ). That is 1 2 (1/η -L F )∥x t+1 -x t ∥ 2 ≤ F (x t ) -F (x t+1 ) -⟨z t -∇F (x t ), x t+1 -x t ⟩ ≤ F (x t ) -F (x t+1 ) + η∥z t -∇F (x t )∥ 2 + 1 4η ∥x t -x t+1 ∥ 2 , where the last inequality uses Young's inequality ⟨a, b⟩ ≤ ∥a∥ 2 + ∥b∥ 2 4 . Then by rearranging the above inequality and summing it across t = 1, • • • , T , we have T t=1 1 -2ηL F 4η ∥x t+1 -x t ∥ 2 ≤ F (x 1 ) -F (x T +1 ) + T t=1 η∥z t -∇F (x t )∥ 2 ≤ F (x 1 ) -inf x∈X F (x) + T t=1 η∥z t -∇F (x t )∥ 2 ≤ ∆ + T t=1 η∥z t -∇F (x t )∥ 2 . ( ) By the same method used in the proof of Theorem 2 in Xu et al. (2019) , we have the following inequality, T t=1 ∥z t -∇F (x t+1 ) + 1 η (x t+1 -x t )∥ 2 ≤ 2 T t=1 ∥z t -∇F (x t )∥ 2 + 2∆ η + (2L 2 F + 3L F η ) T t=1 ∥x t+1 -x t ∥ 2 . ( ) Recalling ηL F ≤ 1 4 and combining Eq. ( 15) and Eq. ( 16), we obtain T t=1 ∥z t -∇F (x t+1 ) + 1 η (x t+1 -x t )∥ 2 (a) ≤ 2 T t=1 ∥z t -∇F (x t )∥ 2 + 2∆ η + 5L F η 1 1/4 -η 1 L F /2 η 1 ∆ + η 1 T t=1 η t ∥z t -∇F (x t )∥ 2 (b) ≤ 2 T t=1 ∥z t -∇F (x t )∥ 2 + 2∆ η + 40L F ∆ + 40ηL F T t=1 ∥z t -∇F (x t )∥ 2 . ( ) where inequality (a) is due to (2L 2 F + 3L F η ) ≤ 5L F η and inequality (b) is due to 1 1/4-ηL F /2 ≤ 8. Recalling Eq. ( 14) and the output rule of Algorithm 1, we have E R [dist(0, ∂ F (x R )) 2 ] ≤ 1 T T t=1 ∥z t -∇F (x t+1 ) + 1 η (x t+1 -x t )∥ 2 . ( ) Then by combining Eqs. (17,18) together we have the Lemma. Lemma 11. Under Assumption 1, 2, run Algorithm 1 with η ≤ β 4L F √ 4+20L 2 g ≤ 1 4L F , and then we have 1 T T t=1 E[∥z t -∇F (x t )∥ 2 ] ≤ 2E[∥z 1 -∇F (x 1 )∥ 2 ] βT + ∆ ηT + 20L F E[∥g(x 1 ) -s 1 ∥ 2 ] βT + 24βL 2 F σ 2 . Proof. To facilitate our proof statement, we define the following notations: ∇F (x t ) ⊤ = (∂ w F (x t ) ⊤ , ∂ λ F (x t )) = (∇f λt (g(x t ))∂ w g(x t ) ⊤ , ∇f λt (g(x t ))∂ λ g(x t ) + log(g(x t )) + ρ) ∇F (x t ) ⊤ = (∇f λt (g(x t ))∂ w g i (x t ) ⊤ , ∇f λt (g(x t ))∂ λ g i (x t ) + log(g(x t )) + ρ) G(x t ) ⊤ = (G wt (x t ) ⊤ , G λt (x t )) = (∇f λt (s t )∂ w g i (x t ) ⊤ , ∇f λt (s t )∂ λ g i (x t ) + log(s t ) + ρ). It is worth to notice that E[ ∇F (x t )] = ∇F (x t ). For every iteration t, by simple expansion we have I t = E[∥∇F (x t ) -z t ∥ 2 ] = E[∥∇F (x t ) -(1 -β)z t-1 -βG(x t )∥ 2 ] = E[∥(1 -β)(∇F (x t ) -∇F (x t-1 )) + (1 -β)∇F (x t-1 ) -(1 -β)z t-1 + β∇F (x t ) -βG(x t )∥ 2 ] = E[∥(1 -β) (∇F (x t ) -∇F (x t-1 ) A ) + (1 -β) (∇F (x t-1 ) -z t-1 ) B ∥ 2 ] + E[∥β( ∇F (x t ) -G(x t ) C ) + β (∇F (x t ) -∇F (x t )) D ∥ 2 ] = E[(1 -β) 2 ∥A∥ 2 + (1 -β) 2 ∥B∥ 2 + β 2 ∥C∥ 2 + β 2 ∥D∥ 2 + 2(1 -β)(1 -β)⟨A, B⟩ + 2β(1 -β)⟨A, C⟩ + 2β(1 -β)⟨A, D⟩ + 2(1 -β)β⟨B, C⟩ + 2(1 -β)β⟨B, D⟩ + 2β 2 ⟨C, D⟩] (a) = E[(1 -β) 2 ∥A∥ 2 + (1 -β) 2 ∥B∥ 2 + β 2 ∥C∥ 2 + β 2 ∥D∥ 2 + 2(1 -β) 2 ⟨A, B⟩ + 2(1 -β)β⟨C, B⟩ + 2β(1 -β)⟨A, C⟩ + 2β 2 ⟨C, D⟩], where the equality (a) is due to E⟨∇F (x t ) -∇F (x t-1 ), ∇F (x t ) -∇F (x t )⟩ = 0 and E⟨z t-1 -∇F (x t-1 ), ∇F (x t ) -∇F (x t )⟩ = 0. By Young's inequality, we have (1 -β) 2 ⟨A, B⟩ ≤ (1 -β)⟨A, B⟩ ≤ 2 β ∥A∥ 2 + (1-β) 2 β 8 ∥B∥ 2 , 2β(1 -β)⟨C, B⟩ ≤ (1-β) 2 β 2 ∥B∥ 2 + 2β∥C∥ 2 , 2β(1 -β)⟨A, C⟩ ≤ (1 -β) 2 ∥A∥ 2 + β 2 ∥C∥ 2 and 2β 2 ⟨C, D⟩ ≤ β 2 ∥C∥ 2 + β 2 ∥D∥ 2 . Therefore, noting (1 -β) < 1 and 1/β > 1, we can obtain I t ≤ E[(1 -β) 2 ∥A∥ 2 + (1 -β) 2 ∥B∥ 2 + β 2 ∥C∥ 2 + β 2 ∥D∥ 2 + 2 β ∥A∥ 2 + (1 -β) 2 β 2 ∥B∥ 2 + 2β∥C∥ 2 + (1 -β) 2 β 2 ∥B∥ 2 + (1 -β) 2 ∥A∥ 2 + β 2 ∥C∥ 2 + β 2 ∥C∥ 2 + β 2 ∥D∥ 2 ] ≤ E[(1 -β)∥B∥ 2 + 4 β ∥A∥ 2 + 5β∥C∥ 2 + 2β 2 ∥D∥ 2 ]. Thus recalling the defintion of G(x t ), ∇F (x t ), ∇F (x t ) and applying the smoothness and Lipschitz continuity of f λ and g, we have C = ∥ ∇F (x t ) -G(x t )∥ 2 = ∥∇f λt (g(x t ))∂ w g i (x t ) -∇f λt (s t )∂ wt g i (x t )∥ 2 + ∥∇f λt (g(x t ))∂ λ g i (x t ) + log(g(x t )) -∇f λt (s t )∂ λ g i (x t ) -log(s t )∥ 2 ≤ ∥∇f λt (g(x t ))∂ w g i (x t ) -∇f λt (s t )∂ wt g i (x t )∥ 2 + 2∥∇f λt (g(x t ))∂ λ g i (x t ) -∇f λt (s t )∂ λ g i (x t )∥ 2 + 2∥ log(g(x t )) -log(s t )∥ 2 (a) ≤ 2L 2 g L 2 ∇f λ t ∥s t -g(x t )∥ 2 + 2∥s t -g(x t )∥ 2 (b) ≤ 2L 2 F ∥s t -g(x t )∥ 2 , ( ) where the inequality (a) is due to | log(g(x t )) -log(s t )| ≤ |s t -g(x t )| since g(x t ) ≥ 1, s t ≥ 1 for all t = {1, • • • , T } by the definition and initialzation of g i (x t ), s t , and the inequality (b) is due to L 2 g L 2 ∇f λ t + 1 ≤ L 2 F . And by the similar method, we also have D = ∥∇F (x t ) -∇F (x t )∥ 2 = ∥∇f λt (g(x t ))∂ w g(x t ) -∇f λt (g(x t ))∂ w g i (x t )∥ 2 + ∥∇f λt (g(x t ))∂ λ g(x t ) + log(g(x t )) + ρ -∇f λt (g(x t ))∂ λ g i (x t ) -log(g(x t )) -ρ∥ 2 = ∥∇f λt (g(x t ))∂ w g(x t ) -∇f λt (g(x t ))∂ w g i (x t )∥ 2 + ∥∇f λt (g(x t ))∂ λ g(x t ) -∇f λt (g(x t ))∂ λ g i (x t )∥ 2 ≤ L 2 f λ t ∥∇g(x t ) -∇g i (x t )∥ 2 ≤ L 2 F ∥∇g(x t ) -∇g i (x t )∥ 2 . (21) Thus combining the Eqs. (19, 20, 21) and applying Assumption 2, we can obtain E[∥z t -∇F (x t )∥ 2 ] = E[(1 -β)∥z t-1 -∇F (x t-1 )∥ 2 + 4 β ∥∇F (x t ) -∇F (x t-1 )∥ 2 + 5β∥ ∇F (x t ) -G(x t )∥ 2 + 2β 2 ∥∇F (x t ) -∇F (x t )∥ 2 ] ≤ E[(1 -β)∥z t-1 -∇F (x t-1 )∥ 2 + 4 β L 2 F ∥x t -x t-1 ∥ 2 + 10L 2 F β∥g(x t ) -s t ∥ 2 ] + 2β 2 L 2 F σ 2 . Taking summation of E[∥z t+1 -∇F (x t+1 )∥ 2 ] from 1 to T and invoking Lemma 9, we have T t=1 E[∥z t -∇F (x t )∥ 2 ] ≤ E[∥∇F (x 1 ) -z 1 ∥ 2 ] β + 4L 2 F β 2 T t=1 E[∥x t+1 -x t ∥ 2 ] + 10L 2 F β T t=1 E[∥g(x t ) -s t ∥ 2 ] + 2β 2 L F σ 2 ≤ E[∥∇F (x 1 ) -z 1 ∥ 2 ] β + 4L 2 F β 2 T t=1 E[∥x t+1 -x t ∥ 2 ] + 10L 2 F E ∥g(x 1 ) -s 1 ∥ 2 β + 2L 2 g β 2 T t=1 ∥x t+1 -x t ∥ 2 + βT σ 2 + 2βL 2 F T σ 2 . Taking Eq. ( 15) into the above inequality, we have T t=1 E[∥z t -∇F (x t )∥ 2 ] ≤ E[∥∇F (x 1 ) -z 1 ∥ 2 ] β + ( 4L 2 F β 2 + 20L 2 F L 2 g β 2 ) η 1/4 -ηL F /2 ∆ + η T t=1 E[∥z t -∇F (x t )∥ 2 ] + 10L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] β + βT σ 2 + 2βL 2 F T σ 2 (a) ≤ E[∥∇F (x 1 ) -z 1 ∥ 2 ] β + ( 4L 2 F β 2 + 20L 2 F L 2 g β 2 ) 8η ∆ + η T t=1 E[∥z t -∇F (x t )∥ 2 ] + 10L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] β + βT σ 2 + 2βL 2 F T σ 2 (b) ≤ E[∥z 1 -∇F (x 1 )∥ 2 ] β + ∆ 2η + 1 2 T t=1 E[∥z t -∇F (x t )∥ 2 ] + 10L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] β + βT σ 2 + 2βL 2 F T σ 2 , where the inequality (a) is due to ηL F ≤ 1/4 and the inequality (b) is due to 8(4L 2 F + 20L 2 F L 2 g )η 2 ≤ β 2 2 . Rearranging terms and dividing T on both sides of Eq. ( 22), we compelte the proof.

D.2 PROOF OF THEOREM 1

Proof . Since η = β 20L 2 F , L F ≥ 1 and L F ≤ L g , it holds that η ≤ β 4L F √ 4+20L 2 g ≤ 1 4L F which satisfy the assumptions of η in Lemma 10 and Lemma 11. Therefore, combining Lemma 10 and Lemma 11, we have E[dist(0, ∂ F (x R )) 2 ] ≤ 2 + 40L F η T T t=1 E[∥z t -∇F (x t )∥ 2 ] + 2∆ ηT + 40L F ∆ T ≤ 12 T T t=1 E[∥z t -∇F (x t )∥ 2 ] + 2∆ ηT + 20L F ∆ T ≤ 24E[∥z 1 -∇F (x 1 )∥ 2 ] βT + 12∆ ηT + 240L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] βT + 288L 2 F βσ 2 + 2∆ ηT + 20L 2 F ∆ T . (23) By the definition of s 1 and Assumption 2, it holds that E[∥s 1 -g(x 1 )∥ 2 ] ≤ E[∥g i (x 1 ) -g(x 1 )∥ 2 ] ≤ σ 2 . (24) Since L 2 g L 2 ∇f λ 1 ≤ L 2 F and 2L 2 f λ 1 ≤ L 2 F , we have E[∥z 1 -∇F (x 1 )∥ 2 ] = ∥∇f λ1 (g i (x 1 ))∇g i (x 1 ) -∇f λ1 (g(x 1 ))∇g(x 1 )∥ 2 = ∥∇f λ1 (g i (x 1 )∇g i (x 1 )) -∇f λ1 (g(x 1 ))∇g i (x 1 ) + ∇f λ1 (g(x 1 ))∇g i (x 1 ) -∇f λ1 (g(x 1 )∇g(x 1 ))∥ 2 (a) ≤ 2∥∇f λ1 (g i (x 1 )) -∇f λ1 (g(x 1 ))∥ 2 ∥∇g i (x 1 )∥ 2 + 2∥∇f λ1 (g i (x 1 ))∥ 2 ∥∇g i (x 1 ) -∇g(x 1 )∥ 2 ≤ (2L 2 g L 2 ∇f λ 1 + 2L 2 f λ 1 )σ 2 ≤ 4L 2 F σ 2 , where the inequality (a) is due to ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 . Combining Eqs. (23, 24, 25) , we obtain E[dist(0, ∂ F (x R )) 2 ] ≤ 24E[∥z 1 -∇F (x 1 )∥ 2 ] βT + 12∆ ηT + 240L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] βT + 288L 2 F βσ 2 + 2∆ ηT + 20L 2 F ∆ T ≤ 96L 2 F σ 2 βT + 12∆ ηT + 240L 2 F σ 2 βT + 288L 2 F βσ 2 + 2∆ ηT + 20L 2 F ∆ T ≤ 96L 2 F σ 2 √ T + 240∆L 2 F √ T + 528L 2 F σ 2 √ T + 40∆L 2 F √ T + 20L 2 F ∆ T ≤ (624σ 2 + 280∆) L 2 F √ T + 20L 2 F ∆ T . This complete the proof.

E PROOFS IN SECTION 4.2 E.1 TECHNICAL LEMMAS

Lemma 12. Let z t = ∇f λt (s t )q t + q λt , where q t = (v ⊤ t , u t ) ⊤ , q λt = (0 ⊤ , log(s t ) + ρ) ⊤ and 0 ∈ R d . Let ∥κ t ∥ 2 = ∥s t -g(x t )∥ 2 + ∥v t -∂ w g(x t )∥ 2 + |u t -∂ λ g(x t )| 2 . Under Assumption 1, run Algorithm 2, and then for every t ∈ {1, • • • T } we have ∥z t -∇F (x t )∥ 2 ≤ 4L 2 F ∥κ t ∥ 2 . Proof. By simple expansion, it holds that ∥z t -∇F (x t )∥ 2 = ∥∇f λt (g(x t ))∂ w g(x t ) -∇f λt (s t )v t ∥ 2 + ∥∇f λt (g(x t ))∂ λ g(x t ) -∇f λt (s t )v t + log(g(x t )) -log(s t )∥ 2 (a) ≤ 2∥∇f λt (g(x t ))∂ w g(x t ) -∇f λt (s t )v t ∥ 2 + 2∥∇f λt (g(x t ))∂ λ g(x t ) -∇f λt (s t )u t ∥ 2 + 2∥g(x t ) -s t ∥ 2 = 2∥∇f λt (g(x t ))∇g(x t ) -∇f λt (s t )q t ∥ 2 + 2∥g(x t ) -s t ∥ 2 , where the inequality (a) is because ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 , and | log(x) -log(y)| ≤ |x -y| for all x, y ≥ 1. Applying the smoothness and Lipschitz continuity of f λ and g, we obtain ∥∇f λt (g(x t ))∇g(x t ) -∇f λt (s t )q t ∥ 2 = ∥∇f λt (g(x t ))∇g(x t ) -∇f λt (s t )∇g(x t ) + ∇f λt (s t )∇g(x t ) -∇f λt (s t )q t ∥ 2 ≤ 2∥∇f λt (g(x t ))∇g(x t ) -∇f λt (s t )∇g(x t )∥ 2 + 2∥∇f λt (s t )∇g(x t ) -∇f λt (s t )q t ∥ 2 ≤ 2L 2 g L 2 ∇f λ t ∥s t -g(x t )∥ 2 + 2L f λ t ∥q t -∇g(x t )∥ 2 + 2∥g(x t ) -s t ∥ 2 . ( ) Noting ∥q t -∇g(x t )∥ 2 ] = ∥v t -∂ w g(x t )∥ 2 + |u t -∂ λ g(x t )| 2 and combining Eqs. (26, 27), we have ∥z t -∇F (x t )∥ 2 ≤ (4L 2 g L 2 ∇f λ t + 2)∥s t -g(x t )∥ 2 + 4L 2 f λ t ∥q t -∇g(x t )∥ 2 ≤ 4L 2 F ∥s t -g(x t )∥ 2 + 4L 2 F ∥q t -∇g(x t )∥ 2 = 4L 2 F (∥s t -g(x t )∥ 2 + ∥v t -∂ w g(x t )∥ 2 + |u t -∂ λ g(x t )| 2 ). This complete the proof. Lemma 13. Under Assumption 1, 2, run Algorithm 2, and then for every t ∈ {1, • • • T } we have E[∥κ t+1 ∥ 2 ] ≤ (1 -β t ) 2 E[∥κ t ∥ 2 ] + 8(1 -β t ) 2 L 2 F E[∥x t+1 -x t ∥ 2 ] + 6β 2 t σ 2 . Proof. Since s t+1 = (g i (x t+1 ) + (1 -β)(s t -g i (x t )), it holds that E[∥s t+1 -g(x t+1 )∥ 2 ] = E[∥g i (x t+1 ) + (1 -β t )(s t -g i (x t )) -g(x t+1 )∥ 2 ] ≤ E[∥(1 -β t )(s t -g(x t )) + β t (g i (x t+1 ) -g(x t+1 )) + (1 -β t )(g i (x t+1 ) -g i (x t ) -(g(x t+1 ) -g(x t )))∥ 2 ] = E[(1 -β t ) 2 ∥s t -g(x t )∥ 2 ] + E[∥β t (g i (x t+1 ) -g(x t+1 )) ( 28) + (1 -β t )(g i (x t+1 ) -g i (x t ) -(g(x t+1 ) -g(x t )))∥ 2 ], where the last inequality is due to E[g i (x t+1 ) -g(x t+1 )] = 0. Noting E[⟨g i (x t+1 ) -g i (x t+1 ), g(x t+1 ) -g(x t )⟩] = E[∥(g(x t+1 ) -g(x t ))∥ 2 ] and applying the Lipschitz continuty of g i (x), we have E[∥g i (x t+1 ) -g i (x t+1 ) -(g(x t+1 ) -g(x t ))∥ 2 ] = E[∥(g i (x t+1 ) -g i (x t+1 )∥ 2 + ∥(g(x t+1 ) -g(x t ))∥ 2 -2 ⟨g i (x t+1 ) -g i (x t+1 ), g(x t+1 ) -g(x t )⟩] = E[∥(g i (x t+1 ) -g i (x t+1 )∥ 2 -∥(g(x t+1 ) -g(x t ))∥ 2 ] ≤ E[∥(g i (x t+1 ) -g i (x t+1 )∥ 2 ] ≤ L 2 g E[∥x t+1 -x t ∥ 2 ]. Combining Eqs. (28, 29) and invoking the Lipschitz continuty of g i (x), under Assumption 2, we have E[∥s t+1 -g(x t+1 )∥ 2 ] ≤ (1 -β t ) 2 E[∥s t -g(x t )∥ 2 ] + 2β 2 t E[∥g i (x t+1 ) -g(x t )∥ 2 ] + 2(1 -β t ) 2 E[∥g i (x t+1 ) -g i (x t+1 ) -(g(x t+1 ) -g(x t ))∥ 2 ] ≤ (1 -β t ) 2 E[∥s t -g(x t )∥ 2 ] + 2β 2 t σ 2 + 2(1 -β t ) 2 L 2 g E[∥x t+1 -x t ∥ 2 ]. (30) In the same way, we also have E[∥v t+1 -∂ w g(x t+1 )∥ 2 ] ≤ (1 -β t ) 2 E[∥v t -∂ w g(x t )∥ 2 ] + 2β 2 t σ 2 + 2(1 -β t ) 2 L 2 ∇g E[∥x t+1 -x t ∥ 2 ], E[|u t+1 -∂ λ g(x t+1 )| 2 ] ≤ (1 -β t ) 2 E[|u t -∂ λ g(x t )| 2 ] + 2β 2 t σ 2 + 2(1 -β t ) 2 L 2 ∇g E[∥x t+1 -x t ∥ 2 ]. (32) Therefore, combining Eqs. (30, 32, 31) , we obtain E[∥κ t+1 ∥ 2 ] ≤ (1 -β t ) 2 E[∥κ t ∥ 2 ] + 6β 2 t σ 2 + 4(1 -β t ) 2 (L 2 ∇g + L 2 g )∥x t+1 -x t ∥ 2 ) ≤ (1 -β t ) 2 E[∥κ t ∥ 2 ] + 8(1 -β t ) 2 L 2 F E[∥x t+1 -x t ∥ 2 ] + 6β 2 t σ 2 , where the last inequality applies (L 2 ∇g + L 2 g ) ≤ 2L 2 F . This complete the proof. Lemma 14. Under Assumption 1 and 2, for any α > 1, let k = ασ 2/3 L F , w = max(2σ 2 , (16L 2 F k) 3 ) and c = σ 2 14L F k 3 + 130L 4 F . Then with η t = k (w+tσ 2 ) 1/3 , β t = cη 2 t and after running T iterations, Algrithm 2 satisfies 4L 4 F T t=1 η t E[∥κ t ∥ 2 ] ≤ E[∥κ 1 ∥ 2 ] η 0 - E[∥κ T +1 ∥ 2 ] η T + T t=1 6c 2 η 3 t σ 2 + 64L 2 F ∆. Proof. Since w ≥ (16L 2 F k) 3 , it is easy to note that η t ≤ η 0 ≤ 1 16L 2 F ≤ 1 4L F . In addition, β t = cη 2 t ≤ cη 2 0 ≤ ( σ 2 14L F k 3 + 130L 4 F ) 1 256L 4 F = σ 2 L 3 F 14L F α 3 σ 2 1 256L 4 F + 65 128 = 1 14α 3 1 2556L 2 F + 65 128 ≤ 1. With η t = k (w+tσ 2 ) 1/3 , we obtain 1 η t - 1 η t-1 = (w + tσ 2 ) 1/3 -(w + (t -1)σ 2 ) 1/3 k (a) ≤ σ 2 3k(w + (t -1)σ 2 ) 2/3 (b) ≤ σ 2 3k(w/2 + tσ 2 ) 2/3 ≤ σ 2 3k(w/2 + tσ 2 /2) 2/3 = 2 2/3 σ 2 3k(w + tσ 2 ) 2/3 = 2 2/3 σ 2 3k 3 η 2 t (c) ≤ 2 2/3 12L F k 3 η t ≤ σ 2 7Lk 3 η t , where the inequality (a) uses the inequality (x + y) 1/3 -x 1/3 ≤ yx -2/3 3 , the inequality (b) is due to w ≥ 2σ 2 , and the inequality (c) is due to η t ≤ 1 4L F . Under review as a conference paper at ICLR 2023 Noting β t = cη 2 t and 0 ≤ (1 -β t ) ≤ 1, by Lemma 13 we have E[∥κ t+1 ∥ 2 ] η t - E[∥κ t ∥ 2 ] η t-1 ≤ ( (1 -β t ) 2 η t - 1 η t-1 )E[∥κ t ∥ 2 ] + 6c 2 η 3 t σ 2 + 8(1 -β t ) 2 L 2 F η t E[∥x t+1 -x t ∥ 2 ] ≤ (η -1 t -η -1 t-1 -2cη t )E[∥κ t ∥ 2 ] + 6c 2 η 3 t σ 2 + 8(1 -β t ) 2 L 2 F η t E[∥x t+1 -x t ∥ 2 ] ≤ -260L 4 F η t E[∥κ t ∥ 2 ] + 6c 2 η 3 t σ 2 + 8(1 -β t ) 2 L 2 F η t E[∥x t+1 -x t ∥ 2 ], where the last inequality is due to η -1 t -η -1 t-1 -2cη t ≤ σ 2 7L F k 3 η t -2( σ 2 14L F k 3 + 130L 4 F )η t ≤ -260L 4 F η t . Taking summation of Eq. ( 33) from 1 to T , we have 260L 4 F T t=1 η t E[∥κ t ∥ 2 ] ≤ E[∥κ 1 ∥ 2 ] η 0 - E[∥κ T +1 ∥ 2 ] η T + T t=1 6c 2 η 3 t σ 2 + 8L 2 F T t=1 1 η t E[∥x t+1 -x t ∥ 2 ]. (34) In the same way with Eq. ( 15) and η t ≤ η 1 , ∀t ≥ 1, we could also have 1 -2η 1 L F 4 T t=1 1 η t ∥x t+1 -x t ∥ 2 ≤ T t=1 1 -2η t L F 4η t ∥x t+1 -x t ∥ 2 ≤ ∆ + T t=1 η t ∥z t -∇F (x t )∥ 2 . (35) Noting η 1 L F ≤ 1 4 and invoking Lemma 12, we obtain T t=1 1 η t E[∥x t+1 -x t ∥ 2 ] ≤ 4 1 -2η 1 L F (∆ + T t=1 η t E[∥z t -∇F (x t )∥ 2 ]) ≤ 8∆ + 8 T t=1 η t E[∥z t -∇F (x t )∥ 2 ] ≤ 8∆ + 32L 2 F T t=1 η t E[∥κ t ∥ 2 ]. Combining Eqs. (34, 36), we have 4L 4 F T t=1 η t E[∥κ t ∥ 2 ] ≤ E[∥κ 1 ∥ 2 ] η 0 - E[∥κ T +1 ∥ 2 ] η T + T t=1 6c 2 η 3 t σ 2 + 64L 2 F ∆. This complete the proof.

E.2 PROOF OF THEOREM 2

Proof. Noting the monotonity of η t and dividing η1 1/4-η1L F /2 on both sides of Eq. ( 35), we have T t=1 ∥x t+1 -x t ∥ 2 ≤ 1 1/4 -η 1 L F /2 η 1 ∆ + η 1 T t=1 η t ∥z t -∇F (x t )∥ 2 . ( ) By the same method used in the proof of Theorem 2 in Xu et al. (2019) , we have the following inequality, ∥z t -∇F (x t+1 ) + 1 η t (x t -x t+1 )∥ 2 ≤ 2∥z t -∇F (x t )∥ 2 + 2 F (x t+1 ) -F (x t ) η t + (2L 2 F + 3L F η t )∥x t+1 -x t ∥ 2 . Multiplying η t on both sides of the above inequality and taking summation from 1 to T , we have T t=1 η t ∥z t -∇F (x t+1 ) + 1 η t (x t+1 -x t )∥ 2 (a) ≤ 2 T t=1 η t ∥z t -∇F (x t )∥ 2 + 2∆ + 5L F 1 1/4 -η 1 L F /2 η 1 ∆ + η 1 T t=1 η t ∥z t -∇F (x t )∥ 2 (b) ≤ 12 T t=1 η t ∥z t -∇F (x t )∥ 2 + 12∆, where inequality (a) is due to (2L 2 F + 3L F ηt ) ≤ 5L F ηt , inequality (b) is due to η 1 L F ≤ 1 4 and 1 1/4-η1L F /2 ≤ 8. Combining Eqs. (37, 39) and invoking Lemma 12 we have T t=1 η t ∥z t -∇F (x t+1 ) + 1 η t (x t+1 -x t )∥ 2 ≤ 48L 2 F T t=1 η t E[∥κ t ∥ 2 ] + 12∆ ≤ 12 E[∥κ 1 ∥ 2 ] η 0 - E[∥κ T +1 ∥ 2 ] η T + T t=1 6c 2 η 3 t σ 2 + 64L 2 F ∆ + 12∆. ( ) Noting the monotonity of η t and dividing T η T on both sides of Eq. ( 40), we obtain 1 T T t=1 ∥z t -∇F (x t+1 ) + 1 η t (x t+1 -x t )∥ 2 ≤ 12 E[∥κ 1 ∥ 2 ] T η T η 0 - E[∥κ T +1 ∥ 2 ] T η 2 T + 1 T η T T t=1 6c 2 η 3 t σ 2 + 64L 2 F ∆ T η T + 12∆ T η T . ( ) Combining Eqs. (18, 41) and noting T t=1 η 3 t ≤ O(log T ), we get the conclusion that E[dist(0, ∂ F (x R )) 2 ] ≤ 1 T T t=1 E[∥z t -∇F (x t+1 ) + 1 η t (x t+1 -x t )∥ 2 ] ≤ 12 E[∥κ 1 ∥ 2 ] T η T η 0 + 1 T η T T t=1 6c 2 η 3 t σ 2 + 64L 2 F ∆ T η T + 12∆ T η T ≤ O log T T 2/3 . This complete the proof. F PROOFS IN SECTION 5 F.1 TECHNICAL LEMMAS Lemma 15. If ℓ i (w) is convex for all i, we can show that F (w, λ) is jointly convex in terms of (w, λ). Proof. We have F (w, λ) = max p∈∆n n i=1 p i ℓ i (w) -λ( n i=1 p i log(np i ) -ρ) -λ 0 ρ G(w,λ,p) . Since G(w, λ, p) is jointly convex in terms of (w, λ) for every fixed p, F (w, λ) is jointly convex in terms of (w, λ). Lemma 16. Under Assumption 1, 2, run Algorithm 1 with η ≤ β 4L F √ 9+20L 2 g ≤ 1 6L F and apply SCDRO to the new objective Fµ (x) by adding µx t to (∇f λt (s t )∇ w g i (x t ) ⊤ , ∇f λt (s t )∇ λ g i (x t ) + log(s t ) + ρ) ⊤ in Eq. ( 6) of Algorithm 1, where µ is a small constant to be determined later. Without loss of the generality, we assume 0 < µ ≤ 1 2 and then we have 1 T T t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ 2E[∥z 1 -∇F µ (x 1 )∥ 2 ] βT + ∆ µ ηT + 20L F E[∥g(x 1 ) -s 1 ∥ 2 ] βT + 24βL 2 F σ 2 . Proof. To facilitate our proof statement, we define the following notations: ∇F µ (x t ) ⊤ = (∂ w F µ (x t ) ⊤ , ∂ λ F µ (x t )) = (∇f λt (g(x t ))∂ w g(x t ) ⊤ + µw ⊤ t , ∇f λt (g(x t ))∂ λ g(x t ) + log(g(x t )) + ρ + µλ t ) ∇F µ (x t ) ⊤ = (∇f λt (g(x t ))∂ w g i (x t ) ⊤ + µw ⊤ t , ∇f λt (g(x t ))∂ λ g i (x t ) + log(g(x t )) + ρ + µλ t ) G µ (x t ) ⊤ = (G wt (x t ) ⊤ , G λt (x t )) = (∇f λt (s t )∂ w g i (x t ) ⊤ + µw ⊤ t , ∇f λt (s t )∂ λ g i (x t ) + log(s t ) + ρ + µλ t ). It is worth to notice that E[ ∇F µ (x t )] = ∇F µ (x t ). Since F (x) is L F -smooth, then we have F µ (x) is L Fµ -smooth, where L Fµ = (L F + µ). Noting L F > 1 and µ ≤ 1 2 , we obtain L F + µ ≤ 3 2 L F . For every iteration t, by simple expansion we have I t = E[∥∇F µ (x t ) -z t ∥ 2 ] = E[∥∇F µ (x t ) -(1 -β)z t-1 -βG µ (x t )∥ 2 ] = E[∥(1 -β)(∇F µ (x t ) -∇F µ (x t-1 )) + (1 -β)∇F µ (x t-1 ) -(1 -β)z t-1 + β∇F µ (x t ) -βG µ (x t )∥ 2 ] = E[∥(1 -β)(∇F µ (x t ) -∇F µ (x t-1 )) + (1 -β)(∇F µ (x t-1 ) -z t-1 )∥ 2 ] + E[∥β( ∇F µ (x t ) -G µ (x t )) + β(∇F µ (x t ) -∇F µ (x t ))∥ 2 ] = E[∥(1 -β) (∇F µ (x t ) -∇F µ (x t-1 ) A ) + (1 -β) (∇F (x t-1 ) -z t-1 ) B ∥ 2 ] + E[∥β( ∇F (x t ) -G(x t ) C ) + β (∇F (x t ) -∇F (x t )) D ∥ 2 ]. The above inequality shows that the only difference between I t in the proof of Lemma 11 and I t in the proof of Lemma 16 is term A. Therefore, by the same method used in the proof of Lemma 11, we have T t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ E[∥∇F µ (x 1 ) -z 1 ∥ 2 ] β + ( 4L 2 Fµ β 2 + 20L 2 F L 2 g β 2 ) η 1/4 -ηL Fµ /2 (∆ µ + η T t=1 E[∥z t -∇F µ (x t )∥ 2 ]) + 10L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] β + βT σ 2 + 2βL 2 F T σ 2 . By L Fµ ≤ 3 2 L F and ηL Fµ ≤ 3 2 ηL F ≤ 1/4, it holds that T t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ E[∥∇F µ (x 1 ) -z 1 ∥ 2 ] β + ( 9L 2 F β 2 + 20L 2 F L 2 g β 2 ) 8η ∆ µ + η T t=1 E[∥z t -∇F µ (x t )∥ 2 ] + 10L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] β + βT σ 2 + 2βL 2 F T σ 2 ≤ E[∥z 1 -∇F µ (x 1 )∥ 2 ] β + ∆ µ 2η + 1 2 T t=1 E[∥z t -∇F µ (x t )∥ 2 ] + 10L 2 F E[∥g(x 1 ) -s 1 ∥ 2 ] β + βT σ 2 + 2βL 2 F T σ 2 , ( ) where the last inequality is due to 8 (9L 2 F + 20L 2 F L 2 g )η 2 ≤ β 2 2 . Rearranging terms and dividing T on both sides of Eq. ( 42), we complete the proof of this Lemma. Lemma 17. At the k-th stage of RASCDRO, let β k = cη 2 k and c = 512L 4 F we have 1 8L 2 F T k T k t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ E[∥κ k ∥ 2 ] β k T k + 6β k σ 2 + 64L 2 F E[∆ µ k ]η k β k T k , where ∆ µ k = F µ (x k ) -inf x∈X F µ (x). Proof. Recall the definition of ∥κ t ∥ 2 and by the same proof of Lemma 12 we have ∥z t -∇F µ (x t )∥ 2 ≤ 4L 2 F ∥κ t ∥ 2 . (44) Denote κ t at kth-stage as κ t k , and by Lemma 13, at the kth-stage in RASCDRO we have E[∥κ t+1 k ∥ 2 ] ≤ (1 -β k ) 2 ∥κ t k ∥ 2 + 6β 2 k σ 2 + 8L 2 F (1 -β k ) 2 ∥x t+1 -x t ∥ 2 ≤ (1 -β k ) 2t ∥κ k ∥ 2 + 6β 2 k σ 2 t i=1 (1 -β k ) 2(t-i) + 8L 2 F (1 -β k ) 2 t i=1 (1 -β k ) 2(t-i) ∥x i+1 -x i ∥ 2 ≤ (1 -β k ) 2t ∥κ k ∥ 2 + 6β k σ 2 (45) + 8L 2 F (1 -β k ) 2 t i=1 (1 -β k ) 2(t-i) ∥x i+1 -x i ∥ 2 . Combining Eqs. (44,45), we obtain 1 4L 2 F T k T k t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ 1 T k T k t=1 E[(1 -β k ) 2 ∥κ t k ∥ 2 + 6β 2 k σ 2 + 8L 2 F (1 -β k ) 2 ∥x t+1 -x t ∥ 2 ] ≤ 1 T k T k t=1 (1 -β k ) 2t-2 E[∥κ k ∥ 2 ] + 6β k σ 2 + 8L 2 F (1 -β k ) 2 T k T k t=1 t-1 i=1 (1 -β k ) 2(t-i) E[∥x i+1 -x i ∥ 2 ]. Under review as a conference paper at ICLR 2023 Noting T k t=1 (1 -β k ) 2t-2 ≤ 1/β k and invoking Eq. (38), we have 1 4L 2 F T k T k t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ E[∥κ k ∥ 2 ] β k T k + 6β k σ 2 + 8L 2 F (1 -β k ) 2 β k T k T k t=1 E[∥x t+1 -x t ∥ 2 ] ≤ E[∥κ k ∥ 2 ] β k T k + 6β k σ 2 + 8L 2 F (1 -β k ) 2 β k T k η k 1/4 -η k L Fµ /2 E[∆ µ k ] + η k T k t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ E[∥κ k ∥ 2 ] β k T k + 6β k σ 2 + 64L 2 F E[∆ µ k ]η k β k T k + 64L 2 F η 2 k β k T k T k t=1 ∥z t -∇F µ (x t )∥ 2 ], where the last inequality is due to 1/(1/4 -η k L Fµ /2) ≤ 8, (1 -β k ) 2 ≤ 1, L 2 ∇g + L 2 g ≤ 2L 2 F . Invoking β k = cη 2 k and c = 576L 4 F to above inequality, we get the conclusion that 1 8L 2 F T k T k t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ E[∥κ k ∥ 2 ] β k T k + 6β k σ 2 + 64L 2 F E[∆ µ k ]η k β k T k . F.2 PROOF OF LEMMA 3 Proof. Since ℓ i (w) is convex for all i, by Lemma 15 we know F (x) is convex. And thus by the definition of Fµ (x) we have Fµ (x) is a strongly convex function. Then by strong convexity, we have Fµ (y) ≥ Fµ (x) + v ⊤ (y -x) + µ 2 ∥y -x∥ 2 , ∀x, y ∈ X , v ∈ ∂ Fµ (x). Then inf x∈X Fµ (x) ≥ min y∈X Fµ (x) + v ⊤ (y -x) + µ 2 ∥y -x∥ 2 ≥ min y Fµ (x) + v ⊤ (y -x) + µ 2 ∥y -x∥ 2 = Fµ (x) - ∥v∥ 2 2µ , ∀v ∈ ∂ Fµ (x). Hence, ∥v∥ 2 2µ ≥ Fµ (x) -inf x∈X Fµ (x) , ∀v ∈ ∂ Fµ (x), which implies dist(0, ∂ Fµ (x)) 2 ≥ 2µ Fµ (x) -Fµ (x * ) . F.3 PROOF OF LEMMA 4 Proof. We use inductions to prove E[∥z k -∇F µ (x k )∥ 2 ] ≤ µϵ k /4, E[∥g(x k ) -s k ∥ 2 ] ≤ µϵ k /4 and E[F µ (x k ) -inf x∈X F µ (x)] ≤ ϵ k . Let's consider the first stage in the beginning. Let ϵ 1 = ∆ µ , thus E[F µ (x 1 ) -inf x∈X F µ (x)] ≤ ϵ 1 . And we can use a batch size of 4/µϵ 1 for initialization.to make sure E [∥∇F µ (x 1 ) -z 1 ∥ 2 ] ≤ µϵ 1 /4,E[∥s 1 -g(x 1 )∥ 2 ] ≤ µϵ 1 /4. Suppose that E[∥g(x k-1 ) -s k-1 ∥ 2 ] ≤ µϵ k-1 /4, E[∥z k-1 -∇F µ (x k-1 )∥ 2 ] ≤ µϵ k-1 /4 and E[F µ (x k-1 ) -inf x∈X F µ (x)] ≤ ϵ k-1 . By setting β k-1 = min{ µϵ k-1 384L 2 F σ 2 , 1 384L 2 F }, η k-1 = min{ µϵ k-1 4608L 4 F σ 2 , 1 4608L 4 F } and T k-1 = max{ 147456L 4 F σ 2 µ 2 ϵ k-1 , 147456L 4 F µ }, it is easy to obtain that η k-1 ≤ β k-1 4L F √ 9+20L 2 g . Therefore, invoking Lemma 16 we have E[∥z k -∇F µ (x k )∥ 2 ] ≤ 1 T k-1 T k-1 t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ E[2∥z k-1 -∇F µ (x k-1 )∥ 2 ] β k-1 T k-1 + E[∆ µ k-1 ] η k-1 T k-1 + 20L F E[∥g(x k-1 ) -s k-1 ∥ 2 ] β k-1 T k-1 + 24β k L 2 F σ 2 ≤ µϵ k-1 2β k-1 T k-1 + ϵ k-1 η k-1 T k-1 + 5L F µϵ k-1 β k-1 T k-1 + 24β k-1 L 2 F σ 2 . Without loss of the generality, we consider the case µϵ k-1 /σ 2 ≤ 1. By definition have β k-1 = µϵ k-1 /(384L 2 F σ 2 ), η k-1 = µϵ k-1 /(4608L 4 F σ 2 ) and T k-1 = 147456L 4 F σ 2 /(µ 2 ϵ k-1 ), which imply 1 β k-1 T k-1 ≤ µ 384L 2 F , 1 η k-1 T k-1 ≤ µ 32 and 24β k-1 L 2 F σ 2 ≤ µϵ k-1 16 . Then, note L F ≥ 1, µ < 1 and ϵ k = ϵ k-1 /2 we have Next we need to show E[∥g(x k )-s k ∥ 2 ] ≤ µϵ k /4 under the assumption that E[∥g(x k-1 )-s k-1 ∥ 2 ] ≤ µϵ k-1 /4. E[∥z k -∇F µ (x k )∥ 2 ] ≤ µ 2 ϵ k-1 768L 2 F + µϵ k-1 16 + 5µ 2 ϵ k-1 384L F + µϵ k- By Lemma 9, we have E[∥g(x k ) -s k ∥ 2 ] = 1 T k-1 T k-1 t=1 E[∥g(x t ) -s t ∥ 2 ] ≤ E[∥g(x k-1 ) -s k-1 ∥ 2 ] β k-1 T k-1 + 2L 2 g β 2 k-1 T k-1 T k-1 t=1 E[∥x t+1 -x t ∥ 2 ] + β k-1 σ 2 ≤ µϵ k-1 4β k-1 T k-1 + 2L 2 g β 2 k-1 T k-1   η k-1 1/4 -η k-1 L Fµ /2   E[∆ µ k-1 ] + η k-1 T k-1 t=1 E[∥z t -∇F µ (x t )∥ 2 ]     + β k-1 σ 2 , where ∆ µ k-1 = F µ (x k-1 ) -inf x∈X F µ (x). With 1/(1/4 -η k-1 L Fµ /2) ≤ 8, E[∥g(x k-1 ) - s k-1 ∥ 2 ] ≤ µϵ k-1 /4 and E[F µ (x k-1 ) -inf x∈X F µ (x)] ≤ ϵ k-1 , it holds that E[∥g(x k ) -s k ∥ 2 ] ≤ µϵ k-1 2β k-1 T k-1 + 16L 2 g η k-1 ϵ k-1 β 2 k-1 T k-1 + 4L 2 g η 2 k-1 µϵ k-1 β 2 k-1 + β k-1 σ 2 ≤ µϵ k 384L 2 F + L 2 g µϵ k-1 288L 4 F + L 2 g µϵ k-1 36L 4 F + µϵ k-1 192L 2 F ≤ µϵ k 2 . Invoking Lemma 10, at (k -1)-th stage (k > 1) we have E[dist(0, ∂ Fµ (x k )) 2 ] ≤ 2 + 40L Fµ η k-1 T k-1 T k-1 t=1 E[∥z t -∇F µ (x t )∥ 2 ] + 2E[∆ µ k-1 ] η k-1 T k-1 + 40L Fµ E[∆ µ k-1 ] T k-1 ≤ (2 + 40L Fµ η k-1 )µϵ k-1 4 + 2ϵ k-1 η k-1 T k-1 + 40L Fµ ϵ k-1 T k-1 ≤ 197µϵ k 192 + µϵ k 8 + 40L Fµ µϵ k-1 147456L 4 F ≤ 2µϵ k , where the second inequality is due to L Fµ η k-1 ≤ (3/2)L F η k-1 ≤ 1/1536. Since F µ (x k ) ≤ Fµ (x k ) and inf x∈X F µ (x) = inf x∈X Fµ (x), applying Lemma 3 we have E[F µ (x k ) -inf x∈X F µ (x)] ≤ E[ Fµ (x k ) -inf x∈X Fµ (x)] ≤ 1 2µ E[dist(0, ∂ Fµ (x k )) 2 ] ≤ 2µϵ k 2µ = ϵ k . This complete the proof of this Lemma.

F.4 PROOF OF THEOREM 3

Proof. Invoking Lemma 4, then after K = O(log 2 (ϵ 1 /ϵ)) stages, we have E[F µ (x K ) -inf x∈X F µ (x)] ≤ ϵ K = ϵ 1 2 K-1 = ϵ. Since K k=1 2 k = O(1/ϵ), the overall oracle complexity is K k=1 T k + 4 µϵ 1 ≤ 36864σ 2 L 4 F K k=2 1 µ 2 ϵ k + 4 µϵ 1 ≤ 36864σ 2 L 4 F µ 2 ϵ K k=1 1 2 k + 4 µϵ 1 ≤ O( 1 µ 2 ϵ ).

F.5 PROOF OF COROLLARY 1

It is easy to note that F µ (x K )-F µ (x * ) ≤ F µ (x K )-inf x∈X F µ (x), where x * = arg min x∈X F (x). Therefore, if after K stages it holds that E[F µ (x K )-inf x∈X F µ (x)] ≤ ϵ/2 with an oracle complexity of O(1/µ 2 ϵ), we have E[F µ (x K ) -F µ (x * )] ≤ ϵ/2 , i.e., E[F (x K ) + µ∥x K ∥ 2 /2 -F (x * ) -µ∥x * ∥ 2 /2] ≤ ϵ/2. By Assumption 1(a) W is bounded by R, and then by setting µ = ϵ/(2(R 2 + λ2 )), with ∥x∥ 2 ≤ (R 2 + λ2 ) we have E[F (x K ) -F (x * )] ≤ ϵ 2 + (2(R 2 + λ2 )) µ 2 ≤ ϵ 2 + ϵ 2 ≤ ϵ with an oracle complexity of O(1/ϵ 3 ). Let ϵ 1 = ∆ µ , thus E[F µ (x 1 ) -inf x∈X F µ (x)] ≤ ϵ 1 . And we can use a batch size of 48L 2 F /µϵ 1 for initialization to make sure E[∥κ 1 ∥ 2 ] = E[∥s 1 -g(x 1 )∥ 2 + ∥v 1 -∂ w g(x 1 )∥ 2 + |u 1 -∂ λ g(x 1 )| 2 ] ≤ µϵ 1 /16L 2 F . Suppose that E[∥κ k-1 ∥ 2 ] ≤ µϵ k-1 /16L 2 F and E[F µ (x k-1 ) -inf x∈X F µ (x)] ≤ ϵ k-1 . By setting β k-1 = min{ µϵ k-1 768L 2 F σ 2 , 1 768L 2 F }, η k-1 = min{ √ µϵ k-1 18432L 3 F σ 2 , 1 18432L 4 F } and T k-1 = max{ 147456L 3 F σ µ 3/2 √ ϵ k-1 , 147456L 4 F σ 2 µϵ k-1 , 147456L 4 F µ }. Then following the above Lemma 17, for k ≥ 1, E[∥κ k ∥ 2 ] ≤ 1 4L 2 F T k-1 T k-1 t=1 E[∥z t -∇F µ (x t )∥ 2 ] ≤ 2E[∥κ k-1 ∥ 2 ] β k-1 T k-1 + 12β k-1 σ 2 + 128L 2 F E[∆ µ k-1 ]η k-1 β k-1 T k-1 ≤ µϵ k-1 4L 2 F β k-1 T k-1 + 12β k-1 σ 2 + 128L 2 F ϵ k-1 η k-1 β k-1 T k-1 . Without loss of the generality, we consider the case µϵ k-1 /σ 2 ≤ 1. By definition we have β k-1 = µϵ k-1 /(768L 2 F σ 2 ), η k-1 = √ µϵ k-1 /(9216L 3 F σ), which imply 1 β k-1 T k-1 ≤ 1 96L 2 F , 1 η k-1 T k-1 ≤ µ 8 and 12β k σ 2 ≤ µϵ k-1 64L 2 F . Then, noting L F ≥ 1, µ < 1 and ϵ k = ϵ k-1 /2 we have E[∥κ k ∥ 2 ] ≤ µϵ k-1 384L 4 F + µϵ k-1 64L 2 F + µϵ k-1 6912L 4 F ≤ µϵ k 192L 2 F + µϵ k 32L 2 F + µϵ k 3456L 2 F ≤ µϵ k 16L 2 F . Then by Eq. ( 44), we have ∥z k -∇F µ (x k )∥ 2 ≤ 4L 2 F ∥κ k ∥ 2 ≤ µϵ k /4. Invoking Lemma 10, at (k -1)-th stage (k > 1) we have E[dist(0, ∂ Fµ (x k )) 2 ] ≤ 2 + 40L Fµ η k-1 T k-1 T k-1 t=1 E[∥z t -∇F µ (x t )∥ 2 ] + 2E[∆ µ k-1 ] η k-1 T k-1 + 40L Fµ E[∆ µ k-1 ] T k-1 ≤ (2 + 40L Fµ η k-1 )µϵ k-1 2 + 2ϵ k-1 η k-1 T k-1 + 40L Fµ ϵ k-1 T k-1 ≤ 773µϵ k 768 + µϵ k 2 + 40L Fµ µϵ k-1 73728L 4 F ≤ 2µϵ k , where the second inequality is due to L Fµ η k-1 ≤ (3/2)L F η k-1 ≤ 1/3072. Since F µ (x k ) ≤ Fµ (x k ) and inf x∈X F µ (x) = inf x∈X F µ(x), applying Lemma 3 we have E[F µ (x k ) -inf x∈X F µ (x)] ≤ E[ Fµ (x k ) -inf x∈X Fµ (x)] ≤ 1 2µ E[dist(0, ∂ Fµ (x k )) 2 ] ≤ 2µϵ k 2µ = ϵ k . This complete the proof of this Lemma.

F.7 PROOF OF THEOREM 4

Proof. Invoking Lemma 5, then after K = O(log 2 (ϵ 1 /ϵ)) stages, we have E[F µ (x K ) -inf x∈X F µ (x)] ≤ ϵ K = ϵ 1 2 K-1 = ϵ.



O omits a logarithmic dependence over ϵ. PG-SMD2 refers to PG-SMD algorithm under Assumption D2 inRafique et al. (2021). FastDRO is name of the GitHub repository ofLevy et al. (2020), and we use the name "FastDRO" to refer to the algorithm based on mini-batch gradient estimator inLevy et al. (2020). We would like to point out that the variance bound and the smoothness constant LF are exponentially dependent on the problem parameters, so are these constants in some other stochastic methods solving constrained DRO, like Dual SGM inLevy et al. (2020).



where D denotes the training set and i denotes the index of the sample randomly generated from D. Let f λ (•) = λ log(•) + λρ, and ∇f λ (g) = λ g denotes the gradient of f in terms of g. Let Π X (•) denote an Euclidean projection onto the domain X . Let [T ] = {1, . . . , T } and τ ∼ [T ] denotes a random selected index. We make the following standard assumptions regarding to the problem (2). Assumption 1. There exists R, G, C, and L such that (a) The domain of model parameter W is bounded by R, i.e., for all w ∈ W, we have ∥w∥ ≤ R. (b) ℓ i (w) is G-Lipschitz continuous function and bounded by C, i.e., ∥∂ℓ i (w)∥ ≤ G and |ℓ i (w)| ≤ C for all w ∈ W and i ∼ D.

Figure 1: Training accuracy (%) vs # of processed training samples for the convex setting. ρ is fixed to 0.5 on CIFAR10-ST and CIFAR100-ST, and 0.1 on ImageNet-LT and iNaturalist2018. The results are averaged over 5 independent runs.

Figure 2: Training accuracy vs # of processed training samples for the non-convex setting. ρ is fixed to 0.5 on all datasets. The results are averaged over 5 independent runs.

Figure 3: Testing accuracy (%) vs # of processed training samples for the convex setting. ρ is fixed to 0.5 on CIFAR10-ST and CIFAR100-ST, and 0.1 on ImageNet-LT and iNaturalist2018. The results are averaged over 5 independent runs.

Figure 4: Testing accuracy (%) vs # of processed training samples for the non-convex setting. ρ is fixed to 0.5 on all datasets. The results are averaged over 5 independent runs.

Figure 5: Running time comparison between PG-SMD2 and SCDRO Per Iteration Cost We report the per iteration cost between the non-convex primal-dual algorithm PG-SMD2 and SCDRO on a single Tesla V100 GPU in Figure 5. It is clear to see that the primal-dual algorithm incurs significantly overhead time due to the updates of dual variables. By comparing the large-scale datasets, ImageNet-LT and iNaturalist2018, with the medium datasets, CIFAR10-ST and CIFAR100-ST, the increased dataset size leads to amplified running time gap as the dual variables dependent on the dataset size O(n). For the largest iNaturalist2018 dataset, SCDRO could save days of training time.

), we have the optimal solution of above optimization problem |λ * | ≤ | λ * | + λ 0 ≤ λ 0 + C ρ , which complete the proof D PROOFS IN SECTION 4 D.1 TECHNICAL LEMMAS Lemma 9. Suppose Assumption 2 holds and i ∼ D and s are initialized with s 1 = exp( ℓi(w1) λ1

PROOF OF LEMMA 5Proof. We use inductions to proveE[∥κ k ∥ 2 ] ≤ µϵ k /16L 2 F and E[F µ (x k ) -inf x∈X F µ (x)] ≤ ϵ k . Let'sconsider the first stage in the beginning.

Summary of algorithms solving KL-constrained DRO problem. Complexity represents the oracle complexity for achieving E[dist(0, ∂ F (x))] ≤ ϵ or other first-order stationarity convergence for the non-convex setting and E[F (x) -F (x * )] ≤ ϵ for the convex setting. Per Iter Cost denotes the per-iteration computational complexity. The algorithm styles include primal-dual (PD), primal only (P), and compositional (COM). "-" means not available in the original paper.

ST, we artificially construct imbalanced training data, where we only keep the last 100 images of each class for the first half classes, and keep other classes and the test data unchanged. ImageNet-LT is a long-tailed subset of the original ImageNet-2012 by sampling a subset following the Pareto distribution with the power value 6. It has 115.8K images from 1000 categories, which include 4980 for head class and 5 images for tail class. iNaturalist 2018 is a real-world dataset whose class-frequency follows a heavy-tail distribution. It contains 437K images from 8142 classes.

Testing Accuracy in Convex Setting

Testing Accuracy in Non-Convex Setting

Test accuracy (%) of different methods for different constraint parameter ρ in the non-convex setting. The results are averaged over 5 independent runs.

annex

Since K k=1 2 k = O(1/ϵ), the overall oracle complexity isThis complete the proof.

G DERIVATION OF THE COMPOSITIONAL FORMULATION

Recall the original KL-constrained DRO problem:whereIn order to tackle this problem, let us first consider the robust lossAnd then we invoke the dual variable λ to transform this primal problem to the following formSince this problem is concave in term of p given w, by strong duality theorem, we haveThen the original problem is equivalent to the following problemNext we fix x = (w ⊤ , λ) ⊤ and derive an optimal solution p * (x) which depends on x and solves the inner maximization problem. We consider the following problemwhich has the same optimal solution p * (x) with our problem.There are three constraints to handle, i.e., p i ≥ 0, ∀i and p i ≤ 1, ∀i and n i=1 p i = 1. Note that the constraint p i ≥ 0 is enforced by the term p i log(p i ), otherwise the above objective will become infinity. As a result, the constraint p i < 1 is automatically satisfied due to n i=1 p i = 1 and p i ≥ 0. Hence, we only need to explicitly tackle the constraint n i=1 p i = 1. To this end, we define the following Lagrangian functionwhere µ is the Lagrangian multiplier for the constraint n i=1 p i = 1. The optimal solutions satisfy the KKT conditions:-ℓ i (w) + λ (log(p * i (x)) + 1) + µ = 0 andFrom the first equation, we can derive p * i (x) ∝ exp(ℓ i (w)/λ). Due to the second equation, we can conclude that p * i (x) = exp(ℓi(w)/λ) n i=1 exp(ℓi(w)/λ) . Plugging this optimal p * (w) into the inner maximization problem, we have which is Eq. ( 2) in the paper.

