STOCHASTIC CONSTRAINED DRO WITH A COMPLEX-ITY INDEPENDENT OF SAMPLE SIZE

Abstract

Distributionally Robust Optimization (DRO), as a popular method to train robust models against distribution shift between training and test sets, has received tremendous attention in recent years. In this paper, we propose and analyze stochastic algorithms that apply to both non-convex and convex losses for solving Kullback-Leibler divergence constrained DRO problem. Compared with existing methods solving this problem, our stochastic algorithms not only enjoy competitive if not better complexity independent of sample size but also just require a constant batch size at every iteration, which is more practical for broad applications. We establish a nearly optimal complexity bound for finding an ϵ-stationary solution for non-convex losses and an optimal complexity for finding an ϵ-optimal solution for convex losses. Empirical studies demonstrate the effectiveness of the proposed algorithms for solving non-convex and convex constrained DRO problems.

1. INTRODUCTION

Large-scale optimization of DRO has recently garnered increasing attention due to its promising performance on handling noisy labels, imbalanced data and adversarial data (Namkoong & Duchi, 2017; Zhu et al., 2019; Qi et al., 2020a; Chen & Paschalidis, 2018) . Various primal-dual algorithms can be used for solving various DRO problems (Rafique et al., 2021; Nemirovski et al., 2009) . However, primal-dual algorithms inevitably suffer from additional overhead for handling a n dimensionality dual variable, where n is the sample size. This is an undesirable feature for large-scale deep learning, where n could be in the order of millions or even billions. Hence, a recent trend is to design dual-free algorithms for solving various DRO problems (Qi et al., 2021; Jin et al., 2021; Levy et al., 2020) . In this paper, we provide efficient dual-free algorithms solving the following constrained DRO problem, which are still lacking in the literature, min w∈W max {p∈∆n:D(p,1/n)≤ρ} n i=1 p i ℓ i (w) -λ 0 D(p, 1/n), where w denotes the model parameter, Kullback-Leibler (KL) divergence measure between p and uniform probabilities 1/n ∈ R n , and ρ is the constraint parameter, and λ 0 > 0 is a small constant. A small KL regularization on p is added to ensure the objective in terms of w is smooth for deriving fast convergence. W is closed convex set, ∆ n = {p ∈ R n : n i=1 p i = 1, p i ≥ 0} denotes a n-dimensional simplex, ℓ i (w) denotes a loss function on the i-th data, D(p, 1/n) = n i=1 p i log(p i n) represents the There are several reasons for considering the above constrained DRO problem. First, existing dualfree algorithms are not satisfactory (Qi et al., 2021; Jin et al., 2021; Levy et al., 2020; Hu et al., 2021) . They are either restricted to problems with no additional constraints on the dual variable p except for the simplex constraint (Qi et al., 2021; Jin et al., 2021) , or restricted to convex analysis or have a requirement on the batch size that depends on accuracy level (Levy et al., 2020; Hu et al., 2021) . Second, the Kullback-Leibler divergence measure is a more natural metric for measuring the distance between two distributions than other divergence measures, e.g., Euclidean distance. Third, compared with KL-regularized DRO problem without constraints, the above KL-constrained DRO formulation allows it to automatically decide a proper regularization effect that depends on the optimal solution by tuning the constraint upper bound ρ. The question to be addressed is the following: Can we develop stochastic algorithms whose oracle complexity is optimal for both convex and non-convex losses, and its per-iteration complexity is independent of sample size n without imposing any requirements on the (large) batch size in the meantime? We address the above question by (i) deriving an equivalent primal-only formulation that is of a compositional form; (ii) designing two algorithms for non-convex losses and extending them for convex losses; (iii) establishing an optimal complexity for both convex and non-convex losses. In particular, for a non-convex and smooth loss function ℓ i (w), we achieve an oracle complexity of O(1/ϵ 3 )foot_0 for finding an ϵ-stationary solution; and for a convex and smooth loss function, we achieve an oracle complexity of O(1/ϵ 2 ) for finding an ϵ-optimal solution. We would like to emphasize that these results are on par with the best complexities that can be achieved by primal-dual algorithms (Huang et al., 2020; Namkoong & Duchi, 2016) . But our algorithms have a per-iteration complexity of O(d), which is independent of the sample size n. The convergence comparison of different methods for solving (1) is shown in Table 1 . To achieve these results, we first convert the problem (1) into an equivalent problem: min w∈W min λ≥λ0 F (w, λ) := λ log 1 n n i=1 exp ℓ i (w) λ + (λ -λ 0 )ρ. By considering x = (w ⊤ , λ) ⊤ ∈ R d+1 as a single variable to be optimized, the objective function is a compositional function of x in the form of f (g(x)), where g(x) = λ, 1 n n i=1 exp ℓi(w) λ ∈ R 2 and f (g) = g 1 log(g 2 ) + g 1 ρ. However, there are several challenges to be addressed for achieving optimal complexities for both convex and non-convex loss functions ℓ i (w). First, the problem F (x) is non-smooth in terms of x given the domain constraint w ∈ W and λ ≥ λ 0 . Second, the outer function f (g)'s gradient is non-Lipschtiz continuous in terms of the second coordinate g 2 if λ is unbounded, which is essential for all existing stochastic compositional optimization algorithms. Third, to the best of our knowledge, no optimal complexity in the order of O(1/ϵ 2 ) has been achieved for a convex compositional function except for Zhang & Lan (2021), which assumes f is convex and component-wisely non-decreasing and hence is not applicable to (2). To address the first two challenges, we derive an upper bound for the optimal λ assuming that ℓ i (w) is bounded for w ∈ W, i.e., λ ∈ [λ 0 , λ], which allows us to establish the smoothness condition of F (x) and f (g). Then we consider optimizing F (x) = F (x) + δ X (x), where δ X (x) = 0 if x ∈ X = {x = (w ⊤ , λ) ⊤ : w ∈ W, λ ∈ [λ 0 , λ]}. By leveraging the smoothness conditions of F and f , we design stochastic algorithms by utilizing a recursive variance-reduction technique to compute a stochastic estimator of the gradient of F (x), which allows us to achieve a complexity of O(1/ϵ 3 ) for finding a solution x such that E[dist(0, ∂ F (x))] ≤ ϵ. To address the third challenge, we consider optimizing Fµ (x) = F (x) + µ∥x∥ 2 /2 for a small µ. We prove that Fµ (x) satisfies a Kurdyka-Łojasiewicz inequality, which allows us to boost the convergence of the aforementioned algorithm to enjoy an optimal complexity of O(1/ϵ 2 ) for finding an ϵ-optimal solution to F (x). Besides the optimal algorithms, we also present simpler algorithms with worse complexity, which are more practical for deep learning applications without requiring two backpropagations at two different points per iteration as in the optimal algorithms.

2. RELATED WORK

DRO springs from the robust optimization literature (Bertsimas et al., 2018; Ben-Tal et al., 2013) and has been extensively studied in machine learning and statistics (Namkoong & Duchi, 2017; Duchi et al., 2016; Staib & Jegelka, 2019; Deng et al., 2020; Qi et al., 2020b; Duchi & Namkoong, 2021) , and operations research (Rahimian & Mehrotra, 2019; Delage & Ye, 2010) . Depending on how to constrain or regularize the uncertain variables, there are constrained DRO formulations that specify a constraint set for the uncertain variables, and regularized DRO formulations that use a regularization term in the objective for regularizing the uncertain variables (Levy et al., 2020) . Duchi et al. (2016) showed that minimizing constrained DRO with f -divergence including a χ 2divergence constraint and a KL-divergence constraint, is equivalent to adding variance regularization for the Empirical Risk Minimization (ERM) objective, which is able to reduce the uncertainty and



O omits a logarithmic dependence over ϵ.

