STOCHASTIC CONSTRAINED DRO WITH A COMPLEX-ITY INDEPENDENT OF SAMPLE SIZE

Abstract

Distributionally Robust Optimization (DRO), as a popular method to train robust models against distribution shift between training and test sets, has received tremendous attention in recent years. In this paper, we propose and analyze stochastic algorithms that apply to both non-convex and convex losses for solving Kullback-Leibler divergence constrained DRO problem. Compared with existing methods solving this problem, our stochastic algorithms not only enjoy competitive if not better complexity independent of sample size but also just require a constant batch size at every iteration, which is more practical for broad applications. We establish a nearly optimal complexity bound for finding an ϵ-stationary solution for non-convex losses and an optimal complexity for finding an ϵ-optimal solution for convex losses. Empirical studies demonstrate the effectiveness of the proposed algorithms for solving non-convex and convex constrained DRO problems.

1. INTRODUCTION

Large-scale optimization of DRO has recently garnered increasing attention due to its promising performance on handling noisy labels, imbalanced data and adversarial data (Namkoong & Duchi, 2017; Zhu et al., 2019; Qi et al., 2020a; Chen & Paschalidis, 2018) . Various primal-dual algorithms can be used for solving various DRO problems (Rafique et al., 2021; Nemirovski et al., 2009) . However, primal-dual algorithms inevitably suffer from additional overhead for handling a n dimensionality dual variable, where n is the sample size. This is an undesirable feature for large-scale deep learning, where n could be in the order of millions or even billions. Hence, a recent trend is to design dual-free algorithms for solving various DRO problems (Qi et al., 2021; Jin et al., 2021; Levy et al., 2020) . In this paper, we provide efficient dual-free algorithms solving the following constrained DRO problem, which are still lacking in the literature, min w∈W max {p∈∆n:D(p,1/n)≤ρ} n i=1 p i ℓ i (w) -λ 0 D(p, 1/n), where w denotes the model parameter, Kullback-Leibler (KL) divergence measure between p and uniform probabilities 1/n ∈ R n , and ρ is the constraint parameter, and λ 0 > 0 is a small constant. A small KL regularization on p is added to ensure the objective in terms of w is smooth for deriving fast convergence. W is closed convex set, ∆ n = {p ∈ R n : n i=1 p i = 1, p i ≥ 0} denotes a n-dimensional simplex, ℓ i (w) denotes a loss function on the i-th data, D(p, 1/n) = n i=1 p i log(p i n) represents the There are several reasons for considering the above constrained DRO problem. First, existing dualfree algorithms are not satisfactory (Qi et al., 2021; Jin et al., 2021; Levy et al., 2020; Hu et al., 2021) . They are either restricted to problems with no additional constraints on the dual variable p except for the simplex constraint (Qi et al., 2021; Jin et al., 2021) , or restricted to convex analysis or have a requirement on the batch size that depends on accuracy level (Levy et al., 2020; Hu et al., 2021) . Second, the Kullback-Leibler divergence measure is a more natural metric for measuring the distance between two distributions than other divergence measures, e.g., Euclidean distance. Third, compared with KL-regularized DRO problem without constraints, the above KL-constrained DRO formulation allows it to automatically decide a proper regularization effect that depends on the optimal solution by tuning the constraint upper bound ρ. The question to be addressed is the following:

