BITRATE-CONSTRAINED DRO: BEYOND WORST CASE ROBUSTNESS TO UNKNOWN GROUP SHIFTS

Abstract

Training machine learning models robust to distribution shifts is critical for real-world applications. Some robust training algorithms (e.g., Group DRO) specialize to group shifts and require group information on all training points. Other methods (e.g., CVaR DRO) that do not need group annotations can be overly conservative, since they naively upweight high loss points which may form a contrived set that does not correspond to any meaningful group in the real world (e.g., when the high loss points are randomly mislabeled training points). In this work, we address limitations in prior approaches by assuming a more nuanced form of group shift: conditioned on the label, we assume that the true group function (indicator over group) is simple. For example, we may expect that group shifts occur along low bitrate features (e.g., image background, lighting). Thus, we aim to learn a model that maintains high accuracy on simple group functions realized by these low bitrate features, that need not spend valuable model capacity achieving high accuracy on contrived groups of examples. Based on this, we consider the two-player game formulation of DRO where the adversary's capacity is bitrate-constrained. Our resulting practical algorithm, Bitrate-Constrained DRO (BR-DRO), does not require group information on training samples yet matches the performance of Group DRO on datasets that have training group annotations and that of CVaR DRO on long-tailed distributions. Our theoretical analysis reveals that in some settings BR-DRO objective can provably yield statistically efficient and less conservative solutions than unconstrained CVaR DRO.

1. INTRODUCTION

Machine learning models may perform poorly when tested on distributions that differ from the training distribution. A common form of distribution shift is group shift, where the source and target differ only in the marginal distribution over finite groups or sub-populations, with no change in group conditionals (Oren et al., 2019; Duchi et al., 2019 ) (e.g., when the groups are defined by spurious correlations and the target distribution upsamples the group where the correlation is absent Sagawa et al. (2019) ). Prior works consider various approaches to address group shift. One solution is to ensure robustness to worst case shifts using distributionally robust optimization (DRO) (Bagnell, 2005; Ben-Tal et al., 2013; Duchi et al., 2016) , which considers a two-player game where a learner minimizes risk on distributions chosen by an adversary from a predefined uncertainty set. As the adversary is only constrained to propose distributions that lie within an f-divergence based uncertainty set, DRO often yields overly conservative (pessimistic) solutions (Hu et al., 2018) and can suffer from statistical challenges (Duchi et al., 2019) . This is mainly because DRO upweights high loss points that may not form a meaningful group in the real world, and may even be contrived if the high loss points simply correspond to randomly mislabeled examples in the training set. Methods like Group DRO (Sagawa et al., 2019) avoid overly pessimistic solutions by assuming knowledge of group membership for each training example. However, these group-based methods provide no guarantees on shifts that deviate from the predefined groups (e.g., when there is a new group), and are not applicable to problems that lack group knowledge. In this work, we therefore ask: Can we train non-pessimistic robust models without access to group information on training samples? We address this question by considering a more nuanced assumption on the structure of the underlying groups. We assume that, conditioned on the label, group boundaries are realized by high-level features that depend on a small set of underlying factors (e.g., background color, brightness). This leads to simpler group ⇤ Correspondence can be sent to asetlur@cs.cmu.edu. functions with large margin and simple decision boundaries between groups (Figure 1 (left)). Invoking the principle of minimum description length (Grünwald, 2007) , restricting our adversary to functions that satisfy this assumption corresponds to a bitrate constraint. In DRO, the adversary upweights points with higher losses under the current learner, which in practice often correspond to examples that belong to a rare group, contain complex patterns, or are mislabeled (Carlini et al., 2019; Toneva et al., 2018) . Restricting the adversary's capacity prevents it from upweighting individual hard or mislabeled examples (as they cannot be identified with simple features), and biases it towards identifying erroneous data points misclassified by simple features. This also complements the failure mode of neural networks trained with stochastic gradient descent (SGD) that rely on simple spurious features which correctly classify points in the majority group but may fail on minority groups (Blodgett et al., 2016) . The main contribution of this paper is Bitrate-Constrained DRO (BR-DRO), a supervised learning procedure that provides robustness to distribution shifts along groups realized by simple functions. Despite not using group information on training examples, we demonstrate that BR-DRO can match the performance of methods requiring them. We also find that BR-DRO is more successful in identifying true minority training points, compared to unconstrained DRO. This indicates that not optimizing for performance on contrived worst-case shifts can reduce the pessimism inherent in DRO. It further validates: (i) our assumption on the simple nature of group shift; and (ii) that our bitrate constraint meaningfully structures the uncertainty set to be robust to such shifts. As a consequence of the constraint, we also find that BR-DRO is robust to random noise in the training data (Song et al., 2022) , since it cannot form "groups" entirely based on randomly mislabeled points with low bitrate features. This is in contrast with existing methods that use the learner's training error to up-weight arbitrary sets of difficult training points (e.g., Liu et al., 2021; Levy et al., 2020) , which we show are highly susceptible to label noise (see Figure 1 (right)). Finally, we theoretically analyze our approach-characterizing how the degree of constraint on the adversary can effect worst risk estimation and excess risk (pessimism) bounds, as well as convergence rates for specific online solvers.

2. RELATED WORK

Prior works in robust ML (e.g., Li et al., 2018; Lipton et al., 2018; Goodfellow et al., 2014) address various forms of adversarial or structured shifts. We specifically review prior work on robustness to group shifts. While those based on DRO optimize for worst-case shifts in an explicit uncertainty set, the robust set is implicit for some others, with most using some form of importance weighting. Distributionally robust optimization (DRO). DRO methods generally optimize for worst-case performance on joint (x,y) distributions that lie in an f-divergence ball (uncertainty set) around the training distribution (Ben-Tal et al., 2013; Rahimian & Mehrotra, 2019; Bertsimas et al., 2018; Blanchet & Murthy, 2019; Miyato et al., 2018; Duchi et al., 2016; Duchi & Namkoong, 2021) . Hu et al. (2018) highlights that the conservative nature of DRO may lead to degenerate solutions when the unrestricted adversary uniformly upweights all misclassified points. Sagawa et al. (2019) proposes to address this by limiting the adversary to shifts that only differ in marginals over predefined groups. However, in addition to it being difficult to obtain this information, Kearns et al. (2018) raise "gerrymandering" concerns with notions of robustness that fix a small number of groups apriori. While they propose a solution that looks at exponentially many subgroups defined over protected attributes, our method does not assume access to such attributes and



Figure1: Bitrate-Constrained DRO: A method that assumes group shifts along low-bitrate features, and restricts the adversary appropriately so that the solution found is less pessimistic and more robust to unknown group shifts. Our method is also robust to training noise. (Left) In Waterbirds(Wah et al., 2011), the spurious feature background is a large margin simple feature that separates the majority and minority points in each class. (Right) Prior works(Levy et al.,  2020; Liu et al., 2021)  that upweight arbitrary points with high losses force the model to memorize noisy mislabeled points while our method is robust to noise and only upweights the true minority group without any knowledge of its identity (see Section 6.2).

