DISTRIBUTIONALLY ROBUST POST-HOC CLASSIFIERS UNDER PRIOR SHIFTS

Abstract

The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike existing methods, which optimize for either the worst or the average performance over classes or groups, our work is motivated by the need for finer control over the robustness properties of the model. We present an extremely lightweight post-hoc approach that performs scaling adjustments to predictions from a pre-trained model, with the goal of minimizing a distributionally robust loss around a chosen target distribution. These adjustments are computed by solving a constrained optimization problem on a validation set and applied to the model during test time. Our constrained optimization objective is inspired from a natural notion of robustness to controlled distribution shifts. Our method comes with provable guarantees and empirically makes a strong case for distributional robust post-hoc classifiers. An empirical implementation is available at

1. INTRODUCTION

Distribution shift, a problem characterized by the shift of test distribution away from the training distribution, deteriorates the generalizability of machine learning models and is a major challenge for successfully deploying these models in the wild. We are specifically interested in distribution shifts resulting from changes in marginal class priors or group priors from training to test. This is often caused by a skewed distribution of classes or groups in the training data, and vanilla empirical risk minimization (ERM) can lead to models overfitting to spurious features. These spurious features seem to be predictive on the training data but do not generalize well to the test set. For example, the background can act as a spurious feature for predicting the object of interest in images, e.g., camels in a desert background, water-birds in water background (Sagawa et al., 2020) . Distributionally robust optimization (DRO) (Ben-Tal et al., 2013; Duchi et al., 2016; Duchi & Namkoong, 2018; Sagawa et al., 2020) is a popular framework to address this problem which formulates a robust optimization problem over class-or group-specific losses. The common metrics of interest in the DRO methods are either the average accuracy or the worst accuracy over classes/groups (Menon et al., 2021; Jitkrittum et al., 2022; Rosenfeld et al., 2022; Piratla et al., 2022; Sagawa et al., 2020; Zhai et al., 2021; Xu et al., 2020; Kirichenko et al., 2022) . However, these metrics only cover the two ends of the full spectrum of distribution shifts in the priors. We are instead motivated by the need to measure the robustness of the model at various points on the spectrum of distribution shifts. To this end, we consider applications where we are provided a target prior distribution (that could either come from a practitioner or default to the uniform distribution), and would like to train a model that is robust to varying distribution shifts around this prior. Instead of taking the conventional approach of optimizing for either the average accuracy or the worst-case accuracy, we seek to maximize the minimum accuracy within a δ-radius ball around the specified target distribution. This strategy allows us to encourage generalization on a spectrum of controlled distribution shifts governed by the parameter δ. When δ = 0, our objective is simply the average accuracy for the specified target priors, and when δ → ∞, it reduces to the model's worst-case accuracy, thus providing a natural way to interpolate between the two extreme goals of average and worst-case optimization. To train a classifier that performs well on the prescribed distributionally robust objective, we propose a fast and extremely lightweight post-hoc method that learns scaling adjustments to predictions from a pre-trained model. These adjustments are computed by solving a constrained optimization problem on a validation set, and then applied to the model during evaluation time. A key advantage of our method is that it is able to reuse the same pretrained model for different robustness requirements by simply scaling the model predictions. This is contrast to several existing DRO methods that train all model parameters using the robust optimization loss (Sagawa et al., 2020; Piratla et al., 2022) , which requires group annotations for the training data and requires careful regularization to make it work with overparameterized models (Sagawa et al., 2020) . On the other hand, our approach only needs group annotations for a smaller held-out set and works by only scaling the model predictions of a pre-trained model at test time. Our method also comes with provable convergence guarantees. We apply our method on standard benchmarks for class imbalance and group DRO, and show that it compares favorably to the existing methods when evaluated on a range of distribution shifts away from the target prior distribution.

2. BACKGROUND

We are primarily interested in two specific prior shifts for distributional robustness of classifiers. In this section, we briefly introduce the problem setting of the two prior shifts and set the notation. Class-Level Prior Shifts. We are interested in a multi-class classification problem with instance space X and output space [m] = {1, . . . , m}. Let D denote the underlying data distribution over X × [m], the random variables of instance X and label Y satisfy that (X, Y ) ∼ D . We define the conditional-class probability as η y (x) = P(Y = y|X = x) and the class priors π y = P(Y = y), note that π y = E [η y (x)] . We use u = [ 1 m , . . . , 1 m ] to denote the uniform prior over m classes. Our goal is then to learn a multi-class classifier h : X → [m] that maps an instance x ∈ X to one of m classes. We will do so by first learning a scoring function f : X → ∆ m that estimates the conditional-class probability for a given instance, and construct the classifier by predicting the class with the highest score: h(x) = arg max j∈[m] f j (x). We measure the performance of a scoring function using a loss function : [m] × ∆ m → R + and measure the per-class loss using i (f ) := E [ (y, f (x)) | y = i]. Let {(x i , y i )} n i=1 be a set of training data samples. The empirical estimate of training set prior is πi := 1 n j∈[n] 1(y j = i) where 1(•) is the indicator function. In class prior shift, the class prior probabilities at test time shift away from π. A special case of such class-level prior shifts includes class-imbalanced learning (Lin et al., 2017; Cui et al., 2019; Cao et al., 2019; Ren et al., 2020; Menon et al., 2021) where π is a long-tailed distribution while the class prior at test time is usually chosen to be the uniform distribution. Regular ERM tends to focus more on the majority classes at the expense of ignoring the loss of the minority classes. Recent work (Menon et al., 2021) uses temperature-scaled logit adjustment with training class priors to adapt the model for average class accuracy. Our method also applies post-hoc adjustments to model probabilities, but our goal differs from (Menon et al., 2021) as we care for varying distribution shifts around the uniform prior and the scaling adjustments are learned using a held-out set to optimize for a constrained robust loss. Group-Level Prior Shifts. The notion of groups arises when each data point (x, y) is associated with some attribute a ∈ A that is spuriously correlated with the label. This is used to form m = |A|× |Y | groups as the cross-product of |A| attributes and |Y | classes. The data distribution D is taken to be a mixture of m groups with mixture prior probabilities π, and each group-conditional distribution given by D j , j ∈ [m]. In this scenario, we have n training samples {(x i , y i )} n i=1 drawn i.i.d. from D, with empirical group prior probabilities π. For skewed group-prior probabilities π, regular ERM is vulnerable to spurious correlations between the attributes and labels, and the accuracy degrades when the test data comes from a shifted group prior (e.g., balanced groups). Domain-aware methods typically assume that the attributes are available for the training examples and optimize for the worst or average group loss (Sagawa et al., 2020; Piratla et al., 2022) . However, recent work (Rosenfeld et al., 2022; Kirichenko et al., 2022) has observed that ERM on skewed group priors is able to learn core features (in addition to spurious features), and training a linear classifier on top of ERM learned features with a small balanced held-out set works quite well. Our proposed method is aligned with this recent line of work in assuming access to only a small held-out set with group annotations. However, we differ in two aspects: (i) Our method is more lightweight and works by only scaling the model predictions post-hoc during test time, (ii) The scaling adjustments are learned to allow more control over the desired robustness properties than implicitly targeting the worst or average accuracies as in (Kirichenko et al., 2022) . Evaluation Metrics under Prior Shifts. Typical evaluation metrics used under prior shifts are: • Mean: This evaluation metric assigns uniform weights for each class or group, measuring the average class-or group-level test accuracy. • Worst: This evaluation metric measures the performance on the worst class or group, and fails to examine the effectiveness of proposed methods on the remaining classes. • Stratified: To overcome the issue of worst-case metric, stratified metrics are sometimes used that divide the classes into three strata and report average accuracy in each stratum, specifically (i) Head: average accuracy on subsets of classes that contain more than a specified number (e.g., 100) of training samples; (ii) Torso: average accuracy on classes that contain, e.g., 20 to 100 samples; (iii) Tail: average accuracy on tail classes. The mean and worst metrics are limited in that they probe the model robustness only on the two ends of the full spectrum. The stratified metric looks into three strata of classes, but the design of strata is more heuristic and doesn't allow for a principled interpolation between mean and worst metrics.

3. DISTRIBUTIONAL ROBUSTNESS OBJECTIVE UNDER PRIOR SHIFTS

We assume that we are given a target prior distribution (that could either come from a practitioner or default to the uniform distribution), and our goal is to train a model that is robust to varying distribution shifts around this prior.  g i Acc i , s.t. D(g, r) ≤ δ . Here ∆ m denotes the (m -1)-dimensional probability simplex and r ∈ ∆ m is a specified reference (or target) distribution. The δ-worst accuracy seeks the worst-case g-weighted performance with the weights constrained to lie within the δ-radius ball (defined by the divergence D : ∆ m × ∆ m → R) around the target distribution r. For uniform distribution r = u and any choice of divergence D, it reduces to the mean accuracy for δ = 0 and the worst accuracy for δ → ∞. The objective interpolates between these two extremes for other values of δ and captures our goal of optimizing for variations around target priors instead of more conventional objectives of optimizing for either the average accuracy at the target prior or the worst-case accuracy. The divergence constraint in the δ-worst objective is convex in g for several divergence functions (e.g., f -divergence, Bregman divergence), and allows for efficient measurement as well (if the per-class accuracies are known) using standard packages such as CVXPY (Diamond & Boyd, 2016) .

4. ROBUST POST-HOC CLASSIFIERS UNDER CLASS PRIOR SHIFTS

In this section, we propose a Distributionally RObust PoSt-hoc method, DROPS, which enables the reuse of a pre-trained model for different robustness requirements by simply scaling the model predictions. In alignment with the earlier δ-worst objective, we aim to learn a classifier h with scoring function f as follows: Goal: min f :X →∆m max g∈∆m i∈[m] g i P X|Y =i (f (X) = Y ), s.t. D(g, r) ≤ δ . (2)

4.1. BAYES-OPTIMAL SCORER

Recall that we measure the performance of a scoring function using a loss function : [m] × ∆ m → R + , and the per-class loss using i (f ) := E [ (y, f (x)) | y = i]. To facilitate theoretical analysis, we consider the scenario where the target distribution r is the uniform prior u = [ 1 m , . . . , 1 m ]. We are then interested in minimizing the following robust objective optimization: min f :X →∆m DRL(f ; δ) = min f :X →∆m max g∈G(δ) m i=1 g i i (f ) , (3) where G(δ) = {g ∈ ∆ m | D(g, u) ≤ δ} for some δ > 0 and divergence function D : ∆ m × ∆ m → R + . We first derive the Bayes-optimal solution to the learning problem in equation 3. Theorem 1 (Bayes-Optimal scorer). Suppose (y, z) is a proper loss that is convex in its second argument and D(g, •) is convex in g. Let δ > 0 be such that G(δ) is non-empty. The optimal solution to equation 3 takes the following form for some g * ∈ G(δ): f * y (x) ∝ g * y π y • η y (x) . The proof is provided in Appendix A.1 and would use in its intermediate step, the following standard result for cost-sensitive learning: Lemma 2. Suppose (y, z) is a proper loss and i (f ) = E [ (y, f (x)) | y = i]. For any fixed g ∈ ∆ m , the following is the minimizer to the objective m i=1 g i • i (f ) over all measurable functions f : X → ∆ m : f * y (x) ∝ gy πy • η y (x).

4.2. POST-HOC APPROACH

Given a pre-trained model η : X → ∆ m that estimates the conditional-class probability function η, we seek to approximate the Bayes-optimal classifier. To do so, we write the Lagrangian form of equation 3 as: L(f, g, λ) = m i=1 g i • i (f )-λ(D(g, u)-δ). We next show the optimization w.r.t. the Lagrangian form has the following equivalent unconstrained problem, as specified in Proposition 1. (4) Proposition 1. The equivalence of two optimization tasks: min f :X →∆m DRL(f ; δ) ⇐⇒ min f :X →∆m max g∈∆m min λ≥0 L(f, g, λ) . To understand why the above two optimization tasks are equivalent, we point out that if the constraint is violated, then the minimizer over λ would yield an unbounded objective. The maximizer over g, as a result, will always choose a g that satisfies the constraint. Build upon Proposition 1, one can then solve the equivalent min-max problem by alternating between a gradient-descent update on λ, an Exponentiated Gradient (EG)-ascent update (Kivinen & Warmuth, 1997) on g, and a full minimization over f : min f :X →∆m m i=1 g i • i (f ), the optimal solution for which is given from Lemma 2 by f * y (x) ∝ gy πy • η y (x). Thus, for the pre-trained model η, we propose to make post-hoc adjustments over η to make the classifiers more robust to prior shifts. We construct the "post-shifted" classifier by predicting the class with the "post-shifted" highest score for each x: DROPS: h(x) ∈ arg max i∈[m] g i πi • ηi (x) . Intuitively, model prediction of class i is up-scaled if the class i is assigned with a large weight g i , or is of a small prior πi . j∈[m] exp(Logit j (x)) . Thus, an empirical alternative of the proposed DROPS is a logitadjustment approach where, given a sample x, for fixed class weights gi πi , we have:

An Empirical

DROPS: h(x) ∈ argmax y∈[m] g y πy • ηy (x) ⇐⇒ h(x) ∈ argmax y∈[m] Logit y (x) + log( g y πy ) .

4.3. CONVERGENCE OF POST-HOC CLASSIFIER

In practice, we only have access to an empirical estimate of the Lagrangian form computed using a validation set S = {(x 1 , ỹ1 ), . . . , (x n , ỹn )}: L(f, g, λ) = m i=1 g i • ˆ i (f ) -λ(D(g, u) -δ) , where ˆ i = 1 πi (x,y)∈S:y=i (y, f (x)) and πi = 1 n |{(x, y) ∈ S : y = i}|. We can then approximately solve the saddle-point problem in equation 4 by repeating the following three steps for T number of iterations: Step 1: updating λ (t) . Given the step size η λ > 0, we perform gradient updates on λ through: λ (t+1) = λ (t) -η λ ∇L(f (t) , g (t) , λ (t) ) = λ (t) -η λ (δ -D(g (t) , u)) . For KL divergence, the updated λ (t+1) amounts to λ (t+1) ← λ (t) -η λ (δ -i g (t) i log( g (t) i ui )). We can clip λ if the update falls out of the desired range: λ (t+1) ← clip [0,R] λ (t) -η λ (δ -D(g, u)) for some R > 0. Step 2: updating g (t) . Given the step size η g > 0, we perform gradient updates on g through: t) , g (t) , λ (t) ) . g (t+1) i ← g (t) i • exp η g • ∇ gi L(f ( To obtain the updates for g, we consider the fixed λ (t) ∈ R + and adopt a simplified version as g (t+1) i ← u i • exp ˆ i (f (t) )/λ (t) ) empirically. Step 3: scaling the predictions. In this step, we apply the post-hoc shifts on the classifier's prediction through f (t+1) y (x) ∝ g (t+1) y πy • ηy (x), in which we use the pre-trained estimate η of the conditional-class probabilities. The final scorer returned is an average of scorers from each iteration: f (x) = 1 T T t=1 f (t) (x). We now provide a convergence guarantee for this procedure. For a fixed pre-trained model η : X → ∆ m , we denote the class of post-hoc adjusted functions derived from it by F = f : x → β 1 η1 (x) j β j ηj (x) , . . . , β m ηm (x) j β j ηj (x) | β ∈ ∆ m , and we let |F| denote a measure of complexity of this function class. Theorem 3. Fix δ > 0. Suppose (y, z) be a proper loss that is convex and L -Lipschitz in its second argument, and (y, z) ≤ B , for some B > 0. Suppose D(g, u) ≤ B D , for some B D > 0, and D is convex and L D -Lipschitz in its first argument. Furthermore, suppose max y∈[m] 1 πy ≤ Z, for some Z > 0. Then when n ≥ 8Z log(2m/α), and we set T = O(n), R = 2B Z/δ, η λ = R B D √ n and η g = 1 2B Z+RL D log(m) n , we have with probability at least 1 -α over draw of validation sample S ∼ D n , the classifier returned f (x) = 1 T T t=1 f (t) (x) satisfies: DRL( f ; δ) ≤ min f :X →∆m DRL(f ; δ) + O log(m|F|/α) n + E x [ η(x) -η(x) 1 ] . The proof is provided in Appendix A.3. The complexity measure |F| can be further bounded using, for example, standard covering number arguments (Shalev-Shwartz & Ben-David, 2014) . Notice that the convergence to the optimal classifier is bounded by two terms: the first is a sample complexity term that goes to 0 as the number of validation samples n → ∞; the second term measures how well the pre-trained model η is calibrated, i.e., how well its scores match the underlying conditionalclass probabilities.

4.4. ROBUST POST-HOC CLASSIFIERS UNDER GROUP PRIOR SHIFTS

We now introduce the variant of our approach to address group prior shifts. We are interested in the performance of the trained classifiers with additional attribute information available in a held-out validation set. Note that the settings of class and group prior shifts differ in the knowledge of group information at validation and test time. Thus, the first variant of DROPS under group prior shifts is the same as the class prior setting: DROPS completely ignores the per-example group information and post-shifts the model predictions using only class labels. We consider another variant where we have access to the attribute information during both validation and test. This scenario can actually arise in practice when the attribute information is readily available for test examples, such as device or sensor type in the case of data coming from different devices/sensors. In this case, the attributespecific class priors take the form of π a,i = P(y = i|a). DROPS can be naturally adapted to this setting by learning multiple sets of scaling adjustments, one for each attribute type, and using the scaling adjustment corresponding to the attribute of the test example at prediction time. We provide the discussion, and the form of the Bayes-optimal classifier for this setting in Appendix B.

5. RELATED WORK

Class-Imbalanced Learning. Most work in class-imbalanced learning is typically interested in generalizing on a uniform class prior when the training data has a skewed or imbalanced class prior. Existing solutions to the class imbalanced learning problem could be categorized into three major lines: (1) Information augmentation methods, which make use of additional information such as open set data (Wei et al., 2021) , adopt a transfer learning approach to enrich the representation on the tail classes (Liu et al., 2020; Yin et al., 2019) , or use advanced data-augmentation techniques (Perez & Wang, 2017; Shorten & Khoshgoftaar, 2019) ; (2) Module improvement methods, i.e., the decoupled training on the head/tail classes (Kang et al., 2019; Chu et al., 2020; Zhong et al., 2021) , or through an ensemble way to make use of multiple networks with different expertise/concentration (Zhou et al., 2020; Guo & Wang, 2021; Wang et al., 2020) ; (3) The most related to our work are the class re-balancing based methods, which mitigate the impact of class-imbalanced data by adjusting the logits using the class prior (Menon et al., 2021) , or align the distributions of the model prediction and a set of balanced validation set (Zhang et al., 2021b) , or modify the loss values by referring to the label frequency (Ren et al., 2020) , sample influence (Park et al., 2021) , among many other robust loss designs (Amid et al., 2019; Wei & Liu, 2021; Zhu et al., 2021) and re-weighting schemes (Kumar & Amid, 2021; Cheng et al., 2021; Wei et al., 2022) . Label shift is a related problem setting where the training class prior is not so imbalanced but the class prior shifts during test time with p(x|y) staying the same, and the goal is to mitigate the effect of this shift (Lipton et al., 2018; Azizzadenesheli et al., 2019; Alexandari et al., 2020) . However, our goal differs from these methods as we are primarily interested in generalization at worst case variations around the target distribution.

Group Distributional Robustness.

It has been observed that classifiers learned with regular ERM are vulnerable to spurious correlations between the attributes and labels, and tend to perform worse when the test data comes from a shifted group prior. Most methods for group distributional robustness are interested in the average or worst group performance. Several prior works utilize the group information at the training time (Sagawa et al., 2020; Piratla et al., 2022) . Recent works consider a more practical setting where the classifier does not have access to the group information at the training time, i.e.., data re-balancing (Idrissi et al., 2022) or re-weighting high loss examples (Liu et al., 2021) , or logit-correction (Liu et al., 2023) , vision transformer (ViT) models (Ghosal et al., 2022) . It was also observed recently that ERM is able to learn features that can be reused for group robustness by training a linear classifier using a balanced held-out set (Kirichenko et al., 2022; Rosenfeld et al., 2022) . Most relevant to us is AdvShift (Zhang et al., 2021a) , which mitigates the impact of label shifts by optimizing a distributionally robust objective function with respect to the model parameters. Out work is in similar vein in assuming access to only a held-out set with group annotations, however our method is more lightweight and allows to optimize for varying worstcase perturbations around the target distribution of interest using only post-hoc scaling of model predictions. Other related work includes CGD (Piratla et al., 2022) that proposes a learning paradigm for optimizing the average group accuracy, and Invariant risk minimization (IRM) (Arjovsky et al., 2019; Ahuja et al., 2020) that aims to learn core features using data from multiple environments for mitigating the spurious correlations.

6. EXPERIMENTS

In this section, we empirically demonstrate the effectiveness of our proposed method DROPS, for the tasks of class-imbalanced learning and group distributional robustness.

6.1. EXPERIMENTS ON CLASS-IMBALANCED LEARNING

We consider the class-imbalanced task to illustrate the robustness of DROPS under class prior shifts. For CIFAR-10, and CIFAR-100 datasets, we down-sample the number of samples for each class to simulate the class-imbalance as done in earlier works (Cui et al., 2019; Cao et al., 2019) . We define the imbalance ratio as ρ := To demonstrate the effectiveness of DROPS, we compare the performance of our proposed method with several popular class-imbalanced learning approaches, including: Cross-Entropy (CE) loss, Focal Loss (Lin et al., 2017) , Class-Balanced (CB) loss (Cui et al., 2019) , LDAM (Cao et al., 2019) , Balanced-Softmax (Ren et al., 2020) , Logit-adjustment (Menon et al., 2021) , and AdvShift (Zhang et al., 2021a) on the original test data. All methods are trained with the same architecture (PreAct-ResNet 18 (He et al., 2016) ) with 5 random seeds, same data augmentation techniques, the same SGD optimizer with a momentum of 0.9 with Nesterov acceleration. All methods share the same initial learning rate of 0.1 and a piece-wise constant learning rate schedule of [10 -2 , 10 -3 , 10 -4 ] at [30, 80, 110] epochs, respectively. We use a batch size of 128 for all methods and train the model for 140 epochs. A balanced held-out validation set is utilized for hyper-parameter tuning. All baseline models are picked by referring to the δ = 1.0-worst case performances on the validation set, which is made up of the last 10% of the original CIFAR training dataset. DROPS obtains the optimal post-hoc shifts under a variety of δ train parameters, which is the perturbation hyper-parameter δ in the Lagrangian of equation 4. We take the divergence D to be the KL-divergence. Performances Comparisons on the Generic δ-Worst Case Accuracy. Experiment results in Table 1 demonstrate that DROPS can not only give promising performance under the reported δ-worst case accuracy (for δ = 1.0), it also outperforms other methods in the worst case accuracy, and remains competitive on the mean accuracy as well (performs best on the CIFAR-10 setting). To further investigate the robustness of each method under different level of prior shifts, we visualize the performance of each method under a list of δ within the δ-worst case accuracy. For the uniform distribution r = u, the δ values recover both the mean accuracy (with δ = 0) and the worst accuracy (for large enough δ), and interpolates between these two metrics for other values of δ. For DROPS, we train using δ train ∈ [0.5, 1.0, 1.5, 2.0] (CIFAR-10) and δ train ∈ [0.5, 1.0, 2.0, 3.0, 4.0] (CIFAR-100) with KL-divergence while evaluating for the aforementioned range of δ eval values. We also evaluate using Reverse-KL divergence to examine the behavior under a different divergence function from that used in learning the scaling adjustments with DROPS. Figure 1 illustrates the robustness and effectiveness of our proposed DROPS method: specifically, in each sub-figure of Figure 1 , the x-axis indicates the value of δ, and the y-axis denotes the corresponding δ-worse case accuracy. For experiment results on CIFAR-10 datasets (1st row), the curves of DROPS under different δ train are consistently higher than the other baselines, indicating that with the increasing of perturbation level of the distribution shifts (from left to right in each sub-figure), DROPS is more robust to the distribution shift. As for CIFAR-100 dataset, we do observe that optimizing for the controlled worst case performance may lead to a trade-off in the worst case accuracy and averaged test accuracy, i.e., Logit-Adj is more competitive than DROPS in the measure of the averaged accuracy, while DROPS still suffers less from the increase of perturbation level. (1st row: CIFAR-10 with imbalance ratio ρ = 100, adopt the KL-divergence measure (left) and Reverse-KL divergence (right) for the δ-worst case accuracy calculation; 2nd row: CIFAR-100 with imbalance ratio ρ = 100, adopt the KL-divergence measure (left) and Reverse-KL divergence (right) for the δ-worst case accuracy calculation.)

6.2. EXPERIMENTS ON GROUP ROBUSTNESS TASKS

We consider the group robustness tasks to show the robustness of DROPS under group prior shifts. For waterbirds and CelebA datasets, we obtain the datasets as described in (Sagawa et al., 2020) . We compare the performance of our proposed method with several popular group-Dro approaches, i.e., Just-train-twice (JTT) (Liu et al., 2021) , Group Distributional Robustness (G-DRO) (Sagawa et al., 2020) , data balancing strategies (SUBG) (Idrissi et al., 2021) , last-layer re-training (DFR) (Kirichenko et al., 2022) . Among the baseline methods, ERM, JTT, G-DRO, and SUBG are trained with the same architecture (ResNet 50 (He et al., 2016) ) for 5 runs, the base model for DFR and our approach DROPS also used the same ERM pre-trained model. Note that the knowledge requirement over group information and validation set differs among the reported methods, we clarify the differences in the column "Group Info" of Table 2 . As for the performance of DROPS in the group DRO datasets, since there are only two classes within the two datasets, we define a single post-hoc scalar as w, which is considered to be a hyperparameter by scaling the classifier's prediction on class 1. Note that the spurious features (minority groups among each class) tend to have a significant impact on the ERM performances by referring to the imbalanced group distribution. We use DROPS to learn post-doc corrections for both ERM trained model and for the DFR (Kirichenko et al., 2022) model that additionally retrains the last layer of the ERM-trained model with a balanced held-out set. We refer to the post-hoc corrected DFR as DROPS* in the result tables. Performance Comparisons on various δ-Worst Accuracies. Post-hoc scaling on the ERM model (referred as DROPS in Table 2 ) improves the robustness of the model as measured at various δ values (including δ = 0 and δ → ∞). DROPS also outperforms JTT on the CelebA dataset. When applying post-hoc scaling to a better pre-trained model, i.e., with DFR (Kirichenko et al., 2022) that re-trains the last layer on the validation set, DROPS* outperforms all baseline methods in most settings, with the performance improvement especially clear on the CelebA dataset. 

7. CONCLUSIONS

We study the problem of improving the the distributional robustness of a pre-trained model under controlled distribution shifts. We propose DROPS, a fast and lightweight post-hoc method that learns scaling adjustments to predictions from a pre-trained model. DROPS learns the adjustments by solving a constrained optimization problem on a held-out validation set, and then applies these adjustments to the model predictions during evaluation time. DROPS is able to reuse the same pretrained model for different robustness requirements by simply scaling the model predictions. For group robustness tasks, our approach only needs group annotations for a smaller held-out set. We also showed provable convergence guarantees for our method. Experimental results on standard benchmarks for class imbalance (CIFAR-10, CIFAR-100) and group DRO (Waterbirds, CelebA) demonstrate the effectiveness and robustness of DROPS when evaluated on a range of distribution shifts away from the target prior distribution. 

A PROOFS

A.1 PROOF OF THEOREM 1 The proof is adaptation of a similar result in Wang et al. (2022) . We first prove Lemma 2. Proof of Lemma 2. We first expand the objective: m i=1 g i • i (f ) = m i=1 g i • E x|y=i [ (i, f (x))] = m i=1 g i π i • E x [η i (x) • (i, f (x))] . Given that is a proper loss, we have that the minimizer of this objective takes the form: f i (x) = gi πi • η i (x) m j=1 gj πj • η j (x) . We are now ready to prove Theorem 1. Proof of Theorem 1. The min-max problem in equation 3 can be expanded as: min f :X →∆m DRL(f ; δ) = min f :X →∆m max g∈G(δ) m y=1 g y π y • E [η y (x) • (y, f (x))] ω(g,f ) . Note that the objective ω(g, f ) is clearly linear in g (for fixed f ), and with chosen to be convex in f (x) (for fixed g), i.e., ω(g, κf 1 + (1 -κ)f 2 ) ≤ κω(g, f 1 ) + (1 -κ)ω(g, f 2 ), ∀f 1 , f 2 : X → ∆ m , κ ∈ [0, 1]. Furthermore, given that divergence D(g, •) is convex in g, we have that G(δ) is a convex compact set, while the domain of f is convex. It follows from Sion's minimax theorem (Sion, 1958) that: min f :X →∆m max g∈G(δ) ω(g, f ) = max g∈G(δ) min f :X →∆m ω(g, f ). Let (g * , f * ) be such that: f * ∈ argmin f :X →∆m DRL(f ; δ) = argmin f :X →∆m max g∈G(δ) ω(g, f ); g * ∈ argmax g∈G(δ) min f :X →∆m ω(g, f ). Such an f * exists because DRL(f ; δ) takes a bounded value when f = η, and any minimizer of DRL(f ; δ) yields a value below that; because DRL(f ; δ) ≥ 0 and is convex in f , such a minimizer exits. Similarly, g * also exists because G(δ) is a compact set, and for any fixed g, min f :X →∆m ω(g, f ) is bounded above (owing to the existence of a minimizer from Lemma 2). We now show that (g * , f * ) is a saddle-point for equation 6, i.e., ω(g * , f * ) = max g∈G(δ) ω(g, f * ) = min f :X →∆m ω(g * , f ). To see this, notice that: ω(g * , f * ) ≤ max g∈G(δ) ω(g, f * ) = min f :X →∆m max g∈G(δ) ω(g, f ) = max g∈G(δ) min f :X →∆m ω(g, f ) = min f :X →∆m ω(g * , f ) ≤ ω(g * , f * ), where we are able to swap the min and max in the second step using equation 7. We thus from equation 8 that f * is a minimizer of ω(g * , f ), i.e., f * ∈ argmin f :X →∆m m y=1 g * y π y • E [η y (x) • (y, f (x))] . Following Lemma 2, we further have that for any x ∈ X : f * (x) ∝ g * y π y η y (x). A.2 PROOF OF PROPOSITION 1 Proof. Now we prove the equivalence of two optimization tasks: min f :X →∆m DRL(f ; δ) ⇐⇒ min f :X →∆m max g∈∆m min λ≥0 L(f, g, λ) . "=⇒" Remember that min f :X →∆m DRL(f ; δ) = min f :X →∆m max g∈G(δ) m i=1 g i • i (f ), we firstly show that any (f * , g * ) given by (L.H.S) yields the optimum of the equation 4 (R.H.S), for some λ * . Note that for any f, g ∈ G(δ), for any g such that D(g, r) ≤ δ, we have: min f :X →∆m max g∈G(δ) i∈[m] g i i (f ) ≥ i∈[m] g * i i (f * ). Plugging f * , g * into the R.H.S, we then have: L(f * , g * , λ) = i∈[m] g * i i (f * ) -λ (D(g * , r) -δ) . Since D(g * , r) ≤ δ, ∃λ * ∈ R + such that the optimization task λ * = arg min λ∈R+ L(f * , g * , λ). To show (f * , g * , λ * ) returns the optimum of the R.H.S, we prove by contradiction and assume that there exists f , g , λ such that: L(f , g , λ) < L(f * , g * , λ * ). This indicates that: i g i i (f ) - i g * i i (f * ) -( λ (D(g , r) -δ) -λ * (D(g * , r) -δ)) < 0 ⇐⇒ i g i i (f ) - i g * i i (f * ) < 0. (Due to Complementary Slackness Condition) This contradicts with the fact that i g i i (f ) ≥ i g * i i (f * ). Thus, (f * , g * , λ * ) returns the optimum of R.H.S.

"⇐="

We next prove that any (f * , g * , λ * ) given by R.H.S yields the optimum in L.H.S. Due to Complementary Slackness Condition, we have: λ * (D(g * , r) -δ) = 0. Note that λ * ∈ R + , we then have D(g * , r) -δ = 0 and the constraint in L.H.S is satisfied. Thus, we have: R.H.S = i g * i i (f * ). Again, if there exists g , f such that the L.H.S is minimized, where g = g * , f = f * , we then have (g , f , λ = 0) which satisfies: L(f , g , λ ) < L(f * , g * , λ * ), which contradicts with the argmin pairs (f * , g * , λ * ). Thus, we finished the proof.

A.3 PROOF OF THEOREM 3

The proof builds on prior convergence results in constrained and robust optimization (Narasimhan et al., 2019; Wang et al., 2022) . We will find it useful to define the following quantities: L 1 (f, g) = m i=1 g i • i (f ); L1 (f, g) = m i=1 g i • ˆ i (f ); L 2 (g, λ) = -λ(D(g, u) -δ). We further define averages of different iterates λ = 1 T T t=1 λ (t) and ḡ = 1 T T t=1 g (t) . We would also find the following lemmas useful. The first is a bound on our estimate of the class priors. Lemma 4. Under the assumptions in Theorem 1, with probability at least 1 -α/2 over draw of validation sample S ∼ D n : πy ≥ 1 2Z , ∀y ∈ [m]. Proof. The proof follows from a direct application of Chernoff's bound (along with a union bound over all m classes), noting that min y∈[m] π y ≥ 1 Z and n ≥ 8Z log(2m/α). Throughout the proof, we will assume that the statement in the above lemma holds with probability at least 1 -α/2. Our second lemma shows that the equivalence between the saddle-point optimization in equation 4 and the original constrained optimization problem in equation 3 still holds when we minimize the Lagrange multiplier only over a bounded set: Lemma 5. Under the assumptions in Theorem 1, we have for any f : X → ∆ m : min λ∈[0,R] max g∈∆m L(f , g, λ) = max g∈G(δ) m i=1 g i • i (f ). Proof. Let λ * ∈ argmin λ≥0 max g∈∆m L(f , g, λ) be the λ-minimizer over all non-negative R. Such a minimizer exists for the following reason. Owing to the continuity of D we know there exits at least one g ∈ ∆ m for which D(g, u) = δ and therefore we have that the minimization objective is bounded: max g∈∆m L(f , g, λ) ≤ L(f , g , λ) = m i=1 g i • i (f ) ≤ ZB . It remains to be shown that λ * ≤ R. To do this end, let g * ∈ argmax g∈∆m:D(g,u)≤δ m i=1 g i • i (f ). We note that: m i=1 g * i • i (f ) = min λ≥0 max g∈∆m L(f , g, λ) = max g∈∆m L(f , g, λ * ) = max g∈∆m m i=1 g i • i (f ) -λ * (D(g, u)-δ). Choose g such that D(g, u) = δ/2, which exits thanks to the continuity of D. Upper bounding the max on RHS in the above equality by substituting g, we get: m i=1 g * i • i (f ) ≤ m i=1 g i • i (f ) -λ * (D(g, u) -δ) = m i=1 g i • i (f ) -λ * δ/2, which gives us: λ * ≤ 2 δ m i=1 g i • i (f ) - m i=1 g * i • i (f ) ≤ 2 δ ZB = R, as desired. The lemmas below follow from Lemma 4 and standard results in online convex optimization (Shalev-Shwartz et al., 2011) . Published as a conference paper at ICLR 2023 Lemma 6. Under the assumptions in Theorem 1, and for η g = 1 2B Z+RL D log(m) T , with probability at least 1 -α/2 over draw of S ∼ D n , the sequence of iterates g (1) , . . . , g (T ) satisfies: max g∈∆m 1 T T t=1 L1 (f (t) , g, λ (t) ) - 1 T T t=1 L1 (f (t) , g (t) , λ (t) ) ≤ 2(2B Z + RL D ) log(m) T . Proof. The proof follows from standard convergence result for the exponentiated-gradient descent algorithm noting that ∇ g L(f (t) , g (t) , λ (t) ) ∞ ≤ max i | ˆ i (f (t) )| + |λ (t) | ∇ g D(g (t) , u) ∞ ≤ B • max i 1 πi + RL D ≤ 2ZB + RL D , where we use the bound on the class prior estimates in Lemma 4. The last step holds with probability at least 1 -α/2. Lemma 7. Under the assumptions in Theorem 1, and for η λ = R B D √ T with probability at least 1 -α/2 over draw of validation sample S ∼ D n , the sequence of iterates λ (1) , . . . , λ (T ) satisfies: 1 T T t=1 L 2 (g (t) , λ (t) ) -min λ∈[0,R] 1 T T t=1 L 2 (g (t) , λ) ≤ RB D √ T . Proof. The proof follows from standard convergence result for the online gradient descent algorithm noting that |∇ λ L 2 (g (t) , λ)| ≤ |D(g (t) , u) -δ| ≤ B D and |λ| ≤ R. The following lemma provides a generalization bound for the empirical Lagrangian. Lemma 8. Under the assumptions in Theorem 1, with probability at least 1 -α over draw of validation sample S ∼ D n , for all t ∈ [T ]: |L(f (t) , g (t) , λ (t) ) -L(f (t) , g (t) , λ (t) )| ≤ O log(m|F|/α) n . Proof. For any t ∈ [T ], we first bound the left-hand side by: |L(f (t) , g (t) , λ (t) ) -L(f (t) , g (t) , λ (t) )| ≤ m i=1 g (t) i • i (f ) -ˆ i (f ) ≤ max i∈[m] i (f ) -ˆ i (f ) . (9) Further define ˜ i = 1 πi (x,y)∈S:y=i (y, f (x)). We then can bound the above difference i (f ) -ˆ i (f ) in the above bound using: | i (f ) -ˆ i (f )| ≤ | i (f ) -˜ i (f )| + | ˜ i (f ) -ˆ i (f )| = 1 π i E [ (i, f (x)) • 1(y = i)] - 1 n n j=1 (i, f (x j )) • 1(y j = i) + 1 π i - 1 πi 1 n n j=1 (i, f (x j )) 1(y j = i) ≤ Z E [ (i, f (x)) • 1(y = i)] - 1 n n j=1 (i, f (x j )) • 1(y j = i) + B π i πi |π i -πi | We know π i ≤ 1 Z . Further, applying Lemma 4, we can bound πi . We therefore have with probability at least 1 -α/2 over draw of S ∼ D n : | i (f ) -ˆ i (f )| ≤ Z E [ (i, f (x)) • 1(y = i)] - 1 n n j=1 (i, f (x j )) • 1(y j = i) + 2Z 2 B |π i -πi | . An application of Hoeffding's inequality to both the above terms, noting that the loss (y, z) is bonded, together with a union bound over all f ∈ F and class i ∈ [m], gives us with probability at least 1 -α/2 over draw of S ∼ D n , for all f ∈ F and i ∈ [m]: | i (f ) -ˆ i (f )| ≤ O log(m|F|/α) n . Plugging back into equation 9 and taking a union bound over both the high probability statements completes the proof. We will additionally use the following regret bound for the f -minimization step: Lemma 9. Under the assumptions in Theorem 1, for a fixed g ∈ ∆ m , with probability at least 1 -α over draw of S ∼ D n , L 1 (f (t) , g) -min f ∈F L 1 (f, g) ≤ B Z • E x [ η(x) -η(x) 1 ] + O log(m/α) n Proof. We first expand L 1 in terms of the conditional-class probability function η(x): L 1 (f, g) = n i=1 g i • i (f ) = n i=1 g i π i • E x [η(x) • (i, f (x))] . We know from Lemma 2 that the minimizer of L 1 (f, g) over all f takes the form f * (x) ∝ g * y πy η y (x). We also know from Lemma 2 that f (t) is the minimizer of a similar objective where η is replaced by the pre-trained model η: L1 (f, g) = n i=1 g i πi • E x [η(x) • (i, f (x))] . We would like to bound: L 1 (f (t) , g) -L 1 (f * , g) = L 1 (f (t) , g) -L1 (f (t) , g) + L1 (f (t) , g) -L 1 (f * , g) ≤ L 1 (f (t) , g) -L1 (f (t) , g) + L1 (f * , g) -L 1 (f * , g) = E x n i=1 g i • η i (x) π i - ηi (x) πi • (i, f (t) (x)) + E x n i=1 g i • η i (x) π i - ηi (x) πi • (i, f * (x)) = E x n i=1 g i • ( (i, f (t) (x)) -(i, f * (x))) • η i (x) π i - ηi (x) πi ≤ E x max i∈[m] g i • | (i, f (t) (x)) -(i, f * (x))| • m i=1 η i (x) π i - ηi (x) πi ≤ B • E x m i=1 η i (x) π i - ηi (x) πi = B • E x m i=1 η i (x) π i - ηi (x) π i + ηi (x) π i - ηi (x) πi ≤ B • max i∈[m] 1 π i • E x [ η(x) -η(x) ] + E x m i=1 ηi (x) π i - ηi (x) πi ≤ B Z • E x [ η i (x) -ηi (x) 1 ] + E x [ η(x)] 1 • max i∈[m] 1 π i - 1 πi = B Z • E x [ η i (x) -ηi (x) 1 ] + (1) • max i∈[m] 1 π i πi |π i -πi | , where in the second step, we use the fact that f (t) minimizes L1 ; in the second-last step, we apply Holder's inequality; in the last step, we use the fact that g i ∈ [0, 1], π i ≤ 1 Z and (y, z) ≤ B . We know π i ≤ 1 Z . Further, applying Lemma 4, we can bound πi . We have with probability at least 1 -α/2 over draw of S ∼ D n : L 1 (f (t) , g) -L 1 (f * , g) ≤ B Z • E x [ η i (x) -ηi (x) 1 ] + 2Z 2 • max i∈[m] |π i -πi | . An application of Hoeffding's inequality to the last term completes the proof. We are now ready to prove Theorem 3. Proof of Theorem 3. Let κ n = O log(m|F |/α) n + E x [ η(x) -η(x) 1 ] . We start by combining Lemma 6 with the generalization bound in Lemma 8, from which we have with probability at least 1 -α over draw of validation sample S ∼ D n , max g∈∆m 1 T T t=1 L(f (t) , g, λ (t) ) ≤ 1 T T t=1 L(f (t) , g (t) , λ (t) ) + O log(m) T + log(m|F|/α) n = 1 T T t=1 L 1 (f (t) , g (t) ) + 1 T T t=1 L 2 (g (t) , λ (t) ) + O log(m) T + log(m|F|/α) n = 1 T T t=1 L 1 (f (t) , g (t) ) + 1 T T t=1 L 2 (g (t) , λ (t) ) + O log(m|F|/α) n , where we have used the fact that T = O(n). Applying Lemma 7 to the right-hand side of equation 10, with T = O(n), we have with probability at least 1 -α, max g∈∆m 1 T T t=1 L(f (t) , g, λ (t) ) ≤ 1 T T t=1 L 1 (f (t) , g (t) ) + min λ∈[0,R] 1 T T t=1 L 2 (g (t) , λ) + O log(m|F|/α) n ≤ min f :X →∆m 1 T T t=1 L 1 (f, g (t) ) + min λ∈[0,R] 1 T T t=1 L 2 (g (t) , λ) + κ n ≤ min f :X →∆m L 1 (f, ḡ) + min λ∈[0,R] L 2 (ḡ, λ) + κ n , where in the pen-ultimate step, we apply Lemma 9, and the last step uses the fact that L 1 (f, g) is linear in g and applies Jensen's inequality to L 2 (f, g) noting that is concave in g (as a result of -D(g, u) being concave in g). Applying Jensen's inequality again to the LHS of equation 11, noting that i (f ) = E [ (y, f (x) | y = i] and as a result L 1 (f, g) is convex in f (owing to (y, z) being convex in z) and additionally using the fact that L 2 (g, λ) is linear in λ, we further have: max g∈∆m L( f , g, λ) ≤ min f :X →∆m,λ∈[0,R] L(f, ḡ, λ) + κ n . Lower bounding the LHS by a min over λ ∈ [0, R] (noting that λ ∈ [0, R]), and the RHS by a max over g ∈ ∆ m , we have: min λ∈[0,R] max g∈∆m L( f , g, λ) ≤ max g∈∆m min f :X →∆m,λ∈[0,R] L(f, ḡ, λ) + κ n . Exchanging the min's and max's using min-max theorem, max g∈∆m min λ∈[0,R] L( f , g, λ) ≤ min f :X →∆m max g∈∆m min λ∈[0,R] L(f, g, λ) + κ n . In other words for any f * : X → ∆ m , max g∈∆m min λ∈[0,R] L( f , g, λ) ≤ max g∈∆m min λ∈[0,R] L(f * , g, λ) + κ n . An application of Lemma 5 to both the LHS and RHS gives us for any f * : X → ∆ m , max g∈G(δ) m i=1 g i • i ( f ) ≤ max g∈G(δ) m i=1 g i • i (f * ) + κ n , which completes the proof. A.4 EG-UPDATE FOR g (t) To obtain the updates for g, we consider the fixed λ (t) ∈ R + and adopt a EG-style computation. Taking the KL divergence for illustration, we provide the closed form of g (t+1) in Proposition 2. Proposition 2. (Un-normalized) EG-updates for g (t) under D KL is given by: g (t+1) i = g (t) i exp{η g i (f ) + λη g log(u i )} 1 1+ληg . Regarding the EG-updates for g (t) , it is straightforward from Proposition 2 that classes with a larger loss or a higher target distribution weight r i tend to receive a larger weight. Proof. In this proof, we consider the generic target distribution r which covers the uniform prior u as a special case. To obtain g (t+1) i for KL divergence, where D KL (g, r) = i g i log gi ri , we need: ∂f -1 ηg i g i log gi g (t) i + i g i i (f ) -λ i g i log gi ri -δ ∂g i = 0 =⇒ -1 η g log g i g (t) i - 1 η g g i g (t) i g i 1 g (t) i + i (f ) -λ log g i r i -λg i r i g i 1 r i = 0 =⇒ -1 η g log g i g (t) i - 1 η g + i (f ) -λ log g i r i -λ = 0 =⇒ i (f ) - 1 η g -λ = λ log g i r i + 1 η g log g i g (t) i =⇒ i (f ) - 1 η g -λ + λ log(r i ) + 1 η g log(g (t) i ) = λ log (g i ) + 1 η g log (g i ) =⇒(λη g + 1) log(g i ) = i (f )η g -1 -λη g + λη g log(r i ) + log(g (t) i ) =⇒g i = exp i (f )η g -1 -λη g + λη g log(r i ) + log(g (t) i ) λη g + 1 =⇒g i = exp log(g (t) i ) + i (f )η g + λη g log(r i ) 1 + λη g -1 , Remove the constant, we then have: g i = exp log(g (t) i ) + i (f )η g + λη g log(r i ) 1 + λη g =⇒g i = g (t) i exp{η g i (f ) + λη g log(r i )} 1 1+ληg .

B EXTENSION TO GROUP-PRIOR SHIFTS

We now show how our theoretical results extend to the group-prior shift setting. Suppose each instance x ∈ X is associated with a group a ∈ A, m = |Y | × |A|. We define the group-specific conditional-class probability to be η i (x, a) = P(y = i|x, a) and the group-specific class priors π a,i = P(y = i|a). In this case, we wish to learn a scoring function f : X × A → ∆ m that takes both the instance x and group g into account. We use a,y (f ) = E [ (y, f (x, a)|x, a] to denote the class-i loss conditioned on group g. Our goal is to minimize the following group-specific DRE objective: DRL(f ; δ) = min f :X →∆m max w∈G(δ) a,i g a,i • a,i (f ), where G(δ) = {g ∈ ∆ m×k | D(g, r) ≤ δ} for some δ > 0, divergence function D : ∆ m ×∆ m → R, and target distribution r ∈ ∆ m . Theorem 10 (Bayes-optimal scorer for group-prior shift). Suppose (y, z) is a proper loss that is convex in its second argument and D(g, •) is convex in g. Let δ > 0 be such that G(δ) is non-empty. For some g * ∈ G(δ), then the optimal solution to equation 12 takes the form: The proof follows the same steps as Theorem 1, except that it uses the following lemma instead of Lemma 2: Lemma 11. Suppose (y, z) is a proper loss. For any fixed g ∈ R k×m + , the following is a minimizer to the objective a,y g a,y • a,y (f ) over all measurable functions f : X × A → ∆ m : f * y (x, a) ∝ g a,y π a,y • η y (x, a). Proof. We first expand the objective: a,i g a,i • a,i (f ) = a,i g a,i • E x|a,y=i [ (i, f (x, a))] = a,i g a,i π a,i • E x [η i (x, a) • (i, f (x, a))] . Given that is a proper loss, we have that the minimizer of this objective takes the form:  f * i (x, a) =

C ADDITIONAL EXPERIMENT DETAILS AND RESULTS

In this section, we introduce omitted experiment details and additional experiment results of our proposed methods.

C.1 ABLATION STUDY OF DROPS ON CLASS-IMBALANCED CIFAR

We offer the ablation study of DROPS in Table 3 . Suppose the designer is interested in δ eval = 1.0 robustness, setting δ train ∈ [0.5, 1.0] frequently reaches best three performances in mean/(δ eval = 1.0)-worst/worst by referring to the performance of averaged 5 runs, which indicates that DROPS is less sensitive to the parameter δ train . Setting δ train to be a coarse estimate of the δ eval should be able to appropriately improve the model robustness under prior shifts. Table 3 : Ablation study of DROPS on class-imbalanced CIFAR-100 dataset: mean ± std of averaged class accuracy, δ = 1.0-worst case accuracy, and worst class accuracy of 5 runs are reported. The best three performed δ for in each setting are highlighted. We provide the statistical testing of results in Table 1 : in Table 4 , we included the paired student t-test results between each baseline method and DROPS, for each dataset and each metric (mean/δworst/worst accuracy), and the inputs of samples of each method for testing are the test accuracies



max y∈[m] πy min y∈[m] πy and experiment with ρ ∈ [10, 50, 100].

Figure 1: Performance/Robustness of methods under different perturbation δ-worst case accuracy.(1st row: CIFAR-10 with imbalance ratio ρ = 100, adopt the KL-divergence measure (left) and Reverse-KL divergence (right) for the δ-worst case accuracy calculation; 2nd row: CIFAR-100 with imbalance ratio ρ = 100, adopt the KL-divergence measure (left) and Reverse-KL divergence (right) for the δ-worst case accuracy calculation.)

• η j (x, a).

DROPS (δ = 0.1) 59.08±0.38 48.24±0.63 43.37±0.62 DROPS (δ = 0.2) 59.12±0.21 47.90±0.52 43.58±0.61 DROPS (δ = 0.3) 58.62±0.56 48.17±0.91 42.85±0.86 DROPS (δ = 0.4) 58.98±0.37 47.64±0.88 42.60±1.15 DROPS (δ = 0.5) 58.79±0.21 47.84±0.35 42.52±0.43 DROPS (δ = 0.6) 59.35±0.27 47.44±0.65 42.47±0.84 DROPS (δ = 0.7) 59.42±0.30 48.04±0.47 43.20±0.92 DROPS (δ = 0.8) 59.63±0.10 47.78±0.96 43.44±0.60 DROPS (δ = 0.9) 59.69±0.39 47.83±0.80 43.20±0.78 DROPS (δ = 1.0) 59.60±0.63 47.96±0.86 43.23±1.45 DROPS (δ = 1.1) 59.00±0.56 48.31±0.46 43.72±0.55 DROPS (δ = 1.2) 59.36±0.37 48.02±1.30 43.11±0.35 DROPS (δ = 1.3) 59.54±0.75 47.86±1.00 42.80±0.67Method CIFAR-100 (δ = 1.0-worst case acc) ρ = 10 ρ = 50 ρ = 100 DROPS (δ = 0.1) 42.98±0.39 29.62±0.52 24.26±0.68 DROPS (δ = 0.2) 43.96±0.18 29.98±1.00 25.16±0.49 DROPS (δ = 0.3) 43.54±0.43 30.18±1.02 25.14±0.77 DROPS (δ = 0.4) 44.16±0.76 30.50±0.96 24.90±0.88 DROPS (δ = 0.5) 44.18±0.34 31.10±0.35 25.08±0.36 DROPS (δ = 0.6) 44.64±0.35 30.12±0.73 24.60±0.64 DROPS (δ = 0.7) 44.34±0.90 30.30±0.41 25.30±0.85 DROPS (δ = 0.8) 44.16±0.43 30.32±0.68 25.48±0.55 DROPS (δ = 0.9) 44.96±0.52 30.12±0.66 25.58±0.50 DROPS (δ = 1.0) 44.86±1.05 31.14±0.74 26.24±1.88 DROPS (δ = 1.1) 43.74±1.33 31.18±0.56 25.84±0.53 DROPS (δ = 1.2) 44.92±0.69 31.28±1.27 25.06±0.58 DROPS (δ = 1.3) 45.04±0.84 31.22±0.98 25.22±0.64 Method CIFAR-100 (Worst) ρ = 10 ρ = 50 ρ = 100 DROPS (δ = 0.1) 20.61±3.24 8.25±2.46 6.02±1.50 DROPS (δ = 0.2) 22.51±3.10 9.58±3.34 7.40±1.04 DROPS (δ = 0.3) 23.32±4.39 10.90±3.90 7.29±1.90 DROPS (δ = 0.4) 22.51±2.87 10.79±2.07 8.28±1.98 DROPS (δ = 0.5) 25.13±3.05 12.52±2.57 6.70±2.39 DROPS (δ = 0.6) 24.86±4.18 10.33±1.92 7.43±2.78 DROPS (δ = 0.7) 21.74±4.97 10.89±2.07 9.03±2.67 DROPS (δ = 0.8) 24.78±4.84 11.17±1.54 8.21±2.06 DROPS (δ = 0.9) 25.98±3.95 11.65±3.05 8.61±2.22 DROPS (δ = 1.0) 24.80±2.85 11.86±2.85 7.52±2.11 DROPS (δ = 1.1) 24.22±4.83 12.81±2.15 9.42±1.99 DROPS (δ = 1.2) 25.77±4.58 12.89±2.46 7.27±2.32 DROPS (δ = 1.3) 26.49±3.66 12.54±1.41 8.56±1.02 C.2 HYPOTHESIS TESTING OF PERFORMANCE COMPARISONS ON CLASS-IMBALANCED CIFAR

Performance comparisons on class-imbalanced CIFAR datasets: mean ± std of averaged class accuracy, δ = 1.0-worst case accuracy, and worst class accuracy of 5 runs are reported. Best performed methods (corresponds to the averaged accuracy of 5 runs) in each setting are highlighted in purple. And we defer the paired student t-test between each baseline method and DROPS in Appendix (Table4), to demonstrate the robustness of DROPS.



Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16489-16498, 2021.Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9719-9728, 2020.

ACKNOWLEDGEMENT

The work is done during JW's internship at Google Research, Brain Team. JW and YL are partially supported by the National Science Foundation (NSF) under grants IIS-2007951, IIS-2143895, and the Office of Naval Research under grant N00014-20-1-22. JW is also partially supported by the Center for Research in Open Source Software at UC Santa Cruz, which is funded by a donation from Sage Weil and industry memberships. We are thankful to Kevin Murphy for providing several helpful comments on the manuscript.

availability

https://github.com/weijiaheng/Drops.

annex

of 5 runs × 3 imbalance ratio settings). And each cell indicates the (t-statistics and p-value). It is quite obvious that for most results, there exists negative statistics, meaning that the given baseline method is significantly (if p-value is small enough, i.e., p < 0.05) worse than DROPS.Table 4 : Paired student t-test of the performance comparisons between each baseline method and DROPS: cells in right 6 columns denote (statistics, p-value) of the hypothesis testing results between each baseline method and DROPS, the scenario where negative statistics and p-value less than 0.05 indicates that DROPS is statistically significant better than the corresponding baseline method.Method V.S. DROPS CIFAR-10 CIFAR-100 Mean δ = 1.0-worst Worst Mean δ = 1.0-worst Worst Cross Entropy -11.3, 1.9e -8 -11.1, 2.6e -8 -9.7, 1.4e -7 -7.9, 1.5e -6 -12.9, 3.7e -7 -13.0, 3.3e -7 Focal -4.9, 2.2e -4 -15.1, 4.9e -10 -21.2, 4.8e -12 -9.4, 3.6e -7 5.6, 8.8e -5 -4.6, 5.0e -4 LDAM -9.1, 2.8e -7 -9.8, 1.2e -7 -9.2, 2.6e -7 -7.4, 3.1e -6 -6.6, 1.2e -5 -7.4, 3.2e -6 Balanced-Softmax -4.1, 1.1e -3 -3.9, 1.6e -3 -3.6, 3.2e -3 -5.5, 8.4e -5 -5.7, 5.4e -5 -6.1, 2.6e -5 Logit-Adjust 1.9, 0.08 -4.6, 4.4e -4 -5.9, 3.9e -5 -3.5, 3.6eWe adopt two kinds of fine-tuning strategies by using the pre-trained model of Cross-Entropy loss in Table 1 and re-train another 2000 iterations with a fixed learning rate (1e-3, 1e-4, 1e-5, 1e-6), including the following two options in Table 5 :Fine-tuning the whole networks: named as the row "Cross-Entropy (1e-3)", empirically we observed that the learning rate 1e-3 is consistently better than the others and is reported here.Fine-tuning the last layer (DFR): where learning rate 1e-3 is consistently better than others.All other settings and model selection criterion remains the same as the vanilla training procedure, except for replacing the training data by the whole validation set. Although these two fine-tuning strategies are beneficial in improving the model performance across each metric and setting over Cross-Entropy loss in Table 1 , they still fall largely behind DROPS in most scenarios and require re-training the model on the additional validation set, while DROPS only needs the information of per-class accuracy to decide on the parameters of post-hoc scaling, as also utilized for the model selection of all other baseline methods appeared in Table 1 .Table 5 : Performance comparisons on class-imbalanced CIFAR datasets: mean ± std of averaged class accuracy, δ = 1.0-worst case accuracy, and worst class accuracy of 5 runs are reported: for all three baseline methods, we perform fine-tune of the pre-trained model appeared in 

