IMPROVED GROUP ROBUSTNESS VIA CLASSIFIER RE-TRAINING ON INDEPENDENT SPLITS

Abstract

Deep neural networks learned by minimizing the average risk can achieve strong average performance, but their performance for a subgroup may degrade, if the subgroup is underrepresented in the overall data population. Group distributionally robust optimization (Sagawa et al., 2020a, GDRO) is a standard baseline for learning models with strong worst-group performance. However, GDRO requires group labels for every example during training and can be prone to overfitting, often requiring careful model capacity control via regularization or early stopping. When only a limited amount of group labels is available, Just Train Twice (Liu et al., 2021, JTT) is a popular approach which infers a pseudo-group-label for every unlabeled example. The process of inferring pseudo labels can be highly sensitive during model selection. To alleviate overfitting for GDRO and the pseudo labeling process for JTT, we propose a new method via classifier retraining on independent splits (of the training data). We find that using a novel sample splitting procedure achieves robust worst-group performance in the fine-tuning step. When evaluated on benchmark image and text classification tasks, our approach consistently reduces the requirement of group labels and hyperparameter search during training. Experimental results confirm that our approach performs favorably compared with existing methods (including GDRO and JTT) when either group labels are available during training or are only available during validation.

1. INTRODUCTION

For many tasks, deep neural networks (DNNs) are often developed where the test distribution is identical to and independent of the train distribution, which can be referred to as IID generalization. The performance of a DNN is also known to worsen when the testing distribution differs from the training distribution. This problem is often referred to as out-of-distribution (OOD) generalization. OOD generalization is crucial in safety-critical applications such as self-driving cars (Filos et al., 2020) or medical imaging (Oakden-Rayner et al., 2020) . Hence, addressing the problem of OOD generalization is foundational for real-world deployment of deep learning models. A notable setting where OOD generalization problems appear is the group-shift setting, where different groups of the data may have a distribution shift. In this setting, there are predefined attributes that divide the input space into different groups of interest. Here, the goal is to find a model that performs well across several predefined groups (Sagawa et al., 2020a) . Similar to other problems in OOD generalization, DNNs learned by empirical risk minimization (ERM) are observed to suffer from poor worst-group performance despite good average-group performance. The difficulty with learning group robust DNNs can be attributed to the phenomenon of shortcut learning (Geirhos et al., 2020) or spurious correlation (Sagawa et al., 2020a; Arjovsky et al., 2019) . Shortcut learning poses that ERM favors those models that discriminate based on simpler and/or spurious features of the data. However, one wishes for the learning algorithm to produce a model that uses features (i.e. correlations) that performs well not only on the train distribution, but also on all potential distributions that a task may generate, like that of a worst-group distribution. In recent years, the group-shift setting has received considerable attention, where Sagawa et al. (2020a) first investigates distributional robust optimization (DRO) (Duchi et al., 2021; Ben-Tal et al., 2013) in this setting and introduces Group DRO (GDRO) that attempts to directly optimize for the worst-group error. Since then, GDRO has been the standard method for producing group-robust Figure 1 : Worst-group learning curve on the Waterbird dataset between GDRO (left) and CROIS (right) from the setting in Section 4.2. In the left figure, the validation accuracy of GDRO becomes unstable as the number of epochs increase beyond 100 while the training accuracy remains close to 100%. Our approach CROIS instead uses 30% of the training data with group labels for fine-tuning the classification layer of a DNN that is obtained via ERM using the rest of the training data without group labels (see also Algorithm 1). This allows CROIS to reuse ERM features while improving generalization compared with GDRO. models. However, GDRO can be sensitive towards model capacity (Sagawa et al., 2020a) and require group labels for all examples at training time. As group annotations can be expensive to obtain, many works consider ways to reduce the amount of groups labels needed (Liu et al., 2021; Creager et al., 2021; Nam et al., 2022) . These methods usually follow the framework of first inferring pseudo-group-labels using a certain referenced model (pseudo-labeling) and then applying a group-robust algorithm like GDRO on the pseudo-labelled data. While results for these methods have been promising, they usually introduce several more sensitive hyperparameters that can be expensive to tune. Our work aims to obtain group robust models using as few group labels as possible while alleviating the need to carefully control model capacity. Our contribution. In this work, we propose a simple approach, called CROIS. By foregoing the error-prone and costly pseudo-labeling phase to instead concentrate on efficiently utilizing group labels by applying them only to the final classifier layer, CROIS achieves good robust performance without relying on a multitude of hyperparameters and large scale tuning, which has been a growing concern in the community (Gulrajani and Lopez-Paz, 2021) . In short, CROIS takes advantage of the features learned by ERM (Kang et al., 2019; Menon et al., 2021) while overcoming the deficiency of its memorization behavior (Sagawa et al., 2020b ) by utilizing the training data as two independent splits: one group-unlabeled split to train the feature extractor and one group-labeled split to retrain only the classifier with a robust algorithm like GDRO. We demonstrate through ablation studies that the use of independent splits is crucial for robust classifier retraining. Furthermore, CROIS's restriction of GDRO to only a low-capacity linear layer reduces GDRO's sensitivity towards model capacity as well as the amount of data needed for GDRO to generalize well (e.g. Figure 1 and Figure 3 ). For various settings where group labels are only partially available during training (Section 4.1), our experimental results on standard datasets including Waterbird, CelebA, MultiNLI, and CivilComments show improved performance over existing methods including JTT (Liu et al., 2021) and SSA (Nam et al., 2022) , despite minimal parameter tuning and no reliance on pseudo labeling. In another setting where more group labels are available (Section 4.2), even when using only a fraction of the available group labels, our competitive robust performance against GDRO demonstrates our method's label efficiency. Finally, our results provide further evidences of ERM trained DNNs containing good features on both image classification (Menon et al., 2021) and natural language classification tasks.

1.1. RELATED WORKS

There are three main settings for the group-shift problem: (1) full availability of group labels, (2) limited availability of group labels, and (3) no availability of group label. Other related areas include domain generalization and long-tailed classification. Full training group labels. Most methods here revolve around up-weighing minority groups, subsampling minority groups (Sagawa et al., 2020b) , or performing GDRO (Sagawa et al., 2020a) . Follow-up works include integrating data augmentation via generative model (Goel et al., 2020) or selective augmentation (Yao et al., 2022) to a robust training pipeline. Limited access to group labels. In this setting, the approach of inferring more group labels for the group-unlabeled data remains the most popular. These pseudo-group-labels are usually generated by training a referenced model that performs the labeling. For example, (Liu et al., 2021) utilizes a low-capacity model that create groups by labeling whether an example is correctly classified by the referenced model or not. Similarly, works like (Creager et al., 2021; Dagaev et al., 2021; Krueger et al., 2021; Nam et al., 2022) and (Nam et al., 2020) are variants of this approach of inferring pseudo-group-labels. These methods then proceed to use a group-robust algorithm like GDRO (Sagawa et al., 2020a) or Invariant Risk Minimization (IRM) (Arjovsky et al., 2019) to perform robust training on a new network with the newly generated pseudo-group-labels. No group labels. This setting removes the ability to validate as well as knowledge of potential groups. This makes the problem more difficult as it is unclear which correlation to look for during training. Some theoretical works in this space include Hashimoto et al. (2018) ; Lahoti et al. (2020) . Sohoni et al. (2020) proposes a popular empirical approach in this setting and has popularized the pseudo-labeling and retraining approach. This setting is related to domain generalization. Gulrajani and Lopez-Paz (2021) shows through mass-scale experiments that most OOD generalization methods do not improve over ERM given the same amount of tuning and model selection criterion. Long-tailed classification. The long-tailed problem concerns with certain classes having significantly fewer training examples than others (see for example Haixiang et al. (2017) or Zhang et al. (2021) for a survey). Some techniques from the long-tail literature, like Cao et al. (2019) , has been applied to the group-shift setting to account for the groups imbalances, as in Sagawa et al. (2020a) . Insights from applications of representation learning in the long-tail problem (Kang et al., 2019) gives valuable evidences for ERM trained DNNs containing good features which is central for our method.

2. PRELIMINARIES

For a classification task T of predicting labels in Y from inputs in X , we are given training examples {(x i , y i )} n i=1 that are drawn IID from some train distribution D train . In the domain generalization setting, we want good performance on some unknown test distribution D test that is different but related to D train through the task T . More explicitly, we wish to find a classifier f from some hypothesis space F using D train such that the classification error L(f ) = Ex,y∼D test [f (x) ̸ = y] of f w.r.t. D test is low. This framework encapsulates many problems like adversarial robustness, domain adaptation, long-tail learning, few-shot learning, and the problem considered here: group-shift. In the group-shift setting (Sagawa et al., 2020a) , we further assume that associated with each data point x is an attribute a(x) (some sub-property or statistics of x) from a set of possible attributes A. These attributes along with the labels form the set of possible groups G = A × Y that each example can take. We denote an input x's group label as g(x) ∈ G. We then define the classification error of a predictor f (w.r.t. a fixed implicit distribution) restricted to a group g ∈ G to be L g (f ) := Ex,y|g(x)=g[f (x) ̸ = y]. The notion of worst-group error upper bounds the error of f w.r.t. any group L wg (f ) := max g∈G L g (f ). Using this notation, the group-shift problem aims to discover a classifier in arg min f ∈F {L wg (f )} = arg min f ∈F {max g∈G L g (f )}. We observe that the group-shift problem is just a special case of the domain generalization problem when D test is the distribution consisting of only the points (x, y) with g(x) being restricted to the worst-group of f in G. Here, GDRO solves this objective by performing a minimax optimization procedure that alternates between the model's weight and relaxed weights on the groups. Spurious Correlations and Memorization. As an example, consider the Waterbird dataset (Sagawa et al., 2020a) , where it has been constructed by combining images of water/land birds from the CUB dataset (Welinder et al., 2010) with water/land backgrounds from the PLACE dataset (Zhou et al., 2017) . The task is to distinguish whether an image of a bird is a waterbird or a landbird. In terms of our problem, the type of bird forms the labels Y, and the backgrounds are set to be the attribute A for each type of bird. Altogether, these form four groups: G = Y ×A = {waterbird, landbird}×{water, land}. This dataset is constructed so that the proportion of birds on matching backgrounds is significantly more than those of the mismatched backgrounds. This is so that the backgrounds could be spuriously correlated with the labels, as predicting the background alone would achieve a high average accuracy w.r.t. the train distribution already. As expected, for ERM trained models, the groups with the highest error are the minority groups where the background mismatches the type of the bird, suggesting that the model is actually predicting using the background instead of the bird. Furthermore, the fact that these high-capacity models achieve zero training error leads to the conclusion that these models not only utilize spurious features like the background to make its predictions, but also must have memorized the minority groups during its training process (Sagawa et al., 2020b) . These problems are common when there is data imbalance (Feldman and Zhang, 2020) . In the next section, we propose a method that attempts to circumvent these issues. D 2 such that |D 1 | = (1 -p) • |D ′ L | and |D 2 | = p • |D ′ L |. Set D ′ L ← D 2 and D ′ U ← D ′ U ∪ D 1

3. METHOD: CLASSIFIER RETRAINING ON INDEPENDENT SPLITS

Algorithm 1 presents an outline for our main method: Classifier Retraining On Independent Splits or CROIS. Given group-labeled data and group-unlabeled data, CROIS involves several steps: (1) organize the data into one group-labeled split D ′ L and one group-unlabeled split D ′ U , (2) obtain an ERM trained feature extractor with the group-unlabeled split D ′ U , and (3) perform robust classifier retraining with the group-labeled split D ′ L , where classifier retraining refers to fine-tuning the final linear layer of a DNN. In the setting where group labels are limited (as in Section 4.1), |D L | is much smaller than |D U | and we do not need to set p < 1. There, we primarily concern with partitioning D L into D ′ L and D (val) L . On the other hand, when training group labels are available (as in Section 4.2) and |D U | is much smaller than |D L |, the optional parameter p in step 3 controls the size of D ′ U to obtain a feature extractor and the amount of group labels D ′ L actually used at train time. Motivation. Inspired by previous works that have demonstrated the potential of simple ERM trained DNNs on a variety of OOD tasks (Gulrajani and Lopez-Paz, 2021; Rosenfeld et al., 2021) and long-tailed tasks (Kang et al., 2019) , our method focuses on developing a simple method around ERM trained DNNs. As soon to be discussed, a strength of an ERM trained DNN is that there are convincing evidences that it contains good features. On the other hand, its weaknesses involve memorizing examples and using spurious features. CROIS is designed to alleviate these weaknesses while taking advantage of the good features of an ERM trained DNN.

Models trained by ERM contains good features.

While not the first work to notice that ERM trained models contain good features, Kang et al. (2019) demonstrates this hypothesis through extensive experiments on several long-tailed vision datasets to investigate different learning strategies for obtaining a feature extractor as well as ways to fine-tune the classifier layer. There, an ERM trained feature extractor combined with a non-parametric method of rescalingfoot_0 the classifier layer achieves (then) state-of-the-art results on all three datasets. A similar study (Menon et al., 2021) has been done to confirm this hypothesis on vision datasets in the group-shift setting. This suggests that one key to the data-imbalance problem (for which the group-shift problem almost always suffers) lies in correcting the classifier layer, which forms the first phase of CROIS. It is peculiar that rescaling the classifier works best whereas intuition suggests a data-dependent method like classifier retraining would be better. We hypothesize that this is related to the next issue of our discussion. (Zhang et al., 2017) . In the group-shift setting, this behavior has been investigated by Sagawa et al. (2020b) , which provides empirical and theoretical justifications for DNNs' memorization behavior of minority groups' training examples. This behavior of memorizing minority examples have also been observed in the broader framework of data imbalances (Feldman and Zhang, 2020) . One way to circumvent memorization is to control the model's capacity by incorporating some combinations of high ℓ 2 regularization, early stopping, and other correctional parameters as has been done in Sagawa et al. (2020a) . However, the additional tuning required might be costly. We instead tackle this via independent splits. Circumventing memorization with independent splits. Examples being memorized must inevitably impact their ability to be useful in subsequent usage. As memorized examples' (i.e. already correctly classified) loss must be low, their gradients contain little information to be of much use. Furthermore, the features of memorized examples might not be representative of their group during test time: Figure 2 

4. EXPERIMENTS

We conduct experiments in two settings: (1) where group labels are only available from the validation split of the datasets (as in Liu et al. (2021) ; Nam et al. (2022) ); and (2) when a fraction of group labels is available from the training split and all group labels are available from the validation split. Setup. We use a similar setup as in Liu et al. (2021) and Sagawa et al. (2020a) . To demonstrate the ease of tuning of CROIS, unless noted otherwise (e.g. Table 2 and parameter p in Table 3 ), we fix the hyperparameters of both the ERM and the robust classifier retraining phase, reusing standard parameters for ERM (see Appendix A for full hyperparameters and models details). Further results of CROIS with tuned hyperparameters are presented in Section C of the Appendix. Datasets. We experiment on four datasets: • Waterbird (Sagawa et al., 2020a) . Combining the bird images from the CUB dataset (Welinder et al., 2010) with water or land backgrounds from the PLACES dataset (Zhou et al., 2017) , the task is to classify whether an image contains a landbird or a waterbird without confounding with the background. There are 4795 total training examples, where the minority group (waterbird, land background) has only 56 examples. We report the weighted test average accuracy due to the skewed nature of the val and test sets to be consistent with Sagawa et al. (2020a) . • CelebA (Liu et al., 2015) is a popular image dataset of celebrity faces. The task is to classify the celebrity in the image is blond or not blond, with male or not male as the confounding attribute. 

4.1. RESULT OF USING GROUP LABELS FROM THE VALIDATION SPLIT

Setup. In this section, we consider the setting where group labels D L are available only from the standard validation split, where these group labels can be used for training (Nam et al., 2022) and/or model selection (Liu et al., 2021) . Here, the training split is treated as the group-unlabeled set D U . Most methods in this setting employ some pseudo-labeling approach to generate pseudo-group-labels that are then used to train a new network via a robust algorithm like GDRO. On the other hand, CROIS simply uses half of D L for classifier retraining D ′ L and the other half for model selection D (val) L and does not rely on pseudo-labeling. CROIS also reuses the initial model for the retraining phase, making CROIS closer to that of a single phase procedure with additional fine-tuning.

Results.

In Table 1 , we compare CROIS against JTT (Liu et al., 2021) and SSA (Nam et al., 2022) , where we report the mean and one standard deviation of the Test Average (Avg Acc) and Worst-Group Accuracy (Wg Acc) across three random seeds. There, CROIS outperforms JTT on all 4 datasets and SSA on 3 datasets using default parameters. Note that, unlike CROIS and SSA, JTT only uses available group labels for model selection. However, JTT requires training a large number of models across two phases, which can be expensive. Furthermore, JTT's model selection can be quite sensitive (see Section 5.4 of Liu et al. (2021) ). SSA alleviates this problem of JTT by more efficiently utilizing group-labels to infer pseudo-labeling. Finally, our method dispenses altogether with pseudo-labeling while still achieving competitive performance to JTT and SSA. Discussion. We clarify the difference between JTT and CROIS. First, the initial ERM phase of JTT is for inferring pseudo-group-labels for the group-unlabeled data. This phase requires careful hyperparameter tuning and capacity control using the group-labeled validation set to accurately produce pseudo-group-labels (as noted in section 5.4 of Liu et al. (2021) ). On the other hand, CROIS trains a single ERM model and simply retrains the last layer with any available group labels. Second, JTT's final performance is limited by GDRO's performance on the full network, which can be worsen by mislabeled pseudo-labels from the first phase. In contrast, we demonstrate in Section 4.2 that, by limiting GDRO to only the last layer, CROIS is competitive to full GDRO even when using only a fraction group labels and minimal tuning. Compared with SSA, CROIS does not relying on pseudo labeling. In SSA, there is a phase that performs pseudo labeling and another phase that performs robust training with the inferred group labels. In the first phase, SSA trains a separate network that predicts the group rather than the class. By treating the pseudo-labeling problem as semi-supervised learning, SSA's pseudo-labeling capability is shown to improve upon JTT. Our results show that CROIS outperforms SSA on 3 out of 4 datasets while reusing default parameter. Reducing validation split size. Following the setup in JTT and SSA, we vary the size of the validation split (20%, 10% and 5% of the original) to test whether our results still hold in these settings. We consider both the Waterbird and CelebA datasets. Note that for this setting, CROIS must be additionally tuned to account for the increased difficulty of the reduced group-labels quantity. Nevertheless, the smaller examples quantity along with just training the last layer makes the extra tuning not too expensive (details and setup in Section F). We present our results (along with error bars) in Table 2 , where we find that CROIS outperforms JTT and SSA on various percentage levels.

4.2. RESULT OF USING PARTIAL TRAINING SPLIT GROUP LABELS

Next, we consider the setting where group labels are available from both the training split and the validation split. In contrast to Section 4.1, the standard validation split is used only for model selection D (val) L and not for classifier retraining D ′ L here. We compare CROIS using some fraction of the training split's group labels against GDRO using all the group labels. Again, we fix parameters of CROIS to its standard ERM parameter to demonstrate its ease-of-tuning (see Appendix A). Setup. We study CROIS with different amount of training group labels, determined by the parameter p. This means that (1 -p) fraction of the training split is used to obtain a feature extractor in the first phase D ′ U (that uses no group label), and the rest p fraction of group labels are used for robust classifier retraining D ′ L . This setup allows examining the trade-off between the quality of the feature extractor versus the amount of data available to perform classifier retraining. Additionally, to demonstrate the importance of retraining with unseen examples, we experiment with robust classifier retraining using the same data from the first phase i.e. without independent splits -denoted as NCRT. The parameter p. In practice, we expect that |D L | is a lot smaller than |D U |, as in Section 4.1. There, Table 2 suggests that reasonable robust performance can be achieved with a small fraction of group labels. In this setup, however, since the amount of group labels is abundant (D U = ∅ and D L is large), we treat p as a tune-able parameter that controls the size of D ′ U and D ′ L . Furthermore, using p fraction of the available group labels simulates the process of obtaining group labels for a random fraction of the data if one is constraint by a budget for obtaining group labels. Results. In Table 3 , CROIS outperforms GDRO on both image datasets and yields competitive performance to GDRO on the two text datasets when using only a fraction of group labels and reusing default hyperparameters. Our result implies that comparable or even better robust performance than GDRO can be obtained by collecting group labels for roughly 30% of the available training data (modulo validation). One exception is in severe group imbalance case as in CivilComments (the minority group consists of only 0.4% of the dataset). There, a higher fraction of group-labeled data is beneficial to obtain more minority-group examples. Hence, a more efficient sampling method to include more minority examples (e.g. filter by labels first) would be beneficial in practice. Finally, the results for NCRT also show the importance of using independent split for classifier retraining. 3Trade-off between feature extractor and amount of group-labeled data for robust retraining. From the results across the datasets, allocating more examples towards training the feature extractor (lower p) generally yields higher on-average accuracies. The worst-group error after classifier retraining has a more complex interaction with p, as it depends on both the quality of the feature extractor and the amount of group-labeled examples available to perform classifier retraining. While varying the proportion p in our experiments gives a rough estimate of this tradeoff, we hypothesize that the availability of minority group examples is the most important for obtaining a robust classifier. We further support this intuition with an ablation study in Section D.2 where removing non-minority examples has an insignificant impact on the final group-robust performance. Alleviating GDRO's sensitivity towards model capacity. GDRO's requirement for model capacity control via either ℓ 2 regularization or early stopping is well noted in the literature (Sagawa et al., 2020a) . In Table 13 , we compare GDRO's and CROIS' sensitivity towards different ℓ 2 regularization. While GDRO's performance on Waterbird is relatively uniform, GDRO is more sensitive to different ℓ 2 settings on CelebA. When ℓ 2 = 1 for CelebA, we see that GDRO fails altogether (see Figure 3 ). On the other hand, CROIS achieves consistent performance across different ℓ 2 setting. CROIS controls the model capacity by limiting GDRO to only the last layer. This alleviates GDRO's tendency to overfit and simplifies parameter tuning (as in Figure 1 for Waterbird). 

4.3. ABLATION STUDIES AND DISCUSSIONS

Obtaining a good feature extractor. An ablation study on the effects of different validation accuracies and initial algorithm on the feature extractor's quality (measured by robust performance after classifier retraining) is presented in Section B. Similarly to previous works Kang et al. (2019) , we find ERM providing the best features over reweighting or GDRO (of which both require group labels). We find that there's a positive correlation between validation average accuracy and features' quality. This then serves as a proxy for CROIS' model selection criterion in the first phase, which significantly simplifies parameter tuning over other two-phase methods in the group-shift setting. GDRO is better than reweighting and subsampling for classifier retraining. Table 4 contains results on using different classifier retraining methods. We observe that GDRO produces the best group-robust performance (since GDRO is designed for this setting after all). Reweighting and subsampling seem to be effective on the vision datasets, but fail to perform on the NLP datasets. Classifier retraining outperforms full retraining with GDRO. Classifier retraining plays a central role in our method. In Table 5 , we compare between fine-tuning with GDRO on the full DNN versus just the last layer (LL) for an independent split of p = 0.30 . We see that GDRO on the LL is much better than full GDRO on most of the datasets (except CivilComments). However, the main difference is that while LL retraining (i.e. CROIS) requires little additional tuning, we must grid search for different regularization strength for GDRO when applied on the full network. CROIS can be shown to be quite robust to different parameter setting and regularization strength (details in Appendix C.1). Models learned with ERM contain good features. The positive result for our decoupled training procedure provides another strong evidence for ERM trained models containing good features for the group-shift problem. While this is consistent with findings in the literatures on vision datasets (Kang et al., 2019; Menon et al., 2021) , our work further provides some of the first evidences of this hypothesis in non-vision tasks, where the same result would not have been possible without independent split, as evident in the result for Naive CRT in Table 3 . Simplified model selection. The model selection criterion of picking the best average validation accuracy model simplifies hyperparameter tuning in comparison with other two-phases methods. This decision has been chosen mainly from the ablation experiments in Section B of the Appendix, where we observe that higher average validation accuracy generally suggests better features.

5. CONCLUSION AND FUTURE WORKS

In this paper, we propose Classifier Retraining on Independent Split (CROIS) as a simple method to reduce the amount of group annotations needed for improving worst-group performance as well as alleviate GDRO's requirement for careful control of model capacity. Our experimental results show the effectiveness of CROIS on four standard datasets across two settings and further provide evidences that ERM trained models contain good features for the group-shift problem. Future works. The richness of ERM trained DNNs' features can potentially be useful towards solving the seemingly harder group-agnostic setting (where no group label is available) by allowing the practitioner to focus on obtaining a robust classifier given an ERM trained feature extractor, where we have shown that reasonable robustness can be achieved with relatively few group-labels, which makes the problem seem closer in reach. On a broader note, while most works in representation learning focus on producing good features (either with supervised, unsupervised, or self-supervised approaches), further examinations into different ways to perform classifier retraining in different settings (as in our work) could give a fuller picture to the features quality of different methods. Reproducibility statement. We include our source code in the supplemental material. All implementation details and hyperparameters are detailed in Section A of the Appendix. Learning Rate 10 -4 /10 -4 10 -4 /10 -4 2 × 10 -5 /2 × 10 -5 10 -5 /10 -5 ℓ2 Regularization 10 -4 /10 -4 10 -4 /10 -4 0/0 10 -2 /0 Number of Epochs 250/250 20/20 20/20 6/6 A EXPERIMENTAL DETAILS

A.1 INFRASTRUCTURE

We performed our experiment on 2 PCs with one NVIDIA RTX3070 and one NVIDIA RTX3090. Our implementation is built on top of the code base from Liu et al. (2021) . Experimental data was collected with the help of Weights and Biases (Biewald, 2020) .

A.2 MODELS

We use ResNet50 (He et al., 2016) with ImageNet initialization and batch-normalization for CelebA and Waterbird. We use pretrained BERT (Devlin et al., 2018) for MultiNLI and CivilComments. We use the original train-val-test split in all the datasets and report the test results. Cross-entropy is used as the base loss for all objectives. SGD with momentum 0.9 is used for the vision datasets while the AdamW optimizer with dropout and a fixed linearly-decaying learning rate is used for BERT. We use a batch size of 16 for CivilComments and 32 for the rest of the datasets. We do not use any additional data augmentation or learning rate scheduler in our results.

A.3 HYPERPARAMETERS

Table 6 contains the hyperparameters used in our experiments in Sections 4.2 and 4.1. Note that these are the standard parameters for obtaining an ERM model for these datasets as in previous works (Sagawa et al., 2020a; Liu et al., 2021) . The only difference is that we train Waterbird and CelebA for slightly shorter epoch due to finding no further increase in validation accuracies after those epochs. In our experiments, unless noted, we do not tune for any other hyperparameters. For the second phase of CivilComments, we do not use the default regularization but opt for 0 since the linear layer already has low capacity. However, adding further regularization does not seem to have much of an effect as in section C.1.

B ABLATION STUDIES: OBTAINING A GOOD FEATURE EXTRACTOR B.1 IMPACT OF THE FEATURE EXTRACTOR'S ALGORITHMS

Here, we provide evidences that ERM trained models produce the best features for worst-group robustness. We conduct an experiment on Waterbird, where instead of using ERM to obtain a feature extractor, we perform GDRO and Reweighing instead on the first phase. The results are in table 7. While using reweighing or GDRO for the first phase defeats the purpose of reducing the amount of group-labels needed (whereas ERM doesn't need any), it is informative to examine the features alone. Here, we see that even though ERM does not use group-labels, it provides the best features for robust classifier retraining on independent split.

B.2 IMPACT OF EARLY STOPPING AND VALIDATION ACCURACIES ON THE FEATURE EXTRACTOR

In this section, we present an ablation study of how different early stopping epoch (figure 4 ), average validation accuracy (figure 5 left), and worst-group accuracy (figure 5 right) of the initial ERM trained model (feature extractor) affect the GDRO retraining phase of CROIS. The results here are from performing CROIS with GDRO and p = 0.30 on Waterbird across a wide variety of epochs. Table 8 presents the full data generated for this section. In our paper, we have demonstrated CROIS's effectiveness even with just using the same hyperparameters as to train an ERM model. In this section, we present results for further additional parameter tuning on the robust classifier retraining phase. These results provide empirical evidence for CROIS's potential as well as robustness to different hyperparameter setting. ℓ 2 regularization. We investigate whether additional regularization would be helpful to classifier retraining with GDRO on CelebA (Table 9 ) and Waterbird. (Table 10 ) We further examine the effects of regularization on CivilComments (Table 11 ) to support our choice in Section A. For these Learning rate. We examine the effects of different learning rates on CROIS on CelebA and Waterbird in Table 12 . Lower learning rate seems to be more beneficial. In this section, we present a comparison between CROIS and GDRO test performance with different ℓ 2 regularization configuration on CelebA and Waterbird in Table 13 . On CelebA, GDRO is quite sensitive to model capacity while it is less so on Waterbird. We note that GDRO fails to converge to a good stationary point when ℓ 2 = 1 on CelebA (see Figure 3 ). Table 13 : Worst-group test accuracy for GDRO and CROIS (p = 0.30) with different ℓ 2 regularization. CelebA Waterbird ℓ2 reg. 0 10 -4 10 -3 10 -3 10 -1 1 0 10 -4 10 -3 10 -3 10 In this section, we examine how robust retraining affects the model's prediction of D U and D L before and after robust classifier retraining on independent split (with p = 0.30). Tables 14 and 15 show the accuracy on D U and D L for Waterbird and CelebA. The "Points changed" column indicates the number of points that the model changes prediction after robust retraining per group along with the total number of examples in that group (with percentage in parentheses). The worst-group is underlined in the tables. As the result in Table 16 shows, there isn't a significant difference between the two sampling strategies, suggesting that the availability of minority group examples plays an important role in robust classifier retraining.

E ADDITIONAL COMPARISON

We provide additional baselines for comparison in the Tables below. Baselines. In Table 17 , we compare CROIS against JTT (Liu et al., 2021) and SSA (Nam et al., 2022) , as well as additional baselines like CVaR DRO (Levy et al., 2020) , LfF (Nam et al., 2020) , EIIL (Creager et al., 2021) , CnC (Zhang et al., 2022) and UMIX (Han et al., 2022) . There, we report the mean and one standard deviation of the Test Average (Avg Acc) and Worst-Group Accuracy (Wg Acc) across three random seeds.

E.2 ADDITIONAL BASELINES FOR GROUP LABELS FROM TRAINING SET

In the setting where group labels are available from the training set, we compare our method against additional baselines like LISA (Yao et al., 2022) . F FRACTION OF THE VALIDATION SET IMPLEMENTATION DETAILS FROM SECTION 4.1 Following the setup in Section 4.1 and the setup as in Liu et al. (2021) ; Nam et al. (2022) , we further reduce the validation set to only a small fraction, 5%, 10%, and 20%. We investigate CROIS's performance in this very few group-labels setting across CelebA and Waterbird in Section 4.1. We note that the highly reduced sample size poses new challenge and makes it harder to simply reuse the default parameters. • Tuning learning rate: We also tune the learning rate across {10 -5 , 10 -4 , 10 -3 , 10 -2 } instead of simply reusing default parameters. • The use of group-labels and model selection: Since the amount of examples for classifier retraining is now significantly reduced, it might be wasteful to further split our available grouplabels for validation. Instead, we use all the available group-labels for robust classifier retraining and perform model selection in the second phase via the train worst-group accuracy. The low capacity linear layer and higher ℓ 2 regularization allows us to avoid overfitting when performing model selection this way. The feature extractor from the first phase is selected via the best average accuracy on the full group-unlabeled validation set. • Smaller batch size: Since GDRO requires group-balanced sampling, a batch size greater than the number of examples in a certain group would cause duplicate sampling of the minority-group examples in the same step, artificially increasing the weight for that group. We further tune for batch sizes across a grid of powers of 2 less than the smallest group or the default batch size (e.g. we search across {4, 8, 16} if the size of the smallest group is 17). In Table 2 , we present the results for CROIS with the above modifications and compare it to CROIS and JTT. There, robust retraining for CelebA is performed with an ℓ 2 regularization of 0.1, batch size of 8 and learning rate 10 -5 . For Waterbird, we found that batch size 8, weight decay 1, and learning rate 10 -5 are best for 20% and 10% reduction. For 5% reduction, we further reduce the batch size to 4 (since the minority group only has 7 examples) and increase the weight decay to 10. The results shows that CROIS maintains its robustness performance even at greatly reduced grouplabels. This implies that even a few (minority) examples can help debias the final layer classifier with proper configurations.

G IMPLICATIONS AND TAKEAWAYS

Comparison to standard pretraining and finetuning As mentioned in the related works section as well as during the discussion of the motivation for our method, pretraining and then finetuning is a now well-known and established strategy in many domains. Our work differs the most significantly from this standard strategy through: 1. The use of independent splits: Traditional pretraining and finetuning reuses the dataset for both phase with the possibility of additional labels (contrastive learning, long-tail learning, etc.). We have demonstrated through extensive experiments the importance of independent split for classifier retraining to work well for the group robust setting. 2. The use of a group robust algorithm for finetuning: We mainly utilize GDRO for our finetuning phase while most other works in finetuning make use of simpler strategy for finetuning like reweighing or subsampling. We demonstrated in our experiments that GDRO yields the best robust performance over other more common methods. Our work provides evidences that the features of ERM trained DNNs are rich enough to solve the group-shift problem (when an abundant amount of group labels is available to retrain the classifier) and one of the major reasons for poor worst-group performance of an ERM trained DNN is within its classifier layer. We then further demonstrate that even a few group labels can sufficiently "fix" the classifier to achieve better group-robust performance. This knowledge can potentially be useful towards solving the seemingly much harder group-agnostic setting (where no group label is available) by allowing the practitioner to focus on obtaining a robust classifier given an ERM trained feature extractor. Our experiments further show that reasonable robustness can be achieved with relatively few group-labels (that are not used to obtain the feature extractor), which makes the problem seem closer in reach.



Rescaling each row of the linear classifier using the row's norm to some power. SeeKang et al. (2019). Half of the data is used to obtain a feature extractor while the other half is used to obtain the features of the unseen examples. In Sagawa et al. (2020a), Group Adjustment (GA)(Cao et al., 2019) is observed to improve Waterbird's WG accuracy to 90.5%. We also notice an improvement when GA is incorporated to CROIS and obtain a 90.3% ± 0.62 test WG accuracy. We also observe that, similarly toSagawa et al. (2020a), GA only works for the synthetic dataset Waterbird but not for CelebA nor MultiNLI. • Tuning ℓ 2 regularization: When using so little data, overfitting can become a bigger problem, even when just training a low-capacity linear classifier. Hence, we tune for higher values for ℓ 2 regularization across {10 -4 , 10 -2 , 1, 10}.



Obtain the initial model f by running ERM on D ′ U and selecting the best model in terms of average accuracy on D Perform classifier retraining R with feature extractor f on D ′ L and then select the best model in terms of worst-group accuracy on D (val) L as the final output.

Figure 2: tSNE projection (Van der Maaten and Hinton, 2008) of the features of an ERM trained ResNet50 on seen (left) versus unseen (right) examples from Waterbird. 2 The features of the minority groups (orange and yellow) from the unseen examples appear better separated from the majority groups than that of the seen examples. Using unseen examples plays a major role in CROIS's ability to improve worst-group performance via robust classifier retraining. Memorization behavior of high capacity DNNs. It has now been well known of high capacity DNNs' ability to memorize training examples(Zhang et al., 2017). In the group-shift setting, this behavior has been investigated bySagawa et al. (2020b), which provides empirical and theoretical justifications for DNNs' memorization behavior of minority groups' training examples. This behavior of memorizing minority examples have also been observed in the broader framework of data imbalances(Feldman and Zhang, 2020). One way to circumvent memorization is to control the model's capacity by incorporating some combinations of high ℓ 2 regularization, early stopping, and other correctional parameters as has been done inSagawa et al. (2020a). However, the additional tuning required might be costly. We instead tackle this via independent splits.

presents a visualization of the features between seen versus unseen examples. Thus, combining this observation with the evidences for ERM trained DNNs containing good features, CROIS performs robust classifier retraining on unseen examples (D ′ L in Algorithm 1) in hope of learning a classifier that utilizes features more representative of examples during test time.

Figure 3: While GDRO often requires high ℓ 2 regularization to avoid overfitting, setting ℓ 2 too high might cause instability in GDRO's minimax optimization procedure (left).

Figure4: The effect of using different epochs for the feature extractor (phase 1) on classifier retraining's (phase 2) test accuracies.

Algorithm 1 Classifier Retraining on Independent Splits (CROIS) 1: Input: Training data D L with group labels and training data without group labels D U . Classifier retraining algorithm R (default to GDRO). Optional splitting parameter p (default to 1). 2: Obtain validation sets by partitioning D L into D ′ L and D

Experimental results for the setting when only group labels from the validation split are used. Results for JTT and SSA are taken fromLiu et al. (2021) andNam et al. (2022), respectively. The numbers in parentheses denote one standard deviation from the mean across 3 random seeds. See Table E in Appendix 17 for comparison with additional baselines. MultiNLI (Williams et al., 2017) is a natural language inference dataset for determining whether a sentence's hypothesis is entailed by, is neutral with, or contradicts its premise. The spurious attribute is the presence of negation words like no, never, or nothing. This task has 6 groups, with 206175 total training and 1521 in the minority group examples (is entailed and contains negation).

Worst-group test accuracy for partial group-labels from the validation split. Results for SSA and JTT are taken from Table3ofNam et al. (2022). The standard deviation is reported based on three independent runs. See Table18in Appendix E for comparison with additional baselines.

Comparison between CROIS and GDRO. NCRT refers to naive classifier retraining i.e. when independent split is not used during the retraining phase. Results marked with † are taken from(Sagawa et al., 2020a). * For Waterbird, we omit the result for p = 0.05 due to the small dataset size and not being able to sample any minority-group example for robust retraining.

Comparison between reweighting, subsampling, and GDRO as classifier retraining algorithms on top of the same feature extractor (p = 0.30).

Comparison between retraining with GDRO on the full network (Full) versus last layer (LL).

Hyperparameters used in the experiments. The slash indicates the parameters used in the first phase (feature extractor) versus the second phase (classifier retraining).

Effects of different methods for obtaining a feature extractor on test average accuracy and test worst-group accuracy (with ResNet50 on Waterbird).

CROIS with GDRO (p = 0.30) on Waterbird. Average (Avg Acc) and Worst-group (Wg Acc) Accuracies for various epochs of the feature extractor ("Phase 1") and the corresponding test accuracies for classifier retraining ("Phase 2"). While training for longer epochs seem to help with average and worst-group accuracy for phase 2, the benefit is small. Hence, simply selecting the best validation average accuracy model, row Epoch 131 and denoted BEST here, yields good enough features that simplify our training procedure and model selection criteria.

Effects of ℓ 2 regularization on classifier retraining with GDRO on CelebA.

Effects of ℓ 2 regularization on classifier retraining with GDRO on Waterbird.

Effects of ℓ 2 regularization on classifier retraining with GDRO on CivilComments.

Effects of varying rate on CROIS p = 0.30 on CelebA and Waterbird. We fix ℓ 2 regularization to 10 -4 .

Model's prediction of D U and D L on Waterbird before robust retraining and after robust classifier retraining on independent split.

Model's prediction of D U and D L on CelebA before and after robust classifier retraining on independent split. RR means after Robust Retraining.

Performance of CROIS when retraining is on a subsampled split versus the full split. Subsampling the split doesn't seem to impact CROIS's performance, indicating the importance of the availability of minority examples.

Experimental results for the setting when only group labels from the validation set are used. Results for C-DRO (CVaR DRO), LfF, EIIL, JTT and SSA are fromNam et al. (2022). Results for UMIX are fromHan et al. (2022). Results for CnC are fromZhang et al. (2022). The numbers in parentheses denote one standard deviation from the mean across 3 random seeds.

Additional comparison where group labels from the training set are available. Results for LISA are taken fromYao et al. (2022). Results for CAMEL are taken fromGoel et al. (2020).

