TAKE ONE GRAM OF NEURAL FEATURES, GET ENHANCED GROUP ROBUSTNESS

Abstract

Predictive performance of machine learning models trained with empirical risk minimization (ERM) can degrade considerably under distribution shifts. In particular, the presence of spurious correlations in training datasets leads ERM-trained models to display high loss when evaluated on minority groups not presenting such correlations in test sets. Extensive attempts have been made to develop methods improving worst-group robustness. However, they require group information for each training input or at least, a validation set with group labels to tune their hyperparameters, which may be expensive to get or unknown a priori. In this paper, we address the challenge of improving group robustness without group annotations during training. To this end, we propose to partition automatically the training dataset into groups based on Gram matrices of features extracted from an identification model and to apply robust optimization based on these pseudogroups. In the realistic context where no group labels are available, our experiments show that our approach not only improves group robustness over ERM but also outperforms all recent baselines.

1. INTRODUCTION

Empirical Risk Minimization (ERM) is the most standard machine learning formulation, which assumes that training and testing samples are independent and identically distributed (Vapnik, 1991) . While academic datasets are mainly built to respect this assumption, practical settings display more challenging configurations with distribution shifts. Among different types of shifts, training data can be affected by selection biases and confounding factors, also called spurious correlations (Woodward, 2005; Duchi et al., 2019) Imagine crowd-sourcing an image dataset of camels and cows (Beery et al., 2018) . Due to selection biases, a large majority of cows stand in front of grass environment and camels in the desert. A simple way to differentiate cows from camels would be to classify the background, an undesirable shortcut that ERM will naturally exploit. Consequently, ERM may perform poorly on minority groups that do not display such spurious correlations (Hashimoto et al., 2018; Tatman, 2017; Duchi et al., 2019) , e.g., a cow standing in the desert. To overcome this issue, recent works (Creager et Kirichenko et al., 2022) rely on two-stage schemes: first, automatic environment discovery (e.g., based on deep feature clustering); then, robust optimization based on environment pseudo-labels. Environment here refers to a recurring setting, not intrinsic to the object of interest, that may affect its classification, such as background, object color or object pose. However, all these approaches require the availability of ground-truth environment labels on a validation set to properly tune their hyperparameters. This paper addresses the problem of learning a robust classifier, which, for instance, would not confuse a cow standing in the desert with a camel although not given any annotation about grass or desert. In computer vision, many identified spurious correlations are closely related to visual aspects, such as background (Beery et al., 2018) , texture (Geirhos et al., 2019) , image style (Hendrycks et al., 2021) , physic attributes (Liu et al., 2015) or camera characteristics (Koh et al., 2021) . In this work, we assume that relevant environment labels can be inferred from visual feature statistics, and demonstrate they lead to meaningful environments and robust classifiers for standard datasets used to evaluate robust classification. We propose a two-stage approach, GRAMCLUST, which first assigns a group label, i.e., a class-environment pair label, by partitioning a training dataset into clusters of images with similar visual statistics and then trains a robust classifier based on these pseudo-group labels. Our approach is summarized in Fig. 1 . We use Gram matrices as visual descriptive statistics, which are second-order moments of neural activation. Gram matrices are well known for displaying impressive results in style transfer techniques (Gatys et al., 2016) , but more importantly for the interpretation of our approach, Li et al. (2017) demonstrate that matching Gram matrices between two groups of images is equivalent to aligning the respective distribution of each group, minimizing the Maximum Mean Discrepancy. Therefore, our method can be interpreted as grouping images into clusters of similar feature distributions that are sensible candidates for environments. Our main contributions are as follows: (1) We introduce an easy-to-scale method to split training images among distinct pseudo-environments, based on feature Gram matrices extracted by a specifically-trained identification model; (2) GRAMCLUST alleviates the need of ground-truth group labels altogether, even in the validation set, as hyperparameters are set based on validation performance computed from our pseudo-groups; (3) Extensive experiments on various image classification datasets with spurious correlations show that GRAMCLUST outperforms all recent baselines addressing robustness without group annotation. In particular, on the realistic large-scale CelebA dataset (Liu et al., 2015) , we improve worst-group test accuracy by +24.3 points. where train and test distributions overlap but their relative proportion differs. With subpopulation shifts, the goal is to perform well even on the minority group, also referred to as group robustness. In this study, we focus on the latter form of distribution shift. (Sagawa et al., 2020a) to minimize the worst-group accuracy on their inferred groups. These best hyperparameters were actually found using a validation set with true-group labels in the original study.

2. RELATED WORK

Gram matrices. The original work of Gatys et al. (2016) demonstrated impressive results to generate images with the style of an existing image. The style of a first image is transferred to a second one by matching Gram matrices of features extracted by a convolutional neural network. Sastry & Oore (2020) also used Gram matrices in out-of-distribution detection to identify an anomaly by comparing their values to the respective range observed over the training data. Interestingly, Li et al. (2017) demonstrate a formal equivalence between matching Gram matrices of neural activations with an L 2 norm and the MMD with the second-order polynomial kernel. This shows that Gram matrices are also implicitly used in the process of distribution alignment between images. This finding motivates our approach, which consists in discovering pseudo-groups using Gram matrices.

3. GRAMCLUST: A CLUSTERING APPROACH FOR ROBUST OPTIMIZATION

Our method, GRAMCLUST, consists of two main steps. First, we discover pseudo-environments among the images of a given dataset (see Section 3.2). Second, we train a robust classifier that leverages the inferred pseudo-environment labels to reduce classification errors due to spurious environment correlations (see Section 3.3). To discover environments, we train during a few iterations an exogenous "identification model". Then, using this model, we compute for each image its Gram matrix representation from different layers and apply random projections to reduce dimension. The resulting concatenated features are then fed to an unsupervised clustering algorithm (k-means) to produce pseudo-environment labels. This allows us to define pseudo-groups as the intersection of pseudo-environments and classes. Last, we train the target classifier by minimizing the standard cross-entropy classification loss on the worst pseudo-group with GroupDRO (Sagawa et al., 2020a).

3.1. PROBLEM FORMULATION AND NOTATIONS

Let us consider a dataset D = {(x i , y i )} N i=1 ∈ X × Y of N samples where X is the input space and Y = 1, K a set of labels. We assume the data is sampled from random variables (X e , Y e ) in X ×Y with probability law P (Xe,Ye) for all e ∈ 1, E , where E is the number of environments. The full dataset can then be seen as the union of subsets associated to each random variable, i.e., D = E e=1 D e where each D e is composed of i.i.d. realisations of a random variable with joint probability law P (Xe,Ye) . For notation purposes, we actually choose the following equivalent formulation for the dataset D = {(x i , y i , e i )} N i=1 ∈ X × Y × 1, E where e i refers to the environment from which x i and y i were sampled. Our goal is to find a model m in a given hypothesis space M that minimizes the error on the worst group. A group is defined as a set of samples both from the same class and in the same environment. Formally, we introduce group distributions: P G1,1 = P(X 1 |Y 1 = 1), • • • , P G E,K = P(X E |Y E = K). ( ) The purpose is then to solve the following objective minimization problem: where ℓ : Y × Y → R + is the cross-entropy loss between the model's prediction and the true label. arg min m∈M max g∈ 1,E × 1,K E (x,y)∼P Gg ℓ(m(x), y) , Note that we have no access to any environment labels. To circumvent this issue, we first discover pseudo-environment labels, then estimate the pseudo-group distributions to be used in Eq. ( 2).

3.2. DATASET PARTITION

In this section, we describe the first stage of GRAMCLUST, which aims at environment discovery. The method is illustrated in Fig. 2 . Identification model. Our approach starts by initializing a convolutional neural network Φ for the classification task at hand; it is composed of L layers with parameters ω and is pre-trained on ImageNet (Deng et al., 2009) . Liu et al. (2021) observed that ERM tends to fit models on data presenting easy-to-learn spurious correlations at the beginning of the learning process. It is only after a significant number of epochs that the model starts to learn more difficult patterns. Hence, we only train Φ during a few iterations, minimizing w.r.t. ω the following empirical loss function: 1 N N i=1 ℓ(Φ(x i , ω), y i ), where ℓ : Y × Y → R + is the cross-entropy loss between the model's predicted label Φ(x i , ω) and the true label y associated with sample x. In the following, we call Φ the identification model as our clustering is based on features extracted from this model. The idea is to leverage the biases learned by Φ to identify relevant environments and partition the training dataset into groups of images presenting spurious correlations, on the one hand, and groups of images free from these correlations on the other hand. Hence, after this initial training and in the rest of the paper, the parameters ω of the identification model Φ are frozen. Features Gram matrices. We denote the feature map of an image x at layer l of Φ by ϕ l (x) ∈ R M l ×C l , where C l is the number of channels and M l is the spatial size of the feature map. For each image x i ∈ X , we extract its feature maps at S ⩽ L different and fixed layers {l 1 , • • • l S }, and compute the Gram matrices defined as: G l (x i ) = 1 M l ϕ l (x i ) ⊺ ϕ l (x i ) ∈ R C l ×C l , l ∈ {l 1 , • • • l S }. Given input x i and identification model Φ, the Gram matrix of its feature ϕ l (x i ) encodes visual correlations via an inner product between each pair of vectorized feature maps. In visual style transfer (Gatys et al., 2016) , these Gram matrices have been shown to encode the "style" of an image, that is, loosely speaking, its textures and color palette, by contrast with its "structure". Clustering with k-means. For each image x i , we vectorize and normalize its S associated Gram matrices: f i,l = vec(G l (x i ))/ ∥vec(G l (x i ))∥ 2 ∈ R C 2 l . The normalization permits us to balance the contributions of the different Gram matrices in the clustering loss. Each image x i is thus encoded by the vector f i = [f i,1 , . . . , f i,S ] ∈ R C , where C = l C 2 l . Relying on the assumption that environments can be inferred from visual feature statistics, we propose to discover E ′ environments, E ′ being a parameter as E is unknown, by clustering the N training images into E ′ clusters {C 1 , . . . C E ′ } via k-means clustering, i.e., by computing a solution to: min {C1,...C E ′ } E ′ e=1 1 2 |C e | i,j|xi,xj ∈Ce×Ce ∥f i -f j ∥ 2 2 , where ∥f i -f j ∥ 2 2 = S l=1 ∥f i,l -f j,l ∥ . Scaling with random projections. Storing all these vectors and computing distances between them in a high-dimensional space is computationally and memory expensive on large datasets. We overcome this difficulty by projecting the vectors f i,l in a lower-dimensional space as proposed by Achlioptas (2003) . Given a size ℓ 0 , we build a matrix P ∈ R ℓ0×C whose entries P mn are the realisation of independent random variables: P mn = 1 or P mn = -1 with probability 1/2. Then we compute fi,l = 1 √ ℓ 0 Pf i,l and substitute fi,l for f i,l in Eq. ( 5). We justify this choice by the fact that this projection preserves the distances ∥f i,l -f j,l ∥ 2 2 involved in the k-means objective of Eq. ( 5). Indeed, let ϵ ∈]0, 1[ and ℓ 0 ∝ log(N ), then with high probability,foot_0  (1 -ϵ) ∥f i,l -f j,l ∥ 2 ⩽ ∥ fi,l -fj,l ∥ 2 ⩽ (1 + ϵ) ∥f i,l -f j,l ∥ 2 , for all (i, j) ∈ 1, N 2 . In practice, we choose ℓ 0 = ⌊100 log(N )⌋ which yields dimensions for fi,l much lower than typical values of C l . We remark that this choice of projection is independent of all f i,l and thus can be defined and fixed before any feature extraction.

3.3. ROBUST OPTIMIZATION WITH PSEUDO-GROUP LABELS

Given these estimated environments, we define their intersection with classes as "pseudo-groups". Formally, given the predicted environment êi ∈ 1, E ′ , of image i, its pseud-group label is ĝi = (ê i , y i ) ∈ 1, E ′ × 1, K . Going back to Eq. ( 2), the distributions over the groups P G ĝ are estimated by PG ĝ = δ(G(ĝ)) for all ĝ ∈ 1, E ′ × 1, K , where δ is the Dirac distribution, and G(1, 1) = {(x i , y i ), i ∈ 1, N | y i = 1, x i ∈ C 1 } (9) • • • G(E ′ , K) = {(x i , y i ), i ∈ 1, N | y i = K, x i ∈ C E ′ } (10) are the sets of images and labels associated with the pseudo-group labels. Each training point x i ∈ X is now associated with a class label y i and a pseudo-group annotation ĝi . We train a robust classifier h with parameters θ by minimizing the worst-group risk on the training dataset (Sagawa et al., 2020a): θ ∈ arg min θ max ĝ∈ 1,E ′ × 1,K 1 |G(ĝ)| (x,y)∈G(ĝ) ℓ(h(x, θ), y) , where the loss ℓ : Y × Y → R + remains the cross-entropy between the predicted label h(x, θ) of the robust classifier and the true label y associated with sample x.

3.4. MODEL SELECTION VIA CROSS-VALIDATION ON VALIDATION DATA

Setting relevant hyperparameters is important in optimization algorithms to ensure a proper convergence. Hyperparameters tuning is performed with cross-validation using a held-out subset of training data. With robust optimization, worst-group accuracy of the final classifier is the go-to metric for model selection. Previous approaches rely on true group labels of the validation set to define and assess performance on the worst group. In contrast, we do not rely on such a prior information. We partition the validation set using the clusters found on the training set and we conduct crossvalidation based on the resulting pseudo-groups. In our experiments, we observe that this type of model selection is effective to achieve proper group robustness.

4. EXPERIMENTS

In this section, we evaluate the capacity of GRAMCLUST to improve group robustness on image classification datasets with spurious correlations. In Section 4.2, we empirically show that it outperforms other baselines addressing robustness without group annotation on three datasets. We then present in Section 4.3 an empirical analysis of our approach, including the importance of using Gram matrices as visual features, the impact of the choice of layers to extract features from, and the impact of the number of clusters. The code is available with the supplementary material. Results of our approach include two types of model selection via cross-validation: (i) based on a validation set with true-group annotations (GRAMCLUST-orig), and (ii) based on pseudo-group labels (GRAMCLUST-cv) predicted by our clustering (see Section 3.4). Metrics. We report worst-group and average test accuracy for Waterbirds and CelebA datasets. On COCO-on-Places-224, we follow the evaluation protocol proposed by Ahmed et al. ( 2021) and report predictive performance on the in-distribution test set, which follows the same distribution as the training set, and on the systematically-shifted test set, where the spurious correlations have been removed and COCO objects are composed with uniformly-sampled random backgrounds.

4.2. COMPARATIVE RESULTS

We report quantitative comparisons on Waterbirds, CelebA and COCO-on-Places-224 in Table 1 . We observe that GRAMCLUST improves worst-group test accuracy over ERM baseline on Waterbirds and CelebA and systematic generalisation on COCO-on-Places224. More importantly, GRAM-CLUST-cv achieves state-of-the-art performance on group robustness compared to all methods that do not use group labels on the training set. This results show empirically that our proposed approach, using Gram matrices of feature to discover pseudo-groups, which are then used for robust optimization and hyperparameter cross-validation, is very effective for group robustness. It also supports that Gram matrices are well suited to capture various types of dataset biases (background for Waterbirds, physical attribute in CelebA, multiple backgrounds in COCO-on-Places-224). For instance, on Waterbirds, GRAMCLUST-cv achieves 85.3% worst-group accuracy compared to the secondbest method, JTT, which reaches 82.9%. The gap is even more pronounced on CelebA where our approach outperforms JTT by 24.3 pts. CelebA constitutes an interesting dataset to evaluate the scalability of methods as the training dataset is composed of 200k images. For instance, we were not able to scale EIIL on this dataset. Note that GRAMCLUST-orig uses the same hyperparameters as EIIL, GEORGE and JTT for robust training of the target classifier from predicted group labels, and still displays significant improvements on the three datasets in terms of worst-group accuracy. Liu et al. (2021) reported results that were obtained with early stopping thanks to a small validation set annotated with group labels. The authors selected models before convergence (around epoch 3) with low average accuracy on the test set but high worst-group accuracy. We argue that it is not a suitable property for a model and prefer models with high accuracy both in average and on the worst group of the test set. Surprisingly, GRAMCLUST-cv and GRAMCLUST-orig outperform GroupDRO on Waterbirds with 85.3% vs. 83.9%, while the latter method uses true-group labels during training. Our intuition is that it may be due to the ambiguity of the background in some Waterbirds images. We further discuss this result in Section 4.5. Overall, these results show that our pseudogroups on the validation set are relevant to select good hyperparameters and more importantly, that GRAMCLUST does not require any group labels during training to achieve group robustness.

4.3. STUDY OF THE CLUSTERING FEATURES

In this section, we compare the performance obtained when clustering images with different visual features. In neural style transfer, Huang & Belongie (2017) proposed the channel-wise mean and variance of image features, instead of Gram matrices as in (Gatys et al., 2016) . We thus compare the use of such features ('MeanVar') against our use of Gram matrices. We also compared our use of VGG-19 features with the direct use of the penultimate representation of a ResNet-50 identification model ('AvgPool'). Recall that although our features for group identification are extracted using a VGG-19, our robust classifier is a ResNet-50. One may wonder if using directly the deepest features before the classification head in a ResNet-50 ('Standard') could be better than using VGG-19 features. For a fair comparison, we trained the robust classifier with the same hyperparameters for each method, which are consistent with those found by Sagawa et al. (2020a) with GroupDRO. The results are available in Table 2 for Waterbirds, CelebA and COCO-on-Places-224. Using the penultimate layer of a ResNet-50 as visual features for the clustering produces poorer performance than Gram matrices of VGG-19 features in every configuration. MeanVar reaches test worst-group accuracy on-par with Gram matrix on Waterbirds but degrades significantly performance on CelebA: 69.8% in average compared to 77.9% with Gram matrix. Gram matrices provide more information than MeanVar as their diagonals already contain the information about the channel-wise mean and variance of the deep features (see Eq. ( 4)). This show that when scaling on large datasets such as CelebA, keeping all the correlations between different channels is important for group robustness.

4.4. CLUSTERING ANALYSIS

Effect of the selected layers for features. We evaluate the impact of the selection of VGG-19 layers to extract the features in the clustering stage. To this end, we study the matching of the predicted environments to the true environment labels on the validation set. The assignment problem is solved via Hungarian matching (Kuhn, 1955) and we measure the global matching accuracy across all validation samples, where matching accuracy is the percentage of samples whose predicted group corresponds to its true group. In Fig. 3 , we compare results on Waterbirds using either one of the five layers commonly used in neural style transfer (conv1 1, conv2 1, conv3 1, conv4 1, conv5 1) or using all layers together. Experiments show that: (i) Features from deeper layers correlate with better matching accuracy; (ii) Our approach is robust to the choice of deep layers either taken to- Impact of the number of clusters. We study the impact of the number of clusters as hyperparameter in the clustering algorithm. Worst-group accuracy on the validation set for E ′ ∈ {2, 4, 8, 16, 32} clusters are reported in Fig. 4 for Waterbirds datasets. Overall, our method is robust to a variation in the number of clusters: GRAMCLUST with higher numbers of clusters produces a slight drop in performance but still outperforms ERM. It also has on-par performance with GroupDRO.

4.5. DISCUSSION ABOUT RESULTS OF GRAMCLUST VS. GROUPDRO ON WATERBIRDS

Comparative results in Table 1 show that GRAMCLUST-orig outperforms GroupDRO on the Waterbirds dataset. The difference between the approaches lies in the usage of true-group labels on the training dataset for GroupDRO while GRAMCLUST-orig leverages its predicted pseudo-groups. This results might be surprising given that the evaluation is performed on true test group labels and that the two methods share the same robust optimization algorithm and hyperparameters. We intuit that this behavior, which occurs only on Waterbirds, is related to the group labels in the dataset. In Fig. 5 , we show some examples of confusing images that were not correctly assigned with our predicted group labels with GRAMCLUST. These images are taken from the set of mismatches between true-group labels and our pseudo-group labels after the Hungarian matching. We can see that some of these samples present dominant characteristic elements from land background, such as heavy vegetation and sand, while being labeled as water background. Conversely, some samples labeled as land background display a high percentage of water surfaces in the image. As mentioned in Section 4.1, the Waterbird dataset was created by combining bird photographs with background scenes taken from the Places365 dataset. But the latter dataset is composed of very diverse images which might not reflect the expected background for a category. This unwanted behavior outlines the difficulty of manually annotating groups and raises the need for creating benchmarks including datasets with spurious correlations from non-artificial, real-world data, such as hair color/gender bias observed in CelebA. It also motivates further research on the automatic discovery of groups in data, as proposed in our method.

5. CONCLUSION

In this paper, we introduce GRAMCLUST, a two-stage method that first partitions a training dataset into clusters via k-means clustering based on Gram matrices computed from image features, which are extracted from a identification model trained to catch spurious correlations in a biased dataset. This first stage is then followed by learning a robust classifier which minimizes the error on the worst pseudo-group labels previously discovered. GRAMCLUST demonstrates to be an effective approach to tackle group robustness and outperforms every baseline on standard datasets with spurious correlations. The usage of Gram matrices of features is crucial to capture pertinent visual statistics of the image and enables a relevant partition for robust training. Our approach also alleviates the need to label a validation set of images with group information and is able to tune its hyperparameters in an unsupervised fashion by applying its clustering algorithm on the validation set.



We let the reader refer to Theorem 1.1 in(Achlioptas, 2003) for the exact expression of this probability as a function of ϵ, N and ℓ0. https://github.com/Faruk-Ahmed/predictive_group_invariance Results with JTT differ from the original paper as the scores that we report correspond to models trained without early-stopping. https://github.com/p-lambda/wilds https://github.com/Faruk-Ahmed/predictive_group_invariance https://github.com/ecreager/eiil



Figure 1: Overview of the proposed GRAMCLUST approach for robust classification with unsupervised group discovery. (1) We first extract deep image features using an identification model and (2) we cluster the training dataset based on Gram matrices of images features; (3) Then, we train the targeted classifier with a robust optimization that exploits the assigned pseudo-group labels. Consequently, GRAMCLUST properly classifies samples in minority groups, e.g. cows and camels in unusual environments -in contrast to standard Empirical Risk Minimization (ERM) training.

Robustness to distribution shift (Rusak et al., 2020; Hendrycks* et al., 2020; Gulrajani & Lopez-Paz, 2021; Geirhos et al., 2019) has recently been an increasingly popular topic among machine learning researchers. Koh et al. (2021) distinguish two types of distribution shifts: domain generalization, where test samples come from a different distribution than training datasets, and subpopulation shift,

Figure 2: Illustration of the proposed dataset partition. An identification model Φ with parameters ω is trained for a limited number T of epochs with ERM to fit groups with easy-to-learn spurious correlations. Then, for each image x i ∈ X , we extract intermediate features ϕ l at layer l and compute their Gram matrix G l with a random projection. These projected Gram matrix representations are used as features to cluster the training dataset D train in E ′ environments.

Figure 3: Impact of the layer choice to extract features. Results in matching accuracy on the validation set for GRAMCLUST on Waterbirds.

Figure 4: Impact of the number of clusters. Results in worst-group val accuracies of GRAMCLUST on Waterbirds.

Figure 5: Example of confusing samples in Waterbirds dataset, wrongly predicted by GRAM-CLUST. (a) Samples of confusing land-background images predicted as water background; (b) Samples of confusing land-background images predicted as water background. In each case, the actual image background is confusing due to the joint presence of elements reflecting land background (forest, heavy vegetation, sand) and water background (water surface, rainfalls, mist).

al., 2021; Bao & Barzilay, 2022; Sohoni et al., 2020; Liu et al., 2021; Ahmed et al., 2021;

Group robustness with group annotations. Recent approaches propose to leverage group annotations during training to improve group robustness. IRM (Arjovsky et al., 2020) augments the standard ERM term with invariance penalties across data from different groups. Ahmed et al. (2021) promote, through a simple penalty, identical prediction behaviour across groups. Other works (Sagawa et al., 2020a; Zhang et al., 2021) minimize explicitly the worst-group loss during training. Finally, Sagawa et al. (2020b) re-balance majority and minority groups via re-weighting and sub-sampling.

We compare our approach against the standard ERM baseline and recent methods that aim at robust predictions across groups without the use of train group annotations: EIIL(Creager et al., 2021), GEORGE (Sohoni et al., 2020), and JTT(Liu et al., 2021). We also include robust methods that use train group annotations: IRM (Arjovsky et al., 2020), importance weighting and GroupDRO(Sagawa et al., 2020a). The latter methods and ERM were already implemented and we took care to reproduce results for all methods. Our results with baselines are in line with those reported respectively in each original paper. Note that our approach and GroupDRO share the same robust optimization objective (Eq. (2)). Hence, GRAMCLUST would boil down to GroupDRO if discovered pseudo-groups were to match exactly the ones annotated in the dataset.Training details. All methods, including ours, use a ResNet-50(He et al., 2016) architecture pretrained on ImageNet as the robust classifier. Models are optimized using SGD with momentum. For GroupDRO and ERM, we use the hyperparameters reported bySagawa et al. (2020a)  on Waterbirds Comparative results on Waterbirds, CelebA and COCO-on-Places-224. Worst-group (w-g) and average (avg) test accuracies (% mean and std.) for Waterbirds and CelebA datasets; systematically-shifted (shift) and in-distribution (in-dis) test-set accuracies (% mean and std.) for COCO-on-Places-224 dataset. Experiments are with ResNet-50 models. Underlined and bold type indicate respectively best and per-block best performance (with significance p < 0.05 according to paired t-test on five runs). Methods are grouped according to their need for ground-truth group labels on train and/or val set; proposed GRAMCLUST-cv is the only one requiring none.

Study of the clustering features. Results in worst-group (Waterbirds, CelebA) and systematically-shifted (COCO-on-P) test-set accuracies (%). Gram matrices show to be the most effective type of information to obtain improved group robustness.

A IMPLEMENTATION DETAILS

This section focuses on implementation details used to produce the results in the main text of our paper. The code that we used is provided along with this appendix. Our implementation builds upon the WILDS framework 4 

A.1 CONSTRUCTION OF COCO-ON-PLACES-224

We generated the dataset using the code 5 of Ahmed et al. (2021) but, as explained in the main paper, we modified it to produce images of size 224 × 224 instead of 64 × 64. The reader can refer to the appendix of (Ahmed et al., 2021) for more details regarding the generation of the COCO-on-Places dataset.

A.2 DETAILS ABOUT ROBUST OPTIMIZATION

We trained all models on one NVIDIA ® V100 Tensor Core with 16GB of memory, using PyTorch 1.10 and CUDA 10.2. We used the implementations of IRM (Arjovsky et al., 2020) , Importance Weighting and GroupDRO (Sagawa et al., 2020a) available in WILDS (Koh et al., 2021), our own implementations of JTT (Liu et al., 2021) and of GEORGE (Sohoni et al., 2020) (while making sure that we could reproduce the original performance on Waterbirds and CelebA), and the official implementation 6 of EIIL (Creager et al., 2021) . Concerning EIIL, we recall that we were not able to make this method scale to large datasets such as CelebA.For all methods, we used a ResNet-50 (He et al., 2016) architecture trained using stochastic gradient descent with momentum (SGD-M) and L 2 regularization, but without any learning rate scheduler. We used a momentum of 0.9 and a batch size of 128 for all datasets and all methods. The learning rate η and L 2 regularization parameters λ are set as detailed below.JTT, GEORGE, EIIL, GRAMCLUST all use GroupDRO (Sagawa et al., 2020a) as robust optimization step. On Waterbirds and CelebA, we did not redo any grid search and used the hyperparameters found in (Sagawa et al., 2020a) . These hyperparameters were optimized using a small validation set annotated with true group labels. To produce the results on COCO-on-Places-224, we performed our own grid search using the annotated validation set. We considered values of η and λ close to those used in (Sagawa et al., 2020a ): λ ∈ {10 -4 , 10 -2 , 10 -1 , 1} and η ∈ {10 -5 , 5 • 10 -5 , 10 -4 }. The best hyperparameters for GroupDRO are summarized in Table 3 .To ensure fair comparisons, we also performed the same grid search over η and λ for ERM, IRM and Importance Weighting. The best hyperparameters for ERM and IRM are summarized for each dataset in Table 4 and Table 5 , respectively. Note that they correspond to those reported in (Sagawa et al., 2020a) for Waterbirds and CelebA. 

SGD-M hyperparameters Waterbirds

CelebA COCO-on-Places-224Learning rate η 10 -4 10 -5 5 • 10 -5 L 2 regularization λ 10 -3 0.1 0.1Among usual layers used to compute representations in neural style transfer, we observed improved performance by selecting deeper layers in the network (see Section 4.4). Consequently, for each dataset, we consistently extract features from the conv5 1 layer, i.e., the first convolutional layer of block 5. Following results of Fig. 4 , we set the number of clusters to 2 in our experiments.For EIIL and GEORGE, the identification model is a ResNet-50 as used in the original methods. We train the model for 1 epoch with ERM using SGD-M, as for GRAMCLUST. Note that the activation at the output of the last layer is a sigmoid in EIIL while it is a softmax in GEORGE. As for GRAMCLUST, the best results were obtained when using 2 clusters for EIIL and GEORGE. We refer the reader to (Creager et al., 2021) and (Sohoni et al., 2020) for other implementation details specific to EIIL and GEORGE, respectively.

A.4 CROSS VALIDATION ON PSEUDO-GROUP ANNOTATIONS

We report in Table 6 the results of our grid search on the validation set of each dataset using the pseudo-annotations discovered with our method, i.e., using our discovered environments instead of the ground-truth ones. Hence, the average and worst group accuracies in Table 6 are computed using the discovered pseudo-groups. The hyperparameters used in GRAMCLUST-cv correspond to those which yield the best worst-group accuracy in this table. We present, in Figure 6 , the matching accuracy between the ground-truth environments and the environments discovered with our method on the validation set of CelebA for different layers of the VGG-19. As on Waterbirds, we notice that the best result is obtained when using the layer conv5 1.Figure 6 : Impact of the layer choice to extract features on CelebA. We show the matching accuracy between the ground-truth environments on the validation set CelebA and the discovered ones with GRAMCLUST when using different VGG-19 layers. The result denoted allconvX 1 is obtained when using all the layers conv1 1, conv2 1, conv3 1, conv4 1, conv5 1 in our method.

