TAKE ONE GRAM OF NEURAL FEATURES, GET ENHANCED GROUP ROBUSTNESS

Abstract

Predictive performance of machine learning models trained with empirical risk minimization (ERM) can degrade considerably under distribution shifts. In particular, the presence of spurious correlations in training datasets leads ERM-trained models to display high loss when evaluated on minority groups not presenting such correlations in test sets. Extensive attempts have been made to develop methods improving worst-group robustness. However, they require group information for each training input or at least, a validation set with group labels to tune their hyperparameters, which may be expensive to get or unknown a priori. In this paper, we address the challenge of improving group robustness without group annotations during training. To this end, we propose to partition automatically the training dataset into groups based on Gram matrices of features extracted from an identification model and to apply robust optimization based on these pseudogroups. In the realistic context where no group labels are available, our experiments show that our approach not only improves group robustness over ERM but also outperforms all recent baselines.

1. INTRODUCTION

Empirical Risk Minimization (ERM) is the most standard machine learning formulation, which assumes that training and testing samples are independent and identically distributed (Vapnik, 1991) . While academic datasets are mainly built to respect this assumption, practical settings display more challenging configurations with distribution shifts. Among different types of shifts, training data can be affected by selection biases and confounding factors, also called spurious correlations (Woodward, 2005; Duchi et al., 2019) Imagine crowd-sourcing an image dataset of camels and cows (Beery et al., 2018) . Due to selection biases, a large majority of cows stand in front of grass environment and camels in the desert. A simple way to differentiate cows from camels would be to classify the background, an undesirable shortcut that ERM will naturally exploit. Consequently, ERM may perform poorly on minority groups that do not display such spurious correlations (Hashimoto et al., 2018; Tatman, 2017; Duchi et al., 2019) , e.g., a cow standing in the desert. To overcome this issue, recent works (Creager et al., 2021; Bao & Barzilay, 2022; Sohoni et al., 2020; Liu et al., 2021; Ahmed et al., 2021; Kirichenko et al., 2022) rely on two-stage schemes: first, automatic environment discovery (e.g., based on deep feature clustering); then, robust optimization based on environment pseudo-labels. Environment here refers to a recurring setting, not intrinsic to the object of interest, that may affect its classification, such as background, object color or object pose. However, all these approaches require the availability of ground-truth environment labels on a validation set to properly tune their hyperparameters. This paper addresses the problem of learning a robust classifier, which, for instance, would not confuse a cow standing in the desert with a camel although not given any annotation about grass or desert. In computer vision, many identified spurious correlations are closely related to visual aspects, such as background (Beery et al., 2018) , texture (Geirhos et al., 2019) , image style (Hendrycks et al., 2021) , physic attributes (Liu et al., 2015) or camera characteristics (Koh et al., 2021) . In this work, we assume that relevant environment labels can be inferred from visual feature statistics, and demonstrate they lead to meaningful environments and robust classifiers for standard datasets used to evaluate robust classification. We propose a two-stage approach, GRAMCLUST, which first assigns a group label, i.e., a class-environment pair label, by partitioning a training dataset into clusters of images with similar visual statistics and then trains a robust classifier based on these pseudo-group labels. Our approach is summarized in Fig. 1 . We use Gram matrices as visual descriptive statistics, which are second-order moments of neural activation. Gram matrices are well known for displaying impressive results in style transfer techniques (Gatys et al., 2016) , but more importantly for the interpretation of our approach, Li et al. ( 2017) demonstrate that matching Gram matrices between two groups of images is equivalent to aligning the respective distribution of each group, minimizing the Maximum Mean Discrepancy. Therefore, our method can be interpreted as grouping images into clusters of similar feature distributions that are sensible candidates for environments. Our main contributions are as follows: (1) We introduce an easy-to-scale method to split training images among distinct pseudo-environments, based on feature Gram matrices extracted by a specifically-trained identification model; (2) GRAMCLUST alleviates the need of ground-truth group labels altogether, even in the validation set, as hyperparameters are set based on validation performance computed from our pseudo-groups; (3) Extensive experiments on various image classification datasets with spurious correlations show that GRAMCLUST outperforms all recent baselines addressing robustness without group annotation. In particular, on the realistic large-scale CelebA dataset (Liu et al., 2015) , we improve worst-group test accuracy by +24.3 points.

2. RELATED WORK

Robustness to distribution shift (Rusak et al., 2020; Hendrycks* et al., 2020; Gulrajani & Lopez-Paz, 2021; Geirhos et al., 2019) has recently been an increasingly popular topic among machine learning researchers. Koh et al. (2021) distinguish two types of distribution shifts: domain generalization, where test samples come from a different distribution than training datasets, and subpopulation shift, where train and test distributions overlap but their relative proportion differs. With subpopulation shifts, the goal is to perform well even on the minority group, also referred to as group robustness. In this study, we focus on the latter form of distribution shift. Group robustness with group annotations. Recent approaches propose to leverage group annotations during training to improve group robustness. IRM (Arjovsky et al., 2020) augments the standard ERM term with invariance penalties across data from different groups. Ahmed et al. (2021) promote, through a simple penalty, identical prediction behaviour across groups. Other works (Sagawa



Figure 1: Overview of the proposed GRAMCLUST approach for robust classification with unsupervised group discovery. (1) We first extract deep image features using an identification model and (2) we cluster the training dataset based on Gram matrices of images features; (3) Then, we train the targeted classifier with a robust optimization that exploits the assigned pseudo-group labels. Consequently, GRAMCLUST properly classifies samples in minority groups, e.g. cows and camels in unusual environments -in contrast to standard Empirical Risk Minimization (ERM) training.

