LEARNING TO SPLIT FOR AUTOMATIC BIAS DETEC-TION

Abstract

Classifiers are biased when trained on biased datasets. As a remedy, we propose Learning to Split (ls), an algorithm for automatic bias detection. Given a dataset with input-label pairs, ls learns to split this dataset so that predictors trained on the training split cannot generalize to the testing split. This performance gap suggests that the testing split is under-represented in the dataset, which is a signal of potential bias. Identifying non-generalizable splits is challenging since we have no annotations about the bias. In this work, we show that the prediction correctness of each example in the testing split can be used as a source of weak supervision: generalization performance will drop if we move examples that are predicted correctly away from the testing split, leaving only those that are mispredicted. ls is task-agnostic and can be applied to any supervised learning problem, ranging from natural language understanding and image classification to molecular property prediction. Empirical results show that ls is able to generate astonishingly challenging splits that correlate with human-identified biases. Moreover, we demonstrate that combining robust learning algorithms (such as group DRO) with splits identified by ls enables automatic de-biasing. Compared to previous state-of-the-art, we substantially improve the worst-group performance (23.4% on average) when the source of biases is unknown during training and validation. Our code is included in the supplemental materials and will be publicly available.

1. INTRODUCTION

Recent work has shown promising results on de-biasing when the sources of bias (e.g., gender, race) are known a priori Ren et al. (2018) ; Sagawa et al. (2019) ; Clark et al. (2019) ; He et al. (2019) ; Mahabadi et al. (2020) ; Kaneko & Bollegala (2021) . However, in the general case, identifying bias in an arbitrary dataset may be challenging even for domain experts: it requires expert knowledge of the task and details of the annotation protocols Zellers et al. (2019) ; Sakaguchi et al. (2020) . In this work, we study automatic bias detection: given a dataset with only input-label pairs, our goal is to detect biases that may hinder predictors' generalization performance. We propose Learning to Split (ls), an algorithm that simulates generalization failure directly from the set of input-label pairs. Specifically, ls learns to split the dataset so that predictors trained on the training split cannot generalize to the testing split (Figure 1 ). This performance gap indicates that the testing split is under-represented among the set of annotations, which is a signal of potential bias. The challenge in this seemingly simple formulation lies in the existence of many trivial splits. For example, poor testing performance can result from a training split that is much smaller than the testing split (Figure 2a ). Classifiers will also fail if the training split contains all positive examples, leaving the testing split with only negative examples (Figure 2b ). The poor generalization of these trivial solutions arise from the lack of training data and label imbalance, and they do not reveal the hidden biases. To ensure that the learned splits are meaningful, we impose two regularity constraints on the splits. First, the size of the training split must be comparable to the size of the testing split. Second, the marginal distribution of the labels should be the similar across the splits. Our algorithm ls consists of two components, Splitter and Predictor. At each iteration, the Splitter first assigns each input-label pair to either the training split or the testing split. The Predictor then takes the training split and learns how to predict the label from the input. Its prediction performance on the testing split is used to guide the Splitter towards a more challenging split (under the regularity Given the set of image-label pairs, our algorithm ls learns to split the data so that predictors trained on the training split cannot generalize to the testing split. The learned splits help us identify the hidden biases. For example, while predictors can achieve perfect performance on the training split by using the spurious heuristic: polar bears live in snowy habitats, they fail to generalize to the under-represented group (polar bears that appear on grass). constraints) for the next iteration. Specifically, while we do not have any explicit annotations for creating non-generalizable splits, we show that the prediction correctness of each testing example can serve as a source of weak supervision: generalization performance will decrease if we move examples that are predicted correctly away from the testing split, leaving only those predicted incorrectly. ls is task-agnostic and can be applied to any supervised learning problem, ranging from natural language understanding (Beer Reviews, MNLI) and image classification (Waterbirds, CelebA) to molecular property prediction (Tox21). Given the set of input-label pairs, ls consistently identifies splits across which predictors cannot generalize. For example in MNLI, the generalization performance drops from 79.4% (split by random) to 27.8% (split by ls) for a standard BERT-based predictor. Further analysis reveals that our learned splits coincide with human-identified biases. Finally, we demonstrate that combining group distributionally robust optimization (DRO) with splits identified by ls enables automatic de-biasing. Compared with previous state-of-the-art, we substantially improves the worst-group performance (23.4% on average) when the sources of bias are completely unknown during training and validation. The challenge arises when the source of biases is unknown (Li & Xu, 2021) . Recent work has shown that the mistakes of a standard ERM predictor on its training data are informative of the biases (Bao et al., 2021; Sanh et al., 2021; Nam et al., 2020; Utama et al., 2020; Liu et al., 2021a; Lahoti et al., 2020; Liu et al., 2021b; Bao et al., 2022) . They deliver robustness by boosting from the mistakes. (Wang & Vasconcelos, 2018; Yoo & Kweon, 2019 ) also utilize prediction correctness for confidence estimation and active learning. (Creager et al., 2021; Sohoni et al., 2020; Ahmed et al., 2020; Matsuura & Harada, 2020) further analyze the predictor's hidden activations to identify under-represented groups. However, many other factors (such as the initialization, the representation power, the amount of annotations, etc) can contribute to the predictors' training mistakes. For example, predictors that lack representation power may simply under-fit the training data.

De

In this work, instead of looking at the training statistics of the predictor, we focus on its generalization gap from the training split to the testing split. This effectively balances those unwanted factors. Going back to the previous example, if the training and test splits share the same distribution, the generalization gap will be small even if the predictors are underfitted. The gap will increase only when the training and testing splits have different prediction characteristics. Furthermore, instead of using a fixed predictor, we iteratively refine the predictor during training so that it faithfully measures the generalization gap given the current Splitter. Heuristics for data splitting Data splitting strategy directly impacts the difficulty of the underlying generalization task. Therefore, in domains where out-of-distribution generalization is crucial for performance, various heuristics are used to find challenging splits Sheridan (2013); Yang et al. (2019) ; Bandi et al. (2018) ; Yala et al. (2021) ; Taylor et al. (2019); Koh et al. (2021) . Examples include scaffold split in molecules and batch split for cells. Unlike these methods, which rely on human-specified heuristics, our algorithm ls learns how to split directly from the dataset alone and can therefore be applied to scenarios where human knowledge is unavailable or incomplete.

3.1. MOTIVATION

Given a dataset D total with input-label pairs {(x, y)}, our goal is to split this dataset into two subsets, D train and D test , such that predictors learned on the training split D train cannot generalize to the testing split D test .foot_0  Why do we have to discover such splits? Before deploying our trained models, it is crucial to understand the extent to which these models can even generalize within the given dataset. The standard cross-validation approach attempts to measure generalization by randomly splitting the dataset (Stone, 1974; Allen, 1974) . However, this measure only reflects the average performance under the same data distribution P D total (x, y). There is no guarantee of performance if our data distribution changes at test time (e.g. increasing the proportion of the minority group). For example, consider the task of classifying samoyeds vs. polar bears (Figure 1 ). Models can achieve good average performance by using spurious heuristics such as "polar bears live in snowy habitats" and "samoyeds play on grass". Finding splits across which the models cannot generalize helps us identify underrepresented groups (polar bears that appear on grass). How to discover such splits? Our algorithm ls has two components, a Splitter that decides how to split the dataset and a Predictor that estimates the generalization gap from the training split to the testing split. At each iteration, the splitter uses the feedback from the predictor to update its splitting decision. One can view this splitting decision as a latent variable that represents the prediction characteristic of each input. To avoid degenerate solutions, we require the Splitter to satisfy two regularity constraints: the size of the training split should be comparable to the size of the testing split (Figure 2a ); and the marginal distribution of the label should be similar across the splits (Figure 2b ). Sample a mini-batch from D total to compute the regularity constraints ⌦ 1 , ⌦ 2 (Eq 1). 8: Sample another mini-batch from D test to compute L gap (Eq 2). 9: Compute the overall loss L total = L gap + ⌦ 1 + ⌦ 2 . Update Splitter to minimize L total . 10: until L total stops decreasing 11: until generalization gap stops increasing

3.2. SPLITTER AND PREDICTOR

Here we describe the two key components of our algorithm, Splitter and Predictor, in the context of classification tasks. The algorithm itself generalizes to regression problems as well. Splitter Given a list of input-label pairs D total = [(x 1 , y 2 ), . . . , (x n , y n )], the Splitter decides how to partition this dataset into a training split D train and a testing split D test . We can view its splitting decisions as a list of latent variables z = [z 1 , . . . , z n ] where each z i 2 {1, 0} indicates whether example (x i , y i ) is included in the training split or not. In this work, we assume independent selections for simplicity. That is, the Splitter takes one input-label pair (x i , y i ) at a time and predicts the probability P Splitter (z i | x i , y i ) of allocating this example to the training split. We can factor the joint probability of our splitting decisions as P(z | D total ) = n Y i=1 P Splitter (z i | x i , y i ). We can sample from the Splitter's predictions P Splitter (z i | x i , y i ) to obtain the splits D train and D test . Note that while the splitting decisions are independent across different examples, the Splitter receives global feedback, dependent on the entire dataset D total , from the Predictor during training. Predictor The Predictor takes an input x and predicts the probability of its label P P redictor (y | x). The goal of this Predictor is to provide feedback for the Splitter so that it can generate more challenging splits at the next iteration. Specifically, given the Splitter's current splitting decisions, we re-initialize the Predictor and train it to minimize the empirical risk on the training split D train . This re-initialization step is critical because it ensures that the predictor does not carry over past information (from previous splits) and faithfully represents the current generalization gap. On the other hand, we note that neural networks can easily remember the training split. To prevent over-fitting, we held-out 1/3 of the training split for early stopping. After training, we evaluate the generalization performance of the Predictor on the testing split D test .

3.3. REGULARITY CONSTRAINTS

Many factors can impact generalization, but not all of them are of interest. For example, the Predictor may naturally fail to generalize due to the lack of training data or due to label imbalance across the splits (Figure 2 ). To avoid these trivial solutions, we introduce two soft regularizers to shape the Splitter's decisions: ⌦ 1 = D KL (P(z) k Bernoulli( )), ⌦ 2 = D KL (P(y | z = 1) k P(y)) + D KL (P(y | z = 0) k P(y)). (1) The first term ⌦ 1 ensures that we have sufficient training examples in D train . Specifically, the marginal distribution P(z) = 1 n P n i=1 P Splitter (z i = z | x i , y i ) represents what percentages of D total are split into D train and D test . We penalize the Splitter if it moves too far away from the prior distribution Bernoulli( ). Centola et al. (2018) suggest that minority groups typically make up 25 percent of the population. Therefore, we fix = 0.75 in all experiments.foot_1  The second term ⌦ 2 aims to reduce label imbalance across the splits. It achieves this goal by pushing the label marginals in the training split P(y | z = 1) and the testing split P(y | z = 0) to be close to the original label marginal P(y) in D total . We can apply Bayes's rule to compute these conditional label marginals directly from the Splitter's decisions P S. (z i | x i , y i ): P(y | z = 1) = P i y (y i ) P S. (z i = 1 | x i , y i ) P i P S. (z i = 1 | x i , y i ) , P(y | z = 0) = P i y (y i ) P S. (z i = 0 | x i , y i ) P i P S. (z i = 0 | x i , y i ) .

3.4. TRAINING STRATEGY

The only question that remains is how to learn the Splitter. Our goal is to produce difficult and nontrivial splits so that the Predictor cannot generalize. However, the challenge is that we don't have any explicit annotations for the splitting decisions. There are a few options to address this challenge. From the meta learning perspective, we can backpropagate the Predictor's loss on the testing split directly to the Splitter. This process is expensive as it involves higher order gradients from the Predictor's training. While one can apply episodictraining (Vinyals et al., 2016) to reduce the computation cost, the Splitter's decision will be biased by the size of the learning episodes (since the Predictor only operates on the sampled episode). From the reinforcement learning viewpoint, we can cast our objectives, maximizing the generalization gap while maintaining the regularity constraints, into a reward function (Lei et al., 2016) . However, according to our preliminary experiments, the learning signal from this scalar reward is too sparse for the Splitter to learn meaningful splits. In this work, we take a simple yet effective approach to learn the Splitter. Our intuition is that the Predictor's generalization performance will drop if we move examples that are predicted correctly away from the testing split, leaving only those that are mispredicted. In other words, we can view the prediction correctness of the testing example as a direct supervision for the Splitter. Formally, let ŷi be the Predictor's prediction for input x i : ŷi = arg max y P P redictor (y | x i ). We minimize the cross entropy loss between the Splitter's decision and the Predictor's prediction correctness over the testing split: L gap = 1 |D test | X (xi,yi)2D test L CE (P Splitter (z i | x i , y i ), yi ( ŷi )). (2) Combining with the aforementioned regularity constraints, the overall objective for the Splitter is L total = L gap + ⌦ 1 + ⌦ 2 , One can explore different weighting schemes for the three loss terms (Chen et al., 2018) . In this paper, we found that the unweighted summation (Eq 3) works well out-of-the-box across all our experiments. Algorithm 1 presents the pseudo-code of our algorithm. At each outer-loop (line 2-11), we start by using the current Splitter to partition D total into D train and D test . We train the Predictor from scratch on D train and evaluate its generalization performance on D test . For computation efficiency, we sample mini-batches in the inner-loop (line 6-10) and update the Splitter based on Eq equation 3.  d I C / q l 0 j y B o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls < l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j y B o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls < l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j y B o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls < l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j y B o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls < l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j y B o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls < l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j y B o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls < l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j y B o k  H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F L P 3 H N D D E O C e X q V 3 a U u 0 A d I Q W I m s q x n V u 2 a P Y E 1 T 5 y C V F G B R s / 8 c v s R S U I q g H C s V N e x Y / B S L I E R T r O K m y g a Y z L C A 9 r V V O C Q K i + d / J B Z h 1 r p W 0 E k d Q m w J u r v i R S H S o 1 D X 3 f m F 6 t Z L x f / 8 7 o J B O d e y k S c A B V k u i h I u A W R l Q d i 9 Z m k B P h Y E 0 w k 0 7 d a Z I g l J q B j q + g Q n N m X 5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A j q Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls Figure 4: The splits learned by ls correlate with human-identified biases. For example in Waterbirds (left), ls learns to amplify the spurious association between landbirds and land backgrounds in the training split D train . As a result, predictors will over-fit the background features and fail to generalize at test time (D test ) when the spurious correlation is reduced.

4. EXPERIMENTS

We conduct experiments over multiple modalities (Section 4.1) and answer two main questions. Can ls identify splits that are not generalizable (Section 4.2)? Can we use the splits identified by ls to reduce unknown biases (Section 4.3)? Implementation details are deferred to the Appendix. Our code is included in the supplemental materials and will be publicly available.

4.1. DATASET

Beer Reviews We use the BeerAdvocate review dataset (McAuley et al., 2012) where each input review describes multiple aspects of a beer and is written by a website user. Following previous work (Lei et al., 2016) , we consider two aspect-level sentiment classification tasks: look and aroma. There are 2,500 positive reviews and 2,500 negative reviews for each task. The average word count per review is 128.5. We apply ls to identify spurious splits for each task. Tox21 Tox21 is a property prediction benchmark with 12,707 molecules Huang et al. (2016) . Each input is annotated with a set of binary properties that represent the outcomes of different toxicological experiments. We consider the property Androgen Receptor (active or inactive) as our prediction target. We apply ls to identify spurious splits over the entire dataset. (Welinder et al., 2010) with backgrounds from the Places dataset (Zhou et al., 2014) . The task is to predict waterbirds vs. landbirds. The challenge is that waterbirds, by construction, appear more frequently with a water background. As a result, predictors may utilize this spurious correlation to make their predictions. We combine the official training data and validation data (5994 examples in total) and apply ls to identify spurious splits. CelebA CelebA is an image classification dataset where each input image (face) is paired with multiple human-annotated attributes Liu et al. (2015) . Following previous work (Sagawa et al., 2019) , we treat the hair color attribute (y 2 {blond, not blond}) as our prediction target. The label is spuriously correlated with the gender attribute ({male, female}). We apply ls to identify spurious splits over the official training data (162,770 examples) . MNLI MNLI is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information (Williams et al., 2018) . The task is to classify the relationship between a pair of sentences: entailment, neutral or contradiction. Previous work has found that contradiction examples often include negation words (McCoy et al., 2019) . We apply ls to identify spurious splits over the training data (206,175 examples) created by Sagawa et al. (2019) .

4.2. IDENTIFYING NON-GENERALIZABLE SPLITS

Figure 3 presents the splits identified by our algorithm ls. Compared to random splitting, ls achieves astonishingly higher generalization gaps across all 6 tasks. Moreover, we observe that the learned splits are not degenerative: the training split D train and testing split D test share similar label distributions. This confirms the effectiveness of our regularity objectives. Why are the learned splits so challenging for predictors to generalize across? While ls only has access to the set of input-label pairs, Figure 4 and Figure 5 show that the learned splits are informative of human-identified biases. For example, in the generated training split of MNLI, inputs with negation words are mostly labeled as contradiction. This encourages predictors to leverage the Table 1 : Average and worst-group test accuracy for de-biasing. When using bias annotations on the validation data for model selection, previous work (CVaR DRO (Levy et al., 2020) , LfF (Nam et al., 2020) , EIIL (Creager et al., 2021) , JTT (Liu et al., 2021a) ) significantly outperform ERM (that is also tuned using bias annotations on the validation data). However, they underperform the group DRO baseline (Sagawa et al., 2019) that was previously overlooked. When bias annotations are not available for validation, the performances of these methods quickly drop to that of ERM. In contrast, applying group DRO with splits identified by ls substantially improves the worst-group performance. † and ‡ denote numbers reported by Liu et al. (2021a) presence of negation words to make their predictions. These biased predictors cannot generalize to the testing split, where inputs with negation words are mostly labeled as entailment or neutral. Convergence and time-efficiency ls requires learning a new Predictor for each outer-loop iteration. While this makes ls more time-consuming than training a regular ERM model, this procedure guarantees that the Predictor faithfully measures the generalization gap based on the current Splitter. Figure 6 shows the learning curve of ls. We observe that the generalization gap steadily increases as we refine the Splitter and the learning procedure usually converges within 50 outer-loop iterations.

4.3. AUTOMATIC DE-BIASING

Once ls has identified the spurious splits, we can apply robust learning algorithms to learn models that generalize across the splits. Here we consider group distributionally robust optimization (group DRO) and study three well-established benchmarks: Waterbirds, CelebA and MNLI. Group DRO Group DRO has shown strong performance when biases are annotated (Sagawa et al., 2019) . For example in CelebA, gender (male, female) constitutes a bias for predicting blond hair. Group DRO uses the gender annotations to partition the training data into four groups: {blond hair, male}, {blond hair, female}, {no blond hair, male}, {no blond hair, female}. By minimizing the worst-group loss during training, it regularizes the impact of the unwanted gender bias. At test time, we report the average accuracy and worstgroup accuracy over a held-out test set. Group DRO with supervised bias predictor Recent work consider a more challenging setting where bias annotations are not provided at train time. CVaR DRO (Levy et al., 2020) up-weights examples that have the highest training losses. LfF (Nam et al., 2020) and JTT (Liu et al., 2021a ) train a separate de-biased predictor by learning from the mistakes of a biased predictor. EIIL (Creager et al., 2021) infers the environment information from an ERM predictor and uses group DRO to promote robustness across the latent environments. However, these methods still access bias annotations on the validation data for model selection. With thousands of validation examples (1199 for Waterbirds, 19867 for CelebA, 82462 for MNLI), a simple baseline was overlooked by the community: learning a bias predictor over the validation data (where bias annotations are available) and using the predicted bias attributes on the training data to define groups for group DRO. Group DRO with splits identified by ls We consider the general setting where biases are not known during both training and validation. To obtain a robust model, we take the splits identified by ls (Section 4.2) and use them to define groups for group DRO. For example, we have four groups in CelebA: {blond hair, z = 0}, {blond hair, z = 1}, {no blond hair, z = 0}, {no blond hair, z = 1}. For model selection, we apply the learned Splitter to split the validation data and measure the worst-group accuracy. Results Table 1 presents our results on de-biasing. We first see that when the bias annotations are available in the validation data, the missing baseline Group DRO (with supervised bias predictor) outperforms all previous de-biasing methods (4.8% on average). This result is not surprising given the fact that the bias attribute predictor, trained on the validation data, is able to achieve an accuracy of 94.8% in Waterbirds (predicting the spurious background), 97.7% in CelebA (predicting the spurious gender attribute) and 99.9% in MNLI (predicting the presence of negation words). When bias annotations are not provided for validation, previous de-biasing methods (tuned based on the average validation performance) fail to improve over the ERM baseline, confirming the findings of Liu et al. (2021a) . On the other hand, applying group DRO with splits identified by ls consistently achieves the best worst-group accuracy, outperforming previous methods by 23.4% on average. While we no longer have access to the bias annotations for model selection, the worst-group performance defined by ls can be used as a surrogate (see Appendix C for details).

5. DISCUSSION

Section 4 shows that ls identifies non-generalizable splits that correlate with human-identified biases. However, we must keep in mind that bias is a human-defined notion. Given the set of inputlabel pairs, ls provides a tool for understanding potential biases, not a fairness guarantee. If the support of the given dataset doesn't cover the minority groups, ls will fail. For example, consider a dataset with only samoyeds in grass and polar bears in snow (no samoyeds in snow or polar bears in grass). ls will not be able to detect the background bias in this case. We also note that poor generalization can result from label noise. Since the Splitter makes its decision based on the input-label pair, ls can achieve high generalization gap by allocating all clean examples to the training split and all mislabeled examples to the testing split. Here we can think of ls as a label noise detector (see Appendix D for more analysis). Blindly maximizing the worst-split performance in this situation will enforce the model to memorize the noise. Another limitation is running time. Compared to empirical risk minimization, ls needs to perform second-order reasoning, and this introduces extra time cost (see Appendix C for more discussion). Finally, in real-world applications, biases can also come from many independent sources (e.g., gender and race). Identifying multiple diverse splits will be an interesting future work.

6. CONCLUSION

We present Learning to Split (ls), an algorithm that learns to split the data so that predictors trained on the training split cannot generalize to the testing split. Our algorithm only requires access to the set of input-label pairs and is applicable to general datasets. Experiments across multiple modalities confirm that ls identifies challenging splits that correlate with human-identified biases. Compared to previous state-of-the-art, learning with ls-identified splits significantly improves robustness.



To prevent over-fitting, we held-out 1/3 of the training split for early-stopping when training the Predictor. We note that the two regularizers ⌦1 and ⌦2 are introduced to shape the Splitter's decisions, but the model has the flexibility to deviate from this "prior." That is, the actual "posteriors" can be different depending on the dataset. For example, the minority group is unlikely to always constitute exactly 25% of the dataset. Therefore, it makes more sense to introduce soft regularizers instead of hard (and exact) constraints. Nevertheless, if users want to allocate exactly 25% of the data into the test set, instead of sampling from the Splitter's decisions P Splitter (zi | xi, yi), they can simply sort these probabilities and split at the 25th percentile.



Figure1: Consider the task of classifying samoyed images vs. polar bear images. Given the set of image-label pairs, our algorithm ls learns to split the data so that predictors trained on the training split cannot generalize to the testing split. The learned splits help us identify the hidden biases. For example, while predictors can achieve perfect performance on the training split by using the spurious heuristic: polar bears live in snowy habitats, they fail to generalize to the under-represented group (polar bears that appear on grass).

-biasing algorithms Modern datasets are often coupled with unwanted biases Buolamwini & Gebru (2018); Schuster et al. (2019); McCoy et al. (2019); Yang et al. (2019). If the biases have already been identified, we can use this prior knowledge to regulate their negative impact Kusner et al. (2017); Hu et al. (2018); Oren et al. (2019); Belinkov et al. (2019); Stacey et al. (2020); Clark et al. (2019); He et al. (2019); Mahabadi et al. (2020); Sagawa et al. (2020); Singh et al. (2021).

Figure 2: Splits that are difficult to generalize do not necessarily reveal hidden biases. (a) Predictors cannot generalize if the amount of annotations is insufficient. (b) Predictors fail to generalize when the labels are unbalanced in training and testing. ls poses two regularity constraints to avoid such degenerative solutions: the training split and testing split should have comparable sizes; the marginal distribution of the label should be similar across the splits.

l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j yB o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M Sr f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F

5 0 n r u O a c 1 p y b k 2 r 9 o o i j j P b R A T p C D j p D d X S N G q i J C H p E z + g V v R l P x o v x b n x M W 0 t G M b O L / s D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 wM 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D B N u + W K V b U m M O e J n Z M K y V H v l r + c X s h i H w J k g i r V s a 0 I 3 Y R K 5 E x A W n J i B R F l I z q A jq Y B 9 U G 5 y e S F 1 D z U S s / s h 1 J X g O Z E / T 2 R U F + p s e / p z u x g N e t l 4 n 9 e J 8 b + u Z v w I I o R A j Z d 1 I + F i a G Z 5 W H 2 u A S G Y q w J Z Z L r W 0 0 2 p J I y 1 K m V d A j 2 7 M v z p H l c t U + r 9 s 1 J p X a R x 1 E k + + S A H B G b n J E a u S Z 1 0 i C M P J J n 8 k r e j C f j x X g 3 P q a t B S O f 2 S V / Y H z + A L s p m E w = < / l a t e x i t > D test ls < l a t e x i t s h a 1 _ b a s e 6 4 = " d I C / q l 0 j y B o k H B q N o + Z s M B y 0 l k s = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i H j x W s B / Q x L L Z b t q l m 0 3 Y n Y g l B L z 4 V 7 x 4 U M S r f 8 K b / 8 Z N m 4 O 2 P h h 4 v D f D z D w / 5 k y B b X 8 b p Y X F p e W V 8 m p l b X 1 j c 8 v c 3 m m p K J G E N k n E I 9 n x s a K c C d o E B p x 2 Y k l x 6 H P a 9 k e X u d + + p 1 K x S N z C O K Z e i A e C B Y x g 0 F

D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D

D 4 / A F 5 P p i 0 < / l a t e x i t > D train ls < l a t e x i t s h a 1 _ b a s e 6 4 = " R p T Y T V z C w d l l t t D H v i Z y h s / z e c E = " > A A A C A n i c b V B N S 8 N A E N 3 U r 1 q / q p 7 E S 7 A I n k o i o h 6 L e v B Y w X 5 A E 8 t m O 2 2 X b j 7 Y n Y g l B C / + F S 8 e F P H q r / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 + L B F d o W d 9 G Y W F x a X m l u F p a W 9 / Y 3 C p v 7 z R V G E s G D R a K U L Y 9 q k D w A B r I U U A 7 k k B 9 T 0 D L G 1 1 m f u s e p O J h c I v j C F y f D g L e 5 4 y i l r r l P c e n O G R U J F f p X e I g P G C C o D

Figure5: ls-identified splits correlate with certain spurious properties (ATAD5, AhR) even though they are not provided to algorithm. Here we present the train-test assignment of compounds with AR=active given by ls. In the leftmost bar, we look at all examples: 58% of {AR=active} is in the training split and 42% of {AR=active} is in the testing split. For each bar on the right, we look at the subset where an unknown property is active. For example, 17% of {AR=active, ATAD5=active} is allocated to the training split and 83% of {AR=active, ATAD5=active} is in the testing split.

Apply Splitter to split D total into D train , D test . For each input-label pair (x i , y i ), sample the splitting decision z i 2 {0, 1} from P Splitter (z i | x i , y i ).

and Creager et al. (2021) respectively. † 86.7% † 88.0% † 81.1% † 78.6% † 72.6% †

