LEARNING ROBUST MODELS BY COUNTERING SPURIOUS CORRELATIONS

Abstract

Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. One reason behind this accuracy drop is the reliance of models on the features that are only associated with the label in the training distribution, but not the test distribution. This problem is usually known as spurious correlation, confounding factors, or dataset bias. In this paper, we formally study the generalization error bound for this setup with the knowledge of how the spurious features are associated with the label. We also compare our analysis to the widely-accepted domain adaptation error bound and show that our bound can be tighter, with more assumptions that we consider realistic. Further, our analysis naturally offers a set of solutions for this problem, linked to established solutions in various topics about robustness in general, and these solutions all require some understandings of how the spurious features are associated with the label. Finally, we also briefly discuss a method that does not require such an understanding.

1. INTRODUCTION

Machine learning, especially deep neural networks, has demonstrated remarkable empirical successes over various benchmarks. One promising next step is to extend such empirical achievements beyond i.i.d benchmarks. If we train a model with data from one distribution (i.e., the source distribution), how can we guarantee the error to be small over other unseen, but related distributions (i.e., target distributions). Quantifying the generalization error over two arbitrary distributions is not useful, thus, we require the distributions of study similar but different: being similar in the sense that there exists a common function that can achieve zero error over both distributions, while being different in the sense that there exists another different function that can only achieve zero error over the training distribution, but not the test distribution. This problem may not be trivial because the empirical risk minimizer (ERM) may lead the model to learn this second function, a topic studied under different terminologies such as spurious correlations (Vigen, 2015) , confounding factors (McDonald, 2014) or dataset bias (Torralba & Efros, 2011) . As a result, small empirical error may not mean the model learns what we expect (Geirhos et al., 2019; Wang et al., 2020) , thus the model may not be able to perform consistently over other related data. In particular, our view of the challenges in this topic is illustrated with a toy example in Figure 1 where the model is trained on the source domain data to classify triangle vs. circle and tested on the target domain data. However, the color coincides with the shape on the source domain, so the model may learn either the desired function (relying on shape) or the spurious function (relying on color). The spurious function will not classify the target domain data correctly while the desired function can, but the ERM cannot differentiate them. As one may expect, whether shape or color is considered as desired or spurious is subjective dependent on the task or the data, and in general irrelevant to the statistical nature of the problem. Therefore, our error bound will require the knowledge of the spurious function. While this is a toy example, this scenario surly exists in real world tasks (e.g., Jo & Bengio, 2017; Geirhos et al., 2019; Wang et al., 2020) . The contributions of this paper are: • We analyze the cross-distribution generalization error bound of a model when the model is trained with a distribution with spuriously correlated features, which is formalized as the main theorem of this paper. spurious labeling function (classifier) of the shape because color coincides with the shape in this distribution, this function will not predict the shape correctly over target domain data. desired labeling function (classifier) of the shape, this function will predict the shape correctly over target domain data.

Source Domain Target Domain

Figure 1 : A toy example of the main problem focused in this paper. • We compare our bound to the widely-accepted domain adaptation one (Ben-David et al., 2010) and show that our bound can be tighter under assumptions that we consider realistic. • Our main theorem naturally offers principled solutions of this problem, and the solutions are linked to many previous established methods for robustness in a broader context. • As the principled solutions all require some knowledge of the task or the data, our main theorem also leads to a new heuristic absent of the knowledge. This new method may be on a par with the principled solutions, and can outperform the vanilla training empirically.

2. RELATED WORK

There is a rich history of learning robust models. We first discuss works in three topics, all centering around the concept of invariance, where invariance intuitively means the model's prediction preserves under certain shift of the data. We then highlight works related to our theoretical discussion.

Cross-domain Generalization

This line of works probably originates from domain adaptation (Ben-David et al., 2007) , which studies the problem of training a model over one distribution and test it over another one. Since (Ganin et al., 2016) , recent advances along this topic mainly center around the concept of invariance: most techniques leverage different regularizers to learn representations that are invariant to the marginals of these two distributions (e.g., Ghifary et al., 2016; Rozantsev et al., 2018) . Further, the community aims beyond the situation that a trained model from domain adaptation may only be applicable to one distribution, and focuses on domain generalization (Muandet et al., 2013) , which studies the problem of training a model over a collection of distributions and test it with distributions unseen during training. Similarly, most recent methods aim to learn representations invariant to the marginals of the training distributions (e.g., Motiian et al., 2017; Li et al., 2018; Carlucci et al., 2018) . Recently, the community extends the study to domain generalization without domain IDs to address the real-world situations that domain IDs are unavailable (Wang et al., 2019b) , which again focuses on learning representations invariant to specifically designed functions.

Adversarially Robust Models

The study of robustness against adversarial examples was popularized by the empirical observations that small perturbations on image data can significantly alter the model's prediction (Szegedy et al., 2013; Goodfellow et al., 2015) . This observation initiated a line of works building models invariant to such small perturbations (the rigorous definitions of "small perturbations" will not be discussed in details here) (e.g., Lee et al., 2017; Akhtar et al., 2018) and adversarial training (Madry et al., 2018) is currently the most widely-accepted method in terms of empirical defense. On the other hand, the community also aims to develop methods that are provably robust to predefined perturbations (e.g., Wong & Kolter, 2018; Croce & Hein, 2020) , which links back to the works of distributional robust models (e.g., Abadeh et al., 2015; Sagawa* et al., 2020) , whose central goal is to train models invariant to a predefined shift of distributions. Recent evidence shows key challenges of learning adversarially robust models are spuriously correlated features (Ilyas et al., 2019; Wang et al., 2020) , connecting adversarial robustness to the next topic. Countering Spurious Correlation Works along this line usually connects the robustness of a model to its ability of ignoring the spurious correlation in the data, which was also studied under the terminologies of confounding factors, or dataset bias. With different concrete definitions of the spurious correlation, methods have been developed for various applications, such as image/video classification (e.g., Goyal et al., 2017; Wang et al., 2019a; b; Bahng et al., 2019; Shi et al., 2020 ), text classification (e.g., He et al., 2019; Clark et al., 2019; Bras et al., 2020; Zhou & Bansal, 2020; Ko et al., 2020 ), medical diagnosis (e.g., Zech et al., 2018; Chaibub Neto et al., 2019; Larrazabal et al., 2020) etc. The key concept is, as expected, to be invariant to the spurious correlated features. Related Theoretical Discussion Out of a rich collection of theoretical discussions in learning robust models, we only focus on the ones for unsupervised domain adaptation, as they will be related to our discussions in the sequel. Popularized by (Ben-David et al., 2007; 2010) , these analyses, although in various forms (Mansour et al., 2009; Germain et al., 2016; Zhang et al., 2019; Dhouib et al., 2020) , mostly involve two additional terms than standard machine learning generalization bound: one term describes the "learnable" nature of the problem and one term quantifies the differences of the two distributions. This second term probably inspired most of the empirical methods forcing invariant representations from distributions. However, the value of invariance is recently challenged (Wu et al., 2019; Zhao et al., 2019) . For example, Zhao et al. (2019) argued that "invariance is not sufficient" by showing counter examples violating the "learnable" nature of the problem and formalized the understanding as that the two distributions have possibly different labeling functions. Key Difference: However, we find the argument of disparity in labeling functions less intuitive, because human will nonetheless be able to agree on the label of an object whichever distribution the object lies in: in the context of this paper, we argue a shared labeling function always exists (in any task reasonable to human), but the ERM model may not have the incentive to learn this function and learns a spurious one instead. As in Figure 1 , we formalize the problem as learning against spurious functions, and argue that the central problem is still invariance, but instead of invariance to marginals, we urge for invariance to the spurious function. Our discussion also applies to more than unsupervised domain adaptation and relates to most of the topics discussed in this section.

3.1. NOTATIONS & BACKGROUND

We consider a binary classification problem from feature space X ∈ R p to label space Y ∈ {0, 1}. The distribution over X is denoted as P. A labeling function f : X → Y is a function that maps feature x to its label y. A hypothesis or model θ : X → Y is also a function that maps feature to the label. The difference in naming is only because we want to differentiate whether the function is a natural property of the space or distribution (thus called a labeling function) or a function to estimate (thus called a hypothesis or model). The hypothesis space is denoted as Θ. This work concerns with the generalization error across two distributions, namely source and target distribution, denoted as P s and P t respectively. As stated previously, we are only interested when these two distributions are similar but different: being similar means there exists a desired labeling function, f d , that maps any x ∈ X to its label (thus the label y := f d (x)); being different means there exists a spurious labeling function, f p , that for any x ∼ P s , f p (x) = f d (x). This "similar but different" property will be reiterated as an assumption (A2) later. We use (x, y) to denote a sample, and use (X, Y) P to denote a finite dataset if the features are drawn from P. We use P (θ) to denote the expected risk of θ over distribution P, and use • to denote estimated term • (e.g., the empirical risk is P ( θ)). We use l(•, •) to denote a generic loss function. For a dataset (X, Y) P , if we train a model θ = arg min θ∈Θ (x,y)∈(X,Y) P l(θ(x), y), previous generalization study suggests that we can expect the error rate to be bounded as P ( θ) ≤ P ( θ) + φ(|Θ|, n, δ), where P ( θ) and P ( θ) respectively are P ( θ) = E x∼P | θ(x) -y| = E x∼P | θ(x) -f d (x)| and P ( θ) = 1 n (x,y)∈(X,Y) P | θ(x) -y|, and φ(|Θ|, n, δ) is a function of hypothesis space |Θ|, number of samples n, and δ accounts for the probability when the bound holds. This paper only concerns with this generic form that can subsume many discussions, each with its own assumptions. We refer to these assumptions as A1. A1: basic assumptions needed to derived (2), formalized with two examples in appendix A.1.

3.2. GENERALIZATION UNDERSTANDING WITH SPURIOUS CORRELATION

Our interest lies in more than the expected performance over samples from the same distribution, but over a different distribution that shares the same labeling function. As we argue previously, the key difficulty in learning a robust model is the existence of the extra labeling function for features sampled from P s . More formally, we have our first assumption describing this problem A2: Existence of Spurious Correlation: For any x ∈ X , y := f d (x). We also have a f p that is different from f d , and for x ∼ P s , f d (x) = f p (x). Thus, θ that learns either f d or f p will lead to small source error, but only θ that learns f d will lead to small target error. Note that f p may not exist for an arbitrary P s . In other words, A2 can be interpreted as a regulation to P s so that f p , while being different from f d , exists for any x ∼ P s . In this problem, f p and f d are not the same despite f p (x) = f d (x) for any x ∼ P s , and we consider the differences lie in the features they use. To describe this difference, we introduce the notation A(•, •), which denotes a set parametrized by the labeling function and the sample, to describe the active set of features that are used by the labeling function. By active set, we refer to the minimum set of features that a labeling function requires to map a sample to its label. Formally, we define A(f, x) = arg min z∈X ,f (z)=f (x) |α x (z)|, where α x (z) = {i|z i = x i } is the set of indices by which z and x are the same, and | • | measures the cardinality. Although f p (x) = f d (x), A(f p , x) and A(f d , x) can be different. Further, we define a new function difference given a sample as d x (θ, f ) = max z∈X :z A(f,x) =x A(f,x) |θ(z) -f (z)|, where x A(f,x) denotes the features of x indexed by A(f, x). In other words, the distance describes: given a sample x, what is the maximum disagreement of the two functions θ and f for all the other data z ∈ X with a constraint that the features indexed by A(f, x) are the same as those of x. Notice that this difference is not symmetric, as the active set is determined by the second function. By definition, we have d x (θ, f ) ≥ |θ(x) -f (x)|. We introduce two more assumptions: A3: Separable Labeling Functions: For any x ∈ X , A(f d , x) ∩ A(f p , x) = ∅ A4: Realized Hypothesis: Given a large enough hypothesis space Θ, for any sample (x, y), for any θ ∈ Θ, which is not a constant mapping, if θ(x) = y, then d x (θ, f d )d x (θ, f p ) = 0 Intuitively, A3 assumes the active sets of the two labeling functions do not overlap. A4 assumes a θ at least learns one labeling function for the sample x if θ can map the x correctly. Finally, we use the term r(θ, A(f, x)) = max x A(f,x) ∈X A(f,x) |θ(x) -y| to describe how θ depends on the active set of f . Notice that r(θ, A(f, x)) = 1 alone does not mean θ depends on the active set of f , it only means so when we also have θ(x) = y (see formal discussion in Lemma B.1). With all above, we formalize a new generalization bound as follows: Theorem 3.1 (The Curse of Universal Approximation). With Assumptions A1-A4, with probability as least 1 -δ, we have Pt (θ) ≤ Ps (θ) + c(θ) + φ(|Θ|, n, δ) where c(θ) = 1 n (x,y)∈(X,Y) Ps I[θ(x) = y]r(θ, A(f p , x)).

I[•]

is a function that returns 1 if the condition • holds true, and 0 otherwise. As θ may learn f p , Ps (θ) can no longer alone indicate Pt (θ), thus we introduce c(θ) to account for the discrepancy. Intuitively, c(θ) quantifies the samples that are correctly predicted, but only because the θ learns f p for that sample. c(θ) critically depends on the knowledge of f p . We name Theorem 3.1 the curse of universal approximation to highlight the fact that the existence of f p is not always obvious, but the models can usually learn it nonetheless. For example, Ilyas et al. (2019) suggest the root to the performance drop over adversarial examples are spurious features, and Wang et al. (2020) demonstrate the existence of human-imperceptible high-frequency spurious signals in image datasets, which may explain several generalization issues of the models. In other words, even in a well-curated dataset that does not seemingly have spurious correlated features, modern machine learning models may still use some spurious features not understood by human, leading to non-robust behaviors when tested over other datasets that human consider similar. This argument may also align with recent discussions suggesting that reducing the model complexity can improve cross-domain generalization (Chuang et al., 2020) .

3.3. IN COMPARISON TO THE VIEW OF DOMAIN ADAPTATION

We further compare Theorem 3.1 with established understandings of domain adaptation. We summarize the several domain adaptation understandings in the following form (see details in §2): Pt (θ) ≤ Ps (θ) + D Θ (P s , P t ) + λ + φ (|Θ|, n, δ) where D Θ (P s , P t ) quantifies the differences of the two distributions, and λ describes the nature of the problem, and usually involves non-estimable terms about the problem or the distributions. For example, Ben-David et al. ( 2010) formalize the difference as H-divergence, and describe the corresponding empirical term as (Θ∆Θ is the set of disagreement between two hypotheses in Θ): D Θ (P s , P t ) = 1 -min θ∈Θ∆Θ ( 1 n x:θ(x)=0 I[x ∈ (X, Y) Ps ] + 1 n x:θ(x)=1 I[x ∈ (X, Y) Pt ]), where m denotes the number of unlabelled samples in P s and P t each. λ = Pt (θ ) + Ps (θ ), where θ = arg min θ∈Θ Pt (θ) + Ps (θ), In our formalization, as we assume the f d applies to any x ∈ X (according to A2), λ = 0 as long as the hypothesis space is large enough. Therefore, the difference mainly lies in the comparison between c(θ) and D Θ (P s , P t ). To compare them, we need an extra assumption: A5: Sufficiency of Training Samples for the two finite datasets in the study, i.e., (X, Y) Ps and (X, Y) Pt , for any x ∈ (X, Y) Pt , there exists one or many z ∈ (X, Y) Ps such that x ∈ {x |x ∈ X and x A(f d ,z) = z A(f d ,z) } (9) A5 intuitively means the finite training dataset needs to be diverse enough to describe the concept that needs to be learned. For example, imagine building a classifier to classify mammals vs. fishes from the distribution of photos to the distribution of sketches, we cannot expect the classifier to do anything good on dolphins if dolphins only appear in the test sketch dataset. A5 intuitively regulates that if dolphins will appear in the test sketch dataset, they must also appear in the training dataset.  f d ∈ Θ, we have c(θ) ≤ D Θ (P s , P t ) + 1 n (x,y)∈(X,Y) P t I[θ(x) = y]r(θ, A(f p , x)) ( ) where c(θ) = 1 n (x,y)∈(X,Y) Ps I[θ(x) = y]r(θ, A(f p , x)) and D Θ (P s , P t ) is defined as in (8). The comparison involves an extra term, q(θ) := 1 n (x,y)∈(X,Y) P t

I[θ(x)

= y]r(θ, A(f p , x)), which intuitively means that if θ learns f p , how many samples θ can coincidentally predict correctly over the finite target set used to estimate D Θ (P s , P t ). For sanity check, if we replace (X, Y) Pt with (X, Y) Ps , D Θ (P s , P t ) will be evaluated at 0 as it cannot differentiate two identical datasets, and q(θ) will be the same as c(θ). On the other hand, if no samples from (X, Y) Pt can be mapped correctly with f p (coincidentally), q(θ) = 0 and c(θ) will be a lower bound of D Θ (P s , P t ). The value of Theorem 3.2 lies in the fact that for an arbitrary target dataset (X, Y) Pt , no samples out of which can be predicted correctly by learning f p (a situation likely to occur for arbitrary datasets), c(θ) will always be a lower bound of D Θ (P s , P t ). Further, when Assumption A5 does not hold, we are unable to derive a clear relationship between c(θ) and D Θ (P s , P t ). The difference is mainly raised as a matter of fact that, intuitively, we are only interested in the problems that are "solvable" (A5, i.e., hypothesis that used to reduce the test error in target distribution can be learned from the finite training samples) but "hard to solve" (A2, i.e., another labeling function, namely f p , exists for features sampled from the source distribution only), while D Θ (P s , P t ) estimates the divergence of two arbitrary distributions. Estimation of c(θ). Finally, due to the limitation of space, we discuss the estimation of c(θ) in appendix A.2. In short, in practice, while we do not know f p , we usually know A(f p , x) through intuition or common sense, such as texture or background of images. Thus, the estimation is to test the whether the model switches its correct prediction when these features are perturbed over the possible space. Also, the search can be terminated once r(θ, A(f p , x)) is evaluated as 1. As one may be aware of, this process is widely known as adversarial attack (e.g., Goodfellow et al., 2015) .

4. METHODS TO LEARN ROBUST MODELS

In this section, we will take advantage of our analysis to introduce methods that can be used to learn robust models by countering the spurious correlation. First, we discuss the principled solutions that can lead to the error bound in Theorem 3.1 to be smaller. These methods are interestingly linked to many previously established methods in robustness in general. Second, as all above methods will require some knowledge of f p or A(f p , x), we also explore a new method that does not require so.

4.1. PRINCIPLED SOLUTIONS OF LEARNING ROBUST MODELS

According to Theorem 3.1, a key to learning robust model is to reduce c(θ). However, for the convenience during training, we can consider its upper bound c(θ) ≤ 1 n (x,y)∈(X,Y) r(θ, A(f p , x)) = 1 n (x,y)∈(X,Y) max x A(fp ,x) ∈X A(fp ,x) |θ(x) -y|, which intuitively means that instead of c(θ) that concerns with the correct prediction based on only f p , now we study any prediction based on only f p . Also, as we introduced in "estimation of c(θ)": although we barely know f p in practice, we usually directly know A(f p , x) through the intuition or common sense of the data or the task. Adversarially robust models (worst-case data augmentation) The most direct approach of learning robust models will be optimizing to reduce (11) in addition to the generic loss (i.e., l(θ(x), y)) of a model. Further, as |θ(x) -y| ≤ max x A(fp ,x) ∈X A(fp ,x) |θ(x) -y|, we can drop the generic loss term, and directly train a model with min θ 1 n (x,y)∈(X,Y) max x A(fp ,x) ∈X A(fp ,x) l(θ(x), y), which is one of the most widely used methods in adversarial robust literature: the adversarial training (Madry et al., 2018) , as well as worst-case data augmentation (Fawzi et al., 2016) . Data augmentation Alternatively, we can assume E x A(fp,x) ∼P all A(fp ,x) |θ(x) -y| = max x A(fp ,x) ∈X A(fp ,x) |θ(x) -y|, ( ) where P all denotes the distribution that can help to remove the correlation between f p related features and y (e.g., a uniform distribution over X is sufficient, but P s will not be good enough). The main strategy is to train with the samples whose f p related features are randomized, so that the model is expected to learn to ignore the pattern. As x can usually be sampled by P all , one can drop the generic loss and train a model with min θ 1 n (x,y)∈(X,Y) E x A(fp ,x) ∼P all A(fp ,x) l(θ(x), y). ( ) Regularizing hypothesis space We can also consider to find a Θ regularized so that |θ(x) -y| = max x A(fp ,x) ∈X A(fp ,x) |θ(x) -y| for θ ∈ Θ regularized , which intuitively means that for any θ ∈ Θ regularized , θ ignores the information from A(f p , x). There is a proliferation of recent developments of this thread, either through intuitive understanding of the task or the data (e.g., Wang et al., 2019a; b; Bahng et al., 2019) or through theoretical understanding for rigorously defined perturbations (e.g., Abadeh et al., 2015; Cranko et al., 2020) .

4.2. LEARNING ROBUST MODELS WITH MINIMUM SUPERVISION

While our analysis suggests that we cannot have a robust model without the knowledge of f b , we continue to ask that what the best we can do without such knowledge. If we use F to denote the set {f d , f p } and use i to index its element, we can have the following upper bound to optimize c(θ) ≤ 1 n (x,y)∈(X,Y) r(θ, A(f p , x)) ≤ 1 n (x,y)∈(X,Y) i r(θ, A(F i , x)). ( ) By optimizing the RHS of ( 16), we aim to discourage the learning towards functions that only rely on one labelling function for each sample for any labelling functions. Intuitively, the method is to encourage the model's usage in all possible features (either associated with f d or f p ), thus the model may be more robust to the changes of features when dealing with perturbations of the data. A model using all the features is not expected to be better than a method that only uses the f d associated features. However, as all methods in §4.1 require specific knowledge of f p , we argue this is a better practice than vanilla training (1) when there are no side information about either f d or f p . In practice, as we do not have the knowledge of F, we can use the estimated θ from previous iteration as a substitute, also as the searching for A(f, x) can be computationally expensive, we use the gradient information to guide the selection of the features. In summary, this new method, which we name minimum supervision (MS), has the following three major steps at iteration t: • Use θ (t-1) as a substitute of either f d or f p . • Identify A(θ (t-1) , x) with top ρ fraction of features according to the gradient. • Sample x A(θ (t-1) ,x) over X A(θ (t-1) ,x) and continue to train the model. Thus, the new method has a hyperparameter ρ. Due to the limitation of space, we discuss the detailed algorithm and other practical aspects of the method in appendix A.3.

5. EXPERIMENTS

The experiments fall into two scenarios: we first use the two binary classification experiments to support Theorem 3.1 and 3.2 ( §5.1); we then test how the new method we introduced in comparison to previously developed methods explicitly using the knowledge of f p ( §5.2).

5.1. THEORY SUPPORTING EXPERIMENT

Synthetic Data with Spurious Correlation We also Figure 1 to data with p features, where p/2 features are related to f d , p/4 features are related to f p , and the rest p/4 features are independent on labels. Also, f d is a non-linear function, and f p is simpler. We test across multiple choices. Overall, the results suggest (1) minimum supervision works better than the vanilla method; (2) c(θ) is a tighter estimation of the test error than D Θ (P s , P t ). Details are in appendix C.1.

Binary Digit Classification over Transferable Adversarial Examples

We also verify the theoretical discussion through a binary digit classification experiment, where the train and validation set are digits 0 and 1 from MNIST train and validation dataset. To create the test set, we first estimate a model, and perform adversarial attacks over this model for the test samples with five adversarial attack methods (C&W (Carlini & Wagner, 2017) ( Goodfellow et al., 2015) , Salt&Pepper (Rauber et al., 2017) , and SinglePixel (Rauber et al., 2017) ). These adversarially generated examples are considered as the test set from another distribution. An advantage of this setup is that we can have f p well defined as 1 -f adv , where the f adv is the function each adversarial attack relies on. Thus, we can assess our analysis on image classification. We train the models with vanilla method, minimum supervision method (MS), adversarial training (AT), and data augmentation (Aug). In addition to the training error (i.e., Ps (θ)) and test error (i.e., Pt (θ)), we also report Ps (θ) + c(θ) and Ps (θ) + D Θ . Our results in Figure 2 again align with our analysis: (1) minimum supervision outperforms the vanilla method, but may be inferior to methods using f p explicitly; (2) c(θ) is often a tighter estimation of the test error than D Θ (P s , P t ).

5.2. REAL IMAGE CLASSIFICATION

Finally, we conduct a real-image classification to compare whether our minimum supervision method can be compared to other advanced methods in a more challenging and realistic setting. We follow the setup in (Bahng et al., 2019) and compare the models for a 9 super-class ImageNet classification (Ilyas et al., 2019) with class balanced strategies. Also, we follow (Bahng et al., 2019) to report standard accuracy, weighted accuracy, a scenario where samples with unusual texture are weighted more, and accuracy over ImageNet-A (Hendrycks et al., 2019) , a collection of failure cases for most ImageNet trained models. Additionally, we also report the performance over ImageNet-Sketch (Wang et al., 2019a) , an independently collected ImageNet test set with only sketch images. We report the results in Table 1 . Our method (MS) outperforms other methods in most situations, which we consider impressive since only MS and vanilla methods are not using any knowledge of f p or f d . More details of the experiment setup and competing methods are discussed in appendix C.2.

6. CONCLUSION

In this paper, we formalized the generalization error when the models can use some spurious features in the training set that are not shared in the test set, a problem widely studied under the terminologies of spurious correlation, confounding factors, or dataset bias. We formalized a new generalization error bound, and compared our bound to the well-established domain adaptation one. More importantly, our theorem naturally offers a set of principled solutions for this problem. These principled solutions are linked to many previous methods for robustness in a broader context. Since all these principled solutions require some knowledge of the spurious correlated features, we also leveraged our theorem to develop a new method that does not require such knowledge.

A SUPPORTING DISCUSSIONS

A.1 CONCRETE EXAMPLES OF GENERIC GENERALIZATION BOUND • when A1 is "Θ is finite, l(•, •) is a zero-one loss, samples are i.i.d", φ(|Θ|, n, δ) = (log(|Θ|) + log(1/δ))/2n • when A1 is "samples are i.i.d", φ(|Θ|, n, δ) = 2R(L) + (log 1/δ)/2n, where R(L) stands for Rademacher complexity and L = {l θ | θ ∈ Θ}, where l θ is the loss function corresponding to θ. For more information or more concrete examples of the generic term, one can refer to relevant textbooks such as (Bousquet et al., 2003) .

A.2 ESTIMATION OF c(θ)

The estimation of c(θ) mainly involves two difficulties: the knowledge of f p and the computational cost of the search over the entire space X . The first difficulty is usually resolved with intuition or common sense of the data or the task: in practice, we usually directly have the knowledge of A(f p , x), i.e., the spuriously correlated features that f p relies on, such as texture of images. Therefore, the estimation becomes a process to test the whether the model will switch its correct prediction when these features are perturbed over the possible space. The second difficulty can be alleviated due to the fact that the search can be terminated once r(θ, A(f, x)) is evaluated as 1. As one may be aware of, this process of searching the entire space with perturbations allowed in a predefined scope to test the model's worst possible prediction for a sample x is widely known as adversarial attack (Goodfellow et al., 2015) . These techniques also usually leverage the knowledge of the model's gradient to accelerate the searching process. While adversarial attack can offer a fairly accurate estimation of c(θ), it usually requires heavy computational efforts. As an alternative strategy, many other literature have tested the models with some fixed perturbations of the x, or in other words, taking advantage of the fact that 2020) demonstrated the models can capture high-frequency signals from images, which also links the discussion of learning through bias signals to the adversarial vulnerability issue of models (Ilyas et al., 2019) . Similarly, these works mostly depend on a subjective choice of A(f p , x), usually given by the knowledge of the data or the task. Although these works did not directly assess c(θ), θ usually switched the prediction for sufficient samples to raise an alarm. |θ(x ) -y| ≤ max x A(f,x) inX A(f,x) |θ(x) -y| = r(θ, A(f, x)), where x A(f,x) ∈ X A(f,x) .

A.3 LEARNING ROBUST MODELS WITH MINIMUM SUPERVISION IN PRACTICE

In practice, as we do not have the knowledge of either f d or f p (F), the strategy we use is to estimate the model first and consider our estimated model θ as a substitute of the labeling function (either f d or f p ). Therefore, at each iteration t, we will use the θ at the previous iteration to identify the active set for the optimization of ( 16) (in main manuscript). Further, another question is that when we have θ t-1 , how to identify A( θ t-1 , x), as searching for A( θ t-1 , x) by the definition can be computationally expensive. Our practical strategy is to use the gradient of θ t-1 to guide the selection of the features. Intuitively, we argue that the the features with larger absolute values of ∂l(θ t-1 , x, y)/∂θ t-1 are the features θ t-1 relies on. Finally, we consider the features with values greater than a threshold τ (ρ, g)} are the features that are in A( θ t-1 , x). The threshold hold is set as the ρ th quantile of all the calculated gradients for this sample. The algorithm is shown in Algorithm 1 Algorithm 1: Learning Robust Models with Minimum Supervision Result: θ T Input: T , ρ, (X, Y); initialize θ 0 , t = 1, η; while t ≤ T do for sample (x, y) do calculate the gradient g = ∂l(θ t-1 , x, y)/∂θ t-1 ; set the threshold τ (ρ, g) to be the ρ th quantile of |g|; set A(θ t-1 , x) = {i||g i | ≥ τ (ρ, g)}; sample x where x A(θ t-1 ,x) ∈ X A(θ t-1 ,x) ; calculate the gradient g = ∂l(θ t-1 , x , y)/∂θ t-1 ; update the model θ t = θ t-1 -ηg end end

C.2 REAL IMAGE CLASSIFICATION: MORE DETAILS

The main experiment setup follows the setup of (Bahng et al., 2019) , and the setup can be conveniently replicated by the GitHub repo associated with the paper (Bahng et al., 2019) . Although results of ImageNet-C are also reporeted by (Bahng et al., 2019) , their github repo does not provide the corresponding replication scripts, so we also skip the information. Additionally, we report another ImageNet level test set that is independently collected, and has only sketch images. We rename the "bias" and "unbiased" in (Bahng et al., 2019) to "standard accuracy" and "weighted accuracy" to align the terms we use in this paper and also help to explain the results. Intuitively, "weighted accuracy" refers to the evaluation mechanism that the test samples with unusual texture will have more weights. Again, following the setup in (Bahng et al., 2019) , the base network is ResNet, and we compare with the vanilla network, and several methods that are designed to reduce the texture bias: including StylisedIN (Geirhos et al., 2019) , LearnedMixin (Clark et al., 2019) , RUBi (Cadene et al., 2019) and ReBias (Bahng et al., 2019) . Finally, to get the reported performance, our MS method uses an extra heuristic, such as we only optimize ( 16) for half of the batch, and optimize the other batch with the vanilla training (1). Despite this heuristic used, the main message remains: MS method, as a method that does not use the knowledge of the spurious correlated features, can compete with the methods that use the knowledge explicitly.



Now, in comparison to (Ben-David et al., 2010), we have Theorem 3.2. With Assumptions A2-A5, and if 1 -

Figure 2: Binary MNIST classification error and estimated bounds. Each panel represents one outof-domain data generated through an attack method. Four methods are reported in each panel. For each method, four bars are plotted: from left to right, Ps (θ), Pt (θ), Ps (θ) + c(θ), and Ps (θ) + D Θ . Some bars are not visible because the values are small.

Results comparison on 9 super-class ImageNet classification. Only Vanilla and MS do not leverage any knowledge of the spurious features.

There are many works in this thread, and we only list a handful of examples:Jo & Bengio (2017) leveraged Fourier transform to show that models can capture a significant amount of texture information, laterGeirhos et al. (2019) showed that CNNs trained with ImageNet are also biased towards texture. With a more concrete definition of the texture,Wang  et al. (

B PROOFS OF THEORETICAL DISCUSSIONS

B.1 LEMMA B.1 AND PROOF Lemma B.1. With sample (x, y) and two labeling functions f 1 (x) = f 2 (x) = y, for an estimated θ ∈ Θ, if θ(x) = y, then with A3 and A4, we have d x (θ, f 1 ) = 1 ⇐⇒ r(θ, A(f 2 , x)) = 1 (18) x A(f,x) ∈ X A(f,x) denotes that the features of x indexed by A(f, x) are searched in the entire space.Proof. If θ(x) = y and d x (θ, f 1 ) = 1, according to A4, we have d x (θ, f 2 ) = 0.First, we consider one direction d x (θ, f 1 ) = 1 =⇒ r(θ, A(f 2 , x)) = 1 and we prove this by contradiction.If the conclusion does not hold, r(θ, A(f 2 , x)) = 0, which means maxTogether with d x (θ, f 2 ) = 0, which means maxwe will have maxwhich is θ(x) = y for any x ∈ P.This contradicts with the premises in A4 (θ is not a constant function).Second, we consider the other direction r(θ, A(f 2 , x)) = 1 =⇒ d x (θ, f 1 ) = 1 and we prove this by showing its contrapositive proposition holds. (Its contrapositive proposition is d x (θ, f 1 ) = 0 =⇒ r(θ, A(f 2 , x)) = 0, because, by definitions, r and d can only be evaluated as 0 or 1).

Because of A3

, thus the contrapositive proposition can be shown trivially.

B.2 THEOREM 3.1 AND PROOF

Theorem. With Assumptions A1-A4, with probability as least 1 -δ, we havewhereProof.where the last line used Lemma B.1.Thus, we havewherewhich describes the correctly predicted terms that θ functions the same as f d and all the wrongly predicted terms. Therefore, conventional generalization analysis through uniform convergence applies, and we haveThus, we have:

B.3 THEOREM 3.2 AND PROOF

Theorem. With Assumptions A2-A5, and if 1 -where c(θ) = 1 n (x,y)∈(X,Y) Ps I[θ(x) = y]r(θ, A(f p , x)) and D Θ (P s , P t ) is defined as in (8).Proof. By definition, g(x) ∈ Θ∆Θ ⇐⇒ g(x) = θ(x) ⊕ θ (x) for some θ, θ ∈ Θ, together with Lemma 2 and Lemma 3 of (Ben-David et al., 2010), we haveFirst line: see Lemma 2 and Lemma 3 of (Ben-David et al., 2010) .Second line: if 1 -f d ∈ Θ, and we useFifth line is a result of using that fact thatas a result of our assumptions. Now we present the details of this argument:be 0 unless θ is a constant mapping that maps every sample to 0 (which will contradicts A4). Thus, we haveTherefore, we can rewrite the left-hand term followingand similarlyWe recap the definition ofTherefore d x (θ, f d ) = 0 implies I(θ(x) = y), andTherefore, we can continue to rewrite the left-hand term followingand similarlywhere z denotes any z ∈ X and zFurther, because of A5, we haveThus, we showed the (40) holds and conclude our proof.

C ADDITIONAL EXPERIMENTS C.1 THEORETICAL SUPPORTING EXPERIMENTS

Synthetic Data with Spurious Correlation We extend the setup in Figure 1 to generate the synthetic dataset to test our methods. We study a binary classification problem over the data with n samples and p features, denoted as X ∈ R n×p . For every training and validation sample i, we generate feature j as following:if 3p/4 < j ≤ p, and y (i) = 1, w.p. ρ N (-1, 1) if 3p/4 < j ≤ p, andIn contrast, testing data are simply sampled with x (i) j ∼ N (0, 1). To generate the label for training, validation, and test data, we sample two effect size vectors β 1 ∈ R p/4 and β 2 ∈ R p/4 whose each coefficient is sampled from a Normal distribution. We then generate two intermediate variables:1,2,...,p/4 β 2 Then we transform these continuous intermediate variables into binary intermediate variables via Bernoulli sampling with the outcome of the inverse logit function (g -1 (•)) over current responses, i.e.,Finally, the label for sample i is determined as y (i) = I(r2 ), where I is the function that returns 1 if the condition holds and 0 otherwise.Intuitively, we create a dataset of p features, half of the features are generalizable across train, validation and test datasets through a non-linear decision boundary, one-forth of the features are independent of the label, and the remaining features are spuriously correlated features: these features are correlated with the labels in train and validation set, but independent with the label in test dataset. There are about c ṅ train and validation samples have the correlated features. Results are reported in Figure 3 , where each setup we ran 3 random seeds and report the mean and standard deviation. We train a vanilla method, minimum supervision method with different hyperparamter ρ, and an oracle method that uses data augmentation to randomized the previously known spurious features. The results show the advantage of the new method consistently, although still not compared to the method with prior knowledge. We also calculate the c(θ) as we perform adversarial attacks over the spuriously correlated feature space, we also calculate D Θ as defined in (8). We compared Ps (θ) + c(θ) and Ps (θ) + D Θ and the results suggest that clearly c(θ) offers a more accurate assessment of the target error than D Θ .

