The Value of Out-of-distribution Data

Abstract

More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pre-training have on this behavior.

1. Introduction

We procure more data to improve generalization. The central assumption behind doing so-that we have baked into learning theory (Vapnik, 1998) -is that this data comes from the desired task. But this may not always be the case. Real data is often heterogeneous (Quinonero-Candela et al., 2008) , this heterogeneity can arise from nuisances which are variables that do not inform the task at hand (say classification), e.g., geometric nuisances such as viewpoint, or semantic ones such as chairs of different shapes. Datasets curated at the Internet-scale Srivastava et al. (2022) may also be susceptible to erroneous annotations (resulting in label noise) (Frénay & Verleysen, 2013) or data poisoning attacks (Steinhardt et al., 2017) . Such "out-of-distribution" (OOD) data, i.e., data that does not come from our desired task, can be detrimental to the performance of the learned model. In this work, we aim to study how OOD samples within datasets impact the generalization error on our desired task. Our contributions are as follows. We demonstrate a counter-intuitive phenomenon: generalization error on the target task is non-monotonic in the number of OOD samples. In other words, there exist situations when a small number of OOD samples can improve the generalization error but if the number of OOD samples is beyond a threshold, then the generalization error deteriorates. This phenomenon is counter intuitive because one would expect the generalization error of the target task to deteriorate or improve monotonically upon the introduction of OOD samples. Our investigation shows that the threshold is different for different tasks and different neural architectures. In Remark 2, we provide an intuitive explanation for this non-monotonic behavior using the bias-variance trade-off. We present empirical evidence for the presence of non-monotonic trends in target generalization error in many popular datasets, ranging from MNIST, CIFAR-10, to PACS and DomainNet. OOD samples within a curated dataset could lead to worse generalization error on the task for which the dataset was curated. We show that when OOD samples in the dataset are unknown, using strategies such as data-augmentation, hyperparameter optimization and pre-training, are not effective in eliminating the adverse impact of OOD data. We develop an algorithmic procedure to train on the target task that is resilient to OOD data. If we know which samples within the dataset are OOD, e.g., using a two-sample test to check for changes in the distribution (Gretton et al., 2012) , then we could mitigate the non-monotonic nature of the generalization error by ignoring the OOD samples. We show how one can do better: using a weighted objective between the target and OOD samples, we can ensure that the generalization error on the target task decreases monotonically with the number of OOD samples. We empirically demonstrate the utility of this weighted objective on a variety of problems. 2 Generalization error is non-monotonic in the number of OOD samples We define a task P as a joint distribution over the input domain X and the output domain Y . We model the heterogeneity in the dataset as two distributions: n samples drawn from a target task P t and m samples drawn from an out-of-distribution (OOD) task Po. We would like to minimize the generalization error e t (h) = E (x,y)∼Pt [h(x) ̸ = y] on the target task. In order to do so, we may find a hypothesis that minimizes the empirical loss ê(h) = 1 n + m n+m i=1 ℓ (h(x i ), y i ) , (1) using the dataset {(x i , y i )} n+m i=1 ; here ℓ measures the mismatch between the prediction h(x i ) and label , 1998) . But if P t ̸ = Po, then we should expect that error on P t of a hypothesis obtained by minimizing the average empirical loss can be sub-optimal, especially when the number of OOD samples m ≫ n. y i . If P t = Po, then e t (h) -ê(h) = O((n + m) -1/2 ) (Smola & Schölkopf

2.1. An example using Fisher's Linear Discriminant

Consider a binary classification problem with one-dimensional inputs in Fig. 1 . Target samples are drawn from a Gaussian mixture model (with means {-µ, µ} for the two classes) and OOD samples are drawn from a Gaussian mixture with means {-µ + ∆, µ + ∆}; also see Appendix A.1. Fisher's linear discriminant (FLD) is a linear classifier for such binary classification problems. It computes ĥ(x) = 1, ω ⊤ x > c 0, otherwise, where ω is a projection vector which acts as a feature extractor and c is a threshold that performs one-dimensional discrimination between the two classes. FLD assumes that the class conditional density of each class is a multivariate Gaussian distribution with the same covariance structure. We provide a detailed account of FLD in Appendix A.2. However, beyond a certain value of ∆, the generalization error is non-monotonic in the number of OOD samples. The optimal value of m/n which leads to the best generalization error is a function of the relatedness between the two tasks, as governed by ∆ in this example. This non-monotonic behavior can be explained in terms of a bias-variance tradeoff with respect to the target task: a large number of OOD samples reduces the variance but also results in a bias with respect to the optimal hypothesis of the target task.



Figure 1: Left: A picture of synthetic target and OOD tasks. Middle: A schematic of the Gaussian mixture model corresponding to the target task (top) and the OOD samples (bottom). The OOD sample size (m = 28) at which the target generalization error is minimized at ∆ = 1.6 is indicated at the top. Right: For n = 100, we plot the generalization error of FLD on the target task as a function of the ratio of OOD and target samples m/n, for different types of OOD samples corresponding to different values of ∆. This plot uses the analytical expression for the generalization error in (2); see Appendix A.6 for a numerical simulation study.For small values of ∆, when the two tasks are similar to each other, the generalization error et(h) decreases monotonically. However, beyond a certain value of ∆, the generalization error is non-monotonic in the number of OOD samples. The optimal value of m/n which leads to the best generalization error is a function of the relatedness between the two tasks, as governed by ∆ in this example. This non-monotonic behavior can be explained in terms of a bias-variance tradeoff with respect to the target task: a large number of OOD samples reduces the variance but also results in a bias with respect to the optimal hypothesis of the target task.

