The Value of Out-of-distribution Data

Abstract

More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pre-training have on this behavior.

1. Introduction

We procure more data to improve generalization. The central assumption behind doing so-that we have baked into learning theory (Vapnik, 1998) -is that this data comes from the desired task. But this may not always be the case. Real data is often heterogeneous (Quinonero-Candela et al., 2008) , this heterogeneity can arise from nuisances which are variables that do not inform the task at hand (say classification), e.g., geometric nuisances such as viewpoint, or semantic ones such as chairs of different shapes. Datasets curated at the Internet-scale Srivastava et al. (2022) may also be susceptible to erroneous annotations (resulting in label noise) (Frénay & Verleysen, 2013) or data poisoning attacks (Steinhardt et al., 2017) . Such "out-of-distribution" (OOD) data, i.e., data that does not come from our desired task, can be detrimental to the performance of the learned model. In this work, we aim to study how OOD samples within datasets impact the generalization error on our desired task. Our contributions are as follows. We demonstrate a counter-intuitive phenomenon: generalization error on the target task is non-monotonic in the number of OOD samples. In other words, there exist situations when a small number of OOD samples can improve the generalization error but if the number of OOD samples is beyond a threshold, then the generalization error deteriorates. This phenomenon is counter intuitive because one would expect the generalization error of the target task to deteriorate or improve monotonically upon the introduction of OOD samples. Our investigation shows that the threshold is different for different tasks and different neural architectures. In Remark 2, we provide an intuitive explanation for this non-monotonic behavior using the bias-variance trade-off. We present empirical evidence for the presence of non-monotonic trends in target generalization error in many popular datasets, ranging from MNIST, CIFAR-10, to PACS and DomainNet. OOD samples within a curated dataset could lead to worse generalization error on the task for which the dataset was curated. We show that when OOD samples in the dataset are unknown, using strategies such as data-augmentation, hyperparameter optimization and pre-training, are not effective in eliminating the adverse impact of OOD data. We develop an algorithmic procedure to train on the target task that is resilient to OOD data. If we know which samples within the dataset are OOD, e.g., using a two-sample test to check for changes in the distribution (Gretton et al., 2012) , then we could mitigate the non-monotonic nature of the generalization error by ignoring the OOD samples. We show how one can do better: using

