ID AND OOD PERFORMANCE ARE SOMETIMES INVERSELY CORRELATED ON REAL-WORLD DATASETS

Abstract

Context. Several studies have empirically compared in-distribution (ID) and outof-distribution (OOD) performance of various models. They report frequent positive correlations on benchmarks in computer vision and NLP. Surprisingly, they never observe inverse correlations suggesting necessary trade-offs. This matters to determine whether ID performance can serve as a proxy for OOD generalization. Findings. This paper shows that inverse correlations between ID and OOD performance do happen in real-world benchmarks. They could be missed in past studies because of a biased selection of models. We show an example on the WILDS-Camelyon17 dataset, using models from multiple training epochs and random seeds. Our observations are particularly striking with models trained with a regularizer that diversifies the solutions to the ERM objective (Teney et al., 2022a). Implications. We nuance recommendations and conclusions made in past studies. • High OOD performance may sometimes require trading off ID performance. • Focusing on ID performance alone may not lead to optimal OOD performance: it can lead to diminishing and eventually negative returns in OOD performance. • Our example reminds that empirical studies only chart regimes achievable with existing methods: care is warranted in deriving prescriptive recommendations.

1. INTRODUCTION

Past observations. This paper complements existing studies that empirically compare indistribution (ID) and out-of-distributionfoot_0 (OOD) performance of deep learning models (Andreassen et al., 2021; Djolonga et al., 2021; Miller et al., 2021; Mania & Sra, 2020; Miller et al., 2020; Taori et al., 2020; Wenzel et al., 2022) . It has long been known that models applied to OOD data suffer a drop in performance, e.g. in classification accuracy. The above studies show that, despite this gap, ID and OOD performance are often positively correlated 2 across models on benchmarks in computer vision (Miller et al., 2021) and NLP (Miller et al., 2020) . Past explanations. Frequent positive correlations are surprising because nothing forbids opposite, inverse ones. Indeed, ID and OOD data contain different associations between labels and features. One could imagine e.g. that an image background is associated with class C 1 ID and class C 2 OOD. The more a model relies on the presence of this background, the better its ID performance but the worse its OOD performance, resulting in an inverse correlation. Never observing inverse correlations has been explained with the possibility that real-world benchmarks might contain : Past studies suggest that positive correlations between ID / OOD performance are ubiquitous. This paper shows, with a counterexample, that inverse correlations are possible and can be accidentally overlooked. The possible need for an ID / OOD trade-off is thus not merely theoretical and should be envisioned, e.g. preventing blind reliance on ID performance for model selection. only mild distribution shifts (Mania & Sra, 2020). We will show that such observations can also be an artefact of study design. A recent large-scale study. Wenzel et al. (2022) show that not all datasets display a clear positive correlation. The authors observe other patterns that sometimes reveal underspecification (D'Amour et al., 2020; Teney et al., 2022b; Lee et al., 2022) , or severe shifts that prevent any training / test transfer. Surprisingly, they never observe inverse correlations: "We did not observe any trade-off between accuracy and robustness, where more accurate models would overfit to spurious features that do not generalize." (Wenzel et al., 2022) On the contrary, we do observe such cases and showcase it on a dataset from the above study. Explaining inverse correlations. We name the underlying cause a misspecification, by extension of underspecification which was previously used to explain why models with similar ID performance can vary in OOD performance (D'Amour et al., 2020; Teney et al., 2022b; Lee et al., 2022) . In cases of misspecification, the standard ERM objective (empirical risk minimization), which drives ID performance, conflicts with the goal of OOD performance. ID and OOD metrics can then vary independently and inversely to one another. In Section 5, we present a minimal theoretical example that illustrate how an inverse correlation pattern originates from the presence of both robust and spurious features in the data. In Section 6, we show that different patterns of ID / OOD performance occur with different magnitudes of distribution shifts.

Summary of contributions.

• An empirical examination of ID vs. OOD performance on the WILD-Camelyon17 dataset (Koh et al., 2021) that shows an inverse correlation pattern conflicting with past evidence (Section 3). • An explanation and empirical verification that past studies could miss such patterns because of a biased sampling of models (Section 4). • A theoretical analysis showing when inverse correlations patterns can occur (Sections 5-6). • A revision of conclusions and recommendations made in past studies (Section 7). 2 PREVIOUSLY-OBSERVED PATTERNS OF ID VS. OOD PERFORMANCE Past studies conclude that ID and OOD performance tend to vary jointly across models on many real-world datasets (Djolonga et al., 2021; Miller et al., 2021; Taori et al., 2020) . Millet al. report an almost-systematic linear correlationfoot_2 between probit-scaled ID and OOD accuracies. Mania & Sra (2020) explain this trend with the fact that real-world benchmarks contain only mild distribution shifts.foot_3 Andreassen et al. ( 2021) find that pretrained models perform "above the linear trend" in the early stages of fine-tuning. Their OOD accuracy rises more quickly than their ID accuracy early on, even though the final accuracies agree with a linear trend. Most recently, the large-scale study of Wenzel et al. ( 2022) is more nuanced: they observe a linear trend only on some datasets. Their setup consists in fine-tuning an ImageNet-pretrained model on a chosen dataset and evaluating it on matching ID and OOD test sets. They repeat the procedure with a variety of datasets, architectures, and implementation options such as data augmentations. The scatter plots of ID / OOD accuracy in Wenzel et al. (2022) show four typical patterns (Figure 2 ).



We use "OOD" to refer to test data conforming to covariate shifts (Shimodaira, 2000) w.r.t. training data.2 We use "correlation" to refer both to linear and non-linear relationships. The "linear trend" is not really linear: it applies to probit-scaled accuracies (a non-linear transform). Mania & Sra (2020) explain the linear trend with (1) certain data points having similar probabilities of occurring in ID and OOD data, and (2) the probability being low that a model classifies some points correctly that a higher-accuracy model classifies incorrectly.



Figure1: Past studies suggest that positive correlations between ID / OOD performance are ubiquitous. This paper shows, with a counterexample, that inverse correlations are possible and can be accidentally overlooked. The possible need for an ID / OOD trade-off is thus not merely theoretical and should be envisioned, e.g. preventing blind reliance on ID performance for model selection.

