IN-N-OUT: PRE-TRAINING AND SELF-TRAINING USING AUXILIARY INFORMATION FOR OUT-OF-DISTRIBUTION ROBUSTNESS

Abstract

Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in-and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt OOD error; but (ii) using auxiliary information as outputs of auxiliary pre-training tasks improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error.

1. INTRODUCTION

When models are tested on distributions that are different from the training distribution, they typically suffer large drops in performance (Blitzer and Pereira, 2007; Szegedy et al., 2014; Jia and Liang, 2017; AlBadawy et al., 2018; Hendrycks et al., 2019a) . For example, in remote sensing, central tasks include predicting poverty, crop type, and land cover from satellite imagery for downstream humanitarian, policy, and environmental applications (Xie et al., 2016; Jean et al., 2016; Wang et al., 2020; Rußwurm et al., 2020) . In some developing African countries, labels are scarce due to the lack of economic resources to deploy human workers to conduct expensive surveys (Jean et al., 2016) . To make accurate predictions in these countries, we must extrapolate to out-of-distribution (OOD) examples across different geographic terrains and political borders. We consider a semi-supervised setting with few in-distribution labeled examples and many unlabeled examples from both in-and out-of-distribution (e.g., global satellite imagery). While labels are scarce, auxiliary information is often cheaply available for every input and may provide some signal for the missing labels. Auxiliary information can come from additional data sources (e.g., climate data from other satellites) or derived from the original input (e.g., background or non-visible spectrum image channels). This auxiliary information is often discarded or not leveraged, and how to best use them is unclear. One way is to use them directly as input features (aux-inputs); another is to treat them as prediction outputs for an auxiliary task (aux-outputs) in pre-training. Which approach leads to better in-distribution or OOD performance? Aux-inputs provide more features to potentially improve in-distribution performance, and one may hope that this also improves OOD performance. Indeed, previous results on standard datasets show that improvements in in-distribution accuracy correlate with improvements in OOD accuracy (Recht et al., 2019; Taori et al., 2020; Xie et al., 2020; Santurkar et al., 2020) . However, in this paper we find that aux-inputs can introduce more spurious correlations with the labels: as a result, while aux-inputs often improve in-distribution accuracy, they can worsen OOD accuracy. We give examples of this trend on CelebA (Liu et al., 2015) and real-world satellite datasets in Sections 5.2 and 5.3. Conversely, aux-output methods such as pre-training may improve OOD performance through auxiliary supervision (Caruana, 1997; Weiss et al., 2016; Hendrycks et al., 2019a 



). Hendrycks et al.

