IN-N-OUT: PRE-TRAINING AND SELF-TRAINING USING AUXILIARY INFORMATION FOR OUT-OF-DISTRIBUTION ROBUSTNESS

Abstract

Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in-and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt OOD error; but (ii) using auxiliary information as outputs of auxiliary pre-training tasks improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error.

1. INTRODUCTION

When models are tested on distributions that are different from the training distribution, they typically suffer large drops in performance (Blitzer and Pereira, 2007; Szegedy et al., 2014; Jia and Liang, 2017; AlBadawy et al., 2018; Hendrycks et al., 2019a) . For example, in remote sensing, central tasks include predicting poverty, crop type, and land cover from satellite imagery for downstream humanitarian, policy, and environmental applications (Xie et al., 2016; Jean et al., 2016; Wang et al., 2020; Rußwurm et al., 2020) . In some developing African countries, labels are scarce due to the lack of economic resources to deploy human workers to conduct expensive surveys (Jean et al., 2016) . To make accurate predictions in these countries, we must extrapolate to out-of-distribution (OOD) examples across different geographic terrains and political borders. We consider a semi-supervised setting with few in-distribution labeled examples and many unlabeled examples from both in-and out-of-distribution (e.g., global satellite imagery). While labels are scarce, auxiliary information is often cheaply available for every input and may provide some signal for the missing labels. Auxiliary information can come from additional data sources (e.g., climate data from other satellites) or derived from the original input (e.g., background or non-visible spectrum image channels). This auxiliary information is often discarded or not leveraged, and how to best use them is unclear. One way is to use them directly as input features (aux-inputs); another is to treat them as prediction outputs for an auxiliary task (aux-outputs) in pre-training. Which approach leads to better in-distribution or OOD performance? Aux-inputs provide more features to potentially improve in-distribution performance, and one may hope that this also improves OOD performance. Indeed, previous results on standard datasets show that improvements in in-distribution accuracy correlate with improvements in OOD accuracy (Recht et al., 2019; Taori et al., 2020; Xie et al., 2020; Santurkar et al., 2020) . However, in this paper we find that aux-inputs can introduce more spurious correlations with the labels: as a result, while aux-inputs often improve in-distribution accuracy, they can worsen OOD accuracy. We give examples of this trend on CelebA (Liu et al., 2015) and real-world satellite datasets in Sections 5.2 and 5.3. Conversely, aux-output methods such as pre-training may improve OOD performance through auxiliary supervision (Caruana, 1997; Weiss et al., 2016; Hendrycks et al., 2019a) . Hendrycks et al. (2019b) show that auxiliary self-supervision tasks can improve robustness to synthetic corruptions. In this paper, we find that while aux-outputs improve OOD accuracy, the in-distribution accuracy is worse than with aux-inputs. Thus, we elucidate a tradeoff between in-and out-of-distribution accuracy that occurs when using auxiliary information as inputs or outputs. To theoretically study how to best use auxiliary information, we extend the multi-task linear regression setting (Du et al., 2020; Tripuraneni et al., 2020) to allow for distribution shifts. We show that auxiliary information helps in-distribution error by providing useful features for predicting the target, but the relationship between the aux-inputs and the target can shift significantly OOD, worsening the OOD error. In contrast, the aux-outputs model first pre-trains on unlabeled data to learn a lower-dimensional representation and then solves the target task in the lower-dimensional space. We prove that the aux-outputs model improves robustness to arbitrary covariate shift compared to not using auxiliary information. Can we do better than using auxiliary information as inputs or outputs alone? We answer affirmatively by proposing the In-N-Out algorithm to combine the benefits of auxiliary inputs and outputs (Figure 1 ). In-N-Out first uses an aux-inputs model, which has good in-distribution accuracy, to pseudolabel in-distribution unlabeled data. It then pre-trains a model using aux-outputs and finally fine-tunes this model on the larger training set consisting of labeled and pseudolabeled data. We prove that In-N-Out, which combines self-training and pre-training, further improves both in-distribution and OOD error over the aux-outputs model. We show empirical results on CelebA and two remote sensing tasks (land cover and cropland prediction) that parallel the theory. On all datasets, In-N-Out improves OOD accuracy and has competitive or better in-distribution accuracy over aux-inputs or aux-outputs alone and improves 1-2% in-distribution, 2-3% OOD over not using auxiliary information on remote sensing tasks. Ablations of In-N-Out show that In-N-Out achieves similar improvements over pre-training or self-training alone (up to 5% in-distribution, 1-2% OOD on remote sensing tasks). We also find that using OOD (rather than in-distribution) unlabeled examples for pre-training is crucial for OOD improvements.

2. SETUP

Let x ∈ R d be the input (e.g., a satellite image), y ∈ R be the target (e.g., crop type), and z ∈ R T be the cheaply obtained auxiliary information either from additional sources (e.g., climate information) or derived from the original data (e.g., background). Training data. Let P id and P ood denote the underlying distribution of (x,y,z) triples in-distribution and out-of-distribution, respectively. The training data consists of (i) in-distribution labeled data {(x i , y i , z i )} n i=1 ∼ P id , (ii) in-distribution unlabeled data {(x id i , z id i )} mid i=1 ∼ P id , and (iii) out-of-distribution unlabeled data {(x ood i ,z ood i )} mood i=1 ∼ P ood . Goal and risk metrics. Our goal is to learn a model from input and auxiliary information to the target, f : R d × R T → R. For a loss function , the in-distribution population risk of the model f is R id (f ) = Ex,y,z∼P id [ (f (x,z),y)], and its OOD population risk is R ood (f ) = Ex,y,z∼P ood [ (f (x,z),y) ].



Figure1: A sketch of the In-N-Out algorithm which consists of three steps: 1) use auxiliary information as input (Aux-in) to achieve good in-distribution performance, 2) use auxiliary information as output in pre-training (Aux-out), to improve OOD performance, 3) fine-tune the pretrained model from step 2 using the labeled data and in-distribution unlabeled data with pseudolabels generated from step 1 to improve in-and out-of-distribution.

