RLSBENCH: A LARGE-SCALE EMPIRICAL STUDY OF DOMAIN ADAPTATION UNDER RELAXED LABEL SHIFT

Abstract

Despite the emergence of principled methods for domain adaptation under label shift (where only the class balance changes), the sensitivity of these methods to natural-seeming covariate shifts remains precariously underexplored. Meanwhile, popular deep domain adaptation heuristics, despite showing promise on benchmark datasets, tend to falter when faced with shifts in the class balance. Moreover, it's difficult to assess the state of the field owing to inconsistencies among relevant papers in evaluation criteria, datasets, and baselines. In this paper, we introduce RLSBENCH, a large-scale benchmark for such relaxed label shift settings, consisting of 14 datasets across vision, tabular, and language modalities spanning ą500 distribution shift pairs with different class proportions. We evaluate 13 popular domain adaptation methods, demonstrating a more widespread susceptibility to failure under extreme shifts in the class proportions than was previously known. We develop an effective meta-algorithm, compatible with most deep domain adaptation heuristics, that consists of the following two steps: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. In our benchmark, the meta-algorithm improves existing domain adaptation heuristics often by 2-10% accuracy points when label distribution shifts are extreme and has minimal (i.e., ă0.5%) to no effect on accuracy in cases with no shift in label distribution. We hope that these findings and the availability of RLSBENCH will encourage researchers to rigorously evaluate proposed methods in relaxed label shift settings.

1. INTRODUCTION

Real-world deployments of machine learning models are typically characterized by distribution shift, where data encountered in production exhibits statistical differences from the available training data (Quinonero-Candela et al., 2008; Torralba & Efros, 2011; Koh et al., 2021) . Because continually labeling data can be prohibitively expensive, researchers have focused on the unsupervised domain adaptation (DA) setting, where only labeled data sampled from the source distribution and unlabeled from the target distribution are available for training. Absent further assumptions, the DA problem is well known to be underspecified (Ben-David et al., 2010b) and thus no method is universally applicable. Researchers have responded to these challenges in several ways. One approach is to investigate additional assumptions that render the problem wellposed. Popular examples include covariate shift and label shift, for which identification strategies and principled methods exist whenever the source and target distributions have overlapping support (Shimodaira, 2000; Schölkopf et al., 2012; Gretton et al., 2009) . Under label shift in particular, recent research has produced effective methods that are applicable in deep learning regimes and yield both consistent estimates of the target label marginal and principled ways to update the resulting classifier (Lipton et al., 2018; Alexandari et al., 2021; Azizzadenesheli et al., 2019; Garg et al., 2020) . However, these assumptions are typically, to some degree, violated in practice. Even for archetypal cases like shift in disease prevalence (Lipton et al., 2018) , the label shift assumption can be violated. For example, over the course of the COVID-19 epidemic, changes in disease positivity have been coupled with shifts in the age distribution of the infected and subtle mutations of the virus itself. A complementary line of research focuses on constructing benchmark datasets for evaluating methods, in the hopes of finding heuristics that, for the kinds of problems that arise in practice, tend to incorporate the unlabeled target data profitably. Examples of such benchmarks include Office-Home (Venkateswara et al., 2017 ), Domainnet (Peng et al., 2019) ), WILDS (Sagawa et al., 2021) . However, most academic benchmarks exhibit little or no shift in the label distribution ppyq. Consequently, benchmark-driven research has produced a variety of heuristic methods (Ganin et al., 2016; Sohn et al., 2020; Wang et al., 2021; Li et al., 2016) that despite yielding gains in benchmark performance tend to break when ppyq shifts. While this has previously been shown for domain-adversarial methods (Wu et al., 2019; Zhao et al., 2019) , we show that this problem is more widespread than previously known. Several recent papers attempt to address shift in label distribution compounded by natural variations in ppx|yq (Tan et al., 2020; Tachet des Combes et al., 2020; Prabhu et al., 2021) . However, the experimental evaluations are hard to compare across papers owing to discrepancies in how shifts in ppyq are simulated and the choice of evaluation metrics. Moreover, many methods violate the unsupervised contract by peeking at target validation performance during model selection and hyperparameter tuning. In short, there is a paucity of comprehensive and fair comparisons between DA methods for settings with shifts in label distribution. In this paper, we develop RLSBENCH, a standarized test bed of relaxed label shift settings, where ppyq can shift arbitrarily and the class conditionals ppx|yq can shift in seemingly natural ways (following the popular DA benchmarks). We evaluate a collection of popular DA methods based on domain-invariant representation learning, self-training, and test-time adaptation across 14 multidomain datasets spanning vision, Natural Language Processing (NLP), and tabular modalities. The different domains in each dataset present a different shift in ppx|yq. Since these datasets exhibit minor to no shift in label marginal, we simulate shift in target label marginal via stratified sampling with varying severity. Overall, we obtain 560 different source and target distribution shift pairs and train ą 30k models in our testbed. Based on our experiments on RLSBENCH suite, we make several findings. First, we observe that while popular DA methods often improve over a source-only classifier absent shift in target label distribution, their performance tends to degrade, dropping below source-only classifiers under severe shifts in target label marginal. Next, we develop a meta-algorithm with two simple corrections: (i) re-sampling the data to balance the source and pseudo-balance the target; (ii) re-weighting the final classifier using an estimate of the target label marginal. We observe that in these relaxed label shift environments, the performance of existing DA methods (e.g. CDANN, FixMatch, and BN-adapt) when paired with our meta-algorithm tends to improve over source-only classifier. On the other hand, existing methods specifically proposed for relaxed label shift (e.g., IW-CDANN and SENTRY) often fail to improve over a source-only classifier and significantly underperform when compared to existing DA methods paired with our meta-algorithm. Our findings underscore the importance of a fair comparison to avoid a false sense of scientific progress in relaxed label shift scenarios. Moreover, we hope that the RLSBENCH testbed and our meta-algorithm (that can be paired with any DA method) provide a framework for rigorous and reproducible future research in relaxed label shift scenarios.

2. PRELIMINARIES AND PRIOR WORK

We first setup the notation and formally define the problem. Let X be the input space and Y " t1, 2, . . . , ku the output space. Let P s , P t : X ˆY Ñ r0, 1s be the source and target distributions and let p s and p t denote the corresponding probability density (or mass) functions. Unlike the standard supervised setting, in unsupervised DA, we possess labeled source data tpx 1 , y 1 q, px 2 , y 2 q, . . . , px n , y n qu and unlabeled target data tx n`1 , x n`2 , . . . , x n`m u. With f : X Ñ ∆ k´1 , we denote a predictor function which predicts p y " arg max y f y pxq on an input x. For a vector v, we use v y to access the element at index y. In the traditional label shift setting, one assumes that ppx|yq does not change but that ppyq can. Under label shift, two challenges arise: (i) estimate the target label marginal p t pyq; and (ii) train a classifier f to maximize the performance on target domain. This paper focuses on the relaxed label shift setting. In particular, we assume that the label distribution can shift from source to target arbitrarily but that ppx|yq varies between source and target in some comparatively restrictive way (e.g., shifts arising naturally in the real-world like ImageNet (Russakovsky et al., 2015) to ImageNetV2 (Recht et al., 2019) ). Mathematically, we assume a divergence-based restriction on ppx|yq. That is, for some small ϵ ą 0 and distributional distance D, we have max y Dpp s px|yq, p t px|yqq ď ϵ and allow an arbitrary shift in the label marginal ppyq. We discuss several precise instantiations in App. F. However, in

