RLSBENCH: A LARGE-SCALE EMPIRICAL STUDY OF DOMAIN ADAPTATION UNDER RELAXED LABEL SHIFT

Abstract

Despite the emergence of principled methods for domain adaptation under label shift (where only the class balance changes), the sensitivity of these methods to natural-seeming covariate shifts remains precariously underexplored. Meanwhile, popular deep domain adaptation heuristics, despite showing promise on benchmark datasets, tend to falter when faced with shifts in the class balance. Moreover, it's difficult to assess the state of the field owing to inconsistencies among relevant papers in evaluation criteria, datasets, and baselines. In this paper, we introduce RLSBENCH, a large-scale benchmark for such relaxed label shift settings, consisting of 14 datasets across vision, tabular, and language modalities spanning ą500 distribution shift pairs with different class proportions. We evaluate 13 popular domain adaptation methods, demonstrating a more widespread susceptibility to failure under extreme shifts in the class proportions than was previously known. We develop an effective meta-algorithm, compatible with most deep domain adaptation heuristics, that consists of the following two steps: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. In our benchmark, the meta-algorithm improves existing domain adaptation heuristics often by 2-10% accuracy points when label distribution shifts are extreme and has minimal (i.e., ă0.5%) to no effect on accuracy in cases with no shift in label distribution. We hope that these findings and the availability of RLSBENCH will encourage researchers to rigorously evaluate proposed methods in relaxed label shift settings.

1. INTRODUCTION

Real-world deployments of machine learning models are typically characterized by distribution shift, where data encountered in production exhibits statistical differences from the available training data (Quinonero-Candela et al., 2008; Torralba & Efros, 2011; Koh et al., 2021) . Because continually labeling data can be prohibitively expensive, researchers have focused on the unsupervised domain adaptation (DA) setting, where only labeled data sampled from the source distribution and unlabeled from the target distribution are available for training. Absent further assumptions, the DA problem is well known to be underspecified (Ben-David et al., 2010b) and thus no method is universally applicable. Researchers have responded to these challenges in several ways. One approach is to investigate additional assumptions that render the problem wellposed. Popular examples include covariate shift and label shift, for which identification strategies and principled methods exist whenever the source and target distributions have overlapping support (Shimodaira, 2000; Schölkopf et al., 2012; Gretton et al., 2009) . Under label shift in particular, recent research has produced effective methods that are applicable in deep learning regimes and yield both consistent estimates of the target label marginal and principled ways to update the resulting classifier (Lipton et al., 2018; Alexandari et al., 2021; Azizzadenesheli et al., 2019; Garg et al., 2020) . However, these assumptions are typically, to some degree, violated in practice. Even for archetypal cases like shift in disease prevalence (Lipton et al., 2018) , the label shift assumption can be violated. For example, over the course of the COVID-19 epidemic, changes in disease positivity have been coupled with shifts in the age distribution of the infected and subtle mutations of the virus itself. A complementary line of research focuses on constructing benchmark datasets for evaluating methods, in the hopes of finding heuristics that, for the kinds of problems that arise in practice, tend 1

