DOMAIN GENERALIZATION VIA HECKMAN-TYPE SELECTION MODELS

Abstract

The domain generalization (DG) setup considers the problem where models are trained on data sampled from multiple domains and evaluated on test domains unseen during training. In this paper, we formulate DG as a sample selection problem where each domain is sampled from a common underlying population through non-random sampling probabilities that correlate with both the features and the outcome. Under this setting, the fundamental iid assumption of the empirical risk minimization (ERM) is violated, so it often performs worse on test domains whose non-random sampling probabilities differ from the domains in the training dataset. We propose a Selection-Guided DG (SGDG) framework to learn the selection probability of each domain and the joint distribution of the outcome and domain selection variables. The proposed SGDG is domain generalizable as it intends to minimize the risk under the population distribution. We theoretically prove that, under certain regular conditions, SGDG can achieve smaller risk than ERM. Furthermore, we present a class of parametric SGDG (HeckmanDG) estimators applicable to continuous, binary, and multinomial outcomes. We also demonstrate its efficacy empirically through simulations and experiments on a set of benchmark datasets comparing with other well-known DG methods.

1. INTRODUCTION

In statistical learning theory, the standard assumption behind many supervised learning algorithms is that both training and test instances are independently and identically distributed (iid) according to the same underlying data distribution (Vapnik, 1991) . In other words, most statistical models assume that the training and test data are both random samples chosen randomly from the same population. Unfortunately, this assumption is often violated in real-world applications rendering model performance to deteriorate on out-of-distribution (OOD) test data (Koh et al., 2021) . Recently, the Domain Generalization (DG) problem (Blanchard et al., 2011) has gained particular attention, where it is assumed that learning systems have access to training data sampled from multiple domains, and the ultimate goal is to extrapolate to new instances sampled from previously unseen test domains. In this paper, we consider DG as a non-random sample selection problem. Let P XY represent the population data distribution, and S k denote a binary random variable indicating whether a subject is selected from the population into domain k. In a random sampling process, P (S k i = 1) is independent and identically distributed (iid). Under a non-random sample selection, the distribution of (X, Y ) in domain k is a conditional distribution of (X, Y ) given S k = 1, which often does not equal to P XY . Consequently, this leads to distributional shifts across domains: P j XY ̸ = P k XY for k ̸ = j. Mathematically, distribution shifts across domains P k XY may contain shifts in distributions of X (P k X , covariate shift (Bickel et al., 2009) ), and in the distributions of Y conditional on X (P k Y |X , concept shift (Moreno-Torres et al., 2012) ). We present a graphical model in Figure 1 to conceptually illustrate the sources of distribution shifts, assuming the existence of latent factors confounding the relationship between X, Y , and domain (S k ). In Figure 1 , C 1 represents unobserved latent factors that correlate with X and S k , resulting in covariate shift. C 2 correlates with X, Y , and S simultaneously, entailing both covariate and concept shifts. The goal of DG is to estimate the domain generalizable (agnostic) edge f : X → Y in the presence of the two types of latent confounders. The vast majority of DG methods are developed to identify f that is robust to C 1 . However in practice, C 2 type of confounders often exist which make P (S k = 1) related to both X and Y . For example, when we train a model to predict tumor status (Y ) from histological images (X) using patients from different hospitals (S k ), there may be variations in X due to inconsistent acquisition processes such as staining differences (C 1 ) across hospitals, and differences in patient characteristics such as age, gender, race, and disease severity (C 2 ) that correlate with hospital, covariates and the outcome. As a result, a model trained in an oncology specialist hospital may not be generalizable to a hospital serving veterans. Similarly, when we train a model to predict wealth index (Y ) from satellite images (X) taken from different countries (S k ), there may be latent factors such as the economic status (C 2 ) correlating with X, Y and domains simultaneously. Therefore, a model trained on one country may not perform well in another country with a different rural/urban proportion or economic status (Koh et al., 2021) . Y C 2 S k X C 1 f Figure 1: A graphical In this paper, we propose a new class of Selection Guided Domain Generalization (SGDG) models to first estimate the selection probability that an instance is sampled into a training domain, and then use the joint distribution of the outcome Y and selection S to learn a domain generalizable model. In particular, SGDG is built on Heckman's bias correction framework (Heckman, 1979) which is a very powerful tool to learn an unbiased model from non-randomly selected samples in the presence of both C 1 and C 2 confounders. The unique contributions of this paper are summarized as follows: • To the best of our knowledge, we are the first paper to formulate the DG problem using a non-random sample selection framework, and to propose a Selection Guided Domain Generalization (SGDG) method under this framework. • We present a class of parametric SGDG (HeckmanDG) estimators applicable to continuous, binary, and multinomial outcomes † . • We demonstrate the efficacy of our method both theoretically and empirically on simulated data and four challenging benchmarks. 



† code available: https://github.com/hgkahng/domain-generalization-lightning



model illustrating the source of distributional shifts. X: covariates, Y : outcome, S k : domain. C1 represents latent factors that correlate with X and S k , resulting in covariate shift. C2 correlates with X, Y , and S k , entailing both covariate and concept shifts. Our goal is to estimate the domain generalizable (agnostic) edge f : X → Y in the presence of the two types of latent confounders.

Domain Generalization. DG has been studied under various contexts. Many studies are devoted to learning domain-invariant features which are discriminative and independent of the domain, such as kernel-based methods(Muandet et al., 2013), matching moments (Sun & Saenko, 2016), adversarial learning(Ganin et al., 2016; Deng et al., 2020), entropy regularization(Zhao et al., 2020), and contrastive learning(Motiian et al., 2017; Kim et al., 2021). Other works exploit invariant causal effects across domains(Arjovsky et al., 2019; Ahuja et al., 2020; Rosenfeld et al., 2021). Another family of robust optimization methods seek to minimize the worst-case error(Sagawa et al., 2020;  Xie et al., 2020; Krueger et al., 2021). More recently, other prominent directions of methods improve DG by model averaging(Cha et al., 2021; Arpit et al., 2022), gradient matching(Shi et al., 2022), meta learning(Li et al., 2018), data augmentation (Robey et al., 2021), and generating novel domains(Zhou et al., 2020).Sample Selection Bias Correction. Zadrozny (2004) formalized sample selection bias in machine learning terms and presented a bias correction method when selection only depends on the input features.Cortes et al. (2008)  proposed a sample reweighting approach to tackle the same problem but assumed the availability of additional data drawn from the true population. Du & Wu (2021) proposed a framework for robust and fair learning under biased sample selection, but assumes conditional independence of Y and S given X.

