DOMAIN GENERALIZATION VIA HECKMAN-TYPE SELECTION MODELS

Abstract

The domain generalization (DG) setup considers the problem where models are trained on data sampled from multiple domains and evaluated on test domains unseen during training. In this paper, we formulate DG as a sample selection problem where each domain is sampled from a common underlying population through non-random sampling probabilities that correlate with both the features and the outcome. Under this setting, the fundamental iid assumption of the empirical risk minimization (ERM) is violated, so it often performs worse on test domains whose non-random sampling probabilities differ from the domains in the training dataset. We propose a Selection-Guided DG (SGDG) framework to learn the selection probability of each domain and the joint distribution of the outcome and domain selection variables. The proposed SGDG is domain generalizable as it intends to minimize the risk under the population distribution. We theoretically prove that, under certain regular conditions, SGDG can achieve smaller risk than ERM. Furthermore, we present a class of parametric SGDG (HeckmanDG) estimators applicable to continuous, binary, and multinomial outcomes. We also demonstrate its efficacy empirically through simulations and experiments on a set of benchmark datasets comparing with other well-known DG methods.

1. INTRODUCTION

In statistical learning theory, the standard assumption behind many supervised learning algorithms is that both training and test instances are independently and identically distributed (iid) according to the same underlying data distribution (Vapnik, 1991) . In other words, most statistical models assume that the training and test data are both random samples chosen randomly from the same population. Unfortunately, this assumption is often violated in real-world applications rendering model performance to deteriorate on out-of-distribution (OOD) test data (Koh et al., 2021) . Recently, the Domain Generalization (DG) problem (Blanchard et al., 2011) has gained particular attention, where it is assumed that learning systems have access to training data sampled from multiple domains, and the ultimate goal is to extrapolate to new instances sampled from previously unseen test domains. In this paper, we consider DG as a non-random sample selection problem. Let P XY represent the population data distribution, and S k denote a binary random variable indicating whether a subject is selected from the population into domain k. In a random sampling process, P (S k i = 1) is independent and identically distributed (iid). Under a non-random sample selection, the distribution of (X, Y ) in domain k is a conditional distribution of (X, Y ) given S k = 1, which often does not equal to P XY . Consequently, this leads to distributional shifts across domains: P j XY ̸ = P k XY for k ̸ = j. Mathematically, distribution shifts across domains P k XY may contain shifts in distributions of X (P k X , covariate shift (Bickel et al., 2009) ), and in the distributions of Y conditional on X (P k Y |X , concept shift (Moreno-Torres et al., 2012) ). We present a graphical model in Figure 1 to conceptually illustrate the sources of distribution shifts, assuming the existence of latent factors confounding the relationship between X, Y , and domain (S k ). In Figure 1 , C 1 represents unobserved latent factors that correlate with X and S k , resulting in covariate shift. C 2 correlates with X, Y , and S simultaneously, entailing both covariate and concept shifts. The goal of DG is to estimate the domain generalizable (agnostic) edge f : X → Y in the presence of the two types of latent confounders.

