UNSUPERVISED ADAPTATION FOR FAIRNESS UNDER COVARIATE SHIFT

Abstract

Training fair models typically involves optimizing a composite objective accounting for both prediction accuracy and some fairness measure. However, due to a shift in the distribution of the covariates at test time, the learnt fairness tradeoffs may no longer be valid, which we verify experimentally. To address this, we consider an unsupervised adaptation problem of training fair classifiers when only a small set of unlabeled test samples is available along with a large labeled training set. We propose a novel modification to the traditional composite objective by adding a weighted entropy objective on the unlabeled test dataset. This involves a min-max optimization where weights are optimized to mimic the importance weighting ratios followed by classifier optimization. We demonstrate that our weighted entropy objective provides an upper bound on the standard importance sampled training objective common in covariate shift formulations under some mild conditions. Experimentally, we demonstrate that Wasserstein distance based penalty for representation matching across protected sub groups together with the above loss outperforms existing baselines. Our method achieves the best accuracy-equalized odds tradeoff under the covariate shift setup. We find that, for the same accuracy, we get up to 2× improvement in equalized odds on notable benchmarks.

1. INTRODUCTION

Moving away from optimizing only prediction accuracy, there is a lot of interest in understanding and analyzing Machine Learning model performance along other dimensions like robustness (Silva & Najafirad, 2020) , model generalization (Wiles et al., 2021) and fairness (Oneto & Chiappa, 2020) . In this work, we focus on the algorithmic fairness aspect. When the prediction of a machine learning classifier is used to make important decisions that have societal impact, like in criminal justice, loan approvals, to name a few; how decisions impact different protected groups needs to be taken into account. Datasets used for training could be biased in the sense that some groups may be underrepresented, biasing classifier decisions towards the over-represented group or the bias could be in terms of undesirable causal pathways between sensitive attribute and the label in the real world data generating mechanism (Oneto & Chiappa, 2020) . It has often been observed (Bolukbasi et al., 2016) , (Buolamwini & Gebru, 2018) that algorithms that optimize predictive accuracy that are fed pre-existing biases further learn and then propagate the same biases. While there are various approaches for fair machine learning, a class of methods called in-processing methods have been shown to perform well (Wan et al., 2021) . These methods regularize training of fair models typically through a composition of loss objective accounting for a specific fairness measure along with predictive accuracy. Popular fairness measures are based on notions of demographic parity, equal opportunity, predictive rate parity and equalized odds. After regularized training, the model attains a specific fairness-accuracy tradeoff. When the test distribution is close or identical to the training distribution, fairness-accuracy tradeoffs typically hold. However, in practical scenarios, there could be non-trivial distributional shifts due to which tradeoffs achieved in train may not hold in the test. For example, Ding et al. (2021) highlights how a classifier's fairness-accuracy tradeoff trained on input samples derived from one state does not extend to predict income in other states for the Adult Income dataset. Similarly, Rezaei et al. (2021) ; Mandal et al. (2020) demonstrate that the tradeoffs achieved by state of the art fairness techniques do not generalize to test data under shifts. In figure 1 , we complement these claims by analyzing the under-performance for a state-of-the-art fairness method -Adversarial Debiasing (Zhang et al., 2018) . We also see similar drop in performance under covariate shift in other baselines we consider, which we highlight in our experimental analysis. In this work, we study covariate shift where the distribution of covariates across training and testing changes, however the optimal label predictor conditioned on input remains the same. We address the following question for unsupervised adaptation of training fair classifiers: Under the covariate shift setup, given sufficient amount of labeled training samples and only a few unlabeled test samples, how can we ensure good fairness-accuracy trade-offs on the test distribution? While this question has not received much attention, some recent works like Rezaei et al. (2021) have begun to address this problem. Prior works rely on explicit density estimation which is then used to adapt to test data. In our work, we focus on avoiding density estimation steps that do not scale well in high dimensions. Here, we propose a novel unsupervised adaptation training objective that is theoretically justified. The objective depends on labeled training samples and unlabeled test samples along with standard fairness objective involving representation matching across the groups on the test. We report the results on equalized odds in our experiments and use the related notion of accuracy parity to motivate our algorithmic design with empirical evidence. Our key contributions are listed as follows: 1. We show that under a scenario of asymmetric covariate shift, where one group exhibits large covariate shift while the other does not, accuracy parity degrades despite perfect representation matching across protected groups highlighting the need to tackle covariate shift explicitly. (Section 4) 2. We introduce a composite objective for prediction that involves a novel weighted entropy objective on the set of unlabeled test samples along with standard a ERM objective on the labeled training samples for tackling covariate shift. We optimize the weights using min-max optimization: The outer minimization optimizes the classifier with the composite objective, while the inner maximization finds the appropriate weights for each sample that are related to importance sampling ratios determined implicitly with no density estimation steps. We prove that our composite objective provides an upper bound on the standard importance sampled training objective common in covariate shift formulations under some mild conditions. We then combine the above composite objective with a representation matching loss to train fair classifiers. (Section 5) 3. We experiment on four benchmark datasets, including Adult, Arrhythmia, Communities and Drug. We demonstrate that, by incorporating our proposed weighted entropy objective, with the Wasserstein based penalty for representation matching across protected sub-groups, we outperform existing fairness methods under covariate shifts. In particular, we achieve the best accuracyequalized odds tradeoff: for the same accuracy, we achieve up to ≈ 2× improvement in equalized odds metric. (Section 6) Techniques for imposing fairness: Pre-processing techniques that aim to transform the dataset (Calmon et al., 2017; Swersky et al., 2013; Feldman et al., 2015; Kamiran & Calders, 2012) followed by a standard training have been studied. In-processing methods directly modify the learning algorithms using techniques, such as, adversarial learning (Madras et al., 2018; Zhang et al., 2018) , (Agarwal et al., 2018; Cotter et al., 2019; Donini et al., 2018; Fish et al., 2016; Zafar et al., 2017; Celis et al., 2019) . Post-processing approaches, primarily focus on modifying the outcomes of the predictive models in order to make unbiased predictions (Pleiss et al., 2017; Zhao et al., 2017; Hardt et al., 2016b) . Bellamy et al. (2019) provides a comprehensive survey containing a broad variety of these algorithms. Our method is an in-processing technique and is different from the above methods in that it operates with a small unlabeled test set along with a standard labeled training set.

2. RELATED WORK

Distribution Shift: Research addressing distribution shift in machine learning is vast and is growing. The general case considers a joint distribution shift between training and testing data (Ben-David et al., 2006; Blitzer et al., 2007; Moreno-Torres et al., 2012) resulting in techniques like domain adaptation (Ganin & Lempitsky, 2015) , distributionally robust optimization (Sagawa et al., 2019; Duchi & Namkoong, 2021) and invariant risk minimization and its variants (Arjovsky et al., 2019; Krueger et al., 2021; Shi et al., 2021) . A survey of various methods and their relative performance is discussed by Wiles et al. (2021) . We focus on the problem of Covariate Shift where the Conditional Label distribution is invariant while there is a shift in the marginal distribution of the covariates across training and test samples. This classical setup is studied by Shimodaira (2000) ; Sugiyama et al. (2007b) ; Gretton et al. (2009) . Importance Weighting is one of the prominently used techniques for tackling covariate shifts (Sugiyama et al., 2007a; Lam et al., 2019) . However, they are known to have high variance under minor shift scenarios (Cortes et al., 2010a) . Recently methods that emerged as the de-facto approaches to tackle distribution shifts include popular entropy minimization (Wang et al., 2021a) , pseudo-labeling (French et al., 2017; Xie et al., 2020) , batch normalization adaptation (Schneider et al., 2020; Nado et al., 2020) , because of their wide applicability and superior performance. Our work provides a connection between a version of weighted entropy minimization and traditional importance sampling based loss which may be of independent interest.

Fairness under Distribution shift:

The work by Rezaei et al. (2021) is by far the most aligned to ours as they propose a method that is robust to covariate shift while ensuring fairness when unlabeled test data is available. However, this requires the density estimation of training and test distribution that is not efficient at higher dimensions and small number of test samples. In contrast our method avoids density estimation and uses a weighted version of entropy minimization that is constrained suitably to reflect importance sampling ratios implicitly. Mandal et al. (2020) proposed a method for fair classification under the worst-case weighting of the data via an iterative procedure, but it is in the agnostic setting where test data is not available. Singh et al. (2021) studied fairness under shifts through a causal lens but the method requires access to the causal graph, separating sets and other non-trivial data priors. Zhang et al. (2021) proposed FARF, an adaptive method for learning in an online setting under fairness constraints, but is clearly different from the static shift setting considered in our work. Slack et al. (2020) proposed a MAML based algorithm to learn under fairness constraints, but it requires access to labeled test data. An et al. (2022) propose a consistency regularization technique to ensure fairness under label shifts, while we consider covariate shift.

3. PROBLEM SETUP

Let X ⊆ R d be the d dimensional feature space for covariates, A be the space of categorical group attributes and Y be the space of class labels. In this work, we consider A = {0, 1} and Y = {0, 1}. Let X ∈ X , A ∈ A, Y ∈ Y be realizations from the space. We consider a training dataset D S = {(X i , A i , Y i )|i ∈ [n]} where every tuple (X i , A i , Y i ) ∈ X × A × Y. We also have an unlabeled test dataset, D T = {X i , A i |i ∈ [m]}. We focus on the setup where m << n. The training samples (X i , A i , Y i ∈ D S ) are sampled i.i.d from distribution P S (X, Y, A) while the unlabeled test instances are sampled from P T (X, A). Let F : X → [0, 1] be the space of soft prediction models. In this work, we will consider F ∈ F of the form F = h • g where g(X) ∈ R k (for some dimension k > 0), is a representation that is being learnt while h(g(X)) ∈ [0, 1] provides the soft prediction. Note that we don't consider A as an input to F , as explained in the work of (Zhao, 2021) . The parameters of F are denoted as θ(F ). We denote the class prediction probabilities from F with P ( Ŷ = y|X i ), where y ∈ {0, 1}. The supervised in-distribution training of F is done by minimizing the empirical risk, ER S as the proxy for population risk, R S . Both risk measures are computed using the Cross Entropy (CE) loss for classification (correspondingly we use ER T and R T over the test distribution for F ). R S = E P S (X,A,Y) -log P ( Ŷ = Y|X) , ER S = 1 n (Xi,Yi,Ai)∈D S -log P ( Ŷ = Y i |X i ) 3.1 COVARIATE SHIFT ASSUMPTION For our work, we adopt the covariate shift assumption as in Shimodaira (2000) . Covariate shift assumption implies that P S (Y|X, A) = P T (Y|X, A). In other words, shift in distribution only affects the joint distribution of covariates and sensitive attribute, i.e. P S (X, A) = P T (X, A). We note that our setup is identical to a recent work of fairness under covariate shift by Rezaei et al. (2021) . We also define and focus on a special case of covariate shift called asymmetric covariate shift. Definition 1 (Asymmetric Covariate Shift). Asymmetric covariate shift occurs when distribution of covariates of one group shifts while the other does not, i.e. P T (X|A = 1) = P S (X|A = 1) while P T (X|A = 0) = P S (X|A = 0) in addition to P S (Y|X, A) = P T (Y|X, A) This type of covariate shift occurs when a sub-group is over represented (sufficiently capturing all parts of the domain of interest in the training data) while the other sub-group being under represented and observed only in one part of the domain. In the test distribution, covariates of the under-represented group assume a more drastic shift.

3.2. FAIRNESS MEASURE

To quantify fairness, we follow Rezaei et al. (2021) and use Equalized Odds (EOdds), proposed by Hardt et al. (2016a) : ∆ EOdds = 1 2 y∈{0,1} |P ( Ŷ = 1|A = 0, Y = y) -P ( Ŷ = 1|A = 1, Y = y)|. EOdds requires parity in both true positive rates and false positive rates across the groups. Hardt et al. (2016a) have raised several concerns regarding other widely used fairness metrics, e.g., Demographic Parity (DP) and Equalized Opportunity (EOpp). Therefore, we don't emphasize them in this work. Another way to interpret EOdds is that it requires I( Ŷ; A|Y) to be small, where I(; |•) is the conditional mutual information measure. Ideally, we are interested in a classifier, F that minimizes the objective: R T + λI T ( Ŷ; A|Y); where I T (•) is the mutual information measure with respect to the test distribution. However, EOdds metric requires the true labels Y from the test distribution. Therefore, we consider optimizing for a related weaker notion, called accuracy parity, i.e. ∆ Apar = |P ( Ŷ = Y|A = 0) -P ( Ŷ = Y|A = 1)|. In information theoretic terms, minimizing accuracy parity entails keeping I T ( Ŷ = Y ; A) small. We now state the main goal of this work: Objective min θ(F ) R T + λ∆ Apar . (2)

4. REPRESENTATION MATCHING AND COVARIATE SHIFT

Our objective is to learn a highly accurate classifier on the test distribution while ensuring accuracy parity as in (2). Despite the lack of test labels, accuracy parity admits a simpler sufficient condition: Train a classifier F = h • g(X) by matching representation g(X) across the protected sub groups and learning a classifier on top of that representation (Zhao & Gordon, 2019) . Several variants for representation matching loss have been proposed in the literature for both classification (Jiang et al., 2020; Wang et al., 2021c) and regression (Zhao, 2021; Chzhen et al., 2020) . For implementation ease, we pick Wasserstein-2 metric to impose representation matching. We recall the definition of Wasserstein distance: Definition 2. Let (M, d) be a metric space and P p (M) denote the collection of all probability measures µ on M with finite p th moment. Then the p-th Wasserstein distance between measures µ and ν both ∈ P p (M) is given by: W p (µ, ν) = inf γ M×M d(x, y) p dγ(x, y) 1 p ; γ ∈ Γ(µ, ν), where Γ(µ, ν) denotes the collection of all measures on M×M with marginals µ and ν respectively. We minimize the W 2 between the representation g(•) of the test samples from both groups. Empirically, our representation matching loss is given by: LW ass (D T ) = W p (μ, ν), μ = (Xi,Ai=0)∈D T δ g(Xi) |(X i , A i = 0) ∈ D T | , ν = (Xi,Ai=1)∈D T δ g(Xi) |(X i , A i = 1) ∈ D T | (3) We arrive at the following objective: min θ(F =h•g) ER T + λ LW ass (D T ) However, we still don't have labeled samples from D T for realizing the first term. It is natural to optimize the following objective: ER S + λ LW ass (D T ) where the first term leverages labeled training data while the second term matches representation across groups using unlabeled test. We illustrate that under covariate shift, using the test set only for representation matching alone is ineffective. We also provide strong experimental justification to support this claim in section A.6.2. We now quote a result from existing literature that bounds accuracy parity under representation matching. Theorem 1 (Zhao & Gordon (2019) ). Consider any soft classifier F = h • g(X) ∈ [0, 1] and the hard decision rule Ŷ = 1 F (X)>1/2 . Let the Bayes optimal classifier for group a under representation g(•) be: 1 P T (Y=1|g(X),A=a)>1/2 = s a (X). Let the Bayes error for group a under representation g(•) be err a . Then we have: ∆ Apar ≤ a err a + P T (g(X)|A = 1) -P T (g(X)|A = 0) 1 + min a (E P T (X|a) |s 1 (X) -s 0 (X)|) Here, P(•) -Q(•) 1 is the total variation distance between measures P and Q. This suggests applying a loss for representation matching to enforce accuracy parity as it would drive the purely label independent middle term to zero. However, we argue that, under asymmetric covariate shift (Definition 1), accuracy parity is approximately the third term in Theorem 1, even when the second term is set to 0. Representation Matching does not work under Asymmetric Covariate Shift: Consider the covariate shift scenario given by Definition 1. Suppose one is also able to find a representation g(•) that matches across groups exactly in the test, i.e. P T (g(•)|A = 1) = P T (g(•)|A = 0). Due to the asymmetric covariate shift assumption between train and test, we have P T (g(•)|A = 0) = P T (g(•)|A = 1) = P S (g(•)|A = 0). Since there is no covariate shift for group A = 0, optimal scoring function s 0 (X) remains the same even for the training set, given the representation. Since a classifier h is learnt on top of representation g, and only training distribution of group A = 0 under g overlaps (completely) with the test, classifier h would be trained overwhelmingly with the correct labels for A = 0 in the region where test samples are found. Over the test distribution, the hard decision score function will be approximately s 0 (X). Therefore, the error in the group 0 would be small. While the test error in group 1 will be approximately E P T (•|A=1) (|s 0 (X) -s 1 (X)|) which matches the third term in Theorem 1. Therefore, in this setting it is essential to use training samples and unlabeled test samples to address covariate shift problem for group 1. Samples for group 1 in the training distribution (not just that belong to group A = 0) must be emphasized more. This motivates the need for performing unsupervised adaptation using unlabeled test samples focusing on accuracy improvement and combining it with representation matching.

5. METHOD AND ALGORITHM

Recall that the objective we are interested in is (4). One needs a proxy for the first term due to lack of labels. From considerations in the previous section, training has to be done in a manner that can tackle covariate shift despite using representation matching. Building over the analysis from the previous section, we derive a novel objective in Theorem 2 based on the weighted entropy over instances in D T along with empirical loss over D S and show that is an upper bound to R T . Theorem 2. Suppose that P T (•) and P S (•) are absolutely continuous with respect to each other over domain X . Let ∈ R + be such that P T (Y=y|X) P ( Ŷ=y|X) ≤ , for y ∈ {0, 1} almost surely with respect to distribution P T (X). Then, we can upper bound R T using R S along with an unsupervised objective over P T as: R T ≤ R S + × E P T (X) e - P S (X) P T (X) H( Ŷ|X) where H( Ŷ|X) = y∈{0,1} -P ( Ŷ = y|X) log(P ( Ŷ = y|X)) is the conditional entropy of the label given a sample X. Proof Sketch. The proof is relegated to the appendix A.1. To arrive at this bound, we manipulate the importance sampled population training loss (Sugiyama et al., 2007b) . Figure 2 : High level architecture of our method. Colored blocks represent parameterized subnetworks. We emphasize that this result also provides an important connection and a rationale for using entropy based objectives as an unsupervised adaptation objective from an importance sampling point of view that has been missing in the literature (Wang et al., 2021a; Sun et al., 2019) . Entropy objective is imposed on points that are more typical with respect to the test than the training. Conversely, in the region where samples are less likely with respect to the test distribution, since it has been optimized for label prediction as part of training, the entropy objective is not imposed strongly. The above bound however hinges on the assumption that pointwise in the domain X , F approximates the true soft predictor by at most a constant factor . To ensure a small value of , we resort to pre-training F with only D S samples for a few epochs before imposing any other type of regularization.

5.1. WEIGHTED ENTROPY OBJECTIVE

Implementing the objective in (5), requires computation of the Radon-Nikodym derivative dP S (X) dP T (X) . This is challenging when m (amount of unlabeled test samples) is small and typical way of density estimation in high dimensions is particularly hard. Therefore, we propose to estimate the ratio dP S (X) dP T (X) by a parametrized network F w : X → R, where F w (X) shall satisfy the following constraints: E X∼P T (X) [F w (X)] = 1, and E X∼P S (X) [1/(F w (X))] = 1. By definition of the Radon-Nikodym derivative, these constraints must be satisfied. Building on (5), we solve for the following upper bound in Theorem 2: max θ(Fw) R S + × E P T (X) e (-Fw(X)) H( Ŷ|X) s.t. E X∼P T (X) [F w (X)] = 1, E X∼P S (X) [1/(F w (X))] = 1 (6) Finally, we plug in the empirical risk estimator for R S , approximate the expectation in second term with the empirical version over D T , posit as a hyperparameter and add the unfairness objective in eq 3 to minimize the following: min θ(F ) max θ(Fw) L(θ(F ), θ(F w )) = ER S + λ 1 1 m Xi∈D T e (-Fw(Xi)) H( Ŷ|X) + λ 2 LW ass (D T ) s.t. C 1 = 1 m Xi∈D T F w (X i ) = 1, and C 2 = 1 n Xi∈D S 1 F w (X i ) = 1 (7) Here λ 1 and λ 2 are hyperparameters governing the objectives. C 1 and C 2 refer to the constraints.  Output: Optimized parameters θ * (F ) θ 0 (F ) ← random initialization for t ← 1 to Ẽ do θ t (F ) ← θ t-1 (F ) -η t ∇ θ t-1 (F ) ER S θ Ẽ (F w ) ← random initialization for t ← Ẽ + 1 to E + Ẽ do θ t (F w ) ← θ t-1 (F w ) + η t ∇ θ t-1 (Fw) L(θ t-1 (F ), θ t-1 (F w )) ; subject to C 1 and C 2 θ t (F ) ← θ t-1 (F ) -η t ∇ θ(F ) L(θ t-1 (F ), θ t (F w )); /* We apply gradient stopping through F w during backpropagation in this step */ θ * (F ) ← θ E+ Ẽ (F ) Since the function has a representation layer followed by a classifier, i.e. F = h • g, in our implementation we apply the weighing function on g(•). Therefore, we have the following formulation: min θ(F ) max θ(Fw) L(θ(F ), θ(F w )) = ER S + λ 1 1 m Xi∈D T e (-Fw(g(Xi))) H( Ŷ|X) + λ 2 LW ass (D T ) s.t. C 1 = 1 m Xi∈D T F w (g(X i )) = 1, and C 2 = 1 n Xi∈D S 1 F w (g(X i )) = 1 We use alternating gradient updates to solve the above min-max problem. Our entire learning procedure consists of two stages: (1) pre-training F for some epochs with only D S and (2) further training F with (8). The procedure is summarized in Algorithm 1 and a high level architecture is provided in Figure 2 .

6. EXPERIMENTS

We demonstrate our method on 4 widely used benchmarks in the fairness literature, i.e. Adult, Communities and Crime, Arrhythmia and Drug Datasets with detailed description in appendix A.2. The baseline methods used for comparison are: MLP, Adversarial Debias (AD) (Zhang et al., 2018) , Robust Fair (RF) (Mandal et al., 2020) , Robust Shift Fair (RSF) (Rezaei et al., 2021) and Z-Score Adaptation (ZSA) with detailed description in appendix A.3. The implementation details of all the methods with relevant hyperparameters are provided in section A.5. The procedure for constructing the covariate shift is described in section A.4. To summarize, we use the Principal Component Analysis (PCA) direction to generate covariate shifted test set similar to Rezaei et al. (2021) ; Gretton et al. (2008) . The evaluation of our method against the baselines is done via the trade-off between fairness violation (using ∆ EOdds ) and error (which is 100accuracy). All algorithms are run 50 times before reporting the mean and the standard deviation in the results.

6.1. COMPARATIVE RESULTS

The experimental results for the shift constructed using procedure in section A.4 are shown in Figure 3 . The results closer to the bottom left corner in each plot are desirable. In some cases, the standard deviation bars in the figure stretch beyond 0 in R -due to skewness when we plot standard error bars, however all the numbers across the runs are positive. Our method provides better error and fairness tradeoffs against the baselines on all the benchmarks. For example, on the Adult dataset, we have the lowest error rate at around 15% with ∆ EOdds at almost 0.075 while the closest baselines MLP and RF fall short on either of the metrics. On Arrhythmia and Communities, our method achieves very low ∆ EOdds (best on Arrhythmia with a margin of ∼ 30%) with only marginally higher error as compared to MLP and RF respectively. On the Drug dataset, we achieve the best numbers for both the metrics. For the same accuracy, we obtain 1.3x-2x improvements against the baselines methods on most of the benchmarks. Similarly for the same ∆ EOdds , we achieve up to 1.5x lower errors. It is also important to note that all the other unsupervised adaptation algorithms perform substantially worse and are highly unreliable. For example, ZSA performs well only on the Drug dataset, but shows extremely worse errors (even worse than random predictions) on Communities and Adult. The adaptation performed by ZSA is insufficient to handle covariate shift. RSF baseline is consistently worse across the board. This is because it tries to explicitly estimate P S (X) and P T (X) which is extremely challenging whereas we implicitly estimate the importance ratio. While there is extensive evidence in the literature suggesting that fairness is achieved at the expense of performance (Menon & Williamson, 2018; Zhao, 2021; Zliobaite, 2015) , we attribute the low errors achieved by our method to the novel entropy formulation, where we in fact minimize the worst case weighting of entropy under the constraints. The saddle-point solution optimizes the entropy on points far from the training distribution, via appropriate scaling (importance weighting).

6.2. RESULTS ON ASYMMETRIC SHIFT

We also study the problem of covariate shift under a new lens where the degree of shift is substantially different across the groups, which also motivates our novel formulation (section 4). To construct this, we follow the same procedure as described in section A.4, but operate on data for the two groups differently. The shift is introduced in one of the groups while for the other group, we resort to splitting it randomly into train-val-test. Figure 4 provides the results for the setup when shift is created in group A = 0 whereas figure 5 provides the result for shift in group A = 1. We again observe that our method provides better tradeoffs across the board. For the shift in group A = 0, we have substantially better results on Adult and Arrhythmia with up to ∼ 2x improvements on ∆ EOdds for similar error and up to ∼ 1.4x improvements in error for similar ∆ EOdds . On the Communities dataset, MLP and AD show similar performance to ours, but much worse on the Drug dataset for both the metrics. ZSA performs comparably to our method only on Drug, but is substantially worse on other datasets. This confirms the inconsistency of the baselines under this setup as well. For the shift in group A = 1, we observe a similar behavior. On the Drug dataset, we clearly obtain the best tradeoff compared to all other baselines. MLP and AD achieve similar performance to our method on Communities, but show up to 2x worse ∆ EOdds on Arrhythmia with marginal improvements in error. On the Adult dataset, we observe up to 1.5x improvements against MLP and AD in ∆ EOdds . RF baseline performs strictly worse than ours on Adult and Arrhythmia datasets where it's marginally better on either metrics on Communities and Drug, but at the expense of the other metric. It is also important to note that the errors are lower for all the methods as compared to figure 3 since only one group exhibits substantial shift while degradation in equalized odds is higher. This is in line with the reasoning provided in section 4 based on theorem 1. The network learns the importance ratios w.r.t P T and P S .

6.3. RATIO ESTIMATED

We empirically justify the use of F w (g(X)) by comparing the distribution of the learned ratio across samples from D T and D S in figure 6 . It is evident that the parametrized weight network can approximately learn importance ratios w.r.t P T and P S . The ratio computed for the test points lie mostly between 0 and 1 in order to satisfy C 1 (in eq 8) whereas the ratio computed for the training points are mostly > 1 in order to satisfy C 2 (in eq 8). More importantly, F w is learned end to end via optimization and doesn't incur any significant overhead compared to explicit density estimation.

6.4. EXTENDED ANALYSIS

Extensive experimental results and analysis across multiple settings are provided in Appendix (due to lack of space). We empirically justify the motivation for unsupervised adaptation (described in section 4) in section A.6.1. Ablation studies for the hyperparameters λ 1 and λ 2 are performed in section A.6.2, for the magnitude of shift in section A.6.3 and for value of m in section A.6.4. In section A.6.5 we derive the connection to standard entropy loss over unlabeled test samples (akin to the work by Wang et al. (2021a) ) and demonstrate that our formulation achieves substantially better trade-off.

7. CONCLUSION

In this work, we considered the problem of unsupervised test adaptation under covariate shift to achieve good fairness-accuracy trade-offs when a small amount of unlabeled test data is available. We showed how fair representation matching alone is insufficient due to covariate shift. We proposed a composite objective that involves weighted entropy loss on the unsupervised test and a representation matching loss across protected groups. Finally, we experimentally demonstrate that our composite objective outperforms many baselines on benchmarks in achieving non trivial accuracyfairness trade-offs.

8. REPRODUCIBILITY STATEMENT

We have described all the relevant implementation details required to reproduce the experiments in the appendix. The details for all the benchmarks as well as the baselines are provided comprehensively. We will publicly release the source code after the review process.

9. ETHICS STATEMENT

This work aims to address the concerns related to the unfairness and bias issues that manifest when there is a shift in distribution across training and the testing phase of a model. With the ever increasing real-world deployment of machine learning models, especially in life-altering scenarios like jurisdiction and college admissions, we hope to tackle these issues with this work and expect a cumulative social gain.

A APPENDIX

A.1 PROOFS Proof of Theorem 2. We start with rewriting the expected cross entropy loss on the test as importance sampled loss on the training distribution. R T = E P T (X)    y∈{0,1} -P T (Y = y|X) log(P ( Ŷ = y|X))    (9) a = E P S (X)    dP T (X) dP S (X) y∈{0,1} -P S (Y = y|X) log(P ( Ŷ = y|X))    (10) (a) is the Importance Weighting technique proposed by (Sugiyama et al., 2007b ) and dP T (X) dP S (X) is the Radon-Nikodym derivative because the two distributions are absolutely continuous with respect to each other. = R S + E P S (X) dP T (X) dP S (X) -1    y∈{0,1} -P S (Y = y|X) log(P ( Ŷ = y|X))    (11) = R S + E P T (X) 1 - dP S (X) dP T (X)    y∈{0,1} -P T (Y = y|X) log(P ( Ŷ = y|X))    (12) b ≤ R S + × E P T (X)    1 - dP S (X) dP T (X) y∈{0,1} -P ( Ŷ = y|X) log(P ( Ŷ = y|X))    (13) c ≤ R S + × E P T (X) e - dP S (X) dP T (X) H( Ŷ|X) , (b) is because of the assumption that P T (Y|X) P ( Ŷ|X) ≤ almost surely with respect to X ∼ P T . (c) This is because 1x ≤ e -x , x ≥ 0.

A.2 DATASET DESCRIPTION

The detailed description of the datasets used in this work are as follows: • Adult is a dataset from the UCI repository containing details of individuals. The output variable is the indicator of whether the adult makes over $50k a year. 2021), we used the dataset with 1885 samples and 11 features.

A.3 BASELINES

We use the following baselines for comparison. This covers the exhaustive set of relevant methods described in section 2. • MLP is the standard Multi Layer Perceptron classifier that doesn't take into account shift and fairness properties. In the standard in-distribution evaluation settings, such a model usually provides the upper bound to the accuracy without considering fairness, however the scenario differs as we are dealing with distribution shifts. • Adversarial Debiasing (AD) (Zhang et al., 2018) is one of the most popular debiasing methods in the literature. This method performs well on the fairness metrics under the standard in-distribution evaluation settings, but fails to do so in the shift setting. • Robust Fair (RF) (Mandal et al., 2020) proposes a framework to learn classifiers that are fair not only with respect to the training distribution, but also for a broad set of distributions characterized by any arbitrary weighted combinations of the dataset. • Robust Shift Fair (RSF) (Rezaei et al., 2021) is a recent and most relevant baseline to this work. The authors propose a method to robustly learn a classifier under covariate shift, with fairness constraints. A severe limitation of this method is that it requires explicit estimation of both source and target covariates' distributions. • Z-Score Adaptation (ZSA) following the thread of work under Batch Norm Adaptation (Li et al., 2017; Schneider et al., 2020 ) literature, we implement a baseline that adapts the parameters of the normalizing layer by recomputing the z-score statistics from the unlabeled test data points.

A.4 SHIFT CONSTRUCTION

To construct the covariate shift in the datasets, i.e., to introduce P S (X, A) = P T (X, A), we utilize the following strategy akin to the works of Rezaei et al. (2021) ; Gretton et al. (2008) . First, all the non-categorical features are normalized by z-score. We then obtain the first principal component of the of the covariates and further project the data onto it, denoting it by P C . We assign a score to each point P C [i] using the density function Ξ : P C [i] → e γ•(P C [i]-b) /Z. Here, γ is a hyperparameter controlling the level of distribution shift under the split, b is the 60 th (percentile) of P C and Z is the normalizing coefficient computed empirically. Using this, we sample 40% instances from the dataset as the test and remaining 60% as training. To construct the validation set, we further split the training subset to make the final train:validation:test ratio as 5 : 1 : 4, where the test is distribution shifted. Note that for large values of γ, all the points with P C [i] > b will have high density thereby increasing the probability of being sampled into the test set. This generates a sufficiently large distribution shift. Correspondingly, for smaller values of γ, the probability of being sampled is not sufficiently high for these points thereby leading to higher overlap between the train and test distributions.

A.5 IMPLEMENTATION DETAILS

We use the same model architecture across MLP and our method in order to ensure consistency. Following Wang et al. (2021b) , a Fully Connected Network (FCN) with 4 layers is used, where the first two layers compose g and the subsequent layers compose h. For AD, we use an additional 2 layer FCN that serves as the adversarial head a : g(X) → A (similar to (Wang et al., 2021b) ). Without further specification, we use the following hyperparameters to train MLP, AD and ZSA. The number of epochs is set to 50 with Adam as the optimizer (Kingma & Ba, 2014) and weight decay of 1e -5 (for Adult dataset, the weight decay is 5e -4 ). The learning rate is set to the value of 1e -3 initially and is decayed to 0 using the Cosine Annealing scheduler (Loshchilov & Hutter, 2017) . A batch size of 32 is generally used to train the models. The gradients are clipped at the value of 5.0 to avoid explosion during training. The dropout (Srivastava et al., 2014) rate is set to 0.25 across the layers. For AD, the adversarial loss hyperparameter post grid search is used. RF and RSF works have tuned their model for the specific architecture and corresponding hyperparameters (different from the aforementioned specifics). We perform another grid search over these hyperparameters and report the best results for comparison. For our proposed method, we pre-train the model for 15 epochs with only ER

S

. For the next 35 epochs we use the objective in eq 8, but with a higher training data batch size to reduce variance in the Monte Carlo Estimation of the second constraint 1 n Xi∈D S Fw(g(Xi)) = 1 . The value of m (size of D T ) is kept at 50 for the main experiments, which is << size of D S . The primary experiments are run with the shift magnitude γ = 10 (with ablations provided in section A.6.3). The constraints C 1 and C 2 as mentioned in 8 are implemented as squared error terms where we minimize c 1 • 1 m • Xi∈D T F w (g(X i )) -1 2 + c 2 • 1 n • Xi∈D S 1 Fw(g(Xi)) -1 2 , where c 1 and c 2 are hyperparameters to control the relative importance of each constraint. The values of the tuple (λ 1 , λ 2 ) are set to the following -Adult : (1, 0.01) ; Arrhythmia : (0.01, 0.005) ; Communities : (0.005, 0.0001) and Drug : (0.1, 0.1) post grid search. All experiments are run on single NVIDIA Tesla V100 GPU.

A.6.1 UNSUPERVISED ADAPTATION WITH OUR ENTROPY FORMULATION UNDER ASYMMETRIC SHIFT

The asymmetric shift setup described in section 3.1 provides a well grounded motivation (section 4) and use case for explicitly handling shifts along with the unfairness objective. We complement the claim with empirical evidence here. The results in table 1 provide comparison of the performance across the metrics with and without our proposed formulation. The wasserstein objective in eq 3 is retained in both settings. We observe significant improvements on both error and ∆ EOdds with our formulation. Particularly on the Drug dataset, we see an improvement of almost 4% in error and around 13× in the ∆ EOdds , which is also notable on Arrhythmia. Table 1 : Comparison of the performance on using the unfairness objective without and with the unsupervised adaptation (our proposed entropy formulation). We observe substantial improvements in both error and ∆ EOdds . Numbers in the parenthesis represent standard deviation across the 50 runs. In this section, we study the variation of the performance of our method against the hyperparameters governing error (λ 1 ) and ∆ EOdds (λ 2 ). While studying the effect of either, we keep the other constant. Table 2 reports the variation for λ 1 keeping λ 2 = 0.01 fixed. It is evident from the numbers that increasing λ 1 has strong correlation with the reduction in error, which exhibits a saturation at 0.1. Higher values of λ 1 emphasize the minimization of the worst-case weighted entropy thus helping in calibration of the network in regions across P T . Furthermore, we observe significant improvements in ∆ EOdds which is inline with the motivation of handling shifts along with an unfairness objective (section 4). Increasing λ 1 doesn't help post a threshold value as the correct estimation of the true class for a given X under P T becomes harder, particularly in regions far from the labeled in-distribution data. Imposing very strong λ 1 can hurt the model performance. The variation against λ 2 , keeping λ 1 = 1 fixed is reported in table 3. As λ 2 increases, we observe a gradual improvement in ∆ EOdds . This exhibits a maxima after which the performance degrades drastically. This is because strongly penalizing LW ass (D T ) with a small number of samples m leads to overfitting (illustrated by the large standard deviation) while matching P T (X|A). This also hurts the optimization as demonstrated by the substantial increase in error. We study the variation of the performance of our method against the magnitude of shift γ on Arrhythmia. A comparison against the best baseline ZSA is also provided. The variation of error is plotted in the left subfigure of 7. With no shift in the data, γ = 0, we observe that both the methods exhibit small errors as D T follows in-distribution. With the increase in the value of γ, ZSA shows a sudden increment in the error with an unstable pattern whereas our method exhibits a more gradual pattern and lower error as compared to ZSA. This justifies that the weighted entropy objective helps. Figure 7: Variation of error against γ (left subfigure) and ∆ EOdds against γ (right subfigure) on Arrhythmia Dataset. We observe that our method performs better in both metrics against the best baseline ZSA. While the error increases gradually, but we observe substantially better ∆ EOdds for our method. On the contrary, we observe that our method is highly stable over ∆ EOdds and performs consistently better for larger shifts as compared to ZSA. We attribute this effect to the proposed objective which optimizes the model to learn fairly under the shift and over the worst case scenario. A.6.4 VARIATION OF SIZE OF D T Here, we study the dependence of the methods on the size of D T . The left subfigure in 8 plots the variation of error against m. The error gradually decreases for our method and RSF as the estimation of the true test distribution improves and the optimization procedure covers a larger region of P T . This also makes the approximation by F w much more reliable and closer to true ratios. Although, the results don't show notable improvements after a certain threshold as we are dealing in an unsupervised regime over P T . It becomes increasingly harder to correctly estimate the true class for a given X under P T , particularly in regions far from the labeled in-distribution data. Interestingly, ZSA doesn't exhibit any improvements which demonstrates that merely matching first and second order moments across the data is not sufficient to handle covariate shifts. The right subfigure in 8 plots the variation of ∆ EOdds against m. Here, we observe a consistent reduction in ∆ EOdds as more data from P T helps is matching representations via improved approximation of P T (X|A). Further this objective only deals with matching representations across the groups and doesn't stagnate as quickly with increasing m as the error margins, which suffers from lack of reliable estimation in regions far from in-distribution. We consistently outperform RSF in both very small and larger regimes of m, partly verifying the importance of F w rather than a direct estimation of P S and P T as RSF does. ZSA is substantially worse than both RSF and our method in terms of errors. In terms of ∆ EOdds its only marginally better than our method for m = 10 and m = 20, but at a huge expense of prediction performance. A.6.5 UNWEIGHTED ENTROPY VS OUR WEIGHTED ENTROPY FORMULATION It is easy to observe that we can recover the standard unlabeled test entropy minimization using our derivation. Formally specifying, we can upper bound eq 14 to obtain entropy as follows: R S + × E P T (X) e - dP S (X) dP T (X) H( Ŷ|X) < R S + × E P T (X) H( Ŷ|X) , ∵ e -x ≤ 1, ∀x ≥ 0 Our formulation particularly provides a tighter bound as compared to standard entropy and implicitly accounts for points in D T that are close to P S by assigning low weight. The experimental results comparing the two settings both with and without the unfairness objective are provided in table 4. Our formulation achieves substantially better results with a relative improvement of around 33% in error. Note that due to the fairness-error tradeoff, the standard (unweighted) entropy achieves better ∆ EOdds , but that is achieved at the expense of a nearly random classifier as evident from the error rate of nearly 50%. We also highlight the large standard deviation in the results achieved by unweighted entropy. This is largely because it seeks to minimize entropy across all m points whereas our objective is more adaptive based on the approximation of importance ratio. As the explicit computation of the density values can be hard, we estimate the ratio dP S (X) dP T (X) via a parametrized network for this class of baselines. Density ratio estimation methods were previously proposed in the works by Sugiyama et al. (2007c) (KLIEP), Kanamori et al. (2009) (LSIF) . Menon & Ong (2016) analysed these methods in a unifying framework. To experimentally demonstrate the efficacy of our method over the aforementioned, we use the density ratio estimation methods of KLIEP and LSIF in the following manner. First, the importance ratio dP S (X) dP T (X) is estimated using unsupervised test samples and the training samples available based on the KLIEP and LSIF losses (given in Menon & Ong (2016) ) via a parameterized weight network s(X). Then, we train a classifier based on the following instance weighted cross entropy loss and representation matching loss: min θ(F =h•g) 1 n (Xi,Yi,Ai)∈D S s(X i ) -log P θ(F ) ( Ŷ = Y i |X i ) + λ LW ass (D T ) (16) where s(X) is a non-negative function which is obtained by minimizing: L KLIEP (s(X)) = 1 m Xi∈D T -log s(X i ) +   1 n Xi∈D S s(X i ) -1   2 (17) or, L LSIF (s(X)) = 1 m Xi∈D T -s(X i ) + 1 2   1 n Xi∈D S (s(X i )) 2   The results are stated in tables 5 and 6. First, we observe that our method consistently outperforms these algorithms across the datasets. The relative improvement of our method is as high as ∼ 31% in error on Adult dataset and ∼ 32.5× in ∆ EOdds on Drug dataset against LSIF. Similar non-trivial margins can be noted on other datasets. Second, the variance in accuracies of the KLIEP and LSIF based importance is very high on the Drug dataset. Particularly, both KLIEP and LSIF exhibit up to 20 -40 times higher variance in error and up to 10 -12 times in ∆ EOdds . Key Takeaway: We can attribute this to the phenomenon that in the small sample regime, importance weighted training on training dataset alone may not bring any improvements for covariate shift due to variance issues and thus estimating the ratio can be insufficient. In fact, this has been pointed out in Menon & Ong (2016) . We, on the other hand propose a new formulation to optimize for an upper bound based on the ratio estimation but due to the negative exponent of the importance ratio, the importance ratio's effect on the loss does not induce such high variance and it also leverages unsupervised test samples at training time. To further demonstrate the effectiveness of our method, we plot the Pareto Frontier in figure 9 (variance bars are removed to retain clarity), similar to Agarwal et al. (2018) . Achievable tradeoffs for the baselines are plotted along with our Pareto curve for comparison. We observe that the curve corresponding to our method is closer to the left axis with a high gradient. The error reduces drastically for a small increase in ∆ EOdds while providing better tradeoffs as compared to the optimal performance of the baselines. We compare the ratio of the prediction probabilities for the classes (y ∈ {0, 1}) on the validation set (which is not available during training to our algorithm) between a classifier trained only on the training set (Train) and a classifier trained only on the held-out test set (Test). We plot the ratios in figure 10 with outliers removed. The subfigures (a),(b) demonstrate the ratio for the true class label for the samples. Subfigures (c),(d) demonstrate the ratio for class y = 0 and subfigures (e),(f) demonstrate the ratio for class y = 1. Correspondingly, in figure 11 we plot the ratios with outliers. Note that atmost 4 points in every plot are outliers with ratios > 5. This empirically justifies that can be set not too high with high probability except for a few outliers. A.6.9 COMPARISON OF ACCURACY PARITY We further demonstrate that our method is better as compared to the baselines when the fairness metric is Accuracy Parity. The results for Adult dataset are provided in table 7. In this section, we would like contrast Generalization bounds between our objective (Right Hand Side of 5) and the left hand side which is importance sampled training loss. The main intention is to bring out dependence on the variance of importance ratios. Therefore, we make the following simplifying assumptions: Assumption 3. • Let Θ = {θ 1 . . . θ k } be parameters of a finite set of classifiers of the form P θ ( Ŷ |X). • Let us assume the loss function 1 (Y, X; θ) =log P θ (Y |X) is bounded between [0, 1] in the domain {0, 1} × X for all θ ∈ Θ. • Let the loss function 2 (X; θ) = y∈{0,1} -P θ ( Ŷ = y|X) log P θ ( Ŷ = y|X) be also be bounded between [0, 1] in the domain {0, 1} × X for all θ ∈ Θ. • Let us assume we have access to the exact importance weight w(X) = P T (X) P S (X) . Since, we assume P T (•) and P S (•) are absolutely continuous with respect to each other, w(X) > 0, ∀X ∈ X . For convenience of notation, let w(X) = P S (X) P T (X) . • Let sup X∈X w(X) = M . Let the variance of the importance ratio with respect to the training distribution be E X∼P S [w 2 (X)] = σ 2 .

Remark:

We have assumed 1 is bounded in [0, 1]. If the log loss over a suitable function class is Lipschitz and domain is bounded, then the loss is also bounded. Therefore, it is not a very heavy assumption and we wanted to keep the analysis simple and normalized. There are two loss functions we compare: 1. R IS (θ) = (Xi,Yi)∼D S w(X i ) 1 (Y i , X i ; θ) and 2. R W E (θ) = (Xi,Yi)∼D S 1 (Y i , X i ; θ) + λ (Xi,Yi)∼D T e -w(Xi) 2 (Y i , X i ; θ). R IS (θ) is the empirical importance sampled loss while R W E is the weighted entropy objective of Theorem 2. We recall some generalization bounds for finite hypothesis classes with bounded risks. Definition 3. Rademacher complexity R(A) for a finite set A = {a 1 , a 2 . . . a N } ⊂ R n is given by: R(A) = E σ [sup a∈A i σ i a[i]] ( ) where σ is a sequence of n i.i.d Rademacher variables each uniformly sampled from {-1, -1} and a[i] is the i-th coordinate of vector a. Theorem 5. [Bousquet et al. (2003) ] When a dataset D is sampled i.i.d from distribution P(X) and f is uniformly bounded by L over the domain of P, then with probability 1δ over the draw of D, Proof. We apply Theorem 5 to 1 (•) (which is bounded by 1) and e -w(•) 2 (•) where 2 (•) ≤ 1, e -w(•) ≤ 1 with the appropriate datasets in Assumption 3. We then use union bound over the two error events that result from application of the theorem twice.

Empirical

E x∼P [f (x)] ≤ 1 |D| x∈D f For finite hypothesis classes, we recall generalization bounds for importance sampled losses from Cortes et al. (2010b) . Theorem 7 (Cortes et al. (2010b) ). Suppose that a dataset D is sampled i.i.d from distribution P(X), f is uniformly bounded by L over the domain of P, and a fixed weighing function w(x) is such that sup w(x) = M, E x∼P [w(x) 2 ] ≤ σ 2 . Consider the loss function f (x) = w(x)f (x). We denote f = w • f . then with probability 1δ over the draw of D, we have: 



Figure 1: Both Error (in % left) and Equalized Odds (right) for SOTA fairness method -Adversarial Debiasing exhibit strong degradation on increasing the magnitude of covariate shift. Three scenarios corresponding to no shift, intermediate shift and high shift are plotted (details on shift construction are provided in experiments).

Figure 3: Comparison of our method against the baselines under Covariate Shift. The bars provide the standard deviation intervals both error (vertical) and ∆ EOdds (horizontal).

Figure 4: Comparison of our method against the baselines under asymmetric covariate Shift for group A = 0.

Figure 5: Comparison of our method against the baselines under asymmetric covariate Shift for group A = 1.

Figure 6: Comparison of the ratio estimated via F w (g(X)) across D T and D S . The network learns the importance ratios w.r.t P T and P S .

Figure 8: Variation of error against m (left subfigure) and ∆ EOdds against m (right subfigure) on Adult Dataset. We see reduction in both error and ∆ EOdds with increasing value of m.

Figure10: The subfigures demonstrate the ratio of the prediction probabilities for the classes (y ∈ {0, 1}) on the validation set between a classifier trained only on the training set (Train) and a classifier trained only on the held-out test set (Test), with outliers removed. Note that = 5 provides a reasonable threshold and holds for all the samples but for 4 outliers (shown in figure11).

Figure 11: The subfigures demonstrate the ratio of the prediction probabilities for the classes (y ∈ {0, 1}) on the validation set between a classifier trained only on the training set (Train) and a classifier trained only on the held-out test set (Test), with outliers. Atmost 4 points in every plot are outliers with ratios > 5.

Rademacher complexity of a class of finite number of functions F on a data set D with m samples is given by R(F(D)) where F(D) = {vec(f (x), ∀x ∈ D), ∀f ∈ F}. Here, vec(•) is a vector of entries.Theorem 4(Bousquet et al. (2003)).

to R IS (θ) we have the following result. Theorem 8. Under Assumption 3, we have that with probability 1δ over the draws of D S ∼ P S , we have ∀θ ∈ ΘE P S [R IS (θ)] ≤ R IS (θ) + 2M L(log|Θ| + log(1/δ)) 3|D S | + 2σ 2 (log|Θ| + log(1/δ)) |D S |(23)

Algorithm 1: Gradient Updates for the proposed objective to learn fairly under covariate shift Input: Training data D S , Unlabelled Test data D T , model F , weight estimator F w , decaying learning rate η t , number of pre-training steps Ẽ, number of training steps E for eq 8, λ 1 , λ 2

Variation of the performance of our method with Entropy Regularizer λ 1 on Adult dataset.

Variation of the performance of our method with Wasserstein Regularizer λ 2 on Adult dataset.

Comparison of the performance of Standard Unweighted Entropy v/s our Weighted Entropy formulation on Communities dataset.

Comparison of our method against popular density ratio estimation methods: KLIEP and LSIF on Drug and Adult datasets.

Comparison of our method against popular density ratio estimation methods: KLIEP and LSIF on Communities and Arrhythmia datasets. Fairness-Error Tradeoff Curves for our method (Pareto Frontier) against the optimal performance of the baselines. Our method provides better tradeoffs in all cases. (On Drug dataset, the performance is concentrated around the optimal point). Variance bars are removed to retain clarity in the plots.

Comparison of Accuracy Parity as well as Error for all the methods. We outperform the baselines, particularly KLIEP and LSIF that are prone to poor results due to high variance.

Theorem 6. Under Assumption 3, we have that with probability 1 -2δ over the draws of D S ∼ P S and D T ∼ P T , we have ∀θ ∈ Θ E P S ,P

annex

Proof. The proof is a direct application of Theorem 7 to R IS (θ) under Assumption 3. Key Takeaways: Comparing Theorem 6 and Theorem 8, we see that the generalization bounds for importance sampled training loss depends on variance of importance ratio and also the worst ratio over the training set (M and σ 2 ). In contrast, our objective does not depend on these parameters primarily due to negative exponential dependence on w. We also note that R W E depends on size of test set also while the other does not seem to. However, R IS needs to estimate importance ratios -which will depend on the test set . We have analyzed both losses when the importance ratios are assumed to be known just to bring out the difference in dependencies on other parameters.Remark: In Assumption 3, we have assumed a finite hypothesis class Θ. However, our result for Theorem 6 would generalize (as is) with rademacher complexity or covering number based arguments of infinite functions classes 1 and 2 . Cortes et al. (2010b) also point out analogous generalization for the importance sampling loss.

