DOMAIN-INDEXING VARIATIONAL BAYES: INTER-PRETABLE DOMAIN INDEX FOR DOMAIN ADAPTATION

Abstract

Previous studies have shown that leveraging domain index can significantly boost domain adaptation performance (Wang et al., 2020; Xu et al., 2022) . However, such domain indices are not always available. To address this challenge, we first provide a formal definition of domain index from the probabilistic perspective, and then propose an adversarial variational Bayesian framework that infers domain indices from multi-domain data, thereby providing additional insight on domain relations and improving domain adaptation performance. Our theoretical analysis shows that our adversarial variational Bayesian framework finds the optimal domain index at equilibrium. Empirical results on both synthetic and real data verify that our model can produce interpretable domain indices which enable us to achieve superior performance compared to state-of-the-art domain adaptation methods. Code is available at https://github.com/Wang-ML-Lab/VDI.

1. INTRODUCTION

In machine learning, it is standard to assume that training data and test data share an identical distribution. However, this assumption is often violated (Ganin & Lempitsky, 2015; Romera et al., 2019; Sun et al., 2017; Yuan et al., 2019; Ramponi & Plank, 2020) when training and test data come from different domains. Domain adaptation (DA) tries to solve such a cross-domain generalization problem by producing domain-invariant features. Typically, DA methods enforce independence between a data point's latent representation and its domain identity, which is a one-hot vector indicating which domain the data point comes from (Ganin et al., 2016; Tzeng et al., 2017; Zhao et al., 2017; Zhang et al., 2019) . More recent studies have found that using domain index, which is a real-value scalar (or vector) to embed domain semantics, as a replacement of domain identity, significantly boosted domain adaptation performance (Wang et al., 2020; Xu et al., 2022) . For instance, Wang et al. (2020) adapted sleeping stage prediction models across patients with different ages, with "age" as the domain index, and achieved superior performance compared to traditional models that split patients into groups by age and used discrete group IDs as domain identities (more discussion in Sec. J). Although significant progress has been made in leveraging domain indices to improve domain adaptation (Wang et al., 2020; Xu et al., 2022) , a major challenge exists: domain indices are not always available. This severely limits the applicability of such indexed DA methods. Thus a natural question is motivated: Can one infer the domain index as a latent variable from data? This prompts us to first develop an expressive and formal definition of "domain index". We argue that an effective "domain index" (1) is independent of the data's encoding, (2) retains as much information on the data as possible, and (3) maximizes adaptation performance, e.g., accuracy (see Sec. 3.2 for rigorous descriptions). With this definition, we then develop an adversarial variational Bayesian deep learning model (Wang et al., 2015; Wang & Yeung, 2016; 2020) that describes intuitive conditional dependencies among the input data, labels, encodings, and the associated domain indices. Our theoretical analysis shows that maximizing our model's evidence lower bound while adversarially 

3.1. PROBLEM SETTING AND NOTATION

We consider the unsupervised domain adaptation setting with N domains in total. Each domain has domain identity k ∈ K = [N ] ≜ {1, . . . , N }; k is in either the source domain identity set K s or the target domain identity set K t . Each domain k has D k data points. Given n labeled data points {(x s i , y s i , k s i )} n i=1 from source domains (k s i ∈ K s ), and m unlabeled data points {x t i , k t i } m i=1 from target domains (k t i ∈ K t ), we want to (1) predict the label {y t i } m i=1 for target domain data, and (2) infer global domain indices β k ∈ R B β for each domain and local domain indices u i ∈ R Bu for each data point. α = {µ α , σ α } are the hyper-parameters for {β k } N k=1 's prior distributions. Note that each domain has only one global domain index, but has multiple local domain indices, one for each data point in the domain (more details in Sec. 3.3). We denote as z ∈ R Bz the data encoding generated from an encoder that takes x as input. We use I(•; •) to denote mutual information.

3.2. FORMAL DEFINITION OF DOMAIN INDEX

We formally define "domain index" as follows (please refer to notations in Sec. 3.1 if needed): Definition 3.1 (Domain Index) . Given data x and label y, a domain-level variable β and a data-level variable u are called global and local domain indices, respectively, if there exists a data encoding z such that the following holds: (1) Independence between β and z: Global domain index β is independent of data encoding z, i.e., β ⊥ ⊥ z, or equivalently I(β; z) = 0. This is to encourage domain-invariant data encoding z. (2) Information Preservation of x: Data encoding z, local domain index u, and global domain index β preserves as much information on x as possible, i.e., maximizing I(x; u, β, z). This is to prevent β and u from collapsing to trivial solutions. (3) Label Sensitivity of z: The data encoding z should contain as much information on the label y as possible to maximize prediction power, i.e., maximizing I(y; z) conditioned on z ⊥ ⊥ β. This is to make sure the previous two constraints on β, u, and z do not harm prediction performance. To summarize, β and u are considered the global and local domain indices, respectively, if (β, u) = argmax β,u I(x; u, β, z) + I(y; z) s.t. I(β; z) = 0. Later in Sec. 4, our theoretical analysis shows that maximizing our model's evidence lower bound while adversarially training an additional discriminator (Sec. 3.4) is equivalent to inferring the optimal domain indices according Definition 3.1. In Appendix A, we provides a rigorous discussion on the definition of "domain index". For clarity, we omit subscripts of q ϕ and p θ as well as p θ (z|x, u, β)'s input (x, u).

Generative

(c) Draw data encoding z i from the Gaussian distribution p θ (z i |x i , u k , β i ). (d) Draw label y i from the distribution p θ (y i |z i ). Besides typical conditional dependencies defined in the graphical model (Fig. 1 (left)), we enforce additional independence between β and z; such independence is represented as a dashed line " " in Fig. 1 (left). Note that there are multiple ways to satisfy such constraints during learning, e.g., using adversarial methods (Ganin et al., 2016; Tzeng et al., 2017; Zhang et al., 2019; Wang et al., 2020; Xu et al., 2022) and using the concentration loss (Xiao et al., 2021) . Generative Model and Inference Model. Based on Fig. 1 (left), we factorize the generative model p θ (x, u, β, z, y|α) into five conditional distributions (omitting the subscript i for clarity below): p θ (x, u, β, z, y|α) = p θ (β|α)p θ (u|β)p θ (x|u)p θ (z|x, u, β)p θ (y|z), where θ denotes the collection of parameters for the generative model, and p θ (β|α) = N (µ α , σ 2 α ) is a Gaussian distribution. The predictor p θ (y|z) is a categorical distribution Cat(f y (z; θ)) for classification tasks and a Gaussian distribution N (µ y (z; θ), σ 2 y (z; θ)) for regression tasks; here f y (z; θ), µ y (z; θ), and σ y (z; θ) are neural networks taking z as input. Similarly, we have p θ (u|β) = N (µ u (β; θ), σ 2 u (β; θ)), p θ (x|u) = N (µ x (u; θ), σ 2 x (u; θ)), p θ (z|x, u, β) = N (µ z (x, u, β; θ), σ 2 z (x, u, β; θ)). We use an inference model q ϕ (u, β, z|x) to approximate the posterior distributions of the latent variables, i.e., p θ (u, β, z|x). As shown in Fig. 1 (right), we factorize q ϕ (u, β, z|x) as q ϕ (u, β, z|x) = q ϕ (u|x)q ϕ (β|u)q ϕ (z|x, u, β), where ϕ denotes the collection of parameters for the inference model. Specifically, we have q ϕ (u|x) = N (µ u (x; ϕ), σ 2 u (x; ϕ)), q ϕ (β|u) = N (µ β (u; ϕ), σ 2 β (u; ϕ)), q ϕ (z|x, u, β) = N (µ z (x, u, β; ϕ), σ 2 z (x, u, β; ϕ)). Note that µ • (•; •) and σ • (•; •) denote neural networks; θ, ϕ are neural network parameters. Here q ϕ (β|u) requires special treatment and will be discussed in Eqn. (14-17) below.

3.4. OBJECTIVE FUNCTION

Evidence Lower Bound. We use an evidence lower bound (ELBO) as an objective to learn the generative and inference models. Maximizing the ELBO learns the optimal variational distribution q ϕ (u, β, z|x) that best approximates the posterior distribution of the latent variables (including the domain indices) p θ (u, β, z|x). Specifically, we have the ELBO as: L ELBO (x, y) = E q ϕ (u,β,z|x) [p θ (x, u, β, z, y|α)] -E q ϕ (u,β,z|x) [q ϕ (u, β, z|x)]. With the factorization in Eqn. 1 and Eqn. 5, we decompose the ELBO as (omitting α to avoid clutter): LELBO(x, y) = Eq ϕ (u|x) [log pθ(x|u)] + E q ϕ (u,β,z|x) [log pθ(y|z)] + Eq ϕ (u|x) E q ϕ (β|u) [log pθ(u|β)] (12) Each term above is computable with our neural network parameterization (see the network structure in Fig. 2 ); for target domains Eqn. 11 is excluded. Below, we describe each term's intuition. -E q ϕ (u,β,z|x) KL[qϕ(β|u)||pθ(β)] -KL[qϕ(z|x, u, β)||pθ(z|x, u, β)] -Eq ϕ (u|x) [log qϕ(u|x)], (1) Reconstruct Data x from u (Eqn. 10). q ϕ (u|x) and p θ (x|u) aim to reconstruct data x using the inferred u, encouraging u to preserve as much information on x as possible. ( 2) Predict Label y from Latent z (Eqn. 11). This term samples u, β, and z from q ϕ (u|x), q ϕ (β|u) and q ϕ (z|x, u, β), respectively, and then uses z to predict y in p θ (y|z), encouraging z to contain as much information on y as possible to maximize prediction performance. (3) Reconstruct Local Domain Index u from β (Eqn. 12). Eqn. 12 samples u and β from q ϕ (u|x) and q ϕ (β|u), respectively, and then uses the inferred β to reconstruct local domain index u in p θ (u|β), encouraging β to preserve as much information on u and x as possible. (4) Regularize All Latent Variables u, β, z (Eqn. 13). Eqn. 13 includes two KL divergence terms between the inference model q ϕ (•) and the generative model p θ (•) as well as an entropy term for q ϕ (u|x); they all serve as regularizers to prevent overfitting. For example, the first regularization term implies that q(β|u) should be close to the prior distribution p(β). given only the data (x i , y i ) and domain identities k i , thereby providing better interpretability and domain adaptation performance. See Sec. K for more discussion on β and u.

Global

Difference between Domain Identities k and Global Domain Indices β. Note that domain identities k are discrete values and therefore cannot describe rich relations (e.g., similarity and distance) among domains. In contrast, global domain indices β are continuous vectors and therefore contain much richer information that describes relations (e.g., similarity and distance) among domains (see Sec. 5 for empirical results). Our VDI assumes k is available and tries to infer β. Inferring Global Domain Indices q ϕ (β|u) in Eqn. 5. For each domain k, global domain index β k should aggregate domain information of all data in this domain. We therefore propose to leverage local domain indices of all domain k's data points, (Rubner et al., 2000) between each pair of local index matrices (sets) U k = [u i ] ki=k ∈ R D k ×Bu , S k,j = f EM D (U k , U j ) ∈ R N ×N , where S k,j is the EMD between domain k and j. (3) Raw Global Domain Indices. According to the pairwise domain distance matrix S, use multi-dimensional scaling (MDS) (Borg & Groenen, 2005) to map each domain k into a B β -dimensional space and obtain the raw global domain index β r k ∈ R B β , i.e., [β r k ] N k=1 = f M DS (S) = [f k M DS (S)] N k=1 . (4) Final Global Domain Indices. Feed the raw index β r k into the inference neural network to obtain the variational distribution N µ r (β r k ; ϕ), σ 2 r (β r k ; ϕ) for the final global domain index β k ∈ R B β , where ϕ is the inference network parameters. We summarize these four steps below: Grouping u i in Domain k: U k = [u i ] ki=k ∈ R D k ×Bu , (14) Pairwise Domain Distance: S = [f EM D (U k , U j )] N,N k=1,j=1 ∈ R N ×N , Raw Global Domain Indices: β r k = f k M DS (S) ∈ R B β , Final Global Domain Indices: β k ∼ N µ r (β r k ), σ 2 r (β r k ); ϕ ∈ R B β . ( ) Discriminator with an Adversarial Loss. To enforce independence between β and z, i.e., Part (1) of Definition 3.1, we train an additional discriminator D with an adversarial loss while maximizing the ELBO in Eqn. 9. The discriminator is a neural network D(•) that takes z as input and predicts the global domain index β and domain identity k. Essentially, D(•) plays a minimax game with the encoder inference network q ϕ (z|x, u, β): D(•) tries to reconstruct the global domain index β and domain identity k, while the encoder q ϕ (z|x, u, β) tries to prevent D(•) from doing so by generating domain-invariant encoding z. Denoting as R D the reconstruction loss, the discriminator loss L D can be written as: L D = R D (β, β, k, k) In Sec. 4, we will prove that β is guaranteed to be independent of z if k is independent of z. We therefore simplify Eqn. 18 into only classifying the domain identity k and use the log-likelihood as L D,ϕ : L D,ϕ = E p(k,x) E q ϕ (z|x) [log D(k|z)] Final Objective Function. Putting Eqn. 9 and Eqn. 19 together, we have our final objective function: max θ,ϕ min D L V DI = max θ,ϕ min D L θ,ϕ -λ d L D,ϕ = max θ,ϕ min D E p(x,y) [L ELBO (x, y)] -λ d E p(k,x) E q ϕ (z|x) [log D(k|z)], where λ d is a hyper-parameter balancing two terms. 

4. THEORY

E p(x,y) [L ELBO (p θ (x, y))] ≤ I(y; z) + I(x; u, β, z) -[H(y) + H(x)]. max D E p(k,x) E q ϕ (z|x) [log D(k|z)] = I(z; β) + I(z; k|β) -H(k), and the global minimum of max D E p(k,x) E q ϕ (z|x) [log D(k|z) ] is achieved if and only if I(z; β) = I(z; k|β) = 0. As Theorem 4.2 states, the global optimum of Eqn. 20 is guaranteed to satisfy all three conditions in Definition 3.1; therefore training VDI using the minimax game objective Eqn. 20 is equivalent to inferring the optimal domain indices. E p(x,y) [L ELBO (x, y)] -max D E p(k,x) E q ϕ (z|x) [log D(k|z)] (23) ≤ I(y; z) + I(x; u, β, z) -I(z; β) -I(z; k|β) -[H(y) + H(x) -H(k)].

5. EXPERIMENTS

In this section, we compare VDI with existing DA methods on both synthetic and real-world datasets.

5.1. DATASETS

Circle (Wang et al., 2020 ) is a synthetic dataset with 30 domains for binary classification. 4 in Sec. F for details). TPT-48 (Xu et al., 2022 ) is a real-world regression dataset that contains monthly average temperature for the 48 contiguous states in the US from 2008 to 2019. We use the first 6 months' temperature as model input to predict the next 6 months' temperature. We formulate two DA tasks (Fig. 4 ): • W (6) → E (42): Adapting models from the 6 states in the west to the 42 states in the east. • N (24) → S (24): Adapting models from the 24 states in the north to the 24 states in the south. We treat target domains one hop away from the closest source domain as Level-1 Target Domains, those two hops away as Level-2 Target Domains, and those more than two hops away as Level-3 Target Domains (see Fig. 4 for an illustration). CompCars (Yang et al., 2015) is a car image dataset with attributes including car types, viewpoints, and years of manufacture (YOMs). The task is to recognize the car type given an image. In CompCars, data with each view point and each YOM is treated as a single domain. We choose a subset of CompCars with 4 car types (MPV, SUV, sedan and hatchback), 5 viewpoints (front (F), rear We report the average MSE of all domains as well as more detailed average MSE of Level-1, Level-2, Level-3 target domains, respectively (Fig. 4 ). D2V cannot perform regression and thus has no results. Note that there is only one single DA model per column. We mark the best result with bold face. (Zhao et al., 2017) , Margin Disparity Discrepancy (MDD) (Zhang et al., 2019) , SENTRY (Prabhu et al., 2021) and Domain to Vector (D2V) (Peng et al., 2020b) . We also report results when the model is only trained in the source domains without adapting to the target domains (Source-Only). Different from VDI that works for both classification and regression tasks, MDD, SENTRY and D2V only work for classification tasks. We managed to adapt MDD and SENTRY for the regression tasks on TPT-48 (see App. H for details); D2V cannot be adapted and has no results for TPT-48. Both Wang et al. (2020) and Xu et al. (2022) assume domain indices are available; therefore they are not applicable to our settings where the goal is to infer domain indices (which are unavailable from data).

5.3. RESULTS

Circle, DG-15 and DG-60. Table 1 shows the accuracy of evaluated methods on Circle, DG-15, and DG-60; all datasets have complex domain relations, making it challenging to perform domain adaptation without knowing ground-truth domain indices. Indeed, we observe that on Circle, all baselines only perform marginally better than random guess (50% accuracy). Moreover, on DG-15 most of baselines perform even worse than a random guess, possibly due to overfitting the source domainsfoot_0 . In contrast, our VDI achieves very high accuracy (over 94%) on all three datasets, significantly outperforming all baselines, thanks to the inferred indices (e.g., Fig. 3(d ) and Fig. 7 ). To verify that VDI infers non-trivial domain indices β, we connect the domain pairs within a distance threshold (∥β k -β j ∥ < ϵ) to reconstruct the domain graph on DG-15 and DG-60. Compared with the ground-truth domain graphs, VDI achieves area under the ROC curve (AUC) of 0.83 for DG-15 and 0.91 for DG-60. Fig. 3 (d) shows an example inferred domain graph for DG-15. TPT-48. Table 2 shows the mean square error (MSE) for all the methods on TPT-48. In terms of average MSE across all domains, we observe that most methods suffer from negative transfer on both tasks, with only DANN and SENTRY marginally improving upon Source-Only. In contrast, our VDI can further improve the performance and achieve the lowest average MSE on both tasks. The plots show that VDI's inferred domain indices are highly correlated with each domain's latitude and longitude. For example, Florida (FL) has the lowest latitude among all 48 states and is hence the left-most circle in Fig. 5 (left). We also observe that states with similar latitude or longitude do have similar domain indices β. These results demonstrate that VDI can infer reasonable domain indices. CompCars. Table 3 shows the classification accuracy on all DA methods. Results show that most of the methods outperform Source-Only, with our VDI achieving the most significant improvement. 

6. CONCLUSION

We identify the problem of inferring domain indices as latent variables, provide a rigorous definition of "domain index", develop the first general method for addressing it, and provide detailed theoretical analysis as well as empirical results. We demonstrate the effectiveness of our proposed VDI for inferring domain indices and show its potential for significant practical applications. As a limitation, our method still assumes the availability of domain identities to identify different domains. Therefore it would be interesting future work to explore jointly inferring domain indices and domain identities.

B THEORETICAL ANALYSIS

Lemma B.1 (Upper Bound of the ELBO of p θ (x, y)). The ELBO of p θ (x, y) is upper bounded by the mutual information among observable variables x, y and latent variables u, β, z as below: E p(x,y) [L ELBO (p θ (x, y))] ≤ I(y; z) + I(x; u, β, z) -[H(y) + H(x)] Optimizing the ELBO of p θ (x, y) is equivalent to enhancing mutual information between label y and classification latent variable z and between data variable x and all latent variables u, β, z. Proof. Firstly, we provide the ELBO of log p θ (x, y) as below: log p θ (x, y) = log p θ (x, u, β, z, y)dzdβdu = log p θ (x, u, β, z, y)q ϕ (u, β, z|x) q ϕ (u, β, z|x) dzdβdu = log E q p θ (x, u, β, z, y) q ϕ (u, β, z|x) ≥ E q log p θ (x, u, β, z, y) q ϕ (u, β, z|x) = E q [log p θ (y|x, u, β, z)p θ (x|u, β, z)p θ (u, β, z) q ϕ (u, β, z|x) ] = E q [log p θ (y|z)] + E q [log p θ (x|u, β, z)] -KL[q ϕ (u, β, z|x)||p θ (u, β, z)] Then we have: L ELBO (p θ (x, y)) = E q log[p θ (y|z)] + E q [log p θ (x|u, β, z)] -KL[q ϕ (u, β, z|x)||p θ (u, β, z)] To help our analysis, we introduce a helper joint distribution of variables x, y, z, which is r(x, y, z) = p(x, y)q ϕ (z|x). This distribution has the following properties, (1) r(x, y) = p(x, y), r(x) = p(x), r(y) = p(y). (2) r(z|y, x) = q ϕ (z|x) = r(z|x). (3) r(y|z, x) = p(y|x) = r(y|x). (4) I r (y; z|x) = 0, i.e. y ⊥ ⊥ z|x under distribution r. (implied by 2, 3). The First Term of the ELBO. Its upper bound is provided as follows: E p(x,y) E q ϕ (z|x) [log p θ (y|z)] = E r(x,y,z) [log p θ (y|z)] = E r(y,z) [log p θ (y|z)] ≤ E r(y,z) [log r(y|z)] = E r(y,z) [log r(y|z) p(y) p(y)] = E r(y,z) [log r(y|z) p(y) ] + E p(y) [log p(y)] = I r (y; z) -H(y) when log p θ (y|z) = r(y|z), we have E p(x,y) E q ϕ (z|x) [p θ (y|z)] = I r (y; z) + H(y). So, max p θ E p(x,y) E q ϕ (z|x) log p θ (y|z) = I r (y; z) -H(y). Then we have: E p(x,y) E q ϕ (z|x) log p θ (y|z) ≤ I r (y; z) -H(y) Furthermore, we prove another upper bound of the first term: E p(x,y) E q ϕ (z|x) [log p θ (y|z)] ≤ E p(x,y) log[E q ϕ (z|x) p θ (y|z)] ≤ E p(x,y) [log p(y|x)] = E p(x,y) [log p(y|x) p(y) p(y)] = I p (y; x) -H(y) The first inequality takes the equal sign when p θ (y|z) is a constant w.r.t. z given x, i.e., p θ (y|z, x) = p θ (y|x). The second equality takes the equal sign, when E q ϕ (z|x) p θ (y|z) = p(y|x). We set q ϕ (z|x) and p θ (y|z) as q * ϕ (z|x) and p * θ (y|z) when the optimality of q ϕ (z|x) and p θ (y|z) is achieved, i.e., log E q ϕ (z|x) p θ (y|z) = log p(y|x) = E q ϕ (z|x) log p θ (y|z). Then we have: E p(x,y) log E q * ϕ (z|x) [p * θ (y|z)] = E p(x,y) E q * ϕ (z|x) [log p * θ (y|z)] = E p(x) E q * ϕ (z|x) E p * θ (y|z) E q * ϕ (z|x) [log p * θ (y|z)] = E p(x) E q * ϕ (z|x) E q * ϕ (z|x) E p * θ (y|z) [log p * θ (y|z)] = I * (y; z) -H(y) = I p (y; x) -H(y) Thus, we prove the relationship among the two forms of upper bound for the first term and the mutual information I * (y; z) w.r.t. the optimal q ϕ (z|x) and p θ (y|z). max p θ E p(x,y) E q(z|x) [log p θ (y|z)] = I r (y; z) -H(y) ≤ max q,p θ E p(x,y) E q(z|x) [log p θ (y|z)] = I * (y; z) -H(y) = I p (y; x) -H(y) Moreover, we prove the upper bound for I * (y; z) and I p (y; x): I p (y; x) ≤ E q(x,y) [p(y|x)] + H(y) ≤ 0 + H(y) Finally, we have E p(x,y) E q ϕ (z|x) [log p θ (y|z)] ≤ I r (y; z) ≤ I * (y; z) = I p (y; x) ≤ H(y). The Second Term of the ELBO. We introduce another helper joint distribution of variables x, u, β, z, which is s(x, u, β, z) = p(x)q ϕ (u, β, z|x). Its upper bound is proved as follows: E p(x,y) E q [log p θ (x|u, β, z)] = E p(x) E q [log p θ (x|u, β, z)] = E p(x) E q [log p θ (x|u, β, z)] = E p(x) E q [log q ϕ (x|u, β, z) p(x) p(x)p θ (x|u, β, z) q ϕ (x|u, β, z) ] = E p(x) E q [log q ϕ (x|u, β, z) p(x) ] + E p(x) E q [log p(x)] + E p(x) E q [log p θ (x|u, β, z) q ϕ (x|u, β, z) ] = I s (x; u, β, z) -H(x) + E q ϕ (x,u,β,z) [log p θ (x|u, β, z) q ϕ (x|u, β, z) ] = I s (x; u, β, z) -H(x) + E q ϕ (u,β,z) E q ϕ (x|u,β,z) [log p θ (x|u, β, z) q ϕ (x|u, β, z) ] = I s (x; u, β, z) -H(x) -E q ϕ (u,β,z) KL[q ϕ (x|u, β, z)||p θ (x|u, β, z)] ≤ I s (x; u, β, z) -H(x) -0 where I s (x; u, β, z) is w.r.t. s(x, u, β, z), i.e., p(x)q ϕ (u, β, z|x). Then we have: E p(x,y) E q [log p θ (x|u, β, z)] ≤ I s (x; u, β, z) -H(x) Applying Eqn. 27 and Eqn. 28 to Eqn. 26, we have: E p(x,y) LELBO(p θ (x, y)) = E p(x,y) Eq[log p θ (y|z)] + E p(x,y) Eq[log p θ (x|u, β, z)] -E q ϕ (u,β,z) KL[q ϕ (x|u, β, z)||p θ (x|u, β, z)] ≤ Ir(y; z) + Is(x; u, β, z) -[H(y) + H(x)] For clarity, we use I(y; z) and I(x; u, β, z) in place of I r (y; z) and I s (x; u, β, z) ,repectively, later in the Appendix and the main paper. Lemma B.2 (Information Decomposition of the Adversarial Loss). The global maximum of adversarial loss w.r.t. discriminator D is decomposed as below: max D E p(k,x) E q ϕ (z|x) [log D(k|z)] = I(z; β) + I(z, k|β) -H(k) and the global minimum of max D E p(k,x) E q ϕ (z|x) [log D(k|z) ] is achieved if and only if I(z; β) = I(z, k|β) = 0. The optimization of adversarial loss is to reduce the mutual information between data encoding z and domain identity k while the mutual information is reduced between data encoding z and domain identity β. Proof. Define s(k, x, β, z) = p(k, x)q ϕ (z|x)q ϕ (β|k) s(k, β, z) = q ϕ (β, z|k)p(k) = q ϕ (z|β, k)q ϕ (β|k)p(k) s(k, z) = p(k)q ϕ (z|k) = s(k|z)q ϕ (z) = p(k)E p(x|k) E q ϕ (z|x) Then we have: -E p(k,x) E q ϕ (z|x) [log D(k|z)] = -E s(z,k) [log D(k|z)] ≤ -E s(z,k) [log s(k|z)] When the equality holds, the discriminator becomes optimal, i.e., D(k|z) = s(k|z). Due to p(β|k) is a function f (k) mapping from k to β. Therefore, we have: p(z|β, k) = p(z|f (k), k) = p(z|k) Thus, s(k, β, z) = q ϕ (z|k)q ϕ (β|k)p(k). The three random variables meet the Markov chain β ← k → z. By chain rule for mutual information, we have: I(z; β) + I(z, k|β) = I(z; β, k) = I(z; k) + I q (z, β|k) where I q (z, β|k) = 0. So, I(z; k) = I(z; β) + I(z, k|β) We also have: E s(z,k) [log s(k|z)] = E s(z,k) [log s(k|z) q ϕ (k) q ϕ (k)] = E s(z,k) [log s(k|z) q ϕ (z) ] + E s(z,k) [log q ϕ (k)] = I(z; k) -H(k) max D E p(k,x) E q ϕ (z|x) [log D(k|z)] = I(z; k) -H(k) = I(z; β) + I(z, k|β) -H(k) Therefore, min ϕ max D E p(k,x) E q ϕ (z|x) [log D(k|z)] = 0 -H(k) while I(z; k) = 0 and thus I(z; β) = I(z, k|β) = 0. Theorem B.1 (Objective Function as a Lower Bound). The objective function involves both the ELBO of p θ (x, y) and adversarial loss E p(k,x) E q ϕ (z|x) [log D(k|z)], and it is the lower bound for a combination mutual information and entropy terms: For Circle, DG-15, DG-60 and TPT-48, we use multi-layer perceptrons for estimating q(u|x), while for CompCars, we use ResNet-18 (He et al., 2015) for q(u|x). All the other neural networks are multi-layer perceptrons. To ensure the robustness of training, we fix the variance of some distributions and only use a neural network to estimate the mean of these distributions. All the input data are normalized with their mean and variance. E p(x,y) [L ELBO (x, y)] -max D E p(k,x) E q ϕ (z|x) [log D(k|z)]

E.2 HYPERPARAMETERS

For experiments on all 4 datasets, we set the dimension of global domain indices to 2. For Circle, DG-15, DG-60, the dimension of local domain indices is 4, while for TPT-48 and CompCars, the dimension of local domain indices is 8. Our model is trained with 20 to 70 warmup steps, learning rates ranging from 1 × 10 -5 to 1 × 10 -4 , and λ d ranging from 0.1 to 1.

E.3 ADDITIONAL LOSS ON LOCAL DOMAIN INDICES

During the inference of local domain indices u, we incorporate a modified contrastive loss from Chen et al. (2020) to improve the coherence of u. The loss aims to maximize the agreement between local domain indices of data points in the same domain. Specifically, we sampled a minibatch of size b from each of the N domains, resulting in a large batch of size bN . For each x k,i , the i-th sample in the minibatch of domain k, we pair it with x k,(i+1) mod b , the neighbouring data point within the same domain. We denote j = (i + 1) mod b, and refer to the sampled pair as (x k,i , x k,j ). We then sample corresponding u k,i , u k,j from q ϕ (u|x) (Eqn. 6). With a multi-layer perceptron, we map u k,i , u k,j to h k,i , h k,j , which are subsequently used to calculate the contrastive loss as follows: ℓ Con k,i = -log exp(sim(h k,i , h k,j )/τ ) N m=1 b n=1 1 k̸ =m exp(sim(h k,i , h m,n )/τ ) , where sim(•, •) is the similarity function between 2 vectors, and τ is the temperature. In practise, we use the cosine similarity and set τ = 1. 

G DG-15 AND DG-60

For completeness, we show DG-15 and DG-60 in Fig. 9 . We use 'red' and 'blue' to indicate positive and negative data points inside a domain. The boundaries between 'red' half circles and 'blue' half circles show the direction of ground-truth decision boundaries in the datasets. DG-15 is a synthetic dataset with 15 domains for binary classification. As shown in Fig. 3 (c), these domains form a domain graph (DG) of 15 nodes, with adjacent domains having similar decision boundaries. Each domain contains 100 data points. We use 6 connected domains as the source domains and use others as target domains. Similarly, DG-60 is another synthetic dataset with 60 domains, each of which contains 100 data points. We use 6 connected domains as source domains and the remaining 54 domains as target domains.

H ADAPTING MDD AND SENTRY FOR REGRESSION TASKS

For MDD, we simply replace its cross-entropy loss with an L 2 loss. For SENTRY, since it requires confidence scores during training, we include another classification network that predicts whether the average temperature of next 6 months will go up or down compared to the previous 6 months, thereby providing confidence scores.

I NEW EDGE TYPE FOR PROBABILISTIC GRAPHICAL MODELS

In Fig. 10 , we introduce a new edge type, " ", to denote independence. β k z enforces independence between z and β k , i.e, p(z|β k ) = p(z); note that this does not contradict β k → z, as long as there are multiple paths between β k and z. In this case, we have two paths between β k and z, β k → z and β k → u → z. As a simplified example, consider only a sub-graph in Fig. 10 with only three nodes β k , u, and z. Essentially our algorithm tries to learn two conditional dependencies p(u|β k ) and p(z|β k , u). Adding a dashed edge β k z ensures that p(z|β k ) = p(z), which is equivalent to p(z|β k ) = p(z|β ′ k ), ∀β k , β ′ k , which is equivalent to p(u|β k )p(z|β k , u)du = p(u|β ′ k )p(z|β ′ k , u)du, ∀β k , β ′ k . It is easy to see that Eqn. 32 only introduces an additional constraint, but does not contradict the learning of conditional dependencies p(u|β k ) and p(z|β k , u). Combining the ELBO below L ELBO (x, y) = E q ϕ (u,β,z|x) [p θ (u, x, z, y, β|α)] -E q ϕ (u,β,z|x) [q ϕ (u, β, z|x)], with the additional constraint in Eqn. 32, we have the following objective function max θ,ϕ E p(x,y) E q ϕ (u,β,z|x) [p θ (u, x, z, y, β|α)] -E q ϕ (u,β,z|x) [q ϕ (u, β, z|x)] , s.t. p θ (u|β k )p θ (z|β k , u)du = p θ (u|β ′ k )p θ (z|β ′ k , u)du, ∀β k , β ′ k , which is equivalent to Eqn. 20.

J DOMAIN IDENTITIES VERSUS DOMAIN INDICES

Compared to domain identities, domain indices can better capture the similarity and complex relations among different domains, thereby better guiding the adaptation process. This is also empirically verified in previous works such as Wang et al. (2020) ; Xu et al. (2022) . As a simple example, note that domain identities are one-hot vectors, the distance between any two domain identities are identical; in contrast, domain indices are real-value vectors and therefore contain richer information. More specifically, since the encoder takes domain indices as additional input, domain indices tend to encourage data x of similar domains to go through similar transformations to produce the encoding z; consequently, the predicted decision boundaries of similar domains will also be similar. Theoretically, the target domain's error is upper-bounded by three terms, i.e., source error, domain gap, and optimal joint error of the source and target domains (Ben-David et al., 2010) . Encouraging similar transformations for similar domains can reduce the joint error term of the upper bound, thereby achieving better performance. 



Note that different from(Wang et al., 2020;Xu et al., 2022), all evaluated methods only have access to domain identities k, but not ground-truth domain indices β, since the goal is to infer β in our setting.



Figure 1: Left: Probabilistic graphical model for VDI's generative model. We introduce a new edge type, " ", to denote independence. β k z enforces independence between z and β k , i.e, p(z|β k ) = p(z) (see Appendix Sec. I for detailed discussion). Right: Probabilistic graphical model for the VDI's inference model.

Process and Probabilistic Graphical Model. Based on our definition, we propose our model: Variational Domain Indexing (VDI). The basic idea is to infer the domain indices as latent variables during domain adaptation. VDI is a generative model assuming the following generative process (see the corresponding graphical model in Fig. 1(left)). For each domain k, (1) Draw global domain index β k from the Gaussian distribution p θ (β|α). (2) For each data point i with domain identity k: (a) Draw local domain index u i from the Gaussian distribution p θ (u i |β k ). (b) Draw input x i from the Gaussian distribution p θ (x i |u i ).

Figure2: Network structure. For clarity, we omit subscripts of q ϕ and p θ as well as p θ (z|x, u, β)'s input (x, u).

Figure 3: (a) The Circle dataset (Wang et al., 2020) with 30 domains, with different colors indicating ground-truth domain indices. The first 6 domains (in the green box) are source domains. (b) Groundtruth labels for Circle, with red dots and blue crosses as positive and negative data points, respectively. (c) Ground-truth domain graph for DG-15. We use 'red' and 'blue' to roughly indicate positive and negative data points in a domain. (d) VDI's inferred domain graph for DG-15, with an AUC of 0.83.

to infer the global domain index β k . Specifically, our process consists of four steps: (1) Grouping u i in Domain k. Group all local domain indices from the same domain k into one local index matrix (set), i.e., U k = [u i ] ki=k ∈ R D k ×Bu . (2) Pairwise Domain Distance. Calculate the Earth Mover's distance (EMD)

Figure 4: Domain graphs for two adaptation tasks on TPT-48; black nodes indicate source domains, and white nodes indicate target domains. Left: Adaptation from the 6 states in the west to the 42 states in the east. Right: Adaptation from the 24 states in the north to the 24 states in the south. Lemma 4.2 above shows that one can decompose max D E p(k,x) E q ϕ (z|x) [log D(k|z)] into several information theoretic terms, including I(z; β), which is related to Part (1) of Definition 3.1. With Lemma 4.1 and Lemma 4.2, we then show that VDI's objective function in Eqn. 20 lower-bounds a combination of mutual information terms plus some constant entropy terms in Theorem 4.1 below. Theorem 4.1 (Objective Function as a Lower Bound). The objective function involves both the ELBO of p θ (x, y) and adversarial loss E p(k,x) E q ϕ (z|x) [log D(k|z)], and it is the lower bound for a combination mutual information and entropy terms:

) With Theorem 4.1, we are now ready to analyze the global optimum of the minimax game in Eqn. 20. Theorem 4.2 (Global Optimum of VDI). In VDI, when the global optimum (Eqn. 20) is achieved, it is guaranteed that (1) I(z; β) = 0, (2) I(x; u, β, z) is maximized, and (3) I(y; z) is maximized.

Fig. 3(a)   shows 30 domains of Circle in different colors.Fig. 3(b)  shows positive (red) and negative (blue) data points, The first 6 domains are source domains, and the remaining 24 domains are target domains.DG-15 and DG-16(Xu et al., 2022). DG-15 (Fig.3(c)) and DG-60 (Fig.9(b)) are synthetic datasets with 15 and 60 domains for binary classification, respectively. In both datasets, we use 6 connected domains as the source domains and use others as target domains (see Table

Figure 5: Inferred domain indices for 48 domains in TPT-48. We color inferred domain indices according to ground-truth indices, latitude (left) and longitude (right). VDI's inferred indices are correlated with true indices, even though VDI does not have access to true indices during training.

Fig.5plots the inferred domain indices β ∈ R 2 for all 48 domains. For reference, we color the inferred domain indices according to ground-truth latitude (Fig.5(left)) and longitude (Fig.5(right)); note that VDI does not have access to latitude and longitude during training. The plots show that VDI's inferred domain indices are highly correlated with each domain's latitude and longitude. For example, Florida (FL) has the lowest latitude among all 48 states and is hence the left-most circle in Fig.5(left). We also observe that states with similar latitude or longitude do have similar domain indices β. These results demonstrate that VDI can infer reasonable domain indices.

plots the inferred domain indices β ∈ R 2 for all 30 domains. For reference, we also color the plotted circles according to YOMs (Fig.6(left)) and viewpoints (Fig.6(right)); note that VDI does not have access to YOMs and viewpoints during training. Interestingly, we have the following observations that are consistent with intuition: (1) domains with the same viewpoint or YOM have similar domain indices; (2) domains with "front-side" and "rear-side" viewpoints have similar domain indices; (3) domains with "front" and "rear" viewpoints have similar domain indices.

Fig. 7 shows the inferred domain indices for Circle.

Figure 7: Inferred domain indices (reduced to 1 dimension by PCA) with true domain indices for dataset Circle. VDI's inferred indices have a correlation of 0.97 with true indices, even though VDI does not have access to true indices during training.

Figure 8: Local Domain Indices for DG-15. Left: Local domain indices for every domain. Right: Local domain indices for 3 selected domain. We select each domain by their location in domain graph (see Fig. 9(a))E ARCHITECTURE AND IMPLEMENTATION DETAILS E.1 ARCHITECTUREAs is shown in Fig.2, we use neural networks to estimate the density function of each distribution. For Circle, DG-15, DG-60 and TPT-48, we use multi-layer perceptrons for estimating q(u|x), while for CompCars, we use ResNet-18(He et al., 2015) for q(u|x). All the other neural networks are multi-layer perceptrons. To ensure the robustness of training, we fix the variance of some distributions and only use a neural network to estimate the mean of these distributions. All the input data are normalized with their mean and variance.

Figure 9: Visualization of the DG-15 (a) and DG-60 (b) datasets. We use 'red' and 'blue' to indicate positive and negative data points inside a domain. The boundaries between 'red' half circles and 'blue' half circles show the direction of ground-truth decision boundaries in the datasets.

Figure 11: Inferred domain indices for 48 domains in TPT-48. We color inferred domain indices according to ground-truth latitude. VDI's inferred indices are correlated with true indices, even though VDI does not have access to true indices during training.

Figure 12: Inferred domain indices for 48 domains in TPT-48. We color inferred domain indices according to ground-truth longitude. VDI's inferred indices are correlated with true indices, even though VDI does not have access to true indices during training.

Figure 14: Inferred domain indices for 30 domains in CompCars. We color inferred domain indices according to ground-truth years of manufacture (YOMs). VDI's inferred indices are correlated with true indices, even though VDI does not have access to true indices during training.

Domain Index β and Local Domain Index u. VDI uses a bi-level structure for domain indices: local domain index u ∈ R Bu and global domain index β ∈ R B β . Both u and β are low-dimensional compared to x ∈ R Bx , i.e., B u ≪ B x and B β ≪ B x .

Accuracy (%) on Circle, DG-15 and DG-60.

MSE for various DA methods for both tasks W (6) → E (42) and N (24) → S (24) on TPT-48.

YOMs) with 18735 images in total. We choose the domain with front view and YOM 2009 as the source domain, and all the others as target domains.

Accuracy (%) on CompCars (4-Way Classification).



Summary of statistics and settings in different datasets.

ACKNOWLEDGEMENT

The authors thank the reviewers/AC for the constructive comments to improve the paper. ZX and HW are partially supported by NSF Grant IIS-2127918. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

annex

(EMD) and multi-dimensional scaling (MDS) in Eq. (8) (9) (10) (11) in the 4096-dimensional space; this is much more computationally expensive and dramatically slows down model training. Therefore local domain indices are necessary in VDI.Necessity of EMD and MDS. It is also worth noting that EMD and MDS are also necessary when inferring global domain indices β from local domain indices u. The main reason is that u's distribution tends to be multi-modal. As an example, we plot the local indices from 3 domains in DG-15 in Fig. 8 of Appendix D. We can see that the u's in Domain 11 contain two clusters (i.e. two modes), with one cluster corresponding to positive data and the other corresponding to negative data. This is also the case for both Domain 1 and 0, but the distance between the two clusters gets smaller and smaller. Therefore, directly using the mean of u's as the global domain index β does not work because all three domains will have similar means, and consequently similar β's. Using EMD and MDS fixes this issue since EMD can naturally compute the distance between two multi-modal distributions. Our preliminary experiments also confirm their necessity; removing EMD and MDS will significantly bring down VDI's performance.Independence between z and the Global Index β. In typical domain adaptation methods, an encoder first takes x as input to produce the representation z, and a predictor then takes z as input to predict the label. To improve z's generalization across different domains, it is common practice (Ganin et al., 2016; Zhao et al., 2017; Zhang et al., 2019; Tzeng et al., 2017; Wang et al., 2020; Xu et al., 2022) to enforce independence between the domain index β and representation z such that domain-specific information is removed from z. Therefore the assumption/constraint of independence between β and z is natural.Why Independence between z and the Local Index u is Not Required. As mentioned above, we need the local indices u to capture the multi-cluster structure in the data, with each cluster corresponding to local indices for data with the same label y. Since z contains label-specific information, if we enforce independence between z and u, such label-specific information will be removed from u, making it impossible for u to capture the label-specific multi-cluster structure in the data.How the Global Index β Indicates Different Domains from the Definition. One key property that makes sure the global index β can distinguish different domains is the second point of Definition 3.1, i.e., information preservation of x. Specifically, maximizing the mutual information I(x; u, β, z) ensures that β contains as much information on x as possible under the constraint that β and z is independent. Since we assume covariant shift exists (i.e., different domains have different distributions over x), this property helps β to distinguish (indicate) different domains. This is also empirically verified by results in Fig. 3(d), Fig. 5, and Fig. 6 , where VDI successfully inferred meaningful domain indices β from different datasets.

L LARGER FIGURES

In this section, we provide larger versions of figures for domain index visualization in the main paper.

