LEARNING DICTIONARIES OVER DATASETS THROUGH WASSERSTEIN BARYCENTERS

Abstract

Unsupervised Domain Adaptation is an important machine learning problem that aims at mitigating data distribution shifts, when transferring knowledge from one labeled domain to another similar and unlabeled domain. Optimal transport has been shown in previous works to be a powerful tool for comparing and matching empirical distributions. As such, we propose a novel approach for Multi-Source Domain adaptation which consists on learning a Wasserstein Dictionary of labeled empirical distributions, as a means of interpolating distributional shift across several related domains, and inferring labels on the target domain. We evaluate this method on Caltech-Office 10 and Office 31 benchmarks, where we show that our method improves the state-of-the-art of 1.96% and 2.70% respectively. We provide further insight on our dictionary, exploring how interpolations of atoms provide useful predictors for target domain data, and how it can be used to study the geometry of data distributions. Our framework opens interesting perspectives for fitting and generating datasets based on learned probability distributions.

1. INTRODUCTION

Dictionary Learning (DiL) is a representation learning technique that seeks to express a set of vectors in R d as the linear weighted combinations of elementary elements named atoms. When vectors represent histograms, this problem is known as Nonnegative Matrix Factorization (NMF). Optimal Transport (OT) previously contributed to this case, either through a metric over histograms (Rolet et al., 2016) or by defining a non-linear way of aggregating atoms (Schmitz et al., 2018) through Wasserstein barycenters (Agueh & Carlier, 2011) . In parallel, different problems in Machine Learning (ML) can be analyzed through a probabilistic view, e.g., generative modeling (Goodfellow et al., 2014) and Domain Adaptation (DA) (Pan & Yang, 2009) . For instance, in Multi-Source Domain Adaptation (MSDA), one wants to adapt data from heterogeneous domains or datasets to a new setting. In this case, the celebrated Empirical Risk Minimization (ERM) principle cannot be correctly applied due to the non-i.i.d. character of the data. However, we assume that the domain shifts have regularities that can be learned and leveraged for MSDA. Thus, in this paper, we take a novel approach to MSDA, using distributional DiL: we learn a dictionary of empirical distributions. As such, we reconstruct domains using interpolations in the Wasserstein space, also known as Wasserstein barycenters. As we explore in section 3, this offers a principled framework for MSDA. We take inspiration from the works of Bonneel et al. (2016) and Schmitz et al. (2018) for defining our novel DiL framework. Indeed, these authors consider DiL over histograms, while we propose a DiL over datasets, understood as point clouds, which enables its application to DA. We summarize our contributions as follows, • Dictionary Learning. To the best of our knowledge, we are the first to propose a DiL problem over point clouds. • Empirical Distributions Embedding. As a by-product, we get embeddings of the DiL datasets as their barycentric coordinates w.r.t. the dictionary. We build on this new representation to define a (semi-)metric called Wasserstein Embedding Distance (WED). We explore the WED theoretically (theorems 3.2 and 3.3) and in experiments (section 4.2). • Domain Adaptation. We propose two novel ways for performing MSDA. The first relies on reconstructing labeled samples in the support of the target distribution. The second relies on weighting predictors learned on each atom, thus defining a new classifier that works on the target domain. We offer theoretical justification for both methods (section 3.1). We further explore, in section 4.1.3, general interpolations in the latent space of our dictionary. Notation. We denote as ∆ N = {a ∈ R N + : N i=1 a i = 1} the N probability simplex. We consider n P i.i.d. samples X (P ) = {x (P ) i } n P i=1 ∈ R n P ×d from an unknown distribution P : x (P ) i ∼ P . The samples X (P ) yield an empirical approximation, P of P as a sum of delta Diracs, P = n P i=1 p i δ x (P ) i , with p i = 1 /n P unless stated otherwise. Similarly, Q = n Q i=1 q i δ x (Q) i , with ) is an empirical approximation for the joint P (X, Y ). Paper Structure. Section 2 covers brief introductions to OT, DA, and DiL. Section 3 introduces our view on dictionary learning. Section 4 covers experiments in manifold learning of distributions and domain adaptation. Section 5 presents our conclusions. q i = 1 /n Q ,

2. PRELIMINARIES 2.1 OPTIMAL TRANSPORT

In this section, we focus on the computational treatment of OT (Peyré et al., 2019) , which predominantly relies on empirical approximations of distributions. There are two discretization strategies. First, known as Eulerian discretization, one seeks to bin R d into a fixed grid {x (P ) i } n P i=1 so that p i corresponds to how many samples are assigned to the i-th bin. The second, known as Lagrangian discretization, assume x (P ) i i.i.d. according to P . Henceforth, contrary to Bonneel et al. (2016) ; Schmitz et al. (2018) , we use the Lagrangian discretization. For P , Q, in the formulation of Kantorovich (1942) OT seeks a transport plan π ∈ R n P ×n Q , where π i,j represents how much mass transported from x (P ) i to x (Q) j . π is required to preserve mass, that is, n P i=1 π i,j = q j and n Q j=1 π i,j = p i , or π ∈ U (p, q), the set of bi-stochastic matrices with marginals p ∈ ∆ n P , q ∈ ∆ n Q . Therefore, π ⋆ = OT(p, q, C) = argmin π∈U (p,q) n P i=1 n Q j=1 C i,j π i,j = ⟨C, π⟩ F , is the OT problem between P and Q. In Equation 1, C i,j = c(x (P ) i , x (Q) j ) is called ground-cost matrix. When c is a distance, OT defines a distance between distributions (Peyré et al., 2019)  called Wasserstein distance, W c ( P , Q) = ⟨C, π ⋆ ⟩ F . For P = { Pk } K k=1 and α ∈ ∆ K , the Wasserstein barycenter (Agueh & Carlier, 2011 ) is a solution to, B ⋆ = B(α; P) = inf B K k=1 α k W c (P k , B). (2) Henceforth we call B(•; P) barycentric operator. For empirical Pk , estimating B⋆ is done through the fixed-point iterations of Álvarez-Esteban et al. ( 2016), (C k ) i,j = ∥x (P k ) i -x (B) j ∥ 2 2 ; π k = OT(p k , b, C k ); X (B) = K k=1 α k diag(b) -1 π T k X (P k ) , until convergence. In this case b ∈ ∆ n (e.g., b j = 1 /n). We further discuss iterations in equation 3 in appendix A.

2.2. EMPIRICAL RISK MINIMIZATION AND DOMAIN ADAPTATION

In traditional ML, the classification problem can be formalized through the ERM principle of Vapnik (1991) , which consists in finding ĥ⋆ , which verifies ĥ⋆ = argmin h∈H RP (h) = 1 n n i=1 L(h(x (P ) i ), y (P ) i ), where L is a loss function and h : R d → {1, • • • , n c } is a classifier, chosen among a family H. Under the i.i.d. hypothesis between training and real application data, the classification error expectation quantified as R P (h, h 0 ) = E x∼P [l(h(x), h 0 (x))], where h 0 is the unknown ground-truth labeling function. Given that enough samples are available R P and RP are close with high probability (Redko et al., 2020, Theorem 1) . Nonetheless, the i.i.d. assumption is restrictive since real-world data may be acquired under different regimes. In this case, train and test data may follow different distributions (Quinonero-Candela et al. (2008) ). DA (Kouw & Loog, 2019) is a framework for non-standard cases. Following Pan & Yang (2009) , a domain D = (X , Q(X)) is a pair of a feature space X (e.g., R d ) and its distribution Q(X). In DA, one has D S ̸ = D T due to different distributions, that is, Q S ̸ = Q T . The goal is thus adapting a classifier learned with labeled data from D S , using unlabeled data from D T . When N S > 1 sources are available, {D S ℓ } N S ℓ=1 (resp. Q S ℓ ), one has MSDA. DA can be further divided into shallow and deep DA. In the first case, one leverages feature extractors (e.g., pre-trained convolutional layers). In contrast, in the second, one uses unlabeled target data during training for learning features invariant to distributional shifts. OT has contributed to DA in various ways. For instance, in the seminal works of Courty et al. (2016) , the authors proposed transporting source domain samples using X(Q S ) = diag(q S ) -1 πX (Q T ) . In this sense X(Q S ) ∼ Q T , thus their method generates labeled data on the target domain. OT has been applied for MSDA as well. For instance, Montesuma & Mboula (2021a; b) proposes to aggregate {Q S ℓ } N S ℓ=1 using the Wasserstein barycenter B, then transporting B to QT . Otherwise, Turrisi et al. (2022) estimate domain importance coefficients α for weighting source domain distributions.

2.3. DICTIONARY LEARNING

DiL is a representation learning technique that tries to express a collection of N vectors {x ℓ } N ℓ=1 , x ℓ ∈ R d through a set of K atoms P = {p k } K k=1 , p k ∈ R d and N weights A = {α ℓ } N ℓ=1 , α ℓ ∈ R K . Mathematically, DiL corresponds to, (P ⋆ , A ⋆ ) = argmin P,A 1 N N i=1 L(x ℓ , P T α ℓ ) + λ A Ω A (A) + λ P Ω P (P), ( ) where L is a suitable loss, whereas Ω A and Ω P are regularization terms of representations and atoms. OT has previously contributed to DiL, when the vectors are histograms, that is x ℓ ∈ ∆ d . In this sense, these contributions assume an Eulerian discretization paradigm. For instance Rolet et al. (2016) considered using the Sinkhorn divergence (Cuturi, 2013) as a loss in 5, and Schmitz et al. (2018) proposed substituting P T α ℓ for the Wasserstein barycenter.

3. DATASET DICTIONARY LEARNING

We seek to learn a dictionary over point clouds. As such, Q = { Qℓ } N ℓ=1 , where each Qℓ has support {x (Q ℓ ) i } n Q ℓ i=1 or {(x (Q ℓ ) i , y (Q ℓ ) i )} n Q ℓ i=1 , corresponding to whether Q ℓ is labeled or not. In this setting, we rewrite equation 5 as, (P ⋆ , A ⋆ ) = argmin P,A L(P, A) = 1 N N ℓ=1 W c ( Qℓ , B(α ℓ ; P)), where P = { Pk } K k=1 is our set of atoms, and A = {α ℓ } N ℓ=1 with α ℓ ∈ ∆ K , α ℓ,k denotes how much Qℓ is composed by Pk in the Wasserstein distance sense. Each Pk a points cloud parametrized by its support X (P k ) and labels Y (P k ) when desired. Hence, we use the fixed-point iterations of Álvarez-Esteban et al. (2016) , and differentiate B(α; P) w.r.t. α and x (P k ) i using the Envelope theorem (Bertsekas, 1997). More on this matter is discussed in appendix A. In addition, optimizing 6 over entire datasets might be untractable due the time and memory complexity of W c , which are, given n samples, O(n 3 log n) and O(n 2 ), respectively. We thus employ mini-batch OT (Fatras et al., 2021) . For M mini-batches of size n b ≪ n, this reduces complexity to O(M n 3 b log n b ) and O(n 2 b ), respectively. Algorithm 1 Dataset Dictionary Learning (DaDiL) learning loop. Require: Q = { Qℓ } N ℓ=1 , number of iterations Niter, number of atoms K, number of batches M , batch size n b 1: Initialize x (P k ) i ∼ N (0, I d ), a ℓ ∼ N (0, IK ). 2: for it = 1 • • • , Niter do 3: for batch = 1, • • • , M do 4: for ℓ = 1, • • • , N do 5: ∀k, sample X (P k ) = {x (P k ) i } n b i=1 , 6: sample X (Q ℓ ) = {x (Q ℓ ) j } n b j=1 , 7: change variables α ℓ = softmax(a ℓ ) 8: calculate X (B ℓ ) = B(α ℓ ; P) 9: end for 10: L = (1/N ) N ℓ=1 Wc( Bℓ , Qℓ ) 11: ∀k, i, update x (P k ) i using ∂L/∂x (P k ) i 12: ∀ℓ, update a ℓ using ∂L/∂a ℓ . 13: end for 14: end for Ensure: Dictionary P ⋆ and weights A ⋆ . As the Qℓ can be either marginal over features, or feature-label joint distributions in equation 6, the atoms Pk are joint featureslabels distributions, and the ground-cost must take into account both labels and features. We propose using, C i,j = ∥x (P ) i -x (Q) j ∥ 2 2 + β∥Y (P ) i -Y (Q) j ∥ 2 2 , for β = max i,j ∥x (P ) i -x (Q) j ∥ 2 2 . This ground-cost allows us to determine X (B) using iterations 3 and to formally justify the label propagation approach of Redko et al. (2019) , Y (B) = K k=1 α k diag(b) -1 π T k Y (P k ) , for labeling the samples in X (B) . Similarly to Bonneel et al. (2016) and Schmitz et al. (2018) , we optimize w.r.t A = {a ℓ,k } ∈ R N ×K , then perform a change of variables α ℓ = softmax(a ℓ ). The overall procedure is shown in algorithm 1, which is implemented using Pytorch (automatic differentiation) of Paszke et al. (2019) and Python Optimal Transport of Flamary et al. (2021) . In the following, we present the derived MSDA approaches.

3.1. DOMAIN ADAPTATION

We assume that N S > 1 labeled source distributions, QS1 , • • • , QS N S , are available. The goal is to improve performance on an unlabeled target distribution, QT . Hence, in eq. 6, Q = { QS ℓ (X, Y )} N S ℓ=1 ∪ { QT (X)} with P = { Pk (X, Y )} K k=1 and A = {α ℓ } N S +1 ℓ=1 . By definition of Q, α N S +1 corresponds to the weights of the target distribution, i.e., α T . We now discuss our two strategies, DaDiL-R and DaDiL-E. An overview is shown in figure 1 . DaDiL-R. Our first strategy is based on the reconstruction BT (X, Y ) = B(α T ; P) of QT . Features X (B T ) and labels Y (B T ) are generated with equations 3 and 8 respectively. Using (Redko et al., 2017, Theorem 2) , we can theoretically bound the risks on these domains using, R Q T (h) ≤ R B T (h) + W 1 ( QT , BT ) + 2 log( 1 δ ) ζ 1 n Q T + 1 n + λ. As such, a classifier trained with data from a distribution BT close to QT is likely to perform well on QT . The other terms in 9 are related to the sample size and the joint error λ = argmin h∈H R Q T (h)+ R B T (h). Since QT is unlabeled, λ cannot be controlled directly. Similarly to Redko et al. (2017) , we hypothesize that the inclusion of labels in the ground-cost and the presence of labels in the other sources help control λ. DaDiL-E. Our second strategy is based on ensembling. Each atom Pk has a labeled support, i.e. {(x (P k ) i , y (P k ) i )} K k=1 , for which we learn a classifier ĥk = argmin h∈H RP k (h). To predict target labels, we use the predictor ĥα T (x (Q T ) j ) = K k=1 α T,k ĥk (x (Q T ) j ). We theoretically justify this method as follows, Theorem 3.1. Let {X (P k ) } K k=1 and X (Q T ) be samples of size n and n Q T from P k and Q T . Let h k be the minimizer of R P k . Then, for any d ′ > d and ζ > √ 2 there exists N 0 , depending on d ′ such that for any δ > 0 and min(n, n Q T ) > N 0 max(δ -d ′ +2 , 1) and α ∈ ∆ K , with probability at least 1δ, the following holds, R Q T (h α ) ≤ R α (h α ) + K k=1 α k (W 1 ( Pk , QT ) + λ k + c), where h α = K k=1 α k h k , λ k = minimize h∈H R P k (h) + R Q T (h), R α (h) = K k=1 α k R P k (h) and c = 2 log( 1 /δ) ζ 1 n + 1 n Q T . The bound in Equation 10 is very similar to that of equation 9, but it takes into account how the atom classifiers are weighted to form h α . In addition, α T minimizes the r.h.s. in equation 10. Similarly to DaDiL-R, we assume that using equation 7 helps controlling λ k . In the next subsection, we introduce a proxy for the Wasserstein distance based on the barycentric weights in the learnt dictionary.

3.2. A PROXY FOR THE EMPIRICAL WASSERSTEIN DISTANCE

We propose approximating distributions by their projection on the set M P = {B(α; P) : α ∈ ∆ K }. Thus, we introduce the following proxy for W 2 , called WED, WED( Q1 , Q2 ) = argmin π∈U (α1,α2) K k1=1 K k2=1 π k1,k2 W c ( Pk1 , Pk2 ), which relies on the notion of Barycentric Coordinate Regression (BCR) (Bonneel et al., 2016, Definition 2) , defined as, α ⋆ = argmin α∈∆ K W c (B(α; P), Q). ( ) Theorem 3.2. Let Q1 and Q2 be 2 empirical distributions, and P = { Pk } K k=1 be a set of atoms learned by minimizing equation 6. In this case, the WED is a pseudo-metric over M (proof in A.3). Calculating the WED is simpler than W c since one needs to: (i) compute and store the pairwise W c ( Pk1 , Pk2 ); (ii) solve a small K × K OT problem. Especially if the distributions Pk have much fewer samples than Qℓ , calculating the WED is faster than W c . We present a theoretical result bounding the WED by the Wasserstein distance of different terms in our dictionary learning problem, Theorem 3.3. Let Q1 and Q2 be 2 empirical distributions, and P = { Pk } K k=1 be a set of learned atoms. For α 1 (resp. α 2 ), the barycentric coordinates (e.g., equation 12) of Q1 (resp. Q2 ) and B1 = B(α ℓ ; P) (resp. B2 ) the WED is bounded as follows, WED( Q1 , Q2 ) ≤ W c ( Q1 , Q2 ) + K k=1 α 1,k W c ( Pk , B1 ) + α 2,k W c ( Pk , B2 ) + W c ( B1 , Q1 ) + W c ( B2 , Q2 ). This theorem bounds our proxy by the Wasserstein distance, plus two kinds of terms. The first correspond to the geometry of the learned dictionary, whereas the second correspond to the approximation error (e.g., W c ( B1 , Q1 )), which is explicitly minimized in our algorithm. We compare our MSDA methods with five other (shallow) DA algorithms: (i) Subspace Alignment (SA) (Fernando et al., 2013) (Turrisi et al., 2022) . (i) and (ii) are standard algorithms in DA, (iii) is in general the OT baseline, and (iv) and (v) are the State-of-the-Art (SOTA) in shallow MSDA. We further consider the baseline case, which corresponds to training a classifier with the concatenation of source domain data, and evaluating on target domain data.

4. EXPERIMENTS

We experiment on the Caltech-Office 10 benchmark, which is the intersection of the Caltech 256 dataset of Griffin et al. (2007) and the Office 31 dataset of Saenko et al. (2010) . This benchmark consists of four domains: Amazon (A), dSLR (D), Webcam (W) and Caltech (C). More information on these can be found in appendix B. A summary of our experiments is presented in table 1. We focus on our analysis w.r.t. the SOTA. Our method improves average performance over WJDOT and WBT, being especially better on Webcam and Caltech domains. W.r.t. WJDOT, Turrisi et al. (2022) show in their experiments that their algorithm chooses one source domain (optimal w.r.t. domain similarity) for the adaptation. This yields the best performance on domains A and D. On the other hand, ours combine information learned through the atoms for predicting on the target domain, which proves to be better for domains W and C. Note that a similar method to ours, WBT, also performs well on these domains. Nonetheless, as we learn to direct reconstruct the target distribution, we manage to outperform their method. For deep DA, we compare six methods from the SOTA; (i) Domain Adversarial Neural Network (DANN) (Ganin et al., 2016) ; (ii) Wasserstein Distance Guided Representation Learning (WDGRL) (Shen et al., 2018) ; (iii) Maximum Classifier Discrepancy (MCD) (Saito et al., 2018) ; (iv) MOST (Nguyen et al., 2021) ; (v) Moment Matching for MSDA (M3SDA) (Peng et al., 2019) and (vi) WBT (Montesuma & Mboula, 2021a). MOST and WBT constitute the OT-based MSDA SOTA. In our experiments we consider the Office 31 dataset of Saenko et al. (2010) (see appendix B). As the the backbone, we fine-tune a ResNet-50 (He et al., 2016) on the concatenation of source domain data, then perform adaptation with target domain data. A summary of our experiments is shown in table 2.

4.1.2. DEEP DOMAIN ADAPTATION

On average, our method is SOTA, primarily due to its outstanding performance on the Amazon domain, surpassing other methods by a margin of 7.80%. On average, it presents an increase of 2.70% in accuracy. Nonetheless, the transfer towards the dSLR domain is negative. In this domain, deep methods such as Margin Disparity Discrepancy (MDD) and M3SDA perform better. Overall, the main difference between our method and the deep methods in the SOTA is that we cannot update the parameters of convolutional layers, thus decreasing performance on this domain.

4.1.3. ADAPTATION USING ATOM INTERPOLATIONS

Besides using the inferred target domain weights α T = α N S +1 for DaDiL-E and DaDiL-R, our method has the potential to generate infinitely many domains, by interpolating atoms using an arbitrary α ∈ ∆ K , that is, through M P = {B(α; P) : α ∈ ∆ K }. We thus explore the performance on Caltech-Office 10 and Office 31 benchmarks in terms of h α = K k=1 α k ĥk and Bα = B(α; P), we call these interpolation models. This is shown in figure 2 for K = 3, which allows for a nice visualization of the latent space. First, in some cases, the atoms themselves provide good predictors for the target domain (e.g., column 2, rows 2 and 3 in figure 2 ). This implies that DaDiL learns distributions that can discriminate between classes. Second, the reconstruction loss is correlated to the performance of DaDiL-R, as regions with low reconstruction error tend to give better predictions on target. This remark agrees with our theoretical discussion (e.g., eq. 9). Further experiments on this idea can be found in appendix B.2.2. To give an overview of the performance of interpolations in the latent space we show in figure 3 box plots corresponding to the adaptation performance of DaDiL-E and DaDiL-R on the Caltech-Office 10 benchmark, for each target and arbitrary α. For domains A, C and W , the choice α T (dotted red line) is clearly above average. Surprisingly, for D, this choice is sub-optimal, indicating that other interpolations in the latent space can be useful for adaptation. The red dotted line corresponds to the results reported in Table 1 for DaDiL.

4.2. TOY EXAMPLE: LEARNING THE MANIFOLD OF GAUSSIAN DISTRIBUTIONS

In this section we propose to learn a dictionary over the Gaussian distributions. Let Q ℓ = N (µ ℓ , I), where µ ℓ ∈ [0, 8] 2 ⊂ R 2 is the mean vector. We discretize [0, 8] 2 using 25 uniformly spaced points. For each Q ℓ , we sample n = 512 points, generating datasets Qℓ . In a first moment, we learn a dictionary for K = 2, • • • , 6, which is shown in Figure 4 . Except for K = 2, all dictionaries provide good reconstructions for the manifold. Furthermore, since µ = [µ 1 , µ 2 ] is a coordinate system for Q, one needs at least three atoms to recover faithful reconstructions. We focus on K = 3 for better visualization. α 0 = ( 1 /3, 1 /3, 1 /3) and calculate and analyze the W 2 and the WED on M P = {B(α; P) : α ∈ ∆ 3 }. This is shown in Figures 4 (d ) and (e). Since each Q ℓ ∈ Q is Gaussian, so is B ((Agueh & Carlier, 2011, Section 6.3)). Therefore W 2 ( B0 , B) = ∥ k (α 0,k -α k )µ k ∥ 2 2 , which explains the quadratic contours. Moreover, the WED contours are associated with the positions of the learned atoms and the polytope U (α 0 , α), since the ground-cost matrix C k1,k2 = W 2 ( Pk1 , Pk2 ) does not depend on α 0 or α. Indeed, for each α, the WED is defined by a 3 × 3 OT problem, which has less degrees of freedom (w.r.t. α ∈ ∆ 3 ) than W 2 , which is a 512 × 512 OT problem. Finally, figure 4 (e) shows that the WED is highly correlated to W 2 . This is mainly due the fact that our dictionary perfectly reconstructs distributions in Q.

5. CONCLUSION

We present a novel probabilistic framework, based on DiL, for learning dictionaries over datasets understood as point clouds, Q = { Qℓ } N ℓ=1 . We learn a dictionary of labeled points clouds P = { Pk } K k=1 and weights A = {α ℓ } N ℓ=1 , where each distribution Qℓ is expressed as a Wasserstein barycenter of atoms, i.e. B(α ℓ ; P). We propose two novel ways of applying our dictionary for MSDA, by either reconstructing the target domain as the combination of atom distributions, or by ensembling classifiers learned on the atoms Pk . We show that interpolations in the latent space of our dictionary provide good predictors for the target distribution as well. Finally, we define a pseudo-metric for empirical distributions, based on their barycentric coordinates in the dictionary which is a valuable approximation of the exact Wasserstein distance. Limitations and perspectives: first, in DA, our method relies on the quality of feature extractors or embedding functions. In comparison with deep DA, DaDiL is not able to guide representation learning. We plan to pursue this line of work in future research. Second, similarly to Bonneel et al. (2016) , our framework is not able to represent distributions outside M P . This may lead to inconsistent reconstructions, which could actually be leveraged for novelty detection.

A THEORETICAL DETAILS

A.1 DIFFERENTIATION In this section, we discuss the derivatives of (i) P → W c ( P , Q), (ii) α → B(α; P), and (iii) P → B(α; P). Since we parametrize empirical distributions by their support, taking derivatives of distribution-valued functionals (e.g., the Wasserstein distance) is equivalent to differentiating w.r.t. the support of the free distribution (e.g., P ). Considering the Wasserstein distance, one has two alternatives: (i) using the Sinkhorn algorithm (Cuturi (2013) ) for calculating π ⋆ ϵ ≈ π ⋆ , where ϵ is a parameter that controls the amount of entropic regularization. Since the Sinkhorn algorithm relies on differentiable operations, one can backpropagate through its iterations. (ii) Differentiating through the min/inf using the Envelope or Danskin's Theorem Afriat (1971) ; Bertsekas (1997) at optimality. The Sinkhorn algorithm has mainly two drawbacks. First, it is unstable numerically for low regularization values, leading to problematic optimization. As remarked by Xie et al. (2020) , a possible workaround would be using stabilized versions of the algorithm, but these come with costly exponential/logarithmic operations. The second issue is that the computed OT plans are no longer sparse, which may lead to degenerated distributions (e.g., take the extreme case when ϵ → ∞). For those reasons, we decide to use exact OT between mini-batches. A second approach is using the Envelope theorem, which, as advocated by Feydy et al. (2019) , can be faster than differentiating through Sinkhorn's iterations. For completeness, we state the Envelope theorem in the following, Theorem A.1. (Envelope Theorem Bertsekas (1997) ) Let Z ⊂ R m be a compact set and let ϕ : R n × Z → R be continuous and such that ϕ(•, z) : R n → R is convex for each z ∈ Z. The function f : R n → R given by, f (x) = max z∈Z ϕ(x, z), is convex. When Z(x) = {z : ϕ(x, z) = max z ϕ(x, z)} consists on an unique point z, then ∂f ∂x i = ∂ϕ(x, z) ∂x i . ( ) The application of this theorem for OT is straightforward. In this case Z = U (a, b) ⊂ R m=n P ×n Q which is compact. Naturally, the variable z = π, whereas ϕ(C, π) = ⟨C, π⟩ F . This is a linear function of C, hence convex. Finally, the uniqueness of π ⋆ depends on the regularity of the groundcost. This property is particularly true for the Euclidean distance. In this case, the Wasserstein distance's derivative reads as, W 2 ( P , Q) = n i=1 m j=1 π ⋆ i,j ∥x (P ) i -x (Q) j ∥ 2 2 → ∂W 2 ∂X (P ) ( P , Q) = 2 n P X (P ) -2(π ⋆ ) T X (Q) , Here, it is essential to note that due to equation 13, to be able to evaluate ∂W2 /∂x (P ) i , one needs to compute π ⋆ = OT(p, q, C) at optimality. This calculation poses a problem for the Sinkhorn algorithm since the number of iterations needed to converge is inversely proportional to ϵ. Thus, in some cases, it is more efficient to compute exact OT. Now, we may apply the Envelope theorem to the iterations of Álvarez-Esteban et al. (2016) in equation 3. Let π k be the OT plan between X (P k ) and X (B) at convergence. The barycentric operator is thus, X (B) = B(α; P) = k α k π T k X (P k ) , which is linear w.r.t. α k and x (P k ) i . The derivatives are, therefore, ∂X (B) ∂α k = π k X (P k ) ∈ R n×d and ∂x (B) j ∂x (P k ) i = α k diag((π k ) i,j ) ∈ R d×d .

A.2 WASSERSTEIN BARYCENTERS WITH FIXED-POINT ITERATIONS

In previous work (Montesuma & Mboula, 2021a; b) , authors have used the iterations of Cuturi & Doucet (2014) for calculating the support X (B) of Wasserstein barycenters. A similar approach is using the fixed-point approach of Álvarez-Esteban et al. ( 2016). Let T k,it be the barycentric mapping between Pk and B(it) , that is, T k,it (x (B (it) ) j ) = 1 b j n i=1 (π k ) ij x (P k ) i , where π k is the OT plan between X (P k ) and X (B (it) ) . Moreover, let T it (x) = K k=1 α k T k,it . Álvarez-Esteban et al. ( 2016) introduce the following mapping, ψ( P ) = T it,♯ P = 1 n n i=1 δ Tit(x) . In this setting, Álvarez-Esteban et al. ( 2016) prove that the Wasserstein barycenter is a fixed-point solution to ψ, namely, ψ( B) = B. Their procedure is then equivalent to B(it+1) = ψ( B(it) ). This iteration corresponds exactly to those in Equation 3. These are summarized in Algorithm 2. In section 3, Equation 7 introduced a supervised ground-cost between pairs of samples (x (P ) i , y (P ) i ) and (x (Q) i , y (Q) i ), C i,j = ∥x (P ) i -x (Q) j ∥ 2 2 + β∥Y (P ) i -Y (Q) j ∥ 2 2 , where β ≥ 0 defines how costly it is to transport samples from different classes. For one-hot encoded vectors Y ∈ R n×nc , the Euclidean distance is equivalent to the 0-1 loss, ∥Y (P ) i -Y (Q) j ∥ 2 2 = δ(y (P ) i -y (Q) j ) . This makes a direct link between our proposed ground-cost, and the semi-supervised regularization proposed by Courty et al. (2016) . In addition, using an Euclidean cost for the labels allows us to interpolate distributions P (X, Y ) and Q(X, Y ), with respect to features and labels. Indeed, taking derivatives as in the previous section, W c ( P , Q) = n i=1 m j=1 π ⋆ i,j (∥x (P ) i -x (Q) j ∥ 2 2 + β∥Y (P ) i -Y (Q) j ∥ 2 2 ), → ∂Wc /∂X (P ) ( P , Q) = 2 n P X (P ) -2(π ⋆ ) T X (Q) , → ∂Wc /∂Y (P ) ( P , Q) = 2β n P Y (P ) -2β(π ⋆ ) T Y (Q) . Equating the derivatives in 14 and 15 to 0 yields, respectively, the barycentric mapping of Courty et al. (2016) and the label propagation formula of Redko et al. (2019) . This allows us to reformulate the fixed-point iterations for joint distributions. Algorithms 2 and 3 show the proposed iterations for barycenters of features and features-labels distributions respectively. Algorithm 2 Free-Support Wasserstein Barycenter of Unlabeled Distributions Require: {X (P k ) } K k=1 , α ∈ ∆K , τ > 0, n. 1: X (B (0) ) = initialize({X (P k ) } K k=1 ) 2: while ∥X (B (it) ) -X (B (it-1) ) ∥F ≤ τ has not converged do 3: for k = 1, • • • K do 4: (C k )i,j = ∥x (P k ) i -x (B (it) ) j ∥ 2 2 . 5: π k = OT(un k , un, C k ) 6: end for 7: X (B (it+1) ) = n K k=1 α k π T k X (P k ) 8: end while Ensure: Barycenter support samples X (B) . Algorithm 3 Free-Support Wasserstein Barycenter of Labeled Distributions Require: {X (P k ) , Y (P k ) } K k=1 , α ∈ ∆ K , τ > 0, n, β ≥ 0. 1: X (B (0) ) , Y (B (0) ) = initialize({X (P k ) , Y (P k ) } K k=1 ) 2: while ∥X (B (it) ) -X (B (it-1) ) ∥ F ≤ τ has not converged do 3: for k = 1, • • • K do 4: (C k ) i,j = ∥x (P k ) i -x (B (it) ) j ∥ 2 2 + β∥Y (P k ) i - Y (B (it) j ) ∥ 2 2 . 5: π k = OT(un k , un, C k ) 6: end for 7: X (B (it+1) ) = n K k=1 α k π T k X (P k ) 8: Y (B (it+1) ) = n K k=1 α k π T k Y (P k ) 9: end while Ensure: Barycenter support samples X (B) and labels Y (B) . Remark 1: in Algorithms 2 and 3 we use the routine initialize to get the initial barycenter support X (B (0) ) , Y (B (0) ) . In our experiments we sub-sample n points from the concatenated matrices {X (P k ) , Y (P k ) } K k=1 . Alternatively, one could do as Montesuma & Mboula (2021a;b) and initialize the barycenter support randomly.

A.3 APPROXIMATION PROPERTIES OF WED

Proof of Theorem 3.2: our proof is done in 4 steps: (i) WED( Q1 , Q2 ) ≥ 0, ∀ Q1 , Q2 , (ii) WED( Q1 , Q1 ) = 0, ∀ Q1 , (iii) WED( Q1 , Q2 ) = WED( Q2 , Q1 ) (iv) WED( Q1 , Q3 ) ≤ WED( Q1 , Q2 ) + WED( Q2 , Q3 ). (i) Since π k1,k2 and W c ( Pk1 , Pk2 ) are non-negative, ∀ Q1 , Q2 one has WED( Q1 , Q2 ) ≥ 0. (ii) First, note that W c ( Q1 , Q1 ) = 0. Now, since equation 12 is a convex optimization problem, to each Q1 one has a single solution α 1 . Hence, I K ∈ U (α 1 , α 1 ), and WED( Q1 , Q1 ) = 0. (iii) If π ∈ U (α 1 , α 2 ), then π T ∈ U (α 2 , α 1 ). Due to commutativity of the summation, WED( Q1 , Q2 ) = WED( Q2 , Q1 ). (iv) Let π (1) , π (2) and π (3) be the solutions to WED( Q1 , Q3 ), WED( Q1 , Q2 ) and WED( Q2 , Q3 ), respectively. In this case, WED( Q1 , Q3 ) = K k1=1 K k2=1 π (1) k1,k2 W c ( Pk1 , Pk2 ), ≤ K k1=1 K k2=1 π (2) k1,k2 W c ( Pk1 , Pk2 ), since π (2) is sub-optimal for the pair ( Q1 , Q3 ). Now, WED( Q1 , Q3 ) ≤ K k1=1 K k2=1 π (2) k1,k2 W c ( Pk1 , Pk2 ), ≤ K k1=1 K k2=1 (π k1,k2 + π (3) k1,k2 )W c ( Pk1 , Pk2 ) = WED( Q1 , Q2 ) + WED( Q2 , Q3 ). since π (3) is non-negative. Proof of Theorem 3.3: our proof relies on the successive application of the triangle inequality for the Wasserstein distance. Let π ∈ U (α 1 , α 2 ) be the minimizer in equation 11. Then, WED( Q1 , Q2 ) = K k1=1 K k2=1 π k1,k2 W c ( Pk1 , Pk2 ), ≤ K k1=1 K k2=1 π k1,k2 (W c ( Pk1 , Q2 ) + W c ( Pk2 , Q2 )), ≤ K k1=1 K k2=1 π k1,k2 (W c ( Pk1 , Q1 ) + W c ( Q1 , Q2 ) + W c ( Pk2 , Q2 )). We now break this sum in 3 parts. First, let us consider W 2 ( Pk1 , Q1 ): Office 31: we use this dataset for deep DA. To that end, we download the raw images from its public repositoryfoot_0 . We train the feature extractor ourselves, which consists on a ResNet50 (He et al. (2016) ). The preprocessing steps taken are: (i) resizing each image to (224, 224, 3) , and (ii) applying tf.keras.applications.resnet50.preprocess input function. We initialize its parameters with the weights trained on ImageNet, then fine-tune for 51 epochs on each combination of source domains (e.g., A, D). Fine-tuning was performed using the ADAPT library of de Mathelin et al. (2021) . We understand each epoch as a full pass through the entire dataset. We then use the fine-tuned network to extract features (vectors x ∈ R 2048 ) from the target domain only. K k1=1 K k2=1 π k1,k2 W c ( Pk1 , Q1 ) = K k1=1 W c ( Pk1 , Q1 ) K k2=1 π k1,k2 , = K k1=1 α 1,k1 W c ( Pk1 , Q1 ), ≤ K k=1 α 1,k (W c ( Pk , B1 ) + W c ( B1 Remark 2: we use a ResNet backbone since we were not able to reproduce previous results on the standard AlexNet backbone (e.g., Nguyen et al. (2021) ). Remark 3: we give the specifics for the Caltech-Office and Office 31 datasets in (2014) ). This analysis is done by considering RGB images I ℓ ∈ R h×w×3 as point clouds of n = hw samples in R 3 . Thus, to each I ℓ , one has Qℓ with support X (Q ℓ ) ∈ R hw×3 . In this experiment, we want to learn a dictionary over Q = { Qℓ } N ℓ=1 , N = 50 artworks of Monet, Delacroix, Magritte, Caravaggio, and Van Gogh, based on their color histograms. We further investigate the geometry induced by the latent codes A ∈ (∆ K ) N and the WED compared to the Wasserstein distance. Here we discuss our results for K = 3, using N tr = 35 images to train our dictionary, leaving the rest N ts = 15 for test. We can map these images into the latent space using equation 12. We show the result of DiL in figure 5 . We compare the Wasserstein distance with DaDiL and the WED in (i) running time; (ii) their induced geometry. For large-scale images, calculating W 2 ( Qℓ1 , Qℓ2 ) is unfeasible due to the number of pixels. We thus down-sample the support matrices by selecting n = 2048 points randomly, hence X (Q ℓ ) ∈ R 2048×3 . This step is not necessary for DaDiL since we perform optimization in minibatches. Concerning running time, the 1225 pairwise W 2 distances took approximately 26 minutesfoot_2 . For DaDiL, training takes an overall of 15 minutes. We remark that, for new distributions, we only need to solve 12 over N ts × K variables. This is usually faster than calculating pairwise distances. In Figure 5 (a) we show the learned atoms, which correspond to the predominant colors in the training images. 

B.2.2 SENSITIVITY ANALYSIS

We study how the number of components K, the batch size b, and the number of samples in the atoms support n influence the performance of DiL. Our results are summarized in Figures 6 and 7 . We want to verify whether the DiL loss L(P, A) is a good predictor for the test score. On one hand, for the Caltech-Office dataset, Figure 6 shows that increasing the number of samples and batch sizes globally leads to a better approximation of the distributions manifold, as L(P, A) shrinks. This, however, is not necessarily true for the test score, as b = 128 and n = 2048 leads to better adaptation results. Figures 6 (d ) and (e) shows that the DiL is a good proxy for DaDiL-E and DaDiL-R, as it is strongly correlated with the test score. We further remark that L(P, A) is more correlated with the test score of DaDiL-R, since this method directly relies on the goodness of reconstructions. On the other hand, for the Office 31 dataset, Figure 7 (a) shows that the dependence on these parameters is much more complex, showing that in general more samples or components do not lead to better approximations. Moreover, Figures 7 (d) and (e) show that L(P, A) is not correlated with neither DaDiL-R nor DaDiL-E. We believe that since this dataset encompasses more classes than the Caltech-Office, one needs more atoms to model the particularities of each domain. We now focus our attention on t-SNE embeddings of reconstructions BT . Figures 8 and 9 shows the embeddings for the Caltech-Office and Office 31 datasets. For this first dataset, our reconstructions are correctly aligned with the dataset's classes, indicating that the atoms have correctly captured the characteristics of the domains. For the second dataset, as the number of classes is large, we only show a comparison between original and reconstructed samples. Semi-transparent circles correspond to points in the target domain, whereas triangles correspond to points in BT support.



https://faculty.cc.gatech.edu/ ˜judy/domainadapt/ https://www.kaggle.com/datasets/ikarus777/best-artworks-of-all-time experiments were performed on a Intel Xeon CPU 2.20 GHz with 12GB of RAM



Figure 1: Illustration of DaDiL for MSDA. (Upper left) set of datasets Q. (Upper middle) learned atoms Pk and their corresponding classifiers ĥk (color map). (Upper right) learned weights. (Bottom left) reconstruction BT of QT = Q4 . (Bottom middle) Ensembled classifier ĥα T (color map) and the true labels of QT (scatter plot).

Figure 2: DA performance of interpolations in the latent space of DaDiL. Columns represent reconstruction loss W c ( Bα , QT ), and classification accuracy (Acc) of DaDiL-E and DaDiL-R, for Caltech-Office 10 (left) and Office 31 (right) benchmarks.

Figure 4: Overview of DiL on the space of Gaussian distributions.

10: we use this dataset for shallow DA. As follows, we use the experimental setting of Montesuma & Mboula (2021a), namely, the 5 fold cross-validation partitions and the features (DeCAF 7th layer activations).

Latent codes in the simplex ∆3. (c) Wasserstein distance embeddings (d) WED embeddings.

Figure 5: Summary of DaDiL for palette learning. Our algorithm learns palettes that reflect different palette clusters (a). These allow for an useful visualization on the latent space (b). The latent codes can be used to approximate the Wasserstein geometry (c) through the proposed WED (d).

Figure 5 (b) shows the latent codes. The embeddings lead to an intuitive description of images, as they express each image as a combination of blue, yellow, and gray palettes. In Figure 5 (c) and (d), we compare the Wasserstein and WED embeddings using t-distributed Stochastic Neighbor Embeddings (t-SNE) (Van der Maaten & Hinton (2008)), which shows that we are able to capture the Wasserstein geometry faithfully.

Correlation between DiL loss and DaDiL-E classification performance.

Figure 6: Study of how the batch size, number of samples and number of components impact DiL, DaDiL-R and E performances on Caltech-Office 10 dataset.

Figure 7: Study of how the batch size, number of samples and number of components impact DiL, DaDiL-R and E performances on Office 31 dataset.

Figure 8: t-SNE visualization of DaDiL-R reconstructions of each target domain. Semi-transparent circles correspond to points in the target domain, whereas triangles correspond to points in BT support.

Figure 9: t-SNE visualization of DaDiL-R reconstructions of each domain in the Office 31 dataset. Semi-transparent circles correspond to points in the target domain, whereas triangles correspond to points in BT support.

is an empirical approximation of a probability distribution Q on R d . Additionally, each x

Classification accuracy (in %) of DA methods. Each column represents a target domain for which we report mean ± standard deviation over 5 folds. * denote results from Montesuma & Mboula (2021a), while † denotes results from Turrisi et al. (2022). DaDiL-R 93.26 ± 1.53 97.50 ± 2.43 98.88 ± 0.71 86.65 ± 1.13 94.07 DaDiL-E 93.89 ± 1.81 98.33 ± 0.83 98.88 ± 0.71 87.00 ± 1.42 94.52

Classification accuracy (in %) of DA methods on the Office 31 benchmark. Each column represents a target domain for which we report mean ± standard deviation over 5 folds.

As follows, we train our algorithms with all source domain data (train and test), and train target domain data. For instance, when training a specific algorithm on Office 31 dataset with the setting A, D → W , at training time the algorithm has 2817 labeled samples from A, 498 labeled samples from D and 636 unlabeled samples from W available. For evaluation, we use the target test set. In the context of our example, 159 unlabeled samples from W .

Details about the datasets considered for domain adaptation.Artworks dataset: we select a sub-set of 50 artworks from the Best Artworks of All Time 2 , from authors Monet, Delacroix, Magritte, Caravaggio and Van Gogh, with 10 images from each. The artworks were selected so that they group around 3 major color palettes (red, blue and somber).

annex

where we used the triangle inequality in the last line. As consequence,The exact same process can be repeated for the term involving W 2 ( Pk2 , Q2 ), that is,The final term follows from the fact that W 2 ( Q1 , Q2 ) is independent of both k 1 nor k 2 . Therefore,

A.4 THEORETICAL BOUNDS FOR DOMAIN ADAPTATION

In what follows, we consider the theoretical results of Ben-David et al. (2010) and Redko et al. (2017) for giving the theoretical guarantees of both DaDiL-E and DaDiL-R. For completeness, we re-state Lemma 1 and Theorem 1 of Redko et al. (2017) , and prove Theorem 3.1. Lemma A.1. (Due to Redko et al. (2017) ) Let P and Q be two probability distributions over R d . Assume that the cost function c(x (P ) is convex, symmetric, bounded, obeys the triangular inequality and has the parametric form |h(x)h 0 (x)| q for some q > 0. Assume that the kernel Φ ∈ H Φ is square-root integrable w.r.t. both P and Q and 0 ≤ Φ(x (P ) , x (Q) ) ≤ M , ∀x (P ) , x (Q) ∈ R d . Then the following holds,Lemma A.1 bounds the risk (recall equation 4) of h with respect to h ′ under distribution Q by the risk under distribution P , plus a term depending on the distance between those 2 distributions. Intuitively, if P is close to Q, its samples are similar and thus the risk are similar. We are now interested in acquiring bounds for the empirical risks RP and RQ in terms of W 1 ( P , Q), which are quantities we can estimate from data. We start by stating Theorem 1. where d ′ and ζ ′ can be calculated explicitly.Lemma A.2 states the conditions for which P and P are close in the sense of Wasserstein. This last bound is on the form P[quantity > ϵ] < δ, that is, with high probability quantity ≤ ϵ. These types of bounds are ubiquitous in the theoretical analysis of learning algorithms. We can express ϵ explicitly in terms of δ,which will be useful in the following discussion. These results allowed Redko et al. (2017) 2017)) Under the assumptions of Lemma A.1, let X (P ) and X (Q) be 2 samples of size n P and n Q , drawn i.i.d. from P and Q. Let P and Q be the respective empirical approximations. Then for any d ′ > d and ζ < √ 2 there exists some constant N 0 , depending on d ′ such that for any δ > 0 and min(n P , n Q ) ≥ N 0 max(δ -d ′ +2 , 1) with probability at least 1δ for all h ∈ H, then,This last theorem effectively states that, besides constant terms that depend on the number of samples n P and n Q , there are 2 factors that determine whether the risk under Q is similar to that under P : (i) the distance in distribution between P and Q, (ii) how well can a classifier in H work on both domains. In practice, what OTDA does is minimizing W 1 ( P , Q) while keeping λ constant (e.g. by enforcing class-sparsity).Now, let us discuss how these concepts apply to DaDiL-R. Note that our strategy consists on learning with samples obtained by the barycentric distribution BT = B(α T ; P), since it approximates the target distribution QT . In this sense,In this last equation, the term W 1 ( QT , BT ) is the reconstruction error for the target domain, which we directly minimize in our objective function (equation 6). The remaining term is λ, which we cannot directly optimise since target domain labels are not available during training. Nonetheless our algorithm leverages label information from source domains. If we suppose that their class structure is similar to that of the target domain, the regularization term 7 ensures that λ remains bounded, since transport plans will not mix classes.We now focus on Theorem 3.1, which presents a theoretical guarantee for DaDiL-E.Proof of Theorem 3.1 This proof relies on the triangle inequality for the risk. Let hSumming over k, weighted by α,This theorem is similar to previous theoretical guarantees for MSDA, such as those of Ben-David et al. (2010) , Redko et al. (2017) , and WBT reg (Montesuma & Mboula (2021a)). In particular, DaDiL-E uses α T for weighting h ⋆ k . Furthermore this choice of α minimizes the r.h.s. of the last equation.

