LEARNING DICTIONARIES OVER DATASETS THROUGH WASSERSTEIN BARYCENTERS

Abstract

Unsupervised Domain Adaptation is an important machine learning problem that aims at mitigating data distribution shifts, when transferring knowledge from one labeled domain to another similar and unlabeled domain. Optimal transport has been shown in previous works to be a powerful tool for comparing and matching empirical distributions. As such, we propose a novel approach for Multi-Source Domain adaptation which consists on learning a Wasserstein Dictionary of labeled empirical distributions, as a means of interpolating distributional shift across several related domains, and inferring labels on the target domain. We evaluate this method on Caltech-Office 10 and Office 31 benchmarks, where we show that our method improves the state-of-the-art of 1.96% and 2.70% respectively. We provide further insight on our dictionary, exploring how interpolations of atoms provide useful predictors for target domain data, and how it can be used to study the geometry of data distributions. Our framework opens interesting perspectives for fitting and generating datasets based on learned probability distributions.

1. INTRODUCTION

Dictionary Learning (DiL) is a representation learning technique that seeks to express a set of vectors in R d as the linear weighted combinations of elementary elements named atoms. When vectors represent histograms, this problem is known as Nonnegative Matrix Factorization (NMF). Optimal Transport (OT) previously contributed to this case, either through a metric over histograms (Rolet et al., 2016) or by defining a non-linear way of aggregating atoms (Schmitz et al., 2018) through Wasserstein barycenters (Agueh & Carlier, 2011) . In parallel, different problems in Machine Learning (ML) can be analyzed through a probabilistic view, e.g., generative modeling (Goodfellow et al., 2014) and Domain Adaptation (DA) (Pan & Yang, 2009) . For instance, in Multi-Source Domain Adaptation (MSDA), one wants to adapt data from heterogeneous domains or datasets to a new setting. In this case, the celebrated Empirical Risk Minimization (ERM) principle cannot be correctly applied due to the non-i.i.d. character of the data. However, we assume that the domain shifts have regularities that can be learned and leveraged for MSDA. Thus, in this paper, we take a novel approach to MSDA, using distributional DiL: we learn a dictionary of empirical distributions. As such, we reconstruct domains using interpolations in the Wasserstein space, also known as Wasserstein barycenters. As we explore in section 3, this offers a principled framework for MSDA. We take inspiration from the works of Bonneel et al. (2016) and Schmitz et al. (2018) for defining our novel DiL framework. Indeed, these authors consider DiL over histograms, while we propose a DiL over datasets, understood as point clouds, which enables its application to DA. We summarize our contributions as follows, • Dictionary Learning. To the best of our knowledge, we are the first to propose a DiL problem over point clouds. • Empirical Distributions Embedding. As a by-product, we get embeddings of the DiL datasets as their barycentric coordinates w.r.t. the dictionary. We build on this new representation to define a (semi-)metric called Wasserstein Embedding Distance (WED). We explore the WED theoretically (theorems 3.2 and 3.3) and in experiments (section 4.2). • Domain Adaptation. We propose two novel ways for performing MSDA. The first relies on reconstructing labeled samples in the support of the target distribution. The second relies on weighting predictors learned on each atom, thus defining a new classifier that works on the target domain. We offer theoretical justification for both methods (section 3.1). We further explore, in section 4.1.3, general interpolations in the latent space of our dictionary. Notation. We denote as ∆ N = {a ∈ R N + : N i=1 a i = 1} the N probability simplex. We consider n P i.i.d. samples X (P ) = {x (P ) i } n P i=1 ∈ R n P ×d from an unknown distribution P : x (P ) i ∼ P . The samples X (P ) yield an empirical approximation, P of P as a sum of delta Diracs, P = n P i=1 p i δ x (P ) i , with p i = 1 /n P unless stated otherwise. Similarly, Q = n Q i=1 q i δ x (Q) i , with q i = 1 /n Q , is an empirical approximation of a probability distribution Q on R d . Additionally, each x (P ) i may have a label y (P ) i ∈ {1, • • • , n c } or Y (P ) i ∈ ∆ nc for its one-hot encoding. In this case P = n P i=1 p i δ (x (P ) i ,y (P ) i ) is an empirical approximation for the joint P (X, Y ). Paper Structure. Section 2 covers brief introductions to OT, DA, and DiL. Section 3 introduces our view on dictionary learning. Section 4 covers experiments in manifold learning of distributions and domain adaptation. Section 5 presents our conclusions.

2.1. OPTIMAL TRANSPORT

In this section, we focus on the computational treatment of OT (Peyré et al., 2019) , which predominantly relies on empirical approximations of distributions. There are two discretization strategies. First, known as Eulerian discretization, one seeks to bin R d into a fixed grid {x (P ) i } n P i=1 so that p i corresponds to how many samples are assigned to the i-th bin. The second, known as Lagrangian discretization, assume x For P , Q, in the formulation of Kantorovich (1942) OT seeks a transport plan π ∈ R n P ×n Q , where π i,j represents how much mass transported from x (P ) i to x (Q) j . π is required to preserve mass, that is, n P i=1 π i,j = q j and n Q j=1 π i,j = p i , or π ∈ U (p, q), the set of bi-stochastic matrices with marginals p ∈ ∆ n P , q ∈ ∆ n Q . Therefore, π ⋆ = OT(p, q, C) = argmin π∈U (p,q) n P i=1 n Q j=1 C i,j π i,j = ⟨C, π⟩ F , is the OT problem between P and Q. In Equation 1, C i,j = c(x (2) (P ) i , x (Q) j ) is called ground-cost matrix. When c is a distance, OT defines a distance between distributions (Peyré et al., 2019) called Wasserstein distance, W c ( P , Q) = ⟨C, π ⋆ ⟩ F . For P = { Pk } K k=1 and α ∈ ∆ K , Henceforth we call B(•; P) barycentric operator. For empirical Pk , estimating B⋆ is done through the fixed-point iterations of Álvarez-Esteban et al. ( 2016), (C k ) i,j = ∥x (P k ) i -x (B) j ∥ 2 2 ; π k = OT(p k , b, C k ); X (B) = K k=1 α k diag(b) -1 π T k X (P k ) , until convergence. In this case b ∈ ∆ n (e.g., b j = 1 /n). We further discuss iterations in equation 3 in appendix A.



.d. according to P . Henceforth, contrary to Bonneel et al. (2016); Schmitz et al. (2018), we use the Lagrangian discretization.

the Wasserstein barycenter (Agueh & Carlier, 2011) is a solution to, B ⋆ = B(α; P) = inf B K k=1 α k W c (P k , B).

