LEARNING DICTIONARIES OVER DATASETS THROUGH WASSERSTEIN BARYCENTERS

Abstract

Unsupervised Domain Adaptation is an important machine learning problem that aims at mitigating data distribution shifts, when transferring knowledge from one labeled domain to another similar and unlabeled domain. Optimal transport has been shown in previous works to be a powerful tool for comparing and matching empirical distributions. As such, we propose a novel approach for Multi-Source Domain adaptation which consists on learning a Wasserstein Dictionary of labeled empirical distributions, as a means of interpolating distributional shift across several related domains, and inferring labels on the target domain. We evaluate this method on Caltech-Office 10 and Office 31 benchmarks, where we show that our method improves the state-of-the-art of 1.96% and 2.70% respectively. We provide further insight on our dictionary, exploring how interpolations of atoms provide useful predictors for target domain data, and how it can be used to study the geometry of data distributions. Our framework opens interesting perspectives for fitting and generating datasets based on learned probability distributions.

1. INTRODUCTION

Dictionary Learning (DiL) is a representation learning technique that seeks to express a set of vectors in R d as the linear weighted combinations of elementary elements named atoms. When vectors represent histograms, this problem is known as Nonnegative Matrix Factorization (NMF). Optimal Transport (OT) previously contributed to this case, either through a metric over histograms (Rolet et al., 2016) or by defining a non-linear way of aggregating atoms (Schmitz et al., 2018) through Wasserstein barycenters (Agueh & Carlier, 2011) . In parallel, different problems in Machine Learning (ML) can be analyzed through a probabilistic view, e.g., generative modeling (Goodfellow et al., 2014) and Domain Adaptation (DA) (Pan & Yang, 2009) . For instance, in Multi-Source Domain Adaptation (MSDA), one wants to adapt data from heterogeneous domains or datasets to a new setting. In this case, the celebrated Empirical Risk Minimization (ERM) principle cannot be correctly applied due to the non-i.i.d. character of the data. However, we assume that the domain shifts have regularities that can be learned and leveraged for MSDA. Thus, in this paper, we take a novel approach to MSDA, using distributional DiL: we learn a dictionary of empirical distributions. As such, we reconstruct domains using interpolations in the Wasserstein space, also known as Wasserstein barycenters. As we explore in section 3, this offers a principled framework for MSDA. We take inspiration from the works of Bonneel et al. (2016) and Schmitz et al. (2018) for defining our novel DiL framework. Indeed, these authors consider DiL over histograms, while we propose a DiL over datasets, understood as point clouds, which enables its application to DA. We summarize our contributions as follows, • Dictionary Learning. To the best of our knowledge, we are the first to propose a DiL problem over point clouds. • Empirical Distributions Embedding. As a by-product, we get embeddings of the DiL datasets as their barycentric coordinates w.r.t. the dictionary. We build on this new representation to define a (semi-)metric called Wasserstein Embedding Distance (WED). We explore the WED theoretically (theorems 3.2 and 3.3) and in experiments (section 4.2).

