UNSUPERVISED MANIFOLD ALIGNMENT WITH JOINT MULTIDIMENSIONAL SCALING

Abstract

We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment. The implementation of our work is available at https://github.com/BorgwardtLab/JointMDS.

1. INTRODUCTION

Many problems in machine learning require joint visual exploration and manipulation of multiple datasets from different (heterogeneous) domains, which is generally a preferable first step prior to any further data analysis. These different data domains may consist of measurements for the same samples obtained with different methods or technologies, such as single-cell multi-omics data in bioinformatics (Demetci et al., 2022; Liu et al., 2019; Cao & Gao, 2022) . Alternatively, the data could be comprised of different datasets of similar objects, such as word spaces of different languages in natural language modeling (Alvarez-Melis et al., 2019; Grave et al., 2019) , or graphs representing related objects such as disease-procedure recommendation in biomedicine (Xu et al., 2019b) . There are two main challenges in joint exploration of multiple datasets. First, the data from the heterogeneous domains may be high-dimensional or may not possess input features but rather only dissimilarities between them. Second, the correspondences between data instances across different domains may not be known a priori. We propose in this work to tackle both issues simultaneously while making few assumptions on the data modality. To address the first challenge, for several decades many dimensionality reduction methods have been proposed to provide lower-dimensional embeddings of data. Among them, multidimensional scaling (MDS) (Borg & Groenen, 2005) and its extensions (Tenenbaum et al., 2000; Chen & Buja, 2009) are widely used ones. They generate low-dimensional data embeddings while preserving the local and global information about its manifold structure. One of the key characteristics of MDS is the fact that it only requires pairwise (dis)similarities as input rather than specific data features, which makes it applicable to problems whose data does not have access to the input features, such as graph node embedding learning (Gansner et al., 2004) . However, when it comes to dealing with multiple datasets from different domains at the same time, the subspaces learned by MDS for different datasets are not naturally aligned, making it not directly applicable for a joint exploration.

