FACTORING OUT PRIOR KNOWLEDGE FROM LOW-DIMENSIONAL EMBEDDINGS

Abstract

Low-dimensional embedding techniques such as tSNE and UMAP allow visualizing high-dimensional data and therewith facilitate the discovery of interesting structure. Although they are widely used, they visualize data as is, rather than in light of the background knowledge we have about the data. What we already know, however, strongly determines what is novel and hence interesting. In this paper we propose two methods for factoring out prior knowledge in the form of distance matrices from low-dimensional embeddings. To factor out prior knowledge from tSNE embeddings, we propose JEDI that adapts the tSNE objective in a principled way using Jensen-Shannon divergence. To factor out prior knowledge from any downstream embedding approach, we propose CONFETTI, in which we directly operate on the input distance matrices. Extensive experiments on both synthetic and real world data show that both methods work well, providing embeddings that exhibit meaningful structure that would otherwise remain hidden.

1. INTRODUCTION

Embedding high dimensional data into low dimensional spaces, such as with tSNE (van der Maaten & Hinton, 2008) or UMAP (McInnes et al., 2018) , allow us to visually inspect and discover meaningful structure from the data that would otherwise be difficult or impossible to see. These methods are as popular as they are useful, but, at the same time limited in that they are one-shot only: they embed the data as is, and that is that. If the resulting embedding reveals novel knowledge, all is well, but, what if the structure that dominates it is something we already know, something we are no longer interested in, or, if we want to discover whether the data has meaningful structure other than what the first result revealed? In word embeddings, for example, we may already know that certain words are synonyms, while in single cell sequencing we may want to discover structure other than known cell types, or factor out family relationships. The question at hand is therefore, how can we obtain low-dimensional embeddings that reveal structure beyond what we already know, i.e. how to factor out prior knowledge from low-dimensional embeddings? For conditional embeddings, research so far mostly focused on emphasizing rather than factoring out prior knowledge (De Ridder et al., 2003; Hanhijärvi et al., 2009; Barshan et al., 2011) , with conditional tSNE as notable exception, which, however, can only factor out label information (Kang et al., 2019) . Here, we propose two techniques for factoring out a more general form of prior knowledge from low-dimensional embeddings of arbitrary data types. In particular, we consider background knowledge in the form of pairwise distances between samples. This formulation allows us to cover a plethora of practical instances including labels, clustering structure, family trees, userdefined distances, but also, and especially important for unstructured data, kernel matrices. To factor out prior knowledge from tSNE embeddings, we propose JEDI, in which we adapt the tSNE objective in a principled way using Jensen-Shannon divergence. It has an intuitively appealing information theoretic interpretation, and maintains all the strengths and weaknesses of tSNE. One of these is runtime, which is why UMAP is particularly popular in bioinformatics. To factor out prior knowledge from embedding approaches in general, including UMAP, we hence propose CONFETTI, which directly operates on the input data. An extensive set of experiments shows that both methods work well in practice and provide embeddings that reveal meaningful structure beyond provided background knowledge, such as organizing flower images according to shape rather than color, or organizing single cell gene expression data beyond cell type, revealing batch effects and tissue type. 1

