FACTORING OUT PRIOR KNOWLEDGE FROM LOW-DIMENSIONAL EMBEDDINGS

Abstract

Low-dimensional embedding techniques such as tSNE and UMAP allow visualizing high-dimensional data and therewith facilitate the discovery of interesting structure. Although they are widely used, they visualize data as is, rather than in light of the background knowledge we have about the data. What we already know, however, strongly determines what is novel and hence interesting. In this paper we propose two methods for factoring out prior knowledge in the form of distance matrices from low-dimensional embeddings. To factor out prior knowledge from tSNE embeddings, we propose JEDI that adapts the tSNE objective in a principled way using Jensen-Shannon divergence. To factor out prior knowledge from any downstream embedding approach, we propose CONFETTI, in which we directly operate on the input distance matrices. Extensive experiments on both synthetic and real world data show that both methods work well, providing embeddings that exhibit meaningful structure that would otherwise remain hidden.

1. INTRODUCTION

Embedding high dimensional data into low dimensional spaces, such as with tSNE (van der Maaten & Hinton, 2008) or UMAP (McInnes et al., 2018) , allow us to visually inspect and discover meaningful structure from the data that would otherwise be difficult or impossible to see. These methods are as popular as they are useful, but, at the same time limited in that they are one-shot only: they embed the data as is, and that is that. If the resulting embedding reveals novel knowledge, all is well, but, what if the structure that dominates it is something we already know, something we are no longer interested in, or, if we want to discover whether the data has meaningful structure other than what the first result revealed? In word embeddings, for example, we may already know that certain words are synonyms, while in single cell sequencing we may want to discover structure other than known cell types, or factor out family relationships. The question at hand is therefore, how can we obtain low-dimensional embeddings that reveal structure beyond what we already know, i.e. how to factor out prior knowledge from low-dimensional embeddings? For conditional embeddings, research so far mostly focused on emphasizing rather than factoring out prior knowledge (De Ridder et al., 2003; Hanhijärvi et al., 2009; Barshan et al., 2011) , with conditional tSNE as notable exception, which, however, can only factor out label information (Kang et al., 2019) . Here, we propose two techniques for factoring out a more general form of prior knowledge from low-dimensional embeddings of arbitrary data types. In particular, we consider background knowledge in the form of pairwise distances between samples. This formulation allows us to cover a plethora of practical instances including labels, clustering structure, family trees, userdefined distances, but also, and especially important for unstructured data, kernel matrices. To factor out prior knowledge from tSNE embeddings, we propose JEDI, in which we adapt the tSNE objective in a principled way using Jensen-Shannon divergence. It has an intuitively appealing information theoretic interpretation, and maintains all the strengths and weaknesses of tSNE. One of these is runtime, which is why UMAP is particularly popular in bioinformatics. To factor out prior knowledge from embedding approaches in general, including UMAP, we hence propose CONFETTI, which directly operates on the input data. An extensive set of experiments shows that both methods work well in practice and provide embeddings that reveal meaningful structure beyond provided background knowledge, such as organizing flower images according to shape rather than color, or organizing single cell gene expression data beyond cell type, revealing batch effects and tissue type. Embedding high dimensional data into a low dimensional spaces is a research topic of perennial interest that includes classic methods such as principal component analysis Pearson (1901) , multidimensional scaling (Torgerson, 1952) , self organizing maps (Kohonen, 1982) , and isomap (Tenenbaum et al., 2000) , all of which focus on keeping large distances intact. This is inadequate for data that lies on a manifold that resembles a Euclidean space only locally, which is the case for high dimensional data (Silva & Tenenbaum, 2003) and for which we hence need methods such as locally linear embedding (LLE) (Roweis & Saul, 2000) and stochastic neighbor embedding (SNE) (Hinton & Roweis, 2003) Whereas the above consider only the data as is, there also exist proposals that additionally take user input and/or domain knowledge into account. For specific applications to gene expression, it was proposed to optimize projections of gene expression to model similarities in corresponding gene ontology annotations (Peltonen et al., 2010) . More recently, attention has been brought to removing unwanted variation (RUV) from data using negative controls in particular in the light of gene expression, assuming that the expression can be modeled as a linear function of factors of variation and a normally distributed variable (Gagnon-Bartsch & Speed, 2012) . This approach has been successfully applied to different tasks and domains of gene expression (Risso et al., 2014; Buettner et al., 2015; Gerstner et al., 2016; Hung, 2019) . Here, we are interested to develop a domain independent method to obtain low-dimensional embeddings while factoring out prior knowledge. For that, we neither want to assume a functional relationship between prior and input data, nor do we want to assume a particular distribution of the input, but keep the original data manifold intact. Furthermore, we do not want to rely on negative samples that have to be known and present in the data to be able to factor out the prior. The general, domain independent methods supervised LLE (De Ridder et al., 2003) , guided LLE (Alipanahi & Ghodsi, 2011), and supervised PCA (Barshan et al., 2011) all aim to emphasize rather than factor out the structure given as prior knowledge. Like us, Kang et al. (2016; 2019); Puolamäki et al. (2018) factor out background knowledge, but are much more limited in the type of prior knowledge. In particular, Puolamäki et al. (2018) requires users to specify clusters in the embedded space, Kang et al. (2016) requires background knowledge for which a maximum entropy distribution can be obtained, while Kang et al. (2019) extend tSNE and propose conditional tSNE (ctSNE) which accepts prior knowledge in the form of class labels. In contract, we consider prior knowledge in the form of arbitrary distance metrics, which can capture relative relationships which appears naturally in real world data, such difference in age, geographic location, or level of gene expression. We propose both, an information theoretic extension to tSNE, and an embedding-algorithm independent approach to factor out prior knowledge.

3. THEORY

We present two approaches, with distinct properties, that both solve the problem of embedding high dimensional data while factoring out prior knowledge. We start with an informal definition of the problem, after which we introduce vanilla tSNE. We then present our first solution, JEDI, which extends the tSNE objective to incorporate prior information. We then present CONFETTI, which uses an elegant yet powerful idea that allows us to directly factor out prior knowledge from the distance matrix of the high dimensional data, which allows CONFETTI to be used in combination with any embedding algorithm that operates on distance matrices.

3.1. THE PROBLEM -INFORMALLY

Given a set of n samples X from a high dimensional space, e.g. IR d , our goal is to find a low dimensional representation Y in IR 2 that captures the local structure in X while factoring out prior



that focus on keeping local distances intact. The current state of the art methods are t-distributed SNE (tSNE) by van der Maaten & Hinton (2008) and Uniform Manifold Approximation (UMAP) by McInnes et al. (2018). Both are by now staple methods for data processing, e.g. in biology (Becht et al., 2019; Kobak & Berens, 2019) and NLP (Coenen et al., 2019). As they often yield highly similar embeddings (Kobak & Linderman, 2019) it is a matter of taste which one to use. While tSNE has an intuitive interpretation, despite recent optimizations (van der Maaten, 2014; Linderman et al., 2019) compared to UMAP it suffers from very long runtimes.

