FROM t-SNE TO UMAP WITH CONTRASTIVE LEARNING

Abstract

Neighbor embedding methods t-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between t-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noisecontrastive estimation can be used to optimize t-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to t-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) t-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a PyTorch implementation.

1. INTRODUCTION

Low-dimensional visualization of high-dimensional data is a ubiquitous step in exploratory data analysis, and the toolbox of visualization methods has been rapidly growing in the last years (McInnes et al., 2018; Amid & Warmuth, 2019; Szubert et al., 2019; Wang et al., 2021) . Since all of these methods necessarily distort the true data layout (Chari et al., 2021) , it is beneficial to have various tools at one's disposal. But only equipped with a theoretic understanding of the aims of and relationships between different methods, can practitioners make informed decisions about which visualization to use for which purpose and how to interpret the results. The state of the art for non-parametric, non-linear dimensionality reduction relies on the neighbor embedding framework (Hinton & Roweis, 2002) . Its two most popular examples are t-SNE (van der Maaten & Hinton, 2008; van der Maaten, 2014) and UMAP (McInnes et al., 2018) . Both can produce insightful, but qualitatively distinct embeddings. However, why their embeddings are different and what exactly is the conceptual relation between their loss functions, has remained elusive. Here, we answer this question and thus explain the mathematical underpinnings of the relationship between t-SNE and UMAP. Our conceptual insight naturally suggests a spectrum of embedding methods complementary to that of Böhm et al. (2022) , along which the focus of the visualization shifts from local to global structure (Fig. 1 ). On this spectrum, UMAP and t-SNE are simply two instances and inspecting various embeddings helps to guard against over-interpretation of apparent structure. As a practical corollary, our analysis identifies and remedies an instability in UMAP. We provide the new connection between t-SNE and UMAP via a deeper understanding of contrastive learning methods. Noise-contrastive estimation (NCE) (Gutmann & Hyvärinen, 2010; 2012) can be used to optimize t-SNE (Artemenkov & Panov, 2020), while UMAP relies on another contrastive method, negative sampling (NEG) (Mikolov et al., 2013) . We investigate the discrepancy between NCE and NEG, show that NEG introduces a distortion, and this distortion explains how UMAP and t-SNE embeddings differ. Finally, we discuss the relationship between neighbor embeddings and self-supervised learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020; Le-Khac et al., 2020) . In summary, our contributions are 1. a new connection between the contrastive methods NCE and NEG (Sec. 4), 2. the exact relation of t-SNE and UMAP and a remedy for an instability in UMAP (Sec. 6), 3. a spectrum of 'contrastive' neighbor embeddings encompassing UMAP and t-SNE (Sec. 5), 4. a connection between neighbor embeddings and self-supervised learning (Sec. 7), 5. a unified PyTorch framework for contrastive (non-)parametric neighbor embedding methods. Our code is available at https://github.com/berenslab/contrastive-ne and https://github.com/hci-unihd/cl-tsne-umap.

2. RELATED WORK

One of the most popular methods for data visualization is t-SNE (van der Maaten & Hinton, 2008; van der Maaten, 2014) . Recently developed NCVis (Artemenkov & Panov, 2020) employs noisecontrastive estimation (Gutmann & Hyvärinen, 2010; 2012) to approximate t-SNE in a samplingbased way. Therefore, we will often refer to the NCVis algorithm as 'NC-t-SNE'. UMAP (McInnes et al., 2018) has matched t-SNE's popularity at least in computational biology (Becht et al., 2019) and uses another sampling-based optimization method, negative sampling (Mikolov et al., 2013) , also employed by LargeVis (Tang et al., 2016) . Other recent sampling-based visualization methods include TriMap (Amid & Warmuth, 2019) and PaCMAP (Wang et al., 2021) . Given their success, t-SNE and UMAP have been scrutinized to find out which aspects are essential to their performance. Initialization was found to be strongly influencing the global structure



Figure1: (a -e) Neg-t-SNE embedding spectrum of the MNIST dataset for various values of the fixed normalization constant Z, see Sec. 5. As Z increases, the scale of the embedding decreases, clusters become more compact and separated before eventually starting to merge. The Neg-t-SNE spectrum produces embeddings very similar to those of (f) t-SNE, (g) NCVis, and (h) UMAP, when Z equals the partition function of t-SNE, the learned normalization parameter Z of NCVis, or |X|/m = n 2 /m used by UMAP, as predicted in Sec. 4-6. (i) The partition function ij (1+d 2 ij ) -1 tries to match Z and grows with it. Here, we initialized all Neg-t-SNE runs using Z = |X|/m; without this 'early exaggeration', low values of Z yield fragmented clusters (Fig.S11).

