FROM t-SNE TO UMAP WITH CONTRASTIVE LEARNING

Abstract

Neighbor embedding methods t-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between t-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noisecontrastive estimation can be used to optimize t-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to t-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) t-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a PyTorch implementation.

1. INTRODUCTION

Low-dimensional visualization of high-dimensional data is a ubiquitous step in exploratory data analysis, and the toolbox of visualization methods has been rapidly growing in the last years (McInnes et al., 2018; Amid & Warmuth, 2019; Szubert et al., 2019; Wang et al., 2021) . Since all of these methods necessarily distort the true data layout (Chari et al., 2021) , it is beneficial to have various tools at one's disposal. But only equipped with a theoretic understanding of the aims of and relationships between different methods, can practitioners make informed decisions about which visualization to use for which purpose and how to interpret the results. The state of the art for non-parametric, non-linear dimensionality reduction relies on the neighbor embedding framework (Hinton & Roweis, 2002) . Its two most popular examples are t-SNE (van der Maaten & Hinton, 2008; van der Maaten, 2014) and UMAP (McInnes et al., 2018) . Both can produce insightful, but qualitatively distinct embeddings. However, why their embeddings are different and what exactly is the conceptual relation between their loss functions, has remained elusive. Here, we answer this question and thus explain the mathematical underpinnings of the relationship between t-SNE and UMAP. Our conceptual insight naturally suggests a spectrum of embedding methods complementary to that of Böhm et al. (2022) , along which the focus of the visualization shifts from local to global structure (Fig. 1 ). On this spectrum, UMAP and t-SNE are simply two instances and inspecting various embeddings helps to guard against over-interpretation of apparent structure. As a practical corollary, our analysis identifies and remedies an instability in UMAP.

