PARAMETRIC UMAP: LEARNING EMBEDDINGS WITH DEEP NEURAL NETWORKS FOR REPRESENTATION AND SEMI-SUPERVISED LEARNING

Abstract

We propose Parametric UMAP, a parametric variation of the UMAP (Uniform Manifold Approximation and Projection) algorithm. UMAP is a non-parametric graph-based dimensionality reduction algorithm using applied Riemannian geometry and algebraic topology to find low-dimensional embeddings of structured data. The UMAP algorithm consists of two steps: (1) Compute a graphical representation of a dataset (fuzzy simplicial complex), and (2) Through stochastic gradient descent, optimize a low-dimensional embedding of the graph. Here, we replace the second step of UMAP with a deep neural network that learns a parametric relationship between data and embedding. We demonstrate that our method performs similarly to its non-parametric counterpart while conferring the benefit of a learned parametric mapping (e.g. fast online embeddings for new data). We then show that UMAP loss can be extended to arbitrary deep learning applications, for example constraining the latent distribution of autoencoders, and improving classifier accuracy for semi-supervised learning by capturing structure in unlabeled data. 1 



Current non-linear dimensionality reduction algorithms can be divided broadly into non-parametric algorithms which rely on the efficient computation of probabilistic relationships from neighborhood graphs to extract structure in large datasets (e.g. UMAP (McInnes et al., 2018) , t-SNE (van der Maaten & Hinton, 2008) , LargeVis (Tang et al., 2016) ), and parametric algorithms, which, driven by advances in deep-learning, optimize an objective function related to capturing structure in a dataset over neural network weights (e.g. Hinton & Salakhutdinov 2006; Ding et al. 2018; Ding & Regev 2019; Szubert et al. 2019; Kingma & Welling 2013) . The goal of this paper is to wed those two classes of methods: learning a structured graphical representation of the data and using a deep neural network to embed that graph. Over the past decade several varients of the t-SNE algorithm have proposed parameterized forms of t-SNE (Van Der Maaten, 2009; Gisbrecht et al., 2015; Bunte et al., 2012; Gisbrecht et al., 2012) . In particular, Parametric t-SNE (Van Der Maaten, 2009) performs exactly that wedding; training a deep neural network to minimize loss over a t-SNE graph. However, the t-SNE loss function itself is not well suited to be optimized over deep neural networks using contemporary training schemes. In particular, t-SNE's optimization requires normalization over the entire dataset at each step of optimization, making batch-based optimization and on-line learning of large datasets difficult. In contrast, UMAP is optimized using negative sampling (Mikolov et al., 2013; Tang et al., 2016) and requires no normalization step, making it more well-suited to deep learning applications. Our proposed method, Parametric UMAP, brings the non-parametric graph-based dimensionality reduction algorithm UMAP into a emerging class of parametric topologically-inspired embedding algorithms (Reviewed in A.5). In the following section we broadly outline the algorithm underlying UMAP to explain why our proposed algorithm, Parametric UMAP, is particularly well suited to deep learning applications. We contextualize our discussion of UMAP in t-SNE, to outline the advantages that UMAP confers over t-SNE in the domain of parametric neural-network based embedding. We then perform experiments comparing our algorithm, Parametric UMAP, to parametric and non-parametric algorithms. Finally, we show a novel extension of Parametric UMAP to semisupervised learning. 0.1 PARAMETRIC UMAP t-SNE and its more recent cousin UMAP are both non-parametric graph-based dimensionality reduction algorithms that learn embeddings based upon local structure in data. Much prior work has focused on extending t-SNE to learn the mapping between data and embeddings, for example optimizing t-SNE's loss over learned neural network weights (Van Der Maaten, 2009) . However, t-SNE optimizes it's embeddings in a manner poorly suited to being translated to neural network optimization, requiring significant modification of the learning algorithm to train via stochastic gradient descent. Conversely, UMAP's loss is optimized in a manner that can be directly translated for neural network optimization. Here, we explain UMAP's compatability with deep learning applications, why this differs for t-SNE, and how we take advantage. A full discussion of UMAP and t-SNE explaining this contrast is given in the Appendix A.1. To summarize, both t-SNE and UMAP rely on the construction of a graph, and a subsequent embedding that preserves the structure of that graph (Fig. 1 ). UMAP learns an embedding by minimizing cross entropy sampled over positively weighted edges (attraction), and using negative sampling randomly over the dataset (repulsion), allowing minimization to occur over sampled batches of the dataset. t-SNE, meanwhile, minimizes a KL divergence loss function normalized over the entire set of embeddings in the dataset using different approximation techniques to compute attractive and repulsive forces. Because t-SNE optimization requires normalization over the distribution of embedding in projection space, gradient descent can only be performed after computing edge probabilities over the entire dataset. Projecting an entire dataset into a neural network between each gradient descent step would be too computationally expensive to optimize however. The trick that Parametric t-SNE proposes to this problem is to split the dataset up into large batches (e.g. 5000 datapoints in the original paper) that are independently normalized over and used constantly throughout training, rather than being randomized. Conversely, UMAP can be trained on batch sizes as small as a single edge, making it suitable for minibatch training needed for memory-expensive neural networks trained on large datasets as well as on-line learning. Given these design features, the UMAP algorithm is better suited to deep neural networks, and is more extendable to typical neural network training regimes. Parametric UMAP can be defined simply by applying the UMAP cost function to a deep neural network over mini batches using negative sampling within minibatches. In our implementation, we keep all hyperparameters the Google Colab walkthrough



Figure 1: Overview of UMAP (A → B) and Parametric UMAP (A → C). (A) The first stage of the UMAP algorithm is to compute a probabilistic graphical representation of the data. (B) The second stage of the UMAP algorithm is to optimize a set of embeddings to preserve the structure of the fuzzy simplicial complex. (C) The second stage of UMAP, learning a set of embeddings which preserves the structure of the graph, is replaced with a neural network which learns a set of neural network weights (parameters) that maps the high-dimensional data to an embedding. Both B and C are learned through the same loss function.

