UNIFORM MANIFOLD APPROXIMATION WITH TWO-PHASE OPTIMIZATION

Abstract

We present a dimensionality reduction algorithm called Uniform Manifold Approximation with Two-phase Optimization (UMATO) which produces less biased global structures in the embedding results and is robust over diverse initialization methods than previous methods such as t-SNE and UMAP. We divide the optimization into two phases to alleviate the bias by establishing the global structure early using the representatives of the high-dimensional structures. The phases are 1) global optimization to obtain the overall skeleton of data and 2) local optimization to identify the regional characteristics of local areas. In our experiments with one synthetic and three real-world datasets, UMATO outperformed widely-used baseline algorithms, such as PCA, Isomap, t-SNE, UMAP, topological autoencoders and Anchor t-SNE, in terms of quality metrics and 2D projection results.

1. INTRODUCTION

We present a novel dimensionality reduction method, Uniform Manifold Approximation with Twophase Optimization (UMATO) to obtain less biased and robust embedding over diverse initialization methods. One effective way of understanding high-dimensional data in various domains is to reduce its dimensionality and investigate the projection in a lower-dimensional space. The limitation of previous approaches such as t-Stochastic Neighbor Embedding (t-SNE, Maaten & Hinton (2008) ) and Uniform Manifold Approximation and Projection (UMAP, McInnes et al. (2018) ) is that they are susceptible to different initialization methods, generating considerably different embedding results (Section 5.5). t-SNE adopts Kullback-Leibler (KL) divergence as its loss function. The fundamental limitation of the KL divergence is that the penalty for the points that are distant in the original space being close in the projected space is too little (Appendix B). This results in only the local manifolds being captured, while clusters that are far apart change their relative locations from run to run. Meanwhile, UMAP leverages the cross-entropy loss function, which is known to charge a penalty for points that are distant in the original space being close in the projection space and for points that are close in the original space being distant in the projection space (Appendix B). UMAP considers all points in the optimization at once with diverse sampling techniques (i.e., negative sampling and edge sampling). Although the approximation technique in UMAP optimization makes the computation much faster, this raises another problem that the clusters in the embedding become dispersed as the number of epochs increases (Appendix K), which can lead to misinterpretation. UMAP tried to alleviate this by using a fixed number (e.g., 200), which is ad hoc, and by applying a learning rate decay. However, the optimal number of epochs and decay schedule for each initialization method needs to be found in practice. To solve the aforementioned problems, we avoid using approximation during the optimization process, which normally would result in greatly increased computational cost. Instead, we first run optimization only with a small number of points that represent the data (i.e., hub points). Finding the optimal projection for a small number of points using a cross-entropy function is relatively easy and robust, making the additional techniques employed in UMAP unnecessary. Furthermore, it is less sensitive to the initialization method used (Section 5.5). After capturing the overall skeleton of the high-dimensional structure, we gradually append the rest of the points in subsequent phases. Although the same approximation technique as UMAP is used for these points, as we have already embedded the hub points and use them as anchors, the projections become more robust and unbiased. The gradual addition of points can in fact be done in a single phase; we found additional phases do not result in meaningful improvements in the performance but only in the increased computation time (Section 4.5). Therefore, we used only two phases in UMAP: global optimization to capture the global structures (i.e., the pairwise distances in a high-dimensional space) and local optimization to retain the local structures (i.e., the relationships between neighboring points in a high-dimensional space) of the data. We compared UMATO with popular dimensionality reduction techniques including PCA, Isomap (Tenenbaum et al. (2000) ), t-SNE, UMAP, topological autoencoders (Moor et al. ( 2020)) and At-SNE (Fu et al. (2019) ). We used one synthetic (101-dimensional spheres) and three realworld (MNIST, Fashion MNIST, and Kuzushiji MNIST) datasets and analyzed the projection results with several quality metrics. In conclusion, UMATO demonstrated better performance than the baseline techniques in all datasets in terms of KL σ with different σ values, meaning that it reasonably preserved the density of data over diverse length scales. Finally, we presented the 2D projections of each dataset, including the replication of an experiment using the synthetic Spheres dataset introduced by Moor et al. ( 2020) where data points locally constitute multiple small balls globally contained in a larger sphere. Here, we demonstrate that UMATO can better preserve both structures compared to the baseline algorithms (Figure 3 ). 1 ). Moreover, they showed that UMAP can generate stable projection results compared to t-SNE over repetition.

2. RELATED WORK

On the other hand, there also exist algorithms that aim to capture the global structures of data. Isomap (Tenenbaum et al. (2000) ) was proposed to approximate the geodesic distance of highdimensional data and embed it onto the lower dimension. Global t-SNE (Zhou & Sharpee (2018)) converted the joint probability distribution, P , in the high-dimensional space from Gaussian to Student's-t distribution, and proposed a variant of KL divergence. By adding it with the original loss function of t-SNE, Global t-SNE assigns a relatively large penalty for a pair of distant data points in high-dimensional space being close in the projection space. Another example is topological autoencoders (Moor et al. ( 2020)), a deep-learning approach that uses a generative model to make the latent space resemble the high-dimensional space by appending a topological loss to the original reconstruction loss of autoencoders. However, they required a huge amount of time for hyperparameter exploration and training for a dataset, and only focused on the global aspect of data. Unlike other techniques that presented a variation of loss functions in a single pipeline, UMATO is novel as it preserves both structures by dividing the optimization into two phases; this makes it outperform the baselines with respect to quality metrics in our experiments. Hubs, landmarks, and anchors. Many dimensionality reduction techniques have tried to draw sample points to better model the original space; these points are usually called hubs, landmarks, or anchors. Silva & Tenenbaum (2003) proposed Landmark Isomap, a landmark version of classical multidimensional scaling (MDS) to alleviate its computation cost. Based on the Landmark Isomap, Yan et al. (2018) tried to retain the topological structures (i.e., homology) of high-dimensional data by approximating the geodesic distances of all data points. However, both techniques have the limitation that landmarks were chosen randomly without considering their importance. UMATO uses a k-nearest neighbor graph to extract significant hubs that can represent the overall skeleton of high-dimensional data. The most similar work to ours is At-SNE (Fu et al. ( 2019)), which optimized the anchor points and all other points with two different loss functions. However, since the anchors wander during the optimization and the KL divergence does not care about distant points, it hardly



Dimensionality reduction. Most previous dimensionality reduction algorithms focused on preserving the data's local structures. For example, Maaten & Hinton (2008) proposed t-SNE, focusing on the crowding problem with which the previous attempts (Hinton & Roweis (2002); Cook et al. (2007)) have struggled, to visualize high-dimensional data through projection produced by performing stochastic gradient descent on the KL divergence between two density functions in the original and projection spaces. Van Der Maaten (2014) accelerated t-SNE developing a variant of the Barnes-Hut algorithm (Barnes & Hut (1986)) and reduced the computational complexity from O(N 2 ) into O(N log N ). After that, grounded in Riemannian geometry and algebraic topology, McInnes et al. (2018) introduced UMAP as an alternative to t-SNE. Leveraging the cross-entropy function as its loss function, UMAP reduced the computation time by employing negative sampling from Word2Vec (Mikolov et al. (2013)) and edge sampling from LargeVis (Tang et al. (2015; 2016)) (Table

