UNIFORM MANIFOLD APPROXIMATION WITH TWO-PHASE OPTIMIZATION

Abstract

We present a dimensionality reduction algorithm called Uniform Manifold Approximation with Two-phase Optimization (UMATO) which produces less biased global structures in the embedding results and is robust over diverse initialization methods than previous methods such as t-SNE and UMAP. We divide the optimization into two phases to alleviate the bias by establishing the global structure early using the representatives of the high-dimensional structures. The phases are 1) global optimization to obtain the overall skeleton of data and 2) local optimization to identify the regional characteristics of local areas. In our experiments with one synthetic and three real-world datasets, UMATO outperformed widely-used baseline algorithms, such as PCA, Isomap, t-SNE, UMAP, topological autoencoders and Anchor t-SNE, in terms of quality metrics and 2D projection results.

1. INTRODUCTION

We present a novel dimensionality reduction method, Uniform Manifold Approximation with Twophase Optimization (UMATO) to obtain less biased and robust embedding over diverse initialization methods. One effective way of understanding high-dimensional data in various domains is to reduce its dimensionality and investigate the projection in a lower-dimensional space. The limitation of previous approaches such as t-Stochastic Neighbor Embedding (t-SNE, Maaten & Hinton (2008) ) and Uniform Manifold Approximation and Projection (UMAP, McInnes et al. (2018) ) is that they are susceptible to different initialization methods, generating considerably different embedding results (Section 5.5). t-SNE adopts Kullback-Leibler (KL) divergence as its loss function. The fundamental limitation of the KL divergence is that the penalty for the points that are distant in the original space being close in the projected space is too little (Appendix B). This results in only the local manifolds being captured, while clusters that are far apart change their relative locations from run to run. Meanwhile, UMAP leverages the cross-entropy loss function, which is known to charge a penalty for points that are distant in the original space being close in the projection space and for points that are close in the original space being distant in the projection space (Appendix B). UMAP considers all points in the optimization at once with diverse sampling techniques (i.e., negative sampling and edge sampling). Although the approximation technique in UMAP optimization makes the computation much faster, this raises another problem that the clusters in the embedding become dispersed as the number of epochs increases (Appendix K), which can lead to misinterpretation. UMAP tried to alleviate this by using a fixed number (e.g., 200), which is ad hoc, and by applying a learning rate decay. However, the optimal number of epochs and decay schedule for each initialization method needs to be found in practice. To solve the aforementioned problems, we avoid using approximation during the optimization process, which normally would result in greatly increased computational cost. Instead, we first run optimization only with a small number of points that represent the data (i.e., hub points). Finding the optimal projection for a small number of points using a cross-entropy function is relatively easy and robust, making the additional techniques employed in UMAP unnecessary. Furthermore, it is less sensitive to the initialization method used (Section 5.5). After capturing the overall skeleton of the high-dimensional structure, we gradually append the rest of the points in subsequent phases. Although the same approximation technique as UMAP is used for these points, as we have already embedded the hub points and use them as anchors, the projections become more robust and unbiased. The gradual addition of points can in fact be done in a single phase; we found additional phases do not result in meaningful improvements in the performance but only in the increased computation time (Section 4.5). Therefore, we used only two phases in UMAP: global optimization to capture the global structures (i.e., the pairwise distances in a high-dimensional space) and local optimization to retain the local structures (i.e., the relationships between neighboring points in a high-dimensional space) of the data. We compared UMATO with popular dimensionality reduction techniques including PCA, Isomap (Tenenbaum et al. (2000) ), t-SNE, UMAP, topological autoencoders (Moor et al. (2020) ) and At-SNE (Fu et al. (2019) ). We used one synthetic (101-dimensional spheres) and three realworld (MNIST, Fashion MNIST, and Kuzushiji MNIST) datasets and analyzed the projection results with several quality metrics. In conclusion, UMATO demonstrated better performance than the baseline techniques in all datasets in terms of KL σ with different σ values, meaning that it reasonably preserved the density of data over diverse length scales. Finally, we presented the 2D projections of each dataset, including the replication of an experiment using the synthetic Spheres dataset introduced by Moor et al. (2020) where data points locally constitute multiple small balls globally contained in a larger sphere. Here, we demonstrate that UMATO can better preserve both structures compared to the baseline algorithms (Figure 3 ).

2. RELATED WORK

Dimensionality reduction. Most previous dimensionality reduction algorithms focused on preserving the data's local structures. For example, Maaten & Hinton (2008) proposed t-SNE, focusing on the crowding problem with which the previous attempts (Hinton & Roweis (2002) ; Cook et al. (2007) ) have struggled, to visualize high-dimensional data through projection produced by performing stochastic gradient descent on the KL divergence between two density functions in the original and projection spaces. Van Der Maaten (2014) accelerated t-SNE developing a variant of the Barnes-Hut algorithm (Barnes & Hut (1986) ) and reduced the computational complexity from O(N 2 ) into O(N log N ). After that, grounded in Riemannian geometry and algebraic topology, McInnes et al. (2018) introduced UMAP as an alternative to t-SNE. Leveraging the cross-entropy function as its loss function, UMAP reduced the computation time by employing negative sampling from Word2Vec (Mikolov et al. (2013) ) and edge sampling from LargeVis (Tang et al. (2015; 2016) ) (Table 1 ). Moreover, they showed that UMAP can generate stable projection results compared to t-SNE over repetition. On the other hand, there also exist algorithms that aim to capture the global structures of data. Isomap (Tenenbaum et al. (2000) ) was proposed to approximate the geodesic distance of highdimensional data and embed it onto the lower dimension. Global t-SNE (Zhou & Sharpee (2018) ) converted the joint probability distribution, P , in the high-dimensional space from Gaussian to Student's-t distribution, and proposed a variant of KL divergence. By adding it with the original loss function of t-SNE, Global t-SNE assigns a relatively large penalty for a pair of distant data points in high-dimensional space being close in the projection space. Another example is topological autoencoders (Moor et al. (2020) ), a deep-learning approach that uses a generative model to make the latent space resemble the high-dimensional space by appending a topological loss to the original reconstruction loss of autoencoders. However, they required a huge amount of time for hyperparameter exploration and training for a dataset, and only focused on the global aspect of data. Unlike other techniques that presented a variation of loss functions in a single pipeline, UMATO is novel as it preserves both structures by dividing the optimization into two phases; this makes it outperform the baselines with respect to quality metrics in our experiments. Hubs, landmarks, and anchors. Many dimensionality reduction techniques have tried to draw sample points to better model the original space; these points are usually called hubs, landmarks, or anchors. Silva & Tenenbaum (2003) proposed Landmark Isomap, a landmark version of classical multidimensional scaling (MDS) to alleviate its computation cost. Based on the Landmark Isomap, Yan et al. (2018) tried to retain the topological structures (i.e., homology) of high-dimensional data by approximating the geodesic distances of all data points. However, both techniques have the limitation that landmarks were chosen randomly without considering their importance. UMATO uses a k-nearest neighbor graph to extract significant hubs that can represent the overall skeleton of high-dimensional data. The most similar work to ours is At-SNE (Fu et al. (2019) ), which optimized the anchor points and all other points with two different loss functions. However, since the anchors wander during the optimization and the KL divergence does not care about distant points, it hardly captures the global structure. UMATO separates the optimization process into two phases so that the hubs barely moves but guides other points so that the subareas manifest the shape of the highdimensional manifold in the projection. Applying different cross-entropy functions to each phase also helps preserve both structures.

3. UMAP

Since UMATO shares the overall pipeline of UMAP (McInnes et al. (2018) ), we briefly introduce UMAP in this section. Although UMAP is grounded in a sophisticated mathematical foundation, its computation can be simply divided into two steps, graph construction and layout optimization, a configuration similar to t-SNE. In this section, we succinctly explain the computation in an abstract manner. For more details about UMAP, please consult the original paper (McInnes et al. (2018) ). Graph Construction. UMAP starts by generating a weighted k-nearest neighbor graph that represents the distances between data points in the high-dimensional space. Given an input dataset X = {x 1 , . . . , x n }, the number of neighbors to consider k and a distance metric d : X × X → [0, ∞), UMAP first computes N i , the k-nearest neighbors of x i with respect to d. Then, UMAP computes two parameters, ρ i and σ i , for each data point x i to identify its local metric space. ρ i is a nonzero distance from x i to its nearest neighbor: ρ i = min j∈Ni {d(x i , x j ) | d(x i , x j ) > 0}. (1) Using binary search, UMAP finds σ i that satisfies: j∈Ni exp(-max(0, d(x i , x j ) -ρ i )/σ i ) = log 2 (k). Next, UMAP computes: v j|i = exp(-max(0, d(x i , x j ) -ρ i )/σ i ), the weight of the edge from a point x i to another point x j . To make it symmetric, UMAP computes v ij = v j|i + v i|j -v j|i • v i|j , a single edge with combined weight using v j|i and v i|j . Note that v ij indicates the similarity between points x i and x j in the original space. Let y i be the projection of x i in a low-dimensional projection space. The similarity between two projected points y i and y j is w ij = (1 + a||y i -y j || 2b 2 ) -1 , where a and b are positive constants defined by the user. Setting both a and b to 1 is identical to using Student's t-distribution to measure the similarity between two points in the projection space as in t-SNE (Maaten & Hinton (2008) ). Layout Optimization. The goal of layout optimization is to find the y i that minimizes the difference (or loss) between v ij and w ij . Unlike t-SNE, UMAP employs the cross entropy: C U M AP = i =j [v ij • log(v ij /w ij ) -(1 -v ij ) • log((1 -v ij )/(1 -w ij ))], between v ij and w ij as the loss function. UMAP initializes y i through spectral embedding (Belkin & Niyogi (2002) ) and iteratively optimize its position to minimize C U M AP . Given the output weight w ij as 1/(1 + ad 2b ij ), the attractive gradient is: C U M AP y i + = -2abd 2(b-1) ij 1 + ad 2b ij v ij (y i -y j ), and the repulsive gradient is: C U M AP y i - = 2b ( + d 2 ij )(1 + ad 2b ij ) (1 -v ij )(y i -y j ), where is a small value added to prevent division by zero and d ij is a Euclidean distance between y i and y j . For efficient optimization, UMAP leverages the negative sampling technique from Word2Vec (Mikolov et al. (2013) ). After choosing a target point and its negative samples, the position of the target is updated with the attractive gradient, while the positions of the latter do so with the repulsive gradient. Moreover, UMAP utilizes edge sampling (Tang et al. (2015; 2016) ) to accelerate and simplify the optimization process (Table 1 ). In other words, UMAP randomly samples edges with a probability proportional to their weights, and subsequently treats the selected ones as binary edges. Considering the previous sampling techniques, the modified objective function is: O = (i,j)∈E v ij (log(w ij ) + M k=1 E j k ∼Pn(j) γ log(1 -w ij k )). Here, v ij and w ij are the similarities in the high and low-dimensional spaces respectively, M is the number of negative samples and E j k ∼Pn(j) indicates that j k is sampled according to a noisy distribution, P n (j), from Word2Vec (Mikolov et al. (2013) ).

4. UMATO

Figure 2 illustrates the computation pipeline of UMATO, which delineates the two-phase optimization (see Figure 9 for a detailed illustration of the overall pipeline). As a novel approach, we split the optimization into global and local so that it could generate a low-dimensional projection keeping both structures well-maintained. We present the pseudocode of UMATO in Appendix A, and made the source codes of it publicly availablefoot_0 .

4.1. POINTS CLASSIFICATION

Figure 1 : Points classification using Spheres dataset. Each point is classified into a hub (red circles), an expanded nearest neighbor (green squares), or an outlier (blue triangles). Best viewed in color. In the big picture, UMATO follows the pipeline of UMAP. We first find the k-nearest neighbors in the same way as UMAP, by assuming the local connectivity constraint, i.e., no single point is isolated and each point is connected to at least a user-defined number of points. After calculating ρ (Equation 1) and σ (Equation 2) for each point, we obtain the pairwise similarity for every pair of points. Once the k-nearest neighbor indices are established, we unfold it and check the frequency of each point to sort them into descending order so that the index of the popular points come to the front. Then, we build a k-nearest neighbor graph by repeating the following steps until no points remain unconnected: 1) choose the most frequent point as a hub among points that are not already connected, 2) retrieve the k-nearest neighbors of the chosen point (i.e., hub), and the points selected from steps 1 and 2 will become a connected component. The gist is that we divide the points into three disjoint sets: hubs, expanded nearest neighbors, and outliers (Figure 1 ). Thanks to the sorted indices, the most popular point in each iteration-but not too densely located-becomes the hub point. Once the hub points are determined, we recursively seek out their nearest neighbors and again look for the nearest neighbors of those neighbors, until there are no points to be newly appended. In other words, we find all connected points that are expanded from the original hub points, which, in turn, is called the expanded nearest neighbors. Any remaining point that is neither a hub point nor a part of any expanded nearest neighbors is classified as an outlier. The main reason to rule out the outliers is, similar to the previous approach (Gong et al. (2012) ), to achieve the robustness of the practical manifold learning algorithm. As the characteristics of these classes differ significantly, we take a different approach for each class of points to obtain both structures. That is, we run global optimization for the hub points (Section 4.2), local optimization for the expanded nearest neighbors (Section 4.3), and no optimization for the outliers (Section 4.4). In the next section we explain each in detail.

4.2. GLOBAL OPTIMIZATION

After identifying hub points, we run the global optimization to retrieve the skeletal layout of the data. First, we initialize the positions of hub points using PCA, which makes the optimization process faster and more stable than using random initial positions. Next, we optimize the positions of hub points by minimizing the cross-entropy function (Equation 4). Let f (X) = {f (x i , x j )|x i , x j ∈ X} and g(Y ) = {g(y i , y j )|y i , y j ∈ Y } be two adjacency matrices in high-and low-dimensional spaces. If X h represents a set of points selected as hubs in high-dimensional space, and Y h is a set of corresponding points in the projection, we minimize the cross entropy-CE(f (X h )||g(Y h ))- between f (X h ) and g(Y h ). Table 1 : The runtime for each algorithm using MNIST dataset. UMAP and UMATO take much less time than MulticoreT-SNE (Ulyanov ( 2016)) when tested on a Linux server with 40-core Intel Xeon Silver 4210 CPUs. The runtimes are averaged over 10 runs. Isomap (Tenenbaum et al. (2000) ) took more than 3 hours to get the embedding result.

Algorithm

Runtime (s) UMAP computes the cross-entropy between all existing points using two sampling techniques, edge sampling and negative sampling, for speed (Table 1 ). However, this often ends up capturing only the local properties of data because of the sampling biases and thus it cannot be used for cases that require a comprehensive understanding of the data. On the other hand, in its first phase, UMATO only optimizes for representatives (i.e., the hub points) of data, which takes much less time but can still approximate the manifold effectively.

4.3. LOCAL OPTIMIZATION

In the second phase, UMATO embeds the expanded nearest neighbors to the projection that is computed using only the hub points from the first phase. For each point in the expanded nearest neighbors, we retrieve its nearest m (e.g., 10) hubs in the original high-dimensional space and set its initial position in the projection to the average positions of the hubs in the projection with small random perturbations. We follow a similar optimization process as UMAP in the local optimization with small differences. As explained in Section 3, UMAP first constructs the graph structure; we perform the same task but only with the hubs and expanded nearest neighbors. While doing this, since some points are excluded as outliers, we need to update the k-nearest neighbor indices. This is fast because we recycle the already-built k-nearest neighbor indices by updating the outliers to the new nearest neighbor. Once we compute the similarity between points (Equation 3), to optimize the positions of points, similar to UMAP, we use the cross-entropy loss function with edge sampling and negative sampling (Equation 7). Here, we try to avoid moving the hubs as much as possible since they have already formed the global structure. Thus, we only sample a point p among the expanded nearest neighbors as one end of an edge, while the point q at the other end of the edge can be chosen from all points except outliers. In UMAP implementation, when q pulls in p, p also drags q to facilitate the optimization (Equation 5). When updating the position of q, we only give a penalty to this (e.g., 0.1), if q is a hub point, not letting its position excessively be affected by p. In addition, because the repulsive force can disperse the local attachment, making the point veer off for each epoch and eventually destroying the well-shaped global layout, we multiply a penalty (e.g., 0.1) when calculating the repulsive gradient (Equation 6) for the points selected as negative samples.

4.4. OUTLIERS ARRANGEMENT

Since the isolated points, which we call outliers, mostly have the same distance to all the other data points in high-dimensional space, due to the curse of dimensionality, they both sabotage the global structure we have already made and try to mingle with all other points, thus distorting the overall projection. We do not optimize these points but instead simply append them using the alreadyprojected points (e.g., hubs or expanded nearest neighbors), that belong to each outlier's connected component of the nearest neighbor graph. That is, if x i ∈ C n where x i is the target outlier and C n is the connected component to which x i belongs, we find x j ∈ C n that has already been projected and is closest to x i . We arrange y i which corresponds to x i in low-dimensional space using the position of y j in the same component offset by a random noise. In this way, we can benefit from the comprehensive composition of the projection that we have already optimized when arranging the outliers. We can ensure that all outliers can find a point as its neighbor since we picked hubs from each connected component of the nearest neighbor graph and thus at least one point is already located and has an optimized position (Section 4.2).

4.5. MULTI-PHASE OPTIMIZATION

The optimization of UMATO can be easily expanded to multiple phases (e.g., three or more phases). Since we have a recursive procedure to expand the nearest neighbors, we can insert the optimization process each time we expand the neighbors to create a multi-phase algorithm. However, our experiment with three-and four-phase optimization with the Fashion MNIST dataset showed that there is no big difference between two-phase optimization and that with more than two phases. Appendix C contains the quantitative and qualitative results of the experiment for multi-phase optimization.

5. EXPERIMENTS

We conducted experiments to evaluate UMATO's ability to capture the global and local structures of high-dimensional data. We compared UMATO with six baseline algorithms, PCA, Isomap, t-SNE, UMAP, topological autoencoders, and At-SNE in terms of global (i.e., DTM and KL σ ) and local (i.e., trustworthiness, continuity, and MRREs) quality metrics.

5.1. DATASETS

We used four datasets for the experiments: Spheres, MNIST (LeCun & Cortes (2010)), Fashion MNIST (Xiao et al. (2017) ), and Kuzushiji MNIST (Clanuwat et al. (2018) ). Spheres is a synthetic dataset that has 10,000 rows of 101 dimensions. It has a high-dimensional structure in which ten small spheres are contained in a larger sphere. Specifically, the dataset's first 5,000 rows are the points sampled from a sphere of radius 25 and 500 points are sampled for each of the ten smaller spheres of radius 5 shifted to a random direction from the origin. This dataset is the one used for the original experiment with topological autoencoders (Moor et al. (2020) ). Other datasets are images of digits, fashion items, and Japanese characters, each of which consists of 60,000 784-dimensional (28 × 28) images from 10 classes.

5.2. EXPERIMENTAL SETTING

Evaluation Metrics. To assess how well projections preserve the global structures of highdimensional data, we computed the density estimates (Chazal et al. (2011; 2017) ), the so-called Distance To a Measure (DTM), between the original data and the projections. Moor et al. (2020) adopted the Kullback-Leibler divergence between density estimates with different scales (KL σ ) to evaluate the global structure preservation. To follow the original experimental setup by Moor et al. (2020) , we found the projections with the lowest KL 0.1 from all algorithms by adjusting their hyperparameters. Next, to evaluate the local structure preservation of projections, we used the mean relative rank errors (MRREs, Lee & Verleysen ( 2007)), trustworthiness, and continuity (Venna & Kaski (2001) ). All of these local quality metrics estimate how well the nearest neighbors in one space (e.g., high-or low-dimensional space) are preserved in the other space. For more information on the quality metrics, we refer readers to Appendix E. Baselines. We set the most widely used dimensionality reduction techniques as our baselines, including PCA, Isomap (Tenenbaum et al. (2000) ), t-SNE (Maaten & Hinton ( 2008)), 2018)), and At-SNE (Fu et al. (2019) ). In the case of t-SNE, we leveraged Multicore t-SNE (Ulyanov ( 2016)) for fast computation. To initialize the points' position, we used PCA for t-SNE, following the recommendation in the previous work (Linderman et al. ( 2019)), and spectral embedding for UMAP which was set to default. In addition, we compared with topological autoencoders (Moor et al. (2020) ) that were developed to capture the global properties of the data using a deep learning-based generative model. Following the convention of visualization in dimensionality reduction, we determined our result projected onto 2D space. We tuned the hyperparameters of each technique to minimize the KL 0.1 . Appendix F further describes the details of the hyperparameters settings. 2018)) in Appendix N which shows the relationship between classes much better than the baseline algorithms like t-SNE and UMAP.

5.5. PROJECTION ROBUSTNESS OVER DIVERSE INITIALIZATION METHODS

We experimented with the robustness of each dimensionality reduction technique with different initialization methods such as PCA, spectral embedding, random position, and class-wise separation. In class-wise separation, we initialized each class with a non-overlapping random position in 2dimensional space, adding random Gaussian noise. In our results, UMATO embeddings were almost the same on the real-world datasets, while the UMAP and t-SNE results relied highly upon the initialization method. We report this in Table 3 with a quantitative comparison using Procrustes distance. Specifically, given two datasets X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y n } where y i corresponds to x i , the Procrustes distance is defined as d P (X, Y ) = N i=1 (x i -y i ). For all cases, we ran optimal translation, uniform scaling, and rotation to minimize the Procrustes distance between the two distributions. In the case of the Spheres dataset, as defined in Appendix G, the clusters were equidistant from each other. The embedding results have to be different due to the limitation of the 2-dimensional space since there is no way to express this relationship. However, as we report in Figure 4 , the global and local structures of the Spheres data are manifested with UMATO with all different initialization methods. 

6. CONCLUSION

We present a two-phase dimensionality reduction algorithm called UMATO that can effectively preserve the global and local properties of high-dimensional data. In our experiments with diverse datasets, we have proven that UMATO can outperform previous widely used baselines (e.g., t-SNE and UMAP) both quantitatively and qualitatively. As future work, we plan to accelerate UMATO, as in previous attempts with other dimensionality reduction techniques (Pezzotti et al. (2019) ; Nolet et al. ( 2020)), by implementing it on a heterogeneous system (e.g., GPU) for speedups.

A UMATO ALGORITHM PSEUDOCODE

The pseudocode of UMATO is as below: Algorithm 1 Uniform Manifold Approximation with Two-phase Optimization 4) 7: Initialize expanded nearest neighbors using hub locations 8: Update k-nearest neighbors & compute weights (Equation 3) 9: Optimize CE(f (X)||g(Y )) to preserve local configuration (Equation 7) 10: Position outliers 11: return Y 12: end procedure

B THE MEANING OF USING DIFFERENT LOSS FUNCTIONS IN DIMENSIONALITY REDUCTION

Following the notations from above, we set the similarity between points in high-dimensional space x i and x j as v ij and the low-dimensional space y i and y j as w ij . Then we can write the KL divergence and cross entropy loss function as: KL = i =j v ij • log(v ij /w ij ), CE = i =j v ij • log(v ij /w ij ) - i =j (1 -v ij ) • log((1 -v ij )/(1 -w ij )). ( ) Table 4 : Analysis of the KL divergence and cross-entropy loss function for imposing penalties when updating the positions of points in low-dimensional space. (Upper table ) The KL divergence and the first term of the cross-entropy function impose a big penalty when w ij is small but v ij is large. (Lower table ) In contrast, the second term of cross-entropy function imposes a big penalty when v ij is small but w ij is large. ) , this means that the relationship between points in high-dimensional space is well-retained in the projection. Thus, the positions of points in the low-dimensional space do not have to move. As v ij and w ij are similar, log(v ij /w ij ) becomes zero, producing a small cost in the end. However, we need to modify the position of w ij if there exists a gap between v ij and w ij . The KL divergence imposes a big penalty when v ij is large but w ij is small (Table 4 b.). That is, if the neighboring points in high-dimensional space are not captured well in the projection, the KL divergence imposes a high penalty to move the point (v ij ) into the right position. Thus, we can understand why t-SNE is able to capture the local characteristics of high-dimensional space, but not the global ones. However, the second term of cross-entropy imposes a big penalty when v ij is small but w ij is large (Table 4 g. ). Therefore, it moves points that are close together in the highdimensional space but far apart in the projection.

C MULTI-PHASE OPTIMIZATION

We report the result of multi-phase optimization (e.g., three and four-phase) using the Fashion MNIST dataset both quantitatively (Table 5 ) and qualitatively (Figure 5 ). As in Figure 5 , we were unable to find any significant differences between the 2D projections, although some outliers were located in different places. Moreover, the quality metrics were almost the same for all three results. The original UMATO was the winner in DTM, KL 0.1 , KL 1 , continuity, and MRRE Z but came last in other quality metrics. In addition, the gap in metrics between UMATO and the multi-phase optimizations indicated a trivial difference. Thus, we concluded that developing a multi-phase optimization for UMATO does not bring about any notable improvement in the projection result. Table 5 : Quantitative evaluation of UMATO and UMATO with multi-phase optimizations. Although the optimization process of UMATO can be simply expandable for multiple phases, no apparent distinctions are found in the the results with different numbers of optimization phases. The winner is in bold. Although there was a small difference such as the locations of outliers, we observed that the projection results were quite similar to each other. Dataset Method DTM KL 0.01 KL 0.1 KL 1 Cont Trust MRRE X MRRE Z

D PROJECTION STABILITY

Table 6 denotes the results of our experiment on the projection stability of UMATO and other dimensionality reduction techniques. When the data size grows, we want to sample a portion of it to speed up the visualization. However, the concern is whether the projection run with the sampled indices is consistent with the part of the projection result with indices selected from the full dataset. If the algorithm can generate stable and consistent results, the two projections should contain the least bias possible. To compute the projection stability of dimensionality reduction techniques, we used the normalized Procrustes distance (Equation 8) to measure the distance between two comparable distributions. To replicate the experiment by McInnes et al. (2018) , we used the same Flow Cytometry dataset (Spidlen et al. (2012) ; Brodie et al. (2013) ), and ran optimal translation, uniform scaling, and rotation to minimize the Procrustes distance between the two distributions. As we can see in Table 6 , UMATO outperformed t-SNE and At-SNE for all sub-sample sizes. Moreover, although UMAP is known as stable among existing algorithms, UMATO showed even better (lower) Procrustes distance except for one sub-sample size (60%). From this result, we can acknowledge that UMATO can generate the more stable and consistent results regardless of sub-sample size than many other dimensionality reduction techniques. 

E QUALITY METRICS

As UMATO presents a dimensionality reduction technique that can capture both the global and local structures of high-dimensional data, we used several quality metrics to evaluate each aspect respectively. We have referred to some review papers (Gracia et al. (2014) ; Lee & Verleysen ( 2009)) for the best use and implementation. Among many quality metrics, we leveraged 1) Distance To a Measure (DTM, Chazal et al. (2011; 2017) ), 2) KL divergence between two density functions, 3) trustworthiness and continuity (Venna & Kaski (2001) ), and 4) mean relative rank errors (MRREs, Lee & Verleysen (2007) ). The first two metrics are used to test the preservation of global structures and the last two metrics are suggested for the preservation of the local structures. Distance To a Measure considers the dispersion of high-and low-dimensional data, where it is defined for a given point as f X σ (x) := y∈X exp (-dist(x, y) 2 /σ). By summing up the elementwise absolute values between two distributions, x∈X,z∈Z f X σ (x) -f Z σ (z) where x is the point in high-dimensional space X and the z is the corresponding projected point in low-dimensional space Z, we can examines the similarity of two datasets. In our experiments, we used the Euclidean distance and the values were normalized between 0 and 1. The σ ∈ R >0 , which represents the length scale parameter, was set to 0.1. Moor et al. (2020) proposed the KL divergence of two probability distributions, KL σ := KL(f X σ ||f Z σ ), a variation of DTM. Changing σ as a normalizing factor of the distribution, the authors investigated if the algorithms can preserve the global structure of the high-dimensional data. Following the same notion as the experiment in the paper (Moor et al. (2020) ), we used three σ values, 1.0, 0.1, and 0.01, to test whether each algorithm can capture the global aspect with respect to diverse density estimates. Trustworthiness and continuity measure how much the nearest neighbors are preserved in a space (i.e., high-or low-dimensional space) compared to the other space by analyzing the ranks of knearest neighbors in both spaces. The difference between trustworthiness and continuity comes from which space is held as the base space. Specifically, we first need to find the k-nearest neighbors in both high-and low-dimensional space. Then, we compute the trustworthiness by checking whether the ranks of nearest neighbors in low-dimensional space resemble those of high-dimensional space. If so, we can achieve a high score in trustworthiness. Meanwhile, we achieve a high score in continuity if the ranks of nearest neighbors in high-dimensional space resemble those of low-dimensional space. MRREs take a similar approach to trustworthiness and continuity as it calculates and compares the ranks of the k-nearest neighbors in both spaces, but the normalizing factor is slightly different. Originally, it was better if MRREs had lower values. However, for the ease of comparing local quality metrics, we defined it as MRREs := 1 -MRREs, so higher MRREs mean that they have better retained the k-nearest neighbors like trustworthiness and continuity.

F HYPERPARAMETER SETTING

As explained in Section 5.2, we generated projections for each dimensionality reduction algorithm that had the lowest KL 0.1 measure. To tune each algorithm's hyperparameters, we employed the grid search for t-SNE, UMAP, and At-SNE. For t-SNE and At-SNE, we changed the perplexity from 5 to 50 with an interval of 5, and the learning rate from 0.1 to 1.0 with a log-uniform scale. In the case of UMAP, we changed the number of nearest neighbors from 5 to 50 with an interval of 5, and the minimum distance between points in the projection from 0.1 to 1.0 with an interval of 0.1. We used the Python library scikit-optimize (Head et al. (2018) ) to find the best hyperparameters for topological autoencoders. UMATO has several hyperparameters such as the number of hub points, the number of epochs, and the learning rate for global and local optimization. In our experiments, we configured everything except the number of hub points to the same setting for UMATO. We used 200 hub points for the Spheres dataset and had 300 hubs for others. We used fewer hub points for the Spheres since it has only 10,000 data points in total, while the other datasets have 60,000 data points. We set the number of epochs to 100 for global optimization and to 50 for local optimization. Lastly, the global learning rate was set to 0.0065, and the local learning rate was set to 0.01.

G SYNTHETIC SPHERES DATASET

We leveraged the same Spheres dataset that Moor et al. (2020) used in their experiments of topological antoencoders. The Spheres dataset contains eleven high-dimensional spheres which reside in 101-dimensional space. We first generated ten spheres of radius of 5, and shifted each sphere by adding the same Gaussian noise to a random direction. For this aim, we created d-dimensional Gaussian vectors X ∼ N (0, I(10/ √ d)), where d is 101. As to embed an interesting geometrical structure to the dataset, the ten spheres of relatively small radii of 5 were enclosed by another larger sphere of radius of 25. We leveraged the 3-dimensional S-curve and Swiss roll datasets to test whether UMATO can preserve both the global and local structures of original datasets. As the visualization shows (Figure 6 ), only PCA and UMATO were able to capture the global and local structures of original datasets. Isomap, t-SNE and UMAP could capture the local manifolds of original datasets, but high-level manifolds of the original datasets were not reflected to the embedding.

I LOCAL QUALITY METRICS WITH DIFFERENT VALUES OF HYPERPARAMETER

We report the result of local quality metrics with diverse hyperparameters (Table 7 ). We changed the number of nearest neighbors (k) from 5 to 15 with an interval of 5. As we have already reported when the k = 5 (Table 2 ), below are the cases where k = 10, 15. As we can check from the result, while the values fluctuate a little bit, the ranks are mostly robust over diverse k values. For the real-world datasets, UMATO showed a similar projection to PCA but with better captured local characteristics. The results from topological autoencoders showed some points detached far apart from their centers, even though the best hyperparameters were used for each. Although At-SNE claimed that it could capture both structures, the results were not significantly different from those of the original t-SNE algorithm when projecting the Spheres and Fashion MNIST datasets. Figure 7 : 2D projections produced by UMATO and six baseline algorithms UMATO generated similar projections to PCA but with the points more locally connected; this is best viewed in color.

K EMBEDDING ROBUSTNESS OVER NUMBER OF EPOCHS

We report the experimental result in Figure 8 . As we explained, the UMAP embedding results are susceptible to the number of epochs so that the distance between clusters get dispersed. This can induce a misinterpretation that the user considers the distance between clusters as something meaningful. The two-phase optimization of UMATO can solve the problem since the global optimization (first phase) is easy to converge as it runs only with a small portion of points. Therefore, the increasing number of epochs in the global optimization does not harm the final embedding. 

N UMATO ON REAL-WORLD BIOLOGICAL DATASET

To test UMATO on the real-world biological dataset, we took professional advice from an expert who has a Ph.D. in Bioinformatics. We have run UMATO and the baseline algorithms (t-SNE, UMAP) on 23,822 single-cell transcriptomes from two areas at distant poles of the mouse neocortex (Tasic et al. (2018) ). Each cell belongs to one of 133 clusters defined by Jaccard-Louvain clustering (for more than 4,000 cells) or a combination of k-means and Ward's hierarchical clustering. Likewise, each cluster belongs to one of 4 classes: GABAergic (red/purple), Endothelial (brown), Glutamatergic (blue/green), Non-Neuronal (dark green). The embedding result for each method is given in Figure 11 . In the case of t-SNE, clusters are well-captured, but the classes are much dispersed, while UMAP adequately separates both classes and clusters. Compared to these baseline algorithms, UMATO is able to capture the relationship between classes much better, retaining some of the local manifolds as well. This means that UMATO focuses more on the manifold at a higher level than the baselines that the hub points worked as the representatives that explain well about the overall dataset. Moreover, there are cases in biological data analysis where the researchers want to know the distance between samples González-Blas et al. (2019); Van den Berge et al. (2020) . As the UMAP embedding results are susceptible to the number of epochs, this may cause a negative impact to interpret the results accurately. On the other hand, as UMATO is robust over the number of epochs, we do not have to worry about such biases. Figure 11 : 2D visualization of UMATO, UMAP and t-SNE on mouse neocortex dataset (Tasic et al. (2018) ). t-SNE separates clusters well but does not show class information: GABAergic (red/purple), Endothelial (brown), Glutamatergic (blue/green), Non-Neuronal (dark green). UMAP moderately captures both the clusters and classes. In the case of UMATO, it demonstrates the relationship between classes much better than t-SNE and UMAP, retaining some of the local manifolds as well.



https://www.github.com/anonymous-author/anonymous-repo



Figure 2: An illustration of UMATO pipeline using 10,000 data points (101 dimensions) of the Spheres. (A) UMATO first initializes hub points using PCA, (B) then optimize their positions using the cross entropy function. (C) Next, we embed the expanded nearest neighbors to the projection and optimize their positions using sampling techniques for acceleration. (D) Lastly, we append the outliers and achieve the final projection result. Best viewed in color.

Figure 3: 2D projections produced by UMATO and six baseline algorithms. t-SNE, At-SNE, and UMAP showed as if the points from a surrounding sphere were attached to inner spheres, not reflecting the data's global structures. PCA, Isomap and topological autoencoders attempted to preserve the global structures, but failed to manifest the complicated hierarchical structures. UMATO was the only algorithm to capture both the global and local structures among all different sphere classes; this is best viewed in color.the points representing the surrounding giant sphere mix with those representing the other small inner spheres, thus failing to capture the nested relationships among different classes. Meanwhile, topological autoencoders are able to realize the global relationship between classes in an incomplete manner; the points for the outer sphere are too spread out, thus losing the local characteristics of the class. From this result, we can acknowledge how UMATO can work with high-dimensional data effectively to reveal both global and local structures. 2D visualization results on other datasets (MNIST, Fashion MNIST, Kuzushiji MNIST) can be found in Appendix H. Lastly, we report an additional experiment on the mouse neocortex dataset(Tasic et al. (2018)) in Appendix N which shows the relationship between classes much better than the baseline algorithms like t-SNE and UMAP.

Figure 4: UMATO results on the Spheres dataset using different initialization methods. Although the average value of the normalized Procrustes distance of UMATO results is higher than the baselines because of the equidistant clusters of inner spheres, both global and local structures are well-captured with all different initialization methods. Best viewed in color.

Figure5: 2D projections of the Fashion MNIST dataset using UMATO and UMATO with multi-phase optimizations. Although there was a small difference such as the locations of outliers, we observed that the projection results were quite similar to each other.

Figure 6: 2D projections produced by UMATO and four baseline algorithms. 3-dimensional S-curve and Swiss roll datasets are used for five different algorithms. While PCA and UMATO can capture both the global and local structures of original datasets, other algorithms such as Isomap, t-SNE, and UMAP can only preserve the local manifolds of original datasets.

Figure 8: Comparing result of UMATO and UMAP with varying number of epochs (Top row) UMAP is susceptible to the number of epochs so that the clusters get dispersed as the epochs increases. (Bottom row) On the other hand, regardless of the number of epochs in the global optimization, UMATO results in almost the same embedding result.

Quantitative results of UMATO and six baseline algorithms. The hyperparameters of the algorithms are chosen to minimize KL 0.1 . The best one is in bold and underlined, and the runner-up is in bold. Only first four digits are shown for conciseness.

displays the experiment results. In most cases, UMATO was the only method that has shown performance both in the global and local quality metrics in most datasets. For local metrics, t-SNE,

The average value of normalized Procrustes distance between diverse dimensionality reduction techniques over four datasets. In all real-world datasets, UMATO has shown the most robust embedding results over different initialization methods. Although the UMATO results in the highest normalized Procrustes distance in the Spheres dataset, the embedding results look quite similar (Figure4). The winner is in bold.

procedure UMATO(X, kX , d, min dist, n h , eg, e l ) Input: High-dimensional data X, number of nearest neighbors k, projection dimension d, minimum distance in projection result min dist, number of hub points n h , epochs for global and local optimization eg, e l

The normalized Procrustes distance between two projection results by the percentage of sub-samples. From four dimensionality reduction techniques, we measured the normalized Procrustes distance to check the projection stability using the Flow Cytometry dataset. The winner is in bold.

Local quality metrics of UMATO and the baseline algorithms. Although the values are changing a little bit depending on the number of nearest neighbors, when comparing the result of k = 10 and k = 15, the ranks barely change. The winner is in bold and underlined, and the runner-up in bold.

L ILLUSTRATION OF UMATO PIPELINE

For the ease of understanding, we provide an illustration of UMATO pipeline in Figure 9 . The detailed explanation for UMATO can be found in Section 4. 

M EFFECT OF LOCAL LEARNING RATE OF UMATO

By manipulating one of UMATO's hyperparameters, local learning rate, the user can determine where to focus in the embedding result; to reveal more of the global structures, the user should apply a lower value (e.g., 0.005), while using a higher one (e.g., 0.1) would generate more like a UMAP embedding which prefers to show the local manifolds. 

