GEOMETRICALLY REGULARIZED AUTOENCODERS FOR NON-EUCLIDEAN DATA

Abstract

Regularization is almost de rigueur when designing autoencoders that are sparse and robust to noise. Given the recent surge of interest in machine learning problems involving non-Euclidean data, in this paper we address the regularization of autoencoders on curved spaces. We show that by ignoring the underlying geometry of the data and applying standard vector space regularization techniques, autoencoder performance can be severely degraded, or worse, training can fail to converge. Assuming that both the data space and latent space can be modeled as Riemannian manifolds, we show how to construct regularization terms in a coordinate-invariant way, and develop geometric generalizations of the denoising autoencoder and reconstruction contractive autoencoder such that the essential properties that enable the estimation of the derivative of the log-probability density are preserved. Drawing upon various non-Euclidean data sets, we show that our geometric autoencoder regularization techniques can have important performance advantages over vector-spaced methods while avoiding other breakdowns that can result from failing to account for the underlying geometry.

1. INTRODUCTION

Regularization is almost de rigueur when designing autoencoders that are sparse and robust to noise. With appropriate regularization, autoencoders enable representations useful for downstream applications (Bengio et al., 2013) , generate plausible data samples (Kingma & Welling, 2013; Rezende et al., 2014) , or even obtain information on the data-generating probability density (Vincent et al., 2010; Rifai et al., 2011b) . Existing work on autoencoder regularization has mostly been confined to vector spaces, i.e., the data are assumed to be drawn from a vector space. On the other hand, a significant and growing number of problems in machine learning involve data that is non-Euclidean (in some past cases the fact that the data was non-Euclidean was not recognized or ignored). Bronstein et al. (2017) reviews several deep neural network architectures and modeling principles to explicitly deal with data defined on non-Euclidean domains, e.g., data collected from sensor networks, social networks in computational social sciences, or two-dimensional meshes embedded in the three-dimensional space. Other works have also addressed manifold-valued data including human mass and shape data (Kendall, 1984; Freifeld & Black, 2012) , directional data (Mardia, 2014) , point cloud data (Lee et al., 2022) , and MRI imaging data (Fletcher & Joshi, 2007; Banerjee et al., 2015) , with several deep neural networks proposed to handle such data in a coordinate-invariant way (Huang & Van Gool, 2017; Chakraborty et al., 2020) . The fundamental idea behind these works is that the geometrical structure of the curved space from which the non-Euclidean data are drawn needs to be accounted for properly, so that the output of any deep learning network applied to such input data should not depend on the particular choice of coordinates used to parametrize the data. Ignoring the underlying geometry of the data and simply applying standard vector space techniques can severely degrade performance, or worse, cause training to fail. Autoencoder training and its regularization are no exception. For example, consider autoencoder training on a set of data points on a sphere as shown in Figure 1 . When using spherical coordinate representations as inputs to train an autoencoder with a contractive regularization, the trained reconstruction function can heavily depend on the choice of coordinates. Moreover, it often fails to learn the correct contractive directions toward datadense regions, especially near the singularity (or the spherical coordinate origins). On the other hand, an autoencoder that properly reflects the spherical constraints can recover those directions successfully and show results almost invariant to the choice of coordinates. In this paper we address the regularization of autoencoders on curved spaces. Any loss function used to train or regularize the autoencoder should be formulated in a coordinate-invariant way, i.e., invariant to the choice of local coordinates used to parametrize the data, and instead depend only on the intrinsic properties of the curved space such as curvature or the choice of metric. Assuming that both the data space and latent space can be modeled as Riemannian manifolds, we show how to construct regularization terms and objective functions in a coordinate-invariant way. We also develop geometric generalizations of the denoising autoencoder (DAE) and reconstruction contractive autoencoder (RCAE) such that the essential properties that enable the estimation of the score, i.e., the log-derivative of the data-generating density, are preserved. We provide some applications that use this property, such as sampling, clustering, and filtering for non-Euclidean data, and also show that the proposed autoencoders can obtain useful representations for non-Euclidean data, especially when noise exists in data. Drawing upon various non-Euclidean data sets, we show that our geometric autoencoder regularization techniques can have important performance advantages over vector-spaced methods -in some cases by significant margins -while avoiding other breakdowns that can result from failing to account for the underlying geometry. The paper is organized as follows. We describe regularized autoencoders for Euclidean data in Section 2 and propose their coordinate-invariant generalizations to non-Euclidean data in Section 3. We then provide autoencoder training case studies using non-Euclidean data sets in Section 4.

2. REGULARIZED AUTOENCODERS FOR EUCLIDEAN DATA

Mathematically, an autoencoder can be represented as the composition of two mappings f : R D → R d (the encoder) and g : R d → R D (the decoder), i.e., r = g • f : R D → R D with the space of hidden variables R d . Assume there exists a data-generating probability density ρ : R D → R from which data points on R D are drawn. Autoencoder training in a vector space can then be formulated as minimizing the reconstruction error R D ∥xg(f(x; θ 1 ); θ 2 )∥ 2 ρ(x) dx over θ = (θ 1 , θ 2 ), where x ∈ R D denotes an input variable, θ 1 and θ 2 are respectively the parameter sets of the maps f and g, and ∥•∥ 2 is the squared Euclidean norm. In the typical case where d < D, the autoencoder engages in a type of dimensionality reduction. By disregarding the assumption that d < D, autoencoders have been modeled in the form of deep artificial neural networks accompanied by certain regularization terms to learn useful representations of the data (Bengio et al., 2007; Vincent et al., 2008; 2010; Ranzato et al., 2007; 2008; Kingma & Welling, 2013; Rezende et al., 2014; Rifai et al., 2011b; a) . For a more comprehensive review of autoencoders, we refer the reader to Goodfellow et al. (2016) . In the meantime, the effects of regularization have been investigated in some detail in Alain & Bengio (2014) for the denoising autoencoder (DAE) (Vincent et al., 2010) and the reconstruction contractive autoencoder (RCAE). They point out that these regularization methods reduce the autoencoder's sensitivity to the input, while the reconstruction error increases the autoencoder's sensitivity to variations along the region of the highest density in the data space. Reconstruction and regularization together successfully capture variations in such regions while ignoring variations that are orthogonal to those and obtain information on the data-generating probability density. Our focus will be on these two types of regularized autoencoders, i.e., the DAE and the RCAE, but the methods described in this paper are easily generalizable to other autoencoders. In the standard vector space formulation of the DAE, an input x ∈ R D is assumed to have been corrupted by some noise density q(x|x) to x ∈ R D , i.e., x ∼ q(x|x). We then seek the reconstruction function r = g • f : R D → R D that minimizes min r R D E q(x|x) ∥r(x) -x∥ 2 ρ(x) dx, where E q(x|x) [ • ] denotes the expectation with respect to the noise density q(x|x). A trivial identity mapping r(x) = x can be avoided due to the injected noise. For the vector space formulation of the RCAE, the objective function is min r R D ∥r(x) -x∥ 2 + σ 2 Tr ∂r ∂x ⊤ ∂r ∂x ρ(x) dx, where σ 2 is a scalar weighting coefficient. The second term in (2) acts as a regularization term and can be interpreted as the Dirichlet energy of r : R D → R D , measuring how variable the mapping r is (Belkin & Niyogi, 2003; Solomon et al., 2013) . Minimizing this term induces the contraction of the mapping r (e.g., in the absence of the reconstruction error term in (2), an extreme contraction of r = constant would be obtained), preventing r(x) from becoming the identity mapping r(x) = x. Note that replacing ∂r ∂x with ∂f ∂x in (2) reduces to the objective function for the contractive autoencoder (CAE) (Rifai et al., 2011b) . For both the DAE under a Gaussian corruption process x ∼ N (x, σ 2 I) and the RCAE with σ small, the derivative of the log-probability density, which is also referred to as the score, can be estimated from the optimized r(x) as follows (Alain & Bengio, 2014) : ∂ log ρ(x) ∂x = 1 ρ ∂ρ ∂x (x) = r(x) -x σ 2 + O(σ 2 ). (3)

3. REGULARIZED AUTOENCODERS FOR NON-EUCLIDEAN DATA

In this section, we address the problem of autoencoder training for the case where the data points are drawn from an a priori known non-Euclidean space M, possibly with another non-Euclidean latent space N . We formulate the reconstruction error and regularization terms in a coordinate-invariant way using notions from Riemannian geometry. (We provide some mathematical backgrounds required for our formulations in Appendix A.) We then show that, as in the Euclidean case, it is possible to estimate the score for non-Euclidean data using the trained autoencoders.

3.1. COORDINATE-INVARIANT GENERALIZATIONS OF THE AUTOENCODER COMPONENTS

∂r ∂x ∂r ∂x Figure Referring to Figure 2 , let M be an mdimensional manifold with local coordinates x = (x 1 , . . . , x m ) and Riemannian metric ds 2 = m i=1 m j=1 g ij (x) dx i dx j . Throughout this paper, we use italics to represent local coordinates, e.g., a point x ∈ M has local coordinates x ∈ R m and the mapping r : M → M can be represented in local coordinates as r : R m → R m . The metric will also be denoted in matrix form as G(x) = (g ij (x)) ∈ R m×m . We now provide coordinate-invariant generalizations of each component in autoencoder training, especially in (1) and (2), while simultaneously reflecting any intrinsic properties of the manifold M.

Reconstruction error:

The reconstruction error on Riemannian manifolds can be defined as dist(r(x), x) 2 , where dist(x, y) denotes the minimal geodesic distance between points x, y ∈ M. Probability function: Assume there exists a data-generating probability function p g : M → R from which data points on M are drawn. Denote by ρ g : R m → R its representation in local coordinates, satisfying ρ g (x) > 0 for all x ∈ R m and M ρ g (x) det G(x)dx = 1 (Pennec, 1999) , where det G(x)dx is the natural volume element induced from the metric. When formulating regularized autoencoders later, p g (or ρ g ) is used in the form of a weighted volume element ρ(x)dx ≡ ρ g (x) det G(x)dx. In practice, the integrations involving ρ(x)dx are approximated as an equally weighted finite sum of the integrands (with ρ(x) excluded) evaluated at given data points.

Contractive regularization:

The conventional Dirichlet energy (appearing in (2)) has been generalized to mappings between Riemannian manifolds in the theory of harmonic maps (Eells & Sampson, 1964) , based on which we formulate the contractive regularization for non-Euclidean settings. Let M be the input manifold, and let N be an n-dimensional output manifold with local coordinates y = (y 1 , . . . , y n ) and Riemannian metric dr 2 = n α=1 n β=1 h αβ (y) dy α dy β . The metric will also be denoted in matrix form as H(y) = (h αβ (y)) ∈ R n×n . In Eells & Sampson (1964) the Dirichlet energy of a smooth map f : M → N is defined as M Tr(J ⊤ HJG -1 ) √ det G dx 1 • • • dx m , where 4) is an intrinsic quantity, i.e., coordinate-invariant. We discuss the coordinate-invariance of (4) and some physical interpretations of minimizing (4) in Appendix B.1 and refer the reader to the extensive literature on the theory and applications of harmonic maps, e.g., Eells & Lemaire (1978; 1988); Park & Brockett (1994) ; Gu et al. (2004) ; Jang et al. (2021) ; Lee et al. (2021b) . J(x) = ∂f i ∂x j (x) ∈ R n×m is the differential df x : T x M → T y N denoted in local coordi- nates. The energy functional in ( To apply this energy functional to autoencoder regularization, we replace the mapping f : M → N and the Riemannian metric H(f (x)) with the reconstruction mapping r : M → M and G(r(x)), respectively, and replace the natural volume element √ det G dx with the weighted volume element ρ(x) dx. Also note that for the case of non-Euclidean latent space N , the objective in (4) with the volume element ρ(x) dx can serve as a geometric regularizer for the contractive autoencoders. (See Appendix B.2 for more discussions related to non-Euclidean latent spaces.) 2

TxS

x v x = Exp x (v) Figure 3 : A data corruption example for an input x ∈ S 2 . Tangent vectors (the yellow triangles) are sampled from a zero-mean Gaussian in T x S 2 and mapped via the exponential map to black dots, the corrupted points from x. The red dot x = Exp x (v) for v ∈ T x S 2 . Data corruption process: To corrupt an input x ∈ M, we sample a tangent vector v ∈ T x M from an isotropic zero-mean multivariate Gaussian, i.e., a linear combination of an orthonormal basis for T x M with coefficients sampled from N (0, σ 2 I), and then apply the exponential map Exp x : T x M → M to v. The point x = Exp x (v) can then be interpreted as a corrupted point from x. We denote by q(x|x) the noise density that samples x for given x according to this procedure. A data corruption example for data on a sphere (S 2 ) is illustrated in Figure 3 , and a rationale for adopting this way of corruption is explained in Appendix B.3.

3.2. GEOMETRICALLY REGULARIZED AUTOENCODERS

We now derive coordinate-invariant formulations of the regularized autoencoders presented in (1) and (2). For the reconstruction contractive autoencoder (2), the contractive regularizer modified from (4) (as explained above) is augmented to the reconstruction error term as follows: min r M dist(r(x), x) 2 + σ 2 Tr ∂r ∂x ⊤ G(r) ∂r ∂x G -1 ρ(x) dx, where the reconstruction map r : M → M is expressed in local coordinates as r : R m → R m , and G(r) denotes the metric at point r(x). We refer to (5) as the geometric RCAE or GRCAE. Similarly, the geometric version of the DAE (1), referred to as GDAE, can be formulated as follows: min r M E q(x|x) dist(r(x), x) 2 ρ(x) dx, where E q(x|x) [ • ] denotes the expectation with respect to the noise density q(x|x) presented above. From the geometric formulations ( 5)-( 6), we can obtain the following relations between the reconstruction function r and the log of the probability function ρ g (x) = ρ(x) √ det G(x) . Theorem 1. Provided σ 2 is small, the derivative of the log of the probability function ρ g (x) = ρ(x) √ det G(x) can be approximated for both the GRCAE and GDAE as ∂ log ρ g (x) ∂x = 1 ρ g ∂ρ g ∂x (x) = G(x) r(x) -x σ 2 + O(σ 2 ). The proof of Theorem 1 is provided in Appendix C. Equation ( 7) can be thought of as a generalization of (3) for non-Euclidean data, and we will refer to ∂ log ρg(x) ∂x as the geometric score. The reconstruction function r : M → M for proposed autoencoders is modeled as a neural network in later experiments, with implementation details provided in Appendix D. We also provide there some ideas to deal with the case of manifolds that require multiple coordinate charts and discuss another case where the data space Riemannian metric is not known a priori.

4. EXPERIMENTS

In the experiments, we first demonstrate the geometric score estimation (Theorem 1) based on our geometrically regularized autoencoders for non-Euclidean data, providing a solid basis for future applications of autoencoders. We then utilize the proposed autoencoders for various applications, such as data sampling based on the Langevin Monte Carlo methods (Girolami & Calderhead, 2011) or clustering and noise filtering based on mode-seeking (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002) , involving real-world non-Euclidean data sets. We also examine the usefulness of the proposed autoencoders in the representation learning perspective, using noisy point cloud data.

4.1. GEOMETRIC SCORE ESTIMATION

For geometric score estimation, we consider the data sampled from P(n), the space of n × n symmetric positive-definite matrices, endowed with the affine-invariant Riemannian metric (Fletcher & Joshi, 2007) . We train the GDAE and GRCAE using synthetic data sampled from m mixtures of isotropic tangent space Gaussians for which the ground truth geometric score values are obtainable. For purposes of comparison, we also use DAE, RCAE, the least-squares log-density gradient method (LSLDG) presented in Sasaki et al. (2014) , and also their extension to data on Riemannian manifolds (R-LSLDG) in Ashizawa et al. (2017) . Full experimental details are provided in Appendix F. For a given data set {x 1 , . . . , x N } represented in local coordinates of P(n), the geometric score estimation error (Est. error), which is also defined to be coordinate-invariant, is evaluated as follows: Est. error = 1 N N i=1 ∂ log ρg ∂x est (x i ) - ∂ log ρg ∂x (x i ) ⊤ G -1 (x i ) ∂ log ρg ∂x est (x i ) - ∂ log ρg ∂x (x i ) , where . Notably, GRCAE shows the best performance for higher dimensionality and a higher number of mixtures in terms of estimation error. We also obtain a similar tendency for another synthetic data set on the hypersphere S n = {p ∈ R n+1 | ∥p∥ = 1}, and the results are provided in Appendix F.4. We can apply the geometric scores estimated from GDAE and GRCAE to the sampling of non-Euclidean data via Riemannian Langevin Monte Carlo (RLMC) methods. The stochastic process for the RLMC methods in Girolami & Calderhead (2011) can be reformulated using the geometric score in (7) as follows: ∂ dx = 1 2 G -1 (x) ∂ log ρ g (x) ∂x - 1 β G -1 (x) ∂β ∂x + Ψ(x) dt + G -1 (x)dw, where dw ∈ R m denotes the Brownian motion in an m-dimensional vector space, β(x) = 1/ det G(x), and Ψ(x) ∈ R m is a vector whose i-th component is given by m j=1 ∂g ij ∂x j with g ij as the (i, j) element of G -1 (x). A discretization of the above process gives the RLMC method (Girolami & Calderhead, 2011) . Note that in (9), the terms except for 1 2 G -1 (x) ∂ log ρg(x) ∂x dt correspond to the Brownian motion on manifolds (Brockett, 1997) . As an illustrative case study, we consider sampling on a sphere (S 2 ). After training DAE, RCAE, GDAE, and GRCAE for data points on S 2 shown in Figure 4 (left), we sample new data points that approximately follow the original data distribution by applying the RLMC methods using the geometric scores estimated from GDAE/GRCAE and its Euclidean counterparts using DAE/RCAE. We also report the results obtained from the S-Flow method (a variation of the M-Flow method (Brehmer & Cranmer, 2020) for the case of S 2 ). The experimental details are provided in Appendix G. For a quantitative comparison of the sampling performances, the maximum mean discrepancy (MMD) (Gretton et al., 2012) between a test data set and the obtained samples is provided in Table 2 . As shown in the table and Figure 4 , GDAE shows the best quantitative and qualitative performance among the considered autoencoders. Even though the samples from GDAE do not yield better numerical results than those from the S-Flow method, an algorithm targeted for data sampling, it is observed that plausible samples can be obtained. Also note that the samples obtained from RCAE and GRCAE tend to be inferior to those from DAE and GDAE in this task, possibly due to the algorithmic properties of RCAE/GRCAE in which, unlike DAE/GDAE, the inputs to neural networks are strictly confined to the given training data. Thus, the reconstruction functions may not be accurately trained on other regions visited during the sampling process compared to DAE/GDAE. In natural language processing, the similarity between word embeddings, i.e., the vector representations that encode semantic information of the words, is often measured according to the cosine similarity (Mikolov et al., 2013) . This can be equated to considering the embeddings as points on a hypersphere and measuring the distance based on the geodesic distance (Straub et al., 2015) . Based on this idea, we group documents in the Newsgroup20 data set (Lang, 1995) using the GDAE trained on the document embeddings. The document embeddings are first represented as the average of the word embeddings in the document and then projected to a hypersphere. For the word embeddings, we utilize pre-trained 50-dimensional GloVe embeddings (Pennington et al., 2014) , hence the document embeddings lie on S 49 . We define four clustering tasks as described in Appendix H.1. We train DAE, GDAE, LSLDG, and R-LSLDG as explained in Appendix H.2. Here we consider two variations of DAE and LSLDG, respectively; the first ones are trained on the spherical coordinate representations of the data, and the second ones on the representations in the ambient space R n+1 . After training the models, the document embeddings are iteratively updated along the gradient of the log-probability (for GDAE, this can be performed by repeatedly applying the reconstruction function on the embedding) for a fixed number of steps and then grouped as done in the mean shift clustering (Comaniciu & Meer, 2002) . The adjusted Rand index (ARI) is used as the performance metric for the clustering tasks (Hubert & Arabie, 1985) . The averaged clustering performance for five runs of each method is presented in Table 3 . Note that the vector-spaced methods trained on the ambient representations of the data are hardly successful in grouping the data. On average, the R-LSLDG performs slightly better than LSLDG and DAE but with higher variance. The results obtained from GDAE show a much higher ARI than others. An additional case study for the clustering of covariance matrix data is provided in Appendix H.3. For the next case study, we consider data obtained from diffusion tensor imaging (DTI). Mathematically, a DTI datum is a threedimensional image in which the value assigned to each voxel is an element of P(3). A voxel of DTI data can be treated as in the space of threedimensional normal distributions N(3), by regarding the voxel location and voxel value as the mean and covariance of a normal distribution, respectively. In this section, following Han & Park (2014) , we adopt the Fisher information Riemannian metric (FIRM) for DTI data (see Appendix D.3.1 for the definition of the FIRM). We train the GDAE on raw DTI data and apply the trained GDAE to filter the noise in the data. By iteratively applying the reconstruction function of the trained GDAE, data points are mapped toward the local modes of the probability function (as can be implied from ( 7)), and the voxels with added noise, which usually have a lower probability, can be automatically filtered (Comaniciu & Meer, 2002) . We summarize a DTI filtering algorithm based on the GDAE in Appendix I.2.

4.4. FILTERING OF DIFFUSION TENSOR IMAGING DATA

We first demonstrate the effectiveness of this algorithm in a simpler setting by using a synthetic data set in two-dimensional normal distributions N(2). We train our GDAE on a noisy input artificially corrupted from clean data in Figure 5 and apply the filtering algorithm. The proposed algorithm can effectively filter out the noises qualitatively better than other filtering methods, such as the DAE-based filtering with more remaining noises and a manifold-valued kernel regression-based filtering approach (MVKR) (Banerjee et al., 2015) which tends to erroneously smooth discontinuous voxel values. A quantitative comparison in Table 4 also suggests that our GDAE-based filtering can perform well at various noise levels. The experimental details are deferred to Appendix I.3. We now conduct numerical experiments of DTI filtering by applying our algorithm. Data used in these experiments were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni. usc.edu). See Appendix I.4 for the data preprocessing details. We also perform filtering based on DAE trained on nine-dimensional vector representations of voxels. Note that, critically, the DAE-based filtering has no guarantee for the filtered voxel values to be positive-definite, while GDAE-based filtering always satisfies the constraint. The input image and the filtering results from each algorithm are shown in Figure 6 . We plot axial slices of the corresponding DTI data in Figures 6 (a)-(c ). Voxel values are drawn as ellipsoids with colors representing the direction of the first principal axis. Since the ground truths of DTI data are not available, only qualitative comparisons between the filtering results can be made. In Figures 6 (a)-(c ), it can be seen that the GDAE-based method effectively filters outliers (the abruptly changing colors appearing in Figure 6 (a)) when compared to DAE-based filtering. Using brain anatomical terms, the separation between the cerebrospinal fluid and brain parenchyma became clear after GDAE-based filtering, and as a result specifying the sulci and gyri of the brain is much easier than from the raw input or DAE-based filtering results. Furthermore, noise in the ventricle area also disappeared after GDAE-based filtering, while DAEbased filtering failed to eliminate this noise. We should note here that the DTI filtering results would require further concordance verification procedures with experts such as radiologists to ensure that only spurious artifacts are erased rather than important anatomy.

4.5. ROBUST REPRESENTATION LEARNING OF POINT CLOUD DATA

We now show that we can obtain useful representations for non-Euclidean data from our proposed autoencoders, especially when noise exists in the data. We consider point cloud data in this case study . A point cloud data in R D is a set of n points in R D , represented as X = {x 1 , . . . , x n | x i ∈ R D }. Recently, a statistical manifold framework has been suggested for point cloud data in Lee et al. (2022) , and they have observed that reflecting this geometry to train autoencoders can obtain better representations than the vector space counterparts. (See Appendix D.4.1 for more details.) Based on this choice of geometry, we train GDAE for point cloud data with n = 2, 048 and D = 3 (hence of dimension nD = 6, 144). We reflect the Fisher information metric proposed in Lee et al. (2022) in the data corruption process for GDAE and use the modified Chamfer loss as the reconstruction error for point cloud data. For comparison purposes, we also consider DAE trained on the vector representations of point cloud data. We use the FcNet (Yang et al., 2018) as the structure for the reconstruction functions with a latent space dimension of 512. Further details are provided in Appendix D.4. To verify the usefulness of the representations obtained from our trained autoencoders, we utilize them as features to train classifiers in the transfer learning setting based on some benchmark data sets following Yang et al. (2018) . More specifically, we train autoencoders using the ShapeNet data set (Chang et al., 2015) and obtain representations for the ModelNet data set (Wu et al., 2015) . We then train a linear SVM using the representations and measure the transfer classification accuracy. We inject varying noises in the data sets to verify if the obtained representations are robust to noise. More experimental details are explained in Appendix J. The experimental results are in Table 5 . The features obtained from GDAE lead to a better transfer classification accuracy than those from the vanilla autoencoder (AE) or DAE; this tendency gets stronger as the noise level increases. Our approach also shows comparable or better performances compared to another regularization method ('AE + R.' in Table 5 ) considered in Lee et al. (2022) , which tries to match the pull-back metric of the Fisher information metric (via the decoder mapping) to the identity. Combining these two regularization methods ('GDAE + R.' in Table 5 ) shows the best overall transfer classification accuracy, demonstrating the usefulness of our geometric regularization methods in obtaining better representations of non-Euclidean data for downstream tasks.

5. CONCLUSION

In this paper, we have introduced geometrically regularized autoencoders for non-Euclidean data. By constructing regularization terms in a coordinate-invariant way, we have developed two types of geometric autoencoders, the geometric reconstruction contractive autoencoder and the geometric denoising autoencoder. These autoencoders are effective in estimating the derivative of the log of the probability density of non-Euclidean data and have been successfully applied to several applications such as data sampling, mode-seeking, and representation learning tasks involving real-world non-Euclidean data sets. Although training these models can be computationally demanding and may involve numerical issues when reflecting the geometry of the data, our experiments show that our approach can be a viable option for handling high-dimensional and complex non-Euclidean data. In the future, it would be worth investigating more efficient and robust methods for reflecting the geometry during training. Additionally, the idea of using geometric regularizations could be applied to other machine learning problems that involve non-Euclidean data.

APPENDIX A MATHEMATICAL BACKGROUNDS

In this section, we briefly review some notions related to differentiable manifolds and Riemannian geometry. For further mathematical details on differentiable manifolds and Riemannian geometry, we refer the reader to Boothby (1986) and Dubrovin et al. (1992) . Intuitively, an m-dimensional differentiable manifold M is a space which is locally diffeomorphicfoot_0 to m-dimensional Euclidean space. For every point p ∈ M, there exists a coordinate chart (U, x), where U is an open subset of M containing p, and x is a homeomorphismfoot_1 of U to an open subset of R m . Applying x to p gives the m coordinates of p, i.e., x(p) = (x 1 (p), . . . , x m (p)) ∈ R m -each x i is a real-valued function on U , the i-th coordinate function. Here x is called the local coordinates; note that other choices of local coordinates are also possible (e.g., for the sphere, both spherical coordinates and stereographic projection correspond to different local coordinates of the sphere). A differentiable manifold M endowed with a Riemannian metric is called a Riemannian manifold. The Riemannian metric is a function defined on the manifold M that assigns to each point p ∈ M a bilinear mapping Φ p : T p M × T p M → R, where T p M denotes the tangent space to M at p. Using the local coordinates x = (x 1 , . . . , x m ), the Riemannian metric can be expressed as Φ = m i=1 m j=1 g ij (x)dx i dx j or ds 2 = m i=1 m j=1 g ij (x)dx i dx j . Here g ij (x) is assumed to be smooth, i.e., infinitely differentiable, and its matrix representation G = (g ij ) ∈ R m×m is symmetric positive-definite. The Riemannian metric allows one to calculate lengths, angles, volumes, and even define a distance metric on differentiable manifolds in an intrinsic way, i.e., in a way that is invariant to the choice of local coordinates. The length of a curve C = {x(t) ∈ M | t ∈ [0, 1]} is calculated as Length(C) = 1 0 ẋ(t) ⊤ G(x(t)) ẋ(t) dt, where x(t) ∈ R m is the local coordinate representation of x(t) (see Figure 7 ). Given two fixed boundary points x(0), x(1) ∈ M, the curves that minimize the length (10) are called the minimal geodesics, and the corresponding lengths the minimal geodesic distances. The equations for geodesics are obtained as d 2 dt 2 x k + m i=1 m j=1 Γ k ij dx i dt dx j dt = 0, k = 1, . . . , m, where Γ k ij denote the Christoffel symbols of the second kind in M, i.e., Γ k ij = m l=1 1 2 g kl ∂g li ∂x j + ∂g lj ∂x i - ∂g ij ∂x l , and g kl is the (k, l) entry of G -1 ∈ R m×m . Figure 8: A geodesic curve x : [0, 1] → M emanating from x(0) ∈ M along v ∈ T x(0) M. Consider a geodesic curve x : [0, 1] → M emanating from a point x(0) ∈ M with an initial velocity vector ẋ(0) = v ∈ T x(0) M as shown in Figure 8 . It is known that such a geodesic is unique if it exists (Boothby, 1986) , and denote by x(1) ∈ M the endpoint of the geodesic that propagates for a unit time. The mapping that maps the initial velocity vector v to the point x( 1) is called the exponential map Exp x(0) : T x(0) M → M. Note that the distance between x(0) and x(1) is the same as the norm of v. This corresponds to a generalization of propagating a line in vector space from a starting point along a vector. On Riemannian manifolds, there exists a (natural) volume element induced from the Riemannian metric G(x) which is expressed in local coordinates as det G(x) dx 1 • • • dx m . The volume of a compact subset V ⊆ M is then obtained by the following integral: Volume(V) = V det G(x) dx 1 • • • dx m , ( ) where V denotes the domain of integration expressed in local coordinates 4 (see Figure 7 ). The integration of a bounded and continuous function f : M → R on the integration domain V is also obtained using the volume element as follows: V f (x) det G(x) dx 1 • • • dx m . B FURTHER EXPLANATIONS OF THE GEOMETRIC REGULARIZATION

COMPONENTS B.1 CONTRACTIVE REGULARIZATION

To see why ( 4) is coordinate-invariant, observe that under a pair of local coordinate transformations x → x ′ = ϕ(x) and y → y ′ = ψ(y), G(x) = (g ij (x)) ∈ R m×m , H(y) = (h αβ (y)) ∈ R n×n , and J(x) = ∂f i ∂x j (x) ∈ R n×m transform according to the following rules (Dubrovin et al., 1992) : (i) G → G ′ = Φ -⊤ GΦ -1 , where Φ = ∂ϕ ∂x ∈ R m×m ; (ii) H → H ′ = Ψ -⊤ HΨ -1 , where Ψ = ∂ψ ∂y ∈ R n×n ; (iii) J → J ′ = ΨJΦ -1 , 5 where it can be verified that Tr J ⊤ HJG -1 remains the same. Also note that minimizing (4) induces the contraction (shrinking without distortion) of the mapping f. As an extreme case without any boundary conditions or constraints, trivial solutions are obtained as J = 0 or equivalently f = constant, which is an extreme contraction. On the other hand, provided the boundary conditions or constraints for f are well-specified, a useful physical analogy for minimizing (4) is to imagine wrapping a curved object made of marble (N ) by an elastic sheet (M); harmonic maps, which are extrema of (4), can be viewed as solutions corresponding to elastic equilibria (Eells & Sampson, 1964) .

B.2 SOME REMARKS ON NON-EUCLIDEAN LATENT SPACES

Recently, to better capture the structure of data distributions, there have been increasing works on autoencoders that deal with non-Euclidean latent spaces, e.g., hyperspheres (Davidson et al., 2018; Xu & Durrett, 2018) , hyperbolic spaces or Poincaré balls (Mathieu et al., 2019) , their mixtures 4 We may also use V rather than V to denote the domain of integration for notational simplicity. 5 We can easily verify (i) and (ii) by observing the Riemannian metric should remain the same under the local coordinate transform as ds 2 = [dx] ⊤ G[dx] = [dx ′ ] ⊤ G ′ [dx ′ ], where [dx] = [dx 1 , . . . , dx m ] ⊤ and [dx ′ ] = [dx ′1 , . . . dx ′m ] ⊤ are related by [dx] = ∂x ∂x ′ [dx ′ ] = Φ -1 [dx ′ ]. We can verify (iii) by considering the chain rule ∂(y ′ •f ) ∂x ′ = ∂y ′ ∂y ∂f ∂x ∂x ∂x ′ = Ψ ∂f ∂x Φ -1 . ( Skopek et al., 2019) , or submanifolds embedded in Euclidean spaces (Rey et al., 2019) . In the case of such non-Euclidean latent spaces, by reflecting the coordinate-invariant contractive regularization discussed in Section 3.1, we can formulate geometric contractive autoencoder (GCAE) as follows: min r=g•f M dist(r(x), x) 2 + σ 2 Tr ∂f ∂x ⊤ H(f (x)) ∂f ∂x G -1 ρ(x) dx, where the reconstruction map r : M → M, the encoder mapping f : M → N , and the decoder mapping g : N → M are respectively expressed in local coordinates as r : R m → R m , f : R m → R n , and g : R n → R m . Training the GCAE may induce additional regularization effects that better capture dataconcentrated regions or make the model more robust to noise, in addition to the effect of using non-Euclidean latent spaces. Confirming these experimentally is left for future work. As another choice of the Riemannian metric for latent space, we can consider the pullback of the data space metric via decoder mapping (Arvanitidis et al., 2018) . It has been observed that applying this metric can better characterize the data distances in latent space and provide more meaningful results in analyzing latent representations. Adopting this metric on the latent space reveals an interesting connection between the GCAE in (15) and GRCAE in (5). This pullback metric is obtained as H(y) = ∂g ∂y ⊤ G(g(y) ) ∂g ∂y ∈ R n×n , and we can observe that ( 15) and ( 5) become identical when substituting this metric into (15) and considering the chain rule ∂r ∂x = ∂g ∂y ∂f ∂x ∈ R m×m .

B.3 DATA CORRUPTION PROCESS

The corruption process x ∼ N (x, σ 2 I) in vector space is equivalent to set x = x + ϵ for ϵ ∼ N (0, σ 2 I), i.e., the endpoint of a line segment starting from x and extended according to the vector ϵ. On the Riemannian manifolds, a similar discussion is possible using the notion of the geodesic and the exponential map as discussed in Appendix A. Therefore, we corrupt an input x ∈ M by applying the exponential map Exp x : T x M → M to a tangent vector v ∈ T x M sampled from an isotropic zero-mean multivariate Gaussian, i.e., v = v 1 E 1 + • • • + v m E m for an orthonormal basis E 1 , . . . , E m for T x M and an m-dimensional vector (v 1 , . . . , v m ) ∼ N (0, σ 2 I). By using an isotropic Gaussian, the corruption process x = Exp x (v) does not depend on which orthonormal basis E 1 , . . . , E m is used. C PROOF OF THEOREM 1 C.1 THE FIRST-ORDER NECESSARY CONDITIONS FOR GRCAE Lemma 1. Provided σ 2 is small, the derivative of the log of the probability function ρ g (x) = ρ(x) √ det G(x) can be approximated for the geometric reconstruction contractive autoencoder as ∂ log ρ g (x) ∂x = 1 ρ g ∂ρ g ∂x (x) = 1 ρ ∂ρ ∂x (x) -Γ(x) = G r(x) -x σ 2 + O(σ 2 ), where Γ(x) ∈ R m is a vector whose i-th component is given by 1 2 Tr G -1 ∂G ∂x i . Proof. The first-order necessary conditions for (5) can be obtained from the following Euler-Lagrange equations: ∂L ∂r i - m j=1 ∂ ∂x j   ∂L ∂ ∂r i ∂x j   = 0, i = 1, . . . , m, where L is the integrand of ( 5) with an approximation on the squared geodesic distance to (r(x)x) ⊤ G(x)(r(x)x) provided σ 2 small, 6 and r i denotes the i-th coordinate representation of the reconstruction function r. By applying L to the Euler-Lagrange equations, (17) results in r i -x i = σ 2 η i (x, r, r ′ , r ′′ ), i = 1, . . . , m, where r ′ denotes ∂r ∂x , r ′′ denotes the collection of ∂ 2 r j ∂x 2 for j = 1, . . . , m, and η i denotes a function of x, r, r ′ , r ′′ . 7 Assuming r, G, G -1 , ρ are smooth, and their higher-order derivatives are bounded, r ix i = σ 2 η i (x, r, r ′ , r ′′ ) = O(σ 2 ) holds. By iterating the relation r i = x i + σ 2 η i (x, r, r ′ , r ′′ ) = x i + O(σ 2 ), i.e., substituting this form of r i into η i (x, r, r ′ , r ′′ ), the i-th component of r is obtained for small σ 2 as follows: r i -x i = σ 2   m j=1 g ij 1 ρ ∂ρ ∂x j - 1 2 Tr G -1 ∂G ∂x j   + O(σ 4 ), = σ 2   m j=1 g ij 1 ρ g ∂ρ g ∂x j   + O(σ 4 ), where g ij denotes the (i, j) entry of G -1 , ρ g = ρ √ det G , and the identity ∂ log(det G) ∂x j = Tr G -1 ∂G ∂x j is used to derive (20). By gathering (20) for all i, the first-order necessary conditions can be rewritten as 1 ρ g ∂ρ g ∂x (x) = G(x) r(x) -x σ 2 + O(σ 2 ).

C.2 THE FIRST-ORDER NECESSARY CONDITIONS FOR GDAE

To obtain the first-order necessary conditions for GDAE, the integrand of ( 6) is first represented in local coordinates x = (x 1 , . . . , x m ) in Lemma 2. Lemma 2. Provided σ 2 is small, E q(x|x) dist(r(x), x) 2 in (6) can be approximated in local coordinates as E q(x|x) dist(r(x), x) 2 (22) = (r(x) -x) ⊤ G(x)(r(x) -x) + σ 2 Tr ∂r ∂x ⊤ G(r) ∂r ∂x G -1 (x) + m i=1 m j=1 (r i -x i ) g ij Tr ∂ 2 r j ∂x 2 G -1 + m k=1 m l=1 m n=1 (r i -x i ) Γ i;jk ∂r j ∂x l ∂r k ∂x n g ln -g ij ∂r j ∂x k Γ k ln g ln + O(σ 4 ), where r i is the i-th component of r, g ij and g ij respectively denote the (i, j) entry of G and G -1 . The Γ k;ij and Γ k ij respectively denote the Christoffel symbol of the first and second kind in M. Proof. We represent x ∈ M and x ∈ M in local coordinates as x ∈ R m and x ∈ R m , respectively. Furthermore, at x, consider a nonlinear function ϕ : R m → R m that maps the representations in local coordinates of the points near x to those in the normal coordinates; from these settings, ϕ(x) = 0 holds and let ũ = ϕ(x). Denote by r u : R m → R m the reconstruction function represented in the normal coordinates. (The functions r and r u are then related by r u (ũ) = ϕ(r(x)), and r u (0) = ϕ(r(x)) holds.) Since the distance between x ∈ M and r(x) ∈ M corresponds to the standard vector norm of the representation of r(x) in normal coordinates at x, dist(r(x), x) 2 = ∥r u (ũ)∥ 2 holds. Near the origin 7 The explicit form of η i (x, r, r ′ , r ′′ ) is obtained as η i = j,k,l,α g ij β -1 2 ∂r α ∂x k ∂g(r) αβ ∂r j ∂r β ∂x l g kl + ∂g(r) jα ∂r β ∂r β ∂x k ∂r α ∂x l g kl + g(r)jα ∂ 2 r α ∂x k ∂x l g kl + ∂r α ∂x l ∂g kl ∂x k + g kl 1 ρ ∂ρ ∂x k , where g ij and g(r)ij respectively denote the (i, j) entry of G -1 and G(r(x)). of the normal coordinates at x, r i u : R m → R (the i-th component of r u : R m → R m ) admits following Taylor's expansion: r i u (ũ) = r i u (0) + ∂r i u ∂u ũ + 1 2! ũ⊤ ∂ 2 r i u ∂u 2 ũ + • • • . ( ) Then ∥r u (ũ)∥ 2 can be expressed as follows: ∥r u (ũ)∥ 2 = m i=1 r i u (ũ) 2 (24) = ∥r u (0)∥ 2 + 2 r u (0) ⊤ ∂r u ∂u ũ + ũ⊤ ∂r u ∂u ⊤ ∂r u ∂u ũ (25) + m i=1 r i u (0) ũ⊤ ∂ 2 r i u ∂u 2 ũ + • • • , where the terms of the order (with respect to ũ) higher than three are omitted in (25). Consider the expectation with respect to ũ ∼ N (0, σ 2 I) of ( 25) (this corresponds to the expectation of dist(r(x), x) 2 with respect to q(x|x)). Provided σ 2 is small, we obtain E q(x|x) dist(r(x), x) 2 = E q(ũ) ∥r u (ũ)∥ 2 (26) = ∥r u (0)∥ 2 + σ 2 Tr ∂r u ∂u ⊤ ∂r u ∂u + σ 2 m i=1 r i u (0) Tr ∂ 2 r i u ∂u 2 + O(σ 4 ), where q(ũ) denotes the noise density for ũ ∼ N (0, σ 2 I), and the terms with an order higher than σ 2 are assumed to be negligible. We now represent (26) in local coordinates. For this purpose, the following equations are useful: r i u (0) = ϕ i (r(x)) ≈ ∂ϕ i ∂x (r(x) -x), u ∂u j = ∂ϕ i (r(x)) ∂u j = m k=1 m l=1 ∂ϕ i ∂r k ∂r k ∂x l ∂x l ∂u j , (28) ∂ 2 r i u ∂u j ∂u k = m l=1 m n=1 m p=1 m q=1 ∂ 2 ϕ i ∂r l ∂r n ∂r l ∂x p ∂r n ∂x q ∂x p ∂u j ∂x q ∂u k (29) + m l=1 m p=1 m q=1 ∂ϕ i ∂r l ∂ 2 r l ∂x p ∂x q ∂x p ∂u j ∂x q ∂u k + m l=1 m n=1 ∂ϕ i ∂r l ∂r l ∂x n ∂ 2 x n ∂u j ∂u k , g jk = m i=1 ∂ϕ i ∂x j ∂ϕ i ∂x k , ( ) g jk = m i=1 ∂x j ∂u i ∂x k ∂u i , Γ j;kl = m i=1 ∂ϕ i ∂x j ∂ 2 ϕ i ∂x k ∂x l , ( ) m j=1 m k=1 Γ i jk g jk = - m j=1 m k=1 ∂ 2 x i ∂u j ∂u k δ jk , where ∂x i ∂u j is the (i, j) entry of ∂x ∂u = ∂ϕ ∂x -1 ∈ R m×m , and δ jk = 1 for j = k and δ jk = 0 otherwise in (33). Here, the higher-order terms in ( 27) can be neglected for small σ 2 , and ( 28)-( 29) are obtained from the chain rule. We can derive (30)-( 33) using the properties of the normal coordinates. By applying ( 27)-( 33) to ( 26) and rearranging the terms, we obtain the result in ( 22).

Published as a conference paper at ICLR 2023

We now provide the first-order necessary conditions for (6) using Lemma 2. Lemma 3. Provided σ 2 is small, the derivative of the log of the probability function ρ g (x) = ρ(x) √ det G(x) can be approximated for the geometric denoising autoencoder as ∂ log ρ g (x) ∂x = 1 ρ g ∂ρ g ∂x (x) = 1 ρ ∂ρ ∂x (x) -Γ(x) = G(x) r(x) -x σ 2 + O(σ 2 ), where Γ(x) ∈ R m is a vector whose i-th component is given by 1 2 Tr G -1 ∂G ∂x i . Proof. The first-order necessary conditions for (6) are obtained from the following Euler-Lagrange equations: ∂L ∂r i - m j=1 ∂ ∂x j   ∂L ∂ ∂r i ∂x j   + m j,k=1 ∂ 2 ∂x j ∂x k   ∂L ∂ ∂ 2 r i ∂x j ∂x k   = 0, i = 1, . . . , m, where L denotes ( 22) approximated according to Lemma 2. Applying L to (35) results in an equation of the form r ix i = σ 2 φ i (x, r, r ′ , r ′′ ) for i = 1, . . . , m, where φ i is a function of x, r, r ′ , r ′′ . 8 Assuming r, G, G -1 , ρ are smooth and their higher-order derivatives are bounded, we obtain an equation identical to (20) for small σ 2 by iterating r i = x i + σ 2 φ i (x, r, r ′ , r ′′ ) = x i + O(σ 2 ). Hence the first-order necessary conditions are obtained as (21).

C.3 PROOF OF THEOREM 1

Theorem 2. Provided σ 2 is small, the derivative of the log of the probability function ρ g (x) = ρ(x) √ det G(x) can be approximated for both the GRCAE and GDAE as ∂ log ρ g (x) ∂x = 1 ρ g ∂ρ g ∂x (x) = 1 ρ ∂ρ ∂x (x) -Γ(x) = G(x) r(x) -x σ 2 + O(σ 2 ), where Γ(x) ∈ R m is a vector whose i-th component is given by 1 2 Tr G -1 ∂G ∂x i . Proof. Collecting Lemma 1 from Appendix C.1 and Lemma 3 from Appendix C.2 completes the proof.

D IMPLEMENTATION OF GEOMETRICALLY REGULARIZED AUTOENCODERS

In Appendix D.1, D.2, D.3, and D.4, we provide the implementation details of the geometrically regularized autoencoders for S n , P(n), N(3), and point cloud data considered in our experiments, respectively. We also provide some ideas to deal with the case of manifolds that require multiple coordinate charts in Appendix D.5. We then discuss the case when the data space Riemannian metric is not known a priori in Appendix D.6. Given an n + 1-dimensional vector with the unit norm x ∈ S n ⊆ R n+1 and a tangent vector v ∈ T x S n ⊆ R n+1 , the exponential map Exp x : T x S n → S n is defined as follows: Exp x (v) = cos ∥v∥ • x + sin ∥v∥ ∥v∥ • v, 8 The explicit form of φ i (x, r, r ′ , r ′′ ) is obtained as φ i = j,k,l,α g ij -1 2 β ∂g(r) αβ ∂r j ∂r α ∂x k ∂r β ∂x l g kl + Γ j;αβ ∂r α ∂x k ∂r β ∂x l g kl -gjα ∂r α ∂x β Γ β kl g kl +gjα ∂ 2 r α ∂x k ∂x l g kl + 1 ρ ∂ ∂x k g(r)jα ∂r α ∂x l g kl ρ + β (r α -x α )Γ α;jβ ∂r β ∂x l g kl ρ -1 2 (r α -x α )gjα Γ k lβ g lβ ρ -1 2 ∂ 2 ∂x k ∂x l gjα(r α -x α )g kl ρ , where g(r)ij is the (i, j) entry of G(r(x)). where ∥ • ∥ is the Euclidean norm. For the data corruption process, we first sample an n + 1-dimensional vector ϵ ∼ N (0, σ 2 I) and project ϵ onto T x S n (with the origin identified to that of the ambient Euclidean space) as v = (Ixx ⊤ )ϵ. As discussed in Section 3.2, the input x is then corrupted to x = Exp x (v) according to (37).

D.1.2 THE RECONSTRUCTION ERROR

For a reconstruction function r : S n → S n , the reconstruction error is defined as follows: dist(r(x), x) 2 = arccos(r(x) ⊤ x) 2 . (38)

D.1.3 THE RECONSTRUCTION FUNCTION

We implement our regularized autoencoder for S n as a mapping r : S n → S n . The input and output of the mapping are n + 1-dimensional vectors with the unit norm. The reconstruction function r is modeled by a neural network with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows: r(x) = Proj(x + W 3 h 2 + b 3 ), ( ) h i = Tanh(W i h i-1 + b i ), i = 1, 2, where h 1 , h 2 ∈ R d h denote the hidden variables, h 0 is set to be x ∈ S n ⊆ R n+1 , Proj : R n+1 → S n ⊆ R n+1 is the function to project a vector onto a hypersphere by dividing the input vector by its norm, and W i , b i for i = 1, 2, 3 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the hidden variable dimensionality d h to 1,000 in all the experiments.

D.1.4 CONTRACTIVE REGULARIZATION

For the case of S n , the contractive term of (5) (without the ρ(x) term) can be computed using r as Tr ∂r ∂x ⊤ ∂r ∂x (Ixx ⊤ ) , where ∂r ∂x ∈ R n+1×n+1 and x ∈ S n ⊆ R n+1 .

D.1.5 AN EMPIRICAL ANALYSIS OF COMPUTATIONAL TIME

Training geometrically regularized autoencoders requires more computations than training vector space autoencoders since it needs to calculate the geodesic distance, exponential map, and geometric contractive term. We have experimentally measured the computational times for the models applying each geometric component explained above to the autoencoder r : M → M. We have used the Pytorch library (Paszke et al., 2017) and have utilized NVIDIA Tesla V100 GPU with Intel Xeon E5-2698 v4 2.2 GHz (20-Core) CPU (also for most of the other experiments). We report in Table 6 the computational times spent for a gradient update iteration for each model with a batch size of 10,000. In the table, AE, GAE, GAE + G., GAE + E., and GAE + C. respectively represent the vanilla autoencoder (of the same input and output dimensions with other models), an autoencoder reflecting the non-Euclidean input and output structure, a model applying the geodesic distance as the reconstruction error for GAE, a model applying the exponential map in data corruption process for GAE, and a model applying the geometric regularization term for GAE. For comparison, we also consider GDAE and GRCAE. To check the time for various data dimensionality, we consider n = 2, 6, 10, 25, 50, 100 for S n data generated as explained in Appendix F.4. From the table, we can observe that applying our geometric components can be scaled to high-dimensional S n data reasonably well, except for the geometric contractive term, which requires the derivative of the Jacobian ∂r ∂x ∈ R n+1×n+1 during the computation of gradients for an update.

D.2.1 MATRIX EXPONENTIAL AND MATRIX LOGARITHM

We first provide the closed-form expressions of the matrix exponential on S(n), the space of n × n symmetric matrices, and matrix logarithm on P(n). Given an eigendecomposition of a symmetric the matrix of corresponding eigenvalues, the matrix exponential is obtained as Exp(A) = R Exp(D) R ⊤ , ( ) where Exp(D) = diag(exp(d 1 ), . . . , exp(d n )). Given an eigendecomposition of a symmetric positive-definite matrix A = RDR ⊤ ∈ P(n) similarly to the above, the matrix logarithm is obtained as Log(A) = R Log(D) R ⊤ , ( ) where Log(D) = diag(log(d 1 ), . . . , log(d n )).

D.2.2 DATA CORRUPTION PROCESS

Given an input P ∈ P(n), we first sample a n(n+1)

2

-dimensional vector ϵ ∼ N (0, σ 2 I). As discussed in Section 3.2, the input P is then corrupted to P ∈ P(n) as follows: P = P 1 2 Exp ([ϵ]) P 1 2 , ( ) where P 1 2 = RS 1 2 R ⊤ for an eigendecomposition of P = RSR ⊤ , and the bracket operator is defined for ϵ = (ϵ 11 , . . . , ϵ 1n , ϵ 22 , . . . , ϵ 2n , . . . , ϵ nn ) ∈ R n(n+1) 2 as ([ϵ]) ij = a ij ϵ ij for i ≤ j and ([ϵ]) ij = ([ϵ]) ji for i > j, with a ij = 1 if i = j and 1 √ 2 otherwise.

D.2.3 THE RECONSTRUCTION ERROR

For a reconstruction function r : P(n) → P(n), the reconstruction error is defined from the affineinvariant Riemannian metric as follows (Fletcher & Joshi, 2007) : dist(r( P ), P ) 2 = Log P -1 2 r( P )P -1 2 2 F , where P -1 2 = RS -1 2 R ⊤ for an eigendecomposition of P = RSR ⊤ , Log : P(n) → S(n) is the matrix logarithm, and ∥ • ∥ F is the matrix Frobenius norm.

D.2.4 THE RECONSTRUCTION FUNCTION

The reconstruction functions r : P(n) → P(n) for GDAE and GRCAE should consider the special structure of P(n). By defining a function v : P(n) → S(n), the reconstruction function at P ∈ P(n) is modeled as r(P ) = P We model v by neural networks with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows: v(P ) = W 3 h 2 + b 3 , ( ) h i = Tanh(W i h i-1 + b i ), i = 1, 2, where h 1 , h 2 ∈ R d h denote the hidden variables, h 0 ∈ R n(n+1) 2 is set to the vector representation of the lower-(or upper-) triangular part of Log(P ) with Log : P(n) → S(n) as the matrix logarithm,foot_4 and W i , b i for i = 1, 2, 3 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the dimensionality of the hidden variables d h to 1,000 for both GDAE and GRCAE in all the experiments.

D.2.5 CONTRACTIVE REGULARIZATION

For the case of P(n), we can consider the upper-(or lower-) triangular part of the matrix representations as its local coordinates, i.e., r(x) ≃ r(x) and x ≃ x, and calculate the contractive term of ( 5) straight-forwardly.

D.2.6 AN EMPIRICAL ANALYSIS OF COMPUTATIONAL TIME

Similarly to Appendix D.1.5, we have experimentally measured the computational times for each autoencoder model for P(n) data. Note that we can calculate the matrix exponential and logarithm faster and numerically more stably using the approximations from Taylor's expansion by assuming small σ 2 (hence inputs becoming near zero and identity for the matrix exponential and logarithm, respectively, in ( 43)-( 45)). We also precompute and use some quantities frequently appears in ( 43)-( 45) such as P 1 2 and P -1 2 . We consider two different models with different degrees of approximation for GDAE, namely GDAE-v1 and GDAE-v2. GDAE-v1 applies in ( 43)-( 45) the second-order Taylor's expansion on the matrix exponential and logarithm, i.e., Exp(v) ≈ I +v+ 1 2 v 2 and Log(M ) ≈ (M -I)-1 2 (M -I) 2 . GDAE-v2 additionally applies the first-order Taylor's expansion near P for P 1 2 and Log( P ) required to calculate r( P ) based on (45) so that it can avoid performing additional eigenvalue decompositions during training. We report in Table 7 the computational times for a gradient update iteration for each model with a batch size of 5,000. To check the time for various data dimensionality, we consider n = 2, 3, 4, 7, 10, 14 for P(n) data (with dimensionality d = n(n+1) 2 = 3, 6, 10, 28, 55, 105) generated as explained in Appendix F.1. From the table, we can observe that applying our geometric components can be scaled to high-dimensional P(n) data at a rate slower than the linear rate to d, except for the geometric contractive term of which the computational time increases at a nearly quadratic rate to d when applied. Also note that even if the GDAE-v2 shows a faster computational time when n = 2, 3, it becomes slower than GDAE-v1 as the data dimensionality increases due to the heavy matrix multiplications required in the approximation. We utilize GDAE-v1 for other experiments using P(n) data. We model r p and v by neural networks with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows: r p (x, P ) = x + W 3 h 2 + b 3 , (54) v(x, P ) = W 4 h 2 + b 4 , (55) h i = Tanh(W i h i-1 + b i ), i = 1, 2, where h 1 , h 2 ∈ R d h denote the hidden variables, and h 0 is set to be (x, p) ∈ R 9 with p ∈ R 6 as the vector representation of the lower-(or upper-) triangular part of P . Moreover, W i , b i for i = 1, 2, 3, 4 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the dimensionality of the hidden variables d h to 1,000. A statistical manifold framework has been suggested to deal with point cloud data in Lee et al. (2022) . Briefly speaking, they deem each point in a point cloud as a sample from a specific parametric probability density model and identify the data to the density. The Fisher information metric for these density models can then serve as a natural Riemannian metric in the point cloud data space. For a point cloud data in R D represented as X = {x 1 , . . . , x n | x i ∈ R D }, by using X itself as the parameter, following parametric probability density model is considered: p(x; X) = 1 n |Σ| n i=1 K(Σ -1 2 (x -x i )), where K : R D → R is a kernel function satisfying R D K(u)du = 1, and Σ ∈ R D×D is a symmetric positive-definite bandwidth matrix. In the experiments, we choose the Gaussian kernel function K(u) = 1/ (2π) D exp(-u ⊤ u/2) with Σ = σ 2 p I following Lee et al. (2022) . With such a choice of density model, the Fisher information metric G ∈ R nD×nD is obtained as G ijkl = p(x; X) ∂ log p(x; X) ∂X ij ∂ log p(x; X) ∂X kl dx ( ) for i, k = 1, . . . , n and j, l = 1, . . . , D, where G ijkl is the ((i -1)D + j, (k -1)D + l) entry of G and X ij is the j-th entry of x i . We approximate the integration in (58) as a finite sum of the integrands (with p(x; X) excluded) over the data points x 1 , . . . , x n . We refer the reader to Lee et al. (2022) for more details.

D.4.2 DATA CORRUPTION PROCESS

Since applying the exponential map based on the Riemannian metric in ( 58) is computationally very expensive, we resort to an approximation for the data corruption process. Given a vector representation of the point cloud data x = (x ⊤ 1 , . . . , x ⊤ n ) ∈ R nD , we sample a random vector of the same size ϵ ∼ N (0, σ 2 I), and corrupt the data as x ≈ x + √ G -1 ϵ, where G -1 ∈ R nD×nD is the inverse of the Riemannian metric G in (58). We further approximate G to be a diagonal matrix by considering only the diagonal elements of (58) so that √ G -1 is computationally tractable.

D.4.3 THE RECONSTRUCTION ERROR

For a reconstruction function r : R n×D → R n×D , consider a point cloud X = {x 1 , . . . , x n | x i ∈ R D } and its reconstructed version r( X) = Y = {y 1 , . . . , y n | y i ∈ R D }. We define the reconstruction error by slightly modifying the Chamfer distance in Yang et al. (2018) as follows (Lee et al., 2022) : Reconstruction error = 1 |X| x∈X min y∈Y ∥x -y∥ 2 + 1 |Y | y∈Y min x∈X ∥x -y∥ 2 , ( ) where |X| denotes the number of elements in the set X. Note that using the reconstruction error in (59) does not perfectly correspond to our original definition of GDAE in (6), which uses the geodesic distance as the reconstruction error. However, with a slight abuse of notation, we still use the term GDAE for the method minimizing the reconstruction error defined in (59).

D.4.4 THE RECONSTRUCTION FUNCTION

We use exactly the same point cloud autoencoder with the reconstruction function r : R n×D → R n×D (i.e., the composition of the encoder f : R n×D → R d and the decoder g : R d → R n×D ) adopted from the FcNet in Yang et al. (2018) . We set the latent space dimensionality d as 512.

D.5 A REMARK ON IMPLEMENTATION FOR MANIFOLDS THAT REQUIRE MULTIPLE COORDINATE CHARTS

Since the manifolds mainly considered in the experiments, e.g., S n , P(n), N(n), and the statistical manifold for point cloud data, could be embedded in Euclidean spaces, we have directly used the manifold elements embedded in Euclidean spaces as the form of input and output for r : M → M. If this is not the case, e.g., for the abstract manifolds, we can implement the mappings r : R m → R m represented in local coordinates. When multiple charts {(U 1 , x 1 ), . . . , (U C , x C )} are required,foot_5 one available approach would be to implement mappings for each chart as separate neural networks and train each mapping by minimizing objective functions weighted by appropriate weight functions. In the case of GDAE, for example, we can solve C optimization problems as follows: min ri M E q(x|x) [dist(r i (x), x) 2 ]ρ(x)f i (x)dx, i = 1, . . . , C, where C is the number of charts, r i : R m → R m is the mapping for the i-th chart (this is r i : U i → M represented in local coordinates), and f i : M → R ≥0 is the weight function for the i-th chart satisfying f i (x) = 0 for x / ∈ U i , e.g., determined based on partitions of unity (satisfying that, for all x ∈ M, there is a neighborhood of x where all but a finite number of f i (x)s are zero, and C i=1 f i (x) = 1). The final reconstruction results for x ∈ M can be obtained by geometrically averaging (after appropriate coordinate transformations) the outputs r i (x) from each mapping with corresponding weights f i (x). Note that there is no guarantee that r i (x i (U i )) ⊆ x i (U i ) (or r i (U i ) ⊆ U i ) would always hold. That is, r i (x) may be outside of U i hence r i (x) and x may not be in the same coordinate chart for some x ∈ U i . For small σ 2 , such a phenomenon (if it happens) would mostly occur at x ∈ U i near the boundary of U i since r i is trained so that r i (x) is close enough to x, i.e., r i (x)x = O(σ 2 ), according to Theorem 1. We can eliminate any undesirable effect that this phenomenon may have on the optimization of the objective function in (60) or on the weighted average of the final results by choosing a weight function f i that has the value of zero in a sufficiently wide region within U i that includes the boundary of U i . As a side effect, this choice would also assign relatively large weights for the inner regions of U i far from the boundary of U i hence inducing a larger regularization effect (e.g., denoising or contraction) in those regions. Therefore, this may lead to learning the reconstruction function to direct toward the inner regions of U i , which helps to prevent the range of r i (or r i ) from going outside of x i (U i ) (or U i ).

D.6 A REMARK ON THE CASE WHEN THE DATA SPACE RIEMANNIAN METRIC IS NOT KNOWN a priori

When the Riemannian metric for data space R D is not known a priori, we can construct the Riemannian metric by resorting to metric learning techniques that either utilize some supervision on the desired distance for a given task or reflect some intuition about the data manifold. For example, in Hauberg et al. (2012) , given a set of centers {c 1 , . . . , c K }, c k ∈ R D , k = 1, . . . , K and a set of metrics {G 1 , . . . , G K }, G k ∈ R D×D , k = 1, . . . , K corresponding to each center, a smoothly varying Riemannian metric G : R D → R D×D is constructed by interpolating the metrics as follows: G(x) = K k=1 w k (x)G k , where w k (x) = wk (x) K i=1 wi(x) , wk (x) = exp(-1 2h (x -c k ) ⊤ G k (x -c k )) , and h > 0 is a bandwidth parameter. Here we can obtain the set of metrics {G 1 , . . . , G K } from different metric learning methods, which learn task-specific distance metrics in a supervised manner (Kulis et al., 2013) . In the unsupervised case, based on the intuition that the shortest path in data space should be near the data manifold, not the region where data are sparse, the Riemannian metric G : R D → R D×D can be constructed as follows: G(x) = (α • h(x) + ϵ) -1 • I, where h : R D → R >0 , h(x) → 1 when x is near the data manifold and h(x) → 0 otherwise, and α, ϵ > 0 are scalars to control the lower and upper bound of the metric, respectively. (Some methods to construct smoothly varying h(x) are provided in Arvanitidis et al. (2021) .) Consider formulating geometrically regularized autoencoders that reflect the Riemannian metrics constructed above. The calculations of the geodesics and exponential maps in these cases usually require numerical solvers (Hauberg et al., 2012) . Therefore, efficient training of these autoencoders would require some methods to approximately apply the geometric components, a detailed discussion of which is left for future work. In (a), red, green, and blue lines represent each reference frame's X-, Y-, and Z-axis, respectively. Note that the reference frames are translated to corresponding spherical coordinate origins for better visualization.

E DETAILS FOR EXPERIMENTS IN THE INTRODUCTION

For the example provided in the introduction, we train the RCAE and the GRCAE (modeled as described in Appendix D.1) on spherical data sampled from the von Mises-Fisher (vMF) distribution with the concentration parameter of 10. To see if the trained results depend on the choice of coordinates, we consider five different spherical coordinate representations of the data for RCAE. We obtain these representations by representing the data with respect to different choices of ambient space reference frames shown in Figure 9 (a) and converting them to spherical coordinate representations. We then train the RCAEs using each representation and compare in the ambient space the reconstruction directions obtained from each RCAE.foot_6 For the GRCAEs using the ambient space representations, we train the models using data represented in five different reference frames (the same as those considered for RCAE) and compare the reconstruction directions from each model after transforming them to an identical reference frame. In Figure 1 (b)-(c), we plot the reconstruction directions for a subsample of the data in Figure 1 (a) and provide a magnified view near the zenith for better visualization. F DETAILS FOR SECTION 4.1 F.1 SYNTHETIC DATA GENERATION We use 10,000 data sampled from m mixtures of tangent space Gaussian on P(n) in Section 4.1. For the i-th mixture, we set the mean as µ i = Exp( √ d 2 diag(e i )) ∈ P(n), where d = n(n+1)

2

is the dimension of P(n), e i ∈ R n is a standard vector whose i-th element is one, and Exp : S(n) → P(n) is the matrix exponential. The covariance of the i-th mixture is represented using an orthonormal basis of T µi P(n) as Σ i = diag(σ 1 , . . . , σ d ) ∈ R d×d , where σ k = 0.1 if k = 1+(i-1)n-(i-1)(i-2) 2 (for i = 1, . . . , n) and σ k = 0.01 otherwise, and the k-th basis of T µi P(n) is defined using the bracket operator in (43) as µ i [e k ] µ i ∈ S(n) with e k ∈ R d as a standard vector whose k-th element is one.

F.2 DEFINITION OF THE ESTIMATION ERROR

For a given data set {x 1 , . . . , x N } represented in local coordinates of P(n), the log-density gradient estimation error (Est. error) is evaluated as follows: Est. error = 1 N N i=1 ∂ log ρg ∂x est (x i ) - ∂ log ρg ∂x (x i ) ⊤ G -1 (x i ) ∂ log ρg ∂x est (x i ) - ∂ log ρg ∂x (x i ) , where term by using the integration by parts technique presented in Hyvärinen (2005) . Ignoring the constant terms that do not depend on the estimated value, (63) can be rewritten as ∂ 1 N N i=1 ∂ log ρg ∂x ⊤ est (x i )G -1 (x i ) ∂ log ρg ∂x est (x i ) + 2Γ(x i ) + 2 m j=1 m k=1 ∂ ∂x k ∂ log ρg ∂x j est (x i )g jk (x i ) , where Γ(x) ∈ R m is a vector whose i-th component is given by 1 2 Tr G -1 ∂G ∂x i . Note that this formulation turns out to be equivalent to the minimization objective of the R-LSLDG method (Ashizawa et al., 2017) . In the case of GDAE and GRCAE, further simplification is available for (64): j k ∂ ∂x k ∂ log ρg ∂x j est g jk = 1 σ 2 Tr ∂r ∂x -I holds from (7). Equation ( 64) can be used in hyperparameter tuning to evaluate the performance of the estimation in place of (63). In discretizing (1), ( 2), (6), and (5), the expectations with respect to the data-generating probability density ρ(x) are approximated by a finite sum over the N training data points with equal weights 1 N . Given an input x (or x), the expectation with respect to the noise density q(x|x) in (6) for GDAE (or q(x|x) in (1) for DAE) is also approximated by a finite sum over finite samples of x (or x) with equal weights. In training the autoencoders, the noise parameter in the data corruption process or the weighting coefficient for the contractive term (σ 2 ) should be chosen carefully. In this experiment, σ is selected among σ ∈ {0.01, 0.025, 0.05} to reduce the modified estimation error in (64), which does not require the true value of ∂ log ρg(x) ∂x , on a randomly selected validation data set of sizes 20,000. Since the estimation results from the autoencoders can vary over iterations due to the stochastic nature of the optimization algorithm, instances with the minimum value of (64) during the training process are taken to be the output for each run of autoencoder training. In LSLDG, the derivative of the log-probability density, i.e., ∂ log ρ(x) ∂x , is modeled as a weighted sum of some basis functions, with the weights optimized in a least-squares sense. The R-LSLDG method estimates the derivative of the log-probability function, i.e., j g ij ∂ log ρg(x) ∂x j for i = 1, . . . , m in local coordinates, similar to LSLDG but using basis functions defined on the Riemannian manifold. For both LSLDG and R-LSLDG, we perform crossvalidation for the hyperparameters λ ∈ {10 -3 , 10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 10 0 , 10 0.5 , 10 1 } and σ ∈ {10 -4 , 10 -3 , 10 -2.5 , 10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 10 0 } as described in Sasaki et al. (2014) and Ashizawa et al. (2017) . 12 The number of the basis functions is set to 500. The time spent in training each model is shown in Table 8 . The autoencoders (DAE, RCAE, GDAE, and GRCAE) take a much longer time for training when compared to LSLDG and R-LSLDG, which are linear-in-parameter models and have closed-form solutions for the parameters. Furthermore, GRCAE (or RCAE) requires more computations than GDAE (or DAE) due to the contractive terms containing the derivative of the reconstruction functions. Compared to DAE and RCAE, GDAE and GRCAE involve heavier computations such as the eigenvalue decomposition (EVD) to reflect the non-Euclidean geometry of the data. Efficient parallel computation for the eigenvalue decomposition has been available by using the torch batch svd repository (MIT License).foot_8 This section provides the result of the geometric score estimation for the data on hypersphere S n = {p ∈ R n+1 | ∥p∥ = 1} endowed with the Riemannian metric induced from the Euclidean metric on R n+1 . We use 10,000 data generated by mapping samples from m Gaussian mixtures in T p S n via the exponential map Exp p : T p S n → S n , where p = (1, 0, . . . , 0) ∈ R n+1 . For the i-th mixture, we set the mean as µ i = 1 √ 2 e i ∈ R n with e i ∈ R n as a standard vector whose i-th element is one, and the covariance as Σ i = (0.1) 2 I ∈ R n×n so that the ground truth log-probability derivative values are obtainable. We train the DAE, RCAE, GDAE, GRCAE, LSLDG, and R-LSLDG using the data. For the autoencoder models described in Appendix D.1 (for GDAE and GRCAE) and Appendix F.3 (for DAE and RCAE), we use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) to update the parameters for 500,000 iterations. We use the batch size of 1,000 to train RCAE and GRCAE and the batch size of 10,000 to train DAE and GDAE. From the initialization scheme of Glorot & Bengio (2010) , the initial W 1 , b 1 are further divided by 1-5 for the experiments. Other optimization and hyperparameter tuning conditions are the same as those explained in Appendix F.3. For both LSLDG and R-LSLDG, we perform cross-validation for the hyperparameters λ ∈ {10 -3 , 10 -2 , 10 -1.5 , 10 -1 } and σ ∈ {10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 10 0 } as described in Sasaki et al. (2014) and Ashizawa et al. (2017) . The number of the basis functions is set to 500. The time spent in training each model is shown in Table 9 . The estimation errors for ∂ log ρg ∂x obtained from each model are reported in Table 10 . Similar to the results in Section 4.1, the GDAE and GRCAE perform much better than other methods (especially for higher dimensionality), while GRCAE gives the least estimation error in most cases. To apply the RLMC method for S n , we should appropriately discretize (9). Recall that, from (9), the Brownian motion dx B on an n-dimensional manifold is written as dx B = 1 2 G -1 (x) ∂ log det G(x) ∂x + Ψ(x) dt + G -1 (x)dw, where dw ∈ R n is the Brownian motion in an n-dimensional vector space and Ψ(x) ∈ R n is a vector whose i-th component is given by n j=1 ∂g ij ∂x j with g ij as the (i, j) element of G -1 (x) ∈ R n×n . Consider spherical coordinate representations (x 1 , . . . , x n ) which parametrizes the points on S n as x(x 1 , . . . , x n ) = (x 1 , . . . , x n+1 ) ∈ R n+1 with x 1 = cos(x 1 ), x i = i-1 j=1 sin(x j ) cos(x i ) for i = 2, . . . , n, and x n+1 = n j=1 sin(x j ). The Riemannian metric is then obtained as G(x) = diag(1, g 22 , . . . , g nn ) ∈ R n×n with g ii = i-1 j=1 sin 2 (x j ) for i = 2, . . . , n. To formulate the RLMC algorithm for S n , it is useful to represent the Brownian motion in (67) in the ambient space. Applying the Ito rule, we can obtain such a representation as follows: dx B = ∂x ∂x 1 2 G -1 (x) ∂ log det G(x) ∂x + Ψ(x) dt + G -1 (x)dw + 1 2 Ξ(x)dt, where ∂x ∂x ∈ R (n+1)×n is the Jacobian of the parametrization x(x) with respect to x and Ξ(x) ∈ R n+1 is a vector whose i-th component is given by Tr ( √ G -1 ⊤ ∂ 2 x i ∂x 2 √ G -1 ) with ∂ 2 x i ∂x 2 ∈ R n×n as the Hessian of x i with respect to x. After a straightforward calculation, the Brownian motion in (68) reduces to dx B = - n 2 xdt + Bdw, where B = ∂x ∂x G -1 (x) ∈ R (n+1)×n is a matrix whose column vectors form an orthonormal basis of T x n . 14 Note here that the drift term (-n 2 xdt) is orthogonal to the tangent space T x S n . Each Langevin diffusion step of the RLMC method for S n can then be performed by x j+1 = Exp xj ∆t 2 • s(x j ) + √ ∆t • (I -x j x ⊤ j )w j , where x j ∈ S n is the point sampled at the j-th step, s(x j ) ∈ T xj S n is the estimated geometric score at x j , and w j ∈ R n+1 is a random vector sampled from N (0, I) with (Ix j x ⊤ j )w j implying the vector obtained by projecting w j onto T xj S n . Note that (70) can be approximated up to the order of ∆t to the update of x j+1 = Proj x j + ∆t • ( 1 2 s(x j ) -n 2 x j ) + √ ∆t • (Ix j x ⊤ j )w j , which straightforwardly considers the Brownian motion in the ambient space in (69). The geometric score in (70) can be estimated for a point x ∈ S n using the reconstruction function r : S n → S n of the GDAE trained from a given set of training data as s(x) = 1 σ 2 Log x (r(x)), where the logarithm map Log x : S n → T x S n , the inverse of the exponential map in (37), is locally defined for an input y ∈ S n near x as follows: Log x (y) = arccos(x ⊤ y) 1 -(x ⊤ y) 2 • (I -xx ⊤ )y.

G.2 EXPERIMENTAL DETAILS

We consider four synthetic data sets on S 2 : four blobs, two moons, an s-curve, and circles, which are generated by the following procedure. First, we make two-dimensional data sets by using the python sklearn.datasets package with zero noise level, place them on the plane z = 1, and project them to the spherical surface S 2 . Secondly, we add tangent Gaussian noises with standard deviations of 0.01 for two moons, s-curve, circles, and 0.05 for four blobs, and project those perturbed data to the sphere by Riemannian projections. The number of training data is 800, that of validation data is 200, and that of test data is 1000. For autoencoder-based models, we use the fully-connected neural network architecture that has 3-512-512-512-512-512-3 layers with ReLU-ReLU-linear-ReLU-ReLU-linear activation functions. For S-Flow that uses the RealNVP model (Dinh et al., 2016) , the depth is eight and the length of the hidden feature is 512. For all cases, the learning rate is 1e-3 and the weight decay parameter is 1e-12. For autoencoder-based models, we search σ ∈ {0.01, 0.025, 0.05}. We use batch gradient descent, and the maximum training epoch is set to 5000. As a validation loss used to determine the best model during the training, we use the modified score estimation error in (64) for autoencoderbased models and negative log-likelihood for S-Flow. In RLMC sampling of the autoencoderbased models, we need to determine the step size ∆t in (70). The step size is searched over ∆t ∈ {0.00001, 0.00005, 0.0001, 0.0005, 0.001}. We run the experiment five times with different random seeds. In the result table, the averages and standard errors of the best cases are reported, where for each run the best-model-resulting hyperparameters (i.e., σ and step size for autoencoderbased models) are selected by computing the MMD metrics (Gretton et al., 2012) between sampled points and the validation data sets. We compute the MMD metric with the exponential kernel k(x, y) = exp(-d 2 S 2 (x, y)/(2 * η 2 )) where d S 2 (x, y) is the squared geodesic distance between data x, y ∈ S 2 and the bandwidth parameter η is defined as the mean of the 1,2,3,4,5-nearest neighbor geodesic distances for all training data. 14 This can be easily verified by the identity of B ⊤ B = √ G -1 ⊤ ∂x ∂x ⊤ ∂x ∂x √ G -1 = I since G(x) = ∂x ∂x ⊤ ∂x ∂x . best clustering performance among σ ∈ {0.025, 0.05, 0.1, 0.2} is chosen, and we use the rest half of the data to train the and report the clustering performance. Other optimization conditions are kept the same as those explained in Appendix F.3. For both LSLDG and R-LSLDG, we perform cross-validation for the hyperparameters λ ∈ {10 -3 , 10 -2 , 10 -1.5 , 10 -1 } and σ ∈ {10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 10 0 } as described in Sasaki et al. (2014) and Ashizawa et al. (2017) . The number of the basis functions is set to 100. The clustering is performed using the identical data used to report the results from DAE and GDAE. The time spent in training each model is shown in Table 12 . The averaged clustering performance for five runs of each method is presented in Table 13 . For the ETH-80 data set, R-LSLDG and GDAE show better performance than others on average but without much significance. In the case of the COIL-20 data set, the clustering results from DAE show the best performance. The absence of a significant performance difference between DAE and GDAE in both cases may partially be due to the less peculiarity in the covariance matrices, e.g., having extremely small or large eigenvalues, for the relatively small number of categories. (According to the affine-invariant metric, the geometry of the covariance matrices can become more relevant when the relative eigenvalues of the covariance matrices get bigger or smaller.) When the number of categories gets bigger as in the COIL-100 data set, the results from GDAE show better performance with a large margin compared to DAE. Even though this case study does not show a distinct tendency, clustering using GDAE performs well among the considered methods in terms of the overall adjusted Rand index. 65), (66) are optimized for (1). We use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) and update the parameters for 40,000 iterations. The learning rate starts at 1e-4 and is divided by 100 after 20,000 iterations. In discretizing (6), the expectation with respect to the data-generating probability density ρ(x) and the noise density q(x|x) are dealt with in the same way as for the case studies in Section 4.1 (explained in Appendix F.3), and we use the approximated reconstruction error in (52). We use the corruption noise standard deviation σ = 0.05 in training both DAE and GDAE.

I.2 A GDAE-BASED DTI FILTERING ALGORITHM

Based on the discussions in Section 4.4, we summarize a DTI filtering algorithm based on the GDAE in Algorithm 1. At each iteration, the input points are updated toward the reconstructed points of GDAE along the geodesic with a predefined step size ϵ. Assuming small step sizes, we approximate the update along the geodesic on N(3) by treating the updates of the location (in R 3 ) and the value (in P(3)) separately. The update of the voxel values then reduces to update along the geodesic according to the affine-invariant Riemannian metric on P(3) (Fletcher & Joshi, 2007) , as expressed line 5 of Algorithm 1. The functions Exp Q : S(3) → P(3) and Log Q : P(3) → S(3) (for Q ∈ P(3)) are defined as follows: Exp Q (A) = Q 1 2 Exp Q -1 2 AQ -1 2 Q 1 2 , ( ) Log Q (P ) = Q 1 2 Log Q -1 2 P Q -1 2 Q 1 2 , ( ) where Q 1 2 = RS 1 2 R ⊤ and Q -1 2 = RS -1 2 R ⊤ for an eigendecomposition of Q = RSR ⊤ , and the functions Exp(•) and Log(•) respectively denote the matrix exponential and logarithm of which the closed-forms are provided in Appendix D.2.1. When the iteration is terminated, we assign the final voxel values to their original locations. Algorithm 1 DTI Filtering Algorithm Using GDAE Given: Voxels of a DTI datum (x i , P i ) ∈ N(3), i = 1, . . . , N . Input: Reconstruction function r : N(3) → N(3) of the GDAE trained from the given set of voxels, step size ϵ, the number of iterations N iter . Initialize: (y i,1 , Q i,1 ) = (x i , P i ) for i = 1, . . . , N .

Iteration:

1: for j = 1, . . . , N iter , i = 1, . . . , N do 2: Compute the reconstruction point (y i , Q i ) = r(y i,j , Q i,j ).

3:

Shift the current point according to step size ϵ: 4: y i,j+1 ← y i,j + ϵ(y iy i,j ),

5:

Q i,j+1 ← Exp Qi,j ϵ Log Qi,j (Q i ) (Q i,j+1 ← Q i holds when ϵ = 1). 6: end for 7: Assign P i,f = Q i,Niter+1 for i = 1, . . . , N . Output: Filtered voxels (x i , P i,f ) ∈ N(3), i = 1, . . . , N .

I.3 DETAILS FOR N(2) FILTERING EXPERIMENTS

In this experiment, we consider a synthetic N(2) data set obtained by gathering the position (in R 2 )value (in P( 2)) pairs at 1,600 grid points (i.e., N = 1, 600) of a P(2) field on R 2 . For the P(2) field for clean data, we gather four distinct smoothly-varying P(2) fields as depicted in Figure 5 . For the noisy data, we corrupt each value in P(2) as explained in (43) with noise standard deviations (or noise levels) in {0.02, 0.05, 0.1, 0.2}. We model the neural networks for both DAE and GDAE as explained in Appendix F.3 and Appendix D.3.4 with appropriate modifications to deal with data in N(2), respectively, and train the reconstruction functions. We use the Adam optimizer (Kingma & Ba, 2015) and update the parameters for 20,000 iterations. The learning rate starts at 1e-4 and is divided by 100 after 10,000 iterations. The corruption noise standard deviation is chosen among σ ∈ {0.01, 0.025, 0.05, 0.1} to show the best modified estimation error in (64). For the trained models, we apply Algorithm 1 (with appropriate modifications to deal with N(2) data) with N iter ≤ 300 and the step size ϵ ∈ {0.01, 0.03, 0.1, 0.3, 1.0} chosen to achieve the minimum averaged square distances (according to the affine-invariant metric) to the clean data. For the MVKR method in Banerjee et al. (2015) , the number of considered nearest neighbors k ∈ {10, 20, 30, 40, 100} and the kernel bandwidth parameter σ ∈ {1, 10, 20, 30, 50} are chosen to achieve the minimum averaged square distances (according to the affine-invariant metric) to the clean data. We report the R-squared score to compare the filtering results from each method in Table 4 , and the scores are calculated as follows: R-squared score = 1 -i dist(P i,f , P i,true ) 2 i dist( P , P i,true ) 2 , (75) where dist(P 1 , P 2 ) is the distance between P 1 , P 2 ∈ P(2) obtained according to the affine-invariant metric, P i,f ∈ P(2) is the i-th filtered value, P i,true P(2) is the i-th value of the clean data, and P ∈ P(2) is the intrinsic mean of {P 1,true , . . . , P N,true } (Moakher, 2005) . I.4 PREPROCESSING DIFFUSION TENSOR IMAGING (DTI) DATA Data used in the preparation of the experiments performed in Section 4.4 were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).foot_9 Among the data set, we use the DTI data for a randomly chosen healthy subject. FSL library (Jenkinson et al., 2012) is utilized to fit diffusion tensors from raw diffusion-weighted image data. 21 The brain extraction tool (BET), eddy current correction function (eddy correct), a linear transformation tool to match the brain template (flirt), and dtifit module are used in the preprocessing of the DTI data. The region of interest (ROI) is chosen as an axial slice that intersects with the corpus callosum region of the brain. J DETAILS FOR SECTION 4.5 Overall experimental settings have been set similarly to those in Lee et al. (2022) . We add noise to each point x ∈ R 3 in the point cloud of the data sets (ShapeNet, ModelNet10, and ModelNet40) according to x → x + m • v, where v ∈ R 3 ⊆ S 2 is uniformly sampled on the unit sphere and m is sampled from the Gaussian distribution with zero mean and different levels of standard deviation (0.01, 0.05, 0.1, and 0.2 of the diagonal length of the point cloud bounding box) as done in Lee et al. (2021a) . For the parameters in the kernel functions (57), we use Σ = σ 2 p I with σ p = 0.8 × MED, where MED denotes the median of the distances between the points in the point cloud and their nearest points. The mean values of MEDs of each data set are 0.0320, 0.0364, 0.0442, and 0.579 for the cases of the noise levels 0.01, 0.05, 0.1, and 0.2, respectively. To train the autoencoders defined in Appendix D.4, we use ADAM (Kingma & Ba, 2015) with a learning rate of 1e-4, betas of [0.9, 0.999], a weight decay of 1e-6, and a batch size of 16; the total number of epochs is 500. The noise parameters σ for DAE and GDAE are set to 0.1 and 1e-4, respectively. The coefficient for the regularization term (defined in (13) of Lee et al. (2022) ) is set to λ = 8000 for both the 'AE + R.' and 'GDAE + R.' methods in Table 5 .



Two manifolds are said to be diffeomorphic if there exists a differentiable mapping between the two manifolds which is invertible and its inverse is also differentiable. Such a mapping is called a diffeomorphism. A continuous function is called a homeomorphism if it is invertible, and its inverse is continuous. See Appendix B.1 for more discussions on the coordinate transformations and the coordinate-invariance. The difference between optimal r(x) and x turns out to be O(σ 2 ). It can be verified that this approximation does not affect the final results, since the approximation error of the squared geodesic distance becomes O(σ 6 ). It has empirically performed better to use the vector representation obtained from Log(P ) rather than P . Here Ui is an open subset of M, and xi is a homeomorphism of Ui to an open subset of R m for i = 1, . . . , C. The collection {U1, . . . , UC } covers M, and for all i and j, the transition map xi • x -1 j is smooth. To alleviate any topological issues arising from using spherical coordinate representations (x 1 , x 2 ) which parametrizes the three-dimensional points as (cos(x 1 ), sin(x 1 ) cos(x 2 ), sin(x 1 ) sin(x 2 )), e.g., the fact that, for points with the x 2 value near π or -π, nearby points on the sphere can be mapped from two distant regions in spherical coordinates, we augment data obtained by adding or subtracting 2π from the x 2 values (as shown in Figure9(b)) during training and only use the original data when getting the reconstruction directions. The definitions of λ and σ can be found inSasaki et al. (2014) andAshizawa et al. (2017). URL: https://github.com/KinglittleQ/torch-batch-svd. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). The license information is available at URL: https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Licence.



Figure 1: Autoencoder training on spherical data sampled from the von Mises-Fisher (vMF) distribution. We train the reconstruction contractive autoencoder (RCAE) and the geometric RCAE (GRCAE). For (b)-(c), we plot the reconstruction directions of the autoencoders trained using representations obtained from different coordinate choices. The results from each coordinate choice are color-coded along with corresponding spherical coordinate origins. (See Appendix E for more details.)

Figure 2: Local coordinates and Riemannian metrics for the reconstruction mapping r : M → M. Local coordinates are denoted in italics.

Figure 4: Data sampling on S 2 .

Figure 5: N(2) data filtering (noise 0.2). The redder, the higher error. Consider a data set comprised of N voxels {(x 1 , P 1 ), . . . , (x N , P N )} of a raw DTI datum, where x i ∈ R 3 and P i ∈ P(3) denote the location and value of the i-th voxel, respectively. For the data set, we train the GDAE with the reconstruction function r : N(3) → N(3); the overall training process is explained in Appendix I.1.

Figure 6: DTI filtering results.

3

Figure 7: A Riemannian manifold M and its local coordinate x.



reduces to training unconstrained v (when considering only the lower-or uppertriangular part), and we denote by v : P(n) → R n(n+1) 2 the function that returns the n(n+1) 2 dimensional vector representation of the lower-(or upper-) triangular part of the output for v.

In training GDAE for diffusion tensor imaging (DTI) data in Section 4.4, the structure of the manifold N(3) (using the Fisher information Riemannian metric) should be dealt with properly. Here, we provide technical details for the modeling and training of the reconstruction function r : N(3) → N(3).

POINT CLOUD DATA D.4.1 THE FISHER INFORMATION RIEMANNIAN METRIC FOR POINT CLOUD DATA

Figure9: Additional figures for experiments in the introduction. In (a), red, green, and blue lines represent each reference frame's X-, Y-, and Z-axis, respectively. Note that the reference frames are translated to corresponding spherical coordinate origins for better visualization.

TRAINING DAE, RCAE, GDAE, GRCAE, LSLDG, AND R-LSLDG We model the reconstruction functions r : P(n) → P(n) for GDAE and GRCAE as explained in Appendix D.2. The reconstruction functions r : R d → R d (d = n(n+1)

.1 TRAINING OF THE RECONSTRUCTION FUNCTION We model GDAE for N(3) as explained in Appendix D.3.4. We then optimize the parameters θ = {W 1 , b 1 , W 2 , b 2 , W 3 , b 3 , W 4 , b 4 } for (6) to train GDAE. We model DAE as explained in Appendix F.3 with d = 9, and the parameters in (

Estimation errors for

The MMD measure (the lower, the better) between the test and the sampled data with standard errors in parentheses.

The adjusted Rand index for the clustering results on the Newsgroup20 data set (S 49 ) with standard errors in parentheses. The higher, the better.

The R-squared score for N(2) data filtering. The higher, the better.

Classification accuracy by transfer learning for ModelNet10 and ModelNet40 from ShapeNet under the noise levels of 0.01, 0.05, 0.1, and 0.2.

The computational times (in milliseconds) spent for a gradient update iteration for autoencoders applying different geometric components for S n data. RDR ⊤ ∈ S(n), with R ∈ O(n) as the eigenvector matrix and D = diag(d 1 , . . . , d n )

The computational times (in milliseconds) spent for a gradient update iteration for autoencoders applying different geometric components for P(n) data.

The time (in seconds) spent in training each model for P(n) data.

The time (in seconds) spent in training each model for S n data.

The estimation errors for∂ log ρg ∂xfor tangent space Gaussian mixture data on S n . Bolds represent the best and comparable methods from the t-test with a significance level of 5%.

The time (in seconds) spent in training each model for P(5) data.

The adjusted Rand index for the covariance matrix data on P(5). Bolds represent the best and comparable methods from the t-test with a significance level of 5%.

acknowledgement

ACKNOWLEDGMENTS C. Jang and Y.-K. Noh were supported by IITP Artificial Intelligence Graduate School Program for Hanyang University funded by MSIT (Grant No. 2020-0-01373). Y.-K. Noh was partly supported by NRF/MSIT (Grant No. 2018R1A5A7059549, 2021M3E5D2A01019545) and IITP/MSIT (Grant No. IITP-2021-0-02068). Y. Lee and F. C. Park were supported in part by SRRC NRF grant 2016R1A5A1938472, IITP-MSIT (Grant No. 2022-0-00480, Development of Training and Inference Methods for Goal-Oriented AI Agents, 20%), SNU-AIIS, SNU-IAMD, and the SNU Institute for Engineering Research. REPRODUCIBILITY STATEMENT We refer the reader to the following pointers for reproducibility: • Codes to train the proposed autoencoders: Supplementary Material. • Proof of Theorem 1: Appendix C. • Implementation details of the proposed autoencoders: Appendix D. • Experimental settings: Appendix E, F, G, H, I, and J.

D.3.1 THE FISHER INFORMATION RIEMANNIAN METRIC

Denoting by N (µ, Σ) a normal distribution in R n with mean µ and covariance Σ, the space of n-dimensional normal distributions N(n) is defined as follows:(48) N(n) becomes a differentiable manifold with dimension n + 1 2 n(n + 1). By using the Fisher information Riemannian metric with local coordinates (µ, Σ) (for Σ, only the lower-or upper-triangular is needed part), the inner product of two tangent vectors V = (V µ , V Σ ), W = (W µ , W Σ ) for V µ , W µ ∈ R n and V Σ , W Σ ∈ S(n) at a point N = (µ, Σ) ∈ N(n) is defined as follows:It can be easily verified that the Fisher information metric reduces to the affine-invariant metric when the R n part of the tangent vectors is zero; the geodesic distance between two normal distributions with identical means is then obtained in closed-form from the affine-invariant distance between the two covariance matrices. However, in general, the distance between two normal distributions cannot be obtained in closed-form; a numerical algorithm to find the minimal geodesic according to this metric is proposed in Han & Park (2014) , along with some physical insights of adopting the Fisher information Riemannian metric for DTI analysis.

D.3.2 DATA CORRUPTION PROCESS

Given an input voxel data N = (x, P ) ∈ N(3), we first sample ϵ p ∼ N (0, σ 2 P ) and a sixdimensional vector ϵ c ∼ N (0, σ 2 I). The input voxel (x, P ) ∈ N(3) is then corrupted to (x, P ) ∈ N(3) as follows:where P 1 2 = RS 1 2 R ⊤ for an eigendecomposition of P = RSR ⊤ , and the bracket operator [ • ] is defined for ϵ = (ϵ 11 , ϵ 12 , ϵ 13 , ϵ 22 , ϵ 23 , ϵ 33 ) ∈ R 6 as ([ϵ]) ij = a ij ϵ ij for i ≤ j and ([ϵ]) ij = ([ϵ]) ji for i > j, with a ij = √ 2 if i = j and 1 otherwise. Note that this corresponds to an approximation of the data corruption process q(x|x) (for small σ 2 ) discussed in Section 3.2. In the experiments, we have further applied the first-order Taylor's expansion on the matrix exponential in (51).

D.3.3 THE RECONSTRUCTION ERROR

Denote the reconstruction function by r = (r p , r c ), where r p : N(3) → R 3 corresponds to the location part and r c : N(3) → P(3) the value part of the reconstruction function, respectively (we discuss their exact modeling in the later section). Assuming σ 2 is small, the reconstruction error between (x, P ) ∈ N(3) and (r p (x, P ), r c (x, P )) ∈ N(3) is approximated according to the inner product of the Fisher information metric defined in (49) as follows: dist (r p (x, P ), r c (x, P )), (x, P ) for P(n)) for DAE and RCAE are modeled by a neural network with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows:where h 1 , h 2 ∈ R d h denote the hidden variables, h 0 is set to be x ∈ R d , and W i , b i for i = 1, 2, 3 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the dimensionality of the hidden variables d h to 1,000 for both DAE and RCAE.We optimize the parameters 46), ( 47) for ( 6) and ( 5) to respectively train GDAE and GRCAE (or those in (65), (66) for ( 1) and (2) to train DAE and RCAE respectively).We apply the Adam algorithm (Kingma & Ba, 2015) using the Pytorch library (Paszke et al., 2017) and update the parameters for 500,000 iterations. We use the batch size of 1,000 to train RCAE and GRCAE and the batch size of 10,000 to train DAE and GDAE. The learning rate starts at 2.5e-5 and is divided by ten after 250,000 iterations with a weight decay parameter of 1e-12. We have applied the weight initialization scheme of Glorot & Bengio (2010) . Dividing the initial W 1 , b 1 by two helped to improve the estimation error for GDAE and GRCAE.

H DETAILS FOR SECTION 4.3 H.1 DEFINITION OF THE CLUSTERING TASKS

We define the clusters by merging relevant topics among 20 groups in the Newsgroup20 data set, 15 resulting in 'computers,' 'politics,' 'automobiles,' 'sports,' and a topic in 'science' (one among 'crypt,' 'electronics,' 'med,' and 'space') . We define four clustering tasks by differing the topic in 'science.'H.2 TRAINING DAE, GDAE, LSLDG, AND R-LSLDGThe reconstruction function for GDAE is modeled as explained in Appendix D.1, and DAE is modeled in the same way as explained in Appendix F.3.We then optimize the parameters 6) and ( 1) to train GDAE and DAE, respectively. We use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) and update the parameters for 500,000 iterations. The learning rate respectively starts at 2.5e-3 and 1e-3 for GDAE and DAE, and is divided by ten after 250,000 iterations. Other optimization conditions are kept the same as those in Appendix F.3, and the corruption noise standard deviations σ used are selected between σ ∈ {0.025, 0.05} to reduce the modified estimation error in (64) on the training data set.For both LSLDG and R-LSLDG, we perform cross-validation for the hyperparameters λ ∈ {10 -3 , 10 -2 , 10 -1.5 , 10 -1 } and σ ∈ {10 -1.2 , 10 -1 , 10 -0.5 , 10 0 } as described in Sasaki et al. (2014) and Ashizawa et al. (2017) . The number of the basis functions is set to 300. The time spent in training each model is shown in Table 11 . In this section we provide a case study of the clustering of covariance matrix data. Following Jayasumana et al. (2015) , we calculate covariance matrices from the features of image data and group the images based on the covariance matrices. For this case study, we consider the ETH-80 data set 16 with 3,240 images from eight categories (Leibe & Schiele, 2003) , COIL-20 data set 17 with 1,440 images from twenty categories (Nene et al., 1996a) , and COIL-100 data set 18 with 7,200 images from a hundred categories (Nene et al., 1996b) . In calculating the covariance matrix for each image, we use the feature (from every pixel of the image) consisting of {x, y, I, |I x |, |I y |}, where x, y are the horizontal and vertical pixel locations, I is the intensity of the pixel, and I x and I y are the horizontal and vertical gradients of the intensity, respectively. 19 Hence the data in use are in P(5).We train DAE, GDAE, LSLDG, and R-LSLDG using the covariance matrix data and perform clustering using the trained models similarly as explained in Section 4.3. For the GDAE and DAE models respectively described in Appendix D.2 and Appendix F.3, we use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) to update the parameters for 400,000 iterations. The learning rate starts at 1e-4 and is divided by ten after 200,000 iterations with a weight decay parameter of 1e-12.The corruption noise standard deviations σ can significantly affect the clustering performance in DAE and GDAE. To choose σ, we use a (randomly chosen) half of the data as a validation set to train the DAE and GDAE models and measure the clustering performance. The value that gives the 

