MINIMUM CURVATURE MANIFOLD LEARNING

Abstract

It is widely observed that vanilla autoencoders can have low manifold learning accuracy given a noisy or small training dataset. Recent work has discovered that it is important to regularize the decoder that explicitly parameterizes the manifold, where a neighborhood graph is employed for decoder regularization. However, one caveat of this method is that it is not always straightforward to construct a correct graph. Alternatively, one may consider naive graph-free regularization methods such as minimizing the norm of the decoder's Jacobian or Hessian, but these norms are not coordinate-invariant (i.e. reparametrization-invariant) and hence do not capture any meaningful geometric quantity of the manifold nor result in geometrically meaningful manifold regularization effects. Another recent work called the isometric regularization implicitly forces the manifold to have zero intrinsic curvature, resulting in some geometrically meaningful regularization effects. But, since the intrinsic curvature does not capture how the manifold is embedded in the data space from an extrinsic perspective, the regularization effects are often limited. In this paper, we propose a minimum extrinsic curvature principle for manifold regularization and Minimum Curvature Autoencoder (MCAE), a graphfree coordinate-invariant extrinsic curvature minimization framework for autoencoder regularization. Experiments with various standard datasets demonstrate that MCAE improves manifold learning accuracy compared to existing methods, especially showing strong robustness to noise.

1. INTRODUCTION

Autoencoders are widely used to identify, given a set of high-dimensional data, the underlying lowerdimensional manifold structure and its coordinate space, simultaneously (Kramer, 1991) . The decoder explicitly parameterizes the data manifold as a mapping from a lower-dimensional coordinate space (i.e., latent space) to the high-dimensional data space, and the encoder maps data points to their corresponding coordinates (i.e., latent values). However, vanilla autoencoders trained to reconstruct the given training data often learn manifolds that severely overfit to noisy training data or are wrong in regions where there are fewer data, impairing their manifold learning performances. It has been recently discovered by Lee et al. (2021) that autoencoder regularization methods that focus on regularizing the latent space distributions determined entirely by the encoders (Kingma & Welling, 2013; Tolstikhin et al., 2018; Makhzani et al., 2015; Rifai et al., 2011) are not sufficient to learn correct manifolds, yet it is important to properly regularize the decoders that parameterize the manifolds. In (Lee et al., 2021) , neighborhood graphs constructed from data are successfully utilized to regularize the local geometry and connectivity of the manifold, significantly improving the manifold learning accuracy. However, the underlying premise behind this method is that the graph has to be accurate, yet constructing a correct graph may not be always straightforward. There are some graph-free methods such as the denoising autoencoder (Vincent et al., 2010) and reconstruction contractive autoencoder (Alain & Bengio, 2014) that regularize not only an encoder but also a decoder. They can learn manifolds that are robust to noise to some extent, but when the noise level is large, the performance is often less-than-desirable, and they do not always produce correct manifolds, especially in regions where there are fewer data (discussed in more detail in Section 4.2). Since the decoder needs to be regularized, one may come up with some naive regularization strategies such as minimizing the norm of the decoder's Jacobian or Hessian, considering them as mea-Figure 1 : Left: Two decoders f and f ′ parameterize the same data manifold where the norm of Jacobian of f ′ is smaller than that of f , i.e., ∥J f ∥ > ∥J f ′ ∥. Right: A curve and developable surface embedded in R 3 have zero intrinsic curvatures. sures of the manifold's smoothness. However, these norms do not properly capture any geometric quantity of the manifold because they are not reparametrization-invariant (or coordinate-invariant). As shown in Figure 1 (Left), just by increasing the volume of the latent space without actually changing the manifold, i.e., re-parametrizing the manifold f → f ′ , the above norms can be minimized. Just recently, a coordinate-invariant geometric distortion measure has been introduced to regularize the decoder to be a geometry-preserving mapping, which is called the isometric regularization (LEE et al., 2022) , so that the data space geometry is preserved in the latent space. Minimizing this distortion measure implicitly forces the learned manifold to have zero intrinsic curvature -which only depends on distances measured within the manifold (e.g., a cylinder's side surface has zero intrinsic curvature unlike the spherical surface) -, resulting in some geometrically meaningful manifold regularization effects. The intrinsic curvature, however, does not capture how the manifold lies in the data spacefoot_0 , and thus minimizing the manifold's intrinsic curvature may not be enough to learn correct manifolds. For example, curves and developable surfacesfoot_1 in R 3 always have zero intrinsic curvatures, e.g., Figure 1 (Right) , regardless of how severely they are curved from an extrinsic point of view (Do Carmo, 2016) . The main contribution of this paper is a coordinate-invariant extrinsic curvature minimization framework for autoencoder regularization, which we refer to a Minimum Curvature Autoencoder (MCAE), that is graph-free and effectively improves the manifold learning accuracy given a noisy or small training dataset. Specifically, we develop a coordinate-invariant extrinsic curvature measure of the learned manifold, by investigating how smoothly tangent space changes on the manifold, and use it as a regularization term. To make things more explicit, let M be a manifold of dimension m embedded in R D . Consider a mapping T that maps a point x in M to its tangent space T x M, a linear subspace that has the dimension of m attached at x, i.e., T (x) = T x M. The set of all linear subspaces of dimension m in R D forms a manifold called the Grassmann manifold denoted by Gr(m, R D ) (Bendokat et al., 2020) , and thus the mapping T can be viewed as a mapping between two Riemannian manifolds, i.e., T : M → Gr(m, R D ). By using the Dirichlet energy (Eells & Lemaire, 1978) , a natural smoothness measure of mappings between two Riemannian manifolds defined in a coordinate-invariant way, we formulate an extrinsic curvature measure. We also propose a practical estimation strategy of the curvature measure that can be used for high-dimensional problems, reducing computation costs. Experiments on diverse image and motion capture data confirm that, compared to existing graphfree regularized autoencoders, our MCAE improves manifold learning accuracy for noisy and small training datasets. In particular, our experiments show that even compared to the methods specially designed to be robust to input perturbations such as the DAE (Vincent et al., 2010) and RCAE (Alain & Bengio, 2014) , the MCAE shows comparable or even in some cases significantly higher robust manifold learning performance.

2.1. GRASSMANN MANIFOLD

In this section, we review the Grassmann manifold and its Riemannian geometry from a matrixanalytic perspective. The Grassmann manifold is defined as the set of all m dimensional linear subspaces of the Euclidean space R D , denoted by Gr(m, R D ); this can be identified with the set of orthogonal rank-m projection matrices as follows: Gr(m, R D ) = {P ∈ R D×D | P T = P, P 2 = P, rank(P ) = m}, which is an m(D -m) dimensional manifold; which associates P ∈ Gr(m, R D ) with the linear subspace range(P ) ⊂ R D . This is an implicit parametrization of the Grassmann manifold considered as being embedded in the Euclidean space R D×D . For more formal and detailed descriptions of the Grassmann manifold, we refer to (Bendokat et al., 2020) . Given a rank-m matrix J ∈ R D×m , one may want to consider its range, an m-dimensional linear subspace in R D , as an element of the Grassmann manifold. The embedding E : R D×m → Gr(m, R D ) such that E(J) = J(J T J) -1 J T properly converts J to the element of (1). We note that (i) range(J) = range(E(J)) and (ii) E(J) = E(JA) for any m × m invertible matrix A ∈ R m×m since the transformation J → JA does not change the range. Next, we introduce the basic Riemannian structure of the Grassmann manifold. At a point P ∈ Gr(m, R D ), the tangent space is defined as follows: T P Gr(m, R D ) := {V ∈ R D×D | V T = V, V P + P V = V }, which can be derived from (1) by differentiating the constraints. One canonical choice of the Riemannian metric is given as follows: ⟨V 1 , V 2 ⟩ := 1 √ 2 Tr(V T 1 V 2 ) for V 1 , V 2 ∈ T P Gr(m, R D ). This metric is invariant under the orthogonal transformation, i.e., ⟨V 1 , V 2 ⟩ = ⟨RV 1 , RV 2 ⟩ for any D × D orthogonal matrix R.

2.2. DIRICHLET ENERGY FOR MAPPINGS BETWEEN RIEMANNIAN MANIFOLDS

This section introduces the Dirichlet energy for mappings between two Riemannian manifolds. Let M and N be Riemannian manifolds of dimension m and n; we will consider a differentiable mapping f : M → N . We will assume x ∈ M is explicitly parametrized by local coordinates as x ∈ R m and the Riemannian metric at x ∈ M is expressed as m × m positive-definite matrix G(x) = (g ij (x)) ∈ R m×m , and N is embedded in the Euclidean space of higher dimension as N ⊂ R d (d ≫ n) and the Riemannian metric at y ∈ N is given as ⟨•, •⟩ y for y ∈ N (e.g. Grassmann manifold). The mapping f is expressed as f : R m → N ⊂ R d such that y = f (x). The Dirichlet energy, a global measure of how much the mapping f changes, is defined as follows: M m i=1 m j=1 g ij (x)⟨ ∂f ∂x i (x), ∂f ∂x j (x)⟩ f (x) det G(x) dx 1 • • • dx m , where g ij (x) denotes (i, j)-th element of the inverse of G(x) and det G(x) dx 1 • • • dx m is the Riemannian volume form, which corresponds to the integral functional from the theory of harmonic maps; this integral is an intrinsic quantity (i.e., coordinate-invariant). We note that the integrand is a local measure of how much the mapping f changes. We refer to the extensive literature on the theory and applications of harmonic maps, e.g., (Eells & Lemaire, 1978; 1988; Park & Brockett, 1994; Jang et al., 2020; LEE et al., 2022) .

3. MINIMUM CURVATURE AUTOENCODERS

In this section, we propose a regularized autoencoder based on the principle of minimum curvature manifold learning. Throughout, we consider a data space R D and latent space R m (D ≫ m) and denote a parametric encoder by g ϕ : R D → R m such that z = g ϕ (x), and a parametric decoder by f θ : R m → R D such that x = f θ (z). The manifold parametrized by the decoder will be denoted by M θ , and the Jacobian of the decoder by J θ (z) = ∂f θ ∂z (z). Given a set of data points {x i ∈ R D } N i=1 , the empirical data distribution will be denoted by p(x) := 1 N N i=1 δ(x -x i ) and the latent space distribution encoded by g ϕ by pϕ (z) := 1 N N i=1 δ(z -g ϕ (x i )). The subscripts show what variables each function or geometric object depends on, either θ or ϕ.

3.1. COORDINATE-INVARIANT EXTRINSIC CURVATURE MEASURE

In this section, we formulate a coordinate-invariant (i.e., reparametrization-invariant) extrinsic curvature measure of the manifold M θ embedded in R D . We begin by introducing the notion of coordinate-invariance: Definition 1. Given a manifold M of dimension m embedded in R D , let f : R m → M be its explicit parametrization. A functional F(f ) is coordinate-invariant (i.e., reparametrization-invariant) if, given any invertible mapping or coordinate transformation (i.e., reparametrization) h : R m → R m , F(f ) = F(f • h -1 ). The coordinate-invariance is necessary to properly measure any geometrically meaningful quantity of the manifold. For example, the integration of the Frobenius norm of J θ in coordinate space R m is not coordinate-invariant, and hence does not capture any geometrically meaningful quantity of M θ . Now, we define a coordinate-invariant extrinsic curvature measure of M θ . The core idea is to define a local measure of the extrinsic curvature by measuring how fast the tangent space T x M θ changes within the neighborhood of x, and then integrate it over the manifold to define a global curvature measure. For this purpose, let a pair of mappings, encoder g ϕ and decoder f θ , be a coordinate system for M θ , and consider a mapping T : R m → Gr(m, R D ) such that T (z) is the element of the Grassmann manifold (1) whose range is equal to T f θ (z) M θ . We note that the range of the Jacobian matrix J θ (z) ∈ R D×m is T x M θ , hence, by using the embedding E : R D×m → Gr(m, R D ) such that E(J θ ) := J θ (J T θ J θ ) -1 J T θ , we can explicitly write the mapping T as T (z) = E(J θ (z)). Let M θ be assigned with the Riemannian metric induced from the ambient space Euclidean metric, so that the metric expressed in the coordinate space is J T θ (z)J θ (z), and Gr(m, R D ) be assigned with the Riemannian metric in (3). We use the dirichlet energy in (4) of the mapping T as a coordinateinvariant extrinsic curvature measure, where the integral is replaced by the expectation over pϕ (z): Definition 2. Given an encoder g ϕ , decoder f θ , and empirical distribution in coordinate space pϕ (z), the global extrinsic curvature measure of M θ with respect to pϕ (z) is defined as C(θ, ϕ) := E z∼ pϕ (z) [ m i=1 m j=1 (J T θ J θ ) -1 ij Tr( ∂ ∂z i (E(J θ )) ∂ ∂z j (E(J θ )))]. Proposition 1. The curvature measure C(θ, ϕ) in Definition 2 is coordinate-invariant, i.e., for another pair of encoder g ϕ ′ := h • g ϕ and decoder f θ ′ := f θ • h -1 with any invertible map or coordinate transformation h such that z ′ = h(z), the measure is invariant, i.e., C(θ, ϕ) = C(θ ′ , ϕ ′ ). Proof. The proof is given in the Appendix A.2 Our definition of the curvature generalizes classical definition of the curvature of a curve embedded in R 3 from differential geometry (Kühnel, 2015) (please see Appendix A.3 for more details). With the proposed curvature measure, we define a regularized autoencoder where the loss function consists of the following two terms i) reconstruction error term for manifold learning and ii) regularization term C(θ, ϕ) for curvature minimization: min θ,ϕ E x∼ p(x) [∥x -f θ • g ϕ (x)∥ 2 ] + α C(θ, ϕ), ( ) where α is the regularization coefficient, which we refer to as the Minimum Curvature Autoencoder (MCAE).

3.2. PRACTICAL IMPLEMENTATIONS

This section introduces two practical strategies for computation of the curvature measure (5). Augmented Distribution: In (5), the local curvature measure is expected over the empirical latent space distribution. However, the influence of the measure is then limited to regions where data is available; thus the manifold's curvature in regions where data is no data may not be properly regularized. In practice, we use data augmentation to resolve this issue. Following (Chen et al., 2020; LEE et al., 2022) , we use the modified mix-up data-augmentation method with a parameter η > 0, where pϕ (z) is augmented by z = δz 1 + (1 -δ)z 2 such that z i ∼ p ϕ (z), i = 1, 2, where δ is uniformly sampled from [-η, 1 + η]. We set η = 0.2 throughout. Stochastic Trace Estimation: At first glance, the curvature measure (5) seems computationally very expensive, because it involves the computation of the full Jacobian J θ of a deep neural network and derivative of the Jacboaidn ∂J θ ∂z , and we even need to backpropagate through them when using the standard stochastic gradient descent algorithms. To efficiently compute the measure in practice, we use the Hutchinson's trace estimator (Hutchinson, 1989) , i.e., Tr(A) = E v∼N (0,I) [v T Av], then the curvature measure C(θ, ϕ) has the following expression: C(θ, ϕ) = E z∼ pϕ (z),v∼N (0,Im),w∼N (0,I D ) [v T ∂(w T E(J θ )) ∂z ∂(E(J θ )w) ∂z G -1 θ v], where I k is the k × k identity matrix and G θ = J T θ J θ . To implement this computationally efficiently, we use the Jacobian-vector and vector-Jacboian products in multiple times: (i) for E(J θ )w = J θ G -1 θ J T θ w, we first use the vector-Jacobian product for J T θ w and the Jacobian-vector product for J θ (G -1 θ J T θ w), and (ii) for ∂(E(J θ )w) ∂z v and ∂(E(J θ )w) ∂z (G -1 θ v) , we use the Jacobian-vector products. These techniques make the computation of ( 5) tractable for high-dimensional complex problems. Surprisingly, for the estimation of (7), using one sample of v and w at each z ∼ pϕ (z) was sufficient to train MCAE in our later experiments. When the latent space is high-dimensional, the matrix inverse computation G -1 θ takes up most of the computation time. Using an approximate inverse can significantly reduce the computation time, see the Appendix A.6.

4.1. PARAMETER SWEEP

We first provide an empirical study on the effect of the most important parameter of MCAE, the regularization coefficient α. Intuitively, as α increases, the tendency to minimize the extrinsic curvature of the manifold becomes stronger, so the learned manifold will become closer to a linear subspace. And, if α is too small, the learned manifold will not be different from that of the vanilla autoencoder; hence it is important to select an appropriate value for α depending on the dataset. Figure 2 shows how α affects the learned manifold in MCAE with two examples. In the upper figure, given noisy two-dimensional data points, we train MCAEs with one-dimensional latent spaces. In the lower figure, given sparse three-dimensional data points constrained on the 2-sphere S 2 := {x ∈ R 3 | ∥x∥ = 1}, we train MCAEs with one-dimensional latent spaces, where the decoder outputs are normalized to be in S 2 . As can be seen, α = 0.01 and α = 0.0001 are good values for the upper and lower examples, respectively. In practice, we can find the optimal value of α with a proper validation criteria (e.g., mean reconstruction error for validation data).

4.2. COMPARISON TO OTHER REGULARIZATION METHODS

In this section, we compare the proposed MCAE with other regularized autoencoders and highlight the differences. Please refer to Appendix A.1 for more detailed comparisons. Comparison to Isometrically Regularized Autoencoders: In the Isometrically Regularized Autoencoder (IRAE) (LEE et al., 2022) , the decoder is regularized to be a scaled isometry; similar to (6), a regularization term that measures how far f θ from being a scaled isometry is added to the reconstruction error term with the regularization coefficient α. This regularization implicitly forces the learned manifold M θ to have zero intrinsic curvature, but not the extrinsic curvature; therefore it is at first glance expected that, when learning a one-dimensional manifold, the IRVE should not have any meaningful manifold regularization effect (since one-dimensional manifolds always have zero intrinsic curvatures). Counterintuitively, as shown in Figure 3 (a), our experiments show that the extrinsic curvature of the one-dimensional manifold learned by IRAE decreases as α increases. If the decoder's hypothesis space was a set of arbitrary smooth functions, this result would not have been obtained, but since the hypothesis space defined as the set of neural networks is smaller, the isometric regularization seems to reduce the extrinsic curvature at the expense of obtaining the isometric representations. (Alain & Bengio, 2014) are intuitive and straightforward regularization methods for learning manifolds robust to input perturbations. As shown in Figure 4 (Upper), the DAE and RCAE learn manifolds robust to noise to some extent. However, as shown in Figure 4 (Lower), for the projected Scurve example in Figure 2 (Lower), they still learn wrong manifolds in regions where there are fewer data and do not improve the vanilla autoencoder. On the other hand, the MCAE explicitly regularizes the learned manifold to have a small curvature globally and improves the manifold learning accuracy. Figure 6 shows the test reconstruction MSEs as a function of the number of training (80%) + validation (20%) data. For all methods, the error decreases as the number of data increases; MCAEs mostly produce the lowest errors except for some MNIST cases. Figure 7 shows the Peak Signal-to-Noise Ratios (PSNRs) computed with the clean test set data (the higher the better) as a function of the standard deviation of the Gaussian noise added to the training data (the number of training data is 8000). The PSNR decreases as the noise level increases; MCAEs mostly produce the highest PSNRs. Figure 5 shows some de-noising examples with corrupted input data of MNIST and FMNIST. 

4.4. HUMAN SKELETON POSE DATA

In this section, we evaluate the MCAE with the human skeleton pose data adopted from the NTU RGB+D dataset (Shahroudy et al., 2016) . A human pose skeleton data onsists of 25 threedimensional key points and thus is considered a 75-dimensional vector. There are 60 different action classes (e.g., drinking water, brushing teeth), and each action data consists of a sequence of skeleton poses. For each action class, we use randomly-selected 800 and 200 skeleton poses as training and validation data, and 9000 poses as test data. We use two-layer fully connected neural networks (512 nodes per layer) for both encoder and decoder with ELU activation functions, and the latent space dimension is 8. Table 4 shows the averages and standard errors of the test data set reconstruction MSEs over 60 different action classes, the lower the better. MCAE mostly produces the lowest errors, especially by a significant margin for noisy training data cases. Figure 10 shows some example reconstruction results of noisy input skeleton data (noise level 0.05); MCAE shows the best de-noising results. 

5. CONCLUSION

In this paper, we have proposed a minimum extrinsic curvature principle for manifold regularization and developed a Minimum Curvature Autoencoder (MCAE), by formulating a coordinateinvariant (reparametrization-invariant) hence geometrically correct extrinsic curvature measure. Our experiments show that the minimum curvature regularization can improve manifold learning accuracy for both noisy and small training datasets. The degree to which the performance is improved depends on the datasets, and especially for the grayscale image and human skeleton pose datasets, the MCAE outperforms the existing methods by a significant margin. Limitations and Future Directions: In the current implementation of MCAE, the manifold's extrinsic curvature is minimized globally by using equal weights for all points. However, for manifolds that have locally very different curvatures, it is difficult to find a proper weight parameter α in (6). Ideally, low and high curvature areas of the manifold need to be regularized with higher and lower weights, respectively. By exploiting local curvature estimation algorithms, e.g., diffusion-based method (Bhaskar et al., 2022) , developing a curvature regularization method with different local weights will be an interesting future research direction. because it is the decoder that has information about how the manifold lies in the data space. Based on the intuition that a local approximation of the decoder contains local geometric information on the decoded manifold (i.e. learned manifold), e.g., a local linear approximation of f spans the tangent space, a priori constructed neighborhood graph is employed to regularize the local approximation of the decoder and hence the decoded manifold. This has shown improved manifold learning accuracy for both noisy and small training dataset cases, however obviously, the performance largely depends on the quality of the graph as in many other graph-based methods. There are graph-free autoencoder regularization methods that regularize not only an encoder but also the decoder. Denoising autoencoder (Vincent et al., 2010) is trained to reconstruct a corrupted input to its clean version with the following loss N i=1 ∥x i -f (g(x i + ϵ))∥ 2 , for some noise variable ϵ. As a limit case in (Alain & Bengio, 2014) , the Jacobian of the reconstruction function is minimized where the loss is defined as follows: N i=1 ∥x i -f (g(x i ))∥ 2 + α ∥ ∂f • g ∂x (x i )∥ 2 F , where α is the regularization coefficient and ∥ • ∥ F denotes the Frobenius norm. These by construction attempt to learn manifolds robust to noise, but we note that (i) they are designed to be robust to noise during inference after being trained with clean data, but if training data points themselves are noisy, the robust manifold learning performance decreases and (ii) their regularization effects are limited to where data points are available. Since regularizing the decoder that explicitly parameterizes the manifold is important, one may consider minimizing the norm of decoder's Jacobian as N i=1 ∥x i -f (g(x i ))∥ 2 + α ∥ ∂f ∂z (g(x i ))∥ 2 F ( ) with the regularization coefficient α or the norm of decoder's Hessian i,j ∥ ∂ 2 f ∂zi∂zj (g(x i ))∥ 2 . However, these norms do not capture geometric quantities of the learned manifold because they are not coordinate-invariant or reparmetrization-invariant, and thus they do not produce any meaningful regularization effects. For example, consider a coordinate transformation z ′ = h(z) which converts the encoder as g → g ′ = h • g and decoder as f → f ′ = f • h -1 . The reconstruction loss is invariant since f • g = f ′ •g ′ , and hence the learned manifold is invariant, but the regularization term, the norm of decoder's Jacobian, is different: ∥ ∂f ′ ∂z ′ (g ′ (x i ))∥ 2 F = ∥ ∂f ∂z (g(x i )) ∂h -1 ∂z ′ (g ′ (x i ))∥ 2 F ̸ = ∥ ∂f ∂z (g(x i ))∥ 2 F . This implies that we can minimize the norm of decoder's Jacobian just by increasing the norm of Jacobian of h -1 without actually changing the learned manifold. A similar argument holds for the Hessian norm. Recent works (Chen et al., 2020; LEE et al., 2022) have suggested decoder regularization methods for learning isometric representations that preserve geometry of the data space. A common goal is to learn a decoder f : R m → R D that satisfies ∂f ∂z (z) T ∂f ∂z (z) = cI for all z ∈ ν(R m ) for some positive scalar c, where I is the m × m identity matrix and ν(R m ) is the support of the latent space data distribution. Such mappings are formally defined as scaled isometries in (LEE et al., 2022) , which are geometry-preserving mappings in the sense that latent space straight lines are mapped to the geodesic curves in the learned manifold. Regularizing the decoder to be a scaled isometry, beyond finding geometry-preserving representations, has an implicit manifold regularization effect. According to Gauss's Theorema Egregium which states that "The Gaussian curvature of a surface is invariant under local isometry", for scaled isometries f to exist, the Gaussian curvature of the learned manifold should be the same as that of the Euclidean space (i.e., zero). In other words, it has an implicit intrinsic curvature minimization effect, which is different from the method proposed in this paper that explicitly minimizes the extrinsic curvature. A.2 PROOF OF PROPOSITION 1 Proof. Let's denote by c(θ, ϕ) = i,j (J T θ J θ ) -1 ij Tr( ∂E(J θ ) ∂z i ∂E(J θ ) ∂z j ).

Given a coordinate transformation z

′ = h(z) that maps (g ϕ , f θ ) → (g ϕ ′ , f θ ′ ) = (h • g ϕ , f θ • h -1 ), the following transformation rules hold: J θ → J θ ′ = J θ • ∂h -1 ∂z ′ and ∂I ∂z → ∂I ∂z ′ = ∂I ∂z ∂h -1 ∂z ′ for some scalar-valued function I(z). We note that, since E(J) = E(JA) for some arbitrary invertible matrix A, the embedding is invariant, i.e., E(J θ ′ ) = E(J θ ). Let I αβ denote the (α, β)-component of E(J θ ), then, by using Tr(AB) = α,β A αβ B βα and denoting ∂h -1 ∂z ′ by H, c(θ, ϕ) → c(θ ′ , ϕ ′ ) = i,j (J T θ ′ J θ ′ ) -1 ij α,β ∂I αβ ∂z ′ i ∂I βα ∂z ′ j = α,β ∂I αβ ∂z ′ (J T θ ′ J θ ′ ) -1 ∂I βα ∂z ′ T = α,β ∂I αβ ∂z H(H T J T θ J θ H) -1 H T ∂I βα ∂z T = α,β ∂I αβ ∂z (J T θ J θ ) -1 ∂I βα ∂z T = i,j (J T θ J θ ) -1 ij α,β ∂I αβ ∂z i ∂I βα ∂z j = c(θ, ϕ). A.3 ON THE EXTRINSIC CURVATURE MEASURE In this section, we we will derive the expression of our extrinsic curvature measure i,j (J T J) -1 ij Tr( ∂J(J T J)J T ∂z i ∂J(J T J)J T ∂z j ) for a one-dimensional manifold, i.e., a curve, embedded in R D . Let x : R → R D be a smooth curve and assume that it is parameterized by arc-length, i.e., ∥ ∂x ∂z ∥ = J T J = 1. Then, the curvature becomes Tr(( ∂JJ T ∂z ) 2 ) = Tr(( ∂ ∂z ( ∂x ∂z ∂x ∂z T )) 2 ) = Tr(( ∂ 2 x ∂z 2 ∂x ∂z T + ∂x ∂z ∂ 2 x ∂z 2 T ) 2 ) = Tr( ∂ 2 x ∂z 2 ∂x ∂z T ∂ 2 x ∂z 2 ∂x ∂z T + 2 ∂ 2 x ∂z 2 ∂x ∂z T ∂x ∂z ∂ 2 x ∂z 2 T + ∂x ∂z ∂ 2 x ∂z 2 T ∂x ∂z ∂ 2 x ∂z 2 T ) = ∂x ∂z T ∂ 2 x ∂z 2 ∂x ∂z T ∂ 2 x ∂z 2 + 2 ∂x ∂z T ∂x ∂z ∂ 2 x ∂z 2 T ∂ 2 x ∂z 2 + ∂ 2 x ∂z 2 T ∂x ∂z ∂ 2 x ∂z 2 T ∂x ∂z = 2( ∂x ∂z T ∂ 2 x ∂z 2 ) 2 + 2 ∂ 2 x ∂z 2 T ∂ 2 x ∂z 2 . ( ) Since ∂ ∂z ∥ ∂x ∂z ∥ = 0 implies that ∂x ∂z T ∂ 2 x ∂z 2 = 0, our curvature measure for an arc-length parameterized curve x(z) is simplified to 2∥ ∂ 2 x ∂z 2 ∥ that is twice the norm of second derivative. This is equivalent to the classical definition of the curvature of a curve.

A.4 EXPERIMENT DETAILS

Grayscale Image Data: The image size is 28 × 28 and the pixel values are normalized between 0 and 1. The encoder and decoder are two-layer fully connected neural networks with the ELU activation functions and 512 nodes for each layer. The output layer is linear for the encoder and sigmoid for the decoder. For clean dataset cases, we use the following early stopping criteria in training: we stop the training if the mean reconstruction error for the validation dataset increases 10 times in a row; then we use the best model (i.e. the lowest validation errors) for evaluation. For noisy dataset cases, assuming that we don't have an access to the clean dataset during training, we do not use the early stopping and trained the model for a sufficiently big number of epochs for convergence (the number of epochs is 1000). The number of test data is 60000. For evaluation, we use clean test data for noisy training dataset cases as well. The batch size is 100 and the learning rate is 0.001.

SVHN & CIFAR10 Image Data:

The image size is 32 × 32 and the pixel values are normalized between 0 and 1. For noisy training dataset experiments, we add noises as follows: (i) for Gaussian noise, the standard deviation is 0.1, (ii) for Shot noise, we multiply 0.15 to noise variables sampled from the Poisson distributions where λ are image pixel values, and (iii) for Impulse noise, with 5% probability we randomly add 1 to each pixel. The encoder and decoder are convolutional and transposed convolutional neural networks with the ReLU activation functions, where, denoting a convolution layer of input channel size c i , output channel size c o , kernel size k, stride s, and padding p by Conv2d(c i , c o , k, s, p) and transposed convolution layer by ConvTrans2d(c i , c o , k, s, p), the following sequence of layers Conv2d(3, 128, 4, 2)-Conv2d(128, 256, 4, 2)-Conv2d(256, 512, 4, 2)-Conv2d(512, 1024, 2, 2)-Conv2d(1024, 64, 1) is used for encoder and ConvTrans2d(64, 1024, 8)-ConvTrans2d(1024, 512, 4, 2, 1)-ConvTrans2d(512, 256, 4, 2, 1)-ConvTrans2d(512, 3, 1) for decoder. The output layer is linear for the encoder and sigmoid for the decoder. For clean dataset cases, we use the following early stopping criteria in training: we stop the training if the mean reconstruction error for the validation dataset increases 10 times in a row; then we use the best model (i.e. the lowest validation errors) for evaluation. For noisy dataset cases, assuming that we don't have an access to the clean dataset during training, we do not use the early stopping and trained the model for a sufficiently big number of epochs for convergence (the number of epochs is 100). The number of test data is 63257 for SVHN and 10000 for CIFAR10. For evaluation, we use clean test data for noisy training dataset cases as well. The batch size is 8 and the learning rate is 0.0001. Human Skeleton Pose Data: From the NTU RGB+D dataset, a set of human pose skeleton data that consists of 25 key points is extracted and pre-processed to be aligned. Specifically, 10000 poses are extracted from each action class (a total of 60 action classes is used), and they are rotated and translated so that the 1-2 key points direction becomes z-axis and 1-13 key points direction becomes the y-axis and the key point number 2 becomes the origin. The encoder and decoder are two-layer fully connected neural networks with the ELU activation functions and 512 nodes for each layer. The output layers are linear for both the encoder and decoder. For clean dataset cases, we use the following early stopping criteria in training: we stop the training if the mean reconstruction error for the validation dataset increases 10 times in a row; then we use the best model (i.e. the lowest validation errors) for evaluation. For noisy dataset cases, assuming that we don't have an access to the clean dataset during training, we do not use the early stopping and trained the model for a sufficiently big number of epochs for convergence (the number of epochs is 5000). The number of test data is 9000. For evaluation, we use clean test data for noisy training dataset cases as well. The batch size is 100 and the learning rate is 0.0001. Qualitative Results: Figure 11, 12, 13, 14, 15, 16 show additional de-noising results for image data and human skeleton pose data. Table 5 shows the per-batch computation time comparisons between reconstruction loss term and curvature term in (7). Although the curvature computation has become feasible through the stochastic trace estimation, compared to the original reconstruction loss term, it still takes much longer time. Especially, looking at the forward computation time for the Conv net case, the curvature computation is almost 100 to 150 times slower than the reconstruction term computation.

More

To see which part in the below curvature measure C(θ, ϕ) = E z∼ pϕ (z),v∼N (0,Im),w∼N (0,I D ) [v T ∂(w T E(J θ )) ∂z ∂(E(J θ )w) ∂z G -1 θ v] requires a major computational cost, we compare the computation times of the following operations: (i) the Riemannian metric G θ = J T f θ J f θ , (ii) the inverse of G θ , (iii) the Jacobian-vector product for ∂(E(J θ )w) ∂z v, and (iv) the Jacobian-vector product for ∂(E(J θ )w) ∂z (G -1 θ v). Table 6 shows the per-batch computation times of the intermediate operations in curvature measure (7) for the Conv net case. As can be seen, the inverse computation takes up most of the total computation time. To reduce the computation time of the matrix inverse, one can consider an approximate inverse computation method. For example, given G θ , let us define a function f : R m×m → R m×m such that f (X) = X -1 -G θ . ( ) To find the root of f , we can use the standard Newton-Raphson method: X n+1 = 2X n -X n G θ X n , which is known as the Newton-Schulz iteration method for the matrix inversion. We can get an approximation of G -1 θ by iteratively applying the above, where it gets closer to the true inverse as we increase the number of iteration. Table 7 shows the per-batch computation times and percent errors of the approximate matrix inverse G -1 θ in curvature measure (7) as the number of iteration increases with the Conv net case. The percent error is computed as 100 * ∥G -1 true -G -1 est ∥ F /∥G -1 true ∥ F . When the number of iterations is set to be 100, the percent error is only 0.01 % while significantly reducing the computation time as 0.28353 s → 0.005669 s. It is highly recommended to use the approximate matrix inverse when the latent space dimension is high.



Manifold's intrinsic properties are defined without involving any embedding. A developable surface can be formed by bending or rolling a planar surface without stretching or tearing.



Figure 2: Learned manifold becomes flatter as the regularization coefficient α increases. Upper: Learned data manifolds of 1d sin-curve and noisy training data points. Lower: Learned data manifolds of 1d S-curve projected to the 2-sphere and sparse training data points.

shows how the reconstruction MSE for clean test data varies as a function of the extrinsic curvature of the learned manifold by IRAE and MCAE. As the curvature decreases or the regularization coefficient increases (from left to right), the test reconstruction MSE decreases, reaches a minimum, and then increases again. We note that the graph of MCAE lies lower than that of IRAE, implying that the MCAE can learn a more accurate manifold than the IRAE.

Figure 3: (a) Learned manifold by IRAE becomes flatter as the regularization coefficient α increases. (b) Test data reconstruction MSE (i.e., manifold learning accuracy) as a function of the extrinsic curvature obtained by IRAE and MCAE.

Figure 4: Learning by DAE and RCAE for examples in Figure 2. Comparison to Denoising and Reconstruction Contractive Autoencoders: Denoising Autoencoder (DAE) (Vincent et al., 2010) and Reconstruction Contractive Autoencoder (RCAE)(Alain & Bengio, 2014) are intuitive and straightforward regularization methods for learning manifolds robust to input perturbations. As shown in Figure4(Upper), the DAE and RCAE learn manifolds robust to noise to some extent. However, as shown in Figure4(Lower), for the projected Scurve example in Figure2(Lower), they still learn wrong manifolds in regions where there are fewer data and do not improve the vanilla autoencoder. On the other hand, the MCAE explicitly regularizes the learned manifold to have a small curvature globally and improves the manifold learning accuracy.

Figure 5: De-noising examples (noise level 0.3).

Figure 6: Test set MSEs as a function of the number of training (80%) + validation (20%) data, the lower the better.

Figure 7: Test set Peak Signal-to-Noise Ratios (PSNR) as a function of the noise level, the higher the better.

Figure 8: Corrupted SVHN and CI-FAR10 images.SVHN & CIFAR10 Image: We compare the manifold learning performances of MCAE with other regularized autoencoders for the SVHN and CIFAR10 image datasets for both clean and corrupted training datasets. We use the convolutional and transposed convolutional neural networks for encoder and decoder with ReLU activation functions and the latent space dimensions are 64; the number of training data is 8000. For the corrupted training dataset cases, we add three different types of noise: (i) Gaussian, (ii) Shot, and (iii) Impulse noises adopted from(Hendrycks & Dietterich, 2019); see Figure8.

Figure 9: Density plots of the lognormalized local curvatures of manifolds learned by vanilla AEs.Table2shows the test set MSEs for experiments with the clean training datasets, and Table3shows the PSNRs for experiments with the corrupted training datasets, where in both cases the metrics are computed with the clean test data. From the results, we note that (i) MCAE shows the second or third best results, (ii) MCAE does not improve the vanilla AE for the SVHN clean training dataset case, and (iii) for the corrupted training dataset cases, RCAE produces better results than the MCAE unlike the grayscale image data. Overall, compared to the grayscale image data, the minimum curvature regularization is less effective for SVHN and CIFAR10. One possible interpretation is related to the limitation of MCAE (discussed in the conclusion section), that the SVHN and CIFAR10 manifolds have locally very different curvatures and thus it is difficult to find a proper constant regularization coefficient α in (6), because if we use a big enough α to correctly learn low curvature areas of the manifold, then high curvature areas can be overly flattened, and vice versa. Figure9shows the density plots of the log normalized local curvature of the learned manifolds by vanilla autoencoders, i.e., log(κ i ) -log(κ), i = 1, . . . , N where κ i is the local curvature at i-th training data points and log(κ) = 1/N i log(κ i ), which is invariant to the scale of the mean curvature. As shown in Figure9, the variance of the SVHN manifold's local curvature is bigger than those of the others, which supports the above interpretation.

Figure 10: Human skeleton pose de-noising examples obtained by reconstructing noisy input data (noise level 0.05). Example poses are from the action class "eat meal".

Figure 11: De-noising examples of grayscale image data (noise level 0.1).

Figure 12: De-noising examples of grayscale image data (noise level 0.2).

Figure 13: De-noising examples of grayscale image data (noise level 0.3).

Figure 14: De-noising examples of SVHN data.

Figure 15: De-noising examples of CIFAR10 data.

Figure 16: De-noising examples of human pose data.

Averages and standard errors of the test data set reconstruction MSEs (5 times run) for the sincurve example in Figure2(Upper) with various Gaussian noise of standard deviations 0.1, 0.2, 0.3, the lower the better. The best results are marked in bold. The numbers are written in units of 10 -3 . As seen from the above examples, besides the proposed MCAE, the IRAE, DAE, RCAE all have the robustness properties to noise. We quantitatively compare the robust manifold learning performance given noisy input training data with the sincurve example in Figure2(Upper) with various noise levels, i.e., Gaussian noise with standard deviations of 0.1, 0.2, 0.3. In addition to the IRAE, DAE, RCAE, we compare the MCAE with the vanilla Autoencoder (AE) and other regularized autoencoders such as the Variational Autoen-

Test set MSEs of autoencoders trained with clean datasets, the lower, the better. The best and second best results are marked in red and blue, respectively.

Test data set PSNRs with various noise types, the higher, the better. The best and second best results are marked in red and blue, respectively.



Per-batch computation time comparisons. For the FC net with 1 × 28 × 28 image, the latent space dimension is 16 and the batch size is 100, and for the Conv net with 3 × 32 × 32 image, the latent space dimension is 64 and the batch size is 8 (for GPU memory limitation).FC net with 1 × 28 × 28 image Conv net with 3 × 32 × 32 image

Per-batch computation times of the intermediate operations in curvature measure (7) with the Conv net with 3 × 32 × 32 image and 64-dimensional latent space.

Per-batch computation times and percent errors of the approximate matrix inverse G -1 θ in curvature measure (7) as the number of iteration increases with the Conv net with 3 × 32 × 32 image and 64-dimensional latent space.

A APPENDIX

The appendix is organized as follows: (A.1) Related Works, (A.2) Proof of Proposition 1, (A.3) On the Extrinsic Curvature, (A.4) Experiment Details, (A.5) Additional Experiment Results, and (A.6) Computational Complexity.

A.1 RELATED WORKS: REGULARIZED AUTOENCODERS

The framework of autoencoding together with the recent advances in deep learning techniques used for approximating arbitrary complex functions successfully addresses the manifold learning problem (Kramer, 1991) . The core idea is to learn two mappings an encoder g : R D → R m and a decoder f : R m → R D approximated with deep neural networks so that the composition of them reconstructs the given data pointsand that the data points approximately lie on the image of the decoder, which we refer to as the learned manifold.Many existing autoencoder regularization methods have focused on the representation learning perspective of autoencoders and studied how to regularize the latent space distributions for purposes like sampling, topology and geometry preserving, clustering, or capturing hierarchical structure (Rifai et al., 2011; Kingma & Welling, 2013; Wang et al., 2014; Makhzani et al., 2015; Tolstikhin et al., 2017; Chen et al., 2016; Tomczak & Welling, 2018; Klushyn et al., 2019; Moor et al., 2020; Schönenberger et al.; Duque et al., 2020; Chen et al., 2021) ; since the latent space distributions are entirely determined by the encoders, they mostly focus on regularizing the encoders but not decoders.As discovered in (Lee et al., 2021) , to learn the accurate manifold in the presence of data noise or given a small number of training data, regularization of the decoder is indeed more important,

