PARAMETRIZING PRODUCT SHAPE MANIFOLDS BY COMPOSITE NETWORKS

Abstract

Parametrizations of data manifolds in shape spaces can be computed using the rich toolbox of Riemannian geometry. This, however, often comes with high computational costs, which raises the question if one can learn an efficient neural network approximation. We show that this is indeed possible for shape spaces with a special product structure, namely those smoothly approximable by a direct sum of low-dimensional manifolds. Our proposed architecture leverages this structure by separately learning approximations for the low-dimensional factors and a subsequent combination. After developing the approach as a general framework, we apply it to a shape space of triangular surfaces. Here, typical examples of data manifolds are given through datasets of articulated models and can be factorized, for example, by a Sparse Principal Geodesic Analysis (SPGA). We demonstrate the effectiveness of our proposed approach with experiments on synthetic data as well as manifolds extracted from data via SPGA.

1. INTRODUCTION

Modeling collections of shapes as data on Riemannian manifolds has enabled the usage of a rich set of mathematical tools in areas such as computer graphics and vision, medical imaging, computational biology, and computational anatomy. For example, Principal Geodesic Analysis, a generalization of Principal Component Analysis, can be used to parametrize submanifolds approximating given data points while preserving structure of the data such as its invariance to rigid motion. The evaluation of such a parametrization, however, typically comes at a high computational cost as the Riemannian exponential, mapping infinitesimal shape variations to shapes, has to be evaluated. This motivates trying to learn an efficient approximation for these parametrizations. Direct application of deep neural networks (NNs), however, proves ineffective for high-dimensional spaces with strongly nonlinear variations. Therefore, we consider more structured shape manifolds, namely, we assume that they can be approximated by an affine sum of low-dimensional submanifolds. In computer graphics, typical examples of data manifolds are given through datasets of articulated models, e.g. human bodies, faces or hands. Then, the desired structure of an affine sum of factor manifolds can be produced, for example, by a Sparse Principal Geodesic Analysis (SPGA). Motivated by this, we exploit the data manifolds' approximability with such affine sums: We separately approximate the exponential map on the factor manifolds by fully connected NNs and the subsequent combination of factors by a convolutional NN to yield our approximate parametrization. In formulas, based on a judiciously chosen decomposition v = v 1 + . . . + v J , our aim is to approximate the Riemannian exponential exp z (v) by Ψ ζ (ψ ζ 1 (v 1 ), . . . , ψ ζ J (v J )), where Ψ ζ is a NN and the ψ ζ j are further NNs approximating the Riemannian exponential exp z on the low-dimensional factor manifolds. We develop our approach focusing on the shape space of discrete shells, where shapes are given by triangle meshes and the manifold is equipped with an elasticity-based metric. In principle, our approach is also applicable to other shape spaces such as manifolds of images, and we will include remarks on how we propose this could work. We evaluate our approach with experiments on data manifolds of triangle meshes, both synthetic ones and ones extracted from data via SPGA, and we demonstrate that the proposed composite network architecture outperforms a monolithic fully connected network architecture as well as an approach based on the affine combination of the factors. We see this work as a first step to use NNs to accelerate the complex computations of shape manifold parameterizations. Therefore, we think that our approach has great potential to stimulate further research in this direction, which could in turn advance the applications of Riemannian shape spaces. Contributions In summary, the contributions of this paper are • combining the Riemannian exponential map on shape spaces and neural network methodology for the efficient parametrization of shape space data manifolds, • demonstrating the applicability of such an approach for data manifolds which can be smoothly approximated via direct sums of low-dimensional submanifolds, • using a combination of fully connected neural networks for the factorwise Riemannian exponential maps and a convolutional network to couple them, • verifying that such a setup works well with existing methods to construct product manifolds, such as Sparse Principal Geodesic Analysis, and • showing that the composite network architecture outperforms alternative approaches.

2. RELATED WORK

Shape Spaces Shape spaces are manifolds in which each point is a shape, e.g., a triangle mesh or an image. A Riemannian metric on such a space provides means to define distances between shapes, to interpolate between shapes by computing shortest geodesic paths, and to explore the space by constructing the geodesic curve starting from a point into a given direction. Shape spaces have proven useful for applications in areas such as computer graphics (Kilian et al., 2007; Heeren et al., 2012; Wang et al., 2018) and vision (Heeren et al., 2018; Xie et al., 2014) , medical imaging (Kurtek et al., 2011b; Samir et al., 2014; Kurtek et al., 2016; Bharath et al., 2018) , computational biology (Laga et al., 2014) , and computational anatomy (Miller et al., 2006; Pennec, 2009; Kurtek et al., 2011a) . For an introduction to the topic, we refer to the textbook of Younes (2010) .

Shape Space of Meshes

Triangle meshes are widely used to represent shapes in computer graphics and vision. Riemannian metrics on shape spaces of triangle meshes can be defined geometrically, using norms on function spaces on the meshes (Kilian et al., 2007) , or physics-based, considering the meshes as thin shells and measuring the dissipation required to deform the shells (Heeren et al., 2012; 2014) . The computation of geodesic curves in these spaces requires numerically solving high-dimensional nonlinear variational value problems, which can be costly. For shape interpolation problems, model reduction methods can be used to efficiently find approximate solutions (Brandt et al., 2016; von Radziewsky et al., 2016) . Statistics in Shape Spaces Data in a Riemannian shape spaces can be analyzed using Principal Geodesic Analysis (PGA) (Fletcher et al., 2004; Pennec, 2006) . Analogous to principal component analysis (PCA) for data in Euclidean spaces, PGA can construct low-dimensional latent representations that preserve much of the variability in the data. This is achieved by mapping the data with a non-linear mapping, the Riemannian logarithmic map, from the manifold to a linear space, the tangential space at the mean shape, and computing a PCA there. Latent variables of the PCA are then mapped with the inverse mapping, the Riemannian exponential map, onto the manifold, so that the latent space describes a submanifold of the shape space. A PGA in shape spaces of meshes was introduced in (Heeren et al., 2018) and used to obtain a low-dimensional, nonlinear, rigid body motion invariant description of shape variation from data. Sparse PGA While PCA modes involve all variables of the data, Sparse Principal Component Analysis (Zou et al., 2006) constructs modes that involve just few variables. This is achieved by integrating a sparsity encouraging term to the objective that defines the modes. Based on this idea, Neumann et al. (2013) proposed a scheme for extracting Sparse Localized Deformation Components (SPLOCS) from a dataset of non-rigid shapes. Since SPLOCS are linear modes, they are well-suited to describe small deformations such as face motions accurately. To increase the range of deformations and to compensate linearization artifacts Huang et al. (2014) integrated SPLOCS with gradient-domain techniques and Wang et al. (2017; 2021) with edge lengths and dihedral angles. In (Sassen et al., 2020b) , a Sparse Principal Geodesic Analysis (SPGA) was introduced. Similar to PGA, the SPGA modes are nonlinear and rigid motion invariant. On top of that, however, the SPGA modes describe localized deformations. Due to the localization, many pairs of SPGA modes have disjoint support and are therefore independent of each other. We want to take advantage of this property to effectively learn the reconstruction of points in the manifold from their latent representation through an adapted network structure. Product Manifolds This work is focused on the parametrization of the product manifold structures obtained from SPGA. Alternative approaches for extracting product structures of data manifolds include the approach of Fumero et al. (2021) , which finds a product structure of a manifold based on a geometric notion of disentanglement, the Geometric Manifold Component Estimator presented in (Pfau et al., 2020) , which uses a Lie group of transformations to generate a symmetry-based disentanglement of data manifolds, and the approach of Zhang et al. (2021) , which decomposes the eigenfunctions of the Laplace-Beltrami operator of a manifold in order to find a product structure in the manifold. To the best of our knowledge, there are no works focusing on using networks to approximate the parametrization of such product manifolds in Riemannian shape spaces.

3. PRELIMINARIES AND NOTATION

We introduce step by step the background necessary to understand the context of our work. We also provide an overview of our notation in Appendix E. Riemannian Shape Space A shape space is a manifold S ⊂ R n whose elements are shapes. These could, for example, be images, curves, or surfaces described in various ways. Endowing such a shape manifold with the structure of a Riemannian manifold, i.e. a (smoothly) z-dependent inner product g z on the tangent space T z S at each point z ∈ S, provides us with a rich set of geometric tools. For example, we can use geodesics c : [0, 1] → S, i.e. arc length parametrized locally shortest paths, as mathematical formulation of shape interpolation. The Riemannian logarithm log z z ∈ T z S is then defined as the time derivative ċ(0) of the geodesic c interpolating between c(0) = z and c(1) = z. This allows to interpret the tangent space T z S as the linear space of infinitesimal shape variations. Lastly, the Riemannian exponential map is the inverse of the logarithm, which means that for a tangent vector v ∈ T z S one 'shoots' a geodesic in its direction, i.e. constructs a geodesic curve c with initial velocity v to obtain exp z v := c(1) ∈ S. The exponential map allows to transfer operations from infinitesimal shape variations back to actual shapes. We provide more details on Riemannian operators and their discretization in Appendix C. Principal Geodesic Analysis With these tools at hand, one can use Principal Geodesic Analysis (PGA) to compute submanifolds of the shape space approximating given data points {z i } ⊂ S: One first computes their Riemannian center of mass z, i.e. the point with minimal sum of squared distances to all data points. Then one computes the logarithms v i = log z z i and thus linearizes the approximation problem at the center of mass by passing to the tangent space T z S. In T z S, one uses classical Principal Component Analysis (PCA) to compute the m dominant modes {u j }, whose span U is the best m-dimensional subspace approximating the logarithms v i . Then the submanifold M approximating the data points is parametrized by the exponential map, i.e. M := exp z U.

Sparse Principal Geodesic Analysis

The coordinates on Riemannian shape spaces, e.g. pixel values, often correspond to different spatial locations of the shape, e.g. pixel positions, such that it makes sense to consider their support and sparsity: Sassen et al. (2020b) introduced Sparse Principal Geodesic Analysis (SPGA) to compute spatially localized dominant modes with widely disjoint supports. Those modes are easier to interpret semantically and in addition allow for efficient approximations of the exponential map. SPGA is performed by adding an appropriate sparsity-inducing regularization functional R to the variational formulation of PCA to compute such sparse deformation modes. The concrete choice of R depends on the specific shape space. Hence, for a given set V ∈ R n×K of K logarithms, the SPGA problem to compute the first m dominant modes U ∈ R n×m reads minimize U ∈R n×m W ∈R m×K ∥V -U W ∥ 2 g + λ R(U ) subject to u j ∈ T z S and |w j | ∞ ≤ 1 for j ∈ {1, . . . , m}. (1) The bound on the magnitude of the weights ensures that the coordinates of the u j do not shrink while the weights grow inverse proportionally. As before, the manifold approximating the data points is then parametrized using the exponential map. It can be equipped with a product structure by grouping the modes and applying the exponential map to the corresponding subspaces. This will be elucidated more in Section 4.

NRIC Manifold

As shapes we will consider triangular surfaces with fixed connectivity, i.e. with shared sets of vertices V, edges E, and faces F. For vertex positions X ∈ R 3|V| , we denote by l(X) = (l e (X)) e∈E the vector of edge lengths and by θ(X) = (θ e (X)) e∈E the vector of dihedral angles. We consider the vectors z(X) = (l(X), θ(X)) ∈ R 2|E| combining both. The manifold of all z ∈ R 2|E| corresponding to immersed triangular surfaces is given by S := z ∈ R 2|E| T (z) > 0, Q(z) = 0 , where T (z) > 0 encodes the triangle inequalities and Q(z) = 0 are the discrete integrability conditions from Wang et al. (2012) , see also Sassen et al. (2020a) for more details. S is called the NRIC manifold, short for Nonlinear Rotation-Invariant Coordinates, and can be equipped with an elasticity-based Riemannian metric (Heeren et al., 2014) . To evaluate the logarithm and exponential map, we use the time-discretization developed by Rumpf & Wirth (2015) . To obtain vertex positions for given lengths and angles, we use the nonlinear least-squares method from Fröhlich & Botsch (2011) based on L p -norms to measure the difference between arbitrary z, ∥z a -z b ∥ p p,z := e∈E w p e,l |l b e -l a e | p + η e∈E w p e,θ |θ b e -θ a e | p . The weights are computed from a reference shape z with vertex positions X, e.g. the center of mass from above, as w e,l = l e ( X) -1 and w e,θ = l e ( X)a e ( X) -1/2 , where a e is (one third of) the area of both faces adjacent to e. The bending weight η is the same as used in the elasticity-based metric. To apply SPGA to this shape space, we need to specify the sparsity-inducing regularization R. Sassen et al. (2020b) observed that the simple mode-wise L 1 -norms R(U ) = m j=1 ∥u j ∥ 1 suffice due to the natural connection of NRIC variation with elastic distortions. We provide more details on NRIC in Appendix D.

Neural Networks

We indicate functions that are implemented as neural networks by the superscript ζ as in φ ζ . This ζ represents the network parameters and is the same for all occurring networks, with the implicit convention that different networks depend on different subsets of these parameters. We denote by MLP ζ ρ [N 1 , . . . , N T ] a fully-connected network with layer sizes N 1 , . . . , N T , the nonlinear activation function ρ : R → R after each layer, and parameters ζ. For graph convolutional networks, we adapt the approach by Kipf & Welling (2017) to data z 0 e ∈ R N0 on edges e: The tth layer is given by z t+1 e = ρ   W t+1 1 z t e + W t+1 2 ẽ∈N (e) z t ẽ + b t+1   where W t+1 1 ∈ R Nt+1×Nt and W t+1 2 ∈ R Nt+1×Nt are small matrices with learnable parameters and b t+1 ∈ R Nt+1 is the bias, all stored in the parameters ζ. We define the neighborhood of an edge in a triangle mesh as N (e) := {ẽ ∈ E | e and ẽ share a vertex}. We used the Exponential Linear Unit (Clevert et al., 2016) as activation function in all our experiments.

4. COMPOSITE NETWORK APPROXIMATION

As previously explained, we want to learn an efficient parametrization Φ : R m → M of an mdimensional Riemannian data manifold M. Below, we will explain our structural assumptions to achieve this, detail how we include them in the network architecture and training, and finally discuss their applicability to practical examples. Structural Assumptions One can think of the parametrization as the decoder part of an autoencoder, however, for the purpose of this article we want to be more specific: We would like the parametrization to have a geometric meaning, so we will focus on the situation in which Φ should encode the Riemannian exponential map at some point z ∈ M. In principle, our approach can also be combined with alternative concepts in which no ground truth parametrization is available, but this setting allows a particularly simple quantification of the results. If no additional structure of the data manifold is known, one can of course not improve over computing the Riemannian exponential directly or approximating it by some standard neural network. In contrast, we consider the case where a specific structure is known. For the purpose of our article, this structure is a priori given, that is, we do not study how such a structure can be found or obtained, though we give an example for a corresponding procedure on Riemannian shape spaces later on. We require that the data manifold M to be parametrized is embedded in some R n for possibly large n. The structure of M we want to exploit has three components: (0) Correlation We assume a structural correlation between the different coordinates, e.g. graphneighbour relations for triangular meshes or pixel-neighbour relations for image data, so that convolutional networks can be applied. (1) Factorization We assume that the manifold can be smoothly approximated by a product of much lower-dimensional manifolds M 1 , M 2 , . . . M J , which we will parametrize separately. Thereby, we exploit that the necessary network size as well as the required training effort decrease with smaller manifold dimension: For m-dimensional manifolds the network size should scale at least linearly in m, while the required training set and thus also training time will scale exponentially in m. (2) Combination It is not sufficient that the single factor manifolds are easy to parametrize since a generic point on M has components in all factors. Thus, we assume that the direct sum of all factors M 1 ⊕ z . . . ⊕ z M J := {z + (z 1 -z) + . . . + (z J -z) | z 1 ∈ M 1 , . . . , z J ∈ M J } for some z ∈ M already approximates the actual manifold M with a (possibly large, but) very smooth approximation error. This will ensure that a suitable map from (z 1 -z, . . . , z J -z) to M can be efficiently learned. The toy shape manifold M of tori from Figure 1 has the flat metric of S 1 × S 1 × T 2 (the factors representing the orientation of the longitudinal and latitudinal ellipsoidal cross-sections as well as a bump position, see subsection 5.1) and thus satisfies condition (1). Condition (2) holds since each torus is represented as NRIC: For instance, the creation of the bump (which corresponds to changing the position in T 2 ) is well described by simply adding fixed numbers in the right places of the edge length vector l(X). However, the direct sum is indeed only an approximation -otherwise there would be no error in Figure 1 (green), but one clearly sees a remnant of the bump from the reference shape. Figure 1 : Approximation of the Freaky Torus. We show the reference shape z in grey, the approximation of exp z v by affine combination of exact factors in green, by a monolithic network in blue, and by our composite network in yellow. The correct vertex positions of exp z v are shown as purple points. These purple points should ideally lie on the shaded surface, which would indicate a good fit. Indeed, for the approximation by our composite network, this is mostly the case, while for the other two approaches the approximation does not match the point cloud in many places. Network Representation and Training Before we discuss the previous conditions, let us describe how we exploit them in our network architecture and training procedure. To learn the parametrization Φ : R m → M, we decompose it as Φ = Ψ ζ • (ψ ζ 1 , . . . , ψ ζ J ) , where ψ ζ j : R mj → R n is the parametrization of the m j -dimensional factor manifold M j and Ψ ζ : (R n ) J → R n is the combination of the single factors, behaving approximately like (x 1 , . . . , x n ) → z + (x 1 -z) + . . . + (x n -z), see Figure 2. Ψ ζ ψ ζ j R mj M j M Figure 2: Structure of composite network. The maps ψ ζ j will be fully connected neural networks. Because they approximate the smooth exponential map on rather low-dimensional spaces, they are expected to achieve a high approximation quality that is typically stable under variation of the concrete network architecture (number and size of layers). For the maps Ψ ζ , we exploit that they operate on a structured domain, e.g. a mesh, and thus they will be convolutional neural networks. This allows for efficient training and storage even though they operate on high-dimensional data. One could alternatively also use fully connected networks for Ψ ζ , but the observed quality of the results was similar despite significantly increased memory requirements to train, store, and evaluate. These networks are trained separately: To train the map ψ ζ j , we consider a parametrization ω j : R mj → T z M, ω j (a j ) = a j 1 e j 1 + . . . + a j mj e j mj with a coefficient vector a j = (a j 1 , . . . , a j mj ) and a basis e j 1 , . . . , e j mj of the Riemannian logarithm of M j at point z as a linear subspace of the tangent space T z M. We then consider a set of random samples S j ⊂ R mj as training data (for instance normally distributed or uniformly on a ball) and minimize the loss function J j (ζ) = 1 |S j | a j ∈S j ∥exp z (ω j (a j )) -ψ ζ j (a j )∥ 2 in some norm ∥ • ∥ depending on the application. To train the map Ψ ζ , we subsequently consider a random training set S ⊂ R m1 × . . . × R m J = R m and minimize the loss function J (ζ) = 1 |S| (a 1 ,...,a J )∈S ∥exp z (ω 1 (a 1 ) + . . . + ω J (a J )) -Ψ ζ (ψ ζ 1 (a 1 ), . . . , ψ ζ J (a J )∥ 2 , where we may or may not keep the maps ψ ζ j fixed. Applicability of Assumptions While condition (0) essentially has to hold for any approach employing neural networks, conditions (1) and ( 2) are specific to our approach. Condition (1) expresses that the intrinsic geometry of the data manifold M has a simplifying structure, while condition (2) is about the extrinsic geometry of how M is embedded in R n -only both conditions together characterize a structure that can efficiently be learned. Typical situations where our conditions hold include the following two examples: • The different factor manifolds correspond to different spatial regions. For instance, images may sometimes be partitioned into different regions that can vary more or less independently of each other. If the regions are fully disjoint, the manifold M can exactly be written as a direct sum of factor manifolds; if on the other hand the regions slightly overlap, then this is only fulfilled approximately. Examples in shape spaces are where the single factor manifolds correspond to shape variations with disjoint support such as articulation of the different extremities of a character. Such a manifold structure is ubiquitous in computer graphics applications, and our computational examples mostly belong to this category. • On shapes one can also often observe variations at different length scales that are independent of each other, for example, geometric texture on small length scales versus global shape variations at large length scales. The above toy example of the torus, in which the small bump is more or less independent of the global shape variations, belongs to this category. As mentioned before, we assume the product manifold structure to be given a priori (in the torus example, for instance, it was given by design of the data manifold). The product manifold structure would typically be a result of some data manifold analysis, for example from disentanglement learning. In the setting of Riemannian shape spaces, one way to obtain this structure is the SPGA introduced earlier: It makes use of the decomposition into independent product manifolds due to disjoint support of the corresponding shape variations.

5. EXPERIMENTS AND APPLICATIONS

In this section, we present experimental results on the aforementioned synthetic shape manifold of deformed tori and manifolds extracted via SPGA. We will compare our method to approaches based on (Sassen et al., 2020b ) and a straightforward approximation of the parametrization by a single fully connected network. To quantify the approximation quality of different approaches, we use the coefficient of determination R 2 . For approximations zi of NRIC z i with mean z, it is defined as R 2 (z, z) = 1 -i ∥z i -z i ∥ 2 2,z i ∥z i -z∥ 2 2,z . From a statistical point of view, it quantifies the proportion of variation of the data that is explainable by a given model. This means an R 2 of one is optimal, the smaller it is the worse is the approximation, and a negative R 2 means that the model is worse than simply using the mean. Training & Implementation We used Adam (Kingma & Ba, 2015) as descent method for training all networks, where the initial learning rate was 10 -3 and was reduced by a factor of 10 every time the loss did not decrease for multiple iterations. For regularization, we used batch normalization after each layer and a moderate dropout regularization (p = 0.1) after each convolutional layer. We implemented the neural networks in PyTorch (Paszke et al., 2019) using the PyTorch Geometric library (Fey & Lenssen, 2019) . The tools for the NRIC manifold were implemented in C++ based on OpenMesh (Botsch et al., 2002) , where we use the Eigen library (Guennebaud et al., 2010) for numerical linear algebra. We follow an approach similar to Kilian et al. (2007) to perform all computations on a coarsened mesh and prolongate solutions to a fine one only for visualizations.

5.1. SYNTHETIC DATA: FREAKY TORUS

For the Freaky Torus dataset, we construct a synthetic shape space with factors Sfoot_0 × S 1 × T 2 , where T 2 refers to the flat 2-dimensional torus. It is realized in NRIC by (i) deforming the two crosssectional circles of a torus to ellipses of fixed aspect ratio and orientation controlled by the first two S 1 factors and (ii) growing a bump in normal direction whose position is controlled by the last T 2 factor. These torus deformations are applied to a regular mesh of a regular torus embedded in R 3 , and the deformed meshes' NRIC are extracted to obtain our datapoints. We used a mesh with 2048 vertices and uniformly drew 1000 samples from S 1 × S 1 × T 2 . 1 More details on the data generation can be found in Appendix A. Figure 1 shows that a single fully connected network struggles with approximating the high frequency detail of the bump, while our composite network is able to handle this well. This can also be observed in the approximation quality quantified using the R 2 . The composite network achieves an R 2 of 0.99 and the monolithic network one of 0.95. This difference may sound small, however, this is because the bump is a detail and the error is dominated by the overall shape of the torus.

5.2. APPLICATION: SPGA MANIFOLDS

Lastly, we report the results of applying our method to shape manifolds whose approximate product structure is found with the help of SPGA. To this end, we repeat three of the examples discussed by Sassen et al. (2020b) and consider one new dataset. The repeated examples are a humanoid dataset from (Anguelov et al., 2005) , a dataset of face meshes from (Zhang et al., 2004) , and a set of hand meshes from (Yeh et al., 2011 ). For the new example, we examine a humanoid dataset based on SMPL-X (Pavlakos et al., 2019) , where we consider their expressive hands and faces (EHF) dataset, containing 100 shapes, and 49 additional shapes from the SMPL+H dataset, which feature more expressive arm and leg movements, adding to a total of 149 input shapes. We interpret all shapes as elements of the nonlinear NRIC space so that this small amount of data points already suffices to span a high-dimensional, nonlinear NRIC submanifold that serves as our data manifold. Based on this data, we follow the numerical approach of Sassen et al. (2020b) to solve (1) and thereby compute the sparse tangent modes. We report the chosen number m of included modes in Table 1 , where we used the same number as Sassen et al. (2020b) for the repeated examples. To factor the resulting data manifold, we again follow the approach outlined by Sassen et al. (2020b) , which is a clustering based on the spatial overlap of the modes. For the hand and face examples, we choose the same number of factors J, while for the SCAPE example we decreased the number to account for the possibility to handle higher-dimensional factors with our method. Our choices are again documented in Table 1 . Each cluster then spans exactly one of the factor manifolds and the range of their dimensions is also reported in said table. 2020b) also proposed a scheme based on multilinear interpolation of precomputed exponentials for each of the factor manifolds and subsequent affine combination of the results to approximate the exponential map and parametrize M. A natural question is how our network-based approach compares to this. For the humanoid examples, it was not possible to precompute Riemannian exponentials on a regular grid for all factor manifolds due to their high dimensionality. Hence, instead of multilinearly interpolating precomputed exponentials we simply compute the exponentials within each factor manifold exactly before combining them affinely. We dub this method simply 'affine combination' and present its results in Figure 4 and Table 1 ; it is computationally heavy, but yields an upper bound on the quality of the method by Sassen et al. (2020b) . The limitation does not apply to our new approach, which for example allows us to learn an efficient parametrization for the SMPL+X dataset, where the expressive movements of hand and face require a higher-dimensional data manifold. In all examples, our composite network approach achieves higher approximation accuracy than the 'affine combination'. This shows that the network Ψ ζ is able to correct the approximation errors of the direct sum structure. For the lower-dimensional examples, this difference is not as pronounced since their sparse modes have a better support separation. Furthermore, storing our network-based approximation requires less memory than the approach by Sassen et al. (2020b) . For example, on the SCAPE dataset storing grids with approx. 20000 samples as reported by Sassen et al. (2020b) requires about 1.7 GB of storage, while our networks only require 0.6 GB (without optimizing for a small memory footprint). Figure 5 : Two examples from the SMPL-X dataset. We use the same colors as in Figure 1 . Comparison to Monolithic Network Another obvious question is whether our composite approach shows any benefit over training a simple, single network. For evaluation, we also trained one fully connected network Φ ζ : R m → M to approximate the parametrization at once, dubbed 'monolithic' approach. The corresponding approximation qualities are reported in Table 1 . One sees that for the lower-dimensional examples this monolithic approach achieves an approximation quality close to the one of our composite network. However, for the higher-dimensional, humanoid examples the approximation quality of the monolithic approach is noticeably lower.

6. CONCLUSION

Our results suggest that the most fundamental geometric operation on Riemannian data manifolds, the parametrization of the manifold via the Riemannian exponential map, can in principle be learned. We illustrated this on shape manifolds of triangular meshes, for which the exponential map is computationally expensive so that an approximation is attractive. However, while naive implementations via deep neural networks proved7 ineffective, we achieved consistently satisfying results by matching our training and network architecture to a typical structure of shape manifolds: that they can be approximated by an affine sum of submanifolds. We thus learned both, the lower-dimensional Riemannian exponential map on each submanifold as well as the (close to affine) composition of the different submanifolds. We furthermore illustrated in our examples that such manifold structures arise from basic principles like support or scale separation of different shape variations. While we implemented the above concept of composite networks only for shape manifolds, it should also be applicable to image manifolds. However, first the corresponding tools to identify approximate product manifold structures (such as the SPGA) would have to be developed for images. It is also conceivable to replace the SPGA by learning approaches akin to disentanglement learning. Furthermore, we did not touch upon further optimization for the specific setting of our shape manifolds: The single factor manifolds typically describe localized shape variations so that a sparsity regularization of the corresponding networks would make sense and could substantially reduce the parameter size. Furthermore, one could add a regularization favoring those NRIC that correspond to an immersed mesh, thereby reducing postprocessing. S 1 S 1 T 2 Figure 6 : Factors of the Freaky Torus. We visualize our synthetic shape space by demonstrating the effect of moving along the individual factors to the final shape. In the first row, we see how the first factor, an S 1 , controls the deformation of the latitudinal cross-section. We show the deformed torus from the top. In the second row, we see how the next factor, another S 1 , controls the deformation of the longitudinal cross-section. Here, we show the torus cut in half to better highlight the crosssection's shape. In the last row, we show how the third factor, a two-dimensional flat torus T 2 , controls the position of a bump on the deformed torus.

A FREAKY TORUS

In this appendix, we provide more details on the construction of our shape space Freaky Torus of deformed tori, a synthetic shape space with factors S 1 ×S 1 ×T 2 , whose action is also summarized in Figure 6 . We will first derive its continuous version leading to a family of parametrization and then arrive at the discrete version via a simple spatial discretization. To begin, we recall the parametrization f : [0, 2π) 2 → R 3 of a standard torus of revolution with radii R and r of the latitudinal and longitudinal circular cross-sections respectively, which is given by f (u, v) := R   cos(u) sin(u) 0   + r   cos(u) cos(v) sin(u) cos(v) sin(v)   . The first two factors S 1 × S 1 of our shape space control the deformation of these cross-sections into ellipses and hence we want to replace the parametrizations of the cross-sectional circles by ones of appropriate ellipses. To this end, a circle deformed into an ellipse with semi-axes' lengths a and b rotated by η/2 against the coordinate axis is parametrized by τ η,a,b (t) := a cos( η 2 ) cos(t -η 2 ) -b sin( η 2 ) sin(t -η 2 ) a sin( η 2 ) cos(t -η 2 ) + b cos( η 2 ) sin(t -η 2 ) . ( ) The phase-shift in the parametrization comes from the fact that the rotation of the semi-axes is also achieved by a warping of the circle instead of rotating it. We introduced this parametrization with the half-scaling of η for cosmetic reasons so that later all deformations are parametrized over the same interval [0, 2π). Now, we use this parametrization as replacement for the circular cross-section in the parametrization f of the torus. We fix the semi-axes' lengths a R , b R and a r , b r of the latitudinal and longitudinal cross-sections respectively and introduce their rotation as parameters α, β ∈ [0, 2π). This leads to the parametrization f α,β (u, v) := R    τ α,a R ,b R 1 (u) τ α,a R ,b R 2 (u) 0    + r    cos(u) τ β,ar,br 1 (v) sin(u) τ β,ar,br 1 (v) τ β,ar,br 2 (v)    . The last factor T 2 controls the position on a bump on such a torus. To describe the corresponding deformation, we first need the normal of the surface in which direction the bump will point. It is given by the usual formula for smooth surfaces namely n α,β (u, v) := ∂ u f α,β × ∂ v f α,β ∥∂ u f α,β × ∂ v f α,β ∥ (u, v). Then the bump is a deformation in the direction of this normal around the position determined by (γ, ζ) ∈ [0, 2π) 2 with maximal height h. To limit the support of the deformation, we use a simple Gaussian with parameter ε on the distance to the center point. We finally arrive at the parametrization of the continuous version of our synthetic shape space F : [0, 2π) × [0, 2π) × [0, 2π) 2 → [0, 2π) 2 → R 3 (α, β, γ, ζ) → (u, v) → f α,β (u, v) + h e -∥f α,β (u,v)-f α,β (γ,ζ)∥ 2 ε 2 n α,β (γ, ζ) , ( ) where the images are parametrizations of embedded surfaces. To obtain discrete surfaces, we apply these to a fixed triangulation of the torus, which yields the discrete version of the shape space. For the dataset we used in our experiments, we chose a R = a r = 1, b R = b r = 1 2 , R = 0.375, r = 0.125, h = 0.075, and ε = 0.05. It is available along with the code at https://gitlab.com/ jrsassen/freaky-torus.

B ADDITIONAL RESULTS

Figure 7 : Exemplary SPGA modes extracted from the SCAPE dataset. We show a selection of modes generating two of the factor manifolds (indicated by the blue frames; not all modes of each factor manifold are shown as indicated by the purple dots). Note that while these deformations move many nodal positions, their support in NRIC is indeed localized. For example, the fifth mode is supported mainly in the hip region of the shape. This also links it with the other modes in the same group even though they might move a different leg. Sparsity In Section 4, we postulated several assumptions on the structure of our data manifold, which raises the question if they are actually satisfied in our experiments. Since we work with NRIC, assumption (0) is fulfilled as they are given as data on the edges of a mesh (see also Appendix D). Then, for the freaky torus example, we explicitly constructed a product manifold (hence assumption (1) is fulfilled) and chose the factors such that they act on different length scales, which entails the fulfillment of assumption (2). For the examples using a decomposition via SPGA, we rely on it producing sparsely supported modes that can be grouped by their spatial overlap. As already Sassen et al. (2020b) observed, the resulting factorization of the data manifold fulfills our assumptions. In Figure 7 , we show exemplary modes from the SCAPE dataset highlighting that this is indeed the case. The sparsity of the SPGA modes is a result of the regularization R in the SPGA problem (1). To still achieve a good approximation of the input data, the modes typically localize in different regions of the shape. This allows us to group them as explained before to obtain the factor manifolds that fulfill assumptions (1) and (2). In contrast, this is not the case for PGA modes, i.e. if we do not use any regularization there is no reason to expect localized support, which is also illustrated in Figure 8 . This means our method is not applicable for such modes. Animation One possible application of our composite network is efficient animation of shapes. In this context, we can consider shape interpolation and extrapolation problems, which correspond to the evaluation of the Riemannian logarithm and exponential map as explained in Section 3 and Appendix C. For the case of shape interpolation, we are given two shapes by their latent coordinates a(0) ∈ R m and a(1) ∈ R m respectively. Then, the latent coordinates of intermediate shapes are obtained by linear interpolation, i.e. we define a(t) := t a(1) + (1 -t)a(0) for t ∈ [0, 1]. By evaluating our composite network on these coordinates, we obtain the approximate NRIC z(t) := Ψ ζ (ψ ζ 1 (a 1 (t)), . . . , ψ ζ J (a J (t)) of these shapes, where a j (t) ∈ R mj are the factorized coordinates as before. This leads to a smooth interpolation between shapes as demonstrated in Figure 9 and the supplementary video. Shape extrapolation can be equivalently phrased by considering linear extrapolation in the latent space R m . Figure 9 : For two given shapes with latent coordinates a(0) and a(1), we compute interpolating NRIC z(t) using our composite network. In the top row, we see the surfaces reconstructed from these NRIC for intermediate time steps exhibiting smooth deformations. Below, we also show the elements ψ ζ j (a j (1)) from the factor manifolds M j which lead to the final shape by applying the combination network Ψ ζ . These individual factors lead primarily to deformations of the legs for ψ 1 and ψ 3 , of the arms for ψ 4 and ψ 5 , of the wrists for ψ 2 , and of the head for ψ 6 . See also the supplementary video.

Number of Samples

We observed that our composite network can also be trained with smaller amounts of samples than we used in subsection 5.2. For example on the SCAPE dataset, if we only use 20 % of the data as training set (about 800 samples) then we still achieve an R 2 of 0.86. Even if we use a mere 5 % (200 samples) we still reach an R 2 of 0.77. We observed a similar behavior on the SMPL+X dataset with an R 2 of 0.85 at 20 % training data and of 0.74 at 5 % training data. Runtimes Our network-based approach enables runtime efficient approximation of the exponential map. For example, on the SCAPE dataset, we used K = 16 time steps to evaluate the timediscrete exponential map (see Appendix C) when generating the training samples. The computation for each such evaluation required around 8 seconds. In contrast, evaluating the networks takes about 10 milliseconds. To render the result, we have to reconstruct the nodal positions of the triangle mesh from the NRIC, for which we use the nonlinear least-squares method from Fröhlich & Botsch (2011) . This requires a small number (e.g. 2 to 3 in Figure 9 ) of Gauß-Newton iterations taking about 20ms each. Overall the performance is comparable to the approach by Sassen et al. (2020b) , which in contrast to our approach is limited in the amount of latent dimensions it can handle, though.

D A RECAP OF NONLINEAR ROTATION-INVARIANT COORDINATES

Let us briefly review the tools introduced by Wang et al. (2012) for a discrete version of the fundamental theorem of surfaces. Consider a simply connected, triangular surface with the set of vertices V, edges E ⊂ V × V, and faces F ⊂ V × V × V. For a given vector of vertex positions X ∈ R 3|V| , we introduce the vector of all edge lengths l(X) = (l e (X)) e∈E and the vector of all dihedral angles θ(X) = (θ e (X)) e∈E . As discussed by Wang et al. (2012) , to ensure that the edge length and dihedral angle data z = (l, θ) ∈ R 2|E| actually corresponds to a triangular surface immersed in R 3 , two admissibility conditions have to be fulfilled. The obvious first condition is the triangle inequality for the edge lengths on all triangles. We write this condition in formulas as T f (l) > 0 for all f ∈ F where T f (l) = (l i + l j -l k l i -l j + l k -l i + l j + l k ) for a face f ∈ F with edge lengths l i , l j , l k , and the above inequality is meant componentwise. Fulfillment of these conditions guarantees that we can construct individual triangles from given edge lengths. However, we need a second condition assuring that these triangles fit together with the given dihedral angles to form a surface, i.e. to guarantee integrability of z. For simply-connected discrete surfaces, this can be broken down to individual conditions for the fans of triangles surrounding any vertex v in the set of interior vertices V 0 . Formally, we express this individual condition as Q v (z) = 0, which guarantees that we can construct the geometry of this fan from z. The explicit formula for this was introduced by Wang et al. (2012) . If this condition is fulfilled for all interior vertices, one can show that it is indeed possible to construct the geometry of the entire surface. Sassen et al. (2020a) demonstrated that Q v and its derivatives can be robustly and efficiently computed using quaternions. The conditions can be extended to highergenus surfaces by including integrability conditions along non-contractible paths that generate the fundamental group on the triangular surface, but this is not used here. The manifold of all z ∈ R 2|E| corresponding to immersed triangular surfaces of the given mesh connectivity can be given by M = z ∈ R 2|E| T (z) > 0, Q(z) = 0 , where we collect all constraints in vector-valued functionals T = (T f ) f ∈F and Q = (Q v ) v∈V0 . As Sassen et al. (2020a) , we call the manifold M the NRIC manifold (Nonlinear Rotation-Invariant Coordinates). The differential structure of this manifold is at first described in terms of the tangent space, which is given at position z ∈ M by T z M = ker DQ(z) := {w ∈ R 2|E| | DQ(z)w = 0} . Here the matrix DQ(z) ∈ R 3|V0|×2|E| is the Jacobian of Q. The triangle inequalities define an open set of R 2|E| and are thus not needed to define the tangent space. The advantage of NRIC is that they allow a local description of shell deformations based on the local variation of the edge lengths, which encodes membrane distortions, and the local variation of the dihedral angles, which encodes bending distortions. We use a Riemannian metric g on the manifold M that reflects the physical dissipation caused by these infinitesimal variations of the discrete surface. In detail, the metric coincides with the Hessian of an elastic energy W , i.e. g z : R 2|E| × R 2|E| → R with g z = 1 2 HessW [z, •] restricted to T z M × T z M. We use an elastic energy describing a deformation from a configuration z to a configuration z that decomposes into a membrane energy and a bending energy, i.e. W [z, z] = W mem [z, z] + W bend [z, z] . The bending energy is given by W bend [z, z] = e∈E (θ e -θe ) 2 d -1 e l 2 e , where d e = 1 3 (a f + a f ′ ) for the two faces f and f ′ adjacent to e ∈ E (a f is the area of f ). Furthermore, the membrane energy is given by W mem [z, z] = f ∈F a f • W mem (G[z, z]| f ), where W mem (A) := µ 2 tr A + λ 4 det A -µ + λ 2 log det A -µ -λ 4 . The constants µ and λ are positive material constants, and G[z, z] denotes the Cauchy-Green strain tensor of the deformation -describing the face-wise distortion -as a function of the edge lengths on each face. Let us remark that the logarithmic term in the energy density W mem acts as a barrier ensuring the triangle inequalities for finite-energy deformations.



The data and the generating code can be found at https://gitlab.com/jrsassen/ freaky-torus.



Figure3: Examples from the face and hands datasets. We use the same colors as in Figure1.Data Generation Next, we sample the exponential map on the space spanned by the SPGA modes to generate the training (and test) data. To this end, we consider the hypercube in R m given by the minimal and maximal coefficients of projections of the input data onto the SPGA subspace. Then we draw our parametrization coefficient samples S ⊂ R m uniformly from this hypercube. To create the samples S j for the factor manifolds, we simply take the corresponding subcomponents of coefficient vectors from S. The corresponding shapes were then computed by evaluating the exponential map for each of them. Overall, we sampled approximately |S| = 4000 points for each of the considered examples. The dataset was split randomly into a training (80 %) and a test (20 %) set, with the training set being used for the descent method of the loss functionals and the test set being used to evaluate the performance of the networks.

Figure 4: Two examples from the SCAPE dataset. We use the same colors as in Figure 1. Example m J m j Affine Monolithic Composite (Ours) Sassen et al. (2020b) SMPL+X 80 10 3 -24 0.78 0.85 0.93 -SCAPE 40 6 5 -9 0.77 0.60 0.91 -Hands 12 4 2 -4 0.88 0.95 0.98 0.80 Faces 10 6 1 -4 0.96 0.95 0.99 0.95

Figure 8: Exemplary PGA modes extracted from the SCAPE dataset showing the global support typical for such modes.

Approximation quality R 2 on SPGA examples.

ACKNOWLEDGMENTS

This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) via project 211504053 -Collaborative Research Center 1060, project 212212052 -"Geodesic Paths in Shape Space" (part of the NFN Geometry + Simulation), and via Germany's Excellence Strategy project 390685813 -Hausdorff Center for Mathematics and project 390685587 -Mathematics Münster: Dynamics-Geometry-Structure.

C RIEMANNIAN OPERATORS AND THEIR DISCRETIZATIONS IN BRIEF

The tangent space T z S of a manifold S in the point z ∈ S is the vector space of all velocities a path in S can have when passing through z. A Riemannian manifold equips each of these tangent spaces T z S with an inner product g z (•, •) so that norms of velocity vectors and angles between them can be measured. The length of a path γ : [0, 1] → S in the Riemannian manifold then is the time integral of the norm of its velocity γ, L[γ] = 1 0 g γ(t) ( γ(t), γ(t)) dt. Given two points z 0 , z 1 ∈ S, the shortest connecting path γ with γ(0) = z 0 , γ(1) = z 1 is called a geodesic, and the Riemannian distance dist(z 0 , z 1 ) between both points is defined as its length. It is essentially an application of Jensen's inequality that the geodesic connecting z 0 and z 1 can equivalently be found by minimizing the path energyYet another view of geodesics, following from the optimality conditions of the above minimization, is that they are paths that always go straight and at constant speed, i.e. they do not accelerate (neither do they change the motion direction, nor do they change the velocity along this direction; on the earth's surface, for instance, geodesics are great circles). This last viewpoint suggests to associate to every v in the tangent space T z S at an arbitrary z ∈ S a point y ∈ S, which is defined as follows: Given v ∈ T z S, just start out at z with initial velocity v and continue straight (i.e. along a geodesic) for total time 1. The map which maps v to the arrival point y is the so-called Riemannian exponential mapVia this exponential map one can identify the tangent space T z S with the manifold S (at least if exp z is injective, otherwise T z S can be identified with a multiple covering of S). The inverse map is the Riemannian logarithm log z : U → T z S, defined for a small enough neighborhood U ⊂ S of z. For a point z 1 ∈ U , it tells us the initial velocity log z z 1 of the geodesic from z to z 1 .Riemannian geodesics, logarithms and exponentials are expensive to approximate computationally.The computation typically has to be performed in charts or local parametrizations of the manifold, i.e. S is identified with (an open subset of) R n and the Riemannian metric with a symmetric positive definite matrix G z ∈ R n×n depending on z ∈ R n . A variational discretization by Rumpf & Wirth (2015) approximates continuous paths γ by time-discrete paths (γ 0 , . . . , γ K ) to be interpreted as polygonal paths in R n with vertices γ 0 , . . . , γ K at times 0 K , 1 K , . . . , K K . The path energy is then approximated by a discrete path energywhere W (z 0 , z 1 ) is a second order accurate approximation to the squared Riemannian distance dist 2 (z 0 , z 1 ) (for instance, W (z 0 , z 1 ) = (z 0 -z 1 ) T G z0 (z 0 -z 1 )). E may be viewed as a Riemann sum approximation of the integral in E. Minimizing E under fixed end points then yields a discrete K-geodesic, i.e. a discrete approximation (γ 0 , . . . , γ K ) to a geodesic between γ 0 and γ K . The initial velocity K(γ 1 -γ 0 ) of this polygonal path is the discrete approximation of the Riemannian logarithm log γ0 γ K . The discretization of the Riemannian exponential works the other way round: Given a velocity vector v ∈ R n we approximate exp γ0 v ∈ R n by that point γ K ∈ R n such that the discrete K-geodesic (γ 0 , . . . , γ K ) has initial velocity K(γ 1 -γ 0 ) = v. This point can be found by a time stepping procedure -one first sets γ 1 = γ 0 + v K and then iteratively computes γ 2 , γ 3 , . . . γ K as follows: Since each triplet γ k-1 , γ k , γ k+1 of a discrete K-geodesic forms a discrete 3-geodesic (much like any subsegment of a continuous geodesic is itself again a geodesic), γ k must minimizeis then solved via Newton's method for γ k+1 , given γ k-1 and γ k .Instead of working on charts, one can also consider an implicitly defined manifold M = z ∈ R m Q(z) = 0 , for suitable smooth functions Q : R m → R r . In this case, one proceeds analogously constraining the search for points on the manifold via a Lagrangian approach. 

