ESTIMATING RIEMANNIAN METRIC WITH NOISE-CONTAMINATED INTRINSIC DISTANCE

Abstract

We extend metric learning by studying the Riemannian manifold structure of the underlying data space induced by dissimilarity measures between data points. The key quantity of interest here is the Riemannian metric, which characterizes the Riemannian geometry and defines straight lines and derivatives on the manifold. Being able to estimate the Riemannian metric allows us to gain insights into the underlying manifold and compute geometric features such as the geodesic curves. We model the observed dissimilarity measures as noisy responses generated from a function of the intrinsic geodesic distance between data points. A new local regression approach is proposed to learn the Riemannian metric tensor and its derivatives based on a Taylor expansion for the squared geodesic distances. Our framework is general and accommodates different types of responses, whether they are continuous, binary, or comparative, extending the existing works which consider a single type of response at a time. We develop theoretical foundation for our method by deriving the rates of convergence for the asymptotic bias and variance of the estimated metric tensor. The proposed method is shown to be versatile in simulation studies and real data applications involving taxi trip time in New York City and MNIST digits.

1. INTRODUCTION

The estimation of distance metric, also known as metric learning, has attracted great interest since its introduction for classification (Hastie & Tibshirani, 1996) and clustering (Xing et al., 2002) . A global Mahalanobis distance is commonly used to obtain the best distance for discriminating two classes (Xing et al., 2002; Weinberger & Saul, 2009) . While a global metric is often the focus of earlier works, multiple local metrics (Frome et al., 2007; Weinberger & Saul, 2009; Ramanan & Baker, 2011; Chen et al., 2019) are found to be useful because they better capture the data space geometry. There is a great body of work on distance metric learning; see, e.g., Bellet et al. (2015) ; Suárez et al. (2021) for recent reviews. Metric learning is intimately connected with learning on Riemannian manifolds. Hauberg et al. (2012) connects multi-metric learning to learning the geometric structure of a Riemannian manifold, and advocates its benefits in regression and dimensional reduction tasks. Lebanon (2002; 2006) ; Le & Cuturi (2015) discuss Riemannian metric learning by utilizing a parametric family of metric, and demonstrate applications in text and image classification. Like in these works, our target is to learn the Riemannian metric instead of the distance metric, which fundamentally differentiates our approach from most existing works in metric learning. We focus on the nonparametric estimation of the data geometry as quantified by the Riemannian metric tensor. Contrary to distance metric learning, where the coefficient matrix for the Mahalanobis distance is constant in a neighborhood, the Riemannian metric is a smooth tensor field that allows analysis of finer structures. Our emphasis is in inference, namely learning how differences in the response measure are explained by specific differences in the predictor coordinates, rather than obtaining a metric optimal for a supervised learning task. A related field is manifold learning which attempts to find low-dimensional nonlinear representations of apparent high-dimensional data sampled from an underlying manifold (Roweis & Saul, 2000; Tenenbaum et al., 2000; Coifman & Lafon, 2006) . Those embedding methods generally start by assuming the local geometry is given, e.g., by the Euclidean distance between ambient data points. Not all existing methods are isometric, so the geometry obtained this way can be distorted. Perraul-Joncas & Meila (2013) uses the Laplacian operator to obtain pushforward metric for the lowdimensional representations. Instead of specifying the geometry of the ambient space, our focus is to learn the geometry from noisy measures of intrinsic distances. Fefferman et al. (2020) discusses an abstract setting of this task, while our work proposes a practical estimation of the Riemannian metric tensor when coordinates are also available, and we show that our approach is numerically sound. We suppose that data are generated from an unknown Riemannian manifold, and we have available the coordinates of the data objects. The Euclidean distance between the coordinates may not reflect the underlying geometry. Instead, we assume that we further observe similarity measures between objects, modeled as noise-contaminated intrinsic distances, that are used to characterize the intrinsic geometry on the Riemannian manifold. The targeted Riemannian metric is estimated in a datadriven fashion, which enables estimating geodesics (straight lines and locally shortest paths) and performing calculus on the manifold. To formulate the problem, let (M, G) be a Riemannian manifold with Riemannian metric G, and dist (•, •) be the geodesic distance induced by G which measures the true intrinsic difference between points. The coordinates of data points x 0 , x 1 ∈ M are assumed known, identifying each point via a tuple of real numbers. Also observed are noisy measurements y of the intrinsic distance between data points, which we refer to as similarity measurements (equivalently dissimilarity). The response is modeled flexibly, and we consider the following common scenarios: (i) noisy distance, where y = dist (x 0 , x 1 ) 2 + for error , (ii) similarity/dissimilarity, where y = 0 if the two points x 0 , x 1 are considered similar and y = 1 otherwise, and (iii) relative comparison, where a triplet of points (x 0 , x 1 , x 2 ) are given and y = 1 if x 0 is more similar to x 1 than to x 2 and y = 0 otherwise. The binary similarity measurement is common in computer vision (e.g. Chopra et al., 2005) , while the relative comparison could be useful for perceptional tasks and recommendation system (e.g. Schultz & Joachims, 2003; Berenzweig et al., 2004) . We aim to estimate the Riemannian metric G and its derivatives using the coordinates and similarity measures among the data points. The major contribution of this paper is threefold. First, we formulate a framework for probabilistic modeling of similarity measurements among data on manifold via intrinsic distances. Based on a Taylor expansion for the spread of geodesic curves in differential geometry, the local regression procedure successfully estimates the Riemannian metric and its derivatives. Second, a theoretical foundation is developed for the proposed method including asymptotic consistency. Last and most importantly, the proposed method provides a geometric interpretation for the structure of the data space induced by the similarity measurements, as demonstrated in the numerical examples that include a taxi travel and an MNIST digit application.

2. BACKGROUND IN RIEMANNIAN GEOMETRY

For brevity, metric now refers to Riemannian metric while distance metric is always spelled out. Throughout the paper, M denotes a d-dimensional manifold endowed with a coordinate chart (U, ϕ), where ϕ : U → R d maps a point p ∈ U ⊂ M on the manifold to its coordinate ϕ(p) = ϕ 1 (p), . . . , ϕ d (p) ∈ R d . Without loss of generality, we identify a point by its coordinate as p 1 , . . . , p d , suppressing ϕ for the coordinate chart. Upper-script Roman letters denote the components of a coordinate, e.g., p i is the i-th entry in the coordinate of the point p, and γ i is the i-th component function of a curve γ : R ⊃ [a, b] → M when expressed on chart U . The tangent space T p M is a vector space consisting of velocities of the form v = γ (0) where γ is any curve satisfying γ(0) = p. The coordinate chart induces a basis on the tangent space T p M, as ∂ i | p = ∂/∂x i | p for i = 1, . . . , d, so that a tangent vector v ∈ T p M is represented as v = d i=1 v i ∂ i for some v i ∈ R, suppressing the subscript p in the basis. We adopt the Einstein summation convention unless otherwise specified, namely v i ∂ i denotes d i=1 v i ∂ i , where common pairs of upper-and lower-indices denotes a summation from 1 to d (see e.g., Lee, 2013, pp.18-19) . The Riemannian metric G on a d-dimensional manifold M is a smooth tensor field acting on the tangent vectors. At any p ∈ M, G(p) : T p M × T p M → R is a symmetric bi-linear tensor/function satisfying G(p)(v, v) ≥ 0 for any v ∈ T p M and G(p)(v, v) = 0 if and only if v = 0. On a chart ϕ, the metric is represented as a d-by-d positive definite matrix that quantifies the distance traveled along infinitesimal changes in the coordinates. With an abuse of notation, the chart representation of G is given by the matrix-valued function p → G(p) = [G ij (p)] d i,j=1 ∈ R d×d for p ∈ M, so the distance traveled by γ at time t for a duration of dt is [G ij (γ(t)) γi (t) γj (t)]foot_0/2 . The intrinsic distance induced by G, or the geodesic distance, is computed as dist (p, q) = inf α 1 0 1≤i,j≤d G ij (α(t)) αi (t) αj (t)dt, (2.1) for two points p, q on the manifold M, where infimum is taken over any curve α : [0, 1] → M connecting p to q. A geodesic curve (or simply geodesic) is a smooth curve γ : R ⊃ [a, b] → M satisfying the geodesic equations, represented on a coordinate chart as γk (t) + γi (t) γj (t)Γ k ij • γ(t) = 0, for i, j, k = 1, . . . , d, where over-dots represent derivative w.r.t. t; Γ k ij = 1 2 G kl (∂ i G jl + ∂ j G il -∂ l G ij ) are the Christoffel symbols at p; and G kl is the (k, l)-element of G -1 . Solving (2.2) with initial conditions 1 produces geodesic that traces out the generalization of a straight line on the manifold, preserving travel direction with no acceleration, and is also locally the shortest path. Considering the shortest path γ connecting p to q and applying Taylor's expansion at t = 0, we obtain dist (p, q) 2 ≈ 1≤i,j≤d G ij (p)(q i -p i )(q j -p j ), showing the connection between the geodesic distance and a quadratic form analogous to the Mahalanobis distance. Our estimation method is based on this approximation, and we will discuss the higher-order terms shortly which unveil finer structure of the manifold.

3.1. PROBABILISTIC MODELING FOR SIMILARITY MEASUREMENTS

Suppose that we observe N independent triplets (Y u , X u0 , X u1 ), u = 1, . . . , N . Here, the X uj are locations on the manifold identified with their coordinates X 1 uj , . . . , X d uj ∈ R d , j = 1, 2, and Y u are noisy similarity measures of the proximity of (X u0 , X u1 ) in terms of the intrinsic geodesic distance dist (•, •) on M. To account for different structures of the similarity measurements, we model the response in a fashion analogous to generalized linear models. For X u0 , X u1 lying in a small neighborhood U p ⊂ M of a target location p ∈ M, the similarity measure Y u is modeled as E (Y u |X u0 , X u1 ) = g -1 Ä dist (X u0 , X u1 ) 2 ä , (3.1) where g is a given link function that relates the conditional expectation to the squared distance. Example 3.1. We describe below three common scenarios modeled by the general framework (3.1). 1. Continuous response being the squared geodesic distance contaminated with noise: Y u = dist (X u0 , X u1 ) 2 + σ(p)ε n , (3.2) where ε 1 , . . . , ε n are i.i.d. mean zero random variables, and σ : M → R + is a smooth positive function determining the magnitude of noise near the target point p. This model will be applied to model trip time as noisy measure of cost to travel between locations. 2. Binary (dis)similarity response: P (Y u = 1|X u0 , X u1 ) = logit -1 Ä dist (X u0 , X u1 ) 2 -h (p) ä (3.3) for some smooth function h : M → R, where logit(µ) = log (µ/ (1 -µ)), µ ∈ (0, 1) is the logit function. This models the case when there are latent labels for X uj (e.g., digit 6 v.s. 9) and Y u is a measure of whether their labels are in common or not. The function h (p) in (3.3) describes the homogeneity of the latent labels for points in a small neighborhood of p. The latent labels could have intrinsic variation even if measurements are made for the same data points x = X u0 = X u1 , and the strength of which is captured by h (p). 3. Binary relative comparison response, where we extend our model for triplets of points (X u0 , X u1 , X u2 ), where Y u stands for whether X u0 is more similar to X u1 than to X u2 : P (Y u = 1|X u0 , X u1 , X u2 ) = logit -1 Ä dist (X u0 , X u2 ) 2 -dist (X u0 , X u1 ) 2 ä , (3.4) so that the relative comparison Y u reflects the comparison of squared distances.

3.2. LOCAL APPROXIMATION OF SQUARED DISTANCES

We develop a local approximation for the squared distance as the key tool to estimate our model (3.1) through local regression. Proposition 3.1 provides a Taylor expansion for the squared geodesic distance between two geodesics with same starting point but different initial velocities (see Figure D .1 for visualization). For a point p on the Riemannian manifold M, let exp p : T p M → M denote the exponential map defined by exp p (tv) = γ(t) where γ is a geodesic starting from p at time 0 with initial velocity γ (0) = v ∈ T p M. For notational simplicity, we suppress the dependency on p in geometric quantities (e.g., the metric G is understood to be evaluated at p). For i = 1, . . . , d, denote δ i = δ i (t) = γ i (t) -γ i (0) as the difference in coordinate after a travel of time t along γ. Proposition 3.1 (spread of geodesics, coordinated). Let p ∈ M and v, w ∈ T p M be two tangent vectors at p. On a local coordinate chart, the squared geodesic distance between two geodesics γ 0 (t) = exp p (tv) and γ 1 (t) = exp p (tw) satisfies, as t → 0, dist (γ 0 (t), γ 1 (t)) 2 = δ i 0-1 δ j 0-1 G ij + δ i 0-1 δ k 0 δ l 0 -δ k 1 δ l 1 Γ j kl G ij + O(t 4 ) (3.5) where for i, j, k, l, m = 1, . . . , d, • δ i 0 = γ i 0 (t) -p i , δ i 1 = γ i 1 (t) -p i , and δ i 0-1 = δ i 0 -δ i 1 , i.e. , δ i 0 , δ i 1 are differences in i-th coordinates of γ 0 (t) and γ 1 (t) to the origin p, respectively, and δ i 0-1 = δ i 0 -δ i 1 is the coordinate difference between γ 0 (t) and γ 1 (t); • G ij and Γ j kl are the elements of the metric and Christoffel symbols at p, respectively. To the RHS of (3.5), the first term is the quadratic term in distance metric learning. The second term is the result of coordinate representation of geodesics. It vanishes under the normal coordinate where the Christoffel symbols are zerofoot_1 . It inspires the use of local regression to estimate the metric tensor and the Christoffel symbols. For X u0 , X u1 in a small neighborhood of p, write the linear predictor as η u := β (0) + δ i u,0-1 δ j u,0-1 β (1) ij + δ k u,0-1 Ä δ i u0 δ j u0 -δ i u1 δ j u1 ä β (2) ijk , a function of the intercept β (0) and coefficients β (1) ij , β ijk , where δ i u0 = X i u0 -p i , δ i u1 = X i u1 -p i , and δ i u,0-1 = δ i u0 -δ i u1 , for i, j, k, l = 1, . . . , d, and u = 1, . . . , N . The intercept term β (0) is included for capturing h(p) in (3.3) under that scenario and can otherwise be dropped from the model. The link function connects the linear predictor to the conditional mean via µ u := g -1 (η u ) ≈ E (Y u |X u0 , X u1 ) as indicated by (3.1) and (3.5), where µ u is seen as a function of the coefficients  and β (2) ijk . Therefore, upon the specification of a loss function Q : R × R → {0} ∪ R + and non-negative weights w 1 , . . . , w N , the minimizers β (0) , β (1) ij , ( β(0) , β(1) ij , β(2) ijk ) = arg min β (0) ,β (1) ij ,β ijk ;i,j,k N u=1 Q (Y u , µ u ) w u , (3.7) subject to β (1) ij = β (1) ji , β ijk = β (2) jik , for i, j, k, l = 1, . . . , d, (3.8) are used to estimate the metric tensor and Christoffel symbols, obtaining Ĝij = β(1) ij , Γl ij = β(2) ijk Ĝkl , (3.9) where Ĝkl is the matrix inverse of Ĝ satisfying Ĝkl Ĝkj = 1 {j=l} . The symmetry constraints (3.8) are the result of the symmetries in the metric tensor and Christoffel symbols, and are enforced by optimizing over only the lower triangular indices 1 ≤ i < j ≤ d without constraints. Asymptotic results guarantees the positive-definiteness of the metric estimate, as will be shown in Proposition 4.1. To weigh the pairs of endpoints according to their proximity to the target location p, we apply kernel weights specified by w u = h -2d d i=1 K Å X i u0 -p i h ã K Å X i u1 -p i h ã (3.10) for some h > 0 and non-negative kernel function K. The bandwidth h controls the bias-variance trade-off of the estimated Riemannian metric tensor and its derivatives. 1. Continuous noisy response: use squared loss Q (y, µ) = (y -µ) 2 with g being the identity link function so µ u = η u . 2. Binary (dis)similarity response: use log-likelihood of Bernoulli random variable Q (y, µ) = y log µ + (1 -y) log (1 -µ) , (3.11) and g the logit link, so µ u = logit -1 (η u ). The model becomes a local logistic regression. 3. Binary relative comparison response: apply the same loss function (3.11) and logit link as in the previous scenario, but here we formulate the linear predictor based on dist (X u0 , X u2 ) 2 -dist (X u0 , X u1 ) 2 ≈ η u1 -η u2 and µ u = g -1 (η u1 -η u2 ) . (3.12) Locally, the difference in squared distances is approximated by η u1 -η u2 = Ä δ i u,0-1 δ j u,0-1 -δ i u,0-2 δ j u,0-2 ä β (1) ij (3.13) + Ä δ k u,0-1 Ä δ i u0 δ j u0 -δ i u1 δ j u1 ä -δ k u,0-2 Ä δ i u0 δ j u0 -δ i u2 δ j u2 ää β (2) ijk , for δ i u2 = X i u2 -p i and δ i u,0-2 = δ i u2 -δ i u0 , i = 1, . . . , d. Here η u1 and η u2 are constructed in analogy to (3.6) using (X u0 , X u1 ) and (X u0 , X u2 ) pair respectively. Examples in Section 5 will further illustrate the proposed method in those scenarios. Besides the models listed, other choices for the link g and loss function Q can also be considered under this local regression framework (Fan & Gijbels, 1996) , accommodating a wide variety of data. To efficiently estimate the metric on the entire manifold M, we apply a procedure based on discretization and post-smoothing, as detailed in Appendix B.

4. BIAS AND VARIANCE OF THE ESTIMATED METRIC TENSOR

This subsection provides asymptotic justification for model (3.2) with E (Y u |X u0 , X u1 ) = dist (X u0 , X u1 ) 2 under the squared loss Q(µ, y) = (µ -y) 2 and the identity link g(µ) = µ. The estimator we analyzed here fits a local quadratic regression without intercept and approximates the squared distance by a simplified form of (3.6): dist (X u0 , X u1 ) 2 ≈ η u := δ i u,0-1 δ j u,0-1 β (1) ij , for u = 1, . . . , N . Given a suitable order of the indices i, j for vectorization, we rewrite the formulation into a matrix form. Denote the local design matrix and regression coefficients as D u = Ä δ 1 u,0-1 δ 1 u,0-1 , . . . , δ i u,0-1 δ j u,0-1 , . . . , δ d u,0-1 δ d u,0-1 ä T , β = Ä β (1) 11 , . . . , β ij , . . . , β (1) dd ä T , so that the linear predictor η u = D T u β. Further, write D = (D 1 , . . . , D N ) T , Y = (Y 1 , . . . , Y N ) T , and W = diag (w 1 , . . . , w N ), with weights w u specified in (3.10). The objective function in (3.7) becomes (Y -Dβ) T W (Y -Dβ), and the minimizer is β = D T WD -1 D T WY , for which we will analyze the bias and variance. To characterize the asymptotic bias and variance of the estimator, we assume the following conditions are satisfied in a neighborhood of the target p. These conditions are standard and analogous to those assumed in a local regression setting (Fan & Gijbels, 1996) . (A1) The joint density of endpoints (X u0 , X u1 ) is positive and continuously differentiable. (A2) The functions G ij , Γ k ij are C 2 -smooth for i, j, k = 1, . . . , d. (A3) The kernel K in weights (3.10) is symmetric, continuous, and has a bounded support. (A4) sup u var (Y u |X u0 , X u1 ) < ∞. Proposition 4.1. Under (A1)-(A4), bias Ä β|X ä = O p h 2 , var Ä β|X ä = O p N -1 h -4-2d , as h → 0 and N h 2+2d → ∞, where X = {(X u0 , X u1 )} N u=1 is the collection of observed endpoints. The local approximation (4.1) is similar to a local polynomial estimation of the second derivative of a 2d-variate squared geodesic distance function, explaining the order of h in the rates for bias and variance.

5. SIMULATION

We illustrate the proposed method using simulated data with different types of responses as described in Example 3.1. We study whether the proposed method well estimates Riemannian geometric quantities, including the metric tensor, geodesics, and Christoffel symbols. Additional details are included in Appendix C of the Supplementary Materials.

5.1. UNIT SPHERE

The usual arc-length/great circle distance on the d-dimensional unit sphere is induced by the round metric, which is expressed under the stereographic projection coordinate x 1 , . . . , For binary responses under model (3.3), Figure 5 .1c visualizes the data where the background color illustrates h. Figure 5 .1d and left panel of Figure 5 .1a suggest that the intercept and the metric were reasonably estimated, while the geodesics are slightly away from the truth (Figure 5.1a, right) . This indicates that the binary model has higher complexity and less information is being provided by the binary response (see also Figure C.1b in the Supplementary Materials). x d as Gij = 4 Ä 1 + d k=1 x k x k ä -2 1 {i=j} , for i, j = 1, . . . ,

5.2. RELATIVE COMPARISON ON THE DOUBLE SPIRALS

A set of 7 × 10 4 points on R 2 were generated around two spirals, corresponding to two latent classes A and B (e.g., green points in Figure 5 .2a are from latent class A). We compare neighboring points (X u0 , X u1 , X u2 ) to generate relative comparison response Y u as follows. For u = 1, . . . , N , Y u = 1 if X u0 , X u1 belong to the same latent class and X u0 , X u2 belong to different classes; Here, contrast of the two latent classes induces the intrinsic distance, so the distance is larger across the supports of the two classes and smaller within a single support. Therefore, the resulting metric tensor should reflect less cost while moving along the tangential direction of the spirals compared to perpendicular directions. Estimates were drawn under model (3.4) by minimizing the objective (3.7) with the link function (3.12) and the local approximation (3.13). The estimated metric shown in Figure 5 .2c is consistent with the interpretation of the intrinsic distance and metric induced by the class membership discussed above. Meanwhile, the estimated geodesic curve unveils the hidden circular structure of the data support as shown in accessed on May 1st, 2022foot_3 to obtained business day morning taxi trip records including GPS coordinates for pickup/dropoff locations as (X u0 , X u1 ) and trip duration as Y u . Estimation to the travel time metric was drawn under model (3.2) with Q(y, µ) = (y -µ) 2 and g(µ) = µ. Figure 6 .1a shows the estimated metric for taxi travel time. The background color shows the Frobenius norm of the metric tensor, where larger values mean that longer travel time is required to pass through that location. Trips through midtown Manhattan and the financial district were estimated to be the most costly during rush hours, which is coherent to the fact that these are the city's primary business districts. Moreover, the cost ellipses represent the cost in time to travel a unit distance along different directions. This suggests that in Manhattan, it takes longer to drive along the east-west direction (narrower streets) compared to the north-south direction (wider avenues). Geodesic curves in Figure 6 .1b show where a 15-minutes taxi ride leads to starting from the Empire State Building. Each geodesic curve corresponds to one of 12 starting directions (1-12 o'clock). Note that we apply a continuous Riemannian manifold approximation to the city, so the geodesic curves provide approximations to the shortest paths between locations and need not conform to the road network. Travel appears to be faster in lower Manhattan than in midtown Manhattan. The spread of the geodesics differs along different directions, indicating the existence of non-constant curvature on the manifold and advocating for estimating the Riemannian metric tensor field instead of applying a single global distance metric.

7. HIGH-DIMENSIONAL DATA: AN EXAMPLE WITH MNIST

The curse of dimensionality is a big challenge to apply nonparametric methods to data sources like images and audios. However, it is often found that apparent high-dimensional data actually lie close to some low-dimensional manifold, which is utilized by manifold learning literature to produce reasonable low-dimensional coordinate representations. The proposed method can then be applied to the resulting low-dimensional coordinates as in the following MNIST example. We embed images in MNIST to a 2-dimensional space via tSNE (Hinton & Roweis, 2002) . Similarity between the objects was computed by the sum of the Wasserstein distance between imagesfoot_4 and the indicator of whether the underlying digits are different (1) or not (0). The goal is to infer the geometry of the embedded data induced by this similarity measures. The geodesics estimated from our method tend to minimize the number of switches between labels. For example, the geodesic A in Figure 7 .1 remains "4" (1st row of panel (b)) throughout, while the straight line on the tSNE chart translates to a path of images switching between "4" and "9" . Also, our estimated geodesics produce reasonable transition and reside in the space of digits, while unrestricted optimal transport (3rd, 6th, and 9th rows of panel (b)) could produce unrecognizable intermediate images. Our estimated geodesic is faithful to the geometric interpretation that a geodesic is locally the shortest path. 

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C -2 -1 0 1 2 -2 -1 0 1 2 tSNE.1 tSNE.2

8. DISCUSSION

We present a novel framework for inferring the data geometry based on pairwise similarity measures. Our framework targets data lying on a low-dimensional manifold since observations need to be dense near the locations where we wish to estimate the metric. However, these assumptions are general enough for our method to be applied to manifold data with high ambient dimension in combination with manifold embedding tools. Context-specific interpretation of the geometrical notions, e.g., Riemannian metric and geodesics has been demonstrated in the taxi travel and MNIST digit examples. Our method has the potential to be applied in many other application communities, such as cognition and perception research, where psychometric similarity measures are commonly made. 

A ADDITIONAL DEFINITION

A cost ellipse visualizes the metric by an ellipse E p =    x 1 , . . . , x d : d i,j=1 x i -p i x j -p j G ij = r 2    (A.1) for some constant r > 0, which shows, approximately, the intrinsic distance on the manifold when traveling a unit length on the coordinate chart along each direction. More precisely, it shows the norm of tangent vectors v i ∂ i ∈ T p M subject to d i=1 v i v i = r 2 at p pointing to the corresponding direction. For example in a d = 2-dimensional manifold at p = 0, under G = diag (λ 1 , λ 2 ) with λ 1 > λ 2 and r = 1. The long axis, ± √ λ 1 , 0 , is the norm of the tangent vector ±∂ 1 . Thus, the direction in which the ellipse is larger corresponds to the direction of the larger geodesic distance. One can see (A.1) as the "inverse" of the Tissot's indicatrix, where the latter shows a local equidistance contour to the ellipse's center. Frobenius norm of tensors is denoted as • F , defined as G F = Ñ d i,j=1 G ij G ij é 1/2 , Γ F = Ñ d i,j,k=1 Γ k ij Γ k ij é 1/2 , (A.2) for metric tensor G and Christoffel symbol Γ.

B IMPLEMENTATION NOTES

An R (R Core Team, 2022) package is developed to implement the proposed methods and all numerical experiments. We utilized an efficient procedure to obtain estimates Ĝij over the entire manifold M as follows. We first obtain estimates Ĝij (p n ) over a dense grid of points p 1 , p 2 , . . . , p Ngrid ∈ M by following (3.7)-(3.9). Next, the estimate Ĝij (x) at an arbitrary x ∈ M is obtained by the post-smoothing estimate

Ĝij

(x) = Ngrid n=1 K ( x -p n /h ps ) Ĝij (p n ) Ngrid n=1 K ( x -p n /h ps ) , for some kernel K and bandwidth h ps > 0. We also use local regression (Loader, 1999) for postsmoothing. The grid for the examples (Section 5, Section 6, and Section 7) are 128×128 for the unit sphere, and 80 × 80 for the double spirals, 250 × 250 meters cells for the New York taxi example, and 64 × 64 for the MNIST example. The estimated geodesics are computed by numerically solving ordinary differential equations system, either given the start point and initial velocity, or given the start and the end points. It suffices to notice that the geodesic equations (2.2) are equivalently written as, after plugging-in the estimated Christoffel symbol Γ, v i (t) = γi (t), vk (t) = -v i (t)v j (t) Γk ij • γ(t), for i, j, k = 1, . . . , d. Here, Γk ij • γ(t) is the value of the estimated Christoffel symbol at point γ 1 (t), . . . , γ d (t) . Further supplying initial condition γ i (0) = p i 0 , v i (0) = v i 0 , i = 1, . . . , d for point p 0 ∈ M and tangent vector v 0 ∈ T p0 M constitute an initial value problem, whose solution reflects the geodesic curve starting from p 0 with initial velocity v 0 . On the other hand, supplying boundary condition γ i (0) = p i 0 , γ i (1) = p i 1 , i = 1, . . . , d for p 0 , p 1 ∈ M constitute a boundary value problem, whose solution reflects the geodesic curve from p 0 to p 1 . we use deSolve (Soetaert et al., 2010) and bvpSolve (Mazzia et al., 2014) for initial value problems and boundary value problems respectively. See reference therein for further details of numeric solution to ODE.

C ADDITIONAL EXPERIMENT DETAILS

This section provides further detail to complete Section 5, Section 6, and Section 7 of the main text including how we generated the simulated data, and more figures.

C.1 UNIT SPHERE AND ROUND METRIC

The stereographic coordinate of the d-dimensional sphere S d identifies points on the sphere by mapping it to its stereographic projection in R d from the north pole. The round metric on the sphere S d is the metric induced by embedding of S d → R d+1 . For detail, see for example page 30 of Lee (2013) and chapter 3 of Lee (2018) . We generated endpoints X u0 , X u1 uniformly in the coordinate chart (-3, 3) × (-3, 3), then pair the endpoints so that the difference in coordinates of the endpoints |δ i u0 -δ i u1 | = |X i u0 -X i u1 | would not exceed 0.2 for i = 1, 2, u = 1, . . . , N . More precisely, data were generated on d = 2-dimensional sphere under stereographic projection coordinate. A total of N = 5 × 10 5 pairs of endpoints with X i u0 , X i u1 ∈ (-3, 3) were generated subject to |X i u0 -X i u1 | ≤ 0.2 for all u = 1, . . . , N ; i = 1, . . . , d. For a reasonable signal-to-noise ratio, we set σ(p) = σ ≈ 9 × 10 -4 for all p, which is approximately one-tenth of the marginal expectation of squared distance, i.e., σ ≈ E dist (X u0 , X u1 ) 2 /10. For simplicity of presentation, we scaled the distance for the binary similarity response model (3.3). More precisely, we use dist c (•, •) = √ cdist (•, •) induced by the scaled metric G ij,c = c Gij for some constant c and i, j = 1, . . . , d. The experiment here used c = 300. Intuitively, the constant c regulates the signal-to-noise ratio without changing the form of geodesics. Given the endpoints, a smaller c leads to a smaller value of geodesic distance and hence smaller variation in the linear predictors η u , so the response Y u will take less influence from the distance, representing a higher amount of noise. Then h (p) was set to be the average local squared distances within a local neighborhood of p under the scaled distance. In the end, the responses were generated following (3.2) and (3.3) respectively. In addition, Figure C .1 illustrates the relative Frobenius error for estimated tensors using noiseless or binary responses.

C.1.1 BANDWIDTH SELECTION

Like local regression, the proposed method relies on a neighborhood specification for optimal bias-variance trade-off. The simulation in Subsection 5.1 uses the rectangular kernel K(x) = 1 [-1,1] (x) for (3.10), where 1 is the indicator function, so the estimation only utilizes observations with endpoints X u0 , X u1 are both lying in the neighborhood U p = x 1 , . . . , x d : |x i -p i | ≤ h, i = 1, . . . , d of the target point p. We propose a train-test set scheme for data-driven bandwidth selection. To simplify computation, we only considered additive error under (3.2). A 16 × 16 grid p 1 , . . . , p 256 ∈ (-3, 3) × (-3, 3) were used as target points where metric tensors were estimated, with a test set of N test = 31246 observations that were within close proximity to the grid. Estimation of the tensors were computed w.r.t. bandwidth h utilizing a train set containing N train = 400158 (approximately 80% of the data) randomly selected observations outside of the test set. For the test set, Ŷu,test = ηu,test were then computed under identity link by plugging the estimated tensors into (3.6). Z m , m = 1, . . . , 70000. Provided with those candidate endpoints, we pair them to form relative comparison subject to the restriction that |X i u0 -X i uj | ≤ 0.35 for i, j = 1, 2, n = 1, . . . , N . The responses Y u are then generated based on the class of involving endpoints by their corresponding X on the spirals. For estimation, we used a larger local neighborhood

The bandwidth minimizing the squared loss

U p = x 1 , x 2 : |x i -p i | ≤ h for i = 1, 2 with h = π/2 and weights w u = 1 {Xu0,Xu1,Xu2∈ Up} for u = 1, . . . , N to avoid degenerate estimates. Note that different starting points and initial velocities will generate different geodesics, not all resembling a spiral, as shown in 

C.3 NYC TAXI TRIPS

We focus on the 8,809,982 sensible records between 7 a.m. to 10 a.m. on business days from May to September (summer months to hopefully avoid snow) of 2015 in New York City areas other than the Staten Island. Sensible in terms of GPS coordinates not falling in to the rivers, travel time not being several seconds, and that inferred traveling speed is not 120 mph, and e.t.c. We measure the cost to travel Y u by the squared trip duration (instead of the trip distance). For each target location p, estimation was computed using trips among the M ≤ 5 × 10 4 closest pickup/dropoff endpoints in the neighborhood U p = (x 1 , x 2 ) : |x i -p i | ≤ 5 kilometers for i = 1, 2 , and weights given by h = 2.5 kilometers with the kernel K being the density function of the standard normal distribution.

C.4 THE MNIST EXAMPLE

The dimension reduction is computed using R package dimRed (Kraemer et al., 2018) . The Wasserstein distance and optimal transport are computed using package transport (Schuhmacher et al., 2022) . To show image transitions, weighted average is adopted to approximate the inverse of the tSNE embedding so as to map the trajectories back to image space, similar to, e.g., equation (3.9) of Chen & Müller (2012) , but with Gaussian kernel and a sufficiently small bandwidth. To simplify computation, we only embed the first 3 × 10 4 images (half of the entire data), and the resulting embedding coordinates were scaled (i.e., centered by the mean and divided by standard deviation). We generated N = 10 5 comparison by selecting nearby points in the embedded space subject to X u0 -X u1 ∞ ≤ 0.75, whose response Y u were computed based on the same-digit-ornot indicator and 2-Wasserstein distance between the corresponding images: dist (X u0 , X u0 ) = Cdist wass (pic u0 , pic u1 ) + 1 {lblu0 =lblu1} , where for u = 1, . . . , N , • X u0 , X u1 ∈ R 2 are coordinates in the embedded space; • pic u0 and pic u0 are the 28 × 28 grey scale images; • lbl u0 and lbl u1 are the image labels (0-9); • dist wass (•, •) is the 2-Wasserstein distance treating images as 2-dimensional probability distributions; • 1 {event} is the indicator for whether the event is true (1) or false (0). We multiply the Wasserstein distance by C = 4 to balance the magnitude of the two summands, otherwise the later could be overly dominating. Estimation was drawn under model (3.2) with squared loss Q(y, µ) = (y -µ) 2 and the identity link. We included the intercept term (β (0) in (3.6)) to capture intrinsic variation. Since the dimensional reduction embedding map is not necessarily an injection, so that different images with non-zero similarity measures could share identical coordinates in the embedded space. Figure C.4a shows the estimated intercepts, which is larger among class boundaries, coherent to a greater variation in the similarity measure. Those few blank pixels indicate failure to obtain positive definite metric, which are alleviated by averaging neighboring estimated values. We also dropped the terms for Christoffel symbols from (3.6) for better numeric stability. Consequently, the estimated Christoffel symbols were computed by numeric differential following the definition. Results are similar if we include the Christoffel symbol terms in the linear predictor, but less stable. Notably, the proposal also work supplied with binary similarity measures using only the same-digitor-not indicator (i.e., setting C = 0 to remove the Wasserstein distance), and retains the "fewer label switching" tendency as illustrated in Figure C.5. We see this as an real data example in analogy to the double spirals (Subsection 5.2). We would also like to remark that not all geodesics are different from straight lines on the chart, and it is not guaranteed that geodesic must travel within the same class whenever possible, since its travel is jointly determined by the metric and where the endpoints are located.

D SPREAD OF GEODESICS

Here we provide proof to the Proposition 3.1 in the main text, which characterizes the distance between geodesics departing from a same starting point. Proposition 3.1 is a result of combining Proposition D.1 and Proposition D.2. Proposition D.1 (spread of geodesics). Let p ∈ M and v, w ∈ T p M be two tangent vector at p. Then the squared distance of separation satisfies Taylor expansion of

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C dist exp p (tv), exp p (tw) 2 = t 2 v -w 2 - 1 3 t 4 R(v, w)w, v + O(t 5 ) as t → 0. Here, R is the (1, 3)-curvature tensor defined as R(X, Y )Z = ∇ X ∇ Y Z -∇ Y ∇ X Z -∇ [X,Y ] Z, where X, Y, Z are vector fields and [X, Y ] = XY -Y X is the Lie bracket (c.f., e.g., Lee, 2018, page 385) . Further, the Riemann curvature tensor is defined as Rm(X, Y, Z, W ) = R(X, Y )Z, W , where W is also a vector field. Note that R and Rm are both tensor fields, so R(v, w)w, v (equivalently Rm(v, w, w, v)) are those evaluated at p, since v, w ∈ T p M. See Lee (2018), pp. 196-199 for detail. However, additional terms are introduced when computing via coordinate charts, as a result of approximating the initial velocities v and w. Proposition D.2 (approximation of velocity). For any p ∈ M, let v ∈ T p M be a tangent vector at p and γ(t) = exp p (tv) be the geodesic from p with initial velocity v. Given any local coordinate chart, write v = v i ∂ i . For i = 1, . . . , d, denote δ i = δ i (t) = γ i (t) -γ i (0) as the difference in coordinate after traveling t along γ, we have v i = t -1 δ i (t) + R i (t), where the remainder is R i (t) = 1 2t δ m δ n Γ i mn + 1 6t δ m δ n δ l Γ k mn Γ i kl + ∂ l Γ i mn + O(t 3 ) (D.1) as t → 0, where Γ and ∂Γ denote the the Christoffel symbols and their derivatives at p. 

D.1 PROOFS

Proof of Proposition D.1. Similar results can be found at Proposition 2.7 of do Carmo (1992), Proposition 5.4, of (Lang, 1999, IX, §5) . We use the form presented by Meyer (1989) . In the following, we reproduce the proof to the equation (9) of Meyer (1989) with some additional clarification. Let γ 0 (s) = exp p (sv) and γ 1 (s) = exp p (sw), define a family of curves V (s, t) = exp γ0(s) Ä t exp -1 γ0(s) γ 1 (s) ä , so that the curves V s : t → V (s, t) are geodesics connecting γ 0 (s) and γ 1 (s) (c.f., e.g., proposition 5.19 and equation (10.2) of Lee ( 2018)), and that V is a variation through geodesics V s . Further, let T = ∂ t V , which is a tangent field of velocities. Let E = ∂ s V , which is a Jacobi field through geodesics V s that vanishes at p. Denote H (s) = dist (γ 0 (s), γ 1 (s)) 2 = T 2 | s,t for any t ∈ [0, 1], where the "| s,t " means to take value at point V (s, t). Then by the chain rules for covariant derivatives (see, e.g. Lee, 2018, chapter 4), we have d ds H(s) = d ds T, T | s,t = 2 D s T, T | s,t , Å d ds ã 2 H(s) = 2 Ä D 2 s T, T + D s T 2 ä | s,t , Å d ds ã 3 H(s) = 2 D 3 s T, T + 3 D 2 s T, D s T | s,t , Å d ds ã 4 H(s) = 2 D 4 s T, T + 3 D 2 s T + 4 D 3 s T, D s T | s,t . Note that V 0 = p for all t, so that T | s=0,t = 0 for all t, hence H (0) = 0. Note that V s : t → V (s, t), s → V (s, 0) and s → V (s, 1) are geodesics; thus D t T = 0 for all t, D s E| s,t=0 = 0 and D s E| s,t=1 = 0 for all s. In addition, by lemma 6.2 of Lee (2018) , D s T = D t E. By Jacobi equation, D 2 t E + R(E, T )T = 0 for all s, which implies D 2 t E| s=0 = 0 since T | s=0 = 0. This means the vector field t → E| s=0,t at p is linear in t, together with E| s=0,t=0 = v and E| s=0,t=1 = w, we can write E| s=0,t = v + t(w -v) for t ∈ [0, 1]. Therefore D s T | s=0,t = D t E| s=0,t = w -v, which implies H (0) = 2 v -w 2 . Proceeding to the third order derivatives, observe that H (0) = 6 D 2 s T, D s T | s=0,t , and by proposition 7.5 of Lee (2018)  , D 2 s T = D s D t E = D t D s E + R(E, T )E, thus it suffices to show D s E| s=0,t = 0, for all t, (D.2) in order to argue H (0) = 0. Since it is known that D s E| s=0,t=0 = 0 = D s E| s=0,t=1 , it suffices to consider its derivative for (D.2). Use proposition 7.5 of Lee (2018) repeatedly, we have D 2 t D s E| s=0,t = D t D s D t E| s=0,t + D t (R(T, E)E) | s=0,t = D t D 2 s T | s=0,t = (D s D t D s T -R(E, T )(D s T )) | s=0,t = D s D t D s T | s=0,t = D s (D s D t T -R(E, T )T ) | s=0,t = D s (R(T, E)T ) | s=0,t , where the last equation is due to D t T = 0. Further, by chain rule of covariant derivative (c.f. e.g. proposition 4.15 of Lee ( 2018)), D s (R(T, E)T ) = (∇ E R) (T, E)T + R(D s T, E)T + R(T, D s E)T + R(T, E)D s T, which equals to zero at s = 0, t since T | s=0,t = 0 for all t. Hence t → D s E| s=0,t is also a linear vector field, implying (D.2) and subsequently H (0) = 0. For the fourth order derivative, note that (D.2) also implies that D t D s E| s=0,t = 0 and that D 2 s T | s=0,t = 0 for all t. Therefore, Proof of Proposition D.2. Under the coordinate chart, we can write the geodesic curve as γ : t → γ 1 (t), . . . , γ d (t) for some smooth function γ 1 , . . . , γ d . Then for any i = 1, . . . , d, univariate Taylor expansion provides γ i (t) = γ i (0) + γi (0)t + 1 2 t 2 γi (0) + 1 6 t 3 H (4) (0) = 8 D 3 s T, D s T | s=0,t . Further, D 3 s T = D 2 s D t E = D s (D t D s E + R(E, T )E) = D s D t D s E + (∇ E R) (E, T )E + R(D s E, T )E + R(E, D s T )E + R(E, T )(D s E), so D 3 s T | s=0,t = (D s D t D s E + R(E, D s T )E) | s=0,t . Thus, D 3 s T, D s T | s=0,t = ( D s D t D ... γ i (0) + O(t 4 ) as t → 0, where γi , γi , and ... γ i are the first, second, and third order derivative of γ i w.r.t. t. Note that the first derivative γi (0) = v i , and the geodesic equation and its derivative give γi (0) = -v m v n Γ i mn , ... γ i (0) = v m v n v l 2Γ k mn Γ i kl -∂ l Γ i mn . Plugging into the initial Taylor expansion gives the desired result. Proof of Proposition 3.1 in the maintext. By Proposition D.1 and Proposition D.2, as t → 0, we have t 2 v -w 2 = δ i 0-1 + tR i 0 (t) -tR i 1 (t) G ij Ä δ j 0-1 + tR j 0 (t) -tR j 1 (t) ä = δ i 0-1 δ j 0-1 G ij + 2tδ i 0-1 Ä R j 0 (t) -R j 1 (t) ä G ij + O(t 4 ) = δ i 0-1 δ j 0-1 G ij + δ i 0-1 δ k 0 δ l 0 -δ k 1 δ l 1 Ä Γ j kl G ij ä + O(t 4 ), where R i a (t) = 1 2t δ m a δ n a Γ i mn + 1 6t δ m a δ n a δ l a Γ k mn Γ i kl + ∂ l Γ i mn + O(t 3 ) for a = 0, 1, similar to (D.1). Note that δ i 0 = δ i 0 (t) = O(t), it suffices to keep only the first term in the R j a (t), which is O(t). If further N h 2+2d → ∞, then S 3N,i1i2i3i4 = O p N h 6 , as h → 0 and N → ∞.

Proof. Write

U u;i1i2i3i4 = w u δ i1 u,0-1 δ i2 u,0-1 δ i3 u,0-1 δ i4 u,0-1 , so Next, write V u;i1i2 = w u δ i1 u,0-1 δ i2 u,0-1 R u . Note that where F klm = Γ r kl G mr . Indeed, since the kernel K is symmetric, and the leading terms in the integrant is of fifth power of s, thus with some abuse of notation, EV u;i1i2 = h 5 F K(s)s 5 (f + hO(s)) ds = O(h 6 ). Similarly var V u;i1i2 = O(h 10-2d ). EU u;i1i2i3i4 = h -2d d i=1 K δ i n0 /h K δ i n1 /h δ i1 n,0-1 δ i2 n,0-1 δ i3 n,0-1 δ i4 n,0-1 • f (p 1 + δ 1 n0 , . . . , p d + δ d n1 )dδ 1 n0 . . . dδ d n1 = h 4 K s 1 u0 • • • • • K s d u1 Ä s i1 u0 -s i1 u1 ä Ä s i2 u0 -s i2 u1 ä Ä s i3 u0 -s i3 u1 ä Ä s i4 u0 -s i4 EV u;i1i2 = h -2d d i=1 K δ i n0 /h K δ i n1 /h δ i1 n,0-1 δ i2 n,0-1 × 1≤k,l,m,r≤d δ m u,0-1 δ k n0 δ l n0 -δ k n1 δ l n1 F klm × f (p 1 + δ 1 n0 , . . . , p d + δ d n1 )dδ 1 n0 . . . dδ d n1 = h 5 K s 1 u0 • • • • • K s d u1 Ä s i1 u0 -s i1 u1 ä Ä s i2 u0 -s i2 The rest of the results for S 3N,i1i2i3i4 proceeds analogously to that of S 1N,i1i2i3i4 and S 2N,i1i2i3i4 . Proposition E.2. Under the conditions of Proposition E.1, bias Ä β|X ä = O p h 2 , var Ä β|X ä = O p Å 1 N h 4+2d ã , as h → 0 and N h 2+2d → ∞, where X are the observed endpoints. Proof. Note that S 1N ;i1i2i3i4 are elements of D T WD, where one pair of (i 1 , i 2 ) index a row while one pair of (i 3 , i 4 ) index a column for i 1 , i 2 , i 3 , i 4 = 1, . . . , d. Similarly S 2N ;i1i2i3i4 are elements of D T ΣD, and S 3N ;i1i2 are elements of D T Wr by Proposition 3.1. Applying Proposition E.1 leads to the result.



See Appendix B for details about solving it in practice. See e.g.,Lee (2018) pp. 131-133 for normal coordinate. See Meyer (1989) and Proposition D.1 for coordinate-invariant version of Proposition 3.1. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. Data format changed after our download. After rescaling, see Subsection C.4.



Altering the link function g and the loss function Q in (3.7) enables flexible local regression estimation for models in Example 3.1. Example 3.2. Consider the following loss functions for estimating the metric tensors and the Christoffel symbols when data are drawn from model (3.2)-(3.4), respectively.

d. Under the additive model (3.2) in Example 3.1, we considered either noiseless or noisy responses by setting σ(p) = 0 or σ(p) > 0 respectively. Experiments were preformed with d = 2 and the finding are summarized in Figure 5.1. For continuous responses, the left panel of Figure 5.1a visualizes the true and estimated metric tensors via cost ellipses (A.1) and the right panel shows the corresponding geodesics by solving the geodesic equations (2.2) with true and estimated Christoffel symbols. The metrics and the geodesics were well estimated under the continuous response model without or without additive errors, where the estimates overlap with the truth. Figure 5.1b evaluates the relative estimation errors Ĝ -G F / G F and Γ -Γ F / Γ F w.r.t. the Frobenius norm (A.2) for data from the continuous model (3.2).

response (a) Estimated and true metric tensors using ellipses representation (left) and the geodesic curves (right) starting from (1, 0) with unit initial velocities pointing to 1-12 o'clock directions. Relative errors in term of Frobenius norm of the estimated tensors for the continuous response model (3.2) with additive error. Simulated data, where line segments show pairs of endpoints colored according to their binary responses. The background shows the value of h.

Errors for estimating h with binary responses (3.3).

Figure 5.1: Simulation results for 2-dimensional sphere under stereographic projection coordinate.

Figure 5.2d.6 NEW YORK CITY TAXI TRIP DURATIONWe study the geometry induced by taxi travel time in New York City (NYC) during weekday morning rush hours. New York City Taxi and Limousine Commission (TLC) Trip Record Data was Under review as a conference paper at ICLR 2023 Geodesics starting from(1, 0) with initial velocity pointing to 9 o'clock under estimated metric.

Figure5.2: Simulation results for relative comparison on double spirals. Gray curves (solid and dashed) in the background represent the approximate support of the two latent classes. In (b), the tiny circles are X u0 , each with two segments connecting to X u1 and X u2 , colored according to Y u .

(a) Estimated metric tensors for trip duration: cost ellipses and Frobenius norm (background color). (b) Geodesics correspond to 15-minute taxi rides from the Empire State Building heading to 1-12 o'clock.

Figure 6.1: New York taxi travel time during morning rush hours.

A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B

a) The estimated geodesic curves (solid red) and straight lines on the chart (dashed black).(b) Image transitions corresponding to A, B, and C in (a). Every 3 rows correspond to a set of paths sharing the same pair of starting and ending images, where the first, second, and third rows correspond to the estimated geodesics, the straight lines on the chart, and the optimal transport (path not shown in (a)), respectively.

Figure 7.1: Geometry induced by a sum of Wasserstein distance and same-digit-or-not indicator.

Figure C.1: Relative errors w.r.t. Frobenius norm (A.2) of the estimated tensors with noiseless or binary response for 2-dimensional sphere under stereographic projection coordinate chart.

Figure C.2: Root mean squared error (RMSE) for the test set (Left) and the average number of local training observations (Right) as the bandwidth h varies.

Figure C.3: Geodesics with different starting points and initial velocities under estimated metric, crosses indicate starting points.

Figure C.4b shows the cost ellipses for addition visualization.

Estimated intercept reflecting intrinsic local variation of the similarity measure.

The cost ellipses of estimated metric.

Figure C.4: More figures for the induced geometry by adding Wasserstein distance and same-digitor-not indicator.

A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B C

C C D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D The geodesic curves (solid red) and straight lines (dashed black). (b) Image transitions (per row) corresponding to bows A, B, C, and D in panel (a). Every 2 rows correspond to one pair of start and end images, where the first and second rows follow geodesics and the straight lines respectively.

Figure C.5: Induced geodesics of only the same-digit-or-not indicator.

Figure D.1: A visualization for the spread of geodesics as in Proposition 3.1. A tangent space (blue plane) and tangent vectors are annotated in red.

s E, D s T + R(E, D s T )E, D s T ) | s=0,t . Recall at s = 0, D s T | s=0,t = D t E| s=0,t = w -v, therefore R(E, D s T )E, D s T | s=0,t = Rm(E, D s T, E, D s T )| s=0,t = Rm(v, w -v, v, w -v) + tRm(v, w -v, w -v, w -v) = Rm(v, w, v, w).

s D t D s E, D s T | s=0,t = D s D t D s E, D s T -D t D s E, D 2 s T | s=0,t = D s D t D s E, D s T | s=0,t = D s D t D s E, D s T | s=0,t -D s D s E, D 2 t E | s=0,t ,where the second term in the last line vanishes since D 2 t E| s=0,t = 0 and D s E| s=0,t = 0. Moreover, since the Levi-Civita connection is torsion free, we haveD s D t D s E, D s T | s=0,t = D s D t D s E, D s T | s=0,t = D t D s D s E, D s T | s=0,t , which should be irrelevant to t, so that D s D s E, D s T | s=0,t is linear in t. Yet D s D s E, D s T = D 2 s E, D s T + D s E, D 2 s T , which vanishes at s = 0 and t = 0, 1. Hence D s D s E, D s T | s=0,t = 0 for all t ∈ [0, 1].Combining those with the symmetries of Riemann curvature tensor leads to the desired expansion.

1 , . . . p d ) + o(1) ds 1 u0 . . . ds d u1 = O(h 4 ),where f is the joint density of endpoints X u0 , X u1 , the second last equality is due to change of variables, and the last due to (A1). Similar argument impliesvar U u;i1i2i3i4 = O h 8-2d , Ew u U u;i1i2i3i4 = O h 4-2d , var w u U u;i1i2i3i4 = O h 8-6d .These rates apply uniformly over n, therefore by i.i.d. and that var Y u |X u0 , X u1 is uniformly bounded,ES 1N,i1i2i3i4 = O N h 4 , var S 1N,i1i2i3i4 = O N h 8-2d , ES 2N,i1i2i3i4 = O N h 4-2d , var S 2N,i1i2i3i4 = O N h 8-6d . Hence S 1N,i1i2i3i4 = ES 1N,i1i2i3i4 + O p Ä var S 1N,i1i2i3i4 ä = O p N h 4 ,under h → 0 and N h 2d → ∞. Similarly we have results for S 2N,i1i2i3i4 .

u1 F klm × f (p 1 , . . . p d ) + h d r=1 ∂f ∂p r (p) (s r u0 + s r u1 ) + o(h) ds 1 u0 . . . ds d u1 = O(h 6 ),

CONTENTS E ASYMPTOTIC OF THE ESTIMATED METRIC TENSOR

Now we discuss the variance and bias of the estimated metric tensors. For simplicity, use the squared loss Q(µ, y) = (µ -y) 2 , the identity link g(µ) = µ, and exclude the intercept β (0) and the terms β(2) ijk for derivative. Given a suitable order of the indices i, j, we rewrite (3.6) into matrix form. Denoteij , . . . , β(1) dd ä T , then the linear predictorThe assumptions (A1)-(A4) in the main text are reiterated here.(A1) The joint density of endpoints X u0 , X u1 is positive and continuously differentiable. Under (A1), (A2), (A3), and (A4), and suppose that h → 0 and N h 2d → ∞, then 

