Learning Globally Smooth Functions on Manifolds

Abstract

Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. The combination of these properties has led to many advances in semi-supervised learning, generative modeling, and control of dynamical systems. However, learning smooth functions is generally challenging, except in simple cases such as learning linear or kernel models. Typical methods are either too conservative, relying on crude upper bounds such as spectral normalization, too lax, penalizing smoothness on average, or too computationally intensive, requiring the solution of large-scale semi-definite programs. These issues are only exacerbated when trying to simultaneously exploit low dimensionality using, e.g., manifolds. This work proposes to overcome these obstacles by combining techniques from semi-infinite constrained learning and manifold regularization. To do so, it shows that, under typical conditions, the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. We prove that, under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Numerical examples illustrate the advantages of using this method to impose global smoothness on manifolds as opposed to imposing smoothness on average.

1. Introduction

Learning smooth functions has been shown to be advantageous in general and is of particular interest in physical systems. This is because of the general observation that close input features tend to be associated with close outputs and of the particular fact that in physical systems Lipschitz continuity of input-output maps translates to stability and safety (Oberman and Calder, 2018; Finlay et al., 2018a; Couellan, 2021; Finlay et al., 2018b; Pauli et al., 2021; Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) . To learn smooth functions one can require the parameterization to be smooth. Such is the idea, e.g., of spectral normalization of weights in neural networks (Miyato et al., 2018; Zhao and Liu, 2020) . Smooth parameterizations have the advantage of being globally smooth, but they may be restrictive because they impose smoothness for inputs that are not necessarily realized in the data. This drawback motivates the use of Lipschitz penalties in risk minimization (Oberman and Calder, 2018; Finlay et al., 2018a; Couellan, 2021; Pauli et al., 2021; Bungert et al., 2021) , which offers the opposite tradeoff. Since penalties encourage but do not enforce small Lipschitz constants, we may learn functions that are smooth on average, but with no global guarantees of smoothness at every point in the support of the data. Formulations that guarantee global smoothness can be obtained if the risk minimization problem is modified by the addition of a Lipschitz constant constraint (Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) . This yields formulations that guarantee Lipschitz smoothness in all possible inputs without the drawback of enforcing smoothness outside of the input data distribution. Several empirical studies (Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) have demonstrated the advantage of imposing global smoothness constraints only on observed inputs. We consider two cases (top) the estimated manifold has two connected components, and (bottom) the manifold is weakly connected (cf. Figure 1 ). We plot the output of a one layer neural network trained using Manifold Regularization, Manifold/Ambient Lipschitz. Ambient Regularization fails to classify the unlabeled samples, given that ignores the distribution of samples given by the Manifold. The case in which the manifold has two connected component (cf. Figure 1a ), our method works as good as Manifold Regularization, due to the fact that the Lipschitz constant will be made small in both components separately. However, when the manifold is weakly connected, Manifold Regularization fails to recognize the transition between the components, as it will penalize large gradients across the manifold, converging to a plane that connects the two samples. Our Manifold Lipschitz method, as it requires the Lipschitz constant to be small, forces a sharp transition along the point with maximal separation. In this paper we exploit the fact that data can be often modeled as points in a low-dimensional manifold. We therefore consider manifold Lipschitz constants in which function smoothness is assessed with respect to distances measured over the data manifold (Definition 1). Although this looks like a minor difference, controlling Lipschitz constants over data manifolds is quite different from controlling Lipschitz constants in the ambient space. In Figure 1 we look at a classification problem with classes arranged in two separate half moons. Constraining Lipschitz constants in the ambient space effectively assumes the underlying data is uniformly distributed in space [cf. Figure 1-(d) ]. Constraining Lipschitz constants in the data manifold, however, properly accounts for the data distribution [cf. Figure 1-(a) ]. This example also illustrates how constraining manifold Lipschitz constants is related to manifold regularization (Belkin et al., 2005; Niyogi, 2013; Li et al., 2022) . The difference is that manifold regularization penalizes the average norm of the manifold gradient. This distinction is significant because regularizing is more brittle than imposing constraints. In the example in Figure 1 , manifold regularization fails to separate the dataset when the moons are close [cf. Global constrains in the manifold gradient yield a statistical constrained learning problem with an infinite and dense number of constraints. This is a challenging problem to approximate and solve. Here, we approach the solution of this problem in the Lagrangian dual domain and establish connections with manifold regularization that allow for the use of point cloud Laplacians. Our specific contributions are the following: (C1) We introduce a constrained statistical risk minimization problem in which we learn a function that: (i) attains a target loss and (ii) attains the smallest possible manifold Lipschitz constant among these functions that satisfy the target loss (Section 2). (C2) We introduce the Lagrangian dual problem and show that its empirical version is a statistically consistent approximation of the primal. This results does not require the learning parametrization to be convex (Section 3.1). (C3) We generalize results from the manifold regularization literature to show that under regularity conditions, the evaluation of manifold Lipschitz constants can be recast in a more amenable form utilizing a weighted point cloud Laplacian (Proposition 3 in Section 3.2). (C4) We present a dual ascent algorithm to find optimal multipliers. The function that attains the target loss and minimizes the manifold Lipschitz constant follows as a byproduct (Section 3.3). (C5) We illustrate the merits of learning with global manifold smoothness guarantees with respect to ambient space and standard manifold regularization through two examples: (i) learning robot navigation policies in a space with obstacles and (ii) learning model mismatches in differential drive steering over non-ideal surfaces (Section 4).

Related Work

This paper is at the intersection of learning with Lipschitz constant constraints (Oberman and Calder, 2018; Finlay et al., 2018a; Couellan, 2021; Pauli et al., 2021; Bungert et al., 2021; Miyato et al., 2018; Zhao and Liu, 2020; Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) and manifold regularization (Belkin et al., 2005; Niyogi, 2013; Li et al., 2022; Hein et al., 2005; Belkin and Niyogi, 2005) . Relative to the literature on learning with Lipschitz constraints we offer the ability to leverage data manifolds. Since data manifolds are often characterized with unlabeled data (Kejani et al., 2020; Belkin and Niyogi, 2004; Jiang et al., 2019; Kipf and Welling, 2016; Yang et al., 2016; Zhu, 2005; Lecouat et al., 2018; Ouali et al., 2020; Cabannes et al., 2021) , similar to these works, we utilize the point cloud Laplacian technique to compute the integral of the norm of the gradient. Relative to the literature on manifold regularization we offer global smoothness assurances instead of an average penalty of large manifold gradients. In (Krishnan et al., 2020) , the space is discretized maintaining convexity, and at the expense of deriving a sensitive solution, their approach leads to a large optimization problem. Similar to us, (Krishnan et al., 2020) poses the problem of minimizing a Lipschtiz constant, however, they utilize a softer surrogate (i.e. p-norm loss) which is a smoother version of the Lipschitz constant. Their approach therefore, tradeoffs numerical stability (small p) with accurate Lipschitz constant estimation (p = ∞). We do not work with surrogates, and we seek to minimize the maximum norm of the gradient, utilizing an epigraph technique.

2. Global Constraining of Manifold Lipschitz Constants

We consider data pairs (x, y) in which the input features x ∈ M ⊂ R D lie in a compact oriented Riemannian manifold M and the output features are real valued y ∈ R. We study the regression problem of finding a function f θ : M → R, parameterized by θ ∈ Θ ⊂ R Q that minimizes the expectation of a nonnegative loss ℓ : R × R → R + , where ℓ(f θ (x), y) represents the loss of predicting output f θ (x) when the world realizes the pair (x, y). Data pairs (x, y) are drawn according to an unknown probability distribution p(x, y) on M × R which we can factor as p(x, y) = p(x)p(y|x). We are interested in learning smooth functions, i.e. functions with controlled variability over the manifold M. We therefore let ∇ M f θ (x) represent the manifold gradient of f θ and introduce the following definition. Definition 1 (Manifold Lipschitz Constant). Given a Riemannian manifold M, the function f θ : M → R is said to be L-Lipschitz continuous if there exists a strictly positive constant L > 0 such that for all pairs of points x 1 , x 2 ∈ M, |f θ (x 1 ) -f θ (x 2 )| ≤ L d M (x 1 , x 2 ), where d M (x 1 , x 2 ) denotes the distance between x 1 and x 2 in the manifold M. If the function f θ is differentiable, (1) is equivalent to requiring the gradient norm to be bounded by L, ∥∇ M f θ (x)∥ = lim δ→0 sup x ′ ∈M : x ′ ̸ =x, d M (x,x ′ )≤δ |f θ (x) -f θ (x ′ )| d M (x, x ′ ) ≤ L, for all x ∈ M. With definition 1 in place and restricting attention to differentiable functions f θ our stated goal of learning functions f θ with controlled variability over the manifold M can then be written as P * = min θ∈Θ,ρ≥0 ρ, subject to E p(x,y) [ℓ (f θ (x), y)] ≤ ϵ, ∥∇ M f θ (z)∥ 2 ≤ ρ, p(z)-a.e., z ∈ M. (3) In this formulation the statistical loss E p(x,y) [ℓ (f θ (x), y)] is required to be below a target level ϵ. Of all functions f θ that can satisfy this loss requirement, Problem (3) defines as optimal the one whose Lipschitz constant L = √ ρ is the smallest. The goal of this paper is to develop methodologies to solve (3) when the data distribution and the manifold are unknown. To characterize the distribution we are given sample pairs (x i , y i ) drawn from the joint distribution p(x, y). To characterize the manifold we are given samples z i drawn from the marginal distribution p(x). This includes the samples x i from the (labeled) data pairs (x i , y i ) and may also include (unlabeled) samples z i . It is interesting to observe that (3) shows that the problems of manifold regularization (Belkin et al., 2005; Niyogi, 2013; Kejani et al., 2020; Li et al., 2022) and Lipschitz constant control (Oberman and Calder, 2018; Finlay et al., 2018a; Couellan, 2021; Finlay et al., 2018b; Pauli et al., 2021; Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) are related. This connection is important to understand the merit of (3). To explain this better observe that there are three motivations for the problem formulation in ( 3 L, subject to E p(x,y) [ℓ (f θ (x), y)] ≤ ϵ, |f θ (y) -f θ (z)| ≤ L ∥y -z∥, (y, z) ∼ p(x). A difference between (3) and ( 4) is that in the latter we use a Lipschitz condition that does not require differentiability. A more important difference is that in (4) the Lipschitz constant is regularized in the ambient space. The distance between features y and z in (4) is the Euclidean distance ∥y -z∥. This is disparate from the manifold metric d M (y, z) that is implicit in the manifold gradient constraint in (3). Thus, the formulation in (3) improves upon (4) because of it leverages the structure of the manifold M [cf. Motivation (iii) ]. Motivations (i) and (iii) are themes of the manifold regularization literature (Belkin et al., 2005; Niyogi, 2013; Kejani et al., 2020; Li et al., 2022) . And, indeed, it is ready to conclude by invoking Green's first identity (see Section 3.2) that the formulation in (3) is also inspired in the manifold regularization problem, minimize θ∈Θ E p(x,y) [ℓ (f θ (x), y)] + γ M ∥∇ M f θ (z)∥ 2 p(z)dV (z). The difference between (3) and ( 5) is that in the latter the manifold Lipschitz constant is added as a regularization penalty. This is disparate from the imposition of a manifold Lipschitz constraint in (3). The regularization in (5) favors solutions with small Lipschitz constant by penalizing large Lipschitz constants, while the constraint in (3) guarantees that the Lipschitz constant is bounded by ρ. This is the distinction between regularizing a Lipschitz constant versus constraining a Lipschitz constant. The constraint in ( 5) is also imposed at all points in the manifold, whereas the regularization in ( 5) is an average over the manifold. Taking an average allows for large Lipschitz constants at some specific points if this is canceled out by small Lipschitz constants in other points of the manifold. Both of these observations imply that (3) improves upon (5) because it offers global smoothness guarantees that are important in, e.g., physical systems [cf. Motivation (iii) ]. Remark 1 (Alternative Manifold Lipschitz Control Formulations). There are three arbitrary choices in (3). (a) We choose to constrain the average statistical loss E p(x,y) [ℓ (f θ (x), y)] ≤ ϵ; (b) we choose to constrain the pointwise Lipschitz constant ∥∇ M f θ (z)∥ 2 ≤ ρ; and (c) we choose as our objective to require a target loss ϵ and minimize the Lipschitz constant L = √ ρ. We can alternatively choose to constrain the pointwise loss ℓ (f θ (x), y) ≤ ϵ, to constrain the average Lipschitz constant M ∥∇ M f θ (z)∥ 2 p(z)dV (z) ≤ ρ [cf (5)], or to require a target smoothness L = √ ρ and minimize the loss ϵ. All of the possible eight combinations of choices are of interest. We formulate (3) because it is the most natural intersection between the regularization of Lipschitz constants in ambient spaces [cf. ( 4)] and manifold regularization [cf. ( 5)]. The techniques we develop in this paper can be adapted to any of the other seven alternative formulations.

3. Learning with Global Lipschitz Constraints

Problem (3) is a constrained learning problem that we will solve in the dual domain (Chamon and Ribeiro, 2020) . To that end, observe that (3) has statistical and pointwise constraints. The loss constraint E p(x,y) [ℓ (f θ (x), y)] ≤ ϵ is said to be statistical because it restricts the expected loss over the data distribution. The Lipschitz constant constraints ∥∇ M f θ (z)∥ 2 ≤ ρ are said to be pointwise because they are imposed for all individual points in the manifold except for a set of zero measure. Consider then a Lagrange multiplier µ associated with the statistical constraint E p(x,y) [ℓ (f θ (x), y)] ≤ ϵ and a Lagrange multiplier distribution λ(z) associated with the set of pointwise constraints ∥∇ M f θ (z)∥ 2 ≤ ρ. The dual problem associated with (3) can then be written as D * = max µ,λ≥0 min θ L(θ, µ, λ) := µ E[ℓ f θ (x), y ] -ϵ + M λ(z)∥∇ M f θ (z)∥ 2 p(z)dV (z), subject to M λ(z)p(z)dV (z) = 1. We point out that in (6) we remove ρ from the Lagrangian by incorporating the dual variable constraint M λ(x)p(x)dV (x) = 1; see Appendix A for details. We henceforth use the dual problem (6) in lieu of (3). Since we are interested in situations in which we do not have access to the data distribution p(x, y) we further consider empirical versions of (6). Consider then N i.i.d. samples (x n , y n ) drawn from p(x, y) and define the empirical dual problem as, D⋆ = max μ, λ≥0 min θ L(θ, μ, λ) := μ 1 N N n=1 ℓ f θ (x n ), y n -ϵ + 1 N N n=1 λ(x n )∥∇ M f θ (x n )∥ 2 , subject to 1 N N n=1 λ(x n ) = 1, where, to simplify notation, we assume no unlabeled samples are available. If unlabeled samples are given the modification is straight-forward; see Appendix B. The remainder of this section provides three technical contributions: (C2) To justify solutions of (7) we must show statistical consistency with respect to the primal problem (3). This is challenging because of two reasons: (i) Since we did not assume the use of a convex parameterization, (3) is not a convex problem on θ. Thus, the primal and dual problems are not necessarily equivalent. (ii) Since we are maximizing over the dual variables µ and λ(z), we do not know if the empirical dual formulation in ( 7) is close to the statistical dual formulation in ( 6). We will overcome these two challenges and show that the empirical dual problem ( 7) is a consistent approximation of the statistical primal problem (Proposition 1 in Section 3.1). (C3) Solving ( 7) requires evaluating the gradient norm sum (1/N ) N n=1 λ(x n )∥∇ M f θ (x n )∥ 2 . We will generalize results from the manifold regularization literature to show that under regularity conditions on λ the gradient norm integral can be computed in a more amenable form utilizing a weighted point cloud Laplacian (Proposition 3 in Section3.2). (C4) We introduce a primal-dual algorithm to solve (7) (Section 3.3).

3.1. Statistical consistency of the empirical dual problem

We show in this section that ( 7) is close to (3). Doing so requires the following assumptions: Assumption 1. The loss ℓ is M -Lipschitz continuous, B-bounded, and convex. Assumption 2. Let H = {f θ | θ ∈ Θ ⊂ R Q } with compact Θ be the hypothesis class, and let H = conv(H) be the closure of its convex hull. For each ν > 0 and φ ∈ H there exists θ ∈ Θ such that simultaneously  sup z∈M |φ(z) -f θ (z)| ≤ ν and sup z∈M ∥∇ M φ(z) -∇ M f θ (z)∥ ≤ ν. Assumption 3. The hypothesis class H is G-Lipschitz in its gradients, i.e. ∥∇ M f θ1 (z) - ∇ M f θ2 (z)∥ ≤ G|f θ1 (z) -f θ2 (z)|. E[ℓ (f θ (x), y)] - 1 N N n=1 ℓ f θ (x n ), y n ≤ ζ(N, δ), E[f θ (x)] - 1 N N n=1 f θ (x n ) ≤ ζ(N, δ), ( ) for all θ ∈ Θ, with probability 1 -δ over independent draws (x n , y n ) ∼ p. Assumption 5. There exists a feasible solution θ ∈ Θ such that E[ℓ(f θ (x), y)] < ϵ -M ν. Assumption 1 holds for most losses utilized in practice. Assumption 2 is a requirement on the richness of the parametrization. In the particular case of neural networks the covering constant ν is upper bounded by the universal approximation bound of the neural network. Assumption 3 also holds for neural networks if we use smooth nonlinearities -e.g., hyperbolic tangents. The uniform convergence property ( 8) is customary in learning theory to prove PAC learnability and is implied by bounds on complexity measures such as VC dimension or Rademacher complexity (Mohri et al., 2018; Vapnik, 1999; Shalev-Shwartz and Ben-David, 2014) . The following theorem provides the desired bound. Proposition 1. Let μ⋆ , λ⋆ be solutions of the empirical dual problem (7). Under assumptions 1-5, there exists θ⋆ ∈ argmin θ L(θ, μ⋆ , λ⋆ ) such that, with probability 1 -5δ, |P ⋆ -D⋆ | ≤ O(ν) + (1 + ∆)ζ(N, δ) + O( ζ(N, δ)), where ∆ = max(μ ⋆ , µ ⋆ ) ≤ C for a constant C < ∞, where µ ⋆ , μ⋆ are solutions of (6), and (7) respectively. Proposition 1 shows that the empirical dual problem 7 is statistically consistent. That is to say, for any realization of N according to p, the difference between the empirical dual problem (7) and the statistical smooth learning problem (3) decreases as N increases. This difference is bounded in terms of the richness of the parametrization (ν), the difficulty of the fit requirement (as expressed by the optimal dual variables µ ⋆ , μ⋆ ), and the number of samples (N ). The guarantee has a form typical for constrained learning problems (Chamon and Ribeiro, 2020) . Proposition 1 states that we are able to predict what is the minimum norm of the gradient that a function class can have while satisfying an expected ϵ loss. This is important, because we do not require access to the distribution of the samples p, only a set of N samples sampled according to this distribution. On the other hand, Proposition 1 does not state that by solving the dual problem 7 we will obtain a solution of the primal problem 3. The following proposition provides a bound on the near feasibility of the solution of 7 with respect to the solution of 3. Proposition 2. Let μ⋆ , λ⋆ be solutions of the empirical dual problem (7). Under assumptions 1-5, there exists θ⋆ ∈ argmin θ L(θ, μ⋆ , λ⋆ ) such that, with probability 1 -5δ, max z∈M ∥∇ M f θ⋆ (z)∥ 2 ≤ P * + μ 1 N N n=1 ℓ(f θ * (x n ), y n ) -ϵ + O(ν) + O( ζ(N, δ)), and E[ℓ(f θ⋆ (x), y)] ≤ ϵ + ζ(N, δ). Proposition 2 provides near optimality, and near feasibility conditions for solutions θ * obtained through the empirical dual problem (7). The difference between the maximum gradient of the obtained solution θ and the optimal value P * is bounded by the number of samples N , as well as the empirical constraint violation. Notice that even though the optimal dual variable μ is not known, the constraint violation can be evaluated in practice, as it only requires to evaluate the obtained function over the N given samples. Remark 2 (Interpolators). In practice, the number of parameters in a parametric function (e.g. Neural Network) tends to exceed the dimension of the input, which allows functions to interpolate the data, i.e. to attain zero loss on the dataset. Proposition 2, presents a connection to interpolating functions. By setting ϵ = 0, if the function achieves zero error over the empirical distribution i.e. ℓ(f θ * (x n ), y n ) = 0, for all n ∈ [N ], then the dependency on µ * disappears. This implies that over interpolating classifiers, the one with the minimum Lipschitz constant over the samples, is probably the one that attains the minimum Lipschitz constant over the whole manifold.

3.2. From Manifold Gradient to Discrete Laplacian

We derive an alternative way of computing the integral of the norm of the gradient utilizing samples. To do so, we define the normalized point cloud Laplacian according a probability distribution λ. Definition 2 (Point-cloud Laplacian). Consider a set of points x 1 , . . . , x N ∈ M, sampled according to probability λ : M → R. The normalized graph Laplacian of f θ at z ∈ M is defined as L t λ,N f θ (z) = 1 N N n=1 W (z, x n ) f θ (z) -f θ (x n ) , for W (z, x n ) = 1 t G t (z, x n ) ŵ(z) Ŵ (x n ) , with G t (z, x n ) = 1 (4πt) d/2 e -∥z-xn∥ 2 4t , ŵ(z) = 1 N N n=1 G t (z, x n ), and Ŵ (x n ) = 1 N -1 m̸ =n G t (x m , x n ). ( ) As long as the function considered is smooth enough, the following convergence result holds: Proposition 3 (Point Cloud Estimate). Let Λ be the set of probability distributions defined a compact d-dimensional differentiable manifold M isometrically embedded in R D such that Λ = {λ : 0 < a ≤ λ(z) ≤ b < ∞, | ∂λ ∂x | ≤ c < ∞ and | ∂ 2 λ ∂x 2 | ≤ d < ∞ for all z ∈ M}, and let f θ , with θ ∈ Θ, be a family of functions with uniformly bounded derivatives up to order 3 vanishing at the boundary ∂M. For any ϵ > 0, δ > 0, for all N > N 0 , P sup λ∈Λ,θ∈Θ ∥∇ M f θ (z)∥ 2 λ(z)dV (z) - 1 N N i=1 f θ (z i )L t λ,N f θ (z i )λ(z i ) > ϵ ≤ δ (15) where the point cloud laplacian L t λ,N f θ is as defined in 2, with t = N -1 d+2+α for any α > 0. The proof of Proposition 3 relies on two steps. First, we relate the integral of the norm of the gradient over the manifold with the integral of the continuous Laplace-Beltrami operator by virtue of Green's identity. Second, we approximate the value of the Laplace-Beltrami operator by the point-cloud Laplacian. Proposition 3 connects the integral of the norm of the gradient of function f θ with a point cloud Laplacian operator, this result connects the dual problem 7, with the primal 3 while allowing for a more amenable way of computing the integral. Remark 3 (Laplacian Regularization and Manifold Lipschitz). The dual problem (6) is closely related to the manifold regularization problem (5). In particular, the two become equivalent by substituting ρ = µ -1 , and utilizing a uniform distribution for λ. The key difference between the problems is given by the dual variable λ, which can be though as a probability distribution over the manifold that penalizes regions of the manifold where the norm of the gradient of f θ is larger. The standard procedure in Laplacian regularization is to calculate the graph-Laplacian of the set of points L utilizing the heat kernel, and compute the integral by f T Lf , where f = [f θ (x 1 ), . . . , f θ (x n )] T . In the case of Manifold Lipschitz, the same product can be computed, but utilizing the re-weighted point cloud laplacian.

3.3. Dual Ascent Algorithm

We outline an iterative and empirical primal-dual algorithm to solve the dual problem. Upon the initialization of θ 0 , and λ 0 , µ 0 , we set to minimize the dual function as follows, θ k+1 = θ k -η θ ∇ θ L(θ k , µ k , λ k ), where η θ is a positive step-size. Note that to solve problem ( 16), we can either utilize the gradient version the the lagrangian (cf. equation ( 7)), or the point-cloud Laplacian (cf. equation ( 15)). Consequent to updating θ, we update the dual variables as follows, µ k+1 = µ k + η µ 1 N N n=1 ℓ(f θ (x n ), y n ) -ϵ + , (x n ) = λ k (x n ) + η λ ∥∇ M f θ (x n )∥ 2 for all data points x n (18) λ k+1 = argmin λ ∥ λk+1 -λ k+1 ∥, such that N n=1 λ k+1 (x n ) = N. ( ) where η µ , η λ are a positive step-sizes. Note that we require a convex projection over λ to satisfy the normalizing constraint. In step (18) we require to estimate the norm of the gradient at data point x n , which we estimate for the neighboring data points. Intuitively, the primal dual procedure increases the value of λ(x n ) at points in which the norm of the gradient is larger. The role of µ is to enforce the loss ℓ to be smaller than ϵ, a larger value of µ would increase the relative importance over the norm of the integral. See Appendix F.

4.1. Navigation Controls Problem

In this section, we consider the problem of continuous navigation of an agent. The agent's objective is to reach a goal while avoiding obstacles. The state space of the agent is given by its position, and navigates by taking actions on the velocity. We construct a square grid of points in the free space i.e. outside of the obstacles, and find the shortest path between two starting positions and the goal along the grid. For those two grid trajectories, we compute the optimal actions to be taken at each point in order to follow the trajectory. The learner is equipped with both labeled trajectories, as well as the unlabeled point grid. To leverage the manifold structure of the data, we consider the grid of points, and we construct the point cloud Laplacian considering adjacent points in the grid. As measure of merit, we take 100 random points and we compute the trajectories. A trajectory is successful if it achieves the goal without colliding. The results are shown in In this real world experiment, we seek to learn the error that a model would make in predicting the dynamics of a ground robot Koppel et al. (2016) . We posses trajectories, in the form of time series of the control signals of an iRobot Packbot making turns on both pavement, and grass. For each trajectory, the mismatch between the real and the model-predicted states is quantified. Given a dataset of trajectories and errors made by the model, the objective is to learn the error that a model will make. Details of the experiment can be found in G.3.

5. Conclusion

In this work, we presented a constrain learning method to obtain smooth functions over manifold data. We have shown that under mild conditions, the problem of finding smooth functions over a manifold can be reformulated as a weighted point cloud Laplacian penalty over varying probability distributions whose dynamics are govern by the constraint violations. Three numerical examples validate the empirical advantages of obtaining functions that vary smoothly over the data.



Figure1: Two moons dataset. The setting consists of a two dimensional classification problem of two classes with 1 labeled, and 200 unlabeled samples per class. The objective is to correctly classify the 200 unlabeled samples. We consider two cases (top) the estimated manifold has two connected components, and (bottom) the manifold is weakly connected (cf. Figure1). We plot the output of a one layer neural network trained using Manifold Regularization, Manifold/Ambient Lipschitz. Ambient Regularization fails to classify the unlabeled samples, given that ignores the distribution of samples given by the Manifold. The case in which the manifold has two connected component (cf. Figure1a), our method works as good as Manifold Regularization, due to the fact that the Lipschitz constant will be made small in both components separately. However, when the manifold is weakly connected, Manifold Regularization fails to recognize the transition between the components, as it will penalize large gradients across the manifold, converging to a plane that connects the two samples. Our Manifold Lipschitz method, as it requires the Lipschitz constant to be small, forces a sharp transition along the point with maximal separation.

Figure 1-(c), bottom]. Classification with a manifold Lipschitz constant constraint is more robust to this change in the data distribution [cf. Figure 1-(a), bottom].

): (i) It is often the case that if samples x 1 and x 2 are close, then the conditional distributions p(y | x 1 ) and p(y | x 2 ) are close as well. A function f θ with small Lipschitz constant leverages this property. (ii) The Lipschitz constant of f θ is guaranteed to be smaller than L = √ ρ. This provides advantages in, e.g., physical systems where Lipschitz constant guarantees translates to stability and safety assurances. (iii) It leverages the intrinsic low-dimensional structure of the manifold M embedded in the ambient space. In particular, this permits taking advantage of unlabeled data. Motivations (i) and (ii) are tropes of the Lipschitz regularization literature; e.g., Oberman and Calder (2018); Finlay et al. (2018a); Couellan (2021); Finlay et al. (2018b); Pauli et al. (2021); Krishnan et al. (2020); Shi et al. (2019); Lindemann et al. (2021); Arghal et al. (2021). Indeed, the problem formulation in (3) is inspired in similar problem formulations in which the Lipschitz constant is regularized in the ambient space, minimize θ∈Θ,ρ≥0

There exists ζ(N, δ) ≥ 0, and ζ(N, δ) ≥ 0 monotonically decreasing with N , such that

Figure2: Figure2ashows the training dataset, blue stars depict unlabeled point, and blue arrow the optimal action at the red star. Figure2bshows the learned function using Manifold Learning, and 2c its associated dual variables associated. Figures2d, 2e, 2f show the functions learned using ERM, ambient regularization, and Manifold regularization respectively.

Number of successful trajectories from 100 random starting points.

Table 1, and the learned functions in Figure 2. Error Prediction accuracy for the Ground Robot Experiment.

