STATISTICAL EFFICIENCY OF SCORE MATCHING: THE VIEW FROM ISOPERIMETRY

Abstract

Deep generative models parametrized up to a normalizing constant (e.g. energybased models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood log p(x) for the training data, we instead fit the score function ∇ x log p(x) -obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -i.e. the Poincaré, log-Sobolev and isoperimetric constant -quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -even for simple families of distributions like exponential families with rich enough sufficient statistics -score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics. Definition 1 (Score matching). Given a smooth ground truth distribution p with sufficient decay at infinity and a smooth distribution q, the score matching loss (at the population level) is defined to be 1 Note, there is another popular variant of score matching called denoising score matching, in which the data distribution is convolved with a Gaussian. This will not be the focus of this paper.

1. INTRODUCTION

Energy-based models (EBMs) are deep generative models parametrized up to a constant of parametrization, namely p(x) ∝ exp(f (x)). The primary training challenge is the fact that evaluating the likelihood (and gradients thereof) requires evaluating the partition function of the model, which is generally computationally intractable -even when using relatively sophisticated MCMC techniques. The seminal paper of Song and Ermon (2019) circumvented this difficulty by instead fitting the score function of the model, that is ∇ x log p(x). Though not obvious how to evaluate this loss from training samples only, Hyvärinen (2005) showed this can be done via integration by parts, and the estimator is consistent (that is, converges to the correct value in the limit of infinite samples). The maximum likelihood estimator is the de-facto choice for model-fitting for its well-known property of being statistically optimal in the limit where the number of samples goes to infinity ( Van der Vaart, 2000) . It is unclear how much worse score matching can be -thus, it's unclear how much statistical efficiency we sacrifice for the algorithmic convenience of avoiding partition functions. In the seminal paper (Song and Ermon, 2019) , it was conjectured that multimodality, as well as a lowdimensional manifold structure may cause difficulties for score matching. Though the intuition for this is natural: having poor estimates for the score in "low probability" regions of the distribution can "propagate" into bad estimates for the likelihood once the score vector field is "integrated"making this formal seems challenging. We show that the right mathematical tools to formalize, and substantially generalize such intuitions are functional analytic tools that characterize isoperimetric properties of the distribution in question. Namely, we show three quantities, the Poincaré, log-Sobolev and isoperimetric constants (which are all in turn very closely related, see Section 2), tightly characterize how much worse the efficiency of score matching is compared to maximum likelihood. These quantities can be (equivalently) viewed as: (1) characterizing the mixing time of Langevin dynamics -a stochastic differential equation used to sample from a distribution p(x) ∝ exp(f (x)), given access to a gradient oracle for f ; (2) characterizing "sparse cuts" in the distribution: that is sets S, for which the surface area of the set S can be much smaller than the volume of S. Notably, multimodal distributions, with wellseparated, deep modes have very big log-Sobolev/Poincaré/isoperimetric constants (Gayrard et al., 2004; 2005) , as do distributions supported over manifold with negative curvature (Hsu, 2002) (like hyperbolic manifolds). Since it is commonly thought that complex, high dimensional distribution deep generative models are trained to learn do in fact exhibit multimodal and low-dimensional manifold structure, our paper can be interpreted as showing that in many of these settings, score matching may be substantially less statistically efficient than maximum likelihood. Thus, our results can be thought of as a formal justification of the conjectured challenges for score matching in Song and Ermon (2019) , as well as a vast generalization of the set of "problem cases" for score matching. This also shows that surprisingly, the same obstructions for efficient inference (i.e. drawing samples from a trained model, which is usual done using Langevin dynamics for EBMs) are also an obstacle for efficient learning using score matching. 1 We roughly show the following results: 1. For finite number of samples n, we show that if we are trying to estimate a distribution from a class with Rademacher complexity bounded by R n , as well as a log-Sobolev constant bounded by C LS , achieving score matching loss at most ϵ implies that we have learned a distribution that's no more than ϵC LS R n away from the data distribution in KL divergence. The main tool for this is showing that the score matching objective is at most a multiplicative factor of C LS away from the KL divergence to the data distribution. 2. In the asymptotic limit (i.e. as the number of samples n → ∞), we focus on the special case of estimating the parameters θ of a probability distribution of an exponential family {p θ (x) ∝ exp(⟨θ, F (x)⟩) for some sufficient statistics F using score matching. If the distribution p θ we are estimating has Poincaré constant bounded by C P have asymptotic efficiency that differs by at most a factor of C P . Conversely, we show that if the family of sufficient statistics is sufficiently rich, and the distribution p θ we are estimating has isoperimetric constant lower bounded by C IS , then the score matching loss is less efficient than the MLE estimator by at least a factor of C IS . 3. Based on our new conceptual framework, we identify a precise analogy between score matching in the continuous setting and pseudolikelihood methods in the discrete (and continuous) setting. This connection is made by replacing the Langevin dynamics with its natural analogue -the Glauber dynamics (Gibbs sampler). We show that the approximation tensorization of entropy inequality (Marton, 2013; Caputo et al., 2015) , which guarantees rapid mixing of the Glauber dynamics, allows us to obtain finite-sample bounds for learning distributions in KL via pseudolikelihood in an identical way to the log-Sobolev inequality for score matching. A variant of this connection is also made for the related ratio matching estimator of Hyvärinen (2007) . 4. In Section 7, we perform several simulations which illustrate the close connection between isoperimetry and the performance of score matching. We give examples both when fitting the parameters of an exponential family and when the score function is fit using a neural network. where K p is a constant independent of q. The last equality is due to Hyvärinen (2005) . Given samples from p, the training loss Ĵp (q) is defined by replacing the rightmost expectation with the average over data. Functional and Isoperimetric Inequalities. Let q(x) be a smooth probability density over R d . A key role in this work is played by the log-Sobolev, Poincaré, and isoperimetric constants of qclosely related geometric quantities, connected to the mixing of the Langevin dynamics, which have been deeply studied in probability theory and geometric and functional analysis (see e.g. (Gross, 1975; Ledoux, 2000; Bakry et al., 2014) ). We discuss the background in more detail in Appendix A. Definition 2. The log-Sobolev constant C LS (q) ≥ 0 is the smallest constant so that for any smooth probability density p, we have KL(p, q) ≤ C LS (q)I(p | q) (2) where KL(p, q) = E X∼p [log(p(X)/q(X))] is the Kullback-Leibler divergence or relative entropy and the relative Fisher information I(p | q) is definedfoot_0 as I(p | q) := E q ∇ log p q , ∇ p q . The log-Sobolev inequality is equivalent to exponential ergodicity of the Langevin dynamics for q, a canonical Markov process which preserves and is used for sampling q, described by the Stochastic Differential Equation dX t = -∇ log q(X t ) dt + √ 2 dB t . The log-Sobolev constant can be bounded for log-concave distributions: if P is α-strongly log concave, then C LS ≤ 1/2α by Bakry-Emery theory (Bakry et al., 2014) . See Appendix A for details. For a class of distributions P, we can also define the restricted log-Sobolev constant C LS (q, P) to be the smallest constant such that (2) holds under the additional restriction that p ∈ P -see e.g. Anari et al. (2021b) . For P an infinitesimal neighborhood of p, the restricted log-Sobolev constant of q becomes half of the Poincaré constant or inverse spectral gap C P (q): Definition 3. The Poincaré constant C P (q) ≥ 0 is the smallest constant so that that for all smooth functions f : R d → R, Var q (f ) ≤ C P (q)E q ∥∇f ∥ 2 . (3) It is related to the log-Sobolev constant by C P ≤ 2C LS (Lemma 3.28 of Van Handel (2014) ). Both of these inequalities measure the isoperimetric properties of q from the perspective of functions; they are closely related to the isoperimetric constant: Definition 4. The isoperimetric constant C IS (q) is the smallest constant, s.t. for every set S, min S q(x)dx, S C q(x)dx ≤ C IS (q) lim inf ϵ→0 Sϵ q(x)dx -S q(x)dx ϵ . ( ) where S ϵ = {x : d(x, S) ≤ ϵ} and d(x, S) denotes the (Euclidean) distance of x from the set S. The isoperimetric constant is related to the Poincaré constant by C P ≤ 4C 2 IS (Proposition 8.5.2 of Bakry et al. (2014) ). Assuming S is chosen so S q(x)dx < 1/2, the left hand side can be interpreted as the volume and the right hand side as the surface area of S with respect to q. Mollifiers We recall the definition of one of the standard mollifiers/bump functions, as used in e.g. Hörmander (2015) . Mollifiers are smooth functions useful for approximating non-smooth functions: convolving a function with a mollifier makes it "smoother", in the sense of the existence and size of the derivatives. Precisely, define the (infinitely differentiable) function ψ : R d → R as ψ(y) = 1 G d e -1/(1-|y| 2 ) for |y| < 1 and ψ(y) = 0 for |y| ≥ 1, where G d := e -1/(1-|y| 2 ) dy. For γ > 0, we also define a "sharpening" of ψ, namely ψ γ (y) = γ -d ψ(y/γ) so that ψ γ = 1. Notation. For a random vector X, Σ X := E[XX T ] -E[X]E[X] T denotes its covariance matrix.

3. LEARNING DISTRIBUTIONS FROM SCORES: NONASYMPTOTIC THEORY

Though consistency of the score matching estimator was proven in Hyvärinen (2005) , it is unclear what one can conclude about the proximity of the learned distribution from a finite number of samples. Precisely, we would like a guarantee that shows that if the training loss (i.e. empirical estimate of (1)) is small, the learned distribution is close to the ground truth distribution (e.g. in the KL divergence sense). However, this is not always true! We will see an illustrative example where this is not true in Section 7 and also establish a general negative result in Section 4. Our first new observation is that understanding the multiplicative gap between the KL divergence and the score matching test loss is equivalent to understanding log-Sobolev constants. Based on this, we prove (Theorem 1) that minimizing the training loss does learn the true distribution, assuming that the class of distributions we are learning have bounded complexity and small log-Sobolev constant. First, we formalize the connection to the log-Sobolev constant: Proposition 1. The log-Sobolev inequality for q is equivalent to the following inequality over all smooth probability densities p: KL(p, q) ≤ 2C LS (q)(J p (q) -J p (p)). (5) More generally, for a class of distribution p ∈ P the restricted log-Sobolev constant is the smallest constant such that KL(p, q) ≤ C LS (q, P)(J p (q) -J p (p)) for all distributions p. Proof. This follows from the following equivalent form for the relative Fisher information (Shao et al., 2019; Vempala and Wibisono, 2019 ) I(p | q) = E q ⟨∇ p q , ∇ log p q ⟩ = E p ⟨ q p ∇ p q , ∇ log p q ⟩ = E p ⟨∇ log p q , ∇ log p q ⟩ = E p ∥∇ log p -∇ log q∥ 2 . ( ) Using this and (1) the log-Sobolev inequality can be rewritten as KL(p, q) ≤ C LS (J p (q) -J p (p)) which proves the first claim, and the same argument shows the second claim. Remark 1 (Interpretation of Score Matching). The left hand side of (5 ) is KL(p, q) = E p [log p] - E p [log q]. The first term is independent of q and the second term is the likelihood, the objective for Maximum Likelihood Estimation. So (5) shows that the score matching objective is a relaxation (within a multiplicative factor of C LS (q)) of maximum-likelihood via the log-Sobolev inequality. We discuss connections to other proposed interpretations in Appendix B. Remark 2. Interestingly, the log-Sobolev constant which appears in the bound is that of q and not p the ground truth distribution. This is useful because q is known to the learner whereas p is only indirectly observed. If q is actually close to p, the log-Sobolev constants are comparable due to the Holley-Stroock perturbation principle (Proposition 5.1.6 of Bakry et al. (2014) ). To our knowledge, we are the first to point out the useful connection of the score matching loss with the log-Sobolev inequality. Because log-Sobolev constants have been well-studied, this observation has many nice consequences which would otherwise be difficult to prove. Combining Proposition 1, bounds on log-Sobolev constants from the literature, and generalization theory gives us finite-sample guarantees for learning distributions in KL divergence via score matching.foot_2 Theorem 1. Suppose that P is a class of probability distributions containing p and define C LS (P, P) := sup q∈P C LS (q, P) ≤ sup q∈P C LS (q) to be the worst-case (restricted) log-Sobolev constant in the class of distributions. Let R n := E X1,...,Xn,ϵ1,...,ϵn sup q∈P 1 n n i=1 ϵ i Tr ∇ 2 log q(X i ) + 1 2 ∥∇ log q(X i )∥ 2 be the expected Rademacher complexity of the class given n samples X 1 , . Example 1. Suppose we are fitting an isotropic Gaussian in d dimensions with unknown mean µ * satisfying ∥µ * ∥ ≤ R. The class of distributions P is q µ with ∥µ∥ ≤ R of the form q µ (x) ∝ exp -∥x -µ∥ 2 /2 so the expected Rademacher complexity can be upper bounded as so: R n = E sup µ 1 n n i=1 ϵ i -d/2 + 1 2 ∥X i -µ∥ 2 = E sup µ 1 n n i=1 ϵ i X i , µ = R E 1 n n i=1 ϵ i X i ≤ R E 1 n n i=1 ϵ i X i 2 = R R 2 + d n where the inequality is Jensen's inequality and in the last step we expanded the square and used that Eϵ i ϵ j = 1(i = j) and E∥X i ∥ 2 ≤ R 2 + d. Recall that the standard Gaussian distribution is 1strongly log concave so C LS ≤ 1/2. Hence we have the concrete bound E KL(p, p) ≤ 2R R 2 +d n .

4. STATISTICAL COST OF SCORE MATCHING: ASYMPTOTIC RESULTS

In this section, we compare the asymptotic efficiency of the score matching estimator in exponential families to the effiency of the maximum likelihood estimator. Because we are considering asymptotics, we might expect (recall the discussion in Section 2) that the relevant functional inequality will be the local version of the log-Sobolev inequality around the true distribution p, which is the Poincaré inequality for p. Our results will show precisely how this occurs and characterize the situations where score matching is substantially less statistically efficient than maximum likelihood. Setup. In this section, we will focus on distributions from exponential families. We will consider estimating the parameters of an exponential family using two estimators, the classical maximum likelihood estimator (MLE), and the score matching estimator; we will use that the score matching estimator arg min θ ′ Ĵp (p θ ′ ) admits a closed-form formula in this setting. Definition 5 (Exponential family). For sufficient statistics F : R d → R m , the exponential family of distributions associated with Van der Vaart (2000) ). Given i.i.d. samples x 1 , . . . , x n ∼ p θ , the maximum likelihood estimator is θMLE = arg max θ ′ ∈Θ Ê [log p θ ′ (X)], where Ê denotes the expectation over the samples. As n → ∞ and under appropriate regularity conditions, we have √ n θMLE -θ → N (0, Γ MLE ), where Γ MLE := Σ -1 F and Σ F is known as the Fisher information matrix. Proposition 2 (Score matching estimator, Equation (34) of Hyvärinen (2007) ). Given i.i.d. samples x 1 , . . . , x n ∼ p θ , the score matching estimator equals θSM = -Ê[(JF ) X (JF ) T X ] -1 Ê∆F , where (JF ) X : m×d is the Jacobian of F at the point X, ∆f = i ∂ 2 i f is the Laplacian and it is applied coordinate wise to the vector-valued function F . F is {p θ (x) ∝ exp (⟨θ, F (x)⟩) |θ ∈ Θ ⊆ R m }. Definition 6 (MLE,

4.1. ASYMPTOTIC NORMALITY

Next, we prove asymptotic normality of the score matching estimator and give a formula for the limiting renormalized covariance matrix Γ SM . 4 Since the MLE also satisfies asymptotic normality with an explicit covariance matrix, we can then proceed in the next sections to compare their relative efficiency (as in e.g. Section 8.2 of Van der Vaart (2000) ) by comparing the asymptotic covariances Γ SM and Γ MLE . The proof of the following result is in Appendix C. Proposition 3 (Asymptotic normality). As n → ∞, and assuming sufficient smoothness and decay conditions so that score matching is consistent (see Hyvärinen (2005) ) we have the following convergence in distribution: √ n( θSM -θ) → N (0, Γ SM ), where Γ SM := E[(JF ) X (JF ) T X ] -1 Σ (JF ) X (JF ) T X θ+∆F E[(JF ) X (JF ) T X ] -1 . (7)

4.2. STATISTICAL EFFICIENCY OF SCORE MATCHING UNDER A POINCAR É INEQUALITY

Our first result will show that if we are estimating a distribution with a small Poincaré constant (and some relatively mild smoothness assumptions), the statistical efficiency of the score matching estimator is not much worse than the maximum likelihood estimator. Theorem 2 (Efficiency under a Poincaré inequality). Suppose the distribution p θ satisfies a Poincaré inequality with constant C P . Then we have ∥Γ SM ∥ OP ≤ 2C 2 P ∥Γ MLE ∥ 2 OP ∥θ∥ 2 E∥(JF ) X ∥ 4 OP + E∥∆F ∥ 2 2 . More generally, the same bound holds assuming only the following restricted version of the Poincaré inequality: for any w, Var(⟨w, F (x)⟩) ≤ C P E∥∇⟨w, F (x)⟩∥ 2 2 . Remark 3. To interpret the terms in the bound, the quantities E p θ ∥(JF ) X ∥ 4 OP and E∥∆F ∥ 2 2 can be seen as a measure of the smoothness of the sufficient statistics F , and ∥θ∥ as a bound on the radius of parameters for the exponential family. In Section 7 we will give an example to show bounded smoothness is indeed necessary for score matching to be efficient. A direct consequence of this result (see Appendix D) is that with 99% probability and for sufficiently large n, n∥θ -θSM ∥ 2 ≤ nE∥θ -θMLE ∥ 2 2 •O C 2 P m ∥θ∥ 2 E∥(JF ) X ∥ 4 OP + E∥∆F ∥ 2 2 . So if the the distribution is smooth and Poincaré, score matching achieves small ℓ 2 error provided MLE does. We illustrate Theorem 2 for a natural example of a bimodal distribution in Example 3 of Appendix G. The proof of Theorem 2 is in Appendix D, and the main lemma to prove the theorem is the following: Lemma 1. E[(JF ) X (JF ) T X ] -1 ⪯ C P Σ -1 F where C P is the Poincaré constant of p θ . Proof. For any vector w ∈ R m , we have by the Poincaré inequality that C P ⟨w, E[(JF ) X (JF ) T X ]w⟩ = C P E∥∇ x ⟨w, F (x)⟩| X ∥ 2 2 ≥ Var(⟨w, F (x)⟩) = ⟨w, Σ F w⟩ This shows C P E[(JF ) X (JF ) T X ] ⪰ Σ F and inverting both sides gives the result.

4.3. STATISTICAL EFFICIENCY LOWER BOUNDS FROM SPARSE CUTS

In this section, we prove a converse to Theorem 2: whereas a small (restricted) Poincaré constant upper bounds the variance of the score matching estimator, if the Poincaré constant of our target distribution is large and we have sufficiently rich sufficient statistics, score matching will be extremely inefficient compared to the MLE. In fact, we will be able to do so by taking an arbitrary family of sufficient statistics, and adding a single sufficient statistic ! Informally, we'll show the following: Consider estimating a distribution p θ in an exponential family with isoperimetric constant C IS . Then, p θ can be viewed as a member of an enlarged exponential family with one more (O ∂S (1)-Lipschitz) sufficient statistic, such that score matching has asymptotic relative efficiency Ω ∂S (C IS ) compared to the MLE, where ∂S denotes the boundary of the isoperimetric cut of p θ and Ω ∂S indicates a constant depending only on the geometry of the manifold ∂S. As noted in Section 2, a large Poincaré constant implies a large isoperimetric constant -so we focus on showing that the score matching estimator is inefficient when there is a set S which is a "sparse cut". Our proof uses differential geometry, so our final result will depend on standard geometric properties of the boundary ∂S (in terms of how small γ is, see Appendix E for more discussion, a proof sketch, as well as the full proof). We now give the formal statement. Theorem 3 (Inefficiency of score matching in the presence of sparse cuts). There exists an absolute constant c > 0 so that the following is true. Suppose S is a set with smooth and compact boundary ∂S, and suppose that p θ * 1 is an element of an exponential family with sufficient statistic F 1 and parameterized by elements of Θ 1 . Define an additional sufficient statistic F 2 = 1 S * ψ γ so that the enlarged exponential family contains distributions with θ 1 ∈ Θ 1 , θ 2 ∈ R of the form p (θ1,θ2) (x) ∝ exp(⟨θ 1 , F 1 (x)⟩ + θ 2 F 2 (x)) and consider the MLE and score matching estimators in this exponential family with ground truth p (θ * 1 ,0) . Suppose that 1 S is not an affine function of F 1 , and so there exists some δ 1 > 0 such that sup w1 Cov ⟨w1,F1⟩ √ Var(⟨w1,F1⟩) , 1 S √ Var(1 S ) 2 ≤ 1 -δ 1 . Then for all γ sufficiently small in terms of S, there exists a vector w such that the asymptotic relative (in)efficiency of the score matching estimator compared to the MLE for estimating ⟨w, θ⟩ admits the following lower bound ⟨w, Γ SM w⟩ ⟨w, Γ MLE w⟩ ≥ c ′ γ min{Pr(X ∈ S), Pr(X / ∈ S)} x∈∂S p(x)dx provided c ′ := c d 1- √ 1-δ1+2 γ x∈∂S p(x)dx Pr(X∈S)(1-Pr(X∈S)) 2 1+∥Σ F 1 ∥ OP > 0. Remark 4. If we choose S to be the set achieving the worst isoperimetric constant, then the right hand side of the bound is simply c ′ γ C IS . (See the appendix for details.) Finally, we observe that although c ′ is exponentially small in d, the bound is still useful in high dimensions because in the bad cases of interest C IS is often exponentially large in d. For example, this is the case for a mixture of standard Gaussians with Ω( √ d) separation between the means (see e.g. Chen et al. (2021a) ). Remark 5. The assumption δ 1 > 0 is a quantitative way of saying that the function 1 S , the cut we are using to define the new sufficient statistic F 2 , is not already a linear combination of the existing sufficient statistics. The assumptions will always holds with some δ 1 ≥ 0 by the Cauchy-Schwarz inequality. The equality case is when 1 S is an affine function of ⟨w 1 , F 1 ⟩ -if such a linear dependence exists, the parameterization is degenerate and the coefficient of F 2 is not identifiable. Example 2. A concrete example in one dimension with a single sufficient statistic is F 1 (x) = - 1 8a 2 (x -a) 2 (x + a) 2 = -x 4 /8a 2 + x 2 /4 -a 2 /8 and θ = (1, 0) for a parameter a > 1 to be taken large. This looks similar to a mixture of standard Gaussians centered at -a and a. Specializing Theorem 3 to this case, we get: Corollary 1. There exists absolute constants γ 0 > 0 and c > 0 so that the following is true. Suppose that a > 1, θ = (1, 0), and expanded exponential family {p θ ′ } θ ′ with p θ ′ (x) ∝ exp (⟨θ ′ , (F 1 (x), F 2 (x))⟩) and new sufficient statistic F 2 is the output of Theorem 3 applied to F 1 , S = {x : x > 0}, and γ = γ 0 . Then there exists w so that the relative (in)efficiency of estimating ⟨w, θ⟩ is lower bounded as ⟨w, Γ SM w⟩ ⟨w, Γ MLE w⟩ ≥ c e a 2 /8 .

5. DISCRETE ANALOGUES: PSEUDOLIKELIHOOD, GLAUBER DYNAMICS, AND APPROXIMATE TENSORIZATION

Several authors have proposed variants of score matching for discrete probability distributions, e.g. Lyu (2009) ; Shao et al. (2019) ; Hyvärinen (2007) . Furthermore, Hyvärinen (2006; 2007) pointed out some connections between pseudolikelihood methods (a classic alternative to maximum likelihood in statistics Besag (1975; 1977) ), Glauber dynamics (a.k.a. Gibbs sampler, see Appendix F for the definition), and score matching. Finally, just like the log-Sobolev inequality controls the rapid mixing of Langevin dynamics, there are functional inequalities (Gross, 1975; Bobkov and Tetali, 2006) which bound the mixing time of Glauber dynamics. Thus, we ask: Is there a discrete analogue of the relationship between score matching and the log-Sobolev inequality? The answer is yes. To explain further, we need a key concept recently introduced by Marton (2013; 2015) and Caputo et al. (2015) : if (Ω 1 , F 1 ), . . . (Ω d , F d ) are arbitrary measure spaces, we say a distribution q on d i=1 Ω i satisfies approximation tensorization of entropy with constant C AT (q) if KL(p, q) ≤ C AT (q) d i=1 E X∼i∼p∼i [KL(p(X i | X ∼i ), q(X i | X ∼i ))]. ( ) This inequality is sandwiched between two discrete versions of the log-Sobolev inequality (Proposition 1.1 of Caputo et al. ( 2015)): it is weaker than the standard discrete version of the log-Sobolev inequality (Diaconis and Saloff-Coste, 1996) and stronger than the Modified Log-Sobolev Inequality (Bobkov and Tetali, 2006) which characterizes exponential ergodicity of the Glauber dynamics.We define a restricted version C AT (q, P) analogously to the restricted log-Sobolev constant. Finally, we recall the pseudolikelihood objective (Besag, 1975) based on entrywise conditional probabilities: L p (q) := d i=1 E X∼p [log q(X i | X ∼i )] . With these definition in place, we can show that just as matching objective is a relaxation of maximum likelihood through the log-Sobolev inequality, pseudolikelihood is a relaxation through approximate tensorization of entropy: Proposition 4. We have KL(p, q) ≤ C AT (q)(L p (p) -L p (q)) and more generally for any class P containing p, we have KL(p, q) ≤ C AT (q, P)(L p (p) -L p (q)). Proof. Observe that L p (p) -L p (q) = d i=1 E X∼i|p∼i [KL(p(X i | X ∼i ), q(X i | X ∼i ))] , so the result follows by expanding the definition. Remark 6. Pseudolikelihood methods (and variants like node-wise regression) are one of the dominant approaches to fitting fully-observed graphical models, e.g. (Wu et al., 2019; Lokhov et al., 2018; Klivans and Meka, 2017; Kelner et al., 2020) . Like score matching, pseudolikelihood methods do not require computing normalizing constants which can be slow or computationally hard (e.g. Sly and Sun (2012) ). Pseudolikelihood is applicable in both discrete and continuous settings, as is our connection with approximate tensorization. An analogous version of Theorem 1 holds by the same argument (Theorem 5 in Appendix F) and guarantees learning in KL when q satisfies approximate tensorization (e.g. under Dobrushin's uniqueness threshold (Marton, 2015) ). Remark 7. (Hyvärinen, 2007) proposed a version of score matching for distributions on the hypercube {±1} d and observed that the resulting method ("ratio matching") bears similarity to pseudolikelihood. A similar calculation as the proof of Proposition 4 allows us to arrive at ratio matching based on a strengthening of approximate tensorization studied in (Marton, 2015) . Our derivation seems more conceptual than the original derivation, explains the similarity to pseudolikelihood, and establishes some useful connections. For space reasons, this is included in Appendix F.2.

6. RELATED WORK

Score matching was originally introduced by Hyvärinen (2005) , who also proved that the estimator is asymptotically consistent. In (Hyvärinen, 2007) , the authors propose estimators that are defined over bounded domains. (Song and Ermon, 2019) scaled the techniques to neurally parameterized energy-based models, leveraging score matching versions like denoising score matching Vincent (2011) , which involves an annealing strategy by convolving the data distribution with Gaussians of different variances, and sliced score matching (Song et al., 2020) . The authors conjectured that annealing helps with multimodality and low-dimensional manifold structure in the data distribution -and our paper can be seen as formalizing this conjecture. The connection between score matching objective and the relative Fisher information in (6) was previously observed in (Shao et al., 2019; Nielsen, 2021) . We also remark that since I(p|q) = -d dt KL(p t , q) | t=0 for p t the output of Langevin dynamics at time t, score matching can be interpreted as finding a q to minimize the contraction of the Langevin dynamics for q started at p. Previously, (Lyu, 2009) observed that the score matching objective can be interpreted as the infinitesimal change in KL divergence as we add Gaussian noise -see Appendix B for an explanation why these two quantities are equal. In the discrete setting, it was recently observed that approximate tensorization has applications to identity testing of distributions in the "coordinate oracle" query model (Blanca et al., 2022) , which is another application of approximate tensorization outside of sampling otherwise unrelated to our result. Finally, (Block et al., 2020; Lee et al., 2022) show guarantees on running Langevin dynamics, given estimates on ∇ log p that are only ϵ-correct in the L 2 (p) sense. They show that when the Langevin dynamics are run for some moderate amount of time, the drift between the true Langevin dynamics (using ∇ log p exactly) and the noisy estimates can be bounded.

7. SIMULATIONS

Exponential family experiments. Fitting a bimodal distribution with and without a statistic approximating a cut. First, we show the result of fitting a bimodal distribution (as in Example 2) from an exponential family. In Figure 1 , the difference of the two sufficient statistics we consider corresponds to the cut statistic used in our negative result (Theorem 3). As predicted (Corollary 1) score matching performs poorly compared to the MLE as the distance between modes grows. Figure 1 : Statistical efficiency of score matching vs MLE for fitting the distribution with ground truth parameters (θ 0 , θ 1 ) = (1, 0) of the form p θ (x) ∝ e θ0(x 2 -x 4 /(2a 2 ))+θ1(x 2 -x 4 /(2a 2 )+erf(x)) as we vary the offset a between 1 and 7 and train with fixed number of samples (10 6 ). We see score matching (red) performs very poorly compared to the MLE (blue) as the offset (distance between modes) grows, by plotting the log of the Euclidean distance to the true parameter for both estimators. In Appendix G, we show that when the second sufficient statistic (which is correlated with a sparse cut in the distribution) is removed, score matching performs nearly as well as MLE. This is what our theory predicts (since the cut statistic is removed) and illustrates the use of restricted functional inequalities (in Theorem 2, the restricted Poincaré inequality explains what happens here -see the appendix). Fitting a unimodal but not smooth distribution. For space reasons this is left to Appendix Gwe demonstrate that, even if the distribution is unimodal, the performance of score matching degrades as the sufficient statistics become less smooth. Hence the dependence on smoothness in our results, e.g. Theorem 2, is really required. Fitting a mixture of Gaussians with a onelayer network. We also show that empirically, our results are robust even beyond exponential families. In Figure 2 we show the results of fitting a mixture of two Gaussians via score matching 5 , where the score function is parameterized as a one hidden-layer network with tanh activations. We see that the predictions of our theory persist: the distribution is learned successfully when the two modes are close and is not when the modes are far. This matches our expectations, since the Poincaré, log-Sobolev, and isoperimetric constants blow up exponentially in the distance between the two modes (see e.g. Chen et al. (2021a) ) and the neural network is capable of detecting the cut between the two modes. We discuss the interpretation of this result more in the appendix. Figure 2 : Training a single hidden-layer network to score match a mixture of Gaussians (ground truth green, score matching output blue) succeeds at learning the distribution when the modes are close (left, small isoperimetric constant), but not when they are distant (right, large isoperimetric constant) in which case it weighs the modes incorrectly.

8. CONCLUSION

In this paper, we initiate the study of the statistical efficiency of score matching, and identified a close connection to functional inequalities which characterize the ergodicity of Langevin dynamics. For future work, it would be interesting to characterize formally the improvements conferred by annealing strategies like (Song and Ermon, 2019) , like it has been done in the setting of sampling using Langevin dynamics (Lee et al., 2018) .

A FURTHER BACKGROUND

Correspondence between functional inequalities and exponential ergodicity. If p t is the distribution of the continuous-time Langevin Dynamicsfoot_5 for q started from X 0 ∼ p, then I(p | q) = -d dt KL(p t , q) | t=0 and so by integrating KL(p t , q) ≤ e -t/C LS KL(p, q). This holding for any p and t is an equivalent characterization of the log-Sobolev constant (Theorem 3.20 of Van Handel ( 2014)). Similarly, the Poincaré inequality implies exponential ergodicity for the χ 2 -divergence χ 2 (p t , q) ≤ e -2t/C P χ 2 (p, q), and this holding for every p and t is an equivalent characterization of the Poincaré constant (Theorem 2.18 of Van Handel ( 2014)). We can equivalently view the Langevin dynamics in a functional-analytic way through its definition as a Markov semigroup, which is equivalent to the SDE definition via the Fokker-Planck equation (Van Handel, 2014; Bakry et al., 2014) . From this perspective, we can write p t = qH t p q where H t is the Langevin semigroup for q, so H t = e tL with generator Lf = ⟨∇ log q, ∇f ⟩ + ∆f. In this case, the Poincaré constant has a direct interpretation in terms of the inverse spectral gap of L, i.e. the inverse of the gap between its two largest eigenvalues. Further remarks. A strengthened isoperimetric inequality (Bobkov inequality) upper bounds the log-Sobolev constant, see Ledoux (2000) ; Bobkov (1997) . Facts about the mollifier ψ. We will use the basic estimate 8 -d B d < G d < B d where B d is the volume of the unit ball in R d , which follows from the fact that e -1/(1-|y| 2 ) ≥ 1/4 for ∥y∥ ≤ 1/2 and e -1/(1-|y| 2 ) ≤ 1 everywhere. It is infinitely differentiable and its gradient is ∇ y ψ(y) = -(2/G d )e -1/(1-∥y∥ 2 ) y (1 -∥y∥ 2 ) 2 = -2y (1 -∥y∥ 2 ) 2 ψ(y) It is straightforward to check that sup y ∥∇ y ψ(y)∥ < 1/G d . For γ > 0, we'll also define a "sharpening" of ψ, namely ψ γ (y) = γ -d ψ(y/γ) so that ψ γ = 1 and (by chain rule) ∇ y ψ γ (y) = γ -d-1 (∇ψ)(y/γ) = -2y/γ 2 (1 -∥y/γ∥ 2 ) ψ γ (y/γ) so in particular ∥∇ y ψ γ ∥ 2 ≤ γ -d-1 /G d . Reach and Condition Number of a Manifold. For a smooth submanifold M of Euclidean space, the reach τ M is the smallest radius r so that every point with distance at most r to the manifold M has a unique nearest point on M (Federer, 1959) ; the reach is guaranteed to be positive for compact manifolds. The reach has a few equivalent characterizations (see e.g. Niyogi et al. (2008) ); a common terminology is that the condition number of a manifold is 1/τ M .

B RECOVERING LYU'S INTERPRETATION OF SCORE MATCHING

As mentioned, the connection between score matching objective and the relative Fisher information was previously observed, for example in (Shao et al., 2019; Nielsen, 2021) . We also remark that if we use the fact I(p|q) = -d dt KL(p t , q) | t=0 , the score matching objective has a natural interpretation in terms of select q to minimize the contraction of the Langevin dynamics for q started at p. On the other hand, Lyu (2009) previously observed that the score matching objective can be interpreted as the infinitesimal change in KL divergence between p and q as we add noise to both of them. We now explain why these two quantities are equal by giving a proof of their equality (which is shorter than the one you get by going through the proof in Lyu (2009) ). Before giving the formal proof, we give some intuition for why the statement should be true. The Langevin dynamics approximately adds a noise of size N (0, 2t) and subtracts a gradient step along ∇ log q, and this dynamics preserves q. For small t, the gradient step is essentially reversible and preserves the KL. So heuristically, reversing the gradient step gives KL(p t , q) ≈ KL(N (0, 2t) * p, N (0, 2t) * q). We now give the formal proof. Lemma 2. Assuming smooth probability densities p(x) and q(x) decay sufficiently fast at infinity, d dt KL(p t , q) t=0 = d dt KL(p * N (0, 2t), q * N (0, 2t)) t=0 where * denotes convolution. Proof. Recalling from Appendix A that H t = e tL we have that d dt pt q = d dt H t p q = L p q . Since KL(p t , q) = E q [ pt q log pt q ] and d dx [x log x] = log x + 1, it follows by the chain rule that d dt KL(p t , q) = E q log p q + 1 L p q = E q log p q + 1 ⟨∇ log q, ∇ p q ⟩ + ∆ p q = E q log p q + 1 -⟨∇ log q, ∇ p q ⟩ + ∆p q - p∆q q 2 where in the last step we used the quotient rule ∆ p q = ∆p q -2 ∇ log q, ∇ p q -p∆q q 2 . On the other hand, by using the Fokker-Planck equation ∂ ∂t (p * N (0, 2t)) = ∆p (Lemma 2 of Lyu ( 2009)) and the chain rule we have d dt KL(p * N (0, 2t), q * N (0, 2t)) = d dt (q * N (0, 2t)) p * N (0, 2t) q * N (0, 2t) log p * N (0, 2t) q * N (0, 2t) dx = (∆q) p q log p q dx + E q log p q + 1 ∆p q - p∆q q 2 Since by the chain rule and integration by parts we have E q log p q + 1 ∇ log q, ∇ p q = ∇q, ∇ p q log p q dx = -(∆q) p q log p q dx, we see that the two derivatives are indeed equal.

C PROOF OF PROPOSITION 3

Proof. From Hyvärinen (2005), we have consistency of score matching (Theorem 2) and in particular the formula θ = -E[(JF ) X (JF ) T X ] -1 E∆F. (9) We now compute the limiting distribution of the estimator as the number of samples n → ∞. We will need to use some standard results from probability theory such as Slutsky's theorem and the central limit theorem, see e.g. Van der Vaart (2000) or Durrett (2019) for references. To minimize ambiguity, let Ên denote the empirical expectation over n i.i.d. samples samples and let θn denote the score matching estimator θSM from n samples. Define δ n,1 and δ n,2 by the equations Ên [(JF ) X (JF ) T X ] = E[(JF ) X (JF ) T X ] + δ n,1 / √ n and Ên ∆F = E∆f + δ n,2 / √ n. By the central limit theorem, δ n = (δ n,1 , δ n,2 ) converges in distribution to a multivariate Gaussian (with a covariance matrix that we won't need explicitly) as n → ∞. From the definition θn = -Ên [(JF ) X (JF ) T X ] -1 Ê∆F = -[E[(JF ) X (JF ) T X ] -1 Ên [(JF ) X (JF ) T X ]] -1 E[(JF ) X (JF ) T X ] -1 Ê∆F and we now simplify the expression on the right hand side. By applying (9) we have E[(JF ) X (JF ) T X ] -1 Ên ∆F = E[(JF ) X (JF ) T X ] -1 (E∆F + δ n,2 / √ n) = -θ + E[(JF ) X (JF ) T X ] -1 δ n,2 / √ n Since E[(JF ) X (JF ) T X ] -1 Ên [(JF ) X (JF ) T X ] = I + E[(JF ) X (JF ) T X ] -1 δ n,1 / √ n and (I + X) -1 = I -X + X 2 -• • • we have by applying Slutsky's theorem that E[(JF ) X (JF ) T X ] -1 Ên [(JF ) X (JF ) T X ]] -1 = I -E[(JF ) X (JF ) T X ] -1 δ n,1 / √ n + O P (1/n) where we use the standard notation Y n = O P (1/n) to indicate that nY n /f (n) → 0 in probability for any function f with f (n) → ∞. Hence θSM = -[E[(JF ) X (JF ) T X ] -1 Ên [(JF ) X (JF ) T X ]] -1 E[(JF ) X (JF ) T X ] -1 Ên ∆F = -I -E[(JF ) X (JF ) T X ] -1 δ n,1 / √ n + O P (1/n) (-θ + E[(JF ) X (JF ) T X ] -1 δ n,2 / √ n) and applying Slutsky's theorem again, we find √ n( θn -θ) = E[(JF ) X (JF ) T X ] -1 (-δ n,1 θ -δ n,2 ) + O P (1/ √ n) From the definition, we know 1 √ n (δ n,1 θ -δ n,2 ) = Ên [-(JF ) X (JF ) T X θ -∆F ] -E[-(JF ) X (JF ) T X θ -∆F ] so altogether by the central limit theorem, we have √ n( θ -θ) → N 0, E[(JF ) X (JF ) T X ] -1 Σ (JF ) X (JF ) T X θ+∆F E[(JF ) X (JF ) T X ] -1 as claimed.

D PROOF OF THEOREM 2

First, we will need the following helper lemma: Lemma 3. For any random vectors A, B we have Σ A+B ⪯ 2Σ A + 2Σ B . Proof. For any vector w we have where the first inequality is Cauchy-Schwarz for variance and the second is ab ≤ a 2 /2 + b 2 /2. We proved for this for every vector which proves the PSD inequality. With this in mind, we can proceed to the proof of Theorem 2: Proof of Theorem 2. Recall from Proposition 3 that Γ SM := E[(JF ) X (JF ) T X ] -1 Σ (JF ) X (JF ) T X θ+∆F E[(JF ) X (JF ) T X ] -1 . By Lemma 1 and submultiplicativity of the operator norm, we have ∥E[(JF ) X (JF ) T X ] -1 Σ (JF ) X (JF ) T X θ+∆F E[(JF ) X (JF ) T X ] -1 ∥ OP ≤ C 2 P ∥Σ -1 F ∥ 2 OP ∥Σ (JF ) X (JF ) T X θ+∆F ∥ OP . We will finally bound the two operator norms on the right hand side. By Lemma 3, we have Σ (JF ) X (JF ) T X θ+∆F ⪯ 2Σ (JF ) X (JF ) T X θ + 2Σ ∆F Furthermore, we have ∥Σ (JF ) X (JF ) T X θ ∥ OP ≤ ∥E[(JF ) X (JF ) T X θθ T (JF ) X (JF ) T X ]∥ OP ≤ E∥(JF ) X ∥ 4 OP ∥θ∥ 2 and ∥Σ ∆F ∥ OP ≤ ∥E(∆F )(∆F ) T ∥ OP ≤ Tr E(∆F )(∆F ) T ≤ E∥∆F ∥ 2 2 which implies the statement of the theorem. Supporting details for remark after Theorem 2. Since √ n(θ -θSM ) → N (0, Γ SM ) by Proposition 3, for all sufficiently large n it follows from Markov's inequality that with probability at least 99%, n∥θ -θSM ∥ 2 = O(E Z∼N (0,ΓSM) ∥Z∥ 2 ) = O(Tr Γ SM ) = O(m∥Γ SM ∥ OP ). On the other hand, by Fatou's lemma we have that lim inf n→∞ nE∥θ -θMLE ∥ 2 ≥ E Z∼N (0,ΓMLE) ∥Z∥ 2 = Tr(Γ MLE ) ≥ ∥Γ MLE ∥ OP where in the first expression θMLE implicitly depends on n, the number of samples. Combining these two observations with Theorem 2 and gives the inequality stated in the remark.

E PROOF OF THEOREM 3 AND APPLICATIONS

We restate Theorem 3 for the reader's convenience and in a slightly more explicit form in terms of the bounds on γ. Note that we use the concept of the reach τ M of a manifold which was defined in the preliminaries (Appendix A). Theorem 4 (Inefficiency of score matching in the presence of sparse cuts, Restatement of Theorem 3). There exists an absolute constant c > 0 such that the following is true. Suppose that p θ * 1 is an element of an exponential family with sufficient statistic F 1 and parameterized by elements of Θ 1 . Suppose S is a set with smooth and compact boundary ∂S. Let τ ∂S > 0 denote the reach of ∂S (see Appendix A) Suppose that 1 S is not an affine function of F 1 , so there exists δ 1 > 0 such that sup w1:Var(⟨w1,F1⟩)=1 Cov ⟨w 1 , F 1 ⟩, 1 S Var(1 S ) 2 ≤ 1 -δ 1 . ( ) Suppose that γ > 0 satisfies γ < min c d (1+∥θ1∥) sup x:d(x,∂S)≤γ ∥(JF1)x∥ OP , c τ ∂S d and is small enough so that 0 < δ := 1 - √ 1 -δ 1 + 2 γ x∈∂S p(x)dx Pr(X∈S)(1-Pr(X∈S)) 2 . Define an additional sufficient statistic F 2 = 1 S * ψ γ so that the enlarged exponential family contains distributions of the form p (θ1,θ2) (x) ∝ exp(⟨θ 1 , F 1 (x)⟩ + θ 2 F 2 (x)) and consider the MLE and score matching estimators in this exponential family with ground truth p (θ * 1 ,0) . Then the asymptotic renormalized covariance matrix Γ MLE of the MLE is bounded above as Γ MLE ⪯ 1 1-δ Σ -1 F1 0 0 1 Pr(X∈S)(1-Pr(X∈S)) and there there exists some w and corresponding asymptotic variances σ 2 SM (w), σ 2 MLE (w) so that √ n⟨w, θSM -θ⟩ → N (0, σ 2 SM (w)), √ n⟨w, θMLE -θ⟩ → N (0, σ 2 MLE (w) ) and the relative (in)efficiency of the score matching estimator compared to the MLE for estimating ⟨w, θ⟩ admits the following lower bound σ 2 SM (w) σ 2 MLE (w) ≥ c ′ γ min{Pr(X ∈ S), Pr(X / ∈ S)} x∈∂S p(x)dx where c ′ := δc d 1+∥Σ F 1 ∥ OP . The proof will proceed in two parts: we will lower bound σ 2 SM (w) and upper bound σ 2 MLE (w). The former part will proceed by proving a lower bound on the spectral norm of Γ SM (Subsection E.1)by picking a direction in which the quadratic form is large. The upper bound on σ 2 MLE (w) (Subsection E.2) will proceed by relating the Fisher matrix for the augmented sufficient statistic (F 1 , F 2 ) with the Fisher matrix for the original sufficient statistic F 1 . Supporting details for remarks after Theorem 3. If we choose S to be the worst set in the isoperimetric inequaltiy, the term min{Pr(X∈S),Pr(X / ∈S)} x∈∂S p(x)dx in the bound is simply C IS . To see this, observe that lim ϵ→0 Sϵ p(x)dx-S p(x)dx ϵ = x∈∂S p(x)dx as a special case of Weyl's tube formula (Weyl, 1939; Gray, 2003) .

E.1 LOWER BOUNDING THE SPECTRAL NORM OF Γ SM

We recall the new statistic F 2 , defined in terms of the mollifier ψ introduced in Section 2: F 2 (x) := (1 S * ψ γ )(x) = R d 1 S (y)ψ γ (x -y)dy = S ψ γ (x -y)dy and the new sufficient statistic is F (x) = (F 1 (x), F 2 (x)). We first show the following lower bound on the largest eigenvalue of Γ SM , the renormalized limiting covariance of score matching: Lemma 4 (Largest eigenvalue of Γ SM ). The largest eigenvalue of K satisfies λ max (Γ SM ) ≥ 8 -d γ 2 Pr[d(X, ∂S) ≤ γ] E X|d(X,∂S)≤γ (∇F 2 ) T X (JF ) T X θ + ∆F 2 2 sup d(x,∂S)≤γ ∥(JF ) x ∥ 2 OP . ( ) Proof. We have ∇ x F 2 (x) = S ∇ x ψ γ (x -y)dy, ∇ 2 x F 2 (x) = S ∇ 2 x ψ γ (x -y)dy. Defining u := E[(JF ) X (JF ) T X ](0, 1) = E[(JF ) X ∇ x F 2 (x)] we have, by the variational characterization of eigenvalues of symmetric matrices, that λ max (K) ≥ ⟨u, E[(JF ) X (JF ) T X ] -1 Σ (JF ) X (JF ) T X θ+∆F E[(JF ) X (JF ) T X ] -1 u⟩ ∥u∥ 2 2 . ( ) To upper bound the denominator we observe that if B d is the volume of the unit ball, ∥∇ x F 2 (x)∥ 2 = S (∇ψ γ )(x -y)dy 2 (13) ≤ 1(d(x, ∂S) ≤ γ)γ -d-1 vol(B(X, γ))/G d (14) ≤ 8 d 1(d(x, ∂S) ≤ γ)γ -1 and so ∥u∥ 2 ≤ 8 d γ -1 Pr[d(X, ∂S) ≤ γ] sup d(x,∂S)∈[-γ,γ] ∥(JF ) x ∥ OP where we used the computation of the derivative of ψ γ . To lower bound the numerator we have ⟨u, E[(JF ) X (JF ) T X ] -1 Σ (JF ) X (JF ) T X θ+∆F E[(JF ) X (JF ) T X ] -1 u⟩ = (0, 1) T Σ (JF ) X (JF ) T X θ+∆F (0, 1) = E⟨(0, 1), (JF ) X (JF ) T X θ + ∆F ⟩ 2 = E (∇ x F 2 ) T X (JF ) T X θ + ∆F 2 2 . The integrand is zero except when d(X, ∂S) ≤ γ so it equals Pr[d(X, ∂S) ≤ γ]E X|d(X,∂S)∈[-γ,γ] (∇F 2 ) T X (JF ) T X θ + ∆F 2 2 and combining gives the result. We now estimate the right hand side of (11) for small γ, using differential geometric techniques. The main idea is that as we take γ smaller, we end up zooming into the manifold ∂S which locally looks closer and closer to being flat. Differential-geometric quantities describing the manifold appear when we make this approximation rigorous. The most involved term to handle ends up to be calculating the expectation E X|d(X,∂S)≤γ (∇F 2 ) T X (JF ) T X θ + ∆F 2 2 . To do this, we first argue that the term with the Laplacian dominates as γ → 0, then by Stokes theorem, we end up integrating ⟨∇ψ, dN ⟩ over intersections of S with small spheres of radius γ, where N is a normal to S. Such quantities can be calculated by comparing to the "flat" manifold case -i.e. when N does not change. How far away these quantities are (thus how small γ needs to be) depends on the curvature of S (or more precisely, the condition number of the manifold). Lemma 6 makes rigorous the statement that well-conditioned manifolds are locally flat and then Lemma 7, which is part of the proof of Weyl's tube formula (Gray, 2003; Weyl, 1939) , lets us rigorously say that the tubular neighborhood (that is, a thickening of the manifold) behaves similarly to the flat case. Lemma 5. There exists an absolute constant c > 0 such that the following is true. For any γ > 0 satisfying γ < min c d (1 + ∥θ 1 ∥) sup x:d(x,∂S)≤γ ∥(JF 1 ) x ∥ OP , c τ ∂S d for score matching on the extended family with m + 1 sufficient statistics and distribution p θ with θ = (θ 1 , 0) we have λ max (Γ SM ) ≥ c d γ ∂S p(x)dA Proof. In the denominator, we can observe by (15) that ∥(JF ) x ∥ 2 OP ≤ ∥JF 1 ∥ 2 OP + ∥∇F 2 ∥ 2 2 ≤ ∥JF 1 ∥ 2 OP + γ -2 B 2 d ≤ 2γ -2 B 2 d where the last inequality holds assuming γ is sufficiently small that ∥JF 1 ∥ 2 OP ≤ γ -2 B 2 d . In the numerator we can observe (∇ x F 2 ) T X (JF ) T X θ + ∆F 2 = S ⟨(∇ψ γ )((X -y)), (JF ) T X θ⟩ + (∆ψ γ )(X -y)dy = S∩B(X,γ) ⟨(∇ψ γ )((X -y)), (JF ) T X θ⟩ + (∆ψ γ )(X -y)dy = γ d B(0,1)∩(X-S)/γ ⟨(∇ψ γ )(γu), (JF ) T X θ⟩ + (∆ψ γ )(γu)du = B(0,1)∩(X-S)/γ γ -1 ⟨∇ψ(u), (JF ) T X θ⟩ + γ -2 (∆ψ)(u)du = B(0,1)∩(X-S)/γ γ -1 ⟨∇ψ(u), (JF ) T X θ⟩ + ∂(B(0,1)∩(X-S)/γ) γ -2 ⟨∇ψ, dN ⟩ = B(0,1)∩(X-S)/γ γ -1 ⟨∇ψ(u), (JF ) T X θ⟩ + B(0,1)∩(X-∂S)/γ γ -2 ⟨∇ψ, dN ⟩ where the second-to-last expression is a surface integral which we arrived at by applying the divergence theorem, using that the Laplacian is the divergence of the gradient, and in the last step we used that ψ and all of its derivatives vanish on the boundary of the unit sphere. Using that θ = (θ 1 , 0) we have B(0,1)∩(X-S)/γ γ -1 ⟨∇ψ(u), (JF ) T X θ⟩ ≤ γ -1 B(0,1) ∥∇ψ(u)∥∥(JF 1 ) X ∥ OP ∥θ∥ (16) ≤ 8 d γ -1 ∥(JF 1 ) X ∥ OP ∥θ∥. Let p be the point in ∂(X -S)/γ which is closest in Euclidean distance to the origin. Let n(q) denote the unit normal vector at point q oriented outwards (Gauss map). Note that by first-order optimality conditions for p, we must have n(p) = p/∥p∥. Since dN = n(q)dA where dA is the surface area form, we have B(0,1)∩(X-∂S)/γ ⟨∇ψ, dN ⟩ = q∈B(0,1)∩(X-∂S)/γ ⟨∇ψ(q), n(p) + (n(q) -n(p))⟩dA = q∈B(0,1)∩(X-∂S)/γ -2ψ(q) (1 -∥q∥ 2 ) 2 ⟨q, p ∥p∥ + (n(q) -n(p))⟩dA. We now show how to lower bounding the integral by showing ⟨q, p ∥p∥ + (n(q) -n(p))⟩ is lower bounded. Let c(t) be a minimal unit-speed geodesic on M := (X -∂S)/γ from p to q. Note that τ M = τ ∂S /γ so if γ is very small, M is very well-conditioned. By the fundamental theorem of calculus, we have that ⟨p, q⟩ = ⟨p, p⟩ + 1 0 ⟨p, c ′ (t)⟩dt = ⟨p, p⟩ + 1 0 ⟨Proj T c(t) p, c ′ (t)⟩dt where T c(t) is the tangent space to M at the point c(t). Hence by the Cauchy-Schwarz inequality we have |⟨p, q⟩ ≥ ⟨p, p⟩ - 1 0 ∥ Proj T c(t) p∥∥c ′ (t)∥dt. By Proposition 6.3 of Niyogi et al. (2008) , we have that for ϕ t the angle between the tangent spaces T p and T c(t) that cos ϕ t ≥ 1 - 1 τ M d M (p, c(t)) = 1 - t τ M d M (p, q). Since sin 2 ϕ t + cos 2 ϕ t = 1 and p is orthogonal to the tangent space at T p , it follows that ∥ Proj T c(t) p∥ ≤ ∥p∥| sin ϕ t | = ∥p∥ 1 -cos 2 ϕ t ≤ ∥p∥ (2t/τ M )d M (p, q) + (t/τ M ) 2 d M (p, q) 2 ≤ ∥p∥ (2t/τ M )d M (p, q) + ∥p∥(t/τ M )d M (p, q) hence 1 0 ∥ Proj T c(t) p∥∥c ′ (t)∥dt ≤ (2/3)∥p∥ (2/τ M )d M (p, q) 3/2 + ∥p∥(1/2τ M )d M (p, q) 2 . Since ∥p -q∥ ≤ 2, provided that τ M > 16 we have by Proposition 6.3 of Niyogi et al. (2008) that d M (p, q) ≤ τ M (1 -1 -2∥p -q∥/τ M ) ≤ 4. Combining, we have for some absolute constant C > 0 that ⟨p, q⟩ ≥ ⟨p, p⟩(1 -C 1/τ M -C/τ M ). Also, we can compute ∥n(q) -n(p)∥ = 2 -2 cos ϕ 1 ≤ 2 τ M d M (p, q) ≤ 8 τ M so |⟨q, n(q) -n(p)⟩⟩| ≤ ∥q∥∥n(q) -n(p)∥ ≤ 8 τ M . Hence provided τ M > C ′ for some absolute constant C ′ > 0 and ∥p∥ > 0.1, we have q∈B(0,1)∩(X-∂S)/γ -2ψ(q) (1 -∥q∥ 2 ) 2 ⟨q, p ∥p∥ + (n(q) -n(p))⟩dA ≥ q∈B(0,1)∩(X-∂S)/γ ψ(q) (1 -∥q∥ 2 ) 2 ∥p∥dA using that the integrand on the left is always negative. We can further lower bound the integral by considering the intersection of M with a ball of radius r := 1-∥p∥ 2 centered at p. We have q∈B(0,1)∩(X-∂S)/γ ψ(q) (1 -∥q∥ 2 ) 2 ∥p∥dA ≥ q∈B(p,r)∩M ψ(q) (1 -∥q∥ 2 ) 2 ∥p∥dA ≥ ∥p∥(cos θ) k vol(B k (p, r)) inf q∈B(p,r)∩M ψ(q) (1 -∥q∥ 2 ) 2 = ∥p∥(cos θ) k r k inf q∈B(p,r)∩M B k ψ(q) (1 -∥q∥ 2 ) 2 where k = d -1 is the dimension of M and θ = arcsin(r/2τ ) and we applied Lemma 5.3 of Niyogi et al. (2008) . If ∥p∥ ∈ (0.1, 0.9) this is lower bounded by a constant C k > 0 which is at worst exponentially small in k. Hence recalling (17) we have for any X with d(X, ∂S) ∈ (0.1γ, 0.9γ) and for γ sufficiently small so that γ8 k+1 ∥(JF 1 ) X ∥ OP ∥θ∥ < C k /4 for any such X, we have that (∇F 2 ) T X (JF ) T X θ + ∆F 2 2 ≥ γ -4 C ′ k where C ′ k > 0 is a constant that is at worst exponentially small in k. Therefore E X|d(X,∂S)∈[-γ,γ] (∇F 2 ) T X (JF ) T X θ + ∆F 2 2 ≥ γ -4 C ′ k Pr(d(X, ∂S) ∈ (0.1γ, 0.9γ)) Pr(d(X, ∂S) ≤ γ) . Combining these estimates, we have for some constant C ′′ k > 0 which is at worst exponentially small in k and γ sufficiently small (to satisfy the conditions above, including the requirement τ M > C ′′ ) that λ max (Γ SM ) ≥ C ′′ k Pr(d(X, ∂S) ∈ (0.1γ, 0.9γ)) Pr(d(X, ∂S) ≤ γ) 2 . ( ) Lemma 8. Suppose that F = (F 1 , F 2 ) is a random vector valued in R m+1 with F 1 valued in R m and F 2 valued in R. Suppose that F 2 is not in the affine of linear combinations of the coordinates of F 1 , i.e. for all w 1 ∈ R m there exists δ > 0 such that Cov(⟨w 1 , F 1 ⟩, F 2 ) 2 ≤ δVar(⟨w 1 , F 1 ⟩)Var(F 2 ). Then we have the lower bound Σ F ⪰ (1 -δ) Σ F1 0 0 Var(F 2 ) in the standard PSD (positive semidefinite) order. Proof. To show a lower bound on Σ F = Σ F1 Σ F1F2 Σ F2F1 Σ F2 observe that ⟨w, Σ F w⟩ = ⟨w 1 , Σ F1 w 1 ⟩ + 2w 2 ⟨w 1 , Σ F1F2 ⟩ + w 2 2 Σ F2 so under the assumption we have by the AM-GM inequality that ⟨w, Σ F w⟩ ≥ (1 -δ)[⟨w 1 , Σ F1 w 1 ⟩ + w 2 2 Σ F2 ] and hence Σ F is lower bounded in the PSD order as long as Σ F1 is and Σ F2 is. The lower bound on Var(F 2 ) is guaranteed when F 2 corresponds to a cut with large mass on both sides since the variance of F 2 is lower bounded by its variance conditioned on being away from the boundary of S.

E.3 PUTTING TOGETHER

Finally, given Lemma 5 and 8, we can complete the proof of Theorem 3. Proof of Theorem 3. Define ρ = Pr(X ∈ S) for the purpose of this proof. Observe that by ( 21) Var(1 S -F 2 ) ≤ E(1 S -F 2 ) 2 ≤ Pr(d(X, ∂S) ≤ γ) ≤ 4γV where V = ∂S p(x)dA. We have that Cov(⟨w 1 , F 1 ⟩, F 2 ) = Cov(⟨w 1 , F 1 ⟩, 1 S ) + Cov(⟨w 1 , F 1 ⟩, F 2 -1 S ) so if w 1 is arbitrary and normalized so that Var(⟨w 1 , F 1 ⟩) = 1 then we have |Cov(⟨w 1 , F 1 ⟩, F 2 )| ≤ 1 -δ 1 Var(1 S ) + Var(F 2 -1 S ) ≤ 1 -δ 1 + 2 γV ρ(1 -ρ) Var(1 S ). Therefore provided δ > 0 we have Σ -1 F ⪯ 1 δ Σ -1 F1 0 0 Var(F 2 ) -1 . On the other hand, by Lemma 5 we have λ max (Γ SM ) ≥ c d γV . Hence there exists some w such that σ 2 SM (w) σ 2 MLE (w) ≥ δc d max{∥Σ -1 F1 ∥ OP , 1/ρ(1 -ρ)} 1 γV ≥ δc d 1 + ρ(1 -ρ)∥Σ -1 F1 ∥ OP ρ(1 -ρ) γV . Using that min{ρ, 1 -ρ}/2 ≤ ρ(1 -ρ) ≤ 1/4 and dividing c by two gives the result.

E.4 MULTIMODAL EXAMPLE: PROOF OF COROLLARY 1

Proof of Corollary 1. First observe that ∞ -∞ e -F1(x) dx = 2 ∞ 0 e -(1/8)(x-a) 2 (x/a+1) 2 dx ≤ 2 ∞ -∞ e -(1/8)(x-a) 2 dx = 2 ∞ -∞ e -(x 2 /8) dx =: C where C is a positive constant independent of a. Using that F 1 (x) = (1/8)(x -a) 2 (x/a + 1) 2 it then follows that Pr(X ∈ [a -1, a + 1]) = a+1 a-1 e -F1(x) dx ∞ -∞ e -F1(x) dx ≥ e -(1/8)(x/a+1) 2 C ≥ C ′ > 0 where C ′ is a positive constant independent of a. From this, we see by the law of total variance that Var(F 1 ) ≥ Var(F 1 | X ∈ [a -1, a + 1]) Pr(X ∈ [a -1, a + 1]) ≥ C ′′ > 0 where C ′′ > 0 is another positive constant independent of a. Hence ∥Σ -1 F1 ∥ OP = O(1) independent of a. Also, if we define S = {x : x > 0} then Cov(F 1 (x), 1 S ) = 0 becuase F 1 (x) is even, 1 S is odd and the distribution is symmetric about zero. So we can take δ 1 = 1 in the statement of Theorem 3. Therefore, applying Theorem 3 to S and using that F 1 (0) = -a 2 /8, we therefore get for γ smaller than an absolute constant, that the inefficiency is lower bounded by Ω(e a 2 /8 /γ). By taking γ equal to a fixed constant we get the result. In Section 7, we perform simulations which show the performance of score matching indeed degrades exponentially as a beomes large.

F DISCRETE ANALOGUES OF SCORE MATCHING

Glauber dynamics. The Glauber dynamics or Gibbs sampler is the standard sampler for discrete spin systems -it repeatedly selects a random coordinate and then resamples the spin X i there conditional on all of the other ones (i.e. conditional on X ∼i ). See e.g. Levin and Peres (2017) . This is the standard sampler for discrete systems, but it also applies and has been extensively studied for continuous ones (see e.g. Marton (2013) ). Exponential ergodicity of the Glauber dynamics is equivalent to the Modified Log-Sobolev Inequality (MLSI) -in most cases where the MLSI is known, approximate tensorization of entropy is also, e.g. Chen et al. (2021b) ; Anari et al. (2021a) ; Marton (2015) ; Caputo et al. (2015) .

F.1 FINITE SAMPLE BOUNDS

We state explicitly the analogue of Theorem 1 for pseudolikelihood, which follows from the same proof by replacing Proposition 1 with Proposition 4. Theorem 5. Suppose that P is a class of probability distributions containing p and define C AT (P, P) := sup q∈P C AT (q, P) ≤ sup q∈P C AT (q) to be the worst-case (restricted) approximate tensorization constant in the class of distributions. Let R n := E X1,...,Xn,ϵ1,...,ϵn sup q∈P 1 n n i=1 ϵ i   d j=1 log q((X i ) j | (X i ) ∼j )   be the expected Rademacher complexity of the class given n samples X 1 , . . . , X n ∼ p i.i.d. and independent ϵ 1 , . . . , ϵ n ∼ U ni{±1} i.i.d. Rademacher random variables. Let p be the pseudolikelihood estimator from n samples, i.e. p = arg min q∈P Lp (q). Then E KL(p, p) ≤ 2C AT (P, P)R n . In particular, if C AT < ∞ then lim n→∞ E KL(p, p) = 0 as long as lim n→∞ R n = 0. F.2 RATIO MATCHING AND APPROXIMATE TENSORIZATION Marton (2015) studied a strengthened version of approximate tensorization of the form KL(p, q) ≤ C AT 2 (q) d i=1 E X∼i∼p∼i TV 2 (p(X i | X ∼i ), q(X i | X ∼i )) where TV denotes the total variation distance (see Cover (1999) ). (This is known to hold for a class of distributions q satisfying a version of Dobrushin's condition and marginal bounds (Marton, 2015) .) This inequality is stronger than the standard approximate tensorization because of Pinsker's inequality TV 2 (P, Q) ≲ KL(P, Q) (Cover, 1999). In the case of distributions on the hypercube, we have TV 2 (p(X i | X ∼i ), q(X i | X ∼i )) = |p(X i = +1 | X ∼i ) -q(X i = +1 | X ∼i )| 2 = E Xi∼p X i |X ∼i |1(X i = +1) -q(X i = +1 | X ∼i )| 2 -E Xi∼p X i |X ∼i |1(X i = +1) -p(X i = +1 | X ∼i )| 2 where in the last step we used the Pythagorean theorem applied to the p Xi|X∼i -orthogonal decomposition 1(X i = +1) -q(X i = +1 | X ∼i ) = [1(X i = +1) -p(X i = +1 | X ∼i )] + [p(X i = +1 | X ∼i ) -q(X i = +1 | X ∼i )] Hence, there exists a constant K ′ p not depending on q such that d i=1 E X∼i∼p∼i TV 2 (p(X i | X ∼i ), q(X i | X ∼i )) = K p + M p (q) where we define the ratio matching objective function to be M p (q) := d i=1 E X∼p |1(X i = +1) -q(X i = +1 | X ∼i )| 2 (24) This objective is now straightforward to estimate from data, by replacing the expectation with the average over data. Analogous to before, we have the following proposition: Proposition 5. We have KL(p, q) ≤ C AT 2 (q)(M p (q) -M p (p)) and more generally for any class P containing p, we have KL(p, q) ≤ C AT 2 (q, P)(M p (q) -M p (p)). We now show how to rewrite M p (q) to match the formula from the original reference. Observe M p (q) = 1 4 d i=1 E X∼p |X i -E q [X i | X ∼i ]| 2 = 1 4 d i=1 E X∼p |1 -X i E q [X i | X ∼i ]| 2 Observe that for any z ∈ {±1} we have zE q [X i | X ∼i ] = q(X i = z | X ∼i ) -q(X i = -z | X ∼i ) q(X i = z | X ∼i ) + q(X i = -z | X ∼i ) and 1 -zE q [X i | X ∼i ] = 2q(X i = -z | X ∼i ) q(X i = z | X ∼i ) + q(X i = -z | X ∼i ) = 2 1 + q(X i = z | X ∼i )/q(X i = -z | X ∼i ) . Also for z ∈ {±1} d we have q( X i = z i | X ∼i = z ∼i )/q(X i = -z i | X ∼i = z ∼i ) = q(z)/q(z -i ) where z -i reprsents z with coordinate i flipped, so M p (q) = d i=1 E X∼p 1 1 + q(X)/q(X -i ) 2 which matches the formula in Theorem 1 of Hyvärinen (2007) . Summarizing, minimizing the ratio matching objective makes the right hand side of the strengthened tensorization estimate (22) small, so when C AT 2 (q) is small it will imply successful distribution learing in KL. (The obvious variant of Theorem 5 will therefore hold.) In this way ratio matching can also be understood as a relaxation of maximum likelihood.

G FURTHER SIMULATIONS

Complementary visualization to Figure 1 . In Figure 3 , we illustrate the distribution of the errors in the bimodal experiment with the cut statistic. As expected based on the theory, the direction where score matching with large offset performs very poorly corresponds to the difference between the two sufficient statistics, which encodes the sparse cut in the distribution. Fitting a bimodal distribution without a cut statistic. In Figure 4 we show the result of fitting the same bimodal distribution using score matching, but we remove the second sufficient statistic (which is correlated with the sparse cut in the distribution). In this case, score matching fits the distribution nearly as well as the MLE. This is consistent with our theory (e.g. the failure of score matching in Theorem 3 requires that we have a sufficient statistic approximately representing the cut) and justifies some of the distinctions we made in our results: even though the Poincaré constant is very large, the asymptotic variance of score matching within the exponential family is upper bounded by the restricted Poincaré constant (see Theorem 2) which is much smaller. Example 3 (Application of Theorem 2 to this example). To briefly expand the last point, we show how to apply Theorem 2 in this example (Example 2, where we have not added a bad cut statistic.) The restricted Poincaré constant for applying Theorem 2 will be C := Var(F 1 (X)) E(F ′ 1 (X)) 2 = Var(X 2 -X 4 /2a 2 ) E(2X -2X 3 /a 2 ) 2 Figure 4 : Here we see the result of running an identical experiment to Figure 1 , only we remove the second sufficient statistic, so our distribution is now p θ (x) ∝ e θ0(x 2 -x 4 /(2a 2 )) where θ 0 = 1 and we again vary the offset a between 1 and 7. With only the single sufficient statistic, score matching performs comparably to MLE. which asymptotically goes to a constant, rather than blowing up exponentially, as a goes to infinity. (This can be made formal using arguments as in the proof of Corollary 1; informally, the distribution is similar to a mixture of two standard Gaussians centered at ±a so the numerator is close to Var Z∼N (0,1) ((a+Z) 2 -(a+Z) 4 /2a 2 ) = Var(2aZ +Z 2 -(4aZ +6Z 2 +4Z 3 /a+Z 4 )/2) = Θ(1) and the denominator is approximately E Z∼N (0,1) (2(a + Z) -2(a + Z) 3 /a 2 ) 2 = E(2Z -2(3Z + 3Z 3 /a + Z 3 /a 2 )) 2 = Θ(1).) Given this bound on the restricted Poincaré constant, we can apply Theorem 2. Based on similar reasoning to above, one can show that EF ′ 1 (X) 4 = (-1/4a 2 ) 4 E((X -a)(X +a) 2 +(X -a) 2 (X + a)) 4 = Θ(1) and EF ′′ 1 (X) 2 = E(-3x 2 /2a 2 + 1/2) 2 = Θ(1), so we conclude that ∥Γ SM ∥ OP = O(∥Γ MLE ∥ 2 OP ). This proves that score matching will perform not much worse than the MLE, as we saw in the experimental result of Figure 4 . Remark 8. Example 3 shows a case where there is a large gap between the restricted and unrestricted Poincaré constants. This also implies a completely analogous gap between appropriate restricted and unrestricted log-Sobolev constants, as used e.g. in the context of Theorem 1. To elaborate, we know that the unrestricted log-Sobolev constant blows up exponentially in a, just like the unrestricted Poincaré constant, because C LS ≥ C P /2 (Van Handel, 2014) . On the other hand, if we fix the ground truth distribution p a consider the class of distributions P r = {p a ′ : |a -a ′ | ≤ r}, we have that lim r→0 C LS (q, P r ) = C/2 where C is the constant defined in (25) in terms of a (and which is O(1) as a → ∞). This is because from the definition as an exponential family, we have p a (x)/p a ′ (x) = exp ((a -a ′ )F 1 (x)) E a ′ exp ((a -a ′ )F 1 (x)) so (a -a ′ ) 2 Var p a ′ (F 1 (x)) 2(a -a ′ ) 2 E p a ′ ∥∇F 1 (x)∥ 2 = C/2 where the first equality is by a standard Taylor expansion argument (see proof of Lemma 3.28 of (Van Handel, 2014) ). Fitting a unimodal but not smooth distribution. In Figure 5 , we demonstrate what happens when the distribution is unimodal (and has small isoperimetric constant), but the sufficient statistic is not quantitatively smooth. More precisely, we consider the case p θ (x) ∝ e -θ0x 2 /2-θ1 sin(ωx) as ω increases. In the figure, we used the formulas from asymptotic normality to calculate the distribution Figure 5 : Score matching vs MLE for a distribution with a rapidly oscillating sufficient statistic, p θ (x) ∝ e -θ0x 2 /2-θ1 sin(ωx) where (θ 0 , θ 1 ) = (1, 1), and increasing ω. On the top, for increasing ω we show a log-log plot of the average Euclidean distance in parameter space between θ and the output of each estimator. On the bottom, for each value of ω, we draw a level set of the distribution within which a fixed fraction of returned estimates lie (MLE left, score matching right). Score matching becomes increasingly inaccurate as ω increases while the MLE stays extremely accurate. over parameter estimates from 100,000 samples. We also verified via simulations that the asymptotic formula almost exactly matches the actual error distribution. The result is that while the MLE can always estimate the coefficient θ 1 accurately, score matching performs much worse for large values of ω. This demonstrates that the dependence on smoothness in our results (in particular, Theorem 2) is actually required, rather than being an artifact of the proof. Conceptually, the reason score matching fails even when though the distribution has no sparse cuts is this: the gradient of the log density becomes harder to fit as the distribution becomes less smooth (for example, the Rademacher complexity from Theorem 1 will become larger as it scales with ∇ x log p and ∇ 2 x log p). Fitting a mixture of Gaussians with a one-layer network: further discussion. We provide some further remarks on the results in Figure 2 . In the right hand side example (the one with large separation between modes), the shape of the two Gaussian components is learned essentially perfectly -it is only the relative weights of the two components which are wrong. This closely matches the idea behind the proof of the lower bound in Theorem 3; informally, the feedforward network can naturally represent a function which detects the cut between the two modes of the distribution, i.e. the additional bad sufficient statistic F 2 from Theorem 3. The fact that the shapes are almost perfectly fit where the distribution is concentrated indicates that the test loss J p is near its minimum. Recall from (1) that the suboptimality of a distribution q in score matching loss is given by J p (q) -J p (p) = E p ∥∇ log p -∇ log q∥ 2 . If we let q be the distribution recovered by score matching, we see from the figure that the slopes of the distribution were correctly fit wherever p is concentrated, so E p ∥∇ log p -∇ log q∥ 2 is small. However near-optimality of the test loss J p (q) does not imply that q is actually close to p: the test loss does not heavily depend on the behavior of log q in between the two modes, but the value of ∇ log q in between the modes affects the relative weight of the two modes of the distribution, leading to failure. Model details: both models illustrated in the figure have 2048 tanh units and are trained via SGD on fresh samples for 300000 steps. After training the model, the estimated distribution is computed from the learned score function using numerical integration.



There are several alternatives formulas for I(p | q), see Remark 3.26 of Van Handel (2014). We use the simplest version of Rademacher complexity bounds to illustrate our techniques. Standard literature, e.g.Shalev-Shwartz and Ben-David (2014);Bartlett et al. (2005) contains more sophisticated versions, and our techniques readily generalize. Asymptotic normality was proved in Corollary 1 ofSong et al. (2020) -we reprove it because in the context of exponential families, we will show and use a much simpler expression for the limiting covariance. We note that this experiment is similar in flavor to plots in (Figure2) inSong and Ermon (2019), where they show that the score is estimated poorly near the low-probability regions of a mixture of Gaussians. In our plots, we numerically integrate the estimates of the score to produce the pdf of the estimated distribution. See e.g.Vempala and Wibisono (2019) for more background and the connection to the discrete time dynamics.



Var(⟨w, A + B⟩) = Var(⟨w, A⟩) + 2Cov(⟨w, A⟩⟨w, B⟩) + Var(⟨w, B⟩) ≤ Var(⟨w, A⟩) + 2 Var(⟨w, A⟩)Var(⟨w, B⟩) + Var(⟨w, B⟩) ≤ 2Var(⟨w, A⟩) + 2Var(⟨w, B⟩)

Figure 3: Level sets for the distribution over estimates in the same example as Figure 1. We see that as the distance a between modes increases, the direction of large variance for the score matching estimator (right figure) corresponds to the difference of the sufficient statistics which encodes the sparse cut in the distribution. On the other hand, the MLE (left figure) does not exhibit this behavior and has low variance in all directions.

lim

a ′ →a KL(p a , p a ′ ) I(p a | p a ′ ) = lim a ′ →a

In particular, if C LS (P, P) < ∞ then lim n→∞ E KL(p, p) = 0 as long as lim n→∞ R n = 0.Proof. By the standard symmetrization argument (Theorem 26.3 of Shalev-Shwartz and Ben-David (2014)) we have EJ p (p) -J p (p) ≤ 2R n , so by Proposition 1 we have E KL(p, p) ≤ 2EC LS (P)(J p (p) -J p (p)) ≤ 4C LS (P)R n .

acknowledgement

Acknowledgements: Frederic Koehler acknowledges support by NSF award CCF-1704417, NSF award IIS-1908774, and N. Anari's Sloan Research Fellowship. Andrej Risteski acknowledges support by NSF awards IIS-2211907 and CCF-2238523, an Amazon Research Award on "Causal + Deep Out-of-Distribution Learning".

annex

Observe that for any points x, y and θ = (θ 1 , 0) we have by the mean value theorem that p θ (x)/p θ (y) = exp (⟨θ 1 , F 1 (x) -F 1 (y)) ≤ exp ∥θ∥ sup θ∈[0,1] ∥(JF 1 ) θx+(1-θ)y ∥ OP ∥x -y∥ .(20) so the log of the density is Lipschitz. This basically reduces estimating Pr(d(X, ∂S) ≤ γ) for small γ to understanding the volume of tubes around ∂S, which can be done using the same ideas as the proof of Weyl's tube formula (Weyl, 1939; Gray, 2003) .Lemma 6 (Proposition 6.1 of Niyogi et al. (2008) ). Let M be a smooth and compact submanifold of dimension q in R d . At a point p ∈ M let B : T p ×T p → T ⊥ p denote the second fundamental form, and for a unit normal vector u, let L u be the linear operator defined so that ⟨u, B(v, w)⟩ = ⟨v, L u w⟩ (this matches the notation from Niyogi et al. (2008) ). ThenLemma 7 (Lemma 3.14 of Gray (2003) ). Let M be a smooth and compact submanifold of dimension q in R d . Let exp p denote the exponential map from the normal bundle at p. The Jacobian determinant of the mapWe can computewhere in the second equality we performed a change of variables and obtained the result by applying Lemma 7. We have det(I -tL u ) ∈ [(1 -t/τ ) k , (1 + t/τ ) k ] and so applying (20) we find that if we define c := γ∥θ∥ sup x:d(x,∂S)≤γ ∥(JF 1 ) x ∥ OP which can be made arbitrarily small by taking γ sufficiently small, thenwhereSince Pr(d(X, ∂S) ∈ (0.1γ, 0.9γ)) = Pr(d(X, ∂S) < 0.9γ) -Pr(d(X, ∂S) ≤ 0.1γ) and the distribution we consider has a density, by combining ( 21) and ( 19) we find that for γ sufficiently small we havewhere C ′′′ k is at worst exponentially small in k.

E.2 RELATING FISHER MATRICES OF AUGMENTED AND ORIGINAL SUFFICIENT STATISTICS

Next, we show that adding the extra sufficient statistic F 2 has a comparatively minor effect on the efficiency of MLE. Intuitively, to be able to estimate the coefficient of F 2 correctly we just need:(1) the variance of F 2 is large, so that a nonzero coefficient of F 2 can be observed from samples (e.g. when F 2 encodes the cut S, the coefficient can be estimated by looking at the relative weight between S and S C ), and (2) there is no redundancy in the sufficient statistics, e.g. F 2 ̸ = F 1 since otherwise different coefficients can encode the same distribution. The proof of this uses that the inverse covariance of the MLE has a simple explicit form (the Fisher information, which is the covariance matrix of (F 1 , F 2 )), and conditions (1) and ( 2) naturally appear when we use this fact.Quantitatively, we show:

