STATISTICAL EFFICIENCY OF SCORE MATCHING: THE VIEW FROM ISOPERIMETRY

Abstract

Deep generative models parametrized up to a normalizing constant (e.g. energybased models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood log p(x) for the training data, we instead fit the score function ∇ x log p(x) -obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -i.e. the Poincaré, log-Sobolev and isoperimetric constant -quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -even for simple families of distributions like exponential families with rich enough sufficient statistics -score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.

1. INTRODUCTION

Energy-based models (EBMs) are deep generative models parametrized up to a constant of parametrization, namely p(x) ∝ exp(f (x)). The primary training challenge is the fact that evaluating the likelihood (and gradients thereof) requires evaluating the partition function of the model, which is generally computationally intractable -even when using relatively sophisticated MCMC techniques. The seminal paper of Song and Ermon (2019) circumvented this difficulty by instead fitting the score function of the model, that is ∇ x log p(x). Though not obvious how to evaluate this loss from training samples only, Hyvärinen (2005) showed this can be done via integration by parts, and the estimator is consistent (that is, converges to the correct value in the limit of infinite samples). The maximum likelihood estimator is the de-facto choice for model-fitting for its well-known property of being statistically optimal in the limit where the number of samples goes to infinity ( Van der Vaart, 2000) . It is unclear how much worse score matching can be -thus, it's unclear how much statistical efficiency we sacrifice for the algorithmic convenience of avoiding partition functions. In the seminal paper (Song and Ermon, 2019), it was conjectured that multimodality, as well as a lowdimensional manifold structure may cause difficulties for score matching. Though the intuition for this is natural: having poor estimates for the score in "low probability" regions of the distribution can "propagate" into bad estimates for the likelihood once the score vector field is "integrated"making this formal seems challenging. We show that the right mathematical tools to formalize, and substantially generalize such intuitions are functional analytic tools that characterize isoperimetric properties of the distribution in question. Namely, we show three quantities, the Poincaré, log-Sobolev and isoperimetric constants (which are all in turn very closely related, see Section 2), tightly characterize how much worse the efficiency of score matching is compared to maximum likelihood. These quantities can be (equivalently) viewed as: (1) characterizing the mixing time of Langevin dynamics -a stochastic differential equation used to sample from a distribution p(x) ∝ exp(f (x)), given access to a gradient oracle for f ; (2) characterizing "sparse cuts" in the distribution: that is sets S, for which the surface area of the set S can be much smaller than the volume of S. Notably, multimodal distributions, with wellseparated, deep modes have very big log-Sobolev/Poincaré/isoperimetric constants (Gayrard et al., 2004; 2005) , as do distributions supported over manifold with negative curvature (Hsu, 2002) (like hyperbolic manifolds). Since it is commonly thought that complex, high dimensional distribution deep generative models are trained to learn do in fact exhibit multimodal and low-dimensional manifold structure, our paper can be interpreted as showing that in many of these settings, score matching may be substantially less statistically efficient than maximum likelihood. Thus, our results can be thought of as a formal justification of the conjectured challenges for score matching in Song and Ermon (2019), as well as a vast generalization of the set of "problem cases" for score matching. This also shows that surprisingly, the same obstructions for efficient inference (i.e. drawing samples from a trained model, which is usual done using Langevin dynamics for EBMs) are also an obstacle for efficient learning using score matching. 1 We roughly show the following results: 1. For finite number of samples n, we show that if we are trying to estimate a distribution from a class with Rademacher complexity bounded by R n , as well as a log-Sobolev constant bounded by C LS , achieving score matching loss at most ϵ implies that we have learned a distribution that's no more than ϵC LS R n away from the data distribution in KL divergence. The main tool for this is showing that the score matching objective is at most a multiplicative factor of C LS away from the KL divergence to the data distribution. 2. In the asymptotic limit (i.e. as the number of samples n → ∞), we focus on the special case of estimating the parameters θ of a probability distribution of an exponential family {p θ (x) ∝ exp(⟨θ, F (x)⟩) for some sufficient statistics F using score matching. If the distribution p θ we are estimating has Poincaré constant bounded by C P have asymptotic efficiency that differs by at most a factor of C P . Conversely, we show that if the family of sufficient statistics is sufficiently rich, and the distribution p θ we are estimating has isoperimetric constant lower bounded by C IS , then the score matching loss is less efficient than the MLE estimator by at least a factor of C IS . 3. Based on our new conceptual framework, we identify a precise analogy between score matching in the continuous setting and pseudolikelihood methods in the discrete (and continuous) setting. This connection is made by replacing the Langevin dynamics with its natural analogue -the Glauber dynamics (Gibbs sampler). We show that the approximation tensorization of entropy inequality (Marton, 2013; Caputo et al., 2015) , which guarantees rapid mixing of the Glauber dynamics, allows us to obtain finite-sample bounds for learning distributions in KL via pseudolikelihood in an identical way to the log-Sobolev inequality for score matching. A variant of this connection is also made for the related ratio matching estimator of Hyvärinen (2007). 4. In Section 7, we perform several simulations which illustrate the close connection between isoperimetry and the performance of score matching. We give examples both when fitting the parameters of an exponential family and when the score function is fit using a neural network.

2. PRELIMINARIES

Definition 1 (Score matching). Given a smooth ground truth distribution p with sufficient decay at infinity and a smooth distribution q, the score matching loss (at the population level) is defined to be J p (q) := 1 2 E X∼p [∥∇ log p(X)-∇ log q(X)∥ 2 ]+K p = E X∼p Tr ∇ 2 log q + 1 2 ∥∇ log q∥ 2 (1)



Note, there is another popular variant of score matching called denoising score matching, in which the data distribution is convolved with a Gaussian. This will not be the focus of this paper.

