PITFALLS OF GAUSSIANS AS A NOISE DISTRIBUTION IN NCE

Abstract

Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an easy-to-sample noise distribution q, in a manner that avoids having to calculate a partition function. It is well-known that the choice of q can severely impact the computational and statistical efficiency of NCE. In practice, a common choice for q is a Gaussian which matches the mean and covariance of the data. In this paper, we show that such a choice can result in an exponentially bad (in the ambient dimension) conditioning of the Hessian of the loss, even for very simple data distributions. As a consequence, both the statistical and algorithmic complexity for such a choice of q will be problematic in practice, suggesting that more complex and tailored noise distributions are essential to the success of NCE.

1. INTRODUCTION

Noise contrastive estimation (NCE), introduced in (Gutmann & Hyvärinen, 2010; 2012) , is one of several popular approaches for learning probability density functions parameterized up to a constant of proportionality, i.e. p(x) ∝ exp(E θ (x)), for some parametric family {E θ } θ . A recent incarnation of this paradigm is, for example, energy-based models (EBMs), which have achieved near-state-ofthe-art results on many image generation tasks (Du & Mordatch, 2019; Song & Ermon, 2019) . The main idea in NCE is to set up a self-supervised learning (SSL) task, in which we train a classifier to distinguish between samples from the data distribution P * and a known, easy-to-sample distribution Q, often called the "noise" or "contrast" distribution. It can be shown that for a large choice of losses for the classification problem, the optimal classifier model is a (simple) function of the density ratio p * /q, so an estimate for p * can be extracted from a good classifier. Moreover, this strategy can be implemented while avoiding calculation of the partition function, which is necessary when using maximum likelihood to learn p * . The noise distribution q is the most significant "hyperparameter" in NCE training, with both strong empirical (Rhodes et al., 2020) and theoretical (Liu et al., 2021) evidence that a poor choice of q can result in poor algorithmic behavior. (Chehab et al., 2022) show that even the optimal q for finite number of samples can have an unexpected form (e.g., it is not equal to the true data distribution p * ). Since q needs to be a distribution that one can efficiently draw samples from, as well as write an expression for the probability density function, the choices are somewhat limited. A particularly common way to pick q is as a Gaussian that matches the mean and covariance of the input data (Gutmann & Hyvärinen, 2012; Rhodes et al., 2020) . Our main contribution in this paper is to formally show that such a choice can result in an objective that is statistically poorly behaved, even for relatively simple data distributions. We show that even if p * is a product distribution and a member of a very simple exponential family, the Hessian of the NCE loss, when using a Gaussian noise distribution q with matching mean and covariance has exponentially small (in the ambient dimension) spectral norm. As a consequence, the optimization landscape around the optimum will be exponentially flat, making gradient-based optimization challenging. As the main result of the paper, we show the asymptotic sample efficiency of the NCE objective will be exponentially bad in the ambient dimension.

2. OVERVIEW OF RESULTS

Let P * be a distribution in a parametric family {P θ } θ∈Θ . We wish to estimate P * via P θ for some θ * ∈ Θ by solving a noise contrastive estimation task. To set up the task, we also need to choose a noise distribution Q, with the constraint that we can draw samples from it efficiently, and we can evaluate the probability density function efficiently. We will use p θ , p * , q to denote the probability density functions (pdfs) of P θ , P * , and Q. For a data distribution P * and noise distribution Q, the NCE loss of a distribution P θ is defined as follows: Definition 1 (NCE Loss). The NCE loss of P θ w.r.t. data distribution P * and noise Q is L(P θ ) = - 1 2 E P * log p θ p θ + q - 1 2 E Q log q p θ + q . ( ) Moreover, the empirical version of the NCE loss when given i.i.d. samples (x 1 , . . . , x n ) ∼ P n * and (y 1 , . . . , y n ) ∼ Q n is given by L n (θ) = 1 n n i=1 - 1 2 log p θ (x i ) p θ (x i ) + q(x i ) + 1 n n i=1 - 1 2 log q(y i ) p θ (y i ) + q(y i ) . ( ) By a slight abuse of notation, we will use L(θ), L(p θ ) and L(P θ ) interchangeably. The NCE loss can be interpreted as the binary cross-entropy loss for the classification task of distinguishing the data samples from the noise samples. To avoid calculating the partition function, one considers it as an additional parameter, namely we consider an augmented vector of parameters θ = (θ, c) and let p θ (x) = exp(E θ (x) -c). The crucial property of the NCE loss is that it has a unique minimizer: Lemma 2 (Gutmann & Hyvärinen 2012). The NCE objective in Definition 1 is uniquely minimized at θ = θ * and c = log( x exp(E θ * (x))dx) provided that the support of Q contains that of P * . We will be focusing on the Hessian of the loss L, as the crucial object governing both the algorithmic and statistical difficulty of the resulting objective. We will show the following two main results: Theorem 3 (Exponentially flat Hessian). For d > 0 large enough, there exists a distribution P * = P θ * over R d such that • E P * [x] = 0 and E P * [xx ⊤ ] = I d . • P * is a product distribution, namely p * (x 1 , x 2 , . . . , x d ) = d i=1 p * (x i ). • The NCE loss when using q = N (0, I d ) as the noise distribution has the property that ∥∇ 2 L(θ * )∥ 2 ≤ exp (-Ω(d)) . We remark the above example of a problematic distribution P * is extremely simple. Namely, P * is a product distribution, with 0 mean and identity covariance. It actually is also the case that P * is log-concave-which is typically thought of as an "easy" class of distributions to learn due to the fact that log-concave distributions are unimodal. The fact that the Hessian is exponentially flat near the optimum means that gradient-descent based optimization without additional tricks (e.g., gradient normalization, second order methods like Newton's method) will fail. (See, e.g., Theorem 4.1 and 4.2 in Liu et al. (2021) .) For us, this will be merely an intermediate result. We will address a more fundamental issue of the sample complexity of NCE, which is independent of the optimization algorithm used. Namely, we will show that without a large number of samples, the best minimizer of the empirical NCE might not be close to the target distribution. Proving this will require the development of some technical machinery. More precisely, we use the result above to show that the asymptotic statistical complexity, using the above choice of P * , Q, is exponentially bad in the dimension. This substantially clarifies results in Gutmann & Hyvärinen (2012) , who provide an expression for the asymptotic statistical complexity in terms of P * , Q (Theorem 3, Gutmann & Hyvärinen (2012)), but from which it's very difficult to glean quantitatively how bad the dependence on dimension can be for a particular choice of P * , Q. Unlike the landscape issues that (Liu et al., 2021) point out, the statistical issues are impossible to fix with a better optimization algorithm: they are fundamental limitations of the NCE loss. Theorem 4 (Asymptotic Statistical Complexity). Let d > 0 be sufficiently large and Q = N (0, I d ). Let θn be the optimizer for the empirical NCE loss L n (θ) with the data distribution P * given by Theorem 3 above and noise distribution Q. Then, as n → ∞, the mean-squared error satisfies E θn -θ * 2 2 = exp(Ω(d)) n . 3 EXPONENTIALLY FLAT HESSIAN: PROOF OF THEOREM 3 The proof of Theorem 3 consists of three ingredients. First, in Section 3.1, we will compute an algebraically convenient upper bound for the spectral norm of the Hessian of the loss (eq. ( 1)). We will restrict our attention to the case when {P θ } belongs to an exponential family. The upper bound will be in terms of the total variation distance TV(P * , Q) and the Fisher information matrix of the sufficient statistics at θ * . Here, P * denotes the true data distribution and Q denotes the noise distribution. Then, in Section 3.2, we construct a distribution P * for which the TV distance between P * and Q is large. We do this by "tensorizing" a univariate distribution. Namely, we construct a univariate distribution with mean 0 and variance 1 that is at a constant TV distance from a standard univariate Gaussian. Then, we use the fact that the Hellinger distance tensorizes, along with the relationship between TV and Hellinger distance, to show that T V (P * , Q) ≥ 1 -δ d for some constant δ < 1. (See Wasserman (2020) for a detailed review of distance measures.) Section 3.3 bounds the Fisher information matrix term, completing all the components required to establish Theorem 3.

3.1. BOUNDING THE HESSIAN IN TERMS OF TV DISTANCE

Suppose {P θ } is an exponential family of distributions, that is p θ (x) = exp(θ ⊤ T (x)), where T (x) is a known function. Then, a straightforward calculation (see e.g., Appendix A in Liu et al. ( 2021)) shows that the gradient and the Hessian of the NCE loss (eq. ( 1)) with respect to θ have the following forms: ∇ θ p θ (x) = p θ (x) • T (x), ∇ θ L(p θ ) = 1 2 x q p θ + q (p θ -p * )T (x)dx, ∇ 2 θ L(p θ ) = 1 2 x (p * + q)p θ q (p θ + q) 2 T (x)T (x) ⊤ dx. For θ = θ * and p θ = p * , we have ∇ 2 θ L(p θ * ) = 1 2 x p * q p * + q T (x)T (x) ⊤ dx ⪯ 1 2 x min(p * , q)T (x)T (x) ⊤ dx The second line holds since p * q p * +q = min(p * , q)• max(p * ,q) p * +q ≤ min(p * , q). Applying the matrix version of the Cauchy-Schwarz inequality (Lemma 9, Appendix A) to eq. ( 6) with two parts min(p * (x),q(x)) √ p * (x) and T (x)T (x) ⊤ p * (x), we obtain ∥∇ 2 θ L(P * )∥ 2 ≤ ∇ 2 θ L(P * ) F ≤ 1 2 x min(p * , q) 2 p * 1 2 x T (x)T (x) ⊤ 2 F p * (x)dx 1 2 ≤ 1 2 x min(p * , q)dx 1 2 x T (x)T (x) ⊤ 2 F p * (x)dx 1 2 =⇒ ∇ 2 θ L(P * ) 2 ≤ 1 2 1 -TV(P * , Q) 1 2 x T (x)T (x) ⊤ 2 F p * (x)dx 1 2 . (7) We bound the two terms in the product above separately. The first term is small when P * and Q are significantly different. The second term is an upper bound of the Frobenius norm of the Fisher matrix at P * . We will construct P * such that the first term dominates, giving us the upper bound required.

3.2. CONSTRUCTING THE HARD DISTRIBUTION P *

The hard distribution P * over R d will have the property that E P * [x] = 0, E P * [xx ⊤ ] = I d , but will still have large TV distance from the standard Gaussian Q = N (0, I d ). This distribution will simply be a product distribution-the following lemma formalizes our main trick of tensorization to construct a distribution having large TV distance with the Gaussian. Lemma 5. Let d > 0 be given. Let Q = N (0, I d ) be the standard Gaussian in R d . Then, for some δ < 1, there exists a log-concave distribution P (also over R d ) with mean 0 and covariance I d satisfying TV(P, Q) ≥ 1 -δ d . Proof. Let Q denote the standard normal distribution over R. Let P be any other distribution over R with mean 0 and variance 1 that satisfies ρ( P , Q) = δ < 1, where ρ( P , Q) = x √ pq dx is the Bhattacharya coefficient. Since ρ tensorizes (Wasserman, 2020), we have that ρ( P d , Qd ) = ρ( P , Q) d for any d > 1. We can then write the Hellinger distance between P, Q as H 2 (P, Q) := 1 - x √ pq dx = 2(1 -ρ( P , Q) d ). Further, we also know that 1 2 H 2 ( P d , Qd ) ≤ TV( P d , Qd ) =⇒ 1 -ρ( P , Q) d ≤ TV( P d , Qd ) =⇒ 1 -δ d ≤ TV( P d , Qd ). Setting P = P d and noting that Qd = Q = N (0, I d ), we have TV(P, Q) ≥ 1 -δ d . Finally, if the chosen P is a log-concave distribution, then so is P d , since the product of log-concave distributions is log-concave, which completes the proof. We will now explicitly define the distribution P * that we will work with for rest of the paper. Definition 6. Consider the exponential family p θ (x) = exp θ ⊤ T (x) θ∈R d+1 given by the sufficient statistics T (x) = (x 4 1 , . . . , x 4 d , 1). Let P * = P d where P is the distribution on R with density function p given by p(x) ∝ exp - x 4 σ 4 . We will set the constant of proportionality C and σ appropriately to ensure that P has mean 0 and variance 1. Note that P * = P θ * for θ * = -1 σ 4 , . . . , 1 σ 4 , log C . Since d 2 log p dx 2 = -12x 2 σ 4 ≤ 0, p is log-concave. Further, symmetry of p around the origin gives E[ P ] = 0, and the choice of σ ensures that Var [ P ] = 1. The normalizing constant C satisfies C = ∞ -∞ e -x 4 σ 4 dx = 2 ∞ 0 e -x 4 σ 4 dx. Substituting t = x 4 σ 4 , dt = 4x 3 σ 4 dx = 4t 3/4 σ dx gives C = σ 2 ∞ 0 t -3/4 e -t dt = σ 2 Γ 1 4 = 2σΓ 5 4 . where Γ(z) is the gamma function defined as Γ(z) ∞ 0 x z-1 e -x dx. The variance is given by Var P = 1 C ∞ -∞ x 2 e -x 4 σ 4 dx = 2 C ∞ 0 x 2 e -x 4 σ 4 dx. The same substitution as above gives Var( P ) = 1 2C ∞ 0 t 1/2 t -3/4 σ 3 e -t dt = σ 3 2C ∞ 0 t -1/4 e -t dt = σ 3 2C Γ 3 4 = σ 2 4 Γ(3/4) Γ(5/4) . Thus, setting σ = 4Γ(5/4) Γ(3/4) results in Var[ P ] = 1. Correspondingly, we have C = 4Γ(5/4) 3/2 √ Γ(3/4) . For this choice of P , the Bhattacharya coefficient ρ( P , Q) is given by: ρ( P , Q) = ∞ -∞ p(x)q(x)dx = 1 C √ 2π ∞ -∞ exp - x 2 4 - x 4 2σ 4 dx ≈ 0.9905 ≤ 0.991 < 1. Thus, in the proof of Lemma 5, we can use this choice of P , and we have that for δ = 0.991 and P * = P d , TV(P * , Q) ≥ 1 -δ d , as required.

3.3. BOUNDING THE FISHER INFORMATION MATRIX

In this subsection, we bound the second factor in eq. ( 7), which is an upper bound on the Frobenius norm of the Fisher information matrix at θ * . Lemma 7. For some constant M > 0, we have x T (x)T (x) ⊤ 2 F p * (x)dx ≤ d 2 M, Proof. Recall that T (x) = (x 4 1 , . . . , x 4 d , 1). Then, T (x)T (x) ⊤ 2 F = i x 16 i + i̸ =j x 8 i x 8 j + 2 i x 4 i + 1. Therefore, by linearity of expectation, and using the fact that P * is a product distribution, x T (x)T (x) ⊤ 2 F p * (x)dx = d • E P x 16 + d(d -1) • E P x 8 2 + 2d • E P x 4 + 1 ≤ d 2 M, for an appropriate choice of constant M . This constant exists since all the expectations above are bounded owing to the fact that the exponential density p dominates in the integrals.

3.4. PUTTING THINGS TOGETHER

For P * defined as above, and Q = N (0, I d ), Lemma 5 ensures that 1 -TV(P * , Q) ≤ δ d , for δ = 0.991. From Lemma 7, we have that x T (x)T (x) T 2 F p * (x)dx ≤ d 2 M. Substituting these bounds in eq. ( 7), we get that ∇ 2 θ L(P * ) 2 ≤ 1 2 δ d/2 d √ M = exp(-Ω(d)). By construction, p * is a product distribution with E p * [x] = 0 and E p * xx ⊤ = I d , which completes the proof of the theorem.

4. PROOF OF THEOREM 4

We will bound the error of the optimizer θn of the empirical NCE loss (eq. ( 2)) using the bias-variance decomposition of MSE. To do this, we will reason about the random variable √ n( θn -θ * ); let Σ be its covariance matrix. Since θn is an unbiased estimate of θ * , the MSE decomposes as E θn -θ * 2 2 = 1 n Tr(Σ). The proof of Theorem 4 proceeds as follows. In Section 4.1, we show that the random variable √ n( θn -θ * ) is asymptotically normal with mean 0 and covariance matrix Σ given by Σ = ∇ 2 θ L(θ * ) -1 Var √ n∇ θ L n (θ * ) ∇ 2 θ L(θ * ) -1 . ( ) We prove that the Hessian ∇ 2 θ L(θ * ) is invertible in Appendix C, so that the above expression is well-defined. Since Σ ⪰ 0 (it is a covariance matrix), to get a lower bound on Tr(Σ), it suffices to get a lower bound on the largest eigenvalue of Σ. Looking at the factors on the right hand side of eq. ( 12), we note first that Theorem 3 ensures an exponential lower bound on all eigenvalues of ∇ 2 θ L(θ * ) -foot_0 . The bulk of the proof towards lower bounding the largest eigenvalue of Σ consists of lower bounding Var v ⊤ • √ n∇ θ L n (θ * ) ), the directional variance of √ n∇ θ L n (θ * ) along a suitably chosen direction v in terms of v ⊤ ∇ 2 θ L(θ * )v. In Section 4.2 and Section 4.3, we use anti-concentration bounds to prove such variance lower bounds.

4.1. GAUSSIAN LIMIT OF

√ n( θn -θ * ) To begin, we will show that √ n( θn -θ * ) behaves as a Gaussian random variable as n → ∞. Recall that the empirical NCE loss is given by eq. ( 2): L n (θ) = 1 n n i=1 - 1 2 ln p θ (x i ) p θ (x i ) + q(x i ) + 1 n n i=1 - 1 2 ln q(y i ) p θ (y i ) + q(y i ) , where x i ∼ P * and y i ∼ Q are i.i.d. Let θn be the optimizer for L n . Then, by the Taylor expansion of ∇ θ L n around θ * , we have √ n θn -θ * = -∇ 2 θ L n (θ * ) -1 • √ n∇ θ L n (θ * ) - √ n • O θn -θ * 2 by Gutmann & Hyvärinen (2012) , who also show in their Theorem 2 that θn is a consistent estimator of θ * ; hence, as n → ∞, θn -θ * 2 → 0. Gutmann & Hyvärinen (2012, Lemma 12) also assert 1 that the Hessian of the empirical NCE loss (eq. ( 2)) at θ * converges in probability to the Hessian of the true NCE loss (definition 1) at θ * , i.e., ∇ 2 θ L n (θ * ) -1 P -→ ∇ 2 θ L(θ * ) -1 . On the other hand, by the Central Limit Theorem, √ n∇ θ L n (θ * ) converges to a Gaussian with mean E[ √ n∇ θ L n (θ * )] = √ n∇ θ L(θ * ) = 0, and covariance Var[ √ n∇ θ L n (θ * )]. With these considerations, we conclude that the random variable √ n( θn -θ * ) in eq. ( 13) is asymptotically a Gaussian with mean 0 and covariance Σ = ∇ 2 θ L(θ * ) -1 Var[ √ n∇ θ L n (θ * )]∇ 2 θ L(θ * ) -1 , as defined in eq. ( 12). Next, we introduce some quantities which will be useful in the subsequent calculations. As we already have a handle on the spectrum of ∇ 2 θ L(θ * ) from Theorem 3, the main object of our focus in eq. ( 12) is the term Var[ √ n∇ θ L n (θ * )]. In particular, since we are concerned with the directional variance of Σ, we will reason about Var v ⊤ • √ n∇ θ L n (θ * ) for a fixed vector of ones, i.e., v = 1 d+1 . This vector has the property that for all x, v ⊤ T (x) ≥ 1, as all non-constant coordinates of T are non-negative, and the remaining coordinate is 1. Note that ∇ θ L n (θ * ) = - 1 2n n i=1 q(x i )T (x i ) p * (x i ) + q(x i ) + 1 2n n i=1 p * (y i )T (y i ) p * (y i ) + q(y i ) where x i ∼ P * and y i ∼ Q. Writing out the variance term explicitly, we have Var v ⊤ • √ n∇ θ L n (θ * ) = n • 1 4n Var x∼p * q(x) • v ⊤ T (x) p * (x) + q(x) + n • 1 4n Var y∼q p * (y) • v ⊤ T (y) p * (y) + q(y) (using linearity and independence) = 1 4 Var x∼p * q(x) • v ⊤ T (x) p * (x) + q(x) A(x) + 1 4 Var y∼q p * (y) • v ⊤ T (y) p * (y) + q(y) B(y) . ( ) Define A(x) = q(x)•v ⊤ T (x) p * (x)+q(x) = R1(x) 1+R1(x) v ⊤ T (x) where R 1 (x) = q(x) p * (x) and B(y) = p * (y)•v ⊤ T (y) p * (y)+q(y) =

R2(y)

1+R2(y) v ⊤ T (y) where R 2 (y) = p * (y) q(y) . To show that Var x∼p * [A(x)] and Var y∼q [B(y)] are large, we will need anti-concentration bounds on R 1 (x) and R 2 (y).

4.2. ANTI-CONCENTRATION

OF R 1 (x), R 2 (y) Next, we show that R 1 and R 2 satisfy (quantitative) anti-concentration. We show this by a relatively straightforward application of the Berry-Esseen Theorem, and the proof is given in Appendix B. Precisely, we show: Lemma 8. Let d > 0 be sufficiently large. Let p = pd and q = qd be any product distributions, and define R(x) = q(x) p(x) . Suppose we have the following third moment bound: E x∼ p log q p 3 < ∞. Then, for any ϵ, there exist constants α = α(p, q, ϵ), µ = µ(p, q, ϵ) < 0 such that P x∼p R(x) ≤ exp µd -α √ d ≥ 1 2 -ϵ and P x∼p R(x) ≥ exp µd + α √ d ≥ 1 2 -ϵ. Instantiating Lemma 8 for the pair (p * , q) gives us the anti-concentration result for R 1 , while instantiating it for the reversed pair (q, p * ) gives us the anti-concentration result for R 2 . We can verify that the third moment condition holds in both instantiations, since in both the cases, log(q/p) is a polynomial. Crucially, we will also utilize the fact that the constant µ is negative (as it equals -KL(p||q)). We are now ready to bound the variance of A(x) and B(y).

4.3. BOUNDING THE VARIANCE OF A(x), B(y)

Recall that A(x) = R1(x)•v ⊤ T (x) 1+R1(x) and B(y ) = R2(y)•v ⊤ T (y) 1+R2(y) . Let µ, α be the constants given by Lemma 8 for p * , q, ϵ. Further, let L 1 = exp µd -α √ d and L 2 = exp µd + α √ d . Since the mapping x → x 1+x is monotonically increasing in x, P x∼p * [R 1 (x) ≤ L 1 ] = P x∼p * R 1 (x) 1 + R 1 (x) ≤ L 1 1 + L 1 ≥ 1 2 -ϵ P x∼p * [R 1 (x) ≥ L 2 ] = P x∼p * R 1 (x) 1 + R 1 (x) ≥ L 2 1 + L 2 ≥ 1 2 -ϵ. ( ) Let T up be such that P x∼p * ∥T (x)∥ ≤ T up ≥ 7 8 and P x∼q ∥T (x)∥ ≤ T up ≥ 7 8 . ( ) In Appendix D, we show that some T up = O(σ 2 √ d) suffices for this to hold. Then, from eq. ( 15), we have P x∼p * R 1 (x) 1 + R 1 (x) ≤ L 1 1 + L 1 ≥ 1 2 -ϵ =⇒ P x∼p * R 1 (x) • v ⊤ T (x) 1 + R 1 (x) ≤ L 1 √ d + 1∥T (x)∥ 1 + L 1 ≥ 1 2 -ϵ (Cauchy-Schwarz) =⇒ P x∼p * R 1 (x) • v ⊤ T (x) 1 + R 1 (x) ≤ L 1 √ d + 1∥T (x)∥ 1 + Li 1 ∧ ∥T (x)∥ ≤ T up ≥ 3 8 -ϵ (union bound with eq. ( 17)) =⇒ P x∼p * R 1 (x)v ⊤ T (x) 1 + R 1 (x) ≤ √ d + 1L 1 T up 1 + L 1 ≥ 3 8 -ϵ =⇒ P x∼p * A(x) ≤ √ d + 1L 1 T up 1 + L 1 ≥ 1 4 , for ϵ ≤ 1 8 . On the other hand, recall also that v satisfies v ⊤ T (x) ≥ 1 for all x. Therefore, we have P x∼p * R 1 (x) 1 + R 1 (x) ≥ L 2 1 + L 2 ≥ 1 2 -ϵ =⇒ P x∼p * R 1 (x) • v ⊤ T (x) 1 + R 1 (x) ≥ L 2 1 + L 2 ≥ 1 2 -ϵ =⇒ P x∼p * A(x) ≥ L 2 1 + L 2 ≥ 1 4 .

Now, consider the event

A 1 = A(x) ∈ 1 2 E x∼p * [A(x)], 3 2 E x∼p * [A(x)] . If this event were to intersect both the events A 2 = A(x) ≤ √ d+1L1Tup 1+L1 and A 3 = A(x) ≥ L2 1+L2 , then we would have 1 2 E x∼p * [A(x)] ≤ √ d + 1L 1 T up 1 + L 1 and 3 2 E x∼p * [A(x)] ≥ L 2 1 + L 2 =⇒ L 2 L 1 • 1 T up √ d + 1 • L 1 + 1 L 2 + 1 ≤ 3. We will show that this cannot be the case. Recall that µ < 0, which means that L 2 = exp(µd + α √ d) < 1 for sufficiently large d. This means that for sufficiently large d we have: exp(µd + α √ d) < 1 =⇒ exp(µd + α √ d) -2 exp(µd -α √ d) < 1 =⇒ 1 + exp(µd + α √ d) < 2 + 2 exp(µd -α √ d) =⇒ 1 + exp(µd -α √ d) 1 + exp(µd + α √ d) > 1 2 =⇒ L 1 + 1 L 2 + 1 > 1 2 . Further, since L2 L1 = exp(2α √ d) and T up = O(σ 2 √ d), we get that L 2 L 1 • 1 T up √ d + 1 • L 1 + 1 L 2 + 1 > exp(2α √ d) O(σ 2 d) • 1 2 > 3, where the last inequality follows for large enough d since the numerator grows faster than the denominator. Hence for large enough d, A 1 cannot intersect both A 2 and A 3 . If the event A 1 is disjoint from A 2 , then P x∼p * [A 1 ∪ A 2 ] = P x∼p * [A 1 ] + P x∼p * [A 2 ] ≤ 1 =⇒ P x∼p * [A 1 ] ≤ 1 -P x∼p * [A 2 ] =⇒ P x∼p * A(x) ∈ 1 2 E x∼p * [A(x)], 3 2 E x∼p * [A(x)] ≤ 3 4 =⇒ P x∼p * |A -E p * A| ≥ 1 2 E p * A ≥ 1 4 . This finally lower-bounds the variance of A as Var p * [A] = E (A -E p * A) 2 ≥ 1 4 (E p * A) 2 • P (A -E p * A) 2 ≥ 1 4 (E p * A) 2 ≥ 1 16 (E p * A) 2 . and thus E p * (A 2 ) -(E p * A) 2 = Var p * [A] ≥ 1 16 (E p * A) 2 , so that (E p * A) 2 ≤ 16 17 E p * (A 2 ). Altogether, we get Var p * [A] ≥ 1 17 E p * (A 2 ). An analogous argument in the case when A 1 is disjoint with A 3 yields the same bound on the variance. Using an identical argument for R 2 and B, we get that for large enough d, Var q [B] ≥ 1 17 E q (B 2 ).

4.4. PUTTING THINGS TOGETHER

Putting together the lower bounds Var p * [A] ≥ 1 17 E p * (A 2 ) and Var q [B] ≥ 1 17 E q (B 2 ) we showed in the previous subsection, and recalling eq. ( 14), we get Var v ⊤ • √ n∇ θ L n (θ * ) = 1 4 Var p * [A] + 1 4 Var p * [B] ≥ 1 68 E p * A 2 + E q B 2 = 1 68 x q(x) 2 p * (x) + q(x)p * (x) 2 (p * (x) + q(x)) 2 v ⊤ T (x)T (x) ⊤ v dx = 1 68 v ⊤ • x p * (x)q(x) p * (x) + q(x) T (x)T (x) ⊤ dx • v = 1 34 v ⊤ ∇ 2 θ L(θ * )v (from eq. (6)). Finally, since ∇ 2 θ L(θ * ) is invertible as claimed earlier (Lemma 10, Appendix C), let w be such that v = ∇ 2 θ L(θ * ) -1 w. Then, recalling the expression for Σ in eq. ( 12), we can conclude that w ⊤ Σw = v ⊤ Var √ n∇ θ L n (θ * ) v = Var v ⊤ • √ n∇ θ L n (θ * ) ≥ 1 34 v ⊤ ∇ 2 θ L(θ * )v = 1 34 w ⊤ ∇ 2 θ L(θ * ) -1 w, which gives us the desired bound on the MSE, namely E θn -θ * 2 2 ≥ 1 n Tr(Σ) ≥ 1 n sup z z ⊤ Σz ∥z∥ 2 ≥ 1 n w ⊤ Σw ∥w∥ 2 ≥ 1 34n w ⊤ ∇ 2 θ L(θ * ) -1 w ∥w∥ 2 ≥ 1 34n inf z z ⊤ ∇ 2 θ L(θ * ) -1 z ∥z∥ 2 ≥ exp(Ω(d)) n , where the last inequality follows from Theorem 3 and the fact that λ max (∇ 2 θ L(θ * )) -1 = λ min (∇ 2 θ L(θ * ) -1 ). This concludes the proof of Theorem 4. We also verify our results with simulations. Precisely, we study the MSE for the empirical NCE loss as a function of the ambient dimension, and recover the dependence from Theorem 4. For dimension d ∈ {70, 72, . . . , 120}, we generate n = 500 samples from the distribution P * we construct in the theorem. We generate an equal number of samples from the noise distribution Q = N (0, I d ), and run gradient descent to minimize the empirical NCE loss to obtain θn . Since we explicitly know what θ * is, we can compute the squared error ∥ θn -θ * ∥ 2 . We run 100 trials of this, where we obtain fresh samples each time from P * and Q, and average the squared errors over the trials to obtain an estimate of the MSE.

5. SIMULATIONS

Figure 1 shows the plot of log MSE versus dimension -we can see that the graph is nearly linear. This corroborates the bound in Theorem 4, which tells us that as n → ∞, the MSE scales exponentially with d. This behavior is robust even when the proportion of noise samples to true data samples is changed to 70:30 (though our theory only addresses the 50:50 case). Finally, we note that optimizing the empirical NCE loss becomes numerically unstable with increasing d (due to very large ratios in the loss), which is why we used comparatively moderate values of d.

6. CONCLUSION

Despite significant interest in alternatives to maximum likelihood-for example NCE (considered in this paper), score matching, etc.-there is little understanding of what there is to "sacrifice" with these losses, either algorithmically or statistically. In this paper, we provided formal lower bounds on the asymptotic sample complexity of NCE, when using a common choice for the noise distribution Q, a Gaussian with matching mean and covariance. Thus, it is likely that even for moderately complex distributions in practice, more involved techniques like Gao et al. (2020); Rhodes et al. (2020) will have to be used, in which one learns a noise distribution Q simultaneously with the NCE minimization or "anneals" the NCE objective. There is very little theoretical understanding of such techniques, and this seems like a very fruitful direction for future work.

A BOUNDING THE MATRIX INTEGRAL IN EQUATION 6

We prove a variant of the Cauchy-Schwarz inequality that gives us a handle on norms of matrix integrals. Lemma 9. Let f : R d → R and A : R d → R n be integrable functions, with M = x f (x)A(x)dx. Then we have ∥M ∥ 2 2 = x f (x)A(x)ds 2 2 ≤ x |f (x)| 2 dx x ∥A(x)∥ 2 2 dx . Similarly, if A : R d → R n×n is a matrix valued function then ∥M ∥ 2 F = x f (x)A(x) 2 F ≤ x |f (x)| 2 dx x ∥A(x)∥ 2 F dx . Proof. The proof follows from the Cauchy-Schwarz inequality. Since we integrate component-wise, for eq. ( 19) we have that M 2 i = x f (x)A(x) i dx 2 ≤ x f (x) 2 dx x A(x) 2 i dx . Summing over i, we get the result. The matrix variant eq. ( 20) follows by looking at the matrix M as a vector in R n 2 . B PROOF OF LEMMA 8 We restate the lemma for convenience: Lemma 8. Let d > 0 be sufficiently large. Let p = pd and q = qd be any product distributions, and define R(x) = q(x) p(x) . Suppose we have the following third moment bound: E x∼ p log q p 3 < ∞. Then, for any ϵ, there exist constants α = α(p, q, ϵ), µ = µ(p, q, ϵ) < 0 such that P x∼p R(x) ≤ exp µd -α √ d ≥ 1 2 -ϵ and P x∼p R(x) ≥ exp µd + α √ d ≥ 1 2 -ϵ. Proof. We will analyze the behaviour of R(x) using the Berry-Esseen theorem. Given that p * = pd and q = qd are product distributions, let r(x) be the random variable defined by r(x) = q(x) p(x) , x ∼ p. Let y i (x) = log r(x) for 1 ≤ i ≤ d be d independent copies of the random variable r(x). Let

E[y

i ] = µ r , E ∥y i -µ r ∥ 2 = σ 2 r and E ∥y i -µ r ∥ 3 = γ r , all of which are well defined by the hypothesis of the lemma. Let Y = d i=1 y i , and Z be the standard Gaussian in R. Then, by the Berry-Esseen Theorem (Durrett, 2019, Theorem 3.4.17), P Y -µ r d σ r √ d ≤ -c ≥ P[Z ≤ -c] - C BE • γ r σ 3 r √ d , where C BE < 1 (van Beek, 1972) is an absolute constant. We can now choose c = c(ϵ) such that P [Z ≤ c] ≥ 1-ϵ 2 . Further, we can choose d large enough so that CBE•γ σ 3 √ d ≤ ϵ 2 . Then for µ = µ r and α = cσ r , we have P x∼p R(x) ≤ exp µd -α √ d ≥ 1 2 -ϵ. Since Z is symmetric around 0, Berry-Esseen gives us the other inequality for the same choice of µ and α, P Y -µ r d σ r √ d ≥ c ≥ P[Z ≥ c] - C BE • γ r σ 3 r √ d ≥ 1 2 -ϵ. Note that the constants µ and α are independent of d. Further, note that µ = µ r = -KL(p||q) < 0.

C INVERTIBILITY OF THE HESSIAN

We prove that the Hessian of NCE loss for the exponential family given by T (x) = (x 4 1 , . . . , x 4 d , 1) is invertible. In particular, we have the following lemma: Lemma 10. Let Q = N (0, I d ) be the standard Gaussian in R d . Let P be the log concave distribution defined in definition 6. Let P = P d . Let q and p denote the density functions of Q and P respectively. Observe that P is in the exponential family given by T (x) = (x 4 1 , . . . , x 4 d , 1), and equals P θ * for some θ * . Then the hessian of the NCE loss with respect to distribution P and noise Q given by H = ∇ 2 θ L(θ * ) = 1 2 x p * q p * + q T (x)T (x) ⊤ is invertible. Proof. For any subset A ⊆ R d , define H A = 1 2 x∈A p * q p * + q T (x)T (x) ⊤ . Observe that the density functions p * and q of P * and Q respectively are strictly positive over all of R d . Therefore, for any subset A ⊆ R d and any v ∈ R d+1 , we have v ⊤ Hv ≥ 1 2 x∈A p * q p * + q v ⊤ T (x)T (x) ⊤ v = v ⊤ H A v. Given a vector v ∈ R d+1 , we will pick A such that T (x) ⊤ v > 0 for all x ∈ A. Note that the set B = {e 1 + e d+1 , . . . , e d + e d+1 , e d+1 } is a basis. Therefore, if b ⊤ v = 0 for all b ∈ B, then v = 0. Hence, there exists some x ∈ {e 1 , . . . , e d } such that T (x) ⊤ v > 0. Since x → T (x) ⊤ v is a continuous function, we can find an open set A around x such that T (y) ⊤ v > 0, ∀y ∈ A.

It follows that

v ⊤ H A v = 1 2 x∈A p * q p * + q v ⊤ T (x)T (x) ⊤ v = 1 2 x∈A p * q p * + q T (x) ⊤ v 2 > 0. Let B = R d \ A. Since v ⊤ H A v > 0 and v ⊤ H B v ≥ 0, we have that v ⊤ Hv > 0. Since this holds for any arbitrary non-zero vector v, the matrix H must be full rank. Since H is an integral of PSD matrices, it is a full rank PSD matrix and hence invertible.

D TAIL BOUNDS FOR EQUATION 17

We prove that some T up = O(σ 2 √ d) suffices to obtain the bounds in eq. ( 17). Concretely, we prove tail bounds for ∥T (x)∥ using tail bounds for P * and Q. We will use Lemma 



Translating notation: T d = n, JT d (θ) = -2L n (θ) and setting ν = 1 gives Iν = 2∇ 2 L(θ * ) as in eq. (6).



Figure 1: Log MSE versus Dimension-Theorem 4 suggests this plot should be linear, as is observed.

1 from Laurent & Massart (2000) which proves a bound for χ 2 distributions: Lemma (Lemma 1, Laurent & Massart (2000)). If X is a χ 2 random variable with d degrees of freedom, then for any positive t, Then, for x ∼ Q, ∥x∥ 2 is a χ 2 random variable with d degrees of freedom. Observe that for t, d ≥ 4, Further, if ∥x∥ ≥ σ 2 √ d, q(x) ≥ p * (x), implying that for t ≥ σ 2 √ d P x∼P * ∥x∥ ≥ t ≤ exp -In particular, for any δ such that log(1/δ) ≥ σ 4 , we have P x∼Q ∥x∥ ≥ 2d log(1/δ) ≤ δ and P x∼P * ∥x∥ ≥ 2d log(1/δ) ≤ δ. (21)

ACKNOWLEDGEMENTS

Andrej is supported in part by NSF award IIS-2211907.

