PITFALLS OF GAUSSIANS AS A NOISE DISTRIBUTION IN NCE

Abstract

Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an easy-to-sample noise distribution q, in a manner that avoids having to calculate a partition function. It is well-known that the choice of q can severely impact the computational and statistical efficiency of NCE. In practice, a common choice for q is a Gaussian which matches the mean and covariance of the data. In this paper, we show that such a choice can result in an exponentially bad (in the ambient dimension) conditioning of the Hessian of the loss, even for very simple data distributions. As a consequence, both the statistical and algorithmic complexity for such a choice of q will be problematic in practice, suggesting that more complex and tailored noise distributions are essential to the success of NCE.

1. INTRODUCTION

Noise contrastive estimation (NCE), introduced in (Gutmann & Hyvärinen, 2010; 2012) , is one of several popular approaches for learning probability density functions parameterized up to a constant of proportionality, i.e. p(x) ∝ exp(E θ (x)), for some parametric family {E θ } θ . A recent incarnation of this paradigm is, for example, energy-based models (EBMs), which have achieved near-state-ofthe-art results on many image generation tasks (Du & Mordatch, 2019; Song & Ermon, 2019) . The main idea in NCE is to set up a self-supervised learning (SSL) task, in which we train a classifier to distinguish between samples from the data distribution P * and a known, easy-to-sample distribution Q, often called the "noise" or "contrast" distribution. It can be shown that for a large choice of losses for the classification problem, the optimal classifier model is a (simple) function of the density ratio p * /q, so an estimate for p * can be extracted from a good classifier. Moreover, this strategy can be implemented while avoiding calculation of the partition function, which is necessary when using maximum likelihood to learn p * . The noise distribution q is the most significant "hyperparameter" in NCE training, with both strong empirical (Rhodes et al., 2020) and theoretical (Liu et al., 2021) evidence that a poor choice of q can result in poor algorithmic behavior. (Chehab et al., 2022) show that even the optimal q for finite number of samples can have an unexpected form (e.g., it is not equal to the true data distribution p * ). Since q needs to be a distribution that one can efficiently draw samples from, as well as write an expression for the probability density function, the choices are somewhat limited. A particularly common way to pick q is as a Gaussian that matches the mean and covariance of the input data (Gutmann & Hyvärinen, 2012; Rhodes et al., 2020) . Our main contribution in this paper is to formally show that such a choice can result in an objective that is statistically poorly behaved, even for relatively simple data distributions. We show that even if p * is a product distribution and a member of a very simple exponential family, the Hessian of the NCE loss, when using a Gaussian noise distribution q with matching mean and covariance has exponentially small (in the ambient dimension) spectral norm. As a consequence, the optimization landscape around the optimum will be exponentially flat, making gradient-based optimization challenging. As the main result of the paper, we show the asymptotic sample efficiency of the NCE objective will be exponentially bad in the ambient dimension.

2. OVERVIEW OF RESULTS

Let P * be a distribution in a parametric family {P θ } θ∈Θ . We wish to estimate P * via P θ for some θ * ∈ Θ by solving a noise contrastive estimation task. To set up the task, we also need to choose a noise distribution Q, with the constraint that we can draw samples from it efficiently, and we can evaluate the probability density function efficiently. We will use p θ , p * , q to denote the probability density functions (pdfs) of P θ , P * , and Q. For a data distribution P * and noise distribution Q, the NCE loss of a distribution P θ is defined as follows: Definition 1 (NCE Loss). The NCE loss of P θ w.r.t. data distribution P * and noise Q is L(P θ ) = - 1 2 E P * log p θ p θ + q - 1 2 E Q log q p θ + q . (1) Moreover, the empirical version of the NCE loss when given i.i.d. samples (x 1 , . . . , x n ) ∼ P n * and (y 1 , . . . , y n ) ∼ Q n is given by L n (θ) = 1 n n i=1 - 1 2 log p θ (x i ) p θ (x i ) + q(x i ) + 1 n n i=1 - 1 2 log q(y i ) p θ (y i ) + q(y i ) . ( ) By a slight abuse of notation, we will use L(θ), L(p θ ) and L(P θ ) interchangeably. The NCE loss can be interpreted as the binary cross-entropy loss for the classification task of distinguishing the data samples from the noise samples. To avoid calculating the partition function, one considers it as an additional parameter, namely we consider an augmented vector of parameters θ = (θ, c) and let p θ (x) = exp(E θ (x) -c). The crucial property of the NCE loss is that it has a unique minimizer: Lemma 2 (Gutmann & Hyvärinen 2012). The NCE objective in Definition 1 is uniquely minimized at θ = θ * and c = log( x exp(E θ * (x))dx) provided that the support of Q contains that of P * . We will be focusing on the Hessian of the loss L, as the crucial object governing both the algorithmic and statistical difficulty of the resulting objective. We will show the following two main results: Theorem 3 (Exponentially flat Hessian). For d > 0 large enough, there exists a distribution P * = P θ * over R d such that • E P * [x] = 0 and E P * [xx ⊤ ] = I d . • P * is a product distribution, namely p * (x 1 , x 2 , . . . , x d ) = d i=1 p * (x i ). • The NCE loss when using q = N (0, I d ) as the noise distribution has the property that ∥∇ 2 L(θ * )∥ 2 ≤ exp (-Ω(d)) . We remark the above example of a problematic distribution P * is extremely simple. Namely, P * is a product distribution, with 0 mean and identity covariance. It actually is also the case that P * is log-concave-which is typically thought of as an "easy" class of distributions to learn due to the fact that log-concave distributions are unimodal. The fact that the Hessian is exponentially flat near the optimum means that gradient-descent based optimization without additional tricks (e.g., gradient normalization, second order methods like Newton's method) will fail. (See, e.g., Theorem 4.1 and 4.2 in Liu et al. (2021) .) For us, this will be merely an intermediate result. We will address a more fundamental issue of the sample complexity of NCE, which is independent of the optimization algorithm used. Namely, we will show that without a large number of samples, the best minimizer of the empirical NCE might not be close to the target distribution. Proving this will require the development of some technical machinery. More precisely, we use the result above to show that the asymptotic statistical complexity, using the above choice of P * , Q, is exponentially bad in the dimension. This substantially clarifies results in

