PITFALLS OF GAUSSIANS AS A NOISE DISTRIBUTION IN NCE

Abstract

Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an easy-to-sample noise distribution q, in a manner that avoids having to calculate a partition function. It is well-known that the choice of q can severely impact the computational and statistical efficiency of NCE. In practice, a common choice for q is a Gaussian which matches the mean and covariance of the data. In this paper, we show that such a choice can result in an exponentially bad (in the ambient dimension) conditioning of the Hessian of the loss, even for very simple data distributions. As a consequence, both the statistical and algorithmic complexity for such a choice of q will be problematic in practice, suggesting that more complex and tailored noise distributions are essential to the success of NCE.

1. INTRODUCTION

Noise contrastive estimation (NCE), introduced in (Gutmann & Hyvärinen, 2010; 2012) , is one of several popular approaches for learning probability density functions parameterized up to a constant of proportionality, i.e. p(x) ∝ exp(E θ (x)), for some parametric family {E θ } θ . A recent incarnation of this paradigm is, for example, energy-based models (EBMs), which have achieved near-state-ofthe-art results on many image generation tasks (Du & Mordatch, 2019; Song & Ermon, 2019) . The main idea in NCE is to set up a self-supervised learning (SSL) task, in which we train a classifier to distinguish between samples from the data distribution P * and a known, easy-to-sample distribution Q, often called the "noise" or "contrast" distribution. It can be shown that for a large choice of losses for the classification problem, the optimal classifier model is a (simple) function of the density ratio p * /q, so an estimate for p * can be extracted from a good classifier. Moreover, this strategy can be implemented while avoiding calculation of the partition function, which is necessary when using maximum likelihood to learn p * . The noise distribution q is the most significant "hyperparameter" in NCE training, with both strong empirical (Rhodes et al., 2020) and theoretical (Liu et al., 2021) evidence that a poor choice of q can result in poor algorithmic behavior. (Chehab et al., 2022) show that even the optimal q for finite number of samples can have an unexpected form (e.g., it is not equal to the true data distribution p * ). Since q needs to be a distribution that one can efficiently draw samples from, as well as write an expression for the probability density function, the choices are somewhat limited. A particularly common way to pick q is as a Gaussian that matches the mean and covariance of the input data (Gutmann & Hyvärinen, 2012; Rhodes et al., 2020) . Our main contribution in this paper is to formally show that such a choice can result in an objective that is statistically poorly behaved, even for relatively simple data distributions. We show that even if p * is a product distribution and a member of a very simple exponential family, the Hessian of the NCE loss, when using a Gaussian

