CONTRASTIVE DIVERGENCE LEARNING IS A TIME REVERSAL ADVERSARIAL GAME

Abstract

Contrastive divergence (CD) learning is a classical method for fitting unnormalized statistical models to data samples. Despite its wide-spread use, the convergence properties of this algorithm are still not well understood. The main source of difficulty is an unjustified approximation which has been used to derive the gradient of the loss. In this paper, we present an alternative derivation of CD that does not require any approximation and sheds new light on the objective that is actually being optimized by the algorithm. Specifically, we show that CD is an adversarial learning procedure, where a discriminator attempts to classify whether a Markov chain generated from the model has been time-reversed. Thus, although predating generative adversarial networks (GANs) by more than a decade, CD is, in fact, closely related to these techniques. Our derivation settles well with previous observations, which have concluded that CD's update steps cannot be expressed as the gradients of any fixed objective function. In addition, as a byproduct, our derivation reveals a simple correction that can be used as an alternative to Metropolis-Hastings rejection, which is required when the underlying Markov chain is inexact (e.g., when using Langevin dynamics with a large step).

1. INTRODUCTION

Unnormalized probability models have drawn significant attention over the years. These models arise, for example, in energy based models, where the normalization constant is intractable to compute, and are thus relevant to numerous settings. Particularly, they have been extensively used in the context of restricted Boltzmann machines (Smolensky, 1986; Hinton, 2002) , deep belief networks (Hinton et al., 2006; Salakhutdinov & Hinton, 2009) , Markov random fields (Carreira-Perpinan & Hinton, 2005; Hinton & Salakhutdinov, 2006) , and recently also with deep neural networks (Xie et al., 2016; Song & Ermon, 2019; Du & Mordatch, 2019; Grathwohl et al., 2019; Nijkamp et al., 2019) . Fitting an unnormalized density model to a dataset is challenging due to the missing normalization constant of the distribution. A naive approach is to employ approximate maximum likelihood estimation (MLE). This approach relies on the fact that the likelihood's gradient can be approximated using samples from the model, generated using Markov Chain Monte Carlo (MCMC) techniques. However, a good approximation requires using very long chains and is thus impractical. This difficulty motivated the development of a plethora of more practical approaches, like score matching (Hyvärinen, 2005) , noise contrastive estimation (NCE) (Gutmann & Hyvärinen, 2010) , and conditional NCE (CNCE) (Ceylan & Gutmann, 2018) , which replace the log-likelihood loss with objectives that do not require the computation of the normalization constant or its gradient. Perhaps the most popular method for learning unnormalized models is contrastive divergence (CD) (Hinton, 2002) . CD's advantage over MLE stems from its use of short Markov chains initialized at the data samples. CD has been successfully used in a wide range of domains, including modeling images (Hinton et al., 2006 ), speech (Mohamed & Hinton, 2010) , documents (Hinton & Salakhutdinov, 2009) , and movie ratings (Salakhutdinov et al., 2007) , and is continuing to attract significant research attention (Liu & Wang, 2017; Gao et al., 2018; Qiu et al., 2019) . In this paper, we present an alternative derivation of CD, which relies on completely different principles and requires no approximations. Specifically, we show that CD's update steps are the gradients of an adversarial game in which a discriminator attempts to classify whether a Markov chain generated from the model is presented to it in its original or a time-reversed order (see Fig. 1 ). Thus, our derivation sheds new light on CD's success: Similarly to modern generative adversarial methods (Goodfellow et al., 2014) , CD's discrimination task becomes more challenging as the model approaches the true distribution. This keeps the update steps effective throughout the entire training process and prevents early saturation as often happens in non-adaptive methods like NCE and CNCE. In fact, we derive CD as a natural extension of the CNCE method, replacing the fixed distribution of the contrastive examples with an adversarial adaptive distribution. CD requires that the underlying MCMC be exact, which is not the case for popular methods like Langevin dynamics. This commonly requires using Metropolis-Hastings (MH) rejection, which ignores some of the generated samples. Interestingly, our derivation reveals an alternative correction method for inexact chains, which does not require rejection.

2. BACKGROUND

2.1 THE CLASSICAL DERIVATION OF CD Assume we have an unnormalized distribution model p θ . Given a dataset of samples {x i } independently drawn from some unknown distribution p, CD attempts to determine the parameters θ with which p θ best explains the dataset. Rather than using the log-likelihood loss, CD's objective involves distributions of samples along finite Markov chains initialized at {x i }. When based on chains of length k, the algorithm is usually referred to as CD-k. Concretely, let q θ (x |x) denote the transition rule of a Markov chain with stationary distribution p θ , and let r m θ denote the distribution of samples after m steps of the chain. As the Markov chain is initialized from the dataset distribution and converges to p θ , we have that r 0 θ = p and r ∞ θ = p θ . The CD algorithm then attempts to minimize the loss CD-k = D KL (r 0 θ ||r ∞ θ ) -D KL (r k θ ||r ∞ θ ) = D KL (p||p θ ) -D KL (r k θ ||p θ ), where D KL is the Kullback-Leibler divergence. Under mild conditions on q θ (Cover & Halliwell, 1994) this loss is guaranteed to be positive, and it vanishes when p θ = p (in which case r k θ = p θ ).



Figure1: Contrastive divergence as an adversarial process. In the first step, the distribution model is used to generate an MCMC process which is used to generate a chain of samples. In the second step the distribution model is updated using a gradient descent step, using the MCMC transition rule.

