GRADIENT-GUIDED IMPORTANCE SAMPLING FOR LEARNING BINARY ENERGY-BASED MODELS

Abstract

Learning energy-based models (EBMs) is known to be difficult especially on discrete data where gradient-based learning strategies cannot be applied directly. Although ratio matching is a sound method to learn discrete EBMs, it suffers from expensive computation and excessive memory requirements, thereby resulting in difficulties in learning EBMs on high-dimensional data. Motivated by these limitations, in this study, we propose ratio matching with gradient-guided importance sampling (RMwGGIS). Particularly, we use the gradient of the energy function w.r.t. the discrete data space to approximately construct the provably optimal proposal distribution, which is subsequently used by importance sampling to efficiently estimate the original ratio matching objective. We perform experiments on density modeling over synthetic discrete data, graph generation, and training Ising models to evaluate our proposed method. The experimental results demonstrate that our method can significantly alleviate the limitations of ratio matching, perform more effectively in practice, and scale to high-dimensional problems. Our implementation is available at https://github.com/divelab/RMwGGIS.

1. INTRODUCTION

Energy-Based models (EBMs), also known as unnormalized probabilistic models, model distributions by associating unnormalized probability densities. Such methods have been developed for decades (Hopfield, 1982; Ackley et al., 1985; Cipra, 1987; Dayan et al., 1995; Zhu et al., 1998; Hinton, 2012) and are unified as energy-based models (EBMs) (LeCun et al., 2006) in the machine learning community. EBMs have great simplicity and flexibility since energy functions are not required to integrate or sum to one, thus enabling the usage of various energy functions. In practice, given different data types, we can parameterize the energy function with different neural networks as needed, such as multi-layer perceptrons (MLPs), convolutional neural networks (CNNs) (LeCun et al., 1998) , and graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2008) . Recently, EBMs have been drawing increasing attention and are demonstrated to be effective in various domains, including images (Ngiam et al., 2011; Xie et al., 2016; Du & Mordatch, 2019 ), videos (Xie et al., 2017 ), texts (Deng et al., 2020 ), 3D objects (Xie et al., 2018 ), molecules (Liu et al., 2021; Hataya et al., 2021), and proteins (Du et al., 2020b) . Nonetheless, learning (a.k.a., training) EBMs is known to be challenging since we cannot compute the exact likelihood due to the intractable normalization constant. As reviewed in Section 4, many approaches have been proposed to learn EBMs, such as maximum likelihood training with MCMC sampling (Hinton, 2002) and score matching (Hyvärinen & Dayan, 2005) . However, most recent advanced methods cannot be applied to discrete data directly since they usually leverage gradients over the continuous data space. For example, for many methods based on maximum likelihood training with MCMC sampling, they use the gradient w.r.t. the data space to update samples in each MCMC step. However, if we update discrete samples using such gradient, the resulting samples are usually invalid in the discrete space. Therefore, learning EBMs on discrete data remains challenging. tion 2.2. However, as analyzed in Section 3.1, it requires expensive computations and excessive memory usages, which is infeasible if the data is high-dimensional. In this work, we propose to use the gradient of the energy function w.r.t. the discrete data space to guide the importance sampling for estimating the original ratio matching objective. More specifically, we use such gradient to approximately construct the provably optimal proposal distribution for importance sampling. Thus, the proposed approach is termed as ratio matching with gradient-guided importance sampling (RMwGGIS). Our RMwGGIS can significantly overcome the limitations of ratio matching. In addition, it is demonstrated to be more effective than the original ratio matching in practice. We perform extensive analysis for this improvement by connecting it with hard negative mining, and further propose an advanced version of RMwGGIS accordingly by reconsidering the importance weights. Experimental results on synthetic discrete data, graph generation, and Ising model training demonstrate that our RMwGGIS significantly alleviates the limitations of ratio matching, achieves better performance with obvious margins, and has the ability of scaling to high-dimensional relevant problems.

2. PRELIMINARIES 2.1 ENERGY-BASED MODELS

Let x be a data point and E θ (x) ∈ R be the corresponding energy, where θ represents the learnable parameters of the parameterized energy function E θ (•). The probability density function of the model distribution is given as x) , where Z θ ∈ R is the normalization constant (a.k.a., partition function). To be specific, Z θ = e -E θ (x) dx if x is in the continuous space and x) for discrete data. Hence, computing Z θ is usually infeasible due to the intractable integral or summation. Note that Z θ is a variable depending on θ but a constant w.r.t. x. p θ (x) = e -E θ (x) Z θ ∝ e -E θ ( Z θ = e -E θ (

2.2. RATIO MATCHING

Ratio matching (Hyvärinen, 2007) is developed for learning EBMs on binary discrete data by matching ratios of probabilities between the data distribution and the model distribution. Note that we focus on d-dimensional binary discrete data x ∈ {0, 1} d in this work. Specifically, ratio matching considers the ratio of p(x) and p(x -i ), where x -i = (x 1 , x 2 , • • • , xi , • • • , x d ) denotes a point in the data space obtained by flipping the i-th dimension of x. The key idea is to force the ratios p θ (x) p θ (x-i) defined by the model distribution p θ to be as close as possible to the ratios p D (x) p D (x-i) given by the data distribution p D . The benefit of considering ratios of probabilities is that they do not involve the intractable normalization constant Z θ since p θ (x) x) . To achieve the match between ratios, Hyvärinen (2007) proposes to minimize the objective function p θ (x-i) = e -E θ (x) Z θ • Z θ e -E θ (x -i ) = e E θ (x-i)-E θ ( J RM (θ) = E x∼p D (x) d i=1 g p D (x) p D (x -i ) -g p θ (x) p θ (x -i ) 2 + g p D (x -i ) p D (x) -g p θ (x -i ) p θ (x) 2 . (1) The sum of two square distances with the role of x and x -i switched is specifically designed since it is essential for the following simplification. In addition, the function g(u) = 1 1+u is also carefully chosen in order to obtain the subsequent simplification. To compute the objective defined in Eq. (1), it is known that the expectation over data distribution (i.e., E x∼p D (x) ) can be unbiasedly estimated by the empirical mean of samples x ∼ p D (x). However, to obtain the ratios between p D (x) and p D (x -i ) in Eq. ( 1), the exact data distribution is required to be known, which is usually impossible. Fortunately, thanks to the above carefully designed objective, Hyvärinen (2007) demostrates that the objective function in Eq. ( 1) is equivalent to the following simplified version J RM (θ) = E x∼p D (x) d i=1 g p θ (x) p θ (x -i ) 2 (2)

