ANNEALED FISHER IMPLICIT SAMPLER

Abstract

Sampling from an un-normalized target distribution is an important problem in many scientific fields. An implicit sampler uses a parametric transform x = G θ (z) to push forward an easy-to-sample latent code z to obtain a sample x. Such samplers are favored for fast inference speed and flexible architecture. Thus it is appealing to train an implicit sampler for sampling from the un-normalized target. In this paper, we propose a novel approach to training an implicit sampler by minimizing the Fisher Divergence between sampler and target distribution. We find that the trained sampler works well for relatively simple targets but may fail for more complicated multi-modal targets. To improve the training for multi-modal targets, we propose another adaptive training approach that trains the sampler to gradually learn a sequence of annealed distributions. We construct the annealed distribution path to bridge a simple distribution and the complicated target. With the annealed approach, the sampler is capable of handling challenging multi-modal targets. In addition, we also introduce a few MCMC correction steps after the sampler to better spread the samples. We call our proposed sampler the Annealed Fisher Implicit Sampler (AFIS). We test AFIS on several sampling benchmarks. The experiments show that our AFIS outperforms baseline methods in many aspects. We also show in theory that the added MC correction steps get faster mixing by using the learned sampler as MCMC's initialization.

1. INTRODUCTION

Sampling from an un-normalized distribution is an important problem in many scientific fields such as Bayesian statistics (Green, 1995) , biology (Schütte et al., 1999) , physics simulations (Olsson, 1995) , machine learning (Andrieu et al., 2003) , and so on. Typically, the problem is formulated as: given a known differentiable un-normalized target potential function log p(x), one wants to sample from the target distribution. Due to the success of deep neural networks, there is increasing popularity to train a deep generative model to learn to sample (Hu et al., 2018; Wu et al., 2020; Matthews et al., 2022; Corenflos et al., 2021) . Such learned models which can approximately sample from target distribution are called samplers. Training a neural network (i.e., a parameterized transform) x = G θ (z) to push forward an easyto-sample latent code z ∼ p Z (z) to obtain a sample is an appealing approach. Such approaches are favored for fast sampling because they only need a single-time forward pass of neural network transform. Let G θ (.) denote the parametric transform and q(x) the un-normalized target distribution with unknown normalizing constant Z = q(x)dx. Let p θ (x) denote the sampler-induced distribution. Some previous work takes a normalizing flow model as sampler, and then minimizes the KL divergence between sampler-induced and target distributions regardless of normalizing constant: D KL (p θ , q) = E x∼p θ log p θ (x) -log q(x) + log Z . Note that Z is parameter-free and can be ignored during training. However, minimizing KL divergence relies on explicit log-likelihood of sampler-induced distribution, which can not be computed in a general transform. Such transform with no explicit likelihood is referred to as an implicit sampler. In this paper, we will focus on implicit samplers. Note that the annoying normalizing constant vanishes when considering the score function of a distribution, s(x) = ∇ x log p(x). Thus, we can take the score-based divergence to constructively get rid of the unknown normalizing constant for implicit samplers. Fisher divergence (FD), which is a popular score-based probability divergence, and its variants have obtained much success in recent years, especially in training deep generative models such as energy-based models (Kingma & Cun, 2010; Martens et al., 2012; Song et al., 2019) , score based diffusion models (Song et al., 2020; Kingma et al., 2021; Vahdat et al., 2021; Song & Ermon, 2019; Ho et al., 2020) , etc. Assume p(x), q(x) are two probability densities. The Fisher Divergence between p and q is defined as D F D (p, q) = 1 2 E x∼p(x) ∥∇ x log p(x) -∇ x log q(x)∥ 2 2 . It is always no less than 0 and equals to 0 if and only if p(x) = q(x) a.s. under probability measure p. Fisher Divergence is suitable for measuring the dissimilarity between sampler and un-normalized target distribution. So as to be used for training the implicit sampler. In this paper, we firstly propose a novel approach to learning a sampler by minimizing the Fisher Divergence between sampler and un-normalized target distributions. We call such a sampler the Fisher Implicit Sampler. We then show that the proposed sampler is capable of handling relatively simple target distribution, but would fail for more challenging multi-modal targets. To remedy this issue and unlock the full potential of the Fisher Implicit Sampler, we additionally propose a novel adaptive training approach that trains the implicit sampler gradually using a sequence of annealed distributions instead of the target distribution. We anneal the target distribution to bridge the hard-to-sample target and an easy-to-sample prior. More precisely, we extend the target distribution q(x) to a sequence of annealed distributions {q k (x)} k for k = 0, . . . , K, where q K (x) is the target density and q 0 (x) is an easy-to-sample prior distribution, typically a normal distribution. The design of such an annealed path gradually reduces the learning difficulty for the sampler. Moreover, we find that a few steps of MC correction after the sampler help the samples spread better with little cost, as also used in some previous work (Wu et al., 2020; Arbel et al., 2021; Matthews et al., 2022) . Combining all together, we call our proposed sampler the Annealed Fisher Implicit Sampler (AFIS), as illustrated in Figure 1 . We validate our AFIS on sampling benchmarks, showing improvements over baseline approaches. The main contributions of our work are summarized as follows: • We propose a novel loss function to minimize the Fisher Divergence. We show that minimizing the proposed loss is equivalent to minimizing the Fisher Divergence between sampler and target distribution. Note that our objective is largely different from other ones in previous work. • We provide an insightful understanding of the difficulty in learning multi-modal targets by minimizing Fisher Divergence. We facilitate the annealing technique on training samplers based on our understanding. • We bring in a novel annealing technique and MC correction steps with our sampler, leading to improved sampling performance with little additional cost.

2.1. TRAIN IMPLICIT SAMPLERS WITH SCORE-BASED DIVERGENCE

The learning-to-sample problem arises in many application fields of machine learning. Assume we only have access to an un-normalized target distribution q(x) (or its logarithm log q(x)), and the goal is to approximately sample from the target. In recent years, training a neural networkbased transform to approximately sample from target distribution is an appealing method. Such a



Figure 1: Illustration of proposed Annealed Fisher Implicit Sampler.

