ANNEALED FISHER IMPLICIT SAMPLER

Abstract

Sampling from an un-normalized target distribution is an important problem in many scientific fields. An implicit sampler uses a parametric transform x = G θ (z) to push forward an easy-to-sample latent code z to obtain a sample x. Such samplers are favored for fast inference speed and flexible architecture. Thus it is appealing to train an implicit sampler for sampling from the un-normalized target. In this paper, we propose a novel approach to training an implicit sampler by minimizing the Fisher Divergence between sampler and target distribution. We find that the trained sampler works well for relatively simple targets but may fail for more complicated multi-modal targets. To improve the training for multi-modal targets, we propose another adaptive training approach that trains the sampler to gradually learn a sequence of annealed distributions. We construct the annealed distribution path to bridge a simple distribution and the complicated target. With the annealed approach, the sampler is capable of handling challenging multi-modal targets. In addition, we also introduce a few MCMC correction steps after the sampler to better spread the samples. We call our proposed sampler the Annealed Fisher Implicit Sampler (AFIS). We test AFIS on several sampling benchmarks. The experiments show that our AFIS outperforms baseline methods in many aspects. We also show in theory that the added MC correction steps get faster mixing by using the learned sampler as MCMC's initialization.

1. INTRODUCTION

Sampling from an un-normalized distribution is an important problem in many scientific fields such as Bayesian statistics (Green, 1995) , biology (Schütte et al., 1999) , physics simulations (Olsson, 1995) , machine learning (Andrieu et al., 2003) , and so on. Typically, the problem is formulated as: given a known differentiable un-normalized target potential function log p(x), one wants to sample from the target distribution. Due to the success of deep neural networks, there is increasing popularity to train a deep generative model to learn to sample (Hu et al., 2018; Wu et al., 2020; Matthews et al., 2022; Corenflos et al., 2021) . Such learned models which can approximately sample from target distribution are called samplers. Training a neural network (i.e., a parameterized transform) x = G θ (z) to push forward an easyto-sample latent code z ∼ p Z (z) to obtain a sample is an appealing approach. Such approaches are favored for fast sampling because they only need a single-time forward pass of neural network transform. Let G θ (.) denote the parametric transform and q(x) the un-normalized target distribution with unknown normalizing constant Z = q(x)dx. Let p θ (x) denote the sampler-induced distribution. Some previous work takes a normalizing flow model as sampler, and then minimizes the KL divergence between sampler-induced and target distributions regardless of normalizing constant: D KL (p θ , q) = E x∼p θ log p θ (x) -log q(x) + log Z . Note that Z is parameter-free and can be ignored during training. However, minimizing KL divergence relies on explicit log-likelihood of sampler-induced distribution, which can not be computed in a general transform. Such transform with no explicit likelihood is referred to as an implicit sampler. In this paper, we will focus on implicit samplers. Note that the annoying normalizing constant vanishes when considering the score function of a distribution, s(x) = ∇ x log p(x). Thus, we can take the score-based divergence to constructively get rid of the unknown normalizing constant for implicit samplers. Fisher divergence (FD), which is a popular score-based probability divergence, and its variants have obtained much success in recent years, especially in training deep generative models such as energy-based models (Kingma & Cun, 2010; Martens et al., 2012; Song et al., 2019), 

