FLOW ANNEALED IMPORTANCE SAMPLING BOOTSTRAP

Abstract

Normalizing flows are tractable density models that can approximate complicated target distributions, e.g. Boltzmann distributions of physical systems. However, current methods for training flows either suffer from mode-seeking behavior, use samples from the target generated beforehand by expensive MCMC methods, or use stochastic losses that have high variance. To avoid these problems, we augment flows with annealed importance sampling (AIS) and minimize the masscovering α-divergence with α = 2, which minimizes importance weight variance. Our method, Flow AIS Bootstrap (FAB), uses AIS to generate samples in regions where the flow is a poor approximation of the target, facilitating the discovery of new modes. We apply FAB to multimodal targets and show that we can approximate them very accurately where previous methods fail. To the best of our knowledge, we are the first to learn the Boltzmann distribution of the alanine dipeptide molecule using only the unnormalized target density, without access to samples generated via Molecular Dynamics (MD) simulations: FAB produces better results than training via maximum likelihood on MD samples while using 100 times fewer target evaluations. After reweighting the samples, we obtain unbiased histograms of dihedral angles that are almost identical to the ground truth.

1. INTRODUCTION

Approximating intractable distributions is a challenging task whose solution has relevance in many real-world applications. A prominent example involves approximating the Boltzmann distribution of a given molecule. In this case, the unnormalized density can be obtained by physical modeling and is given by e -u(x) , where x are the 3D atomic coordinates and u(•) returns the dimensionless energy of the system. Drawing independent samples from this distribution is difficult (Lelièvre et al., 2010) . It is typically done by running expensive Molecular Dynamics (MD) simulations (Leimkuhler & Matthews, 2015) , which yield highly correlated samples and require long simulation times. An alternative is given by normalizing flows. These are tractable density models parameterized by neural networks. They can generate a batch of independent samples with a single forward pass and any bias in the samples can be eliminated by reweighting via importance sampling. Flows are called Boltzmann generators when they approximate Boltzmann distributions (Noé et al., 2019) . Recently, there has been a growing interest in these methods (Dibak et al., 2022; Köhler et al., 2021; Liu et al., 2022) as they have the potential to avoid the limitations of MD simulations. Most current approaches to train Boltzmann generators rely on MD samples since these are required for the estimation of the flow parameters by maximum likelihood (ML) (Wu et al., 2020) . Alternatively, flows can be trained without MD samples by minimizing the Kullback-Leibler (KL) divergence with respect to the target distribution. Wirnsberger et al. ( 2022) followed this approach to approximate the Boltzmann distribution of atomic solids with up to 512 atoms. However, the KL divergence suffers from mode-seeking behavior, which severely deteriorates the performance of this approach with multimodal target distributions (Stimper et al., 2022) . On the other hand, mass-covering objective such as the forward KL divergence suffer from the high variance of the samples from the flow. To address these challenges, we present a new method for training flows: Flow AIS Bootstrapfoot_0 (FAB). Our main contributions are as follows: 1. We propose to use the α-divergence with α = 2 as our training objective, which is mass-covering and minimizes importance weight variance. At test time an importance sampling distribution with low α-divergence (with α = 2) may be used to approximate expectations with respect to the target with low variance. This objective is challenging to estimate during training. To approximate this objective we use annealed importance sampling (AIS) with the flow as the initial distribution and the target set to the minimum variance distribution for the estimation of the α-divergence. AIS returns samples that provide a higher quality training signal than samples from the flow, as it focuses on the regions that contribute the most to the α-divergence loss. 2. We reduce the computational cost of our method by introducing a scheme to re-use samples via a prioritized replay buffer. 3. We apply FAB to a toy 2D Gaussian mixture distribution, the 32 dimensional "Many Well" problem, and the Boltzmann distribution of alanine dipeptide. In these experiments, we outperform competing approaches and, to the best of our knowledge, we are the first to successfully train a Boltzmann generator on alanine dipeptide using only the unnormalized target density. In particular, we use over 100 times fewer target evaluations than a Boltzmann generator trained with MD samples while producing a better approximation to the target.

2. BACKGROUND

Normalizing flows Given a random variable z with distribution q(z), a normalizing flow (Tabak & Vanden-Eijnden, 2010; Rezende & Mohamed, 2015; Papamakarios et al., 2021) uses an invertible map F : R d → R d to transform z yielding the random variable x = F (z) with distribution q(x) = q(z) |det(J F (z))| -1 , where J F (z) = ∂F/∂z is the Jacobian of F . If we parameterize F , we can use the resulting model to approximate a target distribution p. To simplify our notation, we will assume the target density p(x) is normalized, i.e., it integrates to 1, but the methods described here are equally applicable when this is not the case. If samples from the target distribution are available, the flow can be trained via ML. If only the target density p(x) is given, the flow can then be trained by minimizing the reverse KL divergencefoot_1 between q and p, i.e., KL(q∥p) = x q(x) log{q(x)/p(x)}dx, which is estimated via Monte Carlo using samples from q.

Alpha divergence

An alternative to the KL divergence is the α-divergence (Zhu & Rohwer, 1995; Minka, 2005; Müller et al., 2019; Bauer & Mnih, 2021; Campbell et al., 2021) defined by D α (p∥q) = -x p(x) α q(x) 1-α dx α(1 -α) . The α-divergence is mode-seeking for α ≤ 0 and mass-covering for α ≥ 1 (Minka, 2005), as shown in Figure 1 . When α = 2, minimizing the α-divergence is equivalent to minimizing the variance of the importance sampling weights w IS (x) = p(x)/q(x), which is desirable if importance sampling will be used to eliminate bias in the samples from q at test time. Annealed importance sampling AIS begins by sampling from an initial distribution x 1 ∼ p 0 = q, given by the flow in our case, and then transitioning via MCMC through a sequence of intermediate distributions, p 1 to p M -1 , to produce a sample x M closer to the target distribution g = p M (Neal, 2001) . Each transition generates an intermediate sample x j by running a few steps of a Markov chain initialized with the previous intermediate sample x j-1 that leaves the intermediate



FAB uses the flow in combination with AIS to estimate a loss in order to improve the flow. Thus we use bootstrap in the name of our method to mean "using one's existing resources to improve oneself". We refer to reverse KL divergence as just "KL divergence", following standard practice in literature.

