Maximum Likelihood Learning of Energy-Based Models for Simulation-Based Inference

Abstract

We introduce two synthetic likelihood methods for Simulation-Based Inference (SBI), to conduct either amortized or targeted inference from experimental observations when a high-fidelity simulator is available. Both methods learn a conditional energy-based model (EBM) of the likelihood using synthetic data generated by the simulator, conditioned on parameters drawn from a proposal distribution. The learned likelihood can then be combined with any prior to obtain a posterior estimate, from which samples can be drawn using MCMC. Our methods uniquely combine a flexible Energy-Based Model and the minimization of a KL loss: this is in contrast to other synthetic likelihood methods, which either rely on normalizing flows, or minimize score-based objectives; choices that come with known pitfalls. Our first method, Amortized Unnormalized Neural Likelihood Estimation (AUNLE), introduces a tilting trick during training that allows to significantly lower the computational cost of inference by enabling the use of efficient MCMC techniques. Our second method, Sequential UNLE (SUNLE), utilizes a new conditional EBM learning technique in order to re-use simulation data and improve posterior accuracy on a specific dataset. We demonstrate the properties of both methods on a range of synthetic datasets, and apply them to a neuroscience model of the pyloric network in the crab, matching the performance of other synthetic likelihood methods at a fraction of the simulation budget.

1. Introduction

Simulation-based modeling expresses a system as a probabilistic program (Ghahramani, 2015) , which describes, in a mechanistic manner, how samples from the system are drawn given the parameters of the said system. This probabilistic program can be concretely implemented in a computer -as a simulator -from which synthetic parameter-samples pairs can be drawn. This setting is common in many scientific and engineering disciplines such as stellar events in cosmology (Alsing et al., 2018; Schafer & Freeman, 2012) , particle collisions in a particle accelerator for high energy physics (Eberl, 2003; Sjöstrand et al., 2008) , and biological neural networks in neuroscience (Markram et al., 2015; Pospischil et al., 2008) . Describing such systems using a probabilistic program often turns out to be easier than specifying the underlying probabilistic model with a tractable probability distribution. We consider the task of inference for such systems, which consists in computing the posterior distribution of the parameters given observed (non-synthetic) data. When a likelihood function of the simulator is available alongside with a prior belief on the parameters, inferring the posterior distribution of the parameters given data is possible using Bayes' rule. Traditional inference methods such as variational techniques (Wainwright & Jordan, 2008) or Markov Chain Monte Carlo (Andrieu et al., 2003) can then be used to produce approximate posterior samples of the parameters that are likely to have generated the observed data. Unfortunately, the likelihood function of a simulator is computationally intractable in general, thus making the direct application of traditional inference techniques unusable for simulation-based modelling. Simulation-Based Inference (SBI) methods (Cranmer et al., 2020) are methods specifically designed to perform inference in the presence of a simulator with an intractable likelihood. These methods repeatedly generate synthetic data using the simulator to build an estimate of the posterior, that either can be used for any observed data (resulting in a so-called amortized inference procedure) or that is targeted for a specific observation. While the accuracy of inference increases as more simulations are run, so does computational cost, especially when the simulator is expensive, which is common in many physics applications (Cranmer et al., 2020) . In high-dimensional settings, early simulation-based inference techniques such as Approximate Bayesian Computation (ABC) (Marin et al., 2012) struggle to generate high quality posterior samples at a reasonable cost, since ABC repeatedly rejects simulations that fail to reproduce the observed data (Beaumont et al., 2002) . More recently, model-based inference methods (Wood, 2010; Papamakarios et al., 2019; Hermans et al., 2020; Greenberg et al., 2019) , which encode information about the simulator via a parametric density (-ratio) estimator of the data, have been shown to drastically reduce the number of simulations needed to reach a given inference precision (Lueckmann et al., 2021) . The computational gains are particularly important when comparing ABC to targeted SBI methods, implemented in a multi-round procedure that refines the estimation of the model around the observed data by sequentially simulating data points that are closer to the observed ones (Greenberg et al., 2019; Papamakarios et al., 2019; Hermans et al., 2020) . Previous model-based SBI methods have used their parametric estimator to learn the likelihood (e.g. the conditional density specifying the probability of an observation being simulated given a specific parameter set, Wood 2010; Papamakarios et al. 2019; Pacchiardi & Dutta 2022) , the likelihood-to-marginal ratio (Hermans et al., 2020) , or the posterior function directly (Greenberg et al., 2019) . We focus in this paper on likelihood-based (also called Synthetic Likelihood; SL, in short) methods, of which two main instances exist: (Sequential) Neural Likelihood (Papamakarios et al., 2019) , which learns a likelihood estimate using a normalizing flow trained by optimizing a Maximum Likelihood (ML) loss; and Score Matched Neural Likelihood (Pacchiardi & Dutta, 2022) , which learns an unnormalized (or Energy-Based, LeCun et al. 2006 ) likelihood model trained using conditional score matching. Recently, SNL was applied successfully to challenging neural data (Deistler et al., 2021) . However, limitations still remain in the approaches taken by both SNL and SMNL. One the one hand, flow-based models may need to use very complex architectures to properly approximate distributions with rich structure such as multi-modality (Kong & Chaudhuri, 2020; Cornish et al., 2020) . On the other hand, score matching, the objective of SMNLE, minimizes the Fisher Divergence between the data and the model, a divergence that fails to capture important features of probability distributions such as mode proportions (Wenliang & Kanagawa, 2020; Zhang et al., 2022) . This is unlike Maximimum-Likelihood based-objectives, whose maximizers satisfy attractive theoretical properties (Bickel & Doksum, 2015) . Contributions. In this work, we introduce Amortized Unnormalized Likelihood Neural Estimation (AUNLE), and Sequential UNLE, a pair of SBI Synthetic Likelihood methods performing respectively sequential and targeted inference. Both methods learn a Conditional Energy Based Model of the simulator's likelihood using a Maximum Likelihood (ML) objective, and perform MCMC on the posterior estimate obtained after invoking Bayes' Rule. While posteriors arising from conditional EBMs exhibit a particular form of intractability called double intractability, which requires the use of tailored MCMC techniques for inference, we train AUNLE using a new approach which we call tilting. This approach automatically removes this intractability in the final posterior estimate, making AUNLE compatible with standard MCMC methods, and significantly reducing the computational burden of inference. Our second method, SUNLE, departs from AUNLE by using a new training technique for conditional EBMs which is suited when the proposal distribution is not analytically available. While SUNLE returns a doubly intractable posterior, we show that inference can be carried out accurately through robust implementations of doubly-intractable MCMC methods. We demonstrate the properties of AUNLE and SUNLE on an array of synthetic benchmark models (Lueckmann et al., 2021) , and apply SUNLE to a neuroscience model of the crab Cancer borealis, increasing posterior accuracy over prior art while needing only a fraction of the simulations required by the most efficient prior method (Glöckler et al., 2021) .

2. Background

Simulation Based Inference (SBI) refers to the set of methods aimed at estimating the posterior p(θ|x o ) of some unobserved parameters θ given some observed variable x o recorded from a physical system, and a prior p(θ). In SBI, one assumes access to a simulator Figure 1 : Performance of SMNLE, NLE and AUNLE traing using a simulator with a bimodal likelihood p(x|θ), and a gaussian prior p(θ) using 1000 samples. Top: Simulator likelihood p(x|θ 0 ) for some fixed θ 0 . Bottom: posterior estimate. G : (θ, u) -→ y = G(θ, u), from which samples y|θ can be drawn, and whose associated likelihood p(y|θ) accurately matches the likelihood p(x|θ) of the physical system of interest. Here, u represents draws of all random variables involved in performing draws of x|θ. By a slight abuse of notation, we will not distinguish between the physical random variable x representing data from the physical system of interest, and the simulated random variable y draw from the simulator: we will use x for both. The complexity of the simulator (Cranmer et al., 2020) prevents access to a simple form for the likelihood p(x|θ), making standard Bayesian inference impossible. Instead, SBI methods perform inference by drawing parameters from a proposal distribution π(θ), and use these parameters as inputs to the simulator G to obtain a set of simulated pairs (x, θ) which they use to compute a posterior estimate of p(θ|x). Specific SBI submethods have been designed to handle separately the case of amortized inference, where the practitioner seeks to obtain a posterior estimate valid for any x o (which might not be known a priori), and targeted inference, where the posterior estimate should maximize accuracy for a specific observed variable x o . While amortized inference methods set their proposal distribution π to be the prior p, targeted inference methods iteratively refine their proposal π to focus their simulated observations around the targeted x o through a sequence of simulation-training rounds (Papamakarios et al., 2019) .

2.1. (Conditional) Energy-Based Models.

Energy-Based Models (LeCun et al., 2006) are unnormalized probabilistic models of the form q ψ (x) = e -E ψ (x) Z(ψ) , Z(ψ) = e -E ψ (x) dx, where Z(ψ) is the intractable normalizing constant of the model, and E ψ is called the energy function, usually set to be a neural network with weights ψ. By directly modelling the density p(x) of the data through a flexible energy function, simple EBMs can capture rich geometries and multi-modality, whereas other model classes such a normalizing flows may require a more complex architecture (Cornish et al., 2020) . The flexibility of EBMs comes at the cost of having an intractable density q ψ (x) due to the presence of the normalizer Z(ψ), increasing the challenge of both training and sampling. In particular, an EBM's log-likelihood log q ψ and associated gradient ∇ ψ log q ψ both contain terms involving the (intractable) normalizer Z(ψ): log q ψ (x) = -E ψ (x) - intractable log Z(ψ), ∇ ψ log q ψ (x) = -∇ ψ E ψ (x) + intractable E x∼q ψ ∇ ψ E ψ (x) . making exact training of EBMs via Maximum Likelihood impossible. Approximate likelihood optimization can be performed using a Gradient-Based algorithm where at each iteration k, the intractable expectation (under the EBM q ψ k ) present in ∇ ψ log q ψ k is replaced by one under a particle approximation q = 1 N N i=1 w i δ yi of q ψ . The particles y (i) forming q are traditionally set to be samples from a MCMC chain with invariant distribution q ψ k , with uniform weights w i = 1 N , while recent work on EBM for high-dimensional image data uses an adaptation of Langevin Dynamics (Raginsky et al., 2017; Du & Mordatch, 2019; Nijkamp et al., 2019; Kelly & Grathwohl, 2021) . We outline the traditional ML learning procedure for EBM in Algorithm 2, where make_particle_approx(q, q0 ) is a generic routine producing a particle approximation of a target unnormalized density q and an initial particle approximation q0 . Energy-Based Models are naturally extended to both joint EBMs q ψ (θ, x) = e -E ψ (θ,x) Z(ψ) (Kelly & Grathwohl, 2021; Grathwohl et al., 2020) and conditional EBMs (CEBMs Khemakhem et al. 2020; Pacchiardi & Dutta 2022) of the form: q ψ (x|θ) = e -E ψ (x,θ) Z(θ, ψ) , Z(θ; ψ) = e -E ψ (x,θ) dx. ( ) Unlike joint and standard EBMs, conditional EBMs define a family of conditional densities q ψ (x|θ), each of which is endowed with an intractable normalizer Z(θ, ψ).

2.2. Synthetic Likelihood Methods for SBI

Synthetic Likelihood (SL) methods (Wood, 2010; Papamakarios et al., 2019; Pacchiardi & Dutta, 2022) form a class of SBI methods that learn a conditional density model q ψ (x|θ) of the unknown likelihood p(x|θ) for every possible pair of observations and parameters (x, θ). The set {q ψ (x|θ), ψ ∈ Ψ} is a model class parameterised by some vector ψ ∈ Ψ, which recent methods set to be a neural network with weights ψ. We describe the existing Neural SL variants to date. Neural Likelihood Estimation (NLE, Papamakarios et al. 2019 ) sets q ψ to a (normalized) flow-based model, and is optimized by maximizing the average conditional log-likelihood E π(θ)p(x|θ) log q ψ (x|θ). NLE performs inference by invoking Bayes' rule to obtain an unnormalized posterior estimate p ψ (θ|x) = q ψ (x|θ)p(θ) q ψ (x|θ)p(θ)dθ ∝ p(θ)q ψ (x|θ) from which samples can be drawn either using MCMC, or Variational Inference (Glöckler et al., 2021) . Score Matched Neural Likelihood Estimation (SMNLE, Pacchiardi & Dutta 2022) models the unknown likelihood using a conditional Energy-Based Model q ψ (x|θ) of the form of Equation (2), trained using a score matching objective adapted for conditional density estimation. The use of an unnormalized likelihood model makes the posterior estimate obtained via Bayes' Rule known up to a θ-dependent term: q ψ (θ|x) ∝ p(θ)q ψ (x|θ) ∝ e -E ψ (x,θ) p(θ) Z(θ) intractable , Z(θ) = e -E ψ (x,θ) dx. ( ) Posteriors of this form are called doubly intractable posteriors (Møller et al., 2006) . In the case where the likelihood q ψ (x|θ) can be sampled from, Møller et al. (2006) ; Murray et al. (2006) have proposed tractable MCMC methods that draw an auxiliary variable y ∼ q ψ (x|θ) at every iteration to compute the acceptance probability of the proposed sample. Importantly, these MCMC methods still admit q ψ (θ|x) as their invariant distribution, making inference as exact as in standard MCMC methods. In the case of SMNLE however, q ψ (x|θ) cannot be tractably sampled from; SMNLE instead uses an approximate doubly intractable method, which replaces the exact sample y by the result of an MCMC chain with invariant distribution q ψ (x|θ). Even though this variant introduces an additional approximation not present in standard ("singly" intractable) MCMC algorithms, the distance between the true posterior and the distributions of the MCMC samples can be bounded under specific assumptions (Alquier et al., 2016) . Both the likelihood objective of NLE and the score-based objective of SMNLE do not involve the analytic expression of the proposal π, making it easy to adapt these methods for either amortized or targeted inference. To address the limitations of both methods mentioned in the introduction, we next propose a method that combines the use of flexible Energy-Based Models as in SMNLE, while being optimized using a likelihood loss as in NLE.

3. Unnormalized Neural Likelihood Estimation

In this section, we present our two methods, Amortized-UNLE and Sequential-UNLE. Both AUNLE and SUNLE approximate the unknown likelihood p(x|θ) for any possible pair of (x, θ) using a conditional Energy-Based Model q ψ (x|θ) as in Equation ( 2), where E ψ is some neural network. Additionally, AUNLE and SUNLE are both trained using a likelihood-based loss; however, the training objectives and inference phases differ to account for the specificities of amortized and targeted inference, as detailed below. 3.1 Amortized UNLE Given a likelihood model q ψ (x|θ), a natural learning procedure would involve fitting a model q ψ (x|θ)π(θ) of the true "joint synthetic" distribution π(θ)p(x|θ), as NLE does. However, we show that using an alternative -tilted -version of this model allows to compute a posterior that is more tractable than the ones computed by other SL methods relying on conditional EBMs such as SMNLE (Pacchiardi & Dutta, 2022) . Our method, AUNLE, fits a joint probabilistic model q ψ,π of the form: q ψ,π (x, θ) := π(θ)e -E ψ (x,θ) Z π (ψ) , Z π (ψ) = π(θ)e -E ψ (x,θ) dxdθ. ( ) by maximizing its log-likelihood L a (ψ) := E π(θ)p(x|θ) [log q ψ,π (x, θ)] using an instance of Algorithm 2. The gain in tractability offered by AUNLE is a direct consequence of the following proposition, its joint model. Proposition 1. Let P ψ := {q ψ (•|θ) , ψ ∈ Ψ}, and q ψ ∈ P ψ . Then we have: • (likelihood modelling) q ψ,π (x|θ) = q ψ (x|θ) • (joint model tilting) q ψ,π (x, θ) = f (θ)π(θ)q ψ (x|θ), for f (θ) := Z(θ, ψ)/Z π (ψ) • ((Z, θ)-uniformization) If p(•|θ) ∈ P ψ , then the ψ ⋆ minimizing L a (ψ) satisfies: q ψ (x|θ) = p(x|θ), and Z(θ, ψ ⋆ ) = Z π (ψ ⋆ ). Proof. The first point follows by holding θ fixed in q ψ,π (x, θ). To prove the second point, notice that q ψ,π (x, θ) ,ψ) . For the last point, note that at the optimum, we have that q ψ ⋆ ,π (x, θ) = π(θ)p(x|θ). Integrating out x on both sides of the equality yields f (θ)π(θ) = π(θ), proving the result. = Z(θ,ψ) Z(θ,ψ) π(θ)e -E(x,θ) Zπ(ψ) = Z(θ,ψ) Zπ(ψ) π(θ) e -E(x,θ) Z(θ Proposition 1 shows that AUNLE indeed learns a likelihood model q ψ (x|θ) through a joint model q ψ,π tilting the prior π with f (θ). Importantly, this tilting guarantees that the optimal likelihood model will have a normalizing function Z(θ; ψ) constant (or uniform) in θ, reducing AUNLE's posterior to a standard unnormalized posterior q ψ ⋆ (θ|x) = p(θ) e -E ψ ⋆ (θ,x) Zπ(ψ ⋆ ) , from which samples can be drawn using classical MCMC techniques, as for NLE. AUNLE's posterior contrasts with the posterior of SMNLE (Pacchiardi & Dutta, 2022) , an amortized SBI method which also computes a posterior using a conditional EBM of the likelihood, but that remains doubly intractable, as discussed in Section 2. The gain in tractability of AUNLE's posterior is beneficial from an inference accuracy standpoint as it removes the need to use an otherwise approximate doubly-intractable technique when performing inference. Importantly, such a property is also beneficial from a computational cost standpoint, since approximate doubly-intractable methods require running an (inner ) MCMC chain with target q ψ ⋆ (x|θ) for every iteration of the (outer ) MCMC chain with target q ψ ⋆ (θ|x), roughly squaring the computational cost of standard MCMC methods. This computational advantage is all the more important since AUNLE returns an amortized posterior, valid for any observed data x o , and which may be thus sampled from more than once. We confirm in Appendix B.3 that the (Z, θ)-uniformization of AUNLE's posterior, which is only guaranteed in a well-specified setting, at the true optimum ψ ⋆ , holds well in practice. Algorithm 1 Amortized-UNLE Input: prior p(θ), simulator G, budget N Output: Posterior estimate q ψ (θ|x) Initialize ψ 0 , q ψ0,π ∝ e -E ψ 0 (x,θ) π(θ) Initialize π = p for i = 0, . . . , N do Draw θ ∼ π, x ∼ G(θ, •) Add (θ, x) to D end for Get ψ ⋆ := maximize_ebm_log_l(D, ψ 0 ) Set q ψ ⋆ (θ|x) := e -E ψ ⋆ (x,θ) p(θ) Infer using MCMC on q ψ ⋆ (θ|x) Algorithm 2 maximize_ebm_log_l(D, ψ 0 ) Input: Training Data D := {x i , θ i } N i=1 , Initial EBM parameters ψ 0 Output: Density estimator q ψ (x, θ) Initialize q ψ0 (x) ∝ e -E ψ 0 (x,θ) , q0 ∝ i δ (x i ,θ i ) for k = 0, . . . , K -1 do q := make_particle_approx(q ψ k , q) G = -1 N ∇ ψ E ψ k (x i , θ i )+E q ∇ ψ E ψ k (x, θ) ψ k+1 = ADAM(ψ k , G) end for Return q ψ K

3.2. Targeted Inference using Sequential-UNLE

In this section, we introduce our second method, Sequential-UNLE (or SUNLE in short), which performs targeted inference for a specific observation x o . SUNLE follows the traditional methodology of targeted inference by splitting the simulator budget N over R rounds (often equally), where in each round r, a likelihood estimate q ψ ⋆ r (x|θ) in the form of a conditional EBM is trained using all the currently available simulated data D. This allows to construct a new posterior estimate q ψ ⋆ r (θ|x)=e -E ψ ⋆ r (x,θ) p(θ)/Z(ψ ⋆ r , θ) which is used to sample parameters {θ (i) } N/R i=1 that are then provided to the simulator for generating new data x i ∼ G(θ (i) ). The new data are added to the set D and are expected to be more similar to the observation of interest x o . This procedure allows to focus the simulator budget on regions relevant to the single observed data of interest x o , and, as such, is expected to be more efficient in terms of the simulator use than amortized methods (Lueckmann et al., 2021) . Next, we discuss the learning procedure for the likelihood model and the posterior sampling. Learning the likelihood. At each round r, the effective proposal π of the training data available can be understood (provided the number of data points drawn at reach rounds is randomized) as a mixture probability: π := 1 r (π (0) (θ)+q ψ ⋆ 1 (θ|x o )+ . . . +q ψ ⋆ r-1 (θ|x o )) which is used to update the likelihood model. In this case, the analytical form of π is unavailable as it requires computing the normalizing constants of the posterior estimates at each round, thus making the tilting approach introduced for AUNLE impractical in the sequential setting. Since currently available likelihood objectives (Kelly & Grathwohl, 2021; Du & Mordatch, 2019) for EBMs take as input unconditional (or joint) EBMs, a likelihood learning approach building on such objectives would require modeling and learning the entire joint distribution π(θ)p(x|θ) including the proposal π(θ). This latter point is problematic since π is not needed for inference, and can be highly complex (as it is set to be the current posterior estimate), increasing the difficulty of training. Instead, SUNLE learns a likelihood model maximizing the average conditional log-likelihood, L(ψ) = 1 N N i=1 log q ψ (x i |θ i ), ∇ ψ L(ψ) = - 1 N N i=1 (∇ ψ E ψ (x i , θ i ) + intractable E q ψ (•|θ i ) ∇ ψ E ψ (x, θ i )) (5) where (x i , θ i ) N i=1 are the current samples. Unlike standard EBM objectives, this loss directly targets the likelihood q ψ (x|θ), thus bypassing the need for modelling the proposal π. We propose in Algorithm 4 a method that optimizes this objective (previously used for normalizing flows in Papamakarios et al., 2019) when the density estimator is a conditional EBM. The intractable term of Equation ( 5) is an average over the EBM probabilities conditioned on all parameters from the training set, and thus differs from the intractable term of (1), composed of a single integral. Algorithm 4 approximates this term during training by keeping track of one particle approximation q i = δ xi per conditional density q ψ (•|θ i ) comprised of a single particle. The algorithm proceeds by updating only a batch of size B of such particles using an MCMC update with target probability chain q ψ k (•|θ i ), where ψ k is the EBM iterate at iteration k of round r. Learning the likelihood using Algorithm 4 allows to use all the existing simulated data during training without re-learning the proposal, maximizing sample efficiency while minimizing learning complexity. The multi-round procedure of SUNLE is summarized in Algorithm 3. Posterior sampling. Unlike AUNLE, SUNLE's likelihood estimate q ψ does not inherit the (Z, θ)-uniformization property guaranteed by Proposition 1. As a consequence, its posterior q ψ ⋆ R (θ|x) is doubly intractable as it involves the intractable normalizing constant Z(ψ ⋆ r , θ). Nevertheless, we propose to sample from q ψ ⋆ R (θ|x) using Doubly-Intractable MCMC techniques. We consider a custom robust doubly intractable implementation that allows for accurate inference even on challenging posteriors with no parameter tuning other than compute-related parameters like the number of warmup steps. Algorithm 3 Sequential-UNLE Input: prior p(θ), simulator G, budget N , no. rounds R Output: Posterior estimate q ψ (θ|x) Initialize π (0) = p, ψ 0 , q ψ0,π ∝ e -E ψ 0 (x,θ) π(θ) Get D = {θ (i) ∼ π(θ), x (i) ∼ G(θ, •)} N/R i=1 for r = 1, . . . , R do Get ψ ⋆ r := maximize_cebm_log_l(D, ψ ⋆ r-1 ) Set π r+1 (θ|x) := e -E ψ ⋆ r (x,θ) p(θ)/Z(ψ ⋆ r ; θ) Get {θ i } N/R i=1 ∼π ψ ⋆ r via Doubly-Intr. MCMC Get D = D ∪ {θ (i) , x (i) ∼ G(θ (i) , •)} N/R i=1 end for Infer using Doubly-Intr. MCMC on q ψ ⋆ R (θ|x) Algorithm 4 maximize_cebm_log_l(D, ψ 0 ) Input: Training data D := {θ (i) , x (i) } N i=1 , Ini- tial EBM parameters ψ 0 Output: Cond. Density estimator q ψ (x|θ) Initialize q ψ0 ∝ e -E ψ 0 (θ,x) , {q i = δ x i } N i=1 for k = 0, . . . , K -1 do for i = 0, . . . , N -1 do q i := make_particle_approx(q ψ k (•, θ i ), q i ) end for G= -1 N ∇ ψ E ψ k (x i , θ i ) +E qi ∇ ψ E ψ k (x i , θ i ) ψ k+1 = ADAM(ψ k , G) end for Return q ψ K

4. Experiments

In this section, we study the performance and properties of AUNLE and SUNLE in three different settings: a toy model that highlights the failure modes of other synthetic likelihood methods, a series of benchmark datasets for SBI, and a real life neuroscience model. Experimental details AUNLE and SUNLE are implemented using jax (Frostig et al., 2018) . We approximate expectations of AUNLE's joint EBM using 1000 independent MCMC chains with a Langevin kernel parameterised by a step size σ, that automatically update their step size to maintain an acceptance rate of 0.5 during a per-iteration warmup period, before freezing the chain and computing a final particle approximation. Additionally, we introduce a new method which replaces the MCMC chains by a single Sequential Monte Carlo sampler (Chopin et al., 2020; Del Moral et al., 2006) , which yields a similar performance as the Langevin-MCMC approach discussed above, but is more robust for lower computational budgets (see Appendix A.2). The particle approximations are persisted across iterations (Tieleman, 2008; Du & Mordatch, 2019) to reduce the risk of learning a "short run" EBM (Nijkamp et al., 2019; Xie et al., 2021) that would not approximate the true likelihood correctly (see Appendix B.2 for a detailed discussion). All experiments are averaged across 5 random seeds (and additionally 10 different observations x o for benchmark problems). We provide all codefoot_0 needed to reproduce the experiments of the paper. Training and inference are computed using a single RTX5000 GPU. For benchmark models, a single round of EBM training takes around 2 minutes on a GPU (see Appendix B.4).

4.1. A toy model with a multi-modal likelihood

First, we illustrate the issues that SNLE and SMNLE can face when applied to model certain distributions using a simulator with a bi-modal likelihood. Such a likelihood is known to be hard to model by normalizing flows, which, when fitted on multi-modal data, will assign high-density values to low-density regions of the data in order to "connect" between the modes of the true likelihood (Cornish et al., 2020) . Moreover, multi-modal distributions are also poorly handled by score-matching, since score-matching minimizes the Fisher Divergence between the model and the data distribution, a divergence which does not account for mode proportions (Wenliang & Kanagawa, 2020) . Figure 1 shows the likelihood model learned by NLE and SMNLE on this simulator, which exhibit the pathologies mentioned above: the score-matched likelihood only recovers a single mode of the likelihood, while the flow-based likelihood has a distorted shape. In contrast, AUNLE estimates both the likelihood and the posterior accurately. This suggests that AUNLE has an advantage when working with more complex, possibly multi-modal, distributions, as we confirm later in Section 4.3.

4.2. Results on SBI Benchmark Datasets

We next study the performance of AUNLE and SUNLE on 4 SBI benchmark datasets with well-defined likelihood and varying dimensionality and structure (Lueckmann et al., 2021) : SLCP: A toy SBI model introduced by (Papamakarios et al., 2019) with a unimodal gaussian likelihood p(x|θ). The dependence of p(x|θ) on θ is nonlinear, yielding a complex posterior. The Lotka-Volterra Model (Lotka, 1920): An ecological model describing the evolution of the populations of two interacting species, usually referred to as preys and predators. Two Moons: A famous 2-d toy model with posteriors comprised of two moon-shaped regions, and yet not solved completely by SBI methods. Gaussian Linear Uniform: A simple gaussian generative model, with a 10-dimensional parameter space. These models encompass a variety of posterior structures (see Appendix B.1 for posterior pairplots): the two-moons and SLCP posteriors are multimodal, include cutoffs, and exhibit sharp and narrow regions of high density, while posteriors of the Lotka-Volterra model place mass on a very small region of the prior support. We compare the performance of AUNLE and SUNLE with NLE and its sequential analogue SNLE, respectively: NLE and SNLE represent the gold standard of current synthetic likelihood methods, and perform particularly well on benchmark problems (Lueckmann et al., 2021) . We use the same set of hyperparameters for all models, and use a 4-layer MLP with 50 hidden units and swish activations for the energy function. Results are shown in Figure 2 . While some fluctuations exist depending on the task considered, these results show that the performance of AUNLE (and SUNLE when targeted inference is necessary) is on par with that of (S)NLE, thus demonstrating that a generic method involving Energy-Based models can be trained robustly, without extensive hyperparameter tuning. Interestingly, the model where UNLE has the greatest advantage over NLE is Two Moons, which is the benchmark that exhibits a likelihood with the most complex geometry; in comparison, the three remaining benchmarks have simple normal (or log-normal) likelihood, which are unimodal distributions for which normalizing flows are particularly well suited. This point underlines the benefits of using EBMs to fit challenging densities. Finally, we remark that SMNLE, which addresses only amortized inference Pacchiardi & Dutta (2022) struggled in practice for the toy problems investigated here.

4.3. Using SUNLE in a Real World neuroscience model

We investigate further the performance of SUNLE by running its inference procedure on a simulator model of a pyloric network located in stomatogastric ganglion (STG) of the crab Cancer borealis given an observed an neuronal recording (Haddad & Marder) . This model simulates 3 neurons, whose behaviors are governed by synapses and membrane conductances that act as simulator parameters θ of dimension 31. The simulated observations are composed of 15 summary statistics of the voltage traces produces by neurons of this network (Prinz et al., 2003; 2004) . Amortized SBI methods require tens of millions of samples, while currently, the most sample-efficient targeted inference method for this problem is SNVI (a variant of SNLE that replaces the MCMC-powered posterior sampling by a variational inference step Glöckler et al. 2021 ) which uses 30 rounds simulating each 10000 samples. fraction of simulated observations with well-defined summary statistics (higher is better) at each round for SNVI and SUNLE, with dashed lines indicating the maximum fraction for each method. Bottom-right: performance of the posterior using the Energy Distance. We perform targeted inference on this model using SUNLE with a MLP of 9 layers and 300 hidden units per layers for the energy E ψ , and perform doubly intractable MCMC to draw new proposal parameters across rounds. All inference and training steps are initialized using previously the available MCMC chains and EBM parameters. We report in Figure 3 the evolution of the rate of simulated obvservations with valid summary statistics, -a metric indicative of posterior quality -as well as the Energy-Scoring Rule (Gneiting & Raftery, 2007) of SUNLE and SNVI's posteriors across rounds. The synthetic observation simulated using SUNLE's posterior mean closely matches the empirical observation (Figure 3 , Left vs Center). As shown in Figure 3 , SUNLE matches the performance of SNVI in only 5 rounds, reducing by 6 the simulation budget of SNVI to achieve a comparable inference quality. After 10 rounds, SUNLE's poterior significantly exceeds the performance of SNVI in terms of number of valid samples obtained by taking the final posterior samples as parameters. The total procedure takes only 3 hours (half of which is spent simulating samples), 10 times less than SNVI.

Conclusion

The expanding range of applications of Simulation-Based Inference poses new challenges to the way SBI algorithms model data. In this work, we presented SBI methods that use an expressive Energy-Based Model as their inference engine, fitted using Maximum Likelihood. We demonstrated promising performance on synthetic benchmarks and on a real-world neuroscience model. In future work, we hope to see applications of this method to other fields where EBMs have been proven successful, such as physics (Noé et al., 2019) or protein modelling (Ingraham et al., 2018) . efficiently constructs EBM particle approximations across iterations of Algorithm 2 through a Sequential Monte Carlo (SMC) algorithm [Chopin et al., 2020; Del Moral et al., 2006] . In addition to its efficienty, this new routine does not suffer from the bias of incurred by the use of finitely many steps in MCMC-based methods. We apply this routine within the EBM training step of AUNLE's, and show that the learned posteriors can be more accurate than MCMC methods for a fixed compute power allocated to training.

A.2.1 Background: Sequential Monte Carlo Samplers

Sequential Monte Carlo (SMC) Samplers [Chopin et al., 2020; Del Moral et al., 2006 ] are a family of efficient Importance Sampling (IS)-based algorithms, that address the same problem as the one of MCMC, namely computing a normalized particle approximation of a target density q known up to a normalizing constant Z. The particle approximation q SM C computed by SMC samplers (consisting of N particles y i , like in MCMC methods, but weighted non-uniformly by some weights w i ) is produced by defining a set of L intermediate densities (ν l ) L l=0 bridging between the target density ν l =q and some initial density ν 0 , for which a particle approximation ν N 0 : i = 1 N = w i 0 δ y i 0 are readily available. The intermediate densities are often chosen to be a geometric interpolation between ν 0 and ν L , i.e. ν l ∝ (ν 0 ) 1-l L (ν l ) l L , so that ν l are also known up to some normalizing constant. SMC samplers sequentially constructs an approximation ν N l := w i l δ y i l to the respective density ν l at time l, using previously computed approximations of ν l-1 at time l -1. At each time step, the approximations are obtained by applying three successive operations: Importance Sampling, Resampling and MCMC sampling. We provide a vanilla SMC sampler implementation in Algorithm 5, and refer to this algorithm as make_smc_particle_approx Algorithm 5 SMC(q, ν 0 , ν N 0 ) 1: Hyper-parameters: Number of particles N , number of steps L, re-sampling threshold A ∈ [ 1 N , 1). 2: Input: Target density q, initial density ν 0 , particle approximations ν N 0 and ν 0 3: Output: Particle approximations to q. 4: Construct geometric path (ν l ) L l=1 from ν 0 and q. 5: for l = 1, . . . , L do 6: Compute IS weights w i l and W i l 7: Draw N samples ( Y i l ) N i=1 from (Y i l-1 ) N i=1 according to weights (W i l ) N i=1 , then set W i l = 1 N . 8: Sample Y i l ∼ K l ( Y i l , •) using Markov kernel K l . 9: end for 10: Return approximation q N SM C := Y i L , W i L N i=1 . Importantly, under mild assumptions, the particle approximation constructed by SMC provides consistent estimates of expectations of any function f under the target q: N i=1 w i f (y i ) P -→ E y∼q [f (y)]. We briefly compare the role played by the number of steps and particles in both MCMC and SMC algorithms: Number of particles SMC samplers differ from MCMC samplers in their origin of their bias: while the bias of MCMC methods comes from running the chain for a finite number of steps only, the bias of SMC methods comes from the use of finitely many particles. Number of steps While it is usually beneficial to use a high number of iterations within MCMC samplers to decrease algorithm bias and ensure that the stationnary distribution is reached, the number of steps (or intermediate distributions) in SMC is beneficial to ensure a smooth transition from the proposal to the target distribution: however, the variance of SMC samplers as a function of the number of steps is not guaranteed to be decreasing even if variance bounds that are uniform in the number of steps can be derived by making assumptions on K l Chopin et al. [2020] . When applying SMC within AUNLE's training loop, we find that using more SMC samplers steps usually increase the quality of the final posterior. In the next paragraph, we describe how to use SMC routine efficiently to approximate EBM expectations within Algorithm 2. A.2.2 Efficient use of SMC during AUNLE training using OG-SMC A naive approach which uses the SMC routine of Algorithm 5 within the EBM training loop of Algorithm 2 would consist in calling the SMC at every training iteration using a fixed, predefined proposal density ν 0 and associated particle approximation and ν0 , such as one from a standard gaussian distribution. However, as training goes, the EBM is likely to differ significantly from the proposal density q 0 , requiring the use of many SMC inner steps to obtain a good particle approximation. A more efficient approach, which we propose, is to use the readily available particle unnormalized EBM density q ψ k-1 and associated particle approximation qk computed by SMC at the iteration k-1 as the input to the call to SMC targeting the EBM q ψ k at iteration k. Algorithm 6 implements this approach. Algorithm 6 SMC-powered ML training of EBMs Input: Training Data {x (i) } N i=1 , Initial EBM parameters ψ 0 Output: Density estimator q ψ (x) Initialize q ψ0 (x) ∝ e -E ψ 0 (x) , q -1 = ν 0 , q-1 = ν0 for i = 0, . . . , max_iter -1 do # q := make_particle_approx(q ψ k , q) q k := SMC(q ψ k , q k-1 , qk-1 ) q k := q ψ k G = -γ N ∇ ψ E ψ k (x i ) +E q ∇ ψ E(x) ψ k+1 = ADAM(ψ k , G) end for Return q ψ K In practice, we find that using 20 SMC intermediate densities (with 3 steps of K t ) in each call to SMC yields a similar performance as a 250-MCMC steps EBM training procedure. By considering a more constrained budget, using only 5 SMC intermediates densities outperforms a 30-steps MCMC EBM training procedure. Figure 4 : Performance of AUNLE, using either a MCMC-powered particle approximation routine, or a SMC: using 30 MCMC steps or 5 SMC steps We report the ground truth estimated posterior pairplots on benchmark problems. AUNLE and SUNLE exhibit satisfying mode coverage, and are able to capture complex posterior structures.

B.2 Manifestation of the short-run effect in UNLE

It was shown in Nijkamp et al. [2019] that EBMs trained by replacing the intractable expectation under the EBM with an expectation under a particle approximation obtained by running parallel runs of Langevin Dynamics initialized from random noised and updated for a fixed (and small) amount of steps can yield an EBM whose density is not proportional to the true density, but rather a generative model that can generate faithful images by running few steps of Langevin Dynamics from random noise on it. Our design choices for both training and inference purposefully avoid this effect from manifesting itself in UNLE. During training, we estimate the intractable expectation using persistent MCMC or SMC chains, e.g by initializing the MCMC (or SMC) algorithm of iteration k with the result of the MCMC (or SMC) algorithm at iteration k -1, yielding a different training method than short-run EBMs. At inference, the posterior model is sampled from Markov Chains with a significant burn-in period, contrasting with the sampling model of short-run EBMs. Figure 7 compares the density of UNLE's posterior estimate for the two-moons model (a 2d posterior which can be easily visualized) with the true posterior. As the Figure 7 shows, AUNLE and SUNLE's posterior density match the ground truh very closely, demonstrating that UNLE's EBM is not a short-run generative model, but a faithful density estimator.

B.3 Validating the (Z, θ) uniformization of AUNLE's posterior in practice

Proposition 1 ensures that the normalizing constant Z(θ; ψ) present AUNLE's posterior is independent of θ provided that the problem is well-specified, and that ψ = ψ ⋆ , the optimum of AUNLE's population objective. In practice, these conditions will hold exactly, and the uniformization of AUNLE's posterior thus only holds approximately. To assess the loss of precision associated with using a standard MCMC posterior in the context of approximate uniformization, we compare the quality of AUNLE's posterior samples obtained using a standard MCMC sampler (which is valid only uniformization holds), and using a doubly intractable MCMC sampler, which handles non-uniformized posteriors. We that approximation error of doubly intractable samplers by using a large number of steps (1000) when sampling from the likelihood using MCMC. As Figure 8 shows, there is no gain is using a doubly intractable sampler for inference in AUNLE, suggesting that the uniformization property of AUNLE holds well in practice. The results show no gain in using a doubly intractable sampler, justifying the use of standard samplers for AUNLE.

B.4 Computational Cost Analysis

Training unnormalized models using approximate likelihood is computationally intensive, as it requires running a sampler during training at each gradient step, yielding a computational cost of O(T 1 T 2 N ), where T 1 is the number of gradient steps, T 2 is the number of MCMC steps, and N is the number of parallel chains used to estimate the gradient. To maximize the efficiency of training, we implement all samplers using jax Frostig et al. [2018] , which provides a Just-In-Time compiler and an auto-vectorization primitive that generates efficient, custom parallel sampling routines. For AUNLE, we the introduced a warm-started SMC approximation procedure to estimate gradients, yielding a competitive performance with as little as 5 intermediate probabilities per gradient computation. For SUNLE, we warm-start the parameters of the EBM across training rounds, and warm-start the chains of the Doubly Intractable sampler accross inference rounds, which significantly reduces the need for burning steps and long training. Finally, all experiments are ran on GPUs. Together, these techniques make AUNLE and SUNLE almost always the fastest methods for amortized and sequential inference, with total per-problem runtimes of 2 for AUNLE and 15 minutes for SUNLE on benchmark models (which is significantly faster than NLE and SNLE on their canonical CPU setup Lueckmann et al. 2021 ) and less than 3 hours for SUNLE on the pyloric network model (with half of this time spent simulating samples). The latter is 10 times faster than SNVI (30 hours) on the same model. A breakdown of training, simulation and inference time is provided in Figure 9 . We note that (S)NLE was ran on a CPU, which is the advertised computational setting [Lueckmann et al., 2021] , since (S)NLE deep and shallow networks that do not benefit much from GPU acceleration. We note that the time spent performing inference is negligible for AUNLE, which uses standard MCMC for inference thanks to the tilting trick employed in its model. On the other hand, the runtime of SUNLE, which performs inference using a doubly intractable sampler is dominated by its inference phase. This point demonstrates the computational benefits of the AUNLE's tilting trick. Note that SUNLE performs inference in a multi-round procedure, and requires thus R training and inference phases (where R is the number of rounds), as opposed to 1 for AUNLE. We alleviate this effect by leveraging efficient warm-starting strategies for both training and inference, which allow to amortize these steps across rounds.

B.5 Experimental setup for SNLE and SMNLE

SNLE The results reported for SNLE are the one present in the SBI benchmark suite [Lueckmann et al., 2021] , which reports the performance of both NLE and SNLE on all benchmark problems studied in this paper.

SMNLE

The results reported for SMNLE were obtained by running the implementation referenced by Pacchiardi & Dutta [2022] . SMNLE comes in two variants: the first variant uses plain Score Matching [Hyvärinen & Dayan, 2005] to estimate its conditional EBM, while the second variant uses Sliced Score Matching [Song et al., 2020] , which yields significant computational speedups during training. For both methods, we train the model using 500 epochs, and neural networks with 4 hidden layers and 50 hidden and outputs units. To optimize the inference performance, we carry out inference using our own Doubly Intractable Sampler, which automatically tunes the all parameters of the doubly intractable samplers except for the number of burn-in steps, and initializes the chain at local posterior modes. We carry out a grid search over the learning rates 0.01 and 0.001, and leave other training parameters to their default. Figures in the main body only report the performance of the Sliced Score Matching variant, which perform better in practice and runs order of magnitude faster. Figure 10 reports the performance of both variants for completeness. We used GPU both to train and inference using SMNLE, yielding similar or higher training compared to AUNLE for the sliced variant, and much longer training times for the standard variant.

B.6 Neuroscience Model: Details

Pairwise Marginals We provide the full pairwise marginals obtained after computing a kernel density estimation on the final posterior samples of SUNLE. We retrieve similar 



https://github.com/anon-autors-a-sunle-iclr/unle



Figure 2: Performance of AUNLE (resp. SUNLE) compared with NLE and SMNLE (resp. SNLE), using the Classifier Accuracy Metric (Lueckmann et al., 2021) (lower is better). AUNLE and SUNLE exhibit robust performance across a wide array of problems. Additional details on the experimental setup can be found in Appendix B.5.

Figure 3: Inference with SUNLE on a model of the pyloric network. Left: simulations obtained by using the final posterior mean and maximum a posteriori (MAP) as a parameter. Center: the empirical observation x o : arrows indicate the summary statistics. Top-right:fraction of simulated observations with well-defined summary statistics (higher is better) at each round for SNVI and SUNLE, with dashed lines indicating the maximum fraction for each method. Bottom-right: performance of the posterior using the Energy Distance.

Figure 5: Performance of AUNLE, using either a MCMC-powered particle approximation routine, or a SMC: using 200 MCMC steps or 20 SMC steps

Figure 7: Normalized density of AUNLE and SUNLE for the two moons model. Left: manually normalized posterior density of both AUNLE and SUNLE using a discretization of the posterior over a grid. Middle: kernel density estimation of the MCMC samples obtained from AUNLE and SUNLE's posterior. Right: Ground Truth posterior. AUNLE and SUNLE's posterior densities match closely the true density, showing that these method indeed learn a density estimator, and a generative model [Nijkamp et al., 2019].

Figure 8: Quality of AUNLE's posterior samples (measured in classifier accuracy) obtained using a Standard MCMC sampler (S. MCMC) and a doubly intractable sampler (D. MCMC).The results show no gain in using a doubly intractable sampler, justifying the use of standard samplers for AUNLE.

Figure 9: Runtime of UNLE: Analysis and Comparisons. First row: time (in minutes) spent training, inferring, and simulating for AUNLE. Second row. Second row: time (in minutes) spent training, inferring, and simulating for SUNLE. Third row: runtime comparison between ANLE and NLE (in log-scale). Forth row: runtime comparison between SUNLE and SNLE.

Figure 10: Comparison of AUNLE, SMNLE with Sliced Score Matching (SSM), SMNLE with Score Matching (SM) and NLE on a set of benchmark problems.

Supplementary Material for the paper Maximum Likelihood Learning of Energy-Based Models for Simulation-Based Inference

The supplementary materials includes the following materials:• A discussion of the computational rationale motivating the tilting approach of AUNLE.• We propose a training method for EBM which uses the family of Sequential Monte Carlo samplers to efficiently approximate expectations under the EBM during approximate likelihood maximization. We show that using these new methods can lead to increased stability and performance for a fixed budget.• An experiment in Appendix B.3 that suggest that the uniformization of AUNLE's posterior holds in learned AUNLE models.• A discussion in Appendix B.2 about the (absence of) manifestation of the short-run effect [Nijkamp et al., 2019] in UNLE.• A detailed computational analysis in Appendix B.4 of AUNLE, which proves highly competitive over alternatives. Appendix B .1 of UNLE's posterior sapmles for benchmark and the pyloric network's problems.

• Figures in

• Finally, we provide additional details in Appendix B.6 on the results of SUNLE on the pyloric network: we provide an estimation of the pairwise marginals of the final posterior, which contains patterns also present in the pairwise marginals obtained by Glöckler et al. [2021] .

A Methological Details

A.1 Energy-Based Models as Doubly-Intractable Joint Energy-Based Models AUNLE learns a likelihood model q ψ (x|θ) by minimizing the likelihood of a tillted joint EBM p(θ)e -E ψ (x,θ) Zπ(ψ). While the gain in tractability arising in AUNLE's posterior suffices to motivate the use of this model, another computational argument holds. Consider the non-tilted joint model:Expectations under this model can be computed by running a MCMC chain implementing a Metropolis-Within-Gibbs sampling method as in Kelly & Grathwohl [2021] , which uses:• any proposal distribution for q π,ψ (x|θ) ∝ q ψ (x|θ), such a MALA proposal• an approximate doubly-intractable MCMC kernel step for q π,ψ (θ|xwhich is doubly intractable.However, running the approximate doubly intractable MCMC kernel step requires sampling from q ψ (x|θ), incurring an additional nested loop during training. Thus, naive MCMC-based Maximum-Likelihood optimization of untilted joint EBM is prohibitive from a computational point of view.

A.2 Training EBMs using Sequential Monte Carlo

The main technique to compute particle approximations of the EBM iterations (returned by the generic make_particle_approx) when training an EBM using Algorithm 2 is to run N MCMC chains in parallel targeting the EBM Song & Kingma [2021] ; aggregating the final samples y i of each chain i yields a particle approximation q = 1 N i δy i of the EBM in question. In this appendix section, we describe an alternative make_ebm_approx which patterns as the one displayed in the pairwise marginals of SNVI samples. We refer to Glöckler et al. [2021] for more details on the specificities of this model. 

