ANY-SCALE BALANCED SAMPLERS FOR DISCRETE SPACES

Abstract

The locally balanced informed proposal has proved to be highly effective for sampling from discrete spaces. However, its success relies on the "local" factor, which ensures that whenever the proposal distribution is restricted to be near the current state, the locally balanced weight functions are asymptotically optimal and the gradient approximations are accurate. In seeking a more efficient sampling algorithm, many recent works have considered increasing the scale of the proposal distributions, but this causes the "local" factor to no longer hold. Instead, we propose any-scale balanced samplers to repair the gap in non-local proposals. In particular, we substitute the locally balanced function with an any-scale balanced function that can self-adjust to achieve better efficiency for proposal distributions at any scale. We also use quadratic approximations to capture curvature of the target distribution and reduce the error in the gradient approximation, while employing a Gaussian integral trick with a special estimated diagonal to efficiently sample from the quadratic proposal distribution. On various synthetic and real distributions, the proposed sampler substantially outperforms existing approaches.

1. INTRODUCTION

The Markov Chain Monte Carlo (MCMC) algorithm is one of the most widely used methods for sampling from intractable distributions (Robert et al., 1999) . Gradient-based samplers that leverage gradient information to guide the proposal have achieved significant advances in sampling from continuous spaces, demonstrated, for example, by the Metropolis Adjusted Langevin Algorithm (MALA) (Rossky et al., 1978) , Hamiltonian Monte Carlo (HMC) (Duane et al., 1987) , and related variants (Girolami & Calderhead, 2011; Hoffman et al., 2014) . However, for discrete spaces, gradient based samplers remain far less well understood. Recently, a family of locally balanced (LB) samplers (Zanella, 2020; Grathwohl et al., 2021; Sun et al., 2021; 2022a; Zhang et al., 2022) have demonstrated promise in sampling from discrete spaces. Such samplers use a locally balanced weight function in an informed proposal Q(x, y) ∝ g(π(y)/π(x))K σ (x -y), such that g : R → R is a weight function that satisfies g(t) = tg( 1 t ), π is the target distribution, and K σ is a kernel that determines the scale of the proposal distribution. It is also shown that such a locally balanced informed proposal is a discrete version of MALA, since they both simulate gradient flows in the Wasserstein manifold (Sun et al., 2022a) . In initial work, Zanella (2020) considered a local proposal with a kernel K σ that restricts next states to lie within a 1-Hamming ball, seeking to capture natural discrete topological structure arising, for example, in spaces of trees, partitions or permutations. For more regular discrete spaces, such as lattices, Grathwohl et al. ( 2021) introduce a gradient approximation for the probability ratio π(y)/π(x) ≈ exp(⟨y -x, ∇ log π(x)⟩ to make the locally balanced proposal more scalable. However, by restricting attention to a local proposal, these methods tend not to make large jumps and exhibit highly correlated samples. Sun et al. ( 2021) made the first provably efficient attempt to extend local proposals from 1-Hamming ball to L-Hamming ball, after which subsequent works (Zhang et al., 2022; Sun et al., 2022a; Rhodes & Gutmann, 2022) have shown that using a non-local proposal for the heat kernel K σ (z) = exp(-1 2σ ∥z∥ 2 ) can further improve sampling efficiency. Even though extending locally balanced samplers to non-local proposals has delivered some progress, there remain opportunities for improvement by closing gaps in the current methods. One gap is exemplified by the choice of weight function. To illustrate, consider g(t) = t α . For a 100 dimensional Bernoulli distribution, we used an informed proposal with the heat kernel K σ (z) = exp(-1 2σ ∥z∥ 2 ) and plotted the effective sample size as a function of α for different σ. Figure 1 shows clearly that performance of α varies for different σ. In particular, the optimal choice of α monotonically increases with σ. When σ ↓ 0, the optimal choice g(t) = √ t recovers the locally balanced function. This result indicates that the locally balanced function is no longer optimal for non-local proposals. We will show that a good choice of α depends on the variance ratio between the target distribution and the kernel. We also give an adaptive algorithm that tunes (σ, α) automatically. Demo: g(t) = t on Bernoulli Model =0.2 =0.3 =0.5 =0.75 =1 =1.5 =3 =10 =100 =1000 Figure 1: ESS for different (σ, α) pairs Another gap arises from the gradient approximation. For the local proposal, a first order gradient approximation is usually sufficient to estimate the probability ratio. However, for a nonlocal proposal, higher order approximations are generally required to capture correlations between different variables. Extending from recent work, we consider a quadratic approximation of the probability ratio: π(y)/π(x) = exp((y-x) ⊤ ∇ log π(x)+ 1 2 (y-x) ⊤ W (y-x)) for non-local proposal, where W is an arbitrary symmetric real matrix. Unfortunately, the quadratic heat kernel renders a proposal distribution that is a pairwise Markov random field in general, which is intractable to directly sample from. However, this difficulty can be addressed by leveraging a stochastic factorization via the Gaussian integral trick (Hertz et al., 1991; Zhang et al., 2012) , also known as the Hubbard-Stratonovich transform (Hubbard, 1959) . In particular, we decompose the quadratic term via (W + D) 1 2 ξ, where D is a diagonal matrix to make sure W + D is positive semi-definite (PSD) and ξ is standard Gaussian noise. In this paper we will show that the quality of the factorization can be characterized by D. While previous work chose D to be isotropic, we find a substantial increase in performance by numerically optimizing over general diagonal matrices D. Closing these two gaps renders our proposal for Any-scale Balanced Sampling (AB Sampling) methods. We extensively demonstrate the advantages of the proposed sampler on both synthetic and real distributions. The results show that, with the proposed numerical optimization of D and the adaptive tuning, the two extensions robustly improve the efficiency of non-local informed proposal.

2. PRELIMINARIES

Informed Proposal. The informed proposal (Zanella, 2020) is a class of Metropolis-Hastings algorithms for discrete spaces, such that the proposal distribution at the current state x has the form: Q g σ (x, y) = g π(y) π(x) K σ (x -y)/Z g (x), Z(x) = z∈X g π(z) π(x) K σ (x -y), where π is the target distribution, X is the state space, g : R + → R + is a weight function, Z is the partition function, and K σ is an uninformed kernel with size σ, such that K(z) = K(-z), lim σ→0 K σ (z) = 1 {z=0} and lim σ→∞ K σ (z) ≡ 1. For example, K σ can be a Hamming ball kernel K σ (z) = 1 {|z|≤σ} , or a heat kernel K σ (z) = exp(-∥z∥ 2 /2σ). Balanced Proposal. Let π g σ denote the reversible distribution associated with the informed proposal Q g σ ; that is, satisfying π g σ (x)Q g σ (x, y) = π g σ (y)Q g σ (y, x). We refer to the family {Q g σ } σ as a balanced proposal if there exists a sequence σ 1 , σ 2 , ..., such that π σj weakly converges to the target

