ANY-SCALE BALANCED SAMPLERS FOR DISCRETE SPACES

Abstract

The locally balanced informed proposal has proved to be highly effective for sampling from discrete spaces. However, its success relies on the "local" factor, which ensures that whenever the proposal distribution is restricted to be near the current state, the locally balanced weight functions are asymptotically optimal and the gradient approximations are accurate. In seeking a more efficient sampling algorithm, many recent works have considered increasing the scale of the proposal distributions, but this causes the "local" factor to no longer hold. Instead, we propose any-scale balanced samplers to repair the gap in non-local proposals. In particular, we substitute the locally balanced function with an any-scale balanced function that can self-adjust to achieve better efficiency for proposal distributions at any scale. We also use quadratic approximations to capture curvature of the target distribution and reduce the error in the gradient approximation, while employing a Gaussian integral trick with a special estimated diagonal to efficiently sample from the quadratic proposal distribution. On various synthetic and real distributions, the proposed sampler substantially outperforms existing approaches.

1. INTRODUCTION

The Markov Chain Monte Carlo (MCMC) algorithm is one of the most widely used methods for sampling from intractable distributions (Robert et al., 1999) . Gradient-based samplers that leverage gradient information to guide the proposal have achieved significant advances in sampling from continuous spaces, demonstrated, for example, by the Metropolis Adjusted Langevin Algorithm (MALA) (Rossky et al., 1978) , Hamiltonian Monte Carlo (HMC) (Duane et al., 1987) , and related variants (Girolami & Calderhead, 2011; Hoffman et al., 2014) . However, for discrete spaces, gradient based samplers remain far less well understood. Recently, a family of locally balanced (LB) samplers (Zanella, 2020; Grathwohl et al., 2021; Sun et al., 2021; 2022a; Zhang et al., 2022) have demonstrated promise in sampling from discrete spaces. Such samplers use a locally balanced weight function in an informed proposal Q(x, y) ∝ g(π(y)/π(x))K σ (x -y), such that g : R → R is a weight function that satisfies g(t) = tg( 1 t ), π is the target distribution, and K σ is a kernel that determines the scale of the proposal distribution. It is also shown that such a locally balanced informed proposal is a discrete version of MALA, since they both simulate gradient flows in the Wasserstein manifold (Sun et al., 2022a) . In initial work, Zanella (2020) considered a local proposal with a kernel K σ that restricts next states to lie within a 1-Hamming ball, seeking to capture natural discrete topological structure arising, for example, in spaces of trees, partitions or permutations. For more regular discrete spaces, such as lattices, Grathwohl et al. (2021) introduce a gradient approximation for the probability ratio π(y)/π(x) ≈ exp(⟨y -x, ∇ log π(x)⟩ to make the locally balanced proposal more scalable. However, by restricting attention to a local proposal, these methods tend not to make large jumps and exhibit highly correlated samples. Sun et al. (2021) made the first provably efficient attempt to extend local proposals from 1-Hamming ball to L-Hamming ball, after which subsequent works (Zhang et al., 2022; Sun et al., 2022a; Rhodes & Gutmann, 2022) have shown that using a non-local proposal for the heat kernel K σ (z) = exp(-1 2σ ∥z∥ 2 ) can further improve sampling efficiency. Even though extending locally balanced samplers to non-local proposals has delivered some progress, there remain opportunities for improvement by closing gaps in the current methods. One gap is exemplified by the choice of weight function. To illustrate, consider g(t) = t α . For a 100 dimensional Bernoulli distribution, we used an informed proposal with the heat kernel K σ (z) = exp(-1 2σ ∥z∥ 2 ) and plotted the effective sample size as a function of α for different σ. Figure 1 shows clearly that performance of α varies for different σ. In particular, the optimal choice of α monotonically increases with σ. When σ ↓ 0, the optimal choice g(t) = √ t recovers the locally balanced function. This result indicates that the locally balanced function is no longer optimal for non-local proposals. We will show that a good choice of α depends on the variance ratio between the target distribution and the kernel. We also give an adaptive algorithm that tunes (σ, α) automatically. Another gap arises from the gradient approximation. For the local proposal, a first order gradient approximation is usually sufficient to estimate the probability ratio. However, for a nonlocal proposal, higher order approximations are generally required to capture correlations between different variables. Extending from recent work, we consider a quadratic approximation of the probability ratio: π(y)/π(x) = exp((y-x) ⊤ ∇ log π(x)+ 1 2 (y-x) ⊤ W (y-x)) for non-local proposal, where W is an arbitrary symmetric real matrix. Unfortunately, the quadratic heat kernel renders a proposal distribution that is a pairwise Markov random field in general, which is intractable to directly sample from. However, this difficulty can be addressed by leveraging a stochastic factorization via the Gaussian integral trick (Hertz et al., 1991; Zhang et al., 2012) , also known as the Hubbard-Stratonovich transform (Hubbard, 1959) . In particular, we decompose the quadratic term via (W + D) 1 2 ξ, where D is a diagonal matrix to make sure W + D is positive semi-definite (PSD) and ξ is standard Gaussian noise. In this paper we will show that the quality of the factorization can be characterized by D. While previous work chose D to be isotropic, we find a substantial increase in performance by numerically optimizing over general diagonal matrices D. Closing these two gaps renders our proposal for Any-scale Balanced Sampling (AB Sampling) methods. We extensively demonstrate the advantages of the proposed sampler on both synthetic and real distributions. The results show that, with the proposed numerical optimization of D and the adaptive tuning, the two extensions robustly improve the efficiency of non-local informed proposal.

2. PRELIMINARIES

Informed Proposal. The informed proposal (Zanella, 2020) is a class of Metropolis-Hastings algorithms for discrete spaces, such that the proposal distribution at the current state x has the form: Q g σ (x, y) = g π(y) π(x) K σ (x -y)/Z g (x), Z(x) = z∈X g π(z) π(x) K σ (x -y), where π is the target distribution, X is the state space, g : R + → R + is a weight function, Z is the partition function, and K σ is an uninformed kernel with size σ, such that K(z) = K(-z), lim σ→0 K σ (z) = 1 {z=0} and lim σ→∞ K σ (z) ≡ 1. For example, K σ can be a Hamming ball kernel K σ (z) = 1 {|z|≤σ} , or a heat kernel K σ (z) = exp(-∥z∥ 2 /2σ). Balanced Proposal. Let π g σ denote the reversible distribution associated with the informed proposal y, x) . We refer to the family {Q g σ } σ as a balanced proposal if there exists a sequence σ 1 , σ 2 , ..., such that π σj weakly converges to the target distribution π. In particular, we say that {Q g σ } σ is a locally or globally balanced proposal if the sequence σ j satisfies lim j σ j = 0 or lim j σ j = ∞, respectively. Q g σ ; that is, satisfying π g σ (x)Q g σ (x, y) = π g σ (y)Q g σ ( Locally Balanced Sampler. Zanella (2020) showed that the locally balanced function g(t) = tg( 1 t ) defines the family of locally balanced proposals, which furthermore is asymptotically optimal for locally informed proposals with σ ↓ 0. The most commonly used locally balanced function is g(t) = √ t. Grathwohl et al. (2021) introduced a gradient approximation of the probability ratio π(y)/π(x) ≈ exp((y -x) ⊤ ∇ log π(x)) to make the locally balanced proposal scalable. Sun et al. (2021) ; Zhang et al. (2022) ; Sun et al. (2022a) ; Rhodes & Gutmann (2022) show locally balanced sampler can be more efficient with large scale σ and (Sun et al., 2022b) proves that the optimal scaling σ for locally balanced proposal is achieved when the average acceptance rate is 0.574.

3. ANY-SCALE BALANCED SAMPLER

Many recent work (Sun et al., 2021; Zhang et al., 2022; Sun et al., 2022b) have shown using larger kernel K σ can significantly improve the sampling efficiency in locally balanced samplers. Unfortunately, in migrating locally balanced samplers to a global proposal regime, these works ignore the fact that a locally balanced weight function is no longer optimal and the accuracy of the gradient approximation diminishes. To address such shortcomings, we propose our any-scale balanced samplers. We will consider sampling from the discrete space X = S d = {1, ..., S} d with a target distribution π(x) ∝ e f (x) .

3.1. ANY-SCALE BALANCED FUNCTIONS

The first challenge in developing a non-local proposal is that the locally balanced weight function is no longer optimal. To determine the proper choice of weight function for kernels K σ = exp(-∥z∥ 2 /2σ) at different scales, we examine the acceptance rate for the informed proposal in (1) and consider the simple but sufficiently representative weight function class g(t) = t α for different α. Note that, given a current state x and new state y, we have the ratio A σ = π(y)Q σ (y, x) π(x)Q σ (x, y) = π(y)( π(x) π(y) ) α K σ (x -y)/ z ( π(z) π(y) ) α K σ (z -y) π(x)( π(y) π(x) ) α K σ (y -x)/ z ( π(z) π(x) ) α K σ (z -x) (2) = π 1-α (y)/ z π α (z)K σ (z -y) π 1-α (x)/ z π α (z)K σ (z -x) = π 1-α (y)/(π α * K σ )(x) π 1-α (x)/(π α * K σ )(y) , where (F * G)(x) = z F (z)G(x -z) represent the convolution of two functions. Based on this formulation, one can easily recover the locally balanced function. In particular, consider the local proposal at diminishing scales σ, leading to: lim σ→0 (π α * K σ )(x) = π α (x) ⇒ lim σ→0 A σ = π 1-2α (y)/π 1-2α (x). The limit ratio implies that α = 1 2 makes the stationary distribution of π σ weakly converge to the target distribution π; hence the corresponding weight function is g(t) = √ t, which is one of the most widely used locally balanced functions (Zanella, 2020) . A more interesting question is how to select α for σ > 0. Since computing the convolution for a general target distributions is intractable, we consider a continuous relaxation to obtain a hint for determining the proper value of α for a given σ > 0. In particular, consider a normal target distribution π(•) ∼ N (µ, σ 0 I) in the real space R d . In this case, the ratio has a closed form: π α * K σ ∼ N (µ, (σ + σ 0 /α)I) = π ασ 0 ασ+σ 0 ⇒ A σ = π 1-α-ασ 0 ασ+σ 0 (y)/π 1-α-ασ 0 ασ+σ 0 (x). (5) Here, to make the proposal balanced, the parameter α needs to satisfy 1 -α - ασ 0 ασ + σ 0 = 0 ⇒ α = r -2 + √ r 2 + 4 2r , r = σ σ 0 . One can easily check that this family of balanced functions forms a set of interpolants between g(t) = √ t and g(t) = t, where the two limiting values are: lim σ→0 α = lim r→0 r -2 + √ r 2 + 4 2r = 1 2 , lim σ→∞ α = lim r→∞ r -2 + √ r 2 + 4 2r = 1. The first equation recovers the locally balanced function g(t) = √ t. The second equation shows that, if we consider all states as candidates in the proposal distribution, the optimal choice of weight function is g(t) = t, which causes the proposal distribution to become: lim σ→∞ Q σ (x, y) = π(y)/π(x) z∈X π(z)/π(x) = π(y). That is, the proposal degenerates to the target distribution. Ignoring computational cost, such a Markov chain draws independent samples from the target distribution in each step and has, in general, the best efficiency one can expect. Between these two limiting cases, the parameter α ∈ (0, 1) specifies an interpolation that needs to be carefully selected based on σ to balance the proposal. To this end, we employ an adaptive algorithm to automatically learn the proper configuration during sampling (Andrieu & Thoms, 2008) . Since the hyperparameter pair (σ, α) is highly correlated, it can be challenging to directly tune them, so we instead employ a coordinate descent style method that alternatively updates σ and α based on the average jump distance. Specifically, we probe the value with (1 + γ)σ, σ, (1 -γ)σ with fixed α and select the new value of σ based on which one has the largest average jump distance. And similar method to α; see the Appendix A.2 for full details of the adaptation algorithm in Algorithm 3. In our experiments below, we observe that the effective sample size is a concave function of α for fixed σ and vice versa, hence the adaption algorithm is typically able to find a good (σ, α) configuration efficiently.

3.2. QUADRATIC APPROXIMATION

The second challenge is that, in a non-local proposal, the gradient approximation of the probability ratio becomes less accurate. To capture the correlation between variables, we consider a quadratic approximation of the log probability change: f (y) -f (x) = (y -x) ⊤ ∇f (x) + 1 2 (y -x) ⊤ W (y -x). When W = ∇ 2 f (x) is the Hessian matrix, (9) becomes a second order Taylor approximation. In this work, we employ a global W for all states x, hence we choose W as the empirical average Hessian. In particular, the pairs (y -x, ∇g(y) -∇g(x)) are first collected during a burn-in period and W is selected as: W = arg min W =W ⊤ N i=1 ∥W (y i -x i ) -(∇f (y i ) -∇f (x i ))∥ 2 , ( ) which can be efficiently solved via gradient descent. Please refer to Appendix A.3 for details. Substituting the quadratic approximation (9), into the informed proposal in (1), with weight function g(t) = t α , the quadratic proposal distribution becomes Q(x, y) ∝ exp α[(y -x) ⊤ ∇f (x) + 1 2 (y -x) ⊤ W (y -x)] - 1 2σ (y -x) ⊤ (y -x) . (11) Although the second order approximation improves proposal quality from the perspective of gradient approximation, it also makes (11) become a pairwise Markov random field, which is typically intractable to sample from (Murray, 2007) . Therefore, to develop a practical sampling algorithm, we exploit a stochastic factorization of quadratic proposal distribution known as the Gaussian integral trick, which originated in statistical physics (Hubbard, 1959; Hertz et al., 1991) and has been more recently extended in machine learning (Martens & Sutskever, 2010; Zhang et al., 2012) . The original Gaussian integral trick is designed for binary random variables, here we show it also works on more general discrete random variables. In particular, for a quadratic distribution π(z) ∝ exp( 1 2 z ⊤ W z + z ⊤ b) , and a PSD diagonal matrix D that guarantees W + D is PSD, one can introduce a Gaussian auxiliary variable Q(u|z) ∼ N ((W + D) 1 2 z, I) so that the conditional distribution of z given u can be obtained via Bayes' rule: Q(z|u) ∝ exp 1 2 z ⊤ W z + z ⊤ b - 1 2 (u -(W + D) 1 2 z) ⊤ (u -(W + D) 1 2 z) (12) ∝ exp z ⊤ [(W + D) 1 2 u + b] - 1 2 z ⊤ Dz , where the square root of a matrix can be obtained by either Cholesky or eigen decomposition. More details of the Gaussian integral trick are provided in Appendix A.5; also see Zhang et al. (2012) for a good introduction. To use this trick in sampling from (11), we first sample the auxiliary variable u based on the current state Q(u|x) ∼ N ((W + D) 1 2 x, I), then propose y according to Q(y|x, u) ∝ exp αy ⊤ [∇f (x) -W x + (W + D) 1 2 u] - 1 2 y ⊤ (αD + 1 σ )y , Note that the marginal distribution Q(u|x)Q(y|x, u)du is exactly Q(x, y) in ( 11), hence we call this a stochastic factorization. By introducing the auxiliary variable u, we avoid calculating the intractable partition function for ( 11). The M-H acceptance test for this auxiliary sampler is: A(x, u, y) = min 1, π(y)Q(u|y)Q(x|y, u) π(x)Q(u|x)Q(y|x, u) . ( ) Given W , a good choice for D should give high acceptance rate and large-variance proposal distribution. However, directly maximizing the acceptance rate and proposal variance at the current sample with respect to D is intractable, therefore, we construct a surrogate. Consider a continuous relaxation π(x) ∝ exp( 1 2 x ⊤ W x), where one can use σ = ∞, α = 1 and the Gaussian integral trick guarantees the acceptance rate is always 1. In this case, the sampling efficiency is only determined by the variance of the proposal distribution. For a current state x and auxiliary variables ξ and ζ ∼ N (0, I d ), denote u = (W + D) 1 2 x + ξ, and observe y -x = D -1 W x + D -1 (W + D) 1 2 ξ + D -1 2 ζ, ) for new state y in ( 14). One can compute the variance of the change (y -x) in proposal distribution in closed-form: E x E ξ,ζ (y -x) -E[y -x] (y -x) -E[y -x] ⊤ = 2D -1 , which is totally determined by the diagonal matrix D. See Appendix A.4 for detailed derivation. Therefore, we would like to minimize the diagonal of D for larger variance, thus better sample efficiency, while still keep W + D a PSD matrix. A common approach to determine diagonal matrix D is λ-shift, where D = λI is used with λ = max{ϵ, -λ min (W )}, such that ϵ ≥ 0 is a threshold and λ min (W ) is the smallest eigenvalue of W (Martens & Sutskever, 2010) . However, such an isotropic choice can suppress movement in dimensions with large variance. For example, consider a special case where W is a diagonal matrix with W 11 = -100 and W jj = -1 for j = 2, ..., d. Using D = 100I restricts the variance in all dimensions to 0.01, which is inefficient. Instead, since the quadratic term W is known, one can improve sampling efficiency by a more careful choice of diagonal matrix D. For example, Zhang et al. (2012) claims that the convexity of the proposal distribution depends on the spectrum of W + D. Following this idea, a straightforward choice is to minimize the largest eigenvalue of W + D. However, instead of only considering one direction, empirically, we find it is better to maximize the harmonic mean of D -1 in (17). Intuitively, the harmonic mean provides a balanced approach to maximizing the variance of the proposal distribution, as it maximizes variance in all dimensions and puts more weight in directions with smaller variance. Conveniently, recall arg max D 1 d i=1 1/D -1 ii = arg max D 1 d i=1 D ii = arg min D d i=1 D ii = arg min D tr(D), maximizing the harmonic mean of the variance can be reduced to minimizing the trace of D. Under the constraint that W + D is PSD, we obtains a semi-definite programming (SDP) problem, min D trace(D) s.t. D ⪰ 0, W + D ⪰ 0 (19) Empirically, we found there is no need for an exact optimum for (19); a rough solution after early stopping is sufficient to characterize the variance scale in each dimension. Using a modern solver this estimation step can be typically done in milliseconds for domain with 100 dimensions. Complete Algorithm. With the any-scale wight function and the quadratic approximation, we are ready to present our any-scale balanced sampler (AB sampler) in Algorithm 1. The parameters (σ, α) are automatically tuned along the way of sampling (also see Algorithm 3 and Algorithm 4 for the details on adaptive tuning of these parameters), where W and D are 0 during burn-in, and updated via ( 10) and ( 18) right after burn-in. Algorithm 1: AB sampling algorithm Input: Initial σ = 0.1, α = 0.5, W = 0, D = 0; initial x 0 Output: MCMC chain x 0:T and adjusted σ, α, W, D Burn-in period t < T 1 : alternatively update σ, α use algorithm 3 while calling M-H step defined in algorithm 2 as subroutine; collect trace x 0:T1 along the way Estimate W and D using collected x 0:T1 via (10) and ( 18) Mixed period T 1 ≤ t ≤ T : use estimated W and D to continue the alternatively updating σ, α with algorithm 3, while calling M-H step defined in algorithm 2 as subroutine; Return the entire trace x 0:T and the estimated parameters. 

4. RELATED WORK

The informed proposal, which uses information about the target distribution to guide the proposal for the Metropolis-Hastings (M-H) algorithm has been extensively studied for discrete spaces in recent years. A number of methods have attempted to first map the discrete to a continuous space, using relaxation, apply gradient based methods in the continuous space, then map the new state back to the discrete space, either by using auxiliary variables, uniform dequantization, or VAE flow (Zhang et al., 2012; Pakman & Paninski, 2013; Nishimura et al., 2017; Han & Liu, 2018; Zhou, 2020; Jaini et al., 2021) . Such methods work in some scenarios, but embedding a discrete into a continuous space often destroys its natural topological structure, and can create highly multimodal and irregular target distributions (Zanella, 2020) . As shown in previous work, such methods does not scale well to high dimensional discrete settings (Grathwohl et al., 2021) . Another group of methods attempt to directly work within discrete spaces. Titsias & Yau (2017) and Dai et al. (2020) introduce auxiliary variables to trade off the number of updated variables in a block against computational cost, however, by relying on Gibbs sampling, such methods still require significant overhead to make updates. In addition to the related works (Zanella, 2020; Grathwohl et al., 2021; Sun et al., 2021; 2022a; Zhang et al., 2022) already discussed in depth above, a concurrent work (Rhodes & Gutmann, 2022) has considered preconditioning and also used the Gaussian integral trick to incorporate second order information from the target distribution, but this work does not study how to properly choose the weight function g, the hyperparameter (σ, α), and the diagonal matrix D, making the resulting algorithm less efficient. Another recent work (Sun et al., 2022b) proves that the optimal scale σ for locally balanced proposal is achieved when the average acceptance equals to 0.574, and give a robust adaptive algorithm for tuning σ. However, its result relies on the property of locally balanced function, and does not apply to more general weight function g(t) = t α with α ̸ = 0.5.

5. EXPERIMENTS

We conducted an experimental evaluation on three types of target distributions: 1) quadratic synthetic distributions, 2) non-quadratic synthetic distributions, and 3) real distributions. For quadratic synthetic distributions, we focus on demonstrating the benefits of selecting a high quality diagonal matrix D. For non-quadratic synthetic distributions, we show that the performance of the proposed sampler significantly relies on the choice of weight function g(t) = t α . For real distributions, we compare against baseline samplers on challenging inference problems in deep energy based models trained on MNIST, Omniglot, and Caltech datasets.

5.1. SETTINGS

Samplers. We denote the proposed Any-scale Balanced sampler as AB-trace sampler, which uses the any-scale balanced function g(t) = t α and obtains the diagonal matrix D by minimizing its trace. For comparison, we consider the classical discrete samplers, random walk Metropolis (RWM) and Gibbs sampler. We also compare to a locally balanced sampler (LB), considering a representative version DLP in Zhang et al. (2022) that uses α = 0.5 and W = 0 in (11); this is mathematically equivalent to NCG in Rhodes & Gutmann (2022) . For RWM and LB, we follow the optimal acceptance rate in (Sun et al., 2022b) and tune the scale of the proposal distribution until the average acceptance is 0.234 and 0.574, respectively. To demonstrate the benefit of the proposed methods, we consider a few variants for ablation: DLP-trace, which uses the same anisotropic diagonal matrix D as AB-trace in kernel K σ (z) = exp(-z ⊤ Dz/σ) for DLP, AB-1st, which only uses gradients to approximate the probability ratio, and AB-shift and AB-max, which obtain the diagonal matrix D via λ-shift or minimizing the maximum eigenvalue of W + D, respectively. For all AB-* samplers, we tune (σ, α) adaptively via the algorithm discussed above (and described in Appendix A.2). Note that the PAVG sampler in (Rhodes & Gutmann, 2022 ) is equivalent to AB-shift with fixed (σ, α) = (∞, 1). Since we find that tuning (σ, α) improves the efficiency, we use AB-shift to represent PAVG. More details about the sampler implementations, such as solving D, are given in Appendix A.1.

Metrics.

As in other works (Hoffman et al., 2014; Zanella, 2020) , we use effective sample size (ESS) (Lenth, 2001) to characterize the efficiency of the samplers on synthetic distributions. To reduce the effects of implementation, we report ESS normalized in two different ways: We let ESS n denote the ESS for every 10,000 queries of the log likelihood function, and ESS t denote the ESS for every one second of sampling. For each setting and sampler, we run 100 chains for T =100,000 steps, with T 1 =20,000 burn-in steps to make sure the chain mixes. For real distributions, we compare the mixing time for different samplers.

5.2. QUADRATIC SYNTHETIC DISTRIBUTIONS

Ising model. The Ising model (Ising, 1924 ) is a mathematical model of ferromagnetism in statistical mechanics. It consists of binary random variables arranged in a graph G = (V, E) and allows each node to interact with its neighbors. The unnormalized log probability function of the Ising model is: f (x) = i∈V w i x i + (i,j)∈E J ij x i x j . In this experiment, we consider Ising models on 2D grid graphs and Barabasi-Albert-4 graphs (Albert & Barabási, 2002) . For grid Ising, we set J ij = 0.4407 at the critical temperature (Onsager, 1944) , so that the model is at its transition phase and hard to sample from; see Appendix B.2 for a more detailed description. We conduct sampling at high, medium, and low temperatures. In Table 1 , we report results for the medium temperature, where J ij = 0.4407. More results on Ising model are given in Table 2 and Table 3 . Lattice Gaussian Model. The lattice Gaussian model is obtained by restricting the Gaussian distribution to a Lattice, which is an important distribution in coding and cryptography (Kschischang & Pasupathy, 1993; Micciancio & Regev, 2007) . The unnormalized log probability is: f (x) = -1 2 (x -b) ⊤ W (x -b). In this experiment, we use a finite state space X = {0, 1, ..., 20} 100 and we investigate two settings for the Gaussian model. The first setting is a rotated Gaussian W = P ⊤ ΛP , where P is an orthogonal matrix and Λ is a diagonal matrix. The second setting is a Sparse Gaussian, which is a pairwise Markov random field defined on a cycle. We constructed the lattice Gaussian models with low, medium, and high conditions. More detailed descriptions of these models are given in Appendix B.3. We report the results for medium condition in Table 1 . More results are given in Table 4 and 5 . Results Analysis. One can observe that the AB samplers substantially outperform existing samplers on all distributions. Specifically, the first order sampler AB-1st has ESS consistently larger than LB, which justifies the benefit of selecting the proper weight function. Also, the AB-trace sampler has comparable efficiency to AB-shift on Grid Ising and Rotation Gaussian, but is significantly better on BA-4 Ising and Sparse Ising. The reason is that the variables in Grid Ising and Rotation Gaussian are nearly homogeneous, and an isotropic diagonal matrix D = λI is not slowed by several hard dimensions. However, in BA-4 Ising and Sparse Gaussian, the variance in different dimensions can be very different, and AB-trace demonstrates significant advantages by employing a general diagonal matrix D that allows different step sizes in different dimensions.

5.3. NON-QUADRATIC SYNTHETIC DISTRIBUTIONS

Bayesian Logistic Regression (BLR). Following Zhou (2020), we consider a logistic regression model Y ∼ Bernoulli(sigmoid(Xβ)), with Y ∈ {0, 1} m , X ∈ R m×d , β ∈ {0, 1} d . We first generate the sample X, Y , then, using a uniform prior, the target distribution is the posterior of β with the unnormalized log probability function: f (β) = - m i=1 y i log 1 + exp(-σ i ) + (1 -y i ) log 1 + exp(σ i ) , σ i = d j=1 X ij β j . (22) In this experiment, we considered a d = 100 dimensional regression with m = 50 samples. More details for generating X, Y are given in Appendix B.4. We report the results for ESS in Table 2 . For AB-trace, the selected configuration is (σ, α) = (16, 0.96). To justify the quality of this selection, we also plot the ESS for AB-trace with different (σ, α) in Figure 2 . Quartic Mixture Model (QMM). Following Rhodes & Gutmann (2022) , we consider a quartic mixture model, where the unnormalized log likelihood function can be written as: f (x) = log K k=1 exp(-poly 4 k (x)) . ( ) such that poly 4 k is multivariate polynomial with degree 4. In this experiment, we use a finite state space X = {0, 1, ..., 20} 50 and K = 50 components for the mixture model. More details about poly 4 k are given in Appendix B.5. We report the ESS results in Figure 3 . For AB-trace, the selected configuration is (σ, α) = (415, 0.92). To justify the quality of this selection, we also plot the ESS for AB-trace with different (σ, α) in Figure 3 Results Analysis. For the non-quadratic synthetic distributions, the AB-trace sampler significantly outperformed the other methods. From the curves for (σ, α), one can see that the adaptive tuning algorithm successfully found optimal configurations. Note that in Figure 2 and Figure 3 , the values σ = 64 and σ = 1000 can be seen as infinity, since further increasing the σ does not influence 8), the optimal α are still not 1 as we have some estimation error for the probability ratio. One interesting phenomenon is that the first order method AB-1st can be more efficient than second order samplers AB-shift and AB-max in BLR. The reason is that the Gaussian integral trick introduces extra variance in the proposal distribution. If the diagonal matrix D is not properly selected, the benefit of using a second order sampler can be reduced.

5.4. DEEP EBMS ON REAL DISTRIBUTIONS

Having observed excellent performance of the AB sampler on synthetic datasets, we considered sampling in more challenging real distributions. In particular, here we trained deep EBMs parameterized by ResNet (He et al., 2016) on the MNIST, Omniglot, and Caltech datasets. In these real image distributions, we are interested in how fast sampling algorithms can find high quality images, so we report the mixing rate in figure 4. Since we are comparing behavior during the mixing stage, we do not have samples to estimate W via (10), hence we use AB sampler with a bit different from Algorithm 1. In particular, we use the true data (from datasets) to estimate the variance var i for each variable x i and set W as a diagonal matrix with W ii = 1/(1 + var i ). In this case, we do not need to use the Gaussian integral trick and we do not distinguish the different version of AB samplers. More details are given in Appendix B.6. In Figure 4 , one can see that the AB sampler mixes faster than the LB sampler on all three real distributions. The optimal α selected for MNIST, Omniglot, and Caltech are 0.6, 0.55, 0.55, respectively. They are significantly smaller than that in non-quadratic synthetic distributions. We believe the reason is that these deep EBMs are much more complicated than the synthetic distributions. Larger estimation errors only allow the sampler to make local movements, and hence have a smaller α.

6. CONCLUSION

In this work, we proposed an Any-scale Balanced sampler (AB sampler) that substantially improves existing locally balanced samplers for discrete spaces in two respects: • the AB sampler goes beyond considering the locally balanced function as an "optimal" choice for weight function in an informed proposal, and provides an adaptive algorithm for finding the optimal configuration of (σ, α); • the AB sampler introduces the Gaussian integral trick, which allows efficient second order approximation to improve proposal quality. There are still directions for further improvement of the AB sampler. First, current adapting Algorithm 3 tunes (σ, α) based on the empirical estimation of jump distance, which can vary a lot during the mixing process. As a result, our adapting algorithm is not stable until the Markov chain reaches its stationary distribution. This is not a big problem in sampling. But if we want to train EBMs via contrastive divergence (Hinton, 2002; Tieleman, 2008) , the current adapting algorithm can hardly find the optimal configuration for (σ, α) as the model keeps changing. A potential solution is to use adaptive algorithms based on acceptance rate (Roberts & Rosenthal, 2001; Sun et al., 2022b) . Second, the estimation of W in (10) is rough. In complicated distributions, the Hessian matrix can vary a lot at different state x. More accurate quadratic approximation can be obtained via Riemannian (Girolami & Calderhead, 2011) or quasi-Newton style algorithms (Zhang & Sutton, 2011) that allow W = W (x) depend on the current state x. Third, the current quadratic approximation can be inefficient on large models. For example, on a large sparse quadratic model, solving the sparse SDP can be time consuming, and the square root (W + D) 1 2 is not necessarily sparse. Hence, more sophisticated design are needed to make the quadratic approximation being scalable.

A SAMPLERS

A.1 SEMI-DEFINITE PROGRAMMING Solving the diagonal matrix D for AB-trace and AB-max can be formulated as semi-definite programming (SDP). In particular, for AB-trace, the SDP problem is: (ApS, 2019) . For models have less or equal to 100 variables (Lattice Gaussian, Bayesian Logistic Regression, Quartic Mixture Model), Mosek takes less than 0.05 second to solve the SDP problem. For models with 400 variables (ISing), Mosek took less than 1 second to solve the SDP problem. For models with 784 variables, Mosek takes around 10 seconds to solve the SDP problem. For all distributions considered in this work, the time used for solving SDP is negligible comparing to the sampling time. However, for models with several thousands or more variables, the cost for directly solving the SDP could be high and better methods to estimate the diagonal matrix D are needed. D * =

A.2 ADAPTIVE TUNING ALGORITHM

We give the pseudo code for adaptive tuning of (σ, α) in Algorithm 3. The basic idea is alternatively updating σ and α to maximizing the average jump distance. In line 1, 2, and 3 in Algorithm 4, the samples are collected via calling M-H step of AB sampler as in Algorithm 2. Algorithm 3: Adapting Algorithm Input: initial σ = 0.1, α = 0.5, update rate γ = 0.2, decay rate β = 0.9, initial state x 0 , buffer size N = 100. Output: parameters σ, α, samples x 1 , x 2 , ... for i = 0, 1, ... do σ ′ , (x 6iN +1 , ..., x 6iN +3N ) ← Adapting Algorithm Block(θ = σ, γ, x 0 ) α ′ , (x 6iN +3N 1 , ..., x 6iN +6N ) ← Adapting Algorithm Block(θ = α, γ, x 0 ) end if σ == σ ′ , α == α ′ then γ = βγ else σ = σ ′ , α = α ′ end A.3 QUADRATIC APPROXIMATION Here, we explain how to efficiently solve the following optimization problem: W * = arg min W =W ⊤ N i=1 ∥W (y i -x i ) -(∇f (y i ) -∇f (x i ))∥ 2 . ( ) Denote the y i -x i forms a matrix XR N ×d , such that the i-th row M i = y i -x i . Similarly, denote the ∇f (y i ) -∇f (x i ) forms a matrix Y ∈ R N ×d , such that the i-th row Y i = ∇f (y i ) -∇f (x i ). Then, the loss function can be rewritten as:  W * = arg min W =W ⊤ d j=1 = N i=1 |x i -x i-1 | 1 Using parameter θ(1 + γ) to sample x N +1 , ..., x 2N via Algorithm 2 Compute d + = N i=1 |x N +i -x N +i-1 | 1 Using parameter θ(1 -γ) to sample x 2N +1 , ..., x 3N via Algorithm 2 Compute d -= N i=1 |x 2N +i -x 2N +i-1 | 1 if max{d 0 , d + , d -} == d + then θ ′ = θ(1 + γ) else if max{d 0 , d + , d -} == d -then θ ′ = θ(1 -γ) else θ ′ = θ end where W :,j represents the j-th column of W . One can easily see that the loss function is a regression. For the feasible region W = W ⊤ , one can easily check it is a d(d+1) 2 dimensional linear subspace. As a result, one only efficiently solving this convex optimization problems via projected gradient descent, where the projection to symmetric matrix space is simply X → X+X T  -x = D -1 W x + D -1 (W + D) 1 2 ξ + D -1 2 ζ, we have E x E ξ,ζ (y -x) -E[y -x] (y -x) -E[y -x] ⊤ (30) =E x [D -1 W xx T W D -1 + D -1 (W + D)D -1 + D -1 ] (31) = -D -1 W D -1 + D -1 W D -1 + 2D -1 (32) =2D -1 where ( 32) is because for a normal random variable x ∝ exp( 1 2 x T W x), the variance of x is -W -1 .

A.5 GAUSSIAN INTEGRAL TRICK

The Gaussian integral trick, also known as Hubbard-Stratonovich transform (Hubbard, 1959) , is first named in (Hertz et al., 1991) and extedned by Martens & Sutskever (2010) ; Zhang et al. (2012) for efficient Gibbs/HMC sampling inference. The main idea is for discrete-valued pairwise-MRF with within-layer connections, by introducing a real-valued auxiliary variable, the quadratic form in the energy function, x ⊤ W x, will be canceled out. Thus, the inference will be easy to carry on. Specifically, we would like to sample from a MRF with pairwise dependency, i.e., p(x) = exp( x ⊤ W x 2 + x ⊤ b)/Z, where x ∈ {0, 1} d and Z = x exp(x ⊤ W x + x ⊤ b). The vanilla sampler for this model is Markov chain Monte Carlo or Gibbs sampling, which although provably converges to the target distribution, but still might stuck in some region. To accelerate the sampling, one can introduce the auxiliary variables to reformulate the MRF into a family of equivalent Boltzmann machines. Concretely, we introduce the auxiliary variable u with conditional distribution: p(u|x) = N (u|A(W + D)x, A(W + D)A ⊤ ), critical, and low temperatures. We report the results in Table 2 . Consistent to the results in statistical physics, phase transition occurs at the critical temperature and makes the sampling much harder (Onsager, 1944) . More detailedly, at low temperature, variables in grid Ising model only have strong correlation with variables close to it. At critical temperature, the correlation is global and a variable can strongly depends on variables far from it. At high temperature, the variables have weak correlation to all the other variables. On all these three scenarios, Any-scale (AB) samplers substantially outperforms previous discrete sampling methods, including locally balanced (LB) samplers. (Albert & Barabási, 2002) . See Figure B .2 for visualization of a 50-4 BA graph. Since we don't know the critical temperature in BA graphs, we keep using the settig in grid Ising with zero external force w i = 0 and interaction J ij = 0.3000, 0.4407, 0.7071 for high, critical, and low temperatures. We report the results in Table 3 . On all three temperatures, Any-scale (AB) samplers substantially outperforms previous discrete sampling methods, including locally balanced (LB) samplers. Also, one can notice that AB-trace sampler significantly outperforms other AB samplers using quadratic approximation. In low temperature model Ising (0.7071), the first order method BA-1st even beat AB-shift and AB-max using quadratic approximation. The reason is that the variables in BA graph has inhomogeneous topology and a casual selection of D does help. Furthermore, using quadratic approximation has to involve extra randomness in Gaussian integral trick and harm the proposal quality. In grid graphs where different variables have very similar topology, thus this drawback is less significant. Then we generate orthogonal matrix P and let W = -P ΛP ⊤ . The results for rotation Gaussian with L = 2, 10, 50 are reported in Table 4 . Sparse Gaussian For Sparse Gaussian, we use bias vector b = 0. Given parameter l, we generate the weight matrix W in the following way: We first generate the matrix M ∈ R 100×100 , with W ii = 1 and W i,i+1 = W i+1,i ∼ N (0, 0.04), for i = 1, 2, ..., 100. One shall notice that we denote W 100,101 = W 100,1 and W 101,100 = W 1,100 . Then we generate the diagonal matrix Λ, such that Λ ii = 99 1 + i * (L -1) 1 2 , i = 1, 2, ..., 100 Then, we let W = -ΛM Λ. Such a log probability function defines a graphical model defined on a cycle. Since a cycle is sparse, we call it sparse Gaussian. The results for sparse Gaussian with L = 2, 10, 50 are reported in Table 4 . 



Figure 1: ESS for different (σ, α) pairs

Figure 3: Results on QMM: (l) ESS for different samplers, (r) ESS for AB-trace with different (σ, α)

Figure 4: Mixing Time on Real Distributions efficiency. Unlike from (8), the optimal α are still not 1 as we have some estimation error for the probability ratio. One interesting phenomenon is that the first order method AB-1st can be more efficient than second order samplers AB-shift and AB-max in BLR. The reason is that the Gaussian integral trick introduces extra variance in the proposal distribution. If the diagonal matrix D is not properly selected, the benefit of using a second order sampler can be reduced.

. D ⪰ 0, λI ⪰ W + D ⪰ 0 (27) Both SDP problems can be efficiently solved by modern SDP solver. In this work, we use academia version of Mosek

OF THE PROPOSAL DISTRIBUTION For ξ and ζ ∼ N (0, I d ), u = (W + D) 1 2 x + ξ, and y

Figure 6: Images sampled from the trained EBMs.

ESS on selected Quadratic Distributions

Adapting Algorithm Block Input: target parameter θ, adapting rate γ, initial state x 0 Output: updated parameter θ ′ , samples x 1 , ..., x 3N Using parameter θ to sample x 1 , ..., x N via Algorithm 2 Compute d 0

ESS on 20 × 20 Grid Ising We consider 400-4 BA graph, that's to say, a Barabasi-Albert random graph with 400 nodes and 4 attach edges for every node

ESS on 400-4 BA Ising

ESS on 100d Rotation Gaussian ESS n ESS t ESS n ESS t

annex

Published as a conference paper at ICLR 2023 Figure 5 : Visualization: (l) 10 × 10 Grid Graph, (r) 50-4 BA Graph where D = diag(d). Therefore, we have the joint distribution aswhich cancels the quadratic term w.r.t. x and makes x independent for each dimension. This induces the p(x|u) is a multivariate Bernoulli distribution, i.e.,withOne shall notice that in the original Gaussian integral trick (36), the simplification -1 2 x ⊤ Dx = -1 2 d ⊤ x using the property that x are binary random variables. Actually, we don't have to use this simplification. As long as D is diagonal, we can factorize the distribution and make efficinet proposal. In this work, we consider using A = (W + D) -1 2 , such that the conditional distribution has following simple forms:

B EXPERIMENTS B.1 HARDWARE

All experiments are running on a virtual machine with CPU: Intel Haswell, GPU: 4× Nvidia V100, System: Debian 10.

B.2 ISING MODEL

The Ising model (Ising, 1924 ) is a mathematical model of ferromagnetism in statistical mechanics . It consists of binary random variables arranged in a graph G = (V, E) and allows each node to interact with its neighbors. The log probability function of Ising model is:In this experiment, we consider Ising models on 2D grid graphs and Barabasi-Albert graphs (Albert & Barabási, 2002) .grid Ising. We consider 20 × 20 grid graphs. See Figure 5 for visualization of a 10 × 10 grid graph. We use zero external force w i = 0 and set the interaction J ij = 0.3000, 0.4407, 0.7071 for high,

B.4 BAYESIAN LOGISTIC REGRESSION

We consider the logistic regression model Y ∼ Bernoulli(sigmoid(Xβ)), with Y ∈ {0, 1} 50 , X ∈ R 50×100 , β ∈ {0, 1} 100 . We first generate X ∈ R 50×100 . Each row X i is a realization of the normal distribution N (0, 0.25ΛΣΛ), where Σ ii = 1.25, σ ij = 0.25, and Λ ii = exp(-0.25 + (i -1)/99).Then, we set the ground truth β that β i = 1 for i = 1, 2, ..., 7 and β i = 0 for i = 8, 9, ..., 100.Then, we get the logits v = Xβ. Then, we sample Y i ∼ Bernoulli(σ(v i )) for i = 1, ..., 50, where σ(t) = 1/(1+exp(-t)). Our target distribution is the posterior of β and the log probability function is:B.5 QUARTIC MIXTURE MODEL Following Rhodes & Gutmann (2022) , we consider quartic mixture model, where the log likelihood function can be written as:where poly 4 k is multivariate polynomial with degree 4 generated in the following way. For component k = 1, ..., 50, the associated bias b k = k-1 49 1 ∈ R 50 . Denote s ∈ R 20 such that s i = 20 * exp(-0.5 + (i -1)/19) for i = 1, ..., 20. Then, we have the vector(45) We also generate a rotation matrix P ∈ R 20 shared by all components and we have :Then, the polynomial is defined as

B.6 DEEP EBM

Model training. In the main text we compare the sampling efficiency of different samplers using the trained deep EBMs. Here we provide more details on obtaining these pretrained EBMs.We follow the existing works to parameterize the EBMs using ResNets, where it is trained using persistent contrastive divergence (Tieleman, 2008) framework. Specifically we follow Grathwohl et al. (2021) ; Sun et al. (2021) to maintain a buffer of multiple MCMC chains. We use the sampler proposed in Sun et al. (2021) and run 60 steps to obtain samples from the current model per each gradient update. We retain the model after 50,000 steps of training. The models are all reasonable and can produce realistic binary images as the ground truth data.Estimating W . Since we compare the mixing time on real distributions, we can not use AB sampler as Algorithm 1 which always use W = D = 0 during burn-in stage. Instead, we directly use the true data from the datasets to estimate the variance var i for each variable. Then we set W as diagonal matrix with W ii = 1/(1 + var i ). In this case, the proposal distribution in equation 11 is naturally factorized and we don't need to use Gaussian integral trick.Adaptive Tuning. The adaptive Algorithm 3 tuning (σ, α) based on average jump distance, which could be very unstable during the mixing stage. Hence, we simply apply a grid search of the configurations of (σ, α), and report the best one in Figure 4 .

