A SCORE-BASED MODEL FOR LEARNING NEURAL WAVEFUNCTIONS

Abstract

Quantum Monte Carlo coupled with neural network wavefunctions has shown success in computing ground states of quantum many-body systems. Existing optimization approaches compute the energy by sampling local energy from an explicit probability distribution given by the wavefunction. In this work, we provide a new optimization framework for obtaining properties of quantum many-body ground states using score-based neural networks. Our new framework does not require explicit probability distribution and performs the sampling via Langevin dynamics. Our method is based on the key observation that the local energy is directly related to scores, defined as the gradient of the logarithmic wavefunction. Inspired by the score matching and diffusion Monte Carlo methods, we derive a weighted score matching objective to guide our score-based models to converge correctly to ground states. We first evaluate our approach with experiments on quantum harmonic traps, and results show that it can accurately learn ground states of atomic systems. By implicitly modeling high-dimensional data distributions, our work paves the way toward a more efficient representation of quantum systems.

1. INTRODUCTION

Understanding the properties of quantum systems lies at the core of many scientific disciplines, such as condensed matter physics, material science, and quantum chemistry. A quantum system is characterized by its ground state wavefunction, formally obtained by solving the Schrödinger equation. However, directly solving the Schrödinger equation for quantum systems with many particles is impractical due to the exponentially large Hilbert space. Owning to its strong dimension reduction capabilities, deep learning methods have been used as a strong candidate to approximately solve the Schrödinger equation and extract properties of quantum systems with the desired accuracy. For example, under the supervised learning setting, deep learning methods have been successfully applied to predict the quantum properties of molecular systems based on training data generated from density functional theory (DFT) calculation (Schütt et al., 2017; Gasteiger et al., 2020; Liu et al., 2022; Wang et al., 2022) . However, supervised methods rely on expensive computational simulations to generate a large amount of training data, and the accuracy of these methods is fundamentally limited by the data quality. Furthermore, DFT calculations involve various approximations and are not guaranteed to reach true ground states. A common scheme for approximately solving the Schrödinger equation is the variational principle, which optimizes a trial wavefunction to reach the ground state by minimizing its energy as much as possible via quantum Monte Carlo (QMC). Such a method is called variational Monte Carlo, whose accuracy relies on the expressive power of the trial wavefunction. Recently, deep learning methods coupled with variational Monte Carlo have unleashed the potential of both methods (Carleo & Troyer, 2017; Hermann et al., 2022) . Powered by the efficient sampling and optimization framework of quantum Monte Carlo and the universal approximation capability of deep neural networks, neural wavefunctions can accurately model quantum states, and dramatic improvements have been achieved (Pfau et al., 2020; Hermann et al., 2020) . Modeling a wavefunction is conceptually similar to modeling a probability density. Existing methods model the wavefunction explicitly by training a neural network to directly output the wavefunction values. However, numerous examples in machine learning have shown that implicitly modeling data distributions provides better representations (Kingma & Welling, 2014; Goodfellow et al., 2014; Ho et al., 2020) . As our direct reference, score-based methods have demonstrated their strong suc-cess in generative modeling (Song & Ermon, 2019; Song et al., 2020) . A score is defined as the gradient of the log probability. For example, realistic images can be generated from random noise by following dynamics defined by scores. In this paper, we show that the quantum wavefunction can be represented by score models and be optimized within the QMC framework. Our motivation to relate score-based models with QMC is based on an interesting connection between energy computations and score-based formulations. In QMC, the energy of a system is averaged over local energy of plausible quantum states. Our observation is that local energy only involves gradients of the logarithmic wavefunction, which we define as the score of the wavefunction. As a result, to minimize energy, the score must be explicitly computed. On the other hand, the actual wavefunction values are only used for sampling and optimization. To this end, we propose a new optimization framework for QMC where sampling and optimization are also achieved by using score functions alone, eliminating the need to explicitly compute the wavefunction value. In our proposed score-based framework, the sampling is done via Langevin dynamics and optimization is done through a new loss function inspired by diffusion Monte Carlo. Our score-based method enables the possibility of performing QMC computation with only score functions, which is infeasible in existing optimization frameworks. A direct benefit is that, by predicting gradients, we avoid the need to recompute it from the wavefunction. Moreover, score functions can be interpreted as the force of quantum systems, by implicitly modeling distributions with score functions, the dynamics of quantum systems could be better captured. Our experimental results show that with our score-based optimization framework, ground states of quantum systems can be accurately learned.

2.1. QUANTUM MANY-BODY WAVEFUNCTION IN CONTINUOUS SPACE

We use x ∈ R N ×d to denote the coordinates of N particles in d-dimension. The quantum state of a system is defined by its wavefunction ψ : R N ×d → R. By definition ψ is normalized ( x |ψ(x)| 2 = 1) and |ψ(x)| 2 gives the probability density of observing x. Any wavefunction ψ can be expressed as linear combination of eigenfunctions ψ n , which are solutions to the time-independent Schrödinger equation Ĥψ n (x) = E n ψ n (x), where Ĥ is an linear operator known as the Hamiltonian, and E n is a scalar giving the energy of the n-th eigen state. The Hamiltonian is defined as Ĥψ(x) = - 1 2 i ∇ 2 i ψ(x) + V (x)ψ(x), where the index i runs over all of the N × d dimensions in the summation. The first term in the Hamiltonian takes the sum of the second-order partial derivatives of the wavefunction and is related to the kinetic energy of the system. The second term in the Hamiltonian multiplies the wavefunction by a scalar value and is related to the potential energy of the system. The kinetic term is intrinsic to the Schrödinger equation and always takes the same form, whereas the potential function V : R N ×d → R varies for different physics problems. Note that although a wavefunction can be complex-valued in general, we can let ψ be real-valued because Ĥ is real. Our objective is to find the ground state ψ 0 , which is the eigen state associated with the lowest energy E 0 . In our notation, the coordinates x can be either viewed as N d-dimensional vectors or as a flattened N • d dimensional vector. In the rest of this paper, depending on the context, we may use bold x i to denote the i-th particle or use the regular font x i to denote the i-th scalar component of the flattened vector.

2.2. VARIATIONAL MONTE CARLO

The variational Monte Carlo (VMC) method uses a parameterized function ψ θ : R N ×d → R (called the Ansatz) to model a wavefunction, where θ denotes the parameters to be optimized. The normalization of ψ θ is not required. The energy expectation of ψ θ is computed as: L(θ) = ψ θ (x) Ĥψ θ (x)dx ψ θ (x)ψ θ (x)dx = ψ θ (x) 2 Ĥψ θ (x) ψ θ (x) dx ψ θ (x) 2 dx = E x∼ψ 2 θ ψ 2 θ E L (x; θ), where E L (x; θ) = Ĥψ θ (x) ψ θ (x) is called the local energy of x. The expectation is evaluated numerically on sparse samples. Markov Chain Monte Carlo (MCMC) sampling is employed to drive sample distribution to converge to the target density ψ 2 θ / ψ 2 θ . We can approach ground state wavefunctions by minimizing the energy expectation with gradient descent. The unbiased gradient of L(θ) w.r.t. parameters θ is given by: ∇ θ L(x) = 2E x∼ψ 2 θ / ψ 2 θ E L (x; θ) -E x∼ψ 2 θ / ψ 2 θ [E L (x; θ)] ∇ θ log |ψ θ (x)| , where expectations are evaluated by average over samples. The derivation of this loss makes use of the fact that Ĥ is Hermitian (Ceperley et al., 1977) . A detailed derivation can be found in Appendix E of Lin et al. (2021) . The optimized energy expectation value gives the approximated ground state energy, and the eigenvalue formulation ensures the estimation to be variational, that is, the energy expectation defined by the Ansatz is always above the true ground state energy. Other properties of quantum systems can also be estimated by taking expectations over corresponding operators.

2.3. RELATED WORK

Neural quantum state based on the restrictive Boltzmann machine was initially studied for simple quantum spin models on lattices (Carleo & Troyer, 2017) . More complicated neural network architectures, such as convolutional neural network and graph neural network are then extended to more complicated spin systems to capture frustration due to the lattice structure and next nearest neighboring interaction (Kochkov et al., 2021; Fu et al., 2022) . Similar ideas are also studied for ab-initio simulation in quantum chemistry (Han et al., 2019) to study properties of small molecules. Recently, FermiNet Pfau et al. (2020) and PauliNet Hermann et al. (2020) greatly improve the accuracy of neural wavefunction and apply to larger molecular systems. Follow-up works contribute to improved performance (Gerard et al., 2022) , joint training for multiple geometries (Gao & Günnemann, 2021; Scherbela et al., 2022; Gao & Günnemann, 2022) or solving for excited state (Entwistle et al., 2022) . All of the existing methods use neural networks to explicitly model the wavefunction, while in this work we propose to implicitly model the quantum state using the score function. Diffusion Monte Carlo (DMC) (Toulouse et al., 2016; Ceperley, 2004) employs score-based diffusion to improve upon VMC. In VMC, the quality of approximations is limited by the capacity of Ansatz. DMC additionally assigns a weight for each sample in a way that the weighted distribution gives a better approximation of ground state. In DMC's formulation, each sample is defined by a walker, and each walker is assigned a weight. In importance-sampled DMC, a trial function ψ T is used. At each iteration, the walkers randomly diffuse by following score of the trial function and the weights are updated according to the imaginary time evolution so that the repeated diffusing and weighting procedure projects out ground states from the trial wavefunction. At convergence, the ground state energy is estimated by the weighted average of local energies. Ceperley & Alder (1980) applies DMC to electonic gas. Ren et al. (2022) and Wilson et al. (2021) apply fixed-node DMC starting from the optimized FermiNet Ansatz. Score-based methods have shown great success in generative modeling (Song & Ermon, 2019; Song et al., 2020) . The effectiveness of implicit density modeling has also been demonstrated by other successful generative models, such as VAEs (Kingma & Welling, 2014), GANs (Goodfellow et al., 2014; Brock et al., 2018) , and diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) .

3. METHOD

For the Hamiltonian Ĥ defined in Equation 1 the local energy can be expressed in terms of the logarithmic wavefunction as: E L (x; θ) = - 1 2 i ∂ 2 log |ψ θ (x)| ∂x 2 i + ∂ log |ψ θ (x)| ∂x i 2 + V (x). One can easily prove this expression by using the fact that ∂ log |ψ θ (x)| ∂xi = 1 xi ∂ψ θ (x) ∂xi . If we take a closer look at this expression, we can notice that the local energy depends only on the gradient of the logarithmic wavefunction. In fact, if we define the function 𝒔 𝜽 𝒙 = ∈ ℝ "×$ 𝜓 𝜽 𝒙 ∈ ℝ Langevin Dynamics VMC Diffusion VMC (Proposed) Energy Loss Weighted Score Matching Loss 𝜓 𝜽 MCMC 𝒙 ∈ ℝ "×$ 𝒔 𝜽 𝒙 ∈ ℝ "×$ s θ : R N ×d → R N ×d such that s θ (x) i = ∂ log |ψ θ (x)| ∂xi , we can rewrite the local energy as: E L (x; θ) = - 1 2 tr(∇ x s θ (x)) + ∥s θ (x)∥ 2 + V (x). The function s θ is called the quantum force or the drift in quantum Monte Carlo. Here we follow the convention from the machine learning community and call it the score. This direct connection between local energies and scores motivates us to ask the question: can we represent quantum states using only s θ ? As a direct benefit, this can avoid recomputing the first-order derivatives in evaluating the local energy and consequently provide a more succinct representation. It is true that we will lose access to the unnormalized wavefunction. However, in practice, our primary goal is to estimate observables, such as kinetic energy or density, which can be estimated from sample distributions. To this end, our score-based optimization framework follows the gradient descent formulation in VMC, and we design a new loss function inspired by DMC. As we mix the flavors of these two methods, we call our new optimization framework the Diffusion Variational Monte Carlo (Dif-fVMC). The pipelines of VMC and DiffVMC are shown in Figure 1 . Please note that the original score (Hyvärinen & Dayan, 2005) is defined in terms of probability densities, so we have s original (x) = ∇ x log p(x) = ∇ x log ψ(x) 2 = 2∇ x log |ψ(x)| , which is twice of the score in our definition. By abuse of notation we define our score in terms of wavefunctions. We do this for two reasons. First, wavefunctions are more natural to deal with in quantum mechanics. Second, as we will show later, in addition to moving samples locally, we also use s θ (x) to describe distributions, which shares the same physical meaning as the original score definition.

3.1. SCORE-BASED NEURAL WAVEFUNCTION ANSATZ

We use a parameterized score function to implicitly model the wavefunction. Formally, for systems with N particles in d dimension, its score function s θ : R N ×d → R N ×d maps a set of input coordinates to a set of output scores, which are vectors having the same dimensions as inputs. Intuitively, after training, the output score should tell the particles the direction toward regions with higher density. In our case, we are dealing with indistinguishable particles in quantum mechanics. So exchanging two particles will not change the probability density: ψ(. . . , x i , . . . , x j , . . . ) 2 = ψ(. . . , x j , . . . , x i , . . . ) 2 . As a result, the score, which is the gradient of a logarithmic wavefunction, is permutation equivariant. Formally, ∇ xi log |ψ(. . . , x i , . . . , x j , . . . )| = ∇ xj log |ψ(. . . , x j , . . . , x i , . . . )|. Essentially the equivariance means that if two input particles exchange their positions, their corresponding output score vectors will also exchange positions. The proof is straightforward (Appendix B). We parameterize the score function with a neural network. The equivariance can be easily achieved by considering the input coordinates as a set (Zaheer et al., 2017; Qi et al., 2017) .

3.2. SAMPLING VIA LANGEVIN DYNAMICS

To generate samples that follow the distribution implicitly defined by the score, we use Langevin dynamics, which is similar in the score-based generative models (Song & Ermon, 2019) . Given the samples at the current time step x t , the coordinates for the new samples are computed as: x t+1 = x t + √ αϵ + αs θ (x t ), where α is an hyperparameter defining the step size, and ϵ ∼ N (0, I N d ) is a random vector sampled from the standard multivariate Gaussian distribution. The process is the same as in DMC and can be understood as first doing a random diffusion, then drifting by following scores. We can prove that when α is small, the distribution converges to the distribution defined by the score function. In Langevin dynamics, similar to the Metropolis-Hasting rejection step in MCMC, an accept/reject procedure can be employed to alleviate the finite time error. Although omitted in the context of generative modeling Song & Ermon (2019) , this is a standard step in DMC. The original rejection step computes the ratio P acc = exp -1 2α ∥x -x ′ -s θ (x ′ )α)∥ 2 ψ θ (x ′ ) 2 exp -1 2α ∥x ′ -x -s θ (x)α∥ 2 ψ θ (x) 2 . After each Langevin dynamics move, we decide whether to accept or reject the move based on the P acc . Concretly, we first sample a random number uniformly between 0 and 1. The move is accepted if P acc is larger than the random number and is rejected otherwise. However, the expression of P acc involves the wavefunction values ψ θ (x) and ψ θ (x ′ ), which are generally not available in our scorebased framework. To still be able to use the rejection step, we propose an approximated estimation of P acc which involves only the score function. This is achieved by approximating log |ψ θ (x ′ )| |ψ θ (x)| = log |ψ θ (x ′ )|-log |ψ θ (x)| using the average gradient 1 2 (∇ x log |ψ θ (x)|+∇ x log |ψ θ (x ′ )) |, then we can show that the ratio can be approximated only in terms of the score as (derivation in Appendix C): P acc ≈ exp α 2 (∥s θ (x)∥ 2 -∥s θ (x ′ )∥ 2 ) .

3.3. NEURAL WAVEFUNCTION OPTIMIZATION

The energy loss in VMC (Equation 3) depends explicitly on wavefunctions. The expression of the energy loss only in terms of scores is unknown. So we need to find a new loss to optimize scores towards ground states. We motivate our new loss from the imaginary time evolution in DMC. The imaginary time evolution operator e -τ Ĥ projects out ground states when τ → ∞. At short τ , by Taylor expansion, e -τ Ĥ ψ(x) ≈ ψ(x) -τ Ĥψ(x) = ψ(x)(1 -τ Ĥψ(x) ψ(x) ) ≈ e -τ E L (x) ψ(x). Thus, evolving ψ(x) in imaginary time for a short time τ can be approximated as: ψ(x) → ψ ′ (x) = e -τ E L (x) ψ(x). ( ) ψ ′ is closer to the ground state than ψ because higher energy eigenstates decays exponentially faster than the ground state. Therefore, for our score-based model, we can minimize energies by letting s θ (x) approach the evolved score ∇ x log |ψ ′ (x)|. We achieve this via score matching. Score matching (Hyvärinen & Dayan, 2005) provides a way to make s θ converge to the true score of the sample distribution. Assume at current step the samples follow ψ 2 / ψ 2 and s θ (x) = ∇ x log |ψ(x)|, we can make s θ (x) converge to ∇ x log |ψ ′ (x)| by minimizing the implicit score matching (ISM) objective: ISM(θ) = 2E x∼ψ ′2 / ψ ′2 tr(∇ x s θ (x)) + ∥s θ (x)∥ 2 . ( ) The problem is that we do not have samples following ψ ′2 / ψ ′2 . We can solve this by transforming the ISM into a weighted version based on current samples: ISM(θ) = 2 ψ ′2 (x) tr(∇ x s θ (x)) + ∥s θ (x)∥ 2 dx ψ ′2 (x)dx (12) = 2 ψ 2 (x)e -2τ E L (x) tr(∇ x s θ (x)) + ∥s θ (x)∥ 2 dx ψ 2 (x)dx ψ 2 (x)dx ψ ′2 (x)dx (13) = 2E x∼ψ 2 / ψ 2 exp(-2τ E L (x)) tr(∇ x s θ (x)) + ∥s θ (x)∥ 2 dx • C, where C = ψ 2 (x)dx ψ ′2 (x)dx = ψ(x) 2 dx exp(-2τ E L (x))ψ(x) 2 = E x∼ψ 2 / ψ 2 exp(-2τ E L (x)) -1 is a constant independent of θ. In practice, using small τ makes the loss too small, we thus replace 2τ with a hyperparameter β. We also subtract the sample mean of local energies to improve numerical stability. Put everything together, given a batch of M samples, we define the weighted score matching (WSM) objective as: E L (x) = - 1 2 tr(∇ x s θ (x)) + ∥s θ (x)∥ 2 + V (x) E diff (x i ) = E L (x i ) -⟨E L ⟩ = E L (x i ) - 1 M M i=1 E L (x i ) (16) WSM(θ) = 2 M i=1 exp -βE diff (x i ) M i=1 exp -βE diff (x i ) tr(∇ x s θ (x i )) + ∥s θ (x i )∥ 2 (17) = 2 M i=1 softmax -βE diff (x i ) i Does not differentiate w.r.t. θ tr(∇ x s θ (x i )) + ∥s θ (x i )∥ 2 Differentiate w.r.t. θ , where softmax(a) i = exp(ai) i exp(ai) . Note that the weighting terms are treated as constant when differentiating the loss w.r.t. the parameters θ. Only the ISM terms require computing gradient during back-propagation. This weighting scheme is similar to the attention mechanism in machine learning where the attention scores are based on the local energies and β plays a similar role to the temperature variable in the Gumbel softmax (Jang et al., 2016) . As a interesting observation, the ISM (Equation 11) is -4 times the kinetic part in local energy. So we only need to compute one of them and reuse the computation. We can prove that WSM is unbiased by using the zero-variance property of ground states. At ground states we have Ĥψ 0 (x) = E 0 ψ 0 (x). Hence E L (x) = Ĥψ0(x) ψ0(x) = E 0 . Consequently, the WSM weights will be identical at ground states and equal to 1 due to softmax. As a result, the WSM loss reduces to the regular ISM loss at the ground state. Moreover, thanks to Langevin dynamics, every state is a local minimum of the ISM loss because the estimated score is the same with the true score of the sample distribution. Hence the ground state is a local minimum of the WSM loss. Also due to the zero-variance property, when close to ground states, the weights should be close to 1. We can create more imbalanced weights by further normalizing E diff by dividing its standard deviation. We thus have the Scaled-WSM defined as: Scaled-WSM(θ) = 2 M i=1 softmax -β E diff (x i ) std E diff i tr(∇ x s θ (x i )) + ∥s θ (x i )∥ 2 . ( ) As in VMC, we update θ according to the loss for one step, then we run Langevin dynamics for several steps to make samples follow the updated score. In our experiments, we use β ∼ 1. So the short time approximation from imaginary time evolution does not hold strictly. However, since each time we only update parameters for a small step size and that the target wavefunction ψ ′ is constantly updated, the approximation may still be valid to some extend. But this should be more as an intuition than as an proof. To have further intuition for the convergence behavior, we can compare the WSM loss to the energy gradient 3. In fact, in both cases, we try to increase the probability density for regions with smaller local energies and decrease the probability density for regions with larger local energies. However, by far we do not have a rigorous proof for the convergence property. Nevertheless, in our experiments, systems converge to ground states consistently with the proposed WSM loss.

3.4. THE DIFFUSION VARIATIONAL MONTE CARLO ALGORITHM

The overall procedure is similar to VMC. We first perform several steps of Langevin dynamics to equilibrate the sampling. Then we compute the weighted score matching loss and update the network parameters via gradient descent. The algorithm is summarized in Algorithm 1. Algorithm 1 Diffusion Variational Monte Carlo (DiffVMC) Input: Randomly initialized sample x, score network s θ , step size α, number of iterations T , number of Langevin dynamics step N ld Output: New sample x, optimized score network s θ for t = 1 to T do for i = 1 to N ld do x ′ = x + √ αϵ + αs θ (x), ϵ ∼ N (0, I) ▷ Langevin dynamics (Section 3.2) P acc = exp α 2 (∥s θ (x)∥ 2 -∥s θ (x ′ )∥ 2 ) Sample z ∼ U (0, 1) if P acc > z then x = x ′ ▷ Accept the move end if end for loss = Scaled-WSM(θ) ▷ Weighted score matching objective (Section 3.3) Update θ via gradient descent to minimize the loss. end for

4. EXPERIMENTS

We first show that our score-based method correctly finds the ground state by solving the harmonic trap. Then we showcase the applicability of our method on simple atomic systems, i.e., interacting fermions with coulomb potential. We will release our code after the review process.

4.1. WAVEFUNCTION ANSATZ FOR BOSONS AND FERMIONS

Since the unnormalized density ψ 2 (•) is invariant under exchange of particle positions (Section 3.1), there are two cases for wavefunction values. The first case is that the wavefunction does not change its sign, i.e., ψ(. . . , x i , . . . , x j , . . . ) = ψ(. . . , x j , . . . , x i , . . . ), such particles are called bosons. The second case is that the wavefunction change its sign, i.e., ψ(. . . , x i , . . . , x j , . . . ) = -ψ(. . . , x j , . . . , x i , . . . ), such particles are called fermions. These different symmetries will fundamentally change ground states. The fermion ground state must consider the antisymmetric constraint and has higher ground state energy than the boson ground state. Bosons. We use a feed-forward neural network (Figure 3a ) composed of multi-layer perceptrons (MLPs) which satisfies the permutation equivariance (Zaheer et al., 2017; Qi et al., 2017) . Input features are coordinates and distances to the center. Each particle is simultaneously transformed into two feature vectors with two different MLPs. We perform average pooling for the first feature vector over all particles and obtain a global feature. The global feature is then concatenated with the second feature vector. A final MLP is used to predict the score of each particle. All MLPs are shared among different particles. Fermions. The antisymmetric property of fermions is very challenging to model (Ceperley, 1991) . Different from bosons, due to the change of sign, the fermion wavefunction has both positive regions and negative regions. The region where the wavefunction equals zero is called the nodal surface (or the node). In 1-d, the nodal surface is exactly the set where two particles coincide. However, for higher dimension the nodal surface can be arbitrary. The fermion sign structure gives rise some intrinsic difficulties to parametrize the score function.The score function diverges inversely proportionally to the distance away from the nodal surface (Umrigar et al., 1993) . Since neural networks essentially models continuous functions, using feed-forward neural networks to model the score function is prohibitive. In fact, even the networks manage to approximately model the discontinuity, in order to compute the local energy we also requires the derivatives of the score to be accurate, which is more difficult. Nevertheless, we are still able to illustrate how our score-based framework works with fermions. We can do so by modeling the score as the gradient of an antisymmetric wavefunction Ansatz. We use the score computed from FermiNet (Pfau et al., 2020) , where Slater determinants are employed to ensure the antisymmetry. We call it ∇ x FermiNet (Figure 3b ). Although by doing so the wavefunction value is practically computed, our goal is to evaluate the our method in this more challenging fermion setting. Currently this stands as the only feasible option to overcome fermion sign problem. More generally we can model scores for fermions as the sum of the gradient of an antisymmetric wavefunction and a symmetric wavefunction: s Fermion (x) = ∇ x f θ1 (x) + g θ2 (x) -x. In our setting, f θ1 is FermiNet and we can model g θ2 using a feed-forward network (FFN). In our experiment, FFN uses FermiNet encoder to get a feature vector for each particle, which is then mapped to score via a linear layer. The output of FermiNet score and FFN score are summed together to get the final score. We call it ∇ x FermiNet+FFN (Figure 3c ). We do -x to ensure density vanishes at boundary.

4.2. 2D QUANTUM HARMONIC TRAP

The Schrödinger equation with harmonic potential V QHO (x) = 1 2 ∥x∥ 2 . ( ) describes a quantum harmonic oscillator, one of the most famous quantum systems that is solvable. The energy levels and eigenstates are known analytically. In particular, for n bosons in the d dimensional harmonic potential, the ground state energy and wavefunction are E = nd 2 , |ψ⟩ = 1 2 nd/2 exp(-1 2 ∥x∥ 2 ). Therefore, the quantum harmonic oscillator provides an analytical benchmark for our numerical method. Furthermore, despite its simplicity, the quantum harmonic oscillator is of important experimental relevance and can be realized in cold atom systems by trapping atoms using lasers (Dalfovo et al., 1999) . Ground states for fermions are more complex due to the antisymmetry constraint. The ground state energies are E n = n i=1 e i where e = 1, 2, 2, 3, 3, 3, 4 . . . and the ground state wavefunctions are Slater determinants of Hermite polynomials. ∇ x FermiNet+FFN is used here. We conduct simulation with various number of particles. The results are shown in Figure 2 . All runs converge correctly to ground states.

4.3. ATOMIC SYSTEMS

For atoms, we work in the Born-Oppenheimer where we assume the atoms are fixed in space and only the electrons are allowed to move. Hence, our inputs are the coordinates of the electrons. The potential due to the Coulomb interactions is: V Atom (x) = i>j 1 ∥x i -x j ∥ - i Z ∥x i ∥ , ( ) where Z is the atom charge. As mentioned above, we compute the score by taking the gradient of FermiNet (Pfau et al., 2020) . Following Lin et al. (2021) , four atoms are tested, including Boron, Carbon, Nitrogen and Oxygen. To show the effect of step size, We trained and tested the score network with α=1e-3 and α=1e-2. We and optimize for 200k iterations for all systems. We use β=2 for Oxygen and β=1 for all other atoms. We do Langevin dynamics for 20 steps between two parameter update. The results are summarized on Table 1 . The chemical accuracy is defined as 1.594 mE h (Pfau et al., 2020) . Except for the Nitrogen atom, all atoms can enter the chemical accuracy under one setting, although the effect of step size is mixed. In many cases a higher level of variance is observed. The reason may be due to the difficulty in optimizing the gradient network. Moreover, FermiNet works with the second order optimizer KFAC (Martens & Grosse, 2015) , which is adapted for wavefunction outputs and may not be suitable for our socre-based optimization. Nevertheless, our framework has demonstrated the correct convergence behaviour for the challenging electronic potential. The results for N are improved with FFN. For other atoms the results are not improved but are comparable. With this setting we show that in terms of network architectures we can go beyond the FermiNet score.

5. COMPARISON TO VMC AND DMC

Compared to DMC, DiffVMC is closer to VMC's paradigm where Ansatz is optimized by estimating loss based on samples following Ansatz distribution. With DiffVMC we are able to directly model scores, which are more fundamental to the optimization problem and could be more expressive than modeling wavefunctions. On the other hand, DiffVMC and DMC both use scores to update samples. DMC starts with an optimized trial wavefunction Ansatz. Energy minimization is achieved by constantly adjusting weights of walkers. However, the walkers are always guided by the trial function's score and the parameters of the trial Ansatz cannot be updated in DMC. In contrast, DiffVMC is able to update the guiding score function directly. Nevertheless, there is no conflict between DiffVMC and DMC. DMC walkers can be guided with a score optimized with DiffVMC and further project toward ground state through weighting.

6. CONCLUSION

Inspired by the connection between the score-based formulation of local energy, we explore the possibility to implicitly model the quantum wavefunction. With the weighted score matching objective, the proposed DiffVMC enables the possibility to optimize the score network toward the ground state. Experiments show that our proposed method can accurately find the ground state for both bosons and fermions. 

A COMPUTATIONAL COMPLEXITY

Compared to VMC, DiffVMC avoids recomputing gradient of wavefunctions during energy evaluation, but the loss is more complicated. So the overall computational cost should be similar with VMC. In our experiments with FermiNet, DiffVMC is slower than VMC because we need to backpropagate through the gradient network. However, this should be mitigated with more adapted network architectures. In general, the major computation bottleneck of QMC is evaluating tr(∇ x s(x)) which cannot be paralleled efficiently. This is a common problem for all QMC methods. However, by noticing the connection between kinetic energy and score matching, we might be able to accelerate this computation via more efficient score matching techniques (Song & Ermon, 2019; Song et al., 2020; Pang et al., 2020) .

B PROOF FOR EQUIVARIANCE

We show it for two 1-d particles:  C DERIVATION OF THE APPROXIMATED DETAILED BALANCING The original detailed balancing in Langevin Monte Carlo is P acc (x ′ |x) = exp -1 2α ∥x -x ′ -s θ (x ′ )α)∥ 2 ψ θ (x ′ ) 2 exp -1 2α ∥x ′ -x -s θ (x)α∥ 2 ψ θ (x) 2 In our score-based framework, we generally do not have access to wavefunction value. So our goal here is to get rid of the explicit dependencies on ψ θ (x ′ ) and ψ θ (x). We can do so by assuming the score is constant between x and x ′ . We first transform into the log domain: log P acc = - 1 2α (∥x -x ′ -s θ (x ′ )α∥ 2 -∥x ′ -x -s θ (x)α∥ 2 ) A + 2(log |ψ θ (x ′ )| -log |ψ θ (x)|) B ( ) where we use A and B to denote the two components and log P acc = A + B. We first simplify A. By expanding the norms as ∥x -x ′ -s θ (x ′ )α∥ 2 = ∥x -x ′ ∥ 2 -2⟨xx ′ , s θ (x ′ )⟩α + ∥s θ (x ′ )∥ 2 α 2 and ∥x ′ -xs θ (x)α∥ 2 = ∥x ′ -x∥ 2 -2⟨x ′ -x, s θ (x)⟩α + ∥s θ (x)∥ 2 α 2 , we obtain:



Figure 1: Comparison between the pipeline of VMC and our proposed DiffVMC. In VMC, the wavefunction is explicitly modeled with a neural network, who directly outputs the wavefunction value. The sampling is carried out with MCMC and the optimization is achieved by minimizing the energy loss. In the proposed DiffVMC we instead model the score of the wavefunction. The samples are generated via Langevin dynamics and the proposed weighted score matching objective is employed to optimize the score neural network.

Figure 2: Train energies for bosons and fermions in 2d quantum harmonic trap. N is the number of particles. All runs converge correctly to ground states, despite slight fluctuations for fermions.

log |ψ(x 1 , x 2 )| = lim ∆x→0 log |ψ(x 1 + ∆x, x 2 )| -log |ψ(x 1 , x 2 )| ∆x (23) = lim ∆x→0 log |ψ(x 2 , x 1 + ∆x)| -log |ψ(x 2 , x 1 )| ∆x = ∂ 2 log |ψ(x 2 , x 1 )|.

The norm of the score is clipped at 20 to increase numerical stability. The complete hyperparameters are in Appendix D. Experimental results on Atoms in Hartree (E h ). Underline denotes our results within chemical accuracy. References taken fromLin et al. (2021).

CJ Umrigar, MP Nightingale, and KJ Runge. A diffusion monte carlo algorithm with very small time-step errors. The Journal of chemical physics, 99(4):2865-2890, 1993. Limei Wang, Yi Liu, Yuchao Lin, Haoran Liu, and Shuiwang Ji. ComENet: Towards complete and efficient message passing for 3D molecular graphs. In The 36th Annual Conference on Neural Information Processing Systems, 2022. Max Wilson, Nicholas Gao, Filip Wudarski, Eleanor Rieffel, and Norm M Tubman. Simulations of state-of-the-art fermionic neural network wave functions with diffusion monte carlo. arXiv preprint arXiv:2103.12570, 2021. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017.

annex

To simplify B, we use the approximation:Such approximation is commonly employed in finite difference methods. The approximation should be valid since x and x ′ are close in space. With this approximation, we have:Finally,

D COMPUTATIONAL SETTINGS

We implement DiffVMC in Pytorch (Paszke et al., 2017) for experiments with Bosons and in Jax (Bradbury et al., 2018) 

