A SELF-ATTENTION ANSATZ FOR AB-INITIO QUANTUM CHEMISTRY

Abstract

We present a novel neural network architecture using self-attention, the Wavefunction Transformer (Psiformer), which can be used as an approximation (or Ansatz) for solving the many-electron Schrödinger equation, the fundamental equation for quantum chemistry and material science. This equation can be solved from first principles, requiring no external training data. In recent years, deep neural networks like the FermiNet and PauliNet have been used to significantly improve the accuracy of these first-principle calculations, but they lack an attention-like mechanism for gating interactions between electrons. Here we show that the Psiformer can be used as a drop-in replacement for these other neural networks, often dramatically improving the accuracy of the calculations. On larger molecules especially, the ground state energy can be improved by dozens of kcal/mol, a qualitative leap over previous methods. This demonstrates that self-attention networks can learn complex quantum mechanical correlations between electrons, and are a promising route to reaching unprecedented accuracy in chemical calculations on larger systems.

1. INTRODUCTION

The laws of quantum mechanics describe the nature of matter at the microscopic level, and underpin the study of chemistry, condensed matter physics and material science. Although these laws have been known for nearly a century (Schrödinger, 1926) , the fundamental equations are too difficult to solve analytically for all but the simplest systems. In recent years, tools from deep learning have been used to great effect to improve the quality of computational quantum physics (Carleo & Troyer, 2017) . For the study of chemistry in particular, it is the quantum behavior of electrons that matters, which imposes certain constraints on the possible solutions. The use of deep neural networks for successfully computing the quantum behavior of molecules was introduced almost simultaneously by several groups (Pfau et al., 2020; Hermann et al., 2020; Choo et al., 2020) , and has since led to a variety of extensions and improvements (Hermann et al., 2022) . However, follow-up work has mostly focused on applications and iterative improvements to the neural network architectures introduced in the first set of papers. At the same time, neural networks using self-attention layers, like the Transformer (Vaswani et al., 2017) , have had a profound impact on much of machine learning. They have led to breakthroughs in natural language processing (Devlin et al., 2018) , language modeling (Brown et al., 2020) , image recognition (Dosovitskiy et al., 2020) , and protein folding (Jumper et al., 2021) . The basic selfattention layer is also permutation equivariant, a useful property for applications to chemistry, where physical quantities should be invariant to the ordering of atoms and electrons (Fuchs et al., 2020) . Despite the manifest successes in other fields, no one has yet investigated whether self-attention neural networks are appropriate for approximating solutions in computational quantum mechanics. In this work, we introduce a new self-attention neural network, the Wavefunction Transformer (Psiformer), which can be used as an approximate numerical solution (or Ansatz) for the fundamental equations of the quantum mechanics of electrons. We test the Psiformer on a wide variety of benchmark systems for quantum chemistry and find that it is significantly more accurate than existing neural network Ansatzes of roughly the same size. The increase in accuracy is more pronounced the larger the system is -as much as 75 times the normal standard for "chemical accuracy" -suggesting that the Psiformer is a particularly attractive approach for scaling neural network Ansatzes to larger, more challenging systems. In what follows, we will provide an overview of the variational quantum Monte Carlo approach to computational quantum mechanics (Sec. 2), introduce the Psiformer architecture in detail (Sec. 3), present results on a wide variety of atomic and molecular benchmarks (Sec. 4) and wrap up with a discussion of future directions (Sec. 5).

2. BACKGROUND 2.1 QUANTUM MECHANICS AND CHEMISTRY

The fundamental object of study in quantum mechanics is the wavefunction, which represents the state of all possible classical configurations of a system. If the wavefunction is known, then all other properties of a system can be calculated from it. While there are multiple ways of representing a wavefunction, we focus on the first quantization approach, where the wavefunction is a map from possible particle states to a complex amplitude. The state of a single electron x ∈ R 3 × {↑, ↓} can be represented by its position r ∈ R 3 and spin σ ∈ {↑, ↓}. Then the wavefunction for an N-electron system is a function Ψ : R 3 × {↑, ↓} N → C. Let x ≜ x 1 , . . . , x N denote the set of all electron states. The wavefunction is constrained to have unit ℓ 2 norm dx|Ψ| 2 (x) = 1, and |Ψ| 2 can be interpreted as the probability of observing a quantum system in a given state when measured. Not all functions are valid wavefunctions -particles must be indistinguishable, meaning |Ψ| 2 should be invariant to changes in ordering. Additionally, the Pauli exclusion principle states that the probability of observing any two electrons in the same state must be zero. This is enforced by requiring the wavefunction for electronic systems to be antisymmetric. In this paper, we will focus on how to learn an unnormalized approximation to Ψ by representing it with a neural network. The physical behavior of non-relativistic quantum systems is described by the Schrödinger equation. In its time-independent form, it is an eigenfunction equation ĤΨ(x) = EΨ(x) where Ĥ is a Hermitian linear operator called the Hamiltonian and the scalar eigenvalue E corresponds to the energy of that particular solution. In quantum chemistry, atomic units (a.u.) are typically used, in which the unit of distance is the Bohr radius (a 0 ), and the unit of energy is Hartree (Ha). The physical details of a system are defined through the choice of Hamiltonian. For chemical systems, the only details which need to be specified are the locations and charges of the atomic nuclei. In quantum chemistry it is standard to approximate the nuclei as classical particles with fixed positions, known as the Born-Oppenheimer approximation, in which case the Hamiltonian becomes: Ĥ = - 1 2 i ∇ 2 i + i>j 1 |r i -r j | - iI Z I |r i -R I | + I>J Z I Z J |R I -R J | (1) where ∇ 2 i = 3 j=1 ∂ 2 ∂r 2 ij is the Laplacian w.r.t. the ith particle and Z I and R I , I ∈ {1, ..., N nuc } are the charges and coordinates of the nuclei. Two simplifications follow from this. First, since Ĥ is a Hermitian operator, solutions Ψ must be real-valued. Thus we can restrict our attention to real-valued wavefunctions. Second, since the spins σ i do not appear anywhere in Eq. 1, we can fix a certain number of electrons to be spin up and the remainder to be spin down before beginning any calculation (Foulkes et al., 2001) . The appropriate number for the lowest energy state can usually be guessed by heuristics such as Hund's rules. While the time-independent Schrödinger equation defines the possible solutions of constant energy, at the energy scales relevant for most chemistry the electrons are almost always found near the lowest energy state, known as the ground state. Solutions with higher energy, known as excited states, are relevant to photochemistry, but in this paper we will restrict our attention to ground states. For a typical small molecule, the total energy of a system is on the order of hundreds to thousands of Hartrees. However the relevant energy scale for chemical bonds is typically much smaller -on the order of 1 kilocalorie per mole (kcal/mol), or ∼1.6 mHa -less than one part in one hundred thousand of the total energy. Calculations within 1 kcal/mol of the ground truth are generally considered "chemically accurate". Mean-field methods are typically within about 0.5% of the true total energy. The difference between the mean-field energy and true energy is known as the correlation energy, and chemical accuracy is usually less than 1% of this correlation energy. For example, the binding energy of the benzene dimer (investigated in Section 4.5) is only ∼4 mHa. In contrast, the Psiformer uses a single stream of self-attention layers, acting on electron-nuclear features only. Electronelectron features appear only via the Jastrow factor. The FermiNet+SchNet also includes a nuclear embedding stream and separate spin-dependant electron-electron streams, not pictured here.

2.2. VARIATIONAL QUANTUM MONTE CARLO

There are a wide variety of computational techniques to find the ground state solution of the Hamiltonian in Eq 1. We are particularly interested in solving these equations from first principles (abinitio), that is, without any data other than the atomic positions. The ab-initio method most compatible with the modern deep learning paradigm is variational quantum Monte Carlo (VMC, Foulkes et al. (2001) ). In VMC, a parametric wavefunction approximation (or Ansatz) is optimized using samples from the Ansatz itself, in much the same way that deep neural networks are optimized by gradient descent on stochastic minibatches. VMC is variational in the sense that it minimizes an upper bound on the energy of a system. Therefore if two VMC solutions give different energies, the one with the lower energy will be closer to the true energy, even if the true energy is not known. In VMC, we start with an unnormalized wavefunction Ansatz Ψ θ : R 3 × {↑, ↓} N → R with parameters θ. The expected energy of the system is given by the Rayleigh quotient: L θ = ⟨Ψ θ ĤΨ θ ⟩ ⟨Ψ 2 θ ⟩ = dxΨ θ (x) ĤΨ θ (x) dxΨ 2 θ (x) = E x∼Ψ 2 θ Ψ -1 θ (x) ĤΨ θ (x) = E x∼Ψ 2 θ [E L (x)] (2) where we have rewritten the Rayleigh quotient as an expectation over a random variable proportional to Ψ 2 θ on the right hand side. The term E L (x) = Ψ -1 (x) ĤΨ(x) is known as the local energy. Details on how to compute the local energy and unbiased estimates of the gradient of the average energy are given in Sec. A.1 in the appendix. In VMC, samples from the distribution proportional to Ψ 2 are generated by Monte Carlo methods, and unbiased estimates of the gradient are used to optimize the Ansatz, either by standard stochastic gradient methods, or more advanced methods (Umrigar et al., 2007; Sorella, 1998) . Notably, the samples x ∼ Ψ 2 θ can be generated from the Ansatz itself, rather than requiring external data. The form of Ψ must be restricted to antisymmetric functions to avoid collapsing onto non-physical solutions. This is most commonly done by taking the determinant of a matrix of single-electron functions Ψ(x) = det [Φ(x)], where Φ(x) denotes the matrix with elements ϕ i (x j ), since the determinant is antisymmetric under exchange of rows or columns. This is known as a Slater determinant, and the minimum-energy wavefunction of this form gives the mean-field solution to the Schrödinger equation. While ϕ i is a function of one electron in a Slater determinant, any permutation-equivariant function of all electrons can be used as input to Φ and Ψ will still be antisymmetric. The potential energy becomes infinite when particles overlap, which places strict constraints on the form of the wavefunction at these points, known as the Kato cusp conditions (Kato, 1957) . The cusp conditions state that the wavefunction must be non-differentiable at these points, and give exact values for the average derivatives at the cusps. This can be built into an Ansatz by multiplying by a Jastrow factor which satisfies these conditions analytically (Drummond et al., 2004) .

2.3. RELATED WORK

Machine learning has found numerous applications to computational chemistry in recent years, but has mostly focused on problems at the level of classical physics (Schütt et al., 2018; Fuchs et al., 2020; Batzner et al., 2022; Segler et al., 2018; Gómez-Bombarelli et al., 2018) , which all rely on learning from large datasets of experiments or ab-initio calculations. There is also work on machine learning for density functional theory (DFT) (Nagai et al., 2020; Kirkpatrick et al., 2021) , which is an intermediate between classical and all-electron quantum chemistry, but even this is still primarily a supervised learning problem and relies on ab-initio calculations for data. Here instead, we are focused on improving the ab-initio methods themselves. For a thorough introduction to ab-initio quantum chemistry, we recommend Helgaker et al. (2014) and Szabo & Ostlund (2012) . Within this field, VMC was considered a simple but low-accuracy method, failing to match the performance of sophisticated methods (Motta et al., 2020) , but often used as a starting point for diffusion Monte Carlo (DMC) calculations, which are more accurate, but do not produce an explicit functional form for the wavefunction (Foulkes et al., 2001) . These VMC calculations typically used a Slater-Jastrow Ansatz (Kwon et al., 1993) , which consists of a large linear combination of Slater determinants multiplied by a Jastrow factor. They also sometimes include backflow, a coordinate transformation that accounts for electron correlations with a fixed functional form (Feynman & Cohen, 1956) . Recently, the use of neural network Ansatzes in VMC was shown to greatly improve the accuracy of many-electron calculations, often making them competitive with, or in some circumstances superior to, methods like DMC or coupled cluster (Pfau et al., 2020; Hermann et al., 2020; Choo et al., 2020; Han et al., 2019; Luo & Clark, 2019; Taddei et al., 2015) . In first quantization, these Ansatzes used a sum of a small number of determinants to construct antisymmetric functions, but used very general permutation-equivariant functions of all electrons as inputs, rather than the single-electron functions used in Slater determinants. Some, like the PauliNet (Hermann et al., 2020) , used Jastrow factors, while the FermiNet (Pfau et al., 2020) used non-differentiable input features to learn the cusp conditions. Most follow-up work, as surveyed in Hermann et al. (2022) , integrated these Ansatzes with other methods and applications, but did not significantly alter the architecture of the neural networks. The most significant departure in terms of the neural network architecture was Gerard et al. (2022) , which extended the FermiNet -generally recognized to be the most accurate neural network Ansatz up to that point -and integrated it with several details of the PauliNet, especially the continuous-filter convolutions also used by the SchNet (Schütt et al., 2018) , claiming to reach even higher accuracy on several challenging systems. We refer to this architecure as the FermiNet+SchNet.

3. THE PSIFORMER

The Psiformer has the basic form: Ψ θ (x) = exp (J θ (x)) N det k=1 det[Φ k θ (x)], where J θ : (R 3 × {↑, ↓}) N → R and Φ k θ : (R 3 × {↑, ↓}) N → R N ×N are functions with learnable parameters θ. This is similar to the Slater-Jastrow Ansatz, FermiNet and PauliNet, with the key difference being that in the Psiformer, Φ k θ consists of a sequence of multiheaded self-attention layers (Vaswani et al., 2017) . The key motivation for this is that the electron-electron dependence in the Hamiltonian introduces subtle and complex dependence in the wavefunction. Self-attention is one way of introducing this without a fixed functional form. The high-level structure of the Psiformer is shown in Fig. 1(b) , where it is contrasted with the FermiNet (and SchNet extension) in Fig. 1(a) . Because a self-attention layer takes a sequence of vectors as input, only features of single electrons are used as input to Φ k θ . The input feature vector f 0 i for electron i is similar to the one-electron stream of the FermiNet, which uses a concatenation of electron-nuclear differences r i -R I and distances |r i -R I |, I = 1, . . . , N nuc , with two key differences. First, we found that for systems with widely separated atoms, using FermiNet one-electron features caused self-attention Ansatzes to become unstable, so we rescale the inputs in the Psiformer by a factor of log(1 + |r i -R I |)/|r i -R I |, so that the input vectors grow logarithmically with distance from the nucleus. Second, we concatenate the spin σ i into the input feature vector itself (mapping ↑ to 1 and ↓ to -1). While this spin term is kept fixed during training, it breaks the symmetry between spin up and spin down electrons, so that columns of Φ k θ are only equivariant under exchange of same-spin electrons. This is a notable departure from the FermiNet and PauliNet, where the difference between spin-up and spin-down electrons is instead built into the architecture. L i H L i 2 N H 3 C H 4 C O N 2 C 2 H 4 m e t h The input features f 0 i are next projected into the same dimension as the attention inputs by a linear mapping h 0 i = W 0 f 0 i , and then passed into a sequence of multiheaded self-attention layers followed by linear-nonlinear layers, both with residual connections: f ℓ+1 i = h ℓ i + W ℓ o concat h SELFATTN i (h ℓ 1 , . . . , h ℓ N ; W ℓh q , W ℓh k , W ℓh v ) h ℓ+1 i = f ℓ+1 i + tanh W ℓ+1 f ℓ+1 i + b ℓ+1 (5) where h indexes the different attention heads, concat h denotes concatenation of the output from different attention heads, and SELFATTN denotes standard self-attention: SELFATTN i (h 1 , . . . , h N ; W q , W k , W v ) = 1 √ d j σ j q T 1 k i , . . . , q T N k i v j ( ) k i = W k h i , q i = W q h i , v i = W v h i (7) σ i (x 1 , . . . , x N ) = exp(x i ) j exp(x j ) (8) where d is the output dimension of the key and query weights. In principle, multiple linear-nonlinear layers could be used in-between self-attention layers, but we found that adding a deeper MLP between self-attention layers was less effective than adding more self-attention layers. While a smooth nonlinearity must be used to guarantee that a wavefunction is smooth everywhere except the cusps, we found that using activation functions other than tanh had a marginal impact on performance. A final linear projection into a N N det dimensional space is applied to the activations, and the output is multiplied by a weighted sum of exponentially-decaying envelopes,  Φ k ij = Ω k ij w kT i h L j , Ω k ij = I π k iI exp -σ k iI |r j -R I | . This is so that the boundary condition lim |r|→∞ Ψ θ (r) = 0 is enforced, and is the same form as the envelope used by the FermiNet in Spencer et al. (2020a) . The matrix elements Φ k ij are then passed into k determinants and the output summed together. As the distances |r i -R I | are inputs to the Psiformer, it is capable of learning the electron-nuclear cusp conditions, much like the FermiNet. However, the self-attention part of the Psiformer does not take pairwise electron distances as inputs, so it cannot learn the electron-electron cusp conditions. Instead, the Psiformer uses a conventional Jastrow factor only for the electron-electron cusps. We use a particularly simple Jastrow factor: J θ (x) = i<j;σi=σj - 1 4 α 2 par α par + |r i -r j | + i,j;σi̸ =σj - 1 2 α 2 anti α anti + |r i -r j | which has only two free parameters, α par and α anti , and works well in practice.

4. EXPERIMENTS

Here we present an evaluation of the Psiformer on a wide variety of benchmark systems. Where it is not specified, all results with the FermiNet and FermiNet+SchNet are with our own implementation, forked from Spencer et al. (2020b) . We use Kronecker-factored Approximate Curvature (KFAC) (Martens & Grosse, 2015) to optimize the Psiformer. The use of KFAC for self-attention has been investigated in Zhang et al. (2019) . We show the advantage of KFAC, and how it interacts with Lay-erNorm, in Sec. A.2.3 of the appendix. We use hyperparameters and a Metropolis-Hastings MCMC algorithm similar to the original FermiNet paper, though we have made several modifications which help for larger systems: we pretrain for longer, we generate samples for pretraining from the target wavefunction, we take more MCMC steps between parameter updates, we propose updates for subsets of electrons rather than all electrons simultaneously, and we slightly change the gradient computation to be more robust to outliers. Details are given in Sec. A.2.4 of the appendix.

4.1. REVISITING SMALL MOLECULES

In Pfau et al. (2020) , they compared the FermiNet against CCSD(T) extrapolated to the complete basis set (CBS) limit on a number of small molecules (4-30 electrons) from the G3 database (Curtiss et al., 2000) . While the FermiNet captured more than 99% of the correlation energy relative to CCSD(T)/CBS for systems as large as ethene (16 electrons), the quality of the FermiNet calculation began to decline as the system size grew. While some of this discrepancy was reduced simply by changing to a framework with better numerics (Spencer et al., 2020a) , the reported difference in energy for bicyclobutane was still greater than 20 mHa. Here we revisit these systems, comparing the FermiNet with training improvements against the Psiformer. To investigate whether the Psiformer performance could be reproduced by simply making the Fer-miNet larger, we investigated both a "small" and "large" configuration for the Psiformer and Fer-miNet. The small FermiNet has the same layer dimensions and determinants as the one used in Pfau et al. (2020) , while the large configuration has twice as many determinants and a one-electron stream twice as wide, similar to the largest networks in Spencer et al. (2020a) . The Psiformer configurations have the same number of determinants and the MLP layers are the same width as the FermiNet one-electron stream, though due to the self-attention weights the small Psiformer has a number of parameters between that of the large and small FermiNet. Exact details are given in Table 7 in the appendix. On these systems, the small Psiformer ran in a similar amount of time to the large Fer-miNet, though for even larger systems, the difference in wall time between the small FermiNet and small Psiformer became much smaller (see Table 9 in the appendix). The results on small molecules can be seen in Fig. 2 . While increasing the size of the FermiNet increases the accuracy somewhat, the small Psiformer is more accurate than the large FermiNet, despite having fewer parameters, while the large Psiformer is the most accurate of all. This is true for all systems investigated. The improvement from the Psiformer is particularly dramatic on ozone and bicyclobutane -on ozone, the large Psiformer is within 1 kcal/mol of CCSD(T)/CBS, while even the largest FermiNet has an error more than 4 times larger than this. On all molecules, the large Psiformer captures more than 99% of the correlation energy relative to the reference energy.

4.2. COMPARISON AGAINST FERMINET WITH SCHNET-LIKE CONVOLUTIONS

While the performance of the Psiformer relative to the FermiNet is impressive, recent work has proposed several innovations to reach even lower energies (Gerard et al., 2022) . The primary innovations of this work were the FermiNet+SchNet architecture and new hyperparameters they claim led to much faster optimization. They showed an especially large improvement on heavy atoms like potassium (K) and iron (Fe), and some improvement on larger molecules like benzene. We attempted to reproduce the results of Gerard et al. (2022) with our own FermiNet+SchNet implementation, with somewhat surprising results in Fig. 3 . First, the changes in training used for the FermiNet seem to be enough to close the gap with the published FermiNet+SchNet results on heavy atoms. For instance, ablation studies suggested that modifying the hyperparameters plus adding SchNet-like convolutions accounts for a 38.9 mHa improvement in accuracy on the potassium atom at 10 5 iterations, but in our experiments the FermiNet with default hyperparameters is within a few mHa of the published result. On benzene, both the SchNet-like convolutions and modified hyperparameters improved the final energy by a few mHa, though still fell slightly short of the published results. Most importantly, the Psiformer is clearly either comparable to the best published Fer-miNet+SchNet results, as on K, or better by a wide margin, as on benzene. These results also give us confidence that the FermiNet is as strong a baseline as any other published method for further comparison.

4.3. THIRD-ROW ATOMS

The FermiNet, FermiNet+SchNet and Psiformer all seem to perform well on the potassium and iron atoms, but it is difficult to judge how close to the ground truth these results are. had shown that the FermiNet can achieve energies within chemical accuracy of exact results for atoms up to argon (Spencer et al., 2020a) , but exact calculations are impractical for third-row atoms. Instead, comparison to experimental results is a more practical evaluation metric. The ionization potential -the amount of energy it takes to remove one electron -is a particularly simple comparison for which good experimental data exists (Koga et al., 1997) . Here we compare the FermiNet, FermiNet+SchNet and Psiformer for estimating the ionization potential of potassium, iron and zinc. To compare the relative importance of size and architecture, we also looked at both a small and large configuration of all Ansatzes, with parameters described in Table 6 the appendix. Results are shown in Table 1 . For atoms this heavy, all methods showed some amount of run-to-run variability, usually small, but on occasion as large as 10 mHa, which may explain some outlier results. On potassium, the difference between Ansatzes was within the range of run-to-run variability, but the Psiformer still did quite well, and reached the most accurate ionization potential. On heavier atoms, the difference between architectures became more pronounced, and the improvement of the Psiformer relative to other models on absolute energy was more robust. The results on ionization potentials were more mixed, and no Ansatz came within chemical accuracy of the ground truth. This shows that even the Psiformer is not yet converged to the ground truth, though it is the Ansatz closest to reaching it so far.

4.4. LARGER MOLECULES

Much of the promise for deep neural networks for QMC comes from the fact that they can, in theory, scale much better than other all-electron methods, though this promise has yet to be realized. While CCSD(T) scales with number of electrons as O(N 7 ), a single iteration of wavefunction optimization for a fixed neural network size scales as O(N 4 ) in theory, and in practice is closer to cubic for system sizes of several dozen electrons. Self-attention has been especially powerful when scaling to extremely large problems in machine learning -here we investigate whether the same holds true for QMC, by applying both the FermiNet and Psiformer to single molecules much larger than those which have been investigated in most prior work. Results on systems from benzene (42 electrons) to carbon tetrachloride (CCl 4 , 74 electrons) are given in Table 2 . On these systems, CCSD(T) becomes impractical for us to run without approximations, so we only compare against other QMC results. Due to computational constraints, we were only able to run the small network configurations on these systems. Additionally, to help MCMC convergence, we updated half the electron positions at a time in each MCMC move, rather than moving all electrons simultaneously, as all-electron moves become less efficient for larger systems. In Table 2 , it can be seen that the Psiformer is not only significantly better on benzene than the Fer-miNet, as in Fig. 3 , but it is better than the best previously published DMC energy, a remarkable feat. For even larger molecules, there are no results in the literature with neural network wavefunctions to compare against, so we only compare the Psiformer and FermiNet directly. The Psiformer outperforms the FermiNet by an ever larger margin on larger systems, reaching 120 mHa (75 kcal/mol) on CCl 4 . On the three hydrocarbon systems investigated, LayerNorm has only a small impact. However for CCl 4 LayerNorm has a signficant impact, accounting for 70 mHa of the total 120 mHa improvement over the FermiNet. While we do not claim these are the best variational results in the literature, it is clear that on larger molecules the Psiformer is a significant improvement over the FermiNet, which is itself the most accurate Ansatz for many smaller systems. Grover et al. (1987) .

4.5. THE BENZENE DIMER

Finally, we look at the benzene dimer, a challenging benchmark system for computational chemistry due to the weak van der Waals force between the two molecules, and the largest molecular system ever investigated using neural network Ansatzes (Ren et al., 2022) . The dimer has several possible equilibrium configurations, many of which have nearly equal energy (Sorella et al., 2007; Azadi & Cohen, 2015) , making it additionally challenging to study computationally, but here we restrict our attention to the T-shaped structure (Fig. 4 ) so that we can directly compare against Ren et al. ( 2022). Results are given in Table 3 . The results from Ren et al. ( 2022) are from a small FermiNet (3 layers) trained for 2 million iterations. Even our "small" 4-layer FermiNet baseline (which is larger than their FermiNet) trained for 200,000 iterations is able to reach the same accuracy as their small FermiNet trained for 800,000 iterations (Ren et al. ( 2022) Supplementary Figure 7a ). The Psiformer reaches a significantly lower energy than the FermiNet with the same number of training iterations, again surpassing the DMC result by a few mHa. We also tried to estimate the dissociation energy by similar means as Ren et al. ( 2022) -comparing against the energy of the same model trained with a bond length of 10 Å, and twice the energy of the same model trained on the monomer. Ironically, our FermiNet baseline, which had the worst absolute energy, had the best relative energy between configurations. While every model underestimated the energy relative to twice the monomer, the discrepancy was lowest with DMC. It should be noted that the zero-point vibrational energy (ZPE) is not included, the dissociation energies from Ren et al. (2022) in Table 3 are estimates based on their figures, and there is disagreement over the exact experimental energy (Grover et al., 1987; Krause et al., 1991) , so this comparison should be considered a rough estimate only. Our main result is that the absolute energy of the Psiformer is a vast improvement over the FermiNet, and we leave it to future work to properly apply the Psiformer to predicting binding energies.

5. DISCUSSION

We have shown that self-attention networks are capable of learning quantum mechanical properties of electrons far more effectively than comparable methods. The advantage of self-attention networks seems to become most pronounced on large systems, suggesting that these models should be the focus of future efforts to scale to even larger systems. In addition to the strong empirical results, using standard attention layers means that we can leverage existing work on improving scalability, either by architectural advances (Child et al., 2019; Wang et al., 2020; Xiong et al., 2021; Jaegle et al., 2021a; b) or software implementations optimized for specialized hardware (Dao et al., 2022) , which could make it possible to scale these models even further. This presents a promising path towards studying the most challenging molecules and materials in silico with unprecedented accuracy.

ETHICS STATEMENT

The work presented here focuses on fundamental questions in computational chemistry, and is not yet at the stage where it is likely to be widely adopted by experimental chemists. However, in the future, this line of work could lead to computational chemistry becoming much more accurate, making it easier for chemists and material scientists to make new discoveries without requiring cumbersome trial-and-error physical experiments. This could lead to the discovery of new beneficial drugs or industrial chemical processes which are more environmentally friendly. The field of experimental chemistry already has robust ethical standards and processes for preventing harmful applications, and we are confident that more accurate computational methods will not in any way make it easier to circumvent these safeguards. While there is a clear advantage to using KFAC on both systems, LayerNorm has a marginal impact on the sulphur dimer. On the iron atom, however, LayerNorm improves the accuracy of ADAM and the stability of KFAC. A learning rate of 3e-4 was used for ADAM. A rolling mean of the last 1000 iterations is used to smooth the energy. factorized into the product of a determinant of spin-up electrons of size N ↑ × N ↑ and a determinant of spin-down electrons N ↓ × N ↓ , as is the case with conventional VMC ansatzes (Foulkes et al., 2001) : det [Φ(x)] = Φ ↑ (x ↑ ) 0 0 Φ ↓ (x ↓ ) = det [Φ ↑ (x ↑ )] det [Φ ↓ (x ↓ )] . where x σ denotes the set of electron states with the specified spin and different functions are used for electrons of different spins. The FermiNet authors subsequently proposed simply using dense determinants of size N × N , N = N ↑ + N ↓ (Spencer et al., 2020b) : det [Φ(x)] = |Φ ↑ (x ↑ ) Φ ↓ (x ↓ )| where now each block is of dimension N × N σ . This has largely become the default choice for FermiNet architectures and has shown to provide improved accuracy at small additional cost for a myriad of systems (Lin et al., 2021; Cassella et al., 2022; Ren et al., 2022; Gerard et al., 2022; Gao & Günnemann, 2022) .

A.2.2 IMPLEMENTATION OF FERMINET+SCHNET

To best reproduce the results from Gerard et al. (2022) , we implemented the changes to the FermiNet which they claimed made the largest difference. All experiments in this paper, including the Fer-miNet, used dense determinants. We added the SchNet-like continuous filter convolutions, as well as the nuclear embedding stream and separate streams for same-spin and different-spin electrons in the two-electron stream. We also compared their training hyperparameters against ours, except for batch size, which was kept at 4096 for all experiments. We did not implement their local input features or envelope initialization, as their results suggested they did not make as significant a difference, and our longer pretraining likely had the same effect as changing the envelope initialization.

A.2.3 CHOICE OF OPTIMIZER AND LAYERNORM

The original FermiNet was only able to reach high accuracy when trained with Kronecker-factored approximate curvature (KFAC) (Martens & Grosse, 2015) . To see if the same holds true for the Psiformer, here we compare training with KFAC against ADAM on several systems. Additionally, when trained with ADAM, self-attention layers usually require LayerNorm (Ba et al., 2016) . However, other work has suggested that in some contexts normalization may not be necessary when using KFAC (Martens et al., 2021) , so we also compare the Psiformer with and without LayerNorm. Figure 5 shows a comparison for the Psiformer on the sulphur dimer and the iron atom. Consistent with the FermiNet, the Psiformer consistently converges faster and to lower energies with KFAC. With LayerNorm, the situation is more ambiguous. On the sulphur dimer, it seems to have marginal impact, while on the iron atom, it definitely improves training with ADAM, and seems to improve stability when using KFAC.

A.2.4 HYPERPARAMETERS

Table 4 shows the default hyperparameters used for training all models implemented in this work. Note that Pfau et al. (2020) took the sum over gradients across the batch on each device and averaged over devices, whereas here the gradients are averaged over the entire batch. This amounts to a scaling of the learning rate, meaning that the same learning rate as in Pfau et al. (2020) cannot be used. As in previous FermiNet implementations, we pretrain all networks to match Hartree-Fock (HF) orbitals computed using PySCF. Here, we use the LAMB optimizer (You et al., 2020) as the pretraining optimizer. We find that longer pretraining stabilises training for all models considered. In addition, during pretraining, we draw samples from the HF orbitals only, instead of the neural network wavefunction. A smaller number of pretraining iterations (20,000) was used for small molecules in Section 4.1, while 100,000 iterations were used for third-row atoms and larger molecules. To generate samples from Ψ 2 , we use the Metropolis-Hastings algorithm with symmetric Gaussian proposals, as in Pfau et al. (2020) . We increase the number of decorrelation MCMC steps between optimization iterations from 10 in Pfau et al. (2020) to 30. Additionally, for larger systems, we do not update all electron positions simultaneously in one Metropolis step. Instead, we split the electrons into multiple blocks, and iteratively update each block once per step. This is an intermediate between the all-electron and one-electron moves commonly used in VMC. Note that while all models were trained for 200,000 optimization iterations, this does not mean that all systems required that many iterations to converge. Many smaller systems converged in far fewer iterations than this, but the same number was used for all systems to minimize confusion. Not all systems used identical parameters -for larger systems we increased the number of pretraining steps and blocks per MCMC update, and LayerNorm was not used for small molecules in Table 2 . In Table 5 we specify which systems were trained with which settings. In Section 4.1, the performance of "small" and "large" FermiNet and Psiformer models was investigated on a set of small and medium molecules. 



Figure 1: Comparison of (a) FermiNet and FermiNet+SchNet and (b) the Psiformer. The Fer-miNet variants have two streams, acting on electron-nuclear and electron-electron features, which are merged via concatenation or continuous-filter convolution operations. In contrast, the Psiformer uses a single stream of self-attention layers, acting on electron-nuclear features only. Electronelectron features appear only via the Jastrow factor. The FermiNet+SchNet also includes a nuclear embedding stream and separate spin-dependant electron-electron streams, not pictured here.

Figure 2: FermiNet and Psiformer accuracy on small molecules. Geometries and CCSD(T)/CBS reference energies are taken from Pfau et al. (2020). The grey region indicates chemical accuracy (1 kcal/mol or 1.6 mHa) relative to the reference energies.

Figure 3: Comparison of the Psiformer (ΨF), with and without LayerNorm (LN) against the Fer-miNet (FN) and FermiNet+SchNet (FN+SN) using training hyperparameters from Pfau et al. (2020) and Gerard et al. (2022). Learning curves from the original FN+SN implementation in Gerard et al. (2022) are in green. Energies are smoothed over 4000 iterations.

Figure 4: The T-shaped benzene dimer equilibrium.

Figure5: Comparison of different optimization algorithms (KFAC and ADAM) for the Psiformer on (a) the sulphur dimer and (b) the iron atom, with and without LayerNorm. While there is a clear advantage to using KFAC on both systems, LayerNorm has a marginal impact on the sulphur dimer. On the iron atom, however, LayerNorm improves the accuracy of ADAM and the stability of KFAC. A learning rate of 3e-4 was used for ADAM. A rolling mean of the last 1000 iterations is used to smooth the energy.

Energies of third-row neutral atoms and cations. Ionization potentials are compared against experimental results fromKoga et al. (1997). Total energies are in Hartree while ionization potentials are in eV. Chemical accuracy is 0.043 eV.

Energies for molecules with between 42 and 74 electrons. For benzene, the best published results using neural network Ansatzes are either from a Gerard et al. (2022) or b Ren et al. (2022).

Energies for the benzene dimer, with center separation of 4.95 Å (equilibrium) and 10 Å (dissociated), and the estimated dissociation energy from taking the difference of the equilibrium energy from twice the mononer energy (∆E mono ) and the dissociated energy (∆E 10 Å). All ∆E mono results are based on comparing like-for-like dimer and monomer calculations. ∆E 10 Å results from Ren et al. (2022) are estimated from figures. Experimental energies are from

REPRODUCIBILITY STATEMENTDetails of training, network parameters, and optimization hyperparameters are given in the appendices for all experiments shown. The code is available under the Apache License 2.0 as part of the FermiNet repo at https://github.com/deepmind/ferminet.

Table of default hyperparameters used.

Variations in hyperparameters between experiments.

Table 6 gives the network parameters used for these Molecular geometries in Bohr.

ACKNOWLEDGMENTS

We would like to thank Alex G. de G. Matthews for suggesting the model name, Alex Botev for assistance with KFAC, Michael Scherbela and Leon Gerard for providing data for figures, and James Kirkpatrick for support and encouragement.

A APPENDIX A.1 CALCULATING ENERGIES AND GRADIENTS

It is generally more numerically stable to work directly with the log wavefunction, and the local energy can be expressed aswhere V (x) is the potential energy (the last three terms of Eq 1).The gradient of the energy is given bywhere the local energy E L (x) = Ψ -1 (x) ĤΨ(x). Note that the E L (x) -E x ′ ∼Ψ 2 [E L (x ′ )] term is the difference between the local energy at x and the average energy over all x ′ .We make a small but critical change to how the gradients are computed relative to Pfau et al. (2020) which stabilizes training for all models considered here. The local energy often has very large tails due to numerical issues, especially near cusps (where two particles overlap) and nodes (where the wavefunction goes to zero) (Pathak & Wagner, 2020) . To mitigate this, the local energy is often truncated in practice. Let ⟨E L ⟩ mean denote the mean local energy for one minibatch of walkers and ⟨E L ⟩ median denote the median. Then in Pfau et al. (2020) , the local energies were clipped to be within a constant multiple ρ of the mean absolute deviation around the mean:While this is more robust than the standard deviation, it is still susceptible to large outliers because it is centered at the mean. Instead, we use the mean absolute deviation around the median to determine the window for clipping:Secondly, in Pfau et al. (2020) , the average energy term E x ′ ∼Ψ 2 [E L (x ′ )] in the gradient was approximated by the mean local energy ⟨E L ⟩ mean of a minibatch. That meant that outliers were still included in this term. We instead use the mean of the clipped local energies. This guarantees that the mean over a minibatch of the energy difference term is always zero, improving stability during optimization. If we let clip mean/median (x; σ) denote the functions that clip x to be in the range [⟨x⟩ mean/median -σ, ⟨x⟩ mean/median + σ], then the gradient for one batch in Pfau et al. (2020) was:while here we use:While this may seem like a subtle difference, it has a great effect on larger systems, especially heavier atoms. A similar technique was used to stabilize training the FermiNet with pseudopotentials (Li et al., 2022) , but instead of clipping outlier local energies, the outlier walkers were removed entirely for that minibatch. We found that clipping with proper centering was more effective than removing the walkers entirely.

A.2 TRAINING

In this section we give details on training and further differences from the original FermiNet (Pfau et al., 2020) .

A.2.1 DENSE DETERMINANTS

The original FermiNet and PauliNet Ansatzes used block-diagonal determinants for spin-up and spin-down electrons (Pfau et al., 2020; Hermann et al., 2020) 2020) used block diagonal determinants and did not report precise parameter counts; parameters counts for their results are estimated using their published settings and the FermiNet JAX implementation (Spencer et al., 2020b) . Our FermiNet experiments used the envelope function proposed in Spencer et al. (2020a) and dense determinants, resulting in slightly different numbers of parameters between the networks of Pfau et al. (2020) and our FermiNet (Small) networks.models. In all other experiments, unless otherwise specified, the 'small' network configurations were used. Table 7 contains the number of parameters used by each model for this set of molecules.A.3 COMPUTATIONAL DETAILS All models were implemented in JAX (Bradbury et al., 2018) based upon the public FermiNet (Spencer et al., 2020b) and KFAC implementations (Botev & Martens, 2022) , and trained in parallel using between 16 and 64 A100 GPUs, depending on the system size. All calculations were done at standard single precision, as we found that TensorFloat-32 calculations on A100 had numerical accuracy issues. A table of total training time -including pretraining -for several models is given in Table 9 . Empirically, the time per iteration scaled roughly cubically, i.e. the benzene dimer required ∼8 times the computational resources to run as the benzene molecule, though the number of atoms also played a significant role in timing. For systems with very large numbers of atoms, like the benzene dimer, the FermiNet and Psiformer ran at nearly the same speed. This is likely because the determinants and envelope, which are similar between the FermiNet and Psiformer, make up a larger share of the total run time as the systems become larger. Geometries are taken from Curtiss et al. (2000) and given in Table 8 or in Pfau et al. (2020) . The benzene dimer geometry is obtained via a rigid translation and rotation of two monomers. A rolling mean of the last 1000 iterations is used to smooth the energy. The divergence of the energy without using input rescaling, as shown here for ethanol, is typical for medium or large systems.

A.4 ABLATION STUDIES

Two modifications to the self-attention network proved critical to stability for the Psiformer -the Jastrow factor ensuring correct behaviour at electronic cusps, and rescaling the input distances. Here we show ablation studies without these modifications.Figure 6 shows the Psiformer on (a) the nitrogen dimer and (b) ethanol, without a Jastrow factor and without input feature rescaling. In the absence of an electron-electron Jastrow factor to enforce cusp conditions, the energy is very noisy. If the input features are not rescaled, the energy is often unstable. For smaller systems, as in the nitrogen dimer shown, training may recover from the instability, but especially for medium or larger systems, the energy often diverges.A.5 ATTENTION MAPS 

