D4FT: A DEEP LEARNING APPROACH TO KOHN-SHAM DENSITY FUNCTIONAL THEORY

Abstract

Kohn-Sham Density Functional Theory (KS-DFT) has been traditionally solved by the Self-Consistent Field (SCF) method. Behind the SCF loop is the physics intuition of solving a system of non-interactive single-electron wave functions under an effective potential. In this work, we propose a deep learning approach to KS-DFT. First, in contrast to the conventional SCF loop, we propose directly minimizing the total energy by reparameterizing the orthogonal constraint as a feed-forward computation. We prove that such an approach has the same expressivity as the SCF method yet reduces the computational complexity from O(N 4 ) to O(N 3 ). Second, the numerical integration, which involves a summation over the quadrature grids, can be amortized to the optimization steps. At each step, stochastic gradient descent (SGD) is performed with a sampled minibatch of the grids. Extensive experiments are carried out to demonstrate the advantage of our approach in terms of efficiency and stability. In addition, we show that our approach enables us to explore more complex neural-based wave functions.

1. INTRODUCTION

Density functional theory (DFT) is the most successful quantum-mechanical method, which is widely used in chemistry and physics for predicting electron-related properties of matters (Szabo & Ostlund, 2012; Levine et al., 2009; Koch & Holthausen, 2015) . As scientists are exploring more complex molecules and materials, DFT methods are often limited in scale or accuracy due to their computation complexity. On the other hand, Deep Learning (DL) has achieved great success in function approximations (Hornik et al., 1989) , optimization algorithms (Kingma & Ba, 2014) , and systems (Bradbury et al., 2018) in the past decade. Many aspects of deep learning can be harnessed to improve DFT. Of them, data-driven function fitting is the most straightforward and often the first to be considered. It has been shown that models learned from a sufficient amount of data generalize greatly to unseen data, given that the models have the right inductive bias. The Hohenberg-Kohn theorem proves that the ground state energy is a functional of electron density (Hohenberg & Kohn, 1964a) , but this functional is not available analytically. This is where data-driven learning can be helpful for DFT. The strong function approximation capability of deep learning gives hope to learning such functionals in a data-driven manner. There have already been initial successes in learning the exchange-correlation functional (Chen et al., 2020a; b; Dick & Fernandez-Serra, 2020) . Furthermore, deep learning has shifted the mindsets of researchers and engineers towards differentiable programming. Implementing the derivative of a function has no extra cost if the primal function is implemented with deep learning frameworks. Derivation of functions frequently appears in DFT, e.g., estimating the kinetic energy of a wave function; calculating generalized gradient approximation (GGA) exchange-correlation functional, etc. Using modern automatic differentiation (AD) techniques ease the implementation greatly (Abbott et al., 2021) . Despite the numerous efforts that apply deep learning to DFT, there is still a vast space for exploration. For example, the most popular Kohn-Sham DFT (KS-DFT) (Kohn & Sham, 1965) utilizes the self-consistency field (SCF) method for solving the parameters. At each SCF step, it solves a closed-form eigendecomposition problem, which finally leads to energy minimization. However, this method suffers from many drawbacks. Many computational chemists and material scientists criticize that optimizing via SCF is time-consuming for large molecules or solid cells, and that the convergence of SCF is not always guaranteed. Furthermore, DFT methods often utilize the linear combination of basis functions as the ansatz of wave functions, which may not have satisfactory expressiveness to approximate realistic quantum systems. To address the problems of SCF, we propose a deep learning approach for solving KS-DFT. Our approach differs from SCF in the following aspects. First, the eigendecomposition steps in SCF come from the orthogonal constraints on the wave functions; we show in this work that the original objective function for KS-DFT can be converted into an unconstrained equivalent by reparameterizing the orthogonal constraints as part of the objective function. Second, we further explore amortizing the integral in the objective function over the optimization steps, i.e., using stochastic gradient descent (SGD), which is well-motivated both empirically and theoretically for large-scale machine learning (Bottou et al., 2018) . We demonstrate the equivalence between our approach and the conventional SCF both empirically and theoretically. Our approach reduces the computational complexity from O(N 4 ) to O(N 3 ), which significantly improves the efficiency and scalability of KS-DFT. Third, gradient-based optimization treats all parameters equally. We show that it is possible to optimize more complex neural-based wave functions instead of optimizing only the coefficients. In this paper, we instantiate this idea with local scaling transformation as an example showing how to construct neural-based wave functions for DFT.

2. DFT PRELIMINARIES

Density functional theory (DFT) is among the most successful quantum-mechanical simulation methods for computing electronic structure and all electron-related properties. DFT defines the ground state energy as a functional of the electron density ρ : R 3 → R: E gs = E[ρ]. The Hohenberg-Kohn theorem (Hohenberg & Kohn, 1964b ) guarantees that such functionals E exists and the ground state energy can be determined uniquely by the electron density. However, the exact definition of such functional has been a puzzling obstacle for physicists and chemists. Some approximations, including the famous Thomas-Fermi method and Kohn-Sham method, have been proposed and have later become the most important ab-initio calculation methods. The Objective Function One of the difficulties in finding the functional of electron density is the lack of an accurate functional of the kinetic energy. The Kohn-Sham method resolves this issue by introducing an orthogonal set of single-particle wave functions {ψ i } and rewriting the energy as a functional of these wave functions. The energy functional connects back to the Schrödinger equation. Without compromising the understanding of this paper, we leave the detailed derivation from Schrödinger equation and the motivation of the orthogonality constraint in Appendix B.1. As far as this paper is concerned, we focus on the objective function of KS-DFT, defined as, E gs = min {ψ σ i } E[{ψ σ i }] = min {ψ σ i } E Kin [{ψ σ i }] + E Ext [{ψ σ i }] + E H [{ψ σ i }] + E XC [{ψ σ i }] (3) s.t. ⟨ψ σ i |ψ σ j ⟩ = δ ij (4) where ψ σ i is a wave function mapping R 3 → C, and ψ σ * i denotes its complex conjugate. For simplicity, we use the bra-ket notation for ⟨ψ σ i |ψ σ j ⟩ = ψ σ * i (r)ψ σ j (r)dr. δ ij is the Kronecker delta function. The superscript σ ∈ {α, β} denotes the spin. 1 E Kin , E Ext , E H , E XC are the kinetic, external potential (nuclear attraction), Hartree (Coulomb repulsion between electrons) and exchange-correlation energies respectively, defined by E Kin [{ψ σ i }] = - 1 2 N i σ ψ σ i * (r) ∇ 2 ψ σ i (r) dr, E Ext [{ψ σ i }] = N i σ v ext (r)|ψ σ i (r)| 2 dr, E H [{ψ σ i }] = 1 2 N i σ |ψ σ i (r)| 2 N j σ |ψ σ j (r ′ )| 2 |r -r ′ | drdr ′ , E XC [{ψ σ i }] = N i σ ε xc r |ψ σ i (r)| 2 dr, in which v ext is the external potential defined by the molecule's geometry. ε xc is the exchangecorrelation energy density, which has different instantiations. The entire objective function is given analytically, except for the function ψ σ i that we replace with parametric functions to be optimized. In the rest of this paper, we focus on the algorithms that optimize this objective while minimizing the discussion on its scientific background.

Kohn-Sham Equation and SCF Method

The object function above can be solved by Euler-Lagrange method, which yields the canonical form of the well-known Kohn-Sham equation, - 1 2 ∇ 2 + v ext (r) + d 3 r ′ N i ψ i (r ′ ) 2 |r -r ′ | + v xc (r) ψ i = ε i ψ i (9) where N is the total number of electrons. v ext and v xc are the external and exchange-correlation potentials, respectively. This equation is usually solved in an iterative manner called Self-Consistent Field (SCF). This method starts with an initial guess of the orthogonal set of single electron wave functions, which are then used for constructing the Hamiltonian operator. The new wave functions and corresponding electron density can be obtained by solving the eigenvalue equation. This process is repeated until the electron density of the system converges. The derivation of the Kohn-Sham equation and the SCF algorithm is presented in Appendix B.2. The LCAO Method In the general case, we can use any parametric approximator for ψ i to transform the optimization problem from the function space to the parameter space. In quantum chemistry, they are usually represented by the linear combination of atomic orbitals (LCAO): ψ i (r) = B j c ij ϕ j (r), where ϕ j are atomic orbitals (or basis functions more generally), which are usually pre-determined analytical functions, e.g., truncated series of spherical harmonics or plane waves. We denote the number of the basis functions as B. The single-particle wave functions are linear combinations of these basis functions with c ij as the only optimizable parameters. We introduce vectorized notations Ψ := (ψ 1 , ψ 2 , • • • , ψ N ) ⊤ , Φ := (ϕ 1 , ϕ 2 , • • • , ϕ N ) ⊤ and C := [c ij ] , so the LCAO wave functions can be written as, Ψ = CΦ. (11) Classical atomic obitals include Slater-type orbitals, Pople basis sets (Ditchfield et al., 1971) , correlation-consistent basis sets (Dunning Jr, 1989) , etc.

3. A DEEP LEARNING APPROACH TO KS-DFT

In this section, we propose a deep learning approach to solving KS-DFT. Our method can be described by three keywords: • deep learning: our method is deep learning native; it is implemented in a widely-used deep learning framework JAX; • differentiable: all the functions are differentiable, and thus the energy can be optimized via purely gradient-based methods; < l a t e x i t s h a 1 _ b a s e 6 4 = " a t w B T W a e w m r u Y 3 0 k S f X / v O n d l S 0 = " > A A A B 7 X i c b V B N S w M x E J 3 U r 1 q / q h 6 9 B I v g q e x K U Y 9 F L x 4 r 2 A 9 o l 5 J N s 2 1 s N l m S r F C W / g c v H h T x 6 v / x 5 r 8 x b f e g r Q 8 G H u / N M D M v T A Q 3 1 v O + U W F t f W N z q 7 h d 2 t n d 2 z 8 o H x 6 1 j E o 1 Z U 2 q h N K d k B g m u G R N y 6 1 g n U Q z E o e C t c P x 7 c x v P z F t u J I P d p K w I C Z D y S N O i X V S q 5 e M e N / v l y t e 1 Z s D r x I / J x X I 0 e i X v 3 o D R d O Y S U s F M a b r e 4 k N M q I t p 4 J N S 7 3 U s I T Q M R m y r q O S x M w E 2 f z a K T 5 z y g B H S r u S F s / V 3 x M Z i Y 2 Z x K H r j I k d m W V v J v 7 n d V M b X Q c Z l 0 l q m a S L R V E q s F V 4 9 j o e c M 2 o F R N H C N X c 3 Y r p i G h C r Q u o 5 E L w l 1 9 e J a 2 L q n 9 Z r d 3 X K v W b P I 4 i n M A p n I M P V 1 C H O 2 h A E y g 8 w j O 8 w h t S 6 A W 9 o 4 9 F a w H l M 8 f w B + j z B z 4 0 j u w = < / l a t e x i t > 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " D 7 s G v P h N I b M k a s S X S f t V T F E W R Y I = " > A A A B 7 X i c b V B N S w M x E J 3 4 W e t X 1 a O X Y B E 8 l d 1 S 1 G P R i 8 c K 9 g P a p W T T b B u b T Z Y k K 5 S l / 8 G L B 0 W 8 + n + 8 + W 9 M 2 z 1 o 6 4 O B x 3 s z z M w L E 8 G N 9 b x v t L a + s b m 1 X d g p 7 u 7 t H x y W j o 5 b R q W a s i Z V Q u l O S A w T X L K m 5 V a w T q I Z i U P B 2 u H 4 d u a 3 n 5 g 2 X M k H O 0 l Y E J O h 5 B G n x D q p 1 U t G v F / t l 8 p e x Z s D r x I / J 2 X I 0 e i X v n o D R d O Y S U s F M a b r e 4 k N M q I t p 4 J N i 7 3 U s I T Q M R m y r q O S x M w E 2 f z a K T 5 3 y g B H S r u S F s / V 3 x M Z i Y 2 Z x K H r j I k d m W V v J v 7 n d V M b X Q c Z l 0 l q m a S L R V E q s F V 4 9 j o e c M 2 o F R N H C N X c 3 Y r p i G h C r Q u o 6 E L w l 1 9 e J a 1 q x b + s 1 O 5 r 5 f p N H k c B T u E M L s C H K 6 j D H T S g C R Q e 4 R l e 4 Q 0 p 9 I L e 0 c e i d Q 3 l M y f w B + j z B z + 4 j u 0 = < / l a t e x i t > 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " i E a X X v 4 r G / s D E i t p / p + n r Z J C u e s = " > A A A B 7 X i c b V B N S w M x E J 3 1 s 9 a v q k c v w S J 4 K r t a 1 G P R i 8 c K 9 g P a p W T T b B u b T U K S F c r S / + D F g y J e / T / e / D e m 7 R 6 0 9 c H A 4 7 0 Z Z u Z F i j N j f f / b W 1 l d W 9 / Y L G w V t 3 d 2 9 / Z L B 4 d N I 1 N N a I N I L n U 7 w o Z y J m j D M s t p W 2 m K k 4 j T V j S 6 n f q t J 6 o N k + L B j h U N E z w Q L G Y E W y c 1 u 2 r I e h e 9 U t m v + D O g Z R L k p A w 5 6 r 3 S V 7 c v S Z p Q Y Q n H x n Q C X 9 k w w 9 o y w u m k 2 E 0 N V Z i M 8 I B 2 H B U 4 o S b M Z t d O 0 K l T + i i W 2 p W w a K b + n s h w Y s w 4 i V x n g u 3 Q L H p T 8 T + v k 9 r 4 O s y Y U K m l g s w X x S l H V q L p 6 6 j P N C W W j x 3 B R D N 3 K y J D r D G x L q C i C y F Y f H m Z N M 8 r w W W l e l 8 t 1 2 7 y O A p w D C d w B g F c Q Q 3 u o A 4 N I P A I z / A K b 5 7 0 X r x 3 7 2 P e u u L l M 0 f w B 9 7 n D 0 E 8 j u 4 = < / l a t e x i t >

3

< l a t e x i t s h a 1 _ b a s e 6 4 = " 3 3 c 5 C 7 D J F P 3 E Q 8 N 8 E x w C H i Z c A R 8 = " > A A A B 7 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 k V U Y 9 F L 5 6 k g v 2 A d i n Z N N v G Z p O Q Z I W y 9 D 9 4 8 a C I V / + P N / + N a b s H b X 0 w 8 H h v h p l 5 k e L M W N / / 9 g o r q 2 v r G 8 X N 0 t b 2 z u 5 e e f + g a W S q C W 0 Q y a V u R 9 h Q z g R t W G Y 5 b S t N c R J x 2 o p G N 1 O / 9 U S 1 Y V I 8 2 L G i Y Y I H g s W M Y O u k Z l c N W e + u V 6 7 4 V X 8 G t E y C n F Q g R 7 1 X / u r 2 J U k T K i z h 2 J h O 4 C s b Z l h b R j i d l L q p o Q q T E R 7 Q j q M C J 9 S E 2 e z a C T p x S h / F U r s S F s 3 U 3 x M Z T o w Z J 5 H r T L A d m k V v K v 7 n d V I b X 4 U Z E y q 1 V J D 5 o j j l y E o 0 f R 3 1 m a b E 8 r E j m G j m b k V k i D U m 1 g V U c i E E i y 8 v k + Z Z N b i o n t + f V 2 r X e R x F O I J j O I U A L q E G t 1 C H B h B 4 h G d 4 h T d P e i / e u / c x b y 1 4 + c w h / I H 3 + Q N q K I 8 J < / l a t e x i t > N Atomic Orbitals QR Decomposition < l a t e x i t s h a _ b a s e = " o i q j B E y r w C Z U V v x e m Q d c X + c = " > A A A B X i c b V D L S g M x F L T X W + q i d B I v g q s y I q M u i G c V A P a s W Q y m T Y k w x J R i l D / O N C X c + i / u / B s z S y U D I Z x y c k J E s c d v p S y u r a + U d s b G v O V w / a W q a K B a R X K p u g D X l T N C W Y Y b T b q I o j g N O O H J v c j R p J s W m S T U j / F Q s I g R b K z A k D / U k t l f W m Q q N b f u z o C W i V e Q G h R o D q p f / V C S N K b C E I n l u Y v w M K M I p N K P U w W S M h R n q c A x X S z F J Y J U S S V P c K g m f p I O x z q P Z y R i b k V c v E / r e a M r P m E h S Q w W Z P x S l H B m J g p Q y B Q l h k s w U Q x m x W R E V a Y G F t U x Z b g L X m b T P t F / f z u v N a L u o o w x E c w y l c A k N u I U m t I C A g m d h T f n y X l x p P + W j J K X Y O Q + c z x k F Z L x < / l a t e x i t > W < l a t e x i t s h a _ b a s e = " N c V a i h Q J n I e n E U f o M Q z G r E t Q M = " > A A A B X i c b V D L S g M x F L T X W + q i d B I v g q s x I U Z d F N y b s A o x L J p G o J h m S j F K G / o c b F q V / c + T d m l l o G Q w z n k p M T x J x p r f T m F t f W N z q h d t n d z o H x t U w U o S i u V T d A G v K m a A t w w y n V h R H A W c d o L J b e Z H q n S T I p M p H + G R Y E N G s L H S Q z + Q P N T T y F p c z Y o V y q O w d a J V O K p C j M S h / U N J k o g K Q z j W u u e s f F T r A w j n M K / U T T G J M J H t G e p Q J H V P v p P P U M n V k l R E O p B E G z d X f G y m O d B b N T k b Y j P W y l n / e b E D K / l I k M V S Q x U P D h C M j U V Y B C p m i x P C p J Z g o Z r M i M s Y K E O L K t k S v O U v r L R d W r N a a t U r J q + j C C d w C u f g w R X U Q a A I C C p h F d c J + f F e X c + F q M F J h j w P n A G v e S w = = < / l a t e x i t > Q < l a t e x i t s h a 1 _ b a s e 6 4 = " X v a G i O U t q D 7 C i W I v S P a j L a z 3 B F g = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K q M e g F 0 8 S w T w g W c L s Z D Y Z M 4 9 l Z l Y I S / 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S j g z 1 v e / v c L K 6 t r 6 R n G z t L W 9 s 7 t X F 4 8 V 7 A e 0 S 8 m m 2 T Y 2 m 4 Q k K 5 S l / 8 G L B 0 W 8 + n + 8 + W 9 M 2 z 1 o 6 4 O B x 3 s m 7 R 6 0 9 c H A 4 7 0  3 j 9 o G p V q Q h t E c a X b E T a U M 0 k b l l l O 2 4 m m W E S c t q L R z d R v P V F t m J I P d p z Q U O C B Z D E j 2 D q p 2 U 0 M 6 9 3 1 y h W / 6 s + A l k m Q k w r k q P f K X 9 2 + I q m g 0 h K O j e k E f m L D D G v L C K e T U j c 1 N M F k h A e 0 4 6 j E g p o w m 1 0 7 Q S d O 6 a N Y a V f S o p n 6 e y L D w p i x i F y n w H Z o F r 2 p + J / X S W 1 8 F W Z M J q m l k s w X x S l H V q H p 6 6 j P N C W W j x 3 B R D N 3 K y J D r D G x L q C S C y F Y f H m Z N M + q w U X 1 / P 6 8 U r v O 4 y j C E R z D K Q R w C T W 4 h T o 0 g M A j P M M r v H n K e / H e v Y 9 5 a 8 H L Z w 7 h D 7 z P H 3 r 1 j x Q = < / l a t e x i t > N < l a t e x i t s h a 1 _ b a s e 6 4 = " y K E C D T x f j 8 I D A a k p M d m y b a v w i m s = " > A A A B 7 X i c b V B N S w M x E J 3 U r 1 q / q h 6 9 B I v g q e x K U Y 9 F L x 4 r 2 A 9 o l 5 J N s 2 1 s N l m S r F C W / g c v H h T x 6 v / x 5 r 8 x b f e g r Q 8 G H u / N M D M v T A Q 3 1 v O + U W F t f W N z q 7 h d 2 t n d 2 z 8 o H x 6 1 j E o 1 Z U 2 q h N K d k B g m u G R N y 6 1 g n U Q z E o e C t c P x 7 c x v P z F t u J I P d p K w I C Z D y S N O i X V S q 5 c Y 3 v f 7 5 Y p X 9 e b A q 8 T P S Q V y N P r l r 9 5 A 0 T R m 0 l J B j O n 6 X m K D j G j L q W D T U i 8 1 L C F 0 T I a s 6 6 g k M T N B N r 9 2 i s + c M s C R 0 q 6 k x X P 1 9 0 R G Y m M m c e g 6 Y 2 J H Z t m b i f 9 5 3 d R G 1 0 H G Z Z J a J u l i U Z Q K b B W e v Y 4 H X D N q x c Q R Q j V 3 t 2 I 6 I p p Q 6 w I q u R D 8 5 Z d X S e u i 6 l 9 W a / e 1 S v 0 m j 6 M I J 3 A K 5 + D D F d T h D h r Q B A q P 8 A y v 8 I Y U e k H v 6 G P R W k D 5 z D H 8 A f r 8 A U 8 B j v c = < / l a t e x i t > 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " U k r S s 9 P g D K H P b / B n i X e B / E 6 F y i c = " > A A A B 7 X i c b V B N S w M x E J 3 1 s 9 a v q k c v w S J 4 K r u l q M e i z z M y L F G f G + v 6 3 t 7 a + s b m 1 X d g p 7 u 7 t H x y W j o 5 b R q a a 0 C a R X O p O h A 3 l T N C m Z Z b T j t I U J x G n 7 W h 8 O / P b T 1 Q b J s W D n S g a J n g o W M w I t k 5 q 9 Z R h / W q / V P Y r / h x o l Q Q 5 K U O O R r / 0 1 R t I k i Z U W M K x M d 3 A V z b M s L a M c D o t 9 l J D F S Z j P K R d R w V O q A m z + b V T d O 6 U A Y q l d i U s m q u / J z K c G D N J I t e Z Y D s y y 9 5 M / M / r p j a + D j M m V G q p I I t F c c q R l W j 2 O h o w T Y n l E 0 c w 0 c z d i s g I a 0 y s C 6 j o Q g i W X 1 4 l r W o l u K z U 7 m v l + k 0 e R w F O 4 Q w u I I A r q M M d N K A J B B 7 h G V 7 h z Z P e i / f u f S x a 1 7 x 8 5 g T + w P v 8 A V C F j v g = < / l a t e x i t > 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " w G 3 6 q w i 5 P X 2 A C x B 4 A 4 X o l 0 6 O B M c = " > A A A B 7 X i c b V B N S w M x E J 3 4 W e t X 1 a O X Y B E 8 l V 0 t 6 r H o x W M F + w H t U r J p t o 3 N J k u S F c r S / + D F g y J e / T / e / D e Z Z u a F i e D G e t 4 3 W l l d W 9 / Y L G w V t 3 d 2 9 / Z L B 4 d N o 1 J N W Y M q o X Q 7 J I Y J L l n D c i t Y O 9 G M x K F g r X B 0 O / V b T 0 w b r u S D H S c s i M l A 8 o h T Y p 3 U 7 C a G 9 y 5 6 p b J X 8 W b A y 8 T P S R l y 1 H u l r 2 5 f 0 T R m 0 l J B j O n 4 X m K D j G j L q W C T Y j c 1 L C F 0 R A a s 4 6 g k M T N B N r t 2 g k + d 0 s e R 0 q 6 k x T P 1 9 0 R G Y m P G c e g 6 Y 2 K H Z t G b i v 9 5 n d R G 1 0 H G Z Z J a J u l 8 U Z Q K b B W e v o 7 7 X D N q x d g R Q j V 3 t 2 I 6 J J p Q 6 w I q u h D 8 x Z e X S f O 8 4 l 9 W q v f V c u 0 m j 6 M A x 3 A C Z + D D F d T g D u r Q A A q P 8 A y v 8 I Y U e k H v 6 G P e u o L y m S P 4 A / T 5 A 1 I J j v k = < / l a t e x i t > 3 Molecular orbitals < l a t e x i t s h a 1 _ b a s e 6 4 = " y a A 3 S M g E C u / 6 4 J u s k P Y / B K M 1 p A Q = " > A A A B 9 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S Q i 6 r E o g s c K 9 g P a W D b b T b t 0 s w m 7 E 7 W E / A 8 v H h T x 6 n / x 5 r 9 x 2 + a g r Q 8 G H u / N M D P P j w X X 6 D j f V m F p e W V 1 r b h e 2 t j c 2 t 4 p 7 + 4 1 d Z Q o y h o 0 E p F q + 0 Q z w S V r I E f B 2 r F i J P Q F a / m j q 4 n f e m B K 8 0 j e 4 T h m X k g G k g e c E j T S / X U v 7 S J 7 w n T E Z Z b 1 y h W n 6 k x h L x I 3 J x X I U e + V v 7 r 9 i C Y h k 0 g F 0 b r j O j F 6 K V H I q W B Z q Z t o F h M 6 I g P W M V S S k G k v n V 6 d 2 U d G 6 d t B p E x J t K f q 7 4 m U h F q P Q 9 9 0 h g S H e t 6 b i P 9 5 n Q S D C y / l M k 6 Q S T p b F C T C x s i e R G D 3 u W I U x d g Q Q h U 3 t 9 p 0 S B S h a I I q m R D c + Z c X S f O k 6 p 5 V T 2 9 P K 7 X L P I 4 i H M A h H I M L 5 1 C D G 6 h D A y g o e I Z X e L M e r R f r 3 f q Y t R a s f G Y f / s D 6 / A F D M Z M G < / l a t e x i t > E kin Integrator Molecular Configuration < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 y w 3 P l W G D O 1 D I G v L E P 2 Z u / v 6 P C o = " > A A A B 9 X i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 B I v g q S Q i 6 r E o g s c K 9 g P a W D b b S b t 0 s w m 7 E 7 W E / A 8 v H h T x 6 n / x 5 r 9 x 2 + a g r Q 8 G H u / N M D P P j w X X 6 D j f V m F p e W V 1 r b h e 2 t j c 2 t 4 p 7 + 4 1 d Z Q o B g 0 W i U i 1 f a p B c A k N 5 C i g H S u g o S + g 5 Y + u J n 7 r A Z T m k b z D c Q x e S A e S B 5 x R N N L 9 d S / t I j x h a i r L e u W K U 3 W m s B e J m 5 M K y V H v l b + 6 / Y g l I U h k g m r d c Z 0 Y v Z Q q 5 E x A V u o m G m L K R n Q A H U M l D U F 7 6 f T q z D 4 y S t 8 O I m V K o j 1 V f 0 + k N N R 6 H P q m M 6 Q 4 1 P P e R P z P 6 y Q Y X H g p l 3 G C I N l s U Z A I G y N 7 E o H d 5 w o Y i r E h l C l u b r X Z k C r K 0 A R V M i G 4 8 y 8 v k u Z J 1 T 2 r n t 6 e V m q X e R x F c k A O y T F x y T m p k R t S J w 3 C i C L P 5 J W 8 W Y / W i / V u f c x a C 1 Y + s 0 / + w P r 8 A V o O k x U = < / l a t e x i t > E ext < l a t e x i t s h a 1 _ b a s e 6 4 = " J m G f 7 D N G 9 A s C n J j Z L O j 6 6 U X d 6 Z w = " > A A A B 9 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S R S 1 G N R B I 8 V 7 A e 0 o W y 2 2 3 b p Z h N 3 J 6 U l 5 H d 4 8 a C I V 3 + M N / + N 2 z Y H b X 0 w 8 H h v h p l 5 f i S 4 R s f 5 t n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o o c N Y U V a n o Q h V y y e a C S 5 Z H T k K 1 o o U I 4 E v W N M f 3 c 7 8 5 p g p z U P 5 i N O I e Q E Z S N 7 n l K C R v L t u 0 k E 2 w W R C 0 7 R b L D l l Z w 5 7 l b g Z K U G G W r f 4 1 e m F N A 6 Y R C q I 1 m 3 X i d B L i E J O B U s L n V i z i N A R G b C 2 o Z I E T H v J / O j U P j N K z + 6 H y p R E e 6 7 + n k h I o P U 0 8 E 1 n Q H C o l 7 2 Z + J / X j r F / 7 S V c R j E y S R e L + r G w M b R n C d g 9 r h h F M T W E U M X N r T Y d E k U o m p w K J g R 3 + e V V 0 r g o u 5 f l y k O l V L 3 J 4 s j D C Z z C O b h w B V W 4 h x r U g c I T P M M r v F l j 6 8 V 6 t z 4 W r T k r m z m G P 7 A + f w B 8 u p K V < / l a t e x i t > E xc < l a t e x i t s h a 1 _ b a s e 6 4 = " t y M l 6 x K i 2 R G q 8 Q G I x 5 Y s X a W f y K M = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S R S 1 G N R h B 4 r 2 A 9 o Q t l s t + 3 S z S b s T s Q S 8 j e 8 e F D E q 3 / G m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B Z c o + N 8 W 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 1 l G i K G v R S E S q G x D N B J e s h R w F 6 8 a K k T A Q r B N M b m d + 5 5 E p z S P 5 g N O Y + S E Z S T 7 k l K C R v L t + 6 i F 7 w r S R Z f 1 y x a k 6 c 9 i r x M 1 J B X I 0 + + U v b x D R J G Q S q S B a 9 1 w n R j 8 l C j k V L C t 5 i W Y x o R M y Y j 1 D J Q m Z 9 t P 5 z Z l 9 Z p S B P Y y U K Y n 2 X P 0 9 k Z J Q 6 2 k Y m M 6 Q 4 F g v e z P x P 6 + X 4 P D a T 7 m M E 2 S S L h Y N E 2 F j Z M 8 C s A d c M Y p i a g i h i p t b b T o m i l A 0 M Z V M C O 7 y y 6 u k f V F 1 L 6 u 1 + 1 q l f p P H U Y Q T O I V z c O E K 6 t C A J r S A Q g z P 8 A p v V m K 9 W O / W x 6 K 1 Y O U z x / A H 1 u c P c 4 y R + A = = < / l a t b H f v 9 k l 0 y 4 M 9 4 H h 1 I X X U g F g x 7 C I = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I U Z d F N y 4 r 2 g e 0 Y 8 l k 0 j Y 0 k w x J R i l D / 8 O N C 0 X c + i / u / B s z 7 S y 0 9 U D I 4 Z x 7 y c k J Y s 6 0 c d 1 v p 7 C y u r a + U d w s b W 3 v 7 O 6 V 9 w 9 a W i a K 0 C a R X K p O g D X l T N C m Y Y b T T q w o j g J O 2 8 H 4 O v P b j 1 R p J s W 9 m c T U j / B Q s A E j 2 F j p o R d I H u p J Z K / 0 b t o v V 9 y q O w N a J l 5 O K p C j 0 S 9 / 9 U J J k o g K Q z j W u u u 5 s f F T r A w j n E 5 L v U T T G J M x H t K u p Q J H V P v p L P U U n V g l R A O p 7 B E G z d T f G y m O d B b N T k b Y j P S i l 4 n / e d 3 E D C 7 9 l I k 4 M V S Q + U O D h C M j U V Y B C p m i x P C J J Z g o Z r M i M s I K E 2 O L K t k S v M U v L 5 P W W d U 7 r 9 Z u a 5 X 6 V V 5 H E Y 7 g G E 7 B g w u o w w 0 0 o A k E F D z D K 7 w 5 T 8 6 L 8 + 5 8 z E c L T r 5 z C H / g f P 4 A H g G S 7 Q = = < / l a t e x i t >

S

Figure 1 : An illustration on the computational graph of D4FT. In conventional DFT methods using fixed basis sets, the gradients w.r.t. the basis functions (blue dotted line) are unnecessary. • direct optimization: due to the differentiability, our method does not require self-consistent field iterations, but the convergence result is also self-consistent. We refer to our method as D4FT that represents the above 3 keywords and DFT. Fig. 1 shows an overarching framework of our method. We elaborate on each component in the following parts.

3.1. REPARAMETERIZE THE ORTHONORMAL CONSTRAINTS

Constraints can be handled with various optimization methods, e.g., the Lagrangian multiplier method or the penalty method. In deep learning, it is preferred to reparameterize the constraints into the computation graph. For example, when we want the normalization constraint on a vector x, we can reparameterize it to y/∥y∥ such that it converts to solving y without constraints. Then this constraint-free optimization can benefit from the existing differentiable frameworks and various gradient descent optimizers. The wave functions in Kohn-Sham DFT have to satisfy the constraint ⟨ψ i |ψ j ⟩ = δ ij as given in Eq. 4. Traditionally the constraint is handled by the Lagrangian multiplier method, which leads to the SCF method introduced in Sec. 2. To enable direct optimization, we propose the following reparameterization of the constraint. Using LCAO wave functions, the constraint translates to k c ik ϕ k (r) l c jl ϕ l (r)dr = δ ij . Or in the matrix form CSC ⊤ = I. (12) S is called the overlap matrix of the basis functions, where the ij-th entry S ij = ⟨ϕ i |ϕ j ⟩. The literature on whitening transformation (Kessy et al., 2018) offers many ways to construct C to satisfy Eq. 12 based on different matrix factorization of S: C =    QΛ -1/2 U PCA whitening, with U ⊤ ΛU = S QL ⊤ Cholesky whitening, with LL ⊤ = S -1 QS -1/2 ZCA whitening (13) Taking PCA whitening as an example. Since S can be precomputed from the overlap of basis functions, Λ -1/2 U is fixed. The Q matrix can be any orthonormal matrix; for deep learning, we can parameterize Q in a differentiable way using QR decomposition of an unconstrained matrix W : Q, R = QR(W ). ( ) Besides QR decomposition, there are several other differentiable methods to construct the orthonormal matrix Q, e.g., Householder transformation (Mathiasen et al., 2020) , and exponential map (Lezcano-Casado & Martınez-Rubio, 2019) . Finally, the wave functions can be written as: Ψ W = QDΦ. In this way, the wave functions Ψ W are always orthogonal given arbitrary W . The searching over the orthogonal function space is transformed into optimizing over the parameter space that W resides. Moreover, this parameterization covers all possible sets of orthonormal wave functions in the space spanned by the basis functions. These statements are formalized in the following proposition, and the proof is presented in Appendix C.1. Proposition 3.1. Define the original orthogonal function space F and the transformed search space F W by F = {Ψ = (ψ 1 , ψ 2 , • • • , ψ N ) ⊤ : Ψ = CΦ, C ∈ R N ×N , ⟨ψ i |ψ j ⟩ = δ ij } and F W = {Ψ W = (ψ W 1 , ψ W 2 , • • • , ψ W N ) ⊤ : Ψ W = QDΦ, (Q, R) = QR(W ), W ∈ R N ×N }. Then, they are equivalent to F W = F.

3.2. STOCHASTIC GRADIENT

SGD is the modern workhorse for large-scale machine learning optimizations. It has been harnessed to achieve an unprecedented scale of training, which would have been impossible with full batch training. We elaborate in this section on how DFT could also benefit from SGD. Numerical Quadrature The total energies defined in Equations 5-8 are integrals that involve the wave functions. Although the analytical solution to the integrals of commonly-used basis sets does exist, most DFT implementations adopt numerical quadrature integration, which approximate the value of a definite integral using a set of grids g = {(x i , w i )} n i=1 where x i and w i are the coordinate and corresponding weights, respectively: b a f (x)dx ≈ xi,wi∈g f (x i )w i . ( ) These grids and weights can be obtained via solving polynomial equations (Golub & Welsch, 1969; Abramowitz & Stegun, 1964) . One key issue that hinders its application in large-scale systems is that the Hartree energy requires at least O(nfoot_1 ) calculation as it needs to compute the distance between every two grid points. Some large quantum systems will need 100k ∼ 10m grid points, which causes out-of-memory errors and hence is not feasible for most devices. Stochastic Gradient on Quadrature Instead of evaluating the gradient of the total energy at all grid points in g, we randomly sample a minibatch g ′ ⊂ g, where g ′ contains m grid points, m < n, and evaluate the objective and its gradient on this minibatch. For example, for single integral energies such as kinetic energy, the gradient can be estimated by, ∂E kin ∂W = - 1 2 n m xi,wi∈g ′ w i ∂ Ψ * W (x i )(∇ 2 Ψ W )(x i ) ∂W . ( ) The gradients of external and exchange-correlation energies can be defined accordingly. The gradient of Hartree energy, which is a double integral, can be defined as, ∂E H ∂W = 2n(n -1) m(m -1) xi,wi∈g ′ xj ,wj ∈g ′ w i w j ∥Ψ W (x j )∥ 2 ∥x i -x j ∥ k ∂ψ k (x i ) ∂W . It can be proved that the expectation of the above stochastic gradient is equivalent to those of the full gradient. Note that the summation over the quadrature grids resembles the summation over all points in a dataset in deep learning. Therefore, in our implementation, we can directly rely on AD with minibatch to generate unbiased gradient estimates.

3.3. THEORETICAL PROPERTIES

Asympototic Complexity Here we analyze the computation complexity of D4FT in comparison with SCF. We use N to denote the number of electrons, B to denote the number of basis functions, n for the number of grid points for quadrature integral, and m for the minibatch size when the stochastic gradient is used. The major source of complexity comes from the hartree energy E H , as it includes a double integral. In our direct optimization approach, computing the hartree energy involves computing i |ψ i (r)| 2 for all grid points, which takes O(nN B). After that O(n 2 ) computation is needed to compute and aggregate repulsions energies between all the electron pairs. Therefore, the total complexity is O(nN B) + O(n 2 ). 2 In the SCF approach, it costs O(nN B) + O(n 2 N 2 ), a lot more expensive because it computes the Fock matrix instead of scalar energy. Considering both n and B are approximately linear to N , the direct optimization approach has a more favorable O(N 3 ) compared to O(N 4 ) for SCF. However, since n is often much bigger than N and B practically, the minibatch SGD is an indispensable ingredient for computation efficiency. At each iteration, minibatch reduces the factor of n to a small constant m, making it O(mN B)+O(m 2 ) for D4FT. A full breakdown of the complexity is available in Appendix D.

Self-Consistency

The direct optimization method will converge at a self-consistent point. We demonstrate this in the following. We first give the definition of self-consistency as follows. Definition 3.1 (Self-consistency). The wave functions Ψ is said to be self-consistent if the eigenfunctions of its corresponding Hamiltonian Ĥ(Ψ) is Ψ itself, i.e., there exists a real number ε ∈ R, such that Ĥ(Ψ)|Ψ⟩ = ε|Ψ⟩. The next proposition states the equivalence between SCF and direct optimization. It states that the convergence point of D4FT is self-consistent. The proof is shown in the Appendix C.2. Proposition 3.2 (Equivalence between SCF and D4FT for KS-DFT). Let Ψ † be a local optimimal of the ground state energy E gs defined in Eq. 3, such that ∂Egs ∂Ψ † = 0, then Ψ † is self-consistent.

4. NEURAL BASIS WITH LOCAL SCALING TRANSFORMATION

In previous sections, the basis functions are considered given and fixed. Now we demonstrate and discuss the possibility of a learnable basis set that can be jointly optimized in D4FT. We hope that this extension could serve as a stepping stone toward neural-based wave function approximators. As discussed in Sec. 2, LCAO wave functions are used in the existing DFT calculations. This restricts the wave functions to be in the subspace spanned by the basis functions, and it would then require a large number of basis functions in order to make a strong approximator. From the deep learning viewpoint, it is more efficient to increase the depth of the computation. The obstacle to using deep basis functions is that the overlap matrix S will change with the basis, requiring us to do expensive recomputation of whitening matrix D at each iteration. To remedy this problem, we introduce basis functions φ i in the following form φ i (r) := det J f (r) 1 2 ϕ i (f (r)) where f : R 3 → R 3 is a parametric bijective function, and det J f (r) denotes its Jacobian determinant. It is verifiable by change of variable that ⟨φ i |φ j ⟩ = | det J f (r)|ϕ i (f (r))ϕ i (f (r))dr = ϕ i (u)ϕ j (u)du = ⟨ϕ i |ϕ j ⟩. Therefore, the overlap matrix S will be fixed and remain unchanged even if f varies. Within our framework, we can use a parameterized function f θ and optimize both θ and W jointly with gradient descent. As a proof of concept, we design f θ as follows, which we term as neural local scaling: f θ (r) := λ θ (r)r, (20) λ θ (r) := αη(g θ (r)). ( ) where η is the sigmoid function, and α is a scaling factor to control the range of λ θ . g θ : R 3 → R can be arbitrarily complex neural network parametrized by θ. This method has two benefits. First, by introducing additional parameters, the orbitals are more expressive and can potentially achieve better ground state energy. Second, the conventional gaussian basis function struggles to tackle the cusps at the location of the nuclei. By introducing the local scaling transformation, we have a scaling function that can control the sharpness of the wave functions near the nuclei, and as a consequence, this approximator can model the ground state wave function better. We perform experiments using neural local transformed orbitals on atoms. Details and results will be presented in Section 6.3. 2018) employs the forward mode differentiation, which has less burden on system memory compared to backward mode. Kasim & Vinko (2021) uses the implicit gradient method, which has the smallest memory footprint, however it requires running the SCF loop to convergence.

5. RELATED WORK

Direct Optimization in DFT Albeit SCF method dominates the optimization in DFT, there are researches exploring the direct optimization methods (Gillan, 1989; Van Voorhis & Head-Gordon, 2002; Ismail-Beigi & Arias, 2000; VandeVondele & Hutter, 2003; Weber et al., 2008; Ivanov et al., 2021) . The challenging part is how to preserve the orthonormality constraint of the wave functions. A straightforward way to achieve this is via explicit orthonormalizaiton of the orbitals after each update (Gillan, 1989) . In recent years, some have investigated direct optimization of the total energy with the orthonormality constraint incorporated in the formulation of wave functions. A representative method is to express the coefficients of basis functions as an exponential transformation of a skew-Hermitian matrix (Ismail-Beigi & Arias, 2000; Ivanov et al., 2021) .

6. EXPERIMENTS

In this section, we demonstrate the accuracy and scalability of D4FT via numerical experiments on molecules. We compare our method with two benchmarks, • PySCF (Sun et al., 2018) : one of the most widely-used open-sourced quantum chemistry computation frameworks. It uses python/c implementation. • JAX-SCF: our implementation of classical SCF method with Fock matrix momentum mechanism. We implement our D4FT method with the deep learning framework JAX (Bradbury et al., 2018) . For a fair comparison with the same software/hardware environment, we reimplemented the SCF method in JAX. All the experiments with JAX implementation (D4FT, JAX-SCF) are conducted on an NVIDIA A100 GPU with 40GB memory. As a reference, we also test with PySCF on a 64-core Intel Xeon CPU@2.10GHz with 128GB memory.

6.1. ACCURACY

We first evaluate the accuracy of D4FT on two tasks. We first compare the ground state energy obtained from D4FT with PySCF on a series of molecules. We then predict the paramagnetism of molecules to test if our method handles spin correctly. For both tasks, we use the 6-31g basis set with the LDA exchange-correlation functional.

Ground State Energy

The predicted ground state energies are presented in Table 1 . It can be seen that the absolute error between D4FT and PySCF is smaller than 0.02Ha (≈0.54eV) over all the molecules compared. This validates the equivalence of our SGD optimization and SCF loop. 2 . We can observe that an oxygen molecule with 2 unpaired electrons achieves lower energy, whereas carbon dioxide is the opposite, in both methods. This coincides with the fact that oxygen molecule is paramagnetic, while carbon dioxide is diamagnetic. 2 . When the system size is small, the acceleration brought by SGD is limited. SCF has a clear advantage as the closed-form eigendecomposition brings the solution very close to optimal in just one step. Both methods become on par when the system reaches a size of 480 orbitals (C80). After that, SGD is significantly more efficient than SCF because exact numerical integration is prohibitively expensive for large molecules. Column 4 in Fig. 2 shows a clear advantage of D4FT to JAX-SCF, which verifies the smaller complexity of D4FT. Influence of Batch Size As repeatedly proved in large-scale machine learning, SGD with minibatch has a more favorable convergence speed as compared to full batch GD. Fig. 3 provides evidence that this is also true for D4FT. Smaller batch size generally leads to faster convergence in terms of epochs over the grids. However, when the batch size gets too small to fully utilize the GPU's computation power, it becomes less favorable in wall clock time.

6.3. NEURAL BASIS

In this part, we test the neural basis with local scaling transformation presented in Sec. 4 on atoms. In this experiment, we use a simple STO-3g basis set. A 3-layer MLP with tanh activation is adopted for g θ . The hidden dimension at each layer is 9. The experimental results (partial) are shown in Table 3 . Complete results are in Appendix E.3. The results demonstrate that the local scaling transformation effectively increases its flexibility of STO-3g basis set and achieves lower ground-state energy. -2.65731 -7.04847 -13.97781 -36.47026 -52.82486 -72.78665 STO-3g + local scaling -2.65978 -7.13649 -13.99859 -36.52543 -52.86340 -72.95256 

7. DISCUSSION AND CONCLUSION

We demonstrate in this paper that KS-DFT can be solved in a more deep learning native manner. Our method brings many benefits including a unified optimization algorithm, a stronger convergence guarantee, and better scalability. It also enables us to design neural-based wave functions that have better approximation capability. However, at this point, there are still several issues that await solutions in order to move further along this path. One such issue attributes to the inherent stochasticity of SGD. The convergence of our algorithm is affected by many factors, such as the choice of batch sizes and optimizers. Another issue is about the integral which involves fully neural-based wave functions. Existing numerical integration methods rely on truncated series expansions, e.g., quadrature, and Fourier transform. They could face a larger truncation error when more potent function approximators such as neural networks are used. Neural networks might not be helpful when we still rely on linear series expansions for integration. Monte Carlo methods could be what we should resort to for this problem and potentially it could link to Quantum Monte Carlo methods. On the bright side, these limitations are at the same time opportunities for machine learning researchers to contribute to this extremely important problem, whose breakthrough could benefit tremendous applications in material science, drug discovery, etc.

A NOTATIONS

A.1 BRA-KET NOTATION The bra-ket notation, or the Direc notation, is the conventional notation in quantum mechanics for denoting quantum states. It can be viewed as a way to denote the linear operation in Hilbert space. A ket |ψ⟩ represent a quantum state in V, whereas a bra ⟨ψ| represents a linear mapping: V → C. The inner product of two quantum states, represented by wave functions ψ i and ψ j , can be rewritten in the bra-ket notation as ⟨ψ i |ψ j ⟩ := R 3 ψ * i (r)ψ j (r)dr. Let Ô be an operator, the expectation of Ô is defined as, ⟨ψ| Ô|ψ⟩ := R 3 ψ * (r) Ôψ(r) dr.

A.2 NOTATION TABLE

We list frequently-used notations used in this paper in the next Table 4 .  where Ĥ is the system hamiltonian and Ψ(r) denotes the many-body wave function mapping R N ×3 → C. r is a vector of N 3D particles. The Hamiltonian is Ĥ = - 1 2 ∇ 2 kinetic - i j 1 |r i -R j | external + i<j 1 |r i -r j | electron repulsion , where r i is the coordinate of the i-th electron and R j is the coordinate of the j-th nuclei. The following objective function is optimized to obtain the ground-state energy and corresponding ground-state wave function: min Ψ ⟨Ψ| Ĥ|Ψ⟩, s.t. Ψ(r) = σ(P)Ψ(Pr); ∀ P. Antisymmetric Constraint Notice there is a constraint in the above objective function, P is any permutation matrix, and σ(P) denotes the parity of the permutation. This constraint comes from the Pauli exclusion principle, which enforces the antisymmetry of the wave function P si. When solving the above optimization problem, we need a variational ansatz, in machine learning terms, a function approximator for Ψ. To satisfy the antisymmetry constraint, we can start with any function approximator Ψ, and apply the antisymmetrizer Ψ = A Ψ, the antisymmetrizer is defined as A Ψ(r) = 1 √ N ! P σ(P) Ψ(Pr). However, there are N ! terms summing over P, which is prohibitively expensive. Therefore, the Slater determinant is resorted to as a much cheaper approximation.

Slater Determinant & Hartree Fock Approximation

To efficiently compute the above antisymmetrizer, the mean-field assumption is usually made, i.e. the many-body wave function is made of the product of single-particle wave functions: Ψ slater = A Ψslater = A i ψ i = 1 √ N ! ψ 1 (r 1 ) ψ 1 (r 2 ) • • • ψ 1 (r N ) ψ 2 (r 1 ) ψ 2 (r 2 ) • • • ψ 2 (r N ) . . . . . . . . . . . . ψ N (r 1 ) ψ N (r 2 ) • • • ψ N (r N ) This approximation has a much more favorable computation complexity, computing the determinant only takes O(N 3 ). However, it is more restricted in its approximation capability. Notice that the mean-field assumption discards the correlation between different wave functions. Therefore, with the Slater determinant approximation, it omits the correlation between electrons. Plugging this into 26 gives us the Hartree-Fock approximation. Orthogonal Constraints While the Slater determinant is much cheaper to compute, integration in the R N ×3 space is still complex. Introducing an orthogonal between the single-particle wave functions simplifies things greatly. ⟨ψ i |ψ j ⟩ = δ ij . ( ) Plugging this constraint and the Slater determinant into Eq. 26, and with some elementary calculus, there are the following useful conclusions. From this point on, without ambiguity, the symbol r is also used to denote a vector in 3D space when it appears in the single-particle wave function ψ i (r). First, the total density of the electron is the sum of the density contributed by each wave function. With this conclusion, the external part of the hamiltonian from the many body Schrödinger Equation connects with the external energy in DFT in Eq. 3. ρ(r) ⟨ψi|ψj ⟩=δij == N i |ψ i (r)| 2 . ( ) Second, the kinetic energy of the wave function in the joint N × 3 space breaks down into the summation of the kinetic energy of each single-particle wave function in 3D space, which again equals the kinetic term in Eq. 3. ⟨Ψ slater | - 1 2 ∇ 2 |Ψ slater ⟩ ⟨ψi|ψj ⟩=δij == i ⟨ψ i | - 1 2 ∇ 2 |ψ i ⟩. Third, the electron repulsion term breaks into two terms, one corresponding to the hartree energy in Eq. 3, and the other is the exact exchange energy. ⟨Ψ slater | 1 i<j |r i -r j | |Ψ slater ⟩ ⟨ψi|ψj ⟩=δij == (33) 1 2 i̸ =j ψ * i (r 1 )ψ * j (r 2 )ψ i (r 1 )ψ j (r 2 ) 1 |r 1 -r 2 | dr 1 dr 2 E H (34) - 1 2 i̸ =j ψ * i (r 1 )ψ * j (r 1 )ψ i (r 2 )ψ j (r 2 ) 1 |r 1 -r 2 | dr 1 dr 2 E X The above three conclusions link the Schrödinger Equation with the three terms in the KS-DFT objective. To finish the link, notice that when Slater determinant is introduced, we made the meanfield assumption. This causes an approximation error due to the electron correlation in the Hartree-Fock formulation we're deriving. In DFT, this correlation error, together with the exchange energy E X is approximated with a functional E XC . The art of DFT is thus to find the best E XC that minimizes the approximation error.

B.2 DERIVATION OF THE KOHN-SHAM EQUATION

The energy minimization in the functional space formally reads min ψi E gs (ψ 1 , . . . , ψ N ) | ψ i ∈ H 1 R 3 , ⟨ψ i , ψ j ⟩ = δ ij , 1 ⩽ i, j ⩽ N , where we can denote the electron density as ρ(r) = N i=1 |ψ i (r)| 2 , ( ) and have the objective function to be minimized: E = - 1 2 N i ψ * i (r) ∇ 2 ψ i (r) dr + N i v ext (r)|ψ i (r)| 2 dr + 1 2 N i |ψ i (r)| 2 N j |ψ j (r ′ )| 2 |r -r ′ | drdr ′ + N i v xc r |ψ i (r)| 2 dr. This objective dunction can be solved by Euler-Lagrangian method. The Lagrangian is expressive as, L(ψ 1 , • • • , ψ N ) = E (ψ 1 , . . . , ψ N ) - N i,j=1 ϵ ij R 3 ψ * i ψ j -δ ij . The first-order condition of the above optimization problem can be writen as,        δL δψ i = 0, i = 1, • • • , N ∂L ∂ϵ ij = 0, i, j = 1, • • • , N The first variation above can be written as, δL δψ i = δE Kin δρ + δE Ext δρ + δE Hartree δρ + δE XC δρ δρ δψ i - N j=1 ϵ ij ψ i . Next, we derive each components in the above equation. We follow the widely adopted tradition that treats ψ * i and ψ i as different functions Parr (1980), and have δρ δψ i = δ(ψ * i ψ i ) δψ * i = ψ i , For E Hartree , according to definition of functional derivatives: E Hartree [ρ + δρ] -E Hartree [ρ] = 1 2 (ρ + δρ)(r)(ρ + δρ)(r ′ ) -ρ(r)ρ(r ′ ) |r -r ′ | drdr ′ = 1 2 (δρ)(r)ρ(r ′ ) + (δρ)(r ′ )ρ(r) |r -r ′ | drdr ′ = 1 2 • 2 (δρ)(r)ρ(r ′ ) |r -r ′ | drdr ′ = (δρ)(r)ρ(r ′ ) |r -r ′ | drdr ′ = (δρ)(r) ρ(r ′ ) |r -r ′ | dr ′ dr, where we have used the fact that: f 1 (r)f 2 (r ′ ) |r -r ′ | drdr ′ = f 1 (r ′ )f 2 (r) |r -r ′ | drdr ′ , due to the symmetry between r and r ′ . We emphasize that the functional derivatives are computed on functionals mapping functions to scalars, and consequently, we need to compute that for the double integral on both r and r ′ . Therefore, the functional derivative for E Hartree can be rewritten into: δE Hartree [ρ] δρ = ρ(r ′ ) |r -r ′ | dr ′ . ( ) Using the chain rule, we have δE Hartree [ρ] δψ * = ρ(r ′ ) |r -r ′ | dr ′ ∂ρ ∂ψ i = n(r ′ ) |r -r ′ | dr ′ ψ i . For E Ext and E XC , δE Ext δψ i = δE Ext δρ ∂ρ ∂ψ i = v ext ψ i . δE XC δψ i = δE XC δρ ∂ρ ∂ψ i = v xc ψ i . For the kinetic energy, δE Kin δψ * i = -1 2 ψ * i ∇ 2 ψ i δψ * i = - 1 2 ∇ 2 ψ i . Taking first-order derivatives, the Lagrangian becomes: H (ψ 1 , . . . , ψ N ) ψ i = λ i ψ i , where λ i = N j=1 ϵ ij , and the Hamiltonian is explicitly written as Ĥ = - 1 2 ∇ 2 + v ext (r) + R 3 ρ(r ′ ) |r -r ′ | dr ′ + v xc (r), ρ(r) = σ N i=1 ψ i (r) 2 , which is the desired form of the Kohn-Sham Equation. B.3 THE SCF ALGORITHM. The connection between SCF and direct optimization can be summarized as follows. D4FT is minimizing the energy directly where the constraints are encoded into the corresponding parameter constraints, while SCF deals with the constrained optimization by Lagrangian. They are solving the same minimization problem using different approaches, which is reflected in how they deal with the constraint in the problem. Obviously, D4FT is doing projected gradient descent on the constraint set Φ. SCF is the corresponding Lagrangian, where the Lagrangian enforces the orthogonal constraint. Another constraint on the basis function is reflected in the inner product of SCF. The SCF algorithm is presented in Algo. 1. Solve the eigenvalue equations defined in Eq. 48 and get new {ψ i }.

5:

Compute electron density ρ new via Eq. 50 6: if |ρ new -ρ| < ε then 7: break 8: end if 9: end while output of the (t + 1)th iteration is derived from the Hamiltonian with the previous output, i.e., Ĥ (Ψ t ) ψ t+1 i = λ t+1 i ψ t+1 i . The convergence criterion is Ψ t = Ψ t+1 , which is the self-consistency formally defined in Definition 3.1. Since GD and SCF are solving essentially the same problem, we can derive the equivalence between the solutions of GD and SCF. GD is doing projected gradient descent on the constraint set {Ψ | ⟨Ψ|Ψ⟩ = I}. SCF is the corresponding Lagrangian, where the Lagrangian enforces the orthogonal constraint. 

6.. In total

O(nN B) + O(n 2 N 2 ) = O(n 2 N 2 ). SGD's computation complexity analysis ρ(r 1 ) 1 |r 1 -r 2 | ρ(r 2 )dr 1 dr 2 1. Using steps 1, 2, and 4 from SCF. Step 3 is not used because only energy is needed, O(nN B). 2. Contract R n , R n , R n×n into a scalar, O(n 2 ). 

D.3 LITERATURE ON THE CONVERGENCE OF SCF METHODS.

There is a rich literature on the convergence of SCF. Cancès et al. (2021) provides a literature review and method summarization for the iterative SCF algorithm. Specifically, they show that the problem can be tackled via either Lagrangian + SCF or direct gradient descent on matrix manifold. Yang et al. (2007) shows that the Lagrangian + SCF algorithm is indirectly optimizing the original energy by minimizing a sequence of quadratic surrogate functions. Specifically, at the tth time step, SCF is minimizing the quadrature surrogate energy using the tth Hamiltonian, min A 1 2 Tr(A * H (t) A), while the correct nonlinear energy should be min A 1 2 Tr(A * H(A)A). So, our GD directly minimizes the nonlinear energy, while SCF deals with a quadrature surrogate energy. Using a concrete two-dimensional example, how SCF may fail to converge is illustrated, since the second order approximation of the original. Yang et al. (2009) shows that SCF will converge to two different limit points, but neither of them is the solution to a particular class of nonlinear eigenvector problems. Nevertheless, they also identify the condition under which the SCF iteration becomes contractive which guarantees its convergence, which is the gap between the occupied states and unoccupied states is sufficiently large and the second order derivatives of the exchange-correlation functional are uniformly upper bounded. Bai et al. (2020) provides optimal convergence rate for a class of nonlinear eigenvector directly minimizesly; since SCF is an iterative algorithm, we can show that E MORE EXPERIMENTAL RESULTS -195.58268 -237.25091 -283.59821 -334.79401 -390.90888 D4FT -195.47165 -237.03756 -283.41116 -335.10473 -391.09967 NBLST -196.00629 -235.83420 -283.72964 -335.38168 -391 .34567

E.3 MORE RESULTS ON THE EFFICIENCY COMPARISON

We test the efficiency and scalability of our methods against a couple of mature Quantum Chemistry softwares. We compute the ground-state energy using each implementation and record the wallclock running time. The softwares we compare with are: • PySCF. PySCF is implemented mainly with C (Qint and libxc), and use python as the interface. • GPAW. A Plane-wave method for solid and molecules. • Psi4. Psi4 is mainly written in C++. We use an sto-3g basis set and a lda exchange-correlation functional. For GPAW, we apply the lcao mode, and use dzp basis set. The number of bands calculated is the double of the number of electrons. The number of grids for evaluating the XC and Coulomb potentials is 400*400*400 by default. The maximum number of iteration is 100 for all the methods. Our D4FT ran on an NVIDIA A100 40G GPU, whereas other methods did on an Intel Xeon 64-core CPU with 60G memory. The wall-clock running time for convergence (by default setting) are shown in the following table. We show two cases that do not converge with PySCF. We test two unstable systems, C120 and C140, which do not exist in the real world. 



We omit the spin notation σ in the following sections for simplification reasons. We assume the backward mode gradient computation has the same complexity as the forward computation. Larger molecule gets a larger learning rate decay.



Figure 2: Convergence speed on different carbon Fullerene molecules. Columns 1-3: the convergence curves on different molecules. X-axis: wall clock time (s), Y-axis: total energy (1k Ha). Right-top: the scale of convergence time. Right-bottom: the scale of per-epoch time. It can be seen that as the number of orbitals increases, D4FT scales much better than JAX-SCF.

Figure 3: Convergence speed at different batch sizes on Fullerene C20. Left: Comparison under wallclock time. Right: Comparison under the number of epochs. Smaller batch sizes are less favorable in wallclock time because they can not fully utilize the GPU computation.

In total O(nN B) + O(n 2 ) = O(nN 2 ) + O(n 2 ). D.2 COMPLEXITY ANALYSIS FOR MINIBATCH The complexity of SCF under minibatch size m is O(mN B) + O(m 2 N 2 ), since the number of iteration per epoch is n/m, the total complexity for one epoch is O(nN B)+O(nmN 2 ) = O(n(1+ m)N 2 ) since B = N , i.e., the number of orbitals equal to that of the basis. Similarly, the complexity of GD is (O(mN B) + O(m 2 ))n/m = O(n(N 2 + m)).

Both of them are truncated from a C180 Fullerene molecule. Results are shown in Fig. 4.

Figure 4: Cases that do not converge with PySCF, but do with D4FT. Due to the good convergence property of SGD, D4FT is more likely to converge.

. A representative work is conducted by(Gilmer et al., 2017), which proposes a message-passing neural network and predicts DFT results on the QM9 dataset within chemical accuracy.Custódio et al. (2019) andEllis et al. (2021) also follow similar perspectives and predict the DFT ground state energy using feed-forward neural networks. Another active research line falls into learning the functionals, i.e. the kinetic and exchange-correlation functionals. As it is difficult to evaluate the kinetic energy given the electron density, some researchers turn to deep learning to fit a neural kinetic functional(Alghadeer et al., 2021;Ghasemi & Kühne, 2021;Ellis et al., 2021), which makes energy functional orbital-free (depending only on electron density). Achieve chemical accuracy is one of the biggest obstacles in DFT, many works(Kirkpatrick et al., 2021;Kalita et al., 2021;Ryabov et al., 2020;Kasim & Vinko, 2021; Dick & Fernandez-Serra, 2021b;Nagai et al., 2020) research into approximating the exchange-correlation functional with neural networks in the past few years, for making the DFT more accurate. It is worth noting that, the above deep learning approaches are mostly data-driven, and their generalization capability is to be tested.Laestadius et al., 2019) have discussed how to make the training of these neural-based DFT methods, particularly the SCF loop, differentiable, so that the optimization of these methods can be easily tackled by the auto-differentiation frameworks.Li et al. (2021) uses backward mode differentiation to backpropagate through the SCF loop. In contrast,Tamayo- Mendoza et al. (

Partial results of ground state energy calculation(LDA, Ha)

Prediction on magnetism(LSDA, Ha)

Partial results of the neural basis method(LSDA, Ha)

Notation used in this paper Kin , • • • energy functionals B KOHN-SHAM EQUATION AND SCF METHOD B.1 KS-DFT OBJECTIVE FUNCTION The Schrödinger Equation The energy functional of KS-DFT can be derived from the quantum many-body Schrödinger Equation. The time-invariant Schrödinger Equation writes Ĥ|Ψ⟩ = ε|Ψ⟩,

r 2 | ψ i (r 2 )ψ j (r 2 )dr 1 dr 2 1. Compute ψ i (r) for i ∈ [1 . . N ],and all the n grid points. it takes O(N B) to compute for each r, and we do it for n points. Therefore, it takes O(nN B), producing a R n×N matrix. 2. Compute ρ(r), it takes an extra O(nN ) step to sum the square of the wave functions, on top of step 1, produces R n vector. 3. Compute ψ i (r)ψ j (r), it takes an outer product on top of step 1, O(nN 2 ), produces R n×N ×N tensor. 4. Compute 1 |r1-r2| , O(n 2 ). 5. Tensor contraction on outcomes from step 2, 3, 4. O(n 2 N 2 ).

Comparison of Ground State Energy (LDA, Ha)

Comparison of ground state energy on atoms (LSDA, STO-3g, Hartree).

Comparison of wall-clock running time (s). .35 >2144.49 2555.16 14321.64 Notation > represent that the experiment cannot finish or converge given the default convergence condition. It can be seen from the results that our D4FT is the fastest among all the packages. E.4 CASES THAT PYSCF FAILS.

ACKNOWLEDGEMENT

We would like to extend our sincerest appreciation to all those who made contributions to this research project. We express our gratitude to Pengru Huang of the NUS Institute for Functional Intelligent Materials, Alexandra Carvalho and Keian Noori from the NUS Centre for Advanced 2D Materials, Martin-Isbjorn Trappe of the NUS Center for Quantum Technologies, and Aleksandr Rodin from Yale-NUS College for providing invaluable insights and feedback during the group discussions. Their willingness to share their experiences and perspectives enabled us to gain a deeper understanding of the current computational challenges related to the quantum many-body problem at hand. We thank our research engineer, Zekun Shi, for his assistance in improving the quality of our code, paving the way towards a high-quality package. We thank Jianhao Li from University of Minnesota Goodpaster Research Group for his valuable comments on the manuscript.

C PROOFS AND DERIVATIONS

C.1 PROOF OF PROPOSITION 3.1Proof. We first prove the following statement as a preparation: (orthogonal constraint) for any (ψ 1 , ψ 2 , • • • , ψ N ) ⊤ ∈ F W , it holds that ⟨ψ i |ψ j ⟩ = δ ij for all (i, j).,for all (i, j). This statement on the orthogonal constraint and the definitions of F W and F implies that F W ⊆ F. Thus, in the rest of the proof, we need to prove F W ⊇ F. This is equivalent to, where C i is the i-th row vector of C. This is implied if the following holds:Substituting this into Eq. 51, the desired statement is implied if the following holds:Finally, we note that for any. This implies Eq. 52, which implies the desired statement.

C.2 PROOF OF PROPOSITION 3.2

Proof. The total energy minimization problem defined in Eq. 3 is a constraint minimum problem, which can be rewritten as,with H 1 denoted as Sobolev space, which contains all L 2 functions whose first order derivatives are also L 2 integrable. People resort to the associated Euler-Lagrange equations for solution:Letting the first-order derivatives equal to zero, i.e., δL δψ * i = 0, the Lagrangian becomes: Ĥ (Ψ) ψ i = λ i ψ i , where λ i = N j=1 ϵ ij , and the Hamiltonian Ĥ is the same as defined in Eq. 48. To solve this nonlinear eigenvalue/eigenfunction problem, SCF iteration in Algorithm 1 is adopted. The SCF is a contraction mapping:where d measures the distance between two matrices, A * is the solution, and η is the contraction factor determining the convergence rate of SCF, whose upper bound is given in their result. Liu et al. (2015; 2014) formulates SCF as a fixed point map, whose Jacobian is derived explicitly, from which the convergence of SCF is established, under a similar condition to the one in Yang et al. (2009) . Cai et al. (2018) derive nearly optimal local and global convergence rate of SCF.

