MACHINE LEARNING FORCE FIELDS WITH DATA COST AWARE TRAINING

Abstract

Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation, which finds widespread applications in chemistry and biomedical research. Even for the most data-efficient MLFF models, reaching chemical accuracy can require hundreds of frames of force and energy labels generated by expensive quantum mechanical algorithms, which may scale as O(n 3 ) to O(n 7 ), with n being the number of basis functions used and typically proportional to the number of atoms. To address this issue, we propose a multi-stage computational framework -ASTEROID, which enjoys low training data generation cost without significantly sacrificing MLFFs' accuracy. Specifically, ASTEROID leverages a combination of both large cheap inaccurate data and small expensive accurate data. The motivation behind ASTEROID is that inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Therefore, we first train a MLFF model on a large amount of inaccurate training data, employing a bias-aware loss function to prevent the model from overfitting the potential bias of the inaccurate training data. We then fine-tune the obtained model using a small amount of accurate training data, which preserves the knowledge learned from the inaccurate training data while significantly improving the model's accuracy. Moreover, we propose a variant of ASTEROID based on score matching for the setting where the inaccurate training data are unlabelled. Extensive experiments on MD simulation datasets show that ASTER-OID can significantly reduce data generation costs while improving the accuracy of MLFFs.

1. INTRODUCTION

Molecular dynamics (MD) simulation is a key technology driving scientific discovery in fields such as chemistry, biophysics, and materials science (Alder & Wainwright, 1960; McCammon et al., 1977) . By simulating the dynamics of molecules, important macro statistics such as the folding probability of a protein (Tuckerman, 2010) or the density of new materials (Varshney et al., 2008) can be estimated. These macro statistics are an essential part of many important applications such as structure-driven drug design (Hospital et al., 2015) and battery development (Leung & Budzien, 2010) . Most MD simulation techniques share a common iterative structure: MD simulations calculate the forces on each atom in the molecule, and use these forces to move the molecule forward to the next state. The fundamental challenge of MD simulation is how to efficiently calculate the forces at each iteration. An exact calculation requires solving the Schrödinger equation, which is not feasible for many-body systems (Berezin & Shubin, 2012) . Instead approximation methods such as the Lennard-Jones potential (Johnson et al., 1993) , Density Functional Theory (DFT, Kohn (2019) ), or Coupled Cluster Single-Double-Triple (CCSD(T), Scuseria et al. (1988) ) are used. CCSD(T) is seen as the gold-standard for force calculation, but is computationally expensive. In particular, CCSD(T) has complexity O(n 7 ) with respect to the number of basis function used along with a huge storage requirement (Chen et al., 2020) . To accelerate MD simulation while maintaining high accuracy, machine learning based force fields have been proposed. These machine learning models take a molecular configuration as input and then predict the forces on each atom in the molecule. These models have been successful, producing force fields with moderate accuracy while drastically reducing computation time (Chmiela et al., 2017) . In both cases the model architecture used is GemNet (Gasteiger et al., 2021) . Built upon the success of machine learning force fields, deep learning techniques for force fields have been developed, resulting in highly accurate force fields parameterized by large neural networks (Gasteiger et al., 2021; Batzner et al., 2022) . Despite this empirical success, a key drawback is rarely discussed in existing literature: in order to train state-of-the-art machine learning force field models, a large amount of costly training data must be generated. For example, to train a model at the CCSD(T) level of accuracy, at least a thousand CCSD(T) calculations must be done to construct the training set. This is computationally expensive due to the method's O(n 7 ) cost. A natural solution to this problem is to train on less data points. However, if the number of training points is decreased, the accuracy of the learned force fields quickly deteriorates. In our experiments, we empirically find that the prediction error and the number of training points roughly follow a power law relationship, with prediction error ∼ 1 Number of Training Points . This can be seen in Figure 1a , where prediction error and train set size are observed to have a linear relationship with a slope of -1 when plotted on a log scale. Another option is to train the force field model on less accurate but computationally cheap reference forces calculated using DFT (Kohn, 2019) or empirical force field methods (Johnson et al., 1993) . However, these algorithms introduce undesirable bias into the force labels, meaning that the trained models will have poor performance. This phenomenon can be seen in Figure 1b , where models trained on large quantities of DFT reference forces are shown to perform poorly relative to force fields trained on moderate quantities of CCSD(T) reference forces. Therefore current methodologies are not sufficient for training force fields models in low resource settings, as training on either small amounts of accurate data (i.e. from CCSD(T)) or large amounts of inaccurate data (i.e. from DFT or empirical force fields) will result in inaccurate force fields. To address this issue, we propose to use both large amounts of inaccurate force field data (i.e. DFT) and small amounts of accurate data (i.e. CCSD(T)) to significantly reduce the cost of the data needed to achieve highly accurate force fields. Our motivation is that computationally cheap inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Moreover, if treated properly, we can further reduce the bias of the obtained model by taking advantage of the accurate data. More specifically, we propose a multi-stage computational framework -datA cosST awarE tRaining of fOrce fIelDs (ASTEROID). In the first stage, small amounts of accurate data are used to identify the bias of force labels in a large but inaccurate dataset. In the second stage, the model is trained on the large inaccurate dataset with a bias aware loss function. Specifically, the loss function generates smaller weights for data points with larger bias, suppressing the effect of label noise on training. This inaccurately trained model serves as a warm start for the third stage, where the force field model is fine-tuned on the small and accurate dataset. Together, these stages allow the model to learn from many molecular configurations while incorporating highly accurate force data, significantly outperforming conventional methods trained with similar data generation budgets. Beyond using cheap labelled data to boost model performance, we also extend our method to the case where a large amount of unlabelled molecular configurations are cheaply available (Smith et al., 2017; Köhler et al., 2022) . Without labels, we cannot adopt the supervised learning approach. Instead, we draw a connection to score matching, which learns the gradient of the log density function with respect to each data point (called the score) (Hyvärinen & Dayan, 2005) . In the context of molecular dynamics, we notice that if the log density function is proportional to the energy of each molecule, then the score function with respect to a molecule's position is equal to the force on the molecule. Based on this insight, we show that the supervised force matching problem can be tackled in an unsupervised manner. This unsupervised approach can then be incorporated into the ASTEROID framework, improving performance when limited data is available. We demonstrate the effectiveness of our framework with extensive experiments on different force field data sets. We use two popular model architectures, GemNet (Gasteiger et al., 2021) and EGNN (Satorras et al., 2021) , and verify the performance of our method in a variety of settings. In particular, we show that our multi-stage training framework can lead to significant gains when either DFT reference forces or empirical force field forces are viewed as inaccurate data and CCSD(T) configurations are used as accurate data. In addition, we show that we can learn accurate forces via the connection to score matching, and that using this objective in the second stage of training can improve performance on both DFT and CCSD(T) datasets. Finally, our framework is backed by a variety of ablation studies. The rest of this paper is organized as follows: Section 2 presents necessary background on machine learning force fields and training data generation, Section 3 details the ASTEROID framework, Section 4 extends ASTEROID to settings where unlabelled configurations are available, Section 5 presents our experimental results, and Section 6 compares our method to several related works (Ramakrishnan et al., 2015; Ruth et al., 2022; Nandi et al., 2021; Smith et al., 2019; Deringer et al., 2020) and briefly concludes the paper.

2. BACKGROUND

⋄ Machine Learning Force Fields. Recent years have seen a surge of interest in machine learning force fields. Much of this work has focused on developing large machine learning architectures that have physically correct equivariances, resulting in large graph neural networks that can generate highly accurate force and energy predictions (Gasteiger et al., 2021; Satorras et al., 2021; Batzner et al., 2022) . Two popular architectures are EGNN and GemNet. Both models have graph neural network architectures that are translation invariant, rotationally equivariant, and permutation equivariant. EGNN is a smaller model and is often used when limited resources are available. The GemNet architecture is significantly larger and more refined than the EGNN architecture, modeling various types of inter-atom interactions. GemNet is therefore more powerful and can achieve state-of-the-art performance, but requires more resources. It has been noted that in order to be reliable enough for MD simulations, a molecular force field must attain an accuracy of 1 kcal mol -1 Å-1 (Chmiela et al., 2018) . Critically, the accuracy of deep force fields such as GemNet and EGNN is highly dependent on the size and quality of the training dataset. With limited training data, modern MLFFs cannot achieve the required accuracy for usefulness, preventing their application in settings where data is expensive to generate (e.g. large molecules). The amount of resources needed to train is therefore a key bottleneck preventing widespread use of modern MLFFs. ⋄ Data Generation Cost. The training data for MLFFs can be generated by a variety of force calculation methods. These methods exhibit an accuracy cost tradeoff: accurate reference forces from methods such as CCSD(T) require high computational costs to generate reference forces, while inaccurate reference forces from methods such as DFT and empirical force fields can be generated extremely quickly. Concretely, CCSD(T) is highly accurate but has O(n 7 ) complexity, DFT is less accurate with complexity O(n 3 ), and empirical force fields are inaccurate but have complexity O(n) (Lin et al., 2019; Ratcliff et al., 2017) . CCSD(T) is typically viewed as the gold standard for calculating reference forces, but its computational costs often make it impractical for MD simulation (it has been estimated that "a nanosecond-long MD trajectory for a single ethanol molecule exe-cuted with the CCSD(T) method would take roughly a million CPU years on modern hardware") (Chmiela et al., 2018) . Due to this large expense, molecular configurations are typically generated first with MD simulations driven by DFT or empirical force fields. These simulations generate a large number of molecular configurations, and then CCSD(T) reference forces are computed for a small portion of these configurations. Due to this generation process, typically, a large amount of inaccurately labelled molecular configurations are available along with the accurate CCSD(T) labelled data. Therefore large amounts of inaccurate force data can be obtained and used to train MLFF models, while incurring almost no extra data generation costs compared to CCSD(T).

3. ASTEROID

To reduce the amount of resources needed to train machine learning force fields, we propose a multi-stage training framework, ASTEROID, to learn from a combination of both cheaply available inaccurate data and more expensive accurate data. Preliminaries. For a molecule with k atoms, we denote a configuration (the positions of its atoms in 3D) of this molecule as x ∈ R 3k , its respective energy as E(x) ∈ R, and its force as F (x) ∈ R 3k . We denote the accurately labelled data as D A = {(x a 1 , e a 1 , f a 1 ), ..., (x a N , e a N , f a N )} and the inaccurately labelled data as D I = {(x n 1 , e n 1 , f n 1 ), ..., (x n M , e n M , f n M )}, where (x a i , e a i , f a i ) represents the position, potential energy, and force of the ith accurately labelled molecule, using shorthand notations e a i = E(x a i ) and f a i = F (x a i ) (similarly (x n j , e n j , f n j ) for the jth inaccurately labeled data). Conventional methods train a force field model E(•; θ) with parameters θ on the accurate data by minimizing the loss min θ L(D A , θ) = (1 -ρ) N i=1 ℓ f (f a i , ∇ x E(x a i ; θ)) + ρ N i=1 ℓ e (e a i , E(x a i ; θ)), where ℓ f is the loss function for the force prediction, and ℓ e is the loss function for the energy prediction. Here the force is denoted by ∇ x E(x; θ), i.e., the gradient of the energy E(x; θ) w.r.t. to the input x. We can use ρ to balance the losses of the energy prediction and the force prediction. In practice, most of the emphasis is placed on the force prediction, e.g. ρ = 0.001.

3.1. BIAS IDENTIFICATION

The approximation algorithms used to generate cheap data D I introduce a large amount of bias into some force labels f n , which may significantly hurt training. Motivated by this phenomenon, we aim to identify the most biased force labels so that we can avoid overfitting to the bias during training. In the first stage of our training framework, we use small amounts of accurately labelled data D A to identify the levels of bias in the inaccurate dataset D I . Specifically, we train a force field model by minimizing L(D A , θ) (Eq. 1), the loss over the accurate data, to get parameters θ 0 . Although the resulting model E(•; θ 0 ) will not necessarily have good prediction performance because of the limited amount of training data, it can still help estimate the bias of the inaccurate data. For every configuration x n j in the inaccurate dataset D I , we suspect it to have a large bias if there is a large discrepancy between its force label f n j and the force label predicted by the accurately trained model ∇ x E(x n j ; θ 0 ). We can therefore use this discrepancy as a surrogate for bias, i.e. B(x n j ) = ∥∇ x E(x n j ; θ 0 ) -f n j ∥ 1 .

3.2. BIAS-AWARE TRAINING WITH INACCURATE DATA

In the second stage of our framework, we train a force field model E(•; θ init ) from scratch on large amounts of inaccurately labelled data D I . To avoid over-fitting to the biased force labels, we use a bias aware loss function that weighs the inaccurate data according to their bias. In particular, we use the weights w j = exp(-B(x n j )/γ) for configuration x n j , where γ is a hyperparameter to be tuned. In this way, low-bias points are given higher importance and high bias points are treated more carefully. We then minimize the bias aware loss function min θ L w (D I , θ) = (1 -ρ) M i=1 w i • ℓ f (f n i , ∇ x E(x n i ; θ)) + ρ M i=1 w i • ℓ e (e n i , E(x n i ; θ)), to get parameters θ init , resulting in the initial estimate of the force field model E(•; θ init ).

3.3. FINE-TUNING OVER ACCURATE DATA

The model E(•; θ init ) contains information useful to the force prediction problem, but may still contain bias because it is trained on inaccurately labelled data D I . Therefore, we further refine it using accurately labelled data D A . Specifically, we use E(•; θ init ) as an initialization for our final stage, in which we fine-tune the model over the accurate data by minimizing L(D A , θ final ) (Eq. 1).

4. ASTEROID FOR UNLABELLED DATA

In several settings molecular configurations are generated without force labels, either because they are not generated via MD simulation (e.g. normal mode sampling, Smith et al. ( 2017)) or because the forces are not stored during the simulation (Köhler et al., 2022) . Although these unlabelled configurations may be cheaply available, they are not generated for the purpose of learning force fields and have not been used in existing literature. Here, we show that the unlabelled configurations can be used to obtain an initial estimate of the force field, which can then be further fine-tuned on accurate data. More specifically, we consider a molecular system where the number of particles, volume, and temperature are constant (NVT ensemble). Let x refer to the molecule's configuration and E(x) refer to the corresponding potential energy. It is known that x follows a Boltzmann distribution, i.e. p(x) = 1 Z exp - 1 k β T E(x) , where Z is a normalizing constant, T is the temperature, and k β is the Boltzmann constant. In practice, configurations generated using normal mode sampling (Unke et al., 2021) or via a sufficiently long NVT MD simulation follow a Boltzmann distribution. Recall that we model the energy E(x) as E(x; θ), and the force can be calculated as F (x; θ) = ∇E(x; θ). It follows from Hyvärinen & Dayan (2005) that we can learn the score function of the Boltzmann distribution using score matching, where the score function is defined as the gradient of the log density function ∇ x log p(x). In our case, we observe that the force on a configuration x is proportional to the score function, i.e., F (x) ∝ ∇ x log p(x). Therefore, we can use score matching to learn the forces by minimizing the unsupervised loss L(θ) = E p(x) 1 k β T Tr[∇ x F (x; θ)] + 1 2 ||F (x; θ)|| 2 . ( ) Although this allows us to solve the force matching problem in an unsupervised manner, the unsupervised loss is difficult to optimize in practice. To reduce the cost of solving Eq. 3, we adopt sliced score matching (Song et al., 2020) . Sliced score matching takes advantage of random projections to significantly reduce the cost of solving Eq. 3, allowing us to apply score matching to large neural models such as GemNet. In our experiments, we find that score matching does not match the accuracy of CCSD(T) force labels. Instead, we can think of score-matching as a form of inaccurate training. We therefore use score matching as an alternative to stages one and two of the ASTEROID framework. That is, we minimize Eq. 3 to get θ init , after which the model is fine-tuned on the accurate data.

5.1. DATA AND MODELS

For our main experiments, we consider three settings: using DFT data to enhance CCSD(T) training, using empirical force field data to enhance CCSD(T) training, and using unlabelled configurations to enhance CCSD(T) training. For the CCSD(T) data, we use MD17@CCSD, which contains 1,000 configurations labelled at the CCSD(T) and CCSD level of accuracy for 5 molecules (Chmiela et al., 2017) . For DFT data, we use the MD17 dataset which contains over 90,000 configurations labelled at the DFT level of accuracy for 10 molecules (Chmiela et al., 2017) . For the empirical force field data, we generate 100,000 configurations using the OpenMM empirical force field software (Eastman et al., 2017) . For the unlabelled datasets, we use MD17 with the force labels removed. In each setting we use 200, 400, 600, 800, or 950 CCSD(T) training samples as accurately labelled data. A validation set of size 50 and a test set of size 500 is used in all experiments, except for ethanol, where a test set of size 1000 is used. Recall that inaccurately labelled datasets can be generated at a tiny of the cost of CCSD(T) data. We implement our method on GemNet and EGNN. For GemNet we use the same model parameters as Gasteiger et al. (2021) . For EGNN, we use a 5-layer model and an embedding size of 128. We also compare ASTEROID to sGDML (Chmiela et al., 2019) , a kernel-based method that has been shown to perform well when limited training data is available. For training with inaccurate data, we train with a batch size of 16 and stop training when the validation loss stabilizes. In the fine-tuning stage, we use a batch size of 10 and train for a maximum of 2000 epochs. To tune the bias aware loss parameter γ, we search in the set {0.5, 0.75, 1.0, 2.0, 3.0, 4.0, 5.0} and select the model with the lowest validation loss. Comprehensive experimental details are deferred to Appendix A.1.

5.2. ENHANCING FORCE FIELDS WITH DFT

We display the results for using DFT data to enhance CCSD(T) training in Figure 2 for GemNet and Figure 3 for EGNN. From these figures, we can see that ASTEROID can outperform standard training for all amounts of data. In particular, the performance gain from using ASTEROID is strongest when limited amounts of accurate data are available and decreases as the amount of accurate data increases. When applied to GemNet in low resource settings, ASTEROID on average improves prediction accuracy by 52% and sample efficiency by a factor of 2.4. For EGNN, ASTEROID improves prediction accuracy by 66% and increases sample efficiency by more than 5 times. The large performance increase for EGNN may be due to the fact that the EGNN architecture has less inductive bias than GemNet, and therefore may struggle to learn the structures of the underlying force field with only a small amount of data. 

5.3. ENHANCING FORCE FIELDS WITH EMPIRICAL FORCE CALCULATION

We present the results for empirical force field in Table 1 . Due to limited space, we only display the results for the case where 200 accurate data points are available. The remaining results for GemNet can be found in Appendix A.2. Again we find that ASTEROID significantly outperforms the supervised baseline, improving prediction accuracy by 36% for GemNet and by 17% for EGNN. We note that empirical force fields are typically much less accurate than DFT (see Figure 4 ) but also much faster to compute. Despite this accuracy gap, the ASTEROID framework is able to utilize the empirical force field data to improve performance, which allows ASTEROID to benefit from even cheaper labels than DFT and obtain better efficiency. 

5.4. ENHANCING FORCE FIELDS WITH UNLABELLED MOLECULES

We first verify that score matching can learn the forces on unlabelled molecules by comparing its prediction accuracy on CCSD(T) data with models trained on supervised data (DFT and empirical force fields). We show the results in Figure 4 . Surprisingly, we find that the prediction error of score matching is between that of DFT and empirical force fields. This indicates that relatively accurate force predictions can be obtained only by solving Eq. 3. Next we extend ASTEROID to settings where unlabelled data is available by fine-tuning the model obtained from score matching. We present the results for using ASTEROID on unsupervised configurations in Table 2 , where we find that ASTEROID can improve prediction accuracy by 18% for GemNet and 4% for EGNN.

5.5. ABLATION STUDIES

We study the effectiveness of each component of ASTEROID. Specifically, we investigate the importance of bias-aware training (BAT) and fine-tuning (FT) when compared with standard training. As can be seen in Table 3 , each of ASTEROID's components is effective and complementary to one another. We find that bias-aware training is most helpful with GemNet, which has more capability to overfit harmful data points than EGNN.

5.6. ANALYSIS

⋄ Size of inaccurate data. To demonstrate that ASTEROID can exploit varying amounts of inaccurate data, we plot the performance of ASTEROID with randomly sub-sampled inaccurately We also investigate the performance of score matching when varying amounts of unlabelled data are available. We find that score matching is fairly robust to the number of available configurations, beating empirical force fields with only 100 configurations and only requiring 5000 configurations for optimal performance. ⋄ Runtime Comparison. Although there is no official documentaion on the time needed to create the DFT and CCSD(T) data in the MD17 dataset, we do create our own empirical force field dataset and measure the time needed to create it. Generating 100,000 force molecular conformations and their respective force labels takes roughly two hours with a single V100 32GB GPU. This is much faster than DFT or CCSD(T), which can take hundreds or thousands of seconds for a single computation (Bhattacharya et al., 2021; Datta & Gordon, 2021) . In general, it is difficult to get an exact time comparison between DFT and CCSD(T) because the runtime depends on the molecule, the number of basis functions used, and the implementation of each algorithm. ⋄ Hyperparameter Sensitivity. We investigate the sensitivity of ASTEROID to the hyperparameters γ and ρ (note that ρ is only used in GemNet). From the Figure 6 we can see that ASTEROID is robust to the choice of hyperparameters, outperforming standard training in every setting. 

6. DISCUSSION AND CONCLUSION

One related area of work is ∆-machine learning (Ramakrishnan et al., 2015; Ruth et al., 2022; Nandi et al., 2021) , which also uses both inaccurate force calculations and accurate force calculations. Specifically, ∆-machine learning learns the difference between inaccurate and accurate force predictions, therefore speeding up MD simulation while maintaining high accuracy. However, such an approach requires an equal amount of accurate and inaccurate training data, which therefore makes the training data expensive to generate. Another related area of work is training general purpose MLFFs over many molecules (Smith et al., 2019; Deringer et al., 2020) . Different from our work, these methods require lots of data from multiple molecules and train on a huge dataset that is not generated in a cost-aware manner. Additionally, the method from Smith et al. (2017) 

A APPENDIX

A.1 EXPERIMENTAL DETAILS In this section we go over the experimental details. GemNet Training Details. To train the bias identification method, we train a freshly initialized model with a batch size of 10 on the accurate dataset for 2000 epochs. To train the inaccurate model, we train a freshly initialized model with the bias aware loss function and batch size 16 over the inaccurate dataset. Finally to finetune the inaccurately trained model, we train a model with a batch size of 10 on the accurate dataset for 2000 epochs. In each stage of training we use the following hyperparamers: • Evaluation Interval: 1 epoch • Decay steps: 1200000 • Warmup steps: 10000 • Decay patience: 50000 • Decay cooldown: 50000 The rest of the parameters are the same as used in Gasteiger et al. (2021) . Here we include additional results for ASTEROID when empirical force field data is viewed as inaccurate. For the baseline model we use GemNet. The ASTEROID framework again leads to consistent gains across all amounts of data.

A.3 DERIVATION OF SCORE MATCHING FOR FORCES

For a given molecule with conformations x 1 , .., x n , let us denote energy as E(x). Then the the Boltzmann/Equilibrium distribution for the molecule is given by p (x) = 1 Z exp(-βE(x)), where Z is a normalizing constant, β = 1 k β T , k β is the Boltzmann constant, and T is the temperature under which the simulation is run. Then we can see that the force on a conformation x is equivalent to the score, i.e. F (x) = -∇ x E(x) = 1 β ∇ x log p(x). Therefore learning the force F (x) is equivalent to learning the score 1 β ∇ x log p(x). Suppose we parameterize the MLFF to directly predict the force as F θ (x). Then the force matching loss can be written as The middle term can then be expanded as L(θ) = 1 2 E x∼p(x) ∥F θ (x)-F (x)∥ 2 2 = 1 2 E x∼p(x) ∥F (x)∥ 2 2 -E x∼p(x) [⟨F θ (x), F (x)⟩]+ 1 2 E x∼p(x) ∥F θ (x)∥ 2 2 . E x∼p(x) [⟨F θ (x), F (x)⟩] = x p(x)⟨F θ (x), F (x)⟩dx = x p(x) d i=1 ( 1 β dlog p(x) dx i F θ (x) i )dx = 1 β d i=1 x dlog p(x) dx i F θ (x) i dx = 1 β d i=1 x i -xi F θ (x) i dp(x)d x i - = 1 β d i=1 x i - F θ (x) i dp(x)| +∞ -∞ - xi p(x)dF θ (x) i d x i - = - 1 β d i=1 x i -xi p(x) dF θ (x) i dx i dx i d x i - = - 1 β d i=1 E x∼p(x) dF θ (x) i dx i = - 1 β E x∼p(x) [Tr [∇ x F θ (x)]] . Therefore we have the loss L(θ) = E x∼p(x) 1 β Tr [∇ x F θ (x)] + 1 2 ∥F θ (x)∥ 2 2 .

A.4 ASTEROID DIAGRAM

In order to to make the ASTEROID framework more clear, we provide a workflow diagram.

A.5 ASTEROID TOY EXAMPLE

We have added a new result using a two-layer MLP with 128 hidden units each and synthetic data. This experiment shows that ASTEROID can significantly improve generalization error in a variety of , 2, 4, 8, 16] . We also generate varying levels of accurate data according to Y = AX, where X ∼ N (0, 1). We then evaluate the test MAE of ASTEROID and standard training over a variety of accurate data sizes. We find that ASTEROID significantly outperforms standard training. where the steps size is 0.5 femptoseconds, the temperature is 500K, the friction coefficient is 0.002, and the maximum number of time steps is 10000. The results of these simulations can be seen in 10a, where ASTEROID is able to produce stable dynamics, while the error compounding effect of the DFT trained force field and the Lennard-Jones potential is too heavy, resulting in a diverged simulation. Next we compare the simulation ability of ASTEROID with that of standard training, where both MLFFs are trained on 200 points of CCSD(T) data. To evaluate the robustness of the MLFFs, we run MD simulations with varying step sizes on the Aspirin molecule. The setting is the same as in Figure 10a . In Figure 10b we plot the number of simulations that do not converge out of 20 random runs. The ASTEROID framework is able maintain steady performance across step sizes and almost all the simulations converge. In contrast, the simulations powered by standard MLFFs fail with larger step size.

A.7 COMPARISON WITH ANI-1CCX

We compare the performance of the method from Smith et al. (2017) with ours in Table 4 , in both the case when the ANI-1 model is fine tuned on MD17 and the zero-shot setting. Despite having a much larger data generation budget, the open-domain pre-training of Smith et al. (2017) is not as powerful as ASTEROID. 



Figure 1: (a) Log-log plot of the number of training point versus the prediction error for deep force fields (b) Prediction error on CCSD labelled molecules for force fields trained on large amounts of DFT reference forces (100,000 configurations) and moderate amounts of CCSD reference forces (1000 configurations).In both cases the model architecture used is GemNet(Gasteiger et al., 2021).

Figure 2: Main results for GemNet when DFT data is viewed as inaccurate.

Figure 3: Main results for EGNN when DFT data is viewed as inaccurate.

Figure 4: Prediction errors of models tested on CCSD(T) data. Models are trained with different inaccurately labelled data but are not fine-tuned on the accurate CCSD(T) data.

Figure 5: Prediction error of models trained on with varying amounts of inaccurate data. The molecule is aspirin, the inaccurate data is labelled with DFT, and the accurate data is made up of 200 CCSD(T) labelled molecules. For score matching we do not fine-tune the model.

Figure 6: Ablation study on the toluene molecule. We set γ = 1.0 and ρ = 0.001 by default. The red line represents standard training.

Details. The EGNN training setup is similar to GemNet. To train the bias identification method, we train a freshly initialized model with a batch size of 10 on the accurate dataset for 2000 epochs. To train the inaccurate model, we train a freshly initialized model with the bias aware loss function and batch size 32 over the inaccurate dataset. Finally to finetune the inaccurately trained model, we train a with a batch size of 10 on the accurate dataset for 2000 epochs. In each stage of training we use the following hyperparamers: • Evaluation Interval: 1 epoch • Learning rate: 10 -4 for inaccurate training, 10 -

Figure 7: Main results for GemNet when DFT data is viewed as inaccurate.

Figure 8: Asteroid workflow diagram.

Figure 9: Asteroid toy example.



Accuracy of ASTEROID with unlabelled molecular configurations. The training set for the fine-tuning stage contains 200 CCSD(T) labelled molecules.

Ablation study on ASTEROID. The inaccurate data is DFT labelled configurations and the accurate dataset contains 200 CCSD(T) labelled configurations.

only trains on equilibrium states and may not work well for MD trajectory data. Numerical comparisons can be found in Appendix A.7.Different from previous works on machine learning force fields, we propose to learn MLFFs in a data cost aware manner. Specifically, we propose a new training framework, ASTEROID, to boost prediction accuracy with cheap inaccurate data. In addition, we extend ASTEROID to the unsupervised setting. By learning force field structures on the cheap data, ASTEROID only requires a small amount of expensive and accurate datapoints to achieve good performance. Extensive experiments validate the efficacy of ASTEROID.

Accuracy of ASTEROID compared with ANI-1ccx. The training set for the fine-tuning stage contains 200 CCSD(T) labelled molecules. FT refers to fine-tuning ANI-1ccx on MD17.

