FORCES ARE NOT ENOUGH: BENCHMARK AND CRIT-ICAL EVALUATION FOR MACHINE LEARNING FORCE FIELDS WITH MOLECULAR SIMULATIONS

Abstract

Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for ML MD simulation. We curate representative MD systems, including water, organic molecules, peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open source codebase for training and simulation with ML FFs to facilitate further work.

1. INTRODUCTION

Molecular Dynamics (MD) simulations provide atomistic insights into physical phenomena in materials and biological systems. Such simulations are typically based on force fields (FFs) that characterize the underlying potential energy surface (PES) of the system and then use Newtonian forces to simulate long trajectories (Frenkel & Smit, 2001) . The PES itself is challenging to compute and would ideally be done through quantum chemistry which is computationally expensive. Traditionally, the alternative has been parameterized force fields that are surrogate models built from empirically chosen functional forms (Halgren, 1996) . Recently, machine learning (ML) force fields (Unke et al., 2021b) have shown promise to accelerate MD simulations by orders of magnitude while being quantum chemically accurate. The evidence supporting the utility of ML FFs is often based on their accuracy in reconstituting forces across test cases (Faber et al., 2017) . The evaluations invariably do not involve simulations. However, we show that force accuracy alone does not suffice for effective simulation (Figure 1 ). MD simulation not only describes microscopic details on how the system evolves, but also entails macroscopic observables that characterize system properties. Calculating meaningful observables often requires long simulations to sample the underlying equilibrium distribution. These observables are designed to be predictive of material properties such as diffusivity in electrolyte materials (Webb et al., 2015) , and reveal detailed physical mechanisms, such as the folding kinetics of protein dynamics (Lane et al., 2011) . Although these observables are critical products of MD simulations, systematic evaluations have not been sufficiently studied in existing literature. To gain insight into the performance of existing models in a simulation setting, we propose a series of simulation-based benchmark protocols. Compared to the popular multistep prediction task in the learned simulator community (Sanchez-Gonzalez et al., 2020) , MD observables focus on distributional quantities. The exact recovery of the trajectories given the initial conditions is not the ultimate goal. Evaluating learned models through MD simulations requires careful design over the selection of systems, the simulation protocol, and the evaluation metrics: (1) A benchmark suite should cover diverse and representative systems to reflect the various challenges in different MD applications. (2) Simulations can be computationally expensive. An ideal benchmark needs to balance the cost of evaluation and the complexity of the system so that meaningful metrics can be obtained with reasonable time and computation. (3) Selected systems should be well studied in the simulation domain, and chosen metrics should characterize the system's important degrees of freedom or geometric features. Are current state-of-the-art (SOTA) ML FFs capable of simulating a variety of MD systems? What might cause a model to fail in simulations? In this paper, our aim is to answer these questions with a novel benchmark study. The contributions of this paper include: • We introduce a novel benchmark suite for ML MD simulation with simulation protocols and quantitative metrics. We perform extensive experiments to benchmark a collection of SOTA ML models. We provide a complete codebase for training and simulating MD with ML FFs to lower the barrier to entry and facilitate future work. • We show that many existing models are inadequate when they are evaluated on simulationbased benchmarks, even when they show accurate force prediction (as shown in Figure 1 ). • By performing and analyzing MD simulations, we summarize common failure modes and discuss the causes and potential solutions to motivate future research.

2. RELATED WORK

ML force fields learn the potential energy surface (PES) from the data by applying expressive regressors such as kernel methods (Chmiela et al., 2017) and neural networks on symmetry-preserving representations of atomic environments (Behler & Parrinello, 2007; Khorshidi & Peterson, 2016; Smith et al., 2017; Artrith et al., 2017; Unke & Meuwly, 2018; Zhang et al., 2018b; a; Kovács et al., 2021; Thölke & De Fabritiis, 2021; Takamoto et al., 2022) . Recently, graph neural network architectures (Gilmer et al., 2017; Schütt et al., 2017; Gasteiger et al., 2020; Liu et al., 2021) have gained popularity as they provide a systematic strategy for building many-body correlation functions to capture highly complex PES. In particular, equivariant representations have been shown powerful in representing atomic environments (Satorras et al., 2021; Thomas et al., 2018; Qiao et al., 2021; Schütt et al., 2021; Batzner et al., 2022; Gasteiger et al., 2021; Liao & Smidt, 2022) , leading to significant improvements in benchmarks such as MD17 and OC22/20. Some works presented simulation-based results (Unke et al., 2021a; Park et al., 2021; Batzner et al., 2022; Musaelian et al., 2022) 



Figure 1: Results on water-10k. Models sorted by force mean absolute error (MAE) in descending order. High stability and low RDF MAE are better. Performance in force error does not align with simulationbased metrics.

but do not compare different models with simulation-based metrics.Existing benchmarks for ML force fields(Ramakrishnan et al., 2014; Chmiela et al., 2017)  mostly focus on force/energy prediction, with small molecules being the most typical systems. The catalystfocused OC20(Chanussot et al., 2021) and OC22 (Tran et al., 2022)  benchmarks focus on structural relaxation with force computations, where force prediction is part of the evaluation metrics. The structural relaxation benchmark involves relaxation process of around hundreds of steps, and the goal is to predict the final relaxed structure/energy. These tasks do not characterize system properties under a structural ensemble, which requires simulations that are millions of steps long. Several recent works(Rosenberger et al., 2021)  have also studied the utility of certain ML FFs in MD simulations. In particular,Stocker et al. 2022 uses GemNet (Gasteiger et al., 2021)  to simulate small molecules in theQM7-x (Hoja et al., 2021)  dataset, with a focus on simulation stability. Zhai et al. 2022 applies the DeepMD(Zhang et al., 2018a)  architecture to simulate water and demonstrates its shortcoming in generalization across different phases. However, existing works focus on a single system and model, without proposing evaluation protocols and quantitative metrics for model comparison. Systematic benchmarks for simulation-based metrics are lacking in the existing literature, which obscures the challenges in applying ML FF for MD applications.

