FORCES ARE NOT ENOUGH: BENCHMARK AND CRIT-ICAL EVALUATION FOR MACHINE LEARNING FORCE FIELDS WITH MOLECULAR SIMULATIONS

Abstract

Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for ML MD simulation. We curate representative MD systems, including water, organic molecules, peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open source codebase for training and simulation with ML FFs to facilitate further work.

1. INTRODUCTION

Molecular Dynamics (MD) simulations provide atomistic insights into physical phenomena in materials and biological systems. Such simulations are typically based on force fields (FFs) that characterize the underlying potential energy surface (PES) of the system and then use Newtonian forces to simulate long trajectories (Frenkel & Smit, 2001) . The PES itself is challenging to compute and would ideally be done through quantum chemistry which is computationally expensive. Traditionally, the alternative has been parameterized force fields that are surrogate models built from empirically chosen functional forms (Halgren, 1996) . Recently, machine learning (ML) force fields (Unke et al., 2021b) have shown promise to accelerate MD simulations by orders of magnitude while being quantum chemically accurate. The evidence supporting the utility of ML FFs is often based on their accuracy in reconstituting forces across test cases (Faber et al., 2017) . The evaluations invariably do not involve simulations. However, we show that force accuracy alone does not suffice for effective simulation (Figure 1 ). MD simulation not only describes microscopic details on how the system evolves, but also entails macroscopic observables that characterize system properties. Calculating meaningful observables often requires long simulations to sample the underlying equilibrium distribution. These observables are designed to be predictive of material properties such as diffusivity in electrolyte materials (Webb et al., 2015) , and reveal detailed physical mechanisms, such as the folding kinetics of protein dynamics (Lane et al., 2011) . Although these observables are critical products of MD simulations, systematic evaluations have not been sufficiently studied in existing



Figure 1: Results on water-10k. Models sorted by force mean absolute error (MAE) in descending order. High stability and low RDF MAE are better. Performance in force error does not align with simulationbased metrics.

