LEARNING SYMBOLIC MODELS FOR GRAPH-STRUCTURED PHYSICAL MECHANISM

Abstract

Graph-structured physical mechanisms are ubiquitous in real-world scenarios, thus revealing underneath formulas is of great importance for scientific discovery. However, classical symbolic regression methods fail on this task since they can only handle input-output pairs that are not graph-structured. In this paper, we propose a new approach that generalizes symbolic regression to graph-structured physical mechanisms. The essence of our method is to model the formula skeleton with a message-passing flow, which helps transform the discovery of the skeleton into the search for the message-passing flow. Such a transformation guarantees that we are able to search a message-passing flow, which is efficient and Paretooptimal in terms of both accuracy and simplicity. Subsequently, the underneath formulas can be identified by interpreting component functions of the searched message-passing flow, reusing classical symbolic regression methods. We conduct extensive experiments on datasets from different physical domains, including mechanics, electricity, and thermology, and on real-world datasets of pedestrian dynamics without ground-truth formulas. The experimental results not only verify the rationale of our design but also demonstrate that the proposed method can automatically learn precise and interpretable formulas for graph-structured physical mechanisms.

1. INTRODUCTION

For centuries, the development of the natural sciences has been based on human intuition to abstract physical mechanisms represented by symbolic models, i.e., mathematical formulas, from experimental data recording the phenomena of nature. Among these developments, many mechanisms are naturally graph-structured (Leech, 1966) , where the physical quantities are associated with individual objects (e.g., mass), pair-wise relationships (e.g., force) and the whole system (e.g., overall energy), corresponding to three types of variables on graphs: node/edge/global variables. For example, as shown in Figure 1 (a), the mechanical interaction mechanism in multi-body problem corresponds to a graph with masses (m i ), positions ( ⃗ V i ) as attributes of nodes, and spring constants (k ij ) as attributes of edges, which, together with the graph connectivity, yields the acceleration as output attributes of nodes; while in the case of resistor circuit, nodes and edges correspond to voltages and resistances, respectively, and these attributes define a graph-level overall power of the circuit. In the past few years, Symbolic Regression (SR) (Sahoo et al., 2018; Schmidt & Lipson, 2009; Udrescu et al., 2020) , which searches symbolic models y = F(x) from experimentally obtained input-output pairs {(x, y)} with F being an explicit formula, has become a promising approach trying to automate scientific discovery. Traditional SR methods include genetic programming-based methods (Schmidt & Lipson, 2009; Fortin et al., 2012) working by generating candidate formulas by "evolution" (i.e., manipulations), and deep learning-based methods (Li et al., 2019; Biggio et al., 2021; Zheng et al., 2021) utilizing sequence models to generate candidate formulas. However, these methods are designed for traditional SR problems on input-output pairs {(x, y)} without considering graph information. To exploit the inherent graph structure in physical mechanisms, as shown in Figure 1 (b), SR on graphs aims to find a formula F that characterizes a mapping from input {G, X} to output y, with X and y both inside graph structure G. To perform this, we need both fine exploitation of inherent graph structures of physical mechanisms and well achievement of flexibility regarding diverse forms of interaction between entities in the physical world. Graph Neural Network (GNN) has recently been incorporated into SR for discovering mechanisms behind particle interactions (Cranmer et al., 2020; Lemos et al., 2022) . However, obvious setbacks exist that the message-passing flow of GNN, corresponding to the formula skeleton, required to be manually designed to learn the underlying mechanisms, is impractical because the formula skeletons usually remain unknown and are significantly different in diverse physical domains as shown in Figure 1 (c). To solve this problem, inspired by the correspondence between the skeleton and message-passing flow in GNN, our core idea is to transform the discovery of the skeleton into the search for message-passing flow, which paves the way for identifying the underneath formula by interpreting each component function in the searched message-passing flow. However, due to the coupling relationship between the skeleton and the component formula in the skeleton, neither of them can be independently identified, implying a vast, highly entangled search space for both message-passing flow and component functions. To tackle this challenge, we formulate a bi-level optimization problem that searches for the message-passing flow by pruning strategy at the upper level on condition that its component functions have been optimized with deep learning (DL) at the lower level. Besides empirical accuracy, it is equally vital but non-trivial to maintain explicit interpretability and generalization ability in discovered formulas. We propose to search the Pareto-optimal message-passing flow between accuracy and simplicity by carefully designing a scoring function involving a complexity function of message-passing flows that optimizes both aspects across different searching steps. Our contributions can be summarized as the following three aspects, • We generalize the problem of learning formulas with given skeletons (inductive bias) from graph data in Cranmer et al. ( 2020) by additionally learning the formula skeleton from data, which is essential for learning graph-structured physical mechanisms from diverse physical domains. • We propose a novel method to learn graph-structured physical mechanisms from data without knowing the formula skeleton by searching the Pareto-optimal message-passing flows of GNN together with the symbolic models as components. • We conduct experiments on five datasets from diverse physical domains, including mechanics, electricity, thermology, and two real-world datasets about pedestrian dynamics, demonstrating that our model can first automatically identify the correct skeleton based on collected data instead of expert knowledge and then learn the overall symbolic model for corresponding graph-structured physical mechanism.

2. THE PROPOSED METHOD

Before introducing the proposed method, we first formally define the the problem of symbolic regression on graphs.



Figure 1: (a) Two examples of graph-structured physical mechanisms. (b) Illustration of traditional SR (left) and SR on graphs (right). (c) Two different formula skeletons in the two examples, where all formula components are interconnected accordingly to construct an overall formula.

