GRAMMAR-INDUCED GEOMETRY FOR DATA-EFFICIENT MOLECULAR PROPERTY PREDICTION

Abstract

The prediction of molecular properties is a crucial task in the field of material and drug discovery. The potential benefits of using deep learning techniques are reflected in the wealth of recent literature. Still, these techniques are faced with a common challenge in practice: Labeled data are limited by the cost of manual extraction from literature and laborious experimentation. In this work, we propose a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar that can generate molecules from grammar production rules. Such a grammar induces an explicit geometry of the space of molecular graphs, which provides an informative prior on molecular structural similarity. The property prediction is performed using graph neural diffusion over the grammar-induced geometry. On both small and large datasets, our evaluation shows that this approach outperforms a wide spectrum of baselines, including supervised and pre-trained graph neural networks. We include a detailed ablation study and further analysis of our solution, showing its effectiveness in cases with extremely limited data (only ∼100 samples), and its extension to application in molecular generation.

1. INTRODUCTION

Molecular property prediction is an essential step in the discovery of novel materials and drugs, as it applies to both high-throughput screening and molecule optimization. Recent advances in machine learning, particularly deep learning, have made tremendous progress in predicting complex property values that are difficult to measure in reality due to the associated cost. Depending on the representation form of molecules, various methods have been proposed, including recurrent neural networks (RNN) for SMILES strings (Lusci et al., 2013; Goh et al., 2017) , feed-forward networks (FFN) for molecule fingerprints (Tao et al., 2021b; a) , and, more dominantly, graph neural networks (GNN) for molecule graphs (Yang et al., 2019; Bevilacqua et al., 2022; Aldeghi & Coley, 2022; Yu & Gao, 2022) . They have been employed to predict biological and mechanical properties of both polymers and drug-like molecules. Typically, these methods learn a deep neural network that maps the molecular input into an embedding space, where molecules are represented as latent features and then transformed into property values. Despite their promising performance on common benchmarks, these deep learning-based approaches require a large amount of training data in order to be effective (Audus & de Pablo, 2017; Wieder et al., 2020) . In practice, however, scientists often have small datasets at their disposal, in which case deep learning fails, particularly in the context of polymers (Subramanian et al., 2016; Altae-Tran et al., 2017) . For example, due to the difficulty of generating and acquiring data-which usually entails synthesis, wet-lab measurement, and mechanical testing-state-of-the-art works on polymer property prediction using real data are limited to only a few hundred samples (Menon et al., 2019; Chen et al., 2021) . To compensate for the scarcity of experimental data, applied works often rely on labeled data generated by simulations, such as density functional theory and molecular dynamics (Aldeghi & Coley, 2022; Antoniuk et al., 2022 ). Yet, these techniques suffer from high computational costs, tedious parameter optimization, and a considerable discrepancy between simulations and experiments (Afzal et al., 2020; Chen et al., 2021) , which limit their applicability in practice. Recent deep learning research has recognized the scarcity of molecular data in several domains and has developed methods handling small datasets, including self-supervised learning (Zhang et al., 2021; Rong et al., 2020; Wang et al., 2022; Ross et al., 2021) Given a set of molecules, we learn a hierarchical molecular grammar that can generate molecules from production rules. The hierarchical molecular grammar induces an explicit geometry for the space of molecules, where structurally similar molecules are closer in distance along the geometry. Such a grammar-induced geometry provides an informative prior for dataefficient property prediction. We achieve this by using graph neural diffusion over the geometry. and few-shot learning (Guo et al., 2021b; Stanley et al., 2021) . These methods involve pre-training networks on large molecular datasets before being applied to domain-specific, small-scale target datasets. However, when applied to datasets of very small size (e.g., ∼300), most of these methods are prone to perform poorly and are statistically unstable (Hu et al., 2020) . Moreover, as we will show in our experiments, these methods are less reliable when deployed on target datasets that contain significant domain gaps from the pre-trained dataset (e.g., inconsistency in molecule sizes). As an alternative to pure deep learning-based methods, formal grammars over molecular structures offer an explicit, explainable representation for molecules and have shown their great potential in addressing molecular tasks in a data-efficient manner (Kajino, 2019; Krenn et al., 2019; Guo et al., 2021a; 2022) . A molecular grammar consists of a set of production rules that can be chained to generate molecules. The production rules, which can either be manually defined (Krenn et al., 2019; Guo et al., 2021a) or learned from data (Kajino, 2019; Guo et al., 2022) , encode necessary constraints for generating valid molecular structures, such as valency restrictions. A molecular grammar has the combinatorial capacity to represent a large amount of molecules using a relatively small number of production rules. It has thus been adapted as a data-efficient generative model (Kajino, 2019; Guo et al., 2022) . While molecular generation based on grammars has been widely studied and is relatively straightforward, extending the data-efficiency advantage of grammars to property prediction has not yet been well-explored. Motivation. In this paper, we propose a framework for highly data-efficient property prediction based on a learnable molecular grammar. The intuition behind is that the production rule sequences for molecule generation provide rich information regarding the similarity of molecular structures. For instance, two molecular structures that share a common substructure would use the same sequence of grammar production rules. As it is widely recognized in cheminformatics that molecules with similar structures have similar properties (Johnson & Maggiora, 1990; Martin et al., 2002) , grammar production sequences can thus be used as a strong structural prior to predict molecular properties. We aim to develop a model that explicitly represents grammar production sequences and captures the structure-level similarity between molecules. Even from a few molecules, this model is expected to reveal a wealth of information regarding the structural relationship between molecules, providing the key to a data efficient property predictor. Framework. Figure 1 outlines our approach. At the heart of our method is a grammar-induced geometry (in the form of a graph) for the space of molecules. In the geometry, every path tracing from the root to a leaf represents a grammar production sequence that generates a particular molecule. Such a geometry explicitly captures the intrinsic closeness between molecules, where structurally similar molecules are closer in distance along the geometry. In contrast to the embedding space used in most deep learning methods, our grammar-induced geometry is entirely explicit, which can be integrated with the property predictor by considering all involved molecules simultaneously. To construct the geometry, we propose a hierarchical molecular grammar consisting of two parts: a pre-defined meta grammar at the top and a learnable molecular grammar at the bottom. We provide both theoretical and experimental evidence to demonstrate that the hierarchical molecular grammar is compact yet complete. To predict properties, we exploit the structural prior captured by the grammar-induced geometry using a graph neural diffusion model over the geometry. A joint optimization framework is proposed to simultaneously learn both the geometry and the diffusion.



Figure1: Overview. Given a set of molecules, we learn a hierarchical molecular grammar that can generate molecules from production rules. The hierarchical molecular grammar induces an explicit geometry for the space of molecules, where structurally similar molecules are closer in distance along the geometry. Such a grammar-induced geometry provides an informative prior for dataefficient property prediction. We achieve this by using graph neural diffusion over the geometry.

