GRAMMAR-INDUCED GEOMETRY FOR DATA-EFFICIENT MOLECULAR PROPERTY PREDICTION

Abstract

The prediction of molecular properties is a crucial task in the field of material and drug discovery. The potential benefits of using deep learning techniques are reflected in the wealth of recent literature. Still, these techniques are faced with a common challenge in practice: Labeled data are limited by the cost of manual extraction from literature and laborious experimentation. In this work, we propose a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar that can generate molecules from grammar production rules. Such a grammar induces an explicit geometry of the space of molecular graphs, which provides an informative prior on molecular structural similarity. The property prediction is performed using graph neural diffusion over the grammar-induced geometry. On both small and large datasets, our evaluation shows that this approach outperforms a wide spectrum of baselines, including supervised and pre-trained graph neural networks. We include a detailed ablation study and further analysis of our solution, showing its effectiveness in cases with extremely limited data (only ∼100 samples), and its extension to application in molecular generation.

1. INTRODUCTION

Molecular property prediction is an essential step in the discovery of novel materials and drugs, as it applies to both high-throughput screening and molecule optimization. Recent advances in machine learning, particularly deep learning, have made tremendous progress in predicting complex property values that are difficult to measure in reality due to the associated cost. Depending on the representation form of molecules, various methods have been proposed, including recurrent neural networks (RNN) for SMILES strings (Lusci et al., 2013; Goh et al., 2017) , feed-forward networks (FFN) for molecule fingerprints (Tao et al., 2021b; a) , and, more dominantly, graph neural networks (GNN) for molecule graphs (Yang et al., 2019; Bevilacqua et al., 2022; Aldeghi & Coley, 2022; Yu & Gao, 2022) . They have been employed to predict biological and mechanical properties of both polymers and drug-like molecules. Typically, these methods learn a deep neural network that maps the molecular input into an embedding space, where molecules are represented as latent features and then transformed into property values. Despite their promising performance on common benchmarks, these deep learning-based approaches require a large amount of training data in order to be effective (Audus & de Pablo, 2017; Wieder et al., 2020) . In practice, however, scientists often have small datasets at their disposal, in which case deep learning fails, particularly in the context of polymers (Subramanian et al., 2016; Altae-Tran et al., 2017) . For example, due to the difficulty of generating and acquiring data-which usually entails synthesis, wet-lab measurement, and mechanical testing-state-of-the-art works on polymer property prediction using real data are limited to only a few hundred samples (Menon et al., 2019; Chen et al., 2021) . To compensate for the scarcity of experimental data, applied works often rely on labeled data generated by simulations, such as density functional theory and molecular dynamics (Aldeghi & Coley, 2022; Antoniuk et al., 2022 ). Yet, these techniques suffer from high computational costs, tedious parameter optimization, and a considerable discrepancy between simulations and experiments (Afzal et al., 2020; Chen et al., 2021) , which limit their applicability in practice. Recent deep learning research has recognized the scarcity of molecular data in several domains and has developed methods handling small datasets, including self-supervised learning (Zhang et al., 2021; Rong et al., 2020; Wang et al., 2022; Ross et al., 2021 ), transfer learning (Hu et al., 2020) , 1

