CONTEXT-ENRICHED MOLECULE REPRESENTATIONS IMPROVE FEW-SHOT DRUG DISCOVERY

Abstract

A central task in computational drug discovery is to construct models from known active molecules to find further promising molecules for subsequent screening. However, typically only very few active molecules are known. Therefore, few-shot learning methods have the potential to improve the effectiveness of this critical phase of the drug discovery process. We introduce a new method for few-shot drug discovery. Its main idea is to enrich a molecule representation by knowledge about known context or reference molecules. Our novel concept for molecule representation enrichment is to associate molecules from both the support set and the query set with a large set of reference (context) molecules through a modern Hopfield network. Intuitively, this enrichment step is analogous to a human expert who would associate a given molecule with familiar molecules whose properties are known. The enrichment step reinforces and amplifies the covariance structure of the data, while simultaneously removing spurious correlations arising from the decoration of molecules. Our approach is compared with other few-shot methods for drug discovery on the FS-Mol benchmark dataset. On FS-Mol, our approach outperforms all compared methods and therefore sets a new state-of-the art for few-shot learning in drug discovery. An ablation study shows that the enrichment step of our method is the key to improve the predictive quality. In a domain shift experiment, we further demonstrate the robustness of our method. Code is available at https://github.com/ml-jku/MHNfs.

1. INTRODUCTION

To improve human health, combat diseases, and tackle pandemics there is a steady need of discovering new drugs in a fast and efficient way. However, the drug discovery process is time-consuming and cost-intensive (Arrowsmith, 2011) . Deep learning methods have been shown to reduce time and costs of this process (Chen et al., 2018; Walters and Barzilay, 2021) . They diminish the required number of both wet-lab measurements and molecules that must be synthesized (Merk et al., 2018; Schneider et al., 2020) . However, as of now, deep learning approaches use only the molecular information about the ligands after being trained on a large training set. At inference time, they yield highly accurate property and activity prediction (Mayr et al., 2018; Yang et al., 2019 ), generative (Segler et al., 2018a; Gómez-Bombarelli et al., 2018) , or synthesis models (Segler et al., 2018b; Seidl et al., 2022) . Deep learning methods in drug discovery usually require large amounts of biological measurements. To train deep learning-based activity and property prediction models with high predictive performance, hundreds or thousands of data points per task are required. For example, well-performing predictive models for activity prediction tasks of ChEMBL have been trained with an average of 3,621 activity points per task -i.e., drug target -by Mayr et al. (2018) . The ExCAPE-DB dataset provides on average 42,501 measurements per task (Sun et al., 2017; Sturm et al., 2020) . Wu et al. (2018) published a large scale benchmark for molecular machine learning, including prediction models for the SIDER dataset (Kuhn et al., 2016) with an average of 5,187 data points, Tox21 (Huang et al., 2016b; Mayr et al., 2016) with on average 9,031, and ClinTox (Wu et al., 2018) with 1,491 measurements per task. However, for typical drug design projects, the amount of available measurements is very limited (Stanley et al., 2021; Waring et al., 2015; Hochreiter et al., 2018) , since in-vitro experiments are expensive and time-consuming. Therefore, methods that need only few measurements to build precise prediction models are desirable. This problem -i.e., the challenge of learning from few data points -is the focus of machine learning areas like meta-learning (Schmidhuber, 1987; Bengio et al., 1991; Hochreiter et al., 2001) and few-shot learning (Miller et al., 2000; Bendre et al., 2020; Wang et al., 2020) . Few-shot learning tackles the low-data problem that is ubiquitous in drug discovery. Few-shot learning methods have been predominantly developed and tested on image datasets (Bendre et al., 2020; Wang et al., 2020) and have recently been adapted to drug discovery problems (Altae-Tran et al., 2017; Guo et al., 2021; Wang et al., 2021; Stanley et al., 2021; Chen et al., 2022) . They are usually categorized into three groups according to their main approach (Bendre et al., 2020; Wang et al., 2020; Adler et al., 2020) . a) Data-augmentation-based approaches augment the available samples and generate new, more diverse data points (Chen et al., 2020; Zhao et al., 2019; Antoniou and Storkey, 2019) . b) Embedding-based and nearest neighbour approaches learn embedding space representations. Predictive models can then be constructed from only few data points by comparing these embeddings. For example, in Matching Networks (Vinyals et al., 2016) an attention mechanism that relies on embeddings is the basis for the predictions. Prototypical Networks (Snell et al., 2017) create prototype representations for each class using the above mentioned representations in the embedding space. c) Optimization-based or fine-tuning methods utilize a meta-optimizer that focuses on efficiently navigating the parameter space. For example, with MAML the meta-optimizer learns initial weights that can be adapted to a novel task by few optimization steps (Finn et al., 2017) . Most of these approaches have already been applied to few-shot drug discovery (see Section 4). Surprisingly, almost all these few-shot learning methods in drug discovery are worse than a naive baseline, which does not even use the support set (see Section 5). We hypothesize that the underperformance of these methods stems from disregarding the context -both in terms of similar molecules and similar activities. Therefore, we propose a method that informs the representations of the query and support set with a large number of context molecules covering the chemical space. Enriching molecule representations with context using associative memories. In data-scarce situations, humans extract co-occurrences and covariances by associating current perceptions with memories (Bonner and Epstein, 2021; Potter, 2012) . When we show a small set of active molecules to a human expert in drug discovery, the expert associates them with known molecules to suggest further active molecules (Gomez, 2018; He et al., 2021) . In an analogous manner, our novel concept for few-shot learning uses associative memories to extract co-occurrences and the covariance structure of the original data and to amplify them in the representations (Fürst et al., 2022) . We use Modern Hopfield Networks (MHNs) as an associative memory, since they can store a large set of context molecule representations (Ramsauer et al., 2021, Theorem 3) . The representations that are retrieved from the MHNs replace the original representations of the query and support set molecules. Those retrieved representations have amplified co-occurrences and covariance structures, while peculiarities and spurious co-occurrences of the query and support set molecules are averaged out. In this work, our contributions are the following: • We propose a new architecture MHNfs for few-shot learning in drug discovery. • We achieve a new state-of-the-art on the benchmarking dataset FS-Mol. • We introduce a novel concept to enrich the molecule representations with context by associating them with a large set of context molecules. • We add a naive baseline to the FS-Mol benchmark that yields better results than almost all other published few-shot learning methods. • We provide results of an ablation study and a domain shift experiment to further demonstrate the effectiveness of our new method.

2. PROBLEM SETTING

Drug discovery projects revolve around models g(m) that can predict a molecular property or activity ŷ, given a representation m of an input molecule from a chemical space M. We consider machine learning models ŷ = g w (m) with parameters w that have been selected using a training set. Typically,

