TARGET-FREE LIGAND SCORING VIA ONE-SHOT LEARNING

Abstract

Scoring ligands in a library based on their structural similarity to a known hit compound is widely used in drug discovery following high-throughput screening. However, such "similarity search" relies on the assumption that structurally similar compounds have similar activities, and will therefore only retrieve ligands with hit-like affinity, requiring resource-intensive tweaking by medicinal chemists to reach a more active lead compound. We propose a novel approach, One-Shot Ligand Scoring (OSLS), that is much more capable of directly retrieving lead-like compounds from a library using a novel one-shot learning technique. For this new task, we design a Siamese-inspired neural architecture using two Transformer encoders without tied weights, a novel positional encoding-like mechanism, and a final prediction head. OSLS is able to score ligands by activity against a target without any target-specific knowledge beyond a single known activity value, a cost-effective approach to ligand-based or phenotypic drug discovery. We show that OSLS surpasses traditional similarity search as well as modern deep learning baselines on a simulated ligand retrieval task. Furthermore, we demonstrate the applicability of our approach on various drug discovery tasks that also involve ligand scoring, including drug repositioning, precision patient-level drug efficacy prediction, and even molecular generative modeling.

1. INTRODUCTION

Contemporary drug discovery is a costly and time-consuming process requiring billions of dollars per new approved drug. A significant portion of the total development cost is incurred in preclinical stages, where medicinal chemists identify one or more "hit" compounds from high-throughput screens that have activity against the target of interest and retrieve structural analogs to these hits from large molecular catalogs for further exploration and development, and eventually produce a lead compound after much optimization (de Souza Neto et al., 2020; Hughes et al., 2011) . Currently, the concept of chemical similarity is critical to the retrieval of these structural analogs through pairwise similarity scoring between the hit compound and each compound in a library. Commonly used similarity metrics include the Tanimoto similarity computed between binary fingerprints of two molecules (Bajusz et al., 2015) , as well other more specialized metrics and molecular featurizations (Cereto-Massagué et al., 2015; Nikolova & Jaworska, 2003; Maziarka et al., 2020; Jaeger et al., 2018; Coupry & Pogány, 2022; Gandini et al., 2022) . Retrieving structural analogs from a chemical library with such similarity scoring tends to yield compounds whose activity is similar to that of the initial hit, based on the concept that chemically similar compounds have similar activity (Johnson & Maggiora, 1990) . However, hit compounds from experimental screens typically have low activities, e.g. binding affinities of 1 -10 µM, compared to the typical 1 -10 nM goal of preclinical drug development, a thousand-fold difference (Freire, 2015; Hughes et al., 2011) . Thus, retrieving library compounds based purely on their similarity to an early-stage hit compound is not an optimal strategy, as this is expected to yield compounds with hit-like activity instead of the desired highly active compounds. Compound scoring via similarity is common for other related problems in drug discovery, such as drug repositioning (Jarada et al., 2020) and lead optimization Hughes et al. (2011) , which also suffer from the problem of measuring similarity to a weakly active compound. Here, we propose One-Shot Ligand Scoring (OSLS), an alternative to chemical similarity that predicts the activity of an experimentally uncharacterized query compound (e.g. a compound drawn from a chemical library) to an unseen target based on a single context compound and its experimentally known activity to that target (e.g. a hit from an experimental screen). Like standard chemical similarity-based scoring, OSLS shares the advantage of needing no information about a target protein (Zheng et al., 2013; Vijayan et al., 2021) ). However, OSLS is distinct from standard measures of chemical similarity, because, by using a one-shot learning paradigm, it can assign the highest scores to the most active compounds, instead of those most similar to a weakly active hit. More particularly, we • introduce the novel formulation of ligand scoring as a one-shot regression problem, and argue for its utility over traditional similarity-based scoring • design a novel architecture, OSLS, which addresses this problem by using a Siameseinspired neural architecture to extract target information from the known activity of a context compound and use it to directly predict the activity of a query compound • show that OSLS outperforms both similarity-based as well as modern deep learning scoring techniques in settings relevant to compound retrieval and the related tasks of drug repositioning, patient-level drug efficacy prediction, and generative modeling.

2. RELATED WORK

In this work, we focus on cases where information about a targeted protein (e.g. amino acid sequence or 3D structure) is not used for compound scoring -so-called "ligand-based" or "phenotypic" drug discovery (Sharma et al., 2021; Swinney & Anthony, 2011; Zheng et al., 2013) . In this setting, drug discovery begins from one or a few existing compounds with some known activity, often obtained from screening or known natural ligands. In this case, scoring of additional compounds is currently done using either chemical similarity or N-shot learning approaches. Chemical similarity. Chemical similarity is commonly used when compounds with a desired activity are known, and involves computing the pairwise similarity between the known actives and each compound to be scored. When using a highly active known compound, similarity acts as a surrogate measure of activity (Johnson & Maggiora, 1990) , although this approximation fails as the known compound becomes less active. Computing similarity is commonly done with binary fingerprints (e.g. circular fingerprints, Rogers & Hahn (2010)), although this approach will often undesirably miss compounds with similar activity but different chemical scaffolds. For this reason, many other chemical features have been suggested for representing molecules, including simple molecular weight as well as more complex representations that capture information about molecular topology and 3D shape/charge (Khan et al., 2016; Li et al., 2012; Kohlbacher et al., 2021; Kearnes & Pande, 2016) . However, such approaches are, arguably, based more on chemical intuition than data, and choosing which of hundreds of molecular descriptors to use adds another level of uncertainty. For this reason, machine learning-based techniques have been proposed that derive chemical similarity in a datadriven fashion and thus offer promise to improve quality of similarity measurement. In particular, much work has been dedicated to learning molecular featurization in an unsupervised fashion, which can later be used for downstream tasks such as similarity (Jaeger et al., 2018; Huang et al., 2021; Li & Jiang, 2021; Morris et al., 2020) . Due to their unsupervised nature, however, similarity measurements between machine learning-derived embeddings are not necessarily meaningful for activity, as structurally dissimilar molecules may have similar activities, and vice-versa. Because of this, more direct N-shot learning approaches (Schimunek et al., 2021; Altae-Tran et al., 2017; Stanley et al., 2021; Lee et al., 2022) have been proposed to leverage vast amounts of existing activity data toward the scoring of new ligands against novel targets. N-shot learning. N-shot learning techniques directly use existing compounds (the "context", also called the "support set") to predict the activity of unknown compounds (the "query set") without relying on similarity as an imperfect surrogate of activity. The application of Siamese networks to one-shot learning, a form of N-shot learning involving a single context example, was first introduced in computer vision (Koch et al., 2015) , and existing Siamese-based techniques, such as those

