TARGET-FREE LIGAND SCORING VIA ONE-SHOT LEARNING

Abstract

Scoring ligands in a library based on their structural similarity to a known hit compound is widely used in drug discovery following high-throughput screening. However, such "similarity search" relies on the assumption that structurally similar compounds have similar activities, and will therefore only retrieve ligands with hit-like affinity, requiring resource-intensive tweaking by medicinal chemists to reach a more active lead compound. We propose a novel approach, One-Shot Ligand Scoring (OSLS), that is much more capable of directly retrieving lead-like compounds from a library using a novel one-shot learning technique. For this new task, we design a Siamese-inspired neural architecture using two Transformer encoders without tied weights, a novel positional encoding-like mechanism, and a final prediction head. OSLS is able to score ligands by activity against a target without any target-specific knowledge beyond a single known activity value, a cost-effective approach to ligand-based or phenotypic drug discovery. We show that OSLS surpasses traditional similarity search as well as modern deep learning baselines on a simulated ligand retrieval task. Furthermore, we demonstrate the applicability of our approach on various drug discovery tasks that also involve ligand scoring, including drug repositioning, precision patient-level drug efficacy prediction, and even molecular generative modeling.

1. INTRODUCTION

Contemporary drug discovery is a costly and time-consuming process requiring billions of dollars per new approved drug. A significant portion of the total development cost is incurred in preclinical stages, where medicinal chemists identify one or more "hit" compounds from high-throughput screens that have activity against the target of interest and retrieve structural analogs to these hits from large molecular catalogs for further exploration and development, and eventually produce a lead compound after much optimization (de Souza Neto et al., 2020; Hughes et al., 2011) . Currently, the concept of chemical similarity is critical to the retrieval of these structural analogs through pairwise similarity scoring between the hit compound and each compound in a library. Commonly used similarity metrics include the Tanimoto similarity computed between binary fingerprints of two molecules (Bajusz et al., 2015) , as well other more specialized metrics and molecular featurizations (Cereto-Massagué et al., 2015; Nikolova & Jaworska, 2003; Maziarka et al., 2020; Jaeger et al., 2018; Coupry & Pogány, 2022; Gandini et al., 2022) . Retrieving structural analogs from a chemical library with such similarity scoring tends to yield compounds whose activity is similar to that of the initial hit, based on the concept that chemically similar compounds have similar activity (Johnson & Maggiora, 1990) . However, hit compounds from experimental screens typically have low activities, e.g. binding affinities of 1 -10 µM, compared to the typical 1 -10 nM goal of preclinical drug development, a thousand-fold difference (Freire, 2015; Hughes et al., 2011) . Thus, retrieving library compounds based purely on their similarity to an early-stage hit compound is not an optimal strategy, as this is expected to yield compounds with hit-like activity instead of the desired highly active compounds. Compound scoring via similarity is common for other related problems in drug discovery, such as drug repositioning (Jarada et al., 2020) and lead optimization Hughes et al. (2011) , which also suffer from the problem of measuring similarity to a weakly active compound.

