HOW MUCH SPACE HAS BEEN EXPLORED? MEASURING THE CHEMICAL SPACE COVERED BY DATABASES AND MACHINE-GENERATED MOLECULES

Abstract

Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While most databases and machine-learning-based generation models aim to optimize particular chemical properties, there is limited literature on how to properly measure the coverage of the chemical space by those candidates included or generated. This problem is challenging due to the lack of formal criteria to select good measures of the chemical space. In this paper, we propose a novel evaluation framework for measures of the chemical space based on two analyses: an axiomatic analysis with three intuitive axioms that a good measure should obey, and an empirical analysis on the correlation between a measure and a proxy gold standard. Using this framework, we are able to identify #Circles, a new measure of chemical space coverage, which is superior to existing measures both analytically and empirically. We further evaluate how well the existing databases and generation models cover the chemical space in terms of #Circles. The results suggest that many generation models fail to explore a larger space over existing databases, which leads to new opportunities for improving generation models by encouraging exploration.

1. INTRODUCTION

To efficiently navigate through the huge chemical space for drug discovery, machine learning (ML) based approaches have been broadly designed and deployed, especially de novo molecular generation methods (Elton et al., 2019; Schwalbe-Koda & Gómez-Bombarelli, 2020; Bian & Xie, 2021; Deng et al., 2022) . Such generation models learn to generate candidate drug designs by optimizing various molecular property scores, such as the binding affinity scores. In practice, these scores can be computationally obtained using biological activity prediction models (Olivecrona et al., 2017; Li et al., 2018) , which is the key to obtaining massive labeled training data for machine learning. However, high in silico property scores are far from sufficient, as there is usually a considerable misalignment between these scores and the in vivo behaviors. Costly wet-lab experiments are still needed to verify potential drug hits, where only a limited number of drug candidates can be tested. In light of this cost constraint, it is critical to select or generate drug candidates not only with high in silico scores, but also covering a large portion of the chemical space. As functional difference between molecules is closely related to their structural difference (Huggins et al., 2011; Wawer et al., 2014) , a better coverage of the chemical space will likely lead to a higher chance of hits in wet experiments. For this purpose, quantitative coverage measures of the chemical space become crucial. Such measures can both be used to evaluate and compare the candidate libraries 1 , and be incorporated into training objectives to encourage ML models better explore the chemical space. In this paper, we investigate the problem of quantitatively measuring the coverage of the chemical space by a candidate library. There have been a few such coverage measures of chemical space. For example, richness counts the number of unique compounds in a molecular set, and it has been used to describe how well a model is able to generate unique structures (Shi & von Itzstein, 2019; Polykovskiy et al., 2020) . In addition, molecular fingerprints have been used to calculate pairwise similarity or distance between two compounds, and the average of these pairwise distances has been used to describe the overall internal diversity of a molecular set (Brown et al., 2019; Polykovskiy et al., 2020) . However, most existing coverage measures are heuristically proposed and the validity of these measures is rarely justified. In fact, defining the "right" measure for the coverage of chemical space coverage is challenging. Unlike the molecular property scores, there is no obvious "ground truth" about the coverage of chemical space. Moreover, the chemical space is complex and combinatorial, making the design of a good measure even more difficult. To address the fundamental problem of properly measuring the coverage of chemical space for drug discovery, we propose a novel evaluation framework with two complementary criteria for evaluating the validity of coverage measures. We first formally define the concept of coverage measures on the chemical space (referred to as chemical space measures), where many existing heuristic measures fall into our definition (Section 3). Then we introduce the two criteria and compare various (existing and new) chemical space measures based on the criteria (Section 4). Specifically, the first criterion (Section 4.1) is based on an axiomatic analysis with three intuitive axioms that a good chemical space measure should satisfy. Surprisingly, most heuristic measures that are commonly used in literature, such as internal diversity, fail to satisfy these intuitive axioms. The second criterion (Section 4.2) compares the chemical space measures with a proxy of the gold standard: the number of unique biological functionalities covered by the set of molecules. We find that #Circles, a new chemical space coverage measure (defined in Section 3.2.3) that has a strong basis in the mathematical literature, not only satisfies both axioms but also better correlates with the gold standard. Finally, we apply the #Circles measure to evaluate how well the existing databases and ML models cover the chemical space (Section 5). Interestingly, the evaluation results suggest that many ML models fail to explore a larger portion of chemical space compared to drug candidates obtained from virtual screening over existing databases. We believe these findings lead to a new direction to improve ML-based drug candidate generation models on better exploring the chemical space.

2. RELATED WORK

Molecular databases and machine-generated compounds are rich sources of drug candidates for forming a candidate library in drug discovery. To evaluate the quality of molecular databases and molecular generation methods, a variety of metrics are proposed. In general, four categories of evaluation metrics can be identified in the literature, which are related to: (1) bioactivities, (2) molecular properties, (3) data likelihood, or (4) the coverage of the chemical space, respectively. In this paper, we mainly focus on the fourth category of metrics, the metrics that are more or less related to the degree of coverage (or exploration) in the chemical space (other metrics are discussed in Appendix A). In this category, commonly used measures include richness, uniqueness, internal diversity, external diversity, KL divergence, and Fréchet ChemNet Distance (FCD) (Olivecrona et al., 2017; You et al., 2018; De Cao & Kipf, 2018; Elton et al., 2019; Brown et al., 2019; Popova et al., 2019; Polykovskiy et al., 2020; Shi et al., 2020; Jin et al., 2020; Xie et al., 2021) . Besides, Zhang et al. (2021) propose to use the number of unique functional groups or ring systems to estimate the chemical space coverage and to compare several recent generative models. Similarly in Blaschke et al. ( 2020), the number of unique Bemis-Murcko scaffolds is used to measure the variety of drug candidates. Koutsoukas et al. (2014) study the effect of molecular fingerprinting schemes on the internal diversity of compound selection. These measures usually mix the concepts of diversity, coverage, or novelty, and their validity as a measure of exploration is not justified. To the best of our knowledge, this is the first work that formally investigates the validity of molecular chemical space measures. In particular, axiomatic approaches are used to analytically evaluate various designs of a measurement, such as utility functions (Herstein & Milnor, 1953) , cohesiveness (Alcalde-Unzu & Vorsatz, 2013), or document relevance (Fang et al., 2004) . While one study applies axiomatic analysis to the design of diversity measures, with a particular focus on the domain of science of science (Yan, 2021), the analysis of chemical space measurements remains novel. Using axiomatic analysis to evaluate the chemical space measures in the chemical space is novel. With an empirical analysis in addition to the axiomatic analysis, we make practical recommendations on effective chemical space measures, including two novel measures.

