HOW MUCH SPACE HAS BEEN EXPLORED? MEASURING THE CHEMICAL SPACE COVERED BY DATABASES AND MACHINE-GENERATED MOLECULES

Abstract

Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While most databases and machine-learning-based generation models aim to optimize particular chemical properties, there is limited literature on how to properly measure the coverage of the chemical space by those candidates included or generated. This problem is challenging due to the lack of formal criteria to select good measures of the chemical space. In this paper, we propose a novel evaluation framework for measures of the chemical space based on two analyses: an axiomatic analysis with three intuitive axioms that a good measure should obey, and an empirical analysis on the correlation between a measure and a proxy gold standard. Using this framework, we are able to identify #Circles, a new measure of chemical space coverage, which is superior to existing measures both analytically and empirically. We further evaluate how well the existing databases and generation models cover the chemical space in terms of #Circles. The results suggest that many generation models fail to explore a larger space over existing databases, which leads to new opportunities for improving generation models by encouraging exploration.

1. INTRODUCTION

To efficiently navigate through the huge chemical space for drug discovery, machine learning (ML) based approaches have been broadly designed and deployed, especially de novo molecular generation methods (Elton et al., 2019; Schwalbe-Koda & Gómez-Bombarelli, 2020; Bian & Xie, 2021; Deng et al., 2022) . Such generation models learn to generate candidate drug designs by optimizing various molecular property scores, such as the binding affinity scores. In practice, these scores can be computationally obtained using biological activity prediction models (Olivecrona et al., 2017; Li et al., 2018) , which is the key to obtaining massive labeled training data for machine learning. However, high in silico property scores are far from sufficient, as there is usually a considerable misalignment between these scores and the in vivo behaviors. Costly wet-lab experiments are still needed to verify potential drug hits, where only a limited number of drug candidates can be tested. In light of this cost constraint, it is critical to select or generate drug candidates not only with high in silico scores, but also covering a large portion of the chemical space. As functional difference between molecules is closely related to their structural difference (Huggins et al., 2011; Wawer et al., 2014) , a better coverage of the chemical space will likely lead to a higher chance of hits in wet experiments. For this purpose, quantitative coverage measures of the chemical space become crucial. Such measures can both be used to evaluate and compare the candidate libraries 1 , and be incorporated into training objectives to encourage ML models better explore the chemical space. In this paper, we investigate the problem of quantitatively measuring the coverage of the chemical space by a candidate library. There have been a few such coverage measures of chemical space. For example, richness counts the number of unique compounds in a molecular set, and it has been used to describe how well a model is able to generate unique structures (Shi & von Itzstein, 2019; Polykovskiy et al., 2020) . In addition, molecular fingerprints have been used to calculate pairwise similarity or distance between two compounds, and the average of these pairwise distances has been used to describe the overall internal diversity of a molecular set (Brown et al., 2019; Polykovskiy et al., 2020) . However, most existing coverage measures are heuristically proposed and the validity of these measures is rarely justified. In fact, defining the "right" measure for the coverage of chemical space coverage is challenging. Unlike the molecular property scores, there is no obvious "ground truth" about the coverage of chemical space. Moreover, the chemical space is complex and combinatorial, making the design of a good measure even more difficult. To address the fundamental problem of properly measuring the coverage of chemical space for drug discovery, we propose a novel evaluation framework with two complementary criteria for evaluating the validity of coverage measures. We first formally define the concept of coverage measures on the chemical space (referred to as chemical space measures), where many existing heuristic measures fall into our definition (Section 3). Then we introduce the two criteria and compare various (existing and new) chemical space measures based on the criteria (Section 4). Specifically, the first criterion (Section 4.1) is based on an axiomatic analysis with three intuitive axioms that a good chemical space measure should satisfy. Surprisingly, most heuristic measures that are commonly used in literature, such as internal diversity, fail to satisfy these intuitive axioms. The second criterion (Section 4.2) compares the chemical space measures with a proxy of the gold standard: the number of unique biological functionalities covered by the set of molecules. We find that #Circles, a new chemical space coverage measure (defined in Section 3.2.3) that has a strong basis in the mathematical literature, not only satisfies both axioms but also better correlates with the gold standard. Finally, we apply the #Circles measure to evaluate how well the existing databases and ML models cover the chemical space (Section 5). Interestingly, the evaluation results suggest that many ML models fail to explore a larger portion of chemical space compared to drug candidates obtained from virtual screening over existing databases. We believe these findings lead to a new direction to improve ML-based drug candidate generation models on better exploring the chemical space.

2. RELATED WORK

Molecular databases and machine-generated compounds are rich sources of drug candidates for forming a candidate library in drug discovery. To evaluate the quality of molecular databases and molecular generation methods, a variety of metrics are proposed. In general, four categories of evaluation metrics can be identified in the literature, which are related to: (1) bioactivities, (2) molecular properties, (3) data likelihood, or (4) the coverage of the chemical space, respectively. In this paper, we mainly focus on the fourth category of metrics, the metrics that are more or less related to the degree of coverage (or exploration) in the chemical space (other metrics are discussed in Appendix A). In this category, commonly used measures include richness, uniqueness, internal diversity, external diversity, KL divergence, and Fréchet ChemNet Distance (FCD) (Olivecrona et al., 2017; You et al., 2018; De Cao & Kipf, 2018; Elton et al., 2019; Brown et al., 2019; Popova et al., 2019; Polykovskiy et al., 2020; Shi et al., 2020; Jin et al., 2020; Xie et al., 2021) . Besides, Zhang et al. (2021) propose to use the number of unique functional groups or ring systems to estimate the chemical space coverage and to compare several recent generative models. Similarly in Blaschke et al. (2020) , the number of unique Bemis-Murcko scaffolds is used to measure the variety of drug candidates. Koutsoukas et al. (2014) study the effect of molecular fingerprinting schemes on the internal diversity of compound selection. These measures usually mix the concepts of diversity, coverage, or novelty, and their validity as a measure of exploration is not justified. To the best of our knowledge, this is the first work that formally investigates the validity of molecular chemical space measures. In particular, axiomatic approaches are used to analytically evaluate various designs of a measurement, such as utility functions (Herstein & Milnor, 1953) , cohesiveness (Alcalde-Unzu & Vorsatz, 2013) , or document relevance (Fang et al., 2004) . While one study applies axiomatic analysis to the design of diversity measures, with a particular focus on the domain of science of science (Yan, 2021) , the analysis of chemical space measurements remains novel. Using axiomatic analysis to evaluate the chemical space measures in the chemical space is novel. With an empirical analysis in addition to the axiomatic analysis, we make practical recommendations on effective chemical space measures, including two novel measures.

3.1. DEFINITION OF CHEMICAL SPACE MEASURES

To define a chemical space measure, we first formalize the notion of chemical space by assuming a distance metric exists. This assumption is widely adopted in cheminformatics, using distance metrics such as the Tanimoto distance (Tanimoto, 1968; Bajusz et al., 2015) . Assumption 3.1 (Chemical space). The chemical space U contains all possible molecules and is a metric space with a distance metric function d : U × U → [0, +∞). Definition 3.2 (Tanimoto distance). For two molecules x 1 , x 2 ∈ U, whose binary molecular fingerprint vectors are x 1 , x 2 ∈ {0, 1} n where n is the dimensionality of the fingerprint, their Tanimoto distance is defined as d(x 1 , x 2 ) := 1 - n j=1 x 1j • x 2j n j=1 max(x 1j , x 2j ) . (1) The Tanimoto distance is also referred to as the Jaccard distance (Jaccard, 1912) in other domains. Its range is [0, 1]. For finite sets (e.g., molecular fingerprints), the Tanimoto distance is a metric function (Kosub, 2019; Lipkus, 1999) . Our definition and following analysis are generic to all distance functions and are not limited to the Tanimoto distance. One can also use the distance of latent hidden vectors (Preuer et al., 2018; Samanta et al., 2020) or the root-mean-square deviation (RMSD) of three-dimensional molecular conformers (Fukutani et al., 2021) . We then define a chemical space measure as the following. Definition 3.3 (chemical space measure). Given the universal chemical space U, a chemical space measure is a function that maps a set of molecules to a non-negative real number that reflects to what extent the set spans the chemical space, i.e., µ : P(U) → [0, ∞), where P(•) is the notation of power set. In particular, µ(∅) = 0.

3.2. EXAMPLES OF CHEMICAL SPACE MEASURES

Definition 3.3 is intentionally left general. Many measures related to the coverage, diversity, or novelty of a molecular set fall into this definition. We summarize these various measures used in literature into three categories: reference-based measures, aggregation-based measures, and locality-based measures. We are also able to define new chemical space measures under this formulation.

3.2.1. REFERENCE-BASED MEASURES

The first broad category of chemical space measures compare the generated molecules S with a reference set R. In such context, a reference-based measure can be defined as the coverage of references: Coverage(S, R) := y∈R max x∈S cover(x, y) , where cover(x, y) indicates how molecule x can cover the reference y. When the reference set R is taken as a collection of molecular fragments, the coverage function can be written as cover(x, y) := I[molecule x contains fragment y], where I[•] is the indication function. A large body of drug discovery literature uses the number of distinct functional groups (FG), ring systems (RS), or Bemis-Murcko scaffolds (BM) in S to gauge the size of explored chemical space (Zhang et al., 2021; Blaschke et al., 2020) , corresponding to the cases where R is the collection of all possible FG, RS, or BM fragments. We denote these specific reference-based measures as #FG, #RS, and #BM respectively. Another example is the Richness of the molecular set, i.e., Richness := |S| (Shi & von Itzstein, 2019; Polykovskiy et al., 2020) , where cover(x, y) := I[x = y] and R = U.

3.2.2. AGGREGATION-BASED MEASURES

Though reference-based measures estimate chemical space coverage in an intuitive way, such measures highly rely on the characteristics of the reference set and do not consider the relations between the compounds. In contrast, the aggregation-based measures utilize the pair-wise distances among molecules and reflect the extent of chemical space coverage by aggregating the distances. For a set S with n molecules, several aggregation-based measures can be defined as follow:  where x, y are molecules in S, d is a distance metric as defined previously. Among these measures, Diversity (also referred to as the internal diversity), the average distance between the molecules, is widely used in the literature (You et al., 2018; De Cao & Kipf, 2018; Popova et al., 2019; Polykovskiy et al., 2020; Shi et al., 2020; Xie et al., 2021) . We follow the topology theory and also introduce Diameter and Bottleneck as the maximum and minimum distance between any pair of molecules, respectively (Edelsbrunner & Harer, 2010) , since they can also reflect the dissimilarity between molecules in S. We further introduce three Sum-variants for the above measures. The Sum-variantsfoot_0 will tend to increase when new molecules are added into the set. In addition, the determinant of the similarity matrix of molecules is also a measure of dissimilarity, which is often employed in diverse subset selection as a key concept of the determinantal point processes (DPP) (Kulesza & Taskar, 2011; Kulesza et al., 2012) . The DPP measures is defined as DPP(S) := det(S), where S is the Tanimoto similarity matrix of candidate molecules (i.e., 1 -d(x, y) for Tanimoto distance).

3.2.3. LOCALITY-BASED MEASURES

Inspired by the sphere exclusion algorithm used in compound selection (Snarey et al., 1997; Gobbi & Lee, 2003)  where t ∈ [0, 1) is a distance threshold that corresponds to the diameter of a circle. Intuitively, #Circles counts the maximum number of mutually exclusive circles that can fit into S as neighborhoods, with a subset of its members C as the circle centers. When the threshold t = 0, #Circles becomes the richness measure Richness(S) := |S|, which is the number of unique molecules (Shi & von Itzstein, 2019; Polykovskiy et al., 2020) . Interestingly, #Circles has a close relation to the concepts of covering and packing in mathematics, which is in particular related to the definition of packing number in topology (Vershynin, 2018) . We discuss their detailed connections in Appendix B.

4. SELECTING THE RIGHT MEASURE

While all the aforementioned measures can heuristically reflect the degree of exploration, they do not always agree with each other. We need a principled way to select the most suitable measures from a variety of possible choices. Ideally, the selection criteria should not depend on the particular molecule properties of the target or the algorithm used to generate the candidates. We achieve this through a qualitative axiomatic analysis and a quantitative correlation comparison.

4.1. CRITERION #1: AN AXIOMATIC ANALYSIS OF CHEMICAL SPACE MEASURES

We propose three simple and intuitive principles that a good chemical space measure should satisfy. First, including more molecules should not decrease the degree of chemical space coverage. As it is easy to filter large candidate libraries into smaller subsets, including more molecules (even similar ones) is not harmful, and can render a higher probability of containing drug hits. Second, when adding new molecules, the coverage should increase properly, instead of inflating too much. Third, molecular sets with more dissimilar molecules should have a higher degree of chemical space coverage. These three principles are formalized below as three axioms and can be tested analytically. Axiom 4.1 (Monotonicity). A good chemical space measure µ should be monotonic, i.e., for any two molecular sets S 1 , S 2 ⊆ U, it holds that µ(S 1 ∪ S 2 ) ≥ max(µ(S 1 ), µ(S 2 )). (5) A monotonic measure tends to increase when more molecules are included. This tendency is intuitive in drug discovery as testing more molecules means having a higher probability to discover a drug hit. Axiom 4.2 (Subadditivity). A good chemical space measure µ should be subadditive, i.e., for any two molecular sets S 1 , S 2 ⊆ U, it holds that µ(S 1 ∪ S 2 ) ≤ µ(S 1 ) + µ(S 2 ). Note that the coverage of chemical space spanned by a molecular set may correlate with the probability of containing drug hits. So correspondingly, the coverage µ(S 1 ∪ S 2 ) should not exceed the sum of µ(S 1 ) and µ(S 2 ). Some direct corollaries and discussions on how the first and second axioms connect to mathematical measures are in Appendix C.

Axiom 4.3 (Dissimilarity).

A good chemical space measure should have a preference to dissimilar elements, i.e., for any three molecules Intuitively, considering adding a new molecule into an existing molecular set S = {x 0 }, where there are two choices, namely x 1 and x 2 , the more dissimilar one x 1 will be preferred. x 0 , x 1 , x 2 ∈ U, if d(x 0 , x 1 ) ≥ d(x 0 , x 2 ), it holds µ({x 0 , x 1 }) ≥ µ({x 0 , x 2 }), Depending on whether each chemical space measure satisfies monotocinicy, subadditivity and/or dissimilarity, we can put it into a Venn diagram as shown in Figure 1 . The formal proofs are provided in Appendix D. Despite that the three axioms are simple and intuitive, surprisingly, the proposed #Circles is the only measure that satisfies all these three axioms.

4.2. CRITERION #2: CORRELATION WITH BIOLOGICAL FUNCTIONALITY

While only one measure stands out in the axiomatic analysis, it does not guarantee its empirical performance. Meanwhile, there are still multiple measures that satisfy one or two of the three axioms. In this subsection, we further investigate the validity of the chemical space measures by testing their correlations with the biological functionality variety of molecules. Such functionalities can provide valuable information in distinguishing molecules and implicating chemical space coverage. We correlate the chemical space measures to the number of unique biological functionality labels covered by a molecular set. A better chemical space measure should have a higher correlation to the variety of biological functionalities, even though it is a proxy gold standard.

4.2.1. EXPERIMENT SETUP

We base the analysis on the BioActivity dataset that is also used to compare different compound selection algorithms (Koutsoukas et al., 2014) . This dataset contains 10,000 compound samples extracted from the ChEMBL database (Gaulton et al., 2017) with bio-activity labels, which annotate 50 activity classes with 200 samples each. Following Koutsoukas et al. (2014) , for a subset of this dataset S, we take the number of unique class labels as a proxy "gold standard" of the variety of the molecules in S, i.e., GS(S) := #unique labels in S, which represents the number of biological functionality types covered by S. We then compare the behavior of the gold standard and the chemical space measures in two settings to find out which measures have the highest correlations with the coverage (or variety) of biological functionalities: (1) a fixed-size setting where random subsets of the dataset with fixed sizes are measured, and (2) a growing-size setting where we sequentially add molecules into a molecular set. We compare the measuring results of GS and chemical space measures with Spearman's correlation coefficient and dynamic time warping (DTW) distance. The experiment details are listed in Appendix E.

4.2.2. RESULTS AND DISCUSSION

Table 1 : Regarding the two suggested criteria, #Circles is the most recommended chemical space measure. "Mono", "subadd", and "dissim" are abbreviations for monotonicity, subadditivity, and dissimilarity respectively. Experiment results in Figure 2 suggest that #Circles stands out for both empirical experiment settings. For the fixedsize setting, #Circles and SumBottleneck have notably higher correlations to the gold standard than other measures. This indicates that the locality information is critical, as these two measures both prefer new molecules that are at arm's length from their nearest neighbors. For the growing-size setting, #Circles surpasses all other chemical space measures. Reference-based measures such as #FG also perform prominently. Both #Circles and reference-based measures satisfy subadditivity, making them suitable for a "growth" setting. The Sum-variants of aggregation-based measures outperform their original forms, as they tend to increase when adding new molecules. More experiment results can be found in Appendix E, where we also discuss the impact of the distance metric d. We also study the sensitivity of the subset size n, the way of adding new molecules, and the choice of the #Circles's threshold t. Results suggest that t = 0.75 is a good choice for both two settings. Correlations between chemical space measures are visualized in Figure 8 and 11.

4.3. WHICH MEASURE IS THE RIGHT CHOICE?

The empirical analysis shows that the locality-based #Circles measure is a robust choice for all tested scenarios, which reconfirms the conclusion of the axiomatic analysis. Besides, SumBottleneck may be an effective choice when a fixed number of candidates are the target. Reference-based measures may be good alternatives if carefully-designed and comprehensive reference sets are available. Surprisingly, the widely used Diversity measure is rendered inferior both analytically and empirically, casting doubts on its efficiency in measuring and encouraging exploration for molecular generation. Our analysis and comparison results are summarized in Table 1 . We find a discrepancy between Richness and #Circles, meaning that a larger database does not necessarily span a wider range of the chemical space.

5.1. MEASURING CHEMICAL SPACE COVERED BY MOLECULAR DATABASES

Other than being a rich source for virtual screening for drug purposes, molecular databases are also bonanzas of data for training ML models. In both scenarios, the coverage of chemical space is a crucial descriptor for comparing various databases. With a higher coverage of the chemical space, we are able to construct candidate libraries with better varieties. Also, the ML model would benefit from the diverse training data, obtaining a better understanding of the overall chemical space. Here, we compare five commonly used molecular databases with the selected measures.

5.1.1. EXPERIMENT SETUP

Databases and filtering rules. We include five widely used drug and compound databases in our measurement: (1) ZINC-250k (a random subset of the ZINC database) (Irwin & Shoichet, 2005) , ( 2) MOSES (Polykovskiy et al., 2020) , (3) ChEMBL (Gaulton et al., 2017) , (4) GDB-17 (Ruddigkeit et al., 2012) , and (5) the Enamine Hit Locator Library (ENAMINE). More details about the databases are listed in Appendix F.1. We further apply two filtering rules to these databases to select potential drug candidates. In drug discovery, drug-likeness and synthesizability of compounds are both important considerations. The quantitative drug-likeness (QED) (Bickerton et al., 2012) and the synthetic accessibility (SA) (Ertl & Schuffenhauer, 2009) are computed. The molecules are filtered with QED ≥ 0.6 and SA ≤ 4.

5.1.2. RESULTS AND DISCUSSION

Figure 3 compares Richness and Circles of molecular databases. From the results, we conclude the findings as follows. (1) A larger database does not necessarily span a wider range of the chemical space: There is a discrepancy between Richness and #Circles, and the #Circles measuring results suggest that, in terms of chemical space coverage, ZINC-250k covers a larger area than ENAMINE and MOSES. (2) ZINC-250k and GDB-17 are recommended: With the smallest amount of compounds, ZINC-250k renders a relatively high coverage of the chemical space, which is highly recommended for model training with limited computation resources. Constructed by enumeration, GDB-17 substantially expands the known chemical space to a great level, which is suitable for training ML models for chemical space representations. More details and findings on other chemical measures are provided in Appendix F.

5.2.1. PROBLEM FORMULATION

In drug discovery, the search for active molecules towards a druggable target is often formulated as the following unconstrained optimization problem (Olivecrona et al., 2017; Gómez-Bombarelli et al., 2018; Liu et al., 2018; You et al., 2018; Jin et al., 2018; De Cao & Kipf, 2018; Popova et al., 2019; Shi et al., 2020; Xie et al., 2021) : arg max x∈U s 1 (x) • s 2 (x) • • • • s K (x), where x is a molecule in the chemical space U, s k : U → R is a function scoring particular biological properties, and the "•" operator indicates the combination of multiple scores (e.g., summation or multiplication). This scoring function can be binding affinity to protein targets, drug-likeness, synthesizability, etc. When wet-lab experiments are not available, these scores are obtained computationally. Unfortunately, the majority of the computational scoring functions are merely approximations and cannot accurately predict the wet-lab results. As pointed out in the chemistry literature, rather than focusing on optimizing the estimated property scores, it is crucial to generate a variety of compounds that span a wider range of the chemical space (Huggins et al., 2011; Wawer et al., 2014; Ashenden, 2018) . We therefore propose the following objective as the goal of molecular generation, and use this new objective to compare models: arg max S⊆U µ(S) s.t. s k (x) ≥ C k , ∀k ∈ [K], x ∈ S, where S ⊆ U is a molecular candidate set, µ(•) is a chemical space measure, s k (x) ≥ C k means the estimated property score of a molecule x is at or above a threshold C k . (Rogers & Hahn, 2010) . We also consider drug likeness (QED) and synthetic accessibility (SA) as they are important in practical drug discovery scenarios.

5.2.2. EXPERIMENT SETUP

Models, variations, and the baseline. We employ four recently proposed molecular generation models to optimize the generation objective (JNK3, QED, and SA): (1) RationaleRL (Jin et al., 2020) Evaluation. To validate the ability in exploring the chemical space, we compare the molecules generated by models using Equation 9 with multiple chemical space measures. The molecules (including molecules generated by models as well as compounds from databases) are filtered with the constraints JNK3 ≥ 0.5, QED ≥ 0.6, and SA ≤ 4, representing the positive candidates. We examine seven different measures, and report #Circles, Richness, and Diversity here; the others are listed in Appendix G). Note we cannot directly measure the biological functionality coverage as in Section 4.2, as many of the generated molecules are not in the database. For RationaleRL and DST, 5000 molecules are generated as suggested in their papers. For JANUS and MARS, the population size and the number of MCMC chains are set as 5000. JANUS evolves for 10 generations, and MARS iterates 2000 steps. More implementation details are listed in Appendix G. Table 2 lists the chemical space measures estimated for the positive candidate compounds (i.e., JNK3 ≥ 0.5, QED ≥ 0.6, and SA ≤ 4) generated by models or from databases. Based on these results, we conclude the following. (1) ML-based models fail to explore a larger effectual area compared to databases: Unexpectedly, we find the molecules obtained by virtual screening over databases span the largest area of the chemical space in terms of #Circles when t = 0.75. The most competitive model among the tested ones, MARS+#Circles, explores only about half of the space compared to virtual screening. (2) Chemical space measures can encourage the model to explore: By incorporating chemical space measures into the optimization objective of MARS, all measures are significantly improved during molecular generation. (3) Diversity should be avoided as a descriptor for exploration: During sampling, MARS and its variants all converge quickly in Diversity (Figure 19 ), despite a large number of new molecules discovered. Again, it suggests that the widely used Diversity is not suitable for measuring chemical space.

5.2.3. RESULTS AND DISCUSSION

We want to stress that the parameter t of #Circles has an interpretable meaning. When t is smaller, #Circles tends to reflect the "spread" of molecules at a finer granularity. When t is larger, the measure focuses more on the global picture. Therefore, we can intuitively characterize different molecular generation methods by examining #Circles values with different thresholds t as displayed in Figure 4 . From the results, we find MARS and JANUS present similar characteristics as they both have a relatively sharp decrease around t = 0.75. This might be because both two models are based on sampling algorithms and tend to exploit locally in the chemical space. In contrast, RationaleRL tends to explore globally because it is designed to combine active fragments (rationales). We can also compare the models by adopting the concept of Pareto optimization. We find the DST, JANUS, and the vanilla MARS model are completely dominated by the variant MARS+#Circles. Besides, the RationaleRL method is dominated by the virtual screening approach. A few limitations of the measurement with #Circles should be noted. For example the running time of calculating #Circles is exponential in theory. We implement a fast approximation by sacrificing accuracy (Appendix H). More experiment details and findings are provided in Appendix G.

6. CONCLUSION AND FUTURE WORK

In this paper, we have presented a systematic study on the coverage measures of chemical space in the context of drug discovery. The core contribution of this study is a novel evaluation framework for selecting proper coverage measures. Using the proposed framework, we have identified a new chemical space coverage measure, #Circles, which is superior to existing heuristic measures commonly used in the drug discovery literature. We have also quantitatively compared molecular databases and ML-based generation models in terms of #Circles. The results suggest that many ML-based generation models fail to explore a larger chemical space compared to virtual screening, since they tend to exploit locally instead of exploring widely. Measuring the chemical space is a fundamental problem in drug discovery. Our work in this direction opens even more research questions that are worth investigating in the future. For example, it would be interesting to design and evaluate more chemical space measures under our framework. Moreover, the newly suggested objective for molecular generation (Equation 9) inspires further study, since it is computationally challenging as a combinatorial constrained optimization problem. In general, the proposed chemical space measures and the generic measure selection framework are not restricted to the application of drug discovery, but can also be used in other domains, such as material design, science of science, and text or image generation.

A EVALUATION METRICS FOR MOLECULAR DATABASES AND MOLECULAR GENERATION METHODS

Molecular databases and machine-generated compounds are rich sources of drug candidates for forming a candidate library in drug discovery. To evaluate the quality of molecular databases and molecular generation methods, a variety of metrics are proposed. In general, four categories of evaluation metrics can be identified in the literature, which are related to: (1) bioactivities, (2) molecular properties, (3) data likelihood, or (4) the coverage of the chemical space, respectively. The first category describes the biological activities of the included or generated molecules toward certain protein targets. Using docking simulators, researchers can calculate docking scores to determine the binding affinity between the given compound structure and the target (Trott & Olson, 2010) . However, the simulation usually takes a long time to run. To obtain fast feedback and a sufficient amount of data to train molecular generation models, machine learning (ML) methods are also proposed to predict the bioactivity of molecules, including linear regression, supported vector machines (SVMs), random forests, and neural networks (NNs) (Shoichet et al., 1992; Olivecrona et al., 2017; Li et al., 2018; Ahmadi et al., 2021; Jiang et al., 2022) . The second category of metrics evaluates the molecular properties of compounds. Some of them are calculated based on heuristic rules, such as the validity of compounds, molecular weight, octanol-water partition coefficient (logP) (Wildman & Crippen, 1999) , drug-likeness score (QED) (Bickerton et al., 2012) , and synthetic accessibility (SA) (Ertl & Schuffenhauer, 2009) . With the advent of ML and deep learning, ML (especially deep learning)-based prediction models like SVMs, random forests, recurrent neural networks (RNNs), and graph neural networks (GNNs) are also employed to predict the properties of compounds (Shen et al., 2010; Wang et al., 2019; Gilmer et al., 2017) . Another branch of metrics uses data likelihood to evaluate the molecules output by generative models (Gómez-Bombarelli et al., 2018; Liu et al., 2018; Jin et al., 2018; De Cao & Kipf, 2018) . A few other metrics are particularly related to the coverage of the chemical space or the extent of exploration in the chemical space. These coverage-and-exploration-related metrics are introduced in Section 2.

B COVERING NUMBER AND PACKING NUMBER

In mathematics, given a metric space (T, d), where T is the universal set and d is the distance metric, the coverage of a subset K ⊂ T can be described in terms of ε-nets (Vershynin, 2018) defined below. Definition B.1 (ε-net). Given a metric space (T, d), for a subset K ⊆ T and a positive number ε > 0, an ε-net of K is a subset N ⊆ K, where every point in K is within distance ε of N , i.e., ∀x ∈ K, ∃x 0 ∈ N : d(x, x 0 ) ≤ ε. Two related concepts, covering number and the packing number, are then defined as follows. Definition B.2 (Covering number). For a subset K of the metric space (T, d), its covering number, denoted N (K, d, ε), is the smallest possible cardinality of an ε-net of K. Definition B.3 (Packing number). For a subset N in the metric space (T, d), it is called ε-separated if d(x, y) > ε for all points x ̸ = y ∈ N . The packing number of a subset K ⊆ T , denoted P(K, d, ε), is the largest possible cardinality of an ε-separated subset of K. Our definition of #Circles is equivalent to the packing number when setting the distance threshold t in #Circles equal to the diameter of the balls ε in the packing number. To our best knowledge, however, the use of this notion as a chemical space coverage measure is novel. In light of this connection, further leveraging insights from topology to investigate the chemical space would be an interesting future direction.

C AXIOM COROLLARIES AND DISCUSSIONS

Some direct corollaries of monotonicity and subadditivity are as follows. Corollary C.1 (Subtraction). If a chemical space measure µ is subadditive, then for any two molecular sets ∀S 1 , S 2 ⊆ U, we have µ(S 1 ) ≥ µ(S 1 \ S 2 ) ≥ µ(S 1 ) -µ(S 2 ).

Corollary C.2 (Monotonicity [Single Molecule

). ] If a chemical space measure µ is subadditive, then for any molecular set ∀S ⊆ U and a single molecule x ∈ U, we have µ(S ∪ {x}) ≥ µ(S). Corollary C.3 (Dominance). If a chemical space measure µ is subadditive, then for any two molecular sets ∀S 1 , S 2 ⊆ U, if S 1 ⊆ S 2 , then µ(S 1 ) ≤ µ(S 2 ). A subadditive chemical space measure is a special case of outer measures in the context of mathematical measure theory. Outer measures are relaxations of measures, where the latter requires a stricter additivity property, i.e., µ(S 1 ∪ S 2 ) = µ(S 1 ) + µ(S 2 ) for any two disjoint sets S 1 , S 2 (Halmos, 2013). We consider additivity to be too strong and can conflict with intuitions in drug discovery: an additive measure defined on a discrete space must take the form of µ(S) = x∈S w(x), in which w(•) is a weight function that independently assigns a score to each element in the space, meaning additive measures defined on the discrete chemical space cannot capture the interrelationship between molecules in a candidate set, which counters the reality, thus can hardly tell the variety of compounds and how much of the chemical space is been covered.

D PROOFS FOR MONOTONICITY, SUBADDITIVITY AND DISSIMILARITY

In Figure 1 we show the chemical space measures according to whether they will satisfy monotonicity, subadditivity, and/or dissimilarity. In this section, we provide proofs to verify each measure's monotonicity, subadditivity, and dissimilarity. Note that for aggregation-based measures, when |S| = 1, µ(S) is not defined. So without loss of generality, we assume µ({x}) = w(x), x ∈ U in such cases, where w : U → R is an importance function. Proposition D.1 (Monotonicity of chemical space measures.). Reference-based chemical space measures, SumDiversity, Diameter, SumDiameter, and #Circles are monotonic. Diversity, Bottleneck, SumBottleneck, and DPP are not monotonic. Proof. In the proof, for two arbitrary molecular sets S 1 and S 2 as stated in the monotonicity axiom, we consider combining them into S := S 1 ∪ S 2 . For the reference-based chemical space measures, We prove the monotonicity and subadditivity simultaneously for reference-based measures by first proving the monotonicity and subadditivity of the maximum of cover(•, •). To prove the monotonicity and subadditivity of the maximum of cover(•, •), we consider combining any two molecular sets S 1 , S 2 ⊆ U. For any reference y ∈ R, we have max max x∈S1 cover(x, y), max x∈S2 cover(x, y) ≤ max x∈S1∪S2 cover(x, y) ≤ max x∈S1 cover(x, y) + max x∈S2 cover(x, y). Therefore, max (Coverage(S 1 ), Coverage(S 2 )) ≤ Coverage(S 1 ∪ S 2 ) ≤ Coverage(S 1 ) + Coverage(S 1 ), thus proving the Coverage measure is monotonic and subadditive. For the SumDiversity chemical space measure, note that the monotonicity corollary (Corollary C.2) can also entail the monotonicity property defined in Axiom 4.1 (by sequentially adding molecules from S 2 to S 1 or from S 1 to S 2 ). So we prove the monotonicity (defined in Axiom 4.1) by proving the monotonicity [single molecule] (defined in C.2). For a molecule x ∈ U and a molecular set S ⊆ U with size n, using the fact that SumDiversity(S) = n • Diversity(S), we have SumDiversity(S ∪ {x}) -SumDiversity(S) =(n -1) • Diversity(S) + 2 n y∈S d(x, y) -n • Diversity(S) = -Diversity(S) + 2 n y∈S d(x, y). Comparing Diversity(S) and 2 n y∈S d(x, y) is equivalent to comparing y1̸ =y2∈S d(y 1 , y 2 ) and (n -1) y∈S d(x, y). Consider each "triangle" tuple (y 1 , y 2 , x), due to the metric characteristics of d, we have d(y 1 , y 2 ) ≤ d(x, y 1 ) + d(x, y 2 ), meaning that y1̸ =y2∈S d(y 1 , y 2 ) is less or equal to (n -1) y∈S d(x, y). When w(x) = 0, SumDiversity(S ∪ {x}) -SumDiversity(S) ≥ 0, proving the monotonicity of SumDiversity. For the Diameter chemical space measure, when the importance function satisfies w(x) = 0, ∀x ∈ U, we have Diameter(S) = max x,y∈S x̸ =y d(x, y) ≥ max   max x,y∈S1 x̸ =y d(x, y), max x,y∈S2 x̸ =y d(x, y)   = max(Diameter(S 1 ), Diameter(S 2 )), proving the monotonicity of Diameter. For the SumDiameter chemical space measure, when the importance function satisfies w(x) = 0, ∀x ∈ U, we have SumDiameter(S) = x∈S max y∈S y̸ =x d(x, y) ≥ max   x∈S1 max y∈S1 y̸ =x d(x, y), x∈S2 max y∈S2 y̸ =x d(x, y)   = max(SumDiameter(S 1 ), SumDiameter(S 2 )), proving the monotonicity of SumDiameter. For the #Circles chemical space measure, we define C * (S) ⊆ S as an arbitrary set that satisfies |C * (S)| = #Circles(S). For any two molecular sets S 1 , S 2 ⊆ U, since C * (S 1 ) ⊆ S 1 ∪ S 2 , according to the definition of #Circles, we have #Circles(S 1 ) ≤ #Circles(S 1 ∪ S 2 ). Similarly, we also have #Circles(S 2 ) ≤ #Circles(S 1 ∪ S 2 ), proving the monotonicity of #Circles. 

And the change in Diversity is

Diversity(S ∪ {x}) -Diversity(S) = n -1 n + 1 - n + 1 n + 1 Diversity(S) + 2 (n + 1)n y∈S d(x, y) = 2 n + 1   -Diversity(S) + 1 n y∈S d(x, y)   . When the average distance of x and S, i.e., 1 n y d(x, y), is less than Diversity(S) (e.g., adding a molecule on the "segment" between two existing molecules), Diversity would decrease, thus violating the monotonicity corollary and proving Diversity is not monotonic. When x introduces a more restricting bottleneck, i.e., min y d(x, y) < Bottleneck(S), we will have Bottleneck(S ∪ {x}) < Bottleneck(S), violating the monotonicity corollary, thus proving Bottleneck is not monotonic. When x introduces some more restricting bottlenecks, i.e., for many y ∈ S, d(x, y) is small (e.g., adding a molecule into a set whose size is two, and the new molecule is added near one of the two molecules), we will have SumBottleneck(S ∪ {x}) < SumBottleneck(S), violating the monotonicity corollary, thus proving SumBottleneck is not subadditive. For the DPP chemical space measure, we disprove its monotonicity by proving it violates the monotonicity [single molecule] corollary (Corrolory C.2). Consider adding x into {x 0 } where x ̸ = x 0 1 -d(x, x 0 ) is denoted as b. We have DPP({x 0 , x}) = 1 b b 1 = 1 -b 2 . When b > 0 we will have DPP({x 0 , x}) < DPP({x 0 }) = 1, violating the monotonicity corollary, thus proving DPP is not monotonic. For the #Circles chemical space measure, we prove its subadditivity by contradiction. For any two molecular sets S 1 , S 2 ⊆ U, we assume #Circles(S 1 ∪ S 2 ) > #Circles(S 1 ) + #Circles(S 1 ). Use the notations C 1 := C * (S 1 ∪ S 2 ) ∩ S 1 and C 2 := C * (S 1 ∪ S 2 ) ∩ S 2 . We have |C 1 | + |C 2 | ≥ #Circles(S 1 ∪ S 2 ) > #Circles(S 1 ) + #Circles(S 1 ). Since all values are non-negative, we must have |C 1 | > #Circles(S 1 ) or |C 2 | > #Circles(S 2 ), contradicting with the definition of #Circles(S 1 ) or #Circles(S 1 ), thus proving the subadditivity of #Circles. For the Diversity and SumDiversity chemical space measures, we disprove their subadditivity by providing a counter-example. For two disjoint molecular sets with two molecules in each, i.e., {x 1 , x 2 } and {x 3 , x 4 }, we denote d ij = d(x i , x j ). Then we have Diversity({x 1 , x 2 , x 3 , x 4 }) = Diversity({x 1 , x 2 , x 3 , x 4 }) = 2 4 • 3 (d 12 + d 13 + d 14 + d 23 + d 24 + d 34 ), Diversity({x 1 , x 2 }) + Diversity({x 3 , x 4 }) = d 12 + d 34 . Similarly,  SumDiversity({x 1 , x 2 , x 3 , x 4 }) = 4 • Diversity({x 1 , x 2 , x 3 , x 4 }) = 4 • 2 4 • 3 ( ) = max x,x ′ ∈S1 x̸ =x ′ d(x, x ′ ) + max y,y ′ ∈S2 y̸ =y ′ d(y, y ′ ). When the maximum inter-set distance is larger than the maximum inner-set distance, i.e., max x,y,∈S1∪S2 d(x, y) > max x,x ′ ,∈S1 d(x, x ′ ) + max y,y ′ ,∈S2 d(y, y ′ ), we will have Diameter(S 1 ∪ S 2 ) > Diameter(S 1 ) + Diameter(S 2 ), thus proving the Diameter measure is not subadditive. For the SumDiameter chemical space measure, we disprove its subadditivity by providing a counterexample. For two disjoint molecular sets with two molecules in each, i.e., {x 1 , x 2 } and {x 3 , x 4 }, we denote d ij = d(x i , x j ). Then we have proving the dissimilarity property of SumBottleneck. For the DPP chemical space measure defined with the Tanimoto similarity, denoting d(x 0 , x 1 ) and d(x 0 , x 2 ) as d 1 and d 2 respectively, when d 1 , d 2 ∈ [0, 1] (Tanimoto distances), we have DPP({x 0 , x 1 }) = 1 1 -d 1 1 -d 1 1 = 2d 1 -d 2 1 ≥ 2d 2 -d 2 2 = 1 1 -d 2 1 -d 2 1 = DPP({x 0 , x 2 }), proving the dissimilarity property of DPP. For the #Circles chemical space measure, we have #Circles({x 0 , x 1 }) = 1 + I[d(x 0 , x 1 ) > t] ≥ 1 + I[d(x 0 , x 2 ) > t] = #Circles({x 0 , x 2 }), proving the dissimilarity property of #Circles. Discussion on the dissimilarity property of reference-based measures. For a reference-based measure, considering adding a new molecule x 1 or x 2 to the existing moelculer set S = {x 0 }, we have Coverage({x 0 , x 1 }, R) = y∈R (max(cover(x 0 , y), cover(x 1 , y))) , Coverage({x 0 , x 2 }, R) = y∈R (max(cover(x 0 , y), cover(x 2 , y))) , where the values of Coverage({x 0 , x 1 }, R) and Coverage({x 0 , x 2 }, R) will depend on the particular definition of cover(•, •) and the choice of the reference set R. Generally, an arbitrary coverage function and an arbitrary reference set do not necessarily meet the dissimilarity requirement. In our study, we define the cover function as cover(x, y) := I[molecule x contains fragment y], where I[•] is the indicator function, and y is a fragment contained in the reference set R. Some counter-examples can be easily constructed, for instance, if x 1 is far away from the reference set R while x 2 is very close to R (even contained in R), then the Coverage measure will prefer the molecule x 2 instead of the more dissimilar molecule x 1 . Therefore, the dissimilarity property does not hold for chemical space measures #FG, #RS, and #BM. Discussion on a new definition of reference-based measure. In addition to counting the number of fragments in the reference set R contained in the molecular set S, another way to define a reference-based measure is to investigate the distance from S to R. Particularly, we could define such a reference-based measure as below: Coverage(S, R; d, t) := y∈R I[∃x ∈ S such that d(x, y) < t], where I[•] is the indicator function, d is the distance metric, and t is the distance threshold. Similar to the discussion on #FG, #RS, and #BM, the characteristics of the reference-based measure defined by Eq. 10 highly relies on the property of R, and an arbitrary reference set R can not guarantee the dissimilarity property of this measure. We now prove that the reference-based measure defined by Eq. 10 satisfies both monotonicity and subadditivity. In the proof, Coverage(•) refers to the definition in Eq. 10. Proof. We first prove the monotonicity and subadditivity of the indicator function I[•], where we consider combining any two molecular sets S 1 , S 2 ⊆ U. For any reference y ∈ R and a distance threshold t, we have max (I[∃x ∈ S 1 such that d(x, y) < t], I[∃x ∈ S 2 such that d(x, y) < t]) ≤ I[∃x ∈ S 1 ∪ S 2 such that d(x, y) < t] ≤ I[∃x ∈ S 1 such that d(x, y) < t] + I[∃x ∈ S 2 such that d(x, y) < t]. Therefore, max (Coverage(S 1 ), Coverage(S 2 )) ≤ Coverage(S 1 ∪ S 2 ) ≤ Coverage(S 1 ) + Coverage(S 1 ), thus proving the Coverage measure defined by Eq. 10 is monotonic and subadditive.

E RANDOM SUBSET EXPERIMENT DETAILS E.1 BIO-ACTIVITY DATASET

The 10K BioActivity dataset (Koutsoukas et al., 2014) contains 10,000 compound samples excerpted from the ChEMBL database (Gaulton et al., 2017) with bio-activity labels. These labels are the 50 largest ChEMBL activity classes, including enzymes (e.g., proteases, lyases, reductases, hydrolases, and kinases) and membrane receptors (e.g., GPCRs and non-GPCRs). The label distribution is shown in Figure 5 . (Koutsoukas et al., 2014) . 50 bio-activity functionality classes are included. We use UMAP (McInnes et al., 2018) to visualize the molecules in this dataset based on their Morgan fingerprints as displayed Figure 6 . From the visualization, we can see that fingerprint similarity is indeed correlated with bio-activity similarity. 

E.2 RANDOM SUBSETS WITH FIXED SIZES

We first consider randomly sampled molecular subsets of the BioActivity dataset with a fixed size n. We randomly sample n molecules S from the dataset and compute the biological functionality coverage GS(S) as well as each chemical space measure µ(S). By repeating the randomization, we can calculate Spearman's correlation between the gold standard GS and each individual measure µ. We run the experiment for three different fixed sizes n = 50, n = 200, and n = 1000 to represent different molecular distribution density. The pair-wise correlations between measures are displayed in Figure 8 . Furthermore, when n = 50 and the molecules are distributed sparsely, all chemical space measures are positively correlated with the gold standard. However, when the subset size increases to n = 200 and n = 1000, Bottleneck and DPP becomes negatively correlated, while other measure tend to perform better. This is because these two measures will be bounded by the most similar molecular pair, thus severely conflicting with the subadditivity axiom. In this experiment, we repeat Algorithm 1 for ten times to obtain reliable correlations. Algorithm 1 Calculating chemical space measures for random subsets with fixed sizes. 

Experiment results.

We show the experiment results for different fixed random set size n in Figure 7 . We find the #Circles and SumBottleneck measures perform constantly better than all other measures. When the fixed size n increases, most chemical space measures' performances also increase, except for Bottleneck and DPP, meaning they are not suitable for measuring the variety when the molecules are distributed crowdedly. 



The SumBottleneck measure is sometimes abbreviated as "SumBot" in the main paper. MEASURING CHEMICAL SPACE COVERAGEIn this section, we apply the chemical space measures to evaluate how well the existing databases and ML-based molecular generation models cover the chemical space. In the original text, the authors suggest a similarity threshold of 0.40 that is equivalent to a distance threshold of 0.60. The code will be released after publication. ZINC-250k: https://www.kaggle.com/datasets/1379f2461e75ef7a11d0c5ff3dd0c2 440a6bcf33531bc0d303e8eac81a3a4b17. ENAMINE diversity libraries: https://enamine.net/compound-libraries/diversity-l ibraries. RationaleRL: https://github.com/wengong-jin/multiobj-rationale. DST: https://github.com/futianfan/DST. JANUS: https://github.com/aspuru-guzik-group/JANUS. MARS: https://github.com/bytedance/markov-molecular-sampling. Equations 11-13 are approximations of [µ(S ∪ {x}) -µ(S)] defined to avoid numerical issues and for computational efficiency.



Figure1: Chemical space measures, categorized by whether they satisfy monotonicity, subadditivity, and/or dissimilarity. #Circles is the only measure that satisfies all simultaneously.

(b) DTW distances between the gold standard GS and chemical space measures in the growing-size setting, the smaller the better. A smaller distance is better.

Figure 2: Correlation between each chemical space measure and biological functionality coverage. The chemical space measure with the highest correlation is highlighted in green. Results are aggregated by running experiments independently ten times.

Generation objective. FollowingLi et al. (2018) andJin et al. (2020), we consider the inhibition against an Alzheimer-related target protein c-Jun N-terminal kinase-3 (JNK3) as the biological objective. The JNK3 binding affinity score is predicted by a random forest model based on Morgan fingerprint features of a molecule

Figure 4: #Circles in the log scale and its threshold t selection for various models. Curves of the #Circles measure reflects characteristics of molecular generation methods.

the Diversity chemical space measure, we disprove its monotonicity by proving it violates the monotonicity [single molecule] corollary (Corollary C.2). For a molecule x ∈ U and a molecular set S ⊆ U with size n > 1, if x ̸ ∈ S, we have Diversity(S ∪ {x})

the Bottleneck chemical space measure, we disprove its subadditivity by proving it violates the monotonicity [single molecule] corollary (Corrolary C.2). Consider adding a molecule x into a molecular set S with size n > 1. If x ̸ ∈ S, we have Bottleneck(S ∪ {x}) = min Bottleneck(S), min y∈S d(x, y) .

the SumBottleneck chemical space measure, we disprove its monotonicity by proving it violates the monotonicity [single molecule] corollary (Corrolory C.2). Consider adding a molecule x into a molecular set S with size n > 1. If x ̸ ∈ S, we have SumBottleneck(S ∪ {x}) y ′ ̸ =y d(y, y ′ ).

Subadditivity of chemical space measures.). Reference-based chemical space measures, Bottleneck, SumBottleneck, and #Circles are subadditive. Diversity, SumDiversity, Diameter, and SumDiameter are not subadditive.Proof. For the reference-based chemical space measures, the subadditivity is provided under Proposition D.1.For the Bottleneck chemical space measure, when the importance function satisfies w(x) ≥ M, ∀x ∈ U, where M is the upper bound of Bottlenkeck (e.g., w(x) = ∞), 1 ) + Bottleneck(S 2 )), proving the subadditivity of Bottleneck.For the SumBottleneck chemical space measure, when the importance function satisfies w(x) ≥ M, ∀x ∈ U, where M is the upper bound of SumBottleneck (e.g., w(x) = ∞),

Figure 5: Label distribution of the BioActivity dataset (Koutsoukas et al., 2014). 50 bio-activity functionality classes are included.

Figure 6: UMAP visualization of compounds in the BioActivity dataset. Different colors stand for different bio-activity labels.

The fixed subset size n; The bio-activity dataset {(x i , y i )} 10K i=1 where y i ∈ Y are bio-activity labels and |Y| = 50; K chemical space measures {µ k } K k=1 . repeat Sample a number m uniformly from {1, . . . , 50}. Sample m labels Y ′ uniformly from Y. Sample n molecules S with labels in Y ′ uniformly. Compute GS(S) and µ k (S) for k ∈ [K]. until repeated for 1000 times Calculate the correlations between GS and {µ k } K k=1 based on the 1000-times experiment results.

Figure16: Selected molecular property distributions. Except for SAS (normalized from 0 to 1, the larger the better), all property distributions do not show significant changes across different models. The downward shift of the SAS is likely due to the increased diversity in the molecular libraries.

Figure 17: Optimizing chemical space measures encourages the molecular generation model to explore a larger span of the chemical space. This figure shows principal component analysis (PCA) of functional groups discovered by different models.

Measuring results of the chemical space explored by molecular generation methods. Incorporating chemical space measures into the objectives increases the exploration of the chemical space measures, but none of the tested ML-based models surpasses the virtual screening baseline (Databases) in terms of #Circles when t = 0.75 (as suggested in Sec. 4.2, highlighted in grey). Numbers in the parentheses following #Circles are values for the threshold t. In each chemical space measure, the larger value the better. Bold indicates the best performance in each measure.

MARS(Xie et al., 2021) generates fragment-based molecular structures based on Markov chain Monte Carlo (MCMC) sampling. Due to MARS' great ability to simultaneously optimize multiple drug discovery objectives, we also consider three variants of it (MARS+Diversity, MARS+SumBottleneck, and MARS+#Circles), where the chemical space measures are incorporated into Equation 8 to encourage exploration. The virtual screening setting is introduced as a baseline, where the five aforementioned molecular databases are combined, forming a candidate set of 11M compounds.

When the inter-set distances are larger than the inner-set distances, i.e., d 13 +d 14 +d 23 +d 24 > 5•(d 12 + d 34 ), we will have Diversity({x 1 , x 2 , x 3 , x 4 }) > SumDiversity{x 1 , x 2 } + Diversity{x 3 , x 4 }. Similarly, when d 13 + d 14 + d 23 + d 24 > 2 • (d 12 + d 34 ), we will have SumDiversity({x 1 , x 2 , x 3 , x 4 }) > SumDiversity{x 1 , x 2 } + SumDiversity{x 3 , x 4 }, thus proving both Diversity and SumDiversity measures are not subadditive.For the Diameter chemical space measure, For two disjoint molecular sets S 1 , S 2 ⊆ U whose sizes are larger than one, we have

ACKNOWLEDGMENTS

We would like to thank Professor Kevyn Collins-Thompson, Professor Paramveer Dhillon, and Professor Aaron Frank for their helpful feedback and suggestions. We also thank the anonymous reviewers for their constructive comments, and in particular for pointing out the relation between #Circles and packing number. This work was in part supported by the National Science Foundation under grant number 1633370.

annex

When the inter-set distances are larger than the inner-set distances, i.e., d 13 , d 14 , d 23 , d 24 > d 12 , d 34 , we will have SumDiversity({x 1 , x 2 , x 3 , x 4 }) > SumDiversity{x 1 , x 2 } + SumDiversity{x 3 , x 4 }, thus proving the SumDiameter measure is not subadditive.Proposition D.3 (Dissimilarity property of chemical space measures.). Diversity, SumDiversity, Diameter, SumDiameter, Bottleneck, SumBottleneck, DPP, and #Circles have preferences to dissimilarity.Proof. In the proof, for the three molecules x 0 , x 1 , x 2 ∈ U as stated in the dissimilarity axiom in Section 4.1, we assume d(x 0 , x 1 ) ≥ d(x 0 , x 2 ).For the Diversity chemical space measure, we have Diversity({x 0 , x 1 }) = d(x 0 , x 1 ) ≥ d(x 0 , x 2 ) = Diversity({x 0 , x 2 }),proving the dissimilarity property of Diversity.For the SumDiversity chemical space measure, we have SumDiversity({x 0 , x 1 }) = 2 • d(x 0 , x 1 ) ≥ 2 • d(x 0 , x 2 ) = SumDiversity({x 0 , x 2 }),proving the dissimilarity property of SumDiversity.For the Diameter chemical space measure, we have Diameter({x 0 , x 1 }) = d(x 0 , x 1 ) ≥ d(x 0 , x 2 ) = Diameter({x 0 , x 2 }),proving the dissimilarity property of Diameter.For the SumDiameter chemical space measure, we haveproving the dissimilarity property of SumDiameter.For the Bottleneck chemical space measure, we have Bottleneck({x 0 , x 1 }) = d(x 0 , x 1 ) ≥ d(x 0 , x 2 ) = Bottleneck({x 0 , x 2 }),proving the dissimilarity property of Bottleneck.For the SumBottleneck chemical space measure, we have 

Correlation between chemical space measures.

We also visualize the pairwise correlation between chemical space measures in Figure 8 . From the figure we can see that, the gold standard GS, #Circles, and SumBottleneck are most similar with each other in the fixed-size setting.Threshold t for #Circles. The #Circles threshold t is selected to maximize the correlation to the gold standard GS. Taking n = 200 as an example, we test different t values as Figure 9 displays and select t = 0.70 as the threshold. We can see #Circles works well for a wide range of thresholds like [0.40, 0.70]. In Olivecrona et al. (2017) , the authors suggest to use a threshold t = 0.60 to decide whether two molecules are dissimilar with each other 3 , which aligns our results. For n = 50 and n = 1000, the threshold is set as t = 0.70 and t = 0.65 respectively. Distance metric d.

DPP

We also study the impact of distance metric d. In Table 3 , we listed the experiment results for both fingerprint-based Tanimoto distance and VAE-based latent space dissimilarity (Samanta et al., 2020) . We find the experiment results obtained with the VAE dissimilarity remain consistent with the results obtained with the Tanimoto distance.

E.3 RANDOM SUBSETS WITH GROWING SIZES

To mimic the molecular generation process, we also grow the size of the subsets. Specifically, for a maximum size n, we sequentially sample n molecules without replacement to form n subsets For both the gold standard GS and a chemical space measure µ, we record their values as S grows into a time series. , e.g., {(i, µ(S i ))} n i=1 . Comparing the trajectory of a chemical space measure with the trajectory of the gold standard, we can observe which measure behaves more similarly to GS. We quantitatively estimate the similarity of their trajectories with the dynamic time warping (DTW) distance of the two time series.In this experiment, we repeat Algorithm 2 for ten times to obtain reliable DTW distances.Algorithm 2 Calculating chemical space measures for random subsets with growing sizes.

Input:

The maximum subset size n; The bio-activity dataset {(x i , y i )} 10K i=1 where y i ∈ Y are bio-activity labels and |Y| = 50; K chemical space measures. end for Plot chemical space measure curves for GS and {µ k } K k=1 where the x axes are i ∈ {1, . . . , n} and the y axes are chemical space measure values GS(S i ) and µ k (S i ).Transform the cumulative curves into incremental ones. Calculate DTW distances between incremental curves.

Experiment results.

To mimic the way in which generation models propose new molecules, in Algorithm 2, we require the newly sampled molecule x i to be similar to the already sampled molecules {x 1 , . . . , x i-1 }. The specific implementation can be found in our code 4 . Moreover, we also test the following two cases: (1) All molecules are sampled uniformly; (2) The newly sampled molecule x i have to be most similar to the already sampled molecules {x 1 , . . . , x i-1 }. The results of DTW distances for these two cases are shown in Figure 10 . From Figure 10 we can see that, #Circles performs the best. Also, as the new molecules to add become more similar to the existing ones, the advantage of #Circles over other measures becomes larger. This makes #Circles especially suitable for measuring molecular generation models. DTW distances between chemical space measures. We visualize the pairwise DTW distances between chemical space measures in Figure 11 . From the figure we can see that, the gold standard GS and the #Circles measure are most similar with each other in the growing-size setting.

Bottleneck

In addition, we find the Richness, SumDiversity, SumDiameter, SumBottleneck, and #BM are forming a large cluster, while #FG and #RS tend to be similar with each other.Threshold t for #Circles. The #Circles threshold t is selected to minimize the DTW distance to the gold standard GS. Taking the second scenario as an example, we test different t values as Figure 12 displays and select t = 0.76 as the threshold. For the first and the third scenarios, the threshold is set as t = 0.84 and t = 0.78 respectively.

Distance metric d.

We also study the impact of distance metric d. In Table 4 , we listed the experiment results for both fingerprint-based Tanimoto distance and VAE-based latent space dissimilarity (Samanta et al., 2020) . We find the experiment results obtained with the VAE dissimilarity remain consistent with the results obtained with the Tanimoto distance.Table 4 : DTW distances between the gold standard and chemical space measures in the growing-size random subset setting. The maximum size is set as n = 1000, and the new molecule needs to be similar to the already samples ones. A smaller distance indicates the better. (3) ChEMBL (Gaulton et al., 2017) is a manually curated database of 2M bioactive molecules with known experimental data, especially with the inhibitory and binding properties against macromolecule targets. This database contains not only drug and drug-like molecules, but also natural products and biopolymers. (4) GDB-17 (Ruddigkeit et al., 2012) (5) Enamine is a company that provides compound libraries for high-throughput screening (HTS). The Enamine Hit Locator Library 6 is their largest diversity library with high MedChem tractability, and the compounds in the library are readily available for purchase and are guaranteed synthesizable at a reasonable cost.

F.2 RESULTS ON CHEMICAL SPACE MEASURES

The measuring results are listed in Figure 13 and 14.

G MOLECULAR GENERATION EXPERIMENT DETAILS G.1 MODEL IMPLEMENTATION

We implement the models with the official repositories 7 . All hyperparameters are set as default.

G.2 MARS VARIANTS

The terms incorporated into objectives are computed as follows 8 :Novelty SumBottleneck (x, S) := minNovelty #Circles (x, S) := minFor the model variants, we test different α values from {0.1, 0.3, 1.0, 3.0} and report the best performance. Specifically, we use α = 1.0 for MARS+Diversity, α = 0.3 for MARS+SumBottlenck, and α = 0.1 for MARS+#Circles. The threshold we use for the #Circles measure is t = 0.60. The results on other chemical space measures are listed in Table 5 .

G.4 MOLECULAR PROPERTY AND BINDING AFFINITY DISTRIBUTIONS

To investigate how incorporating chemical space measures into the objective as Eq. 11-13 can influence the optimization for molecular properties, we examine the property distributions of molecules generated by MARS and its variants.As shown in Figure 15 -16, our resulting libraries from the joint optimizations with chemical space measures (i.e., MARS+#Circles and MARS+Diversity) are not showing a significant downward shift on JNK3 binding affinity as well as QED scores and SA scores. It is expected that compared with vanilla MARS, the property scores of molecules generated by MARS variants would decrease slightly, because instead of solely optimizing the property scores, the model needs to make sacrifices for a more diverse library. However, this sacrifice would be insignificant compared to the diversity gain in the resulting library. In practice, these molecules would pass the scoring threshold and would be considered qualified in these regards. In Table 2 , the experiment results show that by incorporating chemical space measures into optimization objectives, the richness of discovered qualified molecules (JNK3 ≥ 0.5, QED ≥ 0.6, and SA ≤ 4) got significantly improved.

G.5 VISUALIZATION

To provide a more intuitive view of the effect that adding a joint objective of exploration can encourage the molecular generation model to discover a wider space, we visualize the chemical space explored by the baseline (MARS) and MARS+SumBottleneck as well as MARS+#Circles in Figure 17 , where We observe a downward trend in the MARS variant models, which are jointly optimized with a chemical space measure. However, all generated molecules are above the score threshold, and in practice, due to the limitation of the prediction model, they should be considered equally qualified candidates for further analysis. Right: Distributions of interaction scores of the ligands and the receptor docking (lower energy score indicates more favorable interactions). For the docking experiments, we randomly sampled 30K molecules from each generated library and performed molecular docking on the JNK3 target (PDB: 7KSI) using rDock (Ruiz-Carmona et al., 2014) . The distributions of docking scores show that the joint optimizations on chemical space measures do not affect the binding affinity significantly when predicted by in silico docking.we show the 2D layout of the Morgan fingerprints of the functional groups in the molecules generated through principal component analysis (PCA). The larger number of unique functional groups and the wider spread of the data points obtained by adding exploration-based novelty terms indicate more diverse structures being explored.We show the molecular clustering results in Figure 18 . Clusters are calculated based on Morgan fingerprints of the generated molecules and their Tanimoto similarity. Compared to the baseline model, a larger number of clusters can be obtained from MARS+SumBottleneck.Figure 19 shows the dynamics of measures for MARS and its variants.

H FAST APPROXIMATION OF #CIRCLES

The computation of #Circles is combinatorial and the running time is exponential in theory. We implement two fast approximations by sacrificing accuracy as Algorithm 3 and Algorithm 4.Algorithm 3 Approximation of #Circles (sequential).

Input:

The molecular set S to be measured, the distance metric d, the distance threshold t. 

