BIASES IN EVALUATION OF MOLECULAR OPTIMIZA-TION METHODS AND BIAS REDUCTION STRATEGIES Anonymous authors Paper under double-blind review

Abstract

We are interested in in silico evaluation methodology for molecular optimization methods. Given a sample of molecules and their properties of our interest, we wish not only to train a generator of molecules that can find those optimized with respect to a target property but also to evaluate its performance accurately. A common practice is to train a predictor of the target property on the sample and use it for both training and evaluating the generator. We theoretically investigate this evaluation methodology and show that it potentially suffers from two biases; one is due to misspecification of the predictor and the other to reusing the same sample for training and evaluation. We discuss bias reduction methods for each of the biases, and empirically investigate their effectiveness.

1. INTRODUCTION

Molecular optimization aims to discover novel molecules with improved properties, which is often formulated as reinforcement learning by modeling the construction of a molecule using a Markov decision process. The performance of such agents is measured by the quality of generated molecules. In the community of machine learning, most of the molecular optimization methods have been verified by computer simulation. Since most of the generated molecules are novel, their properties are unknown and we have to resort to a predictor to estimate the properties. However, little attention has been paid to how reliable such estimates are, except for a few empirical studies (Renz et al., 2019; Langevin et al., 2022) , making the existing performance estimates less reliable. In this paper, we study the statistical properties of such performance estimators to enhance our understanding of the evaluation protocol and we discuss several directions to improve it. Let us first introduce a common practice to estimate the performance. Let S ⋆ be a set of molecules, f ⋆ : S ⋆ → R be a property function evaluating the target property of the input molecule, and D = {(m n , f ⋆ (m n )) ∈ S ⋆ × R} N n=1 be a sample. We typically train a predictor f (m; D) using D, regard it as the true property function, and follow the standard evaluation protocol of online reinforcement learning. That is, an agent is trained so as to optimize the properties of discovered molecules computed by f (m; D), and its performance is estimated by letting it generate novel molecules and estimating their properties by f (m; D). We call this a plug-in performance estimator (section 2.1). Our research question is how accurate the plug-in performance estimator is as compared to the true performance computed by f ⋆ . We first point out that the plug-in performance estimator is biased in two ways, indicating that it is not reliable in general (section 2.2). The first bias called a model misspecification bias comes from the deviation between the predictor and the true property function evaluated over the molecules discovered by the learned agent. This bias is closely related to the one encountered in covariate shift (Shimodaira, 2000) . It grows if molecules discovered by the agent become dissimilar to those used to train the predictor. The second bias called a reusing bias is caused by reusing the same dataset for training and testing the agent. Due to these biases, the plug-in performance estimator is not necessarily a good estimator of the true performance. We then discuss strategies to reduce these two biases. Section 3.1 introduces three approaches to reducing the misspecification bias. Since it is caused by covariate shift, it can be reduced by training the predictor taking it into account (section 3.1.1) and/or by constraining the agent so that the generated molecules become similar to those in the sample (section 3.1.2). Yet another approach is to use a more sophisticated estimator called a doubly-robust performance estimator (section 3.1.3). Our idea to correct the reusing bias comes from the analogy to model selection (Konishi & Kitagawa, 2007) , whose objective is to estimate the test performance by correcting the bias of the training performance, i.e., the performance computed by reusing the same dataset for training and testing. Given the analogy, one may consider train-test split could be the first choice. We however argue that it is not as effective as that applied to model selection due to the key difference between our setting and model selection; the test set in model selection is used to take expectation, while that in our setting is used to train a predictor, which is much more complex than expectation. This complexity introduces a non-negligible bias to the train-test split estimator, resulting in a less accurate bias estimation (section 3.2.1). We instead propose to use a bootstrap method in section 3.2.2, which is proven to estimate the reusing bias more accurately than the train-test split method does. We empirically validate our theory in section 4. First, we quantify the two biases, and confirm that both of them are non-negligible, and the reusing bias increases as the sample size decreases, as predicted by our theory. Second, we assess the effectiveness of the bias reduction methods, and confirm that the reusing bias can be corrected, while the misspecification bias can be reduced but at the cost of performance degradation of the agent. Notation. For any distribution G, let Ĝ ∼ G N denote the empirical distribution of a sample of N items independently drawn from G. For a set X , let δ x be Dirac's delta distribution at x ∈ X . For any integer M ∈ N, let [M ] := {0, . . . , M -1}. For any set A, P(A) denotes the set of probability distributions defined over A. Problem setting. We define a molecular optimization problem using a Markov decision process (MDP) of length H + 1 (H ∈ N). See appendix A for concrete examples. Let S be a set of states, and s ⊥ ∈ S be the terminal state. Let S ⋆ ⊆ S be a subset of states that correspond to valid molecules and the rest of the states correspond to possibly incomplete representations of molecules (invalid molecules). Let A be a set of actions that transform a valid or invalid molecule into another one. There exists the terminal action a ⊥ ∈ A that evaluates the property of the molecule at step H, after which the state transits to the terminal state s ⊥ . For each step h ∈ [H + 1], let T h : S × A → P(S) be a state transition distribution, r h : S × A → R be a reward function, and ρ 0 ∈ P(S) be the initial state distribution. We assume that the set of states at step H is limited to S ⋆ , and the reward function is defined as r h (s, a) = 0 for h ∈ [H] and r H (s, a ⊥ ) = f ⋆ (s) for s ∈ S ⋆ . Let M = {S, A, {T h } H h=0 , ρ 0 , H} be the dynamical model of the MDP. Throughout the paper, we assume we know M and omit the dependency on it in expressions. Let Π be the set of policies and π = {π h (• | s)} H h=0 ∈ Π be a policy modeled by a probability distribution over A conditioned on s ∈ S. At each step h ∈ [H + 1], the agent takes action a h sampled from π h (• | s h ). The performance of a policy is measured by the expected cumulative reward, J ⋆ (π ) := E π [ H h=0 r h (S h , A h )] = E π [f ⋆ (S H )] , where E π [•] is the expectation with respect to the Markov process induced by applying policy π on M. Letting p π h ∈ P(S ⋆ ) be the distribution of states visited by policy π at step h ∈ [H + 1], the expected cumulative reward is alternatively expressed as, J ⋆ (π) = E S∼p π H f ⋆ (S). In practice, the property function is not available and instead a sample from it, D = {(m n , f ⋆ (m n )) ∈ S ⋆ × R} N n=1 , is available. Let us assume that each tuple is independently distributed according to G ∈ P(S ⋆ × R). Let G S ∈ P(S ⋆ ) be the marginalized distribution over S ⋆ induced from G. For a theoretical reason clarified in appendix B, we use the empirical distribution of the sample, Ĝ ∈ P(S ⋆ × R), rather than the sample itself (assumption 10) and we call Ĝ an empirical distribution and a sample interchangeably. Let us define a policy learner α π : P(S ⋆ × R) → Π, an algorithm to learn a policy from a distribution over S ⋆ × R. It typically receives a sample Ĝ and outputs a policy, which we denote π := α π ( Ĝ). Our objective is to evaluate its performance J ⋆ (π) only given access to α π , Ĝ, and M.

2. BIASES OF PLUG-IN PERFORMANCE ESTIMATOR

A widely used approach to estimating J ⋆ (π) is a plug-in performance estimator (section 2.1). We point out that it is biased in two ways (section 2.2) and theoretically characterize these biases in sections 2.3 and 2.4.

2.1. PLUG-IN PERFORMANCE ESTIMATOR

For any function f : S ⋆ → R and policy π, let us define a plug-in performance function, J PI (π, f ) := E π [f (S H )]. (1) Let α f : P(S ⋆ × R) → (S ⋆ → R) be an algorithm to learn a predictor, typically by minimizing the loss function averaged over the input distribution. Let π = α π ( Ĝ) be a policy trained using Ĝ and f := α f ( Ĝ) be a predictor trained using the same Ĝ. Then, the plug-in performance estimator is defined as J PI (π, f ), which is often used as a proxy for the true performance, J ⋆ (π).

2.2. BIAS DECOMPOSITION

The plug-in performance estimator is biased in two ways; the first bias comes from model misspecification of the predictor, and the second one is due to reusing the same sample for learning a policy and a predictor. Let us define JPI (G 1 , G 2 ) := J PI (α π (G 1 ), α f (G 2 )) and ∆ PI (G 1 , G 2 ) := JPI (G 1 , G 2 ) -J ⋆ (α π (G 1 )). The quantity JPI (G 1 , G 2 ) denotes the estimated performance of a policy trained with distribution G 1 evaluated by a predictor trained with G 2 , and ∆ PI (G 1 , G 2 ) denotes the deviation of the estimated performance from the ground truth. Then, the bias we care is denoted by E Ĝ∼G N ∆ PI ( Ĝ, Ĝ), which is decomposed as shown in theorem 1. Theorem 1. The bias is decomposed into a reusing bias and a misspecification bias as follows: E Ĝ∼G N ∆ PI ( Ĝ, Ĝ) =E Ĝ∼G N [ JPI ( Ĝ, Ĝ) -JPI ( Ĝ, G) + JPI ( Ĝ, G) -J ⋆ (π)] = E Ĝ∼G N [ JPI ( Ĝ, Ĝ) -JPI ( Ĝ, G)] Reusing bias + E Ĝ∼G N ∆ PI ( Ĝ, G) Misspecification bias . (2)

2.3. MISSPECIFICATION BIAS

Letting f ∞ := α f (G), the squared misspecification ∆ PI ( Ĝ, G) 2 is upperbounded by Jensen's inequality as, ∆ PI ( Ĝ, G) 2 = E π (f ∞ (S H ) -f ⋆ (S H )) 2 ≤ E S∼p π H (f ∞ (S) -f ⋆ (S)) 2 . (3) Assuming that f ∞ = argmin f E S∼G S (f (S) -f ⋆ (S) ) 2 holds, the bias increases if f ∞ fails to predict the properties of molecules generated by policy π, which occurs when the predictor is misspecified (i.e., f ∞ ̸ = f ⋆ ) and p π H and G S are largely deviated (i.e., the discovered molecules are not similar to those in the sample).

2.4. REUSING BIAS

The former term of equation 2, b N PI (G) := E Ĝ∼G N [ JPI ( Ĝ, Ĝ) -JPI ( Ĝ, G)], quantifies the bias caused by reusing the same finite sample for training and testing a policy, which we call a reusing biasfoot_0 . Let us theoretically analyze the reusing bias, assuming the sample size N is moderately large such that the asymptotic expansions are valid but O(1/N ) term cannot be ignored. We show in proposition 2 that the reusing bias is O(1/N ). See appendix B for the assumptions and appendix D.2 for its proof. Proposition 2. Under assumptions 10 and 12, b N PI (G) = 1 2N E X∼G 2 J(1,1) G,G (δ X -G, δ X -G) + J(0,2) G,G (δ X -G, δ X -G) + O(1/N 2 ), holds, indicating that b N PI (G) = O(1/N ) where J(1,1) G,G and J(0,2) G,G are the (1, 1)-st and (0, 2)-nd Fréchet derivative of JPI (G 1 , G 2 ) at (G 1 , G 2 ) = (G, G). In particular, if the policy is optimal and the estimated property function is unbiased, i.e., E Ĝ∼G N f = f ∞ (which is true at least for a linear model), we can prove that the bias is optimistic (proposition 3). See appendix E for its proof. Proposition 3. Assume E Ĝ∼G N f = f ∞ and π = argmax π∈Π J PI (π, f ) hold. Then, b N PI (G) ≥ 0.

3. BIAS REDUCTION STRATEGIES

We have witnessed that the plug-in performance estimator is biased in two ways. In this section, we discuss how to reduce these biases to obtain reliable performance estimates.

3.1. REDUCING MISSPECIFICATION BIAS

There are mainly three approaches to reducing the misspecification bias, ∆ PI ( Ĝ, G). The first one is to train the predictor considering the covariate shift, a mismatch between training and testing distributions (section 3.1.1). The second approach is to constrain a policy such that the molecules discovered by the policy become similar to those in the sample Ĝ (section 3.1.2). These are mainly motivated by minimizing the right-hand side of equation 3. The third one is motivated by a standard technique in contextual bandit, the doubly-robust performance estimator instead of the plug-in performance estimator (section 3.1.3). Before going into details, let us introduce the notion of importance weight, which is used extensively to reduce the misspecification bias. Let F ∈ P(S ⋆ ) be any probability distribution over molecules whose support is larger than that of p π H . Let (p π H /F )(s) := p π H (s)/F (s) (s ∈ S ⋆ ) denote the importance weight between them, and let α w : Π × P(S ⋆ ) → (S ⋆ → R ≥0 ) denote an algorithm that receives a policy and a distribution over molecules and outputs the importance weight between the state distribution induced by the policy and the distribution. We typically use the algorithm by substituting sample Ĝ from G for the distribution, expecting that α w (π, Ĝ) ≈ p π H /G.

3.1.1. COVARIATE SHIFT

The misspecification bias can be reduced by minimizing the right-hand side of equation 3, which is the mean squared error over S ∼ p π H . The predictor f ∞ is usually trained by minimizing E S∼G S (f (S) -f ⋆ (S))foot_1 with respect to f and does not necessarily minimize the right-hand side of equation 3 due to covariate shift (Shimodaira, 2000) , i.e., the mismatch between the training and testing distributions. One approach suggested by the author to alleviating it is to train the predictor by weighted maximum-likelihood estimation. Let us define the algorithm as, α λ f (w, G) = argmin f ∈F E S∼G w(S) λ (f (S) -f ⋆ (S)) 2 , ( ) where w is any importance weight and λ ∈ [0, 1] controls the bias and variance of the estimated predictor 2 . By substituting α λ f (w, G) for α f (G), the misspecification bias will be reduced.

3.1.2. CONSTRAIN A POLICY

The first approach does not always work. If p π H and G are not close enough, the effective sample size of the weighted maximum-likelihood estimation becomes small, leading to poor estimation. This suggests that not all policy learners can be accurately evaluated; those whose state distribution p π H is deviated from G are difficult to be evaluated. Let us assume that the policy is obtained by solving the following optimization problem: α π (G) = argmin π∈Π ℓ(π; G). While a natural approach is to add a divergence between the generator and the data distribution P to the objective function as a regularization term, it is computationally expensive, especially when the length of MDP, H, is large. We instead propose to regularize the policy, inspired by behavior cloning (Fujimoto & Gu, 2021) . Let us first introduce behavior cloning, and then, discuss how to apply its idea to our problem setting. Behavior cloning regularizes the policy so that the policy imitates a behavior policy that generates the data. Let us assume that there exists a behavior policy π b that induces the data distribution, i.e., p π b H (s) = G(s) for s ∈ S ⋆ , which may not be available in our setting. Behavior cloning employs the following regularized objective function: ℓ(π; G)-ν H+1 H h=0 E S h ∼p π b h ,A h ∼π b (S h ) [log π(A h | S h )], where ν ≥ 0 is a hyperparameter controlling behavior cloning. The larger ν is, the more the learned policy resembles the behavior policy, which in turn will make p π H close to the data distribution, and thus, we expect to reduce the misspecification bias. A technical challenge in applying behavior cloning to our setting is that π b is not available. Our key observation to this challenge is that while π b is not available, it is often the case that a trajectory towards each molecule in the dataset can be reconstructed. For example, in an MDP that constructs a molecule atom-wisely (You et al., 2018) , such a trajectory is easily obtained by removing atoms one by one from the molecule; in another MDP that constructs a molecule by chemical reactions (Gottipati et al., 2020) , since each molecule in the dataset is assumed to be synthesizable (because the molecules in the dataset do exist in reality and thus are synthesizable), such a trajectory is easily obtained at least for those molecules in the dataset. Letting π -1 b (m) = (s 0 , a 0 , s 1 , a 1 , . . . , s H = m) be a (potentially random) function to reconstruct a trajectory from a molecule, we propose to train a policy with regularization to the data distribution by the following optimization problem: α ν π (G) := argmin π∈Π ℓ(π; G) - ν H + 1 H h=0 E M ∼G E S0,A0,...,S H ∼π -1 b (M ) [log π(A h | S h )] . (6) Given the discussion above, at least α ν π ( Ĝ) can be computed. Although this regularization is not sufficient to constrain the divergence between p π H and G (which has been discussed in the literature of imitation learning), we consider the idea of behavior cloning is a simple yet effective heuristic, which will be investigated in the experiment.

3.1.3. DOUBLY-ROBUST PERFORMANCE ESTIMATOR

The third approach to reducing the misspecification bias is a doubly-robust performance estimator, which has been applied in contextual bandit (Dudík et al., 2014) and offline reinforcement learning (Tang et al., 2020) as an alternative to the plug-in performance estimator. Noticing that the performance can also be estimated via importance sampling, which we call an importance-sampling performance estimator, the doubly-robust performance estimator combines these two estimators so as to inherit their benefits. Importance-Sampling Performance Estimator. Given that J ⋆ (π) = E π f ⋆ (S H ) = E S∼G S (p π H /G S )(S)f ⋆ (S) holds, we obtain the importance-sampling performance estimator by substituting an importance weight model for the true importance weight. For any importance weight w : S ⋆ → R ≥0 and distribution F ∈ P(S ⋆ × R), let us define an importance-sampling performance function as, J IS (w, F ) := E S∼F S w(S)f ⋆ (S). Then, we obtain the importance-sampling performance estimator as J IS ( ŵ, Ĝ), where ŵ := α w (π, Ĝ). Doubly-Robust Performance Estimator. The doubly-robust performance function combines the plug-in and importance-sampling performance functions as follows: J DR (π, w, f, F ) := E S∼F S [w(S)(f ⋆ (S) -f (S))] + E π f (S H ). (7) This performance function is a combination of the two performance functions in that it is related to them as, J DR (π, 0, f, F ) = J PI (π, f ) and J DR (π, w, 0, F ) = J IS (w, F ). By substituting π, ŵ, f , and Ĝ for the arguments, we obtain the doubly-robust performance estimator as J DR (π, ŵ, f , Ĝ). Let us define, JDR (G 1 , G 2 ) := J DR (α π (G 1 ), α w (α π (G 1 ), G 2 ), α f (G 2 ), G 2 ). Then, the mis- specification bias is expressed as, ∆ DR ( Ĝ, G) := JDR ( Ĝ, G) -J ⋆ (π) = E S∼G S (w ∞ (S) - (p π H /G)(S))(f ⋆ (S) -f ∞ (S)) , where w ∞ := α w (π, G). This suggests that the misspecification bias disappears if the predictor or the importance weight is well-specified. Discussion. Notice that the misspecification biases of J PI and J IS are given by the followings: ∆ PI ( Ĝ, G) =J PI (π, f ) -J ⋆ (π) = E S∼G S (p π H /G)(s)(f ∞ (S) -f ⋆ (S)) , ∆ IS ( Ĝ, G) :=J IS ( ŵ, G) -J ⋆ (π) = E S∼G S (w ∞ (S) -(p π H /G)(S))f ⋆ (S) . We can deduce that for S ∼ G S (i) if |f ⋆ (S) -f ∞ (S)| ≪ |f ⋆ (S)| holds, the misspecification bias of J DR will be smaller than that of J IS , and (ii ) if |w ∞ (S) -(p π H /G)(S)| ≪ |(p π H /G)(S) | holds, the misspecification bias of J DR will be smaller than that of J PI . Therefore, if we can learn both of the predictor and the importance weight well, the doubly-robust performance estimator is preferred to the others. Otherwise, the doubly-robust one can be worse than the others.

3.1.4. SUMMARY

We have introduced three approaches to reducing misspecification bias. The first one trains the predictor by weighted maximum likelihood estimation (equation 5). The second one constrains the policy by behavior cloning (equation 6). The third one is the doubly-robust performance estimator (equation 7). Taking these into consideration, let the combined performance function be, Jλ,ν DR (G 1 , G 2 ) := J DR (α ν π (G 1 ), α w (α ν π (G 1 ), G 2 ), α λ f (α w (α ν π (G 1 ), G 2 ), G 2 ), G 2 ) , and the combined performance estimator be Jλ,ν DR ( Ĝ, Ĝ). We call the importance weight and the predictor an evaluator. Note that proposition 2 holds for the combined performance estimator by further assuming that w is normalized and entire. Proposition 3 holds for the importance sampling performance estimator by further assuming the unbiasedness of the importance weight, but we have not found natural assumptions for the doubly-robust one. See appendix E for details.

3.2. REDUCING REUSING BIAS

Given the discussion in the previous section, let us define the reusing bias for any J ∈ { JPI , JIS , JDR , Jλ,ν DR } as, b N (G) := E Ĝ∼G N J( Ĝ, Ĝ) -J( Ĝ, G) , and let us discuss how to reduce the reusing bias. Our approach is to estimate the reusing bias and substract it from the performance estimator. Such a bias reduction has been extensively discussed in the literature of information criteria (Konishi & Kitagawa, 2007) , which aim to estimate the test performance of a predictor in a supervised learning setting by correcting the bias of its training performance. There are mainly two approaches: train-test split method and bootstrap method.  split ( Ĝ) = E[ J( Ĝtrain , Ĝtrain ) -J( Ĝtrain , Ĝtest )], where the expectation is with respect to the random split. While this estimator seems to be reasonable, it is not recommended for our problem setting due to the bias of the bias estimator. As demonstrated in proposition 4, the train-test split estimator has O(1/N ) bias, the same order as the bias b N (G) itself, and therefore, it is not reliable. Such a bias is due to the non-linearlity of J(G 1 , G 2 ) with respect to G 2 , the distribution used for testingfoot_2 . See appendix D.2 for its proof and appendix G for the comparison with supervised learning. Proposition 4. Suppose we randomly divide the sample such that |D train | : |D test | = λ : (1 -λ) for some λ ∈ (0, 1). Under assumptions 10 and 12, E Ĝ∼G N [b split ( Ĝ)] = b N (G) + O(1/N ) holds. Note that direct estimation of test performance by J( Ĝtrain , Ĝtest ) is not recommended similarly, unless the size of the test sample is sufficiently large. See appendix G for detailed discussion.

3.2.2. BOOTSTRAP BIAS ESTIMATION

An alternative approach to estimating the reusing bias (equation 4) is bootstrap (Efron & Tibshirani, 1994) . A bootstrap estimator of the reusing bias b N (G) is obtained by plugging Ĝ into G: b N ( Ĝ) = E Ĝ⋆ ∼ ĜN [ J( Ĝ⋆ , Ĝ⋆ ) -J( Ĝ⋆ , Ĝ)]. Let Ĝ(m) (m ∈ [M ] ) be a bootstrap sample obtained by uniformrandomly sampling data points N times from the original sample Ĝ with replacement. Then, its Monte-Carlo approximation is, bN ( Ĝ) = 1 M M m=1 [ J( Ĝ(m) , Ĝ(m) ) -J( Ĝ(m) , Ĝ)]. In contrast to Under review as a conference paper at ICLR 2023 the train-test split method, the bootstrap bias estimation can estimate the bias as stated in proposition 5. See appendix D.2 for its proof. Proposition 5. Under assumptions 10 and 12, E Ĝ∼G N [b N ( Ĝ)] = b N (G) + O(1/N 2 ) holds.

3.2.3. SUMMARY

We have introduced two reusing-bias estimators, referring to the literature of information criteria. We have found that the train-test split estimator, one of the most popular estimators, cannot reliably estimate the bias in our problem setting, although it works in supervised learning. In contrast, the bootstrap bias estimator is shown to be less biased than the train-test split estimator and can estimate the reusing bias more reliably. Therefore, we conclude that the bootstrap bias estimator is preferable to the train-test split estimator. From computational point of view, the bootstrap bias estimator requires us to train M agents and M + 1 evaluators. We set M = 20 in the experiments given the result of a preliminary experiment. Since the bootstrap procedure can be easily parallelized with low overhead, its wall-clock time can be reduced in proportion to the computational resource.

4. EMPIRICAL STUDIES

Let us empirically quantify the two biases as well as the effectiveness of the bias reduction methods. We first describe our experimental setup. See appendix H for full details to ensure reproduciability. Molecular representation. All of the functions defined over molecules use the 1024-bit Morgan fingerprint (Morgan, 1965; Rogers & Hahn, 2010) with radius 2 as a feature extractor. Environment and Agent. We employ the environment and the agent by Gottipati et al. ( 2020) with minor modifications. The agent receives a molecule as the current state, and outputs an action consisting of a reaction template and a reactant. The environment, receiving the action, applies the chemical reaction defined by the action to the current molecule to generate a product, which is then set as the next state. This procedure is repeated for H times, and lastly the agent takes action a ⊥ to be rewarded by the property of the final product. We set H = 1 to reduce the variance in the estimated performance and better highlight the biases and their reduction. The agent is implemented by actor-critic using fully-connected neural networks. We use the reaction templates curated by Button et al. (2019) and prepare the reactants from the set of commercially available substances in the same way as the original environment. The number of reaction templates is 64, 15 of which require one reactant, and 49 of which require two reactants. The number of reactants is 150,560. Evaluators. As a predictor, we use a fully-connected neural network with one hidden layer of 96 units with softplus activations except for the last layer. It is trained by minimizing the risk defined over S ∼ G S . As the importance weight, we use the kernel unconstrained least-squares importance fitting (KuLSI) (Kanamori et al., 2012) . In particular, we use the trained predictor except for the last linear transformation as a feature extractor and compute the linear kernel using it. Evaluation framework. To evaluate the biases, we need the true property function f ⋆ , which however is not available in general. We thus design a semi-synthetic experiment using a real-world dataset D 0 = {(m n , f ⋆ (m n )) ∈ S ⋆ × R} N0 n=1 . While any function S ⋆ → R can be used as the true property function f ⋆ , we substituted the predictor provided by Gottipati et al. (2020) for f ⋆ , which was trained with the ChEMBL database (Gaulton et al., 2017) to predict pIC 50 value associated with C-C chemokine receptor type 5 (CCR5). With this property function, we have full access to the environment, and we can construct an offline dataset D of an arbitrary sample size by running a random policy on M, which is available in our setting. To decompose the bias into the misspecification bias and the reusing bias, we need f ∞ , the predictor obtained with full access to the data-generating distribution G. We approximate it by α f ( Ĝtest ), where Ĝtest is the empirical distribution induced by a large sample D test of size 10 5 constructed independently of D. This approximation is valid if |D test | is sufficiently large (see proposition 23). Then, the misspecification bias can be estimated by J( Ĝ, Ĝtest ) -J ⋆ (π) and the reusing bias by Performance JPI( π , f) JPI( π , f ∞ ) J ⋆ ( π ) 10 -1 10 0 10 1 Behavior cloning coefficient ν J( Ĝ, Ĝ) -J( Ĝ, Ĝtest ). The performance estimators are defined by the expectation with respect to a trajectory of a policy, and we estimate them by Monte-Carlo approximation with 1,000 trajectories. Quantifying the two biases. First, we quantify the misspecification and reusing biases. In specific, we aim to study the relationship between these biases and the sample size. We vary the training sample size N in {2 6 , 2 7 , . . . , 2 13 }. For each N , we generate five pairs of train and test sets, and evaluate the biases as indicated above. We report the means and standard deviations. Figure 1 (left) illustrates the result. We have three observations. First, when N = 2 7 , the misspecification bias, J PI (π, f ∞ )-J ⋆ (π), was roughly twice as large as the reusing bias, J PI (π, f )-J PI (π, f ∞ ), demonstrating that both are non-negligible. Second, for N ≥ 2 7 , the reusing bias increased as the size of the training sample decreased, which coincides with proposition 2. The results for N < 2 7 did not coincide with it because the sample size is not large enough for asymptotic expansion to be justified. Third, the ground-truth performance of the policies was rather stable across different training sample sizes. We found that the policies were similar to each other, suggesting that this environment has a local optimum with a reasonably good performance (cf., the performance of a random policy is around 5.8). This also suggests that the policy learner in our experiment was insensitive to the particular sample, and the reusing bias in this case is mainly caused by the finiteness of the sample to train the predictor, not by reusing the same sample. Quantifying Bias Reduction Methods. We then study the effectiveness of the bias reduction methods presented in section 3. Since the behavior cloning coefficient ν will control the trade-off between the misspecification bias and the performance of the learned policy, it should be determined according to the user's requirement, i.e., whether the accuracy of performance estimation or the actual performance is prioritized. Therefore, we design an experiment to evaluate the effectiveness of the bias reduction methods, varying ν in the range of {2 -4 , . . . , 2 4 }. DR accordingly for the doubly-robust performance estimator. We compare the performance estimates by J -- PI , J +- PI , J -+ PI , and J -- DR to see the effectiveness of each bias reduction strategy. Figure 1 (middle) illustrates the performance estimates for N = 10 3 . Since J -- DR performs significantly worse than the baseline J -- PI , we omit it from the figure. See appendix I for the full result. We observe that the bootstrap bias reduction worked well, while the benefit of the covariate shift strategy is marginal. This indicates that the importance weight estimation did not work well in this setting. Figure 1 (right) illustrates the biases in J -- PI and the reusing bias estimated by the bootstrap method. As we expected, the misspecification bias tends to decrease as we increase ν. The reusing bias is under-estimated, but the estimated reusing bias contributes to bias correction. In summary, we confirm that (i) behavior cloning can reduce the misspecification bias at the expense of performance degradation, (ii) the reusing bias can be estimated and corrected by bootstrap, and (iii) the methods using importance weights did not perform well in our setting.

5. RELATED WORK

Our primary contribution is the comprehensive study of theoretically-sound evaluation methodology for in silico molecular optimization algorithms using real-world data. Since the pioneering work by Gómez-Bombarelli et al. (2018) , a number of studies on this topic have been published in the communities of machine learning and cheminformatics to advance the state-of-the-art. While some of them (Gómez-Bombarelli et al., 2016; Takeda et al., 2020; Das et al., 2021) have been validated in vitro, many others have been evaluated in silico. Early studies (Kusner et al., 2017) adopted the octanol-water partition coefficient, log P , penalized by the synthetic accessibility score (Ertl & Schuffenhauer, 2009) and the number of long rings as the target property to be maximized. The score can be easily computed by RDKit, and is often implicitly regarded as a reliable score computed by an accurate simulator. Some recently consider that the log P optimization is not appropriate as a benchmark task because it is easy to optimize (Brown et al., 2019) or its prediction can be inaccurate (Yang et al., 2021) , and alternative benchmark tasks have been investigated; some of them propose a suite of benchmark tasks (Brown et al., 2019; Polykovskiy et al., 2020) and the others use other property functions trained by real-world data (Olivecrona et al., 2017; Li et al., 2018a; Jin et al., 2020; Gottipati et al., 2020; Xie et al., 2021) . However, most of the current evaluation protocols rely on the naive plug-in performance estimator. As far as we are aware of, there are at least two empirical studies concerning about potential biases in the plug-in performance estimator. Renz et al. (2019) pointed out that the plug-in performance estimator is biased due to data reuse and random initialization of the predictor, while a follow-up study by Langevin et al. (2022) attributed the bias to the train-test split used by Renz et al. (2019) ; the train and test sets were far from being identically distributed. While these two pioneering studies shed light on the potential flaw in the plug-in performance estimator, we have not fully understand it partially because these studies are empirical. Our contribution to this line of studies is that we not only empirically but also theoretically demonstrate potential biases in the current evaluation methodology and present bias reduction methods. This also unveils why the log P optimization task has been hacked and suggests that the alternative benchmark tasks will be hacked as long as no bias reduction method is applied. The log P function implemented in RDKit (Wildman & Crippen, 1999) is obtained by fitting a linear model to a dataset of experimental log P values, and is in fact a predictor. Our theory suggests that unless the bias reduction methods are applied, the learned agent generates unrealistic molecules that are far from those in the dataset (which has been often reported in log P optimization), and the resultant performance estimate is biased. This mechanism is also valid for the alternative benchmark tasks, and we conjecture they will also be hacked sooner or later. It also suggests that by incorporating bias reduction methods, we can reliably estimate the performance and therefore can safely compare different methods even when using the log P optimization task. Our work shares a similar objective with a seminal work by Ito et al. (2018) , which aims to reduce the reusing bias that appears when solving an optimization problem whose parameters are estimated from data. A major contribution to this literature is to relax their assummption that the predictor is well-specified. This introduces the concept of misspecification bias, which was confirmed to be non-negligible in our application. Another minor contribution is to formalize their reusing-bias correction method by bootstrap and investigate the theoretical properties.

6. CONCLUSION AND FUTURE WORK

We have discussed that the plug-in performance estimator is biased in two ways; one is due to model misspecification and the other is due to reusing the same dataset for training and testing. In order to reduce these biases to obtain more accurate estimates, we recommend to (i) add a constraint to the policy such that the state distribution stays close to the data distribution and (ii) correct the bias by bootstrapping if it is non-negligible and we can afford to do it. A future research direction is to improve the importance weight estimation so that the other bias reduction methods work. Another is to constrain a policy with less performance degradation. Since the methods using variational autoencoders (Gómez-Bombarelli et al., 2018; Jin et al., 2018; Kajino, 2019) can naturally generate molecules similar to those in the data, such methods could be reevaluated.



The reusing bias is caused by sample reuse as well as the finiteness of the sample, which is clear when the policy is independent from Ĝ; the reusing bias still exists in such a case if f ̸ = f ∞ . While λ = 1 is optimal for N → ∞, it will increase the variance for a finite sample size N , and a smaller λ is favored. The standard supervised learning scenario does not suffer from this bias because the performance estimator is linear with respect to the testing distribution.



3.2.1 BIAS ESTIMATIONBY TRAIN-TEST SPLIT The first approach estimates the bias via train-test split of the sample. The sample D is randomly split into D train and D test such that D train ∩ D test = ∅ and D train ∪ D test = D. Let Ĝtrain and Ĝtest denote the corresponding empirical distributions. The reusing bias is estimated by b

Figure 1: Lines show means and shaded areas show standard deviations. (Left) Biases vs. the sample size. J PI (π, f ) -J PI (π, f ∞ ) corresponds to the reusing bias and J PI (π, f ∞ ) -J ⋆ (π) to the misspecification bias. (Middle) Comparison between bias reduction methods. (Right) Comparison between the misspecification bias, reusing bias, and the estimated reusing bias.

, b 2 ∈ {+, -}) be the plug-in performance estimator with covariate shift (b 1 = +) or without it (b 1 = -) and with bootstrap bias reduction (b 2 = +) or without it (b 2 = -). Let us define J b1b2

