BIASES IN EVALUATION OF MOLECULAR OPTIMIZA-TION METHODS AND BIAS REDUCTION STRATEGIES Anonymous authors Paper under double-blind review

Abstract

We are interested in in silico evaluation methodology for molecular optimization methods. Given a sample of molecules and their properties of our interest, we wish not only to train a generator of molecules that can find those optimized with respect to a target property but also to evaluate its performance accurately. A common practice is to train a predictor of the target property on the sample and use it for both training and evaluating the generator. We theoretically investigate this evaluation methodology and show that it potentially suffers from two biases; one is due to misspecification of the predictor and the other to reusing the same sample for training and evaluation. We discuss bias reduction methods for each of the biases, and empirically investigate their effectiveness.

1. INTRODUCTION

Molecular optimization aims to discover novel molecules with improved properties, which is often formulated as reinforcement learning by modeling the construction of a molecule using a Markov decision process. The performance of such agents is measured by the quality of generated molecules. In the community of machine learning, most of the molecular optimization methods have been verified by computer simulation. Since most of the generated molecules are novel, their properties are unknown and we have to resort to a predictor to estimate the properties. However, little attention has been paid to how reliable such estimates are, except for a few empirical studies (Renz et al., 2019; Langevin et al., 2022) , making the existing performance estimates less reliable. In this paper, we study the statistical properties of such performance estimators to enhance our understanding of the evaluation protocol and we discuss several directions to improve it. Let us first introduce a common practice to estimate the performance. Let S ⋆ be a set of molecules, f ⋆ : S ⋆ → R be a property function evaluating the target property of the input molecule, and D = {(m n , f ⋆ (m n )) ∈ S ⋆ × R} N n=1 be a sample. We typically train a predictor f (m; D) using D, regard it as the true property function, and follow the standard evaluation protocol of online reinforcement learning. That is, an agent is trained so as to optimize the properties of discovered molecules computed by f (m; D), and its performance is estimated by letting it generate novel molecules and estimating their properties by f (m; D). We call this a plug-in performance estimator (section 2.1). Our research question is how accurate the plug-in performance estimator is as compared to the true performance computed by f ⋆ . We first point out that the plug-in performance estimator is biased in two ways, indicating that it is not reliable in general (section 2.2). The first bias called a model misspecification bias comes from the deviation between the predictor and the true property function evaluated over the molecules discovered by the learned agent. This bias is closely related to the one encountered in covariate shift (Shimodaira, 2000) . It grows if molecules discovered by the agent become dissimilar to those used to train the predictor. The second bias called a reusing bias is caused by reusing the same dataset for training and testing the agent. Due to these biases, the plug-in performance estimator is not necessarily a good estimator of the true performance. We then discuss strategies to reduce these two biases. Section 3.1 introduces three approaches to reducing the misspecification bias. Since it is caused by covariate shift, it can be reduced by training the predictor taking it into account (section 3.1.1) and/or by constraining the agent so that the generated molecules become similar to those in the sample (section 3.1.2). Yet another approach is to use a more sophisticated estimator called a doubly-robust performance estimator (section 3.1.3). Our idea to correct the reusing bias comes from the analogy to model selection (Konishi & Kitagawa, 2007) , whose objective is to estimate the test performance by correcting the bias of the training performance, i.e., the performance computed by reusing the same dataset for training and testing. Given the analogy, one may consider train-test split could be the first choice. We however argue that it is not as effective as that applied to model selection due to the key difference between our setting and model selection; the test set in model selection is used to take expectation, while that in our setting is used to train a predictor, which is much more complex than expectation. This complexity introduces a non-negligible bias to the train-test split estimator, resulting in a less accurate bias estimation (section 3.2.1). We instead propose to use a bootstrap method in section 3.2.2, which is proven to estimate the reusing bias more accurately than the train-test split method does. We empirically validate our theory in section 4. First, we quantify the two biases, and confirm that both of them are non-negligible, and the reusing bias increases as the sample size decreases, as predicted by our theory. Second, we assess the effectiveness of the bias reduction methods, and confirm that the reusing bias can be corrected, while the misspecification bias can be reduced but at the cost of performance degradation of the agent. Notation. For any distribution G, let Ĝ ∼ G N denote the empirical distribution of a sample of N items independently drawn from G. For a set X , let δ x be Dirac's delta distribution at x ∈ X . For any integer M ∈ N, let [M ] := {0, . . . , M -1}. For any set A, P(A) denotes the set of probability distributions defined over A. Problem setting. We define a molecular optimization problem using a Markov decision process (MDP) of length H + 1 (H ∈ N). See appendix A for concrete examples. Let S be a set of states, and s ⊥ ∈ S be the terminal state. Let S ⋆ ⊆ S be a subset of states that correspond to valid molecules and the rest of the states correspond to possibly incomplete representations of molecules (invalid molecules). Let A be a set of actions that transform a valid or invalid molecule into another one. There exists the terminal action a ⊥ ∈ A that evaluates the property of the molecule at step H, after which the state transits to the terminal state s ⊥ . For each step h ∈ [H + 1], let T h : S × A → P(S) be a state transition distribution, r h : S × A → R be a reward function, and ρ 0 ∈ P(S) be the initial state distribution. We assume that the set of states at step H is limited to S ⋆ , and the reward function is defined as r h (s, a) = 0 for h ∈ [H] and r H (s, a ⊥ ) = f ⋆ (s) for s ∈ S ⋆ . Let M = {S, A, {T h } H h=0 , ρ 0 , H} be the dynamical model of the MDP. Throughout the paper, we assume we know M and omit the dependency on it in expressions. Let Π be the set of policies and π = {π h (• | s)} H h=0 ∈ Π be a policy modeled by a probability distribution over A conditioned on s ∈ S. At each step h ∈ [H + 1], the agent takes action a h sampled from π h (• | s h ). The performance of a policy is measured by the expected cumulative reward, J ⋆ (π ) := E π [ H h=0 r h (S h , A h )] = E π [f ⋆ (S H )] , where E π [•] is the expectation with respect to the Markov process induced by applying policy π on M. Letting p π h ∈ P(S ⋆ ) be the distribution of states visited by policy π at step h ∈ [H + 1], the expected cumulative reward is alternatively expressed as, J ⋆ (π) = E S∼p π H f ⋆ (S). In practice, the property function is not available and instead a sample from it, D = {(m n , f ⋆ (m n )) ∈ S ⋆ × R} N n=1 , is available. Let us assume that each tuple is independently distributed according to G ∈ P(S ⋆ × R). Let G S ∈ P(S ⋆ ) be the marginalized distribution over S ⋆ induced from G. For a theoretical reason clarified in appendix B, we use the empirical distribution of the sample, Ĝ ∈ P(S ⋆ × R), rather than the sample itself (assumption 10) and we call Ĝ an empirical distribution and a sample interchangeably. Let us define a policy learner α π : P(S ⋆ × R) → Π, an algorithm to learn a policy from a distribution over S ⋆ × R. It typically receives a sample Ĝ and outputs a policy, which we denote π := α π ( Ĝ). Our objective is to evaluate its performance J ⋆ (π) only given access to α π , Ĝ, and M.

2. BIASES OF PLUG-IN PERFORMANCE ESTIMATOR

A widely used approach to estimating J ⋆ (π) is a plug-in performance estimator (section 2.1). We point out that it is biased in two ways (section 2.2) and theoretically characterize these biases in sections 2.3 and 2.4.

