META-LEARNING ADAPTIVE DEEP KERNEL GAUSSIAN PROCESSES FOR MOLECULAR PROPERTY PREDICTION

Abstract

We propose Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework for learning deep kernel Gaussian processes (GPs) by interpolating between meta-learning and conventional deep kernel learning. Our approach employs a bilevel optimization objective where we meta-learn generally useful feature representations across tasks, in the sense that task-specific GP models estimated on top of such features achieve the lowest possible predictive loss on average. We solve the resulting nested optimization problem using the implicit function theorem (IFT). We show that our ADKF-IFT framework contains previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT) as special cases. Although ADKF-IFT is a completely general method, we argue that it is especially well-suited for drug discovery problems and demonstrate that it significantly outperforms previous state-of-the-art methods on a variety of real-world few-shot molecular property prediction tasks and out-of-domain molecular property prediction and optimization tasks.

1. INTRODUCTION

Many real-world applications require machine learning algorithms to make robust predictions with well-calibrated uncertainty given very limited training data. One important example is drug discovery, where practitioners not only want models to accurately predict biochemical/physicochemical properties of molecules, but also want to use models to guide the search for novel molecules with desirable properties, leveraging techniques such as Bayesian optimization (BO) which heavily rely on accurate uncertainty estimates (Frazier, 2018) . Despite the meteoric rise of neural networks over the past decade, their notoriously overconfident and unreliable uncertainty estimates (Szegedy et al., 2013) make them generally ineffective surrogate models for BO. Instead, most contemporary BO implementations use Gaussian processes (GPs) (Rasmussen & Williams, 2006) as surrogate models due to their analytically-tractable and generally reliable uncertainty estimates, even on small datasets. Traditionally, GPs are fit on hand-engineered features (e.g., molecular fingerprints), which can limit their predictive performance on complex, structured, high-dimensional data where designing informative features is challenging (e.g., molecules). Naturally, a number of works have proposed to improve performance by instead fitting GPs on features learned by a deep neural network: a family of models generally called Deep Kernel GPs. However, there is no clear consensus about how to train these models: maximizing the GP marginal likelihood (Hinton & Salakhutdinov, 2007; Wilson et al., 2016b) has been shown to overfit on small datasets (Ober et al., 2021) , while meta-learning (Patacchiola et al., 2020) and fully-Bayesian approaches (Ober et al., 2021 ) avoid this at the cost of making strong, often unrealistic assumptions. This suggests that there is demand for new, better techniques for training deep kernel GPs. In this work, we present a novel, general framework called Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT) for training deep kernel GPs which we believe is especially well-suited to small datasets. ADKF-IFT essentially trains a subset of the model parameters with a meta-learning loss, and separately adapts the remaining parameters on each task using maximum marginal likelihood. In contrast to previous methods which use a single loss for all parameters, ADKF-IFT is able to utilize the implicit regularization of meta-learning to prevent overfitting while avoiding the strong assumptions of a pure meta-learning approach which may lead to underfitting. The key contributions and outline of the paper are as follows: 1. As our main technical contribution, we present the general ADKF-IFT framework and its natural formulation as a bilevel optimization problem (Section 3.1), then explain how the implicit function theorem (IFT) can be used to efficiently solve it with gradient-based methods in a few-shot learning setting (Section 3.2). 2. We show how ADKF-IFT can be viewed as a generalization and unification of previous approaches based purely on single-task learning (Wilson et al., 2016b) or purely on metalearning (Patacchiola et al., 2020) for training deep kernel GPs (Section 3.3). 3. We propose a specific practical instantiation of ADKF-IFT wherein all feature extractor parameters are meta-learned, which has a clear interpretation and obviates the need for any Hessian approximations. We argue why this particular instantiation is well-suited to retain the best properties of previously proposed methods (Section 3.4). 4. Motivated by the general demand for better GP models in chemistry, we perform an extensive empirical evaluation of ADKF-IFT on several chemical tasks, finding that it significantly improves upon previous state-of-the-art methods (Section 5).

2. BACKGROUND AND NOTATION

Gaussian Processes (GPs) are tools for specifying Bayesian priors over functions (Rasmussen & Williams, 2006) . A GP(m θ (•), c θ (•, •)) is fully specified by a mean function m θ (•) and a symmetric positive-definite covariance function c θ (•, •). The covariance function encodes the inductive bias (e.g., smoothness) of a GP. One advantage of GPs is that it is easy to perform principled model selection for its hyperparameters θ ∈ Θ using the marginal likelihood p(y | X, θ) evaluated on the training data (X, y) and to obtain closed-form probabilistic predictions p(y * | X * , X, y, θ) for the test data (X * , y * ); we refer the readers to Rasmussen & Williams (2006) for more details. Deep Kernel Gaussian Processes are GPs whose covariance function is constructed by first using a neural network feature extractor f ϕ with parameters ϕ ∈ Φ to create feature representations h = f ϕ (x), h ′ = f ϕ (x ′ ) of the input points x, x ′ , then feeding these feature representations into a standard base kernel c θ (h, h ′ ) (e.g., an RBF kernel) (Hinton & Salakhutdinov, 2007; Wilson et al., 2016b; a; Bradshaw et al., 2017; Calandra et al., 2016 ). The complete covariance function is therefore k ψ (x, x ′ ) = c θ (f ϕ (x), f ϕ (x ′ )) with learnable parameters ψ = (θ, ϕ). Few-shot Learning refers to learning on many related tasks when each task has few labelled examples (Miller et al., 2000; Lake et al., 2011) . In the standard problem setup, one is given a set of training tasks D = {T t } T t=1 (a meta-dataset) and some unseen test tasks D * = {T * }. Each task T = {(x i , y i )} N T i=1 is a set of points in the domain X (e.g., space of molecules) with corresponding labels (continuous, categorical, etc.) , and is partitioned into a support set S T ⊆ T for training and a query set Q T = T \ S T for testing. Typically, the total number of training tasks T = | D | is large, while the size of each support set | S T | is small. Models for few-shot learning are typically trained to accurately predict Q T given S T for T ∈ D during a meta-training phase, then evaluated by their prediction error on Q T * given S T * for unseen test tasks T * ∈ D * during a meta-testing phase.

3.1. THE GENERAL ADKF-IFT FRAMEWORK FOR LEARNING DEEP KERNEL GPS

Let A Θ and A Φ respectively be the sets of base kernel and feature extractor parameters for a deep kernel GP. Denote the set of all parameters by A Ψ = A Θ ∪ A Φ . The key idea of the general ADKF-IFT framework is that only a subset of the parameters A Ψadapt ⊆ A Ψ will be adapted to each individual task by minimizing a train loss L T , with the remaining set of parameters A Ψmeta = A Ψ \ A Ψadapt meta-learned during a meta-training phase to yield the best possible validation loss L V on average over many related training tasks (after A Ψadapt is separately adapted to each of these tasks). This can be naturally formalized as the following bilevel optimization problem: ψ * meta = arg min ψ meta E p(T ) [L V (ψ meta , ψ * adapt (ψ meta , S T ), T )], such that ψ * adapt (ψ meta , S T ) = arg min ψ adapt L T (ψ meta , ψ adapt , S T ). Equations ( 1) and ( 2) are most easily understood by separately considering the meta-learned parameters ψ meta and the task-specific parameters ψ adapt . For a given task T and an arbitrary value for the meta-learned parameters ψ meta , in Equation ( 2) the task-specific parameters ψ adapt are chosen to minimize the train loss L T evaluated on the task's support set S T . That is, ψ adapt is adapted to the support set S T of the task T , with the aim of producing the best possible model on S T for the given value of ψ meta . The result is a model with optimal task-specific parameters ψ * adapt (ψ meta , S T ) for the given meta-learned parameters ψ meta and task T . The remaining question is how to choose a value for the meta-learned parameters ψ meta , knowing that ψ adapt will be adapted separately to each task. In Equation ( 1), we propose to choose ψ meta to minimize the expected validation loss L V over a distribution of training tasks p(T ). There are two reasons for this. First, on any given task T , the validation loss usually reflects the performance metric of interest on the query set Q T of T (e.g., the prediction error). Second, because the same value of ψ meta will be used for all tasks, it makes sense to choose a value whose expected performance is good across many tasks drawn from p(T ). That is, ψ meta is chosen such that a GP achieves the lowest possible average validation loss on the query set Q T of a random training task T ∼ p(T ) after ψ adapt is adapted to the task's support set S T . In practice, ψ meta would be optimized during a meta-training phase using a set of training tasks D to approximate Equation (1). After meta-training (i.e., at meta-test time), we make predictions for each unseen test task T * using the joint GP posterior predictive distribution with optimal parameters ψ * meta and ψ * adapt (ψ * meta , S T * ): p(Q y T * | Q x T * , S T * , ψ * meta , ψ * adapt (ψ * meta , S T * )). Note that the description above does not specify a particular choice of A Ψmeta , A Ψadapt , L T , L V . This is intentional, as there are many reasonable choices for these quantities. Because of this, we believe that ADKF-IFT should be considered a general framework, with a particular choice for these being an instantiaton of the ADKF-IFT framework. We give examples of this in Sections 3.3 and 3.4.

3.2. EFFICIENT META-TRAINING ALGORITHM

In general, optimizing bilevel optimization objectives such as Equation ( 1) is computationally complex, mainly because each evaluation of the objective requires solving a separate inner optimization problem (2) . Although calculating the hypergradient (i.e., total derivative) of the validation loss L V w.r.t. the meta-learned parameters ψ meta would allow Equation (1) to be solved with gradient-based optimization: d L V d ψ meta = ∂ L V ∂ ψ meta + ∂ L V ∂ ψ * adapt ∂ ψ * adapt ∂ ψ meta , Equation ( 4) reveals that this requires calculating ∂ ψ * adapt/∂ ψ meta , i.e., how the optimal task-specific parameters ψ * adapt (ψ meta , S T ) change with respect to the meta-learned parameters ψ meta . Calculating this naively with automatic differentiation platforms would require tracking the gradient through many iterations of the inner optimization (2) , which in practice requires too much memory to be feasible. Fortunately, because ψ * adapt is an optimum of the train loss L T , Cauchy's Implicit Function Theorem (IFT) provides a formula for calculating ∂ ψ * adapt/∂ ψ meta for an arbitrary value of the meta-learned parameters ψ ′ meta and a given task T ′ : ∂ ψ * adapt ∂ ψ meta ψ ′ meta = - ∂ 2 L T (ψ meta , ψ adapt , S T ′ ) ∂ ψ adapt ∂ ψ T adapt -1 ∂ 2 L T (ψ meta , ψ adapt , S T ′ ) ∂ ψ adapt ∂ ψ T meta ψ ′ meta ,ψ ′ adapt , where ψ ′ adapt = ψ * adapt (ψ ′ meta , S T ′ ). A full statement of the implicit function theorem in the context of ADKF-IFT can be found in Appendix A. The only potential problem with Equation (5) is the computation and inversion of the Hessian matrix ∂ 2 L T (ψ meta ,ψ adapt ,S T ) /∂ ψ adapt ∂ ψ T adapt . This computation can be done exactly if |A Ψadapt | is small, which is the case considered in this paper (as will be discussed in Section 3.4) . Otherwise, an approximation to the inverse Hessian (e.g., Neumann approximation (Lorraine et al., 2020; Clarke et al., 2022) ) could be used, which reduces both the memory and computational complexities to O(|A Ψ |). Combining Equations ( 4) and ( 5), we have a recipe for computing the hypergradient d L V /d ψ meta exactly for a single task, as summarized in Algorithm 1. The meta-learned parameters ψ meta can then be updated with the expected hypergradient over p(T ). Algorithm 1 Exact hypergradient computation in ADKF-IFT. 1: Input: a training task T ′ and the current meta-learned parameters ψ ′ meta . 2: Solve Equation ( 2) to obtain ψ ′ adapt = ψ * adapt (ψ ′ meta , S T ′ ). 3: Compute g 1 = ∂ L V (ψ meta ,ψ adapt ,T ′ ) ∂ ψ meta ψ ′ meta ,ψ ′ adapt and g 2 = ∂ L V (ψ meta ,ψ adapt ,T ′ ) ∂ ψ adapt ψ ′ meta ,ψ ′ adapt by auto-diff. 4: Compute the Hessian H = ∂ 2 L T (ψ meta ,ψ adapt ,S T ′ ) ∂ ψ adapt ∂ ψ T adapt ψ ′ meta ,ψ ′ adapt by auto-diff. 5: Solve the linear system v H = g 2 for v. 6: Compute the mixed partial derivatives P = ∂ 2 L T (ψ meta ,ψ adapt ,S T ′ ) ∂ ψ adapt ∂ ψ T meta ψ ′ meta ,ψ ′ adapt by auto-diff. 7: Output: the hypergradient d L V d ψ meta = g 1 -v P. ▷ Equations ( 4) and (5)

3.3. ADKF-IFT AS A UNIFICATION OF PREVIOUS METHODS

In prior work, the most common method used to train deep kernel GPs is to minimize the negative log marginal likelihood (NLML) on a single dataset (optionally with extra regularization terms). This is commonly referred to as Deep Kernel Learning (DKL) (Wilson et al., 2016b) , and is stated explicitly in Equation ( 6). The most notable departure from DKL is Deep Kernel Transfer (DKT) (Patacchiola et al., 2020) , which instead proposes to train deep kernel GPs entirely using meta-learning, minimizing the expected NLML over a distribution of training tasks, as is stated explicitly in Equation ( 7). ψ * = arg min ψ NLML (ψ, S T ) (6) ψ * = arg min ψ E p(T ) [NLML(ψ, T )] Interestingly, both DKL and DKT can be viewed as special cases of the general ADKF-IFT framework. It is simple to see that choosing the partition to be A Ψmeta = ∅, A Ψadapt = A Ψ and the train loss L T to be the NLML in Equations ( 1) and ( 2) yields Equation ( 6): DKL is just ADKF-IFT if no parameters are meta-learned. Similarly, choosing the partition to be A Ψmeta = A Ψ , A Ψadapt = ∅ and the validation loss L V to be the NLML in Equations ( 1) and (2) yields Equation (7): DKT is just ADKF-IFT if all parameters are meta-learned. This makes ADKF-IFT strictly more general than these two methods.

3.4. HIGHLIGHTED ADKF-IFT INSTANTIATION: META-LEARN ϕ, ADAPT θ

Among the many possible variations of ADKF-IFT, we wish to highlight the following instantation: • A Ψmeta = A Φ , i.e., all feature extractor parameters ϕ are meta-learned across tasks. • A Ψadapt = A Θ , i.e., all base kernel parameters θ (e.g., noise, lengthscales, etc) are adapted. • The train loss L T and validation loss L V are the negative log GP marginal likelihood on S T and the negative log joint GP predictive posterior on Q T given S T , respectively. Please refer to Appendix B for equations of these loss functions. There are several benefits to this choice. First, this particular choice of loss functions has the advantage that the prediction procedure during meta-testing (as defined in Equation ( 3)) exactly matches the meta-training procedure, thereby closely following the principle of learning to learn. Second, the partition of parameters can be intuitively understood as meta-learning a generally useful feature extractor f ϕ such that it is possible on average to fit a low-loss GP to the feature representations extracted by f ϕ for each individual task. This is very similar to previous transfer learning approaches. Third, since most GP base kernels have only a handful of parameters, the Hessian in Equation ( 5) can be computed and inverted exactly during meta-training using Algorithm 1; this removes any need for Hessian approximations. Fourth, the inner optimization (2) for ψ adapt is computationally efficient, as it does not require backpropagating through the feature extractor f ϕ . More generally, we conjecture that adapting just the base kernel parameters will allow ADKF-IFT to achieve a better balance between overfitting and underfitting than either DKL or DKT. The relationship between these methods are visualized in Figure 1 . Panel (c) shows DKL, which trains a separate deep kernel GP for each task. It is not hard to imagine that this can lead to severe overfitting for small datasets, which has been observed empirically by Ober et al. (2021) . Panel (b) shows DKT, which prevents overfitting by fitting one deep kernel GP for all tasks. However, this implicitly makes a strong assumption that all tasks come from an identical distribution over functions, including the same noise level, same amplitude, and same characteristic lengthscales, which is unlikely to hold in practice. Panel (a) shows ADKF-IFT, which allows these important parameters to be adapted, while still regularizing the feature extractor with meta-learning. We conjecture that adapting the base kernel parameters is more appropriate given the expected differences between tasks: two related tasks are more likely to have different noise levels or characteristic lengthscales than to require substantially different feature representations. We refer the readers to Appendix H for more discussions. 2021) propose adding regularization terms to the loss of DKL in order to mitigate overfitting. These works are better viewed as complementary to ADKF-IFT rather than alternatives: their respective regularization terms could easily be added to L T in Equation ( 2) to improve performance. However, the regularization strategies in both of these papers are designed for continuous inputs only, limiting their applicability to structured data like molecules. ADKF-IFT can also be viewed as a meta-learning algorithm comparable to many previously-proposed methods (Lake et al., 2011; Vinyals et al., 2016; Garnelo et al., 2018; Triantafillou et al., 2019; Park & Oliva, 2019; Tian et al., 2020; Chen et al., 2021; Liu et al., 2021; Wistuba & Grabocka, 2021; Patacchiola et al., 2022) . One distinguishing feature of ADKF-IFT is that it is specially designed for deep kernel GPs, whereas most methods from computer vision are designed exclusively for neural network models, which as previously stated are unsuitable when reliable uncertainty estimates are required. Furthermore, many of these algorithms such as ProtoNet (Snell et al., 2017) are designed principally or exclusively for classification, while ADKF-IFT is suited to both regression and classification. Compared to model-agnostic frameworks like MAML (Finn et al., 2017) , ADKF-IFT does not require coarse approximations of the hypergradient due to its use of the implicit function theorem. For further discussions of related works, please refer to Appendix I. 

5. EXPERIMENTAL EVALUATION ON MOLECULES

In this section, we evaluate the empirical performance of ADKF-IFT from Section 3.4. We choose to focus our experiments exclusively on molecular property prediction and optimization tasks because we believe this application would benefit greatly from better GP models: firstly because many existing methods struggle on small datasets of size ∼ 10 2 which are ubiquitous in chemistry, and secondly because many tasks in chemistry require high-quality uncertainty estimates. First, we evaluate ADKF-IFT on four commonly used benchmark tasks from MoleculeNet (Wu et al., 2018) , finding that ADKF-IFT achieves state-of-the-art results on most tasks (Section 5.1). Second, we evaluate ADKF-IFT on the larger-scale FS-Mol benchmark (Stanley et al., 2021) , finding that ADKF-IFT is the best-performing method (Section 5.2). In particular, our results support the hypothesis from Section 3.4 that ADKF-IFT achieves a better balance between overfitting and underfitting than DKL and DKT. Finally, we show that the ADKF-IFT feature representation is transferable to out-of-domain molecular property prediction and optimization tasks (Section 5.3). Guo et al. (2021) . The task-level metric is AUROC (area under the receiver operating characteristic curve). We report the averaged performance over ten runs with different random seeds for each compared method at the support set size 20 (i.e., 2-way 10-shot, as the support sets in MoleculeNet are balanced). We did not perform 1-shot learning, as it is an unrealistic setting in real-world drug discovery tasks. All baseline results are taken from Wang et al. (2021) . Performance. Table 1 shows that ADKF-IFT and Pre-ADKF-IFT achieve the best performance on Tox21, MUV, and ToxCast. In general, the larger the dataset is, the larger the performance gains of our method over other baselines are, highlighting the scalability of our method. In particular, our method outperforms all baselines by a wide margin on MUV due to the relatively large amount of available compounds, but underperforms many baselines on SIDER due to a lack of compounds. 

5.2. FEW-SHOT MOLECULAR PROPERTY PREDICTION ON THE FS-MOL BENCHMARK

Benchmark. We further conduct our evaluation on the FS-Mol benchmark (Stanley et al., 2021) , which contains a carefully constructed set of few-shot learning tasks for molecular property prediction. FS-Mol contains over 5,000 tasks with 233,786 unique compounds from ChEMBL27 (Mendez et al., 2019) , split into training (4,938 tasks), validation (40 tasks), and test (157 tasks) sets. Each task is associated with a protein target. The original benchmark only considers binary classification of active/inactive compounds, but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well, as it is a desired and more preferred task to do in real-world drug discovery projects. Baselines. We compare ADKF-IFT with four categories of baselines: 1) single-task methods: Random Forest (RF), k-Nearest Neighbors (kNN), single-task GP with Tanimoto kernel (GP-ST) (Ralaivola et al., 2005) , single-task GNN (GNN-ST) (Gilmer et al., 2017) , Deep Kernel Learning (DKL) (Wilson et al., 2016b) ; 2) multi-task pretraining: multi-task GNN (GNN-MT) (Corso et al., 2020; Gilmer et al., 2017) ; 3) self-supervised pretraining: Molecule Attention Transformer (MAT) (Maziarka et al., 2020) ; 4) meta-learning methods: Property-Aware Relation Networks (PAR) (Wang et al., 2021) , Prototypical Network with Mahalanobis distance (ProtoNet) (Snell et al., 2017) , Model-Agnostic Meta-Learning (GNN-MAML) (Finn et al., 2017) , Conditional Neural Process (CNP) (Garnelo et al., 2018) , Deep Kernel Transfer (DKT) (Patacchiola et al., 2020) . The GNN feature extractor architecture f ϕ used for DKL, PAR, CNP, DKT, and ADKF-IFT is the same as that used for ProtoNet, GNN-ST, GNN-MT, and GNN-MAML in Stanley et al. (2021) . All multi-task and metalearning methods are trained from scratch on FS-Mol training tasks. MAT is pretrained on 2 million molecules sampled from the ZINC15 dataset (Sterling & Irwin, 2015) . The classification results for RF, kNN, GNN-ST, GNN-MT, MAT, ProtoNet, and GNN-MAML are reproduced according to Stanley et al. (2021) . Detailed configurations of all compared methods can be found in Appendix D. Evaluation Procedure. The task-level metrics for binary classification and regression are ∆AUPRC (change in area under the precision-recall curve) and R 2 os (predictive/out-of-sample coefficient of determination), respectively. Details of these metrics can be found in Appendix E. We follow exactly the same evaluation procedure as that in Stanley et al. (2021) , where the averaged performance over ten different stratified support/query random splits of every test task is reported for each compared method. This evaluation process is performed for five different support set sizes 16, 32, 64, 128, and 256. Note that the support sets are generally unbalanced for the classification task in FS-Mol, which is natural as the majority of the candidate molecules are inactive in drug discovery. Overall Performance. Figure 2 shows the overall test performance of all compared methods. Note that RF is a strong baseline method, as it is widely used in real-world drug discovery projects and has comparable performance to many pretraining methods. The results indicate that ADKF-IFT outperforms all the other compared methods at all considered support set sizes for the classification task. For the regression task, the performance gains of ADKF-IFT over the second best method, namely DKT, get larger as the support set size increases. In Appendix F.1, we show that ADKF-IFT achieves the best mean rank for both classification and regression at all considered support set sizes. Statistical Comparison. We perform two-sided Wilcoxon signed-rank tests (Wilcoxon, 1992) to compare the performance of ADKF-IFT and the next best method, namely DKT. The exact p-values from these statistical tests can be found in Appendix F.2. The results indicate that ADKF-IFT significantly outperforms DKT for the classification task at all considered support set sizes and for the regression task at support set sizes 64, 128, and 256 (at significance level α = 0.05). Ablation Study. To show that 1) the bilevel optimization objective for ADKF-IFT is essential for learning informative feature representations and 2) the performance gains of ADKF-IFT are not simply caused by tuning the base kernel parameters θ at meta-test time, we consider two ablation models: DKT+ and ADKF. The test performance of these models are shown in Figure 3 . For ADKF, we follow the ADKF-IFT training scheme but assume ∂ θ * /∂ ϕ = 0, i.e., updating the feature extractor parameters ϕ with the direct gradient ∂ L V /∂ ϕ rather than d L V /d ϕ. The results show that ADKF consistently underperforms ADKF-IFT, indicating that the hypergradient for the bilevel optimization objective has non-negligible contributions to learning better feature representations. For DKT+, we take a model trained by DKT and adapt the base kernel parameters θ on each task at meta-test time. The results show that DKT+ does not improve upon DKT, indicating that tuning the base kernel parameters θ at meta-test time is not sufficient for obtaining better test performance with DKT. Sub-benchmark Performance. The tasks in FS-Mol can be partitioned into 7 sub-benchmarks by Enzyme Commission number (Webb et al., 1992) . In Appendix F.3, we show the test performance of top performing methods on each sub-benchmark. The results indicate that, in addition to achieving best overall performance, ADKF-IFT achieves the best performance on all sub-benchmarks for the regression task and on more than half of the sub-benchmarks for the classification task.

5.3. OUT-OF-DOMAIN MOLECULAR PROPERTY PREDICTION AND OPTIMIZATION

Finally, we demonstrate that the feature representation learned by ADKF-IFT is useful not only for in-domain molecular property prediction tasks but also for out-of-domain molecular property prediction and optimization tasks. For this, we perform experiments involving finding molecules with best desired target properties within given out-of-domain datasets using Bayesian optimization (BO) with a GP surrogate model operating on top of compared feature representations. We use the expected improvement acquisition function (Jones et al., 1998) with query-batch size 1. All compared feature representations are extracted using the models trained on the FS-Mol dataset from scratch in Section 5.2, except for the pretrained MAT representation and fingerprint. We compare them on four representative molecular design tasks outside of FS-Mol. Detailed configuration of the GP and descriptions of the tasks can be found in Appendix G. We repeat each BO experiment 20 times, each time starting from 16 randomly sampled molecules from the worst ∼ 700 molecules within the dataset. Figure 4 shows that the ADKF-IFT representation enables fastest discovery of top performing molecules for the molecular docking, antibiotic discovery, and material design tasks. For 0 2 4 6 8 10 12 14 16 18 20 The number of molecules queried the antiviral drug design task, although the ADKF-IFT representation underperforms the MAT and GNN-MT representations, it still achieves competitive performance compared to other baselines. Table 2 explicitly reports the regression predictive performance of a GP operating on top of each compared feature representation for these four out-of-domain molecular design tasks. The configuration of the GP is the same as that in the BO experiments. We report test negative log likelihood (NLL) averaged over 200 support/query random splits (100 for each of the support set sizes 32 and 64). The results show that the ADKF-IFT representation has the best test NLL on the molecular docking, antibiotic discovery, and material design tasks, and ranks second on the antiviral drug design task.

6. CONCLUSION

We have proposed Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework for fitting deep kernels that interpolates between meta-learning and conventional deep kernel learning. ADKF-IFT meta-learns a feature extractor across tasks such that the taskspecific GP models estimated on top of the extracted feature representations can achieve the lowest possible prediction error on average. ADKF-IFT is implemented by solving a bilevel optimization objective via implicit differentiation. We have shown that ADKF-IFT is a unifying framework containing DKL and DKT as special cases. We have demonstrated that ADKF-IFT learns generally useful feature representations, achieving state-of-the-art performance on a variety of real-world few-shot molecular property prediction tasks and on out-of-domain molecular property prediction and optimization tasks. We believe that ADKF-IFT could potentially be an important method to produce well-calibrated models for fully-automated high-throughput experimentation in the future.

A CAUCHY'S IMPLICIT FUNCTION THEOREM

We state Cauchy's Implicit Function Theorem (IFT) in the context of ADKF-IFT in Theorem 1. Theorem 1 (Implicit Function Theorem (IFT)) Let T ′ be any given task. Suppose for some ψ ′ meta and ψ ′ adapt that  ∂ L T (ψ meta ,ψ adapt ,S T ′ ) ∂ ψ adapt ψ ′ meta ,ψ ′ adapt = 0. Suppose that ∂ L T ∂ ψ adapt (ψ meta , ψ adapt , S T ′ ) : Ψ meta × Ψ adapt → Ψ adapt is ) : Ψ meta → Ψ adapt , such that ψ ′ adapt = ψ * adapt (ψ ′ meta , S T ′ ) and ∂ L T (ψ meta ,ψ adapt ,S T ′ ) ∂ ψ adapt ψ ′′ meta ,ψ * adapt (ψ ′′ meta ,S T ′ ) = 0, ∀ ψ ′′ meta ∈ U . Moreover, the rate at which ψ * adapt (ψ meta , S T ′ ) is changing w.r.t. ψ meta for any ψ ′′ meta ∈ U is given by ∂ ψ * adapt (ψ meta , S T ′ ) ∂ ψ meta ψ ′′ meta = - ∂ 2 L T (ψ meta , ψ adapt , S T ′ ) ∂ ψ adapt ∂ ψ T adapt -1 ∂ 2 L T (ψ meta , ψ adapt , S T ′ ) ∂ ψ adapt ∂ ψ T meta ψ ′′ meta ,ψ * adapt (ψ ′′ meta ,S T ′ ) .

B GENERAL CONFIGURATIONS OF ADKF-IFT FOR FEW-SHOT LEARNING EXPERIMENTS

In this paper, we consider the specific instantiation of ADKF-IFT from Section 3.4. Specifically, we set A Ψadapt = A Θ and A Ψmeta = A Φ , i.e., to meta-learn the feature extractor parameters ϕ across tasks and to adapt the base kernel parameters θ for each individual task. We choose the train loss L T to be the negative log GP marginal likelihood evaluated on the support set S T , as is common practice for choosing GP base kernel parameters: L T (ψ meta , ψ adapt , S T ) = -log p(S y T | S x T , ψ meta , ψ adapt ) = 1 2 S y T , K -1 S T S y T + 1 2 log det(K S T ) + N S T 2 log(2π), where K S T = k ψ meta ,ψ adapt (S x T , S x T ) + σ 2 I N S T . We choose the validation loss L V to be the negative log joint GP predictive posterior evaluated on the query set Q T given the support set S T , also due to its common usage for making predictions with GPs: L V (ψ meta , ψ adapt , T ) = -log p(Q y T | Q x T , S T , ψ meta , ψ adapt ) = -log N (Q y T ; K Q T S T K -1 S T S y T , K Q T -K Q T S T K -1 S T K S T Q T ), where K Q T = k ψ meta ,ψ adapt (Q x T , Q x T ) + σ 2 I N Q T and K S T Q T = K T Q T S T = k ψ meta ,ψ adapt (S x T , Q x T ) . We solve the inner optimization problem (2) using the L-BFGS optimizer (Liu & Nocedal, 1989) , since L-BFGS is the default choice for optimizing base kernel parameters in the GP literature. For the outer optimization problem (1), we approximate the expected hypergradient over p(T ) by averaging the hypergradients for a batch of K randomly sampled training tasks at each step, and update the meta-learned parameters ψ meta with the averaged hypergradient using the Adam optimizer (Kingma & Ba, 2014) with learning rate 10 -3 for MoleculeNet and 10 -4 for FS-Mol. We set K = 10 for MoleculeNet and K = 16 for FS-Mol. For all experiments on FS-Mol, we evaluate the performance of our model on a small set of validation tasks during meta-training and use early stopping (Prechelt, 1998) to avoid overfitting of ψ meta . We use zero mean function and set Matérn52 without automatic relevance determination (ARD) (Neal, 1996) as the base kernel in ADKF-IFT, since the typical sizes of the support sets in few-shot learning are too small to adjust a relatively large number of ARD lengthscales in ADKF-IFT. The lengthscale in the base kernel of ADKF-IFT is initialized using the median heuristic (Garreau et al., 2017) for each task, with a log-normal prior centered at the initialization. Following Patacchiola et al. (2020) , we treat binary classification as ±1 label regression for ADKF-IFT. 

C DETAILS OF MOLECULENET BENCHMARK TASKS

In Table 3 , we summarize the four few-shot molecular property classification benchmark tasks (Tox21, SIDER, MUV, and ToxCast) from MoleculeNet (Wu et al., 2018) Multi-task Pretraining. The implementation of GNN-MT is taken from Stanley et al. (2021) . GNN-MT shares a GNN with a hidden dimension of 128 using principal neighborhood message aggregation (Corso et al., 2020) across tasks, and uses a task-specific gated readout function Gilmer et al. (2017) and an MLP with one hidden layer on top for each individual task. The model is trained on the support sets of all training tasks with early stopping based on the validation performance on the validation tasks. The task-specific components of the model are fine-tuned for each test task. Self-supervised Pretraining. The implementation of MAT is taken from Stanley et al. (2021) . We use the official pretrained model parameters (Maziarka et al., 2020) , which is pretrained on 2 million molecules sampled from the ZINC15 dataset (Sterling & Irwin, 2015) . We fine-tuned it for each test task with hyperparameter search and early stopping based on 20% of the support set for each task. Meta-learning Methods. Meta-learning methods (PAR, ProtoNet, GNN-MAML, CNP, DKT, and ADKF-IFT) enable knowledge transfer among related small datasets. The implementations of ProtoNet and GNN-MAML are taken from Stanley et al. (2021) . The implementation of PAR is taken from its official implementation and integrated into the FS-Mol training and evaluation pipeline. PAR, ProtoNet, CNP, DKT, and ADKF-IFT operate on top of a combination of extended connectivity fingerprint (Rogers & Hahn, 2010) (count-based fingerprint with radius 2 and size 2,048) and features extracted by a GNN. The GNN feature extractor architecture used for DKL, PAR, CNP, DKT, and ADKF-IFT is the same as that used for ProtoNet, GNN-MAML, GNN-ST, and GNN-MT in Stanley et al. (2021) , with the size of the feature representation being tuned on the validation tasks. The base kernel used in DKT is the same as that used in ADKF-IFT.

E TASK-LEVEL EVALUATION METRICS FOR FS-MOL

Binary Classification. Following Stanley et al. (2021) , the task-level metric used for the binary classification task in FS-Mol is change in area under the precision-recall curve (∆AUPRC), which is sensitive to the balance of the two classes in the query sets and allows for a comparison to the performance of a random classifier: ∆AUPRC(target classifier, T ) = AUPRC(target classifier, Q T ) -AUPRC(random classifier, Q T ) = AUPRC(target classifier, Q T ) - #positive data points in Q T N Q T . Regression. We propose to use the predictive/out-of-sample coefficient of determination (R 2 os ) as the task-level metric for the regression task in FS-Mol, which takes into account forecast errors: R 2 os (target regressor g, T ) = 1 - (xm,ym)∈Q T (y m -g(x m )) 2 ym∈Q y T (y m -ȳS T ) 2 , where ȳS T = 1 N S T yn∈S y T y n is the mean target value in the support set S T . This is different from the regular coefficient of determination (R 2 ), wherein the total sum of squares in the denominator are computed using the mean target value ȳQ T = 1 N Q T ym∈Q y T y m in the query set Q T . F FURTHER EXPERIMENTAL RESULTS ON FS-MOL F.1 OVERALL PERFORMANCE Table 4 shows that ADKF-IFT achieves the best mean rank for both classification and regression tasks at all considered support set sizes. The trends of these mean ranks are consistent to those in Figure 2 . Figures 6 and 7 show the box plots for the classification and regression performances of all compared methods on all FS-Mol test tasks, respectively. These plots are a disaggregated representation of the results in Figure 2 .

F.2 STATISTICAL COMPARISON

Table 5 shows the p-values from the two-sided Wilcoxon signed-rank test for statistical comparisons between ADKF-IFT and the next second method, namely DKT. The test results indicate that their median performance difference is nonzero (i.e., ADKF-IFT significantly outperforms DKT) for the classification task at all considered support set sizes and for the regression task at support set sizes 64, 128, and 256 (at significance level α = 0.05). The p-values for statistical comparisons between ADKF-IFT and the two ablation models DKT+ and ADKF are also shown in Table 5 , demonstrating that ADKF-IFT significantly outperforms these ablation models in most cases.

F.3 SUB-BENCHMARK PERFORMANCE

The tasks in FS-Mol can be partitioned into 7 sub-benchmarks by Enzyme Commission (EC) number (Webb et al., 1992) , which enables sub-benchmark evaluation within the entire benchmark. Ideally, the best method should be able to perform well across all sub-benchmarks. Table 6 shows the test performance of top performing methods on all sub-benchmarks at support set size 64 (the median of all considered support sizes) for both the classification and regression tasks. The results indicate that, in addition to achieving best overall performance, ADKF-IFT achieves the best performance on all sub-benchmarks for the regression task and on more than half of the sub-benchmarks for the classification task.

F.4 META-TESTING COSTS

Figure 5 shows the meta-testing costs of all compared meta-learning methods in terms of wall-clock timefoot_0 on a pre-defined set of FS-Mol classification tasks. These experiments are run on a single NVIDIA GeForce RTX 2080 Ti. It can be seen that ADKF-IFT is ∼ 2.5x slower than CNP, ProtoNet, and DKT, but still much faster than GNN-MAML. We did not report the wall-clock time for PAR, because it is extremely memory intensive (PAR takes > 10x memory than ADKF-IFT does) and thus cannot be run on a GPU. We stress that this is not an important metric for this paper, as real-time adaptation is not required in drug discovery applications, but could be of interest if ADKF-IFT were to be deployed in other settings.

G DETAILS OF THE OUT-OF-DOMAIN MOLECULAR OPTIMIZATION EXPERIMENTS

In Table 7 , we summarize the four molecular design tasks considered in Section 5.3. Note that the datasets for the molecular docking and material design tasks are subsampled from the much larger datasets provided in DOCKSTRING (García-Ortegón et al., 2021) and Harvard Clean Energy Project (Hachmann et al., 2011) , respectively. The datasets for the antibiotic discovery and antiviral drug design tasks are taken from the antibiotic training set and the COVID Moonshot dataset provided in Stokes et al. (2020) and Consortium et al. (2022) , respectively. For the configuration of the GP, we use the Tanimoto kernel for fingerprint (with radius 2 and 2,048 bits based on count simulation) and Matérn52 kernel without ARD (with a log-normal prior over the lengthscale, centered at the median heuristic initialization) for all the other compared feature representations. We re-fit the base kernel parameters using all available data points at the beginning of each BO iteration. ) is a single-task method for fitting a deep kernel to a dataset. DKL jointly fits both the feature extractor parameters ϕ and base kernel parameters θ by maximizing the GP marginal likelihood on a single dataset (i.e., DKL essentially fits a neural network with a GP "head" to a dataset). It is well known that neural networks will easily overfit to small datasets (Sarle, 1995) . This overfitting also happens in DKL, despite the fact that it fits the neural network parameters using a type-II maximum likelihood approach: although early DKL papers suggested that the "model complexity" term (as measured by the log determinant of the kernel matrix) in the GP marginal likelihood objective would prevent this overfitting from happening (Wilson et al., 2016b) , recent follow-up work showed that this is not the case (Ober et al., 2021 ) -a deep-kernel GP can simultaneously overfit to the training data and appear to have a low "model complexity".

H.2 THE UNDERFITTING ISSUE IN DKT

Deep Kernel Transfer (DKT) (Patacchiola et al., 2020 ) is a meta-learning method for fitting a deep kernel to a distribution of datasets. DKT jointly fits both the feature extractor parameters ϕ and base kernel parameters θ by maximizing the expected GP marginal likelihood over a distribution of datasets. To mitigate the overfitting issue of DKL using meta-learning, DKT makes a very strong assumption that different tasks in the task distribution are drawn from an identical GP prior over functions. Explicitly, this means that the data generating process is assumed to have the same noise level, same amplitude, and same "characteristic lengthscale" for every task in the meta-dataset, which is a very restrictive assumption violated by most real-world problems. For example, different datasets in a meta-dataset may have • highly varying noise levels, so modelling all tasks with the same amount of observation noise will not be realistic; • different output ranges and units for regression: for example, one task might have data in the range 1-20 µM, while another might have 0-100% inhibition, meaning that a single signal variance (kernel amplitude) will not model the data well; • different "characteristic lengthscales": for some tasks, structurally similar molecules have very strongly correlated output labels, while for other tasks it is much weaker (i.e., there is much more variation in the labels of very similar molecules), suggesting that the "characteristic lengthscale" will be different. Inevitably, trying to fit such a misspecified model will result in a set of compromised base kernel parameters θ which fit all datasets okay on average but do not fit each individual dataset very well. This is the underfitting issue of DKT.

H.3 THE ADVANTAGE OF ADKF-IFT

ADKF-IFT combines DKL and DKT in a way that can potentially inherit the strengths of both methods and the weaknesses of neither. By adapting the base kernel parameters θ specifically to each task, it prevents underfitting due to varying ranges, lengthscales, or noise levels between datasets. By meta-learning the feature extractor on many datasets, it prevents overfitting as observed by Ober et al. (2021) . This advantage is both theoretically principled (by solving a bilevel optimization objective using the implicit function theorem) and empirically observable (we showed a statistically significant performance improvement of ADKF-IFT over DKT in Section 5.2).

I EXTENDED RELATED WORK

I.1 MULTI-TASK GAUSSIAN PROCESSES ADKF-IFT can be considered a method to learn multi-task GPs. The dominant approach to this problem in prior works is to learn a shared kernel for all data points across all tasks, transmitting information by explicitly modelling the covariance between data from different tasks (Kennedy & O'Hagan, 2000; Forrester et al., 2007; Bonilla et al., 2007; Swersky et al., 2013; Poloczek et al., 2017) . The main difference between methods in this family is the exact form of the kernel (Tighineanu et al., 2022) , which is typically assumed to have a particular structure (e.g. Kronecker product or weighted sum of simpler kernels). ADKF-IFT cannot be naturally viewed in this way because the covariance between data points from separate tasks is always zero; information is instead transmitted between tasks via a shared set of kernel parameters. Therefore, we believe that ADKF-IFT is a significant departure from the dominant paradigm in GP transfer learning.

I.2 IMPLICIT FUNCTION THEOREM IN MACHINE LEARNING

The implicit function theorem employed in our work has been used in many previous machine learning papers in various contexts, e.g., neural architecture search (Zhang et al., 2021) , hyperparameter-tuning (Bengio, 2000; Luketina et al., 2016; Pedregosa, 2016; Lorraine et al., 2020; Clarke et al., 2022) , and meta-learning (Rajeswaran et al., 2019; Lee et al., 2019; Chen et al., 2020) . Although some instantiations of ADKF-IFT could require this, in our highlighted version the Hessian can be computed exactly, which we view as a significant advantage. 3. Goal: The stated goal of Chen et al. (2020) is to decide which parameters should be metalearned, while in ADKF-IFT this must be pre-specified, and we give guidance for doing so in a way that results in transferable meta-learned features. I.4 COMMENTS ON ADKL-GP (TOSSOU ET AL., 2019) The preprint of Tossou et al. (2019) proposes an alternative adaptive deep kernel GP trained with meta-learning, where adaptation is performed by conditioning the feature extractor on an embedding of the entire support set rather than adjusting a subset of the kernel parameters as in ADKF-IFT. In general their empirical results were not very strong, and in our opinion the method is very prone to overfitting, which we explain below. The training objective for ADKL-GP is equivalent to the objective for DKT with an added contrastive loss, weighted by a sensitive hyperparameter γ (see Equation ( 13) of Tossou et al. (2019) ). γ can be interpreted as balancing the degree of regularization between two extremes: 1. If γ = 0, there is no regularization of the task encoding network, making significant overfitting to the meta-dataset possible. This is effectively equivalent to standard DKL (Wilson et al., 2016b ). 2. As γ → ∞, the regularization becomes infinitely strong, causing the task embeddings z T to collapse, and thereby preventing them from transmitting any information about specific datasets. With no information from z T in this case, the objective is essentially the same as that of DKT (Patacchiola et al., 2020) . For this method to be useful it would appear that γ would need to be carefully tuned to balance between these extremes. Tossou et al. (2019) perform a grid search over all hyperparameters including γ ∈ {0, 0.01, 0.1} but find no consistent trend besides γ > 0 being slightly helpful, although the differences in performance were small. This suggests that the method may be difficult to use in practice. ADKF-IFT however has no such tunable hyperparameters, which we view as a significant strength. Instead, the balance between DKL and DKT is controlled by selecting which parameters are adapted and meta-learned, which is much more interpretable and makes it easier to use in practice.

J FUTURE WORK

Some directions for future work are as follows: 



We acknowledge that wall-clock time may not be the best metric for measuring the costs, since some meta-learning methods could be parallelized, which will reduce the wall-clock time accordingly. An alternative metric is multiply-accumulate operation (MAC). However, it is difficult to obtain the accurate number of MACs due to the opaqueness of the GP modules used. They use the terms meta parameters and task-specific parameters instead.



Figure 1: A contrastive diagram illustrating the training procedures of ADKF-IFT, DKT, and DKL.

is part of a growing body of literature of techniques to train deep kernel GPs. As discussed in Section 3.3, ADKF-IFT generalizes DKL (Wilson et al., 2016b) and DKT (Patacchiola et al., 2020), which exclusively use single-task learning and meta-learning, respectively. Liu et al. (2020) and van Amersfoort et al. (

Figure 2: Mean performance with standard errors of all compared methods on all FS-Mol test tasks.

Figure 3: Mean performance with standard errors of ablation models on all FS-Mol test tasks. ADKF is like ADKF-IFT but assuming ∂ θ */∂ ϕ = 0, i.e., updating ϕ with the direct gradient ∂ L V /∂ ϕ. DKT+ is like DKT but tuning the base kernel parameters θ during meta-testing.

Figure 4: Mean top-1 target values with standard errors as a function of the number of molecules queried for all compared feature representations on four out-of-domain molecular optimization tasks.

Figure5: Wall-clock time consumed (with standard errors) when meta-testing on a pre-defined set of FS-Mol classification tasks using each of the compared meta-learning methods.

MODULAR META-LEARNING WITH SHRINKAGE (CHEN ET AL., 2020)The method proposed byChen et al. (2020)  shares many similarities with ADKF-IFT: at a high level it also divides model parameters into meta-learned and adapted parameters 2 , and optimizes the meta-learned parameters using the gradient of the validation loss after the adapted parameters have been adjusted to minimize the training loss using the implicit function theorem. The main differences between this work and ADKF-IFT are: 1. Model: Chen et al. (2020) consider a model where ϕ are the means and variances of a Gaussian prior over model parameters, whereas in ADKF-IFT ϕ is a subset of the parameters of an arbitrary deep kernel GP. 2. Hessian: Chen et al. (2020) consider the case where Θ is too large to form the exact Hessian for the implicit function theorem. They instead use a conjugate gradient approximation.

NS T * = 256.

Figure 6: Box plots for the classification performance of all compared methods on 157 FS-Mol test tasks at different support set sizes.

NS T * = 256.

Figure 7: Box plots for the regression performance of all compared methods on 111 FS-Mol test tasks at different support set sizes.

Mean test performance (AUROC%) with standard deviations of all compared methods on MoleculeNet benchmark tasks at support set size 20 (i.e., 2-way 10-shot).

Pre-ADKF-IFT refers to ADKF-IFT starting from a pretrained feature extractor. All compared methods in this section use GIN(Xu et al., 2019) as their feature extractors. The pretrained weights for the methods of the second type are providedby Hu* et al. (2020).

Mean predictive performance (test NLL) with standard errors of a GP operating on top of each compared feature representation on the four out-of-domain molecular design tasks.

a continuously differentiable function w.r.t. ψ meta and ψ adapt , and the Hessian∂ 2 L T (ψ meta ,ψ adapt ,S T ′ ) ∂ ψ adapt ∂ ψ T

Statistics of four few-shot molecular property prediction benchmarks from MoleculeNet.

considered in Section 5.1.D DETAILED CONFIGURATIONS OF ALL COMPARED METHODS FOR FS-MOLSingle-task Methods. Single-task methods (RF, kNN, GP-ST, GNN-ST, and DKL) are trained separately on the support set of each test task, without leveraging the knowledge contained in the training tasks. The implementations of RF, kNN, and GNN-ST are taken fromStanley et al. (2021).

Mean ranks of all compared methods in terms of their performance on all FS-Mol test tasks.

p-values from the two-sided Wilcoxon signed-rank test for statistical comparisons between ADKF-IFT and DKT/DKT+/ADKF. The null hypothesis is that the median of their performance differences on all FS-Mol test tasks is zero. The significance level is set to α = 0.05.

Mean performance with standard errors of top performing methods on FS-Mol test tasks within each sub-benchmark (broken down by EC category) at support set size 64 (the median of all considered support sizes). Note that class 2 is most common in the FS-Mol training set (∼ 1, 500 training tasks), whereas classes 6 and 7 are least common in the FS-Mol training set (< 50 training tasks each).

Descriptions of four out-of-domain molecular design tasks.

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

We thank Massimiliano Patacchiola, John Bronskill, Marcin Sendera, and Richard E. Turner for helpful discussions and feedback. WC acknowledges funding via a Cambridge Trust Scholarship (supported by the Cambridge Trust) and a Cambridge University Engineering Department Studentship (under grant G105682 NMZR/089 supported by Huawei R&D UK). AT acknowledges funding via a C T Taylor Cambridge International Scholarship. JMHL acknowledges support from a Turing AI Fellowship under grant EP/V023756/1.

ETHICS STATEMENT

We believe that the ethical implications of this work are minimal: this research involves no human subjects, no sensitive data where privacy is a concern, no domains where discrimination/bias/fairness is concerning, and is unlikely to have a noticeable social impact. Optimistically, our hope is that in the future ADKF-IFT could be used in the drug discovery pipeline to create new beneficial medicines, giving it an overall positive ethical impact. However, as with most research in machine learning, new modelling techniques could be used by bad actors to cause harm more effectively, but we do not see how ADKF-IFT is more concerning than any other method in this regard.

REPRODUCIBILITY STATEMENT

Our implementation and experimental results can be found at: https://github.com/ Wenlin-Chen/ADKF-IFT, which is based on a forked from FS-Mol (Stanley et al., 2021) and PAR (Wang et al., 2021) . Details for the setup of ADKF-IFT can be found in Appendix B, while details for other FS-Mol baselines can be found in Appendix D and details for other MoleculeNet baselines can be found in Wang et al. (2021, Appendix B) .The arXiv version of this paper can be found at: https://arxiv.org/abs/2205.02708, which may be updated as needed.

