DeepPipe: DEEP, MODULAR AND EXTENDABLE REPRESENTATIONS OF MACHINE LEARNING PIPELINES

Abstract

Finding accurate Machine Learning pipelines is essential in achieving state-of-theart AI predictive performance. Unfortunately, most existing Pipeline Optimization techniques rely on flavors of Bayesian Optimization that do not explore the deep interaction between pipeline stages/components (e.g. between hyperparameters of the deployed preprocessing algorithm and the hyperparameters of a classifier). In this paper, we are the first to capture the deep interaction between components of a Machine Learning pipeline. We propose embedding pipelines in a deep latent representation through a novel per-component encoder mechanism. Such pipeline embeddings are used with deep kernel Gaussian Process surrogates inside a Bayesian Optimization setup. Through extensive experiments on three largescale meta-datasets, including Deep Learning pipelines for computer vision, we demonstrate that learning pipeline embeddings achieves state-of-the-art results in Pipeline Optimization.

1. INTRODUCTION

Machine Learning (ML) has proven to be successful in a wide range of tasks such as image classification, natural language processing, and time series forecasting. In a supervised learning setup, practitioners need to define a sequence of stages comprising algorithms that transform the data (e.g. imputation, scaling) and produce an estimation (e.g. through a classifier or regressor). The selection of the algorithms and their hyperparameters, known as Pipeline Optimization (Olson & Moore, 2016) or pipeline synthesis (Liu et al., 2020; Drori et al., 2021) is challenging. Firstly, the search space contains conditional hyperparameters, as only some of them are active depending on the selected algorithms. Secondly, this space is arguably bigger than the one for a single algorithm. Therefore, previous work demonstrates how this pipeline search can be automatized and achieve competitive results (Feurer et al., 2015; Olson & Moore, 2016) . Some of these approaches include Evolutionary Algorithms (Olson & Moore, 2016) , Reinforcement Learning (Rakotoarison et al., 2019; Drori et al., 2021) or Bayesian Optimization (Feurer et al., 2015; Thornton et al., 2012; Alaa & van der Schaar, 2018) . Pipeline Optimization (PO) techniques need to capture the complex interaction between the algorithms of a Machine Learning pipeline and their hyperparameter configurations. Unfortunately, no prior method uses Deep Learning to encapsulate the interaction between pipeline components. Prior work trains performance predictors (a.k.a. surrogates) on the concatenated hyperparameter space of all algorithms (raw search space), for instance, using random forests (Feurer et al., 2015) or finding groups of hyperparameters to use on kernels with additive structure (Alaa & van der Schaar, 2018) . On the other hand, transfer learning has been shown to decisively improve PO by transferring efficient pipelines evaluated on other datasets (Fusi et al., 2018; Yang et al., 2019; 2020) . Our method is the first to introduce a deep pipeline representation that is meta-learned to achieve state-of-the-art results in terms of the quality of the discovered pipelines. We introduce DeepPipe, a neural network architecture for embedding pipeline configurations on a latent space. Such deep representations are combined with Gaussian Processes (GP) for tuning pipelines with Bayesian Optimization (BO). We exploit the knowledge of the hierarchical search space of pipelines by mapping the hyperparameters of every algorithm through per-algorithm encoders to a hidden representation, followed by a Fully Connected Network that receives the concatenated representations as input. Additionally, we show that meta-learning this network through evaluations on auxiliary tasks improves the quality of BO. Experiments on three large-scale meta-datasets show that our method achieves the new state-of-the-art in Pipeline Optimization. Our contributions are as follows: • We introduce DeepPipe, a surrogate for BO that achieves peak performance when optimizing a pipeline for a new dataset through transfer learning. • We present a novel and modular architecture that applies different encoders per stage and yields better generalization in low meta-data regimes, i.e. few/no auxiliary tasks. • We conduct extensive evaluations against seven baselines on three large meta-datasets, and we further compare against rival methods in OpenML datasets to assess their performances under time constraints. • We demonstrate that our pipeline representation helps achieving state-of-the-art results in optimizing pipelines for fine-tuning deep computer vision networks.

2. RELATED WORK

Full Model Selection (FMS) is also referred to as Combined Algorithm Selection and Hyperparameter optimization (CASH) (Hutter et al., 2019; Feurer et al., 2015) . FMS aims to find the best model and its respective hyperparameter configuration (Hutter et al., 2019) . A common approach is to use Bayesian Optimization with surrogates that can handle conditional hyperparameters, such as Random Forest (Feurer et al., 2015) , tree-structured Parzen estimators (Thornton et al., 2012) , or ensembles of neural networks (Schilling et al., 2015) . Pipeline Optimization (PO) is a generalization of FMS where the goal is to find the algorithms and their hyperparameters for different stages of a Machine Learning Pipeline. Common approaches model the search space as a tree structure and use reinforcement learning (Rakotoarison et al., 2019; Drori et al., 2021; de Sá et al., 2017) , evolutionary algorithms (Olson & Moore, 2016) , Hierarchical Task Networks (Mohr et al., 2018) for searching pipelines. Other approaches use Multi-Armed Bandit strategies to optimize the pipeline, and combine them with Bayesian Optimization (Swearingen et al., 2017) or multi-fidelity optimization (Kishimoto et al., 2021) . Alaa & van der Schaar (2018) use additive kernels on a Gaussian Process surrogate to search pipelines with BO that groups the algorithms in clusters and fit their hyperparameters on independent Gaussian Processes, achieving an effectively lower dimensionality per input. By formulating the Pipeline Optimization as a constrained optimization problem, Liu et al (Liu et al., 2020) introduce a method based on alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976) . Transfer-learning for Pipeline Optimization and CASH leverages information from previous (auxiliary) task evaluations. Few approaches use dataset meta-features to warm-start BO with good configurations from other datasets (Feurer et al., 2015; Alaa & van der Schaar, 2018) . As extracting meta-features demands computational time, follow-up works find a portfolio based on these auxiliary tasks (Feurer et al., 2020) . Another popular approach is to use collaborative filtering with a matrix of pipelines vs task evaluations to learn latent embeddings of pipelines. OBOE obtains the embeddings by applying a QR decomposition of the matrix on a time-constrained formulation (Yang et al., 2019) . By recasting the matrix as a tensor, Tensor-OBOE (Yang et al., 2020) finds latent representations via the Tucker decomposition. Furthermore, Fusi et al. (2018) apply probabilistic matrix factorization for finding the latent pipeline representations. Subsequently, they use the latent representations as inputs for a Gaussian Process, and explore the search space using BO.

3.1. PIPELINE OPTIMIZATION

The pipeline of a ML system consists of a sequence of N stages (e.g. dimensionality reducer, standardizer, encoder, estimator (Yang et al., 2020) ). At each stage i ∈ {1 . . . N } a pipeline includes one algorithm 1 from a set of M i choices (e.g. the estimator stage can include the algorithms {SVM, MLP, RF}). Algorithms are tuned through their hyperparameter search spaces, where λ i,j denotes the configuration of the j-th algorithm in the i-th stage. Furthermore, let us denote a pipeline p as the set of indices for the selected algorithm at each stage, i.e. p := (p 1 , . . . , p N ), where p i ∈ {1 . . . M i } represents the index of the selected algorithm at the i-th pipeline stage. The hyperparameter configuration of a pipeline is the unified set of the configurations of all the algorithms in a pipeline, concretely λ(p) := (λ 1,p1 , . . . , λ N,p N ), λ i,pi ∈ Λ i,pi . Pipeline Optimization demands finding the optimal pipeline p * and its optimal configuration λ(p * ) by minimizing the validation loss of a trained pipeline on a dataset D as shown in Equation 1. (p * , λ(p * )) = arg min p∈{1...M1}×•••×{1...M N }, λ(p)∈Λ1,p 1 ×•••×Λ N,p N L val p, λ(p), D From now we will use the term pipeline configuration for the combination of a sequence of algorithms p and their hyperparameter configurations λ(p), and denote it simply as p λ := (p, λ(p)).

3.2. BAYESIAN OPTIMIZATION

Bayesian optimization (BO) is a mainstream strategy for optimizing ML pipelines (Feurer et al., 2015; Hutter et al., 2011; Alaa & van der Schaar, 2018; Fusi et al., 2018; Schilling et al., 2015) . Let us start with defining a history of Q evaluated pipeline configurations as H = {(p λ , y (1) ), . . . , (p (Q) λ , y (Q) )}, where y (q) ∼ N (f (p (q) λ ), σ 2 q ) is a probabilistic modeling of the validation loss f (p (q) λ ) achieved with the q-th evaluated pipeline configuration p (q) λ , q ∈ {1 . . . Q}. Such a validation loss is approximated with a surrogate model, which is typically a Gaussian process (GP) regressor. We measure the similarity between pipelines via a kernel function k : dom (p λ ) × dom (p λ ) → R >0 parameterized with θ, and represent similarities as a matrix  K ′ q,ℓ := k(p (q) λ , p (ℓ) λ ; γ), K ′ ∈ R Q×Q >0 . Since we consider noise, we define K = K ′ + σ p I. A GP E f * | p λ ( * ) , H = K T * K -1 y, V f * | p λ ( * ) , H = K * * -K T * K -1 K * (2) where K * ,q = k p λ ( * ) , p (q) λ ; γ , K * ∈ R Q >0 , and K * * = k p λ ( * ) , p λ ( * ) ; γ , K * * ∈ R >0 . BO is an iterative process that alternates between fitting a GP regressor as described above and selecting the next pipeline configuration to evaluate (Snoek et al., 2012) . A description of how BO finds pipelines using a GP surrogate is provided in Appendix I.

4. DeepPipe: BO WITH DEEP PIPELINE CONFIGURATIONS

To apply BO to Pipeline Optimization (PO) we must define a kernel function that computes the similarity of pipeline configurations, i.e. k p (q) λ , p λ ; θ = ?. Prior work exploring BO for PO use kernel functions directly on the raw concatenated vector space of selected algorithms and their hyperparameters (Alaa & van der Schaar, 2018) or use surrogates without dedicated kernels for the conditional search space (Feurer et al., 2015; Olson & Moore, 2016; Schilling et al., 2015) . However, we hypothesize that these approaches cannot capture the deep interaction between pipeline stages, between algorithms inside a stage, between algorithms across stages, and between different configurations of these algorithms. In order to address this issue we propose a simple, yet powerful solution to PO: learn a deep embedding of a pipeline configuration and apply BO with a deep kernel (Wistuba & Grabocka, 2021; Wilson et al., 2016) . This is done by DeepPipe, which searches pipelines in a latent space using BO with Gaussian Processes. We use a neural network ϕ(p λ ; θ) : dom(p λ ) → R Z with weights θ to project a pipeline configuration to a Z-dimensional space. Then, we measure the pipelines' similarity in this latent space as k ϕ(p (q) λ ; θ), ϕ(p (ℓ) λ ; θ) using the popular Matérn 5/2 kernel (Genton, 2002) . Once we compute the parameters of the kernel similarity function, we can compute the GP's posterior and conduct PO with BO as specified in Section 3.2. In this work, we exploit existing deep kernel learning machinery (Wistuba & Grabocka, 2021; Wilson et al., 2016) to train the parameters θ of the pipeline embedding neural network ϕ, and the parameters γ of the kernel function k, by maximizing the log-likelihood of the observed validation losses y of the evaluated pipeline configurations p λ . The objective function for training a deep kernel is the log marginal likelihood of the Gaussian Process (Rasmussen & Williams, 2006) with covariance matrix entries k q,ℓ = k ϕ(p The main piece of the puzzle is: How to define the pipeline configuration embedding ϕ? (q) λ ; θ), ϕ(p (ℓ) λ ; θ) . Our DeepPipe embedding is composed of two parts (i) per-algorithm encoders, and (ii) a pipeline aggregation network. A visualization example of our DeepPipe embedding architecture is provided in Figure 1 . We define one encoder ξ (i,j) for the hyperparameter configurations of each j-th algorithm, in each i-th stage, as plain multi-layer perceptrons (MLP). These encoders, each parameterized by weights θ enc (i,j) , map the algorithms' configurations to a L i -dimensional vector space: ξ (i,j) λ i,j ; θ enc i,j = MLP λ i,j ; θ enc i,j , ξ (i,j) : Λ i,j → R Li , ∀i ∈ {1 . . . N }, ∀j ∈ {1 . . . M i } (3) For a pipeline configuration p λ , represented with the indices of its algorithms p, and the configuration vectors of its algorithms λ(p), we project all the pipeline's algorithms' configurations to their latent space using the algorithm-specific encoders. Then, we concatenate their latent encoder vectors, where our concatenation notation is R Li ⊕ R L k := R Li+L k . Finally, the concatenated representation is embedded to a final R Z space via an aggregating MLP ψ with parameters θ aggr as denoted below: ϕ (p λ ) := ψ ξ (1,p1) (λ 1,p1 ) ⊕ • • • ⊕ ξ (N,p N ) (λ N,p N ) | θ aggr , ψ : R i Li → R Z (4) Within the i-th stage, only the output of one encoder is concatenated, therefore the output of the Selector corresponds to the active algorithm in the i-th stage and can be formalized as ξ (i,pi)  (λ i,pi ) = Mi j=1 I(j = a) • ξ (i,j) (λ i,j ), where a is the index of the active algorithm and I denotes the indicator function. Having defined the embedding ϕ in Equations 3-4, we can plug it into the kernel function, optimize it minimizing the negative log likelihood of the GP with respect to θ = {θ enc , θ aggr }, and conduct BO as in Section 3.2.

4.2. META-LEARNING OUR PIPELINE EMBEDDING

In many practical applications, there exist computed evaluations of pipeline configurations on previous datasets, leading to the possibility of transfer learning for PO. Our DeepPipe can be easily meta-learned from such past evaluations by pre-training the pipeline embedding network. Let us denote the meta-dataset of pipeline evaluations on T datasets (a.k.a. auxiliary tasks) as H t = {(p λ (t,1) , y (t,1) ), . . . , (p λ (t,Qt) , y (t,Qt) )}, t ∈ {1, . . . , T }, where Q t is the number of existing evaluations for the t-th dataset. As a result, we meta-learn our method's parameters to minimize the meta-learning objective of Equation 5. This objective function corresponds to the negative marginal likelihood of the Gaussian Processes using DeepPipe's extracted features as input to the kernel (Wistuba & Grabocka, 2021; Patacchiola et al., 2020) . Further details on the meta-learning procedure of our pipeline configuration weights are provided in Appendix J. arg min γ,θ T t=1 y (t) T K (t) (θ, γ) -1 y (t) + log K (t) (θ, γ) 5 EXPERIMENTS AND RESULTS

5.1. META-DATASETS

A meta-dataset is a collection of pipeline configurations and their respective performance evaluated in different tasks (i.e. datasets). Information about the meta-data sets is provided on Appendix P, their search spaces are clarified on Appendix R, and the splits of tasks per meta-dataset are found on Appendix T. All the tasks in the meta-dataset correspond to classification. We use the meta-training set for pre-training the Pipeline Optimization (PO) methods, the meta-validation set for tuning some of the hyper-parameters of the PO method, and we assess their performance on the meta-test set. In our experiments, we use the following meta-datasets. PMF contains 38151 pipelines (after filtering out all pipelines with only NaN entries), and 553 datasets (Fusi et al., 2018) . Although not all the pipelines were evaluated in all tasks, it has a total of 16M evaluations. The pipeline search space has 2 stages (preprocessing and estimator) with 2 and 11 algorithms respectively. Following the setup in the original paper (Fusi et al., 2018) , we take 464 tasks for meta-training and 89 for meta-test. As the authors do not specify a validation meta-dataset, we sample randomly 15 tasks out of the meta-training dataset. Tensor-OBOE provides 23424 pipelines and 551 tasks (Yang et al., 2020) . It contains 11M evaluations, as there exist no evaluations for some combinations of pipelines and tasks. The pipelines include 5 stages: Imputator (1 algorithm), Dimensionality-Reducer (3 algorithms), Standardizer (1 algorithm), Encoder (1 algorithm), and Estimator (11 algorithms). We assign 331 tasks for meta-training tasks, 110 tasks for meta-validation, and 110 tasks for meta-testing. ZAP is a benchmark that evaluates deep learning pipelines on fine-tuning state-of-the-art computer vision tasks (Ozturk et al., 2022) . The meta-dataset contains 275625 evaluated pipeline configurations on 525 datasets and 525 different Deep Learning pipelines (i.e. the best pipeline of a dataset was evaluated also on all other datasets). From the set of datasets, we use 315 for meta-training, 45 for meta-validation and 105 for meta-test, following the protocol of the original paper. In addition, we use OpenML datasets. It comprises 39 curated datasets (Gijsbers et al., 2019) and has been used in previous work for benchmarking (Erickson et al., 2020) . This dataset does not contain pipeline evaluations like the other three meta-datasets above. However, we use the OpenML collection for evaluating the Pipeline Optimization in time-constrained settings (Ozturk et al., 2022) .

5.2. BASELINES

Random Search (RS) selects pipeline configurations by sampling randomly from the search space (Bergstra & Bengio, 2012) . Probabilistic Matrix Factorization (PMF) uses a surrogate model that learns a latent representation for every pipeline using meta-training tasks (Fusi et al., 2018) . We tuned this latent dimension for the Tensor-OBOE dataset from a grid of {10, 15, 20} and found 20 to be the best setting. For the PMF-Dataset, where the model was introduced, we used the default value of 20. We use the original PMF implementation (Sheth, 2018) . OBOE also uses matrix factorization for optimizing pipelines, but they aim to find fast and informative algorithms to initialize the matrix (Yang et al., 2019) . We use the settings provided by the authors. Tensor-OBOE formulates PO as a tensor factorization, where the rank of the tensor is equal to 1 + N , for N being the number of stages in the pipeline (Yang et al., 2020) . We tuned the rank for the Tucker decomposition from a grid of {20,30,40}, resulting in the best value being 30. All the other hyper-parameters were set as in the original implementation (Yang et al., 2019) . Factorized Multilayer Perceptron (FMLP) creates an ensemble of neural networks with a factorized layer (Schilling et al., 2015) . The inputs of the neural network are the one-hot encodings of the algorithms and datasets, in addition to the algorithms' hyperparameters. We tuned the number of base estimators for the ensemble from a grid {10, 50, 100}, with 100 being the optimal ensemble size. Each network layer has 5 neurons and ReLU activations as highlighted in the author's paper (Schilling et al., 2015) . RGPE builds an ensemble of Gaussian Processes using auxiliary tasks (Feurer et al., 2018) . The ensemble weights the contributions of every base model and the new model fit on the new task. We used the implementation from Botorch (Balandat et al., 2020) . Gaussian Processes (GP) are a standard and strong baseline in hyperparameter optimization (Snoek et al., 2012) . We tuned the kernel from {Gaussian, Matérn 5/2}, with Matérn 5/2 performing better. DNGO uses neural networks as basis functions with a Bayesian linear regressor at the output layer (Snoek et al., 2015) . We use the implementation provided by Klein & Zela (2020) , and its default hyperparameters. SMAC uses Random Forest for predicting uncertainties (Hutter et al., 2011) . After exploring a grid of {10, 50, 100} for the number of trees, we found 100 to be the best choice. TPOT is an AutoML system that conducts PO using evolutionary search (Olson & Moore, 2016) . We use the original implementation but adopted the search space to fit the Tensor-OBOE meta-dataset (see Appendix R).

5.3. RESEARCH HYPOTHESES AND ASSOCIATED EXPERIMENTS

Hypothesis 1: DeepPipe outperforms standard PO baselines. Experiment 1: We evaluate the performance of DeepPipe when no meta-training data is available. We compare against four baselines: Random Search (RS) (Bergstra et al., 2011) , Gaussian Processes (GP) (Rasmussen & Williams, 2006) , DNGO (Snoek et al., 2015) and SMAC (Hutter et al., 2011) . We evaluate their performances on the aforementioned PMF, Tensor-OBOE and ZAP meta-datasets. In Experiments 1 and 2 (below) we select 5 initial observations to warm-start the BO, then we run 95 additional iterations. The procedure for choosing these configurations is detailed in the Appendix G. Hypothesis 2: Our meta-learned DeepPipe outperforms state-of-the-art transfer-learning PO methods. Experiment 2: We compare our proposed method against baselines that use auxiliary tasks (a.k.a. meta-training data) for improving the performance of Pipeline Optimization: Probabilistic Matrix Factorization (PMF) (Fusi et al., 2018) , Factorized Multilayer Perceptron (FMLP) (Schilling et al., 2015) , OBOE (Yang et al., 2019) and Tensor OBOE (Yang et al., 2020) . Moreover, we compare to RGPE (Feurer et al., 2018) , an effective baseline for transfer HPO (Arango et al., 2021) . We evaluate the performances on the PMF and Tensor-OBOE meta-datasets. Hypothesis 3: DeepPipe leads to strong any-time results in a time-constrained PO problem. Experiment 3: Oftentimes practitioners need AutoML systems that discover efficient pipelines within a small time budget. To test the convergence speed of our PO method we ran it on the aforementioned OpenML datasets for a budget of 10 minutes, and also 1 hour. We compare against five baselines: (i) TPOT (Olson & Moore, 2016) adapted to the search space of Tensor-OBOE (see Appendix R), (ii) OBOE and Tensor-OBOE (Yang et al., 2019; 2020) using the time-constrained version provided by the authors, (iii) SMAC (Hutter et al., 2011) , and (iv) PMF (Fusi et al., 2018) . The last three had the same five initial configurations used to warm-start BO as detailed in Experiment 1. Moreover, they were pre-trained with the Tensor-OBOE meta-dataset and all the method-specific settings are the same as in Experiment 2. Hypothesis 4: Our novel encoder layers of DeepPipe enable an efficient PO when the pipeline search space changes, i.e. when developers add a new algorithm to a ML system. Experiment 4: A major obstacle to meta-learning PO solutions is that they do not generalize when the search space changes, especially when the developers of ML systems add new algorithms. Our architecture quickly adapts to newly added algorithms because only an encoder sub-network for the new algorithm should be trained. To test the scenario, we ablate the performance of five versions of DeepPipe and try different settings when we remove a specific algorithm (an estimator) either from meta-training, meta-testing, or both. Hypothesis 5: The encoders in DeepPipe introduce an inductive bias where latent representation vectors of an algorithm's configurations are co-located, and located distantly from the representations of other algorithms' configurations. Formally, given three pipelines p (l) , p (m) , p (n)  if p (l) i = p (m) i , p (l) i ̸ = p (n) i then ||ϕ(p (l) ) -ϕ(p (m) )|| < ||ϕ(p (m) ) -ϕ(p (n) )|| with higher probability when using encoder layers, given that p (n) i is the index of the algorithm in the i-th stage. Furthermore, we hypothesize that the less number of tasks during pre-training, the more necessary this inductive bias is. Experiment 5: We sample 2000 pipelines of 5 estimation algorithms on the TensorOBOE dataset. Subsequently, we embed the pipelines using a DeepPipe with 0, 1 and 2 encoder layers, and weights θ, initialized such that θ i ∈ θ are independently identically distributed θ i ∼ N (0, 1). Finally, we visualize the embeddings with T-SNE (Van der Maaten & Hinton, 2008) and compute a cluster metric to assess how close pipelines with the same algorithm are in the latent space: E p (l) ,p (m) ,p (n) (I(||ϕ(p (l) ) -ϕ(p (m) )|| < ||ϕ(p (m) ) -ϕ(p (n) )||)). To test the importance of the inductive bias vs the number of pre-training tasks, we ablate the performance of DeepPipe for different percentage of pre-training tasks (0.5%, 1%, 5%, 10%, 50%, 100%) under different values of encoder layers.

5.4. EXPERIMENTAL SETUP FOR DeepPipe

The encoders and the aggregated layers are Multilayer Perceptrons with ReLU activations. We keep an architecture that is proportional to the input size. The number of neurons in the hidden layers for the encoder of algorithm j-th in i-th stage with |Λ i,j | hyperparameters is F • |Λ i,j |, given an integer factor F . The output dimension of the encoders of the i-th stage is defined as Q i = max j |Λ i,j |. The number of total layers (i.e. encoder and aggregation layers) is fixed to 4. The number of encoders is chosen from {0,1,2}. The specific values of the encoders' input dimensions are detailed in Appendix R. We choose F ∈ {4, 6, 8, 10} based on the performance in the validation split. Specifically, we use the following values for DeepPipe: (i) in Experiment 1: 1 encoder layer (all meta-datasets), F = 6 (PMF and ZAP) and F = 8 (Tensor-OBOE), (ii) in Experiment 2: F = 8, no encoder layer (PMF, Tensor-OBOE) and one encoder layer (ZAP), (iii) in Experiment 3: F = 8 and no encoder layers, (iv) in Experiment 4 we use F = 8 and {0, 1} encoder layers. Finally (iv) in Experiment 5 we use F = 8 and {0, 1, 2} encoder layers. Additional details on the setup can be found in the Appendix G and our source codefoot_1 .

6. RESULTS

We present the results for Experiments 1 and 2 in Figures 2 and 3 , respectively. In both cases, we compute the ranks of the classification accuracy achieved by the discovered pipelines of each technique, averaged across the meta-testing datasets. The shadowed lines correspond to the 95% confidence intervals. Additional results showing the mean regret are included in Appendix L. In Experiment 1 (standard/non-transfer PO) DeepPipe achieved the best performance for both metadatasets, whereas SMAC attained the second place. In Experiment 2 DeepPipe strongly outperforms all the transfer-learning PO baselines in all meta-datasets. Given that DeepPipe yields state-of-the-art PO results on both standard and transfer-learning setups, we conclude that our pipeline embedding network computes efficient representations for PO with Bayesian Optimization. In particular, the results on the ZAP meta-dataset indicate the efficiency of DeepPipe in discovering state-of-the-art Deep Learning pipelines for computer vision. Furthermore, the results of Experiment 4 indicate that our DeepPipe embedding quickly adapts to incrementally-expanding search spaces, e.g. when the developers of a ML system add new algorithms. In this circumstance, existing transfer-learning PO baselines do not adapt easily, because they assume a static pipeline search space. As a remedy, when a new algorithm is added to the system after meta-learning our pipeline embedding network, we train only a new encoder for that new algorithm. In this experiment, we run our method on variants of the search space when one algorithm at a time is introduced to the search space (for instance an estimator, e.g. MLP, RF, etc., is not known during meta-training, but added new to the meta-testing). In Tables 2 and 5 (in Appendix), we present the values of the average rank among five different configurations for DeepPipe. We compare among meta-trained versions (denoted by ✓in the column MTd.) that omit specific estimators during meta-training (MTr.=✓), or during meta-testing (MTe.=✓). We also account for versions with one encoder layer denoted by ✓in the column Enc. The best in all cases is the meta-learned model that did not omit the estimator (i.e. algorithm known and prior evaluations with that algorithm exist). Among the versions that omitted the estimator in the meta-training set (i.e. algorithm added new), the best configuration was the DeepPipe which fine-tuned a new encoder for that algorithm (line Enc=✓, MTd.=✓, MTr.=✓, MTe.=✗). This version of DeepPipe performs better than ablations with no encoder layers (i.e. only aggregation layers ϕ), or the one omitting the algorithm during meta-testing (i.e. pipelines that do not use the new algorithm at all). The message of the results is simple: If we add a new algorithm to a ML system, instead of running PO without meta-learning (because the search space changes and existing transfer PO baselines are not applicable to the new space), we can use a meta-learned DeepPipe and only fine-tune an encoder for a new algorithm. The effect of the inductive bias introduced by the encoders (Experiment 5) can be appreciated in Figure 4 . The pipelines with the same active algorithm in the estimation stage, but with different hyperparameters, lie closer in the embedding space created by a random initialized DeepPipe, forming compact clusters characterized by the defined cluster metric (value below the plots). We formally demonstrate in Appendix S that, in general, a single encoder layer is creating more compact clusters than a fully connected linear layer. In additional results (Appendix F), we observe that the average rank on the test-tasks improves for DeepPipe versions with deeper encoder layers (keeping the total number of layers fixed), if the number of meta-training tasks gets lower. This occurs because the objective function (Equation 5) makes possible to learn embeddings where pipelines with similar performance are clustered together (see Appendix O) given enough meta-training data. Otherwise, the inductive bias introduced by the encoders becomes more relevant.

7. CONCLUSION

In this paper, we have shown that efficient Machine Learning pipeline representations can be computed with deep modular networks. Such representations help discovering more accurate pipelines compared to the state-of-art approaches, because they capture the interactions of the different pipelines algorithms and their hyperparameters. Moreover, we show that introducing per-algorithm encoders helps in the case of limited meta-trained data, or when a new algorithm is added to the search space. Overall, we demonstrate that our method DeepPipe achieves the new state-of-the-art in Pipeline Optimization. Limitations. Our representation network does not model complex parallel pipelines (in that case the embedding will be a graph neural network), and/or pipelines involving ensembles. We plan to investigate these important points in our future work. Reproducibility Statement. To guarantee the reproducibility of or work, we include an anonymized repository to the related code. The code for the baselines is included also, and we reference the original implementations if it is the case. All the meta-datasets are publicly available and correspondingly referenced.

A POTENTIAL NEGATIVE SOCIETAL IMPACTS

The meta-training is the most demanding computational step, thus it can incur in high energy consumption. Additionally, DeepPipe does not handle fairness, so it may find pipelines that are biased by the data.

B LICENCE CLARIFICATION

The results of this work (code, data) are under license BSD-3-Clause license. Both PMF dataset Sheth (2018) and Tensor-OBOE dataset Akimoto & Yang (2020) hold the same license. C DISCUSSION ON NUMBER OF EVALUATED PIPELINES Based on results from Experiment 3, we report the average (and standard deviation) of the number of observed pipelines among all the compared methods in 10 and 60 minutes on Table 3 . This is an important metric to understand the optimization overhead introduced by the method. For instance, a method that explores few pipelines during a fixed time window, might use expensive computations during the pipeline optimization. We notice that DeepPipe achieves the best results (see Table 1 ) by using a reasonable amount of pipelines, i.e. the optimization overhead introduced our method is small compared to other approaches such as TPOT and OBOE. The encoder and aggregation layers capture interactions among the pipeline components, and therefore are important to attain good performance. These interactions are reflected in the features extracted by these layers, i.e. the pipelines representations obtained by DeepPipe. These representations lie on a metric space that captures relevant information about the pipelines and which can be used on the kernel for the Gaussian Process. Using the original input space does not allow to extract rich representations. To test this idea, we meta-train four version of DeepPipe with and without encoder and aggregation layers on our TensorOBOE meta-train set and then test on the meta-test split. In Figure 5 , we show that the best version is obtained when using both encoder (Enc.) and aggregation (Agg.) layers (green line), whereas the worst version is obtained when using the original input space, i.e. no encoder and no aggregation layers. Having encoder helps more than not having encoder, thus it is important to capture interactions among hyperparameters in the same stage. As having an aggregation layer is better than not, it is important to capture interactions among components from different stages. The hyperparameters of every group of pipelines components is then passed through a kernel, and then the N resulting kernels are added. This effectively builds up a kernel with additive structure (Gardner et al., 2017) , however they are not using a feature extractor like DeepPipe. We compare SKL against a non-pretrained DeepPipe on Figure 6 on three meta-datasets, where it is noticeable that our method outperforms this strategy.

D DISCUSSION ON THE INTERACTIONS AMONG COMPONENTS

Additionally we compare DeepPipe with the whole algorithm introduced by AutoPrognosis 2.0 (Imrie et al., 2022) on the Open ML datasets for 50 and 100 BO iterations (E BO ). We report the average and standard deviation for the rank, accuracy and time. DeepPipe achieves the best average rank, ie. lower average rank than AutoPrognosis. This is complemented with the having the highest average accuracy. Interestingly, our method is approximately one order of magnitude faster than AutoPrognosis. We notice this is due to the time overhead introduced by the Gibbs sampling strategy for optimizing the structured kernel, whereas our approach uses gradient-based optimization. Experimental Set-Up for DeepPipe. For our comparison with SKL, we use the same hyperparameters and architecture as for the Experiment 1. When comparing with AutoPrognosis, we use the same hyperparmeters and architecture as for the Experiment 2, pre-trained on the Tensor-OBOE meta-train split. Experimental Set-Up for SKL and AutoPrognosis For SKL we used the default strategy with N = 3 (Alaa & van der Schaar, 2018) . For AutoPrognosis, we use the implementation in the respective author's repositoryfoot_2 . We ran it with the default configuration, but limited the search space of classifiers to match the classifiers on the Tensor-OBOE meta-datasetfoot_3 .

F DISCUSSION ON THE INDUCTIVE BIAS VS. PRE-TRAINING EFFECT

How shallow/deep should the encoder networks be compared to the aggregation network? We hypothesize that deeper encoders help in the transfer-learning setup where there exist only a few evaluated pipeline configurations on past datasets. To test this hypothesis, we assess the performance of DeepPipe with different network sizes and meta-trained with different percentages of meta-training tasks: 0.5%, 1% , 5%, 10%, 50%, and 100%. As we use the Tensor-OBOE meta-dataset, this effectively means that we use 1, 3, 16, 33, 165, and 330 tasks respectively. The average rank is computed across all the meta-test tasks and across 100 BO iterations. The results indicate that deeper encoders achieve a better performance when a small number of meta-training tasks is available. In contrast, shallower encoders are needed if more meta-training tasks are available. Apparently the deep aggregation layers ϕ already capture the interaction between the hyperparameter configurations across algorithms when a large meta-dataset of evaluated pipelines is given. The smaller the meta-data of evaluated pipeline configurations, the more inductive bias we need to implant in the form of per-algorithm encoders.

G ADDITIONAL INFORMATION ON EXPERIMENTAL SET-UP

In all experiments (except Experiment 1), we meta-train the surrogate following Algorithm 1 in Appendix J for 10000 epochs with the Adam optimizer and a learning rate of 10 -4 , batch size 1000, and the Matérn Kernel. During meta-testing, when we perform BO to search for a pipeline, we fine-tune only the kernel parameters γ for 100 gradient steps. In the non-transfer experiments (Experiment 1) we use an architecture with F = 8 and fine-tuned the network for 10000 iterations. The rest of the training settings are similar to the transfer experiments. In Experiment 5 we finetune the whole network for 100 steps when no encoders are used. Otherwise, we fine-tune only the encoder associated with the omitted estimator and freeze the rest of the network. We ran all experiments on a CPU cluster, where each node contains two Intel Xeon E5-2630v4 CPUs with 20 CPU cores each, running at 2.2 GHz. We reserved a total maximum memory of 16GB. We discuss how we implemented DeepPipe efficiently as a MLP with masked layersfoot_4 in Appendix N. We associate algorithms with no hyperparameters to the same encoder. We found that adding the One-Hot-Encoding of the selected algorithms per stage as an additional input is helpful. Therefore, the input dimensionality of the aggregated layers is equal to the dimension after concatenating the encoders output F • i (Q i + M i ). Further details on the architectures for each search space are specified in Appendix M. Finally, we use the Expected Improvement as acquisition function for DeepPipe and all the baselines.

Initial Configurations

For the experiments with the PMF-Dataset, we choose these configurations with the same procedure as the authors Fusi et al. (2018) , where they use dataset meta-features to find the most similar auxiliary task to initialize the search on the test task. Since we do not have meta-features for the Tensor-OBOE meta-dataset, we follow a greedy initialization approach Metz et al. (2020) . This was also applied to the ZAP-Dataset. Specifically, we select the best-performing pipeline configuration by ranking their performances on the meta-training tasks. Subsequently, we iteratively choose four additional configurations that minimize t∈Tasks rt , where rt = min p∈X r t,p , given that r t,p is the rank of the pipeline p on task t.

H ADDITIONAL RELATED WORK

Hyperparameter Optimization (HPO) has been well studied over the past decade (Bergstra & Bengio, 2012) . Techniques relying on Bayesian Optimization (BO) employ surrogates to approximate the response function of Machine Learning models, such as Gaussian Processes (Snoek et al., 2012) , Random Forests (Bergstra et al., 2011) or Bayesian Neural Networks (Snoek et al., 2015; Springenberg et al., 2016) . Further improvements have been achieved by applying transfer learning, where existing evaluations on auxiliary tasks help pre-training or meta-learning the surrogate. In this sense, some approaches use pre-trained neural networks with uncertainty outputs (Wistuba & Grabocka, 2021; Perrone et al., 2018; Wei et al., 2021b) , or ensembles of Gaussian Processes (Feurer et al., 2018) . Deep Kernels propose combining the benefits of stochastic processes such as Gaussian Processes with neural networks (Calandra et al., 2016; Garnelo et al., 2018; Wilson et al., 2016) . Follow-up work has applied this combination for training few-shot classifiers (Patacchiola et al., 2020) . In the area of Hyperparameter Optimization, (Snoek et al., 2015) achieved success on BO by modeling the output layer of a deep neural network with a Bayesian linear regression. Perrone et al. ( 2018) extended this work by pre-training the Bayesian network with auxiliary tasks. Recent work proposed to use non-linear kernels, such as the Matérn kernel, on top of the pre-trained network to improve the performance of BO (Wistuba & Grabocka, 2021; Wei et al., 2021a) .

I BAYESIAN OPTIMIZATION (BO)

In BO we fit a surrogate iteratively using the observed configurations and their response. Posteriorly, its probabilistic output is used to query the next configuration to evaluate (observe) by maximizing an acquisition function. A common choice for the acquisition is Expected Improvement, defined as: EI(p λ |H) = E [max {µ(p λ ) -y max , 0}] where y max is the largest observed response in the history H and µ is the posterior of the mean predicted performance given by the surrogate, computed using Equation 2. A common choice as surrogate is Gaussian Process, but for Pipeline Optimization we introduce DeepPipe.

J DeepPipe META-TRAINING

Given a task t with observations H t = {(p λ (t,1) , y (t,1) ), . . . , (p λ (t,Qt) , y (t,Qt) )}, t ∈ {1, . . . , T }, the objective function to minimize can be derived from the negative log marginal likelihood from the Gaussian Process p(H t ) ∼ N (0, K T ), where K (t) is the covariance matrix induced by DeepPipe with parameters θ, γ. Specifically, the negative log marginal likelihood is (Rasmussen & Williams, 2006) : -log p (H t ) = -log N (0, K (t) ) = y (t) T K (t) (θ, γ) -1 y (t) + log K (t) (θ, γ) The Equation 5is the multi-task objective function that involves all the meta-learning tasks with indices t ∈ {1..., T }. We use auxiliary tasks to learn a good initialization for the surrogate. We sample batches from the meta-training tasks, and make gradient steps that maximize the marginal log-likelihood in Equation 5, similar to previous work (Wistuba & Grabocka, 2021) . The training algorithm for the surrogate is detailed in Algorithm 1. Additionally, we apply Early Convergence by monitoring the performance on the validation meta-dataset. Every epoch, we perform the following operations for every task t ∈ 1...T : i) Draw a set of b observations (pipeline configuration and performance), ii) Compute the negative log marginal likelihood (our loss function) as in Equation 7, iii) compute gradient of the loss with respect to the DeepPipe parameters and iv) updated DeepPipe parameters. When a new pipeline is to be optimized on a new dataset (task), we apply BO (see Algorithm 2). Every iteration we update the surrogate by fine-tuning the kernel parameters. However, the parameters of the MLP layers θ can be also optimized, as we did on the Experiment 1, in which case the parameters were randomly initialized.

L ADDITIONAL RESULTS

In this section, we present further results. Firstly, we show an ablation of the factor that determines the number of hidden units (F ) in Figure 8 . It shows that F = 8 attains the best performance after exploring 100 pipelines in both datasets. Additionally, we present the average regret for the ablation of F , and the results of Experiment 1 and 2 in Figures 9, 10 and 11 respectively. Table 5 present the extended results of omitting estimators in the PMF Dataset. From these, we draw the same conclusion as in the same paper: having encoders help to obtain better performance when a new algorithm is added to a pipeline. We carry out an ablation to understand the difference between the versions of Deep Pipe with/without encoder and with/without transfer-learning using ZAP Meta-dataset. As shown in Figure 12 , the version with transfer learning and one encoder performs the best, thus, highlighting the importance of encoders in transfer learning our DeepPipe surrogate.

M ARCHITECTURE DETAILS

The input to the kernel has a dimensionality of Z=20. We fix it, to be the same as the output dimension for PMFs. The number of neurons per layer, as mentioned in the main paper, depends on F . Consider an architecture with with no encoder layers and ℓ a aggregation layers, and hyperparameters Λ i,j , i ∈ {1 . . . N }, j ∈ {1 . . . M i } (following the notation in section 4.1) with Q i = max j |Λ i,j |, then the number of weights (omitting biases for the sake of simplicity) will be: If the architecture has ℓ e encoder layers and ℓ a aggregation layers, then number of weights is given by: i,j In other words, the aggregation layers have F • i Q i hidden neurons, whereas every encoder from the i-th stage has F • Q i neurons per layer. The input sizes are i,j |Λ i,j | and |Λ i,j | for both cases respectively. The specific values for |Λ i,j | and Q i per search space are specified in Appendix R.   i,j |Λ i,j |   • F • i Q i + (ℓ a -1) F • i Q i 2 (8) |Λ i,j | • (F • Q i ) + (ℓ e -1) i M i • (F • Q i ) 2 + ℓ a F • i Q i 2 (9) In the search space for PMF, we group the algorithms related to Naive Bayers (MultinomialNB, BernoulliNB, GaussianNB) in a single encoder. In this search space, we also group LDA and QDA. In the search space of TensorOboe, we group GaussianNB and Perceptron as they do not have hyperparameters. Given these considerations, we can compute the input size and the weights per search space as function of ℓ a ℓ e , F as follows: (i) Input size: # Input size (PMF) = i,j |Λ i,j | = 72 # Input (TensorOboe) = i,j |Λ i,j | = 37 # Input (ZAP) = i,j |Λ i,j | = 35 (ii) Number of weights for architecture without encoder layers: # Weights (PMF) = 720 • F + 256 • (ℓ a -1) • F 2 # Weights (TensorOboe) = 444 • F + 144 • (ℓ a -1) • F 2 # Weights (ZAP) = 1085 • F + 961 • (ℓ a -1) • F 2 (iii) Number of weights for architecture with encoder layers: # Weights (PMF) = 886 • F + (1376 • (ℓ e -1) + 256 • ℓ a ) • F 2 # Weights (TensorOboe) = 161 • F + (271 • (ℓ e -1) + 144 • ℓ a ) • F 2 # Weights (ZAP) = 35 • F + (965 • (ℓ e -1) + 961 • ℓ a ) • F 2 (12) According the previous formulations, Figure 13 shows how many parameters (only weights) the MLP has given a specific value of F and of encoder layers. We fix the total number of layers to four. Notice that the difference in the number of parameters between an architecture with 1 and 2 encoder layers is small in both search spaces.

N COMPUTATIONAL IMPLEMENTATION

DeepPipe's architecture (encoder layers + aggregated layers) can be formulated as a Multilayer Perceptron (MLP) comprising three parts (Figure 14 ). The first part of the network that builds the i,j indicates the k-th hyperparameter of the j-th algorithm in the i-th stage. In this architecture, the first stage has two algorithms, thus two encoders. The algorithm 1 is active for stage 1. The second stage has only one algorithm. layers with encoders is implemented as a layer with masked weights. We connect the input values corresponding to the hyperparameters λ (i,j) of the j-th algorithm of the i-th stage to a fraction of the neurons in the following layer, what builds the encoder. The fraction of neurons, as explained in section 5.4 is F • max j |λ (i,j) |. The rest of the connections are dropped. The second part is a layer that selects the output of the encoders associated with the active algorithms (one per stage), and concatenates their outputs (Selection & Concatenation). The layer's connections are fixed to be either to one or zero during forward and backward pass. Specifically, they are one if they are connecting outputs of encoders of active algorithms, and zero otherwise. The last part, an aggregation layer, is a fully connected layer that learn interactions between the concatenated output of the encoders. By implementing the architecture as a MLP instead of a multiplexed list of nodes (e.g. with a module list in PyTorch), faster forward and backward passes are obtained. We only need to specify the selected algorithms in the forward-pass so that the weights in the Encoder Layer are masked and the ones in the Selection & Concatenation are accordingly set. After this implementation, notice that DeepPipe is a MLP with sparse connections.

O VISUALIZING THE LEARNT REPRESENTATIONS

We train a DeepPipe with 2-layer encoders, 2 aggregation layers, 20 output size and F = 8. To visualize the pipelines embeddings, we apply TSNE (T-distributed Stochastic Neighbor Embedding). As plotted in Figure 15 , the pipelines with the same estimator and dimensionality reducer are creating clusters. The groups in this latent space are also indicators of the performance on a specific task. In Figure 16 we show the same embeddings of the pipelines with a color marker indicating its accuracy on two meta-testing tasks. Top-performing pipelines (yellow color) are relatively close to each other in both tasks, building up regions of good performing pipelines. These groups of good pipelines are different in both cases, which indicates that there is not a single pipeline that works for all tasks. DeepPipe maps the pipelines to an embedding space where it is easier to assess the similarity between pipelines and to search for good-performing pipelines. However, the type of pipeline (good performing pipelines, bad performing pipelines) depends on the task. Here, we formally demonstrate that the DeepPipe with encoder layers is grouping hyperparameters from the same algorithm in the latent space, better than DeepPipe without encoders, formulated on Corollary S.4, which is supported by Proposition S.3. Lemma S.1. Given w ∈ R M , a vector of weights with independent and identically distributed components w i ∈ {w 1 , ..., w M } such that w i ∼ p(w), the expected value of the square of the norm E p(w) (||w|| 2 ) is given by M • (µ 2 w + σ 2 w ), where µ w and σ w are the mean and standard deviation of p(w) respectively. Proof.  (w i • x i ) 2 + M i=1 i-1 j=1 w i • w j • x i • x j   (18) = M i=1 E p(w) (w 2 i ) • x 2 i + 2 • M i=1 i-1 j=1 E p(w) (w i • w j ) • x i • x j (19) Since w i , w j are independent then E p(w) (w i • w j ) = E p(w) (w i ) • E p(w) (w j ) = µ 2 w . Moreover, with a slight abuse in notation, we denote M i=1 i-1 j=1 x i • x j = x ⊗ x. Given lemma S.1, we obtain: E p(w) (w T x) 2 = (µ 2 w + σ 2 w ) • ||x|| 2 + 2 • µ 2 w • x ⊗ x = D w (x) (21) where D w (•) is introduced as an operation to simplify the notation. Proposition S.3. Consider two vectors x ′ , x ∈ R M , and two weight vectors ŵ and w ′ , ŵT x ∈ R, w ′ T x ′ ∈ R, such that the weights are iid. Then E p(w) ( ŵT xw ′ T x ′ ) 2 > E p(w) ( ŵT x -ŵT x ′ ) 2 . Proof. Using lemma S.2 and decomposition the argument within square: E p(w) (( ŵT x -w ′ T x ′ ) 2 ) = E p(w) ( ŵT x) 2 + (w ′ T x ′ ) 2 -2 • ŵT x • w ′ T x ′ (23) = D w ( x) + D w (x ′ ) -2 • E p(w) ( ŵT x • w ′ T x ′ ) (24) = D w ( x) + D w (x ′ ) -2 • E p(w) ( M i=1 ŵi • xi M ′ j=1 w j ′ • x j ′ ) (25) = D w ( x) + D w (x ′ ) -2 • E p(w) ( M i=1 M ′ j=1 w j ′ • x j ′ • ŵi • xi ) (26) = D w ( x) + D w (x ′ ) -2 • M i=1 M ′ j=1 E p(w) (w j ′ • ŵi ) • x j ′ • xi Since ŵ and w ′ are independent, then E p(w) (w j ′ • ŵi ) = E p(w) (w j ′ ) • E p(w) ( ŵi ) = µ 2 w . Thus, E p(w) ( ŵT x -w ′ T x ′ ) 2 = D w ( x) + D w (x ′ ) -2 • µ 2 w • M i=1 M ′ j=1 x j ′ • xi When computing E p(w) ( ŵT x -ŵT x ′ ) 2 , we see that the weights are not independent, thus E p(w) ( ŵi • ŵi ) = µ 2 w + σ 2 w , and E p(w) ( ŵT x -ŵT x ′ ) 2 = D w ( x) + D w (x ′ ) -2 • (µ 2 w + σ 2 w ) • M i=1 M ′ j=1 x ′ j • xi (29) < D w ( x) + D w (x ′ ) -2 • µ 2 w • M i=1 M ′ j=1 x j ′ • xi < E p(w) ( ŵT xw ′ T x ′ ) 2 (31) Corollary S.4. A random initialized DeepPipe with encoder layers induces an assumption that two hyperparameter configurations of an algorithm should have more similar performance than hyperparameter configurations from different algorithms. Proof. Given two hyperparameter configurations λ (l) , λ (m) from an algorithm, and a third hyperparameter configuration λ (n) from a different algorithm, every random initialized encoder layer from DeepPipe maps the hyperparameters λ (l) , λ (m) to latent dimensions z (l) , z (m) that are closer to each other than to z (n) , i.e. the expected distance among the output of the encoder layer will be E p(w) (||z l -z m ||) < E p(w) (||z l -z n ||) based on Proposition S.3. Since DeepPipe uses a kernel such that κ(x, x ′ ) = κ(x -x ′ ), their similarity will increase, when the distance between two configurations decreases. Thus, according to the Equation 2, they will have correlated performance.



AutoML systems might select multiple algorithms in a stage, however, our solution trivially generalizes by decomposing stages into new sub-stages with only a subset of algorithms. The code is available in this repository: https://anonymous.4open.science/r/ DeepPipe-3DDF https://github.com/ahmedmalaa/AutoPrognosis Specifically, the list of classifiers is: Random Forest, Extra Tree Classifier, Gradient Boosting", Logist Regression, MLP, linear SVM, kNN, Decision Trees, Adaboost, Bernoulli Naive Bayes, Gaussian Naive Bayes, Perceptron. We make our code available in https://anonymous.4open.science/r/DeepPipe-E19E https://github.com/rsheth80/pmf-automl https://github.com/udellgroup/oboe/tree/master/oboe/defaults/TensorOboe



estimates the validation loss f * of a new pipeline configuration p λ ( * ) by computing the posterior mean E [f * ] and posterior variance V [f * ] as:

Figure 1: An example architecture for DeepPipe on a search space with 2 stages {Preprocessing, Classification}.

Figure 2: Comparison of DeepPipe vs. standard PO methods (Experiment 1)

Figure 4: Embeddings of Pipelines produced by a random initialized DeepPipe (after applying T-SNE). The color indicates the active algorithm in the Estimation stage of Tensor-OBOE Meta-Dataset.

Figure 5: Average rank for DeepPipe with and without encoder and and aggregation layers.

Figure 6: Comparison with Structured Kernel Learning (SKL)

Figure 7: Comparison of the average rank for DeepPipe with different number of encoders under different percentages of meta-train data. The total number of layers is always the same. The results of these experiments are shown in Figure 7. Here we ablate DeepPipe with different numbers of encoder layers while pre-training on different fractions of the meta-training tasks. We ran the experiment for three values of F . The presented scores are the average ranks among the three DeepPipe configurations (row-wise).The average rank is computed across all the meta-test tasks and across 100 BO iterations. The results indicate that deeper encoders achieve a better performance when a small number of meta-training tasks is available. In contrast, shallower encoders are needed if more meta-training tasks are available. Apparently the deep aggregation layers ϕ already capture the interaction between the hyperparameter configurations across algorithms when a large meta-dataset of evaluated pipelines is given. The smaller the meta-data of evaluated pipeline configurations, the more inductive bias we need to implant in the form of per-algorithm encoders.

DeepPipe Meta-Training Input: Learning rates η, meta-training data with T tasks H = t=1..T H (t) , number of epochs E, batch size b Output: Parameters w and θ 1 Initialize w and θ at random; 2 for E times do 3 for t ∈ {1, ..., T } do 4 Sample batch B = {(p (t,i) λ , y (t,i) )} i=1,...,b ∼ H (t) ; 5 Compute negative log-likelihood L on B. (Objective Function in Equation5);6 θ agg ← θ agg -η∇ θ agg L; 7 θ enc ← θ enc -η∇ θ enc L; 8 γ ← γ -η∇ γ L; Bayesian Optimization (BO) with DeepPipe Input: Learning rate η, initial observations H = {(p (i) λ , y (i) )} i=1,...,I , pretrained surrogate with parameters θ and γ, number of surrogate updates E T est , BO iterations E BO , search space of pipelines P Output: Pipeline Configuration p * λ 1 Function FineTune (H, γ, η, E test ): 2 for E T est times do 3 Compute negative log-likelihood L on D. (Objective function in Equation 5 with T = 1); 4 γ ← γ -η∇ γ L; BO(H, η, θ, γ, E test , E BO ): 8 for E BO times do 9 γ ′ ← FineTune(H, γ, η, E T est ); 10 Compute p ′ λ ∈ arg max p λ ∈P EI(p λ , γ ′ , θ) ; 11 Observe performance y ′ of pipeline p ′ λ ; 12 Add new observation H ← H ∪ {(p ′ λ , y ′ )} ; 13 end 14 Compute best pipeline index i * ∈ arg max i∈{1...|H|} y i ;

Figure 8: Comparison of different F values in DeepPipe (Rank).

Figure 9: Comparison of different F values in DeepPipe (Regret).

Figure 10: Comparison of DeepPipe vs. non transfer-learning PO methods in Experiment 1 (Regret)

Figure 12: Ablations on the ZAP meta-dataset

Figure 15: Learnt representations in 2 dimensions for estimators (left) and dimensionality reducers (right) from the Tensor-OBOE meta-dataset.

Figure 16: Learnt representations for two tasks with different accuracy levels.

Lemma S.2. Consider a linear function with scalar output z = w T x where w ∈ R M ×1 is the vector of weights with components w i , i ∈ {1, ..., M }, x ∈ R M ×1 are the input features. Moreover, consider the weights are independently and identically distributed w i ∼ p(w). The expected value of the norm of the output is given by Ep(w) ||w T x|| 2 = (µ 2 w + σ 2 w ) • ||x|| 2 + µ 2 w •

Average Rank of Accuracy on OpenML Datasets

Average rank among DeepPipe variants for newly-added algorithms (Tensor-OBOE)

Average Number of Observed Pipelines on OpenML Datasets

Comparison with AutoPrognosis E BO Method Avg. Rank Std. Rank Avg. Acc. Std. Acc. Avg Time Std. Time







Average rank among DeepPipe variants for newly-added algorithms (PMF)

Figure 14: Example of the Implementation of DeepPipe as MLP. λ

Search Space for PMF Meta-Dataset

Search Space for Tensor-OBOE Meta-Dataset

Search Space for ZAP Meta-Dataset

P META-DATASET PREPROCESSING

We obtained the raw data for the meta-datasets from the raw repositories of PMF 6 and TensorOBOE 7 . PMF repo provides an accuracy matrix, while Tensor-OBOE specifies the error. We take the pipelines' configurations and concatenate the hyperparameters in both meta-datasets. Then we proceed with the following steps: 1) One-Hot encode the categorical hyperparameters, 2) apply a log transformation x new = ln(x) to the hyperparameters whose value is greater than 3 standard deviations, 3) scale all the values to be in the range [0,1]. The variables coming from categorical hyperparameters are named original-variable-name_category. 2: 1) ET: ExtraTrees, 2) GBT: Gradient Boosting, 3) Logit: Logistict Regression 4) MLP: Multilayer Perceptron 5) RF: Random Forest, 6) lSVM: Linear Support Vector Machine, 7) kNN: k Nearest Neighbours, 8) DT: Decision Trees, 9) AB: AdaBoost, 10) GB/PE= Gaussian Naive Bayes/Perceptron.

Q ABBREVIATIONS (i) Abbreviations in Table

(ii) Abbreviations in Table 3: 1) ET: ExtraTrees, 2) RF: Random Forest , 3) XGBT: Extreme Gradient Boosting, 4) kNN: K-Nearest Neighbours, 5) GB: Gradient Boosting, 6) DT: Decision Trees, 7) Q/LDA: Quadratic Discriminant Analysis/ Linear Discriminant Analysis, 8) NB: Naive Bayes.

R META-DATASET SEARCH SPACES

We detail the search spaces composition in Tables 6 and 7 . We specify the stages, algorithms, hyperparameters, number of components per stage M i , the number of hyperparameters per algorithm |λ i,j |, and the maximum number of hyperparameters found in an algorithm per stage Q i . For the ZAP meta-dataset, we defined a pipeline with two stages: (i) Architecture, which specifies the type or architecture used (i.e. ResNet18, EfficientNet-B0, EfficientNet-B1, EfficientNet-B2), and (ii) Optimization-related Hyperparameters that are shared by all the architectures.

