DeepPipe: DEEP, MODULAR AND EXTENDABLE REPRESENTATIONS OF MACHINE LEARNING PIPELINES

Abstract

Finding accurate Machine Learning pipelines is essential in achieving state-of-theart AI predictive performance. Unfortunately, most existing Pipeline Optimization techniques rely on flavors of Bayesian Optimization that do not explore the deep interaction between pipeline stages/components (e.g. between hyperparameters of the deployed preprocessing algorithm and the hyperparameters of a classifier). In this paper, we are the first to capture the deep interaction between components of a Machine Learning pipeline. We propose embedding pipelines in a deep latent representation through a novel per-component encoder mechanism. Such pipeline embeddings are used with deep kernel Gaussian Process surrogates inside a Bayesian Optimization setup. Through extensive experiments on three largescale meta-datasets, including Deep Learning pipelines for computer vision, we demonstrate that learning pipeline embeddings achieves state-of-the-art results in Pipeline Optimization.

1. INTRODUCTION

Machine Learning (ML) has proven to be successful in a wide range of tasks such as image classification, natural language processing, and time series forecasting. In a supervised learning setup, practitioners need to define a sequence of stages comprising algorithms that transform the data (e.g. imputation, scaling) and produce an estimation (e.g. through a classifier or regressor). The selection of the algorithms and their hyperparameters, known as Pipeline Optimization (Olson & Moore, 2016) or pipeline synthesis (Liu et al., 2020; Drori et al., 2021) is challenging. Firstly, the search space contains conditional hyperparameters, as only some of them are active depending on the selected algorithms. Secondly, this space is arguably bigger than the one for a single algorithm. Therefore, previous work demonstrates how this pipeline search can be automatized and achieve competitive results (Feurer et al., 2015; Olson & Moore, 2016) . Some of these approaches include Evolutionary Algorithms (Olson & Moore, 2016 ), Reinforcement Learning (Rakotoarison et al., 2019; Drori et al., 2021) or Bayesian Optimization (Feurer et al., 2015; Thornton et al., 2012; Alaa & van der Schaar, 2018) . Pipeline Optimization (PO) techniques need to capture the complex interaction between the algorithms of a Machine Learning pipeline and their hyperparameter configurations. Unfortunately, no prior method uses Deep Learning to encapsulate the interaction between pipeline components. Prior work trains performance predictors (a.k.a. surrogates) on the concatenated hyperparameter space of all algorithms (raw search space), for instance, using random forests (Feurer et al., 2015) or finding groups of hyperparameters to use on kernels with additive structure (Alaa & van der Schaar, 2018). On the other hand, transfer learning has been shown to decisively improve PO by transferring efficient pipelines evaluated on other datasets (Fusi et al., 2018; Yang et al., 2019; 2020) . Our method is the first to introduce a deep pipeline representation that is meta-learned to achieve state-of-the-art results in terms of the quality of the discovered pipelines. We introduce DeepPipe, a neural network architecture for embedding pipeline configurations on a latent space. Such deep representations are combined with Gaussian Processes (GP) for tuning pipelines with Bayesian Optimization (BO). We exploit the knowledge of the hierarchical search space of pipelines by mapping the hyperparameters of every algorithm through per-algorithm encoders to a hidden representation, followed by a Fully Connected Network that receives the concatenated representations as input. Additionally, we show that meta-learning this network through evaluations on auxiliary tasks improves the quality of BO. Experiments on three large-scale meta-datasets show that our method achieves the new state-of-the-art in Pipeline Optimization. Our contributions are as follows: • We introduce DeepPipe, a surrogate for BO that achieves peak performance when optimizing a pipeline for a new dataset through transfer learning. • We present a novel and modular architecture that applies different encoders per stage and yields better generalization in low meta-data regimes, i.e. few/no auxiliary tasks. • We conduct extensive evaluations against seven baselines on three large meta-datasets, and we further compare against rival methods in OpenML datasets to assess their performances under time constraints. • We demonstrate that our pipeline representation helps achieving state-of-the-art results in optimizing pipelines for fine-tuning deep computer vision networks.

2. RELATED WORK

Full Model Selection (FMS) is also referred to as Combined Algorithm Selection and Hyperparameter optimization (CASH) (Hutter et al., 2019; Feurer et al., 2015) . FMS aims to find the best model and its respective hyperparameter configuration (Hutter et al., 2019) . A common approach is to use Bayesian Optimization with surrogates that can handle conditional hyperparameters, such as Random Forest (Feurer et al., 2015) , tree-structured Parzen estimators (Thornton et al., 2012) , or ensembles of neural networks (Schilling et al., 2015) . Pipeline Optimization (PO) is a generalization of FMS where the goal is to find the algorithms and their hyperparameters for different stages of a Machine Learning Pipeline. use additive kernels on a Gaussian Process surrogate to search pipelines with BO that groups the algorithms in clusters and fit their hyperparameters on independent Gaussian Processes, achieving an effectively lower dimensionality per input. By formulating the Pipeline Optimization as a constrained optimization problem, Liu et al (Liu et al., 2020) introduce a method based on alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976) . Transfer-learning for Pipeline Optimization and CASH leverages information from previous (auxiliary) task evaluations. Few approaches use dataset meta-features to warm-start BO with good configurations from other datasets (Feurer et al., 2015; Alaa & van der Schaar, 2018) . As extracting meta-features demands computational time, follow-up works find a portfolio based on these auxiliary tasks (Feurer et al., 2020) . Another popular approach is to use collaborative filtering with a matrix of pipelines vs task evaluations to learn latent embeddings of pipelines. OBOE obtains the embeddings by applying a QR decomposition of the matrix on a time-constrained formulation (Yang et al., 2019) . By recasting the matrix as a tensor, Tensor-OBOE (Yang et al., 2020) finds latent representations via the Tucker decomposition. Furthermore, Fusi et al. ( 2018) apply probabilistic matrix factorization for finding the latent pipeline representations. Subsequently, they use the latent representations as inputs for a Gaussian Process, and explore the search space using BO.

3.1. PIPELINE OPTIMIZATION

The pipeline of a ML system consists of a sequence of N stages (e.g. dimensionality reducer, standardizer, encoder, estimator (Yang et al., 2020)). At each stage i ∈ {1 . . . N } a pipeline includes one algorithmfoot_0 from a set of M i choices (e.g. the estimator stage can include the algorithms



AutoML systems might select multiple algorithms in a stage, however, our solution trivially generalizes by decomposing stages into new sub-stages with only a subset of algorithms.

