REAL-TIME AUTOML

Abstract

We present a new zero-shot approach to automated machine learning (AutoML) that predicts a high-quality model for a supervised learning task and dataset in real-time without fitting a single model. In contrast, most AutoML systems require tens or hundreds of model evaluations. Hence our approach accelerates AutoML by orders of magnitude. Our method uses a transformer-based language embedding to represent datasets and algorithms using their free-text descriptions and a meta-feature extractor to represent the data. We train a graph neural network in which each node represents a dataset to predict the best machine learning pipeline for a new test dataset. The graph neural network generalizes to new datasets and new sets of datasets. Our approach leverages the progress of unsupervised representation learning in natural language processing to provide a significant boost to AutoML. Performance is competitive with state-of-the-art AutoML systems while reducing running time from minutes to seconds and prediction time from minutes to milliseconds, providing AutoML in real-time.

1. INTRODUCTION

A data scientist facing a challenging new supervised learning task does not generally invent a new algorithm. Instead, they consider what they know about the dataset and which algorithms have worked well for similar datasets in past experience. Automated machine learning (AutoML) seeks to automate these tasks to enable widespread use of machine learning by non-experts. A major challenge is to develop fast, efficient algorithms to accelerate applications of machine learning (Kokiopoulou et al., 2019) . This work develops automated solutions that exploit human expertise to learn which datasets are similar and what algorithms perform best. We use a transformer-based language model (Devlin et al., 2018) allowing our AutoML system to process text descriptions of datasets and algorithms, and a feature extractor (BYU-DML, 2019) to represent the data itself. Using such models for our representation brings in large-scale data. We allow to train our model on other existing AutoML system solutions, specifically AutoSklearn (Feurer et al., 2015 ), AlphaD3M (Drori et al., 2018) , OBOE (Yang et al., 2019), and TPOT (Olson & Moore, 2019) , tapping into their diverse set of solutions. Our approach fuses these representations (dataset description, data, AutoML pipeline descriptions) and represents datasets as nodes in a graph of datasets. Generally, graph neural networks are used for three main tasks: (i) node prediction, (ii) link prediction, and (iii) sub-graph or entire graph classification. In this work we use a GNN for node prediction, which predicts the machine learning pipeline for an unseen dataset. Specifically, we use a graph attention network (GAT) (Veličković et al., 2018) with neighborhood aggregation, in which an attention function adaptively controls the contribution of neighbors. An advantage of using a GNN for AutoML is boosting AutoML performance by sharing information between datasets (graph nodes): including description and algorithm, by message passing between the nodes in the graph. In addition, GNNs generalize well to a new unknown dataset using the aggregated weights learnt over the training datasets. GNN weights are shared with the test dataset for prediction. GNNs generalize to entire new sets of datasets. Finally, prediction is in real-time, within milliseconds. A simple idea is to use machine learning pipelines that performed well (for the same task) on similar datasets. What constitutes a similar dataset? The success of an AutoML system often hinges on this question, and different frameworks have different answers: for example, AutoSklearn (Feurer et al., 2015) computes a set of meta-features, which are features describing the data features, for each dataset, while OBOE (Yang et al., 2019) uses the performance of a few fast, informative models to compute latent features. More generally, for any supervised learning task, one can view the list of recommended algorithms generated by any AutoML system as a vector describing that task. This work is the first to use the information that a human would check first: a summary description of the dataset and algorithms, written in free text. These dataset features induce a metric structure on the space of datasets. Under an ideal metric, a model that performs well on one dataset would also perform well on nearby datasets. The methods we develop in this work show how to learn such a metric using the recommendations of an AutoML framework together with the dataset description. We provide a new zero-shot AutoML method that predicts accurate machine learning pipelines for an unseen dataset and classification task in real-time and runs the pipeline in a few seconds. We use a transformer-based language model to embed the description of the dataset and pipelines and a feature extractor to compute meta-features from the data. Based on the description embedding and meta-features, we build a graph as the input to a graph neural network (GNN). Each dataset represents a node in the graph, together with its corresponding feature vector. The GNN is trained to predict a machine learning pipeline for a new node (dataset). Therefore, given a new dataset, our real-time AutoML method predicts a pipeline with good performance within milliseconds. The running time of our predicted pipeline is a few seconds and the accuracy of the predicted pipeline is competitive with state-of-the-art AutoML methods that are given one minute. This work makes several contributions by using language embeddings and GNNs for AutoML for the first time, and leveraging existing AutoML systems. The result is a real-time high-quality AutoML system. Real-time. Our system predicts a machine learning pipeline for a new dataset in milliseconds and then runs the pipeline and tunes its hyper-parameters within three seconds. This reduces computation time by orders of magnitude compared with state-of-the-art AutoML systems, while improving performance. GNN architecture. Our work achieves real-time AutoML by introducing several architectural components that are new to AutoML. These include embeddings for dataset descriptions and algorithm descriptions using a state-of-the-art transformer-based language model in addition to (standard) embeddings for data; a non-Euclidean embedding of datasets as a graph; and a predictive model employing a GNN on the graph of datasets. Importantly, the GNN recommends a pipeline for a new dataset by adding a node to the graph of datasets and sharing the GNN weights with the new node. Using the information and relationships between all datasets boosts AutoML performance. Embeddings. Bringing techniques from NLP to AutoML, specifically using a large-scale transformer-based language model to embed the description of the dataset and algorithms, brings in information from a large corpra of text. This allows our zero-shot AutoML to train on a small set of datasets with state-of-the-art test set performance. Leveraging existing AutoML systems. Our flexible architecture can use pipeline recommendations from any number of other AutoML systems to improve performance.

2. RELATED WORK

AutoML is an emerging field of machine learning with the potential to transform the practice of Data Science by automatically choosing a model to best fit the data. Several comprehensive surveys of the field are available (He et al., 2019; Zöller & Huber, 2019) . Processing each dataset in isolation. The most straightforward approach to AutoML considers each dataset in isolation and asks how to choose the best hyper-parameter settings for a given algorithm. While the most popular method is still grid search, other more efficient approaches include Bayesian optimization (Snoek et al., 2012) or random search (Solis & Wets, 1981) . Recommender systems. These methods learn (often, exhaustively) what algorithms and hyperparameter settings performed best for a training set-of-datasets and use this information to select better algorithms on a test set without exhaustive search. This approach reduces the time required to find a good model. An example is OBOE (Yang et al., 2019; 2020) , which fits a low rank model to learn the low-dimensional representations for the models (or pipelines) and datasets that best predict the cross-validated errors, among all bilinear models. To find promising models for a new dataset,

