LEARNING A DATA-DRIVEN POLICY NETWORK FOR PRE-TRAINING AUTOMATED FEATURE ENGINEERING

Abstract

Feature engineering is widely acknowledged to be pivotal in tabular data analysis and prediction. Automated feature engineering (AutoFE) emerged to automate this process managed by experienced data scientists and engineers conventionally. In this area, most -if not all -prior work adopted an identical framework from the neural architecture search (NAS) method. While feasible, we posit that the NAS framework very much contradicts the way how human experts cope with the data since the inherent Markov decision process (MDP) setup differs. We point out that its data-unobserved setup consequentially results in incapability to generalize across different datasets as well as also high computational cost. This paper proposes a novel AutoFE framework Feature Set Data-Driven Search (FETCH 1 ), a pipeline mainly for feature generation and selection. Notably, FETCH is built on a brand-new data-driven MDP setup using the tabular dataset as the state fed into the policy network. Further, we posit that the crucial merit of FETCH is its transferability where the yielded policy network trained on a variety of datasets is indeed capable to enact feature engineering on unseen data, without requiring additional exploration. This is a pioneer attempt to build a tabular data pre-training paradigm via AutoFE. Extensive experiments show that FETCH systematically surpasses the current state-of-the-art AutoFE methods and validates the transferability of Aut-oFE pre-training.

1. INTRODUCTION

Tabular data -also known as structured data -abound in the extensive application of database management systems. Modeling tabular data with machine learning (ML) models has greatly influenced numerous domains, such as advertising (Evans, 2009) , business intelligence (Quamar et al., 2020; Zhang et al., 2020) , risk management (Babaev et al., 2019) , drug analysis (Vamathevan et al., 2019) , etc. In resemblance to the other data forms like images or text, building a proper representation for the tabular data is crucial for guaranteeing a decent system-wide performance. In this regime, this process is also known as feature engineering (FE), which was conventionally conducted by highly experienced human experts. In other words, as many empirical studies show (Heaton, 2016) , FE almost always serves as a necessary prerequisite step in ML modeling pipelines. The recent advances in reinforcement learning (RL) have provided a new possibility for automated feature engineering (AutoFE) and automated machine learning (AutoML). Neural architecture search (NAS) (Zoph & Le, 2016) has nearly become a synonym for AutoML in the field of computer vision, based on an RL setup dedicated to searching for undesigned neural network architectures with excellent performance. As for tabular data, a series of well-known open-source packages (such as TPOT (Olson & Moore, 2016 ), AutoSklearn (Feurer et al., 2015) and Auto-Gluon (Erickson et al., 2020) ) claim to implement the AutoML pipeline. However, they do not generally cover AutoFE, especially feature construction and selection, which is supposed to be part of AutoML as shown in Figure 1 . To date, AutoFE has been a significant and non-negligible compo- ) extends its NASlike setup to a differentiable one. However, a data scientist or engineer usually tends to investigate the data -such as analyzing its distribution, identifying the outliers, measuring the correlation between columns, etc. -and then proposes an FE plan. They may further use the derived plan to test the prediction performance and repeat this process considering the evaluated score. Meanwhile, they also can accumulate knowledge to accelerate decision-making. As we scrutinize these works, we posit that existing NAS-like AutoFE frameworks on tabular data have two shortcomings, largely deviating from how human experts cope with the data. First, they have stuck themselves with the data-unobserved paradigm because their policy network does not even see the tabular data itself and proposes data-unrelated FE plans. Second, the inherent dataunobserved setup makes them lack transferability, unfeasible to borrow knowledge from previous training experience to speed up the exploration process when facing a completely new dataset. This paper hopes to bridge this methodology gap between the human experts and data-unobserved methods for AutoFE and validate its feasibility based on the above discussions. In particular, we establish a new form of MDP setup where the state is defined simply as a processed dataset drawn from its original counterpart. The policy network yielded is a succinct mapping from the input data table directly to its (sub)optimal feature engineering actions plan. To this end, we present FEature SET DaTa-Driven SearCH (FETCH) -a brand new RL-based framework for Aut-oFE but with a completely distinct datadriven MDP setup to emulate the human experts. As shown in Figure 2 , FETCH outputs FE actions well-designed for the input data, and iteratively constructs more appropriate actions based on the newly generated data. In contrast, traditional data-unobserved methods only take in the number of features to be processed and iteratively update with the sequence of past actions. Thanks to the aforementioned design principles of FETCH, another favored by-product is that it enables transferability by pre-training for the AutoFE workflow. Simply put, we validate that FETCH can be pre-trained in a collaborative manner where we feed multiple tabular datasets and maximize





Figure 2: The difference between data-driven FETCH and data-unobserved approach. See text for details.

