LEARNING A DATA-DRIVEN POLICY NETWORK FOR PRE-TRAINING AUTOMATED FEATURE ENGINEERING

Abstract

Feature engineering is widely acknowledged to be pivotal in tabular data analysis and prediction. Automated feature engineering (AutoFE) emerged to automate this process managed by experienced data scientists and engineers conventionally. In this area, most -if not all -prior work adopted an identical framework from the neural architecture search (NAS) method. While feasible, we posit that the NAS framework very much contradicts the way how human experts cope with the data since the inherent Markov decision process (MDP) setup differs. We point out that its data-unobserved setup consequentially results in incapability to generalize across different datasets as well as also high computational cost. This paper proposes a novel AutoFE framework Feature Set Data-Driven Search (FETCH 1 ), a pipeline mainly for feature generation and selection. Notably, FETCH is built on a brand-new data-driven MDP setup using the tabular dataset as the state fed into the policy network. Further, we posit that the crucial merit of FETCH is its transferability where the yielded policy network trained on a variety of datasets is indeed capable to enact feature engineering on unseen data, without requiring additional exploration. This is a pioneer attempt to build a tabular data pre-training paradigm via AutoFE. Extensive experiments show that FETCH systematically surpasses the current state-of-the-art AutoFE methods and validates the transferability of Aut-oFE pre-training.

1. INTRODUCTION

Tabular data -also known as structured data -abound in the extensive application of database management systems. Modeling tabular data with machine learning (ML) models has greatly influenced numerous domains, such as advertising (Evans, 2009) , business intelligence (Quamar et al., 2020; Zhang et al., 2020 ), risk management (Babaev et al., 2019) , drug analysis (Vamathevan et al., 2019) , etc. In resemblance to the other data forms like images or text, building a proper representation for the tabular data is crucial for guaranteeing a decent system-wide performance. In this regime, this process is also known as feature engineering (FE), which was conventionally conducted by highly experienced human experts. In other words, as many empirical studies show (Heaton, 2016) , FE almost always serves as a necessary prerequisite step in ML modeling pipelines. The recent advances in reinforcement learning (RL) have provided a new possibility for automated feature engineering (AutoFE) and automated machine learning (AutoML). Neural architecture search (NAS) (Zoph & Le, 2016) has nearly become a synonym for AutoML in the field of computer vision, based on an RL setup dedicated to searching for undesigned neural network architectures with excellent performance. As for tabular data, a series of well-known open-source packages (such as TPOT (Olson & Moore, 2016 ), AutoSklearn (Feurer et al., 2015) and Auto-Gluon (Erickson et al., 2020) ) claim to implement the AutoML pipeline. However, they do not generally cover AutoFE, especially feature construction and selection, which is supposed to be part of AutoML as shown in Figure 1 . To date, AutoFE has been a significant and non-negligible compo-

