OPENFE: AUTOMATED FEATURE GENERATION BE-YOND EXPERT-LEVEL PERFORMANCE

Abstract

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting. Codes and datasets are in the supplementary materials.

1. INTRODUCTION

Feature generation is an important yet challenging task when applying machine learning methods to tabular data. Tabular data, where each row represents an instance and each column corresponds to a distinct feature, is ubiquitous in industrial applications and machine learning competitions. It has been well recognized that the quality of features has a significant impact on the learning performance of tabular data (Domingos, 2012) . The goal of feature generation is to transform the base features into more informative ones to better describe the data and enhance the learning performance. For example, Price-to-Earnings ratio (P/E ratio), calculated as (share price)/(earnings per share), is derived from the base features "share price" and "earnings per share" in financial statements and informs investors about the value of a company. In practice, data scientists typically use their domain knowledge to find useful feature transformations in a trial-and-error manner, which requires tremendous human labor and expertise. Since manual feature generation is time-consuming and requires case-by-case domain knowledge, automated feature generation emerges as an important topic in automated machine learning (Erickson et al., 2020; Lu, 2019) . Expand-and-reduce is arguably the most prevalent framework in automated feature generation, in which we first expand the candidate features and then eliminate redundant ones (Kanter & Veeramachaneni, 2015; Lam et al., 2021; Kaul et al., 2017; Shi et al., 2020; Katz et al., 2016) . There are two challenges in a typical expand-and-reduce practice. First, the number of candidate features is usually huge in many industrial applications. Calculating all candidate features is not only computationally expensive but also infeasible due to the enormous amount of memory required. The second challenge is how to efficiently and accurately estimate the incremental performance of a new feature, i.e., how much performance improvement a new candidate feature can offer when added to the base feature set. The majority of existing methods rely on statistical tests to determine if a new feature should be included (Kanter & Veeramachaneni, 2015; Lam et al., 2021; Shi et al., 2020) . However, statistically significant features do not always translate into good predictors (Lo et al., 2015) . Features may be significantly correlated with the target simply for a small group of instances in the population, thereby leading to poor prediction in the population (Lo et al., 2015; Ward et al., 2010; Welch & Goyal, 2008) . Besides, the effectiveness of a new feature may be encompassed by the base feature set, even if the new feature is significantly correlated with the target. In this paper, we propose OpenFE, a powerful automated feature generation algorithm that can effectively generate useful features to enhance learning performance. First, motivated by the gap between significant features and good predictors, we propose a feature boosting method that directly estimates the predictive power of new features in addition to the base feature set. Second, inspired by the crucial fact that effective features are usually sparse in the huge number of candidate features, we propose a two-stage evaluation framework. In the first stage, we propose a successive featurewise pruning algorithm to quickly eliminate redundant candidate features by dynamically allocating computing resources to promising ones. In the second stage, we propose a feature importance attribution method to rank the remaining candidate features based on their contributions to the improvement in the learning performance and further eliminate redundant candidate features. We validate OpenFE on various datasets and Kaggle competitions, where OpenFE outperforms existing baseline methods. In a famous Kaggle competition with thousands of data science teams participatingfoot_0 , the baseline model with features generated by OpenFE beats 99.3% of 6351 data science teams. More importantly, the features generated by OpenFE result in comparable or even larger performance improvement than those provided by the competition's top winners, demonstrating for the first time that automated feature generation is competitive against machine learning experts. In addition to proposing a novel method, this paper intends to address two important problems that hinder the research process of automated feature generation. The first problem is that the majority of existing methods are evaluated on different datasets, and these studies do not open-source their codes and datasets, hindering new research from conducting fair comparisons. In order to facilitate fair comparisons in future research, we reproduce the main methods for automated feature generation and validate our reproduction by comparing the reproduced results with those in the corresponding papers. We will open-source the codes and datasets (see supplementary materials). The second problem is the lack of evidence regarding the necessity of feature generation in the era of deep learning. Deep neural networks (DNNs) are widely recognized for their ability to extract feature representations. In recent years, a variety of DNNs have been carefully developed for modeling tabular data (Arık & Pfister, 2021; Gorishniy et al., 2021) , and several of them have demonstrated their efficiency in feature interaction learning (Song et al., 2019; Wang et al., 2021) . We extensively evaluate the effect of OpenFE on a variety of DNNs. We demonstrate that generating new transformed features with OpenFE can further enhance the learning performance of existing DNN architectures. In addition to the empirical results, we provide a theoretical justification of our feature generation procedure by presenting a simple yet representative transductive learning setting in which feature generation has provable benefits. We summarize the contributions of our paper as follows: • We propose a novel automated feature generation method that can effectively identify useful new features to enhance learning performance. Extensive experiments show that OpenFE achieves state-of-the-art on seven benchmark datasets. More importantly, we demonstrate for the first time that OpenFE is competitive against human experts in feature generation. • We facilitate future research in feature generation by: 1) reproducing main methods for automated feature generation and releasing the codes and datasets, 2) providing empirical and theoretical evidence that feature generation is a crucial component in modeling tabular data, even in the era of deep learning. 



https://www.kaggle.com/competitions/ieee-fraud-detection/overview



For a given training dataset D, we split it into a sub-training set D tr and a validation set D vld . Assume D consists of a feature set T + S, where T is the base feature set and S is the generated feature set. We use a learning algorithm L to learn a model L(D tr , T + S), and compute the evaluation metric E(L(D tr , T + S), D vld , T + S) to measure the model performance, with a larger

