OPENFE: AUTOMATED FEATURE GENERATION BE-YOND EXPERT-LEVEL PERFORMANCE

Abstract

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting. Codes and datasets are in the supplementary materials.

1. INTRODUCTION

Feature generation is an important yet challenging task when applying machine learning methods to tabular data. Tabular data, where each row represents an instance and each column corresponds to a distinct feature, is ubiquitous in industrial applications and machine learning competitions. It has been well recognized that the quality of features has a significant impact on the learning performance of tabular data (Domingos, 2012) . The goal of feature generation is to transform the base features into more informative ones to better describe the data and enhance the learning performance. For example, Price-to-Earnings ratio (P/E ratio), calculated as (share price)/(earnings per share), is derived from the base features "share price" and "earnings per share" in financial statements and informs investors about the value of a company. In practice, data scientists typically use their domain knowledge to find useful feature transformations in a trial-and-error manner, which requires tremendous human labor and expertise. Since manual feature generation is time-consuming and requires case-by-case domain knowledge, automated feature generation emerges as an important topic in automated machine learning (Erickson et al., 2020; Lu, 2019) . Expand-and-reduce is arguably the most prevalent framework in automated feature generation, in which we first expand the candidate features and then eliminate redundant ones (Kanter & Veeramachaneni, 2015; Lam et al., 2021; Kaul et al., 2017; Shi et al., 2020; Katz et al., 2016) . There are two challenges in a typical expand-and-reduce practice. First, the number of candidate features is usually huge in many industrial applications. Calculating all candidate features is not only computationally expensive but also infeasible due to the enormous amount of memory required. The second challenge is how to efficiently and accurately estimate the incremental performance of a new feature, i.e., how much performance improvement a new candidate feature can offer when added to the base feature set. The majority of existing methods rely on statistical tests to determine if a new feature should be included (Kanter & Veeramachaneni, 2015; Lam et al., 2021; Shi et al., 2020) . However, statistically significant features do not always translate into good predictors (Lo et al., 2015) . Features may be significantly correlated with the target simply for a 1

