OPENFE: AUTOMATED FEATURE GENERATION BE-YOND EXPERT-LEVEL PERFORMANCE

Abstract

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting. Codes and datasets are in the supplementary materials.

1. INTRODUCTION

Feature generation is an important yet challenging task when applying machine learning methods to tabular data. Tabular data, where each row represents an instance and each column corresponds to a distinct feature, is ubiquitous in industrial applications and machine learning competitions. It has been well recognized that the quality of features has a significant impact on the learning performance of tabular data (Domingos, 2012) . The goal of feature generation is to transform the base features into more informative ones to better describe the data and enhance the learning performance. For example, Price-to-Earnings ratio (P/E ratio), calculated as (share price)/(earnings per share), is derived from the base features "share price" and "earnings per share" in financial statements and informs investors about the value of a company. In practice, data scientists typically use their domain knowledge to find useful feature transformations in a trial-and-error manner, which requires tremendous human labor and expertise. Since manual feature generation is time-consuming and requires case-by-case domain knowledge, automated feature generation emerges as an important topic in automated machine learning (Erickson et al., 2020; Lu, 2019) . Expand-and-reduce is arguably the most prevalent framework in automated feature generation, in which we first expand the candidate features and then eliminate redundant ones (Kanter & Veeramachaneni, 2015; Lam et al., 2021; Kaul et al., 2017; Shi et al., 2020; Katz et al., 2016) . There are two challenges in a typical expand-and-reduce practice. First, the number of candidate features is usually huge in many industrial applications. Calculating all candidate features is not only computationally expensive but also infeasible due to the enormous amount of memory required. The second challenge is how to efficiently and accurately estimate the incremental performance of a new feature, i.e., how much performance improvement a new candidate feature can offer when added to the base feature set. The majority of existing methods rely on statistical tests to determine if a new feature should be included (Kanter & Veeramachaneni, 2015; Lam et al., 2021; Shi et al., 2020) . However, statistically significant features do not always translate into good predictors (Lo et al., 2015) . Features may be significantly correlated with the target simply for a small group of instances in the population, thereby leading to poor prediction in the population (Lo et al., 2015; Ward et al., 2010; Welch & Goyal, 2008) . Besides, the effectiveness of a new feature may be encompassed by the base feature set, even if the new feature is significantly correlated with the target. In this paper, we propose OpenFE, a powerful automated feature generation algorithm that can effectively generate useful features to enhance learning performance. First, motivated by the gap between significant features and good predictors, we propose a feature boosting method that directly estimates the predictive power of new features in addition to the base feature set. Second, inspired by the crucial fact that effective features are usually sparse in the huge number of candidate features, we propose a two-stage evaluation framework. In the first stage, we propose a successive featurewise pruning algorithm to quickly eliminate redundant candidate features by dynamically allocating computing resources to promising ones. In the second stage, we propose a feature importance attribution method to rank the remaining candidate features based on their contributions to the improvement in the learning performance and further eliminate redundant candidate features. We validate OpenFE on various datasets and Kaggle competitions, where OpenFE outperforms existing baseline methods. In a famous Kaggle competition with thousands of data science teams participatingfoot_0 , the baseline model with features generated by OpenFE beats 99.3% of 6351 data science teams. More importantly, the features generated by OpenFE result in comparable or even larger performance improvement than those provided by the competition's top winners, demonstrating for the first time that automated feature generation is competitive against machine learning experts. In addition to proposing a novel method, this paper intends to address two important problems that hinder the research process of automated feature generation. The first problem is that the majority of existing methods are evaluated on different datasets, and these studies do not open-source their codes and datasets, hindering new research from conducting fair comparisons. In order to facilitate fair comparisons in future research, we reproduce the main methods for automated feature generation and validate our reproduction by comparing the reproduced results with those in the corresponding papers. We will open-source the codes and datasets (see supplementary materials). The second problem is the lack of evidence regarding the necessity of feature generation in the era of deep learning. Deep neural networks (DNNs) are widely recognized for their ability to extract feature representations. In recent years, a variety of DNNs have been carefully developed for modeling tabular data (Arık & Pfister, 2021; Gorishniy et al., 2021) , and several of them have demonstrated their efficiency in feature interaction learning (Song et al., 2019; Wang et al., 2021) . We extensively evaluate the effect of OpenFE on a variety of DNNs. We demonstrate that generating new transformed features with OpenFE can further enhance the learning performance of existing DNN architectures. In addition to the empirical results, we provide a theoretical justification of our feature generation procedure by presenting a simple yet representative transductive learning setting in which feature generation has provable benefits. We summarize the contributions of our paper as follows: • We propose a novel automated feature generation method that can effectively identify useful new features to enhance learning performance. Extensive experiments show that OpenFE achieves state-of-the-art on seven benchmark datasets. More importantly, we demonstrate for the first time that OpenFE is competitive against human experts in feature generation. • We facilitate future research in feature generation by: 1) reproducing main methods for automated feature generation and releasing the codes and datasets, 2) providing empirical and theoretical evidence that feature generation is a crucial component in modeling tabular data, even in the era of deep learning. We present an overview of OpenFE in Figure 1 . OpenFE follows the expandand-reduce framework for automated feature generation. For expansion, first, we classify all base features into numerical features and categorical features.

2. PROBLEM DEFINITION

Then we create a pool of candidate features by using operators to enumerate all the first-order transformations of base features, where each transformation uses one operator. The challenge of automated feature generation often lies in reduction after expansion, i.e., how to achieve both efficiency and effectiveness in eliminating redundant candidate features, especially when the number of candidate features is huge. For reduction, we propose a two-stage evaluation framework to quickly reduce the number of candidate features. Finally, we include top-ranked candidate features in the base feature set. We employ a greedy approach and repeat the above procedure for high-order feature generation.

3.2. FEATURE BOOSTING

One of the key challenges in automated feature generation is to accurately estimate the incremental performance of a new feature, i.e., how much performance improvement the new feature can offer when added to the base feature set. A standard evaluation procedure involves including the new feature in the base feature set, retraining the machine learning model, and observing the change in validation loss (Katz et al., 2016) . However, the standard evaluation procedure is prohibitively expensive due to the substantial computing resources required for training a model from scratch to convergence on the whole dataset. Therefore, many existing methods rely on statistical tests to determine if a new feature is effective (Christ et al., 2018; Kanter & Veeramachaneni, 2015; Shi et al., 2020) . However, significant features do not always translate into good predictors (Lo et al., 2015) . The effectiveness of a new feature may be encompassed by the base feature set. Motivated by this, we propose a feature boosting method to estimate the predictive power of new features in addition to the base feature set. The inputs of feature boosting are a dataset D, new features T ′ , and base predictions ŷ (see the definitions below). Feature boosting outputs the reduction in loss ∆ as an estimate of the incremental performance of new features. Feature boosting is motivated by gradient boosting algorithms (Friedman et al., 2000) . Assume we have a dataset of n samples D = {(x i , y i ), i = 1, 2, ..., n}. Let F denote a set of models defined over the base features T . Assume we have a model f ∈ F that has been trained to convergence on D. We call the predictions ŷi = f (x i ) the base predictions, and define the loss L(f ) = n i=1 l(y i , f (x i )). Let F ′ denote a set of models defined over new features T ′ . We want to find a new model f ′ ∈ F ′ to boost the performance of f and minimize L(f + f ′ ) = n i=1 l(y i , f (x i ) + f ′ (x i )) . We can fix f and optimize f ′ by gradient descent (see more details in Appendix D.3). The reduction in loss ∆ = L(f ) -L(f + f ′ ) is an estimate of the incremental performance of new features. We call this method "feature boosting," which uses new features to boost the performance of base features. Initialized by base predictions, feature boosting can estimate incremental performance by training on new features as opposed to the full feature set, resulting in faster convergence for efficiency. We provide a case study in Appendix C.6 to illustrate that feature boosting can efficiently estimate the incremental performance of new features. Divide D equally into 2 q data blocks; Initialize A 0 (T ) = A(T ); for i ← 0 to q do for new feature τ ∈ T do Use a subset D i with 2 i data blocks to calculate the values of τ ; Calculate the score of τ as ∆ = FeatureBoosting(D i , τ, ŷ); A i (T ) = delete same(A i (T )); A i+1 (T ) = top half(A i (T )); return A q+1 (T )

3.3. A TWO-STAGE EVALUATION FRAMEWORK

Although feature boosting provides an efficient way to estimate the incremental performance of candidate features (defined in Section 3.1), it is still computationally infeasible to calculate and evaluate the huge number of candidate features. Inspired by a crucial fact that effective features are usually sparse in the vast pool of candidate features, we propose a two-stage evaluation framework based on feature boosting to efficiently eliminate redundant features and retrieve useful ones. Before introducing the two-stage evaluation framework, we present the overall framework of OpenFE in Algorithm 1. OpenFE starts with generating the base predictions by training a model on the base features T for feature boosting. Then we create a pool of candidate features by applying operators on the base features. Next, we propose a two-stage evaluation framework to deal with the large number of candidate features. The first stage employs a successive featurewise pruning algorithm. By using a bandit-based algorithm (Jamieson & Talwalkar, 2016) to dynamically allocate computing resources to promising features, we quickly eliminate redundant features by examining the effectiveness of each feature alone using feature boosting. In the second stage, we further take interaction effects into account and propose a feature importance attribution algorithm to accredit the remaining features in reducing the loss function.

3.3.1. STAGE I: SUCCESSIVE FEATUREWISE PRUNING

In OpenFE, first we use a successive featurewise pruning method to quickly reduce the number of candidate features (see Algorithm 2). The method is motivated by the successive halving algorithm in multi-armed bandit problems (Even-Dar et al., 2006; Zhou et al., 2014) . Successive halving dynamically allocates computing resources to promising arms. In our settings, first we split the dataset into 2 q data blocks, where each data block has both a training set and a validation set, with a total of ⌊ n 2 q ⌋ samples. Then we consider each candidate feature as an arm, and a pull of the arm is to use feature boosting to evaluate the effectiveness of the single feature on the data blocks. The reward of pulling an arm is the reduction in loss in feature boosting, which is calculated on the validation set. We allocate more data blocks (resources) to more promising candidate features. After successive featurewise pruning, only candidate features with a positive reduction in loss ∆ are returned. Besides successive halving, we use a simple trick in our algorithm to eliminate redundant candidate features. It is common in the candidate feature set for two features to have exactly the same values. For example, max(τ 1 , τ 2 ) and max(τ 1 , τ 3 ) are exactly the same if the minimum value of feature τ 1 is larger than the maximum value of τ 2 and τ 3 . Features with the same values have the same score (i.e., reduction in loss ∆). Therefore, the delete same algorithm first ranks all the new features by their scores, and then compares the scores of adjacent features to remove redundant ones.

3.3.2. STAGE II: FEATURE IMPORTANCE ATTRIBUTION

Algorithm 3: Feature Attribution input : D: dataset, ŷ: base predictions, T : base features, A ′ (T ): candidate feature set output: Sorted A ′ (T ). ∆ = FeatureBoosting(D, T + A ′ (T ), ŷ); Attribute the reduction in loss ∆ to each feature in A ′ (T ) as their importance; Sort A ′ (T ) according to importance; return sorted A ′ (T ) There are two problems unaddressed in stage I: 1) There are still many redundant features in the candidate feature set after pruning. 2) Stage I only evaluates the effectiveness of each feature alone. Therefore, in stage II, we further consider the interaction effects to evaluate the effectiveness of the remaining candidate features. We select the top-ranked candidate features and remove redundant ones to improve generalization performance. Assume the remaining candidate features after pruning is A ′ (T ). We use candidate features A ′ (T ) and base features T together as the inputs to feature boosting. The reduction in loss ∆ is an estimate of the incremental performance of the remaining candidate features A ′ (T ), which considers the interaction effects between T and A ′ (T ). We attribute the reduction in loss ∆ to each candidate feature as their importance. Popular methods for feature importance attribution include mean decrease in impurity (MDI) (Breiman, 2001) , permutation feature importance (PFI) (Breiman, 2001), and SHAP (Lundberg et al., 2018) . MDI is a popular method for feature attribution in tree ensembles, which sums up the total reduction of loss in all splits for a given feature. PFI measures the increase in the loss function after we randomly permute the feature's values. SHAP is a game-theoretic method for feature attribution.

3.4. IMPLEMENTATION

In OpenFE, we use gradient boosting decision trees (GBDT) (Friedman, 2001) to model tabular data for two reasons: 1) GBDT is usually the best performing model on tabular data where features are individually meaningful (Gorishniy et al., 2021; Borisov et al., 2021) . 2) GBDT can automatically handle missing values and categorical features, which is convenient for automation (Ke et al., 2017) . We use the popular LightGBM implementation (Ke et al., 2017) . We use MDI for feature importance attribution. We compare MDI, PFI, and SHAP in Appendix C.4. We show that MDI is fast to compute and provides comparable performance to PFI and SHAP. Even though the feature generation method relies on GBDT, the generated features can also enhance the learning performance of a variety of DNNs (see Section 5.4).

4. THEORETICAL ADVANTAGE OF FEATURE GENERATION

In this section, we study the advantage of feature generation from a theoretical perspective. We present a simple yet representative setting in which the test loss of empirical risk minimization augmented with feature generation converges to zero provably as the number of training samples increases, while the test loss for any learning model without feature generation is at least a positive constant. In particular, we present a transductive learning setting, which captures important characteristics of a class of datasets one may encounter frequently in data science applications, e.g., the IEEE-CIS Fraud Detection dataset (we also conduct experiments on this dataset in Section 5.5). Due to space limit, a formal and detailed description of the model can be found in Appendix A. We briefly introduce the high-level idea here. Many tabular datasets contain both categorical and numerical attributes (i.e., features). A categorical feature partitions the dataset into groups (each associated with a distinct category). For a data point (X, Y ). the target Y is correlated with not only the feature X, but also certain statistics of the group containing (X, Y ). Datasets with such characteristics are abundant in data science applications. As a concrete example, one may think each training data as a transaction, and one categorical feature is user id (each user may have many transactions in this table. These transactions form a group). The target Y we want to predict about the transaction (e.g., probability of fraudulence) may depend on not only the features of this particular transaction, but also some statistics of this user (e.g., average size of his/her transactions). Hence, one can see that operations such as GroupByThenMean (group by the user id) can provide statistical information about the user by aggregating the information from all data points associated to this user. We present a theoretical data model, which is a two-phase data generation model, to capture the above characteristics. Under fairly standard learning theoretic assumptions (i.e., bounded Rademacher complexity), we prove that empirical risk minimization augmented with feature generation (such as the GroupByThenMean operation) can achieve vanishing test loss as the sample size and group size increase. Theorem 1. (informal) Assume the data set is generated according to the two-phase process described in Appendix A. Denote the number of groups in the training set and test set by k 1 and k 2 respectively, and the number of data points in each group by h. There is a feature generation function H, such that the test loss of the empirical risk minimizer f can be bounded by L Dtest (H, f ) ≤ O Rad k1 (F) + ln(4δ -1 )/k 1 + d ln(4d(k 1 + k 2 )δ -1 )/h with probability at least 1 -δ. In particular, assuming the Rademacher complexity Rad k1 (F) → 0 as k 1 → ∞, the test loss approaches to 0 when k 1 , h → ∞. On the other hand, if we do not use any feature generation, we prove that any predictor f ′ (no matter how complicated f ′ is) incurs a non-vanishing constant test loss. Theorem 2. (informal) In case that we do not use any feature generation, there exists a problem instance such that, no matter how large k 1 , k 2 , and h are, for any predictor f ′ : X → Y, the test loss L Dtest (f ′ ) ≥ 3 64 . We use a diverse set of seven public datasets with two regression datasets, three binary classification datasets, and two multi-class classification datasets.

5.1. DATASETS AND EVALUATION METRICS

Each dataset has exactly one train-validation-test split, and all methods use the same split. The datasets we collect include: California Housing (CA, real estate data, (Pace & Barry, 1997)), Microsoft (MI, search queries, (Qin & Liu, 2013) ), Diabetes (DI, hospital records, (Strack et al., 2014) ), Nomao (NO, data deduplication, (Candillier & Lemaire, 2012) ), Vehicle (VE, vehicle classification, (Siebert, 1987)), Jannis (JA, anonymized dataset, (Guyon et al., 2019) ), Covertype (CO, forest characteristics, (Blackard & Dean, 1999) ). We summarize the dataset properties in Table 1 . The features in each dataset are classified into numerical, categorical, and ordinal features. An ordinal feature can be treated as both categorical or numerical when generating transformed features.

5.2. BASELINE METHODS FOR COMPARISONS

The baseline methods for comparisons include: Base (the base feature set without feature generation), FCTree (Fan et al., 2010) , SAFE (Shi et al., 2020) , AutoFeat (Horn et al., 2019) , Au-toCross (Luo et al., 2019) . We provide another baseline that uses the last hidden layer of DCN-V2 as new features. DCV-V2 (Wang et al., 2021 ) is a DNN architecture developed for modeling tabular data and claims to be able to capture feature interactions automatically. We denote this baseline as NN. Most existing automated feature engineering methods do not open-source their codes. We reproduce FCTree, SAFE, and AutoCross according to the descriptions in these paper. We validate our reproduction by comparing the reproduced results with those in the corresponding papers (see Appendix C.2). Some other learning-based methods, such as LFE (Nargesian et al., 2017) , ExploreKit (Katz et al., 2016) (meta learning) and NFS (Chen et al., 2019) , TransGraph (Khurana et al., 2018 ) (reinforcement learning), lack critical details for reproduction. For example, the choice of training datasets is crucial for generalization in learning-based methods, however previous research did not describe which datasets were utilized for training (Nargesian et al., 2017; Khurana et al., 2018) . We run OpenFE on the test datasets they use and compare our results with the numbers reported in their papers in Appendix C.1. 

5.3. COMPARISON RESULTS

In this section, we compare OpenFE with other baseline methods. We use two standard learning algorithms to evaluate the effectiveness of the new features generated by different methods. For GBDT, we choose the LightGBM implementation (Ke et al., 2017) . For neural networks, we choose FT-Transformer (Gorishniy et al., 2021) . Hyperparameter tuning follows a standard benchmark study (Gorishniy et al., 2021) , and we tune the hyperparameters using the base feature set. For each dataset, we generate first-order features and include the same number of new features for different methods for a fair comparison. We discuss generating high-order features in Appendix E.2. We present the comparison results in Table 2 . We can observe that OpenFE has clear advantages over baseline methods and achieves SOTA in most cases. The features generated by OpenFE greatly enhance the performance of both LightGBM and FT-Transformer. When baseline methods fail to generate effective new features, they include uninformative features in the base feature set, which brings additional noise and does harm to the generalization of the learning model. We present the running time of different methods in Appendix C.3.

5.4. FEATURE GENERATION FOR NEURAL NETWORKS

In this section, we show that new features generated by OpenFE can greatly improve the performance of a variety of neural networks designed specifically for tabular data. The models include: Au-toInt (Song et al., 2019) , DCN-V2 (Wang et al., 2021) , NODE (Popov et al., 2019) , TabNet (Arık & Pfister, 2021), FT-Transformer (Gorishniy et al., 2021) . We follow the same implementations and hyperparameter tuning of these networks in (Gorishniy et al., 2021) . We present the results in Table 3 . The features generated by OpenFE greatly enhance the performance of different models in most cases. For AutoInt and DCN-V2 which claim to be able to learn feature interactions, feature generation can further improve the performance. Even though OpenFE relies on GBDT to measure the performance of new features, the generated features are also effective for a variety of DNNs. The first Kaggle competition is IEEE-CIS Fraud Detectionfoot_1 , where the goal is to predict whether an online transaction is fraudulent. This competition is one of the largest and most competitive tabular data competitions on Kaggle, with 6,351 data science teams participating. The competition's first place team made public the features they generated after the competition endedfoot_2 , which we refer to as Expert. This competition heavily relies on feature generation, where a baseline model of XGBoost without feature generation ranks at 2286 among 6351 teams on the private leaderboard. However, the baseline model with features generated by the team ranks at 76/6351. The same baseline model with features generated by OpenFE ranks at 42/6351, which outperforms the features generated by the first-place team. We present the results in Table 4 . We investigate the influence of the number of new first-order features in Figure 2 . The advantage of OpenFE is that it can automatically explore and generate more useful feature transformations than Expert. We explain the useful feature transformations generated by OpenFE in Appendix E.1. The second competition is BNP Paribas Cardif Claims Managementfoot_3 , where the goal is to predict whether an insurance claim should be approved. We present the details in Appendix C.5. OpenFE\FB 0.4299±2.2e-3 0.7441±4.0e-3 0.8856±3.3e-3 0.9959±2.0e-4 0.9263±2.0e-4 0.7251±1.2e-3 0.9704±7.0e-4 OpenFE\SS 0.4282±2.3e-3 0.7439±5.0e-4 0.8846±2.2e-3 0.9952±3.0e-4 0.9267±3.0e-4 0.7316±1.1e-3 0.9649±7.0e-4 OpenFE-MI 0.4282±2.3e-3 0.7439±3.0e-4 0.8867±4.4e-3 0.9959±1.0e-4 0.9261±3.0e-4 0.7293±1.2e-3 0.9713±5.0e-4

5.6. ABLATION STUDY

We conduct an ablation study to justify the design choices of OpenFE. We name different variants of OpenFE as follows: • OpenFE\FB. OpenFE without feature boosting. We train GBDT from scratch without using base predictions as the initial prediction. For example, in regression tasks the initial prediction is the average of the targets in the training set. • OpenFE\SS. OpenFE without successive featurewise pruning. We subsample the data so that generating all the features can fit in the memory. • OpenFE-MI. We use mutual information between the feature and the target as the scoring criterion in successive featurewise pruning. We present the results in Table 5 . The results show that: 1) Feature boosting significantly improves the results. 2) Directly subsampling the data usually hurts the performance. 3) Mutual information does not provide desirable results in eliminating features.

6. RELATED WORK

Expand-and-reduce is arguably the most popular framework in automated feature generation. Most existing automated feature generation approaches employ statistical methods to identify and remove redundant features. For example, Data Science Machine (DSM) (Kanter & Veeramachaneni, 2015) uses the f-value between the feature and the target, and One Button Machine (OneBM) (Lam et al., 2021) uses the Chi-square test to remove redundant features. DSM and OneBM focus on feature generation in relational databases, which is different from our setting in tabular data. FC-Tree (Fan et al., 2010) uses information gain to select new features during the construction of decision trees. SAFE (Shi et al., 2020) uses information value to exclude uninformative features. AutoCross (Luo et al., 2019) and AutoFeat (Horn et al., 2019) uses the improvement on a linear regression model to determine whether a new feature should be included. LFE (Nargesian et al., 2017) and ExploreKit (Katz et al., 2016) design meta features based on statistical methods and use a meta learning approach to determine the effectiveness of new features. LFE and ExploreKit are confined to binary-classification tasks and a specific set of operators. How to search for high-order feature transformations is challenging in automated feature generation due to the explosion in search space (Chen et al., 2019) . Some learning-based methods directly search for high-order feature transformations (Xie et al., 2021) . For example, Neural Feature Search (Chen et al., 2019) utilizes a recurrent neural network controller to search for a series of transformations. Khurana et al. (2018) uses reinforcement learning to traverse a transformation graph for high-order features.

7. CONCLUSION

We propose OpenFE, a powerful automated feature generation that can effectively generate useful features to enhance the learning performance of tabular data. Extensive experiments show that OpenFE achieves SOTA on seven benchmark datasets and is competitive against human experts in feature generation. We facilitate future research by: 1) reproducing the main methods for automated feature generation and releasing the codes and datasets for fair comparisons. 2) providing empirical and theoretical evidence that feature generation is a crucial component in modeling tabular data.

REPRODUCIBILITY STATEMENT

All the codes and datasets can be found in the supplementary materials, including the implementation of OpenFE and other baseline methods. We also provide scripts (detailed experiment parameters) in the supplementary materials to reproduce the experiments. Readers can follow the "readme" to download the datasets and reproduce the results. Our theoretical contributions are clearly stated in Appendix A with definitions, assumptions, and complete proofs of all the theorems.

PROBLEM SETTING

In this section, we study the advantage of feature generation from a theoretical perspective. In particular, we present a simple yet representative setting in which empirical risk minimization augmented with feature generation can achieve smaller test loss than any learning model without feature generation. Consider a regression problem with both numerical and categorical features in a transductive learning setting. Denote X num ⊆ R d as the numerical feature space, X cat ⊆ N the categorical feature space, and Y ⊆ [0, 1] as the target domain. We assume X num is d-dimensional and convex. The training data D train consists of n training data {(X i , Y i )} i=1,...,n where X i = (x i0 , x i1 , . . . , x id ), x i0 ∈ X cat is the categorical feature value, (x i1 , . . . , x id ) ∈ X num are d numerical feature values. As a concrete example, one may think each training data as a transaction, where x i0 as the user id and (x i1 , . . . , x id ) are some features about the user and the transaction, and Y is the target we want to predict about the transaction (e.g., lateness of payment, probability of fraudulence). The test data D test consists of m data points {(X i , Y i )} i=n+1,...,m , but these Y i are the unknown targets that we try to predict. The data model: The training set D train and test set D test are generated by a two-phase process. Let ∆(X cat ) be the space all probability distributions defined over X cat , and Q be a probability distribution over ∆(X cat ). D train is generated by repeating the following k 1 times: in the i-th iteration, we first take a sample P i ∈ ∆(X cat ) from Q (P is a distribution over X cat ). Then, we sample a group of h training data points, each with its categorical feature value being c i ∈ X cat and its d numerical feature values being an i.i.d., sample from P. Hence, all these h training data X ih+1 , . . . , X (i+1)h have the same categorical value c i ∈ X cat , and they form the i-th group G i (e.g., the set of transactions of the i-th user). D train contains k 1 groups G 1 , . . . , G k1 . For the i-th data point (X i , Y i ), we use g(i) to denote the index of the group that contains (X i , Y i ). Hence, (X i , Y i ) ∈ G g(i) . Let F be the hypothesis class and the true hypothesis is f * : X num × X cat → Y such that Y i = f * (x i1 , . . . , x id , Z i ) where Z i = E X∼P g(i) [X] (Z i is a d-dimensional vector) (e. g., the target depends not only on the numerical feature values of this particular transaction, but also the mean of the statistics of the user). The test dataset D test is generated in the same way and it contains k 2 groups G k1+1 , . . . , G k1+k2 . We assume the categorical feature values c 1 , . . . , c k1 , c k1+1 . . . , c k1+k2 are all distinct (e.g., a user id that appears in the test set does not appear in the training set). Feature generation: Here we are interested in features generation using operations such as GroupByThenMean. In our setting, the groupby operation is based on the categorical feature X cat . So, for a data point X i , we can generate a set of new features Xi which can be computed from all data points in the same group G g(i) . Formally, let H be the set of possible feature generation function and the new feature Xi is computed as follows: Xi = H(G g(i) ) for some H ∈ H. For a feature generation function H and predictor f , the loss on the data point (X i , Y i ) is L(H, f, X i , Y i )] = ∥Y i -f (X i , Xi )∥ 2 = ∥Y i -f (X i , H(G g(i) ))∥ 2 . Our goal is to find a feature generation function H ∈ H and a predictor f ∈ F such that the test loss is minimized L Dtest (H, f ) = E (Xi,Yi)∈Dtest [L(H, f , X i , Y i )] = 1 m (Xi,Yi)∈Dtest ∥Y i -f (X i , H(G g(i) ))∥ 2 . If we do not use any feature generation, the loss of the predictor f ′foot_4 is simply L Dtest (f ′ ) = E (Xi,Yi)∈Dtest [∥Y i -f ′ (X i )∥ 2 ] In the following, we show that under nature conditions, we can achieve vanishing test loss by using feature generation (as the number of training samples n and the minimum size of each group h become larger). See Theorem 3. But if we only use the raw feature X i for predicting Y i , the expected test loss of any predictor is at least a positive constant. See Theorem 4.

LEARNABILITY WITH FEATURE GENERATION

For a particular feature generation function H, we use Rad k (F) to denote the empirical Rademacher complexity of F over k random samples:  Rad k (F) = E {(Xi,Yi)} k i=1 1 k E σ sup f ∈F k i=1 σ i L(H, f, X i , Y i ) where σ = (σ 1 , • • • , σ k ) are independent Y i = f * (X i , Z i ) where Z i = E X∼Pi [X]. We assume that Rad k1 (F) → 0 as k 1 → ∞ (for many hypothesis classes, Rad k1 scales as O( 1/k 1 )) (Mohri et al., 2018) . We also assume any function f (•, •) ∈ F is Lipschitz on z: There exists constant C F such that |f (X, Z 1 ) -f (X, Z 2 )| ≤ C F ∥Z 1 -Z 2 ∥ for any z 1 , z 2 , x ∈ X and f ∈ F. We further assume there exists constant B X such that sup x∈X ∥x∥ ≤ B X . Theorem 3. There is a feature generation function H, such that the test loss of the empirical risk minimizer f can be bounded by L Dtest (H, f ) ≤ 2Rad k1 (F) + 2 ln(4δ -1 )/k 1 + 6B X C F 2d ln(4d(k 1 + k 2 )δ -1 )/h with probability at least 1 -δ. In particular, the test loss approaches to 0 when k 1 , h → ∞ . Proof. We fix the feature generation function to be GroupByThenMean. In concrete, we have Xi = H(G g(i) ) = 1 |G g(i) | j∈G g(i) X j . The algorithm then find f that minimizes the emprical risk: f = arg min f ∈F L Dtrain (H, f ). According to Hoeffding's inequality and union bound for each dimension, with probability at least 1 -δ/(2(k 1 + k 2 )), we have ∥ X i -Z i ∥ = 1 |G g(i) | j∈G g(i) X j -E X∼P g(i) [X] ≤ B X 2d ln(4d(k 1 + k 2 )δ -1 )/h, for some concrete g(i). By applying union bound on all groups, this statement holds for all i with probability at least 1 -δ/2. In this case, for any function f ∈ F, we have (f (X i , Z i ) -Y i ) 2 -(f (X i , Xi ) -Y i ) 2 ≤ |f (X i , Z i ) -f (X i , Xi )| • f (X i , Z i ) -Y i + f (X i , Xi ) -Y i ≤ 2B X C F 2d ln(4d(k 1 + k 2 )δ -1 )/h (2) where the last inequality holds since |f (X i , Z i ) -f (X i , Xi )| ≤ C F ∥ z i -z i ∥ and Y ⊆ [0, 1]. Define H * be the function H * (G g(i) ) = Z i = E X∼P g(i) [X]. We note this statement further implies L Dtrain (H * , f ) -L Dtrain (H, f ) ≤ 2d ln(4d(k 1 + k 2 )δ -1 )/h and L Dtest (H * , f ) -L Dtest (H, f ) ≤ 2d ln(4d(k 1 + k 2 )δ -1 )/h according to the definition of the loss function. Note that each data point is not an i.i.d. sample (since two points in one group are correlated), we need to define the following group Rademacher complexity to bound the generalization error. We define group Rademacher complexity to be Rademacher complexity but defined over groups: Rad G k1 (F) := E Dtrain 1 k 1 E σ sup f ∈F k1 i=1 σ i • 1 h j∈Gi L(H, f, X j , Y j ) , where σ = (σ 1 , • • • , σ k1 ) are independent Rademacher variables. Note that if we view each group as one random sample, these groups are i.i.d. samples. Hence, we can apply the classical generalization bound via Rademacher complexity (Mohri et al., 2018) , which asserts that with probability at least 1 -δ/2 for any function f ∈ F the following holds L Dtest (H, f ) -L Dtrain (H, f ) ≤ 2Rad G k1 (F) + 2 ln(4δ -1 )/k 1 . Moreover, one can see the group Rademacher complexity can be upper bounded by the ordinary Rademacher complexity for i.i.d. samples: Rad G k1 (F) ≤ 1 h h i=1 E Dtrain 1 k 1 E σ sup f ∈F k1 j=1 σ j • L(H, f, X Gi,j , Y Gi,j ) = Rad k1 (F) where G i,j is the j-th element in G i . According to union bound, both equation 2 and equation 3 are satisfied for all f ∈ F with probability at least 1 -δ. In this case, since f * is the ground truth, its empirical risks satisfies L Dtrain (H * , f * ) = 0. In this case, we have L Dtrain (H, f ) ≤ L Dtrain (H, f * ) ≤ L Dtrain (H, f * ) -L Dtrain (H * , f * ) + L Dtrain (H * , f * ) ≤ 2B X C F 2d ln(4d(k 1 + k 2 )δ -1 )/h where the first inequality dues to empirical risk minimization and the last inequality dues to equation 2. As a result, the population loss can then be bounded by L Dtest (H, f ) ≤ L Dtest (H, f ) -L Dtest (H * , f ) + L Dtest (H * , f ) -L Dtrain (H * , f ) + L Dtrain (H * , f ) -L Dtrain (H, f ) + L Dtrain (H, f ) ≤ 2B X C F 2d ln(4d(k 1 + k 2 )δ -1 )/h + 2Rad G k1 (F) + 2 ln(4δ -1 )/k 1 + 2B X C F 2d ln(4d(k 1 + k 2 )δ -1 )/h + 2B X C F 2d ln(4d(k 1 + k 2 )δ -1 )/h ≤ 2Rad k1 (F) + 2 ln(4δ -1 )/k 1 + 6B X C F 2d ln(4d(k 1 + k 2 )δ -1 )/h where the second inequality holds due to equation 3 and equation 5 while the last inequality follows from equation 4. This proves the desired statement. Remarks: We only consider the case where H only contain one particular operation GroupByThenMean. In fact, it is possible to extend the above setting to one with multiple feature generation functions. Here we require that such feature generation functions be statistics of the corresponding group and can be estimated using i.i.d. samples. In our two-phase generative process, each group contains exactly h data points. We can easily extend our theorem to the setting where the size of each group is also a random variable as long as it takes value at least h with high probability. The main idea is the same but the notations would become very tedious. Since our goal here is to illustrate the statistical advantage of feature generation, we choose to present a simplified yet representative setting.

WITHOUT FEATURE GENERATION

Theorem 4. In case that we do not use any feature generation, there exists a problem instance such that, no matter how large k 1 (number of groups in the training set), k 2 ((number of groups in the test set)), and h (the size of each group) are, for any function f ′ : X → Y, the test loss is at least L Dtest (f ′ ) ≥ 3 64 . Proof. Consider a problem instance with X = Y = [0, 1]. The data distribution is generated in the following way: Each group contains h data points and is generated in the following way: The

C.1 COMPARISONS WITH OTHER BASELINES

We compare OpenFE with other baselines, including some learning-based methods that lack critical details for code reproduction. These mehtods include: • Random (Khurana et al., 2018) . Randomly include features from candidate feature set multiple times and select new features with improvement according to CV scores. • TransGraph (Khurana et al., 2018) . TransGraph uses reinforcement learning to traverse a transformation graph for feature transformations. • LFE (Nargesian et al., 2017) . LFE recommends feature transformations by meta learning approaches. • NFS (Chen et al., 2019) . NFS uses a recurrent neural network controller to search for a series of transformations. Following previous studies (Chen et al., 2019) , the metric for regression datasets is 1 -(relative absolute error) and the metric for classification datasets is F1-score. We present the results in Table 7 . OpenFE also surpasses other baseline methods in these datasets. 

C.2 REPRODUCTION

We reproduce FCTree, SAFE, and AutoCross according to the descriptions in their papers. In order to make sure that we reproduce a reasonable version of these methods, we compare the results of our reproduced methods with the ones in their paper. We present the results in Table 8 . We can see that most of the results of the reproduced methods closely match or even outperform the results from the papers.

C.3 RUNNING TIME

We present the running time of different methods in Table 9 . The experimental environment is the same for all the methods. We can see that OpenFE is a fast method that terminates within a reasonable amount of time even for large datasets. In this section, we compare MDI, permutation, and SHAP in feature importance attribution in OpenFE. We use the results of OpenFE on seven benchmarking datasets to rank these methods. The training model is LightGBM. We present the results in Table 10 . We also present the running time of each method on the Microsoft dataset, the benchmarking dataset with largest number of samples. Different feature attribution methods do not differ much on most of the datasets. MDI can be obtained for free after the training process of LightGBM, while permutation and SHAP may require longer running time, depending on the sizes of datasets.

C.5 KAGGLE COMPETITION

We present the details of the BNP Paribas Cardif Claims Management competition in this section. The goal of the competition is to predict whether a personal insurance claim should be approved. The competition's 8th place team made public their generated features after the competition endedfoot_5 (the winners with higher rankings did not share their codes). We evaluate the performance of OpenFE in a similar way. A baseline model using Catboost (Prokhorenkova et al., 2018) without feature generation ranks at 31 among 2920 teams. The baseline model with features generated by experts ranks at 12/2920, while the baseline model with features generated by OpenFE ranks at 12/2920. We present the results in Table 4 . OpenFE generates 200 first-order features and 100 second-order features, while Expert generates 156 first-order features and 132 high-order features. We enumerate all the first-order transformations to form the candidate feature set. A first-order transformation uses one operator once to transform base features. For example, weight/height is a first-order transformation of base features weight and height. BMI = weight/height 2 is a secondorder transformation. We list the details of all the operators below. f stands for a numerical feature and c stands for a categorical feature. • Freq(f ). The frequency of feature f . • Freq(c). The frequency of feature c. • Abs(f ). Element-wise absolute value. • Log(f ). Element-wise logarithm. • Sqrt(f ). Element-wise square root. • Sigmoid(f ). Element-wisely apply function x → 1 1+e -x . • Round(f ). Element-wise rounding. • Residual(f ). Element-wisely take decimal part. • Min(f 1 , f 2 ). Element-wise minimum. • Max(f 1 , f 2 ). Element-wise maximum. • f 1 + f 2 . Element-wise addition. • f 1 -f 2 . Element-wise subtraction. • f 1 × f 2 . Element-wise multiplication. • f 1 ÷ f 2 . Element-wise division. • GroupByThenMin(f, c). The minimum value of f in each category of feature c. • GroupByThenMax(f, c). The maximum value of f in each category of feature c. • GroupByThenMean(f, c). The average value of f in each category of feature c. • GroupByThenMedian(f, c). The median of f in each category of feature c. • GroupByThenStd(f, c). The standard deviation of f in each category of feature c. • GroupByThenRank(f, c). The ranking of f in each category of feature c. The rankings are normalized between 0 and 1. • Combine(c 1 , c 2 ). Comebine the categories in feature c 1 and c 2 to be new categories. For example, for a data point, "vocation" is "doctor" and "hobby" is "football", then the value for Combine(vocation, hobby) of this data point is a category of "doctor-football". • CombineThenFreq(c 1 , c 2 ). Same as freq (Combine(c 1 , c 2 )). • GroupByThenNUnique(c 1 , c 2 ). The number of unique values of c 1 in each category of feature c 2 .

D.3 FEATURE BOOSTING

We describe the main idea of feature boosting in Section 3.2. Here, we present the details of the implementation in this section. The first step in feature boosting is to generate base predictions. In this paper, we use LightGBM to model tabular data and use 5-fold cross-validation to generate base predictions. For each fold, we first train a LightGBM on the train set till early stopping on the validation set. Then we generate predictions on the validation set. The predictions on each fold are concatenated to yield the base predictions. The LightGBM is trained using a set of default parameters for all the datasets. The default parameter has 10,000 estimators, 0.1 learning rate, 31 leaves, and 200 early stopping rounds. Other parameters follow the default settings in LightGBM. The second step in feature boosting is to find a new model f ′ ∈ F ′ to boost the performance of base predictions and minimize L(f + f ′ ) = n i=1 l(y i , f (x i ) + f ′ (x i )). In the implementation, we use the base predictions {f (x i ), i = 1, 2, ..., n} as the initial predictions in LightGBM. The implementation of LightGBM can use new features T ′ as the inputs to continue training from the base prediction. The new model f ′ is the new LightGBM model after training. Feature boosting can also be extended to using neural networks. If f ′ is a neural network parameterized by θ, we can calculate ∂L(f +f ′ ) ∂θ and optimize θ by back-propagation. We present the hyperparameters of OpenFE for each dataset in Table 14 . For datasets whose samples or the number of candidate features are not large, the number of data blocks for successive featurewise pruning is set to be small. When the number of data blocks is one, we use all the data to calculate and evaluate new features in successive featurewise pruning, and eliminate new features with negative reduction in loss. For most datasets, the number of new features is set to be 10. The effective new features are usually sparse in the vast pool of candidate new features. For two multiclassification datasets, the number of new features is set to be 50. For a fair comparison, all the baseline methods include the same number of new features, except for the NN baseline. For the NN baseline, the number of new features is the number of hidden units in the last hidden layer, which is determined by hyperparameter tuning. In OpenFE, the default parameter of LightGBM in successive featurewise pruning has 1000 number of estimators, 0.1 learning rate, 16 leaves, and 3 early stopping rounds. The default parameter in feature importance attribution is the same except for 50 early stopping rounds.

D.5 FEATURE IMPORTANCE ATTRIBUTION METHODS

MDI is the gain importance embedded in LightGBM. For permutation feature importance, we randomly shuffle the values of features on the validation set, and observe the change in validation loss. For SHAP feature importance, we average the SHAP values of each sample for each feature.

E.1 DISCOVERED FEATURE TRANSFORMATIONS

In this section, we present and explain the useful transformations discovered by OpenFE for the IEEE-CIS Fraud Detection competition and the DI (Diabetes) dataset. The goal of the IEEE competition is to predict whether an online transaction is fraudulent. The top-1 feature discovered by OpenFE is CombineThenFreq(user id, month), which indicates how many transactions a user makes in a month. We present the SHAP values of the feature in the trained XGBoost in Figure 3 . The SHAP values tell us that when the user makes few transactions in a month, the transactions are more likely to be fraudulent. The top-2 feature is GroupByThenStd(day delta, user id), which is the standard deviation of the days between the current transaction and the first transaction. The SHAP values imply that when many transactions happen in a short period (which corresponds to a small standard deviation), the transactions are more likely to be fraudulent. The top-4 feature is CombineThenFreq(DeviceInfo, user id), which indicates how frequently the user switches devices to make transactions. The actual meanings are masked for most of the features to protect the privacy of users. One of the advantages of OpenFE is the ability to generate useful features without knowing their actual meanings. The DI (diabetes) dataset collects 10 years (1999) (2000) (2001) (2002) (2003) (2004) (2005) (2006) (2007) (2008) of clinical care at 130 US hospitals, where the goal is to predict the readmission (Strack et al., 2014) . One can see from Table 2 that OpenFE outperforms other baseline methods and improves the learning performance by a great margin. The top-1 feature discovered by OpenFE is freq(patient id), which is the number of times the patient has been admitted to the hospital and is highly predictive of whether the patient will be readmitted to the hospital. However, other methods may fail to find this new feature since the feature patient id itself is not useful. For example, SAFE (Shi et al., 2020) only generates candidate features based on the base features that are used as splits in XGBoost. However, since patient id itself is not useful, it will not be used for splits in XGBoost. This example also demonstrates that, reducing the number of candidate features by heuristic assumptions may risk missing useful candidate features.

E.2 HIGH-ORDER FEATURES

How to search for high-order feature transformations is challenging in automated feature generation due to the explosion in search space (Chen et al., 2019) . Some previous methods argue that highorder features are useful by directly searching for high-order features (Chen et al., 2019; Khurana et al., 2018) . However, the effectiveness of high-order feature transformations should be evaluated in light of all its low-order components. For example, a second-order feature transformation f 1 × f 2 × f 3 is effective only if it has additional effectiveness to all their first-order components f 1 × f 2 , f 1 × f 3 , and f 2 × f 3 . In addition, because high-order feature transformations are typically more difficult to interpret than low-order ones, low-order feature transformations are favoured in industrial applications where interpretability is important. In a word, searching for high-order features in a hierarchical manner is more appropriate than directly searching for high-order features in two aspects: 1) evaluating the effectiveness of high-order features in a more accurate way. 2) generating low-order features with better interpretability. Whether it is necessary to generate high-order features is case-by-case. Because the search space of high-order features is usually incredibly huge (even if we limit the order), none of the existing methods can enumerate all the high-order features within reasonable computational resources. We can hardly claim that high-order features are not useful for a dataset. However, we do not find that generating high-order features is useful for all the benchmarking datasets in our experiments. In the IEEE competition, generating high-order features does not seem to be useful for neither Expert nor OpenFE. In the BNP competition, generating high-order features provides a small improvement in the test score. We may conclude that first-order features are usually more important than high-order ones in feature generation.

F COMPLEXITY ANALYSIS

Complexity of Generating Base Predictions. Let n be the number of samples, m be the number of base features of dataset, and k be the number of folds. Complexity of generating base predictions is k times of GBDTfoot_6 training, O (kT dnm) where each GBDT contains T trees with a maximum depth d. Complexity of STAGE I. Suppose we split the dataset into 2 q data blocks, the number of candidate features is m 2 . There are 2 -q m 2 features remaining after successive featurewise halving. The complexity is q i=0 2 (i-q) • m 2 • C(2 -i n, 1), where C(n, m) is the complexity of GBDT training with data shape n × m. If the GBDT contains T 1 trees with a maximum depth of d 1 , we have the time complexity O 2 -q qT 1 d 1 nm 2 Complexity of STAGE II. The complexity of stage II is dominated by a single GBDT training. Suppose the GBDT has T 2 trees with a maximum depth of d 2 , then the time complexity of stage II is O 2 -q T 2 d 2 nm 2 Overall Complexity. In the implementation of OpenFE, the number of trees and the maximum depth of GBDT are predefined fixed constant. If we regard T and d as constant, the overall complexity of OpenFE is O(2 -q qnm 2 )



https://www.kaggle.com/competitions/ieee-fraud-detection/overview https://www.kaggle.com/competitions/ieee-fraud-detection/overview https://www.kaggle.com/code/cdeotte/xgb-fraud-with-magic-0-9600 https://www.kaggle.com/competitions/bnp-paribas-cardif-claims-management/ Note that the input dimension for the predictor f ′ here is different since there is no generated feature. But this is not an issue for our following argument. https://www.kaggle.com/code/confirm/xfeat-cudf-lightgbm-catboost-wip In complexity analysis, we refer to the implementation of LightGBM. The dominant term in complexity is building histogram, i.e., O(nm) per depth per tree.



Figure 1: The overview of OpenFE.

OpenFE input : D: dataset input : T : base features, O: operators output: New feature set. Initialize order ← 1; while order < predefined max order do Generate base predictions ŷ by T and D using cross-validation; Enumerate candidate features A(T ) by O and T ; # two-stage evaluation A ′ (T ) = SuccessivePruning(A(T ), D, ŷ); A ′′ (T ) = FeatureAttribution(A ′ (T ), D, ŷ); T = T + top k(A ′′ (T )); order = order + 1; return T Algorithm 2: Successive Pruning input : D: dataset, ŷ: base predictions, A(T ): candidate feature set, 2 q : #data blocks output: Pruned new feature set.

Figure 2: Comparison between Expert and OpenFE.

Figure 3: The SHAP values of discovered feature transformations. The x-axis is the feature values, and the y-axis is the SHAP values. Larger SHAP values indicate higher probability of fraudulent transactions.



Comparisons between OpenFE and baseline methods. We report the mean and standard deviation of the evaluation metric on the test set over 10 random seeds. The top result for each dataset is marked in bold. SAFE and AutoCross can only run on binary-classification datasets. See Table1for the evaluation metric of different tasks.

The effect of OpenFE on a variety of DNNs. We report the mean and standard deviations of the evaluation metric on the test set over 10 random seeds. The top results are shown in bold.

Results of OpenFE and Expert (feature generation by experts) in two Kaggle competitions. Notation: pub. ∼ public, pri. ∼ private. The final standing is determined according to the scores in the private leaderboard.

Results for ablation study. The model is LightGBM.

Rademacher variables and X i is an i.i.d. sample from P i , which is an i.i.d. sample from Q, and

Comparisons between the results of reproduced methods and results from the paper. The results are averaged by 10 different random seeds.

The running time of different methods in minutes. IEEE: IEEE-CIS Fraud Detection, BNP: BNP Paribas Cardif Claims Management.

Comparisons between MDI, permutation, and SHAP in feature importance attribution.

The results of validating feature boosting. The number is the reduction in RMSE on the validation set. FB: feature boosting. to the type of features they act on. For example, GroupByThenMean requires a categorical feature and a numerical feature, while Max requires two numerical features. All the features in a dataset are classified into numerical features, categorical features, and ordinal features. The difference between ordinal features and categorical features is that an ordinal feature has a clear ordering of categories (such as "age"). Ordinal features are treated as both numerical and categorical when generating features transformations. For example, we can calculate GroupByThenMean(age, gender), which is the average age of each gender. We can also calculate GroupByThenMean(income, age), which is the average income of people of different ages. For anonymized datasets where the meanings of features are unknown, features with string values are treated as categorical features. Features with discrete values (the number of unique values is less than 100) are treated as ordinal features. Features with continuous values are treated as numerical features.

The parameters of OpenFE for each dataset. IEEE: IEEE-CIS Fraud Detection. BNP: BNP Paribas Cardif Claims Management.

annex

distribution of i-th group P g(i) is random between B( 3 4 ) and B( 1 4 ) with equal probability where B(p) is the Bernoulli distribution such that Pr[B(p) = 1] = p = 1 -Pr[B(p) = 0]. Thus, Z i = E X∼P g(i) [X] is either 3 4 or 1 4 with equal probability for each group. X i is 0/1 random variable, generated i.i.d. from either B( 3 4 ) if Z i = 3 4 and from either B( 1 4 ) if Z i = 1 4 . We assume the target of data point is given by4 is given bySimilarly, the probability ofThe last inequality holds because the left hand side is a quadratic function. With the same reasoning, we haveAs a result, the test loss of for any predictor f ′ is at leastThe above argument does not depend on how large k 1 , k 2 and h are.B DATA We describe the details of the datasets in Table 6 . All the datasets can be found in the supplementary materials.C ADDITIONAL RESULTS 

C.6 VALIDATE FEATURE BOOSTING

We conduct experiments to validate that feature boosting can estimate the incremental performance of new features. We show that:• Features that are not effective in the presence of base features have zero or negative reduction in loss ∆ when evaluated using feature boosting. For example, base features are not effective in addition to themselves, and they should have zero or negative ∆.• Features that are effective in the presence of base features have positive reduction in loss ∆ when evaluated using feature boosting. For example, a synthetic feature generated using the information in the targets should have positive ∆.We perform the experiment on the CA dataset (D) with 8 base features. We generate base predictions ŷ by the 8 base features for feature boosting. The experiments include:• For each base feature, we train a GBDT f ′ using the single base feature. We calculate ∆ ′ = L(∅) -L(f ′ ) of each feature. We present the results in the second row of Table 11 . We show through this experiment that each base feature can indeed explain the targets.• For each base feature bf , we calculate ∆ = FeatureBoosting(D, bf, ŷ). We present the results in the fourth row of Table 11 . We can see that all of the base features have a zero or negative ∆.• We generate 8 synthetic features using the information in the targets. The formula is f = 0.3 × y + ϵ, where ϵ follows the normal distribution with the mean and standard deviation the same as y. For each synthetic feature sf , we calculate ∆ = FeatureBoosting(D, sf, ŷ).We present the results in the fifth row of 12 ) and binary operators (Table 13 ), where unary operators act on one feature and binary operators act on two features. Then the operators are

