BEYOND DEEP LEARNING: AN EVOLUTIONARY FEA-TURE ENGINEERING APPROACH TO TABULAR DATA CLASSIFICATION

Abstract

In recent years, deep learning has achieved impressive performance in the computer vision and natural language processing domains. In the tabular data classification scenario, with the emergence of the transformer architecture, a number of algorithms have been reported to yield better results than conventional tree-based models. Most of these methods attribute the success of deep learning methods to the expressive feature construction capability of neural networks. Nonetheless, in real practice, manually designed high-order features with traditional machine learning methods are still widely used because neural-network-based features can be easy to over-fitting. In this paper, we propose an evolution-based feature engineering algorithm to imitate the manual feature construction process through trial and improvement. Importantly, the evolutionary method provides an opportunity to optimize cross-validation loss, where gradient methods fail to do so. On a large-scale classification benchmark of 119 datasets, the experimental results demonstrate that the proposed method outperforms existing fine-tuned state-of-the-art tree-based and deep-learning-based classification algorithms.

1. INTRODUCTION

Tabular data is a typical way of storing information in a computer system, and the tabular data learning methodology has been widely employed in recommendation systems (Wang et al., 2021) , advertising (Yang et al., 2021) , and the medical industry (Spooner et al., 2020) to automate the decision process. Formally, tabular data learning techniques capture the association between some explanatory variables {x 1 , . . . , x m } and a response variable y for a given dataset {({x 1 1 , . . . , x 1 m }, y 1 ), . . . , ({x n 1 , . . . , x n m }, y n )}, where n is the number of data items. There are various machine learning methods that could address the tabular data learning problem, such as decision trees, linear models, and support vector machines. However, these learning algorithms have a limited learning capability to fully tackle this problem. For example, the linear model assumes that the relationship between explanatory variables and response variables is linear, whereas the decision tree posits that an axis-parallel decision boundary can divide distinct categories of samples. These assumptions are not always correct in the real world, and thus machine learning practitioners always employ feature engineering techniques to construct a set of high-order features {{ϕ 1 1 . . . ϕ 1 k }, . . . , {ϕ n 1 . . . ϕ n k }} to improve the learning capability of existing algorithms. In real-world applications, many feature engineering methods can be used by industrial practitioners to construct high-order features before training a machine learning model. The conventional feature engineering strategies include manually designed features, dimensionality reduction methods (Colange et al., 2020) , kernel methods (Allen-Zhu & Li, 2019), and so on. However, these methods have drawbacks. Manual feature construction approaches require a large amount of human labor, kernel methods are hard to be combined with tree-based methods, and dimensionality reduction algorithms are insufficient to enhance state-of-the-art learning algorithms, especially for unsupervised ones (Lensen et al., 2022) . With the success of deep learning, there is a growing interest in using neural networks to build high-order features automatically, particularly in the field of recommendation systems (Song et al., 2019; Lian et al., 2018) . However, it is debatable whether deep learning techniques can learn highorder features effectively (Wang et al., 2021) . Despite the fact that various neural network-based tabular learning algorithms have shown better predictive accuracy than some tree-based ensemble models (Arık & Pfister, 2021), most of these methods only validate their algorithms on datasets with a large number of training samples and homogeneous features. Nowadays, it is still controversial whether deep learning methods outperform tree-based methods on datasets with heterogeneous features and a limited amount of data (Gorishniy et al., 2021) , and we hope to further explore this problem in this paper. In real-world cases, machine learning engineers typically start from some randomly designed features, and then evaluate these features to get cross-validation loss and feature importance value. The feedback information can be used by engineers to design better features. To imitate this workflow, we design a feature engineering method named evolutionary feature engineering automation tool (EvoFeat) 1 , which constructs nonlinear features for enhancing an ensemble of decision tree and logistic regression models, and thus achieving better results than conventional ensemble learning methods. The three objectives of our paper are summarized as follows: • Considering cross-validation losses and feature importance values are widely used by machine learning engineers in the feature engineering process, we propose an evolutionary feature construction method that uses both information to guide the search process. • Given that the performance of the constructed features is highly bonded to the base learner, we propose a mechanism that allows the evolutionary algorithm to automatically determine whether to use random decision trees or logistic regression models as base learners. • We conduct experiments on 119 datasets and compare our method to six traditional machine learning algorithms, four deep learning algorithms and two automated feature construction methods. Experiment results show that deep learning methods are not silver bullets for classification tasks. The remainder of the paper is structured as follows. Section 2 introduces related work of feature construction. Section 3 discusses the motivation behind our work. Section 4 then explains the details of the proposed method, followed by the experimental results provided in Section 5. Finally, Section 6 summarizes our paper and recommends some intriguing future avenues.

2. RELATED WORK

Evolutionary feature construction methods have received a lot of attention in the evolutionary learning domain. Nonetheless, the majority of these methods focus on simple learners, such as a single decision tree (Tran et al., 2019) In addition to the evolution-based method, beam search could also be used to construct high-order features (Katz et al., 2016; Luo et al., 2019; Shi et al., 2020) . The main idea behind these methods is to start with some low-order features and iteratively generate higher-order features based on important low-order features. A meta learner (Katz et al., 2016) , logistic regression accuracy (Luo et al., 2019) , or XGBoost feature importance (Shi et al., 2020) can be used to determine the feature importance of each generated feature. However, these methods lack an effective mechanism to prevent over-fitting and thus may have restricted feature construction capabilities. In the recommendation system domain, deep learning methods have been shown to be an effective way to construct high-order features for tabular data classification tasks (Guo et al., 2017) . Even though the feed-forward neural network (FNN) is considered a universal function approximator (Kidger



& Lyons, 2020), its ability to model high-order feature interactions has been questioned(Mudgal et al., 2018). In order to efficiently construct high-order features, a cross network is designed in Deep & Cross Network (DCN) to explicitly model the feature interaction(Mudgal et al., 2018).1 Source code: https://tinyurl.com/ICLR2023-EvoFEAT



or a single support vector machine(Nag & Pal, 2019). Research on how to apply evolutionary feature construction methods to improve state-of-the-art classification algorithms is still lacking. In recent years, ensemble-based evolutionary feature construction methods have produced encouraging results on both the symbolic regression(Zhang et al., 2021)  and image classification(Bi et al., 2020; 2021a)  tasks. However, feature construction on these tasks are different than tabular data classification tasks because domain knowledge for designing primitive functions are available for these tasks, such as a function named analytical quotientNi et al. (2012)  for symbolic regression and the histogram of oriented gradients (HOG)Dalal & Triggs (2005)  for image classification.

