METAFS: AN EFFECTIVE WRAPPER FEATURE SELECTION VIA META LEARNING

Abstract

Feature selection is of great importance and applies in lots of fields, such as medical and commercial. Wrapper methods, directly comparing the performance of different feature combinations, are widely used in real-world applications. However, selecting effective features meets the following two main challenges: 1) feature combinations are distributed in a huge discrete space; and 2) efficient and precise combinations evaluation is hard. To tackle these challenges, we propose a novel deep meta-learning-based feature selection framework, termed MetaFS, containing a Feature Subset Sampler (FSS) and a Meta Feature Estimator (MetaFE), which transforms the discrete search space into continuous and adopts meta-learning technique for effective feature selection. Specifically, FSS parameterizes the distribution of discrete search space and applies gradient-based methods to optimize. MetaFE learns the representations of different feature combinations, and dynamically generates unique models without retraining for efficient and precise combination evaluation. We adopt a bi-level optimization strategy to optimize the MetaFS. After optimization, we evaluate multiple feature combinations sampled from the converged distribution (i.e., the condensed search space) and select the optimal one. Finally, we conduct extensive experiments on two datasets, illustrating the superiority of MetaFS over 7 state-of-the-art methods.

1. INTRODUCTION

Human activities and real-world applications generate a huge amount of data, where the data has a large number of features. Sometimes, we need to identify the effects of different features and select an optimal feature combination for further study. For example, on online shopping websites, marketer analyze the purchase data to study the purchasing preferences of different peoples and construct the users' profilingAllegue et al. (2020) , which need to select a few number of representative shopping categories for further study. Supervised feature selection (FS) Cai et al. (2018b) , making full use of label (e.g., the group classifications) information, is apposite to solve these problems. Numerous supervised feature selection frameworks have been proposed, including filter Miao & Niu (2016 ), embedding Tibshirani (1996) ; Gui et al. (2019) , and wrapper El Aboudi & Benhlima (2016) frameworks, shown in Figures 1(a, b, and c ). Filter frameworks evaluate each feature individually while embedding frameworks greedily weighting features. They both evaluate features instead of combinations and the selection results are less effective. Wrapper frameworks traverse feature combinations and directly score them by wrapped supervised algorithms, e.g., K-nearest neighbors (KNN) and Linear Regression (LR), which significantly improve the effectiveness of feature selection Sharma & Kaur (2021). However, the number of feature combinations grows exponentially with the number of features. Evaluating such large combinations is computationally intensive, and consequently a small number of evaluations would limit the improvement on the final performance. In summary, there are two major problems remaining in wrapper frameworks, which are listed as follows: Problem 1: How to search feature combinations in a large discrete space? If we select k effective features from a feature set with size n, we need to consider n k possible feature combinations, which forms an exponentially large search space. It's impractical to traverse and compare all feature combinations from a such huge discrete space. 2021). However, these models are hard to capture the complex correlation among features and are hard to precisely evaluate different subsets (See Section 5.3). To address these problems, as shown in Figure 1 (d), we propose a meta-learning-based feature selection framework (MetaFS). MetaFS first transforms the discrete searching problem into a continuous optimization problem over a feature combination distribution, in which good feature combinations own higher sampling probabilities than bad ones. Specifically, we adopt differentiable parameters to represent the probability distribution function, such that it can be easily optimized. Next, we sample feature combinations from our distribution, and adopt deep meta-learning technique, which maintains the combination representations and generates unique evaluation model for each combinations without re-training, to solve the problem of efficient and precise feature combinations' evaluation. Then, according to the evaluation results, we increase the probabilities of good combinations and decrease bad ones by employing gradient-based method to optimizing the probability distribution function. Finally, we search the optimal combination in a condensed search space. The main contributions of our work are four folds: • We propose a novel deep meta-learning-based feature selection framework, entitled MetaFS. It transforms the discrete search space into continuous and applies bi-level optimization, that alternatively condenses the continuous continuous search space and learns a meta-learning-based combination evaluation network for effective feature selection. • We propose Feature Subset Sampler (FSS) to parameterize the distribution of search space, which allows us using gradient-based methods to condense the search space, effectively reducing the difficulties in the searching process. • We propose Meta Feature Estimator (MetaFE) to efficiently and precisely evaluate feature combinations. It learns the representations of different feature combinations and generates unique evaluation model without re-training to each feature combination for precise evaluation. • We conduct extensive experiments on two real-world benchmark datasets to verify our framework. The experimental results demonstrate the superiority of MetaFS. 



Figure 1: Supervised Feature Selection Frameworks and our MetaFS

Selection Methods Cai et al. (2018b) can be classified into supervised, semi-supervised Yu et al. (2018); Liu et al. (2019) and unsupervised methods Liu et al. (2011); Tang et al. (2019). Unsupervised and semi-supervised methods cannot make full use of label information, having a poor performance on supervised tasks. Filter frameworks, e.g., MI Yang & Pedersen (1997), independently compute the correlation metrics of features. While embedding frameworks Gui et al. (2019); Singh et al. (2020) estimate the contribution (e.g., attention and weight) of features in the process of model construction. Some neural-network-based embedding methods are proposed recent years. For example, Gui et al. (2019) carries out the feature selection in the learned latent representation space with an attention mechanism, Singh et al. (2020) selects features by an unique loss function and a well-designed neural network structure and Wang et al. (2019) iterative trains a hierarchical neural network for a better

