METAFS: AN EFFECTIVE WRAPPER FEATURE SELECTION VIA META LEARNING

Abstract

Feature selection is of great importance and applies in lots of fields, such as medical and commercial. Wrapper methods, directly comparing the performance of different feature combinations, are widely used in real-world applications. However, selecting effective features meets the following two main challenges: 1) feature combinations are distributed in a huge discrete space; and 2) efficient and precise combinations evaluation is hard. To tackle these challenges, we propose a novel deep meta-learning-based feature selection framework, termed MetaFS, containing a Feature Subset Sampler (FSS) and a Meta Feature Estimator (MetaFE), which transforms the discrete search space into continuous and adopts meta-learning technique for effective feature selection. Specifically, FSS parameterizes the distribution of discrete search space and applies gradient-based methods to optimize. MetaFE learns the representations of different feature combinations, and dynamically generates unique models without retraining for efficient and precise combination evaluation. We adopt a bi-level optimization strategy to optimize the MetaFS. After optimization, we evaluate multiple feature combinations sampled from the converged distribution (i.e., the condensed search space) and select the optimal one. Finally, we conduct extensive experiments on two datasets, illustrating the superiority of MetaFS over 7 state-of-the-art methods.

1. INTRODUCTION

Human activities and real-world applications generate a huge amount of data, where the data has a large number of features. Sometimes, we need to identify the effects of different features and select an optimal feature combination for further study. For example, on online shopping websites, marketer analyze the purchase data to study the purchasing preferences of different peoples and construct the users' profilingAllegue et al. (2020) , which need to select a few number of representative shopping categories for further study. Supervised feature selection (FS) Cai et al. (2018b) , making full use of label (e.g., the group classifications) information, is apposite to solve these problems. Numerous supervised feature selection frameworks have been proposed, including filter Miao & Niu (2016) , embedding Tibshirani (1996) ; Gui et al. (2019) , and wrapper El Aboudi & Benhlima (2016) frameworks, shown in Figures 1(a, b, and c ). Filter frameworks evaluate each feature individually while embedding frameworks greedily weighting features. They both evaluate features instead of combinations and the selection results are less effective. Wrapper frameworks traverse feature combinations and directly score them by wrapped supervised algorithms, e.g., K-nearest neighbors (KNN) and Linear Regression (LR), which significantly improve the effectiveness of feature selection Sharma & Kaur (2021) . However, the number of feature combinations grows exponentially with the number of features. Evaluating such large combinations is computationally intensive, and consequently a small number of evaluations would limit the improvement on the final performance. In summary, there are two major problems remaining in wrapper frameworks, which are listed as follows: Problem 1: How to search feature combinations in a large discrete space? If we select k effective features from a feature set with size n, we need to consider n k possible feature combinations, which forms an exponentially large search space. It's impractical to traverse and compare all feature combinations from a such huge discrete space. 2021). However, these models are hard to capture the complex correlation among features and are hard to precisely evaluate different subsets (See Section 5.3). To address these problems, as shown in Figure 1 (d), we propose a meta-learning-based feature selection framework (MetaFS). MetaFS first transforms the discrete searching problem into a continuous optimization problem over a feature combination distribution, in which good feature combinations own higher sampling probabilities than bad ones. Specifically, we adopt differentiable parameters to represent the probability distribution function, such that it can be easily optimized. Next, we sample feature combinations from our distribution, and adopt deep meta-learning technique, which maintains the combination representations and generates unique evaluation model for each combinations without re-training, to solve the problem of efficient and precise feature combinations' evaluation. Then, according to the evaluation results, we increase the probabilities of good combinations and decrease bad ones by employing gradient-based method to optimizing the probability distribution function. Finally, we search the optimal combination in a condensed search space. The main contributions of our work are four folds: • We propose a novel deep meta-learning-based feature selection framework, entitled MetaFS. It transforms the discrete search space into continuous and applies bi-level optimization, that alternatively condenses the continuous continuous search space and learns a meta-learning-based combination evaluation network for effective feature selection. • We propose Feature Subset Sampler (FSS) to parameterize the distribution of search space, which allows us using gradient-based methods to condense the search space, effectively reducing the difficulties in the searching process. • We propose Meta Feature Estimator (MetaFE) to efficiently and precisely evaluate feature combinations. It learns the representations of different feature combinations and generates unique evaluation model without re-training to each feature combination for precise evaluation. • We conduct extensive experiments on two real-world benchmark datasets to verify our framework. The experimental results demonstrate the superiority of MetaFS. 

3. PROBLEM FORMULATION

Suppose there are n samples and m features. Let X = [x 1 , • • • x m ] ⊤ ∈ R n×m denote the input features, where each x k ∈ R n is the k-th feature in input. Let y = [y 1 , • • • y n ] ∈ R n denote labels. Our goal is to find a feature subset I = {i 1 , • • • , i r } with its corresponding input data X I = [x i1 , • • • x ir ] ⊤ ∈ R n×r , such that we can achieve the minimum loss on the validation dataset by using the model trained with the training dataset. The optimization target can be formulated as: min I∈I λ I L X val I , y val ; Θ I , s.t.      Θ I = arg min Θ ⋆ L X train I , y train ; Θ ⋆ I∈I λ I = 1, ∀λ I ∈ {0, 1} where I represent the set of all subsets, λ I is a binary variable denotes whether subset I is selected or not, Θ I denotes the trainable parameters for subset I, and L denotes the loss function.

4. METHODOLOGIES

As there are exponential numbers of subsets and λ I are discrete, Eq. 1 is hard to be optimized. Inspired by Lovász (1975) , we relax λ I to be continuous, i.e., 0 ≤ λ I ≤ 1. In this way, the subset with the maximum λ I is equivalent to the optimal result in Eq. 1 and we can apply gradient-based method to continuously optimize the feature selection. In this paper, we propose MetaFS, consisting of two components: (1) Feature Subset Sampler (FSS) and ( 2) Meta Feature Estimator (MetaFE). Specifically, FSS parameterizes a subset sampling distribution to model the continuous coefficients λ I and samples subsets for MetaFE. Next, MetaFE, applying meta learning techniques, efficiently generates unique models without re-training to precisely evaluate sampled subsets. Then, FSS and MetaFE are collaboratively optimized to condense the search space (increase the sampling probabilities of good subsets) and finally find the optimal one. In the following subsections, we'll show the structures of FSS, MetaFE, and the collaborative optimization process.

4.1. FEATURE SUBSET SAMPLER

There are m r coefficients λ I in total, which is too large and cannot be directly maintained. As the constraint of λ I (after relaxation) is similar to that of a probability distribution, we propose Feature Subset Sampler (FSS), using a probability distribution to model these coefficients: λ I ← P set (I), I = {i 1 , . . . , i r } ∈ N r where P set (I) is the sampling probability of subset I. For each sampling, we select r features according to features' selecting probabilities without replacement. Suppose p i is the probability for the i-th feature to be select in a single sampling, the subset sampling probabilities are formulated as: P set (I i ) = ij ∈Ii p ij I k ∈I ij ∈I k p ij , where I represent the set of all possible subsets. And when the features' selecting probabilities is fixed, e.g., sampling subsets, the denominator is a constant and the subset sampling probability is proportional to the product of the features' selecting probabilities, i.e., P set (I i ) ∝ ij ∈Ii p ij . FSS maintains m trainable features' importance c = [c 1 , . . . , c m ] ∈ R m to model the feature se- lected probability [p 1 , . . . , p m ]. In this paper, we employ softmax function and define the selected probability of feature i is: p i = exp(c i ) m k=1 exp(c k ) . Then, the sampling probability of subset I i is proportional to the exponent of the sum of corresponding features' importance: Then, meta learner generates the weight parameters W I ∈ R din×dout by a trainable fully connected neural network (FC network) according to the learned subset embedding e I . And finally, we apply W I to compute x out according to Eq. 6. P set (I i ) ∝ j∈Ii p j ∝ j∈Ii exp(c j ) = exp j∈Ii c j Note that instead of training FC weight parameters from scratch, MetaFE generates the parameters from subset embeddings and meta parameters (Figure 2b ). In this way, MetaFE can generate unique fully connected neural networks for each input subset, and this generating process do not require re-training process, which can significantly improve the evaluation efficiency in feature selection.

4.3. SEARCHING PROCEDURE

Feature selection (Eq.1) can be regarded as a bi-level optimization problem Colson et al. (2007) . Inspired by Cai et al. (2018a) , we employ a collaborative optimization procedure to optimize MetaFE and FSS to finally search the best subset, which contains three stages and shown in Figure 3(a) . Collaborative learning. FSS and MetaFE are alternately optimized to condense the search space described by FSS and increase the evaluation accuracy of MetaFE, simultaneously: • Updating FSS. FSS is trained to increase the sampling probability of good subsets and decrease the bad ones to condense the search space. As shown in Figure 3 (b), we first sample a batch of subsets and evaluate them by MetaFE, and then calculate the gradient descent direction of features' importance to update FSS. Specifically, for a sampled subsets [I 1 , . . . I bs ] with the size number of b s , we select a batch of samples for each of them, and feed these samples into MetaFE to calculate the supervised learning loss (subset evaluation) [l 1 , . . . , l bs ] of different subsets. Finally, we update the feature importance c to increase the probability of good subsets, such that a subset with a lower loss need to have a higher sampling probability: c ⋆ = argmin c bs k=1 l k P set (I k ) = argmin c bs k=1 l k exp i∈I k c i , where P set (I k ) is the sampling probability of subset I k that is exponentially proportional to the sum of corresponding features' importance (Eq. 5). In this work, we employ gradient descent method to update the feature importance c ⋆ . • Updating MetaFE. Similar to updating FSS, MetaFE is updated by batch gradient descent. For each update, we first samples b s subsets and select a batch of samples [X I1 , . . . , X I bs ] ∈ R bs×b d ×r for each of them, where b d is the batch size of samples. Then, we updates MetaFE to increase the accuracy of subsets' evaluation, such that, min Θ I k bs k=1 L(X I k , y; Θ I k ), ( ) where L is the loss function, Θ I k = G(I k ; Θ) is the generated parameters that produced by subset I k and all trainable parameters Θ in MetaFE (including subset embeddings and meta parameters). As a result, we can freeze the trainable parameters of FSS and optimize the trainable parameters Θ with the following formula: Θ ⋆ = argmin Θ bs k=1 L(X I k , y; G(I k ; Θ)). Then we can apply gradient descent to optimize all trainable parameters in MetaFE, and the gradient of Θ can be easily computed by back-propagation according to chaining rule. Sampling and Selecting. when the search space is converged, MetaFS samples and evaluates multiple subsets. Finally, we select the optimal subset as our result.

5. EXPERIMENTS

In this section, we conduct experiments on two real-world datasets to show our superiority in three aspects: 1) the performance of the selected feature subsets; 2) the efficiency and precision of subset evaluation; 3) the convergence of subset search stages.

5.1.1. DATASETS AND METRICS

The experiments are conducted on two datasets, one of them is public, listed as follows: • Shopping: This dataset is from an online shopping website, which aims to estimate the customs' gender according to 21912 different shopping categories (features). This data is sparse and huge, which has more than 3 million samples. To reduce the time usage, we only select 10 thousand samples for baselines to evaluate each subset (use all samples in MetaFS). • Gisette Dua & Graff (2017): Gisette is a handwritten digit recognition dataset, which has 7000 samples and 5000 features. And most of the features are highly correlated. For all datasets, we simply select 50 features in experiments, so that we can artificially analyze them in downstream works. We divide all samples into two parts with a ratio of 8:2 for feature selection and testing. For the methods requiring validation dataset, we further divide the samples for feature selection into two parts, where 80% of them for training and 20% of them for validating. To test the performance of the selected subset, we extract the selected features and train a fully connected neural network (FCN) from scratch with an early-stop mechanism. And then we evaluate the subset performance on the test dataset. We also run such re-training processes 10 times, and take the mean values as the final results on Gisette dataset to reduce the evaluation bias as it has less samples and is easy to fall in over-fitting. We adopt precision, recall and F1-score to evaluate the results.

5.1.2. BASELINES

We compare our MetaFS with 8 different methods, including four filter and embedding methods, • MI-KBest Yang & Pedersen (1997) : It ranks features by Mutual Information compared with label information. And then we select k features with the highest scores. • Lasso Tibshirani (1996) : It trains a linear classification model (logistic regression) via l1 penalty, and select the features whose the corresponding weights are non-zero. • AFS Gui et al. (2019) : It trains a neural network and learns the feature importance with attention mechanism. We select the features that the corresponding attention weights are non-zero. In addition, we compare MetaFS with four wrapper methods, where we apply Decision Tree (DT) as the wrapped model, as it has a good performance on subset evaluation (see Section 5.3). • PSO Chuang et al. (2008) : Particle Swarm Optimization (PSO) is a heuristic method, which optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. We use this method to search the best subset. • WOA Mafarja & Mirjalili (2018) : Similar to PSO, we apply whale Optimization Algorithm (WOA) to search the subsets, which mimics the hunting mechanism of humpback whales in nature. • HHO Too et al. (2019) : Hybrid High-Order (HHO) is another heuristic method, which searches subsets with several active and time-varying phases of exploration and exploitation. We also compare a hybrid feature selection method, which is a mix of filter and wrapper methods: • PPIMBC Hassan et al. ( 2021): It is a Markov Blanket based feature selection algorithm that selects a subset of features by considering their performance both individually as well as a group. For all baselines, we turn the hyper-parameters and limit the numbers of selections closing to 50. Experiments settings In our implementation, we build a MetaFE model with two hidden layers, where the layer sizes are set to [64, 32] . The size of feature embeddings in MetaFE is simply set to 256. The batch size of samples and subsets are set to 128 and 32, respectively. In the experiments, we train MetaFE on the training dataset, update FSS and search the optimal subset on the validation dataset. We employ cross entropy as a loss function in the experiments, and all experiments are conducted by using a single Nvidia 2070 GPU.

5.2. PERFORMANCE COMPARISON

Table 1 shows the evaluation results of the selected features. First, for the filter and embedding methods, i.e., MI-Kbest, Lasso, and AFS, have worse performances on Gisette dataset, as they evaluate features greedily and are hard to capture the cross-correlation among features. Second, for the wrapper and hybrid methods, i.e., RFE, PSO, WHO, HOO, and PPIMBC, which need to train plenty of models for evaluating different subsets, have bad performance on the Shopping dataset. The reason is that the Shopping dataset is hard to learn only with a few samples (see Section 5.3), and the methods easily fall into local minimum with imprecise evaluations. Unlike the above methods, MetaFS achieves the best results on the two datasets. The reason is that MetaFS, as a wrapper framework, can directly compare different feature subsets during the selection procedure. In addition, compared to the other wrapper methods, MetaFS applies meta-learning techniques, efficiently evaluating subsets without re-training. Moreover, the deep-learning-based evaluation model can capture complex correlations among features, improving the evaluation's precision. With the precise and efficient subset evaluations, MetaFS could quickly filter out the bad subsets to condense the search space and finally tests much more subsets to select the optimal result.

5.3. STUDIES ON FEATURE SUBSET EVALUATION

To study the precision and efficiency of different evaluation methods, we sample 100 different subsets from the uniform distribution and a converged subset space. Then, we compare the time usage of the evaluation process and the ranking similarities of the evaluating results. The compared evaluation methods include four popular retrained learning methods, which are widely used in wrapper methods, including KNN, Linear Regression (LR), Support Vector Classification (SVC), and Decision Tree (DT). In addition, we compare it with our proposed MetaFE. To illustrate the effectiveness of MetaFE, we also compared it to an FCN model that trained for all subsets without retraining. As shown in Table 2 , we introduce Rank Biased Overlap (RBO) to evaluate the ranking similarities of different evaluation results, where a higher RBO indicates higher similarity. The ground truth is the results from an FCN model re-trained for each subset. For the Gisette dataset, as it is easy to learn, almost all re-trained methods have a good performance in the uniform sampling distribution. However, when the distribution converges, these models are hard to distinguish the nuances among good subsets precisely. For the sparse Shopping dataset, the re-trained methods produce worse results in all subset sampling spaces. The reason is that the efficiency of evaluation limits them (see 

5.4. CONVERGENCE ANALYSIS

In this subsection, we will analyze the convergence of MetaFS during the training process. We first compare the convergence curve of our MetaFS (FSS +MetaFE) with two variants to show the effectiveness of our training process: 1) FSS+FCN: replacing MetaFE with an FCN model, and 2) Init+MetaFE: removing the FSS and sample subsets from an initial uniform distribution. We take the Gisette dataset as a typical case, and Figure 4 shows the F1-score varies over time. In the pre-training stage, all models are trained with random subsets (sampled from the uniform distribution). MetaFE can capture the inherent correlation of different subsets, having a higher prediction precision than FCN. In the collaborative learning stage, using FSS can increase the probabilities of good subsets and condense the search space. As a result, the estimator (MetaFE or FCN) can be trained with similar subsets and finally improves the precision of model inference. First, the RBO curves along the training iteration are unstable. One reason is that, since training FCN models for evaluating subsets is time-consuming, we only sample 100 subsets, which is small and cannot reflect the actual performance. Overall, the general trend of the RBO curve that evaluates converged subset space is increased with the training iteration. The reason is that MetaFE is collaboratively trained with the FSS and could specifically increase the precision of evaluating good subsets with the growing training iterations. For the subsets that follow a uniform distribution, the performance of MetaFE is decreased with the iteration as it is based on a neural network, having a characteristic of forgetting old knowledge Biesialska et al. (2020) . However, our proposed collaborative training stage alternately optimizing the subset sampling space (FSS) and MetaFE, ensuring that MetaFE is always in the optimal state when updating the subset sampling space, can effectively avoid this problem. As a result, MetaFS has good performance throughout the training stages.

6. CONCLUSION AND FUTURE WORKS

In this paper, we proposed a novel Meta Feature Selection framework (MetaFS) for effective feature selection. MetaFS, consisting of FSS and MetaFE, transforms the discrete search problem into continuous and adopts meta learning techniques to evaluate subsets without retraining. Specifically, the proposed FSS parameterizes the search space and adopts gradient-based methods to effectively reduce the search difficulties. Meanwhile, ensuring the evaluation precision, MetaFE could significantly decrease the time usage of evaluation over the selection procedures. MetaFS adopts bi-level optimization strategies to optimize FSS and MetaFE, evaluating multiple subsets sampled from FSS and finally select the optimal one. We conduct extensive experiments on two datasets, demonstrating that our MetaFS searches the promising subsets. In the future, we will study the hyper parameters of MetaFS and simplify the operation to support rapid deployment in real-world scenarios.



Figure 1: Supervised Feature Selection Frameworks and our MetaFS

Figure 2: Meta Feature Estimator Framework As shown in Figure 2(a), MetaFE is composed of activation functions (e.g., ReLU and Sigmoid) and multiple Meta Fully Connected layers (MetaFC).Figure 2(b) shows the structure of MetaFC, which is a fully connected layer that calculates the output by generated parameters W I that generated by Meta Learner for subset I, formulating as:

Figure 2: Meta Feature Estimator Framework As shown in Figure 2(a), MetaFE is composed of activation functions (e.g., ReLU and Sigmoid) and multiple Meta Fully Connected layers (MetaFC).Figure 2(b) shows the structure of MetaFC, which is a fully connected layer that calculates the output by generated parameters W I that generated by Meta Learner for subset I, formulating as: x out = x in W I + b, W I = Learner(I) (6) Meta Learner maintains different subsets' representations by m trainable feature embeddings. And it generates the weight parameters W I according to the subset representation. As shown in Figure 2(c), let [e 1 , . . . , e m ] ⊤ ∈ R m×de denote m trainable feature embeddings. For each input subset I = {i 1 , . . . , i r }, we suppose that all features have similar impact on model inference, meta learner first extracts the corresponding feature embeddings E = [e i1 , . . . , e ir ] ⊤ ∈ R r×de and fuses them

Figure 3: Training Process of Meta Feature Selection Pre-training MetaFE. As MetaFE initially cannot compare the effectiveness of subsets without feeding any training data, we first pre-train MetaFE by using subsets sampled from uniform distribution. In this way, MetaFE could recognize the better feature combination to guide the optimization of FSS. The detailed training process of MetaFE is same to that in collaborative learning.

Figure 4: The Convergence Curves in Giseete Dataset

Figure 5.4 shows the ranking similarity (RBO) change along the training iteration, where the blue lines are the results of MetaFEs that were trained on different training stages, where the evaluation subsets follow uniform distribution, corresponding to the RBO result on the left axis. The red lines are the results of MetaFEs evaluated with converged subset space, corresponding to the right axis.

The Evaluation Performance of Selecting Results

The Ranking Accuracy of Evaluation Results (RBO)

, and they could only evaluate 10 thousand samples for each subset. Different from the above methods, MetaFE, based on a fully connected neural network, has an excellent ability to capture complex non-linear correlations among features. Moreover, it can almost ignore the time-consuming training stage and can be trained with all visible samples to improve the inference performance. Compared to FCN, MetaFE could generate unique model parameters for different subsets, which has a good generalization for subset evaluation. As a result, MetaFE has a precise evaluation performance in experiments, especially for datasets with a large number of samples.

Time Usage of Evaluating One subset (ms) Table3shows the time usage of different methods for evaluating one subset, where the re-trained methods only evaluate 10 thousand samples on the Shopping dataset. We also show the time usage of MetaFE under different devices, where the GPU* represents the estimated time usage that considers the training process. The re-trained methods cost plenty of time for training models. Worsely, the used time increases with the size of trained samples. Conversely, MetaFE does not need re-training models, costing less time. Based on a neural network, it can fully use GPU parallel technique, which significantly accelerates the calculation speed. Even though taking the training process into account (GPU*), the evaluation process is still much more efficient than re-trained methods. Therefore, MetaFE has excellent advantages in the precision and efficiency on subset evaluation.

To show the change in model performance during training stages, we compare the difference of the subset evaluations of MetaFE that trained on different training stages. The ground truth is the test from a

