IMPROVING MUTUAL INFORMATION BASED FEATURE SELECTION BY BOOSTING UNIQUE RELEVANCE Anonymous

Abstract

Mutual Information (MI) based feature selection makes use of MI to evaluate each feature and eventually shortlist a relevant feature subset, in order to address issues associated with high-dimensional datasets. Despite the effectiveness of MI in feature selection, we have noticed that many state-of-the-art algorithms disregard the so-called unique relevance (UR) of features, which is a necessary condition for the optimal feature subset. In fact, in our study of seven state-of-the-art and classical MIBFS algorithms, we find that all of them underperform as they ignore UR of features and arrive at a suboptimal selected feature subset which contains a non-negligible number of redundant features. We point out that the heart of the problem is that all these MIBFS algorithms follow the criterion of Maximize Relevance with Minimum Redundancy (MRwMR), which does not explicitly target UR. This motivates us to augment the existing criterion with the objective of boosting unique relevance (BUR), leading to a new criterion called MRwMR-BUR. We conduct extensive experiments with several MIBFS algorithms with and without incorporating UR. The results indicate that the algorithms that boost UR consistently outperform their unboosted counterparts in terms of peak accuracy and number of features required. Furthermore, we propose a classifier based approach to estimate UR that further improves the performance of MRwMR-BUR based algorithms.

1. INTRODUCTION

High-dimensional datasets tend to contain irrelevant or redundant features, leading to extra computation, larger storage, and decreased performance (Bengio et al., 2013; Gao et al., 2016; Bermingham et al., 2015; Hoque et al., 2016) . Mutual Information (MI) (Cover & Thomas, 2006) based feature selection, which is a classifier independent filter method, addresses those issues by selecting a relevant feature subset. We start this paper by discussing the value of MI based feature selection (MIBFS). Interpretability: Dimensionality reduction methods consist of two classes: feature extraction and feature selection. Feature extraction transforms original features into new features with lower dimensionality (e.g., PCA). This method may perform well in dimensionality reduction, but the extraction process (e.g., projection) loses the physical meaning of features (Chandrashekar & Sahin, 2014; Sun & Xu, 2014; Nguyen et al., 2014; Gao et al., 2016) . In contrast, feature selection preserves the interpretability by selecting a relevant feature subset. This helps to understand the hidden relationship between variables and makes techniques such as MIBFS preferred in various domains (e.g., healthcare) (Kim et al., 2015; Liu et al., 2018; Chandrashekar & Sahin, 2014) . Generalization: Feature selection methods are either classifier dependent or classifier independent (Guyon & Elisseeff, 2003; Chandrashekar & Sahin, 2014) . Examples of the former type include the wrapper method and the embedded method (e.g., LASSO (Hastie et al., 2015) ) which performs feature selection during the training of a pre-defined classifier. The classifier dependent method tends to provide good performance as it directly makes use of the interaction between features and accuracy. However, the selected features are optimized for the pre-defined classifier and may not perform well for other classifiers. The filter method, which is classifier independent, scores each feature according to its relevance with the label. As a filter method, MIBFS quantifies relevance using MI as MI can capture the dependencies between random variables (e.g., feature and label). Consequently, the feature subset selected by MIBFS is not tied to the bias of the classifier and is relatively easier to generalize (Bengio et al., 2013; L. et al., 2011; Meyer et al., 2008) .

