IMPROVING MUTUAL INFORMATION BASED FEATURE SELECTION BY BOOSTING UNIQUE RELEVANCE Anonymous

Abstract

Mutual Information (MI) based feature selection makes use of MI to evaluate each feature and eventually shortlist a relevant feature subset, in order to address issues associated with high-dimensional datasets. Despite the effectiveness of MI in feature selection, we have noticed that many state-of-the-art algorithms disregard the so-called unique relevance (UR) of features, which is a necessary condition for the optimal feature subset. In fact, in our study of seven state-of-the-art and classical MIBFS algorithms, we find that all of them underperform as they ignore UR of features and arrive at a suboptimal selected feature subset which contains a non-negligible number of redundant features. We point out that the heart of the problem is that all these MIBFS algorithms follow the criterion of Maximize Relevance with Minimum Redundancy (MRwMR), which does not explicitly target UR. This motivates us to augment the existing criterion with the objective of boosting unique relevance (BUR), leading to a new criterion called MRwMR-BUR. We conduct extensive experiments with several MIBFS algorithms with and without incorporating UR. The results indicate that the algorithms that boost UR consistently outperform their unboosted counterparts in terms of peak accuracy and number of features required. Furthermore, we propose a classifier based approach to estimate UR that further improves the performance of MRwMR-BUR based algorithms.

1. INTRODUCTION

High-dimensional datasets tend to contain irrelevant or redundant features, leading to extra computation, larger storage, and decreased performance (Bengio et al., 2013; Gao et al., 2016; Bermingham et al., 2015; Hoque et al., 2016) . Mutual Information (MI) (Cover & Thomas, 2006) based feature selection, which is a classifier independent filter method, addresses those issues by selecting a relevant feature subset. We start this paper by discussing the value of MI based feature selection (MIBFS). Interpretability: Dimensionality reduction methods consist of two classes: feature extraction and feature selection. Feature extraction transforms original features into new features with lower dimensionality (e.g., PCA). This method may perform well in dimensionality reduction, but the extraction process (e.g., projection) loses the physical meaning of features (Chandrashekar & Sahin, 2014; Sun & Xu, 2014; Nguyen et al., 2014; Gao et al., 2016) . In contrast, feature selection preserves the interpretability by selecting a relevant feature subset. This helps to understand the hidden relationship between variables and makes techniques such as MIBFS preferred in various domains (e.g., healthcare) (Kim et al., 2015; Liu et al., 2018; Chandrashekar & Sahin, 2014) . Generalization: Feature selection methods are either classifier dependent or classifier independent (Guyon & Elisseeff, 2003; Chandrashekar & Sahin, 2014) . Examples of the former type include the wrapper method and the embedded method (e.g., LASSO (Hastie et al., 2015) ) which performs feature selection during the training of a pre-defined classifier. The classifier dependent method tends to provide good performance as it directly makes use of the interaction between features and accuracy. However, the selected features are optimized for the pre-defined classifier and may not perform well for other classifiers. The filter method, which is classifier independent, scores each feature according to its relevance with the label. As a filter method, MIBFS quantifies relevance using MI as MI can capture the dependencies between random variables (e.g., feature and label). Consequently, the feature subset selected by MIBFS is not tied to the bias of the classifier and is relatively easier to generalize (Bengio et al., 2013; L. et al., 2011; Meyer et al., 2008) . Performance: Although MIBFS is an old idea dating back to 1992 (Lewis, 1992), it still can provide competitive performance in dimensionality reduction (see several recent survey works (Zebari & et al, 2020; Venkatesh & Anuradha, 2019) ). We now provide a new perspective using the Information Bottleneck (Tishby et al., 2000) (IB) to explain the superior performance of MIBFS and suggest why MI is the right metric for feature selection. IB was proposed to search for the solution that achieves the largest possible compression, while retaining the essential information about the target and in (Shwartz-Ziv & Tishby, 2017), IB is used to explain the behavior of neural networks. Specifically, let X be the input data to the neural network, Y be the corresponding label and X be the hidden representation of neural networks. Shwartz-Ziv & Tishby (2017) demonstrate that the learning process in neural networks consists of two phases: (i) empirical error minimization (ERM), where I( X; Y ) gradually increases to capture relevant information about the label Y. (ii) representation compression, where I( X, X) decreases and I( X; Y ) remains almost unchanged, which may be responsible for the absence of overfitting in neural networks. We note that the goal of MIBFS is to find the minimal feature subset with maximum MI with respect to the label (Brown et al., 2012) . Mathematically, the goal can be written as follows. S * = arg min f (arg max S⊆Ω I(S; Y )), where f (A, B, • • • ) = (|A|, |B|, • • • ), |A| represents the number of features in A and Ω is the set of all features, S ⊆ Ω is the selected feature subset and S * is the optimal feature subset. In such a manner, MIBFS naturally converts the representation learning process of neural networks to the process of feature selection (if we consider S as a type of hidden representation X) and attempts to obtain an equivalent learning outcome. Specifically, maximizing the I(S; Y ) corresponds to the ERM phase and minimizing the size of S corresponds to the representation compression phase. We believe this new perspective sheds light on the superior performance of MIBFS in dimensionality reduction and rationalizes the use of MI for feature selection. We note that finding the optimal feature subset S * in (1) through exhaustive search is computationally intractable. Therefore, numerous MIBFS algorithms (Meyer et al., 2008; Yang & Moody, 2000; Nguyen et al., 2014; Bennasar et al., 2015; Peng et al., 2005) are proposed and attempt to select the optimal feature subset following the criterion of Maximize Relevance with Minimum Redundancy (MRwMR) (Peng et al., 2005) . In this paper, we explore a promising feature property, called Unique Relevance (UR), which is the key to select the optimal feature subset in (1). We note that UR has been defined for a long time and it is also known as strong relevance (Kohavi & John, 1997) . However, only very few works (Liu et al., 2018; Liu & Motani, 2020) look into it and the use of UR for feature selection remains largely uninvestigated. We fill in this gap and improve the performance of MIBFS by exploring the utility of UR. We describe the flow of the remaining paper together with several contributions as follows. 1. We shortlist seven state-of-the-art (SOTA) and classical MIBFS algorithms and uncover the fact that all of them ignore UR and end up underperforming, namely they select a non-negligible number of redundant features, contradicting the objective of minimal feature subset in (1). In fact, it turns out that the minimal feature subset in (1) must contain all features with UR. 2. We point out that, the heart of the problem is that existing MIBFS algorithms following the criterion of MRwMR (Peng et al., 2005) , which lacks a mechanism to explicitly identify the UR of features. This motivates us to augment MRwMR and include the objective of boosting UR, leading to a new criterion for MIBFS, called MRwMR-BUR. 3. We estimate UR using the KSG estimator (Kraskov et al., 2004) and conduct experiments with five representative MIBFS algorithms on six datasets. The results indicate that the algorithms that boost UR consistently outperform their unboosted counterparts when tested with three classifiers. 4. We improve MRwMR-BUR by proposing a classifier based approach to estimate UR and our experimental results indicate that this approach further improves the classification performance of MRwMR-BUR based algorithms.

2. BACKGROUND AND DEFINITIONS

We now formally define the notation used in this paper. We denote the set of all features by Ω = {X k , k = 1, • • • , M }, where M is the number of features. The feature X k ∈ Ω and the label Y

