SCALABLE FEATURE SELECTION VIA SPARSE LEARNABLE MASKS

Abstract

We propose a canonical approach for feature selection, sparse learnable masks (SLM). SLM integrates learnable sparse masks into end-to-end training. For the fundamental non-differentiability challenge of selecting a desired number of features, we propose duo mechanisms for automatic mask scaling to achieve the desired feature sparsity, and gradually tempering this sparsity for effective learning. In addition, SLM employs a novel objective that maximizes the mutual information (MI) between the selected features and the labels, in an efficient and scalable way. Empirically, SLM achieves state-of-the-art results on several benchmark datasets, often by a significant margin, especially on real-world challenging datasets.

1. INTRODUCTION

R1: In many machine learning scenarios, a significant portion of the input features may be irrelevant to the output, especially with modern data management tools allowing easy construction of largescale datasets by combining different data sources. 'Feature selection', filtering the most relevant features for the downstream task, is an everlasting problem, with many methods proposed to date and used (Guyon & Elisseeff, 2003; Li et al., 2017; Dash & Liu, 1997) . Feature selection can bring a multitude of benefits. Smaller number of features can yield superior generalization and hence better test accuracy, by minimizing reliance on spurious patterns that do not hold consistently (Sagawa et al., 2020) , and not wasting model capacity on the irrelevant features. In addition, reducing the number of input features can decrease the computational complexity and cost for deployed models, as the models need to learn the mapping from smaller dimensional input data, and the reduced infrastructure requirement to support only the selected features. Lastly, feature selection helps with interpretability, as the users can focus their efforts to understand the model to a smaller subset of input features. R1: How can we select the target number of features in an optimal way? Feature selection has been studied with numerous approaches, as summarized in §2. For superior task accuracy, the feature selection method should consider the predictive model itself, as the optimal set of features would depend on the how the mapping is done between the inputs and outputs. Such methods have been approached in different ways, such as via sparse regularization and its extensions (Lemhadri et al., 2019) . In the context of deep learning, the fundamental challenge is the selection operation (given the target number of selected features) being non-differentiable. This necessitates design of soft approximations for feature selection operator, incorporated into end-to-end task learning.

R1:

To address these fundamental challenges, we propose Sparse Learnable Masks (SLM), a novel approach for scalable feature selection. SLM can be integrated into any deep learning architecture, given the optimization is gradient-descent based for joint training. SLM proposes an effective way of adjusting the learnable masks to select the exact number desired features, addressing the differentiability challenges. In addition, SLM uses a novel mutual information (MI) regularizer, based on a quadratic relaxation of the MI between the labels and the selected features, conditioned on the probability that a feature is selected. SLM comes with scaling benefits, yielding efficient feature selection even when the number of features or samples are very large. We demonstrate state-of-the-art feature selection results with SLM in different scenarios across a wide range of datasets. superior performance due to its input-dependence. However, there is rarely such perfect alignment, and global selection provides robustness benefits when there is distribution shift between train and test sets, in addition to allowing more computational efficiency by globally removing features.



Wrappers recompute the predictive model for each subset of features. As exhaustive search is NP-hard and computationally intractable, efficient search strategies such as forward selection or backward elimination have been developed. For instance, HSIC-Lasso(Yamada et al., 2014)   proposes a feature-wise kernelized Lasso for capturing non-linear dependencies. Wrappers are difficult to integrate with modern deep learning, as the training complexity gets prohibitively large.• Filters select subsets of variables as a pre-processing step, independent of the predictive model.(Gu et al., 2012)  developed the Fisher score, which selects features to maximize (minimize) the distances between data points in different (same) classes in the space spanned by the selected features. Principal feature analysis (PFA)(Lu et al., 2007b)  selects features based on principal component analysis.(Pan et al., 2020)  uses adversarial validation to select the features, based on how much their characteristics differ between training and test splits, as a way to improve robustness. There are also various methods based on MI maximization(Ding & Peng, 2005), selecting features independent of the predictive model (unlike SLM). CMIM(Fleuret, 2004)   maximizes the conditional MI between selected features and the class labels to account for feature inter-dependence. JMIM(Bennasar et al., 2015)  maximizes the joint MI between class labels and the selected features, while addressing overconfidence in features that correlate with alreadyselected features, with greedy search that selects features one at a time.(Zadeh et al., 2017)   formulates feature selection as a diversity maximization problem using a MI-based metric amongst features. The fundamental disadvantage of filter-based methods, of not being optimized with the predictive models, results in them often yielding suboptimal performance. • Embedded methods combine selection into training and are usually specific to given predictive models. Lasso regularization (Tibshirani, 1996) employs feature selection by varying the strength of the L1 regularization. (Feng & Simon, 2017) extends this idea by proposing an input-sparse neural network, where the input weights are penalized using the group Lasso penalty. (Lemhadri et al., 2019) selects only a subset of the features using input-to-output residual connections, allowing features to have non-zero weights only if their skip-layer connections are active. R3: Concrete Autoencoder (Abid et al., 2019) proposes an unsupervised feature selector based on using a concrete selector layer as the encoder and using a standard neural network as the decoder. FsNet (Singh et al., 2020) uses a concrete random variable for discrete feature selection in a selector layer and a supervised deep neural network regularized with the reconstruction loss. STG (Yamada et al., 2020) learns stochastic gates with a probabilistic relaxation of the count of the number of selected features, it selects features and learns task prediction end-to-end. Masking in deep neural networks: Masking the inputs to control information propagation is a commonly-used approach in deep learning. Attention-based architectures, such as the Transformer (Vaswani et al., 2017) and the Perceiver (Jaegle et al., 2021), show strong results across many domains, with learnable key and query representations, whose alignment yield the masks that control the contribution of corresponding value representations. While these effectively reweight the inputs, they typically do not completely mask out (i.e. yielding zero attention weight) the inputs. Towards this end, various works have focused on bringing sparsity into masking, such as based on thresholding (Zhao et al., 2019) or sparse normalization (Correia et al., 2019). TabNet (Arik & Pfister, 2019) directly generates sparse attention masks and applies them sequentially to input data, which can perform sample-dependent feature selection. R1: (Correia et al., 2020) achieves sparsity in latent distributions in neural networks, by using sparsemax and its structured analogs, allowing for efficient latent variable marginalization. (Lei et al., 2016) and (Bastings et al., 2019) learn Bernoulli variables, which are analogous to our feature mask but in a local setting, for extractive rationale prediction in text. (Paranjape et al., 2020) extends these ideas by proposing to control sparsity by optimizing the Kullback-Leibler (KL) divergence between the mask distribution and a prior distribution with controllable sparsity levels. (Guerreiro & Martins, 2021) develops a flexible rationale extraction mechanism using a constrained structured prediction algorithm on factor graphs. All these perform sample-wise, not global, input selection. R3: In this work, our goal is to explore global feature selection. When train and test sets perfectly align in distribution, local feature selection can give

