SCALABLE FEATURE SELECTION VIA SPARSE LEARNABLE MASKS

Abstract

We propose a canonical approach for feature selection, sparse learnable masks (SLM). SLM integrates learnable sparse masks into end-to-end training. For the fundamental non-differentiability challenge of selecting a desired number of features, we propose duo mechanisms for automatic mask scaling to achieve the desired feature sparsity, and gradually tempering this sparsity for effective learning. In addition, SLM employs a novel objective that maximizes the mutual information (MI) between the selected features and the labels, in an efficient and scalable way. Empirically, SLM achieves state-of-the-art results on several benchmark datasets, often by a significant margin, especially on real-world challenging datasets.

1. INTRODUCTION

R1: In many machine learning scenarios, a significant portion of the input features may be irrelevant to the output, especially with modern data management tools allowing easy construction of largescale datasets by combining different data sources. 'Feature selection', filtering the most relevant features for the downstream task, is an everlasting problem, with many methods proposed to date and used (Guyon & Elisseeff, 2003; Li et al., 2017; Dash & Liu, 1997) . Feature selection can bring a multitude of benefits. Smaller number of features can yield superior generalization and hence better test accuracy, by minimizing reliance on spurious patterns that do not hold consistently (Sagawa et al., 2020) , and not wasting model capacity on the irrelevant features. In addition, reducing the number of input features can decrease the computational complexity and cost for deployed models, as the models need to learn the mapping from smaller dimensional input data, and the reduced infrastructure requirement to support only the selected features. Lastly, feature selection helps with interpretability, as the users can focus their efforts to understand the model to a smaller subset of input features. R1: How can we select the target number of features in an optimal way? Feature selection has been studied with numerous approaches, as summarized in §2. For superior task accuracy, the feature selection method should consider the predictive model itself, as the optimal set of features would depend on the how the mapping is done between the inputs and outputs. Such methods have been approached in different ways, such as via sparse regularization and its extensions (Lemhadri et al., 2019) . In the context of deep learning, the fundamental challenge is the selection operation (given the target number of selected features) being non-differentiable. This necessitates design of soft approximations for feature selection operator, incorporated into end-to-end task learning.

R1:

To address these fundamental challenges, we propose Sparse Learnable Masks (SLM), a novel approach for scalable feature selection. SLM can be integrated into any deep learning architecture, given the optimization is gradient-descent based for joint training. SLM proposes an effective way of adjusting the learnable masks to select the exact number desired features, addressing the differentiability challenges. In addition, SLM uses a novel mutual information (MI) regularizer, based on a quadratic relaxation of the MI between the labels and the selected features, conditioned on the probability that a feature is selected. SLM comes with scaling benefits, yielding efficient feature selection even when the number of features or samples are very large. We demonstrate state-of-the-art feature selection results with SLM in different scenarios across a wide range of datasets.

2. RELATED WORK

Feature selection methods: Numerous methods have been studied for feature selection, and broadly fall under three categories (Guyon & Elisseeff, 2003) :

