SCALABLE FEATURE SELECTION VIA SPARSE LEARNABLE MASKS

Abstract

We propose a canonical approach for feature selection, sparse learnable masks (SLM). SLM integrates learnable sparse masks into end-to-end training. For the fundamental non-differentiability challenge of selecting a desired number of features, we propose duo mechanisms for automatic mask scaling to achieve the desired feature sparsity, and gradually tempering this sparsity for effective learning. In addition, SLM employs a novel objective that maximizes the mutual information (MI) between the selected features and the labels, in an efficient and scalable way. Empirically, SLM achieves state-of-the-art results on several benchmark datasets, often by a significant margin, especially on real-world challenging datasets.

1. INTRODUCTION

R1: In many machine learning scenarios, a significant portion of the input features may be irrelevant to the output, especially with modern data management tools allowing easy construction of largescale datasets by combining different data sources. 'Feature selection', filtering the most relevant features for the downstream task, is an everlasting problem, with many methods proposed to date and used (Guyon & Elisseeff, 2003; Li et al., 2017; Dash & Liu, 1997) . Feature selection can bring a multitude of benefits. Smaller number of features can yield superior generalization and hence better test accuracy, by minimizing reliance on spurious patterns that do not hold consistently (Sagawa et al., 2020) , and not wasting model capacity on the irrelevant features. In addition, reducing the number of input features can decrease the computational complexity and cost for deployed models, as the models need to learn the mapping from smaller dimensional input data, and the reduced infrastructure requirement to support only the selected features. Lastly, feature selection helps with interpretability, as the users can focus their efforts to understand the model to a smaller subset of input features. R1: How can we select the target number of features in an optimal way? Feature selection has been studied with numerous approaches, as summarized in §2. For superior task accuracy, the feature selection method should consider the predictive model itself, as the optimal set of features would depend on the how the mapping is done between the inputs and outputs. Such methods have been approached in different ways, such as via sparse regularization and its extensions (Lemhadri et al., 2019) . In the context of deep learning, the fundamental challenge is the selection operation (given the target number of selected features) being non-differentiable. This necessitates design of soft approximations for feature selection operator, incorporated into end-to-end task learning.

R1:

To address these fundamental challenges, we propose Sparse Learnable Masks (SLM), a novel approach for scalable feature selection. SLM can be integrated into any deep learning architecture, given the optimization is gradient-descent based for joint training. SLM proposes an effective way of adjusting the learnable masks to select the exact number desired features, addressing the differentiability challenges. In addition, SLM uses a novel mutual information (MI) regularizer, based on a quadratic relaxation of the MI between the labels and the selected features, conditioned on the probability that a feature is selected. SLM comes with scaling benefits, yielding efficient feature selection even when the number of features or samples are very large. We demonstrate state-of-the-art feature selection results with SLM in different scenarios across a wide range of datasets.

2. RELATED WORK

Feature selection methods: Numerous methods have been studied for feature selection, and broadly fall under three categories (Guyon & Elisseeff, 2003) : • Wrappers recompute the predictive model for each subset of features. As exhaustive search is NP-hard and computationally intractable, efficient search strategies such as forward selection or backward elimination have been developed. For instance, HSIC-Lasso (Yamada et al., 2014) proposes a feature-wise kernelized Lasso for capturing non-linear dependencies. Wrappers are difficult to integrate with modern deep learning, as the training complexity gets prohibitively large. • Filters select subsets of variables as a pre-processing step, independent of the predictive model. (Gu et al., 2012) developed the Fisher score, which selects features to maximize (minimize) the distances between data points in different (same) classes in the space spanned by the selected features. Principal feature analysis (PFA) (Lu et al., 2007b) selects features based on principal component analysis. (Pan et al., 2020) uses adversarial validation to select the features, based on how much their characteristics differ between training and test splits, as a way to improve robustness. There are also various methods based on MI maximization (Ding & Peng, 2005) , selecting features independent of the predictive model (unlike SLM). CMIM (Fleuret, 2004) maximizes the conditional MI between selected features and the class labels to account for feature inter-dependence. JMIM (Bennasar et al., 2015) maximizes the joint MI between class labels and the selected features, while addressing overconfidence in features that correlate with alreadyselected features, with greedy search that selects features one at a time. (Zadeh et al., 2017) formulates feature selection as a diversity maximization problem using a MI-based metric amongst features. The fundamental disadvantage of filter-based methods, of not being optimized with the predictive models, results in them often yielding suboptimal performance. • Embedded methods combine selection into training and are usually specific to given predictive models. Lasso regularization (Tibshirani, 1996) employs feature selection by varying the strength of the L1 regularization. (Feng & Simon, 2017) extends this idea by proposing an input-sparse neural network, where the input weights are penalized using the group Lasso penalty. (Lemhadri et al., 2019) selects only a subset of the features using input-to-output residual connections, allowing features to have non-zero weights only if their skip-layer connections are active. R3: Concrete Autoencoder (Abid et al., 2019) proposes an unsupervised feature selector based on using a concrete selector layer as the encoder and using a standard neural network as the decoder. FsNet (Singh et al., 2020) uses a concrete random variable for discrete feature selection in a selector layer and a supervised deep neural network regularized with the reconstruction loss. STG (Yamada et al., 2020) learns stochastic gates with a probabilistic relaxation of the count of the number of selected features, it selects features and learns task prediction end-to-end. Masking in deep neural networks: Masking the inputs to control information propagation is a commonly-used approach in deep learning. Attention-based architectures, such as the Transformer (Vaswani et al., 2017) and the Perceiver (Jaegle et al., 2021) , show strong results across many domains, with learnable key and query representations, whose alignment yield the masks that control the contribution of corresponding value representations. While these effectively reweight the inputs, they typically do not completely mask out (i.e. yielding zero attention weight) the inputs. Towards this end, various works have focused on bringing sparsity into masking, such as based on thresholding (Zhao et al., 2019) or sparse normalization (Correia et al., 2019) . TabNet (Arik & Pfister, 2019) directly generates sparse attention masks and applies them sequentially to input data, which can perform sample-dependent feature selection. R1: (Correia et al., 2020) achieves sparsity in latent distributions in neural networks, by using sparsemax and its structured analogs, allowing for efficient latent variable marginalization. (Lei et al., 2016) and (Bastings et al., 2019) learn Bernoulli variables, which are analogous to our feature mask but in a local setting, for extractive rationale prediction in text. (Paranjape et al., 2020) extends these ideas by proposing to control sparsity by optimizing the Kullback-Leibler (KL) divergence between the mask distribution and a prior distribution with controllable sparsity levels. (Guerreiro & Martins, 2021) develops a flexible rationale extraction mechanism using a constrained structured prediction algorithm on factor graphs. All these perform sample-wise, not global, input selection. R3: In this work, our goal is to explore global feature selection. When train and test sets perfectly align in distribution, local feature selection can give superior performance due to its input-dependence. However, there is rarely such perfect alignment, and global selection provides robustness benefits when there is distribution shift between train and test sets, in addition to allowing more computational efficiency by globally removing features.

3. METHODS

Algorithm 1 describes SLM's end-to-end feature selection and task learning. The predictor f θ can be any gradient-descent based model, such as an MLP, with a R3: task-specific loss function l such as cross entropy for classification or MAE for regression. The following sections present SLM's key components in detail. Notation. R1: R2: R3: Throughout this work, we let X ∈ R n×d denote the input data, X sp ∈ R n×d the selected features, and m sp ∈ R d the learned feature selection mask. We use to denote element-wise multiplication between each input sample and m sp : X sp = X m sp . We let F t denote the number of selected features at step t, and N the total training steps. Furthermore, I(X, Y ) denotes the mutual information between X and Y , and I q (X, Y ) its quadratic relaxation. ALGORITHM 1 Training for SLM-based feature selection. Input: Input data X with target labels Y Input: Total training steps N Initialize: Learnable mask argument m ← all ones vector for t = 1 to N do Obtain the number of selected features F t using Eq 4 for step t. Generate sparse mask m sp = sparsemax(m). Select and weight input features: X sp = X m sp . Non-selected features are zeroed out. Input the selected features into the predictor f θ (X sp ) for the downstream task. Compute training task loss l(X sp , Y) and MI loss E(X sp , Y) in Eq 9. Update the parameters θ and m, using task loss l and MI loss E.

3.1. MASK SPARSITY

R1: R3: SLM selects features by learning a mask m sp ∈ R d , and zeroing out the features in the input X ∈ R n×d whose corresponding mask entries are zero. We use sparsemax normalization (Martins & Astudillo, 2016) to achieve sparsity in m. Sparsemax achieves sparsity in its output by returning the Euclidean projection of the input vector v ∈ R d onto the probability simplex ∆ d-1 := {f ∈ R d ≥0 | k f k = 1}: sparsemax(v) := argmin p∈∆ d-1 p -v 2 . ( ) We apply sparsemax to the mask argument m ∈ R d to obtain sparse feature mask: m sp := sparsemax(m) ∈ R d ≥0 . (2) For the commonly-used softmax normalization is employed with thresholding, the probability simplex projection in sparsemax(v) scales the top values in v so they are more equidistributed over [0, 1] . This equidistribution leads to greater feature weight separation, encouraging the model to discriminate amongst the features.

3.2. MASK SCALING TO YIELD DESIRED NUMBER OF SELECTED FEATURES

Following its formulation, sparsemax does not yield a predetermined number of non-zero elements, as the sparsity depends on the location on the probability simplex ∆ d-1 that v projects onto. For a non-uniform vector v ∈ R d , we can adjust its projection onto ∆ d-1 by multiplying v by a positive scalar. In particular, a sufficiently large scalar increases the sparsity, while a sufficiently small scalar decreases the sparsity. To illustrate this, we give a simple example in Fig 1 . Example 3.1 (Adjusting sparsemax(v) sparsity by scaling). The probability simplex ∆ 1 in R 2 is the line connecting (0, 1) and (1, 0), with these two points as the simplex boundary. Let v = (x, y) be a point in R 2 , and (z, w) its projection onto ∆ 1 . We show that by varying multiplier m, sparsemax(mv) would have a varying degree of sparsity. The projection (z, w) = sparsemax((x, y)) is the unique point that satisfies (z, w) = argmin (z,w) ( y -w 2 + x -z 2 ), (z, w) elementwise positive, and z + w = 1. As we scale (x, y) with m, sparsemax(m(x, y)) = argmin (z,w) ( my -w 2 + mxz 2 ). This projection distance expands to d(z, w) := my -w 2 + mx -z 2 = m 2 y 2 -2myw + w 2 + m 2 x 2 -2mxz + z 2 Hence, d(0, 1) -d(0.5, 0.5) = mx -my + 0.5, which means that for any (x, y) and m with y > x, sparsemax(m(x, y)) is closer to (0, 1) ∈ ∆ 1 whenever m > 1/(2(y -x)), and closer to (0.5, 0.5) otherwise. Since projection is linear, this means varying the multiplier m varies the sparsity of sparsemax((x, y)). This example conveys the intuition that larger multipliers lead to sparser outputs. More generally, one can show: Lemma 3.2. Given a non-uniform vector v ∈ R d , to obtain F nonzero elements in sparsemax(v), v should be multiplied with the scalar m =      F +1 i=1 v (i) -(F + 1) • v (F +1) -1 if |sparsemax(v) > 0| > F F i=1 v (i) -F • v F -1 if |sparsemax(v) > 0| < F, (3) where v (1) ≥ v (2) . . . ≥ v (d) denote sorted elements of v in descending order. The proof can be found in §A.3. Lemma 3.2 allows us to scale the mask to achieve the desired number of non-zero features. R1: Note that since sparsemax has a particular Fenchel-Young loss (Blondel et al., 2020) , scaling its argument by m is equivalent to scaling the regularizer by 1/m in the Fenchel-Young formulation (Blondel et al., 2020; Peters et al., 2019) .

3.3. TEMPERING FEATURE SPARSITY

Starting training on only a randomly selected subset of features likely leads to suboptimal learning in the initial steps, and if feature selection converges before the predictor converges, the predictor would be trained with suboptimal features. To alleviate these and improve training stability, we propose gradually decreasing the number of features selected until reaching the target F N : F t = F 0 -t/N tmp (F 0 -F N ) if t < N tmp F N if t ≥ N tmp , where F t is the number of selected features at step t, N tmp is the tempering threshold. In our experiments, we simply set N tmp = N/2 as it was observed to be a reasonable value across a wide range of datasets. To further stabilize training, instead of continuously decreasing the number of features, we decrease the number of features at five evenly spaced steps. This tempering allows the model to learn from more than the final target number of features during training -an advantage not shared by baseline methods. Furthermore, learning from all features initially likely provides a more robust initialization compared to starting learning with the target number of features, as the randomness in the initial selection is seldom optimal.

3.4. MUTUAL INFORMATION MAXIMIZATION

As an inductive bias to the model that accounts for sample labels during feature selection, we propose to maximize the mutual information (MI) between the distribution of the selected features and the distribution of the labels. Specifically, we condition the MI on the probability that a feature is selected, as given by the mask m. This stands in contrast to prior MI-based feature selection works such as (Fleuret, 2004; Bennasar et al., 2015) , which yield binary decisions on whether to select a feature. Let X denote the random variable representing the features, and Y the random variable representing the labels, with value spaces X ∈ X and Y ∈ Y. Methods based on maximizing either the conditional or the joint MI between selected features and labels require the computation of an exponential number of probabilities, the optimization of which is intractable (Fleuret, 2004) . R2: Therefore, we propose a quadratic relaxation of MI, which is end-to-end differentiable. When we model X and Y as random variables, their MI I(X, Y ) can be defined and reformulated as: I(X, Y ) := x∈X y∈Y P X,Y (x, y) log P x,y (x, y) P X (x)P Y (y) = x∈X y∈Y P X,Y (x, y) log P X,Y (x, y) P X (x) - y∈Y P Y (y) log P Y (y), where the second step derives from marginalizing over X . Since the second term above does not depend on features X, it can be ignored during optimization. Quadratic relaxation. We propose a quadratic relaxation I q (X, Y ) of Eq 5 to simplify I(X, Y ) and its optimization, while retaining much of its properties: I q (X, Y ) := x∈X y∈Y P X,Y (x, y) 2 /P X (x) - y∈Y P Y (y) 2 . (6) Here, terms of the form p log q are relaxed to pq. Note that both p log q and pq are convex with respect to p and q, and hence have the same correlation behavior with respect to p and q. From an optimization perspective, I q (X, Y ) is a good approximation of I(X, Y ) where P X,Y (X, Y )/P X (x) and P Y (y) in Eq 6 lie in the neighborhood (1-δ, 1+δ). In this neighborhood, using Taylor expansion: log(q)= log(q 0 ) + (q-q 0 )/q 0 -(q-q 0 ) 2 /2q 2 0 + • • • When q 0 =1, this becomes log(q)≈(q-1)-(q-1) 2 /2= -3/2+2q-q 2 /2, hence, p log(q) has the second order approximation -3p/2+2pq (or -3p/2+2p 2 when p=q). Applying this to Eq 5, p is P X,Y (x, y) in the first term and P Y (y) in the second. Since both P X,Y (x, y) and P Y (y) are probabilities, and hence must sum to 1 across the label space for any given sample, the linear term -3p/2 does not affect gradient descent optimization. Normalization is a hard constraint enforced during training that supersedes this linear term in the objective. Therefore, during optimization, P X,Y (x, y) log(P X,Y (x, y)/P X (x)) and P X,Y (x, y) 2 /P X (x), and thus I q (X, Y ) and I(X, Y ), agree on their second order approximation. Note that the proposed relaxation is a variant of the commonly-used quadratic approximation based on Taylor's theorem (Shafer, 1974; Hsieh et al., 2011) . Relating MI I q (X, Y ) to model error E(X, Y ). Next we connect I q (X, Y ) with the model's predictions using Lagrange multipliers. Let R(x, y) : X × Y → [0, 1] denote the model's probability output for sample x and outcome y. Below we model the discrete label case, e.g. for classification; the case where labels are continuous can be reduced to the discrete case by quantization (Fleuret, 2004) . First, we define the quadratic error term E(X, Y ) in terms of R(x, y), and expand: E(X, Y ) := x∈X ,y∈Y P X,Y (x, y) (1 -R(x, y)) 2 + y ∈Y\y R(x, y ) 2 = x∈X ,y∈Y P X,Y (x, y) 1 -2R(x, y) + R(x, y) 2 + y ∈Y\y R(x, y ) 2 = x∈X ,y∈Y P X,Y (x, y) -2 x∈X ,y∈Y P X,Y (x, y)R(x, y) + x∈X ,y∈Y,y ∈Y P X,Y (x, y)R(x, y ) 2 Combine last two terms and expand. = 1 -2 x∈X ,y∈Y P X,Y (x, y)R(x, y) + x∈X ,y ∈Y P X (x)R(x, y ) 2 Marginalize. (7) Theorem 3.3. Let X and Y denote the random variables representing the features and labels, respectively, and Y the value space for Y , then maximizing the quadratic relaxation of mutual information I q (X, Y ) is equivalent to minimizing the error E(X, Y ). More specifically, E(X, Y ) = 1 - y∈Y P Y (y) 2 -I q (X, Y ). The proof uses Lagrange multipliers to solve for the optimal model predictions in terms of P X,Y (x, y) and P X (x), this can then be used to express the objective E(X, Y ) as a function of I q (X, Y ). The full proof can be found in §A.4. Application to feature selection. Now, we apply this finding concretely to feature selection, by selecting a given number of features that minimize E(X, Y ). Given a dataset, let I denote the index set of the dataset samples, J the index set of the features, and L the set of possible labels. Let S ⊂ J denote the index set of features selected, X S i the random variable representing a selected subset of features for the i th sample, and Y i the random variable representing the label for the i th sample. Then, the joint probability can be written as P X,Y (x, y) = |{i ∈ I|X S i = x, Y i = y}|/|I|. Plugging this into the definition of E(X, Y ) we obtain: E(X, Y ) := x∈X ,y∈Y P X,Y (x, y) (1 -R(x, y)) 2 + y =Yi R(x, y ) 2 = x∈X ,y∈Y |{i ∈ I | X S i = x, Y i = y}| |I| (1 -R(X S i , Y i )) 2 + y =Yi R(X S i , y) 2 = i∈I (1 -R(X S i , Y i )) 2 + y =Yi R(X S i , y) 2 /|I| (8) During training, Eq. 8 is minimized under the following consistency constraint: for two samples i 1 and i 2 that have the same values in the selected features, i.e. X S i1 = X S i2 , their model predictions must be the same, i.e. R( X S i1 , Y i1 ) = R(X S i2 , Y i2 ). To encourage the model to satisfy this constraint, we turn it into a soft consistency regularization term r cs , converting constrained optimization to unconstrained optimization with regularization: r cs := {i1,i2}∈I 2 ,i1<i2 P (X S i1 = X S i2 ) R(X S i1 , Y i1 ) -R(X S i2 , Y i2 ) 2 , where P (X S i1 = X S i2 ) is the probability that the samples X i1 and X i2 take the same values in the selected feature set S. Let the learned mask consists of probabilities m = {p j } j∈J , i.e. p j is the probability that feature j is selected, then P (X S i1 = X S i2 ) = X (j) i 1 =X (j) i 2 (1 -p j ), i.e. P (X S i1 = X S i2 ) is the product over probabilities that feature j is not selected, if X i1 and X i2 differ at feature j. (The difference in a feature that is not selected does not contribute to P (X S i1 = X S i2 )). In this probabilistic form, the consistency regularizer also encourages the selection of features with diverse ranges, since it encourages high p j for the features with many X (j) i1 = X (j) i2 pairs. Therefore, the regularized objective to maximize the MI I(X, Y ) between the selected features and the labels becomes: E(X, Y ) = i∈I (1 -R(X S i , Y i )) 2 + y =Yi R(X S i , y) 2 /|I| + r cs , where r cs = {i1,i2}∈I 2 ,i1<i2 X (j) i 1 =X (j) i 2 (1 -p j ) R(X S i1 , Y i1 ) -R(X S i2 , Y i2 ) 2 . ( ) R1: In practice, r cs can be enforced batch-wise, and can be efficiently vectorized for the parallel computation of all X (j) i1 = X (j) i2 pairs per batch using tensor operations. Note that since R(X S i , Y i ) are just model predictions, and p j are learned feature mask probabilities, each component in E(X, Y ) is easily accessible. When the labels are in the continuous space, the minimization objective with the consistency regularizer is derived the exact the same way to yield: E(X, Y ) = i∈I Y i -R(X S i ) 2 /|I| + r cs . Our analysis is done with random variables X and Y to apply tools from probability theory, the data samples X and labels Y can be thought of as samples drawn from the distributions to which X and Y belong, where in the limit with infinitely many samples X and Y perfectly reflect these distributions. 3.5 COMPUTATIONAL COMPLEXITY R1: As above, let h be the hidden dimension, n denote the number of samples, b the batch size, and N the total number of train steps; let F 0 be the total number of features, and F N the target number of features. We first discuss the complexity of individual components. The sparsemax operation is dominated by sorting, and hence has complexity O(F 0 log F 0 ) per sample, with an overall complexity of O(nF 0 log F 0 ). The consistency regularizer r cs in the MI-maximizing objective E(X, Y ) has complexity O(nbF N ), as the calculation X (j) i 1 =X (j) i 2 (1 -p j ) R(X S i1 , Y i1 ) -R(X S i2 , Y i2 ) 2 in Eq 10 occurs over the selected feature index set j ∈ S, and is done between each sample and others in its batch. The non-regularizer component in E(X, Y ) has complexity nc, where c is the constant for the number of discrete or binned labels. Assuming an MLP classifier with h hidden units, which has complexity O(nh 2 ), the overall algorithm has complexity O(nF 0 log F 0 + nbF N + nc + nh 2 ), making SLM amenable to scaling to a large number of features. R3: In addition, SLM amortizes the cost of feature selection across batches throughout training, making it more scalable with respect to the number of samples. This is in contrast to PFA (Lu et al., 2007a) or many other MI-based methods such as CMIM (Fleuret, 2004) or JMIM (Bennasar et al., 2015) , which place the memory and compute burden of selection for the entire dataset in the same step.

4.1. DATASETS AND SETTINGS

We present the efficacy of SLM in feature selection on wide range of datasets from numerous domains. For all experiments, we ensure fair comparison by employing similar hyperparameter search space and budget -to search for hyperparameters such as batch size and learning rate for each baseline method and dataset, we conduct an extensive random search within the search grid, by randomly generating a value within a conceivable range. We run a total of 300 trials for each method-dataset combination to ensure sufficient coverage, and tune all hyperparameters based on the validation accuracy. R1: R2: R3: Additional experiments on selected feature interpretability, compute timings and synthetic data experiments to demonstrate SLM's scalability, as well as comparison with further end-to-end baselines, can be found in §A.6, §A.7, §A.8, §A.9, respectively. We benchmark on a variety of real-world datasets across many domains, including computer vision, biological data, financial data, etc. Concretely, we benchmark on Mice, MNIST, Fashion-MNIST, Isolet, Coil-20, Activity, Ames Housing, and IEEE-CIS Fraud datasets. R3: We use a 70-10-20 train/validation/test split; and when available, we use the exact same train/validation/test samples as (Lemhadri et al., 2019) for fair comparison. We give further detailed descriptions in §A.1. Cross entropy is used as the task loss function for classification tasks, and MAE the task loss for regression. We benchmark SLM against a variety of competitive methods. The mutual information (MI) based feature selection baseline uses entropy estimation from k-nearest neighbors distances as described in (Kraskov et al., 2004; Ross, 2014) to estimate MI. Tree-based methods yield Gini importance scores, which can be used for feature selection. For this we benchmark two commonly used methods: random forest (RF) (Breiman, 2001) , an ensemble of independent trees, and XGBoost (Chen & Guestrin, 2016), a scalable end-to-end tree boosting system. We furthermore benchmark against methods as discussed in §2: LassoNet (Lemhadri et al., 2019) , which uses residual connections to allow the network to learn whether to use any given feature in a particular layer; feature importance ranking based on the Fischer score (Gu et al., 2012) ; principal feature analysis (PFA) (Lu et al., 2007a) , a PCA-based method; and HSIC-Lasso (Yamada et al., 2014) , which uses kernel learning to find non-linear feature interactions. Lastly, we benchmark against linear regression, where feature importance is determined by the learned feature coefficients. When available, we use results from (Lemhadri et al., 2019) . R1: For consistency and fairness, each baseline method uses the same input as SLM to select features, which are then passed to an MLP to compute the task metric.

4.2. TASK PERFORMANCE WITH FEATURE SELECTION R1:

In this work, we consider feature importance to be measured by contribution towards the task metric, as accurate predictor performance is typically the end goal, and the importance of each individual feature is not well-defined due to feature interactions. Therefore, we focus on benchmarking task predictive accuracy given the selected features as the metric. First, we study selecting a fixed number of features across a wide range of high dimensional datasets (most with >400 features) and feature selection methods. R3: We consistently choose 50 selected features, as this represents a small fraction of the total features for most datasets, as often done in practice. This number is kept consistent without tuning for any given method, to avoid favoring any given one. Table 1 shows that the SLM consistently yields competitive performance, outperforming all methods in all cases except on Mice and Ames, for which the performance is saturated due to small numbers of original features, making feature selection less relevant. Most feature selection methods are not consistent in their performance. On the other hand, SLM's strong performance is consistent. Interestingly, we observe that there are cases where SLM even outperforms the baseline of using all features, which can likely be attributed to superior generalization when the limited model capacity is focused on the most salient features. Next, we focus on the Fraud dataset, a large-scale dataset for the complex task from many heterogeneous features. It is highly non i.i.d. (Grover et al., 2022) given that high capacity models can be prone to overfitting and poor generalization. Table 2 shows that SLM outperforms other methods consistently for different number of selected features, and its performance degradation with much less number of features is smaller. Indeed, the AUC with 20 features out of 432, is >10% better than using all features, indicating improved generalization. (3) shows the task accuracy as a function of the number of features selected on the activity dataset. The dark line shows the average of ten random hyperparameter trials, shown with light hue, demonstrating that task performance can be near-optimal even with a small subset of features.

Method

We study the utility of SLM components in this section, in particular the effects of the MI regularizer and tempering the number of features, which gradually decreases the number of selected features from the full feature set to the target number. The effects are measured by randomly selecting ten hyperparameter settings and a seed, and recording the average performance R3: with or without either MI regularizer or tempering (without tempering refers to keeping the number of selected features constant throughout training.). Fig 2 shows that both MI regularization and tempering positively affect task performance. This is consistent with the theory developed in §3: the MI regularizer encourages maximal mutual information sharing between the labels and the selected features; and tempering allows the model to initialize learning based on all features, rather than a randomly selected subset.

5. DISCUSSION

Feature importance intepretability. SLM learns a sparse mask M that contains the feature selection coefficients. We show that this approach yields superior results with end-to-end learning by allowing a smooth transition between selecting and un-selecting features. In addition, SLM can also be used for interpretation of global feature importance during inference, yielding the importance ranking of selected features, similar to other commonly-used methods like SHAP (Lundberg & Lee, 2017) . This can be highly desired in high-stakes applications such as healthcare or finance, where an importance score can be more useful than simply whether a feature is selected or not. Feature interdependence during selection. Compared to prior MI-based feature selectors (Ding & Peng, 2005; Fleuret, 2004; Bennasar et al., 2015) , SLM accounts for feature inter-dependence by learning inter-dependent probabilities {p j } j for the selected feature, where {p j } j jointly maximize the MI between features and labels. Furthermore, SLM learns feature selection and the task objective in an end-to-end way, which alleviates the selection of repetitive features that may individually be predictive, as gradient descent favors increasing the probability for a non-redundant and loss decreasing but less predictive feature over an individually predictive but redundant feature. Improved model generalization via feature selection. Feature selection can help improve generalization beyond the training set, especially for high capacity models like deep neural networks, which can easily overfit patterns from spurious features that do not hold across training and test data splits (Arjovsky et al., 2019) . For instance, Table 1 shows that on some datasets, especially with SLM, prediction on a subset of features can outperform that on all features. Furthermore, Fig 2 shows that task performance can reach near-optimum with even a small subset of all features. Therefore, feature selection is a potential alternative for alleviating compute cost during training and inference, without sacrificing on accuracy. Relation to other MI estimations in deep learning. R2: MI-based objectives have been used in other deep learning methods, such as InfoNCE (Oord et al., 2018) , InfoGAN (Chen et al., 2016) , and Deep Graph Infomax (Velickovic et al., 2019) . To estimate MI, these typically train classifiers on samples drawn from the joint distribution and the product of the marginals, whose exact distributions can be intractable. In contrast, for feature selection, while the exact distributions of the features and the labels are known, the computation of their mutual information and its maximization is computationally intractable. To address this, SLM proposes a quadratic relaxation of MI optimization, applied to feature selection by converting MI maximization to minimizing a loss function. SLM does not need to sample from the joint or marginal distributions, a potentially computationally intensive process. Furthermore, prior works (Chen et al., 2016; Velickovic et al., 2019) often require a contrastive term in estimation of MI with negative sampling, a process that is not needed in SLM. Future work. SLM can be integrated into unsupervised or semi-supervised learning, with modified objectives. In addition, our results indicate more significant outperformance for datasets with non i.i.d. characteristics as feature selection can effectively reduce the feature dimensionality and reduce the risk of overfitting to the spurious correlations of irrelevant features. Lastly, feature selection for data with structure (e.g. temporal or graph) is an interesting extension, which might be based on modifying SLM to apply masking to entire time-series or graph data.

6. CONCLUSION

We introduce SLM, a sparse learnable mask based feature selection framework that maximizes the MI between features and labels, while optimizing the training objective end-to-end. Learning the feature masks allows a smooth, probabilistic selection of features as well as insights on feature importance. SLM demonstrates competitive performance against SOTA baselines, and opens door to future applications in domains such as graph or time series representation learning. REPRODUCIBILITY STATEMENT §3 gives detailed description of the methodology used. All results derive from repeated cross validation. In particular, the main results reported in 1 derive from extensive grid search during 300 trials for each method-dataset combination, where the test result based on the best validation performance is reported. Code will be open-sourced after the review process.

A APPENDIX

A.1 DATASET DETAILS This section provides additional details on the experimental data. We first consider the real-world benchmark datasets in (Lemhadri et al., 2019) . Mice consists of protein expression levels measured in the cortex of normal and trisomic mice who had been exposed to different experimental conditions. Each feature is the expression level of one protein. MNIST and Fashion-MNIST consist of 28-by-28 grayscale images of hand-written digits and clothing items, respectively. The images are converted to tabular data by treating each pixel as a separate feature. consists of preprocessed speech data of people speaking the names of the letters in the English alphabet with each feature being one of the preprocessed quantities, including spectral coefficients and sonorant features. Coil-20 consists of centered gray-scale images of 20 objects taken at certain pose intervals, hence the features are image pixels. Activity consists of sensor data collected from a smartphone mounted on subjects while they performed several activities such as walking or standing. For these datasets, we use the exact same data splits and preprocessing approaches with (Lemhadri et al., 2019) for fair comparison, as well as the same model hyperparameter search space. 1 In addition, we consider the Ames housing dataset (Cock, 2011) , with the goal of predicting residential housing prices based on each home's features; as well as the IEEE-CIS Fraud Detection dataset (Kaggle, 2022), with the goal of identifying fraudulent transactions from numerous transaction and identity dependent features. 

A.2 EXPERIMENTAL DETAILS

As described, we use hyperparameter tuning based on the validation accuracy for all cases. We use the Adam optimizer for training, with exponential decay. For benchmarks from (Lemhadri et al., 2019) , for a fair comparison, our hyperparameter search space is same as the original paper. For Fraud, which is larger and more complex, we extend the search space as in Table 4 . For baselines such as LassoNet, we tune additional method-specific hyperparameters. For instance, for LassoNet, in addition to the hyperparameters, we also tune the 2 penalization on the skip connection, the hierarchy parameter, and the dropout rate. For XGBoost, we also tune the number of estimators and the maximum tree depth.

Hyperparameter

A.3 PROOF OF LEMMA 3.2 Lemma 3.2. Given a nonuniform vector v ∈ R K , to obtain F nonzero elements in sparsemax(v), v should be multiplied with the scalar m =      F +1 i=1 v (i) -(F + 1) * v (F +1) -1 if |sparsemax(v) > 0| > F F i=1 v (i) -F * v F -1 if |sparsemax(v) > 0| < F, where v (1) ≥ v (2) . . . ≥ v (K) denote sorted elements of v in descending order. Proof. We first show the case when |sparsemax(v) > 0| > F , i.e. the sparsity needs to be increased (the case where sparsity needs to be decreased works analogously). By (Martins & Astudillo, 2016) , the projection of v onto ∆ K-1 in Eq 1 takes the form sparsemax(v) = [v -τ (v)] + , where [x] + = max{0, x}, and τ takes the form τ = ( i≤k(v) v (i) )-1 k(v) with k(v) defined as the index k(v) := max k ∈ {1, . . . , K} | 1 + kv (k) > i≤k v (i) . Hence, increasing the sparsity such that sparsemax outputs only F nonzero elements, i.e. decreasing the index k(v) to F , requires finding the smallest m such that 1 + (F + 1)mv (F +1) > i≤(F +1) mv (i) does not hold, i.e. F + 1 must be the first k to fail the condition 1 + kv (k) > i≤k v (i) . Rewriting this condition in terms of F we obtain: 1 + (F + 1)mv (F +1) > i≤(F +1) mv (i) implies 1 > m i≤(F +1) v (i) -(F + 1)v (F +1) The smallest m such that condition Eq. 13 does not hold is m = F +1 i=1 v (i) -(F + 1) * v (F +1) -1 , which given Eq 12 implies mv has F nonzero elements. Analogously, to derive the multiplier for v to decrease sparsemax(v) sparsity, we need to increase the index k(v) to F . This requires finding the largest m such that 1 + F (mv F ) > i≤F mv (i) holds, which implies: m = F i=1 v (i) -F * v (F ) -1 . A.4 PROOF OF THEOREM 3.3 Theorem 3.3. Let X and Y denote the random variables representing the features and labels, respectively, and Y the value space for Y , then maximizing the quadratic relaxation of mutual information I q (X, Y ) is equivalent to minimizing the error E(X, Y ). More specifically, E(X, Y ) = 1 - y∈Y P Y (y) 2 -I q (X, Y ) Proof. During training, the model seeks to produce the optimal predictions R(x, y) that minimize E(X, Y ), while satisfying the constraint y∈Y R(x, y) = 1. Hence we can apply Lagrange multipliers to solve for the optimal R(x, y). Taking the derivatives of E(X, Y ) and the constraint g(X, Y ) = x∈X ,y∈Y R(x, y) -|X | with respect to R(x, y): E (X, Y ) = x∈X ,y∈Y -2P X,Y (x, y) + 2P X (x)R(x, y) (14) g (X, Y ) = x∈X ,y∈Y By Lagrange multiplier theory, for an optimum set of model predictions R * (X, Y ), there exists some λ such that E (X, Y ) = λg (X, Y ). Marginalizing E (X, Y ) over Y yields: E (X, Y ) = x∈X -2P X (x) + 2P X (x) = 0 Since y∈Y R(x, y) = 1 Therefore, by Eq 14, R(x, y) = P X,Y (x, y)/P X (x). Plugging this into Eq 7, we obtain an expression relating the mutual information I q (X, Y ) and the error E(X, Y ): E(X, Y ) = 1 -2 x∈X ,y∈Y P X,Y (x, y)R(x, y) + x∈X ,y∈Y P X (x)R(x, y) 2 = 1 -2 x∈X ,y∈Y P X,Y (x, y) P X,Y (x, y) P X (x) + x∈X ,y∈Y P X,Y (x, y) 2 P X (x) = 1 - x∈X ,y∈Y P X,Y (x, y) 2 P X (x) = 1 - y∈Y P Y (y) 2 -I q (X, Y ) By Eq 6 Since P Y (y) is fixed for a given dataset, maximizing I q (X, Y ) is equivalent to minimizing E(X, Y ).

A.5 SPARSEMAX VS SOFTMAX WITH THRESHOLDING

Besides using sparsemax, an alternative method for learning the sparse mask M sp is to apply softmax normalization, followed by a top-k operation, and an additional normalization to render it a probability mask. This method is not only unwieldy with additional steps, but also from an optimization point of view, the top-k operation can only pass gradients through the top k values of softmax(v), whereas sparsemax can pass gradients through all of sparsemax(v). Furthermore, because the softmax-top-k normalization normalizes with respect to the absolute value of v, whereas sparsemax normalizes with respect to its relative values (by subtracting a v-dependent threshold), sparsemax(v) is more equi-distributed over the interval [0, 1] than softmax-top-k normalization (i.e. sparsemax(v) has lower entropy than softmax-top-k normalization), making it more discriminatory for feature selection. A.6 R1: R3: FEATURE INTERPRETABILITY RESULTS While SLM optimizes feature selection for the task metric, the fact that the selected features are global readily opens the door for feature importance interpretability applications, as the chosen features can give insights about the task. To this end, we focus on the Ames housing dataset (Cock, 2011) , as its features are easily understandable. As mentioned in §A.1, the features in the Ames dataset consist of characteristics of houses, and the prediction target is the house price. We use the model parameters found in the best validation trial reported in Table 1 , and select the top ten out of the 81 features. To obtain importance scores of the selected features, we study the selection probabilities learned in the feature mask. Using this, the ten highest-probability features in terms of determining a house's prices are, with learned feature probabilities: 'OverallQual' (0.211), 'FullBath' (0.182), 'GarageCars' (0.124), 'BsmtFullBath' (0.0795), 'MSSubClass' (0.0758), 'GarageFinish' (0.0739), 'HalfBath' (0.0718), 'PoolArea' (0.0562), 'Fireplaces' (0.0473), 'HouseStyle' (0.0403). Some aspects of this selection conform to common sense: the overall quality of the property, the number of bathrooms, and the size of the garage or pool are good predictors of housing value. Other aspects are more surprising, for instance the feature 'BedroomAbvGr' -the number of bedrooms above ground -is not selected, even though one would expect the number of bedrooms to be an important selling factor. However, on further thought, as the number of bedrooms is positively correlated with the number of bathrooms (Eggers & Moumen, 2013) , SLM is avoiding feature redundancy by only selecting one of the correlated features. The same reasoning applies for the features 'OverallQual', the overall quality, which is selected, and 'OverallCond', the overall condition, which is not selected. A.7 R1: R3: COMPUTATIONAL COMPLEXITY EXPERIMENTS As stated in § 3.5, let F 0 be the total number of features, and n the number of samples, SLM has O(nF 0 log F 0 ) dependence on F 0 . To test that this low complexity in theory translates to actual fast feature selection in practice, we present the wall clock timing of SLM. We compare specifically against LassoNet, a strong baseline that also selects features end-to-end. Table 5 shows the timing results for one epoch on the mice dataset, demonstrating that SLM's low complexity in theory also translates to fast execution in practice. respectively. Train-validation-test are split with 0.7-0.1-0.2 ratio, similar to all other experiments and hyperparameter tuning is done with the search space presented in 6. We compare SLM with other feature selection methods, when they are used to select the 300 features. A.9 R3: FURTHER COMPARISON WITH END-TO-END BASELINES One of SLM's strengths is end-to-end feature selection along with task learning, which allows the model to incorporate inductive biases from the task directly into feature selection. Therefore, we specifically focus on comparing SLM with additional end-to-end feature selection methods, beyond the results in Table 1 . As discussed in §2, Concrete Autoencoder (Abid et al., 2019) proposes an unsupervised feature selector based on using a concrete selector layer as the encoder and using a deep neural network as the decoder. FsNet (Singh et al., 2020) uses a concrete random variable for discrete feature selection in a selector layer and a supervised deep neural network regularized with the reconstruction loss, with a focus on biological data, which are often high-dimensional with limited sample size. STG (Yamada et al., 2020) develops a fully embedded supervised method that learns stochastic gates with a probabilistic relaxation of the count of the number of selected features. While all these works selects features and learns task prediction end-to-end, given that SLM is a supervised model, with a general focus beyond the high-dimensionality and low-sample-size setting, STG (Yamada et al., 2020) is the strongest, most related baseline to compare SLM with. Table 9 shows the comparison between SLM and STG on the Isolet and Activity datasets with 50 selected features. There are certain similarities between how SLM and STG control which feature to select: SLM learns a sparse probability mask m for the features, whereas STG learns learn the parameters of the approximate Bernoulli distributions via gradient descent for each feature. While STG learns the parameters for each Bernoulli variable independently, one advantage SLM has is accounting for interdependence amongst selected features, through both the fact that the probabilities in m are interdependent, and through the MI regularizer (further details discussed in §5). 9 : Test accuracy (%) comparison between SLM with a closely related, end-to-end feature selection baseline STG, which controls feature selection via learned stochastic gates, on the Isolet and Activity datasets with 50 features selected. The two methods are compared under the exact same conditions to the largest extent possible: using the same hidden dimension, number of epochs, batch size, learning rate, etc., all randomly generated from within a feasible range. The non-shared hyperparameters are also generated from random within a feasible range. The results are averaged over ten different runs. SLM is able to account for interdependence amongst selected features, through the learned mask m and the MI regularizer.



We use a single layer multi-layer perceptron (MLP) as the predictor, where the number of units is chosen from [M/3, 2M/3, M, 4M/3].



Figure 1: Scaling v from the black to the red point moves its projection (green dotted line) onto ∆ 1 closer to the simplex boundary, increasing sparsemax(v) sparsity.

Figure 2: (1) and (2) show ablation studies on the effect of MI regularization and tempering the number of features. R3: Both ablation studies have the same number (50) of selected features on all datasets.(3) shows the task accuracy as a function of the number of features selected on the activity dataset. The dark line shows the average of ten random hyperparameter trials, shown with light hue, demonstrating that task performance can be near-optimal even with a small subset of features. We study the utility of SLM components in this section, in particular the effects of the MI regularizer and tempering the number of features, which gradually decreases the number of selected features

, thus making feature selection important Test performance on real-world benchmarks with 50 selected features. SLM outperforms competitive baselines. The metrics reported are AUC for Fraud, since there is a high class imbalance; hence AUC is reported; the median MAE on standard-normalized labels is reported for Ames; and accuracy is reported on all other datasets. The arrow next to each dataset indicates whether a higher or lower value is more optimal. These test results are selected based on the best validation set performance during 300 hyperparameter grid search trials.



Table 3 summarizes the characteristics of the datasets used in the experiments. Attributes of datasets used in experiments.

Hyperparameter tuning search space for experiments on the Fraud dataset.

Table 7 and 8 highlight the superior performance of SLM compared to the alternative methods for challenging datasets with a very large number of features. Test accuracy (%) on the Synthetic dataset with 300 salient features (L = 60) and 14000 training samples.

Test accuracy (%) on the Synthetic dataset with 100 salient features (L = 20) and 35000 training samples.

annex

We demonstrate the performance of SLM on a synthetic dataset that is specifically constructed such that only a small subset of features affect the output value while the vast majority are not useful for the task. All input features X i,j are sampled from the uniform distribution U [-1, 1] and the noise at the end i,j are sampled from standard Gaussian random variable with zero mean and unit variance.The input-output relationship are governed by the equations shown below:As can be seen, the function is highly nonlinear in dependence to the input features, and in total 5L features are salient. 

