SEQUENTIAL ATTENTION FOR FEATURE SELECTION

Abstract

Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on ℓ 1 regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.

1. INTRODUCTION

Feature selection is a classic problem in machine learning and statistics where one is asked to find a subset of 𝑘 features from a larger set of 𝑑 features, such that the prediction quality of the model trained using the subset of features is maximized. Finding a small and high-quality feature subset is desirable for many reasons: improving model interpretability, reducing inference latency, decreasing model size, regularization, and removing redundant or noisy features to improve generalization. We direct the reader to Li et al. (2017b) for a survey on the role of feature selection in machine learning. The widespread success of deep learning has prompted an intense study of feature selection algorithms for neural networks, especially in the supervised setting. While many methods have been proposed, we focus on a line of work that studies the use of attention for feature selection. The attention mechanism in machine learning roughly refers to applying a trainable softmax mask to a given layer. This allows the model to "focus" on certain important signals during training. Attention has recently led to major breakthroughs in computer vision, natural language processing, and several other areas of machine learning (Vaswani et al., 2017) . For feature selection, the works of Wang et al. One problem with naively using attention for feature selection is that it can ignore the residual values of features, i.e., the marginal contribution a feature has on the loss conditioned on previously-selected features being in the model. This can lead to several problems such as selecting redundant features or ignoring features that are uninformative in isolation but valuable in the presence of others. Sequential Attention. Our starting point for Sequential Attention is the well-known greedy forward selection algorithm, which repeatedly selects the feature with the largest marginal improvement in model loss when added to the set of currently selected features (see, e.g., Das & Kempe (2011) and Elenberg et al. ( 2018)). Greedy forward selection is known to select high-quality features, but requires training 𝑂(𝑘𝑑) models and is thus impractical for many modern machine learning problems. To reduce this cost, one natural idea is to only train 𝑘 models, where the model trained in each step approximates the marginal gains of all 𝑂(𝑑) unselected features. Said another way, we can relax the greedy algorithm to fractionally consider all 𝑂(𝑑) feature candidates simultaneously rather than computing their exact marginal gains one-by-one with separate models. We implement this idea by introducing a new set of trainable variables w ∈ R 𝑑 that represent feature importance, or attention logits. In each step, we select the feature with maximum importance and add it to the selected set. To ensure the score-augmented models (1) have differentiable architectures and (2) are encouraged to hone in on the best unselected feature, we take the softmax of the importance scores and multiply each input feature value by its corresponding softmax value as illustrated in Figure 1 . Formally, given a dataset X ∈ R 𝑛×𝑑 represented as a matrix with 𝑛 rows of examples and 𝑑 feature columns, suppose we want to select 𝑘 features. Let 𝑓 (•; 𝜃) be a differentiable model, e.g., a neural network, that outputs the predictions 𝑓 (X; 𝜃). Let y ∈ R 𝑛 be the labels, ℓ(𝑓 (X; 𝜃), y) be the loss between the model's predictions and the labels, and ∘ be the Hadamard product. Sequential Attention outputs a subset 𝑆 ⊆ [𝑑] := {1, 2, . . . , 𝑑} of 𝑘 feature indices, and is presented below in Algorithm 1. Theoretical guarantees. We give provable guarantees for Sequential Attention for least squares linear regression by analyzing a variant of the algorithm called regularized linear Sequential Attention. This variant (1) uses Hadamard product overparameterization directly between the attention weights and feature values without normalizing the attention weights via softmax(w, 𝑆), and (2) adds ℓ 2 regularization to the objective, hence the "linear" and "regularized" terms. Note that ℓ 2 regularization, or weight decay, is common practice when using gradient-based optimizers (Tibshirani, 2021). We give theoretical and empirical evidence that replacing the softmax by different overparameterization schemes leads to similar results (Section 4.2) while offering more tractable analysis. In particular, our main result shows that regularized linear Sequential Attention has the same provable guarantees as the celebrated Orthogonal Matching Pursuit (OMP) algorithm of Pati et al. (1993) for sparse linear regression, without making any assumptions on the design matrix or response vector. Theorem 1.1. For linear regression, regularized linear Sequential Attention is equivalent to OMP. 1 The code is available at: github.com/google-research/google-research/tree/master/sequential attention



(2014); Gui et al. (2019); Skrlj et al. (2020); Wojtas & Chen (2020); Liao et al. (2021) all present new approaches for feature attribution, ranking, and selection that are inspired by attention.

Figure 1: Sequential attention applied to model 𝑓 (•; 𝜃). At each step, the selected features 𝑖 ∈ 𝑆 are used as direct inputs to the model and the unselected features 𝑖 ̸ ∈ 𝑆 are downscaled by the scalar value softmax 𝑖 (w, 𝑆), where w ∈ R 𝑑 is the vector of learned attention weights and 𝑆 = [𝑑] ∖ 𝑆.

