SEQUENTIAL ATTENTION FOR FEATURE SELECTION

Abstract

Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on ℓ 1 regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.

1. INTRODUCTION

Feature selection is a classic problem in machine learning and statistics where one is asked to find a subset of 𝑘 features from a larger set of 𝑑 features, such that the prediction quality of the model trained using the subset of features is maximized. Finding a small and high-quality feature subset is desirable for many reasons: improving model interpretability, reducing inference latency, decreasing model size, regularization, and removing redundant or noisy features to improve generalization. We direct the reader to Li et al. (2017b) for a survey on the role of feature selection in machine learning. The widespread success of deep learning has prompted an intense study of feature selection algorithms for neural networks, especially in the supervised setting. While many methods have been proposed, we focus on a line of work that studies the use of attention for feature selection. The attention mechanism in machine learning roughly refers to applying a trainable softmax mask to a given layer. This allows the model to "focus" on certain important signals during training. Attention has recently led to major breakthroughs in computer vision, natural language processing, and several other areas of machine learning (Vaswani et al., 2017) . For feature selection, the works of Wang et al. ( 2014 One problem with naively using attention for feature selection is that it can ignore the residual values of features, i.e., the marginal contribution a feature has on the loss conditioned on previously-selected features being in the model. This can lead to several problems such as selecting redundant features or ignoring features that are uninformative in isolation but valuable in the presence of others.



); Gui et al. (2019); Skrlj et al. (2020); Wojtas & Chen (2020); Liao et al. (2021) all present new approaches for feature attribution, ranking, and selection that are inspired by attention.

