GREEDY INFORMATION MAXIMIZATION FOR ACTIVE FEATURE ACQUISITION

Abstract

Feature selection is commonly used to reduce feature acquisition costs, but the standard approach is to train models with static feature subsets. Here, we consider the active feature acquisition problem where the model sequentially queries features based on the presently available information. Active feature acquisition has frequently been addressed using reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This approach is theoretically appealing but difficult to implement in practice, so we introduce a learning algorithm based on amortized optimization that recovers the greedy policy when perfectly optimized. We find that the greedy method outperforms both RL-based and static feature selection methods across numerous datasets, which validates our approach as a simple but powerful baseline for this problem.

1. INTRODUCTION

Machine learning models require informative inputs to make accurate predictions, but a model's input features can be costly to acquire. In settings where information is gathered sequentially, and particularly when obtaining features requires time or money, it is reasonable to query features adaptively based on the presently available information. We refer to this as active feature acquisition 1 (Saar-Tsechansky et al., 2009) , and it has been considered by many works in the last decade (Dulac-Arnold et al., 2011; Chen et al., 2015b; Early et al., 2016a; He et al., 2016a; Kachuee et al., 2018) . Feature selection with fixed feature sets (static feature selection) has received more attention (see reviews by Li et al. 2017; Cai et al. 2018 ), but active approaches offer the potential for better performance given a fixed budget. This is easy to see, because selecting the same features for all instances (e.g., all patients visiting a doctor's office) is one possible policy but likely suboptimal in many situations. On the other hand, active approaches are also more challenging because they require both learning a selection policy and making predictions with multiple feature sets. Prior work has approached active feature acquisition in several ways, but often using reinforcement learning (RL) (Dulac-Arnold et al., 2011; Shim et al., 2018; Kachuee et al., 2018; Janisch et al., 2019; Li & Oliva, 2021) . RL is a natural approach for sequential decision-making problems, but current methods are difficult to train and do not reliably outperform static feature selection (Henderson et al., 2018; Erion et al., 2021) . Our work therefore explores a simpler approach: sequentially selecting features based on their conditional mutual information (CMI) with the response variable. The greedy CMI approach is discussed in prior work (Fleuret, 2004; Chen et al., 2015b; Ma et al., 2018) , but it remains difficult to implement because it requires perfect knowledge of the joint data distribution. The focus of this work is therefore developing a simple method to approximate the greedy policy. Our main insight is to leverage amortized optimization (Amos, 2022): by developing an optimization-based characterization of the greedy CMI policy, we design an end-to-end learning approach that recovers the policy when it is perfectly optimized. Our contributions in this work are the following: 1 The problem is also sometimes referred to as sequential or dynamic feature selection. 1. We develop an optimization-based characterization of the locally optimal selection policy, and we show that this yields greedy selections based on CMI for classification problems. These preliminary results later provide the foundation for our learning approach. 2. We describe a procedure to implement the greedy policy given perfect knowledge of the data distribution. While impractical, this procedure relates our approach to a prior CMI approximation (Ma et al., 2018) and shows that the greedy policy is a special case of existing expected utility frameworks (Saar-Tsechansky et al., 2009; Early et al., 2016a) . 3. We develop an amortized optimization approach to approximate the greedy policy using deep learning. Our method permits simple end-to-end learning via stochastic gradient descent, and we prove that our objective recovers the greedy CMI policy when it is perfectly optimized. We demonstrate the effectiveness of our approach on numerous datasets, and the results show that our method outperforms both RL-based and static feature selection methods. Overall, our work shows that a greedy policy is a simple and powerful method for active feature acquisition.

2. PROBLEM FORMULATION

In this section, we introduce notation used throughout the paper and describe the active feature acquisition problem.

2.1. NOTATION

Let x denote a vector of input features and y a response variable for a supervised learning task. The input consists of d distinct features, or x = (x 1 , . . . , x d ). We use s ⊆ [d] ≡ {1, . . . , d} to denote a subset of indices and x s = {x i : i ∈ s} a subset of features. Bold symbols x, y represent random variables, the symbols x, y are possible values, and p(x, y) denotes the data distribution. Our goal is to design a policy that controls which features are selected. The feature selection policy can be viewed as a function π(x s ) ∈ [d], meaning that it receives a subset of features as its input and outputs the next feature index to query. The policy is accompanied by a predictor f (x s ) that can make predictions given the set of available features; for example, if y is discrete then predictions lie in the probability simplex, or f (x s ) ∈ ∆ K-1 for K classes. The notation f (x s ∪ x i ) represents the prediction given the combined features x s∪{i} . The paper initially considers policy and predictor functions that operate on feature subsets because these are simpler to analyze, and Section 4 proposes an implementation using a mask variable m ∈ [0, 1] d where the functions operate on x ⊙ m.

2.2. ACTIVE FEATURE ACQUISITION

The goal of active feature acquisition is to select features with a minimal budget that achieve maximum predictive accuracy. Access to more features generally makes prediction easier, so the challenge is selecting a small number of informative features. There are multiple formulations for this problem, including non-uniform feature costs and different budgets for each sample (Kachuee et al., 2018) , but we focus on the setting with a fixed budget and uniform costs. Our goal is to begin with no features for each data example, sequentially select a feature set x s such that |s| = k for a fixed k < d, and make accurate predictions for the response y. Given a loss function that measures the discrepancy between predictions and labels ℓ(ŷ, y), a natural scoring criterion is the expected loss after selecting k features. The scoring is applied to a policypredictor pair (π, f ), and we define the score for a fixed budget k as follows, v k (π, f ) = E p(x,y) ℓ f {x it } k t=1 , y , where feature indices are chosen sequentially for each (x, y) according to i n = π({x it } n-1 t=1 ). Our goal is to minimize v k (π, f ), or equivalently to maximize our final predictive accuracy. One approach is to frame this as a Markov decision process (MDP) and solve it using standard RL techniques, where π and f are trained to optimize a reward function based on eq. ( 1). Indeed, several works have designed such formulations (Dulac-Arnold et al., 2011; Shim et al., 2018; Kachuee et al., 2018; Janisch et al., 2019; Li & Oliva, 2021) . Our work instead focuses on a simpler greedy approach, which has the benefits of being simpler to interpret and easier to learn in practice.

