EVALUATION OF ACTIVE FEATURE ACQUISITION METHODS UNDER MISSING DATA Anonymous authors Paper under double-blind review

Abstract

Machine learning (ML) methods generally assume the full set of features are available at no cost. If the acquisition of a certain feature is costly at run-time, one might want to balance the acquisition cost and the predictive value of the feature for the ML task. The task of training an AI agent to decide which features are necessary to be acquired is called active feature acquisition (AFA). Current AFA methods, however, are challenged when the AFA agent has to be trained/tested with datasets that contain missing data. We formulate, for the first time, the problem of active feature acquisition performance evaluation (AFAPE) under missing data, i.e. the problem of adjusting for the inevitable missingness distribution shift between train/test time and run-time. We first propose a new causal graph, the AFA graph, that characterizes the AFAPE problem as an intervention on the environment used to train AFA agents. Here, we discuss that for handling missing data in AFAPE, the conventional approaches (off-policy policy evaluation, blocked feature acquisitions, imputation and inverse probability weighting (IPW)) often lead to biased results or are data inefficient. We then propose active feature acquisition importance sampling (AFAIS), a novel estimator that is more data efficient than IPW. We demonstrate the detrimental conclusions to which biased estimators can lead as well as the high data efficiency of AFAIS in multiple experiments using simulated and real-world data under induced MCAR, MAR and MNAR missingness.

1. INTRODUCTION

Machine learning methods generally assume the full set of input features is available at run-time at little to no cost. This is, however, not always the case as acquiring features may impose a significant cost. For example in medical diagnosis, the cost of feature acquisition (e.g. a biopsy test) could include both its monetary cost as well as the potential adverse harm for patients. In this case, the predictive value of a feature should be balanced against its acquisition cost. Physicians acquire certain features via biopsies, MRI scans, or lab tests, only if their diagnostic value outweighs their cost or risk. This challenge becomes more critical when physicians aim to predict a large number of diverse outcomes, each of which has different sets of informative features. Going back to the medical example, a typical emergency department (ED) is able to diagnose thousands of different diseases based on a large set of possible observations. For every new emergency patient entering ED with an unknown diagnosis, clinicians must narrow down their search for a proper diagnosis via step by step feature acquisitions. In this case an ML model designed to do prediction given the entire feature set is infeasible. Active feature acquisition (AFA) addresses this problem by designing two AI systems: i) a so-called AFA agent, deciding which features must be observed, while balancing information gain vs. feature cost; ii) an ML prediction model, often a classifier, that solves the prediction task based on the acquired set of features. An AFA agent, by definition, induces missingness by selecting only a subset of features. We call this AFA missingness which occurs at run-time (e.g. when the AFA agent is deployed at the hospital). In addition, in many AFA applications, retrospective data which we use for model training and evaluation also contain missing entries. This is induced by a different feature acquisition process (e.g. by physicians, ordering from a wide range of diagnostic tests). We call this retrospective missingness. While using retrospective data (during training/evaluation), the agent can only decide among available features. At run-time, however, we make the assumption that the agent has the freedom to choose from all features. This corresponds to a feature "availability" distribution shift that requires adjustment. Apart from difficulties of training an AFA agent using incomplete data, estimating the real world performance of agents (at run-time) using incomplete retrospective data is nontrivial and challenging. Evaluation biases might lead to false promises about the agent's performance, a serious risk especially in safety-critical applications. This paper is the first, to our knowledge, to systematically study the evaluation of AFA agents under this distribution shift. We call this problem active feature acquisition performance evaluation (AFAPE) under missing data. Our work on the AFAPE problem contains 4 major contributions for the field of AFA: First, we propose the AFA graph, a new causal graph, that characterizes the AFAPE problem under retrospective missingness as off-environment policy evaluation under an environment intervention. This is, in our opinion, a very important connection between the fields of missing data and policy evaluation (including reinforcement learning (RL)) that will result in cross-fertilization of ideas from these two areas. Second, we show that the AFA literature only contains approaches to handle missing data that are either derived from a pure RL perspective (off-policy policy evaluation (OPE) (Chang et al., 2019) , blocking of acquisition actions (Janisch et al., 2020; Yoon et al., 2018) ) or correspond to biased methods for missing data (conditional mean imputation (An et al., 2022; Erion et al., 2021; Janisch et al., 2020) ). We show that these approaches will almost certainly lead to biased results and/or can be extremely data inefficient. We demonstrate in experiments with exemplary missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) patterns that these biased evaluation methods can lead to detrimental conclusions about which AFA agent performs best. This can lead for example to high risks for patients' lives if these methods are deployed without proper evaluation. Third, we bring the readers' attentions to unbiased estimators from the missing data literature (inverse probability weighting (IPW) (Seaman & White, 2013) and multiple imputation (MI) (Sterne et al., 2009) ) which haven't been applied to AFA previously. These methods, however, do not account for the special structure of the AFAPE problem as not every feature might be acquired by the AFA agent. We show that missing data methods can, therefore, lead to data inefficiency. Fourth, we instead propose AFAIS (active feature acquisition importance sampling), a new estimator based on the off-environment policy evaluation view. AFAIS is more data efficient than IPW, but cannot always be used for complex MNAR scenarios. For these cases, we propose a modification to AFAIS that allows it to be closer to IPW when required, at the cost of some data efficiency. We demonstrate the improved data efficiency of AFAIS over IPW in multiple experiments.

2. RELATED METHODS

AFA: Various approaches have been proposed for designing AFA agents and prediction models for active feature acquisition (AFA) (An et al., 2006; Li & Oliva, 2021a; Li et al., 2021; Chang et al., 2019; Shim et al., 2018; Yin et al., 2020) . This work focuses, however, not on any particular AFA method, but on the evaluation of any AFA method under missingness. Nevertheless, we refer the interested reader to Appendix A.1 for a more detailed literature review of existing AFA methods and a distinction between AFA and other related fields. Missing data: AFAPE under missingness can be viewed as a missing data problem, and hence methods from the missing data literature can be adopted. There are in general two difficulties for solving missing data problems. The first is identification, i.e. the determination whether the estimand of the full (unknown) data distribution can be estimated from the observed data distribution. The second is estimation, for which there exist generally two strategies which are based on importance sampling (IS), i.e. inverse probability weighting (IPW) (Seaman & White, 2013) , and multiple imputation (MI) (Sterne et al., 2009) . See Appendix A.2 for an in-depth review on missing data. Off-policy policy evaluation (OPE): As we show in Section 3, the AFAPE problem can be formulated as an off-policy policy evaluation (OPE) (Dudik et al., 2011; Kallus & Uehara, 2020) problem. The goal in OPE is to evaluate the performance of a "target" policy (here the AFA policy) from data collected under a "behavior" policy (here the retrospective missingness induced by e.g. the doctor).

