DYNAMIC FEATURE SELECTION FOR EFFICIENT AND INTERPRETABLE HUMAN ACTIVITY RECOGNITION Anonymous authors Paper under double-blind review

Abstract

In many machine learning tasks, input features with varying degrees of predictive capability are usually acquired at some cost. For example, in human activity recognition (HAR) and mobile health (mHealth) applications, monitoring performance should be achieved with a low cost to gather different sensory features, as maintaining sensors incur monetary, computation, and energy cost. We propose an adaptive feature selection method that dynamically selects features for prediction at any given time point. We formulate this problem as an 0 minimization problem across time, and cast the combinatorial optimization problem into a stochastic optimization formulation. We then utilize a differentiable relaxation to make the problem amenable to gradient-based optimization. Our evaluations on four activity recognition datasets show that our method achieves a favorable trade-off between performance and the number of features used. Moreover, the dynamically selected features of our approach are shown to be interpretable and associated with the actual activity types.

1. INTRODUCTION

Acquiring predictive features is critical for building trustworthy machine learning systems, but this often comes at a daunting cost. Such a cost can be in the form of energy needed to maintain an ambient sensor (Ardywibowo et al., 2019; Yang et al., 2020) , time needed to complete an experiment (Kiefer, 1959) , or manpower required to monitor a hospital patient (Pierskalla & Brailer, 1994) . Therefore, it becomes important not only to maintain good performance in the specified task, but also a low cost to gather these features. Indeed, existing Human Activity Recognition (HAR) methods typically use a fixed set of sensors, potentially collecting redundant features to discriminate contexts (Shen & Varshney, 2013; Aziz et al., 2016; Ertuǧrul & Kaya, 2017; Cheng et al., 2018) . Classic feature selection methods such as the LASSO and its variants can address the performance-cost trade-off by optimizing an objective penalized by a term that helps promote feature sparsity (Tibshirani, 1996; Friedman et al., 2010 Friedman et al., , 2008;; Zou & Hastie, 2005) . Such feature selection formulations are often static, that is, a fixed set of features are selected a priori. However, different features may offer different predictive power under different contexts. For example, a health worker may not need to monitor a recovering patient as frequently compared to a patient with the declining condition; an experiment performed twice may be redundant; or a smartphone sensor may be predictive when the user is walking but not when the user is in a car. By adaptively selecting which sensor(s) to observe at any given time point, one can further reduce the inherent cost for prediction and achieve a better trade-off between cost and prediction accuracy. In addition to cost-efficiency, an adaptive feature selection formulation can also lead to more interpretable and trustworthy predictions. Specifically, the predictions made by the model are only based on the selected features, providing a clear relationship between input features and model predictions. Existing efforts on interpreting models are usually based on some post-analyses of the predictions, including the approaches in (1) visualizing higher level representations or reconstructions of inputs based on them (Li et al., 2016; Mahendran & Vedaldi, 2015) , (2) evaluating the sensitivity of predictions to local perturbations of inputs or the input gradients (Selvaraju et al., 2017; Ribeiro et al., 2016) , and (3) extracting parts of inputs as justifications for predictions (Lei et al., 2016) . Another related but orthogonal direction is model compression of training sparse neural networks with the goal of memory and computational efficiency (Louizos et al., 2017; Tartaglione et al., 2018; Han et al., 2015) . All these works require collecting all features first and provide post-hoc feature relevance justifications or network pruning. Recent efforts on dynamic feature selection adaptively assign features based on immediate statistics (Gordon et al., 2012; Bloom et al., 2013; Ardywibowo et al., 2019; Zappi et al., 2008) , ignoring the information a feature may have on future predictions. Others treat feature selection as a Markov Decision Process (MDP) and use Reinforcement Learning (RL) to solve it (He & Eisner, 2012; Karayev et al., 2013; Kolamunna et al., 2016; Spaan & Lima, 2009; Satsangi et al., 2015; Yang et al., 2020) . However, solving the RL objective is not straightforward. Besides being sensitive to hyperparameter settings in general, approximations such as state space discretization and greedy approximations of the combinatorial objective were used to make the RL problem tractable. To this end, we propose a dynamic feature selection method that can be easily integrated into existing deep architectures and trained from end to end, enabling task-driven dynamic feature selection. To achieve this, we define a feature selection module that dynamically selects which features to use at any given time point. We then formulate a sequential combinatorial optimization that minimizes the trade-off between the learning task performance and the number of features selected at each time point. To make this problem tractable, we cast this combinatorial optimization problem into a stochastic optimization formulation. We then adopt a differentiable relaxation of the discrete feature selection variables to make it amenable to stochastic gradient descent based optimization. It therefore can be plugged-in and jointly optimized with state-of-the-art neural networks, achieving task-driven feature selection over time. To show our method's ability to adaptively select features while maintaining good performance, we evaluate it on four time-series activity recognition datasets: the UCI Human Activity Recognition (HAR) dataset (Anguita et al., 2013) , the OPPORTUNITY dataset (Roggen et al., 2010) , the ExtraSensory dataset (Vaizman et al., 2017) , as well as the NTU-RGB-D dataset (Shahroudy et al., 2016) . Several ablation studies and comparisons with other dynamic and static feature selection methods demonstrate the efficacy of our proposed method. Specifically, our dynamic feature selection is able to use as low as 0.28% of the sensor features while still maintaining good human activity monitoring accuracy. Moreover, our dynamically selected features are shown to be interpretable with direct correspondence with different contexts and activity types.

2. METHODOLOGY

2.1 THE 0 -NORM MINIMIZATION PROBLEM Many regularization methods have been developed to solve simultaneous feature selection and model parameter estimation (Tibshirani, 1996; Zou & Hastie, 2005; Tibshirani, 1997; Sun et al., 2014; Simon et al., 2011) . The ideal penalty for the purpose of feature selection is the 0 -norm of the model coefficients for all predictors. This norm is equivalent to the number of nonzero terms in all the model coefficients. Given a dataset D containing N independent and identically distributed (iid) input-output pairs {(x 1 , y 1 ), . . . , (x N , y N )} with each x i containing P features, a hypothesis class of predictor functions f (•; θ), and a loss function L( ŷ, y) between prediction ŷ and true output y, the 0 -norm regularized optimization problem can be written as follows: min θ 1 N N i=1 L(f (x i ; θ), y i ) + λ θ 0 , where θ 0 = P j=1 I[θ j = 0] penalizes the number of nonzero model coefficients. In the models that linearly transform the input features x i , penalizing the weights relating to each feature in x i enables sparse feature subset selection. However, such a selection is static, as it does not adaptively select features that are appropriate for a given context. Moreover, the optimization above is computationally prohibitive as it involves combinatorial optimization to select the subset of nonzero model coefficients corresponding to the input features. In the following, we formulate our adaptive dynamic feature selection problem when learning with multivariate time series. Coupled with training recurrent neural networks, this adaptive feature

