PARAMETER AVERAGING FOR FEATURE RANKING

Abstract

Neural Networks are known to be sensitive to initialisation. The methods that rely on neural networks for feature ranking are not robust since they can have variations in their ranking when the model is initialized and trained with different random seeds. In this work, we introduce a novel method based on parameter averaging to estimate accurate and robust feature importance in tabular data setting, referred as XTab. We first initialize and train multiple instances of a shallow network (referred as local masks) with different random seeds for a downstream task. We then obtain a global mask model by averaging the parameters of local masks. We show that although the parameter averaging might result in a global model with higher loss, it still leads to the discovery of the ground-truth feature importance more consistently than an individual model does. We conduct extensive experiments on a variety of synthetic and real-world data, demonstrating that the XTab can be used to obtain the global feature importance that is not sensitive to sub-optimal model initialisation.

1. INTRODUCTION

Neural networks (NNs) have gained wide adaption across many fields and applications. However, one of the major drawback of NNs is their sensitivity to weight initialisation (McMahan et al., 2017) . This drawback is not critical for most classification and regression tasks, and is less obvious in applications such as explainability in most computer vision (CV) tasks. The problem is more obvious in settings, in which we pay attention to individual features (e.g., a feature in tabular data, or a pixel in the image) rather than group of features (e.g., a region in the image). And it becomes critical in settings, in which we might need to make costly decisions based the explanation that the model gives for its outcomes. Few such applications include disease diagnosis in clinical setting, drug repurposing in drug discovery, and sub-population discovery for clinical trials, in all of which the discovery of important features is critical. In this work, we investigate the robustness of neural networks to model initialisation in the context of feature ranking, and conduct our experiments in tabular data setting. The methods developed to explain predictions should ideally be robust to model initialisation. This is especially important to build trust with stakeholders in fields such as healthcare. In this work, we define the "robustness" as one, in which the feature ranking from the model is not sensitive to sub-optimal model initialisation. Some examples of robust models are seen in tree-based approaches such as the random forest (Breiman, 2001) and XGBoost (Chen & Guestrin, 2016), especially when they are used together with methods such as permutation importance. In these methods, each tree is grown by splitting samples on each decision point by using an impurity metric such as Gini index for the classification task. The importance of a feature in a single tree is typically computed by how much splitting on a particular feature reduces the impurity, which is also weighted by the number of samples the node is responsible for. The importance scores of the features are then averaged across all of the trees within the model to get their final scores. It is this averaging that might be one of the reasons why these models are robust and consistent when used for feature ranking. However, we should make a distinction between the robustness of a method and the correctness of its feature ranking as tree-based methods are known to have their shortcomings (Strobl et al., 2007; Li et al., 2019; Zhou & Hooker, 2021) . To get a robust explanation using neural networks, we could use an ensemble approach by training multiple neural network-based models to get feature importance, and use the majority rule to rank them. However, the ranking of features by using the ensemble of models may still not be easy in cases where the same feature(s) get ranked equally likely across different positions by the models. Moreover, the ensemble approach requires us to store all models so that we can use them to explain a prediction at test time, which is not ideal. Instead, in this work, we propose a novel

