PARAMETER AVERAGING FOR FEATURE RANKING

Abstract

Neural Networks are known to be sensitive to initialisation. The methods that rely on neural networks for feature ranking are not robust since they can have variations in their ranking when the model is initialized and trained with different random seeds. In this work, we introduce a novel method based on parameter averaging to estimate accurate and robust feature importance in tabular data setting, referred as XTab. We first initialize and train multiple instances of a shallow network (referred as local masks) with different random seeds for a downstream task. We then obtain a global mask model by averaging the parameters of local masks. We show that although the parameter averaging might result in a global model with higher loss, it still leads to the discovery of the ground-truth feature importance more consistently than an individual model does. We conduct extensive experiments on a variety of synthetic and real-world data, demonstrating that the XTab can be used to obtain the global feature importance that is not sensitive to sub-optimal model initialisation.

1. INTRODUCTION

Neural networks (NNs) have gained wide adaption across many fields and applications. However, one of the major drawback of NNs is their sensitivity to weight initialisation (McMahan et al., 2017) . This drawback is not critical for most classification and regression tasks, and is less obvious in applications such as explainability in most computer vision (CV) tasks. The problem is more obvious in settings, in which we pay attention to individual features (e.g., a feature in tabular data, or a pixel in the image) rather than group of features (e.g., a region in the image). And it becomes critical in settings, in which we might need to make costly decisions based the explanation that the model gives for its outcomes. Few such applications include disease diagnosis in clinical setting, drug repurposing in drug discovery, and sub-population discovery for clinical trials, in all of which the discovery of important features is critical. In this work, we investigate the robustness of neural networks to model initialisation in the context of feature ranking, and conduct our experiments in tabular data setting. The methods developed to explain predictions should ideally be robust to model initialisation. This is especially important to build trust with stakeholders in fields such as healthcare. In this work, we define the "robustness" as one, in which the feature ranking from the model is not sensitive to sub-optimal model initialisation. Some examples of robust models are seen in tree-based approaches such as the random forest (Breiman, 2001) and XGBoost (Chen & Guestrin, 2016), especially when they are used together with methods such as permutation importance. In these methods, each tree is grown by splitting samples on each decision point by using an impurity metric such as Gini index for the classification task. The importance of a feature in a single tree is typically computed by how much splitting on a particular feature reduces the impurity, which is also weighted by the number of samples the node is responsible for. The importance scores of the features are then averaged across all of the trees within the model to get their final scores. It is this averaging that might be one of the reasons why these models are robust and consistent when used for feature ranking. However, we should make a distinction between the robustness of a method and the correctness of its feature ranking as tree-based methods are known to have their shortcomings (Strobl et al., 2007; Li et al., 2019; Zhou & Hooker, 2021) . To get a robust explanation using neural networks, we could use an ensemble approach by training multiple neural network-based models to get feature importance, and use the majority rule to rank them. However, the ranking of features by using the ensemble of models may still not be easy in cases where the same feature(s) get ranked equally likely across different positions by the models. Moreover, the ensemble approach requires us to store all models so that we can use them to explain a prediction at test time, which is not ideal. Instead, in this work, we propose a novel method, in which we obtain a single global mask model that is based on averaging the parameters of multiple instances (local masks) of the same model. We take advantage of the sensitivity of NNs to initialization by initializing and training each local mask with a different random seed. We show that although the global model might have a higher loss than an individual model, it ranks features more correctly and consistently, and hence can be used to extract the feature importance. Our primary contributions in this work are the following: We obtain a global model by averaging the parameters of multiple instances of a shallow neural network trained with different random initialisation and use it to extract feature importance. The global model obtained in this manner might have a higher loss than any of the individual models (McMahan et al., 2017) . We show that although this is true, the global model is still able to discover the ground-truth feature importance more consistently than an individual model does. We also demonstrate that weight regularization such as dropout and weight-clipping can improve the robustness and consistency of the global model. We show that the existing the state of the art (SOTA) methods proposed for feature ranking or selection are not robust to model initialisation. Finally, we provide insights via extensive empirical study of parameter averaging using both synthetic and real tabular datasets.

2. METHOD

Parameter averaging is extensively studied in the context of Federated Learning (McMahan et al., 2017) , in which individual models are trained on datasets stored in different devices, and a global model is obtained by averaging individual models in various ways. For example, the naive parameter averaging is shown to give a lower loss on full training set than any individual model trained on a different subset of the data when the individual models are initialized with same random seed (McMahan et al., 2017) . It is well known that the loss surface for typical neural networks is non-convex (McMahan et al., 2017) and, hence, averaging parameters of models could result in a sub-optimal global model, especially when their parameters are initialised differently. However, the loss surfaces of over-parameterized NNs are shown to be well behaved and less prone to bad local minima in practice (Choromanska et al., 2015; Dauphin et al., 2014; Goodfellow et al., 2014) . In light of these observations, we investigate settings, in which we can combine multiple models that are initialized and trained with different random seeds to obtain a global model that is less sensitive to sub-optimal initialisation of any individual model. So, in this work, we propose a framework to obtain such a global model that can be used for both feature ranking and selection. We show that global model is able to extract feature importance correctly and consistently especially when the network architecture is shallow. We also show that this behaviour breaks down for deep architectures although regularizing their weights still helps improve them.

2.1. TRAINING

Figure 1 shows our framework, in which we use a shallow neural network as mask generator that in turn is used to learn important features and their weights for a downstream task. The hidden layer



Figure 1: Left: Framework; a) Train the models K-times with different random seeds, b) Obtain the global mask, c) Final training by using global mask (frozen weights) and a new local mask (trained). Right: Details of each training instance; d) Generating mask from input, e) Feature bagging using masked input, f) Aggregating the embeddings of the subsets. E: Encoder, M: Mask, C: Classifier.

