DATA-EFFICIENT AND INTERPRETABLE TABULAR ANOMALY DETECTION

Abstract

Anomaly detection (AD) plays an important role in numerous applications. In this paper, we focus on two understudied aspects of AD that are critical for integration into real-world applications. First, most AD methods cannot incorporate labeled data that are often available in practice in small quantities and can be crucial to achieve high accuracy. Second, most AD methods are not interpretable, a bottleneck that prevents stakeholders from understanding the reason behind the anomalies. In this paper, we propose a novel AD framework, DIAD, that adapts a white-box model class, Generalized Additive Models, to detect anomalies using a partial identification objective which naturally handles noisy or heterogeneous features. DIAD can incorporate a small amount of labeled data to further boost AD performances in semi-supervised settings. We demonstrate the superiority of DIAD compared to previous work in both unsupervised and semi-supervised settings on multiple datasets. We also present explainability capabilities of DIAD, on its rationale behind predicting certain samples as anomalies.

1. INTRODUCTION

Anomaly detection (AD) has numerous real-world applications, especially for tabular data, including detection of fraudulent transactions, intrusions related to cybersecurity, and adverse outcomes in healthcare. When the real-world tabular AD applications are considered, there are various challenges constituting a fundamental bottleneck for penetration of fully-automated machine learning solutions: • Noisy and irrelevant features: Tabular data often contain noisy or irrelevant features caused by measurement noise, outlier features and inconsistent units. Even a change in a small subset of features may trigger anomaly identification. • Heterogeneous features: Unlike image or text, tabular data features can have values with significantly different types (numerical, boolean, categorical, and ordinal), ranges and distributions. • Small labeled data: In many applications, often a small portion of the labeled data is available. AD accuracy can be significantly boosted with the information from these labeled samples as they may contain crucial information on representative anomalies and help ignoring irrelevant ones. • Interpretability: Without interpretable outputs, humans cannot understand the rationale behind anomaly predictions, that would enable more trust and actions to improve the model performance. Verification of model accuracy is particularly challenging for high dimensional tabular data, as they are not easy to visualize for humans. An interpretable AD model should be able to identify important features used to predict anomalies. Conventional explainability methods like SHAP (Lundberg & Lee, 2017) and LIME (Ribeiro et al., 2016) are proposed for supervised learning and not straightforward to generalize to unsupervised or semi-supervised AD. Conventional AD methods fail to address the above -their performance often deteriorates with noisy features (Sec. 6), they cannot incorporate labeled data, and cannot provide interpretability. In this paper, we aim to address these challenges by proposing a Data-efficient Interpretable AD framework, DIAD. DIAD's model architecture is inspired by Generalized Additive Models (GAMs) and GA 2 M (see Sec. 3), that have been shown to obtain high accuracy and interpretability for tabular data (Caruana et al., 2015; Chang et al., 2021b; Liu et al., 2021) , and have been used in applications like finding outlier patterns and auditing fairness (Tan et al., 2018) . We propose to employ intuitive notions of Partial Identification (PID) as an AD objective and learn them with a differentiable GA 2 M (NodeGA 2 M, Chang et al. ( 2021a)). Our design is based on the principle that PID scales to highdimensional features and handles heterogeneous features well, while the differentiable GAM allows fine-tuning with labeled data and retain interpretability. In addition, PID requires clear-cut thresholds like trees which are provided by NodeGA 2 M. While combining PID with NodeGA 2 M, we introduce multiple methodological innovations, such as estimating and normalizing a sparsity metric as the anomaly scores, integrating a regularization for an inductive bias appropriate for AD, and using deep representation learning via fine-tuning with a differentiable AUC loss. The latter is crucial to take advantage of a small amount of labeled samples well and constitutes a more 'data-efficient' method compared to other AD approaches -e.g. DIAD improves from 87.1% to 89.4% AUC with 5 labeled anomalies compared to unsupervised AD. Overall, our innovations lead to strong empirical results -DIAD outperforms other alternatives significantly, both in unsupervised and semi-supervised settings. DIAD's outperformance is especially prominent on large-scale datasets containing heterogeneous features with complex relationships between them. In addition to accuracy gains, DIAD also provides a rationale on why an example is classified as anomalous using the GA 2 M graphs, and insights on the impact of labeled data on the decision boundary, a novel explainability capability that provides both local and global understanding on the AD tasks. 



Figure 1: Overview of the proposed DIAD framework. During training, first an unsupervised AD model is fitted employing interpretable GA 2 M models and PID loss with unlabeled data. Then, the trained unsupervised model is fined-tuned with a small amount of labeled data using a differentiable AUC loss. At inference, both the anomaly score and explanations are provided, based on the visualizations of top contributing features. The example sample in the figure is shown to have an anomaly score, explained by the cell size feature having high value.

Comparison of AD approaches. Table 1 summarizes representative AD works and compares to DIAD. AD methods for training with only normal data have been widely studied (Pang & Aggarwal, 2021b). Isolation Forest (IF)(Liu et al., 2008)  grows decision trees randomly -the shallower the tree depth for a sample is, the more anomalous it is predicted. However, it shows performance degradation when feature dimensionality increases. Robust Random Cut Forest(RRCF, (Guha et al., 2016)) further improves IF by choosing features to split based on the range, but is sensitive to scale.PID- Forest (Gopalan et al., 2019)  zooms on the features with large variances, for more robustness to noisy or irrelevant features. There are also AD methods based on generative approaches, that learn to reconstruct input features, and use the error of reconstructions or density to identify anomalies. Bergmann et al. (2019) employs auto-encoders for image data. DAGMM(Zong et al., 2018) first

