DATA-EFFICIENT AND INTERPRETABLE TABULAR ANOMALY DETECTION

Abstract

Anomaly detection (AD) plays an important role in numerous applications. In this paper, we focus on two understudied aspects of AD that are critical for integration into real-world applications. First, most AD methods cannot incorporate labeled data that are often available in practice in small quantities and can be crucial to achieve high accuracy. Second, most AD methods are not interpretable, a bottleneck that prevents stakeholders from understanding the reason behind the anomalies. In this paper, we propose a novel AD framework, DIAD, that adapts a white-box model class, Generalized Additive Models, to detect anomalies using a partial identification objective which naturally handles noisy or heterogeneous features. DIAD can incorporate a small amount of labeled data to further boost AD performances in semi-supervised settings. We demonstrate the superiority of DIAD compared to previous work in both unsupervised and semi-supervised settings on multiple datasets. We also present explainability capabilities of DIAD, on its rationale behind predicting certain samples as anomalies.

1. INTRODUCTION

Anomaly detection (AD) has numerous real-world applications, especially for tabular data, including detection of fraudulent transactions, intrusions related to cybersecurity, and adverse outcomes in healthcare. When the real-world tabular AD applications are considered, there are various challenges constituting a fundamental bottleneck for penetration of fully-automated machine learning solutions: • Noisy and irrelevant features: Tabular data often contain noisy or irrelevant features caused by measurement noise, outlier features and inconsistent units. Even a change in a small subset of features may trigger anomaly identification. • Heterogeneous features: Unlike image or text, tabular data features can have values with significantly different types (numerical, boolean, categorical, and ordinal), ranges and distributions. • Small labeled data: In many applications, often a small portion of the labeled data is available. AD accuracy can be significantly boosted with the information from these labeled samples as they may contain crucial information on representative anomalies and help ignoring irrelevant ones. • Interpretability: Without interpretable outputs, humans cannot understand the rationale behind anomaly predictions, that would enable more trust and actions to improve the model performance. Verification of model accuracy is particularly challenging for high dimensional tabular data, as they are not easy to visualize for humans. An interpretable AD model should be able to identify important features used to predict anomalies. Conventional explainability methods like SHAP (Lundberg & Lee, 2017) and LIME (Ribeiro et al., 2016) are proposed for supervised learning and not straightforward to generalize to unsupervised or semi-supervised AD. Conventional AD methods fail to address the above -their performance often deteriorates with noisy features (Sec. 6), they cannot incorporate labeled data, and cannot provide interpretability. In this paper, we aim to address these challenges by proposing a Data-efficient Interpretable AD framework, DIAD. DIAD's model architecture is inspired by Generalized Additive Models (GAMs) and GA 2 M (see Sec. 3), that have been shown to obtain high accuracy and interpretability for tabular data (Caruana et al., 2015; Chang et al., 2021b; Liu et al., 2021) , and have been used in applications like finding outlier patterns and auditing fairness (Tan et al., 2018) . We propose to employ intuitive notions of Partial Identification (PID) as an AD objective and learn them with a differentiable GA 2 M

