A CAUSAL APPROACH TO DETECTING MULTIVARIATE TIME-SERIES ANOMALIES AND ROOT CAUSES Anonymous

Abstract

Detecting anomalies and the corresponding root causes in multivariate time series plays an important role in monitoring the behaviors of various real-world systems, e.g., IT system operations or manufacturing industry. Previous anomaly detection approaches model the joint distribution without considering the underlying mechanism of multivariate time series, making them computationally hungry and hard to identify root causes. In this paper, we formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based framework for detecting anomalies and root causes. It first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism whose conditional distribution can be directly estimated from data. In light of the modularity property of causal systems (the causal processes to generate different variables are irrelevant modules), the original problem is divided into a series of separate, simpler, and low-dimensional anomaly detection problems so that where an anomaly happens (root causes) can be directly identified. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications, showing its efficacy, robustness, and practical feasibility.

1. INTRODUCTION

Multivariate time series is ubiquitous in monitoring the behavior of complex systems in real-world applications, such as IT operations management, manufacturing industry and cyber security (Hundman et al., 2018; Mathur & Tippenhauer, 2016; Audibert et al., 2020) . Such data includes the measurements of the monitored components, e.g., the operational KPI metrics such as CPU/Database usages in an IT system. An important task in managing these complex systems is to detect unexpected observations deviated from normal behaviors, figure out the root causes of abnormal behaviors, and notify the operators timely to resolve the underlying issues. Detecting anomalies and corresponding root causes in multivariate time series aims to accomplish this task and has been actively studied in machine learning, which automate the identification of issues and incidents for improving system availability in AIOps (AI for IT Operations) (Dang et al., 2019) . Various algorithms have been developed to detect anomalies in multivariate time series data. In general, there are two kinds of directions commonly explored, i.e., treating each dimension individually using univariate time series anomaly detection algorithms (Hamilton, 1994; Taylor & Letham, 2018; Ren et al., 2019) , and treating all the dimensions as an entity using multivariate time series anomaly detection algorithms (Zong et al., 2018; Park et al., 2017; Su et al., 2019) . The first direction ignores the dependencies between different time series, so it may be problematic especially when sudden changes of a certain dimension do not necessarily mean failures of the whole system, or the relations among the time series become anomalous (Zhao et al., 2020) . The second direction takes the dependencies into consideration, which are more suitable for real-world applications where the overall status of a system is more concerned about than a single dimension. Recently, deep learning receives much attention in anomaly detection, e.g., DAGMM (Zong et al., 2018) , LSTM-VAE (Park et al., 2017) and OmniAnomaly (Su et al., 2019) , which infer dependencies between different time series and temporal patterns within one time series implicitly. Recently, Dai & Chen (2022) developed a graph-augmented normalizing flow approach that models the joint distribution via the learned DAG. However, the dependencies inferred by deep learning models do not represent the underlying pro-1

