A CAUSAL APPROACH TO DETECTING MULTIVARIATE TIME-SERIES ANOMALIES AND ROOT CAUSES Anonymous

Abstract

Detecting anomalies and the corresponding root causes in multivariate time series plays an important role in monitoring the behaviors of various real-world systems, e.g., IT system operations or manufacturing industry. Previous anomaly detection approaches model the joint distribution without considering the underlying mechanism of multivariate time series, making them computationally hungry and hard to identify root causes. In this paper, we formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based framework for detecting anomalies and root causes. It first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism whose conditional distribution can be directly estimated from data. In light of the modularity property of causal systems (the causal processes to generate different variables are irrelevant modules), the original problem is divided into a series of separate, simpler, and low-dimensional anomaly detection problems so that where an anomaly happens (root causes) can be directly identified. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications, showing its efficacy, robustness, and practical feasibility.

1. INTRODUCTION

Multivariate time series is ubiquitous in monitoring the behavior of complex systems in real-world applications, such as IT operations management, manufacturing industry and cyber security (Hundman et al., 2018; Mathur & Tippenhauer, 2016; Audibert et al., 2020) . Such data includes the measurements of the monitored components, e.g., the operational KPI metrics such as CPU/Database usages in an IT system. An important task in managing these complex systems is to detect unexpected observations deviated from normal behaviors, figure out the root causes of abnormal behaviors, and notify the operators timely to resolve the underlying issues. Detecting anomalies and corresponding root causes in multivariate time series aims to accomplish this task and has been actively studied in machine learning, which automate the identification of issues and incidents for improving system availability in AIOps (AI for IT Operations) (Dang et al., 2019) . Various algorithms have been developed to detect anomalies in multivariate time series data. In general, there are two kinds of directions commonly explored, i.e., treating each dimension individually using univariate time series anomaly detection algorithms (Hamilton, 1994; Taylor & Letham, 2018; Ren et al., 2019) , and treating all the dimensions as an entity using multivariate time series anomaly detection algorithms (Zong et al., 2018; Park et al., 2017; Su et al., 2019) . The first direction ignores the dependencies between different time series, so it may be problematic especially when sudden changes of a certain dimension do not necessarily mean failures of the whole system, or the relations among the time series become anomalous (Zhao et al., 2020) . The second direction takes the dependencies into consideration, which are more suitable for real-world applications where the overall status of a system is more concerned about than a single dimension. Recently, deep learning receives much attention in anomaly detection, e.g., DAGMM (Zong et al., 2018) , LSTM-VAE (Park et al., 2017) and OmniAnomaly (Su et al., 2019) , which infer dependencies between different time series and temporal patterns within one time series implicitly. Recently, Dai & Chen (2022) developed a graph-augmented normalizing flow approach that models the joint distribution via the learned DAG. However, the dependencies inferred by deep learning models do not represent the underlying pro-cess of generating the observed data and the causal relationships between time series are ignored; such methods do not provide a mechanistic understanding of anomalies and it is hard for them to identify the root causes when an anomaly occurs. In real-world applications, root cause analysis (RCA) is traditionally treated as a module separated from anomaly detection, identifying potential root causes given the detected anomalous metrics by analyzing the dependencies between the monitored metrics (Soldani & Brogi, 2021) . Because RCA requires to know which metric is anomalous, univariate (instead of multivariate) time series anomaly detection algorithms are mostly applied to detect anomalies, and then RCA analyzes system/service graphs obtained via domain knowledge or observed data to determine root causes. Both univariate and multivariate algorithms have drawbacks and cannot be integrated with RCA seamlessly. To overcome these issues, we take causal perspective (Pearl, 2009; Spirtes et al., 1993) to naturally view anomalies in multivariate time series as instances that do not follow the regular causal mechanism, and propose a novel causality-based framework for detecting anomalies and root causes simultaneously. Specifically, our approach leverages the causal structure discovered from data so that the joint distribution of multivariate time series is factorized into simpler modules where each module corresponds to a local causal mechanism, reflected by the corresponding conditional distribution. Those local mechanisms are modular or autonomous (Pearl, 2009) , and can then be handled separately, which is known as the modularity property of causal systems. In light of this property, the problem is then naturally decomposed into a series of low-dimensional anomaly detection problems. Each sub-problem is concerned with a local mechanism. Because we focus on issues with separate local causal mechanisms, the root causes of an anomaly can be identified at the same time. The main contributions of this paper are summarized below. • We reformulate anomaly detection and root cause analysis of multivariate time series from a causality perspective, which helps understand where and how anomalies happen and facilitates anomaly detection in light of the understanding. • We propose a novel framework that decomposes the multivariate time series anomaly detection problem into a series of separate low-dimensional anomaly detection problems by exploiting the causal structure discovered from data, which not only detects the anomalies more accurately but also offers a natural way to find their root causes. • We perform empirical studies of evaluating our approach with both simulation and public datasets as well as a case study of an internal real-world AIOps application, validating its efficacy and robustness to different causal discovery techniques and settings. Our formulation offers an alternative understanding of anomalies: an anomaly is a data point that does not follow the regular data-generating process. The modularity property makes our approach simpler to train, suitable for real-world applications and easier for root cause analysis. Our method can detect those anomalies that are hard for the approaches based on modeling marginal/joint distributions only, illustrating the benefit of the causal view and treatment of anomalies.

2. THE CAUSAL APPROACH

Given a multivariate time series X with length T and d variables, i.e., X = {x 1 , x 2 , • • • , x d } ∈ R T ×d , let x i (t) be the observation of the ith variable measured at time t. The task in this paper is to detect anomalies after time step T that differ from the regular points in X significantly and identify the corresponding root causes, i.e., test whether X j for j > T follows its regular distribution or not.

2.1. WHY THE CAUSAL VIEW MATTERS

Let us consider a simple example shown in Figure 1 , i.e., the measurements of three components x, y, z with causal structure x → y → z. An anomaly labeled by a black triangle happens at time step 40, where the causal mechanism between x and y becomes abnormal. Typically it is hard to find such an anomaly based on the marginal distributions or the joint distribution. But from local causal mechanism p(y|x), such anomaly becomes obvious, e.g., p(y|x) is much lower than its normal values. In this example, at time step 40 the probability densities p(x) = 0.786, p(y) = 1.563, p(z) = 1.695, p(x, y, z) = p(x)p(y|x)p(z|y) = 0.046 while p(y|x) = 0.011,

