A CAUSAL APPROACH TO DETECTING MULTIVARIATE TIME-SERIES ANOMALIES AND ROOT CAUSES Anonymous

Abstract

Detecting anomalies and the corresponding root causes in multivariate time series plays an important role in monitoring the behaviors of various real-world systems, e.g., IT system operations or manufacturing industry. Previous anomaly detection approaches model the joint distribution without considering the underlying mechanism of multivariate time series, making them computationally hungry and hard to identify root causes. In this paper, we formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based framework for detecting anomalies and root causes. It first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism whose conditional distribution can be directly estimated from data. In light of the modularity property of causal systems (the causal processes to generate different variables are irrelevant modules), the original problem is divided into a series of separate, simpler, and low-dimensional anomaly detection problems so that where an anomaly happens (root causes) can be directly identified. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications, showing its efficacy, robustness, and practical feasibility.

1. INTRODUCTION

Multivariate time series is ubiquitous in monitoring the behavior of complex systems in real-world applications, such as IT operations management, manufacturing industry and cyber security (Hundman et al., 2018; Mathur & Tippenhauer, 2016; Audibert et al., 2020) . Such data includes the measurements of the monitored components, e.g., the operational KPI metrics such as CPU/Database usages in an IT system. An important task in managing these complex systems is to detect unexpected observations deviated from normal behaviors, figure out the root causes of abnormal behaviors, and notify the operators timely to resolve the underlying issues. Detecting anomalies and corresponding root causes in multivariate time series aims to accomplish this task and has been actively studied in machine learning, which automate the identification of issues and incidents for improving system availability in AIOps (AI for IT Operations) (Dang et al., 2019) . Various algorithms have been developed to detect anomalies in multivariate time series data. In general, there are two kinds of directions commonly explored, i.e., treating each dimension individually using univariate time series anomaly detection algorithms (Hamilton, 1994; Taylor & Letham, 2018; Ren et al., 2019) , and treating all the dimensions as an entity using multivariate time series anomaly detection algorithms (Zong et al., 2018; Park et al., 2017; Su et al., 2019) . The first direction ignores the dependencies between different time series, so it may be problematic especially when sudden changes of a certain dimension do not necessarily mean failures of the whole system, or the relations among the time series become anomalous (Zhao et al., 2020) . The second direction takes the dependencies into consideration, which are more suitable for real-world applications where the overall status of a system is more concerned about than a single dimension. Recently, deep learning receives much attention in anomaly detection, e.g., DAGMM (Zong et al., 2018) , LSTM-VAE (Park et al., 2017) and OmniAnomaly (Su et al., 2019) , which infer dependencies between different time series and temporal patterns within one time series implicitly. Recently, Dai & Chen (2022) developed a graph-augmented normalizing flow approach that models the joint distribution via the learned DAG. However, the dependencies inferred by deep learning models do not represent the underlying pro-cess of generating the observed data and the causal relationships between time series are ignored; such methods do not provide a mechanistic understanding of anomalies and it is hard for them to identify the root causes when an anomaly occurs. In real-world applications, root cause analysis (RCA) is traditionally treated as a module separated from anomaly detection, identifying potential root causes given the detected anomalous metrics by analyzing the dependencies between the monitored metrics (Soldani & Brogi, 2021) . Because RCA requires to know which metric is anomalous, univariate (instead of multivariate) time series anomaly detection algorithms are mostly applied to detect anomalies, and then RCA analyzes system/service graphs obtained via domain knowledge or observed data to determine root causes. Both univariate and multivariate algorithms have drawbacks and cannot be integrated with RCA seamlessly. To overcome these issues, we take a causal perspective (Pearl, 2009; Spirtes et al., 1993) to naturally view anomalies in multivariate time series as instances that do not follow the regular causal mechanism, and propose a novel causality-based framework for detecting anomalies and root causes simultaneously. Specifically, our approach leverages the causal structure discovered from data so that the joint distribution of multivariate time series is factorized into simpler modules where each module corresponds to a local causal mechanism, reflected by the corresponding conditional distribution. Those local mechanisms are modular or autonomous (Pearl, 2009) , and can then be handled separately, which is known as the modularity property of causal systems. In light of this property, the problem is then naturally decomposed into a series of low-dimensional anomaly detection problems. Each sub-problem is concerned with a local mechanism. Because we focus on issues with separate local causal mechanisms, the root causes of an anomaly can be identified at the same time. The main contributions of this paper are summarized below. • We reformulate anomaly detection and root cause analysis of multivariate time series from a causality perspective, which helps understand where and how anomalies happen and facilitates anomaly detection in light of the understanding. • We propose a novel framework that decomposes the multivariate time series anomaly detection problem into a series of separate low-dimensional anomaly detection problems by exploiting the causal structure discovered from data, which not only detects the anomalies more accurately but also offers a natural way to find their root causes. • We perform empirical studies of evaluating our approach with both simulation and public datasets as well as a case study of an internal real-world AIOps application, validating its efficacy and robustness to different causal discovery techniques and settings. Our formulation offers an alternative understanding of anomalies: an anomaly is a data point that does not follow the regular data-generating process. The modularity property makes our approach simpler to train, suitable for real-world applications and easier for root cause analysis. Our method can detect those anomalies that are hard for the approaches based on modeling marginal/joint distributions only, illustrating the benefit of the causal view and treatment of anomalies.

2. THE CAUSAL APPROACH

Given a multivariate time series X with length T and d variables, i.e., X = {x 1 , x 2 , • • • , x d } ∈ R T ×d , let x i (t) be the observation of the ith variable measured at time t. The task in this paper is to detect anomalies after time step T that differ from the regular points in X significantly and identify the corresponding root causes, i.e., test whether X j for j > T follows its regular distribution or not.

2.1. WHY THE CAUSAL VIEW MATTERS

Let us consider a simple example shown in Figure 1 , i.e., the measurements of three components x, y, z with causal structure x → y → z. An anomaly labeled by a black triangle happens at time step 40, where the causal mechanism between x and y becomes abnormal. Typically it is hard to find such an anomaly based on the marginal distributions or the joint distribution. But from local causal mechanism p(y|x), such anomaly becomes obvious, e.g., p(y|x) is much lower than its normal values. In this example, at time step 40 the probability densities p(x) = 0.786, p(y) = 1.563, p(z) = 1.695, p(x, y, z) = p(x)p(y|x)p(z|y) = 0.046 while p(y|x) = 0.011, meaning that it is easier to find this anomaly by examining the local causal mechanism p(y|x). A real-world motivation example can be found in Figure 8 . If the causal structure of the underlying process is given, we can examine whether each variable in the time series follows its regular causal mechanism. The causal mechanism can be represented by the structural equation model, i.e., x i (t) = f i (PA(x i (t)), ϵ i (t)), ∀i = 1, • • • , d , where f i are arbitrary measurable functions, ϵ i (t) are independent noises and PA(x i (t)) represents the causal parents of x i (t) including both lagged and contemporaneous ones (Pearl, 2009) . This causal structure can also be represented by a causal graph G whose nodes correspond to the variables x i (t) at different time lags. In this paper, we assume the graph G is a directed acyclic graph (DAG) and that the causal relationships are stationary unless an anomaly occurs. According to the Markov factorization, the joint distribution of x(t) can be factored as P[x(t)] = d i=1 P G [x i (t)|PA(x i (t)) ] where P G denotes the conditional distribution. The local causal mechanisms, corresponding to these conditional distribution terms, are known to be irrelevant to each other in a causal system (Pearl, 2009 ). An anomaly can then be identified according to the local causal mechanism. Therefore, we define anomalies as follows. Definition 1 A point x(t) at time step t is an anomaly if there exists at least one variable x i such that x i (t) violates the local generating mechanism, i.e., given PA(x i (t)), x i (t) does not follow P G [x i (t)|PA(x i (t))], which is the conditional distribution corresponding to the regular causal mechanism. This definition states that an anomaly happens in the system if the causal mechanism between a variable and its causal parents are violated, e.g., the local causal effect dramatically varies (Fig 1 ), or a big change happens on a variable and this change propagates to its children. Based on Definition 1, the anomaly detection problem can be divided into several low-dimensional subproblems, e.g., by checking whether each variable follows its regular conditional distribution. Thank to this modularity property, the root causes can be naturally identified when an anomaly event occurs. Here is our definition of root causes.

Definition 2

The root causes of an anomaly point x(t) are those variables x i such that given PA(x i (t)), x i (t) does not follow P G [x i (t)|PA(x i (t))], e.g., an anomaly happens on the local causal mechanism related to those variables. Definition 2 indicates that x i is one of the root causes if the local causal mechanism of variable x i (t) is violated. In Figure 1 , variable y is the root cause by our definition because the causal mechanism between y and z is normal while the causal mechanism between x and y is violated.

2.2. METHOD

We consider the unsupervised learning setting where X is given as the training data for learning the graph structures and the conditional distributions. (We will discuss the effects of possible anomalies on the learned causal structure in Section 2.2.4.) For learning causal graphs, we exploit suitable causal discovery methods, as discussed in Section 2.2.1. For learning conditional distributions, we maximize the log likelihoods given the observation data, i.e., maximizing Hence the anomaly score is defined as one minus the minimum value of these estimated probabilities. Intuitively, the purpose of using the minimum function is that we expect the algorithm to report an anomaly if any of the metrics (root variables) or local causal mechanisms (conditional probabilities) becomes abnormal, i.e., a data point is labeled as an anomaly if its anomaly score is larger than a certain threshold. If an anomaly event is detected, the root cause scores are computed for each variable and then the variables with the top scores are selected as the root causes (Section 2.2.3). Algorithm 1 outlines our approach. The anomalies in training data may decrease the performance. We discuss this issue and provide a solution for handling training anomalies in Section 2.2.4. L i (X) = T t=1 log P G [x i (t)|PA(x i (t))], ∀i = 1, • • • , d. Algorithm 1 The causality-based approach for detecting anomalies and root causes Input: training data X = {x i } d i=1 ∈ R T ×d , test data Y = {y i } d i=1 ∈ R T ×d , and threshold λ; Training procedure: 1: Infer the causal graph G via causal discovery techniques, e.g., FGES (Chickering, 2003; 2002; Meek, 1995) and PC (Spirtes & Glymour, 1991) . If the G is a partial DAG, convert it into a DAG by the method (Dor & Tarsi, 1992)  1: for t = 1 to T do 2: Compute anomaly score: A(y(t)) = 1 -min {M i (y i (t))|i = 1, • • • , d}; 3: Set anomaly label l t = 1 if A(y(t)) > λ or 0 otherwise; 4: If anomaly label l t is 1, computes root cause scores RS(x i (t)) via Eq (1) for each variable i, and set root causes R t be the variables with the top-k root cause scores. (Section 2.2.3) 5: end for 2.2.1 CAUSAL DISCOVERY Our approach needs to exploit the causal structure underlying the data. A traditional way to find causal relations is to use interventions or randomized experiments, which are generally too expensive and time-consuming. Discovering causal information by analyzing purely observational data, known as causal discovery, is then an important problem (Spirtes & Glymour, 1991; Peters et al., 2017; Spirtes & Zhang, 2016) . Multiple algorithms have been developed for causal discovery from independent and identically distributed (i.i.d.) or time series data, and their results are asymptotically guaranteed under corresponding assumptions. In this paper, we choose causal discovery algorithms such as PC (Spirtes & Glymour, 1991) , FGES (Chickering, 2003; 2002; Meek, 1995) , depending on whether we are given temporal data (with time-delayed causal relations) and whether the causal relations are linear or nonlinear. For example, we apply FGES with SEM-BIC score if the variables are linearly related and apply FGES with generalized score function (Huang et al., 2018) if they are non-linearly correlated. One concern is whether the missing or incorrect causal links in the inferred causal graph have a big impact on the performance of our approach. We performed an empirical study of this impact with public datasets, which shows that interestingly, our approach is robust to the inferred causal graph. The complexity of PC and GES highly depends on the density of the causal graph. Specifically, FGES is highly scalable when dealing with linear models (Ramsey et al., 2016) . In real-world applications, e.g., the public datasets in the experiments, even though the variables may not be exactly linearly correlated, FGES can still generate reasonable causal graphs that are good enough for our approach.

2.2.2. ANOMALY DETECTION

After the causal Markov factorization, it becomes easier to model the joint distribution compared to the previous approaches, e.g., the conditional distributions representing local causal mechanisms can be estimated using simpler ML models. For modeling P G [x i (t)|PA(x i (t))], one can apply kernel conditional density estimation (Hastie et al., 2009) , conditional VAE (CVAE) (Sohn et al., 2015) or even prediction models such as MLP or CNN (Binkowski et al., 2018) . Let τ j be the causal time lag for a parent x j and τ * be the maximum time lag in G; then we define PA * (x i (t)) = {x j (t -τ * ), • • • , x j (t -τ j ) | j ∈ PA}. Time lag τ j = 0 if x j is a contemporaneous causal parent of x i . For causal parent x j , more of its historical data can also be included, e.g., a window with size k: {x j (t -τ j -k + 1), • • • , x j (t -τ j ) | j ∈ PA}. Therefore, the problem becomes estimating the conditional distribution from the empirical observations {(x i (t), c i (t))} T t=1 where c i (t) = PA * (x i (t)). In this paper, we apply CVAE to model such conditional distribution. The reason why choosing CVAE is that it can be trained fast with a simple architecture and achieve good performance as shown in our experiments. The empirical variational lower bound of CVAE is L(x, c; θ, ϕ) = 1 n n k=1 log p θ (x|c, z k ) -KL(q ϕ (z|x, c) ∥ p θ (z|c)), where q ϕ (z|x, c), p θ (x|c, z k ) are MLPs and p θ (z|c) is a Gaussian distribution. Given (x i (t), c i (t)), CVAE outputs xi (t) -reconstruction of x i (t), and then P[x i (t)|c i (t)] is measured by the distribution of |x i (t) -x i (t)|. 1 If PA(x i (t)) is empty, i.e., i ∈ C R , one way to estimate distribution P[x i (t)] is to handle x i (t) via univariate time series models, e.g., ARIMA (Hamilton, 1994) , SARIMA (Hyndman & Athanasopoulos, 2018) . The other way is to handle the variables in C R together by utilizing the models for multivariate time series anomaly detection, e.g., Isolation Forest (IF) (Liu et al., 2008) , AE (Baldi, 2012) , LSTM-VAE (Park et al., 2017) . The training data for such models includes all the observations of the variables in C R , i.e., {x i (t)|i ∈ C R } T t=1 . For example, the training data for a forecasting based method is {(x i (t), {x i (t -k), • • • , x i (t -1)})|i ∈ C R } T t=1 where x i (t) is predicted by a window of its previous data points. Our approach reduces to the previous univariate/multivariate time series AD approaches if the causal graph is empty, i.e., no causal relationships are considered. When the causal relationships are available obtained by domain knowledge or data-driven causal discovery techniques, our approach can easily utilize such information and reduces the efforts in joint distribution estimation.

2.2.3. ROOT CAUSE ANALYSIS

Root cause analysis (RCA) aims to identify root causes when an anomaly event happens. RCA in real-world applications such as AIOps can be very challenging. One practical issue for identifying root causes is that an anomaly occurs in a variable often makes its causal children variables abnormal due to anomaly propagation. Specifically, based on Definition 2, we propose the following practical algorithm. For variable x i , define its initial root cause score at time t by S(x i (t)) = 1 -M i (x i (t)). Suppose that N (x i (t)) is the set of the causal children of x i (t), the final root cause score is define in a PageRank (Page et al., 1999; Wu et al., 2020) way: RS(x i (t)) = S(x i (t)) + α 1 |N (x i )| xj (t)∈N (xi) RS(x j (t)), ∀i = 1, • • • , d, where α is a weight parameter satisfying 0 ≤ α < 1. When N (x i ) is empty, we set RS(x i (t)) = S(x i (t)). Here the final root cause score of a variable is the weighted combination of its initial root cause score and the final root cause scores of its children to handle the anomaly propagation issue. The root causes at time t are identified by picking the variables with top RS scores.

2.2.4. NEGATIVE EFFECT OF TRAINING ANOMALIES

The existence of anomalies in the training data may decrease the detection performance. Our empirical results show that this issue does not affect the anomaly detection performance much, which is expected to be the case when there are relatively few anomalies in the data. Typically, there are two possible cases where anomalies in training data may have obvious negative impacts on performance. One case is that the value of a metric at certain timestamps becomes extremely large, which can affect the conditional probability estimation. In this case, one can simply remove those values based on statistical rules, e.g., removing them if the absolute value is larger than some threshold in the preprocessing step. The other case is that the proportion of anomalies is relatively large. In this case, we can consider an iterative solution that iteratively updates the causal graph and anomaly detection model, i.e., 1) estimate causal graph G and train models M i with the training data, and 2) remove the anomalies detected by M i in the training data and then go to Step (1). We repeat the above two steps until the estimated causal model (including the estimated causal structure and quantitative model, e.g., causal coefficients in the linear case) converges. We conducted an experiment with a simulation dataset to empirically study this iterative solution (Section A.7).

3. EXPERIMENTS

This section evaluates the performance of our proposed approach and compares it to several other approaches. The experiments include: 1) evaluating our approach with simulation and public datasets, 2) analyzing how different causal graphs affect the performance, and 3) a case study demonstrating the application of our approach for real-world anomaly detection in AIOps. The anomaly detection performance is assessed by the precision, recall and F1-score metrics in a point-adjust manner, i.e., all the anomalies of an anomalous segment are considered as correctly detected if at least one anomaly of this segment is correctly detected while the anomalies outside the ground truth anomaly segment are treated as normal. By default, we apply FGES (Chickering, 2003) to discover the causal graph. For i ̸ ∈ C R , we choose CVAE (Sohn et al., 2015) . For i ∈ C R , we tested the univariate model and other methods such as IF (Liu et al., 2008) , AE (Baldi, 2012) , LSTM-VAE (Park et al., 2017) in our experiments. We compare our approach with several unsupervised approaches, e.g., AE (Baldi, 2012) , DAGMM (Zong et al., 2018) , OmniAnomaly (Su et al., 2019) , USAD (Audibert et al., 2020) , GANF (Dai & Chen, 2022)foot_1 .

3.1. SIMULATION DATASETS

Section A.3 discusses how to generate simulation datasets. We consider linear/nonlinear causal relationships and three types of anomalies. The first type is a "measurement" anomaly where the causal mechanism is normal but the observation is abnormal due to measurement errors, i.e., randomly pick a node x i , a time step t and a scale s, and then set x i (t) = [x i (t) -median(x i )] * s + median(x i ). The second type is an "intervention" anomaly, i.e., after generating anomalies for some nodes, those anomaly values propagate to the children nodes according to the causal relationships. The third type is an "effect" anomaly where anomalies only happen on the nodes with no causal children.

Performance comparison.

In the experiments, we consider six settings derived from the combinations of "linear/nonlinear" and "measurement/intervention/effect". The simulated time series has 15 variables with length 20000, where the first half is the training data and the rest is the test data. The percentage of anomalies is 10%. Table 1 shows the performance of different unsupervised multivariate time series anomaly detection methods with the generated simulated dataset. Clearly, our method outperforms all the other methods. It achieves significantly better F1 scores when the relationships are nonlinear or the anomaly type is "intervention", e.g., ours obtains F1 score 0.759 for "nonlinear, intervention", while the best F1 score achieved by the others is 0.589. In "linear, measurement/effect", DAGMM has a similar performance with ours because the data can be modeled well by applying dimension reduction followed by a Gaussian mixture model. But when the relationships become nonlinear, it becomes harder for DAGMM to model the data. This experiment shows that the causal mechanism plays an important role in anomaly detection. Modeling joint distribution without considering causality can lead to a significant performance drop. We use the same simulation datasets as anomaly detection to evaluate the RCA performance measured by the top-k hit ratio, i.e., the predicted top-k root causes are correct as long as one of them is the groundtruth root cause. Table 2 shows the RCA performance of our approach and the baseline. The baseline ignores the causal relationships while samples root causes based on the probabilities proportional to the anomaly scores. Our approach achieves HR@3 >= 0.95 and HR@3 >= 0.75 for the "linear" and "nonlinear" settings, respectively, which is significantly better than the baseline.

3.2. PUBLIC DATASETS

Four public datasets were used in our experiments: 1) Secure Water Treatment (SWaT) (Mathur & Tippenhauer, 2016) : it consists of 11 days of continuous operation, i.e., 7 days collected under normal operations and 4 days collected with attacks, 2) Water Distribution (WADI) (Mathur & Tippenhauer, 2016) : It consists of 16 days of continuous operation, of which 14 days were collected under normal operation and 2 days with attacks. 3) Soil Moisture Active Passive (SMAP) satellite and Mars Science Laboratory (MSL) rover Datasets (Hundman et al., 2018) , which are two public datasets expert-labeled by NASA. Performance comparison. Table 3 shows the results on four representative datasets. Overall, IF, AE, VAE and DAGMM have relatively lower performance because they neither exploit the temporal information nor leverage the causal relationships between those variables. LSTM-VAE, Omni-Anomaly and USAD perform better than these four methods since they utilize the temporal information via modeling the current observations with the historical data, while the DAG-based method GANF does not perform well except for SWaT. Our approach exploits the causal relationships besides the temporal information, achieving significantly better results than the other methods in all the datasets, e.g., ours has the best F1 score 0.918 for SWaT and 0.818 for WADI, while the best F1 scores for SWaT and WADI by other methods are 0.846 and 0.767, respectively. For each public datasets, Table 4 reports the best metrics that can be achieved by choosing the best thresholds in the test datasets. Clearly, if we are allowed to choose better thresholds, the metrics achieved by our approach can be much higher, e.g., F1-score 0.946 for SMAP and 0.913 for MSL. We also report the running time of our approach in Section A.8. Ablation study on M i for i ∈ C R . This experiment evaluates the effect of the causal information on anomaly detection. For an anomaly detection method A such as IF and AE, we compare A with our approach "ours + A" that uses CVAE for i ̸ ∈ C R (estimates P G [x i (t)|PA(x i (t))]) and A for i ∈ C R (estimates i∈C R P[x i (t)] ). We report the metrics as mentioned above and the best metrics achieved by choosing the best thresholds in the test datasets. Table 5 shows the performance of our approach with different A, where A = ∅ means that the anomalies are detected by i ̸ ∈ C R only without using i ∈ C R . By comparing this table with Table 3 we can observe that "ours + A" performs much better than using A only, e.g., "ours + AE" achieves F1 score 0.850 for WADI, while AE obtains 0.668 for WADI. If A is not used in anomaly detection, we get a performance drop in terms of F1 score. For example, the best F1 score drops from 0.934 to 0.923 for WADI. Ablation study on causal discovery and causal mechanism estimation. We also studied the effects of different parameters for discovering causal graphs on the performance of our approach. The experiments (in Section A.9) shows that our approach is robust to the changes of the inferred causal graph. In practice, the causal graph is not required to be accurate, namely, we just need to ensure that it doesn't contain too many missing links or false positive links. Besides FGES, other methods such as the PC algorithm (Spirtes & Glymour, 1991) can also be applied to infer the causal graphs. The causal graphs inferred by PC are probably different from those computed by FGES. Our experiments show that our anomaly detection approach is stable even though the causal graphs are different. Table 6 compares the performance of our approach with FGES, GES and PC. For SWaT, using FGES, GES and PC have similar performance. For WADI, using PC performs worse than using GES and FGES, but the F1-score 0.768 is still better than the other approaches. The performance drop is because the causal graph discovered by FGES is more accurate than PC in WADI. As shown in Table 6 , we also tested different methods for estimating causal mechanisms (conditional distributions), e.g., CVAE, MLP and KMN (Ambrogioni et al., 2017) . CVAE works better than the others in SWaT and WADI so we choose CVAE by default.

3.3. CASE STUDY: REAL-WORLD APPLICATION IN AIOPS

Our last experiment is to apply our method for a real-world anomaly detection task in AIOps, where the goal is to monitor the operational key performance indicator (KPI) metrics of database services for alerting anomalies and identifying root causes in order to automate remediation strategies and improve database availability in cloud-based services. In this application, we monitor a total of 61 time series variables measuring the KPI metrics of database services, e.g., read/write IO requests, CPU usage, DB time. The data in this case study consists of the one-month measurements. According to the feedback from domain experts, most of the inferred causal relationships shown in Figure 2 are consistent with the known domain knowledge. For example, the discovered links Redo (redo size) -> Lfpw (log file parallel write) -> Lfs (log file sync) -> COMT (commit) are exactly the same as the domain knowledge. The incidences that happened are relatively rare, e.g., 2 major incidences one month, and our anomaly detection approach correctly detect these incidences. Therefore, we focus on the root cause analysis in this case study. Figure 2 shows an example of one major incidence, showing several abnormal metrics such as DBt (DB time), Lfs (log file sync), APPL (application), TotPGA (total PGA allocated) and a part of the causal graph. The root cause scores computed by our method are highlighted. We can observe that the top root causes metrics are APPL, DBt and TotPGA, all of which correspond to application or database related issues for the incident as validated by domain experts. More results can be found in Appendix.

4. CONCLUSIONS

Most previous approaches for multivariate time series anomaly detection model the joint distribution directly without considering the underlying causal process of the observed time series data. This paper presented a new definition and formulation of anomalies in multivariate time series from a causal perspective, and proposed a novel approach that exploits the causal structures discovered from data to help detect anomalies more accurately and identify the root causes robustly according to the local causal mechanism. Our experiments on both simulation and real datasets demonstrated the efficacy, robustness and practical feasibility of the proposed approach in real-world applications. dependencies through LSTM. OmniAnomaly (Su et al., 2019) learns robust time series representations with a stochastic variable connection and a planar normalizing flow. USAD (Audibert et al., 2020) uses adversely trained autoencoders inspired by GANs, providing fast training. However, these methods model the joint distribution directly without considering the process behind multivariate time series, and an anomaly that happens to a local mechanism in the process might not change the joint distribution dramatically. Besides, it is difficult for them to leverage the domain knowledge of the monitored system, e.g., the known causal dependencies between time series, and to provide explanations that are crucial for root cause analysis and remediation when an anomaly occurs. Finally, our work also differs substantially from existing studies (Qiu et al., 2012; 2020) which though explore causality in anomaly detection in different ways, but do not use the causal mechanism to model anomalies in time series. Root cause analysis (RCA) methods leverage the KPI metrics monitored on those services to determine the root causes when an anomaly event is detected. The key idea behind RCA with KPI metrics is to analyze the relationships or dependencies between these metrics and then utilize these relationships to identify root causes when an anomaly occurs. Typically, there are two types of approaches: 1) identifying the anomalous metrics in parallel with the observed anomaly via metric data analysis, and 2) discovering topology/causal graphs that represent the causal relationships between the services. Nguyen et al. (2011; 2013) propose two similar RCA methods by analyzing low-level system metrics, e.g., CPU, memory and network statistics. Both methods first detect abnormal behaviors for each component via a change point detection algorithm when a performance anomaly is detected, and then determine the root causes based on the propagation patterns obtained by sorting all critical change points in a chronological order. Shan et al. (2019) developed a low-cost RCA method called ϵ-Diagnosis to detect root causes of small-window long-tail latency for web services. ϵ-Diagnosis assumes that the root cause metrics of an abnormal service have significantly changes between the abnormal and normal periods. But these methods don't consider the causal relationships between KPI metrics or the dependencies between services in an application. The second type of RCA approaches leverages such dependencies, which usually involves two steps, i.e., constructing topology/causal graphs given the KPI metrics and domain knowledge, and extracting anomalous subgraphs or paths given the observed anomalies. Such graphs can either be reconstructed from the topology (domain knowledge) of a certain application (Thalheim et al., 2017; Wu et al., 2020; Álvaro Brandón et al., 2020; Samir & Pahl, 2019) or automatically estimated from the metrics via causal discovery techniques (Wang et al., 2018; Mariani et al., 2018; Chen et al., 2019; Meng et al., 2020; Lin et al., 2018; Ma et al., 2019; 2020) . To identify the root causes of the observed anomalies, random walk (e.g., Kim et al. (2013) ; Meng et al. (2020) ; Wang et al. (2018) ), page-rank (e.g., Wu et al. (2020) ) or other analysis methods can be applied over the discovered topology/causal graphs. Recently, Budhathoki et al. (2022) proposed a method based counterfactual analysis which identifies the root cause of a detected anomaly/outlier by computing the contribution of each noise term to the anomaly score. But these methods only accept univariate time series anomaly detectors, i.e., detecting anomalies for each metric separately.

A.2 EXPERIMENTAL SETUP AND PARAMETERS SETTINGS

For the implementation of our approach, we employ the CVAE to model conditional distributions P G [x i (t)|PA(x i (t))]. We choose the same parameters for all the experiments on both simulated and public real datasets. The encoder and decoder in CVAE are both MLPs with hidden sizes [10, 20, 10] . The latent size is 5 and the prior distribution p θ (z|c) is assumed to be the standard normal distribution (doesn't depend on c). For training CVAE, the optimizer is ADAM with learning rate 0.001, batch size 1024 and epoch num 80. For modeling i∈R P[x i (t)], there are several options to choose in practical applications. For the simulated datasets and our internal AIOps dataset, we choose a univariate anomaly detection method based on a CNN forecasting model. The CNN forecasting model consists of 4 residual blocks with 1D convolutional layers, i.e., the "(input channels, output channels)" pairs are (1, 8), (8, 16), (16, 32), (32, 64), followed by the concatenate of 1D adaptive average pooling and 1D adaptive max pooling. The output layer is a linear layer. For each residual block, it has two convolutional layers "(input channels, output channels) -> (output channels, output channels)". We choose ADAM as the optimizer with learning rate 0.001 (with decay), batch size 1024 and epoch num 50. The window size of the historical data for prediction is 20. For the public datasets, besides this CNN forecasting model, we can also choose isolation forest (IF) and autoencoder (AE). For IF, the max number of samples is 10000. For AE, the hidden sizes of the encoder are [25, 10, 5], the latent size is 5, and the hidden sizes of the decoder are [5, 10, 25] . For the simulated datasets, we apply FGES and set "max degree = 5" and "penalty discount = 20". For the public datasets SWaT and WADI, we apply FGES and set "max degree = 10" and "penalty discount = 100". For SMAP and MSL, we apply the PC algorithm with the default parameters. The library for causal discovery we used in this project is Tetradfoot_2 . Smaller "max degree" or larger "penalty discount" in FGES leads to more sparse graphs with less edges. Table 7 lists the number of the edges in the causal graphs discovered with different parameters. The reason why we choose these parameters such as CVAE hidden sizes = [10, 20, 10] is as follows. For all the simulation datasets, the "max-degree" is set to 5 in FGES and the causal relations are instantaneous, meaning that the number of causal parents of each variable is at most 5 so that the input dimensions of the parent variables in CVAEs for modeling conditional probabilities are at most 5. For the public datasets, the "max-degree" is 10 and we found that there are instantaneous causal influences but not time-delayed ones, so the input dimensions of the parent variables in CVAEs are at most 10. That's why we choose those parameters for the encoder and decoder. For a new dataset, if one considers a similar setting for causal discovery, he/she can use our parameters as default. In general, the input dimensions of CVAEs are at most "max-degree" * "time-lag", so one can choose the hidden sizes around this number. For modeling conditional probabilities, one can construct a validation set by splitting the training dataset. Under the Gaussian distribution assumption in CVAE, the overfitting issue can be found and avoided by measuring the reconstruction MSE loss. For all the experiments, the detection thresholds are inferred by taking the nth percentile of the detection scores in the test data, e.g., we choose n = 95 for SWaT and WADI, n = 98 for SMAP and MSL. For the other methods (except ours) in the simulated datasets, the reported precision, recall and F1-score metrics are the best metrics that can be achieved in the test datasets (by choosing the best threshold). The simulated time series data can be generated in the following steps: 1. Generate an Erdös Rényi random graph G with number of nodes/variables n and edge creation probability p, then convert it into a DAG. We choose p = 0.1. 2. For the variables with no parents in G, randomly pick a signal type from "harmonic", "pseudo periodic" and "autoregressive" and generate a time series with length T according to this type. We use the Python library "TimeSynth"foot_3 to generate such signals. When generating these signals, the stop time is set to 100. For "harmonic", the frequency and the noise std are uniformly drawn from [0.1, 1.0] and [0.1, 0.3], respectively. For "pseudo periodic", the frequency is uniformly drawn from [1.0, 6.0], "freqSD" and "ampSD" are set to 0.0 and 0.1. For "autoregressive", "ar param" is uniformly sampled from [0.3, 1.0] and "sigma" is uniformly sampled from [0.01, 0.1]. 3. For a variable x i with parents PA(x i ) in G, we consider both linear relationship x i = j∈PA(xi) w j x j + ϵ and nonlinear relationship x i = j∈PA(xi) w j tanh(x j ) + ϵ, where w j is uniformly sampled from [0.5, 2.0] and ϵ is uniformly sampled from [-0.1, 0.1]. The time series for those variables are generated in a topological order. 4. Add anomalies into the generated time series: We consider three types of anomalies. The first one is a "measurement" anomaly, i.e., randomly pick a variable x i , a time step t, a scale s (uniformly sampled from [0, 3]) and a duration d (uniformly sampled from [5, 20]), and then set x i (t : t + d) = [x i (t)median(x i )] * s + median(x i ). The second one is an "intervention" anomaly, i.e., after generating "measurement" anomalies for some variables, those anomaly values propagate to the children according to the causal mechanisms. The third one is an "effect" anomaly where anomalies only happen on the variables with no causal children. Figure 3 shows the generated causal graph which contains 15 variables. According to this causal graph, we can generate multivariate time series by following the procedure mentioned above, as shown in Figure 4 and Figure 5 where some of the abnormal time steps in the test dataset are labeled by blue triangles. We also consider a mix of linear and nonlinear relationships, i.e., randomly pick a relationship from [linear, nonlinear] during time series generation, where the probability of selecting "linear" is 0.7. Table 8 shows the experimental results when the relationships are a mix of linear and nonlinear. Compared with the linear case, a mixed relationship makes the anomaly detection problem harder, e.g., the performance of most approaches drops. Our approach still has the best performance compared to the other methods. This experiment considers the case that the generated multivariate time series contains both discrete and continuous values. The generation procedure for this kind of time series data has the following differences. For the variables without parents in G, we randomly pick a signal type from "harmoni", "pseudoperiodic" and "autoregressive". For a variable x i with parents P(x i ) in G, we first randomly pick a data type, i.e., choosing "discrete" with probability 0.4 and "continuous" with probability 0.6. For "discrete", x i = 1[ j∈P(xi) w j x j + ϵ > 0], i.e., logistic regression. For "continuous", x i = j∈P(xi) w j x j + ϵ, i.e., linear relationship. When generating anomalies for "discrete", we take x i (t : Table 9 shows the experimental results with the "discrete/continuous" datasets. Compared with the "continuous" case, the performance of all the methods decreases because inferring correlations or doing causal discovery becomes relatively harder with the mixed data types. Our approach still significantly outperforms the other ones in this case.  t + d) = ceil[[x i (t) -median(x i )] * s + median(x i )]

A.6 REAL-WORLD MOTIVATING EXAMPLE

Figure 8 shows why causality matters with a real-world example. In SWaT (Mathur & Tippenhauer, 2016) , at timestamp 491, our causality-based approach detects a true anomaly where the causal mechanism between Metrics 1, 0 and 9 is violated (Metrics 0 and 9 are the causal parents of Metric 1). We plot the probability density of the reconstruction error based on the causal mechanism (top left figure), where the black triangle is the anomaly. Clearly, this anomaly can be easily identified w.r.t. the p-value. But if we check the probability density of the "joint" reconstruction error by AE (top right figure), this anomaly cannot be found w.r.t. the p-value. The bottom figure plots the time series data of these three metrics. Intuitively, we can observe that Metric 1 has a peak value when Metric 0 is low, and Metric 1 is low when Metric 0 is high. In the range (450, 550), this causal mechanism is violated, e.g., Metric 0 is high while Metric 1 is also high. This type of anomalies is hard to be identified by checking joint or marginal distributions. Figure 8 : A motivation example in the real-world dataset SWaT (Mathur & Tippenhauer, 2016) . At timestamp 491, our causality-based approach detects a true anomaly where the causal mechanism between metrics 1, 0 and 9 is violated (metrics 0 and 9 are the causal parents of metric 1).

A.7 TRAINING ANOMALIES

When the fraction of anomalous points is large in the training data, these anomalies may decrease detection performance since the discovered causal graph may not be accurate. In this case, we can apply the solution discussed in Section 2.2.4, updating the causal graph and anomaly detection model iteratively. In this experiment, the training and test data are generated under the setting "linear/measurement", and a large proportion of noises are added into the training data, i.e., adding additional Gaussian noises to the first 20% data points in the training data. These noisy data points makes estimating accurate causal graphs harder via causal discovery algorithms. In each iteration, 3% data points are detected as anomalies and removed. Figure 9 (a) shows the detection performance on the test dataset measured by the F1 scores over each iteration. In the beginning the discovered causal graph has more errors due to the noises in the training data, leading to the low F1 score. After each iteration, our approach removes the detected anomalies from the training data, making the discovered causal graph more accurate in the next iteration, so that the detection performance increases consistently. This experiment empirically verifies our "iterative updates" approach in the case where the training data has a large portion of anomalies. that the proportion of anomalies and the magnitude of anomalies are large. In practical applications where the proportion of anomalies in training data is relatively small., e.g., the public datasets, there is no need to apply this iterative approach, i.e., one iteration is good enough. A.8 RUNNING TIME Table 10 shows the running time of our approach. The most time consuming step is local causal mechanism estimation (conditional distribution estimation). After training, our approach detects anomalies and root causes fast. We also studied the effects of different parameters for discovering causal graphs on the performance of our approach. The parameters that we investigated are "max degree" and "penalty discount" in FGES, both of which affect the structure of the causal graph, e.g., sparsity, indegree, outdegree. In this experiment, we use 6 different "max degree" [5, 6, 7, 8, 9, 10] and 6 different "penalty discount" [20, 40, 60, 80, 100, 120] . Smaller "max degree" or larger "penalty discount" leads to more sparse graphs with less edges, e.g., for SWaT, the number of the edges in G is [70, 79, 88, 95, 98, 102] when "max degree" = [5, 6, 7, 8, 9, 10], respectively. Figure 10 plots the detection precision, recall and F1 score obtained with different "max degree" and "penalty discount". For SWaT, these two parameters don't affect the performance much. For WADI, when "max degree" decreases (the causal graph becomes more sparse) or "penalty discount" decreases (the causal graph has more false positive links), the performance also decreases but it doesn't drop much, i.e., the worst F1 score is still above 0.65. When "max degree" > 6 and "penalty discount" > 40, we got similar performance, e.g., the F1 score is around 0.8, showing that our approach is robust to the changes of the inferred causal graph. In practice, the causal graph is not required to be accurate, namely, we just need to ensure that it doesn't contain too many missing links or false positive links. Specifically, for a variable x i , recall that its root cause score at time t is S(x i (t)) = 1 -M i (x i (t)). (2) Suppose that N (x i (t)) is the set of the contemporaneous causal children of x i (t), the final root cause score is define by RS(x i (t)) = S(x i (t)) + α 1 |N (x i )| xj (t)∈N (xi) RS(x j (t)), ∀i = 1, • • • , d, where α is a weight parameter satisfying 0 ≤ α < 1. When N (x i ) is empty, we set RS(x i (t)) = S(x i (t)). Here the final root cause score of a variable is the combination between its original root cause score and the root cause scores of its children. The final root cause scores for all the variables can be computed by these linear equations. When α = 0, it is reduced to the ideal scenario discussed above. When α ̸ = 0, the above approach improves the ranking of root causes from a global view. Then the root causes at time t can be identified by picking the variables with top root cause scores. 



We assume the reconstruction error is additive, e.g., x = f (c) + e, so that P(x|c) = Pe(x -f (c)). Hence we use the distribution of the reconstruction error for detecting anomalies. https://github.com/EnyanDai/GANF https://github.com/cmu-phil/tetrad https://github.com/TimeSynth/TimeSynth



Figure 1: The causal mechanism between x, y, z is y = 0.5x + ϵ 1 , z = tanh(y 2 -y) + ϵ 2 . An anomaly occurs at time step 40 labeled by a black triangle. The causal mechanism helps find the anomaly easily as the p-value w.r.t. y conditioned on x is 0.011 (test whether y follows its normal conditional distribution p(y|x)). The root cause of this anomaly is y because only p(y|x) is anomalous.

Figure 2: Example on a real-world AIOps case, showing 8 out of 61 time variables (left) and a part of the causal graph (right). The deeper colors of nodes in the graph indicate higher root cause scores.

Figure 3: An example of the ground truth causal graph in the simulated datasets.

Figure 4: An example of the simulated training datasets. A.4 MIXED RELATIONSHIPS: LINEAR + NONLINEAR

Figure 5: An example of the simulated test datasets. Some of the abnormal time steps are labeled by blue triangles.

for time step t, duration d and scale s. Figures6 and 7give an example of the generated training data and test data in this case.

Figure 6: An example of the simulated training datasets with mixed data types.

Figure 7: An example of the simulated test datasets with mixed data types. Some of the abnormal time steps are labeled by blue triangles.

Figure 9: Empirical study on our "iterative updates" approach discussed in Section 2.2.4 for handling large noise in the training data. (a) The detection performance (F1 scores) on the test data over iterations. (b) The difference between the adjacency matrices of two consecutive discovered causal graphs.

show the detection results by our approach where we did downsampling for better demonstration. The left figures plot the ground truth labels. The right figures plot the detected anomalies in a point-adjust way.

Figure 13: SMAP detection results. Left: Ground-truth labels. Right: Detected anomalies.

Figure 14: MSL detection results. Left: Ground-truth labels. Right: Detected anomalies.

Figure 15 shows another major incidence. The top abnormal variables are SYIO (system I/O), USIO (user I/O), Lfpw (log file parallel write), UTIL (I/O utilization). All of them are related to I/O issues, meaning that the root causes are the components related to I/O.

Figure 15: Case study in AIOps. 8 out of 61 time variables (left) and a part of the causal graph (right). The anomaly scores are indicated by the colors, e.g., deeper colors indicate larger anomaly scores.

Specifically, let C R be the set of variables with no causal parents in G. There are two cases to be considered:• Variable with parents (i ̸ ∈ C R ): The conditional distribution of x i (t) given its causal parents needs to be estimated, i.e., P G [x i (t)|PA(x i (t))] is modeled via conditional density estimation, which can be learned in a supervised manner.

Performance comparison (F1-scores) on the simulation datasets.Lin./Measu. Lin./Inter. Lin./Effect Nonlin./Measu. Nonlin./Inter. Nonlin.

RCA performance comparison (top-k hit-ratio) on the simulation datasets.

Performance comparison of our approach and other methods on the public datasets.

The best performance of our approach with the public datasets.



Performance comparison (F1-score) with the public datasets, e.g., GES vs PC vs FGES for causal discovery, and CVAE vs MLP vs KMN for causal mechanism estimation.

The number of the edges in the causal graphs generated by FGES with different parameters.

Performance comparison with the simulated dataset. The relationships are a mix of linear and nonlinear.

Performance comparison with the simulated dataset. The data types are a mix of discrete and continuous.

The running time of our approach (wall clock time).

B CASE STUDY: REAL-WORLD APPLICATIONS IN AIOPS

Root Cause Analysis (RCA) in real-world applications such as AIOps can be very challenging. One practical issue for identifying root causes is that an anomaly occurs in a parent often makes its contemporaneous causal children abnormal due to the estimation errors in conditional distributions. To handle this issue, we developed the following practical algorithm.

