VARIATIONAL ADAPTIVE GRAPH TRANSFORMER FOR MULTIVARIATE TIME SERIES MODELING Anonymous authors Paper under double-blind review

Abstract

Multivariate time series (MTS) are widely collected by large-scale complex systems, such as internet services, IT infrastructures, and wearable devices. The modeling of MTS has long been an important but challenging task. To capture complex longrange dynamics, Transformers have been utilized in MTS modeling and achieved attractive performance. However, Transformers in general do not well capture the diverse relationships between different channels within MTS and have difficulty in modeling MTS with complex distributions due to the lack of stochasticity. In this paper, we first incorporate relational modeling into Transformer to develop an adaptive Graph Transformer (G-Trans) module for MTS. Then, we further consider stochastity by introducing a powerful embedding guided probabilistic generative module for G-Trans to construct Variational adaptive Graph Transformer (VG-Trans), which is a well-defined variational generative dynamic model. VG-Trans is utilized to learn expressive representations of MTS, being an plug-and-play framework that can be applied to forecasting and anomaly detection tasks of MTS. For efficient inference, we develop an autoencoding variational inference scheme with a combined prediction and reconstruction loss. Extensive experiments on diverse datasets show the efficient of VG-Trans on MTS modeling and improving the existing methods on a variety of MTS modeling tasks.

1. INTRODUCTION

Multivariate time series (MTS) is an important type of data that arises from a wide variety of domains, including internet services (Dai et al., 2021; 2022) , industrial devices (Finn et al., 2016; Oh et al., 2015) , health care (Choi et al., 2016b; a) , and finance (Maeda et al., 2019; Gu et al., 2020) , to name a few. However, the modeling of MTS has always been a challenging problem as there exist not only complex temporal dependencies, as shown in the red box in Fig. 1 , but also diverse crosschannel dependencies, as shown in the blue box in Fig. 1 . Moreover, there exist inherently stochastic components, as shown in the green box in Fig. 1 , even if one can fully capture both temporal and cross-channel dependencies. To address these challenges, many deep learning based methods have been proposed for various MTS tasks, such as forecasting, anomaly detection, and classification. To model the temporal-dependencies of MTS, many dynamic methods based on recurrent neural networks (RNNs) have been developed (Malhotra et al., 2016; Zhang et al., 2019; Bai et al., 2019b; Tang et al., 2020; Yao et al., 2018) . Meanwhile, to take the stochasticity into consideration, some probabilistic dynamic methods have also been developed (Dai et al., 2021; 2022; Chen et al., 2020; 2022; Salinas et al., 2020) . With the development of Transformer (Vaswani et al., 2017) and due to its ability to capture long-range dependencies (Wen et al., 2022; Dosovitskiy et al., 2021; Dong et al., 2018; Chen et al., 2021) , and interactions, which is especially attractive for time series modeling, there is a recent trend to construct Transformer based MTS modeling methods and have achieved promising results in learning expressive representations for down-stream tasks. For example, for forecasting, LogTrans (Li et al., 2019) incorporates causal convolutions into self-attention layer to consider local temporal dependencies of MTS. Informer (Zhou et al., 2021) develops a probsparse self-attention mechanism for long sequence forecasting. AST (Wu et al., 2020) further constructs a generative adversarial encoder-decoder framework for better predicting output distribution. In addition, there are also some other efficient Transformer-based forecasting methods, such as Autoformer (Xu et al., 2021) , FEDformer (Zhou et al., 2022) , and TFT (Lim et al., 2021) . Besides, for anomaly detection, Meng et al. (2019) illustrate the superiority of using Transformer for anomaly detection over other traditional RNN-based methods. Following it, some modified Transformer-based methods have also been proposed for anomaly detection, such as TransAnomaly (Zhang et al., 2021) , ADTrans (Tuli et al., 2022) , and Anomaly Transformer (Xu et al., 2022) . To address non-deterministic temporal dependence within MTS, Tang & Matteson (2021) further incorporate Transformer structure into state-space models and develop ProTrans. Despite the attractive performance of existing Transformer-based models, their ultimate potentials have been limited by ignoring the cross-channel dependence of MTS. To consider the relationships of different channels within MTS, MSCRED (Zhang et al., 2019) introduces a multi-scale convolutional recurrent encoder&decoder to learn spatial correlations and temporal characteristics in MTS and detects anomalies via the residual signature matrices. InterFusion (Li et al., 2021) incorporates recurrent and convolutional structures into a unified framework to capture both temporal and inter-metric information. Recently, Graph neural networks (GNNs) have gradually attracted more attentions in exploring the relationships. Thus, some GNN-based methods for MTS have been developed (Deng & Hooi, 2021; Zhao et al., 2020) for discovering expressive representations of MTS. Deep variational graph convolutional recurrent network (DVGCRN) incorporates relationship modeling into hierarchical generative process. Moreover, some graph based methods (Li et al., 2018; Bai et al., 2019a; Yu et al., 2018; Wu et al., 2019; Guo et al., 2019; Pan et al., 2019) have also been developed for MTS forecasting. Adaptive graph convolutional recurrent network (AGCRN) (Bai et al., 2020) further learns node-specific patterns for MTS forecasting without requiring a pre-defined graph. However, these methods are still all non-dynamic or RNN based models, limiting their power in capturing complex relationships across long-distance time steps Moving beyond the constraints of previous work, we first propose an adaptive graph Transformer (G-Trans) module by incorporating a graph into the Transformer structure, which can model both temporal and cross-channel dependencies within MTS. Then, considering the stochasticity within MTS and enhancing the representative power of G-Trans, we further develop a Variational adaptive Graph Transformer (VG-Trans), which is a well-defined probabilistic dynamic model obtained by combining G-Trans with a proposed Embedding-guided Probabilistic generative Module (EPM), as illustrated in Fig. 2 (b ). We note that VG-Trans is able to get the robust representations of MTS, which enables it to be combined with the existed methods and applied to both anomaly detection and forecasting tasks. In addition, we introduce an autoencoding variational inference scheme for efficient inference and a joint optimization objective that combines forecasting and reconstruction loss to ensure the expressive time-series representation learning. The main contributions of our work are summarized as follows: • For MTS modeling, we propose a G-Trans module, which incorporates channel-relationship learning into the Transformer structure. • We develop VG-Trans, a VAE-structured probabilistic dynamic model with G-Trans as encoder and EPM as decoder, which can consider the non-deterministic within both temporal and cross-channel dependencies of MTS. VG-Trans can be combined with different methods and applied to different tasks of MTS. • To achieve scalable training, we introduce an autoencoding inference scheme with a combined prediction and reconstruction loss for enhancing the representation power of MTS. • Experiments on both anomaly detection and forecasting tasks illustrate the efficiency of our model on MTS modeling

2. METHOD

We first present the problem definition, and then introduce the probabilistic channel embedding for measuring the relationships between different channels and present G- Problem Definition: Defining the n-th MTS as x n = {x 1,n , x 2,n , ..., x T,n } , where n = 1, ...N and N is the number of MTS. T is the duration of x n and the observation at time t, x t,n ∈ R V , is a V dimensional vector where V denotes the number of channels, thus x n ∈ R T ×V . The modeling of MTS is to learn the robust representations of the input with a powerful method, which can be served as a plug and play framework and applied to different tasks.

2.1. PROBABILISTIC CHANNEL EMBEDDING

To reflect the characteristics of different channels in MTS and capture their non-deterministic relationships, we introduce the probabilistic embedding vector for each channel in MTS as: α i ∼ N (µ i , diag (σ i )) ∈ R d , α = [α 1 , ..., α V ] where N (•) means Gaussian distribution. α refers to the channel embbedding for inputs. To fully take the advantage of uncertainties brought by Gaussian-distributed embeddings, we measure the distance between different channels with the expected likelihood kernel Jebara et al. (2004) as s(α i , α j ) = x∈R n N (x; µ i , diag(σ i )) N (x; µ j , diag(σ j )) dx = N (0; µ i -µ j , diag(σ i + σ j )) Clearly, as a symmetric similarity function, s(•) incorporates both the means and covariances of two Gaussians, considering the uncertainties within them. The parameters of channel embeddings are initialized randomly and then trained along with the rest of the model.

2.2. ADAPTIVE GRAPH TRANSFORMER

To consider both temporal and cross-channel dependencies, we introduce graph structure into the Transformer framework and develop an G-Trans module. The detailed structure of G-Trans is shown in Fig. 2 (c ). Specifically, we first introduce a Multi-head Self Attention (MSA) block to capture the temporal dependence as O = MSA(X) = con (H 1 , . . . , H m ) , H i = SA (Q i , K i , V i ) =softmax( Q T i Ki √ d K )V i (1) where X ∈ R V ×T denotes the input and con(•) means concat operation, Q i = W i Q X ∈ R d k ×T , K i = W i K X ∈ R d k ×T , where i ∈ {1, 2, ..., m} and m is the number of heads. It is worth noting that we assign V i = X to keep the meaning of channels within MTS unchanged for capturing their relationships follow-up. O ∈ R (V ×T ×m) is the output of MSA block. Then, different from traditional Transformer that uses feed forward network after multi-head self-attention block, which can not well capture the diverse relationships between different channels, we introduce an Adaptive Graph Convolutional Network (AGCN) block. Specifically, with the O from MSA block and channel embeddings α defined in Section 2.1, AGCN discoveries channel dependencies automatically as A = log(s(α, α)), H = ln(softplus( Conv(O))), Ã = softmax(ReLU (A)), h = AGCN(α, O) = ln(1 + exp(W ÃH))] where A ∈ R V ×V is the relational matrix calculated by symmetric similarity function and α is updated to be adaptive to the MTS data. Conv(•) means the convolutional operation, which is utilized to summarize the multi-head information within O into H ∈ R V ×T . Ã is the normalized symmetric adjacent matrix. W ∈ R K ′ ×V is the GCN filter. After combining temporal and channel-wise relational information of MTS into h ∈ R K ′ ×T , multi-head self attention and feed forward network blocks are further applied for exploring expressive representations of MTS, as the red box shown in Fig. 2 (c ), and getting the dynamic latent states h ∈ R K×T . Then, we can select a reconstruction or forecasting decoder for different tasks.

2.3. VARIATIONAL ADAPTIVE GRAPH TRANSFORMER

Previous Transformer based MTS modeling methods are always equipped with deterministic generative model, which ignores the stochasticity within MTS and has difficulty in modeling complex MTS with sophisticated distribution. To address this issue, we further develop VG-Trans, which is a novel VAE-structured dynamic model equipped with a powerful Embedding guided Probabilistic generative Module (EPM) as decoder and G-Trans as encoder. The graphical illustration of the generation (decoder) and inference (encoder) operations of VG-Trans are shown in Fig. 2 (b) . Embedding guided probabilistic generation: To consider the stochasticity within MTS, we first introduce a Gaussian-distributed latent variable z t,n ∈ R K at each timestep, and then define the generative process as z t,n ∼ N (µ t,n , diag (σ t,n )) , µ t,n = f (W h,µ h t-1,n ) , x t,n ∼ N µ x t,n , diag σ x t,n , µ x t,n = f W x zµ z t,n + W x hµ h t-1,n As illustrated in Fig. 2 (I) (a) and (b). µ t,n and σ t,n are means and covariance parameters of z t,n . h t-1,n ∈ R K ′ denotes the deterministic latent states of G-Trans module. We combine z t,n and h t,n into generative process to consider the temporal dependencies and the stochasticity. f (•) refers to the non-linear activation function. Moreover, to further capture the cross-channel dependencies of inputs and latent variables, we introduce channel embeddings into our generation process by defining W x zµ = log (s (αβ z )) , W x hµ = softmax (αβ h ) (4) The channel embeddings are incorporated into generative process by defining the factor loading matrices W x zµ and W x hµ as the mapping function of them, which can capture the non-deterministic inter-relationships between channels, as introduced in Dieng et al. ( 2020), thus to improve the capacity of model in modeling complex MTS. β z ∈ R d×K is the mapping matrix that transmit z t,n into the embedding space of x t,n , while β h ∈ R d×K ′ is the mapping matrix that transmit h t,n into the embedding space of x t,n . We call our generative model as EPM. Compared with generative module of previous Transformer based methods for MTS modeling, EPM discovers the latent semantic structure of each channel as an probabilistic embedding vector and capturing the relationships between each other according to the similarity of channel embeddings, meanwhile, considering the non-deterministic within them, thus to improve the representative capacity. Inference: The purpose of inference module is to map the inputs x n to z n . To consider the locality temporal dependencies, we first apply a convolutional operation on x t,n as xt,n = Conv(x t,n ) ∈ R V . Then, with MSA and AGCN blocks, we summarize temporal and channel-wise relational information of input MTS as O n = MSA( xn ), hn = AGCN (O n , α) ) Given time series h1:T,n , we apply a linear projection and combine it with a positional embedding to obtain h (0) t,n = LayerNorm(MLP( ht,n ) + Position (t)) (6) With h (0) t,n as the input, we then apply multi-head self attention and feed forward network blocks to get the dynamic latent states h t , as shown in Fig. 2 (c ). Following VAE based models Kingma & Welling (2014), we define a Gaussian distributed variational distribution q(z t,n ) = N (µ t,n , diag(σ t,n )) to approximate the true posterior distribution p(z t,n |-), and map the dynamic layent states h t to their parameters as: µ t,n = f C xµ h t,n + b xµ , σ t,n = Softplus (f (C xσ h t,n + b xσ )) where C xµ , C xσ ∈ R K×V , b xµ , b xσ ∈ R K are all learnable parameters of the inference network. As mentioned in Cao et al. (2020) ; Wen et al. (2022) , the prediction-based model is expert in capturing the periodic information of the MTS, while the reconstruction-based model can explore the global distribution of the time series. To combine the complementary advantages of the two models for facilitating the representation capability of MTS, we formulate the optimization function as the combination of both prediction and reconstruction losses and define the marginal likelihood as P (D|α, W ) = T t=1 p(x t,n | z t,n , α) + p(x T,n | h 1:T -1,n , α)dz 1:T,n where the first and the second term are reconstruction and prediction loss separately. Similar to VAEs, with the inference network and variational distribution in Eq. equation 7, the optimization objective of VG-Trans can be achieved by maximizing the evidence lower bound (ELBO) of the log marginal likelihood, which can be computed as L = N n=1 [ T t=1 E q(z t,n ) [ln p x t,n | z t,n + γ ln p x T,n | h 1:T -1,n -ln q(z t,n |xt,n) p(z t,n |h t-1,n ) ]] (8) where γ > 0 is a hyper-parameter to balance the prediction and the reconstruction losses, which is chosen by grid search on the validation set. The parameters of channel embedding α can be learned with Bayes by Backprop (Blundell et al., 2015) as it can be reparameterized as s(α i , α j ) = (µ i -µ j ) + (diag(σ i + σ j )) * ϵ ij . The detailed procedures of the optimization of VG-Trans are summarized in Appendix.

4. APPLICATION TO ANOMALY DETECTION TASK

Anomaly detection of MTS is defined as a problem that determines whether an observation from a certain task and at a certain time is anomalous or not. Specifically, the model is trained to learn normal patterns of multivariate time series, the more an observation follows normal patterns, the more likely it can be reconstructed and predicted well with higher confidence. Our model is an unsupervised probabilistic generative model, which can be applied to unsupervised anomaly detection directly. Specifically, we apply the reconstruction and prediction error of x t as the anomaly score to determine whether an observed variable is anomalous or not, and it is computed as S t,n = S r t,n + γ(-S p t,n ) (1 + γ), S r t,n = log p(x t,n |z t,n ), S p t,n = (x t,n -xt,n ) 2 (9) where S r t,n and S p t,n are reconstruction and prediction score, respectively. An observation x t will be classified as anomalous when S t,n is below a specific threshold. From a practical point of view, we use the Peaks-Over-Threshold (POT) (Siffer et al., 2017) approach to help select threshold. Moreover, we note that after collecting both non-deterministic temporal and channel dependencies information within VG-Trans and get latent representations of MTS, we can introduce more powerful structure for different specific tasks. Specifically, we combine our model with Anomaly Transformer (Xu et al., 2022) by introducing anomaly attention and association discrepancy into VG-Trans and get the VG-Anomaly-Trans.

5. APPLICATION TO FORECASTING TASK

Considering forecasting task of MTS being formulated as finding the function to predict the next τ time steps (Bai et al., 2020) given the past T time steps as {x :,t+1 , x :,t+2 , . . . , x :,t+τ } = F θ (x :,t , x :,t-1 , . . . , x :,t-T +1 ) where θ refers to the learnable parameter. Since VG-Trans is a representation learning method, we can refer to it as a play and plug framework, which can be applied to forecasting task by combining with corresponding methods. Inspired by the effectiveness of Autoformer in forecasting, we combine it with our proposed VG-Trans and get the VG-Autoformer. As shown in Fig. 3 , VG-Autoformer first get latent representations of MTS where o seasonal T +1:T +τ ∈ R V ×τ and o trend T +1:T +τ ∈ R V ×τ separately represent the seasonal and trendcyclical outputs of Autoformer decoder, which is refined by channel embedding α via a learnable parameter β x ∈ R d×V . The intuition behind this is that long-term channel relationships may lie in trend-cyclical part rather than seasonal part. The detailed structure of VG-Autoformer is introduced in Appendix. To optimize VG-Autoformer for forecasting, we modify the loss in Eq. 12 as L = N n=1 [ T t=1 E q(zt,n) [ln p (x t,n | z t,n ) + γln p (x T +1:T +τ,n | h 1:T,n ) -ln q (z t,n | x t,n ) p (z t,n | h t-1,n ) ]]

6. EXPERIMENT

We conduct extensive experiments to evaluate the performance of our proposed models on forecasting and anomaly detection tasks of MTS.

Datasets and set up:

We evaluate the effectiveness of our model on twofold datasets:1) four realworld datasets for anomaly detection, including CDN Dai et al. (2021) , SMD, MSL and SMAP Xu et al. (2022) ; 2) four datasets for forecasting, including ETTh, ETTm, Weather and ECL (Zhou et al., 2021) . The results are either quoted from the original papers or reproduced with the code provided by the authors. The way of data preprocessing is the same as Dai et al. (2021) , where the window size T and overlap o are set as T = 20, o = 5 for anomaly detection and T = 20, o = 0 for forecasting. Adam optimizer Kingma & Ba ( 2015) is employed with learning rate of 0.0002, the batch size is set to be 64. The number of heads in VG-Trans is set as 8. The summary statistics of these datasets and other implementation details are described in Appendix.

6.1. ANOMALY DETECTION

Similar to the previous studies (Dai et al., 2021) , we employ Precision, Recall, and F1-score as the evaluation metrics to indicate the anomaly detection performance of different methods. Particularly, F1-score is deemed as a comprehensive indicator since it balances precision and recall. We extensively compare our model with 14 baselines,including the classic methods: OC-SVM (Tax & Duin, 2004) , IsolationForest (Liu et al., 2008) , LOF (Breunig et al., 2000) ; recurrent structure methods: LSTM (Hundman et al., 2018) , THOC (Shen et al., 2020) ; probabilistic dynamic models: LSTM-VAE (Park et al., 2018) VRNN (Chung et al., 2015) , DOMI (Su et al., 2021) , BeatGAN (Zhou et al., 2019) , OmniAnomaly (Su et al., 2019) , SDFVAE (Dai et al., 2021) ; methods considering cross-channel dependency: Interfusion (Li et al., 2021) , GNN (Deng & Hooi, 2021) ; and Transformer based methods: Transformer (Zerveas et al., 2021) , Anomaly-Trans (Xu et al., 2022) . Anomaly-Trans is the state-of-the-art method based on Transformer structure. Firstly, we compare the proposed models with baselines on their detection performance and report the average F1-score results on five independent runs in Table .1, the best results are highlighted in boldface. It is obvious that deep learning based methods outperform the classic methods for stronger representational power. Probabilistic dynamic methods achieve better results than LSTM, since they consider the stochasticity within MTS. Both recurrent and graph structures can boost the performance, indicating the effectiveness of temporal and cross-channel relationships on learning normal patterns of MTS. Original Transformer is not suitable for MTS modeling, leading to a poor performance, while Anomaly-Trans achieves SOTA results before our models, showing the efficient of Transformer structure in modeling long and complex temporal dependencies. Our proposed VG-Trans achieves the best detection performance among all methods on most datasets, which demonstrates its effectiveness in modeling non-deterministic temporal and cross-channel dependencies, thus to learn more expressive representations of the normal pattern of MTS. Finally, the performance of VG-trans can be further improved by combining it with Anomaly-Trans for VG-Anomaly-Trans. To better demonstrate the effectiveness of capturing the channel-wise relationships of VG-Trans, we further visualize some channels of x n in Fig. 4 , and presents a subset of the corresponding relational matrices to these channels in Fig. 4 (middle). As we can see, relational matrices can effectively reflect the similarity and correlation between channels, such as channels 21, 23, 25 of x n , which illustrates the capacity of VG-Trans in capturing the cross-channel dependencies within MTS.

6.2. FORECASTING

We deploy two widely used metrics, Mean Absolute Error (MAE) and Mean Square Error (MSE) (Zhou et al., 2021) , to measure the performance of forecasting models. Six popular methods are compared here, including recurrent structure models: LSTnet (Lai et al., 2018) , LSTMa (Bahdanau on almost all datasets, suggesting efficient of VG-Trans in capturing the non-deterministic complex temporal dependencies and complex cross-channel relationships in MTS, which helps to get robust representations, thus ensuring the promising predictions. Besides, we also test the effect of embedding size on forecasting performance with Weather dataset, as shown in Fig. 6 , we set d = 4, 8, 16, 24, 32 and test the MAE of VG-Trans with different d. We can find that the selection of embedding size is affected by the length of horizons, higher embedding size performs better on longer horizons for more complex channel-wise relationships. In addition to quantitative evaluation, we also list the case study of prediction results by autoformer and VG-autoformer in VG-autoformer achieves more accurate prediction than autoformer.

6.3. ABLATION STUDY

We conduct ablation study to analyze the importance of each component in our model, including graph structure, variational scheme, embedding guided generation and the combined optimization objective. The results on both anomaly detection and forecasting tasks are listed in Table 3 . Firstly, on both tasks, all the structural components contribute to the performance of the framework. Specifically, by introducing a powerful variational generative module for Transformer or graph Transformer, their performance gain a significant improvement for being able to get robust representation of MTS with complex distributions. Meanwhile, as shown in the last two lines, incorporating cross-channel Under review as a conference paper at ICLR 2023 on machine 2-3 of SMD dataset. Regions highlighted in red and blue represent the groundtruth anomaly segments and misjudgement segments by methods, red lines refer to the threshold selected according to the rule that all anomalies can be detected. dependency modeling into generative process can further improve the generative capacity of models, thus resulting a better performance on two tasks. In addition, by comparing the 1-st and 5-th lines, the 3-rd and 6-th lines, the efficiency of graph structure can be proved. The combined optimization objective is also beneficial to learn expressive representation of MTS as shown in the 5-th and 7-th lines. These verify that each module of our design is effective and necessary. To further show the efficiency of different components of VG-Trans in capturing the robust representations of MTS intuitively, focusing on anomaly detection task, we visualize the anomaly score of the case study. We compare the anomaly scores by Transformer, G-Trans and VG-Trans, the results are visualized in Fig. 7 . As the deterministic methods, Trans and G-Trans get turbulent anomaly scores since they ignore the stochastic of MTS. Considering the inter-relationships within MTS, G-Trans exhibits more distinct spikes in the regions of anomalies, thus leading to less regions misjudged to anomalies. As probabilistic methods, VG-Trans realizes smoother anomaly scores than previous methods for considering stochasticity within MTS. Meanwhile, it exhibits considerable spikes in the regions of anomalies for considering both temporal and cross-channel dependencies. These properties enable that only one region is misjudged by VG-Trans, thus leading to better F1-score. The anomaly score of this case further demonstrates the capability of graph structure and powerful probabilistic generative process in VG-Trans in learning expressive representations of complex MTS, which echoes the numerical results listed in Table 3 .

7. CONCLUSION

In this paper, towards MTS modeling tasks, we propose a novel variational dynamic model named VG-Trans, which consists of G-Trans module that can capture both cross-channel and long-distance temporal dependencies by incorporating graph into Transformer, and an embedding guided probabilistic generative module to consider the stochasticity within MTS and enhance the capacity of modeling MTS with complex distribution. To achieve efficient optimization, we introduce an autoencoding variational inference scheme with a combined prediction and reconstruction loss. VG-Trans is able to get the robust representations of MTS, which enables it to be combined with the existed methods and applied to different tasks of MTS. Experiments on both anomaly detection and forecasting tasks of MTS illustrate the effectiveness of our model in both extracting expressive representations of MTS.

A APPENDIX

A.1 ALGORITHM The autoencoding variational inference algorithm for VG-Trans is described in Algorithm. 1. It is a great optimization challenge to train a dynamic VAE structured model in practice, due to the well-known posterior collapse and unbounded KL divergence in the objective. Here, we utilize two approaches for stabilizing the training. Warm-up: The variational training criterion in Eq. equation 12 contains the likelihood term p(x j |Φ (1) , θ (1) j ) and the variational regularization term. During the early training stage, the variational regularization term causes some of the latent units to become inactive before their learning useful representation ?. We solve this problem by first training the parameters only using the reconstruction error,and then adding the KL loss gradually with a temperature coefficient: L = N n=1 [ T t=1 E q(z t,n ) [ln p x t,n | z t,n + γ ln p x T,n | h 1:T -1,n -βln q(z t,n |xt,n) p(z t,n |h t-1,n ) ]] (12) where β is increased from 0 to 1 during the first N training epochs. Gradient clipping: Optimizing the unbounded KL loss often causes the sharp gradient during training, we address this by clipping gradient with a large L2-norm above a certain threshold, which we set 20 in all experiments. This technique can be easily implemented and allows networks to train smoothly.

A.2 THE DERIVATION OF ELBO

P (D | α, W ) = ∫ N n=1 T t=1 p (xt,n | zt,n, α) * p (xT,n | h1:T -1,n, α) * p (zt,n | hT -1,n) dz1:T,n, ln P (D | α, W ) = ln ∫ N n=1 T t=1 p (xt,n | zt,n, α) p (xT,n | h1:T -1,n, α) p (zt,n | hT -1,n) dz1:T,n = ln ∫ T t=1 T t=1 p (xt,n | zt,n) p (xT,n | h1:T -1,n) p (zt,n | hT -1,n) q(zt,n | xt,n) q(zt,n | xt,n) dz1:T,n = ln ∫ T t=1 T t=1 E q(z t,n) p (xt,n | zt,n) p (xT,n | h1:T -1,n) p (zt,n | hT -1,n) q(zt,n | xt,n) ≥ N n=1 T t=1 E q(z t,n) ln p (xt,n | zt,n) + γ [ln p (xT,n | h1:T -1,n)] + ln p (zt,n | ht-1,n) q (zt,n | xt,n) = N n=1 T t=1 E q(z t,n) ln p (xt,n | zt,n) + γ [ln p (xT,n | h1:T -1,n)] -ln q (zt,n | xt,n) p (zt,n | ht-1,n)

A.3 THE NOTATION TABLE OF OUR PAPER

To better understand the proposed model, we summarize the notations used in this paper in Table 4 .

A.4 DATASETS

Anomaly detection: For anomaly detection task, we conduct extensive experiments on four datasets: one real-world dataset named CDN multivariate KPI dataset and three public datasets named SMD, MSL and SMAP that were released by the works Su et al. (2019) and Hundman et al. (2018) , respectively. The basic statistical information of datasets is reported in Table 5 and Table 6 

Hyper-parameters:

The hyper-parameters of data preprocessing are set as T = 20, o = 4 for anomaly detection and T = 20, o = 0 for forecasting. We employ Adam optimizer with learning rate of 0.0002, batch size of 64, total training epochs of 90 for all tasks. The balance factor γ for anomaly detection and forecasting are set as 0.1 and 0.5, respectively. The length of forecasting is set to be 12, 24, 48, 96, and 168 in turn. Model architecture: For anomaly detection experiment, the dimensions of channel embeddings for inputs α i are all set as 256, that is d = 256. The dimension of mapping matrix β is set as 256 × 10, that is K ′ = 10. We employ a deconvolutional neural network (DCNN) and CNN in generation and inference process separately, both CNN encoder and DCNN decoder are with 3 convolutional layers, whose filters and strides are set according to the number of timesteps of the observed variables. Then, the latent variables and inputs are feed into MSA blocks, respectively. Thirdly, the aggregated dynamic information hn of inputs x and latent variables z can be calculated by AGCN blocks. And the architectures of MSA and AGCN are shown in Table . 7, where V represents the inputs' dimension, it various from different datasets as discussed in A.6. Finally, we employ a well-developed Transformer architecture Li et al. (2019) to derive the dynamic latent states h. The specific architecture can be found in Table . 8. The head of VG-Trans is set as 8. For forecasting experiment, we select the hyperparameters, including embedding size, latent size, latent dimension and so on, with validation data. We find that it is important to keep the Seasonal Init and Trend-cyclical Init directly derived from the raw input x n to maintain Autoformer's capability for forecasting. On the other hand, considering full ELBO and channel-wise relationships enhances the robustness of capturing long-term dependencies, thus achieves better prediction results, as shown in Table . 2, in which u/v represents the author reported result u and our reproduced result v.

A.8 MORE VISUALIZATION RESULTS

We illustrate more prediction cases of simple and complex MTS on forecasting task in Fig. 9 and Fig. 10 , respectively. And the improvement areas are highlighted with red dotted boxes. Our model gives the best performance on different datasets compared with Autoformer. Specifically, VG-Autoformer complements more missing details in Autoformer thanks to its design of capturing both temporal and channel dependencies at the same time. A.9 BALANCE BETWEEN PREDICTION AND RECONSTRUCTION As discussed in Sec. 3, we combine prediction and reconstruction losses on VG-Trans, which enables our models to learn expressive representations efficiently, and introduce a parameter γ to balance



Figure 1: The temporal dependency, channel relationship and stochasticity within MTS.

Figure 2: (a) The whole framework of VG-Trans for MTS modeling; (b) Graphical illustration of inference (encoder) and generative (decoder) models (Blocks in gray and green represent input and probabilistic latent variables, while the colourless blocks represents deterministic dynamic latent states.); (c) The detailed structure of adaptive graph Transformer module.

Figure 3: The framework of VG-Autoformer.

Figure 4: Visualization of example channels and corresponding relational matrices of x n on SMD machine-2-3 dataset.

Figure 5: Visualizations on ETTh1 dataset of (a) the input MTS under the input-200 setting and the learned relational matrix among different channels; (b), (c), (d): predictions on different channels by autoformer and VG-autoformer.

Figure 6: The influence of embedding size on forecasting.

As shown in Fig 5 (a), similar with anomaly detection task, relational matrices in forecasting task can also effectively reflect the similarity and correlation between channels, such as channels 1,2 and 3. By the aid of the relationships between different channels captured by VG-autoformer, the prediction results of the related channels can guide and revise each other, thus to ensure the forecasting performance. As illustrated in red boxes of Fig 5 (b), (c) and (d),

Figure 7: Case study of anomaly scores by Transformer (left), G-Trans (middle) and VG-Trans (right)on machine 2-3 of SMD dataset. Regions highlighted in red and blue represent the groundtruth anomaly segments and misjudgement segments by methods, red lines refer to the threshold selected according to the rule that all anomalies can be detected.

Figure 9: Prediction cases of simple MTS from the ETT, Weather and ECL datasets (from left to right). Top: results of Autoformer; Bottom: results of VG-Autoformer.

Figure 10: Prediction cases of complex MTS from the ETT dataset. Top: results of Autoformer; Bottom: results of VG-Autoformer.

F1-score performance of different methods.

Multivariate results with predicted length as {96, 168, 288, 336} on the five datasets and {24,48,96,168} on the Weather dataset, lower scores are better. Metrics are averaged over 5 runs, best results are highlighted in bold.

presents the overall prediction performance, which are average MAE and MSE on five independent runs, and the best results are highlighted in boldface. As we can see, RNN based methods underperform Transformer based methods on most setting, especially on long-term forecasting, suggesting the efficiency of Transformer structure in modeling complex and long-distance temporal dependencies.

Ablation study of VG-Trans on anomaly detection and forecasting tasks.

. CDN multivariate KPIs dataset is collected from a large internet company in China, and the dataset contains 12 websites monitored with 36 KPIs individually. These websites are different from each other in

Basic statistics of anomaly detection datasets

Basic statistics of forecasting datasets. ETT * means the full benchmark of ETTh1, ETTh2, ETTm1 and ETTm2.

MSA and AGCN blocks for inputs and latent variables

Algorithm 1 Autoencoding Variational Inference of VG-Trans

Set mini-batch size as M , the number of convolutional filters K and hyperparameters; Initialize the parameters of inference networks Ω, EPM Ψ, G-Trans θ and the channel embeddings α (0:1) ; repeat Randomly select a mini-batch of M MTS consist of T subsequences to form a subset {x 1:T,i } M i=1 ; Draw random noise {ϵ t,n } T,M t=1,n=1 , {ϵi } V i=1 , from uniform distribution for sampling latent states {z t,n } T,M t=1,n=1 and channel embbedings {α i } V i=1 ; Calculate ∇L (Ω, Ψ; X, ϵ t,n , ϵ i , θ, α) according to Eq. ( 8), and update parameters of inference module Ω, EPM Ψ, G-trans module θ, as well as the channel embeddings α jointly; until convergence return global parameters {Ω, Ψ, θ, α}. All of these properties ensure more expressive and robust representations of MTS and enhance generative capacity for complex MTS, thus achieves accurate detection and forecasting.

A.7 VARIATIONAL GRAPH AUTOFORMER

We claim that the proposed Variational Graph Transformer (VG-Trans) is a flexible framework having better compatibility with other methods, such as Anomaly-Trans (please refer to results of Table 1 in the text) and Autoformer Wu et al. (2021) . In this section, we mainly discuss how to extend the VG-Trans with the help of Autoformer, the pipeline of which is illustrated in Fig. 8 . the effect of them. Here, we evaluate the influence of γ to anomaly detection task, the results are reported in Fig. 11 . As we can see, excessively small and large γ will lead to weaker performance, illustrating the effectiveness of both reconstruction and prediction losses. Besides, relatively small weight for prediction loss is good for detection for exploring the global distribution of the MTS.

