MULTIVARIATE TIME-SERIES IMPUTATION WITH DIS-ENTANGLED TEMPORAL REPRESENTATIONS

Abstract

Multivariate time series often faces the problem of missing value. Many time series imputation methods have been developed in literature. However, they all rely on an entangled representation to model dynamics of time series, which may fail to fully exploit the multiple factors (e.g., periodic patterns) presented in the data. Moreover, the entangled representations usually have no semantic meaning, and thus they often lack interpretability. In addition, many recent models are proposed to deal with the whole time series to identify temporal dynamics, but they are not scalable to long time series. Different from existing approaches, we propose TIDER, a novel matrix factorization-based method with disentangled temporal representations that account for multiple factors, namely trend, seasonality, and local bias, to model complex dynamics. The learned disentanglement makes the imputation process more reliable and offers explainability for imputation results. Moreover, TIDER is scalable to long time series. Empirical results show that our method outperforms existing approaches on three typical real-world datasets, especially on long time series, reducing mean absolute error by up to 50%. It also scales well to long datasets on which existing deep learning based methods struggle. Disentanglement validation experiments further highlight the robustness and accuracy of our model.

1. INTRODUCTION

Multivariate time series analysis (e.g., forecasting (Zeng et al., 2021) and classification (Li et al., 2022) ) has a wide spectrum of applications like traffic flow forecasting (Liu et al., 2020) , electricity demand prediction (Kaur et al., 2021) , motion detection (Laddha et al., 2021) , health monitoring (Tonekaboni et al., 2021) , etc. Most of these multivariate time series analysis approaches typically assume intact input for building models. However, real-world multivariate time series tends to have missing values caused by factors like device malfunction, communication failure, or costly measurement, leading to impaired performance of these approaches, or even rendering them inapplicable. In light of this, many time series imputation methods have been proposed to infer missing values from the observed ones. A multivariate time series, denoted as X ∈ R N ×T , consists of N univariate time series (called channels) spanning over T time steps. Hence, it offers two perspectives for imputation: modeling cross-channel correlations and exploiting temporal dynamics. Earlier methods (Batista et al., 2002; Acuna & Rodriguez, 2004; Box et al., 2015) either aggregate observed entries across channels by estimating similarity between distinct channels, or solely exploit local smoothness or linear assumption in the same channel to fill in missing values. Since these methods lack the ability of modeling nonlinear dynamics and complex correlations, they may not perform well in practice. In recent years, deep learning based methods (Liu et al., 2019; Tashiro et al., 2021; Cini et al., 2022) were proposed for time series imputation. These methods typically employ Recurrent Neural Networks (RNNs) as the backbone to jointly model nonlinear dynamics along temporal dimension (by updating hidden states nonlinearly over time steps) and cross-channel correlations (by mapping x t ∈ R N from data space to hidden state space). These RNN-based imputation models achieve better results by further combining with multi-task loss (Cao et al., 2018; Cini et al., 2022) or adversarial training (Liu et al., 2019; Miao et al., 2021) . Despite their success, these methods rely solely on a single entangled representation (hidden state) to model dynamics. However, for many real-world multivariate time series, their dynamics are rich combinations of multiple independent factors like trend, seasonality (periodicity), and local idiosyncrasy (Woo et al., 2022) . Modeling combinations of these factors with a single entangled representation may not give good performance, as the entangled representation has to compromise itself to explain multiple orthogonal patterns (like local changes or exogenous interventions vs global patterns) together (Bengio et al., 2013) . Furthermore, it becomes exacerbated when seasonal patterns dominate, as RNNs lack the inductive bias to initiatively capture periodicity (Hewamalage et al., 2021) . In addition, the hidden states that these models learned are entangled and complex combination of various components. Thus it is difficult for them to provide interpretable information to explain imputation. Moreover, these methods require the entire time series to be fed in to their models at each forward step, to capture temporal dynamics.This is prohibitively costly for large T . Therefore, these methods are not applicable to long datasets. To address these limitations, in this paper, we propose a novel multivariate time series imputation method, Time-series Imputation with Disentangled tEmporal Representations (TIDER), in which we explicitly model the complex dynamics of multivariate time series with disentangled representations to account for different factors. We employ a low-rank matrix decomposition framework, and achieve the disentanglement by imposing different forms of constraints on different representations which compose the low-rank matrix. In particular, we introduce a neighboring-smoothness representation matrix to explain the trend, a Fourier series-based representation matrix to define periodic inductive bias, and a local bias representation matrix to capture idiosyncrasies specific to individual time step. These disentangled representations offer greater flexibility and robustness to model the factors contributed to the dynamics in time series. Another notable benefit of these semantically meaningful disentangled representations is that they offer TIDER an interpretable perspective on imputation. Moreover, TIDER is a scalable model. It can be applied to time-series datasets with large T and achieve much better imputation performance. In summary, our contributions are as follows. • We propose TIDER, a new multivariate time series imputation model, which is featured with effective and explainable disentangled representations to account for various factors that characterize the complex dynamics of time series. To the best of our knowledge, TIDER is the first model to learn disentangled representations for multivariate time series imputation. • TIDER is the first imputation model that introduces a learnable Fourier series-based representation to capture periodic patterns inherent in time series data. • Extensive experiments show that our proposed method outperforms the baseline imputation methods in terms of effectiveness and scalability. Especially, for imputation task in long time series, TIDER achieves more than 50% improvement in MAE, compared with the best imputation baseline approach. Furthermore, TIDER is a scalable model. It can easily handle long multivariate time series while existing deep-learning methods struggle. We also demonstrate the explanability of the disentangled representations with several case studies.

2. RELATED WORK

Early time series imputation methods based on simple statistical strategies focus on exploiting local smoothness of temporal dimension as well as the similarity between different channels. For example, SimpleMean/SimpleMedian (Fung, 2006) imputes missing values by averaging, and KNN (Batista et al., 2002) aggregates cross-channel observations to fill in missing slots with k-nearest neighbors. Linear dynamics-based imputation methods, including linear imputation and state-space models (Durbin & Koopman, 2012) , have also been employed. MICE (Van Buuren & Groothuis-Oudshoorn, 2011) explores the idea of estimating missing slots with chain equations. These methods typically lack the ability of exploiting nonlinear dynamics and complex correlation across channels. Cini et al., 2022) fuses graph message passing into GRU structure to learn spatial-temporal patterns. All these methods model dynamics based on a single entangled representation, which is insufficient to capture multiple factors underlying time series, especially when seasonality emerges, since RNNs lack the inductive bias to initiatively capture periodicity (Hewamalage et al., 2019) . Moreover, these approaches are not scalable to long datasets since they have to process time series of whole length T at each forward step to capture temporal dynamics. Thus, they do not perform well, or are even not applicable to long time-series data. Our proposed TIDER is based on low-rank matrix factorization (MF) (Yu et al., 2016; Bjorck et al., 2021) . Vanilla MF-based models impute missing entries by learning latent factors U, V to exploit low-rank structure. However, they overlook the temporal continuity. To address this, TRMF (Yu et al., 2016) imposes autoregressive constraints on temporal factor V. However, similar to aforementioned methods, TRMF also relies on entangled representation to account for all factors underlying the dynamics. In contrast, TIDER introduces multiple disentangled representations. It achieves the disentanglement by enforcing different forms of constraints into different representations.

3. PROBLEM STATEMENT

Given N univariate time series data x 1 , x 2 , . . . , x N ∈ R T collected over T time steps, we represent it with a multivariate time series matrix X ∈ R N ×T , whose n-th row represents n-th univariate time series (channel) x n and t-th column denotes the observation of all time series at time step t. The multivariate time series matrix X is incomplete and a fraction of entries are missed. We aim to infer missing values from the observed ones, and we denote the mask matrix as M ∈ {0, 1} N ×T , where M ij = 1, if X ij is observed, 0, otherwise.

4.1. METHOD OVERVIEW

The core idea of TIDER is to decompose multivariate time series X into two latent factors U and V, such that U only preserves features unique to each channel whereas V is determined by multiple disentangled representations that jointly capture temporal dynamics. We employ such factorization for two main reasons: 1) univariate time series channels (rows of X) are usually highly correlated; 2) observations at different time steps (columns of X) exhibit strong temporal dynamics. The benefits of our design are twofold: 1) cross-channel correlations are decoupled since U only preserves channel-specific features 2) time-related information is isolated into V, and it enables us to model the potentially complex temporal dynamics with multiple explainable disentangled representations. Figure 1 shows the architecture of TIDER. We adopt a low-rank matrix decomposition framework to factorize multivariate time series matrix X into two latent factors U, V. U is a correlationdecoupled matrix which accounts for channel-specific patterns, while V is a matrix for time-related information. Since temporal dynamics underlying real-world time series can be rich and complex combinations of multiple factors, e.g., trend and seasonality, modeling them merely through an entangled representation matrix will lead to model degradation. Woo et al. (2022) proposes that under mild assumptions, seasonality and trend can be treated as independent factors in time series generating process, and Cohen (2013) suggests that independence can be used as a proxy criterion for disentanglement. Inspired by them, we propose to model V with multiple disentangled representations, each accounting for one particular factor. We achieve the disentanglement by enforcing distinct forms of constraints on different representations, which introduce distinct inductive biases into these representations and make them more liable to capture specific semantically-independent patterns. More specifically, we consider three important factors: trend, seasonality, and bias, which are specified by three representation matrix V t , V s , V b , respectively. Figure 2 illustrates an example of decomposing a time series into these three factors. Trend representation matrix V t ∈ R D×T captures the intrinsic trend which changes gradually and smoothly, and seasonality representation matrix V s ∈ R D×T illustrates the periodic patterns hidden in temporal dynamics. V t and V s jointly determine the dynamics driven by endogenous factors. Bias representation matrix V b ∈ R D b ×T characterizes variations specific to each time step. Intuitively, V b explains the individual idiosyncratic behaviors of time steps, which are orthogonal to global dynamics but shared across channels at a given time step. Hence, we treat it differently from V t , V s and interpret it as a residual term matrix, i.e., X - U(V t + V s ) ≈ 1 N ×D b V b . Mathematically, we formulate the objective of TIDER as minimize ∥(X -U a V) ⊙ M∥ 2 + λ t f t (V t ) + λ b f b (V b ) + η 1 ∥U∥ 2 + η 2 ∥V∥ 2 , U a = [ U 1 ] ∈ R N ×(D+D b ) , V = V t + V s V b ∈ R (D+D b )×T , where U a is the augmented matrix of latent factor U ∈ R N ×D , f t and f b are corresponding inductive bias constraint functions imposed on V t and V b . M is the mask matrix introduced in Sec 3. η 1 ∥U∥ 2 , η 2 ∥V∥ 2 are used to regularize the magnitude of latent factors, and λ t , λ b , η 1 , η 2 are corresponding weights for each term. When training is completed, we use the learned U to get U a , and the learnt V t , V s , and V b to form V. U a and V are then used to generate the imputed time series X as Xij = X ij , M ij = 1, (U a V) ij , M ij = 0. (3) We are now ready to elaborate on the details of the three disentangled representation matrices.

4.2. TREND REPRESENTATION MATRIX

Trend representation matrix V t characterizes the intrinsic trend of time series. The evolution patterns dominated by V t are supposed to change gradually and smoothly in the absence of accidents or extreme events such as holidays (we will consider these external interventions and exogenous influences in V b ). Based on this, we impose a smoothness constraint on V t as f t (V t ) = T j=2 ∥v j t -v j-1 t ∥ 2 , where v j t is the j-th column of V t . Equation 4 encourages close representations of two adjacent time steps in latent space, which will result in a smooth change in data space. We only impose constraints on two consecutive time steps to account for short-term patterns here whereas long-term patterns are explained by V s . This is in contrast with TRMF (Yu et al., 2016) which uses one temporal matrix to account for both short-term and long-term patterns by imposing regression constraint.

4.3. SEASONALITY REPRESENTATION MATRIX

Real-world time series often demonstrate seasonal patterns. For instance, traffic flow exhibits strong daily and weekly seasonality due to regularity of human activities, such as commuting patterns. Solar power production presents clear periodic characteristics caused by climate seasonality and meteorological conditions. Motivated by this and Fourier analysis in Section A.1.2, we propose to model the seasonality of time series by parameterizing representation matrix V s with Fourier basis. V s is a matrix with size D × T , we represent each row with a superposition of 2K sinusoidal waves (K ≪ T ). More formally, let A, B ∈ R D×K be two learnable coefficient matrices, and ϕ sin , ϕ cos ∈ R T ×K be the corresponding Fourier basis matrices, which are defined as A = | | | a 1 a 2 . . . a K | | | , ϕ sin = | | | sin(1ωt) sin(2ωt) . . . sin(Kωt) | | | , B = | | | b 1 b 2 . . . b K | | | , ϕ cos = | | | cos(1ωt) cos(2ωt) . . . cos(Kωt) | | | , where t = [1, . . . , T ] ⊤ . For one specific time-series with period P , ω is calculated as 2π/P . Then the seasonality representation matrix V s can be defined as V s = Aϕ ⊤ sin + Bϕ ⊤ cos . (7) In other words, V s is spanned by Fourier basis ϕ sin and ϕ cos . In particular, the d-th row of V s , denoted by (v d s ) ⊤ , has the form (v d s ) ⊤ = K k=1 A d,k sin(kωt) ⊤ + K k=1 B d,k cos(kωt) ⊤ , which is a truncated Fourier series with coefficients A d,k , B d,k . This elaborate design of V s provides meaningful periodic inductive bias, which enables our model to capture seasonal patterns more accurately and effectively, by learning coefficient matrices A, B from data.

4.4. BIAS TEMPORAL REPRESENTATION MATRIX

The representation matrices V t , V s presented so far jointly determine the dynamics driven by endogenous factors of multivariate time series. However, there are also various external factors (e.g., holidays, weekdays/weekends) that could affect real-world time series. These external factors usually occur at specific time points and yield local variations within short time period, thus they are independent of endogenous dynamics and cannot be captured by V t and V s . And also, these factors impact nearly equally on all channels. To account for these local variations, inspired by the idea of user and item bias explored in collaborative filtering (Lü et al., 2012) , we propose to learn another bias representation matrix V b ∈ R D b ×T , where the representation of a specific time step is shared by all channels. In addition, we impose an autoregressive constraint on V b in temporal dimension since the impact caused by local variation usually lasts for a short duration. Let v t be the t-th column of V b and L be the maximum time lag indicating duration, we define the constraint function as follows, f b (V b ) = T t=L+1 v t b - L l=1 W l v t-l b 2 , where  W = {W l ∈ R D b ×D b | l = 1 . . . L}

4.5. ADAPTIVE WEIGHT FOR TREND AND SEASONALITY

In Section 4.1, we use the additive form V t + V s to characterize the influence of endogenous impacting factors. This implicitly assumes that trend and seasonality components contribute equally to endogenous dynamics. But in practice the importance of trend and seasonality can vary drastically across data sources, which is illustrated in Figure 2 . To address this, we adopt a learnable parameter α ∈ (0, 1) to adaptively adjust the weight for these two components, which leads to a weighted additive form αV t + (1 -α)V s . The temporal matrix V in Equation 2 then becomes: V = αV t + (1 -α)V s V b . ( )

5. EXPERIMENTS

In this section, we evaluate the performance of TIDER by comparing with existing multivariate time series imputation methods, in terms of imputation accuracy and scalability. We also show the explanability of TIDER with several case studies. Hyperparameter sensitivity experiments are included as well to show that TIDER performs steadily under different hyperparameter settings. The code of TIDER is available at https://github.com/liuwj2000/TIDER 

5.1. EXPERIMENTAL SETUP

Baseline Methods We compare our model with popular baselines used in the literature and recentlyproposed methods, including statistical models (S333 impleMean, KNN, MICE), MF-based methods (MF, MF+L2, SoftImpute, TRMF), and deep learning approaches (BRITS, GAIN, NAOMI, Sin-gleRes, SAITS, CSDI). Details and settings of these baselines can be found in Section A.3. In addition, we also include TIDER (no W), a variant of TIDER without the learnable parameter α, to verify the effectiveness of the adaptive weight introduced in Section 4.5. Datasets We use three typical real-world datasets. Guangzhou is a small one where all methods can fit in. Solar-energy shares along time-span while Westminster is one with large N. These datasets represent three different types of multivariate time series (small, large T, large N), and these three types can cover most of the multivariate time series. For more details, please refer to Section A.2. Evaluation Metrics We adopt RMSE, MAE, and MAPE to evaluate the imputation accuracy of all compared methods. Details of these three metrics can be found in Section A.4.

Training Setup

We randomly remove a subset of entries from X as validation and test datasets separately. Let r be the missing rate variable, the ratio of training/validation/test is (0.9 -r)/0.1/r. For each model we run experiments 7 rounds on every dataset and report the imputed results averaged over these 7 runs. All experiments are conducted on a Linux workstation with a 32GB Tesla V100 GPU. For more detailed hyperparameter settings, please refer to Table 7 .

5.2. IMPUTATION ACCURACY COMPARISON

Table 1 and Table 2 show the imputation accuracy of all methods on three datasets with different missing rate r. OOM indicates out of memory. The meaning of asterisk and improvement are illustrated in Section A.3. Our proposed TIDER achieves the best performance in most cases in terms of the three metrics. The superiority of TIDER is much more significant in Solar-energy, the dataset with large T . In addition, we obtain similar improvements on other long time series. We report experiments on another two datasets in Section A.5. In addition, we observe that the imputation accuracy of most methods drops as missing rate r increases, which is as expected since fewer patterns are available. Among all baseline methods, CSDI has the best performance on most datasets, and deep-learning methods (BRITS, GAIN, SAITS, CSDI) usually perform better than other baselines. However, TIDER outperforms them on each dataset. Especially on the solar-energy dataset, where time span T is 52, 560 and many deep learning methods run into OOM on a 32GB-memory GPU, TIDER is We observe that MF-based models can all work for long time series. The difference between these models lies in constraints on U and V. Compared with other MF-based baselines, TRMF outperforms others in most cases, since it intuitively captures auto-regressive dynamics. However, since TRMF employs merely an entangled representation, it cannot model complex dynamics well. TIDER (no W) and TIDER utilize disentangled features like trend, seasonality and local bias to rebuild patterns for time-series, which exploit explanatory temporal factors, thus achieving better imputation result. Furthermore, TIDER performs better than TIDER (no W). This is consistent with our previous analysis that trend and seasonality components might not contribute equally to global dynamics. 

5.3. SCALABILITY ANALYSIS

We study the scalability of different methods in memory usage and training time. In particular, we compare our method with state-of-the-art models, BRITS, NAOMI, SingleRes, SAITS, and CSDI. shows that the memory usage of NAOMI, SingleRes, and SAITS also grows fast as T increases. Again, TIDER needs the least amount of memory. There is also a gap between the curves of CSDI (BRITS) and TIDER in Figure 3 , which is overwhelmed by the magnitude of NAOMI and SingleRes. We present these gaps in Section A.9. Lastly, Figure 3-(c ) demonstrates the running time of different methods for 100 epochs when processing a 100 × 100 matrix. TIDER runs much faster than RNNs-based methods. Notably, TIDER outperforms BRITS by almost an order of magnitude. This is also true for the total time taken by entire training process. Furthermore, the space complexity of BRITS is O(N (N + T )) (Vaswani et al., 2017) while that of TIDER is O(N + T ). The complexity of NAOMI was not established and is tricky to analyze due to its complex divide-and-conquer strategy, but its complicated procedure yields more intermediate variables and thus results in more memory consumption, as empirically validated by our experiments. In conclusion, this experiment demonstrates the scalability of our model.

5.4. ABLATION ANALYSIS

To verify the effectiveness of our proposed disentangled representation matrices, we conduct an ablation study on Guangzhou dataset by removing one of the trend, seasonality, and bias representation matrices while leaving the rest of the model unchanged. The ablative results with missing rate r = 0.2 (0.4) are presented in Table 3 . We find that the model performance drops no matter which representation is removed, which validates that all of our proposed disentangled representations play an important role in imputation and jointly enhance the performance of the final model.

5.5. CASE STUDY OF DISENTANGLEMENT

This section is to demonstrate that the learned representations achieve disentanglement and offer explanability. We first visualize a time-series from Guangzhou dataset and its learnt patterns, namely, trend patterns u ⊤ V t , seasonality patterns u ⊤ V s , and bias patterns 1 ⊤ V b , where u ⊤ is the time-series representation (a row of U). Figure 4 shows that the learned representations generate semantically meaningful patterns. Figure 4-(b ) shows a gradually changing curve that corresponds to trend In order to further show TIDER's interpretability, we conduct experiments on synthetic time-series where one pair of time-series are composed of the same trend but different seasonality, while another pair of time-series are composed of the same seasonality but a different trend. We visualize the ground truth trend and seasonality patterns, together with the learned ones. Their results are depicted in Figure 5 and Figure 6 . We can clearly find that the learned disentangled components are very similar to the ground truth, while the disentangled patterns that are supposed to be close to each other indeed look similar (similar slopes on the trend components, together with similar amplitudes and periods on the seasonality components). This further supports TIDER's explainability of disentanglement.

5.6. HYPERPARAMETER SENSITIVITY

We analyze the impact of four key hyperparameters (D, K, D d , and P ) on performance of TIDER, and present the results in Figure 7 , 8, 9, and 10, respectively. We observe that TIDER is relatively stable under different hyperparameter settings. Even for hyperparameter P (period), TIDER exhibits consistent imputation performance, although interpretability may be compromised. Therefore, if accurate imputation is the primary objective, extensive hyperparameter tuning may not be necessary, as TIDER is robust to variations in hyperparameters. However, when both accuracy and interpretability are important, determining the appropriate time series period is critical. This can often be achieved through prior knowledge or other periodic detection models (Fan et al., 2022; Wang et al., 2022) .

6. CONCLUSION

In this paper, we propose a scalable multivariate time series imputation method, TIDER, with multiple novel inductive biases under the framework of low-rank matrix factorization. In contrast to existing imputation approaches, TIDER adopts semantically meaningful disentangled representations to account for multiple factors of a time series. In particular, it enables capturing periodicity with a novel Fourier basis-based representation and allows to identify local time variation with a time bias representation. TIDER's superiority is verified by experimental results. Moreover, it scales well to long time series where the existing methods struggle or even cannot fit. In the future, it is interesting to investigate how to use the meaningful disentangled representations for forecasting tasks and how to design constraints on U when channel information is available. Furthermore, since now we use hyperparameter tuning techniques to decide P in V s , how to obtain P adaptively in a data-driven way is also worth further investigation.

7. REPRODUCIBILITY STATEMENT

We provide an open-source implementation of our proposed model, TIDER, at https://github. com/liuwj2000/TIDER. Hyperparameter setting is shown in Section A.8. Users can download the code and run TIDER easily.

8. ACKNOWLEDGEMENT

This study is supported under the RIE2020 Industry Alignment Fund -Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU). This research/project is also supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-TC-2021-001). Xiucheng Li is supported in part by the National Natural Science Foundation of China under Grant No. 62206074, Shenzhen College Stability Support Plan under Grant No. GXWD20220811173233001. The main part of his job is done when he is in Nanyang Technological University. & Shakarchi, 2011) states that any real-valued periodic function f (x) can be represented by an infinite series of sinusoidal functions, known as the Fourier series. These sinusoidal functions have their own coefficients and distinct frequencies. More formally, the Fourier series for a function f (x) with its period P can be described as f (x) = ∞ n=0 a n cos(nωx) + ∞ n=0 b n sin(nωx), where ω = 2π/P , and a n , b n ∈ R are the corresponding coefficients. Since sin(nωx) and cos(nωx) functions are orthogonal to each other and are able to span over the entire functional space, they are also called Fourier basis.

A.2 DATASET DETAILS

• Guangzhou Traffic Data. This dataset (Chen et al., 2018) contains traffic speed of 214 anonymous urban road segments for 5 days with a 10-minute sampling rate in Guangzhou, China. It results in a 214 × 500 multivariate time series matrix. • Solar-Energy Production Data. This dataset consists of solar power production records of 137 PV plants in Alabama, USA sampled every 10 minutes. It results in a 137 × 52, 560 data matrix. • Westminster Uber Movement Data. This dataset contains hourly averaged speed of road segments in Westminster, in Jan 2020, released by Uber. Dimension of its data matrix is 7, 489 × 744. The dataset statistics are summarized in Table 4 .

A.3 DETAILS OF BASELINE MODELS

The details of baseline methods are briefly summarized as follows. For SimpleMean, KNN, and SoftImpute, we use their implementation provided by the package fancyimpute whereas for BRITS, GAIN, SAITS, CSDI, NAOMI and SingleRes, we use the source codes released by their authors. • SimpleMean (Acuna & Rodriguez, 2004) foot_0 It imputes missing entries with mean values of corresponding columns. • KNN (Batista et al., 2002) foot_1 It first finds k nearest rows with the highest similarity score to the target row, and then uses the weighted sum of these k rows for imputation. • SoftImpute (Mazumder et al., 2010) 6 . It is a matrix completion approach based on iterative soft thresholding of Singular Value Decomposition (SVD). • MICE (Azur et al., 2011) 7 . It is the abbreviation for Multivariate Imputation by Chained Equations, a widely-used R package for imputation. • MF (Takács et al., 2008) . It applies low-rank matrix factorization without any constraint on the latent factors. • MF+L2 (Takács et al., 2008) . It applies low-rank matrix factorization with L2 regularization on U and V. • TRMF (Yu et al., 2016) . It applies low-rank matrix factorization with an autoregressive constraint imposed on temporal matrix V. • BRITS (Cao et al., 2018) 8 . It is a time series imputation method based on Bidirectional Recurrent Neural Nets and a time decay mechanism. • GAIN (Yoon et al., 2018) 9 . It employs Generative Adversarial Nets (GAN) for imputation. • SAITS (Du et al., 2022) 10 . It imputes missing values based on self-attention mechanism. • CSDI (Tashiro et al., 2021) 11 . It utilizes score-based diffusion models to explore correlations between observed values and for imputation. • NAOMI (Liu et al., 2019) 12 . It combines Bidirectional Recurrent Neural Nets with adversarial training to offer non-autoregressive style imputation. The divide-and-conquer strategy is adopted. • SingleRes (Liu et al., 2019) 13 . It is the autoregressive counterpart of NAOMI. In our experiments, we find that BRITS with the suggested batch size and RNN hidden size will result in OOM (Out of Memory) on the Westminster dataset. Thus we reduce the batch size to 1 and the dimension of RNN to 10 such that it could just fit the 32 GB GPU on that dataset, we use an asterisk symbol to indicate its results in Table 1 and Table 2 , in which the Improvement is calculated as follows, Improvement = best_baseline -TIDER * best_baseline × 100% where best_baseline represents the best performance among all the compared baseline models, TIDER * stands for the better one between TIDER and TIDER (no W).

A.4 DETAILS OF METRICS

Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) are adopted to evaluate the imputation accuracy of all compared methods. These three metrics are defined as RMSE = ij∈Ω (X ij -Xij ) 2 |Ω| , MAE = ij∈Ω |X ij -Xij | |Ω| , MAPE = ij∈Ω |X ij -Xij | |Ω| • |X ij | . (15) where X ij denotes the ground-truth values, Xij is the imputed values, and Ω is the index set of missing entries to be evaluated.

A.5 ADDITIONAL EXPERIMENTS FOR IMPUTATION ACCURACY COMPARISON ON LONG TIME SERIES

As Table 1 and Table 2 shows, TIDER outperforms the baseline models by large margins on long time-series dataset Solar-Energy. In order to further verify its performance stability in imputing long multivariate time-series data, we conduct additional experiments on another two long time-series: • HouseHold Power dataset 14 . This dataset contains every minute's household electric consumption measurements gathered in a house located in Sceaux from December 2006 to November 2010. The size of this long time-series matrix is 7 × 2075259. The imputation accuracy of all models on these two long time series is shown in Table 5 and Table 6 . Similar to the observation made in Section 5.2, many deep-learning-based models that perform well in short time-series dataset Guangzhou (eg, CSDI, SAITS) run into out of memory on these two datasets, while our proposed method can easily handle them and achieve significantly better performance. These results further support the scalability of our model, as well as its effectiveness in imputing long time-series datasets. 

A.6 DISENTANGLEMENT VALIDATION ON SYNTHETIC DATASET

In order to further show TIDER's interpretability, we conduct experiments on synthetic time-series where one pair of time-series are composed of the same trend but different seasonality, while another pair of time-series are composed of the same seasonality but a different trend. We visualize the ground truth trend and seasonality patterns, together with the learned ones. Their results are depicted in Figure 5 and Figure 6 . We can clearly find that the learned disentangled components are very similar to the ground truth, while the disentangled patterns that are supposed to be close to each other indeed look similar (similar slopes on the trend components, together with similar amplitudes and periods on the seasonality components). This further supports TIDER's explainability of disentanglement.

A.7 HYPERPARAMETER SENSITIVITY

In this section, we study the performance change of TIDER under different hyperparameter settings. Figure 7 shows TIDER's hyperparameter sensitivity under different D. Figure 8 draws the performance of TIDER changing against K, the number of sinusoidal waves in V s . Figure 7 A.8 HYPERPARAMETER SETTING We implement TIDER using Python 3.6 and Pytorch 1.9, and optimize the model parameters using Adam (Kingma & Ba, 2014) with a learning rate 1e -3. We use gird search to select the optimal hyperparameters (D, K, D d , P ) on the validation datasets. However, as we have observed in Section 5.6 and Section A.8, the imputation performance of TIDER is relatively steady among different hyperparameter settings. Thus other hyperparameter sets might also offer desirable results. The general principle and guidelines on parameter setting are as follows: • ω: This hyperparameter is involved in the seasonality representation matrix. ω is in close relation with the time-series period P . At present, P is chosen by jointly using our prior knowledge of different time series and hyperparameter tuning techniques. We first construct a candidate list of P based on our knowledge of these time series and select the best one according to performance on validation datasets. We set P for Guangzhou, Solar-Energy, and Westminster datasets as 168, 168, and 24 respectively. In other words, the optimal ω for these datasets are π 84 , π 84 and π 12 . • D: Dimension for matrix U and V t is also an important hyperparameter. Small D will lack the ability to learn enough information while large D is prone to overfitting. • λ t , λ b : These are the corresponding weights for the constraint functions of the trend representation matrix and bias representation matrix, respectively. We tune them by making a choice from a scale set {0.01, 0.1, 0.2, 0.5, 1.0}. By following this principle, the optimal hyperparameters we obtained are listed in Table 7 . A.9 SUPPLEMENTAL SCALABILITY ANALYSIS In Section 5.3, there is a little gap between memory usage of CSDI and TIDER in Figure 3 ) the memory usage gap between CSDI (resp. BRITS) and TIDER is overwhelmed by the large magnitude of other methods. Thus, we zoom in to give a comparison in this section. As depicted in Figure 11 , there are still large memory usage gaps between TIDER and them, which demonstrates the scalability superiority of our proposed TIDER.



https://github.com/iskandr/fancyimpute https://github.com/iskandr/fancyimpute https://github.com/iskandr/fancyimpute https://github.com/amices/mice https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data



Figure 1: Architecture of the proposed method TIDER. X is the multivariate time series matrix. U represents a correlation-decoupled matrix. V t , V s , and V b denote trend representation matrix, seasonality representation matrix and bias representation matrix, respectively.

Figure 2: Visualization of time series decomposition. (a) is the raw time series, whereas (b), (c), and (d) are its decomposed trend, seasonality, and residual component. The grey bars on the right of each subplot show the relative scales of the components. Each grey bar represents the same length.

Figure 3: Scalability test on Westminster dataset: (a) Memory usage of different methods varies over N from 0 to 5000 with fixed T = 100; (b) Memory usage of different methods varies over T from 100 to 700 with fixed N = 100; (c) The average running time taken by different methods for every 100 epochs when processing a 100 × 100 matrix.Table3: Ablation Analysis of TIDER on Guangzhou dataset.

(a) and Figure 3-(b) present the memory footprints of different methods against channel number N and time span T . It can be seen from Figure 3-(a) that memory usage of NAOMI and SingleRes grow rapidly with very steep slopes, and that of BRITS also grows quickly whereas TIDER needs much less memory. Similarly, Figure 3-(b)

Figure 4: Case study of disentanglement. (a) is the original time-series, whereas (b),(c) and (d) represent the generated intrinsic trend, seasonality, and local variation within a period of time.

Figure 5: Disentanglement validation on synthetic dataset. Time series are composed of the same trend but different seasonalities. (a) shows the raw time series. (b) and (d) show the trend and seasonality components TIDER has learned. (c) and (e) depict the ground trend and seasonality components which compose the raw time series.

Figure 6: Disentanglement validation on synthetic dataset. Time series are composed of different trends but the same seasonality. (a) shows the raw time series. (b) and (d) show the trend and seasonality components TIDER has learned. (c) and (e) depict the ground trend and seasonality components which compose the raw time series.

Figure 7: The performance of TIDER changes against D.

depicts the accuracy curves of TIDER varying over different D d . Figure10is used to study the performance stability of TIDER under different periods in V s . It can be seen that TIDER is rather stable under different hyperparameter settings. Therefore, if our target is merely for accurate imputation, we might not need to pay much effort or spend much time on hyperparameter tuning, since TIDER is stable among different hyperparameter settings. If we target both accuracy and interpretability, then the only hyperparameter we need to figure out is the period of time series. This often can be acquired by our prior knowledge or by other periodic detection modelsWen et al. (2021).

Figure 9: The performance of TIDER changes against D d .

Figure 10: The performance of TIDER changes against P .

Figure 11: Supplementary scalability test on Westminster dataset. (a) Memory usage between TIDER and CSDI varies over N from 0 to 5000 with fixed T = 100; (b) Memory usage between TIDER and BRITS varies over T from 100 to 700 with fixed N = 100.

(a), and little gap between memory usage of BRITS and TIDER in Figure3(b). In Figure 3-(a) (resp. Figure 3-(b)

is a group of learnable parameters. In our setting, D b and L are small numbers, thus group W only incurs very few extra parameters.

Imputation accuracy of different methods with missing rate r = 0.2

Imputation accuracy of different methods with missing rate r = 0.4

Ablation Analysis of TIDER on Guangzhou dataset.

Summary of dataset statistics.

Imputation accuracy of different methods with missing rate r = 0.2, for long time series.

Imputation accuracy of different methods with missing rate r = 0.4, for long time series.

Hyperparameters of TIDER for the three datasets.

A APPENDIX

A.1 PRELIMINARY A.1.1 LOW-RANK MATRIX FACTORIZATION Given a matrix X ∈ R N ×T with rank k ≪ min{N, T }, the low-rank matrix factorization aims to factorize X as the product of two low-rank matrices U ∈ R N ×k and V ∈ R k×T , X = UV.(11)This low-rank matrix factorization can be extended to the circumstance where the true rank of X is larger than k. In such a case, we use a metric ℓ to measure the discrepancy between X and the approximated product UV as ℓ(X, UV) = ∥X -UV∥ 2 , (12) where ∥ • ∥ denotes the Frobenius norm of a matrix. This low-rank approximation is fairly reasonable in practice, since many real-world structural observation matrices are low rank. This has been widely adopted and extensively studied in recommender systems (Takács et al., 2008; Chen et al., 2020) . In the context of multivariate time series data, we propose to adopt low-rank matrix factorization for two reasons: 1) univariate time series channels (the rows of X) are usually highly correlated; 2) observations at different time steps (the columns of X) exhibit strong dependencies.

