NON-PARAMETRIC STATE-SPACE MODELS: IDENTIFIA-BILITY, ESTIMATION AND FORECASTING

Abstract

State-space models (SSMs) provide a standard methodology for time series analysis and prediction. While recent works utilize nonlinear functions to parameterize the transition and emission processes to enhance their expressivity, the form of additive noise still limits their applicability in real-world scenarios. In this work, we propose a general formulation of SSMs with a completely non-parametric transition model and a flexible emission model which can account for sensor distortion. Besides, to deal with more general scenarios (e.g., non-stationary time series), we add a higher-level model to capture the time-varying characteristics of the process. Interestingly, we find that even though the proposed model is remarkably flexible, the latent processes are generally identifiable. Given this, we further propose the corresponding estimation procedure and make use of it for the forecasting task. Our model can recover the latent processes and their relations from observed sequential data. Accordingly, the proposed procedure can also be viewed as a method for causal representation learning. We argue that forecasting can benefit from causal representation learning, since the estimated latent variables are generally identifiable. Empirical comparisons on various datasets validate that our model could not only reliably identify the latent processes from the observed data, but also consistently outperform baselines in the forecasting task.

1. INTRODUCTION

Time series forecasting plays a crucial role in various automation and optimization of business processes (Petropoulos et al., 2022; Benidis et al., 2020; Lim & Zohren, 2021) . State-space models (SSMs) (Durbin & Koopman, 2012) are among the most commonly-used generative forecasting models, providing a unified methodology to model dynamic behaviors of time series. Formally, given observations x t , they describe a dynamical system with latent processes z t as: z t = f (z t-1 ) + ϵ t , Transition x t = g(z t ) + η t , Emission where η t and ϵ t denote the i.i.d. Gaussian measurement and process noise terms, and f (•) and g(•) are the nonlinear transition model and the nonlinear emission model, respectively. The transition model captures the latent dynamics underlying the observed data, while the emission model learns the mapping from the latent processes to the observations. Recently, more expressive and scalable deep learning architectures were leveraged for modeling nonlinear transition and emission models effectively (Fraccaro et al., 2017; Castrejon et al., 2019; Saxena et al., 2021; Tang & Matteson, 2021) . However, these SSMs are not guaranteed to recover the underlying latent processes and their relations from observations. Furthermore, stringent assumptions of additive noise terms in both transition and emission models may not hold in practice. In particular, the additive noise terms cannot capture nonlinear distortions in the observed or latent values of the variables, which might be necessarily true in real-world applications (Zhang & Hyvarinen, 2012; Yao et al., 2021) , like sensor distortion and motion capture. If we directly apply SSMs with this constrained additive noise form, the model misspecification can lead to biased estimations. Second, the identification of SSMs is a very challenging task when both states and transition models are unknown. Most work so far has focused on developing efficient estimation methods. We argue that this issue should not be ignored, and it becomes more severe when nonlinear transition and emission models are implemented with deep learning techniques. As the parameter space has increased significantly, SSMs are prone to capture spurious causal relations and strengths, and thus identifiability of SSMs is vital. Furthermore, the transition model is usually assumed to be constant across the measured time period. This stationary assumption hardly holds in many real-life problems due to the changes in dynamics. For example, the unemployment rate tends to rise much faster at the start of a recession than it drops at the beginning of a recovery (Lubik & Matthes, 2015) . In this setting, SSMs should appropriately adapt to the time-varying characteristics of the latent processes to be applicable in general non-stationary scenarios. In this work, in contrast to state-of-the-art approaches following the additive form of transition/emission models, we propose a general formulation of SSMs, called the Non-Parametric State-Space Model (NPSSM) 1 . In particular, we leverage the non-parametric functional causal model (Pearl, 2009) for the transition process and the post-nonlinear model (Zhang & Hyvarinen, 2012) to capture nonlinear distortion effects in the emission model. Besides, we add a higher level model to NPSSM, called N-NPSSM, to capture the potential time-varying change property of the latent processes for more general scenarios (e.g., non-stationary time series). Interestingly, although the proposed NPSSM is remarkably flexible, the latent processes are generally identifiable. To this end, we further develop a novel estimation framework built upon the structural variational autoencoder (VAE) for the proposed NPSSMs. It allows us to recover latent processes and their time-delayed causal relations from observed sequential data and use them to build the latent prediction model simultaneously (illustrated in Figure 1 (left)). Accordingly, the proposed procedure can be viewed as a method for causal representation learning or latent causal model learning from time series data. We argue that forecasting tasks can benefit from causal representation learning, as latent processes are generally identifiable in NPSSM. As shown in Figure 1 (right), first, it provides a compact structure for forecasting, whereas vanilla predictors (bottom), which directly learn a mapping function in the observation space, face the issue of complicated and spurious dependencies. Second, the predictions following the correct causal factorization are expected to be more robust to distribution shifts that happen to some of the modules in the system. If some local intervention exists on one mechanism, it will not affect other modules, and those modules will still contribute generously to the final prediction. Although formulating this problem and providing quantitative theoretical results seem challenging, our empirical studies illustrate this well. Third, it gives a compact way to model the distribution changes. In realistic situations, data distribution might change over time. Fortunately, given the high-dimensional input, the changes often occur in a relatively small space in a causally-factorized system, which is known as the minimal change principle (Ghassami et al., 2018; Huang et al., 2020) or sparse mechanism shift (Schölkopf et al., 2021) . We can thus capture the distribution changes with low-dimensional change factors in a causal system instead of in the high-dimensional input space. In summary, our main contributions are as follows: • We propose a general formulation of SSMs, namely, NPSSM, together with its extension to allow nonstationarity of the latent process over time, which provides a flexible form for the transition and emission model that is expected to be widely applicable; • We establish the identifiability of the time-lagged latent variables and their influencing strengths for NPSSM under relatively mild conditions; • Based on our identifiability analysis, we propose a new structural VAE for model estimation and use it for forecasting tasks; • Estimation of the proposed model can be seen as a way to learn the underlying temporal causal processes, which further facilitates forecasting of the time series; • We evaluate the proposed method on a number of synthetic and real-world datasets. Experimental results demonstrate that latent causal dynamics could be reliably identified from observed data under various settings and further verify that identifying and using the latent temporal causal processes consistently improves the prediction performance.

2.1. NPSSM: NON-PARAMETRIC STATE-SPACE MODEL AND IDENTIFIABILITY

To make SSMs in Eq. ( 1) flexible, we adopt the functional causal model (Pearl, 2009) to characterize transition process. Specifically, each latent factor z it is represented with a general form of structural causal model z it = f i ({z j,t-τ |z j,t-τ ∈ Pa(z it )}, ϵ it ), where i, j denotes variable element index, Pa(z it )} (parents) denotes the set of time-lagged variables that directly determine the latent factor z it , and τ denotes the time lag index. In this way, noise ϵ it together with parents of z it generate z it via unknown function f (•). Formally, NPSSM can be formulated as z it = f i ({z j,t-τ |z j,t-τ ∈ Pa(z it )}, ϵ it ), Structural causal latent transition x t = g(z t , η t ) = g 1 (g 2 (z t ) + η t ),

Post nonlinear emission

(2) where ϵ it are mutually independent (i.e. spatially and temporally independent) random noises sampled from noise distribution p(ϵ it ). g 1 (•) is the invertible post-nonlinear distortion function, g 2 (•) is the nonlinear mixing function and η t are independent noises (detailed notations are given in Appendix A2.1). To the best of our knowledge, this is the most general form of SSMs. In this transition function, the effect z it is just a smooth function (it refers to condition 3 of Theorem 1, which is the core condition to guarantee the identifiability of NPSSM) of its parents Pa(z it ) and noise ϵ it , and it contains linear models, nonlinear models with additive noise, and even multiplicative noise models as special cases. The Independent Noise condition and Conditional Independent condition (Pearl, 2009) are widely satisfied in time series data. Furthermore, in the emission function, the post-nonlinear transformation g 1 (•) can model sensor or measurement distortion that usually happens when the underlying processes are measured with instruments (Zhang & Hyvarinen, 2012; Zhang & Hyvärinen, 2010) . Now, we define the identifiability of NPSSM in the function space. Once the latent variables z 1 , . . . , z T are identifiable up to componentwise transformations and permutation, latent transition (causal relationships) are also identifiable because conditional independence relations fully characterize time delayed causal relations in a time-delayed causally sufficient system. Therefore, we can say that NPSSM is identifiable if the latent variables are identifiable. Definition 1 (Identifiability of NPSSM). For a ground truth (f, g, p(ϵ)) and a learned ( f, ĝ, p(ϵ)) models as defined in Eq. ( 2), if the joint distribution for observed variables p f,g,p(ϵ) (x t ) and p f,ĝ, p(ϵ) (x t ) are matched almost everywhere, then we can say NPSSM are identifiable if observational equivalence can always lead to identifiability of the latent variables up to permutation π and component-wise invertible transformation T : p ĝ, f, pϵ (x t ) = p g,f,pϵ (x t ) ⇒ g -1 = ĝ-1 • T • π. ( ) where g -1 , ĝ-1 are invertible functions that maps x t to z t and ẑt , respectively. Here we present the identifiability result of the proposed model. W.l.o.g., we assume the maximum time lag L = 1 in our analysis. Note that it is trivial to extend our analysis for long lag L > 1. We can see that, somewhat surprisingly, although NPSSM is remarkably flexible, it is actually identifiable up to relative minimum indeterminacies. Each latent process can be recovered up to its component-wise invertible transformations. In many real-world time series applications, these indeterminacies may be inconsequential. Theorem 1. Suppose that we observe data sampled from a generative model defined according to 2 with parameters ( f, ĝ, p(ϵ)). Assume the following holds: 1. The set {x t ∈ X |φ ηt (x t ) = 0} has measure zero, where φ ηt is the characteristic function of the density p(η t ) = p g (x t |z t ). The post nonlinear functions g 1 , ĝ1 are invertible. The mixing functions g 2 , ĝ2 are injective and differentiable almost everywhere. 2. The process noise terms ϵ it are mutually independent. 3. Let η kt ≜ log p(z kt |z t-1 ), η kt is twice differentiable in z kt and is differentiable in z l,t-1 , l = 1, 2, . . . , n. For each value of z t , v 1t , • v 1t , v 2t , • v 2t , . . . , v nt , • v nt as 2n vector functions in z 1,t-1 , z 2,t-1 , . . . , z n,t-1 , are linearly independent, with v kt and • v kt defined below: v k,t ≜ ∂ 2 ηkt ∂zk,t∂z1,t-1 , ∂ 2 ηkt ∂zk,t∂z2,t-1 , ..., ∂ 2 ηkt ∂zk,t∂zn,t-1 ⊺ , vk,t ≜ ∂ 3 ηkt ∂z 2 k,t ∂z1,t-1 , ∂ 3 ηkt ∂z 2 k,t ∂z2,t-1 , ..., ∂ 3 ηkt ∂z 2 k,t ∂zn,t-1 ⊺ . then z t must be an invertible, component-wise transformation of a permuted version of ẑt . The proofs are provided in Appendix A2.2. Theorem 1 indicates that we can find the underlying causal latent processes from the observed data. The differentiability and linear independence in condition 3 is the core condition to assure the identifiability of latent factors z t from observed x t . It indicates that time-lagged variables must have a sufficiently complex and diverse effect on the transition distributions in terms of the second-and third-order partial derivatives. From this condition, we can find that the linear Gaussian SSM is unidentifiable since the second-and third-order partial derivatives would be constant, which violates the linear independence assumption.

2.2. N-NPSSM: NON-STATIONARY NON-PARAMETRIC STATE SPACE MODEL

Considering that time series are non-stationary in many real situations, we now add a higher-level model to NPSSM to allow it to capture the time-varying characteristics of the process. We propose the Non-stationary Non-Parametric State Space Model(N-NPSSM), which is formulated as x t = g 1 (g 2 (z t ) + η t ), Post Nonlinear emission z it = f i ({z j,t-τ |z j,t-τ ∈ Pa(z it )}, c t , ϵ it ) Structural causal latent transition , c t = f c ({c t-τ } Lc τ =1 , ζ t ) Time-varying change factors , where ζ t , similar to ϵ it , are mutually independent (i.e., spatially and temporally independent) random noises. f c (•) is the transition function for the time-varying change factors, which is also formulated in a general form of a structural causal model. It includes the vanilla SSMs in Eq. (1) as a particular case in which the time-varying change factors do not exist. It also includes the time-varying parameter vector autoregressive model (Lubik & Matthes, 2015) as a special case, which allows the coefficients or the variances of noises in the linear auto-regressive model to vary over time following a specified law of motion. In contrast to explicitly specifying how time-varying change factors affect the transition process, our model is quite general in that we use a low-dimensional vector c t to characterize the time-varying information and use it as an input for the transition model. Establishing the theoretical identifiability of this model is technically even more challenging, and our empirical results on various simulated data sets strongly suggest that it is actually identifiable.

3. ESTIMATION FRAMEWORK

Given our identifiability results, we propose the estimation procedures of NPSSM in Eq. ( 2) and N-NPSSM in Eq. 4). Since NPSSM is a special case of N-NPSSM, below, we consider only the estimation framework of N-NPSSM, and its properly constrained version will apply to NPSSM. The model architecture is shown in Fig. 2 (a). Here x t and xt are the observed and reconstructed variables. Similarly, z t and ẑt denote the truth and estimated latent variables. The overall framework is a structural variational auto-encoder which learns the underlying latent temporal process via the latent causal model and then build the auxiliary latent prediction model on the uncovered latent variables. The implementation details are in Appendix A4. Latent Causal Model To facilitate our implementation, we adopt the Variational Auto-Encoder (Hsu et al., 2017) , which implicitly implies that the measurement noise is additive. This is a particular case of the post-nonlinear mixing procedure given in Eq. ( 2). It is challenging to model the causal dependencies among observed and latent variables, especially for the design of the encoder/decoder. An alternative is to follow dynamic VAE (Girin et al., 2020) to encode the latent causal relationships in the encoder explicitly. To make the estimation more efficient, inspired by (Klindt et al., 2020; Yao et al., 2022) , we use the transition prior p(ẑ 1 , . . . , ẑT ) = p(ẑ 1 ) • • • p(ẑ L ) T t=L+1 p(ẑ t |{ẑ t-τ } L τ =1 , ĉt ) and p(ĉ 1 , . . . , ĉT ) = p(ĉ 1 ) T t=2 p(ĉ t |ĉ t-1 ) to encode latent causal relationships and approximate the joint probability of posterior on z 1:T and c 1:T with factorized form. Specifically, the posterior(encoder) for z 1:T is defined as T t=1 q(ẑ t |x t ), and similarly the posterior(encoder) for c 1:T is defined as T t=1 q c (ĉ t |{ẑ t-τ } L τ =0 ). An alternative to model the transition prior probability (transition prior model) p(ẑ t |{ẑ t-τ } L τ =1 , ĉt ) is to leverage the forward prediction function ẑit = f i ({ẑ j,t-τ } L τ =1 , ĉt , εit ). However, we argue that forward prediction with fixed loss cannot model latent processes without parametric form. For example, the latent process z k,t = q k ({z t-τ } L τ =1 )+ 1 b k ({zt-τ } L τ =1 ) ϵ k,t cannot be estimated by forward prediction function with squared loss because of the coupling effect from noise variable and cause variable. Thus, we propose to obtain this transition prior by explictly model the noise function, which can be treated as inverse latent transition function, i.e. εit = r i (ẑ it , ĉt , {ẑ t-τ } L τ =1 ). Particularly, they are implemented by a set of separate MLP Networks {r i } (to satisfy the independent noise condition in Thm 1), which take the estimated latent causal variables and time-varying change factors as input and output the noise terms. By applying the change of variables formula to the transformation, the transition probability can be formulated as: p ẑit |{ẑ t-τ } L τ =1 , ĉt = p ϵit r i (ẑ it , ĉt , {ẑ t-τ } L τ =1 ) ∂r i ∂ẑ it . ( ) Because of the mutually independent noise assumption, the Jacobian is a lower-triangular. We can efficiently calculate its determinant as the product of each element. By applying the independent noise assumption, the transition probability can be formulated as: log p(ẑ t |{ẑ t-τ } L τ =1 , ĉt ) = n i=1 log p(ε it ) + n i=1 log ∂r i ∂ẑ it . ( ) Given this, the transition probability p(ẑ t |{ẑ t-τ } L τ =1 , ĉt ) can be efficiently evaluated using the factorized noise distribution n i=1 log p(ε it ). To fit the estimated noises terms, we model each noise distribution p(ε it ) as a transformation from the standard normal noise N (0, 1) through function s(•), which can be formulated as p(ε it ) = p N (0,1) s -1 (ε it ) ∂s -1 (εit) ∂εit . Fortunately, we do not need to explicitly estimate the term ∂s -1 (εit) ∂εit , since inverse causal transition functions {r i } could compensate it. Similarly, we define the transition probability of change factors c t as log p c (ĉ t |ĉ t-1 ) = n i=1 log p( ζit ) + n i=1 log ∂ui ∂ĉit , where u i denotes the inverse change transition function. Auxiliary Latent Prediction Model While the above latent causal model could estimate latent variable ẑt in the non-parametric form, it could not explicitly model the forward prediction relationship which is required for the forecasting task. Therefore, we propose to train the axuiliary latent prediction models. With penalization hyperparameter, one can view this module as a regularization to enforce the temporal predictability of the learned latent processes, for the purpose of time series forecasting. Formally, the axuiliary latent prediction models is defined as p pred (ẑ t |{ẑ t-τ } L τ =1 , ĉt , εt ), which takes the recovered latent variables {ẑ t } T t=1 , change factor ĉt and noise εt as the input. Note that ĉt is not available at time t -1 in prediction mode. One straightforward solution is to build an extra prediction model for change factor ĉt . Interestingly, we can skip this step, since change factor c t had to be inferred from the latent variables {ẑ t-τ } L τ =0 as well, like the definition of posterior(encoder) q c (ĉ t |{ẑ t-τ } L τ =0 ). Therefore, we can directly learn the auxiliary latent predictor via p pred (ẑ t |{ẑ t-τ } L τ =1 , εt ). Specifically, we use the LSTM network to implement this predictor. The noise εt is generated from the inverse latent transition function r i (ẑ it , ĉt , {ẑ t-τ } L τ =1 ) in the training phase, while it is sampled from the standard normal distribution N (0, 1) in the forecasting phase. This way, the prediction procedure decouples the forecasting task into three steps: (1). The encoder recovers the latent causal representation from the observed data; (2). Next-step prediction is generated via the latent prediction model in the latent space; (3) prediction results are transformed into observation space by the decoder. Optimization By taking into account the above two components, we jointly train the latent causal model and the latent prediction model with the following objective L: L = 1 T T t=1 log p z (x t |z t ) -βD KL (q z (ẑ 1:T |x 1:T )|p(ẑ 1:T )) -γD KL (q c (ĉ 1:T |ẑ 1:T )|p(ĉ 1:T )) latent causal model , + σ T T t=1 log p pred (ẑ t |ε t , {ẑ t-τ } L τ =1 ), auxiliary latent predictor (7) where p z (x t |z t ) and p pred (ẑ t |ε t , {ẑ t-τ } L τ =1 ) denote the decoder distribution and prediction distribution, in which we use MSE loss for the likelihood.

4. RELATED WORK

Identifiability of State-Space Models It is well-known that the linear state space model with additive Gaussian noise is unidentifiable (Arun & Kung, 1990), thus can not recover the latent process. Under specific structural constraints on the transition matrix, (Xu, 2002) find it identifiable. (Zhang & Hyvärinen, 2011) further consider the linear non-Gaussian setting and prove that when the emission matrix is of full column rank and the transition matrix is of full rank, the model is fully identifiable. In the non-stationary environment, (Huang et al., 2019) prove that the time-varying linear causal model is identifiable if the additive noise is a stationary zero-mean white noise process. For the vector autoregressive model with the latent process, (Jalali & Sanghavi, 2011) show that if the interactions between observed variables are sparse, interactions between latent variables and observed variables are sufficient, the transition matrix can be identified. (Geiger et al., 2015) find that if the additional genericity assumptions hold and the exogenous noises are independent non-Gaussian, then the transition matrix is uniquely identifiable. In contrast, our work considers a remarkably flexible state space model, which does not require constraints like linear transition or additive noise. Even so, we find that the latent process is generally identifiable. Deep State-Space Models To leverage advances in deep learning, (Chung et al., 2015; Fraccaro et al., 2016; Karl et al., 2016; Krishnan et al., 2017) draw connections between the state space models and RNN and propose the dynamic VAE framework to model temporal data. For (Chung et al., 2015) , they associate the latent variables in the state space model with the deterministic hidden states of RNN. As such, the transition model is nonlinearly determined by the RNN and the observation model. These works propose different variants of deep learning architectures to parameterize transition and emission models to enhance expressiveness. These models vary in how they define the generative and inference model and how they combine the latent dynamic variables with RNN to model temporal dependencies (Girin et al., 2020) . Meanwhile, the training paradigm of these works is similar to the VAE methodology. Inference networks define a variational approximation to the intractable posterior distribution of the latent variables. Approximation inference is applied, which may lead to sub-optimal performance. To address it, (Fraccaro et al., 2017; Rangapuram et al., 2018; Becker et al., 2019) take advantage of Kalman filters/smoothers to estimate the exact posterior distribution. For (Fraccaro et al., 2017) , they use the standard Gaussian linear dynamical system to model the latent temporal process. The hidden states of RNN are used to predict the parameters of this dynamical system to enable closed-form Bayesian inference. However, these methods require expensive matrix inversion operation and the linear transition model limits the expressiveness. An alternative (Zheng et al., 2017) is to use variational sequential Monte Carlo to draw samples from the posterior directly. Recently, (Klushyn et al., 2021) propose a constraint optimization framework to obtain accurate predictions of the dynamical system. They achieve it by combining amortized variational inference with classic Bayesian filtering/smoothing to model dynamics. These works present different methods to infer the latent variables more accurately. Besides, some work leverage neural SDE to model the transition process (Yildiz et al., 2019) . While these works enhance the expressivity of the transition model with deep architectures, they are still constrained by the additive noise form, which can be treated as special cases of our work.

5. EXPERIMENTS

To show the efficacy of N-NPSSM for identifying latent processes and forecasting, we apply it to various synthetic and real-world datasets with one-step-ahead forecasting tasks. Evaluation Metrics To evaluate the identifiability of the learned latent variables, we report Mean Correlation Coefficient (MCC), which is a standard metric in ICA literature for continuous variables. We use Spearman's rank correlation coefficients to measure the discrepency between the groundtruth and estimated latent factors after component-wise transformation and permutation are adjusted (details are given in Appendix A3.2). MCC reaches 1 when latent variables are identifiable up to componentwise invertible transformation and permutation. To evaluate the forecasting performance, we report the Mean Absolute Error (MAE) and ρ-risk, which quantifies the accuracy of a quantile ρ of the predictive distribution. Formally, they are defined as: MAE = i,t |x it -xit |, R ρ -loss = i,t (x ρ it -x it )(ρI xρ it >xit -(1 -ρ)I xρ it ≤xit ), where xρ it is the empirical ρ-quantile of the prediction distribution and I is the indicator function. For the probabilistic forecasting models, forecast distribution is estimated by 50 trials of sampling, and xit is calculated by the predicted median value. Baselines We compare N-NPSSM with typical deep forecasting models and deep state-space models: (1) LSTM (Hochreiter & Schmidhuber, 1997) which is a baseline for the deterministic deep forecasting model; (2) DeepAR (Salinas et al., 2020) which is an encoder-based probabilistic deep forecasting model; (3) VRNN (Chung et al., 2015) and (4) KVAE (Fraccaro et al., 2017) which are deep state space models. Note that KVAE implicitly considers time-varying change factors by formulating the transition matrix as a weighted average of a set of base matrices and using an RNN to predict the combination weights at each step.

5.1. SYNTHETIC EXPERIMENTS

We generate synthetic datasets that satisfy the identifiability conditions in the theorems. In particular, we consider four representative simulation settings to validate the identifiability and forecasting performance under fixed causal dynamics (Synthetic1), fixed causal dynamics with distribution shift (Synthetic2), time-varying causal dynamics with inter-dependent change factors (Synthetic3) and time-varying causal dynamics with changing causal strengths (Synthetic4) (more details of data generation are given in Appendix A3.1.1). For all the synthetic datasets, we set latent size n = 8, and the maximum latent process lag is set to L = 2. For time-varying settings, the dimension of change variables is set to 4. The emission function g(•) is a random three-layer MLP with LeakyReLU units. As shown in Table 1 , N-NPSSM can successfully recover the latent processes under different settings, as indicated by the highest MCC (close to 1). In contrast, the baseline models, including the deep forecasting model and deep state-space models, cannot recover the latent processes. Besides, our method gives the best forecasting accuracy, as indicated by the lowest MAE and R 0.9 -loss. In Figure 4 , each left sub-figure shows the MCC correlation matrix of each factor, while each right sub-figure shows the scatter plot of recovered factors and truth factors. We can find that the time-delayed causal relationships are successfully recovered, as indicated by high MCC for the causally-related factors. Besides, the latent causal variables are estimated up to permutation and componentwise invertible transformation (more empirical results are given in A3.3). To investigate the consequence of the violation of the critical assumptions. We create another two datasets: (1) with dependent process noise terms, and (2) with additive Gaussian noise terms, in which (1) violates the mutually independent noise condition, and (2) violates the linear independence condition. From Figure 3 , we can find that violating the independent noise condition deteriorates the identifiability results significantly. Additionally, when the latent processes follow a linear, additive Gaussian temporal model (thus, the linear independence condition is violated), the identifiability results are also distorted. However, if the noise terms are slightly non-Gaussian (we change the shape parameter β of the generalized Gaussian noise distribution from β = 2.0 to β = 1.5 or β = 2.5, we can observe the final MCC scores increase significantly and the underlying latent processes become identifiable in both non-Gaussian noise scenarios.

5.2. REAL DATA EXPERIMENTS

We evaluate N-NPSSM on three real-world datasets: Economics, Bitcoin and FRED. Economics and Fred contain a set of macroeconomic indicators, while Bitcoin includes the potential influencers of the bitcoin price (The detailed data descriptions and preprocess are given in Appendix A3.1.2). As shown in Table 2 , N-NPSSM outperforms all competitors in terms of both MAE and R 0.9 -loss, which verifies the effectiveness of N-NPSSM (more qualitative experiments are given in Appendix A3.3). 1.00 0.17 0.25 0.26 0.05 0.13 0.06 0.12 0.12 0.14 0.03 0.09 0.47 0.17 0.10 1.00 0.13 0.30 0.12 0.33 0.04 1.00 0.01 0.17 

6. CONCLUSION AND FUTURE WORK

In this work, we propose a general formulation of state-space models called NPSSM, which includes a completely non-parametric transition model and a flexible emission model. We prove that even though it is flexible, it is generally identifiable. Moreover, we further propose N-NPSSM to capture the possible time-varying change property of the latent processes. We further develop the estimation procedure based on VAE and make use of it for forecasting tasks. Empirical studies on both synthetic and real-world datasets validate that our model could not only identify the latent process but also outperform baselines in the forecasting task. While we do not establish theories with time-varying change factors, we have demonstrated through experiments the possibilities of generalizing our identifiability results to this setting. Extending our theories to address the issue of a completely non-parametric emission model will also be one line of our future work. Another interesting direction is to apply this framework to other time series analysis intelligence tasks, like anomaly detection and change point detection, which is also interesting directions.

REPRODUCIBILITY STATEMENT

Our code for NPSSM is attached as supplementary material. The implementation details can be found in A4. For theoretical results, the assumptions and complete proof of the claims are in A2.2. For synthetic experiments, the data generation process is described in A3.1.1. 

A1 EXTENDED RELATED WORK

Time-Varying State-Space Models In many real situations, the temporal process may vary over time. This inspired the early efforts to allow the parameters of vector autoregressive models to change over time (Sodsri, 2003; Luo, 2005) , which consider the effect of time variation in coefficients and the variance of noises. These works can be treated as special cases of the state space models, which directly learn the transition in observation space. Time-varying linear state space models (Luttinen et al., 2014; Holmes et al., 2012 ) make one step further, as it is more powerful and general than vector autoregressive models. A similar research topic is the switching-regime state space models (Ghahramani & Hinton, 1996; 2000; Glaser et al., 2020) , which assumes the transition lies in a set of linear dynamical models and model the transition process with hidden Markov models. Thus, these models cannot capture the continuous change over time. Recently, some deep state space models have implicitly considered the time-varying characteristic of data. Both of these works (Rangapuram et al., 2018; Fraccaro et al., 2017) consider the Gaussian linear dynamical systems in the latent space. In (Rangapuram et al., 2018) , the transition/emission matrices and two noise covariance matrices are predicted by RNN at each step. In (Fraccaro et al., 2017) , they assume the transition/emission matrices are a weighted average of a set of base matrices, where the RNN model predicts the weights at each step. Note that all these existing works require specifying how time-varying change factors affect the transition process, which may not be applicable in practice without prior knowledge. In contrast, our model is flexible since we consider a more general transition model, and the time-varying change factors are treated as the input for the transition process. A2 IDENTIFIABILITY THEORY

A2.1 NOTATIONS

We summarize the notations used throughout the paper in Table A1 . Before the proof, we first produce Lemma 1, which presents the identifiability of latent variables in fixed latent dynamics. This result will be used in the proof of Theorem 1. Lemma 1. (Theorem 1 in (Yao et al., 2022 )) The fixed latent causal dynamics takes on the following form: x t = g(z t ) z it = f i ({z j,t-1 |z j,t-1 ∈ Pa(z it )}, ϵ it ). ( ) Let η kt ≜ log p(z kt |z t-1 ), η k (t) is twice differentiable in z kt and is differentiable in z l,t-1 , l = 1, 2, . . . , n. Suppose there exists an invertible function ĝ that maps x t to ẑt , i.e., ẑt = ĝ(x t ), such that the components of ẑt are mutually independent conditional on ẑt-1 . Let v k,t ≜ ∂ 2 η kt ∂z k,t ∂z 1,t-1 , ∂ 2 η kt ∂z k,t ∂z 2,t-1 , ..., ∂ 2 η kt ∂z k,t ∂z n,t-1 ⊺ , vk,t ≜ ∂ 3 η kt ∂z 2 k,t ∂z 1,t-1 , ∂ 3 η kt ∂z 2 k,t ∂z 2,t-1 , ..., ∂ 3 η kt ∂z 2 k,t ∂z n,t-1 ⊺ . If for each value of z t , v 1,t ,v 1,t , v 2,t ,v 2,t , ..., v n,t ,v n,t , as 2n vector functions in z 1,t-1 , z 2,t-1 , ..., z n,t-1 , are linearly independent, then z t must be an invertible, component-wise transformation of a permuted version of ẑt . Second, we consider the additive noise model, in which g 1 is the identity mapping. To identify the noise-free distribution g(z t ) from noisy data with assumption 1, we follow the idea of using convolution theorem to decouple measurement error (Khemakhem et al., 2020) . Since the volume of a matrix volA is defined as the product of the singular values of A. We could obtain that volA = |detA| when A is invertible. We use volA in the change of variables formula to replace the absolute determinant of the Jacobian (Ben-Israel, 1999). Suppose the joint distribution for observed variables p f,g,p(ϵ) (x t |z t-1 ) and p f,ĝ, p(ϵ) (x t |ẑ t-1 ) are matched almost everywhere. Then: Z p f,p(ϵ) (z t |z t-1 )p g (x t |z t )dz t = Z p f, p(ϵ) (z t |ẑ t-1 )p ĝ (x t |z t )dz t , Z p f,p(ϵ) (z t |z t-1 )p ηt (x t -g(z t ))dz t = Z p f, p(ϵ) (z t |ẑ t-1 )p ηt (x t -ĝ(z t ))dz t . According to the Jacobian matrix of the mapping from xt = g(z t ) and xt = ĝ(z t ), we have X p f,p(ϵ) (g -1 (x t )|z t-1 )volJ g -1 (x t )p ηt (x t -xt )dx t = X p f, p(ϵ) (ĝ -1 (x t )|ẑ t-1 )volJ g -1 (x t )p ηt (x t -xt ))dx t . Let us assume pf,p(ϵ),g,zt-1 (x t ) = p f,p(ϵ) (g -1 (x t )|z t-1 )volJ g -1 I X (x t ), and then we have X pf,p(ϵ),g,zt-1 (x t )p ηt (x t -xt )dx t = X p f, p(ϵ),ĝ,ẑt-1 (x t )p ηt (x t -xt ))dx t . According to the convolution theorem (Katznelson, 2004) that the convolution in one domain (e.g., time domain) equals point-wise multiplication in the other domain (e.g., frequency domain). We could obtain that, (p f,p(ϵ),g,zt-1 ⋆ p ηt )(x t ) = (p f, p(ϵ),ĝ,ẑt-1 ⋆ p ηt )(x t ), F [p f,p(ϵ),g,zt-1 ](ω)φ ηt (ω) = F [p f, p(ϵ),ĝ,ẑt-1 ](ω)φ ηt (ω), where ⋆ denotes the convolution operator and F [•] denotes the Fourier transform. We can find φ ηt = F [p ηt ] by the definition of characteristic function in Assumption 1. Then, we can remove φ ηt (ω) the term from both sides, as it is non-zero almost everywhere. We have, F [p f,p(ϵ),g,zt-1 ](ω) = F [p f, p(ϵ),ĝ,ẑt-1 ](ω), pf,p(ϵ),g,zt-1 (x t ) = p f, p(ϵ),ĝ,ẑt-1 (x t ). Thus, we can conclude that if the distributions are the same with additive noise, the noise-free distributions are still the same. Combined with the results from Lemma 1 that the latent variables are identifiable up to permutation and component-wise invertible transformation. Lastly, we consider the effect of post non-linear function g 1 (•). Let us denote xt = g 2 (z t ) + η t , then the learned post non-linear function x t = ĝ1 (x t ) can be written as x t = (g 1 • (g 1 ) -1 • ĝ1 )(x t ). We can further assume that ĝ1 = g 1 • ((g 1 ) -1 • ĝ1 ) = g 1 • g 3 , in which g 3 represents the indeterminacy on the space of xt . Following the proof of Theorem 1 of (Klindt et al., 2020) , we have that g 3 can only be a bijection if both g 2 , ĝ1 are injective functions. Thus, we can treat it as adding a component-wise invertible nonlinear function g 3 -1 on x t , which does not affect the identifiability of z t up to permutation and component-wise invertible transformation. Therefore, NPSSM in 9 is identifiable.

A3 EXPERIMENT DETAILS

A3.1 DATASETS

A3.1.1 SYNTHETIC DATASET GENERATION

To evaluate the identifiability and forecasting capability of our model under different scenarios, we generate the synthetic data with 1) fixed causal dynamics; 2) fixed causal dynamics with distribution shift; 3) time-varying causal dynamics with changing noise variances, and 4) time-varying causal dynamics with changing causal strengths. We use the first 80% data for training and the rest 20% for evaluation. Stationary Causal Dynamics For the fixed causal dynamics, we generate 100,000 data points based on the following equation: z k,t = q k ({z t-τ }) + 1 b k ({z t-τ }) ϵ k,t . Here, ϵ k,t is the process noise, which are sampled from i.i.d. Gaussian distribution (σ = 0.1). ϵ 1,t , ϵ 2,t , .., ϵ n,t are mutually independent and independent of z t-1 . The process noise terms are coupled with the history information through multiplication with the average value of all the timelagged latent variables. We set the latent size n = 8 and the lag number of the process L = 2. We apply a 2-layer MLP with LeakyReLU as the state transition function. The emission function is a random three-layer MLP with LeakyReLU units.

Stationary Causal Dynamics with Distribution Shift

We follow the same way as the setting of fixed causal dynamics and generate 80,000 data points for the training set. To simulate distribution shift in the test phase, we vary the values of the first layer of the MLP in the test set and generate 20,000 samples. The entries of the kernel matrix of the first layer are uniformly distributed between [-1,1].

Time-Varying Causal Dynamics with Changing Causal Strengths

For the time-varying causal dynamics with changing causal strengths, we generate 100,000 data points based on the following equation: c k,t = c k,t-1 + ζ k,t z k,t = q k ({z t-τ }, c t ) + 1 b k ({z t-τ }) ϵ k,t , where the noises ζ kt are sampled from i.i.d. Laplace distribution (σ = 1). We take the change factor c t as an input for the latent transition function for z t . Time-Varying Causal Dynamics with Inter-Dependent Change Factors For the time-varying causal dynamics with inter-dependent change factors, instead of considering the independent sources using temporal dependencies, here we consider the inter-dependence across the different variable index. Formally we generate 100,000 data points based on the following equation: c t = Cc t-1 + ζ k,t z k,t = q k ({z t-τ }) + 1 b k ({z t-τ }, c t ) ϵ k,t , where C is the transition matrix for change factors.The noises ζ kt are sampled from i.i.d. Laplace distribution (σ = 1). In the latent transition process for z t , noise terms are coupled with the history information and change factors through multiplication with the average value of all the time-lagged latent variables z t-τ and current time-varying change factor c t . A3.1.2 REAL-WORLD DATASET Three real-world datasets are used to evaluate the forecasting performance of the proposed model. We use the first 80% data for training and the rest 20% for evaluation. Economics The economics dataset was used in (Huang et al., 2019) . We investigate the time-lagged causal relationships among 10 macroeconomic variables ranging from CPI, inflation to unemployment rate with monthly data from 1965 to 2017 in the USAfoot_2 . The data are normalized by subtracting the mean and dividing them by the standard deviation. Bitcoin The bitcoin dataset was used in (Godahewa et al., 2021) . We investigate the time-lagged causal relationships about 16 daily time seriesfoot_3 , which have potential influences on the bitcoin price. Specifically, it includes hash rate, block size, mining difficulty, public opinion, etc. The data are normalized by subtracting the mean and dividing them by the standard deviation. FRED The FRED dataset was used in (Godahewa et al., 2021) . We investigate the time-lagged causal relationships about 107 monthly time seriesfoot_4 . It contains a set of macroeconomic indicators from the Federal Reserve bank. The data are normalized by subtracting the mean and dividing them by the standard deviation.

A3.2 EVALUATION METRIC

Mean Correlation Coefficient (MCC) MCC is a standard metric for evaluating the recovery of latent factors in ICA literature. We first apply a nonlinear regression to the recovered factors, aiming to get rid of the component-wise transformation indeterminacy, for each possible pair of the estimated factor and the true one. Then, we calculate all pairs of correlation coefficients (the absolute values of the Spearman's rank correlation coefficients) between ground-truth latent factors and the estimated latent factors (after the component-wise transformation). We further solve a linear sum assignment problem to assign each latent component to the ground-truth component that best correlates with it, thus finding the correspondence between the estimated factors and the true ones in the latent space. A high MCC means one successfully recovered the true latent factors, up to invertible, component-wise transformation and permutation.

A3.3.1 ADDITIONAL TIME SERIES FORECASTING RESULTS

We list additional comparison with more baselines from typical deep state space models, including DKF (Krishnan et al., 2015) , SRNN (Fraccaro et al., 2016) , RVAE (Leglaive et al., 2020) , DSAE (Yingzhen & Mandt, 2018) . As shown in Table A2 , our method still consistently outperforms the additional baselines. In Table A3 , we evaluate the proposed model N-NPSSM and baselines on more real-world datasets NN5foot_5 , OIKOLABfoot_6 and Pedestrainfoot_7 . We can find that N-NPSSM still outperforms the baselines. In Table A4 , we compare N-NPSSM with the baselines on multi-step (3-step) forecasting setting. We can find that our proposed model N-NPSSM consistently outperforms the baselines.

A3.3.2 ABLATION STUDIES

In table A5 , we show the performance of N-NPSSM and NPSSM on synthetic datasets. We can find that N-NPSSM achieves comparable performance with NPSSM on fixed causal dynamics settings, while N-NPSSM has a higher MCC score on time-varying causal dynamics settings. In table A6 , we compare the identifiability results of NPSSM with/without transition prior network. We can find this module is critical to the identifiability of NPSSM. To verify the effectiveness of NPSSM when noise function is non-inveritble, we consider the generation process of replacing the stationary causal dynamics in Eq. (eq:heteo) with the squared noise, i.e. z k,t = q k ({z t-τ }) + 1 b k ({zt-τ }) ϵ k,t .. We show the results of MCC trajectories of NPSSM in Figure A1 . We can find that NPSSM still achieve high MCC around 0.9. Figure A3 present some showcases for different models in Economics dataset for qualitative evaluation. We can observe that N-NPSSM can predict well under various temporal data characteristics. To visualize nonlinear relations, we use LassoNet (Lemhadri et al., 2021) as a post-processing tool to remove weak edges and generate the sparse causal relation graph from the results on the economics dataset. This method prunes input nodes by jointly feeding the first hidden layer and the residual layer through a hierarchical threshold-based optimizer. We first fit the LassoNet on the emission As shown in Figure A4 , we can find that industrial production and business confidence survey are simultaneously correlated, as both of them are affected by latent factor '1'. Additionally, foreign exchange reserves, CPI and money supply are simultaneously correlated, as all of them effected by latent factor '4'. In Figure A5 , we use LassoNet again to extract the sparse time-lagged causal relation in latent space. We can observe that most of the latent factors are affected by their time-lagged parents' nodes. Meanwhile, our model can also recover the cross relations between latent variables.

A4 IMPLEMENTATION DETAILS A4.1 NETWORK ARCHITECTURE

We summarize our network architecture in Table A8 .

A4.2 TRAINING DETAILS

The models were implemented by PyTorch 1.9.0. The VAE network is trained using AdamW optimizer and early stops if ELBO loss does not decrease. The maximum epoch is 200 for synthetic datasets and 700 for real-world datasets. A mini-batch size of 64 is used. We used three random seeds in each experiment and reported the mean performance with standard deviation averaged across random seeds. The hyperparameters of N-NPSSM include [β, γ, σ], which represent the weights for transition prior for latent variable z, change factor z, and auxiliary predictor. Since the objective of transition prior does not consider the initial time-lagged variables, we follow the conventional VAE and use the standard normal distribution N (0, 1) as the prior distribution for these initial latent variables instead. Therefore, we augment the hyperparameters to [β, β init , γ, γ init , σ]. We performed a grid search to select these hyperparameters, which are lr ∈ [1e -3, 5e -3, 2e -2], β ∈ [8e -3, 1e -2, 2e -2], β init ∈ [5e -4, 2e -3], γ ∈ [1e -4, 5e -3, 1e -2, 2e -2], γ init ∈ [3e -3, 5e -3, 2e -2], and σ ∈ [0.1, 0.5, 1]. To facilitate comparison, the training parameters of baselines, e.g., optimizer, batch size, as well as the encoder and decoder architecture are identical to N-NPSSM. Similarly, we performed a grid search to select learning rate, lr ∈ [5e -4, 1e -3, 5e -3, 1e -2, 5e -2, 1e -1], and the hyperparameter of KL divergence term, α ∈ [1e -4, 5e -4, 1e -3, 5e -3, 1e -2, 5e -2, 1e -1]. For all experiments, we use z ∈ R 8 and c ∈ R 4 and set the maximum time lag L = 2 by the rule of thumb. For the initialization of VAE, we follow the instruction of β-VAE (Higgins et al., 2016) and adopt the He initialization. For the rest of the modules/networks, we adopt uniform initialization. Training Stability We have used several standard tricks to improve training stability: (1) we use AdamW optimizer as a regularizer to prevent training from being interrupted by overflow or underflow of variance terms of VAE; (2) For the experiments on synthetic datasets, we separate the learning procedure into two phases. We focus on the reconstruction task first and uncover the latent process, then we learn the latent predictor. This allows the model to first find the identifiable latent representations and then learn how to utilize them for the forecasting task. For the real-world datasets, we jointly learn these two components. Computation Hardware We use Nvidia A100 GPU to run our experiments. 



Here, the definition of "non-parametric" is not about the general form of mapping function but indicates the functional causal model which takes the cause variables and errors as the input of a general function. Unlike the additive noise form, there is no constraint for how the noise interacts with the cause variable. Formal definition can be found in line 4 below Eq. (1.40) in(Pearl, 2009) Downloaded from https://www.theglobaleconomy.com/ Downloaded from https://zenodo.org/record/5122101#.YzPm7exBz0o Downloaded from https://zenodo.org/record/4654833#.YzPo1exBz0o https://zenodo.org/record/4656117 https://zenodo.org/record/5184708 https://zenodo.org/record/4656626



Figure 1: Left: The proposed estimation framework mainly includes the learning of latent causal model learning and prediction model. Right: Motivational examples demonstrate the benefit of latent causal model learning for forecasting. (1). It provides compact representations for forecasting, as vanilla predictors include complicated dependencies. (2). The prediction model is more robust to the distribution shift (Red circles here indicate distribution change). (3). It gives a compact way to model the change factors to address non-stationary forecasting issues.

Figure 2: Fig (a) demonstrates the overview of our structural VAE framework. It mainly includes the latent causal model and latent prediction model. In latent causal model, it recovers latent process via minimizing reconstruction error and the regularization between factorized posterior q(ẑ 1:T ), q(ĉ 1:T ) and transition prior p(ẑ 1:T ), p(ĉ 1:T ), which implicitly models the temporal dynamics. Fig (b) shows the transition prior model, representing the latent causal processes ẑt and ĉt .

Figure 3: MCC trajectories of NPSSM for temporal data with clear assumption violations.

Figure 4: MCC for causally-related factors and scatter plots between estimated factors and true factors on four synthetic datasets.

Figure A1: MCC trajectories of NPSSM for temporal data with non-inveritble noise.

Figure A2: MCC for causally-related factors and scatter plots between estimated factors and true factors on two synthetic datasets for NPSSM.

Figure A3: The observations of each model on economics dataset

Figure A4: The causal relation between latent variables and observed variables. The blue circles with a number indicate latent factors, while the green circles indicate the observed variables. Note that latent factors '0', '2' and '5' has been removed by the pruning step when constructing this relation graph. It means these factors do not demonstrate strong causal strengths.

Figure A5: The time-lagged causal relations graph for latent variables. The blue circles indicate the time-lagged source latent factors, while the green circles indicate the target latent factors.

Identifiability and forecasting performance for the four synthetic datasets (more empirical results can be found in A3.3). Note: "N/A" indicates the deterministic model LSTM is not applicable to output predictive distribution

Forecasting performance on three real-world datasets

Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A2.2 Proof of Identifiability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Notations.

Comparison with additional baselines on real-world datasets.

Comparison with baselines on additional real-world datasets.

Comparison with baselines on multi-step (3-step) time series forecasting setting.

Comparison between N-NPSSM and NPSSM for MCC scores and forecasting performance on synthetic datasets

Ablation study for the effectiveness of the transition prior network

we report the total number of parameters of different methods in our synthetic experiments. Compared to baseline models, the proposed NPSSM and N-NPSSM requires more extra parameters. This is because these two methods have an extra transition prior models. Compared to NPSSM, N-NPSSM has more parameters since it needs to explicitly model the encoder for ĉ1:T conditioned on {ẑ t-τ } Lc τ =0 .

Model size (Total parameters) of different methods in synthetic experiments.

