GATED INFERENCE NETWORK: INFERENCING AND LEARNING STATE-SPACE MODELS

Abstract

State-space models (SSMs) perform predictions by learning the underlying dynamics of observed sequence. We propose a new SSM approach in both high and low dimensional observation space, which utilizes Bayesian filtering-smoothing to model system's dynamics more accurately than RNN-based SSMs and can be learned in an end-to-end manner. The designed architecture, which we call the Gated Inference Network (GIN), is able to integrate the uncertainty estimates and learn the complicated dynamics of the system that enables us to perform estimation and imputation tasks in both data presence and absence. The proposed model uses the GRU cells into its structure to complete the data flow, while avoids expensive computations and potentially unstable matrix inversions. The GIN is able to deal with any time-series data and gives us a strong robustness to handle the observational noise. In the numerical experiments, we show that the GIN reduces the uncertainty of estimates and outperforms its counterparts , LSTMs, GRUs and variational approaches.

1. INTRODUCTION

State estimation and inference in the states in dynamical systems is one of the most interesting problems that has lots of application in signal processing and time series Rauch et al. (1965) . In some cases, learning state space is a very complicated task due to the relatively high dimension of observations and measurements, which only provides the partial information about the states. Noise is another significant issue in this scenario, where it is more likely to obtain a noisy observation. Time series prediction and estimating the next scene, e.g, the state prediction or next observation prediction, is another substantial application that again requires the inference within the states which comes from the observations. Classical memory networks such as LSTMs (Hochreiter & Schmidhuber, 1997) , GRUs (Cho et al., 2014) and simple RNNs like (Wilson & Finkel, 2009) and (Yadaiah & Sowmya, 2006) fail to give some intuition about the uncertainties and dynamics. A group of approaches perform the Kalman Filtering (KF) in the latent state which usually requires a deep encoder for feature extraction. Krishnan et al. (2017) , Ghalamzan et al. (2021) and Hashempour et al. (2020) belong to these group of works. However the mentioned solutions have some restrictions, where they are not able to deal with high dimensional non-linear systems and the classic KF approach is computationally expensive, e.g matrix inversion issue. Likewise, indirect optimization of an objective fuction by using variational inference, like the work of Kingma & Welling (2013) , increases the complexity of the model. Moreover, in the variational inference approaches that usually implemented in the context of variational auto encoders for dimension reduction, they do not have access to the loss directly and have to minimize its lower bound instead, which reduce the ability of learning dynamics and affect the performance of the model. KalmanNet Revach et al. (2021) and Ruhe & Forré (2021) use GRU in their structure for the state update. However, they are only able to deal with low-dimensional state space and cannot handle complex high dimensional inputs because of directly using classic Bayesian equations and matrix inversion issue. Moreover, their structure require the full, or at least partial, dynamic information. The mentioned restrictions for KF and its variants and variational models in addition the necessity of having a metric to measure the uncertainty, motivate us to introduce the GIN, an end to end structure with dynamics learning ability using Bayesian properties for filtering-smoothing. The contributions of GIN are: (i) modeling high-low dimensional sequences: we show the eligibility of the GIN to infer both cases by a simple adjustment in the observation transferring functions in the proposed structure, where we conduct three experiments of high dimensional non-linear dynamics and two experiments of low dimensional chaotic observation. (ii) Learning/using dynamics: the ability of learning the dynamics(in the lack of them) and utilizing available dynamics(in the presence of them) alongside modeling high-low dimensional observations makes the GIN applicable to a wide range of applications. To attain more accurate inference of observed dynamical system, we apply GRU cells that increases the modeling capability of the Kalman filtering-smoothing. By conduction an ablation study of the GIN being replaced by a linearized Gaussian state transition, we show the GIN is able for better learning state space representation with disentangled dynamics features. (iii) Direct optimization: We show that the posterior and smoothing inference distribution of the state-space model is tractable while dynamics and parameters are estimated by neural networks. Despite variational approaches, this allows us to use recursive Bayesian updates for direct likelihood maximization. (iv) Noise robustness: verified by the numerical results, inferencing for highly distorted sequences is feasible with the GIN. (v) Missing data imputation: by using Bayesian properties, the GIN decides whether to keep the previous information in the memory cell or update them by the obtained observation. Experimental results show the out-performance of the GIN over the SOTA studies in the imputation task.

2. RELATED WORKS

To deal with complex sensory inputs, some approaches integrate a deep auto encoders into their architecture. Among these works, Embed to Control (E2C) (Watter et al., 2015) uses a deep encoder to obtain the observation and a variational inference about the states. However, these methods are not able to deal with missing data problem and imputation task since they do not rely on memory cells and are not recurrent. Another group of works like BackpropKF (Haarnoja et al., 2016) and RKN (Becker et al., 2019) apply CNNs for dimension-reduction and output both the uncertainty vector and observation, where they move away from variational inference and borrow Bayesian properties for the inference. However, these methods cannot handle the cases with the available knowledge of the dynamics and impose restrictive assumptions over covariance matrices, while the GIN provides a principled way for using the available partial dynamics information and release any assumption over covariance. Toward learning state space (system identification) a group of works like Wang et al. (2007) , Ko & Fox (2011) and Frigola et al. (2013) propose algorithms to learn GPSSMs based on maximum likelihood estimation with the iterative EM algorithm. Frigola et al. (2013) obtain sample trajectories from the smoothing distribution, then conditioned on this trajectory they conduct M step for the model's parameters. Switching linear dynamics systems (SLDS) (Ghahramani & Hinton, 2000) , use additional latent variables to switch among different linear dynamics, where the approximate inference algorithms can be utilized to model switching linearity for reducing approximation errors ,however, this approach is not as flexible as general non-linear dynamical systems because the switch transition model is assumed independent of states and observations. To address this problem, Linderman et al. (2017) performs SLDS method through augmentation with a Polya-gamma-distributed variable and a stick-breaking process, however, this approach employs Gibbs sampling for inferring the parameters and therefore is not scalable to large datasets. Auto-regressive Hidden Markov Models (ARHMM) explain time series structures by defining a mapping from past observations to the current observation. (Salinas et al., 2020 ) is a ARHMM approach, in which target values are used as inputs directly. However, this dependency of the model on the targets makes the model more vulnerable to noise. This issue is addressed in DSSM (Rangapuram et al., 2018) , another ARHMM approach, where the target values are only incorporated through the likelihood term. Other group of works consider EM-based variational-inference like Structured Inference Networks (SIN) (Krishnan et al., 2017) , where it utilizes a RNN to update the state. Kalman Variational Autoencoder (KVAE) (Fraccaro et al., 2017) and Extended KVAE (EKVAE) (Klushyn et al., 2021) use the original KF equations and apply both filtering and smoothing. However, these EM-based variational inference methods are not able to estimate the states directly because of optimizing the lower bound of likelihood. Extra complexity is another issue with these approaches, while they are addressed by the proposed structure and direct end-to-end optimization in the GIN. We compare the GIN with these approaches in the experiment section and provide an empirical complexity analysis in appendix A.8.2. We provide a detailed discussion of recent related work in appendix A.8.3.

3. GATED INFERENCE NETWORK FOR SYSTEM IDENTIFICATION

In the context of System Identification (SI) the GIN is similar to a Hammerstein-Wiener (HW) model (Schoukens & Tiels, 2017 ) (Gilabert et al., 2005) , in the sense that it estimates the system parameters directly from the observations, which is in the figure 2 . e(.) and d(.) are implemented with non-linear functions, e.g. auto encoders-MLPs. Transition block in figure 2 represents the dynamics of the system that allows for the inference using the Gaussian state space filtering-smoothing equations. However unlike a HW model, we employ non-linear GRU cells in the transition block that calculate the Kalman Gain (KG) and smoothing gain (SG) in an appropriate manner by circumventing the computational complexity, i.e matrix inversion issues. GRU cells empower the whole system by applying non-linearity to the linearized Gaussian state space models (LGSSMs). Numerical results indicate that by the proposed structure, having a good inference for even the complex non-linear systems with high dimensional observations is feasible. To achieve this, we assume the state fits into Gaussian state space models (GSSMs), which are commonly used to model sequences of vectors. Figure 2 : The GIN as a HW model for system identification. By appropriate structure selection for e(.) and d(.), the GIN can handle high-low dimensional observations. The proposed architectures for each case are depicted separately with further details in appendix figures 10 and 11. The relation between the internal variables, w t and x t , is simulated by the transition block. In most cases, the dynamics of the system might not be available or hard to obtain; while the process noise and observation noise are unknown (our first three experiments). Accordingly, we construct GIN to learn unknown variables from data in an end to end fashion, then we utilize the constructed KG and SG during inference time to obtain the filteredsmoothed states. The proposed architecture is depicted in figure 4. In the presence of dynamics (our last two experiments), auto-encoder and Dynamics Network in figure 4 are replaced by MLP to model the observation noise.

4. PARAMETERIZATION

Figure 3 : Graphical model for high dimensional observations. dyn t is the model's dynamics at time t. In this section we show the parameterization of the inference model. Firstly, we refer the readers to the summary of Kalman filtering-smoothing background for the completeness in appendix A.2. In the rest of the paper we use the following notations, where the original observations are o 1:T , the transferred observations are w 1:T and R 1:T are diagonal matrices with r 1:T diagonal elements that correspond to the covariance of transferred observation noise. x 1:T corresponds to the states and Q 1:T are diagonal matrices with q 1:T diagonal elements which are the covariance of state process noise. (x - t , Σ - t ) are the mean vector and covariance matrix of the prior state at time step t, i.e. p(x t |w 1:t-1 ), and (x + t , Σ + t ) are the mean vector and covariance matrix of the posterior state at time step t, i.e. p(x t |w 1:t ). We define the transition matrices F 1:T and emission matrices H 1:T as the dynamics of the model, where lack of dynamics in the first three experiments means that (F 1:T , H 1:T ) are not know and are going to be trained(graphical model is in figure 3 ), while the presence of dynamics in our last two experiments means that (F 1:T , H 1:T ) are known(graphical models are in figures 9a and 9b). The dynamics (F 1:T , H 1:T ) and noise matrices Figure 4 : The high level structure of the GIN for high dimensional observation in the lack of dynamics, while for low dimensional cases auto-encoder is replaced by MLPs and dynamics are directly used. The transferred observation w t and its uncertainty r t , are obtained from the encoder(MLPs). In each time step, the last posterior mean x + t-1 is fed to the Dynamic Network to compute Ft and Ĥt . In the Prediction Step the next priors (x - t , Σ - t ) are obtained by using new dynamics and the last posteriors. In the filtering step, by using the priors (x - t , Σ - t ) and the observation (w t , r t ), the next posteriors (x + t , Σ + t ) are obtained. Applying smoothing operation over the obtained posteriors (x + t , Σ + t ) is feasible in the smoothing step. Finally, the decoder(MLP) is utilized to produce o + t , which can be the high-low dimensional noise free estimates. (R 1:T , Q 1:T ) construct the system parameters γ 1:T = (F 1:T , H 1:T , Q 1:T , R 1:T ). Given original observations o 1:T and transferred observations w 1:T , we want to find good estimate of the latent states x 1:T . To achieve this, we want to infer the marginal distributions p(x t |w 1:t ) for the online inference approach or filtering; and p(x t |w 1:T ) for the full inference approach or smoothing. We introduce an advantageous prediction parameterization as p γt (x t |x t-1 , w 1:t-1 ) = N (F t x t-1 , Q t ), where x t-1 ∼ p γt-1 (x t-1 |w 1:t-1 ) = N (x + t-1 , Σ + t-1 ). Then, p γt (x t |w 1:t-1 ) = N (F t x + t-1 , F t Σ + t-1 F T t + Q t ) = N (x - t , Σ - t ) is obtained by marginalizing out x t-1 and the Gaussianity of p γt (x t |w 1:t-1 ) results from the Gaussianity of prediction parameterization. By having p γt (x t |w 1:t-1 ) and observing w t , filtering parameterization is introduced as: p γt (x t |w 1:t ) = N x - t + K t [w t -H t x - t ], Σ - t -K t [H t Σ - t H T t + R t ]K T t = N x + t , Σ + t (1) where K t is KG. After observing all transferred observations w 1:T , one can do backward induction and propagate to the previous states using the chain rule. This procedure, known as smoothing, can be parameterized as: p γt (x t |w 1:T ) = N x + t + J t [x t+1|T -F t+1 x + t ], Σ + t + J t Σ t+1|T -Σ - t+1 J T t (2) where J t is SG and we use short handed notation N (x t|T , Σ t|T ) instead of (2) . These parameterizations give us some insight to 1-illustrate a tractable way to construct p γ (x|w) and accordingly obtain the posterior and smoothened states, based on which o + is constructed and 2-appropriately modeling γ and KG(SG) with neural networks. To construct the KG and SG networks, we have to find appropriate inputs containing related information to attain the KG and SG. In (1) and (2), KG and SG are given by (3) and (4), respectively. K t = Σ - t H T t . H t Σ - t H T t + R t -1 ∝ (Σ - t , R t ) J t = Σ + t F T t+1 . F t+1 Σ + t F T t+1 + Q t+1 -1 = Σ + t F T t+1 Σ - t+1 ∝ Σ - t+1 (3) is proportional to the prior covariance at time t, Σ - t , and the observation noise matrix, R t , while (4) is proportional to prior covariance matrix at time t+1, Σ - t+1 . Our encoder(MLP) directly maps the observation noise matrix from the observation space, but the state covariance is a recursive function of previous states. Consequently, we consider GRU KG and GRU SG which are networks including GRU that map [f (Σ - t ), R t ] and f (Σ - t+1 ) to the KG and SG, respectively. GRU KG considers R t , a diagonal matrix with r t elements in figure 4 , as a part of its input to incorporate the effects of observation noise. In the case of high dimensional state space, due to the high dimension of Σ - t and Σ - t+1 , f is a convolutional layer with pooling to extract the valuable information of the covariance matrix that reduces its size, while for the low dimension of Σ - t and Σ - t+1 , f is the identity function. Learning The Process Noise. In the filtering procedure, the process noise in time t is obtained as Q t = Σ - t -F t Σ + t-1 F T t . where Σ - t , F t and Σ + t-1 are prior state covariance, transition matrix and posterior state covariance at time t. It is shown that Q t can be written as a function of F t as Q t = Σ - t -F t Σ + t-1 F T t = Σ - t -F t Σ - t-1 -K t-1 [H t-1 Σ - t-1 H T t-1 + R t-1 ] -1 K T t-1 F T t (6) while the derivations are rather lengthy, therefore, we refer to the appendix materials A.3. From (32), the relation of the process noise with the transition matrix indicates that F t can possess the effects of Q t if we learn it in an appropriate manner. Ft (Q t ) notation means that the learned transition matrix Ft comprises the effects of Q t , while for the simplicity we use Ft abbreviation. Therefore, it is possible to rewrite (5) as Σ - t = Ft Σ + t-1 FT t . (7) Another way to have a a meaningful inference about the process noise matrix is to obtain it from (30) as a recursive function of x + t-1 and Q t-1 . Intuitively, g function in (30) that we call it Q Network, can be implemented by a memory cell, e.g., a GRU cell, to keep the past status of Q ,however, it increases the complexity of the model. Equivalently, one can obtain Q t directly from x + t-1 with MLP as stated in (32). Both of these solutions can be utilized when the dynamics are known, i.e. we cannot learn the effects of Q t jointly with F t as F t is not trainable.

Prediction

Step. Similar to the model based Kalman Filter, by using dynamics of the system and transition, the next priors are obtained from the current posterior by x - t = Ft x + t-1 , Σ - t = Ft Σ + t-1 FT t ( ) where Ft is the learned transition matrix comprises the effects of the process noise from previous section. By which, it is feasible to predict state mean and the state covariance matrix.

Filtering

Step. To obtain the next posteriors based on the new observation (w t , r t ), i.e. the output of e(.) in figure 2 , we have to use the obtained KG matrix from GRU KG network and learned emission matrix Ĥt to complete updating the state mean vector and state covariance matrix. This procedure is given by S - t = Ĥt .Σ - t . ĤT T + R t , K t = Σ - t ĤT t M t M T t , M t = GRU KG (f (Σ - t ), R t ) (9) x + t = x - t + K t .[w t -Ĥt x - t ], Σ + t = Σ - t + K t .S - t .K T t . In addition to avoiding the matrix inversion that arises in the computation of Kalman gain and applying non-linearity to handle more complex dynamics, the architecture of KG network, GRU KG , can reduce the dimension of the input to its corresponding GRU cell, and thus reduces the total amount of parameters quadratically. Additionally, positive r t vector and Cholesky factor consideration, M t M T t in (9), guarantee the positive definiteness of the resulted covariance matrices.

Smoothing

Step. After obtaining filtered states (x + 1:T , Σ + 1:T ) in filtering step, we employ smoothing properties of Bayesian to get smoothed version of the states. In this stage, we use J 1:T matrices obtained from GRU SG network, learned transition matrices F1:T and filtered states (x + 1:T , Σ + 1:T ). The procedure in each smoothing step is given by: J t = Σ + t FT t+1 N t N T t , N t = GRU SG f (Σ - t+1 ) x t|T = x + t + J t x t+1|T -Ft+1 x + t , Σ t|T = Σ + t + J t Σ t+1|T -Ft+1 Σ + t FT t+1 J T t ( ) where the first smoothing state is set to the last filtering state, i.e. (x T |T , Σ T |T ) = (x + T , Σ + T ) . Learning Dynamics. We can model the dynamics in each time step t as a function of the transferred observations w 1:t-1 . However, conditioning on the noisy observations can distort the procedure of learning the dynamics. Instead, we use the state x + t-1 in GSSM that includes the history of the system with considerable lower noise distortion to increase system's noise robustness, where we generate time correlated noise in our experiments to show this robustness(see appendix A.5). In other words, original transition and emission equations, x t = f (x t-1 ) + q t and w t = h(x t ) + r t , are modeled as x t = Ft (x + t-1 )x t-1 + q t and w t = Ĥt (x + t-1 )x t + r t . We learn K state transition and emission matrices Fk and Ĥk , and combine each one with the state dependent coefficient α k (x + t-1 ). Ft (x + t-1 ) = K k=1 α k t (x + t-1 ) Fk t , Ĥt (x + t-1 ) = K k=1 α k t (x + t-1 ) Ĥk t ( ) where a separated neural network with softmax output is utilized to learn α k (x + t-1 ) that we call it Dynamics Network. This formulation enables us to follow Bayesian methodology. Despite classic LGSSMs that are not able to learn the dynamics, e.g. EKF and UKF, the trainable dynamics in the GIN are function of the states. For the notation simplicity, we have used ( Ft , Ĥt ) instead of ( Ft (x + t-1 ), Ĥt (x + t-1 )) in the paper.

5. FITTING

For the state estimation task, by implementing p(w 1:T |o 1:T ), p(x t |w 1:T ) and p(s t |x t ) with encoder, smoothing parameterization and decoder, we maximise the log-likelihood of output p(s t |o 1:T ) = p(s t |x t )p(x t |w 1:T )p(w 1:T |o 1:T )dx t dw 1:T ,where s t is the estimated state, i.e. equal to o + t in figure 4 . For the image imputation task, in addition to the state likelihood, we add the reconstruction pseudo-likelihood for inferring images by using Bernoulli distributions as p(i t |x t ), i.e. the decoder in figure 4 maps both state s t and image i t : o + t = [i t , s t ]. Further details of distribution assumptions and hyper parameter optimization can be found in the appendix A.4 and A.8. After training phase, forecasting desired number of time steps is applicable by plugging the new value x t = Ft x t-1 recursively in the model, and so on.

Likelihood for Inferring States.

Consider the ground truth sequence is defined as s 1:T . We determine the log likelihood of the states as: L(s 1:T ) = T t=1 log N s t dec mean (x t|T ), dec covar (Σ t|T ) where the dec mean (.) and dec covar (.) determines those parts of the decoder that are used to obtain the state mean and state variance, respectively. We use Wishart distribution as a prior for our estimated covariance matrix, which pushes the estimated covariance toward a scale of identity matrix and the scale is a hyper parameter. Such prior prevents getting high log-likelihood due to the high uncertainty. Likelihood for inferring images. For the imputation task, consider the ground truth as the sequence of images and their corresponding states, which are defined as [s 1:T , i 1:T ] and the dimension of i t is D o . We determine the log likelihood: L(o + 1:T ) = L(s 1:T ) + λ T t=1 Do k=0 i (k) t log dec k (x t|T ) + 1 -i (k) t log(1 -dec k (x t|T )). ( ) dec k (x t ) defines the corresponding part of the decoder that maps the k-th pixel of i t image and λ constant determines the importance of the reconstruction. The first term in RHS is obtained from ( 14) and we abbreviate the second term as L(i 1:T ). We include three high dimensional experiments. The first two experiments are single pendulum and double pendulum, where the dynamics of the latter one is more complicated. The last experiment is visual odometry task. Intuitive python code in appendix A.12.1.

6.1.1. SINGLE PENDULUM AND DOUBLE PENDULUM

The inputs of the encoder are the images with size of 24 × 24. The angular velocity is disturbed by transition noise which follows Normal distribution with σ = 0.1 as its standard deviation at each step. In the pendulum experiment, we perform the filtering-smoothing by the GIN where the observation is distorted with high observation noise. Furthermore, we compare GIN with LGSSM, where the GRU cells are omitted from the GIN structure and classic filtering-smoothing equations are used, instead. The log-likelihood and squared error (SE) of positions for single and double pendulum are given in table 2 and 1, respectively. Generated samples from trained smooth-filter distributions are in appendix figures 16-33. By randomly deleting the half of images from the generated sequences, we conduct the image imputation task to our model by predicting those missing parts, while the missingness applied to train and test are not same, but random. The results are in table 3 and 4 .The GIN outperforms all the other models, although the variational inference models have more complex structures in KAVE Table 3 : Image imputation task for the different models. Models contain boolean masks determining the available and missed images. For uninformed masks, a black image is considered as the input of the cell whenever the image is missed, which requires the model to infer the accurate dynamics for the generation. We conduct uninformed experiment as well. and EKVAE. We include the results using the MSE as well, to illustrate that our approach is also competitive in prediction accuracy (See A.10).

6.1.2. VISUAL ODOMETRY OF KITTI DATASET

We also evaluate the GIN with the higher dimensional observations for the visual odometry task on the KITTI dataset Geiger et al. (2012) . This dataset consists of 11 separated image sequences with their corresponding labels. In order to extract the positional features, we use a feature extractor network proposed by Zhou et al. in Zhou et al. (2017) . The obtained features are considered as the observations of the GIN, i.e. (w, r). Additionally, we compare the results with LSTM, GRU, DeepVO Wang et al. (2017) and KVAE. The results are in table 5 and figure 8 , where the common evaluation scheme for the KITTI dataset is exploited. The results of the KVAE degrades substantially as we have to reduce the size of the transferred observation to prevent the complexity of matrix inversion.

6.2. LOW DIMENSIONAL OBSERVATION WITH PRESENCE OF DYNAMICS

We conduct two experiments, Lorenz attractor problem and the real world dynamics NCLT dataset, where we are aware of the dynamics. Intuitive python code in appendix A.12.2. However, the GIN is able to deal the cases in which the dynamics are not known. To show this, we conduct additional experiment in the appendix, where we do not give the dynamics information of Lorenz attractor and NCLT dataset to the model, so that they will be learned(see figures 42-46).

6.2.1. LORENZ ATRRACTOR

Figure 7 : MSE of Lorenz attractor. The Lorenz system is a system of ordinary differential equations that describes a non-linear dynamic system used for atmospheric convection. Due to nonlinear dynamics of this chaotic system (see A.6), it can be a good evaluation for the GIN cell. We evaluate the performance of the GIN on a trajectory of 5k length. Each point in the generated trajectories is distorted with an observation noise that follows Gaussian distribution with standard deviation σ = 0.5. The likelihood with Gaussian distribution is calculated and maximized in the training phase. The mean square error (MSE) of the test data for various number of train- 2018) is a version of LGSSM using LSTM cells. Due to the non-linearity of the dynamics of this system, LGSSM has to use linearization and then use the linearized dynamics to model the transition. The DSSM model performs better for lager amount of data (>10K) because it needs to learn the dynamics. The results of the Hybrid GNN and the GIN are similar, while the results of the GIN are slightly improved. Although, the core of both models is based on the GRU cell, this enhancement may come from the structure of the GIN that learns the observation and process noises separately. Inferred trajectories are in figure 1.

6.2.2. REAL WORLD DYNAMICS: MICHIGAN NCLT DATASET

To evaluate the performance of the GIN on a real world dataset, the Michigan NCLT dataset Carlevaris-Bianco et al. ( 2016) is utilized that encompasses a collection of navigation data gathered by a segwey robot moving inside of the University of Michigan's North Campus. The states in each time, x t ∈ R 4 , comprise the position and the velocity in each direction and the observations, y t ∈ R 2 , include noisy positions. The ultimate purpose is to localize the real position of the segway robot, while only the noisy GPS observations are available. We apply the GIN to find the current location of the segway robot. 

7. CONCLUSION

The GIN, an approach for representation learning in both high and low dimensional SSMs, is introduced in this paper. The data flow is conducted by Bayesian filtering-smoothing, while, due to the usage of GRU based KG and SG network, the computational issues are tackled resulting in an efficient model with numerical stable results. In the presence of the dynamics, the GIN directly use them, otherwise it directly learns them in an end to end manner, which makes the GIN as a HW model with strong system identification abilities. Insightful representation for the uncertainty of the predictions is incorporated in this approach, while it outperforms its counterparts including LSTMs, GRUs and several generative models with variational inferences. In order to model the vectors of time series w = w 1:T = [w 1 , ..., w T ], Gaussian state space models (GSSMs) are commonly applied due to their filtering-smoothing ability. In fact, GSSMs model the first-order Markov process on the state space x = [x 1 , ..., x T ], which can also include the external control input u = [u 1 , ..., u T ] by multivariate normality assumption of the state p γt (x t |x t-1 , u t ) = N (x t ; F t x t-1 + B t u t , Q), p γt (w t |x t ) = N (w t ; H t z t , R). ( ) For the cases, which are not controlled via external input, B t matrix is simply 0 matrix. By Defining γ t as parameters which explain how the state state changes during the time, it contains the information of F t , B t and H t which are the state transition, control and emission matrices. In each step, the procedure is distorted via Q and R that are process noise and observation noise, respectively. It is common to initial the first state x 1 ∼ N (0, Σ 0 ), then the joint probability distribution of the GSSM is p γ (w, x|u) = p γ (w|x)p γ (x|u) = T t=1 p γt (w t |x t ).p(x 1 ) T t=2 p γt (x t |x t-1 , u t ). GSSMs have substantial properties that we can utilize. Filtering and smoothing are among these properties which allow us to obtain the filtered and smoothed posterior based on the priors and observations. By applying classic Bayesian properties, we can have a strong tool to handle the missing data in the image imputation task.

A.2 FILTERING AND SMOOTHING PARAMETERIZATION

The idea of Kalman filter applies two iterative steps, in the former one a prediction is made by the prior state information, while in the latter one an update is done based on the obtained observation. By normality assumption of known additive process and observation noise, the filter can go through the two mentioned steps. In the prediction step, the filter uses the transition matrix F to estimate the next priors (x - t+1 , Σ - t+1 ) which are the estimate of the the next states without any observation. x - t+1 = Fx + t , and Σ - t+1 = FΣ + t F T + Q, and Q = σ 2 trans I In the presence of new observation, the Kalman filter idea goes through the second step and modifies the predicted prior based on the new observation and emission matrix H that results in the next posterior (x + t+1 , Σ + t+1 ). K t+1 = Σ - t+1 H T HΣ - t+1 H T + R -1 , and x + t+1 = x - t+1 + Σ - t+1 H T HΣ - t+1 H T + R -1 (w t -Hx - t+1 ) = x - t+1 + K t+1 (w t -Hx - t+1 ), Σ + t+1 = Σ - t+1 -Σ - t+1 H T HΣ - t+1 H T + R -1 HΣ - t+1 . The whole observation update procedure can be considered as a weighted mean between the the next prior, that comes from state update, and new observation, where this weighting is a function of Q and R that has uncertainty nature. We derive smoothing parameterization, where the key idea is to use Markov property, which states that x t is independent of future observations w t+1:T as long as x t+1 is known. However, we are not aware of x t+1 , but there is a distribution over it. So by conditioning on x t+1 and then marginalizing out it is possible to obtain x t conditioned on w 1:T . p(x t |w 1:T ) = p(x t |x t+1 , w 1:T )p(x t+1 |w 1:T )dx t+1 = p(x t |x t+1 , w 1:t , w t+1:T )p(x t+1 |w 1:T )dx t+1 By using induction and and smoothed distribution for t + 1: p(x t+1 |w 1:T ) = N (x t+1|T , Σ t+1|T ) we calculate the filtered two-slice distribution as follows: .p(x t , x t+1 |w 1:t ) = N x + t x - t+1 , Σ + t Σ + t F T t+1 F t+1 Σ + t Σ - t+1 by using Gaussian conditioning we have: p(x t |x t+1 , w 1:t ) = N (x + t + J t x t+1 -F t+1 x + t , Σ + t -J t Σ - t+1 J T t ) where J t = Σ + t F t+1 [Σ - t+1 ] -1 . We calculate the smoothed distribution for t using the rules of iterated expectation and covariance: x t|T = E E[x t |x t+1 , w 1:T ] |w 1:T = E E[x t |x t+1 , w 1:t ] |w 1:T = E x + t + J t (x t+1 -F t+1 x + t ) |w 1:T = x + t + J t (x t+1|T -F t+1 x + t ) Σ t|T = cov E[x t |x t+1 , w 1:T ] |w 1:T + E cov[x t |x t+1 , w 1:T ] |w 1:T = cov E[x t |x t+1 , w 1:t ] |w 1:T + E cov[x t |x t+1 , w 1:t ] |w 1:T = cov x + t + J t (x t+1 -F t+1 x + t ) |w 1:T + E Σ + t -J t Σ - t+1 J T t |w 1:T = J t cov x t+1 -F t+1 x + t |w 1:T J T t + Σ + t -J t Σ - t+1 J T t = J t Σ t+1|T J T t + Σ + t -J t Σ - t+1 J T t = Σ + t + J t Σ t+1|T -Σ - t+1 J T t .

A.3 PROCESS NOISE MATRIX

As stated in (18), we can elaborate the process noise matrix at time t in more details Q t = Σ - t -F t Σ + t-1 F T t = Σ - t -F t Σ - t-1 -K t-1 [H t-1 Σ - t-1 H T t-1 + R t-1 ] -1 K T t-1 F T t (28) combining ( 18) into (28) results in Q t = Σ - t -F t [F t-1 Σ + t-2 F T t-1 + Q t-1 ] -K t-1 [H t-1 [F t-1 Σ + t-2 F T t-1 + Q t-1 ]H T t-1 + R t-1 ] -1 K T t-1 F T t (29) which is a function of F t , Q t-1 , F t-1 and H t-1 . In the GIN, Ft and Ĥt are learned by the Dynamics Network with the input of x + t-1 . From (20), x + t-1 is derived as a function of both F t-1 and H t-1 , meaning the learned Ft carries the information of both H t-1 and F t-1 . Therefore, one can rewrite the equation ( 29) as Q t = g Ft x + t-1 , Q t-1 , where Ft = Dynamics Network x + t-1 H t-1 , F t-1 . ( ) where g is a nonlinear function mapping x + t-1 and Q t-1 to Q t and the graphical model for such choice of structure is in figure 9b . It is possible to go one step further and simplify x + t-1 more, as it has Σ - t-1 term in (20), combining it with (18) results in x + t-1 = x - t-1 + [F t-1 Σ + t-2 F T t-1 + Q t-1 ] H T t-1 H t-1 [F t-1 Σ + t-2 F T t-1 + Q t-1 ]H T t-1 + R t-1 -1 (w t -H t-1 x - t-1 ) indicating that not only F t-1 and H t-1 , but also Q t-1 is included in x + t-1 , meaning that Q t can be written solely as a function of x + t-1 and the graphical model for such choice is in figure 9a . Q t = g Ft x + t-1 , where Ft = Dynamics Network x + t-1 H t-1 , F t-1 , Q t-1 . We call g as Q Network, where g can be modeled by a MLP (32) or a recurrent network (30), based on the mentioned explanations. In figure 11 , it is shown how the Q Network is integrated into the whole model structure.

A.4 OUTPUT DISTRIBUTION

In the case of grayscale images, consider each pixel, y i , is one or zero with the probability of p i or 1 -p i respectively, meaning that P (Y = y) = p y (1 -p) 1-y . By re-writing the probability equation into the exponential families form f θ (y) = h(y).exp θ.y -ψ(θ) → e log(p y (1-p) 1-y ) = e y log( p 1-p )+ log(1-p) and by choosing θ = log( p 1-p ) and ψ(θ) = log(1 -p), we can obtain p = 1 1+e -θ . It means that by considering θ as the last layer of the decoder and applying a softmax layer, p is obtained. Equivalently, one can calculate the deviance between real p and estimation of it, p, which is given by D(p, p) = p log( p p ) + (1 -p)log( 1 -p 1 - p ) and minimize the deviance with respect to p as we did in (15). Similarly, consider x, xθ and θ as the ground truth state, estimated state and the model variables respectively, where the residual follows Gaussian distribution x = xθ + ϵ ∼ N (x θ , σθ ), where σθ is the estimated variance. Then, the negative log likelihood is given by ( 35) as we obtained it in ( 14). -log(L) ∝ 1 2 log(σ θ ) + (x -xθ ) 2 2σ θ A.5 NOISE GENERATION PROCESS In the high dimensional observation experiments, to show the noise robustness of the system, we use time correlated noise generation scheme. It makes the noise factors correlated over time by introducing a sequence of factors f t of the same length of the data sequence. Let f 0 ∼ U(0, 1) and f t+1 = min(max(0, f t + r t ), 1) with r t ∼ U(-0.2, 0.2), where f 0 is the initialized factor and U is the uniform distribution. Then by defining two thresholds, t 1 ∼ U(0, 0.25) and t 2 ∼ U(0.75, 1), f t < t 1 are set to 0 and f t > t 2 are set to 1 and the rest are splitted linearly within the range of [0, 1]. The t-th obtained observation is given by o t = f t i t + (1 -f t )i pn t , where the i t is the t-th true image and i pn t is the t-th generated pure noise.

A.6 LORENZ ATTRACTOR DYNAMICS

There are three differential equations that model a Lorenz system, x the convection rate, y the horizontal temperature variation and z the vertical temperature variation. dx dt = σ(y -x), dy dt = x(ρ -z) -y, dz dt = xy -βz where the constant values σ, ρ and β are 10, 28 and -8 3 , respectively. To construct a trajectory we use Lorenz system equations ( 36) with dt = 10 -5 , then we sample from it with the step time of ∆t = 0.01. Based on the equations of the system (36), the state is s t = [x t , y t , z t ] and we can write the dynamics of the system as A t and obtain the transition matrix Exp[A t ] = F t . To achieve this, we use the Taylor expansion of Exp function with 5 degrees. ṡt = A t s t =   -10 10 0 28 -z -1 0 y 0 -8 3   x y z , and F t = Exp[A t ] = I + J j=1 (A t .∆t) j j! ( ) where J is the degrees of expansion and I is the identity matrix. For the emission matrix we use H t = I and for process and observation noise standard deviation, we use Q t = 1 100 σ 2 I and R t = σ 2 I, respectively.

A.7 MOVEMENT MODEL DETAILS FOR THE NCLT EXPERIMENT

We assume that the segway robot is moved with a constant velocity, that the equations for such dynamics are given by ∂p 1 ∂t = v 1 , ∂p 2 ∂t = v 2 , ∂v 1 ∂t = 0, ∂v 2 ∂t = 0, x t = [p 1 , v 1 , p 2 , v 2 ], y t = [p 1 , p 2 ]. ( ) By such assumptions for the motion's equations the transition, process noise distribution, emission and measurement noise distribution matrices can be obtained by F =    1 ∆t 0 0 0 1 0 0 0 0 1 ∆t 0 0 0 1    , Q = σ 2    ∆t 0 0 0 0 ∆t 0 0 0 0 ∆t 0 0 0 0 ∆t    , H = 1 0 0 0 0 0 1 0 , R = λ 2 1 0 0 1 . ( ) where ∆t = 1 since the sampling frequency is 1Hz. Process and measurement variance parameters, σ and λ, are unknown that the model will learn them. we split the whole sequence into training, testing and validation folds with the length of 3600 ( 18 sequences of length T = 200) , 280 (1 sequence of length T = 280) and 400 (2 sequences of length T = 200), respectively.

A.8 NETWORK STRUCTURE AND PARAMETERS

In all experiments, Adam optimizer Kingma & Ba (2014) has been used on NVIDIA GeForce GTX 1050 Ti. We conduct a grid search for finding the hyperparameters to rule out the possibility of the models being trained with the suboptimal hyperparameters. To find the initial learning rate, by conducting a grid search between 0.001 and 0.2 with the increment of 0.005, we select the best one among them that corresponds to the highest log-likelihood. With an initial learning rate of 0.006 and an exponential decay with rate of 0.9 every 10 epochs, we employ back propagation through time Werbos (1990) to compute the gradients as we deploy GRU cells in the structure. Layer normalization technique Ba et al. ( 2016) is used to stabilize the dynamics in the recurrent structure and normalize the filter response. Elu + 1 activation function, can ensure the positiveness of the diagonal elements of the process, noise and covariance matrices. In order to prevent the model being stuck in the poor local minima, e.g. focusing on the reconstruction instead of learning the dynamics obtained by filtering-smoothing, we find it useful to use two training tricks for an end-to-end learning: 1-Generating time correlated noisy sequences as consecutive observations, forces the model to learn the dynamics instead of focusing on reconstruction, e.g. figure 13 and 15. 2-For the first few epochs, only learn auto-encoder(MLPs) and globally learned parameters, e.g. F (k) and H (k) , but not Dynamics Network parameters α t (x t-1 ). All the parameters are jointly learned, afterwards. This allows the system to learn good embedding and meaningful latent vectors at first, then learns how to employ K different dynamics variables. In the lack of dynamics, for the low dimensional observations we use K = 5, while for the high dimensional observations we use K = 15 as they need to learn more complex dynamics. In general, if the GIN is flexible enough, tuning the parameters is not difficult as the GIN is capable to learn how to prune unused elements by the Dynamics Network. To prevent the model being stuck into mode collapse, we provided two solutions: 1-By introducing k sets of F k , H k , where each set of F k , H k models different dynamics, we introduce a loss term with a small constant factor which tries to increase the distance of each pair of F k , H k set. Intuitively, the presence of different dynamics can easily modify the states in each update. We found this method as a potential solution to prevent the model go through the mode collapse. 2-Considering the negative distance of consecutive pairs of states as additional loss term with a small constant factor (the distance can be considered as euclidean difference of mean or KL of two consecutive states). Intuitively, this solution is forcing the states to not have overlap with each other and impose them to change in each update step. In the simulation results, we have used the first option.

A.8.1 PROPOSED ARCHITECTURE FOR HIGH AND LOW DIMENSIONAL OBSERVATIONS

The proposed structure to deal with high dimensional observations in the lack of the dynamics(the first three experiments in the paper) is shown in figure 10 . While, the proposed structure to handle low dimensional observations in the presence of the dynamics(the last two experiments in the paper) is shown in the figure 11 .

A.8.2 EMPIRICAL RUNNING TIMES AND PARAMETERS

We present the number of parameters of the utilized cell structures in our experiments and their corresponding empirical running times for 1 epoch in the table 7 and 8. In the first row of each model 

ENCODER/DECODER AND THE DYNAMICS NETWORK ARCHITECTURE

To design the dynamics network, we use a MLP including 60 hidden units with Relu activation function and a softmax activation for the last layer. The state mean, with size of n, and number of the bases, with size of k, are the input and output of the dynamics, respectively. The structures of the encoder and decoder are in the table 10. In the table 10, m is the transferred observation dimension that various values for this parameter are taken into account in the results. In the state estimation tasks, out dim is 4 and 8 for the single-pendulum and double-pendulum experiment, respectively. For the imputation task, number of the hidden units of the KG and SG network is set to 40 and 30, respectively. The convolutional layer applied over the covariance matrix has 8 filters with kernel size of 5.

A.8.5 LORENZ ATTRACTOR AND NCLT EXPERIMENTS

In these two experiments that we have the knowledge of the dynamics, we employ a fully connected with the observations as its input and output dimension of 3 and 2 for Lorenz attractor and NCLT experiments, respectively, to obtain the observation noise, r. The activation function is Elu + 1. Similarly another fully connected with the posterior state as its input and output dimension of 3 and 2 for Lorenz attractor and NCLT experiments, respectively, to attain the uncertainty estimates, o + σ . To estimate the process noise matrix, a fully connect with the posterior state as the input and Elu + 1 activation function is used. Similarly, a GRU cell that maps the posterior states to the process noise matrix with 10 hidden units can be used.  ✓ ✓ ✓ ✓ ✓ × ✓ ✓ GRU (Cho et al., 2014) ✓ ✓ ✓ ✓ ✓ × ✓ ✓ P2T (Wahlström et al., 2015) ✓ ✓ ✓ × ✓ × × ✓ E2C (Watter et al., 2015) ✓ ✓ ✓ × × ✓ × × BB-VI (Archer et al., 2015) × ✓ ✓ × × ✓ ✓ × SIN (Krishnan et al., 2017) ✓ ✓ ✓ × × ✓ ✓ × DVBF (Karl et al., 2016) ✓ ✓ ✓ × × ✓ ✓ × VSMC (Naesseth et al., 2018) ✓ ✓ ✓ × × ✓ ✓ × DSA (Li & Mandt, 2018) ✓ ✓ ✓ × × ✓ × × KVAE (Fraccaro et al., 2017) × ✓ ✓ × × ✓ ✓ × EKVAE (Klushyn et al., 2021) × ✓ ✓ × × ✓ ✓ × rSLSD Linderman et al. (2017) × ✓ × ✓ ✓ ✓ × × DeepAR Salinas et al. (2020) × ✓ ✓ × ✓ ✓ × ✓ DSSM Rangapuram et al. (2018) × ✓ ✓ × ✓ ✓ × ✓ HybridGNN Garcia Satorras et al. (2019) × ✓ × ✓ ✓ × ✓ ✓ KalmanNet Revach et al. (2021) × ✓ × ✓ ✓ × ✓ ✓ SSI Ruhe & Forré (2021) × ✓ × ✓ ✓ ✓ ✓ ✓ LGSSM × ✓ ×/✓ ✓ ✓ ✓ ✓ ✓ GIN ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ A.9 VISUALIZATION AND THE IMPUTATION Graphical results of informed, uninformed and noisy observations for image imputation task for both single and double pendulum experiments can be found in 12, 13, 14 and 15 figures. Inference for the trained smoothing and filtering distributions of all high dimensional experiments are in 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38 and 39 figures, where we generated samples from the smoothing distribution, f (x t |w 1:T ), and the filtering distribution, f (x t |w 1:t ). Then we fit density on the generated samples. This visualization shows the effectiveness of the GIN in reducing the uncertainty of the estimates compare to LGSSM and KVAE. Finally, the results of NCLT experiment are in figure 47 .  o + i : 1 × 1 Trns Conv, stride 1 × 1, softmax Observed Sequence LGSSM(filter) LGSSM(smooth) GIN(filter)

GIN(smooth)

Ground Truth Observed Sequence LGSSM(filter) LGSSM(smooth) GIN(filter) GIN(smooth) We compare the learned KG-SG matrices via the GRU cells with their corresponding ground truth for the first 100 time steps of the low dimensional experiments. We calculate the element-wise squared difference of the learned KG and its ground truth, ∆KG t = Tr ( KG t -KG t ) T ( KG t -KG t ) , and take the average of all ∆KG t , while similar procedure holds for the SG. The results are provided in table 11 . The MSE results for the single and double pendulum experiments are in the table 12 and 13. In addition to (7), where F matrix includes the effects of the process noise, two other mentioned solutions introduced in section 4, are included in the MSE results as well. Using GRU cell and MLP for mapping x + , as their input, to Q, as their output, where the former one is shown by GRU(Q) and the latter one by MLP(Q) in the tables. 



EVALUATION AND EXPERIMENTSWe divide our experiments into two parts, first the tasks in which the observation space is high dimensional like sequence of images, and second the applications that the observation is in low dimension by itself so there is no need to include encoder for dimension reduction. The training algorithms of both cases are added in the appendix section A.11.



Figure 1: Inferred 5k length trajectories for Lorenz attractor.

Figure 5: Pendulum image imputation. Each figure, beginning from up to down, indicates the ground truth, uninformed observation and the imputation results of the GIN(smoothed). Missingness is applied randomly for train and test.

Figure 6: Double pendulum image imputation. Each figure, beginning from up to down, indicates the ground truth, uninformed observation and the imputation results of the GIN(smoothed).

Figure 8: Generated samples from smoothing distribution for the joint position (x1, x2), equivalent to (o1 + , o2 + ) in figure 4, at 100-th time step of visual odometry experiment. The ground truth is shown with a black point.

(a) Without recurrent dependency on qt. (b) With recurrent dependency on qt.

Figure 9: Graphical models for low dimensional observations experiments.

Figure 10: Proposed architecture for operating high dimensional observations in the lack of dynamics.

Figure 11: Proposed architecture for operating low dimensional observations in the presence of dynamics.

Figure 12: Informed(left column) and uninformed(right column) image imputation task for the single pendulum experiments.

Figure 13: Image imputation task for the single pendulum experiment exposed to the noisy observations, where the generated noise has correlation with the time. Each figure, beginning from top to bottom, indicates the ground truth, noisy observation and the imputation results of the GIN.

Figure 16: Inference for the single pendulum x1 position at 100-th time step. Generated samples from smoothened distribution, f (x1 100 |w 1:150 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x1 Pos 100 |w 1:150 ) is the ground truth state with distribution of δ(x1 100 -0.7). We calculate the sample mean and fit a distribution on the samples for further visualization and comparison purpose.

Figure 17: Inference for the single pendulum x2 position at 100-th time step. Generated samples from smoothened distribution, f (x2 100 |w 1:150 ), trained by the GIN, LGSSM and KVAE, respectively.

Figure 18: Generated samples from the trained smoothened joint distribution of the single pendulum position, (x1, x2), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 19: Inference for the single pendulum x1 position at 100-th time step. Generated samples from filter distribution, f (x1 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x1 Pos 100 |w 1:100 ) is the ground truth state with distribution of δ(x1 100 -0.7).

Figure 20: Inference for the single pendulum x2 position at 100-th time step. Generated samples from filter distribution, f (x2 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively.

Figure 21: Generated samples from the trained filter joint distribution of the single pendulum position, (x1, x2), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 22: Inference for the double pendulum x1 position at 100-th time step. Generated samples from smoothened distribution, f (x1 100 |w 1:150 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x1 Pos 100 |w 1:150 ) is the ground truth state with distribution of δ(x1 100 -0.35).

Figure 23: Inference for the double pendulum x2 position at 100-th time step. Generated samples from smoothened distribution, f (x2 100 |w 1:150 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x2 Pos 100 |w 1:150 ) is the ground truth state with distribution of δ(x2 100 -0.35).

Figure 24: Generated samples from the trained smoothened joint distribution of the double pendulum first joint position, (x1, x2), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 25: Inference for the double pendulum x3 position at 100-th time step. Generated samples from smoothened distribution, f (x3 100 |w 1:150 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x3 Pos 100 |w 1:150 ) is the ground truth state with distribution of δ(x3 100 -1).

Figure 26: Inference for the double pendulum x4 position at 100-th time step. Generated samples from smoothened distribution, f (x4 100 |w 1:150 ), trained by the GIN, LGSSM and KVAE, respectively.

Figure 27: Generated samples from the trained smoothened joint distribution of the double pendulum second joint position, (x3, x4), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 28: Inference for the double pendulum x1 position at 100-th time step. Generated samples from filter distribution, f (x1 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x1 Pos 100 |w 1:100 ) is the ground truth state with distribution of δ(x1 100 -0.35).

Figure 29: Inference for the double pendulum x2 position at 100-th time step. Generated samples from filter distribution, f (x2 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x2 Pos 100 |w 1:100 ) is the ground truth state with distribution of δ(x2 100 -0.35).

Figure 30: Generated samples from the trained filter joint distribution of the double pendulum first joint position, (x1, x2), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 31: Inference for the double pendulum x3 position at 100-th time step. Generated samples from filter distribution, f (x3 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x3 Pos 100 |w 1:100 ) is the ground truth state with distribution of δ(x3 100 -1).

Figure 32: Inference for the double pendulum x4 position at 100-th time step. Generated samples from filter distribution, f (x4 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively.

Figure 33: Generated samples from the trained filter joint distribution of the double pendulum second joint position, (x3, x4), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 34: Inference for the visual odometry x1 position at 100-th time step. Generated samples from smoothened distribution, f (x1 100 |w 1:500 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x1 Pos 100 |w 1:500 ) is the ground truth state with distribution of δ(x1 100 + 50).

Figure 35: Inference for the visual odometry x2 position at 100-th time step. Generated samples from smoothened distribution, f (x2 100 |w 1:500 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x2 Pos 100 |w 1:500 ) is the ground truth state with distribution of δ(x1 100 -10).

Figure 36: Generated samples from the trained smoothened joint distribution of the visual odometry joint position, (x1, x2), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 37: Inference for the visual odometry x1 position at 100-th time step. Generated samples from filter distribution, f (x1 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x1 Pos 100 |w 1:100 ) is the ground truth state with distribution of δ(x1 100 + 50).

Figure 38: Inference for the visual odometry x2 position at 100-th time step. Generated samples from filter distribution, f (x2 100 |w 1:100 ), trained by the GIN, LGSSM and KVAE, respectively. The dashed red line (x2 Pos 100 |w 1:100 ) is the ground truth state with distribution of δ(x1 100 -10).

Figure 39: Generated samples from the trained filter joint distribution of the visual odometry joint position, (x1, x2), at 100-th time step for the GIN, LGSSM and KVAE, respectively. The ground truth is shown with a black point.

Figure 42: Eigenvalues of the learned transition matrix Ft and their corresponding true values in the first 100 time steps for Lorenz attractor experiment. Despite the low dimensional experiments in the paper that we give the dynamics (F, H) to the model, here we show the GIN ability for learning the dynamics, when we do not provide the dynamics information, i.e. (F t , H t ) in (37).

Figure 44: Eigenvalues of the learned transition matrix Ft and their corresponding true values in the first 100 time steps for NCLT dataset experiment. Despite the low dimensional experiments in the paper that we provided the dynamics (F, H) for the model, here we show the GIN ability for learning the dynamics, when we do not provide the dynamics information, i.e. (F t , H t ) in (39).

Figure 46: Eigenvalues of the learned transition matrix Ft and their corresponding true values in the first 100 time steps for NCLT dataset experiment. Despite the low dimensional experiments in the paper that we provided the dynamics (F, H) for the model, here we show the GIN ability for learning the dynamics, when we do not provide the dynamics information, i.e. (F t , H t ) in (39).

Figure 47: NCLT dataset position for the first 50 observations: ground truth positions and the generated trajectories with the GIN, LGSSM, KalmanNet and DSSM approaches are illustrated.

m = 15, n = 30, K = 10) 0.089±0.009 0.088±0.005 0.088±0.006 LGSSM filter(m = 15, n = 30, K = 15) 0.088±0.011 0.087±0.007 0.086±0.004 LGSSM filter(m = 15, n = 45, K = 10) 0.085±0.004 0.084±0.007 0.084±0.009 LGSSM filter(m = 15, n = 45, K = 15) 0.084±0.005 0.083±0.004 0.082±0.004 LGSSM filter(m = 20, n = 40, K = 10) m = 15, n = 30, K = 10) 0.078±0.013 0.076±0.005 0.075±0.004 GIN filter(m = 15, n = 30, K = 15) 0.078±0.014 0.075±0.009 0.074±0.012 GIN filter(m = 15, n = 45, K = 10) 0.074±0.010 0.073±0.008 0.072±0.009 GIN filter(m = 15, n = 45, K = 15) 0.073±0.015 0.074±0.011 0.071±0.005 GIN filter(m = 20, n = 40, K = 10) 0.072±0.005 0.072±0.008 0.070±0.002 GIN filter(m = 20, n = 40, K = 15) = 50, m = 40) 0.188±0.015 GRU (units = 100, m = 15) 0.173±0.009 GRU (units = 100, m = 20) 0.169±0.014 GRU (units = 100, m = 40) 0.166±0.018 Model F(Q) MLP(Q) GRU(Q) LGSSM filter(m = 15, n = 30, K = 10) 0.154±0.013 0.159±0.021 0.153±0.009 LGSSM filter(m = 15, n = 30, K = 15) smooth(m = 20, n = 40, K = 15) 0.134±0.011 0.129±0.014 0.129±0.022 LGSSM smooth(m = 20, n = 60, K = 10) 0.123±0.019 0.116±0.016 0.115±0.013 LGSSM smooth(m = 20, n = 60, K = 15) 0.120±0.010 0.112±0.009 0.108±0.014 GIN filter(m = 15, n = 30, K = 10) 0.126±0.014 0.125±0.012 0.125±0.011 GIN filter(m = 15, n = 30, K = 15) smooth(m = 20, n = 60, K = 10) 0.086±0.013 0.081±0.008 0.079±0.009 GIN smooth(m = 20, n = 60, K = 15) 0.

r_1:T, F_1:T, H_1:T = inputs 17 for w_t, r_t, F_t, H_t in (w_1:T, r_1:T, F_1:T, H_1:T): 18 x_t_-, Sigma_t_-= Prediction(F_t, H_t, self.x_tm1_+,... 19 self.Sigma_tm1_+) 20 x_t_+, Sigma_t_+ = Filtering(x_t_-, Sigma_t_-, w_t, r_t, H_t) 21 self.filter_states.append([x_t_+, Sigma_t_+]) 22 self.x_tm1_+, self.Sigma_tm1_+ = x_t_+, Sigma_t_+ 23 x_1:T_T, Sigma_1:T_T = Smoothing(self.filter_states,... , Sigma_1:T_T = self.GIN_CELL_OBJ(o_1:T, r_1:T,... T_+, Sigma_o_1:T_+ = GIN(Data, Dynamics_Matrices)

Double pendulum state estimation. (x 1 , x 3 ) refers to the position of the first joint, while (x 2 , x 4 ) is for the second joint.

Pendulum state estimation. By consider n = 3m, intuitively the last part of the state is dedicated to the acceleration information causing a more lieklihood.

Image imputation for double pendulum.

Comparison of model performance on KITTI dataset. See 34, 35, 36, 37, 38 and 39 figures in A.9 for the visualization results.

MSE for NCLT experiment.

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851-1858, 2017.

A.8.3 QUALITATIVE COMPARISON OF THE GIN TO RECENT RELATED WORK.In table 9, we make a comparison to show whether algorithms are able to handle high and low dimensional observations, learn dynamics, use available-partial dynamics, estimate state appropriately, provide model's uncertainty estimates handling noisy data, handle missing data and perform direct optimization. Classic LGSSMs, e.g. EKF and UKF, work based on the linearization of the transition and emission equations and apply classic Bayesian updates over the linearized system with respect to the states. In other words, (F, H) in the classic LGSSMs are not data-deriven nor trainable. Despite classic LGSSMs, in the GIN we use a data-driven based network to learn dynamics, i.e.Dataset consists of 1000 train, 100 valid and 100 test sequences with the length of 150. The sequences are distorted via generated noise, while in the informed imputation task half of the images are removed and boolean flags indicating the availability of the observations are passed to the cell instead. If the imputation task is in uninformed type, black images are considered as the observations instead of informing the cell with boolean flags.

Empirical running times and parameters of high-low dimensional experiments.

Low dimensional experiments.

Learning the dynamics in LGSSM is shown with ×/✓ because general LGSSMs, e.g. UKF and EKF, are not able to learn the dynamics. However, in our setting and parameterization we use a data driven-based network for obtaining (F, H) to make LGSSMs comparable with the GIN for high dimensional observation experiments.

The structure of the encoder and decoder for single and double pendulum experiments.

Comparison of leaned KG-SG matrices and ground truth KG-SG. Lorenz attractor and NCLT experiments with dynamics refer to the situation, where are given the dynamics form, i.e. (37)-(39).

MSE for single pendulum experiment.

To demonstrate the simplicity of our proposed GIN, we include intuitive inference code with Tensorflow library for both the high dimensional and low dimensional experiments. The code runs with Python 3.6+. The entire code to reproduce the experiments are available in Github repository.A.12.1 PYTHON INTUITIVE CODE FOR HIGH DIMENSIONAL EXPERIMENTS.

A.12.2 PYTHON INTUITIVE CODE FOR LOW DIMENSIONAL EXPERIMENTS.

Observed Sequence

LGSSM(filter)LGSSM( smooth 

Algorithm Low-Dimensional Observations Training

Input: Ground Truth gt 1:T , Observations y 1:T , last posteriors (x + 1:T , Σ + 1:T ), initial posterior (x + 0 , Σ + 0 ) if Dynamics are not known then α 1:T = Dynamics Network (x + 0:T -1 ) Obtain F1:T and Ĥ1:T by (13) (w 1:T , r 1:T ) = MLP (y 

