SPECTRANET: MULTIVARIATE FORECASTING AND IM-PUTATION UNDER DISTRIBUTION SHIFTS AND MISSING DATA

Abstract

In this work, we tackle two widespread challenges in real applications for timeseries forecasting that have been largely understudied: distribution shifts and missing data. We propose SpectraNet, a novel multivariate time-series forecasting model that dynamically infers a latent space spectral decomposition to capture current temporal dynamics and correlations on the recent observed history. A Convolution Neural Network maps the learned representation by sequentially mixing its components and refining the output. Our proposed approach can simultaneously produce forecasts and interpolate past observations and can, therefore, greatly simplify production systems by unifying imputation and forecasting tasks into a single model. SpectraNetachieves SoTA performance simultaneously on both tasks on five benchmark datasets, compared to forecasting and imputation models, with up to 92% fewer parameters and comparable training times. On settings with up to 80% missing data, SpectraNethas average performance improvements of almost 50% over the second-best alternative.

1. INTRODUCTION

Multivariate time-series forecasting is an essential task in a wide range of domains. Forecasts are a key input to optimize the production and distribution of goods (Böse et al., 2017) , predict healthcare patient outcomes (Chen et al., 2015) , plan electricity production (Olivares et al., 2022) , build financial portfolios (Emerson et al., 2019) , among other examples. Due to its high potential benefits, researchers have dedicated many efforts to improving the capabilities of forecasting models, with breakthroughs in model architectures and performance (Benidis et al., 2022) . The main focus of research in multivariate forecasting has been on accuracy and scalability, to which the present paper contributes. In addition, we identify two widespread challenges for real applications which have been largely understudied: distribution shifts and missing data. We refer to distribution shifts as changes in the time-series behavior. In particular, we focus on discrepancies in distribution between the train and test data, which can considerably degrade the accuracy (Kuznetsov & Mohri, 2014; Du et al., 2021; Xu et al., 2022; Ivanovic et al., 2022) . This has become an increasing problem in recent years with the COVID-19 pandemic, which disrupted all aspects of human activities. Missing values is a generalized problem. Some common causes include faulty sensors, the impossibility of gathering data, and misplacement of information. As we demonstrate in our experiments, these challenges hinder the performance of current state-of-the-art (SoTA), limiting their use in applications where these problems are predominant. In this work, we propose SpectraNet, a novel multivariate forecasting model that achieves SoTA performance in benchmark datasets and is also intrinsically robust to distribution shifts and extreme cases of missing data. SpectraNet achieves its high accuracy and robustness by dynamically inferring a latent vector projected on a temporal basis, a process we name latent space spectral decomposition (LSSD). A series of convolution layers then synthesize both the reference window, which is used to infer the latent vectors and the forecast window. To the best of our knowledge, SpectraNet is also the first solution that can simultaneously forecast the future values of a multivariate time series and accurately impute the past missing data. In practice, imputation models are first used to fill the missing information for all downstream tasks, including forecasting. SpectraNet can greatly simplify production systems by unifying imputation and forecasting tasks into a single model. The main contributions are: • Latent Vector Inference: methodology to dynamically capture current dynamics of the target time-series into a latent space, replacing parametric encoders. • Latent Space Spectral Decomposition: representation of a multivariate time-series window on a shared latent space with temporal dynamics. • SpectraNet: novel multivariate forecasting model that simultaneously imputes missing data and forecasts future values, with SoTA performance on several benchmark datasets and demonstrated robustness to distribution shifts and missing values. We will make our code publicly available upon acceptance. The remainder of this paper is structured as follows. Section 2 introduces notation and the problem definition, Section 3 presents our method, Section 4 describes and presents our empirical findings. Finally, Section 5 concludes the paper. The literature review is included in A.1.

2. NOTATION AND PROBLEM DEFINITION

We introduce a new notation that we believe is lighter than the standard notation while being intuitive and formally correct. Let Y ∈ R M ×T be a multivariate time-series with M features and -a) be the observed values for the interval [a, b), that is, Y 0:t is the set of t observations of Y from timestamp 0 to timestamp t -1 while Y t:t+H is the set of H observations of Y from timestamp t to timestamp t + H -1. Let y m,t ∈ R be the value of feature m at timestamp t. T timestamps. Let Y a:b ∈ R M ×(b In this work we consider the multivariate point forecasting task, which consists of predicting the future values of a multivariate time-series sequence based on past observations. The main task of a model F Θ with parameters Θ at a timestamp t, is to produce forecasts for the future H values, denoted by Ŷt:t+H , based on the previous history Y 0:t . Ŷt:t+H = F Θ (Y 0:t ) For the imputation task, to impute a missing value y m,t , models are not constrained to only use past observations. Moreover, they are evaluated on how well they approximate only the missing values. We evaluate the performance with two common metrics used in the literature, mean squared error (MSE) and mean absolute error (MAE), given by equation 2 (Hyndman & Athanasopoulos, 2018) . MSE = 1 M H H-1 h=0 M m=1 (y m,t+h -ŷm,t+h ) 2 MAE = 1 M H H-1 h=0 M m=1 |y m,t+h -ŷm,t+h |. (2) 3 SP E C T R ANE T We start the description of our approach with a general outline of the model and explain each major component in detail in the following subsections. The overall architecture is illustrated in Figure 1 . SpectraNet is a top-down model that generates a multivariate time-series window of fixed size s w = L + H, where L is the length of the reference window and H is the forecast horizon, from a latent vector z ∈ R d . To produce the forecasts at timestamp t, the model first infers the optimal latent vector on the reference window, consisting of the last L values, Y t-L:t , by minimizing the reconstruction error. This inference step is the main difference between our approach to existing models, which map the input into an embedding or latent space using an encoder network. The model generates the full time-series window s w , which includes the forecast window Y t:t+H , with a spectral decomposition and a Convolutional Neural Network (CNN). The main steps of SpectraNet are given by z * = arg min z L(Y t-L:t , Ŷt-L:t (z)) where z * is the inferred latent vector, L is a reconstruction error metric, and Ŷt-L:t (z) is given by, E = LSSD(z, B) (4) Ŷt-L:t , Ŷt:t+H = CNN Θ (E) where LSSD (latent space spectral decomposition) is a basis expansion operation of z over the predefined temporal basis B to produce a temporal embedding E ∈ R d×dt , and CNN is a Top-Down Convolutional Neural Network with learnable parameters Θ. The CNN simultaneously produces both the reconstruction of the past reference window Ŷt-L:t , used to find the optimal latent vector for the full window and the forecast Ŷt:t+H .

3.1. LATENT VECTOR INFERENCE

The proposed latent vector inference is based on the Alternating back-propagation algorithm (ABP) (Han et al., 2017) for training generative models without encoders. A single generator (decoder) architecture is trained by maximizing the observed likelihood directly. To achieve this, at every step, ABP samples latent vectors from the posterior distribution P (Z|Y) using MCMC methods such as Langevin dynamics. Generative models trained with ABP demonstrated superior performance in recovering missing segments of images and videos over Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN). To our knowledge, SpectraNet is the first approach that uses and extends the latent vector inference principle of ABP for time-series forecasting. We reformulate the posterior distribution sampling of the latent vectors as a non-convex minimization problem presented in 3, which aims to minimize the generator's reconstruction error on the reference window Y t-L:t . We use the mean square error (MSE) as the reconstruction loss since it is differentiable, fast to compute and theoretically founded. Our method corresponds to the maximum a posteriori estimation (MAP) in the ABP framework. Appendix A.2 formally presents the ABP algorithm and the connections to our method. To solve the reconstruction loss minimization problem, we use gradient-based methods. In particular, we rely on gradient descent (GD), randomly initializing the latent vector with independent and identically distributed Gaussian distributions and fixing the learning rate and the number of iterations as hyperparameters. Figure 2 demonstrates how SpectraNet's output evolves during the inference of z, adapting to current behaviors on the reference window. Section 3.4 presents how SpectraNet is trained by alternating the inference and parameter learning steps. The model maps the latent vector to the complete window, including both reference (of size 104) and forecasting windows (of size 24), using only information from the former. The temporal basis B imposes strict dependencies between both windows. This inference process allows SpectraNetto dynamically adapt to new behaviours and forecast with missing data. The main advantage of inferring the latent vector is the robustness to handle both distribution shifts and missing values. During this step, the model will infer the latent vector that best fits the current dynamics on the reference window, even if they follow an unseen behavior during training. The ability to dynamically adjust the temporal embedding for each window gives SpectraNet more robustness to changes in distribution, as shown in Figure 3 . To deal with missing data, the reconstruction loss is masked. Following the notation in (Tashiro et al., 2021) , let M ∈ {0, 1} M ×T be the observation mask which indicates the availability of data. The latent vectors are inferred following z * = arg min z L(M t-L:t • Y t-L:t , M t-L:t • Ŷt-L:t (z)) ( ) where • is the element-wise matrix multiplication. As demonstrated in the experimental section results on Tables 1, 2 and 4 , and in congruity with previous studies such as (Han et al., 2017; Pang et al., 2020) , inferring latent vectors provides superior robustness to missing data.

3.2. LATENT SPACE SPECTRAL DECOMPOSITION

The second component of SpectraNet is the mapping from the latent vectors z into the temporal embedding E. Each element of z corresponds to the coefficient of one element of the temporal basis B ∈ R d×dt , where d is the number of elements and d t is the temporal length. The i-th row of E is given by, E i,: = z * i B i,: Each element of the basis consists of a predefined template function. Similarly to previous work including (Oreshkin et al., 2019; Engel et al., 2020; Shan et al., 2022) , the basis includes patterns commonly found in time series: trends, represented by polynomial functions, and seasonal, represented by harmonic functions. The final basis matrix B is the row-wise concatenation of the three following matrices. B trd i,t = t i , for i ∈ {0, ..., p}, t ∈ {0, ..., d t } B cos i,t = cos(2πit) , for i ∈ {1, ..., s w 2 }, t ∈ {0, ..., d t } B sin i,t = sin(2πit) , for i ∈ {1, ..., s w 2 }, t ∈ {0, ..., d t } (8) where p is a hyperparameter that controls the max degree of the polynomial basis. The final size of the basis d is equal to s w + p + 1. The temporal embedding E corresponds to a latent space spectral decomposition that encodes shared temporal dynamics of all features in the target window, as the latent vector z * selects the relevant trend and frequency bands. Another crucial reason behind using a predefined basis is to impose strict temporal dependencies between the reference and forecasting windows. While inferring the latent vector, the forecasting window does not provide information (gradients). If all the entries of E are freely inferred the last values of the temporal embedding that determines the forecasting window cannot be optimized.

3.3. TOP-DOWN CONVOLUTION NETWORK

SpectraNet uses a Top-Down Convolution Network (CNN) as decoder, which produces the final forecast and reconstruction of the reference window Ŷt-L:t+H from the temporal embedding E. The full CNN architecture is presented in Figure 1 , and can be formalized as: h 1 = ReLU(BN(Transposed Convolution(E))) h 2 = ReLU(BN(Transposed Convolution(h 1 ))) Ŷt-L:t+H = Transposed Convolution(h 2 ) (9) where Transposed Convolution is a transpose convolutional layer on the temporal dimension, BN is a batch normalization layer, and ReLU activations introduce non-linearity. The number of filters at each layer, kernel size, and stride are given in Appendix A.5. The convolutional filters are not causal as in a Temporal Convolution Network (TCN) (Bai et al., 2018) . Instead, SpectraNet's causality, that allows the model to forecast future data using past observations, comes from inferring the latent vector z * with the reference window. The first layers of the CNN learn a common representation for all features from the temporal embedding E. Additionally, the second layer refines the temporal resolution to the final size of the window s w . The last layer produces the final output for all M features from h 2 . While equation 9 presents the default configuration for SpectraNet used in the experiments, additional layers can be added to increase the expressivity. Appendix A.10 presents ablation studies on the number of layers.

3.4. TRAINING PROCEDURE

Each training iteration consists of two steps: the inference step and the learning step. During the inference step, the optimal latent vector z * is inferred using Gradient Descent, solving the optimization problem given in equation 3 with the current parameters Θ. During the learning step, the latent vector is used as the input to the model, and the parameters Θ are updated using ADAM optimizer (Kingma & Ba, 2014) and MSE loss. For each iteration, we sample a small batch (with replacement) of multivariate time-series windows Y t:t+sw from the training data, each starting at a random timestamp t. Each sampled observation is first normalized with the mean and standard deviation on the reference window, to decouple the scale and patterns (the output is scaled back before evaluation). The full training procedure is presented in Appendix A.6. One potential drawback of inferring the latent vectors is the computational cost. We tackle this in several ways. First, by solving the optimization problem using GD, we can rely on automatic differentiation libraries such as PyTorch to efficiently compute the gradients. Second, backpropagation on CNN is parallelizable in GPU since it does not require sequential computation. The forecasts are independent between windows so they can be computed in parallel, unlike RNN models, which must produce the forecasts sequentially. Finally, during training, we persist the optimal latent vectors to future iterations on the same window. For a given observation t, the final latent vector z * t is stored and used as a warm start when observation t is sampled again, considerably reducing the number of iterations during the inference step. We discuss training times and memory complexity in section 4.

3.5. ROBUSTNESS TO MISSING VALUES AND DISTRIBUTIONAL SHIFTS

We first test SpectraNet's robustness on synthetic data, which we name Simulated7. This dataset consists of seven time series of length 20,000, each generated as the sum of two cosines with random frequencies and small Gaussian noise. We inject missing values on both train and test sets, following the procedure described in section 4. For the distribution shifts experiments, we modify the test set with two common distribution shifts: changes in trend and in magnitude. Figure 3 presents the forecasts for SpectraNet for the three settings considered. We include the first feature in this figure, and we show the complete data in Appendix A.7. As seen in panel (a), our method can accurately forecast the ground truth even with 80% of missing values, a challenging setting since In Appendix A.9 we demonstrate how SpectraNet is also robust to changes over time of the missing data regime (how much or for how long missing data occurs). While SpectraNet is intrinsically robust to missing data, current SoTA models require either modifications to the architecture or an interpolation model to first impute the missing values. Another advantage of our method is that it cannot only forecast directly on the raw data, but by generating the entire window, it is simultaneously interpolating the past missing data and forecasting the future values.

4. EXPERIMENTAL RESULTS

We base our experimental setting, benchmark datasets, train/validation/test splits, hyperparameter tuning, and data processing on previous works on multivariate forecasting (Zhou et al., 2021; Wu et al., 2021; Challu et al., 2022a) .

4.1. DATASETS

We evaluate our model on synthetic data and five popular benchmark datasets commonly used in the forecasting literature, comprising various applications and domains. All datasets are normalized with the mean and standard deviation computed on the train set. We split each dataset into train, validation, and test sets, following different proportions based on the number of timestamps. We present summary statistics for each dataset in Appendix A. 

4.2. TRAINING AND EVALUATION SETUP

We train models on the training set comprising the earliest history of each dataset. We select the optimal hyperparameters for all models (including the baselines to ensure fair comparisons) on each dataset and occlusion percentage based on the performance (measured by MSE) on the validation set, using a Bayesian optimization algorithm, HYPEROPT (Bergstra et al., 2011) . For SpectraNet we tune the number of convolutional filters, temporal size of the latent vectors, window normalization, and optimization hyperparameters for both the inferencial and learning steps. The complete list of hyperparameters is included in Appendix A.5. For the main results, we use a multi-step forecast with a horizon of size 24, and forecasts are produced in a rolling window strategy. We repeat the experiment five times with different random seeds and report average performances. We run the experiments on an AWS g3.4xlarge EC2 instance with NVIDIA Tesla M60 GPUs. We compare our proposed model against several univariate and multivariate state-of-the-art baselines based on different architectures: Transformers, feed-forward networks (MLP), Graph Neural Network (GNN), and Recurrent Neural Network (RNN). For Transformer-based models we include the Informer (Zhou et al., 2021) and Autoformer (Wu et al., 2021) , two recent models for multi-step long-horizon forecasting; for MLP we consider the univariate N-HiTS (Challu et al., 2022a) and N-BEATS (Oreshkin et al., 2019) models; for GNN models we include the StemGNN (Cao et al., 2020) , and lastly we include a multi-step univariate RNN with dilations (Chang et al., 2017) .

4.3. MISSING DATA SETUP

We present a new experimental setting for testing the forecasting models' robustness to missing data. We parameterize the experiments with two parameters: the size of missing segments s and the probability that each segment is missing with p o . First, the original time series is divided into disjoint segments of length s. Second, each feature m in each segment is occluded with probability p o . We repeat the experiment with different probabilities p o : 0% (no missing values), 20%, 40%, 60%, and 80%, to test models' performance under different proportions of missing values. The size of segments s is fixed at ten timestamps for ILI and at 100 for all other datasets. Figure 3 shows an example of the setting for the Simulated7 dataset with 80% missing data (occluded data in grey).

4.4. KEY RESULTS

Forecasting accuracy. Table 2 presents the forecasting accuracy. First, SpectraNet achieved SoTA performance even on full data, showing the advantages of our method are not limited to the robustness to missing data and distribution shifts. SpectraNet outperformed N-HiTS/N-BEATS in two of the five datasets and placed in the Top-3 models across all datasets, outperforming Transformer-based models, StemGNN, and DilRNN. SpectraNet also achieves the best performance among all multivariate models. The superior robustness of SpectraNet to missing values is evident in all datasets and proportions. In datasets with strong seasonalities, such as Simulated7 and Solar, the accuracy of SpectraNet marginally degrades with even 80% of missing values. The relative performance of our method improves with the proportion of missing data. For example, for 80%, the average MSE across datasets is 48% lower than the N-HiTS and 65% lower than the Autoformer. The improvement on ILI with complete data of almost 50% against the second best model is explained by the distribution shift between the train and test set, where the latter presents larger spikes and a clear positive trend on some features. Appendix A.8 presents a plot of the ILI dataset. Imputation accuracy. Table 3 presents the results for the imputation task. We include a SoTA imputation diffusion model, CSDI (Tashiro et al., 2021) , and three simple baselines: (i) imputation with the mean of each feature, (ii) imputation with the last available value (Naive), and (iii) linear interpolation between past and future available values. SpectraNet consistently achieves the best performance across all datasets. CSDI outperformes all simpler baselines on Simulated7 and Solar, but its performance degrades on datasets with distribution shifts such as Simulated7-Trend and ILI. Memory and time complexity. We compare the training time and memory usage (as the number of learnable parameters) of SpectraNet and baseline models as a function of the input size on the ETTm2 dataset in Figure 4 , using the optimal hyperparameters. Panel (a) shows our method has the lowest memory footprint, with up to 85% fewer parameters than the N-HiTS (the second best alternative in most datasets) and 92% than the Autoformer. Despite the fact that SpectraNet performs the inference step in each iteration, thanks to the improvements discussed in section 3 and the reduced number of parameters, training times are comparable to other baseline models. Ablation studies. Finally, we test the contribution of the two major components proposed in this work, the dynamic inference of latent vectors and the LSSD. We compare SpectraNet against two versions without these components. In SpectraNet 1 we add a CNN encoder to map inputs to a standard embedding, and in SpectraNet 2 we keep the inference step but remove the LSSD. Table 4 presents the average performance across the five benchmark datasets. As expected, SpectraNet 1 with the parametric encoder is not robust to missing data, with similar performance to other SoTA models. On top of improving the performance over SpectraNet 2 , the LSSD reduces the number of parameters by 70% since fewer layers are needed to generate the complete window.

5. DISCUSSIONS AND CONCLUSION

This work proposes a novel multivariate time-series forecasting model that uses a Top-Down CNN to generate time-series windows from a novel latent space spectral decomposition. It dynamically infers the latent vectors that best explain the current dynamics, considerably improving the robustness to distribution shifts and missing data. We compare the accuracy of our method with SoTA models based on several architectures for forecasting and interpolation tasks. We demonstrate SpectraNet does not only achieve SoTA on benchmark datasets but can also produce accurate forecasts under some forms of distribution shifts and extreme cases of missing data. Our experiments provide evidence that, as expected, the performance of current SoTA forecasting models significantly degrades under the presence of missing data and distribution shifts. These challenges are commonly present in high stake settings such as healthcare (for instance, cases with up to 80% missing data are common) and are becoming more predominant in many domains with the recent global events such as the COVID-19 pandemic. Designing robust algorithms and solutions that tackle these challenges can significantly improve their adoption and increase their benefits. We believe that dynamically inferring latent vectors can have multiple other applications in timeseries forecasting. For instance, it can have applications in transfer learning and few-shot learning. A pre-trained decoder could forecast completely unseen time series, while the inference step will allow the model to adapt to different temporal patterns. Moreover, the inference step can be adapted to produce multiple samples and calibrated to learn the target time series distribution.

A APPENDIX

A.1 RELATED WORK Time-series forecasting. Time-series forecasting has long been an active area of research in academia and industry due to its broad range of high-impact applications. The latest developments in machine learning have been combined with classical methods or used to develop new models (Januschowski et al., 2020) which have been successful in recent large-scale forecasting competitions (Makridakis et al., 2020; 2022) . The earlier breakthroughs in modeling time-series with Deeplearning (Januschowski et al., 2018) include the widely used Recurrent Neural Networks (RNN) (Chang et al., 2017; Salinas et al., 2020; 2019) and Temporal Convolution Networks (TCN) (Bai et al., 2018) . The success of Transformers (Vaswani et al., 2017) in sequential data, such as natural language processing (NLP) and audio processing, inspired many recent models with attention mechanisms. The Informer (Zhou et al., 2021) introduces a Prob-sparse self-attention to reduce the quadratic complexity of vanilla Transformers; the Autoformer (Wu et al., 2021) proposed a decomposition architecture in trend and seasonal components, and the Auto-correlation mechanism. Simpler approaches based on feed-forward networks have also shown remarkable performance. The N-BEATS model (Oreshkin et al., 2019) uses a deep stack of fully-connected layers to decompose the forecast into both learnable and predefined basis functions; the NBEATSx extension incorporates interpretable exogenous variable decomposition (Olivares et al., 2022) ; the N-HiTS model (Challu et al., 2022a) generalizes the N-BEATS by introducing mixed-data sampling and hierarchial interpolation to decompose the forecast in different frequencies. Finally, Graph Neural Networks (GNN) has been used to incorporate complex relations between a large number of time series, including the GraphWaveNet (Wu et al., 2019) , and StemGNN (Cao et al., 2020) models. Time-series imputation. The standard practice to handle missing data is filling the missing information, a process called interpolation. Simple interpolation alternatives include replacing missing values with zeros, the mean, most-recent value (naive), and linear interpolation. Most recent Deeplearning approaches consist of Generative Adversarial Networks (GANs) and RNN-based architectures. Some notable examples are E2gan (Luo et al., 2019) , Brits (Cao et al., 2018) , and NAOMI (Liu et al., 2019) . More recent approaches include the CDSI (Tashiro et al., 2021) model, a score-based diffusion auto-regressive architecture that produces a distribution for the imputed values. Alternating back-propagation. The method we propose to infer latent vectors is inspired by the Alternating back-propagation algorithm (ABP) for Generative models (Han et al., 2017) . The key idea of this algorithm is to sample latent vectors from the posterior distribution with MCMC methods and train a Generative model by maximizing the observed likelihood directly. Generative models trained with ABP do not need an encoder, such as Variational Autoencoders (VAE), or Discriminator networks, such as GANs. More recent work extend the original architecture with energy-based models for CV (Pang et al., 2020) , LSTM networks for text generation (Pang et al., 2021) , and a hierarchical latent space for anomaly detection (Challu et al., 2022b) .

A.2 ALTERNATING BACK-PROPAGATION

This section presents the principles of Alternating back-propagation (ABP) to train a generative model, as proposed in (Han et al., 2017) and presented in (Challu et al., 2022b) , and the connection with our proposed latent vector inference. Let y ∈ R D be a D-dimension data vector (such as an image or time-series window) and f be a generative model with parameters θ, y = f (z, θ) + ϵ z ∼ N (0, I d ), ϵ ∼ N (0, σ 2 I D ) where z ∈ R d are latent factors and d < D. Let {y (i) , i = 1, ..., n} be n training observations. In Alternating Back-Propagation, parameters θ are trained by maximizing the observed log-likelihood L(θ) = n i=1 log p θ (y (i) ) = n i=1 log p θ (y (i) , z (i) )dz (i) The gradients L ′ (θ) for observation i are given by, ∂ ∂θ log p θ (y (i) ) = E p θ (z (i) |y (i) ) ∂ ∂θ log p θ (y (i) , z (i) ) where p θ (z (i) |y (i) ) is the posterior distribution. ABP approximates this intractable expectation with Monte Carlo average by taking a single sample z * from the posterior distribution with approximate Langevin Dynamics, by iterating z (i) t+1 = z (i) t + s σ z ∂ ∂z (i) log p θ (z (i) t |y (i) ) + √ 2sϵ t where ϵ t ∼ N (0, I D ), s is the step size, and σ z controls the annealing or tempering. Finally, the Monte Carlo approximation of the gradient is given by L ′ (θ) ≈ ∂ ∂θ log p θ (z * (i) , y (i) ) For our proposed latent vector inference, we reformulate the posterior sampling as a non-convex minimization problem presented in 3, which aims to minimize the mean square error (MSE) between the reconstruction and ground truth y (i) . Minimizing the MSE is equivalent to the maximum a posteriori estimation (MAP) of p θ (z (i) |y (i) ) assuming Gaussian distributions as in equation 10. This methodology also allows for using more sophisticated gradient-based methods than equation 13, such as Adam, can be used to improve the convergence. A.3 BENCHMARK DATASETS Simulated7 is a synthetic dataset that we design to evaluate forecasting models robustness to missing data and distribution shifts in a controlled environment. It consists of 7 time-series, each generated independently as the sum of two cosine functions with different frequencies and a small Gaussian noise. In particular, it is composed as the sum of the three following elements:  Each time-series consists of 20,000 timestamps, with regular intervals between [0, 5]. For the missing data we obfuscate random timestamps following the procedure described in section 4, using s = 10 and p o ∈ {0, 0.2, 0.4, 0.6, 0.8}. For simulating distribution shifts we perturb only the test set with two transformations: adding a linear trend with slope 6, and scaling the magnitude by 0.5. 1 of the paper presents the default recommended configuration for SpectraNet. In this section we explore different configurations for the CNN, in particular for the number of hidden layers. For the model SpectraNet i,j , i refers to the number of layers with temporal resolution s w /2, and j to the number of layers with s w resolution. The following table presents the results on ILI and Solar datasets. The results suggest SpectraNet's performance is not significantly affected by the number of layers in these datasets for several occlusion probabilities. On average, SpectraNet 2,2 achieves the best performance. 



Figure 1: SpectraNet architecture. The Latent Space Spectral Decomposition (LSSD) encodes shared temporal dynamics of the target window into Fourier waves and polynomial functions. Latent vector z is inferred with Gradient Descent minimizing reconstruction error on the reference window. The Convolution Network (CNN) generates the time-series window by sequentially mixing the components of the embedding and refining the output.

Figure 2: SpectraNet's output evolution during latent vector inference with Gradient Descent.The model maps the latent vector to the complete window, including both reference (of size 104) and forecasting windows (of size 24), using only information from the former. The temporal basis B imposes strict dependencies between both windows. This inference process allows SpectraNetto dynamically adapt to new behaviours and forecast with missing data.

Figure 3: Forecasts for the first feature of Simulated7 dataset using SpectraNet with (a) 80% of missing data (missing regions in grey), (b) change in trend, and (c) change in magnitude. Forecasts are produced every 24 timestamps in a rolling window strategy.

3. The Influenza-like illness (ILI) dataset contains the weekly proportion of patients with influenzalike symptoms in the US, reported by the Centers for Disease Control and Prevention, from 2002 to 2021. Exchange reports the daily exchange rates of eight currencies relative to the US dollar from 1990 to 2016. The Solar dataset contains the hourly photo-voltaic production of 32 solar stations in Wisconsin during 2016. Finally, the Electricity Transformer Temperature (ETTm2) dataset is a stream of eight sensor measurements of an electricity transformer in China from July 2016 to July 2018. Weather contains 21 meteorological conditions recorded at the German Max Planck Biogeochemistry Institute in 2021.

Figure 4: Memory efficiency and train time analysis on ETTm2. Memory efficiency measured as the number of parameters, train time includes the complete training procedure. We use the best hyperparameter configuration for each model, based on model accuracy.

Forecasting accuracy on Simulated7 dataset with missing values and distribution shifts between the train and test sets, forecasting horizon of 24 timestamps. Lower scores are better. Metrics are averaged over five runs, best model highlighted in bold.

Main forecasting accuracy results on benchmark datasets with different proportion of missing values (p o ), forecasting horizon of 24 timestamps, lower scores are better. Metrics are averaged over five runs, best model highlighted in bold.

Imputation accuracy on the test set for SpectraNet and baselines with different proportion of missing data (p o ). Lower scores are better, best model highlighted in bold.



presents summary information for the benchmark datasets, including the granularity (frequency), number of features and timestamps, and train/validation/test proportions.All datasets are public, and are available in the following links: • ILI: https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html • Exchange: https://github.com/laiguokun/multivariate-time-series-data • Solar: https://www.nrel.gov/grid/solar-power-data.html • ETTm2: https://github.com/zhouhaoyi/ETDataset • Weather: https://www.bgc-jena.mpg.de/wetter/ A.4 SIMULATED7 DATASET

A.6 SP E C T R ANE T TRAINING ALGORITHMAlgorithm 1 presents the training procedure of SpectraNet. The model is trained for a fixed number of iterations n iters , randomly sampling b windows in each iteration. Parameters θ are optimized with Adam optimizer.Algorithm 1: Training procedure input : multivariate time-series Y ∈ R m×T , model F θ , learning iterations n iters output: F θ * , inferred latent vectors {z * t , t = 0, ..., T } Let i ← 0, initialize θ Initialize z t , for t = 0, ..., T while i < n iters do Take a random mini-batch of windows{Y j k -L:j k +H , j k ∼ U (L, T -H), k{1, ..., b}. z * j k ← arg min z L(Y j k -L:j k , Ŷj k -L:j k (z)), k ∈ {1, ...,b} Update θ with Adam using z * as input. Store z * j k , k ∈ {1, ..., b} i ← i + 1 end A.10 CNN ABLATION STUDIES

Forecasting accuracy results of SpectraNet with different CNN architectures on benchmark datasets with different proportion of missing values (p o ), forecasting horizon of 24 timestamps, lower scores are better. Metrics are averaged over five runs, best model highlighted in bold.

annex

A.5 HYPERPARAMETER OPTIMIZATION SpectraNet hyperparameters are tuned on the validation set of each dataset using HYPEROPT algorithm with 30 iterations, Table 6 presents the hyperparameter grid. To ensure a fair comparison with baseline models, we also tuned their respective hyperparameters with the same procedure, as the default configuration in their implementations might not perform well in the different settings we explore. We detail the hyperparameter grid for each baseline model next. For N-HiTS, we use the hyperparameter grid described in their paper (Challu et al., 2022a) . The N-HiTS is a generalization of the N-BEATS, which is already included in the hyperparameter grid as a posible configuration. For the Informer and Autoformer we explore different values for the dropout probability, number of heads, learning rate (same as SpectraNet), batch size (same as SpectraNet), size of embedding, and sequence length. For StemGNN we explore different optimization parameters, including the learning rate, batch size, and epochs. Finally, for RNN we explore different dilations, number of layers, hidden size, and optimization parameters. For the N-HiTS, transformers, and RNN models we used the Neuralforecast library (available in PyPI and Conda).

A.7 FORECASTS ON SIMULATED7

Figure 1 presents the complete forecasts on the test set for SpectraNet, N-HiTS, Informer, and RNN models. SpectraNet is the only model that can accurately forecast with up to 80% missing data and changes in distribution. A.9 ADDITIONAL OCCLUSION EXPERIMENTSIn this section we present additional occlusion experiments to analyze the case where the missing data regime (how much or for how long missing data occurs) differs between the training and test sets. In this setting we train models on complete data and inject missing values only during inference on the test set. The following table presents SpectraNetand N-HiTS (best baseline) results on the Simulated7 dataset. SpectraNet performance is almost identical to the original results, demonstrating the method's robustness to changes in the behavior of the presence of missing data. 

