NOVEL FEATURE REPRESENTATION STRATEGIES FOR TIME SERIES FORECASTING WITH PREDICTED FU-TURE COVARIATES

Abstract

Accurate time series forecasting is a fundamental challenge in data science. Unlike traditional statistical methods, conventional machine learning models, such as RNNs and CNNs, use historical data consisting of previously measured variables including the forecast variable and all its covariates. However, in many applications, some of the covariates can be predicted with reasonable accuracy for the immediate future. We refer to such covariates as predictable future covariates. Note that the input may also contain some covariates that cannot be accurately predicted. We consider the problem of predicting water levels at a given location in a river or canal system using historical data and future covariates, some of which (precipitation, tide) may be predictable. In many applications, for some covariates of interest, it may be possible to use historical data or accurate predictions for the near future. Traditional methods to incorporate future predictable covariates have major limitations. The strategy of simply concatenating the future predicted covariates to the input vector is highly likely to miss the past-future connection. Another strategy that iteratively predicts one step at a time can end up with prediction error accumulation. We propose two novel feature representation strategies to solve those limitations -shifting and padding, which create a framework for contextually linking the past with the predicted future, while avoiding any accumulation of prediction errors. Extensive experiments on three well-known datasets revealed that our strategies when applied to RNN and CNN backbones, outperform existing methods. Our experiments also suggest a relationship between the amount of shifting and padding and the periodicity of the time series.

1. INTRODUCTION

Conventional time series forecasting is widely used to predict a set of target variables at a future time point based on past data collected over a predetermined length. Next-step forecasting (Montgomery et al., 2015; Shi et al., 2022) refers to predicting the target variables at a time point one step into the future where the unit of time is the time granularity of the measurements. Multi-horizon forecasting (Quaedvlieg, 2021) predicts the target variables multiple steps into the future Capistrán et al. (2010) . Accurate forecasting allows people to do better resource management and optimization decisions for critical processes (Cinar et al., 2017; Salinas et al., 2020; Rangapuram et al., 2018) . Applications include probabilistic demand forecasting in retail (Böse et al., 2017) , dynamic assignments of beds to patients (Zhang & Nawata, 2018) , monthly inflation forecasting, and much more. Good multi-horizon forecasting requires historical data of the target variables from which to learn long-term patterns. In addition, it also requires measurements from heterogeneous data sources of useful covariates, often from the recent past. However, in many applications, some of the covariates can also be predicted with reasonable accuracy for the immediate future. We refer to such covariates as future covariates. For example, in some applications, a covariate of interest could be "precipitation", for which it is possible to use historical data as well as reasonably accurate predictions for the near future, which may be obtained from the weather service. Despite its importance, only limited approaches exist that use future covariates to improve time series predictions. Related methods can be mainly categorized into direct strategy using sequence-to-sequence models (Mariet & Kuznetsov, 2019) and iterated methods using autoregressive models (Sahoo et al., 2020) . Traditional methods to incorporate future covariates have major limitations. We propose two novel feature representation strategies to solve those limitations -shifting and padding, which create a framework for contextually linking the past with the predicted future, while avoiding any accumulation of prediction errors. Extensive experiments on three well-known datasets revealed that our strategies when applied to RNN and CNN backbones, outperform existing methods. Iterative methods. The iterative strategy recursively uses a Next-step model multiple times where the predicted values for the previous time step is used as the input to forecast the next time step, as in Salinas et al. (2020) . For the prediction at time step t, the target values z t-1 at the previous time step, the (predicted) covariates x t for the current time step, and the context vectors h t-1 that summarize the representation information of all the past time steps are considered as the input to predict target values z t at the current time step using RNNs. Rangapuram et al. (2018) adopted a similar approach by parameterizing a per-time-series linear state space model with recurrent neural networks. Related work with the iterative approach is in Li et al. (2019) where the basic architecture used was the transformer model with convolutional layers. Direct methods. The direct method typically uses an encoder model to learn the feature representation of past data, which is saved as context vectors in a hidden state. A decoder model is utilized to intake future covariates and context vectors from the encoder and to then predict the outputs for multi-horizon forecasting, as shown in Figure 1 . The multi-horizon Quantile Recurrent Forecaster by Wen et al. (2017) used an LSTM as the encoder to generate context vectors, which are combined with predicted future covariates and fed into a multi-layer perceptron (MLP) to predict the future horizon. Some works ( (Fan et al., 2019; Du et al., 2020) ) have applied a temporal attention mechanism between the encoder and the decoder. This architecture is able to learn the relevance of different parts of the feature representations from historical data by computing attentional weights. The weighted feature representations are then passed into the decoder to make predictions for future time steps. In Fan et al. (2019) , bi-directional LSTMs are used as the decoder backbone allowing past and predicted features to be considered at every future time step. Temporal Fusion Transformer (Lim et al. (2021) ) combined gated residual networks (GRNs) and an attention mechanism (Vaswani et al., 2017) as an additional decoder on top of the traditional encoder-decoder model. They used GRNs to filter unnecessary information and the additional decoder with attention mechanism to capture long-term dependencies between the time steps. The iterative as well as the direct methods aim to incorporate future covariates as inputs but suffer from several shortcomings. The iterative methods accumulate prediction errors because the input to each time step is the output from the previous step, causing the model performance to quickly degrade for longer forecasting horizons. On the other hand, direct methods are prone to miss some interactions between data from past and future time points. The encoder processes only past data, while the decoder merely concatenates past data and future covariates, which may miss specific relationships between the past and future time points. In this paper, we aim to resolve the shortcomings of both the approaches with a novel architecture. Our contributions: In this paper, we present two novel strategies to combine historical data and predicted future covariates in a meaningful manner using the strategies of shifting and padding. The architecture facilitates forecasting by learning useful feature representations from both the past and the predicted future. These two strategies transform the dataset and construct training pairs (input features, labels) used for learning. Their input features are composed of both past information on all features, predicted future information on some covariates (with appropriate time label) and a set of covariates for which no accurately predicted future information is available. The goal is to directly predict the target variables for the given horizons. The two strategies are briefly described below. • Shifting: Predicted future covariates are shifted past in time and paired with appropriate past time points to create a modified input vector for the predictions. The shift length is a hyperparameter of the strategy. • Padding: Covariates that cannot be accurately predicted are simply copied over from the recent past, and provided as additional input parameters after combining them with future predicted covariates. The length of the padding is a hyperparameter of the strategy. 

2. NOTATION

:l = (y q 1:t ) Q q=1 = (y q 1 , y q 2 , ..., y q t ) Q q=1 be Q time series covariates that can be accurately predicted in the near future. To predict target time series at future multiple horizons from t + 1 to t + k, we define forecasting models that use the past data with a fixed length of w, i.e., for time t -w + 1 : t. The goal of such a forecasting model is to predict the trajectory Z N t+1:t+k of the target variables at the next k time points using the past w time points of all time series (targets and covariates) and k future time points of future predicted covariates. The model is mathematically expressed as follows: Z N t+1:t+k = F(Z N t-w+1:t , X M t-w+1:t , Y Q t-w+1:t+k ; Θ), where F(•) is a function, Θ denotes the learnable parameters, w represents the length of the history used, and k is the length of forecasting horizon.

3. METHODOLOGY

In this section, we provide details on the learning model mentioned in Eq. (1) above. We present two different strategies, shifting and padding, to achieve this.

3.1. THE Shifting STRATEGY

In the shifting strategy, we focus on exploiting the predicted covariates for the future time period of interest. This is achieved by shifting the predictions of the future covariates back in time by s time steps, such that the present covariates are aligned and fused with a predicted covariate s time points into the future to produce distinct feature vectors at each time point (Fig. 2 ). Thus, the input features are composed of all time series (target and covariates) aligned from time points t -w + 1 to t with future covariates from time points t -w + 1 + s to t + s. Target variables are predicted for the forecasting horizon from t + 1 to t + k. This explicit use of the the predicted future covariates differentiates this approach from traditional forecasting appraoches and is expected to improve the deep learning models since it simultaneously learns from the past and the future. The predicted future covariates are shown as a blue dashed trajectory in Fig. 2 . The target values computed at time t + 1 and later using our shifting approach are expressed as follows: Z N t+1:t+k = G(Z N t-w+1:t , X M t-w+1:t , Y Q t-w+1:t , Y Q t-w+1+s:t+s ; Θ), where G(•) is the model function, Z N t+1:t+k represents the N target variables to be predicted at the future time points t+1 : t+k (region shaded brown in Fig. 2 ), Z N t-w+1:t represent the target variables from the past w time points (brown trajectories in Fig. 2 ), X Q t-w+1:t and Y Q t-w+1:t represent the covariates from the past w time points (green and red trajectories from Fig. 2 ), and Y Q t-w+1+s:t+s represent the predictable future covariates along with predictions from s time points into the future and then shifted back by s time steps (green trajectories merged with dashed blue trajectories in Fig. 2 ), and Θ is the set of learnable parameters. The shift amount s is a hyperparameter of the model. 

3.2. THE Padding STRATEGY

The padding approach attempts to extrapolate and incorporate future covariates that cannot be accurately predicted by simply making a copy of the values from the previous s time points. The padded values are then combined with the values of future covariates that can be accurately predicted. More formally, the padding method makes a copy of X M t-s:t and Z N t-s:t , and makes them the padded values (Fig. 3 ). Such manipulations can be viewed as creating a future pseudo-time T = (t + 1, t + s) where target variables ZN t+1:t+s (brown dashed line in Fig. 3 ) and covariates XM t+1:t+s (red dashed line in Fig. 3 ) repeat the previous pattern from t -s to t. For predictable future covariates, the padding is achieved with the best predictions instead of the copies from the recent past. Eq. ( 3) provides a mathematical description of the padding forecasting model: Z N t+1:t+k = H( ZN t-w+1:t+s , XM t-w+1:t+s , Y Q t-w+1:t+s ; Θ), where H(•) is a model function, Z N t+1:t+k represents the N target variables to be predicted at the future time points t + 1 : t + k (region shaded brown in Fig. 3 ), ZN t-w+1:t+s = Z N t-w+1:t + Z N t+1-s:t is the concatenation of the time series in the range t -w + 1 : t with a copy of the time series in the range t + 1 -s : t (brown trajectories in Fig. 3 ), XM t-w+1:t+s = X M t-w+1:t + X M t+1-s:t is the set of padded covariates corresponding to covariates that cannot be accurately predicted (pink trajectories in Fig. 3 ), Y Q t-w+1:t+s is the set of predicted future covariates padded with the predictions for the time range t + 1 : t + s (green trajectory in Fig. 3 ), and Θ is the set of learnable parameters. Figure 3 : Input data transformed using the padding strategy. Pink and brown dotted trajectories represent the padded information of the future covariates that cannot be accurately predicted and the target variables being modeled, while the blue dotted trajectories represent the predicted future covariates. Left: Original trajectories of all variables with the predicted future covariates. Right: All input features with the padded values and predicted future covariates and the output trajectory.

3.3. NETWORK ARCHITECTURES

To validate the effectiveness of the shifting and padding strategies for feature representation, we constructed simple RNN and CNN architectures that included as input the covariates as described above in Sections 3.1 and 3.2. The networks were set to forecast k future steps in one shot instead of the traditional sequential prediction to avoid accumulation of errors.

3.3.1. RNN MODELS WITH SHIFTING

Shifting is implemented by providing the RNN backbone with w hidden states as shown in Fig. 4 . The standard RNNs was further modified to remove the hidden states h t+1 , . . . , h t+k to enable a one-shot prediction of the k future time points. The input to each hidden state h j were variables associated with time t = j (i.e., target variable z j and covariates x j , y j ), as well as the predicted covariates (shifted by s), y j+s as shown in Fig. 4 . RNN model generates the output for a target variables (z t+1 , . . . , z t+k ) in a one-shot manner. The hidden states are described as follows: h j = f (h j-1 , z j , x j , y j , y j+s ), ( ) where f is an activation function; h j and h j-1 refer to the current and previous hidden states; z j , x j , and y j represent the target time series, future covariates that cannot be accurately predicted, and predictable future covariates from the past w steps; and y j+s denotes the predicted future covariates from k steps into the future. Figure 4 : RNN models with the shifting strategy. Dashed ovals represents predicted future covariates that have been shifted. Solid ovals are historical data or data for which we do not have future predictions available. Colors are same as before.

3.3.2. RNN MODELS WITH PADDING

The RNN backbone for padding is similar to that for shifting, but extended with s extra states for pseudo-times, as shown in Fig. 5 . A precise mathematical formulation is given below. h j = f (h j-1 , z j , x j , y j ), j ∈ [t -w + 1, t] f (h j-1 , zj , xj , ỹj ), j ∈ [t + 1, t + s], where f (•) is the activation function; h j and h j-1 refer to the current and previous hidden states; z j , x j , y j represent the target variables, covariates that cannot be accurately predicted, and covariates that can be accurately predicted, all for the past time points j ∈ [t-w +1, t]; and zj , xj , ỹj represent the padded versions of the same variables for the future time range.

3.3.3. CNN MODELS WITH SHIFTING

Convolutional Neural Networks (CNNs) can summarize and learn from the input data using sliding filters that extract features with convolutional computations. The time series are aligned in a manner similar to how we handled RNNs with shifting. Each CNN filter is used to extract features in parallel by sliding from the first time point to the last time point, as shown on the left in Fig. 6 . The parallel 

3.3.4. CNN MODELS WITH PADDING

The time series are aligned similar to how it was organized for RNNs with padding. The sliding filters are similar to that used in CNNs with shifting, but with t + s time points, as shown in Fig. 7 . CNNs with padding require more scanning because of the extra padded pseudo-times (t + 1 : t + s), but the actual filters are more compact because fewer time series are processed.

4. EXPERIMENTS

4.1 DATASETS Three real world datasets were used for time series forecasting tasks in this paper. Beijing PM2.5 and Electricity price datasets are publicly available from UCI and Kaggle repositories, respectively. The third one is the Water stage dataset downloaded from the South Florida Water Management District website. More details about dataset descriptions can be found in Appendix A. Beijing PM2.5 It includes hourly observed data from January 1, 2010, to December 31, 2014. We consider PM2.5 as the target variable to predict, other variables such as dew, temperature, 

4.2. TRAINING AND EVALUATION

For each of datasets, we used the first 80% as a training set to train the model and the last 20% as a test set to evaluate the performance. During the training phase, common techniques such as normalization, dropout, regularization were used to avoid overfitting. Grid search was used to finetune the models for optimal hyperparameters including the number of layers, the number of neurons, learning rate, batch size, regularization factor, and the number of epochs. Mean Absolute Error (MAE) and Root Square Mean Error (RSME) were used to evaluate models. Details on the training and evaluation process is in Appendix B. The four deep learning models (with RNNs or CNNs, and with shifting or padding) were tested with the three datasets. Seq-to-Seq models (Du et al., 2020) , MQRNN (Rangapuram et al., 2018) , DeepAR (Salinas et al., 2020) , Temperal Fusion Transformer (TFT) (Lim et al., 2021) were compared to the four methods mentioned proposed in this manuscript. We used horizon values of k = 6, 12, 24, and 48 hours to forecast with input windows of size w = 72 hours and predictable future covariates from s time steps ahead.

4.3. RESULTS

The results are summarized in Tables 1, 2, and 3. The lowest errors in each column are in bold font. The results show that the shifting and padding methods with CNNs outperform all the other methods compared in this paper for two of the datasets. The methods with RNNs are sensitive to the forecasting length, k. They outperform the "Baseline" methods for short-term predictions (k = 6, 12 hrs), but often fail to do so for long-term forecasting (k = 24, 48 hrs). Our experiments suggest that the future covariates play a significant role in time series predictions. It is also surprising that the simple models (RNNs and CNNs) outperform the sophisticated and complicated deep neural networks (labeled "Baseline" in the Tables). We also experimented with the hyperparameter s, which refers to extent of shifting or padding. As expected, the performance is sensitive to the choice of s. The best performance appears to be achieved when s = k. In many applications, we may be able to reliably predict future covariates for a window much larger than the k used here. However, when s > k, either the performance is flat or deteriorates as s is increased. When s > w, the performance appears to deteriorate rapidly. Our experiments also allowed us to consider if s is impacted by the periodicity of the datasets. The graphs in Fig. 8 shows the MAE and RMSE values as a function of shift length, s, for different values of k, making sure to consider values of s that range from smaller than k (prediction horizon) all the way to larger than w (input window size).

5. DISCUSSION AND CONCLUSIONS

Simple modifications to the basic RNN and CNN architectures have allowed us to build deep learning models that outperform more sophisticated models. Our experiments suggest that the utilization of future covariates (whether predictable or not) can enhance performance considerably. Furthermore, our experiments have helped us to delineate the relationship between the shifting and padding lengths and the model performance. We have observed that s = k results in the best performance since future covariates in exact same forecasting horizon included. If s < k, we only get to utilize some of the predicted covariates from the future for the prediction horizon of k time steps resulting in considerably lower performances. If s > w, then we end up dropping some of the future covariates in order to align with the input window, which again results in considerably lower performances. We also observed that there is considerable periodicity and seasonality in the datasets we used in our experiments. However, in the range k <= s <= w, the variations in performance was too small to be significant. While there were some local minima in the performance when s was a multiple of the period p, the improvements were not significant. In conclusion, the shiftingand padding-based CNNs proposed in this paper outperformed all the baseline deep learning methods considered for our experiments. The corresponding methods with RNNs outperformed the baseline methods for relatively short-term prediction (k = 6, 12 hrs), but was not as accurate for longer term forecasting (k = 24, 48 hrs). The critical feature of our methods appears to be the use of future covariates, either by shifting or padding. Shifting-based CNNs performed best when the shift length s is between the forecast horizon and the length of input window (k ≤ s ≤ w), suggesting a sweet spot for how much of the future is needed for good performance. Periodicity in the datasets appears to have small, but insignificant influence on the performance of the shift-based methods. A APPENDIX: DATASET Water stage prediction: we downloaded a dataset in the real world including the information of water stage, the height of gate opening, the amount of water flowing through the gate, the amount of pumped and rainfall from January 1, 2010, to December 31, 2020. Water stage is the target variable while other variables are covariates. Rainfall information, gate position, and pump control are prior known covariates. Water stage ∈ [-1.25, 4.05] feet in this dataset. 2 , to predict the target variables in future k time steps, we shifted the future covariates in the same horizon to the past by exact k steps to incorporate enough feature representations of future covariates. That is because the future covariates from time t + 1 to t + k have much more influence on those target variables in the same horizon that we want to predict. However, in the real world, we can get the future covariates with the longer time range (¿ k). Therefore, we also tried different shifting lengths using CNNs to explore the relationship between the shifting length, s, the past length, w, and forecasting length, k and possible periodicity of dataset, p. Here we used water stage dataset since water stage is influenced by tide whose periodicity is roughly 12 hours. The length of past data is still 3 days (i.e., w = 72 hours). The corresponding MAEs and RMSEs are listed in Tables 11 and 12 and visualized in Figure 12 .

C.1 s < k

If s < k, it indicated the shorter time range of future covariates is shifted to the past, which cause covariates from t + s to t + k in future k time steps are missed. However, we know those future covariates from time t + 1 to t + k have much influence on the target variables in the same horizon to predict. This missing covariates will result in the higher MAEs and RMSEs shown in the left part of each subplot where s < k (Figure 12 ). C.2 w < s <= w + k If w < s < w + k, it means a few of future covariate from time t + 1 to t + s -w are also missed because they are shifted too many steps to the past such that exceed the window vision of past data. They would be totally out of the vision window of the past data if s = w +k. This missing covariates will result in the higher MAEs and RMSEs shown in the right part of each subplot where w < s <= w + k (Figure 12 ). 



Although this has no bearing on the analysis or conclusions, the currency for the price column in this dataset was unavialable. The $ sign is used as a proxy for whatever currency was intended.



Figure 1: Direct method using sequence-to-sequence models.

Figure 2: Input data transformed by shifting the predicted future covariates. Left: Original trajectories of all variables. Right: All input features with the shifted predicted future covariates included.

Figure 5: RNN models with the padding strategy. Dashed ovals represents padded features. Solid ovals before time t are historical data. Colors are same as before.

Figure 6: CNN models with the shifting strategy. Left: Input from original past and shifted future. Right: Actual future target time series. Blue line represents prior known future covariates. Pink, brown lines are past unknown future covariates and target time series, respectively.

Figure 7: CNN models with the padding strategy. Left shows inputs from the past with padded series. Filters create convolutional layers, which are then used to output the target variables on the right. Colors are used as in previous figures.

Figure 8: MAE & RMSE for different forecasting lengths (k) and shift lengths (s). Left red point of each subplot represents the errors when s = k while right red point denotes the errors when s = w.

Figure 9: Input data transformed by shifting connection (s < k) for future covariates. Left: Original heterogeneous inputs and output. Right: Shifted heterogeneous inputs and output. Dot line represents prior known future covariates.

Figure 10: Input data transformed by shifting connection (w < s <= w + k) for future covariates. Left: Original heterogeneous inputs and output. Right: Shifted heterogeneous inputs and output. Dot line represents prior known future covariates.

Figure 11: Input data transformed by shifting connection (k <= s <= w) for future covariates. Left: Original heterogeneous inputs and output. Right: Shifted heterogeneous inputs and output. Dot line represents prior known future covariates.

MAE & RMSE for the Beijing PM2.5 dataset. (PM2.5 values ∈ [0, 671]µg/m 3 )

A.1 BEIJING PM2.5 DATASET This dataset is Beijing Air Quality data set from the public UCI Website https://archive. ics.uci.edu/ml/datasets/Beijing+PM2.5+Data. It includes hourly observed data from January 1, 2010, to December 31, 2014. The data set has 43,824 rows and 13 columns. The first column is simply an index and was ignored for the analysis. The four columns labeled as year, month, day, and hour, were combined into a single feature called "year-month-day-hour". We consider PM2.5 as the target variable to predict, other variables such as dew, temperature, pressure, wind speed, wind direction, snow, rain, etc. are the prior known covariates that can influence PM2.5 values. P M 2.5 ∈ [0, 671] µg/m 3 in this dataset. It is publicly available in the Kaggle website: https://www.kaggle.com/ datasets/?search=hourly+energy+demand. It has two hourly datasets from January 1, 2015, to December 31, 2018. Energy dataset.csv includes the information of energy demand, generation, prices, and weather features.csv gives the weather features temperature, humidity, etc. Electricity price is the target variable to predict, while prior known covariates are energy demand, generation, and weather features. Electricity price∈ [9.33, 116.8] in this dataset.

Description of Energy (electricity) price dataset

Description of weather feature dataset

Description of water stage dataset MEAN RAIN Mean value of rainfall of radar rainfall in inch C APPENDIX: SHIFTING STRATEGY For the shifting strategy shown in Figure

MAE & RMSE for water stage dataset with different shifting lengths.

MAE & RMSE for water stage dataset with different shifting lengths.

B APPENDIX: TRAINING AND EVALUATION

For each of datasets, we used the first 80% as training set to train the model and the last 20% as test set to evaluate the performance. During the training phase, common skills such as normalization, dropout, regularization were used to avoid overfitting. Mean Absolute Error (MAE) and Root Square Mean Error (RSME) are the measurement to evaluate models. 

B.1 TRAINING DETAILS

where N is the number of samples in the test set, ŷi is the predicted value of model, y i is the actual value in the test set.

C.4 VISUALIZATION OF MAE & RMSE WITH DIFFERENT SHIFTING LENGTHS

With the Water Stage dataset, we tried different shifting lengths from 1 to w + k for a certain forecasting length. 

