COPULA CONFORMAL PREDICTION FOR MULTI-STEP TIME SERIES FORECASTING

Abstract

Accurate uncertainty measurement is a key step to building robust and reliable machine learning systems. Conformal prediction is a distribution-free uncertainty quantification algorithm popular for its ease of implementation, statistical coverage guarantees, and versatility for underlying forecasters. However, existing conformal prediction algorithms for time series are limited to single-step prediction without considering the temporal dependency. In this paper we propose a Copula Conformal Prediction algorithm for multivariate, multi-step Time Series forecasting, CopulaCPTS. On several synthetic and real-world multivariate time series datasets, we show that CopulaCPTS produces more calibrated and sharp confidence intervals for multi-step prediction tasks than existing techniques. • We present a re-calibration technique such that CopulaCPTS can produce valid confidence intervals for time series forecasts of varying lengths. Uncertainty Quantification for Deep Time-Series Forecasting The two major paradigms of Uncertainty Quantification (UQ) methods for deep neural networks are Bayesian and Frequentist. Bayesian approaches estimate a distribution over the model parameters given data, and then marginalize these parameters to form output distributions via Markov Chain Monte Carlo (MCMC) sampling (

1. INTRODUCTION

Deep learning models are becoming widely used in high-risk settings such as healthcare, transportation, and finance. In these settings, it is important that a model produces calibrated uncertainty to reflect its own confidence and to assist decision making. Confidence regions are a common approach to quantify forecast uncertainty (Khosravi et al., 2011) . A (1 -α)-confidence region Γ 1-α for a random variable y is valid if it contains y's true value with high probability: P(y ∈ Γ 1-α ) ≥ 1 -α. Note one can make Γ 1-α infinitely wide to satisfy validity. For the confidence region to be useful, we want to minimize its area while remaining valid; this is known as the efficiency of the region. Conformal prediction is a powerful method that produces confidence regions with finite-sample guarantees of validity (Vovk et al., 2005; Lei et al., 2018) . Furthermore, it makes no assumptions about the forecast model or the underlying data distribution. Its generality, simplicity, and statistical guarantees have made conformal prediction popular for many real world applications including drug discovery (Eklund et al., 2015) and robotics (Luo et al., 2021) . In this paper we present a more calibrated and efficient conformal prediction algorithm for multi-step times series forecasting. This type of problem is ubiquitous -examples include predicting hurricane paths, vehicle trajectories, and financial or epidemic forecasts. To quantify uncertainty, we want a "cone of uncertainty" that covers the entirety of the forecast horizon. Stankevičiūtė et al. (2021) presented an algorithm, CF-RNN, for multi-step time series with the assumption that all time steps are modeled independently. In practice, however, we found that CF-RNN produces confidence regions often too large to be useful, especially when the problem is multivariate or has a long forecast horizon. We introduce CopulaCPTS, a Copula-based Conformal Prediction algorithm for multi-step Time Series forecasting. We improve efficiency by utilizing copulas to model the dependency between forecasted time steps. A copula is a multivariate cumulative distribution function that models the dependence between multiple random variables. It is widely used in economic forecasting (Nelsen, 2007; Patton, 2012) and has been introduced to conformal algorithms by Messoudi et al. (2021) to model correlation between multiple targets in non-temporal settings. Copula processes have been explored in generative models for time series (Salinas et al., 2019; Drouin et al., 2022) . We found that copulas are effective in capturing uncertainty relations between time steps. Our contributions are: • CopulaCPTS is a general uncertainty quantification algorithm that can be applied to any multivariate multi-step forecaster, with statistical guarantees of validity. • CopulaCPTS produces significantly sharper and more calibrated uncertainty estimates than state-of-the-art baselines on 4 benchmark datasets, both synthetic and real. Frequentist UQ methods emphasize robustness against variations in the data. These approaches either rely on resampling the data or learning an interval bound to encompass the dataset. For time series forecasting UQ, frequentist approaches include ensemble methods such as bootstrap (Efron & Hastie, 2016; Alaa & van der Schaar, 2020) and jackknife methods (Kim et al., 2020; Alaa & Van Der Schaar, 2020) ; interval prediction methods include interval regression through proper scoring rules (Kivaranovic et al., 2020; Wu et al., 2021) , and quantile regression (Takeuchi et al., 2006) , with many recent advances for time series UQ (Tagasovska & Lopez-Paz, 2019; Gasthaus et al., 2019; Park et al., 2022; Kan et al., 2022) . Many of the frequentist methods produces asymptotically valid confidence regions and can be categorized as distribution-free UQ techniques as they are (1) agnostic to the underlying model and (2) agnostic to data distribution. Conformal Prediction. Conformal prediction (CP) is an important member of distribution-free UQ methods; we refer readers to Angelopoulos & Bates (2021) for a comprehensive introduction and survey. CP has become popular because of its simplicity, theoretical soundness, and low computational cost. A key difference between CP and other UQ methods is that under the exchangeability assumption, conformal methods guarantee validity in finite samples (Vovk et al., 2005) . Most relevant to our work is recent progress on expanding conformal prediction to time-series forecasting. According to Stankevičiūtė et al. (2021) there are two settings: when the data is generated from one single time series or from multiple independent time series. For the first setting, ACI (Gibbs & Candes, 2021) and EnbPI (Xu & Xie, 2021) developed CP algorithms that relaxes the exchangeability assumption while maintaining asymptotic validity via online learning (former) and ensembling (later); Zaffran et al. (2022) further improves on the online adaptation, and Sousa et al. (2022) combines EnbPI with conformal quantile regression (Romano et al., 2019) to model heteroscedastic time series. These algorithms for single time series are not designed to model multiple independent time series where extracting common patterns can improve forecast and UQ results. The validity guaranteed by these methods are on average over time steps and are often asymptotic, rather than covering the full horizon as in our setting. Stankevičiūtė et al. (2021) shares with us the multi-step forecasting for multiple time series setting, though they focuses on univariate medical time series. We show that their method of applying Bonferroni correction produces inefficient confidence regions, especially when data is multidimensional. Neeven & Smirnov (2018) and Messoudi et al. (2020) are CP algorithms for multi-target regression in the non-temporal setting, creating box-like regions to account for the correlations between the labels. In our work we develop CoupulaCPTS for multivariate multi-step time series forecasting.

3.1. INDUCTIVE CONFORMAL PREDICTION (ICP)

Let D = (z 1 , . . . , z n ) be a dataset of size n. We denote z i = (x i , y i ) as a sample of an input and output pair that follows the distribution P, where i is the data index. The input space X and target space Y can be two arbitrary measurable spaces, their Cartesian product Z = X × Y is the sample space. We begin the algorithm by splitting the dataset into a proper training set D train with |D train | = m and a calibration set D cal with |D cal | = n -m. The objective of conformal prediction is produce a valid confidence region (Definition 1). Definition 1 (Validity). Given a new data pair (X, Y ) ∼ P and a desired coverage rate 1-α ∈ (0, 1), region Γ 1-α (x) is valid if P (X,Y )∼P (Y ∈ Γ 1-α (X)) ≥ 1 -α An important component of conformal prediction is the nonconformity score A : Z × Z m → R, a function to capture how well a sample z conforms to the proper training set. For example, we may choose nonconformity score A(z, D train ) to be the L2-loss on the new sample: A(z, D train ) := ∥y -f (x)∥ (1) where f : X -→ Y is a forecasting model trained with the proper training set D train . In the following sections, we refer to the nonconformity score of a sample z given forecasting model f as A(z, f ) for simplicity. Let s i = A(z i , f ) denote the nonconformity score of a sample z i and let s cal = {A(z i , f )} zi∈D cal denote the set of nonconformity scores of all samples in D cal . We define the empirical p-quantile of the set of nonconformity scores S as: Q(p, S) := inf{s ′ : ( 1 |S| si∈S 1 si≤s ′ ) ≥ p} (2) Definition 2 (Exchangeability). In a dataset {z 1 , z 2 , . . . , z n } of length n, any of its n! permutations are equally probable. Under the exchangeablility assumption (definition 2), conformal prediction is theoretically guaranteed to be valid (definition 1). Let z n+1 = (x n+1 , y n+1 ) ∈ Z be a new sample, and let α ∈ [0, 1] be a chosen significance level (and 1 -α the confidence level). Since D cal ∪ {z n+1 } is exchangeable, the probability of A(z n+1 )'s rank among {A(z i )} zi∈D cal is uniform. Therefore, P(A(z n+1 , f ) ≤ Q(1 -α, s cal ∪ {∞})) = ⌈|(D cal | + 1)(1 -α)⌉ |D cal | + 1 ≥ 1 -α The conformal confidence region is constructed as in equation 4. We say a sample is covered if the true value is in the confidence region y n+1 ∈ Γ 1-α (x n+1 ). Equation 3 means the probability of y n+1 being covered is greater than 1 -α, hence the confidence region is valid. Γ 1-α (x n+1 ) := {y : A(z n+1 , f ) ≤ Q(1 -α, s cal ∪ {∞})} The procedure introduced above is known as inductive or split conformal prediction, as it splits the dataset into training and calibration sets to reduce the amount of computation (Vovk et al., 2005; Lei & Wasserman, 2012) . We introduce it as it is commonly used for uncertainty quantification for machine learning models (Papadopoulos, 2008) . Although our method in this paper, CopulaCPTS, is implemented based on inductive CP, it can be easily adjusted for other CP variants.

3.2. COPULA AND ITS PROPERTIES

In this paper we will model the dependency between time step's uncertainty with copulas. Copula is a concept from statistics that describes the dependency structure in a multivariate distribution. We can use Copulas to capture the joint distribution for multiple future time steps. This section will briefly introduce its notations and concepts. Definition 3 (Copula). Given a random vector (X 1 , • • • X k ), define the marginal cumulative den- sity function (CDF) as F i (x) = P (X i ≤ x), the copula of (X 1 , • • • X k ) is the joint CDF of (U 1 , • • • , U k ) = (F 1 (X 1 ), • • • , F k (X k )), meaning C(u 1 , • • • , u k ) = P (U 1 ≤ u 1 , • • • , U k ≤ u k ) Alternatively C(u 1 , • • • , u k ) = P (X 1 ≤ F -1 1 (u 1 ), • • • , X k ≤ F -1 t (u k )) In other words, the Copula function captures the dependency structure between the variable Xs; we can view an k dimensional copula C : [0, 1] k → [0, 1] as a CDF with uniform marginals. A fundamental result in the theory of copula is Sklar's theorem. Theorem 1 (Sklar's theorem). Given a joint CDF as F (X 1 , • • • , X k ) and the marginals F i (x), there exists a copula such that F (x 1 , • • • , x k ) = C(F 1 (x 1 ), • • • , F k (x k )) for all x i ∈ [-∞, ∞] and i = 1, • • • , k. Sklar's theorem states that for all multivariate distribution functions, there exists a copula function such that the distribution can be expressed using the copula and multiple univariate marginal distributions. To give an example, when all the X k s are independent, the copula function is known as the product copula:  C(u 1 , • • • , u k ) = Π k i=1 u i .

4. COPULA CONFORMAL PREDICTION FOR TIME SERIES (COPULACPTS)

Many real world decision making tasks make use of time series forecasts that are multivariate and can predict multiple steps into the future. In these applications, we want a "cone of uncertainty" that covers the entire course of our forecasts. Existing time series conformal prediction methods either only provide coverage guarantee for individual time step forecasts (Gibbs & Candes, 2021; Xu & Xie, 2021) or produce confidence regions often too inefficient to be useful, especially in multivariate settings (Stankevičiūtė et al., 2021) . Hence, we propose a multi-step conformal prediction algorithm for time series, CopulaCPTS, that guarantees validity over multi-step forecasts. We improve efficiency of the confidence regions by modeling dependency of the time steps using a copula function. We denote the time series dataset as D = {(x , and a confidence level 1 -α, our algorithm returns k confidence intervals, (i) 1:t , y (i) t+1:t+k } l i=1 , where x 1:t ∈ R t×dx is t time steps of input which has dimension d x , y t+1:t+k ∈ R t×dy [Γ 1-α 1 , . . . , Γ 1-α k ], one for each time step, such that: P[ ∀h ∈ {1, . . . , k}, y t+h ∈ Γ 1-α h ] ≥ 1 -α (5) for any underlying predictive model. We define equation 5 as the validity condition for the multi-step time series setting. We will use superscript x i to index data, and subscript y t to index the time steps within the multi-horizon y. There are two characteristics desired in uncertainty quantification methods: calibration and efficiency. A model is calibrated when the predicted confidence level corresponds to the probability of events falling into the predicted range. It is reflected when equality holds in the validity condition P(y ∈ Γ 1-α ) = 1 -α. Efficiency, on the other hand, refers to the size of the confidence region. There is a trade-off between validity and efficiency, as one can always set Γ 1-α to be infinitely large to satisfy the validity condition. In practice, we want to achieve that the measure of the confidence region (e.g. its area or length) to be as small as possible, given that the validity condition holds. In the following sections we will introduce CopulaCPTS, a conformal prediction algorithm that is both calibrated and efficient for multivariate time series forecasts.

4.1. INDUCTIVE CONFORMAL PREDICTION (ICP) FOR MULTIVARIATE FORECASTS

We view the multidimensional target of dimension d as a point in R d space, we simply have Y t = R d for each time step t. In this paper we chose nonconformity score to be the L-2 distance A((x, y), f ) := ∥y -f (x)∥, where f is the forecast model trained on the proper training set D train . The confidence region Γ 1-α (D, x) therefore is a d-dimensional ball. We chose this metric because we are forecasting trajectories in space. Since the conformal prediction algorithm produces valid confidence regions regardless of the choice of nonconformity score, one can choose other metrics such as Mahalanobis (Johnstone & Cox, 2021) or L-1 (Messoudi et al., 2021) distance based on domain needs, and our algorithm will still hold.

4.2. COPULA MULTI-STEP CONFORMAL REGRESSION

We showed in section 3.1 that given a confidence level 1-α and a forecast model f , the ICP algorithm finds a nonconformity score s 1-α such that for a new sample z drawn from the data distribution z ∼ P: P(A(z, f ) ≤ s 1-α ) ≥ 1 -α We can use the algorithm to estimate an empirical cumulative distribution function (CDF) for the random variable A(z, f ). F (s) := P(A(z, f ) ≤ s) = 1 |D cal | z i ∈D cal 1 A(z i , f )≤s (6) For the multi-step conformal prediction algorithm, we will estimate this empirical CDF for each step of the time series, denoted as Fh (s h ) := P(A(z h , f ) ≤ s t ) for h ∈ 1, . . . , k Let 1 -α h denote the probability produced by the each time step's CDF function F h (s h ), h ∈ 1, . . . , k. By Sklar's theorem, we have: F (s 1 , . . . , s k ) = C(F 1 (s 1 ), . . . , F k (s k )) = C(1 -α 1 , . . . , 1 -α k ) We want to find the set of confidence levels 1 -α h , such that the entire predicted time series trajectory is covered by the intervals with confidence 1 -α, i.e. C(1 -α 1 , . . . , 1 -α k ) ≥ 1 -α. The purpose of using copula is to model the dependency between the multiple predicted time steps, so we can better capture the confidence region for the joint probability. This allows our algorithm to produce more efficient confidence regions. We estimate the copula C with the calibration set D cal , and then we can search for values α h for each time step to obtain the desired multi-step coverage. We adopt the empirical copula (Ruschendorf, 1976) as our default copula. One may pick other parametric copula functions, such as the Gaussian copula, to introduce inductive bias and improve sample efficiency when calibration data is scarce. The empirical copula is a non-parametric method of estimating marginals directly from observation, and hence does not introduce any bias. For the joint distribution of a time series with k time steps, the copula of a vector of probabilities u ∈ [0, 1] k is defined as C empirical (u) = 1 n -m n i=m+1 1 u i <u = 1 n -m n i=m+1 k h=1 1 u i h <u h (9) Where n -m is the size of the calibration set. Here, the u i s are the cumulative probabilities for each data in the calibration set D cal with size n -m.  u i = (u i 1 , . . . , u i k ) = ( F1 (s i 1 ), . . . , F1 (s i k )), i ∈ {m + 1, . . . , )) ≥ 1 -α // Prediction ŷn+1 t+1:t+k ← f (x n+1 1:t ) for h = 1 to k do Γ h ← {y : ∥y -ŷn+1 h ∥ < s * h } end for return Γ 1-α 1 , . . . , Γ 1-α k Note that to fulfill the validity condition of Equation 5, we only need to find u * = ( F1 (s * 1 ), . . . , Ft (s * k )) such that C empirical (u * ) ≥ 1 -α We can find s * 1 , . . . , s * k through any search algorithm. As exhaustive search is exponential to the prediction horizon k, we implement the search with stochastic gradient descent using PyTorch. We study the effectiveness of this method in Appendix C.6. The confidence region for each time step is constructed as the set of all y t ∈ y t such that the nonconformity score is less than s * t . Algorithm 1 summarizes the CoupulaCPTS procedure. We prove that CopulaCPTS produces valid confidence regions for multi-step time series forecasting (Proposition 1) in Appendix A. Proposition 1 (Validity of CopulaCPTS). The confidence regions provided by CopulaCPTS (algorithm 1) are valid. i.e. P[ ∀h ∈ {1, . . . , k}, y t+h ∈ Γ 1-α h ] ≥ 1 -α.

4.3. COPULACPTS IN AUTO-REGRESSIVE FORECASTING

Auto-regressive forecasting is a common framework in time series forecasting. So far, we've been looking at forecasts for a predetermined number of time steps k. One can use a fixed length model to forecast for longer horizons k ′ autoregressively -by taking model output as part of the input. In the conformal prediction setting, we want to not only to autoregressively use the point value forecasts, but also propagate the uncertainty measurement. If we assume the time series to be stationary, then the copula remains the same for any sliding window of k steps, i.e. C(u 1 , . . . , u k ) = C(u 2 , . . . , u k+1 ). Given a model f to predict k time steps. Hence, after we've found the u * 1 , . . . , u * k such that C(u * 1 , . . . , u * k ) ≥ 1 -α, we simply have to search for u * k+1 such that C(u * 2 , . . . , u * k , u * k+1 ) ≥ 1 -α. The guarantee proven in Proposition 1 still holds for the new estimation. On the other hand, if the time series is non-stationary, we can fit copulas C 1 (u 1 , . . . , u k ), . . ., C k ′ -k (u k ′ -k , . . . , u k ′ ), one for each autoregressive prediction, given that we have k ′ steps of data in our calibration set. This way, we transform the autoregressive problem into k ′ -k multi-step problems that can be solved by CopulaCPTS. It follows that each of the autoregressive predictions are valid. Appendix B.3 provides an example scenario where re-estimating the copula is necessary and improves validity.

5. EXPERIMENTS

In this section, we show that CopulaCPTS produces more calibrated and efficient confidence regions compared to existing methods on two synthetic datasets and two real-world datasets. We also demonstrate that CopulaCPTS's advantage is more evident over longer prediction horizons. We also show its effectiveness in the autoregressive prediction setting. Baselines. We compare our model with three representative works in different paradigms of uncertainty quantification for neural network time series prediction: the Bayesian-motivated Monte Carlo dropout RNN (MC-dropout) by Gal & Ghahramani (2016a) , the frequentist blockwise jackknife RNN (BJRNN) by Alaa & Van Der Schaar (2020) , and a conformal forecasting RNN (CF-RNN) by Stankevičiūtė et al. (2021) . We use the same underlying prediction model for post-hoc uncertainty quantification methods BJRNN, CF-RNN, and CopulaCPTS. The MC-dropout RNN is of the same architecture but is trained separately, as it requires an extra dropout step during training and inference. Metrics. We evaluate calibration and efficiency for each method. For calibration, we report the empirical coverage on the held-out test set. Coverage should be as close to the desired confidence level 1 -α as possible. Coverage is calculated as coverage 1-α = E x,y∼X×Y P (y ∈ Γ 1-α (x)) ≈ 1 n n 1 1(y n ∈ Γ 1-α n (x n )). For efficiency, we report the average area (2D) or volume (3D) of the confidence region. The metric should be as small as possible while being valid (coverage maintains above pre-specified confidence level). The area or volume is calculated as area 1-α = E x∼X ∥Γ 1-α (x)∥ ≈ 1 n n 1 ∥Γ 1-α (x n )∥.

5.1. SYNTHETIC DATASETS

We first test the effectiveness of our models with two synthetic spatiotemporal datasets -interacting particle systems (Kipf et al., 2018) , and drone trajectory following (simulated with PythonRobotics (Sakai et al., 2018) ). For the particle simulation we predict y t+1:t+h where t = 35, h = 25 and y t ∈ R 2 ; for drone simulation t = 60, h = 10, and y t ∈ R 3 . To add randomness to the tasks, we added Gaussian noise of σ = .01 and .05 to the dynamics of the particle simulation, and σ = .02 to drone dynamics. We generate 5000 samples for each dataset, and split the data by 45/45/10 for train, calibration, and test respectively. Please see Appendix C.1 for forecaster model details. We visualize the calibration and efficiency of the methods in Figure 2 for confidence levels 1-α = 0.5 to 0.99. We can see that Copula-RNN, the red lines, are more calibrated and efficient compared to other baseline methods, especially in tasks with large noise (particle dataset, σ = 0.05). We can see that for harder tasks (particle σ = 0.05, and drone trajectory prediction), MC-Dropout is overconfident, whereas BJ-RNN and CF-RNN produce very large (hence inefficient) confidence regions. This behavior of CF-RNN is expected because they apply Bonferroni correction to account for joint prediction for multiple time steps, which is an upper bound of copula functions. The numbers for confidence level 90% are presented in Table 1 . A quantitative comparison of confidence regions for drone simulation can be found in Figure 8 in the appendix.

5.2. REAL WORLD DATASETS

COVID-19. We replicate the experiment setting of Stankevičiūtė et al. (2021) and predict new daily cases of COVID-19 in regions of the UK. The models take 100 days of data as input and forecast 50 days into the future. We used 200 time series for training, 100 for calibration, and 80 for testing. Vehicle trajectory prediction. The Argoverse autonomous vehicle motion forecasting dataset (Chang et al., 2019 ) is a widely used vehicle trajectory prediction benchmark. The task is to predict 3 second trajectories based on all vehicle motion in the past 2 seconds sampled at 10Hz. Because trajectory prediction is a challenging task, we utilize a state-of-the-art prediction algorithm LaneGCN (Liang et al., 2020) as the underlying model for CF-RNN and Copula-RNN (details in Appendix C.1). Flexibility of underlying forecasting model is an advantage of post-hoc conformal prediction methods. Figure 2 : Calibration (upper row) and efficiency (lower row) comparison on different 1 -α levels for simulated data sets. For calibration, the goal is to coincide with the green dotted calibration line as closely as possible while staying valid (above the green line). Note that copula methods are more calibrated across different significance levels. For efficiency, we want the metric to be as low as possible. Copula-RNN outperforms the baselines consistently. (MC-dropout for the right two experiments produces invalid regions, so we don't consider its efficiency.) Table 1 : Performance on synthetic and real world datasets with target confidence 1 -α = 0.9. Methods that are invalid (coverage below 90%) are greyed out. CopulaCPTS achieves high level of calibration (coverage is close to 90%) while producing more efficient confidence regions (small area). For model-dependent baselines MC-dropout and BJRNN, we have to train an RNN forecasting model from scratch for each method, which is additional computational cost. Results in Table 1 show that, like for synthetic datasets, CopulaCPTS is both more calibrated and efficient compared to baseline models for real world datasets. For the trajectory prediction task, learning the copula results in a 77% sharper confidence region while still remaining valid for the 90% confidence interval. We visualize two samples from each dataset in Figure 3 . The importance of efficiency in these scenarios is clear -the confidence regions need to be narrow enough for them to be useful for decision making. Given the same underlying prediction model, we can see that CopulaCPTS produces a much sharper region while still remaining valid. Comparison of models at different horizon lengths. CopulaCPTS is an algorithm designed to produce calibrated and efficient confidence regions for multi-step time series. When the prediction horizon is long, CopulaCPTS's advantage is more pronounced. Figure 4 shows performance comparison across increasing time horizons on the particle dataset. Additional experiment results can be found in Table 3 of Appendix C. CopulaCPTS achieves a 30% decrease in area at 20 time steps compared to CF-RNN, the best performing baseline; the decrease is above 50% at 25. This experiment shows the significant improvement of using copula on modeling the joint distribution of future time steps. CopulaCPTS for the Autoregressive prediction setting. We test the autoregressive setting (section 4.3) on the COVID-19 dataset. We train an RNN model with k = 7 and use it to autoregressively forecast the next 14 steps. Table 2 compares the performance of re-estimating the copula for each 7-step forecasts versus using a fixed copula calibrated using the first 7 steps. We compare the model to a 14-step joint forecaster using CopulaCPTS as well. Figure 4 : CopulaCPTS remains more calibrated and efficient than baselines over increasing forecast horizons. 

6. CONCLUSION AND DISCUSSION

In this paper we present CopulaCPTS, a conformal prediction algorithm for multidimensional and multi-step time-series prediction. CopulaCPTS significantly improves calibration and efficiency of multi-step conformal confidence intervals by incorporating copulas to model the joint distribution of multiple timesteps. We prove that CopulaCPTS has finite sample validity guarantee. In our experiments we show that the algorithm outperforms state-of-the-art models on all 4 benchmark datasets and on varying prediction horizons. On the flip side, CopulaCPTS assumes that the copula estimation is accurate and we have calibration data for the entirety of the prediction horizon, even in the autoregressive case. We leave it to future work to relax these assumptions (for example by adjusting u * online given prediction errors as in Gibbs & Candes (2021) ) to make CopulaCPTS applicable to the online setting where distribution shift is present.

REPRODUCIBILITY STATEMENT

All experiments included in this paper are repeated over 3 runs to account for randomness in neural network training. We report the mean and the standard deviation of the experiment results in our tables. We have included code for constructing the synthetic datasets, and source code for our model implementation, experiments, and visualizations in the supplementary material. For real-world datasets, we refer readers to Appendix C for detailed descriptions of how to obtain and preprocess the data. Argoverse As highlighted in the main text, we utilize a state-of-the-art prediction algorithm LaneGCN (Liang et al., 2020) as the underlying forcaster model for CF-RNN and Copula-RNN. We refer the readers to their paper and code base for model details. The architecture of the RNN network used for MC-Dropout and BJRNN is an Encoder-Decoder network. Both the encode and decoder contains a LSTM layer with encoding size 8 and hidden size 16. We chose this architecture because the is part of the official Argoverse baselines (https://github.com/jagjeet-singh/argoverse-forecasting) and demonstrates competitive performance.

C.2 COVID-19 DATASET

The COVID-19 dataset is downloaded directly from the official UK government website https://coronavirus.data.gov.uk/details/download by selecting region for area type and newCasesByPublishDate for metric. There are in total 380 regions and over 500 days of data, depending when it is downloaded. We selected 150-day time series from the collection to construct our dataset.

C.3 CALIBRATION AND EFFICIENCY CHART FOR COVID-19

Figure 6 shows comparison of calibration and efficiency for the daily new COVID 19 cases forecasting. The copula methods (orange and red lines) are more calibrated (coinciding with the green doted line) and sharp (low width) compared to baselines. To see if the daily fluctuation due to testing behaviour disrupts other method, we also ran the same experiment on weekly aggregated new cases forecast. We take 14 weeks of data as input and output forecasts for the next 6 weeks. The results are illustrated in Figure 7 . The weekly forecasting scenario gives us similar insights as the daily forecasts. The official validation set of size 39,472 is used for testing and reporting performance. We preprocess the scenes to filter out incomplete trajectories and cap the number of vehicles modeled to 60. If there are less than 60 cars in the scenario, we insert dummy cars into them to achieve consistent car numbers. For map information, we only include center lanes with lane directions as features. Similar to vehicles, we introduce dummy lane nodes into each scene to make lane numbers consistently equal to 650.

C.5 ADDITIONAL EXPERIMENT RESULTS

We present in figure 8 and 9 some qualitative results for uncertainty estimation. To test how the effects of copulaCPTS compare with baseline on other base forecasters, we also include an encoder-decoder architecture with the same embedding size as the RNN models introduced in appendix C.1 for each dataset. The results are presented in Table 3 . We omit these results in the main text because we found that they do did not bring significant improvement to time series forecasting UQ. Table 4 compares model performance compared across different prediction horizons. We show that the advantage of our method is more pronounced for longer horizon forecasts. Figure 9 : Illustrations for confidence regions given by CF-RNN (blue) and CopulaCPTS (orange) in at time steps 0, 10, 20, and 30. Note that in order to achieve 90% coverage, the regions are larger than needed, especially in straight-lane cases like the middle two. Using copulas to couple together time steps results in a much smaller region, while achieving similarly good coverage. Figure 10 shows the α h values for each 1 -α h = Fh (s * h ) used in Copula CPTS as outlined in line 15 of Algorithm 1. We present α h values searched using two methods of searching, with dichotomy search for a constant α value for the horizon as in Messoudi et al. (2021) , and by stochastic gradient descent as outlined in section 4.2. The α h values are an indicator of how interrelated the uncertainty between each time step are: Bonferroni Correction used in Stankevičiūtė et al. (2021) (grey dotted line in Figure 10 ) assumes that the time steps are independent, with CopulaCPTS we have lower 1 -α h levels while having valid coverage (blue and orange lines in Figure 10 ). This shows that the uncertainty of the time steps are not independent, and we are able to utilize this dependency to shrink the confidence region and still maintaining the coverage guarantee. Table 5 shows that there are no significant difference between coverage and area performance for the two search methods with in the scope of datasets we study in this paper. However, we want to highlight that SGD search is O(n) complexity to optimization steps, regardless to the prediction horizon. SGD also allows for varying α h which might be useful in some settings, for example capturing uncertainty spikes for some time steps as seen in the COVID-19 dataset of Figure 10 . Dichotomy search, on the other hand, is O(nlog(n)) complexity to the search space depends on granularity, and will be O(knlog(kn) if we want to search for varying α h . 



Figure 1: An example copula, where we express a multivarite Gaussian with correlation ρ = 0.8 with two univariate distributions and a Copula function C(u 1 , u 2 ).

is k time steps of output of dimension d y . In the traditional time series forecasting setting we have d x = d y , but they do not have to be necessarily equal. Given the dataset D, a new test sample x

Figure 3: Illustrations of 90% confidence regions given by CF-RNN (blue) and CopulaCPTS (orange) on two real-world datasets. For COVID-19 forecast (left 2) we see that CopulaCPTS produces sharper yet covering confidence regions. For Argoverse (right 2) we illustrate regions at time steps 1, 10, 20, and 30. Note that the confidence region produced CF-RNN is uninformatively large, as it covers all the lanes. Overall, CopulaCPTS is able to produce much more efficient confidence regions while maintaining valid coverage. These examples also illustrate the importance of efficiency.

Figure 6: Calibration and efficiency comparison on different ϵ level for COVID-19 Daily Forecasts. The copula methods (orange and red lines) are more calibrated (coinciding with the green doted line) and sharp (low width) compared to baselines.

Figure 7: Covid Weekly Forecasts

Figure 8: 99% Confidence region produced by three methods for the drone dataset. Copula methods (a) produces a more consistent, expanding cone of uncertainty compared to MC-Dropout (b) sharper one compared to CF-RNN (c).

Figure 10: Comparison between dichotomy search for fixed α h values (blue) and stochastic gradient search for varying α h (blue) through timesteps. Shaded regions are the standard deviation of the values over 3 runs.

Algorithm 1: Copula Conformal Time Series Prediction Input: Dataset D = (x i , y i ) i=1,...,n , test input x n+1 1:t , target significant level 1 -α. Randomly split dataset D into training and calibration datasets D = D train ∪ D cal , where |D train | = m and |D cal | = n -m. Train k-step forecasting model f with training set D train . // Calibration Initialize set of nonconformity scores for each time step S h = ∅ for h = 1, . . . , k. Construct CDFs F1 . . . Fk as equation 6. Construct copula C( F1 (•), . . . , Fk (•)) as equation 9. Search for s * 1 , . . . , s

Performance of autoregressive (AR) CopulaCPTS. Re-estimating copula gives us valid confidence region over time and is more efficient than joint CopulaCPTS forecast.

± 0.04 88.0 ± 7.0 0.69 ± 0.25 52.3 ± 1.4 0.94 ± 0.2 BJRNN 45.3 ± 39.4 0.27 ± 0.18 97.7 ± 2.1 2.69 ± 1.79 95.5 ± 2.8 19.99 ± 4.83 CF-RNN 100.0 ± 0.0 0.01 ± 0.01 77.8 ± 19.2 0.8 ± 0.64 66.7 ± 0.0 18.82 ± 3.73 CF-EncDec 89.9 ± 19.2 0.01 ± 0.01 100.0 ± 0.0 0.75 ± 0.99 88.9 ± 19.2 13.07 ± 16.1 Copula-RNN 90.1 ± 0.2 0.01 ± 0.01 89.8 ± 0.6 0.54 ± 0.45 90.1 ± 1.2 8.25 ± 3.44 Copula-EncDec 90.0 ± 0.3 0.01 ± 0.0 90.3 ± 0.6 0.67 ± 1.01 90.5 ± 0.5 7.13 ± 9.5 Performance comparison across different horizons at 90% confidence level on the drone simulation dataset. The improvement on efficiency is more pronounced when the horizon is longer. C.6 STUDY ON α h SEARCH

Coverage and area comparison between stochastic search for fixed α h and SGD for Varying α h . We do not see significant difference between the performance of two.

APPENDIX A PROOF OF PROPOSITION 1

Proposition 1 (Validity of CopulaCPTS). The confidence region provided by CopulaCPTS (algorithm 1) a valid. i.e. P[ ∀h ∈ {1, . . . , k}, y t+h ∈ Γ 1-α h ] ≥ 1 -α.Proof. We estimated u * = ( F1 (s * 1 ), . . . , Ft (s * k )) such thatLet (x n+1 1:t , y n+1 t+1:t+h ) ∼ X × Y be a new data point. Denote ŷ = f (x n+1 1:t ), the forecast given by the trained model. Because the Fh functions and C empirical are monotonously increasing, we have:

1. UPPER AND LOWER BOUNDS FOR COPULAS

To provide a better understanding of the properties of Copulas, consider the Frechet-Hoeffding Bounds (Theorem 2). In fact, the Frechet-Hoeffding upper-and lower-bounds are both copulas. The lower bound is percisely the Bonferroni correction used in Stankevičiūtė et al. (2021) -therefore by estimating the copula more precisely instead of using a lower bound, we have a guaranteed efficiency improvement for the confidence region.Theorem 2 (The Frechet-Hoeffding Bounds). Consider a copula C(u 1 , . . . , u k ). ThenThe empirical inverse CDF modeled the same way as Equation 6 is the same as the empirical quantile estimation (Equation 2).We find the optimal s * h in Equation 10and Algorithm 1 by minimizing the following loss:We use the Adam optimizer and optimize for 500 steps. An alternative and faster way of search is to assume s 1 = s 2 = • • • = s k which simplifies the search from exponential time to constant time.

B.3 AUTOREGRESSIVE PREDICTION

In the context of this paper to forecast autoregressively is given input x 1:t and a k step forecasting model f , perform predictionuntil all k ′ time steps are predicted.We now provide a toy scenario to illustrate when re-estimating the copula is necessary and improves validity. Consider a time series of three time steps t 0 , t 1 , t 2 . The two scenarios are illustrated in Figure 5 . In both scenarios the mean and variance of all timeseps are 0 and 1 respectively.In scenario (a), t 0 = t 1 and hence their covariance is 1. The copula estimated on t 0 and t 1 is). This copula will significantly underestimate the confidence region of t 2 where its covariance with t 1 is -1. In fact the coverage of C 0:1 (F 1 (t 1 ), F 2 (t 2 )) = 0.74.On the other hand, (b) illustrates a scenario where the copula for any 2 consecutive time series remain the same C 0 = C 1 . In this case, applying C 0 directly to forecast C 1 achieves precisely 90% coverage. 

C.1 UNDERLYING FORECASTING MODELS

Particle Dataset The underlying forecasting model for the particle experiments is an 1-layer LSTM network with embedding size = 24. The hidden state is then passed through a linear network to forecast the timesteps concurrently (output has dimension k × d y ). We train the model for 150 epochs with batch size 128. Hyperparameters of the network are selected through a model search by performance on a 5-fold cross validation split of the dataset. The architecture and hyperparameters are shared for all baselines and CopulaCPTS in Table 1 .Drone For the drone trajectory forecasting task, we use he same LSTM forecasting network as the particle dataset, but with hidden size increased to 128. We train the model for 500 epochs with batch size 128. The same architecture and hyperparameters are shared for all baselines and CopulaCPTS reported in Table 1 .

Covid-19

The base forecasting model for Covid-19 dataset is the same as the synthetic datasets, with hidden size = 128 and were trained for 150 epochs with batch size 128. The same architecture and hyperparameters are shared for all baselines and CopulaCPTS reported in Table 1 .

C.7 COMPARISON TO ADDITIONAL BASELINES

We include a comparison to two additional simple UQ baselines on the particle simulation dataset.L2-Conformal. L2-Conformal uses the same underlying RNN forecaster as CF-RNN and Copula RNN. We use that nonconformity score of the vector norm of all timesteps concatenated together ∥ŷ t+1:t+k -y t+1:t+k ∥ to perform ICP. As there are no analytic way to represent a k × d y -dimensional uncertainty region on 2-D space, we calculate the area and plot the region for L2 Conformal baseline with the maximum deviation at each timestep such that the vector norm still stays within range.Direct Gaussian. Direct Gaussian use the same model architecture and training hyperparameters, with the addition of a linear layer that outputs the variance for each timestep, and is optimized using negative log loss, a proper scoring rule for probabilistic forecasting. We obtain the area by analytically calculating the 90% confidence interval for each variable.Results in Table 6 shows that L2-conformal produces inefficient confidence area, and directly outputting variance under-covers test data. These results align with previous findings and motivates our method, which is both more calibrated and sharper compared to these baselines. We show a visualization in Figure 11 

