WHEN RIGID COHERENCY HURTS: DISTRIBUTIONAL COHERENCY REGULARIZATION FOR PROBABILISTIC HIERARCHICAL TIME SERIES FORECASTING

Abstract

Probabilistic hierarchical time-series forecasting is an important variant of timeseries forecasting, where the goal is to model and forecast multivariate time-series that have hierarchical relations. Previous works all assume rigid consistency over the given hierarchies and do not adapt to real-world data that show deviation from this assumption. Moreover, recent state-of-art neural probabilistic methods also impose hierarchical relations on point predictions and samples of predictive distribution. This does not account for full forecast distributions being coherent with the hierarchy and leads to poorly calibrated forecasts. We close both these gaps and propose PROFHIT, a probabilistic hierarchical forecasting model that jointly models forecast distributions over the entire hierarchy. PROFHIT (1) uses a flexible probabilistic Bayesian approach and (2) introduces soft distributional coherency regularization that enables end-to-end learning of the entire forecast distribution leveraging information from the underlying hierarchy. This enables robust and calibrated forecasts as well as adaptation to real-life data with varied hierarchical consistency. PROFHIT provides 41-88% better performance in accuracy and 23-33% better calibration over a wide range of dataset consistency. Furthermore, PROFHIT can robustly provide reliable forecasts even if up to 10% of input timeseries data is missing, whereas other methods' performance severely degrade by over 70%.

1. INTRODUCTION

Time-series forecasting is an important problem that impacts decision-making in a wide range of applications. In many real-world situations, the time-series have inherent hierarchical relations and structures. Examples include forecasting time-series of employment (Taieb et al., 2017) measured at different geographical scales; epidemic forecasting (Reich et al., 2019) at county, state and country, etc. Given time-series dataset with underlying hierarchical relations, the goal of Hierarchical Time-series Forecasting (HTSF) is to generate accurate forecast for all time-series leveraging the hierarchical relations between time-series (Hyndman et al., 2011) . Most previous methods do not provide well-calibrated forecasts for both so-called "strong" and "weakly" consistent datasets. Previous HTSF methods assume that the time-series values of datasets strictly satisfy the underlying hierarchical constraints and impose rigid coherency on generated forecasts i.e., forecasts strictly satisfy the hierarchical relations of dataset. These methods can model datasets generated (Taieb et al., 2017 ) by first collecting data for time-series of the leaf level nodes and deriving time-series for higher-level nodes. We call such data as strongly consistent. For example, classical HTSF methods (Hyndman & Athanasopoulos, 2018 ) use a bottom-up or top-down approach where all time-series at a single level of hierarchy are modeled independently and the values of other levels are derived using the aggregation function governing the hierarchy. In contrast, many real-world datasets are weakly consistent, i.e., they do not follow the strict constraints of the hierarchy 1 . Such data have an underlying data generation process that may follow a hierarchical set of constraints but may contain some deviations. These deviations can be caused by factors such as measurement or reporting error, asynchrony in data aggregation and revision pipeline, etc, as frequently observed in epidemic forecasting (Adhikari et al., 2019) . Most state-of-the-art HTSF methods are designed for strongly consistent datasets and impose rigid coherency constraints -they thus may not adapt to such deviations and can provide poor forecasts for weakly consistent datasets. This does not enable the models generating the raw forecasts to leverage underlying hierarchical relations across time-series. End-to-end learning neural methods directly leverage hierarchical relations as part of the model architecture or learning algorithm like HIERE2E (Rangapuram et al., 2021) and SHARQ (Han et al., 2021) . They do this and usually outperform post-processing methods by imposing hierarchical constraints on the mean or fixed quantiles of the forecast distributions. However, these methods do not enforce hierarchical coherency on the full distributions. Therefore, the forecasts may not be well-calibrated (Kuleshov et al., 2018) i.e., they produce unreliable prediction intervals that may not match observed probabilities from ground truth (Fisch et al., 2022) . In this work, we fill this gap of learning well-calibrated and accurate forecasts for both strong and weakly consistent datasets leveraging underlying hierarchical relations.

Distributional Coherency Hierarchical Relations

We propose PROFHIT (Probabilistic Robust Forecasting for Hierarchical Time-series), a neural probabilistic HTSF method that provides an end-to-end Bayesian approach to model the distributions of forecasts of all time-series together (see Table 1 for a comparison). Specifically, we introduce a novel Soft Distributional Coherency Regularization (SDCR) to tackle the challenge. First, SDCR enables PROFHIT to leverage hierarchical relations over entire forecast distributions to generate calibrated forecast distributions by encouraging forecast distribution of any parent node to be similar to aggregation of children nodes' forecast distribution (Figure 1 ). Second, since SDCR is a soft constraint, our model is trained to adapt to datasets with varying hierarchical consistency that allows the model to trade-off coherency for better accuracy and calibration on weakly consistent datasets. Our main contributions are: (1) Accurate and Calibrated Probabilistic Hierarchical Time-Series Forecasting: We propose PROFHIT, a deep probabilistic framework for modeling the distributions of each time-series together using a soft distributional coherency regularization (SDCR). PROFHIT leverages probabilistic deep-learning models to learn priors of individual time-series and refines the priors of all time-series leveraging the hierarchy to provide accurate and well-calibrated forecasts. (2) Adaptation to Strong and Weak Consistency via Soft Distributional Coherency Regularization: SDCR imposes soft hierarchical constraints on the full forecast distributions to help adapt the model to varying levels of hierarchical consistency. We build a novel refinement module over raw forecast priors and leverage multi-task learning over shared parameters that enable PROFHIT to perform consistently well across the hierarchy. (3) Evaluation Across Multiple Datasets and with Missing Data: We show that our method PROFHIT outperforms a wide variety of state-of-the-art baselines on both accuracy and calibration, at all levels of the hierarchy, for both strong and weakly consistent datasets. We also show training using SDCR enables PROFHIT to leverage hierarchical relations to provide robust predictions that can handle missing data values in the time-series.

2. PROBLEM STATEMENT

Consider the dataset D of N time-series over the time horizon 1, 2, . . . , T . Let y i ∈ R T be timeseries i and y (t) i its value at time t. The time-series have a hierarchical relationship denoted as T = (G T , H T ) where G T is a tree of N nodes rooted at time-series 1. For a non-leaf node (timeseries) i, we denote its children as C i . The node values are related via set of relations H T of form H T = {y i = j∈Ci ϕ ij y j : ∀i ∈ {1, 2, . . . , N }, |C i | > 0} where values of ϕ ij are known and time-independent real-valued constants. Definition 1 (Consistency Error -CE). Given a dataset D of N time-series over the time horizon 1, 2, . . . , T and aggregation relations H T as above, the dataset consistency error (CE) is defined as ET (D) = i∈{1,2,...N },C i ̸ =∅   yi - j∈C i ϕijyj   2 . (1) (Intuitively, datasets with lower CE have time-series values which more strictly follow relations H T ).

Definition 2 (Strong and weak consistency)

. A dataset D is strongly consistent if E T (D) = 0. Otherwise, D is said to be weakly consistent. Let current time-step be t. For any 1 ≤ t 1 < t 2 ≤ t, we denote y (t1:t2) i = {y (t1) i , y , . . . , y }. Given the data D t = [y 1:t 1 , y 1:t 2 , . . . , y 1:t N ] and hierarchical relations H T , a model M is trained to predict the marginal forecast distributions at time t + τ for all time-series of hierarchy leveraging past values of all time-series: {p M (y (t+τ ) 1 |D t ), . . . p M (y (t+τ ) N |D t )}. Along with accuracy of probabilistic forecasts we also evaluate forecast distributions for calibration. We define calibration of model forecasts based on previous works (Kamarthi et Overview PROFHIT models the forecast distributions {P (y (t+τ ) i |D t )} N i=1 of all time-series nodes of the hierarchy by leveraging the relations from the hierarchy to provide accurate and well-calibrated forecasts that are adaptable to varying hierarchical consistency. Most existing methods do not attempt to model entire probabilistic distribution but focus on coherency of point forecasts or samples or fixed quantiles of the distribution (Rangapuram et al., 2021; Han et al., 2021) . This approach does not fully capture the uncertainty of the forecasts and in turn does not provide calibrated predictions. Moreover, most methods operate on datasets that are strongly consistent over hierarchical relations. However, many real-world datasets are weakly consistent with time-series values of all nodes of hierarchy observed simultaneously and may not follow the hierarchical relations strictly due to noise and discrepancies in collecting data at different levels. Therefore, most previous works may not adapt well to such deviations from these constraints. PROFHIT, on the other hand, reconciles the need to model coherency between entire forecast distributions as well as induce a soft adaptable constraint to enforce coherency via a two-step stochastic process that is trained in an end-to-end manner. PROFHIT first produces a raw forecast distribution for each node parameterized by {(μ i , σi )} N i=1 by using past values of time-series via a neural probabilistic forecasting model. Raw forecasts of all nodes are used as priors to derive a refined set of forecast distributions parameterized by {(µ i , σ i )} N i=1 via the refinement module. The full probabilistic process of PROFHIT is depicted in Figure 2 and formally summarized as: P ({y (t+τ ) i } N i=1 |D t ) = P (z|{y (1:t) i } N i=1 ) N i=1 P (zi, ui|{y (1:t) i } N i=1 )P (μi, σi|zi, z, ui) TSFNP (Raw forecasts) N i=1 P (µi, σi|{μj, σj} N j=1 )P (y (t+τ ) i |µi, σi) Refinement Module d{ui} N i=1 d{zi} N i=1 . where z i , u i , z are intermediate latent variables of our probabilistic raw forecasting model TSFNP (Section 3.1). PROFHIT's SDCR regularizes the parameters {(µ i , σ i )} N i=1 to leverage the hierarchical relations by minimizing the Distributional Coherency Error (DCE) defined as follows: Definition 4. (Distributional Coherency Error -DCE) Given the forecasts at time t + τ as {p M (y (t+τ ) 1 |D t ), . . . p M (y (t+τ ) N |D t )} distributional coherency error (DCE) is defined as i∈{1,...,N },C i ̸ =∅ Dist   pM (y (t+τ ) i |D t ), pM ( j∈C i ϕi,jy (t+τ ) j |D t )   ( ) where Dist is a distributional distance metric. Leveraging distributional coherency error as a soft regularizer enforces forecast distributions to be well-calibrated while adaptively adhering to hierarchical relations of the dataset. 

3.1. RAW

[µ(u) i , log σ(u) i ] = Self-Atten(GRU(y (t ′ :t) i )), ui ∼ N (µ(u) i , σ(u) i ). 2) Stochastic Data Correlation Graph: We further leverage similar patterns of past time-series data and aggregate them as local latent variable. Unlike EPIFNP which uses past time-series information from same node, in our multi-variate case TSFNP uses past information from all nodes. Formally, for input sequence y (t ′ :t) i and each of the past sequence y j where j ∈ {1, . . . , N }, we sample y j with probability exp(-γ||u i -u j || 2 2 ) into set N i . Then, we derive the local latent variable as zi ∼ N   j∈N i Θ1(uj), exp( j∈N i Θ2(uj))   where Θ 1 and Θ 2 are feed-forward networks. 3) Predictive Distribution Decoder: Finally, we combine the latent embedding of input time-series, local latent variable and combined information of all past sequences to derive the parameters of the output distribution via a simple feed-forward network. We first derive a global latent variable that combines the information from latent embeddings of all past sequences via self-attention: {βi} N i=1 = Self-Atten({ui} N i=1 ), z = N i=1 βiui Finally, we combine the latent embedding of input time-series, local latent variable and global latent variable to derive the raw forecast distribution modelled as a Gaussian N (μ i , σi ) as: e = concat(ui, zi, z), [μi, log σi] = Θ3(e) where Θ 3 is a feed forward network.

3.2. REFINEMENT MODULE

The refinement module leverages the raw distributions of all nodes of hierarchy to produce refined forecast distributions using hierarchical relations. Given the parameters of raw forecast distributions {μ i , σi } N i=1 derived from TSFNP for all time-series {y (t ′ :t) i } N i=1 , the refinement module derives the refined forecast distributions denoted by parameters {µ i , σ i } N i=1 as functions of parameters of raw forecasts of all time-series. The refined forecasts are optimized to be more coherent using SDCR. Since we drive the full distributions of refined forecasts to be coherent, rather than just the samples or mean statistics, the refined distributions' calibration is also consistent with the hierarchical relations. Let μ = [μ 1 . . . , μN ] and σ = [σ 1 . . . , σN ] be vectors of means and standard deviations of raw distributions. We model the refined mean as a function of the raw means of all the nodes. Formally, we derive the mean µ i of refined distribution as a weighted sum of two terms: a) μi , the mean of raw time-series, and b) linear combination of all raw mean of all time-series: γi = sigmoid( ŵi), µi = γi μi + (1 -γi)w T i μ. { ŵi } N i=1 and {w i } i=1:N are both learnable set of parameters of the model. sigmoid(•) denotes the sigmoid function. γ i helps model the trade-off between the influence of the raw distribution of node i and the influence of the other nodes of the hierarchy. This is useful for the model to automatically adapt to datasets with varying hierarchical consistency. Similarly, we assume the variance of the refined distribution depends on the raw mean and variance of all the time-series. The variance parameter σ i of the refined distribution is derived from the raw distribution parameters μ and σ as σi = c σisigmoid(v T 1i μ + v T 2i σ + bi) where {v 1i } N i=1 , {v 2i } N i=1 and {b i } N i=1 are parameters and c is a positive constant hyperparameter.

3.3. LIKELIHOOD LOSS AND REGULARIZATION OVER HIERARCHY We optimize the probabilistic process of Equation 2 for accuracy and calibration by leveraging hierarchical relations by training on likelihood loss on ground truth training data as well as SDCR.

Likelihood Loss To maximise the likelihood of forecasts over ground truth P ({y (t+τ ) i } N i=1 |D t ) we use variational inference by approximating the posterior N i=1 P (z i , u (j) i |D t ) with the varia- tional distribution N i=1 P (u i |y (t ′ :t) i )q i (u i |y (t ′ :t) i ) where q i is a feed-forward network over GRU hidden embeddings of Probabilistic Neural Encoder that parameterizes the Gaussian distribution of q i (u i |y (t ′ :t) i ). We derive the ELBO (detailed derivation in Appendix) as L1 = -E i q i (z i ,u i |D t ) [log P ({y (t+τ ) i } N i=1 |{ui, zi} N i=1 , z) + N i=1 log P (zi|ui, {uj} N j=1 ) -log qi(ui|y (t ′ :t) i )]. Soft Distributional Coherency Regularization PROFHIT leverages the hierarchy relations in T and regularizes the refined distributions to be coherent. Since PROFHIT aims to leverage hierarchical coherency for improved robustness and calibration, we regularize over the full distributions by using distributional coherency error as part of the loss function. We use the Jensen-Shannon Divergence (Endres & Schindelin, 2003) (JSD) as the distance metric since it is a symmetric and bounded variant of the popularly used KL-Divergence distance and assumes closed form for many widely used distributions. We derive the distributional coherency error on {(µ i , σ i )} N i=1 as L2 = 2   N i=1 JSD   P (y (t+τ ) i |µi, σi), P   j∈C i ϕijy (t+τ ) j |{µj, σj} j∈C i     + 1   . Computation of JSD is generally intractable. However, in our case, due to parameterization of each time-series distribution as a Gaussian we get a closed-form differentiable expression: L2 = N i=1 σ 2 i + µi -j∈C i ϕijµj 2 2 j∈C i ϕ 2 ij σ 2 j + N i=1 j∈C i ϕ 2 ij σ 2 j + µi -j∈C i ϕijµj 2 2σ 2 i . We provide the derivation of Equation 12in the Appendix. We use the distributional coherency error as a soft regularization term to enable PROFHIT to leverage constraints H T when generating forecast distributions. Thus, the total loss for training is given as L = L 1 + λL 2 where the hyperparameter λ controls the trade off between data likelihood and coherency. We also use the reparameterization trick to make the sampling process differentiable and we learn the parameters of all training modules via Stochastic Variational Bayes (Kingma & Welling, 2013). The full pipeline of PROFHIT is summarized in Figure 2 .

3.4. DETAILS ON TRAINING

Parameter sharing across nodes Since PROFHIT's TSFNP module forecasts for multiple nodes, we leverage the hard-parameter sharing paradigm of multi-task learning (Caruana, 1997) 

4. EXPERIMENTS

We evaluate PROFHIT over multiple datasets and compare it with state-of-the-art baselinesfoot_1 .

4.1. SETUP

Baselines: We compare PROFHIT's performance against state-of-the-art HTSF methods. We also compare against state-of-the-art general probabilistic forecasting methods to study the importance of modeling the hierarchy for both weak and strongly consistent datasets. ( 1 ) is a post-processing method that refines raw forecasts to be distributionally coherent. We use the mean forecast from MINT and ERM as input forecasts for PEMBU. Note that we fine-tune the hyperparameters of PROFHIT and each baseline specific to each benchmark. More details on hyperparameters are in Appendix. We also evaluate the efficacy and contribution of our various modeling choices by performing an ablation study using the following variants of PROFHIT: (7) P-GLOBAL: We study the effect of our multi-tasking hard-parameter sharing approach (Section 3.4) by training a variant where all the parameters are shared across all the nodes. (8) P-FINETUNE: We also look at the efficacy of our soft regularization using both losses that adapts to optimize for both coherency and training accuracy by comparing it with a variant where the predictive distribution decoder parameters are further fine-tuned for individual nodes using only the likelihood loss. ( 9) P-DEEPAR: We evaluate our choice of using TSFNP, a previous state-of-the-art univariate forecasting model for accurate and calibrated forecasts with DeepAR, another popular probabilistic forecasting model that was used by HIERE2E. (10) P-NOCOHERENT: This variant is trained by completely removing the SDCR from the training. Note that unlike P-FINETUNE which was initially trained with SDCR before fine-tuning, P-NOCOHERENT never uses the SDCR at any point of training routine. Therefore P-NOCOHERENT measures the importance of explicitly regularizing over the information from the hierarchy. 2 ). Evaluation metrics For a ground truth y (t) , let the predicted probability distribution be py (t) with mean ŷ(t) . Also let Fy (t) be the CDF. We evaluate our model and baselines using carefully chosen metrics that are widely used in literature to measure accuracy and calibration. 1. Mean Absolute Percentage Error (MAPE) is a commonly used score for point-predictions calculated as M AP E = 1 N t N t=t1 | y (t) -ŷ (t) y (t) | 3. Log Score (LS) is a standard score used to measure accuracy of probabilistic forecasts in epidemiology (Reich et al., 2019). LS measures the negative log likelihood of a fixed size interval around the ground truth under the predictive distribution: LS(p y , y) = -y+L y-L log py (ŷ)dŷ. Similar to (Reich et al., 2019), log likelihood of a forecast is capped at -10. 4. Calibration Score (CS): To measure calibration of forecasts, we use the calibration score defined in Section 2. 2. Cumulative Ranked Probability Score (CRPS) is a widely used standard metric for evaluation of probabilistic forecasts that measures both accuracy and calibration. Given ground truth y and the predicted probability distribution py , let Fy be the CDF. Then, CRPS is defined as: CRP S( Fy , y) = ∞ -∞ ( Fy (ŷ) -1{ŷ > y}) 2 dŷ. We approximate Fy as a Gaussian distribution formed from samples of model to derive CRPS. 5. Distributional Coherency Error (DCE): We calculate the Distributional Coherency Error (Equation 11) on output forecast distributions during inference to study how PROFHIT and baselines leverage SDCR to learn from hierarchical relations across datasets of varying consistency and trade-off coherency, calibration and accuracy, especially for weakly consistent data (Section 4.2 Q3).

4.2. RESULTS

We comprehensively evaluate PROFHIT through the following questions: Q1: Does PROFHIT predict accurate calibrated forecasts? Q2: Does PROFHIT provide consistently better performance across all levels of the hierarchy? Q3: Does SDCR help PROFHIT outperform baselines on both strongly and weakly consistent datasets? Q4: How does improved calibration affect robustness of PROFHIT's forecasts? Table 3 : Average scores (across 5 runs) across all levels of hierarchy for all baselines, PROFHIT and its variants. PROFHIT provides 54% better accuracy and 32% better calibration. Accuracy and calibration performance (Q1) We evaluate all baselines, PROFHIT and its variants for all the datasets over 5 independent runs. The average scores across all levels hierarchy are shown in Tables 3. PROFHIT significantly outperforms all baselines in MAPE score by 13% and LS by 14%-550%. In terms of calibration, we observe an average of 32% lower CS scores. Finally, PROFHIT shows 41-88% better CRPS scores. Thus, PROFHIT adapts well to varied kinds of datasets and outperforms all baselines in both accuracy and calibration. Performing t-test with significance α = 1% we find that all the CRPS, LS and CS scores are statistically significant compared to baselines. On comparing the performance of PROFHIT with the variants, PROFHIT is comparable to or better than the best-performing variant in most benchmarks. This shows that all the important model design choices (multi-task parameter sharing, distributional coherency, and joint training on both losses) of PROFHIT are important for its consistently superior performance. Performance across the hierarchy (Q2) Next, we look at the performance of all models across each level of hierarchy. We compared the performance of PROFHIT with best performing baselines HIERE2E and SHARQ for all datasets. PROFHIT significantly outperforms the best baselines. At the leaf nodes, which contain most data, PROFHIT outperforms best baselines by 7% in Wiki to 100% in FB-Survey. For the top node of time-series the performance improvement is largest at 35% (Wiki) to 962% (FB-Survey). Similarly, for calibration score, we observe an average improvement of 12% for top nodes and 18% for bottom nodes. We show detailed results in Appendix. PROFHIT also performs better than the variants in most higher levels of hierarchy and its performance is comparable to the best variant (P-FINETUNE and P-GLOBAL) at leaf nodes as well. P-NOCOHERENT performs most poorly compared to all variant and PROFHIT, proving that SDCR is a very important contributor for consistent performance across the hierarchy in all datasets. SDCR leads to consistently better performance across varying data consistency (Q3) As discussed in Section 4.2, we evaluated on both strong and weakly consistent datasets. Since most previous state-of-the-art models assume datasets to be strongly consistent, deviations from this assumptions can cause under-performance when used with weakly consistent datasets. This is evidenced in Table 3 where most of the baselines explicitly optimize for hierarchical coherency as a hard constraint on the forecasts. For example, PEMBU's forecasts have better distributional coherency error (DCE) for weakly consistent datasets. However, they perform much worse in both accuracy and calibration than even TSFNP, which does not even leverage hierarchical relations. Since we use SDCR as soft learning constraint, PROFHIT can learn to trade-off coherency for accuracy and calibration. Therefore, PROFHIT provides 93% better CRPS and 33% better calibration scores over best HTSF baselines. These improvements are more pronounced at non-leaf nodes of hierarchy where PROFHIT improves by 2.8 times for Flu-Symptoms and 9.2 times for FB-Survey. In case of strongly consistent datasets, PROFHIT provides 54% better CRPS and 23% better calibration scores while having comparable DCE to PEMBU. We observed that soft coherency regularization and parameter sharing across nodes are vital for PROFHIT's adaptability to varying levels of consistency. We provide detailed analysis of these observations in the Appendix. Calibration enables robustness (Q4) Accurate and well-calibrated models that can effectively leverage knowledge of the hierarchy can intuitively allow models to better adapt to noise/missing data. Hence we introduce the task of Hierarchical Forecasting with Missing Values to study the robustness of models when there are missing values in time-series. We model a situation that is encountered in many real-world applications such as Epidemic Forecasting where the past few values of time-series are missing due to various factors like data reporting delays (Chakraborty et al., 2018) . Formally, at time-period t, we are given full data up to time t -ρ. We set ρ = 5 since it is the average forecast horizon of all datasets. For sequence values in time period between t -ρ and t, we randomly remove k% of these values across all time-series. The models are trained on complete time-series dataset till time t ′ = t -ρ. Models' predictions are then used to fill in missing values for time t ′ to t. Finally, we input the filled time-series to generate forecasts for the future time-steps. We measure relative decrease in performance with increase in percentage of missing data k (Figures 3). We observe that PROFHIT's performance decrease with a larger fraction of missing values is much slower compared to other baselines. Even at k = 10%, PROFHIT's performance decreases by 10.45-26.8% compared to other baselines that typically decrease by over 70%. Thus, PROFHIT effectively uses coherency to generate robust predictions on strong and weakly consistent datasets.

5. CONCLUSION AND DISCUSSION

We introduced PROFHIT, a probabilistic hierarchical forecasting model that produces accurate and well-calibrated forecasts using soft distributional coherency regularization (SDCR) which enables adaptablity to datasets with varying levels of hierarchical consistency. We evaluated PROFHIT against previous state-of-the-art hierarchical forecasting baselines over wide variety of datasets and observed 41-88% improvement average improvement in accuracy and 23-33% better calibration. PROFHIT provided best performance across the entire hierarchy as well as significantly outperformed other models in providing robust predictions when it encountered missing data where other baselines' performance degraded by over 70%. Our work opens new possibilities like extending to various domains where time-series values across the hierarchy may not be continuous real numbers, can not be modelled as Gaussian distributions or may have different sampling rates. We can also explore modeling more complex structures between time-series with different aggregation relations. PROFHIT can also be used to study anomaly detection in time-series, especially in time-periods where there are deviations from assumed coherency relations. Similar to Kamarthi et al. (2022) , we can extend our work to include multiple sources of features and modalities of data both specific to each time-series and global to the entire hierarchy. time-series i. For the remaining datasets (Flu-Symptoms, FB-Symptoms) the time-series values are normalized by default and thus require no extra pre-processing.

C.2 MODEL ARCHITECTURE

The architecture of TSFNP used in PROFHIT is similar to that used in the original implementation (Kamarthi et al., 2021) . The GRU unit contains 60 hidden units and is bi-directional. Thus the local latent variable is also of dimension 60. N N 1 and N N 2 are both 2-layered neural networks with the first layer shared between both. Both layers have 60 hidden units. Finally, N N 3 is a three-layer neural network with the input layer having 180 units (for the concatenated input of three 60 dimensional vectors) and the last two layers having 60 hidden units. We found that the value of c in Equation 9is not very sensitive and usually set it to 5. Note that we do not explicitly model covariance between every pair of time series (like MINT, ERM) and use a weighted combination of raw forecast parameters to derive refined forecasts. Therefore the refinement module complexity (Section 3.2) is O(N 2 ) which is on par with previous methods like HIERE2E.

C.3 TRAINING AND EVALUATION

Given the training dataset D t we extract training dataset for each node as the set of prefix sequences {(y (t1:t2) i , y ) : 1 ≤ t1 ≤ t2 < t -τ } and train the full model (TSFNP and refinement module). We tune the hyperparameter using backtesting by validating on window t -τ to t. Finally we train for entire training set with best hyperparameters. For each benchmark, we used the validation set to mainly find the optimal batch size and learning rate. We searched over batch-size of {10, 50, 100, 200} and the optimal learning rate was usually around 0.001. We also found the optimal λ to be around 0.01 for strongly consistent datasets and 0.001 for weakly consistent datasets. We used early stopping with the patience of 150 epochs to prevent overfitting. For each independent run of a model, we initialized the random seeds from 0 to 5 for PyTorch and NumPy. We didn't observe large variations due to randomness for PROFHIT and all baselines. During evaluation, we sampled 2000 Monte-Carlo samples of the forecast distribution and used it to estimate the mean for MAPE. We also used the samples mean and variance to evaluate LS and CS whereas used ensemble scoring to evaluate CRPS directly from the samples using properscoring packagefoot_2 .

D DERIVATION OF LIKELIHOOD ELBO LOSS

The full predictive distribution of PROFHIT from Equation 2 can be further expanded as: P ({y (t+τ ) i } N i=1 |D t ) = N i=1 P (u i |y (1:t) i ) Probabilistic Encoder N i=1 P (N i |{u i } N i=1 )P (z i |N i ) SDCG P (z|{u i } N i=1 ) Global Latent variable N i=1 P (μ i , σi |z, z i , u i ) Raw forecasts N i=1 P (µ i , σ i |{μ j , σj } N j=1 )P (y (t+τ ) i |µ i , σ i ) Refinement Module d{u i } N i=1 d{z i } N i=1 . To minimize the data likelihood P ({y (t+τ ) i } N i=1 |D t ) requires intregration over latent variables {u i } N i=1 and z i } N i=1 . We instead perform amortized variational inference on the latent variables similar to VAE (Kingma & Welling, 2013).

We approximate the posterior of latent variables P ({u

i } N i=1 , {z i } N i=1 , {N i } N i=1 , z|{y (t+τ ) i } N i=1 ) with a variational distribution Q(u i } N i=1 , z i } N i=1 , {N i } N i=1 , z|{y } N i=1 ) expressed as: Q({u i } N i=1 , {z i } N i=1 , {N i } N i=1 , z|{y (t+τ ) i } N i=1 ) = N i=1 P (u i |y (1:t) i ) N i=1 P (N i |{u i } N i=1 )P (z i |N i ) N i=1 q ϕ (z i |y (1:t) i ) P (z|{u i } N i=1 ) where q ϕ is a feed-forward network over GRU embeddings of Probabilistic Neural Encoder that parameterizes to a gaussain distribution of z i . The ELBO loss E Q({ui,zi,Ni} N i=1 ,z|{y N i=1 ) [log P ({y (t+τ ) i } N i=1 |{u i , z i , N i } N i=1 , z) + log P ({u i } N i=1 , {z i } N i=1 , {N i } N i=1 , z|{y (t+τ ) i } N i=1 ) -log Q({u i } N i=1 , {z i } N i=1 , {N i } N i=1 , z|{y (t+τ ) i } N i=1 )] get simplified to Equation 10 by cancelling of similar terms between variational and true distribution of latent variables.

E DERIVATION OF DISTRIBUTIONAL COHERENCY ERROR

The Distributional Coherency Error (Equation 11) can be exactly expressed as: L 2 = N i=1 σ 2 i + µ i -j∈Ci ϕ ij µ j 2 2 j∈Ci ϕ 2 ij σ 2 j + N i=1 j∈Ci ϕ 2 ij σ 2 j + µ i -j∈Ci ϕ ij µ j 2 2σ 2 i . To derive Equation 12, we use the following well-known result for JSD of two Gaussian Distributions (Nielsen, 2019): Lemma 1. Given two univariate Normal distributions P 1 = N 1 (µ 1 , σ 1 ) and P 2 = N 2 (µ 2 , σ 2 ), the JSD is  JSD(P 1 , P 2 ) = 1 2 σ 2 1 + (µ 1 -µ 2 ) 2 2σ 2 2 + σ 2 2 + (µ 1 -µ 2 ) 2 2σ 2 1 -1 P (y t+τ i |μ i , σi ) = N (µ i , σ i ) and P ( j∈Ci ϕ ij y t+τ j |{|μ j , σj } j∈Ci )) is weighted sum of Gaussian variables {N (µ j , σ j )} j∈Ci . Therefore, P   j∈Ci ϕ ij y t+τ j |{μ j , σj } j∈Ci   = N   j∈Ci ϕ ij µ j , j∈Ci ϕ 2 ij σ 2 j   . Using Lemma 1 along with Equations 18,19 we get the desired result in Equation 16. We empirically observe this by measuring Consistency errors of all datasets (Definition 1) for entire hierarchy and at each level of the hierarchy. The results are in Table 4 . As expected there is no deviations for strongly consistent datasets where as there is significant deviation in weakly consistent data. (Chakraborty et al., 2018) . Therefore, one approach to performing forecasting in such a situation is first by imputation of missing values based on past data and then using the predicted missing values as part of the input for forecasting.

G PERFORMANCE ACROSS EACH LEVEL OF HIERARCHY

Task: To simulate such scenarios of missing data and evaluate the robustness of PROFHIT and all baselines, we design a task called Hierarchical Forecasting with Missing Values (HFMV). Formally, at time-period t, we are given full data for up to time t -ρ. We show results here for ρ = 5 which is the average forecast sequence values in time period between t -ρ and t, we randomly remove k% of these values across all time-series. The goal of HFMV task is to use the given partial dataset from t -ρ to t as input along with complete dataset for time-period before t -ρ to predict future values at t + τ . Therefore, success in HFMV implies that models are robust to missing data from recent past by effectively leveraging hierarchical relations. Setup: We first train PROFHIT and baselines on complete dataset till time t ′ and then fill in the missing values of input sequence using the trained model. Using the predicted missing values, we again forecast the output distribution. For each baseline and PROFHIT, we perform multiple iterations of Monte-Carlo sampling for missing values followed by forecasting future values to generate the forecast distribution. We estimate the evaluation scores using sample forecasts from all sampling iterations. We compared the performance of PROFHIT with best performing baselines HIERE2E and SHARQ for each level of hierarchy of all datasets. PROFHIT significantly outperforms the best baselines as well as the variants. At the leaf nodes, which contains most data, PROFHIT outperforms best baselines by 7% in Wiki to 100% in FB-Survey. For the top node of time-series the performance improvement is largest at 35% (Wiki) to 962% (FB-Survey). We show detailed results in 

Robustness of PROFHIT variants:

We compare relative performance decrease with increase in percentage of missing data for PROFHIT and its variants in Figure 4 . We observe that P-NOCOHERENT's performance deteriorates very rapidly in most benchmarks, showing the importance of SDCR for learning provides robust calibrated coherent forecasts. The second worse-performing variant across all datasets is P-FINETUNE which also relies less on the hierarchical relations due to fine-tuning of parameters for specific time-series. Finally, we observe that PROFHIT and P-GLOBAL suffer the least degradation in performance since both these models prioritize integrating hierarchical coherency information which enables them to provide better estimates for imputed data for missing input and use them to generate more accurate and calibrated forecasts.

I ADAPTING TO VARYING DATASET CONSISTENCY

Observation 1. The average improvement in performance of PROFHIT over best HTSF baselines is 72% higher for weakly consistent datasets over its improvement for strongly consistent datasets. Since most previous state-of-art models assume datasets to be strongly consistent, deviations to this assumptions can cause under-performance when used with weakly consistent datasets. This is evidenced in Table 3 where some of the baselines like MINT and ERM that explicitly optimize for hierarchical coherency perform worse than even TSFNP, which does not leverage hierarchical relations, in Flu-Symptoms and FB-Survey. Overall, we found that for weakly consistent datasets, PROFHIT provides a much larger 93% average improvement in CRPS scores over best HTSF baselines compared to 54% average improvement for strongly consistent datasets. These improvements are more pronounced at non-leaf nodes of hierarchy where PROFHIT improves by 2.8 times for Flu-Symptoms and 9.2 times for FB-Survey. This is because HTSF baselines which assume strong consistency do not adapt to noise at leaf nodes that compound to errors at higher levels of hierarchy. Observation 2. PROFHIT's approach to parameter sharing and soft coherency regularization helps adapt to varying hierarchical consistency. We observe that that best performing variant for strongly consistent datasets in P-GLOBAL which is trained with both likelihood loss and SDCR (Table 3 ). But its performance severely degrades for weakly consistent datasets since sharing all model parameters across all time-series makes it inflexible to model patterns and deviations specific to individual nodes. In contrast, P-FINETUNE and P-NOCOHERENT performs the best among variants for weakly consistent datasets since they train separate sets of decoder parameters for each node. But they perform poorly for strongly consistent datasets since they don't leverage Distributional Coherency effectively. PROFHIT combines the flexible parameter learning of P-FINETUNE and leverage Distributional Coherency to jointly optimize the parameters like P-GLOBAL providing comparable performance to best variants over all datasets. Observation 3. PROFHIT's Refinement module automatically learns to adapt to varying hierarchical consistency. The design choices of the refinement module help PROFHIT to adapt to datasets of different levels of hierarchical consistency. Specifically, by optimizing for values of {γ i } N i=1 of Equation 8, PROFHIT aims to learn a good trade-off between leveraging prior forecasts for a time-series and hierarchical relations of forecasts from the entire hierarchy. We study the learned values of {γ i } N i=1 of Equation 8 used to derive refined mean. Note that higher values of γ i indicate larger dependence on raw forecasts of node and smaller dependence of forecasts of the entire hierarchy. We plot the average values of γ i for each of the datasets in Table 7 . We observe that strongly consistent datasets have lower values of γ i indicating that PROFHIT's refinement module automatically learns to strongly leverage the hierarchy for these datasets compared to weakly coherent datasets. Table 8: Std. dev of CRPS and LS (accros 5 runs) across all levels for all baselines, PROFHIT and its variants. PROFHIT performs significantly better than all baselines as noted using t-test with α = 1%.

Models/Data

Tourism 



Note that we describe consistency over a dataset and coherency over model forecasts. Code and datasets: https://anonymous.4open.science/r/PROFHiT-6F2F https://github.com/properscoring/properscoring



Figure 1: Regularizing forecasts using Distributional Coherency. Moreover, previous methods do not focus on providing calibrated forecasts with precise uncertainty measures. Traditional methods focus on point predictions only. Recent methods like MINT (Wickramasuriya et al., 2019), ERM (Ben Taieb & Koo, 2019) and PEMBU (Taieb et al., 2017) refine raw independent forecast distribution as a post-processing step. This does not enable the models generating the raw forecasts to leverage underlying hierarchical relations across time-series. End-to-end learning neural methods directly leverage hierarchical relations as part of the model architecture or learning algorithm like HIERE2E (Rangapuram et al., 2021) and SHARQ(Han et al., 2021). They do this and usually outperform post-processing methods by imposing hierarchical constraints on the mean or fixed quantiles of the forecast distributions. However, these methods do not enforce hierarchical coherency on the full distributions. Therefore, the forecasts may not be well-calibrated(Kuleshov et al., 2018) i.e., they produce unreliable prediction intervals that may not match observed probabilities from ground truth(Fisch et al., 2022).

Figure 2: Overview of pipeline of PROFHIT. The input time series is ingested by TSFNP, a Neural Gaussian Process based probabilistic forecasting model to output the raw forecast distribution. The parameters of raw forecasts are refined by the Refinement module using predictions from all timeseries. The training is driven by a likelihood loss that learns from ground truth and Soft Distributional Coherency Regularization that regularizes the forecast distribution to follow the hierarchical relations.

Figure 3: % increase in CRPS for all models with increase in proportion of missing data.

in Equation 11. Note that

Figure 4: % increase in CRPS for PROFHIT and variants with increase in proportion of missing data.

Comparison of PROFHIT with state-of-the-art methods.

ERM (Ben Taieb & Koo, 2019) are methods that convert incoherent forecasts as post-processing step by framing it as an optimization problem. Since TSFNP provided better evaluation scores compared to DEEPAR, we performed ERM and MINT on Monte Carlo samples of TSFNP predictive distribution. (5) HIERE2E (Rangapuram et al., 2021) is a recent state-of-the-art deep-learning based approach that projects the raw predictions onto a space of coherent forecasts and trains the model in an end-to-end manner. (6) SHARQ (Han et al., 2021) is another state-of-the-art deep learning based approach that reconciles forecast distributions by using quantile regressions and making the quantile values coherent. (7) PEMBU (Taieb et al., 2017

Dataset Characteristics and Consistency Labour dataset contains monthly employment data from Feb 1978 to Dec 2020 collected from Australian Bureau of Statistics. (2) Tourism-L (Wickramasuriya et al., 2019) contains tourism flows in different regions in Australia grouped via region and demographic. It has two sets of hierarchy (with four and five levels), one for the mode of travel and the other for geography with the top node being the only common node of both hierarchies. (3) Wiki dataset collects the number of daily views of 145000 Wikipedia articles aggregated into 150

Average deviation of observed values in time-series from hierarchical relations. Flu-Symptoms and FB-Survey are weakly consistent datasets since they do not strictly follow the aggregation relations H T unlike strongly consistent datasets Tourism-L, Labour, Wiki.

Average CRPS scores at each level of hierarchy. PROFHIT significantly outperforms best baselines across all benchmarks. Note that P-Finetune's performance decreases at higher levels of hierarchy compared to other variants whereas P-Global's performance is worse at lower levels.

Average CS scores at each level of hierarchy. PROFHIT significantly outperforms best baselines across all benchmarks. During real-time forecasting in real-world applications such as Epidemic or Sales forecasting, we encounter situations where the past few values of time-series are missing or unreliable for some of the nodes. This is observed specifically at lower levels, due to discrepancies or delays during reporting and other factors



Average value of γ i for all datasets. Note that weakly coherent datasets have higher γ i (depends mode on past data of same time-series) where as strongly-coherent data have lower γ i (leverages the hierarchical relations).

A ADDITIONAL RELATED WORK

Probabilistic time-series forecasting Classical probabilistic time-series forecasting methods include exponential smoothing and ARIMA (Hyndman & Athanasopoulos, 2018) . They are simple but focus on univariate time-series and model each time-series sequence independently. Recently, deep learning based methods have been successfully applied in this area. DeepVAR (Salinas et al., 2020) 2021) use a deep-learning based end-to-end approach to directly train on the projected forecasts. SHARQ (Han et al., 2021) is another recent probabilistic deep-learning based method that uses quantile regression and regularizes for coherency at different quantiles of forecast distribution. However, unlike our approach, these endto-end methods do not regularize for coherency over the entire distribution (Distributional Coherency) but only over fixed quantiles. Most of these methods also are not designed for cases where the hierarchical constraints are not always consistently followed.

B CODE AND DATASET

We evaluated all models on a system with Intel 64 core Xeon Processor with 128 GB memory and Nvidia Tesla V100 GPU with 32 GB VRAM. We provide an anonymized repository of our implementation of PROFHIT along with the datasets used at https://anonymous.4open. science/r/PROFHiT-6F2F. We will release the code and data publicly after acceptance.

C HYPERPARAMETERS

C.1 DATA PREPROCESSING Most datasets used in our work assume the aggregation function to be simple summation (i.e, ϕ ij = 1 for all weights). We first normalize the values of leaf time-series training data to have 0 mean and variance of 1. Since the aggregation of values at higher levels of the hierarchy can lead to very large values in time-series, we instead divide each non-leaf time-series by the number of children. Then the weights of hierarchical relations become ϕ ij = 1 |Ci| where C i is the set of all children nodes of

