TAMING THE LONG TAIL OF DEEP PROBABILISTIC FORECASTING

Abstract

Deep probabilistic forecasting is gaining attention in numerous applications from weather prognosis, through electricity consumption estimation, to autonomous vehicle trajectory prediction. However, existing approaches focus on improvements on average metrics without addressing the long tailed distribution of errors. In this work, we observe long tail behavior in the error distribution of state-of-the-art deep learning methods for probabilistic forecasting. We present two loss augmentation methods to reduce tailedness: Pareto Loss and Kurtosis Loss. Both methods are related to the concept of moments, which measures the shape of a distribution. Kurtosis Loss is based on a symmetric measure, the fourth moment. Pareto Loss is based on an asymmetric measure of right tailedness and models loss using a Generalized Pareto Distribution (GPD). We demonstrate the performance of our methods on several real-world datasets, including time series and spatiotemporal trajectories, achieving significant improvements on tail error metrics, while maintaining and even improving upon average error metrics.

1. INTRODUCTION

Probabilistic forecasting is one of the most fundamental problems in time series and spatiotemporal data analysis, with broad applications in energy, finance, and transportation. Deep learning models Li et al. (2019) ; Salinas et al. (2020) ; Rasul et al. (2021a) have emerged as state-of-the-art approaches for forecasting rich time series and spatiotemporal data with uncertainty. In several forecast competitions, such as the M5 forecasting competition Makridakis et al. (2020) , Argoverse motion forecasting challenge Chang et al. (2019), and IARAI Traffic4cast contest Kreil et al. (2020) , almost all the winning solutions are based on deep neural networks. Figure 1 : Log-log error distribution plot for trajectory prediction on the ETH-UCY dataset using SoTA (Traj++EWTA) . We see the long tail in error upto 2 orders of magnitude higher than the average. Also shown is a tail sample with predictions from our method(teal) and Traj++EWTA(purple). Despite encouraging progress, we observe that the forecasting error for deep learning models has long-tail behavior. This means that a significant amount of samples are very difficult to forecast. These samples have errors much larger than the average. Figure 1 visualizes an example of long-tail behavior for a motion forecasting task. Existing works often measure forecasting performance by averaging across test samples. However, average performance measured by metrics such as root mean square error (RMSE) or mean absolute error (MAE) can be misleading. A low RMSE or MAE may indicate good average performance, but it does not prevent the model from behaving disastrously in critical scenarios. From a practical perspective, the long-tail behavior in forecasting error is alarming. In motion forecasting, the long tail could correspond to crucial events in driving, such as turning maneuver and sudden stops. Failure to accurately forecast in these scenarios would pose paramount safety risks in route planning. In electricity forecasting, these high errors could be during short circuits, power outages, grid failures, or sudden behavior changes. Focusing solely on average performance would ignore the electric load anomalies, significantly increasing maintenance and operational costs. Under review as a conference paper at ICLR 2023 Long-tailed learning is heavily studied in classification settings, with a focus on class imbalance. There is also rich literature for heavy-tailed time series Kulik & Soulier (2020) . However, long tail there usually refers to distribution of the data, not distribution of the error. We refer the reader to Table 2 in Menon et al. (2020) and the survey paper Zhang et al. (2021) for a complete review. Most common approaches to address the long-tail data distribution include post-hoc normalization Pan et al. (2021) , data resampling Chawla et al. (2002) ; Torgo et al. (2013) , loss engineering Lin et al. (2017) ; Lu et al. (2018) , and learning class-agnostic representations Tiong et al. (2021) . These approaches implicitly assume strong correspondence between data and error. Hence, they are not directly applicable to forecasting, as we do not have pre-defined classes or the prediction error before training. Makansi et al. (2021) observed similar long-tail error in trajectory and proposed to use Kalman filter prediction performance to measure sample difficulty. However, Kalman filter is a different model class and its difficulties do not translate to deep neural networks used for forecasting. In this paper, we address the long-tail behavior in prediction error for deep probabilistic forecasting. We present two loss augmentation methods: Pareto Loss and Kurtosis Loss. Kurtosis Loss is based on a symmetric measure of tailedness as a scaled fourth moment of a distribution. Pareto Loss uses the Generalized Pareto Distribution (GPD) to fit the long-tailed error distribution. The GPD can be described as a weighted summation of shifted moments, which is an asymmetric measure of tailedness. We investigate these measurements as loss regularization and reweighting approaches for probabilistic forecasting tasks. We achieve significantly improved tail performance compared to the base model and baselines. Interestingly, we also observe better average performance in most settings. In summary, our contributions are • We identify long-tail behavior in forecasting error for deep probabilistic models. • We investigate principled approaches to address this long-tail behavior and propose two novel methods: Pareto Loss and Kurtosis Loss. • We significantly improve the tail errors on four real world forecasting tasks, including two time series and two spatiotemporal trajectory forecasting datasets. A few methods were developed for imbalanced regression. Many approaches are modifications of SMOTE (Synthetic Minority Oversampling Technique) such as, adapted to regression SMOTER Torgo et al. (2013) , augmented with Gaussian Noise SMOGN Branco et al. (2017) , or Ribeiro & Moniz (2020) extending for prediction of extremely rare values. Steininger et al. (2021) proposed DenseWeight, a method based on Kernel Density Estimation for better assessment of the relevance function for sample reweighing. Yang et al. (2021) proposed a distribution smoothing over label (LDS) and feature space (FDS) for imbalanced regression. Prasad et al. (2018) ; Zhu & Zhou (2021) worked on robust regression approaches applicable to point forecast. GARCH Bollerslev (1986) and AFTER Cheng et al. (2015) addressed heavy-tailed error in forecasting but both are statistical models, and not applicable to deep learning. A concurrent work is Makansi et al. (2021) where they also notice the long-tail error distribution for trajectory prediction. They use Kalman filter Kalman (1960) performance as a difficulty measure and propose contrastive learning to mitigate the tail problem. However, the tail samples of Kalman Filter differ from that of deep learning models.

Deep

Most methods in long-tailed learning operate on known heavy-tailedness in data, whereas our focus is to mitigate the unknown long tail in the error distribution of test samples without any specific assumption on the data distribution. This is essential to our problem setting and techniques.

3. METHODOLOGY

We first identify the long-tail error distribution in probabilistic forecasting. Then, we propose two novel methods, Pareto Loss and Kurtosis Loss, to mitigate the long tail in error.

3.1. LONG-TAIL IN PROBABILISTIC FORECASTING

Given input x t ∈ R din and output y t ∈ R dout respectively, probabilistic forecasting task aims to predict the conditional distribution of future states y = (y t+1 , . . . , y t+h ) given current and past observations x = (x t-k , . . . , x t ) as: p(y t+1 , . . . , y t+h |x t-k , . . . , , x t ) (1) where k is the length of the history and h is the prediction horizon. The maximum likelihood prediction -mean when the predicted distribution is a Gaussian-can be denoted as ŷ = (ŷ t+1 , . . . , ŷt+h ). Long tailed error distributions for deep learning models manifest in numerous real world datasets. This is evident in four benchmark forecasting datasets studied in this work (Time series: Electricity Dua & Graff (2017 ), Traffic Dua & Graff (2017) ; Trajectory: ETH-UCY Pellegrini et al. (2009) ; Lerner et al. (2007) , nuScenes Caesar et al. (2020) ). Fig. 2 shows the long-tailed error distribution for time series datasets using DeepAR Salinas et al. (2020) and for trajectory datasets using Trajec-tron++EWTA Makansi et al. (2019) . We follow the literature and use Normalized deviation (ND) and Final Displacement Error (FDE) to measure the performance. We also observe that the samples forming the tail in error vary across methods and even across different runs of the same model. For example, we trained 2 DeepAR Salinas et al. (2020) models on the same Electricity forecasting dataset from UCI repository Dua & Graff (2017) . We observe that the sets of samples with the top 5% error values have only 3.5% samples common to both models. This shows that the tail in the data does not necessarily correspond to the tail in error. The fact that it is impossible to identify a fixed set of tail samples means that we cannot simply reweigh ( Cui et al. (2019) ; Fan et al. (2017) ) or resample ( Torgo et al. (2013) ; Branco et al. (2017) ) these samples before training. The variation of tail samples between models also invalidates the approach taken by Makansi et al. (2021) . Mitigating the long tail in error requires an approach that is independent of the data distribution and is adaptive during training. Thus, we propose using tail-sensitive loss augmentations that adapt the model to also improve on samples with tail errors.

3.2. PARETO LOSS

Long tail distributions naturally lend themselves to analysis using Extreme Value Theory (EVT). EVT McNeil (1997) shows that long tail behavior of a distribution can be modeled as a generalized Pareto distribution (GPD). The probability distribution function (pdf) of the GPD is: f (ξ,η,µ) (a) = 1 η 1 + ξ a -µ η -( 1 ξ +1) ⇒ f (ξ,η) (a) = 1 + ξa η -( 1 ξ +1) where the parameters are location (µ), scale (η) and shape (ξ). Without loss of generality, µ can be set to 0. We can drop the scaling term 1 η as the pdf will be scaled using a hyperparameter. The idea behind our Pareto Loss is to fit the GPD pdf in equation 2 to the final loss distribution and use it to increase the emphasis placed on the tail samples during training. We denote the loss function of a given model, base loss, as l. In probabilistic forecasting, a commonly used loss is Negative Log Likelihood (NLL) loss: l i = -log(p(y (i) |x (i) )) where ⟨x (i) , y (i) ⟩ is the i th training sample. Our goal is to reduce the long-tail error measured by, e.g. MSE. This means that using NLL to fit the GPD might not lead to the intended prioritization of samples. Thus, we propose using an auxiliary loss l, which is better correlated with the evaluation metric used, to fit the GPD. The choice of auxiliary loss is completely up to the model designer and could be the base loss itself in settings where it correlates well with the evaluation metric. See Appendix F for further details. There are two main classes of loss augmentation methods to mitigate tail errors: regularization Ren et al. ( 2020 2016), which assigns larger additive penalties to tail samples using the fitted GPD. For a given hyperparameter λ, PLM is defined as, l plm = l + λ * r plm ( l), r plm ( l) = 1 -f (ξ,η) ( l) An alternative is to reweigh the loss terms using the fitted GPD. For a given λ, PLW is defined as, l plw = w plw ( l) * l, w plw ( l) = 1 -λ * f (ξ,η) ( l)

3.3. KURTOSIS LOSS

Use cases requiring higher emphasis on the extreme tail need an even more skewed measure of heavy-tailedness. For such cases we propose using Kurtosis, which is the scaled fourth moment relative to its mean. It assesses the propensity of a distribution to have extreme values within its tails. To increase the emphasis on tail samples, we use this measure as a margin-based regularization term in our proposed Kurtosis Loss. For a given hyperparameter λ and using the same notations as Sec.3.2, Kurtosis Loss is defined as, l kurt = l + λ * r kurt ( l), r kurt ( l) = l -µ l σ l 4 (5) where µ l and σ l are the mean and standard deviation of the auxiliary loss ( l) for a batch of samples. We do not use a reweighting based approach with kurtosis as there is no upper bound to the kurtosis value. This could lead to convergence issues due to very high weights for some samples.  (1 + b) c = 1 + cb + c(c -1) 2! b 2 + c(c -1)(c -2) 3! b 3 + • • • (6) For c < 0 or equivalently ξ < -1 or ξ > 0, the coefficients are positive for even moments and negative for odd moments (odd and even powers of b). Even moments are always symmetric and positive, whereas odd moments are positive only for right-tailed distributions. Since we use the negative of the pdf, it yields an asymmetric measure of the right-tailedness of the distribution. Kurtosis Loss uses the fourth moment. This is a symmetric and positive measure. GPD and kurtosis are visualized in Appendix E. Kurtosis emphasizes extreme values in the tail. Our experiments also show that it is more effective in controlling the extremes in the error distribution.

4. EXPERIMENTS

We evaluate our methods on multiple benchmark datasets from two probabilistic forecasting tasks: time series forecasting (1D) and trajectory prediction (2D).

4.1. SETUP

Datasets. For time series forecasting, we use electricity and traffic datasets from the UCI ML repository Dua & Graff (2017) Apart from the above-mentioned average performance metrics, we introduce metrics to capture the tail errors. We adapt the Value-at-Risk (VaR equation 7) tail metric from financial domain: VaR α (E) = inf{e ∈ E : P (E ≥ e) ≤ 1 -α} VaR at level α ∈ (0, 1) is the smallest error e such that the probability of observing error greater than e is less than 1 -α, where E is the error distribution. This evaluates to the α th quantile of the error distribution. We measure VaR at three different levels: 0.95, 0.98, and 0.99. Additionally, we report the maximum error representing the worst-case performance. We present tail metrics on the complete error distribution as there is no fixed set of tail samples across different methods (See Sec.3.1).

4.2. SYNTHETIC DATASET EXPERIMENTS

To better understand the long tail in error, we perform experiments on three synthetic datasets. The task is to forecast 8 steps ahead given a history of 8 time steps. We use AutoRegression (AR) and DeepAR Salinas et al. (2020) as models to perform this task. The top row in Figure 3 shows that among the datasets, only Gaussian and Pareto exhibit tail in the data distribution. The data distribution is available here only because the datasets were generated synthetically. On the Sine dataset, we observe long tail error for DeepAR but not for AR. This is especially significant as there is no long tail in the data distribution. On Gaussian and Pareto datasets, DeepAR leads to a heavier tail than AR, suggesting that the long tail in data also contributes to long tail in error. The difference between AR and DeepAR error distributions also invalidates the assumption made by Makansi et al. (2021) ). Using the prediction performance from Kalman Filter is not a good indicator of sample tailedness for deep neural networks. The complete results for synthetic datasets are available in appendix K.

4.3. REAL-WORLD EXPERIMENTS

Time Series Forecasting. We present average and tail metrics using ND and NRMSE for the time series forecasting task on electricity and traffic datasets in Tables 1 and 3 respectively. All methods use DeepAR Salinas et al. (2020) , one of the SoTA in probabilistic time series forecasting, as the base model. The task for both datasets is to use a 1-week history (168 hours) to forecast for 1 day (24 hours) at an hourly frequency. The base model exhibits long tail behavior in error on both datasets (see Fig. 2 ). The tail of the error distribution is significantly longer for the traffic dataset as compared to the electricity dataset. This is evident from comparing the tail error values to the average error. The auxiliary loss used here is MAE to correlate with L1 metrics like ND. DeepAR can have intrinsic variation on re-training so results in Table 1 are averaged over 3 runs. Trajectory Forecasting. We present experimental results on ETH-UCY and nuScenes datasets in Tables 2 and 4 respectively. Following Salzmann et al. (2020) and Makansi et al. (2021) we calculate model performance based on the best out of 20 guesses. On both datasets, we compare with several long-tail baselines using Trajectron++EWTA Makansi et al. (2021) as a base model due to its SoTA average performance on these datasets. The auxiliary loss used here is MAE with MSE to correlate with L2 metrics like ADE and FDE.

4.4. RESULTS ANALYSIS

Cross-task consistency. As shown in Tables 1, 3 , 2 and 4, our proposed approaches, Kurtosis Loss and PLM, are the only methods improving on tail metrics across all tasks. Our methods typically deliver 10-15% improvement on tail metrics and sometimes as high as 40% (See Appendix G). These are significant improvements with no sacrifice on average performance for any task. In fact, in some tasks our methods have better average performance as well. The generality of our methods is shown by their success on all studied tasks. Our tasks have different base models (DeepAR, Trajectron++EWTA), data representations (1D: Time series, 2D: Trajectory), base losses (GaussianNLL for Time series, EWTA for Trajectory), and forecasting horizons. Our methods provide consistent improvement on tail metrics for all tasks. In comparison, Focal Loss performs well on trajectory datasets but fails on time series datasets. Contrastive Loss only performs well on Traffic dataset. LDS and Shrinkage Loss do not compare to the best results for any dataset and perform worse than the base model on the time series datasets. We illustrate some difficult examples, examples with large errors common across methods, for all real world datasets in Figure 4 to demonstrate the improvement in the quality of forecast for our methods. Re-weighting vs Regularization. As mentioned in Section 3.2, we can categorize loss modifying methods into two classes: re-weighting (Focal Loss, Shrinkage Loss, LDS and PLW) and regulariza- tion (Contrastive Loss, PLM and Kurtosis Loss). Re-weighting multiplies the loss for tail samples with higher weights. Regularization adds higher regularization values for samples with higher loss. We notice that re-weighting methods perform worse as the long-tail in error worsens. In scenarios with longer tails, the weights of tail samples can be very high. Overemphasizing tail examples might hamper the learning for other samples. Notice the significantly worse average performance of Focal loss for the traffic dataset in Table 3 . Shrinkage Loss limits this issue by bounding the weights but fails to show tail improvements in longer tail scenarios (electricity and traffic datasets). Our proposed PLW is the best reweighting method on most datasets, likely due to bounded weights. In contrast, regularization methods are consistent across all tasks on both tail and average metrics. The additive nature of regularization limits the impact tail samples have on the learning. This enables these methods to handle different severities of long-tail without degrading the average performance. The difficulty here is a departure from historical behavior. This manifests as sudden increases or decreases in the 1D time series datasets and as high velocity trajectories with sharp turns for the trajectory datasets. These samples represent critical events in real world scenarios where the performance of the model is of utmost importance. Our methods perform significantly better on such samples.

Choosing between PLM and Kurtosis Loss

extreme samples in the tail. This shows in performance with Kurtosis Loss performing better on VaR 99 and Max, and PLM performing better on VaR 95 and VaR 98 . The choice between the methods depends on the objective. If the preference is to mitigate extreme samples, then Kurtosis Loss is better. Otherwise, if the preference is to improve on the tail overall, then PLM is better. Tail error and long-term forecasting. Based on the trajectory prediction results in Tables 2 and 4 we observe that the error reduction for tail samples is more visible in FDE than in ADE. This indicates that the magnitude of the observed error increases with the forecasting horizon. The error compounds through prediction steps making far-future predictions inherently more difficult. Larger improvements in FDE indicate that both Kurtosis and Pareto Loss ensure that high tail errors (stemming from large, far-future prediction errors measured by FDE) are decreased. Accurate long-term forecasting is a central goal of deep probabilistic forecasting. As we can see in Fig. 5 , the tail of error distribution is more pronounced with longer horizons. Thus, methods addressing the tail performance are necessary in order to ensure the practical applicability and reliability of future long-term prediction models.

5. CONCLUSION

We identify and address the problem of long-tail in error distribution for deep probabilistic forecasting. We propose Pareto Loss (Margin and Weighted) and Kurtosis Loss, two novel moment-based loss augmentation approaches, increasing emphasis on tail samples adaptively. We demonstrate their practical effects on two spatiotemporal trajectory datasets and two time series datasets using different base models. Our methods achieve significant improvements on tail metrics over existing baselines without degrading average performance. Both proposed losses can be easily integrated with existing approaches in deep probabilistic forecasting to improve their performance on tail metrics. Future directions include more principled ways to tune hyperparameters, extensions to deterministic time series forecasting models, and theoretical analysis for the source of the long-tail error. Based on our observations, we suggest evaluating tail metrics apart from average performance in machine learning tasks to identify potential long tail issues across different tasks and domains.

A DATASET DESCRIPTION

The ETH-UCY dataset consists of five subdatasets, each with Bird's-Eye-Views: ETH, Hotel, Univ, Zara1, and Zara2. As is common in the literature Makansi et al. (2021); Salzmann et al. (2020) we present macro-averaged 5-fold cross-validation results in our experiment section. The nuScenes dataset includes 1000 scenes of 20 second length for trajectories recorded in Boston and Singapore. The electricity dataset contains electricity consumption data for 370 homes over the period of Jan 1st, 2011 to Dec 31st, 2014 at a sampling interval of 15 minutes. We use the data from Jan The synthetic datasets are generated as 100 different time series consisting of 960 time steps. Each time series in the Sine dataset is generated using a random offset θ and a random frequency ν both selected from a uniform distribution U [0, 1]. Then the time series is sin(2πνt + θ) where t is the index of the time step. Gaussian and Pareto datasets are generated as order 1 lag autoregressive time series with randomly sampled Gaussian and Pareto noise respectively. Gaussian noise is sampled from a Gaussian distribution with mean 1 and standard deviation 1. Pareto noise is randomly sampled from a Pareto distribution with shape 10 and scaling 1.

B METHOD ADAPTATION

Time Series forecasting DeepAR uses Gaussian Negative Log Likelihood as the loss which is unbounded. Due to this many baseline methods need to be adapted in order to be usable. For the same reason, we also need an auxiliary loss ( l). We use MAE loss to fit the GPD, calculate kurtosis, and to calculate the weight terms for Focal and Shrinkage loss. For LDS we treat all labels across time steps as a part of a single distribution. Additionally, to avoid extremely high weights (O(10 8 )) in LDS due to the nature of long tail we ensure a minimum probability of 0.001 for all labels. Trajectory forecasting We adapt Focal Loss and Shrinkage Loss to use EWTA loss Makansi et al. (2019) in order to be compatible with Trajectron++EWTA base model. LDS was originally proposed for a regression task and we adapt it to the trajectory prediction task in the same way as for the time series task. We use MAE to fit the GPD, due to the Evolving property of EWTA loss.

Time

Series forecasting We use the DeepAR implementation from https://github.com/zhykoties/TimeSeries as the base code to run all time series experiments. The original code is an AWS API and not publicly available. The implementation of contrastive loss is taken directly from the source code of Makansi et al. (2021) . Trajectory forecasting For the base model of Trajectron++EWTA Makansi et al. (2021) we have used the original implementation provided by the original authors. The implementation of contrastive loss is taken directly from the source code of Makansi et al. (2021) . The experiments have been conducted on a machine with 7 RTX 2080 Ti GPUs.

D HYPERPARAMETER TUNING

We observe during our experiments that the performance of Kurtosis Loss is highly dependent on the hyperparameter λ (See equation 5). Results for different values of λ on the electricity dataset for Kurtosis Loss are shown in Table5. We also show the variation of ND and NRMSE with the hyperparameter value in Figure 6 . We can see that there is an optimal value of the hyperparameter and the approach performs worse with higher and lower values. For both ETH-UCY and nuScenes datasets we have used λ = 0.1 for Kurtosis Loss, and λ = 1 for PLM and PLW. For both electricity and traffic datasets, we use λ = 1 for PLM, λ = 0.5 for PLW and λ = 0.01 for Kurtosis Loss.  2 -m + 1 2 ln(km) = 1 2 ln km e 2m-1 Since the numerator in the log form is linear in m and the denominator is exponential in m the minima can be less than zero for suitable values of m. This shows that there can be pairs of samples with loss-metric inversion. This means that regularization and reweighting values can be completely different from intended unless an auxiliary loss is used, which preserves the order w.r.t. the evaluation metric. This lack of correlation is illustrated in Fig 8 for the DeepAR model on the electricity dataset. 

G PERCENTAGE IMPROVEMENTS

We present percentage improvements compared to the base model for the different datasets. Due to space limitations we were not able to report std dev across the 3 runs for the electricity dataset. We present the same in Table 11 . 

I TRAINING DETAILS

The training procedure employed for the Pareto Losses is as follows: • Train the base model until convergence • Fit the Pareto distribution on the loss distribution from the trained model. This is done on the auxiliary loss if one is being used. • Use the fitted Pareto distribution to implement PLM or PLW and retrain the model. • The retrained model is the one employing PLM or PLW as per choice. The training process for Kurtosis loss is straightforward. We use the loss function in Equation ( 5) directly with one round of training.

J ROBUST STATISTICS METHODS

We ran robust regression methods on the task and found that the results do not show improvements on the long tail of error. The methods examined here are Huber Loss and MSLE.

K SYNTHETIC DATASETS

We present complete results of our experiments on the synthetic datasets in Table 12 . We ran our methods, Kurtosis Loss, and PLM on these datasets as well. Both our methods show significant tail improvements over the base model across all datasets. 



probabilistic forecasting. There is a flurry of work on probabilistic forecasting using deep neural networks. A common practice is to combine classic time series models with deep learning, resulting in DeepAR Salinas et al. (2020), Deep State Space Rangapuram et al. (2018), Deep Factors Wang et al. (2019) and normalizing Kalman Filter de Bézenac et al. (2020). Others introduce normalizing flow Rasul et al. (2021b), denoising diffusion Rasul et al. (2021a) and particle filter Pal et al. (2021) to deep learning. For probabilistic trajectory forecasting, a few recent works propose to approximate the conditional distribution of future trajectories given the past with explicit parameterization Tang & Salakhutdinov (2019); Luo et al. (2020), CVAE Sohn et al. (2015); Lee et al. (2017); Salzmann et al. (2020) or implicit models such as GAN Gupta et al. (2018); Liu et al. (2019a). Nevertheless, most existing works focus on average performance, the issue of long-tail in error distribution is largely overlooked in the community. Long-tailed learning. The main efforts to address the long-tail in error in learning revolve around reweighing, resampling, loss function engineering, and two-stage training, but mainly for classification. Rebalancing during training is done in the form of synthetic minority oversampling Chawla et al. (2002), oversampling with adversarial examples Kozerawski et al. (2020), inverse class frequency balancing Liu et al. (2019b), balancing using effective number of samples Cui et al. (2019), or balance-oriented mixup augmentation Xu et al. (2021). Another direction involves post-processing either in form of normalized calibration Pan et al. (2021) or logit adjustment Menon et al. (2020). An important direction is loss modification approaches such as Focal LossLin et al. (2017), Shrinkage LossLu et al. (2018), and Balanced Meta-Softmax Ren et al. (2020). Others use two-stage trainingLiu et al. (2019b);Cao et al. (2019) or separate expert networksZhou et al. (2020);Li et al. (2020);Wang et al. (2021). We refer the readers toZhang et al. (2021) for an extensive survey.Tang et al. (2020) indicated SGD momentum can contribute to the aggravation of the long-tail problem and suggested de-confounded training to mitigate its effects.Feldman (2020);Feldman & Zhang (2020) performed theoretical analysis and suggested label memorization in a long-tail distribution as a necessity for the network to generalize.

Figure 2: Log-log error distribution plots. Time series datasets (left half) use DeepAR, trajectory datasets (right half) use Traj++EWTA. This clearly illustrates the long tail in error distribution.

); Makansi et al. (2021) and reweighting Lin et al. (2017); Lu et al. (2018); Yang et al. (2021). Inspired by these, we propose two variations of the Pareto Loss using the GPD fitted on l: Pareto Loss Margin (PLM) and Pareto Loss Weighted (PLW). PLM is based on the principles of margin-based regularization Ren et al. (2020); Liu et al. (

CONNECTION BETWEEN PARETO AND KURTOSIS LOSS Kurtosis Loss and Pareto Loss are both based on moments of a distribution. Pareto Loss is a weighted sum of shifted moments, while Kurtosis Loss is the scaled fourth moment. Specifically, let b = ξa η and c = -( 1 ξ + 1), then the Taylor expansion for the GPD pdf in equation 2 is,

Figure 3: Top Row: Ground truth distribution for synthetic datasets. Middle Row: ND error distribution using AR. Bottom Row : ND error distribution using DeepAR. Datasets (L to R): Sine, Gaussian, Pareto. Note: the x-axes for plots in the same column or y-axes for plots in the same row are not for the same range of values.

Figure 4: Visualization of overlapping tail samples for Electricity (top row left half), Traffic (top row right half), ETH-UCY (bottom row left half) and nuScenes (bottom row right half) datasets. The shaded region represents the confidence interval of the prediction.The difficulty here is a departure from historical behavior. This manifests as sudden increases or decreases in the 1D time series datasets and as high velocity trajectories with sharp turns for the trajectory datasets. These samples represent critical events in real world scenarios where the performance of the model is of utmost importance. Our methods perform significantly better on such samples.

Figure 5: Distribution of the top 5% error values (FDE) for different horizons for the ETH-UCY (Zara1) dataset. Predictions obtained using Tra-jectron++EWTA. The trend shows that the long tail in error gets worse as the forecasting horizon increases due to compounding.

1st, 2011 to Aug 31st, 2011 for training and data from Sep 1st, 2011 to Sep 7th, 2011 for testing. The traffic dataset consists of occupancy values recorded by 963 sensors at a sampling interval of 10 minutes ranging from Jan 1st, 2008 to Mar 30th, 2009. We use data from Jan 1st, 2008 to Jun 15th, 2008 for training and data from Jun 16th, 2008 to Jul 15th, 2008 for testing. Both time series datasets are downsampled to 1 hour for generating examples.

Figure 6: Left: Variation of ND by hyperparameter for Kurtosis Loss. Right: Variation of NRMSE by hyperparameter for Kurtosis Loss.

Figure 7: Left: Generalized Pareto distributions with different shape parameters (η = 1). Right: Illustrating the variation of kurtosis on distributions with the same mean.

Figure7illustrates different GPDs for different shape parameter values. Higher shape value models more severe tail behavior.

1

Figure8: Comparing GaussianNLL loss to Normalized Deviation metric for DeepAR on the electricity dataset. We can see that there are a large number of samples which have high GaussianNLL but low ND and vice versa. This illustrates the need of an auxiliary loss for correct emphasis on samples.

Label Distribution Smoothing (LDS): Yang et al. (2021) uses a symmetric kernel to smooth the label distribution and use its inverse to reweigh the loss terms. • Shrinkage Loss: Lu et al. (2018) uses a sigmoid-based function to reweigh loss terms. This deprioritizes lower loss values. • Focal Loss: Lin et al. (2017) uses L1 loss to reweigh the loss terms. Additional power of the loss term increases the steepness of the loss function.

Performance on Electricity Dataset (ND/NRMSE/CRPS). All our methods improve on the average as well as tail metrics. Baseline methods are worse on average and inconsistent on the tail. All methods use DeepAR as the base model. Results indicated as Top 3 and Best. All results have been averaged across 3 runs with different seeds, standard deviation available in Appendix H



Performance on the Traffic Dataset (ND/NRMSE/CRPS). PLM (Ours) delivers best overall results, improving on average and tail metrics. Among baseline methods, contrastive loss is most consistent. Regularization methods in general fare better than re-weighting methods due to a very long tail. All methods use DeepAR as the base model. Results indicated as Top 3 and Best



Kurtosis Loss performs better on extreme tail metrics, VaR 99 and Max. Higher kurtosis puts more emphasis on extreme samples in the tail. It is also important to note that the magnitude of kurtosis varies significantly for different distributions, making the choice of hyperparameter (See equation 5) critical. Further analysis available in Appendix D.

Electricity Dataset evaluation for base model (ND/NRMSE) and different Kurtosis Loss hyperparameters. The value of λ is denoted in [] with the method name. The base model is DeepAR. indicated as Better than base model and Best

Percentage improvements over the base method (DeepAR) on Electricity Dataset (ND/NRMSE). Results indicated as error reduction and increase in %.

Percentage improvements over the base method (DeepAR) on Traffic Dataset (ND/NRMSE). Results indicated as error reduction and increase in %.

Percentage improvements over the base method (Trajectron++EWTA) on ETH-UCY Dataset (ADE/FDE). Results indicated as error reduction and increase in %.

Percentage improvements over the base method (Trajectron++EWTA) on nuScenes Dataset (ADE/FDE). Results indicated as error reduction and increase in %.

Std deviation of results for Electricity Dataset (ND/NRMSE/CRPS). All results have been computed across 3 runs with different seeds. Results corresponding to Table 1.

Results for robust statistics losses on the Electricity dataset. Results indicated as Best. Huber Loss and MSLE both fail to provide any meaningful improvements on the base model. Moreover, the performance on CRPS is significantly worse illustrating their poor fit for the task.

Performance on the Synthetic Datasets (ND/NRMSE). Results indicated as Better than DeepAR and Best for each dataset.

REPRODUCIBILITY STATEMENT

The datasets used in the paper are cited and the preprocessing has been described in Appendix A. We have released the code to run experiments on both time series and trajectory datasets in the supplementary material. Both folders include a step-by-step README file that guides through the process of running our methods and baselines. Hyperparamter values to be used are present in F AUXILIARY LOSS In this section, we present mathematical intuition behind the usage of auxiliary loss in our methods. We will examine a setting where the base loss for a probabilistic model is GaussianNLL loss and the evaluation metric is MSE. For simplicity, we will assume 1-step prediction on 1D data however the analysis can be easily extended to multi step prediction and multi dimensional data.Consider 2 training samples, Past observations :1-step prediction ground truth : y (1) = (yWe will drop t+1 from the notation for simplicity and clarity as there is only one step prediction. Since, the maximum likelihood prediction for a gaussian is the mean, the MSE is calculated using the predicted mean.

MSE : (y

The GaussianNLL loss is calculated as the negative log likelihood of the ground truth as per the predicted distribution. Simplifying the expression gives us,We want to determine the conditions under which the GaussianNLL loss will be higher for sample 1 as compared to sample 2 while the MSE for sample 2 will be higher than sample 1 or vice versa. We will call this a loss-metric inversion. This condition can be written as:Consider the scenario where, M SE (1) > M SE (2) . This can be expressed as,The corresponding condition to satisfy is, 2) , where c > 0 1 2For simplicity let's represent 1 2 

