TAMING THE LONG TAIL OF DEEP PROBABILISTIC FORECASTING

Abstract

Deep probabilistic forecasting is gaining attention in numerous applications from weather prognosis, through electricity consumption estimation, to autonomous vehicle trajectory prediction. However, existing approaches focus on improvements on average metrics without addressing the long tailed distribution of errors. In this work, we observe long tail behavior in the error distribution of state-of-the-art deep learning methods for probabilistic forecasting. We present two loss augmentation methods to reduce tailedness: Pareto Loss and Kurtosis Loss. Both methods are related to the concept of moments, which measures the shape of a distribution. Kurtosis Loss is based on a symmetric measure, the fourth moment. Pareto Loss is based on an asymmetric measure of right tailedness and models loss using a Generalized Pareto Distribution (GPD). We demonstrate the performance of our methods on several real-world datasets, including time series and spatiotemporal trajectories, achieving significant improvements on tail error metrics, while maintaining and even improving upon average error metrics.



. We see the long tail in error upto 2 orders of magnitude higher than the average. Also shown is a tail sample with predictions from our method(teal) and Traj++EWTA(purple). Despite encouraging progress, we observe that the forecasting error for deep learning models has long-tail behavior. This means that a significant amount of samples are very difficult to forecast. These samples have errors much larger than the average. Figure 1 visualizes an example of long-tail behavior for a motion forecasting task. Existing works often measure forecasting performance by averaging across test samples. However, average performance measured by metrics such as root mean square error (RMSE) or mean absolute error (MAE) can be misleading. A low RMSE or MAE may indicate good average performance, but it does not prevent the model from behaving disastrously in critical scenarios. From a practical perspective, the long-tail behavior in forecasting error is alarming. In motion forecasting, the long tail could correspond to crucial events in driving, such as turning maneuver and sudden stops. Failure to accurately forecast in these scenarios would pose paramount safety risks in route planning. In electricity forecasting, these high errors could be during short circuits, power outages, grid failures, or sudden behavior changes. Focusing solely on average performance would ignore the electric load anomalies, significantly increasing maintenance and operational costs. Long-tailed learning is heavily studied in classification settings, with a focus on class imbalance. There is also rich literature for heavy-tailed time series Kulik & Soulier (2020). However, long tail there usually refers to distribution of the data, not distribution of the error. We refer the reader to Table 2 in 2021). These approaches implicitly assume strong correspondence between data and error. Hence, they are not directly applicable to forecasting, as we do not have pre-defined classes or the prediction error before training. Makansi et al. ( 2021) observed similar long-tail error in trajectory and proposed to use Kalman filter prediction performance to measure sample difficulty. However, Kalman filter is a different model class and its difficulties do not translate to deep neural networks used for forecasting. In this paper, we address the long-tail behavior in prediction error for deep probabilistic forecasting. We present two loss augmentation methods: Pareto Loss and Kurtosis Loss. Kurtosis Loss is based on a symmetric measure of tailedness as a scaled fourth moment of a distribution. Pareto Loss uses the Generalized Pareto Distribution (GPD) to fit the long-tailed error distribution. The GPD can be described as a weighted summation of shifted moments, which is an asymmetric measure of tailedness. We investigate these measurements as loss regularization and reweighting approaches for probabilistic forecasting tasks. We achieve significantly improved tail performance compared to the base model and baselines. Interestingly, we also observe better average performance in most settings. In summary, our contributions are • We identify long-tail behavior in forecasting error for deep probabilistic models. • We investigate principled approaches to address this long-tail behavior and propose two novel methods: Pareto Loss and Kurtosis Loss. • We significantly improve the tail errors on four real world forecasting tasks, including two time series and two spatiotemporal trajectory forecasting datasets. 



one of the most fundamental problems in time series and spatiotemporal data analysis, with broad applications in energy, finance, and transportation. Deep learning models Li et al. (2019); Salinas et al. (2020); Rasul et al. (2021a) have emerged as state-of-the-art approaches for forecasting rich time series and spatiotemporal data with uncertainty. In several forecast competitions, such as the M5 forecasting competition Makridakis et al. (2020), Argoverse motion forecasting challenge Chang et al. (2019), and IARAI Traffic4cast contest Kreil et al. (2020), almost all the winning solutions are based on deep neural networks.

Figure1: Log-log error distribution plot for trajectory prediction on the ETH-UCY dataset using SoTA (Traj++EWTA). We see the long tail in error upto 2 orders of magnitude higher than the average. Also shown is a tail sample with predictions from our method(teal) and Traj++EWTA(purple).

Menon et al. (2020) and the survey paper Zhang et al. (2021) for a complete review. Most common approaches to address the long-tail data distribution include post-hoc normalization Pan et al. (2021), data resampling Chawla et al. (2002); Torgo et al. (2013), loss engineering Lin et al. (2017); Lu et al. (2018), and learning class-agnostic representations Tiong et al. (

Deep probabilistic forecasting. There is a flurry of work on probabilistic forecasting using deep neural networks. A common practice is to combine classic time series models with deep learning, resulting in DeepAR Salinas et al. (2020), Deep State Space Rangapuram et al. (2018), Deep Factors Wang et al. (2019) and normalizing Kalman Filter de Bézenac et al. (2020). Others introduce normalizing flow Rasul et al. (2021b), denoising diffusion Rasul et al. (2021a) and particle filter Pal et al. (2021) to deep learning. For probabilistic trajectory forecasting, a few recent works propose to approximate the conditional distribution of future trajectories given the past with explicit parameterization Tang & Salakhutdinov (2019); Luo et al. (2020), CVAE Sohn et al. (2015); Lee et al. (2017); Salzmann et al. (2020) or implicit models such as GAN Gupta et al. (2018); Liu et al. (2019a). Nevertheless, most existing works focus on average performance, the issue of long-tail in error distribution is largely overlooked in the community. Long-tailed learning. The main efforts to address the long-tail in error in learning revolve around reweighing, resampling, loss function engineering, and two-stage training, but mainly for classification. Rebalancing during training is done in the form of synthetic minority oversampling Chawla et al. (2002), oversampling with adversarial examples Kozerawski et al. (2020), inverse class frequency balancing Liu et al. (2019b), balancing using effective number of samples Cui et al. (2019), or balance-oriented mixup augmentation Xu et al. (2021). Another direction involves post-processing either in form of normalized calibration Pan et al. (2021) or logit adjustment Menon et al. (2020). An important direction is loss modification approaches such as Focal Loss Lin et al. (2017), Shrinkage Loss Lu et al. (2018), and Balanced Meta-Softmax Ren et al. (2020). Others use two-stage training Liu et al. (2019b); Cao et al. (2019) or separate expert networks Zhou et al. (2020); Li et al. (2020); Wang et al. (2021). We refer the readers to Zhang et al. (2021) for an extensive survey. Tang et al. (2020) indicated SGD momentum can contribute to the aggravation of the long-tail problem and suggested de-confounded training to mitigate its effects. Feldman (2020); Feldman & Zhang (2020) performed theoretical analysis and suggested label memorization in a long-tail distribution as a necessity for the network to generalize.

