DATEFORMER: TRANSFORMER EXTENDS LOOK-BACK HORIZON TO PREDICT LONGER-TERM TIME SERIES

Abstract

Transformers have demonstrated impressive strength in long-term series forecasting. Existing prediction research mostly focused on mapping past short sub-series (lookback window) to future series (forecast window). The longer training dataset time series will be discarded, once training is completed. Models can merely rely on lookback window information for inference, which impedes models from analyzing time series from a global perspective. And these windows used by Transformers are quite narrow because they must model each time-step therein. Under this point-wise processing style, broadening windows will rapidly exhaust their model capacity. This, for fine-grained time series, leads to a bottleneck in information input and prediction output, which is mortal to long-term series forecasting. To overcome the barrier, we propose a brand-new methodology to utilize Transformer for time series forecasting. Specifically, we split time series into patches by day and reform point-wise to patch-wise processing, which considerably enhances the information input and output of Transformers. To further help models leverage the whole training set's global information during inference, we distill the information, store it in time representations, and replace series with time representations as the main modeling entities. Our designed time-modeling Transformer-Dateformer yields state-of-the-art accuracy on 7 real-world datasets with a 33.6% relative improvement and extends the maximum forecast range to half-year. 1

1. INTRODUCTION

Time series forecasting is a critical demand across many application domains, such as energy consumption, economics planning, traffic and weather prediction. This task can be roughly summed up as predicting future time series by observing their past values. In this paper, we study long-term forecasting that involves a longer-range forecast horizon than regular time series forecasting. Logically, historical observations are always available. But most models (including various Transformers) infer the future by analyzing the part of past sub-series closest to the present. Longer historical series is merely used to train model. For short-term forecasting that more concerns series local (or call short-term) pattern, the closest sub-series carried information is enough. But not for long-term forecasting that requires models to grasp time series' global pattern: overall trend, longterm seasonality, etc. Methods that only observe the recent sub-series can't accurately distinguish the 2 patterns and hence produce sub-optimal predictions (see Figure 1a , models observe an obvious upward trend in the zoom-in window. But zoom out, we know that's a yearly seasonality. And we can see a slightly overall upward trend between the 2 years power load series). However, it's impracticable to thoughtlessly input entire training set series as lookback window. Not only is no model yet can tackle such a lengthy series, but learning dependence from therein is also tough. Thus, we ask: how to enable models to inexpensively use the global information in training set during inference? In addition, the throughput of Transformers (Zhou et al., 2022; Liu et al., 2021; Wu et al., 2021; Zhou et al., 2021; Kitaev et al., 2020; Vaswani et al., 2017) , which show the best performance in long-term forecasting, is relatively limited, especially for fine-grained time series (e.g., recorded per 15 min, half-hour, hour). Given a common time series recorded every 15 minutes (96 time-steps per day), with 24GB memory, they mostly fail to predict next month from 3 past months of series, even if they have struggled to reduce self-attention's computational complexity. They still can't afford such a length of series and thus cut down lookback window to trade off a flimsy prediction. If they are requested to predict 3 months, how do respond? These demands are quite frequent and important in many application fields. For fine-grained series, we argue, it has reached the bottleneck to extend the forecast horizon through improving self-attention to be more efficient. So, in addition to modifying self-attention, how to enhance the time series information input and output of Transformers? We study the second question first. Prior works process time series in a point-wise style: each time-step in time series will be modeled individually. For Transformers, each time-step value is mapped to a token and then calculated. This style wears out the models that endeavor to predict fine-grained time series over longer-term, yet has never been questioned. In fact, similar to images (He et al., 2022), many time series are natural signals with temporal information redundancy-e.g., a missing timestep value can be interpolated from neighboring time-steps. The finer time series' granularity, the higher their redundancy and the more accurate interpolations. Therefore, the pointwise style is information-inefficient and wasteful. In order to improve the information density of fine-grained time series, we split them into patches and reform the point-wise to patch-wise processing, which considerably reduces tokens and enhances information efficiency. To maintain equivalent token information across time series with different granularity, we fix the patch size as day. We choose the "day" as patch size because it's moderate, close to people's habits, and convenient for modeling. Other patch sizes are also practicable, we discuss the detail in AppendixG. Nevertheless, splitting time series into patches is not a silver bullet for the first question. Even if do so, the whole training set patches series is still too lengthy. And we just want the global information therein, for this purpose to model the whole series is not a good deal. Time is one of the most important properties of time series and plenty of series characteristics are determined by or affected by it. Can time be used as a container to store time series' global information? For the whole historical series, time is a general feature that's very appropriate to model therein persistent temporal patterns. Driven by this, we try to distill training set's global information into time representations and further substitute time series with these time representations as the main modeling entities. But how to represent time? In Section 2, we also provide a superior method to represent time. In this work, we challenge using merely vanilla Transformers to predict long-term series. We base vanilla Transformers to design a brand-new forecast framework named Dateformer, it regards day as time series atomic unit, which remarkable reduces Transformer tokens and hence improves series information input and output, particularly for fine-grained time series. This also benefits Autoregressive prediction: less error accumulation and faster reasoning. Besides, to better tap training set's global information, we distill it, store it in the container of time representations, and take these time representations as main modeling entities for Transformers. Dateformer achieves the state-of-theart performance on 7 benchmark datasets. Our main contributions are summarized as follows: • We analyze information characteristics of time series and propose splitting time series into patches. This considerably reduces tokens and improves series information input and output thereby enabling vanilla Transformers to tackle long-term series forecasting problems. • To better tap training set's global information, we use time representations as containers to distill it and take time as main modeling object. Accordingly, we design the first timemodeling time series forecast framework exploiting vanilla Transformers-Dateformer. • As the preliminary work, we also provide a superior time-representing method to support the time-modeling strategy, please see section 2 for details.foot_1 



Code will be released soon. Related works in Appendix B.



Figure1: (a) depicts two-year power load of an area. (b) illustrates the area's full day load on Jan 2, 2012, a week ago, a week later, and a year ago: compared to a week ago or later, the power load series of Jan 2, 2012, is closer to a year ago, which indicates the day's time semantics is altered to closer to a year ago but further away from a week ago or later.

