DATEFORMER: TRANSFORMER EXTENDS LOOK-BACK HORIZON TO PREDICT LONGER-TERM TIME SERIES

Abstract

Transformers have demonstrated impressive strength in long-term series forecasting. Existing prediction research mostly focused on mapping past short sub-series (lookback window) to future series (forecast window). The longer training dataset time series will be discarded, once training is completed. Models can merely rely on lookback window information for inference, which impedes models from analyzing time series from a global perspective. And these windows used by Transformers are quite narrow because they must model each time-step therein. Under this point-wise processing style, broadening windows will rapidly exhaust their model capacity. This, for fine-grained time series, leads to a bottleneck in information input and prediction output, which is mortal to long-term series forecasting. To overcome the barrier, we propose a brand-new methodology to utilize Transformer for time series forecasting. Specifically, we split time series into patches by day and reform point-wise to patch-wise processing, which considerably enhances the information input and output of Transformers. To further help models leverage the whole training set's global information during inference, we distill the information, store it in time representations, and replace series with time representations as the main modeling entities. Our designed time-modeling Transformer-Dateformer yields state-of-the-art accuracy on 7 real-world datasets with a 33.6% relative improvement and extends the maximum forecast range to half-year.

1. INTRODUCTION

Time series forecasting is a critical demand across many application domains, such as energy consumption, economics planning, traffic and weather prediction. This task can be roughly summed up as predicting future time series by observing their past values. In this paper, we study long-term forecasting that involves a longer-range forecast horizon than regular time series forecasting. Logically, historical observations are always available. But most models (including various Transformers) infer the future by analyzing the part of past sub-series closest to the present. Longer historical series is merely used to train model. For short-term forecasting that more concerns series local (or call short-term) pattern, the closest sub-series carried information is enough. But not for long-term forecasting that requires models to grasp time series' global pattern: overall trend, longterm seasonality, etc. Methods that only observe the recent sub-series can't accurately distinguish the 2 patterns and hence produce sub-optimal predictions (see Figure 1a , models observe an obvious upward trend in the zoom-in window. But zoom out, we know that's a yearly seasonality. And we can see a slightly overall upward trend between the 2 years power load series). However, it's impracticable to thoughtlessly input entire training set series as lookback window. Not only is no model yet can tackle such a lengthy series, but learning dependence from therein is also tough. Thus, we ask: how to enable models to inexpensively use the global information in training set during inference? In addition, the throughput of Transformers (Zhou et al., 2022; Liu et al., 2021; Wu et al., 2021; Zhou et al., 2021; Kitaev et al., 2020; Vaswani et al., 2017) , which show the best performance in long-term forecasting, is relatively limited, especially for fine-grained time series (e.g., recorded per 15 min, half-hour, hour) . Given a common time series recorded every 15 minutes (96 time-steps per day), with 24GB memory, they mostly fail to predict next month from 3 past months of series, even if they have struggled to reduce self-attention's computational complexity. They still can't afford such a length of series and thus cut down lookback window to trade off a flimsy prediction. If they are requested to predict 3 months, how do respond? These demands are quite frequent and important in many application fields. For fine-grained series, we argue, it has reached the bottleneck to extend the forecast horizon through improving self-attention to be more efficient. So, in addition to modifying self-attention, how to enhance the time series information input and output of Transformers? illustrates the area's full day load on Jan 2, 2012, a week ago, a week later, and a year ago: compared to a week ago or later, the power load series of Jan 2, 2012, is closer to a year ago, which indicates the day's time semantics is altered to closer to a year ago but further away from a week ago or later. We study the second question first. Prior works process time series in a point-wise style: each time-step in time series will be modeled individually. For Transformers, each time-step value is mapped to a token and then calculated. This style wears out the models that endeavor to predict fine-grained time series over longer-term, yet has never been questioned. In fact, similar to images (He et al., 2022) , many time series are natural signals with temporal information redundancy-e.g., a missing timestep value can be interpolated from neighboring time-steps. The finer time series' granularity, the higher their redundancy and the more accurate interpolations. Therefore, the pointwise style is information-inefficient and wasteful. In order to improve the information density of fine-grained time series, we split them into patches and reform the point-wise to patch-wise processing, which considerably reduces tokens and enhances information efficiency. To maintain equivalent token information across time series with different granularity, we fix the patch size as day. We choose the "day" as patch size because it's moderate, close to people's habits, and convenient for modeling. Other patch sizes are also practicable, we discuss the detail in AppendixG. Nevertheless, splitting time series into patches is not a silver bullet for the first question. Even if do so, the whole training set patches series is still too lengthy. And we just want the global information therein, for this purpose to model the whole series is not a good deal. Time is one of the most important properties of time series and plenty of series characteristics are determined by or affected by it. Can time be used as a container to store time series' global information? For the whole historical series, time is a general feature that's very appropriate to model therein persistent temporal patterns. Driven by this, we try to distill training set's global information into time representations and further substitute time series with these time representations as the main modeling entities. But how to represent time? In Section 2, we also provide a superior method to represent time. In this work, we challenge using merely vanilla Transformers to predict long-term series. We base vanilla Transformers to design a brand-new forecast framework named Dateformer, it regards day as time series atomic unit, which remarkable reduces Transformer tokens and hence improves series information input and output, particularly for fine-grained time series. This also benefits Autoregressive prediction: less error accumulation and faster reasoning. Besides, to better tap training set's global information, we distill it, store it in the container of time representations, and take these time representations as main modeling entities for Transformers. Dateformer achieves the state-of-theart performance on 7 benchmark datasets. Our main contributions are summarized as follows: • We analyze information characteristics of time series and propose splitting time series into patches. This considerably reduces tokens and improves series information input and output thereby enabling vanilla Transformers to tackle long-term series forecasting problems. • To better tap training set's global information, we use time representations as containers to distill it and take time as main modeling object. Accordingly, we design the first timemodeling time series forecast framework exploiting vanilla Transformers-Dateformer. • As the preliminary work, we also provide a superior time-representing method to support the time-modeling strategy, please see section 2 for details. To encoding the Jan 2, DERT selects date-embeddings of some days before and after it as the contextual padding (the green block). Concretely, DERT slides a fixed window Transformer encoder on the date axis to select dateembeddings input. The input window contains date-embeddings of: some days before the target day (pre-days padding), the target day to be represented, and some days after the target day (postdays padding). After encoding, tokens of the pre-days and post-days paddings will be discarded, only the target day's token is picked out as the corresponding dynamic date-representation. We design 2 tasks that implemented by linear layers to pre-train DERT.

Mean value prediction

Predicting mean values of time series patches is a direct and effective pre-training task, we require DERT to predict the target day's time series mean value by the corresponding date-representation. This task will lose time series patches' intraday information, so we design the latter task to mitigate the problem.

Auto-correlation prediction

We observed that many time series display varied intraday trend and periodicity on different kinds of days. Like more steady in holidays, or on the contrary. To leverage it, we utilize the circular auto-correlation in stochastic process theory (Chatfield, 2003; Papoulis & Pillai, 2002) to assess sequences' stability. According to Wiener-Khinchin theorem (Wiener, 1930) , series auto-correlation can be calculated by Fast Fourier Transforms. Thus, we Score the stability of the target day's time series patch X with G-length by following equations: S X X (f ) = F(X t )F * (X t ) = ∞ -∞ X t e -i2πtf dt ∞ -∞ X t e -i2πtf dt R X X (τ ) = F -1 (S X X (f )) ℓ 2 N orm = ∞ -∞ S X X (f )e i2πf τ df X 2 1 + X 2 2 + • • • + X 2 G Score = Score(X ) = 1 G G-1 τ =0 R X X (τ ) where F denotes Fast Fourier Transforms, * and F -1 denote the conjugate and inverse Transforms respectively. We ask DERT to predict Score of the target day's time series patch by the day's date-representation.

Pre-training

We use the whole training set's time series patches to pre-train DERT, each day's time series patch is a training sample and the loss is calculated by: P retraining Loss = MSE(linearLayer 1 (d), M ean) + MSE(linearLayer 2 (d), Score) where d denotes the target day's (patch's) dynamic date-representation that produced by DERT, M ean denotes true time series mean value of the day, and Score is calculated by equation 1. Our DERT relies on supervised pre-training tasks, so it's series-specified-different time series datasets need to re-train respective DERT. Note that the 2 tasks are merely employed for pre-training DERT, so they'll be removed after pre-training and not participate in the downstream forecasting task.

3. DATEFORMER

We believe that time series fuse 2 patterns of components: global and local pattern. The global pattern is derived from time series' inherent characteristics and hence doesn't fluctuate violently, while the local pattern is caused by some accidental factors such as sudden changes in the weather, so it's not long-lasting. As aforementioned, the global pattern should be grasped from entire training set series, and the local pattern can be captured from lookback window series. Based on this, we design Dateformer, it distills training set's global information into date-representations, and then leverages the information to produce a global prediction. The lookback window is used to provide local pattern information, and then contributes a local prediction. Dateformer fuses the 2 predictions into a final prediction. To enhance Transformer's throughput, we split time series into patches by day, and apply Dateformer in a patch-wise processing style. Therefore, we deal with time series forecasting problems at the day-level-predicting next H days time series from historical series.

3.1. GLOBAL PREDICTION

We distill the whole training set's global information and store it in date-representations. Then, during inference, we can draw a global prediction from therein to represent learned global pattern. The local pattern information is volatile and hence can merely be available from recent observations. To better capture the local pattern from lookback window, we eliminate the learned global pattern from lookback window series. As described in the previous section, for a day in lookback window, we can get the day's global prediction from its date-representation. Then, the day's Series Residual r carried only local pattern information will be produced by:

Date Representation

= {d} P ositional Encodings = {1, 2, 3, • • • , G} {t 1 , t 2 , t 3 , • • • , t G } = Duplicate(d) + P ositional Encodings = {d, d, d, • • • , d G copies } + {1, 2, 3, • • • , G} Global P rediction = FFN(t 1 , t 2 , t 3 , • • • , t G ) Series Residual r = Series -Global P rediction (3) where Series denote the day's ground-truth time series. We apply the operation to all days in lookback window, to produce the Series Residuals {r 1 , r 2 , • • • } of the whole lookback window. Subsequently, we utilize a vanilla Transformer to learn a local prediction from these Series Residuals. Concretely, as shown in Figure 3 , we feed the lookback window Series Residuals into Transformer encoder. For multivariate time series, pre-flattening series is required. The encoder output will be sent into Transformer decoder as a cross information to help decoder refine input daterepresentations. To learn pair-wise similarities of the forecast day and each day in lookback window, the decoder eats date-representations of lookback window and forecast day, to exchange information between them. Given a lookback window of P days, the similarities are calculated by: Series Residuals = {r 1 , r 2 , • • • , r P } Date Representations = {d 1 , d 2 , • • • , d P , d P +1 } { d 1 , d 2 , • • • , d P , d P +1 } = Decoder(Date Representations, Encoder(Series Residuals)) Similarities = {W 1 , W 2 , • • • , W P } = SoftMax(FFN( d 1 , d 2 , • • • , d P )) then, we can get a local prediction corresponding d P +1 by Aggregating Series Residuals: Local P rediction = Aggregate(Series Residuals) = W 1 × r 1 + W 2 × r 2 + • • • + W P × r P (5) 3.3 FINAL PREDICTION Now, we can obtain a final prediction by adding the global prediction and local prediction: F inal P rediction = Global P rediction + Local P rediction (6) For multi-day forecast, we need to encode multi-day date-representations and produce their corresponding global predictions and local predictions. Thus, we design Dateformer to automatically conduct these procedures and fuse them into the final predictions, just like a scheduler. Referring to Figure 4 , for multi-day forecast, we input into Dateformer: i) static date-embeddings of: pre-days padding, lookback window, forecast window, and post-days padding to encode multi-day date-representations; ii) lookback window series to provide local information supporting the local prediction. Dateformer slides DERT to encode dynamic date-representations of lookback and forecast window. Then, the global predictions can be produced by a single feed-forward propagation. For the local predictions, Dateformer recursively calls the Transformer to day-wise predict. Autoregression is applied in the procedure-the previous prediction is used for subsequent predictions: Series Residuals = {r 1 , r 2 , • • • , r P } Local P rediction r P +1 = Aggregate(Series Residuals) Series Residuals = {r 0 , r 1 , r 2 , • • • , r P , r P +1 } Local P rediction r P +2 = Aggregate(Series Residuals) • • • (7) To retard error accumulation, we adopt 3 tricks: 1) Dateformer employs Autoregression for only local predictions, the global predictions with most numerical scale of time series are generated in a single feed-forward propagation, which restricts the error's scale; 2) Dateformer regards day as time series basic unit. For fine grained time series, the day-wise Autoregression remarkable reduces errors propagation times and accelerates prediction; 3) During local prediction Autoregression, when a day predicted local prediction patch (such as r P +1 in equation 7) is appended to the Series Residuals tail, Dateformer will insert earlier past a day ground-truth residual (such as r 0 in equation 7) into their head to balance the proportion between true local information and predicted residuals. We didn't adopt the one-step forward generative style prediction that is proposed by Zhou et al. (2021) and followed by other Transformer-based forecast models because the style lacks scalability. For different length forecast demands, they have to re-train models. We are committed to providing a scalable forecast style to tackle forecast demands of various lengths. Although has defects of slow prediction speed and error accumulation, Autoregression can achieve the idea. So, we still adopt Autoregression and fix its defects. Dateformer has excellent scalability. It can be trained on a shortterm forecasting setup, while robustly generalized to long-term forecasting tasks. For each dataset, we only train Dateformer once to deal with forecast demands for any number of days.

4. EXPERIMENTS

Datasets We extensively perform experiments on 7 real-world datasets, including energy, traffic, economics, and weather: (1) ETT (Zhou et al., 2021) (Liu et al., 2021) (ICLR 2022 Oral), LSTM (Hochreiter & Schmidhuber, 1997) , and TCN (Bai et al., 2018) .

Implementation details

The details of our models and Transformer-based baseline models are in AppendixI. Our code will be released after the paper's acceptance.

4.1. MAIN RESULTS

We show Dateformer's performance here. Due to splitting time series by day, we evaluate models with a wide range of forecast days: 1d, 3d, 7d, 30d, 90d (or 60d), and 180d covering short-, medium-, and long-term forecast demands. For coarse-grained ER series, we still follow the forecast horizon setups proposed by Wu et al. (2021) . Our idea is to break the information bottleneck for models, so we didn't restrict models' input. Any input length is allowed as long as the model can afford it. For these baselines, if existing, we'll choose the input length recommended in their paper. Otherwise, an estimated input length will be given by us. We empirically select 7d, 9d, 12d, 26d, 46d (or 39d), and 60d as corresponding lookback days for Dateformer, and we only train it once on the 7d-predict-1d task to test all setups for each dataset. For fairness, all these sequence-modeling baselines are trained time-step-wise but tested day-wise. Due to the space limitation, only multivariate comparisons is shown here, see Appendix D for univariate comparisons and Table 12 for standard deviations results.

Multivariate results

Our proposed Dateformer achieves consistent state-of-the-art performance under all forecast setups on all 7 benchmark datasets. The longer forecast horizon, the more significant improvement. For the long-term forecasting setting (>60 days), Dateformer gives MSE reductions: 82.5% (0.585 → 0. forecast. Note that our Dateformer still contributes the best predictions on the ER series which is coarse-grained time series without obvious periodicity.

4.2. ANALYSIS

We try to analyze the 2 forecast components' contributions to the final prediction and explain why Dateformer's predictive errors rise so slow as forecast days extend. We use separate global prediction or local prediction component to predict and compare their results, as shown below. It can be seen that errors of the separate global prediction hardly rise as forecast horizon grows. The separate local prediction, however, is rapidly deteriorating. This may prove our hypothesis that time series have 2 patterns of components: global and local pattern. As stated in section 3, the global pattern will not fluctuate violently, while the local pattern is not long-lasting. Therefore, as forecast horizon grows, global prediction that still keep stable errors is especially important for long-term forecasting. Although do best in short-term forecasting, errors of local prediction increase rapidly as forecast days extend because local pattern is not long-lasting. As time goes on, the local pattern shifts gradually. For distant future days, current local pattern even degenerates into a noise that encumbers predictions: comparing the 180 days prediction cases on PL, the best prediction is contributed by the separate global prediction instead of the final prediction that disturbed by the noise. Not only local pattern, but Dateformer also grasps global pattern, so its errors rise mostly steadily. To intuitively understand the 2 patterns' contributions, we use Auto-correlation that presents the same periodicity with source series to analyze the seasonality and trend of ETT oil temperature series. As shown in Figure 5a , the ETT oil temperature series mixes complex multi-scale seasonality, and the global prediction almost captures it perfectly in 5b. But see Figure 5c , we can find global prediction fail to accurately estimates series mean: although a sharp descent occurs in the left end after eliminate the global prediction, it didn't immediately drop to near 0. Because local pattern will effect series trends. The local pattern is not long-lasting, so we can only approximate it from lookback window. After subtracting the final prediction, we get an Auto-correlation series of random oscillations around 0 in 5d, which indicates source series is as unpredictable as white noise series.

4.3. ABLATION STUDIES

We conduct ablation studies about input lookback window length and proposed time-representing method, related results and discussions are in Appendix E. Due to the space limitation, we only list major finds here. (1) Our Dateformer can benefit from longer lookback windows, but not for other Transformers; (2) The proposed dynamic time-representing method can effectively enhance Dateformer's performance; (3) But sequence-modeling Transformers, which underestimate the significance of time and let it just play an auxiliary role, may not benefit from better time-encodings. In addition, we also conduct ablation studies about 2 pre-training tasks that are presented in section 2. Beyond all doubt, the pre-training task of mean prediction is naturally effective. Removing this task to pre-train, Dateformer's predictions remarkable deteriorates. But for some datasets, pre-training DERT on only this task also damages Dateformer's performance. Auto-correlation prediction can induce DERT to concern time series intraday information, thereby encoding finer date-representations.

5. CONCLUSIONS

In 

A FUTURE WORKS

In this paper, as the preliminary work, we briefly introduced the dynamic date-representation and DERT, but didn't talk about their transferability. Actually, under certain conditions, a DERT that is pre-trained on a time series dataset can transfer to other similar datasets to encode daterepresentations. For example, PL and ETT are similar electrical series datasets recorded in different cities of China, PL range from 2009 to 2015 while ETT range from 2016 to 2018, but we found the DERT that is pre-trained on PL can directly transfer to forecast tasks on ETT. More interestingly, with the transferred DERT, Dateformer demonstrates a fairly good few sample learning potential. On the ETT series, even though only one-third of training samples are provided, Dateformer still can converge to the same performance level as the full training samples provided (see Table 3 ). The details are not completely clear yet, we will further study in future works. LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Chung et al., 2014) employ gating mechanisms to extend series dependence learning distance and relieve the gradient vanishing or explosion of RNN. DeepAR (Salinas et al., 2020) further combines LSTM and Autoregression for time series probabilistic distribution forecasting. LSTNet (Lai et al., 2018) adopts CNN and recurrent-skip connections to modeling short-and long-term patterns of time series. Some RNN-based works introduced the attention mechanism to capture long-range temporal dependence, so as to improve predictions (Qin et al., 2017) . However, RNN's intrinsic flaws (difficulty capturing long-range dependence, slow reasoning, and accumulating error) prevent them from predicting long-term series. Temporal convolutional network (TCN) (Bai et al., 2018) is CNN-based representative work, which models temporal dependence with the causal convolution. DeepGLO (Sen et al., 2019 ) also mentioned the concept of global and local, and employs TCN to model them. Nevertheless, the global concept in their paper refers to relationships between related other time series, which is distinct from this paper refers to the global historical observation on the series itself. Transformer (Vaswani et al., 2017) recently becomes popular in long-term series forecasting, owing to its excellent long-range dependence capturing capability. But directly applying Transformer to long-term series forecasting is computationally prohibitive due to inside self-attention's quadratic complexity about sequence length in both memory and time. Many studies are proposed to resolve the problem. LogTrans (Li et al., 2019) presented LogSparse attention that reduces the computational complexity to O(L(logL) 2 ), and Reformer (Kitaev et al., 2020) presented localsensitive hashing (LSH) attention with O(LlogL) complexity. Informer (Zhou et al., 2021) proposed probSparse attention of O(LlogL) complexity, and further renovated the vanilla Transformer architecture. Autoformer (Wu et al., 2021) and FEDformer (Zhou et al., 2022) built decomposed blocks in Transformer and introduced low complexity enhanced attention in frequency domain or Auto-correlation to replace self-attention. Pyraformer (Liu et al., 2021) adopted hierarchical multiresolution PAM attention to achieve O(L) complexity and learn multi-scale temporal dependence. To extend forecast horizon, all these variant Transformers designed various modified self-attentions or substitutes to reduce computational complexity, so as to tackle longer time series. In CV community, ViT (Dosovitskiy et al., 2020) 

C DATE EMBEDDING

Observing the movement of celestial planets, changes in temperature, or the rise and fall of plants, our forefathers distilled a set of laws, that is calendars. A wealth of wisdom is encapsulated in the calendar. People's activities are guided by the calendar and we believe that's the fundamental source of seasonality in many time series. We try to introduce the wisdom in our models as a priori knowledge. We use the Gregorian calendar as solar calendar and the traditional Chinese calendar as lunar calendar to deduce our date-embedding. Besides, the vacation and weekday information is also taken into count. In our code, we provide the date-embeddings range from 2001 to 2023 for 12 countries or regions: Australia, British, Canada, China, Germany, Japan, New Zealand, Portugal, Singapore, Switzerland, USA, and San Francisco. dayof jieqi = days that have passed in this solar term -1 total days in this solar term -1 -0.5 We standardize these date features by respective corresponding maximum. For unbounded features (e.g., abs day and year), we use a large number to standardize them. For example, we can use 100 to standardize year, then need not worry about it in our lifetime. With DERT encoding, our model shows excellent generalization when facing a date never seen before (See Table 2 ). Because too complicated, some date features' equations are omitted. Email authors if interested.

D UNIVARIATE RESULTS

Baselines We also select 6 strong baseline models for univariate forecast comparisons, covering state-of-the-art deep learning models and classic statistical methods: FEDformer (Zhou et al., 2022) , Autoformer (Wu et al., 2021) , Informer (Zhou et al., 2021) , N-BEATS (Oreshkin et al., 2019), DeepAR (Salinas et al., 2020), and ARIMA (Box & Jenkins, 1968) . In this section, we study the impact on several Transformer-based models of different input lengths. On representative 8 forecast tasks, which cover short-, middle-, and long-term forecasting on various time series datasets, we gradually extend their lookback window and record their predictive errors of different input window sizes. The results are shown in the form of line charts as follows. As shown in Figure 6 , with lookback window extends, Dateformer's predictive errors gradually reduce until stable, which indicates that Dateformer can leverage more information to continuously improve prediction. Although not the best under narrow lookback windows, Dateformer finally outperforms other all baseline Transformers as lookback window grows. But other Transformers may not benefit from larger input windows. Their performances are unstable, and even deteriorate as input length extends. This is against common sense: more information intake should lead to better predictions. Compared to them, the information utilization upper limit of Dateformer is higher. We can always expect better predictions by feeding longer lookback window series, at least not worse.

E.2 IMPACT OF TIME ENCODING

In this paper, we contribute 2 time-encodings: the static date-embedding that is generated by stacking more date features straightforwardly, and the dynamic date-representation that further considers date contextual information. So, we try to clarify which time-encoding is better, and how well other Transformers perform when better time-encodings are provided.

E.2.1 WHICH TIME ENCODING BETTER?

The better time-encoding should be able to represent time more accurately and accommodate more time series global information. Thus, we employ the 2 time-encoding to distill time series' global information through 2 global prediction components with roughly the same number of parameters, and then check their global prediction quality. The results are shown below. In Figure 7 , when approaching important festivals, DERT pays more attention to these festivals. The more important festival, the more concentrated attention. In China, the Spring Festival is the most important festival of a year. So, we can see the most intensive attention on the day in Figure 7b , and other days barely receive little attention. But referring to Figure 7c , although very intensive attention on Apr 4, some attention is still distributed to other days. Because the Qingming Festival is not as important as the Spring Festival. DERT will not pay excessive attention to it. This is completely in line with Chinese habits. Besides, DERT can also tackle the changeable date semantics in different calendars. In the traditional Chinese lunar calendar, Sep 30, 2012, is the Mid-Autumn Festival while Oct 1 is Chinese National Day in the Gregorian calendar. The interval between the 2 festivals change every year, and in some years they are even on the same day. In Figure 7d , they both exert influence on Oct 2. We can not sure how their influence change when they are on the same day or on adjacent days, so let DERT learn from data. We also call the similar problems as date polysemy. Note that we didn't tell DERT which day is a festival, all knowledge about festivals is learned by itself.

E.2.2 HOW WELL OTHER TRANSFORMERS PERFORM USING BETTER TIME ENCODINGS?

These Transformer-based baselines for comparison also leverage time features to assistant predictions. They adopt the time embedding that proposed by Zhou et al. (2021) . We are inspired by it too, so follow it and stack more date features into our static date-embedding. In this section, we try to clarify how well they perform when better time-encodings are provided. We modify these Transformers to employ the time-encodings mentioned in the previous section, to compare the impact of different time-encoding on them. To better understand the role of time for these sequence-modeling Transformers, we eliminate time embedding as a comparison. The results are as follows. Referring to Table 6 and 7 , as a time-modeling method that mainly models time, Dateformer consistently benefits from better time-encodings. However, better time-encodings may not enhance other Transformers that mainly model series, and they only leverage time-encoding in an auxiliary style just like positional encoding. In some forecast cases, better time-encodings can improve their predictions. But in some other cases, eliminating time-encoding results in the best prediction. Their utilization of time is unstable. Besides, simply collecting more time features does not necessarily work, sometimes even encumber predictions (see 3 days prediction case in Table 6 ).

E.3 ABLATION STUDY ABOUT PRE-TRAINING TASKS

In section 2, we design 2 pre-training tasks for DERT: mean prediction and Auto-correlation prediction. Here, we check their effectiveness. We use the separate pre-training task to train DERT, and compare their prediction results. The results are shown as follows. As shown in Table 8 , for pre-training DERT, the separate mean prediction or Auto-correlation prediction task is effective. But combining them can obtain better date-representations. The mean prediction will lose some intraday information of time series, so we design the Auto-correlation prediction to mitigate the problem, it can induce DERT to tap some time series intraday information.

F TRAINING DETAILS

There are 3 relatively independent components in Dateformer: DERT encoder, global prediction, and local prediction. They all can be trained separately. We found that Dateformers with different training stages strategies have different predictive performances. For some datasets, pre-training DERT or global prediction component is necessary. For other datasets, end-to-end training the entire Dateformer is also feasible. Thus, we design a 3-stage training methodology to tap the full capacity of Dateformer, and do the best on various forecast horizon demands of different datasets.

Pre-training

In section 2, we introduce 2 tasks to pre-train DERT, and the first step is to pre-train the DERT encoder. Time series are split into patches by day to supervise the learning of DERT. Each task contributes half of the pre-training loss. The pre-training stage aims to provide superior time representations as a container to distill and store global information from training set.

Warm-up

The second step is using the separately global prediction component to distill time series global information from training set, we call it warm-up phase. In warm-up, the pre-trained DERT encoder is loaded, then we train the separately global prediction component on training set. We insert this stage to force the global prediction component to remember the global characteristics of time series, so as to help Dateformer produce more robust long-range predictions. Furthermore, it also serves as the adapter when transferring a DERT from other datasets. This phase is optional, and we observed that a better short-term forecast is usually provided by the Dateformer skipping warm-up. More credible long-term predictions, however, always draw from the preheated one. Formal training Dateformer loads pre-trained DERT or global prediction component then start training. We would use a small learning rate to fine-tune the pre-trained parameters. With enough memory, Dateformer can extrapolate to any number of days in the future, once trained. 

G PATCH SIZE CHOICE

In the main text, we choose the "day" as patch size, because it's moderate, close to people's habits, and convenient for modeling. Here, we try some other patch sizes. We select 1 3 day, half-day, and 3-day as comparative patch sizes. The results are shown below. As shown in Table 10 , for some time series, finer patch sizes can lead to better short-term predictions, because the closer time series carries more accurate local pattern information. Compared to a day ago, the local pattern of the present is more similar to it of half a day ago. But their mid-long-term predictions deteriorate, because finer patch sizes will result in more local predictions recursions for the same size forecast horizons, which means more error accumulation. In addition, for these time series whose daily periodicity is dominant (like Traffic), the "day" is the best patch size. These time series are usually closely related to human activities and hence more concerned with daily patterns. The coarse patch size beyond "day" is not the most important activity period of people, so it's difficult to construct sufficiently accurate time-representations. If there is a holiday in the 3-day patch, how to embed which day it is in the time-embedding by the method mentioned at the beginning of Section 2? That leads over-coarse patch sizes inapplicable. For most time series related to human activity, we recommend the "day" as the patch size, it's moderate, close to people's habits.

H HYPER PARAMETER SENSITIVITY

We check the robustness with respect to the hyper-parameters: pre-days and post-days paddings. We select 6 groups of paddings: (1, 1), (3, 3), (7, 7), (7, 14), (14, 14) and (30, 30) . To be more intuitive, we test the global prediction component with above several paddings on PL dataset. As shown in Table 11 , too large or too small padding sizes will make the prediction performance worse. The moderate padding size (7, 7) leads to the best global prediction. This is also consistent with how people generally think-we don't simply consider the here and now, but also don't think too far ahead. In the main text, to balance predictive performances between various time series datasets, we use the padding size of (7, 14) instead of the best (7, 7).

I IMPLEMENTATION DETAILS

Our proposed models are trained with L2 loss, and using AdamW (Loshchilov & Hutter, 2017) optimizer with weight decay of 7e -4 . We adjust learning rate by OneCycleLR (Smith & Topin, 2019) which use 3 phase scheduling with the percentage of the cycle spent increasing the learning rate is 0.4, and the max learning rate is 1e -3 . Batch size for pre-training is 64, and 32 for others. The total epochs are 100, but the models normally super converges very quickly. All experiments are repeated 3 times, implemented by PyTorch (Paszke et al., 2019) , and trained on a NVIDIA RTX3090 24GB GPUs. Numbers of the pre-days and post-days padding are 7 and 14 respectively. There is 1 layer in DERT encoder, and the inside Transformer contains 4 layers both in encoder and decoder. The implementations of Autoformer (Wu et al., 2021) , Informer (Zhou et al., 2021) , and Reformer (Kitaev et al., 2020) are from the Autoformer's repositoryfoot_10 . And the implementations of FEDformerfoot_11 (Zhou et al., 2022) and Pyraformerfoot_12 (Liu et al., 2021) are from their respective repository. We adopt the hyper-parameters setting that recommended in their repositories but unify the token's dimension d model as 512. We fix the input series length as 96 time-steps for FEDformer, Autoformer, Informer, and Reformer. This is recommended by them or empirical results, and we found extending their input length will result in unstable performances of these baselines (see section E.1). For Pyraformer, facing longer forecast horizons, we extend its input length as their paper recommended. For more detailed hyper-parameters setting please refer to their code repositories. For other baseline models, we also use grid-search method to select their hyper-parameters.

J COMPLEXITY ANALYSIS

We also provide the complexity analysis of the global prediction and local prediction components.  i + L o + ( Li+Lo G ) 2 ) , where ( Li+Lo G ) 2 happens in the last day's prediction. In practice, G 2 will be a big number for fine-grained time series and the L i + L o takes up the most memory, it becomes the main factor that restricts us to predict further away. Under 24GB memory, Dateformer's maximum forecast horizon exceeds all baseline models. And its inference speed is just slightly slower than Transformers that adopts onestep forward generative style inference (Zhou et al., 2021) because the local prediction component requires a few recursions.

K MAIN RESULTS WITH STANDARD DEVIATIONS

We repeat all experiments 3 times, and the results with standard deviations are shown in Table 12 . 

L SHOWCASES

In order to more intuitively show Dateformer's prediction results, we visualized the time series ground-truth and predictions of several forecast tasks. The charts are shown as follows. N OTHER MODEL'S REMAINDERS We show 4 auto-correlation series in Figure 5 , to provide a intuitive view to the characteristic of global and local pattern. That's not for comparison with other models, just to intuitively explain each component's function. But as additional interests, we also provide auto-correlation series of other Transformers' remainders here. As shown in Figure 13a and 13b, the auto-correlation series of the remainders produced by Informer and Reformer did not drop to 0 immediately at the left end, which indicates they fail to accurately predict time series' mean. And the remainders of Informer, Reformer, and Autoformer still exhibit obvious seasonality that they didn't capture. Referring to Figure 13d , though not obvious, there is also weak seasonality in FEDformer's remainder.



Code will be released soon. Related works in Appendix B. see Appendix C for details We use the canonical Positional Encoding proposed byVaswani et al. (2017), other PEs are also practicable. We discuss it in Appendix F https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 This dataset is provided by the 9th China Society of Electrical Engineering cup competition. http://pems.dot.ca.gov/ https://www.bgc-jena.mpg.de/wetter/ https://fred.stlouisfed.org/categories/158 https://github.com/thuml/Autoformer https://github.com/MAZiqing/FEDformer https://github.com/alipay/Pyraformer



Figure1: (a) depicts two-year power load of an area. (b) illustrates the area's full day load on Jan 2, 2012, a week ago, a week later, and a year ago: compared to a week ago or later, the power load series of Jan 2, 2012, is closer to a year ago, which indicates the day's time semantics is altered to closer to a year ago but further away from a week ago or later.

Figure 3: Take Jan 2 as an example: to get the day's local prediction, we utilize Transformer to learn series residuals similarities W i between Jan 2 and the lookback window (left), and use the similarities to aggregate Series Residuals of lookback window into the local prediction (right).

Figure 4: Work flow of Dateformer (4-predict-2 days case), 2 pre(post)-days padding. 1 ⃝: sliding DERT to encode multi-day date-representations and building their global predictions by a single feed-forward propagation; 2 ⃝: making multi-day series residuals of lookback window; 3 ⃝, 4 ⃝: Autoregressive local predictions; 5 ⃝: fusing the global and local predictions into the final predictions.

dataset collects 2 years (July 2016 to July 2018) electricity data from 2 transformers located in China, including oil temperature and load recorded every 15 minutes or hourly; (2) ECL 6 dataset collects the hourly electricity consumption data of 321 Portugal clients from 2012 to 2015; PL 7 dataset contains power load series of 2 areas in China from 2009 to 2015. It's recorded every 15 minutes and carries incomplete weather data. We eliminate the climate information and stack the 2 series into a multivariate time-series; (4) Traffic 8 contains hourly record value of road occupancy rate in San Francisco from 2015 to 2016; (5) Weather 9 is a collection of 21 meteorological indicators series recorded every 10 minutes by a German station for the 2020 whole year; (6) ER 10 collects the daily exchange rate of 8 countries from March 2001 to March 2022. We split all datasets into training, validation, and test set by the ratio of 6:2:2 for ETT and PL datasets and 7:1:2 for others, just following the standard protocol. Baselines We select 6 strong baseline models for multivariate forecast comparisons, including 4 state-of-the-art Transformer-based, 1 RNN-based, and 1 CNN-based models: FEDformer(Zhou et al., 2022) (ICML 2022), Autoformer (Wu et al., 2021) (NeurIPS 2021), Informer (Zhou et al., 2021) (AAAI 2021 Best Paper Award), Pyraformer

Figure 5: Auto-correlation of four time series from ETT oil temperature series: (a) Ground-truth; (b) Global Prediction; (c) Ground-truth -Global Prediction; (d) Ground-truth -Final Prediction.

Figure 6: The errors lines of several Transformers with different input lookback days of: predicting next (a) 7-day on ETTh1,(b) 1-day on ETTh2,(c) 3-day on ETTm1,(d) 7-day on PL,(e) 30-day on ECL,(f) 3-day on Traffic,(g) 1-day on Weather,(h) 720-day on ER time series.

Figure 7: DERT encoder attention weight distributions from PL dataset during 2012, the darker the cell, the greater the attention weight. (a) is encoding Jan 2, Jan 1 is New Year's day; (b) is encoding Jan 24, in traditional Chinese lunar calendar, Jan 22 is New Year's Eve and Jan 23 is the Spring Festival; (c) is encoding Apr 5, Apr 4 is the Qingming Festival. (d) is encoding Oct 2, Sep 30 is the Mid-Autumn Festival in Chinese lunar calendar, and Oct 1 is National Day in Gregorian calendar.

Figure 8: 3 days prediction cases from ETTm1 oil temperature series

Figure 9: 3 days prediction cases from ETTh2 oil temperature series

Figure 10: 7 days prediction cases from Traffic series

Figure 13: Auto-correlation of 4 time series remainders from ETT oil temperature series.

2 2 TIME-REPRESENTINGTo facilitate distilling training set's global information, we should establish appropriate time representations as the container to distill and store it. In this paper, we split time series into patches by day, so we mostly study a special case of it-how to represent a date? Informer(Zhou et al., 2021) provides a time-encoding that embeds time by stacking several time features into a vector. These time features contain rich time semantics and hence can represent time appropriately. We follow that and collect more date features by a set of simple algorithms 3 to generate our date-embedding.But this date-embedding is static. In practice, people pay different attention to various date features and this attention will be dynamically adjusted as the date context change. At ordinary times, people are more concerned about what day of week or month is today. When approaching important festivals, the attention may be shifted to other date features. For example, few people care that Jan 2, 2012, is a Monday because the day before it is New Year's Day (see Figure1bfor an example). This semantics shifting is similar to the polysemy in NLP field, we call it date polysemy. It reminds us: a superior date-representation should consider the contextual information of date. However, the date axis is extended infinitely. The contextual boundaries of date are open, which is distinct from words that are usually located in sentences and hence have a comparatively closed context. Intuitively,

Predictive Network As equations 2, given a day's (patch's) dynamic date-representation d, we duplicate it to G copies (G denotes series time-steps number every day), and add sequential positional encodings 4 to these copies, thereby obtaining finer time-representations {t 1 , t 2 , t 3 , • • • , t G } to represent each series time-step of the day. Then, to get the day's global prediction, we employ a position-wise feed-forward network to map these time-representations into time series.









144, 1.672 → 0.176) in PL, 42% (1.056 → 0.690) in ETTm1, 38.6% (0.702 → 0.431) in ETTh2, 59.8% (0.316 → 0.220, 2.231 → 0.240) in ECL, 79.4% (2.557 → 0.526) in Traffic, and 31.5% in ER. Overall, Dateformer yields a 33.6% averaged accuracy improvement among all setups. Compared with other models, its errors rise mostly steadily as forecast horizon grows. It means that Dateformer gives the most credible and robust long-range Multivariate time series forecasting results. OOM: Out Of Memory. "-" means failing to train because validation set is too short to provide even a sample. (Number) in parentheses denotes each dataset's time-steps number every day. A lower MSE or MAE indicates a better prediction. The best results are highlighted in bold and the second best results are highlighted with a underline.

Multivariate time series forecasting comparison of different forecast components results. The results highlighted in bold indicate which component contributes the best prediction.

Few sample learning multivariate results on ETTm1, only MSE is reported.

split images into patches, thereby enabling vanilla Transformers to tackle abundant pixels. Compared to highly abstract human language, we argue that information characteristic of time series is more similar to it of images. They both are natural signals and exist numerical continuity between adjacent signals. Inspired by this, we follow ViT to split time series into patches, thereby enhancing time series forecasting Transformers' throughput, and enabling vanilla Transformers to predict long-term series.In addition to these structures, there are also other neural network methods such as N-BEATS (Oreshkin et al., 2019) that uses decomposition. Above most forecast methods can be summarized as learning a mapping from lookback window to forecast window, and they some leverage time feature assist prediction. However, in the end, they are confined by information because they can only rely on the local information in lookback window. To our best knowledge, we are the first time series forecasting work that distills the whole training set's global information into time-representations and takes time instead of series as the primary modeling entities-time-modeling strategy.

Univariate series forecast results. A lower MSE or MAE indicates a better prediction. The best results are highlighted in bold and the second best results are highlighted with a underline.Univariate resultsIn the univariate time series forecasting setting, Dateformer still achieves consistent state-of-the-art performance under all forecast setups. For the long-term forecasting setting, Dateformer gives MSE reduction: 82.3% (0.442 → 0.135, 1.932 → 0.162) in PL, 77.4% (0.695 → 0.157) in Traffic, 23.7% in ER. Overall, Dateformer yields a 41.6% averaged accuracy improvement among all univariate forecast setups on the given 3 representative datasets.

Global prediction multivariate comparison using 2 time-encoding.As shown in Table5, the global prediction component that employs dynamic date-representation always contributes a better global prediction. This is enough to prove the effectiveness of our proposed dynamic time-representing method. In order to more intuitively show the importance of date contextual information, and how DERT pays attention to date context, we visualize some attention weight distributions of DERT encoder. The figures are shown as follows.

Ablation of time-encoding on ETTh1 multivariate forecast tasks with MSE metric. Without eliminates time-encoding. Origin adopts time-encoding proposed by Zhou et al. (2021). Static and Dynamic denote the static date-embedding and dynamic date-representation respectively. Note that Origin and Static are essentially the same, Static just simply collects more time features.

Ablation of time-encoding on Traffic multivariate forecast tasks with MSE metric. -means can't train, because time-encoding is necessary for Dateformer. Best results are highlighted in bold.

Multivariate forecast comparison of Dateformer using different pre-training tasks. Mean adopts only pre-training task of mean prediction. Auto employs separate Auto-correlation prediction to pre-train DERT. Both combines the 2 pre-training tasks together, and it's adopted by us. The best results are highlighted in bold. "-" denotes lacking test samples to report result.

Multivariate time series forecasting comparisons of different training strategies, where Dateformer 11 goes through all 3 stages. Dateformer 10 skips the warm-up, and Dateformer 01 skips the pre-training stage. Dateformer 00 is end-to-end trained. The * marked group of result is selected to compare with baseline models in the main text. The best results are highlighted in bold.We use different training stages to train Dateformer, and results are shown in Table9. It can be seen that different training stage setups can lead to different performances of Dateformers. There is no versatile training strategy does best on all forecast horizons of all datasets. Generally, the pre-training and warm-up can induce Dateformers to be more concerned with the global pattern of time series, and produce robust long-term predictions. For some datasets, before distilling global information from training set, pre-training a DERT encoder can enhance the distilling effect. But for other datasets, the 2 stages can be combined into one. Actually, the global prediction is also a fairly good pre-training task for DERT. The end-to-end trained Dateformer always contributes the best short-term predictions. Because we train Dateformer in the short-term forecasting task of 7dpredict-1d, the direct end-to-end training makes Dateformer more care local pattern of time series.

Multivariate forecast comparison of Dateformer using different patch sizes. "-" denotes that the patch size is too coarse to align with the forecast horizon.

Dateformer's multivariate global prediction results on PL dataset using several paddings.

For an input lookback window with length L i and output forecast window with length L o , they're all divisible by G in the setting of splitting time series into patches by day, where G denotes timesteps number every day of the time series datasets. The complexity of global prediction component is O(L i + L o ). For local prediction component, it recurses Lo G times and the time complexity of each time shall not exceed O(( Li+Lo G ) 2 ). So, the time complexity of local prediction component is O( Lo G ( Li+Lo

Quantitative results with fluctuations of different forecast days for multivariate forecast.

M LOCAL PREDICTIONS WITH GENERATIVE STYLE In Section 3.2, we introduce Dateformer's local prediction component that implemented by the vanilla Transformer. But note that the vanilla Transformer is just to produce similarities to aggregate series residuals, it does not directly generate local prediction. If we employ a local prediction component with generative style, how does Dateformer perform? As a basic comparison, we modify Dateformer's local prediction component to be generative style. • Dateformer-GD: removing Equation 4's FFN and SoftMax that calculate similarities and attaching a FFN to d P +1 to directly generate the local prediction corresponding d P +1 . • Dateformer-GR: inputting Date Representations to the Transformer's encoder and inputting Series Residuals to the Transformer's decoder. For the decoder's output { r 1 , r 2 , • • • , r P }, we use: r = AveragePooling( r 1 , r 2 , • • • , r P ) Local P rediction = FFN( r) (8) to directly generate the local prediction corresponding d P +1 . Multivariate forecast comparison of several Dateformer variants. "-" denotes lacking test samples to report result. As shown in Table 13, the local prediction component with aggregating style always outperforms it with generative style. Due to splitting time series into patches, training samples considerably reduce. That makes the local prediction component with generative style easy to overfit. To mitigate the problem, we introduce an inductive bias: the local patterns of adjacent time series are similar and hence design the local prediction component with aggregated style to aggregate similar local pattern information.

funding

Jan 1 Jan 2 Jan 3 Jan 4 Jan 5 Jan 6 Jan 7 Jan 8 Jan 9 Jan 10 Jan 11 Jan 12 Jan 13 Jan 14 Jan 15 Jan 16 head1 head2 head3 head4 head5 head6 head7 head8 (a) Jan 2 Jan 17 Jan 18 Jan 19 Jan 20 Jan 21 Jan 22 Jan 23 Jan 24 Jan 25 Jan 26 Jan 27 Jan 28 Jan 29 Jan 30 Jan 31

