IRREGULARITY REFLECTION NEURAL NETWORK FOR TIME SERIES FORECASTING

Abstract

Time series forecasting is a long-standing challenge in a variety of industries, and deep learning stands as the mainstream paradigm for handling this forecasting problem. With recent success, representations of time series components (e.g., trend and seasonality) are also considered in the learning process of the models. However, the residual remains under explored due to difficulty in formulating its inherent complexity. In this study, we propose a novel Irregularity Reflection Neural Network (IRN) that reflect the residual for the time series forecasting. First, we redefine the residual as the irregularity and express it as a sum of individual, short regular waves considering the Fourier series in a micro perspective. Second, we design a module, based on the convolutional architectures to mimic the variables of the derived irregularity representation, named Irregularity Representation Block (IRB). IRN comprises IRB on top of a forecasting model to learn the irregularity representation of time series. Extensive experiments on multiple realworld datasets demonstrate that IRN outperforms the state-of-the-art benchmarks in time series forecasting tasks.

1. INTRODUCTION

Figure 1 : The Traffic data and its time series components (i.e., trend, seasonality, and irregularity). Owing to the ubiquitous computing systems, time series is available in a wide range of domains including traffic (Chen et al., 2001) , power plant (Gensler et al., 2016) , stock market indices (Song et al., 2021) , and so on (Liu et al., 2015; Duan et al., 2021) . Spontaneously, interests in time series forecasting have grown, and as a result, an intensive research for a more accurate prediction. In recent literature, many deep learning models have been favored for forecasting problems (Lim & Zohren, 2021) . Recurrent Neural Network (RNN) and its extensions such as Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Chung et al., 2014) are popular choices for analyzing long sequences. Nevertheless, these models tend to be restricted in handling multivariate time series. As a powerful alternative, Convolution Neural Networks (CNNs) has been introduced to capture overall characteristics of time series through parallel calculations and filter operations. Building on the success in forecasting task, CNNbased models have been proposed according to the type of time series data. Temporal Convolutional Network (TCN) was applied to audio datasets (Oord et al., 2016) , whereas Graph Convolutional Network (GCN) was utilized in the time series with graph characteristics (e.g., human skeletonbased action recognition (Zhang et al., 2020) and traffic dataset (Bai et al., 2020) ). The attention models have also been applied to emphasize the specific sequence data that are primarily referenced when making the predictions (Liu et al., 2021b) . Despite the great efforts made, forecasting performance has room for further improvement as aforementioned models learn feature representations directly from complex real-world time series, often overlooking essential information. Recently, incorporating representations of time series components (e.g., trend, seasonality) used in conventional econometric approaches have shown to lead to better performances of the learning models. For instance, N-BEATS (Oreshkin et al., 2019) , Autoformer (Wu et al., 2021) , and CoST (Woo et al., 2022) reflected the trend and seasonality of the time series and achieved improvements. However, as shown in Figure 1 , time series also include the irregularity that is not accounted by the trend and seasonality, and is yet under explored (Woo et al., 2022) . To address this challenge, we show how to deal with the irregularity of the time series data to improve the forecasting performance of the deep learning models. To this end, we represent the irregularity into an encodable expression on basis of Fourier series viewed from a micro perspective. The derived representation is encoded using convolutional architectures, and named as Irregularity Representation Block (IRB). Then, IRB embedded on a base model builds the Irregularity Reflection Neural Network (IRN). We demonstrate that IRN outperforms existing state-of-the-art forecasting models on eleven popular real-world datasets.

2.1. DEEP LEARNING FOR TIME SERIES FORECASTING

Sequential deep learning models such as RNN, LSTM, and GRU have long been used for time series forecasting (Elman, 1990; Hochreiter & Schmidhuber, 1997; Chung et al., 2014) . Although effective in capturing the temporal dependencies of time series, RNN-based models neglect the correlations in-between time series. To tackle this issue, Liu et al. (2020) propose a dual-stage two-phase (DSTP) to extract the spatial and temporal features. Shi et al. (2015) present convLSTM replacing the states of LSTM block with convolutional states. Another limitation of the sequential models are that the discrepancy between ground truth and prediction is accumulated over time as predictions are referred to predict further into the future (Liu et al., 2021a) . More recent works have demonstrated that CNNs can be applied in multivariate time series problems as well. Ravi et al. (2016) introduce the 1D convolution for human activity recognition, whereas Zhao et al. (2017) suggest the use of 2D convolution. CNN models are parallelizable, and hence show following advantages: the consideration of the correlation between variates and the prevention of error accumulation (Liu et al., 2019) . A downside is the limited receptive field when predicting long sequences due to the increasing number of the parameters (Zhao et al., 2017) . Wang et al. (2019) tackle this challenge by decomposing the long sequences according to long, short and closeness. CNN-based models have received increasing attention to enhance the forecasting performance. For example, the dilated casual convolutional layer is used to increase the receptive field by downsampling and improve long sequences prediction (Sen et al., 2019; Oord et al., 2016) . Another approach is Graph Convolutional Network (GCN), that analyzes the relation between nodes with specific position and edge relation, especially in traffic data (Fang et al., 2021; Song et al., 2020) and human body skeleton data (Yoon et al., 2022; Chen et al., 2021) . Attention-based models have also been adopted (Liu et al., 2019) and further developed into Transformer (Zhou et al., 2021; Liu et al., 2021b) . However, these approaches do not take into account the characteristics of time series such as trend, seasonality and irregularity.

2.2. REFLECTING THE REPRESENTATIVE COMPONENTS OF TIME SERIES

Considerable studies on time series analysis have relied on the decomposition of time series into non-random components. For instance, DeJong et al. (1992) conducted analysis on the trends of the macroeconomic time series as well as Lee & Shen (2009) emphasized the importance of obtaining significant trend relationship in linear time complexity. Jonsson & Eklundh (2002) extracted and analyzed the seasonality of the time series data and Taylor & Letham (2018) considered both trend and seasonality. When extracting these non-random components, a non-stationary time series becomes stationary, meaning time-independent. As conventional statistical methods such as ARIMA (Autoregressive Integrated Moving Average) (Williams & Hoel, 2003) and GP (Gaussian Process) ( Van Der Voort et al., 1996) perform better on stationary data (Cheng, 2018) , differentiation for stationarity has been conducted (Atique et al., 2019) . As such, extraction of the representative time series components for forecasting problems has been a major research topic (Brockwell & Davis, 2009; Cleveland et al., 1990) . Recently, direct learning from input sequences of the deep forecasting models is regarded to be enough, thereupon researchers focus on how to incorporate the components of time series in the learning process. For instance, Oreshkin et al. (2019) proposed a hierarchical doubly residual topology as the interpretable architecture to extract time series representations: trend, and seasonality. Wu et al. (2021) proposed a transformer-based model which decomposes and reflects the trend and seasonality by using auto-correlation mechanism. Woo et al. (2022) introduced disentangled Seasonal-Trend Representation Learning by using the independent mechanisms. They deviced disentanglers for the trend and seasonal features, mainly composed of a discrete Fourier transform to map the intermediate features to frequency domain. These studies successfully reflect representations of trend and seasonality which are the time dependent value, and improve forecasting performances. However, the irregularity, which cannot be explained by the trend or seasonality and is the time independent value, is not sufficiently addressed. In this paper, we build and reflect the irregularity representation to complement the previous researches in forecasting tasks.

3. METHODOLOGY

In this section, we discuss how to reinterpret the irregularity of the time series in term of Fourier series, extract and reflect the irregularity representation using convolutional architectures. Our proposed model IRN is shown in Figure 2 . A time series is generally in the form of an irregularity. Hence, its representation is essential for time series forecasting. Among many existing approaches to represent irregularity, Fourier series is perhaps the most widely used. Fourier series approximates irregularity by the linear superposition of multiple regular waves with varying height, period, and direction as depicted in Figure 3 (a) (Bloomfield, 2004) . The irregularity ψ (t) can be expressed as:

3.1. THEORETICAL APPROACH

ψ (t) = ∞ n=0 C n r n (t) (1) where r n (t) is n-th regular wave, C n is the coefficient of r n (t), t is the time. The concept of infinity in Equation 1 is challenging for the learning model to handle. Therefore, we reinterpret ψ (t) into an encodable equation by viewing it at the micro level. When the irregularity ψ(t) in the time domain is observed at the moment t a∼b , it can be interpreted as a regular wave ψ (t a∼b ) with a vertical shift, which is the average value of the regular waves. Under this concept, Equation 1 is rewritten as: ψ (t a∼b ) = C 0 r 0 (t a∼b ) + ∞ n=1 C n r n (t a∼b ) where C 0 r 0 (t a∼b ) is the representative regular wave characteristic. C 0 r 0 (t a∼b ) is the regular wave with a mean of 0 without vertical shift. The representative regular wave r 0 (t a∼b ) oscillates between the constant maximum and minimum values in a period of time and can be defined as Amplitude × sin(ω × t a∼b ), where the angular velocity ω is constant due to the periodicity of the wave, ω × t a∼b is denoted as the angle θ of r 0 (t a∼b ), and sin(ω × t a∼b ) is the phase of r 0 (t a∼b ). Amplitude is calculated with the peaks of the wave. Accordingly, the representative regular wave r 0 (t a∼b ) can be rewritten as: r 0 (t a∼b ) = max(t a∼b ) -min(t a∼b ) 2 × sin(θ(t a∼b )) where sin(θ(t a∼b )) is the phase of r 0 (t a∼b ) at t a∼b . Therefore, C 0 r 0 (t a∼b ) in Equation 2is redefined by referring to Equation 3. The remaining infinity term ∞ n=1 C n r n (t a∼b ) in Equation 2 corresponds to the vertical shift and can be expressed as the average value A(t a∼b ) of ψ (t a∼b ) as depicted in Figure 3 (b) . The representative regular wave C 0 r 0 (t a∼b ) and the average value ∞ n=1 C n r n (t a∼b ) convert Equation 2 into: ψ (t a∼b ) ≈ A(t a∼b ) + max(t a∼b ) -min(t a∼b ) 2 × sin(θ(t a∼b )) When the regular waves are sequentially connected, we obtain the irregularity ψ (t) consisting of the regular waves that change with time t a∼b . We redefine the Equation 4 as follows: ψ (t) ≈ A(t) + max(t) -min(t) 2 × sin(θ(t)) where max(t)-min(t) 2 is the amplitude of the regular wave C 0 r 0 (t a∼b ) at t a∼b , sin(θ(t)) is the phase of C 0 r 0 (t a∼b ) at t a∼b , and A(t) is the average which is the sum of remained regular waves ∞ n=1 C n r n (t a∼b ) at t a∼b . According to Equation 5, the irregularity ψ (t) can be represented by the combinations of the minimum, maximum, average, and phase values.

3.2. IRREGULARITY REPRESENTATION BLOCK

Based on Equation 5, the irregularity is encoded to incorporate into deep learning models. In this paper, convolutional architectures are adopted since convolutional layers allow the parallel prediction as well as analysis of relations existing in multivariate time series through filter operations. We input the multivariate time series x input ∈ R T ×d , where T is a look-back window of fixed length, and d is the number of variates. Our model stacks multiple convolution layers with the RELU activation function, a dilation filter and same padding. The RELU activation increases the model complexity through space folding (Montufar et al., 2014) , and the dilation operation helps expand the receptive fields (Oord et al., 2016) . The convolution layer extracts the feature that has the same size of the x input through the same padding. Accordingly, Equation 5 is transformed into x irregular = A(x input ) ⊕ max(x input ) -min(x input ) 2 ⊗ sin(θ(x input )) (6) where ⊕ is the pointwise summation, ⊗ is the pointwise multiplication, and x irregular is the irregularity. Through this transformation, the x irregular is converted from time dependent to data dependent and the main operations (i.e., A(), max(), min(), and sin(θ())) are expressed by using the convolution layers and pooling layers which extract the average, the maximum, the minimum, and the phase value from x input under microscopic perspective condition. The main operations are encoded like Figure 4 as follows: M aver (x input ) = R Cav (P av (R C0 (x input ))) ≈ A(x input ) (7) M amp (x input ) = R Cmax (Pmax(R C 0 (x input )))-R Cmin (Pmin(R C 0 (x input ))) 2 ≈ max(x input )-min(x input ) 2 (8) M phase (x input ) = T tanh (C Cphase (R C0 (x input ))) ≈ sin(θ(x input )) where P max , P min , and P av are the max, min, and average pooling operations, respectively. C is the convolution layer without activation, R is C with the RELU activation and T tanh is the hyperbolic tangent(tanh) activation. Through this process, the average, amplitude, and phase values in Equation 6are converted to trainable values. To extract the representation of the average value from x input , we stack the 2D convolution filter and the 2D average pooling as in Equation 7. To decompose the representation of the amplitude from x input , we construct the structure same as Equation 8with the 2D max and min pooling. To obtain the representation of the adaptive phase value using x input under microscopic aspect condition, referring to the Equation 9, we use the tanh activation after convolution layer. Consequently, these operations(i.e., M aver (x input ), M amp (x input ), and M phase (x input )) extract the average, amplitude, and phase values from x input , and we redefine Equation 6 as follows: x irregular = M aver (x input ) ⊕ M amp (x input ) ⊗ M phase (x input ) To consider the x irregular value, we apply the residual stacking principle which enables complex interpretation by combining features in a hierarchical form for each step (Oreshkin et al., 2019) . Therefore, we design the IRB architecture as follows: x IRBoutput = R Cout (R Cirr (x irregular ) ⊕ R C0 (x input ) ⊕ M amp (x input ) ⊗ M phase (x input )) The output of IRB x IRBoutput is the representation of the irregularity which considers the average, amplitude, phase, and input components. Furthermore, these components are trainable values because they consist of the convolution layers.

3.3. IRREGULARITY REFLECTION NEURAL NETWORK

IRN consists of IRB and a irregularity reflection module as in Figure 2 . For the forecasting of this study, a recent model that reflects trend and seasonality, known as SCInet (Liu et al., 2021a) is used as the base model. The x IRBoutput is passed to the base model through the irregularity reflection module. x IRoutput = S sig (x IRBoutput ) ⊗ x input ⊕ x input (12) where x IRoutput is the output of IRN and S sig is the sigmoid activation. The pointwise multiplication is applied to emphasize the irregularity of the x input by using the x IRBoutput with S sig . If we use the x IRBoutput as the input value of the time series model, some information (e.g., trend, seasonality) can be omitted. To alleviate this problem, we preserve the original information by residual connection, which also prevents the gradient vanishing (He et al., 2016) . 

4. EXPERIMENTS

We conduct experiments on 11 real-world time series datasets and compare the performance with the latest baselines. We analyze the circumstances in which proposed IRB improves the forecasting performance. We refer base model (Liu et al., 2021a) for the experiment settings. Due to page limits, the implementation details including the loss function, datasets, and metrics are reported in the Appendix.

4.1. DATASET

Experiments are conducted on following time series datasets: Electricity Transformer Temperature (Zhou et al., 2021) , PEMS (Chen et al., 2001) , Solar, Traffic, Electricity, Exchange-rate (Lai et al., 2018) . The datasets, experiment settings, and metrics are summarized in Table 1 . 

4.2. BASELINES

For each dataset, we compare IRN with the latest baselines: (1) For ETT, Transformer-based methods (i.e., LogTrans (Li et al., 2019) , Informer (Zhou et al., 2021) Autoformer (Wu et al., 2021) , Reformer (Kitaev et al., 2020) , TST (Zerveas et al., 2021) , and Pyraformer (Liu et al., 2021b) ) and feature representation learning based methods (i.e., TCC (Eldele et al., 2021) , N-BEATS (Oreshkin et al., 2019) , CPC (Oord et al., 2018) , Triplet (Franceschi et al., 2019) , MoCo (He et al., 2020) , TNC (Tonekaboni et al., 2021) , TS2Vec (Yue et al., 2022) , SCInet (Liu et al., 2021a) and CoST (Woo et al., 2022) ); (2) For PEMS, LSTM (Hochreiter & Schmidhuber, 1997) , CNN-based methods (i.e., TCN and DCRNN (Li et al., 2017) ), SCInet, Graph-based methods (i.e., STGCN (Yu et al., 2017) , AST-GCNr (Guo et al., 2019) , STSGCN (Song et al., 2020) , STFGNN (Li & Zhu, 2021) , AGCRN (Bai et al., 2020) , and DSTAGNN (Lan et al., 2022) ); (3) For Solar Energy, Traffic, Electricity, and Exchange Rate, AR, VAR-MLP (Zhang, 2003) , GP (Frigola, 2015) , GRU, LSTNet (Lai et al., 2018) , TPA-LSTM (Shih et al., 2019) , SCInet, and MTGNN (Wu et al., 2020) .

4.3. EXPERIMENTAL RESULTS

We summarize the performances of IRN and baseline models in Table 2 to 5. IRN demonstrates state-of-the-art performances in 36 cases and near-best in 14 cases. Autoformer performs better for long-term forecasting in ETTh2 datasets as it shows strengths in reflecting trends and seasonality, which are more apparent in longer sequences (Wu et al., 2021) . In a similar vein, features are more evident in univariate time series, which explains the higher performances of MoCo (He et al., 2020) and CoST (Woo et al., 2022) , which are feature representation learning models, on 6 cases of ETT univariate datasets. MTGNN (Wu et al., 2020) , a model specialized for analyzing edge relations, yields the best performance on Traffic and Electricity datasets. This is because both datasets contain complex edge between nodes. At last, compared to the attention-based model, IRN shows lesser performances on Exchange-rate dataset due to the strong random-walk property of the time series (Wright, 2008) . Overall, our IRN successfully reflects irregularity representation and complements base model to achieve the higher forecasting performances.

4.4. ABLATION STUDY

We perform the ablation study to demonstrate the benefit obtained by IRB. We plot the ground truths and corresponding predictions of IRN and the base model at 499th, 500th, 501th, and 510th sequences of ETTh1 data as shown Figure 5 . In Figure 5 (a), the original time series has a peak in the predicting region shaded in grey. Up to sequence 499, IRN and the base model make similar predictions having large errors. When a sequence is added as in the Figure 5 Next, we observe the 24 horizon forecasts in Traffic dataset for further analysis. In Figure 6 (a), cycles 2 to 7 consist of values lower than 0.2, whereas cycle 1 includes irregular values greater than 0.3. IRN has larger errors than the base model as IRN instantaneously reflects the irregularity. In contrast, IRN performs better than the base model when the irregularity persists as shown in Figure 6 (b) . The reflection of the irregularity does not always end in a better forecast, but IRB consistently improves the forecasting performance of the base model, which confirms the effectiveness of the our model. 6 . This results indicate that higher performance improvement is attained in case 2 than case 1 for both datasets, implying the higher the irregularity variation, the higher performance improvement can be achieved.

5. CONCLUSION

In this paper, we propose Irregularity Reflection Neural Network (IRN), a deep learning based model for time series forecasting that reflects the irregularity in time series. We introduce a novel expression of irregularity based on Fourier series under microscopic perspective condition and employ it to design the Irregularity Representation Block (IRB) that captures, preserves, and learns the irregularity representation of time series data. By embedding the IRB on the base model, IRN is further proposed. Experiments on a variety of real-world datasets show that IRN can consistently outperform existing state-of-the-art baselines. The ablation study confirm that the proposed methodology can reflect the irregularity. Accordingly, we argue that the irregularity representation is essential for improving performance of machine learning models.

B IMPLEMENTATION DETAILS

Our model and framework are implemented with Pytorch. We train IRN with Adam optimizer by using NVIDIA 2080Ti 8 GPUs for enough batch size. Other parameters such as learning rate, level, stack, single, and multi are changed according to the dataset charateristics and referring base model (Liu et al., 2021a) .

C DATASETS AND METRICS

C.1 ELECTRICITY TRANSFORMER TEMPERATURE ETT contains two-year electric power data gathered from two counties in China (hourly subsets ETTh1, ETTh2 and 15 minutes subsets ETTm1). Each data point contains an oil temperature value and six power load components. The train, validation and test sets consist of 12, 4, and 4 months data, respectively. We implement zero-mean normalization for data pre-processing. Mean Absolute Errors (MAE) (Hyndman & Koehler, 2006) and Mean Squared Errors (MSE) (Makridakis et al., 1982) are used as evaluation metrics. M AE = 1 h h i=0 |x i -x i | M SE = 1 h h i=0 (x i -x i ) 2 where x i is the true value, xi is the predicted value, and h is the prediction horizon size.

C.2 PEMS

PeMS consists of four public datasets (i.e., PEMS03, PEMS04, PEMS07 and PEMS08), which are separately collected from Caltrans Performance Measurement System (PeMS) of four sections in California. The data is collected every five minutes. We predict one hour that consists of 12 data points. The zero-mean normalization is applied for the data pre-processing. The evaluation metrics are MAE, Root Mean Squared Errors (RMSE) and Mean Absolute Percentage Errors (MAPE). (Lai et al., 2018) .

RM SE

= 1 h h i=0 (x i -x i ) 2 M AP E = 1 h h i=0 | (x i -x i ) x i | C. RSE = h i=0 (x i -x i ) 2 h i=0 (x i -mean(x)) 2



Figure 2: An overview of IRN framework. In IRN, (a) IRB extracts the irregularity feature from the input sequences and (b) Irregularity Reflection module conducts the time series forecasting.

Figure 3: (a) The irregular wave ψ(t) consisting of multiple regular waves and (b) the irregular wave from a micro perspective ψ (t a∼b ).

Figure 4: Architecture of the Irregularity Representation Block.

Figure 5: Forecasting results of IRN and the base model from (a) sequence 499 to 547, (b) sequence 500 to 548, (c) sequence 501 to 549, (d) sequence 510 to 558 in Ettm1 data. The ground truths are shown in a solid line. The dotted and dashed lines represent the predicted values of the base model and IRN, respectively. The predicting region is shaded in grey. The bar graph shows the absolute MAE difference between the base model and IRN.

(b), the discrepancy between the ground truth and the predicted values of IRN decreases. With an additional sequence in the Figure5(c), IRN quickly reflects the change and makes a better forecast than the base model. It is observed that the base model is less sensitive to the change of the input sequences, giving similar

Figure 6: Forecasting results of (a) sequence 71 to 95 and (b) sequence 217 to 241 in traffic data using IRN and base model.

predictions from sequence 499 to sequence 501. Only when 11 sequences have passed, the base model considers the actual changes as in Figure5 (d). We verify that IRN can reflect the irregular features (instantaneous changes).

Summary of datasets and evaluation metrics used for time series forecasting.

Multivariate forecasting performance of IRN and baseline models on the ETT datasets. Best results are highlighted in bold.

Univariate forecasting performance of IRN and baseline models on the ETT datasets. Best results are highlighted in bold.

Forecasting performance of IRN and baseline models on PEMS datasets. Best results are highlighted in bold.

Forecasting performance comparison of IRN and baseline models on the Solar-Energy, Traffic, Electricity, and Exchange-rate datasets. Best results are highlighted in bold.

The difference of average MSE between IRN and base model according to the variation of irregularity on ETTm1 and Traffic datasets. Case 1 and Case 2 refer to 500 data points with low variation in irregularity and with high variation in irregularity, respectively.

3 TRAFFIC, SOLAR ENERGY, ELECTRICITY AND EXCHANGE RATETraffic includes the hourly road occupancy rates which ranges from 0 to 1. The sensors gather the road occupancy rates from 2015 to 2016. Solar Energy contains 2016 solar power production which are recorded every 10 minutes from 137 PV plants in Alabama State. Electricity collects the hourly electricity consumption (kWh) of 321 clients from 2012 to 2014. Exchange-Rate consists of the daily exchange rates of 8 foreign countries from 1990 to 2016. For four datasets, the size of the lookback window is 168, and horizon sizes are 3,6,12, and 24. The evaluation metrics are Root Relative Squared Error (RSE) and Empirical Correlation Coefficient (CORR)

A LOSS FUNCTION

We cover stacked cases in which losses are accumulated. When the dataset has enough training data, we apply K layers (Liu et al., 2021a) . To train the K stacked IRN for the k-th intermediate prediction, we compute the L1 loss between the k-th prediction and the true value as follows:where h is the horizon size, k is the number of stacks, ŷk is i-th horizon prediction of k-th stack, and y is the true value. We apply Equation 1 to calculate the L1 loss of each stacked layer output. The total loss of the stacked IRN is expressed as:h i=0 (x i,j -mean(x j )) 2 (x i,j -mean(x j )) 2 where d is the number of variates.

