CASCADED TEACHING TRANSFORMERS WITH DATA REWEIGHTING FOR LONG SEQUENCE TIME-SERIES FORECASTING

Abstract

The Transformer-based models have shown superior performance in the long sequence time-series forecasting problem. The sparsity assumption on self-attention dot-product reveals that not all inputs are equally significant for Transformers. Instead of implicitly utilizing weighted time-series, we build a new learning framework by cascaded teaching Transformers to reweight samples. We formulate the framework as a multi-level optimization and design three different dataset-weight generators. We perform extensive experiments on five datasets, which shows that our proposed method could significantly outperform the SOTA Transformers.

1. INTRODUCTION

The long sequence time-series forecasting (LSTF) has drawn particular attention with Transformerbased models, like electricity prediction Li et al. (2019) , financial predictions Zhang et al. (2022) , and weather predictions Pan et al. (2022) . The pairwise self-attention allows time-series points to directly attend to each other, which benefits the future forecasting from the historical observations. The pairwise computation's sparsity assumption Child et al. (2019) reveals that not every sample is equally prior. If we drop unnecessary pairwise connections without violating problem settings, we can acquire a stronger model with better generalization. For example, (i) Reformer Kitaev et al. (2020) uses the hashing bucket to select important query-key pairs (ii) LogSparse Transformer Li et al. (2019) only calculates attention pairs lying on the log-size step away from the diagonal. The sparsity assumption could be roughly considered as manipulating the weights within time-series, or more specifically, internal reweighting. A key technical challenge preventing us from further performance improvement that solely relies on internal reweighting is the unavailability of a specific design of time-series weights under the sparsity-oriented framework. Instead, we can perform the reweighting explicitly, that is, reweight the whole train dataset so that the outlier samples can be excluded from the training. Meanwhile, we assign larger dataset-weights to those samples belonging to the main patterns of the dataset. Inspired by the knowledge distillation Hinton et al. (2015) and teacher-student learning Li et al. (2014) , we can teach the Transformer to learn to weighted time-series. One common method is using pseudo labels Pham et al. (2021) to allow student learning from teacher outputs. If we cannot assign accurate labels, the student hardly learns as well as the teacher does, a phenomenon known as confirmation bias. To alleviate this drawback, we propose to reweight the inputs in a soft manner. In this paper, we design a cascaded teaching framework. There is a sequence of Transformer models, where model i teaches model i + 1. Specifically, model i generates a pseudo time-series dataset, which is used to train model i + 1. Only the first teacher model uses the reweighted dataset, whereas other models use the pseudo time-series dataset generated by the previous model. Finally, the teacher model will update its dataset-weights based on the performance of student models. The experimental results show that we can significantly improve the performance of a time-series prediction model by simply reweighting the dataset while maintaining the sparsity assumption at the same time. More importantly, this cascaded learning framework is easily generalized to other seq-to-seq models. The contributions are summarized as follows: • We propose a cascaded teaching framework to reweight the teacher model's dataset based on the evaluation of student models, which forces the teacher model to get rid of noise data by generating proper dataset-weights. • We design three dataset-weight generators to compress the trainable dataset-weights parameters. • Extensive experiments on three datasets (five cases) have demonstrated its improvement in timeseries learning.

2. RELATED WORK

2.1 TEACHER-STUDENT LEARNING Teacher-student learning has been investigated in knowledge distillation Hinton et al. (2015) , adversarial robustnessCarlini & Wagner (2017), self-supervised learningXie et al. ( 2020), etc. Most of these methods are based on pseudo-labeling. In these existing methods, the focus is to learn a student model with the help of a trained and fixed teacher model. In these works, the teacher model is not updated. In contrast, our method focuses on learning a teacher model by letting it teach a student model. The teacher model constantly updates itself based on the teaching outcome. Teacher-student learning has been investigated in several neural architecture search works as well Li et al. (2020) ; Trofimov et al. (2021) ; Gu & Tresp (2020) . In these works, when searching the architecture of a student model, pseudo-labels generated by a trained teacher model whose architecture is fixed are leveraged. Our work differs from these works in that we focus on searching the dataset-architecture of a teacher model by letting it teach a student model where the student's dataset-architecture is fixed, whereas the existing works focus on searching the architecture of a student model by letting it be taught by a teacher where the teacher's architecture is fixed. In a recent work Pham et al. (2021) which was conducted independently of and in parallel to our work, the teacher model is updated based on student's performance. Our work differs from this one in that our work is based on a three-level optimization framework which searches teacher's architecture by minimizing student's validation loss and trains teacher's network parameters before using teacher to generate pseudolabels, whereas in Pham et al. (2021) framework is based on two-level optimization which has no architecture search and does not train the teacher before using it to perform pseudo-labeling. Hwang (2019) . In these works, only one model is trained to generate weights on the corresponding components of the network. The difference between our method and the above works is that the dataset-weights trained by our framework are not coupled with the model, so they can be applied to other models trained on the same dataset. The weights for time-series can also be extracted with a Bayesian non-parametric way Saad & Mansinghka (2018) or even from the dataset dynamics Zhang et al. (2021) . Our framework differs from these works in that our model reweights the entire training dataset by introducing teacher-student learning.

3. TEACHER-STUDENT LEARNING FRAMEWORK

Notions We start with a teacher model with parameters T and a student model with parameters S. We feed the teacher model with training dataset D (trn) t = {(s i , t i )} N i=1 , where s i denotes the i-th input serie and t i is the corresponding outputs. Following the multi-task learning paradigm Caruana (1997) , we assign different dataset-weights p i ∈ [0, 1] for each input samples (s i , t i ), which forms the dataset-weights P = {p i } N i=1 . In order to avoid introducing prior knowledge, we initialize The student models perform onestep gradient descent on the pseudo-labeled dataset. Then we use the validation loss of the student models to update the dataset-weights. P (A) to 1, so that all samples are treated equally in the beginning. Here we want to emphasize that the reweighting objective (s i , t i ) is not a single time point, but a complete time-serie sample which contains several time points, depending on the input/ouput length setting. Therefore, the dataset weights have no effect on the multi-head attention mechanism. Without loss of generality, we assume the end task is time-series prediction. Our proposed cascaded teaching strategy could be divided into three stages. Firstly, we train a Transformer as the teacher model T from the reweighted time-series. The training process could be formulated as: T * (A) def = min T L(T, P (A), D (trn) t ) = min T N i=1 p i (A)l(T, s i , t i ) (1) where the dataset-weights P (A) = {p i (A)} N i=1 ∈ [0, 1] N are generated by a group of parameters A = {a j } L A j=1 ∈ R L A , where L A represents the length of dataset-weights parameters A. If p i is close to 0, which means that (s i , t i ) is not important, the corresponding loss p i (A)l(T, s i , t i ) is reduced to 0, which means that (s i , t i ) is excluded from training like holds a sparsity assumption. The parameter A is fixed during this step. Otherwise, a trivial solution will be yielded where P are all set to zero. The optimally trained T * (A) is a function of A since they are functions of the reweighted loss, which is also a function of the parameters A. Secondly, we use the previously learned T * (A) to generate a pseudo time-series dataset. We use another time-series dataset D (unl) = {s i } Lu i=1 without labels, where L u represents the length of D (unl) . For each s i ∈ D (unl) , we use T * (A) to predict the possible outputs ti . Then we get a pseudolabeled time-series dataset D (pse) (T * (A)) = {(s i , ti )} Lp i=1 , where L p represents the length of D (pse) . We use this pseudo-labeled time-series dataset to train another Transformer as student model S. The training process is minimizing the following loss: S * (T * (A)) = min S [γL(S, D (trn) s ) + (1 -γ)L(S, D (pse) (T * )] , where γ ∈ [0, 1] denotes the self-study rate. If γ is close to 1, then student model S will study mainly on its own without consulting the "handouts" D (pse) (T * ) prepared by its teacher. Here we can use the input of D (trn) s as the unlabeled dataset E for simplicity. The optimal parameters S * (T * (A)) are functions of T * (A). Thirdly, we apply S * (T * (A)) to validate on D (val) . We update A by minimizing the validation loss. Putting all the pieces together, we get the following optimization framework: min A L S * (T * (A)), D (val) s.t. S * (T * (A)) = min S [γL(S, D (trn) s ) + (1 -γ)L(S, D (pse) (T * )] T * (A) = min T L(T, P (A), D (trn) t ) . (3) In this framework, there are three optimization problems. From bottom to up, the three optimization problems correspond to learning stage 1, 2, and 3 respectively. The first two optimization problems are nested on the constraint of the third optimization problem. These three stages are conducted end-to-end in this unified framework. The solution T * (A) obtained in the first stage is used to create a pseudo time-series dataset in the second stage. The time-series prediction model trained in the second stage is used to make predictions in the third stage. The parameters of the dataset-weights generator A updated in the third stage changes the training loss in the first stage and consequently changes the solution T * (A), which subsequently changes S * (T * (A)).

3.1. OPTIMIZATION ALGORITHM

In this section, we develop a gradient-based optimization algorithm to solve the problem defined in Eq.( 3). We approximate T * (A) using one-step gradient descend w.r.t L(T, P (A), D (trn) t ) : T * (A) ≈ T ′ = T -η t ∇ T L(T, P (A), D (trn) t ) . Then we plug T ′ into L(S, D (pse) (T * )) and get an approximated objective. We approximate S * (T * (A)) using one-step gradient descent w.r.t the approximated objective: S * (T * (A)) ≈ S ′ =S -η s γ∇ S L(S, D (trn) s ) -η s (1 -γ)∇ S L(S, D (pse) (T * )) . Finally, we plug S ′ into the validation loss and get an approximated objective. Then we update A through the gradient descent: A ← A -η a ∇ A L(S ′ , D (val) ) , where ∇ A L(S ′ , D (val) ) = ∂T ′ ∂A ∂D (pse) ∂T ′ ∂S ′ ∂D (pse) ∂L S ′ , D (val) ∂S ′ ≈η t η s (1 -γ)∇ A (P )∇ T ′ (D (pse) )H∇ 2 P,T L(T, P (A), D (trn) t ) . The second-order term ∇ 2 D (pse) ,S L(S, D (pse) )∇ S ′ L(S ′ , D (val) ) can be approximated by a finite difference Hessian matrix val) ), ϵ stands for a small number. H = [∇ D (pse) L(S + , D (pse) ) -∇ D (pse) L(S -, D (pse) )]/2ϵ. Note that S ± = S ± ∇ S ′ L(S ′ , D (

3.2. DATA-REWEIGHTING GENERATION

Recall that the dataset-weights P (A) are generated with learnable parameters A. The proposed framework allows us to provide a generator in various forms. Within the conventional sigmoid activation function σ, we can use different dataset-weights generators to deal with different types of data noise, especially for long sequence time-series forecasting. Here, we proposed three options: (a) Identity function, Figure 2 presents a sample of ETTh1 dataset reweighted by the proposed three dataset-weight generators. The line with lighter color indicates that the training data sampled from this area acquire lower weights (from one to zero). We can whiteness two noise areas for long sequence forecasting, around 700-800 and 1400-1500. Among the three methods, the Fourier generator has captured both data outliers with its decomposition in the frequency domain. The normal generator cannot filter outliers in a fine-grained manner, while the identity generator cannot generate highly discriminating weights due to training sparsity sparse training problem, where each parameter in A is trained only once every epoch. For most of the time-series Transformer converging within several epochs, A is unable to be fully trained. After the empirical evaluation in experiments, the last dataset-weights generator becomes the default option. p i = σ(a i ) (b) Normal distribution, p i = σ   N/2 j=1 1 √ 2πa j exp - i -a j+N/2 2 2a 2 j   (c) Fourier series, p i = σ   N/2b j=1 a j sin( 2πji N ) + N/b k=N/(2b+1) a k cos( 2πki N )   .

4. CASCADED TEACHING

In this section, we propose a generalized cascaded teaching approach for time-series prediction. There is a sequence of models: 1, . . . , K. The first model is a teacher. The K-th model is a student. For any other model, it is both a teacher and a student. Model i teaches model i + 1. The teaching mechanism is the same as described in Section 3. Given model i, we use it to generate a pseudo time-series dataset, then use this pseudo dataset to train model i + 1. In our framework, there are K + 1 learning stages. For 1 ≤ k ≤ K, the k-th learning stage corresponds to train the k-th model. The (K + 1)-th learning stage corresponds to train A via model validation. In the first stage, we train the network T 1 (including encoder and decoder) of the first model by solving the following problem: T * 1 (A) = min T1 L T 1 , P (A), D (trn) , where A contains parameters used to generate dataset-weights in a training set D (trn) . In the second stage, model 1 teaches model 2. Given the optimal parameters T * 1 (A) of model 1, we apply it to generate a pseudo-labeled dataset D (pse) (T * 1 ) in the same way as described in Section 3. Then we use D (pse) (T * 1 ) to train the network parameters T 2 of model 2 by solving the following optimization problem: T * 2 (T * 1 (A)) = min T2 L T 2 , D (pse) (T * 1 ) . ( ) In the k-th stage, model k - 1 teaches k. Let T * k-1 (• • • T * 1 (A) ) denote the optimal parameters of model k -1 trained at stage k -1. We use it to generate a pseudo-labeled dataset D (pse) (T * k-1 (• • • T * 1 (A)) ) and use the pseudo-labeled dataset to train the parameters T k of the model k. This amounts to solving the following optimization problem: T * k = min T k L T k , D (pse) (T * k-1 (• • • T * 1 (A))) . ( ) The process continues until all K models are trained. At the (K +1)-th stage, we perform validation of these trained models on the validation dataset and learn the architecture A of the first model by minimizing the validation loss. The corresponding optimization problem is: min A L T * 1 (A), P (A), D (val) + λ K i=2 L T * i , D (val) . ( ) Where λ stands for the cascaded teaching rate. Putting these pieces together, we get the following multi-level optimization problem: min A L T * 1 (A), P (A), D (val) + λ K i=2 L T * i , D (val) s.t. T * K = min T K L T K , D (pse) (T * K-1 ) • • • T * 2 = min T2 L T 2 , D (pse) (T * 1 (A)) T * 1 = min T1 L T 1 , P (A), D . From bottom to top, the K + 1 optimization problems correspond to the learning stage 1, . . . , K respectively. The first K optimization problems are on the constraints of the (K +1)-th optimization problem. The K stages are performed end-to-end in a joint manner where different stages mutually influence each other. The detailed optimization algorithm can be found in Appendix.

5. EXPERIMENT

We performe experiments on 5 datasets: ETT (ETTh1, ETTh2, ETTm1), WTH and Traffic. Since the framework we proposed only modified the weights on dataset, it can be applied to most time series prediction models under the end-to-end deep learning paradigm. (2021) . Hyperparameter By default, the input length of the encoder is set to 96, and the input length of the decoder is set to 48 for Informer. The other hyperparameters are consistent with the two original papers. Due to limited computing resources, the learner number K is set to 2, which follows the canonical teacher-student learning paradigm. The self-study rate γ is set to 0.2 so that the student model relies mainly on the quality of the pseudo-labeled dataset. Platform: All the models were trained/tested on 2 Nvidia V100 32GB GPU. More information about network setup can be found in the Appendix C.3.

5.3. RESULTS AND ANALYSES

The main results are summarized in table 1. There are two comparisons of cascaded teaching's performance on Informer and Query-Selector separately. The winners are in boldface. As we can see in the last row of Table 1 , the usage of the cascading framework outperforms the baselines in all datasets, and the improvement is more significant on ETT datasets (improved 66% in ETTh 2 at most). For the Informer, the student model performs slightly better than the teacher (Count 26 > 22) The goal of training A helps to reduce the student's validation loss. For the Query-Selector, the student model performs worse than the teacher (Count 10 < 38), but their metrics are very close. The different gap derives from the strong inputs' sparsity assumption in the Query-Selector model because the selected queries have weakened the effect of reweighting. For the fine-grained case (ETT m1 ), the improvement is less obvious than hourly dataset, which indicates that reweighting is more suitable for longer inputs. It is also worth noting that the performance improvement brought by the cascading framework does not decrease significantly with the increase in the prediction length, which reveals its potential for the prediction of longer sequences.

5.4. ABLATION STUDY

We design two individual ablation studies to further validate the effectiveness of the different components of the framework: dataset-weight, teacher model and student model.

5.4.1. ABLATION STUDY FOR DATASET-WEIGHTS

Unlike conventional teacher-student learning, our framework introduces the data-reweighting mechanism to find valuable time-series samples. Without data-reweighting, the cascaded framework will degenerate into knowledge distillation and perform similarly to the baseline model. We perform experiments on Cas-Informer and KD-Informer (Informer with Knowledge Distillation) Hinton et al. (2015) . The KD-Informer simply sets p i (A) = 1 and removes the part of updating A in the optimization algorithm. The experimental results are summarized in Table 2 . As we can see, the performance of the model trained by the knowledge distillation framework is consistent with the baseline. We believe the main reason is that the performance of the student model mainly depends on the quality of the pseudo label generated by the teacher model, so its MSE score is close to the teacher model. The knowledge distillation framework does not improve the performance of the teacher model. Therefore it has no significant improvement compared to the baseline.

5.4.2. ABLATION STUDY FOR TEACHER-STUDENT MODEL

The teacher model generates pseudo labels to help the student learn through the dataset D (pse) , whereas the student model updates A by passing the partial gradient to the teacher model. Therefore, the teacher model and student model could have different model sizes. We evaluate the changing of model size by reducing the number of heads and hidden dimensions. Table 3 ,4 show that the teacher model can still maintain good performance when the size of the student model decreases significantly, while the student model does not exhibit the same performance robustness to the teacher model size. This is because the teacher model can always improve its prediction ability by updating A and T 1 . Although the gradient quality of A will decrease to a certain extent, there is still a significant improvement compared to the baseline. Conversely, the update of the student model is mainly affected by the quality of pseudo-labeling provided by the teacher model, and thus will have a similar rate of performance degradation as the teacher model.

5.5. PARAMETER SENSITIVITY

We perform analysis of parameter sensitivity on 3 hyperparameters of the framework: the weight decay d W , the Fourier divider b, and the temperature of the sigmoid function T . Weight Decay In Figure 4 (a), we can see that tasks with longer prediction length are more sensitive to the weight decay rate. Therefore a smaller d W is recommended to achieve stable performance on tasks with different prediction length. (b) Fourier Divider. Fourier Divider This hyperparameter controls the amount of parameters in A. Figure 4 (b) shows that as the prediction task becomes more difficult, the framework requires more parameters of A to control the dataset-weights of each sample at a finer-grained level. Sigmoid Temperature In our paper, we use sigmoid as activation function σ(x) = (1 + e -T x ) -1 . A larger sigmoid temperature T leads to a smoother derivative of the activation function σ, which allows training of a more selective A for long-sequence predictions.

5.6. DISCUSSION

A detailed Analysis is given in Appendix B, which theoretically shows that the dataset weights directly control the magnitude of the student model's weights updating. Therefore, we have the approximated gradient step for the student model S ′ = S +g 0 + B i=1 p i g i , where B denotes the batch size, and g i are gradient terms independent of dataset-weights. So the optimization goal becomes: make the student model perform better on the validation dataset by adjusting the update magnitude p i (i.e. dataset-weights). The experimental results show that the mixture of the reweighted dataset and real-world dataset can even perform better than a single real-world dataset. This is similar to the results in Hwang et al. (2022) . The reason for adopting the teacher-student method is that we want to force the teacher to generate more valuable teaching sequences rather than simply finding which sample can make the teacher model have a lower validation loss. The latter may allow the model to perform better on a specific validation set, while the former can aim to improve the model's prediction quality on a different dataset and find samples that better reflect the underlying features of the dataset. The teacher model trained in this way can generalize better on other datasets, where it is the unlabeled dataset E in our case. In addition, this also prevents the model from taking some tricky ways to reduce the loss, such as directly predicting zero for a small absolute value.

6. CONCLUSIONS

In this paper, we proposed a cascaded learning framework to reweight the training samples of the teacher model. By using one-step approximation and pseudo dataset, we successfully established a gradient flow between dataset-weights and the test loss of the last model in the framework.

7. ETHICS STATEMENT

The proposed cascaded framework can help most time-series forecasting models to have better performance by reweighting the training dataset. This framework can be applied to many significant real-world cases, such as economics and finance forecasting, climate forecasting, disease propagation analysis, traffic management, etc. It also reveals the latent patterns within the dataset, which is inspirational for feature engineering. Our contributions are not limited to LSTF problem. It can also be applied to other tasks like long text generation, anomaly detection, and neural architecture search. In addition, the introduction of teacher-student learning results in the need for more computing resources. In our case, we use two V100 32G GPU in our experiments to achieve a training speed comparable to the baseline model. In addition, the gradient calculation between the two models necessitates distributed training, which increases the difficulty of coding. Therefore, we provide a decoupled distributed code for cascaded learning to facilitate researchers to fill in their own models in our framework. The above derivation makes it possible to generalize our approach in Section 3 to the K-learner case.

B DETAILS ANALYSIS

To better understand why our cascaded framework works, we have to use the Lipschitz condition of the transformer-based model. As discussed in Kim et al. (2021) , the scaled dot-product selfattention is Lipschitz if the input space is compact. In our case, the compact input space of the time-series dataset can be written as : [-M, M ] B×L×D where M is the upper bound of the whole dataset and B, L, D represent respectively batch size, input length and hidden dimension. Most of the variants of transformer including Informer and QS-Selector are still continuously differentiable, so the Lipschitz condition applies. Assume that the Lipschitz constant of the cascaded models is K. In the case of two learners, the teacher model and the student model can be simplified as follows: F T (T, X i ) = Y T i , F S (S, X i ) = Y S i . ( ) where F T denotes the teacher model and F S denotes the student model. For simplicity, we assume that the batch size is 3, γ=0 and after several iterations the models are close to convergence so that their gradients are a lot smaller than their parameters. Now we do one-step gradient descent on the teacher model in Eq.( 4) and use the new parameters to generate pseudo label for a specific sample e j in the unlabeled dataset E. This process can be written as : F T (T ′ , e j ) = F T (T + p 1 t 1 + p 2 t 2 + p 3 t 3 , e j ) , where t i = -η T ∂ ∂T l(T, X i ) denotes the parameters update brought by sample X i . Since t i ≪ 1 we can have the following approximation: F T (T ′ , e j ) = F T (T, e j ) + (p 1 t 1 + p 2 t 2 + p 3 t 3 ) ∂ ∂T F T (T, e j ) . Therefore, when we train student model on the generated pseudo-labeled dataset D (pse) , the MSE loss function can be written as : l j (S, e j ) = (F S (S, e j ) -F T (T ′ , e j )) 2 =(F S (S, e j ) -F T (T, e j )- (p 1 t 1 + p 2 t 2 + p 3 t 3 ) ∂ ∂T F T (T, e j )) 2 , After omitting the second-order terms, we have: l j (S, e j ) =l j0 + p 1 l j1 + p 2 l j2 + p 3 l j3 , s.t. l j0 = (F S (S, e j ) -F T (T, e j )) 2 l ji = -2t i ∂ ∂T F T (T, e j ) (F S (S, e j ) -F T (T, e j )) So the one-step gradient descend for student model in Eq.( 5) can be rewritten as : S ′ =S -η S ∂ ∂S l j (S, e j ) =S + g j0 + p 1 g j1 + p 2 g j2 + p 3 g j3 , s.t. g j0 = -η S ∂ ∂S l j0 g ji = -η S ∂ ∂S l ji Now we have proven that the dataset-weights directly control the size of the corresponding update of student model parameters.

C.1 SCINET AND REFORMER

In this section, we show the performance of the cascaded framework applied to Reformer Kitaev et al. (2020) and SCINet Liu et al. (2021) , a CNN-based neural network for time-series forecasting. The results are summarized in table 5. As we can see, the usage of the cascaded framework helps the two models to have 13% of improvement. This proves the scalability of the cascaded framework on convolutional neural networks and that data-reweighting does not affect the sparse attention mechanism within the Transformer-based models. We also notice that the student SCINet model performs slightly better than the teacher model. Nevertheless, since the self-study rate is set to 0.2, the two models have a very similar improvement compared to the baseline model.

C.2 DATASET-WEIGHT GENERATOR

We have evaluated the Cas-Informer with different dataset-weight generator. Due to limited time, we could only test normal-distribution-based dataset-weight generator on three datasets, among which the Exchange dataset (financial) records the daily exchange rates of eight different countries ranging from 1990 to 2016. As we can see in table6, the Cas-Informer with Fourier dataset-weight generator perform better on almost all settings except the Exchange dataset. The results confirm the statement about financial datasets in section 3.2 of the paper.

C.3 HYPERPARAMETERS

The structure of the time-series forecasting model within our cascaded framework is the same as the original setting in Informer Zhou et al. (2021) , Query Selector Klimek et al. (2021) , Reformer Kitaev et al. (2020) and SCINet Liu et al. (2021) . By default, the input length of the encoder is set to 96 and the input length of the decoder is set to 48. We use the Adam optimizer to train for 10 epochs with an initial learning rate of 0.0001 which is halved every 4 epochs. The dataset-weights are also trained by the Adam optimizer with an initial learning rate of 0.0002. Due to GPU memory limitations, we adopt a batch size of 32 and a hidden dimension of 512. Unless otherwise specified, the self-study rate γ = 0.2 and the sigmoid temperature T = 5. The early stop algorithm is also adopted to prevent the model from overfitting. We have implemented the cascaded framework in Python 3.8 with Pytorch 1.10 so that the recently released distributed data parallel package can be used.



REPRODUCIBILITY STATEMENTWe provide an open-source implementation of our cascaded framework at https:// anonymous.4open.science/r/cascaded-framework-6830/. The detailed hyperparameter setting and analysis can be found in the Appendix.



Figure 2: ETTh1 dataset reweighted by the proposed three dataset-weight generators. The lighter colors indicates lower dataset-weights, that is, the targeted noise.

Figure 4: The parameter sensitivity of three hyperparameters in Cas-informer.

Multivariate long sequence time-series forecasting results on two datasets (four cases).

The last two models are presented in the Appendix C.1 to show the generalizability of the proposed framework. We also present a comprehensive ablation study to test the effectiveness of the different parts of our framework. More details about parameter sensitivity will also be included in Appendix C.3. Each data point consists of the target value "wet bulb" and 11 other climate features.Traffic is a collection of hourly data from the California Department of Transportation that describes road occupancy measured by different sensors on highways in the San Francisco Bay Area.

Experimental results on ETT dataset with Cas-Informer, KD-Informer and Informer.

Ablation study of student model's size on ETTh1.

Ablation study of teacher model's size on ETTh1.

Multivariate time-series forecasting results on ETT dataset with SCINet and Reformer.

Experimental results on dataset-weight generator.

A OPTIMIZATION ALGORITHM OF CASCADED TEACHING

In this section, we develop an optimization algorithm to solve the multi-level optimization problem. We first approximate T * 1 (A) using one-step gradient descent w.r.t L A, T 1 , D (trn) :We substituteand get an approximated objective. Then we approximate T ′ 2 using one-step gradient descent w.r.t the approximated loss as), we approximate it using one-step gradient descent) . In this way, we can plugi=1 into the validation loss: val) .We update A by gradient descend w.r.t the approximated validation loss: val) ) .(14)

