CASCADED TEACHING TRANSFORMERS WITH DATA REWEIGHTING FOR LONG SEQUENCE TIME-SERIES FORECASTING

Abstract

The Transformer-based models have shown superior performance in the long sequence time-series forecasting problem. The sparsity assumption on self-attention dot-product reveals that not all inputs are equally significant for Transformers. Instead of implicitly utilizing weighted time-series, we build a new learning framework by cascaded teaching Transformers to reweight samples. We formulate the framework as a multi-level optimization and design three different dataset-weight generators. We perform extensive experiments on five datasets, which shows that our proposed method could significantly outperform the SOTA Transformers.

1. INTRODUCTION

The long sequence time-series forecasting (LSTF) has drawn particular attention with Transformerbased models, like electricity prediction Li et al. (2019) , financial predictions Zhang et al. (2022) , and weather predictions Pan et al. (2022) . The pairwise self-attention allows time-series points to directly attend to each other, which benefits the future forecasting from the historical observations. The pairwise computation's sparsity assumption Child et al. (2019) reveals that not every sample is equally prior. If we drop unnecessary pairwise connections without violating problem settings, we can acquire a stronger model with better generalization. For example, (i) Reformer Kitaev et al. (2020) uses the hashing bucket to select important query-key pairs (ii) LogSparse Transformer Li et al. (2019) only calculates attention pairs lying on the log-size step away from the diagonal. The sparsity assumption could be roughly considered as manipulating the weights within time-series, or more specifically, internal reweighting. A key technical challenge preventing us from further performance improvement that solely relies on internal reweighting is the unavailability of a specific design of time-series weights under the sparsity-oriented framework. Instead, we can perform the reweighting explicitly, that is, reweight the whole train dataset so that the outlier samples can be excluded from the training. Meanwhile, we assign larger dataset-weights to those samples belonging to the main patterns of the dataset. Inspired by the knowledge distillation Hinton et al. (2015) and teacher-student learning Li et al. (2014) , we can teach the Transformer to learn to weighted time-series. One common method is using pseudo labels Pham et al. (2021) to allow student learning from teacher outputs. If we cannot assign accurate labels, the student hardly learns as well as the teacher does, a phenomenon known as confirmation bias. To alleviate this drawback, we propose to reweight the inputs in a soft manner. In this paper, we design a cascaded teaching framework. There is a sequence of Transformer models, where model i teaches model i + 1. Specifically, model i generates a pseudo time-series dataset, which is used to train model i + 1. Only the first teacher model uses the reweighted dataset, whereas other models use the pseudo time-series dataset generated by the previous model. Finally, the teacher model will update its dataset-weights based on the performance of student models. The experimental results show that we can significantly improve the performance of a time-series prediction model by simply reweighting the dataset while maintaining the sparsity assumption at the same time. More importantly, this cascaded learning framework is easily generalized to other seq-to-seq models. The contributions are summarized as follows: • We propose a cascaded teaching framework to reweight the teacher model's dataset based on the evaluation of student models, which forces the teacher model to get rid of noise data by generating proper dataset-weights. • We design three dataset-weight generators to compress the trainable dataset-weights parameters. • Extensive experiments on three datasets (five cases) have demonstrated its improvement in timeseries learning. 2 RELATED WORK 

3. TEACHER-STUDENT LEARNING FRAMEWORK

Notions We start with a teacher model with parameters T and a student model with parameters S. We feed the teacher model with training dataset D (trn) t = {(s i , t i )} N i=1 , where s i denotes the i-th input serie and t i is the corresponding outputs. Following the multi-task learning paradigm Caruana (1997), we assign different dataset-weights p i ∈ [0, 1] for each input samples (s i , t i ), which forms the dataset-weights P = {p i } N i=1 . In order to avoid introducing prior knowledge, we initialize



2.1 TEACHER-STUDENT LEARNINGTeacher-student learning has been investigated in knowledge distillationHinton et al. (2015), adversarial robustnessCarlini & Wagner (2017), self-supervised learningXie et al.(2020), etc. Most of these methods are based on pseudo-labeling. In these existing methods, the focus is to learn a student model with the help of a trained and fixed teacher model. In these works, the teacher model is not updated. In contrast, our method focuses on learning a teacher model by letting it teach a student model. The teacher model constantly updates itself based on the teaching outcome. Teacher-student learning has been investigated in several neural architecture search works as wellLi et al. (2020);  Trofimov et al. (2021); Gu & Tresp (2020). In these works, when searching the architecture of a student model, pseudo-labels generated by a trained teacher model whose architecture is fixed are leveraged. Our work differs from these works in that we focus on searching the dataset-architecture of a teacher model by letting it teach a student model where the student's dataset-architecture is fixed, whereas the existing works focus on searching the architecture of a student model by letting it be taught by a teacher where the teacher's architecture is fixed. In a recent work Pham et al. (2021) which was conducted independently of and in parallel to our work, the teacher model is updated based on student's performance. Our work differs from this one in that our work is based on a three-level optimization framework which searches teacher's architecture by minimizing student's validation loss and trains teacher's network parameters before using teacher to generate pseudolabels, whereas in Pham et al. (2021) framework is based on two-level optimization which has no architecture search and does not train the teacher before using it to perform pseudo-labeling. In Such et al. (2019), a meta-learning method is developed to learn a deep generative model which generates synthetic labeled data. A student model leverages the synthesized data to search its architecture. Our work differs from this method in that we focus on searching the teacher's architecture via three-level optimization, whileSuch et al. (2019)  focuses on searching the student's architecture via meta-learning.2.2 REWEIGHTING TIME-SERIESReweighting is a technique widely used in the field of time series forecasting, but the objects of weighting are diverse: features Zhao et al. (2018), fuzzy relationship Yu (2005), gradient Zhang et al. (2019) etc. In addition to time series forecasting, reweighting is also used to periodic pattern mining Chanda et al. (2017), fault diagnosis Lv et al. (2017) and time-series classification Sellami &Hwang (2019). In these works, only one model is trained to generate weights on the corresponding components of the network. The difference between our method and the above works is that the dataset-weights trained by our framework are not coupled with the model, so they can be applied to other models trained on the same dataset. The weights for time-series can also be extracted with a Bayesian non-parametric waySaad & Mansinghka (2018)  or even from the dataset dynamicsZhang  et al. (2021). Our framework differs from these works in that our model reweights the entire training dataset by introducing teacher-student learning.

