CASCADED TEACHING TRANSFORMERS WITH DATA REWEIGHTING FOR LONG SEQUENCE TIME-SERIES FORECASTING

Abstract

The Transformer-based models have shown superior performance in the long sequence time-series forecasting problem. The sparsity assumption on self-attention dot-product reveals that not all inputs are equally significant for Transformers. Instead of implicitly utilizing weighted time-series, we build a new learning framework by cascaded teaching Transformers to reweight samples. We formulate the framework as a multi-level optimization and design three different dataset-weight generators. We perform extensive experiments on five datasets, which shows that our proposed method could significantly outperform the SOTA Transformers.

1. INTRODUCTION

The long sequence time-series forecasting (LSTF) has drawn particular attention with Transformerbased models, like electricity prediction Li et al. (2019) , financial predictions Zhang et al. (2022) , and weather predictions Pan et al. (2022) . The pairwise self-attention allows time-series points to directly attend to each other, which benefits the future forecasting from the historical observations. The pairwise computation's sparsity assumption Child et al. (2019) reveals that not every sample is equally prior. If we drop unnecessary pairwise connections without violating problem settings, we can acquire a stronger model with better generalization. For example, (i) Reformer Kitaev et al. (2020) uses the hashing bucket to select important query-key pairs (ii) LogSparse Transformer Li et al. ( 2019) only calculates attention pairs lying on the log-size step away from the diagonal. The sparsity assumption could be roughly considered as manipulating the weights within time-series, or more specifically, internal reweighting. A key technical challenge preventing us from further performance improvement that solely relies on internal reweighting is the unavailability of a specific design of time-series weights under the sparsity-oriented framework. Instead, we can perform the reweighting explicitly, that is, reweight the whole train dataset so that the outlier samples can be excluded from the training. Meanwhile, we assign larger dataset-weights to those samples belonging to the main patterns of the dataset. Inspired by the knowledge distillation Hinton et al. ( 2015) and teacher-student learning Li et al. ( 2014), we can teach the Transformer to learn to weighted time-series. One common method is using pseudo labels Pham et al. (2021) to allow student learning from teacher outputs. If we cannot assign accurate labels, the student hardly learns as well as the teacher does, a phenomenon known as confirmation bias. To alleviate this drawback, we propose to reweight the inputs in a soft manner. In this paper, we design a cascaded teaching framework. There is a sequence of Transformer models, where model i teaches model i + 1. Specifically, model i generates a pseudo time-series dataset, which is used to train model i + 1. Only the first teacher model uses the reweighted dataset, whereas other models use the pseudo time-series dataset generated by the previous model. Finally, the teacher model will update its dataset-weights based on the performance of student models. The experimental results show that we can significantly improve the performance of a time-series prediction model by simply reweighting the dataset while maintaining the sparsity assumption at the same time. More importantly, this cascaded learning framework is easily generalized to other seq-to-seq models. The contributions are summarized as follows: 1

