ONLINE CONTINUAL LEARNING WITH FEEDFORWARD ADAPTATION

Abstract

Recently deep learning has been widely used in time-series prediction tasks. Although a trained deep neural network model typically performs well on the training set, performance drop significantly in a test set under slight distribution shifts. This challenge motivates the adoption of online test-time adaptation algorithms to update the prediction models in real time to improve the prediction performance. Existing online adaptation methods optimize the prediction model by feeding back the latest prediction error computed with respect to the latest observation. However, the feedback based approach is prone to forgetting past information. In this work, we propose an online adaptation method with feedforward compensation, which uses critical data samples from a memory buffer, instead of the latest samples, to optimize the prediction model. We prove that the proposed approach has a smaller error bound than previously used approaches in slow time-varying systems. The experiments on several time-series prediction tasks show that the proposed feedforward adaptation outperforms previous adaptation methods by 12%. In addition, the proposed feedforward adaptation method is able to estimate an uncertainty bound of the prediction that is agnostic from specific optimizers, while existing feedback adaptation could not.

1. INTRODUCTION

Time-series prediction (or forecasting) has been widely studied in many fields, including control, energy management, and financial investment Box et al. (2015) ; Brockwell & Davis (2002) . Among the research applications, acquiring future trends and tendencies of the time-series data is one of the most important subjects. With the emergence of deep learning, many deep neural network models have been proposed to solve this problem Lim & Zohren (2021 ), e.g., Recurrent Neural Networks Lai et al. (2018 ) and Temporal Convolutional Networks Bai et al. (2018) In practical time-series prediction problems, there are often significant distributional discrepancies between the offline training set and the real-time testing set. These differences may be attributed to multiple factors. In some cases, it is too expensive to collect large unbiased training datasets, e.g., for weather prediction or medical time-series prediction. In other cases, it may be difficult to obtain the training instances from a specific domain. For example, in human-robot collaboration, it is hard to collect data from all potential future users. In these cases, adaptation techniques are applied to deal with the distribution mismatch between offline training and real-time testing Blum (1998). Besides, some tasks require the system to adapt itself after every observation. For example, in human-robot collaboration, the robot needs to continually adapt its behaviors to different users. In these scenarios, online adaptation techniques are often embraced Abuduweili et al. (2019) . . Online adaptation is a special case of online continual learning, which continually learns from real-time streaming data. In online adaptation, a prediction model receives sequential observations, and then an online optimization algorithm (e.g. SGD) updates the prediction model according to the prediction loss measured by the observed data. The goal of online adaptation is to improve prediction accuracy in subsequent rounds. Online adaptation is currently applied to many kinds of research like time-series prediction Pastor et al. ( 2011 In this paper, we mainly focus on time-series prediction tasks, but the proposed methods also can be used for other online adaptation (or online learning) tasks. Most existing online adaptation approaches are based on feedback compensation Tonioni et al. ( 2019), analogous to feedback control. In feedback adaptation, a prediction model only utilizes the latest received data. After observing a new sample, the online optimization algorithm updates the prediction model according to the prediction loss measured between the last prediction and the latest ground truth. However, this kind of passive feedback compensation is not efficient. In this work, we propose feedforward compensation in online adaptation to maximize information extraction from existing data, especially those that are more critical. A critical sample is more helpful to reduce the objective (loss) of the model when the sample is selected for training. In the proposed feedforward adaptation, we will not only have forgetting as is done in conventional online adaptation Paleologu et al. (2008) , but also enable recalling to compensate for potential shortsighted behaviors due to forgetting. There is a balance between forgetting and recalling. On the one hand, to rapidly learn the new function value in a time-varying system, we need to forget some of the old data. On the other hand, too much forgetting may cause unstable and incorrect predictions when we encounter a similar pattern with historical data. To achieve the balance between forgetting and recalling, we design a novel mechanism for feedforward compensation using a memory buffer similar to the functionality of the Hippocampus in the human brain Barron (2021). We will maintain the memory buffer by storing recent L-steps observations (or hidden features) of samples. When the prediction model experiences similar observations, it will pull the corresponding data (critical sample) from the memory buffer to enhance learning. For example, in human behavior prediction tasks, a human subject may exhibit similar behavior patterns on different days. These would be extremely difficult to discover if we only learn from the most recent data like conventional online adaptation but can be identified using the feedforward adaptation methods with memory buffer. We can also use the related information between the current sample and critical samples to measure the uncertainty bound to the current prediction. Our main contributions can be summarized in the following points. • By summarizing feedforward and feedback adaptation methods, we provide a general online test-time adaptation framework and prove its error bound. • We propose a feedforward compensation for online test-time adaptation problems. We prove that the proposed feedforward adaptation method has a smaller error bound than previously used feedback methods. • We propose an uncertainty-bound estimation related to the feedforward approach, which is agnostic from specific optimizers. • We conduct extensive experiments to show that the proposed feedforward adaptation is superior to conventional feedback adaptation.

2.1. TIME-SERIES PREDICTION

The time series prediction problem is to make inferences on future time-series data given the past and current observations. We consider a multi-step prediction problem: using recent I steps' observations to predict future O steps' data. Assume the transition model is composed of a feature extractor (or Encoder) E and a predictor (or decoder) f . At time step t, the input to the model is X t = [x t-I+1 , x t-I+2 , • • • , x t ], which denotes the stack of I-step recent observations. The output of the model is Y t+1 = [y t+1 , y t+2 , • • • , y t+O ], which denotes the stack of O-step future predictions. The observations x t , y t are vectors that may contain trajectory or feature, and x t = y t for some cases (e.g. univariate prediction). The transition model for time series prediction can be formulated as Z t = E(X t ), (1) Y t+1 = f t (Z t ), (2) where Z t is a hidden feature representation of input X t . Feature extractor E does not change over time, while predictor f t changes over time. Let f t denote the ground-truth predictor, that generates



Inspired by the great success of Transformer in the NLP and CV community Vaswani et al. (2017); Dosovitskiy et al. (2020), Transformer-style methods have been introduced to capture long-term dependencies in time series prediction tasks Zhou et al. (2021). Benefiting from the self-attention mechanism, Transformers obtain a great advantage in modeling long-term dependencies for sequential data Brown et al. (2020). Although a trained Transformer model (or other big deep neural network models) typically performs well on the training set, performance can significantly drop in a slightly different test domain or under a slightly different data distribution Popel & Bojar (2018); Si et al. (2019).

); Abuduweili & Liu (2020), image recognition Lee & Kriegman (2005); Chen et al. (2022), and machine translation Martínez-Gómez et al. (2012).

