

Abstract

We introduce Active Tuning, a novel paradigm for optimizing the internal dynamics of recurrent neural networks (RNNs) on the fly. In contrast to the conventional sequence-to-sequence mapping scheme, Active Tuning decouples the RNN's recurrent neural activities from the input stream, using the unfolding temporal gradient signal to tune the internal dynamics into the data stream. As a consequence, the model output depends only on its internal hidden dynamics and the closedloop feedback of its own predictions; its hidden state is continuously adapted by means of the temporal gradient resulting from backpropagating the discrepancy between the signal observations and the model outputs through time. In this way, Active Tuning infers the signal actively but indirectly based on the originally learned temporal patterns, fitting the most plausible hidden state sequence into the observations. We demonstrate the effectiveness of Active Tuning on several time series prediction benchmarks, including multiple super-imposed sine waves, a chaotic double pendulum, and spatiotemporal wave dynamics. Active Tuning consistently improves the robustness, accuracy, and generalization abilities of all evaluated models. Moreover, networks trained for signal prediction and denoising can be successfully applied to a much larger range of noise conditions with the help of Active Tuning. Thus, given a capable time series predictor, Active Tuning enhances its online signal filtering, denoising, and reconstruction abilities without the need for additional training.

1. INTRODUCTION

Recurrent neural networks (RNNs) are inherently only robust against noise to a limited extent and they often generate unsuitable predictions when confronted with corrupted or missing data (cf., e.g., Otte et al., 2015) . To tackle noise, an explicit noise-aware training procedure can be employed, yielding denoising networks, which are targeted to handle particular noise types and levels. Recurrent oscillators, such as echo state networks (ESNs) (Jaeger, 2001; Koryakin et al., 2012; Otte et al., 2016) , when initialized with teacher forcing, however, are highly dependent on a clean and accurate target signal. Given an overly noisy signal, the system is often not able to tune its neural activities into the desired target dynamics at all. Here, we present a method that can be seen as an alternative to regular teacher forcing and, moreover, as a general tool for more robustly tuning and thus synchronizing the dynamics of a generative differentiable temporal forward model-such as a standard RNN, ESN, or LSTM-like RNN (Hochreiter & Schmidhuber, 1997; Otte et al., 2014; Chung et al., 2014; Otte et al., 2016 )-into the observed data stream. The proposed method, which we call Active Tuning, uses gradient back-propagation through time (BPTT) (Werbos, 1990) , where the back-propagated gradient signal is used to tune the hidden activities of a neural network instead of adapting its weights. The way we utilize the temporal gradient signal is related to learning parametric biases (Sugita et al., 2011) and applying dynamic context inference (Butz et al., 2019) . With Active Tuning, two essential aspects apply: First, during signal inference, the model is not driven by the observations directly, but indirectly via prediction errorinducted temporal gradient information, which is used to infer the hidden state activation sequence that best explains the observed signal. Second, the general stabilization ability of propagating signal hypotheses through the network is exploited, effectively washing out activity components (such as noise) that cannot be modeled with the learned temporal structures within the network. As a result, the vulnerable internal dynamics are kept within a system-consistent activity milieu, effectively decoupling it from noise or other unknown distortions that are present in the defective actual signal. In this work we show that Active Tuning elicits enhanced signal filtering abilities, without the need for explicitly training distinct models for exactly such purposes. Excitingly, this method allows for instance the successful application of an entirely noise-unaware RNN (trained on clean, ideal data) under highly noisy and unknown conditions. In the following, we first detail the Active Tuning algorithm. We then evaluate the RNN on three time series benchmarks-multiple superimposed sine waves, a chaotic pendulum, and spatiotemporal wave dynamics. The results confirm that Active Tuning enhances noise robustness in all cases. The mechanism mostly even beats the performance of networks that were explicitly trained to handle a particular noise level. It can also cope with missing data when tuning the predictor's state into the observations. In conclusion, we recommend to employ Active Tuning in all time series prediction cases, when the data is known to be noisy, corrupted, or to contain missing values and the generative differentiable temporal forward model-typically a particular RNN architecture-knows about the potential underlying system dynamics. The resulting data processing system will be able to handle a larger range of noise and corrupted data, filtering the signal, generating more accurate predictions, and thus identifying the underlying data patterns more accurately and reliably.

2. ACTIVE TUNING

Starting point for the application of Active Tuning is a trained temporal forward model. This may be, as mentioned earlier, an RNN, but could also be another type of temporal model. The prerequisite is, however, a differentiable model that implements dependencies over time, such that BPTT can be used to reversely route gradient information through the computational forward chain. Without loss of generality, we assume that the model of interest, whose forward function may be referred to as f M , fulfills the following structure: (σ t , x t ) f M ---→ (σ t+1 , xt+1 ), where σ t is the current latent hidden state of the model (e.g. the hidden outputs of LSTM units, their cell states, or any other latent variable of interest) and x t is the current signal observation. Based on this information f M generates a prediction for the next input xt+1 and updates its latent state σ t+1 accordingly. Following the conventional inference scheme, we feed a given sequence time step by time step into the network and receive a one-time step ahead prediction after each particular step. Over time, this effectively synchronizes the network with the observed signal. Once the network dynamics are initialized, which is typically realized by teacher forcing, the network can generate prediction and its dynamics can be driven further into the future in a closed-loop manner, whereby the network feeds itself with its own predictions. To realize next time step-and closed-loop predictions, direct contact with the signal is inevitable to drive the teacher forcing process. In contrast, Active Tuning decouples the network from the direct influence of the signal. Instead, the model is permanently kept in closed-loop mode, which initially prevents the network from generating meaningful predictions. Over a certain time frame, Active Tuning keeps track of the recent signal history, the recent hidden states of the model, as well as its recent predictions. We call this time frame (retrospective) tuning horizon or tuning length (denoted with R). The principle of Active Tuning can best be explained with the help of Figure 1 and Algorithm 1. The latter gives a more formal perspective onto the principle. Note that for every invocation of the procedure a previously unrolled forward chain (from the previous invocation or an initial unrolling) is assumed. L refers to the prediction error of the entire unrolled prediction sequence and the respective observations, whereas L t refers to the local prediction error just for a time step t . With every new perceived and potentially noise-affected signal observation x t , one or multiple tuning cycles are performed. Every tuning cycle hereby consists of the following stages: First, from the currently believed sequence of signal predictions (which is in turn based on a sequence of hidden states) and the actual observed recent inputs, a prediction error is calculated and propagated back into the past reversely along the unfolded forward computation sequence. The temporal gradient travels to the very left of the tuning horizon and is finally projected onto the seed hidden state σ t-R , which is then adapted by applying the gradient signal in order to minimize the encountered prediction error. This adaption can be done using any gradient-based optimizer. Note that in this paper, we exclusively use Adam (Kingma & Ba, 2015) , but other optimizers are possible as well. Second, after the adap-

