LEARNING IN TEMPORALLY STRUCTURED ENVIRONMENTS

Abstract

Natural environments have temporal structure at multiple timescales. This property is reflected in biological learning and memory but typically not in machine learning systems. We advance a multiscale learning method in which each weight in a neural network is decomposed as a sum of subweights with different learning and decay rates. Thus knowledge becomes distributed across different timescales, enabling rapid adaptation to task changes while avoiding catastrophic interference. First, we prove previous models that learn at multiple timescales, but with complex coupling between timescales, are equivalent to multiscale learning via a reparameterization that eliminates this coupling. The same analysis yields a new characterization of momentum learning, as a fast weight with a negative learning rate. Second, we derive a model of Bayesian inference over 1/f noise, a common temporal pattern in many online learning domains that involves long-range (power law) autocorrelations. The generative side of the model expresses 1/f noise as a sum of diffusion processes at different timescales, and the inferential side tracks these latent processes using a Kalman filter. We then derive a variational approximation to the Bayesian model and show how it is an extension of the multiscale learner. The result is an optimizer that can be used as a drop-in replacement in an arbitrary neural network architecture. Third, we evaluate the ability of these methods to handle nonstationarity by testing them in online prediction tasks characterized by 1/f noise in the latent parameters. We find that the Bayesian model significantly outperforms online stochastic gradient descent and two batch heuristics that rely preferentially or exclusively on more recent data. Moreover, the variational approximation performs nearly as well as the full Bayesian model, and with memory requirements that are linear in the size of the network.

1. INTRODUCTION

Many online tasks facing both biological and artificial intelligence systems involve changes in data distribution over time. Natural environments exhibit correlations at a wide range of timescales, a pattern variously referred to as self-similarity, power-law correlations, and 1/f noise (Keshner, 1982) . This is in stark contrast with the iid environments assumed by many machine learning (ML) methods, and with diffusion or random-walk environments that exhibit only short-range correlations. Moreover, biological learning systems are well-tuned to the temporal statistics of natural environments, as seen in phenomena of human cognition including power laws in learning (Anderson, 1982) , power-law forgetting (Wixted & Ebbesen, 1997) , long-range sequential effects (Wilder et al., 2013) , and spacing effects (Anderson & Schooler, 1991; Cepeda et al., 2008) . An important goal is to incorporate similar inductive biases into ML systems for online or continual learning. This paper analyzes a framework for learning in temporally structured environments, multiscale learning, which for neural networks (NNs) can be implemented as a new kind of optimizer. A common explanation for self-similar temporal structure in nature is that it arises from a mixture of events at various timescales. Indeed, many generative models of 1/f noise involve summing independent stochastic processes with varying time constants (Eliazar & Klafter, 2009) . Accordingly, the multiscale optimizer comprises multiple learning processes operating in parallel at different timescales. In a NN, every weight w j is replaced by a family of subweights ω ij , each with its own learning rate and decay rate, that sum to determine the weight as a whole. Learning at multiple timescales is a key idea in several theories in neuroscience, including conditioning (Staddon et al., 2002 ), learning (Benna & Fusi, 2016 ), memory (Howard & Kahana, 2002; Mozer et al., 2009) , and motor control (Kording et al., 2007) , and has also been exploited in ML (Hinton & Plaut, 1987; Rusch et al., 2022) . The multiscale learner isolates and simplifies this idea, by assuming knowledge at different timescales evolves independently and that credit assignment follows gradient descent. The first part of this paper (Sections 2 and 3) proves three other models are formally equivalent to instances of the multiscale optimizer: a new variant of fast weights (cf. Ba et al., 2016; Hinton & Plaut, 1987) , the model synapse of Benna & Fusi (2016) , and momentum learning (Rumelhart et al., 1986; Qian, 1999) . The insight behind these proofs is that each of these models can be written in terms of a linear update rule with diagonalizable transition matrix. Thus the eigenvectors of this matrix correspond to states that evolve independently. By writing the state of the model as a mixture of eigenvectors, we effect a coordinate transformation that exactly yields the multiscale optimizer. These results imply that the complicated coupling among timescales assumed by some models can be superfluous. They also provide a new perspective on momentum learning, with implications for how and when it is beneficial and how it interacts with nonstationarity in the task environment. In Section 4, we provide a normative grounding for multiscale learning in terms of Bayesian inference over 1/f noise. Our starting point is a generative model of 1/f noise as a sum of diffusion processes at different timescales. Exact Bayesian inference with respect to this generative process is possible using a Kalman filter (KF) that tracks the component processes jointly (Kording et al., 2007) . When learning a single environmental parameter θ, such as mean reward for some action in a bandit task, this amounts to modeling θ(t) = n i=1 z i (t), where each z i is a diffusion process with a different characteristic timescale τ i , and doing joint inference over Z = (z 1 , . . . , z n ). We then generalize this approach to an arbitrary statistical model, h (x, θ), where x is the input and θ ∈ R m is a parameter vector to be estimated. For instance, h might be a NN with parameters θ. Our Bayesian model places a 1/f prior on θ (as a stochastic process), by assuming θ(t) = n i=1 z i (t) for diffusion processes z i ∈ R m with characteristic timescales τ i . We then do approximate inference over the joint state Z = (z 1 , . . . , z n ), using an extended Kalman filter (EKF) that linearizes h by calculating its Jacobian at each step (Singhal & Wu, 1989; Puskorius & Feldkamp, 2003) . Next, we derive a variational approximation to the EKF that constrains the covariance matrix to be diagonal, and show how it extends the multiscale optimizer. Specifically, writing w j and ω ij as the current mean estimates of θ j and z ij (for weight j and time scale i), the variational update to each ω ij follows that of the multiscale optimizer, with additional machinery for determining decay rates based on τ i and adapting learning rates based on the current prior variance s 2 ij (t). In Section 5, we test our methods in online prediction and classification tasks with nonstationary distributions. In online learning, nonstationarity often manifests as poorer generalization performance on future data versus held-out data from within the training interval. Common solutions are to train on a window of fixed length (to exclude "stale" data) or to use stochastic gradient descent (SGD) with fixed learning rate and weight decay, which leads older observations to have less influence (Ditzler et al., 2015) . Here, we demonstrate that performance can be significantly improved by retaining all data and using a learning model that accounts for the temporal structure of the environment. We introduce nonstationarity in our simulations by varying the latent data-generating parameters according to 1/f noise. Thus an important caveat is the task domains are matched to the Bayesian model. Notwithstanding, we test robustness by using a different set of timescales for task generation versus learning (Section 5.1), a generative process that mismatches the NN architecture (Section 5.2), and a construction of 1/f noise that differs from the sum-of-diffusion process the model assumes (Section 5.3). Results show the Bayesian methods (KF and EKF) outperform windowing and online SGD, as well as a novel heuristic of training the network on all past data with gradients weighted by recency. We also find the variational approximation performs nearly as well as the full model (Section 5.1) and scales well to a multilayer NN trained on real data (Section 5.3).

2. MULTISCALE OPTIMIZER

Assume a statistical model ŷ(t) = h(x(t), w(t)) and loss function L(y, ŷ), where x(t) is the input on step t, w(t) is the parameter estimate, ŷ(t) is the model output, and y(t) is the target output. In a

