LEARNING REPRESENTATIONS FROM TEMPORALLY SMOOTH DATA

Abstract

Events in the real world are correlated across nearby points in time, and we must learn from this temporally "smooth" data. However, when neural networks are trained to categorize or reconstruct single items, the common practice is to randomize the order of training items. What are the effects of temporally smooth training data on the efficiency of learning? We first tested the effects of smoothness in training data on incremental learning in feedforward nets and found that smoother data slowed learning. Moreover, sampling so as to minimize temporal smoothness produced more efficient learning than sampling randomly. If smoothness generally impairs incremental learning, then how can networks be modified to benefit from smoothness in the training data? We hypothesized that two simple brain-inspired mechanisms -leaky memory in activation units and memory-gating -could enable networks to rapidly extract useful representations from smooth data. Across all levels of data smoothness, these brain-inspired architectures achieved more efficient category learning than feedforward networks. This advantage persisted, even when leaky memory networks with gating were trained on smooth data and tested on randomly-ordered data. Finally, we investigated how these brain-inspired mechanisms altered the internal representations learned by the networks. We found that networks with multi-scale leaky memory and memory-gating could learn internal representations that "un-mixed" data sources which vary on fast and slow timescales across training samples. Altogether, we identified simple mechanisms enabling neural networks to learn more quickly from temporally smooth data, and to generate internal representations that separate timescales in the training signal.

1. INTRODUCTION

Events in the world are correlated in time: the information that we receive at one moment is usually similar to the information that we receive at the next. For example, when having a conversation with someone, we see multiple samples of the same face from different angles over the course of several seconds. However, when we train neural networks for categorization or reconstruction tasks, we commonly ignore temporal ordering of samples and use randomly ordered data. Given that humans can learn robustly and efficiently when learning incrementally from sequentially correlated, it is important to examine what kinds of architectures and inductive biases may support such learning (Hadsell et al., 2020) . Therefore, we asked how does the sequential correlation structure in the data affect learning in neural networks that are performing categorization or reconstruction of one input at a time? Moreover, we asked: which mechanisms can a network employ to exploit the temporal autocorrelation ("smoothness") of data, without needing to perform backpropagation through time (BPTT) (Sutskever, 2013)? We investigated this question in three stages. In the first stage, we examined the effects of temporally smooth training data on feedforward neural networks performing category learning. Here we confirmed that autocorrelation in training data slows learning in feeforward nets. In the second stage, we investigated conditions under which these classifier networks might take advantage of smooth data. We hypothesized that human brains may possess mechanisms (or inductive biases) that maximize the benefits of learning from temporally smooth data. We therefore tested two network mechanisms inspired by properties of cortical circuits: leaky memory (associated with autocorrelated brain dynamics), and memory gating (associated with rapid changes of brain states at event boundaries). We compared the performance of these mechanisms relative to memoryless networks and also against a long short-term memory (LSTM) architecture trained using BPTT. Finally, having demonstrated that leaky memory can speed learning from temporally smooth data, we studied the internal representations learned by these neural networks. In particular, we showed that networks with multi-scale leaky memory and resetting could learn internal representations that separate fast-changing and slow-changing data sources.

2. RELATED WORK

Effects of sampling strategies on incremental learning. The ordering of training examples affects the speed and quality of learning. For example, learning can be sped by presenting "easier" examples earlier, and then gradually increasing difficulty (Elman, 1993; Bengio et al., 2009; Kumar et al., 2010; Lee & Grauman, 2011) . Similarly, learning can be more efficient if training data is organized so that the magnitude of weight updates increases over training samples (Gao & Jojic, 2016) . Here, we do not manipulate the order based on item difficulty or proximity to category boundaries; we only explore the effects of ordering similar items nearby in time. We aim to identify mechanisms that can aid efficient learning across many levels of temporal autocorrelation, adapting to what is present in the data. This ability to adapt to the properties of the data is important in real-world settings, where a learner may lack control over the training order, or prior knowledge of item difficulty is unavailable. Potential costs and benefits of training with smooth data. In machine learning research, it is often assumed that the training samples are independent and identically distributed (iid) (Dundar et al., 2007) . When training with random sampling, one can approximately satisfy iid assumptions because shuffling samples eliminates any sequential correlations. However, in many real-world situations, the iid assumption is violated and consecutive training samples are strongly correlated. Temporally correlated data may slow learning in feedforward neural networks. If consecutive items are similar, then the gradients induced by them will be related, especially early in training. If we consider the average of the gradients induced by the whole training set as the "ideal" gradient, then subsets of similar samples provide a higher-variance (i.e. noisier) estimate of this ideal. Moreover, smoothness in data may slow learning due to catastrophic forgetting (French, 1999) . Suppose that, for smoother training, we sample multiple times from a category before moving to another category. This means that the next presentation of each category will be, on average, farther apart from its previous appearance. This increased distance could lead to greater forgetting for that category, thus slowing learning overall. On the other hand, smoother training data might also benefit learning. For example, there may be some category-diagnostic features that will not reliably be extracted by a learning algorithm unless multiple weight updates occur for that feature nearby in time; smoother training data would be more liable to present such features nearby in time.

3. RESEARCH QUESTIONS AND HYPOTHESES

1. How does training with temporally smooth data affect learning in feedforward networks? In light of the work reviewed above, we hypothesized that temporally smooth data would slow learning in feedforward nets. 2. How can neural networks benefit from temporally smooth data, in terms of either learning efficiency or learning more meaningfully structured representations? We hypothesized that a combination of two brain-inspired mechanisms -leaky memory and memory-resetting -could enable networks to learn more efficiently from temporally smooth data, even without BPTT.

4. EFFECTS OF TEMPORAL SMOOTHNESS IN TRAINING DATA ON LEARNING IN FEEDFORWARD NEURAL NETWORKS

We first explored how smoothness of data affects the speed and accuracy of category learning (classification) in feedforward networks. See Appendix A.1 for similar results with unsupervised learning.

