DYNAMICAL SIGNATURES OF LEARNING IN RECURRENT NETWORKS

Abstract

Recurrent neural networks (RNNs) are powerful computational tools that operate best near the edge of chaos, where small perturbations in neuronal firing are transmitted between neurons with minimal amplification or loss. In this article, we depart from the observation that both stimulus and noise can be seen as perturbations to the intrinsic dynamics of a recurrent network, however stimulus information must be reliably preserved, while noise must be discarded. First, we show that self-organizing recurrent networks (SORNs) that learn the spatiotemporal structure of their inputs, increase their recurrent memory by preferentially propagating the relevant stimulus-specific structured signals, while becoming more robust to random perturbation. We find that the computational advantages gained through self-supervised learning are accompanied by a shift from critical to ordered dynamics, and that this dynamical shift varies with the structure of the stimulus. Next, we show that SORNs with subcritical dynamics can outperform their random RNNs counterparts with critical dynamics, on a range of tasks, including a temporal MNIST and a sequential shape-rotation task. Interestingly, when a shape is rotated, both the invariant (shape) and the variant (motion direction) aspects of the stimulus sequence are improved through learning in the subcritical SORNs. We propose that the shift in criticality is a signature of specialization and we expect it to be found in all cases in which general-purpose recurrent networks acquire self-correcting properties by internalizing the statistical structure of their inputs.

1. INTRODUCTION

Randomly-connected recurrent neural networks (RNNs), often referred to as 'reservoir networks' Maass et al. (2002) ; Jaeger & Haas (2004) , are powerful computational tools that mimic the brain's mastery for sequential processing Buonomano & Maass (2009) , while achieving impressive performance across a large variety of applications. As dynamical systems, RNNs exhibit a fading memory of recent inputs which, in the limit, can be mapped through a simple readout to approximate any target function Maass & Markram (2004) . In their seminal work, Bertschinger and Natschläger established a relationship between the computational power of recurrent networks and their dynamical properties Bertschinger & Natschläger (2004) . Specifically, they showed that randomly connected networks perform best when they operate at the critical edge between stability and chaos. While random RNNs with critical dynamics exhibit a strong memory of recent inputs, for any arbitrary input sequences, they differ substantially from real cortical networks which are remarkably structured and specialized. Why did our brains evolve so as to contain so many specialized parts? ask Minsky and Papert in Perceptrons Minsky & Papert (1969) . In order for a machine to learn to recognize or perform X, be it a pattern or a process, that machine must in one sense or another learn to represent or embody X. Thus, it is sensible to believe that a good model for, for example, visual processing is one that captures internally the common contingencies of features present in the natural visual environment. It is currently unknown how such learning or specialization impacts the dynamics of a RNN. In this paper we asked what is the relationship between network dynamics and performance, when RNNs learn the spatio-temporal structure of their inputs. We contrast random RNNs, tuned close to criticality, to self-organizing recurrent networks (SORNs) that learn via unsupervised biologically-plausible plasticity mechanisms the sequential structure of their inputs Lazar et al. (2009); Hartmann et al. (2015) .

2.1. RECURRENT NEURAL NETWORK

The network model consists of 80% excitatory units (N E ) and 20% inhibitory units (N I ). W IE , W EI and W II are dense connectivity matrices randomly drawn from the interval [0,1] and normalized so that the incoming connections to each neuron sum up to a constant ( j W ij = 1). The connections between excitatory units W EE are random and sparse (p EE is the probability of a connection), with no self-recurrence. For each discrete time step t, the network state is given by the two binary vectors x(t) ∈ {0, 1} N E , and y(t) ∈ {0, 1} N I , representing activity of the excitatory and inhibitory units, respectively. The network evolves using the following update functions: x(t + 1) = Θ(W EE (t)x(t) -W EI y(t) + U(t)) -T E (t)) (1) y(t + 1) = Θ(W IE (t)x(t) -W II y(t) -T I ) The Heaviside step function Θ constrains the network activation at time t to a binary representation: a neuron fires if the total drive it receives is greater than its threshold. The threshold values for excitatory (T E ) and inhibitory units (T I ) are drawn from a uniform distribution in the interval [0, T E max ] and [0, T I max ]. The initial values for excitatory weights (W EE ) and thresholds (T E ) stay constant for RNNs. These parameters have been chosen such that the dynamics of RNNs is close to criticality and their memory performance is high. SORNs change their weights and thresholds during stimulation in an unsupervised fashion following the rules described in Section 2.3. The input U(t) varies as a function of time and goes to a subset N U of excitatory units.

2.2. PERTURBATION ANALYSIS

To define the chaotic and ordered phase of a RNN we use the perturbation approach proposed by Derrida and Pomeau for autonomous systems Derrida & Pomeau (1986) and employed by Bertschinger and Natschläger for RNNs Bertschinger & Natschläger (2004) . At each timestep t, we flip the activity of one of the reservoir units (excitatory, non-input), giving us a perturbed state vector x ′ (t) with a Hamming distance of 1 from the original network x(t). We map these two states x(t) and x ′ (t) to their corresponding successor states x(t + 1) and x ′ (t + 1), using the same weights and thresholds, and we quantify the change in the Hamming distance. Distances that amplify are a signature of chaos, whereas distances that decrease are a signature of order.

2.3. SELF-ORGANIZING RECURRENT NETWORKS

For learning, we utilize a simple model of spike-timing dependent plasticity STDP that strengthens (weakens) the synaptic weight W EE by a fixed amount η = 0.001 whenever unit i is active in the time step following (preceding) activation of unit j. ∆W EE ij (t) = η ST DP (x i (t)x j (t -1) -x i (t -1)x j (t)). (3) In addition, synaptic normalization is used to proportionally adjust the values of incoming connections to a neuron so that they sum up to a constant value. ∆W EE ij (t) = W EE ij (t)/ j W EE ij (t). To stabilize learning, we utilize a homeostatic intrinsic plasticity (IP) rule that spreads the activity evenly across units, by modulating their excitability using a learning rate η ip = 0.001. At each

