THE INFLUENCE OF LEARNING RULE ON REPRESEN-TATION DYNAMICS IN WIDE NEURAL NETWORKS

Abstract

It is unclear how changing the learning rule of a deep neural network alters its learning dynamics and representations. To gain insight into the relationship between learned features, function approximation, and the learning rule, we analyze infinite-width deep networks trained with gradient descent (GD) and biologicallyplausible alternatives including feedback alignment (FA), direct feedback alignment (DFA), and error modulated Hebbian learning (Hebb), as well as gated linear networks (GLN). We show that, for each of these learning rules, the evolution of the output function at infinite width is governed by a time varying effective neural tangent kernel (eNTK). In the lazy training limit, this eNTK is static and does not evolve, while in the rich mean-field regime this kernel's evolution can be determined self-consistently with dynamical mean field theory (DMFT). This DMFT enables comparisons of the feature and prediction dynamics induced by each of these learning rules. In the lazy limit, we find that DFA and Hebb can only learn using the last layer features, while full FA can utilize earlier layers with a scale determined by the initial correlation between feedforward and feedback weight matrices. In the rich regime, DFA and FA utilize a temporally evolving and depthdependent NTK. Counterintuitively, we find that FA networks trained in the rich regime exhibit more feature learning if initialized with smaller correlation between the forward and backward pass weights. GLNs admit a very simple formula for their lazy limit kernel and preserve conditional Gaussianity of their preactivations under gating functions. Error modulated Hebb rules show very small task-relevant alignment of their kernels and perform most task relevant learning in the last layer.

1. INTRODUCTION

Deep neural networks have now attained state of the art performance across a variety of domains including computer vision and natural language processing (Goodfellow et al., 2016; LeCun et al., 2015) . Central to the power and transferability of neural networks is their ability to flexibly adapt their layer-wise internal representations to the structure of the data distribution during learning. In this paper, we explore how the learning rule that is used to train a deep network affects its learning dynamics and representations. Our primary motivation for studying different rules is that exact gradient descent (GD) training with the back-propagation algorithm is thought to be biologically implausible (Crick, 1989) . While many alternatives to standard GD training were proposed (Whittington & Bogacz, 2019) , it is unclear how modifying the learning rule changes the functional inductive bias and the learned representations of the network. Further, understanding the learned representations could potentially offer more insight into which learning rules account for representational changes observed in the brain (Poort et al., 2015; Kriegeskorte & Wei, 2021; Schumacher et al., 2022) . Our current study is a step towards these directions. The alternative learning rules we study are error modulated Hebbian learning (Hebb), Feedback alignment (FA) (Lillicrap et al., 2016) and direct feedback alignment (DFA) (Nøkland, 2016) . These rules circumvent one of the biologically implausible features of GD: the weights used in the backward pass computation of error signals must be dynamically identical to the weights used on the forward pass, known as the weight transport problem. Instead, FA and DFA algorithms compute an approximate backward pass with independent weights that are frozen through training. Hebb rule only uses a global error signal. While these learning rules do not perform exact GD, they are still able to evolve their internal representations and eventually fit the training data. Further, experiments have shown that FA and DFA can scale to certain problems such as view-synthesis, recommendation systems, and small scale image problems (Launay et al., 2020) , but they do not perform as well in convolutional architectures with more complex image datasets (Bartunov et al., 2018) . However, significant improvements to FA can be achieved if the feedback-weights have partial correlation with the feedforward weights (Xiao et al., 2018; Moskovitz et al., 2018; Boopathy & Fiete, 2022) . We also study gated linear networks (GLNs), which use frozen gating functions for nonlinearity (Fiat et al., 2019) . Variants of these networks have bio-plausible interpretations in terms of dendritic gates (Sezener et al., 2021) . Fixed gating can mitigate catastrophic forgetting (Veness et al., 2021; Budden et al., 2020) and enable efficient transfer and multi-task learning Saxe et al. ( 2022). Here, we explore how the choice of learning rule modifies the representations, functional biases and dynamics of deep networks at the infinite width limit, which allows a precise analytical description of the network dynamics in terms of a collection of evolving kernels. At infinite width, the network can operate in the lazy regime, where the feature embeddings at each layer are constant through time, or the rich/feature-learning regime (Chizat et al., 2019; Yang & Hu, 2021; Bordelon & Pehlevan, 2022) . The richness is controlled by a scalar parameter related to the initial scale of the output function. In summary, our novel contributions are the following: 1. We identify a class of learning rules for which function evolution is described by a dynamical effective Neural Tangent Kernel (eNTK). We provide a dynamical mean field theory (DMFT) for these learning rules which can be used to compute this eNTK. We show both theoretically and empirically that convergence to this DMFT occurs at large width N with error O(N -1/2 ). 2. We characterize precisely the inductive biases of infinite width networks in the lazy limit by computing their eNTKs at initialization. We generalize FA to allow partial correlation between the feedback weights and initial feedforward weights and show how this alters the eNTK. 3. We then study the rich regime so that the features are allowed to adapt during training. In this regime, the eNTK is dynamical and we give a DMFT to compute it. For deep linear networks, the DMFT equations close algebraically, while for nonlinear networks we provide a numerical procedure to solve them. 4. We compare the learned features and dynamics among these rules, analyzing the effect of richness, initial feedback correlation, and depth. We find that rich training enhances gradientpseudogradient alignment for both FA and DFA. Counterintuitively, smaller initial feedback correlation generates more dramatic feature evolution for FA. The GLN networks have dynamics comparable to GD, while Hebb networks, as expected, do not exhibit task relevant adaptation of feature kernels, but rather evolve according to the input statistics. Lillicrap et al. (2016) showed that, in a two layer linear network the forward weights will evolve to align to the frozen feedback weights under the FA dynamics, allowing convergence of the network to a loss minimizer. This result was extended to deep networks by Frenkel et al. (2019) , who also introduced a variant of FA where only the direction of the target is used. Refinetti et al. (2021) studied DFA in a two-layer student-teacher online learning setup, showing that the network first undergoes an alignment phase before converging to one the degenerate global minima of the loss. They argued that FA's worse performance in CNNs is due to the inability of the forward pass gradients to align under the block-Toeplitz connectivity strucuture that arises from enforced weight sharing (d'Ascoli et al., 2019) . Garg & Vempala (2022) analyzed matrix factorization with FA,



RELATED WORKS GLNs were introduced by Fiat et al. (2019) as a simplified model of ReLU networks, allowing the analysis of convergence and generalization in the lazy kernel limit. Veness et al. (2021) provided a simplified and biologically-plausible learning rule for deep GLNs which was extended by Budden et al. (2020) and provided an interpretation in terms of dendritic gating Sezener et al. (2021). These works demonstrated benefits to continual learning due to the fixed gating. Saxe et al. (2022) derived exact dynamical equations for a GLN with gates operating at each node and each edge of the network graph. Krishnamurthy et al. (2022) provided a theory of gating in recurrent networks.

