THE INFLUENCE OF LEARNING RULE ON REPRESEN-TATION DYNAMICS IN WIDE NEURAL NETWORKS

Abstract

It is unclear how changing the learning rule of a deep neural network alters its learning dynamics and representations. To gain insight into the relationship between learned features, function approximation, and the learning rule, we analyze infinite-width deep networks trained with gradient descent (GD) and biologicallyplausible alternatives including feedback alignment (FA), direct feedback alignment (DFA), and error modulated Hebbian learning (Hebb), as well as gated linear networks (GLN). We show that, for each of these learning rules, the evolution of the output function at infinite width is governed by a time varying effective neural tangent kernel (eNTK). In the lazy training limit, this eNTK is static and does not evolve, while in the rich mean-field regime this kernel's evolution can be determined self-consistently with dynamical mean field theory (DMFT). This DMFT enables comparisons of the feature and prediction dynamics induced by each of these learning rules. In the lazy limit, we find that DFA and Hebb can only learn using the last layer features, while full FA can utilize earlier layers with a scale determined by the initial correlation between feedforward and feedback weight matrices. In the rich regime, DFA and FA utilize a temporally evolving and depthdependent NTK. Counterintuitively, we find that FA networks trained in the rich regime exhibit more feature learning if initialized with smaller correlation between the forward and backward pass weights. GLNs admit a very simple formula for their lazy limit kernel and preserve conditional Gaussianity of their preactivations under gating functions. Error modulated Hebb rules show very small task-relevant alignment of their kernels and perform most task relevant learning in the last layer.

1. INTRODUCTION

Deep neural networks have now attained state of the art performance across a variety of domains including computer vision and natural language processing (Goodfellow et al., 2016; LeCun et al., 2015) . Central to the power and transferability of neural networks is their ability to flexibly adapt their layer-wise internal representations to the structure of the data distribution during learning. In this paper, we explore how the learning rule that is used to train a deep network affects its learning dynamics and representations. Our primary motivation for studying different rules is that exact gradient descent (GD) training with the back-propagation algorithm is thought to be biologically implausible (Crick, 1989) . While many alternatives to standard GD training were proposed (Whittington & Bogacz, 2019), it is unclear how modifying the learning rule changes the functional inductive bias and the learned representations of the network. Further, understanding the learned representations could potentially offer more insight into which learning rules account for representational changes observed in the brain (Poort et al., 2015; Kriegeskorte & Wei, 2021; Schumacher et al., 2022) . Our current study is a step towards these directions. The alternative learning rules we study are error modulated Hebbian learning (Hebb), Feedback alignment (FA) (Lillicrap et al., 2016) and direct feedback alignment (DFA) (Nøkland, 2016) . These rules circumvent one of the biologically implausible features of GD: the weights used in the backward pass computation of error signals must be dynamically identical to the weights used on the 1

