THE INFLUENCE OF LEARNING RULE ON REPRESEN-TATION DYNAMICS IN WIDE NEURAL NETWORKS

Abstract

It is unclear how changing the learning rule of a deep neural network alters its learning dynamics and representations. To gain insight into the relationship between learned features, function approximation, and the learning rule, we analyze infinite-width deep networks trained with gradient descent (GD) and biologicallyplausible alternatives including feedback alignment (FA), direct feedback alignment (DFA), and error modulated Hebbian learning (Hebb), as well as gated linear networks (GLN). We show that, for each of these learning rules, the evolution of the output function at infinite width is governed by a time varying effective neural tangent kernel (eNTK). In the lazy training limit, this eNTK is static and does not evolve, while in the rich mean-field regime this kernel's evolution can be determined self-consistently with dynamical mean field theory (DMFT). This DMFT enables comparisons of the feature and prediction dynamics induced by each of these learning rules. In the lazy limit, we find that DFA and Hebb can only learn using the last layer features, while full FA can utilize earlier layers with a scale determined by the initial correlation between feedforward and feedback weight matrices. In the rich regime, DFA and FA utilize a temporally evolving and depthdependent NTK. Counterintuitively, we find that FA networks trained in the rich regime exhibit more feature learning if initialized with smaller correlation between the forward and backward pass weights. GLNs admit a very simple formula for their lazy limit kernel and preserve conditional Gaussianity of their preactivations under gating functions. Error modulated Hebb rules show very small task-relevant alignment of their kernels and perform most task relevant learning in the last layer.

1. INTRODUCTION

Deep neural networks have now attained state of the art performance across a variety of domains including computer vision and natural language processing (Goodfellow et al., 2016; LeCun et al., 2015) . Central to the power and transferability of neural networks is their ability to flexibly adapt their layer-wise internal representations to the structure of the data distribution during learning. In this paper, we explore how the learning rule that is used to train a deep network affects its learning dynamics and representations. Our primary motivation for studying different rules is that exact gradient descent (GD) training with the back-propagation algorithm is thought to be biologically implausible (Crick, 1989) . While many alternatives to standard GD training were proposed (Whittington & Bogacz, 2019) , it is unclear how modifying the learning rule changes the functional inductive bias and the learned representations of the network. Further, understanding the learned representations could potentially offer more insight into which learning rules account for representational changes observed in the brain (Poort et al., 2015; Kriegeskorte & Wei, 2021; Schumacher et al., 2022) . Our current study is a step towards these directions. The alternative learning rules we study are error modulated Hebbian learning (Hebb) , Feedback alignment (FA) (Lillicrap et al., 2016) and direct feedback alignment (DFA) (Nøkland, 2016) . These rules circumvent one of the biologically implausible features of GD: the weights used in the backward pass computation of error signals must be dynamically identical to the weights used on the forward pass, known as the weight transport problem. Instead, FA and DFA algorithms compute an approximate backward pass with independent weights that are frozen through training. Hebb rule only uses a global error signal. While these learning rules do not perform exact GD, they are still able to evolve their internal representations and eventually fit the training data. Further, experiments have shown that FA and DFA can scale to certain problems such as view-synthesis, recommendation systems, and small scale image problems (Launay et al., 2020) , but they do not perform as well in convolutional architectures with more complex image datasets (Bartunov et al., 2018) . However, significant improvements to FA can be achieved if the feedback-weights have partial correlation with the feedforward weights (Xiao et al., 2018; Moskovitz et al., 2018; Boopathy & Fiete, 2022) . We also study gated linear networks (GLNs), which use frozen gating functions for nonlinearity (Fiat et al., 2019) . Variants of these networks have bio-plausible interpretations in terms of dendritic gates (Sezener et al., 2021) . Fixed gating can mitigate catastrophic forgetting (Veness et al., 2021; Budden et al., 2020) and enable efficient transfer and multi-task learning Saxe et al. (2022) . Here, we explore how the choice of learning rule modifies the representations, functional biases and dynamics of deep networks at the infinite width limit, which allows a precise analytical description of the network dynamics in terms of a collection of evolving kernels. At infinite width, the network can operate in the lazy regime, where the feature embeddings at each layer are constant through time, or the rich/feature-learning regime (Chizat et al., 2019; Yang & Hu, 2021; Bordelon & Pehlevan, 2022) . The richness is controlled by a scalar parameter related to the initial scale of the output function. In summary, our novel contributions are the following: 1. We identify a class of learning rules for which function evolution is described by a dynamical effective Neural Tangent Kernel (eNTK). We provide a dynamical mean field theory (DMFT) for these learning rules which can be used to compute this eNTK. We show both theoretically and empirically that convergence to this DMFT occurs at large width N with error O(N -1/2 ). 2. We characterize precisely the inductive biases of infinite width networks in the lazy limit by computing their eNTKs at initialization. We generalize FA to allow partial correlation between the feedback weights and initial feedforward weights and show how this alters the eNTK. 3. We then study the rich regime so that the features are allowed to adapt during training. In this regime, the eNTK is dynamical and we give a DMFT to compute it. For deep linear networks, the DMFT equations close algebraically, while for nonlinear networks we provide a numerical procedure to solve them. 4. We compare the learned features and dynamics among these rules, analyzing the effect of richness, initial feedback correlation, and depth. We find that rich training enhances gradientpseudogradient alignment for both FA and DFA. Counterintuitively, smaller initial feedback correlation generates more dramatic feature evolution for FA. The GLN networks have dynamics comparable to GD, while Hebb networks, as expected, do not exhibit task relevant adaptation of feature kernels, but rather evolve according to the input statistics.

1.1. RELATED WORKS

GLNs were introduced by Fiat et al. ( 2019) as a simplified model of ReLU networks, allowing the analysis of convergence and generalization in the lazy kernel limit. Veness et al. (2021) provided a simplified and biologically-plausible learning rule for deep GLNs which was extended by Budden et al. (2020) and provided an interpretation in terms of dendritic gating Sezener et al. (2021) . These works demonstrated benefits to continual learning due to the fixed gating. Saxe et al. (2022) derived exact dynamical equations for a GLN with gates operating at each node and each edge of the network graph. Krishnamurthy et al. (2022) provided a theory of gating in recurrent networks. Lillicrap et al. (2016) showed that, in a two layer linear network the forward weights will evolve to align to the frozen feedback weights under the FA dynamics, allowing convergence of the network to a loss minimizer. This result was extended to deep networks by Frenkel et al. (2019) , who also introduced a variant of FA where only the direction of the target is used. Refinetti et al. (2021) studied DFA in a two-layer student-teacher online learning setup, showing that the network first undergoes an alignment phase before converging to one the degenerate global minima of the loss. They argued that FA's worse performance in CNNs is due to the inability of the forward pass gradients to align under the block-Toeplitz connectivity strucuture that arises from enforced weight sharing (d'Ascoli et al., 2019) . Garg & Vempala (2022) analyzed matrix factorization with FA, proving that, when overparameterized, it converges to a minimizer under standard conditions, albeit more slowly than GD. Cao et al. (2020) analyzed the kernel and loss dynamics of linear networks trained with learning rules from a space that includes GD, contrastive Hebbian, and predictive coding rules, showing strong dependence of hierarchical representations on learning rule. Recent works have utilized DMFT techniques to analyze typical performance of algorithms trained on high-dimensional random data (Agoritsas et al., 2018; Mignacco et al., 2020; Celentano et al., 2021; Gerbelot et al., 2022) . In the present work, we do not average over random datasets, but rather over initial random weights and treat data as an input to the theory. Wide NNs have been analyzed at infinite width in both lazy regimes with the NTK (Jacot et al., 2018; Lee et al., 2019) and rich feature learning regimes (Mei et al., 2018 ). In the feature learning limit, the evolution of kernel order parameters have been obtained with both Tensor Programs framework (Yang & Hu, 2021) and with DMFT (Bordelon & Pehlevan, 2022) . Song et al. (2021) recently analyzed the lazy infinite width limit of two layer networks trained with FA and weight decay, finding that only one layer effectively contributes to the two-layer NTK. Boopathy & Fiete (2022) proposed alignment based learning rules for networks at large width in the lazy regime, which performs comparably to GD and outperform standard FA. Their Align-Ada rule corresponds to our ρ-FA with ρ = 1 in lazy large width networks.

2. EFFECTIVE NEURAL TANGENT KERNEL FOR A LEARNING RULE

We denote the output of a neural network for input x µ ∈ R D as f µ . For concreteness, in the main text we will focus on scalar targets f µ ∈ R and MLP architectures. Other architectures such as multiclass outputs and CNN architectures with infinite channel count can also be analyzed as we show in the Appendix C. For the moment, we let the function be computed recursively from a collection of weight matrices θ = Vec{W 0 , W 1 , ..., w L } in terms of preactivation vectors h ℓ µ ∈ R N where, f µ = 1 γ 0 N w L • ϕ(h L µ ) , h ℓ+1 µ = 1 √ N W ℓ ϕ(h ℓ µ ) , h 1 µ = 1 √ D W 0 x µ where nonlinearity ϕ is applied element-wise. The scalar parameter γ 0 controls how rich the network training is: small γ 0 corresponds to lazy learning while large γ 0 generates large changes to the features (Chizat et al., 2019) . For gated linear networks, we follow Fiat et al. (2019) and modify the forward pass equations by replacing ϕ(h ℓ µ ) with a multiplicative gating function φ(m ℓ µ )h ℓ µ where gating variables m ℓ µ = 1 √ D M ℓ x µ are fixed through training with M ij ∼ N (0, 1). To minimize loss L = µ ℓ(f µ , y µ ), we consider learning rules to the parameters θ of the form d dt w L = γ 0 µ ϕ(h L µ (t))∆ µ , d dt W ℓ = γ 0 √ N µ ∆ µ gℓ+1 µ ϕ(h ℓ µ ) ⊤ , d dt W 0 = γ 0 √ D µ ∆ µ g1 µ x ⊤ µ (2) where the error signal is ∆ µ (t) = -∂L ∂fµ | fµ(t) . The last layer weights w L are always updated with their true gradient. This corresponds to the biologically-plausible and local delta-rule, which merely correlates the error signals ∆ µ and the last layer features ϕ(h L µ ) (Widrow & Hoff, 1960) . In intermediate layers, the pseudo-gradient vectors gℓ µ are determined by the choice of the learning rule. For concreteness, we provide below the recursive definitions of gℓ for our five learning rules of interest. gℓ µ =                    φ(h ℓ µ ) ⊙ 1 √ N W ℓ (t) ⊤ gℓ+1 µ , gL µ = φ(h L µ ) ⊙ w L GD φ(h ℓ µ ) ⊙ 1 √ N ρW ℓ (0) + 1 -ρ 2 W ℓ ⊤ gℓ+1 , W ℓ ij ∼ N (0, 1) ρ-FA φ(h ℓ µ ) ⊙ zℓ , zℓ i ∼ N (0, 1) DFA φ(m ℓ µ ) ⊙ 1 √ N W ℓ (t) ⊤ gℓ+1 µ , gL = φ(m ℓ µ ) ⊙ w L (t) GLN ∆ µ (t)ϕ(h ℓ µ (t)) Hebb (3) While GD uses the instantaneous feedforward weights on the backward pass, ρ-FA uses the weight matrices which do not evolve throughout training. These weights have correlation ρ with the initial forward pass weights W ℓ (0). This choice is motivated by the observation that partial correlation between forward and backward pass weights at initialization can improve training (Liao et al., 2016; Xiao et al., 2018; Moskovitz et al., 2018) , though the cost is partial weight transport at initialization. However, we consider partial correlation at initialization more biologically plausible than the demanding weight transport at each step of training, like in GD. For DFA, the weight vectors zℓ are sampled randomly at initialization and do not evolve in time. For GLN, the gating variables m ℓ µ are frozen through time but the exact feedforward weights are used in the backward pass. Lastly, we modify the classic Hebb rule (Hebb, 1949) to get ∆W ℓ ∝ µ ∆ µ (t) 2 ϕ(h ℓ+1 µ )ϕ(h ℓ µ ) ⊤ , which weighs each example by its current error. Unlike standard Hebbian updates, this learning rule gives stable dynamics without regularization (App. G). For all rules, the evolution of the function is determined by a time-dependent eNTK K µν which is defined as ∂f µ ∂t = ∂f µ ∂θ • dθ dt = ν ∆ ν K µν (t, t) , K µν (t, s) = L ℓ=0 Gℓ+1 µν (t, s)Φ ℓ µν (t, s) Gℓ µν (t, s) = 1 N g ℓ µ (t) • gℓ ν (s) , Φ ℓ µν (t, s) = 1 N ϕ(h ℓ µ (t)) • ϕ(h ℓ ν (s)), where the base cases GL+1 µν (t, s) = 1 and Φ 0 µν (t, s) = 1 D x µ • x ν are time-invariant. The kernel Gℓ computes an inner product between the true gradient signals g ℓ µ = γ 0 N ∂fµ ∂h ℓ µ and the pseudo-gradient gℓ ν which is set by the chosen learning rule. We see that because Gℓ is not necessarily symmetric, K is also not necessarily symmetric. The matrix Gℓ quantifies pseudo-gradient / gradient alignment.

3. DYNAMICAL MEAN FIELD THEORY FOR VARIOUS LEARNING RULES

For each of these learning rules considered, the infinite width N → ∞ limit of network learning can be described by a dynamical mean field theory (DMFT) (Bordelon & Pehlevan, 2022) . At infinite width, the dynamics of the kernels Φ ℓ and Gℓ become deterministic over random Gaussian initialization of parameters θ. The activity of neurons in each layer become i.i.d. random variables drawn from a distribution defined by these kernels, which themselves are averages over these singlesite distributions. Below, we provide DMFT formulas which are valid for all of our learning rules h ℓ µ (t) = u ℓ µ (t) + γ 0 t 0 ds P ν=1 A ℓ-1 µν (t, s)g ℓ ν (s) + C ℓ-1 µν (t, s)g ℓ ν (s) + Φ ℓ-1 µν (t, s)∆ ν (s)g ℓ ν (s) z ℓ µ (t) = r ℓ µ (t) + γ 0 t 0 ds P ν=1 B ℓ µν (t, s) + Gℓ+1 µν (t, s)∆ ν (s) ϕ(h ℓ ν (s)), g ℓ µ (t) = φ(h ℓ µ (t))z ℓ µ (t) {u ℓ µ (t)} ∼ GP(0, Φ ℓ-1 ), Φ ℓ µν (t, s) = ϕ(h ℓ µ (t))ϕ(h ℓ ν (s)) , A ℓ µν (t, s) = γ -1 0 δ δr ℓ ν (s) ϕ(h ℓ µ (t)) {r ℓ µ (t)} ∼ GP(0, G ℓ+1 ), Gℓ µν (t, s) = g ℓ µ (t)g ℓ ν (s) , B ℓ µν (t, s) = γ -1 0 δ δu ℓ+1 ν (s) g ℓ+1 µ (t) The definitions of gℓ µ (t) depend on the learning rule and are described in Table 1 . The z ℓ µ (t) is the pre-gradient field defined so that g ℓ µ (t) = φ(h ℓ µ (t))z ℓ µ (t). The dependence of these DMFT equations on data comes from the base case Φ 0 µν (t, s) = 1 D x µ • x ν and error signal ∆ µ = -∂L ∂fµ . Rule GD ρ-FA DFA GLN Hebb gℓ µ (t) φ(h ℓ µ (t))z ℓ µ (t) φ(h ℓ µ (t))z ℓ µ (t) φ(h ℓ µ (t))z ℓ φ(m ℓ µ )z ℓ µ (t) ∆ µ (t)ϕ(h ℓ µ (t)) Table 1 : The field definitions for each learning rule. For ρ-FA, the field has definition zℓ µ (t) = ρv ℓ µ (t) + 1 -ρ 2 ζℓ µ (t) + γ 0 t 0 ds ν D ℓ µν (t, s)ϕ(h ℓ ν (s)) where {v ℓ µ (t), ζℓ µ (t)} are Gaussian with r ℓ µ (t)v ℓ ν (s) = Gℓ+1 µν (t, s). The ζℓ field is an independent Gaussian with correlation ζℓ µ (t) ζℓ ν (s) = gℓ+1 µ (t)g ℓ+1 ν (s) = Gℓ+1 µν (t, s). For DFA, the zℓ field is static zℓ ∼ N (0, 1). For GLN, we use {m ℓ µ } ∼ N (0, K x ) as a gating variable. C ℓ = 0 except for ρ-FA with ρ > 0. We see that, for {GD, ρ-FA, DFA, Hebb} the distribution of h ℓ µ (t), z ℓ µ (t) are Gaussian throughout training only in the lazy γ 0 → 0 limit for general nonlinear activation functions ϕ(h). However, conditional on {m ℓ µ }, the {h ℓ , z ℓ } fields are all Gaussian for GLNs. For all algorithms except ρ-FA, C ℓ = 0. For ρ-FA we have As described in prior results on the GD case (Bordelon & Pehlevan, 2022) , the above equations can be solved self-consistently in polynomial (in train-set size P and training steps T ) time. With an estimate of the dynamical kernels {Φ ℓ µν (t, s), Gℓ µν (t, s), G ℓ µν (t, s)}, one computes the eNTK K µν (t) and error dynamics ∆ µ (t). From these objects, we can sample the stochastic processes {h ℓ , z ℓ , zℓ } which can then be used to derive new refined estimates of the kernels. This procedure is repeated until convergence. This algorithm can be found in App. A. An example of such a solution is provided in Figure 1 for two layer ReLU networks trained with GD, FA, GLN, and Hebb. We show that our self-consistent DMFT accurately predicts training and kernel dynamics, as well as the density of preactivations {h µ (t)} and final kernels {Φ µν , Gµν } for each learning rule. We observe substantial differences in the learned representations (Figure 1e ), all predicted by our DMFT. C ℓ µα (t, s) = γ -1 0 δ δv ℓ ν (s) ϕ(h ℓ µ (t)) .

3.1. LAZY OR EARLY TIME STATIC-KERNEL LIMITS

When γ 0 → 0, we see that the fields h ℓ µ (t) and z ℓ µ (t) are equal to the Gaussian variables u ℓ µ (0) and r ℓ µ (0). In this limit, the eNTK K µν remains static and has the form summarized in Table 2 in terms of the initial feature kernels Φ ℓ and gradient kernels G ℓ . We derive these kernels in Appendix D. The feature P × P matrices Φ ℓ , G ℓ in Table 2 are computed recursively as  Φ ℓ = ϕ(u)ϕ(u) ⊤ u∼N (0,Φ ℓ-1 ) , G ℓ = G ℓ+1 ⊙ φ(u) φ(u) ⊤ u∼N (0,Φ ℓ-1 ) L = 1 L = 2 L = 3 L = 4 (b) ReLU FA varying L 3 2 1 0 1 2 3 0.00 0.25 0.50 0.75 1.00 K( ) GLN L = 1 L = 2 L = 3 L = 4 (c) ReLU GLN varying L 10 1 10 2 10 3 N 10 3 10 2 10 1 10 0 | N | 2 /| | 2 N 1 1 2 3 4 (d) Φ ℓ convergence 10 1 10 2 10 3 N 10 3 10 2 10 1 10 0 10 1 |GN G | 2 /|G | 2 N 1 G 1 G 2 G 3 G 4 (e) G ℓ convergence 10 1 10 2 10 3 N 10 3 10 2 10 1 10 0 10 1 |KN K | 2 /|K | 2 L = 2, = 0.0 L = 2, = 1.0 L = 4 L = 8 (f) eNTK convergence |K N -K ∞ | 2 ∼ O N (1/N ). Rule GD ρ-FA DFA GLN Hebb K µν L ℓ=0 G ℓ+1 µν Φ ℓ µν L ℓ=0 ρ L-ℓ G ℓ+1 µν Φ ℓ µν Φ L µν φ(m µ ) φ(m ν ) L K x µν Φ L µν Table 2 : The initial eNTK K µν for each learning rule. The GD kernel is the usual initial NTK of Jacot et al. (2018) . For ρ-aligned FA, each layer ℓ's contribution to the eNTK is suppressed by a factor ρ L-ℓ . For DFA and Hebb, only the last layer feature kernel Φ L contributes to the NTK. For GLN, each layer has an identical contribution. with base cases Φ 0 = K x and G L+1 = 11 ⊤ . We provide interpretations of this result below. • Backpropagation (GD) and ρ = 1 FA recover the usual depth L NTK, with contributions from every layer K µν = ℓ G ℓ+1 µν Φ ℓ µν at initialization. This kernel governs both training dynamics and test predictions in the lazy limit γ 0 → 0 (Jacot et al., 2018; Lou et al., 2022) . • ρ = 0 FA, DFA and Hebb are equivalent to using the NNGP kernel K µν ∼ Φ L µν , giving the Bayes posterior mean (Matthews et al., 2018; Lee et al., 2018; Hron et al., 2020) . In the γ 0 , ρ → 0 limit, only the dynamics of the readout weights w L contribute to the evolution of f µ since error signals cannot successfully propagate backward and gradients cannot align with pseudo-gradients (App D). The standard ρ = 0 FA will be indistinguishable from merely training w L with the delta-rule unless the network is trained in the rich feature learning regime γ 0 > 0, where Gℓ can evolve. This effect was also noted in two layer networks by Song et al. (2021) . • ρ-FA weighs each layer ℓ with scale ρ L-ℓ , since each layer's pseudo-gradient is only partially correlated with the true gradient, giving recursion Gℓ = ρ Gℓ+1 with base case GL+1 = G L+1 . • GLN's kernel in lazy limit is determined by the Gaussian gating variables {m ℓ µ } ∼ N (0, K x ). We visualize these kernels for deep ReLU networks and ReLU GLNs for normalized inputs |x| 2 = |x ′ | 2 = D, by plotting the kernel as a function of the angle θ separating two inputs cos(θ) = 1 D x ⊤ x ′ . We find that the kernels develop a sharp discontinuity at the origin θ = 0, which becomes more exaggerated as ρ and L increase. We further show that the square difference of width N kenels and infinite width kernels go as O(N -1 ). We derive this scaling with a perturbative argument in App. H, which enables analytical prediction of leading order finite size effects (Figure 7 ). In the lazy γ 0 → 0 limit, these kernels define the eNTK and the network prediction dynamics. 3.2 FEATURE LEARNING ENABLES GRADIENT/PSEUDO-GRADIENT ALIGNMENT AND KERNEL/TASK ALIGNMENT In the last section, we saw that, in the γ 0 → 0 limit, all algorithms have frozen preactivations and pregradient features {h ℓ µ (t), z ℓ µ (t)}. A consequence of this fact is that FA and DFA cannot increase their gradient-pseudogradient alignment throughout training in the lazy limit γ 0 = 0. However, if we increase γ 0 , then the gradient features g ℓ µ (t) and pseudo-gradients gℓ µ (t) evolve in time and can increase their alignment. In Figure 3 , we show the effect of increasing γ 0 on alignment dynamics in a depth 4 tanh network trained with DFA. In (b), we see that larger γ 0 is associated with high taskalignment of the last layer feature kernel Φ L , which becomes essentially rank one and aligned to yy ⊤ . The asympotic cosine similarity between gradients and pseudogradients also increase with γ 0 . The eNTK also becomes aligned with the task relevant directions (shown in Figure 3 c ), like has been observed in GD training (Baratin et al., 2021; Shan & Bordelon, 2021; Geiger et al., 2021; Atanasov et al., 2022) . We see that width N networks have a dynamical eNTK K N (t) which deviates from the DMFT eNTK K ∞ (t) by O(1/N ) in square loss. DMFT is more predictive for larger γ 0 networks, suggesting a reduction in finite size variability due to task-relevant feature evolution.  ⟨|KN (t)-K∞(t)| 2 ⟩ t ⟨|K∞(t)| 2 ⟩ t ∼ O(N -1 ) , where the averages are computed over the time interval of training. This error is smaller for larger feature learning strength γ 0 .

3.3. DEEP LINEAR NETWORK KERNEL DYNAMICS

When γ 0 > 0 the kernels and features in the network evolve according to the DMFT equations. For deep linear networks we can analyze the equations for the kernels in closed form without sampling since the correlation functions close algebraically (App. E). In Figure 4 , we utilize our algebraic DMFT equations to explore ρ-FA dynamics in a depth 4 linear network. Networks with larger ρ train faster, which can be intuited by noting that the initial function time derivative df dt | t=0 ∼ L ℓ=0 ρ L-ℓ ∼ 1-ρ L+1 1-ρ is an increasing function of ρ. We observe higher final gradient pseudogradient alignment in each layer with larger ρ, which is also intuitive from the initial condition Gℓ (0) = ρ L-ℓ . However, surprisingly, for large initial correlation ρ, the NTK achieves lower task alignment, despite having larger Gℓ (t). We show that this is caused by smaller overlap of each layer's feature kernel H ℓ (t) with yy ⊤ . Though this phenomenon is counterintuitive, we gain more insight in the next section by studying an even simpler two layer model. N g ℓ (t) • gℓ (t). (c) However, smaller ρ leads to more alignment of the NTK K(t) with the task-relevant subspace, measured with cosine similarity A(K, yy ⊤ ). (d) The feature kernel H(t) overlaps with y reveal that H ℓ (t) aligns more significantly in the small ρ networks.

3.3.1. EXACTLY SOLVEABLE DYNAMICS IN TWO LAYER LINEAR NETWORK

We can provide exact solutions to the infinite width GD and ρ-FA dynamics in the setting of Saxe et al. (2013) , specifically a two layer linear network trained with whitened data K x µν = δ µν . Unlike Saxe et al. ( 2013)'s result, however, we do not demand small initialization scale (or equivalently large γ 0 ), but rather provide the exact solution for all positive γ 0 . We will establish that large initial correlation ρ results in higher gradient/pseudogradient alignment but lower alignment of the hidden feature kernel H(t) with the task relevant subspace yy ⊤ . We first note that when K x = I, the GD or FA hidden feature kernel H(t) only evolves in the rank-one yy ⊤ subspace. It thus suffices to track the projection of H(t) on this rank one subspace, which we call H y (t). In the App. F we derive dynamics for H y for GD and ρ-FA H y (t) = G(t) = 1 + γ 2 0 (y -∆(t)) 2 , d∆ dt = -1 + γ 2 0 (y -∆(t)) 2 ∆(t) GD 2 G(t) + 1 -2ρ = 1 + a 2 , da dt = γ 0 y -1 2 a 3 -(1 + ρ)a ρ-FA We illustrate these dynamics in Figure 5 . The fixed points are H y = 1 + γ 2 0 y 2 for GD and for ρ-FA, H y = 1 + a 2 where a is the smallest positive root of 1 2 a 3 + (1 + ρ)a = γ 0 y. For both GD and FA, we see that increasing γ 0 results in larger asymptotic values for H y and G. For ρ-FA the fixed point of a's dynamics is a strictly decreasing function of ρ since da dρ < 0, showing that the final value of H y is smaller for larger ρ. On the contrary, we have that the final G = ρ + 1 2 a 2 is a strictly increasing function of ρ since d dρ G = 1 - a 2 3 2 a 2 +(1+ρ) > 1 3 > 0. Thus, this simple model replicates the phenomenon of increasing G and decreasing H y as ρ increases. For the Hebb rule with K x = I, the story is different. Instead of aligning H along the rank-one task relevant subpace, the dynamics instead decouple over samples, giving the following P separate equations d dt ∆ µ = -[H µµ (t) + γ 0 ∆ µ (y µ -∆ µ )]∆ µ (t) , d dt H µµ = 2γ 0 ∆ µ (t) 2 H µµ . ( ) From this perspective, we see that the hidden feature kernel does not align to the task, but rather increases its entries in overall scale as is illustrated in Figure 5 (b). 

4. DISCUSSION

We provided an analysis of the training dynamics of a wide range of learning rules at infinite width. This set of rules includes (but is not limited to) GD, ρ-FA, DFA, GLN and Hebb as well as many others. We showed that each of these learning rules has an dynamical effective NTK which concentrates over initializations at infinite width. In the lazy γ 0 → 0 regime, it suffices to compute the the initial NTK, while in the rich regime, we provide a dynamical mean field theory to compute the NTK's dynamics. We showed that, in the rich regime, FA learning rules do indeed align the network's gradient vectors to their pseudo-gradients and that this alignment improves with γ 0 . We show that initial correlation ρ between forward and backward pass weights alters the inductive bias of FA in both lazy and rich regimes. In the rich regime, larger ρ networks have smaller eNTK evolution. Overall, our study is a step towards understanding learned representations in neural networks, and the quest to reverse-engineer learning rules from observations of evolving neural representations during learning in the brain. Many open problems remain unresolved with the present work. We currently have only implemented our theory in MLPs. An implementation in CNNs could explain some of the observed advantages of partial initial alignment in ρ-FA (Xiao et al., 2018; Moskovitz et al., 2018; Bartunov et al., 2018; Refinetti et al., 2021) . In addition, our framework is sufficiently flexible to propose and test new learning rules by providing new gℓ µ (t) formulas. Our DMFT gives a recipe to compute their initial kernels, function dynamics and analyze their learned representations. The generalization performance of these learning rules at varying γ 0 is yet to be explored. Lastly, our DMFT is numerically expensive for large datasets and training intervals, making it difficult to scale up to realistic datsets. Future work could provide theoretical convergence guarantees for our DMFT solver.

APPENDIX A ALGORITHM TO SOLVE NONLINEAR DMFT EQUATIONS

Algorithm 1: Alternating Monte-Carlo Solution to Saddle Point Equations Data: K x , y, Initial Guesses {Φ ℓ , G ℓ , Gℓ , Gℓ } L ℓ=1 , {A ℓ , B ℓ , C ℓ , D ℓ } L-1 ℓ=1 , Sample count S, Update Speed β Result: Network predictions through training f µ (t), correlation functions {Φ ℓ , G ℓ , Gℓ , Gℓ } L ℓ=1 , response functions {A ℓ , B ℓ , C ℓ , D ℓ } L-1 ℓ=1 , Φ 0 = K x ⊗ 11 ⊤ , G L+1 = 11 ⊤ ; while Kernels Not Converged do From {Φ ℓ , G ℓ } compute K N T K (t, t) and solve d dt f µ (t) = α ∆ α (t)K N T K µα (t, t); ℓ = 1; while ℓ < L + 1 do Draw S samples {u ℓ µ,n (t)} S n=1 ∼ GP(0, Φ ℓ-1 ), {r ℓ µ,n (t), v ℓ µ,n (t)} S n=1 ∼ GP 0, G ℓ+1 Gℓ+1 Gℓ+1⊤ Gℓ+1 ; Solve equation 5 for each sample to get {h ℓ µ,n (t), z ℓ µ,n (t), gℓ µ,n (t)} S n=1 ; Use learning rule (Table 1 ) to compute {g ℓ µ,n (t)} S n=1 ; Compute new correlation function {Φ ℓ , G ℓ , Gℓ , Gℓ } estimates: Φ ℓ,new µν (t, s) = 1 S n∈[S] ϕ(h ℓ µ,n (t))ϕ(h ℓ ν,n (s)) , G ℓ,new µν (t, s) = 1 S n∈[S] g ℓ µ,n (t)g ℓ ν,n , s) = 1 S n∈[S] g ℓ µ,n (t)g ℓ ν,n (s), Gℓ,new µν (t, s) = 1 S n∈[S] gℓ µ,n (t)g ℓ ν,n (s) ; Solve for Jacobians on each sample ∂ϕ(h ℓ n ) ∂r ℓ⊤ n , ∂ϕ(h ℓ n ) ∂v ℓ⊤ n , ∂g ℓ n ∂u ℓ⊤ n , ∂ gℓ n ∂u ℓ⊤ n ; Compute new response functions {A ℓ , B ℓ-1 , C ℓ , D ℓ-1 } estimates: A ℓ,new = 1 S n∈[S] ∂ϕ(h ℓ n ) ∂r ℓ⊤ n , B ℓ-1,new = 1 S n∈[S] ∂g ℓ n ∂u ℓ⊤ n ; C ℓ,new = 1 S n∈[S] ∂ϕ(h ℓ n ) ∂v ℓ⊤ n , D ℓ-1,new = 1 S n∈[S] ∂ gℓ n ∂u ℓ⊤ n ; ℓ ← ℓ + 1; end ℓ = 1; while ℓ < L + 1 do Update correlation functions Φ ℓ ← (1 -β)Φ ℓ + βΦ ℓ,new , G ℓ ← (1 -β)G ℓ + βG ℓ,new ; Gℓ ← (1 -β) Gℓ + β Gℓ,new , Gℓ ← (1 -β) Gℓ + β Gℓ,new ; if ℓ < L then 26 Update response functions 27 A ℓ ← (1 -β)A ℓ + βA ℓ,new , B ℓ ← (1 -β)B ℓ + βB ℓ,new 28 C ℓ ← (1 -β)C ℓ + βC ℓ,new , D ℓ ← (1 -β)D ℓ + βD ℓ,new end ℓ ← ℓ + 1 end end return {f µ (t)} P µ=1 , {Φ ℓ , G ℓ , Gℓ , Gℓ } L ℓ=1 , {A ℓ , B ℓ , C ℓ , D ℓ } L-1 ℓ=1 The sample-and-solve procedure we developed and describe below for nonlinear networks is based on numerical recipes used in the dynamical mean field simulations in computational physics Manacorda et al. (2020) and is similar to recent work in the GD case Bordelon & Pehlevan (2022) . The basic principle is to leverage the fact that, conditional on order parameters, we can easily draw We enforce these definitions for all order parameters {Φ ℓ µν , G ℓ µν , Gℓ µν , Gℓ µν , A ℓ µ,ν , C ℓ µ,ν }. We let the corresponding Fourier duals for each of these order parameters be { Φℓ µν , Ĝℓ µν , Ĝℓ µν , Ĝℓ µν , -B ℓ µ,ν , -D ℓ µ,ν }. In the next section we show the resulting formula for the moment generating function and take the N → ∞ limit to derive our DMFT equations.

B.4 DMFT ACTION

After inserting the Dirac-Delta functions to enforce the definitions of the order parameters, we derive the following moment generating functional in terms of q = {Φ, Φ, G, Ĝ, G, Ĝ, G, Ĝ, A, B, C, D, j, k, n, p} Z = ℓ,µ,ν dΦ ℓ µν d Φℓ µν 2πN -1 dG ℓ µν d Ĝℓ µν 2πN -1 d Gℓ µν d Ĝℓ µν 2πN -1 d Gℓ µν d Ĝℓ µν 2πN -1 dA ℓ µν dB ℓ µν 2πN -1 dC ℓ µν dD ℓ µν 2πN -1 exp (N S[q]) where S[q] is the O N (1) DMFT action which takes the form S[q] = ℓµν Φ ℓ µ,ν Φℓ µ,ν + G ℓ µν Ĝℓ µν + Gℓ µν Ĝℓ µν + Gℓ µν Ĝℓ µν -A ℓ µν B ℓ µν -C ℓ µν D ℓ µν + 1 N N i=1 L ℓ=1 ln Z ℓ i [q]. The single-site moment generating functionals (MGF) Z ℓ i involve only the integrals with sources {j ℓ i , k ℓ i , n ℓ i , p ℓ i } for neuron i ∈ [N ] in layer ℓ. For a given set of order parameters q at zero source, these functionals become identical across all neuron sites i. Concretely, for any ℓ ∈ [L], i ∈ [N ], the single site MGF takes the form B.5 SADDLE POINT EQUATIONS Letting the full collection of concatenated order parameters q be indexed by b. We now take the N → ∞ limit, using the method of steepest descent Z = b √ N dq b √ 2π exp (N S[q]) ∼ exp (N S[q * ]) , ∇S[q]| q * = 0 , N → ∞. ( ) We see that the integral over q is exponentially dominated by the saddle point where ∇S[q] = 0. We thus need to solve these saddle point equations for the q * . To do this, we need to introduce some notation. Let O(χ, χ, ξ, ξ, ζ, ζ, ζ, ζ) be an arbitrary function of the single site stochastic processes. We define the ℓ-th layer i-th single site average, denoted by O(χ, χ, ξ, ξ, ζ, ζ, ζ, ζ) ℓ,i as O(χ, χ, ξ, ξ, ζ, ζ, ζ, ζ) ℓ,i = 1 Z ℓ i µ dχ µ d χµ 2π dξ µ d ξµ 2π dζ µ d ζµ 2π d ζµ d ζµ 2π exp -H ℓ i [χ, χ, ξ, ξ, ζ, ζ, ζ, ζ] O(χ, χ, ξ, ξ, ζ, ζ, ζ, ζ) which can be interpreted as an average over the Gibbs measure defined by energy H ℓ i . With this notation, we now set about computing the saddle point equations which define the primal order parameters {Φ, G, G, G}.

∂S

∂ Φℓ µν = Φ ℓ µν - 1 N N i=1 ϕ(h ℓ µ )ϕ(h ℓ ν ) ℓ,i = 0 ∂S ∂ Ĝℓ µν = G ℓ µν - 1 N N i=1 g ℓ µ g ℓ ν ℓ,i = 0 ∂S ∂ Ĝℓ µν = Gℓ µν - 1 N N i=1 g ℓ µ gℓ ν ℓ,i = 0 ∂S ∂ Ĝℓ µν = Gℓ µν - 1 N N i=1 gℓ µ gℓ ν ℓ,i = 0 C.2 CNN The DMFT described for each of these learning rules can also be extended to CNNs with infinitely many channels. Following the work of Bordelon & Pehlevan (2022) Appendix G on the GD DMFT limit for CNNs, we let W ℓ ij,a represent the value of the filter at spatial displacement a from the center of the filter, which maps relates activity at channel j of layer ℓ to channel i of layer ℓ + 1. The fields h ℓ µ,i,a satisfy the recursion h ℓ+1 µ,i,a = 1 √ N N j=1 b∈S ℓ W ℓ ij,b ϕ(h ℓ µ,j,a+b ) , i ∈ [N ], where S ℓ is the spatial receptive field at layer ℓ. For example, a (2k + 1) × (2k + 1) convolution will have S ℓ = {(i, j) ∈ Z 2 : -k ≤ i ≤ k, -k ≤ j ≤ k}. The output function is obtained from the last layer is defined as f µ = 1 γ0N N i=1 a w L i,a ϕ(h L µ,i,a ). The true gradient fields have the same definition as before g ℓ µ,a = γ 0 N ∂fµ ∂h ℓ µ,a ∈ R N , which as before enjoy the following recursion g ℓ µ,a = γ 0 N b ∂f µ ∂h ℓ+1 µ,b • ∂h ℓ+1 µ,b ∂h ℓ µ,a = φ(h ℓ µ,a ) ⊙   1 √ N N j=1 b∈S ℓ W ℓ⊤ b g ℓ+1 µ,a-b   . We consider the following learning dynamics for the filters d dt W ℓ b = γ 0 √ N µ,c ∆ µ gℓ+1 µ,c ϕ(h ℓ µ,c+b ) ⊤ (46) where as before gℓ is determined by the learning rule. The relevant kernel order parameters now have spatial indices. For instance the feature kernel at each layer has form Φ ℓ µ,ν,ab = 1 N ϕ(h ℓ µ,a (t)) • ϕ(h ℓ ν,b (s)). At the infinite width N → ∞, the order parameters and field dynamics have the form h ℓ µ,a (t) = u ℓ µ,a (t) + γ 0 t 0 ds ν,b,c ∆ ν (s)Φ ℓ-1 µν,a+b,b+c (t, s)g ℓ ν,c (s) + γ 0 t 0 ds ν,b [A ℓ-1 µν,ab (t, s)g ℓ ν,b (s) + C ℓ-1 µν,ab (t, s)g ℓ νb (s)] z ℓ µ,a (t) = r ℓ µ,a (t) + γ 0 t 0 ds ν,b,c Gℓ+1 µν,a-b,c-b (t, s)ϕ(h ℓ ν,c (s)) + γ 0 t 0 ds ν,b B ℓ µν,ab (t, s)ϕ(h ℓ ν,b (s)) where correlation and response functions have the usual definitions Φ ℓ µα,ab (t, s) = ϕ(h ℓ µa (t))ϕ(h ℓ αb (s)) , G ℓ µα,ab (t, s) = g ℓ µa (t)g ℓ αb (s) , Gℓ µν,ab (t, s) = g ℓ µa (t)g ℓ αb

D.1 LAZY LIMIT PERFORMANCES ON REALISTIC TASKS

We note that, while the DMFT equations on P datapoints and T timesteps require O(P 3 T 3 ) time complexity to solve in the rich regime, the lazy limit gives neural network predictions in O(P 3 ) time, since the predictor can be obtained by solving a linear system of P equations. The performance of these lazy limit kernels on realistic tasks would match the performances reported by Lee et al. (2020) . Specifically, GD and ρ = 1 FA would match the test accuracy reported for "infinite width GD", while ρ = 0 FA, DFA, and Hebbian rules would match "infinite width Bayesian" networks in Figure 1 of Lee et al. (2020) .

E DEEP LINEAR NETWORKS

In deep linear networks, the DMFT equations close without needing any numerical sampling procedure, as was shown in prior work on the GD case (Yang & Hu, 2021; Bordelon & Pehlevan, 2022) . The key observation is that for all of the following learning rules, the fields {h, g, g} are linear combinations of the Gaussian sources {u, r, v}, and are thus Gaussian themselves. Concretely, we introduce a vector notation h ℓ = Vec{h ℓ µ (t)} and g ℓ = Vec{g ℓ µ (t)}, etc. We have in each layer h ℓ = R h,u u ℓ + R h,r r ℓ + R h,v v ℓ + R h, ζ ζℓ g ℓ = R g,u u ℓ + R g,r r ℓ + R g,v v ℓ + R g, ζ ζℓ gℓ = R g,u u ℓ + R g,r r ℓ + R g,v v ℓ + R g, ζ ζℓ where the matrices R depend on the learning rule and the data. The necessary kernels H ℓ = h ℓ h ℓ⊤ can thus be closed algebraically since all of the correlation statistics of the sources {u, r, v} have known two-point correlation statistics.

E.1 LINEAR NETWORK TRAINED WITH GD

The R matrices for GD were provided in (Bordelon & Pehlevan, 2022) . We start by noting the following DMFT equations for h ℓ , g ℓ h ℓ = u ℓ + γ 0 (A ℓ-1 + H ℓ-1 ∆ )g ℓ , g ℓ = r ℓ + γ 0 (B ℓ + G ℓ+1 ∆ )h ℓ where [H ℓ-1 ∆ ] µν,ts = H ℓ µν (t, s)∆ ν (s). Isolating the dependence of these equations on u and r, we have I -γ 2 0 (A ℓ-1 + H ℓ-1 ∆ )(B ℓ + G ℓ+1 ∆ ) h ℓ = u ℓ + γ 2 0 (A ℓ-1 + H ℓ-1 ∆ )(B ℓ + G ℓ+1 ∆ )r ℓ I -γ 2 0 (B ℓ + G ℓ+1 ∆ )(A ℓ-1 + H ℓ-1 ∆ ) g ℓ = r ℓ + γ 2 0 (B ℓ + G ℓ+1 ∆ )(A ℓ-1 + H ℓ-1 ∆ )r ℓ . ( ) These equations can easily be closed for H ℓ and G ℓ .

E.2 ρ-ALIGNED FEEDBACK ALIGNMENT

In ρ-FA we define the following pseudo-gradient fields gℓ = 1 -ρ 2 ζℓ + ρv ℓ + ργ 0 D ℓ h ℓ Next, we note that, at initialization, the Gℓ can be computed recursively Gℓ = ρ Gℓ+1 We note that ∂ ∂r 1 h 1 = 0 which implies A 1 = 0. Similarly we have ∂ ∂r 2 h 2 = 0. Thus A 2 = 0. Proceeding inductively, we find A ℓ = 0. Similarly, we note that ∂ gL ∂u L = 0 so D L-1 = 0. Inductively, we have D ℓ = 0 for all ℓ. Using these facts, we thus find the following equations h ℓ = u ℓ + γ 0 (C ℓ-1 + H ℓ-1 ∆ )g ℓ (61) g ℓ = r ℓ + γ 0 (B ℓ + G ℓ+1 ∆ )h ℓ (62) gℓ = 1 -ρ 2 ζℓ + ρv ℓ

H.3 SINGLE SAMPLE NEXT-TO-LEADING ORDER PERTURBATION THEORY

In order to obtain exact analytical expressions, we will consider L-hidden layer ReLU and linear neural networks in the lazy regime trained on a single sample with K x = |x| 2 D = 1. To ensure preservation of norm, we will use ϕ(h) = √ 2hΘ(h) for ReLU and ϕ(h) = h for linear networks. First, we note that in either case, the infinite width saddle point equations give Φ ℓ = ϕ(h) 2 h∼N (0,Φ ℓ-1 ) = Φ ℓ-1 , Φ 0 = 1 G ℓ = φ(h) 2 z 2 h,z = G ℓ+1 , G L+1 = 1 =⇒ Φ ℓ = 1 , G ℓ = 1 , ∀ℓ ∈ [1, ..., L]. At large but finite width, the kernels therefore fluctuate around this typical mean value of Φ ℓ = 1 and G ℓ = 1. We now compute the necessary ingredients to invert the Hessian V ϕ = ϕ(h) 4 -Φ 2 = 5 ReLU 2 Linear V g = φ(h) 4 z 4 -G 2 = 5 ReLU 2 Linear . ( ) Next, we compute the sensitivity of each layer's kernel to the previous layer ∂ ∂Φ ℓ Φ ℓ+1 = 1 , ∂ ∂G ℓ+1 G ℓ = 1. First, let's analyze the marginal covariance statistics for Φ = Vec{Φ ℓ } L ℓ=1 and Φ = Vec{ Φℓ } L ℓ=1 . We note that the DMFT action has Hessian components We seek a (physical) inverse C which has vanishing lower diagonal entry, indicating zero variance in the dual order parameters Φ. This gives us the following linear equations H Φ = ∇ 2 Φ S ∇ 2 Φ ΦS ∇ 2 ΦΦ S ∇ 2 ΦS = 0 U U ⊤ V ϕ I , U =       1 -1 0 ... H Φ C = 0 U U ⊤ V ϕ I C 11 C 12 C ⊤ 12 0 = I 0 0 I =⇒ U C ⊤ 12 = I , U ⊤ C 11 + V ϕ C ⊤ 12 = 0. The relevant entry is C 11 = -V ϕ [U ⊤ ] -1 U -1 . This matrix has the form Using the fact that the covariance is the negative of the Hessian inverse multiplied by 1/N , we have the following covariance structure for {Φ ℓ } C 11 = -V ϕ       Cov(Φ ℓ , Φ ℓ ′ ) = 1 N V ϕ min{ℓ, ℓ ′ }. ( ) This result can be interpreted as the covariance of Brownian motion. Following an identical argument, we find Cov(G ℓ , G ℓ ′ ) = 1 N V g min{L + 1 -ℓ, L + 1 -ℓ ′ }. ( ) We verify these scalings against experiments below in Figure 7 . 



Figure 1: The DMFT predicts feature dynamics of wide networks trained with gradient descent (GD), feedback alignment (FA) with ρ = 0, gated linear network (GLN), and a error-modulated β = 1 Hebb rule (Hebb) in the feature learning regime. (a) The loss dynamics in a two layer (L = 1, N = 2000) network trained with these learning rules at richness γ 0 = 2. The network is trained on a collection of P = 10 random vectors in D = 50 dimensions. (b) The cosine similarity of the eNTK with the targets A(K, yy ⊤ ) = y ⊤ Ky |K| F |y| 2 reveals increasing alignment for all algorithms. Though FA starts with the lowest alignment, its final NTK task alignment exceeds that of GD. (c) The dynamics of the gradient-pseudogradient kernel G also reveals increasing correlation of g with g. FA starts with G = 0 but G increases to non-zero value. (d) The distribution of hidden layer preactivations after training reveals non-Gaussian statistics for both GD and FA, but approximately Gaussian statistics for GLN. (e)-(f) The final Φ and G kernels from theory and experiment.

Figure 2: The lazy infinite width limits of the various learning rules can be fully summarized with their initial eNTK. (a) The kernels of ρ-aligned ReLU FA and ReLU GLN for inputs separated by angle θ. (a) The kernels for varying ρ in ρ-aligned FA. Larger ρ has a sharper peak in the kernel around θ = 0. The ρ → 0 limit recovers the NNGP kernel Φ L while the ρ → 1 limit gives the backprop NTK. (b) Deeper networks with partial alignment ρ = 0.5. (c) ReLU-GLN kernel sharpens with depth. (d)-(e) The relative error of the infinite width Φ ℓ , G ℓ kernels in a width N ReLU neural network. The late layer Φ ℓ and early layer G ℓ kernels have highest errors since finite size effects accumulate on forward and backward passes respectively. (f) Finite width corrections to eNTK are larger for small ρ and large depth L. All square errors go as |K N -K ∞ | 2 ∼ O N (1/N ).

Figure 3: Feature Learning enables alignment for a depth 4 (L = 3 hidden layers) tanh network trained with direct feedback alignment (DFA) with varying γ 0 . (a) Training loss for DFA networks with width N = 4000 with varying richness γ 0 shows that feature learning accelerates training, as predicted by DMFT (black). (b) The alignment (cosine similarity) of the last layer kernel Φ L with the target function reveals successful task depedent feature learning at large γ 0 . (c) The dynamics of pseudo-grad./grad. correlation corr(g, g) = 1 LP ℓ,µ g ℓ µ (t)•g ℓ µ (t) |g ℓ µ (t)||g ℓ µ (t)| averaged over layers ℓ and datapoints µ. Larger γ 0 generates more significant alignment between pseudogradients and gradients. (d) The final NTKs as a function of γ 0 reveals increasing clustering of the data points by class. (e) The error of the DMFT approximation for K's dynamics as a function of N : ⟨|KN (t)-K∞(t)| 2 ⟩ t

Figure 4: The initial feedback correlation ρ alters alignment dynamics in on the FA dynamics in a depth 4 (L = 3 hidden layer) linear network. (a) Larger ρ leads to faster initial training since the scale of the eNTK is larger. (b) Further, larger ρ leads to larger scales of G(t) = 1N g ℓ (t) • gℓ (t). (c) However, smaller ρ leads to more alignment of the NTK K(t) with the task-relevant subspace, measured with cosine similarity A(K, yy ⊤ ). (d) The feature kernel H(t) overlaps with y reveal that H ℓ (t) aligns more significantly in the small ρ networks.

Figure5: The feature kernel dynamics and scaling with γ 2 0 for GD, ρ-FA, and Hebbian rules in an exactly solveable two layer linear network. (a) The loss dynamics for all algorithms reveals that ρ = 0 FA and Hebb rules have same early time dynamics and that ρ = 1 FA and GD have same early-time dynamics. However all loss curves become distinct at late times due to different eNTK dynamics. (b) The alignment of the kernel to the target function H y (t) = 1 |y| 2 y ⊤ Hy/TrH(t) increases significantly for GD, and FA, but not for Hebb, reflecting the task-independence of the learned representation. (c) The movement of the feature kernel ∆H y = lim t→∞ H y (t) -H y (0) as a function of γ 0 for GD, and ρ = 0, 1 FA. At small feature learning strength, all algorithms have updates on the order of ∆H y ∼ γ 2 0 . At large γ 0 , GD has ∆H y ∼ γ 0 while FA has ∆H y ∼ γ 2/3 0 . The ρ = 1 FA (green) has lower ∆H y than the ρ = 0 FA across all γ 0 .

Figure 7: Verification of kernel fluctuations through next-to-leading-order (NLO) perturbation theory within DMFT formalism. (a) The cross layer covariance structure of {Φ ℓ } in a L = 10 hidden layer ReLU MLP. The empirical covariance was estimated by initializing a large number (500) of random networks and computing their Φ ℓ kernels. We see that variance for Φ ℓ increases as ℓ increases. (b) The cross-layer covariance structure of {G ℓ }. The variance of G ℓ is larger for smaller ℓ. (c) The predicted variance of Φ ℓ for different layer ℓ and widths N . All layers have variance scaling as 1/N , consistent with NLO perturbation theory. (d) The scaling of G ℓ variance.

annex

Published as a conference paper at ICLR 2023 samples {u ℓ µ (t), r ℓ µ (t), ζ ℓ µ (t), ζℓ µ (t)} from their appropriate GPs. From these sampled fields, we can identify the kernel order parameters by simple estimation of the appropriate moments. The algorithm is provided in Algorithm 1.The parameter β controls recency weighting of the samples obtained at each iteration. If β = 1, then the rank of the kernel estimates is limited to the number of samples S used in a single iteration, but with β < 1 smaller sample sizes S can be used to still obtain accurate results. We used β = 0.6 in our deep network experiments.

B DERIVATION OF DMFT EQUATIONS

In this section, we derive the DMFT description of infinite network dynamics. The path integral theory we develop is based on the Martin-Siggia-Rose-De Dominicis-Janssen (MSRDJ) framework Martin et al. (1973) . A useful review of this technique applied to random recurent networks can be found here Crisanti & Sompolinsky (2018) . This framework was recently extended for deep learning with GD in (Bordelon & Pehlevan, 2022) .

B.1 WRITING EVOLUTION EQUATIONS IN FEATURE SPACE

First, we will express all of the learning dynamics in terms of preactivation features h ℓ µ (t) = 1 √ N W ℓ (t)ϕ(h ℓ µ (t)), pre-gradient features z ℓ µ (t) = 1 √ N W ℓ (t) ⊤ g ℓ+1 and pseudogradient features gℓ µ (t). Since we would like to understand typical behavior over random initializations of weights θ(0) = {W 0 (0), W 1 (0), ..., w L (0)}, we want to isolate the dependence of our evolution equations by W ℓ (0). We achieve this separation by using our learning dynamics for W ℓ (t)The inclusion of the prefactor γ0 √ N in the weight dynamics ensures that d dt f = O γ0,N (1) and d dt h ℓ = O γ0,N (γ 0 ) at initialization (Chizat et al., 2019; Bordelon & Pehlevan, 2022) . Using the forward and backward pass equations, we find the following evolution equations for our feature vectorswhere we introduced the following feature and gradient/pseudo-gradient kernelsThe particular learning rule defines the definition of the pseudo-gradient gℓ µ (t). We note that, for all learning rules considered, the pseudogradient gℓ i,µ (t) is a function of the fields {h ℓ i,µ (t), z ℓ iµ (t), m ℓ iµ (t), ζ ℓ i,µ (t), ζℓ i,µ (t)} µ∈[P ],t∈R+ , conditional on the value of the kernels {Φ ℓ , Gℓ }. The additional fields have definitionsand are specifically required for ρ-FA with ρ > 0 since gℓx µ are required for GLNs. All of the necessary fields {h ℓ µ (t), z ℓ µ (t), gℓ µ (t)} are thus causal functions of the stochastic fields {χ ℓ µ (t), ξ ℓ µ (t), m ℓ µ , ζ ℓ µ (t), ζℓ µ (t)} and the kernels {Φ ℓ , Gℓ }. It thus suffices to characterize the distribution of these latter objects over random initialization of θ(0) in the N → ∞ limit, which we study in the next section.

B.2 MOMENT GENERATING FUNCTIONAL

We will now attempt to characterize the probability density of the random fieldsIt is readily apparent that the fields m ℓ µ are independent of the others and have a Gaussian distribution over random Gaussian M ℓ . These fields, therefore do not can be handled independently from the others, which are statistically coupled through the initial conditions. We will thus characterize the moment generating functional of the remaining fields {χ ℓ µ (t), ξ ℓ µ (t), ζ ℓ µ (t), ζℓ µ (t)} over random initial condition and random backward pass weightswhere χ ℓ , ξ, ζ, ζ are regarded as functions of θ(0), { W ℓ }. Arbitrary moments of these random variables can be computed by differentiation of Z near zero source. For example, a two-point correlation function can be obtained asMore generally, we let µ = (i, µ, t) be a tuple containing the neuron, time, and sample index for an entry of one of these fields so that χ ℓ µ = χ ℓ i,µ (t). Further, we let N χ ℓ , N ξ ℓ , N ζ ℓ , N ζℓ be index sets which contain sample and time indices as well as neuron indices N χ = {µ χ 1 , ..., |µ χ |Nχ| } for all of the indices we wish to compute an average over. Then arbitrary moments can be computed with the formulaWe now to study this moment generating functional Z in the large width N → ∞ limit.

B.3 PATH INTEGRAL FORMULATION AND INTEGRATION OVER WEIGHTS

To enable the average over the weights, we multiply Z by an integral representation of unity that enforces the relationship between χ ℓ+1 µ (t), W ℓ (0), ϕ(h ℓ µ (t))In the second line, we used the Fourier representation of the Dirac-Delta function for each of the N neuron indices δ(r) = ∞ -∞ dr 2π exp (irr). We repeat this procedure for the other fields ξ ℓ µ (t), ζ ℓ µ (t), ζℓ µ (t) at each time t and each sample µ. After inserting these delta functions, we find the following form of the moment generating functionalWe see that we often have simultaneous integrals over time t and sums over samples µ so we will again adopt a shorthand notation for indices µ = (µ, t) and define a summmation conventionTo perform the averages over weights, we note that for a standard normal variableUsing this fact for each of the weight matrix averages, we haveIn the above, we introduced a collection of order parameters {Φ, G, G, G, A, C}, which will correspond to correlation and response functions of our DMFT. These are defined asWe perform a similar average over W ℓ can be obtained directlyNow that we have defined our collection of order parameters, we enforce their definitions with Dirac-Delta functions by multiplying by one. For example,As promised, the only terms in Z i which vary over site index i are the sources {j, k, n, p}. To simplify our later saddle point equations, we will abstract the notation for the single site MGF, lettingwhere H ℓ i is the single site effective Hamiltonian for neuron i and layer ℓ. Note that at zero source,We further compute the saddle point equations for the dual order parametersThe correlation functions involving real variables {h, g, g} have a straightforward interpetation. However, it is not immediately clear what to do with terms involving the dual fields { χ, ξ, ζ}. As a starting example, let's consider one of the terms for B ℓ-1 µν , namely -i χℓ ν g ℓ ν . We make progress by inserting another fictitious source term u ℓ µ and differentiating near zero sourceIntroducing a vectorization notationWe thus need to compute a derivative of the above function with respect to u ℓ at u ℓ = 0, which givesFrom the above reasoning, we can also easily obtain Φℓ-1 usingPerforming a similar analysis, we insert source fields r + = r ℓ v ℓ for ξℓ + = ξℓ ζℓ and define G ℓ+1Gℓ+1 Gℓ+1⊤ Gℓ+1 , and B ℓ + = B ℓ D ℓ and then we can compute the necessary averages using the same techniqueWe now have formulas for all the necessary averages entirely in terms of the primal fields {χ, ξ, ζ, ζ}.

B.6 LINEARIZING WITH THE HUBBARD TRICK

Now, using the fact that in the N → ∞ limit q concentrates around q * , we will simplify our single site stochastic processes so we can obtain a final formula for {A, B, C, D, Φ, Ĝ, Ĝ, Ĝ}. To do so, we utilize the Hubbard-Stratanovich identitywhich is merely a consequence of the Fourier transform of the Gaussian distribution. This is often referred to as "linearizing" the action since the a term quadratic in k was replaced with an average of an action which is linear in k. In our setting, we perform this trick on a collection of variables which appear in the quadratic forms of our single site MGFs Z ℓ i . For example, for the χℓ+1 fields, we haveSimilarly, we perform a joint decomposition for the { ξℓ , ζℓ } fields which givesWe thus see that the Gaussian sources {r ℓ µ } µ and {v ℓ µ } µ are mean zero with correlation given by Σ ℓ+1 . Now that we have linearized the quadratic components involving each of the dual fields Published as a conference paper at ICLR 2023 { χ, ξ, ζ}, we now perform integration over these variables, givingThis reveals the following set of identitiesSince we know by construction that u ℓ = χ ℓ -A ℓ-1 g ℓ -C ℓ-1 gℓ is a zero mean Gaussian with covariance Φ ℓ-1 , we can simplify our expressions for B ℓ-1 and Φℓ-1 using Stein's LemmaSimilarly, using the Gaussianity of r ℓ , ζ ℓ , ζℓ we have

B.7 FINAL DMFT EQUATIONS

We now take the limit of zero source j ℓ , k ℓ , n ℓ , p ℓ → 0. In this limit, all single site averages ⟨⟩ i become identical so we can simplify the expressions for the order parameters. To "symmetrize" the equations we will also make the substitution B → B ⊤ , D → D ⊤ . Next, we also rescale all of the response functions {A ℓ , B ℓ , C ℓ , D ℓ } by γ -1 0 so that they are O γ0 (1) at small γ 0 . This gives us the following set of equations for the order parameters.Expt.1 , DMFT. Tr 1 (t, s)DMFT.Tr 2 (t, s) Tr 3 (t, s)Tr 1 (t, s)DMFT.Tr 2 (t, s) Tr 3 (t, s) For the fields h ℓ µ (t), z ℓ µ (t), zℓ µ (t), we have the following equations

C EXTENSION TO OTHER ARCHITECTURES AND OPTIMIZERS

In this section, we consider the effect of changing architectural details (multiple output channels and convolutional structure) and also optimization choices (momentum, regularization).

C.1 MULTIPLE OUTPUT CLASSES

Similar to pre-existing work on the GD case (Bordelon & Pehlevan, 2022) , our new generalized DMFT can be easily extended to C output channels, provided the number of channels C is not simultaneously taken to infinity with network width N . We note that the outputs of the network are now vectors f µ ∈ R C and that each eNTK entry is now a C × C matrix K µν (t, s) ∈ R C×C . The relevant true gradient fields are vectors g ℓ c,µ = ∂fc,µ ∂h ℓ µ . We construct pseudo-gradients gℓ c,µas before using each of our learning rules. The gradient-pseudogradient kernelµν Φ ℓ µν can be used to derive the function dynamicsAt infinite width N → ∞, the field dynamics forwhereThe pseudogradient fields gℓ are defined analogously for each learning rule as in the single class setting.

C.3 L2 REGULARIZATION (WEIGHT DECAY)

L2 regularization on the weights W ℓ (weight decay) can also be modeled within DMFT. We start by looking at the weight dynamicsIn the second line we used an integrating factor e λt . We can thus arrive at the following feature dynamics in the DMFT limitThe g dynamics are also modified appropriately with factors of e -λt and e -λ(t-s) for each of our learning rules. We see that the contribution from the initial conditions χ, ξ are suppressed at late times while the feature learning update which is O(γ 0 /λ) in the first layer dominates scale of the final features.

C.4 MOMENTUM

Momentum uses a low-pass filtered version of the gradients to update the weights (Goh, 2017) . A continuous time limit of momentum dynamics on the trainable parameters {W ℓ } would give the following differential equationsWe write the expression this way so that the small time constant τ → 0 limit corresponds to classic gradient descent. Integration of the Q ℓ (t) dynamics gives the following integral expression forThese weight dynamics give rise to the following field evolutionWe see that in the τ → 0 limit, the t ′′ integral is dominated by the contribution at t ′′ ∼ t ′ recovering usual gradient descent dynamics. For τ ≫ 0, we see that the integral accumulates additional contributions from the past values of fields and kernels.

D LAZY LIMITS

In this section we discuss the lazy γ 0 → 0 limit. In this limit we see that h ℓ (t) = u ℓ (t) and z ℓ (t) = r ℓ (t) for all time t. Since the input data gram matrix Φ 0 µν = 1 D x µ • x ν is a constant in time the sources in the first hidden layer u 1 µ are constant in time. Consequently, the first layer feature kernel is constant in time sinceNow, we see that this argument can proceed inductively. Since Φ 1 is time-independent, the second layer fields h 2 = u 2 ∼ N (0, Φ 1 ) are also constant in time, implying Φ 2 is constant in time. This argument is repeated for all layer ℓ ∈ [L]. Similarly, we can analyze the backward pass fields z ℓ . Since z L ∼ N (0, G L+1 ) are constant, then z ℓ are time-independent for all ℓ. It thus suffices to compute the static kernels {Φ ℓ , G ℓ , Gℓ } at initializationwhere in the last line we utilized the independence of u ℓ , r ℓ . These above equations give a forward pass recursion for the Φ ℓ kernels and the backward pass recursion for G ℓ . Lastly, depending on the learning rule, we arrive at the following definitions for Gℓ for ℓ ∈ {1, ..., L}Using these results for G, we can compute the initial eNTK K =

L ℓ=0

Gℓ+1 ⊙ Φ ℓ which governs prediction dynamics.We can close these equations for H ℓ and GℓThe matrices Gℓ = 11 ⊤ are all rank one. Thus it suffices to compute the vectorsWith this formalism we haveThe analysis for DFA and Hebb rules is very similar.F EXACTLY SOLVEABLE 2 LAYER LINEAR MODEL F.1 GRADIENT FLOW Based on the prior results from (Bordelon & Pehlevan, 2022) , the H y = y ⊤ Hy/|y| 2 dynamics for GD are coupled to the dynamics for the errorThese dynamics have the conservation. Integrating this conservation law from time 0 to time t, we find H y (t) 2 = 1 + γ 2 0 (y -∆(t)) 2 . We can therefore solve a single ODE for ∆(t), giving the following simplified dynamicsThese dynamics interpolate between exponential convergence (at small γ 0 ) and a logistic convergence (at large γ 0 ) of ∆(t) to zero. Since ∆ → 0 at late time, the final value of the kernel alignment isFor the two layer linear network, the ρ-FA field dynamics areFA we have gµ (t) = g ∼ N (0, 1) which is a constant standard normal. We let a µ (t) = ⟨gh µ (t)⟩.The dynamics for H µν and a µ are coupledWhitening the dataset K x = I and projecting all dynamics on ŷ subspace gives the reduced dynamicsFrom these dynamics we identify the following set of conservation lawsWriting everything in terms of ∆, a we haveIntegrating both sides of this equation from 0 to t gives ∆ = y -γ -1 0 1 2 a 3 + (1 + ρ)a . Thus, the a dynamics now one dimensional, givingWhen run from initial condition a = 0, this will converge to the smallest positive root of the cubic equation 1 2 a 3 + (1 + ρ)a = γ 0 y. This implies that, for small γ 0 we have a ∼ γ0y 1+ρ so that ∆H = 2∆ G ∼ γ 2 0 y 2 (1+ρ) 2 and so that larger initial alignment ρ leads to smaller changes in the feature kernel and pseudo-gradient alignment kernel. At large γ 0 y, we have that a ∼ (2γ 0 y) 1/3 so that ∆H = 2∆ G ∼ (2γ 0 y) 2/3 .

F.3 HEBB

For the Hebb rule, Gµ = ⟨gh µ ⟩ ∆ µ = γ 0 f µ ∆ µ = γ 0 (y µ -∆ µ )∆ µ . Under the whitening assumption K x µν = δ µν , the dynamics decouples over samplesWe see that H µµ strictly increases. The possible fixed points for ∆ µ are ∆ µ = 0 or ∆ µ = 1 2 y µ ± y 2 µ + γ -1 0 H µµ . One of these roots shares a sign with y µ and has larger absolute value. The other root has the opposite sign from y µ . From the initial condition ∆ µ = y µ and H µµ = 1, ∆ µ is initially approaching decreasing in absolute value so that |∆ µ | ∈ (0, |y µ |) and will have the same sign as y µ . In this regime d dt |∆ µ | < 0. Thus, the system will eventually reach the fixed point at ∆ µ = 0, rather than increasing in magnitude to the root which shares a sign with y µ or continuing to the root with the opposite sign as y µ .

G DISCUSSION OF MODIFIED HEBB RULE

We chose to modify the traditional Hebb rule to include a weighing of each example by its instantaneous error. In this section we discuss this choice and provide a brief discussion of alternativesIn the absence of regularization or normalization, this learning rule will continue to update the weights even once the task is fully learned, leading to divergences at infinite time t → ∞.• Single Power of the Error: d dt W ℓ ∝ µ ∆ µ ϕ(h ℓ+1 )ϕ(h ℓ ) ⊤ . While this rule may naively appear plausible, it can only learn training points with positive target values y µ in a linear network if γ 0 > 0. Further this rule only gives Hebbian updates when ∆ µ > 0.• Two Powers of the Error: d dt W ℓ ∝ µ ∆ 2 µ ϕ(h ℓ+1 )ϕ(h ℓ ) ⊤ . This was our error modified Hebb rule. We note that the update always has the correct sign for a Hebbian update and the updates stop when the network converges to zero error, preventing divergence of the features at late time.

H FINITE SIZE EFFECTS

We can reason about the fluctuations of q around the saddle point q * at large but finite N using a Taylor expansion of the DMFT action S around the saddle point. This argument will show that at large but finite N , we can treat q as fluctuating over initializations with mean q * and variance O(N -1 ). We will first illustrate the mechanics of this computation of an arbitrary observable with a scalar example before applying this to the DMFT.

H.1 SCALAR EXAMPLE

Suppose we have a scalar variable q with a distribution defined by Gibbs measure e -N S [q] dqe -N S[q] for action S. We consider averaging some arbitrary observable O(q) over this distributionWe Taylor expand S around its saddle point q * giving SThe exp(N S[q * ]) terms canceled in both numerator and denominator. We let the variable q -q * = 1 √ N δ. After this change of variable, we haveWe note that all the higher order derivatives (k ≥ 3) are suppressed by at least N -1/2 compared to the quadratic term. Letting U = ∞ k=3 N 1-k/2 S (k) [q * ]δ k represent the perturbed potential, we can Taylor expand the exponential around the Gaussian unperturbed potential exp -1 2 δ 2 S ′′ [q * ] . We let ⟨O(δ)⟩ 0 = E q∼N (0,S ′′ [q * ] -1 ) O(δ) represent an average over this unperturbed potentialTruncating each series in numerator and denominator at a certain order in 1/N gives a Pade-Approximant to the full observable average (Bender et al., 1999) . Alternatively, this can be expressed in terms of a cumulant expansion (Kardar, 2007 )where OU k c 0 are the connected correlations, or alternatively the cumulants. The first two connected correlations have the formUsing Stein's lemma, we can now attempt to extract the leading O(N -1 ) behavior from each of these terms. First, we will note the following useful identity which follows from Stein's Lemma O(q)δ k = N k/2 O(q)(q -q * ) k (81)Using these this fact, we can find the first few correlation functions of interest ⟨O(q)U ⟩ = 3N -1 S (3) [S ′′ ] -2 ⟨O ′ (q)⟩ 0 + 3N -1 S (4) [S ′′ ] -2 ⟨O(q)⟩ 0 + O(N -2 ) O(q)U 2 = 15N -1 [S (3) ] 2 [S ′′ ] -3 ⟨O(q)⟩ + O(N -1 ).(83) Thus, the leading order Pade-Approximant has the form ⟨O(q)⟩ = ⟨O⟩ 0 -3 N S (3) [S ′′ ] -2 ⟨O ′ (q)⟩ 0 -3 N S (4) [S ′′ ] -2 ⟨O(q)⟩ 0 + 15 2N [S (3) ] 2 [S ′′ ] -3 ⟨O(q)⟩ 1 The logic of the previous section can be extended to our DMFT. We first redefine the action as its negation S → -S to simplify the argument. Concretely, this action S[q] defines a Gibbs measure over the order parameters q which we can use to compute observable averages ⟨O(q)⟩ = exp (-N S[q]) O(q) exp (-N S[q]) (85)As before, one can Taylor expand the action around the saddle point q * S[q] ∼ S[q * ] + 1 2 (q -q * )∇ 2 S[q * ](q -q * ) + ...As before, the linear term vanishes since ∇ q S[q * ] = 0 at the saddle point q * . We again change variables to δ = √ N (q -q * ) and express the average aswhere ⟨⟩ 0 denotes a Gaussian average over q ∼ N (q * , 1 N [∇ 2 S] -1 ).

H.2 HESSIAN COMPONENTS OF DMFT ACTION

To gain insight into the Hessian, we will first restrict our attention to the subset of Hessian entries related to Φ ℓ , Φℓ . We again adopt a multi-index notation µ = (µ, ν, t, s) so that Φ ℓ µ = Φ ℓ µν (t, s)The first equation follows from the fact that χ has vanishing moments due to the normalization of the probability distribution induced by Z ℓ . Similarly, for the G, Ĝ kernels we haveProceeding in a similar manner, we can compute all off-diagonal components such as ∂ 2 S ∂Φ∂ Ĝ and ∂ 2 S ∂ Φ∂ Ĝ . Once all entries are computed, one can seek an inverse of the Hessian to obtain the covariance of the order parameters.

