SYNAPTIC DYNAMICS REALIZE FIRST-ORDER ADAP-TIVE LEARNING AND WEIGHT SYMMETRY Anonymous

Abstract

Gradient-based first-order adaptive optimization methods such as the Adam optimizer are prevalent in training artificial networks, achieving the state-of-the-art results. This work attempts to answer the question whether it is viable for biological neural systems to adopt such optimization methods. To this end, we demonstrate a realization of the Adam optimizer using biologically-plausible mechanisms in synapses. The proposed learning rule has clear biological correspondence, runs continuously in time, and achieves performance to comparable Adam's. In addition, we present a new approach, inspired by the predisposition property of synapses observed in neuroscience, to circumvent the biological implausibility of the weight transport problem in backpropagation (BP). With only local information and no separate training phases, this method establishes and maintains weight symmetry in the forward and backward signaling paths, and is applicable to the proposed biologically plausible Adam learning rule. The aforementioned mechanisms may shed light on the way in which biological synaptic dynamics facilitate learning.

1. INTRODUCTION

Gradient-based adaptive optimization is a widely used in many science and engineering applications including training of artificial neural networks (ANNs) (Rumelhart et al., 1986) . In particular, first-order methods are preferred over higher-order methods since their memory overhead is significantly lower, considering ANNs are often characterized by high dimension feature spaces and large numbers of parameters. Among the most well-known ANN training methods are stochastic gradient descent (SGD) with momentum (Rumelhart et al., 1986) , root mean square propagation (RMSProp) (Tieleman & Hinton, 2012) , and adaptive moment estimation (Adam) (Kingma & Ba, 2014) . Different from gradient descent, which optimizes an loss function over the complete dataset, SGD runs on a mini-batch. The momentum term accelerates the adjustment of SGD along the direction to a minima, and RMSProp impedes the search in the direction of oscillation. The Adam optimizer can be considered as the combination of the above two ideas; and it is computationally efficient, has fast convergence, works well with noisy/sparse gradient, and achieves the state-of-the-art results on many AI applications (Dosovitskiy et al., 2020; Wang et al., 2022) . Given the success of gradient-based adaptive optimization techniques, particularly Adam, in training ANNs, it is natural to ask the question whether it is viable for biological neural systems to adopt such optimization strategies. We attempt to answer this question by demonstrating an implementation of the Adam optimizer based on biologically plausible synaptic dynamics and a new solution to the well-known weight transport problem. We call our implementation Bio-Adam. Nevertheless, it is not immediately clear how to realize the Adam optimizer biologically realistically given its intricacies. Comparing to the classical SGD method, Adam has two major new ingredients: use of momentum m to smooth the gradient g of multiple batches, and division by a smooth estimation √ v of the root mean square of the gradient to constrain the step size, which is also known as the RMSProp term 1/( √ v t + ϵ), in which ϵ is a small number to prevent division by zero. Although signal smoothing is commonly done in biological modeling such as by using the leaky-integrate and fire (LIF) model of spiking neurons (Gerstner et al., 2014) , we identify that the root mean square calculation √ v and the existence of the division operator in Adam are biologically problematic. With respect to these difficulties, we define a new variable ρ to mimic the dynamics of RMSProp. Biologically, this newly defined ρ variable may be thought as the concentration of certain synaptic substance consumed during the learning process. Essentially, ρ is designed based on the following biologically motivated ideas. A large weight updating signal accelerates the consumption of the substance, dropping the concentration of the substance, which in turn slows down the pace of weight update. On the other hand, under a small weight updating signal, the synapse gradually replenishes the substance and restores the fast weight update pace. This kind of behavior mimics the RMSProp term 1/( √ v t + ϵ) in Adam. The overall weight update speed of our biological Adam is proportional to the product of m and ρ. Another roadblock to a biologically plausible realization of Adam is the weight transport problem, which is in fact an issue common to the wider family of all BP methods. BP is generally believed to be biologically implausible in the brain for several reasons (Stork, 1989; Illing et al., 2019) . One key issue is that BP requires symmetrical weights between the forward and backward paths to propagate correct error information. This is also known as the weight transport problem (Grossberg, 1987; Ororbia II et al., 2017) , which we address by a new approach, inspired by the predisposition property of synapses observed in neuroscience. According to predisposition (Chistiakova et al., 2014) , weak synapses that have been depressed are more prone to potentiation, whereas strong ones are more prone to depression. When the forward and backward synapses share local updating signals, a potentiation signal potentiates the stronger synapses less than the weaker ones. Conversely, a depression signal depresses the weaker synapses less than the stronger ones. By leveraging the biological predisposition property, the proposed mechanism eventually aligns the forward weights with the backward weights. There are other methods also mentioned to share gradients, like the weight-mirror and the modified KP algorithm presented in (Akrout et al., 2019 ). Yet we believe our method is more bio-plausible for it is based on the well-observed synaptic property. Our solution to the weight transport problem is immediately applicable to the proposed Adam learning rule. To conclude, the presented Bio-Adam learning rule is based on biologically plausible synaptic dynamics with an underlying biological mechanism addressing the weight transport problem. Moreover, our biological Adam optimizer delivers a performance level that is on a par with its original implementation (Kingma & Ba, 2014) . Our findings may shed light on the way in which biological neural systems facilitate powerful learning processes.

2.1. PRELIMINARIES

The classical stochastic gradient descent (SGD) follows: θ t ← θ t-1 -γg t , where θ is the weights, γ is the learning rate, and g is the gradient. SGD with momentum (Rumelhart et al., 1986) further introduced a momentum term m to smooth the gradient g of multiple batches according to the following dynamics: m t ← β 1 m t-1 + (1 -β 1 )g t , where β 1 is a hyperparameter in the range of [0, 1) and is usually close to 1.0. The final updating rule is obtained by changing g to m in (1), yielding: θ t ← θ t-1 -γm t . RMSProp (Tieleman & Hinton, 2012) computes a smooth estimate of the root mean square √ v of the gradient and uses it in the denominator to constrain the step size in the oscillation direction. The dynamic of v follows: v t ← β 2 v t-1 + (1 -β 2 )g 2 t , where β 2 is a hyperparamter. Similar to β 1 in (2), β 2 ∈ [0, 1) typically has a value near 1.0. The updating rule of RMSProp is: θ t ← θ t-1 -γg t /( √ v t + ϵ), where ϵ is a small number to prevent division by zero. Adam (Kingma & Ba, 2014) can be viewed as a combination of momentum and RMSProp by omitting minor adjustments such as the bias correction and use of the infinity norm. Adam's main updating rule is: θ t ← θ t-1 -γm t /( √ v t + ϵ).

