NMDA RECEPTOR NONLINEARITY ATTRIBUTES TO MEMORY CONSOLIDATION IN TRANSFORMERS

Abstract

The NMDA receptor (NMDAR) in the hippocampus is essential for learning and memory. We find an interesting resemblance between deep models' nonlinear activation function and the NMDAR's nonlinear dynamics. In light of a recent study that compared the transformer architecture to the formation of hippocampal memory, this paper presents new findings that NMDAR-like nonlinearity may be essential for consolidating short-term working memory into long-term reference memory. We design a navigation task assessing these two memory functions and show that manipulating the activation function (i.e., mimicking the Mg 2+ -gating of NMDAR) disrupts long-term memory formation. Our experimental data suggest that the concept of place cells and reference memory may reside in the feedforward network layer of transformers and that nonlinearity plays a key role in these processes. Our findings propose that the transformer architecture and hippocampal spatial representation resemble by sharing the overlapping concept of NMDAR-like nonlinearity.

1. INTRODUCTION

In the hippocampus, NMDAR is regarded as an essential component that mediates synaptic plasticity, memory formation, and spatial representation (Li & Tsien, 2009; Tsien et al., 1996; Kentros et al., 1998) . NMDAR serves as a switch for synaptic plasticity and long-term memory formation (Bliss & Collingridge, 1993; Slutsky et al., 2010; Miyashita et al., 2012) . In addition, NMDAR has been highlighted for its importance in place cell representations in hippocampal CA1 (McHugh et al., 1996; Kentros et al., 1998) . Place cells in the hippocampus (O'Keefe & Dostrovsky, 1971) and grid cells in the entorhinal cortex (Hafting et al., 2005) are thought to be crucial for spatial navigation in an animal. These discoveries have triggered recent efforts to replicate these spatial representations found in the brain by using deep neural networks (Banino et al., 2018; Cueva & Wei, 2018; Whittington et al., 2022) . In NMDAR depicted in Fig. 1a , the ion channels that reside in the post-synaptic region have unique characteristics that distinguish them from other ion channels in the brain. Their nonlinear dynamics are modulated by Mg 2+ ion blockade at the pore region. NMDAR requires activity-dependent repulsion of Mg 2+ ion (Nowak et al., 1984; Mayer et al., 1984) to be functional, and this phenomenon is partly interesting because it serves as a self-gating of ion influx in the post-synaptic region. In particular, the Mg 2+ gated nonlinear dynamics of NMDAR plays a key role in synaptic plasticity and memory formation (Slutsky et al., 2010; Miyashita et al., 2012) . Recently, the relationship between the transformer (Vaswani et al., 2017 ) and hippocampal formation model has been reported (Whittington et al., 2022) . The transformer is the most advanced deep learning model, showing unprecedented results in various tasks such as language modeling (Devlin et al., 2018; Brown et al., 2020 ), computer vision (Dosovitskiy et al., 2020; Radford et al., 2021) , and art generation (Ramesh et al., 2022) . This model has two consecutive modules, a self-attention layer and a feed-forward network (see Fig. 1b ). Whittington et al. (2022) show the self-attention layer is closely related to the state-of-the-art neuroscience model (Whittington et al., 2020) and claim that softmax neurons in the self-attention layer behave like place cells in a navigation task. However, studies on the role of neurons in feed-forward networks have been absent. We find an interesting resemblance of NMDAR nonlinearity with the Gaussian Error Linear Unit (GELU), a nonlinear activation function popularly used in the transformer's feed-forward network (Fig. 1 ). Similar to NMDAR's activity-dependent gating mechanism of ion influx, the form of the GELU function has a combination of input with the self-gating function. Biological experiments have shown the critical consequence of changing NMDAR's nonlinearity in synaptic plasticity and long-term memory formation (Slutsky et al., 2010; Miyashita et al., 2012) , while the role of NMDAR-like nonlinearity in place cell representation remains unclear. This work is inspired by the fascinating resemblance of NMDAR's nonlinearity dynamics with the GELU activation function and the recent model relating transformer's self-attention mechanism to hippocampal formation (Whittington et al., 2020; 2022) . These findings motivated us to ask a question; Can NMDAR-like nonlinearity in the feed-forward network layer of transformers enhance the formation of long-term memory and spatial place cell representation? To address this question, we propose a novel NMDAR-like activation function derived from NM-DAR IV curve and design a spatial navigation task in a 2D grid environment that can assess two different memory types well formulated in neuroscience experiments (Olton et al., 1977; 1979) : working memory and reference memory. Working memory controls the events from a within-trial, while reference memory controls across-trials from the unchanging environment. We evaluate the transformer model with the NMDAR-like activation function on this task; the results show that 1) place cell representations emerge in feed-forward networks, 2) the reference memory can be controlled by the nonlinearity of the NMDAR-like activation function, 3) the place cell in feed-forward networks is strongly correlated with the reference memory, while the place cell in self-attention layers has no correlation, 4) the proposed NMDAR-like activation shows the best reference memory performance when compared to other widely used nonlinear activation functions. Our experimental data suggest that NMDAR-like nonlinearity in the feed-forward network layer of the transformer can enhance the long-term memory formation and place cell representation.

2. TRANSFORMER

The transformer architecture (Vaswani et al., 2017) can be constructed by stacking multiple blocks of self-attention layers and feed-forward networks (see Fig. 1b ). Here we briefly review the selfattention mechanism and the feed-forward networks.



Figure 1: (a) Schematic diagram of Mg 2+ -gated NMDAR modulating synaptic plasticity (left), its IV curve of current-voltage dynamics (right top) and NMDAR-inspired activation function, NMDA α (x) (right bottom). (b) Transformer architecture and its feed-forward network's activation function, Gaussian Error Linear Unit (GELU; left bottom).

