UNIFIED NEURAL REPRESENTATION MODEL FOR PHYSICAL SPACE AND LINGUISTIC CONCEPTS

Abstract

The spatial processing system of the brain uses grid-like neural representations (grid cells) for supporting vector-based navigation. Experiments also suggest that neural representations for concepts (concept cells) exist in the human brain, and conceptual inference relies on navigation in conceptual spaces. We propose a unified model called "disentangled successor information (DSI)" that explains neural representations for physical space and linguistic concepts. DSI generates grid-like representations in a 2-dimensional space that highly resemble those observed in the brain. Moreover, the same model creates concept-specific representations from linguistic inputs, corresponding to concept cells. Mathematically, DSI vectors approximate value functions for navigation and word vectors obtained by word embedding methods, thus enabling both spatial navigation and conceptual inference based on vector-based calculation. Our results suggest that representations for space and concepts can emerge from a shared mechanism in the human brain.

1. INTRODUCTION

In the brain, grid cells in the entorhinal cortex (EC) represent the space by grid-like representations (Hafting et al., 2005; Doeller et al., 2010; Jacobs et al., 2013) . This neural representation is often related to vector-based spatial navigation because grid cells provide global metric over the space. Theoretically, an animal can estimate the direction to a goal when representations of a current position and a goal position are given (Fiete et al., 2008; Bush et al., 2015) . Furthermore, self-position can be estimated by integrating self-motions when sensory information is not available (McNaughton et al., 2006) . These functions are the basis of robust spatial navigation by animals. There are not only spatial but also conceptual representations in EC. Neurons called as "concept cells" have been found in human medial temporal lobe including EC (Quiroga, 2012; Reber et al., 2019) . Concept cells respond to specific concepts, namely, stimuli related to a specific person, a famous place, or a specific category like "foods" and "clothes". Furthermore, recent experiments also suggest that grid-like representations appear not only for physical space but also for conceptual space if there is a 2-dimensional structure (e.g. lengths of a neck and legs, intensity of two odors), and those representations are the basis of vector-based conceptual inference (Bao et al., 2019; Constantinescu et al., 2016; Park et al., 2021) . Thus, it is expected that there is a shared processing mechanism for physical and conceptual spaces in EC. Existence of shared neural mechanism may also explain why humans use sense of physical space (such as directionality) to communicate abstract concepts (conceptual metaphor (Lakoff & Johnson, 1980) ). However, a principle behind such universal computation in the brain is still unclear. In this paper, we propose a representation model which we call disentangled successor information (DSI) model. DSI is an extension of successor representation (SR), which stems from a theory of reinforcement learning and became one of promising computational models of the hippocampus and EC (Dayan, 1993; Stachenfeld et al., 2017; Momennejad et al., 2017; Momennejad, 2020) . Like eigenvectors of SR, DSI forms grid-like codes in a 2-D space, and those representations support vector-based spatial navigation because DSI approximates value functions for navigation in the framework of linear reinforcement learning (Todorov, 2006; 2009; Piray & Daw, 2021) . Remarkably, when we apply DSI to text data by regarding a sequence of words as a sequence of states, DSI forms concept-specific representations like concept cells. Furthermore, we show mathematical correspondence between DSI and word embedding models in natural language processing (NLP) (Mikolov et al., 2013a; b; Pennington et al., 2014; Levy & Goldberg, 2014) , thus we can perform intuitive vector-based conceptual inference as in those models. Our model reveals a new theoretical relationship between spatial and linguistic representation learning, and suggests a hypothesis that there is a shared computational principle behind grid-like and concept-specific representations in the hippocampal system.

2. CONTRIBUTIONS AND RELATED WORKS

We summarize contributions of this work as follows. (1) We extended SR to successor information (SI), by which we theoretically connected reinforcement learning and word embedding, thus spatial navigation and conceptual inference. (2) We found that dimension reduction with constraints for grid-like representations (decorrelative NMF) generates disentangled word vectors with conceptspecific units, which has not been found previously. (3) Combining these results, we demonstrated that a computational model for grid cells can be extended to represent and compute linguistic concepts in an intuitive and biologically plausible manner, which has not been shown in previous studies. Our model is an extension of successor representation (SR), which is recently viewed as a plausible model of hippocampus and EC (Dayan, 1993; Stachenfeld et al., 2017; Momennejad et al., 2017; Momennejad, 2020) . Furthermore, default representation (DR), which is based on linear reinforcement learning theory, has been also proposed as a model of EC (Piray & Daw, 2021) . We show that our model can extract linguistic concepts, which has not been shown for SR and DR. Furthermore, we demonstrate vector-based compositionality of words in our model, which expands the range of compositionality of EC representations (Piray & Daw, 2021) to semantic processing. Our model produces biologically plausible grid-like representations in 2-D space, which supports spatial navigation. Previous studies have revealed that non-negative and orthogonal constraints are important to obtain realistic grid-like representations (Dordek et al., 2016; Sorscher et al., 2019) . Furthermore, recurrent neural networks form grid-like representations through learning path integration, and those representations support efficient spatial navigation (Banino et al., 2018; Cueva & Wei, 2018; Gao et al., 2019) . Some of those models have reproduced experimentally observed scaling ratios between grid cell modules (Banino et al., 2018; Sorscher et al., 2019) . However, previous models have not been applied to learning of linguistic concepts, or other complex conceptual spaces in real-world data. Whittington et al. (2020) proposed a unified model for spatial and nonspatial cognition. However, their model was applied only to simple graph structures and conceptual specificity like our model was not observed. Analogical inference by our model is a same function as word embedding methods in NLP (Mikolov et al., 2013a; b; Pennington et al., 2014; Levy & Goldberg, 2014) . However, a unique feature of DSI representations is that each dimension of vectors corresponds to a specific concept like concept cells in the human brain (Quiroga, 2012; Reber et al., 2019) . Our model provides biological plausible interpretation of word embedding: each word is represented by combination of disentangled conceptual units, inference is recombination of those concepts, and such representations emerge through the same constraints with grid cells. It was recently shown that transformer-based models (Vaswani et al., 2017; Brown et al., 2020) , which are currently state-of-the-art models in NLP, generate grid-like representations when applied to spatial learning (Whittington et al., 2022) . Similarly to our model, this finding implies the relationship between spatial and linguistic processing in the brain. However, concept-specific representations has not been found in such model. Furthermore, clear theoretical interpretation in this study depends on the analytical solution for skip-gram (Levy & Goldberg, 2014) . Such analytical solution is currently unknown for transformer-based models.

3.1. DISENTANGLED SUCCESSOR INFORMATION

Let us assume N s discrete states exist in the environment. Successor representation (SR) between two states s and s ′ is defined as SR(s, s ′ ) = E ∞ t=0 γ t δ(s t , s ′ )|s 0 = s = ∞ t=0 γ t P (s t = s ′ |s 0 = s), where δ(i, j) is Kronecker's delta and γ is a discount factor. We describe how we calculate SR in this study in Appendix A.1. SR and its dimension-reduced representations have been viewed as models of hippocampus and entorhinal cortex, respectively (Stachenfeld et al., 2017) . Based on SR, we define successor information (SI) and positive successor information (PSI) as SI(s, s ′ ) = log(SR(s, s ′ )) -log(P (s ′ )), (2) P SI(s, s ′ ) = max{SI(s, s ′ ), 0}. (3) In this study, we regard this quantity as a hippocampal model instead of SR (Figure 1A ). Next, we introduce a novel dimension reduction method which we call decorrelative non-negative matrix factorization (decorrelative NMF). Decorrelative NMF can be regarded as a variant of NMF (Lee & Seung, 1999) with additional constraints of decorrelation. By applying decorrelatve NMF to PSI, we obtain representation vectors called as disentangled successor information (DSI), which we regard as a model of EC (Figure 1A ). In decorrelative NMF, we obtain D-dimensional vectors x(s) and w(s) (D < N s ) by minimization of the following objective function J = 1 2 s,s ′ ρ(s, s ′ )(P SI(s, s ′ ) -x(s) • w(s ′ )) 2 + 1 2 β cor i̸ =j (Corr(i, j)) 2 + 1 2 β reg s (||x(s)|| 2 + ||w(s)|| 2 ), subject to non-negative constraints ∀i, x i (s) ≥ 0, w i (s) ≥ 0. ρ(s, s ′ ) is a weight for the square error ρ(s, s ′ ) = 1 N s V 1 M P SI(s, s ′ ) + ρ min , where M and V are mean and variance of PSI, respectively, and ρ min is a small value to avoid zero-weight. Corr(i, j) is a correlation between two dimensions in x(s) Corr(i, j) = s xi (s)x j (s) s (x i (s)) 2 s (x j (s)) 2 , ( ) where xi (s) = x i (s) -1 Ns s x i (s). The first term of the objective function is weighted approximation error minimization, the second term works for decorrelation between dimensions, and the third term regularizes representation vectors. Optimization was performed by Nesterov's accelerated gradient descent method (Nesterov, 1983) with rectification of x i (s), w i (s) every iteration. We describe additional details in Appendix A.2.

3.2. RELATIONSHIPS WITH REINFORCEMENT LEARNING AND WORD EMBEDDING

We show dual interpretation of our model. On the one hand, DSI approximates value estimation of linear reinforcement learning, thus support goal-directed decision making and navigation. On the other hand, the same representation approximates word embedding in NLP, thus support semantic computation. First, our model approximates value functions of linear reinforcement learning (Todorov, 2006; 2009; Piray & Daw, 2021) in the setting of spatial navigation. Linear reinforcement learning assumes default policy and imposes additional penalty on deviation from default policy, then we can obtain value functions explicitly by solving linear equations. Let us consider a specific condition in which the environment consists of non-terminal states, and a virtual terminal state is attached to a goal state s G arbitrarily chosen from non-terminal states (Figure 1B ). When the agent gets to the goal, it transits to the terminal state with a probability p N T . Furthermore, we assume that reward at non-terminal states are uniformly negative and reward at the terminal state is positive so that the agent has to take a short path to goal to maximize reward. In this setting, we can obtain value functions v * (s) in linear reinforcement learning as λ -1 v * (s) = log(SR d (s, s G )) -log P d (s G ) = SI d (s, s G ) ≈ x(s) • w(s G ), where SR d (s, s G ) and SI d (s, s G ) are SR and SI under the default policy, respectively. We describe details of derivation in Appendix A.3. Therefore, SI is proportional to value functions for spatial navigation and inner products of DSI vectors approximates value functions. Based on this interpretation, we basically regard x(s) as a representation of each state, and w(s) represents a temporary goal. Second, DSI is related to word embedding methods in NLP (Mikolov et al., 2013a; b; Pennington et al., 2014; Levy & Goldberg, 2014) . In linguistics, pointwise mutual information (PMI) and positive pointwise mutual information (PPMI) are used to measure the degree of coincidence between two words (Levy & Goldberg, 2014) . They are defined as P M I = log P (word i , word j ) P (word i )P (word j ) , P P M I = max{P M I, 0}, where P (word i , word j ) is a coincidence probability of two words (in a certain temporal window). It has been proven that dimension reduction of PMI approximates a word embedding method skipgram (Mikolov et al., 2013a; b) , and similar performance is obtained using PPMI (Levy & Goldberg, 2014) . GloVe (Pennington et al., 2014) is also based on this perspective. SI can be written as SI(s, s ′ ) = log(SR(s, s ′ )) -log(P (s ′ )) = log ∞ t=0 γ t P (s t = s ′ , s 0 = s) P (s)P (s ′ ) . In this formulation, we can see mathematical similarity between PMI and SI by regarding words as states (s = word i , s ′ = word j ), thus the correspondence between PPMI and PSI. Because of this relationship, we can expect that DSI, which is obtained through dimension reduction of PSI, has similar properties to word embedding methods. The difference is how to count coincidence: the coincidence in SI is evaluated with an asymmetric exponential kernel as in SR, in contrast that a symmetric rectangular temporal window is often used in typical word embedding (see Appendix A.4 for further detail).

3.3. DECORRELATIVE NMF RELATES TO GRID CELLS AND DISENTANGLEMENT

Constraints in decorrelative NMF (non-negativity, decorrelation (or orthogonality), and regularization) are important for generation of grid cells, as shown in previous theoretical studies on grid cells. (Dordek et al., 2016; Cueva & Wei, 2018; Banino et al., 2018; Gao et al., 2019; Sorscher et al., 2019) . They are also biologically plausible because neural activity is basically non-negative and decorrelation is possible through lateral inhibition. On the other hand, non-negativity (Oja & Plumbley, 2004 ) and decorrelation (Hyvärinen & Oja, 2000) are also important for extraction of independent components, and it is known that imposing independence in latent space of deep generative models results in the emergence of disentangled representations for visual features (Higgins et al., 2017) . Therefore, in word embedding, we expected that those constraints help emergence of independent and disentangled units for linguistic concepts. Constraints in decorrelative NMF are actually crucial for results obtained in this study (Appendix A.5). As disentangled visual representations explain single-cell activities in higher-order visual cortex (Higgins et al., 2021) , we may similarly interpret conceptual representations in our model as concept cells in the human medial temporal lobe (Quiroga, 2012) . Previous studies suggest that each concept cell respond to a specific concept, whereas population-level activity patterns represent abstract semantic structures (Reber et al., 2019) . Such property is consistent with the factorized and distributed nature of disentangled representation vectors.

4. LEARNING REPRESENTATIONS OF PHYSICAL SPACES

In this section, we empirically show that DSI model forms biologically plausible grid-like representations in a 2-D physical space, and they support spatial navigation. These results also apply to conceptual spaces with the 2-D structure, depending on the definition of states.

4.1. LEARNING PROCEDURE

As an environment, we assumed a square room tiled with 30×30 discrete states (Figure 2A ). In each simulation trial, an agent starts at one of those 900 states and transits to one of eight surrounding states each time except that transitions are limited at states along the boundary (the structure was not a torus). Transitions to all directions occur with an equal probability. We performed 500 simulation trials and obtained a sequence of 100,000 time steps in each trial. We calculated occurrence probabilities (P (s)) and a successor representation matrix (SR(s, s ′ )) of 900 states from those sequences, and calculated PSI and DSI (100-dimensional) as described in the Model section. The discount factor γ was set to 0.99. We additionally tested spatial navigation in a structure with separated and interconnected rooms (see Figure 3C ). In that case, we used the discount factor γ = 0.999.

4.2. EMERGENCE OF GRID-LIKE REPRESENTATIONS

Here we call each dimension of DSI representation vectors x(s) as a neural "unit", and we regard a value in each dimension at each state as a neural activity (or a neural representation). As shown in Figure 2B , many units exhibited grid-like activity patterns in the space. We performed a gridness analysis that has been used in animal experiments (Sargolini et al., 2006) and found that 51% of units were classified as grid cells. Similarly, 53% of units in w(s) were classified as grid cells. Furthermore, we checked whether DSI representations in the physical space reproduce a property of biological grid cells. Actual grid cells in the rat brain exhibit multiple discrete spatial scales and the ratio between grid scales of adjacent modules is √ 2 (Stensola et al., 2012) . We constructed a distribution of grid scales of DSI units by kernel density estimation, which revealed that multiple peaks of grid scales existed and the ratio between grid scales of adjacent peaks was √ 2 (Figure 2C ). These results show that DSI model constructs biologically plausible grid-like representations in the 2-D physical space. We describe details of analysis methods in Appendix A.6.

4.3. NEAR-OPTIMAL SPATIAL NAVIGATION BY DSI VECTORS

As discussed in Section 3.2, the inner product of DSI representations approximate value functions for spatial navigation. Therefore, we tested whether DSI representations actually enable nearoptimal navigation in the space. We assume that a start location (state s init ) and a goal location (state s G ) are randomly given in each trial such that the shortest path length is minimally 10, and an agent has to navigate between them. To solve the task, we define a vector-based state transition rule. Suppose that the agent exists at a state s, and a set of neighboring states of s is A(s). Given the goal representation vector w(s G ), a value function of a neighboring state s next ∈ A(s) is estimated by x(s next ) • w(s G ), and the agents transits to the state that has a maximum value. This state transition rule can be geometrically interpreted as the choice of movements that has the closest angle with the goal vector in the representation space (Figure 3A ). Otherwise, we can interpret that the agent estimates value functions by linear readout from grid-like DSI representations. Because of the approximation error, this rule did not always give optimal navigation (the shortest path from the start to the goal). However, the agent could take the shortest path to the goal in 93.9% of 1,000 trials we tested (an example is shown in Figure 3B ). Furthermore, 97.2% were near-optimal navigation in which the actual path length was shorter than 1.1 times the shortest path length. The same framework also worked in a relatively complex environment with separated rooms (Figure 3C ). In this environment, the ratio of optimal and near-optimal navigation was 68% and 82.6%, respectively. We also confirmed that we can perform path integration based on DSI representations using movement-conditional recurrent weights (McNaughton et al., 2006; Burak & Fiete, 2009; Oh et al., 2015; Gao et al., 2019 ) (Appendix A.7). These results show that DSI representations can support spatial navigation, which corresponds to the contribution of biological grid cells for spatial navigation.

4.4. VECTOR-BASED INFERENCE OF SPATIAL CONTEXTS

We additionally found that we can perform vector-based inference for spatial navigation in a novel context. First, we constructed DSI representation vectors in spatial contexts A and B, each of which has a barrier (Figure 4A ). Then, we created representation vectors for a novel context A+B with two barriers by simply adding representation vectors for familiar contexts A and B (Figure 4A ). We tested vector-based navigation (described in the section 4.3) in three spatial contexts A, B, and A+B, using one of three representations for A, B, and A+B. Naturally, representation vectors for A and B gave the best performance in contexts A and B, respectively (Figure 4B ). Notably, composite representation vectors for A+B achieved the best performance in the context A+B (Figure 4B ). This result suggests that we can utilize vector-based composition of representations for a novel spatial context. We describe details of the simulation in Appendix A.8. Additional analysis by multidimensional scaling (MDS) suggest that summing DSI vectors leads to composition of an appropriate metric space for the novel context (Appendix A.9). This is potentially useful for composing multiple constraints that change reachability between states in various tasks (such as control of robotic arms and playing computer games), like composition of tasks in soft-Q learning (Haarnoja et al., 2018; Makino) .

5. LEARNING REPRESENTATIONS OF CONCEPTUAL SPACES

In this section, we show that the same DSI model can learn representations for a complex conceptual space from linguistic inputs, and those representations support vector-based conceptual inference.

5.1. LEARNING PROCEDURE

We used text data taken from English Wikipedia, which contains 124M tokens and 9376 words (see Appendix A.10 for the detail of preprocessing). To construct DSI representations, we regarded each word as a "state", and considered the text data as a sequence of 9376 states (N s = 9376). Then, we applied the exactly same learning procedure as in the experiment of physical spaces. We obtained 300-dimensional DSI representation vectors for each word. The discount factor γ was set to 0.9. The setting of other parameters was the same as the experiment of physical spaces.

5.2. EMERGENCE OF CONCEPT-SPECIFIC REPRESENTATIONS

As in the previous section, we regard each dimension of representation vectors as a neural unit, and checked how various words activate those units. Specifically, we listed ten words that elicited the highest activities in each unit (TOP-10 words). Consequently, we found that many units are activated by words related to specific concepts (Figure 5 ; other examples in Appendix A.12), which could be named as "game cell" or "president cell", for example. We quantified this conceptual specificity through WordNet-based semantic similarity between words (Princeton University, 2010). We compared mean similarity among TOP-10 words and a null distribution of similarity between random word pairs, by which we determined statistically significant concept-specific units and quantified the degree of conceptual specificity of each unit (see Appendix A.11 for details). DSI exhibited the larger number of significantly concept-specific units and higher average conceptual specificity than other well-established word embedding methods such as skip-gram and GloVe (Table 1 ) (Mikolov et al., 2013a; b; Pennington et al., 2014; Levy & Goldberg, 2014) . We also analyzed conceptual specificity of representations in the embedding layer of pretrained BERT model (bert-base-uncased in Hugging Face transformers) (Devlin et al., 2018; Wolf et al., 2020) , which was lower than DSI (Table1). This result shows that our DSI model forms more concept-specific representations than other models. Additional analyses revealed that word representation vectors are non-sparse and distributed (Appendix A.15). Therefore, each word is represented by the combination of concept-specific units shared by several related words. For example, "France" can be represented by the combination of units which we could name French cell and country cell (Appendix A.15).

5.3. VECTOR-BASED COMPUTATION IN THE CONCEPTUAL SPACE

Given that DSI and word embedding methods are mathematically similar (Section A.4), we expect that DSI vectors have similar properties to representation vectors learned by those word embedding methods. We evaluated the performance of DSI vectors in two tasks that have been used to evaluate word embedding methods: word similarity and analogical inference (Mikolov et al., 2013a; b; Pennington et al., 2014; Levy & Goldberg, 2014) . In the word similarity task, we calculated cosine similarity between representation vectors of word pairs, and evaluated the rank correlation between those cosine similarities and human word similarities (WS353 dataset (Agirre et al., 2009) ; 248/345 word pairs were used). In the analogical inference task, we performed calculation of vectors such as x(king) -x(man) + x(woman) and checked whether the resultant vector has the maximum cosine similarity with x(queen) (Mikolov's dataset (Mikolov et al., 2013a; b) ; 3157/19544 questions were used; examples in Appendix A.13). The result shows that DSI vectors achieved comparable performance with other well-established word embedding methods (Table 2 ). This result indicates that similarity of DSI representation vectors corresponds to semantic similarity. This property is consistent with the experimental observation that population-level pattern similarity of concept cell activities represents semantic categories (Reber et al., 2019) . By visualizing the structure of DSI representations by MDS, we can actually see clustering of words corresponding to 10 semantic categories used in Reber et al. (2019) (Appendix A.14). Furthermore, conceptual inference is possible through arithmetic composition of DSI vectors.We additionally found that this inference is intuitive recombination of concept-specific units in some cases. For example, transformation from "Paris" to "France" corresponds to activation of country cell and deactivation of capital cell, which is possible by summing the difference of Germany and Berlin vectors. (Appendix A.15). 

6. DISCUSSION

In this paper, we proposed a theoretically interpretable and biologically plausible neural representation model for physical and conceptual spaces. We demonstrated that our DSI model forms grid-like representations in the physical space and concept-specific representations in the linguistic space, which are assumed to correspond to neural representations in EC. Furthermore, we showed that SI is mathematically related to linear reinforcement learning and word embedding methods, thus DSI representations support spatial navigation and conceptual inference. These results suggest that we can extend the spatial representation model of EC to learn and compute linguistic concepts, which apparently seems a different computational domain from physical space. In the section 5.2, we demonstrated concept-specific representations created from text data. To the best of our knowledge, such property has not been reported in any word embedding methods. However, we unexpectedly found that continuous-bag-of-words (CBOW) showed relatively high conceptual specificity. Although DSI is related to PMI, skip-gram and GloVe, we have not found relationship to CBOW. Further clarification of necessary conditions for conceptual specificity is still open problem. Although DSI has clear mathematical interpretation, how biological neural networks can learn DSI is still unclear. A possible solution is an extension of skip-gram neural network with SR, nonnegativity, and decorrelation. Because SI corresponds to PMI which is the optimum of skip-gram neural network, we can expect that extended skip-gram network learns DSI. Building such model and relating it to the circuit mechanism in hippocampus and EC are left for future research. Our model relates word embedding to conceptual representations in the brain. Previously study showed that skip-gram representations support high-performance decoding of semantic information from fMRI data (Nishida & Nishimoto, 2018) . Another study revealed that hippocampal theta oscillation codes semantic distances between words measured in word2vec subspace (Solomon et al., 2019) . These experimental results support our hypothesis. However, recent studies have shown that representations in transformer-based models (Vaswani et al., 2017) such as GPT (Brown et al., 2020) achieve remarkable performance in linear fitting to neural recording during linguistic processing (Goldstein et al., 2022; Schrimpf et al., 2021) . A major difference between our DSI model and transformer-based models is that DSI representations are basically fixed (static embedding) whereas transformer-based models flexibly create context-dependent representations (dynamic embedding). Conceptual interpretation obviously depends on the context, thus activities of concept cells are context-dependent (Bausch et al., 2021) . Therefore, our DSI model should be extended to process context-dependence, hopefully by combination with other models for learning contextdependent latent cognitive states (Uria et al., 2020; George et al., 2021; Whittington et al., 2020) . Another direction of future research is application to general conceptual spaces by learning DSI representations from low-level sensory inputs, like spatial learning from visual and auditory inputs in previous models (Banino et al., 2018; Taniguchi et al., 2018; Uria et al., 2020) . It may be possible by learning discrete states by unsupervised clustering for deep networks (Caron et al., 2018) . As for the human brain, infants probably form primitive spatial and conceptual representations from sensory signals, and later linguistic inputs enrich those representations. We speculate that real-world sensory data also contain the information of the conceptual space, for which DSI can be extended to learn those structures. Such model would clarify the role of the hippocampal system in computation of general conceptual spaces. Mariama Drame, Quentin Lhoest, and Alexander M Rush. Transformers: State-of-the-art natural language processing. pp. 38-45, 2020. doi: 10.18653/v1/2020.emnlp-demos.6 . URL https://aclanthology.org/2020.emnlp-demos.6.

A APPENDIX

A.1 CALCULATION OF SR SR is a variant of value functions in reinforcement learning, thus we can use various methods such as temporal-difference (TD) learning for the construction. Throughout this study, we used a direct count method because we performed only offline processing of finite data. In a sequence of states {s 1 , . . . , s t , . . . , s T }, we recursively calculated exponential traces of past states z(s, t) = t-1 τ =0 γ τ δ(s t-τ , s) as z(s, t) = γz(s, t -1) + δ(s t , s), and calculated SR from state counts and coincidence counts as SR(s, s ′ ) = T t=1 z(s, t)δ(s t , s ′ ) T t=1 δ(s t , s) . ( ) A.2 DETAILS OF DECORRELATIVE NMF In decorrelative NMF, we iteratively updated vectors x(s) and w(s) by Nesterov's accelerated gradient descent method to minimize the objective function (Eq. 4), rectifying all elements every iteration. Gradients are ∂J ∂x k (s) = - s ′ ρ(s, s ′ )(P SI(s, s ′ ) -x(s) • w(s ′ ))w k (s ′ ) + β cor j̸ =k Corr(k, j)x j (s) s (x k (s)) 2 s (x j (s)) 2 + β reg x k (s), ( ) ∂J ∂w k (s ′ ) = - s P SI(s, s ′ )(P SI(s, s ′ ) -x(s) • w(s ′ ))x k (s) + β reg w k (s ′ ). ( ) We note that we regarded mean and variance of x i (s) in the correlation ( 1 Ns s x i (s), s (x i (s)) 2 in Eq. 6) as constants in the calculation of these gradients. Practically, this heuristic did not affect the performance of decorrelation. Throughout this paper, the learning rate was 0.05 and the number of iteration was 10000. Parameters were β cor = 1, β reg = 0.001, and ρ min = 0.001.

A.3 MATHEMATICAL RELATIONSHIP OF DSI AND REINFORCEMENT LEARNING

In this section, we show that our model approximates value functions of linear reinforcement learning (Todorov, 2006; 2009; Piray & Daw, 2021) in the setting of spatial navigation. In linear reinforcement learning, an agent aims to maximize "gain" instead of reward. Assuming a default policy π d (s) (any policy is available; typically random walk in the case of exploration task), gain function is defined as g(s) = r(s) -λKL(π(s)|π d (s)), where r(s) is expected reward at the state s and λKL(π(s)|π d (s)) is the cost imposed on the difference between the current policy π(s) and the default policy π d (s) (λ is a relative weight of the cost). Then, previous works have shown that the optimal policy and corresponding value functions can be determined explicitly by solving linear equations (Todorov, 2006; 2009; Piray & Daw, 2021) . Here we consider an environment that consists of N N non-terminal states and N T terminal states. We define two transition probability matrices under the default policy: P N T is a N N × N T matrix for transitions from non-terminal states to terminal states, and P N N is a N N ×N N matrix for transitions across non-terminal states. Furthermore, r N and r T are vectors of rewards at non-terminal states and terminal states, respectively. In this condition, a vector of value functions under optimal policy v * = (v * (s 1 ), . . . , v * (s N N )) is obtained as exp(λ -1 v * ) = M P N T exp(λ -1 r T ), where M = (diag(exp(-λ -1 r N )) -P N N ) -1 is DR (Piray & Daw, 2021) . To relate v * to SI, we consider a specific condition in which the environment consists of non-terminal states, and a virtual terminal state is attached to a goal state s G arbitrarily chosen from non-terminal states (Figure 1B ). When the agent gets to the goal, it transits to the terminal state with a probability p N T . Furthermore, we assume that reward at non-terminal states are uniformly negative and reward at the terminal state is positive so that the agent has to take a short path to goal to maximize reward. Specifically, we assume all elements of r N are λ log γ, and r T = -λ(log γ +log p N T +log P d (s G )) where γ is an arbitrary value in the range (0, 1), and P d (s G ) is a probability of visiting the state s G under the default policy. Then, we obtain exp(λ -1 v * ) = 1 P d (s G ) (I -γP N N ) -1 e (i G ) , where e (i G ) = (0, . . . , 0, 1, 0, . . . , 0) T (i G is the index of the goal state). Because (I -γP N N ) -1 is equivalent to a successor representation matrix with a discount factor γ (Dayan, 1993; Stachenfeld et al., 2017) , we finally obtain λ -1 v * (s) = log(SR d (s, s G )) -log P d (s G ) = SI d (s, s G ) ≈ x(s) • w(s G ), where SR d (s, s G ) and SI d (s, s G ) are SR and SI under the default policy, respectively. Thus, SI is proportional to value functions for spatial navigation and inner products of DSI vectors approximates value functions. Based on this interpretation, we basically regard x(s) as a representation of each state, and w(s) represents a temporary goal.

A.4 MATHEMATICAL RELATIONSHIP OF DSI AND WORD EMBEDDING

In this section, we discuss the relationship of SI and PMI (Levy & Goldberg, 2014) in detail. PMI is P M I = log P (word i , word j ) P (word i )P (word j ) , where P (word i , word j ) is a coincidence probability of two words (in a certain temporal window). To relate PMI to SI, we regard words as states: s = word i , s ′ = word j . Furthermore, we consider a specific way to count coincidence probability. In typical word embedding, a finite symmetric rectangular window is often used: P (s, s ′ ) = W t=0 P (s t = s ′ , s 0 = s), where W is a window size. Here, we implicitly assumed that same state (word) is not repeated in the temporal window to guarantee that P(s, s') is probability. However, we may arbitrarily calculate coincidence for P (s, s ′ ). Here we evaluate coincidence with an infinite asymmetric exponential kernel as in SR: P (s, s ′ ) = (1 -γ) ∞ t=0 γ t P (s t = s ′ , s 0 = s). We introduced a normalization factor (1 -γ) to guarantee that P (s, s ′ ) is less than one ((1γ) ∞ t=0 γ t = 1). Then, PMI becomes P M I = log (1 -γ) ∞ t=0 γ t P (s t = s ′ , s 0 = s) P (s)P (s ′ ) = log(SR(s, s ′ )) -log(P (s ′ )) + log(1 -γ) (23) = SI(s, s ′ ) + log(1 -γ). (24) If we perform dimension reduction, log(1 -γ) can be ignored because it is a constant. Therefore, we can interpret SI as a special case of PMI in our model.

A.5 RELATIONSHIP BETWEEN MODEL COMPONENTS AND REPRESENTATIONS

To clarify the contribution of each model component to the results in this study, we performed a "lesion study" in which we removed some components in DSI and repeated the same evaluation procedure in the main text. We summarize results in Table 3 . First, we tested representations obtained by singular value decomposition of successor representation (SR-SVD), which was regarded as a model of grid cells in a previous study (Stachenfeld et al., 2017) . DSI model exceeded SR-SVD in all aspects shown in this study. Next, we tested DSI model without decorrelation (β cor = 0) and DSI model without non-negativity (no rectification of representation vectors). Neither modification impaired the performance of navigation and inference, showing the importance of using SI for vector-based computations as theoretically expected. In contrast, removing decorrelation and non-negativity significantly impaired the emergence of grid-like units and concept-specific units, respectively. Thus, decorrelative NMF is crucial to obtain biologically plausible representations. In the section 4.2, we performed the gridness analysis following a previous experimental study (Sargolini et al., 2006) . For each unit, we rotated the spatial autocorrelation map (Figure 2B , lower) and calculated correlations between the original and rotated maps. Gridness was defined as the difference between the lowest correlation at 60 • and 120 • and the highest correlation at 30 • , 90 • and 150 • . A unit was classified as a grid cell when gridness exceeds zero. In Figure 2C , we constructed a distribution of grid scales. Grid scales were determined as the median of distances between the central peak and the six closest peaks (vertices of inner hexagon) in the spatial autocorrelation map. The kernel function for kernel density estimation was Gaussian with a standard deviation 1.

A.7 PATH INTEGRATION BY DSI

We performed path integration based on DSI representations using movement-conditional recurrent weights. This strategy has been used in previous studies such as grid cell modeling (Gao et al., 2019) and action-conditional video prediction (Oh et al., 2015) . This mechanism is also consistent with a conventional biological model for path integration in which head direction signals activate one of attractor networks specialized for different directional shifts of grid patterns (McNaughton et al., 2006; Burak & Fiete, 2009) . We made an estimate of the next representation vector xt+1 by linear transformation of the current representation vector x(s t ) as xt+1 = M (a t )x(s t ) where a t represents a movement (one of eight directional movements in this study) and M (a t ) is movement-conditional recurrent weight matrix. Here, x(s t ) was a DSI representation vector, and we optimized the matrix M (a t ) by minimizing prediction error ||x(s t+1 ) -M (a t )x(s t )|| 2 2 by stochastic gradient descent during random walk on the state transition graph (20 simulation trials of 100,000 time steps). After optimization, we set an initial state s 0 and a sequence of movements {a 0 , a 1 , . . . , a T -1 }, and performed path integration by recursive estimation xt+1 = M (a t ) xt . We determined a position at each time step by searching a state representation vector that has minimum Euclidian distance with the estimated vector (s t = arg min s ||x(s) -xt || 2 ). As shown in Figure 6 , this strategy gave accurate estimation of the spatial path from movement signals. , where i is a positional index which indicates a same position in all contexts (i = 1, 2, • • • , 900). We constructed representation vectors x(s A i ), w(s A i ), x(s B i ), and w(s B i ) through direct experiences, then we created x(s A+B i ) and w(s A+B i ) as x(s A+B i ) = x(s A i ) + x(s B i ), (s A+B i ) = w(s A i ) + w(s B i ). We performed spatial navigation in a given context using one of three representations {x(s A i ), w(s A i )}, {x(s B i ), w(s B i )}, and {x(s A+B i ), w(s A+B i )} for corresponding positions, following the rule described in the section 4.3. In figure 7 , we show structures of state transition graphs for three contexts A, B, and A+B. To learn representations in contexts A and B, we sampled sequences of {s A i } i=1,••• ,900 and {s B i } i=1,••• ,900 by random walk in context A and B. The procedure was basically same with the section 4 except that we increased the number of simulation trials from 500 to 1,000, and state transition to the same position in the other context occurred every 5,000 time steps (transition between s A i and s B i ). We added this transition to associate the same position in different contexts. It means that we assumed that the setting of barriers can change during the experience but this temporal association may be substituted by similarity of sensory inputs across contexts. From sampled sequences, we calculated PSI for all combinations of {s A i } i=1,••• ,900 and {s B i } i=1,••• ,900 , and calculated 100dimensional DSI vectors for 1,800 states by simultaneous compression of all states. The discount factor γ was set to 0.999. In the section 5, we used text data taken from English Wikipedia dump (enwiki-latest-pagesarticles, 22-May-2020). We first generated text files from raw data using wikiextrator (https: //github.com/attardi/wikiextractor). We tokenized texts by nltk Punkt sentense tokenizer, and randomly sampled 100,000 articles containing 1,000 tokens at minimum. We lowercased all characters and removed punctuation characters in the data. After that, we selected words that appeared more than 1,000 times in the data, and substituted all other rare words by <unk> symbol. Finally, we obtained data that contains 124M tokens and 9376 words.

A.11 DETAILS OF EVALUATION OF CONCEPTUAL SPECIFICITY

In the section 5.2, conceptual specificity of each unit was evaluated using WordNet database (Princeton University, 2010). In WordNet, a word belongs to several synsets (sets of cognitive synonyms), and semantic similarity of two synsets can be evaluated from the shortest path length between them in the WordNet structure (we used path similarity function in nltk library). We defined similarity of two words as the highest similarity among all combinations of synsets of those words. We calculated mean similarity of all combinations of TOP-10 words (ten words that highly activated the unit; Figure 5A ) that are available in WordNet. We evaluated only units which had at least five TOP-10 words available in WordNet. Furthermore, we randomly generated 1,000 pairs of words available in WordNet, and generated a null distribution of similarity between words. We defined a significance threshold of similarity as a 95 percentile of the null distribution, and a unit was classified as a significantly concept-specific unit if mean similarity of TOP-10 words exceeded the threshold. Furthermore, we quantitatively defined a conceptual specificity of each unit as s unit s null -1, where s unit is mean similarity of TOP-10 words and s null is the mean of the null distribution. This quantity becomes zero if similarity between TOP-10 words is not different from random pairs, and becomes positive if TOP-10 words are semantically similar. This conceptual specificity was averaged over all evaluated units.

A.13 EXAMPLES OF THE ANALOGICAL INFERENCE TASK

In table 4, we show some examples of the analogical inference in Mikolov's dataset. There is a relationship "WORD1 is to WORD2 as WORD3 is to WORD4". Then, an expected relationship in the vector space is WORD2-WORD1=WORD4-WORD3. In this study, we performed inference of WORD4 by WORD3+WORD2-WORD1. We regarded an inference was correct if the actual vector of WORD4 had the largest cosine similarity to the inferred vector among all word representation vectors (except those for WORD1, WORD2, and WORD3). If the number of words is 10,000, a chance level of the correct answer rate is 0.01%. Therefore, the performance shown in this study (more than 50%) is far above the chance level. "Berlin" is represented by the combination of German cell and capital cell, and so on (Figure 12 ). This example also gives a simple interpretation of word similarity in DSI vector space. If words are similar, they share large number of active units, like the country cell shared by representations of France and Germany. Thus, semantic similarity between words increases cosine similarity between word vectors. Furthermore, we also identified the largest elements (the largest absolute values) in the difference vectors between words, and found that they correspond to semantic difference between words (Figure 12 ). Thus, we can regard analogical inference by DSI vectors as recombination of conceptual units. For example, adding Germany-Berlin vector to Paris vector deactivate capital cell and activate country cell, which leads to the transformation of Paris into France. Such property of the vector space is same as conventional word embedding methods, but unique feature of our model is that those analogical relationships are factorized into separated units. We speculate that constraints of decorrelative NMF are sufficient conditions to align each semantic factors to each axis of the word vector space, and the mechanism is probably related to how disentangled representations emerge in visual feature learning model (Higgins et al., 2017; Carbonneau et al., 2020) .



Figure 1: Interpretation of our model. (A) Biological interpretation of our model. PSI as a model of the hippocampus and DSI as a model of EC. (B) The setting of a state transition map in which DSI approximates value functions for spatial navigation.

Figure 2: DSI representations for the physical space. (A) The square room tiled with 30x30 discrete states. (B) Grid-like DSI representations in the space (upper) and their spatial autocorrelation (lower). (C) The distribution of grid scales of DSI representations. Red and blue broken lines indicate positions of peaks and 30 ( √ 2) n , respectively.

Figure 3: Spatial navigation using DSI representation vectors. (A) Geometric interpretation of the state transition rule in the DSI representation space. (B) An example spatial path obtained by DSIbased navigation. (C) Example spatial paths by DSI-based navigation in the four-room environment.

Figure 4: Vector-based inference of a novel spatial context. (A) We constructed representation vectors for a novel context A+B (a square room with two barriers) by the sum of DSI representation vectors for two familiar contexts A and B (with either of two barriers). (B) Rates of near-optimal navigation in various settings of representation vectors and contexts.

Figure 5: Concept-specific representations formed by DSI. Ten words that gave the highest activation (TOP-10 words) are shown. We also marked each unit with a descriptive label.

Figure 6: Path integration using DSI model. (Left) Actual path. (Right) Path estimated from DSI vectors updated by movement information.

Figure 7: Structures of state transition graphs used in the section 4.4.

Figure 8: Visualization of metric spaces defined by representation vectors for contexts A and B, and composite vectors for the context A+B using MDS. Red, blue, and green dots correspond to states (positions) in the left, center, and right parts of the original 2-D space (Figure 7), respectively.

Figure 10 shows the structure of DSI word representations visualized by MDS. We arbitrarily chose words based on 10 semantic categories used in Reber et al. (2019). We used same dissimilarity metric with Reber et al. (2019) (1 -Pearson's correlation coefficient).

Figure 10: Visualization of the representational structure of DSI using MDS. Words are plotted in coordinates embedded in 2-D DSI space. Each color corresponds to a semantic category used in Reber et al. (2019).

Evaluation of conceptual specificity. Numbers of evaluated and significantly conceptspecific units, the ratio of concept-specific units, and average conceptual specificity are summarized.

Performances of vector-based computations in the conceptual space. The correlation of word similarity evaluated by vectors and humans, and the rate of correct analogical inference.

Contribution of each model component to the performance. Rates of grid-like units, nearoptimal navigation, significantly concept-specific units, correct analogical inference.

Examples of analogical relationships in Mikolov's dataset.

A.12 EXAMPLE DSI REPRESENTATIONS FOR WORDS

In the figure 9 , we show TOP-10 words of DSI units without manual selection. We found several non-significant units exhibit conceptual specificity according to manual inspection (for example, unit4 may be named as university cell). This is probably because of the limitation of knowledge covered by WordNet. Therefore, we suppose that the current evaluation method tends to underestimate the number of concept-specific units. However, the comparison across models was fair because we used the same procedure and criteria for all models. 

A.15 INTUITIVE MECHANISM OF WORD REPRESENTATIONS BY DSI

In this section, we discuss how DSI vectors represent and compute words.First, we analyzed the ratio of each element to the sum of all elements in DSI vectors. We found that even the largest element accounted for 5% of the sum of all elements on average. (Figure 11 ). This result shows that DSI vectors for words are non-sparse and distributed, thus each word is represented by the combination of multiple conceptual units. Next, for further clarification, we inspected representations of an example set of words: France, Paris, Germany and Berlin. We can see there are two analogical relationships (country-capital and French-German relationships). We identified the most active units (TOP-2) in DSI vectors for those words, and listed TOP-10 words for identified units. As a result, we could see that "France" is represented by the combination of units that we could name as French cell and country cell, whereas

