A REPRESENTATIONAL MODEL OF GRID CELLS' PATH INTEGRATION BASED ON MATRIX LIE ALGEBRAS Anonymous

Abstract

The grid cells in the mammalian medial entorhinal cortex exhibit striking hexagon firing patterns when the agent navigates in the open field. It is hypothesized that the grid cells are involved in path integration so that the agent is aware of its selfposition by accumulating its self-motion. Assuming the grid cells form a vector representation of self-position, we elucidate a minimally simple recurrent model for grid cells' path integration based on two coupled matrix Lie algebras that underlie two coupled rotation systems that mirror the agent's self-motion: (1) When the agent moves along a certain direction, the vector is rotated by a generator matrix. (2) When the agent changes direction, the generator matrix is rotated by another generator matrix. Our experiments show that our model learns hexagonal grid response patterns that resemble the firing patterns observed from the grid cells in the brain. Furthermore, the learned model is capable of near exact path integration, and it is also capable of error correction. Our model is novel and simple, with explicit geometric and algebraic structures.

1. INTRODUCTION

Imagine walking in the darkness. Purely based on your sense of self-motion, you can gain a sense of self-position by integrating the self movement -a process often referred to as path integration (Darwin, 1873; Etienne & Jeffery, 2004; Hafting et al., 2005; Fiete et al., 2008; McNaughton et al., 2006) . While the exact neural underpinning of path integration remains unclear, it has been hypothesized that the grid cells (Hafting et al., 2005; Fyhn et al., 2008; Yartsev et al., 2011; Killian et al., 2012; Jacobs et al., 2013; Doeller et al., 2010) in the mammalian medial entorhinal cortex (mEC) may be involved in this process (Gil et al., 2018; Ridler et al., 2019; Horner et al., 2016) . The grid cells are so named because individual neurons exhibit striking firing patterns that form hexagonal grid patterns when the agent (such as a rat) navigates in a 2D open field (Fyhn et al., 2004; Hafting et al., 2005; Fuhs & Touretzky, 2006; Burak & Fiete, 2009; Sreenivasan & Fiete, 2011; Blair et al., 2007; Couey et al., 2013; de Almeida et al., 2009; Pastoll et al., 2013; Agmon & Burak, 2020) . The grid cells interact with the place cells in the hippocampus (O' Keefe, 1979) . Unlike a grid cell that fires at the vertices of a lattice, a place cell often fires at a single or a few locations. The purpose of this paper is to understand how the grid cells may perform path integration (or "path integration calculations"). We propose a representational model in which the self-position is represented by the population activity vector formed by grid cells, and the self-motion is represented by the rotation of this vector. Specifically, our model consists of two coupled systems: (1) When the agent moves along a certain direction, the vector is rotated by a generator matrix of a Lie algebra. (2) When the agent changes movement direction, the generator matrix itself is rotated by yet another generator matrix of a different Lie algebra. Our numerical experiments demonstrate that our model learns hexagon grid patterns which share many properties of the grid cells in the rodent brain. Furthermore, the learned model is capable of near exact path integration, and it is also capable of error correction. Our model is novel and simple, with explicit geometric and algebraic structures. The population activity vector formed by the grid cells rotates in the "mental" or neural space, monitoring the egocentric self-motion of the agent in the physical space. This model also connects naturally to the basis expansion model that decomposes the response maps of place cells as linear expansions of response maps of grid cells (Dordek et al., 2016; Sorscher et al., 2019) . Overall, our model provides a new conceptual framework to study the grid cell systems in the brain by considering the structure of the intrinsic symmetry (through Lie algebra) of the task which the path integration system is solving.

2. REPRESENTATIONAL MODEL FOR PATH INTEGRATION

Consider an agent navigating within a squared domain (theoretically the domain can be R 2 ). Let x = (x 1 , x 2 ) be the self-position of the agent in a 2D environment. At self-position x, if the agent makes a displacement δr along the direction θ ∈ [0, 2π], then the self-position is changed to x + δx, where δx = (δx 1 , δx 2 ) = (δr cos θ, δr sin θ). In our model, we use a polar coordinate system (see figure 1a, b ) by directly using (θ, δr), while only keeping (δx 1 , δx 2 ) implicit. (θ, δr) is the biologically plausible egocentric representation of self-motion. We assume that the location x in the 2D environment is encoded by the response pattern of a population of d neurons (e.g., d = 200), which correspond to a d-dimensional vector v(x) = (v i (x), i = 1, ..., d) , with each element representing the firing rate of one neuron when the animal is at location x. From the embedding point of view, essentially we embed the 2D domain in R 2 as a 2D manifold in a higher dimensional space R d . Locally we embed the 2D local polar system centered at x (see figure 1a, b ) into R d so that it becomes a local system around v(x) (see figure 1c ). 

2.1. THE PROPOSED REPRESENTATIONAL MODEL: COUPLING TWO ROTATION SYSTEMS

Assuming δr to be infinitesimal, we propose the following model v(x + δx) = (I + B(θ)δr)v(x) + o(δr), which parameterizes a recurrent neural network (Hochreiter & Schmidhuber, 1997) , where I is the identity matrix, and B(θ) is a d-dimensional matrix depending on the direction θ, which will need to be learned. Rotation. We assume B(θ) = -B(θ) , i.e., skew-symmetric, so that I + B(θ)δr is a rotation or orthogonal matrix, due to that (I + B(θ)δr)(I + B(θ)δr) = I + O(δr 2 ). Because the upper triangle part of B(θ) is the negative of the transpose of the lower triangle part (the diagonal elements are zeros), in fact we only need to learn its lower triangle part. The geometric interpretation is that, if the agent moves along the direction θ, the vector v(x) is rotated by the matrix B(θ), while the 2 norm v(x) 2 remains stable (figure 1c ). We may interpret v(x) 2 = d i=1 v i (x) 2 as the total energy of grid cells, which is stable across different locations. From embedding point of view, the local polar system in figure 1a is embedded into a d-dimensional sphere in neural response space. When the agent makes an infinitesimal change of direction from θ to θ + δθ, B(θ) is changed to B(θ + δθ). We assume B(θ + δθ) = (I + Cδθ)B(θ) + o(δθ), where C is a d-dimensional matrix, which is also to be learned. We again assumes C = -C , so that I + Cδθ is a rotation matrix. The geometric interpretation is that if the agent changes direction, B(θ) is rotated by C. Equations ( 1) and ( 2) together define our proposed model for path integration, which couples two rotation systems.

2.2. A MIRROR OF EGOCENTRIC MOTION: PRESERVING LOCAL GEOMETRIC RELATIONS

As a representational model, equations ( 1) and ( 2) form a mirror in the d-dimensional "mental" (or neural) space for the egocentric motion in the 2D physical space. Importantly, the embedding preserves the local geometric relations of the local polar system. Let δ θ v(x) = v(x + δx) -v(x) be the displacement of v(x) when the agent moves from x by δr along direction θ. It follows from equation (1) that δ θ v(x) = (B(θ)δr + o(δr))v(x). Ignoring high order terms, we obtain δ θ+δθ v(x) = (I + Cδθ)δ θ v(x). That is, with δr fixed, the local change of v(x) along different θ are rotated versions of each other, mirroring the local polar system at x. See figure 1 for an illustration. As for the angle between v(x) and v(x + δx), i.e., how much the vector v rotates in the neural space as the agent moves in 2D physical space by δr, we have Proposition 1 In the above notation, let δα be the angle between v(x) and v(x + δx), we have δα = βδr + O(δr 2 ), where δr = δx , and β = B(θ)v(x) 2 / v(x) 2 is independent of θ. See Supplementary A.1 for a proof. That means, the angle δα in the d-dimensional neural space is proportional to the Euclidean distance δr in the 2D space, and more importantly the angle δα is independent of direction θ, i.e., β is isotropic.

2.3. HEXAGON GRID PATTERNS

For the learned model, β can be much bigger than 1, so that the vector v(x) will rotate back to itself in a short distance, causing the periodic patterns of v(x). Moreover, β does not depend on the direction of self-motion, and this isotropic property appears to underly the emergent hexagonal periodic patterns, as suggested by the following result. The hexagon grid patterns can be created by linearly mixing three Fourier plane waves whose directions are 2π/3 apart. In the following, we state a theoretical result adapted from Gao et al. (2018) that connects such linearly mixed Fourier waves to the geometric property in Proposition 1 in the previous subsection. Proposition 2 Let e(x) = (exp(i a j , x ), j = 1, 2, 3) , where (a j , j = 1, 2, 3) are three 2D vectors of equal norm, and the angle between every pair of them is 2π/3. Let v(x) = U e(x), where U is an arbitrary unitary matrix, i.e., U * U = I. Let δα be the angle between v(x) and v(x + δx), we have δα = βδr + O(δr 2 ), where δr = δx , and β ∝ a j is independent of the direction of δx. See Supplementary A.1 for a proof, which relies on the fact that (a j , j = 1, 2, 3) forms a tight frame in 2D. Proposition 2 says that the geometric property that emerges from our model as elucidated by Proposition 1 is satisfied by the orthogonal mixing of three Fourier plane waves that creates hexagonal grid patterns. We are currently pursuing a more general analysis of our model, i.e., equations (1) and (2).

2.4. JUSTIFICATION AS MINIMALLY SIMPLE RECURRENT MODEL

Now we justify equation 1 as a minimally simple recurrent model. To start, the general form of the model is v(x + δx) = F (v(x), δr, θ). For infinitesimal δr, a first-order Taylor expansion gives v(x + δx) = v(x) + f (v(x), θ)δr + o(δr), where the function F (v, δ, θ) satisfies F (v, 0, θ) = v, i.e., the vector representation stays the same if there is no self-displacement, and f (v, θ) = ∂ ∂δ F (v, δ, θ) | δ=0 , i.e., the first derivative at δ = 0. The function f (v, θ) transforms v to another vector of the same dimension and the transformation depends on θ. A minimally simple model is a linear transformation that depends on θ, i.e., we can assume v(x + δx) = v(x) + B(θ)v(x)δr + o(δr), which leads to equation 1, where the linear transformation is B(θ). Equation 2 can be similarly justified. In this paper, we assume a linear recurrent model for its simplicity and its explicit geometric meaning as rotation. It is important to emphasize that our arguments are not mutually exclusive with the work based on non-linear recurrent neural network model (Burak & Fiete, 2009; Couey et al., 2013) . In fact, our linear rotation model may serve as a prototype approximation which may help better understand these nonlinear models -a direction we did not pursue here.

2.5. MATRIX LIE ALGEBRAS AND GROUPS: FROM INFINITESIMAL TO FINITE

For a finite, non-infinitesimal, self-displacement ∆r, we can divide ∆r into N steps, so that δr = ∆r/N → 0 as N → ∞, and v(x + ∆x) = (I + B(θ)(∆r/N ) + o(1/N )) N v(x) → exp(B(θ)∆r)v(x). (4) The above math underlies the relationship between matrix Lie algebra and matrix Lie group (Taylor, 2002) . For a fixed θ, the set of M θ (∆r) = exp(B(θ∆r)) for ∆r ∈ R forms a matrix Lie group, which is both a group and a manifold. The tangent space of M θ (∆r) at identity I is called matrix Lie algebra. B(θ) is the basis of this tangent space, and is often referred to as the generator matrix. Similarly for a finite change of direction ∆θ, we obtain B(θ + ∆θ) = exp(C∆θ))B(θ). (5) The set of R(θ) = exp(C∆θ) for all θ ∈ [0, 2π] (with mod 2π addition arithmetics) forms another matrix Lie group, with C being the generator matrix of its matrix Lie algebra. Approximation to exponential map. For a finite but small ∆r, exp(B(θ)∆r) can be approximated by a second-order Taylor expansion exp(B(θ)∆r ) = I + B(θ)∆r + B(θ) 2 ∆r 2 /2 + o(∆r 2 ). (6) Similarly, exp(C∆θ) can be approximated by exp(C∆θ) = I + C∆θ + C 2 ∆θ 2 /2 + o(∆θ 2 ). Path integration. Now we can cast path integration in the language of Lie group. Specifically, the input includes the initial position x (0) , and the self-motions (θ (t) , ∆r (t) ) for t = 1, ..., T . Initializing v (0) = v(x (0) ), the vector is updated recurrently according to v (t) = exp(B(θ (t) )∆r (t) )v (t-1) . That is, the vector v (t) is rotated by B(θ (t) ) according to ∆r (t) geometrically. Modules. Experimentally, it is well established that grid cells are organized in discrete modules (Barry et al., 2007; Stensola et al., 2012) or blocks. We thus partition the vector v(x) into K blocks, v(x) = (v k (x), k = 1, ..., K). Correspondingly the generator matrices B(θ) = diag(B k (θ), k = 1, ..., K) and C = diag(C k , k = 1, ..., K) are block diagonal. This greatly reduces the number of parameters to be learned. Note that each sub-vector v k (x) is rotated by a sub-matrix B k (θ), which is in turn rotated by C k . Metric. By the same argument as in Proposition 1, for a module k, let v k be the sub-vector. Let δα k be the angle between v k (x) and v k (x + δx), then the angle δα k = β k δr, where δr = δx , and β k is independent of θ. That is, if the agent moves by δr, the vector v k (x) rotates by an angle β k δr. β k determines the metric or scale of the response maps of the k-th block of grid cells, i.e., it tells us how fast the sub-vector v k rotates as the agent moves.

3. INTEGRATION WITH BASIS EXPANSION MODEL

For each v(x), we need to uniquely decode x. We thus need to integrate the path integration model with the basis expansion model that connects grid cells to place cells. Each place cell fires when the agent is at a specific position. Let A x (x) be the response map for the place cell associated with position x . It measures the adjacency between x and x . A commonly used form of A x (x) when the agent navigates in the open field is the Gaussian adjacency kernel A x (x) = exp(-x-x 2 /(2σ 2 )).

3.1. BASIS EXPANSION

A popular model that connects place cells and grid cells is the following basis expansion model (or PCA-based model) (Dordek et al., 2016) : where v(x) = (v i (x), i = 1, ..., d) , and u(x ) = (u i,x , i = 1, ..., d) . Here (v i (x), i = 1, ..., d) forms a set of d basis functions for expanding A x (x) for all places x , while u(x ) is the read-out weight vector for place cell x , and needs to be learned. A x (x) = d i=1 u i,x v i (x) = v(x), u(x ) , Experimental results have shown that the connections from grid cells to place cells are excitatory (Zhang et al., 2013; Rowland et al., 2018) . We thus assume that u i,x ≥ 0 for all i and x . We can also make v(x) to be non-negative by adding a bias term. Please see Supplementary A.4 for details.

3.2. DECODING, RE-ENCODING, AND ERROR CORRECTION

For a neural response vector v, such as v (t) in equation ( 7), the response of the place cell centered at location x is v, u(x ) . We can decode the position x by examining which place cell has the maximal response, i.e., x = arg max x v, u(x ) . ( ) After decoding x, we can re-encode v ← v( x), which amounts to projecting v onto the 2D manifold formed by v(x) for all x. The set of v(x) forms a codebook, and the projection via re-encoding enables error correction by removing the possible errors or noises in v (see Supplementary A.2).

3.3. UNITARY REPRESENTATION AND HARMONIC ANALYSIS

Underlying the integration of our proposed path integration model (equation 1) and (equation 2) and the basis expansion model (i.e., equation 8) is the group representation theory. Our path integration model leads to unitary group representation. Let M θ (∆r) = exp(B(θ)∆r) with finite (non-infinitesimal) ∆r, let x = (∆r cos θ, ∆r sin θ), and M (x) = M θ (∆r). For each given x, M (x) is an orthogonal matrix. Collectively, M (x) forms a unitary representation of x ∈ R 2 , i.e., 2D Euclidean group, where the additive group action in R 2 is represented by matrix multiplication. For each element of the matrix, M ij (x) is a function of x. According to the fundamental theorems of Schur (Zee, 2016) and Peter-Weyl (Taylor, 2002) , if M is an irreducible representation of a finite group or compact Lie group, then {M ij (x)} form a set of orthogonal basis functions of x. This leads to a deep generalization of harmonic analysis or Fourier analysis. Let v(x) = M (x)v(0) (where we choose the origin 0 as the reference point). The elements of v(x), i.e., (v i (x), i = 1, ..., d), are also basis functions of x. These basis functions serve to expand (A x (x), ∀x ) that parametrizes the place cells, and these basis functions are generated by the matrix Lie algebras of our path integration model. Thus group representation provides a unifying theoretical framework for the two hypothesized roles of grid cells, namely path integration and basis expansion. In our work, we do not assume each block matrix M k (x) to be irreducible. Thus each learned v i (x) within a block is a linear mixing of orthogonal basis functions in an irreducible representation, and different v i (x) within the same block are not necessarily orthogonal. However, different v i (x) in different blocks are close to orthogonal in our experiments (see Supplementary A.3 for details).

4. LEARNING

The unknown parameters are (1) (v(x), ∀x). (2) (u(x ), ∀x ). (3) (B(θ), ∀θ). (4) C. To learn these parameters, we define a loss function: L = L 0 + λ 1 L 1 + λ 2 L 2 , where L 0 = E x,x [A x (x) -v(x), u(x ) ] 2 , ( ) L 1 = E x,∆x v(x + ∆x) -exp(B(θ)∆r)v(x) 2 , L 2 = E θ,∆θ B(θ + ∆θ) -exp(C∆θ)B(θ) 2 , ( ) where • 2 denotes the sum of squares of the elements of the vector or matrix. λ 1 and λ 2 are chosen so that the three loss terms are of similar magnitudes. In L 0 , A x (x) are given as Gaussian adjacency kernels, and we aim to learn the basis functions v(x). L 1 and L 2 serve to constrain the basis functions v(x) so that path integration can be performed based on our proposed model (equation 1) and (equation 2). In L 1 , ∆x = (∆r cos θ, ∆r sin θ). As the generator matrices for two coupled rotation systems, B(θ) and C are both assumed to be skew-symmetric, so that only the lower triangle parts of the matrices need to be learned. We further assume B(θ) and C to be block-diagonal with each block corresponding to a module, consistent with the experimental observations (Stensola et al., 2012) . For regularization, we add a penalty on |u(x )| 2 , and further assume u(x ) ≥ 0 so that the connections from grid cells to place cells are excitatory (Zhang et al., 2013; Rowland et al., 2018) . We minimize the loss function by stochastic gradient descent, specifically, Adam optimizer (Kingma & Ba, 2014) , where the expectations are approximated by Monte Carlo samples of (x, x ) and x, ∆x). See Supplementary B.1 for details of generating Monte Carlo samples for learning. Unlike previous work on learning basis expansion model (or PCA-based model (Dordek et al., 2016 )), we do not constrain the basis functions v(x) = (v i (x), i = 1, ..., d) to be orthogonal to each other. Instead, we constrain them by our path integration model ( 1) and ( 2) via the loss terms L 1 and L 2 . In fact, the learned v i (x) within the same block are not orthogonal, although v i (x) from different blocks tend to be orthogonal (see Supplementary A.3). L 2 constrains B(θ) to be rotated versions of each other, and it is important for the emergence of hexagon grid patterns according to our experiments. It is also important for reducing model complexity by constraining B(θ) for different θ. Because A x (x) contains a whole range of frequencies in the Fourier domain, the learned response maps of the grid cells span a range of scales too. It is also worth noting that, consistent with the experiential observations, we assume individual place field A x (x) to exhibit a Gaussian shape, rather than a Mexican-hat pattern (with balanced excitatory center and inhibitory surround) as assumed in previous basis expansion models (Dordek et al., 2016; Sorscher et al., 2019) of grid cells.

5. EXPERIMENTS

In Every row shows the learned units belonging to the same block or module. Regular hexagon grid patterns emerge for both v(x) and u(x). Within each block or module, the scales and orientations are roughly the same, but with different phases or spatial shifts. Notably, the emergence of hexagon patterns does not rely on specific block size nor number of blocks. See Supplementary B.2 for the patterns of learned v(x) with different block sizes, learned B(θ) and learned u(x). For the learned B(θ), each element shows regular sine/cosine tuning over θ. We further investigate the characteristics of the learned firing rate patterns (i.e., v(x)) using measures adopted from the grid cell literature. Specifically, the hexagonal regularity, scale and orientation of grid-like patterns are quantified using the gridness score, grid scale and grid orientation (Langston et al., 2010; Sargolini et al., 2006) , which are determined by taking a circular sample of the autocorrelogram of the response map. All learned patterns exhibit significant hexagonal periodicity in terms of gridness scores (mean 1.08, range 0.60 to 1.57). Specifically, a unit is considered to be grid-like if the gridness score exceeds the 95 th percentile of null distribution obtained by applying spatial field shuffles to the response map, following the standard procedure in (Hafting et al., 2005; Barry & Burgess, 2017) . On average, the 95 th percentile is 0.35 for all the units. Figure 4a shows six examples of the autocorrelograms of the response maps and the corresponding gridness scores, each of which is from a different module. The grid scales of learned patterns (mean 0.39, range 0.24 to 0.61), as shown in Figure 4b , follows a multi-modal distribution. The ratio between neighboring modes are roughly 1.44 and 1.51, which closely matches the theoretical predictions (Wei et al., 2015; Stemmler et al., 2015) and also the empirical results from rodent grid cells (Stensola et al., 2012) . The grid orientations of learned patterns, as shown in Figure 4c , are also multi-modal distributed, consistent to the observations on rat grid cells (Stensola et al., 2012) . See Supplementary B.3 for the detailed spatial profile of every unit of v(x). Collectively, these results reveal striking, quantitative correspondence between the properties of our model neurons and those of the grid cells in the brain. 

5.2. PATH INTEGRATION AND ERROR CORRECTION

We then examine the ability of the learned system on performing path integration, by recurrently updating v (t) as shown in equation 7 and decoding v (t) to x (t) for t = 1, ..., T using equation 9. Re-encoding v (t) ← v(x (t) ) after decoding is adopted. Figure 5a shows an example trajectory of accurate path integration for T = 80. As shown in figure 5b , with re-encoding, the path integration error remains close to zero over a duration of 500 time steps (< 0.01 cm, averaged over 1,000 episodes), although the model is trained by the single-time-step loss in equation 11. Without reencoding, the error goes slight higher but still remains reasonable (ranging from 0.0 to 5.4 cm, mean 3.8 cm). The performance of path integration would be improved as the block size becomes larger, i.e., more units or cells in each module (figure 5c ). When block size is larger than 20, path integration is almost exact for the time steps tested. We further assess the ability of error correction of the learned system. Specifically, along the way of path integration, at every time step t, two types of errors are introduced to v (t) : (1) Gaussian noise or (2) dropout masks, i.e., certain percentage of units are randomly set to zero. Figure 5d summarizes the path integration performance with different levels of introduced errors for T = 100. For Gaussian noise, we use the average magnitude of units in v(x) as the reference standard deviation (s), i.e., s = |v(x)| 2 /d. The results show that re-encoding is crucial for error correction. Notably, with re-encoding, the path integration works reasonably well even if Gaussian noise with magnitude of s is added or 50% units are dropped out at each step, indicating that the learned system is quite robust to different sources of errors. 

5.3. ABLATION STUDY

We conduct systematic ablation study to assess the impact of individual components of the model on learning hexagonal grid patterns (measured by gridness score) and the ability to perform path integration (T = 100 time steps). Table 1 summarizes the results where the model is trained with certain component removed. Specifically, for the emergence of regular hexagonal patterns, all components appear to be important, especially L 2 which is crucial. In comparison, u(x) ≥ 0 is not entirely critical. Another observation is that L 0 , L 1 and penalty on |u(x)| 2 are crucial for accurate path integration. See Supplementary B.4 for details. Our work is related to several lines of previous research. First, RNN models have been used to model grid cells and path integration. The traditional approach uses simulation-based models with hand-crafted connectivity (Zhang, 1996; Burak & Fiete, 2009; Couey et al., 2013; Pastoll et al., 2013; Agmon & Burak, 2020) . More recently, two pioneering papers (Cueva & Wei, 2018; Banino et al., 2018) developed an optimization-based RNN approach to learn the path integration model and discovered that grid-like response pattern could emerge in the optimized network. These results are further substantiated in following-up research (Sorscher et al., 2019; Cueva et al., 2020) . Compared to these studies, our path integration model is more explicit, coupling two rotation systems in neural space to mirror the egocentric motion in physical space. In doing so, our results reveals new insights regarding why lattice and in particular hexagonal response patterns may emerge in neural networks trained to perform path integration. Second, as discussed in Sec.3, our model is naturally connected to the basis expansion models of grid cells. A key ingredient of this line of work (Dordek et al., 2016; Sorscher et al., 2019; Stachenfeld et al., 2017) is that, under certain conditions, the principal components of the place cell activities exhibit grid patterns. Importantly, our work differs from these models in that, unlike PCA, we make no assumption about the orthogonality between the basis functions, and the basis expansion formulation is obtained via group representation from our path integration model. Group representation unifies path integration and basis expansion, which are two roles hypothesized for grid cells. Furthermore, in previous basis expansion models (Dordek et al., 2016; Sorscher et al., 2019) , place fields with Mexican-hat patterns (with balanced excitatory center and inhibitory surround) were assumed in order to obtain hexagonal grid firing patterns. However, experimentally measured place fields were instead well characterized by Gaussian functions. Crucially, in our model, hexagonal grids emerge after learning with Gaussian place fields, and there is no need to assume any additional surround mechanisms. In another related paper, Gao et al. (2018) proposed vector representation of 2D position and matrix representation of 2D displacement. Our work goes beyond that work in revealing two coupled Lie groups and Lie algebra structure, as well as further integrating the path integration model with the basis expansion model.

6.2. CONCLUSION

This paper entertains a representation model of grid cells' path integration, which mirrors the egocentric self-motion by coupling two rotation systems. The proposed model can be justified as a minimally simple recurrent model, and has explicit geometric and algebraic structures. As we have shown, this simple model framework leads to a system that captures many of the experimentally observed properties of the grid cell system in the brain. Our path integration model is linear in the vector v(x), but it is non-linear in the displacement ∆x. Since the basis expansion model is linear in v(x), it may be preferable to have the path integration model linear in v(x) too. The rotation of v(x) is capable of path integration, and the way v(x) rotates can explain the hexagon periodic patterns and the metric of each module. The connection between the two models is based on the group representation theory. As to modularity, in the context of our path integration model, it means that each sub-vector v k (x) is rotated by a separate generator sub-matrix, or driven by a separate recurrent network, so that the dynamics of the sub-vectors are disentangled. This appears to be biologically plausible. We believe modularity is part of the design of the network. In terms of representation learning, it is common to represent the physical state by a vector, i.e., embedding the physical state into a higher dimensional neural space. However, the problem of representing continuous motion in the physical space, or continuous transformation of the physical state, or continuous relation between the states has not received an in-depth treatment. The continuous motion in the physical space often has a native Lie algebra and Lie group structure. Our work provides an explicit representation of the continuous motion and its algebraic structure by matrix Lie algebra and matrix Lie group that act in the neural space. We believe our method can be applied to modeling head direction system as well as modeling the motor cortex for representing the continuous motions of arm, hand, pose, etc. Last but not least, the proposed model can also be used for path planning. See Supplementary C for the preliminary results). A THEORETICAL ANALYSIS

A.1 PROOF OF PROPOSITIONS

Proof of Proposition 1. For an infinitesimal δx = δr(cos θ, sin θ), v(x) is changed to v(x + δx). Let δα be the angle between v(x) and v(x + δx). It can be obtained from v(x), v(x + δx) = v(x) exp(B(θ)δr)v(x) (13) = v(x) (I + B(θ)δr + B(θ) 2 δr 2 /2 + o(δr 2 ))v(x) (14) = v(x) 2 -B(θ)v(x) 2 δr 2 /2 + o(δr 2 ), where v(x) B(θ)v(x) = 0 because B(θ) = -B(θ) . Let β 2 = B(θ)v(x) 2 / v(x) 2 . ( ) It is independent of θ, because for any ∆θ, B(θ + ∆θ)v = R(∆θ)B(θ)v, and R(∆θ) is orthogonal, thus B(θ + ∆θ)v 2 = B(θ)v 2 for any ∆θ. Note v(x) 2 = v(x + δx) 2 because M θ (δr) is orthogonal, cos(δα) = 1 -δα 2 /2 + o(δα 2 ) (17) = v(x), v(x + δx) /|v(x)| 2 (18) = 1 -(βδr) 2 /2 + o(δr 2 ). Thus the angle δα = βδr. Proof of Proposition 2. Suppose v(x) is defined as in Proposition 2. That is, v(x) = U e(x), where U is an arbitrary unitary matrix. e(x) = (exp(i a j , x ), j = 1, 2, 3) , where (a j , j = 1, 2, 3) are three 2D vectors of equal norm, and the angle between every pair of them is 2π/3. Then we have v(x) 2 = 3, and the angle between v(x) and v(x + δx), denoted as δα, is cos(δα) = RE v(x), v(x + δx) v(x) v(x + δx) (20) = 1 3 RE( v(x), v(x + δx) ) (21) = 1 3 RE(e(x) * U * U e(x + δx)) (22) = 1 3 RE   3 j=1 exp(i a j , δx )   (23) = 1 3 3 j=1 cos( a j , δx ) (24) = 1 3 3 j=1 1 - 1 2 a j , δx 2 + o(δr 2 ) (25) = 1 - β 2 2 δr 2 + o(δr 2 ) (26) = cos(βδr) + o(δr 2 ). Thus we have δα = βδr + O(δr 2 ). Here the key is that (a j , j = 1, 2, 3) forms a tight frame in the 2D space, in that for any 2D vector δx, 3 j=1 δx, a j 2 ∝ a j 2 δx 2 . Thus β ∝ a j . A.2 ERROR CORRECTION Suppose v = v(x) + is a noisy version of v(x), can we still decode x accurately from v? Here we assume ∼ N (0, τ 2 (|v(x)| 2 /d)I), where d is the dimensionality of v, and τ 2 measures the variance of noise relative to |v(x)| 2 /d, i.e., the average of (v i (x) 2 , i = 1, ..., d). The heat map h(x ) = v, u(x ) = v(x), u(x ) + , u(x ) = A(x, x ) + e(x ), where e(x ) ∼ N (0, τ 2 |v(x)| 2 |u(x )| 2 /d). For A(x, x ) = exp(-|x -x | 2 /(2σ 2 )) = v(x), u(x ) , if σ 2 is small, A(x, x ) decreases to 0 quickly, i.e., if |x -x| > δ, then A(x, x ) < exp(-δ 2 /(2σ 2 )), and the chance for the maximum of h(x ) to be achieved at an x so that |x -x| > δ can be very small. The above analysis also provides a justification for regularizing |u(x )| 2 in learning. For error correction, we want d to be big, and we want σ 2 to be small. But for path planning, we also need big σ 2 . That is, we need A(x, x ) at multiple scales. A.3 ORTHOGONALITY RELATIONS For (x) that form a group, a matrix representation M (x) is equivalent to another representation M (x) if there exists a matrix P such that M (x) = P M (x)P -1 for each x. A matrix representation is reducible if it is equivalent to a block diagonal matrix representation, i.e., we can find a matrix P , such that P M (x)P -1 is block diagonal for every x. Suppose the group is a finite group or a compact Lie group, and M is a unitary representation. If M is block-diagonal, M = diag(M k , k = 1, ..., K), with non-equivalent blocks, and each block M k cannot be further reduced, then the matrix elements (M kij (x)) are orthogonal basis functions of x. Such orthogonality relations are proved by Schur (Zee, 2016) for finite group, and by Peter-Weyl for compact Lie group (Taylor, 2002) . For our case, theoretically the group of displacements in the 2D domain is R 2 , but we learn our model within a finite range, and we further discretize the range into a lattice. Thus the above orthogonal relations are relevant. In our model, we also assume block diagonal M , and we call each block a module. However, we do not assume each module is irreducible, i.e., each module itself may be further diagonalized into a block diagonal matrix of irreducible blocks. Thus the elements within the same module v k (x) may be linear mixing of orthogonal basis functions, and they themselves may not be orthogonal. However, different modules may be linear mixings of different sets of irreducible blocks, and thus different modules can be orthogonal to each other. Figure 6 visualizes the correlations between each pair of the learned v i (x) and v j (x), i, j = 1, ..., d. For v i (x) and v j (x) from different modules, the correlations are close to zero; i.e., v i (x) and v j (x) from different blocks are approximately orthogonal to each other. But v i (x) and v j (x) from the same block are not orthogonal to each other.  = E x,x [A(x, x ) -v(x), u(x ) ] 2 , x is first sampled uniformly within the entire domain, and then the displacement dx between x and x is sampled from a normal distribution N (0, σ 2 I 2 ), where σ = 0.48. This is to ensure that nearby samples are given more emphasis. We let x = x + dx, and those pairs (x, x ) within the range of domain (i.e., 2m × 2m, 80 × 80 lattice) are kept as valid data. For (x, ∆x) used in L 1 = E x,∆x |v(x + ∆x) -exp(B(θ)∆r)v(x)| 2 , ∆x is sampled uniformly within a circular domain with radius equal to 3 grids and (0, 0) as the center. Specifically, ∆r 2 , the squared length of ∆x, is sampled uniformly from [0, 3] grids, and θ is sampled uniformly from [0, 2π]. We take the square root of the sampled ∆r 2 as ∆r and let ∆x = (∆r cos θ, ∆r sin θ). Then x is uniformly sampled from the region such that both x and x + ∆x are within the range of domain. For (θ, ∆θ) used in L 2 = E θ,∆θ |B(θ + ∆θ) -exp(C∆θ)B(θ)| 2 , we enumerate all the pairs of discretized θ (i.e., 144 directions discretized for circle [0, 2π]) and ∆θ (i.e., 5 directions within range [0, 12.5] degrees) as samples.

B.2 LEARNED PATTERNS

Figure 7 shows the learned patterns of u(x) with 6 blocks of size 32 cells in each block. Figure 8 shows the learned patterns of v(x) and u(x) with 6 blocks of size 16. We further evaluate the spatial profile of the learned patterns with 6 blocks of size 16 using the same measures as in the main text. All learned patterns exhibit significant hexagonal periodicity in terms of gridness scores (mean 1.06, std 0.27, range 0.58 to 1.48), which exceed the 95 percentile of null distributions obtained by applying spatial field shuffle to each response map. The grid scales of learned patterns (mean 0.38, range 0.27 to 0.56), as shown in figure 10a , follows a multi-modal distribution. The ratio between neighbouring modes are roughly 1.37 and 1.38. As shown in figure 10b , the grid orientations of learned patterns are also multi-modal distributed. Figure 11 shows the learned patterns of a block of B(θ) over θ from 0 to 2π. Regular sine/cosine tunings emerge. (B(θ)) can be isometric to (θ), in the sense that for each column i, the angle between B i (θ) and B i (θ + ∆θ) is ∆θ.

B.3 SPATIAL PROFILE OF LEARNED HEXAGON GRID PATTERNS

Figure 12 shows the spatial profile of the patterns of v(x) over the 80 lattice.  A γ (x, x ) = E ∞ t=0 γ t 1(x t = x )|x 0 = x) = v(x), u γ (x ) , ( ) where E is with respect to a random walk exploration policy, and γ is the discount factor that controls the temporal and spatial scales. We can discretize γ into a finite list of scales. The above model, i.e., v(x) and u γ (x ), can be learned by temporal difference learning. The basis expansion model with d N 2 enables efficient learning from small amount of explorations, so that we can fill in unexplored A γ (x, x ) based on the learned v(x) and u γ (x ). For random walk diffusion in open field, A γ (x, x ) ∝ exp(-|x -x | 2 /2σ 2 γ ), where σ 2 γ depends on γ. For random walk in a field with obstacles or non-Euclidean geometry, using Varadhan's formula (Varadhan, 1967) , A γ (x, x ) can still be approximated by Gaussian kernel except |x -x | is replaced by geodesic distance. After learning v(x), u γ (x ), we propose to use the following method for path planning. Let x be the target. Let x (t) be the current position, encoded by v(x (t) ), we propose to plan the next displacement by ∆x (t+1) = arg max ∆x M (∆x)v(x (t) ), u γ ( x) , (30) and let x (t+1) = x (t) + ∆x (t+1) , encoded by v(x (t+1) ) = M (∆x (t+1) )v(x (t) ). In the above maximization, ∆x is chosen from all the allowed displacements for a single step, and we also need to select an optimal γ that is most sensitive to the change of A. An example of scale selection scheme is that we choose the smallest σ that satisfies max ∆x M (∆x)v(x (t) ), u γ ( x) > 0.2. When x (t) is far from x, the selected σ 2 γ is big. When x (t) is close to x, the selected σ 2 γ is small. The above method enables automatic selection of scale. We shall explore other schemes of selecting γ in future work. We test path planning in open field using the learned model. Specifically, the model is first learned using a single scale A γ (x, x ), where σ γ = 0.07. Then we assume a list of three scales of A γ (x, x ), i.e., σ γ = [0.07, 0.14, 0.28], and learn three corresponding sets of u γ (x ). For planning, we create a pool of allowed displacements from which ∆x is chosen: the length of ∆x can be 1 or 2 grids, and the direction can be chosen from 200 discretized angles over [0, 2π] . examples of planned paths. As the examples show, when x (t) is far from the target, kernel with large σ γ is chosen, and as x (t) approaches the target, kernel with smaller σ γ is chosen. A planning episode is treated as successful if the distance between x (t) and target is smaller than 0.5 grid within 40 time steps. In the cases where the distance between the starting point of the agent and the target is smaller than 20 grids, the successful rate is 100% (test for 10, 000 episodes). We shall explore this method in irregular fields with obstacles in future work. 



Figure 1: Illustration of the proposed representational model. (a) 2D local polar system centered at x for egocentric self-motion, to be embedded in R d . (b) 2D local displacement δr and local change of direction δθ. (c) Mirroring relations in (b). x is mirrored by v(x). Local displacement δr from x along direction θ is mirrored by B(θ)δr applied to v(x). Local change of direction δθ is mirrored by Cδθ applied to B(θ).

Figure 2: Illustration of basis expansion model A x (x) = d i=1 u i,x vi(x), where vi(x) is the response map of i-th grid cell, shown at the bottom, which shows 5 different i. A x (x) is the response map of place cell associated with x , shown at the top, which shows 3 different x . u i,x is the connection weight.

Figure 3: Grid firing patterns emerge in the learned network. Every response map shows the firing pattern of one neuron (i.e, one element of v) in the 2D environment. Every row shows the firing patterns of the neurons within the same block or module. (Zoom in for high quality.)

Figure 4: Model grid cells exhibit modular structure that is consistent with experimental data.(a) Examples of autocorrelograms of the response maps and the corresponding gridness scores, each of which is from a different module. (b) Multi-modal distribution of grid scales. The scale ratios closely match the real data(Stensola et al., 2012). (c) Multi-modal distribution of grid orientations.

Figure 5: The learned model can perform path integration. (a) Black: example trajectory. The decoded self-positions (red) accurately matches the real path. (b) Path integration error over number of time steps. (c) Path integration error over different block sizes, for 50 and 100 time steps. For (b) and (c), averaged error and ± 1 standard deviation band over 1,000 episodes are shown. (d) Path integration error with introduced errors. Left: Gaussian noise. Right: dropout mask.

Figure 6: Correlation heatmap for each pair of the learned v i (x) and v j (x). The correlations are computed over 80 × 80 lattice of x.

Figure7shows the learned patterns of u(x) with 6 blocks of size 32 cells in each block. Figure8shows the learned patterns of v(x) and u(x) with 6 blocks of size 16. Figure9shows the learned patterns of v(x) with 16 blocks if size 12. Regular hexagon patterns emerge in all these settings.

Figure 7: Learned patterns of u(x) with 6 blocks of size 32 cells in each block. Every row shows the learned patterns within the same block.

Figure 8: Learned patterns with 6 blocks of size 16. Top: v(x). Bottom: u(x). Every row shows the learned patterns within the same block.

Figure 9: Learned patterns with 16 blocks of size 12. Left: v(x). Right: u(x). Every row shows the learned patterns within the same block.

Figure 10: (a) Multi-modal distribution of grid scales. (b) Multi-modal distribution of grid orientations.

Figure 14 depicts several

Figure 11: Learned patterns of a block of B(θ). Each curve shows the patterns of one element of B(θ) over θ ∈ [0, 2π]. Since B(θ) is skew-symmetric, the diagonal values are zeros and the value of B ij (θ) is the same as the value of B ji (θ). Regular sine/cosine tunings emerge.

Figure 12: Spatial profile of the patterns of v(x) over the 80 × 80 lattice. For each unit, the autocorrelogram is visualized. Gridness score, scale and orientation are listed sequentially on top of the autocorrelogram. 19

Ablation study. A certain component of the model is removed and the learned model is evaluated in terms of gridness score and path integration error over T = 100 time steps. Skew-symmetry is for B(θ) and C.

