ASSOCIATIVE MEMORY AUGMENTED ASYN-CHRONOUS SPATIOTEMPORAL REPRESENTATION LEARNING FOR EVENT-BASED PERCEPTION

Abstract

We propose EventFormer-a computationally efficient event-based representation learning framework for asynchronously processing event camera data. Event-Former treats sparse input events as a spatially unordered set and models their spatial interactions using self-attention mechanism. An associative memoryaugmented recurrent module is used to correlate with the stored representation computed from past events. A memory addressing mechanism is proposed to store and retrieve the latent states only where these events occur and update them only when they occur. The representation learning shift from input space to the latent memory space resulting in reduced computation cost for processing each event. We show that EventFormer achieves 0.5% and 9% better accuracy with 30000× and 200× less computation compared to the state-of-the-art dense and event-based method, respectively, on event-based object recognition datasets.

1. INTRODUCTION

Ultra-low power, high dynamic range (> 120dB), high temporal resolution, and low latency makes event-based cameras (Brandli et al., 2014; Suh et al., 2020; Son et al., 2017; Finateu et al., 2020) attractive for real-time machine vision applications such as robotics and autonomous driving (Falanga et al., 2020; Hagenaars et al., 2020; Sun et al., 2021; Zhu et al., 2018; Gehrig et al., 2021) . Convolutional and recurrent neural network-based methods, originally developed for framebased cameras, have demonstrated good perception accuracy on event camera (Gehrig et al., 2019; Baldwin et al., 2022; Cannici et al., 2020b) . But they rely on temporal aggregation of the events to create a frame-like dense representation as input thereby discarding the inherent sparsity of event data and resulting in high computational cost (Figure 2 ). Recent works have explored event-based processing methods for object recognition to exploit data sparsity. Examples of such methods include Time-surface based representation relying on hand-crafted features (Lagorce et al., 2016; Sironi et al., 2018; Ramesh et al., 2019) , 3D space-time event-cloud (Wang et al., 2019) , and Graphbased methods (Li et al., 2021c; Schaefer et al., 2022) . These methods adopt event-based processing to achieve lower computational costs but do not achieve similar performance compared to the denserepresentation based methods (Figure 2 ). This necessitates computationally efficient algorithms that exploit sparsity and achieve high accuracy. We propose an associative memory-augmented asynchronous representation learning framework for event-based perception, hereafter referred to as EventFormer, that enables computationally efficient event-based processing with high performance (Figure 1 ). As events are triggered asynchronously, an event-based processing algorithm must generate and maintain a higher-order representation from the events, and efficiently update that representation to correlate a new event with the past events across space and time. One way to address this is to include a recurrent module at each pixel to track history of past events (Cannici et al., 2020a) . However, the associated processing and memory requirement of such a method increases exponentially with the number of pixels. Motivated by recent works in memory augmented neural networks (Kumar et al., 2016; Ma et al. , dense processing that aggregates events into a frame and use CNN, event-based processing including (mid) GNN on spatiotemporal event-graph, or (right) PointNet like architectures treating events as point-cloud. These methods either re-process (frame and point cloud) or store the past events (graph) to spatiotemporally correlate with new events. (c) EventFormer encodes the spatiotemporal interaction of the past events into a compact latent memory space to efficiently retrieve and correlate with the new events without requiring to store or re-process the past events. 2018; Karunaratne et al., 2021) , we address the preceding challenge by maintaining a spatiotemporal representation associated with past events, occurring at various pixels, as the hidden states of an Associative Memory. As learning spatiotemporal correlation is shifted from the high-dimension input (pixel) space to a compact latent memory space, EventFormer requires an order of magnitude lower floating point operations (FLOPs) per event to update the memory. To the best of our knowledge, EventFormer is the first associative memory-augmented spatiotemporal representation learning method for event-based perception. The key contributions of this paper are: • EventFormer maps the spatiotemporal representation of the incoming event stream into the hidden states of an associative memory and uses a lightweight perception head directly operating on these states to generate perception decisions. • The spatiotemporal representation update mechanism activates only 'when' and only 'where' there is a new event using unstructured set-based processing without requiring to store and re-process the past events. • We propose a new query-key association-based memory access mechanism to enable spatial location-aware memory access to retrieve the past states of the current event locations. Given a new event (or a set of events), our model generates a spatial representation by computing their spatial interaction through self-attention mechanism (Vaswani et al., 2017) and retrieves the past spatiotemporal states related to that pixel location(s) from the memory. The retrieval process exploits a novel location-based query and content-based key-value association mechanism to extract correlations among the past events at neighboring pixels. Once we get the past states, along with our present spatial representation, a recurrent module takes them as input and generates the refined state information. The associated hidden states of the memory are updated with the new information. We evaluate EventFormer on object recognition task from event-camera data. In our experiments on existing benchmark datasets, N-Caltech101 (Orchard et al., 2015) and N-Cars (Sironi et al., 2018) , EventFormer shows an excellent computational advantage over the existing methods (both dense and event-based) while achieving performance comparable to the dense methods (Figure 2 ).

Related Work:

Dense methods convert events to dense frame-like representation and process them with standard deep learning models such as CNNs (Maqueda et al., 2018; Gehrig et al., 2019; Cannici et al., 2020a) . These methods are synchronous in nature as they generate dense inputs by binning and aggregating events in time and generate output only when the entire bin is processed. Event-based methods update their representations with a new event and generate new output. Methods such as (Lagorce et al., 2016; Sironi et al., 2018; Ramesh et al., 2019) compute timeordered representation (also known as time-surface) in an event-based manner with fewer computations. However, their reliance on fixed, hand-tuned representation results in sub-optimal perfor-mance compared to the data-driven methods (Gehrig et al., 2019) . Recently, graph-based methods (Messikommer et al., 2020; Li et al., 2021b; Schaefer et al., 2022) have been considered that generate a spatiotemporal event graph and process them with GNN. For a new event, the graph is updated by removing nodes associated with past events and adding new ones. As these methods require subsampling the events to reduce the size of the graph (and hence, computation), performance is reduced as well. Also, they need to store the past events inside the graph to correlate with the new event causing additional computation and memory overhead. Point cloud-based methods can inherently process unstructured event camera data in a permutation invariant manner (Wang et al., 2019; Vemprala et al., 2021) . However, they process over a sliding window of time τ and need to re-process the past events for every new event to establish correlations leading to redundant computations. Memory Augmented Neural Networks typically combine an external associative memory with a recurrent controller (Kumar et al., 2016; Ma et al., 2018; Karunaratne et al., 2021) resulting in enhanced memory capacity compared to the vanilla recurrent module (Cho et al., 2014) . Cannici et al. (2020) propose a pixel-wise memory-based representation learning for event-camera data that stores states at each pixel location (Cannici et al., 2020a) . However, it requires a computationally expensive feature extractor as spatiotemporal correlations among these states are not considered. Event camera has also been applied for more complex tasks including egomotion, motion segmentation (Parameshwara et al., 2021b) , and depth prediction (Hidalgo-Carrió et al., 2020) . To broadly categorize, these methods either fall into dense frame-based representation where classical image processing (Mitrokhin et al., 2018) and end-to-end learning-based methods (Zhu et al., 2019; Mitrokhin et al., 2019) have been applied or 3D-point cloud representation where GNNs has been adopted (Mitrokhin et al., 2020) to learn to perform these dense prediction tasks. While most of the existing works in this direction adopt a dense processing-based encoder-decoder structure, a spike-based asynchronous formulation has also been explored in (Parameshwara et al., 2021a) . 

Event-based Dense

Figure 2 : Recognition performance vs computation. On N-Caltech101 dataset, EventFormer has about 30000× lower compute cost than the SoTA dense method (Cannici et al., 2020b) with 9% improvement in accuracy compared to the best performing event-based method (Li et al., 2021c) .

Preliminaries and Problem Formulation:

An event-based camera consists of sensors at every pixel location that can respond to the change of brightness asynchronously and independently (Gallego et al., 2020) . Mathematically, a set of events produced at pixel locations (x, y) i (0 ≤ x ≤ W, 0 ≤ y ≤ H) during a time interval τ can be defined as a time-ordered sequence, E τ = {(x i , y i , t i , p i ) | t i ∈ τ, T = sup i t i } where t i denotes the event triggering timestamps, p i ∈ {-1, 1} as the polarity (relative change in brightness), and T as the total time of observation. In this work, we consider the spatial locations of the events as the input, that is: E τ = {(x i , y i )}. Our goal is to learn a compute efficient parametric mapping F : E τ → M Eτ that can convert the raw event sequence E τ into a suitable representation M Eτ ∈ R m×d (where m × d defines the dimension of the feature space) with event-based processing capability. This implies that F needs to have the capability to update its representation as soon as a new event(or a new event sequence) arrives without storing or recomputing the past events. Overview of the Framework: Eventformer consists of a positional encoder (Li et al., 2021a) followed by a pairwise interaction module to compute the spatial interactions among the events (Figure 3 ). A recurrent module takes this output, Z t as the current input state, and computes the spatiotemporal representation X t by retrieving the past hidden states H t-1 stored in an associative memory, M. The output of this recurrent module is used to update the memory representation M t , which EventFormer generates positional embedding π t from a list of new events x t at time t, and calculates their pairwise interaction, Z t using self-attention. A recurrent unit extracts the past states H t-1 associated with x t from an associative memory and computes X t as the current states. This is further encoded into the latent memory space, M t to process future events. A linear layer uses M t to predict the target class. Algorithm 1: Compute the event-based representation through EventFormer Input: A list of n 2-dimensional pixel coordinates of the events at time t: x t = {(x i , y i ) | i ≤ n} Output: Prediction vector Y t ∈ R c where c denotes the number of classes. Hyperparameter: The dimension of the representation vector d, row size of the memory m, number of stacks for the Refine operator R. Initialize the memory representation with the learned initializer, M 0 ∈ R m×d 1 π t ← Π(x t ) := 1 √ d cos(x t W T p ) sin(x t W T p ) ; 2 H t-1 ← Read(π t , M t-1 ); ⊲ Read the past memory-representation 3 foreach r ∈ R do 4 π t ← Refine r (π t , π t ); ⊲ Compute event interactions 5 Z t ← π t ; 6 X t ← R(Z t , H t-1 ); ⊲ Compute the current states 7 M ′ t ← Write(M t-1 , X t ); ⊲ Compute new memory representation 8 α t ← sigmoid(Erase(M t-1 , X t )); ⊲ Compute the update factor 9 M t = α t M t-1 + (1 -α t )M ′ t ; ⊲ Update the memory 10 M f lat t ← Reshape M t into a 1-dimensional vector; 11 Y t ← FF pred (M f lat t ); ⊲ Prediction at time t 12 return Y t . is used by a classification head for the recognition task. To facilitate these operations, EventFormer has three unique Operators: Read to retrieve the past representation, Write/Erase to update the memory with new representation, and, Refine to compute the spatial correlation among the events. Both Read and Write/Erase operators use multihead residual attention (Vaswani et al., 2017) as a building block. The Refine operator adopts an efficient version of this building block (Shen et al., 2021) to address the quadratic memory and compute requirement (that can become computationally prohibitive for a very fast-moving object generating a large number of events at a given time) of the traditional dot-product attention mechanism (Vaswani et al., 2017) . Algorithm 1 shows an implementation of the overall Eventformer architecture. Operator details: An Operator(A, B) takes A ∈ R n×d and B ∈ R m×d as input and maps them into the Q (query), K (key), and V (value) space using a linear transformation. The attention mechanism maps the Q to outputs as follows: . The Write and Erase operator computes query from M t-1 and key-value pair from X t , and, (d) An update to the associative memory occurs with a linear combination of the new value to be added M ′ t and past representation M t-1 weighted by an elementwise scaling factor generated by the Erase operator. Attn(Q, K, V ; a) = a(QK ⊤ )V (1) QK ⊤ ∈ R n×m projecting Q, K, and V onto h separate d/h dimensional space, respectively. The output is a linear combination of h-different attentions (Attn(., ., .; a)) applied to each of these h projections: MultiHead(Q, K, V ; w, a) = concatenate(O 1 , ..., O h )W o where O i = Attn(QW q i , KW k i , V W v i ; a i ) w = {W q i , W k i , W v i } h i=1 ∈ R d× d h and W o ∈ R hdv×d are learnable parameters. A following residual block completes the operation and generates output, O as follows: O = Operator(A, B) := LayerNorm(O h + FF o (O h )) where O h = LayerNorm(A + MultiHead(Q, K, V ; w, a)) LayerNorm is layer normalization function (Ba et al., 2016) . Figure 4 illustrates these operations. Positional encoder: EventFormer generates positional embedding by mapping the event sequence from a low-dimensional (2D) space to a higher-dimensional feature space. This is done by a learnable Fourier feature-based positional encoder. At any given time t, the positional encoder, Π : x t ∈ R n×2 → π t ∈ R n×d parameterized by W p ∈ R d 2 ×2 maps a list of n 2-dimensional events, x t to a d-dimensional feature space, π t using the following equation: π t = Π(x t ) := 1 √ d concatenate cos(x t W T p ), sin(x t W T p ) Spatial correlation: To compute a refined representation that considers spatial correlation among the events, we compute self-attention among the positional embedding π t by using the Refine operator. The efficient attention mechanism modifies the Equation 1 as follows: EfficientAttn(Q, K, V ; a q , a k ) = a q (Q) a k K ⊤ V (7) where a q and a k denote the row-wise and column-wise scaled softmax activation, respectively. A closer look at Equation 7 reveals that the memory and computation complexity has been reduced to O(dn+d 2 ) and O(d 2 n) respectively, which is linear with respect to the number of events, n. Finally, we stack multiple Refine operators to compute Z t ∈ R n×d as their higher-order interactions. Associative memory augmented recurrent module: The proposed associative memory has a querykey association-based memory access architecture with a separate association mechanism for state retrieval and update operation (Figure 4 ). We define M t ∈ R m×d as the stored memory representation at time t. The Read operator retrieves the past representation H t-1 ∈ R n×d using the current event locations π t . It computes the query vectors from π t and key, value vectors using the past stored representation M t-1 . The final retrieved state is a weighted sum of the projected M t-1 where the weights are computed through the association between the event locations (π t ) and the abstract memory addresses (keys projected from M t-1 ). This implies that a particular memory representation gets more weight if its corresponding address (key) has a higher similarity with the query (i.e., positional embedding). The retrieved H t-1 is used as the past hidden state of the recurrent module, R, to compute the current spatiotemporal representation, X t ∈ R n×d . We use gated recurrent unit (GRU) as the recurrent module. The Write and Erase operator jointly update M t with new information. The Write operator computes the query vectors from M t-1 and the keyvalue pair from X t to query the location of the memory that needs to be updated and generate the new representation, M ′ t ∈ R m×d . The purpose of Erase operator is to calculate a set of elementwise scaling factors, α t ∈ {R m×d | 0 ≤ α t ≤ 1} to control the amount of update at each memory location. Additional details can be found in Appendix I. Finally, we update the memory using: M t = α t M t-1 + (1 -α t )M ′ t (8) Classification head: A single feedforward layer FF pred (.) that maps the flattened memory representation, M f lat t ∈ R md to the prediction vector, Y t ∈ R c where c is the number of classes.

3. EXPERIMENTS AND RESULTS

Datasets and metrics: We evaluate the performance of our method on standard event-based object recognition benchmarks: N-Caltech101 and N-Cars (see Appendix D for details). We consider two metrics: recognition accuracy and the number of floating point operations required to update the representation for each new event (MFLOPs/ev). Implementation: EventFormer follows an end-to-end feed-forward layer-based implementation. We use 32×32 and 16×16 as memory dimensions (R m×d ) for N-Caltech101 and N-Cars dataset, respectively. Unless otherwise specified, all the multihead residual attention blocks use h (number of heads) = 4, and the number of stacks for the Refine operator, R = 2. Additional details including training hyperparameters and time intervals can be found in Appendix E. Representation Learning in Associative Memory: We first visualize the evolution of the learned representation in the associative memory for different classes of N-Caltech101 (Figure 5 ). Initially, the memory representation starts from the same initial state for all classes. As events are observed over time, the memory states start to form unique patterns for different classes. The tSNE plot in Temporal Update of the Learned Representation with Class Change: We study the temporal updates of the memory states when samples from two different classes are passed to EventFormer in a streaming manner (Figure 6 ). The memory updates its representation as the object class changes, enabling accurate classification results over time. We randomly take 3000 samples from the N-Caltech101 dataset consisting of sequential class change (3000 unique class combinations). We plot the distribution of the number of events required for the target class output to reach a 0.6 confidence, hereafter referred to as the perception latency (Figure 6 (d) ). The median perception latency is ∼ 120 and ∼ 60 events for the initial and new class, respectively. The perception latency for the initial class (red) has a wider distribution since the memory has to adapt its representation from the initial state. We also plot the mean and standard deviation of the Frobenius norm of the difference between two successive memory representations M t+1 -M t 2 , hereafter referred to as the memory update activity, for the same experiment (Figure 6 (c)). Higher memory update activity, observed when developing the representation of the initial class, stabilizes with more events of the same class. The change in the class also results in increased memory update activity but less than the initial case. This shows that EventFormer can adapt to new representations in real-time and re-use important features already computed in the past thereby reducing perception latency. Temporal Update of Learned Representation with Input Transformation: We study the temporal update of the memory states over time while objects from the same class but with different transformations (rotation) are passed to EventFormer in a streaming manner (Figure 7 ). With the incoming events from the rotated sample, we observe degradation in prediction probability due to represen- tation update, which stabilizes over time. We consider 3000 random samples from N-Caltech101 with combinations of two random rotations between 10 to 30 degrees. We observe that the memory update activity is less for input transformation of the same example (Figure 7 (c)), compared to a completely new example (Figure 6 (c)). This shows that EventFormer can preserve learned representation for a class even under input transformation enabling less memory update activity. Computational Complexity: We analytically derive the computational complexity model of Event-Former (details in Appendix C). We study the FLOP contributions from the key components in Figure 8 . For a single event (n=1), most of the FLOP costs are from the memory update block due to its O(md 2 + md + d 2 ) compute complexity. The cost of memory update diminishes with a larger n while the Read and Refine operator starts to dominate (Figure 8 For experiment (a), we pool the X t to a 1dimensional vector (Vemprala et al., 2021) to deal with the variable-sized sequence. Also, we make the hidden size of the recurrent module m×d so that the FF pred receives the same-sized input. In experiment (b), we linearly project the output of the Write operator into a R m×2d space and split it into two equal-sized vectors to get M ′ t and α t . The results are shown in Table 1 . Merging the write and erase operation marginally reduces the MFLOPs/ev, but at the expense of less flexibility during memory updates and hence, lower accuracy. Without the Refine operator, the model saves more computation but lacks the higher-order interaction modeling among the events, causing marginal accuracy loss. Finally, absence of associative memory reduces maximum compu- tation but also causes the model to lose spatial information (due to the pooling operation) resulting in minimum performance.

Effect of Hyperparameters:

We study the impact of hyperparameters R and m(= d) (Table 2 ). We observe that increasing their values result in better performance but at the cost of higher MFLOPs/ev. Therefore, we follow our initial setting (R=2 and m(= d)=32) to have a better balance between accuracy and compute cost while comparing with the state-of-the-art methods. Comparison with the state-of-the-art (SoTA): We compare EventFormer with SoTA dense and event-based methods on these two datasets (Table 3 ). Following the previous work (Schaefer et al., 2022) , we report the average MFLOPs/ev on a window of 25000 events. Methods utilizing hand-crafted features: H-First (Orchard et al., 2015) , Gabor-SNN (Bovik et al., 1990) , HOTS (Lagorce et al., 2016) , HATS (Sironi et al., 2018) , DART (Ramesh et al., 2019) , require fewer MFLOPs/ev while performing worse than the data-driven methods. EST (Gehrig et al., 2019) and Matrix-LSTM (Cannici et al., 2020b) are synchronous methods that learn the optimal representation and achieve higher accuracy. However, they require compute-intensive feature extractors to work on the learned representations. Whereas EventFormer does not require any complex feature extractors since the learned representation in the memory captures both the spatial and temporal information. It can achieve similar performance with 30000× less computation. YOLE (Cannici et al., 2019) and AsyNet (Messikommer et al., 2020) modify the CNNs to enable asynchronous, sparse processing to reduce the compute cost. However, they do not consider the temporal information in the events. EvS-S (Li et al., 2021c) and AEGNN (Schaefer et al., 2022) are graph-based methods with asynchronous and efficient graph nodes update. However, they use sub-sampling to restrict the graph size from growing prohibitively large, leading to suboptimal performance. Our method, on the other hand, inherently learns to leverage useful events to update the memory without requiring any redundant computation (such as radius search). The memory mechanism enables learning better temporal correlation compared to the graph-based methods. As a result, our method achieves 9% better accuracy on N-Caltech101 while being 200× more efficient compared to (Li et al., 2021c) . Computation Latency: The latency of EventFormer implemented with PyTorch on an Nvidia RTX3090 is 4.5ms/event on N-Caltech101 dataset which is 10× faster than the existing most efficient method (Schaefer et al., 2022 )(52ms/event). Unlike their method, we do not require searching over a large graph to update the representation which reduces computational latency.

4. CONCLUSION AND FUTURE WORK

We propose a novel memory-augmented representation learning framework-EventFormer, for asynchronous and efficient event-based perception. EventFormer learns to store, retrieve and update its memory representation in the latent form of higher-order spatiotemporal dynamics of the events that allow it to achieve high performance with minimal compute cost. Future works include EventFormer on more complex tasks including object detection, depth estimation, and optical flow prediction.

Supplement to "Associative Memory Augmented Asynchronous Spatiotemporal Representation Learning for Event-based Perception"

A APPENDIX In this supplementary material, we provide experimental and additional details on EventFormer. 

C DERIVATION OF COMPUTATIONAL COMPLEXITY MODEL OF EVENTFORMER

Here we derive the computational complexity model of the proposed EventFormer architecture. Let us define n as the number of events to be processed within time-interval τ , d as the dimension of the representation, m as the number of rows in the memory, h be the number of heads in the residual attention blocks (we use the same h in all the blocks), r as the number of stacks for the Refine operator, and c be the number of class for the recognition task. (i) Positional Encoder: Our positional encoder block consists of single matrix multiplication with the input, that is to compute x t W T p , we need 2nd multiplications and nd(2 -1) additions, so in total 3nd FLOPs. (ii) Read Operator: Read operator consists of a residual multihead attention block with h heads. To begin with, calculating the query matrix for each head, we need n( d h ) 2 multiplication and n d h ( d h -1) additions, so in total n d h (2 d h -1) FLOPs. Similarly for the key and value matrix, we need total 2m d h (2 d h -1) FLOPs. Then, to compute QK ⊤ , we need an additional mn(2 d h -1) computation. Similarly the product (QK ⊤ )V requires n d h (2m -1) FLOPs. Upto now for each head, we require total FLOPs: n d h (2 d h -1) + 2m d h (2 d h -1) + mn(2 d h -1) + n d h (2m -1), Therefore, for h heads, it becomes: nd(2 d h -1) + 2md(2 d h -1) + hmn(2 d h -1) + nd(2m -1) The following 2 residual connection involves pointwise addition of a n × d dimensional matrix and the out-projection involves an additional nd(2d -1) FLOPs. Finally, total FLOPs count for the Read(π t , M t-1 ) becomes: nd(2 d h -1) + 2md(2 d h -1) + hmn(2 d h -1) + nd(2m -1) + 2nd + 2nd(2d -1) (iii) Recurrent Module: Our recurrent module is a GRU that involves 3 matrix multiplication (update, reset, hidden) for the current input, π t and hidden state H t-1 each. Therefore 6 matrix multiplication in total followed by 5 elementwise addition and 2 elementwise multiplication. Therefore, the total computation becomes: 6nd(2d -1) + 7nd (iv) Refine Operator: Unlike the Read(. , . ) operator, it projects the same input, X t to query, key and value space. Therefore, total computation for such project becomes 3n d h (2 d h -1), followed by additional ( d h ) 2 (2n -1) and n d h (2 d h -1) FLOPs for K ⊤ V , and Q(K ⊤ V ) computation. Therefore, for h heads and considering the additional cost for residual operations and out-projections, the final cost becomes: r 3nd(2 d h -1) + ( d 2 h )(2n -1) + nd(2 d h -1) + 2nd + 2nd(2d -1) here r accounts for the number of stacks. (iv) Write and Erase Operator: Similar to the Read(. , . ) operator, we can calculate the total FLOPs required for the Write(M t-1 , Z t ) and Erase(M t-1 , Z t ) operation combined as follows: 2 md(2 d h -1) + 2nd(2 d h -1) + hmn(2 d h -1) + md(2n -1) + 2md + 2md(2d -1) + 3md here, the additional 3md accounts for the 2 pointwise addition and 1 multiplication in Equation 8. (iv) Perception Output: The final output is calculated by a vector(∈ R md )-matrix(∈ R md×c ) multiplication, which requires an additional c(2md -1) FLOPs. Total FLOPs: The total FLOPs accounting all the above computation becomes:  + c(2md -1) Output (9) D DETAILS OF THE DATASETS N-Caltech101 (Orchard et al., 2015) is converted to event space from standard frame-based dataset (Caltech101). It has 8246 event sequences from 101 class categories following the original Cal-tech101 dataset. N-Caltech101 uses an event camera to record event streams generated by moving images displayed on a monitor. N-Cars (Sironi et al., 2018 ) is a real-world event-camera dataset with 24039 event sequences from 2-class instances (car or background). N-Cars utilizes an event camera mounted on a moving car recording its surroundings in a real-world setting.

E TRAINING DETAILS

Data Preparation: For both of the datasets, we use the official test set portion to report the result and split the remaining into 90% and 10% ratios for training and validation purposes. The event coordinates are normalized by dividing them by the height and width (180 × 240 for N-Caltech101 and 100 × 120 for N-Cars) of the frame resolution. We use a 50 ms long sequence for each sample to reduce training complexity. Each sample is chunked by a 1ms window, resulting in 50-sequences for each sample. Optimization: We use the standard categorical cross-entropy as the loss function to train the network with Adam (Kingma & Ba, 2014) with batch size 128 for N-Cars and 64 for N-Caltech101 with an initial learning rate of 1e -3 that decreases by a factor of 5 after every 25 epochs. The memory representation is initialized by a set of learnable parameters M ′ ∈ R m×d . We also use a dropout of 0.2 on the memory representation before passing them to the final classification layer. To avoid gradient explosion during training, we use gradient clipping with a maximum gradient norm of 100.

F ADDITIONAL ABLATION EXPERIMENT

Impact of polarity: In our current formulation, we did not consider polarity information of the events during the event-encoding part. We conduct an additional experiment to evaluate the impact of polarity by taking them as an additional input alongside the positional coordinates of the events. In this formulation, we consider E τ = {(x i , y i , p i )} where p i denotes the polarity of the event at the (x i , y i ) location. At a given time t, while we use the same positional encoder, Π to process x t , we use a separate MLP layer parameterized by W pol ∈ R d×1 to map the list of n event polarities, p t to a d-dimensional feature space, po t . Finally, we add π t and po t together to generate a refined π t that now contains both the polarity and positional information. We keep the subsequent operations unchanged. Following the same training and evaluation procedure on N-Caltech101 dataset, this modified architecture achieved a slightly better accuracy of 0.849 (compared to the original 0.848) with negligible additional compute cost. Motion Segmentation: Our primary focus in this work was to develop an event-based spatiotemporal representation learning framework and we used classification as an example task to demonstrate the efficacy of our method. In this section, we further evaluate our method on a dense task-motion segmentation as an example of tasks more complex than classification. Unlike the classification task, motion segmentation is a dense prediction task. Existing works (Mitrokhin et al., 2019; Sanket et al., 2020; Parameshwara et al., 2021) on such dense task leverage hierarchical encoder-decoder network architecture where the encoder encodes the high-dimensional space into a compact latentspace while the decoder decodes it back to its original shape. Therefore, applying Event-Former on such dense tasks requires its own decoder to reconstruct the dense output from its latent memory representation. While designing a novel decoder architecture is out of the scope of this work, we use the existing convolution-based decoder structure for this experiment. We reshape the memory representation M ∈ R m×d into M n ∈ R a×b×d (where m = a × b) to give it a 2Dpositional bias and pass it to the decoder as shown in Figure . 9 so that end-to-end training with the decoder enables it to learn the required 2D-latent representation in the encoded space. Dataset and Experimental Setup: We train and evaluate our method on a sub-set of EV-IMO dataset (Mitrokhin et al., 2019) . EV-IMO consists of challenging scenarios where different objects move at varying speeds and directions. A monocular event-based camera captures the motions. Since our primary objective for this experiment is only to provide some insights on how to scale our method on complex tasks, we use a subset of the available five different sequences (boxes and floor) to reduce the training time. We center-crop the events by a 256 × 256 spatial window (i.e., we do not consider the events that fall outside this window) and use a 20ms long sequence for every sample. Each sequence is chunked by a 1ms temporal window resulting in 20 sequences per sample. We use Adam with a learning rate of 5 × 10 -3 and trained for 80 epochs. We consider the Intersection over Union (IoU) and MFLOPs/ev as the performance metrics. Since all the methods (including ours) employ similar decoder structures, we consider only the encoder part to compare the MFLOPs/ev. For more complex and dense tasks with higher spatial resolution, EventFormer may need to scale up its memory size at the cost of additional compute overhead. We conduct an additional experiment to better understand how our computational cost would increase with higher memory size, and the result is shown in Figure 12 . We observe that the compute cost increases at a significantly lower rate for higher m compared to a higher d. For both m = d = 2048, our EventFormer formulation requires about 210 MFLOPs/ev, which is still 10× lower compared to the existing dense encoder architecture of EveDodgeNet (Sanket et al., 2020) . We perform additional timing experiments to understand the change in latency with an increasing number of events. Table 6 shows the result. We can observe that our latency has a highly sub-linear relationship with respect to the number of events. Also, the throughput increases with the number of events thanks to the parallel processing capabilities of the modern hardware accelerators (GPU in our case). We also measure the time required for a dense method (Matrix-LSTM (Cannici et al., 2020b) ) to process a block of 30000 events which is 35.93ms for N-Caltech101 dataset. It is noteworthy to mention that Matrix-LSTM utilizes its own highly optimized CUDA kernels to reduce their latency significantly. While developing such an optimized kernel for EventFormer is out of the scope of this work, we also expect a similar latency reduction for our method through such software optimization.



Figure 1: Comparison with existing works. (a) Event-based perception algorithms where perception latency k∆ is proportional to the event-generation rate(∆). (b) Existing works include (left)dense processing that aggregates events into a frame and use CNN, event-based processing including (mid) GNN on spatiotemporal event-graph, or (right) PointNet like architectures treating events as point-cloud. These methods either re-process (frame and point cloud) or store the past events (graph) to spatiotemporally correlate with new events. (c) EventFormer encodes the spatiotemporal interaction of the past events into a compact latent memory space to efficiently retrieve and correlate with the new events without requiring to store or re-process the past events.

Figure3: EventFormer architecture. EventFormer generates positional embedding π t from a list of new events x t at time t, and calculates their pairwise interaction, Z t using self-attention. A recurrent unit extracts the past states H t-1 associated with x t from an associative memory and computes X t as the current states. This is further encoded into the latent memory space, M t to process future events. A linear layer uses M t to predict the target class.

Figure 4: Details of the memory operations. (a) block diagram of the multihead residual attention, (b) composition of an Operator(A, B), (c) Read operator calculates query from positional embedding π t and key-value pair from past memory representation, M t-1. The Write and Erase operator computes query from M t-1 and key-value pair from X t , and, (d) An update to the associative memory occurs with a linear combination of the new value to be added M ′ t and past representation M t-1 weighted by an elementwise scaling factor generated by the Erase operator.

Figure 5: Temporal evolution of memory representation. (a) and (b) Change in memory representation over time for samples from two different classes. (c) tSNE plot of the memory representation for 10 randomly selected classes from N-Caltech101 dataset.

Figure 6: Memory representation dynamics over sequential class update. (a) Change in prediction probability of the target class over time. (b) Change in memory representation when class changes occur. (c) Average memory update activity for a large-scale experiment similar to (a) and (b), and (d) their corresponding perception latency distribution.

Figure 7: Memory representation dynamics over sequential input transformation. (a) Change in prediction probability of the target class over time. (b) Change in memory representation when transformed input appears. (c) Average memory update activity for a large-scale experiment similar to (a) and (b), and (d) their corresponding perception latency distribution.

Figure 5(c) shows that the learned representations for various classes form a distinct cluster enabling separation of the class boundaries even with a relatively simple classifier head.

Figure 8: FLOP analysis. (a) Normalized FLOP contribution from the key components of Event-Former (for n=1). (b) Change in relative compute cost for the Operators with increasing n.

Figure 9: Network architectures for dense prediction tasks. (a) Convolution-based hierarchical encoder-decoder structure with skip connections in between. (b) Possible adaptation of EventFormer to work with existing decoder structure with no skip-connection.

Figure 12: Scaling of EventFormer compute cost with increasing m and d from 16 to 2048.

Performance comparison with state-of-the-art event-based and dense methods.

Different notations used to describe the operations of EventFormer

G FEASIBILITY STUDY OF APPLYING EVENTFORMER ON COMPLEX TASK

EventFormer latency and throughput with an increasing number of events.

Ground

Figure 11 : Incremental prediction performance comparison. Our method can converge to its peak performance significantly faster than the existing methods.Preliminary Performance: We compare the performance with existing state-of-the-art dense and event-based motion-segmentation methods (Table . 5). Both the EV-IMO (Mitrokhin et al., 2019) and EVDodgeNet (Sanket et al., 2020) We observe that the better performance of the dense methods comes with the cost of much higher computational complexities. Our method can achieve comparable performance with SpikeMS while being 115× more efficient. We also observe that the performance of our method improves with higher memory capacity across the space. Another interesting observation is that the performance gain from higher m is much larger compared to the higher d. This implies that representation with more fine-grained spatial information is necessary for dense prediction tasks. This is also helpful in terms of computational benefits since our FLOP model has a sub-linear relationship with m. Our unstructured formulation enables us to perform incremental prediction, even for dense tasks. Figure . 11 shows the comparison with existing methods on such incremental prediction tasks. For the dense method, we integrate the events with incremental time-window starting from 1ms up to 20ms with 1ms interval. As shown, our method can outperform both methods at the early stage, requiring much fewer events. We also visualize the qualitative performance of our method in Figure . 10. Although our method can achieve high IoU very fast, it fails to capture the local details of the object. This is because, unlike existing methods, our decoder can not utilize any skip connections. Such skip connections are necessary for the existing encoder-decoder network architectures to recover the local spatial information that may have been lost during the encoding process (Drozdzal et al., 2016) . These results indicate that existing decoder architectures may not be optimum to work on our representation and novel architectural innovations are required in this regard. We leave this as a potential future research direction. 

I DETAILS OF THE MEMORY OPERATORS:

Read: During Read operation, we want to know: what are the past states at the current event location? To do so, we use the multihead residual attention block to perform query-key-based associations in the memory. To be more specific, we query the past memory representation M t-1 using the positional embedding of the current event locations, π t . The complete retrieval of the past hidden representation, H t-1 involves the following operations:whereHere, Q r represents the query vector calculated from π t , and K r , V r represents the key and value vectors computed from the M t-1 .Write and Erase: Similar to the Read operator, we adopt residual multi-head attention block for Write operator to calculate the new memory representation, M ′ t . However, this time we compute the query vectors from M t-1 and key-value pair from the refined spatiotemporal representation, X t . The idea here is that we want to query the location of the memory that needs to be updated while the contents to be updated are provided by the new representation.whereHere, Q w represents the query vector calculated from M t-1 , and K w , V w represents the key and value vectors computed from the X t . We also introduce Erase operator (follows the same operations of Write operator) that calculates a set of element-wise scaling factors, α t ∈ {R m×d | 0 ≤ α t ≤ 1} to control the strength of update:J LATENCY COMPARISON:A direct comparison of the latency of our and other methods in Table 7 is challenging as different methods have provided their latency measurements using different hardware configurations and software setups. However, we still show the latency/event reported in the original papers for the methods and corresponding hardware configuration. A common methodology was used in all these works where the average latency/event was measured considering the average run time per sample divided by the average number of events per sample.

