ASSOCIATIVE MEMORY AUGMENTED ASYN-CHRONOUS SPATIOTEMPORAL REPRESENTATION LEARNING FOR EVENT-BASED PERCEPTION

Abstract

We propose EventFormer-a computationally efficient event-based representation learning framework for asynchronously processing event camera data. Event-Former treats sparse input events as a spatially unordered set and models their spatial interactions using self-attention mechanism. An associative memoryaugmented recurrent module is used to correlate with the stored representation computed from past events. A memory addressing mechanism is proposed to store and retrieve the latent states only where these events occur and update them only when they occur. The representation learning shift from input space to the latent memory space resulting in reduced computation cost for processing each event. We show that EventFormer achieves 0.5% and 9% better accuracy with 30000× and 200× less computation compared to the state-of-the-art dense and event-based method, respectively, on event-based object recognition datasets.

1. INTRODUCTION

Ultra-low power, high dynamic range (> 120dB), high temporal resolution, and low latency makes event-based cameras (Brandli et al., 2014; Suh et al., 2020; Son et al., 2017; Finateu et al., 2020) attractive for real-time machine vision applications such as robotics and autonomous driving (Falanga et al., 2020; Hagenaars et al., 2020; Sun et al., 2021; Zhu et al., 2018; Gehrig et al., 2021) . Convolutional and recurrent neural network-based methods, originally developed for framebased cameras, have demonstrated good perception accuracy on event camera (Gehrig et al., 2019; Baldwin et al., 2022; Cannici et al., 2020b) . But they rely on temporal aggregation of the events to create a frame-like dense representation as input thereby discarding the inherent sparsity of event data and resulting in high computational cost (Figure 2 ). Recent works have explored event-based processing methods for object recognition to exploit data sparsity. Examples of such methods include Time-surface based representation relying on hand-crafted features (Lagorce et al., 2016; Sironi et al., 2018; Ramesh et al., 2019) , 3D space-time event-cloud (Wang et al., 2019), and Graphbased methods (Li et al., 2021c; Schaefer et al., 2022) . These methods adopt event-based processing to achieve lower computational costs but do not achieve similar performance compared to the denserepresentation based methods (Figure 2 ). This necessitates computationally efficient algorithms that exploit sparsity and achieve high accuracy. We propose an associative memory-augmented asynchronous representation learning framework for event-based perception, hereafter referred to as EventFormer, that enables computationally efficient event-based processing with high performance (Figure 1 ). As events are triggered asynchronously, an event-based processing algorithm must generate and maintain a higher-order representation from the events, and efficiently update that representation to correlate a new event with the past events across space and time. One way to address this is to include a recurrent module at each pixel to track history of past events (Cannici et al., 2020a) . However, the associated processing and memory requirement of such a method increases exponentially with the number of pixels. Motivated by recent works in memory augmented neural networks (Kumar et al., 2016; Ma et al. , dense processing that aggregates events into a frame and use CNN, event-based processing including (mid) GNN on spatiotemporal event-graph, or (right) PointNet like architectures treating events as point-cloud. These methods either re-process (frame and point cloud) or store the past events (graph) to spatiotemporally correlate with new events. (c) EventFormer encodes the spatiotemporal interaction of the past events into a compact latent memory space to efficiently retrieve and correlate with the new events without requiring to store or re-process the past events. 2018; Karunaratne et al., 2021), we address the preceding challenge by maintaining a spatiotemporal representation associated with past events, occurring at various pixels, as the hidden states of an Associative Memory. As learning spatiotemporal correlation is shifted from the high-dimension input (pixel) space to a compact latent memory space, EventFormer requires an order of magnitude lower floating point operations (FLOPs) per event to update the memory. To the best of our knowledge, EventFormer is the first associative memory-augmented spatiotemporal representation learning method for event-based perception. The key contributions of this paper are: • EventFormer maps the spatiotemporal representation of the incoming event stream into the hidden states of an associative memory and uses a lightweight perception head directly operating on these states to generate perception decisions. • The spatiotemporal representation update mechanism activates only 'when' and only 'where' there is a new event using unstructured set-based processing without requiring to store and re-process the past events. • We propose a new query-key association-based memory access mechanism to enable spatial location-aware memory access to retrieve the past states of the current event locations. Given a new event (or a set of events), our model generates a spatial representation by computing their spatial interaction through self-attention mechanism (Vaswani et al., 2017) and retrieves the past spatiotemporal states related to that pixel location(s) from the memory. The retrieval process exploits a novel location-based query and content-based key-value association mechanism to extract correlations among the past events at neighboring pixels. Once we get the past states, along with our present spatial representation, a recurrent module takes them as input and generates the refined state information. The associated hidden states of the memory are updated with the new information. We evaluate EventFormer on object recognition task from event-camera data. In our experiments on existing benchmark datasets, N-Caltech101 (Orchard et al., 2015) and N-Cars (Sironi et al., 2018) , EventFormer shows an excellent computational advantage over the existing methods (both dense and event-based) while achieving performance comparable to the dense methods (Figure 2 ). Related Work: Dense methods convert events to dense frame-like representation and process them with standard deep learning models such as CNNs (Maqueda et al., 2018; Gehrig et al., 2019; Cannici et al., 2020a) . These methods are synchronous in nature as they generate dense inputs by binning and aggregating events in time and generate output only when the entire bin is processed. Event-based methods update their representations with a new event and generate new output. Methods such as (Lagorce et al., 2016; Sironi et al., 2018; Ramesh et al., 2019) compute timeordered representation (also known as time-surface) in an event-based manner with fewer computations. However, their reliance on fixed, hand-tuned representation results in sub-optimal perfor-



Figure 1: Comparison with existing works. (a) Event-based perception algorithms where perception latency k∆ is proportional to the event-generation rate(∆). (b) Existing works include (left)dense processing that aggregates events into a frame and use CNN, event-based processing including (mid) GNN on spatiotemporal event-graph, or (right) PointNet like architectures treating events as point-cloud. These methods either re-process (frame and point cloud) or store the past events (graph) to spatiotemporally correlate with new events. (c) EventFormer encodes the spatiotemporal interaction of the past events into a compact latent memory space to efficiently retrieve and correlate with the new events without requiring to store or re-process the past events.

