ASSOCIATIVE MEMORY AUGMENTED ASYN-CHRONOUS SPATIOTEMPORAL REPRESENTATION LEARNING FOR EVENT-BASED PERCEPTION

Abstract

We propose EventFormer-a computationally efficient event-based representation learning framework for asynchronously processing event camera data. Event-Former treats sparse input events as a spatially unordered set and models their spatial interactions using self-attention mechanism. An associative memoryaugmented recurrent module is used to correlate with the stored representation computed from past events. A memory addressing mechanism is proposed to store and retrieve the latent states only where these events occur and update them only when they occur. The representation learning shift from input space to the latent memory space resulting in reduced computation cost for processing each event. We show that EventFormer achieves 0.5% and 9% better accuracy with 30000× and 200× less computation compared to the state-of-the-art dense and event-based method, respectively, on event-based object recognition datasets.

1. INTRODUCTION

Ultra-low power, high dynamic range (> 120dB), high temporal resolution, and low latency makes event-based cameras (Brandli et al., 2014; Suh et al., 2020; Son et al., 2017; Finateu et al., 2020) attractive for real-time machine vision applications such as robotics and autonomous driving (Falanga et al., 2020; Hagenaars et al., 2020; Sun et al., 2021; Zhu et al., 2018; Gehrig et al., 2021) . Convolutional and recurrent neural network-based methods, originally developed for framebased cameras, have demonstrated good perception accuracy on event camera (Gehrig et al., 2019; Baldwin et al., 2022; Cannici et al., 2020b) . But they rely on temporal aggregation of the events to create a frame-like dense representation as input thereby discarding the inherent sparsity of event data and resulting in high computational cost (Figure 2 ). Recent works have explored event-based processing methods for object recognition to exploit data sparsity. Examples of such methods include Time-surface based representation relying on hand-crafted features (Lagorce et al., 2016; Sironi et al., 2018; Ramesh et al., 2019) , 3D space-time event-cloud (Wang et al., 2019) , and Graphbased methods (Li et al., 2021c; Schaefer et al., 2022) . These methods adopt event-based processing to achieve lower computational costs but do not achieve similar performance compared to the denserepresentation based methods (Figure 2 ). This necessitates computationally efficient algorithms that exploit sparsity and achieve high accuracy. We propose an associative memory-augmented asynchronous representation learning framework for event-based perception, hereafter referred to as EventFormer, that enables computationally efficient event-based processing with high performance (Figure 1 ). As events are triggered asynchronously, an event-based processing algorithm must generate and maintain a higher-order representation from the events, and efficiently update that representation to correlate a new event with the past events across space and time. One way to address this is to include a recurrent module at each pixel to track history of past events (Cannici et al., 2020a) . However, the associated processing and memory requirement of such a method increases exponentially with the number of pixels. Motivated by recent works in memory augmented neural networks (Kumar et al., 2016; Ma et al., 

