MEMONAV: WORKING MEMORY MODEL FOR VISUAL NAVIGATION

Abstract

We present MemoNav, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance. Specifically, the node features on the topological map are stored in the short-term memory (STM), as these features are dynamically updated. The MemoNav retains the informative fraction of the STM via a forgetting module to improve navigation efficiency. To learn a global representation of 3D scenes, we introduce long-term memory (LTM) that continuously aggregates the STM. Afterward, a graph attention module encodes the retained STM and the LTM to generate working memory (WM). After encoding, the WM contains the informative features in the retained STM and the scene-level feature in the LTM and is finally used to generate actions. Consequently, the synergy of these three types of memory increases navigation performance by selectively retaining goal-relevant information and learning a highlevel scene feature. When evaluated on multi-goal tasks, the MemoNav outperforms the SoTA methods at all difficulty levels in both Gibson and Matterport3D scenes. The MemoNav also achieves consistent improvements on traditional 1-goal tasks. Moreover, the qualitative results show that our model is less likely to be trapped in a deadlock.

1. INTRODUCTION

This paper studies the image-goal navigation (ImageNav) problem, which aims to steer an agent towards a destination with the goal image in unseen environments. ImageNav has recently received much attention due to its promising applications in robots navigating in the open world. Scene memory is essential for imageNav as it provides indispensable historical information for decision-making in unseen environments (Savinov et al., 2018) . During navigation, this memory typically stores both scene features and the agent's navigation history (Kwon et al., 2021) . These two types of information in turn help the agent generate more reasonable navigation actions by lessening the negative impact of partial observability (Parisotto & Salakhutdinov, 2018) . In literature, various memory mechanisms have been introduced for ImageNav, which can be classified into three categories according to memory structure: (a) metric map-based methods (Chaplot et al., 2020a; Chen et al., 2019) that reconstruct local top-down maps and aggregate them into a global map, (b) stacked memory-based methods (Pashevich et al., 2021; Mezghani et al., 2021; Fang et al., 2019) that stack the past observations in chronological order, and (c) topological map-based methods (Kwon et al., 2021; Chaplot et al., 2020b; Beeching et al., 2020; Savinov et al., 2018) that store sparse landmark features in graph nodes. The topological map-based methods benefit from the memory sparsity of topological maps and have achieved impressive performance in ImageNav. However, existing topological map-based methods still suffer from two major limitations: (a) Unawareness of useful nodes. They generally use all node features for generating actions without considering the contribution of each node, thus being easily misled by redundant nodes that are uninformative of the goal. (b) Local representation. Each node feature only represents a small local area in a large scene, which restricts the agent's ability to learn a higher-level semantic and geometric representation of the scene. To overcome the above two limitations, we present a novel ImageNav method named MemoNav, which is motivated by the classical concept of working memory in cognitive neuroscience (Cowan, 2008) and in loose analogy with the working memory model in human navigation (Blacker et al., 2017) . The MemoNav learns three types of scene representations: (a) Short-term memory (STM) represents the local and transient features of the nodes in a topological map. (b) Long-term memory (LTM) is a global node that learns a scene-level representation by continuously aggregating STM. (c) Working memory (WM) learns goal-relevant features about 3D scenes and is used by a policy network to generate actions. The WM is formed by encoding the informative fraction of the STM and the LTM. Based on the above three representations, the MemoNav navigation pipeline contains five steps: (1) STM generation. The map update module stores landmark features on the map as the STM. (2) Selective forgetting. To incorporate goal-relevant STM into the WM, a forgetting module temporarily removes nodes whose attention scores assigned by a memory decoder rank below a predefined percentage. After forgetting, the navigation pipeline will not compute the forgotten node features at subsequent time steps. (3) LTM generation. To assist the STM, we add a global node to the map as the LTM. The global node links to all map nodes and continuously aggregates the features of these nodes at each time step. (4) WM generation. A graph attention module encodes the retained STM and the LTM to generate the WM. The WM utilizes the goal-relevant information in the retained STM and the scene-level feature in the LTM, thus enabling the agent to utilize informative scene representations to improve navigation performance. (5) Action generation. Two Transformer decoders use the embeddings of the goal image and the current observation to decode the WM. Then, the decoded features are used to generate navigation actions. Consequently, with the synergy of the three representations, the MemoNav noticeably outperforms the SoTA method (Kwon et al., 2021) in the Gibson scenes (Xia et al., 2018) , increasing the navigation success rate by approximately 2.9%, 1.4%, 2.4%, and 1.7% on 1, 2, 3, and 4-goal test datasets, respectively. The comparison in the Matterport3D scenes (Chang et al., 2017) shows that the MemoNav exhibits better transferability. The main contributions of this paper are as follows: • We propose the MemoNav, which learns three types of scene representations (STM, LTM, and WM) to improve navigation performance in the ImageNav task. • We use a forgetting module to retain the STM, thereby reducing redundancy in the map and improving navigation efficiency. We also introduce a global node as the LTM. The LTM connects to all nodes in the STM and learns a scene-level representation that provides the agent with a global view. • We adopt a graph attention module to generate WM from the retained STM and the LTM. This module flexibly adjusts weights used for aggregating node features, which helps the agent use adaptive WM to improve performance. • The experimental results demonstrate that our model outperforms the SoTA baseline on both 1-goal and multi-goal tasks in two popular scene datasets.

2. RELATED WORK

ImageNav methods. Since an early attempt (Zhu et al., 2017) to train agents in a simulator for ImageNav, rapid progress has been made on this task (Beeching et al., 2020; Chen et al., 2021; Wasserman et al., 2022; Al-Halah et al., 2022) Memory mechanisms for reinforcement learning. Several studies (Ritter et al., 2021; Lampinen et al., 2021; Sukhbaatar et al., 2021; Loynd et al., 2020) draw inspiration from memory mechanisms of the human brain and design reinforcement learning models for reasoning over long time horizons. Ritter et al. (Ritter et al., 2021) proposed an episodic memory storing state transitions for navigation



. Several methods have utilized topological scene representations for visual navigation, of which SPTM(Savinov et al., 2018) is an early work. NTS(Chaplot et al., 2020b) and VGM (Kwon et al., 2021)  incrementally build a topological map during navigation and generalize to unseen environments without exploring the scenes in advance. These methods utilize all features in the map, while the MemoNav flexibly utilizes the informative fraction of these features. Another line of work(Yadav et al., 2022; Majumdar et al., 2022)  has introduced self-supervised learning to enhance the scene representations, achieving a promising navigation success rate. In contrast, we enhance the scene representations using a global node that aggregates the agent's local observation features.

