ACTIVE TOPOLOGICAL MAPPING BY METRIC-FREE EXPLO-RATION VIA TASK AND MOTION IMITATION Anonymous

Abstract

Topological map is an effective environment representation for visual navigation. It is a graph of image nodes and spatial neighborhood edges without metric information such as global or relative agent poses. However, currently such a map construction relies on either less-efficient random exploration or more demanding training involving metric information. To overcome these issues, we propose active topological mapping (ATM), consisting of an active visual exploration and a topological mapping by visual place recognition. Our main novelty is the simple and lightweight active exploration policy that works entirely in the image feature space involving no metric information. More specifically, ATM's metric-free exploration is based on task and motion planning (TAMP). The task planner is a recurrent neural network using the latest local image observation sequence to hallucinate a feature as the next-step best exploration goal. The motion planner then fuses the current and the hallucinated feature to generate an action taking the agent towards the hallucinated feature goal. The two planners are jointly trained via deeply-supervised imitation learning from expert exploration demonstrations. Extensive experiments in both exploration and navigation tasks on the photo-realistic Gibson and MP3D datasets validate ATM's effectiveness and generalizability.

1. INTRODUCTION

Mobile agents often create maps to represent their surrounding environments [6] . Typically, such a map is either topological or metrical (including hybrid ones). We consider a topological map to be metric-free, which means it does not explicitly store global/relative position/orientation information with measurable geometrical accuracy [39, 38] . Instead, it is a graph that stores local sensor observations, such as RGB images, as graph nodes and the spatial neighborhood structure (and often navigation actions) as graph edges that connects observations taken from nearby locations. While metric maps are often reconstructed by optimizing geometric constraints between landmarks and sensor poses from classic simultaneous localization and mapping (SLAM), topological maps have recently attracted attention in visual navigation tasks due to the simplicity, flexibility, scalability, and interpretability [58, 13, 27, 40, 12] . A topological map used for visual navigation could be constructed in two ways. The first and simplest way is to let the agent explore the new environment through metric-free random walks, after which the map could be built by projecting the recorded observations into a feature space and adding edges between nearby or sequentially obtained features [58] . However random walk is very inefficient especially in large or complex rooms, leading to repeated revisits of nearby locations in the same area. The other way is to design a navigation policy that controls the agent to more effectively explore the area while creating the map. It is known as active SLAM and often involves some metric information as either required input [42, 12] or intermediate estimations [13] . As shown in Fig. 1 , could we combine the merits of the two ways by finding a metric-free (neither input nor estimates) exploration policy that discovers informative traversing trajectories in unknown environments for topological map construction after exploration? To achieve this objective, we propose Active Topological Mapping (ATM) as shown in Fig. 2 . It contains two stages: active exploration through a learned metric-free policy, and topological mapping through visual place recognition (VPR) [51] . The first stage adopts the task and motion planning formalism (TAMP) [26, 55] and imitation learning [63] from expert demonstrations which could come from either an oracle policy having full access to virtual environments, or simply a human expert in real world. Our main novelty is to design such an imitation at both the task and the motion levels with joint end-to-end training. Our task planner, a two-layer LSTM [31] network trained with deep supervision, conceives the next best goal feature to be We focus on the active mapping problem where a mobile agent needs to decide how to efficiently explore a novel environment. For planning and navigation, we embrace the topological feature space where each feature corresponds to an image observation, while the metric space involves distance/pose information which is onerous to obtain accurately. Our main idea is to hallucinate goal features to guide exploration actions, learned by imitating expert demonstrations. explored by hallucination from the current and historical image features. Our motion planner, a simple multi-layer perceptron, fuses the current and the hallucinated features and generates the best action that will move the agent to a location whose feature is closer to the goal. The second stage of ATM takes all observations recorded during the active exploration stage to create the topological map. This stage could be solved similar to [58] , where nodes are first connected by the sequential visiting order, and then additional node connections are discovered by a binary classifier estimating the spatial adjacency between two nodes through their image similarity. Differently, we adopt VPR, a classic technique in SLAM for loop closure detection, to discover additional edges more effectively. We further train an action assigner to assign each newly-added edge with corresponding actions that will move the agent between the two connected nodes. Finally, the topological map becomes our efficient environment representation for visual navigation as in [58] . We validate the efficacy of ATM on two tasks: exploration in which the goal is to maximize the explored area within a fixed step budget, and navigation in which the goal is to use ATM-constructed topological map to navigate the agent to a target image. In summary, our contributions are: • We propose a simple and effective framework named as active topological mapping (ATM) for efficient and lightweight visual exploration. The topological map it constructs can be used for efficient visual navigation. • We develop joint trainable feature-space task and motion planning (TAMP) networks to achieve metric-free and generalizable exploration. • We design a deeply-supervised imitation learning strategy to train the feature-space TAMP networks with better data efficiency. • We validate our method on the photo-realistic Gibson [72] and MP3D [9] datasets in both visual exploration and navigation.

2. RELATED WORK

Topological map in exploration and navigation. Previous works tried to tackle navigation with endto-end learning of sensorimotor control by directly mapping visual observations to the action space [57] . However, such purely reactive RL-based methods that have no explicit memory struggle to navigate in complex scenarios [20, 74] . Newer methods that tackle this problem with scene memory [23, 29] often rely on metric information. An explicit metric map is commonly used for localization and navigation in the literature [22, 21, 6] , but may face robustness and computation challenges, especially in dynamic and complex scenes, due to the need for accurate geometric constraints during the map and pose optimization. Later, inspired by the animal and human psychology [70], researchers show that topological map may aid robot navigation [13, 15, 11, 5, 29] . In literature, many topological mapping solutions either uses a random walkthrough sequence [58] , or incrementally constructs a topological graph during the navigation task [13, 40] . However, random exploration is inefficient in creating a comprehensive map given a limited exploration time. And the existing exploring-while-mapping solutions still involve metric information either as required input or as intermediate estimation. Instead, we propose a two-stage solution that (1) learns an efficient and generalizable exploration policy to collect visual observations of a novel environment, and (2) uses VPR to construct a topological map for future navigation. Similar exploration-before-navigation pipelines include [5] 



Figure1: Problem overview. We focus on the active mapping problem where a mobile agent needs to decide how to efficiently explore a novel environment. For planning and navigation, we embrace the topological feature space where each feature corresponds to an image observation, while the metric space involves distance/pose information which is onerous to obtain accurately. Our main idea is to hallucinate goal features to guide exploration actions, learned by imitating expert demonstrations.

