ACTIVE TOPOLOGICAL MAPPING BY METRIC-FREE EXPLO-RATION VIA TASK AND MOTION IMITATION Anonymous

Abstract

Topological map is an effective environment representation for visual navigation. It is a graph of image nodes and spatial neighborhood edges without metric information such as global or relative agent poses. However, currently such a map construction relies on either less-efficient random exploration or more demanding training involving metric information. To overcome these issues, we propose active topological mapping (ATM), consisting of an active visual exploration and a topological mapping by visual place recognition. Our main novelty is the simple and lightweight active exploration policy that works entirely in the image feature space involving no metric information. More specifically, ATM's metric-free exploration is based on task and motion planning (TAMP). The task planner is a recurrent neural network using the latest local image observation sequence to hallucinate a feature as the next-step best exploration goal. The motion planner then fuses the current and the hallucinated feature to generate an action taking the agent towards the hallucinated feature goal. The two planners are jointly trained via deeply-supervised imitation learning from expert exploration demonstrations. Extensive experiments in both exploration and navigation tasks on the photo-realistic Gibson and MP3D datasets validate ATM's effectiveness and generalizability.

1. INTRODUCTION

Mobile agents often create maps to represent their surrounding environments [6] . Typically, such a map is either topological or metrical (including hybrid ones). We consider a topological map to be metric-free, which means it does not explicitly store global/relative position/orientation information with measurable geometrical accuracy [39, 38] . Instead, it is a graph that stores local sensor observations, such as RGB images, as graph nodes and the spatial neighborhood structure (and often navigation actions) as graph edges that connects observations taken from nearby locations. While metric maps are often reconstructed by optimizing geometric constraints between landmarks and sensor poses from classic simultaneous localization and mapping (SLAM), topological maps have recently attracted attention in visual navigation tasks due to the simplicity, flexibility, scalability, and interpretability [58, 13, 27, 40, 12] . A topological map used for visual navigation could be constructed in two ways. The first and simplest way is to let the agent explore the new environment through metric-free random walks, after which the map could be built by projecting the recorded observations into a feature space and adding edges between nearby or sequentially obtained features [58] . However random walk is very inefficient especially in large or complex rooms, leading to repeated revisits of nearby locations in the same area. The other way is to design a navigation policy that controls the agent to more effectively explore the area while creating the map. It is known as active SLAM and often involves some metric information as either required input [42, 12] or intermediate estimations [13] . As shown in Fig. 1 , could we combine the merits of the two ways by finding a metric-free (neither input nor estimates) exploration policy that discovers informative traversing trajectories in unknown environments for topological map construction after exploration? To achieve this objective, we propose Active Topological Mapping (ATM) as shown in Fig. 2 . It contains two stages: active exploration through a learned metric-free policy, and topological mapping through visual place recognition (VPR) [51] . The first stage adopts the task and motion planning formalism (TAMP) [26, 55] and imitation learning [63] from expert demonstrations which could come from either an oracle policy having full access to virtual environments, or simply a human expert in real world. Our main novelty is to design such an imitation at both the task and the motion levels with joint end-to-end training. Our task planner, a two-layer LSTM [31] network trained with deep supervision, conceives the next best goal feature to be

