MODEL-BASED NAVIGATION IN ENVIRONMENTS WITH NOVEL LAYOUTS USING ABSTRACT 2-D MAPS

Abstract

Efficiently training agents with planning capabilities has long been one of the major challenges in decision-making. In this work, we focus on zero-shot navigation ability on a given abstract 2-D occupancy map, like human navigation by reading a paper map, by treating it as an image. To learn this ability, we need to efficiently train an agent on environments with a small proportion of training maps and share knowledge effectively across the environments. We hypothesize that model-based navigation can better adapt agent's behaviors to a task, since it disentangles the variations in map layout and goal location and enables longer-term planning ability on novel locations compared to reactive policies. We propose to learn a hypermodel that can understand patterns from a limited number of abstract maps and goal locations, to maximize alignment between the hypermodel predictions and real trajectories to extract information from multi-task off-policy experiences, and to construct denser feedback for planners by n-step goal relabelling. We train our approach on DeepMind Lab environments with layouts from different maps, and demonstrate superior performance on zero-shot transfer to novel maps and goals.

1. INTRODUCTION

If we provide a rough solution of a problem to an agent, can the agent learn to follow the solution effectively? In this paper, we study this question within the context of maze navigation, where an agent is situated within a maze whose layout has never been seen before, and the agent is expected to navigate to a goal without first training on or even exploring this novel maze. This task may appear impossible without further guidance, but we will provide the agent with additional information: an abstract 2-D occupancy map illustrating the rough layout of the environment, as well as indicators of its start and goal locations ("task context" in Figure 1 ). This is akin to a tourist attempting to find a landmark in a new city: without any further help, this would be very challenging; but when equipped with a 2-D map with a "you are here" symbol and an indicator of the landmark, the tourist can easily plan a path to reach the landmark without needing to explore or train excessively. Navigation is a fundamental capability of all embodied agents, both artificial and natural, and therefore has been studied under many settings. In our case, we are most concerned with zero-shot navigation in novel environments, where the agent cannot perform further training or even exploration of the new environment; all that is needed to accomplish the task is technically provided by the abstract 2-D map. This differs from the vast set of approaches based on simultaneous localization and mapping (SLAM) typically used in robot navigation (Thrun et al., 2005) , where the agent can explore and build an accurate occupancy map of the environment prior to navigation. Recently, navigation approaches based on deep reinforcement learning (RL) approaches have also emerged, although they often require extensive training in the same environment (Mirowski et al., 2017; 2018) . Some deep RL approaches are even capable of navigating novel environments with new layouts without further training; however, these approaches typically learn the strategy of efficiently exploring the new environment to understand the layout and find the goal, then exploiting that knowledge for the remainder of the episode to repeatedly reach that goal quickly (Jaderberg et al., 2017) . In contrast, since the solution is essentially provided to the agent via the abstract 2-D map, we require a more stringent version of zero-shot navigation, where it should not explore the new environment; instead, we expect the agent to produce a near-optimal path in its first (and only) approach to the goal. Although the solution is technically accessible via the abstract 2-D map, some challenges remain to use it effectively. First, although we assume that the layout in the 2-D map is accurate, the map does not correspond to the state space of the agent in the environment, so the agent must learn the correspondence between its state and locations in the 2-D map. Second, actions in the 2-D map also cannot be directly mapped into the agent's action space; moving betweend adjacent "cells" in the 2-D map requires a sequence of many actions, specified in terms of the agent's translational and rotational velocities. Hence, one cannot simply perform graph search on the 2-D map, then execute the abstract solution directly on the agent. Instead, we propose approaches that learn to use the provided abstract 2-D map via end-to-end learning. Concretely, we propose two approaches for navigation using abstract 2-D maps: In experiments performed in DeepMind Lab (Beattie et al., 2016) , a 3-D maze simulation environments shown in Figure 1 , we show that both approaches achieve effective zero-shot navigation in novel environment layouts, and the model-based MMN is better at long-distance navigation.

2. BACKGROUND

We consider a distribution of navigation tasks ρ(T ). Each task is different in two aspects: map layout and goal location. (1) Abstract map. The layout of each navigation task is specified by an abstract map. Specifically, an abstract map m ∈ R N ×N is a 2-D occupancy grid, where cell with 1s (black) indicate walls and 0s (white) indicate nagivable spaces. A cell does not correspond to the agent's world, so the agent needs to learn to localize itself given an abstract 2-D map. We generate a set of maps and guarantee that any valid positions are reachable, i.e., there is only one connected component in a map. (2) Goal position. Given a map, we can then specify a pair of start and goal position. Both start and goal are represented as a "one-hot" occupancy grid g ∈ R 2×N ×N provided to the agent. For simplicity, we use g to refer to both start and goal, and we denote the provided map and start-goal positions c = (m, g) as the task context. We formulate each navigation task as a goal-reaching Markov decision process (MDP), consisting of a tuple S, A, P, R G , ρ 0 , γ , where S is the state space, A is the action space, P is the transition function P : S × A → ∆(S), ρ 0 = ρ(s 0 ) is the initial state distribution, and γ ∈ (0, 1] is the discount factor. We assume transitions are deterministic. For each task, the objective is to reach a



Figure 1: In each training episode, we randomly select a task T to initialize environment simulation and feed the corresponding task context c T to the agent. We use a joint state space o ∈ R 12 as input to the agent, consisting of position R 3 , orientation R 3 , and translational and rotational velocity R 6 . Each cell on the abstract map corresponds to 100 units in the agent world.

MMN (Map-conditioned Multi-task Navigator): A model-based approach that learns a hypermodel (Ha et al., 2016), which uses the provided 2-D map to produce a parameterized latent-space transition function f φ for that map. This transition function f φ is jointly trained with Monte-Carlo tree search (MCTS) to plan (in latent space) to reach the specified goal Schrittwieser et al. (2019). • MAH (Map-conditioned Ape-X HER DQN): A model-free approach based on Ape-X Deep Q-Networks (DQN) (Horgan et al., 2018), a high-performing distributed variant of DQN, that takes in the provided 2-D map as additional input. Furthermore, we supplement it with our proposed n-step modification of hindsight experience replay (HER) (Andrychowicz et al., 2017).

