VTNET: VISUAL TRANSFORMER NETWORK FOR OBJECT GOAL NAVIGATION

Abstract

Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize "turning right" over "turning left" when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.

1. INTRODUCTION

The goal of target-driven visual navigation is to guide an agent to reach instances of a given target category based on its monocular observations of an environment. Thus, it is highly desirable to achieve an informative visual representation of the observation, which is correlated to directional navigation signals. In this paper, we propose a Visual Transformer Network (VTNet) to achieve an expressive visual representation. In our VTNet, we develop a Visual Transformer (VT) to extract image descriptors from visual observations and then decode visual representations of the observed scenes. Then, we present a pre-training scheme to associate visual representations with directional navigation signals, thus making the representations informative for navigation. After pre-training, our visual representations are fed to a navigation policy network and we train our entire network in an end-to-end manner. In particular, our VT exploits two newly designed spatial-aware descriptors as the key and query, (i.e., a spatial-enhanced local descriptor and a positional global descriptor) and then encodes them to achieve an expressive visual representation. Our spatial-enhanced local descriptor is developed to fully take advantage of all detected objects for the exploration of spatial and category relationships among instances. Unlike the prior work (Du et al., 2020 ) that only leverages one instance per class to mine the category relationship, our VT is able to exploit the relationship of all the detected instances. To this end, we employ an object detector DETR (Carion et al., 2020) since features extracted from DETR not only encode object appearance information, such as class labels and bounding boxes, but also contain the relations between instances and global contexts. Moreover, DETR features are scale-invariant (output from the

