VTNET: VISUAL TRANSFORMER NETWORK FOR OBJECT GOAL NAVIGATION

Abstract

Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize "turning right" over "turning left" when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.

1. INTRODUCTION

The goal of target-driven visual navigation is to guide an agent to reach instances of a given target category based on its monocular observations of an environment. Thus, it is highly desirable to achieve an informative visual representation of the observation, which is correlated to directional navigation signals. In this paper, we propose a Visual Transformer Network (VTNet) to achieve an expressive visual representation. In our VTNet, we develop a Visual Transformer (VT) to extract image descriptors from visual observations and then decode visual representations of the observed scenes. Then, we present a pre-training scheme to associate visual representations with directional navigation signals, thus making the representations informative for navigation. After pre-training, our visual representations are fed to a navigation policy network and we train our entire network in an end-to-end manner. In particular, our VT exploits two newly designed spatial-aware descriptors as the key and query, (i.e., a spatial-enhanced local descriptor and a positional global descriptor) and then encodes them to achieve an expressive visual representation. Our spatial-enhanced local descriptor is developed to fully take advantage of all detected objects for the exploration of spatial and category relationships among instances. Unlike the prior work (Du et al., 2020 ) that only leverages one instance per class to mine the category relationship, our VT is able to exploit the relationship of all the detected instances. To this end, we employ an object detector DETR (Carion et al., 2020) since features extracted from DETR not only encode object appearance information, such as class labels and bounding boxes, but also contain the relations between instances and global contexts. Moreover, DETR features are scale-invariant (output from the Furthermore, we introduce a positional global descriptor as the query for our VT decoder. In particular, we associate the region features with image region positions (such as bottom and top) and thus facilitate exploring the correspondences between navigation actions and image regions. To do so, we divide a global observation into multiple regions based on spatial layouts and assign a positional embedding to each region feature as our spatial-enhanced global descriptor. After obtaining the global query descriptor, we attend the spatial-enhanced local descriptor to the positional global descriptor query to learn the relationship between instances and observation regions via our VT decoder. However, we found directly training our VTNet with a navigation policy network fails to converge due to the training difficulty of the transformers (Vaswani et al., 2017) . Therefore, we present a pretraining scheme to associate visual representations and directional navigation signals. We endow our VT with the capability of encoding directional navigation signals by imitating expert experience. After warming-up through human instructions, VT can learn instructional representations for navigation, as illustrated in Figure 1 . After pre-training our VT, we employ a standard Long Short Term Memory (LSTM) network to map the current visual representation and previous states to an agent action. We adopt A3C architecture (Mnih et al., 2016) to learn the navigation policy. Once our VTNet has been fully trained, our agent can exploit the correlations between observations and navigation actions to improve visual navigation efficiency. In the popular widely-used navigation environment AI2-Thor (Kolve et al., 2017) , our method significantly outperforms the state-of-the-art. Our contributions are summarized as follows: • We propose a novel Visual Transformer Network (VTNet) to extract informative feature representations for visual navigation. Our visual representations not only encode relationships among objects but also establish strong correlations with navigation signals. • We introduce a positional global descriptor and a spatial-enhanced local descriptor as the query and key for our visual transformer (VT), and then the visual representations decoded by our VT are attended to navigation actions via our presented pre-training scheme, thus providing a good initialization for our VT. • Experimental results demonstrate that our learned visual representation significantly improves the efficiency of the state-of-the-art visual navigation systems in unseen environments by 14.0% relatively on Success Weighted by Path Length (SPL).



As observed in DETR, the number of objects of interest in a scene is usually less than 100. Thus, we set the key number to 100 in the VT encoder.



Figure 1: Motivation of the Visual Transformer Network (VTNet). A target class (cellphone) is highlighted by green bounding boxes. An agent first detects objects of interest from its observation. Then, the agent attends detected objects to the global observation by the visual transformer (VT).High attention scores are achieved on the left side of the observation, which correspond to the target (cellphone). Then, the agent will choose RotateLeft to reach targets.

