GENERATIVE LANGUAGE-GROUNDED POLICY IN VISION-AND-LANGUAGE NAVIGATION WITH BAYES' RULE

Abstract

Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies have built and investigated a discriminative approach, we notice that there are in fact two possible approaches to building such a VLN agent: discriminative and generative. In this paper, we design and investigate a generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all possible sequences of vocabulary tokens given action and the transition history. In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN.

1. INTRODUCTION

Vision-and-language navigation (Anderson et al., 2018b) is a task in which a computational model follows an instruction and performs a sequence of actions to reach the final objective. An agent is embodied in a realistic 3D environment, such as that from the Matterport 3D Simulator (Chang et al., 2017) and asked to follow an instruction. The agent observes the surrounding environment and moves around. This embodied agent receives a textual instruction to follow before execution. The success of this task is measured by how accurately and quickly the agent could reach the destination specified in the instruction. VLN is a sequential decision making problem: the embodied agent makes a decision each step considering the current observation, transition history and the initial instruction. Previous studies address this problem of VLN by building a language grounded policy which computes a distribution over all possible actions given the current state and the language instruction. In this paper, we notice there are two ways to formulate the relationship between the action and instruction. First, the action is assumed to be generated from the instruction, similarly to most of the existing approaches (Anderson et al., 2018b; Ma et al., 2019; Wang et al., 2019; Hu et al., 2019; Huang et al., 2019) . This is often called a follower model (Fried et al., 2018) . We call it a discriminative approach analogous to logistic regression in binary classification. On the other hand, the action may be assumed to generate the instruction. In this case, we build a neural network to compute the distribution over all possible instructions given an action and the transition history. With this neural network, we use Bayes' rule to build a language-grounded policy. We call this generative approach, similarly to naïve Bayes in binary classification. The generative language-grounded policy only considers what is available at each time step and chooses one of the potential actions to generate the instruction. We then apply Bayes' rule to obtain the posterior distribution over actions given the instruction. Despite its similarity to the speaker Walk between the the tub and bathroom counter. Go through the opening next to the bathroom counter and into the little hallway. Wait next to the family picture hanging on the left wall. 2018) cannot be used for navigation on its own due to its formulation, while our generative languagegrounded policy can be used for it by its own. The speaker model of Fried et al. ( 2018) takes as input the entire sequence of actions and predicts the entire instruction, which is not the case in ours.

Next actions

Given these discriminative and generative parameterizations of the language-grounded policy, we hypothesize that the generative parameterization works better than discriminative parameterization does, because the former benefits from richer learning signal arising from scoring the entire instruction rather than predicting a single action. Such rich learning signal arises, because the generative policy must learn to associate all salient features of a language instruction with an intended action, in order to learn the distribution over the language instructions. This is unlike the discriminative policy which may rely on only a minimal subset of salient features of the language instruction in order to model the distribution over a much smaller set of actions. Furthermore, the generative policy enables us to more readily encode our prior about the action distribution when it deemed necessary. We empirically show that indeed the proposed generative approach outperforms the discriminative approach in both the R2R and R4R datasets, especially in the unseen environments. Figure 1 illustrates the proposed generative approach on VLN. Furthermore, we show that the combination of the generative and discriminative policies results in near state-of-the art results in R2R and R4R, demonstrating that they capture two different aspects of VLN. We demonstrate that the proposed generative policy is more interpretable than the conventional discriminative policy, by introducing a token-level prediction entropy as a way to measure the influence of each token in the instruction on the policy's decision. The source code is available at https://github.com/shuheikurita/glgp.

2. DISCRIMINATIVE AND GENERATIVE PARAMETERIZATIONS OF LANGUAGE-GROUNDED POLICY

Vision-and-language navigation (VLN) is a sequential decision making task, where an agent performs a series of actions based on the initially-given instruction, visual features, and past actions. Given the instruction X, past and current observations s :t and past actions a :t-1 , the agent computes the distribution p(a t |X, s :t , a :t-1 ) at time t. For brevity, we write the current state that consists of the current and past scene observations, and past actions as h t = {s :t , a :t-1 }, and the next action prediction as p(a t |X, h t ). The instruction X is a sequence of tokens X = (w 0 , w 1 , ..., w k , ...). The relationship between these notations are also presented in Appendix B. In VLN, the goal is to model p(a t |h t , X) so as to maximize the success rate of reaching the goal while faithfully following the instruction X. In doing so, there are two approaches: generative and discriminative, analogous to solving classification with either logistic regression or naive Bayes. In the discriminative approach, we build a neural network to directly estimate p(a t |h t , X). This neural network takes as input the current state h t and the language instruction X and outputs a distribution over the action set. Learning corresponds to max θ N n=1 Tn t=1 log p(a n t |h n t , X n ), where N is the number of training trajectories.



language-grounded policy vision-and-language navigation. model of Fried et al. (2018), there is a stark difference that the speaker model of Fried et al. (

