KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL EMBODIED NAVIGATION Anonymous authors Paper under double-blind review

Abstract

Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks-all within a reinforcement learning framework for audio-visual navigation. We also define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task. We will release all code, knowledge graphs, and pre-training datasets upon acceptance.

1. INTRODUCTION

Humans are able to use background experience, when navigating unseen or partially-observable environments. Prior experience informs their world model of the semantic relationships between objects commonly found in an indoor scene, the likely object placements, and the properties of the sounds those objects emit throughout their object-object and object-scene interactions. Artificial embodied agents, constructed to perform goal-directed behaviour in indoor scenes, should be endowed with similar capabilities; indeed, as autonomous agents enter our homes, they will need intuitive understanding about how objects are placed in different regions of houses, for better interaction with the environment. Whereas external (domain) knowledge can yield improvements in agent sample-efficiency while learning, generalisability to unseen environments during inference, and overall interpretability in its decision-making, the goal of finding generalisable solutions by injecting knowledge in embodied agents remains elusive (Oltramari et al., 2020; Francis et al., 2022) . The task of semantic audio-visual navigation (shown in Fig. 1 ) lends itself especially well to the use of domain knowledge, e.g., in the form of human-inspired background experience (encapsulated as a prior over regions and semantically-related objects contained therein). Certain sounds can be associated with particular places, e.g., a smoke alarm is more likely to originate in the kitchen. To infer such semantic information from sounds in an environment, we propose the idea of a knowledgeenhanced prior. By using a prior enriched with general experiences, we hypothesise that the learned model would generalise to novel sound sources. We adopt a modular training paradigm, which has been shown to lead to improvements in cross-domain generalisability and more tractable optimisation (Chen et al., 2021b; Chaplot et al., 2020b; Francis et al., 2022) . To verify our hypotheses, we evaluate the agent's performance on a set of novel sounding objects that were not introduced during training. The sound signal may stop while the agent is navigating (e.g., sound produced by washing machine stops). Thus, the agent is encouraged to understand the sound and visual semantics to reason about where to search for the sounding object. For example, in the image above, the agent hears the washing machine sound and decides to navigate near the bathroom to search for the washing machine. Contributions. First, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks|all within a reinforcement learning (RL) framework. Second, we define a knowledge graph that encodes object-object, objectregion, and region-region relations in house environments. Next, we curate a multimodal dataset for pre-training a visual encoder, in order to encourage object-awareness in visual scene understanding. Finally, we define a new task of semantic audio-visual navigation, wherein we assess agent performance on the basis of their generalisation to truly novel sounding objects. We offer experimental results against strong baselines, and show improvements over these models on various performance metrics in unseen contexts. We will provide all code, dataset-generation utilities, and knowledge graphs upon acceptance of the manuscript.

2. RELATED WORK

Modularity in goal-driven robot navigation. Goal-oriented navigation tasks have long been a topic of research in robotics (Kavraki et al., 1996; Lavalle et al., 2000; Canny, 1988; Koenig & Likhachev, 2006) . Classical approaches generally tackle such tasks through non-learning techniques for searching and planning, e.g., heuristic-based search (Koenig & Likhachev, 2006) and probabilistic planning (Kavraki et al., 1996) . Although classical approaches might offer better generalisation and optimality guarantees in low-dimensional settings, they often assume accurate state estimation and cannot operate on high dimensional raw sensor inputs (Gordon et al., 2019) . More recently, researchers have pursued data-driven techniques, e.g., deep reinforcement learning (Wijmans et al., 2020; Batra et al., 2020; Chaplot et al., 2020a; Yang et al., 2019; Chen et al., 2021b; a; Gan et al., 2020) and imitation learning (Irshad et al., 2021; Krantz et al., 2020) , to design goal-driven navigation policies. End-to-end mechanisms have proven to be powerful tools for extracting meaningful features from raw sensor data, and thus, are often favoured for the setting where agents are tasked with learning to navigate toward goals in unknown environments using mainly raw sensory inputs. However, as task complexity increases, these types of systems generally exhibit significant performance drops, especially in unseen scenarios and in long-horizon tasks (Gordon et al., 2019; Saha et al., 2021) . To address the aforementioned limitations, modular decomposition has been explored in recent embodied tasks. Chaplot et al. (2020c) design a modular approach for visual navigation, consisting of a mapping module, a global policy, and a local policy, which, respectively, builds and updates a map of the environment, predict the next sub-goal using the map, and predicts low-level actions to reach the sub-goal. Irshad et al. ( 2021) also define a hierarchical setup for Vision-Language Navigation (VLN) (Anderson et al., 2018) , where a global policy performs waypoint-prediction, given the observations, and a local policy performs low-level motion control. Gordon et al. (2019) design a hierarchical controller that invokes different low-level controllers in charge of different tasks such as planning, exploration, and perception. Similarly, Saha et al. ( 2021) design a modular mechanism for mobile manipulation that decomposes the task into: mapping, language-understanding, modality grounding, and planning. Aforementioned modular designs have shown to increase task performance and generalisability, especially in unexplored scenarios, compared to their end-to-end counterparts. Motivated by these, we develop a modular framework for semantic audio-visual navigation, which includes pre-trained and knowledge-enhanced scene priors, enabling improved unseen generalisation. Knowledge graphs in visual navigation. Combining prior knowledge with machine learning systems remains a widely-investigated topic in various research fields, such as natural language processing (Ma et al., 2021; 2019; Francis et al., 2022) , due to the improvements in generalisability and sample-efficiency that symbolic representation promises for learning-based approaches.



Figure 1: Illustration of the proposed semantic audiovisual navigation task. The agent is initialised at a random location in a 3D environment and tasked to navigate to the sounding object based on audio and visual signals.The sound signal may stop while the agent is navigating (e.g., sound produced by washing machine stops). Thus, the agent is encouraged to understand the sound and visual semantics to reason about where to search for the sounding object. For example, in the image above, the agent hears the washing machine sound and decides to navigate near the bathroom to search for the washing machine.

