KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL EMBODIED NAVIGATION Anonymous authors Paper under double-blind review

Abstract

Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks-all within a reinforcement learning framework for audio-visual navigation. We also define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task. We will release all code, knowledge graphs, and pre-training datasets upon acceptance.

1. INTRODUCTION

Humans are able to use background experience, when navigating unseen or partially-observable environments. Prior experience informs their world model of the semantic relationships between objects commonly found in an indoor scene, the likely object placements, and the properties of the sounds those objects emit throughout their object-object and object-scene interactions. Artificial embodied agents, constructed to perform goal-directed behaviour in indoor scenes, should be endowed with similar capabilities; indeed, as autonomous agents enter our homes, they will need intuitive understanding about how objects are placed in different regions of houses, for better interaction with the environment. Whereas external (domain) knowledge can yield improvements in agent sample-efficiency while learning, generalisability to unseen environments during inference, and overall interpretability in its decision-making, the goal of finding generalisable solutions by injecting knowledge in embodied agents remains elusive (Oltramari et al., 2020; Francis et al., 2022) . The task of semantic audio-visual navigation (shown in Fig. 1 ) lends itself especially well to the use of domain knowledge, e.g., in the form of human-inspired background experience (encapsulated as a prior over regions and semantically-related objects contained therein). Certain sounds can be associated with particular places, e.g., a smoke alarm is more likely to originate in the kitchen. To infer such semantic information from sounds in an environment, we propose the idea of a knowledgeenhanced prior. By using a prior enriched with general experiences, we hypothesise that the learned model would generalise to novel sound sources. We adopt a modular training paradigm, which has been shown to lead to improvements in cross-domain generalisability and more tractable optimisation (Chen et al., 2021b; Chaplot et al., 2020b; Francis et al., 2022) . To verify our hypotheses, we evaluate the agent's performance on a set of novel sounding objects that were not introduced during training.

