NOISY AGENTS: SELF-SUPERVISED EXPLORATION BY PREDICTING AUDITORY EVENTS

Abstract

Humans integrate multiple sensory modalities (e.g., visual and audio) to build a causal understanding of the physical world. In this work, we propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions through auditory event prediction. First, we allow the agent to collect a small amount of acoustic data and use K-means to discover underlying auditory event clusters. We then train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration. We first conduct an in-depth analysis of our module using a set of Atari games. We then apply our model to audio-visual exploration using the Habitat simulator and active learning using the ThreeDWorld (TDW) simulator. Experimental results demonstrate the advantages of using audio signals over vision-based models as intrinsic rewards to guide RL explorations.

1. INTRODUCTION

Deep Reinforcement Learning algorithms aim to learn a policy of an agent to maximize its cumulative rewards by interacting with environments and have demonstrated substantial success in a wide range of application domains, such as video game (Mnih et al., 2015) , board games (Silver et al., 2016) , and visual navigation (Zhu et al., 2017) . While these results are remarkable, one of the critical constraints is the prerequisite of carefully engineered dense reward signals, which are not always accessible. To overcome these constraints, researchers have proposed a range of intrinsic reward functions. For example, curiosity-driven intrinsic reward based on prediction error of current (Burda et al., 2018b) or future state (Pathak et al., 2017) on the latent feature spaces have shown promising results. Nevertheless, visual state prediction is a non-trivial problem as visual state is high-dimensional and tends to be highly stochastic in real-world environments. The occurrence of physical events (e.g. objects coming into contact with each other, or changing state) often correlates with both visual and auditory signals. Both sensory modalities should thus offer useful cues to agents learning how to act in the world. Indeed, classic experiments in cognitive and developmental psychology show that humans naturally attend to both visual and auditory cues, and their temporal coincidence, to arrive at a rich understanding of physical events and human activity such as speech (Spelke, 1976; McGurk & MacDonald, 1976) . In artificial intelligence, however, much more attention has been paid to the ways visual signals (e.g., patterns in pixels) can drive learning. We believe this misses important structure learners could exploit. As compared to visual cues, sounds are often more directly or easily observable causal effects of actions and interactions. This is clearly true when agents interact: most communication uses speech or other nonverbal but audible signals. However, it is just as much in physics. Almost any time two objects collide, rub or slide against each other, or touch in any way, they make a sound. That sound is often clearly distinct from background auditory textures, localized in both time and spectral properties, hence relatively easy to detect and identify; in contrast, specific visual events can be much harder to separate from all the ways high-dimensional pixel inputs are changing over the course of a scene. The sounds that result from object interactions also allow us to estimate underlying causally relevant variables, such as material properties (e.g., whether objects are hard or soft, solid or hollow, smooth, or rough), which can be critical for planning actions. These facts bring a natural question of how to use audio signals to benefit policy learning in RL. In this paper, our main idea is to use sound prediction as an intrinsic reward to guide RL exploration. Intuitively, we want to exploit the fact that sounds are frequently made when objects interact, or other causally significant events occur, like cues to causal structure or candidate subgoals an agent could discover and aim for. A naïve strategy would be to directly regress feature embeddings of audio clips and use feature prediction errors as intrinsic rewards. However, prediction errors on feature space do not accurately reflect how well an agent understands the underlying causal structure of events and goals. It also remains an open problem on how to perform appropriate normalizations to solve intrinsic reward diminishing issues. To bypass these limitations, we formulate the sound-prediction task as a classification problem, in which we train a neural network to predict auditory events that occurred after applying action to a visual scene. We use classification errors as an exploration bonus for deep reinforcement learning. Concretely, our pipeline consists of two exploration phases. In the beginning, the agent receives an incentive to actively collect a small amount of auditory data by interacting with the environment. Then we cluster the sound data into auditory events using K-means. In the second phase, we train a neural network to predict the auditory events conditioned on the embedding of visual observations and actions. The state that has the wrong prediction is rewarded and encouraged to be visited more. We demonstrate the effectiveness of our intrinsic module on game playing with in Atari (Bellemare et al., 2013) , audio-visual exploration in Habitat (Savva et al., 2019) , active learning using a rolling robot in ThreeDWorld (TDW) (Gan et al., 2020a) . In summary, our work makes the following contributions: • We introduce a novel and effective auditory event prediction (AEP) framework to make use of the auditory signals as intrinsic rewards for RL exploration. • We perform an in-depth study on Atari games to understand our audio-driven exploration works well under what circumstances. • We demonstrate our new model can enable a more efficient exploration strategy for audio-visual embodied navigation on the Habitat environment. • We show that our new intrinsic module is more stable in the 3D multi-modal physical world environment and can encourage interest actions that involved physical interactions.

2. RELATED WORK

Audio-Visual Learning. In recent years, audio-visual learning has been studied extensively. By leveraging audio-visual correspondences in videos, it can help to learn powerful audio and visual representations through self-supervised learning (Owens et al., 2016b; Aytar et al., 2016; Arandjelovic & Zisserman, 2017; Korbar et al., 2018; Owens & Efros, 2018) . In contrast to the widely used correspondences between these two modalities, we take a step further by considering sound as causal effects of actions. RL Explorations. The problem of exploration in Reinforcement Learning (RL) has been an active research topic for decades. There are various solutions that have been investigated for encouraging the agent to explore novel states, including rewarding information gain (Little & Sommer, 2013), surprise (Schmidhuber, 1991; 2010) Here, we mainly focus on the problem of using intrinsic rewards to drive explorations. The most widely used intrinsic motivation could be roughly divided into two families. The first one is countbased approaches (Strehl & Littman, 2008; Bellemare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017; Martin et al., 2017; Burda et al., 2018b) , which encourage the agent to visit novel states. For example, Burda et al. (2018b) employs the prediction errors of a self-state feature extracted from a fixed and random initialized network as exploration bonuses and encourage the agent to visit more previous unseen states. Another one is the curiosity-based approach (Stadie et al., 2015; Pathak et al., 2017; Haber et al., 2018; Burda et al., 2018a) , which is formulated as the uncertainty in predicting the consequences of the agent's actions. For instance, Pathak et al. (2017) and Burda et al. (2018a) use the errors of predicting the next state in the latent feature space as rewards. The agent is then encouraged to improve its knowledge about the environment dynamics. In contrast to previous work purely using vision, we make use of the sound signals as rewards for RL explorations.



, state visitation counts(Tang et al., 2017; Bellemare et al.,  2016), empowerment (Klyubin et al., 2005), curiosity (Pathak et al., 2017; Burda et al., 2018a) disagreement (Pathak et al., 2019) and so on. A separate line of work (Osband et al., 2019; 2016) adopts parameter noises and Thompson sampling heuristics for exploration. For example, Osband et al. (2019) trains multiple value functions and makes use of the bootstraps for deep exploration.

