QUERY THE AGENT: IMPROVING SAMPLE EFFICIENCY THROUGH EPISTEMIC UNCERTAINTY ESTIMATION Anonymous

Abstract

Curricula for goal-conditioned reinforcement learning agents typically rely on poor estimates of the agent's epistemic uncertainty or fail to consider the agents' epistemic uncertainty altogether, resulting in poor sample efficiency. We propose a novel algorithm, Query The Agent (QTA), which significantly improves sample efficiency by estimating the agent's epistemic uncertainty throughout the state space and setting goals in highly uncertain areas. Encouraging the agent to collect data in highly uncertain states allows the agent to improve its estimation of the value function rapidly. QTA utilizes a novel technique for estimating epistemic uncertainty, Predictive Uncertainty Networks (PUN), to allow QTA to assess the agent's uncertainty in all previously observed states. We demonstrate that QTA offers decisive sample efficiency improvements over preexisting methods.

1. INTRODUCTION

Deep reinforcement learning has been demonstrated to be highly effective in a diverse array of sequential decision-making tasks (Silver et al., 2016; Berner et al., 2019) . However, deep reinforcement learning remains challenging to implement in the real world, in part because of the massive amount of data required for training. This challenge is acute in robotics (Sünderhauf et al., 2018; Dulac-Arnold et al., 2019) , in tasks such as manipulation (Liu et al.) , and in self-driving cars (Kothari et al.) . Existing curriculum methods for training goal-conditioned reinforcement learning (RL) agents suffer from poor sample efficiency (Dulac-Arnold et al., 2019) and often fail to consider agents' specific deficiencies and epistemic uncertainty when selecting goals. Instead, they rely on poor proxy estimates of epistemic uncertainty or high-level statistics from rollouts, such as the task success rate. Without customizing learning according to agents' epistemic uncertainties, existing methods inhibit the agent's learning with three modes of failure. Firstly, a given curriculum may not be sufficiently challenging for an agent, thus using timesteps inefficiently. Secondly, a given curriculum may be too challenging to an agent, causing the agent to learn more slowly than it could otherwise. Thirdly, a curriculum may fail to take into account an agent catastrophically forgetting the value manifold in a previously learned region of the state space. Curriculum algorithms need a detailed estimate of the agent's epistemic uncertainty throughout the state space in order to maximize learning by encouraging agents to explore the regions of the state space the agent least understands (Kaelbling, 1993; Plappert et al., 2018) . We propose a novel curriculum algorithm, Query The Agent (QTA), to accelerate learning in goalconditioned settings. QTA estimates the agent's epistemic uncertainty in all previously observed states, then drives the agent to reduce its epistemic uncertainty as quickly as possible by setting goals in states with high epistemic uncertainty. QTA estimates epistemic uncertainty using a novel neural architecture, Predictive Uncertainty Networks (PUN). By taking into account the agent's epistemic uncertainty throughout the state space, QTA aims to explore neither too quickly nor too slowly, and revisit previously explored states when catastrophic forgetting occurs. We demonstrate in a 2D continuous maze environment that QTA is significantly more sample efficient than preexisting methods. We also provide a detailed analysis of how QTA's approximation of the optimal value manifold evolves over time, demonstrating that QTA's learning dynamics are meaningfully driven by epistemic uncertainty estimation. An overview of QTA and our maze environments are shown in 1. We further demonstrate the importance of utilizing the agent's epistemic uncertainty by extending QTA with a modified Prioritized Experience Replay (Schaul et al., 2016) (PER) buffer. This modified

