DEEP COHERENT EXPLORATION FOR CONTINUOUS CONTROL

Abstract

In policy search methods for reinforcement learning (RL), exploration is often performed by injecting noise either in action space at each step independently or in parameter space over each full trajectory. In prior work, it has been shown that with linear policies, a more balanced trade-off between these two exploration strategies is beneficial. However, that method did not scale to policies using deep neural networks. In this paper, we introduce Deep Coherent Exploration, a general and scalable exploration framework for deep RL algorithms on continuous control, that generalizes step-based and trajectory-based exploration. This framework models the last layer parameters of the policy network as latent variables and uses a recursive inference step within the policy update to handle these latent variables in a scalable manner. We find that Deep Coherent Exploration improves the speed and stability of learning of A2C, PPO, and SAC on several continuous control tasks.

1. INTRODUCTION

The balance of exploration and exploitation (Kearns & Singh, 2002; Jaksch et al., 2010 ) is a longstanding challenge in reinforcement learning (RL) . With insufficient exploration, states and actions with high rewards can be missed, resulting in policies prematurely converging to bad local optima. In contrast, with too much exploration, agents could waste their resources trying suboptimal states and actions, without leveraging their experiences efficiently. To learn successful strategies, this trade-off between exploration and exploitation must be balanced well, and this is known as the exploration vs. exploitation dilemma. At a high level, exploration can be divided into directed strategies and undirected strategies (Thrun, 1992; Plappert et al., 2018) . While directed strategies aim to extract useful information from existing experiences for better exploration, undirected strategies rely on injecting randomness into the agent's decision-making. Over the years, many sophisticated directed exploration strategies have been proposed (Tang et al., 2016; Ostrovski et al., 2017; Houthooft et al., 2016; Pathak et al., 2017) . However, since these strategies still require lower-level exploration to collect the experiences, or are either complicated or computationally intensive, undirected exploration strategies are still commonly used in RL literature in practice, where some well-known examples are -greedy (Sutton, 1995) for discrete action space and additive Gaussian noise for continuous action space (Williams, 1992) . Such strategies explore by randomly perturbing agents' actions at different steps independently and hence are referred to as performing step-based exploration in action space (Deisenroth et al., 2013) . As an alternative to those exploration strategies in action space, exploration by perturbing the weights of linear policies has been proposed (Rückstieß et al., 2010; Sehnke et al., 2010; Kober & Peters, 2008) . Since these strategies in parameter space naturally explore conditioned on the states and are usually trajectory-based (only perturb the weights at the beginning of each trajectory) (Deisenroth et al., 2013) , they have the advantages of being more consistent, structured, and global (Deisenroth et al., 2013 ). Later, van Hoof et al. (2017) proposed a generalized exploration (GE) scheme, bridging the gap between step-based and trajectory-based exploration in parameter space. With the advance of deep RL, NoisyNet (Fortunato et al., 2018) and Parameter Space Noise for Exploration (PSNE) (Plappert et al., 2018) were introduced, extending parameter-space exploration strategies for policies using deep neural networks.

