DEEP COHERENT EXPLORATION FOR CONTINUOUS CONTROL

Abstract

In policy search methods for reinforcement learning (RL), exploration is often performed by injecting noise either in action space at each step independently or in parameter space over each full trajectory. In prior work, it has been shown that with linear policies, a more balanced trade-off between these two exploration strategies is beneficial. However, that method did not scale to policies using deep neural networks. In this paper, we introduce Deep Coherent Exploration, a general and scalable exploration framework for deep RL algorithms on continuous control, that generalizes step-based and trajectory-based exploration. This framework models the last layer parameters of the policy network as latent variables and uses a recursive inference step within the policy update to handle these latent variables in a scalable manner. We find that Deep Coherent Exploration improves the speed and stability of learning of A2C, PPO, and SAC on several continuous control tasks.

1. INTRODUCTION

The balance of exploration and exploitation (Kearns & Singh, 2002; Jaksch et al., 2010 ) is a longstanding challenge in reinforcement learning (RL) . With insufficient exploration, states and actions with high rewards can be missed, resulting in policies prematurely converging to bad local optima. In contrast, with too much exploration, agents could waste their resources trying suboptimal states and actions, without leveraging their experiences efficiently. To learn successful strategies, this trade-off between exploration and exploitation must be balanced well, and this is known as the exploration vs. exploitation dilemma. At a high level, exploration can be divided into directed strategies and undirected strategies (Thrun, 1992; Plappert et al., 2018) . While directed strategies aim to extract useful information from existing experiences for better exploration, undirected strategies rely on injecting randomness into the agent's decision-making. Over the years, many sophisticated directed exploration strategies have been proposed (Tang et al., 2016; Ostrovski et al., 2017; Houthooft et al., 2016; Pathak et al., 2017) . However, since these strategies still require lower-level exploration to collect the experiences, or are either complicated or computationally intensive, undirected exploration strategies are still commonly used in RL literature in practice, where some well-known examples are -greedy (Sutton, 1995) for discrete action space and additive Gaussian noise for continuous action space (Williams, 1992) . Such strategies explore by randomly perturbing agents' actions at different steps independently and hence are referred to as performing step-based exploration in action space (Deisenroth et al., 2013) . As an alternative to those exploration strategies in action space, exploration by perturbing the weights of linear policies has been proposed (Rückstieß et al., 2010; Sehnke et al., 2010; Kober & Peters, 2008) . Since these strategies in parameter space naturally explore conditioned on the states and are usually trajectory-based (only perturb the weights at the beginning of each trajectory) (Deisenroth et al., 2013) , they have the advantages of being more consistent, structured, and global (Deisenroth et al., 2013 ). Later, van Hoof et al. (2017) proposed a generalized exploration (GE) scheme, bridging the gap between step-based and trajectory-based exploration in parameter space. With the advance of deep RL, NoisyNet (Fortunato et al., 2018) and Parameter Space Noise for Exploration (PSNE) (Plappert et al., 2018) were introduced, extending parameter-space exploration strategies for policies using deep neural networks. Although GE, NoisyNet, and PSNE improved over the vanilla exploration strategies in parameter space and were shown leading to more global and consistent exploration, they still suffer from several limitations. Given this, we propose a new exploration scheme with the following characteristics. 1. Generalizing Step-based and Trajectory-based Exploration (van Hoof et al., 2017) Since both NoisyNet and PSNE are trajectory-based exploration strategies, they are considered relatively inefficient and bring insufficient stochasticity (Deisenroth et al., 2013) . Following van Hoof et al. ( 2017), our method improves by interpolating between step-based and trajectory-based exploration in parameter space, where a more balanced trade-off between stability and stochasticity can be achieved. 2. Recursive Analytical Integration of Latent Exploring Policies NoisyNet and PSNE address the uncertainty from sampling exploring policies using Monte Carlo integration, while GE uses analytical integration on full trajectories, which scales poorly in the number of time steps. In contrast, we apply analytical and recurrent integration after each step, which leads to low-variance and scalable updates. 3. Perturbing Last Layers of Policy Networks Both NoisyNet and PSNE perturb all layers of the policy network. However, in general, only the uncertainty in parameters of the last (linear) layer can be integrated analytically. Furthermore, it is not clear that deep neural networks can be perturbed in meaningful ways for exploration (Plappert et al., 2018) . We thus propose and evaluate an architecture where perturbation is only applied on the parameters of the last layer. These characteristics define our contribution, which we will refer to as Deep Coherent Exploration. We evaluate the coherent versions of A2C (Mnih et al., 2016) , PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018) , where the experiments on OpenAI MuJoCo (Todorov et al., 2012; Brockman et al., 2016) tasks show that Deep Coherent Exploration outperforms other exploration strategies in terms of both learning speed and stability.

2. RELATED WORK

As discussed, exploration can broadly be classified into directed and undirected strategies (Thrun, 1992; Plappert et al., 2018) , with undirected strategies being commonly used in practice because of their simplicity. Well known methods such as -greedy (Sutton, 1995) or additive Gaussian noise (Williams, 1992) randomly perturb the action at each time step independently. These high-frequency perturbations, however, can result in poor coverage of the state-action space due to random-walk behavior (Rückstieß et al., 2010; Deisenroth et al., 2013) , washing-out of exploration by the environment dynamics (Kober & Peters, 2008; Rückstieß et al., 2010; Deisenroth et al., 2013) , and to potential damage to mechanical systems (Koryakovskiy et al., 2017) . One alternative is to instead perturb the policy in parameter space, with the perturbation held constant for the duration of a trajectory. Rückstieß et al. (2010) and Sehnke et al. (2010) showed that such parameter-space methods could bring improved exploration behaviors because of reduced variance and faster convergence, when combined with REINFORCE (Williams, 1992) or Natural Actor-Critic (Peters et al., 2005) . Another alternative to independent action-space perturbation, is to correlate the noise applied at subsequent actions (Morimoto & Doya, 2000; Wawrzynski, 2015; Lillicrap et al., 2016) , for example by generating perturbations from an Ornstein-Uhlenbeck (OU) process (Uhlenbeck & Ornstein, 1930) . Later, van Hoof et al. (2017) used the same stochastic process but in the parameter space of the policy. This approach uses a temporally coherent exploring policy, which unifies step-based and trajectory-based exploration. Moreover, the author showed that, with linear policies, a more delicate balance between these two extreme strategies could have better performance. However, this approach was derived in a batch mode setting and requires storing the full trajectory history and the inversion of a matrix growing with the number of time step. Thus, it does not scale well to long trajectories or complex models. Although these methods pioneered the research of exploration in parameter space, their applicability is limited. More precisely, these methods were only evaluated with extremely shallow (often linear) policies and relatively simple tasks with low-dimensional state spaces and action spaces. Given this,

