DEEP COHERENT EXPLORATION FOR CONTINUOUS CONTROL

Abstract

In policy search methods for reinforcement learning (RL), exploration is often performed by injecting noise either in action space at each step independently or in parameter space over each full trajectory. In prior work, it has been shown that with linear policies, a more balanced trade-off between these two exploration strategies is beneficial. However, that method did not scale to policies using deep neural networks. In this paper, we introduce Deep Coherent Exploration, a general and scalable exploration framework for deep RL algorithms on continuous control, that generalizes step-based and trajectory-based exploration. This framework models the last layer parameters of the policy network as latent variables and uses a recursive inference step within the policy update to handle these latent variables in a scalable manner. We find that Deep Coherent Exploration improves the speed and stability of learning of A2C, PPO, and SAC on several continuous control tasks.

1. INTRODUCTION

The balance of exploration and exploitation (Kearns & Singh, 2002; Jaksch et al., 2010 ) is a longstanding challenge in reinforcement learning (RL) . With insufficient exploration, states and actions with high rewards can be missed, resulting in policies prematurely converging to bad local optima. In contrast, with too much exploration, agents could waste their resources trying suboptimal states and actions, without leveraging their experiences efficiently. To learn successful strategies, this trade-off between exploration and exploitation must be balanced well, and this is known as the exploration vs. exploitation dilemma. At a high level, exploration can be divided into directed strategies and undirected strategies (Thrun, 1992; Plappert et al., 2018) . While directed strategies aim to extract useful information from existing experiences for better exploration, undirected strategies rely on injecting randomness into the agent's decision-making. Over the years, many sophisticated directed exploration strategies have been proposed (Tang et al., 2016; Ostrovski et al., 2017; Houthooft et al., 2016; Pathak et al., 2017) . However, since these strategies still require lower-level exploration to collect the experiences, or are either complicated or computationally intensive, undirected exploration strategies are still commonly used in RL literature in practice, where some well-known examples are -greedy (Sutton, 1995) for discrete action space and additive Gaussian noise for continuous action space (Williams, 1992) . Such strategies explore by randomly perturbing agents' actions at different steps independently and hence are referred to as performing step-based exploration in action space (Deisenroth et al., 2013) . As an alternative to those exploration strategies in action space, exploration by perturbing the weights of linear policies has been proposed (Rückstieß et al., 2010; Sehnke et al., 2010; Kober & Peters, 2008) . Since these strategies in parameter space naturally explore conditioned on the states and are usually trajectory-based (only perturb the weights at the beginning of each trajectory) (Deisenroth et al., 2013) , they have the advantages of being more consistent, structured, and global (Deisenroth et al., 2013) . Later, van Hoof et al. (2017) proposed a generalized exploration (GE) scheme, bridging the gap between step-based and trajectory-based exploration in parameter space. With the advance of deep RL, NoisyNet (Fortunato et al., 2018) and Parameter Space Noise for Exploration (PSNE) (Plappert et al., 2018) were introduced, extending parameter-space exploration strategies for policies using deep neural networks. Although GE, NoisyNet, and PSNE improved over the vanilla exploration strategies in parameter space and were shown leading to more global and consistent exploration, they still suffer from several limitations. Given this, we propose a new exploration scheme with the following characteristics. 1. Generalizing Step-based and Trajectory-based Exploration (van Hoof et al., 2017) Since both NoisyNet and PSNE are trajectory-based exploration strategies, they are considered relatively inefficient and bring insufficient stochasticity (Deisenroth et al., 2013) . Following van Hoof et al. (2017) , our method improves by interpolating between step-based and trajectory-based exploration in parameter space, where a more balanced trade-off between stability and stochasticity can be achieved.

2.

Recursive Analytical Integration of Latent Exploring Policies NoisyNet and PSNE address the uncertainty from sampling exploring policies using Monte Carlo integration, while GE uses analytical integration on full trajectories, which scales poorly in the number of time steps. In contrast, we apply analytical and recurrent integration after each step, which leads to low-variance and scalable updates. 3. Perturbing Last Layers of Policy Networks Both NoisyNet and PSNE perturb all layers of the policy network. However, in general, only the uncertainty in parameters of the last (linear) layer can be integrated analytically. Furthermore, it is not clear that deep neural networks can be perturbed in meaningful ways for exploration (Plappert et al., 2018) . We thus propose and evaluate an architecture where perturbation is only applied on the parameters of the last layer. These characteristics define our contribution, which we will refer to as Deep Coherent Exploration. We evaluate the coherent versions of A2C (Mnih et al., 2016) , PPO (Schulman et al., 2017) , and SAC (Haarnoja et al., 2018) , where the experiments on OpenAI MuJoCo (Todorov et al., 2012; Brockman et al., 2016) tasks show that Deep Coherent Exploration outperforms other exploration strategies in terms of both learning speed and stability.

2. RELATED WORK

As discussed, exploration can broadly be classified into directed and undirected strategies (Thrun, 1992; Plappert et al., 2018) , with undirected strategies being commonly used in practice because of their simplicity. Well known methods such as -greedy (Sutton, 1995) or additive Gaussian noise (Williams, 1992) randomly perturb the action at each time step independently. These high-frequency perturbations, however, can result in poor coverage of the state-action space due to random-walk behavior (Rückstieß et al., 2010; Deisenroth et al., 2013) , washing-out of exploration by the environment dynamics (Kober & Peters, 2008; Rückstieß et al., 2010; Deisenroth et al., 2013) , and to potential damage to mechanical systems (Koryakovskiy et al., 2017) . One alternative is to instead perturb the policy in parameter space, with the perturbation held constant for the duration of a trajectory. Rückstieß et al. (2010) and Sehnke et al. (2010) showed that such parameter-space methods could bring improved exploration behaviors because of reduced variance and faster convergence, when combined with REINFORCE (Williams, 1992) or Natural Actor-Critic (Peters et al., 2005) . Another alternative to independent action-space perturbation, is to correlate the noise applied at subsequent actions (Morimoto & Doya, 2000; Wawrzynski, 2015; Lillicrap et al., 2016) , for example by generating perturbations from an Ornstein-Uhlenbeck (OU) process (Uhlenbeck & Ornstein, 1930) . Later, van Hoof et al. (2017) used the same stochastic process but in the parameter space of the policy. This approach uses a temporally coherent exploring policy, which unifies step-based and trajectory-based exploration. Moreover, the author showed that, with linear policies, a more delicate balance between these two extreme strategies could have better performance. However, this approach was derived in a batch mode setting and requires storing the full trajectory history and the inversion of a matrix growing with the number of time step. Thus, it does not scale well to long trajectories or complex models. Although these methods pioneered the research of exploration in parameter space, their applicability is limited. More precisely, these methods were only evaluated with extremely shallow (often linear) policies and relatively simple tasks with low-dimensional state spaces and action spaces. Given this, NoisyNet (Fortunato et al., 2018) , PSNE (Plappert et al., 2018) and Stochastic A3C (SA3C) (Shang et al., 2019) were proposed, introducing more general and scalable methods for deep RL algorithms. All three of these methods can be seen as learning a distribution over policies for trajectory-based exploration in parameter space. These exploring policies are sampled by perturbing the weights across all layers of a deep neural network, with the uncertainty from sampling being addressed by Monte Carlo integration. Whereas NoisyNet learns the magnitudes of the noise for each parameter, PSNE heuristically adapts a single magnitude for all parameters. While showing good performance in practice (Fortunato et al., 2018; Plappert et al., 2018) , these methods suffer from two potential limitations. Firstly, trajectory-based strategies can be inefficient as only one strategy can be evaluated for a potentially long trajectory (Deisenroth et al., 2013) , which could result in a failure to escape local optima. Secondly, Monte Carlo integration results in high-variance gradient estimates which could lead to oscillating updates.

3. BACKGROUND

This section provides background for reinforcement learning and related deep RL algorithms.

3.1. REINFORCEMENT LEARNING

Reinforcement learning is a sub-field of machine learning that studies how an agent learns strategies with high returns through trial-and-error by interacting with an environment. This interaction between an agent and an environment is described using Markov Decision Processes (MDPs). A MDP is a tuple (S, A, r, P, γ), where S is the state space, A is the action space, r : S × A × S → R is the reward function with r t = r (s t , a t , s t+1 ), P : S × A × S → R + is the transition probability function, and γ is a discount factor indicating the preference of short-term rewards. In RL with continuous action space, an agent aims to learn a parametrized (e.g. Gaussian) policy π θ (a|s) : S × A → R + , with parameters θ, that maximizes the expected return over trajectories: J(θ) = E τ ∼p(τ |π θ ) [R(τ )], where τ = (s 0 , a 0 , ..., a T -1 , s T ) is a trajectory and R(τ ) = T t=0 γ t r t is the discounted return.

3.2. DEEP REINFORCEMENT LEANING ALGORITHMS

Deep reinforcement learning combines deep learning and reinforcement learning, where policies and value functions are represented by deep neural networks for more sophisticated and powerful function approximation. In our experiments, we consider the following three deep RL algorithms. Advantage Actor-Critic (A2C) Closely related to REINFORCE (Williams, 1992) , A2C is an on-policy algorithm proposed as the synchronous version of the original Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) . The gradient of A2C can be written as: ∇ θ J(θ) = E τ ∼p(τ |θ) T -1 t=0 ∇ θ log π θ (a t |s t ) A π θ (s t , a t ) , where A π θ (s t , a t ) is the estimated advantage following policy π θ . Proximal Policy Optimization (PPO) PPO is an on-policy algorithm developed to determine the largest step for update while still keeping the updated policy close to the old policy in terms of Kullback-Leibler (KL) divergence. Instead of using a second-order method as in Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) , PPO applies a first-order method and combines several tricks to relieve the complexity. We consider the primary variant of PPO-Clip with the following surrogate objective: L CLIP θ k (θ) = E τ ∼p(τ |θ k ) T -1 t=0 min (r t (θ), clip (r t (θ), 1 -, 1 + )) A π θ k t , where r t (θ) = π θ (a t |s t ) π θ k (a t |s t ) and is a small threshold that approximately restricts the distance between the new policy and the old policy. In practice, to prevent the new policy from changing too fast, the KL divergence from the new policy to the old policy approximated on a sampled batch is often used as a further constraint. Soft Actor-Critic (SAC) As an entropy-regularized (Ziebart et al., 2008) off-policy actor-critic method (Lillicrap et al., 2016; Fujimoto et al., 2018) with a stochastic policy, SAC (Haarnoja et al., 2018) learns the optimal entropy-regularized Q-function through 'soft' Bellman back-ups with offpolicy data: Q π (s, a) = E s ∼p(s |s,a),ã ∼π(ã |s ) [r + γ (Q π (s , ã ) + αH (π (ã |s )))] , where H is the entropy and α is the temperature parameter. The policy is then learned by maximizing the expected maximum entropy V -function via the reparameterization trick (Kingma et al., 2015) .

4. DEEP COHERENT EXPLORATION FOR CONTINUOUS CONTROL

To achieve the desiderata in Section 1, we propose Deep Coherent Exploration, a method that models the policy as a generative model with latent variables. This policy is represented as π wt,θ (a t |s t ) = N (W t f θ (s t ) + b t , Λ -1 a ). Here w t denotes all last layer parameters of policy network at step t by combining W t and b t , θ denotes the parameters of the policy network except for the last layer, and Λ a is a fixed and diagonal precision matrix. Our method treats the last layer parameters w t as latent variables with marginal distribution w t ∼ N µ t , Λ -1 t , where µ t and Λ t are functions of learnable parameters µ and Λ respectively. In this model, all learnable parameters can be denoted as ζ = (µ, Λ, θ). We provide a graphical model of Deep Coherent Exploration in Appendix A. As in van Hoof et al. ( 2017), Deep Coherent Exploration generalizes step-based and trajectory-based exploration by constructing a Markov chain of w t . This Markov chain specifies joint probabilities through an initial distribution p 0 (w 0 ) = N µ, Λ -1 and the conditional distribution p(w t |w t-1 ). This latter term explicitly expresses temporal coherence between subsequent parameter vectors. In this setting, step-based exploration corresponds to the extreme case when p(w t |w t-1 ) = p 0 (w t ), and trajectory-based exploration corresponds to another extreme case when p(w t |w t-1 ) = δ(w tw t-1 ), where δ is the Dirac delta function. To ensure the marginal distribution of w t will be equal to the initial distribution p 0 at any step t, we directly follow van Hoof et al. ( 2017) with the following transition distribution for w t : p(w t |w t-1 ) = N (1 -β) w t-1 + βµ, (2β -β 2 )Λ -1 , ( ) where β is a hyperparameter that controls the temporal coherency of w t and w t-1 . Then, the two extreme cases corresponds to β = 0 for trajectory-based exploration and β = 1 for step-based exploration, while the intermediate exploration corresponds to β ∈ (0, 1). For intermediate values of β, we obtain smoothly changing policies that sufficiently explore, while reducing high-frequency perturbations.

4.1. ON-POLICY DEEP COHERENT EXPLORATION

Our method can be combined with all on-policy policy gradient methods and here we present this adaptation with REINFORCE (Williams, 1992) . Starting from the RL objective in Equation 1: ∇ ζ J(ζ) = E τ ∼p(τ |ζ) [∇ ζ log p(τ |ζ)R(τ )] , the gradients w.r.t the sampled trajectory can be obtained using standard chain rule: ∇ ζ log p(τ |ζ) = T -1 t=0 ∇ ζ log p(a t |s [0:t] , a [0:t-1] , ζ) . Here, since information can still flow through the unobserved latent variable w t , our policy is not Markov anymore.  p(w t |s [0:t-1] , a [0:t-1] , µ, Λ, θ) forward message α(wt) dw t , where the first factor is the Gaussian action probability and the second factor can be interpreted as the forward message α(w t ) along the chain. We decompose α(w t ) by introducing w t-1 : α(w t ) = p(w t , w t-1 |s [0:t-1] , a [0:t-1] , ζ)dw t-1 (10) = p(w t |w t-1 ; µ, Λ) transition probability of wt p(a t-1 |s t-1 ; w t-1 , θ)α(w t-1 ) Z t-1 dw t-1 , where the first factor is the transition probability of w t (Equation 5) and Z t-1 in the factor term is a normalizing constant. Since Gaussians are closed under marginalization and condition, the second factor can be obtained analytically without the need of computing the normalizing constant Z t-1 . Moreover, α(w t-1 ) is a Gaussian by mathematical induction from the initial step. As a result, we arrive at an efficient recursive expression for exact inference of w t . Again, with the property of Gaussians, all integrals appearing above can be solved analytically, where the marginal action probability given the history at each step t can be obtained and used for policy updates. Summarizing, non-Markov policies require substituting the regular p(a t |s t ; θ) term in the update equation with p(a t |s [0:t] , a [0:t-1] , ζ) (Equation 7). Equations 8-11 show how this expression can be efficiently calculated recursively. Except for this substitution, learning algorithms like A2C and PPO can proceed as normal. For detailed mathematical derivation, please refer to Appendix B.

4.2. OFF-POLICY DEEP COHERENT EXPLORATION

Combining our method with off-policy methods (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018) requires defining both the behavior policy and the update equation. The behavior policy is the same as in the on-policy methods discussed earlier (Equation 5). The policy update procedure may require adjustments for specific algorithms. Here, we show how to adapt our method for SAC (Haarnoja et al., 2018) . In the SAC policy update, the target policy is adapted towards the exponential of the new Q-function. The target policy here is the marginal policy, rather than the policy conditioned on the sampled w, as this second option would ignore the dependence on µ and Λ. This consideration leads to the following objective for policy update: J(ζ) = E st∼D KL p (a t |s t , ζ) exp (Q φ (s t , a t )) Z φ (s t ) (12) = E st∼D,at∼p(at|st,ζ) [log p(a t |s t , ζ) -Q φ (s t , a t )] , where φ denotes the parameters of the Q-function and p (a t |s t , ζ) is the marginal policy which can again be obtained analytically (Equation 22): p (a t |s t , ζ) = p (a t |s t ; w 0 , θ) Gaussian policy π w 0 ,θ (at|st) p(w 0 |µ, Λ) marginal probability of w0 dw 0 , where all parameters can be learned via the reparameterization trick (Kingma et al., 2015) .

5. EXPERIMENTS

For the experiments, we compare our method with NoisyNet (Fortunato et al., 2018) , PSNE (Plappert et al., 2018) and standard action noise. This comparison is evaluated in combination of A2C (Mnih et al., 2016) , PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018) on OpenAI Gym MuJoCo (Todorov et al., 2012; Brockman et al., 2016) continuous control tasks. For exploration in parameter space, we use a fixed action noise with a standard deviation of 0.1. For A2C and PPO, their standard deviations of parameter noise are all initialized at 0.017, as suggested in Fortunato et al. (2018) . For SAC, we initialize the standard deviation of parameter noise at 0.034 for both our method and PSNE as it gave better results in practice. Besides, Deep Coherent Exploration learns the logarithm of parameter noise, while NoisyNet learns the parameter noise directly, and PSNE adapts the parameter noise. To protect the policies from changing too dramatically, we consider three small values of β (0.0, 0.01, and 0.1) for Deep Coherent Exploration, where we use β = 0.01 for comparative evaluation with other exploration strategies. Our implementation of NoisyNet is based on the code from Kaixhinfoot_0 . For PSNE, we refer to author's implementation in OpenAI Baselinesfoot_1 (Dhariwal et al., 2017) and the original paper, where we set the KL threshold for A2C and PPO to 0.01 and the MSE threshold for SAC to 0.1. On the other hand, exploration with action noise uses the default setting proposed by Achiam (2018) , where the standard deviation of action noise is initialized at around 0.6 for A2C and PPO. For SAC, in the baseline setting the standard deviation of action noise is output by the policy network. In all experiments, agents are trained with a total of 10 6 environmental steps, where they are updated after each epoch. A2C and PPO use four parallel workers, where each worker collects a trajectory of 1000 steps for each epoch, resulting in epochs with 4000 steps in total. After each epoch, both A2C and PPO update their value functions for 80 gradient steps. At the same time, A2C updates its policy for one gradient step, while PPO updates its policy for up to 80 gradient steps until the KL constraint is satisfied. SAC uses a single worker, with a step size of 4000 for each epoch. After every 50 environmental steps, both policy and value functions are updated for 50 gradient steps. To make the parameter noise stable, we adapt the standard deviation of parameter noise after each epoch. Our implementation of A2C, PPO, and SAC with different exploration strategies are adapted based on OpenAI Spinning Up (Achiam, 2018) with default settings. All three algorithms use two-layer feedforward neural networks with the same network architectures for both policy and value function. To be more precise, A2C and PPO use network architectures with 64 and 64 hidden nodes activated by tanh units. In comparison, SAC uses a network architecture of 256 and 256 hidden nodes activated by rectified linear units (ReLU). Parameters of policies and value functions in all three algorithms are updated using Adam (Kingma & Ba, 2015) . A2C and PPO use a learning rate of 3 • 10 -4 for the policies and a learning rate of 10 -foot_2 for the value functions. SAC uses a single learning rate of 10 -3 for both policy and value function. For each task, we evaluate the performance of agents after every 5 epochs, with no exploration noise. Additionally, each evaluation reports the average reward of 10 episodes for each worker. To mitigate the randomness within environments and policies, we report our results as the average over 10 random seeds. All settings not explicitly explained in this section, are set to the code base defaults. For more details, please refer to documents and source code from OpenAI Spinning Up (Achiam, 2018) and our implementation 3 .

5.1. COMPARATIVE EVALUATION

In this section, we present the results for A2C (Mnih et al., 2016) , PPO (Schulman et al., 2017) , and SAC (Haarnoja et al., 2018) on three control tasks, with the additional results shown in Appendix D. Figure 1 shows that, overall, Coherent-A2C outperforms all other A2C-based methods in terms of learning speed, final performance, and algorithm stability. In particular, given that Ant-v2 is considered a challenging task, Deep Coherent Exploration considerably accelerates learning speed. For PPO-based method, our method still outperforms NoisyNet and PSNE significantly in all tasks. However, our method's advantage compared to standard action noise is smaller. Particularly, Coherent-PPO underperforms PPO in Walker2d-v2. Two reasons might explain this. Firstly, some environments might be more unstable and require a larger degree of exploration, which favors PPO as it initializes its action noise with much greater value. Secondly, because of having extra parameters, policies of Coherent-PPO, NoisyNet-PPO, and PSNE-PPO tend to satisfy the KL constraint in fewer update steps, which leads to slower learning. For SAC, the advantages of our method are smaller compared to A2C and PPO. More specifically, Coherent-SAC learns slightly faster than SAC in HalfCheetah-v2 and achieves the highest average returns in Walker2d-v2 and Ant-v2. Furthermore, Coherent-SAC shows variance lower than PSNE-SAC but higher than the baseline SAC.

5.2. ABLATION STUDIES

In this section, we present three separate ablation studies to clarify the effect of each characteristic discussed in Section 1. These ablation studies are performed with A2C, to ensure that all characteristics are applicable and because the fixed number of gradient steps puts the methods on equal footing. Here we show the results for HalfCheetah-v2, where the full results can be found in Appendix D.

Generalizing

Step-based and Trajectory-based Exploration As shown in Figure 2a , we evaluate three different values 0.0, 0.01, 0.1 of β for Coherent-A2C. Here both two intermediate strategies (β = 0.01 and β = 0.1) outperforms the trajectory-based strategy (β = 0.0). Coherent-A2C with β = 0.01 seems to achieve the best balance between randomness and stability, with a considerably higher return than the other two.

Analytical Integration of Latent Exploring Policies

We introduce OurNoisyNet for comparison. OurNoisyNet equips a noisy linear layer for only its last layer, and this layer learns the logarithm of standard deviation, as in Deep Coherent Exploration. We compare Coherent-A2C using β = 0.0 and OurNoisyNet-A2C, with the only difference thus being whether we integrate analytically or use the reparameterization trick (Kingma et al., 2015) . , where Figure 2a and Figure 2b show the learning curves and Figure 2c shows the average log variance of gradients during six stages in learning. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average return over 10 random seeds. We first measure the variance of gradient estimates in both methods. This variance is measured by computing the trace of the covariance matrix using 10 gradient samples. We report this measure in six stages during training, as shown in Figure 2c . We can observe that analytical integration leads to lower-variance gradient estimates across all training stages for HalfCheetah-v2. We further present the learning curves of both methods in Figure 2b , where Coherent-A2C with β = 0.0 shows higher return than OurNoisyNet-A2C. It is interesting that Coherent-A2C displays a much lower standard deviation across different random seeds. Furthermore, the lower-variance gradient estimates of Coherent-A2C could enable a larger learning rate for faster training without making policy updates unstable. Perturbing Last Layers of Policy Networks In this part, we compare OurNoisyNet-A2C perturbed over all layers, and OurNoisyNet-A2C perturbed over only the last layer. The result is shown in Figure 2b . Somewhat to our surprise, the latter seems to perform much better. There are several possible reasons. Firstly, since it is unknown how the parameter noise (especially in lower layers) is realized in action noise, perturbing all layers of the policy network may lead to uncontrollable perturbations. Such excess exploratory noise could inhibit exploitation. Secondly, perturbing all layers might disturb the representation learning of states, which is undesirable for learning a good policy. Thirdly, perturbing only the last layer could also lead to fewer parameters for NoisyNet.

6. CONCLUSION

In this paper, we have presented a general and scalable exploration framework that extends the generalized exploration scheme (van Hoof et al., 2017) for continuous deep RL algorithms. In particular, recursive calculation of marginal action probabilities allows handling long trajectories and high-dimensional parameter vectors. Compared with NoisyNet (Fortunato et al., 2018) and PSNE (Plappert et al., 2018) , our method has three improvements. Firstly, Deep Coherent Exploration generalizes step-based and trajectory-based exploration in parameter space, which allows a more balanced trade-off between stochasticity and coherence. Secondly, Deep Coherent Exploration analytically marginalizes the latent policy parameters, yielding lower-variance gradient estimates that stabilize and accelerate learning. Thirdly, by perturbing only the last layer of the policy network, Deep Coherent Exploration provides better control of the injected noise. When combining with A2C (Mnih et al., 2016) , PPO (Schulman et al., 2017) , and SAC (Haarnoja et al., 2018) , we empirically show that Deep Coherent Exploration outperforms other exploration strategies on most of the MuJoCo continuous control tasks tested. Furthermore, the ablation studies show that, while each of the improvements is beneficial, combining them leads to even faster and more stable learning. For future work, since Deep Coherent Exploration uses a fixed and small action noise, we believe one interesting direction is to study whether the learnable perturbations in action space can be combined with our method in a meaningful way for even more effective exploration.

A GRAPHICAL MODEL OF DEEP COHERENT EXPLORATION

In this appendix, we provide a graphical model representation of Deep Coherent Exploration, shown in Figure 3 . This graphical model uses the same conventions as in Bishop (2007) , where empty circles denote latent random variables, shaded circles denote observed random variables, and dots denote deterministic variables. 

B MARGINAL ACTION PROBABILITY FOR ON-POLICY COHERENT EXPLORATION

As discussed in Section 4.1, forward message α(w t ) is used to compute the marginal action probability given the history at step t for the final learning objective. Suppose we have the Gaussian policy as: π wt,θ (a t |s t ) = N (W t x t + b t , Λ -1 a ), where a t ∈ R p , x t = f θ (s t ) ∈ R q , W t ∈ R p×q is the coefficient matrix, b t ∈ R p is the bias vector and Λ a is a constant precision matrix for the Gaussian policy. It's helpful to represent w t ∈ R pq+p by flattening W t and combining b t : w t =                    w 11 . . . w 1q . . . w p1 . . . w pq b 1 . . . b p                    , such that the parameters could still be sampled using multivariate Gaussians. Moreover, we stack x t into X t ∈ R p×(pq+p) :  X t =        x T t 0 T q,1 . . . 0 T q,1 0 T q,1 1 0 . . . 0 0 0 T q,1 x T t . . . 0 T q,1 0 T q,1 0 T q,1 0 T q,1 . . . x T t 0 T q,1 0 0 . . . 1 0 0 T q,1 0 T q,1 . . . 0 T q,1 x T t 0 0 . . . 0 1        , where 0 q,1 is a q-dimension zero column vector. After this transformation, the Gaussian policy is represented equivalently as: π wt,θ (a t |s t ) = N (X t w t , Λ -1 a ). B.1 BASE CASE For the base case t = 0, forward message α(w 0 ) and the initial transition probability of w 0 is identical by definition: α(w 0 ) = p 0 (w 0 ; µ, Λ) = N µ, Λ -1 . ( ) Additionally, the action probability is given by: π w0,θ (a 0 |s 0 ) = N (X 0 w 0 , Λ -1 a ). (20) With the property of multivariate Gaussians, we obtain the marginal action probability given the history at t = 0: log p(a 0 |s 0 , µ, Λ, θ) = log π w0,θ (a 0 |s 0 )α(w 0 )dw 0 (21) = log N (X 0 µ, Λ -1 a + X 0 Λ -1 X T 0 ).

B.2 GENERAL CASE

For the general case of step t > 0, we need the state s t-1 , action a t-1 as well as mean and covariance of forward message α(w t-1 ) stored from previous step. Suppose α(w t-1 ) is written as: α(w t-1 ) = N (v t-1 , L -1 t-1 ), and the action probability from the previous step is given by: π wt-1,θ (a t-1 |s t-1 ) = N X t-1 w t-1 , Λ -1 a . (24) We have directly: p(w t-1 |s [0:t-1] , a [0:t-1] , µ, Λ, θ) = N (u t-1 , Σ t-1 ) , (25) with u t-1 = Σ t-1 X T t-1 Λ a a t-1 + L t-1 v t-1 (26) Σ t-1 = L t-1 + X T t-1 Λ a X t-1 -1 . (27) Combining the transition probability of w t : p(w t |w t-1 ; µ, Λ) = N (1 -β)w t-1 + βµ, (2β -β 2 )Λ -1 , we obtain the forward message α(w t ): α (w t ) = N v t , L -1 t , ( ) where v t = (1 -β)u t-1 + βµ (30) L -1 t = (2β -β 2 )Λ -1 + (1 -β) 2 Σ t-1 . Here, v t and L -1 t should be stored and used for exact inference of α(w t+1 ) at the next step. Finally, the marginal action probability given the history at step t > 0 is given by: log p(a t |s [0:t] , a [0:t-1] , µ, Λ, θ) = log π wt,θ (a t |s t )α(w t )dw t (32) = log N (X t v t , Λ -1 a + X t L -1 t X T t ). C.2 COHERENT PROXIMAL POLICY OPTIMIZATION (COHERENT-PPO) Coherent-PPO can be implemented in a similar way. As in Coherent-A2C, we substitute the original objective L CLIP θ k (θ) with L CLIP µ k ,Λ k ,θ k (µ, Λ, θ) , which is given by: L CLIP µ k ,Λ k ,θ k (µ, Λ, θ) (34) = E τ ∼p(τ |µ k ,Λ k ,θ k ) T -1 t=0 min (r t (µ, Λ, θ), clip (r t (µ, Λ, θ), 1 -, 1 + )) A π µ,Λ,θ t , where r t (µ, Λ, θ) = p(at|s [0:t] ,a [0:t-1] ,µ,Λ,θ) p(at|s [0:t] ,a [0:t-1] ,µ k ,Λ k ,θ k ) . Here, after each step of policy update, p(a t |s [0:t] , a [0:t-1] , µ, Λ, θ) from the new policy should be evaluated on the most recent trajectory τ k for both next update and approximated KL divergence. However, this quantity can not be calculated directly, but only through sampling w t and then integrating w t out. Since w t is integrated out in the end, it does not matter what specific w t is sampled. So one could sample a new set of w t , or use a fixed w along the recent trajectory τ k . The second way is often faster because sampling is avoided. The pseudo-code of single-worker Coherent-PPO is shown in Algorithm 2.

C.3 COHERENT SOFT ACTOR-CRITIC (COHERENT-SAC)

For Coherent-SAC, only two changes are needed. Firstly, we sample the last layer parameters of the policy network w t in each step t for exploration. Secondly, we improve the marginal policy instead of the actual policy performing exploration after each epoch. The pseudo-code of single-worker Coherent-SAC is shown in Algorithm 3.

D ADDITIONAL RESULTS

In this appendix, we provide additional results for both comparative evaluation and ablation studies.

D.1 COMPARATIVE EVALUATION

The results of comparative evaluation for A2C (Mnih et al., 2016) and PPO (Schulman et al., 2017) on Reacher-v2, InvertedDoublePendulum-v2 and Hopper-v2 are shown in Figure 4 . For SAC (Haarnoja et al., 2018) , since it is a state-of-the-art deep RL algorithm, we only test it with three of our six OpenAI MuJoCo continuous control tasks with highest state and action dimensions, as shown in Figure 1 .

D.2 ABLATION STUDIES

The full results of all three ablation studies on all six OpenAI MuJoCo continuous control tasks are shown in Figure 5 , Figure 6 , Figure 7 and Figure 8 . 



https://github.com/Kaixhin/NoisyNet-A3C https://github.com/openai/baselines/tree/master/baselines/ddpg Source code to be released after review



Figure 1: Learning curves for deep RL algorithms with different exploration strategies on OpenAI MuJoCo continuous control tasks, where the top, middle and bottom row corresponds to results of A2C, PPO, and SAC respectively. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average return over 10 random seeds.

Figure 2: Results of Coherent-A2C with different settings for HalfCheetah-v2, where Figure2aand Figure2bshow the learning curves and Figure2cshows the average log variance of gradients during six stages in learning. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average return over 10 random seeds.

Figure 3: Graphical model of Deep Coherent Exploration.

Figure 4: Learning curves for deep RL algorithms with different exploration strategies, where the top and bottom row corresponds to results of A2C and PPO respectively. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average return over 10 random seeds.

Figure 5: Learning curves for Coherent-A2C on OpenAI MuJoCo continuous control tasks. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average return over 10 random seeds.

Figure 6: Log variance of gradient estimates for Coherent-A2C and OurNoisyNet-A2C on OpenAI MuJoCo continuous control tasks. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average log variance over 10 random seeds.

Figure 7: Learning curves for Coherent-A2C and OurNoisyNet-A2C on OpenAI MuJoCo continuous control tasks. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average return over 10 random seeds.

Figure 8: Learning curves for OurNoisyNet-A2C on OpenAI MuJoCo continuous control tasks. The solid curves correspond to the mean, and the shaped region represents half a standard deviation of the average return over 10 random seeds.

To simplify this dependency, we introduce w t into p(a t |s [0:t] , a [0:t-1] , ζ): p(a t |s [0:t] , a [0:t-1] , ζ) = p(a t , w t |s [0:t] , a [0:t-1] , ζ)dw t

C DEEP COHERENT REINFORCEMENT LEARNING

Here, we provide a brief introduction of adapting Deep Coherent Exploration for A2C (Mnih et al., 2016) , PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018) . Respectively, we call them Coherent-A2C, Coherent-PPO and Coherent-SAC.

C.1 COHERENT ADVANTAGE ACTOR-CRITIC (COHERENT-A2C)

Coherent-A2C is straightforward to implement. To do that, one could just replace the original A2C gradient estimates ∇θ J(θ) with the on-policy coherent gradient estimates ∇µ,Λ,θ J(µ, Λ, θ). The pseudo-code of single-worker Coherent-A2C is shown in Algorithm 1.Algorithm 1: Coherent-A2C Input: initial policy parameters µ 0 , Λ 0 , θ 0 , initial value function parameters φ 0 . 1 for k=0,1,2,.. 12 Infer forward message α(w t ) using previous state s t-1 , previous action a t-1 as well as mean v t-1 and covariance L -1 t-1 of previous forward message α(w t-1 ).13Store mean v t and covariance L -1 t of current forward message α(w t ).14

Compute marginal action probability p(a t |s

15Compute rewards-to-go R t and any kind of advantage estimates Ât based on current value function V φ k for all steps t.

16

Estimate gradient of the policy:and update the policy by performing a gradient step: Learn value function by minimizing the regression mean-squared error:and update the value function by performing a gradient step:Algorithm 2: Coherent-PPO Input: initial policy parameters µ 0 , Λ 0 , θ 0 , initial value function parameters φ 0 . 1 for k=0,1,2,...,K do 2 Create a buffer D k for collecting a trajectory τ k with T steps. 12 Infer forward message α(w t ) using previous state s t-1 , previous action a t-1 as well as mean v t-1 and covariance L -1 t-1 of previous forward message α(w t-1 ).13Store mean v t and covariance L -1 t of current forward message α(w t ).14

Compute marginal action probability p(a

15Compute rewards-to-go R t and any kind of advantage estimates Ât based on current value function V φ k for all steps t.

16

Learn policy by maximizing the PPO-Clip objective:and update the policy by performing multiple gradient steps until the constraint of approximated KL divergence being satisfied:17Learn value function by minimizing the regression mean-squared error:and update the value function by performing a gradient step:Algorithm 3: Coherent-SAC Input: initial policy parameters µ, Λ, θ, initial Q-function parameters φ 1 ,φ 2 , empty replay buffer D. 1 Set target parameters equal to main parameters φ targ,1 ← φ 1 , φ targ,2 ← φ 2 . 2 for each step do 3 if just updated then 4 Sample last layer parameters of policy network w ∼ p 0 (w; µ, Λ) and store w as w prev .

5. else 6

Sample last layer parameters of policy network w ∼ p(w|w prev ; µ, Λ) and store w as w prev .

7

Observe state s and select action a ∼ π w,θ (a|s). 16 Update policy by one step of gradient ascent using: 

