DECORRELATED DOUBLE Q-LEARNING

Abstract

Q-learning with value function approximation may have the poor performance because of overestimation bias and imprecise estimate. Specifically, overestimation bias is from the maximum operator over noise estimate, which is exaggerated using the estimate of a subsequent state. Inspired by the recent advance of deep reinforcement learning and Double Q-learning, we introduce the decorrelated double Q-learning (D2Q). Specifically, we introduce Q-value function utilizing control variates and the decorrelated regularization to reduce the correlation between value function approximators, which can lead to less biased estimation and low variance. The experimental results on a suite of MuJoCo continuous control tasks demonstrate that our decorrelated double Q-learning can effectively improve the performance.

1. INTRODUCTION

Q-learning Watkins & Dayan (1992) as a model free reinforcement learning approach has gained popularity, especially under the advance of deep neural networks Mnih et al. (2013) . In general, it combines the neural network approximators with the actor-critic architectures Witten (1977) ; Konda & Tsitsiklis (1999) , which has an actor network to control how the agent behaves and a critic to evaluate how good the action taken is. The Deep Q-Network (DQN) algorithm Mnih et al. (2013) firstly applied the deep neural network to approximate the action-value function in Q-learning and shown remarkably good and stable results by introducing a target network and Experience Replay buffer to stabilize the training. Lillicrap et al. proposes DDPG Lillicrap et al. (2015) , which extends Q-learning to handle continuous action space with target networks. Except the training stability, another issue Q-learning suffered is overestimation bias, which was first investigated in Thrun & Schwartz (1993) . Because of the noise in function approximation, the maximum operator in Q-learning can lead to overestimation of state-action values. And, the overestimation property is also observed in deterministic continuous policy control Silver & Lever (2014) . In particular, with the imprecise function approximation, the maximization of a noisy value will induce overestimation to the action value function. This inaccuracy could be even worse (e.g. error accumulation) under temporal difference learning Sutton & Barto (1998), in which bootstrapping method is used to update the value function using the estimate of a subsequent state. Given overestimation bias caused by maximum operator of noise estimate, many methods have been proposed to address this issue. Double Q-learning van Hasselt ( 2010 This work suggests an alternative solution to the overestimation phenomena, called decorrelated double Q-learning, based on reducing the noise estimate in Q-values. On the one hand, we want to make the two value function approximators as independent as possible to mitigate overestima-tion bias. On the other hand, we should reduce the variance caused by imprecise estimate. Our decorrelated double Q-learning proposes an objective function to minimize the correlation of two critics, and meanwhile reduces the target approximation error variance with control variate methods. Finally, we provide experimental results on MuJoCo games and show significant improvement compared to competitive baselines. The paper is organized as follows. In Section 2, we introduce reinforcement learning problems, notations and two existed Q-learning variants to address overestimation bias. Then we present our D2Q algorithm in Section 3 and also prove that in the limit, this algorithm converges to the optimal solution. In Section 4 we show the experimental results on MuJoCo continuous control tasks, and compare it to the current state of the art. Some related work and discussion is presented in Section 5 and finally Section 6 concludes the paper.

2. BACKGROUND

In this section, we introduce the reinforcement learning problems and Q-learning, as well as notions that will be used in the following sections.

2.1. PROBLEM SETTING AND NOTATIONS

We consider the model-free reinforcement learning problem (i.e. optimal policy existed) with sequential interactions between an agent and its environment Sutton & Barto (1998) in order to maximize a cumulative return. At every time step t, the agent selects an action a t in the state s t according its policy and receives a scalar reward r t (s t , a t ), and then transit to the next state s t+1 . The problem is modeled as Markov decision process (MDP) with tuple: (S, A, p(s 0 ), p(s t+1 |s t , a t ), r(s t , a t ), γ). Here, S and A indicate the state and action space respectively, p(s 0 ) is the initial state distribution. p(s t+1 |s t , a t ) is the state transition probability to s t+1 given the current state s t and action a t , r(s t , a t ) is reward from the environment after the agent taking action a t in state s t and γ is discount factor, which is necessary to decay the future rewards ensuring finite returns. We model the agent's behavior with π θ (a|s), which is a parametric distribution from a neural network. Suppose we have the finite length trajectory while the agent interacting with the environment. The return under the policy π for a trajectory τ = (s t , a t ) T t=0 J(θ) = E τ ∼π θ (τ ) [r(τ )] = E τ ∼π θ (τ ) [R T 0 ] = E τ ∼π θ (τ ) [ T t=0 γ t r(s t , a t )] where π θ (τ ) denotes the distribution of trajectories, p(τ ) = π(s 0 , a 0 , s 1 , ..., s T , a T ) = p(s 0 ) T t=0 π θ (a t |s t )p(s t+1 |s t , a t ) The goal of reinforcement learning is to learn a policy π which can maximize the expected returns θ = arg max θ J(θ) = arg max E τ ∼π θ (τ ) [R T 0 ] The action-value function describes what the expected return of the agent is in state s and action a under the policy π. The advantage of action value function is to make actions explicit, so we can select actions even in the model-free environment. After taking an action a t in state s t and thereafter following policy π, the action value function is formatted as:  Q π ( To get the optimal value function, we can use the maximum over actions, denoted as Q * (s t , a t ) = max π Q π (s t , a t ), and the corresponding optimal policy π can be easily derived by π * (s) ∈ arg max at Q * (s t , a t ).



) mitigates the overestimation effect by introducing two independently critics to estimate the maximum value of a set of stochastic values. Averaged-DQN Anschel et al. (2017) takes the average of previously learned Q-values estimates, which results in a more stable training procedure, as well as reduces approximation error variance in the target values. Recently, Twin Delayed Deep Deterministic Policy Gradients (TD3) Fujimoto et al. (2018) extends the Double Q-learning, by using the minimum of two critics to limit the overestimated bias in actor-critic network. A soft Q-learning algorithm Haarnoja et al. (2018), called soft actor-critic, leverages the similar strategy as TD3, while including the maximum entropy to balance exploration and exploitation. Maxmin Q-learning Lan et al. (2020) proposes the use of an ensembling scheme to handle overestimation bias in Q-Learning.

s t , a t ) = E si∼pπ,ai∼π [R t |s t , a t ] = E si∼pπ,ai∼π [

