ERROR CONTROLLED ACTOR-CRITIC METHOD TO REINFORCEMENT LEARNING

Abstract

In the reinforcement learning (RL) algorithms which incorporate function approximation methods, the approximation error of value function inevitably causes overestimation phenomenon and impacts algorithm performances. To mitigate the negative effects caused by approximation error, we propose a new actor-critic algorithm called Error Controlled Actor-critic which ensures confining the approximation error in value function. In this paper, we derive an upper boundary of approximation error for Q function approximator in actor-critic methods, and find that the error can be lowered by keep new policy close to the previous one during the training phase of the policy. The results of experiments on a range of continuous control tasks from OpenAI gym suite demonstrate that the proposed actor-critic algorithm apparently reduces the approximation error and significantly outperforms other model-free RL algorithms.

1. INTRODUCTION

Reinforcement learning (RL) algorithms are combined with function approximation methods to adapt to the application scenarios whose state spaces are combinatorial, large, or even continuous. Many function approximation methods RL methods, including the Fourier basis (Konidaris et al., 2011) , kernel regression (Xu, 2006; Barreto et al., 2011; Bhat et al., 2012) , and neural neworks (Barto et al., 1982; Tesauro, 1992; Boyan et al., 1992; Gullapalli, 1992) have been used to learn value functions. In recent years, many deep reinforcement learning (DRL) methods were implemented by incorporating deep learning into RL methods. Deep Q-learning Network (DQN) (Mnih et al., 2013) reported by Mnih in 2013 is a typical work that uses a deep convolutional neural network (CNN) to represent a suitable action value function estimating future rewards (returns); it successfully learned end-to-end control policies for seven Atari 2600 games directly from large state spaces. Thereafter, deep RL methods, such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016) , Proximal Policy Optimization (PPO) (Schulman et al., 2017) , Twin Delayed Deep Deterministic policy gradient (TD3) (Fujimoto et al., 2018) , and Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , started to become mainstream in the field of RL. Althouth function approximation methods have assisted reinforcement learning (RL) algorithms to perform well in complex problems by providing great representation power; however, they also cause an issue called overestimation phenomenon that jeopardize the optimization process of RL algorithms. Thrun & Schwartz (1993) presented a theoretical analysis of this systematic overestimation phenomenon in Q-learning methods that use function approximation methods. Similar problem persists in the actor-critic methods employed function approximation methods. Thomas (2014) reported that several natural actor-critic algorithms use biased estimates of policy gradient to update parameters when using function approximation to approximate the action value function. Fujimoto et al. (2018) proved that the value estimation in the deterministic policy gradient method also lead to overestimation problem. In brief, the approximation errors of value functions caused the inaccuracy of estimated values, and such inaccuracy induced the overestimation on value function; so that poor performances might be assigned to high reward values. As a result, policies with poor performance might be obtained. Previous works attempted to find direct strategies to effectively reduce the overestimation. Hasselt (2010) proposed Double Q-learning, in which the samples are divided into two sets to train two independent Q-function estimators. To diminish the overestimation, one Q-function estimator is used to select actions, and the other one is applied to estimate its value. Fujimoto et al. (2018) proposed mechanisms, including clipped double Q-learning and delayed policy updates, to minimize the overestimation. In contrast to these methods, we focus on actor-critic setting and manage to reduce the approximation error of value function, which is the source of the overestimation, in an indirect but effective way. We use the concepts of domain adaptation (Ben-David et al., 2010) to derive an upper boundary of the approximation error in Q function approximator. Then, we find that the least upper bound of this error can be obtained by minimizing the Kullback-Leibler divergence (KL divergence) between new policy and its previous one. This means minimizing the KL divergence when traning policy can stabilize the critic and then confine the approximation error in Q function. Interestingly, we arrive at similar conclusion as two literatures Geist et al. ( 2019); Vieillard et al. ( 2020) by a somewhat different route. In their works, the authors directly studied the effect of KL and entropy regularization in RL and proved that a KL regularization indeed leads to averaging errors made at each iteration of value function update. While our idea is very different from theirs: It is impracticable to minimize the approximation error directly, so instead we try to minimize an upper bound of approximation error. This is similar to Expectation-Maximization Algorithm (Bishop, 2006) which maximize a lower bound of log-likelihood instead of log-likelihood directly. We derive an upper boundary of approximation error for Q function approximatorin actor-critic methods, and arrive at a more general conclusion: approximation error can be reduced by keep new policy close to the previous one. Note that KL penalty is a effective way, but not the only way. Furthermore, the mentioned indirect operation (i.e. the KL penalty) can work together with the mentioned direct strategies for reducing overestimation, for example, clipped double Q-learning. Then, a new actor-critic method called Error Controlled Actor-critic (ECAC) is established by adopting an effective operation that minimizes the KL divergence to keep the upper bound as low as possible. In other words, this method ensures the similarity between every two consecutive polices in training process and reduces the optimization difficulty of value function, so that the error in Q function approximators can be decreased. Ablation studies were performed to examine the effectiveness of our proposed strategy for decreasing the approximation error, and comparative evaluations were conducted to verify that our method can outperform other mainstream RL algorithms. The main contributions of this paper are summarized as follow: (1) We presented an upper boundary of the approximation error in Q function approximator; (2) We proposed a practical actor-critic method-ECAC which decreases the approximation error by restricting the KL divergence between every two consecutive policies and adopt a mechanism to automatically adjust the coefficient of KL term.

2. PRELIMINARIES

2.1 REINFORCEMENT LEARNING Reinforcement learning (RL) algorithms are modeled as a mathematical framework called Markov Decision Process (MDP). In each time-step of MDP, an agent generates an action based on current state of its environment, then receives a reward and a new state from the environment. Environmental state and agent's action at time t are denoted as s t ∈ S and a t ∈ A, respectively; S and A denote the state and action spaces respectively, which may be either discrete or continuous. Environment is described by a reward function, r(s t , a t ), and a transition probability distribution, P r(s t+1 = s |s t = s, a t = a). Transition probability distribution specifies the probability that the environment will transition to next state. Initial state distribution is denoted as P r 0 (s). Let π denotes a policy, η(π) denotes its expected discounted rewards: η(π) = E π [R 1 + γR 2 + γ 2 R 3 + • • • ] = E π [ ∞ t=0 γ k R t+1 ], where γ denotes a discount rate and 0 ≤ γ ≤ 1. The goal of RL is to find a policy, π * , that maximizes a performance function over policy, J(π), which measures the performance of policy: π * = arg max π J(π). (2)

