PARTIAL ADVANTAGE ESTIMATOR FOR PROXIMAL POL-ICY OPTIMIZATION

Abstract

Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to λ-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and µRTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.

1. INTRODUCTION

In reinforcement learning, an agent judges its state according to its environment, then selects an action, and iterates these steps, constantly learning from the environment. Let S be the set of states and A the set of actions. The selection process is quantified by the transition probability P . When the agent makes the choice to execute an action, it will receive feedback rewards from the environment, that is, r : S × A → [R min , R max ]. In this way, the model will produce a trajectory sequence τ , that is τ : (s 1 , a 1 , ..., s T , a T ). This trajectory will have a cumulative return, T t=1 γ t-1 r t , where γ is the discount factor and T is the number of steps performed. The goal of reinforcement learning is to find an optimal policy π, so that an agent can obtain the maximum cumulative expected cumulative reward using this policy. Policy refers to the probability of specifying an action in each state, π(a|s) = p(A t = a|S t = s). Under this policy, the cumulative return follows a distribution, and the expected value of the cumulative return at state s is defined as a state-value function: v π (s) = E π [ ∞ k=0 γ k r t+k+1 |S t = s] Accordingly, at state s, the expected value of cumulative return after executing action a is defined as a state action value function: q π (s) = E π [ ∞ k=0 γ k r t+k+1 |S t = s, A t = a] Often the temporal difference (TD) algorithm or a Monte Carlo (MC) method is used to estimate the value function. However, each of these methods has its advantages and disadvantages. The TD algorithm value estimator has the characteristics of high bias and low variance. In contrast, the estimator of the MC algorithm has low bias and high variance. Kimura et al. (1998) put forward a method to skillfully find a balance between bias and variance, which is called λ return. TD(λ) proposed by Sutton ( 1988) is a variant of the λ return that provides a more balanced value estimation. Generalized Advantage Estimation (GAE) is a method proposed by Schulman et al. (2015b) to estimate the advantage value function. In fact, it is a method to apply the lambda return method to estimate the advantage function. As proposed in the PPO paper Schulman et al. (2017) , in practical applications, due to the incompleteness of the trajectory, truncated GAE is used, which leads to large bias in the estimation process. For this, we propose partial GAE that uses partially calculated GAEs, rather than the entire truncated GAEs, to significantly reduce the bias that caused

