PARTIAL ADVANTAGE ESTIMATOR FOR PROXIMAL POL-ICY OPTIMIZATION

Abstract

Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to λ-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and µRTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.

1. INTRODUCTION

In reinforcement learning, an agent judges its state according to its environment, then selects an action, and iterates these steps, constantly learning from the environment. Let S be the set of states and A the set of actions. The selection process is quantified by the transition probability P . When the agent makes the choice to execute an action, it will receive feedback rewards from the environment, that is, r : S × A → [R min , R max ]. In this way, the model will produce a trajectory sequence τ , that is τ : (s 1 , a 1 , ..., s T , a T ). This trajectory will have a cumulative return, T t=1 γ t-1 r t , where γ is the discount factor and T is the number of steps performed. The goal of reinforcement learning is to find an optimal policy π, so that an agent can obtain the maximum cumulative expected cumulative reward using this policy. Policy refers to the probability of specifying an action in each state, π(a|s) = p(A t = a|S t = s). Under this policy, the cumulative return follows a distribution, and the expected value of the cumulative return at state s is defined as a state-value function: v π (s) = E π [ ∞ k=0 γ k r t+k+1 |S t = s] Accordingly, at state s, the expected value of cumulative return after executing action a is defined as a state action value function: q π (s) = E π [ ∞ k=0 γ k r t+k+1 |S t = s, A t = a] Often the temporal difference (TD) algorithm or a Monte Carlo (MC) method is used to estimate the value function. However, each of these methods has its advantages and disadvantages. The TD algorithm value estimator has the characteristics of high bias and low variance. In contrast, the estimator of the MC algorithm has low bias and high variance. Kimura et al. (1998) put forward a method to skillfully find a balance between bias and variance, which is called λ return. TD(λ) proposed by Sutton ( 1988) is a variant of the λ return that provides a more balanced value estimation. Generalized Advantage Estimation (GAE) is a method proposed by Schulman et al. (2015b) to estimate the advantage value function. In fact, it is a method to apply the lambda return method to estimate the advantage function. As proposed in the PPO paper Schulman et al. ( 2017), in practical applications, due to the incompleteness of the trajectory, truncated GAE is used, which leads to large bias in the estimation process. For this, we propose partial GAE that uses partially calculated GAEs, rather than the entire truncated GAEs, to significantly reduce the bias that caused by incomplete trajectories. In addition to experiments in common MuJoCo environments, we also conduct experiments in the complex and challenging µRTS environment. Our methods have been empirically successful in these environments.

2. VALUE ESTIMATOR

The gradient in the policy gradient algorithm is generally written in the following form: g = E ∞ t=0 Ψ t ∇ θ logπ θ (a t |s t ) where Ψ t is used to control the update amplitude of the the policy update in the gradient direction. In the basic policy gradient algorithm, the action value function q π (s, a) is used as Ψ t , and q π (s, a) is estimated by cumulative return G t . One of the most direct methods of estimating the value function is the Monte Carlo (MC) method. MC starts directly from the definition of the value function, and takes the accumulation of return values as the estimator of the value function. The accumulation of the discounted reward N n=0 γ n R t+n of a reward sequence (r t , r t+1 , ..., r t+N ) is taken as the estimator of the state value under state s t . The state value estimator of the MC algorithm is unbiased estimator. In addition, because the random variables in the estimator are all the returns after time t, and the dimensions are high, the estimator has the characteristic of high variance. The TD algorithm uses r t + γV θ (s t+1 ) as the estimator of value function V π (s t ). There is a certain error, denoted as e θ , between the approximate value function and the real value function. Equation 4 can be obtained, and the bias γE St+1 [e θ (S t+1 )] can be obtained by using the TD algorithm. In addition, because there are fewer dimensions of random variables in the estimator, the variance of the estimator remains low. E (rt,St+1) [r t + γV θ (S t+1 )] = V π St + γE St+1 [e θ (S t+1 )] Compared with the TD and MC methods, the λ-return method seeks to find a balance between bias and variance. In Equation 6, if λ is 0, it is the estimator of the TD method, and if λ is 1, it is the estimator of the MC method. G (n) t = γ n V (s t+n ) + n-1 l=0 γ l r t+l (5) G λ t = λ N -1 G (N ) t + (1 -λ) N -1 n=1 λ n-1 G (n) t One disadvantage of using q π (s, a) is that has a large variance, which can be reduced by introducing the baseline b(t), g = E ∞ t=0 (q π (s, a) -b(t))∇ θ logπ θ (a t |s t ) Using the average incomes from the sample as a baseline offers some improvement. However, for the Markov process, the baseline should change according to the state, and the state baseline which has large value for all actions should be larger, and vice versa. While the A2C and A3C algorithms use V t as the baseline, q π (s, a) -V t is called the advantage, Q(S, A) can be approximated as R + γV ′ , and the advantage is A(s t , a t ) = r t + γV (s t+1 ) -V (s t ). 



In the Generalized Advantage (GAE) paperSchulman et al. (2015b), a TD(λ)-like method is proposed, in which the weighted average of the estimated values of different lengths provides the estimated value. This method to calculate the advantage is adopted by powerful reinforcement learning algorithms such as TRPOSchulman et al.  (2015a)  andPPO Schulman et al. (2017). δ V t+l = r t+l + γV (s t+l+1 ) -V (s t+l )

