PARTIAL ADVANTAGE ESTIMATOR FOR PROXIMAL POL-ICY OPTIMIZATION

Abstract

Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to λ-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and µRTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.

1. INTRODUCTION

In reinforcement learning, an agent judges its state according to its environment, then selects an action, and iterates these steps, constantly learning from the environment. Let S be the set of states and A the set of actions. The selection process is quantified by the transition probability P . When the agent makes the choice to execute an action, it will receive feedback rewards from the environment, that is, r : S × A → [R min , R max ]. In this way, the model will produce a trajectory sequence τ , that is τ : (s 1 , a 1 , ..., s T , a T ). This trajectory will have a cumulative return, T t=1 γ t-1 r t , where γ is the discount factor and T is the number of steps performed. The goal of reinforcement learning is to find an optimal policy π, so that an agent can obtain the maximum cumulative expected cumulative reward using this policy. Policy refers to the probability of specifying an action in each state, π(a|s) = p(A t = a|S t = s). Under this policy, the cumulative return follows a distribution, and the expected value of the cumulative return at state s is defined as a state-value function: v π (s) = E π [ ∞ k=0 γ k r t+k+1 |S t = s] Accordingly, at state s, the expected value of cumulative return after executing action a is defined as a state action value function: q π (s) = E π [ ∞ k=0 γ k r t+k+1 |S t = s, A t = a] Often the temporal difference (TD) algorithm or a Monte Carlo (MC) method is used to estimate the value function. However, each of these methods has its advantages and disadvantages. The TD algorithm value estimator has the characteristics of high bias and low variance. In contrast, the estimator of the MC algorithm has low bias and high variance. Kimura et al. (1998) put forward a method to skillfully find a balance between bias and variance, which is called λ return. TD(λ) proposed by Sutton (1988) is a variant of the λ return that provides a more balanced value estimation. Generalized Advantage Estimation (GAE) is a method proposed by Schulman et al. (2015b) to estimate the advantage value function. In fact, it is a method to apply the lambda return method to estimate the advantage function. As proposed in the PPO paper Schulman et al. (2017) , in practical applications, due to the incompleteness of the trajectory, truncated GAE is used, which leads to large bias in the estimation process. For this, we propose partial GAE that uses partially calculated GAEs, rather than the entire truncated GAEs, to significantly reduce the bias that caused by incomplete trajectories. In addition to experiments in common MuJoCo environments, we also conduct experiments in the complex and challenging µRTS environment. Our methods have been empirically successful in these environments.

2. VALUE ESTIMATOR

The gradient in the policy gradient algorithm is generally written in the following form: g = E ∞ t=0 Ψ t ∇ θ logπ θ (a t |s t ) where Ψ t is used to control the update amplitude of the the policy update in the gradient direction. In the basic policy gradient algorithm, the action value function q π (s, a) is used as Ψ t , and q π (s, a) is estimated by cumulative return G t . One of the most direct methods of estimating the value function is the Monte Carlo (MC) method. MC starts directly from the definition of the value function, and takes the accumulation of return values as the estimator of the value function. The accumulation of the discounted reward N n=0 γ n R t+n of a reward sequence (r t , r t+1 , ..., r t+N ) is taken as the estimator of the state value under state s t . The state value estimator of the MC algorithm is unbiased estimator. In addition, because the random variables in the estimator are all the returns after time t, and the dimensions are high, the estimator has the characteristic of high variance. The TD algorithm uses r t + γV θ (s t+1 ) as the estimator of value function V π (s t ). There is a certain error, denoted as e θ , between the approximate value function and the real value function. Equation 4 can be obtained, and the bias γE St+1 [e θ (S t+1 )] can be obtained by using the TD algorithm. In addition, because there are fewer dimensions of random variables in the estimator, the variance of the estimator remains low. E (rt,St+1) [r t + γV θ (S t+1 )] = V π St + γE St+1 [e θ (S t+1 )] Compared with the TD and MC methods, the λ-return method seeks to find a balance between bias and variance. In Equation 6, if λ is 0, it is the estimator of the TD method, and if λ is 1, it is the estimator of the MC method. G (n) t = γ n V (s t+n ) + n-1 l=0 γ l r t+l (5) G λ t = λ N -1 G (N ) t + (1 -λ) N -1 n=1 λ n-1 G (n) t One disadvantage of using q π (s, a) is that has a large variance, which can be reduced by introducing the baseline b(t), g = E ∞ t=0 (q π (s, a) -b(t))∇ θ logπ θ (a t |s t ) Using the average incomes from the sample as a baseline offers some improvement. However, for the Markov process, the baseline should change according to the state, and the state baseline which has large value for all actions should be larger, and vice versa. While the A2C and A3C algorithms use V t as the baseline, q π (s, a) -V t is called the advantage, Q(S, A) can be approximated as R + γV ′ , and the advantage is A(s t , a t ) = r t + γV (s t+1 ) -V (s t ). In the Generalized Advantage (GAE) paper Schulman et al. (2015b) , a TD(λ)-like method is proposed, in which the weighted average of the estimated values of different lengths provides the estimated value. This method to calculate the advantage is adopted by powerful reinforcement learning algorithms such as TRPO Schulman et al. (2015a) and PPO Schulman et al. (2017) . δ V t+l = r t+l + γV (s t+l+1 ) -V (s t+l ) (8) ÂGAE(γ,λ) t = ∞ l=0 (γλ) l δ V t+l (9) When the parameter γ is introduced to estimate the policy gradient, g γ is biased from g. Such a policy gradient estimation problem actually aims at discounted cumulative reward. In the later part of this paper, the GAE is mainly discussed. In addition to the above method for calculating Ψ t , there are different methods for estimating the value function Bertsekas et al. (2011) . In general, the value loss Schulman et al. (2015b) uses the trust region method to optimize the value function in each iteration of the batch optimization process. L V F t is calculated with squared- error (V θt -V target t ) 2 . The GAE paper minimize ϕ N n=1 ∥ V ϕ (s n ) -Vn ∥ subject to 1 N N n=1 ∥V ϕ (sn)-V ϕ old (sn)∥ 2σ ≤ ϵ (10) And, in most implementations, V θt is clipped around the value estimates on both sides V θt-1 and V θt+1 . L V F t = max[(V θt -V target t ) 2 , clip(V θt , (V θt-1 + ϵ, V θt+1 + ϵ) -V target t ) 2 ] ( ) The work of Tucker et al. (2018) proposes normalizing the advantage, which is shown to improve the performance of the policy gradient algorithm. After the GAE calculates advantages in a batch, the mean and standard deviation of are computed. Then for each advantage, one subtracts the mean and divides by the standard deviation. In Eq. 12, A norm i is normalized advantage, A i is advantage, A mean is mean of advantages, A std is standard deviation of advantages A norm i = A i -A mean A std It is very important for the policy gradient algorithm to estimate a more instructive value function. In the following section, we will mainly discuss the practical application of GAE and how it can be improved.

3. PARTIAL GAE

In the actual environment, a task often has terminal states, which result in a finite trajectory length. For the trajectory terminus at time D, one performs an iterative calculation from back to front to compute the GAE. Denote complete trajectory GAE ÂGAE(γ,λ,D) t as Eq. 13. ÂGAE(γ,λ,∞) t can be generalized as ÂGAE(γ,λ,D) t , when D → ∞. ÂGAE(γ,λ,D) t = D-t l=0 (γλ) l δ V t+l In practical applications, the length of one sample is fixed in order to carry out parallel computing more efficiently. In another case, it takes a long time to sample a complete trajectory. In order to make the training more efficient, only a part of the complete trajectory will be sampled at a time. As it is discussed in the PPO paper Schulman et al. (2017) , a truncated GAE is used for fixed-length trajectory segments, which is shown in Figure 1 . Calculation of the GAE of an incomplete trajectory with length T is represented as: ÂGAE(γ,λ,T ) T = δ V T = r T + γ • 0 -V (s T ) (14) ÂGAE(γ,λ,T ) t = T -t l=0 (γλ) l δ V t+l (15) Following the GAE paper Schulman et al. (2015b) , denote sum of k of δ terms as Â(k) t . Let Â(k) t be an estimator of the advantage function. When k = 1, Â(1) t = δ V t , which has a large bias and low variance. The bias becomes smaller as k becomes larger.  Â(k) t = k-1 l=0 γ l δ V t+l = -V (s t ) + γ k V (s k+t ) + k-1 l=0 γ l r t+l (16) ÂGAE(γ,λ,D) t = (1 -λ) D-t+1 k=1 λ k-1 Â(k) t (18) ÂGAE(γ,λ,T ) T = r t + γV (s t+1 ) -V (s t ) Truncated GAE, similar to infinite ÂGAE(γ,λ) , balances between bias and variance with λ. ÂGAE(γ,1,T ) has low bias and large variance, while ÂGAE(γ,0,T ) has low variance and large bias. When λ < 1, with cost of bias, the variance is substantially reduced. Compared with infinite GAE, truncated GAE has larger bias and lower variance. Relatively, it is more important to reduce bias, which means for a specific trajectory as the step t reduced, ÂGAE(γ,λ,T ) t will become more instructive. Compared with a GAE of a complete trajectory ÂGAE(γ,λ,D)  B t = ÂGAE(γ,λ,D) t - ÂGAE(γ,λ,T ) t = D-t l=T -t (γλ) l δ V t+l ( ) B T is a constant for a specific trajectory of length . B t = D-T l=0 (γλ) l+T -t δ V T +l = (γλ) T -t B T In conclusion, in practical applications, due to the fixed length sampling trajectory, the calculation of GAE is truncated, which will lead to a large bias in the calculated GAE when t is near the end of the trajectory. Instead, we propose to intercept a part of the GAE for use and drop the remainder of the trajectory with large bias. We propose a PPO algorithm with partial GAE as described in Algorithm 1, with partial coefficient ϵ and sample length T . In each iteration, each sampler collects T samples and calculates T truncated GAE. We take part of a GAE if t > ϵ, the advantage estimates at time  t

4. EXPERIMENTS

We conducted a set of experiments to verify the effect of our proposed method, and conducted more detailed investigations: • As we discussed above, can smaller partial coefficients reduce the bias of GAE for improved training? • Can a larger sampling length reduce the bias of GAE to improve training? We evaluated our method in MuJoCo OpenAI (2022) and µRTS Villar (2017). MuJoCo is a physical simulation environment. We mainly conducted experiments in . µRTS is a simplified RTS game environment. Unlike MuJoCo, it has discrete states and discrete actions, and has a large number of game steps. In the experiments with MuJoCo, we use Version 1.31 of MuJoCo (distributed with an MIT License). We use common values in applications to set Hyper-parameters. The discount factor γ is 0.99, λ for the GAE is 0.95, the clip coefficient of PPO is 0.2, the value coefficient is 1, the learning rate of the optimizer is 2.5e-4, the number of environments is 64, and the number of epochs is 2. To represent to policy, we use a three layer fully connected MLP (Multi-layer Perceptron) with a 64 unit hidden layer, and there are two additional noise layers after the MLP for exploration during training. The In the experiments with µRTS, we experimented in the 16×16 map against CoacAI which won the 2020 µRTS competition. In this experiment, the discount factor γ is 0.99, λ for the GAE is 0.95,the clip coefficient of PPO is 0.2, the entropy regularization coefficient is 0.01, the value coefficient is 1, the learning rate of the optimizer is 2.5e-4, the number of environments is 64, and the number of epochs is 2. To represent to policy, we use a two layer convolutional neural network connected to a three layer fully connected MLP with a 512 unit hidden layer. In this paper, we use the training curve within a certain training time to evaluate the algorithm, rather than a fixed number of steps. This is because different algorithms and parameters will lead to differences in update time and training time. In practical applications, in addition to training effects, training time also should be considered. In the MuJoCo environment, we take the total reward of average 100 episodes episode after a certain training time as the performance score. In µRTS, we take the winning rate of recent 100 episodes after a certain training time as the evaluation index. For each set of variables, we used 3 random seeds for the experiment. We use an Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz for our experiments.

4.1. WHAT IS THE EMPIRICAL EFFECT OF PARTIAL LENGTH AND SAMPLE LENGTH

As discussed above, the smaller the value of t in truncated GAE ÂGAE(γ,λ,T ) t , the smaller the bias of the estimate, and the larger the variance. However, the variance of ÂGAE(γ,λ,T ) t is always smaller than that of ÂGAE(γ,λ,D) t , so theoretically, the partial coefficient should be as small as possible. In practical applications, small partial coefficients will lead to an increase in the number of GAE calculations, while a larger sampling length will lead to a longer sequence to be processed in a single GAE calculation, which will increase the calculation time. The experimental results is shown in Figure 2 and Figure 3 . In Figure 2 and Figure 3 (a), using partial GAE can achieve better performance than the baseline, and improved performance can be obtained by using smaller partial coefficients. In Figure 3 Note that when the partial coefficient is small or the sampling length is large, the performance does not improve by continuing to increase the partial coefficient or reduce the sampling length. As shown in Equation 21, when (T -t) is large enough, B t will become very small, and the change brought by continuing to increase (T -t) is very small. Although GAE significantly reduces variance by sacrificing bias, and the variance of truncated GAE is smaller than that of the complete trajectory, this does not mean that bias should be minimized while ignoring the impact of variance. In practical applications, the median of partial coefficient ϵ and sample length T should be found to balance bias and variance. There is an intermediate value that makes the training effect the best, rather than maximizing (T -ϵ). We conducted additional experiments in the 16x16 map of µRTS. This is a sparsely rewarded environment with long game steps. In this task, agents need to learn how to win in real-time strategy (RTS) games: collect resources, build units, and destroy enemy units and bases. The sample length T is 512 in the experiments.

4.2. TRUNCATED ADVANTAGE ESTIMATOR VARIANCE INVESTIGATION

In the discussion in the above experiments, we supposed that blindly reducing the partial coefficient ϵ will lead to an increase in the variance of truncated GAE and make the training effect worse. In the MuJoCo environment, we recorded the 2000 GAE Ât to calculate standard deviation under different t when the sampling length T is 500, as shown in the Figure 6 , red curve is the standard deviation of truncated GAE. As a whole, it can be seen from the figure that when t is small, GAE has a larger standard deviation (or variance). However, unlike the previously mentioned theory, the variance does not completely decrease with the increase of t, especially at the end of a sampling sequence, the variance increases with the increase of t. Write the truncated GAE in the form of the Eq. 22, which can be divided into two parts:accumulated reward part Âr t and value estimate part Âv t . It can be seen from the Figure 2 that the variance is more affected by the value estimate part Âv t . And because of the uncertainty caused by the deviation of the value function fitting to the actual value, the variance of the truncated GAE does not only decrease with the increase of t. As shown in the Figure 6 , empirically, smaller t will result in larger variance of truncated GAE. It is necessary to select an intermediate value of partial coefficient ϵ to balance the variance and bias of the truncated GAE to obtain better training effect.

ÂGAE(γ,λ,T

) t = Âr t + Âv t (22) Âr t = T -t l=0 (γλ) l r t+l (23) Âv t = γ(γλ) T -t-1 V (s T ) -V (s t ) + γ(1 -λ) T -t-1 l=0 (γλ) l V (s t+l+1 )

5. CONCLUSION

How to estimate the value function is particularly important in policy gradient algorithms. GAE provides a method to balance the bias and variance of the estimation of the value function. However, in practical applications, truncated GAE is often used, and will lead to excessive bias of value estimation. We propose to use partial GAE in training to discard truncated GAE with excessive bias, which can reduce the bias and make value estimation more instructive. We conducted experiments in MuJoCo environment. The experimental results show that using partial GAE will always achieve improved training results. For partial GAE related parameters, sampling length T and partial coefficient ϵ, although theoretically increasing (T -ϵ) can reduce bias, there is an intermediate value for the best training effect due to the influence of comprehensive variance. How to adjust sampling length T and partial coefficient ϵ adaptively may be a direction for future research. We conducted additional experiments in µRTS. In this sparse environment with coefficient rewards and long game steps, partial GAE also performed well.

6. ETHICS STATEMENT

We note that our method does not introduce new potential societal harms as it is an improvement of value estimation; however it inherits any potential societal harms of deep reinforcement learning methods, which are well documented in Whittlestone et al. (2021) . Note that the policy of an agent trained by deep reinforcement learning is highly dependent on its explored state and training environment, which causes the agent to perform unexpected actions when it encounters a state it has not seen before. It is not reliable to rely on the generalization of neural network to solve problems. It is necessary to consider how to deal with unexpected actions of agents, especially when DRL agents are applied to the real world.



Figure 1: In practical application, considering parallel computing and avoiding episodes that are too long, a fixed sampling length is used. Since the last step of the sampling trajectory is not the termination of the complete trajectory, we have to use the truncated GAE as the estimator. For all of the calculated truncated GAE ÂGAE(γ,λ,T ) t

Figure 2: training curve in different MuJoCo environment

Figure 3: (a): training curve with different partial coefficient in Ant-v3, sample length T = 512. (b) training curve with different sample length in Ant-v3, partial coefficient ϵ = 64.

Figure 4: The performance after 1 hour training in Ant-v3. Since sample time T is greater than or equal to partial coefficient ϵ, the white part has no data. T = ϵ means not to discard any GAE, which is the baseline.

(b), using larger sample length improves performance. Figure4shows the performance after one hour training as T and ϵ are varied. It can be seen that the highest performance score is with the partial coefficient ϵ ∈[384, 512]  and sample length T ∈[64, 128], in Ant-v3.

Figure 5: Winning rate during training in µRTS 16x16 map against CoacAI.

Figure 6: During Baseline training, the standard deviation under different sample time t. 2000 truncated GAE for a sample time t after one million time steps are recorded to calculate standard deviation.

in the trajectory are discarded in the training. Then we construct the surrogate loss as Schulman et al. (2017) on these N (T -ϵ) data, and optimize the policy with minibatch Adam for K epochs. For algorithm 1 according to Equation 17, as T -t increases, the bias of ÂGAE(γ,λ,T ) t decreases. As partial coefficient decreases, the bias of Â1 , ..., Âϵ decreases.

7. REPRODUCIBILITY STATEMENT

Our experiments were repeated three times with different random seeds. We will upload our code in the supplemental materials for verification.

