A STRONG ON-POLICY COMPETITOR TO PPO Anonymous authors Paper under double-blind review

Abstract

As a recognized variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely used with several advantages: efficient data utilization, easy implementation, and good parallelism. In this paper, a first-order gradient reinforcement learning algorithm called Policy Optimization with Penalized Point Probability Distance (POP3D), which is a lower bound to the square of total variance divergence, is proposed as another powerful variant. The penalty item has dual effects, prohibiting policy updates from overshooting and encouraging more explorations. By carefully controlled experiments on both discrete and continuous benchmarks, our approach is proved highly competitive to PPO.

1. INTRODUCTION

With the development of deep reinforcement learning, lots of impressive results have been produced in a wide range of fields such as playing Atari game (Mnih et al., 2015; Hessel et al., 2018) , controlling robotics (Lillicrap et al., 2015) , Go (Silver et al., 2017) , neural architecture search (Tan et al., 2019; Pham et al., 2018) . The basis of a reinforcement learning algorithm is generalized policy iteration (Sutton & Barto, 2018) , which states two essential iterative steps: policy evaluation and improvement. Among various algorithms, policy gradient is an active branch of reinforcement learning whose foundations are Policy Gradient Theorem and the most classical algorithm REINFORCEMENT (Sutton & Barto, 2018) . Since then, handfuls of policy gradient variants have been proposed, such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) , Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) , Actor-Critic using Kronecker-factored Trust Region (ACKTR) (Wu et al., 2017) , and Proximal Policy Optimization (PPO) (Schulman et al., 2017) . Improving the strategy monotonically had been nontrivial until Schulman et al. (2015) proposed Trust Region Policy Optimization (TRPO), in which Fisher vector product is utilized to cut down the computing burden. Specifically, Kullback-Leibler divergence (KLD) acts as a hard constraint in place of objective, because its corresponding coefficient is difficult to set for different problems. However, TRPO still has several drawbacks: too complicated, inefficient data usage. Quite a lot of efforts have been devoted to improving TRPO since then and the most commonly used one is PPO. PPO can be regarded as a first-order variant of TRPO and have obvious improvements in several facets. In particular, a pessimistic clipped surrogate objective is proposed where TRPO's hard constraint is replaced by the clipped action probability ratio. In such a way, it constructs an unconstrained optimization problem so that any first-order stochastic gradient optimizer can be directly applied. Besides, it's easier to be implemented and more robust against various problems, achieving an impressive result on Atari games (Brockman et al., 2016) . However, the cost of data sampling is not always cheap. Haarnoja et al. (2018) design an off-policy algorithm called Soft Actor-Critic and achieves the state of the art result by encouraging better exploration using maximum entropy. In this paper, we focus on the on-policy improvement to improve PPO and answer the question: how to successfully leverage penalized optimization to solve the constrained one which is formulated by Schulman et al. (2015) . 1. It proposes a simple variant of TRPO called POP3D along with a new surrogate objective containing a point probability penalty item, which is symmetric lower bound to the square of the total variance divergence of policy distributions. Specifically, it helps to stabilize the learning process and encourage exploration. Furthermore, it escapes from penalty item setting headache along with penalized version TRPO, where is arduous to select one fixed value for various environments. 2. It achieves state-of-the-art results among on-policy algorithms with a clear margin on 49 Atari games within 40 million frame steps based on two shared metrics. Moreover, it also achieves competitive results compared with PPO in the continuous domain. It dives into the mechanism of PPO's improvement over TRPO from the perspective of solution manifold, which also plays an important role in our method. 3. It enjoys almost all PPO's advantages such as easy implementation, fast learning ability. We provide the code and training logs to make our work reproducible.

2. PRELIMINARY KNOWLEDGE AND RELATED WORK

2.1 POLICY GRADIENT Agents interact with the environment and receive rewards which are used to adjust their policy in turn. At state s t , one agent takes strategy π and transfers to a new state s t+1 , rewarded r t by the environment. Maximizing discounted return (accumulated rewards) R t is its objective. In particular, given a policy π, R t is defined as R t = ∞ n=0 (r t + γr t+1 + γ 2 r t+2 + ... + γ n r t+n ). (1) γ is the discounted coefficient to control future rewards, which lies in the range (0, 1). Regarding a neural network with parameter θ, the policy π θ (a|s) can be learned by maximizing Equation 1 using the back-propagation algorithm. Particularly, given Q(s, a) which represents the agent's return in state s after taking action a, the objective function can be written as max θ E s,a log π θ (a|s)Q(s, a). (2) Equation 2 lays the foundation for handfuls of policy gradient based algorithms. Another variant can be deduced by using A(s, a) = Q(s, a) -V (s) (3) to replace Q(s, a) in Equation 2 equivalently, V (s) can be any function so long as V depends on s but not a. In most cases, state value function is used for V , which not only helps to reduce variations but has clear physical meaning. Formally, it can be written as max θ E s,a log π θ (a|s)A(s, a). (4)

2.2. ADVANTAGE ESTIMATE

A commonly used method for advantage calculation is one-step estimation, which follows A(s t , a t ) = Q(s t , a t ) -V (s t ) = r t + γV (s t+1 ) -V (s t ). However, a more accurate method called generalized advantage estimation is proposed in Schulman et al. (2016) , where all time steps of estimation are combined and summarized using λ-based weights,. The generalized advantage estimator ÂGAE(γ,λ) t is defined by Schulman et al. (2016) as ÂGAE(γ,λ) t := (1 -λ) * ( Â(1) t + λ Â(2) t + λ 2 Â(3) t + . . .) = ∞ l=0 (γλ) l δ V t+l δ V t+l = r t+l + γV (s t+l+1 ) -V (s t+l ). Â(k) t := k-1 l=0 γ l δ V t+l = -V (s t ) + r t + γr t+1 + • • • + γ k-1 r t+k-1 + γ k V (s t+k ) The parameter λ meets 0 ≤ λ ≤ 1, which controls the trade-off between bias and variance. All methods in this paper utilize ÂGAE(γ,λ) t to estimate the advantage. 2.3 TRUST REGION POLICY OPTIMIZATION Schulman et al. (2015) propose TRPO to update the policy monotonically. In particular, its mathematical form is max θ E t [ π θ (a t |s t ) π θ old (a t |s t ) Ât ] -CE t [KL[π θ old (•|s t ), π θ (•|s t )]] = max s E a∼π θ (a|s) [A π θ old (s, a)]) where C is the penalty coefficient, C = 2 γ (1-γ) 2 . In practice, the policy update steps would be too small if C is valued as Equation 7. In fact, it's intractable to calculate C beforehand since it requires traversing all states to reach the maximum. Moreover, inevitable bias and variance will be introduced by estimating the advantages of old policy while training. Instead, a surrogate objective is maximized based on the KLD constraint between the old and new policy, which can be written as below, max θ E t [ π θ (a t |s t ) π θ old (a t |s t ) Ât ] s.t. E t [KL[π θ old (•|s t ), π θ (•|s t )]] ≤ δ (8) where δ is the KLD upper limitation. In addition, the conjugate gradient algorithm is applied to solve Equation 8 more efficiently. Two major problems have yet to be addressed: one is its complexity even using the conjugate gradient approach, another is compatibility with architectures that involve noise or parameter sharing tricks (Schulman et al., 2017) .

2.4. PROXIMAL POLICY OPTIMIZATION

To overcome the shortcomings of TRPO, PPO replaces the original constrained problem with a pessimistic clipped surrogate objective where KL constraint is implicitly imposed. The loss function can be written as L CLIP (θ) = E t [min(r t (θ) Ât , clip(r t (θ), 1 -, 1 + ) Ât )] r t (θ) = π θ (a t |s t ) π θ old (a t |s t ) , where is a hyper-parameter to control the clipping ratio. Except for the clipped PPO version, KL penalty versions including fixed and adaptive KLD. Besides, their simulation results convince that clipped PPO performs best with an obvious margin across various domains.

3. POLICY OPTIMIZATION WITH PENALIZED POINT PROBABILITY DISTANCE

Before diving into the details of POP3D, we review some drawbacks of several methods, which partly motivate us.

3.1. DISADVANTAGES OF KULLBACK-LEIBLER DIVERGENCE

TRPO (Schulman et al., 2015) induced the following inequalityfoot_0 , η(π θ ) ≤ L π θ old (π θ ) + 2 γ (1 -γ) 2 α 2 α = D max T V (π θ old , π θ ) D max T V (π θ old , π θ ) = max s D T V (π θ old ||π θ ) TRPO replaces the square of total variation divergence D max T V (π θ old , π θ ) by D max KL (π θ old , π θ ) = max s D KL (π θ old ||π θ ). Given a discrete distribution p and q, their total variation divergence D T V (p||q) is defined as D T V (p||q) := 1 2 i |p i -q i | (11) in TRPO (Schulman et al., 2015) . Obviously, D T V is symmetric by definition, while KLD is asymmetric. Formally, given state s, KLD of π θ old (•|s) for π θ (•|s) can be written as D KL (π θ old (•|s)||π θ (•|s)) := a π θ old (a|s) ln π θ old (a|s) π θ (a|s) . ( ) Similarly, KLD in the continuous domain can be defined simply by replacing summation with integration. The consequence of KLD's asymmetry leads to a non-negligible difference of whether choose D KL (π θ old ||π θ ) or D KL (π θ ||π θ old ). Sometimes, those two choices result in quite different solutions. Robert compared the forward and reverse KL on a distribution, one solution matches only one of the modes, and another covers both modes (Murphy, 2012) . Therefore, KLD is not an ideal bound or approximation for the expected discounted cost.

3.2. DISCUSSION ABOUT PESSIMISTIC PROXIMAL POLICY

In fact, PPO is called pessimistic proximal policy optimizationfoot_2 in the meaning of its objective construction style. Without loss of generality, supposing A t > 0 for given state s t and action a t , and the optimal choice is a t . When a t = a t , a good update policy is to increase the probability of action to a relatively high value a t by adjusting θ. However, the clipped item clip(r t (θ), 1 -, 1 + ) Ât will fully contribute to the loss function by the minimum operation, which ignores further reward by zero gradients even though it's the optimal action. Other situation with A t < 0 can be analyzed in the same manner. However, if the pessimistic limitation is removed, PPO's performance decreases dramatically (Schulman et al., 2017) , which is again confirmed by our preliminary experiments. In a word, the pessimistic mechanism plays a very critical role for PPO in that it has a relatively weak preference for a good action decision at a given state, which in turn affects its learning efficiency.

3.3. RESTRICTED SOLUTION MANIFOLD FOR EXACT DISTRIBUTION MATCHING

To be simple, we don't take the model identifiability issues along with deep neural network into account here because they don't affect the following discussion much (LeCun et al., 2015) . Suppose π θ is the optimal solution for a given environment, in most cases, more than one parameter set for θ can generate the ideal policy, especially when π θ is learned by a deep neural network. In other words, the relationship between θ and π θ is many to one. On the other hand, when agents interact with the environment using policy represented by neural networks, they prefer to takes the action with the highest probability. Although some strategies of enhancing exploration are applied, they don't affect the policy much in the meaning of expectation. RL methods can help agents learn useful policies after fully interacting with the environment. Take Atari-Pong game for example, when an agent sees a Pong ball coming close to the right (state s 1 ), its optimal policy is moving the racket to the right position (for example, the "RIGHT" action) with a distribution p s1 θ1 = [0.05, 0.05, 0.1, 0.7, 0.05, 0.05] 3 . The probability of selecting "RIGHT" is a relatively high value such as 0.7. It's almost impossible to push it to be 1.0 exactly since it's produced by a softmax operation on several discrete actions. In fact, we hardly obtain the optimal solution accurately. Instead, our goal is to find a good enough policy. In this case, the policy of pushing p(RIGHT|s 1 ) above a threshold is sufficient to be a good one. In other words, paying attention to the most critical actions is sufficient, and we don't care much the probability value of the other non-critical actions. For example, a good policy at s 1 is [?,?, ≥ 0.7, ?,?,?]. Note that π θ (a|s) is represented by a neural network parameterized using θ and a good policy for the whole game means that the network can perform well across the whole state space. Focusing on those critical actions at each statefoot_4 and ignoring non-critical ones can help the network learn better and more easily. Using a penalty such as KLD cannot utilize this good property, because it involves all of the actions' probabilities. Moreover, it doesn't stop penalizing unless two distributions become exactly indifferent or the advantage item is large enough to compensate for the KLD cost. Therefore, even if θ outputs θ old the same high probability for the right action, the penalization still exists. Suppose that two parameters for θ 1 : θ 2 and θ 3 , where p s1 θ2 = [0.01, 0.15, 0.05, 0.7, 0.01, 0.08] and p s1 θ3 = [0.01, 0.01, 0.01, 0.7, 0.26, 0.01]. When the agent already chooses RIGHT at S 1 , the loss item from a good penalized distance should be small. However, D KL (π θ1 (•|s 1 )||π θ2 (•|s 1 ))=0.15 and D KL (π θ1 (•|s 1 )||π θ3 (•|s 1 ))=0.39. However, it's not necessary to require the distribution of other actions ('NOOP', 'FIRE', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE') of p s1 θ2 near to p s1 θ1 . Instead, it's better to relax this requirement to enlarge the freedom degree of the network and focus on learning important actions. Doing this brings another advantage, the agent can explore more for non critical actions. From the perspective of the manifold, optimal parameters constitute a solution manifold. The KLD penalty will act until θ exactly locates in the solution if possible, akin to mapping a point onto a curve. Instead, if the agent concentrates only on critical actions like a human does, it's much easier to approach the manifold in a higher dimension. This is comparable to expanding the solution manifold by at least one dimension, e.g. from curves to surfaces or from surfaces to spheres.

3.4. EXPLORATION

One shared highlight in reinforcement learning is the balance between exploitation and exploration. For a policy-gradient algorithm, entropy is added in the total loss to encourage exploration in most cases. When included in the loss function, KLD penalizes the old and new policy probability mismatch for all possible actions as Equation 12 given a state s. This strict punishment for every action's probability mismatch, which discourages exploration.

3.5. POINT PROBABILITY DISTANCE

To overcome the above-mentioned shortcomings, we propose a surrogate objective with the point probability distance penalty, which is symmetric and more optimistic than PPO. In the discrete domain, when the agent takes action a, the point probability distance between π θ old (•|s) and π θ (•|s) is defined by D a pp (π θ old (•|s), π θ (•|s)) = (π θ old (a|s) -π θ (a|s)) 2 . ( ) Attention should be paid to the penalty definition item, the distance is measured by the point probability, which emphasizes its mismatch for the sampled actions for a state. Unless it would lead to confusion, we omit a for simplicity in the following sections. Undoubtedly, D pp is symmetric by definition. Furthermore, it can be proved that D pp is indeed a lower bound for the total variance divergence D T V . As a special case, it can be easily proved that for binary distribution, D 2 T V (p||q) = D pp (p||q). Theorem 3.1. For two discrete probability distributions p and q with K values, then D 2 T V (p||q) ≥ D a pp (p||q) holds for any action a and E a D a pp (p||q) is a lower bound for D 2 T V (p||q). Proof. Let p l = α, q l = β for the l-th action a, and suppose a ≥ b without loss of generalization. So, D 2 T V (p||q) = ( 1 2 K i=1 |p i -q i |) 2 = ( 1 2 K i=1,i =l |p i -q i | + 1 2 |p l -q l |) 2 ≥ ( 1 2 | K i=1,i =l p i -q i | + 1 2 (α -β)) 2 = ( 1 2 |1 -α -(1 -β)| + 1 2 (α -β)) 2 = ( 1 2 (α -β) + 1 2 (α -β)) 2 = D a pp (p||q) E a D a pp (p||q) = a p(a)D a pp (p||q) ≤ a p(a)D 2 T V (p||q) = D 2 T V (p||q) Since 0 ≤ π θ (a|s) ≤ 1 holds for discrete action space, D pp has a lower and upper boundary: 0 ≤ D pp ≤ 1. Moreover, D pp is less sensitive to action space dimension than KLD, which has a similar effect as PPO's clipped ratio to increase robustness and enhance stability. Equation 13 stays unchanged for the continuous domain, and the only difference is π θ (a|s) represents point probability density instead of probability.

3.6. POP3D

After we have defined the point probability distance, we use a new surrogate objective f θ for POP3D, which can be written as max θ E t [ π θ (a t |s t ) π θ old (a t |s t ) Ât -βD at pp (π θ old (•|s t ), π θ (•|s t ))], ( ) where β is the penalized coefficient. These combined advantages lead to considerable performance improvement, which escapes from the dilemma of choosing preferable penalty coefficient. Besides, we use generalized advantage estimates to calculate Ât . Algorithm 1 shows the complete iteration process of POP3D. Moreover, it possesses the same computing cost and data efficiency as PPO. Algorithm 1 POP3D 1: Input: max iterations L , actors N , epochs K 2: for iteration = 1 to L do 3: for actor = 1 to N do 4: Run policy π θ old for T time steps Therefore, it can help the agent to focus on the important action. When updating θ from θ old as Equation 14, the gradient f (θ) w.r.t. θ can be written as In the early stage of learning, π θ old (a t |s t ) is near 1/K (taking K discrete spaces for example) and the magnitude of Ât is large, while the damping force is a bit weak. Therefore, the agent learns fast. ∇ θ f (θ) = ∇ θ π θ ( Then β shows a relative stronger force to avoid overshooting for action selection and encourage more exploration. As for the final stage, the policy changes slowly because the learning rate is low, where δ(a t |s t ) is small and therefore it converges.

3.8. RELATIONSHIP WITH PPO

To conclude this section, we take some time to see why PPO works by taking the above viewpoints into account. When we pour more attention to Equation 9, the ratio r t (θ) only involves the probability for given action a, which is chosen by policy π. In other words, all other actions' probabilities except a are not activated, which no longer contribute to back-propagation and allow probability mismatch, which encourage exploration. This procedure behaves similarly to POP3D, which helps the network to learn more easily. Above all, POP3D is designed to conform with the regulations for overcoming above mentioned problems, and in the next section experiments from commonly used benchmarks will evaluate its performance.

4.1. CONTROLLED EXPERIMENTS SETUP

OpenAI Gym is a well-known simulation environment to test and evaluate various reinforcement algorithms, which is composed of both discrete (Atari) and continuous (Mujoco) domains (Brockman et al., 2016) . Most recent deep reinforcement learning methods such as DQN variants (Van Hasselt et al., 2016; Wang et al., 2016; Schaul et al., 2015; Bellemare et al., 2017; Hessel et al., 2018) , A3C, ACKTR, PPO are evaluated using only one set of hyper-parametersfoot_5 . Therefore, we evaluate POP3D's performance on 49 Atari games(v4, discrete action space ) and 7 Mujoco (v2, continuous). Since PPO is a distinguished RL algorithm which defeats various methods such as A3C, A2C ACKTR, we focus on a detailed quantitative comparison with fine-tuned PPO. And we don't consider large scale distributed algorithms Apex-DQN (Horgan et al., 2018) and IMPALA (Espeholt et al., 2018) , because we concentrate on comparable and fair evaluation, while the latter is designed to apply with large scale parallelism. Nevertheless, some orthogonal improvements from those methods have the potentials to improve our method further. Furthermore, we include TRPO to acts as a baseline method. Engstrom et al. (2020) carefully study the underlying factor that helps PPO outperform TPRO. To avoid unfair comparisons, we carefully control the settings. In addition, quantitative comparisons between KLD and point probability penalty helps to convince the critical role of the latter, where the former strategy is named fixed KLD in Schulman et al. (2017) and can act as another good baseline in this context, named by BASELINE below. In particular, we retrained one agent for each game with fine-tuned hyper-parametersfoot_6 . To avoid the problems of reproduction about reinforcement algorithms mentioned in Henderson et al. (2018) , we take the following measures: • Use the same training steps and make use of the same amount of game frames(40M for Atari game and 10M for Mujoco). • Use the same neural network structures, which is the CNN model with one action head and one value head for the Atari game, and a fully-connected model with one value head and one action head which produces the mean and standard deviation of diagonal Gaussian distribution as PPO. • Initialize parameters using the same strategy as PPO. • Keep Gym wrappers from Deepmind such as reward clipping and frame stacking unchanged for Atari domain, and enable 30 no-ops at the beginning of each episode. • Use Adam optimizer (Kingma & Ba, 2014) and decrease α linearly from 1 to 0 for Atari domain as PPO. To facilitate further comparisons with other approaches, we release the seeds and detailed resultsfoot_7 (across the entire training process for different trials). In addition, we randomly select three seeds from {0, 10, 100, 1000, 10000} for two domains, {10,100,1000} for Atari and {0,10,100} for Mujoco in order to decrease unfavorable subjective bias stated in Henderson et al. (2018) .

4.2. EVALUATION METRICS

PPO utilizes two score metrics for evaluating agents' performance using various RL algorithms. One is the mean score of the last 100 episodes Score 100 , which measures how high a strategy can hit eventually. Another is the average score across all episodes Score all , which evaluates how fast an agent learns. In this paper, we conform to this routine and calculate individual metric by averaging three seeds in the same way.

4.3. DISCRETE DOMAIN COMPARISONS

Hyper-parameters We search hyper-parameter four times for the penalty coefficient β based on four Atari games while keeping other hyper-parameters unchanged as PPO and fix β = 5.0 to train all Atari games. For BASELINE, we also search hyper-parameter four times on penalty coefficient β and choose β = 10.0. To save space, detailed hyper-parameter setting can be found in Table 6 and 7 . This process is not beneficial for POP3D owing to missing optimization for all hyper-parameters. There are two reasons to make this choice. On the one hand, it's the simplest way to make a relatively fair comparison group such as keeping the same iterations and epochs within one loop to our knowledge. On the other hand, this process imposes low search requirements for time and resources. That's to say, we can draw a conclusion that our method is at least competitive to PPO if it performs better on benchmarks. Comparisons The final score of each game is averaged by three different seeds and the highest is in bold. As Hyper-parameters For PPO, we use the same hyperparameter configuration as Schulman et al. (2017) . Regarding POP3D, we search on two games three times and select 5.0 as the penalty coefficient. More details about hyper-parameters for PPO and POP3D are listed in Table 8 . Unlike the Atari domain, we utilize the constant learning rate strategy as Schulman et al. (2017) in the continuous domain instead of the linear decrease strategy.

Comparison Results

The scores are also averaged on three trials and summarized in Table 1 . POP3D occupies 6 out of 7 games on Score 100 . Evaluation metrics of both across different games are illustrated in Table 2 and 5. In summary, both metrics indicate that POP3D is competitive to PPO in the continuous domain. More interestingly, it not only suffers less from the penalty item setting headache along with TRPO, where is arduous to select one fixed value for various environments but outperforms fixed KLD baseline from PPO. In summary, POP3D is highly competitive and an alternative to PPO.

A SCORE TABLES AND CURVES

Mean scores of various methods for Atari domain are listed in Table 3 and 4 . 



Note that η means loss instead of return as the ICML version(Schulman et al., 2015). The word "pessimistic" is used by the PPO paper. The action space is described as ['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']. Note that some states don't have critical action. Taking the Pong for example, when the ball is just shot back, the agent can choose any action. DQN variants are evaluated in Atari environment since they are designed to solve problems about discrete action space. However, policy gradient-based algorithms can handle both continuous and discrete problems. We use OpenAI's PPO and TRPO code: https://github.com/openai/baselines.git https://drive.google.com/file/d/1c79TqWn74mHXhLjoTWaBKfKaQOsfD2hg/view



f (θ) w.r.t θ with mini-batch size M ≤ N T , then update θ old ← θ. MECHANISM OF POP3D As for the toy example in Section 3.3, D RIGHT pp (π θ1 (•|s) || π θ2 (•|s)=D RIGHT pp (π θ1 (•|s) || π θ3 (•|s)=0.

a t |s t ) π θ old (a t |s t ) Ât -2β[π θ (a t |s t ) -π θ old (a t |s t )]∇ θ π θ (a t |s t ) = ∇ θ π θ (a t |s t )[ Ât π θ old (a t |s t ) -2β(π θ (a t |s t ) -π θ old (a t |s t ))] = ∇ θ π θ (a t |s t )[ t |s t ) := π θ (a t |s t ) -π θ old (a t |s t ). Suppose the agent selects a t at s t using π θ old and obtains a positive advantage Ât , if π θ (a t |s t ) is larger than π θ (a t |s t ), then 2βδ(a t |s t ) will play a damping role to avoid too greedy preference for a t (i.e. too large probability), which in turn leaves more space for other actions to be explored. Other cases such as negative Ât can be analyzed similarly. The hyper-parameter β controls the damping force.

Table1shows, POP3D outperforms 32 across 49 Atari games given the final score, followed by PPO with 11, BASELINE with 5, and TRPO with 1. Interestingly, for games that POP3D score highest, BASELINE score worse than PPO more often than the other way round, which means that POP3D is not just an approximate version of BASELINE.For another metric, POP3D wins 20 out of 49 Atari games which matches PPO with 18, followed by BASELINE with 6, and last ranked by TRPO with 5. If we measure the stability of an algorithm by the score variance of different trials, POP3D scores high with good stability across various seeds. And PPO behaves worse in Game Kangaroo and UpNDown. Interestingly, BASELINE shows a large variance for different seeds for several games such as BattleZone, Freeway, Pitfall, and Seaquest. POP3D reveals its better capacity to score high and similar fast learning ability in this domain. The detailed metric for each game is listed in Table3 and 4. Top: The number of games "won" by each algorithm for Atari games. Bottom: The number of games won by each algorithm for Mujoco games. Each experiment is averaged across three seeds.

Mean final scores (last 100 episodes) of PPO, POP3D on Mujoco games after 10M frames. The results are averaged by three trials.

Mean final scores (last 100 episodes) of PPO, POP3D, BASELINE and TRPO on Atari games after 40M frames. The results are averaged on three trials.

All episodes mean scores of PPO, POP3D, BASELINE and TRPO on Atari games after 40M frames. The results are averaged by three trials.

