OPTIMIZING SUCCESS RATE IN REINFORCEMENT LEARNING VIA LOOP PENALTY

Abstract

Current reinforcement learning generally uses discounted return as its learning objective. However, real-world tasks may often demand a high success rate, which can be quite different from optimizing rewards. In this paper, we explicitly formulate the success rate as an undiscounted form of return with {0, 1}-binary reward function. Unfortunately, applying traditional Bellman updates to value function learning can be problematic for learning undiscounted return, and thus not suitable for optimizing success rate. From our theoretical analysis, we discover that values across different states tend to converge to the same value, resulting in the agent wandering around those states without making any actual progress. This further leads to reduced learning efficiency and inability to complete a task in time. To combat the aforementioned issue, we propose a new method, which introduces Loop Penalty (LP) into value function learning, to penalize disoriented cycling behaviors in agent's decision-making. We demonstrate the effectiveness of our proposed LP on three environments, including grid-world cliff-walking, Doom first-person navigation and robot arm control, and compare our method with Qlearning, Monte-Carlo and Proximal Policy Optimization (PPO). Empirically, LP improves the convergence of training and achieves a higher success rate.

1. INTRODUCTION

Reinforcement learning usually adopts expected discounted return as objective, and has been applied in many tasks to find the best solution, e.g. finding the shortest path and achieving the highest score (Sutton & Barto, 2018; Mnih et al., 2015; Shao et al., 2018) . However, many real-world tasks, such as robot control or autonomous driving, may demand more in success rate (i.e. the probability for the agent to fulfill task requirements) since failures in these tasks may cause severe damage or consequences. Previous works commonly treat optimizing rewards equivalent to maximizing success rate (Zhu et al., 2018; Peng et al., 2018; Kalashnikov et al., 2018) , but their results can be error-prone when applied to real-world applications. We believe that success rate is different from expected discounted return. The reasons are as follows: 1) expected discounted return commonly provides dense reward signals for transitions in an episode, while success or not is a sparse binary signal only obtained at the end of an episode; 2) expected discounted return commonly weights results in the immediate future more than potential rewards in the distant future, whereas success or not does not have such a weighting and is only concerned about the overall or the final result. Policies with high expected discounted returns are often more demanding in short-term performance than those with high success rates and optimizing success rates often leads to multiple solutions. As a result, policies with high success rates tend to be more reliable and risk-averse while policies with high expected discounted returns tend to be risk-seeking. See the cliff-walking example in Fig. 1 where the objective is to walk from the origin state marked with a triangle to the destination state marked with a circle. The "Slip" area in light grey winds with a certain probability p fall = 0.1, making the agent uncontrollably move down; the dark gray area at the bottom row denotes "Cliff". In Fig. 1 , the blue trajectory shown on the left is shorter but riskier than the green one shown on the right. In commonly-used hyperparameter settings, such as γ = 0.9, the agent tends to follow the blue trajectory rather than the green one, although the green trajectory has a higher success rate. We acknowledge that for this simple example, optimizing expected discounted return with a careful design of γ that meets (1 -p fall ) 4 < γ 9-5 can produce a policy with the highest success rate. However, this result relies on task-specific knowledge about the environment, generally not available in more complex tasks. These findings lead us to the following question: can we express success rate in a general form so that it can be directly optimized? In this paper, we discover a universal way of representing success rate is to 1) use a {0, 1}-binary reward indicates whether or not a trajectory is successful, and 2) set γ = 1 so that the binary signal back-propagates without any discount. Unfortunately, this expression belongs to undiscounted problems and the convergence of value iteration often cannot be guaranteed (Xu et al., 2018) . Nevertheless, we can still explicitly solve the Bellman equation in a matrix form for the special undiscounted return (success rate). We derive that if the transition dynamics of the environment permit existence of an irreducible ergodic set of states, γ = 1 will lead to an undesirable situation: state or state-action values tend to converge to the same value, which we refer to as uniformity. As shown in Fig. 2 for the contour of state values in our cliff-walking example, uniformity is reflected as a plateau in the right figure, which is caused by non-discounting and does not exist in discounting cases (left figure). Uniformity makes the selection of actions purposeless within the plateau, resulting in disoriented and time-consuming behaviors in the agent's decision-making, and unsatisfactory success rates. Based on the above analysis, we introduce Loop-Penalty (LP) into value function learning to penalize disoriented and cycling behaviors in trajectories. We derive that this penalty can be realized by multiplying a special mask function to the original value function. Note that our strategy is general and is applicable to many RL algorithms. We provide concrete loss functions for three popular algorithms in this paper: Monte Carlo, Deep Q-learning and Proximal Policy Optimization (Schulman et al., 2017) . We verify the effectiveness in three representative environments: grid-world cliffwalking, vision-based robot grasping, and first-person navigation in 3D Vizdoom (Kempka et al., 2016) , showing that LP can alleviate the uniformity problem and achieve better performance. Finally, we summarize the major contributions of our paper in the following: • We formally introduce the objective of "success rate" in reinforcement learning. Our formulation of success rate is general and is applicable for many different RL tasks. 

2. RELATED WORK

To the best of our knowledge, currently there is no research that adopts success rate directly as the learning objective. The reason is that success rate is usually not the main criterion in tasks investigated by RL, e.g. video games and simulated robot control. Although some studies used success rate to evaluate the performance of the policies (Andrychowicz et al., 2017; Tobin et al., 2018; Ghosh et al., 2018; Kalashnikov et al., 2018) , they used task-specific reward design and discounted return during training, instead of directly optimizing success rate. The notion of "success" may be reflected in constraints considered in the domain of safe RL (García & Fernández, 2015) . Geibel & Wysotzki (2005) considered constraints on the agent's behavior and discouraged the agent from moving to error states. Geibel (2006) studied constraints on the expected return to ensure acceptable performance. A. & Ghavamzadeh (2013) proposed constraints on the variance of some measurements to pursue an invariable performance. Previous studies have also considered safety in the exploration process (García & Fernández-Rebollo, 2012; Mannucci et al., 2018) . Although these studies deemed success rate as an additional constraint in learning, they either simply assumed that the constraint can be certainly satisfied or penalized constraint violations. The deficiency of expected discounted return as a training objective has been recognized by many studies. Instead of just optimizing expected return, Heger (1994) ; Tamar et al. (2013) adopted the minimax criterion that optimizes the worst possible values of the return. By doing so, occasional small returns would not be ignored at test time. Gilbert & Weng (2016) ; Chow et al. (2017) extended this idea to arbitrary quantiles of the return. However, all these studies are not optimizing success rate directly since they are based on a quantitative measurement of performance and are unnecessarily sensitive to the worst cases. In contrast, success rate is based on a binary signal which only distinguishes between success and failure. Our work involves optimization of an undiscounted return. The instability in training towards an undiscounted return has been mentioned by Schwartz (1993) ; Xu et al. (2018) . However, most studies on undiscounted return focused on continuous settings and considered the average reward as objectives (Schwartz, 1993; Ortner & Ryabko, 2012; Zahavy et al., 2020) . There seems to be a general view that the instability in training towards undiscounted return only exists in continuous cases but not in episodic cases (Pitis, 2019) . Contrary to this view, we propose that training instability also exists in episodic cases. For optimizing success rate, we provide a theoretical analysis and show the existence of training instability and propose a practical method that alleviates this problem.

3. SUCCESS RATE IN REINFORCEMENT LEARNING

In this section we provide a formal definition of success rate, explain its relationship with expected discounted sum of rewards, and analyze the problems in optimizing success rate.

3.1. SUCCESS RATE

In RL, given a policy π, success rate specifically refers to the ratio of the successful trajectories to all trajectories. As in a general setting of RL, a trajectory is expressed as τ = {(s 0 , a 0 , r 0 ), . . . , (s T , a T , r T ), s T +1 } rolled out by following policy π, where s t ∈ S is state, a t ∈ A denotes action, r t represents immediate reward and T is the length of the trajectory. Because the notion of success should only depend on the visited states in a trajectory, we concisely express "success" by defining a set of desired states S g ⊂ S that denote task completion, e.g. the destination state in our cliff-walking example. At a high level, the goal of the agent is to reach any state in S g within a given planning horizon T , and the environment terminates either upon arriving at a desired state or reaching a maximum allocated timestep T . Without loss of generality, we say that "a trajectory τ is successful" if and only if τ -1 ∈ S g , where τ -1 is the last state in τ . Formally, we use an indicator function I(s ∈ S g ) to denote success, where I(•) takes value of 1 when the input statement is true and 0 otherwise. Since this expression is task-independent, our analysis can be widely applicable. Accordingly, we formally define the success rate as follows: Definition 1. The success rate of a given policy π is defined as β π (s 0 ) = τ p π (τ |s 0 )I(τ -1 ∈ S g ) where p π (τ |s 0 ) = T t=0 π(a t |s t )p(s t+1 |s t , a t ) is the probability of observing trajectory τ . In order to find a policy that optimizes success rate, we derive a recursive form of policy evaluation similar to the Bellman equation (Sutton & Barto, 2018) , as shown in Theorem 1. Theorem 1. The success rate is a state-value function represented as an expected sum of undiscounted return, with the reward function R(s) defined to take the value of 1 if s ∈ S g , 0 otherwise.

Proof sketch:

We segment the trajectories and generate sub-trajectories, τ ∈ Γ, τ 0 :k ∈ Γ , where k ∈ (0, T ]. Note that Γ = Γ , because 1) ∀τ ∈ Γ , we have τ 0:T ∈ Γ = τ, Γ ⊆ Γ , 2) τ 0:k is a trajectory, Γ ⊆ Γ . Then the success rate β π (s t ) can be rewritten as the product sum of the probability of reaching s t+k and the indicator I(τ s t+k ∈ S g ) for all s t+k : β π (s t ) = T -t k=1 s t+k p π (s t+k |s t )I(s t+k ∈ S g ) (2) where p π (s t+k |s t ) the probability of reaching s t+k from s t . Complete proof is in appendix. Therefore, we can optimize success rate through setting the above {0, 1}-binary reward function and adopting an undiscounted form of return. The problem is that this formulation falls into optimizing the undiscounted form of return and may have problems in training stability (Xu et al., 2018) .

3.2. UNIFORMITY IN SUCCESS RATE OPTIMIZATION

In the following part, we will show that γ = 1 can cause uniformity among state values, resulting in possible loops in trajectories, which hurts training stability.

A. The concept of uniformity

First, we define the concept of uniformity. Given a policy π, we say that uniformity arises when the state-value estimates of a set of strongly connected states become the same. Here we say two states are strongly connected if one state is reachable from the other and vice versa, e.g. the first two rows in the grid-world example (Fig. 1 ). Since state value represents the expected sum of available rewards (Sutton & Barto, 2018) , uniformity means that moving in this connected area/region will potentially lead to the same amount of return. This phenomenon can hardly occur with discounted return since the discounting poses a preference for time-efficiency in collecting rewards and penalizes purposeless wandering. However, uniformity may happen when the objective is success rate since efficient trajectories and inefficient ones become indistinguishable.

B. Proof of the existence of uniformity

In this section, we theoretically prove that γ = 1 in the expression of success rate can cause uniformity. Because uniformity is a phenomenon about concrete state values, common techniques used to analyze the overall performance such as regret bound and contraction mapping do not apply here. Hence, we directly solve the Bellman equation to get state values. As for the reward function, we are fortunate that in our case the reward function only takes {0, 1}-binary values, which makes our analysis tractable. As for the optimization process, we analyze state values at convergence by first assuming a policy with uniformity, and then show that this policy will be kept during optimization. For succinctness in description, we assume S to be finite to write the Bellman equation into a matrix form: V = P π R + γP π V , where V, R ∈ R |S| , P π ∈ R |S|×|S| and |S| is the cardinality of the state space. Without loss of generality, we denote the desired states at the bottom of each vector, so R = [0, . . . , 0, 1, . . . , 1] T . Then we formulate the concept of "area" as a set of states S e S that are irreducible ergodic in the Markov process conditioned on a policy π. By assuming the existence of π and S e , and denoting states in S e as the first |S e | elements in the vectors, the π-conditioned transition probability matrix can be divided into P π = P π ee O P π oe P π oo , where P π ee is the transition probability matrix for s ∈ S e . Accordingly, we have the following Bellman equation for s ∈ S e : V e = P π ee R e + γP π ee V e = γP π ee V e . Analyzing uniformity requires solving Eq.4. For γ < 1, the solution is unique V e = [0, . . . , 0] T because P π ee is a stochastic matrix and (I -γP π ee ) must be non-singular, and the value 0 drives the agent to leave S e in future policy update. However, when γ = 1, there are infinite solutions, as established in the following theorem. Theorem 2. For γ = 1, if S e exists, the solution space of Eq.4 is {V e = m • [1, . . . , 1] T |m ∈ R}. Proof : Because states in S e are ergodic, for any start-distribution u T 1 and u T 2 among S e , we have Thus, lim i→∞ (P π ee ) i should be in the form that every row is the same, as illustrated below: u T 1 lim i→∞ (P π ee ) i = u T 2 lim i→∞ (P π ee ) i . lim i→∞ (P π ee ) i =     x 1 x 2 • • • x |Se| x 1 x 2 • • • x |Se| . . . . . . . . . . . . x 1 x 2 • • • x |Se|     Note that all the elements are non-zero because S e is irreducible. Thus, for equation V e = (lim i→∞ (P π ee ) i )V e , the solutions are m • [1, . . . , 1] T , m ∈ R. Because P π ee is a stochastic matrix, these solutions also satisfy Eq.4. Now, because solutions for Eq.4 also satisfy V e = (lim i→∞ (P π ee ) i )V e , the solution spaces of the two equations become the same. Therefore, the solution space of Eq.4 is {V e = m • [1, . . . , 1] T |m ∈ R}, which completes the proof. This theorem demonstrates that when evaluating policy in terms of success rate, the converged values for states in S e are the same and may take arbitrary values. This proves the existence of uniformity among state values. Now we reason that there can be a policy π that produces S e and that this policy can be kept by the agent during policy optimization. (1) As for S e , it is common in RL environment that there is a set of two or more states that are reachable from each other without randomness. If the policy is initialized (or disturbed by random sampling during learning) to only stay in this set of states, then it gives the set of states S e . Note that the desired states are not in S e because they are absorbing and cannot reach other states. This ensures that R e = [0, . . . , 0] T , by which Eq.4 is valid. (2) As for the agent keeping π during policy optimization, we check if the state values satisfy the Bellman optimal equation. We have derived that any m may be the value of state in S e . If the value m is larger than the value of states reachable from S e (probably due to initialization of value function), then the update target of values of states in S e remains m. This means that m satisfies the Bellman optimal equation at states in S e , and that the policy at S e is kept during policy update. So far, we have proved that the objective of success rate can cause uniformity in state values.

C. Problems caused by uniformity

In RL, the agent selects actions based on the evaluation of future returns. When uniformity happens, the evaluation of different actions become the same, so the agent can only make random selections. This leads to disoriented, time-consuming but meaningless behaviors and an unsatisfactory success rate. In practice, because of disturbances due to random exploration, there may be slight differences between state values. Although this makes action-selection certain, it may result in undesirable policies, which causes instability in training. Fig. 3 shows a numeric example. We adopt Q-learning and illustrate the trained Q-values and the preferred actions respectively in (a) and (b). The Q-values are almost the same in upper grids, and there are several potential loops in the agent's trajectory. If the agent enters a loop, it will keep repeating the loop and fail in reaching the target.

4. METHOD: LOOP PENALTY

So far we have shown the problems in optimizing success rate. As for the solution, our insight is to suppress the generation of "loops" to penalize disoriented cycling behaviors in agents decisionmaking. In this section, we derive the cost function for minimizing the probability of loops, which ; Perform a gradient descent step on y t -Q(s t , a t ) 2 ; end for end for introduces Loop Penalty (LP) into value function learning. Then we introduce a practical algorithm that can implement this framework for reinforcement learning problems.

4.1. LOOP PENALTY

Our idea is that the agent not only needs to maximize the success rate, but also minimize the probability of "loops". This is formalized as follows: π * = argmax π p π (τ no-loop -1 ∈ S g ), where τ no-loop is the trajectory without loops where the agent visits some states more than once, in which the agent never revisits a previous state. We now derive the recursive state-value function β loop-penalty π (s t ) with our loop-penalty for the optimization of Eq.6. Theorem 3. The state-value function policy for Eq.6 is β loop-penalty π (s t ) = E τ ∼π I(s t+1 ∈ S g )φ(s t ) + β loop-penalty π (s t+1 ) , where φ(s t ) := I(s i = s j , ∀ 0 ≤ i < t, t < j ≤ T ) is an indicator that judges whether there is a loop through s t in the trajectory τ .

Proof sketch:

The key idea is to convert Eq.6 to sum of the probability products of p π (τ ) and I(s ∈ S g )I(τ no-loop ), where I(τ no-loop ) judges if there is not a loop in τ . In addition, we mark the probability of reaching a state s as ρ π (s) = P π (s 0 = s) + P π (s 1 = s, s 0 = s) • • • and have: p π (τ no-loop -1 ∈ S g ) = si ρ π (s i ) T t=i+1 I(s t ∈ S g )I(s j = s i , ∀ i + 1 < j < T ). We postpone the complete proof to the appendix. So far we have derived that reducing the probability of loops can be achieved in sampling with multiplying φ(s t ) according to the signal of success or not in each collected trajectory, which is a method of online policy evaluation for state values.

4.2. ALGORITHM

In this subsection, we design three implementation methods by substituting the state-value function with LP into the loss functions of three commonly used RL algorithms, Monte Carlo (MC), Q-Learning (QL), and Proximal Policy Optimization (PPO) (Schulman et al., 2017) . As discussed above, LP takes the form of multiplying the original state-value function with φ(s t ) as shown in Fig. 4 . Note that the indicator φ(s t ) can be implemented with many famous methods for measuring state similarity, such as GAN or VAE (Yu et al., 2019; Chen et al., 2016; Pathak et al., 2017) . To that end, we derive three new adjusted loss functions, MC with Loop-Penalty (MC-LP), QL with Loop-Penalty (QL-LP), PPO with Loop-Penalty (PPO-LP) as follows: L M C-LP (π Q , , s t ) ∝ E τ ∼π Q , T k=t+1 γ k r k φ(s t ) -Q(s k , a k ) 2 γ=1 , L QL-LP (π Q , s t ) ∝ E τ ∼π Q (r t+1 + γmax at+1 Q(s t+1 , a t+1 ))φ(s t ) -Q(s t , a t ) 2 γ=1 , ( ) L P P O-LP (π, s t ) ∝ -E τ ∼π,old min A(s t , a)φ(s t ) π k (a|s t ) π k,old (a|s t ) , clip{ π k (a|s t ) π k,old (a|s t ) } , ( ) where is the exploration rate of MC, A(s t , a) the advantage function and clip{•} the clipping function. Note that these algorithms all adopt online evaluation methods for value functions, because the probability of loops is related with the current policy. We choose QL-LP as representative to show our algorithm (Alg.1). The agent stores the state transitions collected in an episode into an online buffer D and use it to learn at the end of the episode. The loss function of LP-QL takes the product of r t + max a Q(s t+1 , a) and φ(s t ) as the target Q-value y t in our algorithm.

5. EMPIRICAL RESULTS

In this section we aim to analyze the following three questions: 1) Does LP alleviate the uniformity of state values for success-rate optimization? 2) Does LP achieve better performance in terms of success rate, furthermore close to the highest possible success rate? 3) What is the difference between the policy with a high success rate and that with a high expected return?

5.1. TASK DESIGN

We design three environments to exhibit the problem and examine the effectiveness of our algorithm. 1) We use the aforementioned cliff-walking grid-world to show how our algorithm works in detail. 2) We construct a 3D first-person navigation task based on ViZDoom (Kempka et al., 2016) to examine whether LP is suitable for complex tasks. 3) We construct a robot (kinova jaco2) grasping task with CoppeliaSim (originally named V-REP) to examine the practicality. In these three tasks, we constructed dangerous areas respectively, in which the agent fails with a certain probability: 1) windy area in the grid-world that makes the agent uncontrollably move down with a certain probability p f all = 0.1 and fall down the cliff, 2) an area in the ViZDoom environment with a monster shooting at the agent, where the probability of failure depends on behaviors of the monster and the agent's random initial health, 3) a noisy area in the robot grasping task in which the arm is disturbed with a 0.2 probability and may collide with the obstacle. These environments are illustrated in Fig. 5(a, b, c ). The ViZDoom and robot grasping tasks only provide visual inputs for decision-making. To show our method is compatible with different RL algorithms, here we use three RL algorithms in three experiments: 1) QL and QL-LP in Grid-world, 2) MC and MC-LP in ViZdoom, 3) PPO and PPO-LP in Robot grasping. Other details are included in the appendix.

5.2. RESULTS ON CONVERGENCE AND SUCCESS RATE

First, we focus on the first question, i.e. whether our method alleviates the convergence problem of success-rate optimization. To reflect convergence, we plot curves about the change of success rate during training in Fig. 6 , which is obtained by testing the policy ten times at the end of each training episode to calculate the success rates. It shows that there high variance when using MC (γ = 1.0) and QL (γ = 1.0) to optimize success rate, while our methods (marked by LP) can converge stably to a high success rate. These results indicate that: 1) the difficulty of convergence exists when optimizing success rate, 2) LP can stably optimize success rate. Then, we try to answer the second question, i.e. whether our method achieves better performance than optimizing the expected discounted return. Furthermore, we check whether the success rate of our method can be close to 1. We test the model 1000 times at the end of training and calculated the success rate, as shown in Table .1. In our experiments, PPO-LP with γ = 1.0 has an obviously higher success rate than PPO optimized by expected discounted return with γ = 0.7 and that with  (γ = 0.7) 0.761 PPO (γ = 1.0) 0.109 PPO-LP (γ = 1.0) 0.987 γ = 1.0, furthermore closer to 1. Results of success rate after training show that: 1) Optimizing with expected discounted return can not achieve the highest success rate in our experiments, and 2) the success rate of our method can be close to the highest.

5.3. VISUALIZATION OF STATE VALUES AND POLICIES

Lastly, we focus on our third question, i.e. what are the characteristics of policies trained with our method? We visualize the state values and the policy of our method in the grid-world task. As shown in Fig. 5 (d) , there is no uniformity in state values and the trajectory bypasses the dangerous area. Then we visualize the policies of ours and policies got by optimizing expected discounted returns in ViZDoom and robot grasping, as shown in Fig. 5 (e, f ). They show that the policies trained by maximizing success rate with LP tend to be reliable and risk-averse. On the contrary, the policies trained by maximizing expected discounted return tend to be risk-seeking.

6. DISCUSSION

This paper formally introduces the objective of success rate, analyzes the uniformity problem in directly optimizing success rate in RL, and proposes LP to alleviate it. As a potential impact, we think the discovery of the relationship between success rate and expected undiscounted return may imply that expected undiscounted return has some useful properties. As for future work, we hope to investigate different methods for measuring state similarity to improve the efficiency of LP. In addition, we think it is also beneficial to develop methods that alleviate the sparse-reward problem in optimizing success rate.

A APPENDIX

A.1 PROOF OF THEOREM 1 In this section, we complete the proof of theorem 1. Theorem 1. The success rate is a state-value function represented as an expected sum of undiscounted return, with the reward function R(s) defined to take value of 1 if s ∈ S g , 0 otherwise. Proof : We segment the trajectories and generate sub-trajectories, τ ∈ Γ, τ 0:k ∈ Γ , where k ∈ (0, T ]. Note that Γ = Γ , because 1) ∀τ ∈ Γ , we have τ 0:T ∈ Γ = τ, Γ ⊆ Γ , 2) τ 0:k is a trajectory, Γ ⊆ Γ . Then the success rate β π (s t ) can be rewritten as the product sum of the probability of reaching s t+k and the indicator I(τ s t+k ∈ S g ) for all s t+k : )], we will find that the success rate is a kind of undisouncted return with {0,1}-binary reward of I(s t ∈ S g ), which completes the proof. β π (s t ) =

A.2 PROOF OF THEOREM 3

In this section, we complete the proof of theorem 3. Theorem 3. The state-value function policy for Eq.6 is β loop-penalty π (s t ) = E τ ∼π I(s t+1 ∈ S g )φ(s t ) + β loop-penalty π (s t+1 ) , where φ(s t ) := I(s i = s j , ∀ 0 ≤ i < t, t < j ≤ T ) is an indicator that judges whether there is a loop through s t in the trajectory τ . Proof : The key idea is to convert Eq.6 to sum of the probability products of p π (τ ) and I(s ∈ S g )I(τ no-loop ), where I(τ no-loop ) judges if there is not a loop in τ . In addition, we mark the probability of reaching a state s as ρ π (s) = P π (s 0 = s) + P π (s 1 = s, s 0 = s) • • • and have: p π (τ no-loop -1 ∈ S g ) = si ρ π (s i ) T t=i+1 I(s t ∈ S g )I(s j = s i , ∀ i + 1 < j < T ) = si ρ π (s i ) ai+1 π(a i+1 |s i+1 ) si+1 p(s i+2 |s i+1 , a i+1 ) I(s i+2 ∈ S g )I(s j = s i , ∀ i < j < T )+ ai+2 π(a i+2 |s i+2 ) si+2 p(s i+3 |s i+2 , a i+2 ) T t=i+3 I(s t ∈ S g )I(s j = s i , ∀ i + 1 < j < T ) = si ρ π (s i )E τ ∼π I(s i+1 ∈ S g )I(s j = s i , ∀ i + 1 < j < T ) E τ ∼π T t=i+2 I(s t ∈ S g )I(s j = s i , ∀ i + 2 < j < T ) = E τ ∼π I(s t ∈ S g )I(s j = s i , ∀ 0 < i < t, t < j ≤ T )+ E τ ∼π T k=t+1 I(s k ∈ S g )I(s j = s i , ∀ 0 < i < k, k < j ≤ T ) ) Then let φ(s t ) = I(s j = s i , ∀ 0 < i < t, t < j ≤ T ), and we have p π (τ no-loop -1 ∈ S g ) = E τ ∼π I(s t+1 ∈ S g )φ(s t ) + E τ ∼π T t+2 I(s t+2 ∈ S g )φ(s t+1 ) Considering β loop-penalty π (s t ) = E τ ∼π T t I(s t+1 ∈ S g )φ(s t ) , the β loop-penalty π (s t ) can also be written as a recursive form: β loop-penalty π (s t ) = E τ ∼π I(s t+1 ∈ S g )φ(s t ) + β loop-penalty π (s t+1 ) which completes the proof. We set the same hyper-parameters for models of our algorithms and the baselines. The learning rates and exploration rates are set as the Table 2 and Table 3, 



Figure 1: Cliff-walking example

Figure 3: Numeric example of uniformity and loop



Figure 5: Illustration of environments, value functions and policies

Figure 7: Schematics of the networks of 1) PPO the left 2) MC the right

in which the exploration rate is expressed with the coefficient of policy entropy. The parameter for gradient clipping in PPO clip is set to 0.1. We implement φ(s t ) with environmental information (position information in ViZDoom and pose information of arm in robot grasping) as signals in the training, but not use the information in the testing. The models we used in Grid-World tasks are tabulars which has the same shapes as the environments' and those in ViZDoom tasks and Robot tasks are neural-networks. The models of networks are shown in Fig.7. And the inputs are shown in Fig.8.A.4 NUMERICAL RESULTS ON STATE-ACTION VALUESWe show the state-action values of MC-LP (γ = 1.0), QL (γ = 0.6) and QL (γ = 1.0) after training as Fig.9. State-action values of QL (γ = 1.0) show the phenomenon of uniformity. State-action values of QL (γ = 0.6) has no uniformity but tend to be risk-seeking. State-action values of MC-LP (γ = 1.0) has no uniformity and perform conservative, with high success-rate.

Figure 8: Exhibition of the inputs of networks

State-action values of MC-LP (γ = 1.0)

Figure 9: Illustration of state-action values

We theoretically analyze the difficulty in optimizing success rate and show that the uniformity among state values and the resulting loops in trajectories are the key challenges.• We propose LP which can be combined with any general RL algorithm. We demonstrate empirically that LP can alleviate the problem of "uniformity" among state values and significantly improve success rates in both discrete and continuous control tasks.

Algorithm 1 Loop-Penalty Q-Learning Initialize: action-value function Q, episode buffer D; for episode= 1, M do Initialise episode buffer D; for t = 1, T do With probability select a random action a t , otherwise select a t = max a Q(s t , a); Execute action a t in emulator, get and store transition (s t , a t , r t , s t+1 ) in D; end for for each transition {s t , a t , r t , s t+1 } in D do Initialise the marker factor of loop φ(s t ) ← 1; for each {i, j|0 < i < t, t < j < T } do Calculate φ t ← φ t ∩ I(s i = s j ); end for Set y t = r t for terminal s t+1 (r t + max a Q(s t+1 , a))φ t for non-terminal s t+1

Success rate in robot grasping

π (s t+k |s t )I(s t+k ∈ S g )(12)   where p π (s t+k |s t ) the probability of reaching s t+k from s t . Then we substitute the formula of probability of reaching state p π (s t+k |s t ) = t+k t=t π(a t|s t)p(s t+1 |s t, a t) into β π (s t ):β π (s t ) = t+1 |s t , a t ) [I(s t+1 ∈ S g ) + β π (s t+1 )]If we compare success rate of states β π (s t ) with the state-value function V π (s t ) = at π(a t |s t ) st+1 p(s t+1 |s t , a t ) [r t+1 + V π (s t+1

Learning rate

Exploration rate

