OPTIMIZING SUCCESS RATE IN REINFORCEMENT LEARNING VIA LOOP PENALTY

Abstract

Current reinforcement learning generally uses discounted return as its learning objective. However, real-world tasks may often demand a high success rate, which can be quite different from optimizing rewards. In this paper, we explicitly formulate the success rate as an undiscounted form of return with {0, 1}-binary reward function. Unfortunately, applying traditional Bellman updates to value function learning can be problematic for learning undiscounted return, and thus not suitable for optimizing success rate. From our theoretical analysis, we discover that values across different states tend to converge to the same value, resulting in the agent wandering around those states without making any actual progress. This further leads to reduced learning efficiency and inability to complete a task in time. To combat the aforementioned issue, we propose a new method, which introduces Loop Penalty (LP) into value function learning, to penalize disoriented cycling behaviors in agent's decision-making. We demonstrate the effectiveness of our proposed LP on three environments, including grid-world cliff-walking, Doom first-person navigation and robot arm control, and compare our method with Qlearning, Monte-Carlo and Proximal Policy Optimization (PPO). Empirically, LP improves the convergence of training and achieves a higher success rate.

1. INTRODUCTION

Reinforcement learning usually adopts expected discounted return as objective, and has been applied in many tasks to find the best solution, e.g. finding the shortest path and achieving the highest score (Sutton & Barto, 2018; Mnih et al., 2015; Shao et al., 2018) . However, many real-world tasks, such as robot control or autonomous driving, may demand more in success rate (i.e. the probability for the agent to fulfill task requirements) since failures in these tasks may cause severe damage or consequences. Previous works commonly treat optimizing rewards equivalent to maximizing success rate (Zhu et al., 2018; Peng et al., 2018; Kalashnikov et al., 2018) , but their results can be error-prone when applied to real-world applications. We believe that success rate is different from expected discounted return. The reasons are as follows: 1) expected discounted return commonly provides dense reward signals for transitions in an episode, while success or not is a sparse binary signal only obtained at the end of an episode; 2) expected discounted return commonly weights results in the immediate future more than potential rewards in the distant future, whereas success or not does not have such a weighting and is only concerned about the overall or the final result. Policies with high expected discounted returns are often more demanding in short-term performance than those with high success rates and optimizing success rates often leads to multiple solutions. As a result, policies with high success rates tend to be more reliable and risk-averse while policies with high expected discounted returns tend to be risk-seeking. See the cliff-walking example in Fig. 1 where the objective is to walk from the origin state marked with a triangle to the destination state marked with a circle. The "Slip" area in light grey winds with a certain probability p fall = 0.1, making the agent uncontrollably move down; the dark gray area at the bottom row denotes "Cliff". In Fig. 1 , the blue trajectory shown on the left is shorter but riskier than the green one shown on the right. In commonly-used hyperparameter settings, such as γ = 0.9, the agent tends to follow the blue trajectory rather than the green one, although the green trajectory has a higher success rate. We acknowledge that for this simple example, optimizing expected discounted return with a careful design of γ that meets (1 -p fall ) 4 < γ 9-5 can produce a policy with the highest success rate. However, this result relies on task-specific knowledge about the environment, generally not available in more complex tasks. These findings lead us to the following question: can we express success rate in a general form so that it can be directly optimized? In this paper, we discover a universal way of representing success rate is to 1) use a {0, 1}-binary reward indicates whether or not a trajectory is successful, and 2) set γ = 1 so that the binary signal back-propagates without any discount. Unfortunately, this expression belongs to undiscounted problems and the convergence of value iteration often cannot be guaranteed (Xu et al., 2018) . Nevertheless, we can still explicitly solve the Bellman equation in a matrix form for the special undiscounted return (success rate). We derive that if the transition dynamics of the environment permit existence of an irreducible ergodic set of states, γ = 1 will lead to an undesirable situation: state or state-action values tend to converge to the same value, which we refer to as uniformity. As shown in Fig. 2 for the contour of state values in our cliff-walking example, uniformity is reflected as a plateau in the right figure, which is caused by non-discounting and does not exist in discounting cases (left figure). Uniformity makes the selection of actions purposeless within the plateau, resulting in disoriented and time-consuming behaviors in the agent's decision-making, and unsatisfactory success rates. Based on the above analysis, we introduce Loop-Penalty (LP) into value function learning to penalize disoriented and cycling behaviors in trajectories. We derive that this penalty can be realized by multiplying a special mask function to the original value function. Note that our strategy is general and is applicable to many RL algorithms. We provide concrete loss functions for three popular algorithms in this paper: Monte Carlo, Deep Q-learning and Proximal Policy Optimization (Schulman et al., 2017) . We verify the effectiveness in three representative environments: grid-world cliffwalking, vision-based robot grasping, and first-person navigation in 3D Vizdoom (Kempka et al., 2016) , showing that LP can alleviate the uniformity problem and achieve better performance. Finally, we summarize the major contributions of our paper in the following: • We formally introduce the objective of "success rate" in reinforcement learning. Our formulation of success rate is general and is applicable for many different RL tasks. 

2. RELATED WORK

To the best of our knowledge, currently there is no research that adopts success rate directly as the learning objective. The reason is that success rate is usually not the main criterion in tasks investigated by RL, e.g. video games and simulated robot control. Although some studies used success rate to evaluate the performance of the policies (Andrychowicz et al., 2017; Tobin et al., 2018; Ghosh et al., 2018; Kalashnikov et al., 2018) , they used task-specific reward design and discounted return during training, instead of directly optimizing success rate. The notion of "success" may be reflected in constraints considered in the domain of safe RL (García & Fernández, 2015) . Geibel & Wysotzki (2005) considered constraints on the agent's behavior and



Figure 1: Cliff-walking example

We theoretically analyze the difficulty in optimizing success rate and show that the uniformity among state values and the resulting loops in trajectories are the key challenges.• We propose LP which can be combined with any general RL algorithm. We demonstrate empirically that LP can alleviate the problem of "uniformity" among state values and significantly improve success rates in both discrete and continuous control tasks.

