OPTIMIZING SUCCESS RATE IN REINFORCEMENT LEARNING VIA LOOP PENALTY

Abstract

Current reinforcement learning generally uses discounted return as its learning objective. However, real-world tasks may often demand a high success rate, which can be quite different from optimizing rewards. In this paper, we explicitly formulate the success rate as an undiscounted form of return with {0, 1}-binary reward function. Unfortunately, applying traditional Bellman updates to value function learning can be problematic for learning undiscounted return, and thus not suitable for optimizing success rate. From our theoretical analysis, we discover that values across different states tend to converge to the same value, resulting in the agent wandering around those states without making any actual progress. This further leads to reduced learning efficiency and inability to complete a task in time. To combat the aforementioned issue, we propose a new method, which introduces Loop Penalty (LP) into value function learning, to penalize disoriented cycling behaviors in agent's decision-making. We demonstrate the effectiveness of our proposed LP on three environments, including grid-world cliff-walking, Doom first-person navigation and robot arm control, and compare our method with Qlearning, Monte-Carlo and Proximal Policy Optimization (PPO). Empirically, LP improves the convergence of training and achieves a higher success rate.

1. INTRODUCTION

Reinforcement learning usually adopts expected discounted return as objective, and has been applied in many tasks to find the best solution, e.g. finding the shortest path and achieving the highest score (Sutton & Barto, 2018; Mnih et al., 2015; Shao et al., 2018) . However, many real-world tasks, such as robot control or autonomous driving, may demand more in success rate (i.e. the probability for the agent to fulfill task requirements) since failures in these tasks may cause severe damage or consequences. Previous works commonly treat optimizing rewards equivalent to maximizing success rate (Zhu et al., 2018; Peng et al., 2018; Kalashnikov et al., 2018) , but their results can be error-prone when applied to real-world applications. We believe that success rate is different from expected discounted return. The reasons are as follows: 1) expected discounted return commonly provides dense reward signals for transitions in an episode, while success or not is a sparse binary signal only obtained at the end of an episode; 2) expected discounted return commonly weights results in the immediate future more than potential rewards in the distant future, whereas success or not does not have such a weighting and is only concerned about the overall or the final result. Policies with high expected discounted returns are often more demanding in short-term performance than those with high success rates and optimizing success rates often leads to multiple solutions. As a result, policies with high success rates tend to be more reliable and risk-averse while policies with high expected discounted returns tend to be risk-seeking. See the cliff-walking example in Fig. 1 where the objective is to walk from the origin state marked with a triangle to the destination state marked with a circle. The "Slip" area in light grey winds with a certain probability p fall = 0.1, making the agent uncontrollably move down; the dark gray area at the bottom row denotes "Cliff". In Fig. 1 , the blue trajectory shown on the left is shorter but riskier than the green one shown on the right. In commonly-used hyperparameter settings, such as γ = 0.9, the agent tends to follow the blue trajectory rather than the green one, although the green trajectory has a higher success rate.

