SHORTEST-PATH CONSTRAINED REINFORCEMENT LEARNING FOR SPARSE REWARD TASKS

Abstract

We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on the agent's trajectory that improves the sample-efficiency in sparse-reward MDPs. We show that any optimal policy necessarily satisfies the k-SP constraint. Notably, the k-SP constraint prevents the policy from exploring state-action pairs along the non-k-SP trajectories (e.g., going back and forth). However, in practice, excluding state-action pairs may hinder convergence of RL algorithms. To overcome this, we propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it. Our numerical experiment in a tabular RL setting demonstrate that the SP constraint can significantly reduce the trajectory space of policy. As a result, our constraint enables more sample efficient learning by suppressing redundant exploration and exploitation. Our experiments on MiniGrid and DeepMind Lab show that the proposed method significantly improves proximal policy optimization (PPO) and outperforms existing novelty-seeking exploration methods including count-based exploration, indicating that it improves the sample efficiency by preventing the agent from taking redundant actions.

1. INTRODUCTION

Recently, deep reinforcement learning (RL) has achieved a large number of breakthroughs in many domains including video games (Mnih et al., 2015; Vinyals et al., 2019) , and board games (Silver et al., 2017) . Nonetheless, a central challenge in reinforcement learning (RL) is the sample efficiency (Kakade et al., 2003) ; it has been shown that the RL algorithm requires a large number of samples for successful learning in MDP with large state and action space. Moreover, the success of RL algorithm heavily hinges on the quality of collected samples; the RL algorithm tends to fail if the collected trajectory does not contain enough evaluative feedback (e.g., sparse or delayed reward). To circumvent this challenge, planning-based methods utilize the environment's model to improve or create a policy instead of interacting with environment. Recently, combining the planning method with an efficient path search algorithm, such as Monte-Carlo tree search (MCTS) (Norvig, 2002; Coulom, 2006) , has demonstrated successful results (Guo et al., 2016; Vodopivec et al., 2017; Silver et al., 2017) . However, such tree search methods would require an accurate model of MDP and the complexity of planning may grow intractably large for complex domain. Model-based RL methods attempt to learn a model instead of assuming that model is given, but learning an accurate model also requires a large number of samples, which is often even harder to achieve than solving the given task. Model-free RL methods can be learned solely from the environment reward, without the need of a (learned) model. However, both value-based and policy-based methods suffer from poor sample efficiency especially in sparse-reward tasks. To tackle sparse reward problems, researchers have proposed to learn an intrinsic bonus function that measures the novelty of the state that agent visits (Schmidhuber, 1991; Oudeyer & Kaplan, 2009; Pathak et al., 2017; Savinov et al., 2018b; Choi et al., 2018; Burda et al., 2018) . However, when such intrinsic bonus is added to the reward, it often requires a careful balancing between environment reward and bonus and scheduling of the bonus scale in order to guarantee the convergence to optimal solution. To tackle aforementioned challenge of sample efficiency in sparse reward tasks, we introduce a constrained-RL framework that improves the sample efficiency of any model-free RL algorithm in sparse-reward tasks, under the mild assumptions on MDP (see Appendix G). Of note, though our framework will be formulated for policy-based methods, our final form of cost function (Eq. ( 10) in Section 4) is applicable to both policy-based and value-based methods. We propose a novel k-shortest-path (k-SP) constraint (Definition 7) that improves sample efficiency of policy learning (See Figure 1 ). The k-SP constraint is applied to a trajectory rolled out by a policy; all of its sub-path out suboptimal trajectories from the trajectory space. Intuitively, the k-SP constraint means that when a policy rolls out into trajectories, all of sub-paths of length k is a shortest path (under a distance metric defined in terms of policy, discount factor and transition probability; see Section 3.2 for the formal definition). (Left) MDP and a rollout tree are given. (Middle) The paths that satisfy the k-SP constraint. The number of admissible trajectories is drastically reduced. (Right) A path rolled out by a policy satisfies the k-SP constraint if all sub-paths of length k are shortest paths and have not received non-zero reward. We use a reachability network to test if a given (sub-)path is a shortest path (See Section 4 for details). of length k is required to be a shortest-path under the π-distance metric which we define in Section 3.1. We prove that applying our constraint preserves the optimality for any MDP (Theorem 3), except the stochastic and multi-goal MDP which requires additional assumptions. We relax the hard constraint into a soft cost formulation (Tessler et al., 2019) , and use a reachability network (Savinov et al., 2018b) (RNet) to efficiently learn the cost function in an off-policy manner. We summarize our contributions as the following: (1) We propose a novel constraint that can improve the sample efficiency of any model-free RL method in sparse reward tasks. (2) We present several theoretical results including the proof that our proposed constraint preserves the optimal policy of given MDP. (3) We present a numerical result in tabular RL setting to precisely evaluate the effectiveness of the proposed method. (4) We propose a practical way to implement our proposed constraint, and demonstrate that it provides a significant improvement on two complex deep RL domains. (5) We demonstrate that our method significantly improves the sample-efficiency of PPO, and outperforms existing novelty-seeking methods on two complex domains in sparse reward setting.

2. PRELIMINARIES

Markov Decision Process (MDP). We model a task as an MDP tuple M = (S, A, P, R, ρ, γ), where S is a state set, A is an action set, P is a transition probability, R is a reward function, ρ is an initial state distribution, and γ ∈ [0, 1) is a discount factor. For each state s, the value of a policy π is denoted by V π (s) = E π [ t γ t r t | s 0 = s]. Then, the goal is to find the optimal policy π * that maximizes the expected return: π * = arg max π E π s∼ρ t γ t r t | s 0 = s = arg max π E s∼ρ [V π (s)] . Constrained MDP. A constrained Markov Decision Process (CMDP) is an MDP with extra constraints that restrict the domain of allowed policies (Altman, 1999). Specifically, CMDP introduces a constraint function C(π) that maps a policy to a scalar, and a threshold α ∈ R. The objective of CMDP is to maximize the expected return R(τ ) = t γ t r t of a trajectory τ = {s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , . . .} subject to a constraint: π * = arg max π E τ ∼π [R(τ )] , s.t. C(π) ≤ α. A popular choice of constraint is based on the transition cost function (Tessler et al., 2019) c(s, a, r, s ) ∈ R which assigns a scalar-valued cost to each transition. Then the constraint function for a policy π is defined as the discounted sum of the cost under the policy: C(π) = E τ ∼π [ t γ t c(s t , a t , r t+1 , s t+1 )] . In this work, we propose a shortest-path constraint, that provably preserves the optimal policy of the original unconstrained MDP, while reducing the trajectory space. We will use a cost function-based formulation to implement our constraint (see Section 3 and 4).



Figure1: The k-SP constraint improves sample efficiency of RL methods in sparse-reward tasks by out suboptimal trajectories from the trajectory space. Intuitively, the k-SP constraint means that when a policy rolls out into trajectories, all of sub-paths of length k is a shortest path (under a distance metric defined in terms of policy, discount factor and transition probability; see Section 3.2 for the formal definition). (Left) MDP and a rollout tree are given. (Middle) The paths that satisfy the k-SP constraint. The number of admissible trajectories is drastically reduced. (Right) A path rolled out by a policy satisfies the k-SP constraint if all sub-paths of length k are shortest paths and have not received non-zero reward. We use a reachability network to test if a given (sub-)path is a shortest path (See Section 4 for details).

