SHORTEST-PATH CONSTRAINED REINFORCEMENT LEARNING FOR SPARSE REWARD TASKS

Abstract

We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on the agent's trajectory that improves the sample-efficiency in sparse-reward MDPs. We show that any optimal policy necessarily satisfies the k-SP constraint. Notably, the k-SP constraint prevents the policy from exploring state-action pairs along the non-k-SP trajectories (e.g., going back and forth). However, in practice, excluding state-action pairs may hinder convergence of RL algorithms. To overcome this, we propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it. Our numerical experiment in a tabular RL setting demonstrate that the SP constraint can significantly reduce the trajectory space of policy. As a result, our constraint enables more sample efficient learning by suppressing redundant exploration and exploitation. Our experiments on MiniGrid and DeepMind Lab show that the proposed method significantly improves proximal policy optimization (PPO) and outperforms existing novelty-seeking exploration methods including count-based exploration, indicating that it improves the sample efficiency by preventing the agent from taking redundant actions.

1. INTRODUCTION

Recently, deep reinforcement learning (RL) has achieved a large number of breakthroughs in many domains including video games (Mnih et al., 2015; Vinyals et al., 2019) , and board games (Silver et al., 2017) . Nonetheless, a central challenge in reinforcement learning (RL) is the sample efficiency (Kakade et al., 2003) ; it has been shown that the RL algorithm requires a large number of samples for successful learning in MDP with large state and action space. Moreover, the success of RL algorithm heavily hinges on the quality of collected samples; the RL algorithm tends to fail if the collected trajectory does not contain enough evaluative feedback (e.g., sparse or delayed reward). To circumvent this challenge, planning-based methods utilize the environment's model to improve or create a policy instead of interacting with environment. Recently, combining the planning method with an efficient path search algorithm, such as Monte-Carlo tree search (MCTS) (Norvig, 2002; Coulom, 2006) , has demonstrated successful results (Guo et al., 2016; Vodopivec et al., 2017; Silver et al., 2017) . However, such tree search methods would require an accurate model of MDP and the complexity of planning may grow intractably large for complex domain. Model-based RL methods attempt to learn a model instead of assuming that model is given, but learning an accurate model also requires a large number of samples, which is often even harder to achieve than solving the given task. Model-free RL methods can be learned solely from the environment reward, without the need of a (learned) model. However, both value-based and policy-based methods suffer from poor sample efficiency especially in sparse-reward tasks. To tackle sparse reward problems, researchers have proposed to learn an intrinsic bonus function that measures the novelty of the state that agent visits (Schmidhuber, 1991; Oudeyer & Kaplan, 2009; Pathak et al., 2017; Savinov et al., 2018b; Choi et al., 2018; Burda et al., 2018) . However, when such intrinsic bonus is added to the reward, it often requires a careful balancing between environment reward and bonus and scheduling of the bonus scale in order to guarantee the convergence to optimal solution. To tackle aforementioned challenge of sample efficiency in sparse reward tasks, we introduce a constrained-RL framework that improves the sample efficiency of any model-free RL algorithm in sparse-reward tasks, under the mild assumptions on MDP (see Appendix G). Of note, though our framework will be formulated for policy-based methods, our final form of cost function (Eq. (10) in Section 4) is applicable to both policy-based and value-based methods. We propose a novel k-shortest-path (k-SP) constraint (Definition 7) that improves sample efficiency of policy learning (See Figure 1 ). The k-SP constraint is applied to a trajectory rolled out by a policy; all of its sub-path The k-SP constraint improves sample efficiency of RL methods in sparse-reward tasks by pruning out suboptimal trajectories from the trajectory space. Intuitively, the k-SP constraint means that when a policy rolls out into trajectories, all of sub-paths of length k is a shortest path (under a distance metric defined in terms of policy, discount factor and transition probability; see Section 3.2 for the formal definition). (Left) MDP and a rollout tree are given. (Middle) The paths that satisfy the k-SP constraint. The number of admissible trajectories is drastically reduced. (Right) A path rolled out by a policy satisfies the k-SP constraint if all sub-paths of length k are shortest paths and have not received non-zero reward. We use a reachability network to test if a given (sub-)path is a shortest path (See Section 4 for details). of length k is required to be a shortest-path under the π-distance metric which we define in Section 3.1. We prove that applying our constraint preserves the optimality for any MDP (Theorem 3), except the stochastic and multi-goal MDP which requires additional assumptions. We relax the hard constraint into a soft cost formulation (Tessler et al., 2019) , and use a reachability network (Savinov et al., 2018b) (RNet) to efficiently learn the cost function in an off-policy manner. We summarize our contributions as the following: (1) We propose a novel constraint that can improve the sample efficiency of any model-free RL method in sparse reward tasks. (2) We present several theoretical results including the proof that our proposed constraint preserves the optimal policy of given MDP. (3) We present a numerical result in tabular RL setting to precisely evaluate the effectiveness of the proposed method. (4) We propose a practical way to implement our proposed constraint, and demonstrate that it provides a significant improvement on two complex deep RL domains. (5) We demonstrate that our method significantly improves the sample-efficiency of PPO, and outperforms existing novelty-seeking methods on two complex domains in sparse reward setting.

2. PRELIMINARIES

Markov Decision Process (MDP). We model a task as an MDP tuple M = (S, A, P, R, ρ, γ), where S is a state set, A is an action set, P is a transition probability, R is a reward function, ρ is an initial state distribution, and γ ∈ [0, 1) is a discount factor. For each state s, the value of a policy π is denoted by V π (s) = E π [ t γ t r t | s 0 = s]. Then, the goal is to find the optimal policy π * that maximizes the expected return: π * = arg max π E π s∼ρ t γ t r t | s 0 = s = arg max π E s∼ρ [V π (s)] . Constrained MDP. A constrained Markov Decision Process (CMDP) is an MDP with extra constraints that restrict the domain of allowed policies (Altman, 1999) . Specifically, CMDP introduces a constraint function C(π) that maps a policy to a scalar, and a threshold α ∈ R. The objective of CMDP is to maximize the expected return R(τ ) = t γ t r t of a trajectory τ = {s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , . . .} subject to a constraint: π * = arg max π E τ ∼π [R(τ )] , s.t. C(π) ≤ α. A popular choice of constraint is based on the transition cost function (Tessler et al., 2019) c(s, a, r, s ) ∈ R which assigns a scalar-valued cost to each transition. Then the constraint function for a policy π is defined as the discounted sum of the cost under the policy: C(π) = E τ ∼π [ t γ t c(s t , a t , r t+1 , s t+1 )] . In this work, we propose a shortest-path constraint, that provably preserves the optimal policy of the original unconstrained MDP, while reducing the trajectory space. We will use a cost function-based formulation to implement our constraint (see Section 3 and 4).

3. FORMULATION: k-SHORTEST PATH CONSTRAINT

We define the k-shortest-path (k-SP) constraint to remove redundant transitions (e.g., unnecessarily going back and forth), leading to faster policy learning. We show two important properties of our constraint: (1) the optimal policy is preserved, and (2) the policy search space is reduced. In this work, we limit our focus to MDPs satisfying R(s) + γV * (s) > 0 for all initial states s ∈ ρ and all rewarding states that optimal policy visits with non-zero probability s ∈ {s|r(s) = 0, π * (s) > 0}. We exploit this mild assumption to prove that our constraint preserves optimality. Intuitively, we exclude the case when the optimal strategy for the agent is at best choosing a "lesser of evils" (i.e., largest but negative value) which often still means a failure. We note that this is often caused by unnatural reward function design; in principle, we can avoid this by simply offsetting reward function by a constant -|min s∈{s|π * (s)>0} V * (s)| for every transition, assuming the policy is properfoot_0 . Goalconditioned RL (Nachum et al., 2018) and most of the well-known domains such as Atari (Bellemare et al., 2013) , DeepMind Lab (Beattie et al., 2016 ), MiniGrid (Chevalier-Boisvert et al., 2018) , etc., satisfy this assumption. Also, for general settings with stochastic MDP and multi-goals, we require additional assumptions to prove the optimality guarantee (See Appendix G for details). 3.1 SHORTEST-PATH POLICY AND SHORTEST-PATH CONSTRAINT Let τ be a path defined by a sequence of states: τ = {s 0 , . . . , s (τ ) }, where (τ ) is the length of a path τ (i.e., (τ ) = |τ | -1). We denote the set of all paths from s to s by T s,s . A path τ * from s to s is called a shortest path from s to s if (τ ) is minimum, i.e., (τ * ) = min τ ∈T s,s (τ ). Now we will define similar concepts (length, shortest path, etc.) with respect to a policy. Intuitively, a policy that rolls out shortest paths (up to some stochasticity) to a goal state or between any state pairs should be a counterpart. We consider a set of all admissible paths from s to s under a policy π: Definition 1 (Path set). T π s,s = {τ | s 0 = s, s (τ ) = s , p π (τ ) > 0, {s t } t< (τ ) = s }. That is, T π s,s is a set of all paths that policy π may roll out from s and terminate once visiting s . If the MDP is a single-goal task, i.e., there exists a unique (rewarding) goal state s g ∈ S such that s g is a terminal state, and R(s) > 0 if and only if s = s g , any shortest path from an initial state to the goal state is the optimal path with the highest return R(τ ), and a policy that rolls out a shortest path is therefore optimal (see Lemma 4).foot_1 This is because all states except for s g are non-rewarding states, but in general MDPs this is not necessarily true. However, this motivates us to limit the domain of shortest path to among non-rewarding states. We define non-rewarding paths from s to s as follows: Definition 2 (Non-rewarding path set). T π s,s ,nr = {τ | τ ∈ T π s,s , {r t } t< (τ ) = 0}. In words, T π s,s ,nr is a set of all non-rewarding paths from s to s rolled out by policy π (i.e., τ ∈ T π s,s ) without any associated reward except the last step (i.e., {r t } t<|τ | = 0). Now we are ready to define a notion of length with respect to a policy and shortest path policy: Definition 3 (π-distance from s to s ). D π nr (s, s ) = log γ E τ ∼π: τ ∈T π s,s ,nr γ (τ ) Definition 4 (Shortest path distance from s to s ). D nr (s, s ) = min π D π nr (s, s ). We define π-distance to be the log-mean-exponential of the length (τ ) of non-rewarding paths τ ∈ T π s,s ,nr . When there exists no admissible path from s to s under policy π, the path length is defined to be ∞: D π nr (s, s ) = ∞ if T π s,s ,nr = ∅. We note that when both MDP and policy are deterministic, D π (s, s ) recovers the natural definition of path length, D π nr (s, s ) = (τ ). We call a policy a shortest-path policy from s to s if it roll outs a path with the smallest π-distance: Definition 5 (Shortest path policy from s to s ). π ∈ Π SP s→s = {π ∈ Π | D π nr (s, s ) = D nr (s, s )}. Finally, we will define the shortest-path (SP) constraint. Let S IR = {s | R(s) > 0 or ρ(s) > 0} be the union of all initial and rewarding states, and Φ π = {(s, s ) | s, s ∈ S IR , ρ(s) > 0, T π s,s ,nr = ∅} be the subset of S IR such that agent may roll out. Then, the SP constraint is applied to the nonrewarding sub-paths between states in Φ π : T π Φ,nr = (s,s )∈Φ π T π s,s ,nr . We note that these definitions are used in the proofs (Appendix G). Now, we define the shortest-path constraint as follows: Definition 6 (Shortest-path constraint). A policy π satisfies the shortest-path (SP) constraint if π ∈ Π SP , where Π SP = {π | For all s, s ∈ T π Φ,nr , it holds π ∈ Π SP s→s }. Intuitively, the SP constraint forces a policy to transition between initial and rewarding states via shortest paths. The SP constraint would be particularly effective in sparse-reward settings, where the distance between rewarding states is large. Given these definitions, we can show that an optimal policy indeed satisfies the SP constraint in a general MDP setting. In other words, the shortest path constraint should not change optimality: Theorem 1. For any MDP, an optimal policy π * satisfies the shortest-path constraint: π * ∈ Π SP . Proof. See Appendix G for the proof. 3.2 RELAXATION: k-SHORTEST-PATH CONSTRAINT Implementing the shortest-path constraint is, however, intractable since it requires a distance predictor D nr (s, s ). Note that the distance predictor addresses the optimization problem, which might be as difficult as solving the given task. To circumvent this challenge, we consider its more tractable version, namely a k-shortest path constraint, which reduces the shortest-path problem D nr (s, s ) to a binary decision problem -is the state s reachable from s within k steps? -also known as k-reachability (Savinov et al., 2018b) . The k-shortest path constraint is defined as follows: Definition 7 (k-shortest-path constraint). A policy π satisfies the k-shortest-path constraint if π ∈ Π SP k , where Π SP k = {π | For all s, s ∈ T π Φ,nr , D π nr (s, s ) ≤ k, it holds π ∈ Π SP s→s }. Note that the SP constraint (Definition 6) is relaxed by adding a condition D π nr (s, s ) ≤ k. In other words, the k-SP constraint is imposed only for s, s -path whose length is not greater than k. From Eq. (2), we can prove an important property and then Theorem 3 (optimality): Lemma 2. For an MDP M, Π SP m ⊂ Π SP k if k < m. Proof. It is true since {(s, s ) | D π nr (s, s ) ≤ k} ⊂ {(s, s ) | D π nr (s, s ) ≤ m} for k < m. Theorem 3. For an MDP M and any k ∈ R, an optimal policy π * is a k-shortest-path policy. Proof. Theorem 1 tells π * ∈ Π SP . Eq. (2) tells Π SP = Π SP ∞ and Lemma 2 tells Π SP ∞ ⊂ Π SP k . Collectively, we have π * ∈ Π SP = Π SP ∞ ⊂ Π SP k . In conclusion, Theorem 3 states that the k-SP constraint does not change the optimality of policy, and Lemma 2 states a larger k results in a larger reduction in policy search space. Thus, it motivates us to apply the k-SP constraint in policy search to more efficiently find an optimal policy. For the numerical experiment on measuring the reduction in the policy roll-outs space, please refer to Section 6.4.

4. SPRL: SHORTEST-PATH REINFORCEMENT LEARNING

k-Shortest-Path Cost. The objective of RL with the k-SP constraint Π SP k can be written as: π * = arg max π E π [R(τ )] , s.t. π ∈ Π SP k , where Π SP k = {π | ∀(s, s ∈ T π Φ,nr ), D π nr (s, s ) ≤ k, it holds π ∈ Π SP s→s } (Definition 7 ). We want to formulate the constraint π ∈ Π SP k in the form of constrained MDP (Section 2), i.e., as C(π) ≤ α. We begin by re-writing the k-SP constraint into a cost-based form: Tessler et al. (2019) , we apply the constraint to the on-policy trajectory τ = (s 0 , s 1 , . . .) with discounting by replacing (s, s ) with (s t , s t+l ) where [t, t + l] represents each segment of τ with length l: Update policy π to maximize the objective as Eq. ( 10) (e.g., run PPO train steps). Update Rnet to minimize L Rnet as Eq. ( 14). Π SP k = {π | C SP k (π) = 0}, where C SP k (π) = (s,s ∈T π Φ,nr ):D π nr (s,s )≤k I [D nr (s, s ) < D π nr (s, s )] . (4) Note that I [D nr (s, s ) < D π nr (s, s )] = 0 ↔ D nr (s, s ) = D π nr (s, s ) since D nr (s, s ) ≤ D π nr (s, s ) from Definition 4. Similar to C SP k (π) E τ ∼π (t,l):t≥0,l≤k γ t • I [D nr (s t , s t+l ) < D π nr (s t , s t+l )] • I {r j } t+l-1 j=t = 0 = E τ ∼π (t,l):t≥0,l≤k γ t • I D nr (s t , s t+l ) < log γ E τ ∈T π s t ,s t+l ,nr γ |τ | • I {r j } t+l-1 j=t = 0 ≤ E τ ∼π (t,l):t≥0,l≤k γ t • I [D nr (s t , s t+l ) < k] • I {r j } t+l-1 j=t = 0 ĈSP k (π). Note that it is sufficient to consider only the cases l = k (because for l < k, given D nr (s t , s t+k ) < k, we have D(s t , s t+l ) ≤ l < k). Then, we simplify ĈSP k (π) as ĈSP k (π) = E τ ∼π t γ t I [D nr (s t , s t+k ) < k] • I {r j } t+k-1 j=t = 0 (6) = E τ ∼π t γ t I[t ≥ k] • I [D nr (s t-k , s t ) < k] • I {r j } t-1 j=t-k = 0 . Finally, the per-time step cost c t is given as: c t = I[t ≥ k] • I [D nr (s t-k , s t ) < k] • I {r j } t-1 j=t-k = 0 , where ĈSP k (π) = E τ ∼π [ t γ t c t ]. Note that ĈSP k (π) is an upper bound of C SP k (π) , which will be minimized by the bound to make as little violation of the shortest-path constraint as possible. Intuitively speaking, c t penalizes the agent from taking a non-k-shortest path at each step, so minimizing such penalties will make the policy satisfy the k-shortest-path constraint. In Eq. ( 8), c t depends on the previous k steps; hence, the resulting CMDP becomes a (k + 1)-th order MDP. In practice, however, we empirically found that feeding only the current time-step observation to the policy performs better than stacking the previous k-steps of observations (See Appendix A.3 for details). Thus, we did not stack the observation in all the experiments. We use the Lagrange multiplier method to convert the objective (3) into an equivalent unconstrained problem as follows: min λ>0 max θ L(λ, θ) = min λ>0 max θ E τ ∼π θ t γ t (r t -λc t ) , where L is the Lagrangian, θ is the parameter of policy π, and λ > 0 is the Lagrangian multiplier. Since Theorem 3 shows that the shortest-path constraint preserves the optimality, we are free to set any λ > 0. Thus, we simply consider λ as a tunable positive hyperparameter, and simplify the min-max problem (9) to an RL objective with costs c t being added: max θ E τ ∼π θ t γ t (r t -λc t ) . ( ) Practical implementation of the cost function. We implement the binary distance discriminator I(D nr (s t-k , s t ) < k) in Eq. ( 8) using k-reachability network (Savinov et al., 2018b) . The kreachability network Rnet k (s, s ) is trained to output 1 if the state s is reachable from the state s with less than or equal to k consecutive actions, and 0 otherwise. Formally, we take the functional form: Rnet k (s, s ) I (D nr (s, s ) < k + 1) . We then estimate the cost term c t using (k -1)-reachability network as follows: c t = I [D nr (s t-k , s t ) < k] • I {r l } t-1 l=t-k = 0 • I [t ≥ k] (11) = Rnet k-1 (s t-k , s t ) • I {r l } t-1 l=t-k = 0 • I [t ≥ k] . Intuitively speaking, if the agent takes a k-shortest path, then the distance between s t-k and s t is k, hence c t = 0. If it is not a k-shortest path, c t > 0 since the distance between s t-k and s t will be less than k. In practice, due to the error in the reachability network, we add a small tolerance ∆t ∈ N to ignore outliers. It leads to an empirical version of the cost as follows: c t Rnet k-1 (s t-k-∆t , s t ) • I {r l } t-1 l=t-k-∆t = 0 • I(t ≥ k + ∆t). In our experiment, we found that a small tolerance ∆t k/5 works well in general. Similar to Savinov et al. (2018b) , we used the following contrastive loss for training the reachability network: L Rnet = -log (Rnet k-1 (s anc , s + )) -log (1 -Rnet k-1 (s anc , s -)) , where s anc , s + , s -are the anchor, positive, and negative samples, respectively (See Appendix D.3 for the detail of training).

5. RELATED WORK

Connection between shortest-path problem and planning. Many early works (Bellman, 1958; Ford Jr, 1956; Bertsekas & Tsitsiklis, 1991; 1995) have discussed (stochastic) shortest path problems in the context of MDP. They viewed the shortest-path problem as planning problem and proposed a dynamic programming-based algorithm similar to the value iteration (Sutton & Barto, 2018) to solve it. Our main idea is inspired by (but not based on) this viewpoint. Specifically, our method does not directly solve the shortest path problem via planning; hence, our method does not require model. Our method only exploits the optimality guarantee of the shortest-path under the π-distance to prune out sub-optimal policies (i.e., non-shortest paths). Distance metric in goal-conditioned RL. In goal-conditioned RL, there has been a recent surge of interest on learning a distance metric in state (or goal) space for to construct a high-level MDP graph and perform planning to find a shortest-path to the goal state. Huang et al. (2019) ; Laskin et al. (2020) used the universal value function (UVF) (Schaul et al., 2015) with a constant step penalty as a distance function. Zhang et al. (2018) ; Laskin et al. (2020) used the success rate of transition between nodes as distance and searched for the longest path to find the plan with highest success rate. SPTM (Savinov et al., 2018a ) defined a binary distance based on reachability network (RNet) to connect near by nodes in the graph. However, the proposed distance metrics and methods can be used only for the goal-conditioned task and lacks the theoretical guarantee in general MDP, while our theory and framework are applicable to general MDP (see Section 3.1). Reachability network. The reachability network (RNet) was first proposed by Savinov et al. (2018b) as a way to measure the novelty of a state for exploration. Intuitively, if current state is not reachable from previous states in episodic memory, it is considered to be novel. SPTM (Savinov et al., 2018a) used RNet to predict the local connectivity (i.e., binary distance) between observations in memory for graph-based planning in navigation task. On the other hand, we use RNet for constraining the policy (i.e., removing the sub-optimal policies from policy space). Thus, in ours and in other two compared works, RNet is being employed for fundamentally different purposes. More related works. Please refer to Appendix H for further discussions about other related works. 6.1 SETTINGS Environments. We evaluate our SPRL on two challenging domains: MiniGrid (Chevalier-Boisvert et al., 2018) and DeepMind Lab (Beattie et al., 2016) . MiniGrid is a 2D grid world environment with challenging features such as pictorial observation, random initialization of the agent and the goal, complex state and action space where coordinates, directions, and other object statuses (e.g., key-door) are considered. We conducted experiments on four standard tasks: FourRooms-7×7, FourRooms-11×11, KeyDoors-7×7, and KeyDoors-11×11. DeepMind Lab is a 3D environment with first person view. Along with the nature of partially-observed MDP, at each episode, the agent's initial and the goal location are reset randomly with a change of texture, maze structure, and colors.

6. EXPERIMENTS

We conducted experiments on three standard tasks: GoalSmall, GoalLargefoot_2 , and ObjectMany. We refer the readers to Figure 2 for examples of observations. Baselines. We compared our methods with four baselines: PPO (Schulman et al., 2017) , episodic curiosity (ECO) (Savinov et al., 2018b) , intrinsic curiosity module (ICM) (Pathak et al., 2017) , and GT-Grid (Savinov et al., 2018b) . The PPO is used as a baseline RL algorithm for all other agents. The ECO agent is rewarded when it visits a state that is not reachable from the states in episodic memory within a certain number of actions; thus the novelty is only measured within an episode. Following Savinov et al. (2018b) , we trained RNet in an off-policy manner from the agent's experience and used it for our SPRL and ECO on both MiniGrid (Section 6.2) and DeepMind Lab (Section 6.3). For the accuracy of the learned RNet on each task, please refer to the Appendix B. The GT-Grid agent has access to the agent's (x, y) coordinates. It uniformly divides the world in 2D grid cells, and the agent is rewarded for visiting a novel grid cell. The ICM agent learns a forward and inverse dynamics model and uses the prediction error of the forward model to measure the novelty. We used the publicly available codebase (Savinov et al., 2018b) to obtain the baseline results. We used the same hyperparameter for all the tasks for a given domain -the details are described in the Appendix. We used the standard domain and tasks for reproducibility. Figure 3 shows the performance of compared methods on MiniGrid domain. SPRL consistently outperforms all baseline methods over all tasks. We observe that exploration-based methods (i.e., ECO, ICM, and GT-Grid) perform similarly to the PPO in the tasks with small state space (e.g., FourRooms-7×7 and KeyDoors-7×7). However, SPRL demonstrates a significant performance gain since it improves the exploitation by avoiding sub-optimality caused by taking a non-shortest-path. Figure 4 summarizes the performance of all the methods on DeepMind Lab tasks. Overall, our SPRL method achieves superior results compared to other methods. By the design of the task, the difficulty of exploration in each task increases in the order of GoalSmall, ObjectMany, and GoalLarge tasks, and we observe a coherent trend in the result. For harder exploration tasks, the exploration-based methods (GT-Grid, ICM and ECO) achieve a larger improvement over PPO: e.g., 20%, 50%, and 100% improvement in GoalSmall, ObjectMany, and GoalLarge, respectively. As shown in Lemma 2, our SPRL is expected to have larger improvement for larger trajectory space (or state and action space) and sparser reward settings. We can verify this from the result: SPRL has the largest improvement in GoalLarge task, where both the map layout is largest and the reward is most sparse. Interestingly, SPRL even outperforms GT-Grid which simulates the upperbound performance of novelty-seeking exploration method. This is possible since SPRL improves the exploration by suppressing unnecessary explorations, which is different from novelty-seeking methods, and also improves the exploitation by reducing the policy search space. Experimental setup. We implemented a simple tabular 7×7 Four-rooms domain where each state maps to a unique (x, y) location of the agent. The agent can take up, down, left, right primitive actions to move to the neighboring state, and the episode horizon is set to 14 steps. The goal of the agent is reaching to the goal state, which gives +1 reward and terminates the episode.

6.3. RESULTS ON DeepMind Lab

We computed the ground-truth distance between a pair of states to implement the k-shortest path constraint. We used the ground-truth distance function instead of the learned RNet to implement the exact SPRL agent. Results. Figure 5 summarizes the reduction in the trajectory space size. We searched over all possible trajectories of length 14 using breadth-first-search (BFS). Then we counted the number trajectories satisfying our k-SP constraint with varying parameters k and tolerance ∆t and divided by total number of trajectories (i.e., 4 14 = 268M). The result shows that our k-SP constraint drastically reduces the trajectory space even in a simple 2D grid domain; with very small k = 3 and no tolerance ∆t = 2, we get only 24/268M size of the original search space. As we increase k, we can see more reduction in the trajectory space, which is consistent with Lemma 2. Also, increasing the tolerance ∆t slightly hurts the performance, but still achieves a large reduction (See Appendix A for more analysis on effect of k and tolerance). We counted all the state transitions (s t → s t+1 ) of each agent's roll-out and averaged over 4 random seeds. Random relies on the random walk and cannot explore further than the initial few rooms. GT-UCB seeks for a novel states, and visits all the states uniformly. SPRL learns to take a longest possible shortest path, which results in a "straight" path across the rooms. Note that this only represents a partial behavior of SPRL, since our cost also considers the existence of non-zero reward (see Eq. ( 8)). Thus, in (d), we tested SPRL while providing only the existence of non-zero reward (but not the reward magnitude). SPRL learns to take a shortest path between rewarding and initial states that is consistent with the shortest-path definition in Definition 7.

7. CONCLUSION

We presented the k-shortest-path constraint, which can improve the sample-efficiency of any modelfree RL method by preventing the agent from taking a sub-optimal transition. We empirically showed that our SPRL outperforms vanilla RL and strong novelty-seeking exploration baselines on two challenging domains. We believe that our framework develops a unique direction for improving the sample efficiency in reinforcement learning; hence, combining our work with other techniques for better sample efficiency will be interesting future work that could benefit many practical tasks.

Appendix: Shortest-path reinforcement learning

A MORE ABLATION STUDY 

A.1 EFFECT OF k

As proven in Lemma 2 and shown in Section 6.4, the larger k, the k-shortest constraint promises a larger reduction in policy space, which results in a faster learning. However, with our practical implementation of SPRL with a learned (imperfect) reachability network, overly large k has a drawback. Intuitively speaking, it is harder for policy to satisfy the k-shortest constraint, and the supervision signal given our cost function becomes sparser (i.e., almost always penalized). Figure 7 and 8 shows the performance of SPRL on DeepMind Lab and MiniGrid domains with varying k. In both domains we can see that there exists a "sweet spot" that balances between the reduction in policy space and sparsity of the supervision (e.g., k = 10 for DeepMind Lab and k = 5 for MiniGrid). 

A.2 EFFECT OF TOLERANCE ∆t

Adding the tolerance ∆t to our k-SP constraint it "softer" by allowing ∆t-steps of redundancy in transition (See Eq. ( 13)). Intuitively, a small tolerance may improve the stability of RNet by incorporating a possible noise in RNet prediction, but too large tolerance will make it less effective on removing sub-optimality in transition. Figure 9 and 10 show the performance of SPRL on DeepMind Lab and MiniGrid domains with varying tolerance ∆t. Similar to k, we can see that there exists a "sweet spot" that balances between the reduction in policy space and stabilization of noisy RNet output (e.g., ∆t = 25) in MiniGrid. Note that the best tolerance values for DeepMind Lab and MiniGrid are vastly different. This is mainly because we used multiple tolerance sampling (See Appendix D) for DeepMind Lab but not for MiniGrid. Since the multiple tolerance sampling also improves the stability of RNet, larger tolerance has less benefit compared to its disadvantage. Figure 12 : Average episode reward of SPRL with varying observation stacking dimension of 0, 1, 5 as a function of environment steps for MiniGrid tasks. Other hyper-parameters are kept same as the best hyper-parameter. The best performance is obtained without stacking (i.e., #stack=0)

A.3 STACKING OBSERVATION

The CMDP with k-SP constraint becomes the (k + 1)-th order MDP as shown in Eq. ( 8). Thus, in theory, the policy should take currrent state s t augmented by stacking the k previous states as input: [s t-k , s t-k+1 . . . , s t ], where [•] is a stacks the pixel observation along the channel (i.e., color) dimension. However, stacking the observation may not lead to the best empirical results in practice. Figure 11 and 12 show the performance SPRL on DeepMind Lab and MiniGrid domains with varying stacking dimension. For stack=m, we stacked the observation from t -m to t: [s t-m , s t-m+1 , . . . , s t ]. We experimented up to m = k: up to m = 10 for DeepMind Lab and m = 5 for MiniGrid. The result shows that stacking the observation does not necessarily improves the performance for MDP order greater than 1, which is often observed when the function approximation is used (e.g., Savinov et al. (2018b) ). Thus, we did not augment the observation in all the experiments. review as a conference paper at ICLR 2021

B ANALYSIS ON THE REACHABILITY NETWORK (RNET) B.1 ACCURACY OF THE REACHABILITY NETWORK

We measured the accuracy of reachability network on DeepMind Lab and MiniGrid in Figure 14 and Figure 13 . The accuracy was measured on the validation set; we sampled 15,000 positive and negative samples respectively from the replay buffer of size 60,000. Specifically, for an anchor s t , we sample the positive sample s t from t ∈ [t + 1, t + k + ∆t], and the negative sample s t from t ∈ [t + k + ∆t + ∆ -, t + k + ∆t + 2∆ -]. The RNet reaches the accuracy higher than 80% in only 0.4M steps in both MiniGrid and DeepMind Lab. We note that this is quite high considering the unavoidable noise in the negative samples; i.e., since the negative samples are sampled based on the temporal distance, not based on the actual reachability, they have non-zero probability of being reachable, in which case they are in fact the positive samples.

B.2 ABLATION STUDY: COMPARISON BETWEEN THE LEARNED RNET AND THE GT-RNET

In this section, we study the effect of RNet's accuracy on the SPRL's performance. To this end, we implement and compare the ground-truth reachability network by computing the ground-truth distance between a pair of states in MiniGrid. Ground-truth reachability network was implemented by computing the distance between the two state inputs, and comparing with k. For the state inputs s and s , we roll out all possible k-step trajectories starting from the state s using the ground-truth single-step forward model. If s is ever visited during the roll-out, the output of k-reachability network is 1 and otherwise, the output is 0. Result. We compared the performance of our SPRL with the learned RNet and the ground-truth RNet (GT-SPRL) in Figure 15 with the best hyperparameters. Overall, the performance of SPRL and GT-SPRL are similar. This is partly because the learned RNet achieves a quite high accuracy in early stage of learning (see Figure 13 ). Interestingly, we can observe that our SPRL with learned RNet performs better than SPRL with GT-RNet on FourRooms-7×7 and KeyDoors-7×7. This is possible since a small noise in RNet output can have a similar effect to the increased tolerance ∆t on RNet, which makes the resulting cost more dense, which may be helpful depending on the tasks and hyperparameters.

B.3 ABLATION STUDY: CURRICULUM LEARNING OF RNET

We applied the linear scheduling of k to see if curriculum learning can improve the RNet training, and eventually improve our SPRL performance. For example, in case of FourRooms-7×7, we used k = 1 until 0.1M steps, and k = 2 until 0.2M steps, and so on, and finally k = 5 from 0.4M steps to 0.5M steps. Figure 16 and Figure 17 compares the performance of SPRL on four MiniGrid tasks and three DeepMind Lab tasks with and without curriculum learning applied respectively. The result shows that, however, the curriculum learning is not helpful for RNet training. We conjecture that changing k during training RNet makes the learning unstable since changing k completely flips the learning target for some inputs; e.g, the two states that are 2 steps away are unreachable for k = 1 but reachable for k = 2. In Figure 18 , we compared the RNet accuracy when RNet was trained with and without curriculum learning. We can observe that curriculum learning achieves similar or lower RNet accuracy with higher standard error, which indicates that the curriculum learning makes RNet training unstable. C EXPERIMENT DETAILS OF MiniGrid DOMAIN C.1 ENVIRONMENT MiniGrid is a 2D grid-world environment with diverse predefined tasks (Chevalier-Boisvert et al., 2018) . It has several challenging features such as pictorial observation, random initialization of the agent and the goal, complex action space and transition dynamics involving agent's orientation of movement and changing object status via interaction (e.g., key-door). State Space. An observation s t is represented as H × W × C tensor, where H and W are the height and width of map respectively, and C is features of the objects in the grid. The (h, w)-th element of observation tensor is (type, color, status) of the object and for the coordinate of agent, the (h, w)-th element is (type, 0, direction). The map size (i.e., H × W ) varies depending on the task; e.g., for FourRooms-7×7 task, the map size is 7 × 7. Action Space and transition dynamics The episode terminates in 100 steps, and the episode may terminate earlier if the agent reaches the goal before 100 steps. The action space consists of seven discrete actions with the following transitions. • Turn-Counter-Clockwise: change the direction counter-clockwise by 90 degree. • Turn-Clockwise: change the direction clockwise by 90 degree. • Move-Forward: move toward direction by 1 step unless blocked by other objects. • Pick-up-key: pickup the key if the key is in front of the agent. • Drop-the-key: drop the key in front of the agent. • Open/Close-doors: open/close the door if the door is in front of the agent. • Optional-action: not used Reward function. The reward is given only if the agent reaches the goal location, and the reward magnitude is 1 -0.9(length of episode/maximum step for episode). Thus, the agent can maximize the reward by reaching to the goal location in shortest time.

C.2 TASKS

In FourRooms-7×7 and FourRooms-11×11, the map structure has four large rooms, and the agent needs to reach to the goal. In KeyDoors-7×7 and KeyDoors-11×11, the agent needs to pick up the key, go to the door, and open the door before reaching to the goal location.

C.3 ARCHITECTURE AND HYPER-PARAMETERS

We used a simple CNN architecture similar to (Mnih et al., 2015) for policy network. The network consists of Conv1(16x2x2-1/SAME)-CReLU-Conv2(8x2x2-1/SAME)-CReLU- Conv3(8x2x2-1/SAME)-CReLU-FC(512)-FC(action-dimension), where SAME padding ensures the input and output have the same size (i.e., width and height) and CReLU (Shang et al., 2016 ) is a non-linear activation function applied after each layer. We used Adam (Kingma & Ba, 2014) optimizer to optimize the policy network. For hyper-parameter search, we swept over a set of hyper-parameters specified in Table 1 , and chose the best one in terms of the mean AUC over all the tasks, which is also summarized in Table 1 . D EXPERIMENT DETAILS OF DeepMind Lab DOMAIN D.1 ENVIRONMENT DeepMind Lab is a 3D-game environment with first-person view. Along with random initialization of the agent and the goal, complex action space including directional change, random change of texture, color and maze structure are features that make tasks in DeepMind Lab hard to be learned.

State

Space. An observation s t has the dimension of 120 × 160 × 3 tensor. Observation is given as a first-person view of the map structure. Action Space and transition dynamics The episode terminates after the fixed number of steps regardless of goal being achieved. The original action space consists of seven discrete actions: Move-Forward, Move-Backward, Strafe Left, Strafe Right, Look Left, Look Right, Look Left and Move-Forward, Look Right and Move-Forward. In our experiment, we used eight discrete actions with the additional action Fire as in (Higgins et al., 2017; Vezhnevets et al., 2017; Savinov et al., 2018b; Espeholt et al., 2018; Khetarpal Precup, 2018) . S anc = S anc ∪ {s tanc }.

5:

t + = Uniform(t anc + 1, t anc + k).

6:

t -= Uniform(t anc + k + ∆ -, T ).

7:

S + = S + ∪ {s t+ }.

8:

S -= S -∪ {s t-}.

9:

t anc = Uniform(t + + 1, t + + ∆ + ). return S anc , S + , S -

D.3 REACHABILITY NETWORK TRAINING

Similar to Savinov et al. (2018b) , we used the following contrastive loss for training the reachability network: L Rnet = -log (Rnet k-1 (s anc , s + )) -log (1 -Rnet k-1 (s anc , s -)) , where s anc , s + , s -are the anchor, positive, and negative samples, respectively. The anchor, positive and negative samples are sampled from the same episode, and their time steps are sampled according to Algorithm 2. The RNet is trained in an off-policy manner from the replay buffer with the size of 60K environment steps collecting agent's online experience. We found that adaptive scheduling of RNet is helpful for faster convergence of RNet. Out of 20M total environment steps, for the first 1M, 1M, and 18M environment steps, we updated RNet every 6K, 12K, and 36K environment steps, respectively. For all three environments of DeepMind Lab, RNet accuracy was ∼ 0.9 after 1M steps. Multiple tolerance. In order to improve the stability of Reachability prediction, we used the statistics over multiple samples rather than using a single-sample estimate as suggested in Eq. ( 13). As a choice of sampling method, we simply used multiples of tolerance. In other words, given s t-(k+∆t) and s t as inputs for reachability network, we instead used s t-(k+n∆t) and s t where 1 ≤ n ≤ N ∆t , n ∈ N and N ∆t is the number of tolerance samples. We used 90-percentile of N ∆t outputs of reachability network, Rnet k-1 (s t-(k+n∆t) , s t ), as in (Savinov et al., 2018b) to get the representative of the samples.

D.4 ARCHITECTURE AND HYPER-PARAMETERS

Following (Savinov et al., 2018b) , we used the same CNN architecture used in (Mnih et al., 2015) . To fit the architecture, we resized the input observation image into 84 × 84 × 3 image, and normalized by dividing the pixel value by 255. For SPRL, we used a smaller reachability network (RNet) architecture compared to ECO to reduce the training time. The RNet is based on siamese architecture with two branches. Following (Savinov et al., 2018b) , ECO used Resnet-18 (He et al., 2016) architecture with 2-2-2-2 residual blocks and 512-dimensional output fully-connected layer to implement each branch. For SPRL, we used Resnet-12 with 2-2-1 residual blocks and 512-dimensional output fully-connected layer to implement each branch. The RNet takes two states as inputs, and each state is fed into each branch. The outputs of the two branches are concatenated and forwarded to threefoot_4 512-dimensional fully-connected layers to produce one-dimensional sigmoid output, which predicts the reachability between two state inputs. We also resized the observation to the same dimension as policy (i.e., 84 × 84 × 3, which is smaller than the original 120 × 160 × 3 used in (Savinov et al., 2018b) ). For all the baselines (i.e., PPO, ECO, ICM, and GT-Grid), we used the best hyperparameter used in (Savinov et al., 2018b) . For SPRL, we searched over a set of hyperparameters specified in Table 2 , and chose the best one in terms of the mean AUC over all the tasks, which is also summarized in Table 2 . Options framework (Sutton, 1998) defines options as a generalization of actions to include temporally extended series of action. Formally, options consist of three components: a policy π : S × A → [0, 1], a termination condition β : S + → [0, 1], and an initiation set I ⊆ S. An option I, π, β is available in state s if and only if s ∈ I. If the option is taken, then actions are selected according to π until the option terminates stochastically according to β. Then, the option-reward and option-transition models are defined as r o s = E r t+1 + γr t+2 + • • • + γ k-1 r t+k | E(o, s, t) P o ss = ∞ k=1 p (s , k) γ k ( ) where t + k is the random time at which option o terminates, E(o, s, t) is the event that option o is initiated in state s at time t, and p(s , k) is the probability that the option terminates in s after k steps. Using the option models, we can re-write Bellman equation as follows: V π (s) = E r t+1 + • • • + γ k-1 r t+k + γ k V π (s t+k ) , = o∈O P r[E(o, s)] r o s + s P o ss V π (s ) . ( ) where t + k is the random time at which option o terminates and E(o, s) is the event that option o is initiated in state s.

E.2 OPTION-BASED VIEW-POINT OF SHORTEST-PATH CONSTRAINT

In this section, we present an option framework-based viewpoint of our shortest-path (SP) constraint. We will first show that a (sparse-reward) MDP can be represented as a weighted directed graph where nodes are rewarding states, and edges are options. Then, we show that a policy satisfying SP constraint also maximizes the option-transition probability P o ss . For a given MDP M = (S, A, R, P, ρ, S), let S R = {s|R(s) = 0} ⊂ S be the set of all rewarding states, where R(s) is the reward function upon arrival to state s. In sparse-reward tasks, it is assumed that |S R | << |S|. Then, we can form a weighted directed graph G π = (V, E) of policy π and given MDP. The vertex set is defined as V = S R ∪ ρ 0 ∪ S where S R is rewarding states, ρ 0 is the initial states, and S is the terminal states. Similar to the path set in Definition 2, let T s→s denotes a set of paths transitioning from one vertex s ∈ V to another vertex s ∈ V: T s→s = {τ |s 0 = s, s (τ ) = s , {s t } 0<t< (τ ) ∩ V = ∅}. Then, the edge from a vertex s ∈ V to another vertex s ∈ V is defined by an (implicit) option tuple: o(s, s ) = (I, π, β) (s,s ) , where I = {s}, β(s) = I(s = s ), and π (s,s ) (τ ) = 1 Z π(τ ) for τ ∈ T s→s 0 otherwise , ( ) where Z is the partition function to ensure π (s,s ) (τ )dτ = 1. Following Eq. ( 16), the option-reward is given as r π s,s = E π (s,s ) r t+1 + γr t+2 + • • • + γ k-1 r t+k | E(o (s,s ) , s, t) , = E π (s,s ) γ k-1 r t+k | E(o (s,s ) , s, t) , where t + k is the random time at which option o(s, s ) terminates, and E(o, s, t) is the event that option o(s, s ) is initiated in state s at time t. Note that in the last equality, r t+1 = • • • = r t+k-1 = 0 holds since {s t+1 , . . . , s t+k-1 } ∩ V = ∅ from the definition of option policy π (s,s ) . Following Eq. ( 17), the option transition is given as P π s,s = ∞ k=1 p(s , k)γ k (24) = E π γ k |s 0 = s, s k = s , {r t } t<k = 0 (25) = γ D π nr (s,s ) . where p(s , k) is the probability that the option terminates in s after k steps, and D π nr (s, s ) is the π-distance in Definition 3. Then, we can re-write the shortest-path constraint in terms of P π s,s as follows: Π SP = {π|∀(s, s ∈ T π ŝ,ŝ ,nr s.t. (ŝ, ŝ ) ∈ Φ π ), D π nr (s, s ) = min π D π nr (s, s )} (27) = {π|∀(s, s ∈ T π ŝ,ŝ ,nr s.t. (ŝ, ŝ ) ∈ Φ π ), P π s,s = max π P π s,s } Thus, we can see that the policy satisfying SP constraint also maximizes the option-transition probability. We will use this result in Appendix F.

F SHORTEST-PATH CONSTRAINT: A SINGLE-GOAL CASE

In this section, we provide more discussion on a special case of the shortest-path constraint (Section 3.1), when the (stochastic) MDP defines a single-goal task: i.e., there exists a unique initial state s init ∈ S and a unique goal state s g ∈ S such that s g is a terminal state, and R(s) > 0 if and only if s = s g . We first note that the non-rewarding path set is identical to the path set in such a setting, because the condition r t = 0(t < (τ )) from Definition 2 is always satisfied as R(s) > 0 ⇔ s = s g and s (τ ) = s g : T π s,s ,nr = T π s,s = {τ | s 0 = s, s (τ ) = s , p π (τ ) > 0, {s t } t< (τ ) = s } Again, T π s,s is a set of all path starting from s (i.e., and ending at s (i.e., s (τ ) = s ) where the agent visits s only at the end (i.e., {s t } t< (τ ) = s ), that can be rolled out by policy with a non-zero probability (i.e., p π (τ ) > 0). We now claim that an optimal policy satisfies the shortest-path constraint. The idea is that, since s g is the only rewarding and terminal state, maximizing R(τ ) = γ T R(s g ) where s T = s g corresponds to minimizing the number of time steps T to reach s g . In this setting, a shortest-path policy is indeed optimal. Lemma 4. For a single-goal MDP, any optimal policy satisfies the shortest-path constraint. Proof. Let s init be the initial state and s g be the goal state. We will prove that any optimal policy is a shortest-path policy from the initial state to the goal state. We use the fact that s g is the only rewarding state, i.e., R(s) > 0 entails s = s g . π * = arg max π E τ ∼π s∼ρ t γ t r t s 0 = s (30) = arg max π E τ ∼π t γ t r t s 0 = s init (31) = arg max π E τ ∼π γ T R(s g ) | s 0 = s init , s (τ ) = s g (32) = arg max π E τ ∼π γ T | s 0 = s init , s (τ ) = s g (33) = arg min π log γ E τ ∼π γ T | s 0 = s init , s (τ ) = s g (34) = arg min π D π nr (s init , s g ), where Eq. ( 33) holds since R(s g ) > 0 from our assumption that R(s) + V * (s) > 0.

G PROOF OF THEOREM 1

We make the following assumptions on the Markov Decision Process (MDP) M: namely mild stochasticity (Definitions 8 and 9). Definition 8 (Mild stochasticity (1)). In MDP M, there exists an optimal policy π * and the corresponding shortest-path policy π sp ∈ Π SP such that for all s, s ∈ Φ π , it holds p π * (s = s |s 0 = s) = p π sp (s = s |s 0 = s). Definition 9 (Mild stochasticity (2)). In MDP M, the optimal policy π * does not visit the same state more than once: For all s ∈ S such that ρ π * (s) > 0, it holds ρ π * (s) = 1, where ρ π (s) E s0∼ρ0(S),a∼π(A|s),s ∼(S|s,a) T t=1 I (s t = s) is the state-visitation count. In other words, we assume that the optimal policy does not have a cycle. One common property of MDP that meets this condition is that the reward disappearing after being acquired by the agent. We note that this assumption holds for many practical environments. In fact, in many cases as well as Atari, DeepMind Lab, etc. Theorem 1. For any MDP with the mild stochasticity condition, an optimal policy π * satisfies the shortest-path constraint: π * ∈ Π SP . Proof. For simplicity, we prove this based on the option-based view point (see Appendix E). By plugging Eq. ( 23) and Eq. ( 25) into Eq. ( 19), we can re-write the Bellman equation of the value function V π (s) as follows: V π (s) =  where s is the first rewarding state that agent encounters. Intuitively, p π (s = s |s 0 = s) means the probability that the s is the first rewarding state that policy π encounters when it starts from s. From Eq. ( 28), our goal is to show: π * ∈ Π SP = {π | ∀(s, s ) ∈ T π Φ,nr , P π s,s = P * s,s }, where P * s,s = max π P π s,s . We will prove Eq. ( 40) by contradiction. Suppose π * is an optimal policy such that π * ∈ Π SP . Then,  ∃ Now consider a path τ ŝ→ŝ that agent visits ŝ at time t = i and transitions to ŝ at time t = j > i while not visiting any rewarding state from t = i to t = j with non-zero probability (i.e., p πsp (τ ) > 0). We can define a set of such paths as follows: Tŝ→ŝ = {τ | ∃(i < j), s i = ŝ, s j = ŝ , {s t } i<t<j ∩ S IR = ∅, p πsp (τ ) > 0}. (46) To reiterate the definitions from Definition 6: S IR = {s | R(s) > 0 or ρ(s) > 0} is the union of all initial and rewarding states, and Φ π = {(s, s ) | s, s ∈ S IR , ρ(s) > 0, T π s,s ,nr = ∅} is the subset of S IR such that agent may roll out. From Definition 9 and Eq. ( 45), the likelihood of a path τ under policy π is given as follows:  where p π (τ ) is the likelihood of trajectory τ under policy π, p π (τ ∈ Tŝ→ŝ ) = τ ∈ Tŝ→ŝ p π (τ )dτ ensures the likelihood π(τ ) to be a valid probability density function (i.e., p π (τ )dτ = 1). From the path τ ŝ→ŝ and i, j, we will choose two states s ir , s ir ∼ τ ŝ→ŝ , where s ir = max t (s t |s t ∈ S IR , t ≤ i), s ir = min t (s t |s t ∈ S IR , j ≤ t). (48) Note that such s ir and s ir always exist in τ ŝ→ŝ since the initial state and the terminal state satisfy the condition to be s ir and s ir . Then, we can show that the path between s ir and s ir is not a shortest-path. Recall the definition of D π nr (s, s ) (Definition 3): where Ineq. ( 56) is given by the fact that γ < 1. Then, P π * sir,s ir < P π sir,s ir .

H EXTENDED RELATED WORKS

Approximate state abstraction. The approximate state abstraction approaches investigate partitioning an MDP's state space into clusters of similar states while preserving the optimal solution. Researchers have proposed several state similarity metrics for MDPs. Dean et al. (2013) proposed to use the bisimulation metrics (Givan et al., 2003; Ferns et al., 2004) , which measures the difference in transition and reward function. Bertsekas et al. (1988) used the magnitude of Bellman residual as a metric. Abel et al. (2016; 2018) ; Li et al. (2006) used the different types of distance in optimal Q-value to measure the similarity between states to bound the sub-optimality in optimal value after the abstraction. Recently, Castro (2019) extended the bisimulation metrics to the approximate version for deep-RL setting where tabular representation of state is not available. Our shortest-path constraint can be seen as a form of state abstraction, in that ours also aim to reduce the size of MDP (i.e., state and action space) while preserving the "solution quality". However, our method does so by removing sub-optimal policies, not by aggregating similar states (or policies). Connection to Option framework Our shortest-path constraint constrains the policy space to a set of shortest-path policies (See Definition 5 for definition) between initial and rewarding states. It can be seen as a set of options (Sutton, 1998) transitioning between initial and rewarding states. We refer the readers to Appendix E for the detailed description about option framework-based formulation of our framework.



It is an instance of potential-based reward shaping which has optimality guarantee(Ng et al., 1999). We refer the readers to Appendix F for more detailed discussion and proofs for single-goal MDPs. GoalLarge task corresponds to the Sparse task inSavinov et al. (2018b), and our Figure reproduces the result reported inSavinov et al. (2018b). Savinov et al. (2018b) used four 512-dimensional fully-connected layers.



Figure1: The k-SP constraint improves sample efficiency of RL methods in sparse-reward tasks by pruning out suboptimal trajectories from the trajectory space. Intuitively, the k-SP constraint means that when a policy rolls out into trajectories, all of sub-paths of length k is a shortest path (under a distance metric defined in terms of policy, discount factor and transition probability; see Section 3.2 for the formal definition). (Left) MDP and a rollout tree are given. (Middle) The paths that satisfy the k-SP constraint. The number of admissible trajectories is drastically reduced. (Right) A path rolled out by a policy satisfies the k-SP constraint if all sub-paths of length k are shortest paths and have not received non-zero reward. We use a reachability network to test if a given (sub-)path is a shortest path (See Section 4 for details).

triplet (s anc , s + , s -) ∼ B. See Appendix D.3 for detail 9:

Figure 2: An example observation of (a) FourRooms-11×11, (b) KeyDoors-11×11 in MiniGrid, (c) GoalLarge in DeepMind Lab, and (d) the maze layout (not available to the agent) of GoalLarge.

Figure3: Progress of average episode reward on MiniGrid tasks. We report the mean (solid curve) and standard error (shadowed area) of the performance over six random seeds.

Figure4: Progress of average episode reward on DeepMind Lab tasks. We report the mean (solid curve) and standard error (shadowed area) of the performance over four random seeds.

Figure 5: (Left) 7×7 Tabular four-rooms domain with initial agent location (red) and the goal location (green). (Right) The trajectory space reduction ratio (%) before and after constraining the trajectory space for various k and ∆t with k-SP constraint. Even a small k can greatly reduce the trajectory space with a reasonable tolerance ∆t.

Figure 6: Transition count maps for baselines and SPRL: (a), (b), and (c) are in reward-free (light green) while (d) is in reward-aware (dark green) setting. In reward-free settings (a-c), we show rewarding states in light green only for demonstration purpose. The location of agent's initial state (orange) and rewarding states (dark green) are fixed. The episode length is limited to 500 steps.We qualitatively study what type of policy is learned with the k-SP constraint with the ground-truth RNet in NineRooms domain of MiniGrid. Figure6 (a-c) shows the converged behavior of our SPRL (k = 15), the ground-truth count-based exploration(Lai & Robbins, 1985) agent (GT-UCB) and uniformly random policy (Random) in a reward-free setting. We counted all the state transitions (s t → s t+1 ) of each agent's roll-out and averaged over 4 random seeds. Random relies on the random walk and cannot explore further than the initial few rooms. GT-UCB seeks for a novel states, and visits all the states uniformly. SPRL learns to take a longest possible shortest path, which results in a "straight" path across the rooms. Note that this only represents a partial behavior of SPRL, since our cost also considers the existence of non-zero reward (see Eq. (8)). Thus, in (d), we tested SPRL while providing only the existence of non-zero reward (but not the reward magnitude). SPRL learns to take a shortest path between rewarding and initial states that is consistent with the shortest-path definition in Definition 7.

Figure 7: Average episode reward of SPRL with varying k =1, 3, 10, 30 as a function of environment steps for DeepMind Lab tasks. Other hyper-parameters are kept same as the best hyper-parameter. The best performance is obtained with k = 10.

Figure 9: Average episode reward of SPRL with varying ∆t =1, 2, 3, 5 as a function of environment steps for DeepMind Lab tasks. Other hyper-parameters are kept same as the best hyper-parameter. The best performance is obtained with ∆t = 1.

Figure 10: Average episode reward of SPRL with varying ∆t =10, 15, 25, 50 as a function of environment steps for MiniGrid tasks. Other hyper-parameters are kept same as the best hyperparameter. The best performance is obtained with ∆t = 25.

Figure11: Average episode reward of SPRL with varying observation stacking dimension of 0, 1, 5, 10 as a function of environment steps for DeepMind Lab tasks. Other hyper-parameters are kept same as the best hyper-parameter. The best performance is obtained without stacking (i.e., #stack=0).

Figure 13: The accuracy of the learned reachability network on (a) FourRooms-7×7 (b) FourRooms-11×11, (c) KeyDoors-7×7 and (d) KeyDoors-11×11 in MiniGrid in terms of environment steps.

Figure 14: The accuracy of the learned reachability network on (a) GoalSmall (b) ObjectMany, and (c) GoalLarge in DeepMind Lab in terms of environment steps.

Figure 15: The accuracy of the learned reachability network on (a) FourRooms-7×7 (b) FourRooms-11×11, (c) KeyDoors-7×7 and (d) KeyDoors-11×11 in MiniGrid in terms of environment steps.

Figure 16: The performance of SPRL with and without curriculum learning of RNet on (a) FourRooms-7×7 (b) FourRooms-11×11, (c) KeyDoors-7×7 and (d) KeyDoors-11×11 in MiniGrid.

Figure 17: The performance of SPRL with and without curriculum learning of RNet on (a) GoalSmall (b) ObjectMany, and (c) GoalLarge in DeepMind Lab in terms of environment steps.

Figure 18: The accuracy of the reachability network with and without curriculum learning of RNet on (a) GoalSmall (b) ObjectMany, and (c) GoalLarge in DeepMind Lab in terms of environment steps.

review as a conference paper at ICLR 2021 Algorithm 2 Sampling the triplet data from an episode for RNet training Require: Hyperparameters: k ∈ N, Positive bias ∆ + ∈ N, Negative bias ∆ -∈ N 1: Initialize t anc ← 0. 2: Initialize S anc = ∅, S + = ∅, S -= ∅. 3: while t anc < T do 4:

p π (s = s |s 0 = s) R(s )E τ ∼π (γ (τ ) |s 0 = s, s = s ) + γP π s,s V π (s ) (37) = s ∈S IR p π (s = s |s 0 = s) R(s )P π s,s + γP π s,s V π (s ) ,(38)= s ∈S IR p π (s = s |s 0 = s)P π s,s [R(s ) + γV π (s )] ,

p π (τ ) = p π * (τ ∈ Tŝ→ŝ )p πsp (τ |τ ∈ Tŝ→ŝ ) for τ ∈ Tŝ→ŝ p π * (τ ) otherwise ,

D π * nr (s ir , s ir ) := log γ E τ ∼π * : τ ∈T π * ∼π * γ (τ ) | τ ∈ T π * sir,s ir ,nr ♣ use ♣ := γ (τ )| τ ∈ T π * sir,s ir ,nr for a shorthand notation. Then, we haveγ D π * nr (sir,s ir ) := E τ ∼π * [♣] (51) = p π * (τ ∈ Tŝ→ŝ )E τ ∼π * ♣ | τ ∈ Tŝ→ŝ + p π * (τ / ∈ Tŝ→ŝ )E τ ∼π * ♣ | τ / ∈ Tŝ→ŝ . (52) (From Definition 5) < p π * (τ ∈ Tŝ→ŝ )E τ ∼πsp ♣ | τ ∈ Tŝ→ŝ + p π * (τ / ∈ Tŝ→ŝ )E τ ∼π * ♣ | τ / ∈ Tŝ→ŝ(53)(From Eq. (47)) = p π (τ ∈ Tŝ→ŝ )E τ ∼π ♣ | τ ∈ Tŝ→ŝ + p π (τ / ∈ Tŝ→ŝ )E τ ∼π ♣ | τ / ∈ Tŝ→ŝ (54) = E τ ∼π [♣] = γ D π nr (sir,s ir )(55)⇐⇒ D π * nr (s ir , s ir ) > D π nr (s ir , s ir ) (56)

)

The range of hyperparameters sweeped over and the final hyperparameters used for our SPRL method in DeepMind Lab domain.

(ŝ, ŝ ∈ T π * Φ,nr ) s.t. P π * ŝ,ŝ = P * ŝ,ŝ .(41)Recall the definition: P * s,s = max π P π s,s . Then, for any π, the following statement is true.

D.2 TASKS

We tested our agent and compared methods on three standard tasks in DeepMind Lab: GoalSmall, GoalLarge, and ObjectMany which correspond to explore goal locations small, explore goal locations large, and explore object rewards many, respectively.GoalSmall and GoalLarge has a single goal in the maze, but the size of the maze is larger in GoalLarge than GoalSmall. The agent and goal locations are randomly set in the beginning of the episode and the episode length is fixed to 1,350 steps for GoalSmall and 1,800 steps for GoalLarge.When the agent reaches the goal, it positively rewards the agent and the agent is re-spawned in a random location without terminating the episode, such that the agent can reach to the goal multiple times within a single episode. Thus, the agent's goal is to reach to the goal location as many times as possible within the episode length. ObjectMany has multiple objects in the maze, where reaching to the object positively rewards the agent and the object disappears. The episode length is fixed to 1,800 steps. The agent's goal is to gather as many object as possible within the episode length.From Eq. ( 39), we have where Eq. ( 59) holds from the mild-stochasticity (1) and mild-stochasticity ( 2) assumption, and Ineq. (60) holds because P π sir,s ir > P π * sir,s ir and R(s ) + γV π * (s ) > 0 from the non-negative optimal value assumption (See Section 2). Finally, this is a contradiction since the optimal value function V * (s) should be the maximum.

