HIT-MDP: LEARNING THE SMDP OPTION FRAME-WORK ON MDPS WITH HIDDEN TEMPORAL EMBED-DINGS

Abstract

The standard option framework is developed on the Semi-Markov Decision Process (SMDP) which is unstable to optimize and sample inefficient. To this end, we propose the Hidden Temporal MDP (HiT-MDP) and prove that the optioninduced HiT-MDP is homomorphic equivalent to the option-induced SMDP. A novel transformer-based framework is introduced to learn options' embedding vectors (rather than conventional option tuples) on HiT-MDPs. We then derive a stable and sample efficient option discovering method under the maximum-entropy policy gradient framework. Extensive experiments on challenging Mujoco environments demonstrate HiT-MDP's efficiency and effectiveness: under widely used configurations, HiT-MDP achieves competitive, if not better, performance compared to the state-of-the-art baselines on all finite horizon and transfer learning environments. Moreover, HiT-MDP significantly outperforms all baselines on infinite horizon environments while exhibiting smaller variance, faster convergence, and better interpretability. Our work potentially sheds light on the theoretical ground of extending the option framework into a large scale foundation model.

1. INTRODUCTION

The option framework (Sutton et al., 1999) is one of the most promising frameworks to enable RL methods to conduct lifelong learning (Mankowitz et al., 2016) and has proven benefits in speeding learning (Bacon, 2018) , improving exploration (Harb et al., 2018) , and facilitating transfer learning (Zhang & Whiteson, 2019) . Standard option framework is developed on Semi-Markov Decision Process, we refer to this as SMDP-Option. In SMDP-Option, an option is a temporally abstracted action whose execution cross a variable amount of time steps. A master policy is employed to compose these options and determine which option should be executed and stopped. The SMDP formulation has two deficiencies that severely impair options' applicability in a broader context (Jong et al., 2008) . The first deficiency is sample inefficiency. Since the execution of an option persists over multiple time steps, one update of the master policy consumes various steps of samples and thus is sample inefficient (Levy & Shimkin, 2011; Daniel et al., 2016; Bacon et al., 2017) . The second deficiency is unstable optimizing algorithms. SMDP-based optimization algorithms are notoriously sensitive to hyperparameters. Therefore, they often exhibit large variance (Wulfmeier et al., 2020) and encounter convergence issues (Klissarov et al., 2017) . Extensive research has tried to tackle these issues from aspects such as improving the policy iteration procedure (Sutton et al., 1999; Daniel et al., 2016; Bacon et al., 2017) and adding extra constraints on option discovering objectives (Khetarpal et al., 2020; Wulfmeier et al., 2020; Hyun et al., 2019) . However, rare works (Levy & Shimkin, 2011; Smith et al., 2018; Zhang & Whiteson, 2019) explore from the perspective of improving the underlying Decision Process. Our work is largely different from literatures above and more details are discussed in Section 6. In this work, we present a counterintuitive finding: the SMDP formulated option framework has an MDP equivalence which is still able to temporally extending the execution of abstracted actions. MDP-based options can address SMDP-based ones' deficiencies from two aspects: (1) sample efficient, i.e., MDPs policies can be optimized at every sampling step (Bacon, 2018) ; and (2) more stable to optimize, i.e., convergence of MDPs algorithms are well theoretically justified and have smaller variance (Schulman et al., 2015) . In this paper, we propose the Hidden Temporal MDP (HiT-MDP) and theoretically prove the equivalence to the SMDP-Option. We first formulate HiT-MDP as an HMM-like PGM and introduce temporally dependent latent variables into the MDP to preserve temporal abstractions. By exploiting conditional independencies in PGM, we prove that the HiT-MDP is homomorphic equivalent (Ravindran, 2003) to SMDP-Option. To the best of our knowledge, this is the first work proposing an MDP equivalence of the standard option framework. In order to solve optimal values of the HiT-MDPs, we devise a Markovian Option-Value Function V [s t , o t-1 ] and prove that it is an unbiased estimation of the standard value function V [s t ]. We further develop the Hidden Temporal Bellman Equation in order to derive the policy evaluation theorem for HiT-MDPs. We also show that the Markovian Option-Value Function has a variance reduction effect. As a result, HiT-MDP is a general-purpose MDP that can be updated at every sampling step, and thus naturally address the sample inefficiency issue. We solve the learning problem by deriving a stable on-policy policy gradient method under the maximum entropy reinforcement learning framework. One difficulty of learning standard option frameworks is that they do not have any constraint on qualities of options (Harb et al., 2018) . Standard option frameworks have tendencies to learn either degenerate options (Harb et al., 2018) (short execution time) that switching back-and-forth frequently, or dominant options (Zhang & Whiteson, 2019) (long execution time) that executing through the whole episode. We tackle this problem by proposing the Maximum entropy Options Policy Gradient (MOPG) algorithm. MOPG includes an information-theoretical intrinsic reward to encourage consecutive executions of options and entropy terms to encourage explorations of options. The whole algorithm can be solved in an end-to-end manner under the structional variational inference framework. We theoretically prove that optimizing through MOPG converges to the optimal trajectory. We conduct experiments on challenging Mujoco (Todorov et al., 2012; Brockman et al., 2016b; Tunyasuvunakool et al., 2020) environments. Thorough empirical results demonstrate that under widely used configurations, HiT-MDP achieves competitive, if not better, performance compared to the state-of-the-art baselines on all finite horizon and transfer learning environments. Moreover, HiT-MDP significantly outperforms all baselines on infinite horizon environments while exhibiting smaller variance, faster convergence, and interpretability.

2. BACKGROUND

Markov Decision Process: A Markov Decision Process (Puterman, 1994) M = {S, A, r, P, γ} consists of a state space S, an action space A, a state transition function P (s t+1 |s t , a t ) : S × A → S, a discount factor γ ∈ R, and a reward function r(s, a) = E[r|s, a] : S × A → R which is the expectation of the reward r t+1 ∈ R received from the environment after executing action a t at state s t . A policy π = P (a|s) : A × S → [0, 1] is a probability distribution defined over actions conditioning on states. A discounted return is defined as G t = N k γ k r t+k+1 , where γ ∈ (0, 1) is a discounting factor. The value function V [s t ] = E τ ∼π [G t |s t ] is the expected return starting at state s t and the trajectory τ = {s t , a t , r t+1 , s t+1 , . . . } follows policy π thereafter. The action-value function is defined as Q[s t , a t ] = E τ ∼π [G t |s t , a t ]. Homomorphic Equivalence: Givan et al. (2003) define the equivalence relation between MDPs as symmetric equivalence (bisimulation relation). Ravindran (2003) extends their work to homomorphic equivalence that allows defining symmetries between an MDP and SMDP. Given two processes, an MDP M = {S, A, R, P, γ} with the trajectory τ and an SMDP M = { S, A, R, P , γ} with the trajectory τ . Assume both M and M share the same action space A. An homomorphism B is a tuple of surjection partition functions, M and M is Homomorphic Equivalence if 1) for all stateaction pairs {s, a}, there exists a many-to-one correspondence equivalent state-action pair {s, ã} that {s, a}/ B = {s, ã}/ B, or denoted as B({s, ã}) = {s, a}, 2) and following conditions hold: 



/ B) ≡ P (τ / B), and B is a surjection, 2. r(τ / B) ≡ r(τ / B), The SMDP-based Option Framework: In SMDP-Option (Sutton et al., 1999; Bacon, 2018), an option is a triple (I o , π o , β o ) ∈ O, where O denotes the option set; the subscript o ∈ O = {1, 2, . . . , K} is a positive integer index which denotes the o-th triple where K is the number of options; I o is an initiation set indicating where the option can be initiated; π o = P o (a|s) : A × S → [0, 1] is the action policy of the oth option; β o = P o (b = 1|s) : S → [0, 1] where b ∈ {0, 1} is a termination function. For clarity, we use P o (b = 1|s) instead of β o which is widely used in previous

