HIT-MDP: LEARNING THE SMDP OPTION FRAME-WORK ON MDPS WITH HIDDEN TEMPORAL EMBED-DINGS

Abstract

The standard option framework is developed on the Semi-Markov Decision Process (SMDP) which is unstable to optimize and sample inefficient. To this end, we propose the Hidden Temporal MDP (HiT-MDP) and prove that the optioninduced HiT-MDP is homomorphic equivalent to the option-induced SMDP. A novel transformer-based framework is introduced to learn options' embedding vectors (rather than conventional option tuples) on HiT-MDPs. We then derive a stable and sample efficient option discovering method under the maximum-entropy policy gradient framework. Extensive experiments on challenging Mujoco environments demonstrate HiT-MDP's efficiency and effectiveness: under widely used configurations, HiT-MDP achieves competitive, if not better, performance compared to the state-of-the-art baselines on all finite horizon and transfer learning environments. Moreover, HiT-MDP significantly outperforms all baselines on infinite horizon environments while exhibiting smaller variance, faster convergence, and better interpretability. Our work potentially sheds light on the theoretical ground of extending the option framework into a large scale foundation model.

1. INTRODUCTION

The option framework (Sutton et al., 1999) is one of the most promising frameworks to enable RL methods to conduct lifelong learning (Mankowitz et al., 2016) and has proven benefits in speeding learning (Bacon, 2018) , improving exploration (Harb et al., 2018) , and facilitating transfer learning (Zhang & Whiteson, 2019) . Standard option framework is developed on Semi-Markov Decision Process, we refer to this as SMDP-Option. In SMDP-Option, an option is a temporally abstracted action whose execution cross a variable amount of time steps. A master policy is employed to compose these options and determine which option should be executed and stopped. The SMDP formulation has two deficiencies that severely impair options' applicability in a broader context (Jong et al., 2008) . The first deficiency is sample inefficiency. Since the execution of an option persists over multiple time steps, one update of the master policy consumes various steps of samples and thus is sample inefficient (Levy & Shimkin, 2011; Daniel et al., 2016; Bacon et al., 2017) . The second deficiency is unstable optimizing algorithms. SMDP-based optimization algorithms are notoriously sensitive to hyperparameters. Therefore, they often exhibit large variance (Wulfmeier et al., 2020) and encounter convergence issues (Klissarov et al., 2017) . Extensive research has tried to tackle these issues from aspects such as improving the policy iteration procedure (Sutton et al., 1999; Daniel et al., 2016; Bacon et al., 2017) and adding extra constraints on option discovering objectives (Khetarpal et al., 2020; Wulfmeier et al., 2020; Hyun et al., 2019) . However, rare works (Levy & Shimkin, 2011; Smith et al., 2018; Zhang & Whiteson, 2019) explore from the perspective of improving the underlying Decision Process. Our work is largely different from literatures above and more details are discussed in Section 6. In this work, we present a counterintuitive finding: the SMDP formulated option framework has an MDP equivalence which is still able to temporally extending the execution of abstracted actions. MDP-based options can address SMDP-based ones' deficiencies from two aspects: (1) sample efficient, i.e., MDPs policies can be optimized at every sampling step (Bacon, 2018); and (2) more stable to optimize, i.e., convergence of MDPs algorithms are well theoretically justified and have smaller variance (Schulman et al., 2015) . In this paper, we propose the Hidden Temporal MDP (HiT-MDP) and theoretically prove the equivalence to the SMDP-Option. We first formulate HiT-

