HIT-MDP: LEARNING THE SMDP OPTION FRAME-WORK ON MDPS WITH HIDDEN TEMPORAL EMBED-DINGS

Abstract

The standard option framework is developed on the Semi-Markov Decision Process (SMDP) which is unstable to optimize and sample inefficient. To this end, we propose the Hidden Temporal MDP (HiT-MDP) and prove that the optioninduced HiT-MDP is homomorphic equivalent to the option-induced SMDP. A novel transformer-based framework is introduced to learn options' embedding vectors (rather than conventional option tuples) on HiT-MDPs. We then derive a stable and sample efficient option discovering method under the maximum-entropy policy gradient framework. Extensive experiments on challenging Mujoco environments demonstrate HiT-MDP's efficiency and effectiveness: under widely used configurations, HiT-MDP achieves competitive, if not better, performance compared to the state-of-the-art baselines on all finite horizon and transfer learning environments. Moreover, HiT-MDP significantly outperforms all baselines on infinite horizon environments while exhibiting smaller variance, faster convergence, and better interpretability. Our work potentially sheds light on the theoretical ground of extending the option framework into a large scale foundation model.

1. INTRODUCTION

The option framework (Sutton et al., 1999) is one of the most promising frameworks to enable RL methods to conduct lifelong learning (Mankowitz et al., 2016) and has proven benefits in speeding learning (Bacon, 2018) , improving exploration (Harb et al., 2018) , and facilitating transfer learning (Zhang & Whiteson, 2019) . Standard option framework is developed on Semi-Markov Decision Process, we refer to this as SMDP-Option. In SMDP-Option, an option is a temporally abstracted action whose execution cross a variable amount of time steps. A master policy is employed to compose these options and determine which option should be executed and stopped. The SMDP formulation has two deficiencies that severely impair options' applicability in a broader context (Jong et al., 2008) . The first deficiency is sample inefficiency. Since the execution of an option persists over multiple time steps, one update of the master policy consumes various steps of samples and thus is sample inefficient (Levy & Shimkin, 2011; Daniel et al., 2016; Bacon et al., 2017) . The second deficiency is unstable optimizing algorithms. SMDP-based optimization algorithms are notoriously sensitive to hyperparameters. Therefore, they often exhibit large variance (Wulfmeier et al., 2020) and encounter convergence issues (Klissarov et al., 2017) . Extensive research has tried to tackle these issues from aspects such as improving the policy iteration procedure (Sutton et al., 1999; Daniel et al., 2016; Bacon et al., 2017) and adding extra constraints on option discovering objectives (Khetarpal et al., 2020; Wulfmeier et al., 2020; Hyun et al., 2019) . However, rare works (Levy & Shimkin, 2011; Smith et al., 2018; Zhang & Whiteson, 2019) explore from the perspective of improving the underlying Decision Process. Our work is largely different from literatures above and more details are discussed in Section 6. In this work, we present a counterintuitive finding: the SMDP formulated option framework has an MDP equivalence which is still able to temporally extending the execution of abstracted actions. MDP-based options can address SMDP-based ones' deficiencies from two aspects: (1) sample efficient, i.e., MDPs policies can be optimized at every sampling step (Bacon, 2018) ; and (2) more stable to optimize, i.e., convergence of MDPs algorithms are well theoretically justified and have smaller variance (Schulman et al., 2015) . In this paper, we propose the Hidden Temporal MDP (HiT-MDP) and theoretically prove the equivalence to the SMDP-Option. We first formulate HiT-MDP as an HMM-like PGM and introduce temporally dependent latent variables into the MDP to preserve temporal abstractions. By exploiting conditional independencies in PGM, we prove that the HiT-MDP is homomorphic equivalent (Ravindran, 2003) to SMDP-Option. To the best of our knowledge, this is the first work proposing an MDP equivalence of the standard option framework. In order to solve optimal values of the HiT-MDPs, we devise a Markovian Option-Value Function V [s t , o t-1 ] and prove that it is an unbiased estimation of the standard value function V [s t ]. We further develop the Hidden Temporal Bellman Equation in order to derive the policy evaluation theorem for HiT-MDPs. We also show that the Markovian Option-Value Function has a variance reduction effect. As a result, HiT-MDP is a general-purpose MDP that can be updated at every sampling step, and thus naturally address the sample inefficiency issue. We solve the learning problem by deriving a stable on-policy policy gradient method under the maximum entropy reinforcement learning framework. One difficulty of learning standard option frameworks is that they do not have any constraint on qualities of options (Harb et al., 2018) . Standard option frameworks have tendencies to learn either degenerate options (Harb et al., 2018) (short execution time) that switching back-and-forth frequently, or dominant options (Zhang & Whiteson, 2019) (long execution time) that executing through the whole episode. We tackle this problem by proposing the Maximum entropy Options Policy Gradient (MOPG) algorithm. MOPG includes an information-theoretical intrinsic reward to encourage consecutive executions of options and entropy terms to encourage explorations of options. The whole algorithm can be solved in an end-to-end manner under the structional variational inference framework. We theoretically prove that optimizing through MOPG converges to the optimal trajectory. We conduct experiments on challenging Mujoco (Todorov et al., 2012; Brockman et al., 2016b; Tunyasuvunakool et al., 2020) environments. Thorough empirical results demonstrate that under widely used configurations, HiT-MDP achieves competitive, if not better, performance compared to the state-of-the-art baselines on all finite horizon and transfer learning environments. Moreover, HiT-MDP significantly outperforms all baselines on infinite horizon environments while exhibiting smaller variance, faster convergence, and interpretability.

2. BACKGROUND

Markov Decision Process: A Markov Decision Process (Puterman, 1994) M = {S, A, r, P, γ} consists of a state space S, an action space A, a state transition function P (s t+1 |s t , a t ) : S × A → S, a discount factor γ ∈ R, and a reward function r(s, a) = E[r|s, a] : S × A → R which is the expectation of the reward r t+1 ∈ R received from the environment after executing action a t at state s t . A policy π = P (a|s) : A × S → [0, 1] is a probability distribution defined over actions conditioning on states. A discounted return is defined as G t = N k γ k r t+k+1 , where γ ∈ (0, 1) is a discounting factor. The value function V [s t ] = E τ ∼π [G t |s t ] is the expected return starting at state s t and the trajectory τ = {s t , a t , r t+1 , s t+1 , . . . } follows policy π thereafter. The action-value function is defined as Q[s t , a t ] = E τ ∼π [G t |s t , a t ]. Homomorphic Equivalence: Givan et al. (2003) define the equivalence relation between MDPs as symmetric equivalence (bisimulation relation). Ravindran (2003) extends their work to homomorphic equivalence that allows defining symmetries between an MDP and SMDP. Given two processes, an MDP M = {S, A, R, P, γ} with the trajectory τ and an SMDP M = { S, A, R, P , γ} with the trajectory τ . Assume both M and M share the same action space A. An homomorphism B is a tuple of surjection partition functions, M and M is Homomorphic Equivalence if 1) for all stateaction pairs {s, a}, there exists a many-to-one correspondence equivalent state-action pair {s, ã} that {s, a}/ B = {s, ã}/ B, or denoted as B({s, ã}) = {s, a}, 2) and following conditions hold: 1. P (τ / B) ≡ P (τ / B), and B is a surjection, 2. r(τ / B) ≡ r(τ / B), The SMDP-based Option Framework: In SMDP-Option (Sutton et al., 1999; Bacon, 2018) where o ∈ O is used to sample which option will be executed. Therefore, the dynamics (stochastic process) of the option framework is written as: P (τ ) = P (s 0 )P (o 0 )P o0 (a 0 |s 0 ) ∞ t=1 P (s t |s t-1 , a t-1 )P ot (a t |s t ) [P ot-1 (b t = 0|s t )1 ot=ot-1 + P ot-1 (b t = 1|s t )P (o t |s t )], where τ = {s 0 , o 0 , a 0 , s 1 , o 1 , a 1 , . . .} denotes the trajectory of the option framework. 1 is an indicator function and is only true when o t = o t-1 (notice that o t-1 is the realization at o t-1 ). Therefore, under this formulation the option framework is defined as a Semi-Markov process since the dependency on an activated option o can cross a variable amount of time (Sutton et al., 1999) .

3. AN MDP EQUIVALENCE OF THE SMDP-BASED OPTION FRAMEWORK

In this section, we propose the Hidden Temporal MDP (HiT-MDP). We first reformulate the SMDP-Option as an HMM-like Probabilistic Graphical Model (PGM). We then propose the HiT-MDP as a marginalization of the PGM and prove that the HiT-MDP is homomorphic equivalent (Ravindran, 2003) to the SMDP-Option. We also derive the HiT-MDP's Bellman Equation and prove its convergence by presenting the policy evaluation theorem. As a result, HiT-MDP is a general-purpose MDP that can be combined with any policy optimization algorithm off-the-shelf. We propose an efficient learning algorithm and complete the proof of the policy iteration theorem in Section 4. OPTION FRAMEWORK Following Bishop (2006) 's formulation of mixture distributions, we redefine the option random variable o ∈ O = {1, 2, . . . , K}, which was originally defined as an integer index, but now as a K-dimensional one-hot vector ō ∈ Ō = {0, 1} K where K is the number of options. We further employ the one-hot vector to reformulate the termination function and action function of each option into two mixture distributions by introducing extra dependencies on ō:

3.1. AN MDP FORMULATION OF THE

P (a t |s t , ōt ) = o∈ōt P o (a t |s t ) o , P (b t |s t , ōt-1 ) = o∈ōt-1 P o (b t |s t ) o (2) Since the option random variable ō is now a one-hot vector, an instantiation ō ≜ k denotes the activation of the option k, and by definition only the k-th entry of ōt is 1 and all the other entries are 0. Therefore, we have P ot (a t |s t ) = P (a t |s t , ōt = ōt ) and β ot-1 = P ot-1 (b t = 1|s t ) = P (b t = 1|s t , ōt-1 = ōt-1 ). The third reformulation is that we propose a novel MDP mixture master policy P (ō t |s t , b t , ōt-1 ), which is a mixture distribution containing the SMDP master policy P (ō t |s t ) and a degenerate probability as mixture components by adding two extra dependencies on b t and ōt-1 : P (ō t |s t , b t , ōt-1 ) = P (ō t |s t ) bt P (ō t |ō t-1 ) 1-bt , where the indicator function 1 ot=ot-1 used in Eq. 1 is now redefined as a degenerate probability distribution (Puterman, 1994): P (ō t |ō t-1 ) = 1 if ōt = ōt-1 , 0 if ōt ̸ = ōt-1 . and the joint distribution can be written as: P (τ ) = P (s 0 )P (ō 0 )P (a 0 |s 0 , ō0 ) ∞ t=1 P (s t |s t-1 , a t-1 )P (a t |s t , ōt ) bt P (b t |s t , ōt-1 )P (ō t |b t , s t , ōt-1 ) Although the mixture master policy in Eq. 4 is MDP-formulated, as a mixture component within it, the master policy P (ō t |s t ) is still SMDP-formulated and hence cannot be updated by MDP-based algorithms. By marginalizing over the termination variable b t in Eq. 4: bt P (b t |s t , ōt-1 )P (ō t |b t , s t , ōt-1 ), we propose the Markovian master policy P (ō t |s t , ōt-1 ) to model this marginal distribution explicitly: P (τ ) =P (s 0 )P (ō 0 )P (a 0 |s 0 , ō0 ) ∞ t=1 P (s t |s t-1 , a t-1 ) P (a t |s t , ōt )P (ō t |s t , ōt-1 ) (5) where Eq. 5 denotes the joint distribution of the PGM. In this formulation, P (τ ) is actually an HMM with s t , a t as observable random variables and ōt as latent variables. We use π O (s t , ōt-1 ) = P (ō t |s t , ōt-1 ) to denote the master policy and π A (s t , o t ) = P (a t |s t , ōt ) to denote the action policy. Figure 1 shows the PGM (Eq. 5).

3.2. THE HIDDEN TEMPORAL MDPS (HIT-MDPS)

Given the PGM (Eq. 5, Figure 1 ), the Hidden Temporal MDPs (HiT-MDPs) family can be described by a tuple M = { S, Ā, r, P, ϕ, γ} where S . = S × Ō is an augmented state space, Ā . = A× Ō is an augmented action space, and ϕ = P (ō t |s t ) = P (ō t |s t , ōt-1 ) is the emit function for hidden variables. The joint distribution of HiT-MDPs is factorized as Eq. 5 (derivations see Appendix C). We define a partition function B(s, ā) = B({ō t-1 , a t , ōt , s t }) = (o t-1 , a t , o t , s t ) (proofs in Appendix C), which maps τ to τ , where τ = {s 0 , ō0 , a 0 , s 1 , ō1 , a 1 , . . .} is the trajectory of the HiT-MDP. Therefore, by following the Partition Function B, the dynamics of the SMDP-Option in Eq. 1 under the Surjection B is equivalent to HiT-MDP P (τ / B) = P (τ / B). With P (τ / B) = P (τ / B) in hand, to prove the equivalence between the SMDP-Option and HiT-MDP, we move on to prove both of them share the same expected reward. This is non-trivial since compared to the SMDP-Option, the MDP formulation introduces extra dependencies on ō. However, in Appendix C, by exploiting conditional independencies we prove that they share the same expected return under the Surjection B. Therefore, the SMDP-based option framework has an MDP-based equivalence: Theorem 3.1. By the definition of Bisimulation Relation, the SMDP-based option framework, which employs Markovian options, has an underlying MDP equivalence because:  1. P (τ / B) = P (τ / B) V [s t , ōt-1 ] = E[G t |s t , ōt-1 ] = ōt P (ō t |s t , ōt-1 )E at∼π A [Q A [s t , ōt , a t ]], Q A [s t , ōt , a t ] = E[G t |s t , ōt , a t ] = r(s, a) + E st+1∼π A [ l=1 γ l r t+l ] As with the standard Q-function and value function, we can relate the Q-function to the Markovian option-value function at a future state via a Hidden Temporal Bellman Operator T H . Theorem 3.2. The option-action value function Eq. 7 satisfies the Bellman Operator T H T H Q A [s t , ōt , a t ] = E[G t |s t , ōt , a t ] = r(s, a) + γ st+1 P (s t+1 |s t , a t ) V [s t+1 , ōt ], where the Markovian option-value function given by Eq. 6. Proof. See Appendix D. We can obtain the option-action value function for any policy by repeatedly applying T H and the sequence converges to the optimal value function: Theorem 3.3. (Markovian Option Policy Evaluation Theorem). Assume that throughout our computation the Q A [•, •] and V [•] are bounded and A < ∞, the sequence Q k A defined by Q k+1 A = T H Q k A will converge to the option-action value function Q π A A as k → ∞. Proof. As with the standard convergence results for policy evaluation (Sutton & Barto, 2018) , by the definition of T H (Eq. 8) the option-action value function Q π A A is a fixed point. In Appendix D we show that T H is a contraction, and then Theorem 3.3 follows immediately. In Appendix D, we further prove that V [s t , ōt-1 ] is an unbiased estimation of V [s t ] and has a variance-reduction effect compared to the conventional value function. This property is empirically witnessed and further discussed in experiments (Section 5.1). Proposition 3.4. V [s t , ōt-1 ] is an unbiased estimation of V [s t ]. Proposition 3.5. The variance of V [s t , ōt-1 ] is up-bounded by V [s t ]. Proof. See both Proposition 3.4 and 3.5's proof in Appendix D The theoretical analysis above presents a counterintuitive fact: the SMDP formulated option framework has an MDP formulated homomorphic equivalence (HiT-MDPs), and they both converge to the same optimal value function. One natural question to ask is that how could an MDP temporally extend executions of options? Note that the Markovian master policy P (ō t |s t , ōt-1 ) is a result of the marginalization over the termination variable b t (See Eq. 5 and Eq. 1) and has an extra dependency on ōt-1 ). Therefore, the Markovian master policy acts like a distance measure. The decision of selecting an option ōt can be made by simply comparing which ō ∈ ō is closest to the vector [s t , ōt-1 ]. Because a vector is closest to itself, this mechanism has a natural tendency to assign o t to o t-1 , and thus extends ōt-1 's execution. On the other hand, a significantly different state s t will pull the distance far enough from o t-1 and result in other options being assigned. The effectiveness of the Markovian master policy is also empirically addressed in Section 5.4. In this section, we propose a stable and sample efficient Maximum entropy Option Policy Gradient (MOPG) algorithm to learn HiT-MDPs. Learning temporal abstraction like options has been a long standing challenge (Sutton et al., 1999; Givan et al., 2003; Kolobov et al., 2012; Bacon et al., 2017) . In theory (Sutton et al., 1999) , options are not generally necessary for learning optimal policies since MDPs are sufficient enough. Therefore, standard option learning algorithms optimizing Eq. 1 often leads to sub-optimal results: either degenerate options (Harb et al., 2018 ) (short execution time) switching back-and-forth frequently, or dominant options (Zhang & Whiteson, 2019) (long execution time) executing through the whole episode. Although there are various attempts (Harb et al., 2018; Hyun et al., 2019; Smith et al., 2018) tackling this issue, yet they often introduce bias which accumulates along the length of the trajectory and result in sub-optimal solutions.

4. LEARNING HIT-MDPS UNDER THE MAXIMUM ENTROPY FRAMEWORK

Recently maximum entropy reinforcement learning framework (Todorov, 2006; Ziebart et al., 2010; Haarnoja et al., 2017; 2018) has been witnessed success in encouraging exploration and improving sample complexity. With the PGM of the HiT-MDP in hand, we tackle the above issues by deriving the Maximum entropy Option Policy Gradient (MOPG), a stable algorithm for learning diversified options and preventing degeneracy. Following the control as inference framework (Levine, 2018; Haarnoja et al., 2018) , we introduce the concept of "Optimality" (Todorov, 2006) into the HiT-MDP and formulate the conventional RL problems Eq. 5 as a probabilistic inference problem (Kappen et al., 2012) . Specifically, we follow Levine (2018) ; Koller & Friedman (2009) and factorize Eq. 5 into another PGM (as shown in Figure 2 ): q(τ , e A 1:T , e O 1:T ) = P (s 0 )q(ō 0 ) T t=1 q(e A t |s t , a t )q(e O t |s t , a t , ōt , ōt-1 )P (s t+1 |s t , a t )q(ō t )q(a t ) ∝ P (s 0 )  where e ∈ {0, 1} are observable binary "optimal random variables" (Levine, 2018) . The agent is optimal at time step t when q(e A t = 1|s t , a t ) and q(e O t = 1|s t , a t , ōt , ōt-1 ) (to keep notations uncluttered we use e t to denote e t = 1. To simplify the derivation, priors q(ō) and q(a) can be assumed to be uniform distributions without loss of generality (Levine, 2018) . Note that Eq. 9 shares the same state-action dynamics with Eq. 5. With the optimal random variables e O and e A , the conditional probability of a state-action {s t , a t } pair that is optimal is defined as: q(e A t |s t , a t ) = exp(r(s t , a t ))[ exp(r(s t , a t ))da] -1 , which is an energy function that follows the Boltzmann distribution (Levine, 2018) . This specific design facilitates recovering the value function at the latter structural variational inference stage. Based on the same motivation, the conditional probability of an option-state-action {o t , s t , a t , o t-1 } pair that is optimal is defined as, q(e O t |s t , a t , ōt , ōt-1 ) = exp(I[ō t |s t , a t , ōt-1 ])[ exp(I[ō t |s t , a t , ōt-1 ])do] -1 , where the mutual-information I[ō t |s t , a t , ōt-1 ] is chosen to be the energy function follows the Boltzmann distribution. This design choice arises from a fact that when the uniform prior assumption of q(o) is relaxed the optimization introduces a mutual-information as a regularizer in the Evidence Lower BOund (ELBO) (Proof in Appendix E). When optimal random variables are observed, substituting Eq. 10 & 11 into Eq. 9, the conditional probability of any feasible trajectory τ is: q(τ |e 1:T ) ∝ P (s 0 ) T t=1 P (s t+1 |s t , a t ) exp( T t=1 r(s t , a t )) exp( T t=1 I[ō t |s t , a t , ōt-1 ]), where optimal random variables e 1:T are treated as observed random variables (evidences) and the trajectory τ is treated as latent variables. Therefore, the problem of fitting the optimal trajectory is equivalent to maximizing ELBO: log q(e 1:T ) ≥ -D KL [P (τ )||q(τ |e 1:T )] (Proof in Appendix E). This observation immediately gives rise to a structural variational inference solution as keeping the system dynamix P (s 0 ) T t=1 P (s t+1 |s t , a t ) in both Eq. 12 and Eq. 5 fixed while optimizing the action and master policies in the variational distribution of Eq. 5: Theorem 4.1. (Markovian Option Policy Improvement Theorem). The problem of learning optimal action and master policies can be simplified as shrinking the KL-Divergence: D KL [P (τ )||q(τ |e 1:T )] π O * , π A * = arg max π O ,π A -D KL [P (τ )||q(τ |e 1:T )] = arg max π O ,π A t E π O ,π A [r(s t , a t ) + I(o t |s t , a t , o t-1 ) + H(π O ) + H(π A )], where H(•) denotes the entropy term. Proof. See Appendix E. With policy evaluation and improvement theorems, we further prove the convergence of MOPG by proposing the Markovian Option Policy Iteration Theorem (Appendix E). Standard option frameworks (as shown in Eq. 1 and Eq. 5) only consider maximizing the value function and have no constraints on qualities of options, e.g., what good options should behave (Harb et al., 2018) . As a result, previous researches often lead to suboptimal results. In Eq. 44, the mutual-information is introduced as an intrinsic reward to encourage consecutive executions of options and thus prevent degenerate options, while entropy terms encourage explorations of options thus prevent dominant options. Following value functions from Section 3.2, we directly optimize ELBO with respect to the variational distribution and derive the Maximum entropy Options Policy Gradients (MOPG) theorems: Theorem 4.2. Master Policy Gradient Theorem: Given a stochastic master policy which is differentiable w.r.t. its parameter vector θ ō, the gradient of the expected discounted return w.r.t. θ ō is: ∂ V [s t , ōt-1 ] ∂θ ō = E[ ∂P (ō ′ |s ′ , ō) ∂θ ō Q O [s ′ , ō′ ] + I( ō′ |s, a, ō) + H(π O θō ) | s t , ōt-1 ], where D Eq. 28) . Proof. See Appendix E Theorem 4.3. Action Policy Gradient Theorem: Given a stochastic action policy which is differentiable w.r.t. its parameter vector θ a , the gradient of the expected discounted return w.r.t. θ a is: Q O [s t , ōt ] = E[G t |s t , ōt ] (Appendix ∂Q O [s t , ōt ] ∂θ a = E[ ∂P (a|s, ō) ∂θ a Q A [s, ō, a] + H(π A θa ) | s t , ōt ]. Proof. See Appendix E. To keep notations uncluttered, we use θ ō to denote master policy's parameters π O θō and θ a to denote action policy's parameters π A θa . The algorithm is summarized in Appendix F.

5. EXPERIMENTS

In this section, we design experiments to answer five questions: (Q1) Whether MOPG can achieve better performance than other option variants and non-option baselines? (Q2) Does MOPG have a performance boost over other option variants in transfer learning settings? (Q3) Is HiT-MDP interpretable? (Q4) Whether MOPG can encourage consecutive executions of options while preventing dominant options problem? For baselines, we follow DAC (Zhang & Whiteson, 2019) 's open source implementations and compare our algorithm with six baselines, five of which are option variants, i.e., DAC+PPO, AHP+PPO (Levy & Shimkin, 2011), IOPG (Smith et al., 2018) , PPOC (Klissarov et al., 2017) , and OC (Bacon et al., 2017) . The non-option baseline is PPO (Schulman et al., 2017) . All baselines' parameters used by DAC remain unchanged other than the maximum number of training steps: MOPG only needs 1 million steps to converge rather than the 2 million used in DAC. For single task learning, experiments are conducted on all OpenAI Gym MuJoCo environments (10 environments) (Brockman et al., 2016a) . For transfer learning, we follow DAC and run 6 pairs of transfer learning tasks based on DeepMind Control Suite (Tassa et al., 2020) . Figures are plotted following DAC's protocol: curves are averaged over 10 independent runs and smoothed by a sliding window of size 20. Shaded regions indicate standard deviations. All experiments are run on an Intel® Core™ i9-9900X CPU @ 3.50GHz with a single thread and process. Our implementation details are summarized in Appendix G. For a fair comparison, we follow DAC and use four options in all implementations. Our code is available in supplemental materials.

5.1. SINGLE-TASK LEARNING (Q1)

To answer Q1, we compare MOPG against five different option variants (i.e., DAC+PPO (Zhang & Whiteson, 2019), AHP+PPO (Levy & Shimkin, 2011), IOPG (Smith et al., 2018) , PPOC (Klissarov et al., 2017) and OC (Bacon et al., 2017) ) and PPO (Schulman et al., 2017) . We observe that MOPG exhibits two different behaviors on infinite (Figure 3a ) and finite (Figure 3b ) horizon environments. Previous literatures (Klissarov et al., 2017; Smith et al., 2018; Harb et al., 2018; Zhang & Whiteson, 2019) find that option-based algorithms do not have advantages over hierarchy-free algorithms on single-task environments. Despite this, MOPG can still achieve comparable performance with state of the arts (Figure 3b ). More importantly, on infinite horizon environments (Figure 3a ), MOPG's performance significantly outperforms all baselines with respect to episodic return, convergence speed, variance between steps, and variance between 10 runs (Proposition 3.5). Since infinite and finite horizon environments are theoretically identical (Sutton & Barto, 2018), we do not have a theoretical explanation for this phenomenon. In Appendix A, we conceptually explain that this might be because conventional value functions are insufficient to approximate environments in which hidden variables o only affect rewards but not states. In conclusion, our experiment results show that MOPG is at least as effective as other option variants, but is significantly more sample efficient. Furthermore, it has significant advantages in infinite horizon environments.

5.2. TRANSFER LEARNING (Q2)

We run 6 pairs of transfer learning tasks based on DeepMind Control Suite (Tassa et al., 2020) . Each pair contains two different tasks. performance ranks the first in 5 out of 6 environments. This demonstrate MOPG's advantages in knowledge reuse and its performance is at least comparable with other option variants.

5.3. INTERPRETATION OF OPTION EMBEDDINGS (Q3)

Interpretability is a key property to apply RL agents in real-world applications. Our implementation follows the Skill-Action (SA) architecture (Li et al., 2020) and introduces transformer's decoder as a distance measure to learn option embeddings. One unique advantage of HiT-MDP is that hidden variables ō are interpretable. In the implementation, the master policy is defined as π O = P (ō t |s t , ōt-1 ; W o ) where W o = [ô 1 , ..., ôK ] is an embedding matrix, ô ∈ R N is the embedding vector for an option with dimension N , and K is the total number of options (N = 40 and K = 4 in figure 5 (a)). More implementation details are described in Appendix G. Similar to word embeddings (Vaswani et al., 2017) , option embeddings learn a semantic space of options with each dimension encodes a particular property and can be interpreted explicitly. As in the capsule network (Sabour et al., 2017) , we first infer each embedding dimension's semantic by adding perturbations on to it ôperturb [i,j] = ϵ [i] + ô[:,j] , where ϵ i ∈ [-0.1, -0.09, ..., 0.09] is an array ranging from -0.1 to 0.09 with an interval of 0.01 and j ∈ N is the jth dimension of the option embedding. We then sample primary actions a [:,j] ∼ π A (s t , o perturb [i,j] ) from perturbed options to observe how actions change along with perturbations. One significant advantage of MOPG over other algorithms is that the intrinsic reward and entropy terms (Eq. 44) enable MOPG automatically learns to balance between degenerate and dominant options. In this subsection, we empirically demonstrate this in Figure 6 (experiment results from HalfCheetah, more details in Appendix B.2). At the start of training, all options' durations are short, while Option 3's duration quickly grows. This proves that MOPG can indeed temporally extend an option. More than temporal extension, as shown in the videofoot_0 , MOPG learns to compose distinguishable options. Option 3 (green background in the video) is a running forward skill thus it is executed most of the time. Option 2 (blue background) is mainly used to recover from falling down thus its duration decreases with training. Therefore, the empirical study also shows that MOPG can compose disentangled options while preventing the dominant option problem.

6. RELATED WORKS

To discover options automatically, Sutton et al. (1999) proposed Intra-option Q-learning to update the master Q value function at every time step. However, all policies under this formulation are approximated implicitly using the Q-learning method. AHP (Levy & Shimkin, 2011) is proposed to unify the Semi-Markov process into an augmented Markov process and explicitly learn an "overall policy" by applying MDP-based policy gradient algorithms. However, their method for updating the master policy is still SMDP-style thus sample inefficient (Zhang & Whiteson, 2019) . OC (Bacon et al., 2017) proposes a policy gradient based framework for explicitly learning intra-option policies and termination functions in an intra-option manner. However, for the master policy's policy gradients learning, OC remains SMDP-style. 2018) combined the idea of soft optimality (Haarnoja et al., 2017) from earlier control as inference frameworks (Todorov, 2006; Ziebart et al., 2010; Kappen et al., 2012) with the actor-critic architecture (Konda & Tsitsiklis, 2000) and proposed the state-of-the-art off-policy baseline SAC. Recent works show that the option framework trained under soft-optimality off-policy algorithms outperform on-policy methods (Wulfmeier et al., 2020) . HiT-MDPs are general MDPs which can be trained by both on-policy and off-policy algorithms: the ELBO proposed in Section 4 can easily be extended to a SAC-like algorithm and remains to the future work.

7. CONCLUSIONS

In this paper, we propose the Hidden Temporal MDP (HiT-MDP), to the best of our knowledge, the first work proving option-induced MDP' is homomorphic equivalent to the standard SMDP Option framework. As for the learning algorithm, we propose the Maximum entropy Option Policy Gradients (MOPG) algorithm. Unlike conventional option learning algorithms that only maximize the value function without any constraint on the qualities of options, MOPG optimizes the Evidence Lower BOund (ELBO) with an intrinsic reward to prevent degenerate options and entropy terms to prevent dominant options. It is also theoretically proven fitting optimal trajectories without adding any bias. On challenging Mujoco (Todorov et al., 2012; Brockman et al., 2016b; Tunyasuvunakool et al., 2020) environments, thorough empirical results demonstrate that under widely used configurations, HiT-MDP achieves comparable results to all baselines on finite horizon and transfer learning environments, and significantly outperforms all baselines on infinite horizon environments. We also demonstrate hidden temporal embeddings are interpretable, which is a key property for applying RL agents to real-world applications. Embeddings exist in various applied ML domains such as CV and NLP, our work shows that the idea also works for HRL and potentially lays the theoretical ground for building a large scale HRL foundation model. A LEARNING OPTIONS AT MULTI-LEVELS OF GRANULARITY Implementations of the option framework share some common limitations. When proposing the option framework, Sutton et al. (1999) expected that learning at multi-level of temporal abstraction should be in favor of faster convergence and better exploration. On the contrary, significant improvements on single task environments have not been witnessed in most option implementations (Klissarov et al., 2017; Smith et al., 2018; Harb et al., 2018; Zhang & Whiteson, 2019) . To the best of our knowledge, MOPG is the first option implementation in which these properties are significantly witnessed but only on infinite horizon environments. In this section, we study this problem by explaning an intuition of why the value function is the main reason for this deficiency and how deep wide value functions could solve this problem. The expectation of improvements of the option framework on single task environment builds on an assumption that, by exploiting hierarchical action and state space, an agent's searching space can be greatly reduced thus accelerates learning and improving exploration. However, as reported in section 5.4, most option frameworks suffer from "the dominant option problem" (Zhang & Whiteson, 2019) which prevents option frameworks from effectively learning hierarchy in action and state space as well as coordinating between options. One intuition behind this problem is that conventional value functions  V [S t , O t-1 = o 1 ] = 10 and V [S t , O t-1 = o 2 ] = -10. Because they arrive at the same state S t , they have identical values under conventional value function V [S t ] = 0). This deficiency makes option frameworks can only learn options at very coarse level thus fail to exploit hierarchical information. The solution might be using a deep wide value function: enabling the framework to learn fine-grained options at mutli-levels of granularity (deep) and making value functions depend on latent variables with longer (wide) dependencies (e.g. V [S t , O t-1 ] and Q[S t , O t , A t , O t-1 ]). To have a better understanding the importance of the deep wide value function, let us consider a simple environment which can be easily solved by Q[s t , a t , a t-1 ] but not Q[s t , a t ]. Suppose we are training a robot which only has a camera sensor to cook thanksgiving turkey. In this setting there are only two states: S = {Raw Turkey Image, Cooked Turkey Image}. The robot's action space only consists of two actions: A = {Stuff turkey, Roast turkey}. As for reward, if the robot roasted a stuffed turkey, then the reward is 10. However, if the robot roasted an un-stuffed turkey, then the reward is -10. The stuff turkey action receives 0 reward. The option selection is very random at the beginning of the episode as well as around the switching point (the 16th second). These are exactly the most confusing moments of conventional value functions. Thanks to proposition 3.4 and 3.5, one unbiased solution might be employing the deep wide value function: the higher the order of the MDPs, the smaller the variance will be. We will explore this direction in our future work.

B.1 PERFORMANCE

In this section we provide results for all ten OpenAI Gym Mujoco Environments (Brockman et al., 2016b; Todorov et al., 2012) . Those environments can be classified into two categories: infinite horizon environments (i.e., HalfCheetah, Swimmer, HumanoidStandup and Reacher) and finite horizon environments (the other). To illustrate how MOPG composes options, we take the HalfCheetah model trained after 1 million steps and independently run it 4 times (4 episodes. each episode contains 512 time steps). Option activation sequences of 4 runs are then plotted in Figure 7 . As we can see that there are some common patterns between all 4 independent runs. For example, all runs start with Option 1 and use Option 2 at the early stage. After executing Option 2 for a short period, they all switch to Option 3 which has longest durations in all 4 runs. From time to time they will fall back to Option 2 for short periods and quickly switch to Option 3 again. This pattern of coordination indicates that because of the intrinsic reward and entropy terms in Eq. 44, one significant advantages of the MOPG over other algorithms is that the ELBO objective function automatically learns to balance between degenerate options and dominant options. As shown in the videofoot_1 , Option 3 (green background in the video) is a running forward option thus it is executed most of the time. Option 2 (blue background) is mainly used to recover from falling down thus its duration decreases with training. Option 2 and Option 3 have completely different functionality. Therefore, empircal study also shows that MOPG is able to compose disentangled options while preventing dominant options problem. One interesting question to ask is whether the performance boost of MOPG comes from increasing or decreasing of option frequencies compared to other option variants. In Figure 8 we plot one run of all option variants' option duration. As shown in Figure 8 , it appears to be that there is not a significant difference in terms of option duration between our method and other option variants. All of these methods are able to learn options which execute across various amount of time as well as composing those options. It is also worth to mention that comparing with Figure 3 , there is not an apparent monotonically relation between option changing frequency and performance. Therefore, on simple environments like HalfCheetah, main performance difference comes with sample efficiency of training intra-option policies. On the contrary, in harder environment like HumanoidStandupV2, there is a significant difference in terms of option composition between our method and other variants. As shown in Figure 3 , all option variants except ours have a rather low score in HumanoidStandupV2. As shown in Figure 9 , all of these option variants tends to quickly and randomly switching between options. This means that they failed to learn useful options which focus on specific sub-tasks with longer durations. This is because that standard option frameworks have tendencies to learn either degenerate (Harb et al., 2018) options (short execution time) that switching back-and-forth too frequently, or dominant options (Zhang & Whiteson, 2019) (long execution time) that executing through the whole episode. Both cases severely impair performance. Figure 10 shows our method's options composition in HumanoidStandupV2. As shown in the figure that our method employs more options as the environment gets harder: it successfully learns 4 useful optons and has a clear options composition schedule between stages of standing up. For option (a), the humanoid is lying around and trying to sit up. For option (b) and option (c), these two options are more interchangeably used: option (b) is more sitting and swirling while option (c) is more trying to stand up and recover from failed trials. After the humanoid stands up, the agent constantly call option (d) which is running around and trying to balance itself. In conclusion, our method can learn distinguishable options and composition schedule in harder environment. Because of MOPG's maximum entropy framework, our method is able to adapt to environment's complexity in an end-to-end manner and is more robust to degenerate and dominant options. One last thing worth to mention is that the number of options used is not only related to complexity of the environment, but also the coefficient of entropy terms. Figure 11 shows a histogram of the magnitude of entropy weights and how many options are used in total 20 runs. As we can see that as entropy weight increasing, more options are to be used since there is more randomness in option policy: when entropy weight is 0.01, all 20 runs only use 2 options; when entropy weight 

B.3 INTERPRETATION OF OPTION CONTEXT VECTORS

In this section we continue with the HalfCheetah model used in Section B.2 and demonstrate how to interpret option context vectors as well as option activation sequences (Figure 7 ). In HalfCheetah, the agent learns to run half of a Cheetah by controlling 6 joints: back thigh, back shin, back foot, front thigh, front shin, and front foot. The faster the Cheetah runs forward, the higher return it gets from the environment. We interpret option context vectors and activation patterns by first inspecting what property each dimension of the option context vector encodes (Figure 13 ). Once each dimension is understood, options (Figure 12 ) become straight forward to interpret by simply inspecting on which dimension (property) they have the most significant weights (Figure 14 ). These interpretations can further be taken to explain option activation patterns in Figure 7 . As the first step, we follow Sabour et al. (2017) to interpret what property each dimension of the option context vector in Figure 12 encodes by perturbing each dimension and decode perturbed option context vectors into primary actions. Specifically, we perturb one dimension by adding a range of perturbations [-0.1, 0.09] by intervals of 0.01 onto it while keep the other dimensions fixed. After perturbation, each option context vector dimension has 20 perturbed vectors. We then use the action policy decoder to decode all those vectors into primary actions and see how the perturbation affects the primary action. As an illustration, we plot Dimension 4's all 20 perturbed results in Figure 13 . With visualization of perturbation results in hand, we can interpret what property each dimension encode by inspecting relationships between perturbations and primary actions. In Figure 13 , as an example, it is clear that changes on Dim 4 has the same direction: as the magnitude of Dim 4 increase, all actuators move towards the same direction. This Dim can be seen as having an acceleration of running forward effect. Once we know how to interpret one dimension, we can move on to interpret the whole option context vector. Since Option 1 and Option 2 are two main options employed in Figure 7 , here we provide an example of how to interpret them. Figure 12 shows that Option 1 has significant values on dimension 11, 15 and 22. Option 2 is significant on dimension 2, 5 and 36. We demonstrate these dimensions in the same manner as Figure 13 below: Subfigures in Figure 14 can be interpreted in the same manner as Figure 13 . As an example, from Figure 12 we can see that Option 1 has a significant small value on Dim 11. In Figure 14 , it shows that a smaller Dim 11 will twist the front leg forward and back foot forward while twist back thigh, back shin backward. Composition of these movements is a back leg landing property. Similarly, we can interpret that Dim 15 is a front leg landing property and Dim 22 is a balancing property. Therefore, Option 1 is focusing on landing from all positions. Unlike other option context vectors which have apparent focusing dimensions, Option 2 has a rather balanced option context vector. It has no apparently dominant dimension. It only has slightly more significant values on Dim 2, 5, 36, which are focusing on jumping and running properties. Therefore, Option 2 is more like an "all-weather" option: it is a option having very balanced properties with a slightly demonstration on running and jumping. Interpretations of Option 1 and 2 above can then be taken to understand option activation patterns in Figure 7 : as an all-weather option, Option 2 is the most frequently executed one and has the longest duration. From time to time, when the Cheetah needs to land and balance itself, Option 1 will be executed. However, since landing option does not provide power of moving forward and thus has lower returns to continue, once the body is balanced the Cheetah will quickly stop Option 1's execution and keep running with Option 2.

B.4 TRANSFER LEARNING RESULTS

Below are statistics for Deepmind Control Suite (Tunyasuvunakool et al., 2020) transfer learning environments. 

C AN MDP EQUIVALENCE TO THE SMDP OPTION FRAMEWORK

In this section, we show that the the conventional Semi-Markov Decision Problem (SMDP) option framework which employs Markovian options has an option-induced MDP equivalence. We first introduce the conventional SMDP formulated option framework (SMDP-Option). In Section C.2 we follow Bishop ( 2006)'s method and formulate the dynamics of the SMDP-Option framework as an Hidden Markov Model (HMM) style Probability Graphical Model (PGM) (Koller & Friedman, 2009) . With the PGM and its conditional independence relationships (Chapter 8.2.1 (Bishop, 2006) ) in hand, in Section C.3 we move on to propose the Hidden Temporal MDPs (HiT-MDP) and prove its equivalence to the SMDP-Option under the definition of homomorphic equivalence (Ravindran, 2003) . To the best of our knowledge, this is the first work discovering the option framework's MDP equivalence. Following Bishop ( 2006 Since an option's execution may persist over a variable period of time, a set of options' execution together with its value functions constitutes a Semi-Markov Decision Problem (SMDP) (Puterman, 1994) . When an old option is terminated, a new option will be sampled from the master policy (policy-over-options) o ∼ P (o t+1 |s t+1 ) : S → O. A master policy π(o|s) = P (o|s) where o ∈ O is used to sample which option will be executed. Note that we use the bold-case o to denote unrealized random variables and the light-italic-case o to denote a realized instantiation. Conventionally, the execution of an option employs the call-and-return Therefore, the dynamics (stochastic process) of the option framework is written as: P (τ ) =P (s 0 )P (o 0 )P o0 (a 0 |s 0 ) ∞ t=1 P (s t |s t-1 , a t-1 )P ot (a t |s t )[(1 -β o )1 ot=ot-1 + β o P (o t |s t )]. =P (s 0 )P (o 0 )P o0 (a 0 |s 0 ) ∞ t=1 P (s t |s t-1 , a t-1 )P ot (a t |s t ) ot-1 (b t = 0|s t )1 ot=ot-1 + P ot-1 (b t = 1|s t )P (o t |s t )]. where τ = {s 0 , o 0 , a 0 , s 1 , o 1 , a 1 , . . .} denotes the trajectory of the option framework. 1 is an indicator function and is only true when o t = o t-1 (notice that o t-1 is the realization at o t-1 ). For clarity reasons, we use P o (b = 1|s) instead of β o which is widely used in previous option literatures (e.g. (Sutton et al., 1999; Bacon et al., 2017) ). Therefore, under this formulation the option framework is defined as a Semi-Markov process since the dependency on an activated option o can cross a variable amount of time (Sutton et al., 1999) .

C.2 AN MDP FORMULATION OF THE OPTION FRAMEWORK

We follow Bishop (2006)'s formulation of mixture distribution and Probabilistic Graphical Models (PGMs). By introducing option variables as latent variables and adding extra dependencies into the termination function and master policy, we show that the conventional SMDP version of the option framework (Bacon et al., 2017; Sutton & Barto, 2018; Sutton et al., 1999; Harb et al., 2018; Zhang & Whiteson, 2019 ) can be re-formulated into an MDP formulation. We first follow Bishop (2006)'s formulation of mixture distributions and redefine the option random variable o ∈ O = {1, 2, . . . , K}, which was originally defined as an integer index, but now as a K-dimensional one-hot vector ō ∈ Ō = {0, 1} K where K is the number of options, and each entry o ∈ {0, 1} is a binary random variable. P (ō t |s t ) denotes the probability distribution over one-hot vector ō at time step t conditioned on state s t . P (ō t = o t |s t ) denotes a probability entry (a scalar value) of the random variable ōt with a realization at time step t where o t = 1 and o ∈ ōt /o t = 0. In figure 16 , s ∈ S, ō ∈ ŌK , b ∈ B K and a ∈ A, denote the state, option, termination and action random variable respectively. ō is an K-dimensional one-hot vector and b is an K-dimensional binary vector where each entry b ∈ {0, 1}. K is the number of options. R t+1 is the actual reward received from the environment after executing action a t in state s t . Figure 15 is SMDP formulated, suppose the agent decides to continue the execution of ot-1, then the random variable ot does not exist, and thus ot+1 depends on the random variable ot-1 directly. However, here the PGM is MDP formulated, and the random variable ō exists at every time step. G t = R t+1 + γR t+2 + γ 2 R t+3 • • • is the discounted expected return where γ ∈ R is a discount factor. The termination policy distribution P (b t |s t , ōt-1 ) : S × Ō → B can be formulated as a mixture distributionfoot_2 conditioned on option vector (the one-hot vector) ōt-1 and state s t . P (b t |s t , ōt-1 ) = i∈ōt-1 P i (b t |s t ) i . Because each option has its own termination policy P o (b|s), with a slightly abuse of notation, in equation ( 17) we use P (b t |s t , ōt-1 ) to denote the termination policy activated at time step t by previous chosen option ōt-1 . To keep notation uncluttered, we use β t = P (b t = 1|s t , ōt-1 ) to denote the probability of option ōt-1 terminates at time step t and (1 -β t ) = P (b t = 0|s t , ōt-1 ) to denote the probability of continuation. Conventionally, master policy (Zhang & Whiteson, 2019) (also called "policy-over-options" (Sutton et al., 1999; Bacon et al., 2017) )) is defined as: P (ō t |s t ). Similarly, we propose a novel mixture master policy as a mixture distributionfoot_3 : P (ō t |s t , b t , ōt-1 ) = P (ō t |s t ) bt P (ō t |ō t-1 ) 1-bt , where P (ō t |ō t-1 ) is a degenerated probability distribution (Puterman, 1994) P (ō t |ō t-1 ) = 1 if ōt = ōt-1 , 0 if ōt ̸ = ōt-1 . As shown in equation ( 19), the master policy only exists when b t = 1 the option terminates. Therefore, PPOC (Klissarov et al., 2017) uses inaccurate gradients for updating the master policy during an option's execution. According to the conditional dependency relationships in PGM (figure 16 ), the joint probability distribution of ōt and b t can be written as:  P (ō t , where P (τ ) = P (s 0 , ō0 , a 0 , s 1 , b 1 , ō1 , a 1 , . . .) denotes the joint distribution of the PGM. Notice that under this formulation, P (τ ) is actually an HMM with s t , a t as observable random variables and b t , ōt as latent variables. It is worth to mention that equation ( 20) is essentially the indicator function 1 ōt= ōt-1 used in conventional SMDP option framework papers and the last line in equation ( 22) is identical to transitional probability distribution in their formulation. However, as we show in this section, by adding latent variables ōt-1 and introducing the dependency between ōt and b t , our formulation is essentially an HMM. It opens the door to introduce many well developed PGM algorithms such as message passing (Forney, 1973) and variational inference (Hoffman et al., 2013) to the reinforcement learning framework. As we show below, the nice conditional independence relationships enjoyed by this model also enable us to prove the equivalence between the option framework's SMDP and MDP formulation.

C.3 HIDDEN TEMPORAL MDPS (HIT-MDPS)

In this section we propose the Hidden Temporal MDPs (HiT-MDPs) and prove its equivalence under the definition of homomorphic equivalence. Following notations from Section C.2, the Hidden Temporal MDPs (HiT-MDPs) family can be described by a tuple M = { S, Ā, r, P, ϕ, γ} where S . = S × Ō is an augmented state space, Ā . = A × Ō is an augmented action space, and ϕ = P (ō t |s t ) = P (ō t |s t , ōt-1 ) is the emit function for hidden variables. The joint distribution of HiT-MDPs is:  P (τ ) =P (s 1 ) ∞ t=1 P (s t+1 , āt |s t ) = P (ō 0 , s 1 ) ∞ t=1 P (s t+1 , a t , ōt |s t , ōt-1 ), Notice that the Markovian master policy P (ō t |s t , ōt-1 ) is actually the marginalization over the termination variable b t in Eq. 24: P (ō t |s t , ōt-1 ) = bt P (b t |s t , ōt-1 )P (ō t |b t , s t , ōt-1 ). The joint distribution is factorized from Eq. 25 to Eq. 26 by following conditional independences in the PGM (Figure 17 ): We define a linear function f (ō) = ō • d T : Ō → O which maps ō to o, where d = [1, 2, . . . , K] T is a K-dimensional constant integer vector, and hence f (ō) = o. Note that f is a Bijection since it is a linear function defined on a finite integer space. We define a tuple of partition function B =< f , Ī, Ī, f > where Ī is the identical function and is also a Bijection. Therefore, the partition function B is a tuple of Bijection functions B(s, ā) = B({ō t-1 , s t , a t , ōt }) = {o t-1 , s t , a t , o t }, which maps τ to τ , where τ = {s 0 , ō0 , a 0 , s 1 , ō1 , a 1 , . . .} is the trajectory of the HiT-MDP. Therefore, As for the reward function r, notice that from the PGM we have the conditional independence r t+1 ⊥ ⊥ ōt |a t . Therefore, SMDP-Option and the HiT-MDP also share the same expected reward 

D PROOF OF THE OPTIMAL VALUE EQUIVALENCE

In this Section we derive value functions of HiT-MDPs and prove its optimal value equivalence to the option framework. Following the structure of Section C, we first reformulate value functions of SMDP-Option into an MDP formulation by following the dynamics in Section C.2. We then following the HiT-MDPs' dynamics proposed in Section C.3 and derive a Markovian Option-Value function of HiT-MDPs, and prove it converges to the optimal value. In the end we prove that MDP-Option is optimal value equivalent to the SMDP-Option.

D.1 THE MDP FORMULATED OPTION FRAMEWORK'S VALUE FUNCTIONS

With dynamics proposed in C.2, we now derive the MDP formulated value functions for the option framework (Bacon et al., 2017; Sutton et al., 1999) . We follow Sutton & Barto (2018)'s notation in this section and write value functions for MDP below:  Proof. Q U [o t , U [s t+1 , o t ] =(1 -β t+1 )Q O [o t+1 = o t , s t+1 ] + β t+1 V [s t+1 ] (36) =Q O [o t+1 = o t , s t+1 ] -β t+1 A[o t+1 = o t , s t+1 ]. ( ) where from line 2 to line 3 we use the conditional independence property in PGM that R t+1 ⊥ ⊥  [•, •] and V [•] are bounded and A < ∞, the sequence Q k A defined by Q k+1 A = T H Q k A will converge to the option-action value function Q π A A as k → ∞. Proof. As with the standard convergence results for policy evaluation (Sutton & Barto, 2018) , by the definition of T H (Eq. 8) the option-action value function Q π A A is a fixed point. To prove the T H is a contraction, define a norm on V -values functions V and U ∥V -U ∥ ∞ ≜ max s∈ S |V (s) -U (s)|. ( ) where s = {s, o}. By recurssively apply the Hidden Temporal Bellman Operator T H , we have:  V [s t , ōt-1 ] = E[G t |s t , P (s t+1 , ōt |s t , ōt-1 ) V [s t+1 , ōt ] = r(s, a) + γE st+1,ōt V [s t+1 , ōt ] Therefore, by applying Eq. 41 to V and U we have: ∥T π V -T π U ∥ ∞ = max s∈ S γE st+1,ōt V [s t+1 , ōt ] -γE st+1,ōt Ū [s t+1 , ōt ] = γ max s∈ S E st+1,ōt V [s t+1 , ōt ] -Ū [s t+1 , ōt ] ≤ γ max s∈ S E st+1,ōt γ max s∈ S V [s t+1 , ōt ] -Ū [s t+1 , ōt ] ≤ γ max s∈ S |V [s] -U [s]| = γ∥V -U ∥ ∞ (42) Therefore, T H is a contraction and by the fixed point theorem, Theorem 3.3 follows immediately. Proposition D.2.1. The option-induced Markovian Option-Value Function V is equivalent to the conventional value function V Proof. for Proposition 3.4: By law of total expectation: E ōt-1 [V [s t , ōt-1 ]] = E ōt-1 [E[G t |s t , ōt-1 ]] = E[G t |s t ] = V [s t ] thus V [s t , ōt-1 ] is an unbiased estimator of V [s t ], with conditional independences defined in PGM 17. Proposition D.2.2. The option-induced Markovian Option-Value Function V has smaller variance than the conventional value function V Proof. for Proposition 3.5: By law of total conditional variance: Var(V [s t ]) = Var([E[G t |s t ]]) = E[Var(E[G t |s t , ōt-1 ])|s t ] + Var(E[E[G t |s t , ōt-1 ]]|s t ) = E[Var(V [s t , ōt-1 ])|s t ] + Var(E[V [s t , ōt-1 ]]|s t ) ≥ Var(E[V [s t , ōt-1 ]]|s t ), with conditional independences defined in PGM 17

E MAX-ENTROPY OPTION POLICY GRADIENTS

In the main text Section 4 we solve the HiT-MDPs under the maximum entropy framework (Levine, 2018) . Here we proof the Evidence Lower BOund (ELBO) proposed in the main text and derive the Maximum entropy Option Policy Gradient (MOPG) theorems.

E.1 THE MAXIMUM ENTROPY OBJECTIVE FUNCTION

Following notations defined in the main text Section 4, optimality variables e 1:T (e denotes the realization of e when e = 1) are treated as observed variables while random variables from the trajectory τ = {s, a, ō, ...} are treated as latent variables. The variational Evidence Lower BOund (ELBO) is given by: Theorem E.1. The problem of learning optimal action and master policies can be simplified as shrinking the KL-Divergence: D KL [P (τ )||q(τ |e 1:T )] Proof. log q(e 1:T ) = log q(e 1:T , s 1:T , a 1:T , ō1:T )ds 1:T da 1:T dō 1:T = log q(e 1:T , s 1:T , a 1:T , ō1:T ) P (τ ) P (τ ) ds 1:T da 1:T dō 1:T = log E (s 1:T ,a 1:T ,ō 1:T )∼P (τ ) q(s 1:T , a 1:T , ō1:T |e 1:T )q(e 1:T ) P (τ ) ≥ E (s 1:T ,a 1:T ,ō 1:T )∼P (τ ) log q(s 1:T , a 1:T , ō1:T |e 1:T ) -log P (τ ) = -D KL [P (τ )||q(τ |e 1:T )] where third last line to the second last line follows from the Jensen's inequality and P (τ ) is the dynamics of the HiT-MDPs defined in Eq. 26 Theorem E.2. (Markovian Option Policy Iteration Theorem). Repeated application of Markovian Option evaluation and improvement to any π O,A ∈ converges to a policy π * O,A such that Q π * O,A (s t , āt ) ≥ Q π O,A (s t , āt ) for all π O,A ∈ and st , āt ∈ S × Ā, assuming | Ā| < ∞. Proof. In this proof we use π A for illustration, the proof of π O follows the same derivation. From Theorem 4.1 we have that: π * A = arg max π A -D KL [P (τ )||q(τ |e 1:T )] = arg max π A t E π A [r(s t , a t ) + I[o t ] + H(π O ) + H(π A )], = arg max π A t E π A [r(s t , a t ) + I[o t ] + H(π A )], = arg max π A t E π A [r(s t , a t ) + I[o t ] + H(π A )], Therefore, it must be the case that E at∼π new A [r(s t , a t ) + I(o t ) + H(π old A )] ≥ E at∼π old A [r(s t , a t ) + I[o t ] + H(π old A )] = V π old A [s, ō], since we can always choose π new A = π old A ∈ . Substituting this inequality into Theorem 3.2 leads to: Q π old A A [s, ō, a] = r(s, a) + I[o] + γE s [ V π old A [s, ō]] ≤ r(s, a) + I[o] + γE s [E a∼π new A [r(s t+1 , a t+1 ) + I[o t+1 ] + H(π old A )]] ≤ Q π new A A [s, ō, a] where the convergence to Q Interpretability is a key property to apply RL agents in real-world applications. Attention Option Critic (AOC) (Chunduru & Precup, 2020) first introduces the attention mechanism into options. In AOC, each option has a unique attention mask over state s = Ẅo ⊙ s. Therefore, each option has a unique activation field (e.g., initiation set) over the observation space. Options learned by AOC are also interpretable since each option attends to different context in the state vector.  * A A > Q π A A for all π A ̸ = π * A and (s, o, a) ∈ S × O × A. E.

Clustering

Our methods contributes to the option framework from a different aspect. In AOC, an option is still a tuple of (I, β, π A ) while in our work an option is an embedding vector ô. The attention mechanism is used as a distance measure to compare which option is closer to the concatenated state-option pair (as shown in Eq. 48). As explained in the main text Section 5.3, options in HiT-MDP can be learned as option embeddings. Inspired by the Transformer (Vaswani et al., 2017) , in this section we implement the HiT-MDP as simple yet effective Multi-Head (MHA) (Vaswani et al., 2017) based Encoder-Decoder architecture as shown in Figure 18 . Specifically, an attention mechanism is described as the mapping from a query q ∈ R E and a set of key-value pairs, i.e., K ∈ R M ×E and V ∈ R M ×E (M and E are total number of options and embedding dimensions), to an output: Attention(q, K, V ) = softmax( qK T √ E )V A Multi-Head Attention MHA(q, K, V ) is a linear projection of h (number of heads) concatenated linearly projected Attention outputs: MHA(q, K, V ) = Concat[head 1 , . . . , head h ]W H (47) where head i = Attention(qW q i , KW K i , V W V i ) where projections are parameter matrices W q i ∈ R E×E , W K i ∈ R E×E , W V i ∈ R E×E , W O i ∈ R hE×E . In this paper we use MHA as one building block as illustrated in Figure 18 . We implement the option policy P (ō t |s t , ōt-1 ; W o ) as the encoder and treat the option embedding matrix W o as encoder's parameters. Specifically, we define the option encoder as: The action policy can be simply implemented as one decoder, which learns to decode ôt and s t into primary actions a t . where Q O is also a decoder of s t and ōt . We summarize the detailed algorithm in Appendix F and upload our code in supplementary materials.

G.2 HYPERPARAMETERS

In this section we summarize our implementation details. For a fair comparison, all baselines: DAC+PPO (Zhang & Whiteson, 2019), AHP+PPO (Levy & Shimkin, 2011), PPOC (Klissarov et al., 2017) , OC (Bacon et al., 2017) and PPO (Schulman et al., 2017) 



https://youtube.com/shorts/M06BPqit7l4?feature=share https://youtube.com/shorts/M06BPqit7l4?feature=share Different from conventional formulation which only depends on state st, our termination function has an extra dependence on ōt-1 Different from conventional formulation which only depends on state st, our mixture master policy has extra dependencies on ōt-1 and bt Both equations (36) and (37) is largely used in the conventional SMDP papers(Sutton et al., 1999;Bacon et al., 2017). https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html https://pytorch.org/docs/stable/generated/torch.nn.Linear.html https://github.com/pytorch/pytorch/blob/master/torch/distributions/categorical.py



, an option is a triple (I o , π o , β o ) ∈ O, where O denotes the option set; the subscript o ∈ O = {1, 2, . . . , K} is a positive integer index which denotes the o-th triple where K is the number of options; I o is an initiation set indicating where the option can be initiated; π o = P o (a|s) : A × S → [0, 1] is the action policy of the oth option; β o = P o (b = 1|s) : S → [0, 1] where b ∈ {0, 1} is a termination function. For clarity, we use P o (b = 1|s) instead of β o which is widely used in previous option literatures (e.g., Sutton et al. (1999); Bacon et al. (2017)). A master policy π(o|s) = P (o|s)

Figure 1: Probabilistic Graphical Model (PGM) of HiT-MDP.

Figure 2: PGM of Eq. 5.

Figure 3: Single-task episodic returns in 4 different environments (i.e., HalfCheetah-v2, HumanoidStandup-v2, Walker2d-v2, and Hopper-v2). Results in all 10 environments are available in Appendix B.1.

Figure 4: Transfer learning results on three pairs of tasks. All 6 pairs of results are available in Appendix. B.1

Figure 5: Interpretation of option embeddings learned in HalfCheetah Figure 6: Options Composing Patterns

Figure 5: Performance of Ten OpenAI Gym MuJoCo Environments.

Figure 7: Activated option sequences of 4 independent HalfCheetah runs.

Figure 8: All option baselines' option durations for comparison in the HalfCheetah environment

Figure 9: All option baselines failed to learn useful options in harder environment (HumanoidStandupV2)

Figure 11: Histogram of entropy weights v.s. number of options used in HalfCheetah environment. The larger the entropy weight, more options will be used.

Figure 12: Heatmap of all 4 option context vectors

Figure 13: Perturbation on the Dim 4

)'s notation, we use A, B and C to denote three non-overlapping sets of arbitrarily many random variables. Sets A and B are conditional independent on set C if P (A, B|C) = P (A|C)P (B|C), denoted as A ⊥ ⊥ B | C. We mainly use head-to-tail conditional independence properties(Chapter 8.2.1 (Bishop, 2006)) in this section. C.1 THE SMDP FORMULATED OPTION FRAMEWORK Sutton et al. (1999) proposed the option framework to demonstrate the temporal abstraction problem. Following Bishop (2006)'s notation, we use bolded letter s ∈ S to denote a random variable and normal letter s to denote its realization. Without special clarification, a random vector can have either a vector of continuous or discrete entries. A scalar o ∈ Z denotes the index of an option where O ⊆ {1, 2, . . . , M } and M is the number of options. An Markovian option is a triple (I o , P o (a|s), P o (b|s)) in which I o ⊆ S is an initiation set where the option o can be initiated. P o (a|s) : S → A is the intra-option policy which maps environment states s ∈ S to an action vector a ∈ A. P o (b|s) : S → Z 2 is a termination function where b is a binary random variable. It is used to determine whether to terminate (b = 1) the policy P o (a|s) or not (b = 0). Conventionally, β o = P o (b = 1|s).

Figure15: An Illustration of the SMDP Option Framework. An option ot-1 is selected by master policy P (ot-1|st-1) at time step t -1. At time step t, termination function βo t-1 (st) determines to continue option ot-1. So that there is no random variable ot at time step t compared to there are random variables o at every time step in MDP formulation (figure16).

Figure16: PGM of the MDP Option Framework. Notice the difference to Figure15. Figure15is SMDP formulated, suppose the agent decides to continue the execution of ot-1, then the random variable ot does not exist, and thus ot+1 depends on the random variable ot-1 directly. However, here the PGM is MDP formulated, and the random variable ō exists at every time step.

Figure 17: PGM of the HiT-MDP

is monotonically increasing and we get Q

π

ōt ∼ Categorical(Clustering(s t , ōt-1 , W o ))(48)whereCategorical(•) is a K-dimensional categorical distribution, Kis the number of options, and distances between the pair [s t , ôt-1 ] (where ôt-1 = W o ōt-1 ).Under this configuration, options can be seen as clustering centroids. The problem of selecting the next option o t is equivalent to calculate which clustering centroid ô in embedding matrix W o = [ô 1 , ..., ôK ] is closest to the projected state-option pair FFN([s t , ôt-1 ]) by an efficient MHA-based clustering module: Clustering(s t , ōt-1 , W o ) = FFN(MHA(Query = FFN([s t , ôt-1 ]), Key=Value = W o )) (49)

t ∼ Gaussian(FFN([s t , ôt ])) (50) Because of the Markovian option-value function V [s t+1 , ōt ] is an expectation of the option-value function Q O [s t+1 , ōt+1 ] in Eq. (6), we only need to model only one critic function: Q O = FFN(s t , ōt ),

and the option-action value function Q A [s t , ōt , a t ] be defined by:

One closely related work is DAC(Zhang & Whiteson, 2019)   which reformulates the option framework into two augmented MDPs. Under this formulation, all policies can be modeled explicitly and learned in MDP-style. However, DAC is not theoretically proven to be homomorphic equivalent to the SMDP-Option. Moreover, the reward function r(τ ) used in Lemma 1 & 2 (Zhang & Whiteson, 2019) is undefined and makes the equivalence in proposition 1 (Zhang & Whiteson, 2019) does not hold. To the best of our knowledge, this is the first work proving MDPs' equivalence to option-induced SMDPs under the homomorphic equivalence framework.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5,998-6,008, 2017. Markus Wulfmeier, Dushyant Rao, Roland Hafner, Thomas Lampe, Abbas Abdolmaleki, Tim Hertweck, Michael Neunert, Dhruva Tirumala, Noah Siegel, Nicolas Heess, et al. Data-efficient hindsight off-policy option learning. arXiv preprint arXiv:2007.15588, 2020. Shangtong Zhang and Shimon Whiteson. DAC: The double actor-critic architecture for learning options. In Advances in Neural Information Processing Systems, pp. 2,012-2,022, 2019. Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010.

The difficulty in this environment is, since the robot only has a camera to capture an image of the turkey, it can only observes either {Raw Turkey Image} or {Cooked Turkey Image}. There is This is true in all reinforcement learning settings without dependencies on latent variables. However, it goes much deeper in HRL settings.In HRL, a common formulation is to estimate a latent variable O to encode hierarchical information and makes the policy depends on it P (A t |S t , O t ). Since O is a latent variable, it is highly likely that at state S t , different latent variable P (A t |S t , O t = o x ) and P (A t |S t , O t = o y ) emits the same action A t = A 1 , and thus makes the conventional value function indistinguishable between o x and o y . This phenomenon is especially common around the switching time step of two options: around switching point, states are usually compatible with both old and new options. Conventional value functions will be especially confused at those moments. This is exactly what we observed in Figure5.3: overall, option 3 is executed consistently. However, there are some random switches to option 2. And the randomization is increased between around switching time steps. To explicitly show this, we visualized "Run4" into a video: https://www.youtube.com/shorts/M06BPqit7l4? feature=share.

Performance of Infinite Horizon Environments

Performance of Finite Horizon Environments

Performance of Deepmind Control Suite Transfer Learning Environments

b t |s t , ōt-1 ) = P (b t |s t , ōt-1 )P (ō t |s t , b t , ōt-1 ), (21) and the marginal probability distribution can be written as: P (ō t |s t , ōt-1 ) = bt P (b t |s t , ōt-1 )P (ō t |s t , b t , ōt-1 ) (22) = P (b t = 0|s t , ōt-1 )P (ō t |ō t-1 ) + P (b t = 1|s t , ōt-1 )P (ō t |s t ) = (1 -β t )P (ō t |ō t-1 ) + β t P (ō t |s t ) = (1 -β t )1 ōt= ōt-1 + β t P (ō t |s t ). t |s t-1 , a t-1 )P (a t |s t , ōt ) (b t |s t , ōt-1 )P (ō t |b t , s t , ōt-1 )

t+1 |s t , a t )P (a t |s t , ōt )P (ō t |s t , ōt-1 ),

function r t+1 (s t , a t , o t ) = E[r t+1 |s t , a t , o t ] = E[r t+1 |s t , a t ] = r t+1 (s t , a t ). Therefore, the SMDP-based option framework has an MDP-based equivalence:Theorem C.1. By the definition of Bisimulation Relation, the SMDP-based option framework, which employs Markovian options, has an underlying MDP equivalence because:Although the reward function r is equivalent, in Section D we identify a problem that under the configuration of HiT-MDPs, the conventional value function V [s t ] does not yield a Bellman equation. We have to propose a Markovian Option-Value Function V [s t , ōt-1 ] to solve this problem and prove that it is equivalent to V [s t ]. This is non-trivial since compared to the SMDP-Option, the MDP formulation introduces extra dependencies on ō. In Section D.2, by exploiting conditional independencies we prove that they do have the same expected return under the Bijection B.

s t , a t ] =E[G t |o t , s t , a t ] =E[R t+1 + γG t+1 |o t , s t , a t ] by definition of G t =E[R t+1 |s t , a t ]+ t+1 |s t , o t , a t )P (G t+1 |s t+1 , o t , s t , a t ) =r(s t , a t )+ t+1 |s t , a t )P (G t+1 |s t+1 , o t ) use eq 30 31 and 33 =r(s t , a t ) + γ st+1 P (s t+1 |s t , a t )E[G t+1 |s t+1 , o t ] =r(s t , a t ) + γ st+1 P (s t+1 |s t , a t )U [s t+1 , o t ]. However, unlike conventional value functions V [s t ], Q O [o t , s t ], Q U [o t , s t , a t ] of the option framework derived in Section D.1, the value function U [s t+1 , ōt ] has an extra dependency on the hidden variable ōt . Therefore, conventional value function V [s t ] does not yield a Bellman equation under the configuration of HiT-MDPs. To tackle this issue, we propose the Markovian Option-Value Function V (derivations of Eq. (6) in the main text): O [ō t , s t ],(35)where from line 2 to line 3 we use the conditional independence property in PGM that G t ⊥ ⊥ ōt-1 |{s t , ōt }.

ōt |a t , G t+1 ⊥ ⊥ s t |{s t+1 , ōt } and G t+1 ⊥ ⊥ a t |s t+1 . γ ∈ R is a discounting factor. The last line uses the definition of the Markovian option value function (Eq. 35). Theorem D.2. (Markovian Option Policy Evaluation Theorem). Assume that throughout our computation the Q A

2 PROOF FOR THE MASTER POLICY GRADIENT THEOREM Similar to the first equation above, continue expanding gradients of ∂Q O ∂θa by equations (6) (28) and (8):∂Q O [s t , ōt ] ∂θ a = at ∂P (a t |s t , ōt ) ∂θ a Q A [s t , ōt , a t ] + H(π A θa ) + γ st+1 P (s t+1 |s t , ōt ) ∂V [s t+1 , ōt ] ∂θ a = at ∂P (a t |s t , ōt ) ∂θ a Q A [s t , ōt , a t ] + H(π A θa ) + γ LEARNING ALGORITHM FOR MOPGAlgorithm 1: The MOPG Algorithm 1 Initialize the option embedding matrix W S 2 Assign Initial State: s t ← s 0 3 Assign Initial Option: ôt-1 ← ô0 Sample ôt ∼ P (ô t |s t , ôt-1 ) 10 Retrieve the option context vector ôt = W T S • ôt 11 Sample a t ∼ P (a t |s t , ôt ) 12 Compute Q O [s t , ôt ] and V [s t , ôt-1 ] 13 Take action a t in s t , observe new state s t+1 and reward R t+1

are from DAC's open source Github repo: https://github.com/ShangtongZhang/DeepRL/tree/DAC. Hyper-parameters used in DAC (Zhang & Whiteson, 2019) for all these baselines are kept unchanged.MOPG Architecture: For all experiments, our implementation of MOPG is exactly the same as Figure18(b). We use Pytorch to build neural networks. Specifically, for skill policy module, we use a skill context matrix W S ∈ R 4×40 which has 4 skills (4 rows) and an embedding size of 40 (40 columns). For Multi-Head Attention, we use Pytorch's built-in MultiheadAttention function 6 with num_heads = 1 and embed_dim = 40. For layer normalization we use Pytorch's built-in function LayerNorm 7 . For Feed Forward Networks (FNN), we use a 2 layer FNN with ReLu function as activation function with input size of 40, hidden size of 64, and output size of 64 neurons. For Linear layer, we use built-in Linear function 8 to map FFN's outputs to 4 dimension. Each dimension acts

ACKNOWLEDGEMENT

CL offers his sincerest gratitude to Shangtong Zhang, Pierre-luc Bacon and Doina Precup for their time and patience answering his questions. We deeply appreciate Shangtong Zhang and Jiayi Weng for their exceptional open source projects DeepRL https://github.com/ShangtongZhang/ DeepRL (the original implementation in supplementary material and all experiment results in this paper are based on DeepRL) and tianshou https://github.com/thu-ml/tianshou (the latest github repo is based on tianshou).

annex

 (Bacon et al., 2017; Sutton et al., 1999) , G t = t r t+1 is the return. Note that in deriving equation ( 27) we only use summation rule and production rule, the conditional dependency relationships in PGM (figure 16 ) are not used.The option value function Q O [o t , s t ] can be further expanded as:where Proof. Note that in derivations above we only use summation and production rules. Both equation ( 27) and ( 28) are identical to the conventional SMDP option framework.From now on, we will continue derivations with conditonal independence relationships encoded in PGM (Chapter 8.2.1 (Bishop, 2006) ). We have following conditional independence relationships from PGM (figure 16 ):With above conditional independence relationships in hand, we now show that the MDP formulation has identical value functions to the conventional SMDP formulation (Sutton et al., 1999; Bacon et al., 2017) .Proof.from line 3 to line 4 use equation ( 29) and (32). From line 4 to line 5 use equation ( 22) and definition of Q O . The second last line use equation ( 27). The last line use the definition of advantage function A.Under our MDP formulation, we also propose proposition D.1.4. We derive our gradient theorems based on equation ( 38) in section E. This important relationship largely simplify derivations than the original paper (Bacon et al., 2017) as well as give rise to the the HiT-MDP. Proposition D.1.4. The option-value function upon arrivalProof. Following proof of proposition D.1.3,

D.2 THE OPTIMAL VALUE EQUIVALENCE BETWEEN HIT-MDP AND OPTION FRAMEWORKS

Theorem D.1. The option-action value function Eq. 7 satisfies the Bellman Operator T Hwhere the Markovian option-value function given by Eq. 6.Proof. Following dynamics derived in Section C.3, we can define value functions on HiT-MDPs as (derivations of Eq. ( 8) in the main text): Published as a conference paper at ICLR 2023 like a logit for skill and used as density in Categorical distribution 9 . For both action policy and critic module, FFNs are of the same size as the one used in the skill policy.Preprocessing: States are normalized by a running estimation of mean and std.Hyperparameters of PPO: For a fair comparison, we use exactly the same parameters of PPO as DAC. Specifically:• Optimizer: Adam with ϵ = 10 -5 and an initial learning rate 3 × 10 -4• Discount ratio γ: 0.99• GAE coefficient: 0.95• Gradient clip by norm: 0.5• Rollout length: 2048 environment steps• Optimization epochs: 10• Optimization batch size: 64• Action probability ratio clip: 0.2Computing Infrastructure: We conducted our experiments on an Intel® Core™ i9-9900X CPU @ 3.50GHz with a single thread and process with PyTorch.

