HAMILTONIAN Q-LEARNING: LEVERAGING IMPORTANCE-SAMPLING FOR DATA EFFICIENT RL

Abstract

Model-free reinforcement learning (RL), in particular Q-learning is widely used to learn optimal policies for a variety of planning and control problems. However, when the underlying state-transition dynamics are stochastic and high-dimensional, Q-learning requires a large amount of data and incurs a prohibitively high computational cost. In this paper, we introduce Hamiltonian Q-Learning, a data efficient modification of the Q-learning approach, which adopts an importance-sampling based technique for computing the Q function. To exploit stochastic structure of the state-transition dynamics, we employ Hamiltonian Monte Carlo to update Q function estimates by approximating the expected future rewards using Q values associated with a subset of next states. Further, to exploit the latent low-rank structure of the dynamic system, Hamiltonian Q-Learning uses a matrix completion algorithm to reconstruct the updated Q function from Q value updates over a much smaller subset of state-action pairs. By providing an efficient way to apply Qlearning in stochastic, high-dimensional problems, the proposed approach broadens the scope of RL algorithms for real-world applications, including classical control tasks and environmental monitoring.

1. INTRODUCTION

In recent years, reinforcement learning (Sutton & Barto, 2018) have achieved remarkable success with sequential decision making tasks especially in complex, uncertain environments. RL algorithms have been widely applied to a variety of real world problems, such as resource allocation (Mao et al., 2016) , chemical process optimization (Zhou et al., 2017) , automatic control (Duan et al., 2016) , and robotics (Kober et al., 2013) . Existing RL techniques often offer satisfactory performance only when it is allowed to explore the environment long enough and generating a large amount of data in the process (Mnih et al., 2015; Kamthe & Deisenroth, 2018; Yang et al., 2020a) . This can be prohibitively expensive and thereby limits the use of RL for complex decision support problems. Q-Learning (Watkins, 1989; Watkins & Dayan, 1992 ) is a model-free RL framework that captures the salient features of sequential decision making, where an agent, after observing current state of the environment, chooses an action and receives a reward. The action chosen by the agent is based on a policy defined by the state-action value function, also called the Q function. Performance of such policies strongly depends on the accessibility of a sufficiently large data set covering the space spanned by the state-action pairs. In particular, for high-dimensional problems, existing model-free RL methods using random sampling techniques leads to poor performance and high computational cost. To overcome this challenge, in this paper we propose an intelligent sampling technique that exploits the inherent structures of the underlying space related to the dynamics of the system. It has been observed that formulating planning and control tasks in a variety of dynamical systems such as video games (Atari games), classical control problems (simple pendulum, cart pole and double integrator) and adaptive sampling (ocean sampling, environmental monitoring) as Q-Learning problems leads to low-rank structures in the Q matrix (Ong, 2015; Yang et al., 2020b; Shah et al., 2020) . Since these systems naturally consist of a large number of states, efficient exploitation of low rank structure of the Q matrix can potentially lead to significant reduction in computational complexity and improved performance. However, when the state space is high-dimensional and further, the state transition is probabilistic, high computational complexity associated with calculating the expected Q values of next states renders existing Q-Learning methods impractical. A potential solution for this problem lies in approximating the expectation of Q values of next states with the sample mean of Q values over a subset of next states. A natural way to select a subset of next states is by drawing IID samples from the transition probability distribution. However, this straight forward approach becomes challenging when the state transition probability distribution is highdimensional and is known only up to a constant. We address this problem by using Hamilton Monte Carlo (HMC) to sample next states; HMC draws samples by integrating a Hamiltonian dynamics governed by the transition probability (Neal et al., 2011) . We improve the data efficiency further by using matrix completion methods to exploit the low rank structure of a Q matrix.

RELATED WORK

Data efficient Reinforcement Learning: The last decade has witnessed a growing interest in improving data efficiency in RL methods by exploiting emergent global structures from underlying system dynamics. Deisenroth & Rasmussen (2011) ; Pan & Theodorou (2014) ; Kamthe & Deisenroth (2018) ; Buckman et al. (2018) have proposed model-based RL methods that improve data efficiency by explicitly incorporating prior knowledge about state transition dynamics of the underlying system. Dearden et al. (1998) ; Koppel et al. (2018) ; Jeong et al. (2017) propose Baysean methods to approximate the Q function. Ong (2015) and Yang et al. (2020b) consider a model-free RL approach that exploit structures of state-action value function. The work by Ong (2015) decomposes the Q matrix into a low-rank and sparse matrix model and uses matrix completion methods (Candes & Plan, 2010; Wen et al., 2012; Chen & Chi, 2018) to improve data efficiency. A more recent work by Yang et al. (2020b) has shown that incorporation of low rank matrix completion methods for recovering Q matrix from a small subset of Q values can improve learning of optimal policies. At each time step the agent chooses a subset of state-action pairs and update their Q value according to the Bellman optimally equation that considers a discounted average between reward and expectation of the Q values of next states. Shah et al. (2020) extends this work by proposing a novel matrix estimation method and providing theoretical guarantees for the convergence to a -optimal Q function. On the other hand, entropy regularization (Ahmed et al., 2019; Yang et al., 2019; Smirnova & Dohmatob, 2020) , by penalizing excessive randomness in the conditional distribution of actions for a given state, provides an alternative means to implicitly exploit the underlying low-dimensional structure of the value function. Lee et al. (2019) proposes an approach that samples a whole episode and then updates values in a recursive, backward manner.

CONTRIBUTION

The main contribution of this work is three-fold. First, we introduce a modified Q-learning framework, called Hamiltonian Q-learning, which uses HMC sampling for efficient computation of Q values. This innovation, by proposing to sample Q values from the region with the dominant contribution to the expectation of discounted reward, provides a data-efficient approach for using Q-learning in real-world problems with high-dimensional state space and probabilistic state transition. Furthermore, integration of this sampling approach with matrix-completion enables us to update Q values for only a small subset of state-action pairs and thereafter reconstruct the complete Q matrix. Second, we provide theoretical guarantees that the error between the optimal Q function and the Q function obtained by updating Q values using HMC sampling can be made arbitrarily small. This result also holds when only a handful of Q values are updated using HMC and the rest are estimated using matrix completion. Further, we provide theoretical guarantee that the sampling complexity of our algorithm matches the mini-max sampling complexity proposed by Tsybakov (2008) . Finally, we demonstrate the effectiveness of Hamiltonian Q-learning by applying it to a cart-pole stabilization problem and an adaptive ocean sampling problem. Our results also indicate that our proposed approach becomes more effective with increase in state space dimension.

2. PRELIMINARY CONCEPTS

In this section, we provide a brief background on Q-Learning, HMC sampling and matrix completion, as well as introduce the mathematical notations. In this paper, |Z| denotes the cardinality of a set Z. Moreover, R represent the real line and A T denotes the transpose of matrix A.

2.1. Q-LEARNING

Markov Decision Process (MDP) is a mathematical formulation that captures salient features of sequential decision making (Bertsekas, 1995) . In particular, a finite MDP is defined by the tuple (S, A, P, r, γ), where S is the finite set of system states, A is the finite set of actions, P : S ×A×S → [0, 1] is the transition probability kernel, r : S × A → R is a bounded reward function, and γ ∈ [0, 1) is a discounting factor. Without loss of generality, states s ∈ S and actions a ∈ A can be assumed to be D s -dimensional and D a -dimensional real vectors, respectively. Moreover, by letting s i denote the ith element of a state vector, we define the range of state space in terms of the following intervals [d - i , d + i ] such that s i ∈ [d - i , d + i ] ∀i ∈ {1, . . . , D s }. At each time t ∈ {1, . . . , T } over the decision making horizon, an agent observes the state of the environment s t ∈ S and takes an action a t according to some policy π which maximizes the discounted cumulative reward. Once this action has been executed, the agent receives a reward r(s t , a t ) from the environment and the state of the environment changes to s t+1 according to the transition probability kernel P (•|s t , a t ). The Q function, which represents the expected discounted reward for taking a specific action at the current time and following the policy thereafter, is defined as a mapping from the space of state-action pairs to the real line, i.e. Q : S × A → R. Then, by letting Q t represent the Q matrix at time t, i.e. the tabulation of Q function over all possible state-action pairs associated with the finite MDP, we can express the Q value iteration over time steps as Q t+1 (s t , a t ) = s∈S P (s|s t , a t ) r(s t , a t ) + γ max a Q t (s, a) . (1) Under this update rule, the Q function converges to its unique optimal value Q * (Melo, 2001). But computing the sum (1) over all possible next states is computationally expensive in certain problems; in these cases taking the summation over a subset of the next states provides an efficient alternative for updating the Q values.

2.2. HAMILTONIAN MONTE CARLO

Hamiltonian Monte Carlo is a sampling approach for drawing samples from probability distributions known up to a constant. It offers faster convergence than Markov Chain Monte Carlo (MCMC) sampling (Neal et al., 2011; Betancourt; Betancourt et al., 2017; Neklyudov et al., 2020) . To draw samples from a smooth target distribution P(s), which is defined on the Euclidean space and assumed to be known up to a constant, HMC extends the target distribution to a joint distribution over the target variable s (viewed as position within the HMC context) and an auxiliary variable v (viewed as momentum within the HMC context). We define the Hamiltonian of the system as H(s, v) =log P(s, v) =log P(s)log P(v|s) = U (s) + K(v, s), where U (s)log P(s) and K(v, s)log P(v|s) = 1 2 v T M -1 v represent the potential and kinetic energy, respectively, and M is a suitable choice of the mass matrix. HMC sampling method consists of the following three steps -(i) a new momentum variable v is drawn from a fixed probability distribution, typically a multivariate Gaussian; (ii) then a new proposal (s , v ) is obtained by generating a trajectory that starts from (s, v) and obeys Hamiltonian dynamics, i.e. ṡ = ∂H ∂v , v = -∂H ∂s ; and (iii) finally this new proposal is accepted with probability min {1, exp (H(s, v) -H(s , -v ))} following the Metropolis-Hastings acceptance/rejection rule.

2.3. LOW-RANK STRUCTURE IN Q-LEARNING AND MATRIX COMPLETION

Prior work (Johns & Mahadevan, 2007; Geist & Pietquin, 2013; Ong, 2015; Shah et al., 2020) on value function approximation based approaches for RL has implicitly assumed that the state-action value functions are low-dimensional and used various basis functions to represent them, e.g. CMAC, radial basis function, etc. This can be attributed to the fact that the underlying state transition and reward function are often endowed with some structure. More recently, Yang et al. (2020b) provide empirical guarantees that the Q-matrices for benchmark Atari games and classical control tasks exhibit low-rank structure. Therefore, using matrix completion techniques (Xu et al., 2013; Chen & Chi, 2018) to recover Q ∈ R |S|×|A| from a small number of observed Q values constitutes a viable approach towards improving data efficiency. As low-rank matrix structures can be recovered by constraining the nuclear norm, the Q matrix can be reconstructed from its observed values ( Q) by solving Q = arg min Q∈R |S|×|A| Q * subject to J Ω ( Q) = J Ω ( Q), where • * denotes the nuclear norm (i.e., the sum of its singular values), Ω is the observed set of elements, and J Ω is the observation operator, i.e. J Ω (x) = x if x ∈ Ω and zero otherwise.

3. HAMILTONIAN Q-LEARNING

A large class of real world sequential decision making problems -for example, board/video games, control of robots' movement, and portfolio optimization -involves high-dimensional state spaces and often has large number of distinct states along each individual dimension. As using a Q-Learning based approach to train RL-agents for these problems typically requires tens to hundreds of millions of samples (Mnih et al., 2015; Silver et al., 2017) , there is a strong need for data efficient algorithms for Q-Learning. In addition, state transition in such systems is often probabilistic in nature; even when the underlying dynamics of the system is inherently deterministic; presence of external disturbances and parameter variations/uncertainties lead to probabilistic state transitions. Learning an optimal Q * function through value iteration methods requires updating Q values of state-action pairs using a sum of the reward and a discounted expectation of Q values associated with next states. In this work, we assume the reward to be a deterministic function of state-action pairs. However, when the reward is stochastic, these results can be extended by replacing the reward with its expectation. Subsequently, we can express (1) as Q t+1 (s t , a t ) = r(s t , a t ) + γE max a Q t (s, a) , where E denotes the expectation over the discrete probability measure P. When the underlying state space is high-dimensional and has large number of states, obtaining a more accurate estimate of the expectation is computationally very expensive. The complexity increases quadratically with the number of states and linearly with number of actions, rendering the existing algorithms impractical. In this work, we propose a solution to this issue by introducing an importance-sampling based method to approximate the aforementioned expectation from a sample mean of Q values over a subset of next states. A natural way to sample a subset from the set of all possible next states is to draw identically and independently distributed (IID) samples from the transition probability distribution P(•|s t , a t ). However, when the transition probability distribution is high-dimensional and known only up to a constant, drawing IID samples incurs a very high computation cost. The first row illustrates that, as the dimension of the space increases, the relative volume inside a partition compared to the volume outside of the partition decreases. When the dimension increases from 1 through 3, the relative volume of red partition decreases as 1/3, 1/9 and 1/27, respectively. The second row illustrates that the HMC samples are concentrated in the region that maximizes probability mass. Here P(s), s Mode and ds represent probability density, mode of the distribution and volume, respectively.

3.1. DATA EFFICIENCY THROUGH HMC SAMPLING

A number of importance-sampling methods (Liu, 1996; Betancourt) have been developed for estimating the expectation of a function by drawing samples from the region with the dominant contribution to the expectation. HMC is one such importance-sampling method that draws samples from the typical set, i.e., the region that maximizes probability mass, which provides the dominated contribution to the expectation. As shown in the second row of Figure 1 , most of the samples in a limited pool of HMC samples indeed concentrate around the region with high probability mass. Since the decay in Q function is significantly smaller compared to the typical exponential or power law decays in transition probability function, HMC provides a better approximation for the expectation of the Q value of the next states (Yang et al., 2020b; Shah et al., 2020) . Then by letting H t denote the set of HMC samples drawn at time step t, we update the Q values as: Q t+1 (s t , a t ) = r(s t , a t ) + γ |H t | s∈Ht max a Q t (s, a). (4) HMC for a smooth truncated target distribution: Recall that region of states is a subset of a Euclidean space given as s ∈ [d - 1 , d + 1 ] × . . . × [d - Ds , d + Ds ] ⊂ R Ds . Thus the main challenge to using HMC sampling is to define a smooth continuous target distribution P(s|s t , a t ) which is defined on R Ds with a sharp decay at the boundary of the region of states (Yi & Doshi-Velez, 2017; Chevallier et al., 2018) . In this work, we generate the target distribution by first defining the transition probability kernel from the conditional probability distribution defined on R Ds and then multiplying it with a smooth cut-off function. We first consider a probability distribution P(•|s t , a t ) : R Ds → R such that the following holds P(s|s t , a t ) ∝ s+ε s-ε P(s|s t , a t )ds for some arbitrarily small ε > 0. Then the target distribution can be defined as P(s|s t , a t ) = P(s|s t , a t ) Ds i=1 1 1 + exp(-κ(d + i -s i )) 1 1 + exp(-κ(s i -d - i )) . ( ) Note that there exists a large κ > 0 such that if s ∈ [d - 1 , d + 1 ] × . . . × [d - Ds , d + Ds ] then P(s|s t , a t ) ∝ P(s|s t , a t ) and P(s|s t , a t ) ≈ 0 otherwise. Let µ(s t , a t ), Σ(s t , a t ) be the mean and covariance of the transition probability kernel. In this paper we consider transition probability kernels of the form P(s|s t , a t ) ∝ exp - 1 2 (s -µ(s t , a t )) T Σ -1 (s t , a t )(s -µ(s t , a t )) . Then from ( 5) the corresponding mapping can be given as a multivariate Gaussian P(s|s t , a t ) = N (µ (s t , a t ), Σ(s t , a t )) . Thus from ( 6) it follows that the target distribution is P(s|s t , a t ) = N (µ (s t , a t ), Σ(s t , a t )) Ds i=1 1 1 + exp(-κ(d + i -s i )) 1 1 + exp(-κ(s i -d - i )) Choice of potential energy, kinetic energy and mass matrix: Recall that the target distribution P(s|s t , a t ) is defined over the Euclidean space R Ds . For brevity of notation we drop the explicit dependence on (s t , a t ) and denote the target distribution as P(s). As explained in Section 2.2 we choose the potential energy U (s) =log (P(s)). We consider an Euclidean metric M that induces the distance between s, s as d(s, s) = (ss) T M(ss). Then we define M s ∈ R Ds×Ds as a diagonal scaling matrix and M r ∈ R Ds×Ds as a rotation matrix in dimension D s . With this we can define M as M = M r M s MM T s M T r . Thus, any metric M that defines an Euclidean structure on the target variable space induces an inverse structure on the momentum variable space as d(ṽ, v) = (ṽv) T M -1 (ṽv). This generates a natural family of multivariate Guassian distributions such that P(v|s) = N (0, M ) leading to the kinetic energy K(v, s) =log P(v|s) = 1 2 v T M -1 v where M -1 is the covariance of the target distribution.

3.2. Q-LEARNING WITH HMC AND MATRIX COMPLETION

In this work we consider problems with a high-dimensional state space and large number of distinct states along individual dimensions. Although these problems admit a large Q matrix, we can exploit low rank structure of the Q matrix to further improve the data efficiency. At each time step t we randomly sample a subset Ω t of state-action pairs (each state-action pair is sampled independently with some probability p) and update the Q function for state-action pairs in Ω t . Let Q t+1 be the updated Q matrix at time t. Then from (4) we have Q t+1 (s t , a t ) = r(s t , a t ) + γ |H t | s∈Ht max a Q t (s, a), ∀(s t , a t ) ∈ Ω t . Then we recover the complete matrix Q t+1 by using the method given in (2). Thus we have Q t+1 = arg min Q t+1 ∈R |S|×|A| Q t+1 * subject to J Ωt Q t+1 = J Ωt Q t+1 . ( ) Algorithm 1 Hamiltonian Q-Learning Inputs: Discount factor γ; Range of state space; Time horizon T ; Initialization: Randomly initialize Q 0 for t = 1 to T do Step 1: Randomly sample a subset of state-action pairs Ω t Step 2: HMC sampling phase -Sample a set of next states H t according to the target distribution defined in (6) Step 3: Update phase -For all (s t , a t ) ∈ Ω t Q t+1 (s t , a t ) = r(s t , a t ) + γ |Ht| s∈Ht max a Q t (s, a) Step 4: Matrix Completion phase Q t+1 = arg min Q t+1 ∈R |S|×|A| Q t+1 * subject to J Ωt Q t+1 = J Ωt Q t+1 end for Similar to the approach used by Yang et al. (2020b) , we approximate the rank of the Q matrix as the minimum number of singular values that are needed to capture 99% of its nuclear norm.

3.3. CONVERGENCE, BOUNDEDNESS AND SAMPLING COMPLEXITY

In this section we provide the main theoretical results of this paper. First, we formally introduce the following regularity assumptions: (A1) The state space S ⊆ R Ds and the action space A ⊆ R Da are compact subsets. (A2) The reward function is bounded, i.e., r(s, a) ∈ [R min , R max ] for all (s, a) ∈ S × A. (A3) The optimal value function Q * is C-Lipschitz, i.e. Q * (s, a) -Q * (s , a ) ≤ C ||s -s || F + ||a -a || F where || • || F is the Frobenius norm (which is same as the Euclidean norm for vectors). We provide theoretical guarantees that Hamiltonian Q-Learning converges to an -optimal Q function with O 1 Ds +Da+2 number of samples. This matches the mini-max lower bound Ω 1 Ds+Da +2 proposed in Tsybakov (2008) . First we define a family of -optimal Q functions as follows. Definition 1 ( -optimal Q functions). Let Q * be the unique fixed point of the Bellman optimality equation given as (T Q)(s , a ) = s∈S P(s|s , a ) (r(s , a ) + γ max a Q(s, a)) ∀(s , a ) ∈ S × A where T denotes the Bellman operator. Then, under update rule (3), the Q function almost surely converges to the optimal Q * . We define -optimal Q functions as the family of functions Q such that Q -Q * ∞ ≤ whenever Q ∈ Q . As Q -Q * ∞ = max (s,a)∈S×A Q (s, a) -Q * (s, a) , any -optimal Q function is element wise -optimal. Our next result shows that under HMC sampling rule given in Step 3 of the Hamiltonian Q-Learning algorithm (Algorithm 1), the Q function converges to the family of -optimal Q functions. Theorem 1 (Convergence of Q function under HMC). Let T be an optimality operator under HMC given as (TQ)(s , a ) = r(s , a ) + γ |H| s∈H max a Q(s, a), ∀(s , a ) ∈ S × A, where H is a subset of next states sampled using HMC from the target distribution given in (6). Then, under update rule (4) and for any given ≥ 0, there exists n H , t > 0 such that Q t -Q * ∞ ≤ ∀t ≥ t . Refer Appendix A.1 for proof of this theorem. The next theorem shows that the Q matrix estimated via a suitable matrix completion technique lies in the -neighborhood of the corresponding Q function obtained via exhaustive sampling. Theorem 2 (Bounded Error under HMC with Matrix Completion). Let Q t+1 E (s t , a t ) = r(s t , a t ) + γ s∈S P(s|s t , a t ) max a Q t E (s, a), ∀(s t , a t ) ∈ S × A be the update rule under exhaustive sampling, and Q t be the Q function updated according to Hamiltonian Q-Learning (9)-(10). Then, for any given ˜ ≥ 0, there exists n H = min τ |H τ |, t > 0, such that Q t -Q t E ∞ ≤ ˜ ∀t ≥ t . Please refer Appendix A.2 for proof of this theorem. Finally we provide guarantees on sampling complexity of Hamiltonian Q-Learning algorithm. Theorem 3. (Sampling complexity of Hamiltonian Q-Learning) Let D s , D a be the dimension of state space and action space, respectively. Consider the Hamiltonian Q-Learning algorithm presented in Algorithm 1. Then, under a suitable matrix completion method, the Q function convergea to the family of -optimal Q functions with O -(Ds+Da+2) number of samples. Proof of Theorem 3 is given in Appendix B.

4.1. EMPIRICAL EVALUATION FOR CART-POLE

Experimental setup: By letting θ, θ denote the angle and angular velocity of the pole and x, ẋ denote the position and velocity of the cart, the 4-dimensional state vector for the cart-pole system can be defined as s = (θ, θ, x, ẋ). After defining the range of state space as θ ∈ [-π/2, π/2], θ ∈ [-3.0, 3.0], x ∈ [-2.4, 2.4] and ẋ ∈ [-3.5, 3.5], we define the range of the scalar action as a ∈ [-10, 10] . Then each state space dimension is discretized into 5 distinct values and the action space into 10 distinct values. This leads to a Q matrix of size 625 × 10. To capture parameter uncertainties and external disturbances, we assume that the probabilistic state transition is governed by a multivariate Gaussian with zero mean and covariance Σ = diag[0.143, 0.990, 0.635, 1.346]. To drive the pole to an upright position, we define the reward function as r(s, a) = cos 4 (15θ) (Yang et al., 2020b) . After initializing the Q matrix using randomly chosen values from [0, 1], we sample state-action pairs independently with probability p = 0.5 at each iteration. Additional experimental details and results are provided in Appendix C. Results: As it is difficult to visualize a heat map for a 4-dimensional state space, we show results for the first two dimensions θ, θ with fixed x, ẋ. The color in each cell of the heat maps shown in Figures 2(a) , 2(b) and 2(c) indicates the value of optimal action associated with that state. These figures illustrate that the policy heat map for Hamiltonian Q-Learning is closer to the policy heat map for Q-Learning with exhaustive sampling. The two curves in Figure 2(d) , that show the Frobenius norm of the difference between the learned Q function and the optimal Q * , illustrate that Hamiltonian Q-Learning achieves better convergence than Q-Learning with IID sampling. We also show that the sampling efficiency of any Q-Learning algorithm can be significantly improved by incorporating Hamiltonian Q-Learning. We illustrate this by incorporating Hamiltonian Q-Learning with vanilla Q-Learning, DQN, Dueling DQN and DDPG. Figure 3 shows how Frobenius norm of the error between Q function and the optimal Q * varies with increase in the number of samples. Red solid curves correspond to the case with exhaustive sampling and black dotted curves correspond to the case with Hamiltonian Q-Learning. These results illustrate that Hamiltonian Q-Learning converges to an optimal Q function with significantly smaller number of samples than exhaustive sampling.  10 1 ||Q t Q * || F IID-Q HMC-Q (d) Q Function Convergence

4.2. EMPIRICAL EVALUATION FOR ACROBOT (I.E., DOUBLE PENDULUM)

Experimental setup: By letting θ 1 , θ1 ,θ 2 , θ2 denote the angle of the first pole, angular velocity of the first pole, angle of the second pole and angular velocity of the second pole, respectively, the 4-dimensional state vector for the acrobot can be defined as s = (θ 1 , θ1 , θ 2 , θ2 ).After defining the range of state space as θ 1 ∈ [-π, π], θ1 ∈ [-3.0, 3.0], θ 2 ∈ [-π, π] and θ2 ∈ [-3.0, 3.0], we define the range of the scalar action as a ∈ [-10, 10]. Then each state space dimension is discretized into 5 distinct values and the action space into 10 distinct values. This leads to a Q matrix of size 625 × 10. Furthermore, we assume that the probabilistic state transition is governed by a multivariate Gaussian with zero mean and covariance Σ = diag[0.143, 0.990, 0.635, 1.346]. Following Sutton & Barto (2018), we define an appropriate reward function for stabilizing the acrobot to the upright position. After initializing the Q matrix using randomly chosen values from [0, 1], we sample state-action pairs independently with probability p = 0.5 at each iteration. Results: Figure 4 illustrates how Frobenius norm of the error between Q function and the optimal Q * varies with the number of samples. Red solid curves correspond to the case with exhaustive sampling and black dotted curves correspond to the case with Hamiltonian Q-Learning. These results show that for the same level of error Hamiltonain Q-Learning requires a significantly smaller number of samples compared to exhaustive sampling. 

4.3. APPLICATION TO OCEAN SAMPLING

Ocean sampling plays a major role in a variety of science and engineering problems, ranging from modeling marine ecosystems to predicting global climate. Here, we consider the problem of using an under water glider to obtain measurements of a scalar field (e.g., temperature, salinity or concentration of a certain zooplankton) and illustrate how the use of Hamiltonian Q-Learning in planning the glider trajectory can lead to measurements that minimize the uncertainty associated with the field. States, actions and state transition: By assuming that the glider's motion is restricted to an horizontal plane (Refael & Degani, 2019) , we let x, y and θ denote its center of mass position and heading angle, respectively. Then we can define the 6-dimensional state vector for this system as s = (x, y, ẋ, ẏ, θ, θ) and the action a as a scalar control input to the glider. Also, to accommodate dynamic perturbations due to the ocean current, other external disturbances and parameter uncertainties, we assume that the probabilistic state transition is governed by a multivariate Gaussian. Reward: As ocean fields often exhibit temporal and spatial correlations (Leonard et al., 2007) , this work focuses on spatially correlated scalar fields. Following the approach of Leonard et al. (2007) , we define ocean statistic correlation between two positions q = (x, y) and q = (x , y ) as B(q, q ) = exp(-qq 2 /σ 2 ), where σ is the spatial decorrelation scale. The goal of the task is to take measurements that reduce the uncertainty associated with the field. Now we assume that the glider takes N measurements at positions {q 1 , . . . , q N }. Then covariance of the collected data set can be given by a N × N matrix W such that its ith row and the jth column element is: W ij = ηδ ij + B(q i , q j ), where δ ij is the Dirac delta and η is the variance of the uniform and uncorrelated measurement noise. Then using objective analysis data assimilation scheme (Kalnay, 2003; Bennett, 2005) , the total reduction of uncertainty of the field after the measurements at positions Q = {q 1 , . . . , q N } can be expressed as U = q∈Q N i,j=1 B(q, q i )W -1 ij B(q j , q), By substituting the formulas from (Kalnay, 2003; Bennett, 2005) into (11), this formulation can be generalized to Gaussian measurement noise. Recall that the task objective is to guide the glider to take measurements at multiple locations/positions which maximize the reduction in uncertainty associated with the scalar field. Therefore the reward assigned to each state-action pair (s, a) is designed to reflect the amount of uncertainty that can be reduced by taking a measurement at the position corresponding to the state and at the positions that correspond to the set of maximally probable next states, i.e., arg max s P(s |s, a). Then, by letting Z s = {s} ∪ {∪ a∈A {arg max s P(s |s, a)}} denote the set of current state s and the maximally probable next states for all possible actions, the reward function associated with reducing uncertainty can be given as r u (s, a) = q∈Q i,j∈Zs B(q, q i )W -1 ij B(q j , q). Without loss of generality, we assume that the glider is deployed from q = (0, 0) and retrieving the glider incurs a cost depending on its position. To promote trajectories that do not incur a high cost for glider retrieval, we define the following reward function r c (s, a) = -q T Cq where C = C T ≥ 0. Then we introduce the total reward that aims to reduce uncertainty of the scalar field while penalizing the movements away from the origin, and define it as r(s, a) = r u (s, a) + r c (s, a) = -λq T Cq + q∈Q i,j∈Zs B(q, q i )W -1 ij B(q j , q), where λ > 0 is a trade-off parameter that maintains a balance between these two objectives. Results Figures 5(a), 5(b) and 5(c) show the policy heat map over first two dimensions x, y with fixed ẋ, ẏ, θ and θ. The color of each cell indicates the value of optimal action associated with that state. These figures illustrate that the difference between policy heat maps associated with Hamiltonian Q-Learning and Q-Learning with exhaustive sampling is smaller than the difference between policy heat maps associated with Q-Learning with IID sampling and Q-Learning with exhaustive sampling. The two curves in Figure 5(d) , that show the Frobenius norm of the difference between the learned Q function and the optimal Q * , illustrate that Hamiltonian Q-Learning achieves better convergence than Q-Learning with IID sampling. A comparison between results of the ocean sampling problem and the cart-pole stabilization problem indicates that Hamiltonian Q-Learning provides increasingly better performance with increase in state space dimension.

5. DISCUSSION AND CONCLUSION

Here we have introduced Hamiltonian Q-Learning which utilizes HMC sampling with matrix completion methods to improve data efficiency. We show, both theoretically and empirically, that the proposed approach can learn very accurate estimates of the optimal Q function with much fewer data points. We also demonstrate that Hamiltonian Q-Learning performs significantly better than Q-Learning with IID sampling when the underlying state space dimension is large. By building upon this aspect, future works will investigate how importance-sampling based methods can improve data-efficiency in multi-agent Q-learning with agents coupled through both action and reward.

A CONVERGENCE AND BOUNDEDNESS RESULTS

We proceed to prove theorem by stating convergence properties for HMC as follows. In the initial sampling stage, starting from the initial position Markov chain converges towards to the typical set. In the next stage Markov chain quickly traverse the typical set and improves the estimate by removing the bias. In the last stage Markov chain refine the exploration of typical the typical set provide improved estimates. The number of samples taken during the last stage is referred as effective sample size.

A.1 PROOF OF THEOREM 1

Theorem 1. Let T be an optimality operator under HMC given as (TQ)(s , a ) = r(s , a ) + γ |H| s∈H max a Q(s, a), ∀(s , a ) ∈ S × A, where H is a subset of next states sampled using HMC from the target distribution given in (6). Then, under update rule (4) and for any given ≥ 0, there exists n H , t > 0 such that Q t -Q * ∞ ≤ ∀t ≥ t . Proof of Theorem 1. Let Qt (s, a) = 1 n H max a Q t (s, a), ∀(s, a) ∈ S × A. Here we consider n H to be the effective number of samples. Let E P Q t , Var P Q t be the expectation and covariance of Q t with respect to the target distribution. From Central Limit Theorem for HMC we have Qt ∼ N E P Q t , Var P Q t n H . Since Q function does not decay fast we provide a proof for the case where Q t is C-Lipschitz. From Theorem 6.5 in (Holmes et al., 2014) we have that, there exists a c 0 > 0 such that || Qt -E P Q t || ≤ c 0 . Recall that Bellman optimality operator T is a contraction mapping. Thus from triangle inequality we have TQ 1 -TQ 2 ∞ ≤ max s ,a r(s , a ) + γ |H 1 | s∈S max a Q 1 (s, a) -r(s , a ) - γ |H 2 | s∈S max a Q 2 (s, a) ≤ max s ,a γ |H 1 | s∈S max a Q 1 (s, a) - γ |H 2 | s∈S max a Q 2 (s, a) Let |H 1 | = |H 2 | = n H . Then using triangle inequality we have TQ 1 -TQ 2 ∞ ≤ max s ,a γ Q1 -E P Q 1 + Q2 -E P Q 2 + max s ,a γ E P Q 1 -E P Q 2 Since Q function almost surely converge under exhaustive sampling we have max s ,a γ E P Q 1 -E P Q 2 ≤ γ Q 1 -Q 2 ∞ ( ) From equation 12 and equation 13 we have after t time steps TQ 1 -TQ 2 ∞ ≤ 2c 0 + γ Q 1 -Q 2 ∞ Let R max and R min be the maximum and minimum reward values. Then we have that Q 1 -Q 2 ∞ ≤ γ 1 -γ R max -R min . Thus for any ≥ by choosing a γ such there exists a t such that ∀t ≥ t Q t -Q * ∞ ≤ This concludes the proof of Theorem 1. A.2 PROOF OF THEOREM 2 Theorem 2. Let Q t+1 E (s t , a t ) = r(s t , a t ) + γ s∈S P(s|s t , a t ) max a Q t E (s, a), ∀(s t , a t ) ∈ S × A be the update rule under exhaustive sampling, and Q t be the Q function updated according to Hamiltonian Q-Learning, i.e. by ( 9)-(10). Then, for any given ˜ ≥ 0, there exists n H , t > 0, such that Q t -Q t E ∞ ≤ ˜ ∀t ≥ t . Proof of Theorem 2. Note that at each time step we attempt to recover the matrix Q t E , i.e., Q function time time t under exhaustive sampling though a matrix completion method starting from Q t , which is the Q updated function at time t using Hamiltonian Q-Learning. From Theorem 4 in (Chen & Chi, 2018) we have that ∀t ≥ t there exists some constant δ > 0 such that when the updated Q function a Q t satisfy Q t -Q t E ∞ ≤ c where c is some positive constant then reconstructed (completed) matrix Q t satiesfies Q t -Q t E ∞ ≤ δ Q t -Q t E ∞ (14) for some δ > 0. This implies that when the initial matrix used for matrix completion is sufficiently close to the matrix we are trying to recover matrix completion iterations converge to a global optimum. From the result of Theorem 1 we have for any given ≥ 0, there exists n H , t > 0 such that ∀t ≥ t Q t -Q * ≤ (15) Recall that under the update equation Q t+1 E (s t , a t ) = r(s t , a t )+γ s∈S max a Q t E (s, a), ∀(s t , a t ) ∈ S × A we have that Q E almost surely converge to the optimal Q * . Thus there exists a t † such that ∀t ≥ t † Q t E -Q * ≤ Let t ‡ = max{t † , t }. Then from triangle inequality we have that Q t -Q t E ≤ Q t -Q * + Q t E -Q * ≤ 2 . Thus from equation 14 we have that Q t -Q t E ∞ ≤ 2δ This concludes the proof of Theorem 2.

B SAMPLING COMPLEXITY

In this section we provide theoretical results on sampling complexity of Hamiltonian Q-Learning. For brevity of notation we define MQ(s) = max a Q(s, a). Note that we have the following regularity conditions on the MDP studied in this paper.

Regularity Conditions

1. Spaces S and A (state space and action space) are compact subsets of R Ds and R Da respectively. 2. All the rewards are bounded such that r(s, a) ∈ [R min , R max ], for all (s, a) ∈ S × A. 3. The optimal Q * is C-Lipschitz such that Q * (s, a) -Q * (s , a ) ≤ C (||s -s || F + ||a -a || F ) Now we prove some useful lemmas for proving sampling complexity of Hamiltonian Q-Learning Lemma 1. For some constant c 1 , if |Ω t | ≥ c 1 max |S| 2 , |A| 2 |S||A|D s D a log (D s + D a ) with Q t (s, a) -Q * (s, a) ∞ ≤ then there exists a constant c 2 such that Q t (s, a) -Q * (s, a) ∞ ≤ c 2 Proof of Lemma 1. Recall that in order to complete a low rank matrix using matrix estimation methods, the matrix can not be sparse. This condition can be formulated using the notion of incoherence. Let Q be a matrix of rank r Q with the singular value decomposition Q = U ΣV T . Let T Q be the orthogonal projection of Q ∈ R |S|×|A| to its column space. Then incoherence parameter of φ(Q) can be give as φ(Q) = max |S| r Q max 1≤i≤|S| ||T U e i || 2 F , |A| r Q max 1≤i≤|A| ||T U e i || 2 F where e i are the standard basis vectors. Recall that Q t is the matrix generated in matrix completion phase from Q. From Theorem 4 in Chen & Chi (2018) we have that for some constant C 1 if a fraction of p elements are observed from the matrix such that p ≥ C 1 φ 2 t r 2 Q D s D a log (D s + D a ) where φ t is the coherence parameter of Q t then with probability at least 1 -C 2 (D s + D a ) -1 for some constant C 2 with Q t (s, a) -Q * (s, a) ∞ ≤ there exists a constant c 2 such that Q t (s, a) -Q * (s, a) ∞ ≤ c 2 Note that p ≈ |Ωt| |S||A| . Further we have for some constant c 3 φ 2 t r 2 Q D s D a log (D s + D a ) = c 3 max |S| 2 , |A| 2 D s D a log (D s + D a ) Thus it follows that for some constant c 1 if |Ω t | = c 1 max |S| 2 , |A| 2 |S||A|D s D a log (D s + D a ) with Q t (s, a) -Q * (s, a) ∞ ≤ then there exists a constant c 2 such that Q t (s, a) -Q * (s, a) ∞ ≤ c 2 This concludes the proof of Lemma 1. Lemma 2. Let 1ξ be the spectral gap of Markov chain under Hamiltonian sampling where ξ ∈ [0, 1]. Let ∆R = R max -R min be the maximum reward gap. Then ∀(s , a ) ∈ S × A we have that | Q(s , a ) -Q * (s , a ) ≤ γ 2 1 -γ ∆R + 1 + ξ 1 -ξ 2 |H| γR max 1 -γ 2 log 2 δ . with at least probability 1δ. Proof of Lemma 2. Let Q(s , a ) = r(s , a ) + γ |H| s∈H max a Q(s, a). Recall that MQ(s) = max a Q(s, a). Then we have that Q(s , a ) = r(s , a ) + γ |H| s∈H MQ(s). Then it follows that | Q(s , a ) -Q * (s , a ) = r(s , a ) + γ |H| s∈H MQ(s) -r(s , a ) -γE P MQ * (s) = γ |H| |H| i=1 MQ(s i ) -γE P MQ * (s) = γ |H| |H| i=1 MQ(s i ) - γ |H| |H| i=1 MQ * (s i ) + γ |H| |H| i=1 MQ * (s i ) -γE P MQ * (s) Recall that all the rewards are bounded such that r(s, a) ∈ [R min , R max ], for all (s, a) ∈ S × A. Thus for all s, a we have that MQ(s) ≤ γ 1-γ R max . Let ∆R = R max -R min . Then we have that γ |H| |H| i=1 MQ(s i ) - γ |H| |H| i=1 MQ * (s i ) ≤ γ 2 1 -γ ∆R. Let ξ ∈ [0, 1] be a constant such that 1ξ is the spectral gap of the Markov chain under Hamiltonian sampling. Then from Fan et al. (2018) we have that P   1 |H| |H| i=1 MQ * (s i ) -E P MQ * (s) ≥ ϑ   ≤ exp - 1 -ξ 1 + ξ |H|ϑ 2 2R 2 max 1 -γ γ 2 Let δ = exp -1-ξ 1+ξ |H|ϑ 2 2R 2 max 1-γ γ 2 . Then we have that ϑ = 1 + ξ 1 -ξ 2 |H| γR max 1 -γ 2 log 2 δ . Thus we see that 1 |H| |H| i=1 MQ * (s i ) -E P MQ * (s) ≤ 1 + ξ 1 -ξ 2 |H| γR max 1 -γ 2 log 2 δ ( ) with at least probability 1-δ. Thus it follows from equations equation 16, equation 17 and equation 18 that | Q(s , a ) -Q * (s , a ) ≤ γ 2 1 -γ ∆R + 1 + ξ 1 -ξ 2 |H| γR max 1 -γ 2 log 2 δ . with at least probability 1δ. This concludes the proof of Lemma 2. Lemma 3. For all (s, a) ∈ S × A we have that |Q t (s, a) -Q * (s, a) ≤ 2c 1 γ 2 R max 1 -γ with probability at least 1 -δ Proof of Lemma 3. From Lemma 2 and Shah et al. (2020) we have that for all (s, a) ∈ Ω t | Q t (s, a) -Q * (s, a) ≤ γ 2 1 -γ ∆R + 1 + ξ 1 -ξ 2 |H t | γR max 1 -γ 2 log 2|Ω t |T δ . with probability at least 1 -δ T . Thus we have that |Q t (s, a) -Q * (s, a) ≤ c 1 γ 2 1 -γ ∆R + c 1 1 + ξ 1 -ξ 2 |H t | γR max 1 -γ 2 log 2|Ω t |T δ . with probability at least 1 -δ T . Fro all 1 ≤ t ≤ T letting |H t | = 1 + ξ 1 -ξ 2 γ 2 log 2|Ω t |T δ we obtain γ 2 1 -γ R max ≥ 1 + ξ 1 -ξ 2 |H t | γR max 1 -γ 2 log 2|Ω t |T δ . Thus we have, |Q t (s, a) -Q * (s, a) ≤ 2c 1 γ 2 R max 1 -γ with probability at least 1δ. Recall that ∀(s, a) ∈ S × A we have MQ(s, a) ≤ γ∆R 1-γ . Thus this also proves that |Q t (s, a) -Q * (s, a) ≤ 2c 1 γ|Q t-1 (s, a) -Q * (s, a) This concludes the proof of Lemma 3. Now we proceed to prove the main theorem for sampling complexity as follows. Theorem 3. Let D s , D a be the dimension of state space and action space respectively. Consider the Hamiltonian Q-Learning algorithm presented in Algorithm 1. Under a suitable matrix completion method sampling complexity of the algorithm, Q function converge to the family of -optimal Q functions with O -(Ds+Da+2) number of samples. Proof of Theorem 3. Note that sample complexity of Hamiltonian Q-Learning can be given as . Then from Lemmas 1, 2 and 3 it follows that T t=1 |Ω t ||H t | = O 1 Ds+Da+2 This concludes the proof of Theorem 3.

C ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS FOR CART-POLE

Let θ, θ be the angle and angular velocity of the pole, respectively. Let x, ẋ be the position and linear velocity of the cart, respectively. Let a be the control force applied to the cart. Then, by defining m, M , l and g as the mass of the pole, mass of the cart, length of the pole and gravitational acceleration, respectively, the dynamics of cart-pole system can be expressed as θ = g sin θ -a+ml θ2 sin θ m+M cos θ l 4 3 -m cos 2 θ m+M ẍ = a + ml θ2 sin θ -θ cos θ m + M State space of cart-pole system is 4-dimensional (D s = 4) and any state s ∈ S is given by s = (θ, θ, x, ẋ). We define the range of state space as θ ∈ [-pi/2, π/2], θ ∈ [-3.0, 3.0], x ∈ [-2.4, 2.4] and ẋ ∈ [-3.5, 3.5]. We consider action space to be a 1-dimensional (D a = 1) space such that a ∈ [-10, 10]. We discretize each dimension in state space into 5 values and action space into 10 values. This forms a Q matrix of dimensions 625 × 10. Although the differential equations ( 20) governing the dynamics of the pendulum on a cart system are deterministic, uncertainty of the parameters and external disturbances to the system causes the cart pole to deviate from the defined dynamics leading to a stochastic state transition. Following the conventional approach we model these parameter uncertainties and external disturbances using a multivariate Gaussian perturbation (Maithripala et al., 2016; Madhushani et al., 2017; McAllister & Rasmussen, 2017 ). Here we consider the co-variance of the Gaussian perturbation to be Σ = diag[0.143, 0.990, 0.635, 1.346]. Let s t = (θ t , θt , x t , ẋt ) and a t be the state and the action at time t. Then the state transition probability kernel and corresponding target distribution can be given using ( 7) and ( 8), respectively, with mean µ(s t , a t ) = (θ t + θt τ, θt + θt τ, x t + ẋt τ, ẋt + ẍt τ ), where θt , ẍt can be obtained from (20) by substituting θ t , θt , a t , and co-variance Σ(s t , a t ) = Σ. Our simulation results use the following value for the system parameters -m = 0.1kg, M = 1kg, l = 0.5m and g = 9.8ms -2 . We take 100 HMC samples during the update phase. We use trajectory length L = 100 and step size δl = 0.02. We randomly initialize the Q matrix using values between 0 and 1. We provide additional comparison heat maps for first two dimensions θ, θ with fixed x, ẋ. Further, we provide additional comparison heat maps for last two dimensions x, ẋ with fixed θ, θ. 

D ADDITIONAL DETAILS FOR OCEAN SAMPLING APPLICATION

Glider dynamics By assuming that the glider's motion is restricted to an horizontal plane (Refael & Degani, 2019) , we let x, y and θ denote its center of mass position and heading angle, respectively. Then we can define the 6-dimensional state vector for this system as s = (x, y, ẋ, ẏ, θ, θ) and the action a as a scalar control input to the glider. Also, to accommodate dynamic perturbations due to the ocean current, other external disturbances and parameter uncertainties, we assume that the probabilistic state transition is governed by a multivariate Gaussian. We consider that the motion of the glider is restricted to a horizontal plane. Let x, y and θ be the coordinates of the center of mass of glider and heading angle respectively. By introducing q = [x y θ] T , the dynamics of the glider can be expressed as Our simulation results use system parameter values from Table 1 . We define the range of state and action space as x, y ∈ [-10, 10], ẋ, ẏ ∈ [-25, 25], θ ∈ [-π, π], θ ∈ [-3, 3], and a ∈ [-1, 1], respectively and then discretizing each state dimension into 5 distinct values and the action space into 5 distinct values, we have a Q matrix of size 15625 × 5. Also, we assume that the state transition kernel is given by a multivariate Gaussian with zero mean and covariance Σ = diag [11.111, 69.444, 11.111, 69.444, 0.143, 0.990] . After initializing the Q matrix using randomly chosen values from [0, 1], we sample state-action pairs independently with probability p = 0.5 at each iteration. Also, we assume σ = 2.5, λ = 0.1, C = diag[1, 0]. We take 100 HMC samples during the update phase. We use trajectory length L = 100 and step size δl = 0.02. Additional experimental results : We provide additional comparison heat maps for first two dimensions x, y with fixed ẋ, ẏ, θ, θ. E APPLICATION OF HMC SAMPLING FOR Q-LEARNING: FURTHER DETAILS In this section we provide a detailed explanation on drawing HMC samples from a given state transition probability distribution. Let (s t , a t ) be the current state action pair. Let µ(s t , a t ), Σ(s t , a t ) be the mean and covariance of the transition probability kernel. In order to draw HMC samples



Figure 1: The first row illustrates that, as the dimension of the space increases, the relative volume inside

Figure 2: Figure 2(a), 2(b) and 2(c) show policy heat maps for Q-Learning with exhaustive sampling, Hamiltonian Q-Learning and Q-Learning with IID sampling, respectively. Figure 2(d) provides a comparison for convergence of Q function with Hamiltonian Q-Learning and Q-Learning with IID sampling.

Figure 4: Mean square error vs number of samples of Q function with exhaustive sampling and Hamiltonian Q-Learning for vanila Q-Learning, DQN, Dueling DQN and DDPG.

We define the range of state and action space as x, y ∈ [-10, 10], ẋ, ẏ ∈ [-25, 25], θ ∈ [-π, π], θ ∈ [-3, 3], and a ∈ [-1, 1], respectively and then discretizing each state dimension into 5 distinct values and the action space into 5 distinct values, we have a Q matrix of size 15625 × 5. Also, we assume that the state transition kernel is given by a multivariate Gaussian with zero mean and covariance Σ = diag[11.111, 69.444, 11.111, 69.444, 0.143, 0.990]. After initializing the Q matrix using randomly chosen values from [0, 1], we sample state-action pairs independently with probability p = 0.5 at each iteration. Also, we assume σ = 2.5, λ = 0.1, C = diag[1, 0]. Additional experimental details and results are provided in Appendix D. Q-function Error.

Figure 5: Figure 5(a), 5(b) and 5(c) show policy heat maps for Q-Learning with exhaustive sampling, Hamiltonian Q-Learning and Q-Learning with IID sampling respectively. Figure 5(d) provides a comparison for convergence of Q function with Hamiltonian Q-Learning and Q-Learning with IID sampling.

t | ≤ T |Ω T ||H T |Let β t be the discretization parameter at time t and T =

Figure 6: Figure 6(a), 6(b) and 6(c) show policy heat maps for Q-Learning with exhaustive sampling, Hamiltonian Q-Learning and Q-Learning with IID sampling, respectively.

Figure 7: Figure 7(a), 7(b) and 7(c) show policy heat maps for Q-Learning with exhaustive sampling, Hamiltonian Q-Learning and Q-Learning with IID sampling, respectively.

Figure 8: Figure 8(a), 8(b) and 8(c) show policy heat maps for Q-Learning with exhaustive sampling, Hamiltonian Q-Learning and Q-Learning with IID sampling, respectively.

M q = RF f + F b + τ where rL cos ψ ; µ f = α f L 2 + r cos ψ ; α b = 1 2 C b d b πrα b = 0.005; α f = 0.062; µ f = 0.0074; σ = 2.5.

Figure 9: Figure 9(a), 9(b) and 9(c) show policy heat maps for Q-Learning with exhaustive sampling, Hamiltonian Q-Learning and Q-Learning with IID sampling, respectively.

Figure 11: Figure 11(a), 11(b) and 11(c) show policy heat maps for Q-Learning with exhaustive sampling, Hamiltonian Q-Learning and Q-Learning with IID sampling, respectively.

Notations Notations DescriptionValue

annex

we are required to define the corresponding potential energy and kinetic energy of the system. Let P(s|s t , a t ) be the smooth target state transition distribution.Potential energy, kinetic energy and mass: In this work we consider P(s|s t , a t ) to be a truncated multivariate Gaussian as given in equation 8. Thus potential energy can be explicitly given as,where, µ and Σ correspond to the mean and variance of the transition probability kernel. In the context of HMC s is referred to as the position variable. Then we chose kinetic energy can be given aswhere v is the momentum variable and M = Σ -1 corresponds to the mass/inertia matrix associated with the Hamiltonian.Hamiltonian Dynamics: As the Hamiltonian is the sum of the kinetic and the potential energy, i.e. H(s, v) = U (s) + K(v), the Hamiltonian dynamics can be expressed aswhereT denotes element wise sigmoid function of the vector ξ. We initialize HMC sampling by drawing a random sample s from the transition probability distribution and a new momentum variable v from the multivariate Gaussian N (0, Σ -1 ). We integrate the Hamiltonian dynamics for L steps with step size ∆l to generate the trajectory from (s, v) to (s , v ). To ensure that the Hamiltonian is conserved along the trajectory, we use a volume preserving symplectic integrator, in particular a leapfrog integrator which uses the following update rule to go from step l to l + ∆l:Acceptance of the new proposal: Then, following the Metropolis-Hastings acceptance/rejection rule, this new proposal is accepted with probability.Updating Q function using HMC samples: Let H t be the set of next states obtained via HMC sampling, i.e., position variables from the accepted set of proposals. Then we update Q(s t , a t ) using equation 9.

