STATIONARY DEEP REINFORCEMENT LEARNING WITH QUANTUM K-SPIN HAMILTONIAN EQUATION

Abstract

Instability is a major issue of deep reinforcement learning (DRL) algorithmshigh variance of cumulative rewards over multiple runs. The instability is mainly caused by the existence of many local minimas and worsened by the multiple fixed points issue of Bellman's optimality equation. As a fix, we propose a quantum K-spin Hamiltonian regularization term (called H-term) to help a policy network converge to a high-quality local minima. First, we take a quantum perspective by modeling a policy as a K-spin Ising model and employ a Hamiltonian equation to measure the energy of a policy. Then, we derive a novel Hamiltonian policy gradient theorem and design a generic actor-critic algorithm that utilizes the Hterm to regularize the policy network. Finally, the proposed method significantly reduces the variance of cumulative rewards by 65.2% ∼ 85.6% on six MuJoCo tasks; achieves an approximation ratio ≤ 1.05 over 90% test cases and reduces its variance by 60.16% ∼ 94.52% on two combinatorial optimization tasks and two non-convex optimization tasks, compared with those of existing algorithms over 20 runs, respectively.

1. INTRODUCTION

Instability is a major issue of deep reinforcement learning (DRL) [44] algorithms -agents trained with different random seeds may have dramatically different performance. Existing works [1, 8, 16, 28, 31, 53] empirically reported a high variance over multiple runs. Hence, in practice it requires to train tens of agents and pick the best one. Such a high variance largely contributes to the RL community's dispute of reliability and reproducibility [17, 18] , limiting the wider adoption in realworld tasks. The instability issue is mainly caused by the existence of many local minimas 1 and worsened by the multiple fixed points issue of Bellman's optimality equation [5, 21, 26, 39] . In Fig. 1 , we adapt dynamic programming examples [5, 39] into reinforcement learning settings, while detailed descriptions are given in Appx. A. • Shortest path problem (deterministic) in Fig. 1(a) : two policies, 1) transiting back to state 1; 2) driving to terminal state 0. • Blackmailer's problem (stochastic) in Fig. 1(b) : two policies, 1) demanding a → 0 to keep the victim at state 1; 2) demanding a = 1 that drives the victim to terminate state 0. • Optimal stopping problem (terminating policies) in Fig. 1(c ): two polices, 1) continuing inside the sphere of radius (1 -α)c and stopping outside; 2) jumping to point 0 at any point in region C. The instability problem has been partially addressed, such as ensemble methods [2, 10] , regularization approaches [11, 46] , and baseline-correction approaches [41, 50] . In particular, Generalized Advantage Estimation (GAE) [41] is a widely used one that significantly reduces the variance of the advantage function. However, they did NOT fix the issue of local minimas and the multiple fixed points issue of Bellman equation in Fig. 1 . Existing methods randomly converge to different local minimas. For practical usage, we often expect a DRL algorithm stably converges to a certain policy independent of initialization and noises. As a fix, we propose a quantum K-spin Hamiltonian regularization term (H-term) to help a policy network converge to a high-quality local minima. We take a novel quantum perspective by modeling a policy as a K-spin Ising model [15, 30] and employ a Hamiltonian equation to measure the energy of a policy, namely an H-term. We hypothesize that a stationary policy would have a low energy. In this paper, we propose a quantum K-spin Hamiltonian regularization term (called H-term) to help a policy network converge to a high-quality local minima. Our contributions can be summarized as follows: 1) we take a quantum perspective by modeling a policy as a K-spin Ising model and employ a Hamiltonian equation to measure the energy of a policy, which becomes an add-on term to DRL algorithms; 2) we derive a novel Hamiltonian policy gradient theorem and design a generic actor-critic algorithm that utilizes the H-term to regularize the policy/actor network; 3) we show that the proposed method significantly reduces the variance of cumulative rewards by 65.2% ∼ 85.6% on six challenging MuJoCo tasks [47] ; achieves an approximation ratio ≤ 1.05 over 90% test cases and reduces the variance of approximation ratio by 60.16% ∼ 94.52% on two combinatorial optimization tasks (travelling salesman problem [31] , graph maxcut [14] ) and two non-convex optimization tasks (MIMO beamforming in 5G/6G [7] , non-convex deep learning classifier [33] ), compared with those of existing algorithms over 20 runs, respectively.

2. RELATED WORKS

The existence of many local minimas has been theoretically pointed out in robotic control tasks [16] , combinatorial optimization tasks [25] [36], and non-convex optimization tasks [3][52] . Existing solutions can be classified three approaches, ensemble method, regularizer, and basline-correction. The ensemble method [2, 10] was proposed to reduce the variance by using multiple critic networks to approximate an accurate value function. However, this method will still encounter the multiple fixed points issue of Bellman's optimality equation. Regularization method [11, 46] was proposed to guide the updating process of a policy network. Adding a regularizer essentially helps find a local minima with preferred structure, which cannot help escape from local minimas. Baseline-correction approaches [41, 50] was used to reduce the bias of monte carlo estimation. In particular, Generalized Advantage Estimation (GAE) [41] is a widely used one that significantly reduces the variance of the advantage function. However, the method is restricted by the accuracy of the baseline, which suffers from the local minimas issue as well. However, they did NOT fix the issue of many local minimas and the multiple fixed points issue of Bellman equation in Fig. 1 . In contrast, we propose a physically inspired DRL algorithm that stably converges to a certain policy independent of initialization and noises. Different from our quantum K-spin perspective, several recent papers utilized the (classical) Hamiltonian equation to endow RL agents the capability of inductive biases. For example, [24, 48] used Hamiltonian mechanics to train an agent that learns and respects conservation laws; [51] applied a Hamiltonian Monte Carlo (HMC) simulator to approximate the posterior action probability; and [35] proposed an unbiased estimator for the stochastic Hamiltonian gradient methods for min-max optimization problems.

3. THE PROBLEM OF MANY LOCAL MINIMAS

First, we show the existence of many local minimas in many tasks. Then, we provide observational experiments to empirically verify the existence of multiple policies. Figure 2 : Different policies for MuJoCo tasks [47] . The bold ones are physically stationary policies.

3.1. EXISTENCE OF MANY LOCAL MINIMAS

We point out that combinatorial optimization and non-convex optimization have many local minimas. • MuJoCo tasks [47] : agents randomly converge to policies of different gaits, as shown in Fig. 2 . • Travelling salesman problem (TSP) [31] : a case of 8 cities has 2 local minimas (in Appx. K). • Graph max-cut [14] : an example graph of 20 nodes has 390 local minimas (in Appx. K). • MIMO beamforming [7] : a case of 2 users and 2 antennas has 3 local minimas (in Appx. L). • Non-convex deep learning classifier [33] : an example problem has 25 local minimas [3] .

3.2. MULTIPLE POLICIES OF EXISTING DRL ALGORITHMS

We provide observational experiments on four challenging MuJoCo tasks [47] , namely, Humanoid, Hopper, HalfCheetah, and Ant (details given in Appx. B.1), which are typical examples of the locomotion control of a robot. We render the obtained policies over multiple runs and then identify physically stationary ones. We observe various types of moving strategies, as shown in Fig. 2 , which verifies that multiple policies are very common. For example, the Humanoid agent learns either jumping with a single leg or running with two legs, as shown in Fig. 2 (top-left); another interesting example is HalfCheetah, in which an agent can run normally or in a flipped manner, as shown in Fig. 2 (bottom-left). Among the obtained policies, one can easily identify the physically stationary polices that control the robot moving forward with a stable gait (defined as gait that does not lead to fall).

4. MODELING POLICY AS K-SPIN ISING MODEL

We take a novel quantum perspective by modeling a policy as a K-spin Ising model and employ a Hamiltonian equation to measure the energy of a policy, namely an H-term.

4.1. MOTIVATION

Our modeling a policy as a quantum K-spin Ising model is inspired by the simulated annealing algorithms and the analogy in Table 1 . Simulated annealing algorithms randomly transit to a neighbor solution with probability proportional to the energy gap between the current state and a new state. Here, a state can be modelled as a spin configuration, and a Hamiltonian equation is used to measure the energy of a spin system. Take the graph maxcut problem as an example, a spin is a configuration of nodes, while a transition (in terms of a state-action pair) is taken according to a policy. Using the modelling in Table 1 , we learn a policy network that encodes the transition probability of a simulated annealing algorithm.

4.2. QUANTIFYING ENERGY OF A POLICY VIA K-SPIN HAMILTONIAN EQUATION

The Hamiltonian equation for a quantum K-spin Ising model [15, 30] measures the energy of a particular configuration, which takes the following form H = - K-1 k=0 N j0=1 • • • N j k =1 L j0•••j k σ j0 • • • σ j k , Table 1 : Modeling a policy as a quantum K-spin Ising model. Policy in (3) Quantum K-spin Ising model [15, 30] in (1) State-action µ k ∈ S × A, k = 0, ..., K Spins j k ∈ {1, • • • , N }, k = 0, ..., K Policy π µ0 × π µ1 × • • • × π µK-1 ∈ [0, 1] K Configuration σ j0 × σ j1 × • • • × σ jK-1 ∈ [-1, +1] K Optimal policy π * µ0 × π * µ1 × • • • × π * µK-1 ∈ {0, 1} K Optimal configuration σ * j0 × σ * j1 × • • • × σ * jK-1 ∈ {-1, +1} K Discounted reward L µ0...µK-1 Density function L j0•••jK-1 Energy of policy H(π µ0 , ..., π µK-1 ) Energy H(σ j0 , • • • , σ jK-1 ) where N is the number of spins in the k-th configuration, σ j k = ±1 are spin variables, and L j0...j k is an energy density function for k nearest spins' configuration (σ j0 , . . . , σ j k ). Modeling in Table 1 . Starting from an analogy between a state-action pair µ k = (S k , A k ) and a spin j k , we can map an optimal policy π * (µ k ) ∈ {0, 1} to the optimal single-qubit spin operator σ * j k ∈ {-1, 1} via π * (µ k ) ←→ (1 µ k -σ * µ k )/2 , where π(µ k ) denotes the probability of taking action A k at state S k , following policy π. The energy density function L j0...j k can be defined as the discounted reward on a path (µ 0 , • • • , µ k-1 ) of length k, L µ0,...,µ k = γ k • R(µ k ) • d 0 (s 0 ) • k-1 ℓ=0 P(s ℓ+1 |µ ℓ ), (obtained via Monte Carlo simulation) (2) where d 0 (s 0 ) denotes the distribution of initial state s 0 . Analogy to the quantum K-spin Ising model, we can derive an energy of a RL policy H(π µ0 , ..., π µ K-1 ). We formally express the objective of reinforcement learning (background is given in Appx. C) into a K-spin Hamiltonian equation (inspired by [20] ) H(θ) ≜ -E S0,A0 [Q π θ (S 0 , A 0 )] = -lim K→∞ K-1 k=0 S×A µ0 • • • S×A µ k L µ0,...,µ k π θ (µ 0 ) • • • π θ (µ k ), where expectation is taken over S 0 ∼ d 0 (•), A 0 ∼ π θ (S 0 , •), and L µ0,...,µ k is given in (2) . Physical interpretation: Analogy to a quantum K-spin system, H(θ) in (3) measures a random path's discounted reward (the "energy") without following any policy, and the Hamiltonian equation combinatorially enumerates all possible paths of length K over the state-action space. The joint probability distribution, π(µ 0 ) × π(µ 1 ) × • • • × π(µ K-1 ), is decided by the policy π. monte carlo simulation The energy of a policy is a favorable criteria, since an optimal policy with minimum energy: 1). achieves a relative high reward independent of the initialization; and 2). is robust to interference/noise in the inference stage. In other words, the simulation process of the Hamiltonian term does not rely on any policy. Therefore, the Hamiltonian term is a suitable regularizer for both on-policy and off-policy algorithms. K-step truncation in practice. Minimizing (3) is NP-hard [13] . Since γ ∈ (0, 1), γ K monotonically decreases with look-ahead steps K, therefore, we truncate (3) to finite K terms. One can show that these K terms in (3) is a geometric sequence with a truncation error ratio 1 -γ K . Assuming 1 -γ K ≤ 1 -ϵ, where ϵ > 0 is small, thus we have the look-ahead steps K ≥ log γ ϵ.

4.3. H-TERM HELPS CONVERGE TO A HIGH-QUALITY LOCAL MINIMA

We elaborate how adding the energy in (3) onto each state can help drive to the terminal state (a stationary policy), which fixes the issue in Section 1. We have H(0) = 0 for the terminal state 0. • (a) Shortest path problem (deterministic): H(1) = - ∞ k=1 b = -∞. At state 1, the Bellman's optimality equation becomes V (1) = max{V (1) + λH(1), b}. Independent of the initial value V 0 (1), an agent obtains a policy that always transits back to terminal state 0. • (b) Blackmailer's problem (stochastic): H(1) = -∞. The Bellman's optimality equation becomes V (1) = max a {a + (1 -a)(V (1) + λH(1))} for state 1. For any V 0 (1) < ∞, the optimal policy becomes a = 1 that drives to the terminal state 0. • (c) Optimal stopping problem (terminating policies): any policy that takes infinite steps will have H(x) = -∞, since at each step number k, there are always trajectories that jump to point 0 with reward -c; and a direct jumping policy will have H(x) = -c. Therefore, adding H(x) to each point x ̸ = 0 will lead to a policy of jumping back to point 0. 

5. STATIONARY DEEP REINFORCEMENT LEARNING

First, we propose a novel Hamiltonian policy gradient and the corresponding Monte Carlo estimator. Then, we present a stationary actor-critic algorithm with H-term.

5.1. HAMILTONIAN POLICY GRADIENT AND MONTE CARLO-BASED GRADIENT ESTIMATOR

We provide the policy gradient of the quantum K-spin Hamiltonian equation in (3) , which are variants of the well-known policy gradient theorem [44] . We provide detailed derivations in Appx. E.

Theorem 1. (Stochastic version)

The Hamiltonian stochastic gradient of ( 3) is ∇ θ H(θ) = -E µ0,...,µ K-1 K-1 k=0 γ k • R(µ k ) • ∇ θ log (π θ (µ 0 ) • π θ (µ 1 ) • • • π θ (µ k )) . Let η θ (•) : S → A denote a deterministic policy, while we use π θ,δ (µ) to represent that a Gaussian noise (a.k.a, an exploration noise) with standard deviation δ > 0 is added in the exploration process.

Theorem 2. (Deterministic version)

The Hamiltonian deterministic gradient of ( 3) is ∇ θ H ′ (θ) = -E µ0,...,µ K-1 K-1 k=0 γ k • R(µ k ) • ∇ θ log ( π θ,δ (µ 0 ) • π θ,δ (µ 1 ) • • • π θ,δ (µ k )) . The quantum K-spin Hamiltonian equation in ( 3) is a reformulation of (15) . We verify the gradient calculation by showing that: when K → ∞, the Hamiltonian stochastic and deterministic policy gradient ∇ θ H(θ) and ∇ θ H ′ (θ) are equal to the stochastic policy gradient ∇ θ J(θ) in [45] and deterministic policy gradient ∇ θ J ′ (θ) in [43] , respectively. Note that the gradient ∇ θ H(θ) in (4) and ∇ θ H ′ (θ) in (5) w.r.t. a distributional parameter θ takes an expectation form. Thus, a Monte Carlo gradient estimator is practically useful. We obtain the Monte Carlo gradient estimator of ∇ θ H(θ), illustrated in Fig. 3 (right), as follows ∇ θ H(θ) = - 1 N ′ N ′ i=1 K-1 k=0 γ k • R(µ i k ) • ∇ θ log π θ (µ i 0 ) • • • π θ (µ i k ) . As a contrast, we provide the Monte Carlo gradient estimator of REINFORCE's [45] policy gradient, as illustrated in Fig. 3 (left), as follows ∇ θ J(θ) = 1 N T N i=1 T -1 t=0 G i t • ∇ θ log π θ (µ i t ) , where G i t = T t ′ =t+1 γ t ′ -t-1 R(µ i t ′ ). An interesting observation is that both gradient calculations follow a similar pattern as shown in Fig. 3 . REINFORCE's policy gradient [45] in Fig. 3 (left) employs an estimate of future rewards, while Hamiltonian policy gradient in Fig. 3 (right) uses trajectories in replay buffer D 2 . Computational complexity: we measure the computation complexity by the times of computing one ∇ θ log π θ (µ). Assume N = B and N ′ = B ′ , since most DRL algorithms use a mini-batch stochastic gradient decent methods. REINFORCE's [45] policy gradient in (7) takes O(BT ) computations, while Alg. 1 adds O(B ′ K(K + 1)/2) computations in each gradient update step, thus a total complexity of O(BT + B ′ K(K + 1)/2).

5.2. STATIONARY ACTOR-CRITIC ALGORITHM WITH H-TERM

Actor-critic algorithms in reinforcement learning perform a bilevel optimization, namely alternating between approximating a value function and optimizing a policy. In practice, a critic network with Under review as a conference paper at 2023  for g = 1, • • • , G do 12: Randomly sample a mini-batch of B transitions {(s i , a i , r i , s i+1 )} B i=1 from D 1 13: Randomly sample a mini-batch of B ′ trajectories (of length K) {τ j } B ′ j=1 from D 2 14: Update critic network using a conventional method 15: Update actor network as θ ← θ + α ∇ θ J(θ)-λ ∇ θ H(θ) . 16: end 17: end parameter approximates the Q-value function, and an actor network with parameter θ approximates the policy π, details given in Appx. G. However, since the critic's update is governed by the Bellman's optimality equation, actor-critic algorithms suffer the multiple fixed points problem. Motivated by Section 4.3, we propose a novel H-term for both deterministic and stochastic actor-critic algorithms. Similar to the entropy term in [27] , the proposed H-term is an add-on term to regularize the actor network and help it converge to a stationary policy. Specifically, the objective functions of actor and critic networks become:      Actor : max θ J π (θ, ϕ) ≜ (1 -γ)E S0∼d0,A0∼π θ (S0,•) [Q ϕ (S 0 , A 0 )] -λH(θ), Critic : min ϕ J Q (θ, ϕ) ≜ 1 2 E S∼d θ (•),A∼π θ (S,•) (Q ϕ (S, A) -y(S, A)) 2 , where a target Q-value is y(S k , A k ) = R(S k , A k ) + γQ ϕ (S k+1 , A k+1 ), and λ > 0 is a temperature parameter. As an interpretation, the second term -λH(θ) in the maximization objective function of actor network aims to find a minimum energy configuration for the MDP problem, namely, a policy π that will add a minimum amount of energy to each state's value function (as in Section 4.3). New algorithm. In Alg. 1, an agent interacts with an environment and alternatively updates its actor network and critic network. The algorithm has M episodes and each episode consists of a (Monte Carlo) simulation process and a learning process (gradient estimation) as follows: • During the (Monte Carlo) simulation process (lines 5-10 of Alg. 1), an agent takes action a t according to a policy π θ (•|s t ), t = 0, • • • , T -1, generating a trajectory of T steps/transitions. Then, these T transitions are stored into a replay buffer D 1 , while the full trajectory τ = (s 0 , a 0 , r 0 , s 1 , • • • , s T -1 , a T -1 , r T -1 , s T ) is stored in replay buffer D 2 . • During the learning process (G ≥ 1 updates in one episode) (lines 11-16 of Alg. 1), a minibatch of B transitions {(s i , a i , r i , s i+1 )} B i=1 and a mini-batch of B ′ trajectories (of length K) {τ j = (s j 0 , a j 0 , r j 0 , s j 1 , • • • , s j K-1 , a j K-1 , r j K-1 , s j K )} B ′ j=1 are sampled from D 1 and D 2 , respectively. The critic network is updated by a conventional method, e.g., minimizing the mean squared error (MSE) between an estimated Q-value and a target value. The actor is updated by a Monte Carlo gradient estimator over B transitions and B ′ trajectories. Two new hyperparameters. We introduce two hyperparameters: a temperature λ > 0 that is a relative weight of the H-term, and a look-ahead step K ≤ T that defines the horizon of the H-term. Implementation of replay buffer D 2 . After a full trajectory τ of length T is generated, it is partitioned into T -K + 1 trajectories of length K. We rank them according to the cumulative reward and store the top portion, say 80%, into a new replay buffer D 2 (line 10 of Alg. 1). We randomly sample a mini-batch of B ′ trajectories from D 2 (line 13 of Alg. 1) to compute the H-term. 

6. PERFORMANCE EVALUATION

We evaluate the proposed H-term from four aspects: 1) converging to a high-quality local minima, 2) reducing variance, 3) driving to a stationary policy, and 4) the impact of trajectory length K. All experiments were executed on an NVIDIA DGX-2 server [12] . The server contains 8 A100 GPUs, 320 GB GPU memory, and 128 CPU cores running at 2.25 GHz.

6.1. EXPERIMENTAL SETTINGS

Environments (tasks). We consider six challenging MuJoCo tasks [47] , two combinatorial optimization tasks ( TSP and graph maxcut described in Appx. K), and two non-convex optimization tasks (MIMO beamforming in 5G/6G and non-convex deep learning classifier in Appx. L). For the MuJoCo tasks, the agent learns to control the locomotion of a robot and aims to move forward as quickly as possible. For graph maxcut and TSP, the agent learns to find a near-optimal solution. For MIMO beamforming and non-convex deep learning classifier, the agent learns to optimize the objective function. These tasks have high-dimensional continuous state space and action space, in which there exists multiple locally optimal polices as revealed in Section 3.1. Compared methods. To evaluate both deterministic and stochastic algorithms, we choose Deep Deterministic Policy Gradient (DDPG) [34] and Proximal Policy Optimization (PPO) [42] for MuJoCo tasks. For TSP, graph maxcut, MIMO beamforming and non-convex deep learning classifier, we choose REINFORCE [45] . We implement the PPO algorithm with the GAE trick [41] . For a fair comparison, we keep the hyperparameters (listed in Appx. I) the same and make sure that the obtained results reproduce existing benchmark tests [16] . For combinatorial optimization, we use the same datasets from [14, 31] . Performance metrics. For MuJoCo tasks, we employ two performance metrics, the cumulative rewards and variance, while in Section 6.4, we further consider different policies and report the number of convergence. For TSP, graph maxcut, MIMO beamforming, and nonconvex deep learning classifier, we employ the approximation ratio max( Optimal Obj , Obj Optimal ) and its variance as the performance metric. We run each experiment with 20 random seeds and in each run we test 100 episodes.

6.2. H-TERM CONVERGES TO A HIGH-QUALITY LOCAL MINIMA

Experience replay is crucial in improving performance in terms of cumulative reward. The proposed H-term in ( 6) can be viewed as a novel experience replay technique for an actor network. Here, we add a compared algorithm, DDPG with Prioritized Experience Replay [40] (DDPG+PER), where PER prioritizes experience by the TD error to update a critic network. In Fig. 4 , both DDPG+PER and DDPG+H achieve a substantial improvement of cumulative reward. In particular, DDPG+H achieves the highest cumulative rewards in all six tasks, which are comparable to PPO's performance in Fig. 4 . It is worthwhile to discuss the advantage of DDPG+H over DDPG+PER. DDPG+PER utilizes a prioritized replay strategy to obtain a more accurate critic network, however, it is updated via the Bellman equation with the trouble of multiple fixed points. In contrast, the H-term in DDPG+H is performed on the actor network. Our results indicate that an experience replay technique on actor network may be much more powerful. In Fig. 5 , REINFORCE+H improves the approximation ratio substantially. In particular, REIN-FORCE+H achieves an approximation ratio ≤ 1.05 over 90% test cases in TSP, graph maxcut, MIMO beamforming, and non-convex deep learning classifier. Specifically, H-term achieves an approximation ratio 1.01 (near-optimal) over 90% tests on MIMO beamforming task. Our results indicate H-term helps the policies converge to high-quality local minimas over multiple runs.

6.3. H-TERM REDUCES VARIANCE

The PPO algorithm with GAE is regarded as the state-of-the-art algorithm in MuJoCo environments. However, it still has a very high variance (the shaded area) after the policies have converged, as shown in Fig. 4 . We observe that, at the end of training, the PPO algorithm has a variance of 969.2, 1563.4, 2513.5, 905.3, 60.7, 1290.1 in the six tasks, respectively. Such a high variance is mainly due to the fact that the agent may converge to a random one of multiple policies. In Fig. 4 , the shaded areas of PPO+H (K = 16) are dramatically smaller, i.e., a variance of 228.4, 225.4, 683.7, 184.2, 31.6, and 296.8, respectively. The variance has been reduced by 65.2% ∼ 85.6%, which verifies the effectiveness of the proposed H-term. In Fig. 4 , we also observe that the H-term can help the DDPG algorithm reduce variances, namely, the variances of DDPG+H are much smaller than those of vanilla DDPG and DDPG+PER. Therefore, we may conclude that the H-term guides the agent to search for a stationary policy among multiple feasible ones. In Fig. 5 , the variance of approximation ratios of REINFORCE+H (K = 16) are substantially smaller. The variance has been reduced by 60.16% ∼ 94.52%. The results indicate that the H-term guides policies to high-quality minimas over multiple runs. More experimental results are given in Appx. I due to the space limit, including the cases of K = 8 and K = 24, and the H-term value during the training process. One may verify that the stationary policies have relative lower H-values.

6.4. H-TERM DRIVES TO PHYSICALLY STATIONARY POLICY

A key question needs to be answered: is H-term really guiding the agent converge to a physically stationary policy? Similar to Section 3.1, we perform observational experiments on MuJoCo tasks and measure the number of convergences to different policies over 20 runs. As shown in Table 2 , the vanilla PPO algorithm converges to the physically stationary policy (bold) with 13, 17, 7, 10, 14, and 5 times for the six tasks, while the PPO+H (K = 16) converges to the stationary policy with 20, 20, 16, 20, 20, and 16 times, respectively. From the empirical observation, we find that the PPO gets stuck in locally optimal policies, failing to find a consistent one. As expected, PPO+H can converge to the stationary policy with a substantially higher ratio, which verifies the effectiveness of the proposed H-term in finding a physically stationary policy.

6.5. IMPACT OF TRAJECTORY LENGTH K

We investigate the impact of trajectory length K. From (6), we know that a large K means more accurate estimation of ∇ θ H(θ) but at a price of computations. Here, we evaluate PPO+H with K = 8, 16 and set the size of replay buffer D 2 to 1, 000. In Table 2 , we observe that the cumulative reward increases and the variance decreases as K increases from 8 to 16. However, for the case K = 24, both metrics get worse due to the out-of-memory issue and we reduce the replay buffer size to 800. The smaller replay buffer size hurts the diversity of the trajectories and may lead to a performance drop. Appx. I.2 provides results for replay buffer size 800.

7. CONCLUSIONS

In this paper, we have addressed the foundational issue of many local minimas. This issue leads to the instability of DRL algorithms, puts a challenge on their reliability and reproducibility, and thus limits the wider adoption of DRL algorithms in real-world tasks. As a fix of the problem, we propose a physically inspired regularizer by modeling a policy as a quantum K-spin Ising model. Experimental results show that the H-term helps DRL algorithms converge to a high-quality local minima, reduce the variance of cumulative rewards by 65.2% ∼ 85.6% on six MuJoCo tasks, achieve an approximation ratio ≤ 1.05 over 90% test cases and reduce its variance by 60.16% ∼ 94.52% on two combinatorial tasks and two non-convex optimization tasks, compared with those of existing algorithms over 20 runs, respectively. For future works, we will explore the potential of directly training a policy network using (3) as in Appx. J, quantum simulator [29] and quantum reinforcement learning [6][19] . It is interesting to apply Monte Carlo estimator for unbiased policy gradient calculations. We would like to show that the proposed H-term can help distributional RL algorithms [4] find a stationary policy, since the distributional Bellman optimality operator is not a contraction and thus there is also no unique policy.



Without explicit clarifications, both "local minima" and "fixed points" in this paper are referring to policies.



Figure 1: Examples with γ = 1. Examples with γ < 1 are given in Fig. 6 of Appx. A.

Figure 3: REINFORCE's policy gradient (left) VS. Hamiltonian's policy gradient (right).

Figure 4: Cumulative rewards vs. #samples for compared DRL algorithms on six MuJoCo tasks.

Figure 5: Frequency of approx. ratio for a) TSP with N = 100, b) Graph maxcut with N = 100, c) MIMO beamforming with N = 4, and d) Non-convex deep learning classifier. A lower approximation ratio is better.

Algorithm 1 Stationary Actor-Critic Algorithm with H-term 1: Input: learning rate α, temperature λ, look-ahead step K, and parameters M, T, G, B, B ′ 2: Initialize actor network π and critic network Q with parameters θ, ϕ, and replay buffers D 1 , D 2 3: for episode = 1, • • • , M do

Experimental results on six challenging MuJoCo tasks.

