STATIONARY DEEP REINFORCEMENT LEARNING WITH QUANTUM K-SPIN HAMILTONIAN EQUATION

Abstract

Instability is a major issue of deep reinforcement learning (DRL) algorithmshigh variance of cumulative rewards over multiple runs. The instability is mainly caused by the existence of many local minimas and worsened by the multiple fixed points issue of Bellman's optimality equation. As a fix, we propose a quantum K-spin Hamiltonian regularization term (called H-term) to help a policy network converge to a high-quality local minima. First, we take a quantum perspective by modeling a policy as a K-spin Ising model and employ a Hamiltonian equation to measure the energy of a policy. Then, we derive a novel Hamiltonian policy gradient theorem and design a generic actor-critic algorithm that utilizes the Hterm to regularize the policy network. Finally, the proposed method significantly reduces the variance of cumulative rewards by 65.2% ∼ 85.6% on six MuJoCo tasks; achieves an approximation ratio ≤ 1.05 over 90% test cases and reduces its variance by 60.16% ∼ 94.52% on two combinatorial optimization tasks and two non-convex optimization tasks, compared with those of existing algorithms over 20 runs, respectively.

1. INTRODUCTION

Instability is a major issue of deep reinforcement learning (DRL) [44] algorithms -agents trained with different random seeds may have dramatically different performance. Existing works [1, 8, 16, 28, 31, 53] empirically reported a high variance over multiple runs. Hence, in practice it requires to train tens of agents and pick the best one. Such a high variance largely contributes to the RL community's dispute of reliability and reproducibility [17, 18] , limiting the wider adoption in realworld tasks. The instability issue is mainly caused by the existence of many local minimasfoot_0 and worsened by the multiple fixed points issue of Bellman's optimality equation [5, 21, 26, 39] . In Fig. 1 , we adapt dynamic programming examples [5, 39] into reinforcement learning settings, while detailed descriptions are given in Appx. A. • Shortest path problem (deterministic) in Fig. 1 (a): two policies, 1) transiting back to state 1; 2) driving to terminal state 0. • Blackmailer's problem (stochastic) in Fig. 1(b) : two policies, 1) demanding a → 0 to keep the victim at state 1; 2) demanding a = 1 that drives the victim to terminate state 0. • Optimal stopping problem (terminating policies) in Fig. 1(c ): two polices, 1) continuing inside the sphere of radius (1 -α)c and stopping outside; 2) jumping to point 0 at any point in region C. The instability problem has been partially addressed, such as ensemble methods [2, 10], regularization approaches [11, 46] , and baseline-correction approaches [41, 50] . In particular, Generalized Advantage Estimation (GAE) [41] is a widely used one that significantly reduces the variance of the advantage function. However, they did NOT fix the issue of local minimas and the multiple fixed points issue of Bellman equation in Fig. 1 . Existing methods randomly converge to different local minimas. For practical usage, we often expect a DRL algorithm stably converges to a certain policy independent of initialization and noises. As a fix, we propose a quantum K-spin Hamiltonian regularization term (H-term) to help a policy network converge to a high-quality local minima. We take a novel quantum perspective by modeling 

2. RELATED WORKS

The existence of many local minimas has been theoretically pointed out in robotic control tasks [16] , combinatorial optimization tasks [25][36] , and non-convex optimization tasks [3][52] . Existing solutions can be classified three approaches, ensemble method, regularizer, and basline-correction. The ensemble method [2, 10] was proposed to reduce the variance by using multiple critic networks to approximate an accurate value function. However, this method will still encounter the multiple fixed points issue of Bellman's optimality equation. Regularization method [11, 46] was proposed to guide the updating process of a policy network. Adding a regularizer essentially helps find a local minima with preferred structure, which cannot help escape from local minimas. Baseline-correction approaches [41, 50] was used to reduce the bias of monte carlo estimation. In particular, Generalized Advantage Estimation (GAE) [41] is a widely used one that significantly reduces the variance of the advantage function. However, the method is restricted by the accuracy of the baseline, which suffers from the local minimas issue as well. However, they did NOT fix the issue of many local minimas and the multiple fixed points issue of Bellman equation in Fig. 1 . In contrast, we propose a physically inspired DRL algorithm that stably converges to a certain policy independent of initialization and noises. Different from our quantum K-spin perspective, several recent papers utilized the (classical) Hamiltonian equation to endow RL agents the capability of inductive biases. For example, [24, 48] used Hamiltonian mechanics to train an agent that learns and respects conservation laws; [51] applied a Hamiltonian Monte Carlo (HMC) simulator to approximate the posterior action probability; and [35] proposed an unbiased estimator for the stochastic Hamiltonian gradient methods for min-max optimization problems.

3. THE PROBLEM OF MANY LOCAL MINIMAS

First, we show the existence of many local minimas in many tasks. Then, we provide observational experiments to empirically verify the existence of multiple policies.



Without explicit clarifications, both "local minima" and "fixed points" in this paper are referring to policies.



Figure 1: Examples with γ = 1. Examples with γ < 1 are given in Fig. 6 of Appx. A.

