STATIONARY DEEP REINFORCEMENT LEARNING WITH QUANTUM K-SPIN HAMILTONIAN EQUATION

Abstract

Instability is a major issue of deep reinforcement learning (DRL) algorithmshigh variance of cumulative rewards over multiple runs. The instability is mainly caused by the existence of many local minimas and worsened by the multiple fixed points issue of Bellman's optimality equation. As a fix, we propose a quantum K-spin Hamiltonian regularization term (called H-term) to help a policy network converge to a high-quality local minima. First, we take a quantum perspective by modeling a policy as a K-spin Ising model and employ a Hamiltonian equation to measure the energy of a policy. Then, we derive a novel Hamiltonian policy gradient theorem and design a generic actor-critic algorithm that utilizes the Hterm to regularize the policy network. Finally, the proposed method significantly reduces the variance of cumulative rewards by 65.2% ∼ 85.6% on six MuJoCo tasks; achieves an approximation ratio ≤ 1.05 over 90% test cases and reduces its variance by 60.16% ∼ 94.52% on two combinatorial optimization tasks and two non-convex optimization tasks, compared with those of existing algorithms over 20 runs, respectively.

1. INTRODUCTION

Instability is a major issue of deep reinforcement learning (DRL) [44] algorithms -agents trained with different random seeds may have dramatically different performance. Existing works [1, 8, 16, 28, 31, 53] empirically reported a high variance over multiple runs. Hence, in practice it requires to train tens of agents and pick the best one. Such a high variance largely contributes to the RL community's dispute of reliability and reproducibility [17, 18] , limiting the wider adoption in realworld tasks. The instability issue is mainly caused by the existence of many local minimas 1 and worsened by the multiple fixed points issue of Bellman's optimality equation [5, 21, 26, 39] . In Fig. 1 , we adapt dynamic programming examples [5, 39] into reinforcement learning settings, while detailed descriptions are given in Appx. A. • Shortest path problem (deterministic) in Fig. 1(a) : two policies, 1) transiting back to state 1; 2) driving to terminal state 0. • Blackmailer's problem (stochastic) in Fig. 1(b) : two policies, 1) demanding a → 0 to keep the victim at state 1; 2) demanding a = 1 that drives the victim to terminate state 0. • Optimal stopping problem (terminating policies) in Fig. 1(c ): two polices, 1) continuing inside the sphere of radius (1 -α)c and stopping outside; 2) jumping to point 0 at any point in region C. The instability problem has been partially addressed, such as ensemble methods [2, 10], regularization approaches [11, 46] , and baseline-correction approaches [41, 50] . In particular, Generalized Advantage Estimation (GAE) [41] is a widely used one that significantly reduces the variance of the advantage function. However, they did NOT fix the issue of local minimas and the multiple fixed points issue of Bellman equation in Fig. 1 . Existing methods randomly converge to different local minimas. For practical usage, we often expect a DRL algorithm stably converges to a certain policy independent of initialization and noises. As a fix, we propose a quantum K-spin Hamiltonian regularization term (H-term) to help a policy network converge to a high-quality local minima. We take a novel quantum perspective by modeling 1 Without explicit clarifications, both "local minima" and "fixed points" in this paper are referring to policies. 1

