HAMILTONIAN Q-LEARNING: LEVERAGING IMPORTANCE-SAMPLING FOR DATA EFFICIENT RL

Abstract

Model-free reinforcement learning (RL), in particular Q-learning is widely used to learn optimal policies for a variety of planning and control problems. However, when the underlying state-transition dynamics are stochastic and high-dimensional, Q-learning requires a large amount of data and incurs a prohibitively high computational cost. In this paper, we introduce Hamiltonian Q-Learning, a data efficient modification of the Q-learning approach, which adopts an importance-sampling based technique for computing the Q function. To exploit stochastic structure of the state-transition dynamics, we employ Hamiltonian Monte Carlo to update Q function estimates by approximating the expected future rewards using Q values associated with a subset of next states. Further, to exploit the latent low-rank structure of the dynamic system, Hamiltonian Q-Learning uses a matrix completion algorithm to reconstruct the updated Q function from Q value updates over a much smaller subset of state-action pairs. By providing an efficient way to apply Qlearning in stochastic, high-dimensional problems, the proposed approach broadens the scope of RL algorithms for real-world applications, including classical control tasks and environmental monitoring.

1. INTRODUCTION

In recent years, reinforcement learning (Sutton & Barto, 2018) have achieved remarkable success with sequential decision making tasks especially in complex, uncertain environments. RL algorithms have been widely applied to a variety of real world problems, such as resource allocation (Mao et al., 2016) , chemical process optimization (Zhou et al., 2017 ), automatic control (Duan et al., 2016 ), and robotics (Kober et al., 2013) . Existing RL techniques often offer satisfactory performance only when it is allowed to explore the environment long enough and generating a large amount of data in the process (Mnih et al., 2015; Kamthe & Deisenroth, 2018; Yang et al., 2020a) . This can be prohibitively expensive and thereby limits the use of RL for complex decision support problems. (Watkins, 1989; Watkins & Dayan, 1992 ) is a model-free RL framework that captures the salient features of sequential decision making, where an agent, after observing current state of the environment, chooses an action and receives a reward. The action chosen by the agent is based on a policy defined by the state-action value function, also called the Q function. Performance of such policies strongly depends on the accessibility of a sufficiently large data set covering the space spanned by the state-action pairs. In particular, for high-dimensional problems, existing model-free RL methods using random sampling techniques leads to poor performance and high computational cost. To overcome this challenge, in this paper we propose an intelligent sampling technique that exploits the inherent structures of the underlying space related to the dynamics of the system. It has been observed that formulating planning and control tasks in a variety of dynamical systems such as video games (Atari games), classical control problems (simple pendulum, cart pole and double integrator) and adaptive sampling (ocean sampling, environmental monitoring) as Q-Learning problems leads to low-rank structures in the Q matrix (Ong, 2015; Yang et al., 2020b; Shah et al., 2020) . Since these systems naturally consist of a large number of states, efficient exploitation of low rank structure of the Q matrix can potentially lead to significant reduction in computational complexity and improved performance. However, when the state space is high-dimensional and further, the state transition is probabilistic, high computational complexity associated with calculating the expected Q values of next states renders existing Q-Learning methods impractical.

