HAMILTONIAN Q-LEARNING: LEVERAGING IMPORTANCE-SAMPLING FOR DATA EFFICIENT RL

Abstract

Model-free reinforcement learning (RL), in particular Q-learning is widely used to learn optimal policies for a variety of planning and control problems. However, when the underlying state-transition dynamics are stochastic and high-dimensional, Q-learning requires a large amount of data and incurs a prohibitively high computational cost. In this paper, we introduce Hamiltonian Q-Learning, a data efficient modification of the Q-learning approach, which adopts an importance-sampling based technique for computing the Q function. To exploit stochastic structure of the state-transition dynamics, we employ Hamiltonian Monte Carlo to update Q function estimates by approximating the expected future rewards using Q values associated with a subset of next states. Further, to exploit the latent low-rank structure of the dynamic system, Hamiltonian Q-Learning uses a matrix completion algorithm to reconstruct the updated Q function from Q value updates over a much smaller subset of state-action pairs. By providing an efficient way to apply Qlearning in stochastic, high-dimensional problems, the proposed approach broadens the scope of RL algorithms for real-world applications, including classical control tasks and environmental monitoring.

1. INTRODUCTION

In recent years, reinforcement learning (Sutton & Barto, 2018) have achieved remarkable success with sequential decision making tasks especially in complex, uncertain environments. RL algorithms have been widely applied to a variety of real world problems, such as resource allocation (Mao et al., 2016) , chemical process optimization (Zhou et al., 2017 ), automatic control (Duan et al., 2016 ), and robotics (Kober et al., 2013) . Existing RL techniques often offer satisfactory performance only when it is allowed to explore the environment long enough and generating a large amount of data in the process (Mnih et al., 2015; Kamthe & Deisenroth, 2018; Yang et al., 2020a) . This can be prohibitively expensive and thereby limits the use of RL for complex decision support problems. (Watkins, 1989; Watkins & Dayan, 1992 ) is a model-free RL framework that captures the salient features of sequential decision making, where an agent, after observing current state of the environment, chooses an action and receives a reward. The action chosen by the agent is based on a policy defined by the state-action value function, also called the Q function. Performance of such policies strongly depends on the accessibility of a sufficiently large data set covering the space spanned by the state-action pairs. In particular, for high-dimensional problems, existing model-free RL methods using random sampling techniques leads to poor performance and high computational cost. To overcome this challenge, in this paper we propose an intelligent sampling technique that exploits the inherent structures of the underlying space related to the dynamics of the system. It has been observed that formulating planning and control tasks in a variety of dynamical systems such as video games (Atari games), classical control problems (simple pendulum, cart pole and double integrator) and adaptive sampling (ocean sampling, environmental monitoring) as Q-Learning problems leads to low-rank structures in the Q matrix (Ong, 2015; Yang et al., 2020b; Shah et al., 2020) . Since these systems naturally consist of a large number of states, efficient exploitation of low rank structure of the Q matrix can potentially lead to significant reduction in computational complexity and improved performance. However, when the state space is high-dimensional and further, the state transition is probabilistic, high computational complexity associated with calculating the expected Q values of next states renders existing Q-Learning methods impractical.

annex

A potential solution for this problem lies in approximating the expectation of Q values of next states with the sample mean of Q values over a subset of next states. A natural way to select a subset of next states is by drawing IID samples from the transition probability distribution. However, this straight forward approach becomes challenging when the state transition probability distribution is highdimensional and is known only up to a constant. We address this problem by using Hamilton Monte Carlo (HMC) to sample next states; HMC draws samples by integrating a Hamiltonian dynamics governed by the transition probability (Neal et al., 2011) . We improve the data efficiency further by using matrix completion methods to exploit the low rank structure of a Q matrix. 2020) extends this work by proposing a novel matrix estimation method and providing theoretical guarantees for the convergence to a -optimal Q function. On the other hand, entropy regularization (Ahmed et al., 2019; Yang et al., 2019; Smirnova & Dohmatob, 2020) , by penalizing excessive randomness in the conditional distribution of actions for a given state, provides an alternative means to implicitly exploit the underlying low-dimensional structure of the value function. Lee et al. (2019) proposes an approach that samples a whole episode and then updates values in a recursive, backward manner.

CONTRIBUTION

The main contribution of this work is three-fold. First, we introduce a modified Q-learning framework, called Hamiltonian Q-learning, which uses HMC sampling for efficient computation of Q values. This innovation, by proposing to sample Q values from the region with the dominant contribution to the expectation of discounted reward, provides a data-efficient approach for using Q-learning in real-world problems with high-dimensional state space and probabilistic state transition. Furthermore, integration of this sampling approach with matrix-completion enables us to update Q values for only a small subset of state-action pairs and thereafter reconstruct the complete Q matrix. Second, we provide theoretical guarantees that the error between the optimal Q function and the Q function obtained by updating Q values using HMC sampling can be made arbitrarily small. This result also holds when only a handful of Q values are updated using HMC and the rest are estimated using matrix completion. Further, we provide theoretical guarantee that the sampling complexity of our algorithm matches the mini-max sampling complexity proposed by Tsybakov (2008) . Finally, we demonstrate the effectiveness of Hamiltonian Q-learning by applying it to a cart-pole stabilization problem and an adaptive ocean sampling problem. Our results also indicate that our proposed approach becomes more effective with increase in state space dimension.

2. PRELIMINARY CONCEPTS

In this section, we provide a brief background on Q-Learning, HMC sampling and matrix completion, as well as introduce the mathematical notations. In this paper, |Z| denotes the cardinality of a set Z. Moreover, R represent the real line and A T denotes the transpose of matrix A.

2.1. Q-LEARNING

Markov Decision Process (MDP) is a mathematical formulation that captures salient features of sequential decision making (Bertsekas, 1995) . In particular, a finite MDP is defined by the tuple

