A SAMPLING FRAMEWORK FOR VALUE-BASED REIN-FORCEMENT LEARNING

Abstract

Value-based algorithms have achieved great successes in solving Reinforcement Learning problems via minimizing the mean squared Bellman error (MSBE). Temporal-difference (TD) algorithms such as Q-learning and SARSA often use stochastic gradient descent based optimization approaches to estimate the value function parameters, but fail to quantify their uncertainties. In our work, under the Kalman filtering paradigm, we establish a novel and scalable sampling framework based on stochastic gradient Markov chain Monte Carlo, which allows us to efficiently generate samples from the posterior distribution of deep neural network parameters. For TD-learning with both linear and nonlinear function approximation, we prove that the proposed algorithm converges to a stationary distribution, which allows us to measure uncertainties of the value function and its parameters.

1. INTRODUCTION

Reinforcement learning (RL) targets at learning an optimal policy for sequential decision problems in order to maximize the expected future reward. The value-based algorithms such as Temporaldifference (TD) learning (Sutton, 1988) , State-action-reward-state-action (SARSA) (Sutton & Barto, 2018) , and Q-learning are frequently used, which play a crucial role for policy improvement. TDlearning aims to estimate the value functions, including state-value function and action-value function, by minimizing the mean-squared Bellman error, where the value functions are often approximated by a function family with unknown parameters. Hence, it is critical to evaluate the accuracy and uncertainty of parameter estimation, which enables uncertainty quantification for the sequential decision at a sequence of states. In the function approximation TD algorithms such as Deep Q-Network, the parameters are commonly optimized by stochastic gradient descent (SGD) based algorithms. The convergence of these algorithms, including both with linear function approximation (Schoknecht, 2002) and nonlinear function approximation (Fan et al., 2020; Cai et al., 2019) , has been extensively studied in the literature. However, SGD suffers from the local trap issue while dealing with nonconvex function approximations such as deep neural networks (DNNs). In order to efficiently and effectively explore the landscape of the complex DNN model, Monte Carlo algorithms such as Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011; Aicher et al., 2019; Kamalaruban et al., 2020 ) have shown their great potential in escaping from local traps. Moreover, under the Bayesian framework, the Monte Carlo algorithms generate samples from the posterior distribution, which naturally describes the uncertainty of the estimates. Toward uncertainty quantification for reinforcement learning, it is important to note that the reinforcement learning problem can be generally reformulated as a state-space model. In consequence, the value function parameters can be estimated with Kalman filtering methods such as Kalman Temporal Difference (KTD) (Geist & Pietquin, 2010) and KOVA algorithm (Shashua & Mannor, 2020) . Under the normality assumption and for linear function approximation, the Kalman filter approaches are able to provide correct mean and variance of the value function, which enables uncertainty quantification for the sequential decision. However, for nonlinear function approximation, KTD and KOVA algorithms adopt unscented Kalman filter (UKF) (Wan & Van Der Merwe, 2000) and extended Kalman filter (EKF) techniques to approximate the covariance matrices. Both algorithms are computationally inefficient for large scale neural networks. KTD requires O(p 2 ) for covariance update, where p is the In this paper, we have two major contributions: (i) We develop a new Kalman filter-type algorithm for valued-based policy evaluation based on the Langevinized Ensemble Kalman filter (Zhang et al., 2021; Dong et al., 2022) .The new algorithm is scalable with respect to the dimension of the parameter space, which has a computational complexity of O(p) for each iteration. (ii) We prove that even when the policy is not fixed, under some regularity conditions, the proposed algorithm converges to a stationary distribution eventually.

2.1. MARKOV DECISION PROCESS FRAMEWORK

The standard RL procedure aims to learn an optimal policy from the interaction experiences between an agent and an environment, where the optimal policy maximizes the agent's expected total reward. The RL procedure can be described by a Markov decision process (MDP) represented by {S, A, P, r, γ}, where S is set of states, A is a finite set of actions, P : S × A × S → R is the state transition probability from state s to state s ′ by taking action a, denoted by P(s ′ |s, a), r(s, a) is a random reward received from taking action a at state s, and γ ∈ (0, 1) is a discount factor. At each time stage t, the agent observes state s t ∈ S and takes action a t ∈ A according to policy ρ with probability P ρ (a|s), then the environment returns a reward r t = r(s t , a t ) and a new state s t+1 ∈ S. For a given policy ρ, the performance is measured by the state value function (V -function) V ρ (s) = E ρ [ ∞ t=0 γ t r t |s 0 = s] and the state-action value function (Q-function) Q ρ (s, a) = E ρ [ ∞ t=0 γ t r t |s 0 = s, a 0 = a]. Both functions satisfy the following Bellman equations: V ρ (s) = E ρ [r(s, a) + γV ρ (s ′ )], Q ρ (s, a) = E ρ [r(s, a) + γQ ρ (s ′ , a ′ )], where s ′ ∼ P(•|s, a), a ∼ P ρ (•|s), a ′ ∼ P ρ (•|s ′ ), and the expectations are taken over the transition probability P for a given policy ρ.

2.2. BAYESIAN FORMULATION

In this paper, we focus on learning optimal policy ρ via estimating Q ρ . Suppose that Q-functions are parameterized by Q(•; θ) with parameter θ ∈ θ ⊂ R p . Let µ ρ be the stationary distribution of the transition tuple z = (s, a, r, s ′ , a ′ ) with respect to policy ρ. Q ρ can be estimated by minimizing the mean squared Bellman error (MSBE), min θ MSBE(θ) = min θ E z∼µρ (Q(s, a; θ) -r -γQ(s ′ , a ′ ; θ)) 2 , where the expectation is taken over a fixed stationary distribution µ ρ . By imposing a prior density function π(θ) on θ, we define a new objective function F(θ) = E z∼µρ [F(θ, z)] = E z∼µρ (Q(s, a; θ) -r -γQ(s ′ , a ′ ; θ)) 2 - 1 n log π(θ) , where F(θ, z) = (Q(s, a; θ) -r -γQ(s ′ , a ′ ; θ)) 2 -1 n log π(θ). Since the stationary distribution µ ρ is unkown, we consider the empirical objective function Fz = 1 n n i=1 F(θ, z i ), on a set of transition tuples z = {z i } n i=1 . Instead of minimizing Fz directly, one can simulate a sequence of θ values using the SGLD algorithm by iterating the following equation: θ t = θ t-1 -ϵ t nF t (θ t-1 ) + 2ϵ t β -1 ω t ,

