A SAMPLING FRAMEWORK FOR VALUE-BASED REIN-FORCEMENT LEARNING

Abstract

Value-based algorithms have achieved great successes in solving Reinforcement Learning problems via minimizing the mean squared Bellman error (MSBE). Temporal-difference (TD) algorithms such as Q-learning and SARSA often use stochastic gradient descent based optimization approaches to estimate the value function parameters, but fail to quantify their uncertainties. In our work, under the Kalman filtering paradigm, we establish a novel and scalable sampling framework based on stochastic gradient Markov chain Monte Carlo, which allows us to efficiently generate samples from the posterior distribution of deep neural network parameters. For TD-learning with both linear and nonlinear function approximation, we prove that the proposed algorithm converges to a stationary distribution, which allows us to measure uncertainties of the value function and its parameters.

1. INTRODUCTION

Reinforcement learning (RL) targets at learning an optimal policy for sequential decision problems in order to maximize the expected future reward. The value-based algorithms such as Temporaldifference (TD) learning (Sutton, 1988) , State-action-reward-state-action (SARSA) (Sutton & Barto, 2018) , and Q-learning are frequently used, which play a crucial role for policy improvement. TDlearning aims to estimate the value functions, including state-value function and action-value function, by minimizing the mean-squared Bellman error, where the value functions are often approximated by a function family with unknown parameters. Hence, it is critical to evaluate the accuracy and uncertainty of parameter estimation, which enables uncertainty quantification for the sequential decision at a sequence of states. In the function approximation TD algorithms such as Deep Q-Network, the parameters are commonly optimized by stochastic gradient descent (SGD) based algorithms. The convergence of these algorithms, including both with linear function approximation (Schoknecht, 2002) and nonlinear function approximation (Fan et al., 2020; Cai et al., 2019) , has been extensively studied in the literature. However, SGD suffers from the local trap issue while dealing with nonconvex function approximations such as deep neural networks (DNNs). In order to efficiently and effectively explore the landscape of the complex DNN model, Monte Carlo algorithms such as Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011; Aicher et al., 2019; Kamalaruban et al., 2020 ) have shown their great potential in escaping from local traps. Moreover, under the Bayesian framework, the Monte Carlo algorithms generate samples from the posterior distribution, which naturally describes the uncertainty of the estimates. Toward uncertainty quantification for reinforcement learning, it is important to note that the reinforcement learning problem can be generally reformulated as a state-space model. In consequence, the value function parameters can be estimated with Kalman filtering methods such as Kalman Temporal Difference (KTD) (Geist & Pietquin, 2010) and KOVA algorithm (Shashua & Mannor, 2020) . Under the normality assumption and for linear function approximation, the Kalman filter approaches are able to provide correct mean and variance of the value function, which enables uncertainty quantification for the sequential decision. However, for nonlinear function approximation, KTD and KOVA algorithms adopt unscented Kalman filter (UKF) (Wan & Van Der Merwe, 2000) and extended Kalman filter (EKF) techniques to approximate the covariance matrices. Both algorithms are computationally inefficient for large scale neural networks. KTD requires O(p 2 ) for covariance update, where p is the number of parameters. In each iteration, KOVA calculates a Jacobian matrix that grows linearly with batch size. In this paper, we have two major contributions: (i) We develop a new Kalman filter-type algorithm for valued-based policy evaluation based on the Langevinized Ensemble Kalman filter (Zhang et al., 2021; Dong et al., 2022) .The new algorithm is scalable with respect to the dimension of the parameter space, which has a computational complexity of O(p) for each iteration. (ii) We prove that even when the policy is not fixed, under some regularity conditions, the proposed algorithm converges to a stationary distribution eventually.

2.1. MARKOV DECISION PROCESS FRAMEWORK

The standard RL procedure aims to learn an optimal policy from the interaction experiences between an agent and an environment, where the optimal policy maximizes the agent's expected total reward. The RL procedure can be described by a Markov decision process (MDP) represented by {S, A, P, r, γ}, where S is set of states, A is a finite set of actions, P : S × A × S → R is the state transition probability from state s to state s ′ by taking action a, denoted by P(s ′ |s, a), r(s, a) is a random reward received from taking action a at state s, and γ ∈ (0, 1) is a discount factor. At each time stage t, the agent observes state s t ∈ S and takes action a t ∈ A according to policy ρ with probability P ρ (a|s), then the environment returns a reward r t = r(s t , a t ) and a new state s t+1 ∈ S. For a given policy ρ, the performance is measured by the state value function (V -function) V ρ (s) = E ρ [ ∞ t=0 γ t r t |s 0 = s] and the state-action value function (Q-function) Q ρ (s, a) = E ρ [ ∞ t=0 γ t r t |s 0 = s, a 0 = a]. Both functions satisfy the following Bellman equations: V ρ (s) = E ρ [r(s, a) + γV ρ (s ′ )], Q ρ (s, a) = E ρ [r(s, a) + γQ ρ (s ′ , a ′ )], where s ′ ∼ P(•|s, a), a ∼ P ρ (•|s), a ′ ∼ P ρ (•|s ′ ), and the expectations are taken over the transition probability P for a given policy ρ.

2.2. BAYESIAN FORMULATION

In this paper, we focus on learning optimal policy ρ via estimating Q ρ . Suppose that Q-functions are parameterized by Q(•; θ) with parameter θ ∈ θ ⊂ R p . Let µ ρ be the stationary distribution of the transition tuple z = (s, a, r, s ′ , a ′ ) with respect to policy ρ. Q ρ can be estimated by minimizing the mean squared Bellman error (MSBE), min θ MSBE(θ) = min θ E z∼µρ (Q(s, a; θ) -r -γQ(s ′ , a ′ ; θ)) 2 , where the expectation is taken over a fixed stationary distribution µ ρ . By imposing a prior density function π(θ) on θ, we define a new objective function F(θ) = E z∼µρ [F(θ, z)] = E z∼µρ (Q(s, a; θ) -r -γQ(s ′ , a ′ ; θ)) 2 - 1 n log π(θ) , where F(θ, z) = (Q(s, a; θ) -r -γQ(s ′ , a ′ ; θ)) 2 -1 n log π(θ). Since the stationary distribution µ ρ is unkown, we consider the empirical objective function Fz = 1 n n i=1 F(θ, z i ), on a set of transition tuples z = {z i } n i=1 . Instead of minimizing Fz directly, one can simulate a sequence of θ values using the SGLD algorithm by iterating the following equation: θ t = θ t-1 -ϵ t nF t (θ t-1 ) + 2ϵ t β -1 ω t , where F t (θ t-1 ) is a conditionally unbiased estimator of ∇F z (θ t-1 ), ω t ∼ N (0, I p ) is a standard Gaussian random vector of dimension p, ϵ t > 0 is the learning rate at time t, and β > 0 is the constant inverse temperature. It has been proven that under some regularity assumptions, θ t converges weakly to the unique Gibbs measure π z ∝ exp(-βn Fz ). However, in value-based RL algorithms, the policy ρ is dynamically updated along with parameter θ t . Therefore, the distribution µ ρ of the transition tuple z also evolves from time to time as θ t changes. In section 3, we develop a new sampling algorithm, Langivinized Kalman Temporal Difference (LKTD) algorithm, for value-based RL algorithms and establish the convergence of the proposed algorithm under the dynamic policy setting.

3. MAIN RESULTS

In this section, we first introduce the state-space model formulation and the proposed sampling algorithm under the setting of linear function approximation, and then extend the proposed sampling algorithm to the setting of nonlinear function approximation. For simplicity, a full transition tuple with reward and a reduced transition tuple without reward are denoted, respectively, by z and x as z = (s, a, r, s ′ , a ′ ) (s, a, r, s ′ ) and x = (s, a, s ′ , a ′ ) (s, a, s ′ ) , for which we often write z = (r, x). The observation function h(x, θ) is defined as follows: h(x; θ) = Q(s, a; θ) -γQ(s ′ , a ′ ; θ), Q(s, a; θ) -γ max b∈A Q(s ′ , b; θ), for the SARSA and Q-learning algorithm, where γ is the discount factor.

3.1. LINEAR FUNCTION APPROXIMATION

Suppose that Q = {Q(•; θ)} is a family of linear Q-functions, where every Q-function can be approximated in the form Q(s, a; θ) = ϕ(s, a) ⊤ θ, (7) where ϕ : S × A → R p is a p-dimensional vector-valued feature map. For example, ϕ can be a polynomial kernel, Gaussian kernel, etc. Let z t = (r t , x t ) = {(r t,i , x t,i )} n i=1 be a batch of transition tuples of size n generated at stage t. For convenience, we use bold symbols to represent either vectors or sets of transition tuples depending on the situation. By combining (6) and (7), we define the observation matrix as Φ(x t ) =    Φ(x t,1 ) . . . Φ(x t,n )    , where each row vector is defined as Φ(x t,i ) = ϕ(s t,i , a t,i ) -γϕ(s ′ t,i , a ′ t,i ) ⊤ , ϕ(s t,i , a t,i ) -γϕ(s ′ t,i , argmax b∈A ϕ(s ′ t,i , b) ⊤ θ) ⊤ , for SARSA and Q-learning. Then it is easy to see that the minimization problem of MSBE in (1) can be reformulated as a Bayesian linear inverse problem r t = Φ(x t )θ + η t , η t ∼ N (0, σ 2 I), t = 1, 2, . . . , n, where η t is an additive Gaussian white noise with covariance matrix σ 2 I, and θ is subject to the prior distribution π(θ). The corresponding posterior distribution is given by π * (θ) ∝ e -F (θ) . To develop an efficient algorithm for simulating samples from the target distribution π β * (θ) ∝ e -β F (θ) , where β denotes the inverse temperature, we further reformulate the Bayesian linear inverse model (10) as a state-space model through Langevin diffusion by following Zhang et al. (2021) and Dong et al. (2022) : θ t = θ t-1 + ϵ t 2 ∇ log π(θ t-1 ) + w t , r t = Φ(x t )θ t + η t , where w t ∼ N (0, ϵ t I p ) = N (0, Ω t ), i.e., Ω t = ϵ t I p , and η t ∼ N (0, σ 2 I). In the state-space model (11), the state θ t evolves in a diffusion process that converges to the prior distribution π(θ). The new formulation does not only allow us to solve the Bayesian inverse problem by subsampling (r t , x t ) from a given dataset at each stage t (see Theorem S1 of Zhang et al. (2021) ), but also allow us to model a dynamic system where the policy ρ θt-1 changes along with stage t. In order to establish the convergence theory of the entire RL algorithm, we impose two fundamental assumptions on the data generating process and the normality structure. Assumption 1 (Data generating process) For each t, we are able to generate tuple z t ∼ µ θt-1 according to policy ρ θt-1 . Moreover, the stationary distribution µ θt-1 has density function π(z t |θ t-1 ), which is differentiable with respect to θ. Assumption 2 (Normality structure) For each t, let z t = {z t,i } n i=1 be a set of full transition tuples sampled from µ θt-1 , the conditional distribution π(r t |x t , θ t ) is Gaussian: r t |x t , θ t ∼ N (Φ(x t )θ t , σ 2 I). (12) That is, r t,i |x t,i , θ t are independent Gaussian distributions. Figure 1 depicts the RL updating scheme and data generating process. At each stage t, the agent interacts with the environment according to the policy ρ θt-1 and generates a batch of transition tuples z t = (x t , r t ) from the stationary distribution µ θt-1 , which refers to assumption 1. With the artificial normality structure in assumption 2 and the state-space model ( 11), we combine the RL setting with the forecast-analysis procedure proposed in Zhang et al. (2021) and introduce Algorithm 1. In Theorem 3.1, we prove that the our algorithm is equivalent to an accelerated preconditioned SGLD algorithm (Li et al., 2016) . Then, with some adjustment of Theorem 10 in Raginsky et al. (2017) and the general recipe of stochastic gradient MCMC (Ma et al., 2015) , the chain {θ a t } n t=1 generated by Algorithm 1 converges to a stationary distribution p * (θ) ∝ exp(-β G(θ)) as defined in Lemma A.1 in the Appendix. When ρ is fixed for policy evaluation, Algorithm 1 converges to the target distribution π β * (θ). Algorithm 1 (Langevinized Kalman temporal difference learning for linear approximation) 0. (Initialization) Start with an initial Q-function parameter θ a 0 ∈ R p , drawn from the prior distribution π(θ). For each stage t = 1, 2, . . . , T , do steps 1-3: 1. (Sampling) With policy ρ θ a t-1 , generate a set of n transition tuples from the stationary distribution µ θ a t-1 , denoted by z t = (r t , x t ) = {z t,j } n j=1 , where z t,j has the form of (5). Let Φ t = Φ(x t ). • Set Ω t = ϵ t I p , R t = 2σ 2 I n , and the Kalman gain matrix K t = Ω t Φ ⊤ t (Φ t Ω t Φ ⊤ t + R t ) -1 . 2. (Forecast) Draw w t ∼ N p (0, Ω t ) and calculate θ f t = θ a t-1 + ϵ t 2 ∇ log π(θ a t-1 ) + w t . (13) 3. (Analysis) Draw v t ∼ N n (0, R t ) and calculate θ a t = θ f t + K t (r t -Φ t θ f t -v t ) = θ f t + K t (r t -r f t ). ( ) Theorem 3.1 Algorithm 1 can be reduced to a preconditioned SGLD algorithm. θ a t = θ a t-1 + ϵ t 2 Σ t n i=1 ∇ log π(θ a t-1 |z t,i ) + e t , where Σ t = (I -K t Φ(x t )) is a constant matrix given x t , e t ∼ N (0, ϵ t Σ t ), and ∇ log π(θ a t-1 |z t,i ) = 1 σ 2 Φ(x t,i )(r t,i -Φ(x t,i )θ a t-1 ) + 1 n ∇ log π(θ a t-1 ). Theorem 3.2 Consider the pre-conditioned SGLD algorithm: θ t = θ t-1 -ϵ T Σ t G(θ t-1 , z t ) + 2ϵ T β -1 e t , t = 1, 2, . . . , T, where e t ∼ N (0, Σ t ), β is the inverse temperature, T is the total iteration number, and ϵ T is a constant learning rate depending on T . For this algorithm, we assume the conditions of Lemma A.1 (of the Appendix) hold. Further, if we choose the learning rate ϵ T such that T ϵ T → ∞, T ϵ 5/4 T → 0 and T ϵ T δ 1/4 → 0, then W 2 (p T , p * ) → 0 as T → ∞, where p T and p * are as defined in Lemma A.1 of the Appendix. Remark 1 For the LKTD algorithm, we have G(θ t-1 , z t ) = -1 σ 2 n i=1 (Φ(x t,i ) ⊤ (r t,i - Φ(x t,i )θ t-1 )) + 1 σ 2 θ θ t-1 . In addition, we set ϵ T = t 0 /T α for some constant t 0 > 0 and α ∈ (4/5, 1), such that T ϵ T δ 1/4 → 0. The conditions required by Theorem 3.1 and Theorem 3.2 are verified by Lemma 3.1 given below. Lemma 3.1 Let G(θ, z) = -1 σ 2 n i=1 (Φ(x i ) ⊤ (r i -Φ(x i )θ)) + 1 σ 2 θ θ. Assume (i) Z is compact, and r = sup{|r| : r ∈ R} < ∞, (ii) ∥ϕ(s, a)∥ ≤ 1 for all s ∈ S, a ∈ A, and (iii) θ 0 ∼ N (0, σ 2 0 I p ), with σ 2 0 < 1 2 , then the conditions (A1)-(A6) are satisfied. The LKTD algorithm is very flexible. It is not necessary to use all n samples (collected at each time t) at each iteration. Instead, a subsample can be used and multiple iterations can be performed for intergrating the available data information in the way of SGLD. Moreover, as explained in Zhang et al. (2021) , the forecast-analysis procedure enables the algorithm scalable with respect to the dimension of θ t , while enjoying the computational acceleration led by the pre-conditioner. In summary, the LKTD algorithm is scalable with respect to both the data sample size and the dimension of the parameter space.

3.2. NONLINEAR FUNCTION APPROXIMATION

In this section, we further extend our algorithm to the setting of nonlinear function approximation. For each stage t, we consider the nonlinear inverse problem r t = h(x t ; θ) + η t , η t ∼ N (0, σ 2 I), where h(x; •) : θ → R is a nonlinear differentiable observation function of θ. With the state augmentation approach similar to LEnKF algorithm, we define the augmented state vector by φ t = θ t ξ t , ξ t = h(x t ; θ t ) + u t , u t = N (0, ασ 2 I), where ξ t is an n-dimensional vector, and 0 < α < 1 is a pre-specified constant. Suppose that θ t has a prior distribution π(θ) as we defined in previous section, the joint density function of φ t = (θ ⊤ t , ξ ⊤ t ) ⊤ can be written as π(φ t ) = π(θ t )π(ξ t |θ t ), where ξ t |θ t ∼ N (h(x t ; θ t ), ασ 2 I). Based on Langevin dynamics, we can reformulate (17) as the following dynamic system φ t = φ t-1 + ϵ t 2 ∇ φ log π(φ t-1 ) + w t , r t = H t φ t + v t , where w t ∼ N (0, Ω t ), Ω t = ϵ t I p , p is the dimension of φ t ; H t = (0, I) such that H t φ t = ξ t ; v t ∼ N (0, (1 -α)σ 2 I) , which is independent of w t for all t. With the formulation in (18), we transformed a nonlinear inverse problem to a linear state-space model and thus the previous theoretical results still hold for the nonlinear inverse problem. The target distribution p * (θ) can be easily obtained by marginalization from p * (φ). Algorithm 2 (Langevinized Kalman temporal difference for nonlinear approximation) 0. (Initialization) Start with an initial Q-function parameter ensemble θ a 0 ∈ R p , drawn from the prior distribution π(θ). For each stage t = 1, 2, . . . , T , do the following steps 1-3: 1. (Sampling) With policy ρ θ a t-1 defined in (40), generate a set of n transition tuples from the stationary distribution µ θ a t-1 , denoted by z t = (r t , x t ) = {z t,j } n j=1 , where z t,j has the form of (5). Let H t = (0, I) • For each iteration k = 1, 2, . . . , K, set Q t,k = ϵ t,k I p , R t = 2(1 -α)σ 2 I, and the Kalman gain matrix K t,k = Q t,k H ⊤ t (H t Q t,k H ⊤ t + R t ) -1 , and do steps 2-3. 2. (Forecast) Draw w t,k ∼ N p (0, Ω t ) and calculate φ f t,k = φ a t,k-1 + ϵ t,k 2 ∇ log π(φ a t,k-1 ) + w t,k , where if k = 1, set φ a t,0 = (θ a t-1,K ⊤ , r ⊤ t ) ⊤ . More precisely, the gradient of two components can be written as ∇ log π(φ a t,k-1 ) = ∇ θ log π(θ t,k-1 ) + 1 ασ 2 ∇ θ h(x t ; θ t,k-1 )(ξ t,k-1 -h(x t ; θ t,k-1 )) -1 ασ 2 (ξ t,k-1 -h(x t ; θ t,k-1 )) . (20) 3. (Analysis) Draw v t,k ∼ N n (0, R t ) and calculate φ a t,k = φ f t,k + K t,k (r t -H t φ f t,k -v t,k ) = φ f t,k + K t,k (r t -r f t,k ). ( )

4. EXPERIMENTS

In this section, we compare LKTD with Adam algorithm (Kingma & Ba, 2014) . With a simple indoor escape environment, we show the ability of LKTD in uncertainty quantification and policy exploration. Further, with a more complicated environments such as OpenAI gym, we show that LKTD is able to learn better and more stable policies for both training and testing. At the beginning time 0, the agent is randomly put in the square. At each time t, the agent observes the location coordinate s = (x, y) as its current state, then chooses an action a ∈ {N, S, E, W} according to policy ρ with a step size randomly drawn from Unif([0, 0.3]).

4.1. INDOOR ESCAPE ENVIRONMENT

The reward at each time t is -1 before the agent reaches the goal. The indoor escaping environment is an example where the optimal policy is not unique. Observe that the Q-values of N and E have no difference in most states, except for the top and the right border. Hence, the ability to explore various optimal policies is critical for learning a stable and robust policy. Through this experiment we show the ability of the LKTD algorithm to learn a mixture optimal policy in a single run. We compare LKTD with the widely used Adam algorithm on training a deep neural network with three hidden layers of sizes (16, 16, 16) . Agents update the network parameters every 50 interactions and 5 gradient steps per update for a total of 10000 episodes. For action selection, the ϵ-Boltzmann exploration as defined in B.1 is used with an exploring rate of ϵ = 0.1 and an inverse action temperature of β act = 5. The batch size is 250. The last 1000 parameter updates are collected as a parameter ensemble, which induces a Q-value ensemble and a policy ensemble. With the Q-value ensemble, we are able to draw the density plot of Q-values at each point of the square as illustrated by Figure 3 . To quantify uncertainty of the policy, we define the mean policy probability by p ϱ (a|s) = 1 |ϱ| ρ∈ϱ 1 a (ρ(s)), ( ) where ϱ is the policy ensemble induce from the parameter ensemble. Intuitively, the mean policy probability is the proportion of an action taken by the policy ensemble at a given state. We further define the mean optimal policy probability of an environment by taking expectation over all optimal policies. The mean optimal policy probability of the indoor escape environment is shown in figure 4a . In figure 4 , we divide the state space into 100 grids of size 0.1 × 0.1, then compute the mean action probabilities of each grid center and each action. For a further comparison of the two algorithms, we calculated two metrics in table 1: (1) MSE between the mean action probability and the mean optimal action probability, denoted by MSE(p), where the MSE is taken over all grids. (2) The sub-optimal action rate (SOAR), which is defined by the probability of choosing the action that is sub-optimal for a state. Table 1 shows that in terms of MSE(p), LKTD and Adam with large learning rate are more efficient in sample space exploration; however, in terms of SOAR, LKTD is much smaller than Adam in actions {N, E}, where the high SOAR comes from the top and right boundaries as in figure 4c . In other words, LKTD can efficiently explore the optimal policies, while retaining its accuracy. In figure 3b , the mean policy probability shows that LKTD can choose the correct policy on the boundary grids. In this section, we consider four classical control problems in OpenAI gym (Brockman et al., 2016) , including CartPole-v1, MountainCar-v0, LunarLander-v2 and Acrobot-v1. We compare LKTD with Adam under the framework and parameter settings of RL Baselines3 Zoo (Raffin, 2020) . Each experiment is duplicated 500 times, and the training progress is recorded in figure 5 . At each time step, the best and the worst 1% of the rewards are considered as outliers and thus ignored in the plots. LKTD can also be applied to DQN algorithm by modifying the state-space model in equation 11 as θ t = θ t-1 + ϵ t 2 ∇ log π(θ t-1 ) + w t , y t = ϕ(s t , a t ) ⊤ θ t + η t , where ϕ(s t , a t ) = [ϕ(s t,1 , a t,1 ), . . . , ϕ(s t,n , a t,n )] and y t = r t + γϕ(s ′ t , a ′ t ) ⊤ θ t-1 . The new gradient can be written as G(θ t-1 , z t ) = -1 σ 2 n i=1 (ϕ(x t,i )(r t,i -Φ(x t,i )θ t-1 )) + 1 σ 2 θ θ t-1 , where the first term corresponds to the semi-gradient in DQN algorithm. With suitable constraints on the semi-gradient, we can modify lemma 3.1 to guarantee the convergence. In the four classic control problems, LKTD shows its strength in efficient exploration and robustness without adopting common RL tricks such as gradient clipping and target network. The updating period of the target network is set to 1 for LKTD. The detail hyperparameter settings are given in section B.4. In figure 5 , the solid and dash lines represent the median and mean rewards, respectively. For each algorithm, the colored area covers 98% of the reward curves. We consider 3 types of reward measurements, training reward, evaluation reward and the best evaluation reward. Training reward records the cumulative reward during training, which include the ϵ-exploration errors. Evaluation reward calculates the mean reward over 10 testing trails at each time t. The best evaluation reward only records the best evaluation reward up to time t. In CartPole-v1, LKTD outperforms Adam in all 3 measurements, especially on the training and best evaluation rewards. During training, LKTD receives significantly higher rewards than Adam. In optimal policy exploration, almost 99% of the time LKTD achieves the optimal policy faster than the median of Adam. In MountainCar-v0, the mean and median reward curves of LKTD and Adam are similar. However, LKTD is more robust during training and more efficient in exploration of good policies. From the best evaluation reward plot, we can observe that the 1% reward lower bound is close to -200 for Adam, which indicates that the agent fails find any good policies during the training. In order to learn the optimal policies in Lunarlander-v2, the agent has to learn a correct way of landing instead of staying in the air. Due to the sampling nature of LKTD, the exploration rate ϵ is increased from 0.12 to 0.25 for agent to collect enough landing experiences. Hence, LKTD converges slightly slower than Adam. However, with a large exploration rate, LKTD is still able to obtain stable training rewards which are close to Adam with a much higher lower bound. Moreover, with a longer training period, LKTD will eventually perform better in evaluation. In Acrobot-v1, the training reward of LKTD converges slower in some experiments, but in most cases, the performance of LKTD dominates Adam. According to the experiments, LKTD has a more robust training process and finds the optimal polices faster than Adam. The experiments also indicate that Adam uses the rare experiences more efficiently, whereas LKTD needs to trade the training performance for the exploration of rare experiences.

5. CONCLUSION

This paper proposes LKTD as a new sampling framework for deep RL problems via state-space model reformulation. LKTD is equivalent to an accelerated preconditioned SGLD algorithm but with a self-dependent data generating process. For both linear and nonlinear function approximations, LKTD is guaranteed to converge to a stationary distribution p * (θ) under mild conditions. Our numerical experiments indicate that LKTD is comparable with Adam algorithm in optimal policy search, while outperforming Adam in robustness and optimal policy explorations. This implies a great potential of LKTD in uncertainty quantification. (A3) There exists some constant M > 0 such that for any z ∈ Z, ∥G(θ, z) -G(ϑ, z)∥ ≤ M ∥θ -ϑ∥, ∀θ, ϑ ∈ Θ. (A4) For each z ∈ Z, the function G(•, z) is (m, b)-dissipative; for some m > 0 and b ≥ 0, ⟨θ, G(θ, z)⟩ ≥ m∥θ∥ 2 -b, ∀θ ∈ Θ. (A5) There exist a constant δ ∈ [0, 1) and some constants M and B such that E∥G(θ, z) -g(θ)∥ 2 ≤ 2δ(M 2 ∥θ∥ 2 + B 2 ), ∀θ ∈ Θ. (A6) The probability law µ 0 of the initial hypothesis θ 0 has a bounded and strictly positive density p 0 with respect to the Lebesgue measure on Θ, and κ 0 := log Θ e ∥θ∥ 2 p 0 (θ)dθ < ∞. Lemma A.1 (Proposition 10 of Raginsky et al. ( 2017)) Consider the SGLD algorithm with a constant learning rate ϵ, θ t = θ t-1 -ϵG(θ t-1 , z t ) + 2ϵβ -1 e t , ) where e t ∼ N (0, I d ), d is the dimension of θ, and β is the inverse temperature. Assume the conditions (A1)-(A6) hold. If EG(θ t-1 , z t ) = g(θ t-1 ) holds for any step t ∈ N, β ≥ 1 ∨ 2 m , and 0 < ϵ < 1 ∧ m 4M 2 , then W 2 (p t , p * ) ≤ ( C0 δ 1/4 + C1 ϵ 1/4 )tϵ + C2 e -tϵ/βC LS , where p t (θ) denotes the density function of θ t ; p * (θ) ∝ exp(-β G(θ)), G(θ) is the anti-derivative of g(θ), i.e., ∇ θ G(θ) = g(θ) ; C LS denotes a logarithmic Sobolev constant satisfied by the p * , and the constants C0 , C1 and C2 are given by C 0 = M 2 κ 0 + 2 1 ∨ 1 m b + 2B 2 + d β + B 2 , C 1 = 6M 2 (βC 0 + d), C0 = (12 + 8(κ 0 + 2b + 2d β ))(βC 0 + βC 0 ), C1 = (12 + 8(κ 0 + 2b + 2d β ))(C 0 + C 0 ), C2 = 2C LS log ∥p 0 ∥ ∞ + d 2 log 3π mβ + β M κ 0 3 + B √ κ 0 + A + b 2 log 3 . PROOF: The proof of Lemma A.1 follows from Proposition 10 of Raginsky et al. (2017) . □ A.3 PROOF OF THEOREM 3.2 PROOF: By Theorem 1 of Ma et al. (2015) , Algorithm (16) (of the main text) works as a preconditioned SGLD algorithm with the pre-conditioner Σ t , and it has the same stationary distribution as the algorithm (29). By (30), we have W 2 (p T , p * ) → 0 for algorithm (29) under the given settings of ϵ and N T . Therefore, for Algorithm (16), we also have W 2 (p T , p * ) → 0 as T → ∞ by noting that Σ t is positive definite for any t. □ A.4 PROOF OF LEMMA 3.1 PROOF: Since we assume that all samples are i.i.d, it suffices to prove the lemma with the case n = 1. In this case, G(θ, z) = G(θ, z) = -1 σ 2 Φ(x) ⊤ (r -Φ(x)θ) + 1 σ 2 θ θ. Let g(θ) = E z∼µ θ [G(θ, z)] = Z G(θ, z)π(z|θ)dz be the expected gradient with respect to the stationary distribution µ θ . ) 2 + 1 and B 2 = M 2 ( 2 σ 2 (1 + γ)r) 2 . Since the bound is uniform for all z ∈ Z, we can derive the desired bound for the expectation. (A6) By assumption (iii), κ 0 = log Θ e ∥θ∥ 2 e -1 2σ 2 0 ∥θ∥ 2 dθ -p log 2πσ 2 0 < log Θ e (1-1 2σ 2 0 )∥θ∥ 2 dθ < ∞, for any σ 2 0 < 1 2 . □

B.1 SOFTMAX PROBABILISTIC POLICY

For the indoor escape environment, we adopt the Boltzmann exploration which selects an action a with probability P ρ θ (a|s) = exp{β act Q(s, a; θ)} a ′ ∈A exp{β act Q(s, a ′ ; θ)} , where β act is the action inverse temperature. When β act is small, the agent tends to explore random actions. In contrast, when β act is large, the agent takes action greedily. Greedy Q-learning can be viewed as a special case of Boltzmann exploration, since ρ θ (s) = argmax a Q(s, a; θ) with probability 1 as β act → ∞. Moreover, the action probability P (ρ θ (s) = a) is differentiable with respect to θ. In this paper, we assume all RL algorithms follow the ϵ-Boltzmann exploration with the action probability given by P ρ ϵ θ (a|s) = ϵ + (1 -ϵ) exp{β act Q(s, a; θ)} a ′ ∈A exp{β act Q(s, a ′ ; θ)} , ( ) where ϵ is the random exploration rate. Note that as β act → ∞, ϵ-Boltzmann exploration converges to ϵ-greedy exploration.

B.2 STATE VALUE VISUALIZATION FOR THE INDOOR ESCAPING EXAMPLE

In this section, we demonstrate the state value approximation for both linear and nonlinear function approximations. By following Melo & Ribeiro (2007) , we define the Q-function and the feature map as Q(s, a) = ϕ(s) ⊤ θ = a ′ ∈A φ(s) ⊤ θ a ′ 1 a ′ (a) and φ(s) = (σ 1 (x, y), σ 2 (x, y), . . . , σ 4 (x, y)) ⊤ , (41) where the parameter vector θ = (θ ⊤ N , θ ⊤ S , θ ⊤ E , θ ⊤ W ) ⊤ is partitioned into 4 subspaces, 1(•) is the indicator function, and σ i : S → R is a basis function which can be a linear basis function, Gaussian kernel, etc. For each set of basis functions, the agent is allowed to update its parameter every 60 steps with 20 transition tuples as training data for 30,000 episodes. In every experiment, we collect the last 1,000 updates as the samples from the stationary distribution. In figure 6 , we compare different function approximation of SARSA type LKTD. The optimal policy used to simulate the true state-value is defined as ρ * (x, y) = N, if x ≥ y, E, if x < y, where the discount factor γ = 0.9. In figure 6b , 6c and 6d, the state-value surfaces are estimated using the average state-values of last 1000 updates with respect to linear-based approximation, kernel-based approximation and deep neural network approximation respectively. We showed that LKTD algorithm successfully estimate the state-values for all three function approximations.

EXAMPLE

This section is a supplement to Section 4.1 of the main text, which includes more numerical results for comparison of the proposed LKTD algorithm and the popular Adam algorithm (Kingma & Ba, 2014) . In each run of LKTD, we set the batch size to 250, fix the inverse temperature β = 1, and update the network parameters every 50 steps. For σ and σ θ in Remark 1, we set σ 2 = 1 and σ 2 θ = 25, where σ 2 is estimated by the mean square TD error according to Assumption 2, and σ 2 θ is chosen suitably for convergence and ignorable prior effects. Under the setting β = 1, LKTD converges to the stationary distribution p * (θ) ∝ exp(-G(θ)). The parameters obtained in the last 1000 parameter updates are used as a parameter ensemble for performing the followed Bayesian inference tasks. The parameter ensemble naturally induces a Q-value ensemble and a policy ensemble. More precisely, each parameter vector corresponds to a Q-function and a greedy policy induced from the Q-function. The experimental results are reported in Tables 2, 3 and 5, where each measurement is derived by averaging over 200 independent runs. For comparison, Adam has been run with different learning rates, including 1.0e-2, 1.0e-3 and 1.0e-4. For each learning rate, it is run for 200 times independently, each run consisting of 10000 episodes. Similar to LKTD, we use the parameters obtained in the last 1000 parameter updating steps as a parameter ensemble for performing the followed statistical inference tasks. The results are also summarized in Tables 2, 3 and 5. Table 2 reports the estimation accuracy of Q-values, which is measured by the mean squared error between the mean of the Q-value ensemble and the Q-value of the optimal policy as defined in equation 42. It is easy to see that Adam produces about the same MSEs for all four actions with a learning rate of 1e-2, and it produces more varied MSEs with other learning rates. LKTD produces almost the same MSEs as the best run of Adam. Table3 compares the performance of the two algorithms in optimal policy exploration, which is measured by MSE(p), the mean squared error between the proportions of action votes from the policy ensemble and the probabilities of mean optimal actions. The probabilities of mean optimal actions describe the variety of optimal actions at a state. That is, if multiple actions are all optimal, the probabilities of mean optimal actions are the same across all optimal actions. For example, suppose that both action N and action E are optimal at a state, then each has a mean optimal action probability of 0.5; therefore, a policy ensemble that fails to explore all optimal policies (due to a local trap issue) might only vote for one of the two actions. For LKTD, the smaller values of MSE(p) in the north and east actions imply that it provides better optimal policy exploration than Adam. It is worth mentioning that Adam with a learning rate of 1e-2 also produces similar MSE(p) values to LKTD, but its SOAR in table 4 is worse than LKTD, which implies that LKTD provides a more reliable policy than Adam. Table 5 shows that LKTD has consistent coverage rates around 95% for all actions, while Adam with small learning rates failed to cover over 50% of the optimal Q-values. Although Adam with a large learning rate (1e-2) can provide a good exploration for optimal policies, however, due to its optimization nature, it cannot provide a correct confidence coverage for the Q-value. Our experiment is based on the framework of RL Baselines3 Zoo. For Adam optimizer, the hyperparameters are provided by Zoo package. For LKTD, we set σ = 1 and 1/β = 0.01, and σ θ can be



Figure 1: Data generating process

Figure 2: Indoor escape environmentConsider a simple indoor escape environment as shown in Figure2. The environment is in the square [0, 1] × [0, 1], and the goal is to reach the top right corner of size 0.1×0.1 as fast as possible. At the beginning time 0, the agent is randomly put in the square. At each time t, the agent observes the location coordinate s = (x, y) as its current state, then chooses an action a ∈ {N, S, E, W} according to policy ρ with a step size randomly drawn from Unif([0, 0.3]). The reward at each time t is -1 before the agent reaches the goal. The indoor escaping environment is an example where the optimal policy is not unique. Observe that the Q-values of N and E have no difference in most states, except for the top and the right border. Hence, the ability to explore various optimal policies is critical for learning a stable and robust policy. Through this experiment we show the ability of the LKTD algorithm to learn a mixture optimal policy in a single run. We compare LKTD with the widely used Adam algorithm on training a deep neural network with three hidden layers of sizes(16,16,16). Agents update the network parameters every 50 interactions and 5 gradient steps per update for a total of 10000 episodes. For action selection, the ϵ-Boltzmann exploration as defined in B.1 is used with an exploring rate of ϵ = 0.1 and an inverse action temperature of β act = 5. The batch size is 250. The last 1000 parameter updates are collected as a parameter ensemble, which induces a Q-value ensemble and a policy ensemble. With the Q-value ensemble, we are able to draw the density plot of Q-values at each point of the square as illustrated by Figure3. To quantify uncertainty of the policy, we define the mean policy probability by

Figure 3: Q-value density plots and mean policy probabilities of LKTD

Figure 5: The first column shows the cumulative rewards obtained during the training process, the second column shows the testing performance without random exploration, and the third column shows the performance of best model learnt up to time t.

Figure 6: State-value surface: (a) State-values of the center point in each grid with respect to the optimal policy. (b) Linear function approximation with linear basis. (c) Linear function approximation with Gaussian kernel. (d) Deep neural network with hidden-layers (16,16,16).



MSE( Q) for the indoor escaping example

MSE(p) for the indoor escaping example

Sub-optimal action rate for the indoor escaping example Table 5 compares the coverage rates of the optimal Q-values by different algorithms. By considering 100 grid points over the entire state space, we can calculate the coverage rate of optimal Q-values.

Coverage rate of the optimal Q-value for the indoor escaping example Instead of using semi-gradient, the true gradient is used in all of the indoor escaping experiments of LTKD and Adam. Note that the semi-gradient is biased, which can lead to incorrect stationary distribution. In order to keep the comparison subjective, the objective function for Adam at each time t is given by

Appendix A COMPLETE PROOFS

A.1 PROOF OF THEOREM 3.1 PROOF: Following the proof of LEnKF theorem 3.1 (Zhang et al., 2021) , we consider the Kalman gain matrix K t = Ω t Φ(x t ) ⊤ (R t + Φ(x t )Ω t Φ(x t ) ⊤ ) -1 , which can be rewritten asLet). With the identity (24), the conditional expectation of θ a t can be written aswhereFor LKTD, we havewhere e t = w t -K t (Φ(x t )w t + v t ) with mean E(e t ) = 0 and covarianceBy combining ( 24), ( 25),( 26) and the assumption V = σ 2 I, the update of θ a t can be rewritten aswhereWe assume the following conditions hold:(A1) For any θ ∈ Θ, the Markov transition kernel Π θ has a single stationary distribution π θ (z), G : Θ × Z is measurable, and ∥g(θ)∥ = ∥ Z G(θ, z)π(z|θ)dz∥ < ∞.(A2) There exists a function G(θ, z), which is an anti-derivative of G(θ, z) with respect to θ, i.e., ∇ θ G(θ, z) = G(θ, z), such that |G(0, z)| ≤ A for some constant A > 0 and any z ∈ Z; in addition, there exists some constant B > 0 such that ∥G(0, z)∥ ≤ B for any z ∈ Z.chosen suitably according to the parameter size. The DQN agents are trained using a 2-layer dense neural network with hidden layers of size (256, 256) . All the hyperparameters are shown in table 6 . Note that if the gradient step is set to -1, the agent conducts as many gradient steps as steps done in the environment between two updates. 

