ONLINE REINFORCEMENT LEARNING VIA POSTERIOR SAMPLING OF POLICY Anonymous authors Paper under double-blind review

Abstract

We propose a Reward-Weighted Posterior Sampling of Policy (RWPSP) algorithm to tackle the classic trade-off problem between exploration and exploitation under finite Markov decision processes (MDPs). The Thompson sampling method so far has only considered posterior sampling over transition probabilities, which is hard to gain the globally sub-optimal rewards. RWPSP runs posterior sampling over stationary policy distributions instead of transition probabilities, and meanwhile keeps transition probabilities updated. Particularly, we leverage both relevant count functions and reward-weighting to online update the policy posterior, aiming to balance between local and long-term policy distributions for a globally near-optimal game value. Theoretically, we establish a bound of Õ(Γ T /S 2 ) 1 on the total regret in time horizon T with Γ/S 2 < D SA satisfied in general, where S and A represents the sizes of state and action spaces, respectively, D the diameter. This matches the best regret bound thus far for MDPs. Experimental results corroborate our theoretical results and show the advantage of our algorithm over baselines in terms of efficiency. 1 The symbol Õ hides logarithmic factors.

1. INTRODUCTION

Online reinforcement learning (Wei et al., 2017) addresses the problem of learning and planning in real-time sequential decision making systems with the interacting environment partially observed or fully observed. The decision maker tries to maximize the cumulative reward during the interaction with the environment, which however inevitably leads to the trade-off between exploration and exploitation. Many attempts have been made to mitigate such dilemma by improving underlying regret bounds (Zhang et al., 2020b) (Ménard et al., 2021) (Zhang et al., 2021b) (Zhang et al., 2022) (Agrawal et al., 2021) . Trade-off between exploration and exploitation has been studied extensively in various scenarios. The goal of exploration is to find as much information as possible of the environment, while the exploitation process aims to maximize the long-term total reward based on the exploited part of the environment. To handle the trade-off problem, one popular way is to use the naive exploration method such as adaptive ϵ-greedy exploration (Tokic, 2010) . The method adjusts the exploration parameter adaptively, depending on the temporal-difference (TD) error observed from the value function. Optimistic initialisation methods have also been studied in factored MDPs (Szita & Lörincz, 2009; Brafman & Tennenholtz, 2003) . They encourage systematic exploration in the early stage. Another common way is to use the optimism in the face of uncertainty (OFU) principle (Lai & Robbins, 1985) , where the agent constructs confidence sets to search for the optimistic parameters associated with the maximum reward. Thompson sampling, as an OFU-based approach, was originally presented for stochastic bandit scenarios (Thompson, 1933) . It has been applied in various MDPs contexts (Osband et al., 2013; Agrawal & Goyal, 2012) since it can achieve tighter bounds (Ding et al., 2021; Oh & Iyengar, 2019; Moradipari et al., 2019) and better compatibility with other structures in both theory and practice (Chapelle & Li, 2011; Zhang et al., 2021a; Agrawal & Goyal, 2013) . It has also achieved great performance on contextual bandit problems (Agrawal & Jia, 2017) (Osband & Van Roy, 2017) (Osband et al., 2019) .The general optimistic algorithms require to solve all MDPs lying within the confident sets, while Thompson sampling-based algorithms only need to solve the sampled MDPs to achieve similar results (Russo & Van Roy, 2014) . Thompson sampling offers speedup on one hand, and results in biased estimates of the transition matrix on the other hand. This paper addresses the trade-off problem between exploration and exploitation in finite MDPs. We propose a reward-weighted posterior sampling of policy (RWPSP) algorithm that samples posterior policy distributions rather than posterior transition distributions, which optimizes the long-term policy probability distribution. While updating posterior policy distribution, we use the count functions of the state-action pairs to capture the importance of each sampled episode. This way, we manage to optimize the policy distribution in time horizon T and achieve the total regret bound of Õ(Γ

√

T /S 2 ) with Γ/S 2 < D √ SA, where S and A represent the sizes of the state and action spaces respectively. D is the diameter of the finite MDP. In addition, we propose a new Bayesian method to update transition probabilities which also achieves a state-of-art regret bound. In comparison, existing model-based methods like Upper Confidence Stochastic Game Algorithm(UCSG) achieve a regret bound of Õ 3 √ DS 2 AT 2 on stochastic MDPs (Wei et al., 2017) , while model-free methods like optimistic Q-learning achieve a regret bound of Õ T 2/3 under infinite-horizon average reward MDPs (Wei et al., 2020) . To summarize, this work makes the following contributions: • We propose a reward-weighted posterior sampling of policy (RWPSP) algorithm that strikes a balance between the posterior projection of the long-term policy and the local policy. • RWPSP is the first posterior sampling method that samples posterior policy distributions while Bayesian updating transition probabilities. It achieves a regret bound of Õ( Γ √ T S 2 ), where Γ/S 2 < D SA. We show that the total regret bound is less than the state-of-the-art, i.e., D SAT , to the best of our knowledge. • We conduct experimental studies to verify our theoretical results and demonstrate that our RWPSP algorithm outperforms other online learning methods in complex MDP environments.

2. RELATED WORK

Regret Bound Analysis In the finite-horizon setting, most of the Thompson Sampling-based algorithms follow a model-based approach (Abbasi Yadkori et al., 2013; Xu & Tewari, 2020; Auer et al., 2008; Fruit et al., 2020; Dong et al., 2020; Agarwal et al., 2020) , as model-based reinforcement learning methods are required to approximate the optimal transition matrix of a MDP. In Xu & Tewari (2020) , non-episodic factored Markov decision processes are sampled using extreme transition dynamics which encourages visiting new states in order to minimize regret. Although various approaches had been used to minimize the regret bound, current methods still minimize the regret bound by updating the transition matrix. A good comparison can be found in Zhang et al. (2021c) ; Wei et al. (2020) among existing Thompson sampling based methods. In contrast to existing works with a focus on posterior sampling over transition matrices, our work only considers posterior sampling over policy distributions in a finite-horizon MDP. The transition probabilities will be updated based on the real trajectory. On the other hand, while existing model-free methods have not yet achieved the state-of-art regret bound (Jin et al., 2018; Strehl et al., 2006) , some of them improved the total regret bound (Zhang et al., 2020a) . Intrinsic Reward Shaping Intrinsic reward shaping was first introduced in 1999 (Ng et al., 1999) , which is a generic idea to guide the policy iteration with intrinsic reward. Count-based methods are then proposed to reach nearly state-of-the-art performance on a high-dimensional environment (Tang et al., 2017) . Intrinsic reward is also used in Du et al. (2019) to compute a distinct proxy critic for the agent to guide the update of its individual policy. In order to shape the reward during the policy iteration, we adopt the reward-weighted update to verify the intrinsic reward. Count functions of states and/or actions are usually used in the exploration process of an agent to help build the intrinsic reward (Tang et al., 2017; Bellemare et al., 2016; Burda et al., 2018) . In our algorithm, we consider the count function as the posterior projection of the intrinsic reward, and then use the generated reward to update the posterior distribution. The previous methods mainly focus on the instantaneous rewards generated from the exploration process, while our method uses a reward-weighted count function to generate long-term rewards which can guide the policy towards a globally optimal value.

3.1. MARKOV DECISION PROCESS

A finite stochastic Markov decision process (fMDP) (Ferns et al., 2004 ) could be defined by a tuple M = (S, A, r, θ). Denote the sizes of the state and action spaces as S = |S| and A = |A|, respectively. r represents the reward function defined by r : S × A → [0, 1] 2 . Let θ : S × A × S represent the transition probability such that θ (s ′ | s, a) = P (s t+1 = s ′ | s t = s, a t = a). The ground-truth transition probability θ * is randomly generated before the game starts, which is then fixed and unknown to the agent. For the model-based agents, the transition probability at time step t within episode k could be defined as θ t k . As for each episode, the transition probability would be defined as θ k . A stationary policy π : S → A is a deterministic function that maps a state to an action. We could define the instantaneous policy under transition probability θ t k as π θt k . The globally optimal policy under local optimal transition probability and global optimal transition probability could be defined as π * θt k and π * θ * respectively. For notational brevity, let π t k ≜ π θt k , π * t k ≜ π * θt k , π ⋆ ≜ π * θ * . In the fMDP, the average reward function per time step t under stationary policy π is defined as: J(π θt ) = lim T →∞ 1 T E t+T t ′ =t r (s t ′ , a t ′ ) . (1) Therefore, we could denote the instantaneous average reward under transition probability θ t k as J(π t k ). Note that J(π t k ) is a hypothetical average reward generated from θ t k and π t k . The locally optimal average reward J(π * t k ) could be derived from the corresponding locally optimal policy π * t k . The globally optimal average reward could be represented as J(π ⋆ ). Define the maximum average reward as Γ = max J(π θ ), which is the maximum average reward that an agent could achieve during its exploration in a fMDP. The maximum value Γ will be achieved under the optimal transition probability with the optimal stationary policy, i.e., Γ = J(π ⋆ ). In the online learning setting, total regret is defined to be the difference between the optimal total game value and the actual game value as follows: Reg = max a T t=1 r(s t , a) - T t=1 r(s t , a t ). (2) It is used to measure the performance of a decision maker. Since this metric is hard to calculate in general, we define the following bias vector b(θ, π, s) (Wei et al., 2017) as the relative advantage of each state to help us measure the total regret. b(θ, π, s) ≜ E ∞ t=1 r (s t , a t ) -J(π) | s 1 = s, a t ∼ π(•|s t ) . Under stationary policy π, the advantage of one state s over another state s ′ is defined as the difference between their accumulated rewards with initial states as s and s ′ , respectively, which will eventually converge to the difference of their bias vectors, i.  J(π θ ) + b(θ, π, s) = r(s, π) + s ′ p θ (s ′ | s, π) b(θ, π, s ′ ). Define the span of a vector x as sp(x) = max 

3.2. ASSUMPTIONS

The globally optimal policy is hard to learn for MDPs under online settings. As they often get stuck in locally optimal results. The ϵ-tolerance is then introduced to help measure the performance of the algorithm. When the difference between the current average reward and the optimal average reward is less than constant ϵ, the current policy is said to be ϵ-optimal. Assumption 3.1. (ϵ-Optimal policy) (Hartman, 1975) Under sub-optimal and optimal transition probability, if policy π t k satisfies J π * (θ t k ) -J πt k (θ t k ) ≤ ϵ. Then, policy π t k is ϵ-optimal. Assumption 3.2 implies that under all circumstances, all the states could be visited in D steps on average. When the agent conducts the optimal policy under the optimal transition probability, the transition time T π ⋆ s→s ′ (θ * ) should be the shortest, because the agent tends to explore the fewest nonrelated states under the optimal stationary policy. In a similar fashion, the transition time T π * t k s→s ′ (θ t k ) for agent conducting optimal policy under sub-optimal transition probability is assumed to be less than the maximum transition time D = max s,s ′ T πt k s→s ′ (θ t k ) in normal settings. Assumption 3.2. (Expected transition time) When conducting stationary policy π, we assume that the maximum expected time to reach to state s ′ from state s under sub-optimal transition probability and optimal transition probability is less than constant D: max T π ⋆ s→s ′ (θ * ) ≤ max T π * t k s→s ′ (θ t k ) ≤ max T πt k s→s ′ (θ t k ) = D. Let e(t) ≜ k be the episode where time instant t belongs. When conducting stationary policy π, we could define the count function for the episode number e(t) as N (π e(t) ). Define H s1,s2 (k, π) as the set of all the time instants that the state transition s 1 → s 2 occurs in the first k episodes with stationary policy π used: H (s1,s2) (k, π) ≜ ∞ t=1 1 π e(t) = π, (S t , S t+1 ) = (s 1 , s 2 ) , N (π e(t) ) ≤ k . (5) Under transition probability θ t k , the expected transition time from state s to state s ′ with stationary policy π t k could be denoted as τπt k , i.e., τπt k ≜ T πt k s→s (θ t k ). Therefore, the posterior probability of the stationary policy π can be assumed as the difference between the empirical state pair frequency H (s 1 ,s 2 ) (k,π) k and the corresponding expected value τπt k . Assumption 3.3. (Posterior distribution under sub-optimal trajectories) (Gopalan & Mannor, 2015) For any given e 1 , e 2 ≥ 0, there exists p ≜ p(e 1 , e 2 ) > 0 satisfying θ t k (π * t k ) ≥ p for any episode index k at which sub-optimal transition frequencies have been observed: H (s1,s2) (k, π) k -τπt k θ (s 1 |s 2 ) ≤ e 1 log (e 2 log k) k , ∀s 1 , s 2 ∈ S, k ≥ 1.

4. PROPOSED ALGORITHMS

In this section, we propose a new algorithm to tackle the trade-off between exploration and exploitation. One parameter that we need under the posterior sampling setting is the prior distribution, denoted as µ 0 . Note that we generate prior distributions for both transition probabilities and stationary policies, but only do posterior sampling over stationary policy distributions. While the transition probabilities will be Bayesian updated by the trajectory generated from the posterior policy. In each episode k, at each time step t, the action would be sampled from the posterior policy distribution. And such policy distribution µ t k (π) will be updated based on the previous history h t k . Let N t (s, a) be the number of visits to any state-action pair (s, a) during a period of time t: N t (s, a) = |{τ < t : (s τ , a τ ) = (s, a)}| . ( ) Algorithm 1 Reward-Weighted Posterior Sampling of Policy (RWPSP) Input: Game environment, prior distribution for stationary policy µ π0 , transition probability θ 0 , initial state s 0 ∈ S, time step t = 0. Output: Stationary policy π K 1: for Episode k = 0, 1, 2 . . . K do 2: T k-1 ← t -t k 3: t k ← t 4: Generate µ k (π k ) based on prior distribution 5: Update θ k using θ k = θ k-1 (s1|s2,a2)+Hs 1 ,s 2 (Nπ t (k),π) Nπ t (k) 6: Update posterior distribution µ (t+1) k (π) using RWPI 10: for t ≤ t k + T k-1 and N t (s, a) ≤ 2N t k (s, a) do 7: Sample π t k ∼ µ t k (π) t ← t + 1 11: end for 12: end for We then have our algorithm called the Reward-Weighted Posterior Sampling of Policy, RWPSP for short, described in Algorithm 1. At the beginning of each episode k, the RWPSP algorithm, i.e., Algorithm 1, samples a policy distribution from the prior distribution µ (t-1) k (π k-1 ) (Line 4), which equals the updated posterior policy distribution from the last episode (Line 9). Then, the transition probability distribution will be generated from the history transition matrix θ k-1 and count function H s1,s2 (N πt (k), π) and N πt (k)(will be defined in Section 4.1) (Line 5). We use two stopping criteria to limit agent's exploration direction. The first stopping criterion aims to stop meaningless exploration, while the second stopping criterion ensures that any state-action pair (s, a) will not be encountered twice during the same episode (Line 6). At each time step t k , actions are generated from the instantaneous policy π t k (Line 7) which follows a posterior distribution µ t k (π). These actions are then be used by the agent to interact with the environment to observe the next state s t+1 and the reward r t+1 (Line 8). The observation results are then be used to find the optimal posterior distribution for policy π (t+1) k (Line 9).

4.1. UPDATE RULE

In previous Bayesian methods, the transition matrix is updated with Thompson/posterior sampling. But in our case, we apply posterior sampling over the policy distributions. Based on the Bayes' rule, the posterior distribution of policy can be written as : µ t+1 (π) = θ (s t+1 | s t , a t ) µ t k (π) π ′ θ ′ (s t+1 | s t , a t ) µ t (π ′ ) . ( ) The way we update stationary policy resembles how Thompson sampling updates transition probabilities, as our algorithm uses the prior policy to guide the current policy. The key difference is that our Reward-Weighted Policy Iteration (RWPI) algorithm shown in Algorithm 2 is able to balance between the instantaneous action and the history actions. This will help our method approximate the long-term maximum reward, which is the globally optimal value in this scenario. We could define W t k as the posterior weight in episode k at time t (Line 2 in Algorithm 2). Let J πt k and J π * t k denote the instantaneous average reward and the locally optimal value. This locally optimal value is induced by adopting the greedy policy on the transition probabilities θ t . The value of W t k is proportional to the log difference between the average reward of the locally optimal policy and the current policy. At last, we could generate the policy distribution µ t (π) based on the previous policy distribution µ t-1 (π) and the current locally optimal policy π * t (s, θ t ) (Line 3 in Algorithm 2). We measure the distance between the history optimal policy and the instantaneous policy using the Marginal Kullback-Leibler Divergence (Marginal KL Divergence) which is a widely used metric Algorithm 2 Reward-Weighted Policy Iteration(RWPI) Input: Game environment, prior distribution for stationary policy µ t (π) Output: Stationary policy π i 1: repeat 2: W t k (π) = exp{ π,s H s (N π (k), π) log Jπ t J π * (s) } 3: µ t (π) = W t µ t-1 (π) + (1 -W t )π * t (s, θ t ) 4: until D π (µ * (π)||µ t k (π)) ≤ ϵ for characterizing the difference between two probability distributions. The distance then could be written as follows: D π (µ * (π)||µ t k (π)) ≜ s1∈S θ π s1 s2∈S µ * (π) log µ * (π) µ t k (π) = s1∈S θ π s1 KL (µ * (π)∥µ t k (π)) . Parameter ϵ in Algorithm 2 represents the tolerance between the optimal policy and the instantaneous policy. RWPI updates the policy dynamically with the posterior weight. The policy will converge to an ϵ-optimal value after certain number of iterations under this update method. In the following section, we will analyze the convergence of this posterior update method and the total regret bound of our method.

5.1. CONVERGENCE OF THE UPDATE RULE

We show the convergence of our posterior policy update method to demonstrate its superiority. To this end, we need the following three Lemmas. Lemma 5.1 shows that RWPI enjoys asymptotic convergence. We then demonstrate in Lemma 5.2 that the output policy of such policy iteration method updates monotonically towards the optimal direction, which is vital evidence for the global optimality of our update method. At last, Lemma 5.3 proves that under MDP M , the output policy generated from the RWPI method would reach ϵ-optimality after a constant number of iterations. Lemma 5.1. Suppose Assumption 3.2 holds for some stochastic MDP M , then the policy iteration algorithm on M converges asymptotically. Proof. If Assumption 3.2 holds, by Theorem 4 in Wal, van der (1977) , the successive policy approximation process yields an ϵ-band and stationary ϵ-optimal strategies for the agent. This results match Assumption 3.1. Therefore, the convergence of the policy could be proved. Lemma 5.2. The average reward deducted by Algorithm 2 will be monotonically increasing. Proof. From Algorithm 2, we can write the update rule of the average reward as follows: J πt (θ) -J πt-1 (θ) = (W t -1)J t-1 (θ) + (1 -W t )J π * (s, θ) = (1 -W t )(J π * (s, θ) -J πt-1 (θ)). ( ) If J πt-1 (θ) ≤ J π * (s, θ), then W t ≤ 1 since log Jπ t (θ) J π * (s,θ) ≤ 0, otherwise W t ≥ 1. Thus, J πt (θ) -J πt-1 (θ) = (1 -W t )(J π * (s, θ) -J πt-1 (θ)) ≥ 0. ( ) That is, the sequence J πt (θ) is monotonically increasing with time step t. Lemma 5.3. Suppose Assumptions 3.1-3.2 hold for some stochastic MDP M . Let u i be the state value in iteration i. Define N as the maximum iteration number of the algorithm. Then π t k is ϵ-optimal after N iterations. Proof. The detailed proof is shown in Appendix A.2

5.2. REGRET BOUND ANALYSIS

In the proof of the regret bound, we always consider for the worst case. The randomness of the algorithm is reduced the minimum level in order to get fair measurement of the performance of the algorithms. After proving the convergence of the RWPI method, we now turn to the proof of the total regret bound. The regret in time horizon T can be written as: Reg T = T J π K (θ K ) - T t=1 r πt (s t , a t ) ≈ T J π K (θ K ) - T t=1 J πt (θ K ) + T t=1 J πt (θ K ) - T t=1 J πt (θ t ) = Reg 1 T + Reg 2 T . Let the episode number be K in time horizon T . The regret are defined separately as Reg 1 T = T J π K (θ K ) - T t=1 J πt (θ K ) and Reg 2 T = T t=1 J πt (θ K ) - T t=1 J πt (θ t ). The final average reward under the final policy π k and final transition matrix θ K is defined as J π K (θ K ). Reg 1 T represents the posterior policy regret and Reg 2 T represents the posterior transition probability regret. For any measurable function f and any et al., 2013) . h t k -measurable random variable X, E [f (θ * , X) | h t k ] = E [f (θ k , X) | h t k ] (Osband In order to bound the first regret Reg 1 T , we first bound the ratio between the expected optimal average reward and the instantaneous reward. Based on Assumptions 3.1-3.3, the expected optimal reward that an agent could achieve in the fMDP could be bounded by parameter Γ and ϵ. Lemma 5.4. log Jπ * (θ) Jπ t (θ) ≤ ϵ Γ . Proof. The detailed proof is shown in Appendix A.2 After bounding the log ratio between the expected optimal average reward and the instantaneous reward, we now bound the instantaneous posterior weight W t k , which is important to our proof. At each time step, the posterior weight will be updated based on the previous policy and the observed data. First, we define the counter function N π (t) := t-1 t=0 π 1 π e(t) = π as the total number of the time instants during the period of t when policy π was conducted. When Assumption 3.3 holds, we could bound the posterior weight based on the count function in episode k and the average transition time τ . Lemma 5.5. Under Assumption 3.3, for each stationary near-optimal policy π and episode k ≥ 1. The following upper bound holds for negative log-density: -log W t k (π) ≤ ϵ Γ |S| 2 (ρ(k π ) k π + k π τt k ,kπ ). The real reward is expected to get close to the expected reward by certain optimization method. A large number of iterations would be needed for this purpose. Therefore, from the convergence proof we proposed in section 5.1, we could derive the bound on the expected convergence time during the optimization process. In Lemma 5.6, we give the bound on the instantaneous difference between the real reward and the expected reward with √ T . This bound is inversely proportional to √ T , since our update method updates towards the optimal direction (see Lemma 5.2). For brevity, the full proof will be given in Appendix A.5 Lemma 5.6. The difference between the local optimal average reward and the instantaneous average reward can be bounded as |J πt -J * | ≤ Õ( Γ S 2 √ T ). We then could combine the previous Lemmas together to get the final regret bound of Reg T . Theorem 5.7. The first part of the regret in time horizon T is bounded by: Reg T ≤ Õ( Γ √ T S 2 ). It is not clear if the above result improves over the state-of-art methods. We further give a tighter bound for our method below, which shows that under the fMDP our method has a lower regret bound compared to the current state-of-art method Õ(D √ SAT ). Lemma 5.8. Γ S 2 < D √ SA when |S| ≥ 2 or |A| ≥ 2. The second regret Reg 2 T represents the posterior difference generated by the update method of the transition probability. First, we use the definition of the Bellman iterator of the average reward to transmit the one-step posterior transition difference into the difference between the transition probability. Then we apply the Assumption 3.3 to help bound such difference. At last, the regret could be bounded by summing all the one-step posterior transition difference. Theorem 5.9. The regret caused by transition matrix update could be bounded by: Reg 2 T ≤ Õ(D(SAT ) 1 4 ).

6. EXPERIMENT

In this section, we compare our method with various state-of-the-art methods: SACL (Fruit et al., 2018a) , UCRL2 (Auer et al., 2008) , UCRL2B (Fruit et al., 2020) , UCRL3 (Bourel et al., 2020) , and KL-UCRL (Talebi & Maillard, 2018) . SACL is an exploartion-based method that uses a proper exploration bonus to solve any discrete unknown weakly-communicating MDP. It admits a regret bound O D s,a K s,a T log(T /δ) (Fruit et al., 2018b) . UCRL2, UCRL2B, and UCRL3 are three optimistic methods that used certain confidence bounds to minimize the total regret. The UCRL2 algorithm performs the regret minimization in unknown discrete MDPs under average-reward criterion. UCRL2B refines the previous UCRL2 method by exploiting empirical Bernstein inequalities to prove a regret bound of O(D

√

ΓSAT ) where max s,a Γ(s, a) ≤ S. UCRL3 modifies the previous algorithms by using time-uniform concentration inequalities to compute confidence sets on the reward and transition distributions for each state-action pair. Finally, the KL-UCRL studied the ergodic MDPs and proposed a high-probability regret bound O S s,a V ⋆ s,a T , where V ⋆ s,a is the variance of the bias function with respect to the next-state distribution following action a in state s. In order to measure the performance of our method empirically, we consider several traditional game environments: RiverSwim, 4-room, and three-state. RiverSwim is one of the most important metrics for online learning algorithms. It was first proposed in Strehl & Littman (2008) by Michael L. Littman in 2008. The basic RiverSwim consists of six states. The agent starts from a random state and the two actions available to the agent are to swim left or right. But the current will push the agent to the left side. The agent will receive a much larger reward for swimming upstream and reaching the rightmost state. In our experiment, we use its enhanced version: RiverSwim25-Biclass and RiverSwimErgo50. The RiverSwim25-Biclass is a 25-state communicating riverSwim environment with transition probability for the middle states cut in two subsets. And the RiverSwimErgo50 is a 50-states ergodic RiverSwim environment. For the three-state environment, it was first proposed in Fruit et al. (2018b) as the metric for the SACL. It is an environment with random reward that contains three states and two actions. 4-room is a classic reinforcement learning environment, where the agent must navigate in a maze composed of four rooms interconnected by 4 gaps in the walls. To obtain a reward, the agent must reach the green goal square. In the experiment, we use the cumulative reward as the metric. We can see from Figure 1 and Figure 2 that the RWPSP method tends to perform better in the high-dimensional games like RiverSwim25biclass and RiverSwim-Ergo50 than other state-of-the-art methods, which matches our theoretical analysis since RWPSP is designed to discover the long-term average reward in finite-horizon MDPs. Also, our algorithm performs well in three-state case and surpasses the performance of SCAL. We observe that our method RWPSP shows significant improvements over other methods in RiverSwim25-Biclass, RiverSwimErgo50 and 4-room. That is because the total regret bound Õ( Γ √ T S 2 ) of our method indicates that the regret bound will decrease when the number of states of the environment increases. Thus, our method RWPSP performs pretty well on complex online learning environments.

7. CONCLUSION

In this work, we propose a policy-based posterior sampling method that can achieve the best total regret bound Õ(Γ

√

T /S 2 ) in finite-horizon stochastic MDPs. This algorithm provides a new way to trade-off between exploration and exploitation by sampling from the posterior distributions of policy. The posterior policy can be updated to balance between the long-term policy and the current greedy policy. Our study shows that this posterior sampling method outperforms other optimization algorithms both theoretically and empirically. Despite that the sampling method is known to be efficient in discrete environments, our work shows that it could be further improved with count functions and reward re-weighting for posterior updates. However, it remains unknown in this work if similar ideas are applicable to continuous environments as well, which we leave to our future work. For example, we may use some metric to accommodate the difference between states of a continuous space, and then apply our algorithm to such environments.

A DETAILS OF PROOFS

The appendix aims to introduce the complete proof of the previous lemmas and theorems. A.1 THE CONVERGENCE OF PI Lemma A.1. Under update algorithm RWPI, the average reward should be monotonically increased. Proof. From Algorithm 2, we could deduce the update rule of the average reward: J πt (θ) -J πt-1 (θ) = (W t -1)J t-1 (θ) + (1 -W t )J π * (s, θ) = (1 -W t )(J π * (s, θ) -J πt-1 (θ)). When J π * (s, θ) ≥ J πt-1 (θ), we could deduce that log Jπ t (θ) J π * (s,θ) ≤ 1. So the posterior weight W t is less than 1. This result holds vice versa. The first term 1 -W t ≤ 0 when J π * (s, θ) ≤ J πt-1 (θ). Therefore, we could prove that: J πt (θ) -J πt-1 (θ) = (1 -W t )(J π * (s, θ) -J πt-1 (θ)) ≥ 0. The sequence J πt (θ) is monotonically increased with time step t. Lemma A.2. Suppose Assumption 1 and Assumption 2 hold for some stochastic MDP M . Let u i be the state value in iteration i. Define N as the maximum iteration number of the algorithm. Then π t k is ϵ-optimal after N iterations. Proof. Define D = min s {µ i+1 (π)-µ π } and U = max s {µ i+1 (π)-µ i (π)}. Then we could deduce: D + µ N (π) ≤ µ N +1 ≤ W i µ N + (1 -W i )π * i (s, θ) ≤ W i µ N + (1 -W i )(r N + θu N ). Since 0 < W i ≤ 1, the upper equation could be turned to: D ≤ (1 -W i )J πi (θ). Based on the definition in Preliminaries, let π ⋆ be the optimal policy under all states that satisfies π ⋆ := s∈S π ⋆ i (s, θ). Then D ≤ (1 -W i )J πi (θ) ≤ (1 -W i )J π ⋆ (θ). In a similar way, we could also prove U ≥ (1 -W i )J π ⋆ (θ). From the definition of the stopping criterion of the Policy Iteration algorithm, we could assume U -D ≤ (1 -W i )ϵ. Therefore, we have U ≤ D + (1 -W i ) ≤ (1 -W i )J πi (θ) + (1 -W i )ϵ ≤ (1 -W i )(J πi (θ) + ϵ) (1 -W i )J π ⋆ ≤ (1 -W i )(J πi (θ) + ϵ) J π ⋆ ≤ J πi (θ) + ϵ. We could deduce that stationary policy π is ϵ-optimal after N iterations.

A.2 REGRET BOUND ANALYSIS

Lemma A.3. log J π * (θ) J πt (θ) ≤ ϵ Γ Proof. First, we could multiply J πt (θ) in order to construct the inequality. Let J πt (θ) = n, ϵ = x lim n→+∞ 1 + x n n = lim n→+∞ e n ln(1+ x n ) = e limn→+∞ ln ( 1+ x n ) 1 n . ( ) Apply the L'Hopital's Rule: lim n→+∞ 1 + x n n = e limn→+∞ ( -x n 2 ) 1 1+ x n -1 n 2 = e limn→+∞ x 1+ x n = e x . Then, we could prove that 1 + x n n is monotonically increased with n: (1 + x n ) 2 = 1 • 1 + x n • 1 + x n • • • • • • 1 + x n n ≤ 1 + (1 + x n ) + • • • + (1 + x n ) n + 1 n+1 = 1 + n(1 + x n ) n + 1 n+1 = 1 + x n(n + 1) n+1 ≤ 1 + x n + 1 n+1 . ( ) The first inequality holds for the arithmetic mean equality. We could deduce that (1 + x n ) n ≤ e x . Therefore, we have: J πt (θ) log J π * (θ) J πt (θ) ≤ ϵ. Based on the definition of Γ, we could deduce the upper bound of average reward. Then the lemma could be proved. Lemma A.4. Under Assumption 3, for each stationary near-optimal policy π and epoch counter k ≥ 1. Let ρ(x) satisfies ρ(x) := O( log(log(x))). The following upper bound holds for negative log-density. -log W t k (π) ≤ ϵ Γ |S| 2 (ρ(k π ) k π + k π τt k ,kπ ) Proof. When W t k ≤ 1, we could have: W t k (θ) := exp π,s1,s2 H (N π (k), π) log J πt (θ) J π * (θ) . ( ) Based on the definition of the counter H, we could deduce the value of the posterior weight in a single epoch: W t k (θ) = exp ∞ t=1 1 π e(t) = π, (S t , S t+1 ) = (s 1 , s 2 ) , N (e(t)) ≤ k log J πt (θ) J π * (θ) = exp   π∈Π (s1,s2)∈S 2 T t=1 1 π e(i) = π, (S t , S t+1 ) = (s 1 , s 2 ) log J πt (θ) J π * (θ)   = exp   N π (t) (s1,s2)∈S 2 t-1 t=0 1 π e(t) = π, (S t , S t+1 ) = (s 1 , s 2 ) N π (t) log J πt (θ) J π * (θ)   . Where N π (t) := t-1 t=0 π∈Π 1 π e(t) = π represents the total number of the time instants during the period of t when policy π was conducted. When Assumption 3 holds, we could know that N π (t) = τπt k ,Nπ(k) , where N π (k) := K k=0 π∈Π 1 π e(k) = π holds for the number of the epochs where policy π was chosen(The notation of τ will be represented as N π (k) = k π , τπt k ,Nπ(k) = τt k ,kπ ). Therefore, we could have: - -log W t k (π) = -N π (t) (s1,s2)∈S 2 t-1 t=0 1 π e(t) = π, (S t , S t+1 ) = (s 1 , s 2 ) N π (t) log J πt (θ) J π * (θ) = - (s1,s2)∈S 2 τt k ,kπ H (s1,s2) (τ t k ,kπ , π) log J πt (θ) J π * (θ) = (s1,s2)∈S 2 τt k ,kπ H (s1,s2) (τ t k ,kπ , π) -k π τt k ,kπ θ π (s 1 |s 2 ) log J π * (θ) J πt (θ) + (s1,s2)∈S 2 k π τt k ,kπ θ(s 1 |s 2 ) log J π * (θ) J πt (θ) log W t k (π) ≤ (s1,s2)∈S 2 ρ(k π ) k π log J π * (θ) J πt (θ) + k π τt k ,kπ (s1,s2)∈S 2 θ(s 1 |s 2 ) log J π * (θ) J πt (θ) ≤ ϵ Γ |S| 2 (ρ(k π ) k π + k π τt k ,kπ ). Lemma A.5. The difference between the local optimal average reward and the instantaneous average reward could be bounded by: |J * -J πt | ≤ Õ( 1 √ T ). Proof. We could know that the current policy probability distribution is updated based on the previous distribution and the current local optimal policy distribution: µ t (π) = W t µ t-1 (π) + (1 -W t )π * t (s, θ t k ). (25) We could extend this result to reward function: J πt = W t J πt-1 + (1 -W t ) J * (θ t ) J 2 πt = W 2 t J 2 πt-1 + (1 -W t ) 2 J * 2 + 2W t (1 -W t ) J * J πt-1 ≤ W 2 t J 2 πt + (1 -W t ) 2 J * 2 + 2W t (1 -W t ) J * J πt . The inequality is based on the monotonicity of the algorithm. We could simplify Equation 26: 1 -W 2 t J 2 πt ≤ (1 -W t ) 2 J * 2 + 2W t (1 -W t ) J * J πt (1 + W t ) J 2 πt ≤ (1 -W t ) J * 2 + 2W t J * J πt J 2 πt + W t J 2 πt ≤ J * 2 -W t J * 2 + 2W t J * J πt W t J 2 πt + J * 2 ≤ J * 2 -J 2 πt + 2W t J * J πt J 2 πt + J * 2 ≤ 1 W t J * 2 -J 2 πt + 2J * J πt . Based on the definition of the regret of each time step, we could deduce the bound of the instantaneous regret: (J πt -J * ) 2 = J 2 πt + J * 2 -2J πt J * ≤ 1 W t J * 2 -J 2 πt + 2J * J πt -2J πt J * = 1 W t J * 2 -J 2 πt = 1 W t (J * -J πt ) (J * + J πt ) . Therefore, we could deduce that: |J πt -J * | ≤ 1 W t |J πt + J * |. From Lemma A.4, we could know that -log W t k (π) is bounded by B, with B = ϵ Γ |S| 2 (ρ(k π ) √ k π + k π τt k ,kπ ). Therefore, we could construct the following inequalities. W t k -1 ≥ log W t k ≥ -B 1 W t k ≤ 1 1 -B . ( ) Factor B is proportional to parameter k π which could be bounded by the total number of episode of under total time T . Therefore, we could bound 1 Wt by T (Ignoring the constants): 1 W t ≤ 1 1 -ϵ Γ |S| 2 (ρ(k π ) √ k π + k π τt k ,kπ ) ≤ 1 1 - √ T - √ T . Based on the definition of Γ, the average reward function is bounded by Γ. So the difference between the local optimal average reward and the instantaneous average reward could be bounded by: |J * -J πt | = |J πt -J * | ≤ 1 W t |J * + J πt | ≤ 2 1 -ϵ Γ |S| 2 (ρ(k π ) √ k π + k π τt k ,kπ ) ≤ Õ( 2Γ S 2 √ T ). Lemma A.6. Γ S 2 < D √ SA when |S| ≥ 2 or |A| ≥ 2 Proof. We could know that Γ is defined as the upper bound of the average reward. So we could deduct: Γ ≥ J π * (θ * ) D ≥ max T πt k s→s ′ (θ t k ) Γ ≤ max T π * s→s ′ (θ * ) ≤ max T πt k s→s ′ (θ t k ) ≤ D. Assuming Lemma A.6 is established, we could get: Γ S 2 ≤ D S 2 ≤ D √ SA D ≤ DS 2 √ SA S 2 √ SA ≥ 1. Therefore the Lemma could be proved when the fMDP process has more than one state and one action. Theorem A.7. The first part of the regret in time step T is bounded by: Reg 1 T ≤ Õ( √ T S 2 ). Proof. From the definition before, we could know that Reg 1 T could be represented as: Reg 1 T = T J π k (θ K ) - T t=1 J πt (θ K ). Since this theorem won't involve the transformation of the transition probability. So let J π (θ) = J π . Based on the update rule of the posterior distribution µ t+1 (π) of policy π. We could divide the average reward into several parts: At time step t = T , we could assume the instantaneous regret equals to zero: Reg 1 t T = J π k -J π T = 0. At time step t = T -1, define the local optimal average reward as J * π . Note that this local optimal value is virtual. The instantaneous regret could be represented as: Reg 1 t T -1 = J π k -J π T -1 = W t-1 J π T -1 + (1 -W t-1 )J * π -J π T -1 = (W t-1 -1)J π T -1 + (1 -W t-1 )J * π = (1 -W t-1 )(J * π -J π T -1 ). In a similar fashion, at time step t = T -2, the instantaneous regret could be represented as: Reg 1 t T -2 = J π k -J π T -2 = J π k -J π T -1 + J π T -1 -J π T -2 = (1 -W t-1 )(J * π -J π T -1 ) + (1 -W t-2 )(J * π -J π T -1 ). Based on Lemma A.5, the difference between the local optimal value and the current average reward could be bounded by: |J * π -J πt | ≤ Õ( 1 √ T ). The sub-optimal models are sampled when their posterior probability is larger than 1 T . This ensures the time complexity of the Thompson sampling process is no more than O(1). So we could deduce the total regret in time step T . Reg 1 T = 1 T (Reg 1 t T -1 + Reg 1 t T -2 + • • • + Reg 1 t1 ) ≤ Õ( 2Γ S 2 √ T )( T -1 T + T -2 T + • • • 1 T ) ≤ Õ( 2Γ √ T S 2 ). In order to deduce the second regret bound generated by the transition probability, we should analyze our algorithm's performance over T time step. We define the number of macro episodes M = 1{t k ≤ T }. An episode is defined as the set of the time steps under stopping criterions. Therefore, we could deduce the bound of the number of episode. Lemma A.8. Under the stopping criterion, the number of episodes M could be bounded by: M ≤ SA log(T ). Wei et al. (2017) Proof. The stopping criterion is triggered whenever the visits number of the initial state-action pair is doubled. So M could be represented as: M (s,a) = k ≤ K T : N t k (s, a) > 2N t k-1 (s, a) . ( ) Since the number of the visit to state-action pair (s, a) is doubled at the beginning of every epoch k. The size of M (s,a) should be no larger than O(log(T )). Assume M (s,a) ≥ log (N T +1 (s, a)) + 1. We could have: (43) N t K T ( Since the logarithmic function is concave, we could simplify the inequality to: M ≤ SA log(T ). Lemma A.9. The total number of episodes of total time step T could be bounded by: K T ≤ 2SAT log(T ).

Wei et al. (2017)

Proof. Define macro episodes with start times t ni , i = 1, 2, • • • where t n1 = t 1 ,we could have t ni+1 = min t k > t ni : N t k (s, a) > 2N t k-1 (s, a) . Let Ti = ni+1-1 k=ni T k be the length of the ith episode. Therefore, within the ith macro episode, T k = T k-1 + 1 for all k = n i , n i + 1, • • • , n i+1 -2. Ti = ni+1-1 k=ni T k = ni+1-ni-1 j=1 (T ni-1 + j) + T ni+1-1 ≥ ni+1-ni-1 j=1 (j + 1) + 1 = 0.5 (n i+1 -n i ) (n i+1 -n i + 1) . (45) Consequently,n i+1 -n i ≤ 2 Ti , for all i = 1, • • • , M . From this property, we could obtain: K T = n M +1 -1 = M i=1 (n i+1 -n i ) ≤ M i=1 2 Ti . ( ) Based on Equation 46and M i=1 Ti = T , we could get: K T ≤ M i=1 2 Ti ≤ M M i=1 2 Ti = √ 2M T .



In a finite MDP, the reward in each episode should be confined within [0, 1].



e., b(θ, π, s) -b(θ, π, s ′ ). Denote the expected total reward under stationary policy π by r(s, π) = E a∼π(•|s) [ r(s, a)], and the expected transition probability by p θ (s ′ | s, π) = E a∼π(•|s) [p θ (s ′ |s, a)]. The bias vector then satisfies the Bellman equation below:

Figure 1: RiverSwim25-biClass

23)The last equation is based on the logarithmic property log A B = -log B A . Based on the Assumption 3, define ρ(x) := O( log(log(x))).

s, a) = k≤K T ,Nt k-1 (s,a)≥1 N t k (s, a) N t k-1 (s, a) > k∈M (s,a) ,Nt k-1 (s,a)≥1 2 ≥ N T +1 (s, a). (42)This contradicts the fact that N t K T (s, a) ≤ N T +1 (s, a). This leads to M (s,a) ≤ log (N T +1 (s, a)). Therefore, we could obtain the bound of the number of the episodes: SA log(T /SA).

and apply action a t ∼ π t k

annex

Where the second inequality is based on Cauchy-Schwarz inequality. From Lemma A.8, we could know that the number of the macro episodes until time T is bounded by M ≤ SA log(T ). Therefore, the lemma could be proved.Theorem A.10. The regret caused by transition matrix sampling could be bounded by:Proof. In this theorem, we mainly focus on the difference between transition probability. From the previous definition of the Bellman iterator of the average reward, we could have:The difference between the average reward under near-optimal transition probability and instantaneous transition probability could be represented as:We could bound the first term with the largest difference between each state:Based on Equation 50, we could bound the second term in a similar way:Then, we define the total transition difference as:(52) From Equation 52, we could deduct the one-step transition difference to be:Based on Assumption 3, we could bound the one-step transition difference:where s1,s2When we use stationary policy π in epoch k, we could know that the count function of the policy π should be less or equal to the total number of epochs. Therefore, we could deduct that:Based on Assumption 3, we could deduct the bound for s1,s2 (θ t -θ t-1 ):s1,s2Combining Equation 57with Equation 52. Since the update of the transition matrix only happens once in each epoch, we could deduct the difference between the periodic transition matrix and instantaneous transition matrix based on A.9: (59)

