ONLINE RESTLESS BANDITS WITH UNOBSERVED STATES

Abstract

We study the online restless bandit problem, where each arm evolves according to a Markov chain independently, and the reward of pulling an arm depends on both the current state of the corresponding Markov chain and the action. The agent (decision maker) does not know the transition kernels and reward functions, and cannot observe the states of arms even after pulling. The goal is to sequentially choose which arms to pull so as to maximize the expected cumulative rewards collected. In this paper, we propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore-Then-Commit. The algorithm proceeds in episodes of increasing length and each episode is divided into exploration and exploitation phases. In the exploration phase in each episode, action-reward samples are collected in a round-robin way and then used to update the posterior as a mixture of Dirichlet distributions. At the beginning of the exploitation phase, TSEETC generates a sample from the posterior distribution as true parameters. It then follows the optimal policy for the sampled model for the rest of the episode. We establish the Bayesian regret bound Õ( √ T ) for TSEETC, where T is the time horizon. This is the first bound that is close to the lower bound of restless bandits, especially in an unobserved state setting. We show through simulations that TSEETC outperforms existing algorithms in regret.

1. INTRODUCTION

The restless multi-armed problem (RMAB) is a general setup to model many sequential decision making problems ranging from wireless communication (Tekin & Liu, 2011; Sheng et al., 2014) , sensor/machine maintenance (Ahmad et al., 2009; Akbarzadeh & Mahajan, 2021) and healthcare (Mate et al., 2020; 2021) . This problem considers one agent and N arms. Each arm i is modulated by a Markov chain M i with state transition function P i and reward function R i . At each time, the agent decides which arm to pull. After the pulling, all arms undergo an action-dependent Markovian state transition. The goal is to decide which arm to pull to maximize the expected reward, i.e., E[ T t=1 r t ], where r t is the reward at time t and T is the time horizon. In this paper, we consider the online restless bandit problem with unknown parameters (transition functions and reward functions) and unobserved states. Many works concentrate on learning unknown parameters (Liu et al., 2010; 2011; Ortner et al., 2012; Wang et al., 2020; Xiong et al., 2022a; b) while ignoring the possibility that the states are also unknown. The unobserved states assumption is common in real-world applications, such as cache access (Paria & Sinha, 2021) and recommendation system (Peng et al., 2020) . In the cache access problem, the user can only get the perceived delay but cannot know whether the requested content is stored in the cache before or after the access. Moreover, in the recommender system, we do not know the user's preference for the items. There are also some studies that consider the unobserved states. However, they often assume the parameters are known (Mate et al., 2020; Meshram et al., 2018; Akbarzadeh & Mahajan, 2021) and there is a lack of theoretical result (Peng et al., 2020; Hu et al., 2020) . And the existing algorithms (Zhou et al., 2021; Jahromi et al., 2022) with theoretical guarantee do not match the lower regret bound of RMAB (Ortner et al., 2012) . One common way to handle the unknown parameters but with observed states is to use the optimism in the face of uncertainty (OFU) principle (Liu et al., 2010; Ortner et al., 2012; Wang et al., 2020) . The regret bound in these works is too weak sometimes, because the baseline they consider, such as pulling the fixed arms (Liu et al., 2010) , is not optimal in RMAB problem. Ortner et al. (2012) derives the lower bound Õ( √ T ) for RMAB problem. However, it is not clear whether there is an efficient computational method to search out the optimistic model in the confidence region (Lakshmanan et al., 2015) . Another way to estimate the unknown parameters is Thompson Sampling (TS) method (Jung & Tewari, 2019; Jung et al., 2019; Jahromi et al., 2022; Hong et al., 2022) . TS algorithm does not need to solve all instances that lie within the confident sets as OFU-based algorithms (Ouyang et al., 2017) . What's more, empirical studies suggest that TS algorithms outperform OFUbased algorithms in bandit and Markov decision process (MDP) problems (Scott, 2010; Chapelle & Li, 2011; Osband & Van Roy, 2017) . Some studies assume that only the states of pulled arms are observable (Mate et al., 2020; Liu & Zhao, 2010; Wang et al., 2020; Jung & Tewari, 2019) . They translate the partially observable Markov decision process (POMDP) problem into a fully observable MDP by regarding the state last observed and the time elapsed as a meta-state (Mate et al., 2020; Jung & Tewari, 2019) , which is much simpler due to more observations about pulled arms. Mate et al. (2020) , and Liu & Zhao (2010) derive the optimal index policy but they assume the known parameters. Restless-UCB in Wang et al. (2020) achieves the regret bound of Õ(T 2/3 ), which does not match the lower bound Õ( √ T ) regret, and also restricted to a specific Markov model. There are also some works that consider that the arm's state is not visible even after pulling (Meshram et al., 2018; Akbarzadeh & Mahajan, 2021; Peng et al., 2020; Hu et al., 2020; Zhou et al., 2021; Yemini et al., 2021) and the classic POMDP setting (Jahromi et al., 2022) . However, there are still some challenges unresolved. Firstly, Meshram et al. (2018) and Akbarzadeh & Mahajan (2021) study the RMAB problem with unobserved states but with known parameters. However, the true value of the parameters are often unavailable in practice. Secondly, the works study RMAB from a learning perspective, e.g., Peng et al. (2020) ; Hu et al. (2020) but there are no regret analysis. Thirdly, existing policies with regret bound Õ(T 2/3 ) (Zhou et al., 2021; Jahromi et al., 2022) often do not have a regret guarantee that scales as Õ( √ T ), which is the lower bound in RMAB problem (Ortner et al., 2012) . Yemini et al. (2021) considers the arms are modulated by two unobserved states and with linear reward. This linear structure is quite a bit of side information that the decision maker can take advantage of for decision making and problem-dependent log(T ) is given. To the best of our knowledge, there are no provably optimal policies that perform close to the offline optimum and match the lower bound in restless bandit, especially in unobserved states setting. The unobserved states bring much challenges to us. Firstly, we need to control estimation error about states, which itself is not directly observed. Secondly, the error depends on the model parameters in a complex way via Bayesian updating and the parameters are still unknown. Thirdly, since the state is not fully observable, the decision-maker cannot keep track of the number of visits to state-action pairs, a quantity that is crucial in the theoretical analysis. We design a learning algorithm TSEETC to estimate these unknown parameters, and benchmarked on a stronger oracle, we show that our algorithm achieves a tighter regret bound. In summary, we make the following contributions: Problem formulation. We consider the online restless bandit problems with unobserved states and unknown parameters. Compared with Jahromi et al. ( 2022), our reward functions are unknown. Algorithmic design. We propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore-Then-Commit. The whole learning horizon is divided into episodes of increasing length. Each episode is split into exploration and exploitation phases. In the exploration phase, to estimate the unknown parameters, we update the posterior distributions about unknown parameters as a mixture of Dirichlet distributions. For the unobserved states, we use the belief state to encode the historical information. In the exploitation phases, we sample the parameters from the posterior distribution and derive an optimal policy based on the sampled parameter. What's more, we design the determined episode length in an increasing manner to control the total episode number, which is crucial to bound the regret caused by exploration. Regret analysis. We consider a stronger oracle which solves POMDP based on our belief state. And we define the pseudo-count to store the state-action pairs. Under a Bayesian framework, we show that the expected regret of TSEETC accumulated up to time T is bounded by Õ( √ T ) , where Õ hides logarithmic factors. This bound improves the existing results (Zhou et al., 2021; Jahromi et al., 2022) . Experiment results. We conduct the proof-of-concept experiments, and compare our policy with existing baseline algorithms. Our results show that TSEETC outperforms existing algorithms and achieve a near-optimal regret bound.

2. RELATED WORK

We review the related works in two main domains: learning algorithm for unknown parameters, and methods to identify unknown states. Unknown parameters. Since the system parameters are unknown in advance, it is essential to study RMAB problems from a learning perspective. Generally speaking, these works can be divided into two categories: OFU (Ortner et al., 2012; Wang et al., 2020; Xiong et al., 2022a; Zhou et al., 2021; Xiong et al., 2022b) or TS based (Jung et al., 2019; Jung & Tewari, 2019; Jahromi et al., 2022; Hong et al., 2022) . The algorithms based on OFU often construct confidence sets for the system parameters at each time, find the optimistic estimator that is associated with the maximum reward, and then select an action based on the optimistic estimator. However, these methods may not perform close to the offline optimum because the baseline policy they consider, such as pulling only one arm, is often a heuristic policy and not optimal. In this case, the regret bound O(log T ) (Liu et al., 2010) is less meaningful. Apart from these works, posterior sampling (Jung & Tewari, 2019; Jung et al., 2019) were used to solve this problem. A TS algorithm generally samples a set of MDP parameters randomly from the posterior distribution, then actions are selected based on the sampled model. Jung & Tewari (2019) and Jung et al. (2019) provide theoretical guarantee Õ( √ T ) in the Bayesian setting. TS algorithms are confirmed to outperform optimistic algorithms in bandit and MDP problems (Scott, 2010; Chapelle & Li, 2011; Osband & Van Roy, 2017 ). Unknown states. There are some works that consider the states of the pulled arm are observed (Mate et al., 2020; Liu & Zhao, 2010; Wang et al., 2020; Jung & Tewari, 2019) . Mate et al. (2020) and Liu & Zhao (2010) assumes the unobserved states but with known parameters. Wang et al. (2020) constructs an offline instance and give the regret bound Õ(T 2/3 ). Jung & Tewari (2019) considers the episodic RMAB problems and the regret bound Õ( √ T ) is guaranteed in the Bayesian setting. Some studies assume that the states are unobserved even after pulling. Akbarzadeh & Mahajan (2021) and Meshram et al. (2018) consider the RMAB problem with unknown states but known system parameters. And there is no regret guarantee. Peng et al. (2020) and Hu et al. (2020) consider the unknown parameters but there are also no any theoretical results. The most similar to our work is Zhou et al. (2021) and Jahromi et al. (2022) . Zhou et al. (2021) considers that all arms are modulated by a common unobserved Markov Chain. They proposed the estimation method based on spectral method (Anandkumar et al., 2012) and learning algorithm based on upper confidence bound (UCB) strategy (Auer et al., 2002) . They also give the regret bound Õ(T 2/3 ) and there is a gap between the lower bound Õ( √ T ) (Ortner et al., 2012) . Jahromi et al. (2022) considers the POMDP setting and propose the pseudo counts to store the state-action pairs. Their learning algorithm is based on Ouyang et al. (2017) and the regret bound is also Õ(T 2/3 ). And their algorithm is not programmable due to the pseudo counts is conditioned on the true counts which is uncountable.

3. PROBLEM SETTING

Consider a restless bandit problem with one agent and N arms. Each arm i ∈ [N ] := {1, 2, . . . , N } is associated with an independent discrete-time Markov chain M i = (S i , P i ), where S i is the state space and P i ∈ R S i ×S i the transition functions. Let s i t denote the state of arm i at time t and s t = (s 1 t , s 2 t , . . . , s N t ) the state of all arms. Each arm i is also associated with a reward functions R i ∈ R S i ×R , where R i (r | s) is the probability that the agent receives a reward r ∈ R when he pulls arm i in state s. We assume the state spaces S i and the reward set R are finite and known to the agent. The parameters P i and R i , i ∈ [N ] are unknown, and the state s t is also unobserved to the agent. For the sake of notational simplicity, we assume that all arms have the same state spaces S with size S. Our result can be generalized in a straightforward way to allow different state spaces. The whole game is divided into T time steps. The initial state s i 1 for each arm i ∈ [N ] is drawn from a distribution h i independently, which we assume to be known to the agent. At each time t, the agent chooses one arm a t ∈ [N ] to pull and receives a reward r t ∈ R with probability R at (r t | s at t ). Note that only the pulled arm has the reward feedback. His decision on which arm a t to pull is based on the observed history H t = [a 1 , r 1 , a 2 , r 2 • • • , a t-1 , r t-1 ]. Note that the states of the arms are never observable, even after pulling. Each arm i makes a state transition independently according to the associated P i , whether it is pulled or not. This process continues until the end of the game. The goal of the agent is to maximize the total expected reward. We use θ to denote the unknown P i and R i for i ∈ [N ] collectively. Since the true states are unobservable, the agent maintains a belief state b i t = [b i t (s, θ), s ∈ S i ] ∈ ∆ S i for each arm i, where b i t (s, θ) := P s i t = s | H t , θ , and ∆ S i := b ∈ R S i + : s∈S i b(s) = 1 is the probability simplex in R S i . Note that b i t (s, θ) depends on the unknown model parameter θ, which itself has to be learned by the agent. We aggregate all arms as a whole Markov chain M and denote its transition matrix and reward function as P and R, respectively. For a given θ, the overall belief state b (Smallwood & Sondik, 1973) , so the agent can base his decision at time t on b t only. Let t = (b 1 t , b 2 t , • • • , b N t ) is a sufficient statistic for H t-1 ∆ b := ∆ S 1 × • • • × ∆ S N . A deterministic stationary policy π : ∆ b → [N ] maps a belief state to an action. The long-term average reward of a policy π is defined as J π (h, θ) := lim sup T →∞ 1 T E T t=1 r t h, θ . We use J(h, θ) = sup π J π (h, θ) to denote the optimal long-term average reward. We assume J(h, θ) is independent of the initial distribution h as in Jahromi et al. ( 2022) and denoted it by J(θ). We make the following assumption. Assumption 1. The smallest element ϵ 1 in the transition functions P i , i ∈ N is bigger than zero. Assumption 2. The smallest element ϵ 2 in the reward functions R i , i ∈ N is bigger than zero. Assumption 1 and Assumption 2 are strong in general, but they help us bound the error of belief estimation (De Castro et al., 2017) . Assumption 1 also makes the MDP weakly communicating (Bertsekas et al., 2011) . For weakly communicating MDP, it is known that there exists a bounded function v(•, θ) : ∆ b → R such that for all b ∈ ∆ b (Bertsekas et al., 2011) , J(θ) + v(b, θ) = max a r(b, a) + r P (r | b, a, θ)v (b ′ , θ) , ( ) where v is the relative value function, r(b, a) = s r b a (s, θ)R a (r | s)r is the expected reward, b ′ is the updated belief after obtaining the reward r, and P (r | b, a, θ) is the probability of observing r in the next step, conditioned on the current belief b and action a. The corresponding optimal policy is the maximizer of the right part in equation 2. Since the value function v(, θ) is finite, we can bound the span function sp(θ) := max b v(b, θ) -min b v(b, θ) as Zhou et al. (2021) . We show the details about this bound in Proposition 1 and denote the bound as H. We consider the Bayesian regret. The parameters θ * is randomly generated from a known prior distribution Q at the beginning and then fixed but unknown to the agent. We measure the efficiency of a policy π by its regret, defined as the expected gap between the cumulative reward of an offline oracle and that of π, where the oracle is the optimal policy with the full knowledge of θ * , but unknown states. The offline oracle is similar to Zhou et al. (2021) , which is stronger than those considered in Azizzadenesheli et al. (2016) and Fiez et al. (2018) . We focus on the Bayesian regret of policy π (Ouyang et al., 2017; Jung & Tewari, 2019) as follows, R T := E θ * ∼Q T t=1 (J(θ * ) -r t ) . The above expectation is with respect to the prior distribution about θ * , the randomness in state transitions and the random reward.

4. THE TSEETC ALGORITHM

In section 4.1, we define the belief state and show how to update it with new observation. In section 4.2, we show how to update the posterior distributions under unknown states. In section 4.3, we show the details about our learning algorithm TSEETC.

4.1. BELIEF ENCODER FOR UNOBSERVED STATE

Here we focus on the belief update for arm i with true parameters θ * . At time t, the belief for arm i in state s is b i t (s, θ * ). Then after the pulling of arm i, we obtain the observation r t . The belief b i t (s ′ , θ * ) can be update as follows: b i t+1 (s ′ , θ * ) = s b i t (s, θ * )R i * (r t | s) P i * (s ′ | s) s b i t (s, θ * )R i * (r t | s) , where the P i * (s ′ | s) is the probability of transitioning from state s at time t to state s ′ and R i * (r t | s) is the probability of obtain reward r t under state s. If the arm i is not pulled, we update its belief as follows: b i t+1 (s ′ , θ * ) = s b i t (s, θ * )P i * (s ′ | s). Then at each time, we can aggregate the belief of all arms as b t . Based on equation 2 , we can derive the optimal action a t for current belief b t .

4.2. MIXTURE OF DIRICHLET DISTRIBUTION

In this section, we estimate the unknown P i and R i based on Dirichlet distribution. The Dirichlet distribution is parameterized by a count vector, ϕ = (ϕ 1 , . . . , ϕ k ), where ϕ i ≥ 0, such that the density of probability distribution is defined as Ghavamzadeh et al., 2015) . f (p | ϕ) ∝ k i=1 p ϕi-1 i ( Since the true states are unobserved, all state sequences should be considered, with some weight proportional to the likelihood of each state sequence (Ross et al., 2011) . Denote the reward history collected from time t 1 till t 2 for arm i as r i t1:t2 and similarly the states history is denoted as s i t1:t2 . And the belief state history is denoted as b i t1:t2 . Then with these history information, the posterior distribution g t (P i ) and g t (R i ) at time t can be updated as in Lemma 1. Lemma 1. Under the unobserved state setting and assuming transition function P i with prior g 0 P i = f ( P i -ϵ11 1-ϵ1 | ϕ i ) , reward function R i with prior g 0 R i = f ( R i -ϵ21 1-ϵ2 | ψ i ), with the information r i 0:t and b i 0:t , then the posterior distribution are as follows: g t P i ∝ si t ∈S t i g 0 P i w(s i 0:t ) s,s ′ ( P i (s ′ | s) -ϵ 1 1 -ϵ 1 ) N i s,s ′ (s i t )+ϕ i s,s ′ -1 , g t R i ∝ si t ∈S t i g 0 R i w(s i 0:t ) s,r ( R i (r | s) -ϵ 2 1 -ϵ 2 ) N i s,r (s i t )+ψ i s,r -1 . ( ) where w(s i 0:t ) is the likelihood of state sequence s i 0:t and 1 is the vector with one in each position. The element 1 can be different lengths in correspondence with the dimension of P and R. This procedure is summarized in Algorithm 1. Algorithm 1 Posterior Update for R i (s, •) and P i (s, •) 1: Input: the history length τ 1 , the state space S i , the belief history b i 0:τ1 , the reward history r i 0:τ1 , the initial parameters ϕ i s,s ′ , ψ i s,r , for s, s ′ ∈ S i , r ∈ R, 2: generate S τ1 i possible state sequences 3: calculate the weight w(j) = τ1 t=1 b i t (s, θ), j ∈ S τ1 i 4: for j in 1, . . . , S τ1 i do 5: count the occurence times of event (s, s ′ ) and (s, r) as N i s,s ′ , N i s,r in sequence j 6: update ϕ i s,s ′ ← ϕ i s,s ′ + N i s,s ′ , ψ i s,r ← ψ i s,r + N i s,r 7: aggregate the ϕ i s,s ′ as ϕ(j), ψ i s,r as ψ(j) for all s, s ′ ∈ S i , r ∈ R 8: end for 9: update the mixture Dirichlet distribution g τ1 (P i ) ∝ S τ 1 i j=1 w(j)f ( P i -ϵ11 1-ϵ1 | ϕ(j)), g τ1 (R i ) ∝ S τ 1 i j=1 w(j)f ( R i -ϵ21 1-ϵ2 | ψ(j)) With Algorithm 1, we can update the posterior distribution about the unknown parameters and sample from the posterior distribution as true parameters. The belief estimation error can be bounded by the distance between the sampled parameters and the true values (Proposition 2 ). The theoretical guarantee about estimation errors about unknown parameters is provided in Lemma 2.

4.3. OUR ALGORITHM

Our algorithm, TSEETC, operates in episodes with different lengths. Each episode is split into exploration phase and exploitation phase. Denote the episode number is K T and the first time in each episode is denoted as t k . We use T k to denote the length of episode k and it can be determined as: T k = T 1 + k -1, where T 1 = √ T +1 2 . The length of exploration phase in each episode is fixed as τ 1 which satisfies τ 1 K T = O( √ T ) and τ 1 ≤ T1+K T -1 2 . With these notations, our whole algorithm is shown below. Algorithm 2 Thompson Sampling with Episodic Explore-Then-Commit 1: Input: prior g 0 (P ),g 0 (R), initial belief b 0 , exploration length τ 1 , the first episode length T 1 2: for episode k = 1, 2, . . . , do 3: start the first time of episode k, t k := t 4: generate R(t k ) ∼ g t k-1 +τ1 (R) and P (t k ) ∼ g t k-1 +τ1 (P ) 5: for t = t k , t k + 1, ..., t k + τ 1 do 6: pull the arm i for τ 1 /N times in a round robin way end for 18: compute π * k (•) = Oracle (•, R(t k + τ 1 ), P (t k + τ 1 )) 19: end for 24: end for In episode k, for the exploration phase, we first sampled the θ t k from the distribution g t k-1 +τ1 (P ) and g t k-1 +τ1 (R). We pull each arm for τ 1 /N times in a round robin way. For the pulled arm, we update its belief based on equation 4 using θ t k . For the arms that are not pulled, we update its belief based on equation 5 using θ t k . The reward and belief history of each arm are input into Algorithm 1 to update the posterior distribution after the exploration phase. Then we sample the new θ t k +τ1 from the posterior distribution, and re-calibrate the belief b t based on the most recent estimated θ t k +τ1 . Next we enter into the exploitation phase . Firstly we derive the optimal policy π k for the sampled parameter θ t k +τ1 . Then we use policy π k for the rest of the episode k. for t = t k + τ 1 + 1, • • • , t k+1 -1 do We control the increasing of episode length in a deterministic manner. Specially, the length for episode k is just one more than the last episode k. In such a deterministic increasing manner, the episode number K T is bounded by O( √ T ) as in Lemma 10. Then the regret caused by the exploration phases can be bound by O( √ T ), which is an crucial part in Theorem 1. In TSEETC, for the unknown states, we propose the belief state to estimate the true states. What's more, under the unobserved state setting, we consider all possible state transitions and update the posterior distribution of unknown parameters as mixture of each combined distribution, in which each occurence is summed with different weight. Remark 1. We use an Oracle to derive the optimal policy for the sampled parameters in Algorithm 2. The Oracle can be the Bellman equation for POMDP as we introduced in equation 2, or the approximation methods (Pineau et al., 2003; Silver & Veness, 2010) , etc. The approximation error is discussed in Remark 3.

5. PERFORMANCE ANALYSIS

In section 5.1, we show our theoretical results and some discussions. In section 5.2, we provide a proof sketch and the detailed proof is in Appendix B.

5.1. REGRET BOUND AND DISCUSSIONS

Theorem 1. Suppose Assumptions 1,2 hold and the Oracle returns the optimal policy in each episode. The Bayesian regret of our algorithm satisfies R T ≤ 48C 1 C 2 S N T log(N T ) + (τ 1 ∆R + H + 4C 1 C 2 SN ) √ T + C 1 C 2 , where C 1 = L 1 + L 2 N + N 2 + S 2 , C 2 = r max + H are constants independent with time horizon T , L 1 = 4(1-ϵ1) 2 N ϵ 2 1 ϵ2 , L 2 = 4(1-ϵ1) 2 ϵ 3 1 , ϵ 1 and ϵ 2 are the minimum elements of the functions P * and R * , respectively. τ 1 is the fixed exploration length in each episode, ∆R is the biggest gap of the reward obtained at each two different time, H is the bounded span, r max is the maximum reward obtain each time, N is the number of arms and S is the state size for each arm. Remark 2. The Theorem 1 shows that the regret of TSEETC is upper bound by Õ( √ T ). This is the first bound that matches the lower bound in restless bandit problem (Ortner et al., 2012) in such unobserved state setting. Although TSEETC looks similar to explore-then-commit (Lattimore & Szepesvári, 2020) , a key novelty of TSEETC lies in using the approach of posterior sampling to update the posterior distribution of unknown parameters as the mixture of each combined distribution. Our algorithm balances exploration and exploitation in a deterministic-episode manner and ensures the episode length grows at a linear rate, which guarantees that the total episode number is bounded by O( √ T ). Therefore the total regret caused by exploration is well controlled by O( √ T ) and this is better than the bound O(T 2/3 ) in Zhou et al. (2021) . What's more, in the exploitation phase, our regret bound Õ( √ T ) is also better than Õ(T 2/3 ) (Zhou et al., 2021) . This shows our posterior sampling based method is superior to UCB based solution (Osband & Van Roy, 2017) . In Jahromi et al. (2022) , their pseudo count of state-action pair is always smaller than the true counts with some probability at any time. However, in our algorithm, the sampled parameter is more concentrated on true values with the posterior update. Therefore, our pseudo count (defined in equation 13) based on belief approximates the true counts more closely, which helps us obtain a tighter bound.

5.2. PROOF SKETCH

In our algorithm, the total regret can be decomposed as follows: R T = E θ * k T k=1 t k +τ1 t k J(θ * ) -r t Regret (A) + E θ * k T k=1 t k+1 -1 t k +τ1+1 J(θ * ) -r t Regret (B) . (8) Bounding Regret (A). The Regret (A) is the regret caused in the exploration phase of each episode. This term can be simply bounded as follows: Regret (A) ≤ E θ * k T k=1 τ 1 ∆R ≤ τ 1 ∆Rk T where ∆R = r max -r min is the biggest gap of the reward received at each two different times. The regret in equation 9 is related with the episode number k T , which can be bounded as O( √ T ) in Lemma 10. Bounding Regret (B). Next we bound Regret(B) in the exploitation phase. Define bt is the belief updated with parameter θ k and b * t represents the belief with θ * . During episode k, based on equation 2 for the sampled parameter θ k and that a t = π * ( bt ), we can write: J (θ k ) + v( bt , θ k ) = r( bt , a t ) + r P (r | bt , a t , θ k )v(b ′ , θ k ). With this equation, we proceed by decomposing the regret as: Regret(B) = R 1 + R 2 + R 3 + R 4 (11) where each term is defined as follows: R 1 = E θ * k T k=1 [(T k -τ 1 -1) (J(θ * ) -J(θ k ))] , R 2 = E θ * k T k=1 t k+1 -1 t k +τ1+1 v( bt+1 , θ k ) -v( bt , θ k ) , R 3 = E θ * k T k=1 t k+1 -1 t k +τ1+1 r P r | bt , a t , θ k v(b ′ , θ k ) -v( bt+1 , θ k ) , R 4 = E θ * k T k=1 t k+1 -1 t k +τ1+1 r( bt , a t ) -r(b * t , a t ) . Bounding R 1 . One key property of Posterior Sampling algorithms is that for given the history H t k , the true parameter θ * and sampled θ k are identically distributed at the time t k as stated in Lemma 13. Due to the length T k determined and independent with θ k , then R 1 is zero thanks to this key property. Bounding R 2 . The regret R 2 is the telescopic sum of value function and can be bounded as R 2 ≤ HK T . It solely depends on the episode number and the upper bound H of span function. As a result, R 2 reduce to a finite bound over the number of episodes k T , which can be bounded in Lemma 10. Bounding R 3 and R 4 . The regret terms R 3 and R 4 is related with estimation error about θ. Thus we should bound the parameters' error especially in our unobserved state setting. Recall the definition of ϕ, ψ, we can define the posterior mean of P i (s ′ | s) and Ri (r | s) for arm i at time t as follows: P i (s ′ | s)(t) = ϵ 1 + (1 -ϵ 1 )ϕ i s,s ′ (t) Sϵ 1 + (1 -ϵ 1 ) ϕ i s,• (t) 1 , Ri (r | s)(t) = ϵ 2 + (1 -ϵ 2 )ψ i s,r (t) Sϵ 2 + (1 -ϵ 2 ) ψ i s,• (t) 1 . We also define the pesudo count of the state-action pair (s, a) before the episode k as  N i t k (s, a) = ψ i s,• (t k ) 1 -ψ i s,• (0) 1 s ′ ∈S P (s ′ | z) -P i k (s ′ | z) ≤ β k (z), r∈R R (r | z) -Ri k (r | z) ≤ β k (z), where β i k (s, a) := 14S log(2N t k T ) max{1,N i t k (s,a)} is chosen conservatively (Auer et al., 2008 ) so that M i k contains both P i * and P i k , R i * and R i k with high probability. P i * and R i * are the true parameters as we defined in section 4.1. Specially, for the unobserved state setting, the belief error under different parameters is upper bounded by the gap between the estimators as in Proposition 2. Then the core of the proofs lies in deriving a high-probability confidence set with our pesudo counts and show that the estimated error accumulated to T for each arm is bounded by

√

T . Then with the error bound for each arm, we can derive the final error bound about the MDP aggregated by all arms as stated in Lemma 2 . Lemma 2. (estimation errors) The total estimation error about transition functions accumulated by all exploitation phases satisfies the following bound E θ * K T k=1 t k+1 -1 t=t k +τ1+1 ∥P * -P k ∥ 1 ≤ 48SN N T log(N T ) + 4SN 2 √ T + N. The Lemma 2 shows that the accumulated error is bounded by O( √ T ), which is crucial to obtain the final bound as the observed-states setting (Ortner et al., 2012; Jung & Tewari, 2019) . With  C 1 = L 1 + L 2 N + N 2 + S 2 , R 3 ≤ 48C 1 SH N T log N T + 4C 1 SN H √ T + C 1 H. Lemma 4. R 4 satisfies the following bound R 4 ≤ 48C 1 Sr max N T log(N T ) + 4C 1 SN r max √ T + C 1 r max .

6. NUMERICAL EXPERIMENTS

In this section, we present proof-of-concept experiments and approximately implement TSEETC . We consider two arms and there are two hidden states for each arm. We pull just one arm each time. The learning horizon T = 50000, and each algorithm runs 100 iterations. The transition functions and reward functions for all arms are the same. We initialize the algorithm with uninformed Dirichlet prior on the unknown parameters. We compare our algorithm with simple heuristics ϵ-greedy (Lattimore & Szepesvári, 2020) (ϵ = 0.01), and Sliding-Window UCB (Garivier & Moulines, 2011) with specified window size, RUCB (Liu et al., 2010) , Q-learning (Hu et al., 2020) and SEEU (Zhou et al., 2021) . The results are shown in Figure 1 . We can find that TSEETC has the minimum regret among these algorithms. In Figure 2 , we plot the cumulative regret versus T of the six algorithms in log-log scale. We observe that the slopes of all algorithms except for our TSEETC and SEEU are close to one, suggesting that they incur linear regrets. What is more, the slope of TSEETC is close to 0.5, which is better than SEEU. This is consistent with our theoretical result.

7. CONCLUSION

In this paper, we consider the restless bandit with unknown states and unknown dynamics. We propose the TSEETC algorithm to estimate these unknown parameters and derive the optimal policy. We also establish the Bayesian regret of our algorithm as Õ( √ T ) which is the first bound that matches the lower bound especially in restless bandit problems with unobserved states . Numerical results validate that the TSEETC algorithm outperforms other learning algorithms in regret. A related open question is whether our method can be applied to the setting where the transition functions are action dependent. We leave it for future research.

A TABLE OF NOTATIONS Notation Description

T The length of horizon K T The episode number of time T T k The episode length of episode k τ 1 The fixed exploration length in each episode P i The transition functions for arm i R i The reward function for arm i P k The sampled transition function for aggregated MDP R k The sampled reward function for aggregated MDP r t The The action at time t J(θ k ) The optimal long term average reward with parameter θ k r max The maximum reward obtained each time r min The minimum reward obtained each time ∆R The biggest gap of the obtained reward

B PROOF OF THEOREM 1

Recall that our goal is to minimize the regret : R T := E θ * T t=1 (J(θ * ) -r t ) . r t depends on the state s t and a t . Thus r t can be written as r(s t , a t ). Due to E θ * [r (s t , a t ) | H t-1 ] = r(b * t , a t ) for any t, we have, R T := E θ * T t=1 (J(θ * ) -r(b * t , a t )) . In our algorithm, each episode is split into the exploration and exploitation phase then we can rewrite the regret as: R T = E θ * k T k=1 t k +τ1 t k (J(θ * ) -r (b * t , a t )) + k T k=1 t k+1 -1 t k +τ1+1 (J(θ * ) -r (b * t , a t )) , where τ 1 is the exploration length for each episode. τ 1 is a constant. t k is the start time of episode k. Define the first part as Regret (A) which is caused by the exploration operations. The another part Regret (B) is as follows. Regret (A) = E θ * k T k=1 t k +τ1 t k (J(θ * ) -r (b * t , a t )) , Regret (B) = E θ * k T k=1 t k+1 -1 t k +τ1+1 (J(θ * ) -r (b * t , a t )) . Recall that the reward set is R and we define the maximum reward gap in R as ∆R = r max -r min . Then we get: J(θ * ) -r (b * t , a t ) ≤ ∆R. Then Regret (A) can be simply upper bounded as follows: Regret (A) ≤ E θ * k T k=1 τ 1 ∆R ≤ τ 1 ∆Rk T . Regret (A) is related with the episode number k T obviously, which is bounded in Lemma 10. Next we should bound the term Regret (B). During the episode k, based on equation 2, we get: J (θ k ) + v( bt , θ k ) = r( bt , a t ) + r P (r | bt , a t , θ k )v (b ′ , θ k ) , ( ) where J (θ k ) is the optimal long-term average reward when the system parameter is θ k , bt is the belief at time t updated with parameter θ k , r( bt , a t ) is the expected reward we can get when the action a t is taken for the current belief bt , b ′ is the updated belief based on equation 4 with parameter θ k when the reward r is received. Using this equation, we proceed by decomposing the regret as: Regret(B) = R 1 + R 2 + R 3 + R 4 , where R 1 = E θ * k T k=1 [(T k -τ 1 -1) (J(θ * ) -J(θ k ))] , R 2 = E θ * k T k=1 t k+1 -1 t k +τ1+1 v( bt+1 , θ k ) -v( bt , θ k ) , R 3 = E θ * k T k=1 t k+1 -1 t k +τ1+1 r P (r | bt , a t , θ k )v(b ′ , θ k ) -v( bt+1 , θ k ) , R 4 = E θ * k T k=1 t k+1 -1 t k +τ1+1 r( bt , a t ) -r(b * t , a t ) . Next we bound the four parts one by one. B.1 BOUND R 1 Lemma 5. R 1 satisfies that R 1 = 0. Proof. Recall that: R 1 = E θ * k T k=1 [(T k -τ 1 -1) (J(θ * ) -J(θ k )] . For each episode, T k is determined and is independent with θ k . Based on Lemma 13, we know that, E θ * [J(θ * )] = E θ * [J(θ k )]. therefore, the part R 1 is 0. B.2 BOUND R 2 Lemma 6. R 2 satisfies the following bound R 2 ≤ HK T , where K T is the total number of episodes until time T . Proof. Recall that R 2 is the telescoping sum of value function at time t + 1 and t. R 2 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 v( bt+1 , θ k ) -v( bt , θ k ) . ( ) We consider the whole sum in episode k, then the R 2 can be rewrite as: R 2 = E θ * k T k=1 v( bt k+1 , θ k ) -v( bt k +τ1+1 , θ k ) . Due to the span of v(b, θ) is bounded by H as in proposition 1 , then we can obtain the final bound, R 2 ≤ HK T .

B.3 BOUND R 3

In this section, we first rewrite the R 3 in section B.3.1. In section B.3.2, we show the details about how to bound R 3 . R 3 ) The regret R 3 can be bounded as follows: B.3.1 REWRITE R 3 Lemma 7. (Rewrite R 3 ≤ HE θ * K T k=1 t k+1 -1 t=t k +τ1+1 ||P * -P k || 1 + HE θ * K T k=1 t k+1 -1 t=t k +τ1+1 ||b * t -bt || 1 + S 2 HE θ * k T k=1 t k+1 -1 t=t k +τ1+1 ∥R * -R k ∥ 1 , where P k is the sampled transition functions in episode k, R k is the sampled reward functions in episode k, b * t is the belief at time t updated with true P * and R * , bt is the belief at time t updated with sampled P k , R k . Proof. The most part is similar to Jahromi et al. ( 2022), except that we should handle the unknown reward functions. Recall that R 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r P (r | bt , a t , θ k )v (b ′ , θ k ) -v( bt+1 , θ k ) . Recall that H t is the history of actions and observations prior to action a t . Conditioned on H t , θ * and θ k , the only random variable in bt+1 is r t+1 , then we can get, E θ * v( bt+1 , θ k ) | H t , θ k = r∈R v (b ′ , θ k ) P (r | b * t , a t , θ * ), where P (r | b * t , a t , θ * ) is the probability of getting reward r given b * t , a t , θ * . By the law of probability, P (r | b * t , a t , θ * ) can be written as follows, P (r | b * t , a t , θ * ) = s ′ R * (r | s ′ ) P (s t+1 = s ′ | H t , θ * ) = s ′ R * (r | s ′ ) s P * (s t+1 = s ′ | s t = s, H t , a t , θ * ) P (s t = s | H t , θ * ) = s s ′ b * t (s)P * (s ′ | s) R * (r | s ′ ) , ) where P * is the transition functions for the MDP aggregated by all arms, R * is the reward function for the aggregated MDP. Therefore, we can rewrite the R 3 as follows, R 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r∈R (P (r | bt , a t , θ k ) -P (r | b * t , a t , θ * )v (b ′ , θ k ) . Based on equation 23, we get R 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s bt (s)P k (s ′ | s) -E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R * (r | s ′ ) s b * t (s)P * (s ′ | s) = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s bt (s)P k (s ′ | s) -E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s b * t (s)P * (s ′ | s) + E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s b * t (s)P * (s ′ | s) -E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R * (r | s ′ ) s b * t (s)P * (s ′ | s) . ( ) where R k is the sampled reward function for aggregated MDP, P k is the sampled transition function for aggregated MDP. Define R ′ 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s bt (s)P k (s ′ | s) - s b * t (s)P * (s ′ | s) , R ′′ 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) [R k (r | s ′ ) -R * (r | s ′ )] s b * t (s)P * (s ′ | s) . Bounding R ′ 3 . The part R ′ 3 can be bounded as Jahromi et al. (2022) . R ′ 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s bt (s)P k (s ′ | s) - s b * t (s)P * (s ′ | s) = R ′ 3 (0) + R ′ 3 (1) where R ′ 3 (0) = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s bt (s)P k (s ′ | s) -E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s b * t (s)P k (s ′ | s) R ′ 3 (1) = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s b * t (s)P k (s ′ | s) -E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s b * t (s)P * (s ′ | s) For R ′ 3 (0), because r R k (r | s ′ ) = 1, s ′ P k (s ′ | s) = 1,v (b ′ , θ k ) ≤ H, we have R ′ 3 (0) = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s bt (s)P k (s ′ | s) -E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s b * t (s)P k (s ′ | s) = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s ( bt (s) -b * t (s)P k (s ′ | s) ≤ E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) s | bt (s) -b * t (s)|P k (s ′ | s) ≤ HE θ * k T k=1 t k+1 -1 t=t k +τ1+1 s | bt (s) -b * t (s)| = HE θ * k T k=1 t k+1 -1 t=t k +τ1+1 bt (s) -b * t (s) 1 , where the first inequality is due to bt (s) -b * t (s) ≤ | bt (s) -b * t (s)| and the second inequality is because r R k (r | s ′ ) = 1, s ′ P k (s ′ | s) = 1, v (b ′ , θ k ) ≤ H. For the first term in R ′ 3 (1) , note that conditioned on H t , θ * , the distribution of s t is b * t . Furthermore, a t is measurable with respect to the sigma algebra generated by H t , θ k since a t = π * ( bt , θ k ). Thus, we have E θ * v (b ′ , θ k ) s P * (s ′ | s) b * (s) | H t , θ k = v (b ′ , θ k ) E θ * [P * (s ′ | s) | H t , θ k ] . E θ * v (b ′ , θ k ) s P k (s ′ | s) b * (s) | H t , θ k = v (b ′ , θ k ) E θ * [P k (s ′ | s) | H t , θ k ] . Substitute equation 25, equation 26 into R ′ 3 (1), we have R ′ 3 (1) = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) (P k (s ′ | s) -P * (s ′ | s)) ≤ E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) R k (r | s ′ ) |P k (s ′ | s) -P * (s ′ | s)| ≤ HE θ * k T k=1 t k+1 -1 t=t k +τ1+1 s ′ |P k (s ′ | s) -P * (s ′ | s)| ≤ HE θ * k T k=1 t k+1 -1 t=t k +τ1+1 (∥P k -P * ∥ 1 ) , where the first inequality is because P k (s ′ | s) -P * (s ′ | s) ≤ |P k (s ′ | s) -P * (s ′ | s)|, the second inequality is due to v (b ′ , θ k ) ≤ H and r R k (r | s ′ ) = 1. Therefore we obtain the final results, R ′ 3 ≤ HE K T k=1 t k+1 -1 t=t k +τ1+1 ||P * -P k || 1 + HE K T k=1 t k+1 -1 t=t k +τ1+1 ||b * t -bt || 1 . Bounding R ′′ 3 . For part R ′′ 3 , note that for any fixed s ′ , s b * t (s)P * (s ′ | s) ≤ S, therefore we can bound R ′′ 3 as follows, R ′′ 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r s ′ v (b ′ , θ k ) [R k (r | s ′ ) -R * (r | s ′ )] s b * t (s)P * (s ′ | s) ≤ SHE θ * k T k=1 t k+1 -1 t=t k +τ1+1 s ′ r [R k (r | s ′ ) -R * (r | s ′ )] ≤ SHE θ * k T k=1 t k+1 -1 t=t k +τ1+1 S ∥R k -R * ∥ 1 ≤ S 2 HE θ * k T k=1 t k+1 -1 t=t k +τ1+1 ∥R k -R * ∥ 1 , where the first inequality is due to v (b ′ , θ k ) ≤ H and s b * t (s)P * (s ′ | s) ≤ S , the second inequality is due to for any fixed s ′ , r [R k (r | s ′ ) -R * (r | s ′ )] ≤ ∥R k -R * ∥ 1 . B.3.2 BOUND R 3 Lemma 8. R 3 satisfies the following bound R 3 ≤ 48(L 1 + L 2 N + N + S 2 )SH N T log(N T ) + (L 1 + L 2 N + N + S 2 )H + 4(L 1 + L 2 N + N 2 + S 2 )SN H(T 1 + K T -τ 1 -1). Proof. Recall that the R 3 is as follows: R 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r P [r | bt , a t , θ k ]v (b ′ , θ k ) -v( bt+1 , θ k ) . This regret terms are dealing with the model estimation errors. That is to say, they depend on the on-policy error between the sampled transition functions and the true transition functions, the sampled reward functions and the true reward functions. Thus we should bound the parameters' error especially in our unobserved state setting. Based on the parameters in our Dirichlet distribution, we can define the empirical estimation of reward function and transition functions for arm i as follows: P i (s ′ | s)(t) = ϵ 1 + (1 -ϵ 1 )ϕ i s,s ′ (t) Sϵ 1 + (1 -ϵ 1 ) ϕ i s,• (t) 1 , Ri (r | s)(t) = ϵ 2 + (1 -ϵ 2 )ψ i s,r (t) Sϵ 2 + (1 -ϵ 2 ) ψ i s,• (t) 1 . where ϕ i s,s ′ (t) is the parameters in the posterior distribution of P i at time t, ψ i s,r (t) is the parameters in the posterior distribution of R i at time t. We also define the pseudo count N i t k (s, a) of the stateaction pair (s, a) before the episode k for arm i as N i t k (s, a) = ψ i s,• (t k ) 1 -ψ i s,• (0) 1 . For notational simplicity, we use z = (s, a) ∈ S × A and z t = (s t , a t ) to denote the corresponding state-action pair. Then based on Lemma 7 we can decompose the R 3 as follows, R 3 = E θ * k T k=1 t k+1 -1 t=t k +τ1+1 r P [r | bt , a t , θ k ]v (b ′ , θ k ) -v( bt+1 , θ k ) = E θ * K T k=1 t k+1 -1 t=t k +τ1+1 r P (r | bt , a t , θ k ) -P (r | b * t , a t , θ * ) v(b ′ , θ k ) ≤ R 0 3 + R 1 3 + R 2 3 where R 0 3 = HE θ * K T k=1 t k+1 -1 t=t k +τ1+1 ∥P * -P k ∥ 1 , R 1 3 = HE θ * K T k=1 t k+1 -1 t=t k +τ1+1 ||b * t -bt || 1 , R 2 3 = S 2 HE θ * K T k=1 t k+1 -1 t=t k +τ1+1 ∥R * -R k ∥ 1 . Note that the following results are all focused on one arm. Define P i * is the true transition function for arm i, P i k is the sampled transition function for arm i. We can extend the results on a arm to the aggregated large MDP based on Lemma 11. Bounding R 0 3 . Since 0 ≤ v (b ′ , θ k ) ≤ H from our assumption , each term in the inner summation is bounded by s ′ ∈S | P i * (s ′ | z t ) -P i k (s ′ | z t ) |v (s ′ , θ k ) ≤H s ′ ∈S P i * (s ′ | z t ) -P i k (s ′ | z t ) ≤H s ′ ∈S P i * (s ′ | z t ) -P i k (s ′ | z t ) + H s ′ ∈S P i k (s ′ | z t ) -P i k (s ′ | z t ) . where P i * (s ′ | z t ) is the true transition function, P i k (s ′ | z t ) is the sampled reward function and P i k (s ′ | z t ) is the posterior mean. The second inequality above in due to triangle inequality. Let M i k be the set of plausible MDPs in episode k with reward function R (r | z) and transition function P (s ′ | z) satisfying, s ′ ∈S P (s ′ | z) -P i k (s ′ | z) ≤ β i k (z), r∈R R (r | z) -Ri k (r | z) ≤ β i k (z), where β i k (s, a) := 14S log(2N t k T ) max{1,N i t k (s,a)} is chosen conservatively (Auer et al., 2008) so that M i k contains both P i * and P i k , R i * and R i k with high probability. P i * and R i * are the true parameters as we defined in section 4.1. Note that β i k (z) is the confidence set with δ = 1/t k . Recall the definition of ψ, we can define the pseudo count of state-action pair (s, a) as N i t k (s, a) = ψ i s,• (t k ) 1 -ψ i s,• (0) 1 . Then we can obtain, s ′ ∈S P i * (s ′ | z t ) -P i k (s ′ | z t ) + s ′ ∈S P i k (s ′ | z t ) -P i k (s ′ | z t ) ≤ 2β i k (z t ) + 2 I {P i * / ∈B k } + I {P i k / ∈B k } . ( ) We assume the length of the last episode is the biggest. Note that even the assumption does not hold, we can enlarge the sum items as T K T -1 -τ 1 . This does not affect the order of our regret bound. With our assumption, because the all episode length is not bigger than the last episode, that is t k+1 -1 -(t k + τ 1 ) ≤ T K T -τ 1 , then we can obtain, K T k=1 t k+1 -1 t=t k +τ1 β i k (z t ) ≤ K T k=1 T k T -τ1 t=1 β i k (z t ) . Note that s ′ ∈S P i * (s ′ | z t ) -P i k (s ′ | z t ) ≤ 2 is always true. And with our assumption τ 1 ≤ T1+K T -1 2 , it is easy to show that when N i t k ≥ T k T -τ 1 , β i k (z t ) ≤ 2 holds. Then we can obtain, K T k=1 T k T -τ1 t=1 min{2, β i k (z t )} ≤ K T k=1 T k T -τ1 t=1 2I(N i t k < T k T -τ 1 ) + K T k=1 T k T -τ1 t=1 I(N i t k ≥ T k T -τ 1 ) 14S log (2N t k T ) max 1, N i t k (z t ) . ( ) Consider the first part in equation 31. Obviously, the maximum of N i t k is T k T -τ 1 . Because there are totally SA state-action pairs, therefore, the first part in equation equation 31 can be bounded as, K T k=1 T k T -τ1 t=1 2I(N i t k < T k T -τ 1 ) ≤ 2(T k T -τ 1 )SA. Due to T k T = T 1 + K T -1 and Lemma 10, we get , 2(T k T -τ 1 )SA = 2(T 1 + K T -τ 1 -1)SA = O( √ T ). Consider the second part in 31. Denote the N i t (s, a) is the count of (s, a) before time t(not including t). Due to we just consider the exploration phase in each episode, then N i t (s, a) can be calculated as follows, N i t (s, a) = τ < t, τ ∈ [t k , t k + τ 1 ], k ≤ k(t) : s i τ , a i τ = (s, a) , where k(t) is the episode number where the time t is in. In the second part in equation 31, when N i t k ≥ T k T -τ 1 , based on our assumption τ 1 ≤ T1+K T -1 2 , we can get, τ 1 ≤ T 1 + K T -1 2 , 2τ 1 ≤ T 1 + K T -1 = T k T . therefore, T k T -τ 1 ≥ τ 1 . Because N i t k ≥ T k T -τ 1 , then N i t k (s, a) ≥ τ 1 . For any t ∈ [t k , t k + τ 1 ],we have N i t (s, a) ≤ N i t k (s, a) + τ 1 ≤ 2N i t k (s, a). Therefore N i t (s, a) ≤ 2N i t k (s, a). Next we can bound the confidence set when N t (s, a) ≤ 2N t k (s, a) as follows, K T k=1 T k T -τ1 t=1 β i k (z t ) ≤ K T k=1 t k+1 -1 t=t k 14S log (2N t k T ) max 1, N i t k (z t ) ≤ K T k=1 t k+1 -1 t=t k 14S log (2N T 2 ) max 1, N i t k (z t ) = T t=1 28S log (2N T 2 ) max 1, N i t (z t ) ≤ 56S log(2N T ) T t=1 1 max 1, N i t (z t ) . ( ) where the second inequality in equation 32 is due to t k ≤ T for all episodes and the first equality is due to N i t (s, a) ≤ 2N i t k (s, a). Then similar to Ouyang et al. (2017) , since N i t (z t ) is the count of visits to z t , we have T t=1 1 max 1, N i t (z t ) = z T t=1 I {zt=z} max 1, N i t (z) = z   I {N i T +1 (z)>0} + N i T +1 (z)-1 j=1 1 √ j   ≤ z I {N i T +1 (z)>0} + 2 N i T +1 (z) ≤ 3 z N i T +1 (z). Since z N i T +1 (z) ≤ T , we have 3 z N i T +1 (z) ≤ 3 SN z N i T +1 (z) = 3 √ SN T . With equation 32 and equation 33 we get 2H K T k=1 t k+1 -1 t=t k β i k (z t ) ≤ 6 √ 56HS N T log(N T ) ≤ 48HS N T log(N T ). Then we can bound the equation 30 as follows, K T k=1 t k+1 -1 t=t k β i k (z t ) ≤ 24S N T log(N T ) + 2SA(T 1 + K T -τ 1 -1). Choose the δ = 1/T in Lemma 12, and based by Lemma 13, we obtain that P P i k / ∈ B k = P P i * / ∈ B k ≤ 1 15T t 6 k . Then we can obtain, 2E θ * K T k=1 T k I {θ * / ∈B k } + I {θ k / ∈B k } ≤ 4 15 ∞ k=1 t -6 k ≤ 4 15 ∞ k=1 k -6 ≤ 1. Therefore we obtain 2HE θ * K T k=1 T k I {θ * / ∈B k } + I {θ k / ∈B k } ≤ H. Therefore, we can obtain the bound for one arm as follows, E θ * K T k=1 t k+1 -1 t=t k +τ1+1 s ′ ∈S P i * (s ′ | z t ) -P i k (s ′ | z t ) v (s ′ , θ k ) ≤ H + 4SN H(T 1 + K T -τ 1 -1) + 48HS N T log(N T ). Next we consider the state transition of all arms. Recall that the states of all arms at time t is s t . Because every arm evolves independently, then the transition probability from state s t to state s t+1 is as follows, P (s t+1 | s t , θ * ) = N i=1 P i * s i t+1 | s i t , where P i * is the true transition functions of arm i. Based by the Lemma 11 and our assumption that all arms have the same state space S, we can obtain st+1 |P (s t+1 | s t , θ * ) -P (s t+1 | s t , θ k )| ≤ N i P i * s i t+1 | s i t -P i k s i t+1 | s i t 1 ≤ N P i * s i t+1 | s i t -P i k s i t+1 | s i t 1 . Therefore, we can bound the R 0 3 as follows: R 0 3 ≤ N H + 4SN 2 H(T 1 + K T -τ 1 -1) + 48SN H N T log(N T ). Bounding R 1 3 . Based on the Proposition 2, we know that b * t -bt 1 ≤ L 1 ∥R * -R k ∥ 1 + L 2 max s ∥P * (s, :) -P k (s, :)∥ 2 . Note that the elements the true transition matrix P * and the sampled matrix P k is between the interval (0, 1). Then based on the facts about norm, we know that max s ∥P * (s, :) -P k (s, :)∥ 2 ≤ ∥P * -P k ∥ 1 . Therefore , we can bound the belief error at any time as follows: b * t -bt 1 ≤ L 1 ∥R * -R k ∥ 1 + L 2 ∥P * -P k ∥ 1 . Recall in the confidence for M k , the error bound is the same for ∥R * -R k ∥ 1 and ∥P * -P k ∥ 1 , and based by the bound in equation 34 and equation 35, we can bound the R 1 3 as follows: R 1 3 ≤ HE θ * K T k=1 t k+1 -1 t=t k (L 1 ∥R * -R k ∥ 1 + L 2 ∥P * -P k ∥ 1 ) ≤ (L 1 + L 2 N )HE θ * K T k=1 t k+1 -1 t=t k 2β i k (z t ) + 2 I {P * / ∈B k } + I {P k / ∈B k } ≤ 48(L 1 + L 2 N )SH N T log(N T ) + (L 1 + L 2 N )H 4(L 1 + L 2 N )SN H(T 1 + K T -τ 1 -1). Bounding R 2 3 . Based on equation 34 and equation 35, we can bound R 2 3 as follows, R 2 3 = S 2 HE θ * K T k=1 t k+1 -1 t=t k +τ1+1 ∥R * (• | s) -R k (• | s)∥ 1 ≤ S 2 HE θ * K T k=1 t k+1 -1 t=t k +τ1+1 2β i k (z t ) + 2 I {R * / ∈B k } + I {R k / ∈B k } ≤ HS 2 + 4S 3 N H(T 1 + K T -τ 1 -1) + 48HS 3 N T log(N T ). Combine the bound in equation 39, equation 41 and equation 42, we bound the term R 3 as follows: Proof. We can rewrite the R 4 as follows:  R 3 ≤ 48(L 1 + L 2 N )SH N T log(N T ) + 4(L 1 + L 2 N )SN H(T 1 + K T -τ 1 -1) + (L 1 + L 2 N )H + N H + 4SN 2 H(T 1 + K T -τ 1 -1) + 48SN R where the last inequality is due to the fact r k (s, a t ) ≤ r max . For R 1 4 ,  R 1 4 = E θ * √ T + C 1 (H + r max ) = 48C 1 C 2 S N T log(N T ) + (τ 1 ∆R + H + 4C 1 C 2 SN ) √ T + C 1 C 2 . Thus, we get the final Theorem. Theorem 2. Suppose Assumptions 1,2 hold and the Oracle returns the optimal policy in each episode. The Bayesian regret of our algorithm satisfies R T ≤ 48C 1 C 2 S N T log(N T ) + (τ 1 ∆R + H + 4C 1 C 2 SN ) √ T + C 1 C 2 , where C 1 = L 1 + L 2 N + N 2 + S 2 , C 2 = r max + H are constants independent with time horizon T , L 1 = 4(1-ϵ1) 2 N ϵ 2 1 ϵ2 , L 2 = 4(1-ϵ1) 2 ϵ 3 1 , ϵ 1 and ϵ 2 are the minimum elements of the functions P * and R * , respectively. τ 1 is the fixed exploration length in each episode, ∆R is the biggest gap of the reward obtained at each two different time, H is the bounded span, r max is the maximum reward obtain each time, N is the number of arms and S is the state size for each arm.



a t = π * k (b t ) 21: observe new reward r t+1 22: update the belief b t of all arms based equation 4, equation 5 23:

13) where ψ i s,• (t k ) represents the count of state-action z = (s, a) pair before the episode k. Let M i k be the set of plausible MDPs in episode k with reward function R (r | z) and transition function P (s ′ | z) satisfying,

We show the final bound about R 3 , R 4 and the detailed proof in Appendix B.3,B.4. Lemma 3. R 3 satisfies the following bound

Figure 1: The cumulative regret Figure 2: The log-log regret

H N T log(N T ) + HS 2 + 4S 3 N H(T 1 + K T -τ 1 -1) + 48HS 3 N T log(N T ) = 48(L 1 + L 2 N + N + S 2 )SH N T log(N T ) + (L 1 + L 2 N + N + S 2 )H + 4(L 1 + L 2 N + N 2 + S 2 )SN H(T 1 + K T -τ 1 -1). Lemma 9. R 4 satisfies the following bound R 4 ≤ 48(L 1 + L 2 N + N + S 2 )Sr max N T log(N T ) + (L 1 + L 2 N + N + S 2 )r max + 4(L 1 + L 2 N + N + S 2 )SAr max (T 1 + K T -τ 1 -1).

(s, a t ) -r * (s, a t )] b * t (s)≤ E θ * T t=1 s |r k (s, a t ) -r * (s, a t )| ≤ E θ * T t=1 s r r |R at k (r | s) -R at * (r | s)| ≤ Sr max E θ * inequality in 46 is due to b * t (s) ≤ 1, r k (s, a t ) -r * (s, a t ) ≤ |r k (s, a t ) -r * (s, a t )| and the second inequality is due to r [R at k (r | s) -R at * (r | s)] ≤ ∥R at k -R at * ∥ 1 . Based on the equation 41, we can bound the R 0 4 , R 0 4 ≤ 48(L 1 + L 2 N )Sr max N T log(N T ) + (L 1 + L 2 N )r max + 4(L 1 + L 2 N )SN r max (T 1 + K T -τ 1 -1).Note that for any reward functionR (r | z) in confidence set M k , the reward function satisfies, r∈R R (r | z) -Ri k (r | z) ≤ β i k (z)Then based on equation 42, we getR 1 4 ≤ 48S 2 r max N T log(N T ) + 2S 2 N r max (T 1 + K T -τ 1 -1) + Sr max .Under review as a conference paper at ICLR 2023Then we can obtain final bound:R 4 ≤ 48(L 1 + L 2 N + S)Sr max N T log(N T ) + 4(L 1 + L 2 N + S)SN r max (T 1 + K T -τ 1 -1) + (L 1 + L 2 N + S)r max ≤ 48(L 1 + L 2 N + N + S 2 )Sr max N T log(N T ) + (L 1 + L 2 N + N + S 2 )r max + 4(L 1 + L 2 N + N + S 2 )SN r max (T 1 + K T -τ 1 -1)where the last inequality is due to S ≤ N + S 2 .B.5 THE TOTAL REGRETNext we bound the episode number.Lemma 10. (Bound the episode number) With the conventionT 1 = √ T +1 2 and T k = T k-1 + 1,the episode number is bounded byK T = O( √ T ).Proof. Note that the total horizon is T . The length of episode k isT k = T 1 + k -1. Then we can get, T = T 1 + T 2 + ... + T k T = T 1 + (T 1 + 1) + ... + (T 1 + K T -1) = K T T 1 + (1 + 2 + ... + K T -1) = K T T 1 + K T (K T -Denote C 1 = L 1 + L 2 N + N 2 + S 2 , C 2 = H + r max and C 3 = T 1 + K T -τ 1 -1, then we can get the final regret: R T = Regret(A) + R 1 + R 2 + R 3 + R 4 ≤ τ 1 ∆RK T + HK T + 48C 1 SH N T log(N T ) + 4C 1 C 3 SAH + C 1 H + 48C 1 Sr max N T log(N T ) + 4C 1 C 3 SAr max + C 1 r max ≤ (τ 1 ∆R + H) √ T + 48C 1 S(H + r max ) N T log(N T ) + 4C 1 SA(r max + H)

reward obtained at time t b i t (s, θ) The belief state for being in state s at time t for arm i with parameter θ bt The belief of all arms at time t with parameter θ k b *The belief of all arms at time t with parameter θ * a t

(s, a t ) = r rR at k (r | s) is the expect reward conditioned on the state s of pulled arm and a t , when the reward function is R at k . And r * (s, a t ) = r rR at * (r | s) is the expect reward conditioned on the state s and a t ,with the true reward function R at * . The equation 44 is due to the add the term s r k (s, a t ) b * t (s) and subtract it. (s, a t ) b * t (s) .

annex

Remark 3. (Approximation error.) If the oracle returns an k -approximate policy πk in each episode instead of the optimal policy. That is to say, r(b, πk (b)) + r P (r | b, πk (b), θ)v (b ′ , θ) ≤ max a {r(b, a) + r P (r | b, a, θ)v (b ′ , θ)} -ϵ k . Then we should consider the extra regret E k:t k ≤T (T k -τ 1 )ϵ k in exploitation phase. If we control the error as ϵ k ≤ 1 T k -τ1 , then we can bound the extra regret as E k:t k ≤T (T k -τ 1 )ϵ k ≤ k T = O( √ T )(Lemma 10). Thus the approximation error in the computation of optimal policy is only additive to the regret of our algorithm.

C POSTERIOR DISTRIBUTION

Note that we assume the state transition is independent of the action for each arm. Denote the states visited history from time 0 till t of arm i as s i 0:t and the reward collected history is r i 0:t . And the action history from time 0 to t is a i 0:t . Denote N i s,s ′ s i 0:t as the occurence time of state evolves from s to s ′ for arm i in the state history s i 0:t . Hence, if the prior g (P i (s, •)) is Dirichlet ϕ i s,s1 , . . . , ϕ i s,Si , then after the observation of history s i 0:t , the posterior g P i (s, •) | s i 0:t is Dirichlet ϕ i s,s1 + N i s,s1 s i 0:t , . . . , ϕ i s,Si + N i s,Si s i 0:t (Ross et al., 2011) .Similarly, if the prior g (R i (s, •)) is Dirichlet ψ i s,r1 , . . . , ψ i s,r k , then after the observation of reward history r i 0:t and s i 0:t , the posterior g R i (s, •) | r i 0:t , s i 0:t is Dirichlet ψ i s,r1 + N i s,r1 s i 0:t , r i 0:t , . . . , ψ i s,r k + N i s,r k s i 0:t , r i 0:t , and N i s,r is the number of times the observation (s, r) appears in the history s i 0:t , r i 0:t . Here we drop the arm index and consider a fixed arm. For the unknown transition function, we assume its priorWe consider this special prior is due to the minimum elements of the transition matrix is bigger than ϵ 1 . Next we show the details that how to update the posterior distribution for unknown P and omit the details of unknown reward function R. g (P | a 0:t-1 , r 0:t-1 ) = P (r 0:t-1 , s t | P, a 0:t-1 ) g (P, a 0:t-1 ) P (r 0:t-1 , s t | P, a 0:t-1 ) g (P, a 0:t-1 ) dP = s0:t-1∈S t P (r 0:t-1 , s 0:t | P, a 0:t-1 ) g(P ) P (r 0:t-1 , s t | P, a 0:t-1 ) g (P, a 0:t-1 ) dP) N ss ′ (s0:t) P (r 0:t-1 , s t | P, a 0:t-1 ) g (P, a 0:t-1 ) dP .,where the last equality is due to the prior for unknown P i is g 0 P i = f ( P i -ϵ11 1-ϵ1 | ϕ i ). Next we show the Bayesian approach to learning unknown P and R with the history (a 0:t-1 , r 0:t ). Since the current state s t of the agent at time t is unobserved, we consider a joint posterior g (s t , P, R | a 0:t-1 , r 0:t ) over s t , P , and R (Ross et al., 2011) . The most parts are similar to Ross et al. (2011) , except for our special priors.g (s , P, R | a 0:t-1 , r 0:t-1 ) ∝P (r 0:t , s t | P, R, a 0:t-1 ) g (P, R, a 0:t-1 ) ∝ s0:t-1∈S t P (r 0:t , s 0:t | P, R, a 0:t-1 ) g(P, R) ∝ s0:t-1∈S t g (s 0 , P, R)where g (s 0 , P, R) is the joint prior over the initial state s 0 , transition function P , and reward function R; N ss ′ (s 0:t ) is the number of times the transition (s, s ′ ) appears in the history of state-action (s 0:t ); and N sr (s 0:t , r 0:t-1 ) is the number of times the observation (s, r) appears in the history of state-rewards (s 0:t , r 0:t-1 ).

D TECHNICAL RESULTS

Proposition 1. (Uniform bound on the bias span (Zhou et al., 2021) ). If the belief MDP satisfies Assumption 1,2, then for (J(θ), v(:, θ)) satisfying the Bellman equation ( 2), we have the span of the bias function span(v, θ), whereIt is easy to check that H(α) is increasing with α. Since α is decreasing with ϵ and we assume the smallest element in transitions matrix is ϵ 1 , the span function can be bounded by H(ϵ 1 ). Proposition 2. (Controlling the belief error (Xiong et al., 2022c) ). Suppose Assumption 1,2 hold. Given (R k , P k ), an estimator of the true model parameters (R * , P * ). For an arbitrary rewardaction sequence rt , āt , let bt (•, R k , P k ) and b t (•, R * , P * ) be the corresponding beliefs in period t under (R k , P k ) and (R * , P * ) respectively. Then there exists constantswhere Lemma 12. (Lemma 17 in Auer et al. (2008) ) For any t ≥ 1, the probability that the true MDP M is not contained in the set of plausible MDPs M(t) at time t is at most δ 15t 6 , that is P{M / ∈ M(t)} < δ 15t 6 . Lemma 13. (Posterior Sampling (Ouyang et al., 2017) ). In TSEETC, t k is an almost surely finite σ (H t k )-stopping time. If the prior distribution g 0 (P ),g 0 (R) is the distribution of θ * , then for any measurable function g,

