PROVABLY EFFICIENT LIFELONG REINFORCEMENT LEARNING WITH LINEAR REPRESENTATION

Abstract

We theoretically study lifelong reinforcement learning (RL) with linear representation in a regret minimization setting. The goal of the agent is to learn a multi-task policy based on a linear representation while solving a sequence of tasks that may be adaptively chosen based on the agent's past behaviors. We frame the problem as a linearly parameterized contextual Markov decision process (MDP), where each task is specified by a context and the transition dynamics is context-independent, and we introduce a new completeness-style assumption on the representation which is sufficient to ensure the optimal multi-task policy is realizable under the linear representation. Under this assumption, we propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks while using only sublinear planning calls. Specifically, for K task episodes of horizon H, our algorithm has a regret bound Õ( (d 3 + d d)H 4 K) based on O(dH log(K)) number of planning calls, where d and d are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to learn to internalize experiences into a multi-task policy and rapidly solve new tasks.

1. INTRODUCTION

Recently, there has been a surging interest in designing lifelong learning agents that can continuously learn to solve multiple sequential decision making problems in their lifetimes (Thrun & Mitchell, 1995; Khetarpal et al., 2020; Silver et al., 2013; Xie & Finn, 2021) . This scenario is in particular motivated by building multi-purpose embodied intelligence, such as robots working in a weakly structured environment (Roy et al., 2021) . Typically, curating all tasks beforehand for such problems is nearly infeasible, and the problems the agent is tasked with may be adaptively selected based on the agent's past behaviors. Consider a household robot as an example. Since each household is unique, it is difficult to anticipate upfront all scenarios the robot would encounter. Moreover, the tasks the robot faces are not independent and identically distributed (i.i.d.) . Instead, what the robot has done before can affect the next task and its starting state; e.g., if the robot fails to bring a glass of water and breaks it, then the user is likely to command the robot to clean up the mess. Thus, it is critical that the agent continuously improves and generalizes learned abilities to different tasks, regardless of their order. In this work, we theoretically study lifelong reinforcement learning (RL) in a regret minimization setting (Thrun & Mitchell, 1995; Ammar et al., 2015) , where the agent needs to solve a sequence of tasks using rewards in an unknown environment while balancing exploration and exploitation. Motivated by the embodied intelligence scenario, we suppose that tasks differ in rewards, but share the same state and action spaces and transition dynamics (Xie & Finn, 2021) .To be realistic, we make no assumptions on how the tasks and initial states are selected 1 ; generally we allow them to be chosen from a continuous set by an adversary based on the agent's past behaviors. Once a task is specified and revealed, the agent has one chance (i.e., executing one rollout from its current state) to complete the task and then it moves to the next task. The agent's goal is to perform near optimally for the tasks it faces, despite the online nature of the problem. This means that the accumulated regret of the learner compared with the best policy for each task should be sublinear in its lifetime. We assume that there is no memory constraint; this is usually the case for robotics applications where real-world interactions are the main bottleneck (Xie & Finn, 2021) . Nonetheless, we require that the agent eventually learns to make decisions without frequent deliberate planning, because planning is time consuming and creates undesirable wait time for user-interactive scenarios. In other words, the agent needs to learn a multi-task policy, generalizing from not only past samples but also past computation, to solve new tasks. Formally, we consider an episodic setup based on the framework of contextual Markov decision process (CMDP) (Abbasi-Yadkori & Neu, 2014; Hallak et al., 2015) . It repeats the following steps: 1) At the beginning of an episode, the agent is set to an initial state and receives a context specifying the task reward, both of which can be arbitrarily chosen. 2) When needed, the agent uses its past experiences to plan for the current task. 3) The agent runs a policy in the environment for a fixed horizon in an attempt to solve the assigned task and gains experience from its policy execution. The agent's performance is measured as the regret with respect to the optimal policy of the corresponding task. We require that, for any task sequence, both the agent's overall regret and number of planning calls to be sublinear in the number of episodes. While lifelong RL is not new, the realistic need of simultaneously achieving 1) sublinear regret and 2) sublinear number of planning calls for 3) a potentially adversarial sequence of tasks and initial states makes the setup considered here particularly challenging. To our knowledge, existing works only address a strict subset of these requirements; especially, the computation aspect is often ignored. Most provable works in lifelong RL make the assumption that the tasks are finitely many (Ammar et al., 2015; Zhan et al., 2017; Brunskill & Li, 2015) , or are i.i.d. (Ammar et al., 2014; Brunskill & Li, 2014; Abel et al., 2018a; b; Lecarpentier et al., 2021) , while others considering similar setups to ours do not provide regret guarantees (Isele et al., 2016; Xie & Finn, 2021) . On the technical side, the closest lines of work are Modi & Tewari (2020) ; Abbasi-Yadkori & Neu (2014) ; Hallak et al. (2015) ; Modi et al. (2018) ; Kakade et al. (2020) for contextual MDP and Wu et al. (2021) ; Abels et al. (2019) for the dynamic setting of multi-objective RL, which study the sample complexity for arbitrary task sequences; however, they either assume the problem is tabular or require a model-based planning oracle with unknown complexity. Importantly, none of the existing works properly addresses the need of sublinear planning calls, which creates a large gap between the abstract setup and practice need. In this paper, we aim to establish a foundation for designing agents meeting these three practically important requirements, a problem which has been overlooked in the literature. As the first step, here we study lifelong RL with linear representation. We suppose that the contextual MDP is linearly parameterized (Yang & Wang, 2019; Jin et al., 2020) and the agent needs to learn a multi-task policy based on this linear representation. To make this possible, we introduce a new completenessstyle assumption on the representation which is sufficient to ensure the optimal multi-task policy is realizable under the linear representation. Under these assumptions, we propose the first provably efficient lifelong RL algorithm, Upper Confidence Bound Lifelong Value Distillation (UCBlvd, pronounced as "UC Boulevard"), that possesses all three desired qualities. Specifically, for K episodes of horizon H, we prove a regret bound Õ( (d 3 + d d)H 4 K) using Õ(dH log(K)) planning calls, where d and d are the feature dimensions of the dynamics and rewards, respectively. From a high-level viewpoint, UCBlvd uses a linear structure to identify what to transfer and operates by interleaving 1) independent planning for a set of representative tasks and 2) distilling the planned results into a multi-task value-based policy. UCBlvd also constantly monitors whether the new experiences it gained are sufficiently significant, based on a doubling schedule, to avoid unnecessary planning. On the technical side, UCBlvd's design is inspired by single-task LSVI-UCB (Jin et al., 2020) , however, we introduce a novel distillation step based on QCQP, along with a new completeness assumption, to enable computation sharing across tasks; we also extend the low-switching cost technique (Abbasi-Yadkori et al., 2011; Gao et al., 2021; Wang et al., 2021) for single-task RL to the lifelong setup to achieve sublinear number of planning calls. Notation. Throughout the paper, we use lower-case letters for scalars, lower-case bold letters for vectors, and upper-case bold letters for matrices. The Euclidean-norm of x is denoted by x 2 . We denote the transpose of a vector x by x . For any vectors x and y, we use x, y to denote their inner product. We denote the Kronecker product by A ⊗ B. Let A ∈ R d×d be a positive definite and ν ∈ R d . The weighted 2-norm of ν with respect to A is defined by ν A := √ ν Aν. For a positive integer n, [n] denotes the {1, 2, . . . , n}. For a real number α, we denote {α} + = max{α, 0}. Finally, we use the notation Õ for big-O notation that ignores logarithmic factors.

2. PRELIMINARIES

We formulate lifelong RL as a regret minimization problem in contextual MDP (Abbasi-Yadkori & Neu, 2014; Hallak et al., 2015) with adversarial context and initial state sequences. We suppose that a context determines the task reward but does not affect the dynamics. Such a context dependency is common for the lifelong learning scenario where an embodied agent consecutively solves multiple tasks. Below we give the formal problem definition. Finite-horizon contextual MDP. We consider a finite-horizon contextual MDP denoted by M = (S, A, W, H, P, r), where S is the state space, A is the action space, W is the task context space, H is the horizon (length of each episode), P = {P h } H h=1 are the transition probabilities, and r = {r h } H h=1 are the reward functions. We allow S and W to be continuous or infinitely large, while we assume A is finite such that max a∈A can be performed easily. For h ∈ [H], r h (s, a, w) denotes the reward function whose range is assumed to be in [0, 1], and P h (s |s, a) denotes the probability of transitioning to state s upon playing action a at state s. In short, a contextual MDP can be viewed as an MDP with state space S × W and action space A where the context part of the state remains constant in an episode. 2 To simplify the notation, for any function f , we write P h [f ](s, a) := E s ∼P h (.|s,a) [f (s )]. Policy and value functions. In a finite-horizon contextual MDP, a policy π = {π h } H h=1 is a sequence where π h : S ×W → A determines the agent's action at time-step h. Given π, we define its state value function as a) , where Q π H+1 = 0. We denote the optimal policy as π * h (s, w) := sup π V π h (s, w), and let V π h (s, w) := E[ H h =h r h s h , π h (s h , w), w)|s h = s and its action-value function as Q π h (s, a, w) := r h (s, a, w) + P h [V π h+1 (., w)](s, V * h := V π * h and Q * h := Q π * h denote the optimal value functions. Lastly, we recall the Bellman equation of the optimal policy: Q * h (s, a, w) = r h (s, a, w) + P h [V * h+1 (., w)](s, a), V * h (s, w) = max a∈A Q * h (s, a, w). ( ) Interaction protocol of lifelong RL. The agent interacts with a contextual MDP M in episodes. For presentation simplicity, we assume that the reward functions r are known, while the transition probabilities P are unknown and must be learned online; we will discuss how reward learning can be naturally incorporated in Section 4.3. At the beginning of episode k, the agent receives a task context w k ∈ W and is set to an initial state s k 1 , both of which can be adversarially chosen. The agent can use past experiences to plan for the current task, if needed. Then the agent executes its policy π k : at each time-step h ∈ [H], it observes the state s k h , plays an action a k h = π k h (s k h , w k ), observes a reward r k h := r h (s k h , a k h , w k ) , and goes to the next state s k h+1 according to P h (.|s k h , a k h ). Let K be the total number of episodes. The agent's goal is to achieve sublinear regret, where the regret is defined as R K := K k=1 V * 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ). As the comparator policy above (namely π * that defines V * 1 ) also knows the task context, achieving sublinear regret implies that the agent would attain near task-specific optimal performance on average. Linear Model Representation. We focus on MDPs with linear transition kernels and reward functions (Jin et al., 2020; Yang & Wang, 2019) that are encapsulated in the following assumption. Assumption 1 (Linear MDPs). M = (S, A, H, P, r, W) is a linear MDP with feature maps φ : S × A → R d and ψ : S × A × W → R d . That is, for any h ∈ [H], there exist a vector η h and d measures µ h := [µ h (1) , . . . , µ h d) ] over S such that P h (.|s, a) = µ h (.), φ(s, a) and r h (s, a, w) = η h , ψ(s, a, w) , for all (s, a, w) ∈ S × A × W. Without loss of generality, φ(s, a) 2 ≤ 1, ψ(s, a, w) 2 ≤ 1, µ h (s) 2 ≤ √ d, and η h 2 ≤ √ d for all (s, a, w, h) ∈ S × A × W × [H]. In real-world problems, we can use the context to model the task specification of a problem. For example, if we want to design household robots to assist humans with a series of tasks like cooking, cleaning, washing dishes, lawn mowing, vacuuming, we can treat the the context as a natural language instruction that the human user would give to the robot, and we can view the representations ψ and φ as the embedding of a deep neural network model that has been pre-trained. Example 1 (Weighted Rewards). An interesting and common special case is ψ(s, a, w) = φ(s, a) ⊗ ρ(w), for some mapping ρ : W → R m . In this case, it holds that d = md and r h (s, a, w) = ρ(w), r h (s, a) , where r h (s, a) = A h φ(s, a) ∈ R m , for some A h ∈ R m×d , is the vector reward functions at time-step h. We can view r h (s, a, w) as a weighted reward with weights ρ(w) that depend on task w. This setting is closely related to Multi-Objective RL studied for tabular case in Wu et al. (2021) , which studies the case where ρ(w) = w ∈ R m along with tabular S and A.

3. A WARM-UP ALGORITHM FOR LIFELONG RL

We first present a warm-up algorithm based on linear representation, termed Lifelong Least-Squares Value Iteration (Lifelong-LSVI), in Algorithm 1, which is a straightforward extension of the singletask LSVI-UCB algorithm proposed by Jin et al. (2020) to the lifelong learning setting. The motivation of this warm-up algorithm is to give intuitions on how the problem structure in Assumption 1 can be used to achieve small regret and discuss the computational difficulty in lifelong learning. We will show that Lifelong-LSVI has a sublinear regret bound, which matches the minimax optimal rate in the special case studied by Wu et al. (2021) in terms of number of objectives, m (see Example 1). However, we will also show that Lifelong-LSVI is not computationally efficient, in the sense that the number of planning calls it requires grows linearly with the number of episodes, which would mean the overall computational complexity grows quadratically. This high computation cost is because the agent never learns to internalize the task solving skills but requires going though all past experiences for planning every time a new task arrives. Importantly, we will discuss why it cannot be made computationally efficient in an easy manner without further assumptions on the representation. This drawback motivates our new completeness assumption and our main algorithm, UCBlvd, which is provably efficient in terms of both regret and number of planning calls, in Section 4. We remark that Lifelong-LSVI is only a warm-up algorithm that guides the reader to understand the mechanisms used for addressing the problem, motivates the need for UCBlvd, and shows what regret bound is possible when computational complexity is not a concern (though being impractical).

3.1. ALGORITHMIC NOTATIONS

To begin, we introduce the template and the notations that will be used commonly in presenting the warm-up algorithm, Lifelong-LSVI, and later our main algorithm, UCBlvd. For each algorithm, first we will define an algorithm-specific action-value function Q k h : S × A × W → R, which determines the agent's policy at time-step h in episode k; then we present the full algorithm and its analysis using the quantities below, which are defined with respect to each algorithm's definition of Q k h . Given {Q k h } h∈[H] , we define state value functions and their backups as V k h (s, w) := min max a∈A Q k h (s, a, w), H , θ k h (w) := S V k h+1 (s , w)dµ h (s ), Thanks to the linear MDP structure in Assumption 1, it holds that P h V k h+1 (., w) (s, a) = θ k h (w), φ(s, a) . Let λ > 0 be a constant. We define the λ-regularized least squares estimator of θ k h (w) as θk h (w) := Λ k h -1 k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w), where Λ k h := λI d + k-1 τ =1 φ τ h φ τ h , and θk h (w) is the solution to min θ∈R d k-1 τ =1 ( θ, φ(s τ h , a τ h ) -V k h+1 (s τ h+1 , w)) 2 + λ θ 2 2 , φ τ h := φ(s τ h , a τ h ), and I d ∈ R d×d is the identity matrix.

3.2. DETAILS OF LIFELONG-LSVI AND ITS THEORETICAL GUARANTEES

We define the upper confidence bound (UCB) style action-value function of Lifelong-LSVI as follows: where Q k H+1 = 0 and θk h (w) and Λ k h are defined in (5). Here, β is an exploration factor that will be appropriately chosen in Theorem 1. At episode k, given w k , Lifelong-LSVI first performs planning backward in time based on past data to compute θk h (w k ) in (5) using Q k h+1 defined in (6) (Lines 4-5). Then, in execution, it uses θk h (w k ) to compute Q k h (s k h , a, w k ) for the current state and all a ∈ A (Line 7) and executes the action with the highest value (Line 8). Q k h (s, a, w) := r h (s, a, w) + θk h (w), φ(s, a) + β φ(s, a) (Λ k h ) -1 , We show that Lifelong-LSVI achieves sublinear regret for our lifelong RL setup. The complete proof is reported in Appx. A, which follows the ideas of LSVI-UCB (Jin et al., 2020) . Theorem 1. Let T = KH. Under Assumption 1, there exists an absolute constant c > 0 such that for any fixed δ ∈ (0, 0.5), if we set λ = 1 and β = cH d + √ d log(dd T /δ) in Algorithm 1, then with probability at least 1 -2δ, it holds that R K ≤ Õ (d 3 + dd )H 3 T . Before introducing our main algorithm in Section 4, we make a few remarks on the regret and number of planning calls of Lifelong-LSVI. First, Theorem 1 implies that for the special case studied by Wu et al. (2021) (Example 1), the regret bound of Lifelong-LSVI becomes Õ( √ md 3 H 3 T ). This rate is optimal in terms of its dependency on m, as shown in Wu et al. (2021) . Furthermore, this rate matches the LSVI-UCB's regret dependencies on d and H for the single-task setting (Jin et al., 2020) . While Lifelong-LSVI has a decent regret guarantee, it requires computing θk h (w k ) for all h ∈ [H], whenever a distinct new task w k arrives. Since the number of unique tasks may be as large as K, the total number of planning calls required in Lifelong-LSVI is K in the worst case. Unfortunately, the number of planning calls of Lifelong-LSVI cannot be easily improved, because under Assumption 1 alone, the optimal Q-function Q * h (s, a, w) of the CMDP can be nonlinear in the representation ψ. As a result, for any algorithm that represents its policy linearly based on both ψ and φ, in general it is necessary to recompute the coefficients for every new w to be optimal. For Lifelong-LSVI specifically, this nonlinear dependency shows up in θk h (w) of Q k h (s, a, w) in (6). In the next section, we discuss how placing a completeness-style assumption, which ensures Q * h (s, a, w) can be linearly parameterized by ψ, would circumvent the issue of non-linear dependency of the action-value functions on w, and consequently would enable computation sharing to decrease the number of planning calls to O(dH log(K)).

4. UCB LIFELONG VALUE DISTILLATION (UCBLVD)

In this section, we present our main algorithm, UCB Lifelong Value Distillation (UCBlvd), in Algorithm 2. Under new completeness-style assumption that we will introduce in Section 4.1, we show that UCBlvd shares the same regret bound as Lifelong-LSVI but significantly reduces the number of planning calls to be logarithmic in K. In contrast to Lifelong-LSVI which learns individual action-value function for each w k , UCBlvd learns a single action-value function for all w ∈ W based on ψ(s, a, w) to enable computation sharing across tasks, which is made possible by the extra completeness-style assumption. In general, in order to directly extend Lifelong-LSVI to only use feature ψ(s, a, w) ∈ R d with d ≥ d, we need a context-dependent dynamics structure, which would eventually increase the regret. UCBlvd maintains the same order of regret as Lifelong-LSVI by separating the planning into a novel two-step process: 1) independent planning with φ for a set of representative task contexts and 2) distilling the planned results into a multi-task value function parameterized by ψ. In addition, UCBlvd runs a doubling schedule to decide whether replanning is necessary, which makes the total number of planning calls logarithmic in K.

4.1. ENABLING COMPUTATION SHARING

As lifelong RL with Assumption 1 alone would require replanning in every episode in general (see Section 3), here we introduce new structural assumptions on ψ to enable computation sharing across tasks. First, we define the following class of functions F = f : f (s, w) = min max a∈A ν, ψ(s, a, w) + β φ(s, a) Λ -1 + , H , ν ∈ R d , Λ ∈ S d ++ , β ≥ 0 , where S d ++ denotes the set of symmetric positive definite matrices. We now state our main completeness-style assumption. Assumption 2 (Completeness). For any f ∈ F and h ∈ [H], there exists a vector ξ f h ∈ R d with ξ f h ≤ H √ d such that P h f (., w) (s, a) = ξ f h , ψ(s, a, w) . This assumption says that the backups of functions in F are captured by the feature ψ with bounded parameters. The definition of F closely models the structure of action-value function used by Lifelong-LSVI in (6), except θk h (w), φ(s, a) there is replaced by functions linear in ψ(s, a, w). We will see that the action-value function used by UCBlvd defined in the next section is contained in F. In addition, by setting β = 0 in F and (1), we see Q * h (s, a, w) is linearly realizable by ψ under Assumption 2. We note that a similar notion of this assumption is mentioned in previous work for single-task settings under the name of "optimistic closure" (Wang et al., 2020) . Inspired by Example 1, we now introduce the next assumption on the structure of ψ. Assumption 3 (Mappings). We assume ψ(s, a, w) = φ(s, a) ⊗ ρ(w), for some mapping ρ : W → R m , i.e., d = md. We assume that there is a known set {w (1) , w (2) , . . . , w (n) } of n ≤ m task contexts such that ρ(w) ∈ Span({ρ(w (j) )} j∈ [n] ) for all w ∈ W. That is, for any w ∈ W, there exist coefficients {c j (w)} j∈ [n] such that ρ(w) = j∈[n] c j (w)ρ(w (j) ). We assume j∈[n] c j (w) ≤ L for all w ∈ W and some L < ∞. Note that, for finite-dimensional representations, such set {ρ(w (j) )} j∈[n] always exists. We assume that this set {w (1) , w (2) , . . . , w (n) } is known to the algorithm 4.2 DETAILS OF UCBLVD We define the UCB style action-value function of UCBlvd as follows: Q k h (s, a, w) := r h (s, a, w) + ξk h , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 + . The parameter ξk h is computed by solving the convex quadratically constrained quadratic program (QCQP) in ( 8), which is defined on a set of representative task contexts {w (1) , w (2) , . . . , w (n) } in Assumption 3 and state-action pairs D := (s, a) : φ(s, a) are d linearly independent vectors. . ξk h , { θk(j) h } j∈[n] = arg min ξ,{θ (j) } j∈[n] j∈[n] (s,a)∈D θ (j) , φ(s, a) -ξ, ψ(s, a, w (j) ) 2 (8) s.t. θ (j) - θk h (w (j) ) Λ k h ≤ β, ∀j ∈ [n] and ξ 2 ≤ H √ md, where θk h (w) and Λ k h are defined in (5). In Appx. B.3, we will show that the action-value function in (7) is an optimistic estimate of the optimal action-value function. UCBlvd also uses the linear dependency of Q k h on ψ to reduce calls of the planning step in (8). The agent triggers replanning only when it has gathered enough new information compared to the last update at episode k. This is measured by tracking the variations in Gram matrices {Λ k h } h∈[H] (Line 4 for Algorithm 2). Finally, when executing the policy at episode k, the agent chooses the action according to Q k h in Line 10. Algorithm 2: UCBlvd (UCB Lifelong Value Distillation) Set: Q k H+1 (., ., .) = 0, ∀k ∈ [K], k = 1 for episodes k = 1, . . . , K do Observe the initial state s k 1 and the task context w k . if ∃h ∈ [H] such that log det Λ k h -log det Λ k h > 1 then k = k for time-steps h = H, . . . , 1 do Compute ξk h as in (8). for time-steps h = 1, . . . , H do Compute Q k h (s k h , a, w k ) for all a ∈ A as in (7). Play a k h = arg max a∈A Q k h (s k h , a, w k ) and observe s k h+1 and r k h .

4.3. THEORETICAL ANALYSIS OF UCBLVD

We present our main theoretical result which shows UCBlvd achieves sublinear regret in lifelong RL using sublinear number of planning calls, for any sequence of tasks. The proof is given in Appx. B. Theorem 2. Let T = KH. Under Assumptions 1, 2, and 3, the number of planning calls in Algorithm 2 is at most dH log(1 + K dλ ), and there exists an absolute constant c > 0 such that for any fixed δ ∈ (0, 0.5), if we set λ = 1 and β = cH(d + √ md) log(mdT /δ) in Algorithm 2, then with probability at least 1 -2δ, it holds that R K ≤ Õ L (d 3 + md 2 )H 3 T . Theorem 2 shows that UCBlvd has the same regret bound as Lifelong-LSVI in Theorem 1, but reduces the number of planning calls from K to dH log(1 + K/dλ). As we discussed before, this is made possible by the unique QCQP-based distillation step of UCBlvd in ( 8). If we were to simply perform least-squares regression to fit ψ(s, a, w), ξk h to { φ(s, a), θk h (w (j) )} j∈[n] for distillation, we cannot guarantee the required optimism, because φ(s, a), θk h (w) computed based on finite samples can be an irregular function that cannot be modelled by ψ(s, a, w). Remark 1. If the rewards are unknown, we can adopt a slightly different completeness assumption with an extra bonus in terms of ψ, and then combine tools from linear bandits (Abbasi-Yadkori et al., 2011) and our proof of Theorem 2. Because reward learning affects the radius of the confidence intervals for θ k h (w), the number of planning calls and regret would increase by factors of O(m) and O( √ m) 3 , respectively, compared to those in Theorem 2. See Appx. C for details. Remark 2. It is possible to eliminate the assumption that ψ(s, a, w) = φ(s, a) ⊗ ρ(w). In this case, our analysis would instead require a set {w (1) , w (2) , . . . , w (n) } of n tasks such that ψ(s, a, w) ∈ Span({ψ(s, a, w (j) )} j∈ [n] ) for all (s, a, w) ∈ S × A × W. In Appx. D, we provide details of this relaxation, and show that this version still enjoys the same planning calls and regret as in Theorem 2. Remark 3. We can eliminate Assumptions 1 and 3 and instead design a computation-sharing version of Lifelong-LSVI under a sightly different completeness assumption with a class F, whose exploration bonus is β ψ(s, a, w) Λ-1 . This assumption naturally includes settings with linear MDP in which dynamics also change with task context, i.e., for all h ∈ [H], it holds that P h (.|s, a, w) = µ h (.), ψ(s, a, w) for d unknown measures [µ (1) h , . . . , µ (d ) h ] . Under this assumption, a slightly modified version of Lifelong-LSVI would use Q k h (s, a, w) = {r h (s, a, w) + νk h , ψ(s, a, w) + β ψ(s, a, w) ( Λk h ) -1 } + , where νk h = ( Λk h ) -1 k-1 τ =1 ψ τ h . min{max a∈A Q k h+1 (s τ h+1 , a, w τ ), H}, Λk h = λI d + k-1 τ =1 ψ τ h ψ τ h , ψ τ h = ψ(s τ h , a τ h , w τ ), and β = Õ(d ). However, in Appx. E, we show how these new algorithm and assumption result in Õ(mdH) number of planning calls and a regret 3 While for both settings in this remark and Remark 3, the action-value functions contain exploration bonus in terms of ψ, the regret here is better by a factor of √ m and this is because the multiplicative factor β here saves a factor √ m compared to that in Remark 3. scaling with Õ( √ m 3 d 3 ) for settings with ψ(s, a, w) = φ(s, a) ⊗ ρ(w). These are worse than the number of planning calls and regret in Theorem 2 of UCBlvd by a factor of O(m). Remark 4. A natural follow-up relaxation of Assumption 2 is when the equality holds up to an error of ζ. In Appx. F, we show that this relaxation results in a regret Õ √ mdT ζ + λ(d 3 + md 2 )H 3 T and the same number of planning calls as that in Theorem 2. When ζ is sufficiently small, i.e., ζ = O( d 2 H 3 /mT ), UCBlvd will still enjoy a regret of the same order as that in Theorem 2.

4.4. PROOF SKETCH OF THEOREM 2

Because the proof of planning calls' upper bound follows standard arguments in low switching cost analysis of Abbasi-Yadkori et al. (2011) , in this section, we focus on the proof sketch for the regret bound. We start by introducing the high probability event E 1 , which is the foundation of our analysis: E 1 (w) := θ k h (w) - θk h (w) Λ k h ≤ β, ∀(h, k) ∈ [H] × [K] . The following lemma highlights the importance of the carefully designed planning step in (8), which ensures good estimators for ξ V * h+1 h without the need of bonus term ψ(s, a, w) Λk h -1 . This step saves a factor O(m) in planning calls and regret. Lemma 1. Let W = {w τ : τ ∈ [K]} ∪ {w (j) : j ∈ [n]}. Under the setting of Theorem 2 and conditioned on events {E 1 (w)} w∈ W defined in (9), for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that ξk h , ψ(s, a, w) -P h [V k h+1 (., w)](s, a) ≤ 2Lβ φ(s, a) (Λ k h ) -1 . As the final step in the regret analysis, we use Lemma 1 to prove the optimistic nature of UCBlvd, i.e., Q k h (s, a, w k ) ≥ Q * h (s, a, w k ) for all (s, a, h, k) ∈ S × A × [H] × [K]. Then following the standard analysis of single-task LSVI-UCB we derive the regret bound in Theorem 2.

4.5. EXPERIMENTS

We implemented our main algorithm UCBlvd on synthetic environments and compared its performance with the warm-up algorithm Lifelong-LSVI, which is viewed as an idealized baseline ignoring the computational complexity. In all the experiments, the same setting, task sequences and feature mappings were used for both UCBlvd and Lifelong-LSVI. Figure 1a depicts per-episode rewards for the main setup considered throughout the paper, and Figure 1b shows those for the setup in Remark 2. The plots verify that Lifelong-LSVI and UCBlvd statistically perform almost the same while UCBlvd uses much smaller numbers of planning calls (1000 vs ∼ 20). We remark that Lifelong-LSVI has an overall computation complexity of O(K 2 ), which makes it not practical for the lifelong learning setting, as its planning complexity increases linearly with the number of samples. The details on the parameters of simulations are deferred to Appx. H. 

5. RELATED WORK

We consider the regret minimization setup of lifelong RL under the contextual MDP framework, where the agent receives tasks specified by contexts in sequence and needs to achieve a sublinear regret for any task sequence. Below, we contrast our work with related work in the literature. Lifelong RL. Generally lifelong RL studies how to learn to solve a streaming sequence of tasks using rewards. While it was originally motivated by the need of endless learning of robots (Thrun & Mitchell, 1995) , historically many works on lifelong RL (Ammar et al., 2014; Brunskill & Li, 2014; Abel et al., 2018a; b; Lecarpentier et al., 2021) assume that the tasks are i.i.d. (similar to multi-task RL; see below). There are works for adversarial sequences, but most of them assume finite number of tasks (Brunskill & Li, 2015; Ammar et al., 2015; Zhan et al., 2017) or are purely empirical (Xie & Finn, 2021) . The work by Isele et al. (2016) uses contexts to enable zero-shot learning like here, but it (as well as most works above) does not provide formal regret guarantees. 4 Brunskill & Li (2015) and Xie & Finn (2021) assume the task identity is latent, which requires additional exploration; in this sense, their problem is harder than the setup here where the task context is revealed. Extending the setup here to consider latent context is an important future direction. Contextual MDP and multi-objective RL. Our setup is closely related to the exploration problem studied in the contextual MDP literature, though contextual MDP is originally not motivated from the lifelong learning perspective. A similar mathematical problem appears in the dynamic setup of multi-objective RL (Wu et al., 2021; Abels et al., 2019) , which can be viewed as a special case of contextual MDP where the context linearly determines the reward function but not the dynamics. Most contextual MDP works allow adversarial contexts and initial states, but a majority of them focuses on the tabular setup (Abbasi-Yadkori & Neu, 2014; Hallak et al., 2015; Modi et al., 2018; Modi & Tewari, 2020; Levy & Mansour, 2022; Wu et al., 2021) , whereas our setup allows continuous states. Kakade et al. (2020) and Du et al. ( 2019) allow continuous state and action spaces, but the former assumes a planning oracle with unclear computational complexity and the latter focuses on only LQG problems. While generally contextual MDP allows both the reward and the dynamics to vary with contexts, we focus on the effects of context-independent dynamics similar to Kakade et al. (2020) ; Wu et al. (2021) . In particular, the recent work of Wu et al. (2021) is the closest to ours, but they study the sample complexity in the tabular setup with linearly parameterized rewards. In view of Example 1, their proposed algorithm has a regret bound Õ( min{m,|S|}H|S||A| K). However, they need linear number of planning calls. On the contrary, our algorithm, UCBlvd, allows continuous states, nonlinear context dependency, and has both sublinear regret and number of planning calls. Multi-task RL. Another closely related line of work is multi-task RL. Compared to our setting, multi-task RL assumes that there are beforehand known finite tasks and/or they are i.i.d. samples from a fixed distribution. For example, in Yang et al. ( 2020 

6. DISCUSSION

In this paper, we frame lifelong RL as contextual MDPs and identify a new completeness-style assumption to enable provably efficient lifelong RL with linear representation. We propose UCBlvd, an algorithm that simultaneously satisfies the practical need of achieving 1) sublinear regret and 2) sublinear number of planning calls for 3) any sequence of tasks and initial states. Specifically, for K task episodes of horizon H, we prove that UCBlvd has a regret bound Õ( (d 3 + d d)H 4 K) based on Õ(dH log(K)) number of planning calls, where d and d are the feature dimensions of the dynamics and rewards, respectively. We believe that our results would inspire new research directions in the literature of CMDP and multi-objective RL, as existing work to our knowledge does not cover the computation-sharing aspect of lifelong RL. That said, our work's limitations motivate further investigations in the following directions: 1) extension to more general class of MDPs, potentially using general function approximation/representation tools, 2) establishing an information-theoretic lower bound on the number of planning calls/computation complexity.

A PROOFS OF SECTION 3

To prove Theorem 1, we will use the high probability event E 2 defined in Lemma 3 to prove the UCB nature of Lifelong-LSVI in Lemma 4, which is the key to controlling the regret. We first state the following lemma that will be used in the proof of Lemma 3. Lemma 2. Under the setting of Theorem 1, let c β be the constant in the definition of β. Then, for a fixed w, there is an absolute constant c 0 independent of c β , such that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 φ τ h . V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) (Λ k h ) -1 ≤ c 0 H d + √ d log((c β + 1)dd T /δ), where c 0 and c β are two independent absolute constants. Proof. We note that η h 2 ≤ √ d (Assumption 1), θ k h (w) 2 ≤ H √ d (Lemma 18 ), and Λ k h -1 ≤ 1 λ . Thus, Lemmas 19 and 21 together imply that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) 2 (Λ k h ) -1 ≤ 4H 2   d 2 log k + λ λ + d log(1 + 4d / ) + d log(1 + 4Hd/ ) + d 2 log 1 + 8B 2 √ d λ 2 + log 1 δ   + 8k 2 2 λ . If we let = dH k and β = c β (d + √ d )H log(dT /δ), then, there exists an absolute constant C > 0 that is independent of c β such that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) 2 (Λ k h ) -1 ≤ C(d + d 2 )H 2 log (c β + 1)dd T /δ . Lemma 3. Let the setting of Theorem 1 holds. The event E 2 (w) := θ k h (w) - θk h (w) Λ k h ≤ β, ∀(h, k) ∈ [H] × [K] . ( ) holds with probability at least 1 -δ for a fixed w. Proof. θ k h (w) - θk h (w) = θ k h (w) -Λ k h -1 k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) = Λ k h -1   Λ k h θ k h (w) - k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w)   = λ Λ k h -1 θ k h (w) q1 -Λ k h -1   k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h )   q2 Published as a conference paper at ICLR 2023 Thus, in order to upper bound θ k h (w) - θk h (w) Λ k h , we bound q 1 Λ k h and q 2 Λ k h separately. From Lemma 18, we have q 1 Λ k h = λ θ k h (w) (Λ k h ) -1 ≤ √ λ θ k h (w) 2 ≤ H √ λd. ( ) Thanks to Lemma 2, for all (w, h, k), with probability at least 1 -δ, it holds that q 2 Λ k h ≤ k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) (Λ k h ) -1 ≤ c 0 H d + √ d log((c β + 1)dd T /δ), where c 0 and c β are two independent absolute constants. Combining ( 11) and ( 12), for all (w, h, k), with probability at least 1 -δ, it holds that θ k h (w) - θk h (w) Λ k h ≤ cH d + √ d λ log(dd T /δ) for some absolute constant c > 0. Lemma 4. Let W = {w 1 , w 2 , . . . , w K }. Under the setting of Theorem 1 and conditioned on events {E 2 (w)} w∈ W defined in (10), and with Q k h computed as in (6), it holds that Q k h (s, a, w) ≥ Q * h (s, a, w) for all (s, a, w, h, k) ∈ S × A × W × [H] × [K]. Proof. We first note that conditioned on events {E 2 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that r h (s, a, w) + θk h (w), φ(s, a) -Q π h (s, a, w) -P h V k h+1 (., w) -V π h+1 (., w) (s, a) = r h (s, a, w) + θk h (w), φ(s, a) -r h (s, a, w) -P h V k h+1 (., w) (s, a) = θk h (w), φ(s, a) -P h V k h+1 (., w) (s, a) = θk h (w) -θ k h (w), φ(s, a) ≤ θk h (w) -θ k h (w) Λ k h φ(s, a) (Λ k h ) -1 ≤ β φ(s, a) (Λ k h ) -1 , ( for any policy π. Now, we prove the lemma by induction. The statement holds for H because Q k H+1 (., ., .) = Q * H+1 (., ., .) = 0 and thus conditioned on events {E 2 (w)} w∈ W , defined in ( 10), for all (s, a, w, k ) ∈ S × A × W × [K], we have r H (s, a, w) + θ k H (w), ψ(s, a) -Q * H (s, a, w) ≤ β φ(s, a) (Λ k H ) -1 . Therefore, conditioned on events {E 2 (w)} w∈ W , for all (s, a, w, k) ∈ S × A × W × [K], we have Q * H (s, a, w) ≤ r H (s, a, w) + θ k H (w), φ(s, a) + β φ(s, a) (Λ k H ) -1 = Q k H (s, a, w). Now, suppose the statement holds at time-step h + 1 and consider time-step h. Conditioned on events {E 2 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have 0 ≤ r h (s, a, w) + θ k h (w), φ(s, a) -Q * h (s, a, w) -P h V k h+1 (., w) -V * h+1 (., w) (s, a) + β φ(s, a) (Λ k h ) -1 ≤ r h (s, a, w) + θ k h (w), φ(s, a) -Q * h (s, a, w) + β φ(s, a) (Λ k h ) -1 . (Induction assumption) Therefore, conditioned on events {E 2 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q * h (s, a, w) ≤ r h (s, a, w) + θ k h (w), φ(s, a) + β φ(s, a) (Λ k h ) -1 = Q k h (s, a, w). This completes the proof. A.1 PROOF OF THEOREM 1 Let δ k h = V k h (s k h , w k ) -V π k h (s k h , w k ) and ξ k h+1 = E δ k h+1 |s k h , a k h -δ k h+1 . Conditioned on events {E 2 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q k h (s, a, w) -Q π k h (s, a, w) = r h (s, a, w) + θ k h (w), φ(s, a) -Q π k h (s, a, w) + β φ(s, a) (Λ k h ) -1 ≤ P h V k h+1 (., w) -V π k h+1 (., w) (s, a) + 2β φ(s, a) (Λ k h ) -1 . (13) Note that δ k h ≤ Q k h (s k h , a k h , w k ) -Q π k h (s k h , a k h , w k ). Thus, combining (13), Lemma 3, and a union bound over W, we conclude that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ, it holds that δ k h ≤ ξ k h+1 + δ k h+1 + 2β φ(s k h , a k h ) (Λ k h ) -1 . Now, we complete the regret analysis R K = K k=1 V * 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) ≤ K k=1 V k 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) (Lemma 4) = K k=1 δ k 1 ≤ K k=1 H h=1 ξ k h + 2β K k=1 H h=1 φ(s k h , a k h ) (Λ k h ) -1 ≤ 2H T log(dT /δ) + 2Hβ 2dK log(1 + K/λ) ≤ Õ λ(d 3 + dd )H 3 T . The third inequality is true because of the following: we observe that {ξ k h } is a martingale difference sequence satisfying |ξ k h |≤ 2H. Thus, thanks to Azuma-Hoeffding inequality, we have P   K k=1 H h=1 ξ k h ≤ 2H T log(dT /δ)   ≥ 1 -δ. In order to bound K k=1 H h=1 φ k h (Λ k h ) -1 , note that for any h ∈ [H], we have K k=1 φ k h (Λ k h ) -1 ≤ K K k=1 φ k h 2 (Λ k h ) -1 (Cauchy-Schwartz inequality) ≤ 2K log    det Λ K h det Λ 1 h    (15) ≤ 2dK log 1 + K dλ . In inequality (15), we used the standard argument in regret analysis of linear bandits (Abbasi-Yadkori et al., 2011, Lemma 11) as follows: n t=1 min y t 2 V -1 t , 1 ≤ 2 log det V n+1 det V 1 where V n = V 1 + n-1 t=1 y t y t . In inequality ( 16), we used Assumption 1 and the fact that det(A) = d i=1 λ i (A) ≤ (trace(A)/d) d .

B PROOFS OF SECTION 4

We start by introducing the high probability event E 1 , which is the foundation of our analysis in the following lemma. Lemma 5. Follow the setting of Theorem 2. The event E 1 (w) := θ k h (w) - θk h (w) Λ k h ≤ β, ∀(h, k) ∈ [H] × [K] . ( ) holds with probability at least 1 -δ for a fixed w. Proof of Lemma 5 is given in Appx. B.1.

B.1 PROOF OF LEMMA 5

First, we state the following lemma that will be used in the proof of Lemma 5. Lemma 6. Under the setting of Lemma 5, let c β be a constant in the definition of β. Then, for a fixed w, there is an absolute constant c 0 independent of c β , such that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 φ τ h . V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) (Λ k h ) -1 ≤ c 0 H d + √ md log((c β + 1)mdT /δ), where c 0 and c β are two independent absolute constants. Proof. We note that η h + ξk h 2 ≤ (1 + H) √ md and Λ k h -1 ≤ 1 λ . Thus, Lemmas 19 and 22 together imply that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) 2 (Λ k h ) -1 ≤ 4H 2   d 2 log k + λ λ + md log(1 + 8H √ md/ ) + d 2 log 1 + 32L 2 β 2 √ d λ 2 + log 1 δ   + 8k 2 2 λ . If we let = dH k and β = c β (d + √ md)H log(dT /δ), then, there exists an absolute constant C > 0 that is independent of c β such that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) 2 (Λ k h ) -1 ≤ C(md + d 2 )H 2 log (c β + 1)mdT /δ . Now, we begin the formal proof of Lemma 5: θ k h (w) - θk h (w) = θ k h (w) -Λ k h -1 k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) = Λ k h -1   Λ k h θ k h (w) - k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w)   = λ Λ k h -1 θ k h (w) q1 -Λ k h -1   k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h )   q2 . Thus, in order to upper bound θ k h (w) - θk h (w) Λ k h , we bound q 1 Λ k h and q 2 Λ k h separately. From Lemma 18, we have q 1 Λ k h = λ θ k h (w) (Λ k h ) -1 ≤ √ λ θ k h (w) 2 ≤ H √ λd. ( ) Thanks to Lemma 6, for all (w, h, k), with probability at least 1 -δ, it holds that q 2 Λ k h ≤ k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) (Λ k h ) -1 ≤ c 0 H d + √ md log((c β + 1)mdT /δ), where c 0 and c β are two independent absolute constants. Combining ( 19) and ( 20), for all (h, k) ∈ [H] × [K] , with probability at least 1 -δ, it holds that θ k h (w) - θk h (w) Λ k h ≤ cH d + √ md λ log(mdT /δ) for some absolute constant c > 0.

B.2 PROOF OF LEMMA 1

Thanks to Assumption 2 and conditioned on events {E 1 (w)} w∈ W , one set of solution for (8) is θ k h w (j) j∈[n] and ξ V k h+1 h with corresponding zero optimal objective value. Therefore, it holds that θk(j) h , φ(s, a) = ξk h , ψ s, a, w (j) , ∀(j, (s, a)) ∈ [n] × D. Let s (i) , a (i) be the i-th element of D and {c i (s, a)} i∈ [d] be the coefficients such that i) . φ(s, a) = i∈[d] c i (s, a)φ s (i) , a For any triple (s, a, j) ∈ S × A × [n], we have ξk h , ψ s, a, w (j) = ξk h , φ(s, a) ⊗ ρ w (j) = ξk h , i∈[d] c i (s, a)φ s (i) , a (i) ⊗ ρ w (j) = i∈[d] c i (s, a) ξk h , ψ s (i) , a (i) , w (j) (Assumption 3) = i∈[d] c i (s, a) θk(j) h , φ s (i) , a (i) (Eqn. (21)) = θk(j) h , φ(s, a) . For any (s, a, w) ∈ S × A × W, it holds that P h V k h+1 (., w) (s, a) = θ k h (w), φ(s, a) (Eqn. (4)) = ξ V k h+1 h , ψ(s, a, w) (Assumption 2) = j∈[n] c j (w) ξ V k h+1 h , ψ s, a, w (j) (Assumption 3) = j∈[n] c j (w)P h V k h+1 ., w (j) (s, a) (Assumption 2) = j∈[n] c j (w) θ k h w (j) , φ(s, a) . Finally, conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) = ξk h , ψ(s, a, w) -θ k h (w), φ(s, a) = j∈[n] c j (w) ξk h , ψ s, a, w (j) -θ k h w (j) , φ(s, a) (Assumption 3 and Eqn. ( 23)) ≤ j∈[n] c j (w) ξk h , ψ s, a, w (j) - θk(j) h , φ(s, a) + j∈[n] c j (w) θk(j) h - θk h w (j) , φ(s, a) + j∈[n] c j (w) θk h w (j) -θ k h w (j) , φ(s, a) = j∈[n] c j (w) θk(j) h - θk h w (j) , φ(s, a) + j∈[n] c j (w) θk h w (j) -θ k h w (j) , φ(s, a) (Eqn. ( )) ≤ 2Lβ φ(s, a) (Λ k h ) -1 . (Lemma 5) B.3 PROOF OF OPTIMISTIC NATURE OF UCBLVD Lemma 7. Let W = {w τ : τ ∈ [K]} ∪ {w (j) : j ∈ [n]}. Under the setting of Theorem 2 and conditioned on events {E 1 (w)} w∈ W defined in (9), and with Q k h computed as in (7), it holds that Q k h (s, a, w) ≥ Q * h (s, a, w) for all (s, a, w, h, k) ∈ S × A × W × [H] × [K]. Proof. We first note that conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that r h (s, a, w) + ξk h , ψ(s, a, w) -Q π h (s, a, w) -P h V k h+1 (., w) -V π h+1 (., w) (s, a) = r h (s, a, w) + ξk h , ψ(s, a, w) -r h (s, a, w) -P h V k h+1 (., w) (s, a) = ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) ≤ 2Lβ φ(s, a) (Λ k h ) -1 , ( for any policy π. Now, we prove the lemma by induction. The statement holds for H because Q k H+1 (., ., .) = Q * H+1 (., ., .) = 0 and thus conditioned events {E 1 (w)} w∈ W , defined in ( 9), for all (s, a, w, k) ∈ S × A × W × [K], we have r H (s, a, w) + ξk H , ψ(s, a, w) -Q * H (s, a, w) ≤ 2Lβ φ(s, a) (Λ k H ) -1 . Therefore, conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, k) ∈ S × A × W × [K], we have Q * H (s, a, w) ≤ r H (s, a, w) + ξk H , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k H ) -1 = r H (s, a, w) + ξk H , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k H ) -1 + = Q k H (s, a, w) , where the first equality follows from the fact that Q * H (s, a, w) ≥ 0. Now, suppose the statement holds at time-step h + 1 and consider time-step h. Conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have 0 ≤ r h (s, a, w) + ξk h , ψ(s, a, w) -Q * h (s, a, w) -P h V k h+1 (., w) -V * h+1 (., w) (s, a) + 2Lβ φ(s, a) (Λ k h ) -1 ≤ r h (s, a, w) + ξk h , ψ(s, a, w) -Q * h (s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 . (Induction assumption) Therefore, conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q * h (s, a, w) ≤ r h (s, a, w) + ξk h , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 = r h (s, a, w) + ξk h , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 + = Q k h (s, a, w) , where the first equality follows from the fact that Q * h (s, a, w) ≥ 0. This completes the proof.

B.4 PROOF OF THEOREM 2

First, we bound the number of times Algorithm 2 updates ξk h , i.e., number of planning calls. Let P be the total number of updates and k p be the episode at which, the agent did replanning for the p-th time. Note that det Λ 1 h = λ d and det Λ K h ≤ trace(Λ K h /d) d ≤ λ + K d d , and consequently: det Λ K h det Λ 1 h = P p=1 det Λ kp h det Λ kp-1 h ≤ 1 + K dλ d , and therefore  H h=1 det Λ K h det Λ 1 h = H h=1 P p=1 det Λ kp h det Λ kp-1 h ≤ 1 + K dλ dH . ( ) Since 1 ≤ det Λ kp h det Λ k p- = V k h (s k h , w k ) -V π k h (s k h , w k ) and ξ k h+1 = E δ k h+1 |s k h , a k h -δ k h+1 . Conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q k h (s, a, w) -Q π k h (s, a, w) = r h (s, a, w) + ξk h , ψ(s, a, w) -Q π k h (s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 ≤ P h V k h+1 (., w) -V π k h+1 (., w) (s, a) + 4Lβ φ(s, a) (Λ k h ) -1 . ( ) Note that δ k h ≤ Q k h (s k h , a k h , w k ) -Q π k h (s k h , a k h , w k ). Thus, combining (25), Lemma 5, and a union bound over W, we conclude that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ, it holds that gives δ k h ≤ ξ k h+1 + δ k h+1 + 4Lβ φ(s k h , a k h ) (Λ k h ) -1 . Note that for any positive semi-definite matrices A, B, and C such that A = B + C, we have: det(A) ≥ det(B), det(A) ≥ det(C), and for any x = 0 ((Abbasi-Yadkori et al., 2011, Lemm. 12)): x 2 A x 2 B ≤ det(A) det(B) and x 2 B -1 x 2 A -1 ≤ det(A) det(B) . Now, we complete the regret analysis following similar steps as those of Theorem 1's proof: R K = K k=1 V * 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) ≤ K k=1 V k 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) (Lemma 7) = K k=1 δ k 1 ≤ K k=1 H h=1 ξ k h + 4Lβ K k=1 H h=1 φ(s k h , a k h ) Λ k h -1 ≤ K k=1 H h=1 ξ k h + 4Lβ K k=1 H h=1 φ(s k h , a k h ) (Λ k h ) -1 det Λ k h det Λ k h (Eqn. ( )) ≤ 2H T log(dT /δ) + 8HLβ 2dK log(1 + K/λ) ≤ Õ L λ(d 3 + md 2 )H 3 T .

B.5 DISCUSSION ON THE TIME COMPLEXITY OF UCBLVD AND LIFELONG-LSVI

In what follows, we clarify on how the time complexity of UCBlvd compares to that of Lifelong

LSVI. When we compute Λ

k h -1 by the Sherman-Morrison formula, the computational complexity of Lifelong-LSVI is dominated by Line 5 in computing max a∈A Q k h+1 (s τ h+1 , a) for all τ ∈ [k]. This takes O(d 2 |A|K) per step, which gives a total runtime O(d 2 |A|HK 2 ). In UCBlvd, every planning call takes Õ(md 2 |A|K + m 3 d 3 ), where the second term is the time-complexity of thE convex QCQP with m + 1 constraints and 2md variables. This gives a total runtime of Õ(H 2 (md 3 |A|K + m 3 d 4 )). Therefore, UCBlvd enjoys a smaller time complexity by a factor of K compared to that of Lifelong-LSVI, which is a significant reduction in practical scenarios where K >> d = md. 

C DETAILS OF REMARK 1: UCBLVD WITH UNKNOWN REWARDS

In order for our analysis to go through, we need a slightly different completeness assumption as below: Assumption 4. Given feature maps φ : S × A → R d and ψ : S × A × W → R d , consider function class F = f : f (s, w) = min max a∈A ν, ψ(s, a, w) + β φ(s, a) Λ -1 + β ψ(s, a, w) Λ-1 + , H , ν ∈ R d , Λ ∈ S d ++ , Λ ∈ S d ++ , β ≥ 0, β ≥ 0 . Then for any f ∈ F, and h ∈ [H], there exists a vector ξ f h ∈ R d with ξ f h ≤ H √ d such that P h f (., w) (s, a) = ξ f h , ψ(s, a, w) . C.1 OVERVIEW Let ψ τ h = ψ(s τ h , a τ h , w τ ). UCBlvd with unknown rewards works with the following action-value functions: Q k h (s, a, w) = ηk h + ξk h , ψ(s, a, w) + β φ(s, a) (Λ k h ) -1 + β ψ(s, a, w) ( Λk h ) -1 + , where ηk h = Λk h -1 k-1 τ =1 ψ τ h .r τ h and Λk h = λI md + k-1 τ =1 ψ τ h ψ τ h , and We note that compared to (7), action-value function defined in (28) involves an extra term ηk h , ψ(s, a, w) + β ψ(s, a, w) ( Λk h ) -1 . This term is in fact an upper bound on r h (s, a, w). Specifically, from Theorem 2 in Abbasi-Yadkori et al. (2011) , we know that for β = √ λmd, it holds that ξk h , θk(j) h j∈[n] = arg min ξ,{θ (j) } j∈[n] j∈[n] (s,a)∈D θ (j) , φ(s, a) -ξ, ψ s, a, w (j) 2 (30) s.t. θ (j) - θk h w (j) Λ k h ≤ β, ∀j ∈ [n] and ξ 2 ≤ H √ md, D = (s, η h -ηk h Λk h ≤ β, ∀(h, k) ∈ [H] × [K]. Theorem 3. Let T = KH. Under Assumptions 1, 3, and 4, the number of planning calls in Algorithm 3 is at most dH log 1 + K dλ + mdH log 1 + K mdλ , and there exists an absolute constant c > 0 such that for any fixed δ ∈ (0, 0.5), if we set λ = 1, β = cH (md) log(mdT /δ) and β = √ md in Algorithm 3, then with probability at least 1 -2δ, it holds that R K ≤ 2H T log(dT /δ) + 4H √ K Lβ 2d log(1 + K/λ) + β 2md log(1 + K/λ) ≤ Õ L √ m 2 d 3 H 3 T .

C.2 NECESSARY ANALYSIS FOR THE PROOF OF THEOREM 3

Lemma 8. Let c β be a constant in the definition of β. Then, under Assumptions 1, 3, and 4, for a fixed w, there is an absolute constant c 0 independent of c β , such that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 φ τ h . V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) (Λ k h ) -1 ≤ c 0 mdH log((c β + 1)mdT /δ), where c 0 and c β are two independent absolute constants. Proof. We note that ηk h + ξk h 2 ≤ H √ md + K/λ and Λ k h -1 ≤ 1 λ and Λk h -1 ≤ 1 λ . Thus, Lemmas 19 and 23 together imply that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) 2 (Λ k h ) -1 ≤ 4H 2   d 2 log k + λ λ + md log(1 + 8H √ md/ ) + d 2 log 1 + 32L 2 β 2 √ d λ 2 +m 2 d 2 log 1 + 8 β2 √ md λ 2 + log 1 δ   + 8k 2 2 λ . If we let = dH k and β = c β (md)H log(mdT /δ), then, there exists an absolute constant C > 0 that is independent of c β such that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w) -P h [V k h+1 (., w)](s τ h , a τ h ) 2 (Λ k h ) -1 ≤ C(m 2 d 2 )H 2 log (c β + 1)mdT /δ . Lemma 9. Under Assumptions 1, 3, and 4, if we let β = cmdH λ log(mdT /δ) with an absolute constant c > 0, then the event E 3 (w) := θ k h (w) - θk h (w) Λ k h ≤ β, ∀(h, k) ∈ [H] × [K] . ( ) holds with probability at least 1 -δ for a fixed w. Proof. The proof follows the same steps as those of Lemma 5, except that it uses Lemma 8 instead of Lemma 6 due to different structure of action-value functions Q k h in this section. Lemma 10. Let W = {w τ : τ ∈ [K]} ∪ {w (j) : j ∈ [n]}. Under the setting of Theorem 3 and conditioned on events {E 3 (w)} w∈ W defined in (32), for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) ≤ 2Lβ φ(s, a) (Λ k h ) -1 . Proof. The proof follows the exact same steps as those of Lemma 1's proof. Lemma 11. Let W = {w τ : τ ∈ [K]} ∪ {w (j) : j ∈ [n]}. Under the setting of Theorem 3 and conditioned on events {E 3 (w)} w∈ W defined in (32), and with Q k h computed as in (28), it holds that Q k h (s, a, w) ≥ Q * h (s, a, w) for all (s, a, w, h, k) ∈ S × A × W × [H] × [K]. Proof. We first note that conditioned on events {E 3 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that ηk h + ξk h , ψ(s, a, w) -Q π h (s, a, w) -P h V k h+1 (., w) -V π h+1 (., w) (s, a) = ηk h + ξk h , ψ(s, a, w) -r h (s, a, w) -P h V k h+1 (., w) (s, a) ≤ ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) + β ψ(s, a, w) Λk h -1 (Eqn. ( )) ≤ 2Lβ φ(s, a) (Λ k h ) -1 + β ψ(s, a, w) Λk h -1 , ( for any policy π. Now, we prove the lemma by induction. The statement holds for H because Q k H+1 (., ., .) = Q * H+1 (., ., .) = 0 and thus conditioned events {E 3 (w)} w∈ W , defined in ( 32), for all (s, a, w, k) ∈ S × A × W × [K], we have ηk H + ξk H , ψ(s, a, w) -Q * H (s, a, w) ≤ 2Lβ φ(s, a) (Λ k H ) -1 + β ψ(s, a, w) Λk H -1 . ( ) Therefore, conditioned on events {E 3 (w)} w∈ W , for all (s, a, w, k) ∈ S × A × W × [K], we have Q * H (s, a, w) ≤ ηk H + ξk H , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k H ) -1 + β ψ(s, a, w) Λk H -1 = ηk H + ξk H , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k H ) -1 + β ψ(s, a, w) Λk H -1 + = Q k H (s, a, w), where the first equality follows from the fact that Q * H (s, a, w) ≥ 0. Now, suppose the statement holds at time-step h + 1 and consider time-step h. Conditioned on events {E 3 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have 0 ≤ ηk h + ξk h , ψ(s, a, w) -Q * h (s, a, w) -P h V k h+1 (., w) -V * h+1 (., w) (s, a) + 2Lβ φ(s, a) (Λ k h ) -1 + β ψ(s, a, w) Λk h -1 ≤ ηk h + ξk h , ψ(s, a, w) -Q * h (s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 + β ψ(s, a, w) Λk h -1 . (Induction assumption) Therefore, conditioned on events {E 3 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have a, w) , where the first equality follows from the fact that Q * h (s, a, w) ≥ 0. This completes the proof. Q * h (s, a, w) ≤ ηk h + ξk h , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 + β ψ(s, a, w) Λk h -1 = ηk h + ξk h , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 + β ψ(s, a, w) Λk h -1 + = Q k h (s,

C.3 PROOF OF THEOREM 3

First, we bound the number of times Algorithm 3 updates ξk h , i.e., number of planning calls. Let P be the total number of policy updates and k p be the episode at, the agent did replanning for the p-th time. Note that det Λ 1 h = λ d and det Λ K h ≤ trace(Λ K h /d) d ≤ λ + K d d , and consequently: det Λ K h det Λ 1 h = P p=1 det Λ kp h det Λ kp-1 h ≤ 1 + K dλ d , and therefore H h=1 det Λ K h det Λ 1 h = H h=1 P p=1 det Λ kp h det Λ kp-1 h ≤ 1 + K dλ dH . ( ) We similarly have  H h=1 det ΛK h det Λ1 h = H h=1 P p=1 det Λkp h det Λkp-1 h ≤ 1 + K mdλ mdH . ( ) Since 1 ≤ det Λ kp h det Λ k p- + mdH log 1 + K mdλ number of episodes k ∈ [K]. This concludes that number of planning calls in Algorithm 3 is at most dH log 1 + K dλ + mdH log 1 + K mdλ . Now, we prove the regret bound. Let δ k h = V k h (s k h , w k ) -V π k h (s k h , w k ) and ξ k h+1 = E δ k h+1 |s k h , a k h -δ k h+1 . Conditioned on events {E 3 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q k h (s, a, w) -Q π k h (s, a, w) = ηk h + ξk h , ψ(s, a, w) -Q π k h (s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 + β ψ(s, a, w) ( Λk h ) -1 ≤ P h V k h+1 (., w) -V π k h+1 (., w) (s, a) + 4Lβ φ(s, a) (Λ k h ) -1 + 2 β ψ(s, a, w) ( Λk h ) -1 . (37) Note that δ k h ≤ Q k h (s k h , a k h , w k ) -Q π k h (s k h , a k h , w k ). Thus, combining (37), Lemma 9, and a union bound over W, we conclude that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ, it holds that gives δ k h ≤ ξ k h+1 + δ k h+1 + 4Lβ φ(s k h , a k h ) (Λ k h ) -1 + 2 β ψ(s k h , a k h , w k ) ( Λk h ) -1 . Now, we complete the regret analysis following similar steps as those of Theorem 1's proof: R K = K k=1 V * 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) ≤ K k=1 V k 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) (Lemma 11) = K k=1 δ k 1 ≤ K k=1 H h=1 ξ k h + 4Lβ K k=1 H h=1 φ(s k h , a k h ) Λ k h -1 + 2 β K k=1 H h=1 ψ(s k h , a k h , w k ) Λk h -1 ≤ K k=1 H h=1 ξ k h + 4Lβ K k=1 H h=1 φ(s k h , a k h ) (Λ k h ) -1 det Λ k h det Λ k h + 2 β K k=1 H h=1 ψ(s k h , a k h , w k ) Λk h -1 det Λk h det Λk h (Eqn. ( )) ≤ 2H T log(dT /δ) + 4H √ K Lβ 2d log(1 + K/λ) + β 2md log(1 + K/λ) ≤ Õ L √ λm 2 d 3 H 3 T .

D DETAILS OF REMARK 2: RELAXATION OF ASSUMPTION 3

In this section, we replace Assumption 3 with the following assumption: Assumption 5. There is a known set {w (1) , w (2) , . . . , w (n) } of n ≤ d tasks such that ψ(s, a, w) ∈ Span ψ(s, a, w (j) ) j∈[n] for all (s, a, w) ∈ S × A × W. This implies that for any (s, a, w) ∈ S × A × W, there exist coefficients {c j (s, a, w)} j∈ [n] such that ψ(s, a, w) = j∈[n] c j (s, a, w)ψ s, a, w (j) . Moreover, j∈[n] c j (s, a, w) ≤ L for all (s, a, w) ∈ S × A × W. Define the concatenated mapping ψ : for time-steps h = 1, . . . , H do S × A × W → R d+d such that ψ(s, a, w) = φ(s, Compute Q k h (s k h , a, w k ) for all a ∈ A as in (7). Play a k h = arg max a∈A Q k h (s k h , a, w k ) and observe s k h+1 and r k h . ify the planning step of UCBlvd to the following: ξk h , θk(j) h j∈[n] = arg min ξ,{θ (j) } j∈[n] j∈[n] (s,a)∈D(w (j) ) θ (j) , φ(s, a) -ξ, ψ s, a, w (j) 2 (39) s.t. θ (j) - θk h w (j) Λ k h ≤ β, ∀j ∈ [n] and ξ 2 ≤ H √ d . The only change we make in Algorithm 2 is in Line 9, in which ξk h is now computed as defined in (39). We present this modification in Algorithm 4 for completeness. Theorem 4. Let T = KH. Under Assumptions 1, 2, and 5, the number or planning calls in Algorithm 4 is at most dH log 1 + K dλ and there exists an absolute constant c > 0 such that for any fixed δ ∈ (0, 0.5), if we set λ = 1 and β = cH d + √ d λ log(dd T /δ) in Algorithm 4, then with probability at least 1 -2δ, it holds that R K ≤ 2H T log(dT /δ) + 8HLβ 2dK log(K) ≤ Õ L (d 3 + dd )H 3 T . Proof of Theorem 4 follows exactly the same steps as those of Theorem 2. The only difference is the proof of Lemma 1, which we clarify in the proof of following lemma. Lemma 12. Let W = {w τ : τ ∈ [K]} ∪ {w (j) : j ∈ [n]}. Under Assumptions 1, 2, and 5, if we let β = cH d + √ d λ log(dd T /δ) with an absolute constant c > 0, then for all (s, a, w, h, k) ∈ S × A × W × [H] × [K] with probability at least 1 -δ, it holds that ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) ≤ 2Lβ φ(s, a) (Λ k h ) -1 . Proof. We let ψi (w) = φ i , ψ i (w) be the i-th element of D(w) = ψ(s, a, w) : (s, a) ∈ D(w) and for any triple (s, a, w) ∈ S × A × W, we let {c i (s, a, w)} i∈ [d+d ] be the coefficients such that ψ(s, a, w) = i∈[d+d ] c i (s, a, w) ψi (w), which implies that φ(s, a) = i∈[d+d ] c i (s, a, w)φ i and ψ(s, a, w) = i∈[d+d ] c i (s, a, w)ψ i (w). Thanks to Assumption 2 and conditioned on events {E 1 (w)} w∈ W , one set of solution for (39) is θ k h w (j) j∈[n] and ξ V k h+1 h with corresponding zero optimal objective value. Therefore, it holds that θk(j) h , φ i = ξk h , ψ i w (j) , ∀(i, j) ∈ [d + d ] × [n]. Moreover, for any triple (s, a, j) ∈ S × A × [n], we have ξk h , ψ s, a, w (j) = i∈[d+d ] c i s, a, w (j) ξk h , ψ i w (j) (Eqn. (41)) = i∈[d+d ] c i s, a, w (j) θk(j) h , φ i (Eqn. (42)) = θk(j) h , φ(s, a) . For any (s, a, w) ∈ S × A × W, it holds that P h V k h+1 (., w) (s, a) = θ k h (w), φ(s, a) (Eqn. (4)) = ξ V k h+1 h , ψ(s, a, w) (Assumption 2) = j∈[n] c j (s, a, w) ξ V k h+1 h , ψ s, a, w (j) (Eqn. (38)) = j∈[n] c j (s, a, w)P h V k h+1 ., w (j) (s, a) (Assumption 2) = j∈[n] c j (s, a, w) θ k h w (j) , φ(s, a) . Finally, conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) 38) and ( 23)) for time-steps h = 1, . . . , H do ∈ S × A × W × [H] × [K], it holds that ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) = ξk h , ψ(s, a, w) -θ k h (w), φ(s, a) = j∈[n] c j (s, a, w) ξk h , ψ s, a, w (j) -θ k h w (j) , φ(s, a) (Eqns. ( ≤ j∈[n] c j (s, a, w) ξk h , ψ s, a, w (j) - θk(j) h , φ(s, a) + j∈[n] c j (s, a, w) θk(j) h - θk h w (j) , φ(s, a) + j∈[n] c j (s, a, w) θk h w (j) -θ k h w (j) , φ(s, a) = j∈[n] c j (s, a, w) θk(j) h - θk h w (j) , φ(s, a) + j∈[n] c j (s, a, w) θk h w (j) -θ k h w (j) , φ(s, a) (Eqn. ( )) ≤ 2Lβ φ(s, a) (Λ k h ) -1 . ( Compute Q k h (s k h , a, w k ) for all a ∈ A as in (48). Play a k h = arg max a∈A Q k h (s k h , a, w k ) and observe s k h+1 and r k h .

E DETAILS OF REMARK 3

In this section, we only rely on the following two assumptions: Assumption 6. Given a feature map ψ : S × A × W → R d , consider function class F = f : f (s, w) = min max a∈A ν, ψ(s, a, w) + β ψ(s, a, w) Λ -1 + , H ν ∈ R d , β ≥ 0, Λ ∈ S d ++ . Then for any f ∈ F and h ∈ [H], there exists a vector ν f h ∈ R d with ν f h 2 ≤ H √ d such that P h f (., w) (s, a) = ψ(s, a, w), ν f h . Moreover, for every h ∈ [H], there exists a vector η h such that r h (s, a, w) = η h , ψ(s, a, w) . Assumption 7. Without loss of generality, ψ(s, a, w) 2 ≤ 1 for all (s, a, w) ∈ S × A × W, and η h 2 ≤ √ d for all h ∈ [H]. E.1 OVERVIEW Let ψ τ h = ψ(s τ h , a τ h , w τ ). Standard Lifelong-LSVI with computation sharing works with the following action-value functions: Q k h (s, a, w) = r h (s, a, w) + νk h , ψ(s, a, w) + β ψ(s, a, w) ( Λk h ) -1 + , where νk h = Λk h -1 k-1 τ =1 ψ τ h . min max a∈A Q k h+1 (s τ h+1 , a, w τ ), H and Λk h = λI d + k-1 τ =1 ψ τ h ψ τ h . Theorem 5. Let T = KH. Under Assumptions 6 and 7, the number of planning calls in 5 is at most d H log 1 + K d λ and there exists an absolute constant c > 0 such that for any fixed δ ∈ (0, 0.5), if we set λ = 1 and β = cd H log(d T /δ) in Algorithm 5, then with probability at least 1 -2δ, it holds that R K ≤ 2H T log(d T /δ) + 4Hβ 2d K log(K) ≤ Õ d 3 H 3 T .

E.2 NECESSARY ANALYSIS FOR THE PROOF OF THEOREM 5

Thanks to Assumption 6, we have P h V k h+1 (., w) (s, a) = ν k h , ψ(s, a, w) , where ν k h = ν V k h+1 h . Lemma 13. Let c β be a constant in the definition of β. Then, under Assumption 7, there is an absolute constant c 0 independent of c β , such that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 ψ τ h . V k h+1 (s τ h+1 , w τ ) -P h [V k h+1 (., w τ )](s τ h , a τ h ) Λk h -1 ≤ c 0 d H log((c β + 1)d T /δ), where c 0 and c β are two independent absolute constants. Proof. We note that η h + νk h 2 ≤ (1 + H) √ d and Λk h -1 ≤ 1 λ . Thus, Lemmas 19 and 24 together imply that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ it holds that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w τ ) -P h [V k h+1 (., w τ )](s τ h , a τ h ) 2 Λk h -1 ≤ 4H 2   d 2 log k + λ λ + d log(1 + 8H √ d / ) + d 2 log 1 + 32L 2 β 2 √ d λ 2 + log 1 δ   + 8k 2 2 λ . If we let = dH k and β = c β (d + √ d )H log(dT /δ), then, there exists an absolute constant C > 0 that is independent of c β such that k-1 τ =1 φ τ h V k h+1 (s τ h+1 , w τ ) -P h [V k h+1 (., w τ )](s τ h , a τ h ) 2 Λk h -1 ≤ C(d + d 2 )H 2 log (c β + 1)d T /δ . Lemma 14. Under Assumptions 6 and 7, if we let β = cd H λ log(d T /δ) with an absolute constant c > 0, then the event E 4 := ν k h -νk h Λk h ≤ β, ∀(h, k) ∈ [H] × [K] . holds with probability at least 1 -δ. Proof. ν k h -νk h = ν k h - Λk h -1 k-1 τ =1 ψ τ h V k h+1 (s τ h+1 , w τ ) = Λk h -1   Λk h ν k h - k-1 τ =1 ψ τ h V k h+1 (s τ h+1 , w τ )   = λ Λk h -1 ν k h q1 - Λk h -1   k-1 τ =1 ψ τ h V k h+1 (s τ h+1 , w τ ) -P h [V k h+1 (., w τ )](s τ h , a τ h )   q2 . (Eqn. ( 50)) Thus, in order to upper bound ν k h -νk h (w) Λk h , we bound q 1 Λk h and q 2 Λk h separately. From Assumption 7, we have q 1 Λ k h = λ ν k h Λk h -1 ≤ √ λ ν k h 2 ≤ H √ λd . Thanks to Lemma 13, for all (h, k) ∈ [H] × [K], with probability at least 1 -δ, it holds that q 2 Λk h ≤ k-1 τ =1 ψ τ h V k h+1 (s τ h+1 , w τ ) -P h [V k h+1 (., w τ )](s τ h , a τ h ) (Λ k h ) -1 ≤ c 0 d H log((c β + 1)d T /δ), where c 0 and c β are two independent absolute constants. Combining ( 52) and ( 53), for all (h, k) ∈ [H] × [K] , with probability at least 1 -δ, it holds that ν k h -νk h Λk h ≤ cd H λ log(d T /δ) for some absolute constant c > 0. Lemma 15. Let the setting of Lemma 14 holds. Conditioned on events E 4 defined in (51), and with Q k h computed as in (48), it holds that Q k h (s, a, w) ≥ Q * h (s, a, w) for all (s, a, w, h, k) ∈ S × A × W × [H] × [K]. Proof. We first note that conditioned on the event E 4 , for all (s, a, w, h, k) ∈ S ×A×W ×[H]×[K], it holds that r h (s, a, w) + νk h , ψ(s, a, w) -Q π h (s, a, w) -P h V k h+1 (., w) -V π h+1 (., w) (s, a) = r h (s, a, w) + νk h , ψ(s, a, w) -r h (s, a, w) -P h V k h+1 (., w) (s, a) = νk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) = νk h -ν k h , ψ(s, a, w) ≤ νk h -ν k h Λk h ψ(s, a, w) Λk h -1 ≤ β ψ(s, a, w) Λk h -1 , ( for any policy π. Now, we prove the lemma by induction. The statement holds for H because Q k H+1 (., ., .) = Q * H+1 (., ., .) = 0 and thus conditioned on the event E 4 , defined in ( 51), for all (s, a, w, k) ∈ S × A × W × [K], we have r h (s, a, w) + ν k H , ψ(s, a, w) -Q * H (s, a, w) ≤ β ψ(s, a, w) Λk H -1 . Therefore, conditioned on the event E 4 , for all (s, a, w, k) ∈ S × A × W × [K], we have Q * H (s, a, w) ≤ r H (s, a, w) + ν k H , ψ(s, a, w) + β ψ(s, a, w) ( Λk H ) -1 = r H (s, a, w) + ν k H , ψ(s, a, w) + β ψ(s, a, w) ( Λk H ) -1 + = Q k H (s, a, w), where the first equality follows from the fact that Q * H (s, a, w) ≥ 0. Now, suppose the statement holds at time-step h + 1 and consider time-step h. Conditioned on events E 4 , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have 0 ≤ r h (s, a, w) + ν k h , ψ(s, a, w) -Q * h (s, a, w) -P h V k h+1 (., w) -V * h+1 (., w) (s, a) + β ψ(s, a, w) Λk h -1 ≤ r h (s, a, w) + ν k h , ψ(s, a, w) -Q * h (s, a, w) + β ψ(s, a, w) Λk h -1 . (Induction assumption) Therefore, conditioned on events E 4 , for all (s, a, w, h, k) a, w) , where the first equality follows from the fact that Q * H (s, a, w) ≥ 0. This completes the proof. ∈ S × A × W × [H] × [K], we have Q * h (s, a, w) ≤ r h (s, a, w) + ν k h , ψ(s, a, w) + β ψ(s, a, w) Λk h -1 = r h (s, a, w) + ν k h , ψ(s, a, w) + β ψ(s, a, w) Λk h -1 + = Q k h (s,

E.3 PROOF OF THEOREM 5

First, we bound the number of times Algorithm 5 updates νk h . Let P be the total number of updates and k p be the episode at which, the agent did replanning for the p-th time.  = V k h (s k h , w k ) -V π k h (s k h , w k ) and ξ k h+1 = E δ k h+1 |s k h , a k h -δ k h+1 . Conditioned on E 4 , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q k h (s, a, w) -Q π k h (s, a, w) = r h (s, a, w) + θ k h , ψ(s, a, w) -Q π k h (s, a, w) + β ψ(s, a, w) ( Λk h ) -1 ≤ P h V k h+1 (., w) -V π k h+1 (., w) (s, a) + 2β ψ(s, a, w) ( Λv h ) -1 . ( ) Note that δ k h ≤ Q k h (s k h , a k h , w k ) -Q π k h (s k h , a k h , w k ). Thus, (55) and Lemma 14 imply that for all (h, k) ∈ [H] × [K], it holds that δ k h ≤ ξ k h+1 + δ k h+1 + 2β ψ(s k h , a k h , w k ) ( Λk h ) -1 . Now, we complete the regret analysis following similar steps as those of Theorem 1's proof: R K = K k=1 V * 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) ≤ K k=1 V k 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) (Lemma 15) = K k=1 δ k 1 ≤ K k=1 H h=1 ξ k h + 2β K k=1 H h=1 ψ(s k h , a k h , w k ) Λk h -1 ≤ K k=1 H h=1 ξ k h + 2β K k=1 H h=1 ψ(s k h , a k h , w k ) Λk h -1 det Λk h det Λk h (Eqn. ( )) ≤ 2H T log(d T /δ) + 4Hβ 2λd K log(1 + K/λ) ≤ Õ λd 3 H 3 T .

F DETAILS OF REMARK 4: A MISSPECIFIED SETTING

We first present a definition for an approximate completeness model. Assumption 8 (ζ-Approximate Completeness). Given feature maps φ : S × A → R d and ψ : S × A × W → R d in Assumption 1, consider the function class F = f : f (s, w) = min max a∈A ν, ψ(s, a, w) + β φ(s, a) Λ -1 + , H , ν ∈ R d , Λ ∈ S d ++ , β ≥ 0 . For any f ∈ F and h ∈ [H], there exists a vector ξ f h ∈ R d with ξ f h ≤ H √ d such that for all (s, a, w) ∈ S × A × W P h f (., w) (s, a) -ξ f h , ψ(s, a, w) ≤ ζ. Theorem 6. Let T = KH. Under Assumptions 1, 8, and 3, the number of planning calls in Algorithm 2 is at most dH log(1 + K dλ ), and there exists an absolute constant c > 0 such that for any fixed δ ∈ (0, 0.5), if we set λ = 1 and β = cH(d + √ md) log(mdT /δ) in Algorithm 2, then with probability at least 1 -2δ, it holds that R K ≤ Õ √ mdT ζ + (d 3 + md 2 )H 3 T . F.1 NECESSARY ANALYSIS FOR THE PROOF OF THEOREM 6 Let s (i) , a (i) be the i-th element of D and {c i (s, a)} i∈ [d] be the coefficients such that φ(s, a) = i∈[d] c i (s, a)φ s (i) , a (i) . Then, L φ is a positive constant such that i∈[d] c i (s, a) ≤ L φ for all (s, a) ∈ S × A. Lemma 16. Let W = {w τ : τ ∈ [K]} ∪ {w (j) : j ∈ [n]}. Under the setting of Theorem 6 and conditioned on events {E 1 (w)} w∈ W defined in (9), for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that ξk h , ψ(s, a, w) -P h [V k h+1 (., w)](s, a) ≤ (2L + L φ √ md)ζ + 2Lβ φ(s, a) (Λ k h ) -1 . Proof. Thanks to Assumption 8 and conditioned on events {E 1 (w)} w∈ W , one set of feasible parameters for ( 8) is θ k h w (j) j∈ [n] and ξ V k h+1 h such that θk(j) h , φ(s, a) - ξk h , ψ s, a, w (j) ≤ ζ √ md, ∀(j, (s, a)) ∈ [n] × D. For any triple (s, a, j) ∈ S × A × [n], we have ξk h , ψ s, a, w (j) = ξk h , φ(s, a) ⊗ ρ w (j) = ξk h , i∈[d] c i (s, a)φ s (i) , a (i) ⊗ ρ w (j) = i∈[d] c i (s, a) ξk h , ψ s (i) , a (i) , w (j) (Assumption 3) ≤ √ mdζ i∈[d] c i (s, a) + i∈[d] c i (s, a) θk(j) h , φ s (i) , a (i) (Eqn. (56)) ≤ L φ √ mdζ + θk(j) h , φ(s, a) . Similarly, it holds that ξk h , ψ s, a, w (j) ≥ -L φ √ mdζ + θk(j) h , φ(s, a) . Therefore, for any (s, a, j) ∈ S × A × [n], it holds that ξk h , ψ s, a, w (j) - θk(j) h , φ(s, a) ≤ L φ √ mdζ. For any (s, a, w) ∈ S × A × W, it holds that P h V k h+1 (., w) (s, a) = θ k h (w), φ(s, a) (Eqn. (4)) ≤ ζ + ξ V k h+1 h , ψ(s, a, w) (Assumption 8) = ζ + j∈[n] c j (w) ξ V k h+1 h , ψ s, a, w (j) (Assumption 3) ≤ ζ   1 + j∈[n] c j (w)   + j∈[n] c j (w)P h V k h+1 ., w (j) (s, a) (Assumption 8) ≤ 2Lζ + j∈[n] c j (w) θ k h w (j) , φ(s, a) . ( Similarly, it holds that P h V k h+1 (., w) (s, a) ≥ -2Lζ + j∈[n] c j (w) θ k h w (j) , φ(s, a) . Therefore, for any (s, a, w) ∈ S × A × W, it holds that P h V k h+1 (., w) (s, a) - j∈[n] c j (w) θ k h w (j) , φ(s, a) ≤ 2Lζ. Finally, conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], it holds that ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) ≤ 2Lζ + j∈[n] c j (w) ξk h , ψ s, a, w (j) -θ k h w (j) , φ(s, a) (Assumption 3 and Eqn. ( 58)) ≤ 2Lζ + j∈[n] c j (w) ξk h , ψ s, a, w (j) - θk(j) h , φ(s, a) + j∈[n] c j (w) θk(j) h - θk h w (j) , φ(s, a) + j∈[n] c j (w) θk h w (j) -θ k h w (j) , φ(s, a) ≤ (2L + L φ √ md)ζ + j∈[n] c j (w) θk(j) h - θk h w (j) , φ(s, a) + j∈[n] c j (w) θk h w (j) -θ k h w (j) , φ(s, a) (Eqn. ( )) ≤ (2L + L φ √ md)ζ + 2Lβ φ(s, a) (Λ k h ) -1 . ( As the final step in the regret analysis, we state the following lemma which uses Lemma 16 to prove the optimistic nature of UCBlvd. Then following the standard analysis of single-task LSVI-UCB we derive the regret bound for misspecified settings. Lemma 17. Let W = {w τ : τ ∈ [K]} ∪ {w (j) : j ∈ [n]}. Under the setting of Theorem 6 and conditioned on events {E 1 (w)} w∈ W defined in (9), and with Q k h computed as in (7), it holds that (2L + L φ √ md)(H -h + 1)ζ + Q k h (s, a, w) ≥ Q * h (s, a, w) for all (s, a, w, h, k) ∈ S × A × W × [H] × [K]. Proof. We first note that conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h , k) ∈ S × A × W × [H] × [K], it holds that r h (s, a, w) + ξk h , ψ(s, a, w) -Q π h (s, a, w) -P h V k h+1 (., w) -V π h+1 (., w) (s, a) = r h (s, a, w) + ξk h , ψ(s, a, w) -r h (s, a, w) -P h V k h+1 (., w) (s, a) = ξk h , ψ(s, a, w) -P h V k h+1 (., w) (s, a) ≤ (2L + L φ √ md)ζ + 2Lβ φ(s, a) (Λ k h ) -1 , ( for any policy π. Now, we prove the lemma by induction. The statement holds for H because Q k H+1 (., ., .) = Q * H+1 (., ., .) = 0 and thus conditioned events {E 1 (w)} w∈ W , defined in ( 9), for all (s, a, w, k) ∈ S × A × W × [K], we have r H (s, a, w) + ξk H , ψ(s, a, w) -Q * H (s, a, w) ≤ (2L + L φ √ md)ζ + 2Lβ φ(s, a) (Λ k H ) -1 . Therefore, conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, k) ∈ S × A × W × [K], we have Q * H (s, a, w) ≤ r H (s, a, w) + ξk H , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k H ) -1 + (2L + L φ √ md)ζ = r H (s, a, w) + ξk H , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k H ) -1 + + (2L + L φ √ md)ζ = Q k H (s, a, w) + (2L + L φ √ md)ζ, where the first equality follows from the fact that Q * H (s, a, w) ≥ 0. Now, suppose the statement holds at time-step h + 1 and consider time-step h. Conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have 0 ≤ r h (s, a, w) + ξk h , ψ(s, a, w) -Q * h (s, a, w) -P h V k h+1 (., w) -V * h+1 (., w) (s, a) + (2L + L φ √ md)ζ + 2Lβ φ(s, a) (Λ k h ) -1 ≤ r h (s, a, w) + ξk h , ψ(s, a, w) -Q * h (s, a, w) + (2L + L φ √ md)(H -h + 1)ζ + 2Lβ φ(s, a) (Λ k h ) -1 . (Induction assumption) Therefore, conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q * h (s, a, w) ≤ r h (s, a, w) + ξk h , ψ(s, a, w) + (2L + L φ √ md)(H -h + 1)ζ + 2Lβ φ(s, a) (Λ k h ) -1 = r h (s, a, w) + ξk h , ψ(s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 + + (2L + L φ √ md)(H -h + 1)ζ = Q k h (s, a, w) + (2L + L φ √ md)(H -h + 1)ζ, where the first equality follows from the fact that Q * h (s, a, w) ≥ 0. This completes the proof.

F.2 PROOF OF THEOREM 6

The proof for establishing the upper bound on the number of planning calls for misspecified settings follows exactly the steps as those in the proof of Theorem 2. Now, we prove the regret bound. Let δ k h = V k h (s k h , w k ) -V π k h (s k h , w k ) and ξ k h+1 = E δ k h+1 |s k h , a k h -δ k h+1 . Conditioned on events {E 1 (w)} w∈ W , for all (s, a, w, h, k) ∈ S × A × W × [H] × [K], we have Q k h (s, a, w) -Q π k h (s, a, w) = r h (s, a, w) + ξk h , ψ(s, a, w) -Q π k h (s, a, w) + 2Lβ φ(s, a) (Λ k h ) -1 ≤ P h V k h+1 (., w) -V π k h+1 (., w) (s, a) + (2L + L φ √ md)ζ + 4Lβ φ(s, a) (Λ k h ) -1 . (59) Note that δ k h ≤ Q k h (s k h , a k h , w k ) -Q π k h (s k h , a k h , w k ). Thus, combining (59), Lemma 5, and a union bound over W, we conclude that for all (h, k) ∈ [H] × [K], with probability at least 1 -δ, it holds that gives δ k h ≤ ξ k h+1 + δ k h+1 + (2L + L φ √ md)ζ + 4Lβ φ(s k h , a k h ) (Λ k h ) -1 . Now, we complete the regret analysis following similar steps as those of Theorem 1's proof: R K = K k=1 V * 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) ≤ (2L + L φ √ md)HKζ + K k=1 V k 1 (s k 1 , w k ) -V π k 1 (s k 1 , w k ) (Lemma 17) = (2L + L φ √ md)HKζ + K k=1 δ k 1 ≤ (4L + 2L φ √ md)HKζ + K k=1 H h=1 ξ k h + 4Lβ K k=1 H h=1 φ(s k h , a k h ) Λ k h -1 ≤ (4L + 2L φ √ md)HKζ + K k=1 H h=1 ξ k h + 4Lβ K k=1 H h=1 φ(s k h , a k h ) (Λ k h ) -1 det Λ k h det Λ k h (Eqn. ( )) ≤ (4L + 2L φ √ md)HKζ + 2H T log(dT /δ) + 8HLβ 2dK log(1 + K/λ) ≤ Õ (L + L φ √ md)HKζ + L λ(d 3 + md 2 )H 3 T , where the last two inequalities follow from the similar steps in the proof of Theorem 1.

G AUXILIARY LEMMAS

Notations. N (V) denotes the -covering number of the class V of functions mapping S to R with respect to the distance dist(V, V ) = sup s V (s) -V (s) . Lemma 18 (Bound on Weights θ k h (w)). Under Assumption 1, for any set of action-value functions {Q k h } h∈[H] , and (w, h, k) ∈ W × [H] × [K], it holds that θ k h (w) 2 ≤ H √ d. Proof. Recall that V k h (s, w) = min max a∈A Q k h (s, a, w), H and θ k h (w) := S V k h+1 (s , w)dµ h (s ) . Thus, we have θ k h (w) 2 = S V k h+1 (s , w)dµ h (s ) ≤ H √ d. Lemma 19 (Lemma D.4 in Jin et al. (2020) ). Let {s τ } ∞ τ =1 be a stochastic process on state space S with corresponding filtration {F τ } ∞ τ =0 . Let {φ τ } ∞ τ =0 be an R d -valued stochastic process where φ τ ∈ F τ -1 , and φ τ ≤ 1. Let Λ k = λI d + k-1 τ =1 φ τ φ τ . Then with probability at least 1 -δ, for all k ≥ 0 and V ∈ V such that sup s∈S V (s) ≤ H, we have k τ =1 φ τ . V (s τ ) -E V (s τ )|F τ -1 2 Λ -1 k ≤ 4H 2 d 2 log k + λ λ + log N (V) δ + 8k 2 2 λ . Lemma 20. For any > 0, the -covering number of the Euclidean ball in R d with radius R > 0 is upper bounded by (1 + 2R/ ) d . Lemma 21. For a fixed w, let V denote a class of functions mapping from S to R with following parametric form V (.) = min max  N (V) ≤ d log(1 + 4z/ ) + d log(1 + 4y/ ) + d 2 log 1 + 8B 2 √ d λ 2 . Proof. First, we reparametrize V by letting Ỹ = β 2 Y. We have V (.) = min max a∈A z, ψ(., a, w) + y, φ(., a) + φ(., a) Ỹφ(., a), H , for z ≤ z, y ≤ y, and Ỹ ≤ B 2 λ . For any two functions V 1 , V 2 ∈ V with parameters z 1 , y 1 , Ỹ1 and z 2 , y 2 , Ỹ2 , respectively, we have . dist(V 1 , V 2 ) ≤ sup (s,a)∈S×A z 1 , ψ(s, a, w) + y 1 , φ(s, a) + φ(s, a) Ỹ1 φ(s, a) -z 2 , ψ(s, a, w) + y 2 , φ(s, a) + φ(s, a) Ỹ2 φ(s, a) ≤ sup ψ: ψ ≤1,φ: φ ≤1 z 1 , ψ + y 1 , φ + φ Ỹ1 φ -z 2 , ψ + y 2 , φ + φ Ỹ2 φ ≤ sup ψ: ψ ≤1 z 1 -z 2 , ψ + sup φ: φ ≤1 y 1 -y 2 , φ + sup φ: φ ≤1 φ Ỹ1 -Ỹ2 φ (because √ a - √ b ≤ |a -b| for a, b ≥ 0) = z 1 -z 2 + y 1 -y 2 + Ỹ1 -Ỹ2 ≤ z 1 -z 2 + y 1 -y 2 + Ỹ1 -Ỹ2 F . ( Proof. First, we reparametrize V by letting Ỹ = β 2 Y. We have V (.) = min max a∈A z, ψ(., a, w) + φ(., a) Ỹφ(., a), H , for z ≤ z, and Ỹ ≤ B 2 λ . For any two functions V 1 , V 2 ∈ V with parameters z 1 , Ỹ1 and z 2 , Ỹ2 , respectively, we have For any two functions V 1 , V 2 ∈ V with parameters z 1 , Z 1 , Z1 and z 2 , Z 2 , Z2 , respectively, we have dist(V 1 , V 2 ) ≤ sup dist(V 1 , V 2 ) ≤ sup (s,a)∈S×A z 1 , ψ(s, a, w) + φ(s, a) Z 1 φ(s, a) + ψ(s, a, w) Z1 ψ(s, a, w) z 2 , ψ(s, a, w) + φ(s, a) Z 2 φ(s, a) + ψ(s, a, w) Z2 ψ(s, a, w) Proof. First, we reparametrize V by letting Ỹ = β 2 Y. We have V (., .) = min max a∈A z, ψ(., a, .) + ψ(., a, .) Ỹψ(., a, .), H , for z ≤ z, and Ỹ ≤ B 2 λ . For any two functions V 1 , V 2 ∈ V with parameters z 1 , Ỹ1 and z 2 , Ỹ2 , respectively, we have ≤ sup ψ: ψ ≤1,φ: φ ≤1 z 1 , ψ + φ Z 1 φ + ψ Z1 ψ -z 2 , ψ + φ Z 2 φ + ψ Z2 ψ ≤ sup ψ: ψ ≤1 z 1 -z 2 , ψ + sup φ: φ ≤1 φ (Z 1 -Z 2 ) φ + sup ψ: φ ≤1 ψ Z1 -Z2 ψ (because √ a - √ b ≤ |a -b| for a, b ≥ 0) = z 1 -z 2 + Z 1 -Z 2 + Z1 -Z2 ≤ z 1 -z 2 + Z 1 -Z 2 F + Z1 -Z2 F . ( dist(V 1 , V 2 ) ≤ sup (s,a,w)∈S×A×W z 1 , ψ(s, a, w) + ψ(s, a) Ỹ1 ψ(s, a) z 2 , ψ(s, a, w) + ψ(s, a, w) Ỹ2 ψ(s, a, w) In Figure 3 , we plot UCBlvd's number of planning calls for different number of task episodes, K, while the setting is same as that in 2a. In this figure, we empirically verify the logarithmic dependence of number of planning calls on K as suggested by Theorem 2.  ≤ sup ψ: ψ ≤1 z 1 , ψ + ψ Ỹ1 ψ -z 2 , ψ + ψ Ỹ2 ψ ≤ sup ψ: ψ ≤1 z 1 -z 2 , ψ + sup ψ: ψ ≤1 ψ Ỹ1 -Ỹ2 ψ (because √ a - √ b ≤ |a -b| for a, b ≥ 0) = z 1 -z 2 + Ỹ1 -Ỹ2 ≤ z 1 -z 2 + Ỹ1 -Ỹ2 F .



We adopt a stricter definition of lifelong RL here to distinguish it from multi-task RL, while there are existing works on lifelong RL (e.g.Brunskill & Li (2014);Lecarpentier et al. (2021)) assuming i.i.d. tasks. In general, a context-dependent dynamics would take the form P h (s |s, a, w). Ammar et al. (2015) give regret bounds but only for linearized value difference;Brunskill & Li (2015) show regret bounds only for finite number of tasks.



Per-episode reward Episode, k (a) Setting of Theorem 2, d = 5, m = 5, d = 25 Per-episode reward Episode, k (b) Setting of Remark 2, d = 5, d = 10

Figure 1: UCBlvd vs Lifelong-LSVI. The experimental results include 50 seeds.

);Hessel et al. (2019);Brunskill & Li (2013);Fifty et al. (2021);Zhang & Wang (2021);Sodhani et al. (2021), tasks are assumed to be chosen from a known finite set, and inYang et al. (2020);Wilson et al. (2007);Brunskill & Li (2013);Sun et al. (2021), tasks are sampled from a fixed distribution. By contrast, our setting provides guarantees on regret and number of planning calls for adversarial task sequences.

a) : φ(s, a) are d linearly independent vectors. , and θk h (w) and Λ k h are defined in (5).

a) , ψ(s, a, w) . For any w ∈ W, define D(w) = (s, a) : ψ(s, a, w) are d + d linearly independent vectors. . Given Assumption 5, we mod-Algorithm 4: Modified UCBlvd Set: Q k H+1 (., ., .) = 0, ∀k ∈ [K], k = 1 for episodes k = 1, . . . , K do Observe the initial state s k 1 and the task context w k . if ∃h ∈ [H] such that det Λ for time-steps h = H, . . . , 1 do Compute ξk h as in (39).

Algorithm 5: Standard Lifelong-LSVI with Computation Sharing Set: Q k H+1 (., ., .) = 0, ∀k ∈ [K], k = 1 for episodes k = 1, . . . , K do Observe the initial state s k 1 and the task context w k . if ∃h ∈ [H] such that

a∈A z, ψ(., a, w) + y, φ(., a) + β φ(., a) Yφ(., a), H , where the parameters β ∈ R, z ∈ R d , y ∈ R d , and Y ∈ R d×d satisfy 0 ≤ β ≤ B, z ≤ z, y ≤ y, and Y ≤ λ -1 . Assume φ(s, a) ≤ 1 and ψ(s, a, w) ≤ 1 for all (s, a, w) ∈ S × A × W. Then log

)Let C z and C y be /2-covers of {z ∈ R d : z ≤ z} and {y ∈ R d : y ≤ y}, respectively, with respect to the 2-norm, andC Y be an 2 /4-cover of {Y ∈ R d×d : Y F ≤ B 2 √ d λ }, with respect to the Frobenius norm. By Lemma 20, we know|C z | ≤ (1 + 4z/ ) d , C y ≤ (1 + 4y/ ) d , |C Y | ≤ 1 According to (60), it holds that N (V) ≤ |C z | C y |C Y |,and therefore log N (V) ≤ d log(1 + 4z/ ) + d log(1 + 4y/ ) + d 2 log 1 + 8B 2 √ d λ 2 . Lemma 22. For a fixed w, let V denote a class of functions mapping from S to R with following parametric form V (.) = min max a∈A z, ψ(., a, w) + 2Lβ φ(., a) Yφ(., a) + , H , where the parameters β ∈ R, z ∈ R d and Y ∈ R d×d satisfy 0 ≤ β ≤ B, z ≤ z, and Y ≤ λ -1 . Assume φ(s, a) ≤ 1 and ψ(s, a, w) ≤ 1 for all (s, a, w) ∈ S × A × W. Then log N (V) ≤ d log(1 + 4z/ ) + d 2 log 1 + 8B 2 √ d λ 2

ψ(s, a, w) + φ(s, a) Ỹ1 φ(s, a) -z 2 , ψ(s, a, w) + φ(s, a) Ỹ2 φ(s, a)≤ sup ψ: ψ ≤1,φ: φ ≤1 z 1 , ψ + φ Ỹ1 φz 2 , ψ + φ Ỹ2 φ C z be an /2-cover of {z ∈ R d : z ≤ z} with respect to the 2-norm, and C Y be an 2 /4-cover of {Y ∈ R d×d : Y F ≤ B 2 √ d λ }, with respect to the Frobenius norm. By Lemma 20, we know|C z | ≤ (1 + 4z/ ) d , |C Y | ≤ 1 According to (61), it holds that N (V) ≤ |C z ||C Y |, and therefore log N (V) ≤ d log(1 + 4z/ ) + d 2 log 1 + 8B 2 √ d λ 2 .Lemma 23. For a fixed w, let V denote a class of functions mapping from S to R with following parametric form V (.) = min max a∈A z, ψ(., a, w) + 2Lβ φ(., a) Yφ(., a) + β φ(., a, w) Ỹφ(., a, w)+ , H , where the parameters β, β ∈ R, z ∈ R d , Y ∈ R d×d and Ỹ ∈ R d ×d satisfy 0 ≤ β ≤ B, 0 ≤ β ≤ B z ≤ z, Y ≤ λ -1 and Ỹ ≤ λ -1 . Assume φ(s,a) ≤ 1 and ψ(s, a, w) ≤ 1 for all (s, a, w) ∈ S × A × W. Then log N (V) ≤ d log(1 + 4z/ ) + d 2 log 1 First, we reparametrize V by letting Z = β 2 Y and Z = β2 Ỹ. We have V (.) = min max a∈A z, ψ(., a, w) + φ(., a) Zφ(., a) + φ(., a) Zφ(., a), H , for z ≤ z, Z ≤ B 2 λ , and Z ≤ B2 λ .

)Let C z be an /2-cover of {z ∈ R d : z ≤ z} with respect to the 2-norm, C Z be an 2 /4-cover of {Z ∈ R d×d : Z F ≤ B 2 √ d λ }, and C Z be an 2 /4-cover of { Z ∈ R d ×d : Z respect to the Frobenius norm. By Lemma 20, we know|C z | ≤ (1 + 4z/ ) d , |C Z | ≤ 1 According to (62), it holds that N (V) ≤ |C z ||C Y |,and therefore log N (V) ≤ d log(1 + 4z/ ) + d 2 log 1 Let V denote a class of functions mapping from S to R with following parametric form V (., .) = min max a∈A z, ψ(., a, .) + 2Lβ ψ(., a, .) Yψ(., a, .) + , H , where the parameters β ∈ R, z ∈ R d and Y ∈ R d ×d satisfy 0 ≤ β ≤ B, z ≤ z, and Y ≤ λ -1 . Assume ψ(s, a, w) ≤ 1 for all (s, a, w) ∈ S × A × W. Then log N (V) ≤ d log(1 + 4z/ ) + d 2 log 1 + 8B 2 √ d λ 2.

C z be an /2-cover of {z ∈ R d : z ≤ z} with respect to the 2-norm, and C Y be an 2 /4-cover of {Y ∈ R d ×d : Y F ≤ B 2 √ d λ }, with respect to the Frobenius norm. By Lemma 20, we know |C z | ≤ (1 + 4z/ ) d , |C Y | ≤ 1 According to (63), it holds that N (V) ≤ |C z ||C Y |, and therefore log N (V) ≤ d log(1 + 4z/ ) + d experiments, we have chosen δ = 0.01, λ = 1, d = 5, and H = 5. The parameters {η h } h∈[H] are drawn from N (0, I d ).In order to tune parameters {µ h (.)} h∈[H] and the feature mappings φ such that they are compatible with Assumption 1, we consider that the feature space {φ(s, a): (s, a) ∈ S × A} is a subset of the d-dimensional simplex, {φ ∈ R d : d i=1 φ i = 1, φ i ≥ 0, φ i ≤ 1, ∀i ∈ [d]}, and e i µ h (.) is an arbitrary probability measure over S for all i ∈ [d].The results shown in Figure2adepict averages over 50 realizations for the main setup considered throughout the paper with m = 5 and the results shown in Figure2bdepict averages over 50 realizations, for the more general setup of Remark 2 with d = 10. For the results shown in Figure2a, the mappings ρ(w) are drawn from N (0, I m ) except for the n = m representative tasks {w (j) } j∈[m]   introduced in Assumption 3, for which we set ρ(w (j) ) = e j for j ∈ [m]. For the results shown in Figure2b, the mappings ψ(s, a, w) are drawn from N (0, I d ) and we set ψ(s, a, w (j) ) = e j for j ∈ [d ], where {w (j) } j∈[d ] are n = d representative tasks introduced in Assumption 5 in Appx. D. The parameters {η h } h∈[H] are drawn from N (0, I d ), where d = m × d = 25 in Figure 2a. In our experiments, the exact same settings are used for both UCBlvd and Lifelong-LSVI in both Figures 2aand 2b. We chose fairly large d, m, and d and by checking online, we noticed that the optimal value of QCQP in (8) happens always to be zero. All these together suggest that the assumptions made in the paper approximately hold. Figures2a and 2bdepict the average per-episode reward of UCBlvd and state the average number of planning calls and compare them to those of baseline algorithm Lifelong-LSVI, a direct extension of LSVI-UCB inJin et al. (2020). The results emphasize the value of UCBlvd in terms of requiring much smaller numbers of planning calls. The plots verify that the performances of Lifelong-LSVI and UCBlvd are almost the same statistically, while UCBlvd uses much smaller numbers of planning calls (1000 vs ∼ 20).

Setting of Theorem 2, d = 5, m = 5, d = 25 Per-episode reward Episode, k (b) Setting of Remark 2, d = 5, d = 10

Figure 2: UCBlvd vs Lifelong-LSVI

Algorithm 3: UCBlvd with Unknown RewardsSet: Q k H+1 (., ., .) = 0, ∀k ∈ [K], k = 1 for episodes k = 1, . .. , K doObserve the initial state s k 1 and the task context w k .

ACKNOWLEDGEMENT

This work is partially supported by DARPA grant HR00112190130 and NSF grant 2221871. Sanae Amani is partially supported by Amazon science hub fellowship.

