PROVABLY EFFICIENT MULTI-TASK REINFORCEMENT LEARNING IN LARGE STATE SPACES

Abstract

We study multi-task Reinforcement Learning where shared knowledge among different environments is distilled to enable scalable generalization to a variety of problem instances. In the context of general function approximation, Markov Decision Process (MDP) with low Bilinear rank encapsulates a wide range of structural conditions that permit polynomial sample complexity in large state spaces, where the Bellman errors are related to bilinear forms of features wi 'th low intrinsic dimensions. To achieve multi-task learning in MDPs, we propose online representation learning algorithms to capture the shared features in the different task-specific bilinear forms. We show that in the presence of low-rank structures in the features of the bilinear forms, the algorithms benefit from sample complexity improvements compared to single-task learning. Therefore, we achieve the first sample efficient multi-task reinforcement learning algorithm with general function approximation.

1. INTRODUCTION

The ability to capture informative representations that generalize among multiple tasks has become significant in various machine learning applications Li et al. (2014) ; Tsiakas et al. (2016) ; Baevski et al. (2019) ; D 'Eramo et al. (2019) ; Kubota et al. (2020) ; Liu et al. (2019b) . In the context of multi-task learning Caruana (1997) ; Baxter (2000) ; Yu et al. (2005) , this ability is highly desirable and becomes vital to learn with fewer amount of samples than learning each single task individually. Representation learning Bengio et al. (2013) is a powerful approach for achieving such sample efficiency improvement. This paper considers representation learning in Multitask Reinforcement Learning, an important class of meta Reinforcement Learning (meta-RL) Wang et al. (2016) ; Finn et al. (2017) ; Ritter et al. (2018) . Reinforcement learning (RL) is a sequential decision-making problem where an agent aims to learn the optimal decisions by interacting with an unknown environment Sutton & Barto (2018) . Empowered by representation learning with deep neural networks LeCun et al. (2015) ; Goodfellow et al. (2016) , RL has achieved tremendous success in various real-world applications, such as Go Silver et al. (2016) , Atari Mnih et al. (2013 ), Dota2 Berner et al. (2019) , Texas Holdém poker Moravčík et al. (2017) , and autonomous driving Shalev-Shwartz et al. (2016) . Therefore, the benefit of using representation learning to extract joint feature embedding from different but related tasks emerged as an essential problem to investigate. Specifically, this paper studies the problem of learning multiple RL problems jointly with the help of representation learning. Although multi-task learning in online-decision making problems has received increasing research interest Lazaric & Ghavamzadeh (2010) ; Mutti et al. (2021) ; Maurer et al. (2016) ; Qin et al. (2021) ; Yang et al. (2021) ; Hu et al. (2021) , most existing works focus on tabular or linear models. Indeed, how general function approximations extrapolate across huge state spaces remains largely an open problem itself. Recently, Bilinear class Du et al. (2021) proposes a promising structural framework of generalization in reinforcement learning through the use of function approximation. Bilinear class postulates that the Bellman error can be related to a bilinear form depending on the hypothesis and captures nearly all existing function approximation models, e.g. Jin et al. (2020a) ; Zanette et al. (2020) ; Yang & Wang (2020) ; Jiang et al. (2017) ; Sun et al. (2019) ; Kakade et al. (2020) ; Agarwal et al. (2020) . However, in the presence of shared information in the bilinear forms across multiple tasks, the Bilin-UCB proposed in Du et al. (2021) is not able to adapt to such knowledge and challenges abound in adopting representation learning to find nearly-optimal policies with limited data. In this paper, we give the first sample efficient algorithm of multi-task RL with general function approximation through the usage of representation learning in Bilinear class. In particular, to study representation learning, we propose Low-rank Multi-task Bilinear class-a structural framework that permits generalization both within and across tasks in multi-task RL. Specifically, such a model class specifies M MDP instances, where M > 0 is a fixed integer and each one belongs to the Bilinear class Du et al. (2021) , i.e., the Bellman error admits a low-rank factorization in R d . Since our multi-task setting have M MDP instances, there are M different features maps specified by the definition of Bilinear class, each corresponding to one MDP task and taking values in R d . We additionally assume that these M features maps, when packed together as a matrix-valued mapping in R d×M , has rank k d. In other words, in Low-rank Multi-task Bilinear class, the bilinear form of each task possesses a low-dimensional task-specific feature and a shared representation. Under this setting, it is desirable that the RL algorithm utilize the intrinsic low-dimensional structure to achieve an improved sample efficiency compared to solving each task separately. To this end, under the online setting where the agent learn from its past experiences without knowing the model, we design a sample efficient algorithm that provably finds nearly-optimal policies for all tasks. Our algorithm is based on the principle of Optimism in the Face of Uncertainty (OFU) which constructs an confidence region that contains the true hypothesis based on the historical data across the M tasks, and then update the policy according to the most optimistic hypothesis within the confidence region. In particular, here the hypothesis can denote the true transition models or optimal value functions of these M tasks. When constructing the confidence region, we explicitly utilize the low-dimensional structure by joint learning the task-specific features and shared representation via Empirical Risk Minimization (ERM) with multi-task data. Moreover, as for planning, we find the hypothesis in the confidence region which leads to the highest aggregated value in these M tasks. In the analysis, we show a concentration result where the estimation noise can be embedded into low dimension space and thus prove that our algorithm is able to find nearly-optimal policies within limited samples. Concretely, compared to learning each task separately using Bilin-UCB Du et al. (2021) , an algorithm designed for Bilinear class without utilizing the shared representation, our algorithm enjoys a (d/k)-time improvement in the sample efficiency whenever the feature classes are small. To our best knowledge, our work seems to propose the first provably sample efficient multi-task RL algorithm with general function approximation. Notations Let R d denote the d-dimensional space and R d×k denote the space of d-by-k matrices in R. The inner product of two vectors x, y ∈ R d is denoted as x, y . For sets A 1 , . . . , A n , define ⊗ k∈[n] A k = A 1 ⊗ • • • ⊗ A k = {(a 1 , . . . , a n ) : a k ∈ A k , k ∈ [n]}. Given scalars a 1 , • • • , a n , let a 1:n denote the vector (a 1 , • • • , a n ). Also let (a ω ) ω∈Ω denote the tuple consisting of a ω where ω comes from a countable set Ω. For variables v 1 , . . . , v k , we denote by v 1:k the k-tuple (v 1 , . . . , v k ). Roadmap In Section 2 we introduce the basic problem setup and notations. In Section 3 we introduce Low-rank Multi-task Bilinear class-a framework that captures shared information in bilinear class. Next, we display the main algorithm of learning Low-rank Multi-task Bilinear class models, empowered by representation learning and optimism principle in decision making, in Section 4. We show the main theoretical result in Section 5 and the overview of techniques in Section 6. A couples of examples are given in Section A. We conclude with discussions of further directions.

1.1. RELATED WORK

General function approximation in Reinforcement Learning Theoretical understanding of the sample complexity of RL with general function approximation seems relatively scarce. In recent years, there has been a surge of theoretical insights on linear function approximation and non-linear function approximations Jin et al. (2020b; c) Multi-task learning and Meta learning Meta Learning Thrun & Pratt (2012) and multi-task representation learning Caruana (1997) are important tools for capturing shared knowledge across tasks and achieving generalization to a new task. Theoretical analysis dated back to Baxter (2000) ; Maurer (2005) ; Yu et al. (2005) . Maurer et al. (2016) ; Amit & Meir (2018) ; Konobeev et al. (2020) consider generalization errors averaged over the meta distribution under the assumption of shared distribution for sampling the source tasks. Recently, Du et al. (2020); Tripuraneni et al. (2020b; a) ; Chua et al. (2021) consider the diversity of task distributions and establish risk bounds based on learning shared representation across tasks and transferring to new tasks. 

2. PRELIMINARIES

We consider learning a set of M problem instances P = {M m = (S m , A m , {P m,h } H h=1 , r m , s m,1 )} m∈[M ] , where each M m ∈ P denotes an episodic Markov Decision Process (MDP) in which S m denotes the state space, A m denotes the action set, H denotes the number of time steps in each episode, {P m,h } H h=1 denotes the transition kernel, r m denotes the reward function, and s m,1 denotes the fixed initial state. We assume r m ∈ (0, 1) without loss of generality. For MDP M m , let E m denote the expectation under {P m,h } H h=1 . A (deterministic) policy π m is a length-H sequence of functions π m = {π m,h : S m → A m } H h=1 . To interact with M m , the agent starts at a fixed initial state s m,1 and at each time step h ∈ [H], it takes action a m,h ∼ π m,h , receives reward r h (s m,h , a m,h ) and transits to s m,h+1 ∼ P(•|s m,h , a m,h ). Let E m πm denote the expectation under MDP M m and taking policy π m . We use o m,h = (s m,h , a m,h , s m,h+1 ) to denote the history at h-th time step in MDP M m . Given a policy π m , we define the value function V πm m,h (s) as the expected sum of reward under policy π m starting from s m,h = s at time step h: V πm m,h (s) := E m H t=h r m (s m,t , a m,t )|s m,h = s . Similarly, we define the Q-function Q πm m,h (s, a) as the the expected sum of reward taking action a in state s m,h = s and then following π m,h : Q πm m,h (s, a) = E m H t=h r m (s m,t , a m,t )|s m,h = s, a m,h = a . The Bellman operator T m,h applied to Q-function Q : S m × A m → R is defined via T m,h (Q)(s, a) := r m,h (s, a) + E s ∼P m,h (•|s,a) [max a Q(s , a )]. There exists an optimal policy π * m that gives the optimal value function for all states, i.e. , V  π * m m,h (s) = sup π V π m,h Q π * m m,h (s, a) = T m,h (Q π * m m,h+1 )(s, a). The agent aims at using fewer samples to find a set of policies {π m } M m=1 such that M m=1 V π * m m,h (s 1 ) -V πm m,h (s 1 ) ≤ holds with probability at least 1 -δ. In the following, we define the filtration H t to be the σ-field induced by all the random variables up to round t.

3. MULTI-TASK BILINEAR CLASS

In general function approximation, a hypothesis class denoted by G is applied to address the set P of M problem instances. Here, each hypothesis g ∈ G is a function approximation that captures the information of both task-specific information of all M problem instances and the shared knowledge across tasks. For example, in model-based RL, g may denote the transition models of the M tasks, that is, {P m,h } M,H m=1,h=1 . Whereas in model-free RL, g can be the optimal value functions of the M tasks, i.e., {V π * m m,h } M,H m=1,h=1 . Using the notion of hypothesis g ∈ G, we aim to cover a large class of MDP models studied under the function approximation setting. We will introduce concrete examples in Section A.

Under our setting, each hypothesis

g ∈ G is associated with Q-functions {Q m,h,g (•, •)} m∈[M ],h∈[H] and value functions {V m,h,g (•)} m∈[M ],h∈[H] in the M tasks such that V m,h,g (•) = max a Q m,h,g (•, a) holds for all m ∈ [M ] and h ∈ [H]. Bilinear class Du et al. ( 2021) is a general framework that allows generalization in RL for a wide range of function approximators. In the following, we develop the low-rank multi-task bilinear framework so that it permits generalization in meta-RL across different tasks. The key intuition behind this framework is that it captures the common representation via low-rank structures in the features of task-specific bilinear forms. Definition 3.1 (Low-rank Multi-task Bilinear class). We say the following tuple (G, {M m = (S m , A m , P m , r m , s m,1 )} m∈[M ] , l m,h,f ) is a multi-task Bilinear class with rank k if there exists g * ∈ G such that Q m,h,g * = Q π * m m,h and V m,h,g * = V π * m m,h hold for all m ∈ [M ], h ∈ [H], and there exist functions W * m,h : G → R d , v * m,h : G → R k , B * h : G → R d×k , and X * m,h : G → R d (d k) such that for each g ∈ G, m ∈ [M ] and h ∈ [H]: 1. The features W * m,h possess low rank structures: W * m,h (g) = B * h (g)v * m,h (g). 2. We can upper bound the expected Bellman error as follows: E m a m,1:h ∼π m,g,1:h [Q m,h,g (s m,h , a m,h ) -r m,h (s m,h , a m,h ) -V m,h,f (s m,h+1 )] ≤ | W * m,h (g) -W * m,h (g * ), X * m,h (g) |. Here π m,g,h (s) = arg max a Q m,h,g (s, a) is the optimal policy in MDP M m under hypothesis g.

3.

For any f ∈ G there exist policy π est,m (f ) and discrepancy measure l m,h,f (o m,h , g) that can be used for estimating v m,h (g)B h (g) -v m,h (g * )B h (g * ), X m,h (f ) for any g ∈ G, such that: E m a m,1:h-1 ∼π m,f ,a m,h ∼πest,m(f ) [l m,h,f (o m,h , g)] = W * m,h (g) -W * m,h (g * ), X * m,h (f ) , where o m,h = (s m,h , a m,h , s m,h+1 ). Remark 3.2. A few remarks are in order. First, when d = k and M = 1, we recover the Bilinear class introduced in Du et al. (2021) . Thus, our model can be viewed as an multi-task extension of Bilinear class with a shared low-dimensional representation. Second, here the hypothesis class G can either be model-based function approximation or value-based function approximation, as in either case the Q-functions Q m,h,g and the value functions V m,h,g are well defined. Third, the common representation B * h (•) captures the knowledge shared in all problem instances and enables multi-task learning with fewer samples. Finally, the discrepancy measure l m,h,f (o m,h , •) can be computed for all hypothesis g, an important property that facilitates data reusage as explained below. Examples of Low-rank Multi-task Bilinear class can be found in below and in Appendix A. To understand this framework, we first notice that the expected Bellman errors of the following forms E m a m,1:h ∼π m,g,1:h [Q m,h,g (s m,h , a m,h ) -r m,h (s m,h , a m,h ) -V m,h,f (s m,h+1 )] serve as upper bounds for the sub-optimality of policies π m,g . Thus the Eq. ( 2) indicates that the sub-optimality of policies π m,g can be controlled if the feature W * m,h (g) is close to the feature W * m,h (g * ) corresponding to the best-in-class function approximator g * . Eq. ( 3) further establishes the connection between the error W * m,h (g) -W * m,h (g * ) and the discrepancy measure l m,h,f (o m,h , g). One important observation is information sharing among the function class through the feature X * m,h (f ): given samples from a single function approximator f , the quantity E m a m,1:h-1 ∼π m,f ,a m,h ∼πest,m(f ) [l m,h,f (o m,h , g)] can be estimated for all g ∈ G. This thus allows for data reusage for each on-policy samples in our algorithm. Eq. ( 1) postulates that the features W * m,h (g) share a low-dimension structure. Note that this structure is not fixed for all function approximator g -in fact, the common representation B h (g) may also be a function of g. This allows for much generality in multi-task RL models. We will show that the low-rank multi-task bilinear class cover many existing function approximation models in multitask setting. As compared with a great number of multi-task/meta RL algorithms that use a single representation to solve multiple tasks, our function approximators allow for more generality and therefore handles more structural conditions of multi-task RL such as those studied in Yang et al. (2021) and Hu et al. (2021) . Moreover, our algorithm reduces to using a single representation for the feature W in many special cases, for example when the feature v * m,h is trivial. In this sense, our algorithm can be seen as encompassing the method of using a single feature for multiple tasks. Now we give an example of low-rank multi-task bilinear framework. This example shows that the proposed framework captures the linear mixture model Modi et al. (2020) in multi-task RL where the mixing coefficients of different tasks lie in the same low-dimension space. First, we recall the definition of Linear Mixture Model in the following. Definition 3.3 (Linear Mixture Model). MDP M = (S, A, {P h } H h=1 , r, s 1 ) is a linear mixture model if there exist known feature maps φ : S × A × S → R d and ψ : S × A → R d and unknown θ * h ∈ R d , h ∈ [H] such that P h (s, a, s ) = θ * h , φ(s, a, s ) , r h (s, a) = θ * h , ψ(s, a) . It is known that linear mixture model is Bilinear class with hypothesis g = (θ 1 , . . . , θ H ) and X h (g) = E πg [ψ(s h , a h ) + s ∈S φ(s h , a h , s )V h+1,g (s )], W h (g) = θ h . The discrepancy measure can chosen as l f (o h , g) = θ h , φ(s h , a h ) + s ψ(s h , a h , s )V h+1,f (s ) -(V h+1,f (s h+1 ) + r h ) and the estimation policies can be chosen as π est (f ) = π f . We consider learning the special case of Low-rank Multi-task Bilinear class where each MDP M m is a linear mixture model with P m,h (s, a, s ) = θ * m,h , φ m (s, a, s ) , r m,h (s, a) = θ * m,h , ψ m (s, a) and there exist ν * m,h ∈ R k , B * h ∈ R d×k such that θ * m,h = B * h ν * m,h for all h ∈ [H], m ∈ [M ]. Then we let g = (ν m,h , B h ) h∈[H],m∈[M ] and use the (fixed) features v m,h (g) = ν m,h , B h (g) = B h . Notice that each (ν m,h , B h ) h∈[H],m∈[M ] will define the expectation E m πm,g (via P m,h (s, a, s ) = θ m,h , φ m (s, a, s ) with θ m,h = B h ν m,h ). Thus we can set the feature class X induced by G in which the feature X m,h (g) can be computed for each g ∈ G as follows: X m,h (g) = E m πm,g [ψ m (s m,h , a m,h ) + s ∈Sm φ m (s m,h , a m,h , s )V m,h+1,g (s )].

Algorithm 1 Representation Learning in Low-rank Multi-task Bilinear class

Input: V, B, X for t ← 1, . . . , T do Find g (t) as the solution of the optimization problem in Eq. ( 4) For ∀h ∈ [H], m ∈ [M ], sample n 0 times from a m,1:h-1 ∼ π m,g (t) , a m,h ∼ π m,est and create datasets D (t) m,h . end for Let t 0 ← max t∈[T ] M m=1 V π m,g (t) m,1 (s 1 ) Return π g (t 0 )

4. REPRESENTATION LEARNING IN LOW-RANK MULTI-TASK BILINEAR CLASS

In this section we present the main algorithm to learn Low-rank Multi-task Bilinear class. First, we define in the following the feature classes to capture v * m,h , B * h , X * m,h for m ∈ [M ], h ∈ [H]. Definition 4.1 (Function approximations). We define feature class V = V 1 ⊗ • • • ⊗ V H where V h ⊂ {v : G → R k } ⊗M , ∀h ∈ [H], representation class B = B 1 ⊗ • • • ⊗ B H where B h ⊂ {B : G → R d×k }, and feature class X = X 1 ⊗ • • • ⊗ X H where X h ⊂ {X : G → R d } ⊗M , ∀h ∈ [H]. Here we assume that function classes V, B, and X completely captures the mappings specified in the multi-task Bilinear class. Therefore, given expressive V, B, X as inputs, we can learn the representations v 1:M,1:H ∈ V, B 1:H ∈ B, X 1:M,1:H ∈ X to approximate v * 1:M,1:H , B * 1:H , X * 1:M,1:H by minimizing some proper loss function. Based on this assumption, we propose an algorithm based on the OFU principle while leveraging representation learning to improve sample efficiency. The procedure is shown in Algorithm 1. In particular, in the t-th iteration, an (optimistic) hypothesis g (t) ∈ G is computed using Upper Confidence Bound (UCB) by finding a hypothesis that achieves the highest total value in these M tasks. Specifically, consider the following constrained optimization problem: g (t) = arg max g∈G (t) M m=1 V m,1,g (s m,1 ) which maximizes the sum of candidate value functions V m,1,gi (s m,1 ) subject to the constraint that the hypothesis g (t) belongs to the confidence set G (t) . As we will show in the proof, G (t) contains the true hypothesis g * for all t ∈ [T ] with high probability. Thus, by solving (4), the sum of value functions of g (t) serves as an upper bound of that of g * . The key issue now is how the confidence set is chosen. In BiLin-UCB, the confidence set is chosen to contain all hypotheses that possess small values in the average of discrepancy measures across the available batch data. Since the discrepancy measures serve as unbiased estimates for the bilinear forms which upper bound the Bellman error, this confidence set essentially finds all hypotheses with low Bellman error. However, this approach fails to exploit the shared information among tasks. Instead, we make use of the feature classes V, B, X and learn the common representation {B h } h∈[H] by Empirical Risk Minimization (ERM). For each hypothesis g ∈ G and h ∈ [H], let (v (g) 1:m,h , B (g) h , X 1:m,h , g) be the solution of the following optimization problem: (v (g) 1:m,h , B (g) h , X (g) 1:m,h , g (g) ) = arg min v 1:m,h ∈V h ,B h ∈B h ,X 1:m,h ∈X h , g∈G t-1 τ =1 M m=1 E (s,a,s )∼D (τ ) m,h [l m,h,g (τ ) (s, a, s , g)] -B h (g)v m,h (g) -B h ( g)v m,h ( g), X m,h (g (τ ) ) 2 . Notice that g (g) depends on g and in the rest of this paper we use g for simplicity of notation. With features (v (g) 1:m,h , B (g) h , X 1:m,h ), the confidence set G (t) is then given as the collection of all possible hypothesis g such that the sum of squares of the bilinear forms across the available batch data is not greater than a pre-defined parameter R 2 : G (t) = g ∈ G : t-1 τ =1 M m=1 B (g) h (g)v (g) m,h (g) -B (g) h ( g)v (g) m,h ( g), X (g) m,h (g (τ ) ) 2 ≤ 4R 2 . (5) Due to the definition of (v (g) 1:m,h , B (g) h , X 1:m,h , g), the bilinear form B (g) h (g)v (g) m,h (g) -B (g) h ( g)v (g) m,h ( g), X (g) m,h (g (τ ) ) approximates W * m,h (g) -W * m,h (g * ), X * m,h (g (τ ) ) for all g ∈ G. Therefore the confidence set G (t) contains all the hypothesis in which the sum of W * m,h (g) -W * m,h (g * ), X * m,h (g (τ ) ) 2 over all history data and tasks is upper bounded by R 2 . We will show that this can be used to quantify the Bellman error of the candidate action-value functions under the corresponding greedy policy, which then can be related to the sub-optimality of the corresponding greedy policy in the true environment. The width parameter R 2 can be enlarged to handle the cases where realizability assumption is not strictly satisfied. Since ERM is robust to misspecification, optimism is still guaranteed and the analysis follows similarly. The resulting algorithm would thus have sub-optimality depending additively on the misspecification error. This confidence set captures the shared information B h (•) across tasks and contains all hypothesis with low Bellman errors within a smaller R 2 . Algorithm 1 then samples trajectories according to the greedy policies of the chosen hypothesis g (t) and the estimation policy π est and augments the available batch data with D (t) m,h , h ∈ [H], m ∈ [M ]. We note that the main computation workload of Algorithm 1 is the ERM step in Line 3. This means that Algorithm 1 is oracle-efficient with access to an ERM oracle. In general, Line 3 pertains to a difficult optimization problem over the set of candidate function approximators G (t) . Although no efficient algorithms are currently known to solve this problem, we note that for certain instances of the bilinear class (e.g., Yang et al. (2021) ) where G is parameterized by variables in a Euclidean space, computationally efficient gradient-based algorithms may exist to solve Line 3.

5. MAIN THEORY

This section presents the theoretical analysis of Algorithm 1. Without loss of generality, we assume that the feature classes are all bounded. Definition 5.1. Assume v m,h (g) 2 , B h (g) F ≤ C W and X m,h (g) 2 ≤ C X hold for all v m,h ∈ V h , B h ∈ B h , X m,h ∈ X h and g ∈ G, h ∈ [H], m ∈ [M ]. Here C W , C X ∈ R. Next, it is important to consider the expressiveness of function classes V, B, X . The following assumption is common in the theory of reinforcement learning Jin et al. (2020b) ; Wang et al. (2020b) ; Jin et al. (2021b) ; Du et al. (2021) . Assumption 5.2 (Realizability). Assume g * ∈ G and v * 1:M,1:H ∈ V, B * 1:H ∈ B, X * 1:M,1:H ∈ X . Now we present the theory for Algorithm 1. Theorem 5.3. Set T = 8HM d log(1 + M C 2 X λ ) and R 2 = 8H 3 (M k log(M n 0 C W C X ) + log( HT |G||V||B||X | δ )) n 0 . With probability 1 -δ the algorithm outputs a set of policies π g (t 0 ) such that M m=1 V π * m m,1 (s 1 ) - M m=1 V π m,g (t 0 ) m,1 (s 1 ) ≤ O(HR). The total number of trajectories used in Algorithm 1 is upper bounded by O(M HT n 0 ). Therefore, with probability at least 1 -δ, Algorithm 1 is able to use O H 6 M 2 d(M k + log(|X ||V||B||G|/δ)) trajectories to find a set of policies {π m } m∈[M ] such that M m=1 V π * m m,1 (s 1 ) - M m=1 V πm m,1 (s 1 ) ≤ . Notice that if we use Bilin-UCB to learn each single task individually, then the total number of trajectories to achieve the above guarantee is H 6 M 3 d 2 log(|G|/δ)

2

. Indeed, to achieve Eq. ( 6), the average sub-optimality of each task should be O( /M ). Using Bilin-UCB, achieving V  m,h (s, a, s ) = θ * m,h , φ m (s, a, s ) , r m,h (s, a) = θ * m,h , ψ m (s, a) and there exist ν * m,h ∈ R k , B * h ∈ R d×k such that θ * m,h = B * h ν * m,h for all h ∈ [H], m ∈ [M ]. Let X = G = N (R k , ) ⊗M H ⊗ N (R d×k , ) where N (R k , ) denotes the -covering of R k and N (R d×k , ) denotes the -covering of R d×k . Under Assumption 5.2, there exists an algorithm that with probability at least 1 -δ finds a set of policies {π m } m∈[M ] such that M m=1 V π * m m,1 (s 1 ) - M m=1 V πm m,1 (s 1 ) ≤ using O H 6 M 2 d(M k + kd) log(1/ δ)) 2 trajectories. Remark 5.5. Using Bilin-UCB to learn each task individually, it takes O H 6 M 3 d 3 log(1/δ) • -2 trajectories to learn a set of policies {π m } m∈[M ] such that M m=1 V π * m m,1 (s 1 ) - M m=1 V πm m,1 (s 1 ) ≤ . Therefore, Algorithm 1 achieves sample complexity improvement comparing to single-task learning. Without further assumption, a polynomial improvement w.r.t M appears impossible because the learner has to learn the task-specific features v m,h ∈ R k for all m ∈ [M ] with an Ω(M 3 k) sample complexity in total. The main benefit of multi-task learning in this case is to reduce the dependence on the ambient dimension d to k.

6. TECHNIQUE OVERVIEW

This section gives an overview of the analysis and the main techniques. Owing to optimism principle and the construction of confidence set via representation learning, the proof of Theorem 5.3 will crucially depend on the following three observations: Risk bounds for representation learning. Since the discrepancy measures serve as unbiased estimations of bilinear forms, i.e. E m a m,1:h-1 ∼π m,f ,a m,h ∼πest,m(f ) [l m,h,f (o m,h , g)] = W * m,h (g) -W * m,h (g * ), X * m,h (f ) , the solutions (v (g) 1:m,h , B (g) h , X 1:m,h , g (g) ) of ERM will be able to concentrate to the population mean, i.e. B * h (g)v * m,h (g) -B * h (g * )v * m,h (g * ), X * m,h (g (τ ) ) ≈ B (g) h (g)v (g) m,h (g) -B (g) h ( g)v (g) m,h ( g), X m,h (g (τ ) ) . This means that Algorithm 1 has approximately captured the correct bilinear forms. Thus G (t) essentially finds all hypothesis g such that τ ( B * h (g)v * m,h (g) -B * h (g * )v * m,h (g * ), X * m,h (g (τ ) ) ) 2 is small. Regret decomposition associated to the bilinear forms. One key property of Bilinear class is the upper bound of Bellman error as follow: ) ) 2 . Using Hölder's inequality, it suffices that τ X * m,h (g (τ ) )X * m,h (g (τ ) ) possess sufficient coverage condition of X * m,h (g) for all m ∈ [M ]. Setting T = Ω(M d), we will confirm this via elliptical potential lemma. E m a m The proof of Theorem 5.3 is then established based on the above three observations. Since the second and third steps are natural extensions of the analysis of Bilin-UCB, the technical bulk is then to build risk bounds for representation learning. Specifically, we define the following failure event: Definition 6.1. Define E as the event that there exist t ∈ [T ] and h ∈ [H] such that t-1 τ =1 M m=1 B * h (g)v * m,h (g) -B * h (g * )v * m,h (g * ), X * m,h (g (τ ) ) -B (g) h (g)v (g) m,h (g) -B (g) h ( g)v (g) m,h ( g), X m,h (g (τ ) ) 2 ≥ R 2 . We will show that P[E] ≤ 1 -δ with the choice of R 2 in Algorithm 1. The key step is embedding the estimation noise into low dimensional space R k and achieve improved concentration.

7. CONCLUSION

This paper considers learning multiple RL problems jointly with representation learning. A structural framework is proposed that permits generalization across tasks via general function approximation. A sample efficient algorithm is designed based on representation with ERM and optimistic principle where the confidence sets are constructed based on learned features. Theoretical analysis is displayed that the algorithm finds nearly-optimal policies within limited samples. Several examples are discussed and sample complexity improvements are illustrated. Given the success of representation learning in multi-task RL, it is an interesting future direction to study transfer learning for quickly adapting prior knowledge to a new, unseen task with limited data and computational power.



; Wang et al. (2021); Zanette et al. (2020); Agarwal et al. (2020); Kakade et al. (2020); Wen & Van Roy (2017); Dann et al. (2018); Du et al. (2019); Dong et al. (2020); Liu et al. (2019a); Wang et al. (2020a); Dong et al. (2021); Zhou et al. (2020); Yang et al. (2020); Jin et al. (2021a); Du et al. (2021). Among them, Bilinear class Du et al. (2021) is one of the most general framework.

-task bandit and RL In multi-task bandit and RL problems, theoretical analysis often postulates the existence of low-rank of sparsity structures in the representation Lazaric & Ghavamzadeh (2010); Brunskill & Li (2013); Calandriello et al. (2014); Mutti et al. (2021); Maurer et al. (2016); D'Eramo et al. (2019); Arora et al. (2020); Qin et al. (2021); Yang et al. (2021); Hu et al. (2021). The most related works are Hu et al. (2021); Yang et al. (2021) where the benefits of representation learning of linear bandits and linear MDPs are studied. However, it remains open whether multi-task RL can benefit from representation learning via general function approximation.

s) holds for all h ∈ [H] and s ∈ S m . Therefore Q π * m m,h satisfies the following Bellman optimality equations for all s ∈ S m , a ∈ A m and h ∈ [H]:

s 1 ) -V πm m,1 (s 1 ) ≤ /M costs O(H 6 M 2 d 2 log(|G|/δ)/ 2 ) samples for each task. Thus the total number of trajectories is H 6 M 3 d 2 log(|G|/δ) 2 via learning each task individually. Therefore, Theorem 5.3 improves the sample complexity of learning Low-rank Multi-task Bilinear class given small sizes of expressive feature classes V, B, X , for example, log(|X ||V||B|) ≤ M d.In general, the dependence on d can not be reduced since we estimate a d-by-k matrix for shared representation. Furthermore, we believe that without further assumption, a polynomial improvement w.r.t M is impossible because the learner has to learn the task-specific featuresv * m,h ∈ R k for all m ∈ [M ].For an important special case known as Linear Mixture Model (Definition 3.3), we have the following result via plugging |V| = |B| = 1, |X | = |G| into Theorem 5.3. Corollary 5.4. Consider Low-rank Multi-task Bilinear class where each MDP M m is a linear mixture model with P

,1:h ∼π m,g,1:h [Q m,h,g (s m,h , a m,h ) -r m,h (s m,h , a m,h ) -V m,h,f (s m,h+1 )] ≤ | W * m,h (g) -W * m,h (g * ), X * m,h (g) |.Furthermore, we know that the sub-optimality of the greedy policy of candidate hypothesis can be decomposed into the sum of Bellman errors across time steps. Therefore, we show the following upper bound

