PROVABLY EFFICIENT MULTI-TASK REINFORCEMENT LEARNING IN LARGE STATE SPACES

Abstract

We study multi-task Reinforcement Learning where shared knowledge among different environments is distilled to enable scalable generalization to a variety of problem instances. In the context of general function approximation, Markov Decision Process (MDP) with low Bilinear rank encapsulates a wide range of structural conditions that permit polynomial sample complexity in large state spaces, where the Bellman errors are related to bilinear forms of features wi 'th low intrinsic dimensions. To achieve multi-task learning in MDPs, we propose online representation learning algorithms to capture the shared features in the different task-specific bilinear forms. We show that in the presence of low-rank structures in the features of the bilinear forms, the algorithms benefit from sample complexity improvements compared to single-task learning. Therefore, we achieve the first sample efficient multi-task reinforcement learning algorithm with general function approximation.

1. INTRODUCTION

The ability to capture informative representations that generalize among multiple tasks has become significant in various machine learning applications Li et al. ( 2014 2021) is not able to adapt to such knowledge and challenges abound in adopting representation learning to find nearly-optimal policies with limited data. In this paper, we give the first sample efficient algorithm of multi-task RL with general function approximation through the usage of representation learning in Bilinear class. In particular, to study representation learning, we propose Low-rank Multi-task Bilinear class-a structural framework that permits generalization both within and across tasks in multi-task RL. Specifically, such a model class specifies M MDP instances, where M > 0 is a fixed integer and each one belongs to the Bilinear class Du et al. (2021) , i.e., the Bellman error admits a low-rank factorization in R d . Since our multi-task setting have M MDP instances, there are M different features maps specified by the definition of Bilinear class, each corresponding to one MDP task and taking values in R d . We additionally assume that these M features maps, when packed together as a matrix-valued mapping in R d×M , has rank k d. In other words, in Low-rank Multi-task Bilinear class, the bilinear form of each task possesses a low-dimensional task-specific feature and a shared representation. Under this setting, it is desirable that the RL algorithm utilize the intrinsic low-dimensional structure to achieve an improved sample efficiency compared to solving each task separately. To this end, under the online setting where the agent learn from its past experiences without knowing the model, we design a sample efficient algorithm that provably finds nearly-optimal policies for all tasks. Our algorithm is based on the principle of Optimism in the Face of Uncertainty (OFU) which constructs an confidence region that contains the true hypothesis based on the historical data across the M tasks, and then update the policy according to the most optimistic hypothesis within the confidence region. In particular, here the hypothesis can denote the true transition models or optimal value functions of these M tasks. When constructing the confidence region, we explicitly utilize the low-dimensional structure by joint learning the task-specific features and shared representation via Empirical Risk Minimization (ERM) with multi-task data. Moreover, as for planning, we find the hypothesis in the confidence region which leads to the highest aggregated value in these M tasks. In the analysis, we show a concentration result where the estimation noise can be embedded into low dimension space and thus prove that our algorithm is able to find nearly-optimal policies within limited samples. Concretely, compared to learning each task separately using Bilin-UCB Du et al. ( 2021), an algorithm designed for Bilinear class without utilizing the shared representation, our algorithm enjoys a (d/k)-time improvement in the sample efficiency whenever the feature classes are small. To our best knowledge, our work seems to propose the first provably sample efficient multi-task RL algorithm with general function approximation. Notations Let R d denote the d-dimensional space and R d×k denote the space of d-by-k matrices in R. The inner product of two vectors x, y ∈ R d is denoted as x, y . For sets A 1 , . . . , A n , define ⊗ k∈[n] A k = A 1 ⊗ • • • ⊗ A k = {(a 1 , . . . , a n ) : a k ∈ A k , k ∈ [n]}. Given scalars a 1 , • • • , a n , let a 1:n denote the vector (a 1 , • • • , a n ). Also let (a ω ) ω∈Ω denote the tuple consisting of a ω where ω comes from a countable set Ω. For variables v 1 , . . . , v k , we denote by v 1:k the k-tuple (v 1 , . . . , v k ). Roadmap In Section 2 we introduce the basic problem setup and notations. In Section 3 we introduce Low-rank Multi-task Bilinear class-a framework that captures shared information in bilinear class. Next, we display the main algorithm of learning Low-rank Multi-task Bilinear class models, empowered by representation learning and optimism principle in decision making, in Section 4. We show the main theoretical result in Section 5 and the overview of techniques in Section 6. A couples of examples are given in Section A. We conclude with discussions of further directions. 



); Tsiakas et al. (2016); Baevski et al. (2019); D'Eramo et al. (2019); Kubota et al. (2020); Liu et al. (2019b). In the context of multi-task learning Caruana (1997); Baxter (2000); Yu et al. (2005), this ability is highly desirable and becomes vital to learn with fewer amount of samples than learning each single task individually. Representation learning Bengio et al. (2013) is a powerful approach for achieving such sample efficiency improvement. This paper considers representation learning in Multitask Reinforcement Learning, an important class of meta Reinforcement Learning (meta-RL) Wang et al. (2016); Finn et al. (2017); Ritter et al. (2018). Reinforcement learning (RL) is a sequential decision-making problem where an agent aims to learn the optimal decisions by interacting with an unknown environment Sutton & Barto (2018). Empowered by representation learning with deep neural networks LeCun et al. (2015); Goodfellow et al. (2016), RL has achieved tremendous success in various real-world applications, such as Go Silver et al. (2016), Atari Mnih et al. (2013), Dota2 Berner et al. (2019), Texas Holdém poker Moravčík et al. (2017), and autonomous driving Shalev-Shwartz et al. (2016). Therefore, the benefit of using representation learning to extract joint feature embedding from different but related tasks emerged as an essential problem to investigate. Specifically, this paper studies the problem of learning multiple RL problems jointly with the help of representation learning. Although multi-task learning in online-decision making problems has received increasing research interest Lazaric & Ghavamzadeh (2010); Mutti et al. (2021); Maurer et al. (2016); Qin et al. (2021); Yang et al. (2021); Hu et al. (2021), most existing works focus on tabular or linear models. Indeed, how general function approximations extrapolate across huge state spaces remains largely an open problem itself. Recently, Bilinear class Du et al. (2021) proposes a promising structural framework of generalization in reinforcement learning through the use of function approximation. Bilinear class postulates that the Bellman error can be related to a bilinear form depending on the hypothesis and captures nearly all existing function approximation models, e.g. Jin et al. (2020a); Zanette et al. (2020); Yang & Wang (2020); Jiang et al. (2017); Sun et al. (2019); Kakade et al. (2020); Agarwal et al. (2020). However, in the presence of shared information in the bilinear forms across multiple tasks, the Bilin-UCB proposed in Du et al. (

Theoretical understanding of the sample complexity of RL with general function approximation seems relatively scarce. In recent years, there has been a surge of theoretical insights on linear function approximation and non-linear function approximations Jin et al. (2020b;c); Wang et al. (2021); Zanette et al. (2020); Agarwal et al. (2020); Kakade et al. (2020); Wen & Van Roy (2017); Dann et al. (2018); Du et al. (2019); Dong et al. (2020); Liu et al. (2019a); Wang et al. (2020a); Dong et al. (2021); Zhou et al. (2020); Yang et al. (2020); Jin et al. (2021a); Du et al. (2021). Among them, Bilinear class Du et al. (2021) is one of the most general framework.

