PROVABLY EFFICIENT LIFELONG REINFORCEMENT LEARNING WITH LINEAR REPRESENTATION

Abstract

We theoretically study lifelong reinforcement learning (RL) with linear representation in a regret minimization setting. The goal of the agent is to learn a multi-task policy based on a linear representation while solving a sequence of tasks that may be adaptively chosen based on the agent's past behaviors. We frame the problem as a linearly parameterized contextual Markov decision process (MDP), where each task is specified by a context and the transition dynamics is context-independent, and we introduce a new completeness-style assumption on the representation which is sufficient to ensure the optimal multi-task policy is realizable under the linear representation. Under this assumption, we propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks while using only sublinear planning calls. Specifically, for K task episodes of horizon H, our algorithm has a regret bound Õ( (d 3 + d d)H 4 K) based on O(dH log(K)) number of planning calls, where d and d are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to learn to internalize experiences into a multi-task policy and rapidly solve new tasks.

1. INTRODUCTION

Recently, there has been a surging interest in designing lifelong learning agents that can continuously learn to solve multiple sequential decision making problems in their lifetimes (Thrun & Mitchell, 1995; Khetarpal et al., 2020; Silver et al., 2013; Xie & Finn, 2021) . This scenario is in particular motivated by building multi-purpose embodied intelligence, such as robots working in a weakly structured environment (Roy et al., 2021) . Typically, curating all tasks beforehand for such problems is nearly infeasible, and the problems the agent is tasked with may be adaptively selected based on the agent's past behaviors. Consider a household robot as an example. Since each household is unique, it is difficult to anticipate upfront all scenarios the robot would encounter. Moreover, the tasks the robot faces are not independent and identically distributed (i.i.d.) . Instead, what the robot has done before can affect the next task and its starting state; e.g., if the robot fails to bring a glass of water and breaks it, then the user is likely to command the robot to clean up the mess. Thus, it is critical that the agent continuously improves and generalizes learned abilities to different tasks, regardless of their order. In this work, we theoretically study lifelong reinforcement learning (RL) in a regret minimization setting (Thrun & Mitchell, 1995; Ammar et al., 2015) , where the agent needs to solve a sequence of tasks using rewards in an unknown environment while balancing exploration and exploitation. Motivated by the embodied intelligence scenario, we suppose that tasks differ in rewards, but share the same state and action spaces and transition dynamics (Xie & Finn, 2021) .To be realistic, we make no assumptions on how the tasks and initial states are selected 1 ; generally we allow them to be chosen from a continuous set by an adversary based on the agent's past behaviors. Once a task is specified and revealed, the agent has one chance (i.e., executing one rollout from its current state) to complete the task and then it moves to the next task. The agent's goal is to perform near optimally for the tasks it faces, despite the online nature of the problem. This means that the accumulated regret of the learner compared with the best policy for each task should be sublinear in its lifetime. We assume that there is no memory constraint; this is usually the case for robotics applications where real-world interactions are the main bottleneck (Xie & Finn, 2021) . Nonetheless, we require that the agent eventually learns to make decisions without frequent deliberate planning, because planning is time consuming and creates undesirable wait time for user-interactive scenarios. In other words, the agent needs to learn a multi-task policy, generalizing from not only past samples but also past computation, to solve new tasks. Formally, we consider an episodic setup based on the framework of contextual Markov decision process (CMDP) (Abbasi-Yadkori & Neu, 2014; Hallak et al., 2015) . It repeats the following steps: 1) At the beginning of an episode, the agent is set to an initial state and receives a context specifying the task reward, both of which can be arbitrarily chosen. 2) When needed, the agent uses its past experiences to plan for the current task. 3) The agent runs a policy in the environment for a fixed horizon in an attempt to solve the assigned task and gains experience from its policy execution. The agent's performance is measured as the regret with respect to the optimal policy of the corresponding task. We require that, for any task sequence, both the agent's overall regret and number of planning calls to be sublinear in the number of episodes. While lifelong RL is not new, the realistic need of simultaneously achieving 1) sublinear regret and 2) sublinear number of planning calls for 3) a potentially adversarial sequence of tasks and initial states makes the setup considered here particularly challenging. To our knowledge, existing works only address a strict subset of these requirements; especially, the computation aspect is often ignored. Most provable works in lifelong RL make the assumption that the tasks are finitely many (Ammar et al., 2015; Zhan et al., 2017; Brunskill & Li, 2015) , or are i.i.d. (Ammar et al., 2014; Brunskill & Li, 2014; Abel et al., 2018a; b; Lecarpentier et al., 2021) , while others considering similar setups to ours do not provide regret guarantees (Isele et al., 2016; Xie & Finn, 2021) 2019) for the dynamic setting of multi-objective RL, which study the sample complexity for arbitrary task sequences; however, they either assume the problem is tabular or require a model-based planning oracle with unknown complexity. Importantly, none of the existing works properly addresses the need of sublinear planning calls, which creates a large gap between the abstract setup and practice need. In this paper, we aim to establish a foundation for designing agents meeting these three practically important requirements, a problem which has been overlooked in the literature. As the first step, here we study lifelong RL with linear representation. We suppose that the contextual MDP is linearly parameterized (Yang & Wang, 2019; Jin et al., 2020) and the agent needs to learn a multi-task policy based on this linear representation. To make this possible, we introduce a new completenessstyle assumption on the representation which is sufficient to ensure the optimal multi-task policy is realizable under the linear representation. Under these assumptions, we propose the first provably efficient lifelong RL algorithm, Upper Confidence Bound Lifelong Value Distillation (UCBlvd, pronounced as "UC Boulevard"), that possesses all three desired qualities. Specifically, for K episodes of horizon H, we prove a regret bound Õ( (d 3 + d d)H 4 K) using Õ(dH log(K)) planning calls, where d and d are the feature dimensions of the dynamics and rewards, respectively. From a high-level viewpoint, UCBlvd uses a linear structure to identify what to transfer and operates by interleaving 1) independent planning for a set of representative tasks and 2) distilling the planned results into a multi-task value-based policy. UCBlvd also constantly monitors whether the new experiences it gained are sufficiently significant, based on a doubling schedule, to avoid unnecessary planning. On the technical side, UCBlvd's design is inspired by single-task LSVI-UCB (Jin et al., 2020) , however, we introduce a novel distillation step based on QCQP, along with a new completeness assumption, to enable computation sharing across tasks; we also extend the low-switching cost technique (Abbasi-Yadkori et al., 2011; Gao et al., 2021; Wang et al., 2021) for single-task RL to the lifelong setup to achieve sublinear number of planning calls. Notation. Throughout the paper, we use lower-case letters for scalars, lower-case bold letters for vectors, and upper-case bold letters for matrices. The Euclidean-norm of x is denoted by x 2 . We denote the transpose of a vector x by x . For any vectors x and y, we use x, y to denote their



We adopt a stricter definition of lifelong RL here to distinguish it from multi-task RL, while there are existing works on lifelong RL (e.g.Brunskill & Li (2014); Lecarpentier et al. (2021)) assuming i.i.d. tasks.



. On the technical side, the closest lines of work are Modi & Tewari (2020); Abbasi-Yadkori & Neu (2014); Hallak et al. (2015); Modi et al. (2018); Kakade et al. (2020) for contextual MDP and Wu et al. (2021); Abels et al. (

