RESET-FREE LIFELONG LEARNING WITH SKILL-SPACE PLANNING

Abstract

The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks 1 .

1. INTRODUCTION

Intelligent agents, such as humans, continuously interact with the real world and make decisions to maximize their utility over the course of their lifetime. This is broadly the goal of lifelong reinforcement learning (RL), which seeks to automatically learn artificial agents that can mimic the continuous learning capabilities of real-world agents. This goal is challenging for current RL algorithms as real-world environments can be non-stationary, requiring the agents to continuously adapt to changing goals and dynamics in robust fashions. In contrast to much of prior work in lifelong RL, our focus is on developing RL algorithms that can operate in non-episodic or "reset-free" settings and learn from both online and offline interactions. This setup approximates real-world learning where we might have plentiful logs of offline data but resets to a fixed start distribution are not viable and our goals and environment change. Performing well in this setting is key for developing autonomous agents that can learn without laborious human supervision in non-stationary, high-stakes scenarios. However, the performance of standard RL algorithms drops significantly in non-episodic settings. To illustrate this issue, we first pre-train agents to convergence in the episodic Hopper environment (Brockman et al., 2016) with state-of-the-art model-free and model-based RL algorithms: Soft Actor Critic (SAC) (Haarnoja et al., 2018) and Model-Based Policy Optimization (MBPO) (Janner et al., 2019) , respectively. These agents are then trained further in a reset-free setting, representing a real-world scenario where agents seek to improve generalization via continuing to adapt at a test time where resets are more expensive. The learning curves are shown in Figure 1 . In spite of near-perfect initialization, all agents proceed to fail catastrophically, suggesting that current gradientbased RL methods are inherently unstable in non-episodic settings. This illustrative experiment complements prior work highlighting other failures of RL algorithms in non-stationary and non-episodic environments: Co-Reyes et al. (2020) find current RL algorithms fail to learn in a simple gridworld environment without resets and Lu et al. ( 2019) find modelfree RL algorithms struggle to learn and adapt to nonstationarity even with access to the ground truth dynamics model. We can attribute these failures to RL algorithms succumbing to sink states. Intuitively, these are states from which agents struggle to escape, have low rewards, and suggest a catastrophic halting of learning progress (Lyapunov, 1992) . For example, an upright walking agent may fall over and fail to return to a standing position, possibly because of underactuated joints. A less obvious notion of sink state we use is that the agent simply fails to escape from it due to low learning signal, which is almost equally undesirable. A lifelong agent must seek to avoid such disabling sink states, especially in the absence of resets. We introduce Lifelong Skill Planning (LiSP), an algorithmic framework for reset-free, lifelong RL that uses long-horizon, decision-time planning in an abstract space of skills to overcome the above challenges. LiSP employs a synergistic combination of model-free policy networks and model-based planning, wherein we use a policy to execute certain skills, planning directly in the skill space. This combination offers two benefits: (1) skills constrain the search space to aid the planner in finding solutions to long-horizon problems and (2) skills mitigate errors in the dynamics model by constraining the distribution of behaviors. We demonstrate that agents learned via LiSP can effectively plan for longer horizons than prior work, enabling better long-term reasoning and adaptation. Another key component of the LiSP framework is the flexibility to learn skills from both online and offline interactions. For online learning, we extend Dynamics-Aware Discovery of Skills (DADS), an algorithm for unsupervised skill discovery (Sharma et al., 2019) , with a skill-practice proposal distribution and a primitive dynamics model for generating rollouts for training. We demonstrate that the use of this proposal distribution significantly amplifies the signal for learning skills in resetfree settings. For offline learning from logged interactions, we employ a similar approach as above but with a modification of the reward function to correspond to the extent of disagreement amongst the models in a probabilistic ensemble (Kidambi et al., 2020) . Our key contributions can be summarized as follows: • We identify skills as a key ingredient for overcoming the challenges to achieve effective lifelong RL in reset-free environments. • We propose Lifelong Skill Planning (LiSP), an algorithmic framework for reset-free lifelong RL with two novel components: (a) a skill learning module that can learn from both online and offline interactions, and (b) a long-horizon, skill-space planning algorithm. • We propose new challenging benchmarks for reset-free, lifelong RL by extending gridworld and MuJoCo OpenAI Gym benchmarks (Brockman et al., 2016) . We demonstrate the effectiveness of LiSP over prior approaches on these benchmarks in a variety of nonstationary, multi-task settings, involving both online and offline interactions.

2. BACKGROUND

Problem Setup. We represent the lifelong environment as a sequence of Markov decision processes (MDPs). The lifelong MDP M is the concatenation of several MDPs (M i , T i ), where T i denotes the length of time for which the dynamics of M i are activated. Without loss of generality, we assume the sum of the T i (i.e., the total environment time) is greater than the agent's lifetime. The properties of the MDP M i are defined by the tuple (S, A, P i , r i , γ), where S is the state space, A is the action space, P i : S × A × S → R are the transition dynamics, r i : S × A → R is the reward function, and γ ∈ [0, 1) is the discount factor. Consistent with prior work, we assume r i is always known to the agent specifying the task; it is also easy to learn for settings where it is not known. We use P and r as shorthand to refer to the current M i with respect to the agent. The agent is denoted by a policy π : S → A and seeks to maximize its expected return starting from the current state s 0 : arg max π E st+1∼P,at∼π [ ∞ t=0 γ t r(s t , a t )]. The policy π may be implemented as a parameterized function or an action-generating procedure. We expect the agent to optimize for the current M i , rather than trying to predict the future dynamics; e.g., a robot may be moved to an arbitrary new MDP and expected to perform well, without anticipating this change in advance.



Project website and materials: https://sites.google.com/berkeley.edu/reset-free-lifelong-learning



Figure 1: RL without planning fails without resets. Each line is one seed. The red line shows reward with no updates (i.e. frozen weights).

