RESET-FREE LIFELONG LEARNING WITH SKILL-SPACE PLANNING

Abstract

The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks 1 .

1. INTRODUCTION

Intelligent agents, such as humans, continuously interact with the real world and make decisions to maximize their utility over the course of their lifetime. This is broadly the goal of lifelong reinforcement learning (RL), which seeks to automatically learn artificial agents that can mimic the continuous learning capabilities of real-world agents. This goal is challenging for current RL algorithms as real-world environments can be non-stationary, requiring the agents to continuously adapt to changing goals and dynamics in robust fashions. In contrast to much of prior work in lifelong RL, our focus is on developing RL algorithms that can operate in non-episodic or "reset-free" settings and learn from both online and offline interactions. This setup approximates real-world learning where we might have plentiful logs of offline data but resets to a fixed start distribution are not viable and our goals and environment change. Performing well in this setting is key for developing autonomous agents that can learn without laborious human supervision in non-stationary, high-stakes scenarios. However, the performance of standard RL algorithms drops significantly in non-episodic settings. To illustrate this issue, we first pre-train agents to convergence in the episodic Hopper environment (Brockman et al., 2016) with state-of-the-art model-free and model-based RL algorithms: Soft Actor Critic (SAC) (Haarnoja et al., 2018) and Model-Based Policy Optimization (MBPO) (Janner et al., 2019) , respectively. These agents are then trained further in a reset-free setting, representing a real-world scenario where agents seek to improve generalization via continuing to adapt at a test time where resets are more expensive. The learning curves are shown in Figure 1 . In spite of near-perfect initialization, all agents proceed to fail catastrophically, suggesting that current gradientbased RL methods are inherently unstable in non-episodic settings.



Project website and materials: https://sites.google.com/berkeley.edu/reset-free-lifelong-learning 1



Figure 1: RL without planning fails without resets. Each line is one seed. The red line shows reward with no updates (i.e. frozen weights).

