SKILL-BASED REINFORCEMENT LEARNING WITH INTRINSIC REWARD MATCHING

Abstract

While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the skill discriminator, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an intrinsic reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to match the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and on challenging tabletop manipulation tasks.

1. INTRODUCTION

Generalist agents must possess the ability to execute a diverse set of behaviors and flexibly adapt them to complete novel tasks. Although deep reinforcement learning has proven to be a potent tool for solving complex control and reasoning tasks such as in-hand manipulation (OpenAI et al., 2019) and the game of Go (Silver et al., 2016) , specialist deep RL agents learn each new task from scratch, possibly collecting new data and learning to a new objective with no prior knowledge. This presents a massive roadblock in the way of integration of RL in many real-time applications such as robotic control where collecting data and resetting robot experiments is prohibitively costly (Kalashnikov et al., 2018) . Recent progress in scaling multitask reinforcement learning (Reed et al., 2022; Kalashnikov et al., 2021) has revealed the potential of multitask agents to encode vast skill repertoires, rivaling the performance of specialist agents and even generalizing to out-of-distribution tasks. Moreover, skillbased unsupervised RL (Laskin et al., 2022; Liu & Abbeel, 2021; Sharma et al., 2020) shows promise of acquiring similarly useful behaviors but without the expensive per-task supervision required for conventional multitask RL. Recent skill-based RL results suggest that unsupervised RL can distill diverse behaviors into distinguishable skill policies; however, such approaches lack a principled framework for connecting unsupervised pretraining and downstream finetuning. The current state-of-the-art leverages inefficient skill search methods at the policy level such as performing a sampling-based optimization or sweeping a coarse discretization of the skill space (Laskin et al., 2021) . However, such methods still exhibit key limitations, namely they (1) rely on expensive environment trials to evaluate which skill is optimal and (2) are likely to select suboptimal behaviors as the continuous skill space grows due to the curse of dimensionality. In this work, we present Intrinsic Reward Matching (IRM), a scalable algorithmic methodology for unifying unsupervised skill pretraining and downstream task finetuning by leveraging the learned intrinsic reward function parameterized by the skill discriminator. Centrally, we introduce a novel approach to leveraging the intrinsic reward model as a multitask reward function that, via 1

