SKILL-BASED REINFORCEMENT LEARNING WITH INTRINSIC REWARD MATCHING

Abstract

While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the skill discriminator, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an intrinsic reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to match the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and on challenging tabletop manipulation tasks.

1. INTRODUCTION

Generalist agents must possess the ability to execute a diverse set of behaviors and flexibly adapt them to complete novel tasks. Although deep reinforcement learning has proven to be a potent tool for solving complex control and reasoning tasks such as in-hand manipulation (OpenAI et al., 2019) and the game of Go (Silver et al., 2016) , specialist deep RL agents learn each new task from scratch, possibly collecting new data and learning to a new objective with no prior knowledge. This presents a massive roadblock in the way of integration of RL in many real-time applications such as robotic control where collecting data and resetting robot experiments is prohibitively costly (Kalashnikov et al., 2018) . Recent progress in scaling multitask reinforcement learning (Reed et al., 2022; Kalashnikov et al., 2021) has revealed the potential of multitask agents to encode vast skill repertoires, rivaling the performance of specialist agents and even generalizing to out-of-distribution tasks. Moreover, skillbased unsupervised RL (Laskin et al., 2022; Liu & Abbeel, 2021; Sharma et al., 2020) shows promise of acquiring similarly useful behaviors but without the expensive per-task supervision required for conventional multitask RL. Recent skill-based RL results suggest that unsupervised RL can distill diverse behaviors into distinguishable skill policies; however, such approaches lack a principled framework for connecting unsupervised pretraining and downstream finetuning. The current state-of-the-art leverages inefficient skill search methods at the policy level such as performing a sampling-based optimization or sweeping a coarse discretization of the skill space (Laskin et al., 2021) . However, such methods still exhibit key limitations, namely they (1) rely on expensive environment trials to evaluate which skill is optimal and (2) are likely to select suboptimal behaviors as the continuous skill space grows due to the curse of dimensionality. In this work, we present Intrinsic Reward Matching (IRM), a scalable algorithmic methodology for unifying unsupervised skill pretraining and downstream task finetuning by leveraging the learned intrinsic reward function parameterized by the skill discriminator. Centrally, we introduce a novel approach to leveraging the intrinsic reward model as a multitask reward function that, via interaction-free task inference, enables us to select the most optimal pretrained policy for the extrinsic task reward. During pretraining, unsupervised skill discovery methods learn a discriminatorparameterized, family of reward functions that correspond to a family of policies, or skills, through a shared latent code. Instead of discarding the discriminator during finetuning as is done in prior work, we observe that the discriminator is an effective task specifier for its corresponding policy that can be matched with the extrinsic reward, allowing us to perform skill selection while bypassing brute force environment trials. Our approach views the extrinsic reward as a distribution with measurable proximity to a pretrained multitask reward distribution and formulates an optimization with respect to skills over a reward distance metric called EPIC (Gleave et al., 2020) .

Contributions

The key contributions of this paper are summarized as follows: (1) We describe a unifying discriminator reward matching framework and introduce a practical algorithm for selecting skills without relying on environment samples (Section 3). ( 2) We demonstrate that our method is competitive with previous finetuning approaches on the Unsupervised Reinforcement Learning Benchmark (URLB), a suite of 12 continuous control tasks (Section 4.1). (3) We evaluate our approach on more challenging tabletop manipulation environments which underscore the limitations of previous approaches and show that our method finetunes more efficiently (Section 4.2). ( 4) We generalize our method to sequence pretrained skills and solve long-horizon manipulation tasks (Section 4.3) as well as ablate key algorithmic components. (5) We provide analysis and visualizations that yield insight into how skills are selected and further justify the generality of our method (Section 5).

2. BACKGROUND 2.1 UNSUPERVISED SKILL PRETRAINING

The skill learning literature has long sought to design agents that autonomously acquire structured behaviors in new environments (Thrun & Schwartz, 1994; Sutton et al., 1999; Pickett & Barto, 2002) . Recent work in competence-based unsupervised RL proposes generic objectives encouraging the discovery of skills representing diverse and useful behaviors (Eysenbach et al., 2019; Sharma et al., 2020; Laskin et al., 2022) . A skill is defined as a latent code vector z ∈ Z that indexes the conditional policy π(a|s, z). In order to learn such a policy, this class of skill pretraining algorithms maximizes the mutual information between sampled skills and their resulting trajectories τ (Gregor et al., 2016a; Eysenbach et al., 2018; Sharma et al., 2019) : I(τ ; z) = H(z) -H(z|τ ) = H(τ ) -H(τ |z) Since the mutual information I(s; z) is intractable to calculate in practice, competence-based methods instead maximize a variational lower bound proposed in (Barber & Agakov, 2003) which is



Figure 1: Intrinsic Reward Matching (IRM) Framework. IRM takes place in three stages: (1) Taskagnostic RL pretraining learns skill primitives in conjunction with a skill discriminator. (2) With no environment interaction, IRM minimizes the EPIC Loss between the intrinsic reward parameterized by the discriminator and the extrinsic reward with respect to the skill vector z. (3) The skill policy conditioned on the optimal z * finetunes to task reward to solve the downstream task.

