SKILL-BASED REINFORCEMENT LEARNING WITH INTRINSIC REWARD MATCHING

Abstract

While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the skill discriminator, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an intrinsic reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to match the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and on challenging tabletop manipulation tasks.

1. INTRODUCTION

Generalist agents must possess the ability to execute a diverse set of behaviors and flexibly adapt them to complete novel tasks. Although deep reinforcement learning has proven to be a potent tool for solving complex control and reasoning tasks such as in-hand manipulation (OpenAI et al., 2019) and the game of Go (Silver et al., 2016) , specialist deep RL agents learn each new task from scratch, possibly collecting new data and learning to a new objective with no prior knowledge. This presents a massive roadblock in the way of integration of RL in many real-time applications such as robotic control where collecting data and resetting robot experiments is prohibitively costly (Kalashnikov et al., 2018) . Recent progress in scaling multitask reinforcement learning (Reed et al., 2022; Kalashnikov et al., 2021) has revealed the potential of multitask agents to encode vast skill repertoires, rivaling the performance of specialist agents and even generalizing to out-of-distribution tasks. Moreover, skillbased unsupervised RL (Laskin et al., 2022; Liu & Abbeel, 2021; Sharma et al., 2020) shows promise of acquiring similarly useful behaviors but without the expensive per-task supervision required for conventional multitask RL. Recent skill-based RL results suggest that unsupervised RL can distill diverse behaviors into distinguishable skill policies; however, such approaches lack a principled framework for connecting unsupervised pretraining and downstream finetuning. The current state-of-the-art leverages inefficient skill search methods at the policy level such as performing a sampling-based optimization or sweeping a coarse discretization of the skill space (Laskin et al., 2021) . However, such methods still exhibit key limitations, namely they (1) rely on expensive environment trials to evaluate which skill is optimal and (2) are likely to select suboptimal behaviors as the continuous skill space grows due to the curse of dimensionality. In this work, we present Intrinsic Reward Matching (IRM), a scalable algorithmic methodology for unifying unsupervised skill pretraining and downstream task finetuning by leveraging the learned intrinsic reward function parameterized by the skill discriminator. Centrally, we introduce a novel approach to leveraging the intrinsic reward model as a multitask reward function that, via 2) With no environment interaction, IRM minimizes the EPIC Loss between the intrinsic reward parameterized by the discriminator and the extrinsic reward with respect to the skill vector z. (3) The skill policy conditioned on the optimal z * finetunes to task reward to solve the downstream task. interaction-free task inference, enables us to select the most optimal pretrained policy for the extrinsic task reward. During pretraining, unsupervised skill discovery methods learn a discriminatorparameterized, family of reward functions that correspond to a family of policies, or skills, through a shared latent code. Instead of discarding the discriminator during finetuning as is done in prior work, we observe that the discriminator is an effective task specifier for its corresponding policy that can be matched with the extrinsic reward, allowing us to perform skill selection while bypassing brute force environment trials. Our approach views the extrinsic reward as a distribution with measurable proximity to a pretrained multitask reward distribution and formulates an optimization with respect to skills over a reward distance metric called EPIC (Gleave et al., 2020) .

Contributions

The key contributions of this paper are summarized as follows: (1) We describe a unifying discriminator reward matching framework and introduce a practical algorithm for selecting skills without relying on environment samples (Section 3). ( 2) We demonstrate that our method is competitive with previous finetuning approaches on the Unsupervised Reinforcement Learning Benchmark (URLB), a suite of 12 continuous control tasks (Section 4.1). (3) We evaluate our approach on more challenging tabletop manipulation environments which underscore the limitations of previous approaches and show that our method finetunes more efficiently (Section 4.2). ( 4) We generalize our method to sequence pretrained skills and solve long-horizon manipulation tasks (Section 4.3) as well as ablate key algorithmic components. (5) We provide analysis and visualizations that yield insight into how skills are selected and further justify the generality of our method (Section 5).

2.1. UNSUPERVISED SKILL PRETRAINING

The skill learning literature has long sought to design agents that autonomously acquire structured behaviors in new environments (Thrun & Schwartz, 1994; Sutton et al., 1999; Pickett & Barto, 2002) . Recent work in competence-based unsupervised RL proposes generic objectives encouraging the discovery of skills representing diverse and useful behaviors (Eysenbach et al., 2019; Sharma et al., 2020; Laskin et al., 2022) . A skill is defined as a latent code vector z ∈ Z that indexes the conditional policy π(a|s, z). In order to learn such a policy, this class of skill pretraining algorithms maximizes the mutual information between sampled skills and their resulting trajectories τ (Gregor et al., 2016a; Eysenbach et al., 2018; Sharma et al., 2019) : I(τ ; z) = H(z) -H(z|τ ) = H(τ ) -H(τ |z) Since the mutual information I(s; z) is intractable to calculate in practice, competence-based methods instead maximize a variational lower bound proposed in (Barber & Agakov, 2003) which is parameterized by a learned neural network function q ϕ (τ, z) called a skill discriminator. This discriminator, along with other terms independent of z, parameterizes an intrinsic reward that the skill policy π(•|s, z) maximizes during pretraining. Given an unseen task specification, the agent needs to infer which skill will finetune to solve the task with the fewest samples. For more detailed explanations of the mutual information decompositions of various skill discovery algorithms, refer to Appendix A.2. Pretrained Multitask Reward Functions: We observe that the intrinsic reward function learned during skill pretraining can be viewed as a multitask reward function, where the continuous skill code z determines the task. In other words, we have some function: R int (τ, z) := VLB(τ, z) where VLB ≤ I(τ, z) is the variational lower bound proposed in (Barber & Agakov, 2003) (τ is a trajectory representation such as (s, s ′ )). Since skill discovery algorithms aim to maximize I(τ, z), we can view its parameterized lower bound VLB as a multitask reward function: scoring transitions based on their alignment with a skill code (Laskin et al., 2022) .

2.2. EQUIVALENT-POLICY INVARIANT COMPARISON

We can formalize a general notion of reward function similarity by equivalent-policy invariant comparison (EPIC) as established in (Gleave et al., 2020) . EPIC defines a distance metric between two reward functions such that similar reward functions induce similar optimal policies. We consider the case of action-independent reward: D EPIC (R A , R B ) = E s P ,s ′ P ∼D P ,S C ,S ′ C ∼D C [D ρ (C(R A )(s P , s ′ P , S C , S ′ C ), C(R B )(s P , s ′ P , S C , S ′ C ))]. where D ρ (X, Y ) = 1-ρ(X,Y ) 2 is the Pearson distance between two random variables X and Y , s P , s ′ P are samples from the Pearson distribution D P , and S C , S ′ C are batches sampled from the Canonical distribution D C . We compute the Pearson distance over Pearson samples s P , s ′ P , with additional canonicalization with batches S c , S ′ c to ensure invariance over constant shifts and scaling. The canonicalized reward function is defined as: C(R)(s P , s ′ P , S C , S ′ C ) = R(s P , s ′ P ) + E[γR(s ′ P , S ′ C ) -R(s P , S ′ C ) -γR(S C , S ′ C )] where R : S × S → R is a reward function. The expectation is taken over the Canonical distribution D C ; for simplicity, we sample these batches S C , S ′ C ∼ D C ahead of time. The canonicalization ensures invariance to reward shaping such that rewards that have different shaping but induce similar optimal policies are close in distance. In practice, the final term can be omitted as the Pearson correlation is invariant to constant shifts and scaling.

3.1. TASK INFERENCE VIA INTRINSIC REWARD MATCHING

A multitask reward function that can supervise the learning of diverse behaviors is useful in its own right. However, in the case of skill-based RL, we have additionally learned a corresponding π(a|s, z). Therefore, for any "task" that can be specified by our intrinsic reward function, we already have an optimal policy, so long as we condition on the corresponding skill. If we have learned a sufficiently diverse library of skills, we might expect that some of our skills share behavioral similarity to the optimal policy for the downstream task. It thus also holds that the corresponding intrinsic reward for that skill is a semantically similar task specification to the downstream task. Given this interpretation of intrinsic reward, we posit that the task of identifying which our pretrained skills to apply to a downstream task can be reframed as inferring which task in our multitask reward function is most similar to the downstream task. Moreover, we should hope to find the skill code z that produces the reward function most semantically aligned with the downstream task reward. With this formalism, we can formulate the task inference problem as performing the following optimization: z * = arg min z D EPIC (R int (τ, z), R ext (τ )) in order to find z * most aligned with the task reward. Moreover, Equation 5 performs a minimization of a novel loss we name the EPIC loss with respect to the skill parameter z. By EPIC's equivalence class invariance, we know that if the EPIC loss is small for some z * , and π(a|s, z * ) is near optimal for R int (τ, z * ), then π(a|s, z * ) approaches the optimal policy for the task as specified by R ext . Notably, we require access to the task reward function R ext to compute the EPIC loss. Leveraging a known task reward function is a divergence from conventional skill selection methods. Computing R int during reward matching During pretraining, for some methods such as (Laskin et al., 2022; Sharma et al., 2020) , we require negative samples in order to compute the variational objective in Equation 2 and avoid a degenerate optimization where all embedded trajectories have high similarity with all skills. However, during selection when skills are fixed, the negative sampling component amounts to a reward offset which does not impact the task semantics. Furthermore, since we may not in general have access to a large amount of negative samples on a given downstream task, we choose to simplify the objective to the following: R int (τ, z) := VLB(τ, z) ≡ q ϕ (τ, z) where q ϕ is the skill discriminator. This parameterization of the intrinsic reward preserves the alignment semantics of VLB without the normalization by negative samples. For more details regarding the discriminator parameterization of the intrinsic reward for (Laskin et al., 2022; Sharma et al., 2020) refer to Appendix A.3 and Appendix A.4. for s i , s ′ i in S P , S ′ P do Calculate EPIC Loss as D EPIC (r int (s i , s ′ i , z), r ext (s i , s ′ i )) = D ρ (C D S (R A )(s i , s ′ i , S C , S ′ C ), C D S (R B )(s i , s ′ i , S C , S ′ C )) as in Equation 3end for Take optimization step on batch with respect to z (gradient descent, CEM step, etc.) as in Equation 5. end for Evaluate zero-shot performance and finetune RL agent for N F T steps with z * on downstream task T

3.2. EPIC SAMPLE-BASED APPROXIMATION

We make a number of sample-based approximations of various unknown quantities in order to concretize the continuous optimization Equation 5 as a tractable loss minimization problem.

Canonical State Distribution Approximation:

In order to canonicalize our reward functions, we estimate the expectation over the state and next state distributions with a sample-based average over 1024 samples. These distributions can be entirely arbitrary, though using heavily out-of-distribution samples with respect to pretraining can weaken the accuracy of the approximation. We choose to instantiate a uniform distribution bounded by known workspace constraints for both of these distributions. Sampling Distribution for Pearson Correlation: We find that generating samples uniformly roughly within the environment workspace bounds, just as with the reward canonicalization, often leads to strong approximations. Furthermore, as both sample generation and relatively inexpensive function evaluation are independent of the online-finetuning phase, we can perform the full skill optimization as a self-contained preprocess to downstream policy adaptation without any environment samples. Rough knowledge of workspace bounds represents some amount of prior environment knowledge. We leave more general options such as sampling from a learned generative model over trajectories encountered during pretraining or sampling from saved pretraining data to future work. We ablate various sampling distribution choices in Table 6 and present the full algorithm in detail in Algorithm 1.

3.3. GENERALIZATION TO SKILL SEQUENCING

Many realistic downstream tasks derive additional complexity from temporally extended planning horizons. In contrast to hierarchical reinforcement learning (HRL) approaches, which aim to stitch together pretrained skills at the policy level with a higher-level manager policy, we can extend the task matching framework of IRM to efficiently solve the problem of skill sequencing, entirely doing away with the manager policy. Consider the long-horizon setting where we have a sequence of reward functions to optimize over some task horizon H. Central to the finetuning problem is determining over what time intervals should potentially different pretrained skills be selected. In this work we predetermine a fixed skill horizon ⌊H/N ⌋ where N is the number of rewards. This skill horizon could in principle be specified as a parameter and learned from the task reward signal. Next, in order to perform skill selection over each time interval, we perform the IRM algorithm in parallel for each reward. We note the key assumption that IRM requires access to the reward functions for each of the subtasks. For example, for a sequential goal reaching task, we divide the episode into N segments for each of the N goals and corresponding goal-reaching rewards. We then perform the IRM skill selection algorithm for each reward to select the optimal skill over each interval. After selecting the skills, we freeze our selections and finetune the skill policies jointly.

4. EXPERIMENTS

In this section we aim to experimentally evaluate whether IRM improves the adaptation sampleefficiency of skill finetuning on a downstream reinforcement learning task as compared to baselines. For pretraining skills, we experiment with both the CIC (Laskin et al., 2022) and DADS (Sharma et al., 2020) algorithms. We consider IRM Random a version of IRM that randomly samples skills and picks the one with the lowest EPIC loss, IRM CEM which selects elites as those skills with the lowest EPIC loss, and IRM Gradient Descent which minimizes the EPIC loss using the Adam optimizer and uses backpropagation through the discriminator to regress the optimal skill. Environments We evaluate IRM on URLB (Laskin et al., 2021) , which consists of twelve downstream tasks in three challenging continuous control domains in the DMControl suite: Walker, Quadruped, and Jaco. We also design a reaching and a tabletop pushing environment in the OpenAI Gym Fetch environment (Brockman et al., 2016) with further details in Appendix A.5.

Fetch Environment

Baselines We benchmark many conventional finetuning approaches after a single skill pretraining phase of Contrastive Intrinsic Control (CIC) (Laskin et al., 2022) . The Grid Search (GS) baseline coarsely sweeps each of 10 skills evenly from the all 0's skill vector to the all 1's skill vector and finetunes the skill which achieves the best evaluation reward over an episode. Env Rollout randomly samples 10 skills to evaluate with a rollout and Env Rollout CEM uses the episode reward as the metric by which to select elites. Random Skill selects a skill at random. Relabel relabels saved skill rollouts obtained during pretraining with the task reward function, and selects the skill that achieved the highest reward. All baselines use the TD3 (Fujimoto et al., 2018) RL algorithm. Evaluation We follow an identical evaluation to the 2M pre-training setup in URLB. First, we pretrain each RL agent with the intrinsic rewards for 2M steps. Then, we finetune each agent to the downstream task with extrinsic rewards for 100k steps. Since our primary contribution involves skill selection, we especially focus on zero-shot episode rewards: rewards achieved by a selected skill policy but without any RL updates on the task reward. We report results averaged over 5 seeds with standard error bars.

4.1. UNSUPERVISED REINFORCEMENT LEARNING BENCHMARK

In Table 1 , we display the zero-shot performance of IRM-based methods compared to interactionbased methods over all 12 URLB tasks. On most of the Walker and Quadruped tasks IRM is either comparable to or outperforms the interaction baselines. Reward relabelling fails to consistently select optimal skills across the benchmark, likely because its options are limited to the finite set of skills sampled during pretraining. IRM by contrast leverages continuous optimization in the skill space to find the best skill for the task. An important insight is that IRM uses the environment interactions to immediately begin finetuning the selected skill policy instead of spending significant amounts of samples on skill selection. This allows IRM-based methods to obtain greater sampleefficiency than rollout-based methods, even when both initial skill selections obtain similar performance as demonstrated in Figure 7 . Unsurprisingly, methods like IRM GD and IRM CEM tend to perform better than IRM Random which does not have the luxury of iterative refinement on a smooth EPIC loss manifold as shown in Figure 5 . We find that neither our method nor the baselines are well-suited for skill selection on the Jaco tasks. This is likely because these tasks are very sparsely rewarded, making it unlikely that many samples, either randomly generated as in IRM or rolled out, will consistently result in high rewards. We provide analysis demonstrating the relationship between task reward sparsity and the smoothness of the EPIC loss manifold in Appendix A.12.3.

4.2. TABLETOP MANIPULATION

Reach Target We evaluate IRM on the Reach Target task, where the Fetch robot is rewarded for reaching a target position. IRM outperforms or closely matches environment-rollout methods while requiring no environment samples to perform skill selection. As shown in Table 1 , the random skill policy performs particularly poorly and with very high variance relative to the IRM and environmentrollout based methods. Moreover, appropriate skill selection is required for strong zero-shot performance as certain skills obtain much higher rewards than others. Figure 3 shows the finetuning performance of the methods on the downstream task reward. IRM-based methods are more sample efficient in reaching the optimal performance than environment-rollout-based methods due to improved skill selection. Push Block to Goal Next, we evaluate IRM on a more complex manipulation task involving pushing a block to a goal position. We report the zero-shot IRM skill selection performance in Table 1 and finetuning performance in Figure 3 . This more complex task similarly benefits from bootstrapping the appropriate pretrained skill policy as evidenced by the performance gap of the selection based methods over random skill selection. We remark that even for more complex manipulation tasks, IRM is robust in consistently guiding optimal skill selection without requiring any interaction with the environment. Although Env Rollout CEM is one of the stronger baselines in terms of zero-shot reward, it exceeds the computational budget of 100k interactions entirely on

Finetuning Performance on Fetch

Figure 3 : The performance gap between the IRM skill selection methods and random skill selection evidences the sample efficiency gains to be had from bootstrapping a pretrained policy with task-level semantics similar to the task reward. IRM-based methods select optimal skills with no environment interaction and consequently finetune efficiently. Top: Fetch Reach and Block Push tasks. Bottom: Long-horizon Fetch Reach and Block Push with obstacles tasks. skill selection. For illustrative purposes, we show the plot starting at 50k.

Task

IRM Rand Seq IRM CEM Seq IRM GD Seq Env Seq HRL Fetch Reach Seq 88.1 ± 1.5 89.5 ± 0.34 86.7 ± 0.64 80.7 ± 4.7 28.4 ± 31.0 Fetch Push Seq 84.9 ± 0.12 84.9 ± 0.12 81.4 ± 1.9 83.7 ± 0.30 78.9 ± 3.1 Table 2 : Zero-shot rewards on long-horizon manipulation tasks

4.3. EXTENSIONS AND ABLATIONS

Long-Horizon Manipulation Building on the results in Section 4.2, we demonstrate that IRM fully generalizes to solving long-horizon tasks in the setting of tabletop manipulation. During the unsupervised pretraining phase, skill discovery methods can acquire useful skills such as directional block pushing or pushing the block to certain spatial locations. We show that IRM can intelligently select a sequence of such skills to finetune via reward matching, avoiding learning a hierarchical manager policy that finetunes at the policy level. For the Fetch Reach environment, we consider an extended horizon where the agent is tasked with reaching a sequence of goals in a particular order. For the Fetch Push task, we consider the environment depicted in Figure 2 , where the agent must navigate around a barrier introduced during the finetuning phase in order to reach the goal. We compare IRM methods to an environment rollout baseline (Env Seq) and a hierarchical RL baseline (HRL). The 'IRM Seq' methods select skills based on each defined sub-task's reward function according to the IRM optimization scheme. 'Env Seq' chooses the best combination of skills based on extrinsic reward from rollouts. 'HRL' is initialized with random skills and simultaneously optimizes a manager policy over skills and the skill policies themselves. In both settings and across optimization methods, IRM outperforms the environment rollout and HRL method in identifying ( and shaping that L1 and L2 are not invariant to. To strengthen these comparisons, we include a learned reward scaling parameter for L1 and L2 and similarly observe that EPIC is a superior matching metric. 3 . We validate on the Fetch Reach task that IRM CEM and IRM Rand convincingly outperform all episode rollout baselines in zero-shot episode reward.

EPIC Loss Visualizations

Figure 5 : We examine EPIC losses between extrinsic rewards and intrinsic rewards conditioned on the skill vector. We sweep across the 2D skill vector for a pretrained planar agent. Does optimizing the EPIC loss lead to effective skill selection? In Figure 4 , we verify that EPIC loss is strongly negatively correlated with extrinsic reward on a Planar Goal Reaching task detailed in Appendix A.8. Thus, optimizing for a low EPIC loss is an effective substitute for optimizing the environment reward, and crucially, it forgoes collecting expensive environment samples. How can we understand skills through EPIC losses? In Figure 5 , we plot EPIC losses between intrinsic rewards and goalreaching rewards across the 2D continuous skill space. Not only is the loss landscape smooth, which motivates optimization methods like gradient descent, but there is also a banded partitioning of the manifold. Furthermore, the latent skill space is well-structured as different darker-colored partitions of the skill space correspond to the group of skills with low EPIC loss from each task reward. EPIC losses concisely represent desirability of skills with respect to a downstream reward function, so skills that achieve a low EPIC loss for the Top Left goal will achieve high EPIC losses for the opposite reward, Bottom Right goal. We include a scatter plot and trajectory visualizations in Figure 4 . As Figure 4 suggests, skills with the lowest EPIC loss receive high extrinsic reward, reaching the goal with high spatial precision. Skills with the highest losses produce the opposite behavior: moving in the direct opposite direction of the goal. In the sequential case, low-EPIC loss skills attempt to reach the 1st goal then the 2nd goal, while high-EPIC loss skills perform the behavior in the inverse order. The intrinsic reward module provides a much deeper insight into the semantics of skills than the extrinsic rewards obtained by skill policy rollouts.

6. RELATED WORK

Several works including (Sharma et al., 2020; Eysenbach et al., 2019; Achiam et al., 2018; Gregor et al., 2016b; Baumli et al., 2020; Florensa et al., 2017; Laskin et al., 2022) employ mutual information maximization for skill pretraining. While (Laskin et al., 2022) leverages coarse grid search to select skills for downstream RL, methods such as (Sharma et al., 2020) instead plan through a learned skill dynamics model at finetuning time. Our approach is similar in that it leverages pretraining model components other than the policy to guide skill selection. However, rather than generating a reward maximizing plan through possibly complex, learned environment dynamics, we instead look to match a policy to the task reward directly through a pretrained discriminator. In the context of sequential finetuning, (Baumli et al., 2020; Eysenbach et al., 2019) employ hierarchical RL to chain pretrained skills with a manager policy requiring additional environment interactions. Works on such HRL methods include (Nachum et al., 2018; Frans et al., 2017; Vezhnevets et al., 2017; Springenberg et al., 2018) and more classically (Sutton et al., 1999; Stolle & Precup, 2002) . By contrast, we demonstrate that the intrinsic reward matching framework can be extended to choose skill sequences without reliance on environment samples. The successor features line of work also adopts a unified view of skill-based RL. Such work relies on the assumption that arbitrary rewards can be parameterized linearly in some learned features and some task vector as in (Liu & Abbeel, 2021; Barreto et al., 2016) . Our approach relaxes this assumption to the fully general setting by instead searching for a pretrained task with minimal proximity to an arbitrarily parameterized task reward.

7. DISCUSSION

We present Intrinsic Reward Matching (IRM), a framework for algorithmically unifying information maximization unsupervised reinforcement learning with downstream task adaptation. We instantiate a practical algorithm for implementing this framework and demonstrate that IRM outperforms current methods on a continuous control benchmark and tabletop manipulation tasks. IRM diverges from past works in leveraging the discriminator for downstream task inference and consequently performing skill selection without environment interactions in the short horizon setting. We also show that IRM can be readily extended to the general skill sequencing setting to solve more realistic long-horizon tasks as an alternative to hierarchical methods. Central to our contribution is a novel loss function, the EPIC loss, which serves as both a skill selection utility as well as a new way to interpret the task-level semantics of pretrained skills. We acknowledge a number of limitations of our approach. IRM relies on samples of the state, roughly within workspace boundaries as well as access to an external reward function, ideally wellshaped, which trades off with IRM's reduced reliance on environment interactions. In order to obtain realistic image samples to compute the EPIC loss, an agent could learn an expressive generative model such as a VAE over the image states obtained during pretraining and sample from the model to generate diverse and realistic sampled states. For learning unknown state-based rewards, the agent could additionally learn an image-reward model by regressing the rewards encountered during exploration (Hafner et al., 2019) . This further relaxes some of the assumptions made in this contribution and represents an exciting direction for future work.

A APPENDIX

A.1 BACKGROUND AND NOTATION Markov Decision Process: The goal of reinforcement learning is to maximize cumulative reward in an uncertain environment it interacts with. The problem can be modelled as a Markov Decision Process (MDP) defined by (S, A, P, r, γ), where S is the set of states, A is the set of actions, P is the transition probability distribution, r is the reward function and γ is the discount factor. Unsupervised Skill Discovery: In competence-based unsupervised RL the aim is to learn skills that generate diverse and useful behaviors (Eysenbach et al., 2019) . The broad aim is to learn policies that are skill-conditioned and generalizable. Formally, we also learn skills z ∈ Z and take actions according to a ∼ π(•|s, z). As an illustrative example, applying this formalism to the Mujoco Walker domain, we might hope to find a skill-conditioned policy and skills z walk , z run such that π(•|s, z walk ) makes the agent walk, while π(•|s, z run ) makes it run. Further, if we allow for continuous skills, we can also imagine being able to use the policy to "jog" at different speeds by interpolation the z walk and z run skills. That is, taking z α jog = α • z walk + (1 -α) • z run should, intuitively, yield a policy π(•|s, z α jog ) that makes the agent jog at speed dictated by the parameter α. Finetuning Pretrained Skills: With a skill-conditioned policy π(•|s, z), an agent needs to infer which skill to index for a downstream task (e.g. identifying if it needs to use z walk or z run ) during finetuning. This is a relatively under-explored area, with the most universal approach being a coarse, discretized grid search. Least squares regression has also been investigated in the context of successor features (Liu & Abbeel, 2021) .

A.2 COMPETENCE-BASED SKILL DISCOVERY

Competence-based skill discovery algorithms aim to maximize the mutual information between trajectories and skills: I(τ ; z) = H(z) -H(z|τ ) = H(τ ) -H(τ |z) Since the mutual information I(s; z) is intractable to calculate in practice, competence-based methods maximize a variational lower bound. Many mutual information maximization algorithms, such as Variational Intrinsic Control (Gregor et al., 2016a) and Diversity is All You Need (Eysenbach et al., 2018) , use the estimate I(τ ; z) = H(z) -H(z|τ ). Other competence-based methods, such as Dynamics-Aware Unsupervised Discovery of Skills (Sharma et al., 2019) , Active Pretraining with Successor Features (Liu & Abbeel, 2021) , and Contrastive Intrinsic Control (CIC) (Laskin et al., 2022) , maximize a lower bound for H(τ ) -H(τ |z). While the decompositions of the mutual information objective are equivalent, algorithms make different design choices regarding how to approximate entropy, represent trajectories, and embed skills. These choices affect the distillation of skills: for instance, without explicit maximization of H(τ ) in the decomposition of mutual information, behavioral diversity may not be guaranteed when the state space is much larger than the skill space (Laskin et al., 2022) . A.3 CIC Contrastive Intrinsic Control (CIC) (Laskin et al., 2022)  I(τ ; z) ≥ F CIC (τ ; z) := H particle (τ i ) + E   q ϕ (τ i , z i ) -log 1 N N j=1 exp(q ϕ (τ j , z i ))   where H particle (τ ) ∝ n i=1 log ||h i -h * i ||, h * i is the k-Nearest Neighbors embedding, N k is the number of k-NNs used to approximate entropy, and N -1 is the number of negative samples.

A.4 DADS

We additionally use Dynamics-Aware Unsupervised Discovery of Skills(DADS) (Sharma et al., 2020) for skill discovery, as it is one of the few skill discovery algorithms to successfully scale up to continuous skills. DADS maximizes a lower bound for I(τ ; z) = H(τ ) -H(τ |z) through learning skill-conditioned transition distributions. The lower bound for I(τ ; z) is: I(τ ; z) ≥ F DADS (τ ; z) := log q ϕ (s ′ |s, z) L i=1 q ϕ (s ′ |s, z i ) + log L For our experiments, we reimplement the on-policy DADS algorithm in PyTorch. We follow the default hyperparameters and train for 20 million environment steps, per (Sharma et al., 2020) . A.5 ENVIRONMENT DETAILS The URLB domains are Walker, Quadruped, and Jaco. Walker requires a bipedal agent to perform a variety of navigation based tasks on a 2D-plane while preserving its balance. Quadruped, a more challenging domain due to a higher-dimensional state-action space, requires a quadrupedal agent to perform navigation tasks in a 3D environment. Jaco robot arm is a 6-DOF manipulator with a three-finger gripper which contains a variety of directional reaching tasks For URLB (Laskin et al., 2021) environments, we follow default environment settings. Like many skill-discovery methods (Sharma et al., 2020) (Eysenbach et al., 2019) , we restrict the discriminator input. For Quadruped, we use the x, y, z velocity, which is included in the environment's state space. For Walker, we use the x, y, z world-position, which we add to the environment's state space but remove from the policy input. For Jaco, we use the x, y, z world position. For our fetch reaching environment, we use the Gym Robotics Fetch environment (Brockman et al., 2016) . We set the time limit to 200. For the fetch push environment, we partition the continuous action space into 4 actions, which involve pushing the block forward, backward, left, and right. We set the time limit to 10 for skill learning. We evaluate sequential skill selection on 2 environments: Fetch Reach and Fetch Push. For the Fetch Push task, we fix 3 waypoints, depicted in Figure 2 and fix a time horizon of 15 pushes per waypoint. For Fetch Reach, we consider 2 waypoints and a time horizon of 25 for each waypoint. Our plane environment is a 2D world with observations in [-128, 128] x [-128, 128] and continuous actions in [-10, 10] x [-10, 10] .

A.6 PRETRAINING HYPERPARAMETERS

For the Jaco domain we use a skill dimension of 2 and a discriminator MLP hidden dimension of 64. We use an alpha value of 0 for the entropy weighting as in (Laskin et al., 2022) . We input the 3D position of the end-effector of the Jaco arm to the discriminator. For the Walker domain we use a skill dimension of 2 and a discriminator MLP hidden dimension of 256. We use an alpha value of 0.7 for the entropy weighting. We input the displacement in the 3D position of the torso of the walker to the discriminator. For the Quadruped domain we use a skill dimension of 16 and a discriminator MLP hidden dimension of 128. We use an alpha value of 0.5 for the entropy weighting. We input the 3D velocity of the body of the quadruped to the discriminator. We use a learning rate of 1e-4, a critic target tau parameter of 0.01, and a constant standard deviation exploration schedule of 0.2. The rest of the RL hyperparameters are as in (Laskin et al., 2021) . For the Fetch Push environment, we use a skill dimension of 16 and a discriminator MLP hidden dimension of 16. We use an alpha value of 0 for entropy weighting. For the Fetch Reach environment, we use a skill dimension of 8 and a discriminator MLP hidden dimension of 64. We use an alpha value of 0 for entropy weighting. For all environments, we use a replay buffer size of 100k. A For illustrative purposes, we start its plot at 50k steps to show that finetuning still occurs, however, sample-inefficiency suffers due to excessive rollouts for skill selection. This problem only worsens for long time horizons. IRM Gradient Descent is trained for 5000 steps with a learning rate of 5e-3 and initialized at the skill vector of all 0.5s. IRM Random selects 100 random skills. Env Rollout trials 10 random skills for a fully episode. Grid Search coarsely trials 10 skills from the skill of all 0s to the skill of all 1s as in (Laskin et al., 2021) .

A.8 PLANAR GOAL REACHING

The planar goal reaching task consists of a simple 2D plane with a point with a 2D Cartesian state space that can displace in the x and y coordinates with a 2D action space. Skills learned tend to span the 2D space reaching to diverse locations distributed broadly across the environment. We show some sample zero-shot skill selection results over three different skill dimensions in Figure 6 . A.9 FINETUNING PERFORMANCE ON URLB In Figure 7 we compare the finetuning sample-efficiency of IRM methods against environment rollout-baselines on the URLB Walker tasks. IRM performs skill selection with 0 environment interactions. The episode length of the URLB environments is 1000, meaning that in order to evaluate a single skill, rollout based methods must exhaust 1000 environment steps (i.e. grid search spends 1000 * 10 = 10,000 environment steps -10 percent of the available finetuning budget). By contrast, our method immediately uses new environment steps for improving the policy. As a result, the IRM based approaches generally achieve greater sample efficiency, even when initial skill selection obtains similar performance to the rollout based methods. For illustrative purposes we have shown Env CEM starting at 50k steps even though it far exceeds the 100k sample budget to select a skill before making any RL updates due to having to execute full episode rollouts in the inner loop of optimization. This issue worsens with increasing episode lengths. We plot results over 3 seeds with standard error shading.

A.10 SEQUENTIAL SKILL SELECTION

For sequential skill selection, we compare IRM Sequential and Environment Sequential skill selection. IRM Sequential consists of an iterative process. The first skill is chosen entirely free of environment samples, exactly identical to the single-skill tasks. Once the first skill is chosen, we roll out a trajectory with the skills we have chosen so far and use the latter half of the trajectory as the Pearson samples for our EPIC loss. We use Gaussian noise with variance 1 for our Canonical samples as described in Appendix A.12.2. At each step of the skill selection process, we use the corresponding IRM optimization methods. For our Environment Sequential skill selection method, we select skills iteratively as well. For each waypoint or subtask, we randomly sample N skills and commit to the best, where N = 10/n subtasks. Finetuning Performance on URLB In order to validate the benefits of IRM's offline skill selection, we compare against a baseline that leverages a conventional hierarchical RL algorithm to solve long-horizon, sequential tasks. We instantiate a TD3 manager agent that outputs into a skill action space from state input at a temporally abstract timescale. As in the IRM setup, this timescale is fixed to align with the changes in reward to encourage the manager to change its skill prediction according to the change in the reward semantics. The manager's is then inputted to the low-level pretrained skill policy which is rolled out over many steps with the skill fixed. Both the manager policy and the low-level policy weights are updated during finetuning. The manager agent is randomly initialized such that its initial skill prediction is random. A.12 ADDITIONAL ABLATIONS A.12.1 SKILL DIMENSION We ablate skill dimension and evaluate the zero-shot performance of all skill selection methods. IRM's performance generally increases with increased skill dimension despite discriminator overfitting issues associated with larger skill spaces. The IRM GD learning rate is chosen as 5e-3 for all experiments in this work and is not tuned at all. Such likely explains the divergence of the 64 dimensional result.

A.12.2 PEARSON & CANONICAL DISTRIBUTIONS

We experiment with many ways to approximate the Pearson and Canonical distributions. We defined Full Random to be our uniform samples from a reasonable estimate of the upper and lower bounds



Figure 1: Intrinsic Reward Matching (IRM) Framework. IRM takes place in three stages: (1) Taskagnostic RL pretraining learns skill primitives in conjunction with a skill discriminator. (2) With no environment interaction, IRM minimizes the EPIC Loss between the intrinsic reward parameterized by the discriminator and the extrinsic reward with respect to the skill vector z. (3) The skill policy conditioned on the optimal z * finetunes to task reward to solve the downstream task.

Intrinsic Reward Matching (IRM) Require: Downstream task T , D S , P S Require: Pretrained policy π θ (a|s, z), intrinsic reward r int (s, s ′ , z), and extrinsic reward r ext (s, s ′ ) for T . Require: Optimization N OP = 5000 steps and finetune N F T = 100K steps. / * Skill Selection of z * via EPIC Loss * / for N OP steps do Sample a batch of Pearson samples S P , S ′ P ∼ D P , D P . Sample Canonical samples S C , S ′ C ∼ D C , D C .

Figure 2: In our Fetch Push environment, we discover skills that move the block in different directions. Downstream tasks may involve simple goals or more distant goals that require composition of multiple skills across an extended time horizon and around obstacles.

Figure 4: (a) Scatter plot of extrinsic reward vs. EPIC loss. (b) Trajectories with low and high EPIC losses for planar goal-reaching. (c) Trajectories for sequential goal-reaching. (d) Trajectories for Fetch Reach.

Figure 6: Zero-Shot Returns for Planar Goal Reaching averaged over 5 seeds

Figure 7: IRM finetuning results compared to rollout-based baselines on Walker URLB tasks.

± 0.28 1.84 ± 0.00 0.000 ± 0.00 0.000 ± 0.00 Jaco Top Right 0.0860 ± 0.040 0.640 ± 0.24 0.120 ± 0.097 7.34 ± 3.4 9.82 ± 5.3 16.1 ± 0.00 0.000 ± 0.00 3.50 ± 2.5 Jaco Bot. Left 0.0520 ± 0.030 0.000 ± 0.00 0.000 ± 0.00 0.175 ± 0.16 0.408 ± 0.22 0.102 ± 0.00 0.000 ± 0.00 0.000 ± 0.00 Jaco Bot. Right 2.48 ± 2.2 0.000 ± 0.00 0.360 ± 0.31 0.086 ± 0.073 9.07 ± 3.3 0.191 ± 0.00 0.000 ± 0.00 0.00100 ± 0.0010 IRM with various optimization methods compared to environment rollout-based skill selection, reward relabelling of pretraining data, and random skill selection. IRM based methods rival or exceed skill selection baselines that are reliant on expensive environment trials.





Zero-shot rewards for DADS skill discovery algorithm.

.7 INTRINSIC REWARD MATCHING AND ENVIRONMENT ROLLOUT BASELINE

Skill Dim IRM CEM IRM GD IRM Rand Env Roll. Env CEM GS Rand 8 21.1 ± 0.51 15.7 ± 1.61 18.9 ± 0.18 18.4 ± 0.18 18.8 ± 0.48 17.9 ± 0.101 13.5 ± 1.85 16 17.4 ± 1.30 14.6 ± 0.63 18.8 ± 0.26 22.7 ± 0.83 23.1 ± 0.36 14.0 ± 0.19 11.2 ± 2.32 32 20.1 ± 0.54 22.537 ± 0.25 19.8 ± 0.14 22.2 ± 0.58 21.5 ± 0.67 24.0 ± 0.12 19.9 ± 0.67 64 21.9 ± 0.48 1.68 ± 0.069 20.9 ± 0.74 22.5 ± 0.70 21.6 ± 0.89 18.2 ± 0.059 13.3 ± 2.15

IRM methods and environment rollout methods ablated over multiple skill dimensions on Fetch Push A.11 HIERARCHICAL REINFORCEMENT LEARNING BASELINE

annex

for each dimension of the state. For our planar environment, the bounds are defined explicitly and thus known; for more complex environments, we estimate the bounds. For example, for a tabletop manipulation workspace, we sample 2-dimensional block positions uniformly within the rectangular plane of the table surface. In practice, IRM is fairly robust to the distributions, though there are subtleties that emerge in the various choices for the Pearson and Canonical distributions. For instance, we also ablate a Uniform(0,1) distribution, which generally performs much worse, due to lack of state coverage for most environments. For the Canonical distribution, we also approximate samples by perturbing the Pearson samples by ϵ sampled from a Gaussian distribution. We experiment with hyperparameters of variance, which may be adjusted based on the environment. For our sequential IRM method, we use this Canonical distribution to ablate on-policy samples. None of the distributions ablated above require on-policy environment samples. It is possible to use on-policy samples for the state distributions, and we choose to do so for our sequential IRM method, as previous skill rollouts may provide useful Pearson samples for the subsequent skill selection. Note that while on-policy Canonical samples are possible, they are incredibly expensive and require access to the environment simulator, so we focus on other choices of distributions.

A.12.3 SPARSE REWARD ABLATION

We ablate our planar EPIC Loss visualizations with sparse rewards. Instead of a well-shaped goalreaching reward, we use sparse rewards based on the tolerance to the goal. We define the tolerance as the radius the agent must be within if our 2d planar environment is scaled to [0, 1] x [0, 1]. With a very sparse reward, we show that EPIC losses are largely uninformative. However, by slightly relaxing the tolerance, we show a much better shaped EPIC loss landscape that bears similarity to that of Figure 5 . Thus, while our method is dependent on access to extrinsic rewards, and ideally, shaped rewards, we show that the EPIC loss landscape over sparse reward landscapes with sufficient tolerance can be optimized.

EPIC Loss Visualizations

Figure 8 : We examine EPIC losses between extrinsic rewards and intrinsic rewards conditioned on the skill vector. We sweep across the 2D skill vector for a pretrained planar agent. Left: Sparse goal-reaching reward with tolerance of 0.03. Right: Sparse goal-reaching reward with tolerance of of 0.07.

