UNDERSTANDING HINDSIGHT GOAL RELABELING REQUIRES RETHINKING DIVERGENCE MINIMIZATION

Abstract

Hindsight goal relabeling has become a foundational technique for multi-goal reinforcement learning (RL). The idea is quite simple: any arbitrary trajectory can be seen as an expert demonstration for reaching the trajectory's end state. Intuitively, this procedure trains a goal-conditioned policy to imitate a sub-optimal expert. However, this connection between imitation and hindsight relabeling is not well understood. Modern imitation learning algorithms are described in the language of divergence minimization, and yet it remains an open problem how to recast hindsight goal relabeling into that framework. In this work, we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles. Experimentally, we find that despite recent advances in goal-conditioned behaviour cloning (BC), multi-goal Q-learning can still outperform BC-like methods; moreover, a vanilla combination of both actually hurts model performance. Under our framework, we study when BC is expected to help, and empirically validate our findings. Our work further bridges goal-reaching and generative modeling, illustrating the nuances and new pathways of extending the success of generative models to RL. From 1 and 2, we also see that the future state distribution p + of policy µ, defined as a geometrically discounted sum of state distribution at all future timesteps given current state and action, is given by

1. INTRODUCTION

Goal reaching is an essential aspect of intelligence in sequential decision making. Unlike the conventional formulation of reinforcement learning (RL), which aims to encode all desired behaviors into a single scalar reward function that is amenable to learning (Silver et al., 2021) , goal reaching formulates the problem of RL as applying a sequence of actions to rearrange the environment into a desired state (Batra et al., 2020) . Goal-reaching is a highly flexible formulation. For instance, we can design the goal-space to capture salient information about specific factors of variations that we care about (Plappert et al., 2018) ; we can use natural language instructions to define more abstract goals (Lynch & Sermanet, 2020; Ahn et al., 2022) ; we can encourage exploration by prioritizing previously unseen goals (Pong et al., 2019; Warde-Farley et al., 2018; Pitis et al., 2020) ; and we can even use self-supervised procedures to naturally learn goal-reaching policies without reward engineering (Pong et al., 2018; Nair et al., 2018b; Zhang et al., 2021; OpenAI et al., 2021) . Reward is not enough. Usually, rewards are manually constructed, either through laborious reward engineering, or from task-specific optimal demonstrations, neither of which is a scalable solution. How can RL agents learn useful behaviors from unlabeled reward-free trajectories, similar to how NLP models such as BERT (Devlin et al., 2018) and GPT-3 (Brown et al., 2020) are able to learn language from unlabeled text corpus? Goal reaching is a promising paradigm for unsupervised behavior acquisition, but it is unclear how to write down a well-defined objective for goal-conditioned policies, unlike language models that just predict the next token. This paper aims to define such an objective that unifies many prior approaches and strengthens the foundation of goal-conditioned RL. We start from the following observation: hindsight goal relabeling (Andrychowicz et al., 2017) can turn an arbitrary trajectory into a sub-optimal expert demonstration. Thus, goal-conditioned RL might be doing a special kind of imitation learning. Currently, divergence minimization is the de facto way to describe imitation learning methods (Ghasemipour et al., 2020) , so we should be able to recast hindsight goal relabeling into the divergence minimization framework. On top of that, to tackle the sub-optimality of hindsight-relabeled trajectories, we should explicitly maximize the probability of reaching the desired goal. Following these intuitions, we derive the reward function used in HER from first principles, as well as other behaviour cloning (BC) like methods such as GCSL (Ghosh et al., 2019) and Hindsight BC (HBC) (Ding et al., 2019) . Experimentally, we show that multi-goal Q-learning based on HER-like rewards, when carefully tuned, can still outperform goal-conditioned BC (such as GCSL / HBC). Moreover, a vanilla combination of multi-goal Q-learning and BC (HER + HBC), supposedly combining the best of both worlds, in fact hurts performance. We utilize our unified framework to analyze when a BC loss could help, and propose a modified algorithm named Hindsight Divergence Minimization (HDM) that uses Q-learning to account for the worst while imitating the best. HDM avoids the pitfalls of HER + HBC and improves policy success rates on a variety of self-supervised goal-reaching environments. Additionally, our framework reveals a largely unexplored design space for goal-reaching algorithms and potential paths of importing generative modeling techniques into multi-goal RL.

2.1. THE REINFORCEMENT LEARNING (RL) PROBLEM

We first review the basics of RL and generative modeling. A Markov Decision Process (MDP) is typically parameterized by (S, A, ρ 0 , p, r): a state space S, an action space A, an initial state distribution ρ 0 (s), a dynamics function p(s ′ | s, a) which defines the transition probability, and a reward function r(s, a). A policy function µ defines a probability distribution µ : S × A → R + . For an infinite-horizon MDP, given the policy µ, and the state distribution at step t (starting from ρ 0 at t = 0), the state distribution at step t + 1 is given by: ρ t+1 µ (s ′ ) = S×A p(s ′ | s, a)µ(a | s)ρ t µ (s)dsda The state visitation distribution sums over all timesteps via a geometric distribution Geom(γ): ρ µ (s) = (1 -γ) • ∞ t=0 γ t • ρ t µ (s) However, the trajectory sampling process does not happen in this discounted manner, so the discount factor γ ∈ (0, 1) is often absorbed into the cumulative return instead (Silver et al., 2014) : the following recursive relationship (Eysenbach et al., 2020b; Janner et al., 2020) : p + µ (s + | s, a) = (1 -γ)p(s + | s, a) + γ S×A p(s ′ | s, a)µ(a ′ | s ′ )p + µ (s + | s ′ , a ′ )ds ′ da ′ (4) In multi-goal RL, an MDP is augmented with a goal space G, and we learn a goal-conditioned policy π : S × G × A → R + . Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) gives the agent a reward of 0 when the goal is reached and -1 otherwise, and uses hindsight goal relabeling to increase learning efficiency of goal-conditioned Q-learning by replacing the initial behavioral goals with achieved goals (future states within the same trajectory).

2.2. IMITATION LEARNING (IL) AS DIVERGENCE MINIMIZATION

We first review f -divergence between two probability distributions P and Q and its variational bound: D f (P ∥ Q) = X q(x)f p(x) q(x) dx ≥ sup T ∈T E x∼P [T (x)] -E x∼Q [f * (T (x))] ( ) where f is a convex function such that f (1) = 0, f * is the convex conjugate of f , and T is an arbitrary class of functions T : X → R. This variational bound was originally derived in (Nguyen et al., 2010) and was popularized by GAN (Goodfellow et al., 2014; Nowozin et al., 2016) and subsequently by imitation learning (Ho & Ermon, 2016; Fu et al., 2017; Finn et al., 2016; Ghasemipour et al., 2020) . The equality holds true under mild conditions (Nguyen et al., 2010) , and the optimal T is given by T * (x) = f ′ (p(x)/q(x)). The canonical formulation of imitation learning follows (Ho & Ermon, 2016; Ghasemipour et al., 2020) , where ρ exp (s, a) is from the expert: min µ D f (ρ exp (s, a) ∥ ρ µ (s, a)) ⇔ min µ max T E ρ exp (s,a) [T (s, a)] -E ρ µ (s,a) [f * (T (s, a))] Because of the policy gradient theorem (Sutton et al., 1999) , the policy µ needs to optimize the cumulative return under its own trajectory distribution ρ µ (s, a) with the reward being r(s, a) = f * (T (s, a)). Under this formulation, Jensen-Shannon divergence leads to GAIL (Ho & Ermon, 2016) , reverse KL leads to AIRL (Fu et al., 2017) . Note that f can in principle be any convex function (we can satisfy f (1) = 0 by simply adding a constant).

2.3. ENERGY-BASED MODELS (EBM)

An energy-based model (EBM) is defined by p θ (x) = exp(-E θ (x)) Z(θ) , where E θ : R D → R is the energy function and Z(θ) = X exp(-E θ (x))dx is the partition function. The gradient of the log-likelihood log p θ (x) w.r.t. θ (known as contrastive divergence (Hinton, 2002) ) can be expressed as : ∂ log p θ (x) ∂θ = E p θ (x ′ ) ∂E θ (x ′ ) ∂θ - ∂E θ (x) ∂θ Sampling from p θ (x) can be difficult, as Langevin dynamics is often required (Welling & Teh, 2011; Du & Mordatch, 2019; Grathwohl et al., 2019) . Alternatively, EBMs can be trained via Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2012; Mnih & Kavukcuoglu, 2013; Gao et al., 2020; Rhodes et al., 2020) , which assumes access to a "noise" distribution p n (x) and learns energy functions through density ratio estimation. Let s θ (x) = -E θ (x) -log Z θ (Mnih & Teh, 2012) . NCE learns a classifier that distinguishes the data distribution p(x) from noise p n (x), where noise samples are k times more frequent (Mnih & Kavukcuoglu, 2013; Mikolov et al., 2013) : p(D = 1 | x) = p θ (x)/(p θ (x) + k • p n (x)) = σ(s θ (x) -log p n (x) -log k) ) To be more concise, we denote ∆ θ (x, k) = s θ (x) -log p n (x) -log k. Gradients of the weighted binary cross entropy loss asymptotically approximate contrastive divergence (7) (Mnih & Teh, 2012) : d dθ E p(x) [log σ(∆ θ (x, k))] + k • E pn(x) [log(1 -σ(∆ θ (x, k))] k→∞ ----→ ∂ log p θ (x) ∂θ (9) Closely related to NCE is the more recent InfoNCE loss (Van den Oord et al., 2018) which has gained popularity in the contrastive learning setting (Chen et al., 2020; He et al., 2020) ; we can interpret k in NCE as the batch size of negative samples in InfoNCE. In summary, NCE gives EBM a more tractable way to maximize data likelihood.

3. GRAPHICAL MODELS FOR HINDSIGHT GOAL RELABELING

Consider the setting: given an environment ξ = (ρ 0 (s), p(s ′ | s, a), ρ + (g)) where we generate a dataset of trajectories D = {(s 0 , a 0 , s 1 , a 1 , • • • )} by sampling from the initial state distribution ρ 0 (s), an unobserved actor policy µ(a | s), and the dynamics p(s ′ | s, a). We aim to train a goal-conditioned policy π(a | s, g) from this arbitrary dataset with relabeled future states as the goals. ρ + (g) is the behavioral goal distribution assumed to be given apriori by the environment. To recast the problem of goal-reaching as imitation learning equation 6, we need to set up an f -divergence minimization where we define the target (expert) distribution and the policy distribution we want to match. The training signal comes from factorizing the joint distribution of state-action-goal differently. For the relabeled target distribution, we assume an unconditioned actor µ generating a state-action distribution at first, with the goals coming from the future state distribution equation 4 conditioned on the given state and action. For the goal-conditioned policy distribution, behavior goals are given apriori, and the state-action distribution is generated conditioned on the behavioral goals. Thus, the target distribution (see Figure 2 ) for states, actions, and hindsight goals is: p µ (s, a, s + ) = ρ µ (s, a)p + µ (s + | s, a) Note that p + µ is given by equation equation 4, and ρ µ (s, a) is similar to ρ exp (s, a) in equation equation 6. In the fashion of behavioral cloning (BC), if we do not care about matching the states, we can write the joint distribution we are trying to match as (see Figure 3 ): p BC π (s, a, g) = ρ + (g)ρ µ (s)π(a | s, g) (11) We can recover the objective of Hindsight Behavior Cloning (HBC) (Ding et al., 2019; Eysenbach et al., 2020a) and Goal-Conditioned Supervised Learning (GCSL) Ghosh et al. (2019) via minimizing a KL-divergence: min π D KL p µ (s, a, s + ) ∥ p BC π (s, a, g) ⇔ min π E ρµ(s,a)p + µ (s + |s,a) [-log π(a | s, g)] In many cases, matching the states is more important than matching state-conditioned actions (Ross et al., 2011; Ghasemipour et al., 2020) . The joint distribution for states, actions, behavioral goals for π (see Figure 4 ) is: However, it is important to recognize that even after adding this state-matching part, there is still a missing component of the objective. Divergence minimization encourages the agent to stay in the "right" state-action distribution given the benefit of hindsight, but we still need the policy to actually hit the goal (see Figure 1 ). When we condition the policy on a goal, we should maximize the likelihood of seeing that goal in the future state distribution (see Figure 5 ). Together with a maximum entropy regularization on the policy H(π) (Ho & Ermon, 2016; Schulman et al., 2017) , we propose the following goal-reaching objective: p π (s, a, g) = ρ + (g)ρ π (s, a | g) min π D f ρ + f f (g)ρ π (s, a|g) ∥ p + µ (s + |s, a)ρ µ (s, a) (a) f -divergence term -βE ρ + (g) ρµ(s) π(a|s,g) [log p + π (g|s, a)] (b) goal likelihood term -λ f f H(π) (c) entropy Compared to equation 12 and equation 6, the f -divergence term equation 14(a) swaps the order between the expert and the policy. The coefficient β controls the importance of "hitting the goal". This incentive is already implicit in hindsight-relabeled data, as in many cases a BC objective alone can learn to hit the goal as well (Ding et al., 2019; Ghosh et al., 2019; Lynch et al., 2019; Jang et al., 2021) . However, hindsight-relabeled trajectories are often sub-optimal demonstrations: not every action is acting towards achieving the goal. Thus, maximizing the likelihood of achieving the goal is crucial, and it plays a key role in deriving the reward function of HER (Andrychowicz et al., 2017) .

4. BRIDGING GOAL-REACHING AND GENERATIVE MODELING

In this section, we study how to optimize the proposed unifying objective 14. We show that the Q-function trained for goal-reaching Q θ (s, a, g) is implicitly doing generative modeling. Its temporal difference is modeling the density ratio between the hindsight-relabeled distribution and the goalconditioned behavioral distribution. It also approximates an energy-based model for future-state distribution when marginalized over the goal distribution. In section 4.1, we demonstrate how goalconditioned Q-learning does divergence minimization (equation 14(a)). In section 4.2, we show how Q-functions can define an EBM for cumulative future states, which in turn allows the policy to maximize goal likelihood (equation 14 b+c). In section 4.3, we show that combining the derived results yields the HER reward. In section 4.4, we study when a BC loss can help by analyzing the loss terms being left out by the HER reward.

4.1. DIVERGENCE MINIMIZATION WITH GOAL-CONDITIONED Q-LEARNING

This section decomposes equation 14(a). We start with the f -divergence bound from equations equation 5 and equation 6. D f (p π (s, a, g) ∥ p µ (s, a, s + )) = max T E p(g) ρπ(s,a|g) [T (s, a, g)] -E ρµ(s,a) p + µ (s + |s,a) [f * (T (s, a, s + ))] Now we negate T to get r(s, a, g) = -T (s, a, g), and the divergence minimization problem becomes: max π min r E ρµ(s,a) p + µ (s + |s,a) [f * (-r(s, a, s + ))] + E p(g) ρπ(s,a|g) [r(s, a, g)] We can interpret r as a GAIL-style (Ho & Ermon, 2016) discriminator or reward. However, we aim to derive a discriminator-free learning process that directly trains the Q-function corresponding to this reward Q θ (s, a, g) = r(s, a, g) + γ • P π Q(s, a, g) where P π is the transition operator: P π Q(s, a, g) = E p(s ′ |s,a)π(a ′ |s ′ ,g) [Q θ (s ′ , a ′ , g)]. Re-writing the equation equation 16 w.r.t Q: min Q E ρµ(s,a) p + µ (s + |s,a) [f * (-(Q θ -γ • P π Q)(s, a, s + ))] + E p(g) ρπ(s,a|g) [(Q θ -γ • P π Q)(s, a, g)] A similar change-of-variable has been explored in the context of offline RL (Nachum et al., 2019a; b) and imitation learning (Kostrikov et al., 2019; Zhu et al., 2020) ; we may call those methods the DICE (Nachum & Dai, 2020) family. The major pain point of DICE-like methods is that they require samples from the initial state distribution ρ 0 (s) (Garg et al., 2021) . Here, the following lemma shows that in the goal-conditioned case, we can use arbitrary offline trajectories to evaluate the expected rewards under goal-conditioned online rollouts: Lemma 4.1 (Online-to-offline transformation for goal reaching). Given a goal-conditioned policy π(a | s, g), its corresponding Q-function Q π (s, a, g), and arbitrary state-action visitation distribution ρ µ (s, a) of another policy µ(a | s), the expected temporal difference for online rollouts under π is: E p(g)ρπ(s,a|g) [(Q π -γ • P π Q π )(s, a, g)] = E p(g)ρµ(s,a)π(ã|s,g) [Q π (s, ã, g) -γ • P π Q π (s, a, g)] Using Lemma 4.1, the objective in equation 16 now becomes: max π min Q E ρµ(s,a)p + µ (s + |s,a) p(g),π(ã|s,g) f * (-(Q θ -γP π Q)(s, a, s + )) + Q θ (s, ã, g) -γP π Q(s, a, g) (18) The function f * is the convex conjugate of f in f -divergence. We can pick almost any convex function as f * as long as ((f * ) * )(1) = 0. For instance, in the tradition of Q-learning, we can use: f * (x) = (x + r) 2 /2 + c ( ) where r and c are constants. Its convex conjugate is f (x) = (f * ) * (x) = x 2 /2 -rx -c. For any r, we can always pick a c that will ensure f (1) = (f * ) * (1) = 0, and c does not affect the learning process. In summary, we have derived a way to minimize the first term D f (p µ (s, a, s + ) ∥ p π (s, a, g)) in equation 14 directly using goal-conditioned Q-learning.

4.2. EBM FOR PREDICTING (AND CONTROLLING) THE FUTURE

In this section, we study how to maximize the goal likelihood term equation 14(b) by defining an energy-based model (EBM) to predict the future-state distribution for a policy, which the policy can in turn optimize to control the future-state distribution. We start with the Bayes rule: p + π (s + | s, a) = ρ + (s + )ρ π (s, a | s + ) E ρ + (g) [ρ π (s, a | g)] This relationship reflects that, if we define an EBM for ρ π (s, a | s + ) = exp q(s, a, s + )/Z q (s + ) where Z q (s + ) = S×A exp q(s, a, s + )dsda, then we can contrast the dynamics defined in equation 4 with the marginal goal distribution defined in equation 13. In a slightly overloaded notation, we will now define a Q-function as Q θ (s, a, g) = q(s, a, g) -log Z q (g). For now, we shall assume that this Q-function is separate from the one defined in equation 18. Rearranging equation 20 , and setting ρ + (g) to be ρ + (s + ), we see the density ratio can be expressed as: p + π (s + | s, a) ρ + (s + ) = ρ π (s, a | s + ) E ρ + (g) [ρ π (s, a | g)] = exp Q θ (s, a, s + ) E ρ + (g) [exp Q θ (s, a, g)] Lemma 4.2 (Gradient of the noise-contrastive term in energy-based goal-reaching). Given the following definition for the logit of a NCE-like binary classifier, with ρ + (g) = ρ + (g): ∆ θ (s, a, g, k) = Q θ (s, a, g) -log E ρ + (g) [exp Q θ (s, a, g)] -log k (22) The gradient of the negative NCE term in the density ratio estimation approaches zero as k → ∞: d dθ E ρµ(s,a)ρ + (g) k • log 1 -σ(∆ θ (s, a, g, k)) k→∞ ----→ 0 As for the positive classification term in NCE equation 9, we can make a similar argument that ∇ θ log(1 + exp ∆ θ (s, a, s + , k)) k→∞ ----→ 0 (see appendix B. 2). We have: arg max Q E ρµ(s,a) E p + π (s + |s,a) [Q θ (s, a, s + )] -log E ρ + (g) [exp Q θ (s, a, g)] To optimize the above objective on arbitrary behaviour data, we have to consider the fact that we do not have complete access to the distribution of "positive samples" p + π (s + | s, a), as sampling directly from this distribution requires on-policy rollouts. Utilizing importance weights to learn from off-policy data while accounting for hindsight bias yields the following loss minimization instead: E ρµ(s,a) -(1 -γ) • E p(s ′ |s,a) [Q θ (s, a, s ′ )] Learning single-step dynamics in Q + E ρ + (g)p(s ′ |s,a) [w(s, a, s ′ , g) • Q θ (s, a, g)] Learning multi-step dynamics in Q (24) where w(s, a, s ′ , g) contrasts the goals that will likely be met by the agent beyond a single step of dynamics p(s ′ | s, a) in the future with other random goals that the agent has seen in the past: w(s, a, s ′ , g) = exp Q(s, a, g) E ρ + (g) [exp Q(s, a, g)] Pushing down the likelihood of the marginal -γ • a exp Q(s ′ , a, g)/|A| E ρ + (g) [ a exp Q(s ′ , a, g)/|A|] Pushing up the likelihood of the conditional (25) See Appendix B.3 for a full derivation.

4.3. DERIVING HER REWARDS

For a goal-conditioned Q-function Q θ (s, a, g), besides optimizing a Bellman residual as a regular Q(s, a) function would do, it has an additional degree-of-freedom for the goal g which can be used to define an EBM that models the future-state distribution. In such an EBM, the Q-function is used as the negative energy defined in equation 21. Combining the first term in the f -divergence minimization part equation 18 and the single-step dynamics loss term in the EBM learning 24 gives us: arg min Q E ρµ(s,a)p(s ′ |s,a)p + µ (s + |s,a) f * (-(Q θ -γP π Q)(s, a, s + )) -β • (1 -γ)Q θ (s, a, s ′ ) (26) The crucial observation here is, given that f * is a quadratic 19, and that there is a stop gradient sign on P π Q because it uses a target network, the loss above can be re-packaged into a single quadratic loss of Bellman residuals for a specific reward thanks to the property of the dynamics defined in 4: We now proceed to answer the following question: when does BC help? Our idea is to identify the conditions for a BC-like loss to emerge in the remaining losses unaccounted by the HER reward: a loss term in 18 about pushing Q-values of the current policy down, and the multi-step dynamics term in EBM 54. We may argue that, by not pushing the Q-values of the current policy down, the HER agent becomes more exploratory; by not pushing up the Q-values of the discounted future-state distribution beyond a single step, the HER agent is encouraged to reach a goal sooner rather than later. Nevertheless, we shall see that combining those two remaining terms produces a BC-like term, which imitates the best actions based on how much an action moves the agent closer to the goal. arg min Q E ρµ(s,a)p(s ′ |s,a)p + µ (s + |s,a) 1 2 r(s, a, s ′ , s + ) + (γP π Q -Q θ )(s, a, s + ) 2 (27) r(s, a, s ′ , s + ) = r + β, s ′ = s + r, s ′ ̸ = s + We start with the following property of Boltzmann policies π(a | s, g) ∝ exp Q(s, a, g) (Haarnoja et al., 2017; Schulman et al., 2017 ) (which we can apply thanks to entropy regularization in 14): E π(a|s,g) [Q(s, a, g)] = log a exp Q(s, a, g) -H(π) In equation 25, the Q-value of a particular action a is pushed up under the following condition: exp Q(s, a, g) E ρ + (g) [exp Q(s, a, g)] < γ • a exp Q(s ′ , a, g)/|A| E ρ + (g) [ a exp Q(s ′ , a, g)/|A|] (30) Both denominators on the left and right sides are averaged over all possible behavioral goals, so their values should be roughly the same. Moreover, we approximate the average-exponential operation by using the max operation already computed in the backup operator (either exact (Mnih et al., 2015; Van Hasselt et al., 2016) or approximate (Lillicrap et al., 2015; Haarnoja et al., 2018) ) and lower the threshold γ hdm (compared to γ) accordingly. After those changes, the indicator function that decides whether the value of an action a should be pushed up becomes: ŵ(s, a, s ′ , g) = 1(exp Q(s, a, g) -γ hdm • exp max a ′ Q(s ′ , a ′ , g) < 0) Combining with the first term in 29, we arrive at a BC-like loss 12, but with an indicator weighting: L hdm (Q) = E ρµ(s,a)p(s ′ |s,a)ρ + (g) [-ŵ(s, a, s ′ , g) • (Q θ (s, a, g) -log A exp Q θ (s, •, g))] (32) Intuitively, this term imitates a particular action when the value functions believe that this action can move the agent closer to the goal by at least -log γ hdm steps. The idea of imitating the best actions is similar to self-imitation learning (SIL) (Oh et al., 2018; Vinyals et al., 2019) , but our algorithm 1 operates in a (reward-free) goal-reaching setting, with the advantage function in SIL being replaced by the delta of reachability a particular action can produce in getting closer to the goal.

Algorithm 1 Hindsight Divergence Minimization

Given: Batch of data {(s, a, s ′ , s + )}, where s + is sampled via hindsight relabeling. 1: L = MSE(Q(s, a, s + ) -γ max a ′ Q(s ′ , a ′ , s + ), r) ▷ r is a constant, default r = -1 2: L = L -β(1 -γ hdm ) • Q(s, a, s ′ ) ▷ Push up the values for reaching next states 3: HDM uses Q-learning to account for the worst while imitating the best during the goal-reaching process. L BC = Q(s, a, s + ) -log A exp Q(s, •, s + ) ▷ Behaviour Cloning like loss 4: L = L -β • 1(Q(s, a, s + ) -max a ′ Q(s ′ , a ′ , s + ) < log γ hdm ) • L BC 5: minimize L, update the target network Q 5 EXPERIMENTS

5.1. SELF-SUPERVISED GOAL-REACHING SETUP

We consider the self-supervised goal-reaching setting for the environments described in GCSL (Ghosh et al., 2019) . While the original HER (Andrychowicz et al., 2017) assumes that the agent has direct access to the ground truth binary reward metric (which is used when relabeling is performed), we do not make such an assumption as it can be unrealistic for real-world robot-learning (Lin et al., 2019) . Instead, we simply use next-state relabeling to provide positive rewards. As seen in Eq equation 28, a positive reward is provided only when the relabeled hindsight goal is the immediate next state. The benefit of the next-state relabeling reward is that the training procedure is now completely self-supervised, and therefore provides a fair comparison to GCSL (Ghosh et al., 2019) . 

5.2. COMPARISONS AND ABLATIONS

As shown in Table 1 , we compare HDM to the following baselines in terms of their goal-reaching abilities: GCSL (Ghosh et al., 2019) / HBC (Ding et al., 2019) , HER (Andrychowicz et al., 2017) with (0, 1) rewards, HER with Soft Q-Learning (SQL) (Schulman et al., 2017) , HER with (-1, 0) rewards, and HER + HBC. HER with (0, 1) reward gives the agent a reward of 1 when the goal is reached and 0 otherwise; HER with (-1, 0) reward gives a reward of 0 when a goal is reached and -1 otherwise. Those two types of rewards lead to different learning dynamics because the Q-function is initialized to output values around 0. We also compare against HER + SQL, because soft Q-learning between an algorithm and GCSL. We see that only HER with (-1, 0) rewards and HDM consistently outperform GCSL, while HER + HBC performs worse than HER and sometimes GCSL as well. is known to improve the robustness of learning (Haarnoja et al., 2018) . HDM builds on top of HER with (-1, 0) rewards and adds a BC-like loss with a clipping condition equation 32 such that only the actions that move an agent closer to the goal get imitated (see Figure 6 ). Indeed, combining HER with HBC (which blindly imitates all actions in hindsight) produces worse results than not imitating at all and only resorting to value learning; HDM allows for better control over what to imitate. The results in Table 1 and Figure 7 show that HDM achieves the strongest performance on all environments, while reducing variances in success rates. Interestingly, there is no consensus best baseline algorithm, with all five algorithms achieving good results in some environments and subpar performance in others. In Figure 10 of the appendix, we ablate the additional hyper-parameters introduced by HDM: γ hdm in equation 31 and the β term in equation 14. Intuitively, γ hdm controls the threshold that determines when an action is considered good enough for imitation, and β controls the trade-off between minimizing f -divergence and maximizing future goal likelihood in equation 14. The ablation shows that HDM outperforms HER and GCSL across a variety of γ hdm and β values.

5.3. WHY DOES HINDSIGHT BC SOMETIMES FAIL?

Figure 8 : Success rate versus initial achieved-goal change ratio. In this section, we aim to better understand whether the (underwhelming) performance of hindsight BC had anything to do with our specific setting. We report an interesting metric that correlates with the performance of GCSL / HBC, which can be measured prior to training a policy: initial ag (achieved-goal) change ratio. We define the ag change ratio of π to be: the percentage of trajectories where the achieved goals in initial states s 0 are different from the achieved goals in final states s T under π. Most training signals are only created from ag changes, because they provide examples of how to rearrange an environment. We then define initial ag change ratio to be the ag change ratio of a random-acting policy π 0 . Using notation from 28, it can be computed as E s0•••s T ∼π0 [-r HER (•, •, s 0 , s T )]. In Figure 8 , we show the relationship between final success rates versus the initial ag (achieved-goal) change ratio, across all environments. The performance of GCSL seems to be upper-bounded by a linear relationship between the two. This makes intuitive sense: a BC-style objective starts off cloning the initially random trajectories, so if initial ag change ratio is low, the policy would not learn to rearrange ag, compounding to a low final performance. HER and HDM are able to surpass this upper ceiling likely because of the additional goal-likelihood term equation 14 besides imitation. This finding suggests that in order to make goal-reaching easier, we should either modify the initial state distribution ρ 0 (s) such that ag can be easily changed through random exploration (Florensa et al., 2017) (if the policy is training from scratch), or initialize BC from some high-quality demonstrations where ag does change (Ding et al., 2019; Nair et al., 2018a; Lynch et al., 2019) .

6. CONCLUSION

This work presents a unified goal-reaching objective that encompasses a family of goal-conditioned RL algorithms. Our derivation illustrates the connection between hindsight goal relabeling, divergence minimization, and energy-based models. It reveals that there is a largely unexplored design space: we could potentially use other convex functions in f -divergence minimization (Nowozin et al., 2016; Ghasemipour et al., 2020) , and improve the optimization of the energy-based model (Du et al., 2020) by going beyond NCE (Grathwohl et al., 2020) . The primary limitation of our framework is that it does not account for exploration (Hafner et al., 2020) . Eventually, we would like to incorporate empowerment (Klyubin et al., 2005; Sekar et al., 2020) and input density exploration (Bellemare et al., 2016; Pong et al., 2019) into the framework. A NOTATIONS ρ 0 Initial state distribution p(s ′ | s, a) Environmental dynamics p + µ (s + | s, a) Discounted future state distribution under policy µ ρ µ (s) State distribution (occupancy measure) visited by policy µ P π Q(s, a, g) Transition operator in (goal-conditioned) Bellman backup. P π Q(s, a, g) = E p(s ′ |s,a)π(a ′ |s ′ ,g) [Q θ (s ′ , a ′ , g)]. f Convex function in f -divergence D f . Usually f (1) = 0. f * Convex conjugate of function f f ′ (x) Derivative of function f ρ exp (s, a) State-action visitation distribution of the expert policy ρ µ (s, a) State-action visitation distribution of the policy µ. Note that ρ µ (s, a) = ρ µ (s)µ(a | s).

T (x)

Function used in the variational bound of f -divergence. Also appears as T (s, a) or T (s, a, g) in the main text. ρ + (g) The behaviour goal distribution assumed to be given apriori by the environment. ρ + (s + ) The marginal hindsight goal distribution of a given dataset / replay, where ρ + (s + ) = E ρµ(s,a) [p + µ (s + | s, a)] π(a | s) The goal-conditioned policy π marginalized over the behavioral goal distribution π(a | s) = E p + (g) [π(a | s, g)]. ρ π (s, a | g) The state-action visitation distribution of goal-conditioned policy π when conditioned on the behavioral goal g E(x) Energy function of an EBM: p θ (x) = exp(-E θ (x))/Z(θ).

Z(θ)

The partition function in an EBM Z(θ) = X exp(-E θ (x))dx. σ(x) Sigmoid function σ(x) = 1/(1 + exp(-x)). p n (x) Noise distribution in Noise Contrastive Estimation (NCE) k The number of times noise samples are sampled more frequently than true data samples in NCE.

∆ θ

The logit of the positive sample classification loss in NCE. q(s, a, s + ) A conditional EBM ρ π (s, a | s + ) = exp q(s, a, s + )/Z q (s + ). Z q (s + ) The partition function of a conditional EBM ρ π (s, a | s + ): Z q (s + ) = S×A exp q(s, a, s + )dsda.

H(π)

Entropy of a policy across (replay) states and goals H(π) = E ρ(s)ρ + (g)π(a|s,g) [-log π(a | s, g)]. r(s, a) The (learned) reward in traditional RL settings without goalconditioning. r(s, a, s + ) The (learned) reward in goal-conditioned divergence minimization. r(s, a, s ′ , s + ) The (generalized) reward used in HER-style multi-goal RL. r HER (s, a, s ′ , s + ) The reward function used in HER. It equals 0 when s ′ = s + and -1 when s ′ ̸ = s + . Also denoted as r HER (•, •, s ′ , s + ).

B PROOFS B.1 MAIN LEMMAS

Lemma B.1 (Online-to-offline transformation for goal reaching). Given a goal-conditioned policy π(a | s, g), its corresponding Q-function Q π (s, a, g), and arbitrary state-action visitation distribution ρ µ (s, a) of another policy µ(a | s), the expected temporal difference for online rollouts under π is: E p(g)ρπ(s,a|g) [(Q π -γ • P π Q π )(s, a, g)] = E p(g)ρµ(s,a)π(ã|s,g) [Q π (s, ã, g) -γ • P π Q π (s, a, g)] Proof of Lemma 4.1. E p(g)ρπ(s,a|g) [(Q π -γ • P π Q π )(s, a, g)] = E p(g)ρπ(s,a|g) [Q π (s, a, g) -γE p(s ′ |s,a),π(a ′ |s ′ ,g) Q π (s ′ , a ′ , g)] = (1 -γ) ∞ t=0 γ t E p(g)ρ t π (s|g) π(a|s,g) Q π (s, a, g) -γE p(s ′ |s,a) π(a ′ |s ′ ,g) Q π (s ′ , a ′ , g) = (1 -γ) ∞ t γ t E p(g) ρ t π (s|g) π(a|s,g) [Q π (s, a, g)] -γ t+1 E p(g) ρ t+1 π (s|g) π(a|s,g) [Q π (s, a, g)] = (1 -γ)E p(g),ρ 0 (s),π(a|s,g) [Q π (s, a, g)] = (1 -γ) ∞ t γ t E p(g) ρ t µ (s) π(a|s,g) [Q π (s, a, g)] -γ t+1 E p(g) ρ t+1 µ (s) π(a|s,g) [Q π (s, a, g)] = (1 -γ) ∞ t=0 γ t E p(g)ρ t µ (s,a) π(ã|s,g) [Q π (s, ã, g) -γE p(s ′ |s,a) π(a ′ |s ′ ,g) Q π (s ′ , a ′ , g)] = E p(g)ρµ(s,a)π(ã|s,g) [Q π (s, ã, g) -γE p(s ′ |s,a),π(a ′ |s ′ ,g) Q π (s ′ , a ′ , g)] Lemma B.2 (Gradient of the noise-contrastive term in energy-based goal-reaching). Given the following definition for the logit of a NCE-like binary classifier, with ρ + (g) = ρ + (g): ∆ θ (s, a, g, k) = Q θ (s, a, g) -log E ρ + (g) [exp Q θ (s, a, g)] -log k (33) The gradient of the negative NCE term in the density ratio estimation approaches zero as k → ∞: d dθ E ρµ(s,a)ρ + (g) k • log 1 -σ(∆ θ (s, a, g, k)) k→∞ ----→ 0 Proof of Lemma B.2. σ is the sigmoid function, and Z θ (s, a) = E ρ + (g) [exp Q θ (s, a, g)]: 1 -σ(∆ θ (s, a, g, k)) = 1 1 + exp ∆ θ (s, a, g, k) = 1 1 + exp(Q θ (s, a, g) -log Z θ (s, a))/k (34) Plugging in the above into the loss and taking the gradient: d dθ E ρµ(s,a) ρ + (g) -k • log exp Q θ (s, a, g) k • Z θ (s, a) + 1 (35) = d dθ E ρµ(s,a) ρ + (g) -k • log 1 k exp Q θ (s, a, g) -log Z θ (s, a) + 1 (36) =E ρµ(s,a) ρ + (g) - exp Q θ (s, a, g)/Z θ (s, a) 1/k • exp Q θ (s, a, g)/Z θ (s, a) + 1 d dθ Q θ (s, a, g) -log Z θ (s, a) (37) k→∞ ----→E ρµ(s,a) ρ + (g) - exp Q θ (s, a, g) E ρ + (g) [exp Q θ (s, a, g)] d dθ Q θ (s, a, g) -log E ρ + (g) [exp Q θ (s, a, g)] The first term inside the expectation of ρ µ (s, a): -1 E ρ + (g) [exp Q θ (s, a, g)] E ρ + (g) [exp Q θ (s, a, g) d dθ Q θ (s, a, g)] The second term: E ρ + (g) [exp Q θ (s, a, g)] E ρ + (g) [exp Q θ (s, a, g)] d dθ log E ρ + (g) [exp Q θ (s, a, g)] = 1 E ρ + (g) [exp Q θ (s, a, g)] d dθ E ρ + (g) [exp Q θ (s, a, g)] = 1 E ρ + (g) [exp Q θ (s, a, g)] E ρ + (g) [exp Q θ (s, a, g) d dθ Q θ (s, a, g)] The two terms cancel out and yield a gradient of 0. Lemma B.3 (Goal-conditioned Q-functions estimate PMI on given trajectories). Given that we set apriori the behavioral goal distribution to be ρ + (g) = S×A p µ (s, a, s + )dsda. And assuming that the state-action distribution of π marginalized over behavioral goals E ρ + (g) [ρ π (s, a | g)] is the same as ρ µ (s, a), which means π stays in the same state-action visitation distribution as µ. Then: on trajectories generated by µ, the point-wise mutual information (PMI) between a state-action pair (s, a) and a future state g is given by the Q-function at convergence Q π : PMI((s, a), g) = Q π (s, a, g) -log E ρ + (g) [exp Q π (s, a, g)] -log(r + (γP π Q π -Q π )(s, a, g)) Proof of Lemma B.3. We first start with the optimal T * in the f -divergence bound equation 5, applied to the special case of the function f being a quadratic equation 19: T * (x) = f ′ (p(x)/q(x)) (43) f ′ (t) = t -r Combining the two, we get: p(x)/q(x) -r = T * (x) Now applying this identity to the f -divergence minimization problem in equation 15 (note that we have set r(s, a, g) = -T (s, a, g) in our derivation): p π (s, a, g) p µ (s, a, s + ) = ρ + (g)ρ π (s, a | g) ρ µ (s, a)p + µ (g | s, a) = r -(Q π -γP π Q π )(s, a, g) (46) = r + (γP π Q π -Q π )(s, a, g) We now take the relationship in equation 21: ρ π (s, a | g) ρ π (s, a) = exp Q π (s, a, s + ) E ρ + (g) [exp Q π (s, a, g)] Using this substitution, we arrive at: ρ + (g)ρ π (s, a) ρ µ (s, a)p + µ (g | s, a) exp Q π (s, a, s + ) E ρ + (g) [exp Q π (s, a, g)] = r + (γP π Q π -Q π )(s, a, g) Swapping the nominator and denominator and applying the assumption that ρ π (s, a) = ρ µ (s, a): ρ µ (s, a)p + µ (g | s, a) ρ µ (s, a)ρ + (g) E ρ + (g) [exp Q π (s, a, g)] exp Q π (s, a, g) = 1 r + (γP π Q π -Q π )(s, a, g) Taking the log on both sides, we get the following expression of P M I((s, a), g):  log p + µ (g | s, a) ρ + (g) = Q π (s, a, g) -log E ρ + (g) [exp Q π (s, a, g)] -log(r + (γP π Q π -Q π )(s, d dθ log(1 + exp ∆ θ (s, a, s + , k)) k→∞ ----→ 0 We simply follow the definition of ∆ θ in equation equation 22 and take the gradient of the above: exp ∆ θ (s, a, s + , k) 1 + exp ∆ θ (s, a, s + , k) d dθ ∆ θ (s, a, s + , k) = 1 1 + exp(-∆ θ (s, a, s + , k)) ∇ θ Q θ (s, a, s + ) - E ρ + (g) [exp Q θ (s, a, g)∇ θ Q θ (s, a, g)] E ρ + (g) [exp Q θ (s, a, g)] = 1 1 + k • E ρ + (g) [exp Q θ (s, a, g)] exp Q θ (s, a, s + ) ∇ θ Q θ (s, a, s + ) - E ρ + (g) [exp Q θ (s, a, g)∇ θ Q θ (s, a, g)] E ρ + (g) [exp Q θ (s, a, g)] As k → ∞, we see that the gradient approaches 0 because the scalar on the left approaches 0. Combining the above result about a part of the positive classification loss in NCE with the Lemma in B.2 which deals with the negative classification loss in NCE, we can arrive at the combined NCE loss at its limit k → ∞, as pointed out in equation 23 of the main text: arg max Q E ρµ(s,a) E p + π (s + |s,a) [Q θ (s, a, s + )] -log E ρ + (g) [exp Q θ (s, a, g)] As mentioned, this is roughly equivalent to the InfoNCE loss (Van den Oord et al., 2018) , further validating our analysis so far.

B.3 EBM LOSSES

To optimize E ρµ(s,a) [E p + π (s + |s,a) [Q θ (s, a, s + )] -log E ρ + (g) [exp Q θ (s, a, g)]] from an arbitrary dataset of behaviors ρ µ (s, a, s + ), we can easily see that the problem lies in accessing p + π (s + | s, a): we do not have complete access to the distribution of "positive samples", as sampling directly from p + π (s + | s, a) requires on-policy rollouts. But this can be resolved by using importance weights and Equation equation 4 to rewrite E p + π (s + |s,a) [Q θ (s, a, s + )]: ( 1 -γ) • E p(s ′ |s,a) ρµ(s,a) [Q θ (s, a, s ′ )] + γ • E p(s ′ |s,a)π(a ′ |s ′ ) ρ + (g)ρµ(s,a) p + π (g | s ′ , a ′ ) ρ + (g) Q θ (s, a, g) Above, we have introduced a new notation π(a | s) to address the following issue with p + π (s + |s, a): while π is a goal-conditioned policy, p + π (s + |s, a) is not conditioned on a goal apriori. A similar problem was encountered (but ignored) in C-Learning (Eysenbach et al., 2020b) , which simply assumed that the multi-goal policy was aposteriori conditioned on the same future state that got sampled from this policy in the first place (a contradiction). To avoid this problem, we define π(a | s) = E p + (g) [π(a | s, g)] and assume that p + π (s + |s, a) is sampled under π(a | s) beyond a single step. We package all the EBM training losses into the following: E ρµ(s,a) {-(1 -γ) • E p(s ′ |s,a) [Q θ (s, a, s ′ )]} (52) E ρµ(s,a) log E ρ + (g) [exp Q θ (s, a, g)] -γE p(s ′ |s,a) π(a ′ |s ′ ,g) ρ + (g) p + π (g | s ′ , a ′ )π(a ′ | s ′ ) ρ + (g)π(a ′ | s ′ , g) Q θ (s, a, g) To decompose the importance weight, note that the policy is trying to maximize the Q-values under the entropy constraint in equation 14, resulting in a Boltzmann policy (Haarnoja et al., 2017; Schulman et al., 2017; Haarnoja et al., 2018) : arg max π E π(a|s,g) [Q(s, a, g)] -H(π) = exp Q(s, a, g)/ a exp Q(s, a, g). Minimizing equation 53 is equivalent to minimizing the following loss, with ⊥ (•) being the stop-gradient sign: E ρµ(s,a) ρ + (g) p(s ′ |s,a) ⊥ exp Q(s, a, g) E ρ + (g) [exp Q(s, a, g)] -γ • a exp Q(s ′ , a, g)/|A| E ρ + (g) [ a exp Q(s ′ , a, g)/|A|] Q θ (s, a, g) (54) Proof of equation 54. Let w ebm (s ′ , a ′ , g) denote the importance weight: w ebm (s ′ , a ′ , g) = p + π (g | s ′ , a ′ ) ρ + (g) π(a ′ | s ′ ) π(a ′ | s ′ , g) Firstly, because of the entropy constraint in equation 14, we have a Boltzmann policy defined on the Q-function: arg max π E π(a|s,g) [Q(s, a, g)] -H(π) = exp Q(s, a, g)/ a exp Q(s, a, g): π(a | s, g) ∝ exp Q(s, a, g) Secondly, we note that the first term in importance ratio can be estimated from equation 21. Thus we have: w ebm (s ′ , a ′ , g) = exp Q(s ′ , a ′ , g) E ρ + (g) [exp Q(s ′ , a ′ , g)] • E ρ + (g) [exp Q(s ′ , a ′ , g)] a E ρ + (g) [exp Q(s ′ , a, g)] • a exp Q(s ′ , a, g) exp Q(s ′ , a ′ , g) Assuming that there is a stop gradient sign on P π Q because of the use of a target network, we can rewrite the above as one single quadratic and see that the gradient of the above loss w.r.t Q is equivalent to the gradient of the following squared Bellman residual: where the reward function r(s, a, s ′ , s + ) is: E ρ + (g) [exp Q(s, a, g)∇ θ Q θ (s, a, g)] E ρ + (g) [exp Q(s, a, g)] - γE ρ + (g) [ a exp Q(s ′ , a, g)∇ θ Q θ (s, a, g)] a E ρ + (g) [exp Q(s ′ , a, g)] r + β, s ′ = s + r, s ′ ̸ = s +

C ABLATIONS AND HYPER-PARAMETERS

In this section, we include additional ablations and hyper-parameters. We first describe the goalreaching environments in more details in Figure 9 . We use the following thresholds (for Euclidean norms) for determining success: [0.08, 0.08, 0.05, 0.05, 0.1], which are tight thresholds based on our visualizations of the environments. We use the same network architecture, sampling and optimization schedules for all the methods, as described in Table 2 . As for γ HDM , we set it to be 0.85 in Four Rooms and Lunar Lander, 0.5 in Sawyer Push and Claw Manipulate, and 0.4 for Door Opening. Ablation on this hyper-parameter can be found in Figure 10 . For next state relabeling ratio, we set the default to be 0.2, increase it to 0.5 in Lunar Lander, and 0.6 in Sawyer Push and Door Opening. For the soft-Q-learning (Schulman et al., 2017) + HER baseline, we set the temperature parameter to be 0.2, which we have found to empirically perform well. 



Figure 1: A unified objective for goal-reaching (see equation equation 14 for more details).

Figure 2: ρ µ (s, a)p + µ (s + | s, a) hindsight-relabeled distribution

Figure 5: the future state distribution p + π (s + | s, a) of a goal-conditioned policy when starting from a state s visited by µ.

28) See Appendix B.4 for a full derivation. Setting r = -1 and β = 1, we have arrived at the reward function of HER r HER (s, a, s ′ , s + ). The remaining losses that this reward leaves out are the second term in f -divergence minimization part equation 18 and the second term in the EBM training equation 53, which we will analyze in the next section. 4.4 WHEN DOES BEHAVIOUR CLONING (BC) HELP?

Figure 6: Intuitions about when HDM applies BC. From left to right: (a) the Maze environment with a goal in the upper-left corner; (b) the Q-values learned through the converged policy. Lighter red means higher Q-value; (c) visualizing the actions (from the replay) that get imitated when conditioning on the goal and setting γhdm = 0.95, with the background color reflecting how much an action moves the agent closer to the goal based on the agent's own estimate; (d) γhdm = 0.85; (e) γhdm = 0.75.As we lower γhdm, the threshold -log γhdm gets higher and fewer actions get imitated, with the remaining imitated actions more concentrated around the goal. HDM uses Q-learning to account for the worst while imitating the best during the goal-reaching process.

Figure 7: Normalized performance gain over GCSL, calculated by normalizing the final performance difference

a exp Q(s ′ , a, g) a E ρ + (g) [exp Q(s ′ , a, g)](58) Taking the gradient of equation equation 53 inside the expectation of E ρµ(s,a)p(s ′ |s,a) [•]:

Dividing both the nominator and denominator of the right-hand side by 1/|A|, putting the expectation E ρµ(s,a)p(s ′ |s,a)[•]  back in, and utilizing the stop-gradient sign ⊥ (•), we arrive at the loss:E ρµ(s,a) ρ + (g) p(s ′ |s,a) ⊥ exp Q(s, a, g) E ρ + (g) [exp Q(s, a, g)] -γ • a exp Q(s ′ , a, g)/|A| E ρ + (g) [ a exp Q(s ′ , a, g)/|A|] Q θ (s, a, g)B.4 DERIVING HER REWARDSThe purpose of this section is to derive the reward function used in HER from the equation 26:arg min Q E ρµ(s,a)p(s ′ |s,a)p + µ (s + |s,a) f * (-(Q θ -γP π Q)(s, a, s + )) -β • (1 -γ)Q θ (s, a, s ′ )Recall that we have defined p + µ (s + | s, a) in equation 4:p + µ (s + | s, a) = (1 -γ)p(s + | s, a) + γ S×A p(s ′ | s, a)µ(a ′ | s ′ )p + µ (s + | s ′ , a ′ )ds ′ da ′And that we have defined a quadratic form of f * in equation 19 (with r and c being constants):f * (x) = (x + r) 2 /2 + cUsing the dynamics to expand the expectation and applying the choice of f * being a quadratic, the loss becomes:arg min Q E ρµ(s,a)p(s ′ |s,a) (1 -γ) • 1 2 r + (γP π Q -Q θ )(s, a, s ′ ) 2 -β • Q θ (s,a, s ′ ) + E ρµ(s,a)p(s ′ |s,a)µ(a ′ |s ′ )p + µ (s + |s ′ ,a ′ ) γ • 1 2 r + (γP π Q -Q θ )(s, a, s + ) 2

ρµ(s,a)p(s ′ |s,a)p + µ (s + |s,a) 1 2 r(s, a, s ′ , s + ) + (γP π Q -Q θ )(s, a, s + ) 2

Figure9: Goal-reaching environments from GCSL(Ghosh et al., 2019) that we consider in this paper: reaching a goal location in Four Rooms, landing at a goal location in Lunar Lander, pushing a puck to a goal location in Sawyer Push, opening the door to a goal angle in Door Open(Nair et al., 2018b), turning a valve to a goal orientation in Claw Manipulate(Ahn et al., 2020).

Figure 10: Ablation studies on HDM Gamma and Beta. HDM Gamma refers to γ hdm in equation 31, and HDM Beta refers to the β term in equation 14. The orange line and the blue line denote HER and GCSL baseline performance. See Section 5.2 for further discussion of these results.

Benchmark results of test-time success rates in self-supervised goal-reaching. We compare our method

a, g)) B.2 NCE LOSSES We now complete the derivation of the NCE losses in section 4.2, by illustrating that the positive classification loss of NCE reduces to an InfoNCE (Van den Oord et al., 2018) loss in equation equation 23 under our definition of the logit equation 22. More specifically, given that we already have Lemma B.2, we only need to show:

