UNDERSTANDING HINDSIGHT GOAL RELABELING REQUIRES RETHINKING DIVERGENCE MINIMIZATION

Abstract

Hindsight goal relabeling has become a foundational technique for multi-goal reinforcement learning (RL). The idea is quite simple: any arbitrary trajectory can be seen as an expert demonstration for reaching the trajectory's end state. Intuitively, this procedure trains a goal-conditioned policy to imitate a sub-optimal expert. However, this connection between imitation and hindsight relabeling is not well understood. Modern imitation learning algorithms are described in the language of divergence minimization, and yet it remains an open problem how to recast hindsight goal relabeling into that framework. In this work, we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles. Experimentally, we find that despite recent advances in goal-conditioned behaviour cloning (BC), multi-goal Q-learning can still outperform BC-like methods; moreover, a vanilla combination of both actually hurts model performance. Under our framework, we study when BC is expected to help, and empirically validate our findings. Our work further bridges goal-reaching and generative modeling, illustrating the nuances and new pathways of extending the success of generative models to RL.

1. INTRODUCTION

Goal reaching is an essential aspect of intelligence in sequential decision making. Unlike the conventional formulation of reinforcement learning (RL), which aims to encode all desired behaviors into a single scalar reward function that is amenable to learning (Silver et al., 2021) , goal reaching formulates the problem of RL as applying a sequence of actions to rearrange the environment into a desired state (Batra et al., 2020) . Goal-reaching is a highly flexible formulation. For instance, we can design the goal-space to capture salient information about specific factors of variations that we care about (Plappert et al., 2018) ; we can use natural language instructions to define more abstract goals (Lynch & Sermanet, 2020; Ahn et al., 2022) ; we can encourage exploration by prioritizing previously unseen goals (Pong et al., 2019; Warde-Farley et al., 2018; Pitis et al., 2020) ; and we can even use self-supervised procedures to naturally learn goal-reaching policies without reward engineering (Pong et al., 2018; Nair et al., 2018b; Zhang et al., 2021; OpenAI et al., 2021) . Reward is not enough. Usually, rewards are manually constructed, either through laborious reward engineering, or from task-specific optimal demonstrations, neither of which is a scalable solution. How can RL agents learn useful behaviors from unlabeled reward-free trajectories, similar to how NLP models such as BERT (Devlin et al., 2018) and GPT-3 (Brown et al., 2020) are able to learn language from unlabeled text corpus? Goal reaching is a promising paradigm for unsupervised behavior acquisition, but it is unclear how to write down a well-defined objective for goal-conditioned policies, unlike language models that just predict the next token. This paper aims to define such an objective that unifies many prior approaches and strengthens the foundation of goal-conditioned RL. We start from the following observation: hindsight goal relabeling (Andrychowicz et al., 2017) can turn an arbitrary trajectory into a sub-optimal expert demonstration. Thus, goal-conditioned RL might be doing a special kind of imitation learning. Currently, divergence minimization is the de facto way to describe imitation learning methods (Ghasemipour et al., 2020) , so we should be able to recast hindsight goal relabeling into the divergence minimization framework. On top of that, to tackle the sub-optimality of hindsight-relabeled trajectories, we should explicitly maximize the probability of Experimentally, we show that multi-goal Q-learning based on HER-like rewards, when carefully tuned, can still outperform goal-conditioned BC (such as GCSL / HBC). Moreover, a vanilla combination of multi-goal Q-learning and BC (HER + HBC), supposedly combining the best of both worlds, in fact hurts performance. We utilize our unified framework to analyze when a BC loss could help, and propose a modified algorithm named Hindsight Divergence Minimization (HDM) that uses Q-learning to account for the worst while imitating the best. HDM avoids the pitfalls of HER + HBC and improves policy success rates on a variety of self-supervised goal-reaching environments. Additionally, our framework reveals a largely unexplored design space for goal-reaching algorithms and potential paths of importing generative modeling techniques into multi-goal RL.

2.1. THE REINFORCEMENT LEARNING (RL) PROBLEM

We first review the basics of RL and generative modeling. A Markov Decision Process (MDP) is typically parameterized by (S, A, ρ 0 , p, r): a state space S, an action space A, an initial state distribution ρ 0 (s), a dynamics function p(s ′ | s, a) which defines the transition probability, and a reward function r(s, a). A policy function µ defines a probability distribution µ : S × A → R + . For an infinite-horizon MDP, given the policy µ, and the state distribution at step t (starting from ρ 0 at t = 0), the state distribution at step t + 1 is given by: ρ t+1 µ (s ′ ) = S×A p(s ′ | s, a)µ(a | s)ρ t µ (s)dsda The state visitation distribution sums over all timesteps via a geometric distribution Geom(γ): ρ µ (s) = (1 -γ) • ∞ t=0 γ t • ρ t µ (s) However, the trajectory sampling process does not happen in this discounted manner, so the discount factor γ ∈ (0, 1) is often absorbed into the cumulative return instead (Silver et al., 2014) :  J (µ) = 1 1 -γ S ρ µ (s) From 1 and 2, we also see that the future state distribution p + of policy µ, defined as a geometrically discounted sum of state distribution at all future timesteps given current state and action, is given by



Figure 1: A unified objective for goal-reaching (see equation equation 14 for more details).

µ(a | s)r(s, a)dads = E ρ 0 (s0)µ(a0|s0) p(s1|s0,a0)µ(a1|s1)••• ∞ t=0 γ t r(s t , a t )

