C-LEARNING: LEARNING TO ACHIEVE GOALS VIA RECURSIVE CLASSIFICATION

Abstract

We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be transformed into predictions over future states. Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience. This variant allows us to optimize functionals of a policy's future state distribution, such as the density of reaching a particular goal state. While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work. This foundation makes hypotheses about Q-learning, including the optimal goal-sampling ratio, which we confirm experimentally. Moreover, our proposed method is competitive with prior goal-conditioned RL methods. 1

1. INTRODUCTION

In this paper, we aim to reframe the goal-conditioned reinforcement learning (RL) problem as one of predicting and controlling the future state of the world. This reframing is useful not only because it suggests a new algorithm for goal-conditioned RL, but also because it explains a commonly used heuristic in prior methods, and suggests how to automatically choose an important hyperparameter. The problem of predicting the future amounts to learning a probability density function over future states, agnostic of the time that a future state is reached. The future depends on the actions taken by the policy, so our predictions should depend on the agent's policy. While we could simply witness the future, and fit a density model to the observed states, we will be primarily interested in the following prediction question: Given experience collected from one policy, can we predict what states a different policy will visit? Once we can predict the future states of a different policy, we can control the future by choosing a policy that effects a desired future. While conceptually similar to Q-learning, our perspective is different in that we make no reliance on reward functions. Instead, an agent can solve the prediction problem before being given a reward function, similar to models in model-based RL. Reward functions can require human supervision to construct and evaluate, so a fully autonomous agent can learn to solve this prediction problem before being provided any human supervision, and reuse its predictions to solve many different downstream tasks. Nonetheless, when a reward function is provided, the agent can estimate its expected reward under the predicted future state distribution. This perspective is different from prior approaches. For example, directly fitting a density model to future states only solves the prediction problem in the on-policy setting, precluding us from predicting where a different policy will go. Model-based approaches, which learn an explicit dynamics model, do allow us to predict the future state distribution of different policies, but require a reward function or distance metric to learn goal-reaching policies for controlling the future. Methods based on temporal difference (TD) learning (Sutton, 1988) have been used to predict the future state distribution (Dayan, 1993; Szepesvari et al., 2014; Barreto et al., 2017) and to learn goal-reaching policies (Kaelbling, 1993; Schaul et al., 2015) . Section 3 will explain why these approaches do not learn a true Q function in continuous environments with sparse rewards, and it remains unclear what the learned Q function corresponds to. In contrast, our method will estimate a well defined classifier. Since it is unclear how to use Q-learning to estimate such a density, we instead adopt a contrastive approach, learning a classifier to distinguish "future states" from random states, akin to Gutmann & Hyvärinen (2010). After learning this binary classifier, we apply Bayes' rule to obtain a probability density function for the future state distribution, thus solving our prediction problem. While this initial approach requires on-policy data, we then develop a bootstrapping variant for estimating the future state distribution for different policies. This bootstrapping procedure is the core of our goalconditioned RL algorithm. The main contribution of our paper is a reframing of goal-conditioned RL as estimating the probability density over future states. We derive a method for solving this problem, C-learning, which we use to construct a complete algorithm for goal-conditioned RL. Our reframing lends insight into goalconditioned Q-learning, leading to a hypothesis for the optimal ratio for sampling goals, which we demonstrate empirically. Experiments demonstrate that C-learning more accurately estimates the density over future states, while remaining competitive with recent goal-conditioned RL methods across a suite of simulated robotic tasks.

2. RELATED WORK

Common goal-conditioned RL algorithms are based on behavior cloning (Ghosh et al., 2019; Ding et al., 2019; Gupta et al., 2019; Eysenbach et al., 2020; Lynch et al., 2020; Oh et al., 2018; Sun et al., 2019 ), model-based approaches (Nair et al., 2020; Ebert et al., 2018) , Q-learning (Kaelbling, 1993; Schaul et al., 2015; Pong et al., 2018) , and semi-parametric planning (Savinov et al., 2018; Eysenbach et al., 2019; Nasiriany et al., 2019; Chaplot et al., 2020) . Most prior work on goalconditioned RL relies on manually-specified reward functions or distance metric, limiting the applicability to high-dimensional tasks. Our method will be most similar to the Q-learning methods, which are applicable to off-policy data. These Q-learning methods often employ hindsight relabeling (Kaelbling, 1993; Andrychowicz et al., 2017) , whereby experience is modified by changing the commanded goal. New goals are often taken to be a future state or a random state, with the precise ratio being a sensitive hyperparameter. We emphasize that our discussion of goal sampling concerns relabeling previously-collected experience, not on the orthogonal problem of sampling goals for exploration (Pong et al., 2018; Fang et al., 2019; Pitis et al., 2020) . Our work is closely related to prior methods that use TD-learning to predict the future state distribution, such as successor features (Dayan, 1993; Barreto et al., 2017; 2019; Szepesvari et al., 2014) and generalized value functions (Sutton & Tanner, 2005; Schaul et al., 2015; Schroecker & Isbell, 2020) . Our approach bears a resemblance to these prior TD-learning methods, offering insight into why they work and how hyperparameters such as the goal-sampling ratio should be selected. Our approach differs in that it does not require a reward function or manually designed relabeling strategies, with the corresponding components being derived from first principles. While prior work on off-policy evaluation (Liu et al., 2018; Nachum et al., 2019) also aims to predict the future state distribution, our work differs is that we describe how to control the future state distribution, leading to goal-conditioned RL algorithm. Our approach is similar to prior work on noise contrastive estimation (Gutmann & Hyvärinen, 2010), mutual-information based representation learning (Oord et al., 2018; Nachum et al., 2018) , and variational inference methods (Bickel et al., 2007; Uehara et al., 2016; Dumoulin et al., 2016; Huszár, 2017; Sønderby et al., 2016) . Like prior work on the probabilistic perspective on RL (Kappen, 2005; Todorov, 2008; Theodorou et al., 2010; Ziebart, 2010; Rawlik et al., 2013; Ortega & Braun, 2013; Levine, 2018) , we treat control as a density estimation problem, but our main contribution is orthogonal: we propose a method for estimating the future state distribution, which can be used as a subroutine in both standard RL and these probabilistic RL methods.

3. PRELIMINARIES

We start by introducing notation and prior approaches to goal-conditioned RL. We define a controlled Markov process by an initial state distribution p 1 (s 1 ) and dynamics function p(s t+1 | s t , a t ).



Project website with videos and code: https://ben-eysenbach.github.io/c_learning/

