DICHOTOMY OF CONTROL: SEPARATING WHAT YOU CAN CONTROL FROM WHAT YOU CANNOT

Abstract

Future-or return-conditioned supervised learning is an emerging paradigm for offline reinforcement learning (RL), where the future outcome (i.e., return) associated with an observed action sequence is used as input to a policy trained to imitate those same actions. While return-conditioning is at the heart of popular algorithms such as decision transformer (DT), these methods tend to perform poorly in highly stochastic environments, where an occasional high return can arise from randomness in the environment rather than the actions themselves. Such situations can lead to a learned policy that is inconsistent with its conditioning inputs; i.e., using the policy to act in the environment, when conditioning on a specific desired return, leads to a distribution of real returns that is wildly different than desired. In this work, we propose the dichotomy of control (DoC), a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environment stochasticity). We achieve this separation by conditioning the policy on a latent variable representation of the future, and designing a mutual information constraint that removes any information from the latent variable associated with randomness in the environment. Theoretically, we show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior. Empirically, we show that DoC is able to achieve significantly better performance than DT on environments that have highly stochastic rewards and transitions. 1

1. INTRODUCTION

Offline reinforcement learning (RL) aims to extract an optimal policy solely from an existing dataset of previous interactions (Fujimoto et al., 2019; Wu et al., 2019; Kumar et al., 2020) . As researchers begin to scale offline RL to large image, text, and video datasets (Agarwal et al., 2020; Fan et al., 2022; Baker et al., 2022; Reed et al., 2022; Reid et al., 2022) , a family of methods known as returnconditioned supervised learning (RCSL), including Decision Transformer (DT) (Chen et al., 2021; Lee et al., 2022) and RL via Supervised Learning (RvS) (Emmons et al., 2021) , have gained popularity due to their algorithmic simplicity and ease of scaling. At the heart of RCSL is the idea of conditioning a policy on a specific future outcome, often a return (Srivastava et al., 2019; Kumar et al., 2019; Chen et al., 2021) but also sometimes a goal state or generic future event (Codevilla et al., 2018; Ghosh et al., 2019; Lynch et al., 2020) . RCSL trains a policy to imitate actions associated with a conditioning input via supervised learning. During inference (i.e., at evaluation), the policy is conditioned on a desirable high-return or future outcome, with the hope of inducing behavior that can achieve this desirable outcome. DT maximizes returns across an entire trajectory, leading to suboptimal policies when a large return (r = 100) is achieved only due to very low-probability environment transitions (T = 0.01). DoC separates policy stochasticity from that of the environment and only tries to control action decisions (solid arrows), achieving optimal control through maximizing expected returns at each timestep. Despite the empirical advantages that come with supervised training (Emmons et al., 2021; Kumar et al., 2021) , RCSL can be highly suboptimal in stochastic environments (Paster et al., 2022; Brandfonbrener et al., 2022) , where the future an RCSL policy conditions on (e.g., return) can be primarily determined by randomness in the environment rather than the data collecting policy itself. Figure 1 (left) illustrates an example, where conditioning an RCSL policy on the highest return observed in the dataset (r = 100) leads to a policy (a 1 ) that relies on a stochastic transition of very low probability (T = 0.01) to achieve the desired return of r = 100; by comparison the choice of a 2 is much better in terms of average return, as it surely achieves r = 10. The crux of the issue is that the RCSL policy is inconsistent with its conditioning input. Conditioning the policy on a desired return (i.e., 100) to act in the environment leads to a distribution of real returns (i.e., 0.01 * 100) that is wildly different from the return value being conditioned on. This issue would not have occurred if the policy could also maximize the transition probability that led to the high-return state, but this is not possible as transition probabilities are a part of the environment and not subject to the policy's control. A number of works propose a generalization of RCSL, known as future-conditioned supervised learning methods. These techniques have been shown to be effective in imitation learning (Singh et al., 2020; Pertsch et al., 2020) , offline Q-learning (Ajay et al., 2020) , and online policy gradient (Venuto et al., 2021) . It is common in future-conditioned supervised learning to apply a KL divergence regularizer on the latent variable -inspired by variational auto-encoders (VAE) (Kingma & Welling, 2013) and measured with respect to a learned prior conditioned only on past information -to limit the amount of future information captured in the latent variable. It is natural to ask whether this regularizer could remedy the insconsistency of RCSL. Unfortunately, as the KL regularizer makes no distinction between future information that is controllable versus that which is not, such an approach will still exhibit inconsistency, in the sense that the latent variable representation may contain information about the future that is due only to environment stochasticity. It is clear that the major issue with both RCSL and naïve variational methods is that they make no distinction between stochasticity of the policy (controllable) and stochasticity of the environment (uncontrollable) (Paster et al., 2020; Štrupl et al., 2022) . An optimal policy should maximize over the controllable (actions) and take expectations over uncontrollable (e.g., transitions) as shown in Figure 1 (right). This implies that, under a variational approach, the latent variable representation that a policy conditions on should not incorporate any information that is solely due to randomness in the environment. In other words, while the latent representation can and should include information about future behavior (i.e., actions), it should not reveal any information about the rewards or transitions associated with this behavior. To this end, we propose a future-conditioned supervised learning framework termed dichotomy of control (DoC), which, in Stoic terms (Shapiro, 2014) , has "the serenity to accept the things it cannot



Code available at https://github.com/google-research/google-research/tree/ master/dichotomy_of_control.



Figure 1: Illustration of DT (RCSL) and DoC. Circles and squares denote states and actions. Solid arrows denote policy decisions. Dotted arrows denote (stochastic) environment transitions. All arrows and nodes are present in the dataset, i.e., there are 4 trajectories, 2 of which achieve 0 reward.DT maximizes returns across an entire trajectory, leading to suboptimal policies when a large return (r = 100) is achieved only due to very low-probability environment transitions (T = 0.01). DoC separates policy stochasticity from that of the environment and only tries to control action decisions (solid arrows), achieving optimal control through maximizing expected returns at each timestep.

