COUNTERFACTUAL CREDIT ASSIGNMENT IN MODEL-FREE REINFORCEMENT LEARNING

Abstract

Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, ie. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We then propose to use these as future-conditional baselines and critics in policy gradient algorithms and we develop a valid, practical variant with provably lower variance, while achieving unbiasedness by constraining the hindsight information not to contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.

1. INTRODUCTION

Reinforcement learning (RL) agents act in their environments and learn to achieve desirable outcomes by maximizing a reward signal. A key difficulty is the problem of credit assignment (Minsky, 1961) , i.e. to understand the relation between actions and outcomes and to determine to what extent an outcome was caused by external, uncontrollable factors, i.e. to determine the share of 'skill' and 'luck'. One possible solution to this problem is for the agent to build a model of the environment, and use it to obtain a more fine-grained understanding of the effects of an action. While this topic has recently generated a lot of interest (Ha & Schmidhuber, 2018; Hamrick, 2019; Kaiser et al., 2019; Schrittwieser et al., 2019) , it remains difficult to model complex, partially observed environments. In contrast, model-free reinforcement learning algorithms such as policy gradient methods (Williams, 1992; Sutton et al., 2000) perform simple time-based credit assignment, where events and rewards happening after an action are credited to that action, post hoc ergo propter hoc. While unbiased in expectation, this coarse-grained credit assignment typically has high variance, and the agent will require a large amount of experience to learn the correct relation between actions and rewards. Another issue of model-free methods is that counterfactual reasoning, i.e. reasoning about what would have happened had different actions been taken with everything else remaining the same, is not possible. Given a trajectory, model-free methods can in fact only learn about the actions that were actually taken to produce the data, and this limits the ability of the agent to learn quickly. As environments grow in complexity due to partial observability, scale, long time horizons, and large number of agents, actions taken by the agent will only affect a vanishing part of the outcome, making it increasingly difficult to learn from classical reinforcement learning algorithms. We need better credit assignment techniques. In this paper, we investigate a new method of credit assignment for model-free reinforcement learning which we call Counterfactual Credit Assignment (CCA), that leverages hindsight information to implicitly perform counterfactual evaluation -an estimate of the return for actions other than the ones which were chosen. These counterfactual returns can be used to form unbiased and lower variance estimates of the policy gradient by building future-conditional baselines. Unlike classical Q functions, which also provide an estimate of the return for all actions but do so by averaging over all possible futures, our methods provide trajectory-specific counterfactual estimates, i.e. an estimate of the return for different actions, but keeping as many of the external factors constant between the return and its counterfactual estimate. Our method is inspired by ideas from causality theory, but does not require learning a model of the environment. Our main contributions are: a) proposing a set of environments which further our understanding of when difficult credit assignment leads to poor

